Cannot run humann v3.7 using the latest Chocophlan database

In my Institute’s computer cluster I ran metaphlan v4.0.6 using the mpa_vOct22_CHOCOPhlAnSGB_202212 database version (downloaded from http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/) on a collection of samples. Now I want to use the generated taxonomic profiles to run humann v3.7 with the same samples. However, when trying to do this using the same Chocophlan database version that I used with metaphlan, humann produces a critical error telling me that I need to install the latest version of the database: v201901_v31.

Here’s the humann comand line:
humann --taxonomic-profile …/…/metaphlan/${name}.tax_profile.txt --input ${input_dir}/${name}_kneaddata.fastq.gz --output /vast/scratch/users/schulze.a/humann_output/MIT --threads 8 --metaphlan-options ‘–index mpa_vOct22_CHOCOPhlAnSGB_202212’

Here’s humann’s error message:
CRITICAL ERROR: The directory provided for ChocoPhlAn contains files ( mpa_latest ) that are not of the expected version. Please install the latest version of the database: v201901_v31

If I use Chocophlan v201901_v31 (downloaded from Index of /humann_data/chocophlan/) in humann I don’t get this error. However because the inputed taxonomic profiles were produced using mpa_vOct22_CHOCOPhlAnSGB_202212, I understand that doing this would be a mistake. Right? I guess I should use the same database version in both, the metaphlan and humann runs?

Also just to be sure, mpa_vOct22_CHOCOPhlAnSGB_202212 is the latest available metaphlan database, right?

Thanks in advance.

Hello, Thank you for the detailed post! Yes you want to use HUMAnN and MetaPhlAn versions that are compatible with respect to their databases. Yes vOct22 is the latest MetaPhlAn database and v201901_v31 is the latest HUMAnN database. If you have the latest HUMAnN v3.7 and MetaPhlAn v4, you should be set using the default databases they download (v201901_v31 and vOct22). Sorry for the confusion with the version naming conventions of the two tools in that it is hard to determine how they are in sync. We tried to add code to HUMAnN so it will alert you if you are using a database that is not in sync with your MetaPhlAn version. Please post if you have other issues or questions.

Thanks!
Lauren

1 Like

Hi Lauren,

Thanks a lot for your message! This clarifies everything. So there wasn’t a problem when I thought there was one.

Yes I agree that it would make it easier if the equivalent metaphlan and humann Chocophlan database versions had similar names. That or perhaps a note somewhere in the humann manual. I was assuming that mpa_vOct22_CHOCOPhlAnSGB_202212 was the latest version and that v201901_v31 wasn’t in part because of the creation or modification dates associated with these in their respective downloading websites.

Can I just ask you one more thing. In the logfiles of the humann runs that did work (because I used the v201901_v31 database) I get a few of this kind of message:

8/22/2023 06:45:54 AM - humann.search.prescreen - DEBUG: Taxon not in mapping file: k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Lachnospiraceae|g__GGB9176|s__GGB9176_SGB14114|t__SGB14114 2|1239|186801|186802|186803||| 2.88808

Should I worry about this? Could this mean that I’m not using the correct mapping file?

Definitely! No you don’t need to worry about that. There are some SGBs that don’t have a direct mapping in HUMAnN v3.7 to MetaPhlAn v4.0.

Thanks!
Lauren

OK, perfect.

Thanks a lot,
Enrique

Hi Lauren,

I have a related question. I ran HUMAnN 3 (with MetaPhlAn 4) on mouse metagenomic data using the following databases:

  • Marker database: mpa_vOct22_CHOCOPhlAnSGB_202212
  • Genomes: full_chocophlan.v201901_v31
  • UniRef: uniref90_annotated_v201901b

I am seeing 14% of reads unclassified with MetaPhlAn, but 85% of reads unaligned after nucleotide alignment, which improves to 60% unaligned after translated alignment.

I believe that the percentage of reads unaligned after nucleotide alignment is poor because many of the bugs that get classified with MetaPhlAn 4 don’t have corresponding genomes in the genomes database. Could you please confirm if my understanding is correct?

Also, there is currently no full CHOCOPhlAn genome database corresponding to the mpa_vOct22_CHOCOPhlAnSGB_202212 marker database, right?

Thank you,
Lev

I can confirm that your conclusion is likely correct here. MetaPhlAn 4 has much better representation of the murine gut than MetaPhlAn 3 / HUMAnN 3. HUMAnN 4 will “catch up” in this regard by mapping to MetaPhlAn 4’s SGB pangenomes.

1 Like

Ok. So, this has been a confusion of mine for quite a while…And this thread is the closest I’ve seen my questions to be answered…I think I’m close to actually understanding what’s going on, but need some confirmations still…

I think my biggest mistake was to assume Metaphlan4 was integrated in the Humann pipeline starting from version 3.5? Do I understand it correctly that the Metaphlan used in every Humann3 version is still Metaphlan3 (which would explain why I get errors that it doesn’t recognise the newest, from Jan21 on, Metaphlan databases)?

I guess it would be this post that somewhat confused me. What does the first point under “What has changed” means exactly? Does it mean that we’re able to use Metaphlan4’s output as input in the Humann pipeline to take care of the taxonomy (but that Metaphlan should be ran separately, outside of the Humann pipeline first)? And so, not what I thought at first, that Humann >3.5 would run Metaphlan4 automatically for the taxonomy part and use this information for the functional annotation as well?

Right, HUMAnN 3.x is still based around MetaPhlAn 3, but as of HUMAnN 3.5 we have been adding forward compatibility with the latest MetaPhlAn 4 releases. This will stop once HUMAnN 4 is released.

Thanks for your reply @franzosa.

For some reason it is not working out for me. Could you maybe post how you would go about getting the latest Metaphla4 output and using this info in Humann3?
Would I have to run Metaphlan4 separately at first and use its output as input to Humann3 with the flag --taxonomic-profile $FILE?

I have tried this before, but I still get the error:

CRITICAL ERROR: The directory provided for ChocoPhlAn contains files ( mpa_vJan21_CHOCOPhlAnSGB_202103.pkl ) that are not of the expected version. Please install the latest version of the database: v201901_v31

(I get a similar error if I try the Oct22 database)

Something that I’m thinking about just now, is that I input the mpa_vJan21_CHOCOPhlAnSGB_202103 database as index (see command below). Should I not do this, and “trick” the pipeline by having the --index set to the older database version v201901_v31 (since it won’t use it anyway, if you pass --taxonomic-profile flag, right?) instead of having this set to the newest?
Could this work? Or will it still detect the database version used in the taxonomic-profile file?

humann --threads 4 --bowtie-options='\''--threads 4'\'' --nucleotide-database /scratch/12484287/mock_10k_chocophlan --protein-database /db/uniref --metaphlan-options='\''--bowtie2db /scratch/12484287/mock_10k_chocophlan --in
dex mpa_vJan21_CHOCOPhlAnSGB_202103'\'' --input-format fastq --input results/mock_10k/humann_input_reads/mock_10k.fastq --output results/mock_10k/profiling --taxonomic-profile metaphlan4_vJan21/mock_10k_profile.txt'

The CRITICAL ERROR you are seeing is because you downloaded a MetaPhlAn PKL file into the folder containing your HUMAnN pangenomes. HUMAnN will not run if that folder is “contaminated” in any way by files of the wrong type / version.

@franzosa Thanks again for taking time to reply :slight_smile:

I downloaded these files to a separate directory and was trying to refer Humann to these, by using --index, but it seems as if Humann3 (v 3.6) is not compatible with these newer “Metaphlan4 databases” (because this method does work when I refer to the 2019 database version). Is this correct?

However you did say in your previous message “…but as of HUMAnN 3.5 we have been adding forward compatibility with the latest MetaPhlAn 4 releases”. I’m not really sure how. Would you care to explain?
I have been trying to get Metaphlan4 (outputs) and Humann3.6 to work together for some time now, which I thought should be possible according to others forum posts I’ve seen and also according to your statement above (?), but I just haven’t been able to figure out how to make this happen yet.

Any help would be appreciated.

We have been making HUMAnN 3.X compatibility releases for successive versions of MetaPhlAn 4. For example, the most recent HUMAnN 3 release (3.8) added compatibility with MetaPhlAn 4’s Oct22 marker/SGB database:

You would need to make sure you’re using a compatible MetaPhlAn 4 and HUMAnN for this to work, as HUMAnN is very careful about not interpretting MetaPhlAn output that it isn’t specifically trained for.

It took me a while to understand how this all works…but I think I got it now. Still want to confirm…

HUMAnN3 still uses version 2019_v31 pangenomes, which is equal to the 2019_v31 version for MetaPhlAn’s marker database, right? So, I can run MetaPhlAn and HUMAnN on this same pan genome/marker database (Chocophlan). This would mean that the stratified results of the functional analyses in this case would be in complete concordance. And this 2019_v31 version is the one and only version available right now for which this is the case? Is this all correct?

So, now I ran HUMAnN3.9 with MetaPhlAn4 on the latest MetaPhlAn database (Jun 2023), so the pangenome database and MetaPhlAn taxonomic database are no longer in concordance, which would explain why I don’t have stratified results? But do I understand correctly that they are somewhat compatible, but since they’re not 1-to-1 (less overlap) it would lead to a lot less stratified results during the functional analyses? So, in my case, since it’s a small test data, it was most likely coincidence that I had 0 stratified (all unclassified) results, but I might find some if I would run “real data”?

Sorry if these question might seem simple and redundant to you, but I’m really trying to understand how the different “modules” of the pipelines work (together).

Thanks.

This is mostly right. If you’re using MetaPhlAn 3.X with a compatible HUMAnN 3.Y then you’ll be talking about marker genes and pangenomes based on the same underlying database and taxonomy.

MetaPhlAn 4.0 transitioned to the SGB database, markers, and taxonomy, which is essentially a superset of the v3 databases + some species being subdivided into multiple new species based on new genomic evidence.

So when you run MetaPhlAn 4.X with a compatible HUMAnN 3.Y you should still get stratified functional profiles, but the taxonomic stratifications will be in the language that HUMAnN understands (genus.species) not MetaPhlAn’s SGB taxonomy. This is based on the idea that we know in the underlying data that SGB X from MetaPhlAn 4 contains genomes that were at one time grouped into species Y (known to HUMAnN 3).