The bioBakery help forum

Discrepancy between metaphlan3 community profile and humann3 gene families

I am analyzing a fecal metagenomics dataset from cats with diabetes. I have observed a strange discrepancy between the metaphan3 and humann3 results with respect to the presence/absence of specific taxa. For instance metaphlan identified one species in only 2/24 samples, but when I perform differential abundance testing on the gene families abundance and pathways data (using maaslin2), dozens of genes annotated to the same species are differentially abundant and present in the majority of samples (16-21 of 24).

How is it possible that metaphlan3 identified this species in only 2 samples, but it is present in the majority of gene families features in the humann3 output?

I see a similar problem was discussed here:Discrepancy between metaphlan3 and Humann3 in profiling micriobiota. However I am not sure the issue was resolved.

P.S. I am a novice data scientist, this is my first metaphlan3/humann3 dataset and my first time posting on the biobakery forum. Please excuse the inelegant way I may have posed the question

1 Like

I agree that this does not sound possible. Have you examined the merged HUMAnN tables to see if this species’ genes were ever non-zero in samples where the species was not detected by MetaPhlAn? I’m trying to determine if this is a problem with your HUMAnN output or the downstream MaAsLin analysis.

Were you running HUMAnN with default settings? If not, can you provide the command that you executed?

@franzosa

Thanks for your comment.

I have investigated the merged humann tables and confirmed that gene family features annotated to the species of interest are non-zero in samples for which it was not present in metaphlan. In fact, hundred of features from this species are present in samples for which the species was not detected in metaphlan.

After performing QC and removing host sequences, we followed the the HUMAnN3 joint taxonomy workflow here: GitHub - biobakery/humann: HUMAnN 3.0 is the next generation of HUMAnN 1.0 (HMP Unified Metabolic Analysis Network).

We created taxonomic profiles for each of the samples with MetaPhlAn (v 3.0.4)

metaphlan ${R1},${R2} \ 
    --input_type fastq \ 
    --bowtie2out ${pair_id}.bowtie2.bz2 \ 
    --nproc 12 \ 
    -o ${pair_id}.profile.txt \ 
    --tmp_dir /scratch ${b2db} 

We joined all of the taxonomic profiles, located in directory ‘metaphlan’, into a table of taxonomic profiles for all samples (HUMAnN3 v3.0.0a5):

humann_join_tables -i ./metaphlan \ 
    -o joined_profiled_metagenome.tsv 

We reduced this file into a taxonomic profile that represented the maximum abundances from all of the samples in the set:

humann_reduce_table -i joined_profiled_metagenome.tsv \ 
    -o max_profiled_metagenome.tmp.tsv \ 
    --function max \ 
    --sort-by level 

Bootstrap step: We rann HUMAnN 3.0 on one of our samples (concatenating the PE read data) providing the max taxonomic profile to create the custom indexed ChocoPhlAn database (in ‘bootstrap/$SAMPLE_1_humann_temp/’).

cat R1.fastq.gz R2.fastq.gz > bootstrap.reads.fastq.gz 

humann --input bootstrap.reads.fastq.gz \ 
    --output bootstrap \ 
    --search-mode uniref90 \ 
    --threads 24 \ 
    --taxonomic-profile max_profiled_metagenome.tsv

Finally, we ran each sample using the custom Chocophlan database (‘${db}’ below) on the concatenated sample reads:

humann --input all.reads.fastq.gz \ 
    --output ${sampleid} \ 
    --search-mode uniref90 \ 
    --threads 24 \ 
    --nucleotide-database ${db} \ 
    --bypass-nucleotide-index   

The species we are concerned with is Feacalibacterium. Examining the metaphlan output, it was detected in only 3 samples. However, examining the humann3 output, gene features annotated to Feacalibacterium are present in all samples, except a couple of our negative controls. There is overlap in that the three samples with Feacalibacterium in the metaphlan output are also represented in the humann3 output.

One explanation posed by a more experienced data scientist (a consultant on this project) is that we achieved incredible sequencing depth (200-320 million reads per sample; 10-20X) which could have resulted in some gene features from Feacalibacterium being detected by humann3 (using a more comprehensive pan-genome approach from the custom cocophlan database) but not in metaphlan due to it employing a marker-based approach. This makes sense to me, but I would think that with incredible sequencing depth, we would have captured enough information to generate an accurate metaphlan profile based on the marker genes. I would expect that deep sequencing would increase detection of both maker and non-marker genes.

Any help would be appreciated. Thanks!

I see what’s happening - you’re using a merged MetaPhlAn taxonomic profile as input to HUMAnN, specifically one that captures the maximum abundance of each species across all samples. This means that HUMAnN will map against the same (super)set of species in each per-sample run, including many species that were not detected in the individual sample being analyzed. The default procedure is just to map against the pangenomes of species detected in a given sample, which is both faster and more specific.

The merged taxonomic profile approach trades sensitivity for specificity. For example, species X might be present in sample 1, but with insufficient coverage to be detected by MetaPhlAn. If X was detected in sample 2, then you’ll also map reads to the X pangenome in the HUMAnN run for sample 2. (Given that you have a lot of sequencing depth, this scenario is unlikely unless X’s abundance is vanishingly small in sample 1.) Alternatively, if X is not present in sample 1, then the X pangenome might still recruit sample 1 reads (spuriously) due to homology between X genes and those of present species. I think this (latter) scenario is a lot more likely in your data.

Unless you had a really strong reason to go with the merged profile approach (?) I would stick with the default. You could rescue more “default-like” results without rerunning HUMAnN by only keeping per-sample species stratifications from species that MetaPhlAn detected and binning everything else as “unclassified.” For example, if HUMAnN reported…

Gene1   10
Gene1|X  7
Gene1|Y  2
Gene1|Z  1

But only species X was detected, it should be reasonably safe to reform this as:

Gene1              10
Gene1|X             7
Gene1|unclassified  3