@franzosa
Thanks for your comment.
I have investigated the merged humann tables and confirmed that gene family features annotated to the species of interest are non-zero in samples for which it was not present in metaphlan. In fact, hundred of features from this species are present in samples for which the species was not detected in metaphlan.
After performing QC and removing host sequences, we followed the the HUMAnN3 joint taxonomy workflow here: GitHub - biobakery/humann: HUMAnN 3.0 is the next generation of HUMAnN 1.0 (HMP Unified Metabolic Analysis Network).
We created taxonomic profiles for each of the samples with MetaPhlAn (v 3.0.4)
metaphlan ${R1},${R2} \
--input_type fastq \
--bowtie2out ${pair_id}.bowtie2.bz2 \
--nproc 12 \
-o ${pair_id}.profile.txt \
--tmp_dir /scratch ${b2db}
We joined all of the taxonomic profiles, located in directory âmetaphlanâ, into a table of taxonomic profiles for all samples (HUMAnN3 v3.0.0a5):
humann_join_tables -i ./metaphlan \
-o joined_profiled_metagenome.tsv
We reduced this file into a taxonomic profile that represented the maximum abundances from all of the samples in the set:
humann_reduce_table -i joined_profiled_metagenome.tsv \
-o max_profiled_metagenome.tmp.tsv \
--function max \
--sort-by level
Bootstrap step: We rann HUMAnN 3.0 on one of our samples (concatenating the PE read data) providing the max taxonomic profile to create the custom indexed ChocoPhlAn database (in âbootstrap/$SAMPLE_1_humann_temp/â).
cat R1.fastq.gz R2.fastq.gz > bootstrap.reads.fastq.gz
humann --input bootstrap.reads.fastq.gz \
--output bootstrap \
--search-mode uniref90 \
--threads 24 \
--taxonomic-profile max_profiled_metagenome.tsv
Finally, we ran each sample using the custom Chocophlan database (â${db}â below) on the concatenated sample reads:
humann --input all.reads.fastq.gz \
--output ${sampleid} \
--search-mode uniref90 \
--threads 24 \
--nucleotide-database ${db} \
--bypass-nucleotide-index
The species we are concerned with is Feacalibacterium. Examining the metaphlan output, it was detected in only 3 samples. However, examining the humann3 output, gene features annotated to Feacalibacterium are present in all samples, except a couple of our negative controls. There is overlap in that the three samples with Feacalibacterium in the metaphlan output are also represented in the humann3 output.
One explanation posed by a more experienced data scientist (a consultant on this project) is that we achieved incredible sequencing depth (200-320 million reads per sample; 10-20X) which could have resulted in some gene features from Feacalibacterium being detected by humann3 (using a more comprehensive pan-genome approach from the custom cocophlan database) but not in metaphlan due to it employing a marker-based approach. This makes sense to me, but I would think that with incredible sequencing depth, we would have captured enough information to generate an accurate metaphlan profile based on the marker genes. I would expect that deep sequencing would increase detection of both maker and non-marker genes.
Any help would be appreciated. Thanks!