Hello,
We have profiled shotgun metagenomic sequences derived from human fecal samples using first HUMAnN3 to get functional profiles, then by MetaPhlAn3 using the bowtie2 intermediate file that is spit out by running HUMAnN3 to get normalized marker counts (parameter --marker_ab_table
). An interesting finding of the profiling is that an Alcanivorax species (Alcanivorax sp N3 2A) is detected in every single one of our subject’s samples (~800 total) at a reasonable relative abundance (mean relative abundance ~0.3%), but the literature suggests that Alcanivorax species are found in marine environments. Additionally another species isolated from marine environment, Vibrio fortis, is also detected in every sample, but at a much lower relative abundance (mean relative abundance ~0.005%). We do not suspect geography to be at play as samples were collected from subjects from both southern United States and Canada.
My question is a little open ended, but I wanted to see if anyone else has had these organisms profiled using MetaPhlAn with human fecal samples? Or perhaps there is an error in the way I am performing HUMAnN/MetaPhlAn profiling?
Commands used for profiling are below.
Thanks,
Zach
FILE=sampleA.fastq
FILE_NAME=sampleA
RESULTS_DIR=directory_for_results
CHOCO=full_chocophlan.v296_201901b
UNIREF=uniref90_annotated_v201901b_full
MARKERS=mpa_v30_CHOCOPhlAn_201901_marker_info.txt.bz2
# run humann
humann --input $FILE \
--output $RESULTS_DIR \
--output-basename $FILE_NAME \
--metaphlan-options '-t rel_ab --add_viruses' \
--nucleotide-database $CHOCO \
--protein-database $UNIREF \
--prescreen-threshold 0.001 \
--threads 10 \
--verbose > ${RESULTS_DIR}/${FILE_NAME}.log 2>&1
# re-run metaphlan with bowtie2 intermediate file to get normalized marker counts
metaphlan --input_type bowtie2out \
--add_viruses \
-t marker_ab_table \
${RESULTS_DIR}/${FILE_NAME}_metaphlan_bowtie2.txt \
${RESULTS_DIR}/${FILE_NAME}_metaphlan_norm_abun_table.tsv \
>> ${RESULTS_DIR}/${FILE_NAME}.log 2>&1
# add clade marker names to normalized marker output
join -j 1 -o 1.3,1.2,2.2 \
<(sort -k1,1 <(paste <(bzcat $MARKERS | awk -F"\t" '{print $1}') \
<(bzcat $MARKERS | awk -F"__" '{print $1}') \
<(bzcat $MARKERS | awk -F"('taxon': '|'})" '{print $2}'))) \
<(sort -k1,1 ${RESULTS_DIR}/${FILE_NAME}_metaphlan_norm_abun_table.tsv) | \
sed '1s/^/#clade_name NCBI_tax_id normalized_abundance\n/' | sed 's/ /\t/g' | \
awk 'FNR==1{print;next}{val=$3;$3="~";a[$0]+=val}!b[$0]++{c[++count]=$0}END{for(i=1;i<=count;i++){sub("~",a[c[i]],c[i]);print c[i]}}' OFS='\t' | \
cat <(sed -n 1p ${RESULTS_DIR}/${FILE_NAME}_metaphlan_norm_abun_table.tsv) - > ${FILE_NAME}_temp.txt
mv ${FILE_NAME}_temp.txt ${RESULTS_DIR}/${FILE_NAME}_metaphlan_norm_abun_table.tsv
I just checked and don’t see these species in HMP stool samples that we’ve profiled by MetaPhlAn 3.
Are they showing up in the default taxonomic profiles produced by MetaPhlAn when it’s run inside of HUMAnN (as deposited in the per-sample temp folder)? If so, I’d start wondering about possible contamination at the sequencing site.
On the other hand, if you’re seeing outputs for these species in your custom marker-level analysis (conducted external to HUMAnN), my guess is that you might be seeing individual markers for these species recruiting reads, but not enough for a default MetaPhlAn run to consider them as confidently detected.
I do not see these species appearing in the metaphlan bugs list outputted by HUMAnN, only in the custom marker-level analysis conducted after running of HUMAnN. So does that mean these are not confident detections? and is there some way to gauge “confident-ness” of each detected species to know who is a confident hit and who is not?
Additionally, there were ~950 species included in the bugs list, all of which were included in the custom marker-level analysis (except a handful of viruses for some reason), but the custom marker-level analysis outputted ~2700 species.
So why are so many less species being included in the HUMAnN intermediate run of MetaPhlAn vs the externally run MetaPhlAn? On the surface, it looks like the only difference in the MetaPhlAn runs is that I am changing -t rel_ab
parameter to -t marker_ab_table
in the external run. I’m even using the bowtie2 intermediate file outputted by the HUMAnN run of MetaPhlAn. Is there some underlying filtering or QC going on when rel_ab
is used vs the marker_ab_table
?
Thank you for your help.
A -t marker_ab_table
analysis will report the abundances of any marker that recruited one or more reads. The default MetaPhlAn analysis estimates each species’ coverage as a trimmed average of its markers, hence if only a few markers are detected then the species will not be included in the output. This is by design, as the detection of only a few markers (out of 100s) could indicate a spurious mapping event, or HGT of a marker gene to another background.
Got it, that makes sense.
Thank you!
I actually have a follow up question.
We would like to have observed abundances (normalized for length of genome or marker gene) per species in order to do analyses such as calculating SparCC correlations. This is the reason why I perform the marker-level analysis of MetaPhlAn after HUMAnN, then collapse marker-level abundances to species.
My question is: How would you recommend getting per species abundances AND make sure filtering that is performed in default MetaPhlAn run is also incorporated?
Since posting the follow up question I’ve done more digging and found this post where you recommended using the -t rel_ab_w_read_stats
parameter to obtain species abundances. I ran MetaPhlAn using this parameter on one of our samples and found it had the same species outputted as the default MetaPhlAn run. So this seems to be the preferred parameter to obtain species abundances while also performing the default MetaPhlAn filtering. Is this correct?