Hi all,
I noticed something while reviewing the data for one of my samples (49915_2_156
), which is paired-end and contains no singleton reads. I ran it through MetaPhlAn 4.0.6using the -t rel_ab_w_read_stats
option and the updated 2024 MetaPhlAn database(#mpa_vJun23_CHOCOPhlAnSGB_202403).
The MetaPhlAn output reports that 137,492 reads were processed, yet the estimated read counts mapped/unmapped to taxa sum to 1,554,273 for this sample.
I was expecting these numbers to be closer. Does MetaPhlAn count individual reads (rather than read pairs), and is it normal for the estimated read count to significantly exceed the number of processed reads? I’d really appreciate an explanation for this disparity between the input and output read counts.
Thanks in advance!
Uzma
Hi @Uzma_Basit_Khan
The number of reads assigned to each species is just an estimation based on the avg coverage of the markers and the median length of the species’ genomes, so it is expected that the number of processed reads does not match the estimated count. However, usually it is closer estimation. Did you perhaps consider different taxonomic levels when summing up to 1,554,273? You should only consider rows from the same taxonomic level, because each level is representing the same information, just at different depth and each level will sum up to the same amount e.g. to count reads assigned to SGBs: cat profile.tsv | grep ‘t__’ | cut -f5
Let me know if this was the issue, if not could you share the profile?
Thanks
Hi Claudia,
Thank you so much for the clarification that was really helpful.
You were absolutely right. I had initially summed estimated counts across all taxonomic levels, which is why I was seeing such an inflated number. After filtering to only include species-level entries (s__
), the total estimated read count I get was:
cat merged_metaphlan_counts.txt | grep ‘s__’ | cut -f5 | awk ‘{sum+=$1} END {print sum}’
1182444
It is still quite high, ~1.1 million estimated reads from ~137k processed reads
Best regards
With ‘s__’ you are still including both species (s__Species_name) and SGB levels (e.g. s__Species_name|t__SGBx), so the total will be 591222. My suggestion is to try using the latest version of MetaPhlAn (4.2.2), since the 4.0.6 is now 2 years old, and check if the estimate is more accurate in the current code, if not we will look more into this. Thanks