Discrepancy Between Input Reads and Estimated Read Counts in MetaPhlAn 4

Hi all,

I noticed something while reviewing the data for one of my samples (49915_2_156), which is paired-end and contains no singleton reads. I ran it through MetaPhlAn 4.0.6using the -t rel_ab_w_read_stats option and the updated 2024 MetaPhlAn database(#mpa_vJun23_CHOCOPhlAnSGB_202403).

The MetaPhlAn output reports that 137,492 reads were processed, yet the estimated read counts mapped/unmapped to taxa sum to 1,554,273 for this sample.

I was expecting these numbers to be closer. Does MetaPhlAn count individual reads (rather than read pairs), and is it normal for the estimated read count to significantly exceed the number of processed reads? I’d really appreciate an explanation for this disparity between the input and output read counts.

Thanks in advance!
Uzma

Hi @Uzma_Basit_Khan

The number of reads assigned to each species is just an estimation based on the avg coverage of the markers and the median length of the species’ genomes, so it is expected that the number of processed reads does not match the estimated count. However, usually it is closer estimation. Did you perhaps consider different taxonomic levels when summing up to 1,554,273? You should only consider rows from the same taxonomic level, because each level is representing the same information, just at different depth and each level will sum up to the same amount e.g. to count reads assigned to SGBs: cat profile.tsv | grep ‘t__’ | cut -f5
Let me know if this was the issue, if not could you share the profile?
Thanks