Can we add a column with the number of reads for each taxon next to relative_abundance?


I am using Metaphlan4 for fecal metagenomics. Although it works well, I would like to add the exact number of reads detected for each taxon in a column next to the relative abundance column.

This is easy to calculate manually or in R/python since we are given the total number of reads in the sample, however would be equally useful to have it automatically calculated in the metaphlan output.

Is there a particular parameter to get this that I am not aware of?
I have read that some people used something like “-t rel_ab_w_read_stats” but had a discrepancy in the counts obtained.
Can you please help me?


Hi @Ameneh
MetaPhlAn 4 profiling is based on the mapping against species-specific markers genes, that only represent a part of the species genome. Thus the number of reads assigned to each species is just an estimation based on the avg coverage of the markers and the median length of the species’ genomes (what is calculated using “-t rel_ab_w_read_stats”). The EXACT number of reads is not possible to be calculated using MetaPhlAn

Hi @aitor.blancomiguez

Thanks for your response.

When I run Metaphlan 4, bowtie2out provide an text file as output with the number of reads. for example you can see this output:
:#52373574 reads processed
#SampleID Metaphlan_Analysis
#clade_name NCBI_tax_id relative_abundance additional_species
k__Bacteria 2 100.0
k__Bacteria|p__Bacteroidetes 2|976 68.29302

which means you have number of reads based on sequencing data you entered.
But if there is something else that i am not aware of, please let me know,

Hi @Ameneh
Yes, metaphlan will report the number of reads that were processed as input, but what I tried to say is that known the exact number of reads assigned to each taxa is not possible to assess via MetaPhlAn as it does not map against full genomes but just species-specific marker genes. The calc of the reads assigned to each taxa is a rough estimation based on the coverage of the markers and the median genome length of each species