Can we add a column with the number of reads for each taxon next to relative_abundance?

Ameneh · July 3, 2023, 11:59am

Hello,

I am using Metaphlan4 for fecal metagenomics. Although it works well, I would like to add the exact number of reads detected for each taxon in a column next to the relative abundance column.

This is easy to calculate manually or in R/python since we are given the total number of reads in the sample, however would be equally useful to have it automatically calculated in the metaphlan output.

Is there a particular parameter to get this that I am not aware of?
I have read that some people used something like “-t rel_ab_w_read_stats” but had a discrepancy in the counts obtained.
Can you please help me?

Best,
Ameneh.

aitor.blancomiguez · July 28, 2023, 1:55pm

Hi @Ameneh
MetaPhlAn 4 profiling is based on the mapping against species-specific markers genes, that only represent a part of the species genome. Thus the number of reads assigned to each species is just an estimation based on the avg coverage of the markers and the median length of the species’ genomes (what is calculated using “-t rel_ab_w_read_stats”). The EXACT number of reads is not possible to be calculated using MetaPhlAn

Ameneh · August 28, 2023, 1:43pm

Hi @aitor.blancomiguez

Thanks for your response.

When I run Metaphlan 4, bowtie2out provide an text file as output with the number of reads. for example you can see this output:
:#52373574 reads processed
#SampleID Metaphlan_Analysis
#clade_name NCBI_tax_id relative_abundance additional_species
k__Bacteria 2 100.0
k__Bacteria|p__Bacteroidetes 2|976 68.29302
…

which means you have number of reads based on sequencing data you entered.
But if there is something else that i am not aware of, please let me know,

aitor.blancomiguez · September 1, 2023, 12:11pm

Hi @Ameneh
Yes, metaphlan will report the number of reads that were processed as input, but what I tried to say is that known the exact number of reads assigned to each taxa is not possible to assess via MetaPhlAn as it does not map against full genomes but just species-specific marker genes. The calc of the reads assigned to each taxa is a rough estimation based on the coverage of the markers and the median genome length of each species

jihen · September 19, 2024, 7:57am

Hi @aitor.blancomiguez,

Sorry for reviving this topic, but I have some questions regarding read counts in MetaPhlAn, especially after going through the 4.1 tutorial.

I analyzed the file SRS014476-Supragingival_plaque.fasta.gz and found the following:

Total reads: 20,000 (I counted this with zcat SRS014476-Supragingival_plaque.fasta.gz | grep -c ">")
MetaPhlAn output:
- Reads processed: 19,048
- Estimated reads mapped to known clades: 2,298 (using the -t rel_ab_w_read_stats flag)

My Questions:

From 20,000 to 19,048:

You mention that “metaphlan will report the number of reads that were processed as input”, then how does the count decrease from 20,000 to 19,048? Could the 952 reads that were not processed be due to them not being bacteria reads or possibly failing quality checks?

From 19,048 to 2,298:

I understand that the decrease to 2,298 may be influenced by the -stat_q parameter. When I tested different values:
- -stat_q 0.1 resulted in 4,659 estimated reads.
- -stat_q 0.0 resulted in 13,477 estimated reads.
If I set -stat_q to 0.0, I expected to see the number of estimated reads equal to 19,048 since that would include all reads. Why is it still less than that?

I appreciate your insights, and I hope I’m approaching this correctly!

Thanks in advance!

Claudia_Mengoni · September 24, 2024, 9:07am

Hi @jihen
to answer your questions:

The number of processed reads as you guessed is due to quality checks on reads, specifically based on minimum read length (default value is 70)
The 2,298 is the estimated number of reads that are assigned to a taxa by MetaPhlAn. The stat_q parameter (default 0.20) is a parameter that is used to exclude upper and lower quartiles of markers detected for each taxa, so it will directly influence the estimation of reads. This is used to improve the accuracy of the profiling, reducing this to zero is not advised as it will increase sensitivity at the expense of specificity.
The missing number of reads can be due again to the fact that this is an estimation, so it is unlikely it will correspond to the exact number of reads or to the fact that some reads may belong to unclassified taxa not present in the database.

jihen · September 26, 2024, 12:45am

Hi @Claudia_Mengoni, thanks so much for your response! It really helped clarify my thoughts.

Topic		Replies	Views
Discrepancy Between Input Reads and Estimated Read Counts in MetaPhlAn 4 MetaPhlAn	3	35	July 10, 2025
Metaphlan output question MetaPhlAn	4	911	August 25, 2022
Metaphlan3 relative abundance MetaPhlAn	14	7050	June 9, 2025
Relative clade abundance with low number of reads MetaPhlAn	1	431	April 4, 2022
Sgb_to_gtdb_profile.py keep read count estimation from -t rel_ab_w_read_stats MetaPhlAn	4	422	January 11, 2024

Can we add a column with the number of reads for each taxon next to relative_abundance?

My Questions:

Related topics