Can we add a column with the number of reads for each taxon next to relative_abundance?

Hello,

I am using Metaphlan4 for fecal metagenomics. Although it works well, I would like to add the exact number of reads detected for each taxon in a column next to the relative abundance column.

This is easy to calculate manually or in R/python since we are given the total number of reads in the sample, however would be equally useful to have it automatically calculated in the metaphlan output.

Is there a particular parameter to get this that I am not aware of?
I have read that some people used something like “-t rel_ab_w_read_stats” but had a discrepancy in the counts obtained.
Can you please help me?

Best,
Ameneh.

Hi @Ameneh
MetaPhlAn 4 profiling is based on the mapping against species-specific markers genes, that only represent a part of the species genome. Thus the number of reads assigned to each species is just an estimation based on the avg coverage of the markers and the median length of the species’ genomes (what is calculated using “-t rel_ab_w_read_stats”). The EXACT number of reads is not possible to be calculated using MetaPhlAn

Hi @aitor.blancomiguez

Thanks for your response.

When I run Metaphlan 4, bowtie2out provide an text file as output with the number of reads. for example you can see this output:
:#52373574 reads processed
#SampleID Metaphlan_Analysis
#clade_name NCBI_tax_id relative_abundance additional_species
k__Bacteria 2 100.0
k__Bacteria|p__Bacteroidetes 2|976 68.29302
…

which means you have number of reads based on sequencing data you entered.
But if there is something else that i am not aware of, please let me know,

Hi @Ameneh
Yes, metaphlan will report the number of reads that were processed as input, but what I tried to say is that known the exact number of reads assigned to each taxa is not possible to assess via MetaPhlAn as it does not map against full genomes but just species-specific marker genes. The calc of the reads assigned to each taxa is a rough estimation based on the coverage of the markers and the median genome length of each species

Hi @aitor.blancomiguez,

Sorry for reviving this topic, but I have some questions regarding read counts in MetaPhlAn, especially after going through the 4.1 tutorial.

I analyzed the file SRS014476-Supragingival_plaque.fasta.gz and found the following:

  • Total reads: 20,000 (I counted this with zcat SRS014476-Supragingival_plaque.fasta.gz | grep -c ">")
  • MetaPhlAn output:
    • Reads processed: 19,048
    • Estimated reads mapped to known clades: 2,298 (using the -t rel_ab_w_read_stats flag)

My Questions:

  1. From 20,000 to 19,048:
  • You mention that “metaphlan will report the number of reads that were processed as input”, then how does the count decrease from 20,000 to 19,048? Could the 952 reads that were not processed be due to them not being bacteria reads or possibly failing quality checks?
  1. From 19,048 to 2,298:
  • I understand that the decrease to 2,298 may be influenced by the -stat_q parameter. When I tested different values:
    • -stat_q 0.1 resulted in 4,659 estimated reads.
    • -stat_q 0.0 resulted in 13,477 estimated reads.
  • If I set -stat_q to 0.0, I expected to see the number of estimated reads equal to 19,048 since that would include all reads. Why is it still less than that?

I appreciate your insights, and I hope I’m approaching this correctly!

Thanks in advance!

Hi @jihen
to answer your questions:

  1. The number of processed reads as you guessed is due to quality checks on reads, specifically based on minimum read length (default value is 70)
  2. The 2,298 is the estimated number of reads that are assigned to a taxa by MetaPhlAn. The stat_q parameter (default 0.20) is a parameter that is used to exclude upper and lower quartiles of markers detected for each taxa, so it will directly influence the estimation of reads. This is used to improve the accuracy of the profiling, reducing this to zero is not advised as it will increase sensitivity at the expense of specificity.
    The missing number of reads can be due again to the fact that this is an estimation, so it is unlikely it will correspond to the exact number of reads or to the fact that some reads may belong to unclassified taxa not present in the database.
1 Like

Hi @Claudia_Mengoni, thanks so much for your response! It really helped clarify my thoughts.