Can we add a column with the number of reads for each taxon next to relative_abundance?

Hello,

I am using Metaphlan4 for fecal metagenomics. Although it works well, I would like to add the exact number of reads detected for each taxon in a column next to the relative abundance column.

This is easy to calculate manually or in R/python since we are given the total number of reads in the sample, however would be equally useful to have it automatically calculated in the metaphlan output.

Is there a particular parameter to get this that I am not aware of?
I have read that some people used something like “-t rel_ab_w_read_stats” but had a discrepancy in the counts obtained.
Can you please help me?

Best,
Ameneh.

Hi @Ameneh
MetaPhlAn 4 profiling is based on the mapping against species-specific markers genes, that only represent a part of the species genome. Thus the number of reads assigned to each species is just an estimation based on the avg coverage of the markers and the median length of the species’ genomes (what is calculated using “-t rel_ab_w_read_stats”). The EXACT number of reads is not possible to be calculated using MetaPhlAn

Hi @aitor.blancomiguez

Thanks for your response.

When I run Metaphlan 4, bowtie2out provide an text file as output with the number of reads. for example you can see this output:
:#52373574 reads processed
#SampleID Metaphlan_Analysis
#clade_name NCBI_tax_id relative_abundance additional_species
k__Bacteria 2 100.0
k__Bacteria|p__Bacteroidetes 2|976 68.29302
…

which means you have number of reads based on sequencing data you entered.
But if there is something else that i am not aware of, please let me know,

Hi @Ameneh
Yes, metaphlan will report the number of reads that were processed as input, but what I tried to say is that known the exact number of reads assigned to each taxa is not possible to assess via MetaPhlAn as it does not map against full genomes but just species-specific marker genes. The calc of the reads assigned to each taxa is a rough estimation based on the coverage of the markers and the median genome length of each species

Hi @aitor.blancomiguez,

Sorry for reviving this topic, but I have some questions regarding read counts in MetaPhlAn, especially after going through the 4.1 tutorial.

I analyzed the file SRS014476-Supragingival_plaque.fasta.gz and found the following:

  • Total reads: 20,000 (I counted this with zcat SRS014476-Supragingival_plaque.fasta.gz | grep -c ">")
  • MetaPhlAn output:
    • Reads processed: 19,048
    • Estimated reads mapped to known clades: 2,298 (using the -t rel_ab_w_read_stats flag)

My Questions:

  1. From 20,000 to 19,048:
  • You mention that “metaphlan will report the number of reads that were processed as input”, then how does the count decrease from 20,000 to 19,048? Could the 952 reads that were not processed be due to them not being bacteria reads or possibly failing quality checks?
  1. From 19,048 to 2,298:
  • I understand that the decrease to 2,298 may be influenced by the -stat_q parameter. When I tested different values:
    • -stat_q 0.1 resulted in 4,659 estimated reads.
    • -stat_q 0.0 resulted in 13,477 estimated reads.
  • If I set -stat_q to 0.0, I expected to see the number of estimated reads equal to 19,048 since that would include all reads. Why is it still less than that?

I appreciate your insights, and I hope I’m approaching this correctly!

Thanks in advance!

Hi @jihen
to answer your questions:

  1. The number of processed reads as you guessed is due to quality checks on reads, specifically based on minimum read length (default value is 70)
  2. The 2,298 is the estimated number of reads that are assigned to a taxa by MetaPhlAn. The stat_q parameter (default 0.20) is a parameter that is used to exclude upper and lower quartiles of markers detected for each taxa, so it will directly influence the estimation of reads. This is used to improve the accuracy of the profiling, reducing this to zero is not advised as it will increase sensitivity at the expense of specificity.
    The missing number of reads can be due again to the fact that this is an estimation, so it is unlikely it will correspond to the exact number of reads or to the fact that some reads may belong to unclassified taxa not present in the database.
1 Like

Hi @Claudia_Mengoni, thanks so much for your response! It really helped clarify my thoughts.

Hi @Claudia_Mengoni,

Sorry for necromancing this post again. I’m following up with another question after running MetaPhlAn 4.1 and comparing the stats with the input FASTQ files.

Here’s the setup:

  • I input two paired-end FASTQ files (host removed), each with 99,320 reads.

  • I verified that all reads are ≥70 bp using both seqkit and awk, e.g.:

    zcat RC2_host_rm_1.fq.gz | awk 'NR%4==2{if(length($0)>=70) count++} END{print count}'
    # Output: 99320
    

    Same result using:

    seqkit seq -m 70 RC2_host_rm_1.fq.gz | grep -c "^@"
    # Output: 99320
    
  • So based on your previous response that MetaPhlAn filters out reads shorter than 70 bp, I think this show there will be no read drop as all reads are above the threshold.

Then, I ran MetaPhlAn with this command:

metaphlan RC2_host_rm_1.fq.gz,RC2_host_rm_2.fq.gz \
  --input_type fastq -t rel_ab_w_read_stats \
  --index mpa_vJun23_CHOCOPhlAnSGB_202307 \
  --bowtie2db /home/bt2db/ \
  --bowtie2out RC2.bowtie2.bz2 \
  --nproc 8 -s RC2_metaphlan.sam -o RC2_profile_abs.txt

And I got this in the RC2_profile_abs.txt header:

#198640 reads processed
#estimated_reads_mapped_to_known_clades:82739

But here’s where I get confused:

  • There are 99,320 reads are in each file, and all are ≥70 bp, while MetaPhlAn report 198,640 reads processed (I guess this is because it added R1 + R2, so far so good),

  • Then only 82,739 reads are estimated to come from known clades (which as you said, some reads may belong to unclassified taxa or not present in the database, thus the number dropped),

  • However, when I export a sam file and do a flagstat, I saw only 4,944 reads appear as mapped in the SAM file:

    samtools flagstat RC2_metaphlan.sam
    # 4944 + 0 in total
    # 4944 + 0 mapped (100.00%)
    

:red_question_mark: My questions:

  1. What is the relationship between:

    • Number of input reads (198,640),
    • Number of mapped reads in the .sam (4,944),
    • Estimated number of reads from known clades (82,739)?
  2. Can I use samtools flagstat on the MetaPhlAn .sam file to infer anything meaningful about read mapping rate, coverage, or sequencing depth?

    • What does the 100% mapping rate reflect? Is it a subset of the total 198,640 reads?
  3. Similarly, can I use samtools coverage and samtools depth on the MetaPhlAn .sam file to infer anything meaningful about read mapping rate, coverage, or sequencing depth?

I’m trying to understand what each number represents in the pipeline and whether any of them can inform about read alignment quality or some statistics about sequencing coverage.


Thanks again for the great tool and support!