To get the absolute counts of each taxa

I have read your paper on bioBakery3 and I am using metephlan3. Thanks for making it available and it is a really good tool. I want to do alpha diversity and DESeq2 on the taxa and need to use absolute reads counts. So I used the code as with additional flag of “-t rel_ab_w_read_stats” :

metaphlan r1.fastq.gz, r2.fastq.gz -t rel_ab_w_read_stats -o s1.tsv --input_type fastq —bowtie2out s1.bowtie2.bz2 --nproc 4

In the output file, there is

My questions are:

  • Are “estimated_number_of_reads_from_the_clade” the absolute reads counts for specific taxa?
  • There is also “coverage” column? What does it really mean? Should I filter the output table with this coverage number to get a good taxa table? If so, which number would be good to use?

Best

1 Like
  • The read count is an estimate of the number of reads contributed by a given clade. It is computed by extending the “reads per kilobase” estimate from a species’ marker genes over the species’ average genome length, and then summing species counts to higher-level clades. You could use these counts for count-based models (like DESeq2) but I would not use them directly in alpha diversity estimates (they are not equivalent to organism/cell counts).

  • “Coverage” is a measure of sequencing depth per unit length of a genome (or summed over genomes within a clade). Coverage is proportional to cell count (if A has twice B’s coverage, we infer that twice as many A cells were present vs. B cells). Relative abundance is sum-normalized species coverage (which can then be summed to higher taxonomic levels). We typically filter on relative abundance and prevalence, e.g. keeping taxa that exceed 0.1% abundance in 10% of samples and collapsing everything else into a lower-confidence “other” bin.

thanks for your reply…

  • If i want to get the counts table for each clade, is there any other way with metaphlan3 besides of adding the flag of “-t rel_ab_w_read_stats”?

  • i am checking on the results which has relative abundance. i saw numbers which is bigger than 1.


    I thought this is already in relative abundance and should be between 0 and 1. Are these number actually already multiplied by 100?
    thanks

  • I believe that’s the best way to get the estimated read counts. MetaPhlAn is more focused on relative abundance estimation (hence focusing on those numbers in the primary output).

  • I believe the numbers add to 100% rather than 1.0 in the default output.