The bioBakery help forum

Metaphlan3 relative abundance

Hi,

I have run metaphlan3 using rel_ab_w_read_stats for outputs because I would like to have estimated number of reads from each clade. If I calculate relative abundances using estimated reads from each clade, i.e. estimated reads from each clade is divided by the total number of estimated reads of the corresponding sample, why I get different values as metaphlan3 reports in the rel_ab_w_read_stats output?

Here is an example of relative abundances: the column on the right includes relative abundances reported by Metaphlan3, and the column on the left includes relative abundances which I calculated from the estimated number of reads from the clade.

k_Bacteria 81.20709 98.04187213
k_Viruses 18.57751 1.179186285
k_Eukaryota 0.21539 0.778941581

Can you specify me how the metaphlan3 calculates the relative abundance?

Best, Hanna

Hi Hanna,
MetaPhlAn species relative abundances are not calculated directly from the ratio between the estimated number of reads and the total number of reads.
For each clade, a vector of markers and their count (#reads hitting the marker) is created and sorted according to the count.
According to the quantile value chosen (stat_q value, in MetaPhlAn3 the value is set to 0.2), the markers taken into consideration for the local abundance calculation are the one falling between the 20th and 80th percentile.
Using this subset of makers, the local abundance is calculated by dividing the sum(#reads mapping on the clade markers) by the sum(clade marker length).
The number of reads mapping to the clade is estimated by multiplying the local clade abundance by the clade’s average genome size. Then, relative abundances are then calculated as local clade abundance divided by the sum of all the local abundances

1 Like

Thanks a lot Francesco! I’m still presenting stupid questions.

You said “The number of reads mapping to the clade is estimated by multiplying the local clade abundance by the clade’s average genome size. Then, relative abundances are then calculated as local clade abundance divided by the sum of all the local abundances”

Does this mean that relative abundances are not scaled by the genome size and only the number of reads mapping to the clade are scaled by genome size (i.e. multiplying the local clade abundance by the clade’s average genome size)?

Best, Hanna

Yes, that’s correct.

Hi Francesco,

If I understand it correctly, this means that for two clades with the same local abundance, the count will be higher for the clade having the longest average genome size?

I am going to use count data in a statistical model, and I am afraid that this will affect the result. Is it maybe better for me to calculate the counts from the relative abundances and read depth instead (where rel_ab is scaled according to unknown estimate to ensure that the counts are not overestimated)? In this calculation, two genomes with the same abundances have the same count independent of their genome size.

Best, Anne

@fbeghini I hope that you saw my question even though this discussion is quite old. I am looking forward to get your opinion of this.

Hi Anne,
Sorry for not catching this earlier.

Yes, that’s right. For your purpose you could try obtaining the raw maker counts by running MetaPhlAn with option -t marker_counts. These are counts not normalized with the average genome length.