I have run metaphlan3 using rel_ab_w_read_stats for outputs because I would like to have estimated number of reads from each clade. If I calculate relative abundances using estimated reads from each clade, i.e. estimated reads from each clade is divided by the total number of estimated reads of the corresponding sample, why I get different values as metaphlan3 reports in the rel_ab_w_read_stats output?
Here is an example of relative abundances: the column on the right includes relative abundances reported by Metaphlan3, and the column on the left includes relative abundances which I calculated from the estimated number of reads from the clade.
k_Bacteria 81.20709 98.04187213
k_Viruses 18.57751 1.179186285
k_Eukaryota 0.21539 0.778941581
Can you specify me how the metaphlan3 calculates the relative abundance?
MetaPhlAn species relative abundances are not calculated directly from the ratio between the estimated number of reads and the total number of reads.
For each clade, a vector of markers and their count (#reads hitting the marker) is created and sorted according to the count.
According to the quantile value chosen (stat_q value, in MetaPhlAn3 the value is set to 0.2), the markers taken into consideration for the local abundance calculation are the one falling between the 20th and 80th percentile.
Using this subset of makers, the local abundance is calculated by dividing the sum(#reads mapping on the clade markers) by the sum(clade marker length).
The number of reads mapping to the clade is estimated by multiplying the local clade abundance by the clade’s average genome size. Then, relative abundances are then calculated as local clade abundance divided by the sum of all the local abundances
Thanks a lot Francesco! I’m still presenting stupid questions.
You said “The number of reads mapping to the clade is estimated by multiplying the local clade abundance by the clade’s average genome size. Then, relative abundances are then calculated as local clade abundance divided by the sum of all the local abundances”
Does this mean that relative abundances are not scaled by the genome size and only the number of reads mapping to the clade are scaled by genome size (i.e. multiplying the local clade abundance by the clade’s average genome size)?
If I understand it correctly, this means that for two clades with the same local abundance, the count will be higher for the clade having the longest average genome size?
I am going to use count data in a statistical model, and I am afraid that this will affect the result. Is it maybe better for me to calculate the counts from the relative abundances and read depth instead (where rel_ab is scaled according to unknown estimate to ensure that the counts are not overestimated)? In this calculation, two genomes with the same abundances have the same count independent of their genome size.
@fbeghini I hope that you saw my question even though this discussion is quite old. I am looking forward to get your opinion of this.
Sorry for not catching this earlier.
Yes, that’s right. For your purpose you could try obtaining the raw maker counts by running MetaPhlAn with option
-t marker_counts. These are counts not normalized with the average genome length.
You can obtain the
-t marker_counts output by re-running MetaPhlAn using the previously produced bowtie2out file as input.
Keep in mind that these are raw reads hits count and it is possbile that the internal algorithm discards some markers (e.g. marker is ignored as it is a quasi marker with hits from multiple species)
hello @fbeghini This discussion helped me to understand how to calculate relative abundance. Still, I have two questions:
- Where I can get clade’s average genome size? Is That only summing the length of all available markers for that clade and dividing by the number of markers present in the clade?
- How the stat_choices = [‘avg_g’,‘avg_l’,‘tavg_g’,‘tavg_l’,‘wavg_g’,‘wavg_l’,‘med’] values calculated? can you explained one in detail for e.g “tavg_g”. Though I looked into script but I did not understand formula construct.
Thanks in Advance.
Hi @fbeghini. Sorry to revive this old thread if this has been already discussed elsewhere, but I would like to ask why there could be many differences between the
-t marker_counts and the
-t rel_ab outputs. In fact, in one of my experiments, I found the majority of the reads coming from bowtie2 mapping to an organism (s__Mesorhizobium_plurifarium), but in the re_ab output it doesn’t appear.
Should I consider it as a “quasi marker”, like you stated?
Thank you very much
Hi fbeghini, May I know how can I calculate this "directly from the ratio between the estimated number of reads and the total number of reads " from the relative abundance produced by metaphlan output?