Metaphlan3 relative abundance

Hi,

I have run metaphlan3 using rel_ab_w_read_stats for outputs because I would like to have estimated number of reads from each clade. If I calculate relative abundances using estimated reads from each clade, i.e. estimated reads from each clade is divided by the total number of estimated reads of the corresponding sample, why I get different values as metaphlan3 reports in the rel_ab_w_read_stats output?

Here is an example of relative abundances: the column on the right includes relative abundances reported by Metaphlan3, and the column on the left includes relative abundances which I calculated from the estimated number of reads from the clade.

k_Bacteria 81.20709 98.04187213
k_Viruses 18.57751 1.179186285
k_Eukaryota 0.21539 0.778941581

Can you specify me how the metaphlan3 calculates the relative abundance?

Best, Hanna

2 Likes

Hi Hanna,
MetaPhlAn species relative abundances are not calculated directly from the ratio between the estimated number of reads and the total number of reads.
For each clade, a vector of markers and their count (#reads hitting the marker) is created and sorted according to the count.
According to the quantile value chosen (stat_q value, in MetaPhlAn3 the value is set to 0.2), the markers taken into consideration for the local abundance calculation are the one falling between the 20th and 80th percentile.
Using this subset of makers, the local abundance is calculated by dividing the sum(#reads mapping on the clade markers) by the sum(clade marker length).
The number of reads mapping to the clade is estimated by multiplying the local clade abundance by the clade’s average genome size. Then, relative abundances are then calculated as local clade abundance divided by the sum of all the local abundances

2 Likes

Thanks a lot Francesco! I’m still presenting stupid questions.

You said “The number of reads mapping to the clade is estimated by multiplying the local clade abundance by the clade’s average genome size. Then, relative abundances are then calculated as local clade abundance divided by the sum of all the local abundances”

Does this mean that relative abundances are not scaled by the genome size and only the number of reads mapping to the clade are scaled by genome size (i.e. multiplying the local clade abundance by the clade’s average genome size)?

Best, Hanna

Yes, that’s correct.

Hi Francesco,

If I understand it correctly, this means that for two clades with the same local abundance, the count will be higher for the clade having the longest average genome size?

I am going to use count data in a statistical model, and I am afraid that this will affect the result. Is it maybe better for me to calculate the counts from the relative abundances and read depth instead (where rel_ab is scaled according to unknown estimate to ensure that the counts are not overestimated)? In this calculation, two genomes with the same abundances have the same count independent of their genome size.

Best, Anne

@fbeghini I hope that you saw my question even though this discussion is quite old. I am looking forward to get your opinion of this.

Hi Anne,
Sorry for not catching this earlier.

Yes, that’s right. For your purpose you could try obtaining the raw maker counts by running MetaPhlAn with option -t marker_counts. These are counts not normalized with the average genome length.

Hi. Couple of questions.

  1. Because the estimated read counts do not accurately represent number of organisms they would skew count-based diversity metrics towards organisms with longer genomes. Instead, would it be more appropriate to use -t marker_counts in count-based diversity calculations?

  2. Is there any way to obtain -t marker_counts without rerunning the samples through metaphlan?

You can obtain the -t marker_counts output by re-running MetaPhlAn using the previously produced bowtie2out file as input.
Keep in mind that these are raw reads hits count and it is possbile that the internal algorithm discards some markers (e.g. marker is ignored as it is a quasi marker with hits from multiple species)

1 Like

hello @fbeghini This discussion helped me to understand how to calculate relative abundance. Still, I have two questions:

  1. Where I can get clade’s average genome size? Is That only summing the length of all available markers for that clade and dividing by the number of markers present in the clade?
  2. How the stat_choices = [‘avg_g’,‘avg_l’,‘tavg_g’,‘tavg_l’,‘wavg_g’,‘wavg_l’,‘med’] values calculated? can you explained one in detail for e.g “tavg_g”. Though I looked into script but I did not understand formula construct.
    Thanks in Advance.

Hi @fbeghini. Sorry to revive this old thread if this has been already discussed elsewhere, but I would like to ask why there could be many differences between the -t marker_counts and the -t rel_ab outputs. In fact, in one of my experiments, I found the majority of the reads coming from bowtie2 mapping to an organism (s__Mesorhizobium_plurifarium), but in the re_ab output it doesn’t appear.

Should I consider it as a “quasi marker”, like you stated?
Thank you very much

Hi fbeghini, May I know how can I calculate this "directly from the ratio between the estimated number of reads and the total number of reads " from the relative abundance produced by metaphlan output?

Hi @fbeghini . This topic is interesting to me because I am trying to get a sense of the number of reads mapping to known clades and compare it to my total number of reads. This is because I am analyzing fecal metagenomes from wildlife that I suspect will have some proportion of unclassifiable reads and I want to understand the drop off of reads that map to known clades, and specifically known bacterial clades.

I decided to use the -t rel_ad_with_stats flag in Metaphlan4 to get the estimated numebr of reads from the clades. I also have samtools summary numbers from my original fastq files. I noticed a pecular pattern across my samples - the total number of reads from samtools is much smaller than the estimated number of reads for clade hits. As an example, I have a sample with 8 million reads frm samtools. For the same sequence, I ended up getting an estimated number of reads for just a single clade (bacteria) that was nearly 10 x this values at ~85,000,000. What could be the cause for this discrepancy? Is there an analytical value I could use instead of the estimated number of reads to answer my question?
Thanks!!

Hi Francesco,
Thank you for this explanation. I wish to know if the sum of relative abundance of all species calculated this way would be one?
Best
Hamid