The bioBakery help forum

Metaphlan Read count output instead of relative abundances

Hi @fbeghini. I have ran MetaPhlAn 3.0 with -t rel_ab_w_read_stats option to get read counts along with relative abundances. But, I am seeing descrepancy in many cases.

For example: relative abundance of Prevotella_copri is 54.10114. But, total mapped read count is 7569053 and read count for Prevotella copri is 4558937. So, it was supposed to be (4558937/7569053)*100 = 60.231273318.
Why I’m seeing such difference? Is there any normalization or something? If yes, how can I get absolute read counts of each clade?

Thanks,
DC7

See Metaphlan3 relative abundance

Considering your reply in that link, I think there’s no way to get the absolute read counts of each of the clades. Right?
In that case, is it impossible to calculate alpha diversity or do rarefaction analysis? Or, is there any way out?

Thanks,
DC7

No. Alpha diversity can be calculated in terms of present species, a tally of species with non-zero relative abundance. As mentioned in Rarefy metagenome sequence data before MetaPhlAn3 analysis , rarefaction should be done on the raw reads or the bowtie2out file.

regarding alpha diversity calculation it is not possible with phyloseq. Because it does not accept relative abundance for the calculation. It takes the read counts of the species. So, considering that, can I multiply the relative abundance of each of the species with number of reads in bowtie2out.txt file and then analyse with phyloseq?
Thanks,
DC7

It you want to use the estimated number of reads, you can re-run the analysis using -t rel_ab_w_read_stats. It is not necessary to use phyloseq to calculate the alpha diversity if you use a richness index, just count the species present.

Thanks @fbeghini. I know I am asking a lot of questions that may bother you. Please don’t mind and bear with me as I am facing a lot of confusions.
Actually, I told you regarding the multiplication with bowtie2out mapped read numbers because here Dr. Segata says:

There are however two ways to get an estimate:

  • just multiply the total number of reads by the relative abundance of each species. This works well, but you are likely overestimating the number of reads in each species because you cannot count the number of reads that would not map against any reference genome

  • use the “-t rel_ab_w_read_stats” which estimates the number of reads that should come from a given species by considering the coverage of the species’ markers and the length of the genome (taken from reference genomes).

(However, the overestimation stated by Dr. Segata can be avoided with using bowtie2out mapped read number.)

Now, I am confused among the two ways what to follow because, both of them will give different read counts. And also, I want to do all the statistical analysis (e.g. wilcoxon ranksum, correlation, regression, etc.) with the read counts rather than relative abundance. So, what will be better?

Again, I am apologetic for asking too many questions.

Thanks,
DC7

So, are you saying to calculate only the Observed alpha diversity which only takes into account the richness not the evenness? In case of Shannon diversity index both richness and evenness is considered. So, I think the count number is must.
Not?

No, you could also calculate the Shannon Index or the Gini Simpson index considering the relative abundances,but without using phyloseq since it requires to use count data.
You can just write a code snippet that performs the calculations for each considered sample.

1 Like

Sir, you did not respond to one of my previous question. Can I multiply the “#nreads” number from “.bowtie2out.txt” file with the relative abundances of each of the taxa to get their respective read count?

Thanks,
DC7

Yes, but the number will be the same as the one reported by rel_ab_w_read_stats

1 Like

But, in that case, the merge_metaphlan_tables.py command does not work.