Rarecurve on long read from Metaphlan 4.2.2

Hello,

I wanted to perform rare curves on the MetaPhlAn output to see if I am at a sequencing depth in which I am capturing a majority of my taxa. I was going to use the summary table but I noticed it is number of bases in each clade not number of reads. How is MetaPhlAn coming to this number? How does this effect me making a rarecurve? Thank you!

Hi @GNas !

when running MetaPhlAn with the option for long-read sequencing the clade coverage is computed by counting the number of bases mapping on the markers instead of the reads, because long reads have very variable length and counting the number of reads would not take that into consideration. Consequently, all computations in the MetaPhlAn workflow involving the number of reads are done with number of bases for long reads (e.g. subsampling is performed by number of bases). This should not affect your rarefaction curves as long as you are aware that the unit is number of bases and not reads.

Hope this is helpful!

Linda

Thank you for the explanation @lindacova !

To clarify then, when I am using the summary table, what I am doing is “sampling” using the probability of each clade, so its as simple as a total number of bases / bases for that clade calculation to simulate sampling at various depths?

Hi @GNas ,

If you want to simulate various sequencing depths, I would suggest to run MetaPhlAn on different subsamplings of your sample.

Regarding the use of the summary table, if you are referring to the MetaPhlAn output obtained with the --rel_ab_w_read_stats option, that file provides an estimation of the number of bases covering each clade. This value is an estimate calculated by multiplying the clade coverage by the clade-specific average genome length.