Rarefy metagenome sequence data before MetaPhlAn3 analysis

Hi community!!! @fbeghini,
Should I rarefy WGS metagenome sequence data before MetaPhlAn3 analysis?
Is this step super necessary? If yes, can I do this step on bowtie2out.txt or profile.txt files?
Thanks, DC7

Usually, no, but it depends on what you have to do. If you have to perform alpha diversity analyses and the metagenomes sizes are quite different, you should go for it

1 Like

Thanks @fbeghini

  • In that case, in which step should I do this rarefy step? Can I do it on bowtie2out.txt file?


  • I really don’t understand what do you mean by “quite different”? I have samples ranges between 600MB to 2.2 GB. Should I do?


Hi @DEEPCHANDA7 , a common rarefaction procedure is to rarefy each sample to the 10th percentile of the dataset, meaning that 10% of the sample will be below the threshold and will be excluded.

You can subset to the 10% percentile either the raw-reads or the bowtie alignment (and passing the rarefied bowties as input to metaphlan). I can’t think of any procedure capable of rarefying directly the taxonomic profiles.

Since doing rarefaction you will not loose 10% of the samples (in case you apply this type of rarefaction) but you will also loose great part of the sample diversity, I would personally use the rarefacted profiles just for alpha-diversity. Since 2.2 GB is a normal size and 600 MB is a medium-small size for a metagenome, in your case I would personally go with rarefaction, but just for the alpha-diversity computation.


1 Like

Hi Paolo, many thanks for your reply.
Can you please tell me which software I can use for rarefaction of the bowtie2out.txt file? Because I already have run MetaPhlAn with a large dataset. So, I don’t want to repeat, actually. Also, as you said regarding the 10%tile, it is not clear to me, do you have any article about this?

Thanks and regards,

I don’t know any specific software. I would point you to a repository with a custom python script, but I feel confident only in the version rarefying the raw-reads at the moment.
For what concers the procedure, it’s easy: the lower 10 percentile of the distribution of the number of reads is the one you want to decrease the amount of each sample to. For each sample which a higher or equal number of reads you compute the corresponding proportion of the sample reads number which makes it equal to the percentile, which means percentile / sample N. reads. Than you iterate over the reads and generate for each one a random number [0-1] and subset the read if the random number is minor or equal the retain probability. If you do this with the bowtie alignments, it’s the same, but you have to subset according the reads retain-probabilty the lines in the bowties instead of the reads themselves.


Hi @paolinomanghi, @fbeghini, @DEEPCHANDA7

I would like to come back to this discussion. I have mostly samples between 2GB and 3,8GB with some samples having only 100-300MB. Therefore I would like to subsample to calculate reliable alpha-diversity metrics. Do you know a tool to use for in random subsampling on fastq-Data? Or should I use a tool like phyloseq, which subsamples after the taxonomic classification?


1 Like

I am curious about rarefaction too. I would like to generate rarefaction curves for my samples to determine if we are sequencing deep enough to capture the diversity of each sample. Is there a way to extract the taxon count data from MetaPhlAn3 prior to the relative abundance normalization?