Rarefying raw file for taxonomic classification MetaPhlAn3

Hi @fbeghini sir
I have done taxonomic profiling using MetaPhlAn3 but my question is that my input files or raw files (.fastq.gz format) are of varying sizes ranging from 200 MB to 18GB.

  • rarefy raw files?
    so do you think I need to rarefy my data before doing the taxonomic profiling? if yes then what is the method to do so?
    I am asking this question because after doing the analysis I got different results for different sizes of files. For e.g If the file was having 641 M Bases of sequencing yield then it gave me 62 species and when the file was having 4473 M bases sequencing yield I got 181 species for that particular file/sample.

  • get the datasheet for top 100 or top 25 species
    I apologise for asking question related to other topic but I need to know when we use the flag --ftop 100 or --ftop n in merged_profile.txt file we get n number of species or whatever level(phylum/class…etc) while making the heat map using #hclust2, how is it done what is the method? Do you take average of all the samples? Can I get the datasheet or a table for the top 25 features/clades?

Thanks a ton in Advance

Hi @fbeghini
Please reply to the 1st question if you find any time because I found the answer to the 2nd question. Thanks.


If your samples come from different environments, then differences in diversity are to be expected. If they do not and you expect them to have a similar number of species, then this is probably caused by biases introduced by different sequencing depths, therefore it is unlikely that rarefying could improve your results.