Metaphlan - requirements on input data volume?

tool & version: metaphlan 4.0.4 (conda install) + Oct2022 data


I have a question regarding input data volume for metaphlan4 - are there some minimal requirements on the input data volume so that one obtains sensible metagenomics profile?

To describe why I am asking about this: I am working with metatranscriptomics data which come from total RNA sequencing of human host samples and am trying to profile the fastq reads which do not map to the human reference (without too much success yet). As an example, one of my samples contains ~500.000 paired end non-human reads, 2x150bp (bit shorter after cleaning). Out of those read pairs ~180.000 are reported in the --samout output of metaphlan with huge majority of the reads mapping to the reference with very low mapq (usually 0 or 1), there are ~500 reads reported in --bowtie2out output with higher mapq, as for the profile, it comes out as 100% reads unclassified with ~300 markers in the presence table.

I am trying to figure out, whether there is a problem in how metaphlan is run (metaphlan4 with Oct2022 data, running offline, containerised) or whether the data are simply not good enough for this kind of profiling. I’ve run kraken2 with most of of the input reads being classified (I know the tools work differently, just trying to show that something strange is going on…).

Thanks for any input.

Hi @tinav
It is perfectly normal that a smaller fraction of the reads are mapping in the sam file when executing metaphlan. MetaPhlAn profiling is based on the mapping of up to 200 marker genes per species, a small fraction of the total gene content of microbial genome (containing thousands of genes), and thus, you would expect to see this reflected on the sam files.

1 Like

Hi @aitor.blancomiguez, thanks for the reply!

Do you have a feel of what proportion of non-human reads usually maps to those marker genes and how much that proportion vary in different experiments?

If everything went right and I get no reads mapping to the marker genes, then metaphlan may not be the right tool for my case, correct?

Hi @tinav
It is difficult to say, it will vary from sample to sample and it will be really dependent on the environment… I cannot really provide you a number here.
In cases like yours, in which the microbial reads are really low, MetaPhlAn might not be the best approach and I would go for something more sensitive as Kraken2 (but being really careful interpreting the results, as it is prone to a high number of low-abundant false positives)

1 Like