Metaphlan - requirements on input data volume?

tool & version: metaphlan 4.0.4 (conda install) + Oct2022 data


I have a question regarding input data volume for metaphlan4 - are there some minimal requirements on the input data volume so that one obtains sensible metagenomics profile?

To describe why I am asking about this: I am working with metatranscriptomics data which come from total RNA sequencing of human host samples and am trying to profile the fastq reads which do not map to the human reference (without too much success yet). As an example, one of my samples contains ~500.000 paired end non-human reads, 2x150bp (bit shorter after cleaning). Out of those read pairs ~180.000 are reported in the --samout output of metaphlan with huge majority of the reads mapping to the reference with very low mapq (usually 0 or 1), there are ~500 reads reported in --bowtie2out output with higher mapq, as for the profile, it comes out as 100% reads unclassified with ~300 markers in the presence table.

I am trying to figure out, whether there is a problem in how metaphlan is run (metaphlan4 with Oct2022 data, running offline, containerised) or whether the data are simply not good enough for this kind of profiling. I’ve run kraken2 with most of of the input reads being classified (I know the tools work differently, just trying to show that something strange is going on…).

Thanks for any input.

Hi @tinav
It is perfectly normal that a smaller fraction of the reads are mapping in the sam file when executing metaphlan. MetaPhlAn profiling is based on the mapping of up to 200 marker genes per species, a small fraction of the total gene content of microbial genome (containing thousands of genes), and thus, you would expect to see this reflected on the sam files.