Metaphlan output question

Hi everyone,

I am quite inexperienced in Metaphlan, so I am a bit puzzled about one of the outputs when using the flag -t rel_ab_with_read_stats. Using this flag something called ‘estimated reads mapped to known clades’ is written on top of the output. How is this number calculated and what does it actually mean?

This number is also not the same as the estimated number of reads mapping to clade for k_bacteria, which is 100% in my sample. Could it be that it is the estimated number of reads for bacteria without any classification filters applied?

Hope you could help me out.

Kind regards,

Moelong Yu

Hi @Moellie
As metaphlan profiling is based on mapping only to species-specific marker genes, it has to estimate how many of the reads are coming for each of the taxa. To do so, it uses the coverage of the markers multiplied by the median genome length of the species to give that estimation.
For the difference between the reads mapping to bacteria and the first line, if I recall correctly, the first one is the raw number before any QC on the species in the profile while the second one is only calculated on the quality-controlled species

1 Like

Hi @aitor.blancomiguez

Thank you for your reply. So if I interpret this correctly, that would mean that ‘estimated reads mapped to known clades’ means the calculation that you used before QC happens. Would this number be an appropriate way of filtering out samples i.e. only samples with more than 100.000 estimated reads mapped to known clades are considered in by dataset.

And does before quality control mean that if there was just one read mapping to a species-specific marker gene, that species would instantly be identified by Metaphlan. So in a scenario where QC does not happen, only one read mapping to a species-specifc clade marker gene of E.coli, would result in an E.coli identification in the Metaphlan output.

Hi @Moellie
I think this is tricky, filtering out samples by the estimated mapped reads could be problematic in low-characterized environments where most of the reads will not be mapping to the current reference databases. I would rather filter out samples with low sequence coverage.
For QC, MetaPhlAn, by default, needs to find reads mapping to, at least, 20% of the markers available for a species (you can modify this behaviour with the --stat_q parameter). If a species does not pass this filter, it won’t be reported in the profile.

1 Like

Hi @aitor.blancomigue, this was really helpful. Thank you very much for the fast reply.