Hello Everyone,
I’m looking for a clarification on MetaPhlAn4’s --unclassified_estimation
feature.
Given MetaPhlAn’s marker gene approach, I’m curious about the definition and estimation of “unclassified” reads. How does the tool differentiate between reads from genuinely unknown species and reads that are simply not marker genes (and thus - as I understood - wouldn’t map regardless)?
Specifically, how is the unclassified percentage estimated?
Thank you for any explanation you might give!
Hi @ossannav
Given the SGBs that are detected, we estimate the total number of reads (not only the ones that mapped) for each SGB given the read depth and average genome length. Then we subtract from the total number of reads this number and get the unclassified portion.
You can find all the details in the ‘MetaPhlAn 4 unclassified reads calculation’ paragraph of MetaPhlAn4 paper Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4 | Nature Biotechnology
1 Like