Extract "Unclassified" sequences

Hi all,

I’m wondering if there’s a way to obtain the sequences corresponding to the entries labeled as “Unclassified” in a MetaPhlAn output profile. I understand these reads are either not represented in the MetaPhlAn database or don’t have enough marker coverage to be assigned to a known taxon. Still, I’d like to know if there’s a way to extract the actual reads (as a fasta file) or retrieve them from the SAM file for further exploration.

Thanks a lot for your help!

Best,

Fangxi

Hi @Fangxi_Xu

The unclassified percentage is an estimation that is obtained by taking into account the coverage of the markers used by MetaPhlAn and the expected genome size of the identified species. This means that in the mapping file you will see the reads that mapped to the markers but all the others that are not reported could either be unclassified reads or reads that would map to the rest of the detected genomes (the parts which are not markers). In short, there is no easy way to know which reads belong to the unclassified portion.