Extract "Unclassified" sequences

Fangxi_Xu · October 13, 2025, 6:25pm

Hi all,

I’m wondering if there’s a way to obtain the sequences corresponding to the entries labeled as “Unclassified” in a MetaPhlAn output profile. I understand these reads are either not represented in the MetaPhlAn database or don’t have enough marker coverage to be assigned to a known taxon. Still, I’d like to know if there’s a way to extract the actual reads (as a fasta file) or retrieve them from the SAM file for further exploration.

Thanks a lot for your help!

Best,

Fangxi

Claudia_Mengoni · October 14, 2025, 9:56am

Hi @Fangxi_Xu

The unclassified percentage is an estimation that is obtained by taking into account the coverage of the markers used by MetaPhlAn and the expected genome size of the identified species. This means that in the mapping file you will see the reads that mapped to the markers but all the others that are not reported could either be unclassified reads or reads that would map to the rest of the detected genomes (the parts which are not markers). In short, there is no easy way to know which reads belong to the unclassified portion.

Topic		Replies	Views
Question about the unclassified estiamtion MetaPhlAn	1	102	July 2, 2025
Too many unclassified reads? MetaPhlAn	4	1315	May 29, 2023
Unable to extract unmapped reads from sam file MetaPhlAn	4	602	May 16, 2023
Percentage of classification MetaPhlAn	0	39	July 31, 2024
Extract reads for a specific organism classified MetaPhlAn	4	731	April 15, 2021

Extract "Unclassified" sequences

Related topics