The bioBakery help forum

How is UNKNOWN calculated?


I was wondering if you could provide some further information on how the UNKNOWN output is calculated?

My understanding of Metaphlan2 is that it basically aligns reads to a marker gene database and then from this generates an abundance calculation, adjusting for genome size and such as it goes. For something to be unknown I guess it means that it’s not in the marker gene database? But couldn’t these just be other genomic data DNA from organisms? The pan-genome can be quite large after all.

Mostly I’m interested in the interpretation because I want to be sure it’s different to other meanings of unclassified. e.g. highly conserved DNA sequences across multiple species/families/phyla which can’t be classified to a taxonomic rank, but is in fact present in the database.


That’s right, the estimation done by MetaPhlAn is done by not taking into account the total number of mapped reads, but the full metagenome size.

Hello guys,
I also wanted to know if I was correctly interpreting the UNKNOWN value. After reading your comments I think it is now clear to me. The UNKNOWN value represents the % of input reads that could not be mapped to the gene database. That is correct right?
If that’s in fact correct, I now have another question. I’m running Metaphlan3 with the --samout tag to get a SAM file as output too. The resulting abundance file shows the following:

#/usr/local/bin/metaphlan /batchx/input/file0/ERR1293537.fq.gz --input_type fastq --bowtie2out /batchx/output/metaphlan3/ERR1293537.all-kingdoms.bowtie2.bz2 --samout /batchx/kronaTmp//ERR1293537.all-kingdoms.sam --biom /batchx/output/metaphlan3/ERR1293537.all-kingdoms.biom -t rel_ab --bowtie2db /batchx/mpa_v30_CHOCOPhlAn_201901 --unknown_estimation --stat_q 0.2 --perc_nonzero 0.33 --min_mapq_val 5 --read_min_len 50 --add_viruses --sample_id ERR1293537 -o /batchx/output/metaphlan3/ERR1293537.all-kingdoms.txt --nproc 10
#SampleID ERR1293537
#clade_name NCBI_tax_id relative_abundance additional_species
UNKNOWN -1 63.74387
k__Bacteria 2 36.22163950842033
k__Viruses 10239 0.034494085501256995

Meaning that ~36% of the reads were mapped while ~64% were classified as UNKNOWN since they could not be mapped. Everything is fine at this point. However if I count the number of reads in the SAM file there are only around 5% of the original input reads, far from that 36% of mapped read.
So my question is why such big difference? What happened to the rest of mapped reads? Is that the SAM only keep those reads that meet all the filtering criteria like mapping quality etc?
Best regards!

Hi David

No, this is not correct. MetaPhlAn calculates taxon relative abundance, not read relative abundance, so the UNKNOWN value estimates the total relative abundance of unknown species

Thanks for clearing that out for me Francesco. So, if I were interested in finding out the absolute number of aligned reads where should I look? Is counting the number of reads in the SAM file accurate? if not where?
Best regards!

Yes, by looking in the SAM file you will get all the reads that have been mapped against a marker.

All clear now! Thanks a lot Francesco.

Hi Franscesco,
I have another question regarding the unknowns.
I am applying metaphlan on fastQ files derived from isolates (so not from microbiomes). But the result is still a high % of unknown (about 20% for a M. abscessus sample) and 60% for a Haemophilus species. As both genus/species are present in the database, I would suspect that %unknown would be 0? Because obviously a lot of reads will not map, as you are using only about 80 genes per species and that could actually explain partly why percentage unknown is so big but then I read that this unknown is not the percentage of total reads…