How is UNKNOWN calculated?

Hello,

I was wondering if you could provide some further information on how the UNKNOWN output is calculated?

My understanding of Metaphlan2 is that it basically aligns reads to a marker gene database and then from this generates an abundance calculation, adjusting for genome size and such as it goes. For something to be unknown I guess it means that it’s not in the marker gene database? But couldn’t these just be other genomic data DNA from organisms? The pan-genome can be quite large after all.

Mostly I’m interested in the interpretation because I want to be sure it’s different to other meanings of unclassified. e.g. highly conserved DNA sequences across multiple species/families/phyla which can’t be classified to a taxonomic rank, but is in fact present in the database.

Thanks!

That’s right, the estimation done by MetaPhlAn is done by not taking into account the total number of mapped reads, but the full metagenome size.

Hello guys,
I also wanted to know if I was correctly interpreting the UNKNOWN value. After reading your comments I think it is now clear to me. The UNKNOWN value represents the % of input reads that could not be mapped to the gene database. That is correct right?
If that’s in fact correct, I now have another question. I’m running Metaphlan3 with the --samout tag to get a SAM file as output too. The resulting abundance file shows the following:

#mpa_v30_CHOCOPhlAn_201901
#/usr/local/bin/metaphlan /batchx/input/file0/ERR1293537.fq.gz --input_type fastq --bowtie2out /batchx/output/metaphlan3/ERR1293537.all-kingdoms.bowtie2.bz2 --samout /batchx/kronaTmp//ERR1293537.all-kingdoms.sam --biom /batchx/output/metaphlan3/ERR1293537.all-kingdoms.biom -t rel_ab --bowtie2db /batchx/mpa_v30_CHOCOPhlAn_201901 --unknown_estimation --stat_q 0.2 --perc_nonzero 0.33 --min_mapq_val 5 --read_min_len 50 --add_viruses --sample_id ERR1293537 -o /batchx/output/metaphlan3/ERR1293537.all-kingdoms.txt --nproc 10
#SampleID ERR1293537
#clade_name NCBI_tax_id relative_abundance additional_species
UNKNOWN -1 63.74387
k__Bacteria 2 36.22163950842033
k__Viruses 10239 0.034494085501256995

Meaning that ~36% of the reads were mapped while ~64% were classified as UNKNOWN since they could not be mapped. Everything is fine at this point. However if I count the number of reads in the SAM file there are only around 5% of the original input reads, far from that 36% of mapped read.
So my question is why such big difference? What happened to the rest of mapped reads? Is that the SAM only keep those reads that meet all the filtering criteria like mapping quality etc?
Best regards!

Hi David

No, this is not correct. MetaPhlAn calculates taxon relative abundance, not read relative abundance, so the UNKNOWN value estimates the total relative abundance of unknown species

Thanks for clearing that out for me Francesco. So, if I were interested in finding out the absolute number of aligned reads where should I look? Is counting the number of reads in the SAM file accurate? if not where?
Best regards!

Yes, by looking in the SAM file you will get all the reads that have been mapped against a marker.

All clear now! Thanks a lot Francesco.

Hi Franscesco,
I have another question regarding the unknowns.
I am applying metaphlan on fastQ files derived from isolates (so not from microbiomes). But the result is still a high % of unknown (about 20% for a M. abscessus sample) and 60% for a Haemophilus species. As both genus/species are present in the database, I would suspect that %unknown would be 0? Because obviously a lot of reads will not map, as you are using only about 80 genes per species and that could actually explain partly why percentage unknown is so big but then I read that this unknown is not the percentage of total reads…

Dear @fbeghini
I have a quick follow-up question to this. Since the UNKNOWN value estimates the total relative abundance of unknown species, I should definitely keep that estimation in my profiles, right? My samples vary between 20-80% unknown not including unknown feels like introducing a bias. But if my understanding is correct, why does mpa not include unknown as per default?
Please, let me know if I misunderstood anything!
Thanks a lot!
/Stef

@fbeghini could you please comment on this? Or anyone else?
I appreciate it :slight_smile:

Hi and sorry for the late reply.
Can you check with the latest version of MetaPhlAn? We have refined the unknown estimation as we were seeing the same behaviour you encountered.

Hi Stef,
We choose not to include by default the unknown estimation and add a flag to enable it only if more expert users wants to use the estimated value

May I ask how you refined the estimation? I also wonder whether you have human/mammalian markers in your db. I wonder how efficient my host removal step prior to mpa is and if it isn’t, would the remaining human reads end up as unknown or human/mammalian?
/Stef

About the estimation refinement, I’ll let @aitor.blancomiguez answer your question.
We do not have any mammalian nor human markers in the database as it is built to target microbial organisms, but the remaining reads could be included in the unknown relative abundance.

Hi @Stef . We indeed made some improvements in the UNKNOWN estimation, but tecnically related to the normalization inside the metaphlan code. This fraction represents the proportion of unclassified reads that we cannot assign to any microbial organisms present in the metaphlan database, so any other DNA present in your sample, e.g. unknown microbial species, host DNA, food-related DNA, etc… will be included within that fraction.

Ok, thank you! So when you have a cohort and want to compare e.g. ctrls and cases, would you recommend including UNKNOWN (as they can be microbial) or exclude them (as they can come from food)? Most human reads are removed before we run mpa.
So how do you guys usually deal with UNKNOWN?