I also wanted to know if I was correctly interpreting the UNKNOWN value. After reading your comments I think it is now clear to me. The UNKNOWN value represents the % of input reads that could not be mapped to the gene database. That is correct right?
If that’s in fact correct, I now have another question. I’m running Metaphlan3 with the
--samout tag to get a SAM file as output too. The resulting abundance file shows the following:
#/usr/local/bin/metaphlan /batchx/input/file0/ERR1293537.fq.gz --input_type fastq --bowtie2out /batchx/output/metaphlan3/ERR1293537.all-kingdoms.bowtie2.bz2 --samout /batchx/kronaTmp//ERR1293537.all-kingdoms.sam --biom /batchx/output/metaphlan3/ERR1293537.all-kingdoms.biom -t rel_ab --bowtie2db /batchx/mpa_v30_CHOCOPhlAn_201901 --unknown_estimation --stat_q 0.2 --perc_nonzero 0.33 --min_mapq_val 5 --read_min_len 50 --add_viruses --sample_id ERR1293537 -o /batchx/output/metaphlan3/ERR1293537.all-kingdoms.txt --nproc 10
#clade_name NCBI_tax_id relative_abundance additional_species
UNKNOWN -1 63.74387
k__Bacteria 2 36.22163950842033
k__Viruses 10239 0.034494085501256995
Meaning that ~36% of the reads were mapped while ~64% were classified as UNKNOWN since they could not be mapped. Everything is fine at this point. However if I count the number of reads in the SAM file there are only around 5% of the original input reads, far from that 36% of mapped read.
So my question is why such big difference? What happened to the rest of mapped reads? Is that the SAM only keep those reads that meet all the filtering criteria like mapping quality etc?