Only Bacteria is profiled

Hello everyone,
I have been running some test using metaphlan3 (version 3.0.2 (23 Jul 2020)) with the latest chocophlan version (mpa_v30_CHOCOPhlAn_201901). It caught my attention that all samples I have used only return results for the Bacteria kingdom, I was expecting to obtain some archeal and eukaryotic profiles. This is how I have been running the program:

metaphlan <my-fastq-file> --input_type fastq -o <my-output-file> --nproc 4

I even tried using the SRS019033.fastq fastq as an example and obtained similar results, no trace of other kingdoms except Bacteria, which I’m guessing that there should be, since when I add the --unknown_estimation I get 76.65 classified as unknown.
I have tried switching to an older chocoplan version (mpa_v296_CHOCOPhlAn_201901) but haven’t obtained any profiles other than Bacteria.
As a final test I created a FASTQ file exclusively out of the Aeropyrum pernix fasta reference and got 100% unclassified reads.
I must be missing something very obvious I guess.
By the way, I didn’t add --ignore_archaea flag if that’s what you are thinking : )
Thanks in advance and congrats on the new version!
Best regards,
David

Small update which looks promising:
Lowering the --stat_q parameter from 0.2 to 0.1 results in having k__Archaea in the output (as recommended in other posts), I still have a high number classified as UNKNOWN(> 60) though.
Besides this being a problem in the sample (Biological or other) is there any configuration parameter that could he tuned to decrease this value?
Best regards,
David

More updates on this topic. I have lowered several threshold in an attempt to decrease the number of UNKNOWN matches as well as try to have matches for other kingdoms:
--stat_q 0.1 --perc_nonzero 0.25 --min_mapq_val -1 --read_min_len 50

But I continue to have very little success. I paste the first lines of the output:

#mpa_v30_CHOCOPhlAn_201901
#/usr/local/bin/metaphlan /batchx/input/file0/ERR479430-test.fq.gz --input_type fastq --bowtie2out /batchx/output/metaphlan3/ERR479430-test.bowtie2.bz2 --biom /batchx/output/metaphlan3/ERR479430-test.biom -t rel_ab --bowtie2db /batchx/metaphlan_databases --unknown_estimation --stat_q 0.1 --perc_nonzero 0.25 --min_mapq_val -1 --read_min_len 50 --sample_id ERR479430 -o /batchx/output/metaphlan3/ERR479430-test.txt --nproc 16
#SampleID ERR479430
#clade_name NCBI_tax_id relative_abundance additional_species
UNKNOWN -1 81.24807
k__Bacteria 2 18.751932003563716
k__Bacteria|p__Bacteroidetes 2|976 9.098928708747609

Any suggestions to decrease the number of UNKNOWN matches as well as to get other kingdoms besides Bacteria?

Hi David,
Sorry for the late reply

This,unfortunately if no enough coverage is provided, will not work.

I see that you have lowered the stat_q and removed the mapping quality filter; is this an environmental sample?

Hello Franceso,
no worries, you replied at the perfect time.
I have ran multiple tests with different samples. One that I think could be easy for you to replicate, as it’s the sample listed in the file_list.txt is the SRS019033 sample (I mentioned in my first post). This sample I believe is from a retroauricular crease according to this EBI link.

These are the first lines from the metaphlan3 output I get after running the program:

#mpa_v30_CHOCOPhlAn_201901
#/usr/local/bin/metaphlan /batchx/input/file0/SRS019033.fastq --input_type fastq --bowtie2out /batchx/output/metaphlan3/SRS019033.bowtie2.bz2 --samout /batchx/output/metaphlan3/SRS019033.sam --biom /batchx/output/metaphlan3/SRS019033.biom -t rel_ab --bowtie2db /batchx/metaphlan_databases --unknown_estimation --stat_q 0.1 --perc_nonzero 0.25 --min_mapq_val -1 --read_min_len 50 --sample_id SRS019033 -o /batchx/output/metaphlan3/SRS019033.txt --nproc 16
#SampleID SRS019033
#clade_name NCBI_tax_id relative_abundance additional_species
UNKNOWN -1 61.865
k__Bacteria 2 38.135000000000005
k__Bacteria|p__Firmicutes 2|1239 31.279840959500003

Is this the expected ouput when running the program with the configuration shown above? 61.865 classified as UNKNOWN and only Bacteria matches? Isn’t 61.865 just too high?

In addition, I was also wondering if there is a well characterized sample using metaphlan3, to validate that everything was installed and configured properly (preferably one that shows other kingdoms besides Bacteria)
Thanks in advance!
David

Hi David,
the profile of sample SRS017821 from the HMP has all the three kingdoms.
For a skin sample, 38% of mapped reads is OK, the average mappability in skin samples is around 40%, I’ll point you to Figure 2A (https://www.sciencedirect.com/science/article/pii/S0092867419300017#fig2 ) from the Pasolli et al 2019 paper.

Hello Francesco,
thank you for your reply. I ran the analysis with the SRS017821 sample as you suggested. The output now shows a small proportion of Archea and Eukaryota in addition to Bacteria (good news :slight_smile: )) :

#mpa_v30_CHOCOPhlAn_201901
#/usr/local/bin/metaphlan /batchx/input/file0/SRS017821.fq.gz --input_type fastq --biom /batchx/output/metaphlan3/SRS017821.biom -t rel_ab --bowtie2db /batchx/metaphlan_databases --unknown_estimation --stat_q 0.2 --perc_nonzero 0.33 --min_mapq_val 20 --read_min_len 70 --sample_id SRS017821 -o /batchx/output/metaphlan3/SRS017821.txt --nproc 16
#SampleID SRS017821
#clade_name NCBI_tax_id relative_abundance additional_species
UNKNOWN -1 45.60938
k__Bacteria 2 54.292647237692464
k__Eukaryota 2759 0.09489531656933838
k__Archaea 2157 0.003083948214295573

I assume this is the correct output, right?
Thank you also for referring me to that article, I saw the figure but I now want to read it thoroughly.
One last question Francesco, if I may. What is the default mapping quality used by metaphlan3? I might have missed it, but I didn’t find this value in the documentation.
Best regards,
David