MetaPhlan3 --unknown_estimation

Hello Francesco,

I have a similar problem.
I analyzed 456 samples with Metaphlan3 using this template command line: metaphlan --bowtie2out --nproc 20 --input_type fastq --unknown_estimation -o <output-file
279 of my own, shotgun metagenomic on stool samples using Novaseq6000 paired-end 150nt.
177 are controls that I selected from the curatedMetagenomicData package.
The average unknown% in my samples is 56%.
In the samples from the curatedMetagenomicData package:

  •  Samples from SchirmerM_2016 average unknown% = 67%
    
  •  Samples from PasolliE_2018 average unknown% = 81%
    
  •  Samples from Raymond_2016 average unknown% = 54%
    
  •  Samples from Obregon.TitoAJ_2015 average unknown% = 77%
    
  •  Samples from HMP_2012 average unknown% = 68%
    
  •  Samples from NielsenHB_2014 average unknown% = 55%
    
  •  Samples from BritoIL_2016 average unknown% = 85%
    

The average unknown% estimated by Metaphlan3 is a lot higher than it should be when referencing the Figure 2A Pasolli et al 2019 paper.
However, in the methods of this paper it is stated that the reference genomes used to map the reads are from several databases.

  •  17,607 microbial species from the UniProt portal
    
  •  80,853 genomes from NCBI GenBank database
    
  •  137 isolate genomes collected in (Browne et al., 2016)
    
  •  all the 159,803 assemblies available in NCBI as of September 2018
    

This means that 258,400 reference genomes were used as reference for this paper.
I don’t understand, how the read mappability using MetaPhlan3 could be as high as in this paper since the database used (by MetaPhlan3) is made of 100,000 reference genomes ?

curatedMetagenomicData uses profiles from MetaPhlAn2, not MetaPhlAn 3, how you have obtained the average unknown percentage?

Differently from the method described in Pasolli et al, 2019, read mappability by MetaPhlan3 is done by estimating the amount of reads mapping to the detected species instead of mapping the metagenome to the panel of 258,400 reference genomes.

Hello @fbeghini,

Thank you for your answer.

I used the accession numbers in the curatedMetagenomicData, downloaded the raw reads performed quality control and ran MetaPhlAn 3 on them.

Also, I did not use --add_viruses, could that be a reason why the unknown% is so high ?

I am wondering what could explain such a high % of unknowns. It is way higher than 40%.

I am also not sure I understand the distinction you made here “read mappability by MetaPhlan3 is done by estimating the amount of reads mapping to the detected species instead of mapping the metagenome to the panel of 258,400 reference genomes."

No, --add_viruses should increase a little the mappability but not so much.
Since MetaPhlAn is maker based, you can not directly estimate how many reads are mapping to the whole genomes, so for each clade, an estimation of the total number of reads in the metagenome is performed

@fbeghini, thank you for the answer. I understand now :slight_smile:. Just to be sure I should not worry about the 56% of UNKNOWN in my samples then?

Yes! The number is reasonable.

@fbeghini great thank you so much !

Hi,

As I have a similar question, I figured I could ask it here rather than opening a new question. I ran Metaphlan3 on human stool samples generated from Illumina HiSeq 2000 platform to generate approximately 8 Gb of 150 bp, paired-end, reads per sample (mean 7.9 gb, st.dev 1.2 gb)

metaphlan <sample.fastq> --input_type fastq --nproc 16 -o_metaphlan.txt --tmp_dir --bowtie2db --force --add_viruses -t rel_ab_w_read_stats --unknown_estimation

The mean+/-std relative abundance for the unknown category is 67.6 +/- 13.8 which indeed seems high. However this is not my question. I was wondering if whether it is possible to merge an mp3 abundance table generated with unknown estimation with an mp3 abundance table that was generated without the unknown estimation? Can I remove the unknown category then re-scale, and merge the two tables? Related to that, in the case when you don’t estimate the unknown, where does those reads end up? Are they just removed?

I find that the average relative abundance of e.g. s__Faecalibacterium_prausnitzii and s__Escherichia_coli are pretty comparable (7.289 vs 5.447 for F. prau and 0.421 vs 0.525 for E. coli) for these two mp3 datasets.

Thanks,
Johannes

I have a soil sample i.e. 2x150bp paired end reads .I am getting unknown about 80% . Can you please tell me which parameters i should change to reduce the unknown percentage. What min_alignment_len do you recommend for the data?