MetaPhlan3 --unknown_estimation

Stef · September 11, 2020, 11:57am

Dear MetaPhlan3 developers,
I am very excited about this new release with so many more references
I did a testrun as this:
for i in *.fastq; do metaphlan $i --input_type fastq --nproc 20 --unknown_estimation --index latest --add_viruses > ../metaphlan3/${i%.fastq}_profile.txt; done
I am a bit confused about the output. Please check attached file. Why are there two rows named ‘UNKNOWN’? One has only ‘0’ but the first has very high values (80-90+ %) which seems a lot taken into account that this is human fecal samples.
When I sum up all relative abundances I end up with 150-180% which is strangem too.
Please help me interpret my results!
Thank you!
Stef

fbeghini · September 14, 2020, 8:59am

Hi Stef,
For fecal samples, it is a quite high value, is it possible that the sample contains contaminants like human sequences?
About the >100% sum, the UNKNOWN value is referred to the sum of the relative abundances at one clade level, so if you sum up all the species’ relative abundance and add the UNKNOWN value you’ll get 100%.

Stef · September 14, 2020, 1:17pm

I have removed human reads before running metaphlan3 (which were 4% as the highest in one sample). So that should not be it. And why two rows with ‘UNKNOWN’ one being zero and the other above 80%? Could you please take a look at the output I posted?
I very much appreciate your help interpreting the results!
/Stef

fbeghini · September 14, 2020, 3:51pm

That’s pretty strange, can you upload here all the bowtie2out files MetaPhlAn generated?

Stef · September 15, 2020, 6:23am

PB.39.fastq.bowtie2out.txt (801.0 KB) PB.41.fastq.bowtie2out.txt (1.3 MB) PB.42.fastq.bowtie2out.txt (3.7 MB)
Here are the bowtie2 output files of three samples.
Thank you so much for helping!
/Stef

fbeghini · September 15, 2020, 8:01am

Hi Stef,
I cannot reproduce the same behaviour (two UNKNOWN rows) after merging the three outputs. Which version of MetaPhlAn are you using?

merged.txt (23.3 KB)

Stef · September 15, 2020, 8:36am

Metaphlan3
I have more sample (70) so the problem could be somewhere else? How can I identify a potentially problematic sample?

Stef · September 15, 2020, 8:37am

Still, the unknown should not be above 80% since it is fecal samples after removal of host reads.

Stef · September 15, 2020, 8:43am

I identified the issue! The second row of UNKNOWN came from the negative control. It was 100 there and 0 for all samples. When merging all samples without the NC it looks fine.
But I am still VERY worried about the UNKNOWN in my samples being above 80%!

Stef · September 18, 2020, 6:15am

Hi again,
Do you have any suggestions on how I can increase the mapping to reduce the % UNKNOWN read?

fbeghini · September 18, 2020, 2:18pm

I’ll resolve this issue, it seems that the string printed when no output is available and the one for the unknown estimation are slightly different.
About increasing the mappability, the metagenome size seems below average, are these MiSeq reads?

Stef · September 18, 2020, 2:55pm

Exactly! MiSeq data, 2x300 bp, about 2 Mreads pairs per sample, sometimes only 1M. Is there any useful fine tuning for fewer but longer reads?

fbeghini · September 18, 2020, 4:07pm

Given the particularly longer read length, I’d try to use MetaPhlAn with a local alignment, you can do this by running MetaPhlAn with the --bt2_ps sensitive-local or --bt2_ps very-sensitive-local parameter.

Stef · September 21, 2020, 6:13am

Thanks! I will!
Could you please tell me how exactly sensitive and very sensitive differ? I cannot find that information in the tutorial. And which min_alignment_len do you recommend?
/Stef

fbeghini · September 21, 2020, 8:37am

For the parameters definition, I’ll point you to the Bowtie2 manual since it’s a bowtie2 parameter. I’d not decrease the min_alignment_len below 100, you should not have markers with that size and it should guarantee you to find enough hits.

Stef · September 24, 2020, 6:25am

Hi Francesco,
Using the local alignment I could decrease the UNKNOWN by around half. So this is much better but still about 40% left as unknown. Do you have any further suggestions on how to optimise the parameters to longer MiSeq reads and shallow datasets?
Thank you!
Stef

fbeghini · September 24, 2020, 7:51am

I’m glad it worked out. 40% is a reasonable number for UNKWNOWN.
For longer reads, the tuneable parameters are the two you used before (min_alignment_len and --bt2_ps, and are the one that would mostly impact on the increase of mappability.

Stef · September 24, 2020, 7:53am

Thank you for your help! So 40% is what you expect in fecal samples? Is there still so much dark matter?

fbeghini · September 24, 2020, 8:34am

Yes, the average mappability in stool samples is around 60%, I’ll point you to Figure 2A (https://www.sciencedirect.com/science/article/pii/S0092867419300017#fig2) from the Pasolli et al 2019 paper.

Stef · September 24, 2020, 8:36am

Well, then that’s great! I guess I am now ready to analyse my taxonomic profiles. Thank you so much!

Topic		Replies	Views
How is UNKNOWN calculated? MetaPhlAn	15	2358	December 16, 2021
Metaphlan3 keeps giving an UNKNOWN 100 output MetaPhlAn	13	909	November 21, 2022
Metaphlan4 Unknown taxa 70% MetaPhlAn	1	64	July 28, 2025
100 Unknown raw reads MetaPhlAn	5	1090	March 4, 2021
High Unknown estimation in soil sample almost 80 % MetaPhlAn	1	357	January 6, 2022

MetaPhlan3 --unknown_estimation

Related topics