MetaPhlan3 --unknown_estimation

metaphlan3outputEX.txt (73.4 KB)

Dear MetaPhlan3 developers,
I am very excited about this new release with so many more references :slight_smile:
I did a testrun as this:
for i in *.fastq; do metaphlan $i --input_type fastq --nproc 20 --unknown_estimation --index latest --add_viruses > ../metaphlan3/${i%.fastq}_profile.txt; done
I am a bit confused about the output. Please check attached file. Why are there two rows named ‘UNKNOWN’? One has only ‘0’ but the first has very high values (80-90+ %) which seems a lot taken into account that this is human fecal samples.
When I sum up all relative abundances I end up with 150-180% which is strangem too.
Please help me interpret my results!
Thank you!
Stef

Hi Stef,
For fecal samples, it is a quite high value, is it possible that the sample contains contaminants like human sequences?
About the >100% sum, the UNKNOWN value is referred to the sum of the relative abundances at one clade level, so if you sum up all the species’ relative abundance and add the UNKNOWN value you’ll get 100%.

I have removed human reads before running metaphlan3 (which were 4% as the highest in one sample). So that should not be it. And why two rows with ‘UNKNOWN’ one being zero and the other above 80%? Could you please take a look at the output I posted?
I very much appreciate your help interpreting the results!
/Stef

That’s pretty strange, can you upload here all the bowtie2out files MetaPhlAn generated?

PB.39.fastq.bowtie2out.txt (801.0 KB) PB.41.fastq.bowtie2out.txt (1.3 MB) PB.42.fastq.bowtie2out.txt (3.7 MB)
Here are the bowtie2 output files of three samples.
Thank you so much for helping!
/Stef

Hi Stef,
I cannot reproduce the same behaviour (two UNKNOWN rows) after merging the three outputs. Which version of MetaPhlAn are you using?

merged.txt (23.3 KB)

Metaphlan3
I have more sample (70) so the problem could be somewhere else? How can I identify a potentially problematic sample?

Still, the unknown should not be above 80% since it is fecal samples after removal of host reads.

I identified the issue! The second row of UNKNOWN came from the negative control. It was 100 there and 0 for all samples. When merging all samples without the NC it looks fine.
But I am still VERY worried about the UNKNOWN in my samples being above 80%!

Hi again,
Do you have any suggestions on how I can increase the mapping to reduce the % UNKNOWN read?

I’ll resolve this issue, it seems that the string printed when no output is available and the one for the unknown estimation are slightly different.
About increasing the mappability, the metagenome size seems below average, are these MiSeq reads?

Exactly! MiSeq data, 2x300 bp, about 2 Mreads pairs per sample, sometimes only 1M. Is there any useful fine tuning for fewer but longer reads?

Given the particularly longer read length, I’d try to use MetaPhlAn with a local alignment, you can do this by running MetaPhlAn with the --bt2_ps sensitive-local or --bt2_ps very-sensitive-local parameter.

Thanks! I will!
Could you please tell me how exactly sensitive and very sensitive differ? I cannot find that information in the tutorial. And which min_alignment_len do you recommend?
/Stef

For the parameters definition, I’ll point you to the Bowtie2 manual since it’s a bowtie2 parameter. I’d not decrease the min_alignment_len below 100, you should not have markers with that size and it should guarantee you to find enough hits.

Hi Francesco,
Using the local alignment I could decrease the UNKNOWN by around half. So this is much better but still about 40% left as unknown. Do you have any further suggestions on how to optimise the parameters to longer MiSeq reads and shallow datasets?
Thank you!
Stef

I’m glad it worked out. 40% is a reasonable number for UNKWNOWN.
For longer reads, the tuneable parameters are the two you used before (min_alignment_len and --bt2_ps, and are the one that would mostly impact on the increase of mappability.

Thank you for your help! So 40% is what you expect in fecal samples? Is there still so much dark matter?

Yes, the average mappability in stool samples is around 60%, I’ll point you to Figure 2A (https://www.sciencedirect.com/science/article/pii/S0092867419300017#fig2) from the Pasolli et al 2019 paper.

Well, then that’s great! I guess I am now ready to analyse my taxonomic profiles. Thank you so much! :smiley: