Metaphlan3 keeps giving an UNKNOWN 100 output

Hi,

I’m currently trying to perform a community analysis for milk samples. I’ve received the data and performed a trimming using trimmomatic and then proceeded to use metaphlan. However, all the samples have returned an UNKNOWN 100 output.
I used the following command:

#!/bin/bash

module load miniconda
conda activate metaphlan

METAPHLAN_DB=meta_database/

for f in *.gz
do name=$(basename $f .fastq.gz)
   metaphlan --nproc $SLURM_CPUS_PER_TASK --bowtie2out ${name}.bt2.bz2 --input_type fastq --bowtie2db $METAPHLAN_DB $f > ${name}_profile.txt
done

And I received this output:

#mpa_v30_CHOCOPhlAn_201901
#/home/ccastillo/.conda/envs/metaphlan/bin/metaphlan L006_S3_L001_R1_001.fastq.gz --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --bowtie2db meta_db --input_type fastq --unknown_estimation -o profiled_me$
#SampleID       Metaphlan_Analysis
#clade_name     NCBI_tax_id     relative_abundance      additional_species
UNKNOWN -1	100.0

So, I decided to perform a metaphlan run with just one sample. I used the following command:

metaphlan L006_S3_L001_R1_trimmed.fastq.gz --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --bowtie2db meta_db --input_type fastq --unknown_estimation -o profiled_metagenome.txt

And I still recieved the same output. I tried the raw samples (before the trimming) using the following command:

metaphlan L006_S3_L001_R1_001.fastq.gz --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --bowtie2db meta_db --input_type fastq --unknown_estimation -o profiled_metagenome.txt

And the same problem persisted.

Reading that the read length might be an issue, I performed the following command:

gunzip -c L006_S3_L001_R1_trimmed.fastq.gz | awk 'NR%4 == 2 {lengths[length($0)]++} END {for (l in lengths) {print l, lengths[l]}}' > count.txt

And after performing head and tail in the output file, I noticed that most reads are in the 250 nt length. Which apparently isn’t a problem:

head

36 3
37 6
38 30
39 31
40 12
41 10
42 7
43 11
44 10
45 17

tail
243 159
244 317
245 336
246 292
247 621
248 2385
249 8040
250 37725
251 94589

So, right now, I’m out of ideas. I was wondering if maybe it has something to do with the database, and maybe I’m specifying it wrongly.

This is the version I’m currently using:

MetaPhlAn version 3.0.14 (19 Jan 2022)

Thanks for any help you could give me!

Hi,

indeed it looks like the database is not specified correctly. In your meta_database/ folder you should have a set of bowtie indexes files (.bt2 files). Try specifying the path including the basename of these files. Usually it looks like -x meta_database/mpa_v30_CHOCOPhlAn_201901

Hope this is clear and will solve your issue. Feel free to ask if not
Have a nice day!

Léonard

Hi Léonard,

I applied the change you suggested. However, I still get an UNKNOWN 100. Here’s how the new script looked like:

#!/bin/bash

module load miniconda
conda activate metaphlan

METAPHLAN_DB=meta_database/mpa_v30_CHOCOPhlAn_201901

for f in *.gz
do name=$(basename $f .fastq.gz)
   metaphlan --nproc $SLURM_CPUS_PER_TASK --bowtie2out ${name}.bt2.bz2 --input_type fastq --bowtie2db $METAPHLAN_DB $f > ${name}_profile.txt
done

In which meta_database/mpa_v30_CHOCOPhlAn_201901 was the new path, but it only created a new directory that downloaded the database again.

Regards,
Camila C.

Hi @reymonera
As metaphlan is producing the output (even if it is UNKNOWN 100%) it means that the execution was performed correctly. This is not really an error but a report saying that after mapping all your reads, MetaPhlAn is not able to detect any microbial species in your sample. This does not mean that your sample does not contain any microorganism, but that with the current parameters, MetaPhlAn is not able to detect them. By default, MetaPhlAn needs to find 20% of the markers of a species to report the species is present in the sample. This is controlled by the stat_q parameter (Check this post Fungal signatures - #2 by aitor.blancomiguez). In samples with low microbial biomass, 20% can be too strict (I am not an expert on human milk microbiome, but I think this might be your case). Reducing the --stat_q parameter (maybe to 10%) might help.

Hi!

Thanks for your input!

I changed the stat_q flag to 0.1 and 0.05 and I’m still getting that error. Here’s my attempt:

metaphlan --nproc 6 --bowtie2out L006_S3_L001_R1_trimmed.bt2.bz2 --input_type fastq --stat_q 0.05 --bowtie2db meta_db L006_S3_L001_R1_trimmed.fastq.gz

And here’s the output:

#mpa_v30_CHOCOPhlAn_201901
#/home/ccastillo/.conda/envs/metaphlan/bin/metaphlan --nproc 6 --bowtie2out L006_S3_L001_R1_trimmed.bt2.bz2 --input_type fastq --stat_q 0.05 --bowtie2db meta_db L006_S3_L001_R1_trimmed.fastq.gz
#SampleID	Metaphlan_Analysis
#clade_name	NCBI_tax_id	relative_abundance	additional_species
UNKNOWN	-1	100.0	

Thanks for any input you could give me.

Regards,
Camila C.

Hi @reymonera
Can you check the content of the L006_S3_L001_R1_trimmed.bt2.bz2 file? this file contains a simplified version of the mapping results and it will be really useful to see how many reads are mapping.

Hi!

Yes, this is what appears inside the .bt2 file:

M04874:89:000000000-CTKPJ:1:1102:2622:14981__1.8674     610243__A0A060QJF7__ASAP_2763
M04874:89:000000000-CTKPJ:1:1102:23769:26112__1.11366   610243__A0A060QJF7__ASAP_2763
M04874:89:000000000-CTKPJ:1:1114:19261:2019__1.79569    610243__A0A060QJF7__ASAP_2763
M04874:89:000000000-CTKPJ:1:2101:3489:20055__1.89927    610243__A0A060QJF7__ASAP_2763
M04874:89:000000000-CTKPJ:1:2106:16354:27005__1.124143  610243__A0A060QJF7__ASAP_2763
M04874:89:000000000-CTKPJ:1:2113:7893:24764__1.167862   610243__A0A060QJF7__ASAP_2763
#nreads 173884
#avg_read_length        234.74154608819674

Hi @reymonera
So I see the problem. From your 173,884 reads, only 6 are mapping against MetaPhlAn markers (in this case just to one marker) so it is not possible for MetaPhlAn to accurately detect any species on it.

Hi,

I see. Thanks for the input. However, what should be the course of action here? Is there something wrong with the sequences?

Also, just as a question: are .bt2 files supposed to be compressed inside the database directory? Because any time I tried to perform the head command on these files in the database directory, they’re unreadable.

Hi @reymonera,

Are you running metaphlan on a 16S rRNA dataset? From the total number of reads and the average length it seems so.

Hi,

After your question, I asked again if these were indeed shotgun reads as I’ve been told and they confirmed this again. I’ll try to check again but what number of reads and average length should I expect from a shotgun run?

Thanks!

Hi @reymonera
If you are sure your sequences are from shotgun sequencing and not 16S, there is not too much to do. I might be that there is nothing wrong in your sample. Shotgun sequencing of milk samples for microbial DNA is not an easy task. As an example, in breast milk samples, most of the reads will come from human DNA and the microbial DNA is so low that it is posible to get no reads at all mapping against microbial species (https://journals.asm.org/doi/10.1128/mSystems.00164-16). I am sorry I cannot give you more help than this

Hi! Just to report on this: Apparently there was a confusion labelling the shotgun samples and they got mixed with the 16S samples. I’ve ran the shotgun reads with MetaPhlan and it all went well after putting in some sensible parameters. Human DNA was very mixed with the microbial DNA.

Thanks for the help and input!

Hi! I have the same issue here. Can you share the parameters applied? Thanks!