Too many unclassified reads?

vrrodovalho · November 21, 2022, 6:22pm

Hi, everybody!

I’m using Metaphlan [version 4.0.3 (24 Oct 2022)] for taxonomic profiling and the result files show a high percentage of unclassified reads (~82%). This means that only 18% of the reads are correctly assigned to taxa, right? Isn’t that too low?

I removed adapters, overrepresented sequences, and host sequences with Kneaddata, and checked quality with FastQC / MultiQC. It all seemed OK.

Does anybody know what could be the issue here?

Here is the command I used:

metaphlan ${SEQS}/${name}.fastq \
    --input_type fastq \
    --bowtie2db ${bowtie2db} \
    --sample_id ${name} \
    --nproc ${threads} \
    --bowtie2out ${result_dir}/${name}.bowtie2.bz2 \
    --unclassified_estimation \
    --subsampling 14800000 \
    --subsampling_seed 0 \
    --output_file ${result_dir}/${name}_profile.txt

I really appreciate any help you can provide.

aitor.blancomiguez · November 25, 2022, 8:36am

Hi @vrrodovalho
That fraction is really dependent on the data you are analizing, in the human gut it is usually around 25%, but in other less characterized environments the fraction can be quite high.

vrrodovalho · November 26, 2022, 8:36pm

Hi @aitor.blancomiguez, thank you for your answer. I’m working with samples from a rodent, the golden hamster. So, I think the expected rate of unclassified reads should be more than 25%?

aitor.blancomiguez · November 28, 2022, 10:52am

Hi @vrrodovalho
I would say the higher unclassified fraction is expected in such environment. We do not have too many rodent MAGs in the database

vrrodovalho · May 29, 2023, 6:33pm

Hi,

I’m still working on this (shotgun metagenomics analyzed with Metaphlan version 4.0.3), which showed many unclassified reads when using --unclassified_estimation parameter. I moved on, since this might be expected when dealing with less studied environments (such as Golden hamster gut).

However, among the classified reads, which were assigned to some taxon during Metaphlan analysis, there is still many of them which are only assined to higher taxonomic ranks (such as Bacteria kingdom). As a matter of fact, I end up with an average of ~20% of “Bacteria_unclassified” at phylum level.

My questions are:

Is it expected to have so many unclassified reads at Phylum level, due to the changes in database structure (species-level genome bins)?
Is there some parameter I could use to flexibilize the taxon assignment? I thought about the following parameters:
–min_cu_len minimum total nucleotide length for the markers in a clade for
estimating the abundance without considering sub-clade abundances
[default 2000]
–avoid_disqm Deactivate the procedure of disambiguating the quasi-markers based on the
marker abundance pattern found in the sample. It is generally recommended
to keep the disambiguation procedure in order to minimize false positives
Do you think it would work and result in less unclassified reads? Or do you have some other hints?

Thank you!

Topic		Replies	Views
Over 80% of reads were unclassified in mouse fecal samples using MetaPhlAn3 MetaPhlAn	3	700	January 31, 2023
Percentage of classification MetaPhlAn	0	35	July 31, 2024
How does using --unclassified_estimation affect relative abundances of the classified portion of reads MetaPhlAn	1	287	March 5, 2024
Metaphlan4 doesn't classify bacteria in output MetaPhlAn	2	464	February 23, 2024
MetaPhlAn3 Queries MetaPhlAn	0	376	April 8, 2021

Too many unclassified reads?

Related topics