Too many unclassified reads?

Hi, everybody!

I’m using Metaphlan [version 4.0.3 (24 Oct 2022)] for taxonomic profiling and the result files show a high percentage of unclassified reads (~82%). This means that only 18% of the reads are correctly assigned to taxa, right? Isn’t that too low?

I removed adapters, overrepresented sequences, and host sequences with Kneaddata, and checked quality with FastQC / MultiQC. It all seemed OK.

Does anybody know what could be the issue here?

Here is the command I used:

metaphlan ${SEQS}/${name}.fastq \
    --input_type fastq \
    --bowtie2db ${bowtie2db} \
    --sample_id ${name} \
    --nproc ${threads} \
    --bowtie2out ${result_dir}/${name}.bowtie2.bz2 \
    --unclassified_estimation \
    --subsampling 14800000 \
    --subsampling_seed 0 \
    --output_file ${result_dir}/${name}_profile.txt

I really appreciate any help you can provide.

Hi @vrrodovalho
That fraction is really dependent on the data you are analizing, in the human gut it is usually around 25%, but in other less characterized environments the fraction can be quite high.

1 Like

Hi @aitor.blancomiguez, thank you for your answer. I’m working with samples from a rodent, the golden hamster. So, I think the expected rate of unclassified reads should be more than 25%?

Hi @vrrodovalho
I would say the higher unclassified fraction is expected in such environment. We do not have too many rodent MAGs in the database

Hi,

I’m still working on this (shotgun metagenomics analyzed with Metaphlan version 4.0.3), which showed many unclassified reads when using --unclassified_estimation parameter. I moved on, since this might be expected when dealing with less studied environments (such as Golden hamster gut).

However, among the classified reads, which were assigned to some taxon during Metaphlan analysis, there is still many of them which are only assined to higher taxonomic ranks (such as Bacteria kingdom). As a matter of fact, I end up with an average of ~20% of “Bacteria_unclassified” at phylum level.

My questions are:

  1. Is it expected to have so many unclassified reads at Phylum level, due to the changes in database structure (species-level genome bins)?

  2. Is there some parameter I could use to flexibilize the taxon assignment? I thought about the following parameters:
    –min_cu_len minimum total nucleotide length for the markers in a clade for
    estimating the abundance without considering sub-clade abundances
    [default 2000]
    –avoid_disqm Deactivate the procedure of disambiguating the quasi-markers based on the
    marker abundance pattern found in the sample. It is generally recommended
    to keep the disambiguation procedure in order to minimize false positives
    Do you think it would work and result in less unclassified reads? Or do you have some other hints?

Thank you!