Everything is unclassified

I am running Humann v4.0.0alpha.1 on plant-associated pacbio reads, using the default settings

The Metaphlan prescreen gives INFO: Total species selected from prescreen: 0
and all of the downstream results are listed as unclassified.

Is this okay? or can I change some inputs to improve the results?

Humann is likely not suitable for your data, being long reads.

Humann (functional profiles, automatically curated subset of taxa from metaphlan results), and its preceding tool metaphlan (taxonomic profiling), align reads using bowtie2 to indexes of marker genes.

Bowtie2, as far as I know, cannot be used for long read alignments. If it is possible, it likely requires specialized configuration during humann’s execution, which may be possible to using:

humann --bowtie-options "[pacbio specific parameters for bowtie2]"

An additional issue is that the reads, often several Kbp, are likely larger than typically 1-2 Kbp marker genes, and therefore will not apparently map well in most cases. Many read aligners, or post-alignment workflows, will attempt to filter out poor alignments, for example >75% of the read length aligned. If a 3 Kbp read aligns perfectly to a 2 Kbp gene, it would not pass the 75% threshold and would not be considered aligned.
Furthermore, even if there were a few marker genes for which reads did align, there is a minimum number of marker genes that must sufficiently recruit reads for a taxon to be considered present, and threfore passed from metaphlan to humann. Any resulting alignments by humann then can only include the unclassified uniref references, as no taxa were identified as present.

Do you have (a good fraction of) reads that align to the unclassified genes?

To my knowledge there is no comparable tool for functionally profiling based on read mapping, and you may need to consider assembly-based approaches, or perform read alignments more directly to reference genomes.

Two last things to clarify -

Are your data full-length 16S amplicons? If so, there are no rRNA in metaphlan or humann for alignments, and these are further unsuitable tools.

When you say plant-associated, this does not meant the plant itself, correct? Biobakery is also not suitable for plant functional profiling.

Hi @Gsmith535 , thanks for your reply!

I was told by a specialist that Humann would “probably” work for my reads so perhaps I was a bit too optimistic.
My data is bacterial, not amplicons.
I will tweak my bowtie2 settings and try again. At what stage does the alignment workflow filter out poor alignments?

Thank you!

Has the specialist applied humann to long read data before? If so, they may have the necessary parameters.

Good to hear that it is a bacterial metagenome, as this covers one of the necessary criteria for applying humann.

Workflow-wise, that’s a better question for devs (i am not one).
Typically, reads are aligned to the index, creating a BAM file. This BAM file is then filtered and alignment data reported, which are often then normalized.

From the manual for humann, regardless of bowtie options, you would likely need to include:

--nucleotide-query-coverage-threshold [something less than 90% default because your queries, ie reads, are potentially much larger than than the subject, ie genes]
--translated-query-coverage-threshold [something less than 90% default, same logic]

--bypass-translated-search # similar logic as above, your reads are probably bigger than the coding regions and therefore may not really be able to perform this task well, even if possible, so dont spend the compute on it.

I have also not run humann4, only humann3, so I am uncertain of possible differences.