Humann nucleotide alignment

Hi,

Thank you so much for the development of HUMAnN!

I have performed shotgun sequencing of several mouse fecal samples. My reads are 150bp.

My issue is with the nucleotide alignment step using HUMAnN. For my first run I used:

humann --input sample_cat.fasta.gz --output sampleoutput/ --memory-use maximum --threads 150

Here, I used Uniref90 with default parameters:
SEARCH MODE
search mode = uniref90
nucleotide identity threshold = 0.0
translated identity threshold = 80.0

ALIGNMENT SETTINGS
bowtie2 options = --very-sensitive
diamond options = --top 1 --outfmt 6
evalue threshold = 1.0
prescreen threshold = 0.01
translated subject coverage threshold = 50.0
translated query coverage threshold = 90.0
nucleotide subject coverage threshold = 50.0
nucleotide query coverage threshold = 90.0

From this, my samples had an output of “Unaligned reads after nucleotide alignment: 88.7167874589 %” All of my samples ranged from 85%-92% here.

The translated alignment output was “Unaligned reads after translated alignment: 58.3685585936 %”. My other samples ranged from 30-60%.

After reading many posts, I decided to relax the settings and use Uniref50:

humann --input sample.fasta.gz --output sampleoutput/ --search-mode uniref50 --translated-subject-coverage-threshold 0.0 --nucleotide-subject-coverage-threshold 0.0 --nucleotide-query-coverage-threshold 50.0 --translated-query-coverage-threshold 50.0 --memory-use maximum --threads 150

SEARCH MODE
search mode = uniref50
nucleotide identity threshold = 0.0
translated identity threshold = 50.0

ALIGNMENT SETTINGS
bowtie2 options = --very-sensitive
diamond options = --top 1 --sensitive --outfmt 6
evalue threshold = 1.0
prescreen threshold = 0.01
translated subject coverage threshold = 0.0
translated query coverage threshold = 50.0
nucleotide subject coverage threshold = 0.0
nucleotide query coverage threshold = 50.0

Here my output was: “Unaligned reads after nucleotide alignment: 88.4718670763 %” which was slightly worse.

The translated alignment was “Unaligned reads after translated alignment: 29.4553120544 %”

Lastly, I relaxed the settings even more:

humann --input sample.fasta.gz --output sampleoutput/ --search-mode uniref50 --translated-subject-coverage-threshold 0.0 --nucleotide-subject-coverage-threshold 0.0 --nucleotide-query-coverage-threshold 0.0 --translated-query-coverage-threshold 50.0 --memory-use maximum --threads 150

SEARCH MODE
search mode = uniref50
nucleotide identity threshold = 0.0
translated identity threshold = 50.0

ALIGNMENT SETTINGS
bowtie2 options = --very-sensitive
diamond options = --top 1 --sensitive --outfmt 6
evalue threshold = 1.0
prescreen threshold = 0.01
translated subject coverage threshold = 0.0
translated query coverage threshold = 50.0
nucleotide subject coverage threshold = 0.0
nucleotide query coverage threshold = 0.0

Here, I got “Unaligned reads after nucleotide alignment: 88.4718670763 %”
“Unaligned reads after translated alignment: 29.4552988863 %”
This is exactly the same as my last run.

My question is, is there something that explains the low nucleotide alignment? Am I doing something wrong? The nucleotide alignment did not improve after relaxing the parameters, so should I just use the first run with Uniref90? What results should I trust?

Thanks so much for all the help and sorry for all the questions!

Hi,
Have you been able to solve this problem?
I have exactly the same problem, and the taxonomic results I find don’t make much biological sense.

Hi cgar,

No, I have not been able to solve this problem. I am hoping that someone from the HUMANN development team could supply some answers.

@franzosa

Hi, i am hoping you can give some insights here. Thanks so much!

Hi all,

I just ran HUMANN again with nucleotide bypass and the translated unalignment was only 26%.

humann --input WT_819_cat.fasta.gz --output /bypass_nucl/ --protein-database /home/uniref90/uniref/ --search-mode uniref90 --memory-use maximum --threads 150 --bypass-nucleotide-search

I believe that the chocophlan database may not have contained many of the species that were identified in Metaphlan4, so they were counted as unaligned reads.

Sorry for missing this thread! If the issue is that HUMAnN isn’t finding the species in your sample (because they aren’t in our database), then relaxing parameters won’t improve nucleotide alignment, but it will improve species-agnostic protein-level alignment, just as you’re seeing. As HUMAnN 4 transitions to MetaPhlAn 4’s SGB model it will do a better job identifying and mapping to species from the murine microbiome during the nucleotide search phase.

Hi Eric,

Thanks for the reply!

I am trying to make some conclusions about my data at least in terms of the overall abundance of some genes in my samples rather than what microbes are contributing to the abundance. While we wait for Humann 4.0 to be released, do you think bypassing the nucleotide alignment is okay to do?

Thanks!

Yes, while there is no harm in letting the nucleotide alignment do a small amount of alignment to the species it can find, just doing pure translated search to UniRef50 is also a fine strategy for functional profiling. This is essentially how the original HUMAnN worked before we developed the tiered search.

Hi,

Thanks for the input! Is there a reason that you specified UniRef50 instead of Uniref90 in your reply?

UniRef50 is better for communities that you expect to have more remote homology, since it allows reads to align at 50% identity (vs. our allowed 80% for UniRef90).