Output Depends on Read Order Due to Ambiguous Protein Length Assignment in `blastx_coverage()`

weichi · May 29, 2025, 8:37am

Version: HUMAnN v3.9

Database: full_chocophlan_v201901_v31

Input: demo.fastq.gz (In my case, I split the FASTQ file, then call nucleotide.alignment() to perform parallel mapping, concatenate the SAM files into one, and then call nucleotide.unaligned_reads(). )

Summary

I have discovered that the order of input reads can affect HUMAnN’s final results. This appears to be due to a design issue in the blastx_coverage() function, which determines whether a gene should be retained based on protein coverage. The issue arises when the same protein_name (e.g., a UniRef90 ID) is associated with multiple gene lengths.

Example

This occurs with the UniRef90 ID UniRef90_I9Q8N6, which appears in multiple gene annotations across different species:

From:

g__Bacteroides.s__Bacteroides_dorei.centroids.v201901_v31.ffn.gz
g__Bacteroides.s__Bacteroides_vulgatus.centroids.v201901_v31.ffn.gz

Relevant headers:

>357276__I9Q8N6_B5F00_18260|...g__Bacteroides.s__Bacteroides_dorei|UniRef90_I9Q8N6|UniRef50_A0A069D069|3021
>821__I9Q8N6__BHV80_16865|...g__Bacteroides.s__Bacteroides_vulgatus|UniRef90_I9Q8N6|UniRef50_A0A069D069|324

So, UniRef90_I9Q8N6 is linked to both a gene of length 3021 and another of length 324.

Root Cause

In blastx_coverage(), protein lengths are stored in a dictionary keyed by protein_name. When multiple genes share the same protein_name but have different lengths, the length used depends on whichever entry is processed last, i.e., the order of the reads in the input file.

This makes the results non-deterministic and potentially biologically misleading, because the blastx_coverage() function calculates coverage using a protein length that can change depending on read order. Specifically, coverage is computed as:

coverage = len(range_hit_positions) / float(protein_lengths[protein_name]) * 100

If the same protein_name is associated with multiple gene lengths (e.g., 3021 vs. 324), the final coverage value — and thus whether a hit passes the threshold — will vary depending on which length was stored last. This means the same alignment could either be accepted or discarded based on arbitrary input order, compromising reproducibility and biological validity.

Potential Solution

Here are two potential solutions, in my opinion.
We can discuss them here to determine which one is more reasonable, and I can help to revise the code and create a pull request. (The code below is just a quick draft to illustrate the approach.)

Calculate coverage separately for each different length (or bugs) associated with the same protein_name. If any of them pass the coverage threshold, the protein_name should be accepted.


# blastx_coverage.py
#https://github.com/biobakery/humann/blob/9c6dfef873837c0ed281e1093718769d1aea98c9/humann/search/blastx_coverage.py#L49C9-L49C52
#line 49
#(Assume that one protein name in one bug should have only one length.)
bug_protein=protein_name+';'+bug
protein_lengths[bug_protein] = gene_length

#line 55 
protein_hits[bug_protein]+="{0}-{1};".format(subject_start_index, subject_stop_index)

#line 61
for bug_protein, hit_positions in protein_hits.items():

#line 72
coverage = len( range_hit_positions ) / float( protein_lengths[bug_protein] ) * 100

#line 77
if coverage >= min_coverage:
		allowed.add(bug_protein.split(';')[0])

Calculate coverage separately for each different length (or bugs) associated with the same protein_name, and return not only the accepted genes but also the names of the bugs entries. Then revise the filtering logic in translated.unaligned_reads and nucleotide.unaligned_reads accordingly.


#blastx_coverage.py
#https://github.com/biobakery/humann/blob/9c6dfef873837c0ed281e1093718769d1aea98c9/humann/search/blastx_coverage.py#L49C9-L49C52
#line 49
#(Assume that one protein name in one bug should have only one length.)
bug_protein=protein_name+';'+bug
protein_lengths[bug_protein] = gene_length

#line 55 
protein_hits[bug_protein]+="{0}-{1};".format(subject_start_index, subject_stop_index)

#line 61
for bug_protein, hit_positions in protein_hits.items():

#line 72
coverage = len( range_hit_positions ) / float( protein_lengths[bug_protein] ) * 100

#line 77
if coverage >= min_coverage:
		allowed.add(bug_protein)
		
		
#nucleotide.py
#line 328
bug_gene=gene_name+';'+bug
if not bug_gene in allowed_genes:

#translated.py
#line 302
bug_protein=protein_name+';'+bug
if bug_protein in allowed_proteins:

franzosa · May 29, 2025, 10:30pm

Many, many thanks for catching this and for your detailed analysis!

I can see exactly what happened, which is that 1) we developed this filter in HUMAnN v2 for translated search, where the gene IDs never repeat; then 2) we later learned it was also somewhat helpful for boosting specificity during nucleotide search, so we added that option in HUMAnN v3; but 3) we didn’t account for nucleotide gene IDs sometimes repeating across species.

We’ll work on a fix for this. In the meantime, if you’d rather disable this filter entirely (to avoid the bug and keep things deterministic), you can set --nucleotide-subject-coverage-threshold 0.0, which will disable the nucleotide coverage filter (at the potential cost of some extra false positive hits).

weichi · June 1, 2025, 7:29am

Thank you for the clear explanation and background.
I appreciate the suggested workaround; setting --nucleotide-subject-coverage-threshold 0.0 sounds like a reasonable temporary solution.

If possible, could the fix also be applied to HUMAnN v3.9 in addition to v4? That would be really helpful for those of us still working with HUMAnN 3.

Thanks again for the quick response and all your great work on HUMAnN!

Topic		Replies	Views
Theoretical query regarding HUMAnN analysis HUMAnN	2	435	October 2, 2020
Humann nucleotide alignment HUMAnN	9	140	June 24, 2024
Custom UniRef90 database with Humann3 HUMAnN	4	804	March 15, 2021
Different UniRef90 ID has the same nucleotide sequences in ChocoPhlAn database HUMAnN	3	514	August 4, 2020
Bowtie2 unaligned reads slow HUMAnN	14	1941	November 8, 2024

Output Depends on Read Order Due to Ambiguous Protein Length Assignment in `blastx_coverage()`

Summary

Example

Root Cause

Potential Solution

Related topics