Output Depends on Read Order Due to Ambiguous Protein Length Assignment in `blastx_coverage()`

Version: HUMAnN v3.9

Database: full_chocophlan_v201901_v31

Input: demo.fastq.gz (In my case, I split the FASTQ file, then call nucleotide.alignment() to perform parallel mapping, concatenate the SAM files into one, and then call nucleotide.unaligned_reads(). )


Summary

I have discovered that the order of input reads can affect HUMAnN’s final results. This appears to be due to a design issue in the blastx_coverage() function, which determines whether a gene should be retained based on protein coverage. The issue arises when the same protein_name (e.g., a UniRef90 ID) is associated with multiple gene lengths.


Example

This occurs with the UniRef90 ID UniRef90_I9Q8N6, which appears in multiple gene annotations across different species:

From:

  • g__Bacteroides.s__Bacteroides_dorei.centroids.v201901_v31.ffn.gz
  • g__Bacteroides.s__Bacteroides_vulgatus.centroids.v201901_v31.ffn.gz

Relevant headers:

>357276__I9Q8N6_B5F00_18260|...g__Bacteroides.s__Bacteroides_dorei|UniRef90_I9Q8N6|UniRef50_A0A069D069|3021
>821__I9Q8N6__BHV80_16865|...g__Bacteroides.s__Bacteroides_vulgatus|UniRef90_I9Q8N6|UniRef50_A0A069D069|324

So, UniRef90_I9Q8N6 is linked to both a gene of length 3021 and another of length 324.


Root Cause

In blastx_coverage(), protein lengths are stored in a dictionary keyed by protein_name. When multiple genes share the same protein_name but have different lengths, the length used depends on whichever entry is processed last, i.e., the order of the reads in the input file.

This makes the results non-deterministic and potentially biologically misleading, because the blastx_coverage() function calculates coverage using a protein length that can change depending on read order. Specifically, coverage is computed as:

coverage = len(range_hit_positions) / float(protein_lengths[protein_name]) * 100

If the same protein_name is associated with multiple gene lengths (e.g., 3021 vs. 324), the final coverage value — and thus whether a hit passes the threshold — will vary depending on which length was stored last. This means the same alignment could either be accepted or discarded based on arbitrary input order, compromising reproducibility and biological validity.


Potential Solution

Here are two potential solutions, in my opinion.
We can discuss them here to determine which one is more reasonable, and I can help to revise the code and create a pull request. (The code below is just a quick draft to illustrate the approach.)

  • Calculate coverage separately for each different length (or bugs) associated with the same protein_name. If any of them pass the coverage threshold, the protein_name should be accepted.

    
    # blastx_coverage.py
    #https://github.com/biobakery/humann/blob/9c6dfef873837c0ed281e1093718769d1aea98c9/humann/search/blastx_coverage.py#L49C9-L49C52
    #line 49
    #(Assume that one protein name in one bug should have only one length.)
    bug_protein=protein_name+';'+bug
    protein_lengths[bug_protein] = gene_length
    
    #line 55 
    protein_hits[bug_protein]+="{0}-{1};".format(subject_start_index, subject_stop_index)
    
    #line 61
    for bug_protein, hit_positions in protein_hits.items():
    
    #line 72
    coverage = len( range_hit_positions ) / float( protein_lengths[bug_protein] ) * 100
    
    #line 77
    if coverage >= min_coverage:
    		allowed.add(bug_protein.split(';')[0])
    
  • Calculate coverage separately for each different length (or bugs) associated with the same protein_name, and return not only the accepted genes but also the names of the bugs entries. Then revise the filtering logic in translated.unaligned_reads and nucleotide.unaligned_reads accordingly.

    
    #blastx_coverage.py
    #https://github.com/biobakery/humann/blob/9c6dfef873837c0ed281e1093718769d1aea98c9/humann/search/blastx_coverage.py#L49C9-L49C52
    #line 49
    #(Assume that one protein name in one bug should have only one length.)
    bug_protein=protein_name+';'+bug
    protein_lengths[bug_protein] = gene_length
    
    #line 55 
    protein_hits[bug_protein]+="{0}-{1};".format(subject_start_index, subject_stop_index)
    
    #line 61
    for bug_protein, hit_positions in protein_hits.items():
    
    #line 72
    coverage = len( range_hit_positions ) / float( protein_lengths[bug_protein] ) * 100
    
    #line 77
    if coverage >= min_coverage:
    		allowed.add(bug_protein)
    		
    		
    #nucleotide.py
    #line 328
    bug_gene=gene_name+';'+bug
    if not bug_gene in allowed_genes:
    
    #translated.py
    #line 302
    bug_protein=protein_name+';'+bug
    if bug_protein in allowed_proteins:
    
    

Many, many thanks for catching this and for your detailed analysis! :folded_hands:

I can see exactly what happened, which is that 1) we developed this filter in HUMAnN v2 for translated search, where the gene IDs never repeat; then 2) we later learned it was also somewhat helpful for boosting specificity during nucleotide search, so we added that option in HUMAnN v3; but 3) we didn’t account for nucleotide gene IDs sometimes repeating across species. :person_facepalming:

We’ll work on a fix for this. In the meantime, if you’d rather disable this filter entirely (to avoid the bug and keep things deterministic), you can set --nucleotide-subject-coverage-threshold 0.0, which will disable the nucleotide coverage filter (at the potential cost of some extra false positive hits).

1 Like

Thank you for the clear explanation and background.
I appreciate the suggested workaround; setting --nucleotide-subject-coverage-threshold 0.0 sounds like a reasonable temporary solution.

If possible, could the fix also be applied to HUMAnN v3.9 in addition to v4? That would be really helpful for those of us still working with HUMAnN 3.

Thanks again for the quick response and all your great work on HUMAnN!