Version: HUMAnN v3.9
Database: full_chocophlan_v201901_v31
Input: demo.fastq.gz
(In my case, I split the FASTQ file, then call nucleotide.alignment()
to perform parallel mapping, concatenate the SAM files into one, and then call nucleotide.unaligned_reads()
. )
Summary
I have discovered that the order of input reads can affect HUMAnN’s final results. This appears to be due to a design issue in the blastx_coverage()
function, which determines whether a gene should be retained based on protein coverage. The issue arises when the same protein_name
(e.g., a UniRef90 ID) is associated with multiple gene lengths.
Example
This occurs with the UniRef90 ID UniRef90_I9Q8N6
, which appears in multiple gene annotations across different species:
From:
g__Bacteroides.s__Bacteroides_dorei.centroids.v201901_v31.ffn.gz
g__Bacteroides.s__Bacteroides_vulgatus.centroids.v201901_v31.ffn.gz
Relevant headers:
>357276__I9Q8N6_B5F00_18260|...g__Bacteroides.s__Bacteroides_dorei|UniRef90_I9Q8N6|UniRef50_A0A069D069|3021
>821__I9Q8N6__BHV80_16865|...g__Bacteroides.s__Bacteroides_vulgatus|UniRef90_I9Q8N6|UniRef50_A0A069D069|324
So, UniRef90_I9Q8N6
is linked to both a gene of length 3021 and another of length 324.
Root Cause
In blastx_coverage()
, protein lengths are stored in a dictionary keyed by protein_name
. When multiple genes share the same protein_name
but have different lengths, the length used depends on whichever entry is processed last, i.e., the order of the reads in the input file.
This makes the results non-deterministic and potentially biologically misleading, because the blastx_coverage()
function calculates coverage using a protein length that can change depending on read order. Specifically, coverage is computed as:
coverage = len(range_hit_positions) / float(protein_lengths[protein_name]) * 100
If the same protein_name
is associated with multiple gene lengths (e.g., 3021 vs. 324), the final coverage value — and thus whether a hit passes the threshold — will vary depending on which length was stored last. This means the same alignment could either be accepted or discarded based on arbitrary input order, compromising reproducibility and biological validity.
Potential Solution
Here are two potential solutions, in my opinion.
We can discuss them here to determine which one is more reasonable, and I can help to revise the code and create a pull request. (The code below is just a quick draft to illustrate the approach.)
-
Calculate coverage separately for each different length (or bugs) associated with the same
protein_name
. If any of them pass the coverage threshold, theprotein_name
should be accepted.# blastx_coverage.py #https://github.com/biobakery/humann/blob/9c6dfef873837c0ed281e1093718769d1aea98c9/humann/search/blastx_coverage.py#L49C9-L49C52 #line 49 #(Assume that one protein name in one bug should have only one length.) bug_protein=protein_name+';'+bug protein_lengths[bug_protein] = gene_length #line 55 protein_hits[bug_protein]+="{0}-{1};".format(subject_start_index, subject_stop_index) #line 61 for bug_protein, hit_positions in protein_hits.items(): #line 72 coverage = len( range_hit_positions ) / float( protein_lengths[bug_protein] ) * 100 #line 77 if coverage >= min_coverage: allowed.add(bug_protein.split(';')[0])
-
Calculate coverage separately for each different length (or bugs) associated with the same
protein_name
, and return not only the accepted genes but also the names of the bugs entries. Then revise the filtering logic intranslated.unaligned_reads
andnucleotide.unaligned_reads
accordingly.#blastx_coverage.py #https://github.com/biobakery/humann/blob/9c6dfef873837c0ed281e1093718769d1aea98c9/humann/search/blastx_coverage.py#L49C9-L49C52 #line 49 #(Assume that one protein name in one bug should have only one length.) bug_protein=protein_name+';'+bug protein_lengths[bug_protein] = gene_length #line 55 protein_hits[bug_protein]+="{0}-{1};".format(subject_start_index, subject_stop_index) #line 61 for bug_protein, hit_positions in protein_hits.items(): #line 72 coverage = len( range_hit_positions ) / float( protein_lengths[bug_protein] ) * 100 #line 77 if coverage >= min_coverage: allowed.add(bug_protein) #nucleotide.py #line 328 bug_gene=gene_name+';'+bug if not bug_gene in allowed_genes: #translated.py #line 302 bug_protein=protein_name+';'+bug if bug_protein in allowed_proteins: