The current bottleneck when running some of our large samples is the diamond blastx stage. Some jobs with modest-sized files are timing out after 2 days with 32 cores. I would love to be able to provide precomputed results similar to how we can provide a pre-computed bugs list from metaphlan. That way, we could parallelize the blast jobs on our cluster using array jobs, etc.
Currently Humann supports using the blast results as input, but “using these files as input to HUMAnN 3.0 will bypass both the nucleotide and translated mapping portions of the flow.”, which isn’t ideal. I still want to have the benefits of the nucleotide mapping against the chocophlan bugs. One solution is to provide this file as an optional arg, and then subset the alignment to retain only the reads in the unaligned fasta created by the nucleotide alignment step. Something like the following:
#from https://github.com/biobakery/humann/blob/f147b28ea9f363c608057f5ca37de67b0bcb9542/humann/humann.py#L772
if not config.bypass_translated_search:
# Run translated search on UniRef database if unaligned reads exit
if unaligned_reads_store.count_reads()>0:
if config.precomputed_blast8m:
with open(unaligned_reads_file_fasta) as unal_fa, open(precomputed_blast8m, "r") as pcb, open("filtered_blast8m", "w") as outf:
# get a list of all reads
unaligned_read_names = []
for line in unal_fa:
if line.startswith(">"):
unaligned_read_names.append(line.strip().replace(">", "")
#write a new blast8m results file for just those unmapped reads
for line in pcb:
if line.split("\t")[0] in unaligned_read_names:
outf.write(line)
translated_alignment_file = filtered_blast8m
else:
translated_alignment_file = translated.alignment(config.protein_database, unaligned_reads_file_fasta)
start_time=timestamp_message("translated alignment",start_time)
This could be made more efficient I’m sure, maybe even offloading to something like bedtools that handles set operations.
Would this be possible? Happy to help implement!