Feature Request: full support for precomputed diamond results

The current bottleneck when running some of our large samples is the diamond blastx stage. Some jobs with modest-sized files are timing out after 2 days with 32 cores. I would love to be able to provide precomputed results similar to how we can provide a pre-computed bugs list from metaphlan. That way, we could parallelize the blast jobs on our cluster using array jobs, etc.

Currently Humann supports using the blast results as input, but “using these files as input to HUMAnN 3.0 will bypass both the nucleotide and translated mapping portions of the flow.”, which isn’t ideal. I still want to have the benefits of the nucleotide mapping against the chocophlan bugs. One solution is to provide this file as an optional arg, and then subset the alignment to retain only the reads in the unaligned fasta created by the nucleotide alignment step. Something like the following:

 #from  https://github.com/biobakery/humann/blob/f147b28ea9f363c608057f5ca37de67b0bcb9542/humann/humann.py#L772

       if not config.bypass_translated_search:                                                                                                                                                                                                                                                                                          
            # Run translated search on UniRef database if unaligned reads exit                                                                                                                                                                                                                                                           
            if unaligned_reads_store.count_reads()>0:
                if config.precomputed_blast8m:
                     with open(unaligned_reads_file_fasta) as unal_fa, open(precomputed_blast8m, "r") as pcb, open("filtered_blast8m", "w") as outf:
                     # get a list of all reads 
                     unaligned_read_names = []
                     for line in unal_fa:
                         if line.startswith(">"):
                             unaligned_read_names.append(line.strip().replace(">", "")
                    #write a new blast8m results file for just those unmapped reads
                    for line in pcb:
                         if line.split("\t")[0] in unaligned_read_names:
                     translated_alignment_file = filtered_blast8m
                    translated_alignment_file = translated.alignment(config.protein_database, unaligned_reads_file_fasta)                                                                                                                                                                                                                                                                                          
                start_time=timestamp_message("translated alignment",start_time)                                                                                                                                                                                                                                                          

This could be made more efficient I’m sure, maybe even offloading to something like bedtools that handles set operations.

Would this be possible? Happy to help implement!