Feature Request: full support for precomputed diamond results

nickp60 · October 26, 2022, 3:48pm

The current bottleneck when running some of our large samples is the diamond blastx stage. Some jobs with modest-sized files are timing out after 2 days with 32 cores. I would love to be able to provide precomputed results similar to how we can provide a pre-computed bugs list from metaphlan. That way, we could parallelize the blast jobs on our cluster using array jobs, etc.

Currently Humann supports using the blast results as input, but “using these files as input to HUMAnN 3.0 will bypass both the nucleotide and translated mapping portions of the flow.”, which isn’t ideal. I still want to have the benefits of the nucleotide mapping against the chocophlan bugs. One solution is to provide this file as an optional arg, and then subset the alignment to retain only the reads in the unaligned fasta created by the nucleotide alignment step. Something like the following:

 #from  https://github.com/biobakery/humann/blob/f147b28ea9f363c608057f5ca37de67b0bcb9542/humann/humann.py#L772

       if not config.bypass_translated_search:                                                                                                                                                                                                                                                                                          
            # Run translated search on UniRef database if unaligned reads exit                                                                                                                                                                                                                                                           
            if unaligned_reads_store.count_reads()>0:
                if config.precomputed_blast8m:
                     with open(unaligned_reads_file_fasta) as unal_fa, open(precomputed_blast8m, "r") as pcb, open("filtered_blast8m", "w") as outf:
                     # get a list of all reads 
                     unaligned_read_names = []
                     for line in unal_fa:
                         if line.startswith(">"):
                             unaligned_read_names.append(line.strip().replace(">", "")
                    #write a new blast8m results file for just those unmapped reads
                    for line in pcb:
                         if line.split("\t")[0] in unaligned_read_names:
                              outf.write(line)
                     translated_alignment_file = filtered_blast8m
                else:                                                                                                                                                                                                                                                                                    
                    translated_alignment_file = translated.alignment(config.protein_database, unaligned_reads_file_fasta)                                                                                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                                                                                         
                start_time=timestamp_message("translated alignment",start_time)

This could be made more efficient I’m sure, maybe even offloading to something like bedtools that handles set operations.

Would this be possible? Happy to help implement!

Topic		Replies	Views
Using BLAST results against nr database as input for HMAnN 3 HUMAnN	1	337	July 6, 2021
Humann3: Run translated search HUMAnN	1	426	June 28, 2022
Running HUMAnN: pre-computed protein blastx M8 input HUMAnN	8	517	June 8, 2022
Bowtie2 unaligned reads slow HUMAnN	14	1958	November 8, 2024
Humann nucleotide alignment HUMAnN	9	154	June 24, 2024

Feature Request: full support for precomputed diamond results

Related topics