Humann3 error for some samples but not others

Hi all,

I am trying to process my samples with Humann 3.6 and the Struo2 release of the GTDB202 database for Humann. I’m using this command for running Humann:

humann3 -vvv --input ${name}.cleaned.fastq.gz --input-format fastq.gz \
        --output humann3_out --output-basename ${name} \
        --threads 16 \
        --protein-database $PROTEIN \
        --nucleotide-database $NUC_DB \
        --bypass-nucleotide-index \
        --search-mode uniref90 \
        --remove-temp-output

For a subset of my samples, I am getting an error like the following:

TIMESTAMP: Completed nucleotide alignment : 1150 seconds

Traceback (most recent call last):
  File "/research/bsi/projects/staff_analysis/m141127/conda/biobakery3/bin/humann3", line 33, in <module>
    sys.exit(load_entry_point('humann==3.6', 'console_scripts', 'humann3')())
  File "/research/bsi/projects/staff_analysis/m141127/conda/biobakery3/lib/python3.7/site-packages/humann/humann.py", line 1000, in main
    nucleotide_alignment_file, alignments, unaligned_reads_store, keep_sam=True)
  File "/research/bsi/projects/staff_analysis/m141127/conda/biobakery3/lib/python3.7/site-packages/humann/search/nucleotide.py", line 263, in unaligned_reads
    if int(info[config.sam_flag_index]) & config.sam_unmapped_flag != 0:
ValueError: invalid literal for int() with base 10: 'AS:i:-6'

Sometimes the ValueError is a different value, e.g., ‘YT:Z:UU\n’, ‘AS:i:-28’, or ‘AS:i:-6’. When I re-run the command on these specific samples, I’m able to reproduce the error. This error also does not occur in every sample. I’ve also verified the md5sums of the database, so I think that isn’t the issue. Any guidance you can provide will be greatly appreciated, and I’ll be happy to provide any additional info.

Thanks!

As a caveat, Struo2 was developed outside our group, so if this error is on their end it will be trickier for us to diagnose/solve. That said, the error you’re seeing looks like it’s arising from your SAM output having an unexpected structure, potentially because the columns are shifted. Where it’s not happening with every sample, my first guess would be that there is something funky going on with the read names in some of the samples that is disrupting the output.

If you can share a full SAM alignment row from a sample that worked vs. one that didn’t that might point to an answer.