Mouse stool high unaligned reads, humann 3

Hello,

I have some mouse stool metagenome samples (~10M reads per sample). I preprocessed the samples with kneaddata and then ran through humann3. The unaligned reads after nucleotide alignment were 88.7% and 80.2% after translated alignment. I posted the translated alignment output below.
I went through the bowtie2_unaligned.fa file and blasted a handful of sequences. Many of them found microbial hits on blast (e.g., lachnospiraceae, muribaculum).

Is there a way to verify I am doing this correctly. And if so, are there any other ways to classify the other 80% of the reads in these samples?

Best,
Jacob

08/05/2021 06:51:32 PM - humann.utilities - INFO: Execute command: /bin/cat /home/lab_user/microbiome_files/SE7259/FT-SA93359/kneaddata_output/humann/SE7259_SA93359_S9_kneaddata_combined_humann_temp/tmp_qipdi2t/diamond_m8_jfq0xudm
08/05/2021 06:51:35 PM - humann.humann - INFO: TIMESTAMP: Completed translated alignment : 8487 seconds
08/05/2021 06:56:02 PM - humann.utilities - DEBUG: Total alignments where percent identity is not a number: 0
08/05/2021 06:56:02 PM - humann.utilities - DEBUG: Total alignments where alignment length is not a number: 0
08/05/2021 06:56:02 PM - humann.utilities - DEBUG: Total alignments where E-value is not a number: 0
08/05/2021 06:56:02 PM - humann.utilities - DEBUG: Total alignments not included based on large e-value: 0
08/05/2021 06:56:02 PM - humann.utilities - DEBUG: Total alignments not included based on small percent identity: 2748634
08/05/2021 06:56:02 PM - humann.utilities - DEBUG: Total alignments not included based on small query coverage: 1407002
08/05/2021 06:56:23 PM - humann.search.blastx_coverage - INFO: Total alignments without coverage information: 0
08/05/2021 06:56:23 PM - humann.search.blastx_coverage - INFO: Total proteins in blastx output: 719474
08/05/2021 06:56:23 PM - humann.search.blastx_coverage - INFO: Total proteins without lengths: 0
08/05/2021 06:56:23 PM - humann.search.blastx_coverage - INFO: Proteins with coverage greater than threshold (50.0): 47960
08/05/2021 07:00:57 PM - humann.utilities - DEBUG: Total alignments where percent identity is not a number: 0
08/05/2021 07:00:57 PM - humann.utilities - DEBUG: Total alignments where alignment length is not a number: 0
08/05/2021 07:00:57 PM - humann.utilities - DEBUG: Total alignments where E-value is not a number: 0
08/05/2021 07:00:57 PM - humann.utilities - DEBUG: Total alignments not included based on large e-value: 0
08/05/2021 07:00:57 PM - humann.utilities - DEBUG: Total alignments not included based on small percent identity: 2748634
08/05/2021 07:00:57 PM - humann.utilities - DEBUG: Total alignments not included based on small query coverage: 1407002
08/05/2021 07:00:57 PM - humann.search.translated - DEBUG: Total translated alignments not included based on small subject coverage value: 4617279
08/05/2021 07:05:14 PM - humann.humann - INFO: TIMESTAMP: Completed translated alignment post-processing : 820 seconds
08/05/2021 07:05:14 PM - humann.humann - INFO: Total bugs after translated alignment: 27
08/05/2021 07:05:14 PM - humann.humann - INFO:
g__Lachnospiraceae_unclassified.s__Lachnospiraceae_bacterium_10_1: 188407 hits
g__Helicobacter.s__Helicobacter_typhlonius: 193124 hits
g__Bacteroides.s__Bacteroides_caecimuris: 117577 hits
g__Lachnospiraceae_unclassified.s__Lachnospiraceae_bacterium_COE1: 119192 hits
g__Muribaculaceae_unclassified.s__Muribaculaceae_bacterium_DSM_103720: 280656 hits
g__Muribaculum.s__Muribaculum_intestinale: 194922 hits
g__Lachnospiraceae_unclassified.s__Lachnospiraceae_bacterium_A2: 538415 hits
g__Firmicutes_unclassified.s__Firmicutes_bacterium_ASF500: 97123 hits
g__Bacteroides.s__Bacteroides_vulgatus: 99958 hits
g__Oscillibacter.s__Oscillibacter_sp_1_3: 123789 hits
g__Bacteroides.s__Bacteroides_sartorii: 181472 hits
g__Lachnospiraceae_unclassified.s__Lachnospiraceae_bacterium_3_1: 87535 hits
g__Bacteroides.s__Bacteroides_uniformis: 154944 hits
g__Dorea.s__Dorea_sp_5_2: 80392 hits
g__Parabacteroides.s__Parabacteroides_distasonis: 61290 hits
g__Lactobacillus.s__Lactobacillus_reuteri: 31306 hits
g__Anaerotruncus.s__Anaerotruncus_sp_G3_2012: 38340 hits
g__Clostridium.s__Clostridium_sp_ASF502: 63716 hits
g__Acutalibacter.s__Acutalibacter_muris: 20694 hits
g__Clostridium.s__Clostridium_sp_ASF356: 22462 hits
g__Lactobacillus.s__Lactobacillus_murinus: 8669 hits
g__Helicobacter.s__Helicobacter_apodemus: 12369 hits
g__Mucispirillum.s__Mucispirillum_schaedleri: 5446 hits
g__Lactobacillus.s__Lactobacillus_intestinalis: 16865 hits
g__Lactobacillus.s__Lactobacillus_johnsonii: 5246 hits
g__Enterorhabdus.s__Enterorhabdus_caecimuris: 388 hits
unclassified: 2421004 hits
08/05/2021 07:05:14 PM - humann.humann - INFO: Total gene families after translated alignment: 91351
08/05/2021 07:05:14 PM - humann.humann - INFO: Unaligned reads after translated alignment: 80.2395493667 %

Nothing here suggests to me that you’ve done something wrong - the reason you’re seeing independent BLAST hits to known species in the unmapped reads could be 1) because they’re from non-protein-coding regions (which we don’t map to) or 2) because they’re too diverged from the reference sequence to map by bowtie2 or 3) because we didn’t have sufficient evidence of the source species to include it with confidence during pangenome mapping.

To increase your mapping rate you could run in UniRef50 mode instead of the default UniRef90 mode. This won’t change your species-stratified abundances but it will likely increase the % of reads you map to known proteins with unclassified taxonomy. The infer_taxonomy script can then be used to make guesses about the taxonomy of the unclassified abundances.

Thank you, this is very helpful.

Could you also explain the meaning of the following output from humann3. “Selected species explain 99.98% of the predicted community composition”.

Unless you turn on MetaPhlAn’s % unknown estimation, the total species abundance from MetaPhlAn will sum to 100%. HUMAnN will then select some or all of those species for pangenome mapping requiring (default setting) each to be >0.01% relative abundance. The 99.98% in your question is the sum of the selected species abundances, indicating that unselected (trace) species only accounted for ~0.02% of relative abundance.

Note: this does not indicate that we expect to explain 99.98% of reads in the sample, which is an easy mistake to make.