Low percentage of aligned reads

Hello,

When running humann2, for some samples in our dataset we see a high percentage of unaligned reads (80-90%), and for others, the unaligned read count is closer to ~15% [example from Terminal below]. This is with running the Uniref50 database. I have seen at least one forum post where Eric ran through troubleshooting a similar problem with someone else (https://groups.google.com/forum/m/#!msg/humann-users/vcGFdlBnZA0/kU5BHAYyBQAJ), but I have not been successful with reducing the unaligned reads.

I am wondering whether you can help me troubleshoot what is happening with the samples with high % unaligned reads? Checking the quality using FastQC, I see no adapter sequences, I did deplete host DNA content using the databases for kneaddata, and checking FastQC does show a warning for duplicated sequences, but the totals to less than 0.5%. My understanding is that this is relatively low, so probably not what is driving the issue for this sample. Any advice you can give on parameters to change or check and how would be very useful. Thank you.

Christine

EXAMPLE RUN:

Running humann2 on this test sample:

Running metaphlan2.py …

Found g__Blautia.s__Ruminococcus_gnavus : 20.85% of mapped reads

Found g__Enterococcus.s__Enterococcus_avium : 19.59% of mapped reads

Found g__Gammaretrovirus.s__Murine_osteosarcoma_virus : 14.90% of mapped reads

Found g__Anaerostipes.s__Anaerostipes_caccae : 12.52% of mapped reads

Found g__Anaerostipes.s__Anaerostipes_unclassified : 6.49% of mapped reads

Found g__Erysipelotrichaceae_noname.s__Clostridium_innocuum : 5.45% of mapped reads

Found g__Turicibacter.s__Turicibacter_unclassified : 4.48% of mapped reads

Found g__Peptostreptococcaceae_noname.s__Clostridium_difficile : 4.19% of mapped reads

Found g__Erysipelotrichaceae_noname.s__Erysipelotrichaceae_bacterium_2_2_44A : 2.97% of mapped reads

Found g__Betaretrovirus.s__Mouse_mammary_tumor_virus : 2.14% of mapped reads

Found g__Lactococcus.s__Lactococcus_lactis : 2.04% of mapped reads

Found g__Listeria.s__Listeria_monocytogenes : 1.30% of mapped reads

Found g__Blautia.s__Ruminococcus_torques : 1.22% of mapped reads

Found g__Clostridiaceae_noname.s__Clostridiaceae_bacterium_JC118 : 0.88% of mapped reads

Found g__Turicibacter.s__Turicibacter_sanguinis : 0.50% of mapped reads

Found g__Propionibacterium.s__Propionibacterium_acnes : 0.34% of mapped reads

Found g__Corynebacterium.s__Corynebacterium_kroppenstedtii : 0.13% of mapped reads

Total species selected from prescreen: 17

Selected species explain 100.00% of predicted community composition

Creating custom ChocoPhlAn database …

Running bowtie2-build …

Running bowtie2 …

Total bugs from nucleotide alignment: 15

g__Gammaretrovirus.s__Murine_osteosarcoma_virus: 25 hits

g__Erysipelotrichaceae_noname.s__Clostridium_innocuum: 31300 hits

g__Betaretrovirus.s__Mouse_mammary_tumor_virus: 13 hits

g__Propionibacterium.s__Propionibacterium_acnes: 1059 hits

g__Clostridiaceae_noname.s__Clostridiaceae_bacterium_JC118: 2754 hits

g__Blautia.s__Ruminococcus_torques: 975 hits

g__Peptostreptococcaceae_noname.s__Clostridium_difficile: 33851 hits

g__Listeria.s__Listeria_monocytogenes: 4844 hits

g__Turicibacter.s__Turicibacter_sanguinis: 7739 hits

g__Enterococcus.s__Enterococcus_avium: 62526 hits

g__Corynebacterium.s__Corynebacterium_kroppenstedtii: 508 hits

g__Erysipelotrichaceae_noname.s__Erysipelotrichaceae_bacterium_2_2_44A: 31950 hits

g__Anaerostipes.s__Anaerostipes_caccae: 36787 hits

g__Lactococcus.s__Lactococcus_lactis: 8110 hits

g__Blautia.s__Ruminococcus_gnavus: 41868 hits

Total gene families from nucleotide alignment: 21047

Unaligned reads after nucleotide alignment: 87.8752111330 %

Agreed that the duplicated reads are likely not the issue here.

The example you pasted shows the % of reads unaligned after the pangenome (nucleotide) search stage but not the translated search stage. The latter is where I’d expect a lot of extra reads to map using the UniRef50 database. Can you confirm that the % unaligned rates you give at the start of your post correspond to the end of the HUMAnN2 workflow (i.e. after all mapping steps)?

In either case, one of the best ways to diagnose low mapping rates is to take ~10 reads from the final unmapped.fasta file and blast them online (https://blast.ncbi.nlm.nih.gov/Blast.cgi) to see if they hit new genomes / possible contaminants / etc.

Hello Eric,
Thank you for the reply. Yes, we still see a high percentage of unaligned reads after all of the mapping steps.

We diagnosed the issue by blasting sequences as you suggested. A lot of the reads correspond to mouse DNA. So, it seems likely the host DNA depletion was not successful. I have previously depleted host DNA using the kneaddata suffix ::

kneaddata -i Raw_Files/input1.fastq.gz -i Raw_Files/input2.fastq.gz -o kneaddata_out
–reference-db kneaddata/knead_mouse_db

Is there another way to deplete host DNA or to troubleshoot this step? Thank you so much for your assistance!

Christine

Well in one sense that means that things are working well, since the mouse-derived reads are not being mapped by HUMAnN2.

I believe the default mouse database for KneadData is for a C57BL mouse. If you’re working with a different breed, you could make a custom database out of its genome (if available) to improve depletion, though I’d be surprised if this had a dramatic effect.

Was there anything in common among the mouse sequences you saw hit in the blast searches? Repeat regions, for example, can be hard to fully deplete since they are hard to get right even in model genomes.

I am having a similar problem. Number of mapped reads (nucleotides): range= 16%-70%, mean= 56%. I am using the latest humann version (3.0 alpha) via conda and UniRef90, human microbiome shotgun data. I ran blastn on 1% of the kneaddata QC reads (with removed repeats) from the sample with ~85% unmapped reads using the parameters; blastn -query seq.fa -task megablast -db nt -evalue 1e-10 -best_hit_score_edge 0.05 -best_hit_overhang 0.25 -perc_identity 50 -max_target_seqs 1. ~18% of the reads had hits and it seems that most of them are bacterial complete genomes.

Should I relax the parameters --nucleotide-identity-threshold and --translated-identity-threshold? humann --help says that the --nucleotide-identity-threshold default is 0.0, is that correct?

The default of 0.0 for nucleotide identity means “believe any hits reported by bowtie2.” Bowtie2 doesn’t explicitly filter on alignment identity, but it’s very rare for it to report alignments <90% nucleotide ID. The filter allows you to post-process the bowtie2 hits to be more stringent (if desired).

Otherwise it looks like your blastn results are concordant with the low end of the mapping rates you’re seeing from HUMAnN 3.0. Maybe I’m misunderstanding something about your question? Were you suggesting that blastn was mapping reads that HUMAnN wasn’t?

My mistake. I ran blastn on the unaligned reads only.

Yes. I was trying to figure out why humann wasnt mapping a large proportion of my reads. I thought maybe because they are not bacterial reads or they are new genomes, but I ruled those out with my blastn search and it seems that humann3 is too stringent and missing reads, I dont know.

Gotcha. I’d guess that a good chunk of the 18% mapped by blastn and not HUMAnN are hitting non-protein-coding regions (RNAs and so forth). HUMAnN is only mapping to protein-coding genes, so you’ll definitely align more if working with full-length genomes.

As far as stringency, it’s probably the database coverage filters more than the identity filters that are preventing the rest of the reads from mapping. For example, if a single read aligns well to a 1 kb gene, it will be counted as “unmapped” since the gene itself is not sufficiently well-covered to be believable (in our view). We implemented this sort of coverage filter for translated search in v2.0 and extended it to nucleotide search in v3.0 (the filters are separately tunable). These filters dramatically reduce false positives compared to just giving each read its best hit, but you are welcome to turn them off (i.e. set to 0.0) if you want to maximize the number of mapped reads.

Got it. Would tuning the coverage result in higher accuracy than using UniRef50 instead of UniREf9?

Using UniRef50 over UniRef90 will have a similar effect as lowering the identity threshold (i.e. you will find hits at more remote homology); it won’t rescue hits that are lost to the above-described coverage filters. I tend to stick with UniRef90 for anything human-associated and favor UniRef50 otherwise.

1 Like

The log below says 29514910 (83.44%) aligned 0 times.

For this sample (the sample that lost the most reads), it appears that the majority loss of reads during translation alignment is due to identity; 12,713,631 lost due to identity vs 7,218,428 lost due to query coverage or subject coverage 9,235,839. However, in the majority of the samples, most of the lost reads during translation are due to query or subject coverage, not identity.

On average, 30% of reads do not align after translation. Thats common right? One of the samples had nearly 60% reads lost after translation. Would you still suggest I change the parameters for all the samples or for just this one sample? If yes, which parameters?

humann.utilities - DEBUG: b’35371160 reads; of these:\n 35371160 (100.00%) were unpaired; of these:\n 29514910 (83.44%) aligned 0 times\n 3151724 (8.91%) aligned exactly 1 time\n 2704526 (7.65%) aligned >1 times\n16.56% overall alignment rate\n’

humann.humann - INFO: TIMESTAMP: Completed nucleotide alignment : 529 seconds
humann.utilities - DEBUG: Total alignments where percent identity is not a number: 0
humann.utilities - DEBUG: Total alignments where alignment length is not a number: 0
humann.utilities - DEBUG: Total alignments where E-value is not a number: 0
humann.utilities - DEBUG: Total alignments not included based on large e-value: 0
humann.utilities - DEBUG: Total alignments not included based on small percent identity: 0
humann.utilities - DEBUG: Total alignments not included based on small query coverage: 0
humann.search.blastx_coverage - INFO: Total alignments without coverage information: 0
humann.search.blastx_coverage - INFO: Total proteins in blastx output: 136134
humann.search.blastx_coverage - INFO: Total proteins without lengths: 0
humann.search.blastx_coverage - INFO: Proteins with coverage greater than threshold (50.0): 90546
humann.search.nucleotide - DEBUG: Total nucleotide alignments not included based on filtered genes: 231574
humann.search.nucleotide - DEBUG: Total nucleotide alignments not included based on small percent identities: 8
humann.search.nucleotide - DEBUG: Total nucleotide alignments not included based on query coverage threshold: 0
humann.search.nucleotide - DEBUG: Keeping sam file
humann.humann - INFO: TIMESTAMP: Completed nucleotide alignment post-processing : 1461 seconds
05/29/2020 11:18:21 PM - humann.humann - INFO: Total bugs from nucleotide alignment: 44

humann.humann - INFO: Total gene families from nucleotide alignment: 90546
humann.humann - INFO: Unaligned reads after nucleotide alignment: 84.0981296627 %

humann.humann - INFO: TIMESTAMP: Completed translated alignment : 75698 seconds
humann.utilities - DEBUG: Total alignments where percent identity is not a number: 0
humann.utilities - DEBUG: Total alignments where alignment length is not a number: 0
humann.utilities - DEBUG: Total alignments where E-value is not a number: 0
Total alignments not included based on large e-value: 4072
humann.utilities - DEBUG: Total alignments not included based on small percent identity: 12713631
humann.utilities - DEBUG: Total alignments not included based on small query coverage: 7218428
humann.search.blastx_coverage - INFO: Total alignments without coverage information: 0
humann.search.blastx_coverage - INFO: Total proteins in blastx output: 716163
humann.search.blastx_coverage - INFO: Total proteins without lengths: 0
humann.search.blastx_coverage - INFO: Proteins with coverage greater than threshold (50.0): 107222
humann.utilities - DEBUG: Total alignments where percent identity is not a number: 0
humann.utilities - DEBUG: Total alignments where alignment length is not a number: 0
humann.utilities - DEBUG: Total alignments where E-value is not a number: 0
humann.utilities - DEBUG: Total alignments not included based on large e-value: 4072
humann.utilities - DEBUG: Total alignments not included based on small percent identity: 12713631
humann.utilities - DEBUG: Total alignments not included based on small query coverage: 7218428
humann.search.translated - DEBUG: Total translated alignments not included based on small subject coverage value: 9235839

humann.humann - INFO: Total gene families after translated alignment: 188899
humann.humann - INFO: Unaligned reads after translated alignment: 59.2021777064 %

Could the high % of unaligned reads after translated alignment also be due to you have novel species in your community - eg those not found in metaphlan2? Or is the translated unaligned % the % after mapping to the whole UniRef EC90 database (species agnostic)?

30% unmapped after translated search seems very reasonable (check out Fig. S3 of the HUMAnN 2.0 paper for reference). I would not recommend varying the parameters sample-by-sample – this just seems like a pathway to unintended batch effects.

Are you confident that the samples have been fully host depleted? Perhaps the outlier sample has an excess of hidden contamination?

Otherwise you could, as an experiment, try running the outlier sample in UniRef50 mode to see if more reads align with relaxed translated search (as might be expected if the sample contains an abundant unknown species).

1 Like

Hi there,
We also got very low mapping rates using humann2 for rat gut microbiome. (with 95%+ Unaligned reads after nucleotide alignment and 90%+ Unaligned reads after translated alignment against uniref90). We have used kneaddata to filter out the host sequences.
I wonder if humann 3.0 also has a better performance in rat as well.
Thanks!

I have not tested HUMAnN 3 on a rat microbiome myself, but it’s generally expected to explain more reads under default settings. Give it a try and let us know what you find. :slight_smile:

We just tested one of our samples. It resulted 94% unaligned reads after nucleotide alignment (95% for humann2). We used Uniref50 database instead of Uniref90 with default setting. the unaligned reads after translatedd alignment decreased to 48%, which is okay I think? The only potential issue is that the taxonomy from metaphlan3 is quite different from metaphlan2 taxonomy, with some taxa no longer there and abundance changes.

I’ve done something similar with mouse microbiome samples (~90% reads unaligned after translated alignment Uniref90). Metaphlan 3 was big issue for me, where the bug list was very different than paired 16s composition data for the same sample (16s = 30% Firmicutes, Metaphlan 3 = 1.5% Firmicutes). So, I ran Humann3 bypassing metaphlan (so it uses the entire chocophlan database) and was able to get down to ~70% reads unaligned after translated search. However, because to chocophlan database only includes taxonomy to the genus level, after bypassing metaphlan I can’t get a count of my sample’s taxonomy at the phylum level to see how it compares to paired 16s data. The metagenome functional data from Humann3 looked great, but if I couldn’t trust the taxonomic composition I wasn’t sure how representative the data was.

I ended up switching to a basic pipeline from the iMGMC mouse microbiome database, which resulted in a similar proportion of reads map but the bug composition at the metagenome level is significantly closer to the paired 16s data we have (16s = 30% Firmicutes, iMGMC mapping = 32% Firmicutes). The tools natively provided by iMGMC are not nearly as robust as Humann3, but I think the basic data output is more representative of my samples. I never tried uniref50 in Humann3 because of how off the taxonomy was, but perhaps that would have helped?

UniRef90 vs UniRef50 won’t make much of a difference if you’re doing most of your mapping against pangenomes (it just changes the “resolution” of the output to broader rather than more focused pangenomes).

UniRef50 makes more of a difference if you’re focusing on translated search, since in that case it allows the actual read-vs-protein alignments to be a lot more relaxed.

Once we release the updated infer_taxonomy script you should be able to make a guess about the phylum-level recruitment of the sample from any of these mapping options. Apologies for the delay there.