Hello,
When running humann2, for some samples in our dataset we see a high percentage of unaligned reads (80-90%), and for others, the unaligned read count is closer to ~15% [example from Terminal below]. This is with running the Uniref50 database. I have seen at least one forum post where Eric ran through troubleshooting a similar problem with someone else (https://groups.google.com/forum/m/#!msg/humann-users/vcGFdlBnZA0/kU5BHAYyBQAJ), but I have not been successful with reducing the unaligned reads.
I am wondering whether you can help me troubleshoot what is happening with the samples with high % unaligned reads? Checking the quality using FastQC, I see no adapter sequences, I did deplete host DNA content using the databases for kneaddata, and checking FastQC does show a warning for duplicated sequences, but the totals to less than 0.5%. My understanding is that this is relatively low, so probably not what is driving the issue for this sample. Any advice you can give on parameters to change or check and how would be very useful. Thank you.
Christine
EXAMPLE RUN:
Running humann2 on this test sample:
Running metaphlan2.py …
Found g__Blautia.s__Ruminococcus_gnavus : 20.85% of mapped reads
Found g__Enterococcus.s__Enterococcus_avium : 19.59% of mapped reads
Found g__Gammaretrovirus.s__Murine_osteosarcoma_virus : 14.90% of mapped reads
Found g__Anaerostipes.s__Anaerostipes_caccae : 12.52% of mapped reads
Found g__Anaerostipes.s__Anaerostipes_unclassified : 6.49% of mapped reads
Found g__Erysipelotrichaceae_noname.s__Clostridium_innocuum : 5.45% of mapped reads
Found g__Turicibacter.s__Turicibacter_unclassified : 4.48% of mapped reads
Found g__Peptostreptococcaceae_noname.s__Clostridium_difficile : 4.19% of mapped reads
Found g__Erysipelotrichaceae_noname.s__Erysipelotrichaceae_bacterium_2_2_44A : 2.97% of mapped reads
Found g__Betaretrovirus.s__Mouse_mammary_tumor_virus : 2.14% of mapped reads
Found g__Lactococcus.s__Lactococcus_lactis : 2.04% of mapped reads
Found g__Listeria.s__Listeria_monocytogenes : 1.30% of mapped reads
Found g__Blautia.s__Ruminococcus_torques : 1.22% of mapped reads
Found g__Clostridiaceae_noname.s__Clostridiaceae_bacterium_JC118 : 0.88% of mapped reads
Found g__Turicibacter.s__Turicibacter_sanguinis : 0.50% of mapped reads
Found g__Propionibacterium.s__Propionibacterium_acnes : 0.34% of mapped reads
Found g__Corynebacterium.s__Corynebacterium_kroppenstedtii : 0.13% of mapped reads
Total species selected from prescreen: 17
Selected species explain 100.00% of predicted community composition
Creating custom ChocoPhlAn database …
Running bowtie2-build …
Running bowtie2 …
Total bugs from nucleotide alignment: 15
g__Gammaretrovirus.s__Murine_osteosarcoma_virus: 25 hits
g__Erysipelotrichaceae_noname.s__Clostridium_innocuum: 31300 hits
g__Betaretrovirus.s__Mouse_mammary_tumor_virus: 13 hits
g__Propionibacterium.s__Propionibacterium_acnes: 1059 hits
g__Clostridiaceae_noname.s__Clostridiaceae_bacterium_JC118: 2754 hits
g__Blautia.s__Ruminococcus_torques: 975 hits
g__Peptostreptococcaceae_noname.s__Clostridium_difficile: 33851 hits
g__Listeria.s__Listeria_monocytogenes: 4844 hits
g__Turicibacter.s__Turicibacter_sanguinis: 7739 hits
g__Enterococcus.s__Enterococcus_avium: 62526 hits
g__Corynebacterium.s__Corynebacterium_kroppenstedtii: 508 hits
g__Erysipelotrichaceae_noname.s__Erysipelotrichaceae_bacterium_2_2_44A: 31950 hits
g__Anaerostipes.s__Anaerostipes_caccae: 36787 hits
g__Lactococcus.s__Lactococcus_lactis: 8110 hits
g__Blautia.s__Ruminococcus_gnavus: 41868 hits
Total gene families from nucleotide alignment: 21047
Unaligned reads after nucleotide alignment: 87.8752111330 %