Hi @fbeghini, @franzosa !!!
I am getting 67.78% unaligned reads with HUMAnN 3.0 after nucleotide search tier. After the translated search (against uniref90_ec_filtered database) it reduced to 63.05%.
Is it very high amount and unlikely? If yes, what should I do?
What is a normal range of unaligned reads for this step?
That is a bit high for the gut in my experience. On a recent analysis of some gut metagenomes with HUMAnN 3 I saw ~55-75% of reads mapping to pangenomes (25-45% unmapped; IQRs).
What sequencing depth are you working with? If itās particularly low itās possible that many genes are failing to reach the 50% minimum coverage threshold and so their reads are treated as unmapped.
My files are of different size. Some are around 2 Gb and some like 4Gb around. Should I use files with same sequencing depth. Should I rarefy/normalize them before HUMAnN run? If yes, how to do that?
No need to rarefy. We tend to exclude files with very low sequencing depths (where something appears to have gone wrong, e.g. <1M reads) but otherwise you should be OK with variable depths (downstream normalization will correct for this).
Hello, Iād like to re-open this thread because I have similarly high unmapped rates (73% unmapped after translated alignment) from mouse gut metatranscriptome data (2 x 75 bp) and am not sure how best to troubleshoot. I had poor mapping at the nucleotide alignment step (95% unaligned), so I tried loosening the nt coverage threshold to 0 and nt percent identity to 0.8, but the overall improvement was marginal (1% lower unmapped). I will try relaxing the thresholds at the translated stage, since this is where most of my reads are mapping anyway, but would love to get input as that is running. Is it typical to have the nucleotide alignment so low (just 5%)? Any other ideas to improve mapping?
For context, I used kneaddata to trim adapters and remove contaminants (rRNA, mouse) and then provided the clean, concatenated fastq file (~10GB) to humann3 with default parameters.
When I look into the final unaligned file and blast a few reads, I get mapping to:
genome not in chocophlan database (I suppose I can add a custom genome for coding regions, but this read also has good homology with M. intestinale which is present in the ref database)
Muribaculum gordoncarteri: CCTTGGTTTCGCGTCCACCCCCGCCGACTGTGGCGCCTTGTTCAGACTCGCTTTCGCTTCGGCTCCGTGCGTCCTC
genome in reference database meeting threshold (not sure why this failed?):
Bacillus sp.:
TNCGTCACGGCTCAGGCTTACGACATGCGTACTTCACTACATGCCACCCTTACCGCTTGGACGCGTCACCATCTGC
genome in reference database but below threshold:
Roseburia hominis: CGTACGGGTATGCTATGAACAATAGCGGCTTTTCTCGGTACATGGCATGCATGCTTCGCTACTATAAGTT
CGCTCC
nothing with sig similarity (>50% of reads ā why so many?)
I would hope to recover more of the reads belonging to the first two categories and understand why so many reads are falling into the last category.
Hello all,
Just wanted to followup on this post as (after messing around to high-heaven with the metaphlan/bowtie options) I think Iāve uncovered the culprit⦠I added the parameter āāadd virusesā to the metaphlan options and saw that 80% of my reads are mapping to a murine astrovirus (!). In the end then, I guess I can proceed knowing I have comfortably mapped most of the bacterial transcripts⦠the consequence of the viral RNA in these samples is a bit outside of my scope of expertise, but super interesting and in a way satisfying to see because actually it makes some sense. I have previously run this pipeline with samples from gnotobiotic mice processed the same way and had 10% unmapped⦠so in the end I have the same unmapping rate for these SPF samples if I ignore the viral RNA.
Hope this post might help others in a similar position of high unaligned reads!
Hello, I have a really similar problem with this thread.
My samples are anodic biofilm samples from benthic microbial fuel cells. But I am getting over 90% of the relative abundance (after renorm) assigned to unmapped. merged_genefamilies_EC_unstratified_RA.tsv (153.9 KB)
My files are mostly from 2G to 9G after BBmerge step. So seemingly depth isnāt the problem.
Sorry for being slow to get caught up here. I would check out my responses in other threads about low mapping rates for additional details. As two quick summary replies here, 1) depending on how your metatranscriptomics were processed upstream of HUMAnN, you might have a lot of non-protein-coding reads RNA reads. Because HUMAnN is focused on protein-coding genes, it wonāt be able to map those. 2) For environmental samples (DNA or RNA) I recommend mapping to UniRef50 rather than UniRef90. We automatically use a more relaxed homology-based search for the former which might help to improve your alignment rate.