Hi @fbeghini, @franzosa !!!
I am getting 67.78% unaligned reads with HUMAnN 3.0 after nucleotide search tier. After the translated search (against uniref90_ec_filtered database) it reduced to 63.05%.
Is it very high amount and unlikely? If yes, what should I do?
What is a normal range of unaligned reads for this step?
What environment are your reads from?
I am working with gut metagenome.
That is a bit high for the gut in my experience. On a recent analysis of some gut metagenomes with HUMAnN 3 I saw ~55-75% of reads mapping to pangenomes (25-45% unmapped; IQRs).
What sequencing depth are you working with? If it’s particularly low it’s possible that many genes are failing to reach the 50% minimum coverage threshold and so their reads are treated as unmapped.
My files are of different size. Some are around 2 Gb and some like 4Gb around. Should I use files with same sequencing depth. Should I rarefy/normalize them before HUMAnN run? If yes, how to do that?
No need to rarefy. We tend to exclude files with very low sequencing depths (where something appears to have gone wrong, e.g. <1M reads) but otherwise you should be OK with variable depths (downstream normalization will correct for this).
Hello, I’d like to re-open this thread because I have similarly high unmapped rates (73% unmapped after translated alignment) from mouse gut metatranscriptome data (2 x 75 bp) and am not sure how best to troubleshoot. I had poor mapping at the nucleotide alignment step (95% unaligned), so I tried loosening the nt coverage threshold to 0 and nt percent identity to 0.8, but the overall improvement was marginal (1% lower unmapped). I will try relaxing the thresholds at the translated stage, since this is where most of my reads are mapping anyway, but would love to get input as that is running. Is it typical to have the nucleotide alignment so low (just 5%)? Any other ideas to improve mapping?
For context, I used kneaddata to trim adapters and remove contaminants (rRNA, mouse) and then provided the clean, concatenated fastq file (~10GB) to humann3 with default parameters.
When I look into the final unaligned file and blast a few reads, I get mapping to:
- genome not in chocophlan database (I suppose I can add a custom genome for coding regions, but this read also has good homology with M. intestinale which is present in the ref database)
Muribaculum gordoncarteri: CCTTGGTTTCGCGTCCACCCCCGCCGACTGTGGCGCCTTGTTCAGACTCGCTTTCGCTTCGGCTCCGTGCGTCCTC
- genome in reference database meeting threshold (not sure why this failed?):
- genome in reference database but below threshold:
Roseburia hominis: CGTACGGGTATGCTATGAACAATAGCGGCTTTTCTCGGTACATGGCATGCATGCTTCGCTACTATAAGTT
- nothing with sig similarity (>50% of reads ← why so many?)
I would hope to recover more of the reads belonging to the first two categories and understand why so many reads are falling into the last category.
I’ve uploaded the log file and an abbreviated version of the unaligned reads here:
Thanks in advance for your suggestions!
Just wanted to followup on this post as (after messing around to high-heaven with the metaphlan/bowtie options) I think I’ve uncovered the culprit… I added the parameter “–add viruses” to the metaphlan options and saw that 80% of my reads are mapping to a murine astrovirus (!). In the end then, I guess I can proceed knowing I have comfortably mapped most of the bacterial transcripts… the consequence of the viral RNA in these samples is a bit outside of my scope of expertise, but super interesting and in a way satisfying to see because actually it makes some sense. I have previously run this pipeline with samples from gnotobiotic mice processed the same way and had 10% unmapped… so in the end I have the same unmapping rate for these SPF samples if I ignore the viral RNA.
Hope this post might help others in a similar position of high unaligned reads!