I’m trying to analyze some metatranscriptomics data from human stool with HUMAnN2, but the results I’m getting seem to be really ambiguous (unfortunately no paired metagenomics to go along with this). I’m removing adapters and low complexity sequences with bbduk, removing human sequences and 16S rRNA with kneaddata, but ending up with 75%+ of my data being “UNMAPPED” in the genefamilies.tsv file. Here’s the average proportions across three timepoints:
If I regroup to EC or MetaCyc, then >99% ends up being UNMAPPED or UNINTEGRATED.
I inspected the metaphlan_bugs_list.tsv files in the temp directories of a few samples and saw that some samples are absolutely dominated by some Viruses. Is it possibly the case that these viruses simultaneously extremely active at transcription and also not well-represented in the uniref90 database? I’m not sure what else might be going on. If there’s some parameters I could tune to make my results more useful, I would greatly appreciate having them pointed out to me.
Are you using the latest update to the KneadData rRNA database (see the pinned comment on the KneadData channel)? The previous version was not properly formatted for use with bowtie2 and was allowing a lot of rRNA reads to sneak through (I suspect this is what you’re seeing in the large UNMAPPED section).
Alternatively, if you blast a handful of UNMAPPED reads and confirm they map to rRNA sequences, you could continue with your analysis without having to re-profile the data from scratch.
For the other questions, the MetaPhlAn 2.0 viral abundances are a function of viral coverage and not total read mass: it’s possible that your sample is all viral, but it’s more likely (in my experience) that you have a small number of highly covered viruses that don’t actually account for much total read mass.
For the EC regrouping rate, EC annotations are relatively rare (<10% of proteins). So if only ~20% of your RNA reads are coding, and only 5% of those are from EC-annotated proteins, then you’d see 1% of total reads annotated to ECs. If you want to use a system with a higher annotation rate, I’ve had a lot of success with UniProt’s Pfam annotations (also available as a regrouping option in HUMAnN).
This is extremely helpful, thank you. I just BLASTed a dozen unmapped reads and half of them were rRNA. I’m going to re-process everything with the updated rRNA database and we’ll see if that clears things up.
I ended up upgrading to humann3 and re-running everything with the newer databases in addition to fixing the kneaddata rRNA db.
It seems that the move to humann3 increased the mapping rate substantially. If I use the old rRNA db + humann3, I get ~45% of the data as UNMAPPED (40% as RPKM, 45% as CoPM). If I use the new rRNA db + humann3, UNMAPPED increases to 53% for RPKM and 58% for CoPM.
BLASTing a random sampling of unmapped reads seems to be a mixture of bacteria, other microbes (e.g., protozoa), and “nothing”.
Glad that HUMAnN 3 is improving mapping! I’m surprised though that you’re getting more unmapped reads using the fixed rRNA depletion database (given that many of the previously unmapped reads appeared to be ribosomal by BLAST). Am I missing something here?
I’m surprised as well! Is it possible that some of the rRNA reads that snuck through before were getting mapped to something in UniRef? That could explain why a higher proportion of reads are now unmapped.
Possibly, but I would’ve expected the fraction of newly removed rRNA reads to dwarf that number. Can you confirm that a lot more reads are removed by KneadData during QC using the updated rRNA database?
Here’s the total remaining reads from 5 samples with the old rRNA DB:
Sample_1006093: 3684737
Sample_1006102: 3489848
Sample_1006103: 3842213
Sample_1006104: 4571333
Sample_1006105: 5687093
And with the new rRNA DB:
Sample_1006093: 1562635
Sample_1006102: 2581638
Sample_1006103: 2873539
Sample_1006104: 3392668
Sample_1006105: 2022936
Those numbers are reassuring, thanks! But for the mapping rate to drop with the increased filtering it would require the contaminant reads to have been mapping at a higher rate than the non-contaminant reads. I guess that’s possible it just feels really strange. We’ll have to try some experiments on some MTX data we have here and see if we observe a similar pattern.
Following up on this. I do see in some samples that the mapping rate for the contaminant sequences is mapping at a much higher rate than the non-contaminants (I also see some samples where the decontaminated mapping rate increases or remains stable). I then did the following:
I picked a sample where the contaminated mapping rate was 66% and the decontaminated mapping rate was 3%.
I BLASTed some of the reads from the decontaminated sample’s unaligned_diamond.fa. The breakdown was roughly 45% Endolimax nana, 50% unaligned, 5% some other microbe
I found this to be the case in at least one other sample with a similar decontaminated/contaminated mapping rate
Thanks for sharing this. There is some precedent for spurious ORFs being called within rRNA genes (and winding up as official “proteins” in UniProt). That could be a partial explanation for the high mapping rate on the rRNA-deplete-able reads.
The alignments to Endolimax nana are really interesting. Although this parasite has been known for a while it doesn’t seem to show up in UniProt, hence being missed by our profiling. As far as I can tell, there are rRNA sequences available for it (which is probably what your BLAST search is hitting?) but no genome.
Providing a quick update on this in case it is useful for other people. I’m averaging about 30-99% of reads being unclassified after translated search. I suspect there’s still lots of rRNA leaking through, so I’ve tried a few different methods for screening out rRNA (kneaddata, nhmmer, biobloomtools) , but BLASTing unaligned sequences still gives me consistent but weak hits (i.e., low query cov., middling % identity) to various rRNA sequences from all sorts of organisms.
I believe the company that did this sequencing includes viral, eukaryote, etc. sequences in addition to bacteria + archaea. Knowing that, does it seem reasonable that such a low alignment rate is mostly due to a high abundance of poorly-characterized organisms? I’m struggling to justify whether or not to keep that large “UNMAPPED” number in subsequent analysis as it seems like it has huge implications for CPM / relative abundances.
Seeing a majority of novel sequences from the gut, which is one of the better-referenced community types, is definitely surprising. However, the fact that those reads don’t BLAST well against much of anything at least confirms that they are truly quite novel and not simply missing from HUMAnN’s databases.
An alternative analysis approach is to normalize only within the known/mapped read mass (i.e. excluding unmapped) and then include features like total read depth and % unmapped as covariates in downstream statistical models. That helps to exclude potential false positives that are driven solely by variation in mapping rate. We recently used this approach when analyzing metagenomic data from skin samples that had a high proportion of human DNA contamination (even after removing easily identifiable human DNA with KneadData).