High value of UNINTEGRATED reads

Hello,

I have been analyzing mouse gut samples with HUMAnN 3.0. The properties of samples: pair-end, read_len=100, depth ~20M reads/sample.
Initially, I had a lot of unmapped reads (~60%). After investigating the log file, many reads were discarded based on the --translated-subject-coverage-threshold param. I lowered it to 30%, which made sense in my head, considering the depth of the sequencing. Now, I reduced the UNMAPPED to ~20-25%, which seems reasonable. Just to mention, the number of unmapped reads after the nucleotide alignment is >95%.

The problem is that now, I have ~70% of UNINTEGRATED reads. I understand that only some parts are mapping to pathways, but the thing that worries me is that over 65% are “UNINTEGRATED|unclassified”. If I understand correctly, even though reads can’t be integrated into a pathway, they should be classified to a species level.

Would you expect this kind of behavior?

Thanks,
Matija

1 Like

I’m surprised about the low mapping rate since we did evaluations on CAMI murine gut metagenomes as part of the bioBakery 3 paper and performance was reasonable there. I know this is something we expect to further improve with HUMAnN 4 (in development) since it will include murine gut MAGs in its database.

Lowering --translated-subject-coverage-threshold means that you will accept proteins from translated search even though only 30% of the proteins’ sites (vs. the default 50%) have been covered. I am surprised that that allows you to map so many more reads. It is worth checking some of the UniRefs that show up in the 30% coverage run vs. the 50% coverage run to make sure that they are reasonable (i.e. that the proteins’ source species in UniProt make sense for the murine gut).

Lastly, % UNINTEGRATED is really just a reflection of the fact that many genes are not annotated to MetaCyc reactions, and so they don’t get used in pathway quantification. This number tends to always be fairly high even in (relatively) well annotated communities.

Hi,
I have been working on metagenomic sequence analysis using human3. Unfortunately, I have found a huge portion of unidentified and unintegrated sequences (often a total of 99%). According to my understanding by going through some of the discussions in the forum, using UniRef50 instead of UniRef90 for gene family mapping could help improve the annotation percentage.

I was wondering how I could update the settings in Humann3 to use UniRef50 for analysis.

I have attached a file containing the first few rows of gene family abundance and pathway abundance output files. I would appreciate your thoughts and/or suggestions.
Thanks
Mashuk
gfA_PWY_pcov.txt (2.2 KB)

You would need to download the UniRef50 protein database using the humann_databases script and then either via that script or humann_config tell HUMAnN to use that new database as your default database. Make sure to download it to its own folder - you do not want it in the same folder as other HUMAnN databases (e.g. your UniRef90 database). When running HUMAnN, you would additionally specify --mode uniref50 to use a more relaxed set of translated search settings. Together these changes should increase your mapping rate.

2 Likes

Hi Franzosa,
Thank you so much for your suggestions. I have applied Uniref50 for reanalysis of my sequence following this command
‘humann --input 680_SME_L001_R1_001.fastq.gz --output Output/680_R1_reanalysis --bypass-translated-search --search-mode uniref50’ and compared the processed output files between uniref50 and uniref90. Interestingly uniref90 showed a better output compared to uniref50 whereas the total unintegrated and unaligned output remained high (~95%)
SME_comparison_GFA_Uniref90_uniref50.csv (7.8 MB)
SME_comparison_PWY-coverage_Uniref90_uniref50.csv (49.1 KB)
SME_comparison_RXN_Uniref90_uniref50.csv (128.5 KB)
in both cases.
I would appreciate your thoughts.

Thanks
Mashuk

Hi @franzosa
I would appreciate your kind suggestion regarding this.

Switching to UniRef50 will only map more reads if you’re doing translated search, which you aren’t right now (because of the bypass). Otherwise it’s just a difference in reporting, i.e. reporting nucleotide hits to genes by their UniRef50 family rather than their UniRef90 family.

Thank you a lot for the clarification. I will try running the sequences by skipping the bypass and keep you posted.