Hello,
I have been analyzing mouse gut samples with HUMAnN 3.0. The properties of samples: pair-end, read_len=100, depth ~20M reads/sample.
Initially, I had a lot of unmapped reads (~60%). After investigating the log file, many reads were discarded based on the --translated-subject-coverage-threshold param. I lowered it to 30%, which made sense in my head, considering the depth of the sequencing. Now, I reduced the UNMAPPED to ~20-25%, which seems reasonable. Just to mention, the number of unmapped reads after the nucleotide alignment is >95%.
The problem is that now, I have ~70% of UNINTEGRATED reads. I understand that only some parts are mapping to pathways, but the thing that worries me is that over 65% are âUNINTEGRATED|unclassifiedâ. If I understand correctly, even though reads canât be integrated into a pathway, they should be classified to a species level.
Would you expect this kind of behavior?
Thanks,
Matija
1 Like
Iâm surprised about the low mapping rate since we did evaluations on CAMI murine gut metagenomes as part of the bioBakery 3 paper and performance was reasonable there. I know this is something we expect to further improve with HUMAnN 4 (in development) since it will include murine gut MAGs in its database.
Lowering --translated-subject-coverage-threshold
means that you will accept proteins from translated search even though only 30% of the proteinsâ sites (vs. the default 50%) have been covered. I am surprised that that allows you to map so many more reads. It is worth checking some of the UniRefs that show up in the 30% coverage run vs. the 50% coverage run to make sure that they are reasonable (i.e. that the proteinsâ source species in UniProt make sense for the murine gut).
Lastly, % UNINTEGRATED is really just a reflection of the fact that many genes are not annotated to MetaCyc reactions, and so they donât get used in pathway quantification. This number tends to always be fairly high even in (relatively) well annotated communities.
1 Like
Hi,
I have been working on metagenomic sequence analysis using human3. Unfortunately, I have found a huge portion of unidentified and unintegrated sequences (often a total of 99%). According to my understanding by going through some of the discussions in the forum, using UniRef50 instead of UniRef90 for gene family mapping could help improve the annotation percentage.
I was wondering how I could update the settings in Humann3 to use UniRef50 for analysis.
I have attached a file containing the first few rows of gene family abundance and pathway abundance output files. I would appreciate your thoughts and/or suggestions.
Thanks
Mashuk
gfA_PWY_pcov.txt (2.2 KB)
You would need to download the UniRef50 protein database using the humann_databases
script and then either via that script or humann_config
tell HUMAnN to use that new database as your default database. Make sure to download it to its own folder - you do not want it in the same folder as other HUMAnN databases (e.g. your UniRef90 database). When running HUMAnN, you would additionally specify --mode uniref50
to use a more relaxed set of translated search settings. Together these changes should increase your mapping rate.
1 Like
Hi Franzosa,
Thank you so much for your suggestions. I have applied Uniref50 for reanalysis of my sequence following this command
âhumann --input 680_SME_L001_R1_001.fastq.gz --output Output/680_R1_reanalysis --bypass-translated-search --search-mode uniref50â and compared the processed output files between uniref50 and uniref90. Interestingly uniref90 showed a better output compared to uniref50 whereas the total unintegrated and unaligned output remained high (~95%)
SME_comparison_GFA_Uniref90_uniref50.csv (7.8 MB)
SME_comparison_PWY-coverage_Uniref90_uniref50.csv (49.1 KB)
SME_comparison_RXN_Uniref90_uniref50.csv (128.5 KB)
in both cases.
I would appreciate your thoughts.
Thanks
Mashuk
Hi @franzosa
I would appreciate your kind suggestion regarding this.