Running HUMAnN: pre-computed protein blastx M8 input

Hi,

Thanks for a great tool! I am running HUMAnN with a pre-computed DIAMOND bastx input. I am doing this because I want to use the full UniRef90 database (together with the newest version of DIAMOND). I would still like to use HUMAnN for quantification of abundances and for re-grouping of UniRef90 to KO and GO.

However, I get a high fraction of ungrouped UniRef90 families (96-97% for KO and 70-72% for GO). Compared to 50-70 % for KO in a default HUMAnN setup.

Thus, I suspect that a high fraction of the UniRef90 proteins are not found in the utility mapping of HUMAnN.

Is it possible to obtain utility mapping files that are updated for the newest version of UniRef90? Or to obtain the script for generating these utility mapping files.

The reason for the choice of setup is that I want to analyze soil samples and the default HUMAnN does not seem to be optimal for soil samples.

Kind regards
Anne Mathiesen

We recommend using the UniRef50 database / analysis mode for environmental samples. You might try that within the default HUMAnN 3 ecosystem and see how it works? If you’re sticking with a custom, updated UniRef90 database, I would be cautious about only considering really high-identity alignments (our default with UniRef90), since there’s probably a lot more remote homology to take advantage of with your community.

We don’t have updated utility files for the latest UniRef90 database and the repo for building those is currently private. For common systems like KO / GO you might be able to pull the mappings directly from the big “idmapping” file that UniProt maintains? Otherwise they can be parsed from the DR lines in the raw text representations of the SwissProt and TrEMBL databases.

https://www.uniprot.org/downloads

Thanks for the fast answer and great suggestions!

So would you recommend that I decrease the --translated-identity-threshold from the default 80%? I chose to run HUMAnN with --translated-subject-coverage-threshold and --translated-query-coverage-threshold = 0, as I experienced that the default values removed too many of the DIAMOND alignments. But I might also need to decrease the identity threshold?

Thanks, I will look at your suggestions! Do you know if the UniRef90 within HUMAnN 3 will be updated at some point? The size of the newest version of UniRef90 is twice as big as the one in HUMAnN (based on the size of the DIAMOND-indexed databases), so I would expect that we loose a lot of information by running default HUMAnN 3 setup?

Kind regards
Anne

@franzosa from your suggestions, I tried running the default HUMAnN 3 ecosystem on some of the soil samples using UniRef50 to see how this affects the results. I used default parameters besides that I added the “–bypass-nucleotide-search” flag, as I am not interested in taxonomy. I am getting really high percentage of unaligned reads ( e.g. “Unaligned reads after translated alignment: 87.5807396132 %”). Isn’t this really high? I am quite disappointed that such a small fraction of my data is used.

The sequencing depth of the samples is 10,000,000-20,000,000 reads, so the sequencing depth should be sufficient.

Kind regards
Anne

@franzosa did you get the chance to look at my questions?

@franzosa I am sorry to bother you, but I would really appreciate your inputs to my question

Sorry for being so slow. Just to confirm that the UniRef50 search worked as intended, are you seeing UniRef50 gene families (rather than UniRef90 families) in your genefamilies.tsv file? If so, then I am also surprised that the mapping rate was so low given the relaxed search.

The next step I recommend is taking a random sample of ~10 reads that didn’t map and searching them on NCBI BLAST (the web version) to see what they hit. You could do this with blastn or blastp. If they don’t get any hits by this approach, it might mean you just have a lot of really novel DNA in your sample. If the reads have weak hits to known proteins, it might mean that the UniRef50 mapping criteria are still too stringent. If the reads get hits to something weird (e.g. non-coding DNA) it might suggest that your sample composition is different from what you expect. For example, this sometimes happens for us with human metagenomes that haven’t been properly depleted of human genomic DNA or metatranscriptomes with a large fraction of rRNA reads (I wouldn’t expect those exact cases to be the explanation here, but it could be something similar).

One of those cases ought to be true and will help suggest next steps.

Yes, I am seeing UniRef50 gene families! But just to clarify, the relaxed settings I described was for the initial run with UniRef90 and newest version of UniRef. After your suggestion I re-ran the analysis with UniRef50, the HUMAnN3 version of UniRef50, and default settings (so this time, I didn’t relax the --translated-subject-coverage-threshold and --translated-query-coverage-threshold). Would you suggest to lower these given the high percentage of unaligned reads against the UniRef50 (>80%)?

I will try your suggestions and get back to you! Thanks a lot for your help!

When mapping to UniRef50 HUMAnN will default to a 50% amino acid identity threshold (to match the 50% AA clustering of UniRef50). I would not recommend raising that. You could turn off the --translated-subject-coverage-threshold like you did before (i.e. set it to 0). Doing so will then report any UniRef50 that received a hit, even if the protein wasn’t well covered globally. I don’t recommend reducing --translated-query-coverage-threshold unless you’re working with very long reads (i.e. that might be too long to fit entirely within a typical protein sequence).