Protein database choose and low aligned rate in humann

Hi,

I am totally new to metagenome analysis with limited Linux and Python experience. I am using humann3 to analyze my metagenome and metatranscriptome sequencing data. I have several question need helps:
1, I installed all four diamond databases under uniref folder (uniref50, uniref90, uniref50-ec-filtered, uniref90-ec-filtered). I found the HUMAnN pipeline will default align all these 4 databases when I try to run HUMAnN. I tried to assign the protein database using the command --protein-database uniref/uniref90_201901b_full.dmnd, but it shows an error of ‘CRITICAL ERROR: the directory provided for the protein database does not exist’. I also tried using the absolute path “/home/humann_dbs/uniref/uniref90_201901b_full.dmnd” or database without path “uniref90_201901b_full.dmnd”, but it indicated the same error. I am confused about the correct way to assign protein database to align.
2, if the diamond run both uniref50 and uniref90 databases, how it affects my data readouts? I read the document about the uniref50 and uniref90, and know it likes a tradeoff of aligned rate and resolution, but I am still curious about what happens if both databases run.
3, when I finish the first run of HUMAnN, I found the aligned rate after the nucleic acids database search is really low (~93% unaligned reads), while the rate is ~55% after the protein database search (all 4 databases), does anything go wrong to have so low aligned rate in nucleic acids database search step? How can I increase the rate in this step?

–Chao

Re: 1) You need to point HUMAnN at a folder that contains your protein database, which may be split over one or more files (splitting is atypical but required for some users, hence the constraint). For example, if you just want to map against the full UniRef90 database, you should but its single file in a folder and point HUMAnN at that location. Right now you are pointing at a location with multiple databases, which HUMAnN interprets as parts of a larger database you want to search in its entirety.

Re: 2) The results from mapping to the multiple databases will be messy. I recommend repeating the mapping against a single database. I recommend using UniRef50 for most samples that are not derived from the human microbiome. You can use the EC filtered version of the UniRef90 or UniRef50 database to boost performance (at the cost of explaining fewer reads).

Re: 3) This sort of alignment pattern is common if you are working with communities that do not contain many categorized species (at least as of HUMAnN’s most recent pangenome database), which we map to be nucleotide search. The higher mapping after translated search reflects HUMAnN’s ability to identity homology to known proteins even when their source species is not known.

Hi Eric,

I really appreciate your reply. For the third question, I tried to look into the detail of the diamond_aligned .fa, and found lots of unaligned sequences (3 in 4 which I picked) are bacterial coding sequences by blastx. Why are these sequences assigned to be unaligned by Diamond? Are there any Diomond options that can make Diamond recognize these sequences?

Chao

I had tried lower the --translated-query-coverage-threshold to 50 (my reads are ~150nts), but it just makes a minor difference. Do you think whether lower --translated-subject-coverage-threshold can help? What is the limit of this argument to ensure produce reliable data when lowering it?

Chao

How similar were the sequences by blastx? If you were running in the default UniRef90 mode we only accept alignments >80% AA identity. If you switch to UniRef50 and the associated mode (>50% AA identity) that might improve mapping. I would avoid dropping the subject coverage threshold unless you have really low read depth as doing so tends to increase the false positives markedly.

the similarity is really high (>80 identify), and the query coverage is also higher than 80%

As my understanding of --translated-subject-coverage-threshold 50, when reads matched to a coding gene, even with 100% similarity, if the subject coverage is lower than 50, Diamond still will assign these reads as unaligned, is this true? Actually, I am confused that the --translated-subject-coverage-threshold 50 argument means the subject coverage should be >=50% in one single read match or all reads match.

Does the default ChocoPhlAn database download according to user manual contain representatives from murine gut? or it just contains human’s? Is there any suggestions to ChocoPhlAn database if I need to analyze mouse microbiota data

Sorry to ask too many questions in this thread, I am really fresh to HUMAnN. As you mentioned above, if my reads have low sequence depth, I may try to relax --translated-query-coverage-threshold to get enough mapping. As I am working some specific genes with low abundance in the mouse gut microbiome, their reads are expected to be low although my sample has 10M reads. In this context, do you think whether lower the --translated-query-coverage-threshold is favorable. I ask so because I can’t find my target genes in the final genefamilies.tsv when I use default arguments for HUMAnN.

Sorry for the mistake, I mean --translated-subject-coverage-threshold here but not --translated-query-coverage-threshold

Getting caught up here… If you’re seeing strong online blastx hits then the issue is either 1) that the proteins aren’t in our database or 2) that you are getting strong local hits but not good metagenomic coverage of the protein (i.e. subject coverage).

--translated-subject-coverage-threshold = 50 means that, when post-processing hits from DIAMOND, HUMAnN will ignore a hit between a read X and protein Y if protein Y wasn’t covered at 50% of amino acid sites by reads mapped to Y. It is not expected that any single read would cover 50% of a protein (unless the protein was very tiny), hence the idea of this threshold is to boost our confidence that Y is present based on its coverage by multiple reads. It is reasonable to relax this parameter if you expect your proteins to have very low coverage breadth (due to e.g. low sequencing depth). In general, lowering this parameter boosts sensitivity to proteins of interest but reduces specificity (you will “detect” more proteins that are not actually in your sample because they have local homology to a protein that IS in your sample, and a single read can’t tell them apart).

ChocoPhlAn is not organism-specific, it just has a bias toward microbes that are relevant to human health (since this is what NCBI contains).

Hi Eric,

Thanks so much for clarifying, it helps a lot. I have another question that I am not sure here is the right place to ask. Just a quick question, can I concatenate the unmatched reads from kneaddata as input of HUMAnN, considering the HUMAnN uses a single input and doesn’t count paired reads? I ask so because the newest kneaddata encounters an issue in that bowtie2 does not recognize that my data are paired-end and dumps all of the reads into the “_unmatched” output files. My sequencing data are shot-gun paired reads with lengths of 150nt.

Yes, we will concatenate the still-paired reads and both sets of orphans as a single input to the bioBakery workflows.

Thanks a lot for clarifying.