Guidance on UniRef database contents/filtering

Hi there,

I am seeking some clarification on a curious result/behavior I obtained with humann2 (v0.11.2). I am working with a set of metagenomes derived from mouse samples that are infected with C. difficile and sequenced with ~10 million reads each. I have previously isolated the strain from the mice, generated a MAG for the C. difficile strain from these metagenomes, and quantified the strain and one of its effectors (toxin B) by qPCR.

C. difficile appears in the metaphlan_bugs_list.txt as below:

MMSP_H1-1_T1_metaphlan_bugs_list.tsv:k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Peptostreptococcaceae|g__Peptostreptococcaceae_noname|s__Clostridium_difficile	0.01538
MMSP_H1-1_T7_metaphlan_bugs_list.tsv:k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Peptostreptococcaceae|g__Peptostreptococcaceae_noname|s__Clostridium_difficile	0.09357
MMSP_H1-2_T21_metaphlan_bugs_list.tsv:k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Peptostreptococcaceae|g__Peptostreptococcaceae_noname|s__Clostridium_difficile	0.12672
MMSP_H1-2_T7_metaphlan_bugs_list.tsv:k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Peptostreptococcaceae|g__Peptostreptococcaceae_noname|s__Clostridium_difficile	0.37416
MMSP_H1-3_T21_metaphlan_bugs_list.tsv:k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Peptostreptococcaceae|g__Peptostreptococcaceae_noname|s__Clostridium_difficile	0.43544
MMSP_H1-3_T7_metaphlan_bugs_list.tsv:k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Peptostreptococcaceae|g__Peptostreptococcaceae_noname|s__Clostridium_difficile	0.64551
MMSP_H2-1_T7_metaphlan_bugs_list.tsv:k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Peptostreptococcaceae|g__Peptostreptococcaceae_noname|s__Clostridium_difficile	0.01453

However, the gene family representing the glycosyltransferase toxin B (UniRef90_T3DH50) does not appear in any of the genefamilies.tsv files, but there are most certainly reads mapping to this gene (as I was able to assemble it). There are also other genes and pathways from C. difficile in the outputs. Out of interest, I checked the diamond database file (uniref90_annotated.1.1.dmnd) which was downloaded using the humann2_databases script, and it does not appear that UniRef90_T3DH50 is listed within it. Is there an obvious reason I am missing as to why this family would have been excluded?



If possible, I’d recommend upgrading to MetaPhlAn / HUMAnN 3.0, as I know their databases tend to perform better on mouse communities than the 2.0 versions.

In this specific case, the issue is that UniRef90 centroids move around over time (as more sequences are added to UniProt and the sequences are reclustered). In the 2.0 databases T3DH50 was part of cluster UniRef90_C9XKU6 (as opposed to a representative of its own cluster). Hence, you ought to see UniRef90_C9XKU6 in the diamond database at least, and thus potentially in your unclassified hits.

Note: C9XKU6 is now considered a redundant sequence, but you can find more about its historical annotations here:

Great thanks! Switching to 3.0 is on the to do list, but in the meanwhile it may be worth noting that these are humanized gnotobiotic mice.

I have checked for the new accession and it still doesn’t appear in any of the gene families files. Any thoughts? Is there any form of low-abundance filter for gene families that can be controlled?

The only thing like an abundance filter is that v2.0 requires 50% of sites in a protein to be hit for it to be reported (during translated search only; v3.0 adds this requirement to nucleotide search as well). However you must be above this threshold if you managed to assemble the gene.

Another possibility is that the v2.0 identity threshold (90%) is a bit too stringent here. We’ve softened this to 80% in v3.0. You can see if this would’ve helped by directly looking for alignments to your protein of interest in the DIAMOND output file (assuming you saved it). This would indicate than an alignment was found but rejected on the basis of low identity.

A final option: if you know which reads align to your assembled version of the gene, you could directly look up their identifiers in the HUMAnN mapping output to see what (if anything) we mapped them too.