Virulence Factor Identification

I am new to microbiome research and I have used HUMAnN 3 and MetaPhlAn 3 so far. Another researcher who has a focus on microbiome research recommended doing a virulence gene analysis. Is there anything in bioBkery for that purpose? For example, Fusobacterium nucleatum is a commensal oral microbe. But, a recent journal article reported that its FadA gene is associated with cancer progression. Could such a finding be easily analysed with some bioBakery tool which I am not aware of? I find that the matabolic pathways are not really what’s needed for a cancer vs. normal tissue comparison.

HUMAnN is our general-purpose functional profiling tool. If you are able to associate your genes of interest with UniRef identifiers, you can look them up directly in the genefamilies.tsv output for each sample. Alternatively, if the genes are associated with broader Pfam/KO/eggNOG categories, you can regroup UniRef abundances to those systems and look up the corresponding identifiers.

We also offer ShortBRED as an approach to targeted functional profiling. There, you start with a small set of gene sequences of interest and identify peptide-level markers that are conserved within them but rare in other proteins. These markers can then be used for highly specific, accelerated functional profiling. Compared with HUMAnN, the ShortBRED approach provides more confident presence/absence calls for specific gene families, but requires you to pre-specify those families and do some “indexing” on them (to identify their peptide markers) before analyzing your sample.

Hope this helps!

ShortBRED sounds great for my use case. From its journal article in 2015:

ShortBRED-Identify takes two inputs: (i) a FASTA file of proteins of interest and (ii) a comprehensive catalog of reference protein sequences (as a FASTA file or preformatted BLAST database). As of this writing, IMG is no longer available for download, and we recommend using UniRef100 or UniRef90 as alternative comprehensive protein reference datasets.

Can you provide modern-day recommendations for a good reference database?

UniRef90 is still around and remains our go-to for a non-redundant representation of the known protein universe.