The bioBakery help forum

Customizing Chochophlan panproteome and Metaphlan marker gene databases with new taxa

The problem: I’m currently working on a metagenomics project with a Zymo Spike-In (low-microbial load), which incorporates three taxa. Trupera radiovictrix & Imtechella haloterans are in the Chochophlan/metaphlan databases. The problem child is Allobacillus halotolerans, which is not. Although Allobacillus halotolerans IS in the NCBI taxonomy database (and has been since 2011) it does not seem to be in UniRef 90 (which is likely why it never made it into Chochophlan).

Thus, I have been trying to follow the advice from the Metaphlan Github Issue 103 (link here) for adding a taxa to the Metaphlan database that is NOT in UniRef 90:

Of course this works for genomes included in Uniprot. In case of MAGs,
annotation with DIAMOND/mmseqs2 is an alternative. For annotating MAGs, I
use DIAMOND on the proteins obtained with prokka, using evalue 1, coverage
0.8 and identity percentage 90%, the same thresholds that defines UniRef90

What I have: I have a .fna of all the CDS/protein coding sequencing (nucleotide sequences) from my taxa of interest, which I downloaded from the annotation on NCBI.

My questions:

For Chochophlan database:

  1. Can I just add the fasta of the CDS from my taxa of interest to the chochophlan database of species pan proteomes .fnn.gz, since the pan proteome is non-redundant representation of the species’ protein coding potential? Is there another key somewhere I also need to update? The advice for metaphlan database addition given here makes it seem like it doesn’t matter but should maybe match the header in the metaphlan pickle.

  2. Do I need to change the headers in my .fna to match the headers of the .fnns of the chochophlan data base - example:

my taxa CDS fasta file headers look like:

>lcl|NZ_JAHLZF010000001.1_cds_WP_216686447.1_2 [gene=glnA] [locus_tag=KQ486_RS00010] [protein=type I glutamate–ammonia ligase] [protein_id=WP_216686447.1] [location=765…2099] [gbkey=CDS]

while the .fnn of the chochophlan database look like:


I can add taxonomic information, but do I also need to add some key at the beginning like the “1309__UPI0002B5E709__SMU98_02423”. Also do I need to add UniRef90 and UniRef50 IDS?

For Metaphlan database:
For metaphlan database, I know I need to extract marker genes from the CDS fasta file for Allobacillus halotolerans, and then follow the instructions here for adding the new marker gene to the bowtie2 database and the python pickle. I know that marker genes should be universal within the clade and unique to the clade, but I’m confused as to how to extract marker genes.

  1. Which database should I map my fasta file of CDS from Allobacillus halotolerans against using DIAMOND, following the advice of github issue 103? Should I make a diamond binary database of the fasta file of existing Metaphlan marker genes, and map against that, then select out the sequences that do not map as marker genes? Or against UniRef 90? I tried mapping against the uniref90_201901b_full.dmnd provided with Humann and got 0 hits when using parameters --evalue 1 --id 0.9 --query-cover 0.8.

Hi Freida,

If you have the genomes annotated with Prokka, you could run Roary to generate a pangenome and then extract the core genes.
If you already clustered your CDS at 90%, you can select the cores using your clustering output.

Theoretically, you should map against all the genomes included in MetaPhlAn (You can extract the GCAs from the pickle file) in order to identify if the marker is unique, and, if not, include the list of other species in which the marker can be found. The best situation would be to have all the markers identified to be unique, if you end up with 100 unique CDS is fine.

This is reassuring, probably the CDS you identified are unique, but if I would double check with a mapping against all the genomes.


Yes if you want to use this for HUMAnN