Customizing Chochophlan panproteome and Metaphlan marker gene databases with new taxa

The problem: I’m currently working on a metagenomics project with a Zymo Spike-In (low-microbial load), which incorporates three taxa. Trupera radiovictrix & Imtechella haloterans are in the Chochophlan/metaphlan databases. The problem child is Allobacillus halotolerans, which is not. Although Allobacillus halotolerans IS in the NCBI taxonomy database (and has been since 2011) it does not seem to be in UniRef 90 (which is likely why it never made it into Chochophlan).

Thus, I have been trying to follow the advice from the Metaphlan Github Issue 103 (link here) for adding a taxa to the Metaphlan database that is NOT in UniRef 90:

Of course this works for genomes included in Uniprot. In case of MAGs,
annotation with DIAMOND/mmseqs2 is an alternative. For annotating MAGs, I
use DIAMOND on the proteins obtained with prokka, using evalue 1, coverage
0.8 and identity percentage 90%, the same thresholds that defines UniRef90

What I have: I have a .fna of all the CDS/protein coding sequencing (nucleotide sequences) from my taxa of interest, which I downloaded from the annotation on NCBI.

My questions:

For Chochophlan database:

  1. Can I just add the fasta of the CDS from my taxa of interest to the chochophlan database of species pan proteomes .fnn.gz, since the pan proteome is non-redundant representation of the species’ protein coding potential? Is there another key somewhere I also need to update? The advice for metaphlan database addition given here makes it seem like it doesn’t matter but should maybe match the header in the metaphlan pickle.

  2. Do I need to change the headers in my .fna to match the headers of the .fnns of the chochophlan data base - example:

my taxa CDS fasta file headers look like:

>lcl|NZ_JAHLZF010000001.1_cds_WP_216686447.1_2 [gene=glnA] [locus_tag=KQ486_RS00010] [protein=type I glutamate–ammonia ligase] [protein_id=WP_216686447.1] [location=765…2099] [gbkey=CDS]

while the .fnn of the chochophlan database look like:


I can add taxonomic information, but do I also need to add some key at the beginning like the “1309__UPI0002B5E709__SMU98_02423”. Also do I need to add UniRef90 and UniRef50 IDS?

For Metaphlan database:
For metaphlan database, I know I need to extract marker genes from the CDS fasta file for Allobacillus halotolerans, and then follow the instructions here for adding the new marker gene to the bowtie2 database and the python pickle. I know that marker genes should be universal within the clade and unique to the clade, but I’m confused as to how to extract marker genes.

  1. Which database should I map my fasta file of CDS from Allobacillus halotolerans against using DIAMOND, following the advice of github issue 103? Should I make a diamond binary database of the fasta file of existing Metaphlan marker genes, and map against that, then select out the sequences that do not map as marker genes? Or against UniRef 90? I tried mapping against the uniref90_201901b_full.dmnd provided with Humann and got 0 hits when using parameters --evalue 1 --id 0.9 --query-cover 0.8.

Hi Freida,

If you have the genomes annotated with Prokka, you could run Roary to generate a pangenome and then extract the core genes.
If you already clustered your CDS at 90%, you can select the cores using your clustering output.

Theoretically, you should map against all the genomes included in MetaPhlAn (You can extract the GCAs from the pickle file) in order to identify if the marker is unique, and, if not, include the list of other species in which the marker can be found. The best situation would be to have all the markers identified to be unique, if you end up with 100 unique CDS is fine.

This is reassuring, probably the CDS you identified are unique, but if I would double check with a mapping against all the genomes.


Yes if you want to use this for HUMAnN

Hey @fbeghini thanks for your response!

I’ve annotated the genomes (although there are only 2 genomes available for this species) with Prokka and ran Roary, extracting core genes.

For the next step, you said I should map against all the genomes in Metaphlan. I extracted the GCAs from the metaphlan pickle file using the code below:

import pickle
import bz2 
from itertools import chain

 db = pickle.load('mpa_v30_CHOCOPhlAn_201901.pkl', 'r'))

#example with empty GCAs

#extract GCAs for bacteria 
my_gcas=[i[1]['ext'] for i in list(db['markers'].items()) if 'Bacteria' in i[1]['taxon']]
#unfold the nested lists
len(my_gcas) # 3,012,893

#but these are from the marker database, and a genome can have >1 marker 
len(set(my_gcas)) #24,212

#get unique gcas 

I noticed there are some entries in the markers database without GCAs. Is this ok?

I’m also concerned because I get 24,212 unique GCAs. Does that sound right? It doesn’t match whats on the metaphlan page here (~13,500 bacterial genomes).

If that is right, I should next download all of those GCAs from NCBI

  • Should these be the assembly fasta files or the CDS from genomic fasta files?
  • Once I have all the genomes from metaphlan as fasta files, can I just concatenate them and make a single diamond database?


Yes! Those are unique markers, so no other species in the database share the same sequence.

13,500 is not correct, those are the number of genomes considered in MetaPhlAn2, probably the website is not updated. MetaPhlAn uses 97.9k genomes, the one that are considered for marker extraction in MetaPhlAn are 28,784 (total genomes, including also archaea, viruses, and eukaryotes. The number is lower due to removal of multiple genomes per species).
You can get the full list of genomes by running

my_gcas=[i.split('|')[-1][3:] for i in db['taxonomy']]

This is safer since you get also genomes that are not listed in the ['ext'] field.

The assemblies are ok

I used bowtie2 with command line -f mpa_markers_150bpsplitted_fna -x bin_idx -a --very-sensitive --no-unal

To clarify this:

I used bowtie2 with command line -f mpa_markers_150bpsplitted_fna -x bin_idx -a --very-sensitive --no-unal

Is bin_idx a bowtie2 database made from all of the assembly fasta files of the metaphlan GCAs ( i.e cat ./*.fasta > genomes.fasta then bowtie2-build genomes.fna bin_idx?)

Then the mpa_markers_150bpsplitted_fna will for me be the fasta file of the core pangenome genes from the species I want to add to the metaphlan database?

Can you explain why you use the option --no-unal? I thought I would want the reads from my core genes that don’t align to the genomes from metaphlan, since those will be the ones that are unique to the species I want to add. Then I could pull out those reads with samtools, and covert to a fasta file, and then add those new markers to the metaphlan database.

Yes, I would suggest to have a maximum of 1000 genomes per index to avoid the creation of maxi-indexes

That’s right.

That’s another possible approach, I used the --no-unal to remove unwanted alignment since I wanted to (i) refine the uniqueness by including or removing external species; (ii) slim down the ouput SAM file since it could be enormous.
If you are lucky, you could get enough genes that not map elsewhere so the --no-unal is useful, otherwise you need to keep the one having the minimum number of other species that shares the gene.

Hey Francesco,

Tagging @neal78 as also expressed interest.

I got this running to the point that Metaphlan reports A. halotolerans in my samples but had two caveats that I ran into.

  1. The accessions for viruses in the Metaphlan pickle did not seem to match up with anything in the NCBI database. I cross-references both the current and the historical ncbi lists from the FTP site, and I tried just googling some of the viral accessions without success. Do you have any possible explanation for this? In the end I just mapped my conserved CDS against the bacterial genomes in Metaphlan, since I was unable to download viral genomes from the accessions provided in the pickle file. I’m sure this is not best practice. But it ended up working for me.

  2. I had to adjust the quantile for the robust averaging procedure down to 0.05 from 0.2 in order to detect A. halotolerans in some of my samples. I think this is because of two things 1) A. halotolerans is very low absolute abundance in the spike-in (0.0029 ng in the 20 microL of the prep), 2) I ended up with many marker genes for A. halotolerans that were added to the Metaphlan database (more on that below in Note). Thus, while the A. halotolerans marker genes were in my intermediary bowtie2 file output, A. halotolerans would be reported as 0% abundance because the top 20% most abundant marker genes would be excluded. (this matches my experience with the positive control samples, in which Metaphlan did not report any Cryptococcus neoformans, which was present in the positive control at a low abundance when q=0.2 but does report it at q=0.05).

Note: I ended up with many marker genes for A. halotolerans. Many genes were considered conserved, as there are only two available genomes (2677/2791 genes). Additionally, many of the genes from A. halotolerans did not map to any other bacterial genomes (2634/2677) (this might be because A. halotolerans is the only member of its genus and is a weird shrimp-paste bug)

Do you think these caveats are ok? Here is a link to an html file showing the benchmarking of the quantile value in the positive control and in a single low abundance sample.

@blostein Hi Freda,
I am running into the same issue. Did you get any success?

Yeah, I was able to add the Allobacillus halotolerans to the Metaphlan database, with the caveats above about mapping conserved CDS against only bacterial genomes and having to lower the robust averaging quantile from 0.2 to 0.05. Are you trying to add the same Allobacillus halotolerans or a different bug? I can send you my updated database in the first case. In the second case, I have some of the code from when I did it up on a github I can add you to (although it’s still kind of messy).

Hi Freda,
I am adding the same Allobacillus halotolerans. If you can send me the updated database that would be great. Will there be any issue with the size? My mail id is (if needed). Or if the database or codes will be on Github, thaht will also work. Thanks again

With Regards,