Downloading metaphlan4 SGB genomes

Hi all, Can you please suggest a method to download all the SGB genomes used to construct the metaphlan4 database? If I understand correctly, the links on the Segata Lab website (Segata Lab - Computational Metagenomics) only contain a subset of the SGB genomes used, but not all of them.
I was considering using the dataset download tool available on NCBI to download the genomes via accession ids (provided in the supplementary table) but is there an easier/faster way? Thanks.

Hi @anashank97

Unfortunately, we do not have available all the sequences of the MAGs used to build the metaphlan4 database. The only MAGs and references available are those from the original SGB publication you linked but we are still working on a platform to make all the MAGs publicly available.

Thank you so much for the info

Hi,
I am also looking to download the SGB representative genomes for the vOct22_202403 SGB release as I’m using MetaPhlAn4.1 and HUMAnN4.0.0.alpha. I’ve read in another post (MAG sequences used in MetaPhlAn 4 data base - Microbial community profiling / MetaPhlAn - The bioBakery help forum) that there are plans for sharing the genomes in future CHOCOPhlAn(SGB) releases, but looks like this is still underway.

In the meantime, I’ve been trying to work backwards by mapping the SGB_IDs to NCBI taxonomy ID to help identify the genomes for downloading. I found the file http://cmprod1.cibio.unitn.it/databases/PhyloPhlAn/SGB.Oct22.tar, inside which has SGB.Oct22.txt.bz2 that have the SGB_IDs to assigned NCBI taxonomy ID information, but I’ve noticed some inconsistencies.

For example, in SGB.Oct22.txt.bz2, the SGB_ID 15286 is assigned to taxID 2086273:

  • k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Subdoligranulum|s__Subdoligranulum_sp_APC924_74|t__SGB1528

but in the mpa_vOct22_CHOCOPhlAnSGB_202403_species.txt.bz2, this SGB 15286 is assigned to

  • k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Ruminococcaceae|g__Candidatus_Cibionibacter|s__Candidatus_Cibionibacter_quicibialis

I understand that the taxonomies evolve between releases, but since the mpa_vOct22 file doesn’t have the taxIDs I cannot confirm what has changed. I searched for “Candidatus Cibionibacter quicibialis” on NCBI Taxonomy but got no results, the closest match was instead for Candidatus Cibiobacter qucibialis (taxID: 2500537).

My questions

  1. Is the PhyloPhlAn/SGB.Oct22.tar file the same SGB features used for input to generate the MetaPhlAn species marker reference database and also the CHOCOPhlAn pan-genomes database used by HUMANnN4?
  2. The modified date for PhyloPhlAn/SGB.Oct22. was 2024-01-17, so I assumed that it should correspond with the updated mpa_vOct22_202403 release, maybe this is wrong?
  3. Is the “202403” in the mpa_v* filenames referring to the processing date? I’m getting confused between the two dates in the filenames (vOct22)_(202403).
  4. Can you please let me know what NCBI taxonomy release was used in the mpa_vOct22_202403 related database files?

Much appreciated!!

Hi @xyc

I confirm we still don’t have a public repository with SGB representatives, unfortunately. The “Candidatus Cibiobacter qucibialis” is a specific case, this is a MAG [https://www.ncbi.nlm.nih.gov/nuccore/SAUS00000000] our group characterized in this work (Extensive Unexplored Human Microbiome Diversity Revealed by Over 150,000 Genomes from Metagenomes Spanning Age, Geography, and Lifestyle - PMC ) and therefore we manually curate its taxonomy in metaphlan to be s__Candidatus_Cibiobacter_qucibialis, which then was renamed as Candidatus Cibiobacter qucibialis, as you saw.
In phylophlan it looks like we used the taxonomy of the reference genome that is within the same SGB instead.

In general it can happen that within the same SGB there are multiple isolate genomes and therefore we assign to an SGB the taxonomy by majority voting, I suggest you to use this file for reference of the taxonomy of the SGB and alternative taxonomies http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vOct22_CHOCOPhlAnSGB_202403_species.txt.bz2
As a general rule if the species name contains SGBx it means it is an unknown-SGB (no reference available on NCBI), while for the others you should be able to download a genome of that species. However, you have to be aware that you may have the same taxonomy assigned to different SGBs, meaning that downloading any genome of a taxonomy is not necessarily downloading the same genome that was used in building the SGB.

To reply to your questions specifically

  1. Yes
  2. Yes, the data we used corresponds. There may be some taxonomic label inconsistency in the case of maually curated taxonomies but the data is the same.
  3. yes, the second date is related to processing date and in case there are multiple versions is because we may have changed some minor details in the database but the genomes used (vOct22) are still the same.
  4. difficult to tell exactly, every now and then we download new genomes from UniRef in batch and add them to the database. For the vOct22 database we had downloaded referencee genomes around January 2021, so it’s very possible that there may have been taxonomy updates.

Sorry if this is somewhat confusing, if it can help I can send a list of SGB-to-GCA ids at least for the SGB where the genome is downloaded from NCBI (if it was not suppressed)

Best,
Claudia

Hi @Claudia_Mengoni,
Thanks very much for the detailed reply!
Yes, it would be most helpful if you can send the SGB-to-GCA ids.
Much appreciated.