Downloading metaphlan4 SGB genomes

anashank97 · October 23, 2024, 2:50pm

Hi all, Can you please suggest a method to download all the SGB genomes used to construct the metaphlan4 database? If I understand correctly, the links on the Segata Lab website (Segata Lab - Computational Metagenomics) only contain a subset of the SGB genomes used, but not all of them.
I was considering using the dataset download tool available on NCBI to download the genomes via accession ids (provided in the supplementary table) but is there an easier/faster way? Thanks.

Claudia_Mengoni · November 8, 2024, 4:11pm

Hi @anashank97

Unfortunately, we do not have available all the sequences of the MAGs used to build the metaphlan4 database. The only MAGs and references available are those from the original SGB publication you linked but we are still working on a platform to make all the MAGs publicly available.

anashank97 · December 4, 2024, 7:09pm

Thank you so much for the info

xyc · June 27, 2025, 5:27am

Hi,
I am also looking to download the SGB representative genomes for the vOct22_202403 SGB release as I’m using MetaPhlAn4.1 and HUMAnN4.0.0.alpha. I’ve read in another post (MAG sequences used in MetaPhlAn 4 data base - Microbial community profiling / MetaPhlAn - The bioBakery help forum) that there are plans for sharing the genomes in future CHOCOPhlAn(SGB) releases, but looks like this is still underway.

In the meantime, I’ve been trying to work backwards by mapping the SGB_IDs to NCBI taxonomy ID to help identify the genomes for downloading. I found the file http://cmprod1.cibio.unitn.it/databases/PhyloPhlAn/SGB.Oct22.tar, inside which has SGB.Oct22.txt.bz2 that have the SGB_IDs to assigned NCBI taxonomy ID information, but I’ve noticed some inconsistencies.

For example, in SGB.Oct22.txt.bz2, the SGB_ID 15286 is assigned to taxID 2086273:

k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Subdoligranulum|s__Subdoligranulum_sp_APC924_74|t__SGB1528

but in the mpa_vOct22_CHOCOPhlAnSGB_202403_species.txt.bz2, this SGB 15286 is assigned to

k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Ruminococcaceae|g__Candidatus_Cibionibacter|s__Candidatus_Cibionibacter_quicibialis

I understand that the taxonomies evolve between releases, but since the mpa_vOct22 file doesn’t have the taxIDs I cannot confirm what has changed. I searched for “Candidatus Cibionibacter quicibialis” on NCBI Taxonomy but got no results, the closest match was instead for Candidatus Cibiobacter qucibialis (taxID: 2500537).

My questions

Is the PhyloPhlAn/SGB.Oct22.tar file the same SGB features used for input to generate the MetaPhlAn species marker reference database and also the CHOCOPhlAn pan-genomes database used by HUMANnN4?
The modified date for PhyloPhlAn/SGB.Oct22. was 2024-01-17, so I assumed that it should correspond with the updated mpa_vOct22_202403 release, maybe this is wrong?
Is the “202403” in the mpa_v* filenames referring to the processing date? I’m getting confused between the two dates in the filenames (vOct22)_(202403).
Can you please let me know what NCBI taxonomy release was used in the mpa_vOct22_202403 related database files?

Much appreciated!!

Claudia_Mengoni · June 30, 2025, 2:28pm

Hi @xyc

I confirm we still don’t have a public repository with SGB representatives, unfortunately. The “Candidatus Cibiobacter qucibialis” is a specific case, this is a MAG [https://www.ncbi.nlm.nih.gov/nuccore/SAUS00000000] our group characterized in this work (Extensive Unexplored Human Microbiome Diversity Revealed by Over 150,000 Genomes from Metagenomes Spanning Age, Geography, and Lifestyle - PMC ) and therefore we manually curate its taxonomy in metaphlan to be s__Candidatus_Cibiobacter_qucibialis, which then was renamed as Candidatus Cibiobacter qucibialis, as you saw.
In phylophlan it looks like we used the taxonomy of the reference genome that is within the same SGB instead.

In general it can happen that within the same SGB there are multiple isolate genomes and therefore we assign to an SGB the taxonomy by majority voting, I suggest you to use this file for reference of the taxonomy of the SGB and alternative taxonomies http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vOct22_CHOCOPhlAnSGB_202403_species.txt.bz2
As a general rule if the species name contains SGBx it means it is an unknown-SGB (no reference available on NCBI), while for the others you should be able to download a genome of that species. However, you have to be aware that you may have the same taxonomy assigned to different SGBs, meaning that downloading any genome of a taxonomy is not necessarily downloading the same genome that was used in building the SGB.

To reply to your questions specifically

Yes
Yes, the data we used corresponds. There may be some taxonomic label inconsistency in the case of maually curated taxonomies but the data is the same.
yes, the second date is related to processing date and in case there are multiple versions is because we may have changed some minor details in the database but the genomes used (vOct22) are still the same.
difficult to tell exactly, every now and then we download new genomes from UniRef in batch and add them to the database. For the vOct22 database we had downloaded referencee genomes around January 2021, so it’s very possible that there may have been taxonomy updates.

Sorry if this is somewhat confusing, if it can help I can send a list of SGB-to-GCA ids at least for the SGB where the genome is downloaded from NCBI (if it was not suppressed)

Best,
Claudia

xyc · July 3, 2025, 5:21am

Hi @Claudia_Mengoni,
Thanks very much for the detailed reply!
Yes, it would be most helpful if you can send the SGB-to-GCA ids.
Much appreciated.

Joshua_Mu · September 9, 2025, 12:06pm

Hi, Claudia

I’m very glad to see your comment. Recently, I have had a similar need and hope to obtain the genome sequences of these SGBs. Although these sequences have not been publicly released yet, providing the corresponding list of SGB-to-GCA ids might be the most effective way to access this information at present. I look forward to your sharing of the SGB-to-GCA ids list.

Much appreciated!
Joshua

Claudia_Mengoni · September 9, 2025, 12:11pm

Hi @Joshua_Mu

are you interested in the same version of the database? (vOct22)

Joshua_Mu · September 9, 2025, 12:30pm

Hi,Claudia_Mengoni
Thank you very much for your response.

Yes, our current analysis is based on the mpa_vOct22_CHOCOPhlAnSGB_202403 version.

Meanwhile, I have noticed the recent update, which includes a substantial number of new SGBs. If possible, I would greatly appreciate it if you could also provide the SGB-to-GCA ID list corresponding to the mpa_vJan25_CHOCOPhlAnSGB_202503 version.

Thank you very much for your kind assistance, and I look forward to your support.

Best regards,
Joshua

Topic		Replies	Views
About the Metaphlan4 reference genome MetaPhlAn	5	958	July 3, 2023
Retrieve FASTA files for SGB genomes MetaPhlAn	0	166	April 1, 2024
List of taxa in Metaphlan4 database MetaPhlAn	16	3274	April 5, 2023
SGB identifiers in different versions of Metaphlan database MetaPhlAn	0	43	December 22, 2025
Connecting SGB ID with fasta of a MAG/genome? MetaPhlAn	0	64	October 28, 2025

Downloading metaphlan4 SGB genomes

Related topics