Presence of fungi in ChocoPhlAn vJan25

Hi everyone,

in the MetaPhlAn4 paper (based on the vJun23 database version), the authors mention that “the current methods do not extensively incorporate viral or eukaryotic microbial sequences, due to their unique genomic architectures and quality control requirements relative to bacterial and archaeal genomes.”. From this, I iassume that in that version there were no marker genes for fungi.

I’ve been analyzing the composition of the new vJan25 database and noticed that it now includes marker genes for around 303 fungal species (from http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJan25_CHOCOPhlAnSGB_202503_species.txt.bz2).
I went back to review the database construction procedure described in the MetaPhlAn 4 paper, and from what I understand, the pipeline relies on CheckM and Prokka, which are designed for prokaryotic genomes. This likely explains why previous versions probably contained no fungi.

Given this, I’m wondering:

  • How were the fungal genomes in the vJan25 database incorporated? Were they processed with the same pipeline (CheckM + Prokka + marker selection) or through a different approach?

  • And, more generally, do you plan to expand fungal coverage in future releases, or should users consider adding their own marker genes for missing species? I was thinking about doing it with GitHub - steineggerlab/ufcg: UFCG: Universal Fungal Core Genes , any experience using it?

I’d really appreciate any clarification on this, I would be very interested in analyzing fungi in my samples.

Thanks a lot for your time and for all the great work on MetaPhlAn!

Best,

Alberto

1 Like

UPDATE: I found out that the same 303 fungi that are present in vJan25 are also present in vJun23.

I retrieved the species names from mpa_vJan25_CHOCOPhlAnSGB_202503_species.txt and mpa_vJun23_CHOCOPhlAnSGB_202403_species.txt using the following code:

grep "k__Eukaryota" mpa_vJan25_CHOCOPhlAnSGB_202503_species.txt \        
        | tr ',' '\n' \      # some species in the file are on the same line because they're in the same SGB
        | cut -d'|' -f7 \    # only species field
        | cut -c4- \    # remove 's__'
        | tr '_' ' ' \    # substitute '_' with ' ' between genus and species name
        | sort -u \
        | while IFS= read -r species; do
                output=$(ete3 ncbiquery --search "$species" --info 2>&1)    # search species name in NCBI
    
                # if it is not present (happens with unclassified speies like "Penicillium sp W3 MMC 2018") search only the genus 
                if echo "$output" | grep -q "could not be translated into taxids"; then
                        genus=$(echo "$species" | awk '{print $1}')
                        output=$(ete3 ncbiquery --search "$genus" --info 2>&1)
                fi
                
                # output contains the whole taxonomic string for that species. If it contains fungi then it is a fungus
                if echo "$output" | grep -iq "fungi"; then
                        echo "$species"
                fi
          done > fungi-in-MPA.txt

Please tell me if I’m missing something.

Thank you,
Alberto