Dear MetaPhlAn developers. I have just used the sgb_to_gtdb_profile.py script and noticed there might be some issues and I was wondering if you could help. Firstly with running the script, we noticed that parsing the database name uses the following code
` def get_database_name(self):
“”"Gets database name
Returns:
str: the database name
"""
return self.database.split('/')[-1][:-4]`
This meant that when specifying the database we had to give metaphlan_db_vOct22/mpa_vOct22_CHOCOPhlAnSGB_202212.bt2
instead of a general link to the database directory metaphlan_db_vOct22
. I’m not sure if this was intentional?
Secondly, after successfully running the script, we noticed that the SGB to GTDB conversions raised the bacterial classification to 100. In my samples a large percentage of reads were unclassified so using the MetaPhlAn database we got this as an example:
|taxonomy|C178B|
|---|---|
|UNCLASSIFIED|32.23385|
|k__Bacteria|67.76615398125996|
|k__Bacteria|p__Fusobacteria|33.96444381171528|
|k__Bacteria|p__Proteobacteria|16.390565780370355|
|k__Bacteria|p__Firmicutes|14.516940048634888|
|k__Bacteria|p__Bacteroidetes|2.8942043405394373|
However after converting to GTDB we got this:
#clade_name relative_abundance
UNCLASSIFIED 32.23385
d__Bacteria 100.0
d__Bacteria;p__Fusobacteriota 50.12007
d__Bacteria;p__Proteobacteria 22.51449
d__Bacteria;p__Firmicutes_A 21.422109999999996
d__Bacteria;p__Bacteroidota 4.45885
d__Bacteria;p__Campylobacterota 1.48448
The samples no longer summed to 100. I think that perhaps the ratios between the phyla abundances are correct but they just to be changed to be in proportion to the bacterial abundance which should not be 100 but rather UNCLASSIFIED-100. Let me know if this seems correct or if you think the error came from elsewhere.
Thanks in advance.