Total Bacteria species relative abundance no longer sums to 1 after converting the SGB profile to GTDB

Hi,
I used MetaPhlAn version 4.0.4 (17 Jan 2023) to perform taxonomy profiling on our shotgun metagenomic samples and after getting the metaphlan profiles (SGB level), I used the sgb_to_gtdb_profile.py command to convert the SGB profiles to GTDB profiles.
However, after converting, the total bacteria relative abundance of some samples became no longer 100% . Some became down to ~ 98.5 - 99% and some became up to 100.0001%.
I wonder why the relative abundances of species changed after re-annotating the taxa?
Can you please suggest some solutions to proceed with?
Can we renormalize the GTDB profiles to sum to 100%?

Thank you so much.

Best,
Fangxi

1 Like

Hi @Fangxi_Xu
Could you share with as an example profile (both the metaphlan default output and the gtdb converted) to have a better look into the problem?

Hi @aitor.blancomiguez ,
Thank you for your prompt response. Please find 2 MetaPhlAn SGB profiles and 2 converted GTDB profiles attached.
The sample S16 had total bacteria relative abundance 100% in the SGB profile and 100.00002% in the GTDB profile; while sample S43 had total bacteria relative abundance 100% and 97.77355000000001% in the GTDB profile.
These are oral samples from mice. Please let me know if more information is needed.
Thank you so much for your help.

Best,
Fangxi

gtdb_profile_S16.txt (3.6 KB)
gtdb_profile_S43.txt (27.0 KB)
metaphlan_profile_S16.txt (7.7 KB)
metaphlan_profile_S43.txt (68.6 KB)

Hi @Fangxi_Xu
I had a look at the problem and it seems it is due to some decimal rounding step when reporting the original metaphlan results. As the abundances of higher taxonomic levels are sum up from the species level (up to 5 decimals), the rounding error seems to be increasing. Renormalizing the abundances to 100% in each tax level would solve the issue. I included the re-normalization step in the script uploaded to the github repo (MetaPhlAn/sgb_to_gtdb_profile.py at master · biobakery/MetaPhlAn · GitHub). I will include it in the next metaphlan release but if you download it you can use it as:
$ python path_to_the_script/sgb_to_gtdb_profile.py -i .....

1 Like

Hi @aitor.blancomiguez ,
Thank you for your reply; however, I think it’s not just a decimal rounding issue here.
Can you please provide some details of how the conversion was performed? Was that done by exact mapping from NCBI annotation to GTDB?
We noticed that the number of species in these profiles are not matched.
Can you please look into it?
Thanks.

Best,
Fangxi

To add to this, we noticed that the missing relative abundance % in the GTDB assignment is the relative abundance % sum of those taxa found only in the MetaPhlAn assignment minus the relative abundance % sum of those taxa found only in the GTDB assignment. We identified this by matching the relative abundance numbers between the two files and noting where there were discrepancies.

Hi @Fangxi_Xu and Scott
For each bacterial and archaeal SGB in the mpa4 database, we got the assignment to the GTDB database using the GTDB-Tk tool. For some SGBs, there is not correspondance in the GTDB database, so the assignment sometimes does not reach the species level.
E.g. SGB105011 d__Archaea;p__Thermoplasmatota;c__Thermoplasmata;o__ARK-15;f__ARK-15;g__ARK-15;s__
in those cases, abundances of SGBs assigned to the same genus (e.g. ARK-15) but to an unknown species (s__) are sumup as the same taxa. This will happen also for the SGBs that even assigned at the species level, they are assigned to the same GTDB species.
E.g.

SGB74984|d__Archaea;p__Methanobacteriota_B;c__Thermococci;o__Thermococcales;f__Thermococcaceae;g__Thermococcus;s__Thermococcus sp012027395
SGB74985|d__Archaea;p__Methanobacteriota_B;c__Thermococci;o__Thermococcales;f__Thermococcaceae;g__Thermococcus;s__Thermococcus sp012027595

In the case of eukaryotes, as they are not present in the GTDB database, they are not accounted in the final GTDB profile

Thanks for the clarifications on the mapping and GTDB assignment of SGBs.
I still don’t understand how we can be missing ~2.22647% relative abundance in the GTDB assignment, though.
For example, in S43 provided above, k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridia_unclassified|f__Clostridia_unclassified|g__GGB31835|s__GGB31835_SGB45211 is an abundant SGB in the dataset (~2%) from the MetaPhlan analysis. Although unclassified at the order level, the abundance does not appear to be included elsewhere in the GTDB analysis.

We do not have any Eukaryotes in the analysis, so this does not cause the loss of relative abundance values when going from MetaPhlAn to GTDB.

I keep tracing the problem and I think I discover the root of it in line 73 of the code. So when two SGBs were assigned to the same GTDB taxonomy, instead of summing up the abundances, only the abundance of the last SGB in the profile was used. I pushed a fix of the code in the repo: MetaPhlAn/sgb_to_gtdb_profile.py at master · biobakery/MetaPhlAn · GitHub
Please, let me know if something is still working wrongly

Hi @aitor.blancomiguez ,
Thank you so much for the fixed script. I ran it by removing the last few lines for renormalization and the converted GTDB profiles look good now.

The sums are now very close to 100% and the minor differences are less than ~0.0001% across all our samples.

Will this command be fixed in the next release of MetaPhlAn?

Thank you again for all of your help.

Best,
Fangxi

Hi @Fangxi_Xu
Yes, it will be included in the next version!

1 Like

Great! Looking forward to it. Thanks for all the hard work.

1 Like