MetaPhlAn4 (mpa_vJun23_CHOCOPhlAnSGB_202307): unexpected behavior with fix_relab_mpa4.py

Hi,

I used metaphlan 4.1 using mpa_vJun23_CHOCOPhlAnSGB_202307 version of the DB for taxonomy profiling. Then I learnt about the fix_relab_mpa4.py script to fix the RA of taxa at each rank. But my results show that the relative abundance values including unclassified sums up to 100% at each rank using mpa_vJun23_CHOCOPhlAnSGB_202307 and trying the fixing script just messes up the values at lower ranks. Can you please let me know what is wrong and where I have not understood it correctly. Thanks!

I have two files for your reference:

Hi @Saharb
Would it be possible for you to also upload in the dropbox the original output file of metaphlan for one of the samples, without any editing, so I can try to run the fixing script to see what is happening? Thanks

Dear Claudia,

Thanks for your reply. Sure! I have uploaded one file.

Thanks for checking it.

Sorry I only got now to check the file.
Good news is that (1) your profiles do not need fixing. The issue in mpa_vJun23_CHOCOPhlAnSGB_202307 is due to the fact that some SGBs are reported as p__Bacillota instead of p__Firmicutes and same for SGBs f__Saccharomycetales_unclassified reported as f__Debaryomycetaceae, due to naming convention changes in NCBI, which was causing some SGBs that should have had the same phylum/family to have different ones hence leading to a sum different from 100. In your profiles this never occurred, so you’re safe to use the original profiles. (2) just to make sure I checked the script and indeed we had added some modification from 4.2 which allow to account for ‘‘unclassified’’, you were probably using an older script from 4.1. In future if you need to use it on other samples you can just download the most updated script from github for fix_relab_mpa4.py to be sure you’re using the most updated one.