High Memory Usage of sample2markers.py

Hi Biobakery Team!
I am noticing that the sample2markers.py script is taking up very large amounts of memory. With a input sam.bz2 of 134M, it ends up consuming 80GB memory. Is this a bug? Or is there a way of executing it on fragments of the alignment and recombining to make it more scalable?

Thanks in advance!

Hi @nickp60
Thanks for reporting this, we never experimented such a high consumption of RAM when executing sample2markers.py Could it be possible to share the input sam file to have a better idea of what is going on?

Sure, whats the best email for you? I’ll send a download link. Thanks so much!

Here are the samtools stats, for the record:
sample2markers_highmem.stats.txt (30.8 KB)

Hi @nickp60
How many procs where you using for the sample2markers execution?

So sorry I missed this! I was using 16

Then I think it is expected, the memory consumption of sample2markers will grow linearly with the number of cores used. However, if you are interested we are currently working in a new version that should speed up the process while maintaining a stable consumption of memory. You can check an alpha version of the code in this branch of the mpa repository: GitHub - biobakery/MetaPhlAn at sample2markers_speedup