I’m running PhyloPhlAn on a rather large genome set (~20k). I’m trying to figure out the best computational resources to give it for this task. From the most recent paper it looks like 100 CPUs at 10 days would do it for 17k genomes. So more cores the better? What about memory? I experimented with a high mem volume, and it didn’t really speed things up. Is there a way for PhyloPhlAn to use more memory? Thanks!
Hi and thanks for the question.
So, one thing to speed up things a bit (at least t a later stage) would be to be quite aggressive on trimming (like the --diversity high --fast
combination of params).
Prior to this, though the most time-consuming step will be the mapping of the database markers against your ~20k inputs. In this case, PhyloPhlAn will parallelize the mapping based on the number of CPUs. Since the gain is never linear when using multi-threading we decided that we prefer to run single-thread jobs but we can parallelize on the inputs. So, the more CPUs you provide, the more inputs will be mapped at the same time. Now, when doing multiple mappings, one should look for the RAM usage of every single thread so as to not exceed the memory available in the machine.
I hope this helps, but please let me know if something is not clear.
Many thanks,
Francesco