Masslin2 runs forever for 530 K associations

Hello bioBakery help forum,
I am running masslin2 for 530K associations from Humann3 genefamily_relab.tsv output. Summed Uniref90 to find genus associations with my exposure. The last step below takes forever. It’s running for more than 10 hours now, this is still counting…Could you help me solve this issue? The run is in HPC so I don’t see a memory problem.
2021-01-07 23:01:52 INFO::Counting total values for each feature

Thank you!
Yike

results came after 15 hours… good reminder to request a long run time…

Hi!

Unfortunately with that many associations, the tool will take awhile to profile the associations within the community. You can improve the speed by reducing the search space (i.e. through things like filtering out low abundant or prevalent features or features that are common between your exposure and control).

I think from your message you are already doing this but make sure to also reduce the stratified table down to either the bugs contributing the function or the general gene families, this will also help the size of the dataset.

Finally, MaAsLin can be run in parallel with the --cores flag, which could also improve your run time.

I hope this helps!
Best,
Kelsey

1 Like

Hello Kelsey,
Thank you for your answer! one more question.
I was trying to explain which species are contributing to the association, and found a lot of redundant species with UniRef90 database output. I am just wondering how do people usually approach this redundancy? I only need data from one species association instead of multiple same association with different coef and q to show up in my paper. The associations for the same species were all positive, but the adjusted q value ranged from 0.4% to 10% and coefficient differs 0.1. Thank you!

Hi @YikeShen - in addition to @Kelsey_Thompson’s suggestion above on reducing the stratified table, I would suggest further filtering out features explainable by at most a single taxon before running MaAsLin 2. You can simply do that by discarding features with very high correlation with individual microbial abundances (a similar strategy was done in the original iHMP paper (see the last sentence of Differential microbiome feature abundance in Methods): https://www.nature.com/articles/s41586-019-1237-9). Hope this helps!

2 Likes

Hello Biobakery help forum,
It looks like the most recent Humann 3 had the “human readable” UniRef protein attached to the output table. I already had my Humann3 runned and it took me a chunk of run time. Is there a downstream thing we can attach protein to the genefamily.tsv table?
Thank you!

Hi @YikeShen,

Yes, you can use the humann_rename_tables command to add human-readable names to the UniRefs- similar to how the tutorial does for ECs.

I hope this helps!
Kelsey

1 Like