Hi,
I’m running HUMAnN with a fixed Bowtie2 --seed value, and I’ve confirmed that the genefamilies.tsv output is identical across runs, suggesting consistent upstream results.
However, I’ve noticed that the order of reactions passed into MinPath varies between runs, even though the contents are the same. This change in input order causes the final pathabundance.tsv and pathcoverage.tsv outputs to differ slightly.
To clarify:
- MinPath itself is not producing nondeterministic output given the same input.
- The issue lies in the order in which reactions are written into the MinPath input file, which appears to be non-deterministic.
- I tested multiple repeated runs using multithreading on my system, and the results were consistent across those runs.
- However, when comparing my results to a colleague’s (who ran with the same input data and environment), the pathway abundance outputs differed — suggesting that some non-deterministic behavior remains, likely in how the reaction list is prepared.
I suspect this may be due to the use of unordered data structures (e.g., dictionaries) or parallel execution in the step that prepares the reaction list for MinPath.
Would it be possible to enforce a deterministic sorting of reaction inputs before writing the MinPath input file to ensure reproducible pathway results?
Thanks in advance.
There have been a few reports recently about nondeterministic behavior and we are investigating them. This is not an expected behavior from the HUMAnN code itself. So far my only lead is that bowtie2 uses a --seed for some internal calculations, and I’m wondering if we’re bumping into that. If so we can adjust the MetaPhlAn and HUMAnN bowtie2 calls to pass a specific seed value from the HUMAnN CLI.
Hi,
I’m running HUMAnN with a fixed Bowtie2 --seed value, and I’ve confirmed that the genefamilies.tsv output is identical across runs, suggesting consistent upstream results.
However, I’ve noticed that the order of reactions passed into MinPath varies between runs, even though the contents are the same. This change in input order causes the final pathabundance.tsv and pathcoverage.tsv outputs to differ slightly.
To clarify:
-
MinPath itself is not producing nondeterministic output given the same input.
-
The issue lies in the order in which reactions are written into the MinPath input file, which appears to be non-deterministic.
I suspect this is due to the use of unordered data structures (e.g., dictionaries) or multithreading in the step that prepares the reaction list for MinPath.
Would it be possible to enforce a deterministic sorting of reaction inputs before writing the MinPath input file to ensure reproducible pathway results?
Thanks in advance.
Really interesting and surprising. Thanks for investigating this — I will look into fixing it. In the meantime you can also disable MinPath (--minpath off) if that would potentially be more convenient.