Problem with panphlan_map


I am using PanPhlAn 3.1 (13 Jan 2021) with samtools version = 1.12 and I encountered a problem.
Sometimes, the mapping step stalls and do not proceed. The initial SAM file is produced, but then it just stalls forever. I tried interrupting the command and re-runned the specific sample and then it could easily run the sample. The bug seems quite random - sometimes it is capable of running and sometimes it stalls forever.

For a temporary solution I included the timeout command in my pipeline when calling the script, so that the sample is skipped if PanPhlAn stalls more than 20 minutes.

When it stalls, there is no output in the terminal, but it is the step just before the following is printed in the terminal (from a problem-free run).

“13323640 reads; of these:
13323640 (100.00%) were unpaired; of these:
13258233 (99.51%) aligned 0 times
467 (0.00%) aligned exactly 1 time
64940 (0.49%) aligned >1 times
0.49% overall alignment rate
[mpileup] 1 samples in 1 input files”

Could this problem be solve somehow?

Kind regards Anne


Thanks for raising this issue. Unfortunately, it will be difficult to solve a problem that seems to be randomly happening.

If you get the following output before the stalling, it means that at least the samtools mpileup process went fine. There are not so much computation left after that in the PanPhlAn mapping step, thus I see three possible causes for the problem :

  1. Either it’s a problem from dealing with the tmp files. That could happen in some environment… Are you running PanPhlAn on your laptop ? A remote server ? A HPC ?

  2. After the mpileup, the coverage are summarized by gene families using the information from the species pangenome. If you downloaded it with things should be fine. It might be also possible to run the --species [name of the species] --pangenome [path to the pangenome folder]. This can sometimes solve problems due to duplicated sequences. But I wouldn’t bet so much on this.

  3. The last cause could be the species you’re interested in. When lots of reference genomes are available the species pangenome can be quite big and PanPhlAn takes more time to run. Also depends on the sequencing depth of your metagenomic sample. As an example, the PanPhlAn jobs I’m running with 10 cores and 5GB of RAM usually take between 5 and 20 minutes but some outliers can run for a couple of hours (3-5 hours). In that case it’s usually due to the number of reads int he sample. I also observed strong variations in mean run time between species having 2-4 reference genomes and some having 100+
    So here it depends on what you call “stalling forever”. Does the job actually crash, or is it taking longer than expected ?

Sorry for not having more precise ideas about what could go wrong. I hope this can be of any help. Let me know if you have more questions, or if you solve the bug.

Kind regards,

1 Like

Hi LĂ©onard,

Thank you so much for the detailed answer!

I can see that my question might be a bit unclear, because the output that I included is NOT produced when PanPhlan stalls. It solely makes the initial SAM file in the temp folder and then stalls with no output in the terminal. So I think it is samtools that causes the problem.

  1. I am using a remote sever and I specified the location of the tmp folder.

  2. Thanks for the suggestion. I actually already implemented the clean_pangenome step as I saw that the duplicated sequences in the pangenome might give some problems for samtools

  3. I am running on 14 cores with 30GB RAM. Actually I am using data, which I simulated, so I have 10 samples with the same read depth. So e.g. when running PanPhlAn on Akkermansia_muciniphila, the mapping step took 5-10 minutes for each sample and then suddenly it would stall in 14 hours for one sample until I interrupted it. This sample could easily be mapped when re-running the sample.

Kind regards,

@leonard.dubois I also tried with a different data set and I encountered the same problem
Kind regards,