Problem with panphlan_map

Hi,

I am using PanPhlAn 3.1 (13 Jan 2021) with samtools version = 1.12 and I encountered a problem.
Sometimes, the mapping step stalls and do not proceed. The initial SAM file is produced, but then it just stalls forever. I tried interrupting the command and re-runned the specific sample and then it could easily run the sample. The bug seems quite random - sometimes it is capable of running and sometimes it stalls forever.

For a temporary solution I included the timeout command in my pipeline when calling the panphlan_map.py script, so that the sample is skipped if PanPhlAn stalls more than 20 minutes.

When it stalls, there is no output in the terminal, but it is the step just before the following is printed in the terminal (from a problem-free run).

“13323640 reads; of these:
13323640 (100.00%) were unpaired; of these:
13258233 (99.51%) aligned 0 times
467 (0.00%) aligned exactly 1 time
64940 (0.49%) aligned >1 times
0.49% overall alignment rate
[mpileup] 1 samples in 1 input files”

Could this problem be solve somehow?

Kind regards Anne

Hello,

Thanks for raising this issue. Unfortunately, it will be difficult to solve a problem that seems to be randomly happening.

If you get the following output before the stalling, it means that at least the samtools mpileup process went fine. There are not so much computation left after that in the PanPhlAn mapping step, thus I see three possible causes for the problem :

  1. Either it’s a problem from dealing with the tmp files. That could happen in some environment… Are you running PanPhlAn on your laptop ? A remote server ? A HPC ?

  2. After the mpileup, the coverage are summarized by gene families using the information from the species pangenome. If you downloaded it with panphlan_download_pangenome.py things should be fine. It might be also possible to run the panphlan_clean_pangenome.py --species [name of the species] --pangenome [path to the pangenome folder]. This can sometimes solve problems due to duplicated sequences. But I wouldn’t bet so much on this.

  3. The last cause could be the species you’re interested in. When lots of reference genomes are available the species pangenome can be quite big and PanPhlAn takes more time to run. Also depends on the sequencing depth of your metagenomic sample. As an example, the PanPhlAn jobs I’m running with 10 cores and 5GB of RAM usually take between 5 and 20 minutes but some outliers can run for a couple of hours (3-5 hours). In that case it’s usually due to the number of reads int he sample. I also observed strong variations in mean run time between species having 2-4 reference genomes and some having 100+
    So here it depends on what you call “stalling forever”. Does the job actually crash, or is it taking longer than expected ?

Sorry for not having more precise ideas about what could go wrong. I hope this can be of any help. Let me know if you have more questions, or if you solve the bug.

Kind regards,
LĂ©onard

1 Like

Hi LĂ©onard,

Thank you so much for the detailed answer!

I can see that my question might be a bit unclear, because the output that I included is NOT produced when PanPhlan stalls. It solely makes the initial SAM file in the temp folder and then stalls with no output in the terminal. So I think it is samtools that causes the problem.

  1. I am using a remote sever and I specified the location of the tmp folder.

  2. Thanks for the suggestion. I actually already implemented the clean_pangenome step as I saw that the duplicated sequences in the pangenome might give some problems for samtools

  3. I am running on 14 cores with 30GB RAM. Actually I am using data, which I simulated, so I have 10 samples with the same read depth. So e.g. when running PanPhlAn on Akkermansia_muciniphila, the mapping step took 5-10 minutes for each sample and then suddenly it would stall in 14 hours for one sample until I interrupted it. This sample could easily be mapped when re-running the sample.

Kind regards,
Anne

@leonard.dubois I also tried with a different data set and I encountered the same problem
Kind regards,
Anne

Hi, I’m experiencing the same issue. I run a group of samples and I got the mapping ok (however no strains found with the profiling), but then I did the mapping of same amount of samples, and It got stuck for a day without any outcome. I repeated the samples that worked before and no result came out.

I] Bowtie2 is installed

version: 2.4.5, path: /Users/arrow/bowtie2//bowtie2

[I] Samtools version 1.16.1; path: /usr/local/bin/samtools

[I] bowtie2 --very-sensitive --no-unal -x …/panphlan/Escherichia_coli/Escherichia_coli -U Sample10.fastq -p 12 --quiet

[I] Rejected 0 reads over 125215 total

Bowtie2 mapping and SAM filtering completed.

[I] Samtools version 1.16.1; path: /usr/local/bin/samtools

[I] samtools view -bS /var/folders/cj/1gfhn2_94vs9rt6h8p27jh9h0000gn/T/panphlan_5igejywc.sam

[I] samtools index /var/folders/cj/1gfhn2_94vs9rt6h8p27jh9h0000gn/T/panphlan_g0tdff6r.bam

[I] samtools mpileup /var/folders/cj/1gfhn2_94vs9rt6h8p27jh9h0000gn/T/panphlan_g0tdff6r.bam > /var/folders/cj/1gfhn2_94vs9rt6h8p27jh9h0000gn/T/panphlan_lk_pyq0b.csv

[mpileup] 1 samples in 1 input files. <<<<<<<----- stuck here forever

Hi,

looking at this issue, it seems that the mpileup can be quite long sometimes. Do you know if the jobs is still running and just extremely slow or if the job is stuck somehow and not advancing anymore?

It could also be due to E. coli pangenome being one of the biggest we have and thus . We are working on a updated version of the pangenome database that should improve that

1 Like

I think was the length, I finally was able to do it. Now I’m struggling with the next step hahha.

panphlan_profiling.py -i map_results/ --o_matrix result_profile_ecoli.tsv -p …/panphlan/Escherichia_coli/Escherichia_coli_pangenome.tsv --add_ref -v

STEP 1. Processing genes informations from pangenome file…
Number of reference genomes: 200
Average number of gene-families per genome: 4881
Total number of pangenome gene-families 72508

STEP 1b. Get genes present in reference genomes…

STEP 2. Create coverage matrix
[I] Reading mapping result file: Sample2.ecoli.tsv.bz2
[I] Reading mapping result file: .DS_Store
Traceback (most recent call last):
File “/Users/arrow/miniconda3/bin/panphlan_profiling.py”, line 763, in
main()
File “/Users/arrow/miniconda3/bin/panphlan_profiling.py”, line 709, in main
dna_samples_covs = read_map_results(args.i_dna, args.verbose)
File “/Users/arrow/miniconda3/bin/panphlan_profiling.py”, line 286, in read_map_results
dna_samples_covs[dna_sample_id] = read_gene_cov_file(os.path.join(i_dna, dna_covs_file))
File “/Users/arrow/miniconda3/bin/panphlan_profiling.py”, line 272, in read_gene_cov_file
for line in f:
File “/Users/arrow/miniconda3/lib/python3.9/bz2.py”, line 208, in readline
return self._buffer.readline(size)
File “/Users/arrow/miniconda3/lib/python3.9/_compression.py”, line 68, in readinto
data = self.read(len(byte_view))
File “/Users/arrow/miniconda3/lib/python3.9/_compression.py”, line 103, in read
data = self._decompressor.decompress(rawblock, size)
OSError: Invalid data stream

I think is there an issue with the names… because when I change the files from Sample2.ecoli.tsv.bz2 to Sample2_ecoli.tsv.bz2 I get a different error. And inside de script in some parts says “csv” and others “tsv”.

Thanks

Pablo