Problem with panphlan_map

Anne_Mathiesen · May 27, 2021, 7:41am

Hi,

I am using PanPhlAn 3.1 (13 Jan 2021) with samtools version = 1.12 and I encountered a problem.
Sometimes, the mapping step stalls and do not proceed. The initial SAM file is produced, but then it just stalls forever. I tried interrupting the command and re-runned the specific sample and then it could easily run the sample. The bug seems quite random - sometimes it is capable of running and sometimes it stalls forever.

For a temporary solution I included the timeout command in my pipeline when calling the panphlan_map.py script, so that the sample is skipped if PanPhlAn stalls more than 20 minutes.

When it stalls, there is no output in the terminal, but it is the step just before the following is printed in the terminal (from a problem-free run).

“13323640 reads; of these:
13323640 (100.00%) were unpaired; of these:
13258233 (99.51%) aligned 0 times
467 (0.00%) aligned exactly 1 time
64940 (0.49%) aligned >1 times
0.49% overall alignment rate
[mpileup] 1 samples in 1 input files”

Could this problem be solve somehow?

Kind regards Anne

leonard.dubois · May 27, 2021, 9:47am

Hello,

Thanks for raising this issue. Unfortunately, it will be difficult to solve a problem that seems to be randomly happening.

If you get the following output before the stalling, it means that at least the samtools mpileup process went fine. There are not so much computation left after that in the PanPhlAn mapping step, thus I see three possible causes for the problem :

Either it’s a problem from dealing with the tmp files. That could happen in some environment… Are you running PanPhlAn on your laptop ? A remote server ? A HPC ?
After the mpileup, the coverage are summarized by gene families using the information from the species pangenome. If you downloaded it with panphlan_download_pangenome.py things should be fine. It might be also possible to run the panphlan_clean_pangenome.py --species [name of the species] --pangenome [path to the pangenome folder]. This can sometimes solve problems due to duplicated sequences. But I wouldn’t bet so much on this.
The last cause could be the species you’re interested in. When lots of reference genomes are available the species pangenome can be quite big and PanPhlAn takes more time to run. Also depends on the sequencing depth of your metagenomic sample. As an example, the PanPhlAn jobs I’m running with 10 cores and 5GB of RAM usually take between 5 and 20 minutes but some outliers can run for a couple of hours (3-5 hours). In that case it’s usually due to the number of reads int he sample. I also observed strong variations in mean run time between species having 2-4 reference genomes and some having 100+
So here it depends on what you call “stalling forever”. Does the job actually crash, or is it taking longer than expected ?

Sorry for not having more precise ideas about what could go wrong. I hope this can be of any help. Let me know if you have more questions, or if you solve the bug.

Kind regards,
Léonard

Anne_Mathiesen · May 27, 2021, 10:11am

Hi Léonard,

Thank you so much for the detailed answer!

I can see that my question might be a bit unclear, because the output that I included is NOT produced when PanPhlan stalls. It solely makes the initial SAM file in the temp folder and then stalls with no output in the terminal. So I think it is samtools that causes the problem.

I am using a remote sever and I specified the location of the tmp folder.
Thanks for the suggestion. I actually already implemented the clean_pangenome step as I saw that the duplicated sequences in the pangenome might give some problems for samtools
I am running on 14 cores with 30GB RAM. Actually I am using data, which I simulated, so I have 10 samples with the same read depth. So e.g. when running PanPhlAn on Akkermansia_muciniphila, the mapping step took 5-10 minutes for each sample and then suddenly it would stall in 14 hours for one sample until I interrupted it. This sample could easily be mapped when re-running the sample.

Kind regards,
Anne

Anne_Mathiesen · June 2, 2021, 8:37am

@leonard.dubois I also tried with a different data set and I encountered the same problem
Kind regards,
Anne

PabloGallardoM · October 21, 2022, 8:15am

Hi, I’m experiencing the same issue. I run a group of samples and I got the mapping ok (however no strains found with the profiling), but then I did the mapping of same amount of samples, and It got stuck for a day without any outcome. I repeated the samples that worked before and no result came out.

I] Bowtie2 is installed

version: 2.4.5, path: /Users/arrow/bowtie2//bowtie2

[I] Samtools version 1.16.1; path: /usr/local/bin/samtools

[I] bowtie2 --very-sensitive --no-unal -x …/panphlan/Escherichia_coli/Escherichia_coli -U Sample10.fastq -p 12 --quiet

[I] Rejected 0 reads over 125215 total

Bowtie2 mapping and SAM filtering completed.

[I] Samtools version 1.16.1; path: /usr/local/bin/samtools

[I] samtools view -bS /var/folders/cj/1gfhn2_94vs9rt6h8p27jh9h0000gn/T/panphlan_5igejywc.sam

[I] samtools index /var/folders/cj/1gfhn2_94vs9rt6h8p27jh9h0000gn/T/panphlan_g0tdff6r.bam

[I] samtools mpileup /var/folders/cj/1gfhn2_94vs9rt6h8p27jh9h0000gn/T/panphlan_g0tdff6r.bam > /var/folders/cj/1gfhn2_94vs9rt6h8p27jh9h0000gn/T/panphlan_lk_pyq0b.csv

[mpileup] 1 samples in 1 input files. <<<<<<<----- stuck here forever

leonard.dubois · October 21, 2022, 1:16pm

Hi,

looking at this issue, it seems that the mpileup can be quite long sometimes. Do you know if the jobs is still running and just extremely slow or if the job is stuck somehow and not advancing anymore?

It could also be due to E. coli pangenome being one of the biggest we have and thus . We are working on a updated version of the pangenome database that should improve that

PabloGallardoM · October 23, 2022, 6:52pm

I think was the length, I finally was able to do it. Now I’m struggling with the next step hahha.

panphlan_profiling.py -i map_results/ --o_matrix result_profile_ecoli.tsv -p …/panphlan/Escherichia_coli/Escherichia_coli_pangenome.tsv --add_ref -v

STEP 1. Processing genes informations from pangenome file…
Number of reference genomes: 200
Average number of gene-families per genome: 4881
Total number of pangenome gene-families 72508

STEP 1b. Get genes present in reference genomes…

STEP 2. Create coverage matrix
[I] Reading mapping result file: Sample2.ecoli.tsv.bz2
[I] Reading mapping result file: .DS_Store
Traceback (most recent call last):
File “/Users/arrow/miniconda3/bin/panphlan_profiling.py”, line 763, in
main()
File “/Users/arrow/miniconda3/bin/panphlan_profiling.py”, line 709, in main
dna_samples_covs = read_map_results(args.i_dna, args.verbose)
File “/Users/arrow/miniconda3/bin/panphlan_profiling.py”, line 286, in read_map_results
dna_samples_covs[dna_sample_id] = read_gene_cov_file(os.path.join(i_dna, dna_covs_file))
File “/Users/arrow/miniconda3/bin/panphlan_profiling.py”, line 272, in read_gene_cov_file
for line in f:
File “/Users/arrow/miniconda3/lib/python3.9/bz2.py”, line 208, in readline
return self._buffer.readline(size)
File “/Users/arrow/miniconda3/lib/python3.9/_compression.py”, line 68, in readinto
data = self.read(len(byte_view))
File “/Users/arrow/miniconda3/lib/python3.9/_compression.py”, line 103, in read
data = self._decompressor.decompress(rawblock, size)
OSError: Invalid data stream

I think is there an issue with the names… because when I change the files from Sample2.ecoli.tsv.bz2 to Sample2_ecoli.tsv.bz2 I get a different error. And inside de script in some parts says “csv” and others “tsv”.

Thanks

Pablo

Topic		Replies	Views
Samtools failure "[mpileup] fail to read the header of /tmp/panphlan_nprq80h8.bam" PanPhlAn	14	4796	November 3, 2020
Panphlan Profiling Error PanPhlAn	3	422	December 6, 2021
Sample2markers.py would not close when done MetaPhlAn	0	214	August 4, 2021
PanPhlAn_pangenome_exporter issue while writing .tsv PanPhlAn	7	612	January 25, 2021
StrainPhlAn Tutorial Error StrainPhlAn	0	48	August 25, 2024

Problem with panphlan_map

Related topics