Samtools failure "[mpileup] fail to read the header of /tmp/panphlan_nprq80h8.bam"

Attempting to run panphlan_map.py, and running into an issue with samtools:

panphlan_map.py -p /babbage/panphlan_pangenomes/Alistipes_finegoldii/Alistipes_finegoldii_pangenome.tsv \
                --indexes /babbage/panphlan_pangenomes/Alistipes_finegoldii/Alistipes_finegoldii \
                -i /lovelace/echo/analysis/biobakery3/batch001/output/kneaddata/C0052-5F-1A_S49_kneaddata_paired_1.fastq.gz \
                -i /lovelace/echo/analysis/biobakery3/batch001/output/kneaddata/C0052-5F-1A_S49_kneaddata_paired_2.fastq.gz \
                -o /babbage/echo/panphlan_analysis/Alistipes_finegoldii/C0052_5F_1A.tsv
[I] Bowtie2 is installed
    version: 2.3.5.1, path: /usr/bin/bowtie2
[I] Samtools version 1.10;  path: /opt/samtools/bin/samtools
[I] input_path
[I] bowtie2 --very-sensitive --no-unal -x /babbage/panphlan_pangenomes/Alistipes_finegoldii/Alistipes_finegoldii -U - -p 12 --quiet
[I] Rejected 0 reads over 22732 total
Bowtie2 mapping and SAM filtering completed.
[I] Samtools version 1.10;  path: /opt/samtools/bin/samtools
[I] samtools view -bS /tmp/panphlan_7jn89s03.sam
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000001.1'
# ... lots of other duplicates
# ...
[W::sam_hdr_create] Duplicated sequence 'CZBZ01000008.1'
[E::sam_hrecs_update_hashes] Duplicate entry "CYYQ01000001.1" in sam header
samtools view: failed to add PG line to the header
samtools sort: failed to read header from "-"
[I] samtools index /tmp/panphlan_3_46v4dq.bam
[I] samtools mpileup /tmp/panphlan_3_46v4dq.bam > /tmp/panphlan_d0k1ptap.csv
samtools index: "/tmp/panphlan_3_46v4dq.bam" is in a format that cannot be usefully indexed
[mpileup] fail to read the header of /tmp/panphlan_3_46v4dq.bam
Traceback (most recent call last):
  File "/opt/miniconda3/envs/panphlan/bin/panphlan_map.py", line 399, in <module>
    main()
  File "/opt/miniconda3/envs/panphlan/bin/panphlan_map.py", line 389, in main
    piling_up(out_bam, is_tmp, tmp_csv.name, args)
  File "/opt/miniconda3/envs/panphlan/bin/panphlan_map.py", line 299, in piling_up
    os.unlink(bam_file + '.bai')
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/panphlan_3_46v4dq.bam.bai'

I’ve tried on 3 different pangenomes, including p.copri with the same results. Using conda - installed panphlan (panphlan 3.0 py_0 bioconda), and the most recent version of samtools installed separately (not sure if it’s related, but I had to jump through some hoops to get samtools installed and working - the conda-installed one was throwing some weird errors)

Just cloned the repository, checked out 3.0 and installed via setup.py, same results.

Hello,

thanks for using PanPhlAn and pointing out the issue. Your input seems a bit strange to me, why do you have twice the -i argument with different files ? Has it already work with a similar output ?

Otherwise, if the issue comes from duplicated sequences, you might find the duplicates in the contigs file. Try removing them and rebuild the Bowtie2 indexes with the bowtie-build command. Similar problems were fixed like that.

Anyway, let me then know if you still encounter this issue

They’re from paired end reads - I assumed that I could pass both and have them concatenated, since this is what I do with kneaddata, but I just tried it with only one -i and I get the same result.

I just removed the bt2 files from the folder and ran

$ bowtie2-build Alistipes_finegoldii_pangenome_contigs.fna Allistipes_finegoldii

Then ran the command (with just 1 -i), got the same result.

$ panphlan_map.py -p /babbage/panphlan_pangenomes/Alistipes_finegoldii/Alistipes_finegoldii_pangenome.tsv  --indexes /babbage/panphlan_pangenomes/Alistipes_finegoldii/Alistipes_finegoldii -i /lovelace/echo/analysis/biobakery3/batch001/output/kneaddata/C0052-5F-1A_S49_kneaddata_paired_2.fastq.gz  -o /babbage/echo/panphlan_analysis/Alistipes_finegoldii/C0052_5F_1A.tsv
[I] Bowtie2 is installed
    version: 2.3.5.1, path: /usr/bin/bowtie2
[I] Samtools version 1.10;  path: /opt/samtools/bin/samtools
[I] input_path
[I] bowtie2 --very-sensitive --no-unal -x /babbage/panphlan_pangenomes/Alistipes_finegoldii/Alistipes_finegoldii -U - -p 12 --quiet
(ERR): "/babbage/panphlan_pangenomes/Alistipes_finegoldii/Alistipes_finegoldii" does not exist or is not a Bowtie 2 index
Exiting now ...
[I] Rejected 0 reads over 0 total
Bowtie2 mapping and SAM filtering completed.
[I] Samtools version 1.10;  path: /opt/samtools/bin/samtools
[I] samtools view -bS /tmp/panphlan_g9052szw.sam
[main_samview] fail to read the header from "/tmp/panphlan_g9052szw.sam".
samtools sort: failed to read header from "-"
[I] samtools index /tmp/panphlan_5y84vooj.bam
[I] samtools mpileup /tmp/panphlan_5y84vooj.bam > /tmp/panphlan_7d9muezw.csv
samtools index: "/tmp/panphlan_5y84vooj.bam" is in a format that cannot be usefully indexed
[mpileup] fail to read the header of /tmp/panphlan_5y84vooj.bam
Traceback (most recent call last):
  File "/opt/miniconda3/bin/panphlan_map.py", line 4, in <module>
    __import__('pkg_resources').run_script('panphlan==3.0', 'panphlan_map.py')
  File "/opt/miniconda3/lib/python3.7/site-packages/pkg_resources/__init__.py", line 665, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/opt/miniconda3/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1470, in run_script
    exec(script_code, namespace, namespace)
  File "/opt/miniconda3/lib/python3.7/site-packages/panphlan-3.0-py3.7.egg/EGG-INFO/scripts/panphlan_map.py", line 396, in <module>
  File "/opt/miniconda3/lib/python3.7/site-packages/panphlan-3.0-py3.7.egg/EGG-INFO/scripts/panphlan_map.py", line 386, in main
  File "/opt/miniconda3/lib/python3.7/site-packages/panphlan-3.0-py3.7.egg/EGG-INFO/scripts/panphlan_map.py", line 297, in piling_up
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/panphlan_5y84vooj.bam.bai'

The file is indeed missing:

$ ls /tmp/panphlan_*            
panphlan_7d9muezw.csv 
$ ls /babbage/panphlan_pangenomes/Alistipes_finegoldii/
Alistipes_finegoldii_pangenome_contigs.fna     Allistipes_finegoldii.4.bt2 
Alistipes_finegoldii_pangenome.tsv             Allistipes_finegoldii.rev.1.bt2 
Allistipes_finegoldii.1.bt2                    Allistipes_finegoldii.rev.2.bt2 
Allistipes_finegoldii.2.bt2                    panphlan_Alistipes_finegoldii_annot.tsv 
Allistipes_finegoldii.3.bt2 

Hello Kevin,

Actually the problem seems not to come from samtools tempfiles but rather from above in the analysis :
(ERR): "/babbage/panphlan_pangenomes/Alistipes_finegoldii/Alistipes_finegoldii" does not exist or is not a Bowtie 2 index Exiting now ...

PanPhlAn failed because the name of your indexes contains 2 l and the path you provided has only one.
Allistipes_finegoldii vs Alistipes_finegoldii .

Let’s try again with the right names. If it work, it would mean that the issue was indeed caused by duplicated sequences in the first place. Otherwise, I’ll go check more deeply :sweat_smile:

But in any case, you also pointed out that if the bowtie2 step failed, PanPhlAn tries to perform the samtools step nonetheless and this should definitely be fixed. Thanks for that !

Let me know how this is going.
Best

:man_facepalming: Well - that was silly, but sadly, it’s not the (only) problem;

$ panphlan_map.py -p /babbage/panphlan_pangenomes/Alistipes_finegoldii/Alistipes_finegoldii_pangenome.tsv  --indexes /babbage/panphlan_pangenomes/Alistipes_finegoldii/Alistipes_finegoldii -i /lovelace/echo/analysis/biobakery3/batch001/output/kneaddata/C0052-5F-1A_S49_kneaddata_paired_2.fastq.gz  -o /babbage/echo/panphlan_analysis/Alistipes_finegoldii/C0052_5F_1A.tsv 
[I] Bowtie2 is installed
    version: 2.3.5.1, path: /usr/bin/bowtie2
[I] Samtools version 1.10;  path: /opt/samtools/bin/samtools
[I] input_path
[I] bowtie2 --very-sensitive --no-unal -x /babbage/panphlan_pangenomes/Alistipes_finegoldii/Alistipes_finegoldii -U - -p 12 --quiet
[I] Rejected 0 reads over 22732 total
Bowtie2 mapping and SAM filtering completed.
[I] Samtools version 1.10;  path: /opt/samtools/bin/samtools
[I] samtools view -bS /tmp/panphlan_glroc86k.sam
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000001.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000002.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000003.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000004.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000005.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000006.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000007.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000008.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000009.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000010.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000011.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000012.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000013.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000014.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000015.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000016.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000017.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000018.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000019.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000020.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000021.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000022.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000023.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000024.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000025.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000026.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000027.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000028.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000029.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000030.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000031.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000032.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000033.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000034.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000035.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000036.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000037.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000038.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000039.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000040.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000041.1'
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000042.1'
[W::sam_hdr_create] Duplicated sequence 'CP003274.1'
[W::sam_hdr_create] Duplicated sequence 'CZBZ01000001.1'
[W::sam_hdr_create] Duplicated sequence 'CZBZ01000002.1'
[W::sam_hdr_create] Duplicated sequence 'CZBZ01000003.1'
[W::sam_hdr_create] Duplicated sequence 'CZBZ01000004.1'
[W::sam_hdr_create] Duplicated sequence 'CZBZ01000005.1'
[W::sam_hdr_create] Duplicated sequence 'CZBZ01000006.1'
[W::sam_hdr_create] Duplicated sequence 'CZBZ01000007.1'
[W::sam_hdr_create] Duplicated sequence 'CZBZ01000008.1'
[E::sam_hrecs_update_hashes] Duplicate entry "CYYQ01000001.1" in sam header
samtools view: failed to add PG line to the header
samtools sort: failed to read header from "-"
[I] samtools index /tmp/panphlan_06smnmdk.bam
[I] samtools mpileup /tmp/panphlan_06smnmdk.bam > /tmp/panphlan_dnj3pb8h.csv
samtools index: "/tmp/panphlan_06smnmdk.bam" is in a format that cannot be usefully indexed
[mpileup] fail to read the header of /tmp/panphlan_06smnmdk.bam
Traceback (most recent call last):
  File "/opt/miniconda3/bin/panphlan_map.py", line 4, in <module>
    __import__('pkg_resources').run_script('panphlan==3.0', 'panphlan_map.py')
  File "/opt/miniconda3/lib/python3.7/site-packages/pkg_resources/__init__.py", line 665, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/opt/miniconda3/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1470, in run_script
    exec(script_code, namespace, namespace)
  File "/opt/miniconda3/lib/python3.7/site-packages/panphlan-3.0-py3.7.egg/EGG-INFO/scripts/panphlan_map.py", line 396, in <module>
  File "/opt/miniconda3/lib/python3.7/site-packages/panphlan-3.0-py3.7.egg/EGG-INFO/scripts/panphlan_map.py", line 386, in main
  File "/opt/miniconda3/lib/python3.7/site-packages/panphlan-3.0-py3.7.egg/EGG-INFO/scripts/panphlan_map.py", line 297, in piling_up
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/panphlan_06smnmdk.bam.bai'
$ ls
      Alistipes_finegoldii.1.bt2 
      Alistipes_finegoldii.2.bt2 
      Alistipes_finegoldii.3.bt2 
      Alistipes_finegoldii.4.bt2 
      Alistipes_finegoldii_pangenome_contigs.fna 
      Alistipes_finegoldii_pangenome.tsv 
      Alistipes_finegoldii.rev.1.bt2 
      Alistipes_finegoldii.rev.2.bt2 
      panphlan_Alistipes_finegoldii_annot.tsv 

I also tried with a couple of other taxa after rebuilding without success.

Have you tried to run a uniq Alistipes_finegoldii_pangenome_contigs.fna before rebuilding the indexes ?

I did not - I downloaded the pangenomes using the included script - I didn’t know this would be necessary. I don’t think that’s actually filtering unique sequences, it’s mostly just eliminating strings of Ns.

diff Alistipes_finegoldii_pangenome_contigs*
317,318d316
< NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
< NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
525d522
< NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
802d798
< NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

In any case, building indicies with that pangenome gives the same error.

I checked the database again.
For some strange reason, all reference genome are included twice, thus doubling the amount of sequences… I’ll go back in the database building script…

Meanwhile, I was able to get rid of the duplicated lines with sort Alistipes_finegoldii_pangenome_contigs.fna | uniq.

This lead to a malformed FASTA file though -

$ head Alistipes_finegoldii_pangenome_contigs_u.fna 
AAAAAAAACAGGGCGATTCTTTGCCTGCGGAATAGGTGATTCGCGCGCTTTCCTTGTATTATATACAAGAAAGAAAGCTA
AAAAAAAACGGAGCCCGCACCTCCGGGATGCAGGCTCCGCAAAAGCATACGGGACGATCAGAGCAGGGTGCCGTCCTCGA
AAAAAAAAGGAAACCCGGTATTCCGATGGAATACCACAGTTTCCTCACCAACCTAAAACTACTAACCTAAATGAACCTTA
AAAAAAAATATTTTTTCAGAAAGGGAGCGGTTTTAGTTACGGATCGTAATTTTTATTTCCGACCCGGGGATACATTGCCG
AAAAAAAATCCGGTTCCGAACTCCCGGAATCGAACCCCCATTTCATGCAACCGGACGCGCACCTATAAATAAAGCCGCCC
AAAAAAACGACAAGTATGAAAAAGATATTATTGATGTGTCTGTCCCTGCTGCTGTTCGCGTCAGCTCTGGTCTCCTGCGA
AAAAAAAGACACCGGGCAAATGCCTGGTGTCTTATTCATGTCCGGAGCCGAACCCGGCCGCCGCGCTACCTCGGCCCGTG
AAAAAAAGATAACTCTGCGGCGGCGGGCCTTTTGATCGCGGTGAACCATTGCAAACATTTCACCATTTGAATCCCGCCGC
AAAAAAAGGAAACCCGGTATTCCGATGGAATACCACAGTTTCCTCACCAACCTAAAACTACTAACCTAAATGAACCTTAA
AAAAAAAGGCACGGCGCAATAAAATCACAGATACACCCGGAACCGCAGGCTCCCCTGCAAATTCCTCCGCATTCGGACGA
$ tail Alistipes_finegoldii_pangenome_contigs_u.fna 
TTTTTTTCATTGGGTTCCTGGTCATAGTTTCCCTTTCTATCGCGAAAGCAGCGCCGCAATCCGCGTACGGAGTTCATCCG
TTTTTTTCGCTTTTATCAAAAAAAGTATCCGAAACCGCACTCCGGCTCGGATACTCCCGGGCCGCAAGACCCGTATCATC
TTTTTTTCGGCCTGAGCCTTCAGATAATCGATATCGTCGTCCAGATAAGCCTCGTCGGATCCCAATTTATCGAAATCGGC
TTTTTTTGATTTTGGTCTTTCAAAACAGCTCCTACAGCGCCGATTTTCCGATCCGGATTCTTTTCGCAGAAATCGAAGAA
TTTTTTTGCCTGTGCGGCGTGCCCTTCCGGTTCGTTTACCGGCGTTTTGTTCCTGCGTATCGTGCCTGTACCGGGTTGGT
TTTTTTTGCTATCTTCGTTTGCCGTCCTACTACTGCATCCTCTTCGAGCCTTTTCCTGGGGCCTCCTCTTCCTGGGGGTG
TTTTTTTGGGTGCTCCCGCTTTCTGGCATCTAAAAAACGGGGGATAGGTTGACAATGATCTGTTTGATAACGTAATGCCG
TTTTTTTGTGAACATATGTCATTGGAACCTGAAGTAAACTTATGGTAAAATAATGTCAAACTAAAAATCAATGCTATGTT
TTTTTTTGTGCCCGCCCCCGAAAACTCCCGCCGCCCCGAGTAACGGCCTGTACAGGCGGCCATCTTCGACGACCTGTTTC
TTTTTTTTGATTTTATCGCAGAATGCGCTCCAGAACGCCTGCGGCAAGCACCAGTGGCAGGTAATCACGCAGGACTTCGC

and all the headers are in the middle (between lines that start with C and those that start with G).

But anyway, I was able to remove duplicates with a short julia script, and then rebuild indices. That does seem to solve the issue :+1:

Ok, great !

It is still strange for some species have duplicate, other not. Sometimes it causes troubles, sometimes not.
Anyway thanks a lot for pointing out the importance of the problem.

1 Like

Hi,

I am having exactly the same errors as described above for the E. coli pangenome. I also tried the ‘sort’ solution offered above, but indeed also get a malformed FASTA file. Is there by any chance already a workaround for this issue?

Best,
Quinten

Hi,

I’m currently working on fixing the database duplicates, a temporary solution seems to be to remove the second half of the fna file (contigs have been written twice in some cases) :

Some pangenome of the database might have a problem of duplicated sequences leading to an error raised during the mapping step : [E::sam_hrecs_update_hashes] Duplicate entry “XXX” in sam header".

In these cases, better check the duplication comparing :
wc -l [species_name]_pangenome_contigs.fna and
sort [species_name]_pangenome_contigs.fna | uniq | wc -l
that should give roughly half of the previous number.

Then just cut the fna file in half (using for example head -[half of the lines in the fna file] [species_name]_pangenome_contigs.fna > new_contigs.fna ) and then regenerate the indexes using bowtie2-build.

Hope this can fix your problem quickly. I hope the curated database will be available at the end of the week.

Best

Hi,

Thanks a lot for thee feedback.

There is no big rush, so I will just wait for the new database =).

Again, thanks a lot.

Best,

Quinten

Hi,

The last commit on GitHub (https://github.com/SegataLab/panphlan) provides an updated/cleaned version of the database that should fix this issue