Attempting to run panphlan_map.py, and running into an issue with samtools:
panphlan_map.py -p /babbage/panphlan_pangenomes/Alistipes_finegoldii/Alistipes_finegoldii_pangenome.tsv \
--indexes /babbage/panphlan_pangenomes/Alistipes_finegoldii/Alistipes_finegoldii \
-i /lovelace/echo/analysis/biobakery3/batch001/output/kneaddata/C0052-5F-1A_S49_kneaddata_paired_1.fastq.gz \
-i /lovelace/echo/analysis/biobakery3/batch001/output/kneaddata/C0052-5F-1A_S49_kneaddata_paired_2.fastq.gz \
-o /babbage/echo/panphlan_analysis/Alistipes_finegoldii/C0052_5F_1A.tsv
[I] Bowtie2 is installed
version: 2.3.5.1, path: /usr/bin/bowtie2
[I] Samtools version 1.10; path: /opt/samtools/bin/samtools
[I] input_path
[I] bowtie2 --very-sensitive --no-unal -x /babbage/panphlan_pangenomes/Alistipes_finegoldii/Alistipes_finegoldii -U - -p 12 --quiet
[I] Rejected 0 reads over 22732 total
Bowtie2 mapping and SAM filtering completed.
[I] Samtools version 1.10; path: /opt/samtools/bin/samtools
[I] samtools view -bS /tmp/panphlan_7jn89s03.sam
[W::sam_hdr_create] Duplicated sequence 'CYYQ01000001.1'
# ... lots of other duplicates
# ...
[W::sam_hdr_create] Duplicated sequence 'CZBZ01000008.1'
[E::sam_hrecs_update_hashes] Duplicate entry "CYYQ01000001.1" in sam header
samtools view: failed to add PG line to the header
samtools sort: failed to read header from "-"
[I] samtools index /tmp/panphlan_3_46v4dq.bam
[I] samtools mpileup /tmp/panphlan_3_46v4dq.bam > /tmp/panphlan_d0k1ptap.csv
samtools index: "/tmp/panphlan_3_46v4dq.bam" is in a format that cannot be usefully indexed
[mpileup] fail to read the header of /tmp/panphlan_3_46v4dq.bam
Traceback (most recent call last):
File "/opt/miniconda3/envs/panphlan/bin/panphlan_map.py", line 399, in <module>
main()
File "/opt/miniconda3/envs/panphlan/bin/panphlan_map.py", line 389, in main
piling_up(out_bam, is_tmp, tmp_csv.name, args)
File "/opt/miniconda3/envs/panphlan/bin/panphlan_map.py", line 299, in piling_up
os.unlink(bam_file + '.bai')
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/panphlan_3_46v4dq.bam.bai'
I’ve tried on 3 different pangenomes, including p.copri with the same results. Using conda - installed panphlan (panphlan 3.0 py_0 bioconda), and the most recent version of samtools installed separately (not sure if it’s related, but I had to jump through some hoops to get samtools installed and working - the conda-installed one was throwing some weird errors)
thanks for using PanPhlAn and pointing out the issue. Your input seems a bit strange to me, why do you have twice the -i argument with different files ? Has it already work with a similar output ?
Otherwise, if the issue comes from duplicated sequences, you might find the duplicates in the contigs file. Try removing them and rebuild the Bowtie2 indexes with the bowtie-build command. Similar problems were fixed like that.
Anyway, let me then know if you still encounter this issue
They’re from paired end reads - I assumed that I could pass both and have them concatenated, since this is what I do with kneaddata, but I just tried it with only one -i and I get the same result.
I just removed the bt2 files from the folder and ran
Then ran the command (with just 1 -i), got the same result.
$ panphlan_map.py -p /babbage/panphlan_pangenomes/Alistipes_finegoldii/Alistipes_finegoldii_pangenome.tsv --indexes /babbage/panphlan_pangenomes/Alistipes_finegoldii/Alistipes_finegoldii -i /lovelace/echo/analysis/biobakery3/batch001/output/kneaddata/C0052-5F-1A_S49_kneaddata_paired_2.fastq.gz -o /babbage/echo/panphlan_analysis/Alistipes_finegoldii/C0052_5F_1A.tsv
[I] Bowtie2 is installed
version: 2.3.5.1, path: /usr/bin/bowtie2
[I] Samtools version 1.10; path: /opt/samtools/bin/samtools
[I] input_path
[I] bowtie2 --very-sensitive --no-unal -x /babbage/panphlan_pangenomes/Alistipes_finegoldii/Alistipes_finegoldii -U - -p 12 --quiet
(ERR): "/babbage/panphlan_pangenomes/Alistipes_finegoldii/Alistipes_finegoldii" does not exist or is not a Bowtie 2 index
Exiting now ...
[I] Rejected 0 reads over 0 total
Bowtie2 mapping and SAM filtering completed.
[I] Samtools version 1.10; path: /opt/samtools/bin/samtools
[I] samtools view -bS /tmp/panphlan_g9052szw.sam
[main_samview] fail to read the header from "/tmp/panphlan_g9052szw.sam".
samtools sort: failed to read header from "-"
[I] samtools index /tmp/panphlan_5y84vooj.bam
[I] samtools mpileup /tmp/panphlan_5y84vooj.bam > /tmp/panphlan_7d9muezw.csv
samtools index: "/tmp/panphlan_5y84vooj.bam" is in a format that cannot be usefully indexed
[mpileup] fail to read the header of /tmp/panphlan_5y84vooj.bam
Traceback (most recent call last):
File "/opt/miniconda3/bin/panphlan_map.py", line 4, in <module>
__import__('pkg_resources').run_script('panphlan==3.0', 'panphlan_map.py')
File "/opt/miniconda3/lib/python3.7/site-packages/pkg_resources/__init__.py", line 665, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/opt/miniconda3/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1470, in run_script
exec(script_code, namespace, namespace)
File "/opt/miniconda3/lib/python3.7/site-packages/panphlan-3.0-py3.7.egg/EGG-INFO/scripts/panphlan_map.py", line 396, in <module>
File "/opt/miniconda3/lib/python3.7/site-packages/panphlan-3.0-py3.7.egg/EGG-INFO/scripts/panphlan_map.py", line 386, in main
File "/opt/miniconda3/lib/python3.7/site-packages/panphlan-3.0-py3.7.egg/EGG-INFO/scripts/panphlan_map.py", line 297, in piling_up
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/panphlan_5y84vooj.bam.bai'
Actually the problem seems not to come from samtools tempfiles but rather from above in the analysis : (ERR): "/babbage/panphlan_pangenomes/Alistipes_finegoldii/Alistipes_finegoldii" does not exist or is not a Bowtie 2 index Exiting now ...
PanPhlAn failed because the name of your indexes contains 2 l and the path you provided has only one. Allistipes_finegoldii vs Alistipes_finegoldii .
Let’s try again with the right names. If it work, it would mean that the issue was indeed caused by duplicated sequences in the first place. Otherwise, I’ll go check more deeply
But in any case, you also pointed out that if the bowtie2 step failed, PanPhlAn tries to perform the samtools step nonetheless and this should definitely be fixed. Thanks for that !
I did not - I downloaded the pangenomes using the included script - I didn’t know this would be necessary. I don’t think that’s actually filtering unique sequences, it’s mostly just eliminating strings of Ns.
I checked the database again.
For some strange reason, all reference genome are included twice, thus doubling the amount of sequences… I’ll go back in the database building script…
Meanwhile, I was able to get rid of the duplicated lines with sort Alistipes_finegoldii_pangenome_contigs.fna | uniq.
It is still strange for some species have duplicate, other not. Sometimes it causes troubles, sometimes not.
Anyway thanks a lot for pointing out the importance of the problem.
I am having exactly the same errors as described above for the E. coli pangenome. I also tried the ‘sort’ solution offered above, but indeed also get a malformed FASTA file. Is there by any chance already a workaround for this issue?
I’m currently working on fixing the database duplicates, a temporary solution seems to be to remove the second half of the fna file (contigs have been written twice in some cases) :
Some pangenome of the database might have a problem of duplicated sequences leading to an error raised during the mapping step : [E::sam_hrecs_update_hashes] Duplicate entry “XXX” in sam header".
In these cases, better check the duplication comparing : wc -l [species_name]_pangenome_contigs.fna and sort [species_name]_pangenome_contigs.fna | uniq | wc -l
that should give roughly half of the previous number.
Then just cut the fna file in half (using for example head -[half of the lines in the fna file] [species_name]_pangenome_contigs.fna > new_contigs.fna ) and then regenerate the indexes using bowtie2-build.
Hope this can fix your problem quickly. I hope the curated database will be available at the end of the week.