PanPhlAn_pangenome_exporter issue while writing .tsv

drelo · January 21, 2021, 8:07pm

Dear bioBakery, after I fixed a previous issue and installed the correct version of diamond, the analysis run fine until I got close to the end where it seems it can parse the table (I used a dummy set of 3 individuals).
I got an error message at the end…

This is the last part of the output:

losing the input file...  [7e-06s]
Closing the output file...  [2.2e-05s]
Closing the database file...  [3e-06s]
Deallocating taxonomy...  [1e-06s]
Total time = 335.776s
Reported 36612 pairwise alignments, 36662 HSPs.
3578 queries aligned.
Parsing results file:
  trash1/tmp/uniref/RT078_CDM120/tmp/RT078_CDM120.faa.uniref50.hits
Writing new output file:
  trash1/tmp/uniref/RT078_CDM120/RT078_CDM120.faa
Summary of annotations:
  Genes in input FASTA: 3,594
  UniRef90 codes assigned: 3,427 (95.4%)
  UniRef50 codes assigned: 3,465 (96.4%)
  UniRef50 codes inferred from UniRef90 codes: 0 (0.0%)
Finished successfully.

Thu Jan 21 12:53:43 2021 Done.
Thu Jan 21 12:53:43 2021 Clustering unnanotated proteins at UniRef90 level...['mmseqs', 'createdb', 'trash1/tmp/unannotated/unannotated_90.faa', 'trash1/tmp/mmseq/db/unannotated_90']
['mmseqs', 'cluster', 'trash1/tmp/mmseq/db/unannotated_90', 'trash1/tmp/mmseq/db_clustered/unannotated_90', 'trash1/tmp/mmseq/tmp', '-c', '0.8', '--min-seq-id', '0.9', '--threads', '6']
['mmseqs', 'createtsv', 'trash1/tmp/mmseq/db/unannotated_90', 'trash1/tmp/mmseq/db/unannotated_90', 'trash1/tmp/mmseq/db_clustered/unannotated_90', 'trash1/tmp/unannotated_90.clustered.tsv', '--threads', '6']

Thu Jan 21 12:53:44 2021 Done.
Thu Jan 21 12:53:44 2021 Clustering unnanotated proteins at UniRef50 level...['mmseqs', 'createdb', 'trash1/tmp/unannotated/unannotated_50.faa', 'trash1/tmp/mmseq/db/unannotated_50']
['mmseqs', 'cluster', 'trash1/tmp/mmseq/db/unannotated_50', 'trash1/tmp/mmseq/db_clustered/unannotated_50', 'trash1/tmp/mmseq/tmp', '-c', '0.8', '--min-seq-id', '0.5', '--threads', '6']
['mmseqs', 'createtsv', 'trash1/tmp/mmseq/db/unannotated_50', 'trash1/tmp/mmseq/db/unannotated_50', 'trash1/tmp/mmseq/db_clustered/unannotated_50', 'trash1/tmp/unannotated_50.clustered.tsv', '--threads', '6']

Thu Jan 21 12:53:47 2021 Done.
Thu Jan 21 12:53:47 2021 Reannotating genomes...
Thu Jan 21 12:56:00 2021 Done.
Thu Jan 21 12:56:00 2021 Writing PanPhlAn tsv...Traceback (most recent call last):
  File "/home/andrespara/anaconda3/envs/panphlan/lib/python3.8/site-packages/Bio/SeqIO/Interfaces.py", line 47, in __init__
    self.stream = open(source, "r" + mode)
TypeError: expected str, bytes or os.PathLike object, not FakeHandle

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./panphlan_exporter.py", line 520, in <module>
    panphlan_exporter(args.input, args.tmp, args.output, args.clade_name, args.nprocs, args.db_path)
  File "./panphlan_exporter.py", line 501, in panphlan_exporter
    write_panphlan_tsv(inputdir, tmp_dir, ppa_outdir, clade_name, contigs_names_dict, contigs_names_dict_prokka, extend_pangenome)
  File "./panphlan_exporter.py", line 425, in write_panphlan_tsv
    for rec in GFF.parse(gff_file, limit_info=dict(gff_type = ['CDS'])):
  File "/home/andrespara/anaconda3/envs/panphlan/lib/python3.8/site-packages/BCBio/GFF/GFFParser.py", line 745, in parse
    for rec in parser.parse_in_parts(gff_files, base_dict, limit_info,
  File "/home/andrespara/anaconda3/envs/panphlan/lib/python3.8/site-packages/BCBio/GFF/GFFParser.py", line 322, in parse_in_parts
    for results in self.parse_simple(gff_files, limit_info, target_lines):
  File "/home/andrespara/anaconda3/envs/panphlan/lib/python3.8/site-packages/BCBio/GFF/GFFParser.py", line 343, in parse_simple
    for results in self._gff_process(gff_files, limit_info, target_lines):
  File "/home/andrespara/anaconda3/envs/panphlan/lib/python3.8/site-packages/BCBio/GFF/GFFParser.py", line 637, in _gff_process
    for out in self._lines_to_out_info(line_gen, limit_info, target_lines):
  File "/home/andrespara/anaconda3/envs/panphlan/lib/python3.8/site-packages/BCBio/GFF/GFFParser.py", line 699, in _lines_to_out_info
    fasta_recs = self._parse_fasta(FakeHandle(line_iter))
  File "/home/andrespara/anaconda3/envs/panphlan/lib/python3.8/site-packages/BCBio/GFF/GFFParser.py", line 560, in _parse_fasta
    return list(SeqIO.parse(in_handle, "fasta"))
  File "/home/andrespara/anaconda3/envs/panphlan/lib/python3.8/site-packages/Bio/SeqIO/__init__.py", line 607, in parse
    return iterator_generator(handle)
  File "/home/andrespara/anaconda3/envs/panphlan/lib/python3.8/site-packages/Bio/SeqIO/FastaIO.py", line 183, in __init__
    super().__init__(source, mode="t", fmt="Fasta")
  File "/home/andrespara/anaconda3/envs/panphlan/lib/python3.8/site-packages/Bio/SeqIO/Interfaces.py", line 51, in __init__
    if source.read(0) != "":
TypeError: read() takes 1 positional argument but 2 were given```

Sorry for posting another question so soon but I couldn’t diagnose the issue by myself. Thanks for the help.

leonard.dubois · January 22, 2021, 9:19am

Hello,

it seems the exporter has troubles opening the temporary gff file in order to write the PanPhlAn tsv file.
What is the command line that you used ? Did you specify a tmp location ?

Actually we should improve the software by putting some mandatory args…

drelo · January 22, 2021, 1:39pm

Dear Leonard, Thanks for your help I used the -t flag for other runs but omitted in this one commented above.

Now I did two things:

I fixed a tbl2asn downloading it directly from NCBI, reinstalling prokka from conda and switching the executable in the panphlan environment
I rerun it using a temp folder nohup python ./panphlan_exporter.py --input dummy3 --output oxxx -t T2 -d . -n 7 > OUT.txt &

I got the same error
Please find attached OUT.txt

OUT.txt (261.7 KB)

Here is a look of the first lines of one of the tsv inside the prokka folder

locus_tag ftype length_bp gene EC_number COG product
DIAHKAHI_00001 CDS 1320 dnaA COG0593 Chromosomal replication initiator protein DnaA
DIAHKAHI_00001 gene 1320 dnaA
DIAHKAHI_00002 CDS 1107 dnaN Beta sliding clamp
DIAHKAHI_00002 gene 1107 dnaN
DIAHKAHI_00003 CDS 207 hypothetical protein

leonard.dubois · January 22, 2021, 2:32pm

Hi,

I’m trying to identify your problem. All seems to go well until the program try to open the annotation GFF file.
Could you tell me what is the content of your T2/tmp/annotations/ folder ?

There should be a couple of gff.bz2 files…

leonard.dubois · January 22, 2021, 3:18pm

Ok, had a chat with people from the lab, the issue actually come from problem between Biopython and GFF parser. Have a look at this similar problem here : polymut.py error reading gff file ? · Issue #4 · SegataLab/cmseq · GitHub

drelo · January 22, 2021, 5:44pm

Sorry I tried to fix this like they did in this issue but I hit the same rock again…
I uninstalled Biopython
Then I used pip install Biopython==1.78
pip install bcbio-gff

The error seems similar to the previous one. I will now try to build the environment from the scratch to see if that helps. ETA: I started from the scratch in a new environment fixing diamond, Biopython tbl2asn and the blast version and I got the same error.

Fri Jan 22 14:30:38 2021 Done.
Fri Jan 22 14:30:38 2021 Clustering unnanotated proteins at UniRef90 level…[‘mmseqs’, ‘createdb’, ‘T22/tmp/unannotated/unannotated_90.faa’, ‘T22/tmp/mmseq/db/unannotated_90’]
[‘mmseqs’, ‘cluster’, ‘T22/tmp/mmseq/db/unannotated_90’, ‘T22/tmp/mmseq/db_clustered/unannotated_90’, ‘T22/tmp/mmseq/tmp’, ‘-c’, ‘0.8’, ‘–min-seq-id’, ‘0.9’, ‘–threads’, ‘7’]
[‘mmseqs’, ‘createtsv’, ‘T22/tmp/mmseq/db/unannotated_90’, ‘T22/tmp/mmseq/db/unannotated_90’, ‘T22/tmp/mmseq/db_clustered/unannotated_90’, ‘T22/tmp/unannotated_90.clustered.tsv’, ‘–threads’, ‘7’]

Fri Jan 22 14:30:39 2021 Done.
Fri Jan 22 14:30:39 2021 Clustering unnanotated proteins at UniRef50 level…[‘mmseqs’, ‘createdb’, ‘T22/tmp/unannotated/unannotated_50.faa’, ‘T22/tmp/mmseq/db/unannotated_50’]
[‘mmseqs’, ‘cluster’, ‘T22/tmp/mmseq/db/unannotated_50’, ‘T22/tmp/mmseq/db_clustered/unannotated_50’, ‘T22/tmp/mmseq/tmp’, ‘-c’, ‘0.8’, ‘–min-seq-id’, ‘0.5’, ‘–threads’, ‘7’]
[‘mmseqs’, ‘createtsv’, ‘T22/tmp/mmseq/db/unannotated_50’, ‘T22/tmp/mmseq/db/unannotated_50’, ‘T22/tmp/mmseq/db_clustered/unannotated_50’, ‘T22/tmp/unannotated_50.clustered.tsv’, ‘–threads’, ‘7’]

Fri Jan 22 14:30:42 2021 Done.
Fri Jan 22 14:30:42 2021 Reannotating genomes…
Fri Jan 22 14:32:47 2021 Done.
Fri Jan 22 14:32:47 2021 Writing PanPhlAn tsv…Traceback (most recent call last):
File “/home/andrespara/anaconda3/envs/panphlan/lib/python3.8/site-packages/Bio/SeqIO/Interfaces.py”, line 47, in init
self.stream = open(source, “r” + mode)
TypeError: expected str, bytes or os.PathLike object, not FakeHandle

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “./panphlan_exporter.py”, line 520, in
panphlan_exporter(args.input, args.tmp, args.output, args.clade_name, args.nprocs, args.db_path)
File “./panphlan_exporter.py”, line 501, in panphlan_exporter
write_panphlan_tsv(inputdir, tmp_dir, ppa_outdir, clade_name, contigs_names_dict, contigs_names_dict_prokka, extend_pangenome)
File “./panphlan_exporter.py”, line 425, in write_panphlan_tsv
for rec in GFF.parse(gff_file, limit_info=dict(gff_type = [‘CDS’])):
File “/home/andrespara/anaconda3/envs/panphlan/lib/python3.8/site-packages/BCBio/GFF/GFFParser.py”, line 745, in parse
for rec in parser.parse_in_parts(gff_files, base_dict, limit_info,
File “/home/andrespara/anaconda3/envs/panphlan/lib/python3.8/site-packages/BCBio/GFF/GFFParser.py”, line 322, in parse_in_parts
for results in self.parse_simple(gff_files, limit_info, target_lines):
File “/home/andrespara/anaconda3/envs/panphlan/lib/python3.8/site-packages/BCBio/GFF/GFFParser.py”, line 343, in parse_simple
for results in self._gff_process(gff_files, limit_info, target_lines):
File “/home/andrespara/anaconda3/envs/panphlan/lib/python3.8/site-packages/BCBio/GFF/GFFParser.py”, line 637, in _gff_process
for out in self._lines_to_out_info(line_gen, limit_info, target_lines):
File “/home/andrespara/anaconda3/envs/panphlan/lib/python3.8/site-packages/BCBio/GFF/GFFParser.py”, line 699, in _lines_to_out_info
fasta_recs = self._parse_fasta(FakeHandle(line_iter))
File “/home/andrespara/anaconda3/envs/panphlan/lib/python3.8/site-packages/BCBio/GFF/GFFParser.py”, line 560, in _parse_fasta
return list(SeqIO.parse(in_handle, “fasta”))
File “/home/andrespara/anaconda3/envs/panphlan/lib/python3.8/site-packages/Bio/SeqIO/init.py”, line 607, in parse
return iterator_generator(handle)
File “/home/andrespara/anaconda3/envs/panphlan/lib/python3.8/site-packages/Bio/SeqIO/FastaIO.py”, line 183, in init
super().init(source, mode=“t”, fmt=“Fasta”)
File “/home/andrespara/anaconda3/envs/panphlan/lib/python3.8/site-packages/Bio/SeqIO/Interfaces.py”, line 51, in init
if source.read(0) != “”:
TypeError: read() takes 1 positional argument but 2 were given

leonard.dubois · January 25, 2021, 8:26am

Hello,

the issue specified that this happened when using a version newer than 1.76.
It seems you used the version 1.78, so it make sense to get the same error.
Just try the 1.76

drelo · January 25, 2021, 11:46am

I used 1.76 and now it works fine.
Thanks for the help

Topic		Replies	Views
PanPhlAn_pangenome_exporter uniref uniref_annotator diamond issue PanPhlAn	6	667	April 14, 2021
Customizing Chochophlan panproteome and Metaphlan marker gene databases with new taxa MetaPhlAn	10	1312	May 20, 2024
Absence of organism annotation in the genefaimly.tsv HUMAnN	1	397	January 4, 2021
Announcing HUMAnN 3.6 (Critical Update) HUMAnN	6	5380	March 31, 2023
PanPhlAn Identifier to Uniprot PanPhlAn	2	419	July 12, 2021

PanPhlAn_pangenome_exporter issue while writing .tsv

Related topics