PhyloPhlAn seems to be unable to find UniRef sequences (perhaps a change in the URL structure in UniProt?). This simple code reproduces the issue:
phylophlan_setup_database -g s__Xanthomonas_citri --verbose 2>&1 | tee log/phylophlan_setup_database.log
And the messages are multiple failed downlooads:
...
Downloading "http://www.uniprot.org/uniref/UniRef90_A0A2H1SFM2.fasta" to "s__Xanthomonas_citri/A0A2H1SFM2.faa"
[e] unable to download "http://www.uniprot.org/uniref/UniRef90_A0A2H1SFM2.fasta"
Downloading "http://www.uniprot.org/uniref/UniRef90_H8FBN8.fasta" to "s__Xanthomonas_citri/H8FBN8.faa"
[e] unable to download "http://www.uniprot.org/uniref/UniRef90_H8FBN8.fasta"
Downloading "http://www.uniprot.org/uniref/UniRef90_A0A0W7Y4U0.fasta" to "s__Xanthomonas_citri/A0A0W7Y4U0.faa"
[e] unable to download "http://www.uniprot.org/uniref/UniRef90_A0A0W7Y4U0.fasta"
Downloading "http://www.uniprot.org/uniref/UniRef90_D4T8I5.fasta" to "s__Xanthomonas_citri/D4T8I5.faa"
[e] unable to download "http://www.uniprot.org/uniref/UniRef90_D4T8I5.fasta"
Downloading "http://www.uniprot.org/uniref/UniRef90_Q5H0C9.fasta" to "s__Xanthomonas_citri/Q5H0C9.faa"
[e] unable to download "http://www.uniprot.org/uniref/UniRef90_Q5H0C9.fasta"
Downloading "http://www.uniprot.org/uniref/UniRef90_A0A2S6YIG2.fasta" to "s__Xanthomonas_citri/A0A2S6YIG2.faa"
[e] unable to download "http://www.uniprot.org/uniref/UniRef90_A0A2S6YIG2.fasta"
Downloading "http://www.uniprot.org/uniref/UniRef90_A0A2H1S9C3.fasta" to "s__Xanthomonas_citri/A0A2H1S9C3.faa"
[e] unable to download "http://www.uniprot.org/uniref/UniRef90_A0A2H1S9C3.fasta"
Downloading "http://www.uniprot.org/uniref/UniRef90_Q3BMB6.fasta" to "s__Xanthomonas_citri/Q3BMB6.faa"
[e] unable to download "http://www.uniprot.org/uniref/UniRef90_Q3BMB6.fasta"
Downloading "http://www.uniprot.org/uniref/UniRef90_A0A0H2X5L3.fasta" to "s__Xanthomonas_citri/A0A0H2X5L3.faa"
[e] unable to download "http://www.uniprot.org/uniref/UniRef90_A0A0H2X5L3.fasta"
...
Strangely, most but not all downloads fail. I double-checked, and the URLs indeed point to a page with the error:
Re-trying to download 2457 core proteins that just failed, please wait as it might take some time
[e] unable convert UniProtKB ID to UniRef90 ID
Traceback (most recent call last):
File "/home/c718/c7181116/.conda/envs/phylophlan/bin/phylophlan_setup_database", line 10, in <module>
sys.exit(phylophlan_setup_database())
File "/home/c718/c7181116/.conda/envs/phylophlan/lib/python3.10/site-packages/phylophlan/phylophlan_setup_database.py", line 407, in phylophlan_setup_database
get_core_proteins(taxa2core_file_latest, args.get_core_proteins, args.output, args.output_extension, verbose=args.verbose)
File "/home/c718/c7181116/.conda/envs/phylophlan/lib/python3.10/site-packages/phylophlan/phylophlan_setup_database.py", line 333, in get_core_proteins
for uniref90_id in (x[1].split('_')[-1] for x in uniprotkb2uniref90[1:]):
UnboundLocalError: local variable 'uniprotkb2uniref90' referenced before assignment
I have the same error and have made sure to download the newest version available on the GitHub in its own environment. For me it basically stalls partway through the operation after several failures to download. It’s never run long enough to fail with a specific message but I can leave it up and see when it does. It’s downloaded only ~100 of the 2131 core proteins it initially identified.
I faced the same issue using Phylophlan v. 3.0.67 (as installed from conda).
I cloned the github repository with the latest commit and ran phylophlan_setup_database again, but it’s still not able to download several proteins, and when clicking on the URLs I also get redirected to an empty page with the text
“Error messages
Resource not found”
Hi, I thought I would post the workaround I used, hopefully it can be useful. I downloaded the proteins from the bash command line using the API.
parse the log file to get the IDs of proteins for which the download failed cat phylophlan_setup_database.log | grep "^\[e\]" | cut -f 2 -d '_' | cut -d '.' -f 1 | head -n -1 > proteins_to_rematch.txt
Use the API to match the ID to the updated UniRef90 representative: for i in $(cat proteins_to_rematch.txt); do curl "https://rest.uniprot.org/uniref/search?query=(uniprot_id:"$i")%20AND%20(identity:0.9)&format=fasta" > prova/$i.faa; done
Parse the downloaded fasta files to get the id of the actual representative (as the filenames now are the obsolete representatives) for i in $(ls rematched_proteins); do cat rematched_proteins/$i | head -1 | cut -f 2 -d '_' | cut -f 1 -d ' ' | paste <(echo $i) -; done > rename_rematched_proteins.txt
Filter out empty files (in my case they were about 15%, mostly deleted entries) awk -F '\t' '$2 != ""' rename_rematched_proteins.txt | sponge rename_rematched_proteins.txt
rename files that contain successfully downloaded representatives and move them to the database folder while read a b; do mv rematched_proteins/$a s__Cupriavidus_necator_fail/$b.faa; done < rename_rematched_proteins.txt
hello same problem here with phylophlan-3.0.3 tagged archive
rpm_maker:examples/01_saureus > phylophlan_setup_database --version
phylophlan_setup_database.py version 3.0.23 (27 November 2020)
after using the command described in the tutorial.
rpm_maker:examples/01_saureus > phylophlan_setup_database -g s__Staphylococcus_aureus --verbose 2>&1 | tee logs/phylophlan_setup_database.log
i got the following error:
[e] unable to download "http://www.uniprot.org/uniref/UniRef90_A0A389G6R0.fasta"
Re-trying to download 1436 core proteins that just failed, please wait as it might take some time
[e] unable convert UniProtKB ID to UniRef90 ID
Traceback (most recent call last):
File "/opt/gensoft/exe/phylophlan/3.0.3/bin/phylophlan_setup_database", line 33, in <module>
sys.exit(load_entry_point('PhyloPhlAn==3.0.3', 'console_scripts', 'phylophlan_setup_database')())
File "/opt/gensoft/exe/phylophlan/3.0.3/venv/lib/python3.8/site-packages/PhyloPhlAn-3.0.3-py3.8.egg/phylophlan/phylophlan_setup_database.py", line 407, in phylophlan_setup_database
get_core_proteins(taxa2core_file_latest, args.get_core_proteins, args.output, args.output_extension, verbose=args.verbose)
File "/opt/gensoft/exe/phylophlan/3.0.3/venv/lib/python3.8/site-packages/PhyloPhlAn-3.0.3-py3.8.egg/phylophlan/phylophlan_setup_database.py", line 333, in get_core_proteins
for uniref90_id in (x[1].split('_')[-1] for x in uniprotkb2uniref90[1:]):
UnboundLocalError: local variable 'uniprotkb2uniref90' referenced before assignment
seems there is a logical problem in the code uniprotkb2uniref90 is defined in the try block wich fails with and erro code 405, even if the try fails the code continue and try to use uniprotkb2uniref90 wich in this case is undefined.
maybee: uniprot rest id_mapping examples UniProt can help.