Unable to download proteins

PhyloPhlAn seems to be unable to find UniRef sequences (perhaps a change in the URL structure in UniProt?). This simple code reproduces the issue:

phylophlan_setup_database -g s__Xanthomonas_citri  --verbose 2>&1 | tee log/phylophlan_setup_database.log

And the messages are multiple failed downlooads:

...
Downloading "http://www.uniprot.org/uniref/UniRef90_A0A2H1SFM2.fasta" to "s__Xanthomonas_citri/A0A2H1SFM2.faa"
[e] unable to download "http://www.uniprot.org/uniref/UniRef90_A0A2H1SFM2.fasta"
Downloading "http://www.uniprot.org/uniref/UniRef90_H8FBN8.fasta" to "s__Xanthomonas_citri/H8FBN8.faa"
[e] unable to download "http://www.uniprot.org/uniref/UniRef90_H8FBN8.fasta"
Downloading "http://www.uniprot.org/uniref/UniRef90_A0A0W7Y4U0.fasta" to "s__Xanthomonas_citri/A0A0W7Y4U0.faa"
[e] unable to download "http://www.uniprot.org/uniref/UniRef90_A0A0W7Y4U0.fasta"
Downloading "http://www.uniprot.org/uniref/UniRef90_D4T8I5.fasta" to "s__Xanthomonas_citri/D4T8I5.faa"
[e] unable to download "http://www.uniprot.org/uniref/UniRef90_D4T8I5.fasta"
Downloading "http://www.uniprot.org/uniref/UniRef90_Q5H0C9.fasta" to "s__Xanthomonas_citri/Q5H0C9.faa"
[e] unable to download "http://www.uniprot.org/uniref/UniRef90_Q5H0C9.fasta"
Downloading "http://www.uniprot.org/uniref/UniRef90_A0A2S6YIG2.fasta" to "s__Xanthomonas_citri/A0A2S6YIG2.faa"
[e] unable to download "http://www.uniprot.org/uniref/UniRef90_A0A2S6YIG2.fasta"
Downloading "http://www.uniprot.org/uniref/UniRef90_A0A2H1S9C3.fasta" to "s__Xanthomonas_citri/A0A2H1S9C3.faa"
[e] unable to download "http://www.uniprot.org/uniref/UniRef90_A0A2H1S9C3.fasta"
Downloading "http://www.uniprot.org/uniref/UniRef90_Q3BMB6.fasta" to "s__Xanthomonas_citri/Q3BMB6.faa"
[e] unable to download "http://www.uniprot.org/uniref/UniRef90_Q3BMB6.fasta"
Downloading "http://www.uniprot.org/uniref/UniRef90_A0A0H2X5L3.fasta" to "s__Xanthomonas_citri/A0A0H2X5L3.faa"
[e] unable to download "http://www.uniprot.org/uniref/UniRef90_A0A0H2X5L3.fasta"
...

Strangely, most but not all downloads fail. I double-checked, and the URLs indeed point to a page with the error:

Error messages
Resource not found

Thanks!
Miguel.

BTW, the run finally fails with the message:

Re-trying to download 2457 core proteins that just failed, please wait as it might take some time
[e] unable convert UniProtKB ID to UniRef90 ID
Traceback (most recent call last):
  File "/home/c718/c7181116/.conda/envs/phylophlan/bin/phylophlan_setup_database", line 10, in <module>
    sys.exit(phylophlan_setup_database())
  File "/home/c718/c7181116/.conda/envs/phylophlan/lib/python3.10/site-packages/phylophlan/phylophlan_setup_database.py", line 407, in phylophlan_setup_database
    get_core_proteins(taxa2core_file_latest, args.get_core_proteins, args.output, args.output_extension, verbose=args.verbose)
  File "/home/c718/c7181116/.conda/envs/phylophlan/lib/python3.10/site-packages/phylophlan/phylophlan_setup_database.py", line 333, in get_core_proteins
    for uniref90_id in (x[1].split('_')[-1] for x in uniprotkb2uniref90[1:]):
UnboundLocalError: local variable 'uniprotkb2uniref90' referenced before assignment

Dear @lrr, thanks for reporting this. I believe this is the same issue as reported here UnboundLocalError: local variable 'uniprotkb2uniref90' referenced before assignment · Issue #98 · biobakery/phylophlan · GitHub.
Briefly, UniProt change the way to handle ID conversion, so I just finished implementing and testing the new APIs from UniProt that should solve this issue (updating UniRef90 IDs retrieval using latest APIs from UniProt · biobakery/phylophlan@c64a75f · GitHub).
Can you please pull the latest version form the PhyloPhlAn repo and check if the issue is solved?

Many thanks,
Francesco

Hi Francesco,

I have the same error and have made sure to download the newest version available on the GitHub in its own environment. For me it basically stalls partway through the operation after several failures to download. It’s never run long enough to fail with a specific message but I can leave it up and see when it does. It’s downloaded only ~100 of the 2131 core proteins it initially identified.

Best,
Margot

Hi,

I faced the same issue using Phylophlan v. 3.0.67 (as installed from conda).

I cloned the github repository with the latest commit and ran phylophlan_setup_database again, but it’s still not able to download several proteins, and when clicking on the URLs I also get redirected to an empty page with the text
“Error messages
Resource not found”

Best,
Maria Silvia

Hi, I thought I would post the workaround I used, hopefully it can be useful. I downloaded the proteins from the bash command line using the API.

  1. parse the log file to get the IDs of proteins for which the download failed
    cat phylophlan_setup_database.log | grep "^\[e\]" | cut -f 2 -d '_' | cut -d '.' -f 1 | head -n -1 > proteins_to_rematch.txt

  2. Use the API to match the ID to the updated UniRef90 representative:
    for i in $(cat proteins_to_rematch.txt); do curl "https://rest.uniprot.org/uniref/search?query=(uniprot_id:"$i")%20AND%20(identity:0.9)&format=fasta" > prova/$i.faa; done

  3. Parse the downloaded fasta files to get the id of the actual representative (as the filenames now are the obsolete representatives)
    for i in $(ls rematched_proteins); do cat rematched_proteins/$i | head -1 | cut -f 2 -d '_' | cut -f 1 -d ' ' | paste <(echo $i) -; done > rename_rematched_proteins.txt
    Filter out empty files (in my case they were about 15%, mostly deleted entries)
    awk -F '\t' '$2 != ""' rename_rematched_proteins.txt | sponge rename_rematched_proteins.txt

  4. rename files that contain successfully downloaded representatives and move them to the database folder
    while read a b; do mv rematched_proteins/$a s__Cupriavidus_necator_fail/$b.faa; done < rename_rematched_proteins.txt