[e] gene_markers_selection crashed (SOLUTION)

Hi,

Thank you for developing this great tool!

I was running PhyloPhlAn version 3.1.68 to create a phylogenomic tree using the command:

phylophlan -i Protein_files/ -o Phylogenomic_tree/ -f supermatrix_aa.cfg --diversity high --fast --nproc 8 -d phylophlan --proteome_extension .faa --verbose

However, I got the following error:

Selecting "phylophlan_test/tmp/map_aa/0BWT1.b6o.bkp"

[e] expected str, bytes or os.PathLike object, not NoneType

[e] gene_markers_selection crashed

After rerunning the command above using 1 thread I was able to identify the specific *.b6o.bkp files that originated the error. I realized that these files were corrupted, the last line of the diamond output table was incomplete. Rows in these files are supposed to be 12 columns wide but in these files, the last row had a smaller number (e.g, 1,2,3 or 9 columns). Apparently, when aligning the genome protein files to the markers, the process stopped before writing to the whole output. Such files cannot be processed by the “best_hit” function (line 1337 of the phylophlan.py script) since it relies on column position to identify each variable, causing the error.

I removed only those corrupted files and reran the phylophlan command and the alignment was done correctly this time. So, it seems a random, non reproducible error (may be a problem of the multiprocessing module?). However, it seems to be a recurring problem that several users have reported before:

It would be very useful if you could add a few lines of code to check file integrity before proceeding to marker selection and, in the case that the last row of the file does not have 12 columns, rerun the alignment. I think it would fit in the “gene_markers_identification_rec” function (line 1246 of the phylophlan.py script).

As a temporary workaround I would suggest users facing the same error use this line of code to identify corrupted *.b6o.bkp:

out=Phylogenomic_tree/ # Set the phylophlan output folder
for i in $(ls $out/tmp/map_aa/*.bkp); do ncol=$(tail -n 1 $i | awk '{print NF}'); if (( $ncol < 12 )) ; then  echo $i $ncol; fi; done

It will print to screen the files with less than 12 columns in the last row that should be removed before rerunning phylophlan.

Hope it helps!

Borja