How many markers included in Phylophlan concatenated alignment?

Hello,
I am using phylophlan to create a tree from several of my own MAGs, as well as complete genomes, and publicly available MAGs and SAGs. As most of the MAGs and SAGs are not complete, I wanted to know how many of the 400 (phylophlan database) marker genes were found per included genome and how many genes were used to create the trees. I am not seeing a summary or log file to get that information, is there a file that gives that info or is there another way to find that out?
Thanks!
Alma

Hello Alma, you’re right that there is no report about this as output from PhyloPhlAn and I can think to add it in the next release.
However, you can get this info with a bit of bash using the files PhyloPhlAn generated in the tmp folder.

Depending on the configuration of PhyloPhlAn you can either have the folder markers_dna or markers_aa. You should take the last one generated. Inside you’ll have a compressed file for each of your input and you can count the number of markers each input mapped with:

for i in $(ls *.bz2); do echo -en "${i}\t"; bzgrep -c "^>" $i; done > ../../markers_mapped_by_inputs.tsv

Assuming you’re inside on of the two folders above, the above code will write the markers_mapped_by_inputs.tsv in the output folder.

If you want to know the number of markers used to build your phylogeny out of the 400 universal in the PhyloPhlAn database, you can check the following. Depending on the parameters you specified for PhyloPhlAn you can have different trimming steps (or no trimming at all). You should take the latest folder you have of the following:

  • msas
  • trim_gap_trim
  • trim_gap_perc
  • trim_not_variant
  • sub

and you can count the files it contains. That will give you the number of markers used to build your phylogeny out of the 400 universal in the PhyloPhlAn database.

Please, let me know if something is not clear.

Many thanks,
Francesco