I am using Strainphlan3 and would like to add multiple reference genomes as outgroups. But I am having trouble adding more than one reference genome to a tree (a single reference genome can be added).
Adding a single reference genome works:
strainphlan -s consensus_markers/*.pkl -m db_markers/s__Bacteroides_caccae.fna -r ref1.fna.bz2 -o output -c s__Bacteroides_caccae
Adding multiple reference genomes “works” without an error but the output tree either do not have any reference represented or only one reference is represented:
strainphlan -s consensus_markers/*.pkl -m db_markers/s__Bacteroides_caccae.fna -r ref1.fna.bz2 ref2.fna.bz2 ref3.fna.bz2 ref4.fna.bz2 ref5.fna.bz2 ref6.fna.bz2 ref7.fna.bz2 -o output -c s__Bacteroides_caccae
(“Number of main references” is correctly recognized by strainphlan in the s__Bacteroides_caccae.info output)
All of the reference genomes are almost equally divergent to the specified clade, so I am assuming there should be enough markers to extract if one of the reference genomes is being added to the tree. I have also tried passing
--secondary_references but having the same issue. I have read related posts, and I have enough samples (>100) in the final tree and relaxed the parameters
--marker_in_n_samples 10 --sample_with_n_markers 10.
Any comment is appreciated!
Thanks for getting in touch.
I suppose that all the reference genomes you added belong to the B. caccae species. If they don’t, this can explain why the references do not appear in your tree. The marker genes are species specific (there are exceptions in which the marker - in this case quasi-marker - can appear in other species), so in general if your reference is from another species it won’t appear in tree. In case they are from the same species, can you share the content of the s__Bacteroides_caccae.info file?
Thanks for the very quick reply! That explains it. I think 1 out of the 7 genomes happened to be captured using a different species-specific marker.
Can you think of a way to add an outgroup (different species) to strainphlan produced trees? I guess one needs to align reference genomes to multiple marker genes (.aln output) but it is not straight forward because number and position of marker genes differ in each taxa, correct?
Since StrainPhlAn is based on species-specific marker genes rather than universal marker genes, adding an outgroup from a different species is not possible in most of the cases (it can happen that another species is really close to your target species that some regions of its marker genes can be mapped, but it is not something common).
Thank you for all your input. I get to add different species when they are very closely related for a couple of incidences, but I understand now that it may not be common.