Strainphlan shallow shotgun sequencing data

Dear,

I recently tried to run strainphlan3 on shallow shotgun sequencing data of 166 skin samples (sequencing depth 12M reads). In the example, underneath, I show the phylogenetic tree of the most abundant species Cutibacterium acnes within all my samples.
My first question is, can you use strainphlan3 on shallow shotgun sequencing data to look at the most abundant species to look for sub-species diversity? I suppose for low abundant species, it will be more difficult due to insufficient coverage.
My second question is, can strainphlan identify the number of strains of the same species that are present in a sample?
My third question, as you can see in the PCoA plot of the MSA. The first component explains more than 100% of the variance, which is not possible. Is this due to correlation between the variables that I am comparing with one another? Or outliers that are present within my data? What is the most correct way to adjust this? Removal of outliers and removal of one of the highly correlated variables?

Thank you in advance!

Best regards,
Britta

Hi @bdpessem
Answering your questions:

  1. Exactly, it is possible to use it with shallow sequencing data if there is enough coverage of the species you are interested on.
  2. This is a tricky question as it is particularly difficult to define what same strains are. For a similar task in one of our latest works (in this case to detect strain sharing events: The person-to-person transmission landscape of the gut and oral microbiomes | Nature), from version 4 it is possible an utility script called strain_transmission.py (Strain Sharing Inference · biobakery/MetaPhlAn Wiki · GitHub)
  3. It seems it is probably an error in the R script that calculates the variance explained by the X axis.