The bioBakery help forum

Length of concatenated alignment file

Hello.
Thank you so much for your work with the Biobakery tools. The tools are very useful!

In am working with StrainPhlAn 3 and encountered a problem I hope you can help me solve.

The concatenated alignment file is quite short - I included two examples of this. E.g. for s__Akkermansia_muciniphila I have 52 markers available after filtering. However, in the concatenated alignment file the concatenated length for each sample is solely 5396 characters.
(for E. coli it is 27 markers and only 997 characters). You can see the output for the two species in the attached files.

I tried to change the --phylophlan_mode to “accurate” as I saw that using the default parameter only 500 nucleotides are included from each marker. However, this did not cause any difference.
I also updated both StrainPhlAn and PhyloPhlAn and likewise, this did not help.

I am worried that this will be too sparse information to obtain a meaningful phylogenetic reconstruction (if e.g. it is solely based on 997 nucleotides).

I can see from the output in the tutorial that you get 14355 characters pr. sample (for s__Eubacterium_rectale)

Hope you can help clarify my problem.

Kind regards Anne.
s__Escherichia_coli.StrainPhlAn3_concatenated.aln.txt (14.9 KB)
s__Escherichia_coli.info.txt (743 Bytes)
s__Akkermansia_muciniphila.StrainPhlAn3_concatenated.aln.txt (63.4 KB)
s__Akkermansia_muciniphila.info.txt (751 Bytes)

Hi @Anne_Mathiesen
Thanks for getting in contact.
With 52 markers, an alignment length of ~5,400 bases can be something expected. In the tutorial, for E. rectale, StrainPhlAn is able to reconstruct 135 markers (almost 3 times more). Also, you should take into account that, for the final alignment, PhyloPhlAn removes all the bases without phylogenetic signal and, in fast mode, those not covered by 67% of the samples.
However, for A. muciniphila, StrainPhlAn uses a DB of 150 marker genes, but in your case you are only able to use 53 for the alignment. I can see from the file that some of the samples have a lot of markers missing (-) and the markers present are not well covered or with a lot of polymorphisms (N). Probably discarding those samples from the analysis will help to have a bigger alignment and a more robust phylogeny.
In the case of E. coli, the alignment length will be always quite small, since it is a species with already a few markers available in our database (32). If the strains are quite similar, after removing the non-variant positions the length of the final alignment can be quite small.
I hope this helps.

1 Like

Hi @aitor.blancomiguez,

Thank you so much for the prompt and detailed reply. It makes so much more sense now! I will try to remove the samples with markers missing for A. muciniphila.

Is there somehow I can assess how long the concatenated marker length is before the PhyloPhlAn analysis? E.g. for E. coli I have 27 markers present in my samples, but these can be of varying length (150-1,500), so it is hard for me to know the concatenated marker length.

Kind regards Anne.

Hi @Anne_Mathiesen
If you know the IDs of the markers present in your samples, you could retrieve their length from this file: https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAAlyQITZuUCtBUJxpxhIroIa/mpa_v30_CHOCOPhlAn_201901_marker_info.txt.bz2?dl=1 Remember also that StrainPhlAn will trim the first and last 50 nt for QC purposes.
Best,
Aitor

1 Like

Hi @aitor.blancomiguez

Thanks a lot for you inputs!

I been trying different things out based on your suggestions and encountered a question. Would it be OK to lower the marker_in_n_samples to e.g. 60 % in order to get a bigger alignment and a more robust phylogeny? In this approach the samples with many markers missing can subsequently be filtered out based on % of gaps ("-") in the MSA file and thus, I do not have to perform 2 runs of StrainPhlAn as suggested by you. I have many samples and many species of interest and therefore, I am trying to find a solution that scales well.

Would the two approaches give a similar result in terms of phylogenetic reconstruction by PhyloPhlAn 3?

Kind regards,
Anne

Hi @Anne_Mathiesen

It is OK if you lower the --marker_in_n_samples, but this won’t ensure you to will have a more robust phylogeny, as a lot of gaps will be introduced in the final alignment. Manually removing the samples with a lot of gaps will help and it can work, but I think you could have StrainPhlAn doing that job for you.
While filtering marker / samples, StrainPhlAn first applies the --samples_with_n_markers parameter (discarding the samples with few markers) and then --marker_in_n_samples (discarding the markers present in few samples). As you are interested in getting a larger alignment, your first targeted parameter should be--samples_with_n_markers. Filtering the samples you use before running StrainPhlAn based on a minimum abundance in MetaPhlAn can also help. A threshold we commonly use is 0.05 rel. abundance.
However, getting a set of StrainPhlAn parameter that work for every sample / species is something really tricky, as the number of markers per species and their presence will be really variable.
Regarding the last question, I didn’t get which is the other approach you are talking about, neither the 2 StrainPhlAn runs you mentioned before.

Best,
Aitor

1 Like

Hi @aitor.blancomiguez

Thank you so much for the detailed answer. I am sorry that my question was unclear, but you actually answered all of my questions.

Kind regards,
Anne

Hi @aitor.blancomiguez
I been working with StrainPhlAn since I posted this question and I encountered a new problem regarding the MSA of StrainPhlAn 3. I hope you can clarify this for me.

I wanted to calculate the identity of two consensus sequences from the MSA and I was inspired by the tutorial of StrainPhlAn 3. Here you use distmat to calculate a distance matrix “representing the nucleotide substitution rate, i.e. the number of substitutions per 100 nucleotides of sequence between a given pair of samples.”. This matrix is computed from the MSA. However if we solely compute this from the MSA of which PhyloPhlAn has removed all nucleotides without a signal, I would expect this to be incorrect? I would expect this to work only if we had the full-length MSA with both the non-variable and variable nucleotides? I can see that StrainPhlAn 2 gives the full MSA - is there a reason why this is not included in the output of StrainPhlAn 3?

Kind regards
Anne

I am refererring to this tutorial: StrainPhlAn3 · biobakery/biobakery Wiki · GitHub

Hi @Anne_Mathiesen
You are right, if you compute that from the phylophlan’s MSA, you only account for the variant parts. Instead, you can use the --mutation_rates option when running StrainPhlAn, since it is computed on the full MSA before phylophlan filters and trims it.

Best,
Aitor

Hi @aitor.blancomiguez.
OK, thanks. I also looked at the --mutation_rates flag but I cannot find information about how this is calculated? I tried looking at both the documentation from StrainPhlAn and PhyloPhlAn without any luck.
Best,
Anne

They are calculated as the amount of nucleotide or amino acid changes in each aligned marker, but without accounting for positions containing Ns or gaps. This will be represented in the resulted matrix as a percentage (upper triangle) and as a fraction (lower triangle).

1 Like

Okay, sounds very good! That was exactly what I was looking for!
Thanks for your help once again!
Kind regards
Anne