What are the limitations for number of input sequences?

Hi,

Is it possible to use PhyloPhlAn with single protein sequences representing each species? I realise this isn’t the main purpose for the algorithm, but we have a mixed set of protein sequences from both known organisms and a mix of metagenomic analyses and we’d like to be able to generate a phylogeny independently of those protein sequences as a gut-check/confirmation of our results. Any help/advice would be welcome!

Best,

Bruce

Hi Bruce,

I’m not 100% sure I got your exact situation, but what I understand is that you have single protein sequences each one defining different organisms (that you want to have in your tree).
Now, I don’t know how long these protein sequences are, but in PhyloPhlAn when inputs are proteins they are thought to represent the single protein in that genome. So, there are some quality filtering parameters like --min_num_proteins (default 1) and --min_len_protein (default 50) that if not met will discard your inputs. From what I understood, it seems that --min_num_proteins should not be a problem, but maybe you want to tune the --min_len_protein param.

Now, PhyloPhlAn will be basically mapping your single protein sequences against the database. It is not clear to me if you are gonna use one of the default databases or a custom one. But in general, I don’t see huge problems here. Only you want to make sure which markers in the database will have a hit against the inputs because if one input doesn’t hit any marker it won’t appear in the phylogeny.

Please let me know if the above makes sense or let me know if you have other details.

Many thanks,
Francesco

Hi Francesco,

Yes, this makes sense. So practically, I would provide each sequence separately as representative of the genomes of interest. In the case that some may not map directly, I assume there’s a warning output (we do suspect that some of the sequences might fail to be found in common databases of genomes being derived from recent metagenomic mining).

I will write back if we have any further questions, but thank you in the meantime!

Best,

Bruce

Hi Francesco,

We’ve finally managed to find the time/head space to tackle this again. We can get the default Phylophlan running with out single protein sequences, but there is no output. I suspect our configuration is off – it appear that Phylophlan is looking for the protein sequences we have in the default markers that are downloaded automatically in the first run of the algorithm, and we obviously know that these are not part of the set of markers. We have a mix of known full length protein sequences from known, sequenced organisms, and a handful from unknown organisms, and I would have expected, had things been configured correctly, that the connections to known genomes would have been readily identified. Clearly, we’re missing a database or reference genome set (are these the same things? - the terms appear interchangably in the tutorials) for this work – any guidance would be welcome.

Best,

Bruce

Hi Bruce,

Thanks for getting back on this. Apologies, but I’m not sure I can help with just the above information.
To give me a bit of context about this analysis, can you please provide the:

  • PhyloPhlAn command
    • If you save the output of PhyloPhlAn to a text file, can you also provide it? (It will be amazing if you ran it with the --verbose parameter)
  • the configuration file
  • the list of the content of the database folder you specified
  • the list of the content of the input folder you specified

If you can upload them as separate files it will be fantastic, as it will be much easier to check them for me.

Many thanks,
Francesco

Hi Francesco,

Apologies for the slight delay – my student needed a moment to collate her files --and thank you for taking your time in helping us in this. Attached you will find our output from phylophlan, the input folder contents (a list of fasta *.faa files), the database folder content, and we’re using the supermatrix_aa.cfg file as our starting point (but we’ve attached the version we have on hand).

Thank you again for your feedback! It would be great to get this closer to working!

Best,

Bruce

Database folder content.txt (135 Bytes)

Output_Error_Phylophlan.txt (22.4 KB)

(Attachment supermatrix_aa.cfg is missing)

Input Folder Content.txt (529 Bytes)

Hi Bruce, many thanks for sharing the files.

So, the problem is actually with diamond and the mapping, it seems. From the output is clear that all 31 input genomes cannot be annotated with any of the 400 universal makers from the phylophlan database:

Not enough markers mapped (0/1) in "/path/to/your/genome.b6o.bkp"

Now, this is strange, but it could be due to the fact that the “single protein sequences representing each species” is not actually complete and hence doesn’t have a match with any of the proteins in the phylophlan database?
Alternatively, I would suggest trying the amphora2 database, as there are some ribosomal proteins there that can have better luck. Or, if you know which proteins you have in your inputs and you have a bunch of them that are conserved, you can collect them and create a custom database using those.

Please, let me know if something is not clear.

Many thanks,
Francesco