PCA of input samples based on taxonomic profile


I apologize for the basic question but I was wondering if you could guide me to the correct software and commands I need to use.

I have 20 WGS .fastq.gz files from 10 different fecal samples; these are Illumina short-read pair-end sequence data. My goal is to conduct a taxonomic profile of each sample to compare the similarities and differences between the ten samples. In the end, I require a taxonomic summary for each sample, a phylogenetic comparison of the ten samples and a PCA of the ten samples.

I was thinking of either utilizing just MetaPhlan 3 or using the bioBakery workflow for metagenome profiling. Based on these results, I want to be able to conduct a phylogeny comparison of the ten samples and a PCA of the ten samples.

Could you please let me know if it would be better to just use MetaPhlan or conduct the bioBakery workflow for this project and also which software I should use to conduct the phylogeny and PCA analyses.

Thank you for your help!


indeed for a taxonomic summary for each sample, MetaPhlAn 3 will be the best tool to use. Then from the taxonomic profile, you can easily generate some PCoA of the samples using for example the vegan R package.

I’m not exactly sure what you mean by “a phylogeny comparison of the ten samples”. Using the bioBakery tools StrainPhlAn or PhyloPhlAn, you can easily generate some phylogenetic tree of a species of interest. The best might be to generate the MetaPhlAn 3 profile first and from it, see which species would be relevant for further phylogeny analysis.

Have a nice day!


Thank you for your response!

I meant to say a hierarchical clustering analysis of the ten samples based on the taxonomy found. Can I use PhyloPhlAn to conduct this?

Also, is it possible to use Whole Genome Sequencing files (Illumina NextSeq2000) as input to MetaPhlAn 3? I do not have shotgun sequencing files to use as input.

Thank you for your help!

Hi, for the hierarchical clustering, just turn the MetaPhlAn taxonomic profile table as a Heatmap and do the usual hierarchical clustering. No need of PhyloPhlan here.

If by “Whole Genome Sequencing”, you mean reference genomes sequenced from isolation, you can add them in a PhyloPhlan tree along the metagenomic samples. Check the PhyloPhlan wiki for more details. MetaPhlAn is meant to be used with shotgun sequencing.

Hope that answers your questions