Conceptual Gap?: diversity metrics from metaphlan metagenomic output

Hi all,

I’ve had a collaborator provide to me output of metagenomic shotgun sequencing run through the metaphlan2 pipeline. I’ve merged the taxonomic abundance output files of all my samples, and am trying now to investigate sample diversity. I’ve been running into a ton of errors in this process, some of which I worry may come down to gaps in my conceptual understanding; I am new to the world of microbiome analyses, and my learning about the subject has been more piecemeal and self-guided than anticipated. I’m hoping that in laying out my confusion below, someone can point out what I may be misunderstanding so that I can be more focused in my googling and self-learning.

Here is a screenshot my merged table…note that I did not include the first row, which says “clade, sample_1, sample_2, sample_3, …” (I cut it off because my sample names include potentially identifiable info). I’ll also note that I am able to produce this merged table as a .csv and .biom file.

Initially, I thought I could take this merged abundance table and bring it forward into programs like phyloseq. However, it seems like I need a taxonomic tree and/or table to make use of some of these programs for generating alpha and beta diversity. For example, this tutorial references a tree and a refseq file as used in its import functions. I tried porting my merged abundance table into R to use following a divnet tutorial, I get the following error early on: “The tax_glom() function requires that physeq contain a taxonomyTable.” I tried using the QIIME2 pipeline as well, but it expects a tree file when I try to generate alpha diversity (see –i-phylogeny)

Neither a taxonomy table or phylogenetic tree were provided to me by my collaborator. I mentioned my stumbling in passing to my collaborator and they advised me that if the methods I was investigating for generating diversity required a tree or table, then I was probably looking into the wrong methods. I have since spent a lot of time trying to figure out a way to use the above programs with purely my metaphlan output table, and I have been stymied. I’ve also tried seeing if I could generate a tree from my metaphlan2 output. Based on this tutorial, I spent a day thinking Graphlan might be able to generate such a tree, but I have run into issues with export2graphphlan.py and my data. The error I get when trying to use that script to create a tree from my merged metaphlan output is
" line 380, in get_biomarkes
while lvl < len(max(cc)):
ValueError: max() arg is an empty sequence"

Looking at this tutorial for metaphlan, I see that strainphlan can generate these trees. I don’t have access to the raw sequence data as required by that program, however, so investigating this route would require reconnecting again with my collaborator.

Before I reach out again to my collaborator with more candor about my confusion and to see if they could run the fasta files through strainphlan, I wanted to reach out and see if anyone on this forum could help me out. I am new to microbiome analyses and this is my first time working with metaphlan2 output files; I feel like there is something I’m missing, either a conceptual gap in understanding and/or a means of generating the diversity metrics I’m after (things like Shannon, Chao1, and Bray–Curtis dissimilarity). I would love to figure this out on my own, but the next best thing would be to make sure I’d done my homework before leaning back on my collaborator for help. With this in mind, I would greatly appreciate feedback or if recommendations as to resources (eg programs, concept explanations, or alternate forums) that they think would be helpful.

Thanks to anyone for assistance of any sort!

Hi Jacob,

Yes, you can definitely import the MetaPhlAn abundance table into phyloseq. As you pointed out, you can import those using the otu_table function, you can extract the taxonomic table from the same abundance table since you can use the taxonomy labels from the first column to build the 7-column table for tax_table. However, the phylogenetic tree is not required for performing some of the alpha/beta diversity measures (e.g. Shannon or BrayCurtis).

For an easy data import into phyloseq you should have a look at this R function from Import a table of MetaPhlAn taxonomic abundances into phyloseq (github.com) which also provides a phylogenetic tree built using all the reference genomes considered by MetaPhlAn2 (you can find it in the Bioconductor - curatedMetagenomicData package).

GraPhlAn does not generate a phylogenetic tree, it is a visualization tool for producing annotated circular trees.

Yes, you are right, StrainPhlAn does produces tree, but the tools is aimed to produce intra-species phylogenies.

Feel free to ask here any question!

Thank you so much for your response! I’d tried using metaphlantopython before, but had gotten an error and so had moved on to investigating other methods and considering other causes for error messages. Because of your direction, i went back and troubleshot the method, and was able to find a post that let me run the code (I think successfully).

My metaphlantophyloseq object is in the screenshot below. In addition to creating the base object, I’ve worked today also to include some metadata. Ultimately I’d like to be able to group samples according to metadata (eg sex of study subject from which a sample was taken), and then calculate within (alpha) and between (beta) diversity using those groups. I’m still troubleshooting this metadata step; for some reason, when I use merge_phyloseq to append sample data to my metaphlantophyloseq output, I lose a couple samples, down to 85 from 89.

Putting that aside, though, the main followup issue I’d love to pick your brain about is again more conceptual. When I try to get alpha diversity from phyloseq (following this tutorial), I get the error below:

Looking at my initial merged metaphlan output (a screenshot of which is embedded within my initial post) I believe I have relative abundances, not absolute, and that might be what’s causing me trouble here. Is it true, as per this post, that I’d need absolute abundances for calculating alpha diversity using phyloseq? This other forum seems to imply that I can calculate the diversity metrics I’m interested in from relative abundance using qiime2. I see in googling that there are means of transporting the phyloseq object I’ve created into qiime2 (one example). Before I set down that path, though, I wanted to check whether my understanding of restrictions around use of relative abundance in phyloseq was accurate. Does qiime2 seem like a solid next move, or is there another option I don’t know to consider?

Thanks again for your help in response to my last post!!

Yes, this is right. Since phyloseq was designed to be used with amplicon data, it requires to use counts data. You can predict read counts with MetaPhlAn if you run it with the option -t rel_ab_w_read_stats, or otherwise, you can use the imported matrix as input for other packages like vegan or just use a custom function implementing an alpha diversity measure. I don’t see the point of exporting to qiime2 since it will anyhow requires counts values.

Unlike the alpha diversity, you can easily run the beta diversity with phyloseq using relative abundances.