Hi all,
I’ve had a collaborator provide to me output of metagenomic shotgun sequencing run through the metaphlan2 pipeline. I’ve merged the taxonomic abundance output files of all my samples, and am trying now to investigate sample diversity. I’ve been running into a ton of errors in this process, some of which I worry may come down to gaps in my conceptual understanding; I am new to the world of microbiome analyses, and my learning about the subject has been more piecemeal and self-guided than anticipated. I’m hoping that in laying out my confusion below, someone can point out what I may be misunderstanding so that I can be more focused in my googling and self-learning.
Here is a screenshot my merged table…note that I did not include the first row, which says “clade, sample_1, sample_2, sample_3, …” (I cut it off because my sample names include potentially identifiable info). I’ll also note that I am able to produce this merged table as a .csv and .biom file.
Initially, I thought I could take this merged abundance table and bring it forward into programs like phyloseq. However, it seems like I need a taxonomic tree and/or table to make use of some of these programs for generating alpha and beta diversity. For example, this tutorial references a tree and a refseq file as used in its import functions. I tried porting my merged abundance table into R to use following a divnet tutorial, I get the following error early on: “The tax_glom() function requires that physeq contain a taxonomyTable.” I tried using the QIIME2 pipeline as well, but it expects a tree file when I try to generate alpha diversity (see –i-phylogeny)
Neither a taxonomy table or phylogenetic tree were provided to me by my collaborator. I mentioned my stumbling in passing to my collaborator and they advised me that if the methods I was investigating for generating diversity required a tree or table, then I was probably looking into the wrong methods. I have since spent a lot of time trying to figure out a way to use the above programs with purely my metaphlan output table, and I have been stymied. I’ve also tried seeing if I could generate a tree from my metaphlan2 output. Based on this tutorial, I spent a day thinking Graphlan might be able to generate such a tree, but I have run into issues with export2graphphlan.py and my data. The error I get when trying to use that script to create a tree from my merged metaphlan output is
" line 380, in get_biomarkes
while lvl < len(max(cc)):
ValueError: max() arg is an empty sequence"
Looking at this tutorial for metaphlan, I see that strainphlan can generate these trees. I don’t have access to the raw sequence data as required by that program, however, so investigating this route would require reconnecting again with my collaborator.
Before I reach out again to my collaborator with more candor about my confusion and to see if they could run the fasta files through strainphlan, I wanted to reach out and see if anyone on this forum could help me out. I am new to microbiome analyses and this is my first time working with metaphlan2 output files; I feel like there is something I’m missing, either a conceptual gap in understanding and/or a means of generating the diversity metrics I’m after (things like Shannon, Chao1, and Bray–Curtis dissimilarity). I would love to figure this out on my own, but the next best thing would be to make sure I’d done my homework before leaning back on my collaborator for help. With this in mind, I would greatly appreciate feedback or if recommendations as to resources (eg programs, concept explanations, or alternate forums) that they think would be helpful.
Thanks to anyone for assistance of any sort!