I’ve been working on a QIIME 2 plugin to facilitate using HUMANn 3 and Metaphlan 3 output with downstream tools in QIIME 2 (taxonomy/functional category plots, ordination, etc). The file formats that I’m specifically focused on at the moment are the HUMANn 3 gene family and pathway abundance tables and the metaphlan merged abundance table (small examples of all three can be found here). I have a couple of questions related to this work.
First, are there changes to the output file formats between HUMANn 3 and 3.5, or between Metaphlan 3 and 4?
Second, are there example output files around anywhere that I could use in my testing? I’ve pulled a few from my own work and from the docs, but I was wondering if there is a canonical set that you use for testing on your end that might facilitate testing of 3rd party tools.
Note that the plugin is not ready for use in production environments yet. If there is interest from the community here, I’m happy to post a note on it when it’s ready. (Also happy to have any feedback, including suggestions for a new name! )
Hi all, Just wanted to follow up to see if anyone could provide input on these questions. Thanks!
Hi @gregcaporaso - I don’t believe we’ve made any changes to the HUMAnN output formats since HUMAnN 2.0, and my expectation is that they will stay the same in v4.0 (with the exception of possibly replacing the seldom-used pathway coverage file with something like enzyme/reaction abundances).
The default MetaPhlAn taxonomic profile format has changed a couple of times throughout v3 development: first to add a second column with the NCBI taxid-equivalent taxonomy for each row (bumping the abundances to the third column), then to add a fourth column indicating other names assigned to genomes that we group with the species listed in the first column. I believe there were some further modifications to accommodate the move to SGB-based profiling in v4, but I would check with folks on the MetaPhlAn topic to be sure. However, when we merge the separate profiles to make an integrated table, it’s always the human-readable taxonomy column and the abundance column that get extracted.
For testing, I am usually focused on testing our path from reads to profiles, and I use two sets of samples for that. The first are the synthetic samples drawn from non-human-associated species that we published with the bioBakery 3 paper:
And the second are a selection of ecologically diverse real metagenomes from HMP1-II (specifically, these are the k=5 medoids of each HMP body area as determined from earlier profiling):
I don’t have those profiles packed up nicely, but I could do that if useful? They (along with lots of other profiles) could probably also be extracted from Curated Metagenomic Data of the Human Microbiome • curatedMetagenomicData. Hope this helps!
Thanks @franzosa! Ok, so I think my plugin should work just fine for HUMAnN 2-3.5 data, but I’ll probably need to do some work to get it to support different versions of MetaPhlan. I support the format that includes the NCBI taxonomy id column, but I haven’t run across the version that includes the “other names” column you mention. In the meantime, I’ll add a note to the README on my project with that information.
I hadn’t come across
curatedMetagenomicData before - that sounds handy. You don’t need to create a package of those profiles for me, but thanks for the offer. It could be a handy general resource however, for individuals who want to build tools that read HUMAnN output.