Human and metaphlan file formats

Hi all,
I’ve been working on a QIIME 2 plugin to facilitate using HUMANn 3 and Metaphlan 3 output with downstream tools in QIIME 2 (taxonomy/functional category plots, ordination, etc). The file formats that I’m specifically focused on at the moment are the HUMANn 3 gene family and pathway abundance tables and the metaphlan merged abundance table (small examples of all three can be found here). I have a couple of questions related to this work.

First, are there changes to the output file formats between HUMANn 3 and 3.5, or between Metaphlan 3 and 4?

Second, are there example output files around anywhere that I could use in my testing? I’ve pulled a few from my own work and from the docs, but I was wondering if there is a canonical set that you use for testing on your end that might facilitate testing of 3rd party tools.

Note that the plugin is not ready for use in production environments yet. If there is interest from the community here, I’m happy to post a note on it when it’s ready. (Also happy to have any feedback, including suggestions for a new name! :slight_smile: )

Thank you!

Hi all, Just wanted to follow up to see if anyone could provide input on these questions. Thanks!

Hi @gregcaporaso - I don’t believe we’ve made any changes to the HUMAnN output formats since HUMAnN 2.0, and my expectation is that they will stay the same in v4.0 (with the exception of possibly replacing the seldom-used pathway coverage file with something like enzyme/reaction abundances).

The default MetaPhlAn taxonomic profile format has changed a couple of times throughout v3 development: first to add a second column with the NCBI taxid-equivalent taxonomy for each row (bumping the abundances to the third column), then to add a fourth column indicating other names assigned to genomes that we group with the species listed in the first column. I believe there were some further modifications to accommodate the move to SGB-based profiling in v4, but I would check with folks on the MetaPhlAn topic to be sure. However, when we merge the separate profiles to make an integrated table, it’s always the human-readable taxonomy column and the abundance column that get extracted.

For testing, I am usually focused on testing our path from reads to profiles, and I use two sets of samples for that. The first are the synthetic samples drawn from non-human-associated species that we published with the bioBakery 3 paper:

And the second are a selection of ecologically diverse real metagenomes from HMP1-II (specifically, these are the k=5 medoids of each HMP body area as determined from earlier profiling):

SRS013269  hmp-nasal
SRS016752  hmp-nasal
SRS019119  hmp-nasal
SRS019386  hmp-nasal
SRS050025  hmp-nasal
SRS015921  hmp-oral
SRS018975  hmp-oral
SRS020862  hmp-oral
SRS021986  hmp-oral
SRS049147  hmp-oral
SRS019015  hmp-skin
SRS020263  hmp-skin
SRS024482  hmp-skin
SRS052988  hmp-skin
SRS057083  hmp-skin
SRS014287  hmp-stool
SRS015960  hmp-stool
SRS022071  hmp-stool
SRS045713  hmp-stool
SRS062427  hmp-stool
SRS011111  hmp-vagina
SRS014494  hmp-vagina
SRS015072  hmp-vagina
SRS017497  hmp-vagina
SRS023604  hmp-vagina

I don’t have those profiles packed up nicely, but I could do that if useful? They (along with lots of other profiles) could probably also be extracted from Curated Metagenomic Data of the Human Microbiome • curatedMetagenomicData. Hope this helps!

Thanks @franzosa! Ok, so I think my plugin should work just fine for HUMAnN 2-3.5 data, but I’ll probably need to do some work to get it to support different versions of MetaPhlan. I support the format that includes the NCBI taxonomy id column, but I haven’t run across the version that includes the “other names” column you mention. In the meantime, I’ll add a note to the README on my project with that information.

I hadn’t come across curatedMetagenomicData before - that sounds handy. You don’t need to create a package of those profiles for me, but thanks for the offer. It could be a handy general resource however, for individuals who want to build tools that read HUMAnN output.