Merging results of metaphlan and humman tables for two batches of the same study (two different timepoints)

Hi,

I’m facing an imminent deadline and would really appreciate your guidance at your earliest convenience.

I have two sequencing runs containing samples from different time points. For each run, I have already generated MetaPhlAn taxonomic profiles (count and relative abundance) and HUMAnN functional profiles (CPM and relative abundance). I would like to combine the results across runs so that all samples can be analyzed together for downstream analyses, but I want to ensure this is done correctly and in line with best practices.

Any advice you could provide on how to appropriately merge these outputs would be extremely helpful. My question is especially about relative abundance data for metaphlan and CPM and relative abundance data for humann.

Thank you very much for your time and help.

Sahra

Hi Sahra,

There‘s quite a bit of documentation about how to use MaAsLin 3 (probably your first analysis step) here. There’s also information on standard visualization techniques here.

Hi Will,

I’m not sure I fully understand your response. My question is specifically about combining results from two separate sequencing runs. Each run has already been processed independently through the MetaPhlAn and HUMAnN pipelines. I would like to merge these outputs into unified tables: MetaPhlAn relative abundance profiles, and HUMAnN gene family and pathway abundance tables (both relative abundance and CPM), for downstream analysis.

I would appreciate your guidance on best practices in this situation. Is it sound to merge results after processing the runs separately, or is it recommended to reprocess all samples together in a single run before downstream analysis?

Thanks

Ah, sorry - thought you were asking about downstream analysis more generally. Is this MetaPhlAn merging information and this HUMAnN merging information what you’re looking for? Both of these tools operate per-sample, so it shouldn’t matter whether you did single files or many files originally.

Thanks for your quick reply, Will. My main concern was whether it is appropriate to merge samples originating from different sequencing runs when the outputs are already normalized (relative abundance for MetaPhlAn and relative abundance/CPM for HUMAnN).

I wanted to confirm that it is acceptable to use the same merging scripts I typically use for combining samples within a single run, but instead apply them across two independent runs. Since both MetaPhlAn and HUMAnN perform normalization on a per-sample basis, my understanding is that I can place all *_profile.txt files from both runs into a single directory and merge them using the command below for MetaPhlAn count and relative abundance data.

merge_metaphlan_tables.py *_profile.txt > merged_abundance_table.txt

Similarly, for HUMAnN, I would copy all joined_taxonomic_profile.tsv files per database from both runs into one directory and run:

humann_join_tables --input $DIR --output joined_taxonomic_profile.tsv

Downstream, I would then account for potential batch effects by including sequencing run as a covariate in the statistical analyses. Please let me know if this approach is sound or if there are any additional considerations I should be aware of.

Thank you so much,

Yes - both MetaPhlAn and HUMAnN use per-sample normalization, so combining the batches like this should work fine. In downstream analysis, you could also include the batch as a covariate (in addition to read depth, which should always be there) if you think there are systematic changes per batch.

Thank you very much, Will, for the quick response. Just to clarify, by read depth you are referring to the read count per sample. Would you recommend using the raw read counts, or the read counts after quality control (e.g., post-QC / post-host removal)?

Thanks,

I’d use the total read count you tried to map to microbial genomes (which is probably your post-QC/host removal read count) since read count is mainly a confounder because when you map to microbial genomes, you’re more likely to see a particular species present if you have a higher total read count.

Thanks, Will! that’s very helpful. For the per-sample total microbial-eligible read count (post-QC / post-host removal), would you recommend log-transforming this variable when including it as a covariate in downstream analyses?

If you’re using MaAsLin 3, I’d include it not log transformed (but leave the scale parameter TRUE like the defaults). For other analyses, it’s hard to give a blanket statement, but log transforming is likely a good idea.

Thank you so much for getting back to me quickly and answering all my questions!!!