Hello,
I am interested in running HUMAnN 3.0 on a collection of over 12K metagenomes, so it is a huge run of jobs…
I was wondering how using a single “general” MetaPhlAn bugs list (generated from a sampling in the metagenome collection) would affect the results.
Any comment about this would be appreciated!
This is not something we usually do, but other people have tried it in the past. It mostly hurts your computational efficiency since you’re mapping each sample against one large database instead of smaller sample-specific databases.
If you want to do this, I would profile all of the samples with MetaPhlAn first, then combine the resulting profiles into a joint list of “species of interest.” This could be any species that was seen in any sample, or maybe more conservatively any species that was seen at >0.1% abundance in at least one sample. After you have that species list, you would concatenate their pangenomes from the HUMAnN ChocoPhlAn database into one big FASTA file and then index that file for use with bowtie2 inside of HUMAnN. You can then pass that custom index to HUMAnN to profile each sample against.
Note that you just want to build that big index once and use it for all samples. If you passed the combined taxonomic profile to HUMAnN for each sample it would rebuild the big index over and over, which would be very time consuming.
1 Like
Dear Eric Franzosa,
Thank you so much for your answer!
I was wondering if you have any other general advice on how to lower the computational time for HUMAnN 3.0.
Best,
Most of the runtime is spent in translated search. You can bypass translated search to speed up the runs, but that’s really only admissible for very well characterized environment types (where you expect most of the taxa to be known), and even then you could be losing out on important unclassified signals.
Another option is to use the EC-filtered translated search database instead of the full database. This restricts translated search to proteins with an EC annotation (~10% of the total). This is faster and still allows you to explore unclassified enzyme and pathway contributions, but you lose coverage of less-well-annotated proteins.