Thank you for the wonderful tools from the biobakery team
.
I have some questions regarding the choice of HUMANn and MetaPhlAn to maximise the compatibility for my downstream data analysis.
I followed the thread at MetaPhlAn 4 + HUMAnN 4 compatibility - #8 by jtrachsel to get HUMANn + MetaPhlAn working up on my workstation. Since HUMANn4 seemed to output MetaPhlAn results, I took that as a one-stop solution to process my samples.
However, during the initial trial runs on some of my decontaminated samples, only about 4% of my reads were mapped to UniRef genes, which raised a couple of red flags. A little bit of digging later, I inferred that it’s probably because there is no full database for uniref in HUMANn4, which may explain the low mapping rates.
I am now considering using HUMANn3 + MetaPhlAN3 to complete my analysis since version 4 of the softwares seemed early in development. Following MetaPhlAn 4.2.2 work with humann v3.0.1 - #5 by hammond , I would like to ask if:
-
Is HUMANn3.9 + MetaPhlAN3.1.0 the “latest” stable and “compatible” pair of suites? If not, which pairing is recommended for maximum software compatibility? I intend to run MaAsLin2/3 later down the pipeline, whichever is suitable. (A software compatibility list/matrix would be helpful here)
-
If running MetaPhlAN3.1.0, which bowtie index is recommended? I would like to verify that
mpa_v31_CHOCOPhlAn_201901is the “correct”/ “latest” one to use for3.1.0. I usedmetaphlan --install --bowtie2db <PATH/TO/DB>for this installation. -
I see that HUMANn3 has only 3 output files as opposed to 4 in HUMANn4; If I were to finalise and use HUMANn3 + MetaPhlAN3 for the project, do I need to run MetaPhlAN3, then HUMANn3 to HUMANn4-like outputs. I want to get taxonomic, gene family, and pathway profiles.
-
In the case I need to run MetaPhlAN3 then HUMANn3, does the CHOCOPhlAN database across the two runs need to be the same? Assuming I will run something like
metaphlan trial_f1.fq.gz,trial_f2.fq.gz --bowtie2out ./trial_bt.out.bz2 --nproc 32 --input_type fastq --db_dir metaphlan_databases/vJan19/mpa_v31_CHOCOPhlAn_201901 -o ./output/kneaddata/01.kneaddata.txthumann -r -i trial_concat.fq.gz -o ./output/humann --threads 32 --remove-temp-output --protein-database humann3_dbs/uniref --nucleotide-database humann3_dbs/chocophlan --metaphlan-options "-t rel_ab_w_read_stats --bowtie2db metaphlan_databases/vJan19 --index mpa_v31_CHOCOPhlAn_201901
I would really appreciate help with a thorough clarification. Sorry for the lengthy query and many thanks.
#======================
Additional info:
I used the latest Kneaddata by using pip install kneaddata in an independent conda environment (with appropriate dependency installation) and cleaned my samples using CHM13-t2t indexes (Though I doubt it matters much, the number of cleaned reads using Grch38 yielded a similar number of reads)
My reads are Illumina reads @ 150bp X 2, trimmed and cleaned using fastp. Average read length is over 149.9bp. I simply concatenated the reads (as per GitHub instructions) for HUMANn analysis.