Respected developers,
Greetings! I hope this message finds you well. I am currently working on the development of a pipeline, specifically focusing on the decontamination step of NGS data quality control. While MetaPhlAn version 4.0.6 (1 Mar 2023) is primarily designed for species classification in metagenomic data, I strongly believe that its advantages in terms of speed, recent release, and active maintenance make it a suitable candidate for effective decontamination. There have already been successful precedents of using Kraken2-based approaches for decontamination.
My initial idea is to utilize the output.txt file generated by Metaphlan4 and the intermediate files from Bowtie2 to carry out the decontamination process. In the first column of the final output.txt file, the classification hierarchy is refined down to the SGB level. Therefore, my proposed approach involves adding SGB labels to each read in the input fastq files using Metaphlan4. Subsequently, we can filter out reads based on different classification hierarchy keywords such as species, genus, and family, using the assigned SGB labels as criteria for decontamination. I have already initiated the following command, and it is producing the expected output file:
$ metaphlan --bowtie2db /path/to/db/ \
/path/to/SRR5194984_1.fastq,/path/to/SRR5194984_2.fastq \
--nproc 12 --bowtie2out SRR5194984.bowtie2.bz2 --input_type fastq -o SRR5194984.txt
$cat SRR5194984.txt
#25407082 reads processed
#SampleID Metaphlan_Analysis
#clade_name NCBI_tax_id relative_abundance additional_species
k__Bacteria 2 100.0
k__Bacteria|p__Proteobacteria 2|1224 100.0
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria 2|1224|1236 100.0
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales 2|1224|1236|91347 100.0
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae 2|1224|1236|91347|543 99.84893
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Morganellaceae 2|1224|1236|91347|1903414 0.15107
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia 2|1224|1236|91347|543|561 99.84893
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Morganellaceae|g__Proteus 2|1224|1236|91347|1903414|583 0.15107
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_coli 2|1224|1236|91347|543|561|562 99.61441
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_marmotae 2|1224|1236|91347|543|561|1499973 0.23452
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Morganellaceae|g__Proteus|s__Proteus_mirabilis 2|1224|1236|91347|1903414|583|584 0.15107
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_coli|t__SGB10068 2|1224|1236|91347|543|561|562| 99.61441 k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Shigella|s__Shigella_sonnei,k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Shigella|s__Shigella_flexneri,k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Shigella|s__Shigella_boydii,k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Shigella|s__Shigella_dysenteriae, ......
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_marmotae|t__SGB10064 2|1224|1236|91347|543|561|1499973| 0.23452 k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_sp_MOD1_EC5462,k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_sp_MOD1_EC5438, ......
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Morganellaceae|g__Proteus|s__Proteus_mirabilis|t__SGB9992 2|1224|1236|91347|1903414|583|584| 0.15107 k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Morganellaceae|g__Proteus|s__Proteus_sp_G4399,k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Morganellaceae|g__Proteus|s__Proteus_sp_G4441, ......
#part message of SRR5194984.bowtie2.bz2
SRR5194984.177__1.177 UniRef90_P76204|1__4|SGB10068
SRR5194984.165__1.165 UniRef90_P69435|1__11|SGB10068
SRR5194984.151__1.151 UniRef90_P0A9S4|1__6|SGB10068
SRR5194984.272__1.272 UniRef90_Q8XDZ2|2__5|SGB10068
SRR5194984.236__1.236 UniRef90_P0A9S4|1__6|SGB10068
.
.
.
SRR5194984.12703024__2.12703024 UniRef90_P75800|3__7|SGB10068
SRR5194984.12703015__2.12703015 UniRef90_P76230|4__9|SGB10068
SRR5194984.12702952__2.12702952 UniRef90_P13518|2__10|SGB10064
SRR5194984.12702961__2.12702961 UniRef90_P75990|1__7|SGB10068
SRR5194984.12703282__2.12703282 UniRef90_P75908|1__4|SGB10068
#nreads 25407082
#avg_read_length 99.0
Since the final position of Bowtie2 results points to the SGB sequence identifier, my decontamination approach can be summarized as follows: I take the fastq file as input and annotate each read with information based on Metaphlan4. Then, I filter the reads based on the classification hierarchy. Initially, I thought that the Bowtie2 results file would include labels for each read. However, I later realized that this is not the case:
$ less SRR5194984.bowtie2.bz2 | grep "SRR5194984" | wc -l
217474
#nreads 25407082
The number of Bowtie2 results is much smaller compared to the number of reads in the input fastq file. Initially, I thought this might be due to the sampling limitation of the Metaphlan4 software. However, I addressed this issue by adding the parameter “–subsampling 25407081” to the command, making the complete command as follows:
$ metaphlan --bowtie2db /path/to/db/ \
/path/to/SRR5194984_1.fastq,/path/to/SRR5194984_2.fastq \
--nproc 12 --bowtie2out SRR5194984.bowtie2.bz2 --subsampling 25407081 \
--input_type fastq -o SRR5194984.txt
The generated result files remain unchanged compared to the previous ones. Therefore, I would like to know what command should I use to run Metaphlan4 in order to obtain the taxonomic classification information or specific SGB identifiers for each read. This will facilitate the implementation of the decontamination process in my script on the input fastq file.
I kindly request your thoughts and feedback regarding this approach. If you have any suggestions or recommendations to enhance the decontamination process, I would greatly appreciate your input.
Thank you for your time and consideration.
Asher