How to get the SGB number of each reads in fastq

Respected developers,

Greetings! I hope this message finds you well. I am currently working on the development of a pipeline, specifically focusing on the decontamination step of NGS data quality control. While MetaPhlAn version 4.0.6 (1 Mar 2023) is primarily designed for species classification in metagenomic data, I strongly believe that its advantages in terms of speed, recent release, and active maintenance make it a suitable candidate for effective decontamination. There have already been successful precedents of using Kraken2-based approaches for decontamination.

My initial idea is to utilize the output.txt file generated by Metaphlan4 and the intermediate files from Bowtie2 to carry out the decontamination process. In the first column of the final output.txt file, the classification hierarchy is refined down to the SGB level. Therefore, my proposed approach involves adding SGB labels to each read in the input fastq files using Metaphlan4. Subsequently, we can filter out reads based on different classification hierarchy keywords such as species, genus, and family, using the assigned SGB labels as criteria for decontamination. I have already initiated the following command, and it is producing the expected output file:

$ metaphlan --bowtie2db /path/to/db/ \
 /path/to/SRR5194984_1.fastq,/path/to/SRR5194984_2.fastq \
 --nproc 12 --bowtie2out SRR5194984.bowtie2.bz2 --input_type fastq -o SRR5194984.txt

$cat SRR5194984.txt 
#25407082 reads processed
#SampleID       Metaphlan_Analysis
#clade_name     NCBI_tax_id     relative_abundance      additional_species
k__Bacteria     2       100.0
k__Bacteria|p__Proteobacteria   2|1224  100.0
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria    2|1224|1236     100.0
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales        2|1224|1236|91347       100.0
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae  2|1224|1236|91347|543   99.84893
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Morganellaceae      2|1224|1236|91347|1903414       0.15107
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia   2|1224|1236|91347|543|561       99.84893
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Morganellaceae|g__Proteus   2|1224|1236|91347|1903414|583   0.15107
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_coli       2|1224|1236|91347|543|561|562   99.61441
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_marmotae   2|1224|1236|91347|543|561|1499973      0.23452
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Morganellaceae|g__Proteus|s__Proteus_mirabilis      2|1224|1236|91347|1903414|583|584       0.15107
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_coli|t__SGB10068   2|1224|1236|91347|543|561|562| 99.61441 k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Shigella|s__Shigella_sonnei,k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Shigella|s__Shigella_flexneri,k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Shigella|s__Shigella_boydii,k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Shigella|s__Shigella_dysenteriae, ......
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_marmotae|t__SGB10064       2|1224|1236|91347|543|561|1499973|      0.23452 k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_sp_MOD1_EC5462,k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_sp_MOD1_EC5438, ......
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Morganellaceae|g__Proteus|s__Proteus_mirabilis|t__SGB9992   2|1224|1236|91347|1903414|583|584|     0.15107  k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Morganellaceae|g__Proteus|s__Proteus_sp_G4399,k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Morganellaceae|g__Proteus|s__Proteus_sp_G4441, ......

#part message of SRR5194984.bowtie2.bz2

SRR5194984.177__1.177   UniRef90_P76204|1__4|SGB10068
SRR5194984.165__1.165   UniRef90_P69435|1__11|SGB10068
SRR5194984.151__1.151   UniRef90_P0A9S4|1__6|SGB10068
SRR5194984.272__1.272   UniRef90_Q8XDZ2|2__5|SGB10068
SRR5194984.236__1.236   UniRef90_P0A9S4|1__6|SGB10068
.
.
.
SRR5194984.12703024__2.12703024 UniRef90_P75800|3__7|SGB10068
SRR5194984.12703015__2.12703015 UniRef90_P76230|4__9|SGB10068
SRR5194984.12702952__2.12702952 UniRef90_P13518|2__10|SGB10064
SRR5194984.12702961__2.12702961 UniRef90_P75990|1__7|SGB10068
SRR5194984.12703282__2.12703282 UniRef90_P75908|1__4|SGB10068
#nreads 25407082
#avg_read_length        99.0

Since the final position of Bowtie2 results points to the SGB sequence identifier, my decontamination approach can be summarized as follows: I take the fastq file as input and annotate each read with information based on Metaphlan4. Then, I filter the reads based on the classification hierarchy. Initially, I thought that the Bowtie2 results file would include labels for each read. However, I later realized that this is not the case:

$ less SRR5194984.bowtie2.bz2 | grep "SRR5194984" | wc -l
217474

#nreads 25407082

The number of Bowtie2 results is much smaller compared to the number of reads in the input fastq file. Initially, I thought this might be due to the sampling limitation of the Metaphlan4 software. However, I addressed this issue by adding the parameter “–subsampling 25407081” to the command, making the complete command as follows:

$ metaphlan --bowtie2db /path/to/db/ \
 /path/to/SRR5194984_1.fastq,/path/to/SRR5194984_2.fastq \
 --nproc 12 --bowtie2out SRR5194984.bowtie2.bz2 --subsampling 25407081 \
 --input_type fastq -o SRR5194984.txt

The generated result files remain unchanged compared to the previous ones. Therefore, I would like to know what command should I use to run Metaphlan4 in order to obtain the taxonomic classification information or specific SGB identifiers for each read. This will facilitate the implementation of the decontamination process in my script on the input fastq file.
I kindly request your thoughts and feedback regarding this approach. If you have any suggestions or recommendations to enhance the decontamination process, I would greatly appreciate your input.

Thank you for your time and consideration.
Asher

Hi @Asher
MetaPhlAn 4 profiling is based on the mapping against species-specific markers genes, that only represent a part of the species genome, and thus, not all the reads will map to the database. This means that, using metaphlan4 you will never be able to assign all your reads to a taxon