Data Preprocessing for Human Whole Genome Sequencing

Dario · April 13, 2021, 7:00am

I have a data set that consists of oral squamous cell carcinoma tumour whole genome sequencing and adjacent normal whole genome sequencing. Is there a guide I coud follow about preprocessing such data? The simple approach I am taking is to extract unmapped reads from the BAM files (aligned to hg38) with samtools.

samtools view -b -f 12 -F 256 $alignmentsFilePath > $unmappedBam
samtools sort -n $unmappedBam -o $unmappedSortedBam
samtools fastq $unmappedSortedBam -1 ${sampleID}_unmapped_R1.fastq.gz -2 ${sampleID}_unmapped_R2.fastq.gz

Would that create suitable input data for MetaPhlAn 3? Kraken 2 natively handles paired FASTQ input via its --paired option. How are paired reads expected to be input to MetaPhlAn 3?

Is there any beginner’s tutorial using a small example data set, similar to the vignettes of Bioconductor packages, which has a step-by-step processing guide and explains the outputs at each stage?

shalvichirmade · May 24, 2022, 11:37am

Hi Dario,

The GitHub wiki page for MetaPhlAn seems to contain the information you are looking for. Here is the link: metaphlan3 · biobakery/biobakery Wiki · GitHub

Hope this helps! Good luck.

Dario · May 24, 2022, 10:00pm

That documentation has nothing specific about human whole genome sequencing data, which is a low biomass scenario.

aitor.blancomiguez · May 27, 2022, 10:37am

Hi @Dario
Thanks for getting in touch. MetaPhlAn 3 accepts FASTQ files but it does not use the paired-end information in the mapping step. You can pass the files (compressed or nor) as in this example:

metaphlan metagenome_R1.fastq,metagenome_R2.fastq --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --input_type fastq -o profiled_metagenome.txt

In relation with your problem with the low microbial biomass scenario it is quite tricky. The approach you are following by using the reads not mapped to the human genomes is probably the best strategy here.

Topic		Replies	Views
MetaPhlAn preprocessing of reads MetaPhlAn	1	524	December 3, 2021
HUManN3 functional annotation doubts HUMAnN	6	1336	June 29, 2022
Humann3 Paired end reads HUMAnN	19	5871	October 30, 2024
Combine metaphlan and humann HUMAnN	2	36	June 15, 2025
Am I doing it right...? PanPhlAn	2	1060	July 8, 2020

Data Preprocessing for Human Whole Genome Sequencing

Related topics