Data Preprocessing for Human Whole Genome Sequencing

I have a data set that consists of oral squamous cell carcinoma tumour whole genome sequencing and adjacent normal whole genome sequencing. Is there a guide I coud follow about preprocessing such data? The simple approach I am taking is to extract unmapped reads from the BAM files (aligned to hg38) with samtools.

samtools view -b -f 12 -F 256 $alignmentsFilePath > $unmappedBam
samtools sort -n $unmappedBam -o $unmappedSortedBam
samtools fastq $unmappedSortedBam -1 ${sampleID}_unmapped_R1.fastq.gz -2 ${sampleID}_unmapped_R2.fastq.gz

Would that create suitable input data for MetaPhlAn 3? Kraken 2 natively handles paired FASTQ input via its --paired option. How are paired reads expected to be input to MetaPhlAn 3?

Is there any beginner’s tutorial using a small example data set, similar to the vignettes of Bioconductor packages, which has a step-by-step processing guide and explains the outputs at each stage?

Hi Dario,

The GitHub wiki page for MetaPhlAn seems to contain the information you are looking for. Here is the link: metaphlan3 · biobakery/biobakery Wiki · GitHub

Hope this helps! Good luck.

That documentation has nothing specific about human whole genome sequencing data, which is a low biomass scenario.

Hi @Dario
Thanks for getting in touch. MetaPhlAn 3 accepts FASTQ files but it does not use the paired-end information in the mapping step. You can pass the files (compressed or nor) as in this example:

metaphlan metagenome_R1.fastq,metagenome_R2.fastq --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --input_type fastq -o profiled_metagenome.txt

In relation with your problem with the low microbial biomass scenario it is quite tricky. The approach you are following by using the reads not mapped to the human genomes is probably the best strategy here.