Data Preprocessing for Human Whole Genome Sequencing

I have a data set that consists of oral squamous cell carcinoma tumour whole genome sequencing and adjacent normal whole genome sequencing. Is there a guide I coud follow about preprocessing such data? The simple approach I am taking is to extract unmapped reads from the BAM files (aligned to hg38) with samtools.

samtools view -b -f 12 -F 256 $alignmentsFilePath > $unmappedBam
samtools sort -n $unmappedBam -o $unmappedSortedBam
samtools fastq $unmappedSortedBam -1 ${sampleID}_unmapped_R1.fastq.gz -2 ${sampleID}_unmapped_R2.fastq.gz

Would that create suitable input data for MetaPhlAn 3? Kraken 2 natively handles paired FASTQ input via its --paired option. How are paired reads expected to be input to MetaPhlAn 3?

Is there any beginner’s tutorial using a small example data set, similar to the vignettes of Bioconductor packages, which has a step-by-step processing guide and explains the outputs at each stage?

Hi Dario,

The GitHub wiki page for MetaPhlAn seems to contain the information you are looking for. Here is the link: metaphlan3 · biobakery/biobakery Wiki · GitHub

Hope this helps! Good luck.

That documentation has nothing specific about human whole genome sequencing data, which is a low biomass scenario.