I have a data set that consists of oral squamous cell carcinoma tumour whole genome sequencing and adjacent normal whole genome sequencing. Is there a guide I coud follow about preprocessing such data? The simple approach I am taking is to extract unmapped reads from the BAM files (aligned to hg38) with samtools
.
samtools view -b -f 12 -F 256 $alignmentsFilePath > $unmappedBam
samtools sort -n $unmappedBam -o $unmappedSortedBam
samtools fastq $unmappedSortedBam -1 ${sampleID}_unmapped_R1.fastq.gz -2 ${sampleID}_unmapped_R2.fastq.gz
Would that create suitable input data for MetaPhlAn 3? Kraken 2 natively handles paired FASTQ input via its --paired
option. How are paired reads expected to be input to MetaPhlAn 3?
Is there any beginner’s tutorial using a small example data set, similar to the vignettes of Bioconductor packages, which has a step-by-step processing guide and explains the outputs at each stage?