The bioBakery help forum

Metagenomic raw data size

Dear bioBakery team,

I have downloaded the raw data of the metagenomic data (MGX) from ftp://ftp.broadinstitute.org/raw/HMP2/MGX/2018-05-04/*.tar.
But I found the data size varies greatly. For example:
-rw-rw-r-- 1 carze broad 1228011520 May 4 2018 HSM6XRR3.tar
-rw-rw-r-- 1 carze broad 334336000 May 4 2018 HSM6XRR5.tar
-rw-rw-r-- 1 carze broad 681635840 May 4 2018 HSM6XRR7.tar
-rw-rw-r-- 1 carze broad 143360 May 4 2018 HSM6XRR9.tar
-rw-rw-r-- 1 carze broad 1518469120 May 4 2018 HSM6XRRB.tar

Prior to Illumina sequencing, were the libraries size of each sample close to each other? Or because the data on the HMP2 servers are clean data, and some low-quality reads and human reads were removed, leading to the difference in data size?
Thank you for your time.

Best regards
Sheng

While the target seq depths were similar, that doesn’t always produce the same number of reads per sample in the end. That, combined with the qc procedures, explains the variance in file size / read count.