Wgs files not complete?

I downloaded a number of wgs tar files, and I’m finding inconsistency between the number of reads in the fastq files and the number of reads stated in the metadata.
I’m using bioawk to count the reads in the files and I’m comparing to “reads_filtered” in the metadata, although also the other “reads_*” don’t match.
I didn’t have errors during the download, and processing the files did not create problems (I would have expected some if the downloading was not complete).
I report here three samples for which the #reads_expected and the #reads_found don’t match:
sample #reads_expected #reads_found
CSM5MCXD 11141935 38259
CSM5MCUO 8976280 1476056
CSM5MCWQ 15777116 1467797

I would really appreciate if you could give me some feedback on this, Thank you!

My guess is that the read counts from the table correspond to the first iteration of the reads/qc as published in the HMP2 paper. I know we have updated the filtered reads on ibdmdb.org since then as better qc options became available. That said, I’m surprised by how dramatically different these three cases are. Were they the only outliers are was this a more general trend?

Thank you for your reply @franzosa!
Out of the 521 files I downloaded, 386 files had less reads in the files than in the table.
Of these 386, 200 files have at least 1M reads less, and 108 of those files have at least 10M reads less. I hope this helps! Thanks :slight_smile:

The updated QC being associated with fewer reads makes sense - the QC was improved to be more stringent. If the files with the most reads lost are mostly metatranscriptomes, I think that also tracks as they tend to be enriched for rRNA reads, and we improved the removal of those reads in the revised QC.

Thanks @franzosa!
It makes sense if the QC pipeline was updated, but the numbers I mentioned were relative to metagenomic files, let me know if you would like to inspect them, I can pass down a list of IDs :slight_smile:

Hi @chiaramazzoni ,

Thank you for reaching out to the bioBakery Lab and wanted to let know you that we are looking at the metadata files of wgs data for any issues. We will update you soon.


