Dataset S3 CRISPR

My lab thoroughly enjoyed reading your recent Cell Host & Microbe paper
analyzing CRISPR immune systems in the HMP data. We are hoping to
leverage your taxonomic mapping for a project of our own, but my student
Wei and I can only find a portion of the data in the supplemental data
files posted on your web site.

The paper describes 1,630,590 spacer sequences taxonomically identified
by mapping to assemblies with MetaPhlAn2 or UniRef90 annotations and an
additional 768,068 spacer sequences taxonomically identified with
DIAMOND blastx directly against UniRef90. The latter set is clearly
identified in your “hmp1-II-crispr-spacers-annotation.tar.gz” datafile.
However, we have been unable to find the former (larger) set.

Can you give any guidance? The paper references “Dataset S3,” but it is
not clear which file corresponds to this dataset.

Thanks for pointing this out! We have uploaded the dataset to the crispr2020 – The Huttenhower Lab webpage as mapping_assembly.tar.gz. This contains the taxonomic annotation of the 1,630,590 spacers.

Best,
Philipp