The bioBakery help forum

Fasta sequences for metagenomic data


I am a Tenure-Track research fellow at Liverpool university with a current research interest in carbohydrate sulfatases found in the human gut microbiota. I recently read your paper in Nature microbiology titled ‘Gut microbiome structure and metabolic activity in inflammatory bowel disease’ and was very interested in the data. It was a very nice piece of work.

I am particularly interested in the metagenomic analyses which highlighted genes/proteins abundance as well as differential expression. I observed in the supplemental that the sulfatases (belonging the EC classes below) were in both categories. Arylsulfatase N-acetylgalactosamine-4-sulfatase Choline-sulfatase

I was hoping you would share your compiled protein fasta files of the metagenomic analyses so that my collaborators would be able to analyse the sulfatase sequences in the data. This data would be of great help to us.

Hi Alan - The QC’ed sequencing reads underlying this study are available here:

The proteins you highlighted were quantified with HUMAnN 2.0 (first by mapping reads to UniRef90 and then summing UniRef90 abundances according to their EC annotations from UniProt). Hence, we don’t have novel protein sequences per se from this study. To get those one would need to assemble the reads linked above and then perform protein annotation on the resulting contigs.

Hi Eric,

Thanks for looking into this for me.

So to be clear we would need to assemble the nucleotide data and perform ORF prediction to derive the protein sequences? My collaborator, Gurvan Michel, could do this but we were hoping to avoid these steps due to the current lockdown situation, and timelines for completing the project, it may not be possible.



If you wanted to extract the actual protein sequences present in the community, the approach you summarized (assemble + call ORFs + annotate) would be the way to go.

An approximate approach would be to pull out the UniRef90 reference sequences that we quantified from the samples by mapping and then later associated with your ECs of interest. Let me know if that would be useful and I can look into it?

Hi Eric,

I just wanted to say thanks for your help with this. I spoke to Gurvan and he initially suggested that assemble + call ORFs + annotate could be possible but it seems it is too big a task on the 220 samples. He has come up with an alternate approach:

  1.  A cleaning step: quality checking of the reads + elimination of the human 16S DNA traces.
  2.  Mapping of the reads on the SulfAtlas database allowing read assignment and quantification.

So we may try this and see where we get. I just though I would let you know and say thank you for offering to help us out.