The bioBakery help forum

Would it be a problem for us to use the samtools restrictions on allowable characters in sequence names?

Asking for a friend (really! from Nathan Sheffield):

"I’m working on a specification for a new standard called ‘sequence
collections’ that defines a way to algorithmically identify a set of
sequences (like, from a FASTA file, for a reference genome). As part of
this, a group of us is defining a formal specification, and we are
trying to come up with the list of “allowable characters” in sequence
names. We’re right now planning to go with the sam file specification,
which prohibits a few characters like white space, brackets, quotes,
etc. I wanted to reach out to people outside human genomics, where I’m
familiar, into other areas like plants or microbial genomics, where
there may be different standards for how to refer to sequences names, so
that we make sure we can accommodate common practices from other
communities.

So, my question for you is, in the microbial genomics world, would it be
a problem for us to use the samtools restrictions, or would that cut out
some common use cases in your community? You can see more and read the
exact character limitations and comment on this issue here: