NCBI Data in XML
Introduction
Extensible Markup Language (XML) is a tagged format similar to HTML on which web
pages are based. The familiar text format and availability of public domain tools for
parsing this language is making it a popular choice for the exchange of structured data
over the WWW. Roughly ten years ago, NCBI chose a language called Abstract Syntax
Notation 1 (ASN.1) for describing and exchanging information in a manner similar to the
ways XML is now used. ASN.1 came out of the telecommunications industry and is a
compact binary encoding intended for both human readable text as well as integers,
floating point numbers, and so on. While this is "software friendly" it is less accessible to
users familiar with HTML and other text based languages. Tools for ASN.1 have largely
stayed within the commercial telecommunications industry while a host of public domain
tools of varying character have arisen for XML and HTML.
NCBI has recently added support for XML output to its ASN.1 toolkit. An ASN.1
specification can be automatically rendered into an XML DTD. Data encoded in ASN.1
can automatically be output in XML which will validate against the DTD using standard
XML tools. We hope this will make the structured sequence, map, and structure data, as
well as the output of tools like BLAST, more accessible to those who wish to work in
XML.
We are providing XML in two basic modes. Full Data Conversion is the direct mapping
of every data field used within NCBI to XML. This is not for the faint of heart, but it
does mean that whatever we have, you have. The other mode is to provide smaller,
Targeted DTDs for end users. These are still first done as ASN.1, but with an eye to
providing smaller, standalone data outputs as XML. These two modes are described in
detail below.
Full Data Conversion
Note that the full conversion of existing ASN.1 specified data into XML has some
specific properties. NCBI is not proposing a new data model, but is simply transliterating
the data model we have used for the last decade into a different language for the
convenience of our users. ASN.1 has a number of specific data types such as INTEGER
or REAL numbers while XML has only strings, so our DTD automatically adds some
ENTITY definitions at the top which maps these numbers to strings. This mapping only
allows humans that read the DTD to see where numbers are expected; an XML validator
will not care what is there. The ASN.1 validators do care, and can also check ranges of
values and so on, so those continue to be used to read and process the data within NCBI.
Reuse and Roles
ASN.1 is also designed to allow the reuse of modules in a specification. Modules may be in multiple
files and mixed and matched as needed, similar to C or C++ header files defining structures and
classes. Most XML specifications in biology have been relatively small thus far, and/or focussed on
the work of a specific group. Thus the DTDs tend to be in a single file. It is possible to write a
large modular DTD in XML, and this is done by commercial publishing houses, but in XML the including
process requires two sets of files. One file is basically a list of DTDs to put together to make the
complete DTD. The other is the DTD modules themselves. In the NCBI XML specs, the files with a .dtd
extension are the ones referenced by the DOCTYPE line in an XML file. The DTDs for individual
modules have the extension .mod, and these corresspond to the ASN.1 modules.
XML can be "valid" or "well formed". Valid XML means that the data in a record is compared with a
specific DTD and all the rules and elements defined in the DTD are correctly reflected in the data.
Well formed XML just means that the file does not break any XML syntax rules, but no check is made
that it actually follows the specification of its DTD. ASN.1 was designed on the basis that data
must always be "valid". Not only is this more "type safe", but it also means that the ASN.1 parser
always knows the structure of the data. This makes compact binary encoding possible. It also means
that data elements can be reused in different roles without lots of extra tagging since the context
is always known. So in ASN.1 (or most computer languages) the data structure "Person" can have a field
called "name" and "Gene" can also have a field called "name", and nothing gets confused. XML requires
that every ELEMENT have a unique tag, so if "Person" and "Gene" appear in the same DTD, you cannot
have a single tag, "name" that means two different things depending on context.
Roles:
For example, the NCBI ASN.1 specification was designed to be used in a modular way. So a single Date
object is defined with the fields year, month, day, etc. It is then referenced in any object that
needs a date, that is, this object can be reused in a variety of roles. Since ASN.1 assumes a
modular structure, it is straightforward to reuse data in different roles without a lot of overhead.
For this specification:
Record ::= SEQUENCE {
create-date Date,
update-date Date }
Date ::= SEQUENCE {
month INTEGER,
year INTEGER }
and some sample data might be:
Record ::= SEQUENCE {
create-date {
month 6,
year 1999 },
update-date {
month 8,
year 2000 } }
the direct mapping to XML requires that every ELEMENT be explicitly tagged and not
implied by the context. So the equivalent DTD is more verbose:
as is the XML data itself:
6
1999
8
2000
There is a tendency in XML DTDs to adjust to this expansion of tag levels due to roles,
by defining each role separately as it occurs:
Scope:
ASN.1 does not require that a name be unique except within a structure, similar to
C or C++. XML however requires that all names be unique across the DTD, unless they
are attributes which must come from a limited repertoire. Many XML parsers rely on this
so that callback functions are associated wth a tag, not a tag within context. As a trivial
illustration, if both people and genes have names, they are distinct in ASN.1:
Person ::= SEQUENCE {
name VisibleString,
room-number INTEGER }
Gene ::= SEQUENCE {
name VisibleString,
map VisibleString }
but must be made unique in XML to be distinguished:
In the case above, we prefixed the element (name) that was used in two contexts with the
name of the context to make it unique. But this requires an analysis of all the modules of
the specification at once. In addition, it assumes the modules will not be used in other
contexts in future, which might make other elements non-unique. So the automatic
converter guarantees that every element is unique by always prefixing all element names
with the context (and would produce both Person_room, and Gene_map, in the example
above).
Alternate Representations:
In a number of cases the ASN.1 specification allows alternate forms of the same data object. This is
because our goal was to get a workable specification that would incorporate data from all the
available sources. While the overall model is designed to a view of how it "should be" there
are lots of places where we allow for the reality of available sources. So, for example, while we
might prefer that a Date have fields for month and year, for some sources we may only have a string.
Rather than drop the Date altogether in those cases, we allow alternate forms in ASN.1:
Date ::= CHOICE {
str VisibleString, -- when it is all we have
std Date-std } -- preferred
Date-std ::= SEQUENCE {
month INTEGER,
year INTEGER }
which is represented in ASN.1 data as:
Date ::= std {
month 8,
year 1999 }
However in XML it requires two more layers of explicit tags:
8
1999
Note the use of hyphen in the original names (eg. Date-std) and of underline to delimit a
role in another object (eg. Date_std).
Summary:
While the effect of Roles, Scope, and Alternate Forms results in extensive
tags in the XML, it does accurately reflect the structure and use of the data. It allows
XML programs to capture as little or as much of the full data structure as they wish. And
once converted back from XML to structures or classes in a variety of programming
languages there is minimal overhead once again. The full NCBI DTD reflects this
structure. What is called the NCBI DTD actually only specifies the basic data structures
for publications, sequences, maps, alignments, and structures. These same elements are
reused in different roles in many services as well, such as BLAST which produces
alignments (defined in NCBI DTD) as well as other elements specific to BLAST. We
have not copied all the referenced modules into a DTD for every service as a practical
matter, although we can produce XML output from any ASN.1 interface.
Targeted DTDs
Many people do not want, or will not make use of the full data specification used
internally by NCBI. It is possible for us to fairly easily write specialized subsets into
standalone specifications when there is a clear community need that will be served. Just
as FASTA files are a very limited representation of a sequence, they are sufficient for a
large number of users most of the time.
In the NCBI toolkit are tools which, given an ASN.1 specification, will automatically
generate the C or C++ code (C++ version is still in development) to read and write data
conforming to that specification in ASN.1, the C structures or classes to store it in, the
XML DTD, and the code to write it in XML. Thus we can specify a simpler, special
purpose structure, automatically generate most of the necessary code, then manually
write a relatively small bit of code to fill in the fields in the new C structure from our
existing C structures of the full version.
We have created two small examples of this. The Minimal Sequence (MinSeq) example
keeps some of the modular structure of the full specification, but greatly reduces the
number and depths of elements, and does not reference any other specification. The Tiny
Sequence (TinySeq) removes all modularity (and thus a lot of the flexibility for growth
and modification) of MinSeq but results in an extremely simple structure. All these forms
of any sequence are available in the XML demo application. We welcome comments and
suggestions after you have looked through the demo.
asn2xml
asn2xml is a utility program designed to read sequence data in ASN.1 and output it as
"full XML", for those who would prefer working with that format. The only change to
the data itself, in addition to the remapping to XML, is to convert binary sequence
alphabets to text. Especially for long DNA sequences NCBI normally stores the data
in ASN.1 in 2 bits per base if there are no ambiguity codes, or 4 bits per base if there
are. This reduces the data size by a factor of 2 or 4, and is also a more convenient
form for many computations. Since XML is a text format, the alphabets are converted.
This, and the more verbose tagging in XML, result in considerable expansion of the
data from the binary ASN.1 on our ftp site. So, to conserve our heavily used bandwidth
and disk space, we provide this utility. You can ftp binary ASN.1 and then expand it
on your site to XML.
The arguments to asn2xml (or any NCBI application) can be seen by typing the name and a
hyphen.. "asn2xml -" which will give you:
asn2xml 1.0 arguments:
-i Filename for asn.1 input [File In]
default = stdin
-e Input is a Seq-entry [T/F] Optional
default = F
-b Input asnfile in binary mode [T/F] Optional
default = T
-o Filename for XML output [File Out] Optional
default = stdout
-l Log errors to file named: [File Out] Optional
The defaults are set to read a binary update file into stdin and output xml from stdout:
gzcat update.aso | asn2xml > update.xml
The binary ASN.1 files can be found in the ncbi ftp directory at ftp.ncbi.nih.gov/ncbi-asn1
Be sure to transfer them in binary format. Note that these files include GenBank in ASN.1,
as well as other sources such as RefSeq, PIR, PDB, etc. SWISSPROT is not included since it
is no longer distributable in the public domain.
Documentation on the ASN.1 specification, and pointers to the DTDs, and a demo program that shows
MinSeq and TinySeq are at http://www.ncbi.nlm.nih.gov/IEB from the upper right hand corner of the
page. This page is not really finished, but interest in XML has prompted us to show it to you
anyway. The ASN.1 spec documentation is directly relevant to the XML version since they are the same
logical structure with pretty much the same names. Note that our DOCTYPE line is set up so that
you can validate XML either with local DTD files from us, or using the public repository at
http://www.ncbi.nlm.nih.gov/IEB/DTD