Automatic extraction of structure into SGML format


for general circulation

Mark Eichin
Cygnus Solutions

Introduction

There are a wide variety of tools manipulating and examining documents marked up with SGML. In particular, they deal cleanly with the structure described by the DTD and the markup. This makes SGML well suited as a "common ground" format for conversion of structured data.

HP100 Generic Databases

Given a database format that already has self-describing fields, it is even possible to directly construct a DTD that covers all of the existing structure. An example of this is the GDB Generic Database format used by the HP palmtop applications. The application on the palmtop actually has a format editor that lets one easily change field names and types, as well as location on input screens.

The database format is published, and there are gdbload and gdbdump programs that convert them to and from comma seperated value format. I have modified gdbdump to use a -S flag to generate SGML output, including emitting the DOCTYPE declaration describing the data. The outermost element (covering the entire database) is named from the filename given to the conversion command. Within that, a dbentry entity is composed of all of the fieldnames stored in the database header, which are themselves declared as simple pcdata entities.

There is one odd case -- a record can contain a Note field with multiple lines of text. In order to preserve the line breaks, an empty crnl element is permitted within any element corresponding to a field which had the note attribute in the database description.

USR PalmPilot databases

The USR PalmPilot palmtop also has several databases, though not using such a generic format. In fact, the current tools for the Pilot query the palmtop directly for records, rather than manipulating the database images available on the server side of a HotSync or pilot-link backup. Since the primary motivation for this effort was to handle conversion from the HP to the Pilot, the next step is to write a program that will

The jade package can easily handle the DTD conversion using a DSSSL style-sheet.

Further Work

Back-conversion is also important. Having gone through the effort of producing the data in a marked up form, it may be worth considering it the canonical form.