This is /home/pdm/install/Python-2.1/Doc/lib/python-lib.info, produced
by makeinfo version 4.0 from lib.texi.

   April 15, 2001		2.1


File: python-lib.info,  Node: sgmllib,  Next: htmllib,  Prev: Structured Markup Processing Tools,  Up: Structured Markup Processing Tools

Simple SGML parser
==================

   Only as much of an SGML parser as needed to parse HTML.

   This module defines a class `SGMLParser' which serves as the basis
for parsing text files formatted in SGML (Standard Generalized Mark-up
Language).  In fact, it does not provide a full SGML parser -- it only
parses SGML insofar as it is used by HTML, and the module only exists
as a base for the `htmllib'module.

`SGMLParser()'
     The `SGMLParser' class is instantiated without arguments.  The
     parser is hardcoded to recognize the following constructs:

        * Opening and closing tags of the form `<TAG ATTR="VALUE" ...>'
          and `</TAG>', respectively.

        * Numeric character references of the form `&#NAME;'.

        * Entity references of the form `&NAME;'.

        * SGML comments of the form `<!--TEXT-->'.  Note that spaces,
          tabs, and newlines are allowed between the trailing `>' and
          the immediately preceding `--'.


   `SGMLParser' instances have the following interface methods:

`reset()'
     Reset the instance.  Loses all unprocessed data.  This is called
     implicitly at instantiation time.

`setnomoretags()'
     Stop processing tags.  Treat all following input as literal input
     (CDATA).  (This is only provided so the HTML tag `<PLAINTEXT>' can
     be implemented.)

`setliteral()'
     Enter literal mode (CDATA mode).

`feed(data)'
     Feed some text to the parser.  It is processed insofar as it
     consists of complete elements; incomplete data is buffered until
     more data is fed or `close()' is called.

`close()'
     Force processing of all buffered data as if it were followed by an
     end-of-file mark.  This method may be redefined by a derived class
     to define additional processing at the end of the input, but the
     redefined version should always call `close()'.

`get_starttag_text()'
     Return the text of the most recently opened start tag.  This should
     not normally be needed for structured processing, but may be
     useful in dealing with HTML "as deployed" or for re-generating
     input with minimal changes (whitespace between attributes can be
     preserved, etc.).

`handle_starttag(tag, method, attributes)'
     This method is called to handle start tags for which either a
     `start_TAG()' or `do_TAG()' method has been defined.  The TAG
     argument is the name of the tag converted to lower case, and the
     METHOD argument is the bound method which should be used to
     support semantic interpretation of the start tag.  The ATTRIBUTES
     argument is a list of `(NAME, VALUE)' pairs containing the
     attributes found inside the tag's `<>' brackets.  The NAME has
     been translated to lower case and double quotes and backslashes in
     the VALUE have been interpreted.  For instance, for the tag `<A
     HREF="http://www.cwi.nl/">', this method would be called as
     `unknown_starttag('a', [('href', 'http://www.cwi.nl/')])'.  The
     base implementation simply calls METHOD with ATTRIBUTES as the
     only argument.

`handle_endtag(tag, method)'
     This method is called to handle endtags for which an `end_TAG()'
     method has been defined.  The TAG argument is the name of the tag
     converted to lower case, and the METHOD argument is the bound
     method which should be used to support semantic interpretation of
     the end tag.  If no `end_TAG()' method is defined for the closing
     element, this handler is not called.  The base implementation
     simply calls METHOD.

`handle_data(data)'
     This method is called to process arbitrary data.  It is intended
     to be overridden by a derived class; the base class implementation
     does nothing.

`handle_charref(ref)'
     This method is called to process a character reference of the form
     `&#REF;'.  In the base implementation, REF must be a decimal
     number in the range 0-255.  It translates the character to ASCII
     and calls the method `handle_data()' with the character as
     argument.  If REF is invalid or out of range, the method
     `unknown_charref(REF)' is called to handle the error.  A subclass
     must override this method to provide support for named character
     entities.

`handle_entityref(ref)'
     This method is called to process a general entity reference of the
     form `&REF;' where REF is an general entity reference.  It looks
     for REF in the instance (or class) variable `entitydefs' which
     should be a mapping from entity names to corresponding
     translations.  If a translation is found, it calls the method
     `handle_data()' with the translation; otherwise, it calls the
     method `unknown_entityref(REF)'.  The default `entitydefs' defines
     translations for `&amp;', `&apos', `&gt;', `&lt;', and `&quot;'.

`handle_comment(comment)'
     This method is called when a comment is encountered.  The COMMENT
     argument is a string containing the text between the `<!--' and
     `-->' delimiters, but not the delimiters themselves.  For example,
     the comment `<!--text-->' will cause this method to be called with
     the argument `'text''.  The default method does nothing.

`handle_decl(data)'
     Method called when an SGML declaration is read by the parser.  In
     practice, the `DOCTYPE' declaration is the only thing observed in
     HTML, but the parser does not discriminate among different (or
     broken) declarations.  Internal subsets in a `DOCTYPE' declaration
     are not supported.  The DATA parameter will be the entire contents
     of the declaration inside the `<!'...`>' markup.  The default
     implementation does nothing.

`report_unbalanced(tag)'
     This method is called when an end tag is found which does not
     correspond to any open element.

`unknown_starttag(tag, attributes)'
     This method is called to process an unknown start tag.  It is
     intended to be overridden by a derived class; the base class
     implementation does nothing.

`unknown_endtag(tag)'
     This method is called to process an unknown end tag.  It is
     intended to be overridden by a derived class; the base class
     implementation does nothing.

`unknown_charref(ref)'
     This method is called to process unresolvable numeric character
     references.  Refer to `handle_charref()' to determine what is
     handled by default.  It is intended to be overridden by a derived
     class; the base class implementation does nothing.

`unknown_entityref(ref)'
     This method is called to process an unknown entity reference.  It
     is intended to be overridden by a derived class; the base class
     implementation does nothing.

   Apart from overriding or extending the methods listed above, derived
classes may also define methods of the following form to define
processing of specific tags.  Tag names in the input stream are case
independent; the TAG occurring in method names must be in lower case:

`start_TAG(attributes)'
     This method is called to process an opening tag TAG.  It has
     preference over `do_TAG()'.  The ATTRIBUTES argument has the same
     meaning as described for `handle_starttag()' above.

`do_TAG(attributes)'
     This method is called to process an opening tag TAG that does not
     come with a matching closing tag.  The ATTRIBUTES argument has the
     same meaning as described for `handle_starttag()' above.

`end_TAG()'
     This method is called to process a closing tag TAG.

   Note that the parser maintains a stack of open elements for which no
end tag has been found yet.  Only tags processed by `start_TAG()' are
pushed on this stack.  Definition of an `end_TAG()' method is optional
for these tags.  For tags processed by `do_TAG()' or by
`unknown_tag()', no `end_TAG()' method must be defined; if defined, it
will not be used.  If both `start_TAG()' and `do_TAG()' methods exist
for a tag, the `start_TAG()' method takes precedence.


File: python-lib.info,  Node: htmllib,  Next: htmlentitydefs,  Prev: sgmllib,  Up: Structured Markup Processing Tools

A parser for HTML documents
===========================

   A parser for HTML documents.

   This module defines a class which can serve as a base for parsing
text files formatted in the HyperText Mark-up Language (HTML).  The
class is not directly concerned with I/O -- it must be provided with
input in string form via a method, and makes calls to methods of a
"formatter" object in order to produce output.  The `HTMLParser' class
is designed to be used as a base class for other classes in order to
add functionality, and allows most of its methods to be extended or
overridden.  In turn, this class is derived from and extends the
`SGMLParser' class defined in module `sgmllib'.  The `HTMLParser'
implementation supports the HTML 2.0 language as described in RFC 1866.
Two implementations of formatter objects are provided in the
`formatter' module; refer to the documentation for that module for
information on the formatter interface.

   The following is a summary of the interface defined by
`sgmllib.SGMLParser':

   * The interface to feed data to an instance is through the `feed()'
     method, which takes a string argument.  This can be called with as
     little or as much text at a time as desired; `p.feed(a);
     p.feed(b)' has the same effect as `p.feed(a+b)'.  When the data
     contains complete HTML tags, these are processed immediately;
     incomplete elements are saved in a buffer.  To force processing of
     all unprocessed data, call the `close()' method.

     For example, to parse the entire contents of a file, use:
          parser.feed(open('myfile.html').read())
          parser.close()

   * The interface to define semantics for HTML tags is very simple:
     derive a class and define methods called `start_TAG()',
     `end_TAG()', or `do_TAG()'.  The parser will call these at
     appropriate moments: `start_TAG' or `do_TAG()' is called when an
     opening tag of the form `<TAG ...>' is encountered; `end_TAG()' is
     called when a closing tag of the form `<TAG>' is encountered.  If
     an opening tag requires a corresponding closing tag, like `<H1>'
     ... `</H1>', the class should define the `start_TAG()' method; if
     a tag requires no closing tag, like `<P>', the class should define
     the `do_TAG()' method.


   The module defines a single class:

`HTMLParser(formatter)'
     This is the basic HTML parser class.  It supports all entity names
     required by the HTML 2.0 specification (RFC 1866).  It also defines
     handlers for all HTML 2.0 and many HTML 3.0 and 3.2 elements.

   See also:

   *Note htmlentitydefs:: Definition of replacement text for HTML 2.0
entities.  *Note sgmllib:: Base class for `HTMLParser'.

* Menu:

* HTMLParser Objects::


File: python-lib.info,  Node: HTMLParser Objects,  Prev: htmllib,  Up: htmllib

HTMLParser Objects
------------------

   In addition to tag methods, the `HTMLParser' class provides some
additional methods and instance variables for use within tag methods.

`formatter'
     This is the formatter instance associated with the parser.

`nofill'
     Boolean flag which should be true when whitespace should not be
     collapsed, or false when it should be.  In general, this should
     only be true when character data is to be treated as
     "preformatted" text, as within a `<PRE>' element.  The default
     value is false.  This affects the operation of `handle_data()' and
     `save_end()'.

`anchor_bgn(href, name, type)'
     This method is called at the start of an anchor region.  The
     arguments correspond to the attributes of the `<A>' tag with the
     same names.  The default implementation maintains a list of
     hyperlinks (defined by the `HREF' attribute for `<A>' tags) within
     the document.  The list of hyperlinks is available as the data
     attribute `anchorlist'.

`anchor_end()'
     This method is called at the end of an anchor region.  The default
     implementation adds a textual footnote marker using an index into
     the list of hyperlinks created by `anchor_bgn()'.

`handle_image(source, alt[, ismap[, align[, width[, height]]]])'
     This method is called to handle images.  The default implementation
     simply passes the ALT value to the `handle_data()' method.

`save_bgn()'
     Begins saving character data in a buffer instead of sending it to
     the formatter object.  Retrieve the stored data via `save_end()'.
     Use of the `save_bgn()' / `save_end()' pair may not be nested.

`save_end()'
     Ends buffering character data and returns all data saved since the
     preceding call to `save_bgn()'.  If the `nofill' flag is false,
     whitespace is collapsed to single spaces.  A call to this method
     without a preceding call to `save_bgn()' will raise a `TypeError'
     exception.


File: python-lib.info,  Node: htmlentitydefs,  Next: xml.parsers.expat,  Prev: htmllib,  Up: Structured Markup Processing Tools

Definitions of HTML general entities
====================================

   Definitions of HTML general entities.

   This section was written by Fred L. Drake, Jr. <fdrake@acm.org>.
This module defines a single dictionary, `entitydefs', which is used by
the `htmllib' module to provide the `entitydefs' member of the
`HTMLParser' class.  The definition provided here contains all the
entities defined by HTML 2.0 that can be handled using simple textual
substitution in the Latin-1 character set (ISO-8859-1).

`entitydefs'
     A dictionary mapping HTML 2.0 entity definitions to their
     replacement text in ISO Latin-1.


File: python-lib.info,  Node: xml.parsers.expat,  Next: xml.dom,  Prev: htmlentitydefs,  Up: Structured Markup Processing Tools

Fast XML parsing using Expat
============================

   An interface to the Expat non-validating XML parser.  This module
was documented by Paul Prescod <paul@prescod.net>.
This section was written by A.M. Kuchling <amk1@bigfoot.com>.
_Added in Python version 2.0_

   The `xml.parsers.expat' module is a Python interface to the Expat
non-validating XML parser.  The module provides a single extension
type, `xmlparser', that represents the current state of an XML parser.
After an `xmlparser' object has been created, various attributes of the
object can be set to handler functions.  When an XML document is then
fed to the parser, the handler functions are called for the character
data and markup in the XML document.

   This module uses the `pyexpat' module to provide access to the Expat
parser.  Direct use of the `pyexpat' module is deprecated.

   This module provides one exception and one type object:

`ExpatError'
     The exception raised when Expat reports an error.

`error'
     Alias for `ExpatError'.

`XMLParserType'
     The type of the return values from the `ParserCreate()' function.

   The `xml.parsers.expat' module contains two functions:

`ErrorString(errno)'
     Returns an explanatory string for a given error number ERRNO.

`ParserCreate([encoding[, namespace_separator]])'
     Creates and returns a new `xmlparser' object.  ENCODING, if
     specified, must be a string naming the encoding used by the XML
     data.  Expat doesn't support as many encodings as Python does, and
     its repertoire of encodings can't be extended; it supports UTF-8,
     UTF-16, ISO-8859-1 (Latin1), and ASCII.  If ENCODING is given it
     will override the implicit or explicit encoding of the document.

     Expat can optionally do XML namespace processing for you, enabled
     by providing a value for NAMESPACE_SEPARATOR.  The value must be a
     one-character string; a `ValueError' will be raised if the string
     has an illegal length (`None' is considered the same as omission).
     When namespace processing is enabled, element type names and
     attribute names that belong to a namespace will be expanded.  The
     element name passed to the element handlers `StartElementHandler'
     and `EndElementHandler' will be the concatenation of the namespace
     URI, the namespace separator character, and the local part of the
     name.  If the namespace separator is a zero byte (`chr(0)') then
     the namespace URI and the local part will be concatenated without
     any separator.

     For example, if NAMESPACE_SEPARATOR is set to a space character (`
     ') and the following document is parsed:

          <?xml version="1.0"?>
          <root xmlns    = "http://default-namespace.org/"
                xmlns:py = "http://www.python.org/ns/">
            <py:elem1 />
            <elem2 xmlns="" />
          </root>

     `StartElementHandler' will receive the following strings for each
     element:

          http://default-namespace.org/ root
          http://www.python.org/ns/ elem1
          elem2

* Menu:

* XMLParser Objects::
* ExpatError Exceptions::
* Example 8::
* Content Model Descriptions::
* Expat error constants::


File: python-lib.info,  Node: XMLParser Objects,  Next: ExpatError Exceptions,  Prev: xml.parsers.expat,  Up: xml.parsers.expat

XMLParser Objects
-----------------

   `xmlparser' objects have the following methods:

`Parse(data[, isfinal])'
     Parses the contents of the string DATA, calling the appropriate
     handler functions to process the parsed data.  ISFINAL must be
     true on the final call to this method.  DATA can be the empty
     string at any time.

`ParseFile(file)'
     Parse XML data reading from the object FILE.  FILE only needs to
     provide the `read(NBYTES)' method, returning the empty string when
     there's no more data.

`SetBase(base)'
     Sets the base to be used for resolving relative URIs in system
     identifiers in declarations.  Resolving relative identifiers is
     left to the application: this value will be passed through as the
     BASE argument to the `ExternalEntityRefHandler',
     `NotationDeclHandler', and `UnparsedEntityDeclHandler' functions.

`GetBase()'
     Returns a string containing the base set by a previous call to
     `SetBase()', or `None' if `SetBase()' hasn't been called.

`GetInputContext()'
     Returns the input data that generated the current event as a
     string.  The data is in the encoding of the entity which contains
     the text.  When called while an event handler is not active, the
     return value is `None'.  _Added in Python version 2.1_

`ExternalEntityParserCreate(context[, encoding])'
     Create a "child" parser which can be used to parse an external
     parsed entity referred to by content parsed by the parent parser.
     The CONTEXT parameter should be the string passed to the
     `ExternalEntityRefHandler()' handler function, described below.
     The child parser is created with the `ordered_attributes',
     `returns_unicode' and `specified_attributes' set to the values of
     this parser.

   `xmlparser' objects have the following attributes:

`ordered_attributes'
     Setting this attribute to a non-zero integer causes the attributes
     to be reported as a list rather than a dictionary.  The attributes
     are presented in the order found in the document text.  For each
     attribute, two list entries are presented: the attribute name and
     the attribute value.  (Older versions of this module also used this
     format.)  By default, this attribute is false; it may be changed at
     any time.  _Added in Python version 2.1_

`returns_unicode'
     If this attribute is set to a non-zero integer, the handler
     functions will be passed Unicode strings.  If `returns_unicode' is
     0, 8-bit strings containing UTF-8 encoded data will be passed to
     the handlers.  _Changed in Python version 1.6_

`specified_attributes'
     If set to a non-zero integer, the parser will report only those
     attributes which were specified in the document instance and not
     those which were derived from attribute declarations.
     Applications which set this need to be especially careful to use
     what additional information is available from the declarations as
     needed to comply with the standards for the behavior of XML
     processors.  By default, this attribute is false; it may be
     changed at any time.  _Added in Python version 2.1_

   The following attributes contain values relating to the most recent
error encountered by an `xmlparser' object, and will only have correct
values once a call to `Parse()' or `ParseFile()' has raised a
`xml.parsers.expat.ExpatError' exception.

`ErrorByteIndex'
     Byte index at which an error occurred.

`ErrorCode'
     Numeric code specifying the problem.  This value can be passed to
     the `ErrorString()' function, or compared to one of the constants
     defined in the `errors' object.

`ErrorColumnNumber'
     Column number at which an error occurred.

`ErrorLineNumber'
     Line number at which an error occurred.

   Here is the list of handlers that can be set.  To set a handler on an
`xmlparser' object O, use `O.HANDLERNAME = FUNC'.  HANDLERNAME must be
taken from the following list, and FUNC must be a callable object
accepting the correct number of arguments.  The arguments are all
strings, unless otherwise stated.

`XmlDeclHandler(version, encoding, standalone)'
     Called when the XML declaration is parsed.  The XML declaration is
     the (optional) declaration of the applicable version of the XML
     recommendation, the encoding of the document text, and an optional
     "standalone" declaration.  VERSION and ENCODING will be strings of
     the type dictated by the `returns_unicode' attribute, and
     STANDALONE will be `1' if the document is declared standalone, `0'
     if it is declared not to be standalone, or `-1' if the standalone
     clause was omitted.  This is only available with Expat version
     1.95.0 or newer.  _Added in Python version 2.1_

`StartDoctypeDeclHandler(doctypeName, systemId, publicId, has_internal_subset)'
     Called when Expat begins parsing the document type declaration
     (`<!DOCTYPE ...').  The DOCTYPENAME is provided exactly as
     presented.  The SYSTEMID and PUBLICID parameters give the system
     and public identifiers if specified, or `None' if omitted.
     HAS_INTERNAL_SUBSET will be true if the document contains and
     internal document declaration subset.  This requires Expat version
     1.2 or newer.

`EndDoctypeDeclHandler()'
     Called when Expat is done parsing the document type delaration.
     This requires Expat version 1.2 or newer.

`ElementDeclHandler(name, model)'
     Called once for each element type declaration.  NAME is the name
     of the element type, and MODEL is a representation of the content
     model.

`AttlistDeclHandler(elname, attname, type, default, required)'
     Called for each declared attribute for an element type.  If an
     attribute list declaration declares three attributes, this handler
     is called three times, once for each attribute.  ELNAME is the name
     of the element to which the declaration applies and ATTNAME is the
     name of the attribute declared.  The attribute type is a string
     passed as TYPE; the possible values are `'CDATA'', `'ID'',
     `'IDREF'', ...  DEFAULT gives the default value for the attribute
     used when the attribute is not specified by the document instance,
     or `None' if there is no default value (`#IMPLIED' values).  If
     the attribute is required to be given in the document instance,
     REQUIRED will be true.  This requires Expat version 1.95.0 or
     newer.

`StartElementHandler(name, attributes)'
     Called for the start of every element.  NAME is a string
     containing the element name, and ATTRIBUTES is a dictionary
     mapping attribute names to their values.

`EndElementHandler(name)'
     Called for the end of every element.

`ProcessingInstructionHandler(target, data)'
     Called for every processing instruction.

`CharacterDataHandler(data)'
     Called for character data.  This will be called for normal
     character data, CDATA marked content, and ignorable whitespace.
     Applications which must distinguish these cases can use the
     `StartCdataSectionHandler', `EndCdataSectionHandler', and
     `ElementDeclHandler' callbacks to collect the required information.

`UnparsedEntityDeclHandler(entityName, base, systemId, publicId, notationName)'
     Called for unparsed (NDATA) entity declarations.  This is only
     present for version 1.2 of the Expat library; for more recent
     versions, use `EntityDeclHandler' instead.  (The underlying
     function in the Expat library has been declared obsolete.)

`EntityDeclHandler(entityName, is_parameter_entity, value, base, systemId, publicId, notationName)'
     Called for all entity declarations.  For parameter and internal
     entities, VALUE will be a string giving the declared contents of
     the entity; this will be `None' for external entities.  The
     NOTATIONNAME parameter will be `None' for parsed entities, and the
     name of the notation for unparsed entities.  IS_PARAMETER_ENTITY
     will be true if the entity is a paremeter entity or false for
     general entities (most applications only need to be concerned with
     general entities).  This is only available starting with version
     1.95.0 of the Expat library.  _Added in Python version 2.1_

`NotationDeclHandler(notationName, base, systemId, publicId)'
     Called for notation declarations.  NOTATIONNAME, BASE, and
     SYSTEMID, and PUBLICID are strings if given.  If the public
     identifier is omitted, PUBLICID will be `None'.

`StartNamespaceDeclHandler(prefix, uri)'
     Called when an element contains a namespace declaration.  Namespace
     declarations are processed before the `StartElementHandler' is
     called for the element on which declarations are placed.

`EndNamespaceDeclHandler(prefix)'
     Called when the closing tag is reached for an element that
     contained a namespace declaration.  This is called once for each
     namespace declaration on the element in the reverse of the order
     for which the `StartNamespaceDeclHandler' was called to indicate
     the start of each namespace declaration's scope.  Calls to this
     handler are made after the corresponding `EndElementHandler' for
     the end of the element.

`CommentHandler(data)'
     Called for comments.  DATA is the text of the comment, excluding
     the leading ``<!-'`-'' and trailing ``-'`->''.

`StartCdataSectionHandler()'
     Called at the start of a CDATA section.  This and
     `StartCdataSectionHandler' are needed to be able to identify the
     syntactical start and end for CDATA sections.

`EndCdataSectionHandler()'
     Called at the end of a CDATA section.

`DefaultHandler(data)'
     Called for any characters in the XML document for which no
     applicable handler has been specified.  This means characters that
     are part of a construct which could be reported, but for which no
     handler has been supplied.

`DefaultHandlerExpand(data)'
     This is the same as the `DefaultHandler', but doesn't inhibit
     expansion of internal entities.  The entity reference will not be
     passed to the default handler.

`NotStandaloneHandler()'
     Called if the XML document hasn't been declared as being a
     standalone document.  This happens when there is an external
     subset or a reference to a parameter entity, but the XML
     declaration does not set standalone to `yes' in an XML
     declaration.  If this handler returns `0', then the parser will
     throw an `XML_ERROR_NOT_STANDALONE' error.  If this handler is not
     set, no exception is raised by the parser for this condition.

`ExternalEntityRefHandler(context, base, systemId, publicId)'
     Called for references to external entities.  BASE is the current
     base, as set by a previous call to `SetBase()'.  The public and
     system identifiers, SYSTEMID and PUBLICID, are strings if given;
     if the public identifier is not given, PUBLICID will be `None'.
     The CONTEXT value is opaque and should only be used as described
     below.

     For external entities to be parsed, this handler must be
     implemented.  It is responsible for creating the sub-parser using
     `ExternalEntityParserCreate(CONTEXT)', initializing it with the
     appropriate callbacks, and parsing the entity.  This handler
     should return an integer; if it returns `0', the parser will throw
     an `XML_ERROR_EXTERNAL_ENTITY_HANDLING' error, otherwise parsing
     will continue.

     If this handler is not provided, external entities are reported by
     the `DefaultHandler' callback, if provided.


File: python-lib.info,  Node: ExpatError Exceptions,  Next: Example 8,  Prev: XMLParser Objects,  Up: xml.parsers.expat

ExpatError Exceptions
---------------------

   This section was written by Fred L. Drake, Jr. <fdrake@acm.org>.
`ExpatError' exceptions have a number of interesting attributes:

`code'
     Expat's internal error number for the specific error.  This will
     match one of the constants defined in the `errors' object from
     this module.  _Added in Python version 2.1_

`lineno'
     Line number on which the error was detected.  The first line is
     numbered `1'.  _Added in Python version 2.1_

`offset'
     Character offset into the line where the error occurred.  The first
     column is numbered `0'.  _Added in Python version 2.1_


File: python-lib.info,  Node: Example 8,  Next: Content Model Descriptions,  Prev: ExpatError Exceptions,  Up: xml.parsers.expat

Example
-------

   The following program defines three handlers that just print out
their arguments.

     import xml.parsers.expat
     
     # 3 handler functions
     def start_element(name, attrs):
         print 'Start element:', name, attrs
     def end_element(name):
         print 'End element:', name
     def char_data(data):
         print 'Character data:', repr(data)
     
     p = xml.parsers.expat.ParserCreate()
     
     p.StartElementHandler = start_element
     p.EndElementHandler = end_element
     p.CharacterDataHandler = char_data
     
     p.Parse("""<?xml version="1.0"?>
     <parent id="top"><child1 name="paul">Text goes here</child1>
     <child2 name="fred">More text</child2>
     </parent>""")

   The output from this program is:

     Start element: parent {'id': 'top'}
     Start element: child1 {'name': 'paul'}
     Character data: 'Text goes here'
     End element: child1
     Character data: '\n'
     Start element: child2 {'name': 'fred'}
     Character data: 'More text'
     End element: child2
     Character data: '\n'
     End element: parent


File: python-lib.info,  Node: Content Model Descriptions,  Next: Expat error constants,  Prev: Example 8,  Up: xml.parsers.expat

Content Model Descriptions
--------------------------

   This section was written by Fred L. Drake, Jr. <fdrake@acm.org>.
Content modules are described using nested tuples.  Each tuple contains
four values: the type, the quantifier, the name, and a tuple of
children.  Children are simply additional content module descriptions.

   The values of the first two fields are constants defined in the
`model' object of the `xml.parsers.expat' module.  These constants can
be collected in two groups: the model type group and the quantifier
group.

   The constants in the model type group are:

`XML_CTYPE_ANY'
     The element named by the model name was declared to have a content
     model of `ANY'.

`XML_CTYPE_CHOICE'
     The named element allows a choice from a number of options; this is
     used for content models such as `(A | B | C)'.

`XML_CTYPE_EMPTY'
     Elements which are declared to be `EMPTY' have this model type.

`XML_CTYPE_MIXED'

`XML_CTYPE_NAME'

`XML_CTYPE_SEQ'
     Models which represent a series of models which follow one after
     the other are indicated with this model type.  This is used for
     models such as `(A, B, C)'.

   The constants in the quantifier group are:

`XML_CQUANT_NONE'

`XML_CQUANT_OPT'
     The model is option: it can appear once or not at all, as for `A?'.

`XML_CQUANT_PLUS'
     The model must occur one or more times (`A+').

`XML_CQUANT_REP'
     The model must occur zero or more times, as for `A*'.


File: python-lib.info,  Node: Expat error constants,  Prev: Content Model Descriptions,  Up: xml.parsers.expat

Expat error constants
---------------------

   This section was written by A.M. Kuchling <amk1@bigfoot.com>.
The following constants are provided in the `errors' object of the
`xml.parsers.expat' module.  These constants are useful in interpreting
some of the attributes of the `ExpatError' exception objects raised
when an error has occurred.

   The `errors' object has the following attributes:

`XML_ERROR_ASYNC_ENTITY'

`XML_ERROR_ATTRIBUTE_EXTERNAL_ENTITY_REF'
     An entity reference in an attribute value referred to an external
     entity instead of an internal entity.

`XML_ERROR_BAD_CHAR_REF'

`XML_ERROR_BINARY_ENTITY_REF'

`XML_ERROR_DUPLICATE_ATTRIBUTE'
     An attribute was used more than once in a start tag.

`XML_ERROR_INCORRECT_ENCODING'

`XML_ERROR_INVALID_TOKEN'

`XML_ERROR_JUNK_AFTER_DOC_ELEMENT'
     Something other than whitespace occurred after the document
     element.

`XML_ERROR_MISPLACED_XML_PI'

`XML_ERROR_NO_ELEMENTS'
     The document contains no elements.

`XML_ERROR_NO_MEMORY'
     Expat was not able to allocate memory internally.

`XML_ERROR_PARAM_ENTITY_REF'

`XML_ERROR_PARTIAL_CHAR'

`XML_ERROR_RECURSIVE_ENTITY_REF'

`XML_ERROR_SYNTAX'
     Some unspecified syntax error was encountered.

`XML_ERROR_TAG_MISMATCH'
     An end tag did not match the innermost open start tag.

`XML_ERROR_UNCLOSED_TOKEN'

`XML_ERROR_UNDEFINED_ENTITY'
     A reference was made to a entity which was not defined.

`XML_ERROR_UNKNOWN_ENCODING'
     The document encoding is not supported by Expat.


File: python-lib.info,  Node: xml.dom,  Next: xml.dom.minidom,  Prev: xml.parsers.expat,  Up: Structured Markup Processing Tools

The Document Object Model API
=============================

   Document Object Model API for Python.

   This section was written by Paul Prescod <paul@prescod.net>.
This section was written by Martin v. L"owis
<loewis@informatik.hu-berlin.de>.
_Added in Python version 2.0_

   The Document Object Model, or "DOM," is a cross-language API from
the World Wide Web Consortium (W3C) for accessing and modifying XML
documents.  A DOM implementation presents an XML document as a tree
structure, or allows client code to build such a structure from
scratch.  It then gives access to the structure through a set of
objects which provided well-known interfaces.

   The DOM is extremely useful for random-access applications.  SAX only
allows you a view of one bit of the document at a time.  If you are
looking at one SAX element, you have no access to another.  If you are
looking at a text node, you have no access to a containing element.
When you write a SAX application, you need to keep track of your
program's position in the document somewhere in your own code.  SAX
does not do it for you.  Also, if you need to look ahead in the XML
document, you are just out of luck.

   Some applications are simply impossible in an event driven model with
no access to a tree.  Of course you could build some sort of tree
yourself in SAX events, but the DOM allows you to avoid writing that
code.  The DOM is a standard tree representation for XML data.

   The Document Object Model is being defined by the W3C in stages, or
"levels" in their terminology.  The Python mapping of the API is
substantially based on the DOM Level 2 recommendation.  Some aspects of
the API will only become available in Python 2.1, or may only be
available in particular DOM implementations.

   DOM applications typically start by parsing some XML into a DOM.  How
this is accomplished is not covered at all by DOM Level 1, and Level 2
provides only limited improvements.  There is a `DOMImplementation'
object class which provides access to `Document' creation methods, but
these methods were only added in DOM Level 2 and were not implemented
in time for Python 2.0.  There is also no well-defined way to access
these methods without an existing `Document' object.  For Python 2.0,
consult the documentation for each particular DOM implementation to
determine the bootstrap procedure needed to create and initialize
`Document' and `DocumentType' instances.

   Once you have a DOM document object, you can access the parts of your
XML document through its properties and methods.  These properties are
defined in the DOM specification; this portion of the reference manual
describes the interpretation of the specification in Python.

   The specification provided by the W3C defines the DOM API for Java,
ECMAScript, and OMG IDL.  The Python mapping defined here is based in
large part on the IDL version of the specification, but strict
compliance is not required (though implementations are free to support
the strict mapping from IDL).  See section *Note Conformance::,
"Conformance," for a detailed discussion of mapping requirements.

   See also:

   `Document Object Model (DOM) Level 2 Specification' {The W3C
recommendation upon which the Python DOM API is based.} `Document
Object Model (DOM) Level 1 Specification' {The W3C recommendation for
the DOM supported by `xml.dom.minidom'.} `PyXML'{Users that require a
full-featured implementation of DOM should use the PyXML package.}
`CORBA Scripting with Python' {This specifies the mapping from OMG IDL
to Python.}

* Menu:

* Module Contents 2::
* Objects in the DOM::
* Conformance::


File: python-lib.info,  Node: Module Contents 2,  Next: Objects in the DOM,  Prev: xml.dom,  Up: xml.dom

Module Contents
---------------

   The `xml.dom' contains the following functions:

`registerDOMImplementation(name, factory)'
     Register the FACTORY function with the name NAME.  The factory
     function should return an object which implements the
     `DOMImplementation' interface.  The factory function can return
     the same object every time, or a new one for each call, as
     appropriate for the specific implementation (e.g. if that
     implementation supports some customization).

`getDOMImplementation(name = None, features = ())'
     Return a suitable DOM implementation. The NAME is either
     well-known, the module name of a DOM implementation, or `None'. If
     it is not `None', imports the corresponding module and returns a
     `DOMImplementation' object if the import succeeds.  If no name is
     given, and if the environment variable `PYTHON_DOM' is set, this
     variable is used to find the implementation.

     If name is not given, consider the available implementations to
     find one with the required feature set. If no implementation can
     be found, raise an `ImportError'. The features list must be a
     sequence of (feature, version) pairs which are passed to
     hasFeature.

   In addition, `xml.dom' contains the `Node', and the DOM exceptions.


File: python-lib.info,  Node: Objects in the DOM,  Next: Conformance,  Prev: Module Contents 2,  Up: xml.dom

Objects in the DOM
------------------

   The definitive documentation for the DOM is the DOM specification
from the W3C.

   Note that DOM attributes may also be manipulated as nodes instead of
as simple strings.  It is fairly rare that you must do this, however,
so this usage is not yet documented.

Interface                Section                  Purpose
------                   -----                    -----
DOMImplementation        *Note DOMImplementation  Interface to the
                         Objects::                underlying
                                                  implementation.
Node                     *Note Node Objects::     Base interface for most
                                                  objects in a document.
NodeList                 *Note NodeList           Interface for a
                         Objects::                sequence of nodes.
DocumentType             *Note DocumentType       Information about the
                         Objects::                declarations needed to
                                                  process a document.
Document                 *Note Document           Object which represents
                         Objects::                an entire document.
Element                  *Note Element Objects::  Element nodes in the
                                                  document hierarchy.
Attr                     *Note Attr Objects::     Attribute value nodes
                                                  on element nodes.
Comment                  *Note Comment Objects::  Representation of
                                                  comments in the source
                                                  document.
Text                     *Note Text and           Nodes containing
                         CDATASection Objects::   textual content from
                                                  the document.
ProcessingInstruction    *Note                    Processing instruction
                         ProcessingInstruction    representation.
                         Objects::                

   An additional section describes the exceptions defined for working
with the DOM in Python.

* Menu:

* DOMImplementation Objects::
* Node Objects::
* NodeList Objects::
* DocumentType Objects::
* Document Objects::
* Element Objects::
* Attr Objects::
* NamedNodeMap Objects::
* Comment Objects::
* Text and CDATASection Objects::
* ProcessingInstruction Objects::
* Exceptions 2::


File: python-lib.info,  Node: DOMImplementation Objects,  Next: Node Objects,  Prev: Objects in the DOM,  Up: Objects in the DOM

DOMImplementation Objects
.........................

   The `DOMImplementation' interface provides a way for applications to
determine the availability of particular features in the DOM they are
using.  DOM Level 2 added the ability to create new `Document' and
`DocumentType' objects using the `DOMImplementation' as well.

`hasFeature(feature, version)'

File: python-lib.info,  Node: Node Objects,  Next: NodeList Objects,  Prev: DOMImplementation Objects,  Up: Objects in the DOM

Node Objects
............

   All of the components of an XML document are subclasses of `Node'.

`nodeType'
     An integer representing the node type.  Symbolic constants for the
     types are on the `Node' object: `ELEMENT_NODE', `ATTRIBUTE_NODE',
     `TEXT_NODE', `CDATA_SECTION_NODE', `ENTITY_NODE',
     `PROCESSING_INSTRUCTION_NODE', `COMMENT_NODE', `DOCUMENT_NODE',
     `DOCUMENT_TYPE_NODE', `NOTATION_NODE'.  This is a read-only
     attribute.

`parentNode'
     The parent of the current node, or `None' for the document node.
     The value is always a `Node' object or `None'.  For `Element'
     nodes, this will be the parent element, except for the root
     element, in which case it will be the `Document' object.  For
     `Attr' nodes, this is always `None'.  This is a read-only
     attribute.

`attributes'
     A `NamedNodeList' of attribute objects.  Only elements have actual
     values for this; others provide `None' for this attribute.  This
     is a read-only attribute.

`previousSibling'
     The node that immediately precedes this one with the same parent.
     For instance the element with an end-tag that comes just before the
     SELF element's start-tag.  Of course, XML documents are made up of
     more than just elements so the previous sibling could be text, a
     comment, or something else.  If this node is the first child of the
     parent, this attribute will be `None'.  This is a read-only
     attribute.

`nextSibling'
     The node that immediately follows this one with the same parent.
     See also `previousSibling'.  If this is the last child of the
     parent, this attribute will be `None'.  This is a read-only
     attribute.

`childNodes'
     A list of nodes contained within this node.  This is a read-only
     attribute.

`firstChild'
     The first child of the node, if there are any, or `None'.  This is
     a read-only attribute.

`lastChild'
     The last child of the node, if there are any, or `None'.  This is
     a read-only attribute.

`localName'
     The part of the `tagName' following the colon if there is one,
     else the entire `tagName'.  The value is a string.

`prefix'
     The part of the `tagName' preceding the colon if there is one,
     else the empty string.  The value is a string, or `None'

`namespaceURI'
     The namespace associated with the element name.  This will be a
     string or `None'.  This is a read-only attribute.

`nodeName'
     This has a different meaning for each node type; see the DOM
     specification for details.  You can always get the information you
     would get here from another property such as the `tagName'
     property for elements or the `name' property for attributes.  For
     all node types, the value of this attribute will be either a
     string or `None'.  This is a read-only attribute.

`nodeValue'
     This has a different meaning for each node type; see the DOM
     specification for details.  The situation is similar to that with
     `nodeName'.  The value is a string or `None'.

`hasAttributes()'
     Returns true if the node has any attributes.

`hasChildNodes()'
     Returns true if the node has any child nodes.

`isSameNode(other)'
     Returns true if OTHER refers to the same node as this node.  This
     is especially useful for DOM implementations which use any sort of
     proxy architecture (because more than one object can refer to the
     same node).

     *Note:*  This is based on a proposed DOM Level 3 API which is
     still in the "working draft" stage, but this particular interface
     appears uncontroversial.  Changes from the W3C will not necessarily
     affect this method in the Python DOM interface (though any new W3C
     API for this would also be supported).

`appendChild(newChild)'
     Add a new child node to this node at the end of the list of
     children, returning NEWCHILD.

`insertBefore(newChild, refChild)'
     Insert a new child node before an existing child.  It must be the
     case that REFCHILD is a child of this node; if not, `ValueError'
     is raised.  NEWCHILD is returned.

`removeChild(oldChild)'
     Remove a child node.  OLDCHILD must be a child of this node; if
     not, `ValueError' is raised.  OLDCHILD is returned on success.  If
     OLDCHILD will not be used further, its `unlink()' method should be
     called.

`replaceChild(newChild, oldChild)'
     Replace an existing node with a new node. It must be the case that
     OLDCHILD is a child of this node; if not, `ValueError' is raised.

`normalize()'
     Join adjacent text nodes so that all stretches of text are stored
     as single `Text' instances.  This simplifies processing text from a
     DOM tree for many applications.  _Added in Python version 2.1_

`cloneNode(deep)'
     Clone this node.  Setting DEEP means to clone all child nodes as
     well.  This returns the clone.


File: python-lib.info,  Node: NodeList Objects,  Next: DocumentType Objects,  Prev: Node Objects,  Up: Objects in the DOM

NodeList Objects
................

   A `NodeList' represents a sequence of nodes.  These objects are used
in two ways in the DOM Core recommendation:  the `Element' objects
provides one as it's list of child nodes, and the
`getElementsByTagName()' and `getElementsByTagNameNS()' methods of
`Node' return objects with this interface to represent query results.

   The DOM Level 2 recommendation defines one method and one attribute
for these objects:

`item(i)'
     Return the I'th item from the sequence, if there is one, or
     `None'.  The index I is not allowed to be less then zero or
     greater than or equal to the length of the sequence.

`length'
     The number of nodes in the sequence.

   In addition, the Python DOM interface requires that some additional
support is provided to allow `NodeList' objects to be used as Python
sequences.  All `NodeList' implementations must include support for
`__len__()' and `__getitem__()'; this allows iteration over the
`NodeList' in `for' statements and proper support for the `len()'
built-in function.

   If a DOM implementation supports modification of the document, the
`NodeList' implementation must also support the `__setitem__()' and
`__delitem__()' methods.

