This is /home/pdm/install/Python-2.1/Doc/lib/python-lib.info, produced by makeinfo version 4.0 from lib.texi. April 15, 2001 2.1 File: python-lib.info, Node: sgmllib, Next: htmllib, Prev: Structured Markup Processing Tools, Up: Structured Markup Processing Tools Simple SGML parser ================== Only as much of an SGML parser as needed to parse HTML. This module defines a class `SGMLParser' which serves as the basis for parsing text files formatted in SGML (Standard Generalized Mark-up Language). In fact, it does not provide a full SGML parser -- it only parses SGML insofar as it is used by HTML, and the module only exists as a base for the `htmllib'module. `SGMLParser()' The `SGMLParser' class is instantiated without arguments. The parser is hardcoded to recognize the following constructs: * Opening and closing tags of the form `' and `', respectively. * Numeric character references of the form `&#NAME;'. * Entity references of the form `&NAME;'. * SGML comments of the form `'. Note that spaces, tabs, and newlines are allowed between the trailing `>' and the immediately preceding `--'. `SGMLParser' instances have the following interface methods: `reset()' Reset the instance. Loses all unprocessed data. This is called implicitly at instantiation time. `setnomoretags()' Stop processing tags. Treat all following input as literal input (CDATA). (This is only provided so the HTML tag `' can be implemented.) `setliteral()' Enter literal mode (CDATA mode). `feed(data)' Feed some text to the parser. It is processed insofar as it consists of complete elements; incomplete data is buffered until more data is fed or `close()' is called. `close()' Force processing of all buffered data as if it were followed by an end-of-file mark. This method may be redefined by a derived class to define additional processing at the end of the input, but the redefined version should always call `close()'. `get_starttag_text()' Return the text of the most recently opened start tag. This should not normally be needed for structured processing, but may be useful in dealing with HTML "as deployed" or for re-generating input with minimal changes (whitespace between attributes can be preserved, etc.). `handle_starttag(tag, method, attributes)' This method is called to handle start tags for which either a `start_TAG()' or `do_TAG()' method has been defined. The TAG argument is the name of the tag converted to lower case, and the METHOD argument is the bound method which should be used to support semantic interpretation of the start tag. The ATTRIBUTES argument is a list of `(NAME, VALUE)' pairs containing the attributes found inside the tag's `<>' brackets. The NAME has been translated to lower case and double quotes and backslashes in the VALUE have been interpreted. For instance, for the tag `<A HREF="http://www.cwi.nl/">', this method would be called as `unknown_starttag('a', [('href', 'http://www.cwi.nl/')])'. The base implementation simply calls METHOD with ATTRIBUTES as the only argument. `handle_endtag(tag, method)' This method is called to handle endtags for which an `end_TAG()' method has been defined. The TAG argument is the name of the tag converted to lower case, and the METHOD argument is the bound method which should be used to support semantic interpretation of the end tag. If no `end_TAG()' method is defined for the closing element, this handler is not called. The base implementation simply calls METHOD. `handle_data(data)' This method is called to process arbitrary data. It is intended to be overridden by a derived class; the base class implementation does nothing. `handle_charref(ref)' This method is called to process a character reference of the form `&#REF;'. In the base implementation, REF must be a decimal number in the range 0-255. It translates the character to ASCII and calls the method `handle_data()' with the character as argument. If REF is invalid or out of range, the method `unknown_charref(REF)' is called to handle the error. A subclass must override this method to provide support for named character entities. `handle_entityref(ref)' This method is called to process a general entity reference of the form `&REF;' where REF is an general entity reference. It looks for REF in the instance (or class) variable `entitydefs' which should be a mapping from entity names to corresponding translations. If a translation is found, it calls the method `handle_data()' with the translation; otherwise, it calls the method `unknown_entityref(REF)'. The default `entitydefs' defines translations for `&', `&apos', `>', `<', and `"'. `handle_comment(comment)' This method is called when a comment is encountered. The COMMENT argument is a string containing the text between the `' delimiters, but not the delimiters themselves. For example, the comment `' will cause this method to be called with the argument `'text''. The default method does nothing. `handle_decl(data)' Method called when an SGML declaration is read by the parser. In practice, the `DOCTYPE' declaration is the only thing observed in HTML, but the parser does not discriminate among different (or broken) declarations. Internal subsets in a `DOCTYPE' declaration are not supported. The DATA parameter will be the entire contents of the declaration inside the `<!'...`>' markup. The default implementation does nothing. `report_unbalanced(tag)' This method is called when an end tag is found which does not correspond to any open element. `unknown_starttag(tag, attributes)' This method is called to process an unknown start tag. It is intended to be overridden by a derived class; the base class implementation does nothing. `unknown_endtag(tag)' This method is called to process an unknown end tag. It is intended to be overridden by a derived class; the base class implementation does nothing. `unknown_charref(ref)' This method is called to process unresolvable numeric character references. Refer to `handle_charref()' to determine what is handled by default. It is intended to be overridden by a derived class; the base class implementation does nothing. `unknown_entityref(ref)' This method is called to process an unknown entity reference. It is intended to be overridden by a derived class; the base class implementation does nothing. Apart from overriding or extending the methods listed above, derived classes may also define methods of the following form to define processing of specific tags. Tag names in the input stream are case independent; the TAG occurring in method names must be in lower case: `start_TAG(attributes)' This method is called to process an opening tag TAG. It has preference over `do_TAG()'. The ATTRIBUTES argument has the same meaning as described for `handle_starttag()' above. `do_TAG(attributes)' This method is called to process an opening tag TAG that does not come with a matching closing tag. The ATTRIBUTES argument has the same meaning as described for `handle_starttag()' above. `end_TAG()' This method is called to process a closing tag TAG. Note that the parser maintains a stack of open elements for which no end tag has been found yet. Only tags processed by `start_TAG()' are pushed on this stack. Definition of an `end_TAG()' method is optional for these tags. For tags processed by `do_TAG()' or by `unknown_tag()', no `end_TAG()' method must be defined; if defined, it will not be used. If both `start_TAG()' and `do_TAG()' methods exist for a tag, the `start_TAG()' method takes precedence. File: python-lib.info, Node: htmllib, Next: htmlentitydefs, Prev: sgmllib, Up: Structured Markup Processing Tools A parser for HTML documents =========================== A parser for HTML documents. This module defines a class which can serve as a base for parsing text files formatted in the HyperText Mark-up Language (HTML). The class is not directly concerned with I/O -- it must be provided with input in string form via a method, and makes calls to methods of a "formatter" object in order to produce output. The `HTMLParser' class is designed to be used as a base class for other classes in order to add functionality, and allows most of its methods to be extended or overridden. In turn, this class is derived from and extends the `SGMLParser' class defined in module `sgmllib'. The `HTMLParser' implementation supports the HTML 2.0 language as described in RFC 1866. Two implementations of formatter objects are provided in the `formatter' module; refer to the documentation for that module for information on the formatter interface. The following is a summary of the interface defined by `sgmllib.SGMLParser': * The interface to feed data to an instance is through the `feed()' method, which takes a string argument. This can be called with as little or as much text at a time as desired; `p.feed(a); p.feed(b)' has the same effect as `p.feed(a+b)'. When the data contains complete HTML tags, these are processed immediately; incomplete elements are saved in a buffer. To force processing of all unprocessed data, call the `close()' method. For example, to parse the entire contents of a file, use: parser.feed(open('myfile.html').read()) parser.close() * The interface to define semantics for HTML tags is very simple: derive a class and define methods called `start_TAG()', `end_TAG()', or `do_TAG()'. The parser will call these at appropriate moments: `start_TAG' or `do_TAG()' is called when an opening tag of the form `<TAG ...>' is encountered; `end_TAG()' is called when a closing tag of the form `<TAG>' is encountered. If an opening tag requires a corresponding closing tag, like `<H1>' ... `</H1>', the class should define the `start_TAG()' method; if a tag requires no closing tag, like `<P>', the class should define the `do_TAG()' method. The module defines a single class: `HTMLParser(formatter)' This is the basic HTML parser class. It supports all entity names required by the HTML 2.0 specification (RFC 1866). It also defines handlers for all HTML 2.0 and many HTML 3.0 and 3.2 elements. See also: *Note htmlentitydefs:: Definition of replacement text for HTML 2.0 entities. *Note sgmllib:: Base class for `HTMLParser'. * Menu: * HTMLParser Objects:: File: python-lib.info, Node: HTMLParser Objects, Prev: htmllib, Up: htmllib HTMLParser Objects ------------------ In addition to tag methods, the `HTMLParser' class provides some additional methods and instance variables for use within tag methods. `formatter' This is the formatter instance associated with the parser. `nofill' Boolean flag which should be true when whitespace should not be collapsed, or false when it should be. In general, this should only be true when character data is to be treated as "preformatted" text, as within a `<PRE>' element. The default value is false. This affects the operation of `handle_data()' and `save_end()'. `anchor_bgn(href, name, type)' This method is called at the start of an anchor region. The arguments correspond to the attributes of the `<A>' tag with the same names. The default implementation maintains a list of hyperlinks (defined by the `HREF' attribute for `<A>' tags) within the document. The list of hyperlinks is available as the data attribute `anchorlist'. `anchor_end()' This method is called at the end of an anchor region. The default implementation adds a textual footnote marker using an index into the list of hyperlinks created by `anchor_bgn()'. `handle_image(source, alt[, ismap[, align[, width[, height]]]])' This method is called to handle images. The default implementation simply passes the ALT value to the `handle_data()' method. `save_bgn()' Begins saving character data in a buffer instead of sending it to the formatter object. Retrieve the stored data via `save_end()'. Use of the `save_bgn()' / `save_end()' pair may not be nested. `save_end()' Ends buffering character data and returns all data saved since the preceding call to `save_bgn()'. If the `nofill' flag is false, whitespace is collapsed to single spaces. A call to this method without a preceding call to `save_bgn()' will raise a `TypeError' exception. File: python-lib.info, Node: htmlentitydefs, Next: xml.parsers.expat, Prev: htmllib, Up: Structured Markup Processing Tools Definitions of HTML general entities ==================================== Definitions of HTML general entities. This section was written by Fred L. Drake, Jr. <fdrake@acm.org>. This module defines a single dictionary, `entitydefs', which is used by the `htmllib' module to provide the `entitydefs' member of the `HTMLParser' class. The definition provided here contains all the entities defined by HTML 2.0 that can be handled using simple textual substitution in the Latin-1 character set (ISO-8859-1). `entitydefs' A dictionary mapping HTML 2.0 entity definitions to their replacement text in ISO Latin-1. File: python-lib.info, Node: xml.parsers.expat, Next: xml.dom, Prev: htmlentitydefs, Up: Structured Markup Processing Tools Fast XML parsing using Expat ============================ An interface to the Expat non-validating XML parser. This module was documented by Paul Prescod <paul@prescod.net>. This section was written by A.M. Kuchling <amk1@bigfoot.com>. _Added in Python version 2.0_ The `xml.parsers.expat' module is a Python interface to the Expat non-validating XML parser. The module provides a single extension type, `xmlparser', that represents the current state of an XML parser. After an `xmlparser' object has been created, various attributes of the object can be set to handler functions. When an XML document is then fed to the parser, the handler functions are called for the character data and markup in the XML document. This module uses the `pyexpat' module to provide access to the Expat parser. Direct use of the `pyexpat' module is deprecated. This module provides one exception and one type object: `ExpatError' The exception raised when Expat reports an error. `error' Alias for `ExpatError'. `XMLParserType' The type of the return values from the `ParserCreate()' function. The `xml.parsers.expat' module contains two functions: `ErrorString(errno)' Returns an explanatory string for a given error number ERRNO. `ParserCreate([encoding[, namespace_separator]])' Creates and returns a new `xmlparser' object. ENCODING, if specified, must be a string naming the encoding used by the XML data. Expat doesn't support as many encodings as Python does, and its repertoire of encodings can't be extended; it supports UTF-8, UTF-16, ISO-8859-1 (Latin1), and ASCII. If ENCODING is given it will override the implicit or explicit encoding of the document. Expat can optionally do XML namespace processing for you, enabled by providing a value for NAMESPACE_SEPARATOR. The value must be a one-character string; a `ValueError' will be raised if the string has an illegal length (`None' is considered the same as omission). When namespace processing is enabled, element type names and attribute names that belong to a namespace will be expanded. The element name passed to the element handlers `StartElementHandler' and `EndElementHandler' will be the concatenation of the namespace URI, the namespace separator character, and the local part of the name. If the namespace separator is a zero byte (`chr(0)') then the namespace URI and the local part will be concatenated without any separator. For example, if NAMESPACE_SEPARATOR is set to a space character (` ') and the following document is parsed: <?xml version="1.0"?> <root xmlns = "http://default-namespace.org/" xmlns:py = "http://www.python.org/ns/"> <py:elem1 /> <elem2 xmlns="" /> </root> `StartElementHandler' will receive the following strings for each element: http://default-namespace.org/ root http://www.python.org/ns/ elem1 elem2 * Menu: * XMLParser Objects:: * ExpatError Exceptions:: * Example 8:: * Content Model Descriptions:: * Expat error constants:: File: python-lib.info, Node: XMLParser Objects, Next: ExpatError Exceptions, Prev: xml.parsers.expat, Up: xml.parsers.expat XMLParser Objects ----------------- `xmlparser' objects have the following methods: `Parse(data[, isfinal])' Parses the contents of the string DATA, calling the appropriate handler functions to process the parsed data. ISFINAL must be true on the final call to this method. DATA can be the empty string at any time. `ParseFile(file)' Parse XML data reading from the object FILE. FILE only needs to provide the `read(NBYTES)' method, returning the empty string when there's no more data. `SetBase(base)' Sets the base to be used for resolving relative URIs in system identifiers in declarations. Resolving relative identifiers is left to the application: this value will be passed through as the BASE argument to the `ExternalEntityRefHandler', `NotationDeclHandler', and `UnparsedEntityDeclHandler' functions. `GetBase()' Returns a string containing the base set by a previous call to `SetBase()', or `None' if `SetBase()' hasn't been called. `GetInputContext()' Returns the input data that generated the current event as a string. The data is in the encoding of the entity which contains the text. When called while an event handler is not active, the return value is `None'. _Added in Python version 2.1_ `ExternalEntityParserCreate(context[, encoding])' Create a "child" parser which can be used to parse an external parsed entity referred to by content parsed by the parent parser. The CONTEXT parameter should be the string passed to the `ExternalEntityRefHandler()' handler function, described below. The child parser is created with the `ordered_attributes', `returns_unicode' and `specified_attributes' set to the values of this parser. `xmlparser' objects have the following attributes: `ordered_attributes' Setting this attribute to a non-zero integer causes the attributes to be reported as a list rather than a dictionary. The attributes are presented in the order found in the document text. For each attribute, two list entries are presented: the attribute name and the attribute value. (Older versions of this module also used this format.) By default, this attribute is false; it may be changed at any time. _Added in Python version 2.1_ `returns_unicode' If this attribute is set to a non-zero integer, the handler functions will be passed Unicode strings. If `returns_unicode' is 0, 8-bit strings containing UTF-8 encoded data will be passed to the handlers. _Changed in Python version 1.6_ `specified_attributes' If set to a non-zero integer, the parser will report only those attributes which were specified in the document instance and not those which were derived from attribute declarations. Applications which set this need to be especially careful to use what additional information is available from the declarations as needed to comply with the standards for the behavior of XML processors. By default, this attribute is false; it may be changed at any time. _Added in Python version 2.1_ The following attributes contain values relating to the most recent error encountered by an `xmlparser' object, and will only have correct values once a call to `Parse()' or `ParseFile()' has raised a `xml.parsers.expat.ExpatError' exception. `ErrorByteIndex' Byte index at which an error occurred. `ErrorCode' Numeric code specifying the problem. This value can be passed to the `ErrorString()' function, or compared to one of the constants defined in the `errors' object. `ErrorColumnNumber' Column number at which an error occurred. `ErrorLineNumber' Line number at which an error occurred. Here is the list of handlers that can be set. To set a handler on an `xmlparser' object O, use `O.HANDLERNAME = FUNC'. HANDLERNAME must be taken from the following list, and FUNC must be a callable object accepting the correct number of arguments. The arguments are all strings, unless otherwise stated. `XmlDeclHandler(version, encoding, standalone)' Called when the XML declaration is parsed. The XML declaration is the (optional) declaration of the applicable version of the XML recommendation, the encoding of the document text, and an optional "standalone" declaration. VERSION and ENCODING will be strings of the type dictated by the `returns_unicode' attribute, and STANDALONE will be `1' if the document is declared standalone, `0' if it is declared not to be standalone, or `-1' if the standalone clause was omitted. This is only available with Expat version 1.95.0 or newer. _Added in Python version 2.1_ `StartDoctypeDeclHandler(doctypeName, systemId, publicId, has_internal_subset)' Called when Expat begins parsing the document type declaration (`<!DOCTYPE ...'). The DOCTYPENAME is provided exactly as presented. The SYSTEMID and PUBLICID parameters give the system and public identifiers if specified, or `None' if omitted. HAS_INTERNAL_SUBSET will be true if the document contains and internal document declaration subset. This requires Expat version 1.2 or newer. `EndDoctypeDeclHandler()' Called when Expat is done parsing the document type delaration. This requires Expat version 1.2 or newer. `ElementDeclHandler(name, model)' Called once for each element type declaration. NAME is the name of the element type, and MODEL is a representation of the content model. `AttlistDeclHandler(elname, attname, type, default, required)' Called for each declared attribute for an element type. If an attribute list declaration declares three attributes, this handler is called three times, once for each attribute. ELNAME is the name of the element to which the declaration applies and ATTNAME is the name of the attribute declared. The attribute type is a string passed as TYPE; the possible values are `'CDATA'', `'ID'', `'IDREF'', ... DEFAULT gives the default value for the attribute used when the attribute is not specified by the document instance, or `None' if there is no default value (`#IMPLIED' values). If the attribute is required to be given in the document instance, REQUIRED will be true. This requires Expat version 1.95.0 or newer. `StartElementHandler(name, attributes)' Called for the start of every element. NAME is a string containing the element name, and ATTRIBUTES is a dictionary mapping attribute names to their values. `EndElementHandler(name)' Called for the end of every element. `ProcessingInstructionHandler(target, data)' Called for every processing instruction. `CharacterDataHandler(data)' Called for character data. This will be called for normal character data, CDATA marked content, and ignorable whitespace. Applications which must distinguish these cases can use the `StartCdataSectionHandler', `EndCdataSectionHandler', and `ElementDeclHandler' callbacks to collect the required information. `UnparsedEntityDeclHandler(entityName, base, systemId, publicId, notationName)' Called for unparsed (NDATA) entity declarations. This is only present for version 1.2 of the Expat library; for more recent versions, use `EntityDeclHandler' instead. (The underlying function in the Expat library has been declared obsolete.) `EntityDeclHandler(entityName, is_parameter_entity, value, base, systemId, publicId, notationName)' Called for all entity declarations. For parameter and internal entities, VALUE will be a string giving the declared contents of the entity; this will be `None' for external entities. The NOTATIONNAME parameter will be `None' for parsed entities, and the name of the notation for unparsed entities. IS_PARAMETER_ENTITY will be true if the entity is a paremeter entity or false for general entities (most applications only need to be concerned with general entities). This is only available starting with version 1.95.0 of the Expat library. _Added in Python version 2.1_ `NotationDeclHandler(notationName, base, systemId, publicId)' Called for notation declarations. NOTATIONNAME, BASE, and SYSTEMID, and PUBLICID are strings if given. If the public identifier is omitted, PUBLICID will be `None'. `StartNamespaceDeclHandler(prefix, uri)' Called when an element contains a namespace declaration. Namespace declarations are processed before the `StartElementHandler' is called for the element on which declarations are placed. `EndNamespaceDeclHandler(prefix)' Called when the closing tag is reached for an element that contained a namespace declaration. This is called once for each namespace declaration on the element in the reverse of the order for which the `StartNamespaceDeclHandler' was called to indicate the start of each namespace declaration's scope. Calls to this handler are made after the corresponding `EndElementHandler' for the end of the element. `CommentHandler(data)' Called for comments. DATA is the text of the comment, excluding the leading ``<!-'`-'' and trailing ``-'`->''. `StartCdataSectionHandler()' Called at the start of a CDATA section. This and `StartCdataSectionHandler' are needed to be able to identify the syntactical start and end for CDATA sections. `EndCdataSectionHandler()' Called at the end of a CDATA section. `DefaultHandler(data)' Called for any characters in the XML document for which no applicable handler has been specified. This means characters that are part of a construct which could be reported, but for which no handler has been supplied. `DefaultHandlerExpand(data)' This is the same as the `DefaultHandler', but doesn't inhibit expansion of internal entities. The entity reference will not be passed to the default handler. `NotStandaloneHandler()' Called if the XML document hasn't been declared as being a standalone document. This happens when there is an external subset or a reference to a parameter entity, but the XML declaration does not set standalone to `yes' in an XML declaration. If this handler returns `0', then the parser will throw an `XML_ERROR_NOT_STANDALONE' error. If this handler is not set, no exception is raised by the parser for this condition. `ExternalEntityRefHandler(context, base, systemId, publicId)' Called for references to external entities. BASE is the current base, as set by a previous call to `SetBase()'. The public and system identifiers, SYSTEMID and PUBLICID, are strings if given; if the public identifier is not given, PUBLICID will be `None'. The CONTEXT value is opaque and should only be used as described below. For external entities to be parsed, this handler must be implemented. It is responsible for creating the sub-parser using `ExternalEntityParserCreate(CONTEXT)', initializing it with the appropriate callbacks, and parsing the entity. This handler should return an integer; if it returns `0', the parser will throw an `XML_ERROR_EXTERNAL_ENTITY_HANDLING' error, otherwise parsing will continue. If this handler is not provided, external entities are reported by the `DefaultHandler' callback, if provided. File: python-lib.info, Node: ExpatError Exceptions, Next: Example 8, Prev: XMLParser Objects, Up: xml.parsers.expat ExpatError Exceptions --------------------- This section was written by Fred L. Drake, Jr. <fdrake@acm.org>. `ExpatError' exceptions have a number of interesting attributes: `code' Expat's internal error number for the specific error. This will match one of the constants defined in the `errors' object from this module. _Added in Python version 2.1_ `lineno' Line number on which the error was detected. The first line is numbered `1'. _Added in Python version 2.1_ `offset' Character offset into the line where the error occurred. The first column is numbered `0'. _Added in Python version 2.1_ File: python-lib.info, Node: Example 8, Next: Content Model Descriptions, Prev: ExpatError Exceptions, Up: xml.parsers.expat Example ------- The following program defines three handlers that just print out their arguments. import xml.parsers.expat # 3 handler functions def start_element(name, attrs): print 'Start element:', name, attrs def end_element(name): print 'End element:', name def char_data(data): print 'Character data:', repr(data) p = xml.parsers.expat.ParserCreate() p.StartElementHandler = start_element p.EndElementHandler = end_element p.CharacterDataHandler = char_data p.Parse("""<?xml version="1.0"?> <parent id="top"><child1 name="paul">Text goes here</child1> <child2 name="fred">More text</child2> </parent>""") The output from this program is: Start element: parent {'id': 'top'} Start element: child1 {'name': 'paul'} Character data: 'Text goes here' End element: child1 Character data: '\n' Start element: child2 {'name': 'fred'} Character data: 'More text' End element: child2 Character data: '\n' End element: parent File: python-lib.info, Node: Content Model Descriptions, Next: Expat error constants, Prev: Example 8, Up: xml.parsers.expat Content Model Descriptions -------------------------- This section was written by Fred L. Drake, Jr. <fdrake@acm.org>. Content modules are described using nested tuples. Each tuple contains four values: the type, the quantifier, the name, and a tuple of children. Children are simply additional content module descriptions. The values of the first two fields are constants defined in the `model' object of the `xml.parsers.expat' module. These constants can be collected in two groups: the model type group and the quantifier group. The constants in the model type group are: `XML_CTYPE_ANY' The element named by the model name was declared to have a content model of `ANY'. `XML_CTYPE_CHOICE' The named element allows a choice from a number of options; this is used for content models such as `(A | B | C)'. `XML_CTYPE_EMPTY' Elements which are declared to be `EMPTY' have this model type. `XML_CTYPE_MIXED' `XML_CTYPE_NAME' `XML_CTYPE_SEQ' Models which represent a series of models which follow one after the other are indicated with this model type. This is used for models such as `(A, B, C)'. The constants in the quantifier group are: `XML_CQUANT_NONE' `XML_CQUANT_OPT' The model is option: it can appear once or not at all, as for `A?'. `XML_CQUANT_PLUS' The model must occur one or more times (`A+'). `XML_CQUANT_REP' The model must occur zero or more times, as for `A*'. File: python-lib.info, Node: Expat error constants, Prev: Content Model Descriptions, Up: xml.parsers.expat Expat error constants --------------------- This section was written by A.M. Kuchling <amk1@bigfoot.com>. The following constants are provided in the `errors' object of the `xml.parsers.expat' module. These constants are useful in interpreting some of the attributes of the `ExpatError' exception objects raised when an error has occurred. The `errors' object has the following attributes: `XML_ERROR_ASYNC_ENTITY' `XML_ERROR_ATTRIBUTE_EXTERNAL_ENTITY_REF' An entity reference in an attribute value referred to an external entity instead of an internal entity. `XML_ERROR_BAD_CHAR_REF' `XML_ERROR_BINARY_ENTITY_REF' `XML_ERROR_DUPLICATE_ATTRIBUTE' An attribute was used more than once in a start tag. `XML_ERROR_INCORRECT_ENCODING' `XML_ERROR_INVALID_TOKEN' `XML_ERROR_JUNK_AFTER_DOC_ELEMENT' Something other than whitespace occurred after the document element. `XML_ERROR_MISPLACED_XML_PI' `XML_ERROR_NO_ELEMENTS' The document contains no elements. `XML_ERROR_NO_MEMORY' Expat was not able to allocate memory internally. `XML_ERROR_PARAM_ENTITY_REF' `XML_ERROR_PARTIAL_CHAR' `XML_ERROR_RECURSIVE_ENTITY_REF' `XML_ERROR_SYNTAX' Some unspecified syntax error was encountered. `XML_ERROR_TAG_MISMATCH' An end tag did not match the innermost open start tag. `XML_ERROR_UNCLOSED_TOKEN' `XML_ERROR_UNDEFINED_ENTITY' A reference was made to a entity which was not defined. `XML_ERROR_UNKNOWN_ENCODING' The document encoding is not supported by Expat. File: python-lib.info, Node: xml.dom, Next: xml.dom.minidom, Prev: xml.parsers.expat, Up: Structured Markup Processing Tools The Document Object Model API ============================= Document Object Model API for Python. This section was written by Paul Prescod <paul@prescod.net>. This section was written by Martin v. L"owis <loewis@informatik.hu-berlin.de>. _Added in Python version 2.0_ The Document Object Model, or "DOM," is a cross-language API from the World Wide Web Consortium (W3C) for accessing and modifying XML documents. A DOM implementation presents an XML document as a tree structure, or allows client code to build such a structure from scratch. It then gives access to the structure through a set of objects which provided well-known interfaces. The DOM is extremely useful for random-access applications. SAX only allows you a view of one bit of the document at a time. If you are looking at one SAX element, you have no access to another. If you are looking at a text node, you have no access to a containing element. When you write a SAX application, you need to keep track of your program's position in the document somewhere in your own code. SAX does not do it for you. Also, if you need to look ahead in the XML document, you are just out of luck. Some applications are simply impossible in an event driven model with no access to a tree. Of course you could build some sort of tree yourself in SAX events, but the DOM allows you to avoid writing that code. The DOM is a standard tree representation for XML data. The Document Object Model is being defined by the W3C in stages, or "levels" in their terminology. The Python mapping of the API is substantially based on the DOM Level 2 recommendation. Some aspects of the API will only become available in Python 2.1, or may only be available in particular DOM implementations. DOM applications typically start by parsing some XML into a DOM. How this is accomplished is not covered at all by DOM Level 1, and Level 2 provides only limited improvements. There is a `DOMImplementation' object class which provides access to `Document' creation methods, but these methods were only added in DOM Level 2 and were not implemented in time for Python 2.0. There is also no well-defined way to access these methods without an existing `Document' object. For Python 2.0, consult the documentation for each particular DOM implementation to determine the bootstrap procedure needed to create and initialize `Document' and `DocumentType' instances. Once you have a DOM document object, you can access the parts of your XML document through its properties and methods. These properties are defined in the DOM specification; this portion of the reference manual describes the interpretation of the specification in Python. The specification provided by the W3C defines the DOM API for Java, ECMAScript, and OMG IDL. The Python mapping defined here is based in large part on the IDL version of the specification, but strict compliance is not required (though implementations are free to support the strict mapping from IDL). See section *Note Conformance::, "Conformance," for a detailed discussion of mapping requirements. See also: `Document Object Model (DOM) Level 2 Specification' {The W3C recommendation upon which the Python DOM API is based.} `Document Object Model (DOM) Level 1 Specification' {The W3C recommendation for the DOM supported by `xml.dom.minidom'.} `PyXML'{Users that require a full-featured implementation of DOM should use the PyXML package.} `CORBA Scripting with Python' {This specifies the mapping from OMG IDL to Python.} * Menu: * Module Contents 2:: * Objects in the DOM:: * Conformance:: File: python-lib.info, Node: Module Contents 2, Next: Objects in the DOM, Prev: xml.dom, Up: xml.dom Module Contents --------------- The `xml.dom' contains the following functions: `registerDOMImplementation(name, factory)' Register the FACTORY function with the name NAME. The factory function should return an object which implements the `DOMImplementation' interface. The factory function can return the same object every time, or a new one for each call, as appropriate for the specific implementation (e.g. if that implementation supports some customization). `getDOMImplementation(name = None, features = ())' Return a suitable DOM implementation. The NAME is either well-known, the module name of a DOM implementation, or `None'. If it is not `None', imports the corresponding module and returns a `DOMImplementation' object if the import succeeds. If no name is given, and if the environment variable `PYTHON_DOM' is set, this variable is used to find the implementation. If name is not given, consider the available implementations to find one with the required feature set. If no implementation can be found, raise an `ImportError'. The features list must be a sequence of (feature, version) pairs which are passed to hasFeature. In addition, `xml.dom' contains the `Node', and the DOM exceptions. File: python-lib.info, Node: Objects in the DOM, Next: Conformance, Prev: Module Contents 2, Up: xml.dom Objects in the DOM ------------------ The definitive documentation for the DOM is the DOM specification from the W3C. Note that DOM attributes may also be manipulated as nodes instead of as simple strings. It is fairly rare that you must do this, however, so this usage is not yet documented. Interface Section Purpose ------ ----- ----- DOMImplementation *Note DOMImplementation Interface to the Objects:: underlying implementation. Node *Note Node Objects:: Base interface for most objects in a document. NodeList *Note NodeList Interface for a Objects:: sequence of nodes. DocumentType *Note DocumentType Information about the Objects:: declarations needed to process a document. Document *Note Document Object which represents Objects:: an entire document. Element *Note Element Objects:: Element nodes in the document hierarchy. Attr *Note Attr Objects:: Attribute value nodes on element nodes. Comment *Note Comment Objects:: Representation of comments in the source document. Text *Note Text and Nodes containing CDATASection Objects:: textual content from the document. ProcessingInstruction *Note Processing instruction ProcessingInstruction representation. Objects:: An additional section describes the exceptions defined for working with the DOM in Python. * Menu: * DOMImplementation Objects:: * Node Objects:: * NodeList Objects:: * DocumentType Objects:: * Document Objects:: * Element Objects:: * Attr Objects:: * NamedNodeMap Objects:: * Comment Objects:: * Text and CDATASection Objects:: * ProcessingInstruction Objects:: * Exceptions 2:: File: python-lib.info, Node: DOMImplementation Objects, Next: Node Objects, Prev: Objects in the DOM, Up: Objects in the DOM DOMImplementation Objects ......................... The `DOMImplementation' interface provides a way for applications to determine the availability of particular features in the DOM they are using. DOM Level 2 added the ability to create new `Document' and `DocumentType' objects using the `DOMImplementation' as well. `hasFeature(feature, version)' File: python-lib.info, Node: Node Objects, Next: NodeList Objects, Prev: DOMImplementation Objects, Up: Objects in the DOM Node Objects ............ All of the components of an XML document are subclasses of `Node'. `nodeType' An integer representing the node type. Symbolic constants for the types are on the `Node' object: `ELEMENT_NODE', `ATTRIBUTE_NODE', `TEXT_NODE', `CDATA_SECTION_NODE', `ENTITY_NODE', `PROCESSING_INSTRUCTION_NODE', `COMMENT_NODE', `DOCUMENT_NODE', `DOCUMENT_TYPE_NODE', `NOTATION_NODE'. This is a read-only attribute. `parentNode' The parent of the current node, or `None' for the document node. The value is always a `Node' object or `None'. For `Element' nodes, this will be the parent element, except for the root element, in which case it will be the `Document' object. For `Attr' nodes, this is always `None'. This is a read-only attribute. `attributes' A `NamedNodeList' of attribute objects. Only elements have actual values for this; others provide `None' for this attribute. This is a read-only attribute. `previousSibling' The node that immediately precedes this one with the same parent. For instance the element with an end-tag that comes just before the SELF element's start-tag. Of course, XML documents are made up of more than just elements so the previous sibling could be text, a comment, or something else. If this node is the first child of the parent, this attribute will be `None'. This is a read-only attribute. `nextSibling' The node that immediately follows this one with the same parent. See also `previousSibling'. If this is the last child of the parent, this attribute will be `None'. This is a read-only attribute. `childNodes' A list of nodes contained within this node. This is a read-only attribute. `firstChild' The first child of the node, if there are any, or `None'. This is a read-only attribute. `lastChild' The last child of the node, if there are any, or `None'. This is a read-only attribute. `localName' The part of the `tagName' following the colon if there is one, else the entire `tagName'. The value is a string. `prefix' The part of the `tagName' preceding the colon if there is one, else the empty string. The value is a string, or `None' `namespaceURI' The namespace associated with the element name. This will be a string or `None'. This is a read-only attribute. `nodeName' This has a different meaning for each node type; see the DOM specification for details. You can always get the information you would get here from another property such as the `tagName' property for elements or the `name' property for attributes. For all node types, the value of this attribute will be either a string or `None'. This is a read-only attribute. `nodeValue' This has a different meaning for each node type; see the DOM specification for details. The situation is similar to that with `nodeName'. The value is a string or `None'. `hasAttributes()' Returns true if the node has any attributes. `hasChildNodes()' Returns true if the node has any child nodes. `isSameNode(other)' Returns true if OTHER refers to the same node as this node. This is especially useful for DOM implementations which use any sort of proxy architecture (because more than one object can refer to the same node). *Note:* This is based on a proposed DOM Level 3 API which is still in the "working draft" stage, but this particular interface appears uncontroversial. Changes from the W3C will not necessarily affect this method in the Python DOM interface (though any new W3C API for this would also be supported). `appendChild(newChild)' Add a new child node to this node at the end of the list of children, returning NEWCHILD. `insertBefore(newChild, refChild)' Insert a new child node before an existing child. It must be the case that REFCHILD is a child of this node; if not, `ValueError' is raised. NEWCHILD is returned. `removeChild(oldChild)' Remove a child node. OLDCHILD must be a child of this node; if not, `ValueError' is raised. OLDCHILD is returned on success. If OLDCHILD will not be used further, its `unlink()' method should be called. `replaceChild(newChild, oldChild)' Replace an existing node with a new node. It must be the case that OLDCHILD is a child of this node; if not, `ValueError' is raised. `normalize()' Join adjacent text nodes so that all stretches of text are stored as single `Text' instances. This simplifies processing text from a DOM tree for many applications. _Added in Python version 2.1_ `cloneNode(deep)' Clone this node. Setting DEEP means to clone all child nodes as well. This returns the clone. File: python-lib.info, Node: NodeList Objects, Next: DocumentType Objects, Prev: Node Objects, Up: Objects in the DOM NodeList Objects ................ A `NodeList' represents a sequence of nodes. These objects are used in two ways in the DOM Core recommendation: the `Element' objects provides one as it's list of child nodes, and the `getElementsByTagName()' and `getElementsByTagNameNS()' methods of `Node' return objects with this interface to represent query results. The DOM Level 2 recommendation defines one method and one attribute for these objects: `item(i)' Return the I'th item from the sequence, if there is one, or `None'. The index I is not allowed to be less then zero or greater than or equal to the length of the sequence. `length' The number of nodes in the sequence. In addition, the Python DOM interface requires that some additional support is provided to allow `NodeList' objects to be used as Python sequences. All `NodeList' implementations must include support for `__len__()' and `__getitem__()'; this allows iteration over the `NodeList' in `for' statements and proper support for the `len()' built-in function. If a DOM implementation supports modification of the document, the `NodeList' implementation must also support the `__setitem__()' and `__delitem__()' methods.