This is Info file /home/pdm/tmp/Python-1.5.2p1/Doc/lib/python-lib.info, produced by Makeinfo version 1.68 from the input file lib.texi. July 6, 1999 1.5.2  File: python-lib.info, Node: sgmllib, Next: htmllib, Prev: Internet Data Handling, Up: Internet Data Handling Simple SGML parser ================== Only as much of an SGML parser as needed to parse HTML. This module defines a class `SGMLParser' which serves as the basis for parsing text files formatted in SGML (Standard Generalized Mark-up Language). In fact, it does not provide a full SGML parser -- it only parses SGML insofar as it is used by HTML, and the module only exists as a base for the `htmllib'module. `SGMLParser()' The `SGMLParser' class is instantiated without arguments. The parser is hardcoded to recognize the following constructs: * Opening and closing tags of the form `' and `', respectively. * Numeric character references of the form `&#NAME;'. * Entity references of the form `&NAME;'. * SGML comments of the form `'. Note that spaces, tabs, and newlines are allowed between the trailing `>' and the immediately preceeding `--'. `SGMLParser' instances have the following interface methods: `reset()' Reset the instance. Loses all unprocessed data. This is called implicitly at instantiation time. `setnomoretags()' Stop processing tags. Treat all following input as literal input (CDATA). (This is only provided so the HTML tag `' can be implemented.) `setliteral()' Enter literal mode (CDATA mode). `feed(data)' Feed some text to the parser. It is processed insofar as it consists of complete elements; incomplete data is buffered until more data is fed or `close()' is called. `close()' Force processing of all buffered data as if it were followed by an end-of-file mark. This method may be redefined by a derived class to define additional processing at the end of the input, but the redefined version should always call `close()'. `handle_starttag(tag, method, attributes)' This method is called to handle start tags for which either a `start_TAG()' or `do_TAG()' method has been defined. The TAG argument is the name of the tag converted to lower case, and the METHOD argument is the bound method which should be used to support semantic interpretation of the start tag. The ATTRIBUTES argument is a list of `(NAME, VALUE)' pairs containing the attributes found inside the tag's `<>' brackets. The NAME has been translated to lower case and double quotes and backslashes in the VALUE have been interpreted. For instance, for the tag `<A HREF="http://www.cwi.nl/">', this method would be called as `unknown_starttag('a', [('href', 'http://www.cwi.nl/')])'. The base implementation simply calls METHOD with ATTRIBUTES as the only argument. `handle_endtag(tag, method)' This method is called to handle endtags for which an `end_TAG()' method has been defined. The TAG argument is the name of the tag converted to lower case, and the METHOD argument is the bound method which should be used to support semantic interpretation of the end tag. If no `end_TAG()' method is defined for the closing element, this handler is not called. The base implementation simply calls METHOD. `handle_data(data)' This method is called to process arbitrary data. It is intended to be overridden by a derived class; the base class implementation does nothing. `handle_charref(ref)' This method is called to process a character reference of the form `&#REF;'. In the base implementation, REF must be a decimal number in the range 0-255. It translates the character to ASCII and calls the method `handle_data()' with the character as argument. If REF is invalid or out of range, the method `unknown_charref(REF)' is called to handle the error. A subclass must override this method to provide support for named character entities. `handle_entityref(ref)' This method is called to process a general entity reference of the form `&REF;' where REF is an general entity reference. It looks for REF in the instance (or class) variable `entitydefs' which should be a mapping from entity names to corresponding translations. If a translation is found, it calls the method `handle_data()' with the translation; otherwise, it calls the method `unknown_entityref(REF)'. The default `entitydefs' defines translations for `&amp;', `&apos', `&gt;', `&lt;', and `&quot;'. `handle_comment(comment)' This method is called when a comment is encountered. The COMMENT argument is a string containing the text between the `<!--' and `-->' delimiters, but not the delimiters themselves. For example, the comment `<!--text-->' will cause this method to be called with the argument `'text''. The default method does nothing. `report_unbalanced(tag)' This method is called when an end tag is found which does not correspond to any open element. `unknown_starttag(tag, attributes)' This method is called to process an unknown start tag. It is intended to be overridden by a derived class; the base class implementation does nothing. `unknown_endtag(tag)' This method is called to process an unknown end tag. It is intended to be overridden by a derived class; the base class implementation does nothing. `unknown_charref(ref)' This method is called to process unresolvable numeric character references. Refer to `handle_charref()' to determine what is handled by default. It is intended to be overridden by a derived class; the base class implementation does nothing. `unknown_entityref(ref)' This method is called to process an unknown entity reference. It is intended to be overridden by a derived class; the base class implementation does nothing. Apart from overriding or extending the methods listed above, derived classes may also define methods of the following form to define processing of specific tags. Tag names in the input stream are case independent; the TAG occurring in method names must be in lower case: `start_TAG(attributes)' This method is called to process an opening tag TAG. It has preference over `do_TAG()'. The ATTRIBUTES argument has the same meaning as described for `handle_starttag()' above. `do_TAG(attributes)' This method is called to process an opening tag TAG that does not come with a matching closing tag. The ATTRIBUTES argument has the same meaning as described for `handle_starttag()' above. `end_TAG()' This method is called to process a closing tag TAG. Note that the parser maintains a stack of open elements for which no end tag has been found yet. Only tags processed by `start_TAG()' are pushed on this stack. Definition of an `end_TAG()' method is optional for these tags. For tags processed by `do_TAG()' or by `unknown_tag()', no `end_TAG()' method must be defined; if defined, it will not be used. If both `start_TAG()' and `do_TAG()' methods exist for a tag, the `start_TAG()' method takes precedence.  File: python-lib.info, Node: htmllib, Next: htmlentitydefs, Prev: sgmllib, Up: Internet Data Handling A parser for HTML documents =========================== A parser for HTML documents. This module defines a class which can serve as a base for parsing text files formatted in the HyperText Mark-up Language (HTML). The class is not directly concerned with I/O -- it must be provided with input in string form via a method, and makes calls to methods of a "formatter" object in order to produce output. The `HTMLParser' class is designed to be used as a base class for other classes in order to add functionality, and allows most of its methods to be extended or overridden. In turn, this class is derived from and extends the `SGMLParser' class defined in module `sgmllib'. The `HTMLParser' implementation supports the HTML 2.0 language as described in RFC 1866. Two implementations of formatter objects are provided in the `formatter' module; refer to the documentation for that module for information on the formatter interface. The following is a summary of the interface defined by `sgmllib.SGMLParser': * The interface to feed data to an instance is through the `feed()' method, which takes a string argument. This can be called with as little or as much text at a time as desired; `p.feed(a); p.feed(b)' has the same effect as `p.feed(a+b)'. When the data contains complete HTML tags, these are processed immediately; incomplete elements are saved in a buffer. To force processing of all unprocessed data, call the `close()' method. For example, to parse the entire contents of a file, use: parser.feed(open('myfile.html').read()) parser.close() * The interface to define semantics for HTML tags is very simple: derive a class and define methods called `start_TAG()', `end_TAG()', or `do_TAG()'. The parser will call these at appropriate moments: `start_TAG' or `do_TAG()' is called when an opening tag of the form `<TAG ...>' is encountered; `end_TAG()' is called when a closing tag of the form `<TAG>' is encountered. If an opening tag requires a corresponding closing tag, like `<H1>' ... `</H1>', the class should define the `start_TAG()' method; if a tag requires no closing tag, like `<P>', the class should define the `do_TAG()' method. The module defines a single class: `HTMLParser(formatter)' This is the basic HTML parser class. It supports all entity names required by the HTML 2.0 specification (RFC 1866). It also defines handlers for all HTML 2.0 and many HTML 3.0 and 3.2 elements. See also: *Note htmlentitydefs:: Definition of replacement text for HTML 2.0 entities. *Note sgmllib:: Base class for `HTMLParser'. * Menu: * HTMLParser Objects::  File: python-lib.info, Node: HTMLParser Objects, Prev: htmllib, Up: htmllib HTMLParser Objects ------------------ In addition to tag methods, the `HTMLParser' class provides some additional methods and instance variables for use within tag methods. `formatter' This is the formatter instance associated with the parser. `nofill' Boolean flag which should be true when whitespace should not be collapsed, or false when it should be. In general, this should only be true when character data is to be treated as "preformatted" text, as within a `<PRE>' element. The default value is false. This affects the operation of `handle_data()' and `save_end()'. `anchor_bgn(href, name, type)' This method is called at the start of an anchor region. The arguments correspond to the attributes of the `<A>' tag with the same names. The default implementation maintains a list of hyperlinks (defined by the `HREF' attribute for `<A>' tags) within the document. The list of hyperlinks is available as the data attribute `anchorlist'. `anchor_end()' This method is called at the end of an anchor region. The default implementation adds a textual footnote marker using an index into the list of hyperlinks created by `anchor_bgn()'. `handle_image(source, alt[, ismap[, align[, width[, height]]]])' This method is called to handle images. The default implementation simply passes the ALT value to the `handle_data()' method. `save_bgn()' Begins saving character data in a buffer instead of sending it to the formatter object. Retrieve the stored data via `save_end()'. Use of the `save_bgn()' / `save_end()' pair may not be nested. `save_end()' Ends buffering character data and returns all data saved since the preceeding call to `save_bgn()'. If the `nofill' flag is false, whitespace is collapsed to single spaces. A call to this method without a preceeding call to `save_bgn()' will raise a `TypeError' exception.  File: python-lib.info, Node: htmlentitydefs, Next: xmllib, Prev: htmllib, Up: Internet Data Handling Definitions of HTML general entities ==================================== Definitions of HTML general entities. This section was written by Fred L. Drake, Jr. <fdrake@acm.org>. This module defines a single dictionary, `entitydefs', which is used by the `htmllib' module to provide the `entitydefs' member of the `HTMLParser' class. The definition provided here contains all the entities defined by HTML 2.0 that can be handled using simple textual substitution in the Latin-1 character set (ISO-8859-1). `entitydefs' A dictionary mapping HTML 2.0 entity definitions to their replacement text in ISO Latin-1.  File: python-lib.info, Node: xmllib, Next: formatter, Prev: htmlentitydefs, Up: Internet Data Handling A parser for XML documents ========================== A parser for XML documents. This module was documented by Sjoerd Mullender <Sjoerd.Mullender@cwi.nl>. This section was written by Sjoerd Mullender <Sjoerd.Mullender@cwi.nl>. *Changed in Python version 1.5.2* This module defines a class `XMLParser' which serves as the basis for parsing text files formatted in XML (Extensible Markup Language). `XMLParser()' The `XMLParser' class must be instantiated without arguments. This class provides the following interface methods and instance variables: `attributes' A mapping of element names to mappings. The latter mapping maps attribute names that are valid for the element to the default value of the attribute, or if there is no default to `None'. The default value is the empty dictionary. This variable is meant to be overridden, not extended since the default is shared by all instances of `XMLParser'. `elements' A mapping of element names to tuples. The tuples contain a function for handling the start and end tag respectively of the element, or `None' if the method `unknown_starttag()' or `unknown_endtag()' is to be called. The default value is the empty dictionary. This variable is meant to be overridden, not extended since the default is shared by all instances of `XMLParser'. `entitydefs' A mapping of entitynames to their values. The default value contains definitions for `'lt'', `'gt'', `'amp'', `'quot'', and `'apos''. `reset()' Reset the instance. Loses all unprocessed data. This is called implicitly at the instantiation time. `setnomoretags()' Stop processing tags. Treat all following input as literal input (CDATA). `setliteral()' Enter literal mode (CDATA mode). This mode is automatically exited when the close tag matching the last unclosed open tag is encountered. `feed(data)' Feed some text to the parser. It is processed insofar as it consists of complete tags; incomplete data is buffered until more data is fed or `close()' is called. `close()' Force processing of all buffered data as if it were followed by an end-of-file mark. This method may be redefined by a derived class to define additional processing at the end of the input, but the redefined version should always call `close()'. `translate_references(data)' Translate all entity and character references in DATA and return the translated string. `handle_xml(encoding, standalone)' This method is called when the `<?xml ...?>' tag is processed. The arguments are the values of the encoding and standalone attributes in the tag. Both encoding and standalone are optional. The values passed to `handle_xml()' default to `None' and the string `'no'' respectively. `handle_doctype(tag, data)' This method is called when the `<!DOCTYPE...>' tag is processed. The arguments are the name of the root element and the uninterpreted contents of the tag, starting after the white space after the name of the root element. `handle_starttag(tag, method, attributes)' This method is called to handle start tags for which a start tag handler is defined in the instance variable `elements'. The TAG argument is the name of the tag, and the METHOD argument is the function (method) which should be used to support semantic interpretation of the start tag. The ATTRIBUTES argument is a dictionary of attributes, the key being the NAME and the value being the VALUE of the attribute found inside the tag's `<>' brackets. Character and entity references in the VALUE have been interpreted. For instance, for the start tag `<A HREF="http://www.cwi.nl/">', this method would be called as `handle_starttag('A', self.elements['A'][0], {'HREF': 'http://www.cwi.nl/'})'. The base implementation simply calls METHOD with ATTRIBUTES as the only argument. `handle_endtag(tag, method)' This method is called to handle endtags for which an end tag handler is defined in the instance variable `elements'. The TAG argument is the name of the tag, and the METHOD argument is the function (method) which should be used to support semantic interpretation of the end tag. For instance, for the endtag `</A>', this method would be called as `handle_endtag('A', self.elements['A'][1])'. The base implementation simply calls METHOD. `handle_data(data)' This method is called to process arbitrary data. It is intended to be overridden by a derived class; the base class implementation does nothing. `handle_charref(ref)' This method is called to process a character reference of the form `&#REF;'. REF can either be a decimal number, or a hexadecimal number when preceded by an `x'. In the base implementation, REF must be a number in the range 0-255. It translates the character to ASCII and calls the method `handle_data()' with the character as argument. If REF is invalid or out of range, the method `unknown_charref(REF)' is called to handle the error. A subclass must override this method to provide support for character references outside of the ASCII range. `handle_entityref(ref)' This method is called to process a general entity reference of the form `&REF;' where REF is an general entity reference. It looks for REF in the instance (or class) variable `entitydefs' which should be a mapping from entity names to corresponding translations. If a translation is found, it calls the method `handle_data()' with the translation; otherwise, it calls the method `unknown_entityref(REF)'. The default `entitydefs' defines translations for `&amp;', `&apos', `&gt;', `&lt;', and `&quot;'. `handle_comment(comment)' This method is called when a comment is encountered. The COMMENT argument is a string containing the text between the `<!--' and `-->' delimiters, but not the delimiters themselves. For example, the comment `<!--text-->' will cause this method to be called with the argument `'text''. The default method does nothing. `handle_cdata(data)' This method is called when a CDATA element is encountered. The DATA argument is a string containing the text between the `<![CDATA[' and `]]>' delimiters, but not the delimiters themselves. For example, the entity `<![CDATA[text]]>' will cause this method to be called with the argument `'text''. The default method does nothing, and is intended to be overridden. `handle_proc(name, data)' This method is called when a processing instruction (PI) is encountered. The NAME is the PI target, and the DATA argument is a string containing the text between the PI target and the closing delimiter, but not the delimiter itself. For example, the instruction `<?XML text?>' will cause this method to be called with the arguments `'XML'' and `'text''. The default method does nothing. Note that if a document starts with `<?xml ..?>', `handle_xml()' is called to handle it. `handle_special(data)' This method is called when a declaration is encountered. The DATA argument is a string containing the text between the `<!' and `>' delimiters, but not the delimiters themselves. For example, the entity `<!ENTITY text>' will cause this method to be called with the argument `'ENTITY text''. The default method does nothing. Note that `<!DOCTYPE ...>' is handled separately if it is located at the start of the document. `syntax_error(message)' This method is called when a syntax error is encountered. The MESSAGE is a description of what was wrong. The default method raises a `RuntimeError' exception. If this method is overridden, it is permissable for it to return. This method is only called when the error can be recovered from. Unrecoverable errors raise a `RuntimeError' without first calling `syntax_error()'. `unknown_starttag(tag, attributes)' This method is called to process an unknown start tag. It is intended to be overridden by a derived class; the base class implementation does nothing. `unknown_endtag(tag)' This method is called to process an unknown end tag. It is intended to be overridden by a derived class; the base class implementation does nothing. `unknown_charref(ref)' This method is called to process unresolvable numeric character references. It is intended to be overridden by a derived class; the base class implementation does nothing. `unknown_entityref(ref)' This method is called to process an unknown entity reference. It is intended to be overridden by a derived class; the base class implementation does nothing. See also: The Python XML Topic Guide provides a great deal of information on using XML from Python and links to other sources of information on XML. It's located on the Web at `http://www.python.org/topics/xml/'. The Python XML Special Interest Group is developing substantial support for processing XML from Python. See `http://www.python.org/sigs/xml-sig/' for more information. * Menu: * XML Namespaces::  File: python-lib.info, Node: XML Namespaces, Prev: xmllib, Up: xmllib XML Namespaces -------------- This module has support for XML namespaces as defined in the XML Namespaces proposed recommendation. Tag and attribute names that are defined in an XML namespace are handled as if the name of the tag or element consisted of the namespace (i.e. the URL that defines the namespace) followed by a space and the name of the tag or attribute. For instance, the tag `<html xmlns='http://www.w3.org/TR/REC-html40'>' is treated as if the tag name was `'http://www.w3.org/TR/REC-html40 html'', and the tag `<html:a href='http://frob.com'>' inside the above mentioned element is treated as if the tag name were `'http://www.w3.org/TR/REC-html40 a'' and the attribute name as if it were `'http://www.w3.org/TR/REC-html40 src''. An older draft of the XML Namespaces proposal is also recognized, but triggers a warning.  File: python-lib.info, Node: formatter, Next: rfc822, Prev: xmllib, Up: Internet Data Handling Generic output formatting ========================= Generic output formatter and device interface. This module supports two interface definitions, each with mulitple implementations. The *formatter* interface is used by the `HTMLParser' class of the `htmllib' module, and the *writer* interface is required by the formatter interface. Formatter objects transform an abstract flow of formatting events into specific output events on writer objects. Formatters manage several stack structures to allow various properties of a writer object to be changed and restored; writers need not be able to handle relative changes nor any sort of "change back" operation. Specific writer properties which may be controlled via formatter objects are horizontal alignment, font, and left margin indentations. A mechanism is provided which supports providing arbitrary, non-exclusive style settings to a writer as well. Additional interfaces facilitate formatting events which are not reversible, such as paragraph separation. Writer objects encapsulate device interfaces. Abstract devices, such as file formats, are supported as well as physical devices. The provided implementations all work with abstract devices. The interface makes available mechanisms for setting the properties which formatter objects manage and inserting data into the output. * Menu: * Formatter Interface:: * Formatter Implementations:: * Writer Interface:: * Writer Implementations::  File: python-lib.info, Node: Formatter Interface, Next: Formatter Implementations, Prev: formatter, Up: formatter The Formatter Interface ----------------------- Interfaces to create formatters are dependent on the specific formatter class being instantiated. The interfaces described below are the required interfaces which all formatters must support once initialized. One data element is defined at the module level: `AS_IS' Value which can be used in the font specification passed to the `push_font()' method described below, or as the new value to any other `push_PROPERTY()' method. Pushing the `AS_IS' value allows the corresponding `pop_PROPERTY()' method to be called without having to track whether the property was changed. The following attributes are defined for formatter instance objects: `writer' The writer instance with which the formatter interacts. `end_paragraph(blanklines)' Close any open paragraphs and insert at least BLANKLINES before the next paragraph. `add_line_break()' Add a hard line break if one does not already exist. This does not break the logical paragraph. `add_hor_rule(*args, **kw)' Insert a horizontal rule in the output. A hard break is inserted if there is data in the current paragraph, but the logical paragraph is not broken. The arguments and keywords are passed on to the writer's `send_line_break()' method. `add_flowing_data(data)' Provide data which should be formatted with collapsed whitespaces. Whitespace from preceeding and successive calls to `add_flowing_data()' is considered as well when the whitespace collapse is performed. The data which is passed to this method is expected to be word-wrapped by the output device. Note that any word-wrapping still must be performed by the writer object due to the need to rely on device and font information. `add_literal_data(data)' Provide data which should be passed to the writer unchanged. Whitespace, including newline and tab characters, are considered legal in the value of DATA. `add_label_data(format, counter)' Insert a label which should be placed to the left of the current left margin. This should be used for constructing bulleted or numbered lists. If the FORMAT value is a string, it is interpreted as a format specification for COUNTER, which should be an integer. The result of this formatting becomes the value of the label; if FORMAT is not a string it is used as the label value directly. The label value is passed as the only argument to the writer's `send_label_data()' method. Interpretation of non-string label values is dependent on the associated writer. Format specifications are strings which, in combination with a counter value, are used to compute label values. Each character in the format string is copied to the label value, with some characters recognized to indicate a transform on the counter value. Specifically, the character `1' represents the counter value formatter as an arabic number, the characters `A' and `a' represent alphabetic representations of the counter value in upper and lower case, respectively, and `I' and `i' represent the counter value in Roman numerals, in upper and lower case. Note that the alphabetic and roman transforms require that the counter value be greater than zero. `flush_softspace()' Send any pending whitespace buffered from a previous call to `add_flowing_data()' to the associated writer object. This should be called before any direct manipulation of the writer object. `push_alignment(align)' Push a new alignment setting onto the alignment stack. This may be `AS_IS' if no change is desired. If the alignment value is changed from the previous setting, the writer's `new_alignment()' method is called with the ALIGN value. `pop_alignment()' Restore the previous alignment. `push_font(`('size, italic, bold, teletype`)')' Change some or all font properties of the writer object. Properties which are not set to `AS_IS' are set to the values passed in while others are maintained at their current settings. The writer's `new_font()' method is called with the fully resolved font specification. `pop_font()' Restore the previous font. `push_margin(margin)' Increase the number of left margin indentations by one, associating the logical tag MARGIN with the new indentation. The initial margin level is `0'. Changed values of the logical tag must be true values; false values other than `AS_IS' are not sufficient to change the margin. `pop_margin()' Restore the previous margin. `push_style(*styles)' Push any number of arbitrary style specifications. All styles are pushed onto the styles stack in order. A tuple representing the entire stack, including `AS_IS' values, is passed to the writer's `new_styles()' method. `pop_style([n` = 1'])' Pop the last N style specifications passed to `push_style()'. A tuple representing the revised stack, including `AS_IS' values, is passed to the writer's `new_styles()' method. `set_spacing(spacing)' Set the spacing style for the writer. `assert_line_data([flag` = 1'])' Inform the formatter that data has been added to the current paragraph out-of-band. This should be used when the writer has been manipulated directly. The optional FLAG argument can be set to false if the writer manipulations produced a hard line break at the end of the output.  File: python-lib.info, Node: Formatter Implementations, Next: Writer Interface, Prev: Formatter Interface, Up: formatter Formatter Implementations ------------------------- Two implementations of formatter objects are provided by this module. Most applications may use one of these classes without modification or subclassing. `NullFormatter([writer])' A formatter which does nothing. If WRITER is omitted, a `NullWriter' instance is created. No methods of the writer are called by `NullFormatter' instances. Implementations should inherit from this class if implementing a writer interface but don't need to inherit any implementation. `AbstractFormatter(writer)' The standard formatter. This implementation has demonstrated wide applicability to many writers, and may be used directly in most circumstances. It has been used to implement a full-featured world-wide web browser.  File: python-lib.info, Node: Writer Interface, Next: Writer Implementations, Prev: Formatter Implementations, Up: formatter The Writer Interface -------------------- Interfaces to create writers are dependent on the specific writer class being instantiated. The interfaces described below are the required interfaces which all writers must support once initialized. Note that while most applications can use the `AbstractFormatter' class as a formatter, the writer must typically be provided by the application. `flush()' Flush any buffered output or device control events. `new_alignment(align)' Set the alignment style. The ALIGN value can be any object, but by convention is a string or `None', where `None' indicates that the writer's "preferred" alignment should be used. Conventional ALIGN values are `'left'', `'center'', `'right'', and `'justify''. `new_font(font)' Set the font style. The value of FONT will be `None', indicating that the device's default font should be used, or a tuple of the form `('SIZE, ITALIC, BOLD, TELETYPE`)'. Size will be a string indicating the size of font that should be used; specific strings and their interpretation must be defined by the application. The ITALIC, BOLD, and TELETYPE values are boolean indicators specifying which of those font attributes should be used. `new_margin(margin, level)' Set the margin level to the integer LEVEL and the logical tag to MARGIN. Interpretation of the logical tag is at the writer's discretion; the only restriction on the value of the logical tag is that it not be a false value for non-zero values of LEVEL. `new_spacing(spacing)' Set the spacing style to SPACING. `new_styles(styles)' Set additional styles. The STYLES value is a tuple of arbitrary values; the value `AS_IS' should be ignored. The STYLES tuple may be interpreted either as a set or as a stack depending on the requirements of the application and writer implementation. `send_line_break()' Break the current line. `send_paragraph(blankline)' Produce a paragraph separation of at least BLANKLINE blank lines, or the equivelent. The BLANKLINE value will be an integer. Note that the implementation will receive a call to `send_line_break()' before this call if a line break is needed; this method should not include ending the last line of the paragraph. It is only responsible for vertical spacing between paragraphs. `send_hor_rule(*args, **kw)' Display a horizontal rule on the output device. The arguments to this method are entirely application- and writer-specific, and should be interpreted with care. The method implementation may assume that a line break has already been issued via `send_line_break()'. `send_flowing_data(data)' Output character data which may be word-wrapped and re-flowed as needed. Within any sequence of calls to this method, the writer may assume that spans of multiple whitespace characters have been collapsed to single space characters. `send_literal_data(data)' Output character data which has already been formatted for display. Generally, this should be interpreted to mean that line breaks indicated by newline characters should be preserved and no new line breaks should be introduced. The data may contain embedded newline and tab characters, unlike data provided to the `send_formatted_data()' interface. `send_label_data(data)' Set DATA to the left of the current left margin, if possible. The value of DATA is not restricted; treatment of non-string values is entirely application- and writer-dependent. This method will only be called at the beginning of a line.  File: python-lib.info, Node: Writer Implementations, Prev: Writer Interface, Up: formatter Writer Implementations ---------------------- Three implementations of the writer object interface are provided as examples by this module. Most applications will need to derive new writer classes from the `NullWriter' class. `NullWriter()' A writer which only provides the interface definition; no actions are taken on any methods. This should be the base class for all writers which do not need to inherit any implementation methods. `AbstractWriter()' A writer which can be used in debugging formatters, but not much else. Each method simply announces itself by printing its name and arguments on standard output. `DumbWriter([file[, maxcol` = 72']])' Simple writer class which writes output on the file object passed in as FILE or, if FILE is omitted, on standard output. The output is simply word-wrapped to the number of columns specified by MAXCOL. This class is suitable for reflowing a sequence of paragraphs.  File: python-lib.info, Node: rfc822, Next: mimetools, Prev: formatter, Up: Internet Data Handling Parse RFC 822 mail headers ========================== Parse RFC 822 style mail headers. This module defines a class, `Message', which represents a collection of "email headers" as defined by the Internet standard RFC 822. It is used in various contexts, usually to read such headers from a file. This module also defines a helper class `AddressList' for parsing RFC 822 addresses. Note that there's a separate module to read UNIX, MH, and MMDF style mailbox files: `mailbox'. `Message(file[, seekable])' A `Message' instance is instantiated with an input object as parameter. Message relies only on the input object having a `readline()' method; in particular, ordinary file objects qualify. Instantiation reads headers from the input object up to a delimiter line (normally a blank line) and stores them in the instance. This class can work with any input object that supports a `readline()' method. If the input object has seek and tell capability, the `rewindbody()' method will work; also, illegal lines will be pushed back onto the input stream. If the input object lacks seek but has an `unread()' method that can push back a line of input, `Message' will use that to push back illegal lines. Thus this class can be used to parse messages coming from a buffered stream. The optional SEEKABLE argument is provided as a workaround for certain stdio libraries in which `tell()' discards buffered data before discovering that the `lseek()' system call doesn't work. For maximum portability, you should set the seekable argument to zero to prevent that initial `tell()' when passing in an unseekable object such as a a file object created from a socket object. Input lines as read from the file may either be terminated by CR-LF or by a single linefeed; a terminating CR-LF is replaced by a single linefeed before the line is stored. All header matching is done independent of upper or lower case; e.g. `M['From']', `M['from']' and `M['FROM']' all yield the same result. `AddressList(field)' You may instantiate the `AddressList' helper class using a single string parameter, a comma-separated list of RFC 822 addresses to be parsed. (The parameter `None' yields an empty list.) `parsedate(date)' Attempts to parse a date according to the rules in RFC 822. however, some mailers don't follow that format as specified, so `parsedate()' tries to guess correctly in such cases. DATE is a string containing an RFC 822 date, such as `'Mon, 20 Nov 1995 19:12:08 -0500''. If it succeeds in parsing the date, `parsedate()' returns a 9-tuple that can be passed directly to `time.mktime()'; otherwise `None' will be returned. `parsedate_tz(date)' Performs the same function as `parsedate()', but returns either `None' or a 10-tuple; the first 9 elements make up a tuple that can be passed directly to `time.mktime()', and the tenth is the offset of the date's timezone from UTC (which is the official term for Greenwich Mean Time). (Note that the sign of the timezone offset is the opposite of the sign of the `time.timezone' variable for the same timezone; the latter variable follows the POSIX standard while this module follows RFC 822.) If the input string has no timezone, the last element of the tuple returned is `None'. `mktime_tz(tuple)' Turn a 10-tuple as returned by `parsedate_tz()' into a UTC timestamp. It the timezone item in the tuple is `None', assume local time. Minor deficiency: this first interprets the first 8 elements as a local time and then compensates for the timezone difference; this may yield a slight error around daylight savings time switch dates. Not enough to worry about for common use. * Menu: * Message Objects:: * AddressList Objects::  File: python-lib.info, Node: Message Objects, Next: AddressList Objects, Prev: rfc822, Up: rfc822 Message Objects --------------- A `Message' instance has the following methods: `rewindbody()' Seek to the start of the message body. This only works if the file object is seekable. `isheader(line)' Returns a line's canonicalized fieldname (the dictionary key that will be used to index it) if the line is a legal RFC 822 header; otherwise returns None (implying that parsing should stop here and the line be pushed back on the input stream). It is sometimes useful to override this method in a subclass. `islast(line)' Return true if the given line is a delimiter on which Message should stop. The delimiter line is consumed, and the file object's read location positioned immediately after it. By default this method just checks that the line is blank, but you can override it in a subclass. `iscomment(line)' Return true if the given line should be ignored entirely, just skipped. By default this is a stub that always returns false, but you can override it in a subclass. `getallmatchingheaders(name)' Return a list of lines consisting of all headers matching NAME, if any. Each physical line, whether it is a continuation line or not, is a separate list item. Return the empty list if no header matches NAME. `getfirstmatchingheader(name)' Return a list of lines comprising the first header matching NAME, and its continuation line(s), if any. Return `None' if there is no header matching NAME. `getrawheader(name)' Return a single string consisting of the text after the colon in the first header matching NAME. This includes leading whitespace, the trailing linefeed, and internal linefeeds and whitespace if there any continuation line(s) were present. Return `None' if there is no header matching NAME. `getheader(name[, default])' Like `getrawheader(NAME)', but strip leading and trailing whitespace. Internal whitespace is not stripped. The optional DEFAULT argument can be used to specify a different default to be returned when there is no header matching NAME. `get(name[, default])' An alias for `getheader()', to make the interface more compatible with regular dictionaries. `getaddr(name)' Return a pair `(FULL NAME, EMAIL ADDRESS)' parsed from the string returned by `getheader(NAME)'. If no header matching NAME exists, return `(None, None)'; otherwise both the full name and the address are (possibly empty) strings. Example: If M's first `From' header contains the string `'jack@cwi.nl (Jack Jansen)'', then `m.getaddr('From')' will yield the pair `('Jack Jansen', 'jack@cwi.nl')'. If the header contained `'Jack Jansen <jack@cwi.nl>'' instead, it would yield the exact same result. `getaddrlist(name)' This is similar to `getaddr(LIST)', but parses a header containing a list of email addresses (e.g. a `To' header) and returns a list of `(FULL NAME, EMAIL ADDRESS)' pairs (even if there was only one address in the header). If there is no header matching NAME, return an empty list. If multiple headers exist that match the named header (e.g. if there are several `Cc' headers), all are parsed for addresses. Any continuation lines the named headers contain are also parsed. `getdate(name)' Retrieve a header using `getheader()' and parse it into a 9-tuple compatible with `time.mktime()'. If there is no header matching NAME, or it is unparsable, return `None'. Date parsing appears to be a black art, and not all mailers adhere to the standard. While it has been tested and found correct on a large collection of email from many sources, it is still possible that this function may occasionally yield an incorrect result. `getdate_tz(name)' Retrieve a header using `getheader()' and parse it into a 10-tuple; the first 9 elements will make a tuple compatible with `time.mktime()', and the 10th is a number giving the offset of the date's timezone from UTC. Similarly to `getdate()', if there is no header matching NAME, or it is unparsable, return `None'. `Message' instances also support a read-only mapping interface. In particular: `M[name]' is like `M.getheader(name)' but raises `KeyError' if there is no matching header; and `len(M)', `M.has_key(name)', `M.keys()', `M.values()' and `M.items()' act as expected (and consistently). Finally, `Message' instances have two public instance variables: `headers' A list containing the entire set of header lines, in the order in which they were read (except that setitem calls may disturb this order). Each line contains a trailing newline. The blank line terminating the headers is not contained in the list. `fp' The file or file-like object passed at instantiation time. This can be used to read the message content.  File: python-lib.info, Node: AddressList Objects, Prev: Message Objects, Up: rfc822 AddressList Objects ------------------- An `AddressList' instance has the following methods: `__len__(name)' Return the number of addresses in the address list. `__str__(name)' Return a canonicalized string representation of the address list. Addresses are rendered in "name" <host@domain> form, comma-separated. `__add__(name)' Return an `AddressList' instance that contains all addresses in both `AddressList' operands, with duplicates removed (set union). `__sub__(name)' Return an `AddressList' instance that contains every address in the left-hand `AddressList' operand that is not present in the right-hand address operand (set difference). Finally, `AddressList' instances have one public instance variable: `addresslist' A list of tuple string pairs, one per address. In each member, the first is the canonicalized name part of the address, the second is the route-address (@-separated host-domain pair).  File: python-lib.info, Node: mimetools, Next: MimeWriter, Prev: rfc822, Up: Internet Data Handling Tools for parsing MIME messages =============================== Tools for parsing MIME style message bodies. This module defines a subclass of the `rfc822.Message' class and a number of utility functions that are useful for the manipulation for MIME multipart or encoded message. It defines the following items: `Message(fp[, seekable])' Return a new instance of the `Message' class. This is a subclass of the `rfc822.Message' class, with some additional methods (see below). The SEEKABLE argument has the same meaning as for `rfc822.Message'. `choose_boundary()' Return a unique string that has a high likelihood of being usable as a part boundary. The string has the form `'HOSTIPADDR.UID.PID.TIMESTAMP.RANDOM''. `decode(input, output, encoding)' Read data encoded using the allowed MIME ENCODING from open file object INPUT and write the decoded data to open file object OUTPUT. Valid values for ENCODING include `'base64'', `'quoted-printable'' and `'uuencode''. `encode(input, output, encoding)' Read data from open file object INPUT and write it encoded using the allowed MIME ENCODING to open file object OUTPUT. Valid values for ENCODING are the same as for `decode()'. `copyliteral(input, output)' Read lines until `EOF' from open file INPUT and write them to open file OUTPUT. `copybinary(input, output)' Read blocks until `EOF' from open file INPUT and write them to open file OUTPUT. The block size is currently fixed at 8192. * Menu: * mimetools.Message Methods::