Pure Python Doc Compression Code -------------------------------- The code in doc_compress.py is not intended to be used unless absolutely necessary; the C code is much faster and better. However, I wanted to write pure-Python Doc compression code so that I could experiment with it more easily, so I figured I might as well put it in Pyrite. The App.Doc module will try to import the C code first, and if it can't be loaded for some reason it will use the pure version. The main difference between the Python and C compressors is speed: the Python compressor is at least an order of magnitude slower than the compiled one. However, with a reasonably fast CPU it still might not be annoyingly slow. For example, my system is a Cyrix M2 at 207 MHz (83 MHz bus) with 1MB cache, and it compresses 2-3 blocks per second using the Python code. I should note, however, that there is one difference between the Python and C code, at the present time. In the Doc compression scheme, characters with the high bit set -- accented characters, non-ASCII symbols, and the like -- must be escaped when they are stored in the compressed output. This escaping takes the form of a byte 0x01-0x08 followed by 1-8 bytes of data. As the compressor outputs bytes, it escapes every high-bit-set character individually, even if there are several of them in a row. The C compressor then makes a second pass over the data, collapsing sequences of escapes. For example, the main compression loop might output: 0x01 0x9f 0x01 0x80 0x01 0x8d 0x01 0xea and the second pass would collapse this to: 0x04 0x9f 0x80 0x8d 0xea The Python compressor doesn't do this. Collapsing or not collapsing sequences of escapes doesn't affect decompression at all; however, it may make a compressed record slightly bigger *if there are runs of more than one escaped character in a row*. Whether this makes much practical difference remains to be seen. In ordinary English text, it doesn't make much difference at all, because high-bit characters are rare, and unlikely to come in large clumps. In non-English text, however, it is more likely that high-bit characters will occur, especially if the text is heavy in accents. Such text will produce slightly larger compressed output using the Python code. -- Rob Tillotson