1. Header Block
2. Big Block Depot
3. Big Data Blocks
3.1 Small Block Depot
3.2 Property Set Storage
3.3 Property Storage
3.4 Where Are The Small Data Blocks?
3.5 Property Sets
4. Trash Blocks
Table 1: LAOLA Header
Table 2: Property Storage
When looking at the document binaries I soon got confused. At a first glance the documents file format seemed to differ very much from that of files stored with Word 2. When looking more closely it got clear that some very familiar binary pieces was stored within that new looking data. In fact a Word 6 document is somewhat a Word 2 format document stored among additional data. These additional data makes out a kind of file system.
As far as I know there is no publically available information source on how this file system (document format) works. Generally I demand that producing industry has the duty to show what ingredients are in the products. In case of authoring systems I think that people ought to know what kind of information is in their (may be public distributed) documents. E.g., if documents contain information about the creation date, printer(s), directory structure or serial numbers. Or even worse, if documents contain or might contain other private data.
Summarizing this topic for the below described file format, it always stores the last modification date of objects. Because of a bad implementation for Windows 3.x and older distributions of 32 bit Windows systems it still always contains some "data trash" sections. These sections might contain personal data. Of course, depending on the authoring program, other private data might be stored invisible, too.
This text does *not* explain how a Microsoft Word file is structured. This text does explain how the file system works that younger Windows programs like Microsoft Word use to store their documents. So actually it should be called OLE file system, as the philosophy behind this file system is Microsoft's OLE / Com technology. But in lack of any binary level technical specification about this topic my explanation might differ in some cases or even be wrong. In this cases I certainly would not explain the OLE filesystem, but something similar. So I decided to take a similar name, either. The name is LAOLA.
Copying. This file and the here referenced source codes are distributed under the terms of Version 2 of the GNU General Public License from June 1991. If you have no copy you should find one here.
Diese Veröffentlichung erfolgt ohne Berücksichtigung eines eventuellen Patentschutzes. Warennamen werden ohne Gewährleistung einer freien Verwendung benutzt.
Actually this work could have being done well by promoting some of the popular archives like the whole "zip" family. But Microsoft went their own way. As I think not only because of their market philosophy, but also because it seems, the intention to develop a file system has been directed by their "OLE philosophy", that in a way demands to have a hierarchical file structures.
Unfortunately Microsoft did not include mechanisms to assure the well being of a document. So, if e.g. a Laola document gets corrupted, this normally stays undetected. If somebody tampers with the document, normally nobody can notice. If a document contains much unused space, this normally will stay so. Microsoft's strategy involved no compensation for the disadvantages of a new file system.
What is a Laola file? In short, a Laola file is an archive. The archive can maintain files and directories. Each archive entry has a 0x80 bytes long info block. To store the files the archive maintains a list of big data blocks and a list of small data blocks. Files with a size less than 0x1000 bytes will be stored into the small data blocks, the other files into big data blocks.
Data types. Laola files have three basic data types:
1. 4 byte integers ("long") 0x12345678 -> 0x78 0x56 0x34 0x12 2. 2 byte integers ("word") 0x1234 0x5678 -> 0x34 0x12 0x78 0x56 3. 1 byte integers ("char") 0x12 0x34 0x56 0x78 -> 0x12 0x34 0x56 0x78Integers are stored in "little endian" ("VAX", "x86") mode. That means they are stored eightbitwise, the least significant byte first. Char streams are stored first in, first out.
Blocks. In a first step each Laola file is divided into 0x200 (512) byte "big blocks", so each Laola file's size is a multitude of 0x200 bytes. Each block corresponds to an enum, starting with -1 for the first 0x200 bytes, then counting upwards. So the file is made out of the set of blocks:
file <=> union of big blocks {-1, 0, 1 .. $maxblock}, ($maxblock = (sizeof(file)-1) / 0x200 -1), $maxblock e {1, 2, .. }Basic parts. The big blocks divide the file into the four basic parts:
Dead non trash information. The Laola blocks seem to contain several constant data entries. Some of them seem to be (still?) with absolutely no function, the functions of others are still not clear. Anyway, for the creation of Laola files it would be enough just to copy these values. In the tables at the end of this document the constant values are marked with a dot ".", the known and changing values with an exclamation mark "!".
Example:
00000: d0 cf 11 e0 a1 b1 1a e1 00 00 00 00 00 00 00 00 00010: 00 00 00 00 00 00 00 00 3b 00 03 00 fe ff 09 00 00020: 06 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 00030: 01 00 00 00 00 00 00 00 00 10 00 00 02 00 00 00 00040: 01 00 00 00 fe ff ff ff 00 00 00 00 00 00 00 00 00050: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff * 00: stream $laola_id identifier {d0 cf 11 e0 a1 b1 1a e1} 2c: long $num_of_bbd_blocks Number of big block depot blocks 30: long $root_startblock Root chain's first big block 3c: long $sbd_startblock small block depot's first big block 4c[]: long $bbd_list[i] array of $num_of_bbd_blocks big block numbers (for detailed info look at: Table 1)
big block depot <=> union of big blocks {bbd_list[i]}, bbd_list consists out of $num_of_bbd_blocks elements, stored from position <header:4c> on.Example:
00200: fd ff ff ff 05 00 00 00 fe ff ff ff 04 00 00 00 00210: 06 00 00 00 fe ff ff ff 07 00 00 00 08 00 00 00 00220: 09 00 00 00 0a 00 00 00 0b 00 00 00 fe ff ff ff 00230: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff *The big block depot is representing a table of block numbers, their index starts with zero. Entry 0, in this example with the value 0xfffffffd (-3), refers to block 0. Entry 1, with the value 0x00000005 refers to block 1. Entry 2 refers to block 2 ... and so on. Each entry may have one of these values:
0xfffffffd (-3) : this block is a special block 0xfffffffe (-2) : end of chain 0xffffffff (-1) : unused 0 .. $maxblock : next element of chain (a big block number) $maxblock+1 .. : not definedIn the header the variable $root_startblock has been initiated, the example gives the value 1 to it. These value tells which block is the first in a chain of blocks belonging to the "root". In the example it would be read out as follows:
How to read a block chain. Starting at position 1. So the first block belonging to root is block 1. The value of the big block depots entry with position 1 is: 0x00000005. So the next block belonging to root is block 5. The value of the big block depots entry with position 5 is 0xfffffffe (-2). That means: here is the end of the chain. So "root" finally consists out of the blocks: {1, 5}
From the header also the value of variable $sbd_startblock is known. Try to find it's value, then try to get the belonging block chain! (If you want to see the solution, look at the end of this document)
Note: when reading in a chain, only the values "0 .. $maxblock" and "-2" are ok. If other values do occur in a chain, some error happened.
Note: The small block depot may be absent. In that case $sbd_startblock is 0xfffffffe (-2).
Summary: with the help of header block and big block depot the values of the big block lists: @root_list and @sbd_list are known.
small block depot <=> union of big blocks {sbd_list[i]}, sbd_list consists out of (number of chain_elements(sbd_list)) elements. The list is read out from the big block depot, the lists start is $sbd_startblock (-> Section 2)Example:
00600: 01 00 00 00 fe ff ff ff ff ff ff ff ff ff ff ff 00610: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffThe entries of the small block depot do *not* refer to absolute positions in the file, like the entries of the big block depot. They refer to the positions in the (file made out of big blocks) according to the the big block list @sbd. This list is denoted by the root entry of the property storage. So first property storage has to be explained.
property set storage blocks <=> union of big blocks {root_list[i]}, root_list consists out of (number of chain_elements(root_list)) elements. The list is read out from the big block depot, the lists start is $root_startblock (-> Section 2)
Example: 00400: 52 00 6f 00 6f 00 74 00 20 00 45 00 6e 00 74 00 R o o t E n t 00410: 72 00 79 00 00 00 00 00 00 00 00 00 00 00 00 00 r y 00420: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00430: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00440: 16 00 05 00 ff ff ff ff ff ff ff ff 03 00 00 00 00450: 00 09 02 00 00 00 00 00 c0 00 00 00 00 00 00 46 00460: 00 00 00 00 00 00 00 00 00 00 00 00 86 29 f6 1f 00470: ad 57 bb 01 03 00 00 00 00 0f 00 00 00 00 00 00 40: word $pps_sizeofname size of $pps_rawname 42: byte $pps_type type of pps (1=storage|2=stream|5=root) 44: long $pps_prev previous pps 48: long $pps_next next pps 4c: long $pps_dir directory pps 74: long $pps_sb starting block of property 78: long $pps_size size of property (for detailed info look at: Table 2)The first 0x40 bytes are reserved for the name of the pps. The length of the name stands in $pps_sizeofname. The name can be converted to an ASCII string $pps_name just by removing every second char. In this example the length of the name is 0x16, and in the end $pps_name is "Root Entry\00". The C-style zero should be removed. If the case occurs that $pps_sizeofname is zero, then this 0x80 block is no pps and has to be ignored.
Each pps can have a successor and a predecessor. Each pps also can be a directory (or "storage"). $pps_prev, $pps_next, $pps_dir refer to the ascending number of the 0x80 blocks as mentioned above. So in the example the pps starting at 0x400 gets the handle (number) 0, the pps starting at 0x480 gets the handle 1, the pps starting at 0500 gets the handle 2 and so on. When read skillfully, an ordered listing of pps results in the end (look at function get_pps_chain in "laola.pl").
Property types. Each pps has a type out of this three:
If $pps_size is not zero, $pps_sb points to the starting block of the belonging property. The starting block refers to the big block depot, if $pps_size is greater or equal 0x1000 (4096) bytes. If the property's size is smaller, $pps_sb refers to the small block depot. There is one exception: $pps_sb of the Root entry (always pps 0) does always refer to the big block depot.
It now is easy to read the "files" in: the big or small block list has to be catched (as did before with root_list and sbd_list) from the big or small block depots, the so referred blocks have to be read and at the last step the size might have to be truncated to fit to $pps_size.
If the type of a pps is root or storage, at least the variables $pps_ts2d and $pps_ts2s get initialized. Together these variables build a 64 bit integer variable, that represents time and date. This variable counts all 10^-7 seconds, starting at 01/01/1601 00:00. If the type is root, $pps_sb is pointing to the first big block of the small block list @sb_list. See just below:
00200: fd ff ff ff 05 00 00 00 fe ff ff ff 04 00 00 00 00210: 06 00 00 00 fe ff ff ff 07 00 00 00 08 00 00 00 00220: 09 00 00 00 0a 00 00 00 0b 00 00 00 fe ff ff ff 00230: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff *it results in the chain: {3, 4, 6, 7, 8, 9, a, b}. The so yielded chain gives the blocks that provide the space for the small blocks. So small block references refer to positions in the "small block file", that in this example is build of the big block chain starting with block 3.
small data blocks <=> union of big blocks {sb_list[i]}, sb_list consists out of (number of chain_elements(sb_list)) elements. The list is read out from the big block depot, the lists start is $root_startblock -> property storage 0 -> pps_sbWith this in mind, everything needed to pull out some file out of a Laola archive is available. You could test this with "lls -s". But there is still more behind the thing.
Summary and further information are - still to be done ! -
Some blocks are just partially consisting out of trash, they could be called stinky blocks. This is because of the size of a property does just by chance fit exactly to the 0x200 (0x40 at small blocks) bytes border. So the rest of the last block of a chain does nearly always contain some bytes of rubbish.
Like all trash, data trash is troublesome. In some cases it is simply annoying because it blows up the files size. In each case it is relevant with reference to data security. Because you cannot know what's in there you have a lack of control to your own data. In Usenet has been even reported that private mail and encrypted password happened to stick in a Word file.
The good thing is, that unlike nuclear trash, this data trash is removable. Just look at demonstration program "lclean" at the source code section of this document. In case of data trash I've heard, that Microsoft knows about this OLE bug and provides a fix for 32 bit Windows systems. However, if you use Windows 3.1 you probably have to rely on lclean.
Comments appreciated!
Martin
- The End -
Table 1: Block 0 (laola header) offset type value const? function 00: stream $laola_id ! identifier {d0 cf 11 e0 a1 b1 1a e1} 08: long 0 . ? 0c: long 0 . ? 10: long 0 . ? 14: long 0 . ? 18: word 3b . ? revision ? 1a: word 3 . ? version ? 1c: word -2 . ? 1e: byte 9 . ? 1f: byte 0 . ? 20: long 6 . ? 24: long 0 . ? 28: long 0 . ? 2c: long $num_of_bbd_blocks ! Number of big block depot blocks 30: long $root_startblock ! Root chain 1st block 34: long 0 . ? 38: long 1000 . ? 3c: long $sbd_startblock ! small block depot 1st block 40: long 1 . ? 44: long -2 . ? 48: long 0 . ? 4c[]: long $bbd_list[i] ! array of $num_of_bbd_blocks big block numbers The rest of block 0 should be: long -1 . #### Table 2: Property Storage offset type value const? function 00: stream $pps_rawname ! name of the pps 40: word $pps_sizeofname ! size of $pps_rawname 42: byte $pps_type ! type of pps (1=storage|2=stream|5=root) 43: byte $pps_uk0 ! ? 44: long $pps_prev ! previous pps 48: long $pps_next ! next pps 4c: long $pps_dir ! directory pps 50: stream 00 09 02 00 . ? 54: long 0 . ? 58: long c0 . ? 5c: stream 00 00 00 46 . ? 60: long 0 . ? 64: long $pps_ts1s ! timestamp 1 : "seconds" 68: long $pps_ts1d ! timestamp 1 : "days" 6c: long $pps_ts2s ! timestamp 2 : "seconds" 70: long $pps_ts2d ! timestamp 2 : "days" 74: long $pps_sb ! starting block of property 78: long $pps_size ! size of property 7c: long . ?