LAOLA file system

"Structured Storage"

The binary structure of Ole / Compound Documents

Hacking guide

1. Header Block
2. Big Block Depot
3. Big Data Blocks
    3.1 Small Block Depot
    3.2 Property Set Storage
    3.3 Property Storage
    3.4 Where Are The Small Data Blocks?
    3.5 Property Sets
4. Trash Blocks

Table 1: LAOLA Header
Table 2: Property Storage

Whites only - ein Stück Apartheid in Berlin

Back to Laola HomePage

A - Preface

One day I started writing some program that should have access to documents done with Microsoft Word for Windows 6. I wanted to keep it portable, so it was necessary not to use methods specific to operating systems. So I decided to learn to understand the binary structure of the documents.

When looking at the document binaries I soon got confused. At a first glance the documents file format seemed to differ very much from that of files stored with Word 2. When looking more closely it got clear that some very familiar binary pieces was stored within that new looking data. In fact a Word 6 document is somewhat a Word 2 format document stored among additional data. These additional data makes out a kind of file system.

As far as I know there is no publically available information source on how this file system (document format) works. Generally I demand that producing industry has the duty to show what ingredients are in the products. In case of authoring systems I think that people ought to know what kind of information is in their (may be public distributed) documents. E.g., if documents contain information about the creation date, printer(s), directory structure or serial numbers. Or even worse, if documents contain or might contain other private data.

Summarizing this topic for the below described file format, it always stores the last modification date of objects. Because of a bad implementation for Windows 3.x and older distributions of 32 bit Windows systems it still always contains some "data trash" sections. These sections might contain personal data. Of course, depending on the authoring program, other private data might be stored invisible, too.

This text does *not* explain how a Microsoft Word file is structured. This text does explain how the file system works that younger Windows programs like Microsoft Word use to store their documents. So actually it should be called OLE file system, as the philosophy behind this file system is Microsoft's OLE / Com technology. But in lack of any binary level technical specification about this topic my explanation might differ in some cases or even be wrong. In this cases I certainly would not explain the OLE filesystem, but something similar. So I decided to take a similar name, either. The name is LAOLA.

Copying. This file and the here referenced source codes are distributed under the terms of Version 2 of the GNU General Public License from June 1991. If you have no copy you should find one here.

Diese Veröffentlichung erfolgt ohne Berücksichtigung eines eventuellen Patentschutzes. Warennamen werden ohne Gewährleistung einer freien Verwendung benutzt.

B - General

Digital documents tend to consist of more than just one file. When storing or exchanging such documents it used to be a problem to bind all those files together and to be sure about the same hierarchy conventions. LAOLA file format allows to store files in a hierarchical order into a single file. This may contain one or more directories, that each may contain one or more files and directories.

Actually this work could have being done well by promoting some of the popular archives like the whole "zip" family. But Microsoft went their own way. As I think not only because of their market philosophy, but also because it seems, the intention to develop a file system has been directed by their "OLE philosophy", that in a way demands to have a hierarchical file structures.

Unfortunately Microsoft did not include mechanisms to assure the well being of a document. So, if e.g. a Laola document gets corrupted, this normally stays undetected. If somebody tampers with the document, normally nobody can notice. If a document contains much unused space, this normally will stay so. Microsoft's strategy involved no compensation for the disadvantages of a new file system.

C - Basics

No guaranty! The things here described are assumptions, mainly based on speculation and experiment and only few on documentation. Although I'm sure things are pretty ok, there might be errors!

What is a Laola file? In short, a Laola file is an archive. The archive can maintain files and directories. Each archive entry has a 0x80 bytes long info block. To store the files the archive maintains a list of big data blocks and a list of small data blocks. Files with a size less than 0x1000 bytes will be stored into the small data blocks, the other files into big data blocks.

Data types. Laola files have three basic data types:

   1.  4 byte integers ("long") 0x12345678           ->  0x78 0x56 0x34 0x12
   2.  2 byte integers ("word") 0x1234 0x5678        ->  0x34 0x12 0x78 0x56
   3.  1 byte integers ("char") 0x12 0x34 0x56 0x78  ->  0x12 0x34 0x56 0x78

Integers are stored in "little endian" ("VAX", "x86") mode. That means they are stored eightbitwise, the least significant byte first. Char streams are stored first in, first out.

Blocks. In a first step each Laola file is divided into 0x200 (512) byte "big blocks", so each Laola file's size is a multitude of 0x200 bytes. Each block corresponds to an enum, starting with -1 for the first 0x200 bytes, then counting upwards. So the file is made out of the set of blocks:

    file <=> union of big blocks {-1, 0, 1 .. $maxblock},

    ($maxblock = (sizeof(file)-1) / 0x200 -1), $maxblock e {1, 2, .. }

Basic parts. The big blocks divide the file into the four basic parts:

Header Block
Big Block Depot
Big Data Blocks
Trash Data Blocks

Dead non trash information. The Laola blocks seem to contain several constant data entries. Some of them seem to be (still?) with absolutely no function, the functions of others are still not clear. Anyway, for the creation of Laola files it would be enough just to copy these values. In the tables at the end of this document the constant values are marked with a dot ".", the known and changing values with an exclamation mark "!".

1. Header Block

The header is built of the first block (block -1, running from file offset 0x00 to 0x1ff). The header starts with the eight bytes long hex string {d0 cf 11 e0 a1 b1 1a e1}. The header contains some first structure information. Function will be told later in this document.

Example:

    00000: d0 cf 11 e0  a1 b1 1a e1  00 00 00 00  00 00 00 00  
    00010: 00 00 00 00  00 00 00 00  3b 00 03 00  fe ff 09 00 
    00020: 06 00 00 00  00 00 00 00  00 00 00 00  01 00 00 00
    00030: 01 00 00 00  00 00 00 00  00 10 00 00  02 00 00 00
    00040: 01 00 00 00  fe ff ff ff  00 00 00 00  00 00 00 00
    00050: ff ff ff ff  ff ff ff ff  ff ff ff ff  ff ff ff ff *


00:     stream $laola_id          identifier {d0 cf 11 e0 a1 b1 1a e1}
2c:	long $num_of_bbd_blocks   Number of big block depot blocks
30:	long $root_startblock     Root chain's first big block
3c:	long $sbd_startblock      small block depot's first big block
4c[]:   long $bbd_list[i]         array of $num_of_bbd_blocks big block numbers

(for detailed info look at: Table 1)

2. Big Block Depot

The big block depot manages the big blocks. Big blocks have a length of exactly 0x200 (512) bytes. Often the big block depot will consist exactly out of one big block.

    big block depot <=> union of big blocks {bbd_list[i]},  

    bbd_list  consists out of $num_of_bbd_blocks elements, stored from
              position <header:4c> on.

Example:

   00200: fd ff ff ff  05 00 00 00  fe ff ff ff  04 00 00 00
   00210: 06 00 00 00  fe ff ff ff  07 00 00 00  08 00 00 00
   00220: 09 00 00 00  0a 00 00 00  0b 00 00 00  fe ff ff ff
   00230: ff ff ff ff  ff ff ff ff  ff ff ff ff  ff ff ff ff *

The big block depot is representing a table of block numbers, their index starts with zero. Entry 0, in this example with the value 0xfffffffd (-3), refers to block 0. Entry 1, with the value 0x00000005 refers to block 1. Entry 2 refers to block 2 ... and so on. Each entry may have one of these values:

   0xfffffffd (-3) : this block is a special block
   0xfffffffe (-2) : end of chain
   0xffffffff (-1) : unused
   0 .. $maxblock  : next element of chain (a big block number)
   $maxblock+1 ..  : not defined

In the header the variable $root_startblock has been initiated, the example gives the value 1 to it. These value tells which block is the first in a chain of blocks belonging to the "root". In the example it would be read out as follows:

How to read a block chain. Starting at position 1. So the first block belonging to root is block 1. The value of the big block depots entry with position 1 is: 0x00000005. So the next block belonging to root is block 5. The value of the big block depots entry with position 5 is 0xfffffffe (-2). That means: here is the end of the chain. So "root" finally consists out of the blocks: {1, 5}

From the header also the value of variable $sbd_startblock is known. Try to find it's value, then try to get the belonging block chain! (If you want to see the solution, look at the end of this document)

Note: when reading in a chain, only the values "0 .. $maxblock" and "-2" are ok. If other values do occur in a chain, some error happened.

Note: The small block depot may be absent. In that case $sbd_startblock is 0xfffffffe (-2).

Summary: with the help of header block and big block depot the values of the big block lists: @root_list and @sbd_list are known.

3. Big Data Blocks

3.1 Small Block Depot

The small block depot manages the small blocks. Small blocks have a length of exactly 0x40 (64) bytes. Often the small block depot will consist exactly out of one big block. Some documents do not have a small block depot at all.

    small block depot <=> union of big blocks {sbd_list[i]},  

    sbd_list  consists out of (number of chain_elements(sbd_list))
              elements. The list is read out from the big block depot,
              the lists start is $sbd_startblock (-> Section 2)

Example:

    00600: 01 00 00 00  fe ff ff ff  ff ff ff ff  ff ff ff ff
    00610: ff ff ff ff  ff ff ff ff  ff ff ff ff  ff ff ff ff

The entries of the small block depot do *not* refer to absolute positions in the file, like the entries of the big block depot. They refer to the positions in the (file made out of big blocks) according to the the big block list @sbd. This list is denoted by the root entry of the property storage. So first property storage has to be explained.

3.2 Property Set Storage

From section 2 the values of the big block list @root_list are known. These blocks do contain the property set storage (ppss) blocks.

    property set storage blocks <=> union of big blocks {root_list[i]},  

    root_list  consists out of (number of chain_elements(root_list))
               elements. The list is read out from the big block depot,
               the lists start is $root_startblock (-> Section 2)

3.3 Property Storage

The ppss blocks (-> section 3.2) split into 0x80 byte blocks. These 0x80 blocks are building a property storage (pps). Every pps gets a number, starting with zero. A pps refers to a "file" or it does represent a "directory".

Example:

   00400: 52 00 6f 00  6f 00 74 00  20 00 45 00  6e 00 74 00   R o o t  E n t
   00410: 72 00 79 00  00 00 00 00  00 00 00 00  00 00 00 00   r y 
   00420: 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00 
   00430: 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00 
   00440: 16 00 05 00  ff ff ff ff  ff ff ff ff  03 00 00 00
   00450: 00 09 02 00  00 00 00 00  c0 00 00 00  00 00 00 46
   00460: 00 00 00 00  00 00 00 00  00 00 00 00  86 29 f6 1f
   00470: ad 57 bb 01  03 00 00 00  00 0f 00 00  00 00 00 00

40:  word $pps_sizeofname      size of $pps_rawname
42:  byte $pps_type            type of pps (1=storage|2=stream|5=root)
44:  long $pps_prev            previous pps
48:  long $pps_next            next pps
4c:  long $pps_dir             directory pps
74:  long $pps_sb              starting block of property
78:  long $pps_size            size of property

(for detailed info look at: Table 2)

The first 0x40 bytes are reserved for the name of the pps. The length of the name stands in $pps_sizeofname. The name can be converted to an ASCII string $pps_name just by removing every second char. In this example the length of the name is 0x16, and in the end $pps_name is "Root Entry\00". The C-style zero should be removed. If the case occurs that $pps_sizeofname is zero, then this 0x80 block is no pps and has to be ignored.

Each pps can have a successor and a predecessor. Each pps also can be a directory (or "storage"). $pps_prev, $pps_next, $pps_dir refer to the ascending number of the 0x80 blocks as mentioned above. So in the example the pps starting at 0x400 gets the handle (number) 0, the pps starting at 0x480 gets the handle 1, the pps starting at 0500 gets the handle 2 and so on. When read skillfully, an ordered listing of pps results in the end (look at function get_pps_chain in "laola.pl").

Property types. Each pps has a type out of this three:

Storage, that is a directory
Stream, that is a file
Root, that is the root directory

If $pps_size is not zero, $pps_sb points to the starting block of the belonging property. The starting block refers to the big block depot, if $pps_size is greater or equal 0x1000 (4096) bytes. If the property's size is smaller, $pps_sb refers to the small block depot. There is one exception: $pps_sb of the Root entry (always pps 0) does always refer to the big block depot.

It now is easy to read the "files" in: the big or small block list has to be catched (as did before with root_list and sbd_list) from the big or small block depots, the so referred blocks have to be read and at the last step the size might have to be truncated to fit to $pps_size.

If the type of a pps is root or storage, at least the variables $pps_ts2d and $pps_ts2s get initialized. Together these variables build a 64 bit integer variable, that represents time and date. This variable counts all 10^-7 seconds, starting at 01/01/1601 00:00. If the type is root, $pps_sb is pointing to the first big block of the small block list @sb_list. See just below:

3.4 Where Are The Small Data Blocks?

The root entry is an exception to the 0x1000 bytes rule of section 3.3. The size in the example is 0xf00, so it actually should belong to the small block depot. In fact the file of the root entry always refers to the big block table. The start block here is 3. When examining this (copy from above):

    00200: fd ff ff ff  05 00 00 00  fe ff ff ff  04 00 00 00
    00210: 06 00 00 00  fe ff ff ff  07 00 00 00  08 00 00 00
    00220: 09 00 00 00  0a 00 00 00  0b 00 00 00  fe ff ff ff
    00230: ff ff ff ff  ff ff ff ff  ff ff ff ff  ff ff ff ff *

it results in the chain: {3, 4, 6, 7, 8, 9, a, b}. The so yielded chain gives the blocks that provide the space for the small blocks. So small block references refer to positions in the "small block file", that in this example is build of the big block chain starting with block 3.

    small data blocks <=> union of big blocks {sb_list[i]},  

    sb_list  consists out of (number of chain_elements(sb_list))
             elements. The list is read out from the big block depot,
             the lists start is 
             $root_startblock -> property storage 0 -> pps_sb

With this in mind, everything needed to pull out some file out of a Laola archive is available. You could test this with "lls -s". But there is still more behind the thing.

3.5 Property Sets

Apart from just storing plain files it is possible, to store special structured "database" files. These structured files are called property sets. There is a good article about how they are build at Microsoft's web service, "OLE Property Sets Exposed" by Charlie Kindel. It claims, that accurate information about property sets is provided with the Win32 SDK, too.

Summary and further information are - still to be done ! -

4. Trash Blocks

Last of the four chapters is "trash data blocks". These trash blocks are blocks stored in the document without being referred by Laola system. There should be no such trash, but there is sometimes. Pretty famous about this is the Microsoft Word Option "fast saving" (switch this off, if you haven't yet). Thus stored documents are usually consisting by about the half out of garbage. Another example would be Star Writer 3.1, that by principle produces 2 big blocks of trash.

Some blocks are just partially consisting out of trash, they could be called stinky blocks. This is because of the size of a property does just by chance fit exactly to the 0x200 (0x40 at small blocks) bytes border. So the rest of the last block of a chain does nearly always contain some bytes of rubbish.

Like all trash, data trash is troublesome. In some cases it is simply annoying because it blows up the files size. In each case it is relevant with reference to data security. Because you cannot know what's in there you have a lack of control to your own data. In Usenet has been even reported that private mail and encrypted password happened to stick in a Word file.

The good thing is, that unlike nuclear trash, this data trash is removable. Just look at demonstration program "lclean" at the source code section of this document. In case of data trash I've heard, that Microsoft knows about this OLE bug and provides a fix for 32 bit Windows systems. However, if you use Windows 3.1 you probably have to rely on lclean.

Comments appreciated!

Martin

- The End -

TABLES



Table 1: Block 0 (laola header)

offset  type value           const?  function
00:     stream $laola_id          !  identifier {d0 cf 11 e0 a1 b1 1a e1}
08:     long 0                    .  ?
0c:     long 0                    .  ?
10:     long 0                    .  ?
14:     long 0                    .  ?
18:     word 3b                   .  ? revision ?
1a:     word 3                    .  ? version  ?
1c:     word -2                   .  ?
1e:     byte 9                    .  ?
1f:     byte 0                    .  ?
20:     long 6                    .  ?
24:     long 0                    .  ?
28:     long 0                    .  ?
2c:     long $num_of_bbd_blocks   !  Number of big block depot blocks
30:     long $root_startblock     !  Root chain 1st block
34:     long 0                    .  ?
38:     long 1000                 .  ?
3c:     long $sbd_startblock      !  small block depot 1st block
40:     long 1                    .  ?
44:     long -2                   .  ?
48:     long 0                    .  ?
4c[]:   long $bbd_list[i]         !  array of $num_of_bbd_blocks big block
                                     numbers
The rest of block 0 should be:
	long -1                   .  

####


Table 2: Property Storage

offset  type value           const?  function
00:     stream $pps_rawname       !  name of the pps
40:     word $pps_sizeofname      !  size of $pps_rawname
42:     byte $pps_type		  !  type of pps (1=storage|2=stream|5=root)
43:	byte $pps_uk0		  !  ?
44:	long $pps_prev	          !  previous pps
48:     long $pps_next            !  next pps
4c:     long $pps_dir             !  directory pps
50:     stream 00 09 02 00        .  ?
54:     long 0                    .  ?
58:     long c0                   .  ?
5c:     stream 00 00 00 46        .  ?
60:     long 0                    .  ?
64:     long $pps_ts1s            !  timestamp 1 : "seconds"
68:     long $pps_ts1d            !  timestamp 1 : "days"
6c:     long $pps_ts2s            !  timestamp 2 : "seconds"
70:     long $pps_ts2d            !  timestamp 2 : "days"
74:     long $pps_sb              !  starting block of property
78:     long $pps_size            !  size of property
7c:     long                      .  ?

Solutions:

Lesson 1:: Find the block chain belonging to variable: $sbd_startblock
Solution:: The value of $sbd_startblock stands at position 0x3c of the header. It's value is: 0x00000002 == 2. The value of the big block depots entry with position 2 is: 0xfffffffe, that means: end of chain. So the sbd consists only out of big block 2.