This file is a short discussion of the uuencode/uudecode 'standard:' how they work, pitfalls for them, and variants on the theme. If you have any questions or comments, or if you just want to bitch at me about something, contact me at readdm@ccwf.cc.utexas.edu Here's how UUENCODE and UUDECODE do their thing: The basic idea is to convert binary data to plain ASCII data, so it can be mailed safely over standard maile systems. You start with 3 8-bit bytes (24 bits) and convert them to 4 6-bit bytes (also 24 bits). Normal ASCII text has all values between 32 and 127 (all of the 'control' characters are between 0 & 31), so you need to add 32 to every byte as an offset. 127 takes 7 bits, so with only 6 bits (maximum value = 63), you can add 32 and still be under 128. The encoding process goes like this: you take the high order 6 bits from the first input byte (Ibyte1 for short) and they become the 6 bits of the first output byte (Obyte1)...you have to remember to mask off the top two bits, which is done by doing a logical .and. with 63, which is the lower six bits all *on* and the top two *off*. Then you add 32 to force the result to be at least 32. Now Obyte2 is made from the lower 2 bits of Ibyte1 and the top 4 bits of Ibyte2. Obyte3 is the lower 4 bits of Ibyte2 and the top 2 bits of Ibyte3. Obyte4 is the lower 6 bits of Ibyte3. Again, mask off the top two of each output byte and add 32. The first byte of each encoded output line tells you how many bytes will be output when decoded. This count byte is also encoded; the 'M' you see on the beginning of almost all lines is ASCII character 77...subtracting 32 gets you 45. This 45 refers to the count of DECODED bytes, which come in 3-byte cells, so this means that there are 15 cells on a line that begins with 'M', or 4 * 15 = 60 encoded bytes on that line. Remember that you can't have less than a whole cell, even though you may need only 1 or 2 bytes of the 3-byte decoded cell. Now to discuss a few common variations on the uuencode 'standard.' One of the earliest things that people noticed was that some mailers like to strip off any trailing spaces on lines of mail text. Seems like a good ideam, no? You're saving storage space by trimming off useless data, right? Well...unfortunately, trailing spaces are perfectly valid in uunecoded data. I have seen several interesting ways around this problem. The first is to replace the space character (ASCII value = 0x20, or 32 decimal) with the ` character (it's called a grave, pronounced "grahhhve"), which has an ASCII value of 0x60 (96 decimal). Under the standard decoding procedure, described above, the grave decodes the same way as a space: for a space: (0x20 - 0x20) & 0x3F = (0x00) & 0x3F = 0x00 for a grave: (0x60 - 0x20) & 0x3F = (0x40) & 0x3F = 0x00 or in binary format: space = 00100000 grave = 01100000 space - 32 = 0 grave - 32 = 01000000 (grave - 32) & 63 = 01000000 & 00111111 = 0 So you see that both decode to zero. The difference between them, of course, is that the grave is not seen by the mailer as a space, so it does not get truncated! The other way around this truncation problem is to place some non-space character at the end of every line...then the trailing spaces don't trail any more, so they're not trimmed! It really doesn't matter which character gets put there, so long as it is *not* a space. A normal decoder will never see this character, because it stops when the correct number of bytes is encountered, as described in the earlier discussion. Some decoders do choke on this, but they're pretty rare. Something I said earlier is not strictly true; the space and the grave do not *always* decode to zero; if you have a decoder that forms a look-up table for decoding, and neglects to include both the space and the grave in the table, then you're hosed. Speaking of look-up tables, there's another variant of uuencode/uudecode called xxencode/xxdecode. Xxencode was invented to avoid problems in a conversion routine between the ASCII world and the EBCDIC world (used primarily by IBM mainframes). It seems that a bug in the conversion routine incorrectly maps some ASCII characters onto other characters in EBCDIC. In a fit of right-thinking, Phil Howard realized that uuencode/ uudecode were just a fancy form of a look-up table, and set about finding a character set that does *not* get munged by the conversion routine. He then modified some uuencode source to handle the new character set, and *presto* the problem was solved. The rest of the code remains basically unchanged, but all encoding/decoding is done via the look-up table instead of via the "get a value between 0 and 64 and then add 32" method that uuencode uses. You can always recognize xxencoded files because the lines start with 'h' instead of the uuencode 'M'. There's yet another variant of uudecode that I've seen. Once people realized that this was all just a look-up table problem, it wasn't long before encoders started appearing that included the entire table before the data, like thus: table `!"#$%&'()*+,-./0123456789:;<=>? @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]~_ begin 666 ROGER.GIF M1TE&.#=A@`)>`9,``````*JJ`*H`J@"JJO]5557_5:JJJ@``JE55_U5550"Ja M`*H``%7___]5____5?___RP`````@`)>`0,$_A#(2:N]..O-N_]@*(YD:9YHa MJJYLZ[YP+,]T;=]XKN]\[__`H'!(+!J/R*1RR6RV#-"H=$JM6J_8K';+[7J_a MX+!X3"Z;S~BT>LUN%][PN'Q.K]OO~+Q~S~_7H00&#H$$#@8$#X6'@(N!AX.`a Then your decoder has an absolute reference for which characters are where in the table! Again, if the standard character set is used, this shouldn't present any problems, because most uudecoders skip over everything before the 'begin' line before they start decoding. Note that in the example above, the encoder used *both* methods of avoiding the trailing-space-truncation problem, by using grave instead of space, and by placing the 'a' at then end of every line. Kind of redundant, but I suppose it's better than nothing. Note also that this example would *not* decode under a standard decoder that knew nothing about 'table' statements; the tilde (~) specified as the second-to-last character in the table is *not* the same as the carat (^) that should be there! This example is actual data that someone posted to the net, which drove everybody nuts. Now I've probably got you wondering about that 'begin' line in the example. This line tells the decoder where to start decoding. It also tells the decoder which mode to use for opening the output file (for UNIX systems), and what to call the output file. Most encoders will write 644 as the mode, which gives read/write/execute privileges to the 'owner', but only read/execute to the 'group' and to 'others.' Note that in the example, read/write/execute privileges are given to *everybody.* Dangerous! There is also a corresponding 'end' line at the end of the data. Without one, decoders would not know where data stops and other trailer info begins. We now come to the last topic: multi-part files. Many mailers cannot handle files larger than some maximum value (some are as low as 32 kbytes!), so a common practice is to cut the uuencoded output into smaller chunks which can be mailed. At the other end, the chunks are re-combined and then decoded. The problem here is that many people use newsreaders/ posters that automatically append a signature to every post, which means that the resulting concatenated file will have bad data in it! This is why you must trim off the headers & trailers by hand. Unfortunately, there is no standard yet for a uuencoder/decoder which recognizes multi-part files. Recently I've seen some people put "cut here" lines in their files like this: BEGIN----------------CUT HERE----------------- ... END------------------CUT HERE----------------- so that simple programs can be written which recognize these markers. Again, normal decoders will have no problem with this, because the watch for the 'begin' marker is case-sensitive; 'BEGIN' is different from 'begin.' That's about it for uuencode and uudecode. If you have any questions, you can find me at the address listed above.