Xref: bloom-picayune.mit.edu alt.binaries.pictures.utilities:201 alt.graphics.pixutils:3514
Path: bloom-picayune.mit.edu!snorkelwacker.mit.edu!usc!sdd.hp.com!mips!mips!decwrl!deccrl!news.crl.dec.com!hollie.rdg.dec.com!decvax.dec.com!maxx!tyager
From: tyager@maxx.UUCP (Tom Yager)
Newsgroups: alt.binaries.pictures.utilities,alt.graphics.pixutils
Subject: sift2: an automatic graphic file extractor/decoder
Message-ID: <231@maxx.UUCP>
Date: 27 Jun 92 04:48:30 GMT
Organization: New Hampshire Public Access UNIX
Lines: 722

The following shar file splits into the UNIX source code for a graphic
file extractor. It decodes decipherable files directly from a Usenet
spool directory, or from some directory you create. It is intended for
large-scale use, as when you wish to decode a large number of files at
once, or you want to maintain an archive of decoded files.

More details in sift2.doc below.
(ty)
---
Tom Yager, BYTE Multimedia Lab Director
           Author, Addison-Wesley UNIX programming book (Summer, 1991)
           Author, Harcourt Brace Jovanovich multimedia book (Fall, 1992)
           decvax!maxx!tyager, maxx!tyager@bytepb.byte.com
---
#	This is a shell archive.
#	Remove everything above and including the cut line.
#	Then run the rest of the file through sh.
#----cut here-----cut here-----cut here-----cut here-----
#!/bin/sh
# shar:	Shell Archiver
#	Run the following text with /bin/sh to create:
#	sift2.doc
#	sift2.c
#	eek2.c
#	makefile
# This archive created: Sat Jun 27 00:40:18 1992
cat << \ABRACADABRA > sift2.doc
                                   SIFT2
         Combines and decodes graphics files posted to the Usenet
                       Written 06/26/92 by Tom Yager

This is a (very) brief description of what sift2 does and how to use it.
You'll find very little here on how it works; those with a little UNIX
programming experience can figure that out themselves.

WHAT IS IT?
There are now several Usenet newsgroups which hold postings of graphical
images, ranging from postcards to pornography. These files, split into
pieces to prevent truncation, typically land on news sites in the wrong
order. Ordering them, decoding them, and storing the resulting image files
has become something of a favorite hacker's problem.

All sorts of marvelous, elegant solutions have been posted; I can't touch
the least of them for technical splendor. But all of the solutions I've
seen are targeted at the news reader, who must manually save a file's
components before the file can be retrieved. That wasn't what I needed. As
a site administrator, I wanted the ability to decode the contents of an
entire spool directory, quickly, and with minimal interaction. So I wrote a
tool for myself to do it.

Sift2 is that tool. Point it at any directory, and it will do its best to
pull out all the decipherable GIF, JPEG and GL files. You can add more file
types with simple source code modifications.

HOW DO YOU USE IT?
Under normal conditions, you'll call sift2 from an interactive shell
session. Syntax is simple:

   sift2 spooldir

where "spooldir" is the name of the directory holding the raw Usenet
posting files. It does not write into this directory, so you can safely
point it directly at your Usenet spool area. Sift2 will then create a
database of decipherable files, sort it, and then use that database to pass
the raw files through a decoder (eek2).

As sift2 reads the directory entries, it displays a running count on your
terminal. Redirect your stdout to /dev/null if you don't want to see this.
Then it sorts and sifts, and provides fairly verbose output as it does the
latter. This is easy to change in the source; I find the information useful
since I generally run sift2 interactively.

Sift2 creates two working files in the current directory: sift.out and
sift.out.s. These are not removed automatically, but will be overwritten if
sift2 is run again (see the exception below).

HOW DOES IT WORK?
After sift2 builds its database, it passes the files, in order, through a
program called eek2 (included in sift2's distribution). This is the same
"eek" distributed to the Usenet, but slightly modified for use as a filter
for sift2. Both eek2 and sort must be in your PATH environment variable for
sift2 to work correctly.

Sift2 uses the "Subject:" line of the article header to determine the name
of the graphics file, how many parts it has, and which part each article
contains. It's pretty forgiving about variances in format, but generally
expects something like this:

   Subject: moose.gif (part 1 of 7)

See the regular expression in the source for the pattern used.

To decode correctly, each file must be in uuencoded format. Alternate
methods of packing the data, including shar, will confuse sift2. When it
gets confused, you won't get a file, or you might get a bad one. Its
priority is to get all the way through the list, no matter how screwed up
any of the files are.

If sift2 doesn't decipher a file you thought it would, you can kick off the
second half of the process again by adding a second argument to the command
line, like so:

  sift2 /usr/spool/news/alt/binaries/pictures/misc sift.out.s

You must use "sift.out.s" unless you change the source code to use an
alternate name. Using this syntax skips the building of the database, and
starts decoding files based on the order in sift.out.s. This is an ordinary
text file holding the information sift2 collected. You can add to it,
reorder it, or fix any problem with it in an editor, then tell sift2 to
start over. It will overwrite any same-named graphic file in the current
directory, so it's safe to run it repeatedly on the same set of data.

GENERAL INFORMATION
Everything included in this distribution is provided as-is. Use it entirely
at your own risk. I'm interested in your comments, but I don't have time to
answer technical questions or help you with changes. Once you pick it up,
you're on your own.

The C source code was targeted for System V, Release 3.2 UNIX (SCO,
Interactive, ESIX, Dell, and so on). It should compile on BSD derivatives
using System V libraries and headers (if you have them), or you can change
the sources to render them Berkeleyesque.

This is not commercial-quality code. I think it's pretty stable, but if
you're looking for a good example of UNIX programming, this isn't it.

Good luck with sift2. I hope it saves you some effort.
(ty)
---
Tom Yager, BYTE Multimedia Lab Director
           Author, Addison-Wesley UNIX programming book (Summer, 1991)
           Author, Harcourt Brace Jovanovich multimedia book (Fall, 1992)
           decvax!maxx!tyager, maxx!tyager@bytepb.byte.com
ABRACADABRA
cat << \ABRACADABRA > sift2.c
/* sift2.c: decode images posted to usenet newsgroups */
/* Written 06/26/92 by Tom Yager. Provided as-is; anyone using this program */
/* or any portion of it does so at his or her own risk. */
/* Permission granted for free distribution of source code and/or binary, */
/* provided this header and the file "sift2.doc" are included. */

/* Set tabstops to 4 spaces apart */

#include <stdio.h>
#include <sys/types.h>
#include <dirent.h>
#include <sys/stat.h>
#include <string.h>
#include <malloc.h>
#include <signal.h>

char	*ofexp, *slashexp;
FILE	*dbgout;

#define SEARCHLIM 25

/* return index of s2 in s1. return -1 for no match. */
/* set nocase to do case-insensitive search */
int findstr(s1, s2, nocase)
char	*s1, *s2;
int		nocase;
{
	int		pos;
	char	*ptr1, *ptr2, *ptr1a, *ptr2a, temp[2];

	/* if "case" is positive, convert both strings to lowercase first */
	/* MAKE NOTE: THIS FUNCTION MODIFIES PASSED STRINGS! */
	ptr1 = s1; ptr2 = s2;
	if (nocase) {
		while(*ptr1 != 0) {
			*ptr1 = (char)tolower(*ptr1);
			++ptr1;
		}
		while(*ptr2 != 0) {
			*ptr2 = (char)tolower(*ptr2);
			++ptr2;
		}
	}

	if ((strlen(s2) > strlen(s1)) || (*s2 == 0) || (*s1 == 0))
	/* no match possible */
		return(-1);

	ptr1 = s1; ptr2 = s2;
	/* scan for match on first char */
	while(1) {
		for (; ((*ptr1 != *ptr2) && (*ptr1 != 0)); ptr1++);
		if (*ptr1 == 0)	/* end of string */
			break;
		/* match found--see how far it goes */
		ptr1a = ptr1; ptr2a = ptr2;
		for (; ((*ptr1a == *ptr2a) && (*ptr1a != 0) && (*ptr2a != 0));
			ptr1a++, ptr2a++);

		if (*ptr2a == 0) /* match */
			return(ptr1 - s1);	/* index of first matching char */
		++ptr1;
		continue;	/* keep trying */
	}
	return(-1);
}

void dofiles(srcdir, listfilename)
char	*srcdir, *listfilename;
{
	FILE	*listfile, *infile, *outfile, *eekpipe;
	char	infilename[255], gifname[255], rawname[255], 
				outfilename[255], tmp[300];
	char	prevgifname[255];
	char	input_line[1024], prev_input_line[1024], *strptr;
	int		result, part, of, thispart, prevpart, file_list_index = 0, i;
	int		prevof, done = 0;
	char	*file_list[100];

	signal(SIGPIPE, SIG_IGN);
	fprintf(stderr, "list file: %s\n", listfilename);
	listfile = fopen(listfilename, "r");
	if (listfile == NULL) {
		perror("List file open");
		exit(1);
	}
	
	strcpy(prevgifname, "\0");
	strcpy(prev_input_line, "\0");
	eekpipe = (FILE *)NULL;
	prevpart = -1;
	while (1) {
		strptr = fgets(input_line, 1024, listfile);
		if (strptr != NULL) {
			strptr = strtok(input_line, ",");
			strcpy(gifname, strptr);
			strptr = strtok(NULL, ",");
			part = atoi(strptr);
			strptr = strtok(NULL, ",");
			of = atoi(strptr);
			strptr = strtok(NULL, ",");
			strcpy(rawname, strptr);
			rawname[strlen(rawname) - 1] = '\0';
		} else {
			fprintf(stderr, "EOF detected\n");
			gifname[0] = '\0';
			done = 1;
		}
		
		/* What was expected? */
		if (strcmp(gifname, prevgifname)) {		/* new file! */
			if (*prevgifname != '\0') {			/* unfinished business */
				eekpipe = popen("eek2", "w");
				if (eekpipe == NULL) {
					perror("Pipe open");
					exit(3);
				}
				for (i = 0; i < file_list_index; i++) {
					if (file_list_index < prevof)	{	/* don't bother */
						fprintf(stderr, "** Too few parts\n");
						break;
					}
					sprintf(infilename, "%s/%s", srcdir, file_list[i]);
					free(file_list[i]);
					infile = fopen(infilename, "r");
					fprintf(stderr, "copying '%s'\n", infilename);
					if (infile == NULL) {
						perror("Input file open");
						exit(2);
					}
					/* copy the file to eek for cleaning */
					while (fgets(input_line, 1024, infile)) {
						fputs(input_line, eekpipe);
					}
					fclose(infile);
				}
				pclose(eekpipe);
			}
			/* set up for new file */
			file_list_index = 0;
			prevpart = -1;
		}
		if (done)
			break;
		printf("name '%s'  part %d  of %d  fname '%s'\n",
			gifname, part, of, rawname);
		strcpy(prevgifname, gifname);
		if (part == prevpart)		/* skip repeats */
			continue;
		prevpart = part;
		prevof = of;
		file_list[file_list_index] = malloc(strlen(rawname) + 1);
		strcpy(file_list[file_list_index], rawname);
		++file_list_index;
	}
	fclose(listfile);
	return;
}

/* build file entry database */
makedb(srcdir, listfile)
char	*srcdir, *listfile;
{
	DIR		*dir_ptr;
	FILE	*infile, *outfile;
	char	input_line[1024], gifname[255], *strptr;
	char	outfilename[255], infilename[255];
	struct dirent	*dir_entry;
	struct stat		stat_buf;
	int		i, result, pos, pos1, part0, part1, count = 0;
	
	if (listfile == NULL) {
		dir_ptr = opendir(srcdir);
		if (dir_ptr == NULL) {
			perror("Can't read requested directory");
			exit(1);
		}
		
		strcpy(outfilename, "sift.out");
		outfile = fopen(outfilename, "w");
		if (outfile == NULL) {
			perror("Output file open");
			exit(1);
		}
		
		while ((dir_entry = readdir(dir_ptr)) != NULL) {
			sprintf(infilename, "%s/%s", srcdir, dir_entry->d_name);
			fprintf(dbgout, "\n");
			fprintf(dbgout, "%s", dir_entry->d_name);
			if (stat(infilename, &stat_buf) < 0) {
				perror("Can't stat file");
				exit(2);
			}
			fprintf(dbgout, "  %ld bytes, %s\n", stat_buf.st_size,
				(S_ISREG(stat_buf.st_mode)) ? "reg file" : "not reg");
			if ((S_ISREG(stat_buf.st_mode)) == 0)
				continue;
			infile = fopen(infilename, "r");
			if (infile == NULL) {
				perror("Can't open file");
				exit(3);
			}

			/* perform Subject: search on first SEARCHLIM lines */
			result = 0;
			for (i = 0; i < SEARCHLIM; i++) {
				strptr = fgets(input_line, 1024, infile);
				if (strptr == NULL)
					break;
				if (strncmp("Subject:", input_line, 8) == 0) {
					result = 1;
					break;	
				}
			}
			fclose(infile);
			fprintf(dbgout, "   %s", input_line);

			if ((findstr(input_line, "Re:", 1) != -1) || 
				(findstr(input_line, ".gif") == -1) && findstr(input_line, 
					".jpg") == -1) {
				fprintf(dbgout, "   ** doesn't look like a GIF/JPG entry\n");
				continue;
			}
			/* hunt for file name */
			pos1 = pos = findstr(input_line, ".gif", 1);
			if (pos1 == -1) 
				pos1 = pos = findstr(input_line, ".jpg", 1);
			for (; pos >= 0; pos--)
				if ((char *)strchr(": 	/,;\"'`[](){}<>*=|", 
					*(input_line + pos)) != NULL)
					break;
			if (pos < 0) {
				fprintf(dbgout, "   ** confused: can't break out file name\n");
				continue;
			}
			pos1 += 4;	/* leave .gif/jpg in */
			strncpy(gifname, input_line + pos + 1, (pos1 - pos) - 1);
			gifname[(pos1 - pos) - 1] = '\0';

			/* now try to find out what part this is */
			if (getparts(input_line, &part0, &part1) == -1)
				fprintf(dbgout, "  >> getparts failed\n");
			fprintf(outfile, 
				"%s,%2.2d,%2.2d,%s\n", gifname, part0, part1, dir_entry->d_name);
			fprintf(stderr, "%d       %c", ++count, 13);
		}
		closedir(dir_ptr);
		fclose(outfile);

		sprintf(input_line, "%s.s", outfilename);
		unlink(input_line);
		sprintf(input_line, "sort %s >%s.s", outfilename, outfilename);
		fprintf(stderr, "%s\n", input_line);
		result = system(input_line);
		fprintf(stderr, "sort returned %d\n", result);
		strcat(outfilename, ".s");
	} else {
		strcpy(outfilename, listfile);
		result = 0;
	}
	if (result == 0) {
		dofiles(srcdir, outfilename);
	}
	fprintf(dbgout, "*** END *** \n");
}

void	initexp()
{
	slashexp = (char *)regcmp(".*[^0-9]([0-9]+)$0[	 ]*\/[	 ]*([0-9]+)$1",
		(char*)0);
	ofexp = (char *)regcmp(".*[^0-9]([0-9]+)$0[	 ]*of[	 ]*([0-9]+)$1",
		(char*)0);
	if ((ofexp == NULL) || (slashexp == NULL)) {
		fprintf(dbgout, "Regular expression compile failed\n");
		exit(1);
	}
}

int		getparts(str, part0, part1)
char	*str;
int		*part0, *part1;
{
	char	str1[255];
	char	*strptr, *exp1;
	char	num1[10], num2[10];		/* assumes people will behave */
	int		result, result1;

	strptr = (char *)regex(ofexp, str, num1, num2);
	if (strptr != NULL) {	/* we got one */
		result = sscanf(num1, "%d", part0);
		result1 = sscanf(num2, "%d", part1);
		if ((result != 1) || (result1 != 1)) {
			fprintf(dbgout, "   ** illegal part number(s)\n");
			return(-1);
		}
		return(0);
	}
	strptr = (char *)regex(slashexp, str, num1, num2);
	if (strptr != NULL) {
		result = sscanf(num1, "%d", part0);
		result1 = sscanf(num2, "%d", part1);
		if ((result != 1) || (result1 != 1)) {
			fprintf(dbgout, "   ** illegal part number(s)\n");
			return(-1);
		}
		fprintf(dbgout, "  == parts found (/): %d of %d\n", *part0, *part1);
		return(0);
	}
}


main(argc,argv)
int		argc;
char	**argv;
{
	char	*srcdir;

	if (argc < 2) {
		fprintf(stderr, "\nusage: %s srcdir [listfile.s]\n", argv[0]);
		exit(1);
	}

	srcdir = argv[1];
	/* Change the assignment of dbgout below if you want more detailed
	   console output. */
	/*dbgout = stderr;*/	/* one way to see what's happening */
	dbgout = fopen("/dev/null", "w");

	initexp();
	if (argc == 2) {
		makedb(srcdir, NULL);
	} else {
		makedb(srcdir, argv[2]);
	}
	close(dbgout);
}
ABRACADABRA
cat << \ABRACADABRA > eek2.c
/*
 * This UNIX C program takes any number of uuencoded files
 * from <stdin>, extracts the good stuff, and feeds
 * each file one at a time to uudecode.
 *
 * To compile, "cc -o eek eek.c" should suffice.
 *
 * From the prompt, you can use the 's' command like this:
 *
     [rn prompt from within the group] 3220,3219,3221,3224-3228 s ims
 *
 * If this is the first set of images this run, it will ask if you
 * want mailbox format, to which "n" (no) is the best response, though
 * it shouldn't make much difference.
 *
 * This will send four pictures to the file "ims".  Notice that when
 * the parts aren't in order, you have to specify the numbers in the
 * proper order -- but it can still be done on one line (in one command).
 *
 * Once out of "rn", the command
 *
     eek < ims
 *
 * will extract the four images into the current directory, at which point
 * you probably want to remove "ims" so you can start fresh next time.
 *
 * It will break if someone has junk lines starting with "end" or "begin #".
 * Probably in other ways too.  But heh, it's free.  Whaddya want?  If you
 * improve upon it, please send me a copy so I can use it too.
 *
 * If you find it useful, drop me a note so I can feel all warm inside.
 *
 *  - Brandyn (brandyn@apple.com), 1/25/90
 *
 * Ps. I've added a few trivial enhancements:
 *
 *    - If multiple archives are in one file, and one of them is corrupt
 *      and missing an "end" line, Eek will detect the "begin" line of the
 *      next archive and start a new extraction.  Before, it would loose
 *      the second file (actually, it would pack it onto the end of the
 *      corrupt one, not knowing any better).
 *
 *    - If the file to be extracted already exists, the name of the new
 *      file will have a "+" appended to it (or ++, +++, etc.. until it
 *      finds an unused name).  This is to insure that eek doesn't clobber
 *      any of your files.  Bug: Eek determines the existance of a file by
 *      trying to open it for reading.  If the file is not readable, eek
 *      will think it is not there and use the same name.  If it is writable,
 *      it will clobber it.  Cheap hack, obscure bug.  That's life.
 *
 *    - The mode of the extracted file will always be 600.  You can change
 *      the code if you want another default.  Some postings had obnoxious
 *      protection modes (like 0) so I found this useful.
 *
 *  -Brandyn - 1/17/92 (Two years! Boy do I feel old.)
 */

#include <ctype.h>
#include <stdio.h>

    /*
     * Current input line (which
     * may be held over from earlier):
     */
    static char curline[512];
    static int  heldover=0;

main()
{
    FILE *uu;
    int num;

	uu  = popen("uudecode","w");
	CopyOne(stdin,uu);
	pclose(uu);
	/*printf("Done\n"); */
}

/*
 * Throws the input stream <from> away until
 * it finds "begin ...", then copies
 * all uuencode type lines to <to> until
 * it finds "end ..." at which point it may
 * reintroduce the last one or two "rejected"
 * lines to account for uuencode's funny endings.
 *
 * Returns TRUE on EOF.
 */
CopyOne(from,to)
FILE *from,*to;
{
    int len;
    char new[250];
    char a[250],b[250];
    char *res;

    /*
     * Find the begin line, or use
     * the heldover one:
     */
    if (!heldover) {

        while (res = fgets(curline,250,from))
            if (strncmp(curline,"begin ",6)==0 && isdigit(curline[6]))
                break;

        if (!res)
            return(1);

    } else
        heldover=0;

    RewriteHeader(curline,new);
    fputs(new,to);

    while (res = fgets(curline,250,from)) {

        if (!strncmp(curline,"end",3))
            break;

        len = strlen(curline);

        if (curline[0] != 'M' ||
            len < 60      ||
            len > 70      ||
            (strncmp(curline,"Message-ID",10)==0)) {

                /*
                 * Check for nested "begin" line:
                 */
                if (strncmp(curline,"begin ",6)==0 && isdigit(curline[6])) {
                    heldover++;
                    fprintf(stderr,"Eek! Found a nested begin line!\n");
                    fputs("end\n",to);
                    return(0);
                }

                /* fprintf(stderr,"Rejecting \"%s\"\n",curline); */
                strcpy(b,a);
                strcpy(a,curline);
                continue;

        }
        a[0] = '\0';
        b[0] = '\0';
        fputs(curline,to);
    }

    if (!res) {
        fprintf(stderr,
            "Warning: hit end of file without ever finding \"end\"\n");
        exit(0);
    }

    if (b[0]) {
        /* fprintf(stderr,"Restoring \"%s\"\n",b); */
        fputs(b,to);
    }

    if (a[0]) {
        /* fprintf(stderr,"Restoring \"%s\"\n",a); */
        fputs(a,to);
    }
    fputs(curline,to);

    return(0);
}

RewriteHeader(old,new)
char *old;
char *new;
{
    char fname[256];

    if (ExtractFileName(old,fname,250)) {
        strcpy(new,old);
        fprintf(stderr,"[can't parse header!!]");
        fflush(stderr);
        return;
    }

	if (strlen(fname) > 13)
		fname[11] = '\0';
    /*while (Isafile(fname))
        strcat(fname,"+"); */

    sprintf(new,"begin 600 %s\n",fname);
}

Isafile(fn)
char *fn;
{
    FILE *fp;

    if (fp = fopen(fn,"r")) {
        fclose(fp);
        return(1);
    }
    return(0);
}

/*
 * I don't know, maybe I was tired when
 * I wrote this.  But you must admit it's
 * robust?  I know, I know -- use sscanf.
 * I must have had a reason to do it this
 * way... (ehem).
 */
ExtractFileName(header,fname,len)
char *header,*fname;
int len;
{
    int i,j;

    for (j=5; header[j]==' '; j++)
        ;
    for (; header[j] && header[j]!=' '; j++)
        ;
    for (; header[j]==' '; j++)
        ;
    for (i=0; i<len; i++) {
        fname[i] = header[i+j];
        if (fname[i] == ' '  ||
            fname[i] == '\t' ||
            fname[i] == '\n' ||
            fname[i] == '\0')
            break;
    }
    fname[i] = '\0';

    if (!i)
        return(1);

    return(0);
}
ABRACADABRA
cat << \ABRACADABRA > makefile
CFLAGS = -O

sift2:	sift2.o
	cc -o sift2 sift2.o
ABRACADABRA
#	End of shell archive
exit 0
-- 
Tom Yager, BYTE Multimedia Lab Director
           Author, Addison-Wesley UNIX programming book (Summer, 1991)
           Author, Harcourt Brace Jovanovich multimedia book (Fall, 1992)
           decvax!maxx!tyager, maxx!tyager@bytepb.byte.com