
                              Qui-ne-faut FAQ
                                      
General interest questions

  Q. What is OCR?
  
   A. OCR stands for Optical Character Recognition; put a printed text in
   a scanner and the software should manage to rebuild from a huge bitmap
   the separate characters and identify them to output an editable text.
   Although OCR is also possible with handwritten characters (it's called
   ICR then) this last case is *extremely* difficult. Look at the OCR FAQ
   (disable style sheets under netscape-4, or, better, use lynx), also
   look there.
   
  Q. Are there commercial OCR software for UN*X?
  
   A. Yes.
   
  What are/were other freeish OCR attempts for UN*X platforms?
  
   A. (mainly from OCR FAQ) These pointers are good at Jan 16 1999. I
   have got personal copies of any of these packages.
     * Cal Poly OCR: last update Jan 24 1994
     * xocr: is *shareware* last update Apr 18 1996 ftp (english version)
       ftp (german version)
     * OCRchie: last update May 18 1996
     * Project-O2 last update Oct 30 1998
     * socr site last updated Jul 14 1998. Abandonned Dec 1 1998: dead.
       
  Q. Font/Rotation/Scale independent OCR...?
  
   A. Look at David Squire's thesis for some ideas.
   
  Q. How does it work?
  
   A. Two steps:
     * first, we have to make out the characters on the page. This is not
       so trivial, and is a bit boring when you know that the difficult
       part of the job is still to come. Once we have these characters
       (we have the outlines!) we compute classifiers: h/w ratio, number
       of holes, etc.
     * second, those classifiers are fed into an algorithm that will try
       to decide whether this vector of classifiers is a P, a # or a z. I
       see three ways to do this: basic R^n distance comparison,
       classification tree, neural network. In the three cases, you need
       a good learning basis of a few thousands of (vector,letter)
       solution couples.
       
  Q. What's the difference between a classification tree and a neural network?
  
   A. See the article An Empirical Comparison of Decision Trees and Other
   Classifications Methods, by Tjen-Sien Lim, Wei-Yin Loh and Yu-Shan
   Shih Neural networks are very strange objects, very heavy to
   manipulate and maybe overestimated. I chose classification trees. Why?
   Initial learning stage is long. Very long. But the storage requirement
   of the decision tree is very small and decision time is extremely
   short. Notice that Stuart Inglis, the SOCR developer used C4.5 trees
   for preliminary works.
   
  Q. How do you build the classification trees?
  
   A. With QUEST, a FORTRAN package by Yu-Shan Shih, mentioned in the
   technical report above. The situation is not clear whether I may use
   QUEST to build trees and package them with quinefaut without license
   troubles, since QUEST implementation is not free for commercial uses,
   and GPL'ized software can be used in commercial applications with no
   limitations. Another one is FACT, by Wei-Yih Loh. Notice that
   implementations may be protected, but not algorithms. So if there is a
   *real* trouble using these trees, the solution would be to reimplement
   a classification tree building algorithm from scratch.
   
  Q. How to make an outline from a bitmap?
  
   A. Look closely at GNU fontutils, especially the LIMN algorithm (see
   README in fontutils/limn/) for a very interesting algorithm of
   conversion of bitmap into bezier contours. We do not use this here. We
   compute line segments over triangles, which is much much more simple,
   see connexity.c. We could use more sioux approximations such as spline
   interpolation. See package xfarbe: it would be hard to program. see
   nurbs?
   
Usage

  Q. Should I use interpolated resolution on my scanner?
  
   A. Interpolated resolution is a despicable commercial kludge. Forget
   it. NEVER use interpolated resolution. Read this.
   
  Q. What is the memory usage of quinefaut?
  
   A. Well well well; quinefaut operates on average 5MB files, that makes
   a lot of data, and makes not-so-trivial computations on it, so, don't
   be surprised by the fact it is CPU and memory intensive. The critical
   stage is the connexity_find_paths_monolithic function that must be run
   entirely in RAM, because swapping there would slow down execution to a
   crawl. This function works on a data structure that weighs 75MB for an
   average book page; and for the while computers for which this amount
   of data fits entirely in RAM are not so common. So, I decided to write
   something to cut a page in little squares called ``patch'', and
   connexity_find_paths_monolithic is run on these patches; with the -m
   option you give x = the size of the patches in megabytes. (That does
   not mean that all quinefaut data will fit in x megabytes exactly)
   
  Q. How can I use --dump-paths-and-quit option?
  
   A. Example: closely look at usage of ``P''ath and ``p''ath below!
$ ./quinefaut --dump-paths-and-quit -n cro >enormousfile
...

$ grep Path enormousfile
Path000 length=142 area=131.69381 closed [10.45997<25.04790<38.61311]x[57.97514
<62.83528<66.25212]
Path001 length=54 area=53.41130 closed [13.88729<18.01650<22.12796]x[410.82814<
414.98363<419.08787]
Path002 length=54 area=52.42692 closed [14.83339<19.01323<23.18012]x[445.89770<
450.01081<454.05927]
Path003 length=134 area=306.40131 closed [14.95238<24.49555<34.03151]x[488.9570
8<498.49969<508.03602]
...

   gives the list of all positive area paths. Do you want to see Path006?
$ grep path006 enormousfile >xydata
$ gnuplot
gnuplot> plot 'xydata'

   cool!
   
Maths

  Q. What is connexity/simple connexity/...?
  
   A. pffft
   
  Q. What is the Green-Riemann formula?
  
                                  [INLINE]
                                      
   leading to
   
                                  [INLINE]
                                      
Further developments

  Q. Why do you call MYOWN_FREE(,,) with three parameters when the libc
  ``free'' function has just one arg?
  
   A. When you free a pointer, you *always* know how much you mallocated.
   So, it's not a strain and is extremely helpful to keep malloc/free
   balance for a smooth run.
   
  Q. What are the coding rules for quinefaut?
  
   A. If you intend to develop or modify code in quinefaut, do what you
   want, if you want me to read and/or accept it respect the following
   rules:
     * the work has begun in C and will go on in C. No C++.
     * gcc is strongly prefered. Linux or any working platform is ok.
       Windo'z not.
     * compilation gcc -Wall with *zero* warnings.
     * respect MYOWN_ dynamic allocation functions. Any modification must
       respect malloc/free balance and *zero* at the end, even for a poor
       three char string, even if you have dozens of free followed
       immediately by... exit().
     * beware, the number of chars in a string is strlen, but if you
       include the last \0, it is 1+strlen = STRSIZE, so use STRSIZE when
       freeing a string
     * *DO NOT* use break, excepted in switch
     * C performs boolean shortcuts, abandoning evaluation as soon as one
       item in a x&&y&&z&&t expression is false, or one item in a
       x||y||z||t expression is true. Use it, but please comment boolean
       shortcuts.

for (i=0;(isize)&&(T->table[i].area<0);i++); /* boolean shortcut! */
     * warning: `j' might be used uninitialized in this function. I
       cannot stand those warnings. There is a macro:
DECLARE_UNINITIALIZED(double,firsty);
       that allows you to declare such a variable. It is strictly
       equivalent to saying double firsty=0; but by using this macro, you
       clearly mean the *only* reason to initialize is to suppress the
       warning.
     * do not make any supposition concerning the length of an array or a
       string. use dynamic allocation, asprintf, getline. There if no
       exception in all the program. Remember The Ten Commandments for C
       Programmers: ``Thou shalt check the array bounds of all strings
       (indeed, all arrays), for surely where thou typest ``foo'' someone
       someday shall type ``supercalifragilisticexpiali- docious''.''
     * there is a macro to ``securely increment'' an array index when
       you're compelled to do so, i.e., when, while filling an array, you
       really don't know how big it will be:
/* incrementation and reallocation if needed */
PLUSPLUSANDCTRL(l,C->table,line,Csize);
       remember realloc sometimes causes a complete copy of the array
       elsewhere so you might experience weird things with temerary
       pointer arithmetic.
     * recursive function are beautiful but cost a lot of stack. When
       recursivity can be avoided by writing in a flat way (ex. free a
       linked list) do it. I don't think the tree freeing algo can be
       rapidly written another way
       
  Q. What are final checks?
  
   A.
     * every .c file must include its own .h file...
     * track unused variables and savage debugging hooks
     * using ``int'' is like playing with fire. We do not make any
       supposition on int size, the type must code properly the height or
       width of a scanned image, that is, max 3000, this is the order of
       magnitude. int must also be greater than GRANULARITY. Something
       homogeneous to int*int must be stored in a double.
     * every extern function should be declared extern even if it's the
       default.
     * static variables?
     * run gprof to see points of slowing
       
  Q. How to add classifiers?
  
   A. This is trivial, there are numerical (=returns double) classifiers,
   and categorical (=returns int) ones. Just look at the way one is
   written and mimic. One trouble is that you obviously have to rebuild a
   base a learned samples when you modify the classifiers set...
   
  Q. How to speed up?
  
   A. There are at least two algorithms that are very badly implemented,
   but the speed of today's machines will hide this (affectation of holes
   inside contours, composition of lines). The data type for path is a
   bit naive, since each time you register a new pair (x,y) you call
   malloc for approx 30 bytes. It would be much better to work with lists
   of 100-pair arrays or realloc. I do not cleary estimate the amount of
   time lost malloc'ing 100 times when you could do it just once.
   
  Q. Planned troubles
  
   A.
     * correct_orientation fails on chinese. and lots lots of other ones
     * think of pnmrotate as a faster routine to rotate and more precise
     * find a good bend function
     * dangerous bend: basename is a valid name in string.h
       
  Q. What about GIMP plug-ins
  
   A. Obviously, an OCR would make a very interesting plug-in for the
   gimp, since xscanimage is already very well interfaced with the GIMP.
   I downloaded some simple plug-ins such as rand-noted.c or stereogram.c
   (this one is excellent, thanks to Francisco Bustamante) to compile
   them and look what it's like to program with GTK. Compile with:
gcc `gtk-config --cflags --libs` -lgimp -lgimpui stereogram.c -o stereogram; mv
 stereogram ~/.gimp/plug-ins/

   I made some changes: it is easy but time-consuming. Notice that the
   bitmap->outline conversion routine, or the deskewing routine can make
   separate interesting plug-ins, think of gyve project. Weh, here is a
   beautiful stereogram:
   
                                  [INLINE]
