How cvs2svn Works
                         =================

A cvs2svn run consists of eight passes.  Each pass saves the data it
produces to files on disk, so that a) we don't hold huge amounts of
state in memory, and b) the conversion process is resumable.

CollectRevsPass (formerly called pass1)
===============

The goal of this pass is to write a summary of each CVS file as a
pickled CVSFile to 'cvs2svn-cvs-files.db', and a summary of each CVS
file revision as a pickled CVSRevision to 'cvs2svn-cvs-items.db'.  In
each case, items are assigned an arbitrary key that is used to refer
to them.

We walk over the repository, collecting data about the RCS files into
an instance of CollectData.  Each RCS file is processed with
rcsparse.parse(), which invokes callbacks from an instance of
cvs2svn's _FileDataCollector class (which is a subclass of
rcsparse.Sink).

For each RCS file, the first thing the parser encounters is the
administrative header, including the head revision, the principal
branch, symbolic names, RCS comments, etc.  The main thing that
happens here is that _FileDataCollector.define_tag() is invoked on
each symbolic name and its attached revision, so all the tags and
branches of this file get collected.  When this stage is done, the
parser invokes admin_completed(), which writes the CVSFile to the
database.

Next, the parser hits the revision summary section.  That's the part
of the RCS file that looks like this:

   1.6
   date	2002.06.12.04.54.12;	author captnmark;	state Exp;
   branches
   	1.6.2.1;
   next	1.5;

   1.5
   date	2002.05.28.18.02.11;	author captnmark;	state Exp;
   branches;
   next	1.4;

   [...]

For each revision summary, _FileDataCollector.define_revision() is
invoked, recording that revision's metadata in various variables of
the _FileDataCollector class instance.

After finishing the revision summaries, the parser invokes
_FileDataCollector.tree_completed(), which loops over the revision
information stored, determining if there are instances where a higher
revision was committed "before" a lower one (rare, but it can happen
when there was clock skew on the repository machine).  If there are
any, it "resyncs" the timestamp of the earlier rev to be just before
that of the later rev, but saves the original timestamp in
self._rev_data[blah].original_timestamp, so we can later write out a
record to the resync file indicating that an adjustment was made (this
makes it possible to catch the other parts of this commit and resync
them similarly; more details below).

Next, the parser encounters the *real* revision data, which has the
log messages and file contents.  For each revision, it invokes
_FileDataCollector.set_revision_info(), which writes a record to
'cvs2svn-cvs-items.db'.

Also, for resync'd revisions, a line like this is written out to
'cvs2svn-resync.txt':

   3d6c1329 18a 3d6c1328

The fields are:

   NEW_TIMESTAMP   METADATA_ID   OLD_TIMESTAMP

(The resync file will be explained later.)

That's it -- the RCS file is done.

When every CVS file is done, CollectRevsPass is complete, and:

   - 'cvs2svn-cvs-files.db' contains a record of every CVS file.

   - 'cvs2svn-cvs-items.db' contains a summary of every revision to
     every CVS file, including a reference to the corresponding CVS
     file record in 'cvs2svn-cvs-files.db'.  The order of the
     revisions is arbitrary.  In other words, a multi-file commit will
     be scattered all over the place.

   - 'cvs2svn-a-revs.txt' contains a list of CVSRevision keys that are
     in 'cvs2svn-cvs-items.db', in the order that they were written.

   - 'cvs2svn-resync.txt' contains a small amount of resync data, in
     no particular order.

   - 'cvs2svn-symbol-stats.txt' contains one line for each symbol that
     was seen in the CVS repository.  The fields are:

         ID NAME TAG_COUNT BRANCH_COUNT BRANCH_COMMIT_COUNT BLOCKERS

     where ID is a unique number identifying this symbol (in
     hexadecimal), NAME is the symbol name, TAG_COUNT and BRANCH_COUNT
     are the number of CVS files on which this symbol was used as a
     tag or branch respectively, and BRANCH_COMMIT_COUNT is the number
     of files for which commits were made on a branch with the given
     name.  BLOCKERS is a space-separated list of tags and branches
     that were defined on branches named NAME (a branch cannot be
     excluded if it has any blockers that are not also being
     excluded).  These data are used to look for inconsistencies in
     the use of symbols under CVS and to decide which symbols can be
     excluded or forced to be branches and/or tags.

   - 'cvs2svn-metadata.db' contains information that will help
     determine what CVSRevisions might have been made together and
     will therefore be combined into a single SVNCommit.  CVSRevisions
     that were part of a single CVS commit always have a common author
     and log message.  This database contains two mappings for each
     (author, log_msg,) combination:

     digest (40-byte string) -> metadata_id (int)

     metadata_id (int as hex) -> (author, log_msg,) (tuple)

     The digest is used to locate the metadata_id for the metadata
     record having a specific (author, log_msg,) tuple.  The
     metadata_id, in turn, is used as a key to locate the actual
     metadata.  CVSRevision records include the metadata_id.


CollateSymbolsPass
==================

Use the symbol statistics collected in CollectRevsPass and any
command-line options to determine which symbols should be treated as
branches, which as tags, and which symbols should be excluded from the
conversion altogether.

Create 'cvs2svn-symbols.dat', which contains a pickle of a list of
BranchSymbol, TagSymbol, and ExcludedSymbol objects indicating how
each symbol should be processed in the conversion.


ResyncRevsPass (formerly called pass2)
==============

This is where the resync file is used.  The goal of this pass is to
output the information from cvs2svn-cvs-items.db to a new file,
'cvs2svn-cvs-items-resync.db' (clean revs).  It has the same content
as the original file, except for some resync'd timestamps.

First, read the whole resync file into a hash table that maps each
metadata_id to a list of lists.  Each sublist represents one of the
timestamp adjustments from CollectRevsPass, and looks like this:

   [old_time_lower, old_time_upper, new_time]

The reason to map each metadata_id to a list of sublists, instead of
to one list, is that sometimes you'll get the same metadata for
unrelated commits (for example, the same author commits many times
using the empty log message, or a log message that just says "Doc
tweaks.").  So each metadata_id may need to "fan out" to cover
multiple commits, but without accidentally unifying those commits.

Now we loop over the CVSRevisions in 'cvs2svn-cvs-items.db', and for
each record write a line to 'cvs2svn-data.c-revs.txt'.  Each line of
this file looks like this:

   3dc32955 5a 12ab

The fields are:

   1.  a fixed-width timestamp
   2.  the metadata_id of the metadata (log message + author)
       associated with this CVSRevision, as a hexadecimal string.
   3.  the integer unique ID for this CVSRevision, as a hexadecimal
       string.

Any CVSRevision record in 'cvs2svn-cvs-items.db' whose metadata_id
matches some resync entry and appears to be part of the same commit as
one of the sublists in that entry, gets tweaked.  The tweak is to
adjust the commit time of the line to the new_time, which is taken
from the resync hash and results from the adjustment described in
CollectRevsPass.

The way we figure out whether a given line needs to be tweaked is to
loop over all the sublists, seeing if this commit's original time
falls within the old<-->new time range for the current sublist.  If it
does, we tweak the line before writing it out, and then conditionally
adjust the sublist's range to account for the timestamp we just
adjusted (since it could be an outlier).  Note that this could, in
theory, result in separate commits being accidentally unified, since
we might gradually adjust the two sides of the range such that they are
eventually more than COMMIT_THRESHOLD seconds apart.  However, this is
really a case of CVS not recording enough information to disambiguate
the commits; we'd know we have a time range that exceeds the
COMMIT_THRESHOLD, but we wouldn't necessarily know where to divide it
up.  We could try some clever heuristic, but for now it's not
important -- after all, we're talking about commits that weren't
important enough to have a distinctive log message anyway, so does it
really matter if a couple of them accidentally get unified?  Probably
not.


SortRevsPass (formerly called pass3)
============

This is where we deduce the changesets, that is, the grouping of file
changes into single commits.

It's very simple -- run 'sort' on 'cvs2svn-c-revs.txt', converting it
to 'cvs2svn-s-revs.txt'.  Because of the way the data is laid out,
this causes commits with the same metadata_id (that is, the same
author and log message) to be grouped together.  Poof!  We now have
the CVS changes grouped by logical commit.

In some cases, the changes in a given commit may be interleaved with
other commits that went on at the same time, because the sort gives
precedence to date before metadata_id.  However, CreateDatabasesPass
detects this by seeing that the metadata_id is different, and
re-separates the commits.


CreateDatabasesPass (formerly called pass4):
===================

Find and create a database containing the last CVS revision that is a
source (also referred to as an "opening" revision) for each symbol.
This will result in a database containing key-value pairs whose key is
the id for a CVSRevision, and whose value is a list of symbol ids for
which that CVSRevision is the last "opening."

The format for this file is:

    'cvs2svn-symbol-last-cvs-revs.db':
         Key                      Value
         CVS Revision ID          array of symbol ids

    For example:

         5c                      --> [3, 8]
         62                      --> [15]
         4d                      --> [29, 5]
         f                       --> [18, 12]


AggregateRevsPass (formerly called pass5)
=================

Primarily, this pass gathers CVS revisions into Subversion revisions
(a Subversion revision is comprised of one or more CVS revisions)
before we actually begin committing (where "committing" means either
to a Subversion repository or to a dump file).

This pass does the following:

1. Creates a database file to map Subversion revision numbers to
   SVNCommit instances ('cvs2svn-svn-commits.db').  Creates another
   database file to map CVS Revisions to their Subversion Revision
   numbers ('cvs2svn-cvs-revs-to-svn-revnums.db').

2. When a file is copied to a symbolic name in cvs2svn, there are a
   range of valid Subversion revisions that we can copy the file from.
   The first valid Subversion revision number for a symbolic name is
   called the "Opening", and the first *invalid* Subversion revision
   number encountered after the "Opening" is called the "Closing".  In
   this pass, the SymbolingsLogger class writes out a line (for each
   symbolic name that it opens) to cvs2svn-symbolic-names.txt if it is
   the first possible source revision (the "opening" revision) for a
   copy to create a branch or tag, or if it is the last possible
   revision (the "closing" revision) for a copy to create a branch or
   tag.  Not every opening will have a corresponding closing.

   The format of each line is:

       SYMBOL_ID SVN_REVNUM TYPE BRANCH_ID CVS_FILE_ID

   For example:

       1c 00000234 O * 1a7
       34 00000245 O * 1a9
       18a 00000241 C 34 1a7
       122 00000201 O 7e 1b3

   Here is what the columns mean:

   SYMBOL_ID: The id of the branch or tag that starts or ends in this
              CVS Revision (there can be multiples per CVS rev).

   SVN_REVNUM: The Subversion revision number that is the opening or
               closing for this SYMBOLIC_NAME.  This number is written
               with a fixed number of digits so that the file sorts
               correctly.

   TYPE: "O" for Openings and "C" for Closings.

   BRANCH_ID: The id of the branch where this opening or closing
              happened.  '*' denotes the default branch.

   CVS_FILE_ID: The ID of the CVS file where this opening or closing
                happened, in hexadecimal.

   See SymbolingsLogger for more details.


SortSymbolsPass (formerly called pass6)
===============

This pass merely sorts 'cvs2svn-symbolic-names.txt' into
'cvs2svn-symbolic-names-s.txt'.  This orders the file first by
symbolic name, and second by Subversion revision number, thus grouping
all openings and closings for each symbolic name together.


IndexSymbolsPass (formerly called pass7)
================

This pass iterates through all the lines in
'cvs2svn-symbolic-names-s.txt', writing out a database file
('cvs2svn-symbolic-name-offsets.db') mapping SYMBOL_ID to the file
offset in 'cvs2svn-symbolic-names-s.txt' where SYMBOL_ID is first
encountered.  This will allow us to seek to the various offsets in the
file and sequentially read only the openings and closings that we
need.


OutputPass (formerly called pass8)
==========

This pass has very little "thinking" to do--it basically opens the
svn-nums-to-cvs-revs.db and, starting with Subversion revision 2
(revision 1 creates /trunk, /tags, and /branches), sequentially plays
out all the commits to either a Subversion repository or to a
dumpfile.

In --dump-only mode, the result of this pass is a Subversion
repository dumpfile (suitable for input to 'svnadmin load').  The
dumpfile is the data's last static stage: last chance to check over
the data, run it through svndumpfilter, move the dumpfile to another
machine, etc.

However, when not in --dump-only mode, no full dumpfile is created for
subsequent load into a Subversion repository.  Instead, miniature
dumpfiles representing a single revision are created, loaded into the
repository, and then removed.

In both modes, the dumpfile revisions are created by walking through
'cvs2svn-data.s-revs.txt'.

The databases 'cvs2svn-svn-nodes.db' and 'cvs2svn-svn-revisions.db'
form a skeletal (metadata only, no content) mirror of the repository
structure that cvs2svn is creating.  They provide data about previous
revisions that cvs2svn requires while constructing the dumpstream.


                  ===============================
                      Branches and Tags Plan.
                  ===============================

This pass is also where tag and branch creation is done.  Since
subversion does tags and branches by copying from existing revisions
(then maybe editing the copy, making subcopies underneath, etc), the
big question for cvs2svn is how to achieve the minimum number of
operations per creation.  For example, if it's possible to get the
right tag by just copying revision 53, then it's better to do that
than, say, copying revision 51 and then sub-copying in bits of
revision 52 and 53.

Also, since CVS does not version symbolic names, there is the
secondary question of *when* to create a particular tag or branch.
For example, a tag might have been made at any time after the youngest
commit included in it, or might even have been made piecemeal; and the
same is true for a branch, with the added constraint that for any
particular file, the branch must have been created before the first
commit on the branch.

Answering the second question first: cvs2svn creates tags as soon as
possible and branches as late as possible.

Tags are created as soon as cvs2svn encounters the last CVS Revision
that is a source for that tag.  The whole tag is created in one
Subversion commit.

For branches, this is "just in time" creation -- the moment it sees
the first commit on a branch, it snaps the entire branch into
existence (or as much of it as possible), and then outputs the branch
commit.

The reason we say "as much of it as possible" is that it's possible to
have a branch where some files have branch commits occuring earlier
than the other files even have the source revisions from which the
branch sprouts (this can happen if the branch was created piecemeal,
for example).  In this case, we create as much of the branch as we
can, that is, as much of it as there are source revisions available to
copy, and leave the rest for later.  "Later" might mean just until
other branch commits come in, or else during a cleanup stage that
happens at the end of this pass (about which more later).

How just-in-time branch creation works:

In order to make the "best" set of copies/deletes when creating a
branch, cvs2svn keeps track of two sets of trees while it's making
commits:

   1. A skeleton mirror of the subversion repository, that is, an
      array of revisions, with a tree hanging off each revision.  (The
      "array" is actually implemented as an anydbm database itself,
      mapping string representations of numbers to root keys.)

   2. A tree for each CVS symbolic name, and the svn file/directory
      revisions from which various parts of that tree could be copied.

Both tree sets live in anydbm databases, using the same basic schema:
unique keys map to marshal.dumps() representations of dictionaries,
which in turn map entry names to other unique keys:

   root_key  ==> { entryname1 : entrykey1, entryname2 : entrykey2, ... }
   entrykey1 ==> { entrynameX : entrykeyX, ... }
   entrykey2 ==> { entrynameY : entrykeyY, ... }
   entrykeyX ==> { etc, etc ...}
   entrykeyY ==> { etc, etc ...}

(The leaf nodes -- files -- are also dictionaries, for simplicity.)

The repository mirror allows cvs2svn to remember what paths exist in
what revisions.

For details on how branches and tags are created, please see the
docstring the SymbolingsLogger class (and its methods).

-*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*-
- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -
-*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*-

Some older notes and ideas about cvs2svn.  Not deleted, because they
may contain suggestions for future improvements in design.

-----------------------------------------------------------------------

An email from John Gardiner Myers <jgmyers@speakeasy.net> about some
considerations for the tool.

------
From: John Gardiner Myers <jgmyers@speakeasy.net>
Subject: Thoughts on CVS to SVN conversion
To: gstein@lyra.org
Date: Sun, 15 Apr 2001 17:47:10 -0700

Some things you may want to consider for a CVS to SVN conversion utility:

If converting a CVS repository to SVN takes days, it would be good for
the conversion utility to keep its progress state on disk.  If the
conversion fails halfway through due to a network outage or power
failure, that would allow the conversion to be resumed where it left off
instead of having to start over from an empty SVN repository.

It is a short step from there to allowing periodic updates of a
read-only SVN repository from a read/write CVS repository.  This allows
the more relaxed conversion procedure:

1) Create SVN repository writable only by the conversion tool.
2) Update SVN repository from CVS repository.
3) Announce the time of CVS to SVN cutover.
4) Repeat step (2) as needed.
5) Disable commits to CVS repository, making it read-only.
6) Repeat step (2).
7) Enable commits to SVN repository.
8) Wait for developers to move their workspaces to SVN.
9) Decomission the CVS repository.

You may forward this message or parts of it as you seem fit.
------

-----------------------------------------------------------------------

Further design thoughts from Greg Stein <gstein@lyra.org>

* timestamp the beginning of the process. ignore any commits that
  occur after that timestamp; otherwise, you could miss portions of a
  commit (e.g. scan A; commit occurs to A and B; scan B; create SVN
  revision for items in B; we missed A)

* the above timestamp can also be used for John's "grab any updates
  that were missed in the previous pass."

* for each file processed, watch out for simultaneous commits. this
  may cause a problem during the reading/scanning/parsing of the file,
  or the parse succeeds but the results are garbaged. this could be
  fixed with a CVS lock, but I'd prefer read-only access.

  algorithm: get the mtime before opening the file. if an error occurs
  during reading, and the mtime has changed, then restart the file. if
  the read is successful, but the mtime changed, then restart the
  file.

* use a separate log to track unique branches and non-branched forks
  of revision history (Q: is it possible to create, say, 1.4.1.3
  without a "real" branch?). this log can then be used to create a
  /branches/ directory in the SVN repository.

  Note: we want to determine some way to coalesce branches across
  files. It can't be based on name, though, since the same branch name
  could be used in multiple places, yet they are semantically
  different branches. Given files R, S, and T with branch B, we can
  tie those files' branch B into a "semantic group" whenever we see
  commit groups on a branch touching multiple files. Files that are
  have a (named) branch but no commits on it are simply ignored. For
  each "semantic group" of a branch, we'd create a branch based on
  their common ancestor, then make the changes on the children as
  necessary. For single-file commits to a branch, we could use
  heuristics (pathname analysis) to add these to a group (and log what
  we did), or we could put them in a "reject" kind of file for a human
  to tell us what to do (the human would edit a config file of some
  kind to instruct the converter).

* if we have access to the CVSROOT/history, then we could process tags
  properly. otherwise, we can only use heuristics or configuration
  info to group up tags (branches can use commits; there are no
  commits associated with tags)

* ideally, we store every bit of data from the ,v files to enable a
  complete restoration of the CVS repository. this could be done by
  storing properties with CVS revision numbers and stuff (i.e. all
  metadata not already embodied by SVN would go into properties)

* how do we track the "states"? I presume "dead" is simply deleting
  the entry from SVN. what are the other legal states, and do we need
  to do anything with them?

* where do we put the "description"? how about locks, access list,
  keyword flags, etc.

* note that using something like the SourceForge repository will be an
  ideal test case. people *move* their repositories there, which means
  that all kinds of stuff can be found in those repositories, from
  wherever people used to run them, and under whatever development
  policies may have been used.

  For example: I found one of the projects with a "permissions 644;"
  line in the "gnuplot" repository.  Most RCS releases issue warnings
  about that (although they properly handle/skip the lines), and CVS
  ignores RCS newphrases altogether.

# vim:tw=70