C Token Comparator Version 1.3

	  	 (c) 2003, 2004, Warren Toomey wkt@tuhs.org
			   under the GPL license

INTRODUCTION
------------

The purpose of this set of programs is to allow you to compare
two sets of C code trees on a token basis, rather than on a
line by line basis. The programs help to identify similarities
between snippets of code in both trees, even if the actual lines
are dissimilar.

CHANGES
-------

1.3: Changed storage of identifiers to hashed values, so that
     you can check if the same variable names were used in
     the 2 code trees.

1.2: Further performance increases with the new Rabin-Karp
     string comparison algorithm; maybe it's better than O(n2) now
     but I'm not sure. No other changes or bug fixes.

1.1: The program's performance since 1.0 has been improved
     significantly but it is still O(n2) where n is the number of
     tokens in both input files. More importantly, several bugs were
     fixed that prevented all positive matches from bring reported.
     The format of the ctf file was also changed slightly.

BUILDING THE TOOLS
------------------

You will need: make, lex, a C Compiler, the fts(3) function
and a Unix or Linux box. Type 'make' to make the tools. If 
the make fails, you're on your own! The programs are small
enough to make the chore of fixing them not too difficult.
The code compiles and runs on FreeBSD 4.x and FreeBSD 5.x,
and Debian Linux with a 2.4.20 kernel.

TOKENISING A CODE TREE
----------------------

If you have a C code tree in /some/where/code and you want to
tokenise the files in the tree and store the result in a
"C Token File" called mytree.ctf, you would do:

  $ ./buildctf /some/where/code mytree.ctf

I recommend that you only do .c files for now, and remove the .h files.
The directory name can be relative or absolute (i.e it doesn't have to
start with a /).

VIEWING THE TOKENISED RESULTS FILE
----------------------------------

To view the resulting C token file, use the detok command:

  $ ./detok mytree.ctf | less

This shows you the basic C tokens for each file in your source tree.


COMPARING TWO SOURCE TREES
--------------------------

First, tokenise the two code trees:

  $ ./buildctf /this/is/the/first/codebase      tree1.ctf
  $ ./buildctf /and/heres/the/second/codebase   tree2.ctf

Then do the cross-compare and save the result

  $ ./ctcompare tree1.ctf tree2.ctf > results

This can be slow, as the algorithm is O(n2). The results have 3 columns:

#tokens_in_common	filename:line_number	filename:line_number

1st column is the number of C tokens found to be consecutive.
2nd column names a file and line number from tree1 where the run starts.
3rd column names a file and line number from tree2 where the run starts.

The program is able to detect runs of C code that are algorithmically
isomorphic, even if the variable names have changed, or declared in
different orders. See the source code of ctcompare.c for more details.

OPTIONS TO CTCOMPARE
--------------------

Usage: ctcompare [-p] [-t] [-s] [-x] [-i] [-I] [-n nnn] ctf_file ctf_file
        -p: print progress reports to stderr
        -t: show matching tokens when match is found
        -s: show matching source lines when match is found
        -x: set -s, print results side by side on same line
        -i: also match C preprocessor tokens (#include etc.)
        -I: turn off isomorphic code comparison
        -n nnn: Set the minimum matching run length to nnn, default 20

or ctcompare [-s] [-x] results_file
         where results_file is output of ctcompare ctf1 ctf2

   -n nnn	Set the minimum number of tokens in the run to nnn.
		Only runs of length nnn or greater will be reported.
		The default threshold value is 20.
   -i		Include C preprocessor tokens (#include etc.). This
		is normally disabled to stop long runs of #include
		being reported.
   -p		Print a progress report to stderr for each file
		we are working on in the first ctf file.
   -t		When a token run is found, print out the list of
		tokens in the token run to stdout.
   -s		When a token run is found, print out the lines from
		the file in the first code tree, followed by the lines
		in the second code tree.
   -x		When a token run is found, print out the lines from
	  	both files side-by-side on stdout. I recommend you
		have a 160 column wide terminal window with this
		option. If ctcompare can't open the source files,
		no output is produced; thus, -x is safe even when
		you don't have access to the original source files.
		Note: -t and -x are mutually exclusive.
			

CTF FILE FORMAT
---------------

The ctf file format was designed to be extensible, even though it
is currently very simplistic.

Each ctf file begins with a five octet header. The header starts
with the three octets 0x63 0x74 0x66, i.e "ctf" in ASCII. This is
followed by two octets. The first octet indicates the version number
of the ctf file. The second octet indicates the sub-version. At present,
we use version 1 and sub-version 2, so these octets will be 0x01 and 0x02,
respectively.

After the ctf header, the ctf file is composed of a set of
variable-length records, one for each file which has been tokenised.
Each record starts with the octet 0x09, followed by a filename in
ASCII which is terminated by a 0x00 octet. This is then followed
by a set of tokens for the file.

Most tokens are 1 octet long, and are listed in tokens.h. However, the
following two token values are followed by two further octets:

INTVAL:     0x39
IDENTIFIER: 0x5f

The two octets after one of these token represent, in big-endian format,
the unique value of that specific token in the file. For example,
take the short C code:

	void print_word(char *str) {
	  char *c;
	  
	  c=str;
	  while ((c!='\0') && (c!=' ')) putchar(c++);
	  putchar('\n');
 	}

There are several identifiers and character constants in the code.
As buildctf parses the file, it might build up this table:

	Identifier
	-------------
	1= print_word
	2= str
	3= c
	4= putchar

Thus, the code would be represented in the ctf file as follows:

VOID IDENTIFIER 1 OPENPAREN CHAR MULT IDENTIFIER 2 CLOSEPAREN OPENCURLY LINE
CHAR MULT IDENTIFIER 3 SEMICOLON LINE
LINE
IDENTIFIER 3 EQUALS IDENTIFIER 2 SEMICOLON LINE
WHILE OPENPAREN OPENPAREN IDENTIFIER 3 NOT EQUALS CHARCONST    etc etc.

The value 0 must not be used for identifier tokens in the ctf file,
although it can be used for numeric constant tokens.

The set of tokens representing a single file in the ctf file is
terminated either by the beginning 0x09 of a new file, or the end
of the ctf file itself. It goes without saying that the value 0x09
does not represent a token.