C Token Comparator Version 1.3 (c) 2003, 2004, Warren Toomey wkt@tuhs.org under the GPL license INTRODUCTION ------------ The purpose of this set of programs is to allow you to compare two sets of C code trees on a token basis, rather than on a line by line basis. The programs help to identify similarities between snippets of code in both trees, even if the actual lines are dissimilar. CHANGES ------- 1.3: Changed storage of identifiers to hashed values, so that you can check if the same variable names were used in the 2 code trees. 1.2: Further performance increases with the new Rabin-Karp string comparison algorithm; maybe it's better than O(n2) now but I'm not sure. No other changes or bug fixes. 1.1: The program's performance since 1.0 has been improved significantly but it is still O(n2) where n is the number of tokens in both input files. More importantly, several bugs were fixed that prevented all positive matches from bring reported. The format of the ctf file was also changed slightly. BUILDING THE TOOLS ------------------ You will need: make, lex, a C Compiler, the fts(3) function and a Unix or Linux box. Type 'make' to make the tools. If the make fails, you're on your own! The programs are small enough to make the chore of fixing them not too difficult. The code compiles and runs on FreeBSD 4.x and FreeBSD 5.x, and Debian Linux with a 2.4.20 kernel. TOKENISING A CODE TREE ---------------------- If you have a C code tree in /some/where/code and you want to tokenise the files in the tree and store the result in a "C Token File" called mytree.ctf, you would do: $ ./buildctf /some/where/code mytree.ctf I recommend that you only do .c files for now, and remove the .h files. The directory name can be relative or absolute (i.e it doesn't have to start with a /). VIEWING THE TOKENISED RESULTS FILE ---------------------------------- To view the resulting C token file, use the detok command: $ ./detok mytree.ctf | less This shows you the basic C tokens for each file in your source tree. COMPARING TWO SOURCE TREES -------------------------- First, tokenise the two code trees: $ ./buildctf /this/is/the/first/codebase tree1.ctf $ ./buildctf /and/heres/the/second/codebase tree2.ctf Then do the cross-compare and save the result $ ./ctcompare tree1.ctf tree2.ctf > results This can be slow, as the algorithm is O(n2). The results have 3 columns: #tokens_in_common filename:line_number filename:line_number 1st column is the number of C tokens found to be consecutive. 2nd column names a file and line number from tree1 where the run starts. 3rd column names a file and line number from tree2 where the run starts. The program is able to detect runs of C code that are algorithmically isomorphic, even if the variable names have changed, or declared in different orders. See the source code of ctcompare.c for more details. OPTIONS TO CTCOMPARE -------------------- Usage: ctcompare [-p] [-t] [-s] [-x] [-i] [-I] [-n nnn] ctf_file ctf_file -p: print progress reports to stderr -t: show matching tokens when match is found -s: show matching source lines when match is found -x: set -s, print results side by side on same line -i: also match C preprocessor tokens (#include etc.) -I: turn off isomorphic code comparison -n nnn: Set the minimum matching run length to nnn, default 20 or ctcompare [-s] [-x] results_file where results_file is output of ctcompare ctf1 ctf2 -n nnn Set the minimum number of tokens in the run to nnn. Only runs of length nnn or greater will be reported. The default threshold value is 20. -i Include C preprocessor tokens (#include etc.). This is normally disabled to stop long runs of #include being reported. -p Print a progress report to stderr for each file we are working on in the first ctf file. -t When a token run is found, print out the list of tokens in the token run to stdout. -s When a token run is found, print out the lines from the file in the first code tree, followed by the lines in the second code tree. -x When a token run is found, print out the lines from both files side-by-side on stdout. I recommend you have a 160 column wide terminal window with this option. If ctcompare can't open the source files, no output is produced; thus, -x is safe even when you don't have access to the original source files. Note: -t and -x are mutually exclusive. CTF FILE FORMAT --------------- The ctf file format was designed to be extensible, even though it is currently very simplistic. Each ctf file begins with a five octet header. The header starts with the three octets 0x63 0x74 0x66, i.e "ctf" in ASCII. This is followed by two octets. The first octet indicates the version number of the ctf file. The second octet indicates the sub-version. At present, we use version 1 and sub-version 2, so these octets will be 0x01 and 0x02, respectively. After the ctf header, the ctf file is composed of a set of variable-length records, one for each file which has been tokenised. Each record starts with the octet 0x09, followed by a filename in ASCII which is terminated by a 0x00 octet. This is then followed by a set of tokens for the file. Most tokens are 1 octet long, and are listed in tokens.h. However, the following two token values are followed by two further octets: INTVAL: 0x39 IDENTIFIER: 0x5f The two octets after one of these token represent, in big-endian format, the unique value of that specific token in the file. For example, take the short C code: void print_word(char *str) { char *c; c=str; while ((c!='\0') && (c!=' ')) putchar(c++); putchar('\n'); } There are several identifiers and character constants in the code. As buildctf parses the file, it might build up this table: Identifier ------------- 1= print_word 2= str 3= c 4= putchar Thus, the code would be represented in the ctf file as follows: VOID IDENTIFIER 1 OPENPAREN CHAR MULT IDENTIFIER 2 CLOSEPAREN OPENCURLY LINE CHAR MULT IDENTIFIER 3 SEMICOLON LINE LINE IDENTIFIER 3 EQUALS IDENTIFIER 2 SEMICOLON LINE WHILE OPENPAREN OPENPAREN IDENTIFIER 3 NOT EQUALS CHARCONST etc etc. The value 0 must not be used for identifier tokens in the ctf file, although it can be used for numeric constant tokens. The set of tokens representing a single file in the ctf file is terminated either by the beginning 0x09 of a new file, or the end of the ctf file itself. It goes without saying that the value 0x09 does not represent a token.