I have written a couple of simple programs to check for indications of plagiarism between undergraduate students, by cross-correlating syntactic and semantic information within the set of files submitted by the students.
The programs are written in Perl, and use simple heuristics to obtain the syntactic and semantic content in the files. They could be easily changed to glean information from files in other formats, e.g Java programs, C++ programs etc.
You can download the Fingerprint shar file , which is a text file and contains the source to the programs. The user manual is below.
The two programs find_newest and fingerprint, can be used to find pairs or groups of Ada programs or documentation files which have similarities. Note that what you end up with is a list of similarity `fingerprints', and you have to eyeball the files yourself to determine exactly how similar two or more files are.
Find_newest find the most recent submission of each student. You need a list of student userids in a file called, say, studentlist to run this program. An example list of userids is:
alldc96 andpd96 andtn96 arcbd96 arcjem96 arcpj96 atkjt96 baalm96 baimc96 bakkj96 bansp96 ...
You run the script as follows:
$ find_newest studentlist
The Bourne shell script finds the most recent .a, .d and .t files submitted. These are Ada source files, documentation files and program output files, respectively. If your students are submitting files with different suffixes, you will want to alter this script. It's also very slooow, so any speed improvements are welcome! The script generates the files afiles, dfiles and tfiles, which lists the most recent .a, .d and .t files.
This program is designed to be used on a set of files sent in by students using the ADFA submit program. In other words, there is a directory for submissions. Inside this directory there are subdirectories for each lab period where the submissions are kept. Filenames have the format:
userid_procid_realfilename
where userid is the student's username, procid is a unique number and realfilename is the name of the file as submitted by the student. The find_newest script should be run from the directory for submissions, and not from any of the subdirectories.
Fingerprint is a large Perl script that does a number of things. It can find duplicate words, lines, Ada comments, Ada procedure or function names, Ada variables or strings enclosed in "" in the files named as command-line arguments. It prints out the fingerprints found, preceded by the number of times each was found.
Usage is:
fingerprint [-A|-F|-c|-w|-l|-s|-v|-p] file_of_filenames ... -A Only print out the adjacency list, don't show fingerprints -F List filenames where each fingerprint is found -c Look for Ada comments (i.e -- text .... ) -w Look for words (i.e non-whitespace character runs) -l Look for lines, ignoring leading whitespace -s Look for characters enclosed in "" characters -v Look for Ada variables. -p Look for Ada subprogram declarations.
The Ada variable name and Ada subprogram declaration stuff is pretty crude, but it gives reasonable results.
Fingerprint is not dependent on the directory structure. All it needs is an input file which lists the names of the text files to examine. An example input file might be:
fri89/alldc96_13457_mis.a thu89/andpd96_25068_i_mis.a fri67/andtn96_29019_mis.a thu67/arcjem96_1090_mis.a wed67/arcpj96_14957_mis.a wed45/baimc96_1199_i_mis.a wed23/bakkj96_28586_i_mis.a wed45/bennm96_908_mis.a wed67/blim96_12372_i_mis.a fri67/boujt96_26078_mis.a wed45/carmr96_3201_mis.a fri89/chegl96_22896_i_mis.a tue67/clirl95_20252_i_mis.a wed23/cooma96_27378_mis.a fri89/coopa96_1355_i_mis.a mon78/dalsr96_18975_mis.a thu89/deaak96_8885_i_mis.a thu67/denn96_29996_mis.a . . .
Therefore, you can use fingerprint on any set of input files, and not just the specific directory structure used by submit. However, let's assume that we have three input files: afiles, dfiles, and tfiles. These hold the names of the Ada source files, the documentation files and the program output files, respectively.
You can now use the fingerprint script as follows:
$ fingerprint -c -l -s -v -p afiles > a_output $ fingerprint -l dfiles > d_output $ fingerprint -l tfiles > t_output
The first command would show the fingerprints for comments, lines, words, variables & procedures in the Ada files. Adding -F to the flags would also list the files where each fingerprint occurred. The script takes a while to run, but if you look in the file a_output you should see output like:
96 line: end loop; 95 line: end record; 84 line: new_line; 83 line: end if; 75 var: profit 64 cmt: 63 proc: mis 61 line: -- 57 line: use text_io; 56 line: with text_io; 56 line: procedure mis is 52 var: day 50 line: end mis; 46 var: sale_price
The number shows how many times the fingerprint occurred in the files. If you use less options on the command-line, less fingerprints are found and the script goes faster!
After the list of fingerprints, you get an adjacency list showing the number of fingerprints shared between pairs of files:
Adjacency list -------------- 64 fri89/rigsp96_1418_i_mis.a tue89/fogce96_29016_i_mis.a 46 mon78/leeycg96_10811_i_mis.a tue89/sutmj96_426_i_mis.a 39 fri89/hicjl96_23046_mis.a thu67/yould96_1198_mis.a 35 fri67/goute96_13900_i_mis.a thu89/warja96_29296_i_mis.a 35 thu89/deaak96_8885_i_mis.a thu89/warja96_29296_i_mis.a 34 thu67/arcjem96_1090_mis.a thu89/deaak96_8885_i_mis.a 33 thu67/stomj96_1489_mis.a wed45/tost96_1247_mis.a 33 fri89/smism96_11146_mis.a mon78/dalsr96_18975_mis.a 33 fri67/rolns96_12643_mis.a fri89/meaaj96_24511_mis.a 32 fri89/meaaj96_24511_mis.a fri89/smism96_11146_mis.a
Obviously, the more shared fingerprints, the greater the likelihood of plagiarism. The -A option only shows this adjacency list. The students on the first five or six did cheat, after eyeballing the files. You might want to cut & paste from the lines to see the files in question, e.g
$ less fri89/rigsp96_1418_i_mis.a tue89/fogce96_29016_i_mis.a
where the filenames were cut & pasted from the output displayed above. The adjacency lists from the d_output and t_output files should also show who plagiarised documentation and program output.
Note that the -w option for finding word matches is really only useful for finding spelling mistakes which might indicate plagiarism. I wouldn't rely too heavily on the adjacency list when using this option.