Copied and pasted code is usually bad

But it can be hard to find, especially in a large project. So we wrote a utility - CPD - to find it for us. First we wrote it using a variant of Michael Wise's Greedy String Tiling algorithm (our variant is described here ). Then it was completely rewritten by Brian Ewins using the Burrows-Wheeler transform - or, at least, the first part of it.

Here's a screenshot of CPD after running on the JDK java.lang package.

Note that CPD works with Java, C, C++, and PHP code.

If you have Java Web Start , you can run CPD by clicking here .

Here are the duplicates CPD found in the JDK 1.4 source code.

Here are the duplicates CPD found in the APACHE_2_0_BRANCH branch of Apache (just the httpd-2.0/server/ directory).

There's also a JavaSpaces version available for splitting the CPD effort across a farm of machines. I usually post news on that here and the releases are here

Andy Glover wrote an Ant task for CPD; here's how to use it:


<target name="cpd">
    <taskdef name="cpd" classname="net.sourceforge.pmd.cpd.CPDTask" />
    <cpd minimumTokenCount="100" outputFile="/home/tom/cpd.txt" verbose="true">
        <fileset dir="/home/tom/tmp/ant">
            <include name="**/*.java"/>
        </fileset>
    </cpd>
</target>

       

Suggestions? Comments? Post them here . Thanks!