GVM Java Library

The GVM algorithm has been initially developed as a Java library.

Download

The GVM library jar can be downloaded from the Google Code project for this site.

Building from Source

To build the current development snapshot, you need svn, Java 1.6 (or above) and Maven 2.2 (or above).

svn checkout http://tomgibara.googlecode.com/svn/trunk/cluster
cd cluster/cluster-mojo/
mvn install
cd ..
mvn install

You can also browse the source code online.

Using the Library

Clustering using the GVM Java library is very simple.

First you need to choose the subpackage that matches your coordinate type. The GVM library uses source code generation to provide efficient implementations for each of the numeric Java primitives; the generic root package should be avoided.. The packages are:

Next you create a Clusters object, specifying the dimension and maximum number of clusters. For example, if up-to 10 clusters of 3 dimensional double coordinates were being sought:

DblClusters<Key> clusters = new DblClusters<Key>(3, 10);

Here Key is a type of object that your application associates with each point/cluster. There is one key for each cluster (eqv. point) and how keys are assigned to new clusters, or combined when clusters are merged is controlled by the Keyer. By default, the clusters will use a Keyer that picks the key from the largest cluster/point, but other implementations are possible, and the implementation used can be set like so:

clusters.setKeyer(myKeyer);

Once the Clusters object has been created, it's simply a matter of adding points to it. Each point has a mass (which can be used as a weighting by the application, or set as 1.0 for every point), a coordinate vector (in the form of an array of coordinates), and (optionally) a key. The are added using the add method:

clusters.add(mass, pt, key);

At any point during the clustering process (but usually after the last point has been added, and before the results have been returned). The number of clusters can be reduced, by calling the reduce method; the first argument constrains the total variance (negative if there's no constraint), the second number constrains the number of clusters (zero if there's no constraint).

clusters.reduce(100.0, 2);

Finally, the computed clustering can be obtained (usually after all the points have been added, though it can be called at any time). by calling the results method. This method performs almost no computation to return a list of Results objects that each contain information about an identified cluster. eg.

List<DblResult<Key>> results = clusters.results()

Documentation

Javadoc library documentation is available from the project's Maven site. The essential package documentation is that of the com.tomgibara.cluster.gvm subpackages.

Sample Code

ClusterPoints.java
This is a class used to produce the k-means comparison clustering. It reads pairs coordinates from each line of a number of text files, clusters them, and writes a new file recording the cluster for each point.
CityDemo.java
This the the class that clusters the Cities in the demonstration applet: