GVM: Fast Spatial Clustering

As a result of investigations into various different fields of data processing, I have developed need of a spatial clustering algorithm that can cope with extremely large data sets. I wasn't able to find an existing algorithm that suited my needs so I developed a new one. It is called gvm for Greedy Variance Minimization.

This algorithm is new to me; if I've rediscovered it please let me know, email me at tomgibara at hotmail dot com.

I am still developing the algorithm in an attempt to make it more efficient. I don't know how its performance compares (in terms of speed, quality or memory) to established algorithms, but in my case the characteristics I list below were more important than raw speed.

When I have finalized the algorithm, I will publish a complete description of it here.

Characteristics

The gvm algorithm has the following characteristics:

Online Demonstration

My thanks go to the GeoNames project which freely provides an accessible source of valuable geographic data.

The applet below doubles as both a demonstration and first implementation of the gvm algorithm. It requires a browser with support for Java 1.5 (or above) to run.

Use the slider to change the number of clusters. The number of cities being clustered is necessarily constrained by a desire to keep the applet download small. The largest city in each of the largest 20 clusters is indicated with a separate marker.

This demo requires Java 1.5

If you can't view the applet for any reason there are always the screenshots, or you can download the demonstration as a Java application.

Screenshots

All the cities put into one cluster. Shanghai is the largest city and so is choosen in the demonstration as the label for the cluster.
Two clusters results in the Asian cities being essentially being treated as the densest set of outliers, split away from a larger central cluster of cities.
With three clusters, the cities become broadly clustered by longitude.
Four clusters results in the American cluster being split across the two land masses.
Five clusters, and the Euro-African cluster fragments, with South Africa's populous cities being split away.
With 25 clusters, all of the continents become represented. The lower population of Saharan Africa starts to become observable.
Increasing the number of clusters demonstrates that with this data set, the algorithm exhibits linear scaling (78ms for 100 clusters, 156ms for 200 clusters) with the number of clusters – not the n log n worst case.

Downloads

At this time, the only download available for this algorithm is the demonstration applet in the form of a Java application you can run locally:

gvm-city.jar

If you already have Java 1.5 (or above) there is nothing to install. Simply download the above jar file and then double click on it, or execute the following command line:

java -jar gvm-city.jar