Recently, I've been playing with various computer vision algorithms. The implementation of such algorithms is often very sensitive to optimization. One situation that arises quite often is that of returning a coordinate pair from a method that may be called many times (often several million times) in just one pass of the algorithm. Since Java lacks the ability to return tuples from methods, this requires an object to be created on every return that packages up the two coordinates.
As far as I understand, Java 6 (the latest release at the time of writing) performs escape analysis but performs no optimizations based on it so although significant work has been undertaken in each of Sun's Java releases to speed-up object creation, all of those return objects are still going to need heap allocation. It is natural therefore to look for a technique that avoids heap allocation and that might be faster.
The obvious technique, when we are dealing with a pair of integer components, is to 'bit stuff' them into a long (which can be stack allocated if necessary) and then 'unstuff' them. So the question is which is faster: creating an object and accessing its fields, or calling methods to bit-manipulate a long as two ints?
In particular we want to compare the speed of creating one-off instances of this class (or something equivalent to it) and accessing its fields, with calls to the static methods that follow it.
/** A simple class that records a pair of integer coordinates. */
private static class Point {
final int x;
final int y;
Point(int x, int y) {
this.x = x;
this.y = y;
}
}
/** Combines two integer coordinates into a single long./*
private static long point(int x, int y) {
return (((long)x) << 32) | y;
}
/** Extracts the x coordinate from a point long./*
private static int x(long point) {
return (int)(point >> 32);
}
/** Extracts the y coordinate from a point long./*
private static int y(long point) {
return (int) point;
}
This is all extremely simple, the tricky part is knowing which is faster, actually its impossible. All we can do is measure the performance in particular JVMs for specific usage scenarios and see which one is best for the task at hand.
I generalized a simple benchmark that compares the performance of these two methods:
private static void testObject(int max) {
for (int y = 0; y < max; y++) {
for (int x = 0; x < max; x++) {
Point point = new Point(x, y);
x = point.x;
y = point.y;
}
}
}
private static void testLong(int max) {
for (int y = 0; y < max; y++) {
for (int x = 0; x < max; x++) {
long point = point(x,y);
x = x(point);
y = y(point);
}
}
}
Observe that in these tests I am not taking into account the invocation of a separate function that returns the long/Object, this is a weakness in my approach, but it probably doesn't invalidate the results.
When run under Java 1.6.0 with no command line switches, on a 3.4Ghz Pentium 4 1GB PC running Debian Linux, the full benchmarking code outputs the following results:
Time for 100000000 noops: 299ms (299) Time for 100000000 objects: 1s 578ms (1578) Time for 100000000 longs: 828ms (828) Time for 100000000 noops: 79ms (79) Time for 100000000 objects: 1s 571ms (1571) Time for 100000000 longs: 825ms (825) Time for 100000000 noops: 87ms (87) Time for 100000000 objects: 1s 538ms (1538) Time for 100000000 longs: 857ms (857) Time for 100000000 noops: 88ms (88) Time for 100000000 objects: 1s 545ms (1545) Time for 100000000 longs: 852ms (852) Time for 100000000 noops: 78ms (78) Time for 100000000 objects: 1s 543ms (1543) Time for 100000000 longs: 845ms (845) Time for 100000000 noops: 81ms (81) Time for 100000000 objects: 1s 599ms (1599) Time for 100000000 longs: 856ms (856) Time for 100000000 noops: 92ms (92) Time for 100000000 objects: 1s 539ms (1539) Time for 100000000 longs: 850ms (850) Time for 100000000 noops: 73ms (73) Time for 100000000 objects: 1s 547ms (1547) Time for 100000000 longs: 845ms (845) Time for 100000000 noops: 89ms (89) Time for 100000000 objects: 1s 539ms (1539) Time for 100000000 longs: 845ms (845) Time for 100000000 noops: 87ms (87) Time for 100000000 objects: 1s 539ms (1539) Time for 100000000 longs: 851ms (851)
In this output, the lines marked as "noops" provide a baseline in which only the looping was executed. With the usual caveats that, under a JVM that performs optimized pre-compilation, micro benchmarks can become very badly skewed (as a case in point, note the initial noop time compared to subsequent times), the use of longs does seem significantly faster. Here's the obligatory chart for people who don't enjoy reading.
I think this approach (which may provide benefits for float pairs too) is worth considering for algorithms that depend a great deal on 2D integer coordinates though it should certainly be confined to the internals of algorithms and not exposed in an api.