Author's note: This document was written in draft form in 1997, for potential inclusion into my book on the Java Virtual Machine. It didn't make it into the book, and it is now out of date and has certain typos. However, much of the information it contains is still relevant.
In this document we discuss how to write efficient Java code. The discussion starts with a high level introduction. By the end of the document, we'll be looking at performance benchmarks for individual JVM instructions. Our goal is to examine ways to optimize Java code. We will look at some common optimization techniques, and also see speed comparisons of several JVM implementations.
This document is written for programmers with some Java programming experience. It also contains pointers that will be useful for newer programmers.
Speed is important. Writing code that runs quickly isn't just good for marketing brochures and customer satisfaction, it also lets you tackle new kinds of problems that you wouldn't otherwise consider.
In Java, a large number of factors determine how fast a particular application runs, including the virtual machine implementation it is running on, the efficiency of the standard Java classes provided with that machine, and the efficiency of the application's own Java classes. To complicate matters further, if you are running over the internet, you have to take into account the download time for your classes. If you use lots of classes, this can be a significant amount of time (you can use an archive format such as JAR can reduce this overhead).
You can make code go faster by using better compilation and execution technology (e.g. using an optimizing compiler instead of a non-optimizing compiler, or switching to a faster virtual machine or computer). However, in some cases this isn't enough and you need to spend some effort rewriting the code so that it runs faster. We'll see some rewriting techniques later in this document.
|
|
A good general rule with regards to hand optimizing your code is: don't do it unless you have to.
Programmers are sometimes tempted to put in that extra effort to write really optimal code, without considering carefully if the code they are optimizing is actually the cause of a speed bottleneck.
Its easy to fall into the trap of thinking that every line of code must be blazingly fast. But code that is hand-optimized is usually harder to read than code which hasn't been written for speed, because it contains non-obvious reorganizations designed to make it go faster (e.g. unrolled loops, extra variable assignments or missing variable assignments, code repeated several times with minor variations, etc.) Understanding and debugging code that is hand-optimized is therefore hard, compared to understanding code that has been written for legibility instead than speed. This is a problem when you have to come back and maintain or modify an application.
Its better to only concentrate hand-optimization efforts on the parts of the code that are 'speed critical'. Detecting these areas of code is sometimes difficult - profiling tools such as jprof can help, and some of the Java debuggers are beginning to appear with profilers. If you don't have a profiler, look for functions that are used a lot in the code, or loops that have many iterations.
|
|
When deciding what code to optimize, think big. Concentrate on areas in your application that are noticably slow to the user. A small increase in the speed of an inner loop can lead to a much larger increase in the overall speed of the program - look for optimizations that will double the speed of the resulting program, not just increase it by a factor of 10%.
In some case, rather than bashing on the code optimizing it again and again for a small gain in speed, its better to rethink the approach and write the code again from scratch. Using a different algorithm or datastructure may be a better solution than trying to make the existing system just one more notch faster. Rewriting code often leads to faster, cleaner applications that have less bugs and are easier to maintain.
In those cases where you do add optimizations, make sure you add clear comments explaining what is optimized and why. There is nothing worse than trying to understand code you wrote several months back which is full of poorly commented optimization tricks.
In this document, we are going to look at some of the techniques you can use to make your Java code run more quickly. We'll talk about optimizing techniques in four steps: (1) use an optimizing compiler, (2) utilize standard classes effectively, (3) understand the machine, and (4) analyze the bytecode. We won't discuss profiling techniques here.
At the end of the document, we'll look into the future a little, and discuss how to write code which will run well on the next generation JVM engines.
Before reaching for your tuning forks and starting to hand modify your code, its worth seeing how fast you can make your program run just by using an optimizing Java compiler. All the commercial strength Java compilers out there offer some level of optimization support, but its surprising how many programmers forget to use the optimizer.
For Sun's JDK, to turn on the optimizer you simply supply javac with a "-O" flag. For example:
javac -O MyClass.java
This runs the optimizing compiler on the Java class. In some cases, the -O flag can really make a difference to the speed of the resulting class.
|
|
When the optimizer is turned on, the compiler takes longer to compile the file, since it is spending more time analyzing the code looking for optimizations. For this reason, many programmers only use the optimizer on code they are about to ship.
Compiling your code with an optimizer is an important first step. If you are using Sun's JDK, for example, using the -O flag will give you:
inlined private/final methods and fields, (saves the cost of a method dispatch and field lookup)
'peephole' optimizated bytecode.
Peephole optimization looks at a few instructions at a time, replacing
common sequences of operations with faster versions that have the same
results. For example, a peephole optimizer might replace a jump to a
goto instruction with a jump directly to the destination.
saving one instruction. Peephole optimizaters also removes dead
code and unused variables, and perform other small code rewrites.
These techniques do improve resulting code, but they won't necessarily solve your speed problems. The sad reality is that existing Java compilers don't do all that many optimizations. You may be surprised to learn that Sun's compiler (and several other Java compilers I've tried) doesn't perform any of the following standard optimization techniques:
common subexpression elimination
strength reduction
local variable analysis/reuse
hoisting or uplifting
unrolling
[We don't have room here to discuss these optimization techniques in detail. See [dragon book] if you want to learn more about optimization.]
Java is relatively new, and good optimizing compilers have yet to be written. Future releases may well include more of these optimization techniques.
It certainly doesn't help that Java is fairly complex to optimize. Java programs consist of method calls, exception handlers and threads. These can all have subtle side effects which must be analyzed carefully by the optimizer before it can make optimizations. Developing a good optimizing compiler is hard.
JIT (Just in time) code generation technology may be another factor holding up the development of optimizing Java compilers. Just-in-time code generators convert Java class files to native code on the fly, dynamically at runtime, just before it is executed. The JIT implementations themselves include various optimization tricks, which make the static optimizations performed by Java compilers less important. Java compiler writers may be holding off from developing fully optimizing Java compilers until it becomes clearer just how far JIT technology will go; spending time optimizing Java bytecodes at compile time within a Java compiler may be counterproductive if the JIT code generator in the runtime system doesn't produce effective native code from the results!
An optimizing compiler can save you some work, but it isn't an automatic guarantee of speed.
To write fast Java code, its important to become familiar with the standard Java classes shipped with Java. Using these classes wisely will help to make your code run quickly.
For example, consider:
for (int i = 0; i < numItems; i++)
dst[i] = src[i];
which copies numItems from one array to another. Java provides a builtin method which does exactly this, so an experienced Java hacker would replace this with:
System.arraycopy(src, 0, dst, 0, numItems);
Since many classes rely on the System.arraycopy method, Java implementations are likely to ensure that this goes fast - e.g. by using native code. So using System.arraycopy is probably going to lead to faster copying time than manually looping through the array in Java.
The following table lists all the public methods provided in the standard Java 1.1 classes which are marked as native in Sun's JDK. This will give you a good starting point for learning about which methods the Sun developers felt needed native support. In some cases, native methods are used to access facilities not available in the Java language. In other cases, native code is used for its raw speed.
Class Method
---------------------------------------------------------------------
java.io.File boolean isAbsolute()
java.io.FileDescriptor boolean valid()
void sync()
java.io.FileInputStream int read()
long skip(long n)
int available()
void close()
java.io.FileOutputStream void write(int b)
void close()
java.io.RandomAccessFile int read()
void write(int b)
long getFilePointer()
void seek(long pos)
long length()
void close()
java.lang.Class static Class forName(String className)
Object newInstance()
boolean isInstance(Object obj)
boolean isAssignableFrom(Class cls)
boolean isInterface()
boolean isArray()
boolean isPrimitive()
String getName()
ClassLoader getClassLoader()
Class getSuperclass()
Class[] getInterfaces()
Class getComponentType()
int getModifiers()
Object[] getSigners()
java.lang.Compiler static boolean compileClass(Class clazz)
static boolean compileClasses(String string)
static Object command(Object any)
static void enable()
static void disable()
java.lang.Double static long doubleToLongBits(double value)
static double longBitsToDouble(long bits)
java.lang.Float static int floatToIntBits(float value)
static float intBitsToFloat(int bits)
java.lang.Math static double sin(double a)
static double cos(double a)
static double tan(double a)
static double asin(double a)
static double acos(double a)
static double atan(double a)
static double exp(double a)
static double log(double a)
static double sqrt(double a)
static double IEEEremainder(double f1,
double f2)
static double ceil(double a)
static double floor(double a)
static double rint(double a)
static double atan2(double a, double b)
static double pow(double a, double b)
java.lang.Object final Class getClass()
int hashCode()
final void notify()
final void notifyAll()
final void wait(long timeout)
java.lang.Runtime long freeMemory()
long totalMemory()
void gc()
void runFinalization()
void traceInstructions(boolean on)
void traceMethodCalls(boolean on)
java.lang.String String intern()
java.lang.System static long currentTimeMillis()
static void arraycopy(Object src, ...)
static int identityHashCode(Object x)
java.lang.Thread static Thread currentThread()
static void yield()
static void sleep(long millis)
final boolean isAlive()
int countStackFrames()
java.lang.Throwable Throwable fillInStackTrace()
|
|
In general, if there is a builtin class offering a particular feature, its better to use that class than to reinvent your own. The builtin classes are more likely to be tuned properly for the JVM they run on, and will probably be faster than your homegrown alternatives.
This isn't a hard and fast rule, however. For example, consider the java.util.Vector class. Most of the public methods of this class are marked as synchronized. Synchronized methods have much higher overheads than unsynchronized methods (more on this later), so if you need expandable vectors in a time critical portion of your code, its better to write your own vector class or use arrays instead.
You can use javap to find out if a particular method in Java is synchronized.
For example, to find out which methods in java.util.Hashtable are synchronized,
use:
javap java.util.Hashtable
This produces something like the printout below (your version may be more detailed...)
Compiled from Hashtable.java
public class Hashtable extends Dictionary
implements Cloneable, Serializable {
public Hashtable(int,float);
public Hashtable(int);
public Hashtable();
public int size();
public boolean isEmpty();
public synchronized Enumeration keys();
public synchronized Enumeration elements();
public synchronized boolean contains(Object);
public synchronized boolean containsKey(Object);
public synchronized Object get(Object);
protected void rehash();
public synchronized Object put(Object,Object);
public synchronized Object remove(Object);
public synchronized void clear();
public synchronized Object clone();
public synchronized String toString();
}
As you can see, many of the methods in java.util.Hashtable are synchronized. So if you are using Hashtables (or Vectors) in a performance critical part of your code, and don't need thread synchronization, you should consider implementing your own versions of the Hashtable/Vector classes. In my tests, an unsynchronized version of the Hashtable class runs about 50% faster than a synchronized version (don't rely on this figure - other runtime systems will produce different results).
There are several well known slow-performers in Java.
Threads and thread synchronizations have high overheads associated with them, so try to avoid using threads or synchronized methods within the inner loops of your application. (Some people avoid using Threads altogether, to make their applications run more quickly).
Creating objects or arrays is expensive, so you should avoid doing this if you can. Sometimes its better to recycle objects, using techniques such as free lists to keep track of which objects are available for reuse.
Early versions of the AWT are less than efficient, though Java 1.1 is better in this department.
|
|
An important technique is to write small test programs and try things out. If threads are critical to your application, write a small Java application which creates some threads, and do some timings. You may find that the JVM you are using handles your case with performance to spare. We'll show an example of a simple test program later in this document.
|
Borrow shamelessly (but carefully) |
As you become more experienced in Java, you'll learn more about how to use the standard class libraries. One way to do this is to study Java source code that is in the public domain. You will probably learn some things about the Java language that you didn't know. In some cases you can discover tricks that translate into faster ways of doing things in your code. (of course, you need to be sensitive to ownership and copyright issues when doing this).
One useful method you might come across is the String.intern() method. This takes the String object you apply it to and looks it up in an internal hash table. If a String object with the same characters already exists in the table, then the String instance from the table is returned. Otherwise your String is added to the table and returned. When you intern a String, you end up with a reference to a unique String object containing the same characters. Because its unique, you can use "==" to perform fast equality tests on strings, without having to compare the characters one at a time using the equals() method.
Literal strings are 'interned' automatically. This means that:
"abc" == "abc"
evaluates to true - i.e. both "abc" literals refererence the same underlying string object (N.B. a bug in Sun's JDK 1.0.2 prevents this from working properly - try it on your runtime system before relying on this behavior).
Note that:
"abc" == ("a" + "b" + "c")
evaluates to false - the string built by dynamically concatenating the three literals "a", "b" and "c" is not interned automatically. Of course you can intern it explicitly by calling the intern() method, so the expression
"abc" == ("a" + "b" + "c").intern()
evaluates to true.
Using interned Strings can up programs that perform a lot of symbol manipulation, such as compilers or parsers.
Knowing the Java class library well and using a good optimizing compiler will get you 90% of the way.
But what do you do if your code is still too slow? Before resorting to an extreme measure like using native methods, its worth looking at what your code is doing. With some knowledge of how the JVM works, it may be possible to rewrite parts of the code and eliminate the performance problem.
For example, consider the following three loops, where
vector is an instance of java.util.Vector, and
array is an array of Objects:
// Iteration using Enumeration
for (Enumeration e = vector.elements(); e.hasMoreElements();) {
Object x = e.nextElement();
}
// Iteration using elementAt()
for (int j = 0; j < vector.length(); j++) {
Object x = vector.elementAt(j);
}
// Iteration using arrays
for (int j = 0; j < array.length; j++) {
Object x = array[j];
}
Which approach is fastest? The answer is that the last version runs very quickly, the middle version runs at a medium clip, and the first one is the slowest of the three. Using our "write simple tests" rule, lets write a simple class to verify this:
import java.util.Vector;
import java.util.Enumeration;
public class TestIteration {
static Object x;
public static void main(String args[]) {
Enumeration e;
int i;
long startTime;
Vector vector = new Vector();
Object array[] = new Object[100];
// Setup the vector and the array with some dummy objects
for (i = 0; i < 100; i++) {
vector.addElement(new Object());
array[i] = new Object();
}
// Test Enumeration
startTime = System.currentTimeMillis();
for (i = 0; i < 10000; i++) {
for (e = vector.elements(); e.hasMoreElements(); ) {
x = e.nextElement();
}
}
System.out.println(System.currentTimeMillis() - startTime);
// Test Vector.elementAt
startTime = System.currentTimeMillis();
for (i = 0; i < 10000; i++) {
for (int j = 0; j < 100; j++) {
x = vector.elementAt(j);
}
}
System.out.println(System.currentTimeMillis() - startTime);
// Test array access
startTime = System.currentTimeMillis();
for (i = 0; i < 10000; i++) {
for (int j = 0; j < 100; j++) {
x = array[j];
}
}
System.out.println(System.currentTimeMillis() - startTime);
}
}
Running this on JDK 1.0 on my system produces:
% java TestIteration
14085
8513
537
Under JDK 1.1, my system produces:
% java TestIteration
6329
4136
371
As you can see, the array based approach is much faster than the Enumeration based approach. Around 20 times faster!
Why is this? A large part of the answer lies in the cost of method calls. Lets look at this in more detail.
If you are calling a method inside a critical portion of your application, its important to know what sort of overhead is involved with that method call. Overhead refers to the time taken by the system to locate, setup and invoke the new method, and doesn't include the time actually taken to execute the method. For some types of method call, the overhead can be quite large. Overhead varies from one JVM implementation to another.
|
|
As a general rule of thumb, the static, final, private and native method modifiers reduce the overhead of a method call. The synchronized modifier increases the overhead of a method call, sometimes quite substantially. Also, interface method calls are slower than normal method calls. Interface method calls are used whenever you call a method on a variable or field which declared as an interface type. For example, in the code:
Enumeration e = vector.elements();
while (e.hasMoreElements()) ...
The hasMoreElements() method call is an interface method call. Its important to realize that if you have:
class E implements Enumeration {
boolean hasMoreElements() { ... }
...
void foo() {
if (this.hasMoreElements()) {
}
}
}
then the call of hasMoreElements() in foo() is not an interface method call, its a normal method call. In foo(), -this- is of type E (a class type) and not of type Enumeration (an interface type) so the this.hasMoreElements() is treated as a standard method call rather than an interface call.
The exact overhead of a method call depends upon the JVM implementation, and also what combination of method modifiers are used. In some implementations final native methods are faster than non-final native methods, and static native methods are faster still. If you have several modifiers from the left column below, that's generally a good thing:
| Faster | Slower |
|
static final private native |
synchronized (interface methods) |
Sometimes it is possible to replace method calls with code that doesn't involve a method. For example, if x and y are both doubles, consider the two expressions:
y = Math.abs(x);
and
y = (x < 0 ? -x : x);
The first version uses the Math.abs method to obtain an absolute
value for x. The second uses the
In the last section, we looked at the overhead associated with methods. What about the overheads of fields and local variables?
As with methods, some field modifiers have higher overheads than others. Retrieving static (class) fields is faster than retrieving instance fields. Final static fields are faster still. Note that, unlike private methods, private fields are not usually any faster than public ones - the same adressing mechanisms are needed for both.
|
|
In all the implementations I've tried, local variables are faster to access than fields. This means that if you are retrieving a field many times in a loop, then it may be better to cache the value of the field in a local variable, and use the local variable in the loop instead.
This is especially true if you are using an integer counter, so:
class Example {
int x;
void count() {
while (true) {
x++;
...
}
}
}
is not as efficient as:
class Example {
void count() {
int x;
while (true) {
x++;
...
}
}
}
Unfortunately, existing Java compilers don't intelligently cache fields in local variables. If you want to do this kind of reorganization, you have to do it manually.
Here are some other useful things to know about local variables:
|
|
The JVM has specialized instructions for retrieving local variables numbered 0, 1, 2 and 3. Its not mentioned anywhere in the language or compiler spec, but its reasonable to assume that the first few variables and parameters you declare in your method are going to end up with these local variable numbers. If your method only has a total of four local variables+parameters, you are all set. If it has a lot more than this, you might consider declaring important looping variables first in the code. JVM implementations currently retrieve these first few local variables a little faster than the other local variables, so it may make your code go just that little bit faster.
The JVM has a special instruction, iinc, for incrementing and decrementing integer local variables. You can use this fact to your advantage by making sure that you utilize the "++", "--", "+=" and "-=" operators with local variables when possible, as illustrated by the table below.
| Faster | Slower |
| x++ y-- x += 5 y -= 5 |
x = x + 1 y = y - 1 x = x + 5 y = y - 5 |
(of course, a good optimizing compiler would transform the expressions in the right column into those shown in the left column. Unfortunately, the compilers I've tested don't do this).
|
|
If you have a piece of code containing a loop which is performance critical, its a good idea to 'hoist' and 'unroll' the code in the loop yourself, since at least for the moment Java compilers and JIT code generators aren't necessarily going to do this for you.
Hoisting refers to moving code out of inner loops that doesn't need to be in inner loops. For example, in a loop like:
for (y = 0; y < height; y++) {
for (x = 0; x < width; x++) {
array[(y * width) + x] = 1;
}
}
its better to move the (y * width) calculation outside the loop:
for (y = 0; y < height; y++) {
int offset = y * width;
for (x = 0; x < width; x++) {
array[offset + x] = 1;
}
}
You've added an extra line to the program, making it a little longer to read, but it will run faster.
Unrolling is the technique of replacing loops like:
for (i = 0; i < 128; i++) {
doSomething(i);
}
with:
for (i = 0; i < 128; ) {
doSomething(i++);
doSomething(i++);
doSomething(i++);
doSomething(i++);
doSomething(i++);
doSomething(i++);
doSomething(i++);
doSomething(i++);
}
In this revised version of the loop, doSomething is called eight times in each iteration, instead of once each iteration. The overall effect of the loop is the same, but the unrolled version spends less time testing the value of i and branching, so the unrolled version of the loop takes less time to run.
|
|
While the Java language has full support for short, bytes and chars, inside the JVM these types are treated as second class citizens. They are known as "storage types": There are mechanisms for declaring that a field or array stores shorts, bytes or chars, but the arithmetic instructions, stack and local variables work only with ints, floats, longs and doubles. This means that every time you assign a value to a byte/short/char field the system has to truncate the value from an int to a byte/short/char before doing the assignment (and visa versa). This conversion is fast, but it still takes time. This means if you see a loop like:
for (short x = 0; x < 1000; x++) {
...
}
your better off using an int instead.
|
|
The JVM is a word oriented machine, in which the 'logical' word is 32
bits. Since doubles and longs are 64 bits apiece, they take two words
of storage. On most existing implementations I've tried, the
following speed ranking exists:
ints < floats < longs < doubles
Since ints are faster than floats, its a good idea to use as much
integer arithmetic as possible. For example, say you have a
integer between 0 and A and you want to scale it to a value between
0 and B. A very general way to write this is:
(int)((i / (float)A) * B)
which works well for any values of A and B. But in some cases
you can do much better. For example, if A is 256 and B is 1024,
with a little care you can rewrite the above to use integer arithmetic:
(i * 1024) / 256
Notice the multiplication by B (i.e. 1024) comes first.
This is still not fully optimal. Observe that 1024 / 256 is four - so this expression is really just multiplying the value of i by four. Multiplies and divides by powers of 2 can be done using shifting, which is faster. So the fastest way to write the above expression is:
i << 2
I ran some tests of these simple on two systems:
| Expression | SGI Impact | Windows PC |
(int)((i / 256.0f) * 1024) | 177 | 471 |
(i * 1024) / 256 | 246 | 40 |
i << 2 | 41 | 10 |
As you can see, the shifting approach is very fast on both systems. Also notice that, on the Silicon Graphics system, the floating point divison version is slightly faster than the integer division version. SGI systems have fast floating point hardware, so this isn't all that surprising. But it does raise another guideline:
|
|
At the end of the day, your Java program will most likely end up compiled into a number of Java class files. For ultimate speed freaks (and also for programmers with a healthy level of professional curiosity) its worth taking a look at the contents of your class files to see how well your compiler is doing.
For example, looking at the arithmetic expressions we saw in the last section, an experience programmer might expect an optimizing compiler to automatically spot code like:
(i * 1024) / 256
and, using a technique known as strength reduction, replace it with:
i << 2
(In fact, GNU's gcc compiler does this simple form of transformation even when you haven't turned optimization on!)
How can you find out if your Java compiler is doing this transformation? The easiest way is to compile the program and then run it through a disassembler and look at the bytecode.
You can use the javap program to disassemble a Java class
file and print out its bytecode. For example, the command:
javap -c java.lang.String
shows you a text representation of the JVM bytecode that makes up the String class.
You can also use a disassembler like the one provided by Kimera on the web (see http://kimera.cs.washington.edu/disassembler.html). This lets you disassemble a class file into Jasmin syntax. You can then take this class file, edit it, and use Jasmin (see http://www.cybergrain.com/tech/jasmin) to assemble the class back into its binary form - giving you the ultimate control over the contents of your class files.
Going back to our example, lets consider a Java class like:
class Test {
int testExpr(int x) {
return (x * 1024) / 256;
}
}
Compile this with optimization turned on, and then ran javap on the result. You will see something like:
javap -c Test
class Test extends java.lang.Object {
int testExpr(int);
Test();
Method int testExpr(int)
0 iload_1
1 sipush 1024
4 imul
5 sipush 256
8 idiv
9 ireturn
Method Test()
0 aload_0
1 invokespecial #3 ()V>
4 return
}
Even without knowing much about the JVM instruction set, you can see the numbers 1024 and 256 appearing in the code for testExpr() - its very clear that this expression has not been converted into a shift by 2. In this case its up to you to rewrite the divide using a shift.
If you want to learn more about what these JVM instructions are actually doing, you can look at: http://www.cybergrain.com/tech/jvmref .
Looking at the contents of a class file, you will see lots of instructions like:
iload_1 sipush 1024 invokespecial ...
Just how expensive are these operations?
In order to find out, I wrote a simple benchmarking program. The program tests a large percentage of the JVM instruction set. It doesn't test the efficiency of Java classes that are shipped with a runtime system (after all, you can't do much about improving them!), but instead just tests how fast the runtime system is able to execute the bytecode itself.
I've divided the tests into a number of categories:
For each test category, there are a series of tests, as well as a "noop" test that shows a reference value indicating how much overhead the JVM is taking to simply run the test. Each test shows two numbers: the first is a timing for the test in milliseconds. The second is a ranking of the instruction, where the noop gets a ranking of 0, simple instructions get a ranking of 1, and more complex instructions get higher rankings.The ranking gives you an idea of how costly the instruction is.
Looking at these results, you can get a feel for how fast a particular JVM system is (by looking at the absolute timings), and also learn about which operations are typically efficient, and which are more expensive (by looking at the relative instruction ranking).
Note that these tests differ from the results you would see from a benchmark test (such as CaffieneMark). Benchmark tests try to exercise as many different parts of the JVM as possible, to give an accurate overall rating of a runtime system. Our tests, on the other hand, only focus on one aspect of the instruction set at a time. The results are good for understanding how fast a specific instruction is, but are not a good measure of the overall performance of a system.
{{These results need to be presented in a nicer format, e.g. bar charts}}
Local Variables
JDK 1.0.2 on R10K noop 434 0 dup 667 1 aload_0 790 1 aload 10 696 1 aload 260 783 1 lload 10 726 1 JDK 1.0.2 on SPARC 20 noop 541 0 dup 780 1 aload_0 780 1 aload 10 804 1 aload 260 924 1 lload 10 845 1 JDK 1.1 on R10K noop 357 0 dup 550 1 aload_0 545 1 aload 10 563 1 aload 260 659 1 lload 10 546 1 JDK 1.1 on SPARC 20 noop 242 0 dup 330 1 aload_0 331 1 aload 10 346 1 aload 260 490 2 lload 10 354 1 JDK 1.1.2, Windows NT (Gateway 200Mhz Pentium) noop 135 0 dup 170 1 aload_0 185 1 aload 10 185 1 aload 260 305 2 lload 10 190 1 Microsoft SDK-Java 1.5, Windows NT (Gateway 200MHz Pentium) noop 411 0 dup 601 1 aload_0 580 1 aload 10 651 1 aload 260 691 1 lload 10 621 1Comments
Here you can see that local variable 0 is a little faster to access (using aload_0) than local variable 10 (using aload 10). Retrieving local variable 260 is slower still (local variables with indices greater than 256 use a "wide" instruction form which is slower than the instruction form for the first 256 locals).
Field Access
JDK 1.0.2 on R10K
noop 434 0
getfield 901 2
getfield(byte) 968 2
getfield(short) 946 2
getstatic 711 1
getstatic(final) 701 1
JDK 1.0.2 on SPARC 20
noop 541 0
getfield 1073 1
getfield(byte) 1067 1
getfield(short) 1078 1
getstatic 860 1
getstatic(final) 861 1
JDK 1.1 on R10K
noop 357 0
getfield 694 1
getfield(byte) 696 1
getfield(short) 698 1
getstatic 584 1
getstatic(final) 584 1
JDK 1.1 on SPARC 20
noop 242 0
getfield 418 1
getfield(byte) 418 1
getfield(short) 419 1
getstatic 370 1
getstatic(final) 370 1
JDK 1.1.2, Windows NT (Gateway 200Mhz Pentium)
noop 135 0
getfield 225 1
getfield(byte) 220 1
getfield(short) 220 1
getstatic 215 1
getstatic(final) 215 1
Microsoft SDK-Java 1.5, Windows NT (Gateway 200Mhz Pentium)
noop 411 0
getfield 681 1
getfield(byte) 681 1
getfield(short) 681 1
getstatic 821 1
getstatic(final) 832 2
Comments
Notice that these values are consistently higher than the values for retrieving local variables. Static fields are also typically faster than non-static fields. Byte and short fields are not noticably slower to retrieve than int fields (hinting perhaps that these implementations are actually using int-sized members to store byte and short values?).
Array Access
JDK 1.0.2 on R10K noop 434 0 aaload 1017 2 aastore 1374 3 aastore(in interface array) 2038 4 aastore(null) 1169 2 iaload 1026 2 iastore 1022 2 laload 1053 2 lastore 1138 2 baload 1008 2 bastore 1166 2 baload (boolean) 1025 2 bastore (boolean) 1015 2 JDK 1.0.2 on SPARC 20 noop 541 0 aaload 1183 2 aastore 1583 2 aastore(in interface array) 2598 4 aastore(null) 1404 2 iaload 1252 2 iastore 1230 2 laload 1300 2 lastore 1223 2 baload 1207 2 bastore 1350 2 baload (boolean) 1403 2 bastore (boolean) 1200 2 JDK 1.1 on R10K noop 357 0 aaload 789 2 aastore 1149 3 aastore(in interface array) 1236 3 aastore(null) 959 2 iaload 815 2 iastore 799 2 laload 812 2 lastore 874 2 baload 809 2 bastore 896 2 baload (boolean) 810 2 bastore (boolean) 819 2 JDK 1.1 on SPARC 20 noop 242 0 aaload 515 2 aastore 767 3 aastore(in interface array) 847 3 aastore(null) 581 2 iaload 521 2 iastore 499 2 laload 531 2 lastore 524 2 baload 516 2 bastore 571 2 baload (boolean) 516 2 bastore (boolean) 500 2 JDK 1.1.2, Windows NT (Gateway 200Mhz Pentium) noop 135 0 aaload 255 1 aastore 666 4 aastore(in interface array) 591 4 aastore(null) 375 2 iaload 255 1 iastore 270 2 laload 265 1 lastore 345 2 baload 275 2 bastore 315 2 baload (boolean) 270 2 bastore (boolean) 290 2 Microsoft SDK-Java 1.5, Windows NT (Gateway 200Mhz Pentium) noop 411 0 aaload 771 1 aastore 921 2 aastore(in interface array) 961 2 aastore(null) 802 1 iaload 791 1 iastore 751 1 laload 791 1 lastore 881 2 baload 801 1 bastore 832 2 baload (boolean) 801 1 bastore (boolean) 761 1
Comments
Here you can see that arrays are slower to work with than either fields or locals. Also note that storing values in an interface array takes longer - when an object is placed in the an object array the JVM must perform a runtime check to make sure that the object is compatible with the array type. This can take extra time.
Method Invocation
JDK 1.0.2 on R10K noop 434 0 invokevirtual 1742 4 invokevirtual(final) 1736 4 invokevirtual(private,final) 1733 3 invokevirtual(synchronized) 7439 17 invokevirtual(native) 1505 3 invokespecial(private,final) 1613 3 invokestatic 1545 3 invokestatic(final) 1519 3 invokestatic(native,final) 1337 3 invokeinterface 1858 4 JDK 1.0.2 on SPARC 20 noop 541 0 invokevirtual 2085 3 invokevirtual(final) 2081 3 invokevirtual(private,final) 2087 3 invokevirtual(synchronized) 10607 19 invokevirtual(native) 1672 3 invokespecial(private,final) 2320 4 invokestatic 1976 3 invokestatic(final) 1978 3 invokestatic(native,final) 1740 3 invokeinterface 2372 4 JDK 1.1 on R10K noop 357 0 invokevirtual 1407 3 invokevirtual(final) 830 2 invokevirtual(private,final) 1326 3 invokevirtual(synchronized) 3373 9 invokevirtual(native) 1246 3 invokespecial(private,final) 866 2 invokestatic 715 2 invokestatic(final) 727 2 invokestatic(native,final) 1084 3 invokeinterface 1489 4 JDK 1.1 on SPARC 20 noop 242 0 invokevirtual 1052 4 invokevirtual(final) 524 2 invokevirtual(private,final) 1055 4 invokevirtual(synchronized) 2664 10 invokevirtual(native) 1238 5 invokespecial(private,final) 529 2 invokestatic 485 2 invokestatic(final) 505 2 invokestatic(native,final) 1143 4 invokeinterface 1441 5 JDK 1.1.2, Windows NT (Gateway 200Mhz Pentium) noop 135 0 invokevirtual 600 4 invokevirtual(final) 261 1 invokevirtual(private,final) 570 4 invokevirtual(synchronized) 4606 34 invokevirtual(native) 661 4 invokespecial(private,final) 250 1 invokestatic 255 1 invokestatic(final) 255 1 invokestatic(native,final) 631 4 invokeinterface 831 6 Microsoft SDK-Java 1.5, Windows NT (Gateway 200Mhz Pentium) noop 411 0 invokevirtual 1942 4 invokevirtual(final) 1963 4 invokevirtual(private,final) 2023 4 invokevirtual(synchronized) 2814 6 invokevirtual(native) 1532 3 invokespecial(private,final) 2043 4 invokestatic 1753 4 invokestatic(final) 1803 4 invokestatic(native,final) 1432 3 invokeinterface 2153 5Comments
Notice how consistently slow synchronized methods are. Also see how virtual methods are slower than static methods. Method calls are slower than most other instructions on the JVM.
Object Operations
JDK 1.0.2 on R10K noop 434 0 instanceof(class) 1125 2 instanceof(interface) 1868 4 checkcast(class) 1105 2 checkcast(interface) 2009 4 new(Object) 4317 9 new(String) 8699 20 new(Test) 5439 12 newarray 3448 7 anewarray 3656 8 JDK 1.0.2 on SPARC 20 noop 541 0 instanceof(class) 1289 2 instanceof(interface) 2407 4 checkcast(class) 1180 2 checkcast(interface) 2287 4 new(Object) 6250 11 new(String) 13394 24 new(Test) 7816 14 newarray 6088 11 anewarray 6507 12 JDK 1.1 on R10K noop 357 0 instanceof(class) 926 2 instanceof(interface) 1002 2 checkcast(class) 933 2 checkcast(interface) 1009 2 new(Object) 3845 10 new(String) 8731 24 new(Test) 4125 11 newarray 3881 10 anewarray 4024 11 JDK 1.1 on SPARC 20 noop 242 0 instanceof(class) 613 2 instanceof(interface) 795 3 checkcast(class) 575 2 checkcast(interface) 725 2 new(Object) 4135 17 new(String) 9032 37 new(Test) 4467 18 newarray 4624 19 anewarray 4930 20 JDK 1.1.2, Windows NT (Gateway 200Mhz Pentium) noop 135 0 instanceof(class) 406 3 instanceof(interface) 696 5 checkcast(class) 390 2 checkcast(interface) 495 3 new(Object) 4216 31 new(String) 12498 92 new(Test) 5513 40 newarray 4942 36 anewarray 5262 38 Microsoft SDK-Java 1.5 noop 411 0 instanceof(class) 1211 2 instanceof(interface) 1252 3 checkcast(class) 1212 2 checkcast(interface) 1252 3 new(Object) 8482 20 new(String) 18537 45 new(Test) 11096 26 newarray 6209 15 anewarray 6889 16Comments
Its surprising how long it takes to make new objects. This is because the new instruction does quite a lot of work. For more complex classes, new will take even longer. Making arrays is generally faster than making objects, since no object initialization needs to be performed (arrays are initialized to null (for object arrays) or 0 (for numeric arrays)).
Also notice that testing if something belongs to an interface is slower than testing if it belongs to a class.
Arithmetic
JDK 1.0.2 on R10K noop 434 0 barith 1068 2 sarith 1050 2 iarith 946 2 larith 1256 2 farith 926 2 darith 1269 2 JDK 1.0.2 on SPARC 20 noop 541 0 barith 1274 2 sarith 1246 2 iarith 1103 2 larith 1225 2 farith 1125 2 darith 1237 2 JDK 1.1 on R10K noop 357 0 barith 913 2 sarith 878 2 iarith 768 2 larith 925 2 farith 757 2 darith 909 2 JDK 1.1 on SPARC 20 noop 242 0 barith 621 2 sarith 585 2 iarith 483 1 larith 547 2 farith 483 1 darith 555 2 JDK 1.1.2, Windows NT (Gateway 200Mhz Pentium) noop 135 0 barith 260 1 sarith 270 1 iarith 245 1 larith 280 2 farith 300 2 darith 390 2 Microsoft SDK-Java 1.5, Windows NT (Gateway 200Mhz Pentium) noop 411 0 barith 882 2 sarith 881 2 iarith 781 1 larith 821 1 farith 791 1 darith 822 1Comments
On SGI systems, floating point arithment is actually a little faster than integer arithmetic, which is a surprising result. In general, floats and ints are faster than longs and doubles.
As things stand now, Java is not as fast as C for image processing, graphics, or other computationally intensive tasks. But what about in the future?
Expect to see hardware support for Java, better JIT code generators, and even fully-compiling Java compilers (compilers which go directly from Java source code to native code without the using Java class files as an intermediate representation). These technologies will all help to make Java code run significantly faster - as fast as C++ code in some cases. (though for certain kinds of operation C++ will always be faster than Java).
I've heard from an internal source at Sun that they are betting on JIT to deliver faster Java performance in the future, rather than on more optimizations in the static compiler. Just how far JIT technology can go is still an open question, but you can bet that this will be a highly competitive arena.
In the meantime, the techniques described in this article should help you to write Java code which runs quickly on todays Java runtime systems.
Copyright © Jon Meyer, June 1997