Writing Efficient Java

Jon Meyer, June 1997

Author's note: This document was written in draft form in 1997, for potential inclusion into my book on the Java Virtual Machine. It didn't make it into the book, and it is now out of date and has certain typos. However, much of the information it contains is still relevant.

In this document we discuss how to write efficient Java code. The discussion starts with a high level introduction. By the end of the document, we'll be looking at performance benchmarks for individual JVM instructions. Our goal is to examine ways to optimize Java code. We will look at some common optimization techniques, and also see speed comparisons of several JVM implementations.

This document is written for programmers with some Java programming experience. It also contains pointers that will be useful for newer programmers.

Writing Optimized Code

Speed is important. Writing code that runs quickly isn't just good for marketing brochures and customer satisfaction, it also lets you tackle new kinds of problems that you wouldn't otherwise consider.

In Java, a large number of factors determine how fast a particular application runs, including the virtual machine implementation it is running on, the efficiency of the standard Java classes provided with that machine, and the efficiency of the application's own Java classes. To complicate matters further, if you are running over the internet, you have to take into account the download time for your classes. If you use lots of classes, this can be a significant amount of time (you can use an archive format such as JAR can reduce this overhead).

You can make code go faster by using better compilation and execution technology (e.g. using an optimizing compiler instead of a non-optimizing compiler, or switching to a faster virtual machine or computer). However, in some cases this isn't enough and you need to spend some effort rewriting the code so that it runs faster. We'll see some rewriting techniques later in this document.

What Should You Optimize

  • Don't hand-optimize code unless you have to.
  • A good general rule with regards to hand optimizing your code is: don't do it unless you have to.

    Programmers are sometimes tempted to put in that extra effort to write really optimal code, without considering carefully if the code they are optimizing is actually the cause of a speed bottleneck.

    Its easy to fall into the trap of thinking that every line of code must be blazingly fast. But code that is hand-optimized is usually harder to read than code which hasn't been written for speed, because it contains non-obvious reorganizations designed to make it go faster (e.g. unrolled loops, extra variable assignments or missing variable assignments, code repeated several times with minor variations, etc.) Understanding and debugging code that is hand-optimized is therefore hard, compared to understanding code that has been written for legibility instead than speed. This is a problem when you have to come back and maintain or modify an application.

    Its better to only concentrate hand-optimization efforts on the parts of the code that are 'speed critical'. Detecting these areas of code is sometimes difficult - profiling tools such as jprof can help, and some of the Java debuggers are beginning to appear with profilers. If you don't have a profiler, look for functions that are used a lot in the code, or loops that have many iterations.

  • Go for order of magnitude improvements, ignore minor ones.
  • When deciding what code to optimize, think big. Concentrate on areas in your application that are noticably slow to the user. A small increase in the speed of an inner loop can lead to a much larger increase in the overall speed of the program - look for optimizations that will double the speed of the resulting program, not just increase it by a factor of 10%.

    In some case, rather than bashing on the code optimizing it again and again for a small gain in speed, its better to rethink the approach and write the code again from scratch. Using a different algorithm or datastructure may be a better solution than trying to make the existing system just one more notch faster. Rewriting code often leads to faster, cleaner applications that have less bugs and are easier to maintain.

    In those cases where you do add optimizations, make sure you add clear comments explaining what is optimized and why. There is nothing worse than trying to understand code you wrote several months back which is full of poorly commented optimization tricks.

    Optimization In Four Steps

    In this document, we are going to look at some of the techniques you can use to make your Java code run more quickly. We'll talk about optimizing techniques in four steps: (1) use an optimizing compiler, (2) utilize standard classes effectively, (3) understand the machine, and (4) analyze the bytecode. We won't discuss profiling techniques here.

    At the end of the document, we'll look into the future a little, and discuss how to write code which will run well on the next generation JVM engines.

    Step 1: Use An Optimizing Compiler

    Before reaching for your tuning forks and starting to hand modify your code, its worth seeing how fast you can make your program run just by using an optimizing Java compiler. All the commercial strength Java compilers out there offer some level of optimization support, but its surprising how many programmers forget to use the optimizer.

    For Sun's JDK, to turn on the optimizer you simply supply javac with a "-O" flag. For example:

      javac -O MyClass.java
    

    This runs the optimizing compiler on the Java class. In some cases, the -O flag can really make a difference to the speed of the resulting class.

  • Always try the "-O" approach before hand optimizing
  • When the optimizer is turned on, the compiler takes longer to compile the file, since it is spending more time analyzing the code looking for optimizations. For this reason, many programmers only use the optimizer on code they are about to ship.

    How Effective Are Optimzing Java Compilers?

    Compiling your code with an optimizer is an important first step. If you are using Sun's JDK, for example, using the -O flag will give you:

    Peephole optimization looks at a few instructions at a time, replacing common sequences of operations with faster versions that have the same results. For example, a peephole optimizer might replace a jump to a goto instruction with a jump directly to the destination. saving one instruction. Peephole optimizaters also removes dead code and unused variables, and perform other small code rewrites.

    These techniques do improve resulting code, but they won't necessarily solve your speed problems. The sad reality is that existing Java compilers don't do all that many optimizations. You may be surprised to learn that Sun's compiler (and several other Java compilers I've tried) doesn't perform any of the following standard optimization techniques:

    [We don't have room here to discuss these optimization techniques in detail. See [dragon book] if you want to learn more about optimization.]

    Java is relatively new, and good optimizing compilers have yet to be written. Future releases may well include more of these optimization techniques.

    It certainly doesn't help that Java is fairly complex to optimize. Java programs consist of method calls, exception handlers and threads. These can all have subtle side effects which must be analyzed carefully by the optimizer before it can make optimizations. Developing a good optimizing compiler is hard.

    JIT (Just in time) code generation technology may be another factor holding up the development of optimizing Java compilers. Just-in-time code generators convert Java class files to native code on the fly, dynamically at runtime, just before it is executed. The JIT implementations themselves include various optimization tricks, which make the static optimizations performed by Java compilers less important. Java compiler writers may be holding off from developing fully optimizing Java compilers until it becomes clearer just how far JIT technology will go; spending time optimizing Java bytecodes at compile time within a Java compiler may be counterproductive if the JIT code generator in the runtime system doesn't produce effective native code from the results!

    Step 2: Utilize Standard Classes Effectively

    An optimizing compiler can save you some work, but it isn't an automatic guarantee of speed.

    To write fast Java code, its important to become familiar with the standard Java classes shipped with Java. Using these classes wisely will help to make your code run quickly.

    For example, consider:

        for (int i = 0; i < numItems; i++) 
            dst[i] = src[i];
    

    which copies numItems from one array to another. Java provides a builtin method which does exactly this, so an experienced Java hacker would replace this with:

        System.arraycopy(src, 0, dst, 0, numItems);
    

    Since many classes rely on the System.arraycopy method, Java implementations are likely to ensure that this goes fast - e.g. by using native code. So using System.arraycopy is probably going to lead to faster copying time than manually looping through the array in Java.

    The following table lists all the public methods provided in the standard Java 1.1 classes which are marked as native in Sun's JDK. This will give you a good starting point for learning about which methods the Sun developers felt needed native support. In some cases, native methods are used to access facilities not available in the Java language. In other cases, native code is used for its raw speed.

    Class                           Method
    ---------------------------------------------------------------------
    
    java.io.File                        boolean isAbsolute()
    
    java.io.FileDescriptor              boolean valid()
                                        void sync() 
    
    java.io.FileInputStream             int read() 
                                        long skip(long n) 
                                        int available() 
                                        void close() 
    
    java.io.FileOutputStream            void write(int b) 
                                        void close() 
    
    java.io.RandomAccessFile            int read() 
                                        void write(int b) 
                                        long getFilePointer() 
                                        void seek(long pos) 
                                        long length() 
                                        void close() 
    
    java.lang.Class                    static Class forName(String className)
                                       Object newInstance() 
                                       boolean isInstance(Object obj)
                                       boolean isAssignableFrom(Class cls)
                                       boolean isInterface()
                                       boolean isArray()
                                       boolean isPrimitive()
                                       String getName()
                                       ClassLoader getClassLoader()
                                       Class getSuperclass()
                                       Class[] getInterfaces()
                                       Class getComponentType()
                                       int getModifiers()
                                       Object[] getSigners()
    
    java.lang.Compiler              static boolean compileClass(Class clazz)
                                    static boolean compileClasses(String string)
                                    static Object command(Object any)
                                    static void enable()
                                    static void disable()
    
    java.lang.Double                    static long doubleToLongBits(double value)
                                        static double longBitsToDouble(long bits)
    
    java.lang.Float                    static int floatToIntBits(float value)
                                       static float intBitsToFloat(int bits)
    
    java.lang.Math                    static double sin(double a)
                                      static double cos(double a)
                                      static double tan(double a)
                                      static double asin(double a)
                                      static double acos(double a) 
                                      static double atan(double a)
                                      static double exp(double a)
                                      static double log(double a)
                                      static double sqrt(double a)
                                      static double IEEEremainder(double f1, 
                                                                double f2)
                                      static double ceil(double a)
                                      static double floor(double a)
                                      static double rint(double a)
                                      static double atan2(double a, double b)
                                      static double pow(double a, double b)
    
    java.lang.Object                    final Class getClass()
                                        int hashCode()
                                        final void notify()
                                        final void notifyAll()
                                        final void wait(long timeout) 
    
    java.lang.Runtime               long freeMemory()
                                    long totalMemory()
                                    void gc()
                                    void runFinalization()
                                    void traceInstructions(boolean on)
                                    void traceMethodCalls(boolean on)
    
    java.lang.String                    String intern()
    
    java.lang.System                    static long currentTimeMillis()
                                        static void arraycopy(Object src, ...)
                                        static int identityHashCode(Object x)
    
    java.lang.Thread                    static Thread currentThread()
                                        static void yield()
                                        static void sleep(long millis) 
                                        final boolean isAlive()
                                        int countStackFrames()
    
    java.lang.Throwable             Throwable fillInStackTrace()
    
    

  • Use builtin methods and classes where possible
  • In general, if there is a builtin class offering a particular feature, its better to use that class than to reinvent your own. The builtin classes are more likely to be tuned properly for the JVM they run on, and will probably be faster than your homegrown alternatives.

    This isn't a hard and fast rule, however. For example, consider the java.util.Vector class. Most of the public methods of this class are marked as synchronized. Synchronized methods have much higher overheads than unsynchronized methods (more on this later), so if you need expandable vectors in a time critical portion of your code, its better to write your own vector class or use arrays instead.

    You can use javap to find out if a particular method in Java is synchronized. For example, to find out which methods in java.util.Hashtable are synchronized, use:

        javap java.util.Hashtable
    

    This produces something like the printout below (your version may be more detailed...)

    Compiled from Hashtable.java
    public class Hashtable extends Dictionary 
                           implements Cloneable, Serializable {
        public Hashtable(int,float);
        public Hashtable(int);
        public Hashtable();
        public int size();
        public boolean isEmpty();
        public synchronized Enumeration keys();
        public synchronized Enumeration elements();
        public synchronized boolean contains(Object);
        public synchronized boolean containsKey(Object);
        public synchronized Object get(Object);
        protected void rehash();
        public synchronized Object put(Object,Object);
        public synchronized Object remove(Object);
        public synchronized void clear();
        public synchronized Object clone();
        public synchronized String toString();
    }
    
    

    As you can see, many of the methods in java.util.Hashtable are synchronized. So if you are using Hashtables (or Vectors) in a performance critical part of your code, and don't need thread synchronization, you should consider implementing your own versions of the Hashtable/Vector classes. In my tests, an unsynchronized version of the Hashtable class runs about 50% faster than a synchronized version (don't rely on this figure - other runtime systems will produce different results).

    Slow Performers

    There are several well known slow-performers in Java.

  • Write simple test programs
  • An important technique is to write small test programs and try things out. If threads are critical to your application, write a small Java application which creates some threads, and do some timings. You may find that the JVM you are using handles your case with performance to spare. We'll show an example of a simple test program later in this document.

    Tricks and Techniques

  • Borrow shamelessly (but carefully)

  • As you become more experienced in Java, you'll learn more about how to use the standard class libraries. One way to do this is to study Java source code that is in the public domain. You will probably learn some things about the Java language that you didn't know. In some cases you can discover tricks that translate into faster ways of doing things in your code. (of course, you need to be sensitive to ownership and copyright issues when doing this).

    One useful method you might come across is the String.intern() method. This takes the String object you apply it to and looks it up in an internal hash table. If a String object with the same characters already exists in the table, then the String instance from the table is returned. Otherwise your String is added to the table and returned. When you intern a String, you end up with a reference to a unique String object containing the same characters. Because its unique, you can use "==" to perform fast equality tests on strings, without having to compare the characters one at a time using the equals() method.

    Literal strings are 'interned' automatically. This means that:

       "abc" == "abc"
    

    evaluates to true - i.e. both "abc" literals refererence the same underlying string object (N.B. a bug in Sun's JDK 1.0.2 prevents this from working properly - try it on your runtime system before relying on this behavior).

    Note that:

       "abc" == ("a" + "b" + "c")
    

    evaluates to false - the string built by dynamically concatenating the three literals "a", "b" and "c" is not interned automatically. Of course you can intern it explicitly by calling the intern() method, so the expression

       "abc" == ("a" + "b" + "c").intern()
    

    evaluates to true.

    Using interned Strings can up programs that perform a lot of symbol manipulation, such as compilers or parsers.

    Step 3: Understand the machine

    Knowing the Java class library well and using a good optimizing compiler will get you 90% of the way.

    But what do you do if your code is still too slow? Before resorting to an extreme measure like using native methods, its worth looking at what your code is doing. With some knowledge of how the JVM works, it may be possible to rewrite parts of the code and eliminate the performance problem.

    For example, consider the following three loops, where vector is an instance of java.util.Vector, and array is an array of Objects:

        // Iteration using Enumeration
        for (Enumeration e = vector.elements(); e.hasMoreElements();) {
            Object x = e.nextElement();
        }
    
    
        // Iteration using elementAt()
        for (int j = 0; j < vector.length(); j++) {
            Object x = vector.elementAt(j);
        }
    
    
        // Iteration using arrays
        for (int j = 0; j < array.length; j++) {
            Object x = array[j];
        }
    

    Which approach is fastest? The answer is that the last version runs very quickly, the middle version runs at a medium clip, and the first one is the slowest of the three. Using our "write simple tests" rule, lets write a simple class to verify this:

    import java.util.Vector;
    import java.util.Enumeration;
    
    public class TestIteration {
    
        static Object x;
        
        public static void main(String args[]) {
            Enumeration e;
            int i;
            long startTime;
    
            Vector vector = new Vector();
            Object array[] = new Object[100];
    
            // Setup the vector and the array with some dummy objects
            for (i = 0; i < 100; i++) {
                vector.addElement(new Object());            
                array[i] = new Object();
            }
    
            // Test Enumeration
            startTime = System.currentTimeMillis();
            for (i = 0; i < 10000; i++) {
                 for (e = vector.elements(); e.hasMoreElements(); ) {
                     x = e.nextElement();
                 }
            }
            System.out.println(System.currentTimeMillis() - startTime);
    
            // Test Vector.elementAt
            startTime = System.currentTimeMillis();
            for (i = 0; i < 10000; i++) {
                 for (int j = 0; j < 100; j++) {
                     x = vector.elementAt(j);
                 }
            }
            System.out.println(System.currentTimeMillis() - startTime);
    
            // Test array access
            startTime = System.currentTimeMillis();
            for (i = 0; i < 10000; i++) {
                 for (int j = 0; j < 100; j++) {
                     x = array[j];
                 }
            }
            System.out.println(System.currentTimeMillis() - startTime);
        }
    }
    

    Running this on JDK 1.0 on my system produces:

        % java TestIteration
        14085
        8513
        537
    

    Under JDK 1.1, my system produces:

        % java TestIteration
        6329
        4136
        371
    

    As you can see, the array based approach is much faster than the Enumeration based approach. Around 20 times faster!

    Why is this? A large part of the answer lies in the cost of method calls. Lets look at this in more detail.

    Method Calls

    If you are calling a method inside a critical portion of your application, its important to know what sort of overhead is involved with that method call. Overhead refers to the time taken by the system to locate, setup and invoke the new method, and doesn't include the time actually taken to execute the method. For some types of method call, the overhead can be quite large. Overhead varies from one JVM implementation to another.

  • Synchronized and interface methods are slow
  • As a general rule of thumb, the static, final, private and native method modifiers reduce the overhead of a method call. The synchronized modifier increases the overhead of a method call, sometimes quite substantially. Also, interface method calls are slower than normal method calls. Interface method calls are used whenever you call a method on a variable or field which declared as an interface type. For example, in the code:

        Enumeration e = vector.elements();
    
        while (e.hasMoreElements()) ...
    

    The hasMoreElements() method call is an interface method call. Its important to realize that if you have:

        class E implements Enumeration {
    
            boolean hasMoreElements() { ... }
    
            ...
    
            void foo() {
                if (this.hasMoreElements()) {
                    
                }
            }
        }
    

    then the call of hasMoreElements() in foo() is not an interface method call, its a normal method call. In foo(), -this- is of type E (a class type) and not of type Enumeration (an interface type) so the this.hasMoreElements() is treated as a standard method call rather than an interface call.

    The exact overhead of a method call depends upon the JVM implementation, and also what combination of method modifiers are used. In some implementations final native methods are faster than non-final native methods, and static native methods are faster still. If you have several modifiers from the left column below, that's generally a good thing:

    Faster Slower
    static
    final
    private
    native
    synchronized
    (interface methods)
     
     
    Method Modifier Overheads

    Sometimes it is possible to replace method calls with code that doesn't involve a method. For example, if x and y are both doubles, consider the two expressions:

        y = Math.abs(x);
    

    and

        y = (x < 0 ? -x : x);
    

    The first version uses the Math.abs method to obtain an absolute value for x. The second uses the ? : syntax to perform the same operation. Even though Math.abs is a final static method (which has a fairly low overhead), the second version is still a little faster on the systems I've tried. Of course, its also requires more typing and is harder to read...

    Local Variables and Fields

    In the last section, we looked at the overhead associated with methods. What about the overheads of fields and local variables?

    As with methods, some field modifiers have higher overheads than others. Retrieving static (class) fields is faster than retrieving instance fields. Final static fields are faster still. Note that, unlike private methods, private fields are not usually any faster than public ones - the same adressing mechanisms are needed for both.

  • Local variables are faster than fields
  • In all the implementations I've tried, local variables are faster to access than fields. This means that if you are retrieving a field many times in a loop, then it may be better to cache the value of the field in a local variable, and use the local variable in the loop instead.

    This is especially true if you are using an integer counter, so:

       class Example {
          int x;
    
          void count() {
                while (true) {
                   x++;
                   ...
                }
          }
       }
    

    is not as efficient as:

       class Example {
    
          void count() {
              int x;
              while (true) {
                   x++;
                   ...
              }
          }
       }
    

    Unfortunately, existing Java compilers don't intelligently cache fields in local variables. If you want to do this kind of reorganization, you have to do it manually.

    Here are some other useful things to know about local variables:

  • The first few locals are special
  • The JVM has specialized instructions for retrieving local variables numbered 0, 1, 2 and 3. Its not mentioned anywhere in the language or compiler spec, but its reasonable to assume that the first few variables and parameters you declare in your method are going to end up with these local variable numbers. If your method only has a total of four local variables+parameters, you are all set. If it has a lot more than this, you might consider declaring important looping variables first in the code. JVM implementations currently retrieve these first few local variables a little faster than the other local variables, so it may make your code go just that little bit faster.

  • The JVM has a special instruction, iinc, for incrementing and decrementing integer local variables. You can use this fact to your advantage by making sure that you utilize the "++", "--", "+=" and "-=" operators with local variables when possible, as illustrated by the table below.

    FasterSlower
    x++
    y--
    x += 5
    y -= 5
    x = x + 1
    y = y - 1
    x = x + 5
    y = y - 5
    Incrementing/Decrementing Integer Locals

    (of course, a good optimizing compiler would transform the expressions in the right column into those shown in the left column. Unfortunately, the compilers I've tested don't do this).

    Loops

  • Assume the compiler is stupid (for now)
  • If you have a piece of code containing a loop which is performance critical, its a good idea to 'hoist' and 'unroll' the code in the loop yourself, since at least for the moment Java compilers and JIT code generators aren't necessarily going to do this for you.

    Hoisting refers to moving code out of inner loops that doesn't need to be in inner loops. For example, in a loop like:

       for (y = 0; y < height; y++) {
    
           for (x = 0; x < width; x++) {
    
               array[(y * width) + x] = 1;
    
           }
       }
    

    its better to move the (y * width) calculation outside the loop:

       for (y = 0; y < height; y++) {
    
           int offset = y * width;
    
           for (x = 0; x < width; x++) {
    
               array[offset + x] = 1;
    
           }
       }
    

    You've added an extra line to the program, making it a little longer to read, but it will run faster.

    Unrolling is the technique of replacing loops like:

        for (i = 0; i < 128; i++) {
            doSomething(i);
        }
    
    with:

        for (i = 0; i < 128; ) {
            doSomething(i++); 
            doSomething(i++); 
            doSomething(i++); 
            doSomething(i++); 
            doSomething(i++); 
            doSomething(i++); 
            doSomething(i++); 
            doSomething(i++); 
        }
    
    

    In this revised version of the loop, doSomething is called eight times in each iteration, instead of once each iteration. The overall effect of the loop is the same, but the unrolled version spends less time testing the value of i and branching, so the unrolled version of the loop takes less time to run.

    Arithmetic

  • Avoid shorts, bytes and chars
  • While the Java language has full support for short, bytes and chars, inside the JVM these types are treated as second class citizens. They are known as "storage types": There are mechanisms for declaring that a field or array stores shorts, bytes or chars, but the arithmetic instructions, stack and local variables work only with ints, floats, longs and doubles. This means that every time you assign a value to a byte/short/char field the system has to truncate the value from an int to a byte/short/char before doing the assignment (and visa versa). This conversion is fast, but it still takes time. This means if you see a loop like:

       for (short x = 0; x < 1000; x++) {
          ...
       }
    

    your better off using an int instead.

  • ints and floats are faster than doubles and longs
  • The JVM is a word oriented machine, in which the 'logical' word is 32 bits. Since doubles and longs are 64 bits apiece, they take two words of storage. On most existing implementations I've tried, the following speed ranking exists:

        ints < floats < longs < doubles
    

    Since ints are faster than floats, its a good idea to use as much integer arithmetic as possible. For example, say you have a integer between 0 and A and you want to scale it to a value between 0 and B. A very general way to write this is:

        (int)((i / (float)A) * B)
    

    which works well for any values of A and B. But in some cases you can do much better. For example, if A is 256 and B is 1024, with a little care you can rewrite the above to use integer arithmetic:

        (i * 1024) / 256
    

    Notice the multiplication by B (i.e. 1024) comes first.

    This is still not fully optimal. Observe that 1024 / 256 is four - so this expression is really just multiplying the value of i by four. Multiplies and divides by powers of 2 can be done using shifting, which is faster. So the fastest way to write the above expression is:

        i << 2
    

    I ran some tests of these simple on two systems:

    ExpressionSGI Impact Windows PC
    (int)((i / 256.0f) * 1024) 177 471
    (i * 1024) / 256 246 40
    i << 2 41 10
    Timings of arithmetic expressions

    As you can see, the shifting approach is very fast on both systems. Also notice that, on the Silicon Graphics system, the floating point divison version is slightly faster than the integer division version. SGI systems have fast floating point hardware, so this isn't all that surprising. But it does raise another guideline:

  • Run test programs on the hardware you are targetting
  • Step 4: Analyze the bytecode

    At the end of the day, your Java program will most likely end up compiled into a number of Java class files. For ultimate speed freaks (and also for programmers with a healthy level of professional curiosity) its worth taking a look at the contents of your class files to see how well your compiler is doing.

    For example, looking at the arithmetic expressions we saw in the last section, an experience programmer might expect an optimizing compiler to automatically spot code like:

        (i * 1024) / 256
    

    and, using a technique known as strength reduction, replace it with:

        i << 2
    

    (In fact, GNU's gcc compiler does this simple form of transformation even when you haven't turned optimization on!)

    How can you find out if your Java compiler is doing this transformation? The easiest way is to compile the program and then run it through a disassembler and look at the bytecode.

    You can use the javap program to disassemble a Java class file and print out its bytecode. For example, the command:

        javap -c java.lang.String
    

    shows you a text representation of the JVM bytecode that makes up the String class.

    You can also use a disassembler like the one provided by Kimera on the web (see http://kimera.cs.washington.edu/disassembler.html). This lets you disassemble a class file into Jasmin syntax. You can then take this class file, edit it, and use Jasmin (see http://www.cybergrain.com/tech/jasmin) to assemble the class back into its binary form - giving you the ultimate control over the contents of your class files.

    An Example

    Going back to our example, lets consider a Java class like:

        class Test {
            int testExpr(int x) {
                return (x * 1024) / 256;
            }
        }
    

    Compile this with optimization turned on, and then ran javap on the result. You will see something like:

    javap -c Test
    class Test extends java.lang.Object {
        int testExpr(int);
        Test();
    
    Method int testExpr(int)
       0 iload_1
       1 sipush 1024
       4 imul
       5 sipush 256
       8 idiv
       9 ireturn
    
    Method Test()
       0 aload_0
       1 invokespecial #3 ()V>
       4 return
    }
    

    Even without knowing much about the JVM instruction set, you can see the numbers 1024 and 256 appearing in the code for testExpr() - its very clear that this expression has not been converted into a shift by 2. In this case its up to you to rewrite the divide using a shift.

    If you want to learn more about what these JVM instructions are actually doing, you can look at: http://www.cybergrain.com/tech/jvmref .

    Benchmarking Instructions

    Looking at the contents of a class file, you will see lots of instructions like:

       iload_1
       sipush 1024
       invokespecial ...
    

    Just how expensive are these operations?

    In order to find out, I wrote a simple benchmarking program. The program tests a large percentage of the JVM instruction set. It doesn't test the efficiency of Java classes that are shipped with a runtime system (after all, you can't do much about improving them!), but instead just tests how fast the runtime system is able to execute the bytecode itself.

    The Tests

    I've divided the tests into a number of categories:

    Local Variables
    Setting/getting local variables

    Field Access
    Setting/getting fields

    Array Access
    Setting/getting values in arrays

    Method invocation
    Various styles of method dispatch

    Object operations
    Various object operations (e.g. making new objects)

    Arithmetic
    Some basic arithmetic tests.

    For each test category, there are a series of tests, as well as a "noop" test that shows a reference value indicating how much overhead the JVM is taking to simply run the test. Each test shows two numbers: the first is a timing for the test in milliseconds. The second is a ranking of the instruction, where the noop gets a ranking of 0, simple instructions get a ranking of 1, and more complex instructions get higher rankings.The ranking gives you an idea of how costly the instruction is.

    Looking at these results, you can get a feel for how fast a particular JVM system is (by looking at the absolute timings), and also learn about which operations are typically efficient, and which are more expensive (by looking at the relative instruction ranking).

    Note that these tests differ from the results you would see from a benchmark test (such as CaffieneMark). Benchmark tests try to exercise as many different parts of the JVM as possible, to give an accurate overall rating of a runtime system. Our tests, on the other hand, only focus on one aspect of the instruction set at a time. The results are good for understanding how fast a specific instruction is, but are not a good measure of the overall performance of a system.

    The Results

    {{These results need to be presented in a nicer format, e.g. bar charts}}


    Local Variables

    
    JDK 1.0.2 on R10K
    
    noop                                                        434       0
    dup                                                         667       1
    aload_0                                                     790       1
    aload 10                                                    696       1
    aload 260                                                   783       1
    lload 10                                                    726       1
    
    JDK 1.0.2 on SPARC 20
    
    noop                                                        541       0
    dup                                                         780       1
    aload_0                                                     780       1
    aload 10                                                    804       1
    aload 260                                                   924       1
    lload 10                                                    845       1
    
    JDK 1.1 on R10K
    
    noop                                                        357       0
    dup                                                         550       1
    aload_0                                                     545       1
    aload 10                                                    563       1
    aload 260                                                   659       1
    lload 10                                                    546       1
    
    JDK 1.1 on SPARC 20
    
    noop                                                        242       0
    dup                                                         330       1
    aload_0                                                     331       1
    aload 10                                                    346       1
    aload 260                                                   490       2
    lload 10                                                    354       1
    
    JDK 1.1.2, Windows NT (Gateway 200Mhz Pentium)
    
    noop                                                        135       0
    dup                                                         170       1
    aload_0                                                     185       1
    aload 10                                                    185       1
    aload 260                                                   305       2
    lload 10                                                    190       1
    
    Microsoft SDK-Java 1.5, Windows NT (Gateway 200MHz Pentium)
    
    noop                                                        411       0
    dup                                                         601       1
    aload_0                                                     580       1
    aload 10                                                    651       1
    aload 260                                                   691       1
    lload 10                                                    621       1
    
    
    Comments

    Here you can see that local variable 0 is a little faster to access (using aload_0) than local variable 10 (using aload 10). Retrieving local variable 260 is slower still (local variables with indices greater than 256 use a "wide" instruction form which is slower than the instruction form for the first 256 locals).


    Field Access

    
    JDK 1.0.2 on R10K
    
    noop                                                        434       0
    getfield                                                    901       2
    getfield(byte)                                              968       2
    getfield(short)                                             946       2
    getstatic                                                   711       1
    getstatic(final)                                            701       1
    
    JDK 1.0.2 on SPARC 20
    
    noop                                                        541       0
    getfield                                                    1073      1
    getfield(byte)                                              1067      1
    getfield(short)                                             1078      1
    getstatic                                                   860       1
    getstatic(final)                                            861       1
    
    JDK 1.1 on R10K
    
    noop                                                        357       0
    getfield                                                    694       1
    getfield(byte)                                              696       1
    getfield(short)                                             698       1
    getstatic                                                   584       1
    getstatic(final)                                            584       1
    
    JDK 1.1 on SPARC 20
    
    noop                                                        242       0
    getfield                                                    418       1
    getfield(byte)                                              418       1
    getfield(short)                                             419       1
    getstatic                                                   370       1
    getstatic(final)                                            370       1
    
    JDK 1.1.2, Windows NT (Gateway 200Mhz Pentium)
    
    noop                                                        135       0
    getfield                                                    225       1
    getfield(byte)                                              220       1
    getfield(short)                                             220       1
    getstatic                                                   215       1
    getstatic(final)                                            215       1
    
    Microsoft SDK-Java 1.5, Windows NT (Gateway 200Mhz Pentium)
    
    noop                                                        411       0
    getfield                                                    681       1
    getfield(byte)                                              681       1
    getfield(short)                                             681       1
    getstatic                                                   821       1
    getstatic(final)                                            832       2
    
    
    Comments

    Notice that these values are consistently higher than the values for retrieving local variables. Static fields are also typically faster than non-static fields. Byte and short fields are not noticably slower to retrieve than int fields (hinting perhaps that these implementations are actually using int-sized members to store byte and short values?).


    Array Access

    
    JDK 1.0.2 on R10K
    
    noop                                                        434       0
    aaload                                                      1017      2
    aastore                                                     1374      3
    aastore(in interface array)                                 2038      4
    aastore(null)                                               1169      2
    iaload                                                      1026      2
    iastore                                                     1022      2
    laload                                                      1053      2
    lastore                                                     1138      2
    baload                                                      1008      2
    bastore                                                     1166      2
    baload (boolean)                                            1025      2
    bastore (boolean)                                           1015      2
    
    JDK 1.0.2 on SPARC 20
    
    noop                                                        541       0
    aaload                                                      1183      2
    aastore                                                     1583      2
    aastore(in interface array)                                 2598      4
    aastore(null)                                               1404      2
    iaload                                                      1252      2
    iastore                                                     1230      2
    laload                                                      1300      2
    lastore                                                     1223      2
    baload                                                      1207      2
    bastore                                                     1350      2
    baload (boolean)                                            1403      2
    bastore (boolean)                                           1200      2
    
    JDK 1.1 on R10K
    
    noop                                                        357       0
    aaload                                                      789       2
    aastore                                                     1149      3
    aastore(in interface array)                                 1236      3
    aastore(null)                                               959       2
    iaload                                                      815       2
    iastore                                                     799       2
    laload                                                      812       2
    lastore                                                     874       2
    baload                                                      809       2
    bastore                                                     896       2
    baload (boolean)                                            810       2
    bastore (boolean)                                           819       2
    
    JDK 1.1 on SPARC 20
    
    noop                                                        242       0
    aaload                                                      515       2
    aastore                                                     767       3
    aastore(in interface array)                                 847       3
    aastore(null)                                               581       2
    iaload                                                      521       2
    iastore                                                     499       2
    laload                                                      531       2
    lastore                                                     524       2
    baload                                                      516       2
    bastore                                                     571       2
    baload (boolean)                                            516       2
    bastore (boolean)                                           500       2
    
    
    JDK 1.1.2, Windows NT (Gateway 200Mhz Pentium)
    
    noop                                                        135       0
    aaload                                                      255       1
    aastore                                                     666       4
    aastore(in interface array)                                 591       4
    aastore(null)                                               375       2
    iaload                                                      255       1
    iastore                                                     270       2
    laload                                                      265       1
    lastore                                                     345       2
    baload                                                      275       2
    bastore                                                     315       2
    baload (boolean)                                            270       2
    bastore (boolean)                                           290       2
    
    Microsoft SDK-Java 1.5, Windows NT (Gateway 200Mhz Pentium) 
    
    noop                                                        411       0
    aaload                                                      771       1
    aastore                                                     921       2
    aastore(in interface array)                                 961       2
    aastore(null)                                               802       1
    iaload                                                      791       1
    iastore                                                     751       1
    laload                                                      791       1
    lastore                                                     881       2
    baload                                                      801       1
    bastore                                                     832       2
    baload (boolean)                                            801       1
    bastore (boolean)                                           761       1
    
    

    Comments

    Here you can see that arrays are slower to work with than either fields or locals. Also note that storing values in an interface array takes longer - when an object is placed in the an object array the JVM must perform a runtime check to make sure that the object is compatible with the array type. This can take extra time.


    Method Invocation

    
    JDK 1.0.2 on R10K
    
    noop                                                        434       0
    invokevirtual                                               1742      4
    invokevirtual(final)                                        1736      4
    invokevirtual(private,final)                                1733      3
    invokevirtual(synchronized)                                 7439      17
    invokevirtual(native)                                       1505      3
    invokespecial(private,final)                                1613      3
    invokestatic                                                1545      3
    invokestatic(final)                                         1519      3
    invokestatic(native,final)                                  1337      3
    invokeinterface                                             1858      4
    
    JDK 1.0.2 on SPARC 20
    
    noop                                                        541       0
    invokevirtual                                               2085      3
    invokevirtual(final)                                        2081      3
    invokevirtual(private,final)                                2087      3
    invokevirtual(synchronized)                                 10607     19
    invokevirtual(native)                                       1672      3
    invokespecial(private,final)                                2320      4
    invokestatic                                                1976      3
    invokestatic(final)                                         1978      3
    invokestatic(native,final)                                  1740      3
    invokeinterface                                             2372      4
    
    JDK 1.1 on R10K
    
    noop                                                        357       0
    invokevirtual                                               1407      3
    invokevirtual(final)                                        830       2
    invokevirtual(private,final)                                1326      3
    invokevirtual(synchronized)                                 3373      9
    invokevirtual(native)                                       1246      3
    invokespecial(private,final)                                866       2
    invokestatic                                                715       2
    invokestatic(final)                                         727       2
    invokestatic(native,final)                                  1084      3
    invokeinterface                                             1489      4
    
    JDK 1.1 on SPARC 20
    
    noop                                                        242       0
    invokevirtual                                               1052      4
    invokevirtual(final)                                        524       2
    invokevirtual(private,final)                                1055      4
    invokevirtual(synchronized)                                 2664      10
    invokevirtual(native)                                       1238      5
    invokespecial(private,final)                                529       2
    invokestatic                                                485       2
    invokestatic(final)                                         505       2
    invokestatic(native,final)                                  1143      4
    invokeinterface                                             1441      5
    
    JDK 1.1.2, Windows NT (Gateway 200Mhz Pentium)
    
    noop                                                        135       0
    invokevirtual                                               600       4
    invokevirtual(final)                                        261       1
    invokevirtual(private,final)                                570       4
    invokevirtual(synchronized)                                 4606      34
    invokevirtual(native)                                       661       4
    invokespecial(private,final)                                250       1
    invokestatic                                                255       1
    invokestatic(final)                                         255       1
    invokestatic(native,final)                                  631       4
    invokeinterface                                             831       6
    
    Microsoft SDK-Java 1.5, Windows NT (Gateway 200Mhz Pentium)
    
    noop                                                        411       0
    invokevirtual                                               1942      4
    invokevirtual(final)                                        1963      4
    invokevirtual(private,final)                                2023      4
    invokevirtual(synchronized)                                 2814      6
    invokevirtual(native)                                       1532      3
    invokespecial(private,final)                                2043      4
    invokestatic                                                1753      4
    invokestatic(final)                                         1803      4
    invokestatic(native,final)                                  1432      3
    invokeinterface                                             2153      5
    
    
    Comments

    Notice how consistently slow synchronized methods are. Also see how virtual methods are slower than static methods. Method calls are slower than most other instructions on the JVM.


    Object Operations

    JDK 1.0.2 on R10K
    
    noop                                                        434       0
    instanceof(class)                                           1125      2
    instanceof(interface)                                       1868      4
    checkcast(class)                                            1105      2
    checkcast(interface)                                        2009      4
    new(Object)                                                 4317      9
    new(String)                                                 8699      20
    new(Test)                                                   5439      12
    newarray                                                    3448      7
    anewarray                                                   3656      8
    
    JDK 1.0.2 on SPARC 20
    
    noop                                                        541       0
    instanceof(class)                                           1289      2
    instanceof(interface)                                       2407      4
    checkcast(class)                                            1180      2
    checkcast(interface)                                        2287      4
    new(Object)                                                 6250      11
    new(String)                                                 13394     24
    new(Test)                                                   7816      14
    newarray                                                    6088      11
    anewarray                                                   6507      12
    
    JDK 1.1 on R10K
    
    noop                                                        357       0
    instanceof(class)                                           926       2
    instanceof(interface)                                       1002      2
    checkcast(class)                                            933       2
    checkcast(interface)                                        1009      2
    new(Object)                                                 3845      10
    new(String)                                                 8731      24
    new(Test)                                                   4125      11
    newarray                                                    3881      10
    anewarray                                                   4024      11
    
    JDK 1.1 on SPARC 20
    
    noop                                                        242       0
    instanceof(class)                                           613       2
    instanceof(interface)                                       795       3
    checkcast(class)                                            575       2
    checkcast(interface)                                        725       2
    new(Object)                                                 4135      17
    new(String)                                                 9032      37
    new(Test)                                                   4467      18
    newarray                                                    4624      19
    anewarray                                                   4930      20
    
    JDK 1.1.2, Windows NT (Gateway 200Mhz Pentium)
    
    noop                                                        135       0
    instanceof(class)                                           406       3
    instanceof(interface)                                       696       5
    checkcast(class)                                            390       2
    checkcast(interface)                                        495       3
    new(Object)                                                 4216      31
    new(String)                                                 12498     92
    new(Test)                                                   5513      40
    newarray                                                    4942      36
    anewarray                                                   5262      38
    
    Microsoft SDK-Java 1.5
    
    noop                                                        411       0
    instanceof(class)                                           1211      2
    instanceof(interface)                                       1252      3
    checkcast(class)                                            1212      2
    checkcast(interface)                                        1252      3
    new(Object)                                                 8482      20
    new(String)                                                 18537     45
    new(Test)                                                   11096     26
    newarray                                                    6209      15
    anewarray                                                   6889      16
    
    
    Comments

    Its surprising how long it takes to make new objects. This is because the new instruction does quite a lot of work. For more complex classes, new will take even longer. Making arrays is generally faster than making objects, since no object initialization needs to be performed (arrays are initialized to null (for object arrays) or 0 (for numeric arrays)).

    Also notice that testing if something belongs to an interface is slower than testing if it belongs to a class.


    Arithmetic

    
    JDK 1.0.2 on R10K
    
    noop                                                        434       0
    barith                                                      1068      2
    sarith                                                      1050      2
    iarith                                                      946       2
    larith                                                      1256      2
    farith                                                      926       2
    darith                                                      1269      2
    
    JDK 1.0.2 on SPARC 20
    
    noop                                                        541       0
    barith                                                      1274      2
    sarith                                                      1246      2
    iarith                                                      1103      2
    larith                                                      1225      2
    farith                                                      1125      2
    darith                                                      1237      2
    
    JDK 1.1 on R10K
    
    noop                                                        357       0
    barith                                                      913       2
    sarith                                                      878       2  
    iarith                                                      768       2  
    larith                                                      925       2  
    farith                                                      757       2  
    darith                                                      909       2  
    
    JDK 1.1 on SPARC 20
    
    noop                                                        242       0  
    barith                                                      621       2  
    sarith                                                      585       2  
    iarith                                                      483       1  
    larith                                                      547       2  
    farith                                                      483       1  
    darith                                                      555       2  
    
    JDK 1.1.2, Windows NT (Gateway 200Mhz Pentium)
    
    noop                                                        135       0  
    barith                                                      260       1  
    sarith                                                      270       1  
    iarith                                                      245       1  
    larith                                                      280       2  
    farith                                                      300       2
    darith                                                      390       2
    
    
    Microsoft SDK-Java 1.5, Windows NT (Gateway 200Mhz Pentium)  
    
    noop                                                        411       0
    barith                                                      882       2
    sarith                                                      881       2
    iarith                                                      781       1
    larith                                                      821       1
    farith                                                      791       1
    darith                                                      822       1
    
    
    Comments

    On SGI systems, floating point arithment is actually a little faster than integer arithmetic, which is a surprising result. In general, floats and ints are faster than longs and doubles.


    Optimization In The Future

    As things stand now, Java is not as fast as C for image processing, graphics, or other computationally intensive tasks. But what about in the future?

    Expect to see hardware support for Java, better JIT code generators, and even fully-compiling Java compilers (compilers which go directly from Java source code to native code without the using Java class files as an intermediate representation). These technologies will all help to make Java code run significantly faster - as fast as C++ code in some cases. (though for certain kinds of operation C++ will always be faster than Java).

    I've heard from an internal source at Sun that they are betting on JIT to deliver faster Java performance in the future, rather than on more optimizations in the static compiler. Just how far JIT technology can go is still an open question, but you can bet that this will be a highly competitive arena.

    In the meantime, the techniques described in this article should help you to write Java code which runs quickly on todays Java runtime systems.