Compiling Java with GCJ
Java has not become as pervasive as the original hype suggested, but it is a popular language, used a lot for in-house and server-side development and other applications. Java has less mind-share in the free software world, although many projects are now using it. Examples of free projects using Java include Jakarta from the Apache Foundation (jakarta.apache.org), various XML tools from W3C (www.w3.org) and Freenet (freenet.sourceforge.net). See also the FSF's Java page (www.gnu.org/software/java).
One reason relatively few projects use Java has been the real or perceived lack of quality, free implementations of Java. Two free Java implementations, however, have been around since the early days of Java. One is Kaffe (www.kaffe.org), originally written by Tim Wilkinson and still developed by the company he cofounded, Transvirtual. The other is GCJ (the GNU Compiler for the Java language), which I started in 1996 at Cygnus Solutions (and which this article discusses). GCJ has been fully integrated and supported as a GCC language since GCC version 3.0.
The traditional way to implement Java is a two-step process: a translation phase and an execution phase. (In this respect Java is like C.) A Java program is compiled by javac, which produces one or more files with the extension .class. Each such file is the binary representation of the information in a single class, including the expressions and statements of the class' methods. All of these have been translated into bytecode, which is basically the instruction set for a virtual, stack-based computer. (Because some chips also have a Java bytecode instruction set, it also can be a real instruction set.)
The execution phase is handled by a Java Virtual Machine (JVM) that reads in and executes the .class files. Sun's version is called plain “java”. Think of the JVM as a simulator for a machine whose instruction set is Java bytecodes.
Using an interpreter (simulator) adds quite a bit of execution overhead. A common solution for high-performance JVMs is to use dynamic translation or just-in-time (JIT) compilers. In that case, the runtime system will notice a method has been called enough times to make it worthwhile to generate machine code for that method on the fly. Future calls to the method will execute the machine code directly.
A problem with JITs is startup overhead. It takes time to compile a method, especially if you want to do any optimization, and this compilation is done each time the application is run. If you decide to compile only the methods most often executed, then you have the overhead of measuring those. Another problem is that a good JIT is complex and takes up a fair bit of space (plus the generated code needs space, which may be on top of the space used by the original bytecode). Little of this space can be in shared memory.
Traditional Java implementation techniques also do not interoperate well with other languages. Applications are deployed differently (a Java Archive .jar file, rather than an executable); they require a big runtime system, and calling between Java and C/C++ is slow and inconvenient.
The approach of the GCJ Project is radically traditional. We view Java as simply another programming language and implement it the way we implement other compiled languages. As Cygnus had been long involved with GCC, which was already being used to compile a number of different programming languages (C, C++, Pascal, Ada, Modula2, Fortran, Chill), it made sense to think about compiling Java to native code using GCC.
On the whole, compiling a Java program is actually much simpler than compiling a C++ program, because Java has no templates and no preprocessor. The type system, object model and exception-handling model are also simpler. In order to compile a Java program, the program basically is represented as an abstract syntax tree, using the same data structure GCC uses for all of its languages. For each Java construct, we use the same internal representation as the equivalent C++ would use, and GCC takes care of the rest.
GCJ can then make use of all the optimizations and tools already built for the GNU tools. Examples of optimizations are common sub-expression elimination, strength reduction, loop optimization and register allocation. Additionally, GCJ can do more sophisticated and time-consuming optimizations than a just-in-time compiler can. Some people argue, however, that a JIT can do more tailored and adaptive optimizations (for example, change the code depending on actual execution). In fact, Sun's HotSpot technology is based on this premise, and it certainly does an impressive job. Truthfully, running a program compiled by GCJ is not always noticeably faster than running it on a JIT-based Java implementation; sometimes it even may be slower, but that usually is because we have not had time to implement Java-specific optimizations and tuning in GCJ, rather than any inherent advantage of HotSpot technology. GCJ is often significantly faster than alternative JVMs, and it is getting faster as people improve it.
A big advantage of GCJ is startup speed and modest memory usage. Originally, people claimed that bytecode was more space-efficient than native instruction sets. This is true to some extent, but remember that about half the space in a .class file is taken up by symbolic (non-instruction) information. These symbols are duplicated for each .class file, while ELF executables or libraries can do much more sharing. But where bytecodes really lose out to native code is in terms of memory inside a JVM with a JIT. Starting up Sun's JVM and JIT compiling and applications' classes take a huge amount of time and memory. For example, Sun's IDE Forte for Java (available in the NetBeans open-source version) is huge. Starting up NetBeans takes 74MB (as reported by the top command) before you actually start doing anything. The amount of main memory used by Java applications complicates their deployment. An illustration is JEmacs (JEmacs.sourceforge.net), a (not very active) project of mine to implement Emacs in Java using Swing (and Kawa, discussed below, for Emacs Lisp support). Starting up a simple editor window using Sun's JDK1.3.1 takes 26MB (according to top). XEmacs, in contrast, takes 8MB.
Running the Kawa test suite using GCJ vs. JDK1.3.1, GCJ is about twice as fast, causes about half the page faults (according to the time command) and uses about 25% less memory (according to top). The test suite is a script that starts the Java environment multiple times and runs too many different things for a JIT to help (which penalizes JDK). It also loads Scheme code interactively, so GCJ has to run it using its interpreter (which penalizes GCJ). This experiment is not a real benchmark, but it does indicate that even in its current status you can get improved performance using GCJ. (As always, if you are concerned about performance, run your own benchmark based on your expected job mix.)
GCJ has other advantages, such as debugging with GDB and interfacing with C/C++ (mentioned below). Finally, GCJ is free software, based on the industry-standard GCC, allowing it to be freely modified, ported and distributed.
Some have complained that ahead-of-time compilation loses the big write-once, run-anywhere portability advantage of bytecodes. However, that argument ignores the distinction between distribution and installation. We do not propose native executables as a distribution format, expect perhaps as prebuilt packages (e.g., RPMs) for a particular architecture. You still can use Java bytecodes as a distribution format, even though they don't have any major advantages over Java source code. (Java source code tends to have fewer portability problems than C or C++ source.) We suggest that when you install a Java application, you should compile it to native code if it isn't already so compiled.
- High-Availability Storage with HA-LVM
- DNSMasq, the Pint-Sized Super Dæmon!
- March 2015 Issue of Linux Journal: System Administration
- Localhost DNS Cache
- Real-Time Rogue Wireless Access Point Detection with the Raspberry Pi
- Days Between Dates: the Counting
- The Usability of GNOME
- PostgreSQL, the NoSQL Database
- Linux for Astronomers
- You're the Boss with UBOS