Writing a GCC Front End
GCC, the premier free software compiler suite, has undergone many changes in the last few years. One change in particular, the merging of the tree-ssa branch, has made it much simpler to write a new GCC front end.
GCC always has had two different internal representations, trees and RTL. RTL, the register transfer language, is the low-level representation that GCC uses when generating machine code. Traditionally, all optimizations were done in RTL. Trees are a higher-level representation; traditionally, they were less documented and less well known than RTL.
The tree-ssa Project, a long-term reworking of GCC internals spearheaded by Diego Novillo, changes all that. Now, trees are much better although still imperfectly documented, and many optimizations are done at the tree level. A side effect of this work on trees was the clear specification of a tree-based language called GENERIC. All GCC front ends generate GENERIC, which is later lowered to another tree-based representation called GIMPLE, and from there it goes to RTL.
What this means to you is that it is much, much simpler to write a new front end for GCC. In fact, it now is feasible to write a front end for GCC one without any knowledge of RTL whatsoever. This article provides a tour of how you would go about connecting your own compiler front end to GCC. The information in this article is specific to GCC 4.0, due to be released in 2005.
For our purposes, compilation is done in two phases, parsing and semantic analysis and then code generation. GCC handles the second phase for you, so the question is, what is the best way to implement phase one?
Traditional GCC front ends, such as the C and C++ front ends, generate trees during parsing. Front ends like these typically add their own tree codes for language-specific constructs. Then, after semantic analysis has completed, these trees are lowered to GENERIC by replacing high-level, language-specific trees with lower-level equivalents. One advantage of this approach is the language-specific trees usually are nearly GENERIC already. The lowering phase often can prevent too much garbage from generating.
The primary disadvantage of this approach is trees are typed dynamically. In theory, this might not seem so bad—many dynamically typed environments exist that can be used efficiently by developers, including Lisp and Python. However, these are complete environments, and GCC's heavily macro-ized C code doesn't confer the same advantages.
My preferred approach to writing a front end is to have a strongly typed, language-specific representation of the program, called an abstract syntax tree (AST). This is the approach used by the Ada front end and by gcjx, a rewrite of the front end for the Java programming language.
For instance, gcjx is written in C++ and has a class hierarchy that models the elements of the Java programming language. This code actually is independent of GCC and can be used for other purposes. In gcjx's case, the model can be lowered to GENERIC, but it also can be used to generate bytecode or JNI header files. In addition, it could be used for code introspection of various kinds; in practice, the front end is a reusable library.
This approach provides all the usual advantages of a strongly typed design, and in the GCC context, it results in a program that is easier to understand and debug. The relative independence of the resulting front end from the rest of GCC also is an advantage, because GCC changes rapidly and this loose coupling minimizes your exposure.
Potential disadvantages of this approach are the possibilities that your compiler might do more work than is strictly needed or use more memory. In practice, this doesn't seem to be too important.
Before we talk about some details of interfacing your front end to GCC, let's take a look at some of the documentation and source files you need to know. Because it hasn't been a priority in the GCC community to make it simpler to write front ends, some things you need to know are documented only in the source. The documentation references here refer to info pages and not URLs, because GCC 4.0 has not yet been released. Thus, the Web pages reflect earlier versions. Your best bet is to check out a copy of GCC from CVS and dig around in the source.
gcc/c.opt: describes command-line options used by the C family of front ends. More importantly, it describes the format of the .opt files. You'll be writing one of these.
gcc info page, node Spec Files (source file gcc/doc/invoke.texi): describes the spec minilanguage used by the GCC driver. You'll write some specs to tell GCC how to invoke your front end.
gccint info page, node Front End (source file gcc/doc/sourcebuild.texi): describes how to integrate your front end into the GCC build process.
gccint info page, node Tree SSA (source file gcc/doc/tree-ssa.texi): describes GENERIC.
gcc/tree.def, gcc/tree.h: some attributes of trees don't seem to be documented, and reading these files can help. tree.def defines all the tree codes and is, in large part, explanatory comments. tree.h defines the tree node structures, the many accessor macros and declares functions that are useful in building trees of various types.
libcpp/include/line-map.h: line maps are used to represent source code locations in GCC. You may or may not use these in your front end—gcjx does not. Even if you do not use them, you need to build them when lowering to GENERIC, as information in line maps is used when generating debug information.
gcc/errors.h, gcc/diagnostic.h: defines the interface to GCC's error formatting functions, which you may choose to use.
gcc/gdbinit.in: defines some GDB commands that are handy when debugging GCC. For instance, the pt command prints a textual representation of a tree. The file .gdbinit also is made in the GCC build directory; if you debug there, the macros immediately are available.
gcc/langhooks.h: lang hooks are a mechanism GCC uses to allow front ends to control some aspects of GCC's behavior. Each front end must define its own copy of the langhooks structures; these structures consist largely of function pointers. GCC's middle and back ends call these functions to make language-specific decisions during compilation. The langhooks structures do change from time to time, but due to the way GCC expects front ends to initialize these structures, you largely are insulated from these changes at the source level. Some of these lang hooks are not optional, so your front end is going to implement them. Others are ad hoc additions for particular problems. For instance, the can_use_bit_fields_p hook was introduced solely to work around an optimization problem with the current gcj front end.
Editorial Advisory Panel
Thank you to our 2014 Editorial Advisors!
- Jeff Parent
- Brad Baillio
- Nick Baronian
- Steve Case
- Chadalavada Kalyana
- Caleb Cullen
- Keir Davis
- Michael Eager
- Nick Faltys
- Dennis Frey
- Philip Jacob
- Jay Kruizenga
- Steve Marquez
- Dave McAllister
- Craig Oda
- Mike Roberts
- Chris Stark
- Patrick Swartz
- David Lynch
- Alicia Gibb
- Thomas Quinlan
- Carson McDonald
- Kristen Shoemaker
- Charnell Luchich
- James Walker
- Victor Gregorio
- Hari Boukis
- Brian Conner
- David Lane