The Compiler as Attack Vector

Can an attacker build a compromised program from good source code? Yes, if he or she controls the tools. Learn how an attack can happen during the build process.

Media exposure of serious security threats has sky-rocketed in the last five years, and this has caused a strange parallel to develop. As software developers have become more aware of security problems and have taken steps to mitigate them during the development phase, attackers have been forced to become more insidious in exploit vectors. A possible vector that often is not explored is attacking the program as it is built.

I first encountered this idea while reading the September 1995 ACM classic of the month article “Trusting Trust”, by Ken Thompson. The article originally appeared in the August 1984 issue of Communications of the ACM, and it deals with the belief that ultimate security is impossible to achieve because in the chain of building an application there is no way to trust every link fully. The particular focus was on the C compiler for UNIX and how, within the build process, the programmer can be blind to the compiler's actions.

The same problem still exists currently. Because so many things in the Linux world are downloaded and compiled, an avenue of attack opens. Binary distributions like RPMs and Debian packages are becoming increasingly popular; thus, attacking the build machines for the distributions would yield many unsuspecting victims.

GCC and Glibc

Before engaging in a discussion of how such attacks could take place, it is important to become familiar with the target, and how someone would evaluate it for places to attack. GCC, written and distributed by the GNU Project, supports many languages and architectures. For the sake of brevity, we focus on ANSI C and the x86 architecture in this article.

The first task is to become more familiar with GCC—what it does to code and where. The best way to start this is to build a simple Hello World program, passing GCC the -v option at compile time. The output should look something similar to that shown in Listing 1. Examining it yields several important details, as GCC is not a single program. It invokes several programs to translate the c source file into an ELF binary. It also links in numerous system libraries with virtually no verification that they are what they appear to be.

Further information can be gained by repeating the same build with the -save-temps options. This saves the intermediate files created by GCC during the build. In addition to the binary and source file, you now have filename.i, filename.s and filename.o. The .i file contains your source after preprocessing, the .s contains the translated assembly and the .o is the assembled file before any linking happens. Using the file command on these files provides some information as to what they are.

The thing to focus on while looking through the temp files is the type and amount of code added at each step, as well as where the code comes from. Attackers look for places where they can add code, often called payloads, without being noticed. Attackers also must add statements somewhere in the flow of a program to execute the payload. For attackers, ideally this would be done with the least amount of effort, changing only one or two files. The phase that covers both these requirements is called the linking phase.

The linking phase, which generates the final ELF binary, is the best place for attackers to exploit to ensure that their changes are not detected. The linking phase also gives attackers a chance to modify the flow of the program by changing the files that are linked in by the compiler. Examining the verbose output of the Hello World build, you can see several files like linked in. These are the files an attacker will pay the most attention to because they contain the standard functions the program needs to work. These collections are often the easiest in which to add a malicious payload and the code to call it, often by replacing only a single file.

Let's take a small aside here and discuss some parts of ELF binaries, how they work and how attackers can use this to their advantage. Ask many people who write C code where their programs begin executing and they will say “main”, of course. This is true only to a point; main is where the code they wrote begins execution, but in actuality, the code started executing long before main. You can examine this with tools like nm, readelf and gdb. Executing the command readelf --l hello shows the entry point for the program. This is where the program begins executing. You then can look at what this does by setting a breakpoint for the entry point, and then run the program. You will find the program actually starts executing at a function called _start, line 47 of file <glibc-base-directory>/sysdeps/i386/elf/start.S. This is actually part of glibc.

Attackers can modify the assembly directly, or they can trace the execution to a point where they are working with C for easier modifications. In start.S, __libc_start_main is called with the comments Call the user's main function. Looking through the glibc source tree brings you to <glibc-base-directory>/sysdeps/generic/libc-start.c. Examining this file, you see that not only does this call the user's main function, it also is responsible for setting up command-line and environment options, like argc, argv and evnp, to pass to main. It is also in C, which makes modifications easier than in assembly. At this point, making an effective attack is as simple as adding code to execute before main is called. This is effective for several reasons. First, in order for the attack to succeed, only one file needs to be changed. Second, because it is before main(), typical debugging does not discover it. Finally, because main is about to be called, all the built-ins that C coders expect already have been set up.