Embedding a File in an Executable, aka Hello World, Version 5967

 in

I recently had the need to embed a file in an executable. Since I'm working at the command line with gcc, et al and not with a fancy RAD tool that makes it all happen magically it wasn't immediately obvious to me how to make this happen. A bit of searching on the net found a hack to essentially cat it onto the end of the executable and then decipher where it was based on a bunch of information I didn't want to know about. Seemed like there ought to be a better way...

And there is, it's objcopy to the rescue. objcopy converts object files or executables from one format to another. One of the formats it understands is "binary", which is basicly any file that's not in one of the other formats that it understands. So you've probably envisioned the idea: convert the file that we want to embed into an object file, then it can simply be linked in with the rest of our code.

Let's say we have a file name data.txt that we want to embed in our executable:

  # cat data.txt
  Hello world
To convert this into an object file that we can link with our program we just use objcopy to produce a ".o" file:
  # objcopy --input binary \
            --output elf32-i386 \
            --binary-architecture i386 data.txt data.o
This tells objcopy that our input file is in the "binary" format, that our output file should be in the "elf32-i386" format (object files on the x86). The --binary-architecture option tells objcopy that the output file is meant to "run" on an x86. This is needed so that ld will accept the file for linking with other files for the x86. One would think that specifying the output format as "elf32-i386" would imply this, but it does not.

Now that we have an object file we only need to include it when we run the linker:

  # gcc main.c data.o
When we run the result we get the prayed for output:
  # ./a.out
  Hello world
Of course, I haven't told the whole story yet, nor shown you main.c. When objcopy does the above conversion it adds some "linker" symbols to the converted object file:
   _binary_data_txt_start
   _binary_data_txt_end
After linking, these symbols specify the start and end of the embedded file. The symbol names are formed by prepending _binary_ and appending _start or _end to the file name. If the file name contains any characters that would be invalid in a symbol name they are converted to underscores (eg data.txt becomes data_txt). If you get unresolved names when linking using these symbols, do a hexdump -C on the object file and look at the end of the dump for the names that objcopy chose.

The code to actually use the embedded file should now be reasonably obvious:

#include <stdio.h>

extern char _binary_data_txt_start;
extern char _binary_data_txt_end;

main()
{
    char*  p = &_binary_data_txt_start;

    while ( p != &_binary_data_txt_end ) putchar(*p++);
}
One important and subtle thing to note is that the symbols added to the object file aren't "variables". They don't contain any data, rather, their address is their value. I declare them as type char because it's convenient for this example: the embedded data is character data. However, you could declare them as anything, as int if the data is an array of integers, or as struct foo_bar_t if the data were any array of foo bars. If the embedded data is not uniform, then char is probably the most convenient: take its address and cast the pointer to the proper type as you traverse the data.

______________________

Mitch Frazier is an Associate Editor for Linux Journal.

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

C++ Linkage

Jon Purdy's picture

NB: In order to compile with C++, declare the symbols as follows.

extern "C" {
extern char binary_data_txt_start;
extern char binary_data_txt_end;
}

whoa the version number on

stabu's picture

whoa the version number on this article!

for 64bit x86's, use --output elf64-x86-64. The --binary-architecture option need not change, again somewhat unintuitively.

Its the program version number

Mitch Frazier's picture

The version number is the version of the "hello world" program, not the article. And could somebody please come up with a new standard first program. If I see "hello world" in one more language I'm gonna spit-up :).

Mitch Frazier is an Associate Editor for Linux Journal.

so much stuff for little problem...

Anonymous's picture

man xxd for "xxd -i":

cat input_file | ( echo "unsigned char xxx[] = {"; xxd -i; echo "};" ) > output_file.c

There is another, portable way to do this

Hans Bezemer's picture

I was facing exactly the same problem when I wanted to embed 4tH bytecode into an executable. The trick is to convert the file into a C-file that can be compiled properly with any C compiler. 4tH features a program to do that. In essence it works like this: you read the file in binary mode byte by byte and convert those bytes to unsigned characters. A converted file looks like this:

static unit HelloWorld [] = {
  '\x01', '\x02', '\x04', '\x00', '\xff', '\xff', '\xff', '\x7f', '\x04',
  '\x5c', '\x03', '\x08', '\x02', '\x02', '\x02', '\x0d', '\x08', '\x08',
  '\x08', '\x05', '\x08', '\x02', '\x48', '\x65', '\x6c', '\x6c', '\x6f',
  '\x20', '\x77', '\x6f', '\x72', '\x6c', '\x64', '\x21', '\x00', '\xfd'
};

'unit' is equivalent to 'unsigned char'. You can even embed several files like this. IMHO this method is more transparent to both the programmer and the compiler. The source to do this is pretty trivial:

\ 4tH binary to .h file converter - Copyright 2007 J.L. Bezemer
\ You can redistribute this file and/or modify it under
\ the terms of the GNU General Public License

\ This file is geared toward the conversion of 4tH HX bytecode.
\ In order to convert other binary files, just change 'unit' to 'char'.
s" static unit " sconstant header      \ declaration header

include lib/argopen.4th                \ use ARG-OPEN word
include lib/ulcase.4th                 \ case conversion

     9 constant /line                  \ number of bytes per line
char ' constant quote                  \ single quote character
char , constant colon                  \ single colon character

/line string line                      \ input buffer

: .char ." '\x" <# # # #> s>lower type quote emit ;
: .char, .char colon emit space ;      ( n --)
: ?c@ dup if 1- chars + c@ else 2drop 0 then ;
: ?char if ?c@ .char else 2drop then ; ( a n f --)
: .header header type 1 args type ."  [] = {" cr ;
: .footer ." };" cr ;                  ( --)
: ?bounds space space over 0<> and if 1- then bounds ;
: read over over accept tuck <> ;      ( a n1 -- a n2 f)
: .line >r 2dup r@ ?bounds ?do i c@ .char, loop r@ ?char cr r> ;
: .lines hex begin line /line read .line until ;
: Usage argn 4 < abort" Usage: bin2h variable file h-file" ;
: OpenFiles Usage input 2 arg-open output 3 arg-open ;
: Convert Openfiles .header .lines .footer close close ;

Convert

Hans Bezemer

Same Thing Using "Standard" Linux Commands

Mitch Frazier's picture

As I allued to in my comment reply below about assembler output, you can create C (or assembler) data with standard Linux commands:

#!/bin/bash

if [[ $# -ne 1 ]]; then
    echo "Usage: $0 FILENAME"
    exit 1
fi
file=$1

if [[ ! -f "$file" ]]; then
    echo "File not found: $file"
    exit 1
fi

cname=$file
cname=${cname//-/_}
cname=${cname//./_}

echo "static unsigned char $cname[] = {"
hexdump -v -e '" " 16/1 "  0x%02x, " "\n"' $file | \
   sed -e '$s/0x  ,//g'
echo "};"

Using objcopy does this without the extra compilation step, although using the result is a bit more obscure. The other thing I like about using objcopy is that it doesn't leave a "temporary" ".c" file sitting around. Makes me nervous deleting ".c" files.

PS Try this, the hexdump command looks freaky but it actually does work!

Mitch Frazier is an Associate Editor for Linux Journal.

That is one of the most

Anonymous's picture

That is one of the most interesting things I have ever seen in this magazine. It's almost an introduction to how a linker works. It would be really excellent to expand upon this article, although I'm not expert enough to suggest in what way.

Thanks.

Use reswrap instead

Sander's picture

Or you just use a utility called reswrap which can convert any file into c/c++ data arrays. More portable and lot easier to use.
It's part of the fox toolkit. (www.fox-toolkit.org):


Usage: reswrap [options] [-o[a] outfile] files...
Convert files containing images, text, or binary data into C/C++ data arrays.

Options:
-o[a] outfile Output [append] to outfile instead of stdout
-h Print help
-v Print version number
-d Output as decimal
-m Read files with MS-DOS mode (default is binary)
-x Output as hex (default)
-t[a] Output as [ascii] text string
-e Generate external reference declaration
-i Build an include file
-k Keep extension, separated by underscore
-s Suppress header in output file
-p prefix Place prefix in front of names of declarations and definitions
-n namespace Place declarations and definitions inside given namespace
-c cols Change number of columns in output to cols
-u Force unsigned char even for text mode
-z Output size in declarations

Each file may be preceded by the following extra option:
-r name Override resource name of following resource file

How about assembler?

Mitch Frazier's picture
This will convert to assembly on stdout:
  echo ' .global data_txt'
  echo 'data_txt:'
  hexdump -v -e '" .byte " 16/1 "  0x%02x, " "\n"' data.txt | \
      sed -e '$s/0x  ,//g' -e 's/, *$//'
  echo ' .end'

Mitch Frazier is an Associate Editor for Linux Journal.

Ehhh...

mkourinny's picture


.globl data_begin
.data
data_begin:
.incbin "data.txt"
.globl data_end
data_end:

Good luck to us,
Mikhail Kourinny

Macro version

Harvey's picture

(Thank you for the initial code that got me started.)
I turned the code into a macro, got rid of the global data_end and replaced it with data_len. You could go one big step forward and create a common header file containing the assembly and C macros. It could also contain a macro for C++. Then, just ifdef the macros based on the compiler flags. Then, you can just #include the same file, I think, in many places.

// Common Include File: test.h

#ifdef __ASSEMBLER__
.altmacro
.macro binfile p q
	.globl \p&_begin
\p&_begin:
	.incbin \q
\p&_end:
	// Put a ".byte 0" here if you know your data is text
	// and you wish to use \p&_begin as a C string. It 
	// doesn't hurt to leave it here even for binary data
	// since it is not counted in \p_&len
	.byte 0
	.globl \p&_len
\p&_len:
	.int (\p&_end - \p&_begin)
.endm	

#else // Not __ASSEMBLER__

#ifdef __cplusplus
        extern "C" {
#endif

#define BIN_DATA(_NAME) \
  extern char _NAME##_begin; \
  extern int _NAME##_len

#ifdef __cplusplus
         }
#endif

#endif

// Assembly: test.S

#include "test.h"
	.data
	binfile data "data.txt"
	binfile src "test.S"

// C or C++:

#include "test.h"

BIN_DATA(data);
BIN_DATA(src);

Hi mkourinny & Mitch

Anonymous's picture

Hi mkourinny & Mitch Frazier,

Both of ur scripts mentioned above for assembly
give the same output.

But I don't understand what does "Converting to
assembly mean". Sorry if it sounds silly. I guess
its converting an assembly file (.s) to hex bytes.

Thanks,
Ram

Not Quite

Mitch Frazier's picture

Its converting a data file, of any type of data, into text that is valid assembly language. The resulting output could then be passed to the assembler and "assembled" (ie compiled by the assembler) into an object file.

Some of the other comments mention converting it to C and then compiling the C, this is the same idea only the target language is assembly language and not C.

The linux assembler is a program invoked with the command "as", it is sometimes referred to as "gas" for the GNU Assembler.

Mitch Frazier is an Associate Editor for Linux Journal.

Thank u much. :) Sorry for

Ram's picture

Thank u much. :)
Sorry for posting many times.
It happened without my knowledge.

Ram

At Last an Assembly Language Programmer

Mitch Frazier's picture

Didn't know that!

Mitch Frazier is an Associate Editor for Linux Journal.

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState