Dynamic Kernels: Modularized Device Drivers

by Alessandro Rubini

Kernel modules are a great feature of recent Linux kernels. Although most users feel modules are only a way to free some memory by keeping the floppy driver out of the kernel most of the time, the real benefit of using modules is support for adding additional devices without patching the kernel source. In the next few Kernel Korners Georg Zezschwitz and I will try to introduce the “art” of writing a powerful module—while avoiding common design errors.

What is a device?

A device driver is the lowest level of the software that runs on a computer, as it is directly bound to the hardware features of the device.

The concept of “device driver” is quite abstract, actually, and the kernel can be considered like it was a big device driver for a device called “computer”. Usually, however, you don't consider your computer a monolithic entity, but rather a CPU equipped with peripherals. The kernel can thus be considered as an application running on top of the device drivers: each driver manages a single part of the computer, while the kernel-proper builds process scheduling and file-system access on top of the available devices.

See Figure 1.

A few mandatory drivers are “hardwired” in the kernel, such as the processor driver and the the memory driver; the others are optional, and the computer is usable both with and without them—although a kernel with neither the console driver nor the network driver is pointless for a conventional user.

The description above is somehow a simplistic one, and slightly philosophical too. Real drivers interact in a complex way and a clean distinction among them is sometimes difficult to achieve.

In the Unix world things like the network driver and a few other complex drivers belong to the kernel, and the name of device driver is reserved to the low-level software interface to devices belonging to the following three groups:

character devices

Those which can be considered files, in that they can be read-from and/or written-to. The console (i.e. monitor and keyboard) and the serial/parallel ports are examples of character devices. Files like /dev/tty0 and /dev/cua0 provide user access to the device. A char device usually can only be accessed sequentially.

block devices

Historically: devices which can be read and written only in multiples of the block-size, often 512 or 1024 bytes. These are devices on which you can mount a filesystem, most notably disks. Files like /dev/hda1 provide access to the devices. Blocks of block devices are cached by the buffer cache. Unix provides uncached character devices corresponding to block devices, but Linux does not.

network interfaces

Network interfaces don't fall in the device-file abstraction. Network interfaces are identified by means of a name (such as eth0 or plip1) but they are not mapped to the filesystem. It would be theoretically possible, but it is impractical from a programming and performance standpoint; a network interface can only transfer packets, and the file abstraction does not efficiently manage structured data like packets.

The description above is rather sketchy, and each flavour of Unix differs in some details about what is a block device. It doesn't make too much a difference, actually, because the distinction is only relevant inside the kernel, and we aren't going to talk about block drivers in detail.

What is missing in the previous representation is that the kernel also acts as a library for device drivers; drivers request services from the kernel. Your module will be able to call functions to perform memory allocation, filesystem access, and so on.

As far as loadable modules are concerned, any of the three driver-types can be constructed as a module. You can also build modules to implement filesystems, but this is outside of our scope.

These columns will concentrate on character device drivers, because special (or home-built) hardware fits the character device abstraction most of the time. There are only a few differences between the three types, and so to avoid confusion, we'll only cover the most common type.

You can find an introduction to block drivers in issues 9, 10, and 11 of Linux Journal, as well as in the Linux Kernel Hackers' Guide. Although both are slightly outdated, taken together with these columns, they should give you enough information to get started.

What is a module?

A module is a code segment which registers itself with the kernel as a device driver, is called by the kernel in order to communicate with the device, and in turn invokes other kernel functions to accomplish its tasks. Modules utilize a clean interface between the “kernel proper” and the device, which both makes the modules easy to write and keeps the kernel source code from being cluttered.

The module must be compiled to object code (no linking; leave the compiled code in .o files), and then loaded into the running kernel with insmod. The insmod program is a runtime linker, which resolves any undefined symbols in the module to addresses in the running kernel by means of the kernel symbol table.

This means that you can write a module much like a conventional C-language program, and you can call functions you don't define, in the same way you usually call printf() and fopen() in your application. However, you can count only on a minimal set of external functions, which are the public functions provided by the kernel. insmod will put the right kernel-space addresses in your compiled module wherever your code calls a kernel function, then insert the module into the running Linux kernel.

If you are in doubt whether a kernel-function is public or not, you can look for its name either in the source file /usr/src/linux/kernel/ksyms.c or in the run-time table /proc/ksyms.

To use make to compile your module, you'll need a Makefile as simple as the following one:

TARGET = myname

ifdef DEBUG
  # -O is needed, because of "extern inline"
  # Add -g if your gdp is patched and can use it
  CFLAGS = -O -DDEBUG_$(TARGET) -D__KERNEL__ -Wall
else
  CFLAGS = -O3 -D__KERNEL__ -fomit-frame-pointer
endif

all: $(TARGET).o

As you see, no special rule is needed to build a module, only the correct value for CFLAGS. I recommend that you include debugging support in your code, because without patches, gdb isn't able to take advantage of the symbol information provided by the -g flag to a module while it is part of the running kernel.

Debugging support will usually mean extra code to print messages from within the driver. Using printk() for debugging is powerful, and the alternatives are running a debugger on the kernel, peeking in /dev/mem, and other extremely low-level techniques. There are a few tools available on the Internet to help use these other techniques, but you need to be conversant with gdb and be able to read real kernel code in order to benefit from them. The most interesting tool at time of writing is kdebug-1.1, which lets you use gdb on a running kernel, examining or even changing kernel data structures (including those in loaded kernel modules) while the kernel is running. Kdebug is available for ftp from sunsite.unc.edu and its mirrors under /pub/Linux/kernel.

Just to make things a little harder, the kernel equivalent of the standard printf() function is called printk(), because it does not work exactly the same as printf(). Before 1.3.37, conventional printk()'s generated lines in /var/adm/messages, while later kernels will dump them to the console. If you want quiet logging (only within the messages file, via syslogd) you must prepend the symbol KERN_DEBUG to the format string. KERN_DEBUG and similar symbols are simply strings, which get concatenated to your format string by the compiler. This means that you must not put a comma between KERN_DEBUG and the format string. These symbols can be found in <linux/kernel.h>, and are documented there. Also, printk() does not support floating point formats.

Remember, that syslog write to the messages file as soon as possible, in order to save all messages on disk in case of a system crash. This means that an over-printk-ed module will slow down perceptibly, and will fill your disk in a short time.

Almost any module misbehaviour will generate an [cw]Oops[ecw] message from the kernel. An Oops is what happens when the kernel gets an exception from kernel code. In other words, Oopses are the equivalent of segmentation faults in user space, though no core file is generated. The result is usually the sudden destruction of the responsible process, and a few lines of low-level information in the messages file. Most Oops messages are the result of dereferencing a NULL pointer.

This way to handle disasters is a friendly one, and you'll enjoy it whenever your code is faulty: most other Unices produce a kernel panic instead. This does not mean that Linux never panics. You must be prepared to generate panics whenever you write functions that operate outside of a process context, such as within interrupt handlers and timer callbacks.

The scarce, nearly unintelligible information included with the [cw]Oops[ecw] message represents the processor state when the code faulted, and can be used to understand where the error is. A tool called ksymoops is able to print more readable information out of the oops, provided you have a kernel map handy. The map is what is left in /usr/src/linux/System.map after a kernel compilation. Ksymoops was distributed within util-linux-2.4, but was removed in 2.5 because it has been included in the kernel distribution during the linux-1.3 development.

If you really understand the Oops message, you can use it as you want, like invoking gdb off-line to disassemble the whole responsible function. if you understand neither the Oops nor the ksymoops output, you'd better add some more debugging printk() code, recompile, and reproduce the bug.

The following code can ease management of debugging messages. It must reside in the module's public include file, and will work for both kernel code (the module) and user code (applications). Note however that this code is gcc-specific. Not too big a problem for a kernel module, which is gcc-dependent anyway. This code was suggested by Linus Torvalds, as an enhancement over my previous ansi-compliant approach.

#ifndef PDEBUG
#  ifdef DEBUG_modulename
#    ifdef __KERNEL__
#      define PDEBUG(fmt, args...) printk (KERN_DEBUG fmt , ## args)
#    else
#      define PDEBUG(fmt, args...) fprintf (stderr, fmt , ## args)
#    endif
#  else
#    define PDEBUG(fmt, args...)
#  endif
#endif

#ifndef PDEBUGG
#  define PDEBUGG(fmt, args...)
#endif

After this code, every PDEBUG("any %i or %s...\n", i, s); in the module will result in a printed message only if the code is compiled with -DDEBUG_modulename, while PDEBUGG() with the same arguments will expand to nothing. In user mode applications, it works the same, except that the message is printed to stderr instead of the messages file.

Using this code, you can enable or disable any message by removing or adding a single G character.

Writing Code

Let's look at what kind of code must go inside the module. The simple answer is “whatever you need”. In practice, you must remember that the module is kernel code, and must fit a well-defined interface with the rest of Linux.

Usually, you start with header inclusion. And you begin to have contraints: you must always define the __KERNEL__ symbol before including any header unless it is defined in your makefile, and you must only include files pertaining to the <linux/*> and <asm/*> hierarchies. Sure, you can include your module-specific header, but never, ever, include library specific files, such as <stdio.h> or <sys/time.h>.

The code fragment in Listing 1 represents the first lines of source of a typical character driver. If you are going to write a module, it will be easier to cut and paste these lines from existing source rather than copying them by hand from this article.

#define __KERNEL__         /* kernel code */

#define MODULE             /* always as a module */
#include <linux/module.h>  /* can't do without it */
#include <linux/version.h> /* and this too */

/*
 * Then include whatever header you need.
 * Most likely you need the following:
 */
#include <linux/types.h>   /* ulong and friends */
#include <linux/sched.h>   /* current, task_struct, other goodies */
#include <linux/fcntl.h>   /* O_NONBLOCK etc. */
#include <linux/errno.h>   /* return values */
#include <linux/ioport.h>  /* request_region() */
#include <linux/config.h>  /* system name and global items */
#include <linux/malloc.h>  /* kmalloc, kfree */

#include <asm/io.h>        /* inb() inw() outb() ... */
#include <asm/irq.h>       /* unreadable, but useful */

#include "modulename.h" /* your own material */

After including the headers, there comes actual code. Before talking about specific driver functionality—most of the code—it is worth noting that there exist two module-specific functions, which must be defined in order for the module to be loaded:

int init_module (void);
void cleanup_module (void);

The first is in charge of module initialization (looking for the related hardware and registering the driver in the appropriate kernel tables), while the second is in charge of releasing any resources the module has allocated and deregistering the driver from the kernel tables.

If these functions are not there, insmod will fail to load your module.

The init_module() function returns 0 on success and a negative value on failure. The cleanup_module() function returns void, because it only gets invoked when the module is known to be unloadable. A kernel module keeps a usage count, and cleanup_module() is only called when that counter's value is 0 (more on this later on).

Skeletal code for these two functions will be presented in the next installment. Their design is fundamental for proper loading and unloading of the module, and a few details must dealt with. So here, I'll introduce you to each of the details, so that next month I can present the structure without explaining all the details.

Getting a major number

Both character drivers and block drivers must register themselves in a kernel array; this step is fundamental for the driver to be used. After init_module() returns, the driver's code segment is part of the kernel, and won't ever be called again unless the driver registers its functionality. Linux, like most Unix flavors, keeps an array of device drivers, and each driver is identified by a number, called the major number, which is nothing more than the index in the array of available drivers.

The major number of a device is the first number appearing in the output of ls -l for the device file. The other one is the minor number (you guessed it). All the devices (file nodes) featuring the same major number are serviced by the same driver code.

It is clear that your modularized driver needs its own major number. The problem is that the kernel currently uses a static array to hold driver information, and the array is as small as 64 entries (it used to be 32, but it was increased during the 1.2 kernel development because of lack of major numbers).

Fortunately, the kernel allows dynamic assignment of major numbers. The invocation of the function

int register_chrdev(unsigned int major,
                    const char *name,
                    struct file_operations *fops);

will register your char-driver within the kernel. The first argument is either the number you are requesting or 0, in which case dynamic allocation is performed. The function returns a number less than 0 to signal an error, and 0 or greater to signal successful completion. If you asked for a dynamically assigned number, the positive return value is the major number your driver was assigned. The name argument is the name of your driver, and is what appears within the /proc/devices file. finally, fops is the structure used for calling all the other functions in your driver, and will be described later on.

Using dynamic allocation of major numbers is a winning choice for custom device drivers: you're assured that your device number doesn't conflict with any other device within your system—you're assured that register_chrdev() will succeed, unless you have loaded so many devices that you have run out of free device numbers, which is unlikely.

Loading and unloading

Since the major number is recorded inside the filesystem node that applications use to access the device, dynamic allocation of the major number means that you can't create your nodes once, and keep them in /dev forever. You need to recreate them each time you load your module.

The scripts in this page are the ones I use to load and to unload my modules. A little editing will suit your own module: you only need to change the module name and the device name.

The mknod command creates a device node with a given major and minor number (I'll talk about minor numbers in the next installment), and chmod gives the desired permissions to the new devices.

Though some of you may dislike creating (and changing permissions) any time the system is booted, there is nothing strange in it. If you are concerned about becoming root to perform the task, remember that insmod itself must be issued with root privileges.

The loading script can be conveniently called drvname_load, where drvname is the prefix you use to identify your driver; the same one used in the name argument passed to register_chrdrv(). The script can be invoked by hand during driver development, and by rc.local after module installation. Remember that insmod looks both in the current directory and in the installation directory (somewhere in /lib/modules) for modules to install.

If your module depends on other modules or if your system setup is somehow peculiar, you can invoke modprobe instead of insmod. The modprobe utility is a refined version of insmod which manages module dependencies and conditional loading. The tool is quite powerful and well documented. If your driver needs exotic handling, you're better off reading the manpage.

At time of writing, however, none of the standard tools handles generation of device nodes for automatically allocated major numbers, and I can't even conceive how they could know the names and minor numbers of your driver. This means that a custom script is needed in any case.

Here's drvname_load:

#!/bin/sh
# Install the drvname driver,
# including creating device nodes.

# FILE and DEV may be the same.
# The former is the object file to load,
# the latter is the official name within
#  the kernel.

FILE="drvname"
DEV="devicename"

/sbin/insmod -f $FILE $*  || \
 {echo "$DEV not inserted" ; exit 1}

# retrieve major just assigned
major=`grep $DEV /proc/devices | \
  awk "{print \\$1}"`

# make defice nodes
cd /dev
rm -f mynode0 mynode1

mknod mynode0 c $major 0
mknod mynode1 c $major 1

# edit this line to suit your needs
chmod go+rw mynode0 mynode1

And drvname_unload:

#!/bin/sh
# Unload the drvname driver

FILE="drvname"
DEV="devicename"

/sbin/rmmod $FILE $* || \
 {echo "$DEV not removed" ; exit 1}

# remove device nodes
cd /dev
rm -f mynode0 mynode1
Allocating Resources

The next important task of init_module() is allocating any resources needed by the driver for correct operation. We call any tiny piece of the computer a “resource”, where “piece” is a logical (or software) representation of a physical part of the computer. Usually a driver will request memory, I/O ports, and IRQ lines.

Programmers are familiar with requesting memory. The kmalloc() function will do it, and you can use it exactly like it was malloc(). Requesting I/O ports, on the contrary, is unusual. They're there, free of charge. There is no “I/O port fault” equivalent of a “segmentation fault”. However, writing to I/O ports belonging to other devices can still crash your system.

Linux implements essentially the same policy for I/O ports as is used for memory. The only real difference is in the CPU not generating exceptions when you write to a port address that you have not requested. Port registering, like memory registering, is also useful to help the kernel's housekeeping tidy.

If you ever scratched your head about the port address to assign to your newly acquired board, you'll soon forget the feeling: cat /proc/ioports and cat /proc/interrupts will quickly uncover the secrets of your own hardware.

Registering I/O ports you use is a little more complicated than requesting memory, because you often have to “probe” to find out where your device is. To avoid “probing” ports that other devices have already registered, you can call check_region() to ask if the region you are considering looking in is already claimed. Do this once for each region as you probe. Once you find the device, use the request_region() function to reserve the region. When your device is removed, it should call release_region() to free the ports. Here are the function declarations from <linux/ioports.h>:

int check_region(unsigned int from,
                 unsigned int extent);
void request_region(unsigned int from,
                    unsigned int extent,
                    const char *name);
void release_region(unsigned int from,
                    unsigned int extent);

The from argument is the beginning of a contiguous region, or range, of I/O ports, the extent is the number of ports in the region, and name is the name of the driver.

If you forget to register your I/O ports, nothing bad will happen, unless you have two such misbehaving drivers, or you need the information to fit a new board in your computer. If you forget to release ports when unloading, any subsequent program accessing the /proc/ioports file will “Oops”, because the driver name will refer to unmapped memory. Besides, you won't be able to load your driver again, because your own ports are no longer available. Thus, you should be careful to free your ports.

A similar allocation policy exists for IRQ lines (see <linux/sched.h>):

int request_irq(uint irq,
           void (*handler)(int, struct pt_regs *),
           ulong flags, const char *name);
void free_irq(uint irq);

Note again that name is what appears in the /proc/ files, and thus should be rather myhardware than mydrv.

If you forget to register IRQ lines, your interrupt handler won't be called; if you forget to unregister, you won't be able to read /proc/interrupts. In addition, if the board continues generating irq's after your handler is unloaded, something weird may happen (I can't tell exactly, because it never happened to me, and I'm not likely to try it in order to document it here). [I think you get a kernel panic, but I've never managed (or tried) to make it happen, either—ED]

The last point I'd like to touch here is introduced by Linus's comment in <linux/io.h>: you have to find your hardware. If you want to make usable drivers, you have to autodetect your devices. Autodetection is vital if you want to distribute your driver to the general public, but don't call it “Plug and Play”, since that is now a trademark.

The hardware should detect both the ioports and the irq number. If the board doesn't tell which IRQ line it will use, you can go through a trial and error technique—it works great, if you do it carefully. The technique will be covered in a later installment.

When you know the irq number of your device, you should use free_irq() to release it before returning from module_init(). You can request it again when your device is actually opened. If you keep hold of the interrupt, you won't be able to multiplex hardware on it (and the i386 has too few IRQ lines to allow wasting them). Thus I run plip and my frame grabber on the same interrupt without unloading any module—I just open only one of them at a time.

Unfortunately, there exist some rare times where autodetection won't work, so you must provide a way to pass information to the driver about ports and irqs. A probe will usually fail only during system boot, when the first drivers have access to several unregistered devices, and can mistake another device for the one it looks for. Sometimes probing for a device can be “destructive” for another device, preventing its future initialization. Both these problems shouldn't happen to a module, which comes last, and thus can't request ports belonging to other devices. Nonetheless, a way to disable autodetection and force values in the driver is an important feature to implement. At least, it's easier than autodetection, and can help you in successfully loading the module before autodetection is there.

Load-time configuration will be the first topic of next issue, where the full source of init_module() and cleanup_module will be uncovered.

Additional information

The Kernel Korner columns of the following months will introduce further points of module-writing. Code samples can be found inside the kernel and on ftp sites near you.

In particular, what I describe is based on my personal experience with device drivers: both the ceddrv-0.xx and cxdrv-0.xx resemble the code I describe. Georg Zezschwitz and I wrote the ceddrv, which drives a lab interface (A/D, D/A, bells and whistles). The cxdrv driver is simpler, and drives a memory-mapped frame grabber. The latest versions of both drivers are available on ftp://iride.unipv.it/pub/linux for public ftp. ceddrv is also on tsx-11.mit.edu, while cxdev is on sunsite.unc.edu in apps/video.

There are quite a few books about device drivers out there, but they're often too system-specific and describe an awkward interface—Linux is easier. Generic books about Unix internals and the kernel source are the best teachers. I'd suggest to get one of the following:

  • Maurice J. Bach, The Design of the UNIX Operating System, Prentice Hall, 1986

  • Andrew S. Tanenbaum, Operating Systems: Design and Implementation, Prentice Hall, 1987

  • Andrew S. Tanenbaum, Modern Operating Systems, Prentice Hall, 1992

Alessandro Rubini ([email protected]) is taking his PhD course in computer science and is breeding two small Linux boxes at home. Wild by his very nature, he loves trekking, canoeing, and riding his bike.

Load Disqus comments