Adding Your Code to the Kernel: A Book Excerpt

December 20th, 2005 by Claudia Salzberg Rodriguez, Gordon Fischer and Steve Smolski in

From their book's section on adding your own code to the kernel, the authors demonstrate how device drivers are represented in the filesystem.

This content is excerpted from Chapter 10 of the book titled The Linux Kernel Primer: A Top-Down Approach for x86 and PowerPC Architectures, authored by Claudia Salzberg Rodriguez, Gordon Fischer and Steven Smolski. ISBN: 0-13-118163-7. Copyright 2006, Pearson Education, Inc. To learn more about this book, including purchasing options, please visit the book's Web page.

Device drivers encompass the interface that the Linux kernel uses to allow the programmer to control the system's input/output devices. Entire books have been written specifically on Linux device drivers. In this chapter we attempt to distill this topic down to its essentials. We will follow a device driver from how the device is represented in the filesystem and then through the specific kernel code that controls it. We start by exploring the filesystem and show how these files tie into the kernel.

Getting Familiar with the Filesystem

Devices in Linux can be accessed via /dev. For example, an ls -l /dev/random yields:


crw-rw-rw-    1 root     root       1,   8 Oct  2 08:08 /dev/random

The leading 'c' tells us that the device is a character device; a 'b' identifies a block device. After the owner and group columns there are two numbers separated by a comma--in this case 1, 8. The first number is the driver's major number and the second its minor number. When a device driver registers with the kernel, it will register a major number. When a given device is opened the kernel uses the major number of the device file to find to the driver that has registered with that major number. The minor number is passed through the kernel to the device driver itself, as a single driver can control multiple devices. For example, /dev/urandom has a major number of 1 and a minor number of 9. This means that the device driver registered with major number 1 handles both /dev/random and /dev/urandom.

To generate a random number we simply read from /dev/random. The following is one possible way to read 4 bytes of random data.


lkp@lkp:~$ head -c4 /dev/urandom | od -x
0000000 823a 3be5
0000004

If you repeat this command you'll notice the four bytes [823a 3be5] continue to change. To demonstrate how the Linux kernel uses device drivers, we'll follow the steps the kernel takes when a user accesses /dev/random.

We know that the /dev/random device file has a major number of 1. We can determine what driver controls the node by checking /proc/devices.


lkp@lkp:~$ less /proc/devices
Character devices:
  1 mem

Let us examine the mem device driver and search for occurrences of "random".


drivers/char/mem.c
653 static int memory_open(struct inode * inode, struct file * filp)
    654 {
    655         switch (iminor(inode)) {
    656                 case 1:
?
676                 case 8:
    677                         filp->f_op = &random_fops;
    678                         break;
    679                 case 9:
    680                         filp->f_op = &urandom_fops;
    681                         break;

Lines 655-681: This switch statement initializes driver structures based upon the minor number of the device being operated upon. Specifically, filps and fops are being set.

Which leads us to ask, "What is a filp? And what is a fop?"

Filps and Fops

A filp is simply a file struct pointer and a fop is a file_operations struct pointer. The kernel uses the file_operations structure to determine what functions to call when the file is operated upon. Below are selected sections of the structures that are used in the random device driver.


include/linux/fs.h
556 struct file {
    557         struct list_head        f_list;
    558         struct dentry           *f_dentry;
    559         struct vfsmount         *f_vfsmnt;
    560         struct file_operations  *f_op;
    561         atomic_t                f_count;
    562         unsigned int            f_flags;
? 
581         struct address_space    *f_mapping;
    582 };

and


include/linux/fs.h
863 struct file_operations {
864         struct module *owner;
865         loff_t (*llseek) (struct file *, loff_t, int);
866         ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
867         ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t);
868         ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
869         ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t);
870         int (*readdir) (struct file *, void *, filldir_t);
871         unsigned int (*poll) (struct file *, struct poll_table_struct *);
872         int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);

?
888  };

The random device driver declares which file operations it provides in the following way. Functions that the drivers implement must conform to the prototypes listed in the file_operations structure.


drivers/char/random.c
1824 struct file_operations random_fops = {
   1825         .read           = random_read,
   1826         .write          = random_write,
   1827         .poll           = random_poll,
   1828         .ioctl          = random_ioctl,
   1829 };
   1830 
   1831 struct file_operations urandom_fops = {
   1832         .read           = urandom_read,
   1833         .write          = random_write,
   1834         .ioctl          = random_ioctl,
1835  };

Lines 1824-1829: The random device provides the operations of read, write, poll and ioctl.

Lines 1831-1835: The urandom device provides the operations of read, write and ioctl.

The poll operation allows a programmer to check before performing an operation to see if that operation will block. This suggests, and is indeed the case, that /dev/random will block if a request has been made for more bytes of entropy than are in its entropy pool. /dev/urandom will not block but may not return completely random data if the entropy pool is too small. For more information: man 4 random.

Digging deeper into the code, notice that when a read operation is performed on /dev/random, the kernel passes control to the function random_read() (See line 1825). random_read() is defined as follows.


drivers/char/random.c

1588 static ssize_t
   1589 random_read(struct file * file, char __user * buf, size_t 
nbytes, loff_t *ppos)

The function parameters are as follows:

  • file points to the file structure of the device.

  • buf points to an area of user memory where the result is to be stored.

  • nbytes is the size of data requested.

  • ppos points to a position within the file that the user is accessing.

Wait Queues

Occasionally a driver may need to wait for some condition to be true--perhaps access to a system resource. In this case we don't want the kernel to wait for the access to complete. It is bad to cause the kernel to wait since all other system processing halts while the wait occurs. By declaring a wait queue, you can postpone processing until a later time when the condition you are waiting on has occurred.

Two structures are used for this process of waiting; a wait queue and a wait queue head. A module should create a wait queue head and have parts of the module that use sleep_on and wake_up macros to manage things. This is precisely what occurs in random_read().


drivers/char/random.c

1588 static ssize_t
   1589 random_read(struct file * file, char __user * buf, size_t nbytes, loff_t *ppos)
   1590 {
   1591         DECLARE_WAITQUEUE(wait, current);
?
1597         while (nbytes > 0) {
?
1608                 n = extract_entropy(sec_random_state, buf, n,
   1609                                     EXTRACT_ENTROPY_USER |
   1610                                     EXTRACT_ENTROPY_LIMIT |
   1611                                     EXTRACT_ENTROPY_SECONDARY);
?
1618                 if (n == 0) {
   1619                         if (file->f_flags & O_NONBLOCK) {
   1620                                 retval = -EAGAIN;
   1621                                 break;
   1622                         }
   1623                         if (signal_pending(current)) {
   1624                                 retval = -ERESTARTSYS;
   1625                                 break;
   1626                         }
?
1632                         set_current_state(TASK_INTERRUPTIBLE);
1633                         add_wait_queue(&random_read_wait, &wait);
1634 
1635                         if (sec_random_state->entropy_count / 8 == 0)
1636                                 schedule();
1637 
1638                         set_current_state(TASK_RUNNING);
   1639                         remove_wait_queue(&random_read_wait, &wait);
?
1645                         continue;
1646      }

Line 1591: The wait queue wait is initialized on the current task. current is a macro that refers to a pointer to the current task's task_struct.

Lines 1608-1611: We extract a chunk of random data from the device.

Lines 1618-1626: If we could not extract the necessary amount of entropy from the entropy pool and we are non-blocking or there is a signal pending, we return an error to the caller.

Lines 1631-1633: Set up the wait queue. random_read() uses its own wait queue, random_read_wait, instead of the system wait queue.

Lines 1635-1636: We are on a blocking read at this point, and if we don't have one byte worth of entropy, we release control of the processor by calling schedule(). Note that the entropy_count variables hold bits and not bytes, thus the division by 8 to determine whether we have a full byte of entropy.

Lines 1638-1639: When we are eventually restarted, we cleanup our wait queue.

Note: The random device in Linux requires the entropy queue to be full before returning. The urandom device does not have this requirement and will return regardless of the size of data available in the entropy pool.

Eventually the kernel will give control back to random_read() and we cleanup our wait queue and continue. This repeats the loop and if the system has generated enough entropy we should be able to return with the requested number of random bytes.

random_read() sets its state to TASK_INTERRUPTIBLE before calling schedule() to allow itself to be interrupted by signals while it is on a wait queue. The driver's own code generates these signals when extra entropy is collected by calling wake_up_interruptible() in batch_entropy_process() and random_ioctl(), TASK_UNINTERRUPTIBLE is usually used when the task is waiting for hardware to respond as opposed to software, when TASK_INTERRUPTIBLE is normally used.

Random_read() uses its own wait queue code instead of the standard macros but essentially does an interruptible_sleep_on() (from the scheduler code), with the exception that if we have more than a full byte's worth of entropy we don't yield control and instead loop again to try and get all the entropy requested. If there isn't enough entropy, random_read() waits until awoken, with wake_up_interruptible(), from entropy gathering processes of the driver.

Other Types of Drivers

Until now, all the device drivers we have dealt with have been character drivers. These are usually the easiest to understand, but you may want to write other drivers that interface with the kernel in different ways.

Block devices are similar to character devices in that they can be accessed via the filesystem. /dev/hda is the device file for the primary ide hard drive on the system. Block devices are registered and unregistered in similar ways to character devices using the functions register_blkdev() and unregister_blkdev().

A major difference between block drivers and character drivers is that block drivers do not provide their own read and write functionality and instead use a request method.

The 2.6 kernel has undergone major changes in the block device subsystem. Old functions, like block_read() and block_write(), and kernel structures, like blk_size and blksize_size, have been removed. In this section we focus solely on the 2.6 block device implementation.

If you need the Linux kernel to work with a disk, or a disk-like, device, you'll need to write a block device driver. The driver must inform the kernel what kind of disk it's interfacing with. It does this by using the gendisk structure.


include/linux/genhd.h

82 struct gendisk {
83         int major;                      /* major number of driver */
84         int first_minor;
85         int minors;
86         char disk_name[32];             /* name of major driver */
87         struct hd_struct **part;        /* [indexed by minor] */
88         struct block_device_operations *fops;
89         struct request_queue *queue;
90         void *private_data;
91         sector_t capacity;
?

Line 83: major is the major number for the block device. This can be either statically set or dynamically generated, using register_blkdev(), as it was in character devices.

Lines 84-85: first_minor and minors are used in determining the number of partitions within the block device. minors contains the maximum number of minor numbers the device can have. first_minor contains the first minor device number of the block device.

Line 86: disk_name is a 32 character name for the block device. It will appear in the /dev filesystem, sysfs and /proc/partitions.

Line 87: hd_struct is the set of partitions that are associated with the block device.

Line 88: fops is a pointer to a block_operations structure that contains the operations open, release, ioctl, media_changed and revalidate_disk. See include/linux/fs.h. In the 2.6 kernel each device has its own set of operations.

Line 89: request_queue is a pointer to a queue that helps manage the pending operations for the device.

Line 90: private_data points to information that will not be accessed by the kernel's block subsystem. Typically this is used to store data that is used in low-level, device specific operations.

Line 91: capacity is the size of the block device in 512 byte sectors. If the device is removable, like a floppy disk or CD, a capacity of 0 signifies no disk is present. If your device doesn't use 512 byte sectors you'll need to set this value as if it did. For example, if your device has 1000 256 byte sectors, that's equivalent to 500 512 byte sectors.

In addition to having a gendisk structure, a block device also needs a spinlock structure for use with its request queue

Both the spinlock and fields in the gendisk structure need to be initialized by the device driver. After the device is initialized and ready to handle requests the function add_disk() should be called to add the block device to the system.

Finally, if the block device can be used as a source of entropy for the system the module initialization can also call add_disk_randomness()--see drivers/char/random.c for more detailed information.

Now that we've covered the basics of block device initialization we can examine its complement, exiting and cleaning up the block device driver. This is quite easy in the 2.6 version of Linux.

del_gendisk( struct gendisk ) removes the gendisk from the system and cleans up it's partition information. This call should be followed by putdisk( struct gendisk) which will release kernel references to the gendisk. The block device is unregistered via a call to unregister_blkdev (int major, char[16] device_name), which then allows us to free the gendisk structure.

We also need to clean up the request queue associated with the block device driver. This is done using blk_cleanup_queue( struct *request_queue). Note: If you can only reference the request queue via the gendisk structure, be sure to call blk_cleanup_queue before freeing gendisk.

__________________________


Special Magazine Offer -- 2 Free Trial Issues!
Receive 2 free trial issues of Linux Journal as well as instant online access to current and past issues. There's NO RISK and NO OBLIGATION to buy. CLICK HERE for offer

Linux Journal: delivering readers the advice and inspiration they need to get the most out of their Linux systems since 1994.

Sorry, offer available in the US only. International orders, click here.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Small question

On December 22nd, 2005 Frederic Mora (not verified) says:

Great article, it is a good way to introduce this book!

I have a question about a minor detail. In drivers/char/random.c, line 1635 reads:

if (sec_random_state->entropy_count / 8 == 0)

Now, why is this test using a division, instead of the much quicker less-than? I'd like to know why this cannot be written:

if (sec_random_state->entropy_count < 8 )

Any idea?

Thanks,
--Fred

Possibilities

On December 26th, 2005 rich (not verified) says:

Fred,
I agree with the premise of your question. The division by 8 can be implemented by a shift, so it isn't necessarily slower than the comparison. Perhaps there's some architecture where it's actually a lot faster, so someone coded it that way. Or perhaps in a previous incarnation, the number of bytes available was needed nearby, so with optimization the divide is free. Or perhaps that's just the way this particular programmer thinks.
Rich

Re: Small question

On December 22nd, 2005 Anonymous (not verified) says:

Fred,

one checks for a value divisible by 8, the other for a value less than 8. the difference should be clear now.

Clearly you mis-read the

On January 3rd, 2006 Anonymous (not verified) says:

Clearly you mis-read the code.

Here are some code examples:

(1) if (my_val / 8 == 0) {...}
(2) if (my_val < 8) {...}
(3) if (my_val % 8 == 0) {...}

And explanations:

(1) Checks if a value is less than 8 and greater than -8. Any value not in -7...7 will be nonzero.
(2) Checks if a value is less than 8.
(3) Checks if a value is divisible by 8. (If the integer remainder of division by 8 is zero).

Featured Videos

Non-linear video editing tools are great, but they're not always the best tool for the job. This is where a powerful tool like ffmpeg becomes useful. This tutorial by Elliot Isaacson covers the basics of transcoding video, as well as more advanced tricks like creating animations, screen captures, and slow motion effects.

Shawn Powers reviews the HP Mini-Note portable computer.

Thanks to our sponsor: Silicon Mechanics

Silicon Mechanics is a leading manufacturer of rackmount servers, storage, and high performance computing hardware. The best warranty offerings available are backed by experts dedicated to customer satisfaction.

From the Magazine

August 2008, #172

There's nuttin like a Cool Project to give you some relief from the summer heat, so get out your parka cuz we got a bunch of em. First up is the BUG, not a bug, The BUG. It's got a GPS, camera and more, in a hand-sized package that's user programmable. The BUG does everything. It's both a floor wax and a dessert topping. Get one now. Need a software version of a Swiss Army knife? Take a look at Billix, and don't leave home without it. Then, chew on this one, an X server on a Gumstix device driving an E-Ink display. Need more storage? How about 16 Terabytes? Can do.

And, of course, we have the usual cast of characters: Marcel, Reuven, Dave, Kyle, Doc, plus the new kid on the block Shawn Powers. But it doesn't stop there: build a MythTV box on a budget, build your own GIS system, set up the tools to monitor your enterprise and more. Finally, remember The War of the Worlds? Now you can play too.

Read this issue