Adding Your Code to the Kernel: A Book Excerpt

From their book's section on adding your own code to the kernel, the authors demonstrate how device drivers are represented in the filesystem.


This content is excerpted from Chapter 10
of the book titled The Linux Kernel Primer: A Top-Down
Approach for x86 and PowerPC Architectures
, authored by
Claudia Salzberg Rodriguez, Gordon Fischer and Steven Smolski. ISBN:
0-13-118163-7. Copyright 2006, Pearson Education, Inc. To learn more
about this book, including purchasing options, please visit
the book's Web
page
.

Device drivers encompass the interface that the Linux kernel uses to
allow the programmer to control the system's input/output devices.
Entire books have been written specifically on Linux device drivers.
In this chapter we attempt to distill this topic down to its essentials.
We will follow a device driver from how the device is represented in the
filesystem and then through the specific kernel code that controls it.
We start by exploring the filesystem and show how these files tie into
the kernel.
Getting Familiar with the FilesystemDevices in Linux can be accessed via /dev. For example, an
ls -l /dev/random yields:


crw-rw-rw-    1 root     root       1,   8 Oct  2 08:08 /dev/random

The leading 'c' tells us that the device is a character device; a 'b'
identifies a block device. After the owner and group columns there
are two numbers separated by a comma--in this case 1, 8. The first
number is the driver's major number and the second its minor number.
When a device driver registers with the kernel, it will register a major
number. When a given device is opened the kernel uses the major number
of the device file to find to the driver that has registered with that
major number. The minor number is passed through the kernel to the
device driver itself, as a single driver can control multiple devices.
For example, /dev/urandom has a major number of 1 and a minor number of 9.
This means that the device driver registered with major number 1 handles
both /dev/random and /dev/urandom.

To generate a random number we simply read from /dev/random.
The following is one possible way to read 4 bytes of random data.


lkp@lkp:~$ head -c4 /dev/urandom | od -x
0000000 823a 3be5
0000004

If you repeat this command you'll notice the four bytes [823a 3be5]
continue to change. To demonstrate how the Linux kernel uses device
drivers, we'll follow the steps the kernel takes when a user accesses
/dev/random.

We know that the /dev/random device file has a major number of 1.
We can determine what driver controls the node by checking /proc/devices.


lkp@lkp:~$ less /proc/devices
Character devices:
  1 mem

Let us examine the mem device driver and search for occurrences of
"random".


drivers/char/mem.c
653 static int memory_open(struct inode * inode, struct file * filp)
    654 {
    655         switch (iminor(inode)) {
    656                 case 1:
?
676                 case 8:
    677                         filp->f_op = &random_fops;
    678                         break;
    679                 case 9:
    680                         filp->f_op = &urandom_fops;
    681                         break;

Lines 655-681: This switch statement initializes driver structures based upon the minor
number of the device being operated upon. Specifically, filps and fops
are being set.

Which leads us to ask, "What is a filp? And what is a fop?"
Filps and Fops
A filp is simply a file struct pointer and a fop is a file_operations
struct pointer. The kernel uses the file_operations structure to
determine what functions to call when the file is operated upon.
Below are selected sections of the structures that are used in the random
device driver.


include/linux/fs.h
556 struct file {
    557         struct list_head        f_list;
    558         struct dentry           *f_dentry;
    559         struct vfsmount         *f_vfsmnt;
    560         struct file_operations  *f_op;
    561         atomic_t                f_count;
    562         unsigned int            f_flags;
? 
581         struct address_space    *f_mapping;
    582 };

and


include/linux/fs.h
863 struct file_operations {
864         struct module *owner;
865         loff_t (*llseek) (struct file *, loff_t, int);
866         ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
867         ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t);
868         ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
869         ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t);
870         int (*readdir) (struct file *, void *, filldir_t);
871         unsigned int (*poll) (struct file *, struct poll_table_struct *);
872         int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);

?
888  };

The random device driver declares which file operations it provides in
the following way. Functions that the drivers implement must conform
to the prototypes listed in the file_operations structure.


drivers/char/random.c
1824 struct file_operations random_fops = {
   1825         .read           = random_read,
   1826         .write          = random_write,
   1827         .poll           = random_poll,
   1828         .ioctl          = random_ioctl,
   1829 };
   1830 
   1831 struct file_operations urandom_fops = {
   1832         .read           = urandom_read,
   1833         .write          = random_write,
   1834         .ioctl          = random_ioctl,
1835  };

Lines 1824-1829: The random device provides the operations of read, write, poll and ioctl.

Lines 1831-1835: The urandom device provides the operations of read, write and ioctl.

The poll operation allows a programmer to check before performing an
operation to see if that operation will block. This suggests, and is
indeed the case, that /dev/random will block if a request has been made
for more bytes of entropy than are in its entropy pool. /dev/urandom
will not block but may not return completely random data if the entropy
pool is too small. For more information: man 4
random
.

Digging deeper into the code, notice that when a read operation is
performed on /dev/random, the kernel passes control to the function
random_read() (See line 1825). random_read() is defined as follows.


drivers/char/random.c

1588 static ssize_t
   1589 random_read(struct file * file, char __user * buf, size_t 
nbytes, loff_t *ppos)

The function parameters are as follows:

  • file points to the file structure of the
    device.
  • buf points to an area of user memory where the result is
    to be stored.
  • nbytes is the size of data requested.
  • ppos points to a position within the file that the user
    is accessing.

Wait Queues
Occasionally a driver may need to wait for some condition to be true--perhaps
access to a system resource. In this case we don't want the
kernel to wait for the access to complete. It is bad to cause the kernel
to wait since all other system processing halts while the wait occurs.
By declaring a wait queue, you can postpone processing until a later time
when the condition you are waiting on has occurred.

Two structures are used for this process of waiting; a wait queue and
a wait queue head. A module should create a wait queue head and have
parts of the module that use sleep_on and wake_up macros to manage
things. This is precisely what occurs in random_read().


drivers/char/random.c

1588 static ssize_t
   1589 random_read(struct file * file, char __user * buf, size_t nbytes, loff_t *ppos)
   1590 {
   1591         DECLARE_WAITQUEUE(wait, current);
?
1597         while (nbytes > 0) {
?
1608                 n = extract_entropy(sec_random_state, buf, n,
   1609                                     EXTRACT_ENTROPY_USER |
   1610                                     EXTRACT_ENTROPY_LIMIT |
   1611                                     EXTRACT_ENTROPY_SECONDARY);
?
1618                 if (n == 0) {
   1619                         if (file->f_flags & O_NONBLOCK) {
   1620                                 retval = -EAGAIN;
   1621                                 break;
   1622                         }
   1623                         if (signal_pending(current)) {
   1624                                 retval = -ERESTARTSYS;
   1625                                 break;
   1626                         }
?
1632                         set_current_state(TASK_INTERRUPTIBLE);
1633                         add_wait_queue(&random_read_wait, &wait);
1634 
1635                         if (sec_random_state->entropy_count / 8 == 0)
1636                                 schedule();
1637 
1638                         set_current_state(TASK_RUNNING);
   1639                         remove_wait_queue(&random_read_wait, &wait);
?
1645                         continue;
1646      }

Line 1591: The wait queue wait is initialized on the current task.
current is a macro that refers to a pointer to the current task's
task_struct.

Lines 1608-1611: We extract a chunk of random data from the device.

Lines 1618-1626: If we could not extract the necessary amount of
entropy from the entropy pool and we are non-blocking or there is a
signal pending, we return an error to the caller.

Lines 1631-1633: Set up the wait queue. random_read() uses its own wait
queue, random_read_wait, instead of the system wait queue.

Lines 1635-1636: We are on a blocking read at this point, and if we don't
have one byte worth of entropy, we release control of the processor by
calling schedule(). Note that the entropy_count variables hold bits
and not bytes, thus the division by 8 to determine whether we have a
full byte of entropy.

Lines 1638-1639: When we are eventually restarted, we cleanup our wait
queue.

Note: The random device in Linux requires the entropy queue to be full
before returning. The urandom device does not have this requirement and
will return regardless of the size of data available in the entropy pool.

Eventually the kernel will give control back to random_read()
and we cleanup our wait queue and continue. This repeats the loop and
if the system has generated enough entropy we should be able to return
with the requested number of random bytes.

random_read() sets its state to TASK_INTERRUPTIBLE before calling
schedule() to allow itself to be interrupted by signals while it is
on a wait queue. The driver's own code generates these signals when
extra entropy is collected by calling wake_up_interruptible() in
batch_entropy_process() and random_ioctl(), TASK_UNINTERRUPTIBLE is
usually used when the task is waiting for hardware to respond as opposed
to software, when TASK_INTERRUPTIBLE is normally used.

Random_read() uses its own wait queue code instead of the standard macros
but essentially does an interruptible_sleep_on() (from the scheduler
code), with the exception that if we have more than a full byte's worth
of entropy we don't yield control and instead loop again to try and get
all the entropy requested. If there isn't enough entropy, random_read()
waits until awoken, with wake_up_interruptible(), from entropy gathering
processes of the driver.
Other Types of Drivers
Until now, all the device drivers we have dealt with have been character
drivers. These are usually the easiest to understand, but you may want
to write other drivers that interface with the kernel in different ways.

Block devices are similar to character devices in that they can be
accessed via the filesystem. /dev/hda is the device file for the
primary ide hard drive on the system. Block devices are registered and
unregistered in similar ways to character devices using the functions
register_blkdev() and unregister_blkdev().

A major difference between block drivers and character drivers is that
block drivers do not provide their own read and write functionality and
instead use a request method.

The 2.6 kernel has undergone major changes in the block device subsystem.
Old functions, like block_read() and block_write(), and kernel
structures, like blk_size and blksize_size, have been removed. In this section we
focus solely on the 2.6 block device implementation.

If you need the Linux kernel to work with a disk, or a disk-like,
device, you'll need to write a block device driver. The driver must inform the
kernel what kind of disk it's interfacing with. It does this by using
the gendisk structure.


include/linux/genhd.h

82 struct gendisk {
83         int major;                      /* major number of driver */
84         int first_minor;
85         int minors;
86         char disk_name[32];             /* name of major driver */
87         struct hd_struct **part;        /* [indexed by minor] */
88         struct block_device_operations *fops;
89         struct request_queue *queue;
90         void *private_data;
91         sector_t capacity;
?

Line 83: major is the major number for the block device. This can be
either statically set or dynamically generated, using register_blkdev(),
as it was in character devices.

Lines 84-85: first_minor and minors are used in determining the number of
partitions within the block device. minors contains the maximum number
of minor numbers the device can have. first_minor contains the first
minor device number of the block device.

Line 86: disk_name is a 32 character name for the block device. It will
appear in the /dev filesystem, sysfs and /proc/partitions.

Line 87: hd_struct is the set of partitions that are associated with
the block device.

Line 88: fops is a pointer to a block_operations structure that contains
the operations open, release, ioctl, media_changed and revalidate_disk.
See include/linux/fs.h. In the 2.6 kernel each device has its own set
of operations.

Line 89: request_queue is a pointer to a queue that helps manage the
pending operations for the device.

Line 90: private_data points to information that will not be accessed
by the kernel's block subsystem. Typically this is used to store data
that is used in low-level, device specific operations.

Line 91: capacity is the size of the block device in 512 byte sectors.
If the device is removable, like a floppy disk or CD, a capacity of
0 signifies no disk is present. If your device doesn't use 512 byte
sectors you'll need to set this value as if it did. For example, if your device
has 1000 256 byte sectors, that's equivalent to 500 512 byte sectors.

In addition to having a gendisk structure, a block device also needs a
spinlock structure for use with its request queue

Both the spinlock and fields in the gendisk structure need to be
initialized by the device driver. After the device is initialized and
ready to handle requests the function add_disk() should be called to
add the block device to the system.

Finally, if the block device can be used as a source of entropy for the
system the module initialization can also call add_disk_randomness()--see
drivers/char/random.c for more detailed information.

Now that we've covered the basics of block device initialization we can
examine its complement, exiting and cleaning up the block device driver.
This is quite easy in the 2.6 version of Linux.

del_gendisk( struct gendisk ) removes the gendisk from the system and
cleans up it's partition information. This call should be followed
by putdisk( struct gendisk) which will release kernel references
to the gendisk. The block device is unregistered via a call to
unregister_blkdev (int major, char[16] device_name), which then allows
us to free the gendisk structure.

We also need to clean up the request queue associated with the block device
driver. This is done using blk_cleanup_queue( struct *request_queue).
Note: If you can only reference the request queue via the gendisk
structure, be sure to call blk_cleanup_queue before freeing gendisk.

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Small question

Frederic Mora's picture

Great article, it is a good way to introduce this book!

I have a question about a minor detail. In drivers/char/random.c, line 1635 reads:

if (sec_random_state->entropy_count / 8 == 0)

Now, why is this test using a division, instead of the much quicker less-than? I'd like to know why this cannot be written:

if (sec_random_state->entropy_count < 8 )

Any idea?

Thanks,
--Fred

Possibilities

rich's picture

Fred,
I agree with the premise of your question. The division by 8 can be implemented by a shift, so it isn't necessarily slower than the comparison. Perhaps there's some architecture where it's actually a lot faster, so someone coded it that way. Or perhaps in a previous incarnation, the number of bytes available was needed nearby, so with optimization the divide is free. Or perhaps that's just the way this particular programmer thinks.
Rich

Re: Small question

Anonymous's picture

Fred,

one checks for a value divisible by 8, the other for a value less than 8. the difference should be clear now.

Clearly you mis-read the

Anonymous's picture

Clearly you mis-read the code.

Here are some code examples:

(1) if (my_val / 8 == 0) {...}
(2) if (my_val < 8) {...}
(3) if (my_val % 8 == 0) {...}

And explanations:

(1) Checks if a value is less than 8 and greater than -8. Any value not in -7...7 will be nonzero.
(2) Checks if a value is less than 8.
(3) Checks if a value is divisible by 8. (If the integer remainder of division by 8 is zero).

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState