The sysctl Interface
Although low-level, the tunable parameters of the kernel are very interesting to tweak and can help optimize system performance for the different environments where Linux is used.
The following list is an overview of some relevant /kernel and /vm files in /proc/sys. (This information applies to all kernels from 2.0 through 2.1.35.)
kernel/panic - The integer value is the number of seconds the system will wait before automatic reboot in case of system panic. A value of 0 means “disabled”. Automatic reboot is an interesting feature to turn on for unattended systems. The command-line option panic=value can be used to set this parameter at boot time.
kernel/file-max - The maximum number of open files in the system. file-nr, on the other hand, is the per-process maximum and can't be modified, because it is constrained by the hardware page size. Similar entries exist for the inodes: a system-wide entry and an immutable per-process one. Servers with many processes and many open files might benefit by increasing the value of these two entries.
kernel/securelevel - This is a hook for security features in the system. The securelevel file is currently read-only even for root, so it can only be changed by program code (e.g., modules). Only the EXT2 file system uses securelevel—it refuses to change file flags (like immutable and append-only) if securelevel is greater than 0. This means that a kernel, precompiled with a non-zero securelevel and no support for modules, can be used to protect precious files from corruption in case of network intrusions. But stay tuned for new features of securelevel.
vm/freepages - Contains three numbers, all counts of free pages. The first number is the minimum free space in the system. Free pages are needed to fulfill atomic allocation requests, like incoming network packets. The second number is the level at which to start heavy swapping, and the third is the level to start light swapping. A network server with high bandwidth benefits from higher numbers in order to avoid dropping packets due to free memory shortage. By default, one percent of the memory is kept free.
vm/bdflush - The numbers in this file can fine-tune the behaviour of the buffer cache. They are documented in fs/buffer.c.
vm/kswapd - This file exists in all of the 2.0.x kernels, but has been removed in 2.1.33 as not useful. It can safely be ignored.
vm/swapctl - This big file encloses all the parameters used in fine-tuning the swapping algorithms. The fields are listed in include/linux/swapctl.h and are used in mm/swap.c.
Module writers can easily add their own tunable features to /proc/sys by using the programming interface to extend the control tree. The kernel exports to modules the following two functions:
struct ctl_table_header * register_sysctl_table(ctl_table * table, int insert_at_head); void unregister_sysctl_table( struct ctl_table_header * table);
The former function is used to register a “table” of entries and returns a token, which is used by the latter function to detach (unregister) your table. The argument insert_at_head tells whether the new table must be inserted before or after the other ones, and you can easily ignore the issue and specify 0, which means “not at head”.
What is the ctl_table type? It is a structure made up of the following fields:
int ctl_name - This is a numeric ID, unique within each table.
const char *procname - If the entry must be visible through /proc, this is the corresponding name.
void *data - The pointer to data. For example, it will point to an integer value for integer items.
int maxlen - The size of the data pointed to by the previous field; for example, sizeof(int).
mode_t mode - The mode of the file. Directories should have the executable bit turned on (e.g., 0555 octal).
ctl_table *child - For directories, the child table. For leaf nodes, NULL.
proc_handler *proc_handler - The handler is in charge of performing any read/write spawned by /proc files. If the item has no procname, this field is not used.
ctl_handler *strategy - This handler reads/writes data when the system call is used.
struct proc_dir_entry *de - Used internally.
void *extra1, *extra2 - These fields have been introduced in version 1.3.69 and are used to specify extra information for specific handlers. The kernel has an handler for integer vectors, for example, that uses the extra fields to be notified about the allowable minimum and maximum allowed values for each number in the array.
Well, the previous list may have scared most readers. Therefore, I won't show the prototypes for the handling functions and will instead switch directly to some sample code. Writing code is much easier than understanding it, because you can start by copying lines from existing files. The resulting code will fall under the GPL—of course, I don't see that as a disadvantage.
Let's write a module with two integer parameters, called ontime and offtime. The module will busy-loop for a few timer ticks and sleep for a few more; the parameters control the duration of each state. Yes, this is silly, but it is the simplest hardware-independent example I could imagine.
The parameters will be put in /proc/sys/kernel/busy, a new directory. To this end, we need to register a tree like the one shown in Figure 1. The /kernel directory won't be created by register_sysctl_table, because it already exists. Also, it won't be deleted at unregister time, because it still has active child files; thus, by specifying the whole tree of directories you can add files to every directory within /proc/sys.
Listing 2 is the interesting part of busy.c, which does all the work related to sysctl. The trick here is leaving all the hard work to proc_dointvec and sysctl_intvec. These handlers are exported only by version 2.1.8 and later of the kernel, so you need to copy them into your module (or implement something similar) when compiling for older kernels.
I won't show the code related to busy looping here, because it is completely out of the scope of this article. Once you have downloaded the source from the FTP site1, it can be compiled on your own system. It works with both version 2.0 and 2.1 on the Intel, Alpha and SPARC platforms.