Network Buffers and Memory Management

Writing a network device driver for Linux is fundamentally simple—most of the complexity (other than talking to the hardware) involves managing network packets in memory.
Optional Functionality

Each device has the option of providing additional functions and facilities to the protocol layers. Not implementing these functions will cause a degradation in service available via the interface, but will not prevent operation. These operations split into two categories—configuration and activation/shutdown.

Activation and Shutdown

When a device is activated (i.e., the flag IFF_UP is set), the dev->open() method is invoked if the device has provided one. This invocation permits the device to take any action such as enabling the interface that is needed when the interface is to be used. An error return from this function causes the device to stay down and causes the user's activation request to fail with an error returned by dev->open()

The dev->open() function can also be used with any device that is loaded as a module. Here it is necessary to prevent the device from being unloaded while it is open; thus, the MOD_INC_USE_COUNT macro must be used within the open method.

The dev->close() method is invoked when the device is ready to be configured down and should shut off the hardware in such a way as to minimise machine load (e.g., by disabling the interface or its ability to generate interrupts). It can also be used to allow a module device to be unloaded after it is down. The rest of the kernel is structured in such a way that when a device is closed, all references to it by pointer are removed, in order to ensure that the device can be safely unloaded from a running system. The close method is not permitted to fail.

Configuration and Statistics

A set of functions provide the ability to query and to set operating parameters. The first and most basic of these is a get_stats routine which when called returns a struct enet_statistics block for the interface. This block allows user programs such as ifconfig to see the loading of the interface and any logged problem frames. Not providing this block means that no statistics will be available.

The dev->set_mac_address() function is called whenever a superuser process issues an ioctl of type SIOCSIFHWADDR to change the physical address of a device. For many devices this function is not meaningful and for others it is not supported. In these cases, set this function pointer to NULL. Some devices can only perform a physical address change if the interface is taken down. For these devices, check the IFF_UP flag, and if it is set, return -EBUSY.

The dev->set_config() function is called by the SIOCSIFMAP function when a user enters a command like ifconfig eth0 irq 11. It then passes an ifmap structure containing the desired I/O and other interface parameters. For most interfaces this function is not useful, and you can return NULL.

Finally, the dev->do_ioctl() call is invoked whenever an ioctl in the range SIOCDEVPRIVATE to SIOCDEVPRIVATE+15 is used on your interface. All these ioctl calls take a struct ifreq, which is copied into kernel space before your handler is called and copied back at the end. For maximum flexibility any user can make these calls, and it is up to your code to check for superuser status when appropriate. For example, the PLIP driver uses these calls to set parallel port time out speeds in order to allow a user to tune the plip device for his machine.

Multicasting

Certain physical media types, such as Ethernet, support multicast frames at the physical layer. A multicast frame is heard by a group of hosts (not necessarily all) on the network, rather than going from one host to another.

The capabilities of Ethernet cards are fairly variable. Most fall into one of three categories:

  • No multicast filters. The card either receives all multicasts or none of them. Such cards can be a nuisance on a network with a lot of multicast traffic, such as group video conferences.

  • Hash filters. A table is loaded onto the card giving a mask of entries for desired multicasts. This method filters out some of the unwanted multicasts but not all.

  • Perfect filters. Most cards that support perfect filters combine this option with 1 or 2 above, because the perfect filter often has a length limit of 8 or 16 entries.

It is especially important that Ethernet interfaces are programmed to support multicasting. Several Ethernet protocols (notably Appletalk and IP multicast) rely on Ethernet multicasting. Fortunately, most of the work is done by the kernel for you (see net/core/dev_mcast.c).

The kernel support code maintains lists of physical addresses your interface should be allowing for multicast. The device driver may return frames matching more than the requested list of multicasts if it is not able to do perfect filtering.

Whenever the list of multicast addresses changes, the device drivers dev->set_multicast_list() function is invoked. The driver can then reload its physical tables. Typically this looks something like:

if(dev->flags&IFF_PROMISC)
    SetToHearAllPackets();
else if(dev->flags&IFF_ALLMULTI)
    SetToHearAllMulticasts();
else
{
    if(dev->mc_count<16)
    {
        LoadAddressList(dev->mc_list);
        SetToHearList();
    }
    else
        SetToHearAllMulticasts();
}

There are a small number of cards that can only do unicast or promiscuous mode. In this case the driver, when presented with a request for multicasts has to go promiscuous. If this is done, the driver must itself set the IFF_PROMISC flag in dev->flags.

In order to aid the driver writer, the multicast list is kept valid at all times. This simplifies many drivers, as a reset from an error condition in a driver often has to reload the multicast address lists.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

What about rmem_max / rmem_default ?

Anonymous's picture

An admirable in-depth article. Just a stupid question (I'm so slow-witted) : I still don't catch the link between the rmem_default/rmem_max sysctl parameters (socket receive buffer default/max length) and the buffer allocated by dev_alloc_skb(). Socket receive buffer vs buffer of skb : are we talking about he same memory area, or are they different things (involving necessarily a copy from the one to the other, sooner or later) ?

Thanks for anyone who would make it clear to me,
Telenn

Missing pictures

Ovy's picture

The links to figures do not work (File not found error). I guess time does matter (1996 article!). To anyone reading this article, please provide us some links for the pictures (or link to some other up to date articles).

Thank you,
Ovy

Fixed

Mitch Frazier's picture

Should be working now.

Mitch Frazier is an Associate Editor for Linux Journal.

thnx

Ravikumar's picture

thanx for the great article..

at each layer the data and tail pointers change right??

so if i need to acces the L7 data,consider UDP can i take the from pre routing hook can i take data+udphdr->length..??

Help Required....

Ram's picture

Hi Alan Cox,
Thanx for the article.
Iam Ram.Iam new to device driver development.
some how i manged to write a network driver.
still i need some help.But I want to access the driver functions directly from user program written in c.

i.e. I want to access the open,close,hard_start_xmit(),ioctl functions directly without using the socket api(socket,bind,connect etc). I want my own function api.
is it possible to do it.

Thanx in adavance,

good article

Ajay Thakur's picture

thanks for this article. It explains most of the things. But still I feel that some more thing related to Bottom Half/Top half processing should be added. and also things are not clear about the logic of freeing/owning skbuffers.

Ajay

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix