Network Buffers and Memory Management

Writing a network device driver for Linux is fundamentally simple—most of the complexity (other than talking to the hardware) involves managing network packets in memory.
Flags

A set of flags is used to maintain the interface properties. Some of these are “compatibility” items and as such are not directly useful. The flags are:

  • IFF_UP The interface is currently active. In Linux, the IFF_RUNNING and IFF_UP flags are basically handled as a pair, existing as two items for compatibility reasons. When an interface is not marked as IFF_UP, it can be removed. Unlike BSD, an interface that does not have IFF_UP set will never receive packets.

  • IFF_BROADCAST The interface has broadcast capability. There will be a valid IP address for the interface stored in the device addresses.

  • IFF_DEBUG Indicates debugging is desired. Not currently used.

  • IFF_LOOPBACK The loopback interface (lo) is the only interface that has this flag set. Setting it on other interfaces is neither defined nor a very good idea.

  • IFF_POINTOPOINT This interface is a point to point link (such as SLIP or PPP). There is no broadcast capability as such. The remote point to point address in the device structure is valid. Normally, a point to point link has no netmask or broadcast, but it can be enabled if needed.

  • IFF_NOTRAILERS More of a prehistoric than an historic compatibility flag. Not used.

  • IFF_RUNNING See IFF_UP

  • IFF_NOARP The interface does not perform ARP queries. Such an interface must have either a static table of address conversions or no need to perform mappings. The NetROM interface is a good example of this. Here all entries are hand configured as the NetROM protocol cannot do ARP queries.

  • IFF_PROMISC If it is possible, the interface will hear all of the packets on the network. This flag is typically used for network monitoring, although it can also be used for bridging. One or two interfaces like the AX.25 interfaces are always in promiscuous mode.

  • IFF_ALLMULTI Receive all multicast packets. An interface, that cannot perform this operation but can receive all packets, will go into promiscuous mode when asked to perform this task.

  • IFF_MULTICAST Indicates that the interface supports multicast IP traffic, which is not the same as supporting a physical multicast. AX.25 for example supports IP multicast using physical broadcast. Point to point protocols such as SLIP generally support IP multicast.

The Packet Queue

Packets are queued for an interface by the kernel protocol code. Within each device, buffs[] is an array of packet queues for each kernel priority level. These are maintained entirely by the kernel code, but must be initialized by the device itself on boot up. The intialization code used is:

int ct=0;
while(ct<DEV_NUMBUFFS)
{
    skb_queue_head_init(&dev->buffs[ct]);
    ct++;
}

All other fields should be initialized to 0.

The device gets to select the queue length it needs by setting the field dev->tx_queue_len to the maximum number of frames the kernel should queue for the device. Typically this is around 100 for Ethernet and 10 for serial lines. A device can modify this dynamically, although its effect will lag the change slightly.

Network Device Methods

Each network device has to provide a set of actual functions (methods) for the basic low level operations. It should also provide a set of support functions that interface the protocol layer to the protocol requirements of the link layer it is providing.

Setup

The init method is called when the device is initialized and registered with the system, in order to perform any low level verification and checking needed. It returns an error code if the device is not present, if areas cannot be registered or if it is otherwise unable to proceed. If the init method returns an error, the register_netdev() call returns the error code, and the device is not created.

Frame Transmission

All devices must provide a transmit function. It is possible for a device to exist that cannot transmit. In this case, the device needs a transmit function that simply frees the buffer passed to it. The dummy device has exactly this functionality on transmit.

The dev->hard_start_xmit() function is called to provide the driver with its own device pointer and network buffer (a sk_buff) for transmitting. If your device is unable to accept the buffer, it should return 1 and set dev->tbusy to a non-zero value. This action will queue the buffer to be retried again later, although there is no guarantee that a retry will occur. If the protocol layer decides to free the buffer that the driver has rejected, then the buffer will not be offered back to the device. If the device knows the buffer cannot be transmitted in the near future, for example due to bad congestion, it can call dev_kfree_skb() to dump the buffer and return 0 indicating the buffer has been processed.

If there is space the buffer should be processed. The buffer handed down already contains all the headers, including link layer headers, necessary and need only be loaded into the hardware for transmission. In addition, the buffer is locked, which means that the device driver has absolute ownership of the buffer until it chooses to relinquish it. The contents of a sk_buff remain read-only, with the exception that you are guaranteed that the next/previous pointers are free, so that you can use the sk_buff list primitives to build internal chains of buffers.

When the buffer has been loaded into the hardware or, in the case of some DMA driven devices, when the hardware has indicated transmission is complete, the driver must release the buffer by calling dev_kfree_skb(skb, FREE_WRITE). As soon as this call is made, the sk_buff in question may spontaneously disappear, and the device driver should not reference it again.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

What about rmem_max / rmem_default ?

Anonymous's picture

An admirable in-depth article. Just a stupid question (I'm so slow-witted) : I still don't catch the link between the rmem_default/rmem_max sysctl parameters (socket receive buffer default/max length) and the buffer allocated by dev_alloc_skb(). Socket receive buffer vs buffer of skb : are we talking about he same memory area, or are they different things (involving necessarily a copy from the one to the other, sooner or later) ?

Thanks for anyone who would make it clear to me,
Telenn

Missing pictures

Ovy's picture

The links to figures do not work (File not found error). I guess time does matter (1996 article!). To anyone reading this article, please provide us some links for the pictures (or link to some other up to date articles).

Thank you,
Ovy

Fixed

Mitch Frazier's picture

Should be working now.

Mitch Frazier is an Associate Editor for Linux Journal.

thnx

Ravikumar's picture

thanx for the great article..

at each layer the data and tail pointers change right??

so if i need to acces the L7 data,consider UDP can i take the from pre routing hook can i take data+udphdr->length..??

Help Required....

Ram's picture

Hi Alan Cox,
Thanx for the article.
Iam Ram.Iam new to device driver development.
some how i manged to write a network driver.
still i need some help.But I want to access the driver functions directly from user program written in c.

i.e. I want to access the open,close,hard_start_xmit(),ioctl functions directly without using the socket api(socket,bind,connect etc). I want my own function api.
is it possible to do it.

Thanx in adavance,

good article

Ajay Thakur's picture

thanks for this article. It explains most of the things. But still I feel that some more thing related to Bottom Half/Top half processing should be added. and also things are not clear about the logic of freeing/owning skbuffers.

Ajay

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState