Inside the Linux Packet Filter
Network geeks among you may remember my article, “Linux Socket Filter: Sniffing Bytes over the Network”, in the June 2001 issue of LJ, regarding the use of the packet filter built inside the Linux kernel. In that article I provided an overview of the functionality of the packet filter itself; this time, I delve into the depths of the kernel mechanisms that allow the filter to work and share some insights on Linux packet processing internals.
In the previous article, some arguments regarding kernel packet processing were raised. It is worthwhile to recall briefly the most important of them:
Packet reception is first dealt with at the network card's driver level, more precisely in the interrupt service routine. The service routine looks up the protocol type inside the received frame and queues it appropriately for later processing.
During reception and protocol processing, packets might be discarded if the machine is congested. Furthermore, as they travel upward toward user land, packets lose network lower-level information.
At the socket level, just before reaching user land, the kernel checks whether an open socket for the given packet exists. If it does not, the packet is discarded.
Then the Linux kernel implements a generic-purpose protocol, called PF_PACKET, which allows you to create a socket that receives packets directly from the network card driver. Hence, any other protocols' handling is skipped, and any packets can be received.
An Ethernet card usually passes only the packets destined to itself to the kernel, discarding all the others. Nevertheless, it is possible to configure the card in such a way that all the packets flowing through the network are captured, independent of their MAC address (promiscuous mode).
Finally, you can attach a filter to a socket, so that only packets matching your filter's rules are accepted and passed to the socket. Combined with PF_PACKET sockets, this mechanism allows you to sniff selected packets efficiently from your LAN.
Even though we built our sniffer using PF_PACKET sockets, the Linux socket filter (LSF) is not limited to those. In fact, the filter also can be used on plain TCP and UDP sockets to filter out unwanted packets—of course, this use of the filter is much less common.
In the following, I sometimes refer either to a socket or to a sock structure. As far as this article is concerned, both forms indicate the same object, and the latter corresponds to the kernel's internal representation of the former. Actually, the kernel holds both a socket structure and a sock structure, but the difference between the two is not relevant here.
Another data structure that will recur quite often is the sk_buff (short for socket buffer), which represents a packet inside the kernel. The structure is arranged in such a way that addition and removal of header and trailer information to the packet data can be done in a relatively inexpensive way: no data actually needs to be copied since everything is done by just shifting pointers.
Before going on, it may be useful to clear up possible ambiguities. Despite having a similar name, the Linux socket filter has a completely different purpose with respect to the Netfilter framework introduced into the kernel in early 2.3 versions. Even if Netfilter allows you to bring packets up to user space and feed them to your programs, the focus there is to handle network address translation (NAT), packet mangling, connection tracking, packet filtering for security purposes and so on. If you just need to sniff packets and filter them according to certain rules, the most straightforward tool is LSF.
Now we are going to follow the trip of a packet from its very ingress into the computer to its delivery to user land at the socket level. We first consider the general case of a plain (i.e., not PF_PACKET) socket. Our analysis at link layer level is based on Ethernet, since this is the most widespread and representative LAN technology. Cases of other link layer technologies do not present significant differences.
As we mentioned in the previous article, the Ethernet card is hard-wired with a particular link layer (or MAC) address and is always listening for packets on its interface. When it sees a packet whose MAC address matches either its own address or the link layer broadcast address (i.e., FF:FF:FF:FF:FF:FF for Ethernet) it starts reading it into memory.
Upon completion of packet reception, the network card generates an interrupt request. The interrupt service routine that handles the request is the card driver itself, which runs with interrupts disabled and typically performs the following operations:
Allocates a new sk_buff structure, defined in include/linux/skbuff.h, which represents the kernel's view of a packet.
Fetches packet data from the card buffer into the freshly allocated sk_buff, possibly using DMA.
Invokes netif_rx(), the generic network reception handler.
When netif_rx() returns, re-enables interrupts and terminates the service routine.
The netif_rx() function prepares the kernel for the next reception step; it puts the sk_buff into the incoming packets queue for the current CPU and marks the NET_RX softirq (softirq is explained below) for execution via the __cpu_raise_softirq() call. Two points are worth noticing at this stage. First, if the queue is full the packet is discarded and lost forever. Second, we have one queue for each CPU; together with the new deferred kernel processing model (softirqs instead of bottom halves), this allows for concurrent packet reception in SMP machines.
If you want to see a real-world Ethernet driver in action, you can refer to the simple NE 2000 card PCI driver, located in drivers/net/8390.c; the interrupt service routine called ei_interrupt(), calls ei_receive(), which in turn, performs the following procedure:
Allocates a new sk_buff structure via the dev_alloc_skb() call.
Reads the packet from the card buffer (ei_block_input() call) and sets skb->protocol accordingly.
Repeats the procedure for a maximum of ten consecutive packets.
A slightly more complex example is provided by the 3COM driver, located in 3c59x.c, which uses DMA to transfer the packet from the card memory to the sk_buff.