Inside the Linux Packet Filter, Part II
As a side note, you may be wondering, how does a user process come to sleep on a given socket when it invokes a recv(), recvfrom() or recvmesg() system call? The mechanism is actually pretty easy: all the recv functions are implemented inside the kernel by calling, more or less directly, sock_recvmsg() (in net/socket.c). This function, in turn, calls the recvmsg() function that is registered inside the protocol-specific operations within the sock structure. For example, this function is packet_recvmsg() in the case of PF_PACKET protocol.
The protocol-specific recvmsg function, among other things, sooner or later will call skb_recv_datagram(), which is a generic function handling datagram reception for all the protocols. The latter function obtains process blocking by calling wait_for_packet() (in net/core/datagram.c), which sets process status to TASK_INTERRUPTIBLE (i.e., sleeping task) and queues it into the socket's sleep queue. The process rests there until a call to wake_up_interruptible() is triggered by the arrival of a new packet, as we saw in the previous paragraphs.
The main filter implementation resides in core/filter.c, whereas the SO_ATTACH/DETACH_FILTER ioctls are dealt with in net/core/sock.c. The filter initially is attached to a socket via the sk_attach_filter() function, that copies it from user space to kernel space and runs an integrity check on it (sk_chk_filter()). The check is aimed at ensuring that no incongruent code is executed by the filter interpreter. Finally, the filter base address is copied into the filter field of the sock structure, where it will be used for filter invocation as we saw before.
The packet filter proper is implemented in the sk_run_filter() function, which is given an skb (the current packet) and a filter program. The latter is simply an array of BPF instructions (see Resources) that is a sequence of numeric opcodes and operands. The sk_run_filter() function does nothing more than implement a BPF code interpreter (or a virtual CPU, if you prefer) in a pretty straightforward way; a long switch/case statement discriminates the opcode and takes actions on emulated registers and memory accordingly. The emulated memory space, where the filter code is run, is of course the packet's payload (sk->data). The filter execution flow terminates, leading toward exiting the function, when a BPF RET instruction is encountered.
Note that the sk_run_filter() function is called directly only from PF_PACKET processing routines. Socket-level receive routines (i.e., TCP, UDP and raw IP ones) go through the wrapper function sk_filter() (in sock.h), which in addition to calling sk_run_filter() internally, trims the packet to the length returned by the filter.
Our tour of the kernel packet handling functions is now completed. It is interesting to draw some conclusions regarding the packet filter invocation points. As we have seen, there are three distinct call points inside the kernel where the filter may get invoked: the TCP and UDP (layer 4) receive functions, and the PF_PACKET (layer 2.5) receive function. Raw IP packets are filtered also because they pass through the same path followed by UDP packets (namely, sock_queue_rcv_skb()), which is used for datagram-oriented reception).
It is important to notice that, at each layer, the filter is applied to that layer's payload. That is, as the packet travels upward the filter can see less and less information. For PF_PACKET sockets, the filter is applied to layer 2 information, which includes either the whole link layer data frame for SOCK_RAW sockets or the whole IP packet; for TCP/UDP sockets, the filter is applied to layer 4 information (basically, port numbers and little other useful data). For this reason, layer 4 socket filtering is likely to be useless. Of course, in any case the application level payload (user data) is always available for the filter, even if it is often of little or no use at all.
A bright example of layer 4 uselessness is given in Listing 1 [available at ftp.linuxjournal.com/pub/lj/listings/issue95/5617.tgz and Listing 2, which presents a simple UDP server with an attached socket filter and an associated simple UDP data sender. The filter will accept only packets whose payload starts with “lj” (hex 0x6c6a). To test the program, compile and run Listing 1, called udprcv. Then compile Listing 2 (udpsnd), and launch it like this:
./udpsnd 127.0.0.1 "hello world"
Nothing will be printed by udprcv. Now, try writing a string starting with “lj”, as in
./udpsnd 127.0.0.1 "lj rules"This time the string is printed correctly by udprcv since the packet payload matches the filter.
Another important issue that filter writers should be aware of is that the filter must be written depending on the type of socket (PF_PACKET, raw IP or TCP/UDP) that the filter will be attached to. In fact, filter memory accesses use offsets that are relative to the first byte in the packet payload as seen at a specific level. Filter memory base addresses corresponding to the most common families are reported in Table 1.
Moreover, the method described in the June 2001 article to obtain the filter code (i.e., using tcpdump -dd) does not apply anymore if non-PF_PACKET sockets are used, as it produces a filter working only for layer 2 (since it assumes that address 0 is the start of the link layer frame).
Practical Task Scheduling Deployment
One of the best things about the UNIX environment (aside from being stable and efficient) is the vast array of software tools available to help you do your job. Traditionally, a UNIX tool does only one thing, but does that one thing very well. For example, grep is very easy to use and can search vast amounts of data quickly. The find tool can find a particular file or files based on all kinds of criteria. It's pretty easy to string these tools together to build even more powerful tools, such as a tool that finds all of the .log files in the /home directory and searches each one for a particular entry. This erector-set mentality allows UNIX system administrators to seem to always have the right tool for the job.
Cron traditionally has been considered another such a tool for job scheduling, but is it enough? This webinar considers that very question. The first part builds on a previous Geek Guide, Beyond Cron, and briefly describes how to know when it might be time to consider upgrading your job scheduling infrastructure. The second part presents an actual planning and implementation framework.
Join Linux Journal's Mike Diehl and Pat Cameron of Help Systems.
Free to Linux Journal readers.View Now!
|The Firebird Project's Firebird Relational Database||Jul 29, 2016|
|Stunnel Security for Oracle||Jul 28, 2016|
|SUSE LLC's SUSE Manager||Jul 21, 2016|
|My +1 Sword of Productivity||Jul 20, 2016|
|Non-Linux FOSS: Caffeine!||Jul 19, 2016|
|Murat Yener and Onur Dundar's Expert Android Studio (Wrox)||Jul 18, 2016|
- The Firebird Project's Firebird Relational Database
- Stunnel Security for Oracle
- My +1 Sword of Productivity
- SUSE LLC's SUSE Manager
- Non-Linux FOSS: Caffeine!
- Managing Linux Using Puppet
- Murat Yener and Onur Dundar's Expert Android Studio (Wrox)
- Parsing an RSS News Feed with a Bash Script
- Google's SwiftShader Released
- Doing for User Space What We Did for Kernel Space