Building a Next-Generation Residential Gateway

You may not need as much as you think to build a residential gateway.

Now to the real fun—choosing (or even better, designing) your hardware. MIPS is the most commonly used CPU architecture in these kinds of devices, with older devices using the MIPS32 4K cores, and newer devices probably using the MIPS32 24K cores. As MIPS 24K is about 3% slower than MIPS 4K at the same clock rate (because of the longer pipeline), it makes sense to use it only if you really need clock rates of about 400MHz. However, benchmarks show that networking devices are memory- and I/O-bound, not CPU-bound (after all, there is not much number crunching in forwarding packets, with a few exceptions discussed later), so you probably will need these high clock rates only if you are going to use DDR RAM and not SDRAM (which is limited to 133MHz).

Another option is the new (at least in the embedded world) multithreading (MT) technology of the MIPS32 34K, which can have a configurable number of Virtual Processing Elements (VPEs) and Thread Contexts (TCs). Without going into too much detail, this technology is a more flexible equivalent of Intel Hyper-Threading, which helps the CPU exploit parallelism at the process and thread level (compared to parallelism at the instruction level, which superscalar processors can exploit quite well). This can boost performance in some cases, especially when there are a few concurrent processes or threads that are I/O- or memory-bound, but benchmarks show that this may not always be the case. In addition, MT has an overhead—MIPS 34K is slightly slower than MIPS 24K at the same clock rate, and the Linux kernel becomes slightly slower when compiled with symmetric multiprocessing (SMP) enabled.

Another CPU you probably should consider is MIPS' closest competitor—ARM. It is used less frequently than MIPS for these particular devices, mostly for historical reasons. ARM7 is very popular, small and has a low power core; however, it is clearly too slow for an RG. If you decide to go with ARM, you should consider one of the ARM9 or ARM10 families of processors, with the major differences, as far as an RG application is concerned, being performance, power consumption and cost.

RAM and Flash choice is obvious; you need at least 8MB of RAM and 2MB of Flash. However, as memory prices are dropping, I expect that future devices will have 64MB of RAM and 4MB of Flash. For the same reason, you probably will have to go with DDR and not SDRAM—as SDRAM actually becomes more expensive than DDR.

Assuming you are going to use one of the reference design boards provided by a manufacturer (board design is beyond the scope of this article anyway), you are ready to test your brand-new RG. Unfortunately, you are in for a big surprise. Your CPU is going to choke at not so unreasonable bandwidths.

One possible solution is to use a more powerful (and significantly more expensive) CPU, such as the Intel IXP or Freescale PowerQUICC network processors. For instance, the 533MHz IXP425 will have no problem sustaining that kind of traffic. Unfortunately, in order to stay competitive, RG manufacturers cannot afford these high-end processors, so a more creative solution is required.


This is the challenge of next-generation RGs—achieve high throughput using low-end hardware. One possible solution is to offload the whole data path of the RG to hardware, the way it works on high-end core routers, which usually employ a general-purpose CPU only for management and control. There are a few chips that can do this for the low-end RG market, such as the Realtek RTL8650B/RTL8651B, which can do routing, NAT and firewall in hardware. Of course, the implementation is limited compared to the Linux TCP/IP stack, but the hardware can be configured to trap the CPU in case it encounters a packet it cannot handle, so that most of the packets will be forwarded from one interface to another without interrupting the CPU. However, this approach has a number of problems, the most serious one being the fact that the hardware TCP/IP stack is limited to a fixed interface (MII in case of RTL8650). It would be difficult and, in some cases, impossible, to connect other interfaces, such as 802.11, DOCSIS and xDSL, to that logic. Therefore, I believe that although this approach can work in some specific cases, it is a wrong idea in general.

Typically, it is a good idea to keep to the software as much as possible, because it's easier to develop and debug. Another optimization approach is based on the observation that RGs (and networking applications in general) are memory-bound, so it would be extremely beneficial to improve memory access. Let's separate data and code for the sake of this discussion. As far as the code is concerned, we want to keep it in cache as much as possible. Ideally, we want the whole critical path routines—that is, starting from one driver's receive function via the TCP/IP stack and to the other driver's send function to stay in cache all the time. This is not possible with most embedded processors, which have only a 32K cache. However, it can be shown that the Linux 2.6 critical path—that is, functions used 95% of the time, under firewall and NAT configuration, including Ethernet drivers' send and receive routines, can fit into a 64K cache, and there are embedded processors with 64K on the market. If your CPU does not have that much cache, but instead has scratchpad SRAM, you can modify the Linux linker script to place certain routines in the SRAM memory region.

If you want to test the above observation (or calculate how much cache your particular application needs), use OProfile, which is a system-wide profiler for Linux that allows you to profile user-mode applications, kernel and drivers, and supports many embedded architectures along with objdump (or any other utility) that will show you how much memory each routine requires.

As for the data, it is absolutely necessary to ensure that all network drivers follow zero-copy methodology, and it may be wise to place frequently accessed data structures in a scratchpad SRAM.

Yet another optimization approach is the mixture of the above two—using profiler, find the most CPU-intensive pieces of code and offload this particular functionality to the hardware. As it turns out, the two most CPU-intensive tasks related to networking are IPSec and the UDP/TCP checksum calculation. It is very convenient (and not very surprising) that both have a well-defined architecture for hardware offloading. UDP and TCP checksum offloading is extremely beneficial, because if it is checked in the hardware on receive and calculated in the hardware on transmit, the CPU will never have to bring the whole packet into the cache, significantly reducing the number of memory accesses. IPSec, on the other hand, is less useful, as an RG is rarely an IPSec termination point—usually IPSec (VPN) traffic is passed through and terminated on the PC.

Another approach that I definitely do not recommend, but one being taken by some manufacturers because it is actually cheaper than the ones mentioned above, is to “optimize” Linux by creating various types of “fast paths”. For instance, if the L2 bridge performance is not satisfactory, it is possible to pass packets from one network driver directly to the other, eliminating at least one context switch and some other code, resulting in performance gain. Although it will not work in general cases, it does work for an RG, where the manufacturer controls the whole system, including all the drivers and the kernel itself. The biggest problem with this approach is that it actually cripples the stock Linux kernel, limiting functionality and introducing bugs. These modifications rarely are submitted to the Linux-kernel mailing list, and even if they were, they never would be accepted. But, they do go into some products you can find in stores.