Critical Server Needs and the Linux Kernel
This article provides some examples of features and mechanisms needed in the Linux kernel for server nodes operating in mission-critical environments, such as telecom, where reliability, performance, availability and security are extremely important. Here, we discuss four such features: a cluster communication protocol, support for multiple-FIB, a module to verify digital signatures of binaries at run time and an efficient low-level asynchronous event mechanism. For some of these example features, open-source projects already exist to provide their implementations. For other features, there currently is no open-source project that can implement them. For each of our four examples features, we discuss the feature, its importance, the advantages it provides, its implementation when available and the status of its integration with the Linux kernel.
Today's computing and telecommunication environments increasingly are adopting clustered servers to gain benefits in performance, availability and scalability. The resulting benefits of a cluster are greater and/or more cost-efficient than what a single server offers. Furthermore, in the case of the telecommunication industry, the interest in clustering originates from the fact that clusters address carrier-grade characteristics--guaranteed service availability, reliability and scaled performance--using cost-effective hardware and software. Without being absolute about these requirements, they can be divided into three categories: short failure detection and failure recovery, guaranteed availability of service and short response times. The most widely adopted clustering technique is use of multiple interconnected, loosely coupled nodes to create a single highly available system.
The direct advantages of clustering in telecom servers include:
High availability through redundancy and failover techniques, which isolate or reduce the impact of a failure in the machine, resources or device.
Manageability through appropriate system management facilities that reduce system management costs and balance loads for efficient resource utilization.
Scalability and performance through expanding the capacity of the cluster by adding more servers or, in terms of servers, adding more processors, memory, storage or other resources to support growth and to achieve higher levels of performance.
In addition, using commercial off-the-shelf building blocks in clustered systems offers a number of advantages, including a better price/performance ratio when compared to specialized parallel supercomputers; deployment of the latest mass-market technology as it becomes available at low cost; and added benefits from the latest standard operating system features, as they become available.
One feature missing from the Linux kernel in this area is a reliable, efficient and transparent interprocess and interprocessor communication protocol that we can use to build highly available Linux clusters. Transparent interprocess communication (TIPC) is a suitable open-source implementation that fills this gap and provides an efficient cluster communication protocol, leveraging the particular conditions present within loosely coupled clusters.
Figure 1. Functional View of TIPC
TIPC is unique because no other protocol seems to provide a comparable combination of versatility and performance. It includes some original innovations, such as functional addressing, topology subscription services and reactive connection concept. Other important TIPC features include full location transparency, support for lightweight connections, reliable multicast, signaling link protocol, topology subscription services and more.
TIPC should be regarded as a useful toolbox for anyone wanting to develop or use carrier-grade or highly available Linux clusters. It provides the necessary infrastructure for cluster, network and software management functionality, as well as a good support for designing site-independent, scalable, distributed, high-availability and high-performance applications.
It also is worth mentioning that the ForCES working group within IETF has agreed that it must be possible to carry its router internal protocol (the ForCES protocol) over different types of transport protocols. There is consensus that TCP is the protocol to be used when ForCES messages are transported over the Internet, while TIPC is the protocol to be used in closed environments (LANs), where special characteristics such as high performance and multicast support is desirable. Other protocols also may be added as options.
In addition, TIPC meets several priority level 1 and 2 requirements, as defined in the OSDL Carrier Grade Linux Requirements Definition, Versions 2.0 and 3.0, providing an implementation for the various protocols under the Cluster Communication Service requirements.
TIPC is a contribution from Ericsson to the Open Source community. It has undergone a significant redesign over the past two years and now is available as a portable source code package of about 12,000 lines of C code. The code implements a kernel driver, a design that has made it possible to boost performance--35% faster than TCP--and minimize the code footprint. The current version is available under a dual BSD and GPL license. It runs on Linux 2.4 and 2.6 and was announced on LKML (see Resources). Several proprietary ports to other operating systems (OSE, True64, Dicos, VxWare) exist, and more are planned before the end of 2004.
Routers are core elements of modern telecom networks. They propagate and direct billion of data packets from their sources to their destinations using air transport devices or high-speed links. Routers must operate as fast as the medium they use in order to deliver the best quality of service and have a negligible effect on communications. To give some figures, it is common for routers to manage between 10,000 and 500,000 routes. In these situations, good performance is achievable by handling around 2,000 routes/sec.
The actual implementation of the IP stack in Linux works fine for home or small business routers. However, with the high expectation of telecom operators and the new capabilities of telecom hardware, it barely is possible to use Linux as an efficient forwarding and routing element of a high-end router for large networks (core/border/access router) or as a high-end server with routing capabilities.
Two problems with the networking stack in Linux is the lack of support for multiple forwarding information bases (multi-FIB) with overlapping interface IP addresses and the lack of appropriate interfaces for addressing FIB. Another problem with the current implementation is the limited scalability of the routing table.
The solution to these problems is to provide support for multi-FIB with overlapping IP address. As such, we can have different VLANs or different physical interfaces forming independent networks in the same Linux box. A good reason to separate VLANs is for security through separation of services. For instance, a GSN node having multiple company networks connected to it could use VLAN for separation, but that might not hold on the other side of the node. The only way to keep separation (and security) would be to have multiple FIBs.
Consider the example (see Figure 2) of having two HTTP servers serving two different networks with potentially the same IP address. One HTTP server serves the network/FIB 10, while the other HTTP server serves the network/FIB 20. The advantage gained is to have one Linux box serving two different customers using the same IP address. ISPs adopt this approach by providing services for multiple customers sharing the same server (server partitioning), instead of using a server per customer.
Figure 2. Example of Usage
The way to achieve this is to have an ID (an identifier that identifies the customer or user of the service) to separate the routing table completely in memory. Two approaches to doing this exist. The first is to have separate routing tables; each routing table is looked up by its ID, and within that table the lookup is done by the prefix. The second approach is to have one table, in which the lookup is done on the combined key = prefix + ID.
A different kind of problem arises when we are not able to predict access time with the chaining in the hash table of the routing cache and FIB. This problem is of particular interest in an environment that requires predictable performance.
Another aspect of the problem is the route cache and the routing table are not kept synchronized most of the time (path MTU, to name one). The route cache flush is executed regularly; therefore, any updates on the cache are lost. For example, if you have a routing cache flush, you have to rebuild every route you currently are talking to by going for every route in the hash/try table and rebuilding the information. First, you have to look it up in the routing cache; if you have a miss, you need to go in the hash/try table. This process is slow and not predictable, because the hash/try table is implemented with linked lists and the potential for collisions is high when a large number of routes are present. This design is suitable for a home PC with a few routes, but it is not scalable for a large server.
To support the various routing requirements of server nodes operating in high-performance, mission-critical environments, Linux should support the following:
An implementation of multi-FIB using tree (radix, patricia and so on). It is important to have predictable performance in insert/delete/lookup from 10,000 to 500,000 routes. In addition, it is favorable to have the same data structure for both IPv4 and IPv6.
Socket and ioctl interfaces for addressing multi-FIB.
Multi-FIB support for neighbors (arp).
Providing these implementations in Linux affects a large part of net/core, net/ipv4 and net/ipv6; these subsystems, mostly the network layer, will need to be re-written. Other areas will feel minimal impact at the source code level; most of the impact will be at the transport layer--socket, TCP, UDP, RAW, NAT, IPIP, IGMP and so on.
As for the availability of an open-source project that can provide these functionalities, an existing project, Linux Virtual Routing and Forwarding, may be able to help. This project aims at implementing a flexible and scalable mechanism for providing multiple routing instances within the Linux kernel. The project has some potential for providing the needed functionalities; however, no progress has been made since 2002, and the project now appears to be inactive.
The Distributed Security Infrastructure (DSI) is an open-source project started at Ericsson to provide a secure framework for carrier-grade Linux clusters that run soft real-time distributed applications. Carrier-grade clusters have tight restrictions on performance and response time, making the design of security solutions difficult. Many security solutions cannot be used due to their high-resource consumption. Therefore, the need for a security framework that targets carrier-grade Linux clusters was important to provide advanced security levels in such systems.
Linux generally has been considered immune to the spread of viruses, backdoors and Trojan programs on the Internet. However, with the increasing popularity of Linux as a desktop platform, the risk of seeing viruses or Trojans developed for this platform are growing. One way of solving this potential risk is to allow the system to prevent, at run time, the execution of untrusted software.
One solution is to sign digitally the trusted binaries and have the system check the digital signature of binaries before running them. Therefore, untrusted (unsigned) binaries are denied the execution. This can improve the security of the system by avoiding a wide range of malicious binaries from running on the system.
Figure 3. bsign's Signature Section as Added in an ELF Binary
Figure 4. DigSig in Action
DigSig, a component of DSI, is one implementation of such a feature. DigSig is a Linux kernel module that checks the signature of a binary before running it. DigSig inserts digital signatures inside the ELF binary and verifies this signature before loading the binary. It is based on the Linux security module (LSM) hooks. LSM has been integrated with the Linux kernel since 2.5.x and higher.
Typically, in this approach, vendors do not sign binaries; the control of the system remains with the local administrator. The responsible administrator is to sign all binaries she trusts with her private key. Therefore, DigSig guarantees two things. First, if you signed a binary, no one else can modify that binary without being detected. Second, nobody can run a binary that is not signed or is signed badly.
Several initiatives in this domain already have been made, such as Tripwire, bsign and Cryptomark, but we believe the DigSig project is the first to be easily accessible to all--its available on SourceForge under the GPL license--and to operate at the kernel level at run time. Run time is particularly important for carrier-grade environments, as it takes into account the high availability aspects of the system.
The DigSig approach has been using extant solutions such as GnuPG and bsign rather than reinventing the wheel. However, in order to reduce the overhead in the kernel, the DigSig project only took the minimum code necessary from GnuPG. This helped to reduce the amount of code imported to the kernel; only one-tenth of the original GnuPG 1.2.2 source code has been imported to the kernel module.
DigSig is a contribution from Ericsson to the Open Source community under the GPL license. DigSig has been announced on LKML; however, it is not yet integrated in the Linux kernel.
Operating systems for carrier-grade systems must be able to deliver a high response rate with minimum downtime. In addition, carrier grade systems must take into account characteristics such as scalability, high availability and performance.
In carrier-grade systems, thousands of requests must be handled concurrently without affecting the overall system's performance, even under extremely high loads. Subscribers expect some latency time when issuing a request, but they are not willing to accept an unbounded response time. Such transactions are not handled instantaneously for many reasons, and it can take some milliseconds or seconds to reply. Waiting for an answer reduces applications' abilities to handle other transactions.
Many different solutions have been proposed and prototyped to improve the Linux kernel capabilities in this area. Most have focused on using different types of software organization, such as multithreaded architectures, implementing efficient POSIX interfaces or improving the scalability of existing kernel routines.
One possible solution appropriate for carrier-grade servers is the asynchronous event mechanism (AEM). AEM provides an asynchronous execution of processes in the Linux kernel. It implements native support for asynchronous events in the Linux kernel and aims to bring carrier grade characteristics to Linux in areas of scalability, performance and soft real-time responsiveness.
An event-based mechanism provides a new programming model that offers software developers unique and powerful support for asynchronous execution of processes. Of course, it differs radically from the sequential programming styles used, but it offers a design framework better structured for software development. It also simplifies the integration and the interoperability of complex software components. In addition, AEM offers an event-based development framework, scalability, flexibility and extensibility.
The emerging paradigm of AEM provides a simpler and more natural programming style when compared to the complexity offered by multithreaded architectures. It proves its efficiency for the development of multilayer software architectures, where each layer provides a service to the upper layer. This type of architecture is quite common for distributed applications. One of the strengths of AEM is its ability to combine synchronous and asynchronous code in the same application, or even mix these two types of models within the same code routine. With this hybrid approach, it is possible to take advantage of their respective capabilities, depending on the situation. This model is favorable especially for the development of secure software and for the long-term maintenance of mission-critical applications.
Ericsson released AEM to the Open Source community in February 2003 under the GPL license. AEM was announced on LKML and received a lot feedback. The feedback suggested changes to the design, which resulted in an improved implementation and a better kernel-compliant code structure. AEM is not yet integrated into the Linux kernel.
Many challenges accompany the migration from proprietary to open platforms. One of the main challenge remains to be the availability of various kernel features and mechanisms needed for telecom platforms and integrating these features into the Linux kernel.
Ibrahim Haddad works for the Ericsson Research branch in Montreal, Canada. He also serves as contributing editor to Linux Journal. Ibrahim co-authored with Richard Peterson Red Hat Linux Pocket Administrator and Red Hat Enterprise and Fedora Edition: The Complete Reference (DVD Edition), both published by McGraw-Hill/Osborne. He currently is a Dr. Sc. Candidate at Concordia University.