The Linux Kernel Summit
A webcast of the summit will be up April 10th on the OSDN web site. I recommend listening to the talks there to learn more about what went on.
After registration and danish, Ted opened the show by expressing his thanks to the show sponsors, IBM, AMD and EMC. The show was well appointed, with a wireless 802.11b network and cards available for attendees. On each table was a power strip for laptops--you've never seen so many laptops. The network was somewhat sporadic, but as I walked around the room I noticed a lot of compiling going on. A lot of software development was happening. While I was there, Dave Miller wrote a utility to modulate the speed of the CPU fans based on the temperature reading from the motherboard.
Each talk lasted about one hour, with ample time for questions and interruptions built in.
The first talk, given by Lance Larsh of Oracle, covered the requirements for running a high performance database under Linux. Just its inclusion in the show should indicate this is an important issue for the kernel group. The back-and-forth was exciting and important for both sides. One interesting point was the statement that the raw filesystem isn't as important as previously considered because database administrators hate it, and it makes a database harder to backup and restore.
That was followed by a spirited discussion of O_SYNC and O_DSYNC (flags to the open system call to force synchronous I/O) and the changes from 2.2 and 2.4 on multiple CPU machines running SCSI. This was followed by many discussions, ranging from shared memory paging and page sizes to process memory consumption for said tables. The problem, Linus noted, with shared pagetables is you would have to have some sort of page table lock or semaphores, and it wasn't going to happen that way.
SCTP is the Streaming Control Transport Protocol for streaming media, and La Monte has been working on a kernel module to implement it. Selected by the IETF, it is a peer to TCP and UDP. It's described as a reliable, message-oriented, multiple-ordered message stream with support for automatic network failure--think multihomed multimedia serving.
SCTP requires a number of changes to established systems, namely the bind(2) system call (for SCTP, bindx). Since SCTP is designed to be handled across a number of machines, it requires sets of addresses from bind. The XOpen folks were reluctant to advocate multiple bind calls for resolution, favoring a set to be returned from a single request.
For proper implementation, SCTP also needs some other network features. Additionally, some of the more "single-threaded" bits of the networking present challenges for SCTP as well.
There was some very real action going on during the breaks. Ted Ts'o, the organizer, knew this would be true and planned 30-minute breaks between talks. It's all very "old home week" for these guys, and you can tell that many haven't seen each other for some time. For myself, it was good meeting people I had previously met only through e-mail and seeing those who I only run into at these kinds of conferences. I don't want to focus too much on personality in this article, but it's all very chummy and fun.
It's funny to think that Linux and the surrounding software was all developed over the Net and not in person. Of course, you could never bring everyone involved into one place, but it's impressive that the Open Source/Free Software movement can change the world so effectively while not being in physical proximity. I don't want to get too Jon Katzian here, but you can be assured that free software is alive, well and thriving.
Ted jokingly called this "a completely non-controversial and inconsequential talk". The block device is absolutely vital and important and, well, Steven is the guy.
Starting with scalability, Stephen spoke about the naming of devices, the need for bounce buffering to be optional for large memory support, the 2TB device support limit, the need to drop the 1k disk alignment, and issues regarding SCSI (LUN rescanning) and SMP scalability in the SCSI layers.
On robustness, Steven commented on the need to deal intelligently with different errors. Currently error response can be brittle, causing drives to be taken off-line when it could be a fairly inconsequential sector error. Also, he mentions there is a problem to date with not distinguishing between read and write failures.
In the realm of performance, Steven mentioned issues including buffer efficiency problems and queuing issues. Steven feels a per spindle approach is preferable to a per unit approach.
During his discussion of the extra features he was considering, one mentioned was the possibility to defer atime updates on sleeping drives. This would make for more power-efficient Linux laptops and desktops, which, as a Californian, I can appreciate. Atime is set whenever a file is accessed by any program, even if it is not changed. This could be considered useful but not enough to spin up the drive and spend vital power on. One of Nate Myers's tricks back when he ran Linux Laptops was to mount drives with the atime feature shut off to save power, and it is a great trick to know.
Lunch was, well, a hotel lunch. Weird brown meat, fried and dried chicken, scary mayonnaise-covered noodles and tortellini filled with liquid cement. No different from any trade show I've been to. The sandwiches on Day 2 were good, but hotels should really just stop.
Unfortunately, I was pulled away for the first 15 minutes of the talk and missed a bunch of stuff. The slides can be found on-line. Stephen noted that the new IO patches from Steven Tweedie, combined with SGI's allocator, were able to make great strides on XFS integration. This eventually led to an interesting debate on the capacities of Linux. Namely, Dave Miller talked about how unwise it would be to plan for everyone having 128 processors and thus make "sparser" environments lacking in efficiencies.
Dave Miller, as usual, was a very smart person in this debate.
In this talk, Jamal brought up issues he had with the current networking code. This was a very good talk, and I'd recommend listening to the webcast; it's very much worth your time, too complex for me to simplify. I spent much of the talk passing the mic around (I was mic monkey for Ted Ts'o), but it was a lot of fun.
Especially interesting was the discussion of the uses of consolidating interrupts for busy networks. Imagine if the kernel could queue up data and not waste a lot of time switching about servicing the data. Anyhow, go listen.
Johannes Erdfelt, in case you didn't know, is the guy handling USB for Linux. USB, PCMCIA, 1394, SCSI and PCI all have to deal with the problem of resources and devices being installed into and removed from the operating system. As the main job of the kernel is managing resources, handling disappearing and reappearing resources is quite challenging.
Do you remove a /dev/radio simply because the radio device has been pulled? What if a device begins to misbehave? What if it's mounted and pulled out? Hot Plugging (and unplugging) of devices presents a great challenge to the device driver writer, so I looked forward to this talk.
Issues that come up with Hot Plug include device naming and enumeration, and bus enumeration. USB can be tricky from an architectural point of view. Johannes mentioned the problems with user-space notification of devices. For instance, when a mouse is attached, it would be nice to have X notice properly. I learned from the talk that USB is an imperfect standard, but the best imperfect, perhaps.
As this progressed, we were informed by H. Peter Anvin about the starvation problem for major and minor numbers. We only have around 28 majors left before something must be done. This situation was also an obvious dislike of DevFS, or at least the requirement of DevFS to use hot plug or new devices past the 8-bit limit.
Linus brought up the point that there is zero, zip, nada need for new devices to receive major numbers. And he defined major/minor numbers as being one of the three huge problems with UNIX/Linux. The other two are ioctls and something, uh, else. He stopped short of mentioning number three.
Alan Cox brought up the side issue of the need for Japanese users to see a Japanese /dev directory. And that brought up that server number and device naming can be indirect.
KBuild is, uh, challenged. Eric, author of a new configuration system called CML2, and Keith, the maintainer of the current KBuild, assert it is needlessly complex and intimidating, and the configuration language is ill-suited to the task. In addition, make dep is "fundamentally broken" and parental dependencies of Make create issues. To the statement that make -j4 was broken, Linus replied that was not the case, and he obviously felt the current system isn't that bad. Shortly thereafter, CML2, written in Python, was demonstrated.
The new configuration systems seems to be quite good, considering the stage it is at, but the use of Python gave audience members pause. As the current system relies on shell scripts and C, many felt that requiring Python could prove to be a problem when extending to different architectures. Eric countered that those in a cross-compile situation wouldn't feel bad about taking the configuration file and passing it on to the new system in the event that the new system didn't have Python installed. They are both right in their way. Many would like to have the complete build environment on the target machine to bootstrap the system completely, without having to include the Python interpreter, but it is true that porting over Python isn't a huge challenge once gcc is working on a foreign system.
How will memory management change? Some of the new feature for 2.5 include virtual memory balancing, NUMA support, memory quality of service and improvements in the handling of physical and virtual memory. NUMA was getting a lot of good attention at the conference. A BOF (birds of a feather sessions) was scheduled for the night before, and it appeared to be well attended.
Security Enhanced Linux (SELinux) is a way of moving access control and protection to the kernel. It was Peter's assertion that the proper place for security is in the kernel, and all access to all resources should be mediated and administered. He proposed Mandatory Access Control (MAC), with a central administrator that creates policy that controls all resources in the OS. Having this outside the kernel is not practical, for there is no way a user-space application can guarantee a hostile kernel module isn't doing improper things.
You can receive key benefits from MAC. Memory can be secured cryptographically, as can pipelines, and you can prevent the ability to bypass secured applications so you wouldn't be able to skip past the security. You can also assure that you are running only appropriate and cleared code.
How can we get to MAC? Well, it's not trivial. Traditionally, you can enforce a system-wide security policy. Based on integrity and confidentiality attributes of subjects, MAC can monitor and regulate use accordingly.
It is nice that all Linux applications do not have to support SELinux; the only caveat is that programs may fail in different ways if they attempt to access secured resources. What is very interesting is SELinux's benchmarks show little effect on speed, which is quite impressive. Pipes do take a bit of a hit (up to 17%), as do file copies (up to 9.85%).
Using SELinux will also result in a 10-20% slowdown on networking. So, security doesn't come cheaply, but Peter noted they had not pursued optimization yet. Also, the lock used by the MAC system needs to achieve a finer grain to reach greater speed.
Asynchronous I/O has always been the red-headed stepson of the Linux kernel family. It may seem a bad metaphor, but low speed and underutilized drivers haven't held a lot of appeal for the average driver writer. The current model needs replacing, according to Ben, for many reasons. To make async I/O useful in event-driven applications, to make efficient use of raw I/O and to make it possible for async I/O to fit within the zero copy ideal, it needs some real changes.
Ben presented the needs for contextual async device I/O for events. Although Linus shot down the need for an mmap handle, Ben was agreeable and asked good-naturedly where people were when he requested comments on-line. He also presented four new system calls for handling async I/O. Linus suggested a fifth call for get events, but Ben disagreed, citing the capacity of the buffer for the I/O events.
After a brief list of reasons why we need power management, Andy gave a short history of power management under Linux and within the industry, noting both APM and ACPI. He proceeded with a description of the system and device power states, and for those interested in the inner workings of ACPI, it was very interesting to see the meat of the 500-page ACPI spec.
Intel's group has spent some time creating an ACPI core that is cross-platform and is in use on other Unices and under FreeBSD. The core was developed by Intel and open-source developers, and their intention is to have this driver referent available for OS developer use.
Currently, the great problem is sleep support. However, it is my personal experience that when you look at the battery costs of backing up any significant amount of memory to non-volatile (read disk) storage, you are better off shutting down than going into a hibernate mode.
Andy points out that to do sleep right, you must have the context of every device in the system, so every device must respond to PM events completely and accurately. That isn't quite done right now and won't be for a bit. Linus countered that this was a dream, and the kernel would be better off killing and reestablishing devices on wake up to ensure they are running properly.
Driver abstraction came up again, as was brought up in the hot plug discussion, but the problem of redoing the entire driver system is somewhat daunting.
Talks 12 and 13 were both shorter and more about audience participation...
It is widely known that the kernel has shunned the use of version control systems like CVS as, well, Linus doesn't want them to be used. Larry McVoy of LMBench fame started a company some three years ago to create a new kind of version control system under a new kind of license. More about the license in a bit. Larry started the project as a way of doing source control in a more peer-to-peer way with multiple repositories, all of which are peers of each other. BitKeeper is a very complete version control system, with command-line and gui-merge applications that are very impressive.
What wasn't talked about was the non-OSI-compliant BitKeeper license. It's worth examining, and it's on the BitKeeper web site. Check it out. Subversion (another next generation version control system) was discussed in a BOF session at the summit as well.
Ted Ts'o also wanted the group to discuss the use of a formal bug control process, a contentious issue at best. Linus doesn't much care as he sees it to be the job of the lieutenants, like Alan Cox, Ted Ts'o and Dave Miller, to deal with. He's pessimistic that any system will fulfill the needs of the kernel team, citing the geographic dispersal and ill-formed bug reports seen on the kernel mailing list.
Bugzilla was discussed and rejected without changes to its structure, as was GNATS. There was some talk about integrating bugs with some future patch system. In the short term (assuming the adoption of some number bug submission system), each patch would feature a list of the bugs it fixes by referring to the bug id in the code. Much of this was rejected as well, as it seemed to be a case of putting the cart before the horse.
A part of me thinks that most of the work done was accomplished out in the foyer, as developers would go out during talks that were not compelling to them and work out problems with other developers. If that was the only benefit of the summit, then it would be enough really. Ted did a terrific job is bringing together the right people, and there are already plans for another summit.
What I took away from the meeting was a greater sense of the complete inevitability of Linux. That this all can come about and has worked this well for ten years is a monumental achievement; one that will benefit computer science and engineering for decades to come.
Chris DiBona has been using Linux since early 1995. In addition to being the Linux International Grant Chair, he is part of the events staff of the Open Source Developers Network (a subsidiary of VA Linux Systems). He was Coeditor of the O'Reilly book, Open Sources, and was proud to work with Marty Garbus and the Electronic Frontier Foundation (which you should join!) to defend the freedom of software developers from copyright maximumists in the 2600 case. If you'd like to contact Chris, you can find him at http://www.dibona.com, and he welcomes e-mail at firstname.lastname@example.org