The Penguin and the Dinosaur
Although Linux began its life on the Intel x86 architecture and still has strong roots there, it is highly portable. Linux runs on everything from SGIs at the high end to Palm Pilots, Psion PDAs and tiny embedded microcontrollers at the low end.
Linux's world just got a little larger. It now runs on the IBM System/390 mainframe. That's right; there is now a port of Linux to the IBM System/390 mainframe architecture. Once you've booted it, it works much like you'd expect any other Linux system to. As a matter of fact, there are two ports: there's IBM's, which runs only on relatively recent System/390 machines and was developed in secret, and there's also Linas Vepstas' “Bigfoot” port, which runs on older System/370 machines as well and was done as a proper open-source project. Bigfoot was quite close to running—it boots the kernel and loads /bin/sh—when IBM dropped the bomb. Reconciliation of the two projects is an ongoing and somewhat acrimonious issue.
I can hear some readers scratching their heads in puzzlement. If you have a mainframe, you presumably already have a perfectly good and extremely expensive operating system for it, right? Running Linux on such a machine means taking down what is almost certainly an incredibly expensive and probably highly loaded production machine—not a good idea.
Well, the first part is true. However, if you have a recent S/390 or any version of VM (virtual machine), you don't have to interrupt service at all to play with other operating systems on your production machine. If you're coming to this from the Linux/x86 world, you're probably aware of VMWare, which effectively splits your PC and lets you boot Windows NT in a virtual machine under Linux (and if you're not, you should be—it's an incredibly cool application). Well, IBM mainframes did it first.
The VM operating system (also called a “hypervisor”, like a supervisor on steroids) was designed from the start to do exactly that; in essence, under VM, each user gets his or her own copy of the machine. It looks exactly like a System/390, complete with attached peripherals, and the user owns (effectively, is root on) that virtual machine. Behind the scenes, of course, the peripherals are simulated and are managed by VM, and the hardware is designed with virtualization assistance features which VM exploits. This is analogous to the way VMWare provides “disk access” by simulating each disk as a file within the Linux file system. VM can also simply present an actual physical device, if appropriate to your needs.
This gets really interesting when you realize you can run VM in your own Virtual Machine. In fact, a very common use of VM is testing a new system second-level; that is, running underneath the instance of VM actually running on the bare metal. This allows you to make sure all your existing programs still work. Only when you have all the bugs worked out of the new system do you briefly shut down the machine and bring it back up with the new system running on the actual, rather than a virtual, machine. You can be confident it will work, since you've already extensively tested it.
There are other tricks you can do with VM. IBM's mainframe cash cow is OS/390, formerly MVS. You can run OS/390—or VSE, the third mainframe OS—in a virtual machine, although you cannot run guest operating systems under OS/390. Thus, VM lets you do the same test upgrade process with OS/390 as it does with VM itself. The virtual machines do not have to reflect the hardware configuration of the physical machine: virtual machines can be multiprocessor machines, even on a uniprocessor real system, or vice versa. Give each machine as much memory as you think it ought to have and its own set of printers, tapes, LAN adapters, whatever.
This is such a massively useful feature that IBM, in its more recent machines, has included a hardware feature which acts as a stripped-down VM in microcode; whether it is, in fact, a stripped-down older version of VM is a subject of debate. In any event, even without running VM, you get the benefit of LPARs (“Logical PARtitions”) which effectively allow you to partition your mainframe into a small number (usually 15 or fewer) of conceptual machines. IBM's mainframe competitors offer essentially the same functionality under different names, such as Amdahl's MDF.
How efficient is virtualization under VM? The architecture was designed with self-virtualization in mind, so it is much more efficient than VMWare. When a virtual machine under VM isn't doing anything, it consumes very little in the way of resources. Medium to large VM systems typically support about 5000 simultaneous logins (each, mind you, with its own virtual machine) without hiccuping.
David Boyes at Dimension Enterprises recently did a web server torture test—in an LPAR running a single VM instance, so his machine was running second-level on a medium-sized S/390, and each of his Linux machines was running third-level. His one goal was to see how many Linux boxes he could bring up, simultaneously serving requests. The results were somewhat astounding.
Are you sitting down? He ran out of resources at 41,400 simultaneous Linux machines—forty-one thousand four hundred. I'll get to what that's good for a little later; the notion of 41,400 web servers on a single physical box ought to give you an idea, though.
The upshot is, if you have access to a recent System/390, you can almost certainly spare the resources to play with Linux/390 without noticeably affecting your production system. If you don't have access to a mainframe, well, just keep reading. I have a little surprise coming up.
One obvious question is, “what good does this do?” Let's look at a few scenarios. Scott Courtney has addressed this issue wonderfully in his essay at LinuxPlanet (see Resources), which should be required reading if you're interested in L/390—it is a great deal more in-depth than my article.
UNIX administration and development skills are much more common and much cheaper than mainframe skills. Thus, it's going to be easier to find people to work on Linux on your mainframe than it would be to find OS/390, VSE or VM gurus.
What good is a mainframe? Mainframes traditionally had little in the way of CPU power (no longer true—you get plenty of raw CPU speed out of one), but have absolutely fantastic I/O capabilities. One of the main needs in a big web server farm or a big e-commerce server is I/O, of course.
Linux/390 presents an interesting migration path for organizations which are seeking to de-emphasize their mainframes but can't just decommission them, because it provides services simply unavailable in other environments. I've personally seen this scenario happen twice: once at Rice University and once at Princeton. Plans to shut the dinosaur down were announced, then retracted once it became clear that vital parts of the campus would grind to a halt, since there was simply no viable alternative to some of the mainframe services. Moving over to Linux for those things it can do provides a smoother and cheaper transition, without the need for additional hardware. Let's face it: if your organization is going to be moving away from OS/390 or VM, better they should move to Linux than to another OS. Since Linux running Samba is already a good back-office substitute for NT, you could provide Windows browsing services to your users without ever needing PC hardware, let alone an NT license.
Another exciting possibility is that VM licenses tend to be held by academic institutions. Imagine a third-year computer science course on operating systems or networking. Now imagine each student getting his very own Linux box with which to play. There's full OS source code, a full development environment and isolation from the production systems and other students. A fantastic course could be developed around a study of Linux internals in such an environment, and a medium-sized S/390 could support a class of 25 students, all recompiling the kernel at once, without breaking a sweat.
The commercial version of this scenario is the commercial web-hosting server. Traditionally, this means you get dedicated access to a machine in a rack space somewhere, physically managed by an ISP. We'll do the numbers a little later, but in short, if you're doing this on a large scale, the price of a mainframe and a VM license get dwarfed quickly by the price of a whole bunch of fast, rack-mounted PCs. Your labor cost also drops radically, as you don't have to physically set up yet another PC; you simply create one more virtual machine on your VM box, and give it its own copy of the installed system disks.
As an aside, it doesn't hurt to remember that the total cost of network ownership typically breaks down to less than one-quarter hardware, roughly one-third service and facilities, with the remainder the necessary staff to support it (IDC, 1996). Administration tasks are obviously greatly simplified when the entire network of Linux machines is contained within a single box.
It would also be quite possible to separate various services onto various virtual machines. Sendmail would get its own (virtual) Linux box, DNS another, Samba another. This would be good from both a security standpoint (an exploit on one machine compromises only one service) and a reliability standpoint. You can also split the various pieces of a multi-tier application (e.g., web front end, business rules processing engine in the middle and RDBMS on the back end) among separate virtual machines, and run your database on OS/390, if you prefer. The isolation would make both debugging and development somewhat easier. I know this is something we always laugh at the NT people for requiring, but there are three advantages to the Linux-on-a-virtual-machine method of service isolation:
Additional hardware cost is zero, as opposed to a couple thousand dollars per machine.
Additional software cost is zero, as opposed to the cost of an NT license per machine.
Actual resource utilization overhead is very low, since VM's virtual machines consume almost no resources unless they are actually running.
Mainframes are expensive. There's no way around that fact. They're less expensive than they used to be, but they're certainly not $800 PCs. I'm not particularly au courant with IBM's pricing structure, but let's take a million dollars as a high-end price. A million dollars will buy you a lot of mainframe; you'd get a terrific IBM support contract with it and a VM license, as well as a backup solution. Most mainframe shops run OS/390, often in tandem with VM, but if your purpose is to run many Linux virtual machines, then you'd want VM and would have no use for OS/390 unless you wanted to do traditional mainframe computing too. I'm told a brand-new, top-end system with several terabytes of DASD is closer to $600,000 than a million, and much cheaper secondhand. However, for the purposes of argument, let's stick with a million as a nice round figure.
What do you get for that price? A machine which will run—I'm being extremely conservative here—1000 simultaneous Linux machines without a problem. You're talking $1000 per Linux box: not too different from the cost of a low-end rackmount Linux system, thus the price right there is a wash.
Obviously, it's a lot cheaper to rent space for a single S/390 and its associated disk arrays than it is to rent space for 1000 physical Linux boxes; even in one-unit packaging at 42 units/rack, you're still talking 24 racks. That's certainly more than even a big S/390 is going to take. Networking gets easier too; buy the 155MBps ATM options on your OSA cards, plug up to twelve ATM interfaces into your machine (if you need more scalability, options exist for you) and coordinate internal communication between your virtual machines via a virtual LAN. An internal, virtual LAN is much easier and cleaner to administer than a physical topology of switches and runs of Cat5 and fiber.
You can cut back on resource usage, too. If some of your machines will have identical file systems (it's fairly common, for example, in the UNIX world, to mount /usr read-only and have symlinks to what needs to be writable), then you can maintain a single disk with the shared file system and mount it from each of the virtual machines. It's just like sharing file systems over NFS, only without the ugliness of NFS. This, incidentally, was what David Boyes did when running his 41,400 machines: all shared /usr and /bin; with a little more trickery, /lib could be shared too. Furthermore, there's no need to specify much—or indeed, any—swap for your Linux machines; VM knows all about memory management and does it very effectively. Give each of your Linux machines a bunch of virtual memory, and let the VM hypervisor worry about paging it in and out.
Now let's look at reliability. We'll assume one of these $1000 machines has a MTBF of 1000 days. That's probably on the high side for a thousand-dollar machine, if you're pushing it fairly hard. At $1000, you're not getting RAID, your disks are probably IDE, and even redundant power supplies are unlikely. If you have a thousand of these boxes in a room, the chance you will get through a day with none of them failing is 1-(1/1000)<+>1000<+>, or about 37%. In other words, two days out of three, you're going to have to replace something.
Of course, if the mainframe fails, all your machines fail. However, one of the things you're paying for with your million dollars is rock-solid reliability: a System/390 is built with enough redundancy that if something fails, the rest of the system stays up and can be hot-swapped. This isn't just disks: on multiprocessor machines, you can replace a failed processor without bringing down the system. I giggled when I saw Microsoft trumpeting it had achieved 99.5% up time with NT. Thus, under exceptionally good circumstances, properly administered and maintained, NT is down, on average, a little less than an hour a week. I was a VM systems programmer from 1992 to 1994; during that time, we typically had under an hour of scheduled down time a year. Unscheduled down time was zero.
Backups cease to be an issue; because VM is managing all the disks for the virtual machines, all their data is backed up with the VM backup. The same with data integrity; for that kind of cash, you get well-implemented hot-swappable RAID, in which the complexity is never even visible to the Linux machines, because they see their disk space just as devices presented to them by VM. Basically, no matter how many virtual machines you have, you have only one actual machine to protect, so the cost of doing so remains constant, rather than scaling with the number of machines.
Furthermore, VM has an extremely efficient cache. Frequently accessed disk blocks will be held in the cache and requests never go to the drives at all. This is a huge win if you either share disks among machines, or if you're running a server farm.
There's also a low end in the mainframe market. The P/390 is a PCI card containing a chip with the S/390 instruction set on it. It is sold in combination with another PCI card which provides a channel interface (so you can drive your real S/390 peripherals), a PC running OS/2 and some driver software; you run VM, VSE, OS/390 or Linux on the card. The prices for the PC Server System/390 are under $10,000 now. There's also the slightly more expensive R/390, which sits in an AIX box. Ten thousand dollars is well within the budget of a smallish software development company. Of course, these boxes won't support a thousand simultaneous machines, but they'd do fifty fairly comfortably (they support 130 or so simultaneous CMS users). In fact, I'm planning to use just such a system to play with a virtual Beowulf cluster, among other things, and maybe experiment with MOSIX, too.
There's also a lot of middle ground between these two ends. In short, it's not much more expensive, from a purely hardware point of view, to put N virtual Linux boxes on a mainframe than it is to simply buy N boxes, and it becomes a good deal cheaper as N increases.
The short answer is, yes, it's Linux. The version I'm running at penguinvm.princeton.edu is 2.2.13. The S/390 patches are already in the stock 2.2.14 source, and IBM's team is hard at work integrating L/390 into 2.3.x. By the time you read this, it should all have been merged fairly seamlessly into the mainline development kernel tree.
Linux on the System/390 is just as much Linux as Linux/PPC, Linux/SPARC, Linux/m68k or Linux/Alpha. The System/390 port is integrated into the main kernel source tree; unlike ELKS or RT-Linux, it's the stock kernel, rather than a subset or an extension. From a user's perspective, nothing differs between running it on a 390 and running it on an x86 machine. The shell works the same way, X applications work the same way, GNOME has been ported (by adding a couple of lines to the configure scripts) and it all just works. There's not much exciting to say, because if it comes with source, the odds are very good you can build and run it with minimal effort. As “Think Blue Linux” gets off the ground, you should be able to install a binary with RPM. At the time of this writing, they had about 420 RPMs available; there should soon be many more.
From an application programmer's perspective, it's about the same. There are a couple of very minor patches for config.sub and config.guess to get configure to recognize the S/390 architecture (these have already been submitted to Ben Elliston at Red Hat, who maintains autoconf). With its next release, configure should support L/390 with no modification. Once the patches have been applied, anything which uses autoconf builds and runs right out of the box. This includes Bochs (an x86 emulator); apparently, some intrepid soul has built Bochs and booted the NT Server, very slowly, under L/390.
The other major issue is endianness. Unlike the x86, S/390 is big-endian. Programs which assume Intel byte order will fail. But if these programs have already been ported to, say, Linux/PPC (another 32-bit, big-endian platform), then those bugs have been crushed. This is not a 390-specific problem as such, but a general portability issue.
At the time of this writing, pthreads support was still buggy in the version of L/390 I was using, which caused odd bugs in VNC and the Hercules emulator; this has apparently already been fixed, but I haven't rebuilt the system with the latest patches. Likewise, there was an optimizer bug in GCC, causing it to generate bad code with -O2 and above; this too has already been fixed, and should no longer be an issue by the time this appears in print.
Deep within the guts of the kernel, things get fairly hairy. IBM has released its kernel code and glibc modifications under the GPL, and they're available from the IBM DeveloperWorks page; it's all there and well-commented. It's fascinating stuff, especially the huge differences between x86 interrupt design and the way S/390 talks to its peripherals, but far beyond the scope of this article.
Networking is handled via either the OSA-2 (Ethernet or token-ring) device driver or point-to-point high-speed channels. (OSA is the ironically named “open systems adapter”: the device driver is OCO—object code only, meaning no source.) If you're running under VM, you can set up something called IUCV (Inter User Communications Vehicle), which basically lets you establish a PPP connection to another virtual machine. If, by the way, you're beginning to get the feeling IBM has its own language, all acronyms, you're absolutely right. What this IUCV means is that it's purely internal—there is no real-world device corresponding to an IUCV adapter. You can also set up a virtual CTCA interface (channel-to-channel adapter). Channels are real-world interfaces, implemented these days over fiber optic cables. IUCV runs at 500MBps, CTCA at 250MBps. Communication between your virtual machines should not be a bottleneck. Writing to IBM urging them to change their OSA-2 OCO policy is unlikely to help, but might make you feel better.
Currently, penguinvm.princeton.edu uses CTCA to talk to the VM hypervisor's TCP/IP stack, but it would also work to set up a Linux virtual machine and use it as a firewall, doing packet filtering and network address translation for the machines behind it. That machine can then talk directly to either the outside world with its OSA-2 interface or VM's stack with a different IUCV/CTCA interface.
Installation is still ugly. Basically, you're not going to get anywhere unless you understand how to install a new operating system on a System/390, and you understand how to install and set up Linux. It's a four-step process:
Build a boot loader image from the supplied files.
Boot the loader.
Dump the file system tar archive onto newly created EXT2 partitions.
Reboot and answer the configuration questions.
The first two steps require S/390 knowledge, the last two Linux knowledge. You may be able to puzzle it out from the documentation, but if you're shaky in one area or the other, it's probably a good idea to find someone who has the other base covered.
Marist College has been at the forefront of Linux/390 development. Marist's page contains the tape and disk images necessary to get started, as well as documentation, and is the place to begin. There is also an extremely active mailing list hosted at Marist, Linux-VM, which is not VM-specific, but is devoted to Linux on the mainframe, with or without VM. The file system layout of the Marist disk is a little weird if you're used to Red Hat or SuSE. This will change as actual distributions get ported.
There is already a distribution for the S/390: “Think Blue Linux” or “Iron Penguin”, based on Red Hat 6.1. At the time of this writing, it is only a downloadable collection of packages and not yet available on CD or tape.
As soon as I get the testbed P/390 I have available in Virginia fired up and networked (which should also have happened long ago by the time you read this), Richard Higson and I will begin the process of porting Debian Potato. I'm expecting it to be fairly easy, since Debian already supports quite a few platforms.
In short, by the time this is published, installation should be much simpler than it currently is. It's not insanely difficult even now, but it is comparable to Linux/x86 in early 1993, when SLS was beginning to get off the ground. It certainly is no Lizard or YAST2, and installation requires you to understand both the System/390 and Linux.
Perhaps I've convinced you Linux/390 is cool, and you'd like to see it for yourself. Now, you probably have a problem: you do not have access to a System/390. There are at least three ways you can play with Linux/390 without already having an actual System/390.
The least interesting option is to do what the IBM guys did initially and download their glibc, gcc and kernel patches, and build L/390 software in a cross-compilation environment hosted under Linux/x86. Sure, this works, but it's not very much fun. For the experience, you need to be running L/390 itself, which requires a System/390.
The expensive option is to buy a mainframe. A minimal P/390 on the used market will set you back less than $10,000. That's not all that much, is it? Okay, so $10,000 might be a little steep for a neat toy.
You'd still like to play, but can't spend ten grand? How does “free” sound, then? Free, as in beer and speech. Roger Bowler has written an intensely cool emulator, Hercules, which runs under Linux and emulates either a System/370 (the previous generation of IBM mainframe) or a System/390; it also emulates many of the more common mainframe peripherals. It's open source (not GPL, but the license is quite reasonable, basically just forbidding commercial use) and very easy to build. The emulator is sufficiently thorough to boot OS/MFT (a simpler IBM mainframe OS, circa 1966) and use MFT to build and boot OS/MVT (MVS's progenitor). Work is being done to bring MVS 3.8 up on it.
More to the point, Hercules emulates a System/390 well enough to boot Linux/390. I've put up a page explaining how; see the Resources section. Currently, it's not yet useful; there is no network device emulation, so getting stuff into and out of the machine is difficult. It can, we think, be done with the supplied tools and tape image files. By the time you read this, someone will probably have figured out how to load the Marist file system onto a 3330 image, which would allow for actual development. Hopefully, it will not take too long to develop enough network support to allow the virtual S/390 to appear on the network, in the same way VMWare machines do.
Be warned: you'll need patience. On my PII-300, Hercules takes close to an hour to boot Linux, although once it's up and running, interactive performance is not actually bad at all (I started playing with Linux on a 4MB 386/25, and Hercules is no more painful than it was). Many people on the Hercules mailing list (sign up from the Hercules home page) are aggressively working on performance-tweaking Hercules, so its speed should increase significantly. The referenced page will contain updates as we turn Hercules into a reasonable development environment for L/390, so check back often.
It's a grand adventure; we're exploring new territory every day. We need your help. Hop aboard, and bring your penguins. The dinosaur doesn't bite. Honest.
Adam J. Thornton has been using Linux since 0.09, making him older than he cares to contemplate. He distinctly remembers thinking SLS was for sissies. When he's not hunched troglodytically in front of his monitor, he enjoys playing with his Greater Swiss Mountain Dog Vinnie, bicycling with his fiancee Amy, and drinking Scotch, but not all at the same time. Write him at firstname.lastname@example.org.