Some Thoughts on the Occasion of the NSA Linux Release

by Earl Boebert
Assurance

When I say "assurance" I mean the process of acquiring confidence that your box isn't going to do something really ugly if you fire it up or put it on your net. Operational, or black-box assurance, is based on the observation that a certain class of boxes hasn't killed anybody yet, so you're probably safe. Reliance on this form of assurance has led to some pretty nasty surprises. Formal, or glass-box assurance, attempts to provide confidence from some combination of design characteristics, analysis and testing. This approach is still prone to nasty surprises, but they tend to be fewer--or at least easier to explain after the fact. Most high assurance work has been done in the area of kinetic devices and infernal machines that are controlled by stupid robots. As information processing technology becomes more important to society, these concerns spread to areas previously thought inherently harmless, like operating systems. Security is the most obvious example, along with availability of service in chaotic or hostile environments.

The NSA release incorporates an idea called Type Enforcement (TE) that was cooked up by Dick Kain and myself over 15 years ago, as part of a project to investigate high assurance systems. It's intended as a design characteristic to support analysis and testing, in aid of assurance. Our retrospective paper [1] covers those aspects, so I'll concentrate here on the short and long term development work that I think needs to be done.

Type Enforcement Explained

The shortest explanation I ever gave of TE was in response to a question by Butler Lampson: "It's the Lampson Access Matrix organized into equivalence classes for efficiency." The Lampson Access Matrix is a way of modeling the protection state of a system, to deduce implications of a particular policy. It is a two-dimensional matrix, with active entities (subjects, processes, threads) on one axis and passive entities (objects, files, segments) on the other. The entries in a [row, column] intersection define the operations that active entities can perform on passive ones. The matrix is defined in [2], which is one of the classics of computer security and is still worth study.

So much for theory, now for the implementation. Data objects are assigned an attribute called Type and processes an attribute called Domain. Conceptually (but not generally in the implementation) there is an internal matrix, one of whose axes is Type and the other Domain. For each Type,Domain pair, this matrix defines allowed access: read, write, execute for starters.

Two things were left fuzzy at the start and need work: rules for filling out the TE matrix, and the control and mechanization of communication between Domains. The former has been addressed by Dan Sterne and his colleagues in what they call Domain Type Enforcement (DTE), which develops a relationship between the file hierarchy and the matrix. You should understand this work before attempting extensions to the NSA release, just to avoid reinvention. The latter area, inter-Domain interaction, is the one most ripe for innovation.

What To Do With It (Short Term)

First, the one thing not to do with it: attempt to build a general-purpose, resource managing OS that automatically exports some property (safety, security, whatever) to any and all applications that happen to run on it. That's what the whole Orange Book effort was about. Been there, tried that, never got a T-shirt.

Thinking about TE on and off all these years has led me to an more refined definition: TE is a mechanism for integrating an application and its underlying resource manager in a way that reduces the chances of unintended ugliness. It's really a tool for building high assurance, special-purpose boxes. The way to gain immediate benefit from the NSA release is to use it to build armor-plated web servers, mail servers, data base servers, etc. Take the release, strip out every function not needed by the app and bolt the whole thing down using the TE matrix. ("What? No more "Bug in Sendmail allows unauthorized root access?" What's the world coming to?")

To do this you need to recognize ugly when you see it, and you need to know the structure of the application. The first is a policy question, and I can't help you much there. As far as the second is concerned, TE is a data flow control mechanism, and that's what needs to be defined. One structure TE was aimed at from the beginning is what Dick and I dubbed "assured pipelines". This refers to the case where some function (like crypto) absolutely positively has to happen between a predecessor operation (message prep) and a successor (message release). Finding such functions and laying out the pipelines is a necessary first step to defining the Types, Domains and initial form of the TE matrix.

What To Do With It (Long Term)

The biggest thing that needs finishing, as I mentioned before, is inter-Domain communication. When you work on this, you'll have an opportunity to radically rethink message passing, multithreaded processing, interrupt-driven schedulers and contemporary stack design. The first two are the worst things ever to happen to analytic assurance, and the last is the oldest hoo-ha in OS design.

But first, (says he, waiting for the boos and catcalls to subside), let me state that assurance always comes with a cost in performance. The situation is similar to that described by a Saab aerodynamicist in the early days of the JA-37 autopilot project: you can only haul so much energy into combat, and it's the designer's job to distribute it sensibly among the contending requirements of speed, altitude, range and payload. Same thing for cycles and memory. Luckily, today they're dirt cheap, so, for once, let's consider burning them in aid of a box you would trust with human life instead of ever more realistic dancing pigs, okay?

Analytic assurance rests on reasoning based on a description of the system, and the fidelity of the description counts for a lot. Ideally you'd like to reason from object code, but source will work if you pay attention to your compiler. This sort of thing used to be called "static analysis", and people actually built and used tools that helped do it. What you're looking for is the dependency tree, or what Parnas dubbed the "uses" hierarchy: those modules that other stuff depends on and what has to work in order for them to be correct, dependency level by dependency level. Knowing the dependencies is essential to orderly test and integration. You also want to eliminate circularities (A depends on B and B depends on A; longer loops are possible) because they form "lumps" that have to be integrated and tested as a unit. (Exercise for the reader: demonstrate that non-preemptive multiprogramming is a built-in circularity.)

Message passing and multithreading kill static dependency analysis dead. What you have is a processing model in which a bunch of flyweight processes run around leaving notes under rocks for each other and lifting rocks to read the other guy's note. Maybe. Reading the program text in any form isn't going to tell you, so you have to fire up the old debugger and trace the progress of the code, note by note, rock by rock. Such a task will never yield a dependency tree, but it might drive an unwanted programmer out of your team.

I would suggest that anyone exploring high assurance systems take a look back at large process models in which the program text, and in particular the call tree, is closer to the dependency tree. In the course of such an effort, you might be able to convince me the message passing model was an innovation with a motive behind it other than the desire to be innovative.

Interrupt-driven schedulers finish off any hope of assurance because they make the internal progress of the system a slave to indeterminate external events. When I used to teach this stuff, I used a hypothetical exchange between a user and an OS vendor:

<il>User: "My OS just did something funny."

<il>Vendor: "Try it again."

(Pause) <il>User: "It didn't do it this time."

<il>Vendor: "See?"

Now you don't tell a test pilot who's just punched out after an uncommanded control hard over to repack his parachute and try it again. People want to know what happened, why it happened and why it isn't going to happen again. (Sooner or later this attitude is going to spread outside of kinetic vehicles, and the OS community had better be ready for it.) All three of the above questions are the very devil to answer in an interrupt-driven system. There are alternatives, such as rate structure (an extreme form of round-robin scheduling) and post-and-wait (a hybrid), documented on petroglyphs somewhere in the Southwest and deserve being dug up as a starting points for scheduler research.

Just how improper contemporary schedulers are is made obvious by the very existence of denial of service attacks. I'll accept that you can spray packets at a system that will cause the network interface to be temporarily unavailable. What I will not accept is the notion that any input across any interface should cause an entire OS to fly up its own orifice and disappear in a puff of smoke.

And, finally, the stack mess. What a grim joke this is: the longest-running, dumbest vulnerability in OS history. The idea of mixing data, buffers and control information on the same stack dates back to the early 1960s and Peter Naur's GIER Algol compiler. Those were heady days: Dijkstra had just shown the essential identity of pushdown stacks and the evaluation of infix notation, and when I took a summer course in compiler writing from Naur I thought that the generalized stack (which he called the "display") was the coolest thing ever. Now we know better.

There are several starting points for a fix. The most straightforward is to run separate data and control stacks. Another is to have only pointers on the stack, and allocate all parameter and buffer data out of the heap. The best form of this would be proper descriptors, with base and bounds fields. You could even add an "unavailable" bit and implement true dynamic linking. <sarcasm> Gosh, what an idea! Why, with a little thought you could do run-time software updates! Or you could take a different leaf from the Multics playbook and run a stack per domain, with specialized "gate" procedures handling the housekeeping. ("What's that I hear?" "It's the script kiddies, sir, wailing their little hearts out.")

Software fixes will burn lots of cycles, and a real fix will require changes to the processor architecture. Software fixes will point the way. Unfortunately, the current crop of chip designs accept the idea that having read access identical to execute access constitutes responsible engineering, which doesn't bode well for them understanding something as subtle as a proper stack. Developers who want examples of previous work to stimulate thinking (Yes, Virginia, there was Life Before The Web) would do well to study [3].

Coda

Well, that's it. Now you know everything that I do, so I can get back to being retired. I hope this stimulates thought, discussion and, in particular, the questioning of received wisdom about OS structure. Who knows, maybe some high assurance software may come of it. That would definitely make an old man feel good.

References

[1] Boebert, W.E. and Kain, R.Y., "A Further Note on the Confinement Problem". Proc. 30th Annual Carnahan Conference on Security Technology, 1966, pp. 198-203, IEEE 0-7803-3537-6-9/96.

[2] Lampson, Butler W. "Protection". Mar. Proc. 5th Princeton Symp. on Information Sciences and Systems, 1971, pp.437-443. Reprinted January 1974 in ACM SIGOPS OSR 8(1), pp.18-24.

[3] Kain, R. Y. "Advanced Computer Architecture : A Systems Design Approach". Prentice Hall. ISBN: 0130077410

Load Disqus comments

Firstwave Cloud