Non-Child Process Exit Notification Support
Daniel Colascione submitted some code to support processes knowing when
others have terminated. Normally a process can tell when its own child
processes have ended, but not unrelated processes, or at least not
trivially. Daniel's patch created a new file in the /proc directory entry
for each process—a file called "exithand" that is readable by any other
process. If the target process is still running, attempts to
exithand file will simply block, forcing the querying process to wait.
When the target process ends, the
read() operation will complete, and the
querying process will thereby know that the target process has ended.
It may not be immediately obvious why such a thing would be useful. After all, non-child processes are by definition unrelated. Why would the kernel want to support them keeping tabs on each other? Daniel gave a concrete example, saying:
Android's lmkd kills processes in order to free memory in response to various memory pressure signals. It's desirable to wait until a killed process actually exits before moving on (if needed) to killing the next process. Since the processes that lmkd kills are not lmkd's children, lmkd currently lacks a way to wait for a process to actually die after being sent SIGKILL.
Daniel explained that on Android, the
lmkd process currently would simply
keep checking the proc directory for the existence of each process it tried
to kill. By implementing this new interface, instead of continually polling
the process, lmkd could simply wait until the
read() operation completed,
thus saving the CPU cycles needed for continuous polling.
And more generally, Daniel said in a later email:
I want to get polling loops out of the system. Polling loops are bad for wakeup attribution, bad for power, bad for priority inheritance, and bad for latency. There's no right answer to the question "How long should I wait before checking $CONDITION again?". If we can have an explicit waitqueue interface to something, we should. Besides, PID polling is vulnerable to PID reuse, whereas this mechanism (just like anything based on struct pid) is immune to it.
Joel Fernandes suggested, as an alternative, using ptrace() to get the process exit notifications, instead of creating a whole new file under /proc. Daniel explained:
Only one process can ptrace a given process at a time, so I don't like ptrace as a mechanism for anything except debugging. Relying on ptrace for exit notification would interfere with things like debuggers and crash dump collection systems. Besides, ptrace can do too much (like read and write process memory) and so requires very strong privileges not necessary for this mechanism. Besides: ptrace's interface is complicated and relies on repeated calls to various wait functions, whereas the interface in this patch is simple enough to use from the shell.
The issue of PID (process ID) reuse came up again, because it wasn't clear to everyone that a whole new file in the /proc directory was the best way to solve the problem. As David Laight said, Linux used a reference counter on all PIDs, so that any reuse could be seen. He figured the /proc directory should include some way to expose that reference count.
Other operating system kernels have other ways of trying to avoid PIT reuse or at least mitigate its downsides. As Joel explained:
If you look at the NetBSD pid allocator you'll see that it uses the low pid bits to index an array and the high bits as a sequence number. The array slots are also reused LIFO, so you always need a significant number of pid allocate/free before a number is reused. The non-sequential allocation also makes it significantly more difficult to predict when a pid will be reused. The table size is doubled when it gets nearly full.
But to this, Daniel replied:
NetBSD is still just papering over the problem. The real issue is that the whole PID-based process API model is unsafe, and a clever PID allocator doesn't address the fundamental race condition. As long as PID reuse is possible at all, there's a potential race condition, and correctness depends on hope. The only way you could address the PID race problem while not changing the Unix process API is by making pid_t ridiculously wide so that it never wraps around.
Elsewhere, Aleksa Sarai was still unconvinced that that a whole new file in the /proc directory would be a good thing, if there were a way to avoid it. Aleksa understood that Daniel wanted to avoid continuous polling, but felt there were still workable alternatives. For example, Aleksa said, "When you open /proc/$pid, you already have a handle for the underlying process, and you can already poll to check whether the process has died (fstatat fails for instance). What if we just used an inotify event to tell userspace that the process has died—to avoid userspace doing a poll loop?"
Daniel replied that Aleksa's solution was far more complicated than Daniel's. He said that inotify and related APIs were:
...intended for broad monitoring of system activity, not for waiting for some specific event. They require a substantial amount of setup code, and since both are event-streaming APIs with buffers that can overflow, both need some logic for userspace to detect buffer overrun and fall back to explicit scanning if that happens. They're also optional part of the kernel.
Daniel went on:
Given that we *can*, cheaply, provide a clean and consistent API to userspace, why would we instead want to inflict some exotic and hard-to-use interface on userspace instead? Asking that userspace poll on a directory file descriptor and, when poll returns, check by looking for certain errors (we'd have to spec which ones) from fstatat is awkward. /proc/pid is a directory. In what other context does the kernel ask userspace to use a directory this way?
The debate went on, with no resolution on the mailing list. Daniel continued to insist that his approach was simpler than any of the proposed alternatives, and he also felt it was in keeping with the spirit of UNIX itself. At one point, he explained:
The basic unix data access model is that a userspace application wants information (e.g., next bunch of bytes in a file, next packet from a socket, next signal from a signal FD, etc.), and tells the kernel so by making a system call on a file descriptor. Ordinarily, the kernel returns to userspace with the requested information when it's available, potentially after blocking until the information is available. Sometimes userspace doesn't want to block, so it adds O_NONBLOCK to the open file mode, and in this mode, the kernel can tell the userspace requestor "try again later", but the source of truth is still that ordinarily-blocking system call. How does userspace know when to try again in the "try again later" case? By using select/poll/epoll/whatever, which suggests a good time for that "try again later" retry, but is not dispositive about it, since that ordinarily-blocking system call is still the sole source of truth, and that poll is allowed to report spurious readabilty. This model works fine and has a ton of mental and technical infrastructure built around it. It's the one the system uses for almost every bit of information useful to an application.
The opposition to Daniel's patch seems to emanate from the desire to avoid adding new files to /proc. There's a real risk of /proc, and other kernel interfaces, growing bloated, overly complex and unmaintainable over time. Linus Torvalds and other top contributors want to avoid this, especially since it is very difficult to remove interfaces once they are implemented. Once user software starts to rely on a given interface, there's a great reluctance in Linux to break that software. One reason for this is that not all software is open source, and older closed-source tools may not be maintained, and thus may not have the option to adapt to any new interface. A change in something they rely on may mean the software simply can't be used with newer kernels. The kernel developers want to avoid that situation if at all possible.
It's unclear whether Daniel's patch will go into the tree in its current form, given the opposition. It may be that user code—the Android OS in this case—for now will have to continue to use other, more complicated ways of knowing when processes have died.
Addendum from Daniel Colascione (January 10, 2019)
Thanks for writing about this work! I just want to add that the project is ongoing and that I plan to refresh my non-child wait work after Christian Brauner's pidfd_kill patches land. My current thinking is that a system call returning an exit handle might be a viable alternative to a new readdir-visible proc file.
One unresolved difficulty is figuring out who should be able to read a process's exit status, as in the thread here. Do just parents have access? All processes, as apparently in FreeBSD? Same user only? Root?
Still, the general idea is that you should be able, somehow, to get a file descriptor from which you read(2) a siginfo_t containing exit status (like for waitid(2)), and I'm looking forward to adding this capability to Linux one way or another.
Note: if you're mentioned above and want to post a response above the comment section, send a message with your response text to email@example.com.