Virtualizing the Clock
Dmitry Safonov wanted to implement a namespace for time information. The twisted and bizarre thing about virtual machines is that they get more virtual all the time. There's always some new element of the host system that can be given its own namespace and enter the realm of the virtual machine. But as that process rolls forward, virtual systems have to share aspects of themselves with other virtual systems and the host system itself—for example, the date and time.
Dmitry's idea is that users should be able to set the day and time on their virtual systems, without worrying about other systems being given the same day and time. This is actually useful, beyond the desire to live in the past or future. Being able to set the time in a container is apparently one of the crucial elements of being able to migrate containers from one physical host to another, as Dmitry pointed out in his post.
As he put it:
The kernel provides access to several clocks: CLOCK_REALTIME, CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the start points for them are not defined and are different for each running system. When a container is migrated from one node to another, all clocks have to be restored into consistent states; in other words, they have to continue running from the same points where they have been dumped.
Dmitry's patch wasn't feature-complete. There were various questions still to consider. For example, how should a virtual machine interpret the time changing on the host hardware? Should the virtual time change by the same offset? Or continue unchanged? Should file creation and modification times reflect the virtual machine's time or the host machine's time?
Eric W. Biederman supported this project overall and liked the code in the patch, but he did feel that the patch could do more. He thought it was a little too lightweight. He wanted users to be able to set up new time namespaces at the drop of a hat, so they could test things like leap seconds before they actually occurred and see how their own projects' code worked under those various conditions.
To do that, he felt there should be a whole "struct timekeeper" data structure for each namespace. Then pointers to those structures could be passed around, and the times of virtual machines would be just as manipulable and useful as times on the host system.
In terms of timestamps for filesystems, however, Eric felt that it might be best to limit the feature set a little bit. If users could create files with timestamps in the past, it could introduce some nasty security problems. He felt it would be sufficient simply to "do what distributed filesystems do when dealing with hosts with different clocks".
The two went back and forth on the technical implementation details. At one point, Eric remarked, in defense of his preference:
My experience with namespaces is that if we don't get the advanced features working there is little to no interest from the core developers of the code, and the namespaces don't solve additional problems. Which makes the namespace a hard sell. Especially when it does not solve problems the developers of the subsystem have.
At one point, Thomas Gleixner came into the conversation to remind Eric that the time code needed to stay fast. Virtualization was good, he said, but "timekeeping_update() is already heavy and walking through a gazillion of namespaces will just make it horrible."
He reminded Eric and Dmitry that:
It's not only timekeeping, i.e. reading time, this is also affecting all timers which are armed from a namespace.
That gets really ugly because when you do settimeofday() or adjtimex() for a particular namespace, then you have to search for all armed timers of that namespace and adjust them.
The original posix timer code had the same issue because it mapped the clock realtime timers to the timer wheel so any setting of the clock caused a full walk of all armed timers, disarming, adjusting and requeing them. That's horrible not only performance wise, it's also a locking nightmare of all sorts.
Add time skew via NTP/PTP into the picture and you might have to adjust timers as well, because you need to guarantee that they are not expiring early.
So, there clearly are many nuances to consider. The discussion ended there, but this is a good example of the trouble with extending Linux to create virtual machines. It's almost never the case that a whole feature can be fully virtualized and isolated from the host system. Security concerns, speed concerns, and even code complexity and maintainability come into the picture. Even really elegant solutions can be shot down by, for example, the possibility of hostile users creating files with unnaturally old timestamps.
Note: if you're mentioned above and want to post a response above the comment section, send a message with your response text to [email protected]