The Power of the Incredible Hulk—the ILM Linux Death Star
“Florian Kines, who was also behind OpenEXR, wrote our batch scheduler along with a couple others a long time ago for SGI Irix”, says Hess. “That made use of big iron and desktops. When we started our move to Linux we wanted better resource management.” The first version of ObaQ divided machines by show—not a very efficient utilization of resources.
“Our attempt to replace ObaQ with a centralized resource management system called the IMP Project didn't work out”, says Hess. “We went back to ObaQ, and the Linux port of that took about two weeks. Three or four months ago, Florian decided he was going to fix that so any show could use any machine.” ObaQ is a peer-to-peer (P2P) scheduler system. The advantage of a P2P scheduler is that a scheduler server failure won't knock the entire system off-line. ObaQ2 uses a single machine for global scheduling, but it advises only independent machines running ObaQ. Losing the ObaQ2 server won't bring the entire facility down. There are scheduler system alternatives, such as the popular proprietary product Platform LSF or the open-source Condor and OpenPBS schedulers, but ILM plans to continue to use ObaQ.
SGI had added functionality in the IRIX kernel for process monitoring, such as CPU time and temp space. Those values determine how ILM machine time gets charged back by central accounting to projects. ILM discovered the Linux /proc filesystem didn't provide all those statistics or created excessive overhead, and that it couldn't support ObaQ without changes.
“Florian asked me to address some of the Linux kernel issues”, says Hess. “For one thing, Linux provides no way to tell if something is a thread or a process. In ps, every thread shows as a separate process.” Some jobs, such as Mental Ray, can run multiple threads per frame in parallel. Linux top or ps shows each thread using 1GB RAM, but that's shared memory being counted twice. Linux also couldn't tell which job is opening temporary files. ObaQ needs to know that in order to clean up temporary files if it kills a job.
Hess created a Linux kernel module to trap opens, forks, clones, vforks, exits and renames, to make accurate statistics possible. The kernel module does most of work, but the hooked calls should ignore any job not being run by ObaQ. To do that required hacking the kernel. “I used one of the unused bits in the ptrace flag”, says Hess. “Every x86 job has a 32-bit ptrace vector. As of 2.4.20, 10 bits are used to indicate ptrace modes, such as single step. Sometime last year Linux or glibc changed how the ptrace flag works so it clears on fork. I found all places the kernel clears those bits and keep bit 32.” Hess says the OPROFILE feature in the 2.5 kernel has enhanced accounting facilities, so his hack might not be needed in 2.5. Commandeering an unused bit in the ptrace flag was a quick hack to mark jobs as being ObaQ tasks. “This is one of the great things about Linux”, says Hess. “Because we had the source, we could make this change ourselves, and very quickly. No third-party vendor had to be involved to do custom engineering, as in the IRIX case.”
“Now that we have all this firepower in the renderfarm it can overwhelm any file server”, says Thompson. “In The Hulk we have these nuclear explosion renders that are really crunchy—causing major grief for us lately. It is easy for an artist to proc-up a render [add more processors to a task] to the point that it brings a file server to its knees. We're doling out 700 times the data we used to!”
ILM uses a Sun T3 disk array to serve NFS. Adopting Linux as an NFS client presented a number of problems when brought on-line a year and a half ago. Due to a Linux NFS UDP-packets-out-of-order bug (fixed in 2.4.18), after a couple hours the Sun Solaris server would spin up to 100% and be dragged down. Sun came to the rescue with a proprietary Solaris kernel module and IP stack patch to work around the Linux bug.
A nagging issue from choosing Linux NFS UDP is no flow control. “When we get into hot spot problems on file servers, the renderfarm makes a denial-of-service attack on our file servers”, says Thompson. “We're going to try TCP NFS on a Linux client again, now that it's a year and a half later. We'll start testing that next week.” TCP adds about 5% overhead.
ILM is scaling up, from about 20TB of file server storage now to double that next year. “For Star Wars Episode III we're going to double the size of our renderfarm”, says Thompson. “We can do that by ordering another 3,000 nodes from RackSaver—but that could destroy our file servers.” Thompson plans to head off that NFS server meltdown by going to a clustered file server—Sistina GFS or something like that. File serving isn't limited to only within the ILM facility.