Linux Helps Bring Titanic to Life
Digital Domain is an advanced full-service production studio located in Venice, California. There, we generate visual effects for feature films and commercials as well as new media applications. Our feature film credits include Interview with the Vampire, True Lies, Apollo 13, Dante's Peak and The Fifth Element. Our commercial credits are challenging to count, much less list here (see the web site at http://www.d2.com/). While we are best known for the excellent technical quality of our work, we are also well respected for our creative contributions to our assignments.
The film Titanic (written and directed by James Cameron) opened in theaters December 19, 1997. Set on the Titanic during its first and final voyage across the Atlantic ocean, this tale had to be recreated on the screen in all the splendor and drama of both the ship and the tragedy. Digital Domain was selected to produce a large number of extraordinarily challenging visual effects for this demanding film.
Digital visual effects are a large portion of our work. For many digital effects shots, original photographic images are first shot on film (using conventional cinematic methods) and then scanned into the computer. Each “cut” or “scene” is set up as a collection of directories with an “element” directory for all the photographic passes that contribute to the final scene. Each frame of film is stored as a separate file on a central file server.
A digital artist then begins working on the shot. The work may involve creating whole new elements such as animating and rendering 3D models or modifying existing elements such as painting out a wire or isolating the areas of interest in the original film.
This work is done at the artist's desktop (often on an SGI or NT workstation). Once the setup for this work is done, the process is repeated for each frame of the shot. This batch processing is done on all the available CPUs in the facility, often in parallel and requires a distributed file system and uniform data overview. A goal of this processing is to remain platform independent whenever possible.
Finally, once all the elements are created, the final image is “composited”. During this step the individual elements are color corrected to match the original photography, spatially coordinated and layered to create the final image. Again, the set up for compositing work is usually done on a desktop SGI, and the batch processing is done throughout the facility.
Since building a full-scale model of the Titanic would have been prohibitively expensive, only a portion of the ship was built full size (by the production staff), and miniatures were used for the rest of the scenes. To this model we added other elements of the scene such as the ocean, people, birds, smoke and other details that make the model appear to be docked, sailing or sunk in the ocean. To this end, we built a 3D model and photographed 2D elements to simulate underwater, airborne and land-based photography.
During the work on Titanic the facility had approximately 350 SGI CPUs, 200 DEC Alpha CPUs and 5 terabytes of disk all connected by a 100Mbps or faster network.
Our objective is always to create the highest quality images within financial and schedule constraints. Image creation is accomplished in two phases. In the first phase, the digital artist works at an interactive workstation utilizing specific, sophisticated software packages and specific high-performance hardware. During the second phase the work is processed in batch mode on as many CPUs as possible, regardless of vintage, location or features to enhance interactive performance.
It is difficult to improve on that first, interactive phase. The digital artists require certain packages that are not always available on other platforms. Even if similar packages are available, there is a significant cost associated with interoperating between them.
Another problem is that some of the packages require certain high-end (often 3D) hardware acceleration. That same quality and performance of 3D acceleration may not be available on other platforms.
In the batch-processing phase, improvements are more easily found, since basic requirements are high-bandwidth computation, access to large storage and a fast network. If the appropriate applications are available, we can improve that part of the process. Even in cases where only a subset of the applications are available on a particular platform, using that platform gives us the ability to partition work flow to improve access to resources in general.
We rapidly concluded the DEC Alpha-based systems served our batch-processing needs very well. They provide extremely high floating-point performance in commodity packaging. We were able to identify certain floating-point-intensive applications as port targets. The Alpha systems could be configured with large amounts of memory and fast networking at extremely attractive price points. Overall, the DEC Alpha had the best price/performance match for our needs.
The next question was which operating system to use. We had the usual choices: Windows/NT, DEC UNIX and Linux. We knew which programs we needed to run on the systems, so we assembled systems of each type and proceeded to evaluate their suitability for the various tasks we needed to complete for this production.
Windows NT had several shortfalls. First, our standard applications, which normally run on SGI hardware, were not available under NT. Our software staff could port the tools, but that solution would be quite expensive. NT also had several other limitations; it didn't support an automounter, NFS or symbolic links, all of which are critical to our distributed storage architecture. There were third-party applications available to fill some of these holes, but they added to the cost and, in many cases, did not perform well in handling our general computing needs.
Digital UNIX performed very well and integrated nicely into our environment. The biggest limitations of Digital UNIX were cost and lack of flexibility. We would be purchasing and reconfiguring a large number of systems. Separately purchasing Digital UNIX for each system would have been time consuming and expensive. Digital UNIX also didn't have certain extensions we required and could not provide them in an acceptable time frame. For example, we needed to communicate with our NT-based file servers, connect two unusual varieties of tape drives and allow large numbers of users on a single system; none are supported by Digital UNIX.
Linux fulfilled the task very well. It handled every job we threw at it. During our testing phase, we used its ability to emulate Digital UNIX applications to benchmark standard applications and show that its performance would meet our needs. The flexibility of the existing devices and available source code gave Linux a definitive advantage.
The downside of Linux was the engineering effort required to support it. We knew that we would need to dedicate one engineer to support these systems during their set up. Fortunately, we had engineers with significant previous experience with Linux on Intel systems (the author and other members of the system-administration staff) and enough Unix-system experience to make any required modifications. We carefully tested a variety of hardware to make sure all were completely compatible with Linux.
The Linux distribution used was Red Hat 4.1. At that time Red Hat was shipping Linux 2.0.18, which didn't support the PC164 mainboard, so the first thing we had to do was upgrade the kernel. During our testing we tracked down a number of problems with devices and kept up with both the 2.0 and 2.1 series of kernels. We ended up sticking with 2.1.42 with a few patches. We also decided on the NCR 810 SCSI card with the BSD-based driver and the SMC 100MB Ethernet card with the de4x5 driver. It turned out to be a very stable configuration, but there was one serious floating-point problem that caused our water-rendering software to die with an unexpected floating-point exception.
This turned out to be a tricky problem to fix and didn't make it into the kernel sources until 2.0.31-pre5 and 2.1.43. The Alpha kernel contains code to catch floating-point exceptions and to handle them according to the IEEE standard. That code failed to handle one of the floating-point instructions that could generate an exception. As a result, when that case occurred, the application would exit with a floating-point exception. Once fixed, our applications ran quite smoothly on the Alpha systems.
At this point, the decision was made to purchase 160 433MHz DEC Alpha systems from Carrera Computers of Newport Beach, California. Of those 160 machines, 105 of the machines are running Linux, the other 55 are running NT. The machines are connected with 100Mbps Ethernet to each other and to the rest of our facility.
The staff at Carrera was extraordinarily helpful and provided inestimable support for our project. This support began at the factory, with follow-up support through delivery, support and repair.
We created a master disk, which we provided to Carrera, along with a single initialization script that would configure the generic master disk to one of the 160 unique personalities by setting up parameters such as the system name and IP address. Carrera built, configured and burned-in the machine, then logged in as a special user causing the setup script to execute. When the script completed, the machine automatically shut down.
This process made configuring the machines easy for both Carrera and us. When the hosts arrived, we just plugged them in and flipped the switch, and they came up on the network. All 160 machines are housed in a small room at Digital Domain in ten 19 inch racks. They are all connected to a central screen, keyboard and mouse via a switching system to allow an operator to sit in the middle of the room and work on the console of any machine in the room.
Figure 2. Digital Domain Computer Room
The room was assembled in a time period of two weeks including the installation of the electrical, computing and networking. The time spent creating the initialization script was extremely well spent as it allowed the machines to be dropped in place with relatively little trouble. At that point we began running the Titanic work through the “Render Ranch” of Alphas.
The first part of this work partition was to simulate and render the water elements. We knew that the water elements were computationally very expensive, so this process was one of the major reasons for purchasing the Alphas.
These jobs computed for approximately 45 minutes and then generated several hundred megabytes of image data to be stored on central storage servers. Intermediate data was stored on the local SCSI disk of the Alpha. The floating-point power of the DEC Alpha made jobs run about 3.5 times faster than on our old SGI systems.
As the water rendering completed, the task load then switched to compositing. These jobs were more I/O bound, because they had to read elements from disks on servers spread around the facility and combine them into frames to be stored centrally. Even so, we still saw improvements of a factor of two for these tasks.
We were extremely pleased with the results. Between the beginning of June and the end of August, the Alpha Linux systems processed over three hundred thousand frames. The systems were up and running 24 hours a day, seven days a week. There were no extended downtimes, and many of the machines were up for more than a month at a time.
We addressed a number of different problems using a variety of techniques. Some of the problems were Alpha specific, and some were issues for the Linux community at large. Hopefully, these issues will help others in the same position and provide feedback for the Linux community.
Hardware compatibility, particularly with Alpha Linux, is still a problem. Carrera was very cooperative about sending us multiple card varieties, so that we could do extensive testing. The range of choices was large enough that we were able to find a combination that worked. We had to pay careful attention to which products we were using, as the particular chip revision made a difference in one case.
The floating-point problem (discussed above) was the toughest problem we had to address. We didn't expect to find this kind of problem when we started the project. This was a long-standing bug that had never been tracked down—we attribute this fact to the relatively small Alpha Linux community.
Linux software for Alpha seems to be less tested than the equivalent software for the Intel processors—again, a function of the user-base size. It was exacerbated by the fact that Alpha Linux uses glibc instead of libc5, which introduced problems in our code and, we suspect, in other packages.
We had a number of small configuration issues with respect to the size of our facility. Most of these were just parameter changes in the kernel, but they took some effort to track down. For example, we had to increase the number of simultaneously mounted file systems (64 was not sufficient). Also, NFS directory reads were expected to fit within one page (4K on Intel, 8K on Alpha); we had to double this number to support the average number of frames stored in a single directory.
Boot management under Linux Alpha was more difficult than we would have liked. We felt the documentation needed improvements to make it more useful. Boot management required extensive knowledge of ARC, MILO and Linux to make it work. ARC requires entering a reasonably large amount of data to get MILO to boot. MILO worked well and provided a good set of options, but we never managed to get soft reboots to operate correctly. We've been working with the engineers at DEC to improve some of these issues.
The weakest link in the current Linux kernel appeared to be the NFS implementation, resulting in most of our system crashes. We generally had a large number of file systems mounted simultaneously, and those file systems were often under heavy load. When central servers died or had problems, the Linux systems didn't recover. The common symptoms of these problems were stale NFS handles and kernel hangs. When all the servers were running, the Linux boxes worked correctly. Overall, the NFS implementation worked, but it should be more robust.
The Linux systems worked incredibly well for our problems. The cost benefit was overwhelmingly positive even including the engineering resources we devoted to the problems. The Alpha Linux turned out to be slightly more difficult than first expected, but the state of Alpha Linux is improving very rapidly and should be substantially better now.
Digital Domain will continue to improve and expand the tools we have available on these systems. We are engendering the development of more commercial and in-house applications available on Linux. We are requesting that vendors port their applications and libraries. At this time, the Linux systems are only used for batch processing, but we expect our compositing software to be used interactively by our digital artists. This software does not require dedicated acceleration hardware, and the speed provided by the Alpha processor is a great benefit to productivity.
Feature film and television visual effects development has provided a high-performance, cost-sensitive, proving ground for Linux. We believe that the general purpose nature of the platform coupled with commodity pricing gives it wide application in areas outside our industry. The low entry cost, versatility and interoperability of Linux is sufficiently attractive to warrant more extensive investigation, experimentation and deployment. We are currently at the forefront of that development within our industry and hope to be joined shortly by our peers.
Why Risk Linux? A Production Perspective
Currently, Digital Domain's core business is as a premier provider of visual effects creativity and services to the feature film and commercial production industries. As such, we often take a conservative approach to changes in infrastructure and methodologies in order to meet aggressive delivery schedules and the most demanding standards of product quality.
During the course of work on several recent feature film productions, we encountered situations where our installed base of equipment was not adequate to meet changing production schedules and dynamic visual effects requirements (in terms of increasing magnitude of effort and complexity). We needed to meet these challenges head on without impacting the existing pipeline and without creating new methodologies or systems which would require re-engineering or re-training. Linux Alpha helped us overcome these challenges both cost effectively and quickly (a rare combination).
Selecting Linux as part of the production pipeline for the film Titanic required several goals to be met. If we had not met these requirements, it is unlikely we would have been able to deliver sufficient computing resources in a timely fashion to the production. We needed interoperability and, to a certain degree, compatibility with our SGI/Irix-based systems. Interoperability and compatibility with Linux had been demonstrated during a previous effort (Dante's Peak). We ported critical infrastructure elements (to support distributed processing) to the Linux environment in days, not weeks, using existing staff. The developers of these tools were able to rapidly deploy to the Linux environment, demonstrating that we could leverage that environment in short order. We needed performance, as the schedule for the production, as well as the magnitude of the work implied a 100% or more increase in studio processing capacity. As we had shown that Alpha Linux provided a factor of three to four over our SGI systems (see main article), it was possible to deliver that increased level of performance while physically constrained (air, power and floor space) within our current facility.
As to cost effectiveness, we would have needed more than twice as many Intel machines as Alphas to meet our performance goals. SGI was a valid contender, but could not compete on a price per CPU basis. We also needed a viable structure for delivery, installation and support. Carrera Computers had proven their ability to supply and support us in a timely and cost-effective manner prior to this order, and that company continued to provide an extraordinary level of service throughout the Titanic project.
All things considered, this risk paid off in substantial dividends of project quality and time. Because the urgency of the situation demanded that we think “outside the box”, we were able to deliver a superior solution in a framework that was entirely compatible with our normal operating models and that gave a productivity increase equal to double that of our previous infrastructure. The satisfaction in this success actually made up for the stress incurred in risking one's job and career.
Wook has been a software engineer for over 20 years, having discovered computers and became a complete geek at the age of 14. He has worked for many companies over those years, finally coming to rest at Digital Domain, where he was considered unfit for the task of software engineering and has been relegated to the position of Director of (Digital) Engineering.