Making Inodes Behave
It started with a phone call. Could we build a Linux box that would mount and read a DEC drive and make the data available to NT workstations via Samba? The answer was “I'll have to get back to you.” Because the caller was General Dynamics (the company that makes most of the US Navy's nuclear subs, among many other things) and we are a for-profit Linux company, I planned to get back to them as quickly as possible.
Cosmos Engineering Company has been building custom computer systems for industry since 1984. In 1996 we focused exclusively on building Linux systems. That same year we introduced Linux on a Disk. The next year we became founding Red Hat Hardware Partners. By the time General Dynamics came to us, we had been solving problems with Linux for a few years. This was right up our alley.
The DEC drives were Quantum 9GB SCSI-2 drives with one OSF partition formatted with a UFS file system. Although I wasn't sure of that at the time, I did know that the Linux kernel hackers had been gobbling up new file formats and partition types faster than Carlie gives out free drink tickets at a Linux Journal party. For example, between 2.2.15 and 2.3.99pre9 the number of supported foreign partition types went from three to 15. UFS has been in the kernel since 2.0.XX. But then there are different “flavors” of UFS, aren't there? The number of those has also been growing in the march towards 2.4.0-use-me-1.
So I checked with the latest experimental kernel tree, which at the time was 2.3.99pre9. Yes, there was support for seven different UFS types now, although DEC was not specifically mentioned, and support for the OSF partition had just been added. On the strength of this I got back to John Loeffler of General Dynamics Electronic Systems with a tentative yes, and, “Would it be possible to get a sample drive?” so that we could give them a definitive answer. One FedEx shipment later, I was building the newest experimental kernel with all of the partition types and file systems enabled that might even think about working.
As I suspected from the beginning, mounting the drive as an OSF partition type with UFS type SUN file system read only appeared to work. The sample disk had three files on it: rc.local, hosts and a rather large tar ball—oilpatch.tar. We faxed them the contents of the two short files as they requested, and as a result Cosmos Engineering Company was able to get a contract to build a series of Cosmos 500 Linux Servers customized to the task at hand.
That task was to build Linux Servers that could each support four 9gig SCSI DEC OSF drives and allow network attached NT workstations to read the data. This task was then expanded to include the ability to delete files and directories. Okay, no problem, kernel 2.4.0-test1 has experimental write support for UFS. The hardware itself was nothing special. Each server was a 500MHz Pentium III, built around the ASUS i440BX ATX motherboard with 128MB Ram in a mid-tower ATX case. Red Hat Linux 6.1 was installed on the 18GB IBM 2Ultra SCSI system drive. What made this system special was its ability to read the foreign disk format. That capability called for the creation of a special kernel.
The paperwork involved in doing a deal with a company used to subcontracting parts for a destroyer was, well, impressive. We'd never done a contract that involved so many government regulations. But then again, we'd never done a contract with a separate boilerplate stipulating how cost overruns above $500,000 should be handled.
The first sample disk, and a number of others that followed, quite obviously had dummy data on them. We never knew what the real data was. Whatever it is, there must be a lot of it, and they must have wanted badly to keep the data transfer operation in-house to go the route they did. We did know that the end user was the US government, and it did come out in the course of the work that that meant the military. Later, while I was helping them to troubleshoot some SCSI termination problems, I was told that the SCSI drives were enclosed in canisters that plugged into ruggedized quad bays that “We use in everything.” “Everything” in this case being nuclear subs destroyers and other weapons systems. I also knew that with the persecution of Dr. Lee moving ahead full throttle, and with a notebook hard drive full of nuclear secrets turning up missing at Los Alamos, there was a lot of concern in the government with securing the nation's military data. Maybe that had nothing to do with the implied mass data migration. That is just pure speculation. I don't know what our servers are being used for and furthermore, I don't want to know. In talking to them I always had the feeling that they could tell me but then they would have to, well, you know.
With the contract signed we started to build and test the first of the servers. The project hit a snag when we discovered that there was a problem copying long files from the DEC drives. Files were being truncated after 96KB. That wouldn't do. Why 96KB? That's what I wanted to know.
Further features of this problem were that long files, for example the kernel tree tar ball, could be copied to the DEC drive and then copied back intact in that session. But if the system was shut down and brought back up, and then the kernel tar ball was copied back from the DEC drive, the results would be of the right length but the contents garbage. I suspect it had to do with how inodes are managed in memory and stored on disk.
This began my long descent into Inode Land, and that is what this story is really about. Previously I'd heard about inodes. I knew they were disk structures related to files. I knew I always needed to have enough free ones. Ever get a “no space on the device” message with 4GB free? No free inodes. Inodes were largely a mystery to me. In the next several weeks I learned more about inodes than I'll ever remember.
Inodes are little data structures that define files. In the ext2 file system an inode is 128 bits long. Conveniently, all the stuff that follows applies to both the Linux ext2 file system and the DEC UFS file system. An inode is a table of information including the file's owner, file permissions, create and modify data, file size and most importantly, the location of the actual file data. Everything, in fact, but the file name. The file name is quite incidental to the file. It's there for our benefit. It's a directory entry. Say you want to hear a song. So you double click The_Train_They_Call_the_City_of_New_Orleans.mp3. That's a directory entry that points to an inode that knows where that music lives and who gets to play it. There could be other directory entries for the same inode. This is what is called a hard link. A file can have more than one name but only one inode. That is why Sun folks say, “The Inode is the File.” Every file, including devices, directories and soft links, requires an inode.
Inodes are a limited resource. When you create a file system, you determine the number of inodes you create, and that's that. If you run out of inodes because you put lots of tiny files on a partition with a small inode table, you are SOL even if you have gigibytes free. You may be able to fool a system that has run out of disk space on a particular partition with a symbolic link to a subdirectory on another partition. But you won't be able to fool one that has run out of free inodes, at least not until you delete something and free up an inode. Soft links, being files that point to other directory entries, require an inode.
Inodes are data structures, but they are usually stored on a disk and read into memory for reference and modification.
Looking at the inode data structure it's easy to see why 96KB is important. The first block of data in the inode has to do with time, date and file ownership, etc. Then, at offset 0x28, we get to what I think of as the “good stuff”--the location of the file's data on the disk. Oh, did I mention that there are different types of inode structures for different types of files, and we are currently discussing only the structure of inodes for DEC ufs regular files?
In the ext2 or DEC UFS 32-bit file system data structure, a four-byte word points to one of many possible data blocks in the file system. So the first 48-bytes of this section consists of 12 four byte pointers to the first 12 blocks of the files data. In the case of this file system, each data block is 8KB long. These are called the direct blocks, because their address can be stored directly in the inode. This also means that small files, 96KB or smaller, can be accessed faster. For larger files the 13th address block in the inode points not to another 8KB block of data, but to an 8KB block of addresses, or 2048 addresses of other 8KB data blocks. This is called an indirect block. For even larger files the 14th address block in the inode points to an 8KB block with 2048 addresses of 8KB blocks, each of which has 2048 addresses to data. This is called a double indirect block. This allows for some very big files indeed.
My problem was that Linux could only read the direct blocks of the DEC file's data. Attempts to read files that required the use of indirect block addressing failed with an “attempt to access beyond end of device” error.
Two things were required to fix this problem: 1) a clear understanding of the structure of the data on the disk, most importantly the inode structure, and how the address information is stored; and 2) a clear understanding of what the operating system is doing with the data it finds on the disk, most importantly how it reads an inode. I had neither of these.
However, I did have the help of the Linux community. Early on I told members of my user group, Linux Users Los Angeles, about the work we were doing and the challenge we faced. A number of people offered insight and suggestions. Christopher Smith went much further. He sent me an e-mail, “If you want to drop off one of these drives and a SCSI controller, I'll try futzing with it during my copious amounts of free time.” I brought him over a prototype server and a copy of one of the sample drives we had received from General Dynamics.
His assistance helped immeasurably in clarifying the problem and how to approach it. Dan Kegel suggested I post to the linux-kernel list. This gave me the courage to do so. He also found the linux-fsdevel list for me. I got a lot of help from people on both those lists.
Peter Swain wrote, it “looks like your indirect block handling is twisted, either through endianness or 64bitness issues. DEC might well have a nonconforming implementation there, but you should conform to it, at least as a mount option.” Jim Nance also thought, “It sounds like some sort of 32/64 bit problem, or perhaps a byte ordering problem of some sort.” He also suggested we investigate with the user-mode port of the kernel and sent us information on how to get it. Peter Rival of Tru64 QMG Performance Engineering suggested we contact Marcus Barrow of Mission Critical Linux, whom he described as “the UFS engineer du jour here until he went to MCL”.
Marcus Barrow responded to my e-mail by saying, “I'd be happy to take a look at this.” He warned “Avoid writing to all three of your disks with Linux. Your file system might be becoming corrupted.” We were already using cloned drives. He wrote extensively on the problem, saying
Firstly I thought the problem might be dealing with the 8K/1K block/fragment issues. On second thought, I'm more confused. I don't know why you could read the direct blocks but not the indirect blocks. Particularly when Linux can read its own indirect blocks.
Lyle Seaman pointed out that, “Your task would be much simpler if you had the relevant files from /usr/include/fs/* for that OS....They must be in somebody's museum somewhere.” The support we got from Linux users and developers on these two lists was critical to getting this job done.
We also had the kernel source code, which is precisely what made this trip possible. Free software means knowing what the system is doing. It also means being able to modify it. I printed out a ream of files like linux/fs/ufs/inode.c and known associates, and hit the books. Along the way I found some very good material in the Open Source community explaining file systems and the software that loves them. Two tutorials on the Linux file system I'll mention here are
“Design and Implementation of the Second Extended File System” by Rémy Card, Laboratoire MASI—Institut Blaise Pascal, e-mail: email@example.com; Theodore Ts'o, Massachusetts Institute of Technology, e-mail: firstname.lastname@example.org; and Stephen Tweedie, University of Edinburgh, e-mail: email@example.com khg.redhat.com/HyperNews/get/fs/ext2intro.html.
“The Linux Kernel” by David A. Rusling (firstname.lastname@example.org), Chapter 9: The File System, www.linuxdoc.org/LDP/tlk/tlk.html.
So at least in theory I have an understanding of what the OS is doing because I had the source code I was running and the means to modify it and create new binaries. Ordinary stuff for a Linux system.
To fulfill understanding of how the file system is structured, I had to be able to read the data on the disk and understand how it was organized. I needed to see how the inode on the target drive was really structured. I needed to read the inode and see if, and how, the indirect block pointed to the file's missing data.
By then inodes had become very real to me. I devised a plan to trap one, namely the inode for the misbehaving oilpatch.tar. Because I could mount the partition and then remount it read/write with the experimental write code in the kernel, I could change the ownership of oilpatch.tar. So I did.
I knew my area of interest would be near the beginning of the drive, so I copied that area to a file with a command like:
dd if=/dev/sda of=root-patch bs=1k count=100000
Then I mounted the DEC partition and changed the owner of oilpatch.tar from root (ID 0) to nobody (ID 99), unmounted it and made another file with the command:
dd if=/dev/sda of=nobody-patch bs=1k count=100000Taking the difference of these two files reveals a byte value change from 0 to 99 at a certain offset in the files. The four-byte file owner ID # is known to be at a certain offset into the inode, so plugging the proper numbers into a dd command like:
dd if=/dev/sda of=copy_of_inode_for_oilpatch.tar bs=128 count=1 skip=offset_from_frontwrites the inode to a file. Now I could examine the inode in some detail. I could print it out, read its hidden messages and find the files data. Fortunately, oilpatch.tar turned out to be a text file of predictable construction. Blocks of text fit together like parts of a jigsaw puzzle, and it's pretty apparent when you have the right order.
What I found was that the first 12 pointers pointed to the first 12 blocks of the file, and the first four bytes of the block pointed to the 13th pointer which pointed at the 13th block of the file. The next four bytes of that block of pointers pointed to the 14th block of the files data and so on.
And there's nothing strange about how it worked at all. No weird “big-endian, little-endian” problems. No unexpected bit shifts at all. All very neat, very straightforward. The way I would have done it. The same way Linux does it, in fact, in the ext2 file system. And the same way the kernel UFS code was expecting it to be, which didn't help to explain why it wasn't working.
I can spend a lot of time looking for a problem where it isn't. Slowly, I came to the conclusion that the problem wasn't in the kernel UFS code at all. The inode was being read as it should. The DEC UFS should read as the ufs_type=sun. In fact, the UFS code had been around for some time, so it was hard to believe it was broken. The problem lay at a deeper level: in the communication between the Linux virtual file system (VFS)--in the by then 2.4.0-test1 kernel and the UFS code, caused by changes in the upper-level code. The pointer to the first indirect block was not being passed properly to the VFS, and this broke the UFS code.
The test of this thesis also became the solution to my problem. I was using first the 2.3.99, and later, the 2.4.-test series kernel because it had support for the OSF partition lacking in the 2.2 production series kernels. If I was right, and the UFS code had only recently been broken, the 2.2 kernel should be able to read the DEC drive with no problem provided it could also handle the partition type.
Well, the back port of the OSF partition code from the 2.4 kernel source to 2.2.16 was easy, even for me, and the resulting 2.2.16 kernel did everything right with the DEC OSF drives, including read and write big files all the livelong day. Since 2.2.16 is a better choice in a production environment, and I had a working solution for this job, we were able to move ahead and complete the work.
We started shipping machines with the hacked 2.2.16 kernel to General Dynamics, and its client, and they are doing whatever they are doing with them. I reported this bug in the 2.4.0-test1 kernel to the linux-kernel and linux-fsdevel lists. I said, “Unfortunately I've found a bug.” Alan Cox wrote back, “Uncovering bugs is good news. The chances are, this one wouldn't have been found before 2.4.0 otherwise.”
That was pretty much the story. Later, when the first systems got on-site, there was a problem with the Windows 2000 workstations mangling long file names, but the Samba people have stayed on top of Microsoft's latest tricks, so an upgrade to the newest Samba release solved that problem.
They also needed to have a static relationship between DEC drives with specific SCSI IDs and mount points regardless of the number of DEC drives installed. Since Linux likes to assign SCSI hard drive devices sda, sdb, sdc and in the order it finds them, a dynamic approach to drive mounting had to be created.
The original Tekram controllers weren't entirely happy driving the external ruggedized qual bays and got replaced by Adaptec.
Those are the normal hiccups in a job like this. The real task here had been making those inodes behave.