The Lustre Distributed Filesystem
There comes a time in a network or storage administrator's career
when a large collection of storage volumes needs to be pooled together
and distributed within a clustered or multiple client network, while
maintaining high performance with little to no bottlenecks when accessing
the same files. That is where Lustre comes into the picture. The Lustre
filesystem is a high-performance distributed filesystem intended for
larger network and high-availability environments.
The Storage Area Network and Linux
Traditionally, Lustre is configured to manage remote data storage disk
devices within a Storage Area Network (SAN), which is two or more
remotely attached disk devices communicating via a Small Computer System
Interface (SCSI) protocol. This includes Fibre Channel, Fibre Channel
over Ethernet (FCoE), Serial Attached SCSI (SAS) and even iSCSI. To better
explain what a SAN is, it may be more beneficial to begin with what it
isn't. For instance, a SAN shouldn't be confused with a Local Area Network
(LAN), even if that LAN carries storage traffic (that is, via networked filesystem shares and so on). Only if the LAN carries storage traffic using
the iSCSI or FCoE protocols can it then be considered a SAN. Another
thing that a SAN isn't is Network Attached Storage (NAS). Again, the
SAN relies heavily on a SCSI protocol, while the NAS uses the NFS and
SMB/CIFS file-sharing protocols.
An external storage target device will represent storage volumes as
Logical Units within the SAN. Typically, a set of Logical Units will be
mapped across a SAN to an initiator node—in our case, it would be the
server(s) managing the Lustre filesystem. In turn, the server(s) will
identify one or more SCSI disk devices within its SCSI subsystem and treat
them as if they were local drives. The amount of SCSI disks identified
is determined by the amount of Logical Units mapped to the initiator. If
you want to follow along with the examples here, it is relatively simple
to configure a couple virtual machines: one as the server node with
one or more additional disk devices to export and the second to act as
a client node and mount the Lustre enabled volume. Although it is bad
practice, for testing purposes, it also is possible to have a single
virtual machine configured as both server and client.
SCSI is an ANSI-standardized hardware and software computing
interface adopted by all early storage manufacturers. Revised editions
of the standard continue to be used today.
The Distributed Filesystem
A distributed filesystem allows access to files from multiple hosts
sharing a computer network. This makes it possible for multiple
users on multiple client nodes to share files and storage resources. The
client nodes do not have direct access to the underlying block storage
but interact over the network using a protocol and, thus, make it possible
to restrict access to the filesystem depending on access lists or
capabilities on both the servers and the clients, unlike a clustered filesystem, where all nodes have equal access to the block storage where the
filesystem is located. On these systems, the access control must reside on
the client. Other advantages to utilizing distributed filesystems include
the fact that they may involve facilities for transparent replication
and fault tolerance. So, when a limited number of nodes in a filesystem
goes off-line, the system continues to work without any data loss.
Lustre (or Linux Cluster) is one such distributed filesystem,
usually deployed for large-scale cluster computing. Licensed under
the GNU General Public License (or GPL), Lustre provides a solution in
which high performance and scalability to tens of thousands of nodes
and petabytes of storage becomes a reality and is relatively simple to
deploy and configure. Despite the fact that Lustre 2.0 has been released,
for this article, I work with the generally available 1.8.5.
Lustre contains a somewhat unique architecture, with three major
functional units. One is a single metadata server or MDS that contains
a single metadata target or MDT for each Lustre filesystem. This
stores namespace metadata, which includes filenames, directories, access
permissions and file layout. The MDT data is stored in a single disk filesystem mapped locally to the serving node and is a dedicated filesystem
that controls file access and informs the client node(s) which object(s)
make up a file. Second are one or more object storage servers (OSSes) that
store file data on one or more object storage targets or OST. An OST is a
dedicated object-base filesystem exported for read/write operations. The
capacity of a Lustre filesystem is determined by the sum of the total
capacities of the OSTs. Finally, there's the client(s) that accesses and
uses the file data.
Lustre presents all clients with a unified namespace
for all of the files and data in the filesystem that allow concurrent
and coherent read and write access to the files in the filesystem. When
a client accesses a file, it completes a filename lookup on the MDS,
and either a new file is created or the layout of an existing file is
returned to the client. Locking the file on the OST, the client will
then run one or more read or write operations to the file but will not
directly modify the objects on the OST. Instead, it will delegate tasks to
the OSS. This approach will ensure scalability and improved security and
reliability, as it does not allow direct access to the underlying storage,
thus, increasing the risk of filesystem corruption from misbehaving/defective clients. Although all three components (MDT, OST and client)
can run on the same node, they typically are configured on separate
nodes communicating over a network (see the details on LNET later in this
this example, I'm running the MDT and OST on a single server node
while the client will be accessing the OST from a separate node.
To obtain Lustre 1.8.5, download the prebuilt binaries packaged in RPMs,
or download the source and build the modules and utilities for your
respective Linux distribution. Oracle provides server RPM packages
for both Oracle Enterprise Linux (OEL) 5 and Red Hat Enterprise Linux
(RHEL) 5, while also providing client RPM packages for OEL 5, RHEL 5
and SUSE Linux Enterprise Server (SLES) 10,11. If you will be building
Lustre from source, ensure that you are using a Linux
kernel 2.6.16 or greater. Note that in all deployments of Lustre, the
server that runs on an MDS, MGS (discussed below) or OSS must utilize a
patched kernel. Running a patched kernel on a Lustre client is optional
and required only if the client will be used for multiple purposes,
such as running as both a client and an OST.
If you already have a supported operating system,
make sure that the patched kernel, lustre-modules, lustre-ldiskfs (a
Lustre-patched backing filesystem kernel module package for the ext3 filesystem), lustre (which includes userspace utilities to configure and run
Lustre) and e2fsprogs packages are installed on the host system while
also resolving its dependencies from a local or remote repository. Use
the rpm command to install all necessary packages:
$ sudo rpm -ivh kernel-2.6.18-22.214.171.124.1.el5_lustre.1.8.4.i686.rpm $ sudo rpm -ivh lustre-modules-1.8.4-2.6.18_126.96.36.199.1.el5_ ↪lustre.1.8.4.i686.rpm $ sudo rpm -ivh lustre-ldiskfs-3.1.3-2.6.18_188.8.131.52.1.el5_ ↪lustre.1.8.4.i686.rpm $ sudo rpm -ivh lustre-1.8.4-2.6.18_184.108.40.206.1.el5_ ↪lustre.1.8.4.i686.rpm $ sudo rpm -ivh e2fsprogs-1.41.10.sun2-0redhat.oel5.i386.rpm
After these packages have been installed, list the boot directory to
reveal the newly installed patched Linux kernel:
[petros@lustre-host ~]$ ls /boot/ config-2.6.18-220.127.116.11.1.el5_lustre.1.8.4 grub initrd-2.6.18-18.104.22.168.1.el5_lustre.1.8.4.img lost+found symvers-2.6.18-22.214.171.124.1.el5_lustre.1.8.4.gz System.map-2.6.18-126.96.36.199.1.el5_lustre.1.8.4 vmlinuz-2.6.18-188.8.131.52.1.el5_lustre.1.8.4
Petros Koutoupis is currently a senior software developer at Cleversafe, an IBM Company. He is also the creator and maintainer of the RapidDisk Project. Petros has worked in the data storage industry for more than a decade.
Fast/Flexible Linux OS Recovery
On Demand Now
In this live one-hour webinar, learn how to enhance your existing backup strategies for complete disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible full-system recovery solution for UNIX and Linux systems.
Join Linux Journal's Shawn Powers and David Huffman, President/CEO, Storix, Inc.
Free to Linux Journal readers.Register Now!
- Download "Linux Management with Red Hat Satellite: Measuring Business Impact and ROI"
- Sony Settles in Linux Battle
- Libarchive Security Flaw Discovered
- Profiles and RC Files
- Maru OS Brings Debian to Your Phone
- The Giant Zero, Part 0.x
- Snappy Moves to New Platforms
- Understanding Ceph and Its Place in the Market
- Git 2.9 Released
- Astronomy for KDE
With all the industry talk about the benefits of Linux on Power and all the performance advantages offered by its open architecture, you may be considering a move in that direction. If you are thinking about analytics, big data and cloud computing, you would be right to evaluate Power. The idea of using commodity x86 hardware and replacing it every three years is an outdated cost model. It doesn’t consider the total cost of ownership, and it doesn’t consider the advantage of real processing power, high-availability and multithreading like a demon.
This ebook takes a look at some of the practical applications of the Linux on Power platform and ways you might bring all the performance power of this open architecture to bear for your organization. There are no smoke and mirrors here—just hard, cold, empirical evidence provided by independent sources. I also consider some innovative ways Linux on Power will be used in the future.Get the Guide