The Lustre Distributed Filesystem

There comes a time in a network or storage administrator's career
when a large collection of storage volumes needs to be pooled together
and distributed within a clustered or multiple client network, while
maintaining high performance with little to no bottlenecks when accessing
the same files. That is where Lustre comes into the picture. The Lustre
filesystem is a high-performance distributed filesystem intended for
larger network and high-availability environments.

The Storage Area Network and Linux

Traditionally, Lustre is configured to manage remote data storage disk
devices within a Storage Area Network (SAN), which is two or more
remotely attached disk devices communicating via a Small Computer System
Interface (SCSI) protocol. This includes Fibre Channel, Fibre Channel
over Ethernet (FCoE), Serial Attached SCSI (SAS) and even iSCSI. To better
explain what a SAN is, it may be more beneficial to begin with what it
isn't. For instance, a SAN shouldn't be confused with a Local Area Network
(LAN), even if that LAN carries storage traffic (that is, via networked filesystem shares and so on). Only if the LAN carries storage traffic using
the iSCSI or FCoE protocols can it then be considered a SAN. Another
thing that a SAN isn't is Network Attached Storage (NAS). Again, the
SAN relies heavily on a SCSI protocol, while the NAS uses the NFS and
SMB/CIFS file-sharing protocols.

An external storage target device will represent storage volumes as
Logical Units within the SAN. Typically, a set of Logical Units will be
mapped across a SAN to an initiator node—in our case, it would be the
server(s) managing the Lustre filesystem. In turn, the server(s) will
identify one or more SCSI disk devices within its SCSI subsystem and treat
them as if they were local drives. The amount of SCSI disks identified
is determined by the amount of Logical Units mapped to the initiator. If
you want to follow along with the examples here, it is relatively simple
to configure a couple virtual machines: one as the server node with
one or more additional disk devices to export and the second to act as
a client node and mount the Lustre enabled volume. Although it is bad
practice, for testing purposes, it also is possible to have a single
virtual machine configured as both server and client.

SCSI

SCSI is an ANSI-standardized hardware and software computing
interface adopted by all early storage manufacturers. Revised editions
of the standard continue to be used today.

The Distributed Filesystem

A distributed filesystem allows access to files from multiple hosts
sharing a computer network. This makes it possible for multiple
users on multiple client nodes to share files and storage resources. The
client nodes do not have direct access to the underlying block storage
but interact over the network using a protocol and, thus, make it possible
to restrict access to the filesystem depending on access lists or
capabilities on both the servers and the clients, unlike a clustered filesystem, where all nodes have equal access to the block storage where the
filesystem is located. On these systems, the access control must reside on
the client. Other advantages to utilizing distributed filesystems include
the fact that they may involve facilities for transparent replication
and fault tolerance. So, when a limited number of nodes in a filesystem
goes off-line, the system continues to work without any data loss.

Lustre (or Linux Cluster) is one such distributed filesystem,
usually deployed for large-scale cluster computing. Licensed under
the GNU General Public License (or GPL), Lustre provides a solution in
which high performance and scalability to tens of thousands of nodes
and petabytes of storage becomes a reality and is relatively simple to
deploy and configure. Despite the fact that Lustre 2.0 has been released,
for this article, I work with the generally available 1.8.5.

Lustre contains a somewhat unique architecture, with three major
functional units. One is a single metadata server or MDS that contains
a single metadata target or MDT for each Lustre filesystem. This
stores namespace metadata, which includes filenames, directories, access
permissions and file layout. The MDT data is stored in a single disk filesystem mapped locally to the serving node and is a dedicated filesystem
that controls file access and informs the client node(s) which object(s)
make up a file. Second are one or more object storage servers (OSSes) that
store file data on one or more object storage targets or OST. An OST is a
dedicated object-base filesystem exported for read/write operations. The
capacity of a Lustre filesystem is determined by the sum of the total
capacities of the OSTs. Finally, there's the client(s) that accesses and
uses the file data.

Lustre presents all clients with a unified namespace
for all of the files and data in the filesystem that allow concurrent
and coherent read and write access to the files in the filesystem. When
a client accesses a file, it completes a filename lookup on the MDS,
and either a new file is created or the layout of an existing file is
returned to the client. Locking the file on the OST, the client will
then run one or more read or write operations to the file but will not
directly modify the objects on the OST. Instead, it will delegate tasks to
the OSS. This approach will ensure scalability and improved security and
reliability, as it does not allow direct access to the underlying storage,
thus, increasing the risk of filesystem corruption from misbehaving/defective clients. Although all three components (MDT, OST and client)
can run on the same node, they typically are configured on separate
nodes communicating over a network (see the details on LNET later in this
article). In
this example, I'm running the MDT and OST on a single server node
while the client will be accessing the OST from a separate node.

Installing Lustre

To obtain Lustre 1.8.5, download the prebuilt binaries packaged in RPMs,
or download the source and build the modules and utilities for your
respective Linux distribution. Oracle provides server RPM packages
for both Oracle Enterprise Linux (OEL) 5 and Red Hat Enterprise Linux
(RHEL) 5, while also providing client RPM packages for OEL 5, RHEL 5
and SUSE Linux Enterprise Server (SLES) 10,11. If you will be building
Lustre from source, ensure that you are using a Linux
kernel 2.6.16 or greater. Note that in all deployments of Lustre, the
server that runs on an MDS, MGS (discussed below) or OSS must utilize a
patched kernel. Running a patched kernel on a Lustre client is optional
and required only if the client will be used for multiple purposes,
such as running as both a client and an OST.

If you already have a supported operating system,
make sure that the patched kernel, lustre-modules, lustre-ldiskfs (a
Lustre-patched backing filesystem kernel module package for the ext3 filesystem), lustre (which includes userspace utilities to configure and run
Lustre) and e2fsprogs packages are installed on the host system while
also resolving its dependencies from a local or remote repository. Use
the rpm command to install all necessary packages:


$ sudo rpm -ivh kernel-2.6.18-194.3.1.0.1.el5_lustre.1.8.4.i686.rpm
$ sudo rpm -ivh lustre-modules-1.8.4-2.6.18_194.3.1.0.1.el5_
↪lustre.1.8.4.i686.rpm 
$ sudo rpm -ivh lustre-ldiskfs-3.1.3-2.6.18_194.3.1.0.1.el5_
↪lustre.1.8.4.i686.rpm
$ sudo rpm -ivh lustre-1.8.4-2.6.18_194.3.1.0.1.el5_
↪lustre.1.8.4.i686.rpm

$ sudo rpm -ivh e2fsprogs-1.41.10.sun2-0redhat.oel5.i386.rpm

After these packages have been installed, list the boot directory to
reveal the newly installed patched Linux kernel:


[petros@lustre-host ~]$ ls /boot/
config-2.6.18-194.3.1.0.1.el5_lustre.1.8.4
grub
initrd-2.6.18-194.3.1.0.1.el5_lustre.1.8.4.img
lost+found
symvers-2.6.18-194.3.1.0.1.el5_lustre.1.8.4.gz
System.map-2.6.18-194.3.1.0.1.el5_lustre.1.8.4
vmlinuz-2.6.18-194.3.1.0.1.el5_lustre.1.8.4
______________________

Petros Koutoupis is a full-time Linux kernel, device-driver and
application developer for embedded and server platforms. He has been
working in the data storage industry for more than six years and enjoys discussing the same technologies.

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

very helpful !!! David,

DavidParis's picture

very helpful !!!
David, Ubuntu addict !
Website : banque en ligne

very helpful !!! David,

DavidParis's picture

very helpful !!!
David, Ubuntu addict !
Website : banque en ligne

Lustre vs GPFS

gs's picture

That will be nice if you can write the similar article using GPFS and do the comparison between Lustre and GPFS.

HEWLETT PACKARD created

Anonymous's picture

HEWLETT PACKARD created recognized this past year along with absolutely nothing much more to complete for that 1. two million bought WebOS as well as the newest TouchPad-tablets upon WebOS had been carried out away the doorway with regard to opheffingsuitverkoop costs. Since that time the organization required a brand new top woman that evidently really wants to make a move otherwise. WebOS includes a long term, as well as that is open up supply.

HEWLETT PACKARD created

Anonymous's picture

HEWLETT PACKARD created recognized this past year along with absolutely nothing much more to complete for that 1. two million bought WebOS as well as the newest TouchPad-tablets upon WebOS had been carried out away the doorway with regard to opheffingsuitverkoop costs. Since that time the organization required a brand new top woman that evidently really wants to make a move otherwise. WebOS includes a long term, as well as that is open up supply.

very helpful :)

Anonymous's picture

very helpful :)

I agree with what are you

Marit's picture

I agree with what are you saying and I think that this can be really useful for many of us.
Aer Conditionat

Lustre != SAN, distributed != many (remote clients)

Cédric Dufour's picture

The purpose of Lustre is to be a "distributed file system" in the sense of distributing the data among multiple storage nodes and doing so via the local area network (and protocols such as TCP/IP over Ethernet or Infiniband). It is the distribution on the *server* side that allows Lustre capacity and bandwidth to scale up to several petabytes of storage and gigabytes/seconds of bandwidth (thanks to theoretically up to 8192 OSS servers).
And it is certainly not recommended to configure Lustre "to manage remote data storage disk devices within a Storage Area Network (SAN)", as SANs quickly become performances bottlenecks given the bandwidth of common SAN protocols, which can only be circumvented by very costly hardware. On the contrary, Lustre aims towards distributing storage across (as many as possible) commodity storage (OSS) nodes, each designed to provide balanced/optimal performances on all levels (storage, memory, processing power and client-facing network). It thus proves more efficient to use more "scaled-down" OSS nodes with direct-attached disks (and not too many of them, as the SCSI/SAS or PCI bus will quickly become a bottleneck) rather than fewer "beefed-up" OSS nodes in front of SANs.

Gluster

Anonymous's picture

You may want to look at Gluster which provides a single namespace for storage like Lustre but is much simpler to implement and doesn't require its own kernel or module. Clients are either NFS or FUSE.

Newer versions of Lustre available

Andreas Dilger's picture

Please note that even though Oracle has stopped releasing newer versions of Lustre, there is a considerable amount of development effort in the Lustre community.

New releases are available for download at http://downloads.whamcloud.com/public/lustre/

and you can find more detailed information on the wiki:
http://wiki.whamcloud.com/

Lustre 1.8.5 Installation - Thanks

Sarah Bird's picture

I m facing issues with implementing distributed filesystem (LUSTER). I want to allow access to some sensitive data on my server from from multiple hosts sharing the network. After reading this article, I just realized that issues are with installation of Luster. Thanks for sharing this article. It's very helpful :)

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix