TL;DR: Lustre Unveiled (3): Architecture Details and Comparative Study

Lustre Unveiled: Evolution, Design, Advancements, and Current Trends provides a comprehensive journey of Lustre, including its history and evolution, detailed archtiecture and design elements, comparison with other prominent storage technologies, case study of Lustre on a real-world supercomputer and the future development of Lustre.

In this post I share my digests of this journal’s section about design drinciples and key features.

Architecture Details of Lustre

Object Storage for Data and Metadata

The centerpiece of Lustre’s architectural design is distributed object storage.

Objects are logical storage locations with identifiers that are globally unique within the filesystem. Given the global id for an object, any client or server can derive the storage target that holds the object.

Data Objects: the contents of a Lustre file are spread across one or more data objects according to the file layout. These objects are organized within the OST backend filesystem to facilitate quick lookup by global id.

The simplest and most common file layout is known as a “plain” striping layout similar to RAID-0 and is defined by strip count and strip size. The strip count is the number of OSTs to use to hold the file’s data. The strip size is the amount of data to write to the data object on each OST before moving on to the next OST.

Metadata Objects: a metadata object contains the metadata (e.g. POSIX filesystem attributes) for a file within the user-visible namespace. These objects are organized within the MDT backend filesystem to reflect the hierarchical namespace structure.

FID and Distributed Management of Global Object Identifiers: a FID is a a 128-bit unique identifier for each objects, broken into: sequence number (SEQ), object ID(OID), object version (VER) so that different parts of the filesystems can autonomously generate FIDs without any need for further serialization or communication with other hosts.

Object Index: an OI maintained on each target maps individual FID numbers to local filesystem identifiers (i.e. inode number and generation) in each backing filesystem. The OI abstracts the backend inode numbers from the global namespace so that any changes to the underlying filesystem inode numbers do not require rewriting references to the FIDs in other data structures.

Lustre Filesystem Operations

Application I/O system calls are initially handled by the Linux Virtual File System (VFS) layer on the client machine, and then directed to the Lustre llite kernel module which implements the necessary VFS operations. Depending upon the type, the request takes one of two paths, Logical Metadata Volume (LMV) for metadata access request and Logical Object Volume(LOV) for data access.

The client maintains a Metadata Client (MDC) component for each MDT in the filesystem. The LMV routes a metadata request to the correct MDC based on the request’s target (based on FID, or directory layout or hash of the filename). The MDC prepares the request for Portal RPC(PtlRPC) subsystem which sends the request over the wire using LNET.

On the server side, the incoming request from the LNet Layer is passed through the PtlRPC to the MDS component. The MDS sends any lock related requests to the Lustre Distributed Lock Manger (LDLM) layer, while the metadata requests are handled by the Metadata Device(MDD), which implements common metadata functionality such as lookups, permission checks, attributes and xattrs. These requeste are then passed to the OSD layer which interfaces directly with the backing OSD filesystem.

The path for data request is very similar. The client maintains a Object Storage Client(OSC) component for each OST in the filesystem. The LOV acts as an abstraction layer for the OSC components and routes data request to the correct OSC based on the request’s final target. Similarly, the OSC sends the request along the PtlRPC -> LNet -> Physical Network -> LNet -> PtlRPC path where it arrives on the server at the OSS server.

The OSS directs lock-related requests to the LDLM component, while data requests are sent to the OBD Filter Device (OFD) component, which handles common object functions such as compound OSD transactions, I/O and attribute requests and recovery logging. These requests are then passed to the OSD layer, that interfaces the backing OSD filesystem.

Path Resolution. To navigate the filesystem namespace, clients do incremental lookups for each component of a pathname by sending requests to the appropriate MDS to retrieve the FID of the metadata object for a specific parent directory FID and the filename therein.

Listing Directory Entries. Directory listing requests are sent to one or more MDT(s) holding the directory contents.

Creating Files and Directories. When a client creates a new directory entry (i.e., a file or subdirectory), it forwards the creation request to the MDT hosting the new entry’s parent directory.

Opening an Existing File. When a Lustre client wants to access file data, the client will traverse the pathname components doing an incremental lookup on each one. The client then sends a lock intent request to do a lookup on the last component of the filename.

Writing Data to a File. During write operations on a file, the client uses the logical file offset and length of the write buffer to identify the file extent and the OST object(s) containing data within the extent.

Reading Data from a File. During read operations on a file, the client uses the logical file offset and length of the read buffer to identify the file extent that will be read and then checks its local data cache for availability.

Closing a File. When an application closes a file, and depending on the file access pattern, the client may cache an open file lock locally to avoid repeated open and close RPCs being sent to the MDS for the same file. Upon the final close of the file, the client will send an RPC to the MDS that it is no longer accessing the file, and send along the current size, blocks and timestamp attributes so the MDT can cache them in the Lazy Size-on-MDT (LSOM) extended attribute for future reference.

Concurrency and Consistency for Parallel I/O

Lustre’s client architecture enables concurrent communication and processing of the remote actions. Parallel applications often

Distributed Consistency via Locking. The Lustre Distributed Lock Manager(LDLM) manages coherent access to shared resources across the filesystem.

LDLM design incorporates various lock modes, including read (shared), write, and exclusive. Each lock mode controls the type of concurrency allowed, such as many clients reading and caching the same file attributes and data, writing different parts of a file in parallel, or exclusively cancelling all other locks to perform changes to a file’s layout.

It intelligently manages lock states and transitions, such as waiting or converting locks, to optimize resource access and minimize delays. This guarantees the efficient and reliable execution of filesystem operations from simple reads to complex data manipulations.

Lock Namespaces. Lock namespaces segregate locks on different resources to be independently handled by the individual servers managing the storage for those resources. Each MDT and OST has its own lock namespace, as does the MGT. Clients have “shadow” namespaces where they manage their subset of locks for each target.

Each namespace is responsible for managing the locks related to its particular set of local resources (e.g., files or objects). This approach allows LDLM to efficiently scale and concurrently manage millions of locks across thousands of servers as the filesystem expands to accommodate billions of files and PBs of data.

Lock Resources. A lock resource in LDLM refers to a specific entity within a lock namespace, such as a file, directory, or an object, for which access needs to be controlled and synchronized. Each resource manages its locks independently, and there can be one or many individual locks on a single resource.

Lock Granularity. LDLM’s locking scheme includes both fine-grained and coarse-grained locks, enabling it to efficiently manage access to filesystem resources.

LDLM locks have six Lock Modes:

  • EX (Exclusive) mode
  • PW (Protected Write) mode
  • PR (Protected Read) mode
  • CW (Concurrent Write) mode
  • CR (Concurrent Read) mode
  • NL (Null) mode

LDLM locks have four Lock Types:

  • Extent Locks: Extent locks manage access to specific data ranges within an OST object, ensuring that data on the OSTs can be accessed and modified safely and concurrently by multiple clients.
  • Inode Locks (Inodebits/ibits): Inode locks control access to the metadata attributes of files and directories, safeguarding the metadata stored on the MDTs and ensuring consistent metadata operations.
  • Flock (File Locks): Flocks implement the POSIX file locking semantics, supporting userspace requests for advisory lock operations on files, thereby facilitating compatibility with applications that rely on traditional file locking mechanisms.
  • Plain Locks: Plain locks represent the original and simplest form of LDLM locking, suitable for scenarios requiring only a single lock on a resource.

Lustre Networking

Lustre’s networking layer, LNet, facilitates communication between clients and storage serverse (OSS, MDS, MGS) and among servers. The key design features include:

  • Scalability: LNet is designed to handle large-scale Lustre deployments in a parallel and distributed fashion.
  • Performance: LNet is tuned for high performance, aiming to deliver rapid and scalable storage solutions.
  • Modularity: LNet is inherently modular, accommodating integration of a wide array of network transports without requiring changes to the software components using the network abstraction layer.
  • Reliability: LNet incorporates a range of reliability features, such as multi-rail (MR), MR discovery and LNet health, which are essential for preserving the integrity and accessibility of data.
  • Compatibility: LNet provides system administrators the means to configure it to align with the particular needs of their Lustre deployment, including suitable network transport, performance-related parameters and compatibility with underlying hardware infrastructure.

The architecture of LNet is divided into user space and kernel space components.

In the user space, the lnetctl utility is used to configure LNet. The lnetctl interacts with the Dynamic Library Configuration (DLC) library, which implements the configuration API. DLC translates the configurations provided via the API into Netlink protocol messages understood by the LNet Netlink component in the kernel space.

The core logic for mesaging handling is located in the LNet module, while Lustre Network Drivers (LND) serve as hardware-specific interfaces such as socklnd for TCP, o2iblnd for verbs interface like Mellanox or OmniPath, kfilnd for HPE’s Slingshot driver.

The basic functionality of LNet involves communication between two hosts, each directly configured with a single Network Interface (NI) on the same network. A scenario beyond the basic one is routed single-hop configuration, where individual hosts may be situated on different network types. To enable message exchange between these hosts, an LNet router host is established, equipped with interfaces on both networks. Alternate scenarios include routed single-hop with multiple LNet routers, involving multiple LNet routers with each introducing its own set of routing configurations and each route can be assigned a priority, and routed multi-hop with multiple router hosts are interposed between two nodes.


Reference

Anjus George, Andreas Dilger, Michael J. Brim, Richard Mohr, Amir Shehata, Jong Youl Choi, Ahmad Maroof Karimi, Jesse Hanley, James Simmons, Dominic Manno, Veronica Melesse Vergara, Sarp Oral, and Christopher Zimmer. 2025. Lustre Unveiled: Evolution, Design, Advancements, and Current Trends. ACM Trans. Storage 21, 3, Article 21 (June 2025), 109 pages. https://doi.org/10.1145/3736583