Exploiting 64-Bit Linux
Wide availability of 64-bit systems offers an opportunity to simplify programming. By leveraging large virtual addresses and memory mapped files, we combine programming and file persistence into a single activity. This simplifies programming when persistence and sharing of large complex data structures are required. It also improves the efficiency of data access and sharing, as multiple programs can access data directly (operate in place) from the single real page copy. This eliminates the need to copy the data through multiple layers of buffering. When all programs share data at the same virtual address, there are opportunities for the kernel to manage the memory map to avoid aliases and share MMU resources across applications.
As a proof of concept, I have a small user library I call SPHDE (Shared Persistent Heap/Data Environment). This library manages a portion of the application's address space to provide blocks of persistent and shared storage. The library also provides some locking primitives and some “utility objects” that provide finer-grained storage allocation and indexing.
As a demonstration, I wrote a gigapixel Mandelbrot (GTK2) GUI program. This program computes each point in the Mandelbrot set once and saves the resulting image as a quadtree in SPHDE storage. This quadtree representation can be shared and displayed by multiple instances of the program. Program instances can cooperate to fill in the next level of detail.
System design and programming models are (initially) driven by the need to manage scarce resources. In the early days of programming, both address spaces and physical memory were small. IO Subsystems evolved to manage the buffering of serial media, such as tape and punch cards. Even the introduction of random access disk and virtual memory did not change the basic paradigm much. The programming paradigms we use and teach today have far outlived the original scarcity that caused them. The separation of programming from persistent data is a prime example. It creates inefficiencies in both programming effort and runtime execution.
With the prevalence of 64-bit microprocessors, the original (addressing/memory) scarcity is gone. We can leverage virtual memory to access and share massive data structures directly using “operate in place” and “implicit persistence” patterns. This simplifies programming, because any data structure is implicitly persistent by where it is allocated. This improves operational efficiency, because it eliminates layers of buffer management and the OS has more opportunities to share a single physical copy. There are additional opportunities for improving address translation efficiency by eliminating address aliasing and exploiting hardware features, such as large pages.
There have been attempts to unify programming and persistence before. Persistence frameworks abound, but they simply add layers of buffering and data conversions to the top of the IO Subsystem stack. Operational efficiency is lost to obtain a modest reduction in programming effort.
Alternatively, Single-Level Stores have been a research topic for years. Although the technology works (simplified programming and data efficiency), the requirement for specialized hardware or unique OS environments has limited acceptance.
This picture changes again with widely available 64-bit microprocessors and standardized POSIX memory management APIs. The addition of a small runtime library allows adventurous programmers to use operate in place and implicit persistence across numerous OSes and 64-bit hardware platforms. This hybrid Single-Level Store approach allows for incremental exploration and adoption of a new programming paradigm.
Simple memory-mapped files is not enough. We need a design for how the address space and backing files are managed. The design should balance simplicity and efficiency in the management of the virtual (address space) and physical (real memory and disk space) resources.
Here are some principles I believe are important to include in the design. The runtime should be simple to use and compatible with UNIX programming and POSIX APIs. It should use resources (address space) that are not normally used in existing applications. It should appear to the program like a large shared persistent heap (SPH). Access to data in the SPH should use standard C pointer semantics. Setup to access to data in the SPH should be minimum. The design evolved to the concepts of regions, stores, segment and blocks, all with power of two sizes.
The region is simply the range of virtual addresses that the application wants the runtime to manage on its behalf. The region should be in a range of virtual addresses not normally used by the application. We want to ensure that the region does not interfere with the normal program usage of (local temporary) heap and dynamic libraries. On Linux, this means the area above the TASK_UNMAPPED_BASE and below the main stack. Leaving lots of room below the stack and above TASK_UNMAPPED_BASE still allows for the use of up to half the program's address space for SPH.
Next, we need to choose a segment size. The segment is the unit size for allocating backing files and memory mapping those files into the region. This should be large enough so that we don't do frequent kernel calls and small enough so that we don't waste file space.
A store is a directory that contains one or more files used to back SPH segments. A system can have multiple SPH stores by allocating the backing files in different directories. For now (also to keep things simple), a program can bind to only one region/store at a time.
Finally, a block is the unit of space allocation within an SPH region. It is normally between a page and a segment in size, but that is not a hard limit. The only hard limit is that the block fits with the remaining space at the time. If the block is allocated to a portion of the SPH region that does not currently have a backing file allocated, the SPH runtime will implicitly create the file(s) in the store directory associated with the region.
For overall simplicity, we use power of two sizes for blocks and segments. This allows a power of two buddy system to be used to track all aspects of SPH storage. The buddy system reduces fragmentation and simplifies recombining smaller blocks into larger blocks when they are freed.
We need some low-level utility functions to help manage the buddy lists. First, we need a simple heap manager to sub-allocate blocks for smaller control blocks and lists. The lists need to be ordered and will be searched frequently, so we use a simple binary tree. Algorithms for the heap manager and binary tree can be found in Donald E. Knuth's The Art of Computer Programming (see Resources).
Finally, we need a place to store and maintain this information. The anchor block is the first block in the region. The anchor block starts with a block header, which includes signature words (eye-catchers initialized to special values), type info, pointer to the start of the local free list (heap) and a set of pointers for the various binary trees that manage space in the SPH region. The remainder of the anchor block is initialized as free heap space. This internal heap is used to allocate additional nodes required for the binary trees and any other internal control blocks needed to manage the SPH region.
After an SPH region is initialized, we have mostly an empty region with one segment (backing file in the store directory) and one block (the anchor block) allocated from that segment. The various lists have been initialized to depict this state. For example, the region free and used lists contain entries for the unallocated (no backing files) and allocated (with backing files) portion of the region. An entry for the anchor block is added to the “used” list. The “uncommitted” list contains entries for the current unused portions of the first segment. And, the free list is empty. Now we are ready to allocate more blocks.
|Speed Up Your Web Site with Varnish||Jun 19, 2013|
|Non-Linux FOSS: libnotify, OS X Style||Jun 18, 2013|
|Containers—Not Virtual Machines—Are the Future Cloud||Jun 17, 2013|
|Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer||Jun 12, 2013|
|Weechat, Irssi's Little Brother||Jun 11, 2013|
|One Tail Just Isn't Enough||Jun 07, 2013|
- Speed Up Your Web Site with Varnish
- Containers—Not Virtual Machines—Are the Future Cloud
- Linux Systems Administrator
- Non-Linux FOSS: libnotify, OS X Style
- Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer
- Senior Perl Developer
- Technical Support Rep
- UX Designer
- RSS Feeds
- Reply to comment | Linux Journal
3 hours 7 min ago
- Yeah, user namespaces are
4 hours 24 min ago
- Cari Uang
7 hours 55 min ago
- user namespaces
10 hours 48 min ago
11 hours 14 min ago
- One advantage with VMs
13 hours 43 min ago
- about info
14 hours 16 min ago
14 hours 17 min ago
14 hours 18 min ago
14 hours 20 min ago
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?