Memory Ordering in Modern Microprocessors, Part II
Although AMD64 is compatible with x86, it offers a slightly stronger memory-consistency model, in that it does not reorder a store ahead of a load. After all, loads are slow and cannot be buffered, so why reorder a store ahead of a load? Although it is possible in theory to create a parallel program that works on some x86 CPUs but fails on AMD64 due to this difference in memory-consistency model, in practice this difference has little effect on porting code from x86 to AMD64.
The AMD64 implementation of the Linux smp_mb() primitive is mfence, smp_rmb() is lfence and smp_wmb() is sfence.

Figure 1. Why smp_read_barrier_depends() Is Required
IA64 offers a weak consistency model, so that in absence of explicit memory-barrier instructions, IA64 is within its rights to reorder memory references arbitrarily. IA64 has a memory-fence instruction named mf, as well as a half-memory fence modifier to load and store some of its atomic instructions. The acq modifier prevents subsequent memory-reference instructions from being reordered before the acq, but it permits prior memory-reference instructions to be reordered after the acq, as fancifully illustrated by Figure 2. Similarly, the rel modifier prevents prior memory-reference instructions from being reordered after the rel, but it allows subsequent memory-reference instructions to be reordered before the rel.
These half-memory fences are useful for critical sections, as it is safe to push operations into a critical section. It can be fatal, however, to allow them to bleed out.
The IA64 mf instruction is used for the smp_rmb(), smp_mb() and smp_wmb() primitives in the Linux kernel. Oh, and despite persistent rumors to the contrary, the mf mnemonic really does stand for memory fence.
Although the PA-RISC architecture permits full reordering of loads and stores, actual CPUs run fully ordered. This means the Linux kernel's memory-ordering primitives generate no code; they do, however, use the GCC memory attribute to disable compiler optimizations that would reorder code across the memory barrier.
Listing 2. Safe Insert and Lock-Free Search
1 struct el *insert(long key, long data)
2 {
3 struct el *p;
4 p = kmalloc(sizeof(*p), GPF_ATOMIC);
5 spin_lock(&mutex);
6 p->next = head.next;
7 p->key = key;
8 p->data = data;
9 smp_wmb();
10 head.next = p;
11 spin_unlock(&mutex);
12 }
13
14 struct el *search(long key)
15 {
16 struct el *p;
17 p = head.next;
18 while (p != &head) {
19 smp_read_barrier_depends();
20 if (p->key == key) {
21 return (p);
22 }
23 p = p->next;
24 };
25 return (NULL);
26 }
The POWER and PowerPC CPU families have a wide variety of memory-barrier instructions:
sync causes all preceding instructions, not only memory references, to appear to have completed before any subsequent operations are started. This instruction, therefore, is quite expensive.
lwsync, or lightweight sync, orders loads with respect to subsequent loads and stores, and it also orders stores. However, it does not order stores with respect to subsequent loads. Interestingly enough, the lwsync instruction enforces the same ordering as does the zSeries and, coincidentally, the SPARC TSO.
eieio, enforce in-order execution of I/O, in case you were wondering, causes all preceding cacheable stores, which are normal memory references, to appear to have completed before all subsequent cacheable stores. It also causes all preceding non-cacheable, memory-mapped I/O (MMIO) stores to appear to have completed before all subsequent non-cacheable stores. However, the stores to cacheable memory are ordered separately from the stores to non-cacheable memory, which, for example, means that eieio does not force an MMIO store to precede a spinlock release.
isync forces all preceding instructions to appear to have completed before any subsequent instructions start execution. This means that the preceding instructions must have progressed far enough that any traps they might generate either have happened or are guaranteed not to happen. Furthermore, any side effects of these instructions—for example, page-table changes—are seen by the subsequent instructions.
Unfortunately, none of these instructions line up exactly with Linux's wmb() primitive, which requires all stores to be ordered. It does not require the other high-overhead actions of the sync instruction. But there is no choice: ppc64 versions of wmb() and mb() are defined to be the heavyweight sync instruction. However, Linux's smp_wmb() primitive cannot be used for MMIO, because a driver must carefully order MMIOs in UP as well as SMP kernels. So, it is defined to be the lighter-weight eieio instruction, which may be unique in having a five-vowel mnemonic. The smp_mb() primitive also is defined to be the sync instruction, but both smp_rmb() and rmb() are defined to be the lighter-weight lwsync instruction.
Many members of the POWER architecture have incoherent instruction caches, so a store to memory is not necessarily reflected in the instruction cache. Thankfully, few people write self-modifying code these days, but JITs do it all the time. Furthermore, recompiling a recently run program looks like self-modifying code from the CPU's viewpoint. The icbi instruction, instruction cache block invalidate, invalidates a specified cache line from the instruction cache and may be used in these situations.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Designing Electronics with Linux | May 22, 2013 |
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
- New Products
- Linux Systems Administrator
- Senior Perl Developer
- Technical Support Rep
- UX Designer
- Web & UI Developer (JavaScript & j Query)
- Designing Electronics with Linux
- Dynamic DNS—an Object Lesson in Problem Solving
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Using Salt Stack and Vagrant for Drupal Development
- Reply to comment | Linux Journal
1 hour 40 min ago - Reply to comment | Linux Journal
1 hour 56 min ago - Favorite (and easily brute-forced) pw's
3 hours 47 min ago - Have you tried Boxen? It's a
9 hours 39 min ago - seo services in india
14 hours 11 min ago - For KDE install kio-mtp
14 hours 12 min ago - Evernote is much more...
16 hours 12 min ago - Reply to comment | Linux Journal
1 day 57 min ago - Dynamic DNS
1 day 1 hour ago - Reply to comment | Linux Journal
1 day 2 hours ago
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Featured Jobs
| Linux Systems Administrator | Houston and Austin, Texas | Host Gator |
| Senior Perl Developer | Austin, Texas | Host Gator |
| Technical Support Rep | Houston and Austin, Texas | Host Gator |
| UX Designer | Austin, Texas | Host Gator |
| Web & UI Developer (JavaScript & j Query) | Austin, Texas | Host Gator |
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?





Comments
memory addressing question
First...Loved your article.
I hope I am not bothering you. But I have a question regarding
memory addressing in Linux.
As I have read ( Mel Gorman's book ) a virtual address in kernel space bellow the first 896 MB is simply an offset PAGE_OFFSET which is stored in the DS register.
So when the cpu wishes to aproach it he substracts this value from the address when he is in kernel mode.
Well if he does, how can the processor tell between a vmalloc virtual
address ( 896 to 1GB) in kernel space to a virtual address in kernel
space ( bellow the 896 MB) ?
Furthermore , If I boot my linux ( An Intel machine, T42 IBM laptop ) using only part of the memory ( boot mem=400M out of 512M) , I would not be able to address addresses above 400 MB .
I tried to memcpy to address above 400 MB and I crashed.
So i realy have no idea where i am wrong.
I would most appreciate your kind help.
Thank you.
Raz
PS.
I am looking for some information/articles regarding how dows the CPU actually approaches the memory.