Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer
Lock-Free Multi-Producer Multi-Consumer Queue
Hopefully, you do not have to ask the kernel for help with user-space thread synchronization. The CPU (at least in the most common architectures) provides atomic memory operations and barriers. With the operations, you can atomically:
Read the memory operand, modify it and write it back.
Read the memory operand, compare it with a value and swap it with the other value.
Memory barriers are special assembly instructions also known as fences. Fences guarantee an instruction's execution order on the local CPU and visibility order on other CPUs. Let's consider two independent data instructions, A and B, separated by fence (let's use mfence, which provides a guarantee for ordering read and write operations):
A mfence B
The fence guarantees that:
Compiler optimizations won't move A after the fence or B before the fence.
The CPU will execute A and B instructions in order (it normally executes instructions out of order).
Other CPU cores and processor packages, which work on the same bus, will see the result of instruction A before the result of instruction B.
For our queue, we need to synchronize multiple threads' access to the head_ and
tail_ fields. Actually, when you run
is an example of an RMW,
Read-Modify-Write, operation because the processor must read the current head_ value,
increment it locally and write it back to memory) on two cores, both
could read the current head_ value simultaneously, increment it and
write the new value back simultaneously, so one increment is lost. For atomic operations,
C++11 provides the std::atomic template, which should replace the current GCC sync_
intrinsics in the future. Unfortunately, for my compiler (GCC 4.6.3 for
std::atomic<> methods still generate extra fences independently on specified memory
order. So, I'll the use old GCC's intrinsics for atomic operations.
We can atomically read and increment a variable (for example, our head_) by:
This makes the CPU lock the shared memory location on which it's going to do an operation (increment, in our case). In a multiprocessor environment, processors communicate to each other to ensure that they all see the relevant data. This is known as the cache coherency protocol. By this protocol, a processor can take exclusive ownership on a memory location. However, these communications are not for free, and we should use such atomic operations carefully and only when we really need them. Otherwise, we can hurt performance significantly.
Meanwhile, plain read and write operations on memory locations execute
atomically and do not require any additional actions (like specifying the
prefix to make the instruction run atomically on x86 architecture).
In our lock-free implementation, we're going to abandon the mutex mtx_ and consequently both the condition variables. However, we still need to wait if the queue is full on push and if the queue is empty on pop operations. For push, we would do this with a simple loop like we did for the locked queue:
while (tail_ + Q_SIZE < head_) sched_yield();
sched_yield() just lets the other thread run on the current processor. This is
the native way and the fastest way to re-schedule the current thread. However, if there is no
other thread waiting in the scheduler run queue for available CPU,
the current thread will be scheduled back immediately. Thus, we'll always see 100%
CPU usage, even if we have no data to process. To cope with this, we can use
usleep(3) with some small value.
Let's look more closely at what's going on in the loop. First, we read the tail_ value; next we read the value of head_, and after that, we make the decision whether to wait or push an element and move head_ forward. The current thread can schedule at any place during the check and even after the check. Let's consider the two-thread scenario (Table 1).
Alexander Krizhanovsky is the software architect and founder of NatSys-Lab. Before NatSys-Lab, he worked as a Senior Software Developer at IBM, Yandex and Parallels. He specializes in high-performance solutions for UNIX environments.
- Stepping into Science
- Linux Journal December 2016
- Synacor, Inc.'s Zimbra Open Source Support and Zimbra Suite Plus
- CORSAIR's Carbide Air 740
- A Better Raspberry Pi Streaming Solution
- Tech Tip: Really Simple HTTP Server with Python
- Tyson Foods Honored as SUSE Customer of the Year
- The Tiny Internet Project, Part II
- Connecting Apache's Web Server to Multiple Instances of Tomcat
- Radio Free Linux