Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer

We can find the lowest tail values for all the consumers with the following loop:


auto min = tail_;
for (size_t i = 0; i < n_consumers_; ++i) {
    auto tmp_t = thr_p_[i].tail;

    asm volatile("" ::: "memory"); // compiler barrier

    if (tmp_t < min)
        min = tmp_t;
1}

The temporal variable tmp_t is required here, because we cannot atomically compare whether thr_p_[i].tail is less than min and update min if it is. When we remember the current consumer's tail and compare it with min, the consumer can move the tail. It can move it only forward, so the check in the while condition is still correct, and we won't overwrite some live queue elements. But, if we don't use tmp_t, we write the code like this:


if (thr_p_[i].tail < min)
    min = thr_p_[i].tail;

Then the consumer can have a lower tail value while we're comparing it with min, but moves it far forward after the comparison is done and just before the assignment. So we probably will find an incorrect minimal value.

I added the compiler barrier asm volatile("" ::: "memory)—this is a GCC-specific compiler barrier—to make sure that the compiler won't move thr_p_[i].tail access and will access the memory location only once—to load its value to tmp_t.

One important thing about the array is that it must be indexed by the current thread identifier. Because POSIX threads (and consequently the C++ threads that use them) do not use small monotonically increasing values for identifying threads, we need to use our own thread wrapping. I'll use the inline thr_pos() method of the queue to access the array elements:


ThrPos& thr_pos() const
{
    return thr_p_[ThrId()];
}

You can find an example of the ThrId() implementation in the source referenced at the end of the article.

Before writing the final implementation of push() and pop(), let's go back to the initial application of our queue, the work queue. Usually, producers and consumers do a lot of work between operations with the queue. For instance, it could be a very slow IO operation. So, what happens if one consumer fetches an element from the queue and goes to sleep in the long IO operation? Its tail value will stay the same for a long time, and all the producers will wait on it over all the other consumers fully cleared the queue. This is not desired behavior.

Let's fix this in two steps. First, let's assign to the per-thread tail pointer the maximum allowed value just after fetching the element. So, we should write the following at the end of the pop() method:


T *ret = ptr_array_[thr_pos().tail & Q_MASK];
thr_pos().tail = ULONG_MAX;
return ret;

Because a producer in push() starts to find the minimal allowed value for the last_tail_ from the current value of the global tail_, it can assign the current tail_ value to last_tail_ only if there are no active consumers (this is what we want).

Generally speaking, other processors can see thr_pos().tail update before the local processor reads from ptr_array_, so they can move and overwrite the position in the array before the local processor reads it. This is possible on processors with relaxed memory operation ordering. However, x86 provides relatively strict memory ordering rules—particularly, it guarantees that 1) stores are not reordered with earlier loads and 2) stores are seen in consistent order by other processors. Thus, loading from ptr_array_ and storing to thr_pos().tail in the code above will be done on x86 and seen by all processors in exactly this order.

The second step is to set thr_pos().tail correctly at the beginning of pop(). We assign the current thr_pos().tail with:


thr_pos().tail = __sync_fetch_and_add(&tail_, 1);

The problem is that the operation is atomic only for tail_ shift, but not for the thr_pos().tail assignment. So there is a time window in which thr_pos().tail = ULONG_MAX, and tail_ could be shifted significantly by other consumers, so push() will set last_tail_ to the current, just incremented tail_. So when we're are going to pop an element, we have to reserve a tail position less than or equal to the tail_ value from which we'll pop an element:


thr_pos().tail = tail_;
thr_pos().tail = __sync_fetch_and_add(&tail_, 1);

In this code, we actually perform the following three operations:

  • Write tail_ to thr_pos().tail.

  • Increment tail_.

  • Write the previous value of tail_ to thr_pos().tail.

Again, in this general case, we have no guarantee that other processors will "see" the results of the write operations in the same order. Potentially, some other processor can read the incremented tail_ value first, try to find the new last_tail_, and only after that read the new current thread tail value. However, __sync_fetch_and_add() executes locked instruction, which implies an implicit full memory barrier on most architectures (including x86), so neither the first nor third operations can be moved over the second one. Therefore, we also can skip explicit memory barriers here.

Thus, if the queue is almost full, all producers will stop at or before the position of element that we're popping.

Now let's write our final implementation of the push() and pop() methods:


void push(T *ptr)
{
    thr_pos().head = head_;
    thr_pos().head = __sync_fetch_and_add(&head_, 1);

    while (__builtin_expect(thr_pos().head >=
                            last_tail_ + Q_SIZE, 0))
    {
        ::sched_yield();

        auto min = tail_;
        for (size_t i = 0; i < n_consumers_; ++i) {
            auto tmp_t = thr_p_[i].tail;

            asm volatile("" ::: "memory"); // compiler barrier

            if (tmp_t < min)
                min = tmp_t;
        }
        last_tail_ = min;
    }

    ptr_array_[thr_pos().head & Q_MASK] = ptr;
    thr_pos().head = ULONG_MAX;
}

T *pop()
{
    thr_pos().tail = tail_;
    thr_pos().tail = __sync_fetch_and_add(&tail_, 1);

    while (__builtin_expect(thr_pos().tail >=
                            last_head_, 0))
    {
        ::sched_yield();

        auto min = head_;
        for (size_t i = 0; i < n_producers_; ++i) {
            auto tmp_h = thr_p_[i].head;
       
            asm volatile("" ::: "memory"); // compiler barrier

            if (tmp_h < min)
                min = tmp_h;
        }
        last_head_ = min;
    }

    T *ret = ptr_array_[thr_pos().tail & Q_MASK];
    thr_pos().tail = ULONG_MAX;
    return ret;
}

Careful readers will notice that multiple threads can scan the current head or tail values over all the producing or consuming threads. So a number of threads can find different min values and try to write them to last_head_ or last_tail_ simultaneously, so we probably would use a CAS operation here. However, atomic CAS is expensive, and the worst that can happen is that we assign too small of a value to last_head_ or last_tail_. Or, if we ever overwrite a new higher value with a smaller older value, we'll fall into sched_yield() again. Maybe we will fall to sched_yield() more frequently than if we use the synchronized CAS operation, but in practice, the cost of extra atomic operation reduces performance.

Also, I used __builtin_expect with the zero expect argument to say that we do not expect that the condition in the while statement will become true too frequently and the compiler should move the inner loop code after the code executed if the condition is false. This way, we can improve the instruction cache usage.

Finally, let's run the same test as for the naive queue:


# time ./a.out 

real    1m53.566s
user    27m55.784s
sys     2m4.461s

This is 3.7 times faster than our naive queue implementation!

Conclusion

Nowadays, high-performance computing typically is achieved in two ways: horizontal scaling (scale-out) by adding new computational nodes and vertical scaling (scale-up) by adding extra computational resources (like CPUs or memory) to a single node. Unfortunately, linear scaling is possible only in theory. In practice, if you double your computational resources, it is likely that you get only a 30–60% performance gain. Lock contention is one of the problems that prevents efficient scale-up by adding extra CPUs. Lock-free algorithms make scale-up more productive and allow you to get more performance from multicore environments.

The code for naive and lock-free queue implementations with the tests for correctness is available at https://github.com/krizhanovsky/NatSys-Lab/blob/master/lockfree_rb_q.cc.

Acknowledgement

Special thanks to Johann George for the final review of this article.

______________________

Alexander Krizhanovsky is the software architect and founder of NatSys-Lab. Before NatSys-Lab, he worked as a Senior Software Developer at IBM, Yandex and Parallels. He specializes in high-performance solutions for UNIX environments.

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

I real glad to uncover this

yepi250's picture

I real glad to uncover this web internet site on bing, just what I was searching for : D likewise saved to bookmarks.

These articles are greatly

kizi's picture

These articles are greatly appreciated; very useful and informative article and every body must visit this article.

Oakleys Sale,Cheap Sunglasses Sale,Oakley Batwolf

podcfzu8's picture

Today, in addition to the bright colors and unusual shapes, persistent resistance to organic pollutants, capture the image of fashion eyewear comprehensive definition than ever before.Oakley sunglasses for the Asian sports and special living area can be seen as timeless and stunning sunglasses.Sports sunglasses are generally polarized lenses can eliminate interference ability to increase depending on the material, and more lightweight frame materials and styles designed for sports wear.The colors of Oakley sunglasses to protect your eyes, the sun goes down, Oakley sunglasses Lenses fading light will automatically adjust.As time goes by, sunglasses, and gradually become a common part of everyday life, fashion jewelry.Darker skin should choose brighter colors; fair-skinned, generally with what color glasses are very nice.Yellow lenses filter blue light, nature scenery can be seen more clearly. While driving, wearing a yellow lens sunglasses, you can more clearly see from the vehicle.In visible light, violet energy, the red energy minimization. UV contains a lot of natural light, due to the high energy of the ultraviolet rays, so interested in corneal and retinal damage, and therefore the sunglasses would expect prevent ultraviolet.On sunglasses to be safe, or else you exercise, it will follow the action, which you found in the running belt loose the same reaction.Do not be flashy there may be some fashion rimmed glasses and confusion you feel, you must wear a framework to see if this is the most important.Oakley sunglasses, inventions and innovative ideas, world-class scientific research and production equipment, provide a strong support.Polarized lenses is recognized worldwide as the most suitable for driving the lens. The light reflected by the surface polarization glare. The negative effects of glare - Enhanced brightness, color saturation weakened; object contours become blurred, glasses fatigue and discomfort.Fashion sport sunglasses PLUTONITE lens manufacturing materials, polarized Oakley sunglasses block 100% of harmful UV blue light, vision than the United States defined in ANSI industrial standards.Oakley in addition to the characteristics of the population unobtanium their ears grips to prevent slipping and High Definition Optics (HDO), it provides a clear, but also to protect the eyes.Oakley sunglasses at least change the color of the object itself, more real, the advantage of the natural landscape, but also to prevent glare.In accordance with the functional use can be divided into many types of sunglasses point. Ordinary sunglasses, decorated sunglasses, drivers sunglasses.Oakley sunglasses are very useful activities, such as driving, fishing and other sports. The polarized lenses bring better visual clarity.Date night, a copy of jewelry features a lot of fashion critics alike, affordable, because these are a good substitute to platinum, gold, and other jewelry sunglasses.Most girls will be ignored is not a compliment, this is mainly because they like to be called a fashionable and stylish than most people know.Oakley sunglasses is the most appropriate choice! Oakley glasses store, especially the discount may be your choice. To select your favorite, you will not regret it!

Reply to comment | Linux Journal

Psychic's picture

Hi, just read your post. Thought you might want to
know that I found you through Google.

what's the solution for

Anonymous's picture

what's the solution for compile error?

Reply to comment | Linux Journal

plastische chirurgie's picture

I do not even know how I ended up here, but I thought this post was good.

I don't know who you are but definitely you are going to a famous blogger if you aren't already ;) Cheers!

Reply to comment | Linux Journal

sacoche homme's picture

This piece of writing provides clear idea in support of the new viewers
of blogging, that genuinely how to do running a blog.

Reply to comment | Linux Journal

http://best-breastpump.net/'s picture

There's certainly a great deal to know about this topic. I love all the points you made.

compiling

John Ellson's picture

Doesn't compile for me using: g++ (GCC) 4.8.1 20130603 (Red Hat 4.8.1-1)

The error messages fill a screen, but the first few lines are:

lockfree_rb_q.cc: In member function ‘void NaiveQueue::push(T*)’:
lockfree_rb_q.cc:72:31: error: capture of non-variable ‘NaiveQueue::head_’
cond_overflow_.wait(lock, [&head_, &tail_]() {
^
lockfree_rb_q.cc:98:17: note: ‘long unsigned int NaiveQueue::head_’ declared here
unsigned long head_, tail_;
^
lockfree_rb_q.cc:72:39: error: capture of non-variable ‘NaiveQueue::tail_’
cond_overflow_.wait(lock, [&head_, &tail_]() {
^
lockfree_rb_q.cc:98:24: note: ‘long unsigned int NaiveQueue::tail_’ declared here
unsigned long head_, tail_;
^

Suggestions?

Fixed

A.Krizhanovsky's picture

John, thank you for the bug report!

I've fixed compilation errors for GCC 4.8. Please, fetch the new version of the code from GitHub.

Reply to comment | Linux Journal

free psychological tricks to get your ex back's picture

I am sure this article has touched all the internet viewers, its really really pleasant piece of writing on
building up new blog.

Temporal variable

Anonymous's picture

Aside: I believe I am the registered user rhkramer (at least, I used to be) but this posting thingie wouldn't let me use that name--I think if someone at LJ looks it up, they'll see that my email and the email of registered user rhkramer are the same.

Do you really mean temporal variable or do you just mean a temporary variable.

I had never heard of a temporal variable before I read this article, then I did some googling to find it.

In looking at a page of 10 google hits, I then investigated 3 or 4 of those. At least one of them definitely simply meant temporary variable, and, at the time I looked at the article, it used the phrase temporary variable. I'm guessing that at the time google indexed the article it might have said temporal variable--but there were no remaining instances of temporal in the article. OTOH, maybe google decided that I meant temporary and used temporary instead of temporal in the query.

One hit on the page of hits did give me some hints as to what might be meant by a temporal variable:

'
On the semantics of (Bi)temporal variable databases - Springer
link.springer.com/chapter/10.1007%2F3-540-57818-8_53
Numerous proposals for extending the relational data model to incorporate the
temporal dimension of data have appeared during the past several years.
'

I guess my point is (especially as an old guy trying to keep up with some of this stuff), is that it sure would help if terminology didn't change unnecessarily. If the variable in this article truly is something more or different than a temporary variable, fine (but then please provide a definition or a pointer to a definition), but, if it is no different, then just please use "temporary variable".

Misprint

A.Krizhanovsky's picture

Yes, sure, I did mean temporary variable, not temporal variable. Thank you for the indication of the error.

Reply to comment | Linux Journal

http://www.telephone-fixe.info/'s picture

I do consider all of the concepts you have presented for your post.
They are really convincing and can definitely work. Nonetheless, the posts are very short for beginners.
Could you please lengthen them a bit from subsequent time?

Thanks for the post.

Inaccuracy

alastair's picture

The article claims that the "mfence" in the sequence

A
mfence
B

will cause the instruction "A" to execute before the instruction "B".

This is untrue (and is explicitly contradicted by the Intel manuals, which state categorically that “mfence does not serialize the instruction stream”; i.e. the instructions can still execute out of order).

The mfence will cause memory accesses before the fence to complete before memory accesses after the fence (more accurately, it causes memory accesses before the fence to become globally visible — i.e. their effects are apparent to other cores in the system — before those after the fence).

I would like to thank you for

kizi's picture

I would like to thank you for the efforts you have made in writing this article.Thanks.Would be waiting for more stuff from yourside.

Simplification

A.Krizhanovsky's picture

This was just a simplification for gentle introduction to memory ordering and when and why barriers are used. Unfortunately, the article has limited size, so there is no opportunity to carefully and fully describe this and some other interesting points.

Reply to comment | Linux Journal

Minisivarao Short Article Author's picture

bookmarked!!, I like your site!

Feel free to surf to my website; Minisivarao Short Article Author

Reply to comment | Linux Journal

diete rapide's picture

excellent put up, very informative. I'm wondering why the other experts of this sector don't realize this.
You must continue your writing. I am confident, you've a huge readers' base already!

my blog post :: diete rapide

Reply to comment | Linux Journal

thecandidacenter.com's picture

I am sure this post has touched all the internet viewers,
its really really nice piece of writing on building up new weblog.

Reply to comment | Linux Journal

yepi's picture

Currently it sounds like Drupal is the preferred blogging platform available right now.

(from what I've read) Is that what you are using on your blog?

Reply to comment | Linux Journal

werbeagentur blog's picture

I'm really inspired with your writing talents and also with the format to your blog. Is this a paid theme or did you customize it yourself? Anyway keep up the excellent high quality writing, it's rare to see a great weblog like
this one nowadays..

Reply to comment | Linux Journal

older ladies's picture

I for all time emailed this website post page to all my associates, because if like to read it
after that my contacts will too.

Re: Inaccuracy

digitas's picture


A
mfence
B

If effects of the instruction A and B are to be visible outside the processor core they must somehow access the memory (or to be precise maybe the cache). The article explains inter core or inter processors relations, so IMHO the explanation in the article is a little simplification but it is not inaccurate.

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState