Kaspar Daugaard’s Blog About Kaspar

How To Make Your SPSC Ring Buffer Do Nothing, More Efficiently

September 5, 2018 – Posted in Programming – Comments

In my original post Writing a Fast and Versatile SPSC Ring Buffer I suggested that there was a better way of waiting for data to be written, and buffer space to be available, than just continually checking the other thread’s progress. Now I am going to elaborate and get into the dirty details.

A marginal improvement is to tell the processor that’s doing the waiting to use fewer resources. On Windows you would use YieldProcessor which on Intel CPUs translates to the pause instruction. You can find similar instructions on other architectures.

Yielding can free up cycles for another thread running on the same physical core (which means that the CPU uses hyper-threading), but it doesn’t help if the thread you are waiting on isn’t currently running. Dealing with that requires whoever is scheduling the threads – typically the operating system – to be aware of the dependency. In your program, you use synchronization objects like semaphores to make a thread go idle and wake it up again when you have some useful work for it.

For those who are unfamiliar with semaphores, they essentially allow threads to wait for a signal from another thread. They are more expensive than the atomic operations I used in the original ring buffer implementation, but the advantage is that once you are waiting you no longer spend CPU cycles doing so because the thread is suspended.

Ideally we only want to use semaphores when the producer or consumer thread is actually blocked. In other words, the buffer should run locklessly when not waiting and switch out of lockless mode to wait.

Setting up the semaphores

In the previous version, each thread’s shared state was simply a position. It looks like we are going to need a semaphore per thread as well, and a bool to keep track of whether the other thread is waiting for a signal.

    struct alignas(CACHE_LINE_SIZE) SharedState
    {
        std::atomic<size_t> pos;
        std::atomic<bool> shouldSignal;
        Semaphore semaphore;
    };

The idea is that we are going to check the bool every time we finish writing something or reading something:

void RingBuffer::FinishWrite()
{
    m_WriterShared.pos.store(m_Writer.base + m_Writer.pos, std::memory_order_release);
    if (m_WriterShared.shouldSignal.exchange(false))
        m_ReaderShared.semaphore.signal();
}

The call to exchange returns the existing value of shouldSignal while also resetting it to false.

Requesting a signal

Instead of checking the written position in an indefinite loop, we now sleep on the consumer thread and ask the producer to wake us up the next time it has written some data:

    for (;;)
    {
        size_t writerPos = m_WriterShared.pos.load(std::memory_order_acquire);
        size_t available = writerPos - m_Reader.base;
        // Signed comparison (available can be negative)
        if (static_cast<ptrdiff_t>(available) >= static_cast<ptrdiff_t>(end))
        {
            m_Reader.end = std::min(available, m_Reader.size);
            break;
        }
        m_WriterShared.shouldSignal.store(true);
        m_ReaderShared.semaphore.wait();
    }

The producer thread goes through the same steps, but with the reader and writer state swapped. The code is symmetrical except for the fact that the writer starts out with an empty buffer to write to, so its version of available adds the size of the buffer.

Pseudocode review

Let’s revisit the steps in FinishWrite/FinishRead:

  1. Store position
  2. Check if other thread is waiting
  3. Signal other thread if needed

And similarly GetBufferSpaceToReadFrom/GetBufferSpaceToWriteTo:

  1. Read other thread’s position
  2. Return if there’s enough data (or empty buffer space)
  3. Request that the other thread signals us
  4. Wait for signal
  5. Loop to 1

This looks sane… or does it?

Seeing it all fall apart

You may have noticed an issue with the algorithm, especially if you are a veteran in lockless programming. It turns out that it’s possible for a thread to miss a position update before going to sleep, while the thread that was supposed to wake it up misses the request to do so. For example, the consumer could check for incoming data just before the producer updates its progress, and the producer could then check whether it should send a signal before the consumer requests one.

Let’s go through that scenario again using bullet points:

  • Consumer reads producer’s position
  • Producer updates position
  • Producer checks if consumer needs a signal
  • Consumer asks for a signal and goes to sleep

In lockless code, there are no guarantees as to how the instructions running on two different threads will be interleaved, so you have to assume that they will be interleaved in the worst possible way. In fact, it’s even worse since in this case both threads may see an older version of the opposite thread’s position.

I said earlier that the positions are written and read using release/acquire ordering, which only guarantees that they are consistent with the data they refer to. It does not mean that the values are visible instantaneously. Sadly, it’s perfectly possible for both threads to write out their new positions, check each other’s positions, and fail to see the latest progress. As you can imagine, this leads to deadlocks.

Changing the position to use a stricter memory order means that only one thread would initially get stuck, but that’s not a great solution since one stuck thread can still cause a deadlock. For example, you might wait for the consumer thread to finish a particular task by sending a signal back to the producer when it’s done. However, if the consumer thread got stuck while receiving the task data, you are going to wait forever if you wait around without sending more data.

Another scenario for a deadlock is when the producer waits due to the buffer being full, and the consumer frees up the buffer quickly enough that it doesn’t notice the producer waiting. This would likely only happen if you filled up the entire buffer with one big element. In this case the consumer never wakes up the producer thread, which never produces more data for the consumer thread, which means both threads are forever stuck waiting.

Picking up the pieces

Fortunately, there is a solution. It involves checking the position again after requesting a signal and before going to sleep:

    for (;;)
    {
        size_t writerPos = m_WriterShared.pos.load(std::memory_order_acquire);
        size_t available = writerPos - m_Reader.base;
        // Signed comparison (available can be negative)
        if (static_cast<ptrdiff_t>(available) >= static_cast<ptrdiff_t>(end))
        {
            m_Reader.end = std::min(available, m_Reader.size);
            break;
        }
        m_WriterShared.shouldSignal.store(true);
        if (writerPos != m_WriterShared.pos.load(std::memory_order_relaxed))
        {
            // Position changed after we requested signal, try to cancel
            if (m_WriterShared.shouldSignal.exchange(false))
                continue; // Request successfully cancelled
        }
        m_ReaderShared.semaphore.wait();
    }

In case we missed an update, we see if we can cancel the request for a signal by changing the bool back to false. We then jump back to the beginning of the for loop and check if we got the data we needed. It’s possible that the producer already saw the request and preemptively woke us up, so in that case we fall through to the semaphore wait() before starting over.

The reason this works is that either the producer’s position is written out before the consumer sets shouldSignal to true, which means the consumer gets the latest position in the second attempt or the producer sees that shouldSignal was already set to true and signals the consumer.

Anything that happens before or after setting/checking shouldSignal will be observed to happen in that order on either thread. This is because it uses the default memory order for atomic operations, which is stricter than the ordering used for the positions.

Call for comments

Checking the position again seems like a fairly obvious solution to a race condition in a lockless ring buffer, and almost a little too easy. I assume that others have used the same approach with varying degrees of confidence in its robustness.

Part of my motivation for publishing this is to have more people examine it and either tell me why it shouldn’t work or that it does indeed work. I wrote a ring buffer based on the same concept for Unity’s multithreaded rendering, so even if it isn’t supposed to work in theory, at least I can prove that it works in practice.

It’s quite likely that there is literature out there on how to block when a lockless data structure is empty or at capacity, or indeed that Dijkstra wrote a paper about it in the 1970s (Dijkstra invented the semaphore concept). However, it seems like the kind of problem that almost everyone ignores.

Conclusion

I find it intensely satisfying that you can get the benefits of a lockless ring buffer while also using proper synchronization for waiting instead of keeping the CPU busy. It feels better to not burn a lot of cycles without accomplishing anything.

Compared to the original algorithm, calling FinishWrite() and FinishRead() has become more expensive because they now use an atomic exchange operation – in addition to the atomic store with release ordering, which is just a regular store on Intel CPUs. Expect about an order of magnitude difference. The best usage of this ring buffer is to not call those functions for tiny amounts of data, but amortize the cost over slightly bigger chunks so you can still push data through fast.

I think that despite the added cost, this version is a significant improvement if you care about stable performance and are not in complete control of the environment you are running in. Unless you always have free processors for both threads to run on, you can get pathologically bad performance hiccups when the thread that’s allowed to run has to wait on the thread that’s not running.

In case you really care about performance, look for situations where you don’t need to guarantee that the data is consumed right away. For example, when you know that you will always write more data later, even if the consumer thread nodded off the last time you sent it some data.

Source code

The full source code is here. It depends on a semaphore class, e.g. Jeff Preshing’s which is here.


Writing a Fast and Versatile SPSC Ring Buffer – Performance Results

August 13, 2018 – Posted in Programming – Comments

In the previous post I proposed an API for a ring buffer that can be used for a variety of different types of data. I also suggested that it’s possible to implement without sacrificing efficiency compared to ring buffers that only support one type of data.

The secret was supposedly keeping thread local state and shared state in separate cache lines and only accessing the shared state when you absolutely have to. So how big a difference does this make?

I wrote a benchmark which writes a billion integers into a 1 megabyte ring buffer on one thread while simultaneously reading the integers on another thread. Here are the results from my MacBook Pro:

Cache line separation Integers per Finish call
Time in seconds
No 1 36.0
No 16 7.8
Yes 1 1.7
Yes 16 1.2

Note: You can find the full implementation of the buffer here. I left out some details in the last post, so you may notice that LocalState contains additional variables. I store the buffer pointer and size redundantly in the reader and writer state, since this uses less memory than sharing that data, which would introduce an extra cache line. Pretty unintuitive!

I removed alignment support for the benchmark and that shaved off a fraction of a second. The versions with and without cache line separation are identical except for the memory layout of the ring buffer state, which obviously makes a huge difference.

The second column is how often FinishWrite() and FinishRead() are called. There’s a decent incremental improvement in calling them less frequently even when the data is laid out well.

Checking out the competition

I was curious how these results compare to other implementations, so I looked at one described here, which is a fairly simple fixed element size ring buffer. I changed the element type to int, fixed various typos in the code and ran the same test of pushing through a billion integers. This takes a little over 100 seconds.

What’s going on? You might notice that both ring buffers use std::atomic but my version uses “release” (and “acquire”) memory order while the other uses the default memory order. The default is “sequentially-consistent ordering” which offers some strict guarantees about the globally observed order of changes to variables. Those guarantees shouldn’t be needed for SPSC ring buffers. Both versions are “lockless” but there are lots of subtleties in how to write fast lockless algorithms.

Changing the other ring buffer’s memory order makes it run in about 30 seconds. I haven’t looked closely at whether it’s safe, but it worked well enough for the benchmark. There appeared to be an off-by-one error originally, but after replacing the <= size comparison with < size the same billion values come out as were put in.

I can improve the speed further by replacing the modulo of the size with & (size - 1) based on the fact that I know the size is a power of two. That brings the time down to about 9.5 seconds. Beyond that there’s not much I can do without essentially rewriting the code. Even though it looks simpler, it’s actually quite a bit slower than my proposal for a ring buffer that supports variable element sizes.

Boost

Boost has another fixed element size SPSC ring buffer implementation. There’s padding so the read and write positions occupy different cache lines, but there’s also a lot of contention between the threads since they access the positions simultaneously. It runs the billion integer test in 42 seconds out of the box.

I figured that this was probably because of poor inlining, since the code seems fairly well written, so I tried again with -Ofast optimizations. For the other tests I ran release builds with default settings and manually forced the buffer functions to be inlined. Xcode defaults to optimization level -Os which balances speed with small size. -Ofast brings the time down to about 12 seconds.

The Boost ring buffer does support writing arrays, so you can create an spsc_queue of char and serialize various types of data that way. However, you can’t directly copy C++ objects into the buffer and use them in-place when reading from the buffer.

Memcpy

It looks like the approach I describe is pretty fast compared to alternative ring buffer implementations, but how close is it to the theoretical limit? For example, the speed of memcpy. I created a benchmark that copies integers in 1 gigabyte blocks, first into a shared buffer on the producer thread and then out of the shared buffer on the consumer thread. This takes about 1.1 seconds per billion integers.

Note that I’m cheating a little, since the 1 megabyte ring buffer I used is small enough to stay in the L3 cache on my test hardware (but not L2 cache). The memcpy speed is for uncached memory, since I’m copying a gigabyte at a time. Like with the other tests, I try to get a reliable measurement by running it in a loop and only looking at the result after it’s stabilized.

Conclusion

It turns out that ring buffers can get pretty close to memcpy speed. Here is a summary of the results of writing a billion ints on one thread and reading them on another thread:

Ring buffer Notes
Time in seconds
spsc_lockless_sequential.cpp Original >100
spsc_lockless_sequential.cpp Optimized 9.5
spsc_queue.hpp Xcode default 42
spsc_queue.hpp LLVM -Ofast 12
RingBuffer_v1.hpp 1 int at a time 1.7
RingBuffer_v1.hpp 16 ints at a time 1.2
memcpy() Gigabyte blocks 1.1

It’s obviously quite a stress test that both the producer and consumer thread are constantly accessing the buffer, so in other scenarios the differences would be smaller.


Writing a Fast and Versatile SPSC Ring Buffer

August 13, 2018 – Posted in Programming – Comments

I wanted to share some notes about how to write a circular queue/ring buffer designed for a single producer thread and a single consumer thread (SPSC). What’s more, I will try to describe a ring buffer that can be used for lots of different things and is still very fast. Quite often you want to use the buffer for commands that can have variable sizes. However, most ring buffers that I have seen published only seem to support fixed-size elements.

Ideally you should be able to send anything through the buffer without incurring a performance cost for that extra flexibility. You can have data that varies both in size and what alignment it requires. With a bit of luck, we can support that and still beat other ring buffers on speed. I am assuming that you’re using a language like C++ that lets you be explicit about things like memory layout and thread synchronization.

Another design goal is that data should be available for reading as soon as the producer wants it to be available. We don’t want to wait until the producer thread has filled out a pre-defined “page size” before we can consume it. There may be reasons why you would want to use pages, but here we are optimizing for low latency. Low response time is often more important than throughput in real-time applications like games and VR.

Usage

Let’s look at a sample use case. It turns out that you may want separate function calls for writing data and actually submitting the data to the consumer thread. Likewise, you may want separate calls for reading data and telling the producer that you finished reading it. For example, your producer thread can write commands like this:

// Send command to process array of items.
queue.Write(Command::ConsumeItems);
queue.Write(itemCount);
queue.WriteArray(items, itemCount);
queue.FinishWrite();

Processing the commands looks something like this:

Command cmd = queue.Read<Command>();
switch (cmd)
{
    case Command::ConsumeItems:
        int itemCount = queue.Read<int>();
        const Item* items = queue.ReadArray<Item>(itemCount);
        ConsumeItems(items, itemCount);
        queue.FinishRead();
        break;
}

The consumer thread is allowed to access the items in the buffer until it explicitly calls FinishRead(). This avoids having to copy the items out of the buffer. The finish functions don’t need to be called at the same frequency, but they do need to be called often enough that the buffer doesn’t fill up with unfinished data.

Wrapping

For simplicity, I will choose that the size of the buffer must be a power of two. It’s certainly possible to support non-power of two sizes and still make it fast, but it makes the wrapping logic that much more complicated. On the other hand, I will not cut corners on making sure that the buffer supports writing an unlimited amount of data. Sometimes that requires taking extra care with comparisons. It would be a tragedy if a size_t overflow caused the buffer to break down, even if that’s lot of bytes on 64 bit computers.

When you allow variable-sized elements, you have to decide what to do about elements that partially go past the end of the buffer. One approach is to use a magic ring buffer where the virtual memory past the end of the buffer is mapped to the beginning of the buffer. Another approach is to just leave a gap at the end of the buffer and start over from the beginning. I find that totally reasonable, especially if the buffer is large compared to the element sizes.

Note that if you leave gaps at the end of the buffer, it’s important that you read exactly the same element sizes as you originally wrote. If you write 4 bytes twice, you can’t read those elements as one 8 byte element, because that would behave differently when wrapping.

Buffer API

In the code example above, I introduced a Write() function to write a single element and a similar Read() function. There was another pair of functions for arrays, WriteArray() and ReadArray(). I obviously promised to be able to write various kinds of data, so these are all template functions that take any copy-contructable type.

The previous functions can be implemented using a pair of lower-level functions that prepare a number of bytes in the buffer for writing (or reading). They also need to be aware of the alignment, since certain types need to be aligned to N-byte boundaries.

Consider something like this:

class RingBuffer
{
public:
    // Allocate buffer space for writing.
    void* PrepareWrite(size_t size, size_t alignment);

    // Publish written data.
    void FinishWrite();

    // Write an element to the buffer.
    template <typename T> void Write(const T& value)
    {
        void* dest = PrepareWrite(sizeof(T), alignof(T));
        new(dest) T(value);
    }

    // Write an array of elements to the buffer.
    template <typename T> void WriteArray(const T* values, size_t count)
    {
        void* dest = PrepareWrite(sizeof(T) * count, alignof(T));
        for (size_t i = 0; i < count; i++)
            new(static_cast<T*>(dest) + i) T(values[i]);
    }

The new is the C++ placement new operator, which doesn’t allocate memory but initializes an unused piece of memory with a copy of an object. alignof is a C++11 keyword similar to sizeof except it returns the required alignment of a type.

Here’s what the API for reading would look like:

    // Get read pointer. Size and alignment should match written data.
    void* PrepareRead(size_t size, size_t alignment);

    // Finish and make buffer space available to writer.
    void FinishRead();

    // Read an element from the buffer.
    template <typename T> const T& Read()
    {
        void* src = PrepareRead(sizeof(T), alignof(T));
        return *static_cast<T*>(src);
    }

    // Read an array of elements from the buffer.
    template <typename T> const T* ReadArray(size_t count)
    {
        void* src = PrepareRead(sizeof(T) * count, alignof(T));
        return static_cast<T*>(src);
    }

Local and shared state

The first step to writing a fast ring buffer is not being clever with atomic operations, but thinking about what state is required by the producer and consumer. Each thread needs some information that is local to itself and some that is shared.

For optimal performance, you want to avoid reading data that may be simultaneously written by another thread. In the API above, you ideally want PrepareWrite() and PrepareRead() to only access local state, since it will be much faster when both threads are active. Local state can be kept fully in the L1 cache, or even in registers, while changes to shared state have to be expensively synchronized across different processors.

Separating local and shared state is trickier than you might expect because of the risk of false sharing where values that are close to each other in memory affect each other’s read and write performance. In situations like this, where the memory layout can make a big difference, you want to force each type of state to take up a multiple of entire cache lines.

Assume we have a define for the typical cache line size – a reasonably safe value to use on modern platforms is 64 bytes. An initial version of the producer and consumer threads’ local state might look like this:

    struct alignas(CACHE_LINE_SIZE) LocalState
    {
        char* buffer;
        size_t pos;
        size_t end;
    };

    LocalState m_Writer;
    LocalState m_Reader;

It makes sense that you want a pointer to the buffer memory and the writer or reader’s current position, but what about the end variable? You could store the opposite thread’s last known (finished) position, but then things get complicated when it goes past the end of the buffer and wraps to the beginning. What you want is a fast way to decide if you can write or read data right now with as little logic as possible. For that purpose it’s better that pos and end represent the window in the buffer that’s immediately available without wrapping or checking the other thread’s latest state.

The function for allocating buffer space for writing ends up looking something like this, with a single branch for handling non-trivial cases in a separate function:

void* RingBuffer::PrepareWrite(size_t size, size_t alignment)
{
    size_t pos = Align(m_Writer.pos, alignment);
    size_t end = pos + size;
    if (end > m_Writer.end)
        GetBufferSpaceToWriteTo(pos, end);
    m_Writer.pos = end;
    return m_Writer.buffer + pos;
}

Moving complexity like wrapping the position and waiting for buffer space to be available into GetBufferSpaceToWriteTo() means we can aggressively inline PrepareWrite() without bloating code size.

Alignment

There’s one thing left in the function above which is fairly annoying, and that is spending time aligning the position for every write. In trivial cases this may be optimized away by the compiler, if it can see that the previous write was aligned. You can also ignore alignment altogether if you never write data which has to be aligned to a buffer that might be misaligned.

Another approach that I quite like is rounding all sizes up to a multiple of a minimum alignment, e.g. 4 bytes, so you don’t need to realign the position for types whose alignof is less than or equal to that. This requires PrepareWrite() to be inlined so the compiler knows about the alignment at compile time.

Synchronizing state

Note that none of the code so far needed to synchronize values across different threads, because all the data was local. The only state that needs to be shared is the finished position for the writer and reader. I will be using std::atomic along with alignas to prevent false sharing:

    struct alignas(CACHE_LINE_SIZE) SharedState
    {
        std::atomic<size_t> pos;
    };

    SharedState m_WriterShared;
    SharedState m_ReaderShared;

FinishWrite() will look something like this:

void RingBuffer::FinishWrite()
{
    m_WriterShared.pos.store(m_Writer.pos, std::memory_order_release);
}

See this page on cppreference.com for a discussion about the different types of memory order. It’s worth noticing that memory_order_release is guaranteed by default on strongly-ordered platforms like Intel x86, so it doesn’t generate different code from a regular store or load. Using std::atomic is still a good idea for documenting the code and preventing the compiler from reordering writes and reads (which it is allowed to do for non-volatile variables).

Blocking

The read functions in the proposed API may have to block until some data was written, and the write functions may block because the buffer is full. An alternative API could have used TryWrite() and TryRead() functions, which are allowed to fail when running out of data or buffer space, but that gets tedious if you want to communicate any complex information through the ring buffer.

How do you actually implement waiting for the opposite thread? The simplest way is to just keep checking the shared position in a loop. This is mostly fine if you know the producer and consumer threads are running on different processors, although you will burn unnecessary CPU cycles.

Waiting in a tight loop is a bad idea if the producer and consumer happen to be scheduled on the same processor. In that case, you might spend a lot of CPU time on the wrong thread while the thread being waited on is prevented from running.

It turns out that this is solvable, but that the solution is not trivial. You can wait on an event or semaphore, though you only want to do that when you are actually blocked, since doing it all the time would be far too slow. The tricky part is how to request that the event should be signalled. I may get back to that in a later post.

Next up: Performance Results.

Recent

  • How To Make Your SPSC Ring Buffer Do Nothing, More Efficiently
  • Writing a Fast and Versatile SPSC Ring Buffer – Performance Results
  • Writing a Fast and Versatile SPSC Ring Buffer
  • All Posts

Categories

  • Programming (3)

Keep in touch

Twitter RSS Feed

© 2018 Kaspar Daugaard