Fast Counters

Designing High-Performance Thread-Safe Counters

In this article I will explore several implementations of thread-safe counters and how they can be optimized to increase throughput. We will go from naive approaches using mutexes to high-performance sharded counters. This will show how cache alignment, false sharing, and wait-free design can drastically affect performance especially as the number of threads scales up.

If you are implementing metrics counters, instrumentation probes, or concurrent accumulators this is directly applicable.

The naive counter

Let’s start with a simple, single-variable counter shared between threads:

class CounterNaive {
  private:
    int v;

  public:
    CounterNaive(int _numThreads) : v(0) {}
    int64_t inc(int tid) { return v++; }
    int64_t read() { return v; }
};

The inc() method uses the v++ post-increment operator without any synchronization. At first glance, this looks like it should work every thread increments the counter, and we expect the final value to match the total number of increments.

Here’s how we tested it:

CounterNaive counter(10);
std::vector<std::thread> threads;
std::random_device dev;
std::mt19937 rng(dev());
std::uniform_int_distribution<std::mt19937::result_type> dist(1000'000, 1'000'000'000); 
for (int tid = 0; tid < 10; ++tid) {
    threads.emplace_back([&, tid]() {
        auto num_inc = dist(rng);
        for (int i = 0; i < num_inc; ++i) {
            counter.inc(tid);
        }
        std::println("tid: {}, num_inc: {}", tid, num_inc);
    });
}
for (auto& t : threads) t.join();
std::println("Counter Value: {}", counter.read());

Outputs:

tid: 1, num_inc: 800992823
tid: 3, num_inc: 914215200
tid: 6, num_inc: 288466445
tid: 7, num_inc: 397029764
tid: 2, num_inc: 701856543
tid: 8, num_inc: 564335446
tid: 9, num_inc: 914107421
tid: 8, num_inc: 426239428
tid: 10, num_inc: 860952345
tid: 8, num_inc: 137102172
Counter Value: 1710330291

Sum of all increments: 6,005,297,587

Reported counter value: 1,710,330,291

This massive discrepancy, over 4 billion lost increments due to data races. The v++ operation is not atomic. It reads the current value, increments it, and writes it back. When multiple threads do this concurrently, they can overwrite each other’s updates, leading to lost increments.

Mutex Counter

To eliminate data races, the simplest solution is to protect the shared counter with a mutex ensuring that only one thread can modify it at a time.

To avoid the data race issues we saw in the naive implementation, we can serialize access to the counter using a std::mutex. This guarantees that only one thread can update or read the counter at a time:

class CounterLocked {
  private:
    int64_t state;
    std::mutex state_mutex;

  public:
    CounterLocked(int _numThreads): state(0) {}
    int64_t inc(int tid) {
        const std::lock_guard<std::mutex> g(state_mutex);
        auto ret = state++;
        return ret;
    }
    int64_t read() {
        const std::lock_guard<std::mutex> g{state_mutex};
        auto ret = state;
        return ret;
    }
};

Testing it again:

tid: 6, num_inc: 111514906
tid: 8, num_inc: 131574102
tid: 7, num_inc: 166769447
tid: 4, num_inc: 243170435
tid: 0, num_inc: 336813736
tid: 1, num_inc: 375111202
tid: 9, num_inc: 543642060
tid: 2, num_inc: 820265280
tid: 3, num_inc: 876108386
tid: 5, num_inc: 873528855
Counter Value: 4478498409

Sum of all increments: 4,478,498,409

Reported counter value: 4,478,498,409

The mutex-based version is correct, it comes at a cost: performance.

Each inc() call now involves locking and unlocking a std::mutex, which means:

High contention as threads compete for the lock
Potential context switches and blocking

How Do We Measure Performance?

To compare different approaches, we need to quantify that slowdown.

To fairly compare counter implementations, we use a consistent and structured benchmarking harness. Here’s how it works:

template <class CounterType>
struct experiment_detail {
    ...
    int64_t numThreads;
    int64_t millsToRun;
};

This experiment_detail struct acts as the shared state for an experiment:

It holds the actual CounterType implementation under test.
It tracks the total number of increments performed.
It stores the start time and run duration (in milliseconds).
It uses a std::atomic_flag barrier to synchronize the start across all threads.
It adds padding between members to avoid false sharing and ensure cacheline isolation for clean benchmarking.

template <class CounterType>
void run(experiment_detail<CounterType>& g);

This function:

Spawns numThreads worker threads.
Uses a barrier to start all threads simultaneously (reducing warm-up bias).
Each thread loops, calling counter->inc(tid) until the target time limit is hit.

Full source code available here View on GitHub Gist

The Scalability Problem

Let’s benchmark the mutex counter to see how it performs as we increase thread count:

Threads:  1      2      4      8      16     32
Ops/sec: 126M   126M   32M    33M    126M   126M

This reveals a critical scalability problem. Despite adding more threads, performance actually degrades at 4-8 threads before recovering. The mutex creates a serialization bottleneck where all threads compete for the same lock.

Using perf stat confirms this: - 447 billion CPU cycles consumed - Poor instructions-per-cycle ratio due to blocking - High system time from mutex operations

Sharded Lock Counter

The solution is to eliminate the single point of contention by sharding the counter across multiple independent counters, each with its own mutex:

struct padded_int_mutex {
    int64_t v; 
    std::mutex m;
};

class CounterShardedLocked {
private:
    int _numThreads;
    padded_int_mutex shards[MAX_THREADS];
    
public:
    int64_t inc(int tid) {
        std::lock_guard<std::mutex> g(shards[tid].m);
        return shards[tid].v++; 
    }
    
    int64_t read() {
        int64_t total = 0; 
        for(int i = 0; i < _numThreads; ++i) {
            std::lock_guard<std::mutex> g(shards[i].m);
            total += shards[i].v;
        }
        return total;
    }
};

Key Design Elements:

Per-Thread Sharding: Each thread tid operates on shards[tid], eliminating inter-thread contention
Independent Mutexes: Each shard has its own mutex, allowing true parallelism
Read Aggregation: Total counter value requires summing all shards

Performance Transformation

The sharded approach delivers dramatic improvements:

           Mutex Counter    Sharded Counter
Threads:   Ops/sec         Ops/sec
1          126M           127M  
2          126M           127M  ↑ Consistent
4           32M           127M  ↑ 4x improvement  
8           33M           127M  ↑ 4x improvement
16         126M           127M  ↑ Stable
32         126M           127M  ↑ Stable

CPU Efficiency: - 23% fewer CPU cycles (343B vs 447B) - Better instructions-per-cycle ratio - Eliminated lock contention

Why This Works

Mutex Counter (Bottleneck):
Thread 1 ──┐
Thread 2 ──┤ → [SINGLE MUTEX] → Counter
Thread 4 ──┤     (Serialized)
Thread 8 ──┘

Sharded Counter (Parallel):
Thread 1 → [Mutex 1] → Shard 1
Thread 2 → [Mutex 2] → Shard 2  
Thread 4 → [Mutex 4] → Shard 4
Thread 8 → [Mutex 8] → Shard 8

By transforming one contended resource into N independent resources, we achieve true parallelism. However, we can optimize further by addressing false sharing between shard structures - our next improvement will add proper cache line padding.

Cache Line Padding Impact

Adding proper cache line alignment to prevent false sharing shows minimal impact for this workload:

Sharded (No Padding):    127M ops/sec, 343B CPU cycles
Sharded (With Padding):  127M ops/sec, 343B CPU cycles

The performance remains essentially identical because each thread operates on its dedicated shard with minimal cross-thread interference.

Fetch-and-Add Counter

Moving beyond locks entirely, we can use atomic operations. A simple fetch_add approach:

class CounterFetchAndAdd {
private:
    std::atomic<int64_t> state; 
public:
    int64_t inc(int tid) {
        return state.fetch_add(1);
    }
    int64_t read() {
        return state; 
    }
};

Performance Results:

Fetch-and-Add:  127M ops/sec, 171B CPU cycles, 359B instructions

The fetch-and-add counter uses 50% fewer CPU cycles than sharded locks (171B vs 343B) while maintaining similar throughput. However, it reintroduces cache line contention as all threads compete for the same atomic variable.

Wait-Free Sharded Counter

The ultimate optimization combines the best of both worlds: per-thread shards with atomic operations instead of locks:

struct alignas(hardware_constructive_interference_size) padded_atomic_int {
    std::atomic<int64_t> v; 
};

class CounterShardedWaitfree {
    padded_atomic_int shards[MAX_THREADS];
public:
    int64_t inc(int tid) {
        return shards[tid].v++;  // No locks needed!
    }
    int64_t read() {
        int64_t total = 0; 
        for (int i = 0; i < _numThreads; ++i) {
            total += shards[i].v; 
        }
        return total; 
    }
};

Performance Breakthrough:

                    Ops/sec     CPU Cycles   
Mutex Counter:      33M         447B        
Sharded Locks:      127M        343B        
Fetch-and-Add:      127M        171B        
Wait-Free Sharded:  290M        [lower]     ↑ 2.3x improvement!

The wait-free sharded counter delivers: - 2.3x higher throughput than sharded locks - No blocking or lock contention - True wait-free operation - threads never block - Linear scalability - performance improves with thread count

Scaling Performance:

Threads:  1      2      4      8      16     32
Ops/sec: 290M   289M   565M   542M   569M   519M

At 4 threads, we see 565M ops/sec - nearly perfect scaling where adding threads directly multiplies performance. The cache line padding ensures each thread’s atomic shard sits in its own cache line, eliminating false sharing that would otherwise degrade performance.

Performance Evolution Summary

Counter Type          Throughput    Scalability   CPU Efficiency
─────────────────────────────────────────────────────────────────
Naive                 Broken        N/A           N/A
Mutex                 33M           Poor          Low (447B cycles)
Sharded Locks         127M          Good          Medium (343B cycles)  
Fetch-and-Add         127M          Limited       High (171B cycles)
Wait-Free Sharded     290M          Excellent     Highest

What’s Next?

In the next article, we’ll push performance even further by exploring:

Thread pinning to specific CPU cores
NUMA-aware memory allocation