Designing High-Performance Thread-Safe Counters
In this article I will explore several implementations of thread-safe counters and how they can be optimized to increase throughput. We will go from naive approaches using mutexes to high-performance sharded counters. This will show how cache alignment, false sharing, and wait-free design can drastically affect performance especially as the number of threads scales up.
If you are implementing metrics counters, instrumentation probes, or concurrent accumulators this is directly applicable.
Let’s start with a simple, single-variable counter shared between threads:
class CounterNaive {
private:
int v;
public:
(int _numThreads) : v(0) {}
CounterNaiveint64_t inc(int tid) { return v++; }
int64_t read() { return v; }
};
The inc()
method uses the v++
post-increment operator without any synchronization. At first glance,
this looks like it should work every thread increments the counter, and
we expect the final value to match the total number of increments.
Here’s how we tested it:
(10);
CounterNaive counterstd::vector<std::thread> threads;
std::random_device dev;
std::mt19937 rng(dev());
std::uniform_int_distribution<std::mt19937::result_type> dist(1000'000, 1'000'000'000);
for (int tid = 0; tid < 10; ++tid) {
.emplace_back([&, tid]() {
threadsauto num_inc = dist(rng);
for (int i = 0; i < num_inc; ++i) {
.inc(tid);
counter}
std::println("tid: {}, num_inc: {}", tid, num_inc);
});
}
for (auto& t : threads) t.join();
std::println("Counter Value: {}", counter.read());
Outputs:
tid: 1, num_inc: 800992823
tid: 3, num_inc: 914215200
tid: 6, num_inc: 288466445
tid: 7, num_inc: 397029764
tid: 2, num_inc: 701856543
tid: 8, num_inc: 564335446
tid: 9, num_inc: 914107421
tid: 8, num_inc: 426239428
tid: 10, num_inc: 860952345
tid: 8, num_inc: 137102172
Counter Value: 1710330291
Sum of all increments: 6,005,297,587
Reported counter value: 1,710,330,291
This massive discrepancy, over 4 billion lost increments due to data
races. The v++
operation is not atomic. It
reads the current value, increments it, and writes it back. When
multiple threads do this concurrently, they can overwrite each other’s
updates, leading to lost increments.
To eliminate data races, the simplest solution is to protect the shared counter with a mutex ensuring that only one thread can modify it at a time.
To avoid the data race issues we saw in the naive implementation, we
can serialize access to the counter using a std::mutex
.
This guarantees that only one thread can update or read the counter at a
time:
class CounterLocked {
private:
int64_t state;
std::mutex state_mutex;
public:
(int _numThreads): state(0) {}
CounterLockedint64_t inc(int tid) {
const std::lock_guard<std::mutex> g(state_mutex);
auto ret = state++;
return ret;
}
int64_t read() {
const std::lock_guard<std::mutex> g{state_mutex};
auto ret = state;
return ret;
}
};
Testing it again:
tid: 6, num_inc: 111514906
tid: 8, num_inc: 131574102
tid: 7, num_inc: 166769447
tid: 4, num_inc: 243170435
tid: 0, num_inc: 336813736
tid: 1, num_inc: 375111202
tid: 9, num_inc: 543642060
tid: 2, num_inc: 820265280
tid: 3, num_inc: 876108386
tid: 5, num_inc: 873528855
Counter Value: 4478498409
Sum of all increments: 4,478,498,409
Reported counter value: 4,478,498,409
The mutex-based version is correct, it comes at a cost: performance.
Each inc()
call now involves locking and unlocking a
std::mutex
, which means:
To compare different approaches, we need to quantify that slowdown.
To fairly compare counter implementations, we use a consistent and structured benchmarking harness. Here’s how it works:
template <class CounterType>
struct experiment_detail {
...
int64_t numThreads;
int64_t millsToRun;
};
This experiment_detail
struct acts as the shared state
for an experiment:
It holds the actual CounterType
implementation under
test.
It tracks the total number of increments performed.
It stores the start time and run duration (in milliseconds).
It uses a std::atomic_flag
barrier to synchronize
the start across all threads.
It adds padding between members to avoid false sharing and ensure cacheline isolation for clean benchmarking.
template <class CounterType>
void run(experiment_detail<CounterType>& g);
This function:
Spawns numThreads
worker threads.
Uses a barrier to start all threads simultaneously (reducing warm-up bias).
Each thread loops, calling counter->inc(tid) until the target time limit is hit.
Full source code available here View on GitHub Gist
Let’s benchmark the mutex counter to see how it performs as we increase thread count:
Threads: 1 2 4 8 16 32
Ops/sec: 126M 126M 32M 33M 126M 126M
This reveals a critical scalability problem. Despite adding more threads, performance actually degrades at 4-8 threads before recovering. The mutex creates a serialization bottleneck where all threads compete for the same lock.
Using perf stat
confirms this: - 447 billion CPU
cycles consumed - Poor instructions-per-cycle
ratio due to blocking - High system time from mutex
operations
The solution is to eliminate the single point of contention by sharding the counter across multiple independent counters, each with its own mutex:
struct padded_int_mutex {
int64_t v;
std::mutex m;
};
class CounterShardedLocked {
private:
int _numThreads;
[MAX_THREADS];
padded_int_mutex shards
public:
int64_t inc(int tid) {
std::lock_guard<std::mutex> g(shards[tid].m);
return shards[tid].v++;
}
int64_t read() {
int64_t total = 0;
for(int i = 0; i < _numThreads; ++i) {
std::lock_guard<std::mutex> g(shards[i].m);
+= shards[i].v;
total }
return total;
}
};
Key Design Elements:
tid
operates on shards[tid]
, eliminating inter-thread
contentionThe sharded approach delivers dramatic improvements:
Mutex Counter Sharded Counter
Threads: Ops/sec Ops/sec
1 126M 127M
2 126M 127M ↑ Consistent
4 32M 127M ↑ 4x improvement
8 33M 127M ↑ 4x improvement
16 126M 127M ↑ Stable
32 126M 127M ↑ Stable
CPU Efficiency: - 23% fewer CPU cycles (343B vs 447B) - Better instructions-per-cycle ratio - Eliminated lock contention
Mutex Counter (Bottleneck):
Thread 1 ──┐
Thread 2 ──┤ → [SINGLE MUTEX] → Counter
Thread 4 ──┤ (Serialized)
Thread 8 ──┘
Sharded Counter (Parallel):
Thread 1 → [Mutex 1] → Shard 1
Thread 2 → [Mutex 2] → Shard 2
Thread 4 → [Mutex 4] → Shard 4
Thread 8 → [Mutex 8] → Shard 8
By transforming one contended resource into N independent resources, we achieve true parallelism. However, we can optimize further by addressing false sharing between shard structures - our next improvement will add proper cache line padding.
Adding proper cache line alignment to prevent false sharing shows minimal impact for this workload:
Sharded (No Padding): 127M ops/sec, 343B CPU cycles
Sharded (With Padding): 127M ops/sec, 343B CPU cycles
The performance remains essentially identical because each thread operates on its dedicated shard with minimal cross-thread interference.
Moving beyond locks entirely, we can use atomic operations. A simple
fetch_add
approach:
class CounterFetchAndAdd {
private:
std::atomic<int64_t> state;
public:
int64_t inc(int tid) {
return state.fetch_add(1);
}
int64_t read() {
return state;
}
};
Performance Results:
Fetch-and-Add: 127M ops/sec, 171B CPU cycles, 359B instructions
The fetch-and-add counter uses 50% fewer CPU cycles than sharded locks (171B vs 343B) while maintaining similar throughput. However, it reintroduces cache line contention as all threads compete for the same atomic variable.
The ultimate optimization combines the best of both worlds: per-thread shards with atomic operations instead of locks:
struct alignas(hardware_constructive_interference_size) padded_atomic_int {
std::atomic<int64_t> v;
};
class CounterShardedWaitfree {
[MAX_THREADS];
padded_atomic_int shardspublic:
int64_t inc(int tid) {
return shards[tid].v++; // No locks needed!
}
int64_t read() {
int64_t total = 0;
for (int i = 0; i < _numThreads; ++i) {
+= shards[i].v;
total }
return total;
}
};
Performance Breakthrough:
Ops/sec CPU Cycles
Mutex Counter: 33M 447B
Sharded Locks: 127M 343B
Fetch-and-Add: 127M 171B
Wait-Free Sharded: 290M [lower] ↑ 2.3x improvement!
The wait-free sharded counter delivers: - 2.3x higher throughput than sharded locks - No blocking or lock contention - True wait-free operation - threads never block - Linear scalability - performance improves with thread count
Scaling Performance:
Threads: 1 2 4 8 16 32
Ops/sec: 290M 289M 565M 542M 569M 519M
At 4 threads, we see 565M ops/sec - nearly perfect scaling where adding threads directly multiplies performance. The cache line padding ensures each thread’s atomic shard sits in its own cache line, eliminating false sharing that would otherwise degrade performance.
Counter Type Throughput Scalability CPU Efficiency
─────────────────────────────────────────────────────────────────
Naive Broken N/A N/A
Mutex 33M Poor Low (447B cycles)
Sharded Locks 127M Good Medium (343B cycles)
Fetch-and-Add 127M Limited High (171B cycles)
Wait-Free Sharded 290M Excellent Highest
In the next article, we’ll push performance even further by exploring:
Thread pinning to specific CPU cores
NUMA-aware memory allocation