Aware Of Cacheline
Most Golang tutorials will teach you how to use mutex properly.
Here is an example explaining how to effectively utilize modern hardware.
Cacheline
Most cpu has cacheline the size of 64 bytes, when cpu fetch data from memory to L1/L2 caches, it fetches the size of a cacheline, so keeping your data nice and compact would increase read performance.
Not the case if you want to update them concurrently.
False Sharing
You can search cacheline + false sharing
online to learn what they are, articles
are very good at explaining them.
Below is the simple Golang code showing the false sharing impact.
To use the full code snippet, check out here.
package main
import (
"sync"
"sync/atomic"
"testing"
)
const THREADS int = 8
const LOOP int = 100000
func BenchmarkCachelineMutex(b *testing.B) {
for run := 0; run < b.N; run++ {
total := 0
var wg sync.WaitGroup
wg.Add(THREADS)
var mu sync.Mutex
for i := 0; i < THREADS; i++ {
go func() {
for j := 0; j < LOOP; j++ {
mu.Lock()
total++
mu.Unlock()
}
wg.Done()
}()
}
wg.Wait()
}
}
func BenchmarkCachelineAtomic(b *testing.B) {
for run := 0; run < b.N; run++ {
var total = new(int32)
var wg sync.WaitGroup
wg.Add(THREADS)
for i := 0; i < THREADS; i++ {
go func() {
for j := 0; j < LOOP; j++ {
atomic.AddInt32(total, 1)
}
wg.Done()
}()
}
wg.Wait()
}
}
func BenchmarkCachelineArray1(b *testing.B) {
for run := 0; run < b.N; run++ {
var total [2048]int
var wg sync.WaitGroup
wg.Add(THREADS)
for i := 0; i < THREADS; i++ {
go func(idx int) {
for j := 0; j < LOOP; j++ {
total[idx]++
}
wg.Done()
}(i * 1)
}
wg.Wait()
}
}
func BenchmarkCachelineArray2(b *testing.B) {
for run := 0; run < b.N; run++ {
var total [2048]int
var wg sync.WaitGroup
wg.Add(THREADS)
for i := 0; i < THREADS; i++ {
go func(idx int) {
for j := 0; j < LOOP; j++ {
total[idx]++
}
wg.Done()
}(i * 2)
}
wg.Wait()
}
}
// strip BenchmarkCachelineArray2 to BenchmarkCachelineArray13
func BenchmarkCachelineArray14(b *testing.B) {
for run := 0; run < b.N; run++ {
var total [2048]int
var wg sync.WaitGroup
wg.Add(THREADS)
for i := 0; i < THREADS; i++ {
go func(idx int) {
for j := 0; j < LOOP; j++ {
total[idx]++
}
wg.Done()
}(i * 14)
}
wg.Wait()
}
}
Run command go test cacheline_test.go -run=BenchmarkCacheline -bench=BenchmarkCacheline -benchmem
and you will get the benchmark statistic:
BenchmarkCachelineMutex-16 30 40018997 ns/op 1256 B/op 6 allocs/op
BenchmarkCachelineAtomic-16 145 8215842 ns/op 226 B/op 2 allocs/op
BenchmarkCachelineArray1-16 442 2677674 ns/op 16456 B/op 2 allocs/op
BenchmarkCachelineArray2-16 849 1436529 ns/op 16406 B/op 2 allocs/op
BenchmarkCachelineArray3-16 1143 1034337 ns/op 16401 B/op 2 allocs/op
BenchmarkCachelineArray4-16 1976 621679 ns/op 16408 B/op 2 allocs/op
BenchmarkCachelineArray5-16 1790 699448 ns/op 16400 B/op 2 allocs/op
BenchmarkCachelineArray6-16 2383 458006 ns/op 16401 B/op 2 allocs/op
BenchmarkCachelineArray7-16 2215 563357 ns/op 16400 B/op 2 allocs/op
BenchmarkCachelineArray8-16 5644 211981 ns/op 16405 B/op 2 allocs/op
BenchmarkCachelineArray9-16 4479 281926 ns/op 16400 B/op 2 allocs/op
BenchmarkCachelineArray10-16 6565 211207 ns/op 16403 B/op 2 allocs/op
BenchmarkCachelineArray11-16 5283 220976 ns/op 16400 B/op 2 allocs/op
BenchmarkCachelineArray12-16 6944 189071 ns/op 16400 B/op 2 allocs/op
BenchmarkCachelineArray13-16 5514 219693 ns/op 16400 B/op 2 allocs/op
BenchmarkCachelineArray14-16 5882 189697 ns/op 16400 B/op 2 allocs/op
BenchmarkCachelineArray15-16 5656 236692 ns/op 16400 B/op 2 allocs/op
BenchmarkCachelineArray16-16 5886 214639 ns/op 16400 B/op 2 allocs/op
BenchmarkCachelineArray17-16 5835 207106 ns/op 16400 B/op 2 allocs/op
PASS
ok command-line-arguments 25.139s
Most of the time, using mutex is good enough to avoid race condition. Yet using atomic(lock free) data structures is usually faster, in this case 4.8x
In methods BenchmarkCachelineArray%d
I use an array to store different result for different goroutine.
The method BenchmarkCachelineArray12
is 231x faster then using mutex.
Take BenchmarkCachelineArray8
for example, the 1st goroutine update total[0], the 2nd one update total[8], int
is 4 bytes, so total[0] and total[8] is 32(31?) bytes away.
BenchmarkCachelineArray1
is the slowest one due to false sharing, because as long as the 1st goroutine write data to total[0], it will invalidate other goroutines’ cache, because total[0] to total[7] is in one cacheline.
Benchmark result can vary due to system scheduling, on my machine, the next run shows BenchmarkCachelineArray14
is the fastest one.
Also you should adjust THREADS
according to the cpu cores on your machine.