You would think that Benchmarks are your friend when it comes to measuring your Computers performance. At least until you talk to the actual IT guys and then get laughed at because Benchmarks reveal absolutely nothing. Their only use case is e-peen contests – and now discovering why so many memory benchmarks get wildly different scores.
Where It Began
It all began with an innocent comment written on 28th January 2017 for the AMD Encoder Plugin for OBS Studio: “ToDo: Can we make this threaded?“. The idea was to increase memory throughput and reduce overall code latency in order to make the encoder even more responsive.
I tried every single solution known to me, inline assembler, intrinsics, threading – hell I even tried core parking and assigning thread/process priority. The best result I had gave me 7.2 gibibytes per second, but nothing ever got me anything more than that.
So I started running Benchmarks that measured memory performance. None showed any anomalies, latency and speeds are very close to the maximum memory throughput my Intel i5-4690 is rated for.
I didn’t understand why my own application didn’t manage to get even half of that – technically mine was literally hogging the entire CPU, sharing it with nothing else, so it should’ve performed better.
That is until I randomly discovered a trick, if I may call it that.
It turns out that if the size of the memory you allocated is smaller than the L2 cache your memory throughput suddenly explodes: Instead of just measly 7 gibibytes per second, I was now at 42 gibibytes per second – an increase of 35 gibibytes! Even higher than the CPU was actually rated for!
Except there was a weird side effect: Depending on when the test was launched, it would either only cap out at just below 21 gibibytes per second or use the full 42 gibibytes per second.
And that’s not all, most of the memcpy calls took next to no CPU cycles at all – a sign that something else was happening. So:
What’s really happening here?
It seems that we have found a point where either Intel has “optimized” a Benchmark (essentially cheating) or we are at a size where memory prefetch works for copy operations.
Both of them mean the same thing: Any Benchmark that uses memory sizes where this applies is not actually testing system memory throughput, but testing the CPUs capability to prefetch and serve cached results.
The solution for benchmarks wanting to test true system memory throughput is to increase the tested memory size to the point where it exceeds all caches.