Improve L1 and L2 benchmark tests for caches that don't use allocate-on-write

In particular this affects single-core ARMs (e.g. ARM11, Cortex-A8), which are usually configured this way. For other CPUs, this should only add a constant time, which will be cancelled out by the EXCLUDE_OVERHEAD runs. The problems were caused by cachelines becoming permanently evicted from the cache, because the code that was intended to pull them back in again on each iteration assumed too long a cache line (for the L1 test) or failed to read memory beyond the first pixel row (for the L2 test). Also, the reloading of the source buffer was unnecessary. These issues were identified by Siarhei in this post: http://lists.freedesktop.org/archives/pixman/2013-January/002543.html
author: Ben Avison <bavison@riscosopen.org> 2013-01-24 18:19:48 +0000
committer: Søren Sandmann Pedersen <ssp@redhat.com> 2013-01-29 15:23:05 -0500
commit: 69a7a9b6b6dc5b769888c469de3435059318f7cc (patch)
tree: 05329aa8f4c366e58807a2a6e5d10c014789378e
parent: 1fa67f499d3826fad8783684bb90c8aadd9f682f (diff)
1 files changed, 25 insertions, 6 deletions
diff --git a/test/lowlevel-blt-bench.c b/test/lowlevel-blt-bench.c
index 7336fa0..8e80b42 100644
--- a/test/lowlevel-blt-bench.c
+++ b/test/lowlevel-blt-bench.c
@@ -33,6 +33,14 @@
 #define L1CACHE_SIZE (8 * 1024)
 #define L2CACHE_SIZE (128 * 1024)
 
+/* This is applied to both L1 and L2 tests - alternatively, you could
+ * parameterise bench_L or split it into two functions. It could be
+ * read at runtime on some architectures, but it only really matters
+ * that it's a number that's an integer divisor of both cacheline
+ * lengths, and further, it only really matters for caches that don't
+ * do allocate0on-write. */
+#define CACHELINE_LENGTH (32) /* bytes */
+
 #define WIDTH  1920
 #define HEIGHT 1080
 #define BUFSIZE (WIDTH * HEIGHT * 4)
@@ -168,18 +176,29 @@ bench_L  (pixman_op_t              op,
           int                      width,
           int                      lines_count)
 {
-    int64_t      i, j;
+    int64_t      i, j, k;
     int          x = 0;
     int          q = 0;
     volatile int qx;
 
     for (i = 0; i < n; i++)
     {
-	/* touch destination buffer to fetch it into L1 cache */
-	for (j = 0; j < width + 64; j += 16) {
-	    q += dst[j];
-	    q += src[j];
-	}
+        /* For caches without allocate-on-write, we need to force the
+         * destination buffer back into the cache on each iteration,
+         * otherwise if they are evicted during the test, they remain
+         * uncached. This doesn't matter for tests which read the
+         * destination buffer, or for caches that do allocate-on-write,
+         * but in those cases this loop just adds constant time, which
+         * should be successfully cancelled out.
+         */
+        for (j = 0; j < lines_count; j++)
+        {
+            for (k = 0; k < width + 62; k += CACHELINE_LENGTH / sizeof *dst)
+            {
+                q += dst[j * WIDTH + k];
+            }
+            q += dst[j * WIDTH + width + 62];
+        }
 	if (++x >= 64)
 	    x = 0;
 	call_func (func, op, src_img, mask_img, dst_img, x, 0, x, 0, 63 - x, 0, width, lines_count);
author	Ben Avison <bavison@riscosopen.org>	2013-01-24 18:19:48 +0000
committer	Søren Sandmann Pedersen <ssp@redhat.com>	2013-01-29 15:23:05 -0500
commit	69a7a9b6b6dc5b769888c469de3435059318f7cc (patch)
tree	05329aa8f4c366e58807a2a6e5d10c014789378e
parent	1fa67f499d3826fad8783684bb90c8aadd9f682f (diff)