Software/Beignet/optimization-guide.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158

Optimization Guide
====================

All the SIMD optimization principles such as avoid branching and don't waste 
SIMD lanes are also applied to Beignet optimization on Gen platform. Furthermore,
there are some special tips for Beignet optimization.

Work Group Size Determination
-----------------------------

The basic thoery here is to launch enough work to use as much as possible EUs to do the work. So basicly,
we need to set a relatively large global work size. The difficult part is how to choose work group
size. We have the following rules:

* Work group Size should be larger than 16 and be multiple of 16.

  As two possible SIMD lanes on Gen are 8 or 16. To not waste SIMD lanes, we need to follow this rule.

* Need to set relatively large work group size if barrier or SLM are used.

  Using barrier restricts the number concurent running work groups to 16 on each half-slice. But each half-slice could
  emit 256-1024 work items according to different platforms. So if the work group size is too small, for example 16, then
  only 16*16 work items could run on the half-slice, and may waste 3/4 of the computing resource on some platforms.

  Using SLM also introduces such restrication and even worse if a kernel is using too much SLM. Each half-slice only has
  64KB SLM. And each work group needs one additional SLM space. Let's do the math, if one work group needs 64KB, then the
  whole half-slice could only run one work group one time. And if the work group size is 16, then we may waste 98% of the
  computing resources on some platforms.

* Some examples of work group

  If use barrier, work group size should be equal or larger than 16 on baytrail and be equal or larger than 64 on other
  platforms.

  If use SLM, here are a table:  
    <table border=1 width=400>
      <tr>
          <th> Amount of SLM </td>
          <td> 0 </td>
          <td> 4K </td>
          <td> 8K </td>
          <td> 16K </td>
          <td> 32K </td>
      </tr>
      <tr>
          <th> Work Group Size </td>
          <td> 16 </td>
          <td> 64 </td>
          <td> 128 </td>
          <td> 256 </td>
          <td> 512 </td>
      </tr>
    </table>
  If use both barrier and table, need to follow the larger one.

Actually, a good method is to pass in a NULL local work size parameter to let the driver to determine the best work group size for you.

Memory Access Optimizations
---------------------------

Memory bandwidth requirement is a key factor for many workloads. Although the total memory read/write
amount may be a constant for a specific algorithm, you can find some tips here to know how to increase
the cache efficiency and how to achieve the maximum read/write throughput capability.

* Use shorter data type could get better performance. There are also some detail tips as below.
  * Use uchar16/ushort8/uint4 as much as possible.
  * If the data has to be DWORD(4 bytes) unaligned, it's better to use vload16(for char), vload8(for short) to load the data.
  * Read/write scalar char/short will be very slow and may lead to be even worse performance than use DW type.

* Avoid too strided global/constant memory access.
  
  Some examples are as below (assume the data is a cache line aligned global/constant uint buffer, and the work group size is 16 with SIMD16 mode):  
  `uint x = data[get_global_id(0)];      //best, only read one cache line, no bandwidth waste`  
  `uint x = data[get_global_id(0) + 1]; //bad, cross 2 cache lines, not good, waste half of the bandwidth`  
  `uint x = data[get_global_id(0) * 16]; //worst, cross 16 cache lines, waste 15/16 bandwidth.`  

* Avoid dynamic indexed private buffer if possible.
  Currently, private buffer access in beignet backend is very slow. Many small private buffer could be optimized by the compiler.
  But the following type of dynamic indexed private buffer could not be optimized:

  The following example may bring major performance issue:

    <pre><code>
     uint private_buffer[32];
     for (i = 0; i < xid; i++)    {
       int dynamic_idx = src[xid];
       private_buffer[dynamic_idx % 10] = src1[xid];
       ...
     }
    </code></pre>

  The following example will not bring performance issue:

    <pre><code>
     uint private_buffer[32];
     for (i = 0; i < xid; i++)    {
       private_buffer[xid % 32] = src1[xid];
       ...
     }
    </code></pre>

 
* Use SLM to reduce the host memory bandwidth requirement if possible.
  
  On Gen platform, SLM is in GPU's L3 cache, if it could be used to  
  share data between work items, it could reduce the memory bandwidth  
  on the system memory bus. This will be a big win for many I/O intensity  
  kernels.

* Avoid SLM bank conflicts.

  SLM is banked at a DWORD granularity, totally 16 banks. Access on the same
  bank with different addresses will lead to a conflicts. It should be avoided.
  The worst case is:

  Some examples are as below (assume the data is a cache line aligned global/constant uint buffer, and the work group size is 16 with SIMD16 mode):  
  `uint x = data[get_global_id(0)];      //best, no bank conflicts, no bandwidth waste`  
  `uint x = data[get_global_id(0) + 1]; //best, no bank conflicts, no bandwidth waste`  
  `uint x = data[get_global_id(0) * 2]; //bad, work item (id) and (id + 8) conflict to each other, waste half of the bandwidth`  
  `uint x = data[get_global_id(0) * 16]; //worst, all work items conflicts on the zero bank, waste 15/16 bandwidth.`  

* Zero copy on buffer creation. (Only avaliable in git master branch and Release\_v1.0 branch).

  Use CL\_MEM\_USE\_HOST\_PTR to create buffer, and pass in a page
  aligned host pointer which also has multiple page size. Beignet
  will leverage userptr to create a buffer object by using that
  host buffer directly. If possible, you can also use CL\_MEM\_ALLOC\_HOST\_PTR
  flag to let the driver to allocate a userptr qualified buffer which could
  guarantee zero copy on the buffer.

  Please be noted, this feature requires the kernel is newer than 3.16 and the libdrm version is newer than 2.4.57.

Miscellaneous
--------
* Use float data type as much as possible.

  The two ALUs of one EU could both handle float data,but only one of them could handle non-float type data.

* Avoid using long.

  GEN7 and Gen7.5 doesn't support long natively. And Gen8's native long support is still under development.

* Declare small constant buffer with content in the kernel if possible.

  For a small constant buffer, it's better to declare it in the kernel directly with "const \_\_constant". The compiler may optimize it if the buffer is defined inside kernel.

* Avoid unnecessary synchronizations.

  Both in the runtime and in the kernel. For examples, clFinish and clWaitForEvents in runtime and barrier() in the kernel.

* Consider native version of math built-ins, such as native\_sin, native\_cos, if your kernel is not precision sensitive.

* Use fma()/mad() as much as possible.

* Try to eliminate branching as much as possible.

  For example using min, max, clamp or select built-ins instead of if/else if possible.