Optimization Guide ==================== All the SIMD optimization principles such as avoid branching and don't waste SIMD lanes are also applied to Beignet optimization on Gen platform. Furthermore, there are some special tips for Beignet optimization. Work Group Size Determination ----------------------------- The basic thoery here is to launch enough work to use as much as possible EUs to do the work. So basicly, we need to set a relatively large global work size. The difficult part is how to choose work group size. We have the following rules: * Work group Size should be larger than 16 and be multiple of 16. As two possible SIMD lanes on Gen are 8 or 16. To not waste SIMD lanes, we need to follow this rule. * Need to set relatively large work group size if barrier or SLM are used. Using barrier restricts the number concurent running work groups to 16 on each half-slice. But each half-slice could emit 256-1024 work items according to different platforms. So if the work group size is too small, for example 16, then only 16*16 work items could run on the half-slice, and may waste 3/4 of the computing resource on some platforms. Using SLM also introduces such restrication and even worse if a kernel is using too much SLM. Each half-slice only has 64KB SLM. And each work group needs one additional SLM space. Let's do the math, if one work group needs 64KB, then the whole half-slice could only run one work group one time. And if the work group size is 16, then we may waste 98% of the computing resources on some platforms. * Some examples of work group If use barrier, work group size should be equal or larger than 16 on baytrail and be equal or larger than 64 on other platforms. If use SLM, here are a table:
 Amount of SLM Work Group Size 0 4K 8K 16K 32K 16 64 128 256 512
``````
uint private_buffer[32];
for (i = 0; i < xid; i++)    {
int dynamic_idx = src[xid];
private_buffer[dynamic_idx % 10] = src1[xid];
...
}
``````
The following example will not bring performance issue:
``````
uint private_buffer[32];
for (i = 0; i < xid; i++)    {
private_buffer[xid % 32] = src1[xid];
...
}
``````
* Use SLM to reduce the host memory bandwidth requirement if possible. On Gen platform, SLM is in GPU's L3 cache, if it could be used to share data between work items, it could reduce the memory bandwidth on the system memory bus. This will be a big win for many I/O intensity kernels. * Avoid SLM bank conflicts. SLM is banked at a DWORD granularity, totally 16 banks. Access on the same bank with different addresses will lead to a conflicts. It should be avoided. The worst case is: Some examples are as below (assume the data is a cache line aligned global/constant uint buffer, and the work group size is 16 with SIMD16 mode): `uint x = data[get_global_id(0)]; //best, no bank conflicts, no bandwidth waste` `uint x = data[get_global_id(0) + 1]; //best, no bank conflicts, no bandwidth waste` `uint x = data[get_global_id(0) * 2]; //bad, work item (id) and (id + 8) conflict to each other, waste half of the bandwidth` `uint x = data[get_global_id(0) * 16]; //worst, all work items conflicts on the zero bank, waste 15/16 bandwidth.` * Zero copy on buffer creation. (Only avaliable in git master branch and Release\_v1.0 branch). Use CL\_MEM\_USE\_HOST\_PTR to create buffer, and pass in a page aligned host pointer which also has multiple page size. Beignet will leverage userptr to create a buffer object by using that host buffer directly. If possible, you can also use CL\_MEM\_ALLOC\_HOST\_PTR flag to let the driver to allocate a userptr qualified buffer which could guarantee zero copy on the buffer. Please be noted, this feature requires the kernel is newer than 3.16 and the libdrm version is newer than 2.4.57. Miscellaneous -------- * Use float data type as much as possible. The two ALUs of one EU could both handle float data,but only one of them could handle non-float type data. * Avoid using long. GEN7 and Gen7.5 doesn't support long natively. And Gen8's native long support is still under development. * Declare small constant buffer with content in the kernel if possible. For a small constant buffer, it's better to declare it in the kernel directly with "const \_\_constant". The compiler may optimize it if the buffer is defined inside kernel. * Avoid unnecessary synchronizations. Both in the runtime and in the kernel. For examples, clFinish and clWaitForEvents in runtime and barrier() in the kernel. * Consider native version of math built-ins, such as native\_sin, native\_cos, if your kernel is not precision sensitive. * Use fma()/mad() as much as possible. * Try to eliminate branching as much as possible. For example using min, max, clamp or select built-ins instead of if/else if possible.