summaryrefslogtreecommitdiff
path: root/IntelPerformance.mdwn
blob: db6f944d66949ea7ca3330b6f289e8b5733984c7 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73


## Ideas for improving Intel 3D driver performance


### 965: Profile URB allocation

How often are we cutting down to minimal URB allocation size? 


### G4x: Use transposed reads

This would cut the URB size from sf->wm, allowing more concurrency.  g45-transposed-read branch of ~anholt/mesa 


### 965: Cut down on state flagging in brw_new_batch

We just need to re-emit everything (BRW_NEW_CONTEXT), not re-calculate all state.  Must make sure that BRW_NEW_CONTEXT is set where it needs to be.  Also, merge this and BRW_NEW_BATCH together. 


### 965: Enable other-sized dispatch in wm

Right now we only enable 16-pixel or 8-pixel dispatch, while creating program binaries with multiple entrypoints for differently-sized dispatch could save us many cycles. 


### 965: Merge brw_wm_glsl.c into brw_wm_emit.c

This would get us 16-wide dispatch in GLSL, which looks like a 10-20% performance win. 


### 915: Avoid no-op updates of non-pipelined state

Calling [[DrawBuffer|DrawBuffer]] to the same buffer is painful as it flags us for updating the draw buffer, which is non-pipelined state.  This brings the meta clear code from 200mb/s to 6mb/s.  Something like what 965 does would be better for state tracking in this driver. 


### both: Save state instead of using push/pop in metaops

push/pop are expensive, and if we just kept track of the state in static structures it would be a win. 


### both: Use fps and vps in metaops

This is partially done now, but using fps and vps for metaops lets us push/pop less state and reduces the cost for mesa and 965 to calculate the state updates that result. 


### both: Avoiding CPU-dirty of BOs kicked out of the aperture

Right now when an application exceeds the aperture size, it hits a performance cliff because BOs removed from the aperture have their pages unpinned so they can be swapped.  If we kept the BOs pinned but had a memory pressure handler, we could avoid cpu dirtying them, which would remove most of the unbind/rebind thrashing cost. 


### both: Use PPGTT to have a larger aperture size

Pulls us back from the performance cliff in the previous entry.  This would also let us successfully render larger FBOs and textures where we fail currently. 


### i965: Avoid creating new VBO until we've sent the last one out in a batchbuffer

Right now when vbo_exec_api.c flushes a set of primitives, it does [[BufferData|BufferData]](size, NULL) on the VBO, so that you don't block on mapping the old one due to existing rendering.  VBOs are relatively huge, so if you're doing tiny draws it's a lot of overhead, keeping us from using real VBOs for vbo_exec. 


### i965: Implement the ranged mapping extension

This would obsolete the previous entry, as the vbo module would start doing what we want for us. 


### i965: Only upload the constant buffer when the contents or the fencing of it has changed.

Right now we upload it if you change programs, fencing, pipelined state, or anything related to transforms or projection.  Would this help? 


### i965: Use PIPE_CONTROL

This is supposed to get us better pipelining behavior than MI_FLUSH.