I965Todo.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143

This page documents improvements to the i965 driver that we would like to make in the future (time permitting).  To see things that we don't intend to fix (e.g. known hardware bugs), see [[I965Errata]].

# gen7 (Ivy Bridge) and newer

### Combine GS_OPCODE_END_THREAD with earlier URB writes (easy)

Currently, we always use a separate URB write message to end geometry shader threads.  If there's an immediately preceding URB write message, we can simply set the EOT bit on that, and drop the extra message.

Note that because EmitVertex() may be called from a loop, there might not always be an immediately preceding URB write.

### Combine GS URB write messages (moderate)

Every GS EmitVertex() call generates its own URB write messages.  When geometry shaders emit multiple vertices in a row, this produces several messages in a row.  We could coalesce these into a single, longer message.  We would need to ensure the offsets/lengths ensure a contiguous block of URB space, and obey message length limits.

It may be easier to recognize this case if we combine the message header setup, GS_OPCODE_SET_WRITE_OFFSET, and GS_OPCODE_URB_WRITE into a single, logical message.

### Optimize CS local ID push constant register usage (moderate)

See comment above gen7_cs_state.c:brw_cs_prog_local_id_payload_dwords.

# gen6 (Sandy Bridge) and newer 

### Use SSA form for the scalar backend (hard)

At some point, we want to use SSA form for the scalar backend.  Some
thoughts for that have beeen collected at [[I965ScalarSSA]].

### Improve performance of ARB_shader_atomic_counters

In the fragment shader, if all channels doing the atomic add are to the same address, then an atomic add of the number of channels active and manually producing the per-channel result from that should be more efficient than asking the hardware to do each atomic operation (even though there is only the one SEND instruction for the atomic operation).

### Improve code generation for if statement conditions (easy)

Something like "if (a == b && c == d)" produces:

    cmp.e.f0(8)     g32<1>D         g6<8,8,1>F      g31<8,8,1>F     { align1 WE_normal 1Q };
    cmp.e.f0(8)     g34<1>D         g5<8,8,1>F      g33<8,8,1>F     { align1 WE_normal 1Q };
    and(8)          g35<1>D         g32<8,8,1>D     g34<8,8,1>D     { align1 WE_normal 1Q };
    and.ne.f0(8)    null            g35<8,8,1>D     1D              { >align1 WE_normal 1Q };
    (+f0) if(8) 0 0                 null            0x00000000UD    { >align1 WE_normal 1Q switch };

when it would be better to produce something like:

    cmp.e.f0(8)           g32<1>D         g6<8,8,1>F      g31<8,8,1>F     { align1 WE_normal 1Q };
    (+f0) cmp.e.f0(8)     g34<1>D         g5<8,8,1>F      g33<8,8,1>F     { align1 WE_normal 1Q };
    (+f0) if(8) 0 0                       null            0x00000000UD    { align1 WE_normal 1Q switch };

### Recognize DPH structures and generate them. (moderate).

We could teach nir_opt_algebraic to recognize fdp4 with the first source's .w component == 1.0 and turn that into fdph.  There are probably patterns where teaching nir_search about swizzles would be useful.

### Return-from-main using HALT (easy)

Right now when there's a "return" in the main() function, we lower all later assignments to be conditional moves.  But, using the HALT instruction we can tell the hardware to stop execution for some channels until a certain IP is reached.  We use this for discards to have subspans stop executing once they're discarded (for efficiency), and we basically that on a channel-wise basis for return-from-main.  Take a look at FS_OPCODE_DISCARD_JUMP, FS_OPCODE_PLACEHOLDER_HALT, and patch_discard_jumps_to_fb_writes()

### Loop invariant code motion (hard)

When there's a for loop like

    for (int i = 0; i < max; i++) {
        result += texture2D(sampler0, offsets[i]) * texture2D(sampler1, vec2(0.0));
    }

It would be nice to recognize that texture2D(sampler1, vec2(0.0)) doesn't depend on the loop iteration, and pull it outside of the loop.  This is a standard compiler optimization that we lack.

### 32-wide dispatch for regular fragment shaders (hard)

This might involve double-emitting each operation at the LIR level, or might involve making 32-wide (4-register instead of 2-register) vgrfs.  Both seem like big changes.

### Experiment with VFCOMP_NO_SRC in vertex fetch.

Right now, if a VS takes two vec2 inputs (say, a 2D position and 2D texcoord), it will get put in the VUE as two vec4s, each formatted as (x, y, 0.0, 1.0).

The VUE could be shrunk if we could notice that and pack the two into one VUE slot, using the VFCOMP_NO_SRC in the VERTEX_ELEMENT_STATE to write half of the VUE slot with each vec2 input.  This is assuming VFCOMP_NO_SRC works like we might hope it does (the "no holes" comments are concerning).

[Note that on ILK+ the destination offset in the VUE is no longer controllable, so only things which can share a VERTEX_ELEMENT_STATE can be packed. -chrisf]

### Use vertex/fragment shaders in meta.c (easy)

This is partially done now, but using fps and vps for metaops lets us
push/pop less state and reduces the cost for mesa and 965 to calculate
the state updates that result.

### Full accelerated glBitmap() support. (moderate)

You'd take the bitmap and upload it to a texture and put it in a spare surface slot in the brw_wm_surface_state.c-related code.  Use meta.c to generate a primitive covering the area to be rasterized by glBitmap().  Set a flag in the driver across the meta.c calls that we're doing bitmap, then in brw_fs_visitor.c when the flag is set you'd prepend the shader with a texture sample from the bitmap and a discard.

### Full accelerated glDrawPixels() support (moderate).

Like the glBitmap() above, except you're replacing the incoming color instead of doing a discard.

### Full acccelerated glAccum() support. (easy)

Using FBOs in meta.c, this should be fairly easy, except that we don't have tests.

### Full accelerated glRenderMode(GL_SELECT) support (moderate).

This seems doable using meta.c and FBO rendering to pick the result out.

### Full accelerated glRenderMode(GL_FEEDBACK) support. (hard)

This would involve transform feedback in some way.

### Trim down memory allocation.

Right now running a minimal shader program takes up 24MB of memory.  There's a big 16MB allocation for swrast spans, then some more 1MB or so allocations for TNL, and 1.5MB for the register allocator, then a bunch of noise.

On a core/GLES3 context, we skip the swrast and tnl allocations, but most apps aren't core apps.  If we could delay the swrast/tnl allocations until needed, that would save people a ton of memory.  The bitmap/drawpixels/rendermode tasks above are motivated by making it possible to not initialize swrast at all.

# Pre-gen6 (Iron Lake and older)

### Port fast color clears from gen6+ to gen5/4 (moderate).

While it's a very minor win itself, this would be the first step in
getting blorp ported so that we can do efficient glBlitFramebuffer()
and glCopyTexSubImage() on older gens.

### Port GL_ARB_blend_func_extended to gen4/5 (easy).

While we don't use it in our 2D driver or cairo-gl yet (no glamor
support), it should be a significant win when we do.

### Port GL_EXT_transform_feedback to gen4/5 (hard).

You have to run the geometry shader and have it write the vertices out
to a surface, like the gen6 code does. [You also have to do the accounting yourself, as the SVBI hardware support only exists on gen6+ -chrisf]

### Port HiZ support to gen5 (hard).

It should be worth a 10-20% performance boost on most apps, but expect
a lot of work in getting piglit fbo-depthstencil tests working.

### Use transposed URB reads on g4x (moderate).

This would cut the URB size from sf->wm, allowing more concurrency.  g45-transposed-read branch of ~anholt/mesa

### Port ARB_uniform_buffer_objects to gen4/5 (moderate)

The gen6 code generation may just work on gen4/5, but there will probably be a little bit of work to get the brw_wm_surface_state.c code updated.

### Port ARB_texture_buffer_object/ARB_texture_buffer_object_rgb32/ARB_texture_buffer_range (easy)

The gen6 code generation may just work on gen4/5, but there will probably be a little bit of work to get the brw_wm_surface_state.c code updated.