Development/Documentation/GlamorPerformance.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65

Glamor _should_ be able to accelerate all of X, given a sufficiently capable GL implementation. It doesn't, yet, and this document attempts to describe the remaining work.

## Core ops

The two bits of core GC state that tend to cause fallbacks are GCFunction and GCPlaneMask. For the former we try to use glLogicOp, which works for draw calls but not for texture image calls, and doesn't exist in GLES. For the latter we basically just punt if the planemask isn't trivial.

We could probably accelerate both of the above if the GL(SL) below us has something like [[MESA_shader_integer_functions|https://www.opengl.org/registry/specs/MESA/shader_integer_functions.txt]] or [[EXT_framebuffer_fetch|https://www.khronos.org/registry/gles/extensions/EXT/EXT_shader_framebuffer_fetch.txt]].

Composite clip walk could likely use [[EXT_window_rectangles|https://www.opengl.org/registry/specs/EXT/window_rectangles.txt]] instead of glScissor.

### Image ops

PutImage is implemented with TexSubImage2D, so is only accelerated for GXcopy operations. If LogicOp works, you could accelerate this by using a temporary texture and drawing through it to the destination.

PutImage can't accelerate XYPixmap because that's not an image format that GL hardware (or anything postdating the Amiga, really) is equipped to deal with. You basically can't accelerate this unless you can accelerate arbitrary planemasks.

PutImage (not ShmPutImage) tends to memcpy once more than a classic X driver because it buffers the image data into the GL command stream, where exa might just blast the data directly into the pixmap. It's difficult to get around this without either a hint to the GL that the upload should be immediate (with all the stall/flush that implies), or some _serious_ surgery to the protocol buffering code.

GetImage falls back for ZPixmap with non-trivial planemask. There's not really a sane way to accelerate that through the GL, but it's cheap to fix up in software by just zeroing out the unset planes before writing to the client.

GetImage falls back entirely for XYPixmap, and again, GL very wisely doesn't believe in XYPixmap. Best approach is likely to download to a temporary as if ZPixmap, and then walk the planemask pulling out a plane at a time into the reply buffer.

### Geometry ops

It's extremely difficult to want to reimplement the arc or polygon code directly in the shader, as neither one sees very much use even in legacy apps. We should profile to make sure the current approach of decomposing to spans on the CPU is "fast enough".

The rectangle ops are probably about as perfect as they're going to get.

### Text ops

TODO

### Span ops

SetSpans, like PutImage, operates on the texture so can't accelerate non-GXcopy operations.

FillSpans might benefit from an alternate calling convention, where rather than being passed a span list the caller asks the driver to allocate the span storage, fills that in, then calls Set or Fill.  This would let us eliminate the copy into the vbo and just store there directly.

GetSpans is irrelevant, you can't hit it unless you're using the mi blit routines, and we're not.

## Render

The Render implementation needs major work:

0) Determine if the largepixmap code is salvageable or easily replaceable.

Right now modification of the Render extension support is nearly impossible because of the largepixmap render support.  If we got rid of this it would be easy to hack on, but it may be that we need it for performance (For example, on Firefox's scaling of large photographs).

1) Use GL_ARB_texture_view to reinterpret pixmaps as various sets of channels.

Without this, Render ops with a2r10g10b10 or r5g6b6 will fall back to software because we only store pixmaps as either a8r8g8b8 or a8.

This requires use of GL_ARB_texture_storage and therefore GL_ARB_sampler_objects.  See https://github.com/anholt/xserver/tree/glamor-sampler-objects

2) Use the new program infrastructure for transformations.

We do a ton of computation on the CPU for the coordinates of our Render ops. These are best done in shader code, instead.

3) Accelerate trapezoids

With modern GL, we should be just fine writing the trapezoids rendering in a shader, improving xlib-cairo path performance massively.  Today, we allocate in an memory pixmap, rasterize traps into it, then upload to GL and do a Composite from there.

## XVideo

TODO