Development/Documentation/GlamorPerformance.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92

Glamor _should_ be able to accelerate all of X, given a sufficiently capable GL implementation. It doesn't, yet, and this document attempts to describe the remaining work.

## Core ops

The two bits of core GC state that tend to cause fallbacks are GCFunction and GCPlaneMask. For the former we try to use glLogicOp, which works for draw calls but not for texture image calls, and doesn't exist in GLES. For the latter we basically just punt if the planemask isn't trivial.

We could probably accelerate both of the above if the GL(SL) below us has something like [[MESA_shader_integer_functions|https://www.opengl.org/registry/specs/MESA/shader_integer_functions.txt]] or [[EXT_shader_framebuffer_fetch|https://www.khronos.org/registry/gles/extensions/EXT/EXT_shader_framebuffer_fetch.txt]].

Composite clip walk could likely use [[EXT_window_rectangles|https://www.opengl.org/registry/specs/EXT/window_rectangles.txt]] instead of glScissor. Alternatively, if GL had an R1 texture format (effectively a bitmap) then it might be a win to use that as a clip mask.

### Blits

Overlapping blits are hard for GL. We handle them by allocating a temporary the size of the overlapping region, copy into that, and then back. For large overlaps it would be cheaper to allocate a new fbo the same size as the source, copy into the (clipped) destination, copy the rest untranslated from the source, and swap the destination in as the new pixmap and free the original. "Large" is probably some heuristic percentage of the image dirtied by the blit.

This would have obvious interactions with DRI though.

CopyPlane to a destination that's >1bpp expands the selected bit plane of the source through the GC foreground/background colors. You would need one or more of integer operation or a bitmap texture format to accelerate this, depending whether you wanted to accelerate just bitmap sources or arbitrary depths.

### Image ops

PutImage is implemented with TexSubImage2D, so is only accelerated for GXcopy operations. If LogicOp works, you could accelerate this by using a temporary texture and drawing through it to the destination.

PutImage can't accelerate XYPixmap because that's not an image format that GL hardware (or anything postdating the Amiga, really) is equipped to deal with. For 1bpp (equivalent to XYBitmap) you could plausibly do the color expansion in software to a temporary, or accelerate it if you had bitmap textures. For >1bpp XYPixmap it's a lot harder.

PutImage (not ShmPutImage) tends to memcpy once more than a classic X driver because it buffers the image data into the GL command stream, where exa might just blast the data directly into the pixmap. It's difficult to get around this without either a hint to the GL that the upload should be immediate (with all the stall/flush that implies), or some _serious_ surgery to the protocol buffering code.

GetImage falls back entirely for XYPixmap, and again, GL very wisely doesn't believe in XYPixmap. Best approach is likely to download to a temporary as if ZPixmap, and then walk the planemask pulling out a plane at a time into the reply buffer. To do that one first needs to fix GetImage's calling convention so that a single XYPixmap get is responsible for all planes, not one at a time.

### Geometry ops

It's extremely difficult to want to reimplement the arc or polygon code directly in the shader, as neither one sees very much use even in legacy apps. We should profile to make sure the current approach of decomposing to spans on the CPU is "fast enough".

The rectangle ops are probably about as perfect as they're going to get.

### Text ops

When damage is active for a drawable we look up the glyphs twice, once in damage and once in glamor, which is a waste of CPU.

### Span ops

FillSpans might benefit from an alternate calling convention, where rather than being passed a span list the caller asks the driver to allocate the span storage, fills that in, then calls Fill.  This would let us eliminate the copy into the vbo and just store there directly.

GetSpans and SetSpans are irrelevant, you can't hit them unless you're using the mi blit routines, and we're not.

## Render

The Render implementation needs major work:

0. Determine if the largepixmap code is salvageable or easily replaceable.

  Right now modification of the Render extension support is nearly impossible because of the largepixmap render support.  If we got rid of this it would be easy to hack on, but it may be that we need it for performance (For example, on Firefox's scaling of large photographs).

1. Use GL_ARB_texture_view to reinterpret pixmaps as various sets of channels.

  Without this, Render ops with a2r10g10b10 or r5g6b5 will fall back to software because we only store pixmaps as either a8r8g8b8 or a8.

  This requires use of GL_ARB_texture_storage and therefore GL_ARB_sampler_objects.  See [[https://github.com/anholt/xserver/tree/glamor-sampler-objects]]

2. Use the new program infrastructure for transformations.

  We do a ton of computation on the CPU for the coordinates of our Render ops. These are best done in shader code, instead.

3. Accelerate trapezoids

  With modern GL, we should be just fine writing the trapezoids rendering in a shader, improving xlib-cairo path performance massively.  Today, we allocate an in-memory pixmap, rasterize traps into it, then upload to GL and do a Composite from there.

4. Accelerate the Render 0.11 blend modes

  Right now we're only doing the basic Porter-Duff modes. 0.11 adds the PDF blend modes (dodge, burn, etc.) and there's no reason we couldn't accelerate them as well.

## XVideo

Might be nice to add support for more formats? Should inspect the source to various video players and see what they'd prefer.

## Übershader

Switching among shader programs isn't free. Would be worth investigating whether a single shader performs as well as the current design.

## Flush reduction

Currently we flush in BlockHandler, which is quite often. Strictly speaking we only need to do this when we're trying to synchronize between the X and GL command streams (or, arguably, when sending damage events), and can otherwise let the implicit flush in SwapBuffers do the job for us.

The GL sync extensions should be able to help us out with this, though on at least i965 they're equivalent to a flush so they're not _that_ much help.

## Pseudocolor

Glamor doesn't accelerate drawing to pseudocolor at all. Probably a "good enough" solution would be to draw to the 8bpp surface in software as normal, and expand to 32bpp with a paletted texture (or equivalently, dependent texture lookups). Drawing to R8 directly would be... let's say "difficult".

## Fixed since 1.19

- GetImage for ZPixmap with non-trivial planemask is no longer a fallback.
- Overlapping blits now use [[MESA_tile_raster_order|https://www.khronos.org/registry/OpenGL/extensions/MESA/MESA_tile_raster_order.txt]] if available (~40% faster on VC4).