1 files changed, 286 insertions, 23 deletions
diff --git a/docs/drivers/panfrost.rst b/docs/drivers/panfrost.rst
index 0ed8b2d8cbd..7fc1a32e9f0 100644
--- a/docs/drivers/panfrost.rst
+++ b/docs/drivers/panfrost.rst
@@ -1,29 +1,37 @@
 Panfrost
 ========
 
-The Panfrost driver stack includes a **non-conformant** OpenGL ES
-implementation for Arm Mali GPUs based on the Midgard and Bifrost
-microarchitectures. The following GPUs are currently supported:
-
-=========  ============ ============ =======
-Product    Architecture OpenGL ES    OpenGL
-=========  ============ ============ =======
-Mali T720  Midgard (v4) 2.0          2.1
-Mali T760  Midgard (v5) 3.1          3.1
-Mali T820  Midgard (v5) 3.1          3.1
-Mali T860  Midgard (v5) 3.1          3.1
-Mali G72   Bifrost (v6) 3.1          3.1
-Mali G31   Bifrost (v7) 3.1          3.1
-Mali G52   Bifrost (v7) 3.1          3.1
-=========  ============ ============ =======
-
-Other Midgard and Bifrost chips (T604, T620, T830, T880, G71, G51, G76) may
-work but may be buggy. End users are advised against using Panfrost on
-unsupported hardware. Developers interested in porting will need to allowlist
-the hardware (``src/gallium/drivers/panfrost/pan_screen.c``).
+The Panfrost driver stack includes an OpenGL ES implementation for Arm Mali
+GPUs based on the Midgard and Bifrost microarchitectures. It is **conformant**
+on Mali-G52 and Mali-G57 but **non-conformant** on other GPUs. The following
+hardware is currently supported:
+
+=========  ============= ============ =======
+Product    Architecture  OpenGL ES    OpenGL
+=========  ============= ============ =======
+Mali T600  Midgard (v4)  2.0          2.1
+Mali T620  Midgard (v4)  2.0          2.1
+Mali T720  Midgard (v4)  2.0          2.1
+Mali T760  Midgard (v5)  3.1          3.1
+Mali T820  Midgard (v5)  3.1          3.1
+Mali T830  Midgard (v5)  3.1          3.1
+Mali T860  Midgard (v5)  3.1          3.1
+Mali T880  Midgard (v5)  3.1          3.1
+Mali G72   Bifrost (v6)  3.1          3.1
+Mali G31   Bifrost (v7)  3.1          3.1
+Mali G51   Bifrost (v7)  3.1          3.1
+Mali G52   Bifrost (v7)  3.1          3.1
+Mali G76   Bifrost (v7)  3.1          3.1
+Mali G57   Valhall (v9)  3.1          3.1
+Mali G310  Valhall (v10) 3.1          3.1
+Mali G610  Valhall (v10) 3.1          3.1
+=========  ============= ============ =======
+
+Other Midgard and Bifrost chips (e.g. G71) are not yet supported.
 
 Older Mali chips based on the Utgard architecture (Mali 400, Mali 450) are
-supported in the Lima driver, not Panfrost. Lima is also available in Mesa.
+supported in the :doc:`Lima <lima>` driver, not Panfrost. Lima is also
+available in Mesa.
 
 Other graphics APIs (Vulkan, OpenCL) are not supported at this time.
 
@@ -39,7 +47,7 @@ it's easy to add support, see the commit ``cff7de4bb597e9`` as an example.
 LLVM is *not* required by Panfrost's compilers. LLVM support in Mesa can
 safely be disabled for most OpenGL ES users with Panfrost.
 
-Build like ``meson . build/ -Ddri-drivers= -Dvulkan-drivers=
+Build like ``meson . build/ -Dvulkan-drivers=
 -Dgallium-drivers=panfrost -Dllvm=disabled`` for a build directory
 ``build``.
 
@@ -49,4 +57,259 @@ For general information on building Mesa, read :doc:`the install documentation
 Chat
 ----
 
-Panfrost developers and users hang out on IRC at ``#panfrost`` on OFTC.
+Panfrost developers and users hang out on IRC at ``#panfrost`` on OFTC. Note
+that registering and authenticating with ``NickServ`` is required to prevent
+spam. `Join the chat. <https://webchat.oftc.net/?channels=panfrost>`_
+
+Compressed texture support
+--------------------------
+
+In the driver, Panfrost supports ASTC, ETC, and all BCn formats (e.g. RGTC,
+S3TC, etc.) However, Panfrost depends on the hardware to support these formats
+efficiently.  All supported Mali architectures support these formats, but not
+every system-on-chip with a Mali GPU support all these formats. Many lower-end
+systems lack support for some BCn formats, which can cause problems when playing
+desktop games with Panfrost. To check whether this issue applies to your
+system-on-chip, Panfrost includes a ``panfrost_texfeatures`` tool to query
+supported formats.
+
+To use this tool, include the option ``-Dtools=panfrost`` when configuring Mesa.
+Then inside your Mesa build directory, the tool is located at
+``src/panfrost/tools/panfrost_texfeatures``. Copy it to your target device,
+set as executable as necessary, and run on the target device. A table of
+supported formats will be printed to standard output.
+
+drm-shim
+--------
+
+Panfrost implements ``drm-shim``, stubbing out the Panfrost kernel interface.
+Use cases for this functionality include:
+
+- Future hardware bring up
+- Running shader-db on non-Mali workstations
+- Reproducing compiler (and some driver) bugs without Mali hardware
+
+Although Mali hardware is usually paired with an Arm CPU, Panfrost is portable C
+code and should work on any Linux machine. In particular, you can test the
+compiler on shader-db on an Intel desktop.
+
+To build Mesa with Panfrost drm-shim, configure Meson with
+``-Dgallium-drivers=panfrost`` and ``-Dtools=drm-shim``. See the above
+building section for a full invocation. The drm-shim binary will be built to
+``build/src/panfrost/drm-shim/libpanfrost_noop_drm_shim.so``.
+
+To use, set the ``LD_PRELOAD`` environment variable to the drm-shim binary.  It
+may also be necessary to set ``LIBGL_DRIVERS_PATH`` to the location where Mesa
+was installed.
+
+By default, drm-shim mocks a Mali-G52 system. To select a specific Mali GPU,
+set the ``PAN_GPU_ID`` environment variable to the desired GPU ID:
+
+=========  ============= =======
+Product    Architecture  GPU ID
+=========  ============= =======
+Mali-T720  Midgard (v4)  720
+Mali-T860  Midgard (v5)  860
+Mali-G72   Bifrost (v6)  6221
+Mali-G52   Bifrost (v7)  7212
+Mali-G57   Valhall (v9)  9093
+Mali-G610  Valhall (v10) a867
+=========  ============= =======
+
+Additional GPU IDs are enumerated in the ``panfrost_model_list`` list in
+``src/panfrost/lib/pan_props.c``.
+
+As an example: assuming Mesa is installed to a local path ``~/lib`` and Mesa's
+build directory is ``~/mesa/build``, a shader can be compiled for Mali-G52 as:
+
+.. code-block:: sh
+
+   ~/shader-db$ BIFROST_MESA_DEBUG=shaders \
+   LIBGL_DRIVERS_PATH=~/lib/dri/ \
+   LD_PRELOAD=~/mesa/build/src/panfrost/drm-shim/libpanfrost_noop_drm_shim.so \
+   PAN_GPU_ID=7212 \
+   ./run shaders/glmark/1-1.shader_test
+
+The same shader can be compiled for Mali-T720 as:
+
+.. code-block:: sh
+
+   ~/shader-db$ MIDGARD_MESA_DEBUG=shaders \
+   LIBGL_DRIVERS_PATH=~/lib/dri/ \
+   LD_PRELOAD=~/mesa/build/src/panfrost/drm-shim/libpanfrost_noop_drm_shim.so \
+   PAN_GPU_ID=720 \
+   ./run shaders/glmark/1-1.shader_test
+
+These examples set the compilers' ``shaders`` debug flags to dump the optimized
+NIR, backend IR after instruction selection, backend IR after register
+allocation and scheduling, and a disassembly of the final compiled binary.
+
+As another example, this invocation runs a single dEQP test "on" Mali-G52,
+pretty-printing GPU data structures and disassembling all shaders
+(``PAN_MESA_DEBUG=trace``) as well as dumping raw GPU memory
+(``PAN_MESA_DEBUG=dump``). The ``EGL_PLATFORM=surfaceless`` environment variable
+and various flags to dEQP mimic the surfaceless environment that our
+continuous integration (CI) uses. This eliminates window system dependencies,
+although it requires a specially built CTS:
+
+.. code-block:: sh
+
+   ~/VK-GL-CTS/build/external/openglcts/modules$ PAN_MESA_DEBUG=trace,dump \
+   LIBGL_DRIVERS_PATH=~/lib/dri/ \
+   LD_PRELOAD=~/mesa/build/src/panfrost/drm-shim/libpanfrost_noop_drm_shim.so \
+   PAN_GPU_ID=7212 EGL_PLATFORM=surfaceless \
+   ./glcts --deqp-surface-type=pbuffer \
+   --deqp-gl-config-name=rgba8888d24s8ms0 --deqp-surface-width=256 \
+   --deqp-surface-height=256 -n \
+   dEQP-GLES31.functional.shaders.builtin_functions.common.abs.float_highp_compute
+
+U-interleaved tiling
+---------------------
+
+Panfrost supports u-interleaved tiling. U-interleaved tiling is
+indicated by the ``DRM_FORMAT_MOD_ARM_16X16_BLOCK_U_INTERLEAVED`` modifier.
+
+The tiling reorders whole pixels (blocks). It does not compress or modify the
+pixels themselves, so it can be used for any image format. Internally, images
+are divided into tiles. Tiles occur in source order, but pixels (blocks) within
+each tile are reordered according to a space-filling curve.
+
+For regular formats, 16x16 tiles are used. This harmonizes with the default tile
+size for binning and CRCs (transaction elimination). It also means a single line
+(16 pixels) at 4 bytes per pixel equals a single 64-byte cache line.
+
+For formats that are already block compressed (S3TC, RGTC, etc), 4x4 tiles are
+used, where entire blocks are reorder. Most of these formats compress 4x4
+blocks, so this gives an effective 16x16 tiling. This justifies the tile size
+intuitively, though it's not a rule: ASTC may uses larger blocks.
+
+Within a tile, the X and Y bits are interleaved (like Morton order), but with a
+twist: adjacent bit pairs are XORed. The reason to add XORs is not obvious.
+Visually, addresses take the form::
+
+   | y3 | (x3 ^ y3) | y2 | (y2 ^ x2) | y1 | (y1 ^ x1) | y0 | (y0 ^ x0) |
+
+Reference routines to encode/decode u-interleaved images are available in
+``src/panfrost/shared/test/test-tiling.cpp``, which documents the space-filling
+curve. This reference implementation is used to unit test the optimized
+implementation used in production. The optimized implementation is available in
+``src/panfrost/shared/pan_tiling.c``.
+
+Although these routines are part of Panfrost, they are also used by Lima, as Arm
+introduced the format with Utgard. It is the only tiling supported on Utgard. On
+Mali-T760 and newer, Arm Framebuffer Compression (AFBC) is more efficient and
+should be used instead where possible. However, not all formats are
+compressible, so u-interleaved tiling remains an important fallback on Panfrost.
+
+Instancing
+----------
+
+The attribute descriptor lets the attribute unit compute the address of an
+attribute given the vertex and instance ID. Unfortunately, the way this works is
+rather complicated when instancing is enabled.
+
+To explain this, first we need to explain how compute and vertex threads are
+dispatched.  When a quad is dispatched, it receives a single, linear index.
+However, we need to translate that index into a (vertex id, instance id) pair.
+One option would be to do:
+
+.. math::
+   \text{vertex id} = \text{linear id} \% \text{num vertices}
+
+   \text{instance id} = \text{linear id} / \text{num vertices}
+
+but this involves a costly division and modulus by an arbitrary number.
+Instead, we could pad num_vertices. We dispatch padded_num_vertices *
+num_instances threads instead of num_vertices * num_instances, which results
+in some "extra" threads with vertex_id >= num_vertices, which we have to
+discard.  The more we pad num_vertices, the more "wasted" threads we
+dispatch, but the division is potentially easier.
+
+One straightforward choice is to pad num_vertices to the next power of two,
+which means that the division and modulus are just simple bit shifts and
+masking. But the actual algorithm is a bit more complicated. The thread
+dispatcher has special support for dividing by 3, 5, 7, and 9, in addition
+to dividing by a power of two. As a result, padded_num_vertices can be
+1, 3, 5, 7, or 9 times a power of two. This results in less wasted threads,
+since we need less padding.
+
+padded_num_vertices is picked by the hardware. The driver just specifies the
+actual number of vertices. Note that padded_num_vertices is a multiple of four
+(presumably because threads are dispatched in groups of 4). Also,
+padded_num_vertices is always at least one more than num_vertices, which seems
+like a quirk of the hardware. For larger num_vertices, the hardware uses the
+following algorithm: using the binary representation of num_vertices, we look at
+the most significant set bit as well as the following 3 bits. Let n be the
+number of bits after those 4 bits. Then we set padded_num_vertices according to
+the following table:
+
+==========  =======================
+high bits   padded_num_vertices
+==========  =======================
+1000		   :math:`9 \cdot 2^n`
+1001		   :math:`5 \cdot 2^{n+1}`
+101x		   :math:`3 \cdot 2^{n+2}`
+110x		   :math:`7 \cdot 2^{n+1}`
+111x		   :math:`2^{n+4}`
+==========  =======================
+
+For example, if num_vertices = 70 is passed to glDraw(), its binary
+representation is 1000110, so n = 3 and the high bits are 1000, and
+therefore padded_num_vertices = :math:`9 \cdot 2^3` = 72.
+
+The attribute unit works in terms of the original linear_id. if
+num_instances = 1, then they are the same, and everything is simple.
+However, with instancing things get more complicated. There are four
+possible modes, two of them we can group together:
+
+1. Use the linear_id directly. Only used when there is no instancing.
+
+2. Use the linear_id modulo a constant. This is used for per-vertex
+attributes with instancing enabled by making the constant equal
+padded_num_vertices. Because the modulus is always padded_num_vertices, this
+mode only supports a modulus that is a power of 2 times 1, 3, 5, 7, or 9.
+The shift field specifies the power of two, while the extra_flags field
+specifies the odd number. If shift = n and extra_flags = m, then the modulus
+is :math:`(2m + 1) \cdot 2^n`. As an example, if num_vertices = 70, then as
+computed above, padded_num_vertices = :math:`9 \cdot 2^3`, so we should set
+extra_flags = 4 and shift = 3. Note that we must exactly follow the hardware
+algorithm used to get padded_num_vertices in order to correctly implement
+per-vertex attributes.
+
+3. Divide the linear_id by a constant. In order to correctly implement
+instance divisors, we have to divide linear_id by padded_num_vertices times
+to user-specified divisor. So first we compute padded_num_vertices, again
+following the exact same algorithm that the hardware uses, then multiply it
+by the GL-level divisor to get the hardware-level divisor. This case is
+further divided into two more cases. If the hardware-level divisor is a
+power of two, then we just need to shift. The shift amount is specified by
+the shift field, so that the hardware-level divisor is just
+:math:`2^\text{shift}`.
+
+If it isn't a power of two, then we have to divide by an arbitrary integer.
+For that, we use the well-known technique of multiplying by an approximation
+of the inverse. The driver must compute the magic multiplier and shift
+amount, and then the hardware does the multiplication and shift. The
+hardware and driver also use the "round-down" optimization as described in
+https://ridiculousfish.com/files/faster_unsigned_division_by_constants.pdf.
+The hardware further assumes the multiplier is between :math:`2^{31}` and
+:math:`2^{32}`, so the high bit is implicitly set to 1 even though it is set
+to 0 by the driver -- presumably this simplifies the hardware multiplier a
+little. The hardware first multiplies linear_id by the multiplier and
+takes the high 32 bits, then applies the round-down correction if
+extra_flags = 1, then finally shifts right by the shift field.
+
+There are some differences between ridiculousfish's algorithm and the Mali
+hardware algorithm, which means that the reference code from ridiculousfish
+doesn't always produce the right constants. Mali does not use the pre-shift
+optimization, since that would make a hardware implementation slower (it
+would have to always do the pre-shift, multiply, and post-shift operations).
+It also forces the multiplier to be at least :math:`2^{31}`, which means
+that the exponent is entirely fixed, so there is no trial-and-error.
+Altogether, given the divisor d, the algorithm the driver must follow is:
+
+1. Set shift = :math:`\lfloor \log_2(d) \rfloor`.
+2. Compute :math:`m = \lceil 2^{shift + 32} / d \rceil` and :math:`e = 2^{shift + 32} % d`.
+3. If :math:`e <= 2^{shift}`, then we need to use the round-down algorithm. Set
+   magic_divisor = m - 1 and extra_flags = 1.  4. Otherwise, set magic_divisor =
+   m and extra_flags = 0.