Age | Commit message (Collapse) | Author | Files | Lines |
|
In the pursuit of lowering driver overhead, it became clear that some
amount of redesign of how libdrm_freedreno constructs the submit ioctl
would be needed. In particular, as the gallium driver is starting to
make heavier use of CP_SET_DRAW_STATE state groups/objects, the over-
head of tracking cmd buffers and relocs becomes too much. And for
"streaming" state, which isn't ever reused (like uniform uploads) the
overhead of allocating/freeing ringbuffer[1] objects is too high.
This redesign makes two main changes:
1) Introduces a fd_submit object for tracking bos and cmds table
for the submit ioctl, making ringbuffer objects more light-
weight. This was previously done in the ringbuffer. But we
have many ringbuffer instances involved in a submit (gmem +
draw + potentially 1000's of state-group rbs), and only need
a single bos and cmds table. (Reloc table is still per-rb)
The submit is also a convenient place for a slab allocator for
ringbuffer objects. Other options would have required locking
because, while we can guarantee allocations will only happen on
a single thread, free's could happen either on the application
thread or the flush_queue thread. With the slab allocator in
the submit object, any frees that happen on the flush_queue
thread happen after we know that the application thread is done
with the submit.
2) Introduce a new "softpin" msm_ringbuffer_sp implementation that
does not use relocs and only has cmds table entries for IB1 (ie.
the cmdstream buffers that kernel needs to CP_INDIRECT_BUFFER
to from the RB). To do this properly will require some updates
on the kernel side, so whether you get the softpin or legacy
submit/ringbuffer implementation at runtime depends on your
kernel version.
To make all these changes in libdrm would basically require adding a
libdrm_freedreno2, so this is a good point to just pull the libdrm code
into mesa. Plus it allows for using mesa's hashtable, slab allocator,
etc. And it lets us have asserts enabled for debug mesa buids but
omitted for release builds. And it makes life easier if further API
changes become necessary.
At this point I haven't tried to pull in the kgsl backend. Although
I left the level of vfunc indirection which would make it possible
to have other backends. (And this was convenient to keep to allow
for the "softpin" ringbuffer to coexist.)
NOTE: if bisecting a build error takes you here, try a clean build.
There are a bunch of ways things can go wrong if you still have
libdrm_freedreno cflags.
[1] "ringbuffer" is probably a bad name, the only level of cmdstream
buffer that is actually a ring is RB managed by kernel. User-
space cmdstream is all IB1/IB2 and state-groups.
Reviewed-by: Kristian H. Kristensen <hoegsberg@chromium.org>
Reviewed-by: Eric Engestrom <eric.engestrom@intel.com>
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
These are not necessary because the corresponding settings are set via
the .dir-locals.el file anyway. Most of them were missing a ‘:’ after
“tab-width” which was making Emacs display an annoying warning
whenever you open the file.
This patch was made with:
sed -ri '/-\*- mode:/,/^$/d' \
$(find src/gallium/{drivers,winsys} -name \*.\[ch\] \
-exec grep -l -- '-\*- mode:' {} \+)
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
This is defined to always clear the entire surface(s) specified,
regardless of scissor state.. mesa/st will turn scissored clears
into a draw. So rip about a bunch of unnecessary machinery.
Also remove a comment that was obsolete since using u_blitter to
turn clear into draw (for the cases where there isn't a hw blitter
fast-path).
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
This adds a freedreno backend for the a6xx generation GPUs, which at
the time of this commit is about 98% GLES2 conformant. Much remains to
be done - both performance work and feature work towards more recent
GLES versions, but this is a good start.
Signed-off-by: Kristian H. Kristensen <hoegsberg@chromium.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
This was basically to avoid a zero-dword IB (indirect-branch), but
instead just don't emit the IB packet in that case.
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
Get rid of "gmem" (ie. tiling) ringbuffer, and just emit setup commands
directly to "draw" ringbuffer for compute (and in future for blits not
using the 3d pipe). This way we can have a simple flat cmdstream buffer
and bypass setup related to 3d pipe.
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
ctx->last_fence isn't such a terribly clever idea, if batches can be
flushed out of order. Instead, each batch now holds a fence, which is
created before the batch is flushed (useful for next patch), that later
gets populated after the batch is actually flushed.
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
To add context priority support we need to have an fd_pipe per context,
rather than per-screen. Which conflicts with existing ctx->pipe (which
is actually a visibility stream pipe (hw resource). So just rename it.
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
Causing issues with stk on a4xx.. still probably a good idea, but seems
some debugging is needed first.
This reverts commit 3ab072d3c8643c66d8e07e63df970b792728bac6.
|
|
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
Some things trigger batches that only contain a clear (like glmark2
startup). No point to use GMEM for this.
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
Possibly other gen's have a similar limit. Fixes glmark2 -b shadow
with larger resolutions on devices with small gmem (for example,
fullscreen 1080p on 8x16/db410c).
Cc: mesa-stable@lists.freedesktop.org
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
For a5xx (and actually some queries on a4xx) we can accumulate results
in the cmdstream, so we don't need this elaborate mechanism of tracking
per-tile query results. So make it into vfuncs so generation specific
backend can use it when it makes sense.
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
Fixes some issues at least with GMEM bypass mode, where we'd sometimes
end up with some FS quads not hitting memory.
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
Deal w/ differing gmem tile size alignment between generations, and make
sure texture pitch matches.
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
If app tries to create a fence but there is no rendering to submit, we
need a dummy/no-op submit. Use a string-marker for the purpose.. mostly
since it avoids needing to realize that the packet format changes in
later gen's (so one less place to fixup for a5xx).
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
Prep-work for next patch, mostly move to tracking last_fence as a
pipe_fence_handle (created now only in fd_gmem_render_tiles()), and a
bit of superficial renaming.
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
a5xx seems to prefer 64 pixel alignment, in at least some cases. Make
this configurable per generation.
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
This is also used in gmem code, which executes from the "bottom half"
(ie. from the flush_queue worker thread), so it cannot be in fd_context.
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
They weren't really used, and it gets somewhat more complicated to deal
with if batches are flushed asynchronously (on another thread). So just
drop them, and move _query_set_state(NULL) call into batch (so it is not
happening on background thread).
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
With the state accessed from GMEM+submit factored out of fd_context and
into fd_batch, now it is possible to punt this off to a helper thread.
And more importantly, since there are cases where one context might
force the batch-cache to flush another context's batches (ie. when there
are too many in-flight batches), using a per-context helper thread keeps
various different flushes for a given context serialized.
TODO as with batch-cache, there are a few places where we'll need a
mutex to protect critical sections, which is completely missing at the
moment.
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
Push query state down to batch, and use the resource tracking to figure
out which batch(es) need to be flushed to get the query result.
This means we actually need to allocate the prsc up front, before we
know the size. So we have to add a special way to allocate an un-
backed resource, and then later allocate the backing storage.
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
Make it easier to track batches, to ensure things happen properly when
they are reordered.
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
To flush batches out of order, the gmem code needs to not depend on
state from fd_context (since that may apply to a more recent batch).
So this all moves into batch.
The one exception is the gmem/pipe/tile state itself. But this is
only used from gmem code (and batches are flushed serially). The
alternative would be having to re-calculate GMEM layout on every
batch, even if the dimensions of the render targets are the same.
Note: This opens up the possibility of pushing gmem/submit into a
helper thread.
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
Introduce the batch object, to track a batch/submit's worth of
ringbuffers and other bookkeeping. In this first step, just move
the ringbuffers into batch, since that is mostly uninteresting
churn.
For now there is just a single batch at a time. Note that one
outcome of this change is that rb's are allocated/freed on each
use. But the expectation is that the bo pool in libdrm_freedreno
will save us the GEM bo alloc/free which was the initial reason
to implement a rb pool in gallium.
The purpose of the batch is to eventually facilitate out-of-order
rendering, with batches associated to framebuffer state, and
tracking the dependencies on other batches.
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
Some a4xx firmware doesn't implement the "PFD" (prefetch-disabled)
version of the CP_INDIRECT_BUFFER packet. So allow for PFD vs PFE per
generation. Switch a3xx and a4xx over to using prefetch-enabled version
(which is also what blob does.. it seems only on a2xx we cannot use
PFE).
Signed-off-by: Rob Clark <robclark@freedesktop.org>
|
|
We hard-coded 4 or 8 as the max in various places. Switch it all to a
define since the limit will go up with a4xx (and maybe even again in the
future?)
Signed-off-by: Rob Clark <robclark@freedesktop.org>
|
|
This was causing corruption with hw binning on a306. Unlikely that it
is a306 specific, but rather the smaller gmem size resulted in different
tile configuration which was triggering the bug at certain resolutions.
Signed-off-by: Rob Clark <robclark@freedesktop.org>
Cc: "10.4" and "10.5" and "10.6" <mesa-stable@lists.freedesktop.org>
|
|
Enables ARB_depth_buffer_float. There is no sampling support for
interleaved Z32F_S8, so we store the two textures separately, one as
Z32F, the other as S8. As a result, we need a lot of additional logic
for restores and transfers.
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
|
|
Waiting on a bo being ready is handled in fd_bo_cpu_prep. No need to
keep separate timestamps around.
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
|
|
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
|
|
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
|
|
A bunch of open-coded 'gpu_id > 300's seems like it will eventually
cause problems with future generations. There were already a few minor
problems with caps for features that still need additional work on a4xx.
Signed-off-by: Rob Clark <robclark@freedesktop.org>
|
|
dscis -> noscis
dbypass -> nobypass
a bit more consistant w/ nobin, etc. And IMO a bit more sensible names.
Signed-off-by: Rob Clark <robclark@freedesktop.org>
|
|
The optimization of avoiding restore (mem2gmem) if there was a clear
falls down a bit if you don't have a fullscreen scissor. We need to
make the decision logic a bit more clever to keep track of *what* was
cleared, so that we can (a) completely skip mem2gmem if entire buffer
was cleared, or (b) skip mem2gmem on a per-tile basis for tiles that
were completely cleared.
Signed-off-by: Rob Clark <robclark@freedesktop.org>
|
|
4f338c9b introduced logic to trigger a flush rather than overflowing
cmdstream buffer. But the threshold was too low, triggering flushes
where they were not needed. This caused problems with games like
xonotic.
Part of the problem is that we need to mark all state dirty between
cmdstream submit ioctls, because we cannot rely on state being
preserved across ioctls. But even with that, there are still some
problems that are still being debugged. For now:
1) correctly mark all state dirty
2) introduce FD_MESA_DEBUG flush flag to force rendering to be flushed
between each draw, to trigger problems (so that I can debug)
3) use a more reasonable threshold so for normal usecases we don't
trigger the problems
This at least corrects the regression, but there is still more debugging
to do.
Signed-off-by: Rob Clark <robclark@freedesktop.org>
|
|
Worth about ~0.5fps in xonotic, for example.
Signed-off-by: Rob Clark <robclark@freedesktop.org>
|
|
Real GPU queries need some infrastructure to track samples per tile and
accumulate the results. But fortunately this can be shared across GPU
generation.
See:
https://github.com/freedreno/freedreno/wiki/Queries#hardware-queries
Signed-off-by: Rob Clark <robclark@freedesktop.org>
|
|
If scissor optimization is used (to avoid bringing scissored portions of
the render target into GMEM and then back out to system memory) in
combination with hw binning pass, the result would be a scissor mismatch
between binning pass and rendering pass. This would cause rendering
bugs in some scenarios with (for example) gnome-shell.
I would have expected that simply using the correct screen-scissor
during the binning pass would be enough, but seems like there is
something else missing. So for now disable binning pass if scissor
optimization is used.
|
|
Updates to non-banked registers, CP_LOAD_STATE, etc, need a WFI if there
is potentially pending rendering. Track this better, and add fd_wfi()
calls everywhere that might potentially need CP_WAIT_FOR_IDLE.
Signed-off-by: Rob Clark <robclark@freedesktop.org>
|
|
Add for now some simple/basic query support (ie. things not actually
requiring the GPU). Might change around a bit when I actually add
GPU queries, but for now this enables some useful performance info
in the GALLIUM_HUD. For example:
GALLIUM_HUD=fps+batches+batches-sysmem+batches-gmem+restores,draw-calls
The driver specific specific queries are:
+ draw-calls
+ batches - number of batches per second, sum of batches-sysmem
plus batches-gmem
+ batches-gmem - render a set of tiles in GMEM, for each tile
(optionally) system mem -> gmem (restore), plus N draws,
plus gmem -> system mem (resolve) per second
+ batches-sysmem - N draws to system memory (GMEM bypass) per
second
+ restores - number of GMEM batches that required restore per
second
Ideally for GMEM rendering, you want batches-gmem to equal fps. If
the app is doing something that triggers multiple passes (ie. requires
extra round trip gmem <-> system memory) then the # of batches per
second will go up relative to fps.
Signed-off-by: Rob Clark <robclark@freedesktop.org>
|
|
The binning pass sorts vertices into which bins/tiles they apply to.
The visibility information generated during the binning pass can be
used to speed up the rendering pass by filtering out vertices which
do not apply to the current tile. See:
https://github.com/freedreno/freedreno/wiki/Adreno-tiling#optimized-approach
This brings a significant fps boost. A rough assortment of tests
(supertuxkart, etracer, tremulous, glmark2 'build' test, etc) seems
to yield a ~35-45% fps improvement.
For now, to be conservative, the binning pass is not enabled yet by
default. To enable it use:
FD_MESA_DEBUG=binning
So far I haven't found anything that breaks with binning enabled,
but I'd like a bit more testing before I enable it as default.
Signed-off-by: Rob Clark <robclark@freedesktop.org>
|
|
Only need to leave room for depth/stencil if it is actually used, etc.
Signed-off-by: Rob Clark <robclark@freedesktop.org>
|
|
Using RMW on banked context registers is not safe. The value read
could be the wrong one. So if there has been a DRAW_IDX launched,
the RMW must be preceded by a WAIT_FOR_IDLE to ensure the read part
of RMW sees the correct value.
To avoid unnecessary WFI's, keep track if there is a need for WFI,
and only emit one if needed. Furthermore, keep track if we even
need to update the register in the first place.
And to cut down on the amount of RMW to avoid excessive WFI's, at the
tiling/GMEM level we can always overwrite RB_RENDER_CONTROL, as the
state at beginning of draw/clear cmds (which we IB to) is always
undefined. In the draw/clear commands, we always still use RMW (with
WFI if needed), but only if the register value actually changes. (At
points where the current value cannot be known, the saved value is
reset to ~0, which includes bits outside of RBRC_DRAW_STATE, so there
never is chance for confusion.)
Signed-off-by: Rob Clark <robclark@freedesktop.org>
|