path: root/src/mesa/drivers/dri/i965/brw_fs_cse.cpp
AgeCommit message (Collapse)AuthorFilesLines
2015-01-08i965: Consider SEL.{GE,L} to be commutative operations.Matt Turner1-4/+11
Reviewed-by: Kenneth Graunke <>
2014-12-05i965/fs: Perform CSE on MOV ..., VF instructions.Matt Turner1-5/+11
Safe from causing optimization loops, since we don't constant propagate VF arguments. (for this and the previous patch): total instructions in shared programs: 4289075 -> 4271932 (-0.40%) instructions in affected programs: 1616779 -> 1599636 (-1.06%) Reviewed-by: Ian Romanick <>
2014-11-21i965: Combine offset/texture_offset fields.Matt Turner1-1/+1
texture_offset was only used by some texturing operations, and offset was only used by spill/unspill and some URB operations. These fields are never used at the same time. Reviewed-by: Jason Ekstrand <>
2014-10-29i965/fs: Perform CSE on MAD instructions with final arguments switched.Matt Turner1-1/+5
Multiplication is commutative. instructions in affected programs: 48314 -> 47954 (-0.75%) Reviewed-by: Kenneth Graunke <>
2014-10-15i965: Allow CSE on Gen4-5 unary math.Kenneth Graunke1-1/+1
Due to the implicit move-from-GRF, unary math looks a lot like the Gen6+ math instruction: it's a single instruction (SEND) with a GRF source. The difference is that it also implicitly clobbers a message register. The only visible effect is that CSE will remove the MRF-clobbering from later math operations. This should be fine; compute_to_mrf and remove_redundant_mrf_writes don't look at the values populated by implied writes, so they can't rely on those values being present. Less interference may actually help those passes make more progress. Binary math is still problematic, since it involves a separate MOV instruction to load the second operand. We continue disallowing CSE for binary math operations. total instructions in shared programs: 3340303 -> 3340100 (-0.01%) instructions in affected programs: 26927 -> 26724 (-0.75%) Nothing hurt, gained, or lost. ~6% reduction on a few shaders. Signed-off-by: Kenneth Graunke <> Reviewed-by: Matt Turner <>
2014-09-30i965/fs_reg: Allocate double the number of vgrfs in SIMD16 modeJason Ekstrand1-9/+13
This is actually the squash of a bunch of different changes. Individual commit titles follow: i965/fs: Always 2-align registers SIMD16 for gen <= 5 i965/fs: Use the register width when applying offsets This reworks both byte_offset() and offset() to be more intelligent. The byte_offset() function now supports offsets bigger than 32. The offset() function uses the byte_offset() function together with the register width and the type size to offset the register by the correct amount. i965/fs: Change regs_read to be in hardware registers i965/fs: Change regs_written to be actual hardware registers i965/fs: Properly handle register widths in LOAD_PAYLOAD The LOAD_PAYLOAD instruction is a bit special because it collects a bunch of registers (with possibly different widths) into a single payload block. Once the payload is constructed, it's treated as a single block of data and most of the information such as register widths doesn't matter anymore. In particular, the offset of any particular source register is the accumulation of the sizes of the previous source registers. i965/fs: Properly set writemasks in LOAD_PAYLOAD i965/fs: Handle register widths in demote_pull_constants i965/fs: Get rid of implicit register doubling in the allocator i965/fs: Reserve enough registers for PLN instructions i965/fs: Make sources and destinations interfere in 16-wide i965/fs: Properly handle register widths in CSE i965/fs: Properly handle register widths in register_coalesce i965/fs: Properly handle widths in copy propagation i965/fs: Properly handle register widths in VARYING_PULL_CONSTANT_LOAD i965/fs: Properly handle register widths and odd register sizes in spilling i965/fs: Don't waste a register on texture lookups for gen >= 7 Previously, we were waisting a register in SIMD16 mode because we could only allocate registers in pairs. Now that we can allocate and address odd-sized registers, let's get rid of this special-case. Signed-off-by: Jason Ekstrand <> Reviewed-by: Matt Turner <>
2014-09-30i965/fs: Use offset a lot more placesJason Ekstrand1-8/+4
We have this wonderful offset() function for advancing registers, but we're not using it. Using offset() allows us to do some sanity checking and avoid manually touching fs_reg::reg_offset. In a few commits, we will make offset do even more nifty things for us. Signed-off-by: Jason Ekstrand <> Reviewed-by: Matt Turner <>
2014-09-24i965: Make instruction lists local to the bblocks.Matt Turner1-6/+0
Reviewed-by: Topi Pohjolainen <>
2014-09-24i965: Remove cfg-invalidating parameter from invalidate_live_intervals.Matt Turner1-1/+1
Everything has been converted to preserve the CFG. Reviewed-by: Topi Pohjolainen <>
2014-08-22i965: Use basic-block aware insertion/removal functions.Matt Turner1-4/+4
To avoid invalidating and recreating the control flow graph. Also stop invalidating the CFG in places we didn't add or remove an instruction. cfg calculations: 202951 -> 80307 (-60.43%) Reviewed-by: Topi Pohjolainen <>
2014-08-18i965: Add and use foreach_block macro.Matt Turner1-3/+1
Use this as an opportunity to rename 'block_num' to 'num'. block->num is clear, and block->block_num has always been redundant.
2014-08-11i965/cse: Don't eliminate instructions with side-effectsJason Ekstrand1-1/+1
This casues problems when converting atomics to use the GRF. Sometimes the atomic operation would get eaten by CSE when it shouldn't. v2: Roll the has_side_effects check into is_expression Signed-off-by: Jason Ekstrand <> Reviewed-by: Matt Turner <>
2014-08-09i965: Get rid of backend_instruction::samplerChris Forbes1-1/+0
The generators no longer use this. Signed-off-by: Chris Forbes <> Reviewed-by: Matt Turner <> Reviewed-by: Ian Romanick <>
2014-07-21i965: Add cfg to backend_visitor.Matt Turner1-4/+3
Reviewed-by: Topi Pohjolainen <>
2014-07-15i965/fs: Perform CSE on sends-from-GRF rather than textures.Matt Turner1-1/+1
Should potentially allow a few more cases, while avoiding doing CSE on texture operations on Gen <= 6 with the MRF. Bugzilla: Reviewed-by: Kenneth Graunke <> Tested-by: lu hua <>
2014-07-14i965/fs: Invalidate live intervals in opt_cse, not _local.Matt Turner1-3/+3
Reviewed-by: Kenneth Graunke <>
2014-07-14i965/fs: Move aeb list into opt_cse_local.Matt Turner1-6/+6
Reviewed-by: Kenneth Graunke <>
2014-07-01i965/fs: Pass cfg to calculate_live_intervals().Matt Turner1-2/+1
We've often created the CFG immediately before, so use it when available. Reviewed-by: Ian Romanick <>
2014-07-01i965: Use typed foreach_in_list_safe instead of foreach_list_safe.Matt Turner1-3/+1
Acked-by: Ian Romanick <>
2014-07-01i965: Add and use foreach_inst_in_block macros.Matt Turner1-4/+1
Reviewed-by: Ian Romanick <>
2014-07-01mesa: Add and use foreach_in_list_use_after.Matt Turner1-4/+1
Reviewed-by: Ian Romanick <>
2014-06-17i965/fs: Perform CSE on texture operations.Matt Turner1-1/+10
Helps Unigine Tropics and some (old) gstreamer shaders in shader-db. instructions in affected programs: 792 -> 744 (-6.06%) Reviewed-by: Kenneth Graunke <>
2014-06-17i965/fs: Perform CSE on load_payload instructions if it's not a copy.Matt Turner1-0/+18
Since CSE creates instructions, if we let CSE generate things register coalescing can't remove, bad things will happen. Only let CSE combine non-copy load_payloads. E.g., allow CSE to handle this load_payload vgrf4+0, vgrf5, vgrf6 but not this load_payload vgrf4+0, vgrf5+0, vgrf5+1
2014-06-17i965/fs: Emit load_payload instead of multiple MOVs for large VGRFs.Matt Turner1-12/+21
2014-06-17i965/fs: Only consider real sources when comparing instructions.Matt Turner1-4/+15
2014-06-11i965/fs: Clean up tabs in brw_fs_cse.cpp.Matt Turner1-43/+43
I'm adding vec4 CSE, and I want to diff the files.
2014-06-10i965/fs: Allow CSE on math opcodes on Gen6+.Kenneth Graunke1-0/+11
total instructions in shared programs: 2081469 -> 2081248 (-0.01%) instructions in affected programs: 22606 -> 22385 (-0.98%) No programs were hurt by this patch. Signed-off-by: Kenneth Graunke <> Reviewed-by: Matt Turner <> Reviewed-by: Chris Forbes <>
2014-06-01i965/fs: Loop from 0 to inst->sources, not 0 to 3.Matt Turner1-1/+1
Reviewed-by: Chris Forbes <> Reviewed-by: Tapani Pälli <> Reviewed-by: Kenneth Graunke <>
2014-04-05i965/fs: Name temporary ralloc contexts something other than mem_ctx.Matt Turner1-3/+3
Or else poor programmers might mistakenly use the temporary mem_ctx, instead of the fs_visitor's mem_ctx and wonder why their code is crashing. Also remove the parenting. These contexts are local to the optimization passes they're in and are freed at the end.
2013-12-04i965/cfg: Clean up cfg_t constructors.Matt Turner1-1/+1
parent_mem_ctx was unused since db47074a, so remove the two wrappers around create() and make create() the constructor. Reviewed-by: Eric Anholt <>
2013-11-09i965/fs: Don't perform CSE on inst HW_REG dests (unless it's null)Matt Turner1-1/+2
Commit b16b3c87 began performing CSE on CMP instructions with null destinations. I relaxed the restrictions a bit too much, thereby allowing CSE to be performed on instructions with, for instance, an explicit accumulator destination. This broke the arb_gpu_shader5/fs-imulExtended shader tests because they emit MUL instructions with the accumulator as the destination. CSE would instead cause the MUL to write to a GRF, which is lower precision than the accumulator. Reviewed-by: Eric Anholt <> Cc: 10.0 <>
2013-10-30i965/fs: Perform CSE on CMP(N) instructions.Matt Turner1-10/+29
Optimizes null g45<8,8,1>F 0F (+f0) sel(8) g50<1>F g40<8,8,1>F g10<8,8,1>F null g45<8,8,1>F 0F (+f0) sel(8) g51<1>F g41<8,8,1>F g11<8,8,1>F null g45<8,8,1>F 0F (+f0) sel(8) g52<1>F g42<8,8,1>F g12<8,8,1>F null g45<8,8,1>F 0F (+f0) sel(8) g53<1>F g43<8,8,1>F g13<8,8,1>F into null g45<8,8,1>F 0F (+f0) sel(8) g50<1>F g40<8,8,1>F g10<8,8,1>F (+f0) sel(8) g51<1>F g41<8,8,1>F g11<8,8,1>F (+f0) sel(8) g52<1>F g42<8,8,1>F g12<8,8,1>F (+f0) sel(8) g53<1>F g43<8,8,1>F g13<8,8,1>F total instructions in shared programs: 1644938 -> 1638181 (-0.41%) instructions in affected programs: 574955 -> 568198 (-1.18%) Two more 16-wide programs (in L4D2). Some large (-9%) decreases in instruction count in some of Valve's Source Engine games. No regressions. Reviewed-by: Eric Anholt <> Reviewed-by: Paul Berry <>
2013-10-30i965/fs: Don't emit null MOVs in CSE.Matt Turner1-17/+25
We'd like to CSE some instructions, like CMP, that often have null destinations. Instead of replacing them with MOVs to null, just don't emit the MOV. Reviewed-by: Paul Berry <>
2013-10-25i965/fs: Match commutative expressions with reversed arguments.Matt Turner1-3/+23
total instructions in shared programs: 1645011 -> 1644938 (-0.00%) instructions in affected programs: 17543 -> 17470 (-0.42%) Reviewed-by: Eric Anholt <>
2013-10-25i965: s/Muchnik/Muchnick/.Matt Turner1-1/+1
Reviewed-by: Eric Anholt <>
2013-10-10i965/fs: Create a helper function for invalidating live intervals.Kenneth Graunke1-1/+1
For now, this simply sets live_intervals_valid = false, but in the future it will do something more sophisticated. Based on a patch by Eric Anholt. Signed-off-by: Kenneth Graunke <> Reviewed-by: Eric Anholt <>
2013-10-07i965/fs: Disable CSE on instructions writing to HW_REG.Matt Turner1-1/+2
CSE would otherwise combine the two mul(8) emitted by [iu]mulExtended: mul(8) acc0 x y mach(8) null x y mov(8) lsb acc0 ... mul(8) acc0 x y mach(8) msb x y Into: mul(8) temp x y mov(8) acc0 temp mach(8) null x y mov(8) lsb acc0 ... mov(8) acc0 temp mach(8) msb x y But mul(8) into the accumulator produces more than 32-bits of precision, which is required and lost if multiplying into a general register and moving to the accumulator. Reviewed-by: Eric Anholt <>
2013-09-05i965: Remove never used RSR and RSL opcodes.Matt Turner1-2/+0
RSR and RSL are listed in the "Defeatured Instructions" section of the 965 PRM, Volume 4: "The following instructions are removed from Gen4 implementation mainly due to implementation cost/schedule reasons. They are candidates for future generations." Reviewed-by: Kenneth Graunke <>
2013-08-12i965/fs: Explicitly disallow CSE on predicated instructions.Kenneth Graunke1-1/+3
The existing inst->is_partial_write() already disallows predicated instructions, so this has no functional change. However, it's worth doing explicitly since the CSE pass does not consider the flag register. This means it could blindly factor out operations that use the same sources, but which have different condition codes set. This prevents a regression in the next commit. Signed-off-by: Kenneth Graunke <> Reviewed-by: Matt Turner <>
2013-05-09i965/fs: Make virtual grf live intervals actually cover their used range.Eric Anholt1-1/+1
Previously, we would sometimes not consider a write to a register to extend the end of the interval, nor would we consider a read before a write to extend the start. This made for a bunch of complicated logic related to how to treat the results when dead code might be present. Instead, just extend the interval and fix dead code elimination to know how to remove it. Interestingly, this actually results in a tiny bit more optimization: total instructions in shared programs: 1391220 -> 1390799 (-0.03%) instructions in affected programs: 14037 -> 13616 (-3.00%) v2: Fix a theoretical problem with the simd16 workaround if dst == src, where we would revert the bump of the live range. Reviewed-by: Ian Romanick <> (v1)
2013-04-12i965/fs: Add a helper function for checking for partial register updates.Eric Anholt1-2/+1
These checks were all over, and every time I wrote one I had to try to decide again what the cases were for partial updates. v2: Fix inadvertent reladdr check removal. Reviewed-by: Matt Turner <>
2013-04-01i965/fs: Allow CSE on pre-gen7 varying-index uniform loadsEric Anholt1-1/+1
All the other expression types allowed here have inst->mlen == 0, and this one has implied MRF writes for all of its payload, so nothing else in the implementation should need to change. Reduces SEND messages for loading from pull constants in kwin's Lanczos shader from 16 to 6. (Due to a deficiency in constant propagation, I can't use the hack I did in the previous commit to test the performance change) Reviewed-by: Kenneth Graunke <> Bugzilla: NOTE: This is a candidate for the 9.1 branch.
2013-04-01i965/fs: Use LD messages for pre-gen7 varying-index uniform loadsEric Anholt1-0/+1
This comes at a minor performance cost at the moment (-3.2% +/- 0.2%, n=14 on my GM45 forced to load all uniforms through the varying-index path), but we get a whole vec4 at a time to reuse in the next commit. v2: Fix comment about channels in the other message. Reviewed-by: Kenneth Graunke <> NOTE: This is a candidate for the 9.1 branch.
2013-04-01i965/fs: Bake regs_written into the IR instead of recomputing it later.Eric Anholt1-3/+3
For sampler messages, it depends on the target gen, and on gen4 SIMD16-sampler-on-SIMD8-execution we were returning 4 instead of 8 like we should. Reviewed-by: Kenneth Graunke <> NOTE: This is a candidate for the 9.1 branch.
2013-04-01i965/fs: Do CSE on gen7's varying-index pull constant loads.Eric Anholt1-11/+32
This is our first CSE on a regs_written() > 1 instruction, so it takes a bit of extra fixup. Reduces the number of loads on kwin's Lanczos shader from 12 to 2. v2: Fix compiler warning (false positive on possibly-uninitialized variable) Bugzilla: Reviewed-by: Kenneth Graunke <> (v1) NOTE: This is a candidate for the 9.1 branch.
2013-03-11i965/fs: Improve CSE performance by expiring some available expressions.Eric Anholt1-1/+19
We're already walking the list, and we can easily know when something has no reason to be in the list any longer, so take a brief extra step to reduce our worst-case runtime (an oglconform test that emits the maximum instructions in a fragment program). I don't actually know what the worst-case runtime was, because it was too long and I got bored. Reviewed-by: Kenneth Graunke <>
2013-02-28i965/fs: Use the LRP instruction for ir_triop_lrp when possible.Kenneth Graunke1-0/+1
v2 [mattst88]: - Add BRW_OPCODE_LRP to list of CSE-able expressions. - Fix op_var[] array size. - Rename arguments to emit_lrp to (x, y, a) to clear confusion. - Add LRP function to brw_fs.cpp/.h. - Corrected comment about LRP instruction arguments in emit_lrp. v3 [mattst88]: - Duplicate MAD code for LRP instead of using a function pointer. - Check for != GRF instead of == IMM in emit_lrp. - Lower LRP on gen < 6. Reviewed-by: Matt Turner <> Reviewed-by: Eric Anholt <> Signed-off-by: Kenneth Graunke <> 1
2013-02-19i965/fs: Enable CSE on uniform pull constant loads.Eric Anholt1-0/+3
Improves on a major performance regression for the dolphin wii emulator from its move to using UBOs. Performance in the UBO codepath (as replayed through apitrace) is up 21.1% +/- 2.3% (n=26/29). Reviewed-by: Kenneth Graunke <>
2013-02-19i965/fs: Only do CSE when the dst types match.Eric Anholt1-1/+2
We could potentially do some CSE even when the dst types aren't the same on gen6 where there is no implicit dst type conversion iirc, or in the case of uniform pull constant loads where the dst type doesn't impact what's stored. But it's not worth worrying about. Reviewed-by: Kenneth Graunke <> NOTE: This is a candidate for the 9.1 branch.
2012-10-17i965: Make the cfg reusable from the VS.Eric Anholt1-1/+1
Reviewed-by: Kenneth Graunke <>