# Optimization for  r300/r400 fragment shader program

We want few things, this stuff should be done in gallium. This could be done using the software rendering pipe so debugging is easier and there is no r300/r400 gallium driver yet. For shake of simplicity we will use the ARB fragment & vertex shader extension as we expect the higher level like glsl will be optimized in a first pass by things like llvm. We are focusing on optimizing a bit more the "ASM" we got as output of such stage. Also we don't want that one pass of optimization destroy the work done by another. To solve this i think that trying each permutation of optimization stage and selecting the one producing the slower number of instruction and fitting hardware is the best solution to keep things clean and simple. 
## Reshuffle texture instruction

To work around the 4 textures indirection limit we need to reshuffle texture instructions. For instance the following program can be rewritten. 

Original which don't pass texture indirection limits (here 5 indirections): 

    TEMP a, b, c, d;
    // node 0 - 0 indirection
    TEX a, fragment.color, texture[0], 2D;
    // node 1 - 1 indirection
    TEX b, a, texture[1], 2D;
    ADD c, b, 1;
    MUL b, c, b;
    // node 2 - 2 indirection because c have been written in previous node
    TEX c, fragment.color, texture[2], 2D;
    // node 3 - 3 indirection c have been written in previous node
    TEX d, c, texture[3], 2D;
    ADD a, b, d;
    // node 4 (out of limit !) - a have been written in previous node
    TEX result.color, a, texture[4], 2D;`
    Reshuffled program which pass texture indirection limits (here 3 indirections): 
    `TEMP a, b, c, d;
    TEMP _ts_0;
    // node 0 - 0 indirection
    TEX a, fragment.color, texture[0], 2D;
    TEX c, fragment.color, texture[2], 2D;
    // node 1 - 1 indirection
    TEX b, a, texture[1], 2D;
    TEX d, c, texture[3], 2D;
    ADD _ts_0, b, 1;
    MUL b, _ts_0, b;
    ADD a, b, d;
    // node 2 - 2 indirection a have been written in previous node
    TEX result.color, a, texture[4], 2D;

## Use native swizzle

Please refer to doc to find native swizzle for r300/r400 hw. We want to rewritte asm to take advantage of native swizzle. This stage could be mixed with the scalar/vector optimization pass. 

Following program can be optimize (assuming xyzw & wzyx & wyxz are native but xzwy isn't): 

    TEMP a, b;
    PARAM coef = {0.5f, 0.6f, 0.7f, 0.8f};
    ADD a, fragment.color.xzwy, coef.wyxz;`

To: 

    TEMP a, b;
    PARAM coef = {0.5f, 0.6f, 0.7f, 0.8f};
    ADD a, fragment.color.wzyx, coef.xyzw;

Well optimization here can be more complexe if we add negation on individual component in the equation. 
## Reshuffle instruction

By reshuffling instruction you could take advantage of this scalar/vector split in unit to parallelize computation a bit more. 
## References

[[http://www.opengl.org/registry/specs/ARB/fragment_program.txt|http://www.opengl.org/registry/specs/ARB/fragment_program.txt]]