summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2012-05-30Speed up _pixman_image_get_solid() in common casesglyph3Søren Sandmann Pedersen1-6/+27
Make _pixman_image_get_solid() faster by special-casing the common cases where the image is SOLID or a repeating a8r8g8b8 image. This optimization together with the previous one results in a small but reproducable performance improvement on the xfce4-terminal-a1 cairo trace: [ # ] backend test min(s) median(s) stddev. count Before: [ 0] image xfce4-terminal-a1 1.221 1.239 1.21% 100/100 After: [ 0] image xfce4-terminal-a1 1.170 1.199 1.26% 100/100 Either optimization by itself is difficult to separate from noise.
2012-05-30Speed up _pixman_composite_glyphs_no_mask()Søren Sandmann Pedersen3-23/+112
Bypass much of the overhead of pixman_image_composite32() by only computing the composite region once instead of once per glyph, and by only looking up the composite function whenever the glyph format or flags change. As part of this, the pixman_compute_composite_region32() was renamed to _pixman_compute_composite_region32() and exported in pixman-private.h. I couldn't find a trace that would reliably demonstrate that this is actually an improvement by itself (since _pixman_composite_glyphs_no_mask() is called so rarely), but together with the following optimization for solid sources, there is a small but reliable improvement to the xfce4-a1-terminal cairo trace.
2012-05-30Speed up pixman_composite_glyphs()Søren Sandmann Pedersen3-24/+108
When adding glyphs to the mask, bypass most of the overhead of pixman_image_composite32() by: - Only looking up the composite function when the glyph changes either format or flags. - Just intersecting the glyph rectangle with the destination rectangle instead of doing the full _pixman_composite_region32(). Performance results: [ # ] backend test min(s) median(s) stddev. count Before: [ 0] image firefox-talos-gfx 5.516 5.531 0.21% 5/6 After: [ 0] image firefox-talos-gfx 4.398 4.404 0.16% 5/6
2012-05-30test: Add glyph-testSøren Sandmann Pedersen2-0/+322
This test tests the new glyph cache and compositing API. Much of this test is intending to making sure that clipping and alpha map handling survive any optimizations that may be added to the glyph compositing.
2012-05-30Add support for alpha maps to compute_crc32_for_image().Søren Sandmann Pedersen1-13/+71
When a destination image I has an alpha map A, the following rules apply: - If I has an alpha channel itself, the content of that channel is undefined - If A has RGB channels, the content of those channels is undefined. Hence in order to compute the CRC32 for such an image, we have to mask off the alpha channel of the image, and the RGB channels of the alpha map.
2012-05-30Move CRC32 computation from blitters-test.c into utils.cSøren Sandmann Pedersen3-31/+46
This way it can be used in other tests.
2012-05-30Add pixman_glyph_cache_t APISøren Sandmann Pedersen3-0/+496
This new API allows entire glyph strings to be composited in one go which reduces overhead compared to multiple calls to pixman_image_composite32(). The pixman_glyph_cache_t is a hash table that maps two keys (a "font" and a "glyph" key, but they are just keys; there is no distinction between them as far as pixman is concerned) to a glyph. Glyphs in the cache can be composited through two new entry points pixman_glyph_cache_composite_glyphs() and pixman_glyph_cache_composite_glyphs_no_mask(). A glyph cache may only be inserted into when it is "frozen", which is achieved by calling pixman_glyph_cache_freeze(). When pixman_glyph_cache_thaw() is later called, if the cache has become too crowded, some glyphs (currently the least-recently-used) will automatically be evicted. This means that a user must ensure that all the required glyphs are present in the cache before compositing a string. The intended way to use the cache is like this: pixman_glyph_t glyphs[MAX_GLYPHS]; pixman_glyph_cache_freeze (cache); for (i = 0; i < n_glyphs; ++i) { const void *g; if (!(g = pixman_glyph_cache_lookup (cache, font_key, glyph_key))) { img = <rasterize glyph as a pixman_image_t>; g = pixman_glyph_cache_insert (cache, font_key, glyph_key, glyph_origin_x, glyph_origin_y, img); if (!g) { /* Clean up out-of-memory condition */ goto oom; } glyphs[i].pos_x = glyph_x_pos; glyphs[i].pos_y = glyph_y_pos; glyphs[i].glyph = g; } } pixman_composite_glyphs (op, src, dest, ..., cache, n_glyphs, glyphs); pixman_glyph_cache_thaw (cache);
2012-05-30Add doubly linked listsSøren Sandmann Pedersen2-0/+56
This commit adds some new inline functions to maintain a doubly linked list. The way to use them is to embed a pixman_link_t into the structures that should be linked, and use a pixman_list_t as the head of the list. The new functions are pixman_list_init (pixman_list_t *list); pixman_list_prepend (pixman_list_t *list, pixman_link_t *link); pixman_list_move_to_front (pixman_list_t *list, pixman_link_t *link); There are also new macros: OFFSET_OF(type, member) CONTAINER_OF(type, member, data); that can be used to get from a pointer to a member to the containing structure.
2012-05-30Make use of image flags in mmx and sse2 iteratorsSøren Sandmann Pedersen2-20/+8
Now that we have the full image flags available, the SSE2 and MMX iterators can simply check against SAMPLES_COVER_CLIP_NEAREST (which is computed in pixman_image_composite32()) instead of comparing all the x/y/width/height parameters.
2012-05-30Pass the full image flags to iteratorsSøren Sandmann Pedersen12-33/+40
When pixman_image_composite32() is called some flags are computed that indicate various things about the composite operation that can't be deduced from the image flags themselves. These additional flags are not currently available to iterators. All they can do is read the image flags in image->common.flags. Fix that by passing the info->{src, mask, dest}_flags on to the iterator initialization and store the flags in the iter struct as "image_flags". At the same time rename the *iterator* flags variable to "iter_flags" to avoid confusion.
2012-05-27mmx: add missing _mm_empty callsMatt Turner1-0/+5
Fixes spurious test failures on x86-32.
2012-05-26mmx: add over_reverse_n_8888Matt Turner2-0/+73
Loongson: over_reverse_n_8888 = L1: 16.04 L2: 15.35 M: 10.20 ( 27.96%) HT: 10.95 VT: 10.45 R: 9.18 RT: 6.99 ( 76Kops/s) over_reverse_n_8888 = L1: 27.40 L2: 26.67 M: 16.97 ( 45.78%) HT: 16.66 VT: 15.38 R: 14.15 RT: 9.44 ( 97Kops/s) image poppler 34.106 35.500 1.48% 6/6 image poppler 29.598 30.835 1.70% 6/6 ARM/iwMMXt: over_reverse_n_8888 = L1: 15.63 L2: 14.33 M: 10.83 ( 27.55%) HT: 9.78 VT: 9.91 R: 9.49 RT: 6.96 ( 69Kops/s) over_reverse_n_8888 = L1: 22.79 L2: 19.40 M: 13.76 ( 34.19%) HT: 11.66 VT: 11.86 R: 11.17 RT: 7.85 ( 75Kops/s) image poppler 38.040 38.606 1.10% 6/6 image poppler 31.686 32.278 0.80% 5/6
2012-05-26mmx: add add_0565_0565Matt Turner1-0/+86
Loongson: add_0565_0565 = L1: 15.37 L2: 14.91 M: 11.83 ( 16.06%) HT: 10.53 VT: 10.15 R: 9.74 RT: 6.19 ( 68Kops/s) add_0565_0565 = L1: 45.06 L2: 46.71 M: 27.45 ( 38.00%) HT: 23.76 VT: 22.84 R: 18.96 RT: 9.79 ( 104Kops/s) ARM/iwMMXt: add_0565_0565 = L1: 12.87 L2: 11.58 M: 10.11 ( 12.50%) HT: 9.06 VT: 8.66 R: 7.70 RT: 5.62 ( 58Kops/s) add_0565_0565 = L1: 31.14 L2: 28.87 M: 22.46 ( 28.60%) HT: 18.61 VT: 17.04 R: 15.21 RT: 9.35 ( 90Kops/s)
2012-05-26fast: add add_0565_0565 functionMatt Turner1-0/+44
I'll need this code for header and tail alignment loops in MMX, so I might as well implement a fast path here.
2012-05-26mmx: implement expand_4x565 in terms of expand_4xpacked565Matt Turner1-27/+59
Loongson: over_n_0565 = L1: 38.57 L2: 38.88 M: 30.01 ( 20.97%) HT: 23.60 VT: 23.88 R: 21.95 RT: 11.65 ( 113Kops/s) over_n_0565 = L1: 56.28 L2: 55.90 M: 34.20 ( 23.82%) HT: 25.66 VT: 26.60 R: 23.78 RT: 11.80 ( 115Kops/s) over_8888_0565 = L1: 35.89 L2: 36.11 M: 21.56 ( 45.47%) HT: 18.33 VT: 17.90 R: 16.27 RT: 9.07 ( 98Kops/s) over_8888_0565 = L1: 40.91 L2: 41.06 M: 23.13 ( 48.46%) HT: 19.24 VT: 18.71 R: 16.82 RT: 9.18 ( 99Kops/s) over_n_8_0565 = L1: 28.92 L2: 29.12 M: 21.42 ( 30.00%) HT: 18.37 VT: 17.75 R: 16.15 RT: 8.79 ( 91Kops/s) over_n_8_0565 = L1: 32.32 L2: 32.13 M: 22.44 ( 31.27%) HT: 19.15 VT: 18.66 R: 16.62 RT: 8.86 ( 92Kops/s) over_n_8888_0565_ca = L1: 29.33 L2: 29.22 M: 18.99 ( 66.69%) HT: 16.69 VT: 16.22 R: 14.63 RT: 8.42 ( 88Kops/s) over_n_8888_0565_ca = L1: 34.97 L2: 34.14 M: 20.32 ( 71.73%) HT: 17.67 VT: 17.19 R: 15.23 RT: 8.50 ( 89Kops/s) ARM/iwMMXt: over_n_0565 = L1: 29.70 L2: 30.53 M: 24.47 ( 14.84%) HT: 22.28 VT: 21.72 R: 21.13 RT: 12.58 ( 105Kops/s) over_n_0565 = L1: 41.42 L2: 40.00 M: 30.95 ( 19.13%) HT: 27.06 VT: 27.28 R: 23.43 RT: 14.44 ( 114Kops/s) over_8888_0565 = L1: 12.73 L2: 11.53 M: 9.07 ( 16.47%) HT: 9.00 VT: 9.25 R: 8.44 RT: 7.27 ( 76Kops/s) over_8888_0565 = L1: 23.72 L2: 21.76 M: 15.89 ( 29.51%) HT: 14.36 VT: 14.05 R: 12.44 RT: 8.94 ( 86Kops/s) over_n_8_0565 = L1: 6.80 L2: 7.15 M: 6.37 ( 7.90%) HT: 6.58 VT: 6.24 R: 6.49 RT: 5.94 ( 59Kops/s) over_n_8_0565 = L1: 12.06 L2: 11.02 M: 10.16 ( 13.43%) HT: 9.57 VT: 8.49 R: 9.10 RT: 6.86 ( 69Kops/s) over_n_8888_0565_ca = L1: 7.62 L2: 7.01 M: 6.27 ( 20.52%) HT: 6.00 VT: 6.07 R: 5.68 RT: 5.53 ( 57Kops/s) over_n_8888_0565_ca = L1: 13.54 L2: 11.96 M: 9.76 ( 30.66%) HT: 9.72 VT: 8.45 R: 9.37 RT: 6.85 ( 67Kops/s)
2012-05-26mmx: add and use expand_4xpacked565 functionMatt Turner2-6/+59
Loongson: add_0565_0565 = L1: 14.39 L2: 13.98 M: 11.28 ( 15.22%) HT: 10.11 VT: 9.74 R: 9.39 RT: 6.05 ( 67Kops/s) add_0565_0565 = L1: 15.37 L2: 14.91 M: 11.83 ( 16.06%) HT: 10.53 VT: 10.15 R: 9.74 RT: 6.19 ( 68Kops/s) ARM/iwMMXt: add_0565_0565 = L1: 11.12 L2: 10.40 M: 8.82 ( 10.65%) HT: 7.98 VT: 7.41 R: 7.57 RT: 5.21 ( 54Kops/s) add_0565_0565 = L1: 12.87 L2: 11.58 M: 10.11 ( 12.50%) HT: 9.06 VT: 8.66 R: 7.70 RT: 5.62 ( 58Kops/s)
2012-05-26Post-release version bump to 0.27.1Søren Sandmann Pedersen1-2/+2
2012-05-26Pre-release version bump to 0.26.0Søren Sandmann Pedersen1-2/+2
2012-05-25Fix MSVC compilationIngmar Runge1-2/+7
Only up to three SSE intrinsics supported in function declaration.
2012-05-24test: Composite with solid images instead of using pixman_image_fill_*Søren Sandmann Pedersen2-11/+15
There is a couple of places where the test suite uses the pixman_image_fill_* functions to initialize images. These functions can fail, and will do so if the "fast" implementation is disabled. So to make sure the test suite passes even using PIXMAN_DISABLE="fast", use pixman_image_composite32() with a solid image instead of pixman_image_fill_*.
2012-05-23MIPS: DSPr2: Added bilinear over_8888_8_8888 fast path.Nemanja Lukic4-0/+177
Performance numbers before/after on MIPS-74kc @ 1GHz Referent (before): cairo-perf-trace: [ # ] backend test min(s) median(s) stddev. count [ # ] image: pixman 0.25.3 [ 0] image firefox-fishtank 2289.180 2290.567 0.05% 5/6 Optimized: cairo-perf-trace: [ # ] backend test min(s) median(s) stddev. count [ # ] image: pixman 0.25.3 [ 0] image firefox-fishtank 1700.925 1708.314 0.22% 5/6
2012-05-23MIPS: DSPr2: Fix bug in over_n_8888_8888_ca/over_n_8888_0565_ca routinesNemanja Lukic1-32/+28
In main loop (unrolled by factor 2), instead of negating multiplied mask values by srca, values of srca was negated, and passed as alpha argument for UN8x4_MUL_UN8x4_ADD_UN8x4 macro. Instead of: ma = ~ma; UN8x4_MUL_UN8x4_ADD_UN8x4 (d, ma, s); Code was doing this: ma = ~srca; UN8x4_MUL_UN8x4_ADD_UN8x4 (d, ma, s); Key is in substituting registers s0/s1 (containing srca value), with t0/t1 containing mask values multiplied by srca. Register usage is also improved (less registers are saved on stack, for over_n_8888_8888_ca routine). The bug was introduced in commit d2ee5631 and revealed by composite test.
2012-05-20demos: Add parrot.jpg to EXTRA_DISTSøren Sandmann Pedersen1-1/+1
Pointed out by Cyril Brulebois.
2012-05-15configure.ac: Fail the ARM/iwMMXt test if not compiling with -march=iwmmxtMatt Turner1-0/+3
If not compiling with -march=iwmmxt, the configure test will still pass, thinking that the __builtin_arm_* intrinsic is a function instead of generating a single instruction. Since no linking is done, the configure test doesn't catch this, and we get linking errors in the build.
2012-05-15Post-release version bump to 0.25.7Søren Sandmann Pedersen1-1/+1
2012-05-15Pre-release version bump to 0.25.6Søren Sandmann Pedersen1-1/+1
Note that 0.25.4 was a botched release that doesn't have a tag and doesn't correspond to any commit ID. It was however uploaded and announced, so I'll just use the 0.25.6 version number.
2012-05-15demos/Makefile.am: Add parrot.c to EXTRA_DISTSøren Sandmann Pedersen1-0/+2
To get 'make distcheck' to pass.
2012-05-11configure.ac: Rename loongson -> loongson-mmiMatt Turner1-7/+7
Make it match with the other fast paths, and the PIXMAN_DISABLE value is already loongson-mmi.
2012-05-11configure.ac: Fix loongson-mmi out-of-tree buildsMatt Turner1-1/+1
When building out-of-tree, gcc wasn't able to find loongson-mmintrin.h to compile the test program. Add -I$srcdir to CFLAGS to point gcc to it.
2012-05-11MIPS: DSPr2: Added over_n_8_8888 and over_n_8_0565 fast paths.Nemanja Lukic3-0/+301
Performance numbers before/after on MIPS-74kc @ 1GHz Referent (before): lowlevel-blt-bench: over_n_8_8888 = L1: 10.40 L2: 9.79 M: 8.47 ( 33.62%) HT: 7.64 VT: 7.59 R: 7.48 RT: 5.30 ( 40Kops/s) over_n_8_0565 = L1: 7.40 L2: 7.23 M: 6.78 ( 17.94%) HT: 6.23 VT: 6.17 R: 6.14 RT: 4.62 ( 37Kops/s) Optimized: lowlevel-blt-bench: over_n_8_8888 = L1: 27.25 L2: 26.24 M: 18.15 ( 72.12%) HT: 14.52 VT: 14.31 R: 13.83 RT: 7.57 ( 48Kops/s) over_n_8_0565 = L1: 18.91 L2: 17.59 M: 15.06 ( 39.90%) HT: 12.18 VT: 11.98 R: 11.83 RT: 6.80 ( 46Kops/s)
2012-05-10mmx: add and use pack_4x565 functionMatt Turner1-55/+52
The pack_4x565 makes use of the pack_4xpacked565 function which uses pmadd. Some of the speed up is probably attributable to removing the artificial serialization imposed by the vdest = pack_565 (..., vdest, 0); vdest = pack_565 (..., vdest, 1); ... pattern. Loongson: over_n_0565 = L1: 16.44 L2: 16.42 M: 13.83 ( 9.85%) HT: 12.83 VT: 12.61 R: 12.34 RT: 8.90 ( 93Kops/s) over_n_0565 = L1: 42.48 L2: 42.53 M: 29.83 ( 21.20%) HT: 23.39 VT: 23.72 R: 21.80 RT: 11.60 ( 113Kops/s) over_8888_0565 = L1: 15.61 L2: 15.42 M: 12.11 ( 25.79%) HT: 11.07 VT: 10.70 R: 10.37 RT: 7.25 ( 82Kops/s) over_8888_0565 = L1: 35.01 L2: 35.20 M: 21.42 ( 45.57%) HT: 18.12 VT: 17.61 R: 16.09 RT: 9.01 ( 97Kops/s) over_n_8_0565 = L1: 15.17 L2: 14.94 M: 12.57 ( 17.86%) HT: 11.96 VT: 11.52 R: 10.79 RT: 7.31 ( 79Kops/s) over_n_8_0565 = L1: 29.83 L2: 29.79 M: 21.85 ( 30.94%) HT: 18.82 VT: 18.25 R: 16.15 RT: 8.72 ( 91Kops/s) over_n_8888_0565_ca = L1: 15.25 L2: 15.02 M: 11.64 ( 41.39%) HT: 11.08 VT: 10.72 R: 10.02 RT: 7.00 ( 77Kops/s) over_n_8888_0565_ca = L1: 30.12 L2: 29.99 M: 19.47 ( 68.99%) HT: 17.05 VT: 16.55 R: 14.67 RT: 8.38 ( 88Kops/s) ARM/iwMMXt: over_n_0565 = L1: 19.29 L2: 19.88 M: 17.38 ( 10.54%) HT: 15.53 VT: 16.11 R: 13.69 RT: 11.00 ( 96Kops/s) over_n_0565 = L1: 36.02 L2: 34.85 M: 28.04 ( 16.97%) HT: 22.12 VT: 24.21 R: 22.36 RT: 12.22 ( 103Kops/s) over_8888_0565 = L1: 18.38 L2: 16.59 M: 12.34 ( 22.29%) HT: 11.67 VT: 11.71 R: 11.02 RT: 6.89 ( 72Kops/s) over_8888_0565 = L1: 24.96 L2: 22.17 M: 15.11 ( 26.81%) HT: 14.14 VT: 13.71 R: 13.18 RT: 8.13 ( 78Kops/s) over_n_8_0565 = L1: 14.65 L2: 12.44 M: 11.56 ( 14.50%) HT: 10.93 VT: 10.39 R: 10.06 RT: 7.05 ( 70Kops/s) over_n_8_0565 = L1: 18.37 L2: 14.98 M: 13.97 ( 16.51%) HT: 12.67 VT: 10.35 R: 11.80 RT: 8.14 ( 74Kops/s) over_n_8888_0565_ca = L1: 14.27 L2: 12.93 M: 10.52 ( 33.23%) HT: 9.70 VT: 9.90 R: 9.31 RT: 6.34 ( 65Kops/s) over_n_8888_0565_ca = L1: 19.69 L2: 17.58 M: 13.40 ( 42.35%) HT: 11.75 VT: 11.33 R: 11.17 RT: 7.49 ( 73Kops/s)
2012-05-10configure.ac: make -march=loongson2f come before CFLAGSMatt Turner1-1/+1
Otherwise we'd have -march=loongson2f being overridden by automake's CFLAGS ordering which causes build failures when -march=<not loongson2f> is specified by the user.
2012-05-10Add Makefile.win32 and Makefile.win32.common to EXTRA_DISTSøren Sandmann Pedersen1-0/+4
https://bugs.freedesktop.org/show_bug.cgi?id=46905
2012-05-09.gitignore: add demos/checkerboard and demos/quad2quadMatt Turner1-0/+2
2012-04-27mmx: Use wpackhus in src_x888_0565 on iwMMXtMatt Turner1-1/+5
iwMMXt which has an unsigned saturation pack instruction, while MMX/EXT and Loongson don't. ARM/iwMMXt: src_8888_0565 = L1: 110.38 L2: 82.33 M: 40.92 ( 73.22%) HT: 35.63 VT: 32.22 R: 30.07 RT: 18.40 ( 132Kops/s) src_8888_0565 = L1: 117.91 L2: 83.05 M: 41.52 ( 75.58%) HT: 37.63 VT: 35.40 R: 29.37 RT: 19.39 ( 134Kops/s)
2012-04-27mmx: add src_8888_0565Matt Turner2-0/+96
Uses the pmadd technique described in http://software.intel.com/sites/landingpage/legacy/mmx/MMX_App_24-16_Bit_Conversion.pdf The technique uses the packssdw instruction which uses signed saturatation. This works in their example because they pack 888 to 555 leaving the high bit as zero. For packing to 565, it is unsuitable, so we replace it with an or+shuffle. Loongson: src_8888_0565 = L1: 106.13 L2: 83.57 M: 33.46 ( 68.90%) HT: 30.29 VT: 27.67 R: 26.11 RT: 15.06 ( 135Kops/s) src_8888_0565 = L1: 122.10 L2: 117.53 M: 37.97 ( 78.58%) HT: 33.14 VT: 30.09 R: 29.01 RT: 15.76 ( 139Kops/s) ARM/iwMMXt: src_8888_0565 = L1: 67.88 L2: 56.61 M: 31.20 ( 56.74%) HT: 29.22 VT: 27.01 R: 25.39 RT: 19.29 ( 130Kops/s) src_8888_0565 = L1: 110.38 L2: 82.33 M: 40.92 ( 73.22%) HT: 35.63 VT: 32.22 R: 30.07 RT: 18.40 ( 132Kops/s)
2012-04-27mmx: add x8f8g8b8 fetcherMatt Turner1-0/+42
Loongson: add_x888_x888 = L1: 29.36 L2: 27.81 M: 14.05 ( 38.74%) HT: 12.45 VT: 11.78 R: 11.52 RT: 7.23 ( 75Kops/s) add_x888_x888 = L1: 36.06 L2: 34.55 M: 14.81 ( 41.03%) HT: 14.01 VT: 13.41 R: 13.06 RT: 9.06 ( 90Kops/s) src_x888_8_x888 = L1: 21.92 L2: 20.15 M: 13.35 ( 41.42%) HT: 11.70 VT: 10.95 R: 10.53 RT: 6.18 ( 65Kops/s) src_x888_8_x888 = L1: 25.43 L2: 23.51 M: 14.12 ( 44.00%) HT: 13.14 VT: 12.50 R: 11.86 RT: 7.49 ( 76Kops/s) over_x888_8_0565 = L1: 10.64 L2: 10.17 M: 7.74 ( 21.35%) HT: 6.83 VT: 6.55 R: 6.34 RT: 4.03 ( 46Kops/s) over_x888_8_0565 = L1: 11.41 L2: 10.97 M: 8.07 ( 22.36%) HT: 7.42 VT: 7.18 R: 6.92 RT: 4.62 ( 52Kops/s) ARM/iwMMXt: add_x888_x888 = L1: 22.10 L2: 18.93 M: 13.48 ( 32.29%) HT: 11.32 VT: 10.64 R: 10.36 RT: 6.51 ( 61Kops/s) add_x888_x888 = L1: 24.26 L2: 20.83 M: 14.52 ( 35.64%) HT: 12.66 VT: 12.98 R: 11.34 RT: 7.69 ( 72Kops/s) src_x888_8_x888 = L1: 19.33 L2: 17.66 M: 14.26 ( 38.43%) HT: 11.53 VT: 10.83 R: 10.57 RT: 6.12 ( 58Kops/s) src_x888_8_x888 = L1: 21.23 L2: 19.60 M: 15.41 ( 42.55%) HT: 12.66 VT: 13.30 R: 11.55 RT: 7.32 ( 67Kops/s) over_x888_8_0565 = L1: 8.15 L2: 7.56 M: 6.50 ( 15.58%) HT: 5.73 VT: 5.49 R: 5.50 RT: 3.53 ( 38Kops/s) over_x888_8_0565 = L1: 8.35 L2: 7.85 M: 6.68 ( 16.40%) HT: 6.12 VT: 5.97 R: 5.78 RT: 4.03 ( 43Kops/s)
2012-04-27mmx: add a8 fetcherMatt Turner2-0/+68
oprofile of xfce4-terminal-a1 210535 9.0407 libpixman-1.so.0.25.3 fetch_scanline_a8 144802 6.0054 libpixman-1.so.0.25.3 mmx_fetch_a8 Loongson: add_8_8_8 = L1: 17.98 L2: 17.28 M: 14.28 ( 19.79%) HT: 11.11 VT: 10.38 R: 9.97 RT: 5.14 ( 55Kops/s) add_8_8_8 = L1: 20.44 L2: 19.65 M: 15.62 ( 21.53%) HT: 12.86 VT: 11.98 R: 11.32 RT: 6.13 ( 64Kops/s) src_8888_8_0565 = L1: 19.97 L2: 18.59 M: 13.42 ( 32.55%) HT: 11.46 VT: 10.78 R: 10.33 RT: 5.87 ( 61Kops/s) src_8888_8_0565 = L1: 21.16 L2: 19.68 M: 13.94 ( 33.64%) HT: 12.31 VT: 11.52 R: 11.02 RT: 6.54 ( 68Kops/s) src_x888_8_x888 = L1: 20.54 L2: 18.88 M: 13.07 ( 40.74%) HT: 11.05 VT: 10.36 R: 10.02 RT: 5.68 ( 60Kops/s) src_x888_8_x888 = L1: 21.92 L2: 20.15 M: 13.35 ( 41.42%) HT: 11.70 VT: 10.95 R: 10.53 RT: 6.18 ( 65Kops/s) over_x888_8_0565 = L1: 10.32 L2: 9.85 M: 7.63 ( 21.13%) HT: 6.56 VT: 6.30 R: 6.12 RT: 3.80 ( 43Kops/s) over_x888_8_0565 = L1: 10.64 L2: 10.17 M: 7.74 ( 21.35%) HT: 6.83 VT: 6.55 R: 6.34 RT: 4.03 ( 46Kops/s) ARM/iwMMXt: add_8_8_8 = L1: 13.10 L2: 11.67 M: 10.74 ( 13.46%) HT: 8.62 VT: 8.15 R: 7.94 RT: 4.39 ( 44Kops/s) add_8_8_8 = L1: 13.81 L2: 12.79 M: 11.63 ( 13.93%) HT: 9.33 VT: 9.20 R: 9.04 RT: 5.43 ( 52Kops/s) src_8888_8_0565 = L1: 16.62 L2: 15.07 M: 12.52 ( 27.46%) HT: 10.07 VT: 10.17 R: 9.95 RT: 5.64 ( 54Kops/s) src_8888_8_0565 = L1: 16.84 L2: 16.11 M: 13.22 ( 27.71%) HT: 11.74 VT: 10.90 R: 10.80 RT: 6.66 ( 62Kops/s) src_x888_8_x888 = L1: 17.49 L2: 16.22 M: 13.73 ( 38.73%) HT: 10.10 VT: 10.33 R: 9.55 RT: 5.21 ( 52Kops/s) src_x888_8_x888 = L1: 19.33 L2: 17.66 M: 14.26 ( 38.43%) HT: 11.53 VT: 10.83 R: 10.57 RT: 6.12 ( 58Kops/s) over_x888_8_0565 = L1: 7.57 L2: 7.29 M: 6.37 ( 15.97%) HT: 5.53 VT: 5.33 R: 5.21 RT: 3.22 ( 35Kops/s) over_x888_8_0565 = L1: 8.15 L2: 7.56 M: 6.50 ( 15.58%) HT: 5.73 VT: 5.49 R: 5.50 RT: 3.53 ( 38Kops/s)
2012-04-27mmx: add r5g6b5 fetcherMatt Turner1-0/+100
Loongson: add_0565_0565 = L1: 12.73 L2: 12.26 M: 10.05 ( 13.87%) HT: 8.77 VT: 8.50 R: 8.25 RT: 5.28 ( 58Kops/s) add_0565_0565 = L1: 14.04 L2: 13.63 M: 10.96 ( 15.19%) HT: 9.73 VT: 9.43 R: 9.11 RT: 5.93 ( 64Kops/s) ARM/iwMMXt: add_0565_0565 = L1: 10.36 L2: 10.03 M: 9.04 ( 10.88%) HT: 3.11 VT: 7.16 R: 7.72 RT: 5.12 ( 51Kops/s) add_0565_0565 = L1: 10.84 L2: 10.20 M: 9.15 ( 11.46%) HT: 7.60 VT: 7.82 R: 7.70 RT: 5.41 ( 53Kops/s)
2012-04-27mmx: Use Loongson pextrh instruction in expand565Matt Turner2-0/+15
Same story as pinsrh in the previous commit. text data bss dec hex filename 25336 1952 0 27288 6a98 .libs/libpixman_loongson_mmi_la-pixman-mmx.o 25072 1952 0 27024 6990 .libs/libpixman_loongson_mmi_la-pixman-mmx.o -dsll: 95 +dsll: 70 -dsrl: 135 +dsrl: 105 -ldc1: 462 +ldc1: 445 -lw: 721 +lw: 700 +pextrh: 30
2012-04-27mmx: Use Loongson pinsrh instruction in pack_565Matt Turner2-0/+25
The pinsrh instruction is analogous to MMX EXT's pinsrw, except like other Loongson vector instructions it cannot access the general purpose registers. In the cases of other Loongson vector instructions, this is a headache, but it is actually a good thing here. Since the instruction is different from MMX, I've named the intrinsic loongson_insert_pi16. text data bss dec hex filename 25976 1952 0 27928 6d18 .libs/libpixman_loongson_mmi_la-pixman-mmx.o 25336 1952 0 27288 6a98 .libs/libpixman_loongson_mmi_la-pixman-mmx.o -and: 181 +and: 147 -dsll: 143 +dsll: 95 -dsrl: 87 +dsrl: 135 -ldc1: 523 +ldc1: 462 -lw: 767 +lw: 721 +pinsrh: 35
2012-04-27mmx: don't pack and unpack src unnecessarilyMatt Turner1-47/+35
The combine function was store8888'ing the result, and all consumers were immediately load8888'ing it, causing lots of unnecessary pack and unpack instructions. It's a very straight forward conversion, except for mmx_combine_over_u and mmx_combine_saturate_u. mmx_combine_over_u was testing the integer result to skip pixels, so we use the is_* functions to test the __m64 data directly without loading it into an integer register. For mmx_combine_saturate_u there's not a lot we can do, since it uses DIV_UN8.
2012-04-27mmx: introduce is_equal, is_opaque, and is_zero functionsMatt Turner1-0/+41
To be used by the next commit.
2012-04-27mmx: simplify srcsrcsrcsrc calculation in over_n_8_0565Matt Turner1-7/+3
2012-04-27mmx: remove unnecessary uint64_t<->__m64 conversionsMatt Turner1-3/+1
Loongson: add_8888_8888 = L1: 68.73 L2: 55.09 M: 25.39 ( 68.18%) HT: 25.28 VT: 22.42 R: 20.74 RT: 13.26 ( 131Kops/s) add_8888_8888 = L1: 159.19 L2: 114.10 M: 30.74 ( 77.91%) HT: 27.63 VT: 24.99 R: 24.61 RT: 14.49 ( 141Kops/s)
2012-04-27mmx: compile on MIPS for Loongson MMI optimizationsMatt Turner6-13/+350
image image16 evolution 32.985 -> 29.667 27.314 -> 23.870 firefox-planet-gnome 197.982 -> 180.437 220.986 -> 205.057 gnome-system-monitor 48.482 -> 49.752 52.820 -> 49.528 gnome-terminal-vim 60.799 -> 50.528 51.655 -> 44.131 grads-heat-map 3.167 -> 3.181 3.328 -> 3.321 gvim 38.646 -> 32.552 38.126 -> 34.453 midori-zoomed 44.371 -> 43.338 28.860 -> 28.865 ocitysmap 23.065 -> 18.057 23.046 -> 18.055 poppler 43.676 -> 36.077 43.065 -> 36.090 swfdec-giant-steps 20.166 -> 20.365 22.354 -> 16.578 swfdec-youtube 31.502 -> 28.118 44.052 -> 41.771 xfce4-terminal-a1 69.517 -> 51.288 62.225 -> 53.309
2012-04-27mmx: make ldq_u take __m64* directlyMatt Turner1-26/+26
Before, if __m64 is allocated in vector or floating-point registers, __m64 vs = ldq_u((uint64_t *)src); would cause src to be loaded into an integer register and then transferred to an __m64 register. By switching ldq_u's argument type to __m64 we give the compile enough information to recognize that it can load to the vector register directly. This patch is necessary for the Loongson optimizations when __m64 is typedef'd as double.
2012-04-27mmx: add load function and use it in add_8888_8888Matt Turner1-5/+11
2012-04-27mmx: add store function and use it in add_8888_8888Matt Turner1-5/+11
2012-04-20bits_image_fetch_pixel_convolution(): Make sure channels are signedSøren Sandmann Pedersen1-5/+5
In the computation: srtot += RED_8 (pixel) * f RED_8 (pixel) is an unsigned quantity, which means the signed filter coefficient f gets converted to an unsigned integer before the multiplication. We get away with this because when the 32 bit unsigned result is converted to int32_t, the correct sign is produced. But if srtot had been an int64_t, the result would have been a very large positive number. Fix this by explicitly casting the channels to int.