Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM assembly to address performance of blit and fill routines #777

Closed
SDLBugzilla opened this issue Feb 10, 2021 · 3 comments
Closed

ARM assembly to address performance of blit and fill routines #777

SDLBugzilla opened this issue Feb 10, 2021 · 3 comments
Labels
enhancement New feature or request

Comments

@SDLBugzilla
Copy link
Collaborator

This bug report was migrated from our old Bugzilla tracker.

These attachments are available in the static archive:

Reported in version: HG 1.2
Reported for operating system, platform: Linux, ARM

Comments on the original bug report:

On 2018-11-07 15:29:25 +0000, Ben Avison wrote:

Created attachment 3452
ARM: Create configure option --enable-arm-simd to govern assembly optimizations

In tests on the Raspberry Pi, various applications suffer poor graphics performance, traceable to SDL routines. Presented here are a series of patches to SDL 1.2 which provide assembly versions of these functions, under control from the configure script.

The Raspberry Pi is available with a range of ARM architectures (ARMv6 for the Pi 1 and 0, ARMv7/v8 with NEON for the Pi 2 and 3), and two different implementations give best results on different versions. A single binary can contain both implementations and choose between them by detecting the CPU at runtime.

On 2018-11-07 15:31:22 +0000, Ben Avison wrote:

Created attachment 3453
ARM: SIMD assembly optimization for function BlitRGBtoRGBPixelAlpha

On 2018-11-07 15:31:58 +0000, Ben Avison wrote:

Created attachment 3454
ARM: SIMD assembly optimization for function BlitARGBto565PixelAlpha

On 2018-11-07 15:32:27 +0000, Ben Avison wrote:

Created attachment 3455
SDL_blit: use a named enum for required hardware features bits in dispatch tables

On 2018-11-07 15:33:01 +0000, Ben Avison wrote:

Created attachment 3456
ARM: SIMD assembly optimization for BGR-to-RGB 32bpp normal blits

On 2018-11-07 15:33:28 +0000, Ben Avison wrote:

Created attachment 3457
ARM: assembly optimization for SDL_FillRect

On 2018-11-07 15:34:05 +0000, Ben Avison wrote:

Created attachment 3458
ARM: SIMD optimization for 4:4:4:4 to 8:8:8:8 normal blits

On 2018-11-07 15:34:34 +0000, Ben Avison wrote:

Created attachment 3459
ARM: Create configure option --enable-arm-neon to govern assembly optimizations

On 2018-11-07 15:35:09 +0000, Ben Avison wrote:

Created attachment 3460
ARM: NEON assembly optimization for function BlitRGBtoRGBPixelAlpha

On 2018-11-07 15:35:38 +0000, Ben Avison wrote:

Created attachment 3461
ARM: NEON assembly optimization for function BlitARGBto565PixelAlpha

On 2018-11-07 15:36:10 +0000, Ben Avison wrote:

Created attachment 3462
ARM: NEON assembly optimization for SDL_FillRect

On 2018-11-07 23:42:46 +0000, Sam Lantinga wrote:

Do you have equivalent patches for SDL 2.0, along with performance testing results?

Thanks!

On 2018-11-08 05:20:43 +0000, Ozkan Sezer wrote:

Should we merge the existing patches to SDL-1.2 if we get performance
test results?

On 2018-11-09 02:07:00 +0000, Ben Avison wrote:

If you want an idea of the difference this makes in a real-world application, running this Pygame application on a Pi 3:

http://sphinx.mythic-beasts.com/~eupton/myriapod.tgz

  • does 9.6 fps with the existing C implementations
  • does 22.6 fps with the ARMv6-tuned optimizations
  • does 27.3 fps with the NEON optimizations

For more formal benchmarking, I wrote the attached benchmarker utility. It's based on the benchmarker that's included with Pixman (from where I borrowed the assembly framework used in most of the opimizations). It measures performance in cases where the image data is in the L1 cache, vs in the L2 cache, vs in main memory.

Figures represent the rate of writing to the output image, in megabytes/second. Assembly speedups are quoted relative to the original C versions. It's worth bearing in mind that in cases that do alpha blending, these benchmarks only measure the case where all source pixels are semi-transparent, which is not necessarily very representative of real-world use.

For the Pi 1 (ARMv6, ARM1176JZFS):

FillRect
L1 L2 Memory
C 889 462 337
ARMv6 assembly 3518 (x3.96) 1406 (x3.04) 1314 (x3.90)

BlitRGBtoRGB
L1 L2 Memory
C 74.9 47.8 38.6
ARMv6 assembly 91.5 (x1.22) 81.2 (x1.69) 79.3 (x2.06)

BlitARGBto565
L1 L2 Memory
C 38.0 25.1 21.7
ARMv6 assembly 48.7 (x1.28) 42.5 (x1.69) 43.2 (x1.99)

Blit_BGR888_RGB888
L1 L2 Memory
C 29.2 24.3 24.4
ARMv6 assembly 528 (x18.1) 295 (x12.1) 279 (x11.5)

Blit_RGB444_RGB888
L1 L2 Memory
C 31.0 27.0 27.3
ARMv6 assembly 328 (x10.6) 190 (x7.04) 246 (x9.02)

For the Pi 3 (ARMv8 with NEON, Cortex-A53):

FillRect
L1 L2 Memory
C 2120 2100 1091
ARMv6 assembly 4704 (x2.22) 4591 (x2.19) 1098 (x1.01)
NEON assembly 15020 (x7.08) 10662 (x5.08) 1093 (x1.00)

BlitRGBtoRGB
L1 L2 Memory
C 236 233 205
ARMv6 assembly 242 (x1.02) 241 (x1.04) 238 (x1.16)
NEON assembly 1081 (x4.58) 1015 (x4.36) 571 (x2.79)

BlitARGBto565
L1 L2 Memory
C 111 110 101
ARMv6 assembly 126 (x1.14) 126 (x1.14) 125 (x1.23)
NEON assembly 555 (x4.99) 535 (x4.84) 440 (x4.34)

Blit_BGR888_RGB888
L1 L2 Memory
C 101 101 97.5
ARMv6 assembly 1510 (x14.9) 1470 (x14.6) 886 (x9.09)

Blit_RGB444_RGB888
L1 L2 Memory
C 102 101 98.9
ARMv6 assembly 667 (x6.56) 668 (x6.59) 660 (x6.68)

Regarding the 2.0 branch, I've had a go at porting it across, so I shall shortly be attaching a second patch series. It builds for me, but it's untested and unbenchmarked.

I see that the 2.0 branch already has some ARM NEON detection logic, though it's used for audio rather than video. It didn't feature an equivalent of the --enable-arm-neon configure switch I created for the 1.2 branch, so I've migrated that across. At present, it only controls whether my video NEON optimisations are built - perhaps you'd like it to control the existing audio ones also?

On 2018-11-09 02:09:04 +0000, Ben Avison wrote:

Created attachment 3468
Benchmark utility

On 2018-11-09 02:13:05 +0000, Ben Avison wrote:

Created attachment 3469
[SDL2] ARM: Create configure option --enable-arm-simd to govern assembly optimizations

On 2018-11-09 02:13:51 +0000, Ben Avison wrote:

Created attachment 3470
[SDL2] ARM: SIMD assembly optimization for function BlitRGBtoRGBPixelAlpha

On 2018-11-09 02:14:27 +0000, Ben Avison wrote:

Created attachment 3471
[SDL2] ARM: SIMD assembly optimization for function BlitARGBto565PixelAlpha

On 2018-11-09 02:15:13 +0000, Ben Avison wrote:

Created attachment 3472
[SDL2] SDL_blit: use a named enum for required hardware features bits in dispatch tables

On 2018-11-09 02:17:22 +0000, Ben Avison wrote:

Created attachment 3473
[SDL2] ARM: SIMD assembly optimization for BGR-to-RGB 32bpp normal blits

On 2018-11-09 02:18:06 +0000, Ben Avison wrote:

Created attachment 3474
[SDL2] ARM: assembly optimization for SDL_FillRect

On 2018-11-09 02:18:36 +0000, Ben Avison wrote:

Created attachment 3475
ARM: SIMD optimization for 4:4:4:4 to 8:8:8:8 normal blits

On 2018-11-09 02:19:13 +0000, Ben Avison wrote:

Created attachment 3476
[SDL2] ARM: Create configure option --enable-arm-neon to govern assembly optimizations

On 2018-11-09 02:19:49 +0000, Ben Avison wrote:

Created attachment 3477
[SDL2] ARM: NEON assembly optimization for function BlitRGBtoRGBPixelAlpha

On 2018-11-09 02:21:32 +0000, Ben Avison wrote:

Created attachment 3478
[SDL2] ARM: NEON assembly optimization for function BlitARGBto565PixelAlpha

On 2018-11-09 02:22:01 +0000, Ben Avison wrote:

Created attachment 3479
[SDL2] ARM: NEON assembly optimization for SDL_FillRect

On 2019-01-23 13:53:25 +0000, Sylvain wrote:

Hi,
Just wondering. I heard NEON assembly had differences between armeabi-v7a and arm64. Is this true ? And if so, which arch those neon routines are for ?

On 2019-01-23 14:14:52 +0000, Ben Avison wrote:

Yes, 32-bit and 64-bit ARM assembly are quite different in many ways, enough that it's generally impractical to write either pure assembly or inline assembly in C/C++ to target both. Some code sharing can be achieved using compiler intrinsics, but in my experience it's not possible to achieve the best results using intrinsics, at least with current compilers.

All these optimizations are written for 32-bit ARM instruction sets (and should work when compiled as either ARM or Thumb).

On 2019-01-25 10:09:31 +0000, Sam Lantinga wrote:

Thanks for creating these!

It looks like you're the original author of the pixman code used here. Some companies are leery of mixed licenses, would it be possible to re-license the assembly under the Zlib license?

The patches generally look good, here are a few comments:

  • CPU_HAS_NEON should not be changed, this breaks ABI compatibility
  • Casting an enum to unsigned int can be dangerous, some compilers use smaller datatypes (e.g. char) for small enumerations. You should declare features as the type of variable you're scanning into here:
    SDL_sscanf(override, "%u", (unsigned int *) &features);

On 2019-02-06 13:38:43 +0000, Ben Avison wrote:

For some reason, bugzilla has only just let me know about Sam's comment!

It looks like you're the original author of the pixman code used here. Some
companies are leery of mixed licenses, would it be possible to re-license the
assembly under the Zlib license?

Well, I wrote a lot of the pixman code. I'm pretty sure I can relicense that.

However, pixman-arm-neon-asm.h and the top part of pixman-arm-neon-asm.S (down to line 77) were written by Siarhei Siamashka and say they're (c) Nokia. It appears he left their employ in 2012, so I don't know how practical it would be to change the license on those. pixman-arm-asm.h is pretty trivial, but its sole macro definition was actually also originally written by Siarhei during his time at Nokia (pixman commit 3ef20333 - the Mozilla copyright refers to the C source code of which this is a disassembly, but none of that remains in pixman-arm-asm.h).

The pixman code is under a variant of the MIT license, which is similar in spirit to the zlib license. It looks to me as though main additional action it places on the user to do (as opposed to not do) is that it requires the copyright message and license header be included in documentation, whereas the zlib license only requests ackonwledgment in the documentation. How much of a deal-breaker is it if I can't get permission to relicense from Nokia?

  • CPU_HAS_NEON should not be changed, this breaks ABI compatibility

It's only on the SDL2 branch that I changed value of existing defines - I think it was because I wrote the SDL1.2 patches first and I wanted to keep the ordering consistent between the two branches, but you're right, ABI compatibility trumps that consideration.

  • Casting an enum to unsigned int can be dangerous, some compilers use smaller datatypes (e.g. char) for small enumerations. You should declare features as the type of variable you're scanning into here:
    SDL_sscanf(override, "%u", (unsigned int *) &features);

Good catch. I suppose I could revert to using unsigned int and use #defines for the values instead, but I generally prefer enums where possible because it lets debuggers present the symbolic names. Would something like this be acceptable?

--- a/src/video/SDL_blit_N.c
+++ b/src/video/SDL_blit_N.c
@@ -856,12 +856,13 @@ static enum blit_features
GetBlitFeatures(void)
{
static enum blit_features features = -1;

  • if (features == -1) {
  • if (features == (enum blit_features) -1) {
    /* Provide an override for testing .. */
    char *override = SDL_getenv("SDL_ALTIVEC_BLIT_FEATURES");
    if (override) {
  •        features = 0;
    
  •        SDL_sscanf(override, "%u", (unsigned int *) &features);
    
  •        unsigned int features_as_uint = 0;
    
  •        SDL_sscanf(override, "%u", &features_as_uint);
    
  •        features = (enum blit_features) features_as_uint;
       } else {
           features = (0
                       /* Feature 1 is has-MMX */
    

One more question: can I post a changes like this (or changes of license headers) here as patches to be applied at the end after the existing ones, or do you need me to rework the old ones to incorporate review comments? It'll be a bit tedious to re-upload so many of them, and it won't be immediately obvious which patches are outdated, so perhaps it would be best to open a new "bug" in that case?

On 2019-02-28 19:52:35 +0000, Ben Avison wrote:

In an attempt to keep this from going stale, I have rolled in the above changes and rebased my patch series to the tips of the 1.2 and 2.0 branches respectively (nearly 4 months of development have introduced some conflicts). I have relicensed what I can to the zlib license, but I have yet to establish who at Nokia would have the rights to relicense the remaining MIT-licensed code.

On 2019-02-28 19:54:21 +0000, Ben Avison wrote:

Created attachment 3661
[SDL1.2] ARM: Create configure option --enable-arm-simd to govern assembly optimizations

On 2019-02-28 19:55:11 +0000, Ben Avison wrote:

Created attachment 3662
[SDL1.2] ARM: SIMD assembly optimization for function BlitRGBtoRGBPixelAlpha

On 2019-02-28 19:56:56 +0000, Ben Avison wrote:

Created attachment 3663
[SDL1.2] ARM: SIMD assembly optimization for function BlitARGBto565PixelAlpha

On 2019-02-28 19:58:53 +0000, Ben Avison wrote:

Created attachment 3664
[SDL1.2] SDL_blit: use a named enum for required hardware features bits in dispatch tables

On 2019-02-28 19:59:47 +0000, Ben Avison wrote:

Created attachment 3665
[SDL1.2] ARM: SIMD assembly optimization for BGR-to-RGB 32bpp normal blits

On 2019-02-28 20:00:56 +0000, Ben Avison wrote:

Created attachment 3666
[SDL1.2] ARM: assembly optimization for SDL_FillRect

On 2019-02-28 20:02:20 +0000, Ben Avison wrote:

Created attachment 3667
[SDL1.2] ARM: SIMD optimization for 4:4:4:4 to 8:8:8:8 normal blits

On 2019-02-28 20:05:02 +0000, Ben Avison wrote:

Created attachment 3668
[SDL1.2] ARM: Create configure option --enable-arm-neon to govern assembly optimizations

On 2019-02-28 20:05:37 +0000, Ben Avison wrote:

Created attachment 3669
[SDL1.2] ARM: NEON assembly optimization for function BlitRGBtoRGBPixelAlpha

On 2019-02-28 20:06:10 +0000, Ben Avison wrote:

Created attachment 3670
[SDL1.2] ARM: NEON assembly optimization for function BlitARGBto565PixelAlpha

On 2019-02-28 20:06:44 +0000, Ben Avison wrote:

Created attachment 3671
[SDL1.2] ARM: NEON assembly optimization for SDL_FillRect

On 2019-02-28 20:09:20 +0000, Ben Avison wrote:

Created attachment 3672
[SDL2] ARM: Create configure option --enable-arm-simd to govern assembly optimizations

On 2019-02-28 20:12:24 +0000, Ben Avison wrote:

Created attachment 3673
[SDL2] ARM: SIMD assembly optimization for function BlitRGBtoRGBPixelAlpha

On 2019-02-28 20:12:49 +0000, Ben Avison wrote:

Created attachment 3674
[SDL2] ARM: SIMD assembly optimization for function BlitARGBto565PixelAlpha

On 2019-02-28 20:13:19 +0000, Ben Avison wrote:

Created attachment 3675
[SDL2] SDL_blit: use a named enum for required hardware features bits in dispatch tables

On 2019-02-28 20:13:57 +0000, Ben Avison wrote:

Created attachment 3676
[SDL2] ARM: SIMD assembly optimization for BGR-to-RGB 32bpp normal blits

On 2019-02-28 20:14:53 +0000, Ben Avison wrote:

Created attachment 3677
[SDL2] ARM: assembly optimization for SDL_FillRect

On 2019-02-28 20:15:41 +0000, Ben Avison wrote:

Created attachment 3678
[SDL2] ARM: SIMD optimization for 4:4:4:4 to 8:8:8:8 normal blits

On 2019-02-28 20:16:29 +0000, Ben Avison wrote:

Created attachment 3679
[SDL2] ARM: Create configure option --enable-arm-neon to govern assembly optimizations

On 2019-02-28 20:18:21 +0000, Ben Avison wrote:

Created attachment 3680
[SDL2] ARM: NEON assembly optimization for function BlitRGBtoRGBPixelAlpha

On 2019-02-28 20:18:54 +0000, Ben Avison wrote:

Created attachment 3681
[SDL2] ARM: NEON assembly optimization for function BlitARGBto565PixelAlpha

On 2019-02-28 20:19:28 +0000, Ben Avison wrote:

Created attachment 3682
[SDL2] ARM: NEON assembly optimization for SDL_FillRect

On 2019-07-30 17:49:37 +0000, Ryan C. Gordon wrote:

(Sorry if you get several emails like this, we're marking a bunch of bugs.)

We're hoping to ship SDL 2.0.11 on a much shorter timeframe than we have historically done releases, so I'm starting to tag bugs we hope to have closed in this release cycle.

Note that this tag means we just intend to scrutinize this bug for the 2.0.11 release: we may fix it, reject it, or even push it back to a later release for now, but this helps give us both a goal and a wishlist for the next release.

If this bug has been quiet for a few months and you have new information (such as, "this is definitely still broken" or "this got fixed at some point"), please feel free to retest and/or add more notes to the bug.

--ryan.

On 2019-08-19 17:13:56 +0000, Ben Avison wrote:

I've finally managed to get a response from Siarhei, thee author of pixman-arm-neon-asm.h and the top part of pixman-arm-neon-asm.S (down to line 77) about the potential of changing the license on that code. It doesn't sound too promising:

Unfortunately I can't help much. I don't know who even owns this
copyright nowadays after Microsoft bought Nokia's smartphone business.

But if I remember correctly, Nokia's lawyers did not particularly
like the zlib license because its no-liability disclaimer was
kinda weak. Looks like the following stackexchange question
discusses this particular problem:
https://opensource.stackexchange.com/questions/4329/is-zlib-license-no-waranty-no-liability-clause-enough-for-us-and-eu-law

For example, the zlib license only mentions "authors", while the MIT
license mentions "AUTHORS OR COPYRIGHT HOLDERS".

Given that this is very specialist code, it's going to be hard to find someone able to do a clean-room implementation that could be under the zlib license. Is that really the only path ahead?

On 2019-09-07 10:25:50 +0000, Ozkan Sezer wrote:

Are there any licensing issues with the SDL-1.2 versions of the patches?

On 2019-09-10 14:41:17 +0000, Ben Avison wrote:

Unfortunately, the same MIT-licensed code is present in both the SDL1.2 and SDL2.0 branches.

However, if you're able to apply just the ARMv6-compatible patches, I have been able to relicense all of those to zlib. However, as the benchmarks show, you'll still be missing most of the speed boost if you do that (apart from on 1st-generation Raspberry Pis or Pi Zeros, both of which have an ARMv6 CPU).

On 2019-09-10 16:37:26 +0000, Ozkan Sezer wrote:

(In reply to bavison from comment # 56)

Unfortunately, the same MIT-licensed code is present
in both the SDL1.2 and SDL2.0 branches.

Unlike SDL-2.x, SDL-1.2 is LGPL, so your patches might
be OK in SDL-1.2. (Not a lawyer, hence I'm asking.)

On 2019-09-10 18:35:55 +0000, Sam Lantinga wrote:

Let's go ahead and apply the patches for 2.0.11, with the MIT code clearly delineated, and afterward see if we can do a clean-room implementation, or disable compiling that code by default.

Companies that accept the zlib license also generally accept the MIT license, so worst case we can just identify the code for lawyers to decide whether they want to accept it.

On 2019-09-10 19:21:44 +0000, Ben Avison wrote:

OK. Do you need me to split pixman-arm-neon-asm.S into two so that each half only has one license header, or can you handle that when you merge it?

FWIW, I had a few thoughts on clean-rooming, in case it comes to that:

The main bit of code that would need a clean-room reimplementation is the NEON version of the macro generate_composite_function. I've had so much exposure to it with my work on Pixman that I'm not really in a position to do it myself. However, I wrote a roughly analogous macro for ARMv6 which might be a useful guide, as I was able to relicense that; the main difference is that ARMv6 has to use its main register bank for pixel data manipulation as well as for holding pointers, counters etc, so everything is much tighter for space.

Basically, what these macros do is handle all the actions that are common to different blit routines - looping over the rectangle, loading and storing data (whenever possible, with cacheline alignment at the destination image), deinterleaving and reinterleaving colour components of 32bpp pixels, and issuing prefetch instructions.

The prefetch part of the functionality is the part that's most likely to get into legal trouble, since it has a very distinct and not necessarily obvious algorithm. However, software prefetch's greatest benefit was for relatively low-end ARMv7 processors like Cortex-A8, A9 or A7 (and pre-NEON ones too, but I'm discounting them because I'm talking about NEON code here). It's less of an issue with the latest CPUs, so the easiest approach might just to be to abandon any effort to do software prefetch.

One thing in our favour is that generate_composite_function is way more general than is needed for the two SDL functions I used it for - for example, they both only have one source image, and the destination image is 32bpp in both cases.

On 2019-09-20 20:47:39 +0000, Ryan C. Gordon wrote:

We're changing how we do SDL release versions; now releases will be even numbers (2.0.10, 2.0.12, etc), and as soon as we tag a release, we'll move the internal version number to an odd number (2.0.12 ships, we tag the latest in revision control as 2.0.13 immediately, which will become 2.0.14 on release, etc).

As such, I'm moving the bugs tagged with target-2.0.11 to target 2.0.12. Sorry if you get a lot of email from this change!

Thanks,
--ryan.

On 2019-09-20 20:48:39 +0000, Ryan C. Gordon wrote:

We're changing how we do SDL release versions; now releases will be even numbers (2.0.10, 2.0.12, etc), and as soon as we tag a release, we'll move the internal version number to an odd number (2.0.12 ships, we tag the latest in revision control as 2.0.13 immediately, which will become 2.0.14 on release, etc).

As such, I'm moving the bugs tagged with target-2.0.11 to target 2.0.12. Sorry if you get a lot of email from this change!

Thanks,
--ryan.

On 2019-09-24 10:31:33 +0000, Sylvain wrote:

Maybe we could pixman directly instead ? Would it be ok with the license?

I mean modifying the software renderer to use optionnally pixman transformation routines.

On 2019-09-24 13:33:58 +0000, Ben Avison wrote:

Unfortunately, the per-pixel calculations differ in Pixman, meaning the results would be incompatible with what existing SDL clients expect if we were to attempt to link with a Pixman library binary.

Specifically, Pixman consistently always uses "premultiplied alpha" colour components where maximum saturation of any given colour channel is achieved when its value matches the corresponding pixel of the alpha channel. With SDL, maximum saturation of of a colour channel is represented by 0xFF (or correspondingly less when fewer than 8 bits are used to store it) even when the alpha channel has a lower value.

The other difference I recall is that SDL blend functions leave the alpha channel of the destination image unchanged, whereas Pixman would blend in an appropriate fraction of the source image's alpha channel, in the same manner as the RGB channels. This cannot be changed without breaking compatibility with existing SDL clients, as I discovered during implementation.

The code I borrowed from Pixman forms a common wrapper around its many rectangle processing optimisations, and ensures that loads, stores and colour component packing and unpacking are performed as efficiently as possible. It's probably at least as important to the speedups as is the SIMD maths that is performed by the code I wrote specifically to implement SDL's blending rules.

I have to say that I think it's highly unlikely that Pixman would accept SDL-compatible blit routines into their library: they're very strict about the algorithms they use. This is why the only option I was left with was to copy their macro definitions across into SDL.

On 2019-10-18 04:55:37 +0000, Ryan C. Gordon wrote:

I just want to make sure: is the license incompatibility a non-starter for us here? We need to decide if we're going to close this bug or work on merging these patches in.

We should assume no one is ever going to rewrite that one file; it's a massive pile of assembly code.

Sam, do you want to revisit your previous decision on the matter, include it as-is, or punt this down the road to 2.0.13?

--ryan.

On 2019-10-19 20:03:15 +0000, Sam Lantinga wrote:

Yeah, I think it's still okay to include if we mark it appropriately with the license, and ifdef the code so it's easy to exclude if somebody wants.

Ben, can you split pixman-arm-neon-asm.S based on license, the way you described?

Thanks for all your work on this!

On 2019-10-21 17:35:08 +0000, Ben Avison wrote:

Created attachment 3995
[SDL1.2] ARM: Create configure option --enable-arm-simd to govern assembly optimizations

On 2019-10-21 17:36:40 +0000, Ben Avison wrote:

Created attachment 3996
[SDL1.2] ARM: SIMD assembly optimization for function BlitRGBtoRGBPixelAlpha

On 2019-10-21 17:37:09 +0000, Ben Avison wrote:

Created attachment 3997
[SDL1.2] ARM: SIMD assembly optimization for function BlitARGBto565PixelAlpha

On 2019-10-21 17:37:47 +0000, Ben Avison wrote:

Created attachment 3998
[SDL1.2] SDL_blit: use a named enum for required hardware features bits in dispatch tables

On 2019-10-21 17:38:24 +0000, Ben Avison wrote:

Created attachment 3999
[SDL1.2] ARM: SIMD assembly optimization for BGR-to-RGB 32bpp normal blits

On 2019-10-21 17:38:51 +0000, Ben Avison wrote:

Created attachment 4000
[SDL1.2] ARM: assembly optimization for SDL_FillRect

On 2019-10-21 17:39:19 +0000, Ben Avison wrote:

Created attachment 4001
[SDL1.2] ARM: SIMD optimization for 4:4:4:4 to 8:8:8:8 normal blits

On 2019-10-21 17:39:49 +0000, Ben Avison wrote:

Created attachment 4002
[SDL1.2] ARM: Create configure option --enable-arm-neon to govern assembly optimizations

On 2019-10-21 17:40:20 +0000, Ben Avison wrote:

Created attachment 4003
[SDL1.2] ARM: NEON assembly optimization for function BlitRGBtoRGBPixelAlpha

On 2019-10-21 17:40:41 +0000, Ben Avison wrote:

Created attachment 4004
[SDL1.2] ARM: NEON assembly optimization for function BlitARGBto565PixelAlpha

On 2019-10-21 17:41:12 +0000, Ben Avison wrote:

Created attachment 4005
[SDL1.2] ARM: NEON assembly optimization for SDL_FillRect

On 2019-10-21 17:44:57 +0000, Ben Avison wrote:

Created attachment 4006
[SDL2] ARM: Create configure option --enable-arm-simd to govern assembly optimizations

On 2019-10-21 17:45:24 +0000, Ben Avison wrote:

Created attachment 4007
[SDL2] ARM: SIMD assembly optimization for function BlitRGBtoRGBPixelAlpha

On 2019-10-21 17:45:55 +0000, Ben Avison wrote:

Created attachment 4008
[SDL2] ARM: SIMD assembly optimization for function BlitARGBto565PixelAlpha

On 2019-10-21 17:46:21 +0000, Ben Avison wrote:

Created attachment 4009
[SDL2] SDL_blit: use a named enum for required hardware features bits in dispatch tables

On 2019-10-21 17:47:00 +0000, Ben Avison wrote:

Created attachment 4010
[SDL2] ARM: SIMD assembly optimization for BGR-to-RGB 32bpp normal blits

On 2019-10-21 17:47:27 +0000, Ben Avison wrote:

Created attachment 4011
[SDL2] ARM: assembly optimization for SDL_FillRect

On 2019-10-21 17:47:57 +0000, Ben Avison wrote:

Created attachment 4012
[SDL2] ARM: SIMD optimization for 4:4:4:4 to 8:8:8:8 normal blits

On 2019-10-21 17:48:27 +0000, Ben Avison wrote:

Created attachment 4013
[SDL2] ARM: Create configure option --enable-arm-neon to govern assembly optimizations

On 2019-10-21 17:48:57 +0000, Ben Avison wrote:

Created attachment 4014
[SDL2] ARM: NEON assembly optimization for function BlitRGBtoRGBPixelAlpha

On 2019-10-21 17:49:24 +0000, Ben Avison wrote:

Created attachment 4015
[SDL2] ARM: NEON assembly optimization for function BlitARGBto565PixelAlpha

On 2019-10-21 17:49:50 +0000, Ben Avison wrote:

Created attachment 4016
[SDL2] ARM: NEON assembly optimization for SDL_FillRect

On 2019-10-21 17:53:19 +0000, Ben Avison wrote:

There you go - achieved by shuffling the offending lines from pixman-arm-neon-asm.S into pixman-arm-neon-asm.h, which was already 100% MIT-licensed. I've also rebased everything again so that it should all apply cleanly.

On 2019-10-25 01:44:09 +0000, Ryan C. Gordon wrote:

(In reply to Ben Avison from comment # 88)

There you go - achieved by shuffling the offending lines from
pixman-arm-neon-asm.S into pixman-arm-neon-asm.h, which was already 100%
MIT-licensed. I've also rebased everything again so that it should all apply
cleanly.

Thanks!

These are all in revision control now:

https://hg.libsdl.org/SDL/rev/de3493dacdaf
https://hg.libsdl.org/SDL/rev/924cc078b2a0
https://hg.libsdl.org/SDL/rev/f84bb6a0ac86
https://hg.libsdl.org/SDL/rev/5b1f186e43f4
https://hg.libsdl.org/SDL/rev/0259428bf4cf
https://hg.libsdl.org/SDL/rev/bd6e73345614
https://hg.libsdl.org/SDL/rev/eab5b002d2ab
https://hg.libsdl.org/SDL/rev/60b1290fe381
https://hg.libsdl.org/SDL/rev/f48e492e4ba7
https://hg.libsdl.org/SDL/rev/1d8fafac75cc
https://hg.libsdl.org/SDL/rev/094ee379b862

Plus one from me to warn about MIT-license:

https://hg.libsdl.org/SDL/rev/398670c70a02

I'll look at CMake support real fast and then I think this bug is done for SDL2.

Ozkan: you want me to deal with SDL 1.2, or do you want to do it?

--ryan.

On 2019-10-25 03:24:51 +0000, Ryan C. Gordon wrote:

CMake is wired up in https://hg.libsdl.org/SDL/rev/9bf7f9ba2bf4

All that's left is SDL 1.2, so I'm taking the target-2.0.12 keyword off this bug now.

--ryan.

On 2019-10-25 07:12:26 +0000, Ozkan Sezer wrote:

Ozkan: you want me to deal with SDL 1.2, or do you want to do it?

The new SDL-1.2 versions of the patches do not apply. (The older
ones still apply though.) If Ben re-diffs them, I can deal with
SDL-1.2.

On 2019-10-25 14:10:07 +0000, Sylvain wrote:

Not sure why, but SDL_cpuinfo refuses to compile on the android buildbot,
(See https://buildbot.libsdl.org/waterfall)

Maybe a matter of updating the NDK, (it compiles on newer I think)

In file included from ./src/cpuinfo/SDL_cpuinfo.c:335:
/android-ndk-r13b/platforms/android-16/arch-arm/usr/include/elf.h:50:5: error: expected identifier
AT_HWCAP,
^
./src/cpuinfo/SDL_cpuinfo.c:68:18: note: expanded from macro 'AT_HWCAP'
#define AT_HWCAP 16
^
1 error generated.
make: *** [build/android/obj/local/armeabi-v7a/objs/SDL2/src/cpuinfo/SDL_cpuinfo.o] Error 1

On 2019-10-27 13:56:54 +0000, Sylvain wrote:

Fixed it in https://hg.libsdl.org/SDL/rev/8974226b4075

linux is defined when compiling android-ndk, so it includes <elf.h>
which try to create an enum like:

#if !defined(AT_NULL)
/* these definitions are missing from the BSD sources */
enum {
AT_NULL = 0,
...
AT_PLATFORM,
AT_HWCAP,
AT_CLKTCK,

AT_SECURE = 23

};
#endif

(This is not the case in newer ndk because elf.h is different)

On 2019-10-29 15:16:42 +0000, Sylvain wrote:

I think also there is an issue in SDL_FillRects for arm simd/neon, it needs to set the fill_function. Not tested ...
https://hg.libsdl.org/SDL/rev/38141e83d0fa

On 2019-10-30 15:44:41 +0000, Ben Avison wrote:

Created attachment 4026
[SDL1.2] ARM: Create configure option --enable-arm-simd to govern assembly optimizations

On 2019-10-30 15:45:32 +0000, Ben Avison wrote:

Created attachment 4027
[SDL1.2] ARM: SIMD assembly optimization for function BlitRGBtoRGBPixelAlpha

On 2019-10-30 15:46:25 +0000, Ben Avison wrote:

Created attachment 4028
[SDL1.2] ARM: SIMD assembly optimization for function BlitARGBto565PixelAlpha

On 2019-10-30 15:46:59 +0000, Ben Avison wrote:

Created attachment 4029
[SDL1.2] SDL_blit: use a named enum for required hardware features bits in dispatch tables

On 2019-10-30 15:47:32 +0000, Ben Avison wrote:

Created attachment 4030
[SDL1.2] ARM: SIMD assembly optimization for BGR-to-RGB 32bpp normal blits

On 2019-10-30 15:48:08 +0000, Ben Avison wrote:

Created attachment 4031
[SDL1.2] ARM: assembly optimization for SDL_FillRect

On 2019-10-30 15:48:36 +0000, Ben Avison wrote:

Created attachment 4032
[SDL1.2] ARM: SIMD optimization for 4:4:4:4 to 8:8:8:8 normal blits

On 2019-10-30 15:49:09 +0000, Ben Avison wrote:

Created attachment 4033
[SDL1.2] ARM: Create configure option --enable-arm-neon to govern assembly optimizations

On 2019-10-30 15:49:45 +0000, Ben Avison wrote:

Created attachment 4034
[SDL1.2] ARM: NEON assembly optimization for function BlitRGBtoRGBPixelAlpha

On 2019-10-30 15:50:14 +0000, Ben Avison wrote:

Created attachment 4035
[SDL1.2] ARM: NEON assembly optimization for function BlitARGBto565PixelAlpha

On 2019-10-30 15:50:46 +0000, Ben Avison wrote:

Created attachment 4036
[SDL1.2] ARM: NEON assembly optimization for SDL_FillRect

On 2019-10-30 16:07:35 +0000, Ben Avison wrote:

@ozkan I can't see any changes on the SDL1.2 branch that should have interacted with my changes since I last rebased my patch series. However, it's quite possible I fluffed it somehow while generating them, so I've had another go.

I also have a copy of the changes on a branch on my mirror at https://github.com/bavison/SDL/commits/arm-blit-1.2 if it's any easier to get the patches from there. It's certainly quite laborious for me to upload them to Bugzilla!

I haven't included a commit to reflect the changes to configure or test/configure that happen when I run autogen.sh, partly because I don't have the same version of Autoconf installed so there are loads of unrelated changes that drown out the ones relating to the new command line options. Is that the problem?

I don't think any of the other corrections that have been made on the SDL2 branch need to be backported to SDL1.2, with the possible exception of the Android build fix, but I could be wrong!

On 2019-10-31 11:05:15 +0000, Ozkan Sezer wrote:

(In reply to Ben Avison from comment # 106)

@ozkan I can't see any changes on the SDL1.2 branch that should have
interacted with my changes since I last rebased my patch series. However,
it's quite possible I fluffed it somehow while generating them, so I've had
another go.
[...]
I don't think any of the other corrections that have been made on the SDL2
branch need to be backported to SDL1.2, with the possible exception of the
Android build fix, but I could be wrong!

The first patch had mistakenly removed CheckAltivecApple() from
configure.in. Corrected locally and applied all of the SDL-1.2
patches. Thanks. Will look at the Android build fix, soon.

http://hg.libsdl.org/SDL/rev/3d6dc20a0974
http://hg.libsdl.org/SDL/rev/db77604e083d
http://hg.libsdl.org/SDL/rev/78d5c2b67346
http://hg.libsdl.org/SDL/rev/84aa936d90e4
http://hg.libsdl.org/SDL/rev/eb0ed9be8a68
http://hg.libsdl.org/SDL/rev/d5d5686855d3
http://hg.libsdl.org/SDL/rev/8a6c0f0319d4
http://hg.libsdl.org/SDL/rev/4f88e197acad
http://hg.libsdl.org/SDL/rev/0ae1ddca5e85
http://hg.libsdl.org/SDL/rev/3705e81df6ff
http://hg.libsdl.org/SDL/rev/c8562ecca3c9

Regenerated configure:
http://hg.libsdl.org/SDL/rev/4ceb979e228a

Changelog update:
http://hg.libsdl.org/SDL/rev/9ed6ca274a7e

On 2019-10-31 11:29:34 +0000, Ben Avison wrote:

Thanks everyone for helping to get this over the line!

On 2019-10-31 11:53:03 +0000, Ozkan Sezer wrote:

OK. Closing this as resolved/fixed.

On 2020-02-27 17:32:57 +0000, Sam Lantinga wrote:

I temporarily disabled the NEON optimizations. ABGR -> ARGB blit yields an empty surface, and this needs to be looked at before they're re-enabled.
https://hg.libsdl.org/SDL/rev/3d49344c6988

On 2020-05-07 23:28:12 +0000, Brad Smith wrote:

The code added also does not build with Clang's assembler.

cc -O2 -pipe -I/usr/local/include -DUSING_GENERATED_CONFIG_H -Iinclude -I/usr/obj/ports/sdl2-2.0.12/SDL2-2.0.12/include -idirafter /usr/obj/ports/sdl2-2.0.12/SDL2-2.0.12/src/video/khronos -Wall -fno-strict-aliasing -fvisibility=hidden -Wdeclaration-after-statement -Werror=declaration-after-statement -I/usr/X11R6/include -DHAVE_USBHID_H -DUSBHID_UCR_DATA -DUSBHID_NEW -D_REENTRANT -MMD -MT build/pixman-arm-simd-asm.lo -c /usr/obj/ports/sdl2-2.0.12/SDL2-2.0.12/src/video/arm/pixman-arm-simd-asm.S -fPIC -DPIC -o build/.libs/pixman-arm-simd-asm.o
:1:1: error: unknown directive
.func fname
^
/usr/obj/ports/sdl2-2.0.12/SDL2-2.0.12/src/video/arm/pixman-arm-simd-asm.h:599:5: note: while in macro instantiation
pixman_asm_function fname
^
:3:1: note: while in macro instantiation
...
:345:5: error: unknown directive
.endfunc
^
:3:1: note: while in macro instantiation

I see FFmpeg tests the assembler with a check like..

 check_as as_func ".func test
                   .endfunc"

#if HAVE_AS_FUNC

define FUNC

#else

define FUNC @

#endif

FUNC .func EXTERN_ASM\name
FUNC .endfunc

On 2020-05-12 16:36:30 +0000, Dan Lawrence wrote:

Not sure if this is the right place to address this or if I should open a new bug, but it seems as though the --enable-arm-simd optimisations are also producing incorrect results on my raspberry Pi 4.

If I use the default configuration of SDL 2.0.12 (or SDL 2.0.9) (which has ARM SIMD enabled) then alpha blitting functions incorrectly in the alpha channel.

Essentially it just always copies the destination surface's alpha over to the final surface rather than doing the proper:

dstA = srcA + (dstA * (1-srcA))

Calculation. Essentially right now we are getting:

dstA = dstA

If you disable the simd extensions with:

../configure --disable-arm-simd

Then this error disappears and the Pi4 works the same as the other platforms.

I originally noticed this bug while using pygame and have an issue tracking the problem on the pygame GitHub here:

pygame/pygame#1722

Here is a C++ reproduction case:

#define SDL_MAIN_HANDLED
#include "SDL/SDL.h"
#include <stdio.h>
#include
#include

int main(int argc, char* argv[]) {

SDL_Window *window;
SDL_SetMainReady();
SDL_Init(SDL_INIT_VIDEO);

window = SDL_CreateWindow(
    "Pi4 Alpha blending bug",                  
    SDL_WINDOWPOS_UNDEFINED,           
    SDL_WINDOWPOS_UNDEFINED,           
    640,                              
    480,                               
    0          // flags 
);

if (window == NULL) {
    printf("Could not create window: %s\n", SDL_GetError());
    return 1;
}
int width = 256;
int height = 256;
Uint32 rmask, gmask, bmask, amask;

#if SDL_BYTEORDER == SDL_BIG_ENDIAN
rmask = 0xff000000;
gmask = 0x00ff0000;
bmask = 0x0000ff00;
amask = 0x000000ff;
#else
rmask = 0x000000ff;
gmask = 0x0000ff00;
bmask = 0x00ff0000;
amask = 0xff000000;
#endif

SDL_Surface* src = SDL_CreateRGBSurface(0, width, height, 32,
                        rmask, gmask, bmask, amask);
SDL_FillRect(src, NULL, SDL_MapRGBA(src->format, 255, 255, 255, 255));

SDL_Surface* dst = SDL_CreateRGBSurface(0, width, height, 32,
                        rmask, gmask, bmask, amask);
SDL_FillRect(dst, NULL, SDL_MapRGBA(src->format, 0, 0, 0, 128));

SDL_BlitSurface(src, NULL, dst, NULL);

int x = 128;
int y = 128; 
Uint8 rgba[4] = {0, 0, 0, 0};
SDL_PixelFormat *format = dst->format;

Uint8 * pixels = (Uint8 *)dst->pixels;
Uint32 color = *((Uint32 *)(pixels + y * dst->pitch) + x);
SDL_GetRGBA(color, format, rgba, rgba + 1, rgba + 2, rgba + 3);

std::cout << SDL_GetPixelFormatName(format->format) << std::endl;
std::cout << std::hex << color << std::endl;

// Close and destroy the window
SDL_DestroyWindow(window);

// Clean up
SDL_Quit();
return 0;

}

Output with default configuration on Windows:

SDL_PIXELFORMAT_ABGR8888
ffffffff

Output with --disable-arm-simd:

SDL_PIXELFORMAT_ABGR8888
ffffffff

Output with default SDL2 configuration:

SDL_PIXELFORMAT_ABGR8888
80ffffff

Hope that is enough information. I'll keep tracking the bug on the pygame GitHub as well.

Cheers!

On 2020-06-08 20:30:42 +0000, Sam James wrote:

I encountered some test failures with pygame recently, and after a hint from one of their developers, found this bug. All of the failures were resolved on ARMv7 (Raspberry Pi 2) by building libsdl2 with --disable-arm-simd.

References:

On 2020-06-09 00:12:04 +0000, Sam Lantinga wrote:

With the correctness problem and failure to compile on clang, I'm inclined to disable this for 2.0.14.

What do you think, Ryan?

On 2020-06-09 19:34:37 +0000, Ben Avison wrote:

Sorry I haven't been able to find time to investigate this recently, folks. Can anyone confirm which branches are affected?

On 2020-06-10 18:30:03 +0000, Sam James wrote:

(In reply to Ben Avison from comment # 115)

Sorry I haven't been able to find time to investigate this recently, folks.
Can anyone confirm which branches are affected?

No worries! I've reproduced this on 2.0.12. I can try others if needed?

On 2020-06-27 01:56:59 +0000, Ryan C. Gordon wrote:

(In reply to Sam Lantinga from comment # 114)

With the correctness problem and failure to compile on clang, I'm inclined
to disable this for 2.0.14.

What do you think, Ryan?

Yeah, let's disable this and bump to target-2.0.16 for now.

--ryan.

On 2020-06-27 03:40:00 +0000, Ryan C. Gordon wrote:

Yeah, let's disable this and bump to target-2.0.16 for now.

Disabled in https://hg.libsdl.org/SDL/rev/d86a8168fac9

--ryan.

On 2020-06-27 08:47:53 +0000, Ozkan Sezer wrote:

I have no way of testing SDL-1.2 with arm myself, so I disabled it in
the SDL-1.2 branch too by default for now, to be safe:
https://hg.libsdl.org/SDL/rev/05db7c08aa56

@SDLBugzilla SDLBugzilla added the enhancement New feature or request label Feb 10, 2021
@bavison
Copy link

bavison commented Jun 30, 2021

I think I can see what's happened. In SDL 2.0, the way BlitRGBtoRGBPixelAlpha() handled the alpha channel changed in an incompatible way. To be precise, in this commit:

libsdl-org/SDL@89bc80f

There is no equivalent commit in SDL 1.2. My ARM assembly optimisations - both the SIMD and NEON versions - faithfully copied the SDL 1.2 behaviour (this was, after all, my primary target at the time). It looks like nobody, including myself, noticed the difference between the two branches.

First conclusion: I think the code can safely be re-enabled on SDL 1.2.

Having looked more closely at the SDL 2.0 code, I think the new handling of alpha is actually incorrect, according to its own specification. It's desirable that every colour component is treated identically (not least because it makes SIMD processing easier). Look at the least-significant byte of d1/s1 (blue) - we can ignore the AND mask because the intermediate value can't overflow for any combination of input values - we can rearrange

d1 + ((s1 - d1) * alpha >> 8)

to

(s1 * alpha + d1 * 0x100 - d1 * alpha) >> 8

If you substitute s1 with alpha and d1 with dalpha, this gives

(alpha * alpha + dalpha * 0x100 - dalpha * alpha) >> 8

By contrast, we can rearrange

alpha + (dalpha * (alpha ^ 0xFF) >> 8)

to

(0x100 * alpha + dalpha * 0xFF - dalpha * alpha) >> 8

These are not equivalent, except (almost) when alpha == 0xFF - but that's already been special-cased a few lines above!

Some worked examples:

source 0x0f0f0f0f, destination 0x00000000, SDL1.2 result 0x00000000, SDL2.0 result 0x0f000000

source 0x0f0f0f0f, destination 0xffffffff, SDL1.2 result 0xfff0f0f0, SDL2.0 result 0xfef0f0f0

Ideally, I'd have said that we should be aiming for every colour component to be treated the same, which would result in 0x00000000 and 0xf0f0f0f0 respectively for these examples. (There's also an argument that some rounding should be done rather than simple truncation of the 16-bit intermediate product, but I won't go into that now.)

If I were to re-work the SDL2.0 assembly, would it be acceptable to treat all components the same in this way? I wouldn't have to revisit it yet again if the equation is changed a year or two down the line. I can change BlitRGBtoRGBPixelAlpha()to match this behaviour easily enough, but would need some assistance from someone familiar with MMX and 3dNOW to make those cases match.

@sezero
Copy link
Collaborator

sezero commented Jun 30, 2021

Created a new issue from your latest comment in the main SDL repo as libsdl-org/SDL#4484

@sezero
Copy link
Collaborator

sezero commented Jun 30, 2021

Re-enabled ARM SIMD and NEON asm blitters as of cb261c8
Closing this one. SDL2 discussion should continue at libsdl-org/SDL#4484

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants