We are currently migrating Bugzilla to GitHub issues.
Any changes made to the bug tracker now will be lost, so please do not post new bugs or make changes to them.
When we're done, all bug URLs will redirect to their equivalent location on the new bug tracker.

Bug 3921

Summary: Add some Fastpath to BlitNtoNKey and BlitNtoNKeyCopyAlpha
Product: SDL Reporter: Sylvain <sylvain.becker>
Component: *don't know*Assignee: Sam Lantinga <slouken>
Status: RESOLVED FIXED QA Contact: Sam Lantinga <slouken>
Severity: normal    
Priority: P2 CC: sezeroz
Version: don't know   
Hardware: All   
OS: All   
Attachments: test case
patch
Fast path using DUFFS_LOOP
patch

Description Sylvain 2017-10-26 10:57:48 UTC
When doing combination of surfaces with colorkey, some fastpath are obvious:
assuming identical pixel format for source and destination and also bpp=32.
loop are also vectorized (this only adds a small improvement factor).

But, in the end, with this patch, it's ~10x faster.

Also, a small patch for android so that if goes faster in arm. (on arm, that does not seem to be vectorized, but it's also 10x faster).
Comment 1 Sylvain 2017-10-26 10:58:28 UTC
Created attachment 3043 [details]
test case

test case
Comment 2 Sylvain 2017-10-26 10:59:02 UTC
Created attachment 3044 [details]
patch

patch
Comment 3 Ozkan Sezer 2018-09-27 07:12:42 UTC
What is the status of this patch?
Comment 4 Sylvain 2018-09-27 09:43:34 UTC
I have been using this patch in production for almost 1 year now
Comment 5 Sam Lantinga 2018-09-27 21:46:13 UTC
Created attachment 3319 [details]
Fast path using DUFFS_LOOP

I adjusted the patch to match the other code, which should be faster on non-vectorizing compilers. Can you check to make sure it works and is as fast as yours?
Comment 6 Ozkan Sezer 2018-09-27 22:07:00 UTC
You already applied your DUFFS_LOOP patch along with Sylvain's
warnings patch in https://hg.libsdl.org/SDL/rev/df7260f149f2 :)
Comment 7 Sam Lantinga 2018-09-27 22:17:21 UTC
Oops, well please let me know if it needs improvement. :)
Comment 8 Sylvain 2018-09-28 06:47:44 UTC
Hi, I'll try that next week and let you know the results!
Comment 9 Sam Lantinga 2018-09-28 07:41:58 UTC
Okay, thanks!
Comment 10 Sylvain 2018-10-01 20:05:21 UTC
Created attachment 3333 [details]
patch

I did various benches. with clang 6.0.0 on linux, and ndk-r16b on android (NDK_TOOLCHAIN_VERSION=clang).

- still see a x10 speed factor.
- with duff_loops, it does not use vectorisation (but doesn't seem to be a problem).

on linux my patch is already at full speed on -O2, whereas the duff_loops need -O3 (200 ms at -03, and 300ms at -02).

I realized that on Android, I had a slight variation which fits best.
both on linux with -O2 and -O3, and on android with 02/03 and armeabi-v7a/arm64.

Here's the patch.
Comment 11 Sam Lantinga 2018-10-01 21:45:08 UTC
Okay, your patch is in:
https://hg.libsdl.org/SDL/rev/736699aa224d

It may make sense to revisit the DUFFS_LOOP code. Maybe we can make a version that is a simple loop, depending on whether SDL_VECTORIZATION_FRIENDLY_LOOP is defined, or something like that. That should automatically improve all the blitting code using that.

I'm going to close this bug for now though. Feel free to open a new bug if you want to investigate that possibility.

Thanks!
Comment 12 Sylvain 2018-10-02 10:46:41 UTC
In fact, the big performance issue with blit functions is that there are per-pixel macros (SDL_blit.h) that starts by doing : 

switch (bpp) { ... }.

And bpp is not a constant but evaluated twice per pixel (read/write), so not optimized because it just could removed by the compiler. 

This is the reason why we can get a x10 faster implementation with the previous patches.

An improvement idea would be to turn SDL_Blit_Slow into a macro, and generate all the blits function from this seed macro with hard-coded constants.

They would look like : 

blit_function()
{
const int src_format = RGB888;
const int dst_format = ARGB888;
const int has_COPY_BLEND = 1;
const int has_COPY_MODULATE_COLOR = 1;
const int has_COPY_MODULATE_ALPHA = 1;
const int has_colorkey = 1;
...
CALL_SDL_BLIT_MACRO();
}

It may be a lot of function (nb_src_format * nb_src_format * has_colorkey * nb_blend_mode  * ... ). but that can be generated automatically.

With constants the compiler can optimise the READ/WRITE pixel function and the various steps of SDL_blit_slow() by dropping unsed steps.

That would a total replacement of SDL_blit_* stuff..