We are currently migrating Bugzilla to GitHub issues.
Any changes made to the bug tracker now will be lost, so please do not post new bugs or make changes to them.
When we're done, all bug URLs will redirect to their equivalent location on the new bug tracker.

Bug 3921 - Add some Fastpath to BlitNtoNKey and BlitNtoNKeyCopyAlpha
Summary: Add some Fastpath to BlitNtoNKey and BlitNtoNKeyCopyAlpha
Status: RESOLVED FIXED
Alias: None
Product: SDL
Classification: Unclassified
Component: *don't know* (show other bugs)
Version: don't know
Hardware: All All
: P2 normal
Assignee: Sam Lantinga
QA Contact: Sam Lantinga
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-10-26 10:57 UTC by Sylvain
Modified: 2018-10-02 10:46 UTC (History)
1 user (show)

See Also:


Attachments
test case (2.82 KB, text/x-csrc)
2017-10-26 10:58 UTC, Sylvain
Details
patch (6.17 KB, patch)
2017-10-26 10:59 UTC, Sylvain
Details | Diff
Fast path using DUFFS_LOOP (2.20 KB, patch)
2018-09-27 21:46 UTC, Sam Lantinga
Details | Diff
patch (908 bytes, patch)
2018-10-01 20:05 UTC, Sylvain
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Sylvain 2017-10-26 10:57:48 UTC
When doing combination of surfaces with colorkey, some fastpath are obvious:
assuming identical pixel format for source and destination and also bpp=32.
loop are also vectorized (this only adds a small improvement factor).

But, in the end, with this patch, it's ~10x faster.

Also, a small patch for android so that if goes faster in arm. (on arm, that does not seem to be vectorized, but it's also 10x faster).
Comment 1 Sylvain 2017-10-26 10:58:28 UTC
Created attachment 3043 [details]
test case

test case
Comment 2 Sylvain 2017-10-26 10:59:02 UTC
Created attachment 3044 [details]
patch

patch
Comment 3 Ozkan Sezer 2018-09-27 07:12:42 UTC
What is the status of this patch?
Comment 4 Sylvain 2018-09-27 09:43:34 UTC
I have been using this patch in production for almost 1 year now
Comment 5 Sam Lantinga 2018-09-27 21:46:13 UTC
Created attachment 3319 [details]
Fast path using DUFFS_LOOP

I adjusted the patch to match the other code, which should be faster on non-vectorizing compilers. Can you check to make sure it works and is as fast as yours?
Comment 6 Ozkan Sezer 2018-09-27 22:07:00 UTC
You already applied your DUFFS_LOOP patch along with Sylvain's
warnings patch in https://hg.libsdl.org/SDL/rev/df7260f149f2 :)
Comment 7 Sam Lantinga 2018-09-27 22:17:21 UTC
Oops, well please let me know if it needs improvement. :)
Comment 8 Sylvain 2018-09-28 06:47:44 UTC
Hi, I'll try that next week and let you know the results!
Comment 9 Sam Lantinga 2018-09-28 07:41:58 UTC
Okay, thanks!
Comment 10 Sylvain 2018-10-01 20:05:21 UTC
Created attachment 3333 [details]
patch

I did various benches. with clang 6.0.0 on linux, and ndk-r16b on android (NDK_TOOLCHAIN_VERSION=clang).

- still see a x10 speed factor.
- with duff_loops, it does not use vectorisation (but doesn't seem to be a problem).

on linux my patch is already at full speed on -O2, whereas the duff_loops need -O3 (200 ms at -03, and 300ms at -02).

I realized that on Android, I had a slight variation which fits best.
both on linux with -O2 and -O3, and on android with 02/03 and armeabi-v7a/arm64.

Here's the patch.
Comment 11 Sam Lantinga 2018-10-01 21:45:08 UTC
Okay, your patch is in:
https://hg.libsdl.org/SDL/rev/736699aa224d

It may make sense to revisit the DUFFS_LOOP code. Maybe we can make a version that is a simple loop, depending on whether SDL_VECTORIZATION_FRIENDLY_LOOP is defined, or something like that. That should automatically improve all the blitting code using that.

I'm going to close this bug for now though. Feel free to open a new bug if you want to investigate that possibility.

Thanks!
Comment 12 Sylvain 2018-10-02 10:46:41 UTC
In fact, the big performance issue with blit functions is that there are per-pixel macros (SDL_blit.h) that starts by doing : 

switch (bpp) { ... }.

And bpp is not a constant but evaluated twice per pixel (read/write), so not optimized because it just could removed by the compiler. 

This is the reason why we can get a x10 faster implementation with the previous patches.

An improvement idea would be to turn SDL_Blit_Slow into a macro, and generate all the blits function from this seed macro with hard-coded constants.

They would look like : 

blit_function()
{
const int src_format = RGB888;
const int dst_format = ARGB888;
const int has_COPY_BLEND = 1;
const int has_COPY_MODULATE_COLOR = 1;
const int has_COPY_MODULATE_ALPHA = 1;
const int has_colorkey = 1;
...
CALL_SDL_BLIT_MACRO();
}

It may be a lot of function (nb_src_format * nb_src_format * has_colorkey * nb_blend_mode  * ... ). but that can be generated automatically.

With constants the compiler can optimise the READ/WRITE pixel function and the various steps of SDL_blit_slow() by dropping unsed steps.

That would a total replacement of SDL_blit_* stuff..