| Summary: | Add some Fastpath to BlitNtoNKey and BlitNtoNKeyCopyAlpha | ||
|---|---|---|---|
| Product: | SDL | Reporter: | Sylvain <sylvain.becker> |
| Component: | *don't know* | Assignee: | Sam Lantinga <slouken> |
| Status: | RESOLVED FIXED | QA Contact: | Sam Lantinga <slouken> |
| Severity: | normal | ||
| Priority: | P2 | CC: | sezeroz |
| Version: | don't know | ||
| Hardware: | All | ||
| OS: | All | ||
| Attachments: |
test case
patch Fast path using DUFFS_LOOP patch |
||
|
Description
Sylvain
2017-10-26 10:57:48 UTC
Created attachment 3043 [details]
test case
test case
Created attachment 3044 [details]
patch
patch
What is the status of this patch? I have been using this patch in production for almost 1 year now Created attachment 3319 [details]
Fast path using DUFFS_LOOP
I adjusted the patch to match the other code, which should be faster on non-vectorizing compilers. Can you check to make sure it works and is as fast as yours?
You already applied your DUFFS_LOOP patch along with Sylvain's warnings patch in https://hg.libsdl.org/SDL/rev/df7260f149f2 :) Oops, well please let me know if it needs improvement. :) Hi, I'll try that next week and let you know the results! Okay, thanks! Created attachment 3333 [details]
patch
I did various benches. with clang 6.0.0 on linux, and ndk-r16b on android (NDK_TOOLCHAIN_VERSION=clang).
- still see a x10 speed factor.
- with duff_loops, it does not use vectorisation (but doesn't seem to be a problem).
on linux my patch is already at full speed on -O2, whereas the duff_loops need -O3 (200 ms at -03, and 300ms at -02).
I realized that on Android, I had a slight variation which fits best.
both on linux with -O2 and -O3, and on android with 02/03 and armeabi-v7a/arm64.
Here's the patch.
Okay, your patch is in: https://hg.libsdl.org/SDL/rev/736699aa224d It may make sense to revisit the DUFFS_LOOP code. Maybe we can make a version that is a simple loop, depending on whether SDL_VECTORIZATION_FRIENDLY_LOOP is defined, or something like that. That should automatically improve all the blitting code using that. I'm going to close this bug for now though. Feel free to open a new bug if you want to investigate that possibility. Thanks! In fact, the big performance issue with blit functions is that there are per-pixel macros (SDL_blit.h) that starts by doing :
switch (bpp) { ... }.
And bpp is not a constant but evaluated twice per pixel (read/write), so not optimized because it just could removed by the compiler.
This is the reason why we can get a x10 faster implementation with the previous patches.
An improvement idea would be to turn SDL_Blit_Slow into a macro, and generate all the blits function from this seed macro with hard-coded constants.
They would look like :
blit_function()
{
const int src_format = RGB888;
const int dst_format = ARGB888;
const int has_COPY_BLEND = 1;
const int has_COPY_MODULATE_COLOR = 1;
const int has_COPY_MODULATE_ALPHA = 1;
const int has_colorkey = 1;
...
CALL_SDL_BLIT_MACRO();
}
It may be a lot of function (nb_src_format * nb_src_format * has_colorkey * nb_blend_mode * ... ). but that can be generated automatically.
With constants the compiler can optimise the READ/WRITE pixel function and the various steps of SDL_blit_slow() by dropping unsed steps.
That would a total replacement of SDL_blit_* stuff..
|