We are currently migrating Bugzilla to GitHub issues.
Any changes made to the bug tracker now will be lost, so please do not post new bugs or make changes to them.
When we're done, all bug URLs will redirect to their equivalent location on the new bug tracker.

Bug 3816

Summary: asm code in video/SDL_stretch.c
Product: SDL Reporter: Ozkan Sezer <sezeroz>
Component: videoAssignee: Sam Lantinga <slouken>
Status: RESOLVED FIXED QA Contact: Sam Lantinga <slouken>
Severity: normal    
Priority: P2 CC: sylvain.becker
Version: HG 2.0   
Hardware: x86   
OS: All   
Attachments: test-case
test image
test log
test-case (fixed)
i686 test log
updated patch
updated patch - v3
updated patch -- v3
test-case
test-case
test log 64 bits
new SDL_stretch.c

Description Ozkan Sezer 2017-09-10 18:47:39 UTC
Below is a test patch for SDL_stretch.c, which adds OS-specific memory
calls (win32, os/2: where mprotect() isn't available) for the asm code.
It's compile-tested only, not run-tested (at least not yet).

Questions:
- Has the asm code path ever been tested with any newer gcc than 4.4.1,
 i.e. is the failure reason known?
- Asm code is disabled not just for gcc but for other compilers, too:
 has any failure been observed for them too?

----

diff --git a/src/video/SDL_stretch.c b/src/video/SDL_stretch.c
--- a/src/video/SDL_stretch.c
+++ b/src/video/SDL_stretch.c
@@ -43,7 +43,14 @@
 
 #ifdef USE_ASM_STRETCH
 
-#ifdef HAVE_MPROTECT
+#ifdef __WIN32__
+#define WIN32_LEAN_AND_MEAN
+#include <windows.h>
+#elif defined (__OS2__)
+#define INCL_DOSMEMMGR
+#define INCL_DOSERRORS
+#include <os2.h>
+#elif defined(HAVE_MPROTECT)
 #include <sys/types.h>
 #include <sys/mman.h>
 #endif
@@ -77,6 +84,9 @@ generate_rowbytes(int src_w, int dst_w, 
         int status;
     } last;
 
+#ifdef  __WIN32__
+    DWORD oldprot;
+#endif
     int i;
     int pos, inc;
     unsigned char *eip, *fence;
@@ -104,8 +114,16 @@ generate_rowbytes(int src_w, int dst_w, 
     default:
         return SDL_SetError("ASM stretch of %d bytes isn't supported", bpp);
     }
-#ifdef HAVE_MPROTECT
     /* Make the code writeable */
+#ifdef __WIN32__
+    if (!VirtualProtect(copy_row, sizeof(copy_row), PAGE_READWRITE, &oldprot)) {
+        return SDL_SetError("Couldn't make copy buffer writeable");
+    }
+#elif defined (__OS2__)
+    if (DosSetMem(copy_row, sizeof(copy_row), PAG_READ | PAG_WRITE) != NO_ERROR) {
+        return SDL_SetError("Couldn't make copy buffer writeable");
+    }
+#elif defined(HAVE_MPROTECT)
     if (mprotect(copy_row, sizeof(copy_row), PROT_READ | PROT_WRITE) < 0) {
         return SDL_SetError("Couldn't make copy buffer writeable");
     }
@@ -136,8 +154,16 @@ generate_rowbytes(int src_w, int dst_w, 
     }
     *eip++ = RETURN;
 
-#ifdef HAVE_MPROTECT
     /* Make the code executable but not writeable */
+#ifdef __WIN32__
+    if (!VirtualProtect(copy_row, sizeof(copy_row), PAGE_EXECUTE_READ, &oldprot)) {
+        return SDL_SetError("Couldn't make copy buffer executable");
+    }
+#elif defined (__OS2__)
+    if (DosSetMem(copy_row, sizeof(copy_row), PAG_READ | PAG_EXECUTE) != NO_ERROR) {
+        return SDL_SetError("Couldn't make copy buffer executable");
+    }
+#elif defined(HAVE_MPROTECT)
     if (mprotect(copy_row, sizeof(copy_row), PROT_READ | PROT_EXEC) < 0) {
         return SDL_SetError("Couldn't make copy buffer executable");
     }
Comment 1 Ozkan Sezer 2021-01-26 20:52:29 UTC
Sylvain: your commit https://hg.libsdl.org/SDL/rev/a304d0bbe458
enabled the stretch asm code for compilers other than gcc.  See
the patch in comment #0.
Comment 2 Sylvain 2021-01-26 21:52:09 UTC
Hey Ozkan,

I re-enabled as I tested with gcc 4.6, gcc 6 and gcc 10.2, and clang 
with HAVE_MPROTECT

compiling with -m32 and using multilib package on 64bits
not sure about others OS ..

checking with a basic test-case

(it was disabled here: https://hg.libsdl.org/SDL/rev/8ae607392409)
Comment 3 Ozkan Sezer 2021-01-27 01:16:24 UTC
Can you post your testcase?
Comment 4 Sylvain 2021-01-27 07:10:41 UTC
Created attachment 4711 [details]
test-case

test-case
Comment 5 Sylvain 2021-01-27 07:11:18 UTC
Created attachment 4712 [details]
test image

test image
Comment 6 Sylvain 2021-01-27 07:11:54 UTC
Created attachment 4713 [details]
test log

test log
Comment 7 Sylvain 2021-01-27 07:13:30 UTC
I've double-checked and, it work with USE_ASM + MPROTECT (on i386 linux/multilib), but
If I manually un-define HAVE_MPROTECT, then it crashes !
Comment 8 Sylvain 2021-01-27 08:35:33 UTC
So, disable asm path if mprotect isn't available
https://hg.libsdl.org/SDL/rev/9eb1c95cd4d6
Comment 9 Ozkan Sezer 2021-01-27 09:49:25 UTC
Created attachment 4714 [details]
test-case (fixed)

Fixed testcase (fixes type of crc local var)
Comment 10 Ozkan Sezer 2021-01-27 09:55:19 UTC
Created attachment 4715 [details]
i686 test log

Ran this on my i686 CentOS-6.10 (Intel Core2 Duo E8400 @ 3.00GHz)
using gcc4.4.7:  Every line in the logs stay the same, except for
delta time of course:

Without USE_ASM_STRETCH (because you disable it for gcc < 4.6),
delta time is usually around 420 ms

With USE_ASM_STRETCH enabled, delta time is around ~90 ms.

My gcc is gcc4.4.7 of RedHat. Maybe the issue with gcc4.4.1 is
fixed later in newer versions?

Sam?
Comment 11 Ozkan Sezer 2021-01-27 11:58:14 UTC
Made a quick win32 build using gcc-4.5.4, seems to work:

Every crc is the same with asm disabled/enabled, delta-time
is reduced to from ~151 ms to ~51 ms.
Comment 12 Ozkan Sezer 2021-01-27 12:26:08 UTC
Created attachment 4716 [details]
updated patch
Comment 13 Ozkan Sezer 2021-01-27 12:29:30 UTC
Built using VS2017 for i686 with the attached updated patch applied:

CRCs differ againsts MinGW-built dll. However, with asm disabled or
enabled, CRCS stay the same for MSVC-built dlls. Delta time reduces
to ~51 ms from ~115 ms.
Comment 14 Ozkan Sezer 2021-01-27 13:27:52 UTC
OS/2 builds using Watcom without asm matches all the CRCs from
the linux and windows gcc builds.  But the asm-enabled builds
crash (possibly for lack of any align directive in C code) ??
I won't go after that one and leave watcom alone w/o asm..
Comment 15 Ozkan Sezer 2021-01-27 18:34:24 UTC
Created attachment 4717 [details]
updated patch - v3

- adds MSVC __declspec(align(x)) support,
- disables asm if PAGE_ALIGNED no macro is defined,
- still disables asm for gcc < 4.6, need more info,
- drops Watcom support.
Comment 16 Ozkan Sezer 2021-01-27 18:41:55 UTC
Created attachment 4718 [details]
updated patch -- v3

Correct version of patch attached this time:

- adds MSVC __declspec(align(x)) support,
- disables asm if PAGE_ALIGNED no macro is defined,
- still disables asm for gcc < 4.6, need more info,
- drops Watcom support.
Comment 17 Sylvain 2021-01-27 19:28:30 UTC
Created attachment 4719 [details]
test-case

I have update the test case so that it allows try with a 24 bpp image
Comment 18 Sylvain 2021-01-27 19:32:49 UTC
Created attachment 4720 [details]
test-case

Updated again with your modif !
Comment 19 Sylvain 2021-01-27 19:33:44 UTC
Created attachment 4721 [details]
test log 64 bits

my test log on 64 bits
Comment 20 Sylvain 2021-01-27 19:37:24 UTC
Created attachment 4722 [details]
new SDL_stretch.c

I propose this new version for SDL_stretch.c
that drops mprotect and asm

Code is similar to the StretchLinear, but the steps computation are kept similar to the nearest. 
so that:
- it's pixel perfect with nearest
- as fast as asm I think
- no asm, nor mprotect
- benefit for all archicture
Comment 21 Sam Lantinga 2021-01-27 19:39:43 UTC
Ozkan, I tweaked your patch and applied it:
https://hg.libsdl.org/SDL/rev/f32946bbe1e6
Comment 22 Sam Lantinga 2021-01-27 19:41:39 UTC
Sylvain, can you confirm the performance of your code relative to what I just checked in?
Comment 23 Sylvain 2021-01-27 20:11:42 UTC
on i386 ( gcc-multilib)
 
I see the current head version with asm/mprotect gives:
60 ms for 100 SDL_Stretch() 
without asm: 149 ms (gcc 10) or 130ms  (gcc 4.6)

whereas my C version gives 73 ms.
Comment 24 Sylvain 2021-01-27 20:15:16 UTC
on x84, there is no asm, it gives 215 ms.

whereas my c version is 60 ms
Comment 25 Sylvain 2021-01-27 20:16:26 UTC
... with clang,  with gcc: 75 ms
Comment 26 Ozkan Sezer 2021-01-27 20:23:27 UTC
(In reply to Sylvain from comment #23)
> on i386 ( gcc-multilib)
>  
> I see the current head version with asm/mprotect gives:
> 60 ms for 100 SDL_Stretch() 
> without asm: 149 ms (gcc 10) or 130ms  (gcc 4.6)
> 
> whereas my C version gives 73 ms.

Your new C version should be good: 60 <-> 73 ms should be
negligible, no?
Comment 27 Sam Lantinga 2021-01-27 20:50:52 UTC
All right! This code is in, thanks Sylvain!
https://hg.libsdl.org/SDL/rev/e39af0d27669
Comment 28 Ozkan Sezer 2021-01-27 21:13:53 UTC
Sylvain:  Applied this:  https://hg.libsdl.org/SDL/rev/0645e1a9812e
Please confirm that it's correct.

About new code:  *(Uint16*)dst = *(Uint16*)s_00_01
in scale_mat_nearest_2() won't cause any bus errors
because of possible unaligned access, won't it?
Comment 29 Ozkan Sezer 2021-01-27 21:23:37 UTC
(In reply to Ozkan Sezer from comment #28)
> About new code:  *(Uint16*)dst = *(Uint16*)s_00_01
> in scale_mat_nearest_2() won't cause any bus errors
> because of possible unaligned access, won't it?

Same question for scale_mat_nearest_4() .
Comment 30 Sylvain 2021-01-27 21:38:22 UTC
Indeed, I think https://hg.libsdl.org/SDL/rev/0645e1a9812e is correct, 
otherwise no code return, if no sse/neon


I don't think there would be bus error, because it always move by 'bpp' bytes, so it remains with the same alignment. (like the original copy_row of any image copying)


now, I hesitate between keeping the code same as the bilinear one, 
or just simplify it to the most basic step/increment & loop.
because, it would just apply to more scale loop I guess..
Comment 31 Sylvain 2021-01-28 10:47:06 UTC
some simplification: https://hg.libsdl.org/SDL/rev/1010371c31c4
Comment 32 Sylvain 2021-01-28 10:49:47 UTC
Changing the other function in bug #5510