We are currently migrating Bugzilla to GitHub issues.
Any changes made to the bug tracker now will be lost, so please do not post new bugs or make changes to them.
When we're done, all bug URLs will redirect to their equivalent location on the new bug tracker.

Bug 4503

Summary: bad textures on bigendian cpu
Product: SDL Reporter: Dmitry Gapkalov <gapkalov>
Component: renderAssignee: Sam Lantinga <slouken>
Status: RESOLVED FIXED QA Contact: Sam Lantinga <slouken>
Severity: critical    
Priority: P2 CC: gapkalov, sylvain.becker
Version: 2.0.9   
Hardware: Other   
OS: Other   
Attachments: bad pic
image for testcase
test case
compare script
ref and new
compare script
ref2, new and result
test case
very small image for the testcase
compare script
test case for unaligned store
SDL_blit_N.c
results
result with large picture
SDL_blit_N.c
new result
SDL_blit_N.c
SDL_blit_N.c with intermediate aligned int on stack
results
result7

Description Dmitry Gapkalov 2019-02-14 02:11:51 UTC
Hello,
with latest SDL2 changes from Hg, when I using software rendering on the BigEndian CPU, textures(created from jpeg`s) on the screen displayed with wrong colors. It looks like if some color components(RGB) is exchanged.
Comment 1 Sylvain 2019-02-14 08:22:00 UTC
Probably my last commits, haven't tested with big endian.
which hardware do you use ?

Can you help to do some check ?

Can you try this patch:

diff -r 48e26b5d4f56 src/video/SDL_blit_N.c
--- a/src/video/SDL_blit_N.c	Sat Feb 09 17:40:32 2019 +0100
+++ b/src/video/SDL_blit_N.c	Thu Feb 14 09:19:57 2019 +0100
@@ -2155,7 +2155,11 @@
         int *_p0 , int *_p1, int *_p2, int *_p3, int *_alpha_channel)
 {
     int alpha_channel = 0, p0, p1, p2, p3;
+#if SDL_BYTEORDER == SDL_LIL_ENDIAN
     int Pixel = 0x04030201; /* identity permutation */
+#else
+    int Pixel = 0x01020304; /* identity permutation */
+#endif
 
     if (srcfmt->Amask) {
         RGBA_FROM_PIXEL(Pixel, srcfmt, p0, p1, p2, p3);
Comment 2 Dmitry Gapkalov 2019-02-14 15:55:39 UTC
No, it doesn't help
Comment 3 Dmitry Gapkalov 2019-02-14 16:08:41 UTC
Created attachment 3611 [details]
bad pic
Comment 4 Sylvain 2019-02-14 17:36:16 UTC
Created attachment 3613 [details]
image for testcase

image for test case
Comment 5 Sylvain 2019-02-14 17:36:38 UTC
Created attachment 3614 [details]
test case

test case
Comment 6 Sylvain 2019-02-14 17:42:41 UTC
I've just uploaded my test-case and its png image

Could you try to run it with:
- your previous version of SDL.
- the current head of SDL (also eventually a modified on with the patch ..)
Save the full log. 

It blits any format to any format, with/without colorkey.

It does a crc on each image and can help to see what's going on / regression.
(You need some other dependencies: zlib.h and also md5.h (compile with -lz -lcrypto) )
Comment 7 Sylvain 2019-02-14 17:51:33 UTC
Created attachment 3615 [details]
compare script

Here's also a script to compare two logs you would have saved.
It tells if blits get optimised and if there're crc errors.
Comment 8 Dmitry Gapkalov 2019-02-14 22:54:48 UTC
Created attachment 3617 [details]
ref and new

attached two logs:
ref.txt - reference 2.0.9
new.txt - from the latest Hg changes

compare script doesn't work for me:

Starting new.txt new.txt
error parsing file ref.txt
Comment 9 Sylvain 2019-02-15 10:13:54 UTC
Created attachment 3622 [details]
compare script

I've fixed the compare script.
The line format was a little bit different (starting with "INFO:" or not).
+ some other internals because of missing lines.
Comment 10 Sylvain 2019-02-15 10:25:50 UTC
So there are lots of errors.
But I don't think it needs a big fix. 


I don't you if you want to debug it or not.

I have not much time in the next few days. But after I can have a look, and set up a virtual machine with big endian. 

What I would do:

1/
enable again the patch, and see if the number of crc mismatch decrease.

+#if SDL_BYTEORDER == SDL_LIL_ENDIAN
     int Pixel = 0x04030201; /* identity permutation */
+#else
+    int Pixel = 0x01020304; /* identity permutation */
+#endif

if it doesn't help, then maybe there should a little different start permutation for big/little endian. 



2/ not 100% necessary, but
The reference "ref.txt"  is missing lines.
I've fix two issues, which we should apply before have the ref.txt
https://hg.libsdl.org/SDL/rev/233b2a61cad1
https://hg.libsdl.org/SDL/rev/dff36de37426


3/ We should also activate the "Blit_3or4_to_3or4__inversed_rgb" in normal_blit_3 for big endian, so there is also something to fix here.
(remove ifdef, and see where it fails :)
Comment 11 Dmitry Gapkalov 2019-02-15 15:29:45 UTC
- Sorry for misunderstanding, new.txt already contains your changes
+#if SDL_BYTEORDER == SDL_LIL_ENDIAN
     int Pixel = 0x04030201; /* identity permutation */
+#else
+    int Pixel = 0x01020304; /* identity permutation */
+#endif
Comment 12 Dmitry Gapkalov 2019-02-16 03:25:43 UTC
Created attachment 3627 [details]
ref2, new and result

get new ref2.txt logs(with patches) old new.txt, and put the results.

p.s. If I uncomment #ifdef I have a crash.
Comment 13 Sylvain 2019-02-18 21:19:48 UTC
Ok, so I think I've fixed all the issues with big endian.
and enabled all the optimized routines for big endian.
I checked this with a virtual machine.

https://hg.libsdl.org/SDL/rev/c8f998f83225

1/ Can you see if this works, and re-run the test-suite and see how it compares with the reference ? 


2/ On my virtual machine, valgrind doesn't work (crashed at start, even with an helloworld). Just to make sure there is no bad access, could you run it.
with some parameters:

- remove the use of aligned memory (it can hide bad access): 
(remove https://hg.libsdl.org/SDL/rev/66cd8731c3b1 ) 
or just apply:

diff -r c8f998f83225 src/video/SDL_surface.c
--- a/src/video/SDL_surface.c	Mon Feb 18 22:06:53 2019 +0100
+++ b/src/video/SDL_surface.c	Mon Feb 18 22:15:59 2019 +0100
@@ -120,13 +120,13 @@
             return NULL;
         }
 
-        surface->pixels = SDL_SIMDAlloc((size_t)size);
+        surface->pixels = SDL_malloc((size_t)size);
         if (!surface->pixels) {
             SDL_FreeSurface(surface);
             SDL_OutOfMemory();
             return NULL;
         }
-        surface->flags |= SDL_SIMD_ALIGNED;
+//        surface->flags |= SDL_SIMD_ALIGNED;
         /* This is important for bitmaps */
         SDL_memset(surface->pixels, 0, surface->h * surface->pitch);


- use new test-suite which removes any extra pitch-padding. making pitch == width * bpp
Comment 14 Sylvain 2019-02-18 21:21:06 UTC
Created attachment 3632 [details]
test case

new test case.
- removes the width padding
Comment 15 Sylvain 2019-02-18 21:22:02 UTC
Created attachment 3633 [details]
very small image for the testcase

A smaller image for the testcase
Comment 16 Sylvain 2019-02-18 21:22:37 UTC
Created attachment 3634 [details]
compare script

Update the compare script
Comment 17 Dmitry Gapkalov 2019-02-19 18:12:07 UTC
I try latest changes with new test suit, but have a crash. Last line in log before crash is:

BlitNtoN       RGB24 ->       RGB24, in    20 ms       0
crc32: D0108A95 full: D0108A95 md5=78580e3deb728e4e7568b92edd3de58b
Comment 18 Dmitry Gapkalov 2019-02-19 18:15:22 UTC
I guess crash is in blitter:

BlitNtoN       RGB24 ->       BGR24
Comment 19 Sylvain 2019-02-19 19:06:51 UTC
It sounds like an invalid read/write since I get the correct crc32 and md5 for all conversions. And also as I don't have the crash with qemu.

which hardware do you use ? 


It can be the:
  RGB24 ->       BGR24

but than can be also a previous which doing an invalid read/write.
that can also be the underneath preparation conversion image_format -> BGR24


If this is RGB24 -> BGR24,
this is handled in : 
 Blit_3or4_to_3or4__inversed_rgb
 it's 3->3, so NO_ALPHA
 
Maybe you can try to comment out things:
the whole block, then the while loop, then the last_line, to try to narrow down the issue !
Comment 20 Dmitry Gapkalov 2019-02-19 19:55:01 UTC


        while (height--) {
            /* *INDENT-OFF* */
            DUFFS_LOOP(
            {
                Uint32 *dst32 = (Uint32*)dst;
                Uint8 s0 = src[i0];
                Uint8 s1 = src[i1];
                Uint8 s2 = src[i2];
                /* inversed, compared to Blit_3or4_to_3or4__same_rgb */
                *dst32 = (s0 << shift0) | (s1 << shift1) | (s2 << shift2); // 

if destination bpp is 3, than we have exception in this line, because we canno write 4 byte into 3 ;)
Comment 21 Sylvain 2019-02-19 20:03:52 UTC
I agree, but exception is only for the very last pixel. This is handled in the "last_line" block.

I have enabled the big_endian code in my little endian machine. and run valgrind with no issue.

Are  you sure you're running the latest code ?

And if so, what's the type of crash ? maybe you can get the stacktrace in gdb ?
Comment 22 Dmitry Gapkalov 2019-02-19 23:39:55 UTC
>>which hardware do you use ?
175Mhz core mips 32bit cpu

Its hard to explain, but this code generate exception:

                Uint8 s0 = src[i0];
                Uint8 s1 = src[i1];
                Uint8 s2 = src[i2];
				/* inversed, compared to Blit_3or4_to_3or4__same_rgb */
                *dst32 = (s0 << shift0) | (s1 << shift1) | (s2 << shift2);

but this code NOT:

		dst[0] = s0;
		dst[1] = s1;
		dst[2] = s2;
Comment 23 Sylvain 2019-02-20 07:12:25 UTC
strange ... any subtleties with mips ? 
I also ran qemu with mips eb debian and did hit this.

So with this patch, how does this behave for the full testcase ?
Comment 24 Sylvain 2019-02-20 08:40:37 UTC
mips doesn't like store on non 32 byte aligned memory probably ? And qemu doesn't simulate this.
but I would be surprised if this is the only place where there is such access.
Comment 25 Sylvain 2019-02-20 08:41:41 UTC
I mean -word- store on non 4 bytes / 32bits aligned memory
Comment 26 Sylvain 2019-02-20 12:17:59 UTC
Created attachment 3640 [details]
test case for unaligned store

to check this, can you try this test case that does non aligned int store
Comment 27 Dmitry Gapkalov 2019-02-20 15:53:27 UTC
on ARM or Mips - based systems you cannot address a 32-bit word that is not aligned to a 4-byte boundary. Doing so will result in an access violation exception. On x86 you can access such non-aligned data, though the performance suffers a little since two words have to fetched from memory instead of just one.

p.s. latest test example also crush

in: 
0, 
1, 
2, 
3, 
4, 
5, 
6, 
7, 
8, 
9, 
10, 
11, 
12, 
13, 
14, 
15, 
16, 
17, 
18, 
19, 
20, 
21, 
22, 
23, 
24, 
25, 
26, 
27, 
28, 
29, 
--than crash
Comment 28 Sylvain 2019-02-20 21:40:37 UTC
Created attachment 3643 [details]
SDL_blit_N.c

So it seems to be the issue!
Here's a new SDL_blit_N.c that removes the un-aligned load/store + some clean-up, please try it!

- You may also need to edit the testcase and comment out the 4 calls to "remove_padding()".
It could create un-aligned access ... or maybe not since, the padding are only for particular format ).

- In the compare script, if you un-comment the log faster/slower, you'll see how it improves the blit.
Comment 29 Dmitry Gapkalov 2019-02-21 17:55:06 UTC
Created attachment 3646 [details]
results

with this version all problems now is solved, all texturest displayed correctly without any artifacts.

ref4.txt - 2.0.9 with few patches
new_9.txt - latest mercury with new SDL_blit_N.c
result2.txt - results

results is little bit slower than original, but my hardware timer precision is 10ms, it maybe not enough to make correct decision.
Comment 30 Sylvain 2019-02-21 18:56:08 UTC
Ok great !

Yes the timings seem wrongs. the image was very small to run the test quickly.
The previous image is bigger so the timing should be more accurate.
Comment 31 Dmitry Gapkalov 2019-02-21 21:10:45 UTC
Created attachment 3647 [details]
result with large picture

Starting ref_3.txt new_10.txt 
Reference: Time BlitNtoN : 137890 msec
Reference: Time BlitNtoNKey : 127910 msec
Current: Time BlitNtoN : 145990 msec
Current: Time BlitNtoNKey : 135720 msec
Comment 32 Sylvain 2019-02-22 08:35:11 UTC
So it's seems very bad :)
Whereas it always faster on x86, it just almost always slower in Mips.

I think the board memory is slow ? or maybe each char write is done as a int32 store maybe ?


I committing the previous SDL_Blit_N.c, because it solves the crashes.
https://hg.libsdl.org/SDL/rev/8a160ecca90f

But, I'll provide you a new modified file to see it this can be improved
Comment 33 Sylvain 2019-02-22 14:07:43 UTC
Created attachment 3648 [details]
SDL_blit_N.c

Here's a new SDL_blit_N.c
Please gives a try !

(maybe change the compare_script threshold, so that it display faster/slower when the difference is > 1.2). (low resolution, but seems quite stables).
Comment 34 Dmitry Gapkalov 2019-02-22 17:46:57 UTC
Created attachment 3649 [details]
new result

Starting ref_3.txt new_11.txt 
Reference: Time BlitNtoN : 137890 msec
Reference: Time BlitNtoNKey : 127910 msec
Current: Time BlitNtoN : 136620 msec
Current: Time BlitNtoNKey : 127060 msec
Comment 35 Sylvain 2019-02-22 19:57:20 UTC
Created attachment 3651 [details]
SDL_blit_N.c

Most of regressions seems to have disappeared. There is one left which I've undefined in this new file. Can you test it to make sure?

After that,there should be no perf regression left. And there are still some faster blits case (but less).
I think it still valuable as you can get a x2 ratio if you convert a rgb24 with color-key to alpha surface.


Then, I will make this define active/un-active based on arch.
Not sure when it should disabled ... I have tried on arm, and all are ok. but I don't have very old one.
And what define can I use to make sure it selects or not your board or any MIPS ? Maybe I should disable the slower for all big endian ? 
Btw, what's your board exactly so that I can the specs ?
Comment 36 Sylvain 2019-02-22 20:33:10 UTC
Created attachment 3652 [details]
SDL_blit_N.c with intermediate aligned int on stack

Can also try this SDL_blit_N.c where it uses an intermediate aligned int ? It may enable most of the optimisation !
Comment 37 Dmitry Gapkalov 2019-02-22 21:13:14 UTC
Created attachment 3653 [details]
results


Starting ref_3.txt new_13.txt 
Reference: Time BlitNtoN : 137890 msec
Reference: Time BlitNtoNKey : 127910 msec
Current: Time BlitNtoN : 136510 msec
Current: Time BlitNtoNKey : 125990 msec

>>And what define can I use to make sure it selects or not your board or any >>MIPS ? Maybe I should disable the slower for all big endian ? 
>>Btw, what's your board exactly so that I can the specs ?

my hardware doesn't have common specific macro to determine it.
Comment 38 Dmitry Gapkalov 2019-02-22 22:02:36 UTC
Created attachment 3654 [details]
result7

Starting ref_3.txt new_14.txt 
Reference: Time BlitNtoN : 137890 msec
Reference: Time BlitNtoNKey : 127910 msec
Current: Time BlitNtoN : 151860 msec
Current: Time BlitNtoNKey : 140680 msec
Comment 39 Sylvain 2019-02-23 08:42:22 UTC
The SDL_blit_N.c of new_13.txt fixes the regression on mips. So it's commited.
I un-activate the routines for any __MIPS__ 
https://hg.libsdl.org/SDL/rev/c005c49beaa9

So the perf I see (I filtered some noise), compared to the ref are below :

===== BlitNtoNKey ========
ABGR8888 -> BGR888 :  faster x1   (240 -> 140)
ABGR8888 -> RGB24 :  faster x1   (300 -> 270)

ARGB8888 -> RGB888 :  faster x1   (250 -> 140)

BGR24 -> BGR24 :  faster x1   (290 -> 250)
BGR24 -> RGB24 :  faster x1   (300 -> 260)

BGR888 -> ABGR8888 :  faster x1   (250 -> 140)
BGR888 -> RGB24 :  faster x1   (310 -> 270)

BGRA8888 -> BGRX8888 :  faster x1   (240 -> 130)
BGRA8888 -> RGB24 :  faster x1   (300 -> 270)

BGRX8888 -> BGR24 :  faster x1   (300 -> 270)
BGRX8888 -> BGRA8888 :  faster x1   (250 -> 130)
BGRX8888 -> BGRX8888 :  faster x1   (250 -> 140)
BGRX8888 -> RGB24 :  faster x1   (310 -> 280)

RGB24 -> BGR24 :  faster x1   (290 -> 260)
RGB24 -> RGB24 :  faster x1   (290 -> 260)

RGB888 -> ARGB8888 :  faster x1   (240 -> 130)
RGB888 -> BGR24 :  faster x1   (310 -> 270)
RGB888 -> RGB24 :  faster x1   (310 -> 280)

RGBA8888 -> BGR24 :  faster x1   (310 -> 270)
RGBA8888 -> RGB24 :  faster x1   (310 -> 270)
RGBA8888 -> RGBX8888 :  faster x1   (260 -> 140)

RGBX8888 -> RGB24 :  faster x1   (310 -> 270)
RGBX8888 -> RGBA8888 :  faster x1   (250 -> 130)
RGBX8888 -> RGBX8888 :  faster x1   (250 -> 140)
===== BlitNtoN ========
ABGR8888 -> ARGB8888 :  faster x1   (290 -> 150)
ABGR8888 -> BGR24 :  faster x1   (330 -> 290)
ARGB8888 -> ABGR8888 :  faster x2   (300 -> 150)
ARGB8888 -> RGB24 :  faster x1   (320 -> 290)

BGR24 -> ABGR8888 :  faster x1   (210 -> 140)
BGR24 -> ARGB8888 :  faster x1   (220 -> 150)

BGR888 -> ARGB8888 :  faster x1   (250 -> 150)
BGR888 -> BGR24 :  faster x1   (320 -> 290)
BGR888 -> RGB24 :  faster x1   (330 -> 290)

BGRA8888 -> BGR24 :  faster x1   (330 -> 290)
BGRA8888 -> RGB24 :  faster x1   (330 -> 290)
BGRX8888 -> BGR24 :  faster x1   (320 -> 290)

RGB24 -> ABGR8888 :  faster x1   (210 -> 140)
RGB24 -> ARGB8888 :  faster x1   (210 -> 130)

RGB888 -> ABGR8888 :  faster x1   (260 -> 150)
RGB888 -> RGB24 :  faster x1   (330 -> 290)

RGBA8888 -> RGB24 :  faster x1   (330 -> 290)
RGBX8888 -> BGR24 :  faster x1   (320 -> 290)
RGBX8888 -> RGB24 :  faster x1   (340 -> 300)
Comment 40 Sylvain 2019-02-23 08:48:59 UTC
I mark the issue as fixed.

The new_14.txt is still slower. 
But I was expecting some improvement by using a word write, and an intermediate 4 char buffer

2257         char  buf[7];
2258         char *tmp = (char *)(((size_t)buf + 3) & ~3);
2259 
2264         while (height--) {
2266             DUFFS_LOOP(
2267             {
2268                 Uint32 *dst32 = (Uint32*)dst;
2269                 tmp[0] = src[p0];
2270                 tmp[1] = src[p1];
2271                 tmp[2] = src[p2];
2272                 tmp[3] = src[p3];
2273                 tmp[alpha_channel] = alpha;
2274                 *dst32 = *(Uint32 *)tmp;
2275                 src += 4;
2276                 dst += 4;
2277             }, width);

Maybe the intermediate tmp[4] it's not optimized as it should be. And should be marked as
register char tmp[4]; or a global static var aligned(4) char tmp[4];

Anyway, feel free to try to optimize it if you need.