Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bad textures on bigendian cpu #3162

Closed
SDLBugzilla opened this issue Feb 11, 2021 · 0 comments
Closed

bad textures on bigendian cpu #3162

SDLBugzilla opened this issue Feb 11, 2021 · 0 comments

Comments

@SDLBugzilla
Copy link
Collaborator

This bug report was migrated from our old Bugzilla tracker.

These attachments are available in the static archive:

Reported in version: 2.0.9
Reported for operating system, platform: Other, Other

Comments on the original bug report:

On 2019-02-14 02:11:51 +0000, Dmitry Gapkalov wrote:

Hello,
with latest SDL2 changes from Hg, when I using software rendering on the BigEndian CPU, textures(created from jpeg`s) on the screen displayed with wrong colors. It looks like if some color components(RGB) is exchanged.

On 2019-02-14 08:22:00 +0000, Sylvain wrote:

Probably my last commits, haven't tested with big endian.
which hardware do you use ?

Can you help to do some check ?

Can you try this patch:

diff -r 48e26b5d4f56 src/video/SDL_blit_N.c
--- a/src/video/SDL_blit_N.c Sat Feb 09 17:40:32 2019 +0100
+++ b/src/video/SDL_blit_N.c Thu Feb 14 09:19:57 2019 +0100
@@ -2155,7 +2155,11 @@
int *_p0 , int *_p1, int *_p2, int *_p3, int _alpha_channel)
{
int alpha_channel = 0, p0, p1, p2, p3;
+#if SDL_BYTEORDER == SDL_LIL_ENDIAN
int Pixel = 0x04030201; /
identity permutation */
+#else

  • int Pixel = 0x01020304; /* identity permutation */
    +#endif

    if (srcfmt->Amask) {
    RGBA_FROM_PIXEL(Pixel, srcfmt, p0, p1, p2, p3);

On 2019-02-14 15:55:39 +0000, Dmitry Gapkalov wrote:

No, it doesn't help

On 2019-02-14 16:08:41 +0000, Dmitry Gapkalov wrote:

Created attachment 3611
bad pic

On 2019-02-14 17:36:16 +0000, Sylvain wrote:

Created attachment 3613
image for testcase

image for test case

On 2019-02-14 17:36:38 +0000, Sylvain wrote:

Created attachment 3614
test case

test case

On 2019-02-14 17:42:41 +0000, Sylvain wrote:

I've just uploaded my test-case and its png image

Could you try to run it with:

  • your previous version of SDL.
  • the current head of SDL (also eventually a modified on with the patch ..)
    Save the full log.

It blits any format to any format, with/without colorkey.

It does a crc on each image and can help to see what's going on / regression.
(You need some other dependencies: zlib.h and also md5.h (compile with -lz -lcrypto) )

On 2019-02-14 17:51:33 +0000, Sylvain wrote:

Created attachment 3615
compare script

Here's also a script to compare two logs you would have saved.
It tells if blits get optimised and if there're crc errors.

On 2019-02-14 22:54:48 +0000, Dmitry Gapkalov wrote:

Created attachment 3617
ref and new

attached two logs:
ref.txt - reference 2.0.9
new.txt - from the latest Hg changes

compare script doesn't work for me:

Starting new.txt new.txt
error parsing file ref.txt

On 2019-02-15 10:13:54 +0000, Sylvain wrote:

Created attachment 3622
compare script

I've fixed the compare script.
The line format was a little bit different (starting with "INFO:" or not).

  • some other internals because of missing lines.

On 2019-02-15 10:25:50 +0000, Sylvain wrote:

So there are lots of errors.
But I don't think it needs a big fix.

I don't you if you want to debug it or not.

I have not much time in the next few days. But after I can have a look, and set up a virtual machine with big endian.

What I would do:

1/
enable again the patch, and see if the number of crc mismatch decrease.

+#if SDL_BYTEORDER == SDL_LIL_ENDIAN
int Pixel = 0x04030201; /* identity permutation */
+#else

  • int Pixel = 0x01020304; /* identity permutation */
    +#endif

if it doesn't help, then maybe there should a little different start permutation for big/little endian.

2/ not 100% necessary, but
The reference "ref.txt" is missing lines.
I've fix two issues, which we should apply before have the ref.txt
https://hg.libsdl.org/SDL/rev/233b2a61cad1
https://hg.libsdl.org/SDL/rev/dff36de37426

3/ We should also activate the "Blit_3or4_to_3or4__inversed_rgb" in normal_blit_3 for big endian, so there is also something to fix here.
(remove ifdef, and see where it fails :)

On 2019-02-15 15:29:45 +0000, Dmitry Gapkalov wrote:

  • Sorry for misunderstanding, new.txt already contains your changes
    +#if SDL_BYTEORDER == SDL_LIL_ENDIAN
    int Pixel = 0x04030201; /* identity permutation */
    +#else
  • int Pixel = 0x01020304; /* identity permutation */
    +#endif

On 2019-02-16 03:25:43 +0000, Dmitry Gapkalov wrote:

Created attachment 3627
ref2, new and result

get new ref2.txt logs(with patches) old new.txt, and put the results.

p.s. If I uncomment #ifdef I have a crash.

On 2019-02-18 21:19:48 +0000, Sylvain wrote:

Ok, so I think I've fixed all the issues with big endian.
and enabled all the optimized routines for big endian.
I checked this with a virtual machine.

https://hg.libsdl.org/SDL/rev/c8f998f83225

1/ Can you see if this works, and re-run the test-suite and see how it compares with the reference ?

2/ On my virtual machine, valgrind doesn't work (crashed at start, even with an helloworld). Just to make sure there is no bad access, could you run it.
with some parameters:

diff -r c8f998f83225 src/video/SDL_surface.c
--- a/src/video/SDL_surface.c Mon Feb 18 22:06:53 2019 +0100
+++ b/src/video/SDL_surface.c Mon Feb 18 22:15:59 2019 +0100
@@ -120,13 +120,13 @@
return NULL;
}

  •    surface->pixels = SDL_SIMDAlloc((size_t)size);
    
  •    surface->pixels = SDL_malloc((size_t)size);
       if (!surface->pixels) {
           SDL_FreeSurface(surface);
           SDL_OutOfMemory();
           return NULL;
       }
    
  •    surface->flags |= SDL_SIMD_ALIGNED;
    

+// surface->flags |= SDL_SIMD_ALIGNED;
/* This is important for bitmaps */
SDL_memset(surface->pixels, 0, surface->h * surface->pitch);

  • use new test-suite which removes any extra pitch-padding. making pitch == width * bpp

On 2019-02-18 21:21:06 +0000, Sylvain wrote:

Created attachment 3632
test case

new test case.

  • removes the width padding

On 2019-02-18 21:22:02 +0000, Sylvain wrote:

Created attachment 3633
very small image for the testcase

A smaller image for the testcase

On 2019-02-18 21:22:37 +0000, Sylvain wrote:

Created attachment 3634
compare script

Update the compare script

On 2019-02-19 18:12:07 +0000, Dmitry Gapkalov wrote:

I try latest changes with new test suit, but have a crash. Last line in log before crash is:

BlitNtoN RGB24 -> RGB24, in 20 ms 0
crc32: D0108A95 full: D0108A95 md5=78580e3deb728e4e7568b92edd3de58b

On 2019-02-19 18:15:22 +0000, Dmitry Gapkalov wrote:

I guess crash is in blitter:

BlitNtoN RGB24 -> BGR24

On 2019-02-19 19:06:51 +0000, Sylvain wrote:

It sounds like an invalid read/write since I get the correct crc32 and md5 for all conversions. And also as I don't have the crash with qemu.

which hardware do you use ?

It can be the:
RGB24 -> BGR24

but than can be also a previous which doing an invalid read/write.
that can also be the underneath preparation conversion image_format -> BGR24

If this is RGB24 -> BGR24,
this is handled in :
Blit_3or4_to_3or4__inversed_rgb
it's 3->3, so NO_ALPHA

Maybe you can try to comment out things:
the whole block, then the while loop, then the last_line, to try to narrow down the issue !

On 2019-02-19 19:55:01 +0000, Dmitry Gapkalov wrote:

    while (height--) {
        /* *INDENT-OFF* */
        DUFFS_LOOP(
        {
            Uint32 *dst32 = (Uint32*)dst;
            Uint8 s0 = src[i0];
            Uint8 s1 = src[i1];
            Uint8 s2 = src[i2];
            /* inversed, compared to Blit_3or4_to_3or4__same_rgb */
            *dst32 = (s0 << shift0) | (s1 << shift1) | (s2 << shift2); // 

if destination bpp is 3, than we have exception in this line, because we canno write 4 byte into 3 ;)

On 2019-02-19 20:03:52 +0000, Sylvain wrote:

I agree, but exception is only for the very last pixel. This is handled in the "last_line" block.

I have enabled the big_endian code in my little endian machine. and run valgrind with no issue.

Are you sure you're running the latest code ?

And if so, what's the type of crash ? maybe you can get the stacktrace in gdb ?

On 2019-02-19 23:39:55 +0000, Dmitry Gapkalov wrote:

which hardware do you use ?
175Mhz core mips 32bit cpu

Its hard to explain, but this code generate exception:

            Uint8 s0 = src[i0];
            Uint8 s1 = src[i1];
            Uint8 s2 = src[i2];
  		/* inversed, compared to Blit_3or4_to_3or4__same_rgb */
            *dst32 = (s0 << shift0) | (s1 << shift1) | (s2 << shift2);

but this code NOT:

  dst[0] = s0;
  dst[1] = s1;
  dst[2] = s2;

On 2019-02-20 07:12:25 +0000, Sylvain wrote:

strange ... any subtleties with mips ?
I also ran qemu with mips eb debian and did hit this.

So with this patch, how does this behave for the full testcase ?

On 2019-02-20 08:40:37 +0000, Sylvain wrote:

mips doesn't like store on non 32 byte aligned memory probably ? And qemu doesn't simulate this.
but I would be surprised if this is the only place where there is such access.

On 2019-02-20 08:41:41 +0000, Sylvain wrote:

I mean -word- store on non 4 bytes / 32bits aligned memory

On 2019-02-20 12:17:59 +0000, Sylvain wrote:

Created attachment 3640
test case for unaligned store

to check this, can you try this test case that does non aligned int store

On 2019-02-20 15:53:27 +0000, Dmitry Gapkalov wrote:

on ARM or Mips - based systems you cannot address a 32-bit word that is not aligned to a 4-byte boundary. Doing so will result in an access violation exception. On x86 you can access such non-aligned data, though the performance suffers a little since two words have to fetched from memory instead of just one.

p.s. latest test example also crush

in:
0,
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,
--than crash

On 2019-02-20 21:40:37 +0000, Sylvain wrote:

Created attachment 3643
SDL_blit_N.c

So it seems to be the issue!
Here's a new SDL_blit_N.c that removes the un-aligned load/store + some clean-up, please try it!

  • You may also need to edit the testcase and comment out the 4 calls to "remove_padding()".
    It could create un-aligned access ... or maybe not since, the padding are only for particular format ).

  • In the compare script, if you un-comment the log faster/slower, you'll see how it improves the blit.

On 2019-02-21 17:55:06 +0000, Dmitry Gapkalov wrote:

Created attachment 3646
results

with this version all problems now is solved, all texturest displayed correctly without any artifacts.

ref4.txt - 2.0.9 with few patches
new_9.txt - latest mercury with new SDL_blit_N.c
result2.txt - results

results is little bit slower than original, but my hardware timer precision is 10ms, it maybe not enough to make correct decision.

On 2019-02-21 18:56:08 +0000, Sylvain wrote:

Ok great !

Yes the timings seem wrongs. the image was very small to run the test quickly.
The previous image is bigger so the timing should be more accurate.

On 2019-02-21 21:10:45 +0000, Dmitry Gapkalov wrote:

Created attachment 3647
result with large picture

Starting ref_3.txt new_10.txt
Reference: Time BlitNtoN : 137890 msec
Reference: Time BlitNtoNKey : 127910 msec
Current: Time BlitNtoN : 145990 msec
Current: Time BlitNtoNKey : 135720 msec

On 2019-02-22 08:35:11 +0000, Sylvain wrote:

So it's seems very bad :)
Whereas it always faster on x86, it just almost always slower in Mips.

I think the board memory is slow ? or maybe each char write is done as a int32 store maybe ?

I committing the previous SDL_Blit_N.c, because it solves the crashes.
https://hg.libsdl.org/SDL/rev/8a160ecca90f

But, I'll provide you a new modified file to see it this can be improved

On 2019-02-22 14:07:43 +0000, Sylvain wrote:

Created attachment 3648
SDL_blit_N.c

Here's a new SDL_blit_N.c
Please gives a try !

(maybe change the compare_script threshold, so that it display faster/slower when the difference is > 1.2). (low resolution, but seems quite stables).

On 2019-02-22 17:46:57 +0000, Dmitry Gapkalov wrote:

Created attachment 3649
new result

Starting ref_3.txt new_11.txt
Reference: Time BlitNtoN : 137890 msec
Reference: Time BlitNtoNKey : 127910 msec
Current: Time BlitNtoN : 136620 msec
Current: Time BlitNtoNKey : 127060 msec

On 2019-02-22 19:57:20 +0000, Sylvain wrote:

Created attachment 3651
SDL_blit_N.c

Most of regressions seems to have disappeared. There is one left which I've undefined in this new file. Can you test it to make sure?

After that,there should be no perf regression left. And there are still some faster blits case (but less).
I think it still valuable as you can get a x2 ratio if you convert a rgb24 with color-key to alpha surface.

Then, I will make this define active/un-active based on arch.
Not sure when it should disabled ... I have tried on arm, and all are ok. but I don't have very old one.
And what define can I use to make sure it selects or not your board or any MIPS ? Maybe I should disable the slower for all big endian ?
Btw, what's your board exactly so that I can the specs ?

On 2019-02-22 20:33:10 +0000, Sylvain wrote:

Created attachment 3652
SDL_blit_N.c with intermediate aligned int on stack

Can also try this SDL_blit_N.c where it uses an intermediate aligned int ? It may enable most of the optimisation !

On 2019-02-22 21:13:14 +0000, Dmitry Gapkalov wrote:

Created attachment 3653
results

Starting ref_3.txt new_13.txt
Reference: Time BlitNtoN : 137890 msec
Reference: Time BlitNtoNKey : 127910 msec
Current: Time BlitNtoN : 136510 msec
Current: Time BlitNtoNKey : 125990 msec

And what define can I use to make sure it selects or not your board or any >>MIPS ? Maybe I should disable the slower for all big endian ?
Btw, what's your board exactly so that I can the specs ?

my hardware doesn't have common specific macro to determine it.

On 2019-02-22 22:02:36 +0000, Dmitry Gapkalov wrote:

Created attachment 3654
result7

Starting ref_3.txt new_14.txt
Reference: Time BlitNtoN : 137890 msec
Reference: Time BlitNtoNKey : 127910 msec
Current: Time BlitNtoN : 151860 msec
Current: Time BlitNtoNKey : 140680 msec

On 2019-02-23 08:42:22 +0000, Sylvain wrote:

The SDL_blit_N.c of new_13.txt fixes the regression on mips. So it's commited.
I un-activate the routines for any MIPS
https://hg.libsdl.org/SDL/rev/c005c49beaa9

So the perf I see (I filtered some noise), compared to the ref are below :

===== BlitNtoNKey ========
ABGR8888 -> BGR888 : faster x1 (240 -> 140)
ABGR8888 -> RGB24 : faster x1 (300 -> 270)

ARGB8888 -> RGB888 : faster x1 (250 -> 140)

BGR24 -> BGR24 : faster x1 (290 -> 250)
BGR24 -> RGB24 : faster x1 (300 -> 260)

BGR888 -> ABGR8888 : faster x1 (250 -> 140)
BGR888 -> RGB24 : faster x1 (310 -> 270)

BGRA8888 -> BGRX8888 : faster x1 (240 -> 130)
BGRA8888 -> RGB24 : faster x1 (300 -> 270)

BGRX8888 -> BGR24 : faster x1 (300 -> 270)
BGRX8888 -> BGRA8888 : faster x1 (250 -> 130)
BGRX8888 -> BGRX8888 : faster x1 (250 -> 140)
BGRX8888 -> RGB24 : faster x1 (310 -> 280)

RGB24 -> BGR24 : faster x1 (290 -> 260)
RGB24 -> RGB24 : faster x1 (290 -> 260)

RGB888 -> ARGB8888 : faster x1 (240 -> 130)
RGB888 -> BGR24 : faster x1 (310 -> 270)
RGB888 -> RGB24 : faster x1 (310 -> 280)

RGBA8888 -> BGR24 : faster x1 (310 -> 270)
RGBA8888 -> RGB24 : faster x1 (310 -> 270)
RGBA8888 -> RGBX8888 : faster x1 (260 -> 140)

RGBX8888 -> RGB24 : faster x1 (310 -> 270)
RGBX8888 -> RGBA8888 : faster x1 (250 -> 130)
RGBX8888 -> RGBX8888 : faster x1 (250 -> 140)
===== BlitNtoN ========
ABGR8888 -> ARGB8888 : faster x1 (290 -> 150)
ABGR8888 -> BGR24 : faster x1 (330 -> 290)
ARGB8888 -> ABGR8888 : faster x2 (300 -> 150)
ARGB8888 -> RGB24 : faster x1 (320 -> 290)

BGR24 -> ABGR8888 : faster x1 (210 -> 140)
BGR24 -> ARGB8888 : faster x1 (220 -> 150)

BGR888 -> ARGB8888 : faster x1 (250 -> 150)
BGR888 -> BGR24 : faster x1 (320 -> 290)
BGR888 -> RGB24 : faster x1 (330 -> 290)

BGRA8888 -> BGR24 : faster x1 (330 -> 290)
BGRA8888 -> RGB24 : faster x1 (330 -> 290)
BGRX8888 -> BGR24 : faster x1 (320 -> 290)

RGB24 -> ABGR8888 : faster x1 (210 -> 140)
RGB24 -> ARGB8888 : faster x1 (210 -> 130)

RGB888 -> ABGR8888 : faster x1 (260 -> 150)
RGB888 -> RGB24 : faster x1 (330 -> 290)

RGBA8888 -> RGB24 : faster x1 (330 -> 290)
RGBX8888 -> BGR24 : faster x1 (320 -> 290)
RGBX8888 -> RGB24 : faster x1 (340 -> 300)

On 2019-02-23 08:48:59 +0000, Sylvain wrote:

I mark the issue as fixed.

The new_14.txt is still slower.
But I was expecting some improvement by using a word write, and an intermediate 4 char buffer

2257 char buf[7];
2258 char *tmp = (char *)(((size_t)buf + 3) & ~3);
2259
2264 while (height--) {
2266 DUFFS_LOOP(
2267 {
2268 Uint32 dst32 = (Uint32)dst;
2269 tmp[0] = src[p0];
2270 tmp[1] = src[p1];
2271 tmp[2] = src[p2];
2272 tmp[3] = src[p3];
2273 tmp[alpha_channel] = alpha;
2274 *dst32 = *(Uint32 *)tmp;
2275 src += 4;
2276 dst += 4;
2277 }, width);

Maybe the intermediate tmp[4] it's not optimized as it should be. And should be marked as
register char tmp[4]; or a global static var aligned(4) char tmp[4];

Anyway, feel free to try to optimize it if you need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant