| Summary: | bad textures on bigendian cpu | ||
|---|---|---|---|
| Product: | SDL | Reporter: | Dmitry Gapkalov <gapkalov> |
| Component: | render | Assignee: | Sam Lantinga <slouken> |
| Status: | RESOLVED FIXED | QA Contact: | Sam Lantinga <slouken> |
| Severity: | critical | ||
| Priority: | P2 | CC: | gapkalov, sylvain.becker |
| Version: | 2.0.9 | ||
| Hardware: | Other | ||
| OS: | Other | ||
| Attachments: |
bad pic
image for testcase test case compare script ref and new compare script ref2, new and result test case very small image for the testcase compare script test case for unaligned store SDL_blit_N.c results result with large picture SDL_blit_N.c new result SDL_blit_N.c SDL_blit_N.c with intermediate aligned int on stack results result7 |
||
|
Description
Dmitry Gapkalov
2019-02-14 02:11:51 UTC
Probably my last commits, haven't tested with big endian.
which hardware do you use ?
Can you help to do some check ?
Can you try this patch:
diff -r 48e26b5d4f56 src/video/SDL_blit_N.c
--- a/src/video/SDL_blit_N.c Sat Feb 09 17:40:32 2019 +0100
+++ b/src/video/SDL_blit_N.c Thu Feb 14 09:19:57 2019 +0100
@@ -2155,7 +2155,11 @@
int *_p0 , int *_p1, int *_p2, int *_p3, int *_alpha_channel)
{
int alpha_channel = 0, p0, p1, p2, p3;
+#if SDL_BYTEORDER == SDL_LIL_ENDIAN
int Pixel = 0x04030201; /* identity permutation */
+#else
+ int Pixel = 0x01020304; /* identity permutation */
+#endif
if (srcfmt->Amask) {
RGBA_FROM_PIXEL(Pixel, srcfmt, p0, p1, p2, p3);
No, it doesn't help Created attachment 3611 [details]
bad pic
Created attachment 3613 [details]
image for testcase
image for test case
Created attachment 3614 [details]
test case
test case
I've just uploaded my test-case and its png image Could you try to run it with: - your previous version of SDL. - the current head of SDL (also eventually a modified on with the patch ..) Save the full log. It blits any format to any format, with/without colorkey. It does a crc on each image and can help to see what's going on / regression. (You need some other dependencies: zlib.h and also md5.h (compile with -lz -lcrypto) ) Created attachment 3615 [details]
compare script
Here's also a script to compare two logs you would have saved.
It tells if blits get optimised and if there're crc errors.
Created attachment 3617 [details]
ref and new
attached two logs:
ref.txt - reference 2.0.9
new.txt - from the latest Hg changes
compare script doesn't work for me:
Starting new.txt new.txt
error parsing file ref.txt
Created attachment 3622 [details]
compare script
I've fixed the compare script.
The line format was a little bit different (starting with "INFO:" or not).
+ some other internals because of missing lines.
So there are lots of errors.
But I don't think it needs a big fix.
I don't you if you want to debug it or not.
I have not much time in the next few days. But after I can have a look, and set up a virtual machine with big endian.
What I would do:
1/
enable again the patch, and see if the number of crc mismatch decrease.
+#if SDL_BYTEORDER == SDL_LIL_ENDIAN
int Pixel = 0x04030201; /* identity permutation */
+#else
+ int Pixel = 0x01020304; /* identity permutation */
+#endif
if it doesn't help, then maybe there should a little different start permutation for big/little endian.
2/ not 100% necessary, but
The reference "ref.txt" is missing lines.
I've fix two issues, which we should apply before have the ref.txt
https://hg.libsdl.org/SDL/rev/233b2a61cad1
https://hg.libsdl.org/SDL/rev/dff36de37426
3/ We should also activate the "Blit_3or4_to_3or4__inversed_rgb" in normal_blit_3 for big endian, so there is also something to fix here.
(remove ifdef, and see where it fails :)
- Sorry for misunderstanding, new.txt already contains your changes
+#if SDL_BYTEORDER == SDL_LIL_ENDIAN
int Pixel = 0x04030201; /* identity permutation */
+#else
+ int Pixel = 0x01020304; /* identity permutation */
+#endif
Created attachment 3627 [details]
ref2, new and result
get new ref2.txt logs(with patches) old new.txt, and put the results.
p.s. If I uncomment #ifdef I have a crash.
Ok, so I think I've fixed all the issues with big endian. and enabled all the optimized routines for big endian. I checked this with a virtual machine. https://hg.libsdl.org/SDL/rev/c8f998f83225 1/ Can you see if this works, and re-run the test-suite and see how it compares with the reference ? 2/ On my virtual machine, valgrind doesn't work (crashed at start, even with an helloworld). Just to make sure there is no bad access, could you run it. with some parameters: - remove the use of aligned memory (it can hide bad access): (remove https://hg.libsdl.org/SDL/rev/66cd8731c3b1 ) or just apply: diff -r c8f998f83225 src/video/SDL_surface.c --- a/src/video/SDL_surface.c Mon Feb 18 22:06:53 2019 +0100 +++ b/src/video/SDL_surface.c Mon Feb 18 22:15:59 2019 +0100 @@ -120,13 +120,13 @@ return NULL; } - surface->pixels = SDL_SIMDAlloc((size_t)size); + surface->pixels = SDL_malloc((size_t)size); if (!surface->pixels) { SDL_FreeSurface(surface); SDL_OutOfMemory(); return NULL; } - surface->flags |= SDL_SIMD_ALIGNED; +// surface->flags |= SDL_SIMD_ALIGNED; /* This is important for bitmaps */ SDL_memset(surface->pixels, 0, surface->h * surface->pitch); - use new test-suite which removes any extra pitch-padding. making pitch == width * bpp Created attachment 3632 [details]
test case
new test case.
- removes the width padding
Created attachment 3633 [details]
very small image for the testcase
A smaller image for the testcase
Created attachment 3634 [details]
compare script
Update the compare script
I try latest changes with new test suit, but have a crash. Last line in log before crash is: BlitNtoN RGB24 -> RGB24, in 20 ms 0 crc32: D0108A95 full: D0108A95 md5=78580e3deb728e4e7568b92edd3de58b I guess crash is in blitter: BlitNtoN RGB24 -> BGR24 It sounds like an invalid read/write since I get the correct crc32 and md5 for all conversions. And also as I don't have the crash with qemu. which hardware do you use ? It can be the: RGB24 -> BGR24 but than can be also a previous which doing an invalid read/write. that can also be the underneath preparation conversion image_format -> BGR24 If this is RGB24 -> BGR24, this is handled in : Blit_3or4_to_3or4__inversed_rgb it's 3->3, so NO_ALPHA Maybe you can try to comment out things: the whole block, then the while loop, then the last_line, to try to narrow down the issue !
while (height--) {
/* *INDENT-OFF* */
DUFFS_LOOP(
{
Uint32 *dst32 = (Uint32*)dst;
Uint8 s0 = src[i0];
Uint8 s1 = src[i1];
Uint8 s2 = src[i2];
/* inversed, compared to Blit_3or4_to_3or4__same_rgb */
*dst32 = (s0 << shift0) | (s1 << shift1) | (s2 << shift2); //
if destination bpp is 3, than we have exception in this line, because we canno write 4 byte into 3 ;)
I agree, but exception is only for the very last pixel. This is handled in the "last_line" block. I have enabled the big_endian code in my little endian machine. and run valgrind with no issue. Are you sure you're running the latest code ? And if so, what's the type of crash ? maybe you can get the stacktrace in gdb ? >>which hardware do you use ?
175Mhz core mips 32bit cpu
Its hard to explain, but this code generate exception:
Uint8 s0 = src[i0];
Uint8 s1 = src[i1];
Uint8 s2 = src[i2];
/* inversed, compared to Blit_3or4_to_3or4__same_rgb */
*dst32 = (s0 << shift0) | (s1 << shift1) | (s2 << shift2);
but this code NOT:
dst[0] = s0;
dst[1] = s1;
dst[2] = s2;
strange ... any subtleties with mips ? I also ran qemu with mips eb debian and did hit this. So with this patch, how does this behave for the full testcase ? mips doesn't like store on non 32 byte aligned memory probably ? And qemu doesn't simulate this. but I would be surprised if this is the only place where there is such access. I mean -word- store on non 4 bytes / 32bits aligned memory Created attachment 3640 [details]
test case for unaligned store
to check this, can you try this test case that does non aligned int store
on ARM or Mips - based systems you cannot address a 32-bit word that is not aligned to a 4-byte boundary. Doing so will result in an access violation exception. On x86 you can access such non-aligned data, though the performance suffers a little since two words have to fetched from memory instead of just one. p.s. latest test example also crush in: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, --than crash Created attachment 3643 [details]
SDL_blit_N.c
So it seems to be the issue!
Here's a new SDL_blit_N.c that removes the un-aligned load/store + some clean-up, please try it!
- You may also need to edit the testcase and comment out the 4 calls to "remove_padding()".
It could create un-aligned access ... or maybe not since, the padding are only for particular format ).
- In the compare script, if you un-comment the log faster/slower, you'll see how it improves the blit.
Created attachment 3646 [details]
results
with this version all problems now is solved, all texturest displayed correctly without any artifacts.
ref4.txt - 2.0.9 with few patches
new_9.txt - latest mercury with new SDL_blit_N.c
result2.txt - results
results is little bit slower than original, but my hardware timer precision is 10ms, it maybe not enough to make correct decision.
Ok great ! Yes the timings seem wrongs. the image was very small to run the test quickly. The previous image is bigger so the timing should be more accurate. Created attachment 3647 [details]
result with large picture
Starting ref_3.txt new_10.txt
Reference: Time BlitNtoN : 137890 msec
Reference: Time BlitNtoNKey : 127910 msec
Current: Time BlitNtoN : 145990 msec
Current: Time BlitNtoNKey : 135720 msec
So it's seems very bad :) Whereas it always faster on x86, it just almost always slower in Mips. I think the board memory is slow ? or maybe each char write is done as a int32 store maybe ? I committing the previous SDL_Blit_N.c, because it solves the crashes. https://hg.libsdl.org/SDL/rev/8a160ecca90f But, I'll provide you a new modified file to see it this can be improved Created attachment 3648 [details]
SDL_blit_N.c
Here's a new SDL_blit_N.c
Please gives a try !
(maybe change the compare_script threshold, so that it display faster/slower when the difference is > 1.2). (low resolution, but seems quite stables).
Created attachment 3649 [details]
new result
Starting ref_3.txt new_11.txt
Reference: Time BlitNtoN : 137890 msec
Reference: Time BlitNtoNKey : 127910 msec
Current: Time BlitNtoN : 136620 msec
Current: Time BlitNtoNKey : 127060 msec
Created attachment 3651 [details]
SDL_blit_N.c
Most of regressions seems to have disappeared. There is one left which I've undefined in this new file. Can you test it to make sure?
After that,there should be no perf regression left. And there are still some faster blits case (but less).
I think it still valuable as you can get a x2 ratio if you convert a rgb24 with color-key to alpha surface.
Then, I will make this define active/un-active based on arch.
Not sure when it should disabled ... I have tried on arm, and all are ok. but I don't have very old one.
And what define can I use to make sure it selects or not your board or any MIPS ? Maybe I should disable the slower for all big endian ?
Btw, what's your board exactly so that I can the specs ?
Created attachment 3652 [details]
SDL_blit_N.c with intermediate aligned int on stack
Can also try this SDL_blit_N.c where it uses an intermediate aligned int ? It may enable most of the optimisation !
Created attachment 3653 [details] results Starting ref_3.txt new_13.txt Reference: Time BlitNtoN : 137890 msec Reference: Time BlitNtoNKey : 127910 msec Current: Time BlitNtoN : 136510 msec Current: Time BlitNtoNKey : 125990 msec >>And what define can I use to make sure it selects or not your board or any >>MIPS ? Maybe I should disable the slower for all big endian ? >>Btw, what's your board exactly so that I can the specs ? my hardware doesn't have common specific macro to determine it. Created attachment 3654 [details]
result7
Starting ref_3.txt new_14.txt
Reference: Time BlitNtoN : 137890 msec
Reference: Time BlitNtoNKey : 127910 msec
Current: Time BlitNtoN : 151860 msec
Current: Time BlitNtoNKey : 140680 msec
The SDL_blit_N.c of new_13.txt fixes the regression on mips. So it's commited. I un-activate the routines for any __MIPS__ https://hg.libsdl.org/SDL/rev/c005c49beaa9 So the perf I see (I filtered some noise), compared to the ref are below : ===== BlitNtoNKey ======== ABGR8888 -> BGR888 : faster x1 (240 -> 140) ABGR8888 -> RGB24 : faster x1 (300 -> 270) ARGB8888 -> RGB888 : faster x1 (250 -> 140) BGR24 -> BGR24 : faster x1 (290 -> 250) BGR24 -> RGB24 : faster x1 (300 -> 260) BGR888 -> ABGR8888 : faster x1 (250 -> 140) BGR888 -> RGB24 : faster x1 (310 -> 270) BGRA8888 -> BGRX8888 : faster x1 (240 -> 130) BGRA8888 -> RGB24 : faster x1 (300 -> 270) BGRX8888 -> BGR24 : faster x1 (300 -> 270) BGRX8888 -> BGRA8888 : faster x1 (250 -> 130) BGRX8888 -> BGRX8888 : faster x1 (250 -> 140) BGRX8888 -> RGB24 : faster x1 (310 -> 280) RGB24 -> BGR24 : faster x1 (290 -> 260) RGB24 -> RGB24 : faster x1 (290 -> 260) RGB888 -> ARGB8888 : faster x1 (240 -> 130) RGB888 -> BGR24 : faster x1 (310 -> 270) RGB888 -> RGB24 : faster x1 (310 -> 280) RGBA8888 -> BGR24 : faster x1 (310 -> 270) RGBA8888 -> RGB24 : faster x1 (310 -> 270) RGBA8888 -> RGBX8888 : faster x1 (260 -> 140) RGBX8888 -> RGB24 : faster x1 (310 -> 270) RGBX8888 -> RGBA8888 : faster x1 (250 -> 130) RGBX8888 -> RGBX8888 : faster x1 (250 -> 140) ===== BlitNtoN ======== ABGR8888 -> ARGB8888 : faster x1 (290 -> 150) ABGR8888 -> BGR24 : faster x1 (330 -> 290) ARGB8888 -> ABGR8888 : faster x2 (300 -> 150) ARGB8888 -> RGB24 : faster x1 (320 -> 290) BGR24 -> ABGR8888 : faster x1 (210 -> 140) BGR24 -> ARGB8888 : faster x1 (220 -> 150) BGR888 -> ARGB8888 : faster x1 (250 -> 150) BGR888 -> BGR24 : faster x1 (320 -> 290) BGR888 -> RGB24 : faster x1 (330 -> 290) BGRA8888 -> BGR24 : faster x1 (330 -> 290) BGRA8888 -> RGB24 : faster x1 (330 -> 290) BGRX8888 -> BGR24 : faster x1 (320 -> 290) RGB24 -> ABGR8888 : faster x1 (210 -> 140) RGB24 -> ARGB8888 : faster x1 (210 -> 130) RGB888 -> ABGR8888 : faster x1 (260 -> 150) RGB888 -> RGB24 : faster x1 (330 -> 290) RGBA8888 -> RGB24 : faster x1 (330 -> 290) RGBX8888 -> BGR24 : faster x1 (320 -> 290) RGBX8888 -> RGB24 : faster x1 (340 -> 300) I mark the issue as fixed.
The new_14.txt is still slower.
But I was expecting some improvement by using a word write, and an intermediate 4 char buffer
2257 char buf[7];
2258 char *tmp = (char *)(((size_t)buf + 3) & ~3);
2259
2264 while (height--) {
2266 DUFFS_LOOP(
2267 {
2268 Uint32 *dst32 = (Uint32*)dst;
2269 tmp[0] = src[p0];
2270 tmp[1] = src[p1];
2271 tmp[2] = src[p2];
2272 tmp[3] = src[p3];
2273 tmp[alpha_channel] = alpha;
2274 *dst32 = *(Uint32 *)tmp;
2275 src += 4;
2276 dst += 4;
2277 }, width);
Maybe the intermediate tmp[4] it's not optimized as it should be. And should be marked as
register char tmp[4]; or a global static var aligned(4) char tmp[4];
Anyway, feel free to try to optimize it if you need.
|