We are currently migrating Bugzilla to GitHub issues.
Any changes made to the bug tracker now will be lost, so please do not post new bugs or make changes to them.
When we're done, all bug URLs will redirect to their equivalent location on the new bug tracker.

Bug 3689 - MMX YUV renderer crash
Summary: MMX YUV renderer crash
Status: RESOLVED FIXED
Alias: None
Product: SDL
Classification: Unclassified
Component: render (show other bugs)
Version: HG 2.0
Hardware: x86 Linux
: P2 normal
Assignee: Ryan C. Gordon
QA Contact: Sam Lantinga
URL:
Keywords:
: 3479 3707 3715 (view as bug list)
Depends on:
Blocks:
 
Reported: 2017-06-30 12:59 UTC by felix
Modified: 2019-10-24 16:17 UTC (History)
6 users (show)

See Also:


Attachments
Patch (requires recent-ish GCC) (4.06 KB, patch)
2017-07-04 13:49 UTC, felix
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description felix 2017-06-30 12:59:29 UTC
The functions in src/render/SDL_yuv_mmx.c contain the following inline assembly snippet:

        /* tap dance to workaround the inability to use %%ebx at will... */
        /*  move one thing to the stack... */
        "pushl $0\n"  /* save a slot on the stack. */
        "pushl %%ebx\n"  /* save %%ebx. */
        "movl %0, %%ebx\n"  /* put the thing in ebx. */
        "movl %%ebx,4(%%esp)\n"  /* put the thing in the stack slot. */
        "popl %%ebx\n"  /* get back %%ebx (the PIC register). */

Here's how it ended up in a binary on my old laptop:

   0xb5c17dbd <ColorRGBDitherYV12MMX1X+93>:	push   $0x0
   0xb5c17dbf <ColorRGBDitherYV12MMX1X+95>:	push   %ebx
   0xb5c17dc0 <ColorRGBDitherYV12MMX1X+96>:	mov    0xc(%esp),%ebx
   0xb5c17dc4 <ColorRGBDitherYV12MMX1X+100>:	mov    %ebx,0x4(%esp)
   0xb5c17dc8 <ColorRGBDitherYV12MMX1X+104>:	pop    %ebx

Apparently the compiler, oblivious to the fact that the assembly snippet manipulates the %esp register, decided to refer to the operand via that same register instead of via %ebp (I believe -fomit-frame-pointer enables this). This causes %ebx to be loaded with the wrong value, which later leads to a null pointer dereference.

Recent GCC can use the %ebx register normally: <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47602#c16>. There is even an explicit constraint "b" for allocating it.
Comment 1 Sam Lantinga 2017-07-03 18:57:40 UTC
Good catch! Do you have a tested fix?
Comment 2 felix 2017-07-04 13:49:13 UTC
Created attachment 2787 [details]
Patch (requires recent-ish GCC)

Tested? Not really, but something like this ought to work, at least if you don't mind dropping support for older GCCs (they should fail to build). In that case you'll need to update the #if __GNUC__ > 2 checks as well, but I left that part out.
Comment 3 felix 2017-07-14 08:10:50 UTC
Comment on attachment 2787 [details]
Patch (requires recent-ish GCC)

I've been running SDL2 with this patch for a while, and I don't get any more crashes. It's just a question of which compilers you care to support at this point.
Comment 4 Sam Lantinga 2017-07-20 17:49:37 UTC
I accepted your patch. Failing to compile inline assembly on older compilers is much better than crashing with newer compilers.
https://hg.libsdl.org/SDL/rev/2ee7d2fa299b

Thanks!
Comment 5 Ozkan Sezer 2017-07-26 18:08:07 UTC
Which version(s) of gcc results in a miscompile?

The merged fix most certainly breaks older gcc, and clang as well
(see Ryan's commits 5ba02f3c5a8b and 43c7baa53681).  Can you guys
not add some ifdefs to the original code instead?
Comment 6 Ozkan Sezer 2017-07-28 10:03:15 UTC
Bug #3715 reports that even gcc-6.3 fails building.  Which
gcc version really resulted in a miscompile before?
Comment 7 felix 2017-07-30 18:16:16 UTC
The miscompiled binary is the stock Debian sid package, version 2.0.5+dfsg1-3; apparently it is built with gcc 6.4.0. The patch compiles fine with gcc 7.1.0. Apparently I underestimated how young that GCC change is...

There are some preprocessor guards already around this inline-assembly renderer; those guards could be bumped to check if __GNUC__ >= 7. I didn't include that change in my patch to make it easier to review, and I figured I'd leave the decision to the upstream how to handle this. I gotta leave them with _something_ to do ;)

Other possible solutions include making sure the inline-assembly renderer is compiled with -fno-omit-frame-pointer (this may hurt performance a little) or removing it entirely, instead judiciously decorating the pure-C renderer with __restrict__ and/or __attribute__((target_clones)) and then crossing fingers that the compiler will manage to vectorise the code on its own.
Comment 8 Ryan C. Gordon 2017-07-30 19:00:44 UTC
While it's more work, I'm inclined to think that the better solution would be to drop this code and write an SSE version with compiler intrinsics. Surely we don't _need_ either MMX or inline assembly at this point?

I'd also be interested in seeing if the C fallback is measurably slower on modern compilers than the MMX code at this point, in which case I'd recommend we just delete it and leave a FIXME to revisit with SSE later.

--ryan.
Comment 9 Sam Lantinga 2017-07-31 20:00:40 UTC
Agreed.
Comment 10 Sam Lantinga 2017-07-31 20:12:52 UTC
Interestingly, that code compiles with GCC 5.4.0 here... ?
$ $CC --version
i686-w64-mingw32-gcc (GCC) 5.4.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Comment 11 Ryan C. Gordon 2017-07-31 20:36:50 UTC
(In reply to Sam Lantinga from comment #10)
> Interestingly, that code compiles with GCC 5.4.0 here... ?
> i686-w64-mingw32-gcc (GCC) 5.4.0

Different ABI requirements on Windows?

--ryan.
Comment 12 Ryan C. Gordon 2017-08-01 03:35:54 UTC
Does anyone have a program that calls this MMX code? Some way for me to profile this?

--ryan.
Comment 13 felix 2017-08-01 13:49:31 UTC
Comment #11: Exactly: %ebx is used when generating position-independent code on x86-32 Linux, which until recently made it 'fixed' in GCC and unavailable to the register allocator (and not even possible to be declared clobbered in inline-assembly snippets).

Comment #12: ffplay with an MPEG-1 video file should do it (not all codecs will trigger it!). At least that's the way I've been hitting this bug on my Pentium III laptop with an R100-based card, for which the latest supported OpenGL version is 1.3...

...because of which I do beg you not to remove acceleration support for MMX-only CPUs: with such old hardware, I'm going to need all the acceleration I can get. (I'm kinda stuck with it for now, and I expect this may last for a while.)

Will you at least look at how __restrict__ + __attribute__((target_clones)) + -ftree-vectorize behave before reaching for SSE intrinsics?
Comment 14 Ryan C. Gordon 2017-08-02 03:39:37 UTC
> ...because of which I do beg you not to remove acceleration support for
> MMX-only CPUs: with such old hardware, I'm going to need all the
> acceleration I can get. (I'm kinda stuck with it for now, and I expect this
> may last for a while.)

I wouldn't object to MMX intrinsics, I just want to get rid of the assembly code.

--ryan.
Comment 15 Ryan C. Gordon 2017-08-08 22:33:00 UTC
*** Bug 3715 has been marked as a duplicate of this bug. ***
Comment 16 Ryan C. Gordon 2017-08-09 05:25:36 UTC
(Sorry if you get a lot of copies of this email, we're touching dozens of bug reports right now.)

Tagging a bunch of bugs as target-2.0.6.

This means we're in the final stretch for an official SDL 2.0.6 release! These are the bugs we really want to fix before shipping if humanly possible.

That being said, we don't promise to fix them because of this tag, we just want to make sure we don't forget to deal with them before we bless a final 2.0.6 release, and generally be organized about what we're aiming to ship. After some debate, we might just remove this tag again and deal with it for a later release.

Hopefully you'll hear more about this bug soon. If you have more information (including "this got fixed at some point, nevermind"), we would love to have you come add more information to the bug report when you have a moment.

Thanks!
--ryan.
Comment 17 Sam Lantinga 2017-08-11 17:23:46 UTC
We'll leave MMX assembly disabled for the 2.0.6 release.
Comment 18 Sam Lantinga 2017-08-11 20:27:32 UTC
*** Bug 3479 has been marked as a duplicate of this bug. ***
Comment 19 felix 2017-08-12 14:27:09 UTC
I did some testing on godbolt.org, and it seems that this code can be made to compile under GCC as old as 4.4.1. The catch is, whether compilation succeeds is quite sensitive to compiler options: if the source file is compiled with -fPIC, -fomit-frame-pointer must be enabled as well. Otherwise the compiler runs out of registers and compilation fails. The same caveat applies to clang.

Decorating those two functions with __attribute__((optimize("omit-frame-pointer"))) should fix this on GCC; clang doesn't support this attribute, unfortunately. It's a bit of a hack anyway.

By the way, I think bug #3707 might be a duplicate as well.
Comment 20 Sam Lantinga 2017-08-12 22:08:34 UTC
*** Bug 3707 has been marked as a duplicate of this bug. ***
Comment 21 Sylvain 2017-10-09 06:43:05 UTC
> Will you at least look at how __restrict__ + __attribute__((target_clones))
> + -ftree-vectorize behave before reaching for SSE intrinsics?

YUV/RGB conversion have been implemented in SDL_ConvertPixels (bug 3857) and they are somehow optimized with gcc vectorize.

It you are using the mmx version (ColorRGBDitherYV12MMX1X, not the 565 ...), you could try to tell how the C version runs on you side.

A quick patch to do that is to recompile SDL2 with the change in file "render/SDL_yuv_sw.c". Just reimplement the function as below:

int
SDL_SW_CopyYUVToRGB(SDL_SW_YUVTexture * swdata, const SDL_Rect * srcrect,
                     Uint32 target_format, int w, int h, void *pixels,
                     int pitch)
{
         return SDL_ConvertPixels(swdata->w, swdata->h,
                 swdata->format, swdata->planes[0], swdata->pitches[0],
                 target_format, pixels, pitch);
}


Also, you need to make sure that gcc is compile with -O3. Or/and you can add: -ftree-loop-vectorize -fopt-info-vec-optimized -fopt-info-optimized
opt-info tells if loops are vectorized or not. You should see 16 loops optimized at the end of the file.

And also you can try with: -march=native
Comment 22 Sylvain 2019-10-24 16:17:10 UTC
Closing because the code YUV ColorRGBDitherYV12MMX1X and so on, has been removed/updated