We are currently migrating Bugzilla to GitHub issues.
Any changes made to the bug tracker now will be lost, so please do not post new bugs or make changes to them.
When we're done, all bug URLs will redirect to their equivalent location on the new bug tracker.

Bug 4530

Summary: Classic open source engine Exult (Ultima VII) segfaults on exit with the KMSDRM backend
Product: SDL Reporter: Manuel Alfayate <redwindwanderer2>
Component: videoAssignee: Sam Lantinga <slouken>
Status: RESOLVED FIXED QA Contact: Sam Lantinga <slouken>
Severity: normal    
Priority: P2 CC: icculus, sylvain.becker
Version: HG 2.1Keywords: target-2.0.10
Hardware: All   
OS: Linux   
Attachments: Valgrind log for Exult on SDL2 on KMSDRM
test case

Description Manuel Alfayate 2019-03-01 17:53:44 UTC
Created attachment 3685 [details]
Valgrind log for Exult on SDL2 on KMSDRM

Hi there,

I noticed yesterday that the classic open source engine Exult, with code repository located here->(https://github.com/exult/exult.git) is segfaulting on quit when using with the KMSDRM backend.

I opened an issue there, but I suspect it may have to do with the KMSDRM backend, and not with the engine, but I am not sure, and after investigating for A LOT of hours, I come here for help.

It seems to be segfaulting on different points of the SDL2 quit sequence, varying with each engine session.
I am attaching a valgrind log in case someone can take a look and find a solution.

It usually segfaults on KMSDRM_WarpMouseGlobal() because it enters a section that's conditioned to a pointer variable not being NULL. Well, there's no way this variable can't be NULL, yet it enters that section sometimes.

I'd like someone with better knowledge of SDL2 and C memory problems builds SDL2 in KMSDRM mode and tests Exult with it, so I can fix the KMSDRM backend *if that's the cause*.

Thanks!
Comment 1 Sylvain 2019-03-08 17:24:08 UTC
Most of the log is video driver and can be ignore (really ?).

The part which is very likely the issue is: 

==28799== 
==28799== Invalid read of size 8
==28799==    at 0x4A52FFA: KMSDRM_WarpMouseGlobal (SDL_kmsdrmmouse.c:425)
==28799==    by 0x4A52FB4: KMSDRM_WarpMouse (SDL_kmsdrmmouse.c:415)
==28799==    by 0x4888A77: SDL_WarpMouseInWindow_REAL (SDL_mouse.c:695)
==28799==    by 0x4941E82: SDL_RestoreMousePosition (SDL_video.c:1177)
==28799==    by 0x4942246: SDL_UpdateFullscreenMode (SDL_video.c:1344)
==28799==    by 0x4944A5E: SDL_HideWindow_REAL (SDL_video.c:2174)
==28799==    by 0x494688C: SDL_DestroyWindow_REAL (SDL_video.c:2696)
==28799==    by 0x4946C3A: SDL_VideoQuit_REAL (SDL_video.c:2821)
==28799==    by 0x4862166: SDL_QuitSubSystem_REAL (SDL.c:318)
==28799==    by 0x48622A4: SDL_Quit_REAL (SDL.c:378)
==28799==    by 0x487A5C0: SDL_Quit (SDL_dynapi_procs.h:89)
==28799==    by 0x52B142B: __run_exit_handlers (exit.c:108)
==28799==  Address 0x83417c8 is 8 bytes inside a block of size 16 free'd
==28799==    at 0x483897B: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==28799==    by 0x48DA2B0: SDL_free_REAL (SDL_malloc.c:5372)
==28799==    by 0x4A52F8B: KMSDRM_FreeCursor (SDL_kmsdrmmouse.c:406)
==28799==    by 0x4888858: SDL_MouseQuit (SDL_mouse.c:612)
==28799==    by 0x4946C0E: SDL_VideoQuit_REAL (SDL_video.c:2813)
==28799==    by 0x4862166: SDL_QuitSubSystem_REAL (SDL.c:318)
==28799==    by 0x48622A4: SDL_Quit_REAL (SDL.c:378)
==28799==    by 0x487A5C0: SDL_Quit (SDL_dynapi_procs.h:89)
==28799==    by 0x52B142B: __run_exit_handlers (exit.c:108)
==28799==    by 0x52B1559: exit (exit.c:139)
==28799==    by 0x52910A1: (below main) (libc-start.c:342)

What happens:

First SDL_MouseQuit() which free mouse 
Then SDL_VideoQuit() which call KMSDRM_WarpMouseGlobal and invalid read.

KMSDRM_WarpMouseGlobal, which does:
 if (mouse != NULL && mouse->cur_cursor != NULL && mouse->cur_cursor->driverdata != NULL) {

mouse->cur_cursor is freed, but isn't cleared.

Patch: clear "mouse->cur_cursor" in SDL_QuitMouse()

--- a/src/events/SDL_mouse.c	Sat Feb 23 09:36:56 2019 +0100
+++ b/src/events/SDL_mouse.c	Fri Mar 08 18:18:59 2019 +0100
@@ -607,6 +607,7 @@
         cursor = next;
     }
     mouse->cursors = NULL;
+    mouse->cur_cursor = NULL;
 
     if (mouse->def_cursor && mouse->FreeCursor) {
         mouse->FreeCursor(mouse->def_cursor);


Let me know if this works
Comment 2 Manuel Alfayate 2019-03-10 13:26:49 UTC
(In reply to Sylvain from comment #1)
> Most of the log is video driver and can be ignore (really ?).
> 
> The part which is very likely the issue is: 
> 
> ==28799== 
> ==28799== Invalid read of size 8
> ==28799==    at 0x4A52FFA: KMSDRM_WarpMouseGlobal (SDL_kmsdrmmouse.c:425)
> ==28799==    by 0x4A52FB4: KMSDRM_WarpMouse (SDL_kmsdrmmouse.c:415)
> ==28799==    by 0x4888A77: SDL_WarpMouseInWindow_REAL (SDL_mouse.c:695)
> ==28799==    by 0x4941E82: SDL_RestoreMousePosition (SDL_video.c:1177)
> ==28799==    by 0x4942246: SDL_UpdateFullscreenMode (SDL_video.c:1344)
> ==28799==    by 0x4944A5E: SDL_HideWindow_REAL (SDL_video.c:2174)
> ==28799==    by 0x494688C: SDL_DestroyWindow_REAL (SDL_video.c:2696)
> ==28799==    by 0x4946C3A: SDL_VideoQuit_REAL (SDL_video.c:2821)
> ==28799==    by 0x4862166: SDL_QuitSubSystem_REAL (SDL.c:318)
> ==28799==    by 0x48622A4: SDL_Quit_REAL (SDL.c:378)
> ==28799==    by 0x487A5C0: SDL_Quit (SDL_dynapi_procs.h:89)
> ==28799==    by 0x52B142B: __run_exit_handlers (exit.c:108)
> ==28799==  Address 0x83417c8 is 8 bytes inside a block of size 16 free'd
> ==28799==    at 0x483897B: free (in
> /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==28799==    by 0x48DA2B0: SDL_free_REAL (SDL_malloc.c:5372)
> ==28799==    by 0x4A52F8B: KMSDRM_FreeCursor (SDL_kmsdrmmouse.c:406)
> ==28799==    by 0x4888858: SDL_MouseQuit (SDL_mouse.c:612)
> ==28799==    by 0x4946C0E: SDL_VideoQuit_REAL (SDL_video.c:2813)
> ==28799==    by 0x4862166: SDL_QuitSubSystem_REAL (SDL.c:318)
> ==28799==    by 0x48622A4: SDL_Quit_REAL (SDL.c:378)
> ==28799==    by 0x487A5C0: SDL_Quit (SDL_dynapi_procs.h:89)
> ==28799==    by 0x52B142B: __run_exit_handlers (exit.c:108)
> ==28799==    by 0x52B1559: exit (exit.c:139)
> ==28799==    by 0x52910A1: (below main) (libc-start.c:342)
> 
> What happens:
> 
> First SDL_MouseQuit() which free mouse 
> Then SDL_VideoQuit() which call KMSDRM_WarpMouseGlobal and invalid read.
> 
> KMSDRM_WarpMouseGlobal, which does:
>  if (mouse != NULL && mouse->cur_cursor != NULL &&
> mouse->cur_cursor->driverdata != NULL) {
> 
> mouse->cur_cursor is freed, but isn't cleared.
> 
> Patch: clear "mouse->cur_cursor" in SDL_QuitMouse()
> 
> --- a/src/events/SDL_mouse.c	Sat Feb 23 09:36:56 2019 +0100
> +++ b/src/events/SDL_mouse.c	Fri Mar 08 18:18:59 2019 +0100
> @@ -607,6 +607,7 @@
>          cursor = next;
>      }
>      mouse->cursors = NULL;
> +    mouse->cur_cursor = NULL;
>  
>      if (mouse->def_cursor && mouse->FreeCursor) {
>          mouse->FreeCursor(mouse->def_cursor);
> 
> 
> Let me know if this works

Thanks for your input on this, really. Having the KMSDRM driver in good shape is very important, I believe! :)

This worked, yes. No more crashes in KMSDRM_WarpMouseGlobal()!

However, there are other crashes. This one seems impossible to debug for me, too, so I would be very thankful if you gave me ideas on this one.
It's a segfault on KMSDRM_VideoQuit(), when this call is done:

KMSDRM_gbm_device_destroy(vdata->gbm);

This seems to be the typical case where I am destroying something already destroyed, but the ONLY other place where vdata->gbm is destroyed is in KMSDRM_VideoInit(), on the "cleaup:" label.
There was a missing return before the "cleanup:" label, so I added "return ret;" after the "KMSDRM_InitMouse(_this);" call there. But I am STILL getting the segfault on KMSDRM_VideoQuit():  


[code]
Thread 1 "exult" received signal SIGSEGV, Segmentation fault.
0x00007ffff71854a8 in malloc_consolidate (av=av@entry=0x7ffff72d9c40 <main_arena>)
    at malloc.c:4442
4442    malloc.c: No such file or directory.
#0  0x00007ffff71854a8 in malloc_consolidate (av=av@entry=0x7ffff72d9c40 <main_arena>)
    at malloc.c:4442
#1  0x00007ffff718758c in _int_free (av=0x7ffff72d9c40 <main_arena>, p=0x555555bcf8d0, 
    have_lock=<optimized out>) at malloc.c:4348
#2  0x00007ffff355e8c8 in ?? () from /usr/lib/x86_64-linux-gnu/dri/i965_dri.so
#3  0x00007ffff3593d19 in ?? () from /usr/lib/x86_64-linux-gnu/dri/i965_dri.so
#4  0x00007ffff3525bdf in ?? () from /usr/lib/x86_64-linux-gnu/dri/i965_dri.so
#5  0x00007ffff3db5740 in ?? () from /usr/lib/x86_64-linux-gnu/libgbm.so.1
#6  0x00007ffff7f2cb42 in KMSDRM_VideoQuit (_this=0x55555583eb60)
    at /home/manuel/src/SDL2-2.0.9/src/video/kmsdrm/SDL_kmsdrmvideo.c:536
#7  0x00007ffff7e1ec74 in SDL_VideoQuit_REAL ()
    at /home/manuel/src/SDL2-2.0.9/src/video/SDL_video.c:2823
#8  0x00007ffff7d3a167 in SDL_QuitSubSystem_REAL (flags=62001)
    at /home/manuel/src/SDL2-2.0.9/src/SDL.c:318
#9  0x00007ffff7d3a2a5 in SDL_Quit_REAL () at /home/manuel/src/SDL2-2.0.9/src/SDL.c:378
#10 0x00007ffff7d525c1 in SDL_Quit ()
    at /home/manuel/src/SDL2-2.0.9/src/dynapi/SDL_dynapi_procs.h:89
#11 0x00007ffff713942c in __run_exit_handlers (status=0, listp=0x7ffff72d9718 <__exit_funcs>, 
    run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
#12 0x00007ffff713955a in __GI_exit (status=<optimized out>) at exit.c:139
#13 0x00007ffff71190a2 in __libc_start_main (main=0x5555555a6080 <main>, argc=3, 
    argv=0x7fffffffe3d8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 
    stack_end=0x7fffffffe3c8) at ../csu/libc-start.c:342
#14 0x00005555555a8eea in _start ()
frame 6
#6  0x00007ffff7f2cb42 in KMSDRM_VideoQuit (_this=0x55555583eb60)
    at /home/manuel/src/SDL2-2.0.9/src/video/kmsdrm/SDL_kmsdrmvideo.c:536
536             KMSDRM_gbm_device_destroy(vdata->gbm);
print vdata->gbm
$1 = (struct gbm_device *) 0x55555583b5e0
[/code]

So this is REALLY the first time KMSDRM_gbm_device_destroy() is called, because of course KMSDRM_VideoInit() returns before the "cleanup:" label.
Any ideas on what could be going on here?
Comment 3 Sylvain 2019-03-10 17:24:20 UTC
Just wondering, how do you try kmsdrm ? need a RPY ?


Really not sure but some guess in KMSDRM_VideoQuit:

- maybe SDL_GL_UnloadLibrary() should be the last thing to do

- if (vdata->drm_fd > 0

  drm_fp 0 should also be checked no ? it's a valid file descriptor. Only -1 is invalid ?
 
  (In KMSDRM_VideoQuit, that means save_crtc isn't restored)
  

- also, it seems to be call by through an at_exit(), maybe it's not the best place and it's always confusing how it gets called. Maybe it should be done by calling SDL_Quit before the main returns.

- I would also run Valgrind to see what happens.


Let me know if this works
Comment 4 Sylvain 2019-03-11 14:29:35 UTC
Added the missing "return ret;" in VideoInit();
I'm surprised it could even start without this.

https://hg.libsdl.org/SDL/rev/8e52484226af
Comment 5 Sylvain 2019-03-11 14:32:59 UTC
SDL_MouseQuit(): clear mouse->cur_cursor
in https://hg.libsdl.org/SDL/rev/0010b9a064d3
Comment 6 Manuel Alfayate 2019-03-11 23:32:30 UTC
(In reply to Sylvain from comment #3)
> Just wondering, how do you try kmsdrm ? need a RPY ?
> 
> 
> Really not sure but some guess in KMSDRM_VideoQuit:
> 
> - maybe SDL_GL_UnloadLibrary() should be the last thing to do
> 
> - if (vdata->drm_fd > 0
> 
>   drm_fp 0 should also be checked no ? it's a valid file descriptor. Only -1
> is invalid ?
>  
>   (In KMSDRM_VideoQuit, that means save_crtc isn't restored)
>   
> 
> - also, it seems to be call by through an at_exit(), maybe it's not the best
> place and it's always confusing how it gets called. Maybe it should be done
> by calling SDL_Quit before the main returns.
> 
> - I would also run Valgrind to see what happens.
> 
> 
> Let me know if this works

Thanks a lot for these ideas, Sylvain!

I tried them and things worked (somehow!):

In KMSDRM_VideoQuit():

-Putting the SDL_GL_UnloadLibrary() as the last thing to do did not make a difference regarding segfaults, but it seems a good idea to me, too, if you want to move it.

-I'm not sure what you mean about the if (vdata->drm_fd > 0): in KMSDRM_VideoQuit(), it's "if (vdata->drm_fd => 0)", because the the file descriptor is an integer that, as far as I know, could be zero too. Anyway, nothing to change there, as far as I can understand.

NOW, what did things work, NO MORE SEGFAULTS ON QUIT: I moved SDL_Quit() from the AtExit to the end of the main() function on Exult. And voila!! No more segfaults, no matter how hard I try :D

You don't need a Raspberry Pi to test the SDL2 KMSDRM video driver: that's precisely the good thing about KMSDRM, it works on any X86 workstation and on a Raspberry Pi.
Just build SDL2 with:

./configure --prefix=/usr --enable-video-kmsdrm --enable-video-opengles
make 
sudo make install
(you should end up with SDL2 libs in /usr/lib)

Beware you may need to manually remove your distro's libSDL2 from /usr/lib/x86_64-linux-gnu, or your programs will keep trying to use that instead of your freshly built KMSDRM-enabled version.
And you need to be in a TTY console ouside the Xorg server to use it! CTRL+ALT+F2 should give you one.
If you run a KMSDRM program under GDB and it segfaults, GDB will be waiting for your input while the program is frozen in fullscreen, so debug via SSH if you try this.

So, two questions:
1) What do you think about the solution of moving SDL_Quit() to the end of main()? If it's a good idea (and it seems to be) then why doesn't it crash when using the SDL2 X11 backend? 
2) Could you *please* give it a try under Valgrind and give me ideas and opinions? Feel free to make any fixes, of course! (Now that you know that you can test this on your very computer, no Pi needed!)
Comment 7 Sylvain 2019-03-12 13:33:48 UTC
I gave a try and got a non initialized memory warning for ioctcl. Fixed at https://hg.libsdl.org/SDL/rev/141e8355078d

Otherwise:
- I can run a KMSDRM helloworld or my apps. But I have no crash in the end. Using atexit(SDL_Quit) or not.
  Not sure what is exactly the difference when it's called from atexit.

- For me, mouse and keyboards doesn't work. But joysticks does.

- Exult didn't work. It starts but remain white screen.

About KMSDRM_VideoQuit() and "if(vdata->drm_fd > 0"
- Sorry I try to explain:  
   drm_fd -1 is invalid and drm_fd >= 0 are correct fd
   so checking "vdata->drm_fd > 0" is at first strange: you want a valid fd, but not the one which has value 0. (maybe it would make sense)
Comment 8 Manuel Alfayate 2019-03-12 15:16:56 UTC
(In reply to Sylvain from comment #7)
> I gave a try and got a non initialized memory warning for ioctcl. Fixed at
> https://hg.libsdl.org/SDL/rev/141e8355078d
> 
> Otherwise:
> - I can run a KMSDRM helloworld or my apps. But I have no crash in the end.
> Using atexit(SDL_Quit) or not.
>   Not sure what is exactly the difference when it's called from atexit.
> 
> - For me, mouse and keyboards doesn't work. But joysticks does.
> 
> - Exult didn't work. It starts but remain white screen.
> 
> About KMSDRM_VideoQuit() and "if(vdata->drm_fd > 0"
> - Sorry I try to explain:  
>    drm_fd -1 is invalid and drm_fd >= 0 are correct fd
>    so checking "vdata->drm_fd > 0" is at first strange: you want a valid fd,
> but not the one which has value 0. (maybe it would make sense)

Thanks A LOT about that! :)

Exult does work here. White screen? How strange... Never saw that! What video chip does your test computer have? I am on Intel here ("Intel Corporation HD Graphics 620" according to lspci) and it works fine. It also works on the VC4 (Raspberry Pi). Do other SDL2 games work in KMSDRM mode for you? Like SDLPop (https://github.com/NagyD/SDLPoP), for example. 

About the keyboard and mouse not working: keep in mind this X-less solution uses udev for input reading. So put yourself on the input group, or try this:
https://docs.libretro.com/guides/input-joypad-drivers/#setting-up-udev-permissions

That should make mouse and keyboard work.

Now, about the (vdata->drm_fd > 0) comparision, sorry, now I understand. And yes, it makes sense to do vdata->drm_fd >= 0) instead. Since you can do commits directly, feel free to make this one :)
Comment 9 Sylvain 2019-03-13 07:56:02 UTC
I haven't yet checked more precisely on kmsdrm, but on x11 for instance, it displays a splash screen asking to provide the game static data. Not sure how to get them. 

btw, you're the author of the kmsdrm code ? (from bug 3690), so you're more aware of the modification that should me done on this side ? Can you provide test patch ?
Comment 10 Manuel Alfayate 2019-03-13 13:45:30 UTC
(In reply to Sylvain from comment #9)
> I haven't yet checked more precisely on kmsdrm, but on x11 for instance, it
> displays a splash screen asking to provide the game static data. Not sure
> how to get them. 
> 
> btw, you're the author of the kmsdrm code ? (from bug 3690), so you're more
> aware of the modification that should me done on this side ? Can you provide
> test patch ?

I'm not the only author of this: I just polished the work done by Sigmaris (Hugh Cole-Baker), who did the bulk of the driver, so there are things outside my understanding.

As for the change, I just meant that you change:

if(vdata->drm_fd > 0 && vdata->saved_conn_id > 0) {

for:

if(vdata->drm_fd >= 0 && vdata->saved_conn_id > 0) {

in KMSDRM_VideoQuit().

That's according to your previous comments, where I think you are right about the right comparision being >=.
Comment 11 Sylvain 2019-03-13 13:56:05 UTC
Ok. So I fixed this in https://hg.libsdl.org/SDL/rev/958835bd12c2
at two places. Please double-check it!
Comment 12 Manuel Alfayate 2019-03-15 11:56:54 UTC
(In reply to Sylvain from comment #11)
> Ok. So I fixed this in https://hg.libsdl.org/SDL/rev/958835bd12c2
> at two places. Please double-check it!

Thanks! It seems good!

Did you get Exult to work in KMSDRM mode? Other games like SDLPop work for you in KMSDRM mode?
Comment 13 Sylvain 2019-03-15 12:17:55 UTC
Thanks to your link, I got the mouse and keyboard working.

I tried my apps and all is correct. I also try to use atexit(SDL_Quit). And it also works fine.
-> Please double if you still have the issue ? 
run with valgrind gdb.


I can't run exult because it requires the mods ? / eg games datas, which I haven't.

I think it's the same with SDLPop. (But do you have the atexit issue with it?)
Comment 14 Manuel Alfayate 2019-03-15 13:57:17 UTC
@Sylvain:

I don't have any problems with SDLPop, since it does not use atexit(), but it calls SDL_Quit() properly instead.
I just wanted to know if SDLPop worked for you in KMSDRM mode, since it works great here.

I have uploaded the datas for both SDLPop and Exult, so you can try them:

https://www.dropbox.com/s/bfr8wy005lck42b/sdlpop_datas.zip?dl=0
https://www.dropbox.com/s/mvo6j3w0filsixv/exult_datas.zip?dl=0

In Exult, with latest SDL2 revision, I still have the segfault on quit but only if I let it use "atexit()" to call SDL_Quit(). If I move SDL_Quit() to the end of the main function (in exult.cc), there are no problems at all.

I have ran exult (with the original atexit(SDL_Quit), which causes the segfault on quit) with both valgrind and gdb: here are the results.

file:///home/manuel/ultima7/exult_valgrind_log002.txt
file:///home/manuel/ultima7/exult_gdb_log002.txt

What could be causing this, is beyond my understanding.
Comment 15 Sylvain 2019-03-15 14:47:30 UTC
Thanks,
Can you upload the valgrind files ?
Comment 16 Manuel Alfayate 2019-03-15 15:11:43 UTC
(In reply to Sylvain from comment #15)
> Thanks,
> Can you upload the valgrind files ?

What valgrind files? I copied all valgrind output to this file:
file:///home/manuel/ultima7/exult_valgrind_log002.txt
Comment 17 Manuel Alfayate 2019-03-15 15:18:36 UTC
(In reply to Manuel Alfayate from comment #16)
> (In reply to Sylvain from comment #15)
> > Thanks,
> > Can you upload the valgrind files ?
> 
> What valgrind files? I copied all valgrind output to this file:
> file:///home/manuel/ultima7/exult_valgrind_log002.txt

Oh, sorry!!! How stupid on my part... Here they are! 

https://www.dropbox.com/s/ges504rcz24vwiv/exult_valgrind_log002.txt?dl=0
https://www.dropbox.com/s/5twgfltscfjwriq/exult_gdb_log002.txt?dl=0

Sorry! :D
Comment 18 Sylvain 2019-03-15 20:47:08 UTC
Thanks, I got exult running. It also crashes at the end. But I can't run gdb. As you said, I should connect with ssh and I have nothing setup like that. 

Last message is "SDL restoring keyboard".
After putting some printf, it seems to happens in: 
src/core/linux/SDL_evdev_kbd.c
in the "raise(sig);"
and if I comment this is even after. Maybe there are threading issues, and this is not related.


Can you try to add an "atexit(SDL_Quit)" into SDLPop and see if it crashes the same way ?
Comment 19 Manuel Alfayate 2019-03-15 23:52:56 UTC
(In reply to Sylvain from comment #18)
> Thanks, I got exult running. It also crashes at the end. But I can't run
> gdb. As you said, I should connect with ssh and I have nothing setup like
> that. 
> 
> Last message is "SDL restoring keyboard".
> After putting some printf, it seems to happens in: 
> src/core/linux/SDL_evdev_kbd.c
> in the "raise(sig);"
> and if I comment this is even after. Maybe there are threading issues, and
> this is not related.
> 
> 
> Can you try to add an "atexit(SDL_Quit)" into SDLPop and see if it crashes
> the same way ?

I have tried moving the SDL_Quit() call in SDLPop to atexit(SDL_Main), and IT DOES CRASH IN THE SAME WAY. 
So... This is what was going on. Confirmed. But why? and why ONLY on KMSDRM and not on X11?
Comment 20 Manuel Alfayate 2019-03-15 23:54:45 UTC
(In reply to Manuel Alfayate from comment #19)
> (In reply to Sylvain from comment #18)
> > Thanks, I got exult running. It also crashes at the end. But I can't run
> > gdb. As you said, I should connect with ssh and I have nothing setup like
> > that. 
> > 
> > Last message is "SDL restoring keyboard".
> > After putting some printf, it seems to happens in: 
> > src/core/linux/SDL_evdev_kbd.c
> > in the "raise(sig);"
> > and if I comment this is even after. Maybe there are threading issues, and
> > this is not related.
> > 
> > 
> > Can you try to add an "atexit(SDL_Quit)" into SDLPop and see if it crashes
> > the same way ?
> 
> I have tried moving the SDL_Quit() call in SDLPop to atexit(SDL_Main), and
> IT DOES CRASH IN THE SAME WAY. 
> So... This is what was going on. Confirmed. But why? and why ONLY on KMSDRM
> and not on X11?

Also, the "restoring keyboard" message is not related to the problem. It's what modern SDL2 does when it segfaults (if it wouldn't do it, keyboard in X-less mode would be blocked after any segfault, so it's a good thing).
Comment 21 Sylvain 2019-03-16 14:08:46 UTC
Created attachment 3698 [details]
test case

Here's a very small testcase. I'm not sure of what happens.
but to reproduce the issue, it seems to need:

1) a user atexit before CreateWindow, or only before SDL_init. 

2) that the user atexit does the SDL_DestroyWindow() (via SDL_Quit), but not the SDL_DestroyRenderer()

because 1), maybe it conflicts with the other atexit: kbd_cleanup_atexit
in src/core/linux/SDL_evdev_kbd.c which is probably installed in the CreateWindow
when calling: CreateWindow -> KMSDRM_VideoInit -> SDL_EVDEV_Init > SDL_EVDEV_kbd_init
Comment 22 Sylvain 2019-03-16 20:10:49 UTC
I looked again, and it doesn't seem to conflict with kbd_cleanup_atexit. (If I comment it out, it still crash).

I think it's more like: 
atexit(SDL_Quit) is called before/after shared libraries destructor, whether it's called before/after those shared libaries are loaded.

If so, solution would just to call atexit after SDL_CreateWindow (maybe even after SDL_CreateRenderer).

Maybe you can try to move the atexit after SDL_CreateWindow/Renderer in exult and sdlpop ?


Valgrind log is:

my_at_exist() is doing an invalid write, because egl context was freed first, by exit/__run_exit_handlers (libEGL_mesage).
And it was allocated first, KMSDRM_GLES_LoadLibrar / SDL_GL_LoadLibrary


INFO: my_at_exit
==7337== Invalid write of size 8
==7337==    at 0x12520E05: ??? (in /usr/lib/x86_64-linux-gnu/libEGL_mesa.so.0.0.0)
==7337==    by 0x12521524: eglMakeCurrent (in /usr/lib/x86_64-linux-gnu/libEGL_mesa.so.0.0.0)
==7337==    by 0x52AFA1B: ??? (in /usr/lib/x86_64-linux-gnu/libEGL.so.1.1.0)
==7337==    by 0x49737BF: SDL_EGL_MakeCurrent (SDL_egl.c:800)
==7337==    by 0x4A7C1F0: KMSDRM_GLES_MakeCurrent (SDL_kmsdrmopengles.c:185)
==7337==    by 0x4987A63: SDL_GL_MakeCurrent_REAL (SDL_video.c:3509)
==7337==    by 0x4988FB7: SDL_GL_DeleteContext_REAL (SDL_video.c:3604)
==7337==    by 0x48D28BE: GL_DestroyRenderer (SDL_render_gl.c:1452)
==7337==    by 0x48C93FF: SDL_DestroyRenderer_REAL (SDL_render.c:3216)
==7337==    by 0x4892F83: SDL_DestroyRenderer (SDL_dynapi_procs.h:379)
==7337==    by 0x4011E8: my_at_exit (main_bug_kmsdrm_2.c:12)
==7337==    by 0x531A42B: __run_exit_handlers (exit.c:108)

==7337==  Address 0x122fc700 is 32 bytes inside a block of size 48 free'd
==7337==    at 0x483897B: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==7337==    by 0x1252B3F1: ??? (in /usr/lib/x86_64-linux-gnu/libEGL_mesa.so.0.0.0)
==7337==    by 0x1252C8A3: ??? (in /usr/lib/x86_64-linux-gnu/libEGL_mesa.so.0.0.0)
==7337==    by 0x531A42B: __run_exit_handlers (exit.c:108)
==7337==    by 0x531A559: exit (exit.c:139)
==7337==    by 0x52FA0A1: (below main) (libc-start.c:342)

==7337==  Block was alloc'd at
==7337==    at 0x4839775: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==7337==    by 0x1252B41F: ??? (in /usr/lib/x86_64-linux-gnu/libEGL_mesa.so.0.0.0)
==7337==    by 0x12520DF8: ??? (in /usr/lib/x86_64-linux-gnu/libEGL_mesa.so.0.0.0)
==7337==    by 0x1252113E: eglGetProcAddress (in /usr/lib/x86_64-linux-gnu/libEGL_mesa.so.0.0.0)
==7337==    by 0x52B33CB: ??? (in /usr/lib/x86_64-linux-gnu/libEGL.so.1.1.0)
==7337==    by 0x52B39D9: ??? (in /usr/lib/x86_64-linux-gnu/libEGL.so.1.1.0)
==7337==    by 0x64990C6: __pthread_once_slow (pthread_once.c:116)
==7337==    by 0x52B3B8E: ??? (in /usr/lib/x86_64-linux-gnu/libEGL.so.1.1.0)
==7337==    by 0x52AF7D0: ??? (in /usr/lib/x86_64-linux-gnu/libEGL.so.1.1.0)
==7337==    by 0x4972691: SDL_EGL_LoadLibrary (SDL_egl.c:421)
==7337==    by 0x4A7BBA0: KMSDRM_GLES_LoadLibrary (SDL_kmsdrmopengles.c:40)
==7337==    by 0x4981D8D: SDL_GL_LoadLibrary_REAL (SDL_video.c:2867)
Comment 23 Manuel Alfayate 2019-03-20 00:24:07 UTC
@Sylvain:

In both SDLPop and Exult:
I have moved atexit(SDL_Quit) after SDL_CreateWindow() and it still segfaults.
BUT moving it after SDL_CreateRenderer(), it does NOT segfault!

Does it help to solve the puzzle on what's going on here?
Comment 24 Sylvain 2019-03-20 08:57:33 UTC
I am not sure how it should be fixed ...

- ask exult/sdlpop to move their atexit but it isn't a fix.

- maybe there is an API to see if mesa has already called its atexit handlers ..
  or to see if egl drivers are already terminated ...

See:
https://gitlab.freedesktop.org/mesa/mesa/blob/master/src/egl/main/eglglobals.c

https://gitlab.freedesktop.org/mesa/mesa/blob/master/src/egl/main/egldisplay.c#L184
https://gitlab.freedesktop.org/mesa/mesa/blob/master/src/egl/main/egldevice.c#L53
https://gitlab.freedesktop.org/mesa/mesa/blob/master/src/egl/main/egldriver.c#L121


- or maybe could do some workaround and add immediatly an atexit in SDL_Init(), as soon as it starts. So we can first the exit ...

I'd say ask some help on the mesa mailing list :/
Comment 25 Manuel Alfayate 2019-03-20 12:53:58 UTC
(In reply to Sylvain from comment #24)
> I am not sure how it should be fixed ...
> 
> - ask exult/sdlpop to move their atexit but it isn't a fix.
> 
> - maybe there is an API to see if mesa has already called its atexit
> handlers ..
>   or to see if egl drivers are already terminated ...
> 
> See:
> https://gitlab.freedesktop.org/mesa/mesa/blob/master/src/egl/main/eglglobals.
> c
> 
> https://gitlab.freedesktop.org/mesa/mesa/blob/master/src/egl/main/egldisplay.
> c#L184
> https://gitlab.freedesktop.org/mesa/mesa/blob/master/src/egl/main/egldevice.
> c#L53
> https://gitlab.freedesktop.org/mesa/mesa/blob/master/src/egl/main/egldriver.
> c#L121
> 
> 
> - or maybe could do some workaround and add immediatly an atexit in
> SDL_Init(), as soon as it starts. So we can first the exit ...
> 
> I'd say ask some help on the mesa mailing list :/

So it's not KMSDRM's SDL2 fault, is it? It's a MESA problem as I understand your comments.
Comment 26 Sylvain 2019-03-20 15:59:23 UTC
Not really a mesa problem, but an issue that comes up using both atexit(SDL_Quit) + mesa which also has its atexit().

According to my previous valgrind log: mesa -through an atexit- cleans up what wasn't cleaned up when the application exits, which sounds ok. 
But, what happens is that exult also wants to clean up those things with an atexit() 
(the first thing is egl_display, allocated by SDL_EGL_LoadLibrary (SDL_egl.c:421), but what crashes it actually when re-destroying gbm).

if SDL2 cleans up first, then mesa knows it and won't try to clean up it again.
But if mesa cleans up first, SDL2 doesn't know it, and it tries to release it a second time, and that start making invalid writes.
Comment 27 Ryan C. Gordon 2019-03-22 14:29:44 UTC
> if SDL2 cleans up first, then mesa knows it and won't try to clean up it
> again.
> But if mesa cleans up first, SDL2 doesn't know it, and it tries to release
> it a second time, and that start making invalid writes.

This isn't an immediate solution, but perhaps Mesa shouldn't be using atexit() but rather a function marked __attribute__((destructor)) so it runs on library unload instead of process termination?

--ryan.
Comment 28 Sylvain 2019-03-23 13:07:58 UTC
Yes, it sounds also a possibility.

I notice there is a Mesa function: _eglCheckDisplayHandle(EGLDisplay dpy); 
but I am not sure if we can access it since SDL is a generic layer.

I'll write a message to the mailing list to have some advice
Comment 29 Ryan C. Gordon 2019-05-18 18:48:53 UTC
Tagging a bunch of bugs with "target-2.0.10" so we have a clear list of things to address before a 2.0.10 release.

Please note that "addressing" one of these bugs might mean deciding to defer on it until after 2.0.10, or resolving it as WONTFIX, etc. This is just here to tell us we should look at it carefully, and soon.

If you have new information or feedback on this issue, this is a good time to add it to the conversation, as we're likely to be paying attention to this specific report in the next few days/weeks.

Thanks!

--ryan.
Comment 30 Sylvain 2019-05-20 18:53:48 UTC
Answer I got from mesa mailing list was simply not to use atexit().
Comment 31 Ryan C. Gordon 2019-05-24 19:23:47 UTC
I think this is a bad response from Mesa, but I also don’t think this is SDL’s bug: at this point, it’s either Mesa’s or Exult’s, depending on where you are standing.

Should we close this?
Comment 32 Sylvain 2019-05-24 19:36:03 UTC
Yes, I think so, I'm closing it:

Exult issue is here:
https://github.com/exult/exult/issues/30

Mesa message and reply here:
https://lists.freedesktop.org/archives/mesa-users/2019-March/001519.html