| Summary: | Concurrency issues in Android backend | ||
|---|---|---|---|
| Product: | SDL | Reporter: | Andreas Falkenhahn <andreas> |
| Component: | video | Assignee: | Sam Lantinga <slouken> |
| Status: | WAITING --- | QA Contact: | Sam Lantinga <slouken> |
| Severity: | major | ||
| Priority: | P2 | CC: | etc0de, superfury, sylvain.becker |
| Version: | 2.0.8 | ||
| Hardware: | All | ||
| OS: | Android (All) | ||
| Attachments: | test-case | ||
Yep, these are definitely issues. I don't have an Android environment set up right now, so do you want to submit a tested patch that addresses these? Unfortunately, I currently don't have enough time to work on a fix but maybe in a few weeks. I am preparing a fix for this issue because this can clearly occur! I think this is fixed by: https://hg.libsdl.org/SDL/rev/93771c30420b Maybe there are corners cases remaining ... Previous commits are no-op, just for simplification https://hg.libsdl.org/SDL/rev/219a154f54cb https://hg.libsdl.org/SDL/rev/6053ab61996a Created attachment 3558 [details]
test-case
Here's a testcase,
Create-destroy window/renderer multiple times, adding also some timing delays.
To use it, you need some modification:
1/ You can deactivate semaphores by commenting out its creation in
src/core/android/SDL_android.c
(eg: Android_ActivitySem = SDL_CreateSemaphore(1); )
It will remain NULL, and be ignored by SDL_SemWait()/SDL_SemPost()
2/ You can add a delay in the java side
In 'src/video/android/SDL_androidtouch.c' file
In 'Android_OnTouch()' function
SDL_Log("Wait <...");
SDL_Delay(500);
Android_GetWindowCoordinates(window, 1, 1, &window_x, &window_y);
SDL_Log(" ...> Done (window is %d x %d)", window_x, window_y);
test 1:
It shows concurrency issues :
- the window size gets invalid sometimes. Because the window is created/destroyed during the delay.
(well, this is the purpose of this test-case ...)
test 2:
At the end, a crash is reproducible (and it seems to be the normal case).
If you do quickly:
touch screen several time (add some lag in the main loop)
press back to quit
touch screen several time -> Crash
(in SetMouseFocus() .. WINDOWENTER_EVENT, because SDL_VideoDevice "_this" became null)
Update: use SDL_Mutex instead of SDL_Semaphore (which were used as mutex) https://hg.libsdl.org/SDL/rev/c2e223dc5da9 Another concurrency issue: eglSwapBuffers reports an error of EGL_BAD_SURFACE https://hg.libsdl.org/SDL/rev/e6130a4cca7a Occurs when application goes to background: - Java activity is destroying SurfaceView holder and "egl_surface" (in onNativeSurfaceDestroyed()) - While native thread is in Android_GLES_SwapWindow(), prepared to call SDL_EGL_SwapBuffers() The error is "call to eglSwapBuffers failed, reporting an error of EGL_BAD_SURFACE" It can be reproduced easily by adding a SDL_Delay(100) at the beginning of SDL_EGL_SwapBuffers(), and putting the application into background What about this code in android_egl_context_backup()?
/* Keep a copy of the EGL Context so we can try to restore it when we resume */
SDL_WindowData *data = (SDL_WindowData *) window->driverdata;
data->egl_context = SDL_GL_GetCurrentContext();
/* We need to do this so the EGLSurface can be freed */
SDL_GL_MakeCurrent(window, NULL);
It seems to be called when onPause() is called but not on the UI thread but on the SDL thread. That's why it could happen that onSurfaceDestroyed() gets to be called before android_egl_context_backup(). Then android_egl_context_backup() won't be able to backup the context because it is NULL already (it is set to NULL in onNativeSurfaceDestroyed). So maybe the context should be backed up in onNativeSurfaceDestroyed(), somewhere before here:
if (data->egl_surface != EGL_NO_SURFACE) {
SDL_EGL_MakeCurrent(_this, NULL, NULL);
SDL_EGL_DestroySurface(_this, data->egl_surface);
data->egl_surface = EGL_NO_SURFACE;
}
But I'm not sure...
Yes android_egl_context_backup() is called from the SDL thread. Just before the main loop in the user application blocks. Yes, when going to backgroung for instance, usually you have android_egl_context_backup() called in the SDL thread. then onNativeSurfaceDestroyed() called from the UI thread. If you add a SDL_Delay just at the begining of Android_PumpEvents(), it becomes slower you can have the inverse. 1/ onNativeSurfaceDestroyed, then 2/ android_egl_context_backup(). But, it doesn't seem to be an issue .. I see no error/nor black screen. I see no SDL_RENDER_DEVICE_RESET neither in _restore(). Since android_egl_context_backup() is doing "data->egl_context = SDL_GL_GetCurrentContext();" Isn't this ok ? Does it restore correctly in that case? I think so: .. I see no error/nor black screen. I see no SDL_RENDER_DEVICE_RESET neither in _restore(). Or how should I check ? I tried with an app. And also with testgles.c Maybe it's harmless then. I just thought it's a little odd to backup the context on the SDL thread instead of in onNativeSurfaceDestroyed() which looks like the natural place to do this. It also would eliminate one more switch between the UI and SDL thread which I think would be quite desirable because it reduces the multithreading complexity even more. - backup/restore use thread local storage SDL_TLS Set/Get so maybe it has to run on the SDL Thread ?! Also: testegl.c is an interesting testcase. User manipulates its context. for this testcase, the backup() has to be done exactly before waiting the semaphore. (before "if (SDL_SemWait(Android_ResumeSem) == 0)" ); https://hg.libsdl.org/SDL/rev/6acc7df4a175 https://bugzilla.libsdl.org/show_bug.cgi?id=2134 - onNativeSurfaceDestroyed: is the place to invalidate the egl_surface, where we have to destroy it. ... hence, the synchronisation between UI Thread and SDL Thread. This may be totally unrelated, but the code reads a little as if SDL_ANDROID_BLOCK_ON_PAUSE set to 0, the context won't be backed up and later restored at all. Does that mean SDL_ANDROID_BLOCK_ON_PAUSE=0 will currently inevitably risk corruption, even if the user code/app is taking super extra care to never accesses the GL context on its own in between SDL_APP_WILLENTERBACKGROUND and SDL_DIDENTERFOREGROUND? Sorry if that isn't the case, it was just something that occurred me reading the patches and changes linked so far in the discussion Thanks for asking, I've less tested the non-blocking mode https://hg.libsdl.org/SDL/file/d311fcb8066a/src/video/android/SDL_androidevents.c#l143 It seems it also calls: line 158 android_egl_context_restore(Android_Window); line 171 android_egl_context_backup(Android_Window); Indeed, there is no implicit acknowledge like making sure the app has consumed WILL_ENTER_BG before backing-up the context. > Indeed, there is no implicit acknowledge like making sure the app has consumed WILL_ENTER_BG before backing-up the context.
I just tested this, I have a couple of moments where my app will hang a second or two (loading screen!) and won't poll events immediately, and when I tab out exactly while that happens I'm almost GUARANTEED to get texture corruption once I tab back in. So this definitely seems to be an issue in practice
I also see some glitch. but this is a little different (but 50% reproduce able on my side, portrait App). Each time I got to background/foreground, when the app reappears, the last/bottom line of pixel of the screen has noisy pixels. It's really quick and not disturbing, but it's there. I thought this was also the lack of hand-shaking for the onPause(). So I just add a SDL_Delay(500) at the end of "src/core/android/SDL_android.c:onPuase()" but that doesn't change anything. It should have done the trick. Maybe you could try also a SDL_Delay() there for you to see what happens. Anyway: if you are stuck a few seconds, you don't want this kind of acknowledgement. You'll get an ANR report. And, this seems to me more dangerous than solving anything. I would have added a small SDL_Delay() as a *soft/loose* hand-shaking. but that doesn't change anything. I just checked, ANR seems to be 5 seconds. That is pretty long. In any case, I think SDL2 should at least wait a little to give the app a chance to possibly consume the event before it backups the context, and I think that should be a longer time than like just 100ms IF that is possible per android APIs. (Of course if the app consumes it quickly it should also back it up quickly) For what it's worth, regarding the trigger: If I back out at any other point other then these hangs I don't see this corruption. (at least so far never have) And these hangs are never GFX related, it's not an asset loading screen: I'm working on a sort of document editor and the hang is e.g. during complex serialization for saving, where it doesn't touch any SDL_Texture or similar at all - so it shouldn't be related to any specific type of drawing op And for the sort of corruption: Also what I'm seeing is more than just a few noisy pixels, what happens is that a huge bunch of the SDL_Textures just goes pitch black and others go fully transparent/unrenderable (they disappear), and from that point on there is a chance that with any future tab out/back in the entire viewport will freeze with the old image and no longer render and update. So this is more than just a small cosmetic thing As for SDL_Delay testing: That might be worth a try, although what I'll probably do first is try to reduce the time between the SDL_PollEvent a little since I should do that anyway. I would assume that might make it less likely already Maybe this is different issue ... SDL sends events: - will_enter_bg - did_enter_bg Once you receive will_enter_bg, you're warned. And on your next call of SDL_PollEvent, the egl context is actually backed-up and you read did_enter_bg event. This seems ok to me. this is how it's done in block_on_pause mode. on non-blocking, you need this patch from bug 4578 so that it remains in state "inPausing" until you fetch the did_enter_bg event. Yeah I run with non-blocking, and I don't use the patch right now. (just the tip with no patches) So that would be the problem then? I do serialize on SDL_WILLENTERBACKGROUND and not did enter background, so regarding that I should be okay Sorry for muddling this ticket with a potentially unrelated issue :O Maybe, the patch will wait for did_enter_bg to be removed to backup the context This might be the case as well with current SDL commits. I notice that, since enabling my app to run in the background, the graphics context or something related doesn't get restored most of the time putting the app back in the foreground(SDL_APP_DIDENTEREDFOREGROUND event). My app, when entering the foreground that way, will start rendering again, forcing a redraw of screen content from the backbuffer it uses. Said forced redraw also forces a new SDL context etc. to be createn(SDL_CreateWindow etc.), but destroying the old context etc. that was already there. I don't really fully understand your message :) Do you use the latest source code ? As soon as your read WILL_ENTER_FG you should stop rendering. As soon as your read DID_ENTER_BG you can resume rendering. I re-tried quickly with SDL_SetHint(SDL_HINT_ANDROID_BLOCK_ON_PAUSE, "0"); and faking with some SDL_Delay() It seems to work ... You can add some log when you stop/start rendering. And when the backup/restore are done. (SDL_androidevents.c) My app keeps a set of flags that handle foreground/background operations, focus information etc. as a single bitmap(1 bit for each), see https://bitbucket.org/superfury/commonemuframework/src/35b198714b17c0ab7aa3cc80e2238d62b299f1e6/emu/io/input.c look for the didenterforeground/didenterbackground. It updates the global flags, which are used by the main loop to determine what to do with SDL input/output/looping control(making the main routine stop working, depending on settings to fully halt, partly halt(keep recording/playback of sound or just stop rendering). Everything is determined bg the haswindowactive flags. For the GPU side, look at https://bitbucket.org/superfury/commonemuframework/src/35b198714b17c0ab7aa3cc80e2238d62b299f1e6/emu/gpu/gpu.c (CPU_updateVideo, which (re)creates the SDL rendering surface when starting/resuming/changing the display), https://bitbucket.org/superfury/commonemuframework/src/35b198714b17c0ab7aa3cc80e2238d62b299f1e6/emu/gpu/gpu_renderer.c (for SDL surface flipping). Audio rendering keeps running, but may be muted(both input and/or output, depending on settings), using the same flags. Everything but the video aspects is user-configurable using a settings option(configurable from a settings menu or configuration text file). Of course, all other surfaces that are not the main rendering surface(that's flipped) keep updating unconditionally at a up to 60Hz rate. It's just the rendering to the screen(the SDL_flip equivalent) that's affected, together with audio and surface creation of the window surface. Also, the SDL(2) layer used: https://bitbucket.org/superfury/commonemuframework/src/35b198714b17c0ab7aa3cc80e2238d62b299f1e6/emu/gpu/gpu_sdl.c Just to make sure, when you're pushed to foreground, you don't need to re-create the window, nor the renderer. What issues are you seeing, and also what is in the android log cat ? Did you try disabling the BATCHING ? Use latest sources ? Things I can share from my experience: 1. re-creating the renderer when resuming not only will not help but actually break things, 2. you can't not only not flip while in background, but you can also neither create nor destroy textures, nor do anything else with them (no color mod, alpha mod changes, no pixel access, ...), 3. Surface access/creation/changes should be fine as long as you're not updating the window surface or converting any surfaces to textures I'm using the SDL2 commit from 20190619 at 06:43:54. I've just modified the update function to renew the renderer surface(release it and recall the allocation functions(updateWindow)) after being restored to foreground. Or shouldn't updateWindow be called again? But the pixels in the surface are being updated in the background, just not flipped(rendered using the SDL compatibility method (the flip method in my app's gpu_sdl.c). Again: when you're pushed to foreground: - you don't need to re-create the window, nor the renderer. - you don't need to modify any of your SDL_Surface. - you don't need to modify any of your SDL_Texture, unless you got one of the event SDL_RENDER_TARGETS_RESET or SDL_RENDER_DEVICE_RESET. BTW, looking at your code: case SDL_WINDOWEVENT_FOCUS_LOST: lock(LOCK_INPUT); hasinputfocus = 0; //Lost input focus! unlock(LOCK_INPUT); #ifdef ANDROID goto didenterbackground; //Simulate background! #endif You don't need anymore to do a didenterbackground when receiving FOCUS_LOST. Now, Android can work in multi-window (split view), so you can have two apps shown at the same time. And when your app is not focus, it can still run and update the screen. OK. I'll look at that one later. For now, I've added a bit more code to the didenterforeground event. It now will set needvideoupdate and GPU.forceredraw to 1, enforcing a full redraw across all used video layers(from bottom to top layer, all the way to the renderer's layer). I've also forced a renderer(updateWindow) refresh on entering foreground(on Android), which will release the old SDL_Surface of the renderer by setting it to NULL before calling updateWindow. UpdateWindow in turn will do a full SDL reallocation(SDL_destroytexture, SDL_destroyRenderer, free the surface wrapper(used for the app only, leaving the actual SDL_Surface untouched), SDL_setWindowSize, SDL_SetWindowFullscreen, SDL_setWindowIcon, SDL_CreateRenderer and finally SDL_RenderSetLogicalSize and SDL_CreateRGBSurface for a new rendering surface). I now notice that the app correctly returns from being backgrounded every single time, instead of at random/sporadically. So something seems to have been fixed? (In reply to Sylvain from comment #31) > Again: when you're pushed to foreground: > - you don't need to re-create the window, nor the renderer. > - you don't need to modify any of your SDL_Surface. > - you don't need to modify any of your SDL_Texture, unless you got one of > the event SDL_RENDER_TARGETS_RESET or SDL_RENDER_DEVICE_RESET. > > > BTW, looking at your code: > > case SDL_WINDOWEVENT_FOCUS_LOST: > lock(LOCK_INPUT); > hasinputfocus = 0; //Lost input focus! > unlock(LOCK_INPUT); > #ifdef ANDROID > goto didenterbackground; //Simulate background! > #endif > > You don't need anymore to do a didenterbackground when receiving FOCUS_LOST. > Now, Android can work in multi-window (split view), so you can have two apps > shown at the same time. And when your app is not focus, it can still run and > update the screen. I tried using it with my Android device, but Android is giving me a message saying that it "Can't use this app in multi-window view", as that message states. The icon for multiple windows is also not visible(the two squares that enable said view). Perhaps it needs some kind of permission to use that? Anyway, this way, that multi-window view issue simply solved itself? One thing about the resuming from being backgrounded, since it determines the maximum resolution each time it resumes, I saw one instance where the fullscreen view wasn't completely fullscreen? It always takes the first display resolution item it gets the first entry of SDL_ListModes(NULL,SDL_FULLSCREEN|SDL_HWSURFACE) for SDL or SDL_getCurrentDisplayMode(0,&displayMode)==0. But that's only done on boot, so why wouldn't it properly use fullscreen? It looked like it's used some kind of border filling display mode? Also, you say that responding to input is fine when in multi-window mode? But when you minimize your app to start closing it, switch to another app or switch to multi-window mode, doesn't the app receive a SDL_EVENT_DIDENTERBACKGROUND, making it forbidden to render or give output? So the app will inhibit rendering in multi-window mode, while accepting input anyways? Also, since my app runs at fullscreen at a fixed autodetect resolution(the (screen) resolution during it's startup), it won't display fine with only half a screen? Probably only the left half would be rendered? Or does the reported first resolution of the list change during such a mode? It sounds like these questions might be better directed at the forum, but for what it's worth: multi-window means multi-window as on desktop, you see multiple windows at once. (E.g. as in Android 7+'s side by side, or Samsung Dex, or Android 10's upcoming experimental desktop mode) If you use latest SDL hg, you won't get a background event in side by side if you lose focus and your app stays visible. In that scenario, while you're still no longer the main currently active foreground/focused app, you can still render & get input
> Or does the reported first resolution of the list change during such a mode?
You get a window resize same as when orientation changes, so your app needs to be able to handle that properly
With SDL_WINDOWEVENT_SIZE_CHANGED, it simply does the same as resuming the app(SDL_EVENT_DIDENTERFOREGROUND), but clears window_xres and window_yres, which causes the renderer to call the earlier mentioned SDL_GetCurrentDisplayMode(0,&displayMode)==0, which will cause it to update the display resolution according to the displaymode.w and displaymode.h accordingly. Will that event happen again when minimizing the app to the Android task manager and resuming it in fullscreen(single-window) mode? Otherwise it will keep said resolution when resuming in singls-window mode, which would give strange output. one more concurrency issue: https://hg.libsdl.org/SDL/rev/dd9169424181 |
AFAICS the Android backend has several concurrency issues because of the fact that SDL_Main() is run on its own thread whereas window callbacks like onNativeSurfaceChanged() and onNativeSurfaceDestroyed() run on the UI thread. Both threads can access "Android_Window" at the same time and there doesn't seem to be any synchronization mechanism. For example, consider the following code in onNativeSurfaceDestroyed() in SDL_android.c. This code is running on the UI thread: if (Android_Window == NULL || Android_Window->driverdata == NULL ) { return; } _this = SDL_GetVideoDevice(); data = (SDL_WindowData *) Android_Window->driverdata; It could happen that "Android_Window" is set to NULL by the thread that is running SDL_Main() right after the code above that is running on the UI thread has passed the NULL pointer check. When trying to access "Android_Window->driverdata" then, the code will crash because "Android_Window" is now NULL. So I think this code needs some synchronization with the thread that is running SDL_Main(). Also, onNativeSurfaceDestroyed() has this code: if (data->egl_surface != EGL_NO_SURFACE) { SDL_EGL_MakeCurrent(_this, NULL, NULL); SDL_EGL_DestroySurface(_this, data->egl_surface); data->egl_surface = EGL_NO_SURFACE; } Once again, there is no synchronization and the SDL thread could be trying to access "data->egl_surface" while the UI thread is trying to destroy it which could lead to a crash. These are just two issues I saw. There are probably more because of the multithreaded nature of the Android backend. I've also posted about this here but I haven't received any reply: https://discourse.libsdl.org/t/potential-serious-threading-issue-in-android-backend/24135 Since I consider this to be a serious issue I've decided to file a ticket for this so that it isn't forgotten.