| Summary: | crash in testhotplug on macOS, via hot-unplugging of USB device | ||
|---|---|---|---|
| Product: | SDL | Reporter: | David Ludwig <dll> |
| Component: | joystick | Assignee: | Ryan C. Gordon <icculus> |
| Status: | RESOLVED FIXED | QA Contact: | Sam Lantinga <slouken> |
| Severity: | normal | ||
| Priority: | P2 | CC: | dll, icculus |
| Version: | HG 2.1 | Keywords: | target-2.0.12 |
| Hardware: | x86_64 | ||
| OS: | macOS 10.15 | ||
| Attachments: |
partial fix for bug 4961
unfinished fix; adds CFRelease, but still needs CFRetain, and maybe more CFRetain? patch #3: needs review, and more testing; adds both CFRetain and CFRelease for IOKit device-object patch #4: add an extra CFRelease, on device-disconnect |
||
I _think_ what's happening is that we mark the deviceRef as NULL (but not the device pointer) in a callback that happens in another thread, but that particular call in IOKit gets mad about a NULL deviceRef. I've pushed an untested patch for this in https://hg.libsdl.org/SDL/rev/736e6cf8e5f8, which hopefully fixes the crash. Let me know if it doesn't! --ryan. Thanks for the update, Ryan.
I tried your patch, however, I am still seeing the crash. This was via a full, clean rebuild, using the latest sources (from changeset 13452) :-(
Here is the latest that I get from Address Sanitizer:
davidl@Wart build % cd test
davidl@Wart test % ./testhotplug2020-01-27 18:10:14.007 testhotplug[53382:4847153] INFO: There are 0 joysticks at startup
2020-01-27 18:10:14.007 testhotplug[53382:4847153] INFO: There are 0 haptic devices at startup
2020-01-27 18:10:26.590 testhotplug[53382:4847153] INFO: Joy Added : 0 : Smartjoy Family Super Smartjoy 2
2020-01-27 18:10:26.590 testhotplug[53382:4847153] INFO: No haptic found
AddressSanitizer:DEADLYSIGNAL
=================================================================
==53382==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x7fff38d16c18 bp 0x7ffee471a810 sp 0x7ffee471a800 T0)
==53382==The signal is caused by a READ memory access.
==53382==Hint: address points to the zero page.
#0 0x7fff38d16c17 in IOHIDDeviceGetValue (IOKit:x86_64+0x39c17)
#1 0x10bac9574 in GetHIDElementState SDL_sysjoystick.c:166
#2 0x10bac91b4 in GetHIDScaledCalibratedState SDL_sysjoystick.c:190
#3 0x10bac4063 in DARWIN_JoystickUpdate SDL_sysjoystick.c:933
#4 0x10baa874d in SDL_JoystickUpdate_REAL SDL_joystick.c:1212
#5 0x10b556244 in SDL_PumpEvents_REAL SDL_events.c:681
#6 0x10b5562c4 in SDL_WaitEventTimeout_REAL SDL_events.c:718
#7 0x10b556286 in SDL_PollEvent_REAL SDL_events.c:700
#8 0x10b537483 in SDL_PollEvent SDL_dynapi_procs.h:153
#9 0x10b4e9bc9 in main testhotplug.c:66
#10 0x7fff6d5b87fc in start (libdyld.dylib:x86_64+0x1a7fc)
==53382==Register values:
rax = 0x00006040000f6490 rbx = 0x00007ffee471a860 rcx = 0x00006040000f6490 rdx = 0x00007ffee471a840
rdi = 0x0000000000000000 rsi = 0x000060c000046840 rbp = 0x00007ffee471a810 rsp = 0x00007ffee471a800
r8 = 0x00007ffee471aa60 r9 = 0x9afcfe5373260069 r10 = 0x0000606000095860 r11 = 0x0000606000095858
r12 = 0x0000000000000000 r13 = 0x0000000000000000 r14 = 0x0000000000000000 r15 = 0x0000000000000000
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (IOKit:x86_64+0x39c17) in IOHIDDeviceGetValue
==53382==ABORTING
zsh: abort ./testhotplug
davidl@Wart test %
Hmm, might be a race condition? We're checking for NULL in that new patch, _then_ we're setting deviceRef to NULL in the removal callback and _then_ it calls IOHIDDeviceGetValue? --ryan. Perhaps. The IOKit's function does, as of macOS 10.15.2, look like it has a simple and short implementation, when disassembled.
Here is a sample, lldb callstack, from a reproduction with Address Sanitizer turned OFF:
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x15)
* frame #0: 0x00007fff38d16c18 IOKit`IOHIDDeviceGetValue + 12
frame #1: 0x00000001001f252d testhotplug`GetHIDElementState(pDevice=0x000000010076cbc0, pElement=0x0000000100770580, pValue=0x00007ffeefbfed34) at SDL_sysjoystick.c:166:13
frame #2: 0x00000001001f2458 testhotplug`GetHIDScaledCalibratedState(pDevice=0x000000010076cbc0, pElement=0x0000000100770580, min=-32768, max=32767, pValue=0x00007ffeefbfed34) at SDL_sysjoystick.c:190:9
frame #3: 0x00000001001f0510 testhotplug`DARWIN_JoystickUpdate(joystick=0x000000010076ca10) at SDL_sysjoystick.c:933:20
frame #4: 0x00000001001e9280 testhotplug`SDL_JoystickUpdate_REAL at SDL_joystick.c:1212:17
frame #5: 0x0000000100038063 testhotplug`SDL_PumpEvents_REAL at SDL_events.c:681:9
frame #6: 0x00000001000380e5 testhotplug`SDL_WaitEventTimeout_REAL(event=0x00007ffeefbfeeb0, timeout=0) at SDL_events.c:718:9
frame #7: 0x00000001000380a7 testhotplug`SDL_PollEvent_REAL(event=0x00007ffeefbfeeb0) at SDL_events.c:700:12
frame #8: 0x000000010001ec64 testhotplug`SDL_PollEvent(a=0x00007ffeefbfeeb0) at SDL_dynapi_procs.h:153:1
frame #9: 0x0000000100005bdf testhotplug`main(argc=1, argv=0x00007ffeefbfef18) at testhotplug.c:66:15
frame #10: 0x00007fff6d5b87fd libdyld.dylib`start + 1
frame #11: 0x00007fff6d5b87fd libdyld.dylib`start +
and here is the disassembled function, annotated with what I think is the suspect line:
_IOHIDDeviceGetValue:
0000000000039c0c push rbp
0000000000039c0d mov rbp, rsp
0000000000039c10 sub rsp, 0x10
0000000000039c14 mov rdi, qword [rdi+0x10]
0000000000039c18 mov rax, qword [rdi] <------ CRASHES HERE, maybe? (0x39c0c + 12 (offset in base 10, IIRC))
0000000000039c1b mov dword [rsp+0x10+var_10], 0x0
0000000000039c22 xor ecx, ecx
0000000000039c24 xor r8d, r8d
0000000000039c27 xor r9d, r9d
0000000000039c2a call qword [rax+0x58]
0000000000039c2d add rsp, 0x10
0000000000039c31 pop rbp
0000000000039c32 ret
; endp
I'm a bit hazy on my ability to parse x86-64 function arguments on macOS, though.
Minor update: the annotated line is indeed the instruction where it crashes, as observed in lldb. If I have some time, I can try to look into this a bit more, though it is in some SDL code that I'm currently unfamiliar with. (In reply to David Ludwig from comment #5) > Minor update: the annotated line is indeed the instruction where it crashes, > as observed in lldb. > > If I have some time, I can try to look into this a bit more, though it is in > some SDL code that I'm currently unfamiliar with. In that disassembly, the RDI register has the DeviceRef pointer (the first argument to IOHIDDeviceGetValue), then it looks 16 bytes into that struct for another pointer, which is clearly NULL here probably from the device being disconnected, and thus crashes. But crucially: the deviceRef itself isn't NULL. So basically, we can't use this DeviceRef at this point, so we need to figure out how to mark SDL's device struct as disconnected sooner so we know not to try to update it under DARWIN_JoystickUpdate. --ryan. I have a terrible idea... Can we check to see if 16 bytes into the ref pointer is NULL? I'm about to dig into this in a bit more depth (starting now). If anyone wants to get in touch with me more quickly, I'll be on Skype as dludwig@pobox.com, and Freenode IRC as "dll255". Sam, I'll consider your idea too, but I can see how it -might- be a bad idea. :-) More to come... IOHIDDeviceRef's internal struct is private, and as such, I'm thinking that it'd be a bad idea to try modifying it. For what it's worth, though, it's code is open, at least as of macOS 10.14.5. It's defined in this file, as 'typedef struct __IOHIDDevice': https://opensource.apple.com/source/IOKitUser/IOKitUser-1483.260.4/hid.subproj/IOHIDDevice.c.auto.html Next, I'll look into Ryan's race-condition hypothesis. SDL_JoystickUpdate appears to run joystick-detection routines *AFTER* it runs joystick-updating functions. Perhaps this is part of it?
I tried adding in a separate call to the joystick-drivers' Detect() routines before its Update() was called, but this seems to lead to a similar, but subtly-different crash, with a stack trace that looks like the following, albeit with the source-lines being a bit off, due to additional debug code:
Process 70604 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x18)
frame #0: 0x00007fff38d16c18 IOKit`IOHIDDeviceGetValue + 12
IOKit`IOHIDDeviceGetValue:
-> 0x7fff38d16c18 <+12>: movq (%rdi), %rax
0x7fff38d16c1b <+15>: movl $0x0, (%rsp)
0x7fff38d16c22 <+22>: xorl %ecx, %ecx
0x7fff38d16c24 <+24>: xorl %r8d, %r8d
Target 0: (testhotplug) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x18)
* frame #0: 0x00007fff38d16c18 IOKit`IOHIDDeviceGetValue + 12
frame #1: 0x00000001001f241d testhotplug`GetHIDElementState(pDevice=0x00000001007152e0, pElement=0x0000000100726470, pValue=0x00007ffeefbfed44) at SDL_sysjoystick.c:166:13
frame #2: 0x00000001001f2348 testhotplug`GetHIDScaledCalibratedState(pDevice=0x00000001007152e0, pElement=0x0000000100726470, min=-32768, max=32767, pValue=0x00007ffeefbfed44) at SDL_sysjoystick.c:190:9
frame #3: 0x00000001001f03e0 testhotplug`DARWIN_JoystickUpdate(joystick=0x0000000100735110) at SDL_sysjoystick.c:934:20
frame #4: 0x00000001001e9155 testhotplug`SDL_JoystickUpdate_REAL at SDL_joystick.c:1233:17
frame #5: 0x0000000100037df3 testhotplug`SDL_PumpEvents_REAL at SDL_events.c:681:9
frame #6: 0x0000000100037e75 testhotplug`SDL_WaitEventTimeout_REAL(event=0x00007ffeefbfeec0, timeout=0) at SDL_events.c:718:9
frame #7: 0x0000000100037e37 testhotplug`SDL_PollEvent_REAL(event=0x00007ffeefbfeec0) at SDL_events.c:700:12
frame #8: 0x000000010001e9f4 testhotplug`SDL_PollEvent(a=0x00007ffeefbfeec0) at SDL_dynapi_procs.h:153:1
frame #9: 0x000000010000596f testhotplug`main(argc=1, argv=0x00007ffeefbfef20) at testhotplug.c:66:15
frame #10: 0x00007fff6d5b87fd libdyld.dylib`start + 1
(lldb)
SDL_JoystickUpdate's call to drivers' Detect() method is preceded by a comment about it having to be called after a thing happens:
/* this needs to happen AFTER walking the joystick list above, so that any
dangling hardware data from removed devices can be free'd
*/
I'm unsure exactly what this means.
Tips welcomed, if any!
Nevermind about the last stack-trace (whereby I attempted a quick-fix by adding a new Detect() call to SDL_JoystickUpdate) looking subtly-different. My caffeine appears to be wearing off. I'm off to take a bit of a break, now. Back later. PROGRESS!!! Modifying the interval parameter in darwin/SDL_sysjoystick.c's calls to CFRunLoopRunInMode, in my current case, from 0 to 0.001, makes the bug almost, though not 100%, go away. In a debug build, with it set to 0, testhotplug will almost always crash on the first disconnect, and I've never been able to get past ten disconnects before a crash. With it set to 0.001, I can easily get above 10 disconnect-attempts before a crash. To add to my previous comment (regarding progress and run loops), a similar fix helped fixed an iOS event-handling bug in SDL, way back in https://hg.libsdl.org/SDL/rev/8faee8f46165 Created attachment 4179 [details] partial fix for bug 4961 Here is a partial fix for bug 4961, which for me, observably reduces the frequency with which it occurs. Unfortunately, it can still happen, but it does appear to reduce it on my dev machine, at least. Augmenting my previous partial fix, I tried adding a call to DARWIN_JoystickDetect to DARWIN_JoystickUpdate, however, this didn't lead to a 100% fix. This smells like a race condition to me, perhaps whereby the device removal occurs, but the removal-event either hasn't been received, or perhaps even sent to the app. I'd be willing to bet that this a bug in IOKit as well, however, that doesn't seem terribly helpful to us. Perhaps there is an IOKit function that could be used to effectively query whether or not the device-ref is still valid? The NULL-pointer deference occurs in Release builds, as well as Debug. I need to move onto some other projects for a bit. If anyone could use additional help with this, please let me know, and I can try to revisit. Created attachment 4182 [details]
unfinished fix; adds CFRelease, but still needs CFRetain, and maybe more CFRetain?
This proof-of-concept, which increases a game controller device object's reference count upon device-connection, appears to fix the bug, at least for my testing (100 plugs and unplugs; maybe there are other repro-algorithms). IOKit device objects have a reference count. The fix increments that reference count, in SDL's callback that gets invoked when a USB game-controller/joystick is plugged in on Darwin (in SDL's private function, JoystickDeviceWasAddedCallback). It does not, however, add in associated reference-count decrement or decrements. It might need more ref-count increments, especially if SDL gets IOKit device-object references from places other than its Darwin, usb-device-was-added callback.
This fix does not require that a delay be introduced into any Darwin/macOS run-loop-iteration function calls, and appears to render my previous fix-attempt obsolete.
Created attachment 4184 [details]
patch #3: needs review, and more testing; adds both CFRetain and CFRelease for IOKit device-object
Here is an updated patch, which attempts to both CFRetain and CFRelease the game controller's IOKit-provided device-object.
Code review would be helpful here! Either way, I'll try to take another look at it within a few days' time.
This sounds like the right direction. We do need to release the reference though, once we detect that the device is no longer connected. Created attachment 4190 [details]
patch #4: add an extra CFRelease, on device-disconnect
Looks good, thanks! https://hg.libsdl.org/SDL/rev/9c763de27c67 Sure thing. Thanks for your help on this! 👍✌️ I augmented my patch a tiny bit further, in order to fix a seg-fault that seemed possible within FreeDevice (a private function in darwin/SDL_sysjoystick.c), if and when joystick-init fails just after it gets plugged in. I went ahead and pushed this patch to HG (as it looks innocuous otherwise, to me at least), via https://hg.libsdl.org/SDL/rev/cec913fbe656 |
testhotplug is somewhat-reliably crashing for me on macOS 10.15.2 (Catalina), when I unplug a previously-detected joystick. I haven't had much time to dig into the specifics of it, but here is a bit of info that I've been able to glean. This is occurring for me in the latest HG version, which is currently at changeset 13449. Steps to reproduce: 1. run testhotplug 2. repeatedly plug and unplug a USB device, with a delay of at least a few seconds in between plugging and unplugging (and vice-versa). Observed Result: the app crashes with a seg-fault much, but not 100% of the time. (I have, so far, able to get it to reliably reproduce, however it may take a few tries.) Expected Result: no crash Notes: The crash seems a bit less likely to occur if the hot [un]plugging occurs quickly, with little time in-between plugs and unplugs. Below is a stack trace from the crash, in a build with Clang's Address Sanitizer turned on. (Build command used, from an empty SDL/build directory): cmake -DCMAKE_BUILD_TYPE=Debug "-DCMAKE_C_FLAGS=-fsanitize=address -fno-omit-frame-pointer" -DSDL_TEST=1 .. && make VERBOSE=1 ) davidl@Wart test % ./testhotplug 2020-01-27 13:22:41.276 testhotplug[2841:4641112] INFO: There are 0 joysticks at startup 2020-01-27 13:22:41.277 testhotplug[2841:4641112] INFO: There are 0 haptic devices at startup 2020-01-27 13:22:53.168 testhotplug[2841:4641112] INFO: Joy Added : 0 : Smartjoy Family Super Smartjoy 2 2020-01-27 13:22:53.168 testhotplug[2841:4641112] INFO: No haptic found AddressSanitizer:DEADLYSIGNAL ================================================================= ==2841==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x7fff38d16c18 bp 0x7ffee9940830 sp 0x7ffee9940820 T0) ==2841==The signal is caused by a READ memory access. ==2841==Hint: address points to the zero page. #0 0x7fff38d16c17 in IOHIDDeviceGetValue (IOKit:x86_64+0x39c17) #1 0x1068a3a85 in GetHIDElementState SDL_sysjoystick.c:166 #2 0x1068a3724 in GetHIDScaledCalibratedState SDL_sysjoystick.c:190 #3 0x10689e5d3 in DARWIN_JoystickUpdate SDL_sysjoystick.c:933 #4 0x106882cbd in SDL_JoystickUpdate_REAL SDL_joystick.c:1212 #5 0x106330874 in SDL_PumpEvents_REAL SDL_events.c:681 #6 0x1063308f4 in SDL_WaitEventTimeout_REAL SDL_events.c:718 #7 0x1063308b6 in SDL_PollEvent_REAL SDL_events.c:700 #8 0x106311ab3 in SDL_PollEvent SDL_dynapi_procs.h:153 #9 0x1062c41f9 in main testhotplug.c:66 #10 0x7fff6d5b87fc in start (libdyld.dylib:x86_64+0x1a7fc) ==2841==Register values: rax = 0x00006040000b9010 rbx = 0x00007ffee9940880 rcx = 0x00006040000b9010 rdx = 0x00007ffee9940860 rdi = 0x0000000000000000 rsi = 0x000060c000046840 rbp = 0x00007ffee9940830 rsp = 0x00007ffee9940820 r8 = 0x00007ffee9940a60 r9 = 0x153ed16fb5f2007d r10 = 0x000060600006f820 r11 = 0x000060600006f818 r12 = 0x0000000000000000 r13 = 0x0000000000000000 r14 = 0x0000000000000000 r15 = 0x0000000000000000 AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV (IOKit:x86_64+0x39c17) in IOHIDDeviceGetValue ==2841==ABORTING zsh: abort ./testhotplug I checked to see if any of the values directly-passed into IOHIDDeviceGetValue() were NULL, however they were not. Here is the debug code I used to check this, which was placed just-before the call (to IOHIDDeviceGetValue()): printf("0x%x\n", (unsigned int)pDevice->deviceRef); printf("0x%x\n", (unsigned int)pElement->elementRef); printf("0x%x\n", (unsigned int)&valueRef); printf("-----\n");