We are currently migrating Bugzilla to GitHub issues.
Any changes made to the bug tracker now will be lost, so please do not post new bugs or make changes to them.
When we're done, all bug URLs will redirect to their equivalent location on the new bug tracker.

Bug 5350

Summary: Help needed implementing VULKAN support on the KMSDRM backend
Product: SDL Reporter: Manuel Alfayate Corchete <redwindwanderer>
Component: videoAssignee: Sam Lantinga <slouken>
Status: RESOLVED FIXED QA Contact: Sam Lantinga <slouken>
Severity: normal    
Priority: P2 CC: wahil1976
Version: 2.0.13   
Hardware: All   
OS: Linux   

Description Manuel Alfayate Corchete 2020-11-12 12:53:27 UTC
Hi there,

Recently Vulkan has started working x-less on the Raspberry Pi, thanks to recent MESA advancements. Before that, it was working for quite a while on Intel graphics, amdgpu, etc

So, it's time for the KMSDRM backend to get Vulkan support, or maybe to do a new specific backend that would simply:

-Give the SDL2+Vulkan programs the active extensions list (SDL_GetInstanceExtensions)
-Create the vulkan surface specific to non-X Vulkan (SDL_Vulkan_CreateSurface() which in turn calls vkCreateDisplayPlaneSurfaceKHR). 

In either case, only those functionality is required from SDL2, but I am failing to implement it.

I have the VK_KHR_Display extension active, and as such I can retrieve the pointer to vkCreateDisplayPlaneSurfaceKHR(), using vkGetInstanceProcAddr() on the program's instance. That proves that the extension is, indeed, active.

However, when I try to get the display count using vkGetPhysicalDeviceDisplayPropertiesKHR(), I get 0 displays. Any idea on what am I doing wrong?

Here's where I successfully retrieve the vkCreateDisplayPlaneSurfaceKHR() pointer: 
[https://github.com/vanfanel/SDL/blob/11e4911657edc495e196b9f262b0b523fe9ad8e0/src/video/kmsdrm/SDL_kmsdrmvulkan.c#L179](http://Code)

And here is where I try to retrieve the display count that results in 0 displays:
[https://github.com/vanfanel/SDL/blob/11e4911657edc495e196b9f262b0b523fe9ad8e0/src/video/kmsdrm/SDL_kmsdrmvulkan.c#L276](http://Code)

I am trying this on AMDGPU and intel graphics with the same result.
Any Vulkan experts in the room?
Comment 1 wahil1976 2020-11-18 17:23:11 UTC
Try vkGetPhysicalDeviceDisplayProperties2KHR and see if it helps.
Comment 2 wahil1976 2020-11-18 17:35:04 UTC
Addendum: You will need to add yourself to the render group too, otherwise the Vulkan device will not show up.
Comment 3 wahil1976 2020-11-19 07:49:49 UTC
Yet another note: Try selecting the second GPU first before selecting the first GPU.
Comment 4 wahil1976 2020-11-24 18:41:27 UTC
After a bit of a research, I finally tracked down the issue. You have to drop master privileges from the KMSDRM file descriptor to allow the Mesa library to set itself as master which is required for Xorg-less Vulkan to work, since Linux does not allow more than one DRM card file descriptor to be master at the same time.

Try stealing the DRM master file descriptor by poking the /proc/self/fd/ directory and looking for symlinks to the DRM master device, ignoring the KMSDRM file descriptor. It seems to be only available on Linux, so making KMSDRM Vulkan exclusive to Linux will be your only option for now. You will need to restore master privileges when the application closes the KMSDRM window so that you can clean up KMSDRM states.

Try with the SDL 2.0.12 legacy KMSDRM backend if you end up with problems with Atomic DRM backend; the Mesa Vulkan WSI code exclusively uses legacy DRM for its Xorg-less Vulkan rendering.
Comment 5 wahil1976 2020-11-24 18:53:43 UTC
By stealing I meant "check if the file descriptor references the DRM device node and is the master file descriptor and use the same value as that master file descriptor". Sorry for my English.
Comment 6 Manuel Alfayate Corchete 2020-11-24 22:12:31 UTC
@wahil1976@outlook.com

Thanks a lot!

Don't you think it would be better to do a new backend instead of reusing the KMSDRM one? From what you say, it would be better to prevent KMSDRM from even initializing, in order to let Vulkan take the file descriptor for itself.
Comment 7 wahil1976 2020-11-25 04:42:23 UTC
Going down the route of creating the new backend purely for Xorg-less Vulkan isn't something that would simply work; many SDL2 applications ignore whatever backend SDL2 is using and rely upon several assumptions they may have made when creating the SDL2 window. You would also need a Vulkan instance to do anything useful and the application may not provide it at all. It would be really complicated. You would need to complicate the graphical cursor code further for example.
Comment 8 wahil1976 2020-11-25 05:23:13 UTC
Here's a hint: Drop master privileges from KMSDRM *before* calling vkEnumeratePhysicalDevices, and then if you can't create the surface, find the file descriptor that references the DRM primary node device file and is master and close it, assuming it is not the KMSDRM one and is instead the one by Mesa library.
Comment 9 Manuel Alfayate Corchete 2020-11-29 14:54:02 UTC
@wahil1976@outlook.com

Thanks again.
I do this to check the contents in /proc/self/fd

struct dirent *res;
DIR *folder;

if (folder) {
    while ((res = readdir(folder))) {
        printf("---%s\n", res->d_name);
    }   
    closedir(folder);
}   

I can easily confirm that the KMSDRM FD number (3 in this case) is on that list.

However, what do you mean by this?
"check if the file descriptor references the DRM device node and is the master file descriptor and use the same value as that master file descriptor"

I think you believe that I must confirm that the KMSDRM FD is on the list of FDs in /proc/self/fd, right?

But then, how do I take it from KMSDRM? By closing it and opening it again? In my head that makes no sense: we are still in the KMSDRM backend, I would get the same FD if I reopen it again, wouldn't I? 

Sorry if I sound a bit obtuse, trying my best here, really.
Comment 10 wahil1976 2020-11-29 16:20:30 UTC
Opening the same file again will only yield another file descriptor, as per UNIX rules.

A alternate solution is to save the Vulkan file descriptor and then whenever the backend is invoked from the application, temporarily make the KMSDRM file descriptor master and restore the master to Vulkan file descriptor at the end of the function or if any error occurs.

Note that hardware cursor is impossible when Vulkan is running because the Vulkan swapchain function will disable hardware cursor whenever it is called.
Comment 11 wahil1976 2020-11-29 16:21:28 UTC
Check the contents of /proc/self/fd and /dev/fd too.
Comment 12 Manuel Alfayate Corchete 2020-11-29 16:41:51 UTC
@wahil1976@outlook.com
What Vulkan file descriptor? I only open an FD from KMSDRM, but not from Vulkan.


And... What do you mean by "make file descriptor master"?
How does one make a file descriptor master? Master of what?
Comment 13 Manuel Alfayate Corchete 2020-11-29 16:47:10 UTC
(In reply to Manuel Alfayate Corchete from comment #12)
> @wahil1976@outlook.com
> What Vulkan file descriptor? I only open an FD from KMSDRM, but not from
> Vulkan.
> 
> 
> And... What do you mean by "make file descriptor master"?
> How does one make a file descriptor master? Master of what?

I have been reading about this. Ok, there are ioctls for SET_MASTER and DROP_MASTER.

But still, I don't see it clear. Should I do the following?
-Issue IOCTL DROP_MASTER on the KMSDRM FD before calling vkGetPhysicalDeviceDisplayPropertiesKHR()
-Restore it before destroying the SDL/KMSDRM window.

Is that right??


But then: Why should I iterate on /proc/self/fd? I already have the KMSDRM FD.
Comment 14 Manuel Alfayate Corchete 2020-11-29 17:02:00 UTC
@wahil1976@outlook.com

Ok! I am doing:

ioctl(viddata->drm_fd, DRM_IOCTL_DROP_MASTER, 0);

...and it DOES drop MASTER on the DRM FD, and I can indeed recover the number of displays via vkCreateDisplayPlaneSurfaceKHR().

But there's another small problem: the DRM_IOCTL_DROP_MASTER ioctl requires root permissions... At least until Linux kernel 5.8, where it is not required anymore:

https://stackoverflow.com/questions/29708596/drmdropmaster-requires-root-privileges
Comment 15 Manuel Alfayate Corchete 2020-11-29 17:27:17 UTC
@wahil1976@outlook.com Currently, testvulkan.c seems to be failing with:

ERROR: vkQueuePresentKHR(): VK_ERROR_SURFACE_LOST_KHR

Which seems to mean that the surface is no longer available (¿?)

It's strange because this call DOES succeed:

result = vkCreateDisplayPlaneSurfaceKHR(instance, &createInfo,
                                       NULL, surface);

It returns VK_SUCCESS.

So, any idea on what's taking the surface away?
Comment 16 wahil1976 2020-11-29 18:09:36 UTC
A look at the Mesa WSI source code indicates that surface will be lost if:
1. drmModeGetResources fails in the library.
2. The DRM connector cannot be setup again.
3. The Vulkan file descriptor is no longer valid.

There's a merge request from Keith Packard in the Mesa Gitlab repository that makes everything done through the atomic DRM API if it is supported but it hasn't been merged yet. Another one by Jonathan Marek does the same thing but removes the legacy DRM API support entirely, but it hasn't been merged either.

[1] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6173
[2] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4176

The former is more likely to work since it doesn't involve the hardware cursor crap and does not involved setting up the connector again. The latter I am not sure since I didn't look closely at the code. Keith Packard was also working on a Vulkan extension  named VK_KEITHP_kms_display that would prove very useful here (since you can tell the library to use the KMSDRM file descriptor instead of opening its own) but there was zero progress on that since then.

I am not sure what is the solution here currently.
Comment 17 Manuel Alfayate Corchete 2020-11-29 18:53:12 UTC
@wahil1976@outlook.com Ohh.. I thought it would be easier. We are almost there!
If you have any idea, please tell me so I can try...
Comment 18 Manuel Alfayate Corchete 2020-11-29 21:29:15 UTC
@wahil1976@outlook.com I have updated the code for SDL_kmsdrmvulkan:

https://github.com/vanfanel/SDL/blob/master/src/video/kmsdrm/SDL_kmsdrmvulkan.c

Where do you think that KMSDRM is stealing the Vulkan file descriptor?
I have disabled the creation of the dumb_buffer, the KMSDRM_CreateSurfaces() call in KMSDRM_CreateWindow()... but still, Vulkan is losing it's FD.
Comment 19 wahil1976 2020-11-30 17:15:16 UTC
Try releasing the connector and CRTCs that KMSDRM claimed to let Vulkan claim it for itself.
Comment 20 Manuel Alfayate Corchete 2020-12-02 16:55:11 UTC
@wahil1976@outlook.com

I have tried freeing the KMSDRM CRTC and connector as soon as I enter Vulkan_LoadLibrary(), but I am still receiving:

ERROR: vkQueuePresentKHR(): VK_ERROR_SURFACE_LOST_KHR

So, I went and simply disabled the code for CRTC, connector and primary plane selection in KMSDRM_VideoInit(). You can see it here:

https://github.com/vanfanel/SDL/blob/48aab9caec742496051ce938e6ae0efe978807d7/src/video/kmsdrm/SDL_kmsdrmvideo.c#L1158

As you will see, KMSDRM_VideoInit() does VERY little now: it does not take any planes, CRTCs or connectors, to let Vulkan work with these instead.

What can be failing now? Really, KMSDRM is not taking anything now, yet the same problem happens.

What's curious is that if I don't let KMSDRM open drm_fd, vkGetPhysicalDeviceDisplayPropertiesKHR() fails to get the display_count. Isn't Vulkan supposed to open his own FD itself?
Comment 21 wahil1976 2020-12-02 17:04:53 UTC
Did you disable the GBM stuff?
Comment 22 Manuel Alfayate Corchete 2020-12-02 17:40:45 UTC
@wahil1976@outlook.com

Yes, you can see here that the GBM device creation is disabled:

https://github.com/vanfanel/SDL/blob/48aab9caec742496051ce938e6ae0efe978807d7/src/video/kmsdrm/SDL_kmsdrmvideo.c#L1225
Comment 23 Manuel Alfayate Corchete 2020-12-02 18:25:10 UTC
@wahil1976@outlook.com

What really puzzles me is that I need to open fd with:
viddata->drm_fd = open("/dev/dri/card0", O_RDWR | O_CLOEXEC);

And then drop master on it with:
ret = ioctl(viddata->drm_fd, DRM_IOCTL_DROP_MASTER, 0);

...Or else, vkGetPhysicalDeviceDisplayPropertiesKHR() fails to get the display_count.
The STRANGRE part is that viddata->drm_fd is never passed to any Vulkan function, at least explicitly. This is very strange.
Comment 24 Manuel Alfayate Corchete 2020-12-02 18:26:45 UTC
What I mean is that Vulkan should work without me opening the DRM FD, because Vulkan should be opening it's own FD internally.
Comment 25 wahil1976 2020-12-02 18:51:11 UTC
You will need to work out a solution with Mesa library developers. Looks like I am unable to provide any help beyond this at the moment.
Comment 26 Manuel Alfayate Corchete 2020-12-02 19:12:52 UTC
(In reply to wahil1976 from comment #25)
> You will need to work out a solution with Mesa library developers. Looks
> like I am unable to provide any help beyond this at the moment.

Ok, I will try my luck in #dri-devel...
Comment 27 wahil1976 2020-12-03 04:59:58 UTC
So, any solutions yet?
Comment 28 Manuel Alfayate Corchete 2020-12-03 10:23:07 UTC
@wahil1976@outlook.com

No, nothing yet. Do you know any Vulkan expert I could ask?
Comment 29 wahil1976 2020-12-03 11:17:45 UTC
Not that I know off.

Did you try running testvulkan as root though?
Comment 30 Manuel Alfayate Corchete 2020-12-03 15:14:52 UTC
(In reply to wahil1976 from comment #29)
> Not that I know off.
> 
> Did you try running testvulkan as root though?

Yes. Anyway, I am taking some clues from vkcube. Maybe I can make it work by following what vkcube does more closely.
Comment 31 Manuel Alfayate Corchete 2020-12-03 16:25:22 UTC
@wahil1976@outlook.com

I think I found the problem (but have yet to get a solution...).

In testvulkan.c, there is findPhysicalDevice(), which opens the physical device (gpu FD under the hood, I understand) and stores it in vulkanContext.physicalDevice

In KMSDRM_Vulkan_CreateSurface(), I need to find the physical device again because: -It's needed by vkGetPhysicalDeviceDisplayPropertiesKHR() which I use to get the displays of the phsysical device.
-It's needed by vkGetPhysicalDeviceDisplayPlanePropertiesKHR() which I use to get the planes of the physical device.

So I guess the problem comes from getting the physical device from two different places: testvulkan.c and KMSDRM_Vulkan_CreateSurface(). That would explain the lost Vulkan FD.

But now I need to find a solution, which seems no easy since testvulkan.c doesn't pass the physical device to SDL, at least explicitly. Is vulkanContext somehow accessible from SDL? If it's not, maybe this is as far as I can go without changes on how SDL works.
Comment 32 wahil1976 2020-12-03 18:00:34 UTC
I think you will need to build the entire Mesa library (including the Vulkan drivers) with debug symbols and use a debugger (preferably GDB) to put a breakpoint at each and every line of VK_ERROR_SURFACE_LOST_KHR return statement in wsi_common_display.c and then run the program. Do the debugger task from a external device that can SSH to the computer where you want to run the testvulkan program. You will need to switch to a terminal on that target computer for the program to run. Once the breakpoint hits at one of those return statements it will finally pinpoint the problem, which is better than guessing in the dark.

I can't right now because for some reason GDB can't find the debug symbols in either libvulkan_intel.so or libvulkan_radeon.so no matter what I do.
Comment 33 Manuel Alfayate Corchete 2020-12-03 19:32:24 UTC
(In reply to wahil1976 from comment #32)
> I think you will need to build the entire Mesa library (including the Vulkan
> drivers) with debug symbols and use a debugger (preferably GDB) to put a
> breakpoint at each and every line of VK_ERROR_SURFACE_LOST_KHR return
> statement in wsi_common_display.c and then run the program. Do the debugger
> task from a external device that can SSH to the computer where you want to
> run the testvulkan program. You will need to switch to a terminal on that
> target computer for the program to run. Once the breakpoint hits at one of
> those return statements it will finally pinpoint the problem, which is
> better than guessing in the dark.
> 
> I can't right now because for some reason GDB can't find the debug symbols
> in either libvulkan_intel.so or libvulkan_radeon.so no matter what I do.

I have done debugging exactly like that (GDB over SSH) multiple times. Raspberry Pi having Vulkan support now would help immensely if I have to come to that!
But, for a start, I believe accessing the physical device from both testvulkan.c and SDL is plain wrong and has to be corrected. Maybe I am wrong and it's fine, but VK_ERROR_SURFACE_LOST_KHR could be caused by that.
Comment 34 Manuel Alfayate Corchete 2020-12-04 17:20:32 UTC
@wahil1976@outlook.com
Couldn't get it out of my head, so finally setup a MESA debug build.
I set breakpoints at every VK_ERROR_SURFACE_LOST_KHR return in wsi_common_display.c as you suggested, and I found that it's returned exactly here:

Breakpoint 8, _wsi_display_queue_next (drv_chain=0x5555666160) at ../src/vulkan/wsi/wsi_common_display.c:1742

Which is this block in wsi_common_display.c:

     if (ret != -EACCES) {
         connector->active = false;
         image->state = WSI_IMAGE_IDLE;
         return VK_ERROR_SURFACE_LOST_KHR;
     }    


Where the evaluated "ret" comes from here:

        /* XXX allow setting of position */
         ret = drmModeSetCrtc(wsi->fd, connector->crtc_id,
                              image->fb_id, 0, 0,
                              &connector->id, 1,
                              &connector->current_drm_mode);

...Where drmModeSetCrtc() returns -28, which according to:
(gdb) p strerror(errno) 

means...
$2 = 0x7ff7e4ef10 "No space left on device"

So.. what does that mean in this context? All drmModeSetCrtc() is an ioctl on the FD under the hood, but never had to deal with that errno on a graphics fd.
Comment 35 wahil1976 2020-12-04 17:33:12 UTC
Try grepping the drivers/gpu/drm/ folder inside the Linux kernel source with pattern -ENOSPC to find for any clues of what could be causing it to return -28 (-ENOSPC).
Comment 36 Manuel Alfayate Corchete 2020-12-04 17:55:40 UTC
wahil1976@outlook.com
There are countless places where that's returned.
On the VC4 driver, which is small, it's only being returned on these two places, which makes no sense here:


        /* The absolute limit is 2Gbyte/sec, but let's take a margin to let
         * the system work when other blocks are accessing the memory.
         */
        if (load_state->membus_load > SZ_1G + SZ_512M)
                return -ENOSPC;

        /* HVS clock is supposed to run @ 250Mhz, let's take a margin and
         * consider the maximum number of cycles is 240M.
         */
        if (load_state->hvs_load > 240000000ULL)
                return -ENOSPC;
Comment 37 Manuel Alfayate Corchete 2020-12-04 18:09:51 UTC
@wahil1976@outlook.com

In the end, I went back to my old ways and instead actitated the kernel drm debug logging, and looked through dmesg, which told me this:

[11353.733278] [drm:drm_framebuffer_check_src_coords [drm]] Invalid source coordinates 1920.000000x1080.000000+0.000000+0.000000 (fb 640x480)

It seems I am somehow trying to scan a 1920x1080 image on a 640x480 fb.
Doesn't seem to be related to physical device after all, heh.
Comment 38 wahil1976 2020-12-04 18:16:22 UTC
That explains why I am seeing GZDoom (it's a DOOM source port if you didn't know) report "Resolution: 640x480" at startup with OpenGL renderer on pre-Vulkan KMSDRM.

Find some way to make sure the KMSDRM framebuffer is always created at the highest resolution available. That should fix the Vulkan "surface lost" problem.
Comment 39 wahil1976 2020-12-06 07:31:21 UTC
This code should do the trick. Use it after you get the number of display modes.
    dispModeProps = calloc(sizeof(VkDisplayModePropertiesKHR),mode_count);
    vkGetDisplayModePropertiesKHR(gpu, display, &mode_count,dispModeProps);
    for (int i = 0; i < mode_count; i++)
    {
        if (dispModeProps[i].parameters.visibleRegion.width == dispdata->mode.hdisplay
            && dispModeProps[i].parameters.visibleRegion.height == dispdata->mode.vdisplay)
            {
                mode_props = dispModeProps[i];
                break;
            }
    }
    free(dispModeProps);
    if (mode_props.parameters.visibleRegion.width == 0
        || mode_props.parameters.visibleRegion.height == 0)
    {
        SDL_SetError("Vulkan did not return any display modes matching the current desktop mode");
        return SDL_FALSE;
    }
Comment 40 Manuel Alfayate Corchete 2020-12-06 11:16:14 UTC
(In reply to wahil1976 from comment #39)
> This code should do the trick. Use it after you get the number of display
> modes.
>     dispModeProps = calloc(sizeof(VkDisplayModePropertiesKHR),mode_count);
>     vkGetDisplayModePropertiesKHR(gpu, display, &mode_count,dispModeProps);
>     for (int i = 0; i < mode_count; i++)
>     {
>         if (dispModeProps[i].parameters.visibleRegion.width ==
> dispdata->mode.hdisplay
>             && dispModeProps[i].parameters.visibleRegion.height ==
> dispdata->mode.vdisplay)
>             {
>                 mode_props = dispModeProps[i];
>                 break;
>             }
>     }
>     free(dispModeProps);
>     if (mode_props.parameters.visibleRegion.width == 0
>         || mode_props.parameters.visibleRegion.height == 0)
>     {
>         SDL_SetError("Vulkan did not return any display modes matching the
> current desktop mode");
>         return SDL_FALSE;
>     }

I had figured that out already, but thanks (dispModeProps in your code must be initialized because VkDisplayModePropertiesKHR structs seem to come with garbage values at least here).

I am now trying to figure out how to display, let's say, a 320x240 window (or any other window sizes with no matching video modes) without an intermediate buffer.

Thing is:
-vkCreateDisplayPlaneSurfaceKHR() accepts displayMode and imageExtent parameters. 
-displayMode has the physical video mode to use.
-imageExtent, as far as I understand, should let me specify the size of the image to scanout, which should be independent from displayMode.

...But I get VK_ERROR_SURFACE_LOST_KHR again if I try an imageExtent smaller than the displayMode.parameters.visibleRegion.
Any ideas on this one?
Comment 41 wahil1976 2020-12-06 16:44:30 UTC
That's strange. But I would guess the smaller the ImageExtent values the larger the image and the larger the ImageExtent values the smaller the image.
Comment 42 Manuel Alfayate Corchete 2021-01-15 21:49:19 UTC
This can be closed, Vulkan support is in place.