Freezing VDI sessions with a Nvidia T4 GPU.

Currently I’m working on a VMware Horizon VDI project where I am responsible for the Windows 10 Golden Image (GI) and distribution of the applications through the Golden Image, AppStacks or ThinApps.

The hardware which used at the pilot environment was almost the same as from the production environment, the only difference was the Nvidia GPU card. In the pilot environment it was the Nvidia Tesla P40 and in the production environment it is the Nvidia Tesla T4. Where the Nvidia Tesla T4 is now the more common card for density instead of the older model Nvidia Tesla M10. It still depends on the bare-metal server you bought or will buy. The pilot environment with the Tesla P40 at my customer was sponsored by the hardware vendor.

Where we didn’t had any issues during the pilot, it was different when we went to production. The users noticed that they have irregular freezing’s of their VDI session. Most common moments when the freezing happens were at:

  • At VDI session startup within 30-90 seconds the session will freeze and only the desktop with taskbar is shown.
  • After reconnecting to a current VDI session the session screen stays black.

Analyzing the issue by digging into the Horizon Client, Agent and Connection Server logs we couldn’t find any related errors or warnings which could cause the freezing. The only the thing that froze was the Blast/PCoIP Connection. When we were connecting through RDP with the local admin account we could come into the non-persistent desktop.

After some searching on the VMware Community forums, Reddit and the Nvidia forum. I found a post from MaorZ on the Nvidia forum which looks very similar to the issue we have. (reference: https://gridforums.nvidia.com/default/topic/9657/vdi-machines-with-nvidia-tesla-t4-profile-are-stuck-when-using-horizon-client). There is a reaction from Justin (vdiguywi) which was very helpful. Other administrator/consultants were responding to that topic as well, that they have this issue. Nvidia Support advise is to set a registry key for the Nvidia Windows Kernel Adapter (nvlddmkm). Because it is a specific issue with the Tesla T4 GPU card, it could be that this will resolve our issue also. The difference between the pilot and production environment was the GPU card. To test if it really fixes the error we edit our Windows 10 Golden Image with the registry key, sealed the image, created a snapshot and pushed to a test VDI Pool. During two day we didn’t had any users which reported the symptoms anymore with feezing during startup or reconnecting.

ROOT CAUSE:
The vGPU driver on first Horizon session thinks it’s not installed properly and does a reload which causes the screen to freeze. At which time a reboot of Windows is required to finish the driver reload, but in the case of my non-persistent instant clones, the VM is deleted when the user logs out so every connection is a first session.

SOLUTION:
Create the following registry key “NVFBCEnable”=dword:00000001” under [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\nvlddmkm]. My advise is to not set in a Windows GPO, just create it in your Windows 10 Golden Image creation. A reboot is required after setting the registry key.

For example just after installing your Nvidia Windows Driver. Our imaging technique is based on PowerShell, so I used this command:

New-ItemProperty -Path "HKLM:\SYSTEM\CurrentControlSet\services\nvlddmkm" -Name "NVFBCEnable" -Type dword -Value 1 | Out-Null