Heads Up! 3D graphics enabled on ESXi 6.0 U1 can cause PSOD


So this issues was brought to me by my former colleague Toni Lindberg meaning i have not seen this myself. However, i thought it was kind of critical so i wanted to share it with you.

So Toni’s customer runs vSphere 6.0 U1 (upgraded from ESXi 6.0 GA with latest patches), Horizon View 6.1.1 and the environment is pretty graphic intensive using NVIDIA K2 cards. They utilize both the built in 3D feature and vGPU.

After upgrading to ESXi 6.0 U1 they saw problems with ESXi purple screen of death which have never occurred for this environment in the past.

The following errors were found in the ESXi host vmkernel and vmkwarning log files:

  • ESC[7m2015-0X-YYT06:16:18.341Z cpu19:111396)WARNING: Heartbeat: 781: PCPU 13 didn’t have a heartbeat for 21 seconds; *may* be locked up.ESC[0m
  • ESC[31;1m2015-0X-YYT06:16:18.341Z cpu13:219369)ALERT: NMI: 681: NMI IPI recvd. We Halt. eip(base):ebp:cs [0x343a63(0x418034e00000):0x43940749b890:0x4010](Src0x1, CPU13)ESC[0m
  • 2015-0X-YYT06:16:18.341Z cpu19:111396)World: 9729: PRDA 0x418044c00000 ss 0x4018 ds 0x4018 es 0x4018 fs 0x0 gs 0x0
  • 2015-0X-YYT06:16:18.341Z cpu19:111396)World: 9731: TR 0x4000 GDT 0xfffffffffc60a000 (0xffff) IDT 0xfffffffffc608000 (0xffff)
  • 2015-0X-YYT06:16:18.341Z cpu19:111396)World: 9732: CR0 0x80050031 CR3 0x6de47f4000 CR4 0x42668
  • 2015-0X-YYT06:16:18.341Z cpu13:219369)0x43940749b790:[0x418035143a63]Printf_WithFunc@vmkernel#nover+0x6b7 stack: 0x3f00000002
  • 2015-0X-YYT06:16:18.341Z cpu13:219369)0x43940749b890:[0x418035144093]vsnprintf@vmkernel#nover+0x33 stack: 0x1
  • 2015-0X-YYT06:16:18.341Z cpu13:219369)0x43940749b8b0:[0x418034e9b922]vmk_StringFormat@vmkernel#nover+0x72 stack: 0x43940749b920
  • 2015-0X-YYT06:16:18.341Z cpu13:219369)0x43940749b920:[0x418034e8f3b6]Util_FormatTimestampUTC@vmkernel#nover+0x9a stack: 0x417f00000010
  • 2015-0X-YYT07:31:13.652Z cpu32:146698)WARNING: VmMemPf: vm 146115: 654: COW copy failed: pgNum=0x290f42, mpn=0x3fffffffff
  • 2015-0X-YYT07:31:13.652Z cpu32:146698)WARNING: VmMemPf: vm 146115: 654: COW copy failed: pgNum=0x290f42, mpn=0x3fffffffff

A non maskable interrupt is sent because a pcpu has locked up. In both cases the PCPU is handling the vmx-svga module at the time of the failure, this relates to graphics.

A VMware SR was opened and the support responded that they are aware of the problem and will release a fix for it in second half of Q4. The PSOD is related to the internally known issue and is caused by heavy usage of vmx mapped memory (3D) and relates to a problem with the retry logic.

The NVIDAI card was not the problem root cause, it’s using the ESXi 3D feature that causes the problem.


32 pings

Skip to comment form

Comments have been disabled.