Session “Troubleshooting vSphere 6 Made Easy: Expert Talk – INF9205R” was presented by Ragavendra Kumar who works as a Technical Support Supervisor at VMware and Abhilash Kunhappan who is a Staff Technical Support Engineer at VMware. The following areas were covered:
What are the troubleshooting options available for customers. And yes, the error messages mentioned are actually much better compared to a few years ago:)
The log files will tell you what the problem is, the trick is to understand how to read and correlate them and helpful tools with this are vRLI and or vMA.
Heads UP: If you got a VMFS corruption you should always reach out to support and do not follow anything you might find in KB articles.
Before digging into the different sections I just wanted to make you aware, unless you are already familiar with it, of the command localcli. When you use the command esxcli for configuration or troubleshooting purposes and it hangs it means that the ESXi process hostd has a problem. localcli is your friend here since it bypasses hostd..
vCenter Server
The most common issues are:
- vCenter Server Upgrade
- Challenges with SSL Certificates – Totally agree 🙂
- Linked Mode Configuration Issues – One or more venter Server are missing from the inventory
- Both internal & external DB issues
- Crashes
An important first step is to identify if the vCenter Server and Platform Service Controller (PSC) runs on same VM or not. For the Windows based vCenter Server you can run the following command to find that out:
reg query “HKEY_LOCAL_MACHINE\SOFTWARE\VMware, Inc\vCenter Server” /v INSTALL_TYPE
The output will be either:
- Embedded – meaning both vCenter Server and PSC are installed on same server
- Infrastructure – meaning vCenter Server and PSC are not installed on same server
Another command you can use it the below one which will give you the vCenter Server build number:
reg query “HKEY_LOCAL_MACHINE\SOFTWARE\VMware, Inc\vCenter Server” /v BuildNumber
Some useful tools available are listed below.
- certificate-manager – Manage PSC and vCenter Server certificates
- certool – Manage certificates
- cmsso-util – repoint, reconfigure vCenter Server, unregister and move services across PSCs.
- dir-cli – to manage solution users, certificates, passwords
- service-contro –list (2xminus sign before list) – List the vCenter Server services (and there are a lot)
- vdcadmintool – Test LDAP connectivity and reset passwords
- vdcrepafmin – show, add, remove replication between PSCs
- ssolscli listServices – List registered services with vCenter Server 5.x
- stool.py – List registered services with vCenter Server 6.x
Log File Locations
ESXi Server
Relationship between vpxd (vCenter Server Process) – vpxa (ESXi agent) – hostd (ESXi process) was described.
The most common issues are:
- ESXi crash
- ESXi host disconnected or not responding from vCenter Server
- ESXi upgrade to version 6
- HA agent not getting configured
- VM issues including e.g. Power operations, Snapshot Consolidation
Available ESXi tools
- esxcli – Config & monitoring. Can also use localcli
- esxtop – Performance monitor & troubleshooting
- vm-cmd – ESXi & VM config. Useful to verify if hostd is fine.
- vm-support – Generate log bundle
- vmdumper – VM core dump management
- vmkfstools – VMFS and VM disk management. Hidden option -t 10 (meaning you increase the verbosity by 10 times)
- vmkping – Check network connectivity for vmkernel interfaces
Log files:
One log file I usually tend to forget about is the vobd.log log so don’t forget that one 🙂
Networking
As always, if anything fails blame the network 🙂
Let’s again start with the most common issues are:
- Network performance problems
- No connection to ESXi host or package drop is experienced
- Ports closed in firewall between ESXi and vCenter Server causing communication issues or ESXi host disconnect.
- VM doen not have network connectivity
- VM loses network connectivity after power cycle or vMotion operation – Solved by disable and enable the vNIC
- vMotion failing – If it fails between 1% -10% (11%) it is cause by network problem.
Troubleshooting commands:
- esxcli network – Configure & monitor
- esxcfg-nics or esxcfg-vmknic
- esxtop option n – Performance monitor & troubleshooting
- ethtool – Identify device driver setting and see network statistics
- pktcap-uw – Capture package for both uplink interface and vmkernel interface
- pktcap-uw -h | more
- pktcap-uw —vmk vmk0 -o pktcap.pcap
- vsish
- vsysh get /net/portsets/vSwitch<x>/ports/port-ID/vmxnet
- vsish get /net/pNics/vmnic<x>/stats
- tcpdump – Capture package from uplink interface
- tcpdump-uw -i vmk0-s 1514 -w traffic.pcap
Storage
Starting with explaining the were well known DAVG/cmd, KAVG/cmd and GAVG/cmd which you can inspect via esxtop.
The most common issues are:
- All Path Down
- Missing LUN
- VMFS Datastore inaccessible or not visible
- SCSI Reservation Conflicts – Should not be there anymore based on e.g. ATS
- Storage or datastore perf issues
Troubleshooting commands includes:
- esxcli storage – Configure & monitor
- esxtop option d – Performance monitor & troubleshooting
- enable read and write latency via the f option
- vmkchdev – Map HBA devices to PCI slots
- vmkload_mod – View, load & unload HBA drivers
- partedUtil – List, create, recreate partition table on disks
- partedUtil getptbl /vmfs/devices/disks/naa.XYZ
- voma – Check VMFS metadata (vSphere On-Disk Metadata Analyzer)
- voma -m wmfs -d /vmfs/devices/disks/naa.XYZ -s /tmp/analyze.txt
- vmkfstools – VMFS and VM disk management
2 pings