CLI madness part I – performance health check for ESX in 5 minutes Quite a few customers and partners have asked about quick ways to do live performance health check for VMware ESX environments. The short answer is to evaluate vCenter Operations Suite! It is a very intuitive tool that enables VI admins to pinpoint craps that are causing issues in a given infrastructure. For customers that are more CLI driven and do not yet have the luxury to purchase vCenter Operations (I am always a big fan of command line, as it is geeky, fast, and fun!), there is always esxtop to do some quick spot checks. Below is a quickie on how to use esxtop to do a quick 5 minute health check for your ESX environment end-to-end (compute, network & storage). Part II of this blog will add more color to storage side data analysis for the Nimble array. To get started, establish an ssh session to the ESX host of interest, and then issue ‘esxtop’ command. NOTE: It is always a good practice to change the default display interval to “2” seconds for more frequent data refresh (default value is 5). Change this by typing “s”, there will be a prompt to enter a value in seconds, type in “2” and press ENTER Initial screen would show ‘CPU usage across the board. Type the following key to switch between CPU, memory, HBA adapter, network, LUN, VM virtual disk performance stats: • c – CPU • m – memory • n – network • d – disk adapter (in our example screenshots below, it’ll be vmhba33/37 depending on which one represents the sw/iscsi adapter) • u – all LUNs • v – VMs virtual machine disks (VMDKs) Quick spot check for CPU usage: Type ‘c’ to switch to CPU view: Pay attention to the “% RDY” time for all the VMs, if the value is greater than 10%, the ESX server is CPU bound, meaning the virtual machine vCPU is ready to execute threads, however, VMkernel was not able to schedule it on a PCPU. For example, if VM A has %RDY time of 10, then 1 out of 10 times, it has to wait for VMkernel to schedule a physical CPU for it to execute instructions. Quick spot check for memory usage: Type ‘m’ to switch to memory view Always check to see how much free memory is available. More importantly, see if VMkernel is swapping . Typically, if VMkernel is swapping to disk, the ESX server is memory bound. Potentially due to too much memory overcommitment. Quick spot check for Network usage: Type ‘n’ to switch to network view For Nimble Storage, it is especially important to identify the vmkernel ports here. They are typically ‘vmk1’ and ‘vmk2’, as ‘vmk0’ is used by ESX’s mangement port. For both vmk1 and vmk2, pay attention to “%DRPTX” and “%DRRX” to see how many % of packets are dropped during transmit or receive. If ESX host is pegged with high CPU utilization, there could be good chance of high % packet drops. Throughtput numbers are also useful to see how much traffic is being transmitted in # of packets per second as well as Mbs/sec. Quick spot check for Storage usage: Keys to remember: d – vmhba stats view v – all virtual machine view If a given VM has multiple VMDKs, then type “e” to expand the view to include all VMDKs; after typing ‘e’, esxtop will prompt you to enter the GID of the VM of interest, simply type in the correspoinding GID of the VM to see stats for all of its VMDKs: u – all datastores view For ‘d’ (vmhba stats), be sure to identify the vmhba # for sw/iscsi adapter (can either find this in vCenter server or CLI by typing ‘#esxcli iscsi adapter list’. While in this view, type ‘f’ to add additiona fields to display, then type ‘d, h, i’ where ‘d’ would add ‘queue stats’, ‘h’ adds ‘read latency stats’ and ‘i’ adds ‘write latency stats’. In this view, pay close attention to the following fields: MBREAD/s (read throughput) MBWRITE/s (write throughput) DAVG/cmd (storage device latency per scsi command, in ms) KAVG/cmd (kernel latency per scsi command, in ms) DAVG/rd (storage device latency for read io, in ms) DAVG/wr (storage device latency for write io, in ms) KAVG/rd (kernel latency for read io, in ms) KAVG/wr (kernel latency for read io, in ms) The throughtput and “DAVG/cmd” values should match with what you see in the Nimble performance monitoring UI (or stats output) esxtop in batch mode If you need to drill down on individual volume/datastore, switch to the datastore view by pressing ‘u’. The tricky part here is the device name is displayed using the NAA ID of the volume. All of the above are useful for live health check. If you want to collect the stats in batch mode, from each ESX host, for deeper analysis. Below are the steps: First, run esxtop in batch mode #cd /tmp # esxtop -b -d 2 > esxtopresult.csv The above command will run esxtop in batch mode with an unspecified amount of time. It is useful for cases for customer is not clear how long a given performance test takes. If the customer knows ahead of time how much time their repro/performance test runs, then use the –n to specify the number of iterations esxtop should run. For example, if customer wants to run a 10 minute test, the command to specify is the following (600 seconds in 10 minutes, the command collects stats every 2 seconds, so 600/2 yields 300): # esxtop -b -d 2 –n 300 > esxtopresult.csv Now that the esxtopresult.csv file has been generated, find a way for the customer to get the file to you. Once the file has arrived, the following tools are best in analyzing the results: • Perfmon in Microsoft • MS Excel (as the results are in .CSV format)
↧