Hi All ...
Today, we will talk about Microsoft SQL Server Virtualization. SQL Databases have been in our data centers since 1990s carrying TBs of data that are stored and retrieved by many and many of applications. Nowadays, no business application can run without a SQL back-end DB. This leads to a "Spree" of databases in our data centers, pushing Microsoft SQL Server to be one of our most critical applications that spans both production and non-production regions. The latest editions now are Microsoft SQL Server 2012 and 2014.
We'll talk about how to virtualize Microsoft SQL Server using vSphere 5.x platform and how to leverage new SQL Server features, like: Always-on Availability Groups (AAG) and Data Mirorring, to provide the required level of performance and availability. Best practices mentioned here is collected from different sources that are mentioned in References section below. I'll follow the same schema of my previous posts and relate these best practices to our five Design Qualifiers (AMPRS - Availability, Manageability, Performance, Recoverability and Security) in addition to Scalability.
Availability:
1-) Try to use vSphere HA in addition to SQL Always-on Availability Group (AAG) to provide the highest level of availability. Adapt a protection policy of N+1, as N is the number of AAG members VMs in vSphere Cluster of N+1 hosts. In case of an ESXi failure, in the background, vSphere HA powers-on the failed virtual machine on another host, restoring the failed AAG member and bringing the newly passive (failed old active node) database up to date and ready to take over in case of a failure, or to be manually reactivated as the primary active database.
2-) Try to separate your SQL AAG VMs on different Racks, Blade Chassis and Storage Arrays if available using VMs Affinity/Anti-affinity rules and Storage Affinity/Anti-affinity rules for the most possible availability.
3-) For SQL AAG VMs, and when using VMs anti-affinity rules to separate them over different hosts, vSphere HA will not respect the anti-affinity rule, but on the following DRS invocation, the VM will be migrated by DRS to respect the rule. In vSphere 5.5, configure the vSphere Cluster with “das.respectVmVmAntiAffinityRules” set to 1 to respect all VMs Anti-affinity rules. For licensing limits, use Host-VM must affinity rule to force VMs to run on licensed hosts only. Keep in mind that, Must Rules are respected even in case of HA invocation, i.e. HA will not restart a failed VM if it can’t respect a Must Rule. In this case, It’s recommended to separate your licensed hosts on different Racks, Blade Chassis and Power Supplies for more availability.
4-) If SQL AAG isn’t available, there’re many other High-Availability technologies to be used in SQL Clusters:
Technology | Automatic Failover | No. of Secondaries | Readable Secondaries | RPO | RTO | vSphere Compatibility |
Always-on Availability Group (AAG)-Synchronous Mode | Yes | Max. 2 nodes | Yes | 0 | Seconds | Totally compatible and supported with vSphere HA, DRS and vMotion |
Always-on Availability Group (AAG)-Asynchronous Mode | No | Max. 4 nodes | Yes | Seconds | Minutes | Totally compatible and supported with vSphere HA, DRS and vMotion |
Always-on Failover Cluster Instances | Yes | Max. 4 nodes | No | 0 | Seconds | Requires certain configuration to be supported on vSphere Clusters. Totally supported with HA but isn’t supported for vMotion and Automatic DRS. For more info, check Virtualizing Microsoft Clustering Services (MSCS)-Windows 2012 on vSphere Best Practices. |
Data Mirroring – High Safety Mode with automatic Failover | Yes | Max. single node | No | 0 | Seconds | Totally compatible and supported with vSphere HA, DRS and vMotion |
Data Mirroring – High Safety Mode without automatic Failover | No | Max. single node | No | 0 | Minutes | Totally compatible and supported with vSphere HA, DRS and vMotion |
Data Mirroring – High Performance Mode | No | Max. single node | No | Seconds | Minutes | Totally compatible and supported with vSphere HA, DRS and vMotion |
Log Shipping | No | N/A | No | Minutes | Minutes-Hrs | Totally compatible and supported with vSphere HA, DRS and vMotion |
Backup/Restore | No | N/A | No | Minutes-Hrs | Hrs-Days | Totally compatible and supported with vSphere HA, DRS and vMotion |
Keep in mind that, all of AAG, Log Shipping or Data Mirroring require DB Full Recovery Mode, which may lead to higher storage growth.
5-) As Microsoft supports vMotion of SQL VMs, use DRS Clusters in Fully Automated Mode. It’ll always load balance your SQL VMs across the cluster, respecting all of your configured affinity/anti-affinity rules. Keep in mind that, Software Assurance Program by Microsoft is needed for unlimited vMotion ability. Review Manageability section below.
6-) vMotion:
a- Leverage Multi-NIC vMotion feature to make vMotion process much faster even for large SQL VMs.
b- Try to enable Jumbo Frames on vMotion Network to reduce overhead and increase throughput.
c- SQL AAG members are very sensitive to any latency or drop in its heartbeat network and hence, any vMotion operation may cause a false fail over during the switch between the source and destination hosts –although this drop is really negligible and characterized by single ping-packet drop. That’s why we need to set SQL AAG Cluster Heartbeat setting “SameSubnetDelay” to 2 seconds (2000 milliseconds). In some cases when leveraging Multi-NIC vMotion and enabling Jumbo Frames on vMotion Network, changing heartbeat setting isn't necessary.
7-) Try to leverage VM Monitoring to mitigate the risk of Guest OS failure. VMware Tools inside SQL VMs will send heartbeats to HA driver on the host. If it’s stopped because Guest OS failure, the host will monitor IO and network activity of the VM for certain period. If there’s also no activity, the host will restart the VM. This add additional layer of availability for SQL VMs. For more information about VM Monitoring:
vSphere HA VM Monitoring - Back to Basics | VMware vSphere Blog - VMware Blogs
8-) Try to leverage Symantec Application HA Agent for SQL with vSphere HA for max. availability. Using Application HA, the monitoring agent will monitor SQL instance and its services, sending heartbeats to HA agent on ESXi host. In case of application failure, it may restart services or mount databases. If Application HA Agent can’t recover the application from its failure, it’ll stop sending heartbeats and the host will initiate a VM restart as a HA action. For more information about Symantec Application HA, check this pdf from VMware.
Performance:
1-) Try to leverage Resources Governor limits and pools in conjunction with using vSphere Resource Pools for more control over the compute resources presented to SQL Server. Generally, it’s better to adjust resources settings from SQL Server Resources Governor itself, rather than vSphere Resource Pools. SQL Server Resources Governor can now create up to 64 resource pools and can sit affinity rules on which CPUs or NUMA nodes to process similar to vSphere Resource-controlling capabilities.
2-) Disable virus scanning on SQL DB and logs as it may lead to a performance impact on your SQL DB.
3-) CPU Sizing:
a- As most of SQL VMs underutilize their vCPUs (95% of SQL servers are under 30% utilization as stated by VMware Capacity Planner), it’s preferred –if available- to choose physical CPUs with higher clock and lower number of cores. This reduces the license costs as the number of physical cores is reduced and doesn't affect the cores performance due to higher clock speed and more MHz to share between the low-utilization SQL VMs. Keep in mind that, you have to do your homework of capacity planning and performance monitoring to maintain your performance baseline before deciding. You can also useSPECint2006 Resultsto compare between CPUs.
b- Don’t over-commit CPUs. It’s better to keep it nearly 1:1. In some cases like small environments, over-commitment is allowed after establishing a performance baseline of normal-state utilization.
c- Enable Hyper-threading when available. It won’t double the processing power –in opposite to what shown on ESXi host as double number of logical cores- but it’ll give a CPU processing boost up to 20-25% in some cases and these added logical cores won’t be taken into licensing consideration (Max. Virtualization per Core Licensing Model). Don’t consider it when calculating Virtual:Physical Cores ratio. In some cases of Tier-0/1 SQL Servers, setting VM Setting ”Hyperthreading Sharing” to None –disabling Hyper-threading for this VM- may lead to better performance than enabling Hyper-threading. Test both options and choose what fits your performance requirements.
d- ESXi Hypervisor is NUMA aware and it leverages the NUMA topology to gain a significant performance boost. Try to size your SQL VMs to fit inside single NUMA node to gain the performance boost of NUMA node locality.
e- In case of large SQL VMs that won’t fit inside single NUMA node, keep in mind that SQL 2012/2014 are vNUMA-aware, so it’s recommended to show underlying physical NUMA topology to SQL VMs. It’s applicable for large VMs with 8 or more vCPUs or can be enabled for small VMs using advanced VM Setting “numa.vcpu.min”. For more information about Virtual NUMA (vNUMA), check my blog post:
Part IV - Virtual Machines Settings - Advanced/Troubleshooting Notes & Tips:
3-) Memory Sizing:
a- Don’t over-commit memory, as SQL 2012/2014 is a memory-intensive application. If needed, reserve the configured memory to provide the required performance level. Keep in mind that memory reservation affects many aspects, like: HA Slot Size, vMotion chances and duration. In addition, reservation of memory removes VM swapfiles from datastores and hence, its space is usable for adding more VMs. For some cases, where a lot of underutilized SQL Servers there, over-commitment is allowed to get higher consolidation ratios. Performance monitoring is mandatory in this case to maintain a baseline of normal-state utilization.
b- Don’t disable Balloon Driver installed with VMware Tools. Ballooning is the last line of defense of ESXi Host before compression and swapping to disk when its memory becomes too low. When Ballooning is needed, Balloon Driver will force Guest OS to swap the idle memory pages to disk to return the free memory pages to the Host, i.e. swapping is done according to Guest OS techniques. Swapping to disk can be done by ESXi Host itself. Host’d swap memory pages from physical memory to VM swap file on disk without knowing what these pages contents or if these pages are required (active) or idle. Ballooning effect on performance is somehow much lower than Swapping to disk. Generally speaking as mentioned in the previous point, don’t over-commit memory for business critical SQL VMs and if you’ll do some over-commitment, don’t push it to these extreme limits.
c- SQL Server 2012/2014 tends to use all the configured memory of the VM. This may lead to some performance issues cause any other application installed –like: backup agent or AV agent- wouldn't find adequate memory to operate. It’s recommended to use “Max Server Memory” parameter from inside SQL itself and set it to Configured Memory minus 4GB to allow some memory for Guest OS and any 3rd-party Applications.
d- Set “Min Server Memory” to define a min amount of memory for SQL server to acquire for processing. “Min. Sever Memory” will not immediately be allocated at start-up. However, after memory usage has reached this value due to client load, SQL Server will not free memory unless the minimum server memory value is reduced. This can help in high-consolidation-ratios scenarios, as when Host is memory-contended and Balloon Driver at its full swing to free memory pages from Guest OS’es, “Min. Server Memory” will prevent Balloong Driver from inflating and taking this defined memory space.
e- Use Large Memory Pages for Tier-1 SQL VMs. SQL Server supports the concept of Large Memory Pages when allocating memory for some internal structures and the buffer pool, when the following conditions are met:
- You are using SQL Server Enterprise Edition.
- VM has 8GB or more of physical RAM.
- The Lock Pages in Memory privilege is set for the service account. Check this Microsoft article.
To enable all SQL Server buffer to use Large Memory pages, you should start your 64bit SQL Server with “trace flag 834”. To enable “trace flag 834”, check:Tuning options for SQL Server when running in high performance workloads.
This will lead to:
- With large pages enabled in the guest operating system, and the virtual machine is running on a host that supports large pages, vSphere does not perform Transparent Page Sharing on the VM’s memory unless hosts reached Memory Hard state.
- With trace flag 834 enabled, SQL Server start-up behavior changes. Instead of allocating memory dynamically at runtime, SQL Server allocates all buffer pool memory during start-up. Therefore, SQL Server start-up time can be significantly delayed.
- With trace flag 834 enabled, SQL Server allocates memory in 2MB contiguous blocks instead of 4KB blocks. After the host has been running for a long time, it might be difficult to obtain contiguous memory due to fragmentation. If SQL Server is unable to allocate the amount of contiguous memory it needs, it can try to allocate less, and SQL Server might then run with less memory than you intended.
This only should be enabled on Tier-1 dedicated SQL VMs with memory reservation and no memory over-commitment. Any memory over-commitment technique can lead to performance issues with trace flag 834. For more information:SQL Server and Large Pages Explained - MSDN Blogs
f- SQL Server uses all the configured memory as a cache of its queries and it manages its configured memory by its own techniques. Active Memory counter from vSphere (Web) Client may not reflect the actual usage of SQL Server memory. Use in-guest memory counters and SQL Server memory counters for more accurate monitoring.
4-) Storage Sizing:
a- If possible, use SQLIOSIM tool –found in Binn folder found under your SQL instance installation path- to simulate your SQL DB IO load on your backend storage to establish a performance baseline of your backend storage array. It’s recommended to run it in non-business hours as it loads your HBAs, NICs and back-end array. Check:How to use the SQLIOSim utility to simulate SQL Server activity on a disk subsystem
b- Always consider any storage space overhead while calculating VMs space size required. Overhead can be: swapfiles, VMs logs or snapshots. It’s recommended to add 20-30% of space as an overhead.
c- Follow Microsoft SQL Storage Best Practices:SQL Server Storage Top 10 Best Practices
d- Separate different SQL VMs’ disks on different –dedicated if needed- datastores to avoid IO contention, as SQL is an IO-intensive application.
e- Provide at least 4 paths, through two HBAs, between each ESXi host and the Storage Array for max. availability.
f- For IP-based Storage, enable Jumbo Frames on its network end-to-end. Jumbo Frames reduces network overhead and increases throughput.
g- RDM can be used in many cases, like: SQL P2V migration or to leverage 3rd Party array-based backup tool. Choosing RDM disks or VMFS-based disks are based on your technical requirements. No performance difference between these two types of disks.
h- Use Paravirtual SCSI Driver in all of your SQL VMs, specially disks used for DB and Logs, for max. performance, least latency and least CPU overhead. Check Paravirtual SCSI Controller Compatibility here.
i- Distribute any SQL VM disks on the four allowed SCSI drivers for max. performance paralleling and higher IOps. It’s recommended to use Eager-zeroed Thick disks for DB and Logs disks.
j- The following table shows the Read/Write behavior of each of SQL DB components:
DB Component | Read/Write | RAID Recommended |
DB | Read Intensive | RAID 5 |
Logs | Write Intensive | RAID 1/10 |
tempdb | Write Intensive | RAID 10 |
k- Allocate "tempdb" for each of your SQ VM vCPU. It’s recommended if these "tempdbs" are on fast SSD array for better overall performance.
l- Partition Alignment gives a performance boost to your back-end storage, as spindles will not make two reads or writes to process single request. VMFS5 created using vSphere (Web) Client will be aligned automatically as well as any disks formatted using newer versions of Windows. Any upgraded VMFS datastores or upgraded versions of Windows Guests will require a partitions alignment process. For upgraded VMFS, it’s done by migrate VMs disks to another datastore using Storage vMotion, then format and recreate the datastore on VMFS5.
m- Enable Instant Initialization feature for speeding up the growth or creating new databases. It forces Windows to initialize the DB file before zeroing all of the blocks in its allocated size, then Windows pads zeros to it as long as a new writes occur. Log files disks can’t be instantly initializaed. Check: Database Instant File Initialization
5-) Network:
a- Use VMXNet3 vNIC in all SQL VMs for max. performance and throughput and least CPU overhead.
b- SQL VMs port group should have at least 2 physical NICs for redundancy and NIC teaming capabilities. Connect each physical NIC to a different physical switch for max. redundancy.
c- Consider network separation between different types of networks, like: vMotion, Management, SQL production, SQL Replication, Fault Tolerance, etc. Network separation is either physical or logical using VLANs.
d- Clustered SQL VMs should have two vNICs, one for public network and the other one for heartbeat and replication network. It’s better to dedicate a physical NIC on ESXi hosts for replication network between Clustered SQL VMs, specially when using Synchronous-commit mode AAGs or Data Mirroring in High-safety mode.
6-) Monitoring:
Try to establish a performance baseline for your SQL VMs and VI by monitoring the following:
- ESXi Hosts and VMs counters:
Resource | Metric (esxtop/resxtop) | Metric (vSphere Client) | Host/ VM | Description |
CPU | %USED | Used | Both | CPU used over the collection interval (%) |
%RDY | Ready | VM | CPU time spent in ready state | |
%SYS | System | Both | Percentage of time spent in the ESX/ESXi Server VMKernel | |
Memory | Swapin, | Swapinrate, Swapoutrate | Both | Memory ESX/ESXi host swaps in/out from/to disk (per virtual machine, or cumulative over host) |
MCTLSZ (MB) | vmmemctl | Both | Amount of memory reclaimed from resource pool by way of ballooning | |
Disk | READs/s, WRITEs/s | NumberRead, NumberWrite | Both | Reads and Writes issued in the collection interval |
DAVG/cmd | deviceLatency | Both | Average latency (ms) of the device (LUN) | |
KAVG/cmd | KernelLatency | Both | Average latency (ms) in the VMkernel, also known as Queuing Time‖ | |
Network | MbRX/s, MbTX/s | Received, Transmitted | Both | Amount of data received/transmitted per second |
PKTRX/s, PKTTX/s | PacketsRx, PacketsTx | Both | Received/Transmitted Packets per second | |
%DRPRX, %DRPTX | DroppedRx, DroppedTx | Both | Receive/Transmit Dropped packets per second |
- In-guest Counters:
SQL Server on VMware - Best Practices Guide
Manageability:
1-) SQL Server 2012 Support Statement for Virtualization:Support Policy for Microsoft SQL Server on Virtual Hardware Platform
2-) SQL Server 2012/2014 Editions:SQL Server Editions
3-) SQL Server 2012/2014 Licensing:
2012:SQL Server 2012 Licensing Reference Guide
2014:SQL Server 2014 Licensing Reference Guide
Two licensing approaches: Licensing your VMs or Licensing your physical hosts (Max. Virtualization Approach).
a- Licensing your VMs:
Used usually for small deployments. Two models: Licensing your virtual cores or Server/CAL licensing.
Licensing Virtual Cores | Licensing Server/CAL |
- Min.: 4 virtual cores per VM. - Core Factor doesn’t apply. Hyperthreading is taken into consideration as additional vCPUs must be licensed if used. - Available only for Standard Edition. - Allow for VMs mobility across different hosts using VMware vMotion with Software Assurance (SA) Benefits. | - Each VM will have Server license and each connection requires CAL license. - Available for Standard and Business Intelligence Editions. - Allow for VMs mobility across different hosts using VMware vMotion with Software Assurance (SA) Benefits. |
b- Licensing your Physical Hosts (Max. Virtualization Approach):
Used for large virtual environments. You count only for physical CPU cores with consideration of Core Factor and Hyperthreading isn’t taken into considerations. With SA Benefits, it allows for unlimited number of SQL Server VMs and allows for license mobility as much as possible in your datacenter. Without SA, you’re limited in deploying total vCPUs of SQL VMs less than or equal to number of physical Cores licensed for SQL per host.
4-) SQL Server 2012/2014 Maximums:Max. Capacity Specifications for SQL Server.
5-) Try to have a policy for Change Management in your environment to control the SQL Server VMs sprawl due to high demand on SQL Servers for the different purposes in your environment, like: testing, development, supporting single production application, etc.
6-) Try to leverage monitoring and capacity planning tools, like: Microsoft MAP Tool. It helps significantly in monitoring all your SQL VMs performance and utilization, your DBs sizes and performance trends as well as creating a complete inventory of all your SQL servers, their editions and licenses (This helps a lot in case of P2V, DBs migration to SQL Servers or SQL Server migration to a newer version).
7-) Try to use Upgrade Advisor to perform a full check-up on the old SQL Servers before upgrading to SQL 2012/2014. Keep in mind that it can’t check on OS or some 3rd-party applications installed that may not allow for upgrade. For more information:Use Upgrade Advisor to Prepare for SQL Server Upgrade
8-) Try to leverage Microsoft SQL Best Practices Analyzer to make sure you follow the best practices of deploying SQL Server in your environment.
9-) Try to leverage VMware Converter for faster migration from physical SQL Servers. It’s recommended to use Cold P2V conversion to preserve DBs consistency. Keep in mind to remove unneeded HW drivers from P2V VMs after successful conversions. This is can be done by:
- Change Windows Environmental Variable “devmgr_show_unpresent_devices” to 1.
- From Device Manager, and after all not presented physical drivers are shown, remove all unneeded old drivers.
- Restart the VM.
10-) An availability group listener is a virtual network name (VNN) that directs read-write requests to the primary replica and read-only requests to the read-only secondary replica. Always create an availability group listener when deploying AAG on vSphere. That enables application clients to connect to an availability replica without knowing the name of the physical instance of the SQL Server installation. For more information:Availability Group Listeners
11-) Time Synchronization is one of the most important things in SQL environments. It’s recommended to do the following:
a- Let all your SQL VMs sync their time with DC’s only, not with VMware Tools.
b- Disable time-sync between SQL VMs and Hosts using VMware Tools totally (Even after uncheck the box from VM settings page, VM can sync with the Host using VMware Tools in case of startup, resume, snapshotting, etc.) according to the following KB:
VMware KB - Disabling Time Sync
c- Sync all ESXi Hosts in the VI to the same Startum 1 NTP Server which is the same time source of your forest/domain.
12-) Make sure that you enable “Full Recovery Mode” on all DBs that will be included in your SQL AAGs and also make sure that at least single Full Backup is taken.
Recoverability:
1-) Try to leverage any backup software that uses Microsoft Volume Shadow Service (VSS). These are SQL-aware and don’t cause any corruption in DB due to quiesceing the DB during the backup operation. Ofcourse, one of them is vSphere Advanced Data Protection. Check the following link:VMware vSphere Data Protection FAQ
2-) If available, you can use any backup software that depends on array-based snapshots if it’s SQL-aware.
3-) Use VMware Site Recovery Manager (SRM) if available for DR. With SRM, automated failover to a replicated copy of the VMs in your DR site can be carried over in case of a disaster or even a failure of single critical SQL VM in your environment.
4-) If VMware SRM isn’t available, you can leverage some availability features of SQL itself for more recoverability and DR. You can either use a mix between AAG Synchronous/Asynchronous replicas or Data Mirroring in High safety Mode with Log Shipping in DR Site. This approach leads to lower cost, but with higher management overhead and higher RPO/RTO results than using VMware SRM.
Security:
1-) All security procedures done for securing physical Microsoft SQL Servers should be done in Virtual SQL VMs, like: Role-based Access Policy.
2-) Follow VMware Hardening Guide (v5.1/v5.5) for more security procedures to secure both of your VMs and vCenter Server.
Scalability:
1-) Leverage CPU/Memory Hot add with SQL VMs to scale them as needed. All you need is to use “RECONFIGURE” query using Management Studio to force SQL Server to use the newly added resources.
2-) Try to leverage SQL Sysprep Tool to provide a golden template of your SQL VM. It’ll reduce the time required for deploying or scaling your SQL environment as well as preserve consistency of configuration throughout your environment. For more information, check:Considerations for Installing SQL Server Using SysPrep&Install SQL Server 2014 Using SysPrep
3-) Scale-up Approach of SQL VMs requires a large ESXi Hosts with many sockets and RAM. It reduces the number of VMs required to serve certain number of DBs and hence, a single failed VM will affect a large portion of users. That’s why Scale-up Approach needs a careful attention to availability and usage of SQL AAGs in parallel with vSphere HA. In the same time it reduces the cost of software licenses and physical hosts. Scale-out Approach requires smaller ESXi Hosts and gives a more flexibility in designing a SQL VM, but requires high number of ESXi hosts to provide the required level of availability and more software licenses and hence, more cost. A single VM failure has a less effect using Scale-out Approach and it requires less time for migration using vMotion and hence, DRS will be more effective. There’s no best approach here. It all depends on your environment and your requirements.
Another long post and I'm so sorry for that. I hope this can summarize tons of papers out there covering virtualizing Microsoft SQL Servers on vSphere 5.x platform. Wish you a smooth virtualization experience .
Share the knowledge ...
References:
-- Virtualizing MS Business Critical Applications by Matt Liebowitz and Alex Fontana.
-- SQL Server on VMware - Best Practices Guide
--SQL Server on VMware - Availability & Recovery Options
--VMware DBA Guide to Virtual Databases
-- Infrastructure Planning and Design Guide for Microsoft SQL Server 2008
-- vSphere Design Sybex 2nd Edition by Scott Lowe, Kendrick Coleman and Forbes Guthrie