Quantcast
Channel: VMware Communities : Blog List - All Communities
Viewing all articles
Browse latest Browse all 3135

Stretching your SQL Server over datacenters without worries… Proving All-Flash Virtual SAN Resiliency and SQL performance on Stretched Cluster

$
0
0

All-Flash Virtual SAN Resiliency

Virtual SAN provides enterprise availability by using cost effective x86 servers and enables maximum levels of data protection and availability with built-in failure tolerance.  Putting your application on Virtual SANStretched Cluster ensures no data loss and near zero downtime even in the event of an entire site failure with synchronous replication between two data centers.

To validate Virtual SAN resilience we measured the disk, disk group and host failure with (1) 200GB TPC-E like OLTP workload on a 4-node All-Flash Virtual SAN cluster. The results prove that the component failure did not cause a database or application outage, and that the TPS can recover after a few hundreds of seconds.

All-Flash Virtual SAN Specifications

VMware VSAN Disk Group Specification (per Host)

  • SSD: 2 x 400GB Solid State Drive (Intel SSDSC2BA40) as Cache SSD
  • SSD: 8 x 400GB Solid State Drive (Intel SSDSC2BX40) as Capacity SSD
  • Windows Server 2012 R2
  • 1 x 200GB TPC-E like database

SQL Server Testing Configuration (VM)

Regular Cluster Resiliency (disk, disk group and host failure)

We measured the disk, diskgroup and host failure shown in the figure below.  Before validations, we verified the user database stored on the capacity. We injected a permanent disk error to the capacity SSD, and cache SSD to simulate a disk failure and a diskgroup failure, and power off one node from the cluster to simulate the host failure.

failedcomponents.png

Virtual SAN All-Flash Virtual handles component failure in different ways after enabling Deduplication and Compression in Virtual SAN 6.2. After enabling Deduplication & Compression, the disk failure in a disk group will cause the whole disk group inaccessible because the Deduplication & Compression is disk group based.

Without Deduplication & Compression enabled:

  • Single physical disk failure in the capacity SSD group caused the TPS dropped from 1813 to the average 1643, and the recovery time was ~ 160 seconds
  • One disk group failure  caused the TPS dropped from 1853 to the average 1659 and the recovery time was ~ 430 seconds
  • One host failure caused the TPS dropped from 1568 to the average 1312 and the recovery time was ~ 715 seconds

Failure Type

TPS before Failure

average TPS after Failure

Time taken for recovery to steady state TPS after failure (sec)

Disk

1,813

1,643

160

Disk group

1,853

1,659

430

Host

1,568

1,312

715

 

With Deduplication & Compression enabled:

  • Single physical disk failure or one disk group failure caused the TPS dropped from 1760 to the average 1510, and the recovery time was ~ 540 seconds
  • One host failure caused the TPS dropped from 1590 to the average 1211 and the recovery time was ~ 560 seconds

Failure Type

TPS before Failure

average TPS after Failure

Time taken for recovery to steady state TPS after failure (sec)

Disk or disk group

1,760

1,510

540

Host

1,638

1,211

560

 

Stretched cluster Resiliency (site failure)

We validated the Stretched cluster resiliency of the All-Flash Virtual SAN.  Using a four-node All Flash cluster with (4) SQL Servers, 2 database sizes were used; (2) 200GB and (2) 500GB. To drive the workload, we used Benchmark Factory’s TPC-E. The inter-site latency was 2ms, and the site latency to the witness was 200ms before the failure emulation. The test validated the All-Flash can serve the four SQL Servers after one site was down.

sitefailure.png

All-Flash Virtual SAN Stretched Cluster Specifications

SQL Server Testing Configuration (per VM)

  • Windows Server 2012 R2
  • Storage Footprint:
    • 2 x 200GB TPC-E like database
    • 2 x 500GB TPC-E like database
    • Site A: 2 physical hosts
    • Site B: 2  physical hosts
    • Site C: 1 witness appliance
    • 2ms (Site A to Site B)
    • 200ms (Site C to Site A/B)

Host placement and roles

Site latencies

Comparing with aggregated TPS 7743 before the site down emulation, after one site failed the TPS reduced to 5856 after four virtual machines moved to the two servers. The downgraded percentage was around 24.5%. The average disk read latency increased to 4ms. Since there is no mirror write to the Cache SSD, the average write latency of the data virtual disks decreased from 7ms to 3.6ms.

 

All-Flash Virtual SAN Stretched Cluster Performance

By introducing the network latency for Virtual we emulated the site latency in 1ms, 2ms and 4ms. With the increase of inter-site latency, we observed a reduction in aggregated TPS.

stretchedcluster-tps.png

We measured VMware Virtual SAN disk write latency increased from 2ms to 10ms when inter-site latency increased. The average Log virtual disk write latency increased from 5ms to 15ms. For mission critical application we DO NOT recommend deploy SQL Server database across the stretched cluster with more than 2ms inter-site latency.

stretchedcluster-datalatency.png

stretchedcluster--loglatency.png

Summary

Resiliency to the failure scenarios is the top priority for mission critical applications running on SQL Server. Virtual SAN with all flash storage ensures incredible resiliency against component failures. All-Flash Stretched Cluster provides reasonable OLTP performance running on physical dispersed data centers and ensures application running even in the scenario of one site failure thus further enhances the resiliency for the software defined storage.

This blog is a preview of a comprehensive reference architecture paper that is being published very soon, stay tuned.


Viewing all articles
Browse latest Browse all 3135

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>