I came across a problem where vSAN reported problems with one of the virtual disks (vmdk) connected to vSAN witness appliance (cache disk).
The vSAN health reported “Operational Health Alarm” and all of the VMs objects showed “Reduced availability with no rebuild”.
Strangely enough the underlaying storage was not reporting any errors or problems and other VMs on the same datastore were all ok.
As there was no time to do proper investigation, the decision was made to restart the witness appliance VM.
When the witness came back online everything went back to normal and the errors were gone.
One thing that I noticed was that by some accident the witness appliance got included into snapshot-based (quiesced) backup policy, and a few hours before the accident the backup job got started. It crossed my mind that this problem may have something to do with quiesced snapshots.
I tried to reproduce this problem in my lab and I manage to create the same issue.
I generated some IO in my stretched vSAN cluster plus I executed some re-sync operations by changing storage policy settings.
At the same time I started to create quiesced snapshots on the witness appliance VM. After a while I noticed the same error in my lab.
The following alarm appeared on the witness nested ESXi:
The vSAN health reported “Operational health error” – permanent disk failure:
And “vSAN object health” showed “Reduced availability with no rebuild”:
Witness was inoperable at this stage and as it is a vital component of stretched vSAN cluster the whole environment got affected. Of course, existing VMs kept running as vSAN is a robust solution and can handle such situations (still had quorum). However without "Force Provisioning" in the storage policy no new object (VMs, snapshots, etc.) could be created.
Further investigation of the logs (vmkernel.log and vmkwarning.log) on the witness appliance revealed problems with access to the affected disk (vmhba1:C0:T0:L0)
That proved the problem was indeed virtual disk related and caused by the snapshot.
I tried to fix it by rescanning the storage adapter but to no avail, so I decided to reboot the appliance.
Once the appliance was on-line again the “Operational health error” disappeared.
However, there was still 7 objects with “Reduced availability with no rebuild”
After examining these objects, it turned out that the witness component was missing. Fortunately, it was quite easy to fix by using the “Repair Object Immediately” option in vSAN Health.
It looks like taking snapshots on the vSAN witness appliance not only does not make any sense (can’t think of any) but can also cause problems in the environment.
There is a configuration parameter that could prevent such accidents form happening - “snapshot.maxSnapshots”.
If it is set to “-0” on the VM level it will effectively disable snapshots for that VM, therefore I would strongly advise to set it for the vSAN witness appliance.