So, this was a weird issue.
I've been on vacation for a couple of weeks. On the first weekend, I received an alert that our vCenter server was down. The vCenter server is itself a VM, in a cluster which it manages, so this is always fun.
I tracked down the host it was running on (always a challenge when DRS is enabled), logged in to the server via the vSphere client, and discovered there was an outstanding question for the VM regarding failed connectivity to a datastore. The datastore in question was an NFS share that contains an ISO which which was configured for the virtual DVD drive. To my knowledge the drive was only configured, not connected, but I can't confirm that now.
At any rate, I told it to ignore the error and disconnect from the datastore, after which the VM came back online and vCenter was working and everything was roses and rainbows. I thought.
When I got back from vacation today, I discovered that a number of VMs hadn't been backed up in my absence due to the inability to create a snapshot. The VMs in question (about 25% of our total production VMs at one site) spanned all the hosts in our production cluster, and all the datastores in that cluster as well. And there were VMs on all hosts and datastores that were not affected. Even the within the backup configuration there was nothing common between the affected VMs; every backup sub-client was affected, but not all the VMs in any sub-client were affected.
I've seen the symptoms before, but not under the current versions of our backup software (Simpana v10) and vCenter (v5.5 Update 2). In vCenter, the virtual drive shows as zero bytes, and you cannot take a snapshot, either via vCenter or from the host CLI. On the host server there are multiple snapshot files, but both the host and vCenter consider the VM to have no snapshots, so you can't remove or consolidate the snapshots. The vmx file for the VM shows the current VMDK as the most recent snapshot image. All the vmdk's for the affected VMs were equally affected. Fortunately, the operation of the affected VMs is not affected, which is why nobody noticed.
I worked through the problem with one development VM, but none of the "fixes" I've used in the past would get past the problem. I eventually did a manual consolidation, using vmkfstools to clone the last snapshot to a new VMDK (which worked fine), then attached the new VMDK to the VM by manually editing the vmx file (since vCenter wouldn't let me delete the virtual disk). But snapshots still failed, with no particularly evident error. The VM would power on and function fine; we just couldn't get a snapshot, which meant we couldn't backup the VM.
Ultimately, I discovered that the NFS services on our NFS file server were screwed up. Note that the VMs had no files on the NFS datastore; all the VM files are on shared Fibrechannel SAN. The only files on the NFS store are the ISO files we use as virtual media. However, once I restarted the NFS services on the file server, the problem resolved itself. I could create a snapshot on each VM, after which all the "phantom" snapshots came into focus in vCenter, and I could do a "delete all" to clean them all up. It took a while, as there were between 30 and 50 snapshots per affected vmdk.
So, there you have it. Why problems connecting to an NFS datastore that only stores ISO images would affect the ability to snapshot some, but not all, VMs is beyond me. But it certainly caused a lot of head scratching.
If anyone else runs into a similar issue, I hope this helps.