In the project I'm currently involved in, the network team had a planned upgrade of the cisco 4500 switches located in both datacenters. These switches are used to connect the new server hardware to the current environment. This new hardware consist of a print, file and mail cluster among other service. The print and file clusters have their SAN disks in the first datacenter (A), the mail in the other datacenter (B). During the upgrade, for some reason a complete shutdown of the switch in the first datacenter was required. It was a planned intervention and the environment is still in build phase so this was not an issue. We, the ones responsible for the server setup, didn't stay to check the health of the clusters. We just expected a fail-over to happen to the other datacenter for the file and print services. Not!
As the switch in datacenter A was powered off at 18.00, the clusternodes for file and print lost their network in datacenter A. Hence they stopped their services, disks, networknames so the nodes in the other datacenter B could start them. I intended to provide a small network diagram, but I have no time for this.
As the cisco 4500 was powered off in datacenter A, the cisco 4500 in datacenter B had to learn new networkroutes to other LAN segments such as the one containing the DNS servers. The process of learning this new routes approximately takes up to 6 minutes. This is because the RIP protocol is currently in use. During this crucial 6 minutes the clusternodes in the B datacenter try to start the resources. Bringing the disks online is no issue, same for the IP's. Bringing the CAP (Client Access Points) online failed. So basically no services where available. The proces of bringing the network names online failed because Active Directory couldn't be contacted... The domain controllers themselves where available as these are located on VLANs which are routeable by the cisco 4500. But for the clusternodes be able to contact AD they required DNS, which they didn't had because they lacked a networkroute to that specific subnet for 6 minutes....
So when we arrived the morning after, we found our file and print cluster to be offline. At first we wondered why the clusters didn't tried to bring their networknames online at a later interval. Afterall the network was "healthy" all night. This KB947172 explains why. Shortly: if a resources fails once on the first node, followed by a failure on the other node, a manual interaction is required.
Below is a copy paste of the clusterlog when the check succeeds:
INFO [RES] Network Name <vdmprinters>: Initiating the Network Name operation : 'Verifying computer object associated with network name resource printcluster'
INFO [RES] Network Name <vdmprinters>: Trying to find computer account printcluster object GUID(b4a849281d4b47d30af3681b8590a20e) on any available domain controller.
INFO [RES] Network Name <vdmprinters>: Found computer account printcluster on domain controller \dc1.domain.local.
So the point in this story: allthough the cluster service no longer requires an Active Directory user account, you still need AD AND DNS to be around at all time, especially during a cluster failover.