I recently opened Outlook only to discover that it would not connect to my Exchange mailbox server. Everything...
else seemed to be working correctly, so I assumed that one of my Exchange 2007 servers had failed. But they were still online and had been for several months.
When the mailbox servers checked out, I thought that a service had failed or that the mailbox database had been dismounted. Unfortunately, neither of these were the cause. Next, I thought the connection trouble might have stemmed from a network card failure, so I pinged the IP address for each server. All of these attempts were successful.
I thought I should look at the server’s event logs for clues. Although all of the network links were functioning properly, a communications failure was occurring.
Checking hub transport server and cluster node communication
A majority node set cluster, which is the cluster type that cluster continuous replication (CCR) uses, requires that the majority of the cluster nodes are functional in order for the cluster to retain quorum. CCR only uses two cluster nodes, so Exchange creates a special share on the hub transport server.
This process is known as the file share witness. A file share witness acts as a third cluster node; if a cluster node fails, the remaining node can maintain quorum by treating the file share witness as if it were a cluster node.
After examining my cluster node logs, I discovered that neither node, both residing on physical hardware, could communicate with the file share witness. The file share witness was running within a virtual machine (VM). This has never been a problem before, so I thought nothing of it. The error message I received stated:
Cluster resource “File Share Witness (\\mirage\FSM_DIR_CCR)” in clustered service or application “Cluster Group” failed (Figure 1).
The error message was clear, but I was still quite confused. I verified that the hub transport server that was acting as the file share witness was functional and accessible. Hoping that the problem was a fluke, I rebooted my hub transport server. Unfortunately, that didn’t do the trick.
Once the hub transport server was back online, I pinged the server from each cluster node -- first by IP address, then by computer name. All of the pings were successful.
Troubleshooting Exchange permissions problems
I tried to manually access the shared folder that the file share witness was using, but it was inaccessible. To rule out that the issue was a permissions problem, I tried to access a different shared folder on the same server. However, access to this folder was denied.
To further isolate the problem, I went to a different computer on my network and tried to access the same shared folders. I was able to access the folders without issue, so I went back to one of my cluster nodes and tried to access the shared folder again.
On my second try, I substituted the hub transport server’s IP address for the server’s name. The shared folder was accessible when I used the computer’s IP address, but not when I used the computer’s name. What complicated matters was the fact that I had already confirmed that my DNS server was resolving the computer name properly.
Double checking servers’ log files
I decided to look back through some of the other log files on my servers. As I did, I discovered that the electricity had gone out for about 12 hours while I was away. Although the servers are on battery backups, they all shut down as the batteries ran low. When the power came back on, all the servers were automatically brought back online. This fact helped lead me to the root cause of my problem.
The communication failure was a product of a date and time mismatch between my file share witness and cluster nodes. The host server is running several other VMs, including my primary domain controller. Like all servers, the host server has an internal battery that maintains the server’s low-level configuration. When the battery died, I lost power and the server lost track of the date and time.
When the power came back on, all of the virtual servers that were running on the host had a consistent but incorrect date and time. The cluster nodes that were running on the physical hardware had retained the correct date and time.
Because the clock failure happened on a host server, all VMs running on that server remained in sync with each other. This provided the illusion that there was nothing wrong with the clocks. I finally resolved the problem by resetting the server clocks.
While it’s well known that bad things can happen if your server clocks fall out of sync, this particular problem proved quite challenging. Even though the clocks were to blame, the symptoms were more in line with a DNS name resolution failure.
ABOUT THE AUTHOR
Brien M. Posey, MCSE, is a seven-time Microsoft MVP for his work with Windows 2000 Server, Exchange Server and IIS. He has served as CIO for a nationwide chain of hospitals and was once in charge of IT security for Fort Knox. For more information visit www.brienposey.com.