Purpose:
On a high latency network, it may be necessary to adjust the heartbeat settings of the DAG cluster. This is likely the case if your DAG spans multiple sites.
There are four cluster properties that we need to be aware of in the case of a DAG running on a high latency network.
SameSubnetDelay
SameSubnetThreshold
CrossSubnetDelay
CrossSubnetThreshold
You can see the current values using:
cluster /prop
Use the following commands to delay the hearbeat settings. The amount of time it takes before a database fails over.
cluster /prop SameSubnetDelay=2000:DWORDTo adjust CrossSubnetDelay to max
cluster /prop CrossSubnetDelay=4000:DWORDTo adjust CrossSubnetDelay 10 missed heartbeats or 20 seconds (Can be increased)
cluster /prop CrossSubnetThreshold=10:DWORDTo adjust SameSubnetThreshold 10 missed heartbeats or 40 seconds (Can be increased)
cluster /prop SameSubnetThreshold=10:DWORD
The *SubnetDelay property is how often the cluster heartbeats are sent between DAG nodes. The default setting here for both same subnet and cross subnet is 1000ms. The max for same subnet is 2000ms while the max for cross subnet is 4000ms.
The *SubnetThreshold property is how many of the heartbeats can be missed before the node is considered failed and a DAG failover is initiated. The default settings for both same and cross subnet is 5. and the max setting for both is 120.
So the default settings allow a 5 second delay before the DAG failover will be initiated. (1×5). So for our example, if we want to increase that value to 40 seconds because our DAG spans multiple sites across several subnets, then we would set the properties as follows:
SubnetDelay=2000
SubnetThreshold=20 (2×20 = 40)
Event IDs to look for database failover.
Event IDs for MS Exchange HA Operation
127,147,161,184,252,292,293,301,316,333
Database Failovers
A database failover occurs when a database copy that was active is no longer able to remain active. The following occurs as part of a database failover:
- The database failure is detected by the Microsoft Exchange Information Store service.
- The Microsoft Exchange Information Store service writes failure events to the crimson channel event log.
- The Active Manager on the server that contains the failed database detects the failure events.
- The Active Manager requests the database copy status from the other servers that hold a copy of the database.
- The other servers return the requested database copy status to the requesting Active Manager.
- The PAM initiates a move of the active database to another server in the DAG using a best copy selection algorithm.
- The PAM updates the database mount location in the cluster database to refer to the selected server.
- The PAM sends a request to the Active Manager on the selected server to become the database master.
- The Active Manager on the selected server requests that the Microsoft Exchange Replication service attempt to copy the last logs from the previous server and set the mountable flag for the database.
- The Microsoft Exchange Replication service copies the logs from the server that previously had the active copy of the database.
- The Active Manager reads the maximum log generation number from the cluster database.
- The Microsoft Exchange Information Store service mounts the new active database copy.