This is part 2 of the VMware Stretched Cluster on IBM SVC blogpost series.
PART 1 (intro, SVC cluster, I/O group, nodes)
PART 2 (split I/O group, deployment, quorum, config node)
PART 3 (HA, PDL, APD)
SVC split I/O group
It’s time to split our SVC nodes between failure domains (sites). While the SVC technically supports a maximum round-trip time (RTT) of 80 ms, Metro vMotion supports a RTT up to 10 ms (Enterprise Plus license).
You can split nodes in 2 ways; with or without the use of ISL’s (Inter-Switch Link). Both deployment methods are covered in detail in this document.
Deployment without ISL
Nodes are directly connected to the FC switches in both the local and remote site, without traversing an ISL. Passive WDM devices (red line) can be used to reduce the number of links. You’ll need to equip the nodes with “colored” long distance SFP’s.
Please note that the image above does show ISL’s. They are used to connect the switches, the node connections (green lines) do not utilize the ISL’s.
This deployment method let’s you cover up to 40 KM (at a reduced speed).
SourceThe failure domains for the project I worked on, were approximately 55 KM (total dark fiber length) apart, so we had to use the following method.
Deployment with ISL
Nodes are connected to the local FC switches, ISL’s are configured between sites, all traffic traverses the ISL’s. You are required to configure a so called private SAN for node-to-node communication and a public SAN for host and storage array communication. You can separate the SAN’s by using dedicated switches, or by the use of Virtual Fabrics.
N.B. The private-public separation isn’t always strictly enforced. In case of a failure (let’s say all public SAN ISL’s fail), the SVC can route public traffic over the private SAN (and vice versa).
The implementation I worked on consisted of 2x Brocade B6510 in each site (each B6510 containing two Virtual Fabrics). We used a MRV LambdaDriver 1600 (DWDM mux/demux) to create 4x 8FC ISL’s over 2 dark fibers between site 1 and 2 and 1x 2FC to the quorum site.
Make sure all dark fibers between the 3 sites use different physical paths from at least 2 providers. Keep in mind however that sometimes fiber providers will swap fibers. So despite using different providers, your fibers end up in the same tube.
Also, keep in mind the attenuation of the optical signal for all (short wave) patches. Full 8FC can go up to 150 meters with OM3 cabling. If your FC switch is located inside a rack and the multiplexer in the Main Equipment Room (MER) or Satellite Equipment Room (SER), you might be covering more meters than you think.
Source
Long distance impact
The further you stretch your main sites apart, the higher the latency, the bigger the performance impact. 10 KM adds 0,10 ms to your RTT, 25 KM adds 0,25 ms etc. My synthetic benchmarks (mixed workload) showed that a split I/O group cluster (main sites approximately 50-60 KM fiber length apart) was 54% slower compared to a local SVC cluster. That’s the price you pay.
Quorum site
As you may have already noticed in the pictures above, a third (independent) site is required for the (active) quorum. As with all non-majority clusters (nodes are always divided 50:50), a tie breaker is needed. The main sites contain a candidate quorum disk. The active quorum will decide which site stays up in case of a split-brain.
For the active quorum disk, you’ll need a storage array that supports (what IBM calls) extended quorum. We used a V3700 in the quorum site. The quorum disk only takes up around 256 MB.
Configuration node
I haven’t discussed the term configuration node yet. There’s one configuration node in the SVC cluster, that manages the configuration (what’s in a name). The configuration node is elected by the system, you are unable to change the configuration node manually. If the configuration node fails, an other node takes over it’s role.
An other task of the configuration node is to closely monitor the active quorum. This might have an impact on the split-brain remediation, which I’ll discuss shortly.
Cluster status
In the table below, you’ll find some failure scenario’s and the resulting cluster status. Please note there’s a small error in the 4th entry; write cache is enabled when only the quorum fails!
Let’s take a closer look at entry 5; the dreaded split-brain scenario.
“The node that accesses the active quorum disk first remains active and the partner node goes offline.”
There’s no way you can be sure which node accesses the active quorum first. However, we do know the configuration node already accesses the active quorum frequently, so it has a high chance of winning. In my validation testing, the site that contained the configuration node always won, despite the fact this site was located 40 KM farther from the quorum compared with the other site! Based on this, we can make the following assumptions, on who has the highest chance of accessing the active quorum first:
- Configuration node
- Node in the site physically closest to the active quorum
- Node in the site physically farther from the active quorum
Voting set
A voting set is formed by all nodes that are fully connected together. A site has quorum (stays online) if either one of the following is true.
- more than half of all nodes in the voting set
- half of all nodes in the voting set AND the active quorum
- if there’s no active quorum; half of the nodes if that half includes the founding node (usually the configuration node)
Let me clarify some scenario’s with my anti-anti-aliased Visio drawings.
This shows a 4-node cluster in normal operation. The voting set consists of all nodes that are fully connected with one another.
In this scenario, all ISL’s have failed. There’s no node majority in the voting set. The configuration node (NODE 2) is the first one to access the quorum > site 2 stays online.
I know you’ve been waiting for this scenario! Although highly unlikely, I did test it 😉
- The active quorum fails; at this point there’s no impact. The quorum disks in site 1 and site 2 remain in a candidate state
- All ISL’s fail; a new active quorum is elected. The configuration node is located in site 2, therefore the quorum switches from candidate to active and site 2 stays online
Now we know how a split I/O group cluster behaves. In PART 3 we will see how this all interacts with VMware HA and we’ll have a closer look at for instance APD (All Paths down), PDL (Permanent Device Loss) and some advanced DRS (Distributed Resource Scheduler) settings.
Thanks for reading!
nice article , the problem comes definitely from the unavailability to determine where the datastore is actually accessed , there is no way (from a vmware point of view) to know if we are doing our write I/O on the local or the distant site which could affect the perf
when in svc you map the volumes to hosts, you have to specify the field “site”. in this way svc is aware which is the local storage to that host and will not send data to the distant site.
Pingback: vSphere Metro Storage Cluster Links » Welcome to vSphere-land!
Hello, we have a configuration similar to the one you implemented, but are having some issues with the SAN configuration.
Don’t you have the details for the implementation of the SAN?
Currently we disconnect one FC provider for the ISL and nodes warmstart so we don’t have redundancy there.
Are you using trunks for the internode communication?
Thanks for any help.
We have the same issue , the solution is unreliable , did you manage to find a way to solve your issue ?
Hello, our trunk in the switches for the SVC was misconfigured.
There was a lenght difference between the links, Port 34 was 15 km long anf Port 35 14,5 km.
“Length difference of participating links no more than 30m recommended by brocade. (officially 400 meter)”
http://windowspeople.com/brocade/san-brocade-trunking-1-concept.html
It stays very reliable now, even after loosing one fabric (the fiber was cut on the ISL beetween sites) the nodes din’t restart anymore.
The voting set is interesting, especially the third in the list. Does that mean that cluster will be running even after failure of half of the nodes and quorum, as long as configuration node survives?
Correct!
Rule 3 applies: “if there’s no active quorum; half of the nodes if that half includes the founding node (usually the configuration node)”