This is part 2 of the VMware Stretched Cluster on IBM SVC blogpost series.
SVC split I/O group
It’s time to split our SVC nodes between failure domains (sites). While the SVC technically supports a maximum round-trip time (RTT) of 80 ms, Metro vMotion supports a RTT up to 10 ms (Enterprise Plus license).
You can split nodes in 2 ways; with or without the use of ISL’s (Inter-Switch Link). Both deployment methods are covered in detail in this document.
Deployment without ISL
Nodes are directly connected to the FC switches in both the local and remote site, without traversing an ISL. Passive WDM devices (red line) can be used to reduce the number of links. You’ll need to equip the nodes with “colored” long distance SFP’s.
Please note that the image above does show ISL’s. They are used to connect the switches, the node connections (green lines) do not utilize the ISL’s.
This deployment method let’s you cover up to 40 KM (at a reduced speed).Source
The failure domains for the project I worked on, were approximately 55 KM (total dark fiber length) apart, so we had to use the following method.
Deployment with ISL
Nodes are connected to the local FC switches, ISL’s are configured between sites, all traffic traverses the ISL’s. You are required to configure a so called private SAN for node-to-node communication and a public SAN for host and storage array communication. You can separate the SAN’s by using dedicated switches, or by the use of Virtual Fabrics.
N.B. The private-public separation isn’t always strictly enforced. In case of a failure (let’s say all public SAN ISL’s fail), the SVC can route public traffic over the private SAN (and vice versa).
The implementation I worked on consisted of 2x Brocade B6510 in each site (each B6510 containing two Virtual Fabrics). We used a MRV LambdaDriver 1600 (DWDM mux/demux) to create 4x 8FC ISL’s over 2 dark fibers between site 1 and 2 and 1x 2FC to the quorum site.
Make sure all dark fibers between the 3 sites use different physical paths from at least 2 providers. Keep in mind however that sometimes fiber providers will swap fibers. So despite using different providers, your fibers end up in the same tube.
Also, keep in mind the attenuation of the optical signal for all (short wave) patches. Full 8FC can go up to 150 meters with OM3 cabling. If your FC switch is located inside a rack and the multiplexer in the Main Equipment Room (MER) or Satellite Equipment Room (SER), you might be covering more meters than you think.Source
Long distance impact
The further you stretch your main sites apart, the higher the latency, the bigger the performance impact. 10 KM adds 0,10 ms to your RTT, 25 KM adds 0,25 ms etc. My synthetic benchmarks (mixed workload) showed that a split I/O group cluster (main sites approximately 50-60 KM fiber length apart) was 54% slower compared to a local SVC cluster. That’s the price you pay.
As you may have already noticed in the pictures above, a third (independent) site is required for the (active) quorum. As with all non-majority clusters (nodes are always divided 50:50), a tie breaker is needed. The main sites contain a candidate quorum disk. The active quorum will decide which site stays up in case of a split-brain.
For the active quorum disk, you’ll need a storage array that supports (what IBM calls) extended quorum. We used a V3700 in the quorum site. The quorum disk only takes up around 256 MB.
I haven’t discussed the term configuration node yet. There’s one configuration node in the SVC cluster, that manages the configuration (what’s in a name). The configuration node is elected by the system, you are unable to change the configuration node manually. If the configuration node fails, an other node takes over it’s role.
An other task of the configuration node is to closely monitor the active quorum. This might have an impact on the split-brain remediation, which I’ll discuss shortly.
In the table below, you’ll find some failure scenario’s and the resulting cluster status. Please note there’s a small error in the 4th entry; write cache is enabled when only the quorum fails!
Let’s take a closer look at entry 5; the dreaded split-brain scenario.
“The node that accesses the active quorum disk first remains active and the partner node goes offline.”
There’s no way you can be sure which node accesses the active quorum first. However, we do know the configuration node already accesses the active quorum frequently, so it has a high chance of winning. In my validation testing, the site that contained the configuration node always won, despite the fact this site was located 40 KM farther from the quorum compared with the other site! Based on this, we can make the following assumptions, on who has the highest chance of accessing the active quorum first:
- Configuration node
- Node in the site physically closest to the active quorum
- Node in the site physically farther from the active quorum
A voting set is formed by all nodes that are fully connected together. A site has quorum (stays online) if either one of the following is true.
- more than half of all nodes in the voting set
- half of all nodes in the voting set AND the active quorum
- if there’s no active quorum; half of the nodes if that half includes the founding node (usually the configuration node)
Let me clarify some scenario’s with my anti-anti-aliased Visio drawings.
This shows a 4-node cluster in normal operation. The voting set consists of all nodes that are fully connected with one another.
In this scenario, all ISL’s have failed. There’s no node majority in the voting set. The configuration node (NODE 2) is the first one to access the quorum > site 2 stays online.
I know you’ve been waiting for this scenario! Although highly unlikely, I did test it 😉
- The active quorum fails; at this point there’s no impact. The quorum disks in site 1 and site 2 remain in a candidate state
- All ISL’s fail; a new active quorum is elected. The configuration node is located in site 2, therefore the quorum switches from candidate to active and site 2 stays online
Now we know how a split I/O group cluster behaves. In PART 3 we will see how this all interacts with VMware HA and we’ll have a closer look at for instance APD (All Paths down), PDL (Permanent Device Loss) and some advanced DRS (Distributed Resource Scheduler) settings.
Thanks for reading!