Lifecycle Management Done for You with VMware Cloud on AWS

IT infrastructure lifecycle management must be done in a timely manner. One of the key reasons is to keep your systems updated and patched for fixes and vulnerabilities, making sure your IT infrastructure is secure! But also think about new capabilities and features with new releases next to performance enhancements and compatibility. We talk to customers a lot and a recurring topic regarding VMware infrastructures is lifecycle management. Customers typically struggle to plan and accomplish lifecycle management, keeping their IT infrastructures stack up-to-date. VMware solutions provide lots of capabilities to help customers overcome these challenges. A great example is VMware Cloud on AWS; Because VMware Cloud on AWS is a managed service, one of the many benefits is that customers no longer have to worry about SDDC lifecycle management. This is done for you by VMware.

This article zooms in on what is done in the background to ensure your VMware Cloud on AWS SDDCs are up-to-date!

6-Month Release Cadence

VMware Cloud on AWS Software-Defined Data Centers (SDDCs) runs specific SDDC versions. An SDDC version correlates with a bundle of ESXi, vSAN, vCenter Server, and NSX versions. We make a distinction between essential releases, which are major versions for all SDDCs, and optional releases. Customers can review SDDC versions in the ‘Support’ tab in the Cloud Console portal.

Essential SDDC software versions will be mandatory for all SDDCs and are even-numbered, for example, 1.20 or 1.22. The essential SDDC versions are upgraded to a new release every 6 months. Optional releases are odd-numbered, like 1.19 or 1.21, and will be limited to a smaller subset of SDDCs based on business needs.

In the 6-month timeframe between major SDDC version releases, it could be that patches and/or fixes are released to cope with fixes and vulnerabilities. We don’t want customers to be waiting until the next SDDC essential version upgrade. Our approach for patches/fixes is that we release so-called ‘v-releases’, for example 1.20v2, when required. SDDC remediation of these patches and fixes is done in targeted maintenance windows outside of upgrades.

 

Rollout of SDDC Upgrades in Waves

The 6-month timeframe of SDDC version upgrades is done in waves. This is done to gradually expand the impact of the upgrade across the fleet, meaning our SRE teams extract learnings from each wave and iterate.

The first wave starts with internal SDDC only. The 2nd wave is smaller SDDCs, ‘easy’ configurations, only. In the 3rd wave, the majority of our customer SDDCs are upgraded. The 4th wave is for the largest SDDCs with potentially more complex configurations. Each SDDC upgrade is SRE-triggered but largely automated. Our SRE teams monitor the entire process and will (proactively) open support tickets in the event of issues. The SRE team is proactive in their response in order to help speed up the support process, without the need for customer intervention.

Using this approach, we adjust rollout velocity based on success metrics. We capture known issues and continuously develop our runbooks to improve processes and adjust lifecycle automation for our customers.

Because we have a large number of customers SDDCs to upgrade, we developed specific internal services to support the upgrade process, but also the cluster conversion process. Some of the services we use on the background are:

  • ​Release Coordination Engine (RCE) – ​the orchestration engine for SDDC maintenance workflows.
  • ​Rollout Lifecycle Management (RLCM) – end-to-end management of rollouts of SDDC change workflows to the SDDC fleet.
  • ​Backup and Restore Service (BRS) – ​SDDC backup and restore for management appliances (vCenter Server, NSX). Making use of object storage (S3 buckets in AWS infrastructure).

vSphere features like Quickboot are used to save time by only restarting the ESXi kernel instead of doing a full host reboot.

Phased Approach

The actual SDDC upgrade uses a phased approach to upgrade constructs in the correct order while minimizing the impact on customer workloads.

The process ensures to remediate the management plane before moving to update or upgrade the ESXi hosts and the NSX appliances. Making use of the AWS fleet management, non-billable hosts are added during the process to ensure that customers always have the number of hosts available for which they have a subscription.

The impact during the upgrade process is minimized for the workloads. In specific phases of the process, the control plane might not be available. This means customers cannot access vCenter Server or the NSX Manager. The focus here is to keep the workloads running, making good use of vMotion, like you would in any VMware vSphere environment.

Phase 1 – Control Plane

In the first phase, the control plane updates. These are the updates to vCenter Server and NSX Edges. A backup of the management appliances is taken during this phase, using the BRE service. If a problem occurs, there is a restore point for the SDDC. A management gateway firewall rule is added during this phase.

During this upgrade phase, there is an NSX Edge failover resulting in a brief downtime of the NSX Edge. Customers do not have access to NSX Manager and vCenter Server during this phase. During this time, customer workloads and other resources function as usual.

Certificates for vCenter Server and NSX Edge are replaced during Phase 1 if the certificates were last replaced more than 14 days ago. If you are using other software that relies on the vCenter Server certificate, such as Horizon Enterprise, vRealize Operations, vRealize Automation, VMware Site Recovery, and many third-party management applications, you must re-accept the vCenter Server and NSX certificates in that software after Phase 1 of the upgrade.

Phase 2 – Rolling Host Upgrades

Host Updates. These are the updates to the ESXi hosts and host networking software in the SDDC. An additional host is temporarily added to your SDDC to provide enough capacity for the update. You are not billed for these host additions. vMotion and DRS activities occur to facilitate the update. The upgrade process has been improved so that only one NSX Edge migration occurs during the update. During this time, your workloads and other resources function as usual subject to the constraints outlined above. When Phase 2 is complete, the hosts that were temporarily added are removed from each cluster in the SDDC.

Phase 3 – NSX Appliances

These are the updates to the NSX appliances. A backup of the management appliances is taken during this phase. If a problem occurs, there is a restore point for the SDDC. A management gateway firewall rule is added during this phase. You do not have access to NSX Manager and vCenter Server during this phase. During this time, your workloads and other resources function as usual, subject to the constraints outlined above.

When Phase 3 is complete, you receive a notification.

Scheduled Maintenance

Once a customer their SDDCs are scheduled for maintenance, they are immediately notified. This is done over e-mail and within the Cloud Console. The scheduling for most SDDCs is executed using automation that takes many factors into consideration like region, business hours, holidays, and support capacity to name a few. The goal here is to unburden the customer, so no customer intervention is required for doing SDDC lifecycle management.

However, there are some factors that customers have influence over. The maintenance window is communicated to customers, but if for whatever reason this overlaps customer activities or simply comes at a non-suitable time, customers can request another date/time for the maintenance window. If customers run multiple SDDCs, they can influence the order in which the SDDCs are upgraded.

Email Notifications

An example of a SDDC maintenance over e-mail looks like the following:

Rollout details will be are included to customers sharing the dates and version of the SDDC upgrades, and estimations for start and end of the 3 phases. This gives customers an idea what to expect.

Cloud Console

Even more info is displayed under the ‘Maintenance’ tab in the Cloud Console portal. This is also the place to influence the SDDC scheduled maintenance or SDDC order. Looking at the inventory, customers immediately see what SDDCs have maintenance scheduled.

Looking into the detailed paged of an SDDC, customers see detailed information per phase, and it easy to see the times converted to their local timezone to avoid confusion about the maintenance window. When maintenance is done, customers can see details here, too. Once maintenance is concluded, the SDDC version under the ‘Support’ tab reflects the upgraded SDDC version.

Firmware Updates

Next to the SDDC upgrade workflows, we developed a non-disruptive workflow that provides automated firmware updates in a timely manner. From time to time, AWS develops firmware updates to address known issues or performance enhancements with the EC2 fleet.  These updates are staged on the instance and installed the next time the host reboots. To expedite the delivery of these improvements, VMware has developed a non-disruptive workflow to ensure updates are applied in a timely manner.

With automated firmware updates enabled by default, the process will add a new non-billable host to augment the cluster’s capacity.  Once this new host is online and healthy, the service will initiate the firmware update process by first placing the host into maintenance mode, all workloads will be migrated using vMotion to other hosts in the cluster, and then the host will be rebooted. Once the host is back online and confirmed healthy to run workloads, the non-billable host is removed.

To Conclude

Hopefully, this resource provides a better look under the hood of VMware Cloud on AWS and how SDDC maintenance is done. The same cycle repeats itself every 6 months for major releases. Patches and critical fixes are done as soon as possible! The main benefit for customers is to automatically stay up to date without time and effort from their side. This allows customers to be secure, and focus on what is important; Their data and workloads. VMware Cloud on AWS takes care of the underlying SDDC infrastructure!

–originally authored and posted by me at https://vmc.techzone.vmware.com/resource/infrastructure-lifecycle-management-done-you-vmware-cloud-aws–

Leave a Comment

Comments

No comments yet. Why don’t you start the discussion?

    Leave a Reply

    Your email address will not be published. Required fields are marked *

     

    This site uses Akismet to reduce spam. Learn how your comment data is processed.