Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
104 views86 pages

VMMIG Module03 Plan Foundation

Uploaded by

NISHANT KUMAR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views86 pages

VMMIG Module03 Plan Foundation

Uploaded by

NISHANT KUMAR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

The Plan and Foundation Phase


Assess/Discover Plan/Foundation Migrate! Optimize your
your application Create a landing Pick a path, operations and
landscape zone to the cloud and get started save on costs

Cloud migration is the journey, the end-to-end lifecycle whereby things


move from other locations (on-prem, other clouds) and into the GCP.
GCP is the destination where these things migrate to,
and which are often modernized/optimized in-cloud
afterwards.

After learning about the client’s environment in the Assess/Discover phase, we are
now going to build the foundation, the landing zone we’ll migrate our VMs into.
Learn how to...
Plan a GCP landingzone for your migrated VMs

Map VMs to GCP IAM and networking infrastructure

Lay the groundwork for monitoring and billing

Overcome foundational challenges

Create the migration factory


Agenda
Plan and foundation

IAM and networking

Instrumentation and cost control

Data considerations

Windows

Building the migration factory


Plan and foundation overview
The plan phase is designed to plan and build your foundational GCP “landing zone”,
pilot and validate the migration approach while aligning on the long-term roadmap.

Input Activities Output


● Application/VM grouping ● Establish IAM and org structure ● Delivery of foundations/landing zones
● First mover workloads identified ● Create/configure the network resources scripts
● Initial sorting of groups to move ● Stackdriver setup ● Org structure
● Billing export/labels ● Agile migration factory
● Plan the pilot
● Rough out migration waves
● Create the migration factory

Important - Plan isn’t a monolithic, linear plan. Don’t get stuck in analysis-paralysis.
We’ll help group workloads into migration waves and build a detailed plan for those
first workloads.
The plan is necessarily less detailed for later waves. This plan will be iterative,
building and maintaining a pipeline that is ready for migration.
Assumptions
● You are a Google cloud architect
● You can build projects, VMs, VPCs, subnets, and
firewalls
● You understand how GCP IAM and org structure
works
● You know the GCP building blocks
● This chapter is about applying that knowledge to VM
Migration

You will need a good knowledge of GCP to perform the tasks associated with this
phase
Configuration management
● You should have a configuration management system in place
○ Ansible, Puppet, Chef, etc.
○ Improve and move opportunity?
● Lift and shift typically involves configuration changes
● Automated configuration management can help
○ On-prem pushing changes to cloud (networking, security...)
○ Install in cloud as part of foundation

Lift and shift inevitably involves some configuration changes at scale. You will need a
way of automating this.
You may have Chef (or whatever) on prem, and another in the cloud (part of
foundation)
Or
You have it on-prem and pushing changes to the cloud (make sure network, firewall
rules, connectivity, etc. there). Make sure the configuration management system can
resolve and talk to servers they need to be able talk to.
Infrastructure as Code (IaC)
● IaC helps on three fronts:
○ Cost reduction (people and effort)
○ Speed building and tweaking foundation for
each wave
○ Lowers risks
● Will be similarities between migration waves
○ DRY principal—Don’t Repeat Yourself
○ Find what works and stick to it
○ IaC makes sure you don’t forget steps
● Repeatable and recreatable (waves and DR)

This class will advocate IaC. We will be building our foundation using Terraform,
though there are other options available. Terraform is one of the most popular IaC
options
Infrastructure as Code (IaC) tools
Deployment Puppet Chef Terraform Cloud
Manager Formation
Imperative vs. Declarative Declarative Imperative Declarative Declarative
Declarative
Hosted Yes No No No Yes

Driven by Yes No No No No
Discovery/
Swagger
Multi-Platform No Yes Yes Yes No

Integrated with a Yes No No No Yes


Platform (IAM,
UI, …)

Deployment Manager is the native GCP tool, but is relatively verbose (Terraform
scripts tend to be shorter, for example). Cloud Formation is the AWS tool.
Terraform
● Cloud agnostic IaC tool from HashiCorp
○ Freeware CLI, but also a team-oriented enterprise version
● Easy process
○ Define infrastructure with configuration files
○ Generate a plan for verification
○ Execute
● Handles creation and change automation
● Efficient and codeafiable
● Terraform Intro
● Build GCP with Terraform example

Although Terraform supports multiple cloud environments, individual scripts are


inevitably cloud-specific (e.g. the properties of VMs on GCP are not identical to the
properties of VMs on AWS or Azure).
Cloud Foundation Toolkit
● Google’s reference templates for:
○ Deployment Manager
○ Terraform
● Help build a repeatable, enterprise-ready foundation
● Cloud Foundation Toolkit

Google recognises that either Deployment Manager or Terraform may support a


particular customer’s requirements. The Cloud Foundation Toolkit is a set of
examples to help people get started with either tool.
Lab 5
Terraforming GCP
Cloud foundations
Whatever the shape of your migration, you need to build strong foundations. These
should be defined as code and automated, providing a GCP “landing zone”.

Identity & Access Networking Instrumentation Cost Control

● Define IAM policies ● Create networks, ● Define baseline ● Set up a


subnets, network monitoring billing account
● Determine devices (cloud routers,
structure of the VPNs, and load ● Build Stackdriver ● Define billing
resource hierarchy balancers), etc. dashboards export and
resource labelling
● Create projects ● Create firewall rules, ● Initial alerting groups
and folders until unless maintained by and processes
org is mature the security admin

During this phase, get your foundations right - make sure you’re ready to migrate at
scale. Google has a Cloud Foundations Toolkit. Identity and Access, Networking,
Instrumentation and Cost Control are the four key aspects of building a foundation for
the GCP deployment.
Agenda
Plan and foundation

IAM and networking

Instrumentation and cost control

Data considerations

Windows

Building the migration factory


Identity and access management
● Two key identity concerns
○ Accessing GCP resources
○ VM workload specific access requirements
● Let’s start with GCP access

IAM determines how individuals are authenticated to GCP, and what resources they
are authorized to access. IAM also addresses how code and VMs access GCP
resources.
GCP authentication
● Use G Suite directly
○ May need to import/create users
○ Migrating to Cloud Identity or G Suite
● Link GCP to a SAML2 compliant SSO service
● Use one of the two main AD options
○ Federate using Google Cloud Directory Sync (GCDS)
○ Federate with Azure AD
○ Federating Google Cloud Platform with Active Directory

These are the three choices for GCP authentication.

If only need core admin type access, manually creating credentials might be fine. For
larger user usage however, will need something more sophisticated.
Federate with AD using GCDS
● Syncs users and groups
● One way, scheduled push from
AD to G Suite
● Auth through ADFS
● Fine-grained control using rules
● Details

Active Directory remains the “source of truth” for authentication. In this scenario, I
would need to go back to the on-prem AD system to actually authenticate a particular
logon.

https://cloud.google.com/solutions/federating-gcp-with-active-directory-introduction
Federate directly with Azure AD
● AD on prem and Azure AD once source of truth
● Updates immediate (no manual/timed sync)
● May be worth adding the Azure AD piece, even if client not
using already
● Details

For complex AD environments, the most seamless integration with on-prem AD may
be via Azure AD, rather than GCDS.
IAM tied to organizational structure
● Org/folder/project/resource organization
● IAM uses a union of permissions approach
○ Inherited permissions may not be removed, only
added to
● Decide where and how to use folders
○ Will require input from clients culture
○ Folders help with organization and base
permissions
● Resource Organization and IAM

You need to set up the org/folder structure before you start creating projects. Getting
this right at the beginning gives you the ability to take advantage of inheritance
permissions to simplify the granting of access e.g. You can assign permissions at the
folder level and have all the projects in the folder inherit those permissions.
Example: granular access-oriented hierarchy

Granular permissions

Easily extensible

Retail Risk mgmt Financial Commercial


Complex structure / group /
roles management

Inheritance
Apps Sandbox Shared Core serv D&A Ctrl serv Complex network topology

Prod Dev Prod Dev

MyApp1 MyApp1 NetHost MyApp2 MyApp2

The structure you come up with should reflect how the client thinks about their own
organisation, and GCP org structures best practices.

Be careful with naming folders around department names, as organisations will often
reorganize. Base your structure around reasonably stable reference points.
Create a solid set of org-level starter groups

Org admin Network admin Security admin Billing admin

Define IAM policies. Create networks, Establish policies and Set up a


subnets, network constraints for the billing account.
Determine structure of devices (cloud routers, entire organization,
the resource hierarchy. cloud VPNs, and cloud folders, and projects. Monitor usage.
load balancers).
Create projects, until Establish IAM roles
org is mature. Maintain firewall rules, for projects.
unless maintained by
the security admin. Maintain visibility on
logs and resources.

Details

The linked document provides a deeper dive about how you typically set up your
groups when starting with GCP
Migrated VMs and credential management
● Migrated machines frequently need two forms of IAM
○ Machine to GCP resources (IAM Service Accounts)
○ User to machine, machine to machine, workload specific
● How would code on a VM access a GCS bucket?
● How does logging into a Windows machine in the typical
corporate environment work?
● How do most Windows Server applications authenticate into
SQL Server?
AD and VM workloads
● Most popular credential management system in the
world
● Options
○ 1: Migrate to GCP
○ 2: Leave where it is (on-prem or Azure)
○ 3: Set up a new AD server in GCP just for apps
○ 4: Set up a new AD server and federate
○ 5: Managed AD

Used by 90% for the Fortune 1000


Options 1-3 would work for any auth provider (LDAP, etc.)
Option 4 would also work for many
Option 1: Migrating to GCP
● Pros:
○ Might be simple
■ Just move the AD server(s) to GCP
○ Application, VM, auth. provider, all can move as a
unit
● Cons:
○ What about apps and users left on prem?
○ Can be more challenging if in a HA configuration
○ Latency?

Moving your AD to GCP might be overkill, if only a small subset of your AD users
require GCP access.
Option 2: Connect back to on-prem AD
● Pros
○ Requires little to no application/VM changes
○ Fast
○ Don’t have to move/recreate AD
● Cons:
○ Trust boundary for AD now extended to include the cloud
● DNS: Make sure it resolves correctly!
○ Only global domains allowed
● Also works for Azure AD
● AD in a hybrid environment
Option 3: New AD server just for apps
● Works best in environments where only a small admin/support
staff will need access
○ Or where it’s used for access between machines
● Pros:
○ Easy
○ Quickly done
● Cons:
○ Have a new AD server to manage
■ Can become management nightmare
● Deploying a fault-tolerant AD to GCP

Here you create an independent AD in the cloud, just to support cloud-based apps. If
you have a lot of users, it can become a pain to maintain two sets of users in two
different ADs. Often this works best as an interim option
Option 4: New AD server, cross-forest trust
● Essentially, two different AD servers that have a
special trust relationship
● Several different configuration options
● Pros:
○ Less management
● Cons:
○ Harder to setup
● AD-GCP integration patterns

This option requires some AD expertise to set up correctly.


Option 5: Managed AD
● Why set it up yourself, when GCP can
● Pros:
○ HA, hardened, fully managed by Google
○ Can be standalone or in trust-relationship with
on-prem
● Cons:
○ Still cooking (alpha/beta)
● Managed AD

Keep an eye on this development - it may provide the best solution in the longer term.
Network concepts
Project

Network (VPC)

Region Region

Zone a Zone b Zone c Zone a Zone b

Subnet
192.168.0.0/16
Subnet
172.16.0.0/12
Subnet
10.0.0.0/8

© 2018 Google LLC. All rights reserved.

Overview of networking in GCP:

● Networks live within projects.


● Networks may consist of multiple regions.
● Regions may consist of multiple zones; locations within a
region have roundtrip network latency of 5ms.
● Subnets may span zones.

Cloud VPNs are regional resources.

Internet egress is at the region level.

All elements of a network may talk to one another, firewall rules


permitting.
Laying the VPC (network)
● Need an initial network
● Connection to on-prem requirements
● Velostrata networking requirements
○ Details coming
● Networking Deep Dive
● Best practices and reference VPC architectures

● One network per application? Shared network for multiple apps?


● There’s a good chance you’ll need a shared VPC for Velostrata
● Hub and spoke? Full mesh?
● Shared network?
Client connection to Google

Dedicated or Partner
VPN
Interconnect
and Cloud Router

Both of these options allow you to extend your on-prem network into GCP in a
relatively seamless manner.
VPN vs interconnect
● Up to 3Gbps ● 50Mbps-100Gbps
● $37 per tunnel (HA 2x) ● $39-$13,000
● SLA only with VPN HA (beta) ● SLA
● Encrypted ● Unencrypted
● Works anywhere ● 81 co-location facilities
● Low latency
○ <5ms round trip in
specific facilities

https://cloud.google.com/interconnect/docs/concepts/dedicated-overview
https://cloud.google.com/interconnect/docs/how-to/choose-type
10GBPS: $1,700 / mo, up to 8
100Gbps $13,000/mo, up to 2
That latency is from VM to co-location facility
Shared VPC
Service project - apps

VM
Compute Engine

Host project - net

Network
Service project - db
Subnet
10.0.0.0/8 VM
Compute Engine

Subnet
192.168.0.0/16

Service project - mgmt

VM
Compute Engine

© 2018 Google LLC. All rights reserved.

Shared VPC Network (XPN) Overview

Shared VPC allows you to use the same VPC across multiple
projects within an organization. In a shared VPC implementation,
projects are designated as either host projects or service projects.

Shared VPC host project contains a network that the Shared VPC
service projects are granted access to.

A VM in a service project may connect to any of the subnets that it


has been granted access to in same zone as the VM.
Prefer custom mode subnets
● Peering VPCs, or even hybrid environments, might
have issues with auto-mode subnet choices
● Have more control with custom mode
● Also, can choose unique, descriptive names
○ Naming convention
● Don’t forget to delete the default network
● Fewer, larger subnets easier to manage
Load balancers
● Global/regional load balancers might be an easy fit
○ Just balancing the load, right?
● Know how the original load balancer operated
○ Have an agent on vm?
○ May need to do improve and move
● Watch the floating IPs
○ IP to Mac address translation

https://docs.google.com/presentation/d/1JsMbSu1fjFSiTvU0DkxknZxqUVXBlTDBCW
8jZcI0qaU/edit#slide=id.g495dbabbf9_1_922

GCP load balancers may not be simple 1-to-1 replacements for on-prem load
balancers. Make sure you understand how the on-prem load balancer works when
planning a migration.
Floating (“shared” or “virtual”) IP Addresses

Possible GCP solutions

On the left is the before, two HAProxy servers. They keep a


heartbeat between them so if one goes down, the floating IP can
be grabbed by the network interface on the other. To the right is
one common GCP solution to the problem. Rather than using the
Floating IP to address server failure, you use an Internal Load
Balancer (ILB). When one of the migrated HAProxy servers goes
off line, the load balancer will automatically route to the other 


● A single IP can, under different conditions, refer to
different VMs
● Many on-prem solutions will not work out of the box in
GCP
○ E.g., NGINX, HA proxy
DNS
● VM DNS servers are set as part of DHCP
○ Point at metadata server: 169.254.169.254
● VMs may need external (on-prem) resolvers
● Updating VM DNS configurations
○ Scripting
○ Pre-migration manual configuration
○ Config management system
● Velostrata can help! (more later)

You may need to change the default DNS on your GCP VMs, if, for example, you
need to access an on-prem DNS to find an on-prem AD.
Firewall rules
Network

Target tag
Source Cloud Firewall Rule

All instances
Cloud Firewall Rule

allow or deny
Target service account
connections Cloud Firewall Rule
Service account Network tags
Compute Engine Compute Engine
Target service account
Destination Cloud Firewall Rule

All instances
Cloud Firewall Rule

can be on the
Target tag
same network Cloud Firewall Rule

● Know payloads and communication requirements


○ Workload specific?
○ Hybrid cloud?
● Examine compliance requirements
● Velostrata requirements
● GCP Firewalls

Risk: Applying the right firewall rules to the right VMs in an


environment is a way to provide least privilege access (network
access).

Note: FW rules may be applied to specific VMs by either a Tag or a


Service Account but not both.

Firewall rules are applied to your VPC network on which your GCE
instances reside. Firewall rules apply to both inbound (ingress) and
outbound (egress) traffic. They can also be applied between
instances in your network. Firewall rules can be set to allow or deny
traffic based on protocol, ports, and IP addresses. Firewall rules
have the following settings:

● Action of the rule—allows or denies the connection that


matches the firewall rule.
● Direction of the rule—specifies whether the rule applies to
Inbound (Ingress) or outbound (Egress) network traffic.
● Source or Destination—specifies the traffic source, if the
rule is for ingress, or the destination, if the rule is for egress,
by IP address. Source tags are also available for source
rules. Tags for egress rules are also available for GCP
resources.
● Protocol and Ports—specifies the protocol (TCP, UDP, both)
in use and the port number.
● Instance to which the rule is applied—specifies which
GCE instances the rule applies to based on tags.
● Priority of the rule—specifies where in the priority lineup this
rule falls. Values closest to zero are processed first.

You should keep your firewall rules in line with the model of least
privilege. To allow traffic through, the user needs to create firewall
rules to explicitly allow traffic necessary for your applications to
communicate.

Assign the Compute Security Admin role to your security or


networking team so that they can configure and modify the firewall
rules on your network.

Mention that ingress rule is default deny and egress rule is default
allow.
Security and compliance
● Special security needs?
○ Encryption keys need managed particular way?
● Regional data requirements
● Existing security policies and procedures need cloud
updating
● Compliance?
○ HIPAA
○ PCI
○ ISO
● Standards, regulations & certifications

You should have identified the security/compliance requirements as part of your


Discovery/Assess phase. These could be quite extensive for some enterprise clients.
Lab 6
Document org structure and network requirements
Agenda
Plan and foundation

IAM and networking

Instrumentation and cost control

Data considerations

Windows

Building the migration factory


42

Site reliability

SREs:
Monitoring is at the base of site reliability.
Stackdriver helps make this easier.

Monitoring

Site reliability engineering is a discipline that incorporates aspects of software


engineering and applies that to operations whose goals are to create ultra-scalable
and highly reliable software systems.

Site reliability starts at the core of any infrastructure. SREs are site reliability
engineers: they are responsible for keeping the lights on at Google. SREs monitor,
but they don't necessarily stare at a screen. There are certain reactive options that
they take care of and they also try to automate responses. Their real core is to identify
root cause analysis and identify how to test and re-release any types of fixes.

For more information, see: https://landing.google.com/sre/book.html


43

Multiple integrated products

● Benefits of Stackdriver:
○ Monitors multi-cloud
■ GCP and AWS
○ Identify trends and prevent issues
■ Charts
○ Reduce monitoring overhead
○ Improve signal-to-noise
■ Advanced alerting
○ Fix problems faster
■ Alerts -> Dashboards -> Logs
Migration monitoring and logging strategy
● VMs sends send metrics to StackDriver
○ Monitoring agent will need to be installed (coming!)
● It is possible to monitor from StackDriver
○ But what about servers still on-prem?
● Decide on path
○ Monitor all with StackDriver
○ Monitor with StackDriver and on-prem system
(Splunk?)
○ Export from StackDriver, back to on-prem system

It is usually not feasible to monitor everything (i.e. on-prem and cloud) with
Stackdriver. Having parallel monitoring systems (one for on-prem, one for GCP) may
be confusing. Exporting from Stackdriver back to on-prem systems is perhaps the
most popular choice for clients running a hybrid environment.
Exporting logs from Stackdriver
● Stackdriver can export logs three ways:
○ JSON in Cloud Storage
○ BigQuery
○ Cloud Pub/Sub
● Exported:
○ Cloud Audit Logs : admin activity and data access
○ Monitored Service Log: GCE, Cloud SQL, etc.
● When exporting to Splunk, need the Splunk Add-on for GCP
○ Subscribes Splunk to Pub/Sub
○ Make sure to check details on where to install add-on

Most of the popular on-prem systems have a way of integrating with Stackdriver
(usually by exporting log entries to GCS or a Pub/Sub queue).
Stackdriver groups
● Logical groups of resources
● Simplifies the monitoring of a set
of resources
● Groups can be nested to form hierarchies
● Logical unit being migrated may all belong to the
same Stackdriver group
● Terraforming Stackdriver Groups

Stackdriver groups allow you to group resources as a single logical unit for monitoring
purposes (e.g. a group of web servers used by a particular application).
Alerting policy foundation
● Alerts will be triggered when a set of conditions are met
○ Notify critical personnel when actions need to be taken
○ Intro to Alerting
● Baseline and document alerts pre-migration
○ Who receives which alerts and what do they contain?
○ Alert triggers?
○ Room for improvements? Now or later?
● Exporting or using Stackdriver, build your alerting foundation
● Alerting policies with Terraform

If you are going to migrate your alerting to Stackdriver, you need to develop a good
understanding of the current alerting strategy used by the client. Make sure you are
not reproducing limitations in the existing system; loook to improve the alerting
approach where required.
Uptime checks

● Checks the availability of web or TCP based services


○ Again, document how are these handled now and by whom?
● Uptime checks with Terraform

For some customers, uptime checks may already in place for key web-facing
applications.
Dashboards
● Aid with quick glance BI over one to many monitored resources
● Who needs access, and what should each dashboard display?
Setting up billing

Billing may happen at the project level, but organization, folder structure, project, and
resource details are all part of the billing reports (and BigQuery export).

https://cloud.google.com/billing/docs/onboarding-checklist

Your default choice for billing should be to set up a single billing account, only to be
modified where there is a clear client need to have multiple billing accounts.
Billing tips
● Make sure the org understands how billing works
○ Who has access? Who gets alerts? Who’s paying the bill?
● Setup spend and trend alerts
○ BigQuery quotas
● Billing reports and cost trends viewable in Cloud Billing Reports
● Export to BigQuery for better detailed analysis
○ Terraform can help
○ Exporting billing data to BigQuery
○ Visualize with DataStudio
● Cost management

Give your finance people the tools they need to understand cloud costs.
Labeling for resource identification
● Key-value resource identifier
○ Bulk operation runs
○ Added identity for better cost analysis
● Examples
○ Team, cost center, owner, person
○ Application layer (web, app-server)
○ State, stage, or environment
● BigQuery over billing data examples

Labelling resources is a big part of cloud cost management, as the labels are visible
when querying cost data (e.g. in BigQuery). There are examples of this in the linked
article in the slide.
Label usage

What can be labeled? Considerations for applying labels


● Virtual machine instances ● Focus on label consistency (apply programmatically)
● BigQuery ● Make labels a simple, standard set of values that are
● Images useful both technically and business value–wise
● Persistent disks ○ Despite the ability to apply up to 64 labels per
● Cloud Storage resource, try to stick with no more than 5 labels
● Static external IP addresses (Alpha) ● Follow.json format of -l <key>:<value>
● VPN tunnels (Alpha) ○ No more than 63 characters each
● And many more [source]
○ Only contain lowercase letters, numeric
characters, underscores, and dashes

© 2018 Google LLC. All rights reserved.

Alpha features do not have the ability to apply labels in the console or gcloud.
Use the Alpha API

Note: Alpha resources are publicly documented here:


https://cloud.google.com/compute/docs/labeling-resources
Cost analysis and rightsizing

● Google offers rightsizing


recommendations
out of the box
● Discussed more during the
optimize portion of our class
○ Day 4

The slide above shows the message I would get from the recommendation engine in
Compute Engine if I had over-provisioned a VM and the VM has been running for at
least 24 hours. Up until recently, you could only see the recommendations in the
console, but a new Recommendations API is, at the time of writing, in alpha.

There are third party tools that can also make recommendations, but Google
recommends only following the recommendations within GCP.
Lab 7
Plan monitoring and billing
Agenda
Plan and foundation

IAM and networking

Instrumentation and cost control

Data considerations

Windows

Building the migration factory


Data
● Applications and their VMs commonly have
persistence requirements
○ Memory cache (Redis, Aerospike, etc.)
○ Key-value pair configuration data
○ Relational (MySQL, SQL Server, Oracle, etc.)
○ No SQL (Casandra, HBase, etc.)
○ File shares
Common challenges

● Hard coded IPs (discovery!)


● Floating IPs for failover (discussed)
● Multiple applications sharing a database server
● NAS/SMB
● Oracle or SQL Server
● Quantity of data
○ Transfer appliance

The sheer amount of data that you have to move can be a challenge, both in time to
perform the migration, and cost (e.g. egress charges from another cloud
environment).
Databases and discovery
● Get details through questionnaires and interviews
● Common source of VM Migration pain
○ Static IPs, blocked ports, missing backup
locations, etc
● Learn what you can in discovery and foundation
○ Fill gaps in the migration factory

Often you will only have gathered the high-level detail about data migration
requirements during the Discovery/Assess phase, and you need to get more detail
before the specific migration.
Shared database servers
● Several apps might share a common database
● Common solutions:
○ Move the whole group
○ Use backup/restore to extract the single
required database
■ Rebuild in cloud
○ Consider improve and move strategy, swapping
the database to Cloud SQL, Spanner, BigTable,
etc.

Moving all the applications accessing a shared database at one time might be an
unacceptable risk for some migrations.
NFS/SMB

● Migrate with VMs Managed filer solutions:


● Accessed through ● NetApp Cloud Volumes
VPN/Interconnect ● Elastifile
● Improve and move Supported filer solutions in GCP
○ Filestore
Marketplace:
○ Cloud Storage
○ Persistent Disk ● Panzura
○ Etc. ● Quobyte
● Compute Engine persistent disks
Unsupported filer solutions:
● Supported filer solutions from partners:
● Single Node File Server
○ Avere vFXT
Oracle options

● Move it into GCP even though it isn’t officially supported


● Leave it on prem and create hybrid solution
○ GCP Caching?
○ Latency? Network fees?
● Investigate Accenture
○ Google partner with collocated Oracle certified hardware
○ Latency guaranteed <2ms
○ Creatable through Marketplace and billed through GCP
● Improve and move, leveraging Cloud SQL or Spanner

Leaving the database on-prem might introduce unacceptable latency into the app.
Splitting the application between GCP and on-prem might increase your network
egress charges. Improve and move might require code changes, e.g. if you are using
Oracle stored procedures.
SQL Server

● One of the most popular databases in the world


● Can be migrated directly to GCP
○ Watch your licensing! (more soon)
● For DR, back up to a local SSD, push to Cloud Storage
○ Install Cloud SDK and use SQL Server Agent job to resync
● SQL Always on can migrate too
○ But need to fix Windows Server Failover Clustering (soon)
● Move and improve to Managed SQL Server?
○ Coming soon
● SQL Server in GCP best practices

SQL Server will soon be available as a managed service within Cloud SQL.
Agenda
Plan and foundation

IAM and networking

Instrumentation and cost control

Data considerations

Windows

Building the migration factory


Microsoft licensing concepts

SPLA Licenses Enterprise Agreement Live Migration

Software Assurance BYOL Product Classes

License Mobility Sole Tenancy Custom T&Cs

https://www.microsoft.com/en-us/licensing/licensing-programs/spla-program?activetab
=spla-program%3aprimaryr2
https://www.microsoft.com/en-in/licensing/licensing-programs/FAQ-Software-Assuran
ce
https://cloud.google.com/compute/docs/instances/windows/bring-your-own-license/
https://cloud.google.com/compute/docs/instances/windows/ms-licensing
https://cloud.google.com/compute/docs/instances/windows/bring-your-own-license/fre
quently-asked-questions
https://cloud.google.com/compute/docs/instances/windows/ms-licensing

GCP Premium Images use SPLA Licenses, which are charged by the vendor
(Microsoft) and billed by Google. These are not necessarily connected in any way to
existing license your customer may have purchased.

Software Assurance is a volume licensing program. Benefits can be found here:


http://download.microsoft.com/download/0/0/3/0039F316-45CF-4083-AA6E-C35DA9
D25C1B/SA_InteractiveBenefitsChart.pdf

License Mobility allows you to deploy certain licenses to cloud deployments.

Enterprise agreements are volume licenses agreements aimed at larger license


purchases (500+), and are at least 3-years in length. Software assurance is included
in an Enterprise Agreement.
BYOL = Bring your own license. Rather than paying for SPLA licenses within premium
images, you assign a license you already own to the deployment.

In GCP, Sole Tenancy means that an entire physical server is reserved for
consumption by a single customer (no sharing).

Microsoft Licensing typically requires that a system running Microsoft products with
non-SPLA licenses must serve a single customer. Sole tenant nodes will allow
customers to bring existing licenses to the cloud. For workloads that are not
concerned with physical core or socket usage based on the nature of the licensing
and product terms, you can use sole-tenant nodes without the in-place restart feature.

In GCP, Live Migration is the default configuration for new VMs, and will move a VM
from one physical host to another when a migration event is scheduled. For licenses
that limit physical core or socket use, you will need to use in-place restart. When you
enable in-place restart on sole-tenant nodes, Compute Engine minimizes the number
of physical servers your VM runs on by restarting the VM on the same server
whenever possible. If restarting VMs on the same physical server is not possible (for
example, if the physical server experiences a critical hardware failure), your VMs will
be migrated to another server. Compute Engine will assign and report a new physical
server ID and the old server ID will be permanently retired.

This is especially useful during host maintenance events; instead of live migrating to a
new physical server, Compute Engine terminates and restarts the VM on the same
server. Please note that VMs will be taken offline and be unavailable while
maintenance is applied.

Microsoft product classes (Server, Client, Server Applications) have different licensing
schemes, and these will affect the licensing decisions you make.

Everything above is simplified and drawn from publicly available EULAs, SPLAs, and
T&Cs. Customers may be custom T&Cs which require review and differing strategies.
Microsoft licensing details
GCP BYOL Sole Tenant Offering
GCP premium images Software Assurance/
(core/minute charge) License Mobility
No Live Migration Live Migration

GCP Availability GA GA GA Alpha Q3; GA EOY


Server Yes No Yes Yes
Permitted
(by MSFT) Clients No No Yes Yes
Apps Yes Yes Yes Yes
- “Spiky” workloads - Customer has licenses - Apps that can tolerate - Apps that can't
- High memory/ CPU configs - High memory / low CPU unscheduled downtime tolerate downtime
Good - Getting started (~hour a month)
use cases - Avoid license management - Fault tolerant apps
- Containers
- MSFT low % of overall spend
- Customer has licenses - Customer has to pay for - ~1 hr of downtime - Extra infrastructure/
Bad use cases/ - Can get really expensive for SA/LM /month licensing spend
Pitfalls large, stable workloads - Does not support
Windows OS

https://www.microsoft.com/en-us/licensing/licensing-programs/spla-program?activetab
=spla-program%3aprimaryr2
https://www.microsoft.com/en-in/licensing/licensing-programs/FAQ-Software-Assuran
ce
https://cloud.google.com/compute/docs/instances/windows/bring-your-own-license/
https://cloud.google.com/compute/docs/instances/windows/ms-licensing
https://cloud.google.com/compute/docs/instances/windows/bring-your-own-license/fre
quently-asked-questions
https://cloud.google.com/compute/docs/instances/windows/ms-licensing

Premium image VM is licensed to host unlimited # of containers

Based on deals ~50 deals analyzed by PM team


- about 30% of workloads are viable for No Live Migration (test and dev, analytic
workloads, etc.)
- about 70% are not
No Live Migration VMs are terminated and restarted in-place during host
maintenance.

Extra cost for Live Migration is associated with maintaining a dedicated pool of VMs
within which LV can happen.
Windows OSes supported by BYOL with Sole Tenant node
- Windows Server 2008 R2 SP1
- Windows Server 2012
- Windows Server 2012 R2
- Windows Server 2016
- Windows 7 SP1 Enterprise x64
- Windows 10 Enterprise x64
At time of writing, BYOL with Sole Tenant nodes is limited to:
- us-central
- us-west1
- us-east1
- europe-west1
GCP sole-tenant nodes
● A sole-tenant node is a physical server Node Template (region, server
that hosts VMs only for your project config, restart config, affinity tags)

○ A node is associated with one


server Node Group

○ A node can host multiple VMs


○ Node are defined in Node Groups
Node Node
○ Affinity controls where VMs run
● Limitations include…
VM VM
○ Nodes available in select zones
○ VMs must have at least two vCPUs
○ GPUs, Local SSDs are unavailable

https://cloud.google.com/compute/docs/nodes/
https://cloud.google.com/compute/docs/nodes/create-nodes
https://cloud.google.com/compute/pricing#nodes

To see if a given region/zone support sole tenant nodes, visit the GCE Regions and
Zones page.

At time of writing, there is only one available node type: n1-node-96-624. A project
must have sufficient CPU quota in a given region to allow node creation.

Affinity is controlled by assigning affinity labels to node groups and nodes. Some
labels are assigned automatically:
- compute.googleapis.com/node-group-name = [node group name]
- compute.googleapis.com/node-name = [node name]
Additional labels can be applied when defining a template; e.g. env = production; app
= frontend; license = byol

When creating a new VM, you can specify node affinity settings; e.g.:
- license:IN:byol
- env:NOT:production
When creating a VM in a node group with in-place restart configured, you must set
the the "On host maintenance" policy to Terminate.
Windows Server Failover Clustering

● WSFC Windows Server HA feature


○ If one node fails, automatically routes to
another
○ Heart of SQL Servers AlwaysOn Availability
Groups
● Normally accomplished with Address Reduction
Protocol (ARP)
○ Maps IPs to MAC addresses
○ When a failure occurs, ARP routes same IP
to different MAC
● Also requires a file share witness off the clustered
machines
● Will require an agent installed in OS
○ Part of Windows Guest Environment for
GCP (discussed tomorrow)
● Place servers behind the Google Internal Load
Balancer (ILB)
● ILB healthcheck hits agent on OS
○ If failure, then assigns IP to backup server
Re: diagram above:
● The Compute Engine agent for the VM named wsfc-2 is
responding to the health check with the value 1, indicating it is
the active cluster node. For wsfc-1, the response is 0.
● The load balancer is routing requests to wsfc-2, as indicated by
the arrow.
● The load balancer and wsfc-2 both have the IP address
10.0.0.9. For the load balancer, this is the specified frontend IP
address. For the VM, it's the IP address of the application. The
failover cluster sets this IP address on the currently active node.
● The failover cluster and wsfc-2 both have the IP address
10.0.0.8. The VM has this IP address because it currently hosts
the cluster resources.

On failure:

● Windows failover clustering changes the status of the active


node to indicate that it has failed.
● Failover clustering moves any cluster resources and roles from
the failing node to the best node, as defined by the quorum. This
action includes moving the associated IP addresses.
● Failover clustering broadcasts ARP packets to notify
hardware-based network routers that the IP addresses have
moved. For this scenario, GCP networking ignores these packets.
● After the move, the Compute Engine agent on the VM for the
failing node changes its response to the health check from 1 to
0, because the VM no longer hosts the IP address specified in
the request.
● The Compute Engine agent on the VM for the newly active node
likewise changes its response to the health check from 0 to 1.
● The internal load balancer stops routing traffic to the failing node
and instead routes traffic to the newly active node.

● Running Windows Server Failover Clustering


Agenda
Plan and foundation

IAM and networking

Instrumentation and cost control

Data considerations

Windows

Building the migration factory


The migration machine

● How do you eat an elephant?


○ One bite at a time
● Migration can be a huge undertaking
● Need a system that’s flexible, reusable, and iterable
● Migration won’t be a purely waterfall/linear process
● Divide and conquer!

We need a way of breaking up our migration task (which could be huge) into a series
of manageable tasks.
The Agile inspired VM migration process
Migrate
Prepare Execute move of
Build target
apps and services to
environments and select
GCP
migration candidates
from backlog

Migration
Improve Sprint
Learn lessons,
improve migration Test / Verify
process Conduct UAT and
regression testing

Optimize
Decouple state and
stateless, scale
horizontally, rightsize &
PVM

We can use an Agile approach by using Migration sprints. Typically a sprint


lasts for two weeks (though the time box is up to your team), and the goal is to
move a certain number of VMs/apps within that 2 week period. After each
sprint, especially the early ones, you learn a lot and refine your approach for
the next sprint. The iterative process will continue to give the team an
opportunity to learn and improve in efficiency throughout the migration.
Involve application owners

● This is the their stuff, that they built and are maintaining
○ Migration shouldn’t be a surprise
● Get their buy in
○ The data center isn’t gone, it’s just moving
● Work with them to update SOPs
● Training! Training! Training!
○ Changes to DR
○ Changes to logging, monitoring, debugging, etc.
● What’s happening to the developer pipeline?

As with any project, getting the right people involved, with the right skills and
motivation, is the challenge.
Sprint planning meeting
● Discuss an initial sprint length and high-level plan
from discovery
○ Tweak the plan each iteration to incorporate
lessons learned
○ Two weeks works well
● Finalize migration team(s)
● Run through the rough wave plan and time
estimations
○ This is your product backlog
● Confirm the workload for this sprint
○ Starting with first mover

How are you going to allocate migrations to particular sprints? Who needs to be
involved? Your first mover app is going to be part of the first sprint.
Week 1: discovery and foundation scripting
● Most of week 1 is getting to know the application(s)
being migrated
○ Dig detail from automatic discovery and
questionnaire
● Reach out to application owners for clarifications
● What foundational changes are needed for this
workload?
● Foundation in place, script (Terraform) the deltas
○ Excellent support in Terraform

We often need to change our foundation e.g. refine network, firewalls, IAM etc, to
support the requirements of a particular migration.
Week 2: complete foundation, migrate, test
● Beginning week 2, work to finalize foundational
scripts
● Early-mid week 2 should be the migration
○ Tomorrow’s class!
● Smoke test it to make sure it’s working
● End the week by analyzing:
○ What worked?
○ How could the process be improved?
○ Any sorting of backlog needed?

Every sprint provides lessons that can be applied to subsequent sprints. You may
also rethink your migration priorities based on the outcomes of the early sprint.
Lab 8
Plan for data and other concerns, then generate a
backlog for our migration factory
Assess/Discover Plan/Foundation Migrate! Optimize your
your application Create a landing Pick a path, operations and
landscape zone to the cloud and get started save on costs

Cloud migration is the journey, the end-to-end lifecycle whereby things


move from other locations (on-prem, other clouds) and into the GCP.
GCP is the destination where these things migrate to,
and which are often modernized/optimized in-cloud
afterwards.

Time to move on
Deliverables
● GCP landing zone
○ Built with Terraform IaC
● Our Org structure
● A migration factory
○ Though will need tweaking every sprint
Training
● Throughout everything you do, help advise on
training
● How will the client support the org structure you’ve
built?
● Do they understand the new architecture?
● Have the security been trained up to speed on the
new structure?
● You’re going to leave
○ And the client will have to support all of this

You might also like