VMMIG Module03 Plan Foundation
VMMIG Module03 Plan Foundation
■
Assess/Discover Plan/Foundation Migrate! Optimize your
your application Create a landing Pick a path, operations and
landscape zone to the cloud and get started save on costs
After learning about the client’s environment in the Assess/Discover phase, we are
now going to build the foundation, the landing zone we’ll migrate our VMs into.
Learn how to...
Plan a GCP landingzone for your migrated VMs
Data considerations
Windows
Important - Plan isn’t a monolithic, linear plan. Don’t get stuck in analysis-paralysis.
We’ll help group workloads into migration waves and build a detailed plan for those
first workloads.
The plan is necessarily less detailed for later waves. This plan will be iterative,
building and maintaining a pipeline that is ready for migration.
Assumptions
● You are a Google cloud architect
● You can build projects, VMs, VPCs, subnets, and
firewalls
● You understand how GCP IAM and org structure
works
● You know the GCP building blocks
● This chapter is about applying that knowledge to VM
Migration
You will need a good knowledge of GCP to perform the tasks associated with this
phase
Configuration management
● You should have a configuration management system in place
○ Ansible, Puppet, Chef, etc.
○ Improve and move opportunity?
● Lift and shift typically involves configuration changes
● Automated configuration management can help
○ On-prem pushing changes to cloud (networking, security...)
○ Install in cloud as part of foundation
Lift and shift inevitably involves some configuration changes at scale. You will need a
way of automating this.
You may have Chef (or whatever) on prem, and another in the cloud (part of
foundation)
Or
You have it on-prem and pushing changes to the cloud (make sure network, firewall
rules, connectivity, etc. there). Make sure the configuration management system can
resolve and talk to servers they need to be able talk to.
Infrastructure as Code (IaC)
● IaC helps on three fronts:
○ Cost reduction (people and effort)
○ Speed building and tweaking foundation for
each wave
○ Lowers risks
● Will be similarities between migration waves
○ DRY principal—Don’t Repeat Yourself
○ Find what works and stick to it
○ IaC makes sure you don’t forget steps
● Repeatable and recreatable (waves and DR)
This class will advocate IaC. We will be building our foundation using Terraform,
though there are other options available. Terraform is one of the most popular IaC
options
Infrastructure as Code (IaC) tools
Deployment Puppet Chef Terraform Cloud
Manager Formation
Imperative vs. Declarative Declarative Imperative Declarative Declarative
Declarative
Hosted Yes No No No Yes
Driven by Yes No No No No
Discovery/
Swagger
Multi-Platform No Yes Yes Yes No
Deployment Manager is the native GCP tool, but is relatively verbose (Terraform
scripts tend to be shorter, for example). Cloud Formation is the AWS tool.
Terraform
● Cloud agnostic IaC tool from HashiCorp
○ Freeware CLI, but also a team-oriented enterprise version
● Easy process
○ Define infrastructure with configuration files
○ Generate a plan for verification
○ Execute
● Handles creation and change automation
● Efficient and codeafiable
● Terraform Intro
● Build GCP with Terraform example
During this phase, get your foundations right - make sure you’re ready to migrate at
scale. Google has a Cloud Foundations Toolkit. Identity and Access, Networking,
Instrumentation and Cost Control are the four key aspects of building a foundation for
the GCP deployment.
Agenda
Plan and foundation
Data considerations
Windows
IAM determines how individuals are authenticated to GCP, and what resources they
are authorized to access. IAM also addresses how code and VMs access GCP
resources.
GCP authentication
● Use G Suite directly
○ May need to import/create users
○ Migrating to Cloud Identity or G Suite
● Link GCP to a SAML2 compliant SSO service
● Use one of the two main AD options
○ Federate using Google Cloud Directory Sync (GCDS)
○ Federate with Azure AD
○ Federating Google Cloud Platform with Active Directory
If only need core admin type access, manually creating credentials might be fine. For
larger user usage however, will need something more sophisticated.
Federate with AD using GCDS
● Syncs users and groups
● One way, scheduled push from
AD to G Suite
● Auth through ADFS
● Fine-grained control using rules
● Details
Active Directory remains the “source of truth” for authentication. In this scenario, I
would need to go back to the on-prem AD system to actually authenticate a particular
logon.
https://cloud.google.com/solutions/federating-gcp-with-active-directory-introduction
Federate directly with Azure AD
● AD on prem and Azure AD once source of truth
● Updates immediate (no manual/timed sync)
● May be worth adding the Azure AD piece, even if client not
using already
● Details
For complex AD environments, the most seamless integration with on-prem AD may
be via Azure AD, rather than GCDS.
IAM tied to organizational structure
● Org/folder/project/resource organization
● IAM uses a union of permissions approach
○ Inherited permissions may not be removed, only
added to
● Decide where and how to use folders
○ Will require input from clients culture
○ Folders help with organization and base
permissions
● Resource Organization and IAM
You need to set up the org/folder structure before you start creating projects. Getting
this right at the beginning gives you the ability to take advantage of inheritance
permissions to simplify the granting of access e.g. You can assign permissions at the
folder level and have all the projects in the folder inherit those permissions.
Example: granular access-oriented hierarchy
Granular permissions
Easily extensible
Inheritance
Apps Sandbox Shared Core serv D&A Ctrl serv Complex network topology
The structure you come up with should reflect how the client thinks about their own
organisation, and GCP org structures best practices.
Be careful with naming folders around department names, as organisations will often
reorganize. Base your structure around reasonably stable reference points.
Create a solid set of org-level starter groups
Details
The linked document provides a deeper dive about how you typically set up your
groups when starting with GCP
Migrated VMs and credential management
● Migrated machines frequently need two forms of IAM
○ Machine to GCP resources (IAM Service Accounts)
○ User to machine, machine to machine, workload specific
● How would code on a VM access a GCS bucket?
● How does logging into a Windows machine in the typical
corporate environment work?
● How do most Windows Server applications authenticate into
SQL Server?
AD and VM workloads
● Most popular credential management system in the
world
● Options
○ 1: Migrate to GCP
○ 2: Leave where it is (on-prem or Azure)
○ 3: Set up a new AD server in GCP just for apps
○ 4: Set up a new AD server and federate
○ 5: Managed AD
Moving your AD to GCP might be overkill, if only a small subset of your AD users
require GCP access.
Option 2: Connect back to on-prem AD
● Pros
○ Requires little to no application/VM changes
○ Fast
○ Don’t have to move/recreate AD
● Cons:
○ Trust boundary for AD now extended to include the cloud
● DNS: Make sure it resolves correctly!
○ Only global domains allowed
● Also works for Azure AD
● AD in a hybrid environment
Option 3: New AD server just for apps
● Works best in environments where only a small admin/support
staff will need access
○ Or where it’s used for access between machines
● Pros:
○ Easy
○ Quickly done
● Cons:
○ Have a new AD server to manage
■ Can become management nightmare
● Deploying a fault-tolerant AD to GCP
Here you create an independent AD in the cloud, just to support cloud-based apps. If
you have a lot of users, it can become a pain to maintain two sets of users in two
different ADs. Often this works best as an interim option
Option 4: New AD server, cross-forest trust
● Essentially, two different AD servers that have a
special trust relationship
● Several different configuration options
● Pros:
○ Less management
● Cons:
○ Harder to setup
● AD-GCP integration patterns
Keep an eye on this development - it may provide the best solution in the longer term.
Network concepts
Project
Network (VPC)
Region Region
Subnet
192.168.0.0/16
Subnet
172.16.0.0/12
Subnet
10.0.0.0/8
Dedicated or Partner
VPN
Interconnect
and Cloud Router
Both of these options allow you to extend your on-prem network into GCP in a
relatively seamless manner.
VPN vs interconnect
● Up to 3Gbps ● 50Mbps-100Gbps
● $37 per tunnel (HA 2x) ● $39-$13,000
● SLA only with VPN HA (beta) ● SLA
● Encrypted ● Unencrypted
● Works anywhere ● 81 co-location facilities
● Low latency
○ <5ms round trip in
specific facilities
https://cloud.google.com/interconnect/docs/concepts/dedicated-overview
https://cloud.google.com/interconnect/docs/how-to/choose-type
10GBPS: $1,700 / mo, up to 8
100Gbps $13,000/mo, up to 2
That latency is from VM to co-location facility
Shared VPC
Service project - apps
VM
Compute Engine
Network
Service project - db
Subnet
10.0.0.0/8 VM
Compute Engine
Subnet
192.168.0.0/16
VM
Compute Engine
Shared VPC allows you to use the same VPC across multiple
projects within an organization. In a shared VPC implementation,
projects are designated as either host projects or service projects.
Shared VPC host project contains a network that the Shared VPC
service projects are granted access to.
https://docs.google.com/presentation/d/1JsMbSu1fjFSiTvU0DkxknZxqUVXBlTDBCW
8jZcI0qaU/edit#slide=id.g495dbabbf9_1_922
GCP load balancers may not be simple 1-to-1 replacements for on-prem load
balancers. Make sure you understand how the on-prem load balancer works when
planning a migration.
Floating (“shared” or “virtual”) IP Addresses
●
● A single IP can, under different conditions, refer to
different VMs
● Many on-prem solutions will not work out of the box in
GCP
○ E.g., NGINX, HA proxy
DNS
● VM DNS servers are set as part of DHCP
○ Point at metadata server: 169.254.169.254
● VMs may need external (on-prem) resolvers
● Updating VM DNS configurations
○ Scripting
○ Pre-migration manual configuration
○ Config management system
● Velostrata can help! (more later)
You may need to change the default DNS on your GCP VMs, if, for example, you
need to access an on-prem DNS to find an on-prem AD.
Firewall rules
Network
Target tag
Source Cloud Firewall Rule
All instances
Cloud Firewall Rule
allow or deny
Target service account
connections Cloud Firewall Rule
Service account Network tags
Compute Engine Compute Engine
Target service account
Destination Cloud Firewall Rule
All instances
Cloud Firewall Rule
can be on the
Target tag
same network Cloud Firewall Rule
Firewall rules are applied to your VPC network on which your GCE
instances reside. Firewall rules apply to both inbound (ingress) and
outbound (egress) traffic. They can also be applied between
instances in your network. Firewall rules can be set to allow or deny
traffic based on protocol, ports, and IP addresses. Firewall rules
have the following settings:
You should keep your firewall rules in line with the model of least
privilege. To allow traffic through, the user needs to create firewall
rules to explicitly allow traffic necessary for your applications to
communicate.
Mention that ingress rule is default deny and egress rule is default
allow.
Security and compliance
● Special security needs?
○ Encryption keys need managed particular way?
● Regional data requirements
● Existing security policies and procedures need cloud
updating
● Compliance?
○ HIPAA
○ PCI
○ ISO
● Standards, regulations & certifications
Data considerations
Windows
Site reliability
SREs:
Monitoring is at the base of site reliability.
Stackdriver helps make this easier.
Monitoring
Site reliability starts at the core of any infrastructure. SREs are site reliability
engineers: they are responsible for keeping the lights on at Google. SREs monitor,
but they don't necessarily stare at a screen. There are certain reactive options that
they take care of and they also try to automate responses. Their real core is to identify
root cause analysis and identify how to test and re-release any types of fixes.
● Benefits of Stackdriver:
○ Monitors multi-cloud
■ GCP and AWS
○ Identify trends and prevent issues
■ Charts
○ Reduce monitoring overhead
○ Improve signal-to-noise
■ Advanced alerting
○ Fix problems faster
■ Alerts -> Dashboards -> Logs
Migration monitoring and logging strategy
● VMs sends send metrics to StackDriver
○ Monitoring agent will need to be installed (coming!)
● It is possible to monitor from StackDriver
○ But what about servers still on-prem?
● Decide on path
○ Monitor all with StackDriver
○ Monitor with StackDriver and on-prem system
(Splunk?)
○ Export from StackDriver, back to on-prem system
It is usually not feasible to monitor everything (i.e. on-prem and cloud) with
Stackdriver. Having parallel monitoring systems (one for on-prem, one for GCP) may
be confusing. Exporting from Stackdriver back to on-prem systems is perhaps the
most popular choice for clients running a hybrid environment.
Exporting logs from Stackdriver
● Stackdriver can export logs three ways:
○ JSON in Cloud Storage
○ BigQuery
○ Cloud Pub/Sub
● Exported:
○ Cloud Audit Logs : admin activity and data access
○ Monitored Service Log: GCE, Cloud SQL, etc.
● When exporting to Splunk, need the Splunk Add-on for GCP
○ Subscribes Splunk to Pub/Sub
○ Make sure to check details on where to install add-on
Most of the popular on-prem systems have a way of integrating with Stackdriver
(usually by exporting log entries to GCS or a Pub/Sub queue).
Stackdriver groups
● Logical groups of resources
● Simplifies the monitoring of a set
of resources
● Groups can be nested to form hierarchies
● Logical unit being migrated may all belong to the
same Stackdriver group
● Terraforming Stackdriver Groups
Stackdriver groups allow you to group resources as a single logical unit for monitoring
purposes (e.g. a group of web servers used by a particular application).
Alerting policy foundation
● Alerts will be triggered when a set of conditions are met
○ Notify critical personnel when actions need to be taken
○ Intro to Alerting
● Baseline and document alerts pre-migration
○ Who receives which alerts and what do they contain?
○ Alert triggers?
○ Room for improvements? Now or later?
● Exporting or using Stackdriver, build your alerting foundation
● Alerting policies with Terraform
If you are going to migrate your alerting to Stackdriver, you need to develop a good
understanding of the current alerting strategy used by the client. Make sure you are
not reproducing limitations in the existing system; loook to improve the alerting
approach where required.
Uptime checks
For some customers, uptime checks may already in place for key web-facing
applications.
Dashboards
● Aid with quick glance BI over one to many monitored resources
● Who needs access, and what should each dashboard display?
Setting up billing
Billing may happen at the project level, but organization, folder structure, project, and
resource details are all part of the billing reports (and BigQuery export).
https://cloud.google.com/billing/docs/onboarding-checklist
Your default choice for billing should be to set up a single billing account, only to be
modified where there is a clear client need to have multiple billing accounts.
Billing tips
● Make sure the org understands how billing works
○ Who has access? Who gets alerts? Who’s paying the bill?
● Setup spend and trend alerts
○ BigQuery quotas
● Billing reports and cost trends viewable in Cloud Billing Reports
● Export to BigQuery for better detailed analysis
○ Terraform can help
○ Exporting billing data to BigQuery
○ Visualize with DataStudio
● Cost management
Give your finance people the tools they need to understand cloud costs.
Labeling for resource identification
● Key-value resource identifier
○ Bulk operation runs
○ Added identity for better cost analysis
● Examples
○ Team, cost center, owner, person
○ Application layer (web, app-server)
○ State, stage, or environment
● BigQuery over billing data examples
Labelling resources is a big part of cloud cost management, as the labels are visible
when querying cost data (e.g. in BigQuery). There are examples of this in the linked
article in the slide.
Label usage
Alpha features do not have the ability to apply labels in the console or gcloud.
Use the Alpha API
The slide above shows the message I would get from the recommendation engine in
Compute Engine if I had over-provisioned a VM and the VM has been running for at
least 24 hours. Up until recently, you could only see the recommendations in the
console, but a new Recommendations API is, at the time of writing, in alpha.
There are third party tools that can also make recommendations, but Google
recommends only following the recommendations within GCP.
Lab 7
Plan monitoring and billing
Agenda
Plan and foundation
Data considerations
Windows
The sheer amount of data that you have to move can be a challenge, both in time to
perform the migration, and cost (e.g. egress charges from another cloud
environment).
Databases and discovery
● Get details through questionnaires and interviews
● Common source of VM Migration pain
○ Static IPs, blocked ports, missing backup
locations, etc
● Learn what you can in discovery and foundation
○ Fill gaps in the migration factory
Often you will only have gathered the high-level detail about data migration
requirements during the Discovery/Assess phase, and you need to get more detail
before the specific migration.
Shared database servers
● Several apps might share a common database
● Common solutions:
○ Move the whole group
○ Use backup/restore to extract the single
required database
■ Rebuild in cloud
○ Consider improve and move strategy, swapping
the database to Cloud SQL, Spanner, BigTable,
etc.
Moving all the applications accessing a shared database at one time might be an
unacceptable risk for some migrations.
NFS/SMB
Leaving the database on-prem might introduce unacceptable latency into the app.
Splitting the application between GCP and on-prem might increase your network
egress charges. Improve and move might require code changes, e.g. if you are using
Oracle stored procedures.
SQL Server
SQL Server will soon be available as a managed service within Cloud SQL.
Agenda
Plan and foundation
Data considerations
Windows
https://www.microsoft.com/en-us/licensing/licensing-programs/spla-program?activetab
=spla-program%3aprimaryr2
https://www.microsoft.com/en-in/licensing/licensing-programs/FAQ-Software-Assuran
ce
https://cloud.google.com/compute/docs/instances/windows/bring-your-own-license/
https://cloud.google.com/compute/docs/instances/windows/ms-licensing
https://cloud.google.com/compute/docs/instances/windows/bring-your-own-license/fre
quently-asked-questions
https://cloud.google.com/compute/docs/instances/windows/ms-licensing
GCP Premium Images use SPLA Licenses, which are charged by the vendor
(Microsoft) and billed by Google. These are not necessarily connected in any way to
existing license your customer may have purchased.
In GCP, Sole Tenancy means that an entire physical server is reserved for
consumption by a single customer (no sharing).
Microsoft Licensing typically requires that a system running Microsoft products with
non-SPLA licenses must serve a single customer. Sole tenant nodes will allow
customers to bring existing licenses to the cloud. For workloads that are not
concerned with physical core or socket usage based on the nature of the licensing
and product terms, you can use sole-tenant nodes without the in-place restart feature.
In GCP, Live Migration is the default configuration for new VMs, and will move a VM
from one physical host to another when a migration event is scheduled. For licenses
that limit physical core or socket use, you will need to use in-place restart. When you
enable in-place restart on sole-tenant nodes, Compute Engine minimizes the number
of physical servers your VM runs on by restarting the VM on the same server
whenever possible. If restarting VMs on the same physical server is not possible (for
example, if the physical server experiences a critical hardware failure), your VMs will
be migrated to another server. Compute Engine will assign and report a new physical
server ID and the old server ID will be permanently retired.
This is especially useful during host maintenance events; instead of live migrating to a
new physical server, Compute Engine terminates and restarts the VM on the same
server. Please note that VMs will be taken offline and be unavailable while
maintenance is applied.
Microsoft product classes (Server, Client, Server Applications) have different licensing
schemes, and these will affect the licensing decisions you make.
Everything above is simplified and drawn from publicly available EULAs, SPLAs, and
T&Cs. Customers may be custom T&Cs which require review and differing strategies.
Microsoft licensing details
GCP BYOL Sole Tenant Offering
GCP premium images Software Assurance/
(core/minute charge) License Mobility
No Live Migration Live Migration
https://www.microsoft.com/en-us/licensing/licensing-programs/spla-program?activetab
=spla-program%3aprimaryr2
https://www.microsoft.com/en-in/licensing/licensing-programs/FAQ-Software-Assuran
ce
https://cloud.google.com/compute/docs/instances/windows/bring-your-own-license/
https://cloud.google.com/compute/docs/instances/windows/ms-licensing
https://cloud.google.com/compute/docs/instances/windows/bring-your-own-license/fre
quently-asked-questions
https://cloud.google.com/compute/docs/instances/windows/ms-licensing
Extra cost for Live Migration is associated with maintaining a dedicated pool of VMs
within which LV can happen.
Windows OSes supported by BYOL with Sole Tenant node
- Windows Server 2008 R2 SP1
- Windows Server 2012
- Windows Server 2012 R2
- Windows Server 2016
- Windows 7 SP1 Enterprise x64
- Windows 10 Enterprise x64
At time of writing, BYOL with Sole Tenant nodes is limited to:
- us-central
- us-west1
- us-east1
- europe-west1
GCP sole-tenant nodes
● A sole-tenant node is a physical server Node Template (region, server
that hosts VMs only for your project config, restart config, affinity tags)
https://cloud.google.com/compute/docs/nodes/
https://cloud.google.com/compute/docs/nodes/create-nodes
https://cloud.google.com/compute/pricing#nodes
To see if a given region/zone support sole tenant nodes, visit the GCE Regions and
Zones page.
At time of writing, there is only one available node type: n1-node-96-624. A project
must have sufficient CPU quota in a given region to allow node creation.
Affinity is controlled by assigning affinity labels to node groups and nodes. Some
labels are assigned automatically:
- compute.googleapis.com/node-group-name = [node group name]
- compute.googleapis.com/node-name = [node name]
Additional labels can be applied when defining a template; e.g. env = production; app
= frontend; license = byol
When creating a new VM, you can specify node affinity settings; e.g.:
- license:IN:byol
- env:NOT:production
When creating a VM in a node group with in-place restart configured, you must set
the the "On host maintenance" policy to Terminate.
Windows Server Failover Clustering
On failure:
Data considerations
Windows
We need a way of breaking up our migration task (which could be huge) into a series
of manageable tasks.
The Agile inspired VM migration process
Migrate
Prepare Execute move of
Build target
apps and services to
environments and select
GCP
migration candidates
from backlog
Migration
Improve Sprint
Learn lessons,
improve migration Test / Verify
process Conduct UAT and
regression testing
Optimize
Decouple state and
stateless, scale
horizontally, rightsize &
PVM
● This is the their stuff, that they built and are maintaining
○ Migration shouldn’t be a surprise
● Get their buy in
○ The data center isn’t gone, it’s just moving
● Work with them to update SOPs
● Training! Training! Training!
○ Changes to DR
○ Changes to logging, monitoring, debugging, etc.
● What’s happening to the developer pipeline?
As with any project, getting the right people involved, with the right skills and
motivation, is the challenge.
Sprint planning meeting
● Discuss an initial sprint length and high-level plan
from discovery
○ Tweak the plan each iteration to incorporate
lessons learned
○ Two weeks works well
● Finalize migration team(s)
● Run through the rough wave plan and time
estimations
○ This is your product backlog
● Confirm the workload for this sprint
○ Starting with first mover
How are you going to allocate migrations to particular sprints? Who needs to be
involved? Your first mover app is going to be part of the first sprint.
Week 1: discovery and foundation scripting
● Most of week 1 is getting to know the application(s)
being migrated
○ Dig detail from automatic discovery and
questionnaire
● Reach out to application owners for clarifications
● What foundational changes are needed for this
workload?
● Foundation in place, script (Terraform) the deltas
○ Excellent support in Terraform
We often need to change our foundation e.g. refine network, firewalls, IAM etc, to
support the requirements of a particular migration.
Week 2: complete foundation, migrate, test
● Beginning week 2, work to finalize foundational
scripts
● Early-mid week 2 should be the migration
○ Tomorrow’s class!
● Smoke test it to make sure it’s working
● End the week by analyzing:
○ What worked?
○ How could the process be improved?
○ Any sorting of backlog needed?
Every sprint provides lessons that can be applied to subsequent sprints. You may
also rethink your migration priorities based on the outcomes of the early sprint.
Lab 8
Plan for data and other concerns, then generate a
backlog for our migration factory
Assess/Discover Plan/Foundation Migrate! Optimize your
your application Create a landing Pick a path, operations and
landscape zone to the cloud and get started save on costs
Time to move on
Deliverables
● GCP landing zone
○ Built with Terraform IaC
● Our Org structure
● A migration factory
○ Though will need tweaking every sprint
Training
● Throughout everything you do, help advise on
training
● How will the client support the org structure you’ve
built?
● Do they understand the new architecture?
● Have the security been trained up to speed on the
new structure?
● You’re going to leave
○ And the client will have to support all of this