VMMIG Module05 Optimize Phase
VMMIG Module05 Optimize Phase
Optimize costs
1.
Agenda
Introduction
Image strategies and configuration management
Managed Instance Groups
Availability and disaster recovery
Networking and security consolidation
Managed services
Cost optimization
Migrate for Anthos (VMs to containers)
The Optimize phase is where it gets cloudy
Having moved your workloads, you can now update to fully exploit the cloud.
https://cloud.google.com/solutions/image-management-best-practices
Start with a base OS installation, or if building images for GCP, start with a public boot
image
Periodically, take the base image and harden it by removing services, change
settings, installing security components, etc. Build the hardened image every 90 days,
or whatever frequency makes sense in the organization. This becomes that basis of
subsequent builds.
More frequently, build platform-specific images. One image for web servers, one for
application servers, one for databases, etc. Build this image maybe every 30 days.
As frequently as you build an app, create new VM images for the new versions of the
application. You might create new application images on a daily basis.
Goals for image management
Servers
Compute Engine
Bespoke 2 Server 2
Compute Engine Compute Engine
v1.0
Bespoke 3 Server 3
Boot image
Compute Engine Compute Engine
v1.1
Bespoke 4 Server 4 Boot image
Compute Engine Compute Engine
After migration, you have servers with independent configurations. They may, or may
not, be managed with a configuration management solution. However, each is
managed as a unique asset.
By updating the servers to all use a consistent base image, you ensure uniform
configuration across multiple instances. You also make it possible to combine like
servers into managed instance groups. This provides benefits such as:
- Health checks
- Ability to resize the cluster easily
- Autoscaling (for workloads that will scale horizontally)
- A cloud-native approach to VM updates - that is, use of immutable images.
This, combined with the rolling update feature of MIGs, makes rolling out new
versions easy.
Organizational maturity
?
Web image
DB image
Robust systems will be run much like a standard DevOps pipeline. Commits to a code
base will trigger build jobs, which will create/test/deploy images. The image building
tool can leverage configuration management systems to automate the configuration of
the image.
Many customers will have some version of the second option, with a set of images
that may be built manually or with partial automation. They don't get built as often,
and certainly not daily.
Some customers will have hand-crafted servers, and have no existing process in
place for creating/baking images.
GCP Images
https://cloud.google.com/solutions/image-management-best-practices
There are three main approaches to creating GCP boot images that can be used for
managed instance groups.
Best: Build an image from the ground up, starting with a public image. Develop a
clean CI/CD pipeline for generating these images, using tools like Packer and
Chef/Puppet/Ansible.
Good: Use existing image-generation pipelines and produce output images for GCP.
Tools like Packer and Vagrant that are being used to produce VMware images can
also output these images for use with GCP.
Not-so-good (some would say bad): Take the migrated VMs disk and create an image.
Then manually prune and tailor the image.
How much configuration is baked in?
https://cloud.google.com/solutions/image-management-best-practices
There are many variables that go into deciding how much you bake into an image:
https://cloud.google.com/community/tutorials/create-cloud-build-image-factory-using-p
acker
Packer can't SSH in successfully on instances where OS-Login is enabled. Make sure
the metadata to enable this feature is not set on the project where you are demoing.
Optimizing for configuration management
Disable network, authorization, firewall Disable network, authorization, firewall
management in playbooks management in playbooks
On premises On premises
CM Server Routes, firewall rules,
bandwidth
Compute Engine
CM Server CM Server
Migrated VM Migrated VM
Compute Engine Compute Engine
VM VM
Migrated VM Migrated VM
Compute Engine Compute Engine
VM VM
Migrated VM Migrated VM
Compute Engine VM Compute Engine VM
https://cloud.google.com/solutions/configuration-management/
As noted in the module on the Plan phase, companies should really have
configuration management for their on-prem assets in place prior to migrating VMs
into the cloud.
When extending infrastructure into the cloud, one common approach is to place CM
servers in the cloud as well. You configure the on-prem servers to manage the
on-prem inventory, and the cloud servers to manage the cloud inventory. You then
have either separate playbooks for the different environments, or adaptable playbooks
that use environment-specific variables or context to perform slightly different
configuration depending on whether the VM is in the cloud or on-prem.
For VMs migrated into GCP, you'll want to remove the normal CM commands that
configure network, firewall, and authorization settings as they will be managed
differently in GCP.
Agenda
Introduction
Image strategies and configuration management
Managed Instance Groups
Availability and disaster recovery
Networking and security consolidation
Managed services
Cost optimization
Migrate for Anthos (VMs to containers)
Optimizing for scaling and release management
https://cloud.google.com/compute/docs/instance-groups/rolling-out-updates-to-manag
ed-instance-groups
https://cloud.google.com/docs/enterprise/best-practices-for-enterprise-organizations#
high-availability
https://cloud.google.com/docs/geography-and-regions
Regional instances groups will distribute instances created from your template across
zones. If a zone goes down, the instances in other zones remain available. If a zone
goes down, the regional MIG will not automatically create additional instances in the
remaining zones (unless autoscaling is enabled). An alternative is to use multiple
zonal managed instance groups.
Not mentioned on slide, but also important, is ensuring you have a high-availability
interconnect between GCP and your on-premises networks. This should have been
handled during the Plan phase.
Optimizing for disaster recovery
● The lower the tolerance for loss, the higher the cost and complexity
● Options include…
○ Cold: rebuild app in another region
○ Warm: unused app in another region
○ Hot: app runs across regions
https://cloud.google.com/solutions/dr-scenarios-planning-guide
DR: Cold pattern
Region 1 Region 2
App App
Compute Engine Compute Engine
App DB App DB
Cloud SQL Cloud SQL
Deployment
Manager
MR Bucket
Cloud Storage
https://cloud.google.com/solutions/dr-scenarios-planning-guide
The original environment is deployed using Infrastructure as Code (IaC). The app is
implemented using a managed instance group and instance templates. The database
is backed up periodically to a multiregional bucket.
If a regional fails, the application can be redeployed fairly quickly into a new region,
the database can be restored from the latest backup, and the load balancer can be
reconfigured with a new backend service.
RTO is bounded typically by the time required to restore the DB. RPO is bounded by
how frequently you perform database backups.
DR: Warm pattern
Region 1 Region 2
App App
Compute Engine Compute Engine
App DB Replica
Cloud SQL Cloud SQL
https://cloud.google.com/solutions/dr-scenarios-planning-guide
App deployments are made into multiple regions, but failover regions have smaller
application MIGs which don't serve traffic. A DB replica is created in the failover
region, and this receives replication traffic from the DB master, keeping it nearly
entirely up-to-date.
In the case of failure, update the load balancer to include the region 2 instance group
as a backend, increase the size of the instance group, and point the app to the replica
(this could be done via dns changes, or by placing a load balancer in front of the DB
and changing the load balancing configuration).
This design reduces the RTO and RPO significantly. However, it does introduce
cross-regional replication costs.
DR: Hot pattern
Region 1 Region 2
App App
Compute Engine Compute Engine
App DB
Cloud Spanner
https://cloud.google.com/solutions/dr-scenarios-planning-guide
App deployment occurs in multiple regions. The load balancer does geo-aware
routing of requests to the nearest region. The backing database service, Spanner,
handles replication across regions.
https://www.youtube.com/watch?v=1ibeCQjjpBw&autoplay=1
https://forsetisecurity.org/about/
https://cloud.google.com/vpc/docs/firewalls#service-accounts-vs-tags
https://cloud.google.com/blog/products/gcp/simplify-cloud-vpc-firewall-management-w
ith-service-accounts
Optimizing load balancing
high-availability
App
Compute Engine
● Proxy-based load balancers offer
cross-regional routing
On premises
● Hybrid load balancing can be DNS
https://cloud.google.com/load-balancing/
Optimizing security at the edge
https://cloud.google.com/files/GCPDDoSprotection-04122016.pdf
https://cloud.google.com/armor/
Optimizing secret management
Google APIs
https://cloud.google.com/compute/docs/access/create-enable-service-accounts-for-in
stances
https://cloud.google.com/kms/
https://cloud.google.com/hsm/
https://cloud.google.com/kms/docs/encrypt-decrypt
Apps running on instance can use the VM-assigned service account by using Cloud
Libraries; credentials are passed to the application via metadata.
Apps can leverage Cloud KMS to decrypt secrets that are either stored in app
configuration or in GCS. The application operates within the context of a service
account. That account has permissions to use a given key. That key is used by Cloud
KMS to encrypt/decrypt secrets. The diagram shows a secret being stored in GCS,
and the app asks KMS to read and decrypt the file.
Optimizing IAM configurations
https://cloud.google.com/policy-intelligence/
Logging and monitoring for security
https://cloud.google.com/logging/docs/export/
https://cloud.google.com/solutions/exporting-stackdriver-logging-for-splunk
https://www.splunk.com/blog/2016/03/23/announcing-splunk-add-on-for-google-cloud-
platform-gcp-at-gcpnext16.html
https://resources.netskope.com/cloud-security-collateral-2/netskope-for-google-cloud-
platform
https://help.sumologic.com/03Send-Data/Sources/02Sources-for-Hosted-Collectors/G
oogle-Cloud-Platform-Source
https://cloud.google.com/logging/docs/audit/
https://cloud.google.com/vpc/docs/using-flow-logs
https://cloud.google.com/vpc/docs/firewall-rules-logging
Agenda
Introduction
Image strategies and configuration management
Managed Instance Groups
Availability and disaster recovery
Networking and security consolidation
Managed services
Cost optimization
Migrate for Anthos (VMs to containers)
Optimizing with managed services
CloudSQL MySQL Cost, not on VPC, see differences and issues docs
https://cloud.google.com/sql/docs/mysql/features#differences
https://cloud.google.com/sql/faq
Cloud SQL costs roughly 2x the cost of un-managed MySQL running on a VM. Cloud
SQL VMs are not on a VPC in the project; they are accessed via peering or public IP.
https://cloud.google.com/pubsub/architecture
https://cloud.google.com/pubsub/docs/faq
https://cloud.google.com/pubsub/docs/ordering
https://cloud.google.com/pubsub/pricing
https://cloud.google.com/dataproc/pricing
https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage
https://cloud.google.com/dataproc/docs/resources/faq
Dataproc's $0.01/vcpu/hr. charge adds up on very large clusters that are long lived.
In general, the online documentation does a good job of detailing key issues. Review
the concepts section, the known issues section, and the pricing. Also, Googling <gcp
product> vs. <other product) often yields good initial results.
Agenda
Introduction
Image strategies and configuration management
Managed Instance Groups
Availability and disaster recovery
Networking and security consolidation
Managed services
Cost optimization
Migrate for Anthos (VMs to containers)
Make sure you tailor instance sizes to real needs
https://cloud.google.com/compute/docs/instances/apply-sizing-recommendations-for-i
nstances
There's more to TCO than VM costs
Persistent Disk
Network egress
Intra-VPC traffic
costs
Load balancer
costs
https://cloud.google.com/billing/docs/how-to/export-data-bigquery
BigQuery is hugely useful for analyzing billing data. It can be used to find large,
and potentially unexpected, sources of cost - which can then be optimized.
Watch network costs
Remember that traffic transferred within a VPC, but across zones or regions incurs
costs (in addition to the more obvious VPC egress).
https://cloud.google.com/vpc/docs/vpc-peering
https://cloud.google.com/vpc/docs/shared-vpc
https://cloud.google.com/network-tiers/
Working with budgets
https://cloud.google.com/billing/docs/how-to/budgets
https://cloud.google.com/billing/docs/how-to/notify
https://cloud.google.com/bigquery/docs/custom-quotas
https://cloud.google.com/appengine/pricing#spending_limit
Lab 13
Defining an optimization strategy
Agenda
Introduction
Image strategies and configuration management
Managed Instance Groups
Availability and disaster recovery
Networking and security consolidation
Managed services
Cost optimization
Migrate for Anthos (VMs to containers)
Moving VMs into containers
Why Kubernetes/GKE?
Secure kernel
Density
Resiliency
Modernization
Google offers automatic updates, which keeps the kernel on the machines running
you apps secure.
You can run more apps on a given host for better resource utilization.
Istio, for example, allows you to make service discovery, traffic splitting, authorization,
circuit break patterns, and other features easy to implement without having to rewrite
apps.
Teams can get experience using GKE and K8s without having to totally re-engineer
their apps.
Migrate for Anthos
https://cloud.google.com/migrate/anthos/
Migrate for Anthos components
● Firewall rules
○ Two additional firewall rules are required to allow migrated workloads to
speak to the Migrate Manager and Cloud Extension
Creating the GKE Cluster
https://cloud.google.com/migrate/anthos/docs/configuring-a-cluster
Deploying Migrate for Anthos
https://cloud.google.com/migrate/anthos/docs/creating-migrate-anthos-configuration
Migrate for Anthos architecture
GKE Cluster
controller
component
per-node
component Edge A Edge B Backend
storageclass
To migrate an app
1. Deploy the PVC. This is the storage resource that will be used by the pod. The
blocks it accesses are provided via the driver, which is talking to the Edge
node, which is talking to the backend. The streaming mechanism from the
backend to the Cloud Extension is what you've seen before. The new part is
the conduit from the edge node, through the CSI driver, to the PVC, to the
pod.
1. Deploy the app, with the pod definition including a volume that uses the PVC.
This is an example PVC configuration. The administrator will need to populate the
values in the square brackets.
- The PVC name can be anything
- The VM_ID is the VMware-specific ID for a given VM. For details on how to
get the id, see
https://cloud.google.com/migrate/anthos/docs/migrate-vmware-to-gke
- The storage class name is typically specified during Migrate for Anthos
installation
There are also some options that can be set differently than in the example.
- vm-data-access-mode can be streaming or fully cached
- run-mode can be normal or testclone
Application Configuration
kind: StatefulSet # source-pvc needs to match the name of the PVC
apiVersion: apps/v1 declared above.
metadata: anthos-migrate.gcr.io/source-pvc: [PVC_NAME]
name: [APPLICATION_NAME] spec:
namespace: default containers:
spec: - name: [APPLICATION_NAME]
serviceName: [SERVICE_NAME] # The image for the Migrate for Anthos system
replicas: 1 container.
selector: image: anthos-migrate.gcr.io/v2k-run:v1.0.1
matchLabels:
app: [APPLICATION_NAME]
template:
metadata:
labels:
app: [APPLICATION_NAME]
annotations:
anthos-migrate.gcr.io/action: run
anthos-migrate.gcr.io/source-type: streaming-disk
(continued)...
The source-type annotation dictates whether the CSI driver and streaming is used, or
whether it uses an exported PVC (more on this to come).
How does it work?
https://cloud.google.com/migrate/anthos/docs/architecture
https://cloud.google.com/migrate/anthos/docs/export-storage
Export configuration (part 1)
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: [STORAGE_CLASS_NAME]
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-ssd
replication-type: none
-
Export configuration (part 2)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
# Replace this with the name of your application
name: [TARGET_PVC_NAME]
spec:
storageClassName: [STORAGE_CLASS_NAME]
accessModes:
- ReadWriteOnce
resources:
requests:
# Replace this with the quantity you'll need in the target volume, such as
# 20G. You can use the included script to make this calculation (see the
# section earlier in this topic).
storage: [TARGET_STORAGE]
-
Export configuration (part 3)
apiVersion: v1
kind: ConfigMap
metadata:
name: [CONFIGMAP_NAME]
data:
config: |-
appSpec:
dataFilter:
- "- *.swp"
- "- /etc/fstab"
- "- /boot/"
- "- /tmp/*"
-
Export configuration (part 4)
apiVersion: batch/v1
kind: Job
metadata:
name: [JOB_NAME]
spec:
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
anthos-migrate.gcr.io/action: export
anthos-migrate.gcr.io/source-type: streaming-disk
anthos-migrate.gcr.io/source-pvc: [SOURCE_PVC_NAME]
anthos-migrate.gcr.io/target-pvc: [TARGET_PVC_NAME]
anthos-migrate.gcr.io/config: [CONFIGMAP_NAME]
spec:
restartPolicy: OnFailure
containers:
- name: exporter-sample
image: anthos-migrate.gcr.io/v2k-export:v1.0.1
-
Modified app config using exported storage
annotations:
kind: StatefulSet anthos-migrate.gcr.io/action: run
apiVersion: apps/v1 anthos-migrate.gcr.io/source-type: exported
metadata: anthos-migrate.gcr.io/source-pvc: [TARGET_PVC_NAME]
name: [STATEFULSET_NAME] spec:
spec: containers:
serviceName: "[SERVICE_NAME]" - name: [STATEFULSET_NAME]
replicas: 1 image: anthos-migrate.gcr.io/v2k-run:v1.0.1
selector:
matchLabels:
app: [STATEFULSET_NAME]
template:
metadata:
labels:
app: [STATEFULSET_NAME]
But wait, there's more (coming soon)
Virtual Machine
Pod sandbox
initContainer container
Services
sshd / apache / crond
Storage Services
init / systemd Aggregator sshd / apache / crond
+
OS kernel Image
adaptation init / systemd
BIOS
Basic concepts
● User-space part of VM (services *) runs within container
● Networking (NIC and DNS) provided by GKE
● Storage streaming happens in the background, abstracted by k8s PV
init/systemd vs. typical app
● Assumes to run as PID 1
○ Also deals with process reaping (app containers typically do not do
that)
● Does not react the same on SIGTERM or SIGBREAK
○ sysv expects SIGPWR
○ systemd typically expects SIGRTMIN+3
○ SIGBREAK will cause a reboot (privileged)
○ SIGKILL is usually blocked
● Typically runs multiple sub-processes under different user contexts
● Does not produce the console output in the same way
○ Typically works with terminal devices
● Most importantly, does a lot of setup (devices, cgroups, mounts, networking,
…)
○ Disable some network confiugration
○ Fix signals
○ Tacke care of devices
○ Disable some services (iptables, firewall, etc.)