0% found this document useful (0 votes)

26 views111 pages

Week 5 Preparing For PCA Module 4

The document outlines the preparation for a Professional Cloud Architect journey, focusing on optimizing technical and business processes for Cymbal Direct. It details business and technical requirements, current and new processes, and various Google Cloud services such as Filestore, Firestore, and BigQuery. Additionally, it emphasizes the importance of CI/CD pipelines, managed services, and security measures like penetration testing and chaos engineering.

Uploaded by

Serguei

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views111 pages

Week 5 Preparing For PCA Module 4

Uploaded by

Serguei

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 111

Preparing for Your

Professional
Cloud Architect
Journey
Module 4: Analyzing and Optimizing Technical
and Business Processes
Week 5 agenda
Diagnostic Questions
Optimizing Cymbal Data services for exam guide Section
Direct’s technical and (Filestore, Firestore, 4: Analyzing and
business processes Memorystore, optimizing technical
and procedures Spanner, BigQuery, and business
Bigtable) processes

1 2 3 4 5 6

QUIZ Filestore, Firestore Mountkirk Games

& Firebase case study analysis
QUIZ time!
Optimizing Cymbal
Direct’s technical and
business processes and
procedures
Your role in
● Analyzing and defining technical processes
optimizing ● Analyzing and defining business processes

business and ● Developing procedures to ensure reliability

of solutions in production
technical
processes
Business Requirements

● Cymbal Direct’s management wants to make sure that they can easily scale to handle additional
demand when needed, so they can feel comfortable with expanding to more test markets.
● Streamline development for application modernization and new features/products.
● Ensure that developers spend as much time on core business functionality as possible, and not
have to worry about scalability wherever possible.
● Allow for partners to order directly via API
● Get a production version of the social media highlighting service up and running, and ensure no
inappropriate content
Technical Requirements

● Move to managed services wherever possible

● Ensure that developers can deploy container based workloads to testing and production
environments in a highly scalable environment.
● Standardize on containers where possible, but also allow for existing virtualization infrastructure
to run as-is without a re-write, so it can be slowly refactored over-time
● Securely allow partner integration
● Allow for streaming of IoT data from drones
Process optimization
The current build process at Cymbal Direct is:
● Package monolithic application with its dependencies
● Check it in and notify the QA team they need to test it
● Stress test the application to ensure it performs well
● Build a VM image for deployment

Plan Code Build Test Release Deploy Operate + Monitor

● Stakeholder ● Check in ● Docker Image ● Unit ● Tag ● VM ● Scale

● Integration ● Artifact available ● Kubernetes ● Ensure availability
● Cloud run
Fail ● red/black
● canary
Fail

Continuous Integration Continuous Deployment

Process optimization

Requirements not met:

● development is streamlined
● developers focus on core business functionality
● Move to managed services wherever possible
● deploy container based workloads
Example of end-to-end CI/CD pipeline

Code Artifact Registry Vulnerability Binary Trusted gke-demo

Scanning Authorization images

Vulnerability Found

Untrusted Audit Log

2 images

Key Management
Service

hello-world-not-signed hello-world-signed attestor-demo attest-key

Proprietary + Confidential

Container Registry vs Artifact Registry

● Container Registry is currently in maintenance mode

Although it is still available and supported as a Google
Enterprise API, it won’t see new any features
● Artifact Registry is the successor and the
Container Registry
recommended solution
vs
● Artifact Registry covers all use cases of container
registry and can be used for additional packages like prefer this for
maven, npm, python, etc. new projects

● See more info on how to migrate here

https://cloud.google.com/artifact-registry/docs/transiti
on/transition-from-gcr
Artifact Registry
Process optimization
New Process
● New features are implemented as
microservices in Docker containers
● Code check-in triggers CI/CD
pipeline w/ automatic test & release
● Code is deployed to Cloud Run
Chaos Engineering

Developing ● Creates a culture of reliability

● Crashes systems intentionally to

Cymbal Direct's build resiliency

● Service Mesh can help you here!
procedures to
ensure solution Penetration testing
reliability ● Mimics the behavior of hackers
to attack your own environment

“If you plan to evaluate the security of your Cloud Platform infrastructure
with penetration testing, you are not required to contact us.”
Filestore
Proprietary + Confidential

Filestore
Managed NFS, NOT a database

Filestore Basic Filestore High Scale Filestore Enterprise

(GA) (Public Preview) (GA)

File sharing, Software Dev, HPC, Financial Modeling, SAP, GKE, and
Workloads
and Web Hosting Pharma, and Analytics ‘Lift & Shift Apps”

Capacity 1 - 64 TiB 10 - 100 TiB 1 - 10 TiB

Scale Scale-up Scale-out Scale-out

Capacity Management Grow Grow & Shrink Grow & Shrink

Max Performance
1.2GiB/s | 60k 26GiB/s | 920k 1.2GiB/s | 120k
(Throughput | IOPS)

Data Protection Backups None Snapshots

Availability SLA 99.9% 99.9% 99.99%

Firestore
Firestore: When to use?

Firestore is ideal for applications that rely on highly available structured data at scale.

Ideal Use Cases:

● Product catalogs that provide real-time inventory and product details for a retailer.
● User profiles that deliver a customized experience based on the user’s past activities and preferences.
● Transactions based on ACID properties

Non-Ideal Use Cases:

● OLTP relational database with full SQL support. Consider: Cloud SQL
● Data isn’t highly structured or no need for ACID transactions. Consider: Cloud Bigtable
● Interactive querying in an online analytical processing (OLAP) system. Consider: BigQuery
● Unstructured data such as images or movies, Consider: Cloud Storage
Firestore: Datastore mode vs Firestore (native) mode
Both Native Mode (only) Datastore Mode (only)

Data model Strong consistency Documents and Entities, kinds, ancestor

collections queries/results

Performance No read limits 10K writes/sec

limits 500 documents/txn

API Firestore (Documents) Datastore (Entities)

&$985487

Security IAM Firebase Rules

Offline data Yes

persistence

Real-time Yes
updates

Firestore or Datastore - comparison

Proprietary + Confidential
Firestore vs Filestore

… vs Firebase
Exam Tip: Firestore is a NoSQL Database, but Firebase
is a development platform with a ton of additional
features that uses Firestore. Make sure to differentiate
between them!
Firebase
*** Platform, NOT a database ***
Firebase is Google’s complete app development platform
Release
Testing
Complete = it provides different products to: management

● Build apps
Backend Analytics
● Test apps compute Develop Run
● Implement authentication (Firebase
Crash
Authentication can be a part of PCA exam on reporting
Data +
very high-level!) Authentication
● Run apps Engage
Messaging
● Run analytics
Experimentation
● Personalize apps Personalization
● And more…

iOS Android Web C++ Unity

Exam Tip: Firestore is usually a part of Firebase-based app (for storing and syncing data)
Memorystore
Spanner
What workloads fit Cloud Spanner best?

01 02 03 04
Sharded RDBMS Scalable relational Manageability/HA Multi-region
data
Manually sharding is Highly automated. Write once and
diﬃcult. People do it Scalable relational Online Schema automatically replicate
to achieve scale. database. Instead of changes and your data to multiple
Cloud Spanner gives moving to NoSQL, patching. No planned regions.
you relational data move from one downtime and comes
Most customers use
and scale. relational database with up to a 99.999%
regional instances, but
to a more scalable availability SLA.
multi-region is there if
relational database.
you need it.
When Cloud Spanner fits less well

TIP
It’s NOT a straightforward thing to migrate a different RDBMS to
Cloud Spanner. Be familiar with challenges on high level.

1 2 3 4

Lift and shift Lots of Compatibility App is very sensitive to

in-database needed very low latency
business logic (micro/nano/low single
(triggers, stored digit ms)
procedures)
Lots of analytics / OLAP
type of queries /
workloads
Bigquery
BigQuery hierarchy
Project -> Dataset -> Tables (-> Partitions)
● For each query, BigQuery executes a full-column scan.
● BigQuery performance and query costs are based on the
amount of data scanned.
● You can set the geographic location of a Dataset at creation
time only.
● All tables that are referenced in a query must be stored in
datasets in the same location.
● When you copy a table (bq cp), the datasets that contain the
source table and destination table must reside in the same
location.
○ You can copy a dataset (NOT with bq cp, but with BigQuery
Data Transfer Service) within a region or from one region to
another
● Dataset names are case-sensitive
BigQuery: Controlling access to datasets Exam Tips: It’s a common
practice to have a Dataset in one
Common BigQuery predefined roles project and perform queries from
another one (split billing!).

Admin Full Access to all datasets

Data Editor Access to edit all contents of the

datasets

Data Owner Full access to datasets and all of

their contents

Data Viewer Access to view datasets and all

of their contents

Job User Access to run jobs

Metadata Viewer Access to view table and dataset

metadata

User Access to run queries and create

datasets

Read Sessions Access to create and use read

User sessions
BigQuery: Controlling access to datasets

You can grant access at the following BigQuery resource levels:

● organization or Google Cloud project level
● dataset level
● table or view level
a. Authorized Views
● You can also restrict access to data on more granular level by using the following methods:
a. column-level access control
b. dynamic data masking (aka “some columns may be hidden, depending on privileges”)
i. Works together with column-level security.
ii. no need to modify existing queries by excluding the columns that the user cannot access
c. row-level security (aka “some rows may be hidden, depending on privileges”)
i. One table can have multiple row-level access policies. Row-level access policies can coexist on a
table with column-level security as well as dataset-level, table-level, and project-level access
controls.
BigQuery: Controlling access to datasets
Authorized Views
1. View: View is a virtual table defined by a SQL query. When you
create a view, you query it in the same way you query a table

2. Query: When a user queries the view, the query results

contain data only from the tables and fields specified in the
query that defines the view.

3. Authorized Views: An authorized view allows you to share

query results with particular users and groups without giving
them access to the underlying tables.

Exam Tip: Authorized Views were especially useful when

there were no table/column-level permissions. However,
they’re still often-used way to selectively share access to
datasets (and they pop up on the exam!).
MAKE SURE TO UNDERSTAND HOW TO CREATE AND
SHARE SUCH A VIEW.
BigQuery: Controlling access to datasets
Authorized Views

BigQuery Data Viewer

[email protected]
BigQuery - Data Transfer Service
Mostly useful for regular data transfers to BigQuery

● BigQuery Data Transfer Service

automates data movement from
various sources into BigQuery on a
scheduled, managed basis.
● You can initiate data backfills to
recover from any outages or gaps.
BigQuery - Batch vs Streaming inserts
Most common architectures

Exam Tip: There is additional cost for streaming (both inserts and reads) in BigQuery.
BigQuery: Sharing Datasets with others
AllAuthenticatedUsers

The special setting allAuthenticatedUsers makes

a dataset public. Authenticated users must use
BigQuery within their own project and have
access to run BigQuery jobs so that they can
query the Public Dataset. The billing for the
query goes to their project, even though the
query is using public or shared data. In summary,
the cost of a query is always assigned to the
active project from where the query is executed.
BigQuery: Sharing Queries with others
Mostly for collaboration

● Query needs to be saved first, before it’s shared;

● Can share incomplete / invalid queries -> collaboration;
● Project-level saved queries are visible to principals with the required permissions;
● Public saved queries are visible to anyone with a link to the query;
BigQuery: Scheduling queries
Mostly useful for regular execution

● Scheduled queries use features of

BigQuery Data Transfer Service.
● If the destination table for your
results doesn't exist when you set up
the scheduled query, BigQuery
attempts to create the table for you.
● You can set up a scheduled query to
authenticate as a service account.
BigQuery: Query results caching
Limit Access by Data Lifecycle Stages

● Query results are cached to improve performance and reduce costs

for repeated queries
● Cache is per user
● Still subject to quota policies
● Cache results have a size limit of 128 MB compressed
● No charge for queries that use cached results
● Results are cached for approximately 24 hours
● Lifetime extended when a query returns a cached result
● Use of cached results can be turned off (useful for benchmarking)
BigQuery: table/partition (automatic) data expiration
Can be set for dataset / table / partition

Best practice for data lifecycle management.

Expiration in BigQuery automatically implements
retention policy.

● Dataset expiration
○ = “default table expiration time” for a dataset

● Table expiration
○ If Dataset expiration is set, each table inherits this setting by default

● Partition expiration:
○ The setting applies to all partitions in the table, but is calculated
independently for each partition based on the partition time.
○ At any point after a table is created, you can update the table's
partition expiration
BigQuery: Table Partitioning
c2 c3 eventDate

Partitioning versus sharding: 2018-01-01

● Table sharding is the practice of storing data in multiple 2018-01-02

tables, using a naming prefix such as
[PREFIX]_YYYYMMDD. Partitioning is recommended over 2018-01-03
table sharding, because partitioned tables perform
2018-01-04
better.

2018-01-05
You can partition BigQuery tables by:

● Time-unit column: Tables are partitioned based on a SELECT * FROM ...

TIMESTAMP, DATE, or DATETIME column in the table. WHERE eventDate BETWEEN
● Ingestion time: Tables are partitioned based on the “2018-01-03” AND
timestamp when BigQuery ingests the data.
“2018-01-04”
● Integer range: Tables are partitioned based on an integer
column.
BigQuery: Table Clustering

c1 userId c3

2018-01-01

2018-01-02

2018-01-03

2018-01-04

2018-01-05

SELECT c1, c3 FROM ... WHERE userId BETWEEN 52 and 63

AND eventDate BETWEEN “2018-01-03” AND “2018-01-04”
BigQuery - table partitioning vs clustering
Decision making

● Clustering gives you more granularity than partitioning alone allows

● Use clustering if your queries commonly use filters or aggregation against multiple particular
columns.
BigQuery: table partitioning AND clustering
Both partitioning and clustering can improve performance and reduce query cost

Exam Tip: You can combine partitioning with

clustering. Data is first partitioned and then data in
each partition is clustered by the clustering columns.
BigQuery: Storage Pricing

Storage pricing is the cost to store data that you load into BigQuery. You pay for active storage and
long-term storage.

● Active storage includes any table or table partition that has been modified in the last 90 days.
● Long-term storage includes any table or table partition that has not been modified for 90
consecutive days. The price of storage for that table automatically drops by approximately
50%. There is no difference in performance, durability, or availability between active and long-term
storage.
Bigtable
Bigtable is a common migration target for key-value,
wide-column and time-series databases
● Petabyte-scale

● fully managed NoSQL database

service for use cases where low
latency random data access,
scalability and reliability are
critical.

● scales seamlessly

● integrates with the Apache®

Cloud Bigtable
ecosystem and supports the
HBase™ API.
What is Bigtable good for?

Use Case Examples Applications that need... Storage Engine

● Time-series data, such as CPU ● Batch MapReduce
● Very high throughput
and memory usage over time for ● Stream Processing/Analytics
multiple servers. ● Scalability
● Marketing data, such as purchase ● ML applications
● Non-Structured key/value data
histories and customer
where each value is no larger than
preferences.
● Financial data, such as
&$985487

10MB
transaction histories, stock prices,
and currency exchange rates.
● Internet of Things data, such as Exam Tip: types of apps where you’d consider using
usage reports from energy meters Bigtable: recommendation engines, personalizing user
and home appliances. experience, Internet of Things, real-time analytics, fraud
● Graph data, such as information detection, migrating from HBase or Cassandra, Fintech,
about how users are connected to gaming, high-throughput data streaming for creating /
one another. improving ML models.
Bigtable for analytics… ?
Bigtable vs BigQuery

NoSQL wide column Enterprise data warehouse for

database relational structured data

Low latency per-entry Cloud Large scale, ad-hoc

BigQuery
access Bigtable SQL-based OLAP analysis
Organizational insights
Heavy read/write events
Analyze data from Cloud
Bigtable database
&$985487

Ad-hoc analysis
Optimized for Point read/write
and reporting

Cohort
Typical target User/entity level
/population level

Exam Tip: BigTable might be optimal for “real-time analytics”, when

you need to make decisions on events as they’re happening.
Bigtable: Hadoop migration and modernization

Apache Hadoop/HBase Data Ecosystem / Cloud Bigtable

Stream Processing Stream Processing

Kafka, Spark Kafka, Spark

Database

Cloud Bigtable
HBase
Scripting & Querying “After” Scripting & Querying
“Before” HIVE, Impala, Pig, Mahout
HIVE, Impala, Pig, Mahout

Compute
Dataproc

Database
Distributed Processing Distributed Processing
Spark, MapReduce Spark, MapReduce

Distributed Storage Storage

HDFS Cloud Storage

Simpliﬁed Google Cloud

Hadoop Stack Storage and Databases

Exam Tip: Main goal: decoupling of storage & compute. As a consequence,

you can treat Dataproc clusters as job-specific / ephemeral
What is Bigtable not good for?

Not good for… Considerations

● Not a relational database ● You need full SQL support for OLTP
● No SQL Queries or Joins
&$9854
87
→ consider Spanner or CloudSQL
● No Multi-Row ● Interactive querying for OLAP
Transactions → consider BigQuery
● Need to store immutable blobs larger than 10MB (e.g.
movies, images)
→ consider Cloud Storage
Comparing GCP
storage solutions
Proprietary + Confidential

SQL vs noSQL
SQL (aka ‘Relational’) NoSQL (aka ‘Non-relational’)
“traditional” table-based RDBMSes key-value, wide column, document
Strongly typed, fixed schemas Dynamic schemas
Almost all ACID-compliant Mostly BASE
Considerable percentage of logic can be done in Most of logic needs to be offloaded to application
database layer
Default choice for most monoliths Suitable for some microservices
performance capped at some point (vertical Processing nodes often separate from storage
scaling only, plus sharding, offloading read-only nodes (if network is fast enough)
etc)
In GCP: Cloud SQL, Cloud Spanner In GCP: Firestore, Bigtable
Outside of GCP: MySQL, Oracle, PostgreSQL, Outside of GCP: MongoDB, Redis, Cassandra,
Microsoft SQL Server. HBase, CouchDB
Proprietary + Confidential

OLTP vs OLAP

OLTransactionalP OLAnalyticalP
For processing data in transaction-oriented Multi-dimensional, analytical queries used in
apps BI, reporting, data mining etc
Large amounts of transactions Large volume of data
A mix of Inserts, Updates, Deletes on individual Loading data from source + selects. Optimized
records. for high throughput reads on large number of
records
Tables are normalized Tables are not normalized
ACID & (mostly) SQL SQL (sometimes NoSQL)
Cloud SQL, Cloud Spanner BigQuery

Exam Tip: Here you’ll find a GREAT Decision tree for database choices on AWS, Microsoft Azure,
Google Cloud Platform, and cloud-agnostic
Cloud Storage