Data Engineering
SNOWFLAKE
ALL CONCEPTS TO GET STARTED
Data Engineering 101- Snowflake
Cloud Data
Warehouse
A cloud-based platform for storing and
analyzing data, which offers scalability,
flexibility, and cost-efficiency compared to
traditional on-premises data warehouses.
Snowflake provides a fully managed service
with separate compute, storage, and cloud
services layers, making it easier to scale and
manage data operations.
2
Data Engineering 101- Snowflake
Snowflake
Architecture
Snowflake's architecture separates storage
and compute, allowing for independent
scaling and efficient data management.
This design eliminates many limitations of
traditional data warehouses.
Snowflake uses a multi-cluster shared data
architecture, where storage is centralized,
and compute resources can be scaled up or
down independently based on workload.
3
Data Engineering 101- Snowflake
Virtual Warehouse
A virtual warehouse is a cluster of compute
resources in Snowflake. Each virtual
warehouse can be scaled independently to
match the workload, providing the
necessary compute power for query
execution without affecting other
warehouses.
If a company needs to run a heavy
analytical query during peak business
hours, they can scale up the virtual
warehouse to a larger size, ensuring faster
query performance. After the peak hours,
the warehouse can be scaled down to save
costs.
4
Data Engineering 101- Snowflake
Database
A logical grouping of schemas, tables, and
other database objects. It provides a
namespace for organizing and managing
data.
Creating a new database in Snowflake:
CREATE DATABASE sales_data;.
This command sets up a new database
where all sales-related schemas and tables
can be organized.
5
Data Engineering 101- Snowflake
Schema
A logical grouping of database objects such
as tables, views, and stored procedures.
Schemas help organize objects within a
database.
Creating a new schema in a database:
CREATE SCHEMA sales_data.january;.
This schema can contain all tables related
to January's sales data.
6
Data Engineering 101- Snowflake
Table
A structured set of data elements (values)
organized in rows and columns. Tables are
fundamental storage objects in a database.
Creating a new table:
CREATE TABLE customers
(id INT, name STRING, email STRING);.
This table stores customer information.
7
Data Engineering 101- Snowflake
View
A virtual table based on the result-set of a
SQL query. Views do not store data
themselves but provide a way to represent
data stored in tables.
Creating a new view:
CREATE VIEW vip_customers
AS SELECT *FROM customers
WHERE status ='VIP';.
This view shows only VIP customers.
8
Data Engineering 101- Snowflake
Stage
A location where data files are stored
temporarily before being loaded into
Snowflake tables. Stages can be internal or
external (e.g., S3, Azure Blob Storage).
Creating an internal stage:
CREATE STAGE my_stage;.
This stage can be used to store data files
before loading them into tables.
9
Data Engineering 101- Snowflake
File Format
Defines the format of data files to be loaded
into Snowflake (e.g., CSV, JSON, Avro). File
formats specify how Snowflake should
interpret the contents of the files.
Creating a file format for CSV files:
CREATE FILE FORMAT my_csv_format
TYPE ='CSV'
FIELD_OPTIONALLY_ENCLOSED_BY ='"';.
10
Data Engineering 101- Snowflake
Warehouse Size
Snowflake offers different sizes for virtual
warehouses (e.g., X-Small, Small, Medium,
Large) to accommodate various workloads.
Larger sizes provide more compute
resources.
A Small warehouse might be sufficient for
routine queries, while a Large warehouse
can handle complex analytical queries.
Adjust the size based on workload
demands.
11
Data Engineering 101- Snowflake
Scaling Up
Increasing the size of a virtual warehouse to
provide more compute resources for a
specific workload.
Scaling up a virtual warehouse:
ALTER WAREHOUSE my_warehouse
SET WAREHOUSE_SIZE ='LARGE';.
This increases the compute power available
for queries.
12
Data Engineering 101- Snowflake
Scaling Out
Adding more compute clusters to a virtual
warehouse to handle increased
concurrency and workload demands.
Enabling auto-scaling for a warehouse:
ALTER WAREHOUSE my_warehouse
SET MAX_CLUSTER_COUNT =5;.
Snowflake will add clusters as needed to
handle concurrent queries.
13
Data Engineering 101- Snowflake
Auto-Suspend
Automatically suspends a virtual warehouse
when it is idle for a specified period, saving
costs.
Setting auto-suspend for a warehouse:
ALTER WAREHOUSE my_warehouse
SET AUTO_SUSPEND =300;.
The warehouse will suspend after 5 minutes
of inactivity.
14
Data Engineering 101- Snowflake
Auto-Resume
Automatically resumes a suspended virtual
warehouse when a query is submitted,
ensuring availability without manual
intervention.
Enabling auto-resume for a warehouse:
ALTER WAREHOUSE my_warehouse
SET AUTO_RESUME =TRUE;.
The warehouse will resume automatically
when a query is submitted.
15
Data Engineering 101- Snowflake
Query Caching
Snowflake caches the results of queries to
speed up repeated query executions,
reducing the need for re-computation and
saving compute resources.
Running the same query twice will utilize
the cached result if the underlying data has
not changed, improving performance and
efficiency.
16
Data Engineering 101- Snowflake
Result Cache
Stores the results of queries executed
within the past 24 hours. The cache is
accessible to all users within the account,
reducing compute costs and speeding up
query performance.
If a query is run and then re-run within 24
hours without changes to the underlying
data, the result is fetched from the result
cache, saving compute resources.
17
Data Engineering 101- Snowflake
Metadata Cache
Stores metadata about database objects to
speed up query parsing and planning. This
cache helps optimize query execution by
reducing the time needed to access
metadata.
Metadata about tables, columns, and
statistics is cached, allowing faster query
planning and execution. This helps
Snowflake optimize performance for
complex queries.
18
Data Engineering 101- Snowflake
Data Caching
Snowflake caches data in the local storage
of virtual warehouses to improve query
performance. This cache is independent for
each virtual warehouse.
Frequently accessed data is stored in the
local disk cache of a virtual warehouse,
reducing the need to fetch data from
remote storage repeatedly, thus improving
performance.
19
Data Engineering 101- Snowflake
Stages
Locations in Snowflake where data files can
be stored before being loaded into tables.
Stages can be internal (within Snowflake) or
external (e.g., AWS S3).
An internal stage can be created using
CREATE STAGE my_stage;.
Data files can be uploaded to this stage
before being loaded into a table.
20
Data Engineering 101- Snowflake
COPY INTO
Command
Used to load data from a stage into a
Snowflake table. The command specifies
the target table and the source file(s) along
with optional transformations.
Loading data from a stage into a table:
COPY INTO my_table
FROM @my_stage/file.csv
FILE_FORMAT =(FORMAT_NAME =
'my_csv_format');.
21
Data Engineering 101- Snowflake
Time Travel
Allows users to query, clone, or restore data
to a previous state within a defined
retention period. This feature aids in data
recovery and auditing.
Querying a table as it was at a specific point
in time:
SELECT *FROM my_table
AT (TIMESTAMP =>'2022-06-01T00:00:00');.
22
Data Engineering 101- Snowflake
Zero-Copy Cloning
Enables creating a clone of a database,
schema, or table without copying the data.
Changes to the clone do not affect the
original, and vice versa.
Creating a clone of a table:
CREATE CLONE my_table_clone
OF my_table;.
This allows working with a snapshot of the
data without additional storage costs.
23
Data Engineering 101- Snowflake
Secure Data Sharing
Allows sharing of data across different
Snowflake accounts without moving or
copying the data. Consumers can query
shared data in real-time.
Sharing data with another Snowflake
account:
CREATE SHARE my_share;
GRANT SELECT ON TABLE my_table TO
SHARE my_share;.
The recipient can access the shared data
directly.
24
Data Engineering 101- Snowflake
Snowsight
Snowflake's new web user interface that enhances the user experience
with features like integrated dashboards, interactive visualizations, and
an improved SQL editor.
Users can create and manage interactive dashboards within
Snowsight, allowing them to visualize data trends and share insights
with their team. For example, a sales team can use Snowsight to
build a dashboard that tracks monthly sales performance across
different regions.
25
Data Engineering 101- Snowflake
Snowflake
Community
A vibrant network of users, experts, and
partners who share knowledge, best
practices, and support each other in using
Snowflake. It includes user groups, forums,
and special interest groups.
Joining the Snowflake Community allows
users to participate in discussions, attend
meetups, and access valuable resources.
For instance, a data analyst can join a virtual
special interest group focused on data
warehousing to learn from others'
experiences and share their own insights.
26
Data Engineering 101- Snowflake
Data Marketplace
The Snowflake Data Marketplace is a
platform where users can discover, access,
and share live data sets from various
providers. It facilitates data collaboration
and allows users to enrich their own data
with external data sources.
A marketing team can access demographic
data from a third-party provider through
the Data Marketplace to enhance their
customer analysis. They can integrate this
data with their internal sales data to gain
deeper insights into customer behavior and
preferences.
27
Data Engineering 101- Snowflake
Multi-Cluster
Warehouses
Multi-Cluster Warehouses allow Snowflake
to automatically manage the number of
compute clusters needed to handle varying
workloads. This ensures optimal
performance and resource utilization
without manual intervention.
A retail company can set up a multi-cluster
warehouse to handle the high concurrency
of queries during Black Friday sales.
Snowflake automatically adds clusters to
manage the increased load and removes
them when the load decreases, ensuring
efficient use of resources and cost
management.
28
Data Engineering 101- Snowflake
Materialized Views
Materialized views store the result set of a
query physically and automatically update
when the underlying data changes. They
improve query performance by providing
pre-computed results.
Creating a materialized view:
CREATE MATERIALIZED VIEW mv_sales
AS SELECT *FROM sales
WHERE year =2022;.
Queries on this view are faster since the
results are pre-computed.
29
Data Engineering 101- Snowflake
Task
Tasks are used to automate the execution
of SQL statements, including procedural
logic, at specified intervals or upon
completion of other tasks.
Creating a task to run a query every hour:
CREATE TASK hourly_task
WAREHOUSE ='my_warehouse'
SCHEDULE ='1 HOUR'
AS I
NSERT INTO daily_sales
SELECT *FROM sales
WHERE sales_date =CURRENT_DATE;.
30
Data Engineering 101- Snowflake
Stream
Streams track changes to a table (inserts,
updates, deletes) and provide a change
data capture (CDC) mechanism for efficient
data processing.
Creating a stream:
CREATE STREAM sales_stream ON TABLE
sales;.
The stream captures changes to the sales
table, which can be processed later.
31
Data Engineering 101- Snowflake
Pipe
Pipes automate data loading by
continuously ingesting data from external
stages (e.g., AWS S3, Azure Blob Storage)
into Snowflake tables.
Creating a pipe to load data from an S3
bucket:
CREATE PIPE my_pipe
AS COPY INTO my_table
FROM @my_stage/file.csv
FILE_FORMAT =(FORMAT_NAME =
'my_csv_format');.
32
Data Engineering 101- Snowflake
Warehouse
Monitoring
Snowflake provides tools to monitor the
performance and usage of virtual
warehouses, helping users optimize
resource allocation and manage costs.
Using the WAREHOUSE_METERING_HISTORY
view to monitor warehouse usage and costs:
SELECT *FROM
WAREHOUSE_METERING_HISTORY
WHERE WAREHOUSE_NAME =
'my_warehouse';.
33
Data Engineering 101- Snowflake
Role-Based Access
Control (RBAC)
A security model that restricts access to
data and resources based on the roles
assigned to users. Snowflake allows fine-
grained control over access permissions.
Creating a role and granting privileges:
CREATE ROLE analyst_role;
GRANT SELECT ON DATABASE sales_data TO
ROLE analyst_role;.
Assigning the role to a user:
GRANT ROLE analyst_role TO USER john_doe;.
34
Data Engineering 101- Snowflake
Dynamic Data
Masking
Dynamic Data Masking allows Snowflake to
hide sensitive data in query results based
on the role of the user accessing the data.
This enhances data security and privacy.
Masking sensitive data:
CREATE MASKING POLICY ssn_mask
AS (val STRING)
RETURNS STRING ->CASE
WHEN CURRENT_ROLE() IN ('analyst_role')
THEN 'XXX-XX-XXXX'
ELSE val END; Applying the policy:
ALTER TABLE customers
MODIFY COLUMN ssn
SET MASKING POLICY ssn_mask;.
35
Data Engineering 101- Snowflake
External Tables
External tables allow Snowflake to query
data stored in external locations (e.g., AWS
S3, Azure Blob Storage) without loading it
into Snowflake.
Creating an external table:
CREATE EXTERNAL TABLE my_ext_table WITH
LOCATION ='@my_external_stage'
FILE_FORMAT =(FORMAT_NAME =
'my_csv_format');.
This table allows querying data directly
from the external stage.
36
Data Engineering 101- Snowflake
Data Replication
Snowflake's data replication feature allows
for the replication of databases across
different regions and cloud providers to
enhance data availability and disaster
recovery.
Setting up data replication:
CREATE REPLICATION GROUP
my_replication AS REPLICATION
TO REGION 'aws_us_west_2';.
This replicates the database to a different
AWS region.
37
Data Engineering 101- Snowflake
Failover and
Failback
Snowflake provides failover and failback
capabilities to ensure high availability and
disaster recovery. Failover allows switching
to a replica in case of a failure, and failback
switches back once the original is restored.
Configuring failover for a database:
ALTER DATABASE my_database
SET FAILOVER GROUP =my_failover_group;.
This ensures that the database can switch
to a replica in case of a failure.
38
Data Engineering 101- Snowflake
Search Optimization
Service
A Snowflake feature that improves the
performance of searches on large tables by
creating and maintaining search
optimization structures.
Enabling search optimization for a table:
ALTER TABLE my_table SET SEARCH
OPTIMIZATION =TRUE;.
This improves the performance of search
queries on the table.
39
Data Engineering 101- Snowflake
Snowflake Data
Exchange
A platform that allows Snowflake users to
share and access live data securely. It
facilitates data collaboration and
monetization by providing a marketplace
for data providers and consumers.
Publishing data to the Data Exchange:
CREATE EXCHANGE my_exchange;
GRANT SELECT ON TABLE my_table TO
EXCHANGE my_exchange;.
Other users can subscribe to and query the
shared data.
40
Data Engineering 101- Snowflake
Data Masking
Data masking provides a way to protect
sensitive data by masking it in query results,
based on user roles. This ensures that
sensitive information is not exposed to
unauthorized users.
Creating a data masking policy:
CREATE MASKING POLICY email_mask
AS (val STRING)
RETURNS STRING ->CASE
WHEN CURRENT_ROLE() IN ('analyst_role')
THEN '********@domain.com' ELSE val END;
Applying the policy to a column:
ALTER TABLE users
MODIFY COLUMN email SET MASKING POLICY
email_mask;.
41
Data Engineering 101- Snowflake
Snowpipe
Snowpipe is Snowflake's continuous data
ingestion service, which allows for the
automated loading of data from external
stages into Snowflake tables.
Creating a Snowpipe to load data:
CREATE PIPE my_pipe
AS COPY INTO my_table
FROM @my_stage
FILE_FORMAT =(FORMAT_NAME =
'my_csv_format');.
Snowpipe will automatically load new data
files as they arrive in the stage.
42
Data Engineering 101- Snowflake
External Functions
External functions allow Snowflake to call
external services and integrate with
external systems directly from SQL queries.
This enables advanced data processing and
integration capabilities.
Creating an external function:
CREATE EXTERNAL FUNCTION
my_ext_function()
RETURNS STRING API_INTEGRATION =
my_api_integration;.
This function can call an external API and
return the result to Snowflake.
43
Data Engineering 101- Snowflake
Streams and Tasks
Streams track changes to tables, and tasks
automate the execution of SQL based on
schedules or events. Together, they enable
efficient change data capture and
automation.
Creating a stream and task:
CREATE STREAM my_stream
ON TABLE my_table;
CREATE TASK my_task WAREHOUSE =
'my_warehouse' SCHEDULE ='1 HOUR'
AS
INSERT INTO my_target_table
SELECT *FROM my_stream;.
44
Data Engineering 101- Snowflake
Snowflake
Organizations
Snowflake Organizations provide a way to
manage multiple Snowflake accounts
within an organization. This enables better
resource allocation, cost management, and
governance.
Creating an organization:
CREATE ORGANIZATION my_org;
and adding accounts to it. This allows
central management of multiple Snowflake
accounts.
45
Data Engineering 101- Snowflake
Data Governance
Snowflake offers features for data
governance, including access controls, data
masking, and audit logging, to ensure data
security, privacy, and compliance.
Implementing data governance:
CREATE ROW ACCESS POLICY my_policy AS
(val STRING)
RETURNS BOOLEAN ->CURRENT_ROLE()
IN ('data_governance_role');
and applying it to a table.
46
Data Engineering 101- Snowflake
Account Usage
Snowflake provides account usage views to
track and analyze resource usage, query
performance, and cost management. These
views help in monitoring and optimizing
Snowflake usage.
Querying account usage:
SELECT *FROM
ACCOUNT_USAGE.QUERY_HISTORY WHERE
QUERY_TEXT ILIKE '%SELECT%';
This retrieves the history of SELECT queries
executed in the account.
47
Data Engineering 101- Snowflake
Resource Monitors
Resource monitors allow administrators to
manage and control compute resource
usage by setting thresholds and triggering
actions when limits are reached.
Creating a resource monitor:
CREATE RESOURCE MONITOR my_monitor
WITH CREDIT_QUOTA =1000;
and assigning it to a warehouse. This
monitor will track the compute credits used
by the warehouse and take action if the
quota is exceeded.
48
Data Engineering 101- Snowflake
Query Optimization
Snowflake provides various tools and techniques to optimize query
performance, including using the Query Profiler, optimizing table
structures, and leveraging caching.
Using the Query Profiler:
SELECT *FROM
TABLE(QUERY_HISTORY_BY_SESSION(SESSI
ON_ID =>'my_session'));
This helps identify and optimize slow- running queries.
49
Data Engineering 101- Snowflake
Data Sharing
Snowflake allows secure sharing of data
between different accounts without data
movement. Shared data can be accessed in
real-time, ensuring consistency and
reducing latency.
Creating a share: CREATE SHARE my_share;
and adding tables to it. Other Snowflake
accounts can access the shared data
directly.
50
Data Engineering 101- Snowflake
Cloning
Cloning in Snowflake creates a copy of a
database, schema, or table without
duplicating the data. This is useful for
creating test environments and for backup
purposes.
Cloning a table:
CREATE CLONE my_table_clone OF
my_table;
This allows working with a snapshot of the
data without additional storage costs.
51
Data Engineering 101- Snowflake
Data Load and
Unload
Snowflake provides various methods for
loading and unloading data, including bulk
loading with the COPY command, using
Snowpipe for continuous loading, and
unloading data to external stages.
Loading data:
COPY INTO my_table
FROM @my_stage FILE_FORMAT =
(FORMAT_NAME ='my_csv_format');
and unloading data:
COPY INTO @my_stage FROM my_table;.
52
Data Engineering 101- Snowflake
Data Encryption
Snowflake encrypts data at rest and in
transit to ensure data security. Encryption
keys are managed automatically, and users
can also provide their own keys for
additional security.
Enabling encryption for a table:
ALTER TABLE my_table SET
DATA_RETENTION_TIME_IN_DAYS =90;
This ensures that data is encrypted and
retained for a specified period.
53
Data Engineering 101- Snowflake
Data Retention
Snowflake provides data retention policies
to manage how long data is kept in the
system. This includes Time Travel and Fail-
safe periods for data recovery.
Setting data retention:
ALTER TABLE my_table SET
DATA_RETENTION_TIME_IN_DAYS =7;
This configures the table to retain historical
data for 7 days.
54
Data Engineering 101- Snowflake
Fail-Safe
Fail-Safe is a Snowflake feature that
provides an additional 7-day period for
recovering data after the Time Travel
retention period has expired. This ensures
data recovery in case of failures.
Accessing Fail-Safe data:
SELECT *FROM my_table BEFORE
(END_TIME =>'2022-06-01T00:00:00');
This retrieves data that is in the Fail-Safe
period.
55
Data Engineering 101- Snowflake
User-Defined
Functions (UDFs)
UDFs allow users to define their own
functions in SQL or JavaScript, extending
Snowflake's built-in functionality with
custom logic.
Creating a SQL UDF:
CREATE FUNCTION my_udf(x INT)
RETURNS INT
LANGUAGE SQL
AS
'RETURN x *2';
This function multiplies the input by 2.
56
Data Engineering 101- Snowflake
Stored Procedures
Stored procedures in Snowflake allow for procedural logic and complex
operations to be encapsulated in SQL or JavaScript, enabling
automation and reusable code.
Creating a stored procedure:
CREATE PROCEDURE my_proc()
RETURNS STRING LANGUAGE JAVASCRIPT
AS $$ return 'Hello, World!';
$$;
and calling it:
CALL my_proc();.
57
Data Engineering 101- Snowflake
Privileges and
Grants
Snowflake's security model uses privileges
and grants to control access to database
objects. Roles are assigned privileges, and
users are assigned roles.
Granting privileges:
GRANT SELECT ON TABLE my_table
TO ROLE analyst_role;
This allows users with the analyst_role to
query the table.
58
Data Engineering 101- Snowflake
Roles and Role
Hierarchies
Roles in Snowflake define a set of privileges
and can be assigned to users. Role
hierarchies allow roles to inherit privileges
from other roles, simplifying access
management.
Creating a role hierarchy:
CREATE ROLE senior_analyst;
GRANT ROLE analyst_role TO ROLE
senior_analyst;
Users with the senior_analyst role inherit
privileges from the analyst_role.
59
Data Engineering 101- Snowflake
Session Variables
Session variables in Snowflake store values
that can be used within a session. They
allow for dynamic SQL and reusable code.
Setting and using a session variable:
SET my_var ='Hello, World!';
and SELECT $my_var;
This returns the value of the variable.
60
Data Engineering 101- Snowflake
Parameter
Management
Snowflake allows configuration of various
parameters at the account, session, and
object levels to customize behavior and
optimize performance.
Setting a session parameter:
ALTER SESSION SET QUERY_TAG =
'MyQuery';
This tags queries within the session for
easier tracking.
61
Data Engineering 101- Snowflake
Semi-Structured
Data
Snowflake supports semi-structured data
formats such as JSON, Avro, Parquet, and
XML. This allows for flexible data modeling
and integration with modern data sources.
Querying JSON data: SELECT json_data:id
FROM my_table;. This retrieves the "id" field
from JSON data stored in a column.
62
Data Engineering 101- Snowflake
Data Compression
Snowflake automatically compresses data
to reduce storage costs and improve query
performance. Different compression
algorithms are used based on the data type.
Snowflake's automatic compression means
users don't need to manually configure
compression settings, as the platform
optimizes storage efficiency.
63
Data Engineering 101- Snowflake
Cost Management
Snowflake provides tools and practices to
manage and optimize costs, including
resource monitors, usage views, and best
practices for query optimization.
Using resource monitors to control costs:
CREATE RESOURCE MONITOR my_monitor
WITH CREDIT_QUOTA =1000;
and setting up alerts for budget thresholds.
64
Data Engineering 101- Snowflake
Query History
Snowflake tracks query history, allowing
users to review and analyze past queries for
performance optimization and
troubleshooting.
Accessing query history:
SELECT *FROM QUERY_HISTORY
WHERE QUERY_TEXT ILIKE '%SELECT%'; This
retrieves a history of SELECT queries
executed in the account.
65
Data Engineering 101- Snowflake
Metadata
Management
Snowflake manages metadata for all
database objects, providing detailed
information about tables, columns, and
other objects. This metadata is used for
query optimization and data governance.
Querying metadata:
SELECT *FROM
INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA ='PUBLIC';
This retrieves information about all tables in
the PUBLIC schema.
66
Data Engineering 101- Snowflake
Data Import/Export
Snowflake supports various methods for
importing and exporting data, including
bulk loading with the COPY command and
unloading to external stages.
Importing data:
COPY INTO my_table
FROM @my_stage FILE_FORMAT =
(FORMAT_NAME ='my_csv_format');
and exporting data:
COPY INTO @my_stage FROM my_table;
67
Data Engineering 101- Snowflake
Data Quality
Snowflake provides features to ensure data
quality, including constraints, data
validation, and profiling.
Implementing data quality checks:
CREATE TABLE my_table (
id INT PRIMARY KEY,
name STRING NOT NULL);
This ensures that the "id" column is unique
and the "name" column is not null.
68
Data Engineering 101- Snowflake
Data Lineage
Data lineage tracks the flow of data
through Snowflake, from ingestion to
transformation to analysis, providing
visibility into data dependencies and
transformations.
Using views and tasks to track data lineage:
CREATE VIEW my_view
AS
SELECT *FROM my_table;
and
CREATE TASK my_task
AS
INSERT INTO my_table
SELECT *FROM my_view;
69
Data Engineering 101- Snowflake
Business Continuity
Snowflake's features for business continuity
include data replication, failover, and fail-
safe, ensuring that data is always available
and recoverable in case of disasters.
Setting up a failover group:
CREATE FAILOVER GROUP my_group
AS FAILOVER TO REGION 'aws_us_west_2';
This ensures that the database can switch
to a replica in case of a failure.
70
Data Engineering 101- Snowflake
Governance and
Compliance
Snowflake provides tools for data
governance and compliance, including
access controls, data masking, and audit
logging, to ensure data security and
regulatory compliance.
Implementing compliance policies:
CREATE ROW ACCESS POLICY
compliance_policy AS (val STRING)
RETURNS BOOLEAN ->CURRENT_ROLE()
IN ('compliance_role');
and applying it to a table.
71
Data Engineering 101- Snowflake
Advanced Analytics
Snowflake supports advanced analytics
capabilities, including machine learning
integration, geospatial data processing, and
complex data transformations.
Integrating with machine learning models:
CREATE FUNCTION predict_sales(x FLOAT)
RETURNS FLOAT
LANGUAGE PYTHON RUNTIME ='3.8'
HANDLER ='my_model.predict';
This function calls a Python model for sales
prediction.
72
Data Engineering 101- Snowflake
Data Monetization
Snowflake's data marketplace and secure
data sharing enable organizations to
monetize their data assets by sharing or
selling data to other Snowflake users.
Publishing data for monetization:
CREATE EXCHANGE my_exchange;
and adding data to it for other users to
access and purchase.
73
Data Engineering 101- Snowflake
Geospatial Data
Snowflake supports geospatial data types
and functions, allowing users to store,
query, and analyze spatial data such as
points, polygons, and geometries.
Querying geospatial data:
SELECT ST_DISTANCE(point1, point2)
FROM my_table;
This calculates the distance between two
points stored in a table.
74
Data Engineering 101- Snowflake
IoT Data Processing
Snowflake's scalable architecture and
support for semi-structured data make it
well-suited for processing and analyzing IoT
(Internet of Things) data.
Loading IoT data:
COPY INTO my_table
FROM @iot_stage
FILE_FORMAT =(FORMAT_NAME =
'json_format');
This ingests JSON data from IoT devices.
75
Data Engineering 101- Snowflake
Real-Time Analytics
Snowflake supports real-time analytics by
allowing continuous data ingestion and
immediate querying of fresh data.
Using Snowpipe for real-time data
ingestion:
CREATE PIPE my_pipe
AS COPY INTO my_table
FROM @my_stage
FILE_FORMAT =(FORMAT_NAME =
'my_csv_format');
76
Data Engineering 101- Snowflake
Data Federation
Snowflake's external tables and data sharing features enable data
federation, allowing users to query and combine data from multiple
sources without moving the data.
Creating an external table to federate data: CREATE
EXTERNAL TABLE my_ext_table WITH LOCATION
='@my_external_stage' FILE_FORMAT =
(FORMAT_NAME =
'my_csv_format');
This table allows querying data directly from the external stage.
77
Data Engineering 101- Snowflake
Security
Integrations
Snowflake integrates with security tools
and frameworks, including single sign-on
(SSO), multi-factor authentication (MFA),
and encryption key management, to
enhance data security.
Configuring SSO:
ALTER ACCOUNT SET SSO_LOGIN_PAGE =
'https://mycompany.com/sso';.
This enables single sign-on for Snowflake
users.
78
Data Engineering 101- Snowflake
Continuous Data
Protection
Snowflake's continuous data protection
features include Time Travel, Fail-safe, and
data replication, ensuring data integrity and
availability at all times.
Setting up data replication:
CREATE REPLICATION GROUP
my_replication
AS REPLICATION TO REGION
'aws_us_west_2';.
This replicates the database to a different
AWS region.
79
Data Engineering 101- Snowflake
Custom Data Types
Snowflake allows users to define custom
data types and enforce data integrity
through constraints and validation rules.
Creating a custom data type:
CREATE DOMAIN email_type
AS STRING CHECK (VALUE LIKE '%@%.%');
This enforces email format validation.
80
Data Engineering 101- Snowflake
Hybrid Tables
Hybrid tables in Snowflake combine the
benefits of transactional and analytical
processing, allowing for efficient real-time
data analysis.
Creating a hybrid table:
CREATE HYBRID TABLE my_table
(id INT, data STRING);
This table supports both transactional and
analytical workloads.
81
Data Engineering 101- Snowflake
Data Archiving
Snowflake's data retention and archiving
features help manage long-term storage of
historical data, ensuring that it is available
for compliance and analysis.
Setting data retention:
ALTER TABLE my_table
SET DATA_RETENTION_TIME_IN_DAYS =365;
This configures the table to retain historical
data for one year.
82
Data Engineering 101- Snowflake
Data Classification
Data classification in Snowflake helps
categorize data based on sensitivity and
importance, enabling better data
governance and security.
Classifying data:
ALTER TABLE my_table SET TAG
classification = 'sensitive';
This tags the table as containing sensitive
data.
83
Data Engineering 101- Snowflake
Data Masking
Policies
Data masking policies in Snowflake provide
dynamic masking of sensitive data based
on user roles, ensuring that only authorized
users can see the actual data.
Creating a data masking policy:
CREATE MASKING POLICY ssn_mask
AS (val STRING)
RETURNS STRING ->CASE
WHEN CURRENT_ROLE() IN ('analyst_role')
THEN 'XXX-XX-XXXX' ELSE val END;
Applying the policy:
ALTER TABLE customers MODIFY COLUMN ssn
SET MASKING POLICY ssn_mask;
84
Data Engineering 101- Snowflake
Row Access Policies
Row access policies allow Snowflake to
restrict access to specific rows in a table
based on user roles and other criteria,
enhancing data security and compliance.
Creating a row access policy:
CREATE ROW ACCESS POLICY row_policy
AS (val STRING)
RETURNS BOOLEAN ->CURRENT_ROLE() IN
('analyst_role');
Applying the policy:
ALTER TABLE my_table MODIFY ROW
ACCESS POLICY row_policy;.
85
Data Engineering 101- Snowflake
Cross-Cloud
Replication
Snowflake supports cross-cloud replication,
allowing data to be replicated across
different cloud providers (e.g., AWS, Azure,
Google Cloud) for high availability and
disaster recovery.
Setting up cross-cloud replication:
CREATE REPLICATION GROUP
my_replication
AS REPLICATION TO REGION 'azure_eastus';
This replicates the database to an Azure
region.
86
Data Engineering 101- Snowflake
Event-Driven Data
Processing
Snowflake's tasks and streams enable
event-driven data processing, allowing
actions to be triggered based on changes in
data or scheduled intervals.
Creating an event-driven task:
CREATE TASK my_task
WAREHOUSE ='my_warehouse'
AFTER INSERT ON my_table
AS I
NSERT INTO audit_table
SELECT *FROM my_table;
87
Data Engineering 101- Snowflake
Data Encryption Key
Management
Snowflake allows users to manage their
own encryption keys for added security,
providing control over data encryption and
compliance with regulatory requirements.
Setting a customer-managed key:
ALTER DATABASE my_database
SET ENCRYPTION ='my_custom_key';
This uses a user-provided key for data
encryption.
88
Data Engineering 101- Snowflake
Geospatial
Functions
Snowflake provides geospatial functions to
perform spatial analysis and operations on
geographic data, such as distance
calculations and spatial joins.
Using a geospatial function:
SELECT ST_DISTANCE(point1, point2)
FROM my_table;
This calculates the distance between two
geographic points stored in a table.
89
Data Engineering 101- Snowflake
Graph Analytics
Snowflake supports graph analytics,
enabling users to model and analyze
relationships between data points using
graph structures and algorithms.
Performing graph analytics:
CREATE TABLE graph_edges
(src INT, dst INT);
and running graph queries to analyze
relationships.
90
Data Engineering 101- Snowflake
Data Versioning
Snowflake's Time Travel and Zero-Copy
Cloning features enable data versioning,
allowing users to create, manage, and
query different versions of data for analysis
and auditing.
Creating a version of a table:
CREATE CLONE my_table_clone
OF my_table;
This clone represents a version of the
original table that can be queried and
analyzed separately.
91
Data Engineering 101- Snowflake
API Integration
Snowflake supports integration with
external APIs, allowing users to call external
services and incorporate real-time data into
Snowflake queries and workflows.
Creating an external function to call an API:
CREATE EXTERNAL FUNCTION
my_ext_function() RETURNS STRING
API_INTEGRATION =my_api_integration;
This function can call an external API and
return the result to Snowflake.
92
THANK YOU
93