Getting Started with
Amazon Redshift
Maor Kleider, Sr. Product Manager, Amazon Redshift
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
• Introduction
• Benefits
• Use cases
• Getting started
• Q&A
What is Big Data?
When your data sets become so large and diverse
that you have to start innovating around how to
collect, store, process, analyze and share them
It’s never been easier to generate vast amounts of data
Generate
Individual AWS customers Collect & Store
generate over a PB/day
Analyze
Collaborate & Act
Amazon S3 lets you collect and store all this data
Generate
Store exabytes of
Individual AWS customers Collect & Store
data in S3
generating over PB/day
Analyze
Collaborate & Act
But how do you analyze it?
Generate
Store exabytes of
Individual AWS customers Collect & Store
data in S3
generating over PB/day
Highly
Analyze
Constrained
Collaborate & Act
The Dark Data Problem
Most generated data is unavailable for analysis
Data Volume
Generated Data
Available for Analysis
Year
1990 2000 2010 2020
Sources:
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
AWS Big Data Portfolio
Collect Store Analyze
Amazon Kinesis AWS Direct
Amazon S3 Amazon Glacier Amazon EMR Amazon EC2
Firehose Connect
Amazon Kinesis Amazon Amazon Amazon RDS, Amazon Athena
Amazon
Analytics Snowball Dynamo DB Amazon Aurora Redshift Athena
Amazon Kinesis Amazon Amazon Amazon Amazon Machine
Streams CloudSearch Elasticsearch QuickSight Learning
AWS Database Migration Service AWS
AWSGlue
Glue
Amazon Redshift
shift
Fast, simple, petabyte-scale data warehousing for $1,000/TB/Year
150+ features
a lot faster
a lot simpler
a lot cheaper
Relational data warehouse
Massively parallel; petabyte scale
Amazon Fully managed
Redshift HDD and SSD platforms
$1,000/TB/year; starts at $0.25/hour
Selected Amazon Redshift customers
Use Case: Traditional Data Warehousing
Business Advanced pipelines Secure and Bulk Loads
Reporting and queries Compliant and Updates
Easy Migration – Point & Click using AWS Database Migration Service
Secure & Compliant – End-to-End Encryption. SOC 1/2/3, PCI-DSS, HIPAA and FedRAMP compliant
Large Ecosystem – Variety of cloud and on-premises BI and ETL tools
Japanese Mobile World’s Largest Children’s Powering 100 marketplaces
Phone Provider Book Publisher in 50 countries
Use Case: Log Analysis
Log & Machine Clickstream Time-Series
IOT Data Events Data Data
Cheap – Analyze large volumes of data cost-effectively
Fast – Massively Parallel Processing (MPP) and columnar architecture for fast queries and parallel loads
Near real-time – Micro-batch loading and Amazon Kinesis Firehose for near-real time analytics
Interactive data analysis and Ride analytics for pricing Ad prediction and
recommendation engine and product development on-demand analytics
Use Case: Business Applications
Multi-Tenant BI Back-end Analytics as a
Applications services Service
Fully Managed – Provisioning, backups, upgrades, security, compression all come built-in so you can
focus on your business applications
Ease of Chargeback – Pay as you go, add clusters as needed. A few big common clusters, several
data marts
Service Oriented Architecture – Integrated with other AWS services. Easy to plug into your pipeline
Infosys Information Analytics-as-a- Product and Consumer
Platform (IIP) Service Analytics
Amazon Redshift architecture
Leader node
Simple SQL endpoint BI tools Analytics tools SQL clients
Stores metadata JDBC/ODBC
Optimizes query plan
Coordinates query execution
Compute nodes
Leader node
Local columnar storage
10 GigE
Parallel/distributed execution of all queries, loads, (HPC)
backups, restores, resizes
Start at just $0.25/hour, grow to 2 PB (compressed) Compute node Compute node Compute node
DC1: SSD; scale from 160 GB to 326 TB
Ingestion
DS2: HDD; scale from 2 TB to 2 PB Backup
Restore
Amazon S3 Amazon EMR Amazon Dynamo DB SSH
Benefit #1: Amazon Redshift is fast
analyze compression listing;
Dramatically less I/O Table | Column | Encoding
---------+----------------+----------
listing | listid | delta
Column storage listing | sellerid | delta32k
listing | eventid | delta32k
listing | dateid | bytedict
Data compression listing | numtickets | bytedict
listing | priceperticket | delta32k
listing | totalprice | mostly32
Zone maps listing | listtime | raw
Direct-attached storage 10 10 | 13 | 14 | 26 |…
324 … | 100 | 245 | 324
Large data block sizes 375 375 | 393 | 417…
623 … 512 | 549 | 623
637 637 | 712 | 809 …
959 … | 834 | 921 | 959
Benefit #1: Amazon Redshift is fast
Parallel and distributed
Query
Load
Export
Backup
Restore
Resize
Benefit #1: Amazon Redshift is fast
Hardware optimized for I/O intensive workloads, 4 GB/sec/node
Enhanced networking, over 1 million packets/sec/node
Choice of storage type, instance size
Regular cadence of auto-patched improvements
Benefit #1: Amazon Redshift is fast
“Did I mention that it’s ridiculously fast? We’re using “After investigating Redshift, Snowflake, and
it to provide our analysts with an alternative to Hadoop” BigQuery, we found that Redshift offers top-of-the-
line performance at best-in-market price points”
“On our previous big data warehouse system, it took
around 45 minutes to run a query against a year of
“…[Redshift] performance has blown away everyone
data, but that number went down to just 25 seconds
here. We generally see 50-100X speedup over Hive”
using Amazon Redshift”
“We regularly process multibillion row datasets “We saw a 2X performance improvement on a wide
and we do that in a matter of hours. We are heading variety of workloads. The more complex the queries,
to up to 10 times more data volumes in the next couple the higher the performance improvement”
of years, easily
And has gotten faster...
5X Query throughput improvement over the past year
Memory allocation (launched)
Improved commit and I/O logic (launched)
Queue hopping (launched)
Fast
Query monitoring rules (launched)
10X Vacuuming performance improvement
Ensures data is sorted for efficient and fast I/O
Reclaims space from deleted rows
Enhanced vacuum performance leads to better system throughput
Efficient
The life of a query
Client Amazon Redshift Cluster
2 3
BI tools
Compute node
1
Queue 1
Analytics tools
Queue 2
Compute node
Leader node
SQL clients
Compute node
Query monitoring rules
• Allows automatic handling of runaway (poorly written) queries
• Metrics with operators and values (e.g. query_cpu_time > 1000) create a predicate
• Multiple predicates can be AND-ed together to create a rule
• Multiple rules can be defined for a queue in WLM. These rules are OR-ed together
If { rule } then [action]
{ rule : metric operator value } eg: rows_scanned > 100000
• Metric : cpu_time, query_blocks_read, rows scanned, query
execution time, cpu & io skew per slice, join_row_count, etc.
• Operator : <, >, ==
• Value : integer
[action] : hop, log, abort
Query monitoring rules
Monitor and control
cluster resources
consumed by a query
Get notified, abort and
reprioritize long-
running / bad queries
Pre-defined templates
for common use
cases
Query monitoring rules
Common use cases:
• Protect interactive queues
INTERACTIVE = { “query_execution_time > 15 sec” or
“query_cpu_time > 1500 uSec” or
”query_blocks_read > 18000 blocks” } [HOP]
• Monitor ad-hoc queues for heavy queries
AD-HOC = { “query_execution_time > 120” or
“query_cpu_time > 3000” or
”query_blocks_read > 180000” or
“memory_to_disk > 400000000000”} [LOG]
• Limit the number of rows returned to a client
MAXLINES = { “RETURN_ROWS > 50000” } [ABORT]
Benefit #2: Amazon Redshift is inexpensive
Price per hour for Effective annual
DS2 (HDD) DS2.XL single node price per TB compressed
On-demand $ 0.850 $ 3,725
1 year reservation $ 0.500 $ 2,190 Pricing is simple
3 year reservation $ 0.228 $ 999 Number of nodes x price/hour
No charge for leader node
Price per hour for Effective annual
No upfront costs
DC1 (SSD) DC1.L single node price per TB compressed Pay as you go
On-demand $ 0.250 $ 13,690
1 year reservation $ 0.161 $ 8,795
3 year reservation $ 0.100 $ 5,500
Benefit #3: Amazon Redshift is fully managed
Continuous/incremental backups
Multiple copies within cluster Compute node Compute node Compute node
Continuous and incremental backups
to Amazon S3
Region 1
Continuous and incremental backups
across regions Amazon S3
Streaming restore Region 2
Amazon S3
Benefit #3: Amazon Redshift is fully managed
Fault tolerance
Disk failures Compute node Compute node Compute node
Node failures
Network failures Region 1
Availability Zone/region level disasters Amazon S3
Region 2
Amazon S3
Node fault tolerance
Data-path monitoring agents
Node level monitoring
can detect SW/HW
Compute node
issues and take action
Leader node Compute node
Client
Compute node
Node fault tolerance
Data-path monitoring agents Failure is detected at one
of the compute nodes
Compute node
Leader node Compute node
Client
Compute node
Node fault tolerance
Data-path monitoring agents Redshift parks the
connections
Compute node Next, the node is
replaced
Leader node Compute node
Client
Compute node
Node fault tolerance
Data-path monitoring agents Queries are re-submitted
Compute node
Leader node Compute node
Client
Compute node
Node fault tolerance
Data-path monitoring agents Additional monitoring
layer for the leader
Cluster-level monitoring agents node and network
Compute node
Leader node Compute node
Client
Compute node
Benefit #4: Security is built-in Customer VPC
Load encrypted from S3
BI tools Analytics tools SQL clients
SSL to secure data in transit
JDBC/ODBC
ECDHE perfect forward secrecy
Internal VPC
Amazon VPC for network isolation
Encryption to secure data at rest
Leader node
All blocks on disks and in S3 encrypted 10 GigE
(HPC)
Block key, cluster key, master key (AES-256)
On-premises HSM & AWS CloudHSM support
Compute node Compute node Compute node
Audit logging and AWS CloudTrail integration
Ingestion
Backup
SOC 1/2/3, PCI-DSS, FedRAMP, BAA Restore
Amazon S3 Amazon EMR Amazon Dynamo DB SSH
Benefit #5: Amazon Redshift is powerful
• Approximate functions
• User defined functions
• Machine learning
• Data science
Benefit #6: Amazon Redshift has a large ecosystem
Data integration Business intelligence Systems integrators
Benefit #7: Service oriented architecture
EC2/SSH
DynamoDB
RDS/Aurora
Amazon ML
EMR
Amazon
Redshift CloudSearch
Data Pipeline
Amazon
Mobile
S3 Amazon Kinesis Analytics
Amazon Redshift Spectrum
Amazon Redshift Spectrum
Run SQL queries directly against data in S3 using thousands of nodes
Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query
S3
SQL
High concurrency: Multiple No ETL: Query data in-place Full Amazon Redshift
clusters access same data using open file formats SQL support
Life of a query Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
JDBC/ODBC
Amazon
Redshift
Redshift Spectrum ...
Fast @ Exabyte scale
1 2 3 4 N
Amazon S3 Data Catalog
Exabyte-scale object storage Apache Hive Metastore
Amazon Redshift Spectrum – Current support
File formats Compression Encryption
• Parquet • Gzip • SSE with AES256
• CSV • Snappy • SSE KMS with default
• Sequence • Lzo (coming soon) key
• RCFile • Bz2
• ORC (coming soon)
• RegExSerDe (coming soon)
Column types Table type
• Numeric: bigint, int, smallint, float, double • Non-partitioned table
and decimal (s3://mybucket/orders/..)
• Char/varchar/string • Partitioned table
• Timestamp (s3://mybucket/orders/date=YYYY-MM-
• Boolean DD/..)
• DATE type can be used only as a
partitioning key
The Emerging Analytics Architecture
Storage
Amazon S3 AWS Glue Data Catalog
Exabyte-scale Object Storage Hive-compatible Metastore
Serverless
Compute
Amazon Kinesis Firehose AWS Glue Amazon Redshift Spectrum AWS Lambda
Real-Time Data Streaming ETL & Data Catalog Fast @ Exabyte scale Trigger-based Code Execution
Data
Processing
Amazon EMR Amazon Redshift Amazon Athena
Athena
Managed Hadoop Applications Petabyte-scale Data Warehousing Interactive Query
Over 20 customers helped preview Amazon Redshift Spectrum
Use cases
NTT Docomo: Japan’s largest mobile service provider
68 million customers Scaling challenges
Tens of TBs per day of data across a Performance issues
mobile network
6 PB of total data (uncompressed) Need same level of security
Data science for marketing Need for a hybrid environment
operations, logistics, and so on
Greenplum on-premises
NTT Docomo: Japan’s largest mobile service provider
125 node DS2.8XL cluster
S3
4,500 vCPUs, 30 TB RAM
2 PB compressed
Data ET Forwarder
Source State Loader
Management
10x faster analytic queries
AWS
Direct
Connect 50% reduction in time for new
Client Amazon Redshift Sandbox BI application deployment
Significantly less operations
overhead
Nasdaq: powering 100 marketplaces in 50 countries
Orders, quotes, trade executions, Expensive legacy DW
market “tick” data from 7 exchanges ($1.16 M/yr.)
7 billion rows/day Limited capacity (1 yr. of data
Analyze market share, client activity, online)
surveillance, billing, and so on
Needed lower TCO
Must satisfy multiple security
Microsoft SQL Server on-premises and regulatory requirements
Similar performance
Nasdaq: powering 100 marketplaces in 50 countries
23 node DS2.8XL cluster
828 vCPUs, 5 TB RAM
368 TB compressed
2.7 T rows, 900 B derived
8 tables with 100 B rows
7 man-month migration
¼ the cost, 2x storage, room to
grow
Faster performance, very
secure
Amazon.com clickstream analytics
Web log analysis for Amazon.com
• PBs workload, 2TB/day@67% YoY
• Largest table: 400 TB
Understand customer behavior
Previous solution
• Legacy DW (Oracle)—query across 1 week/hr
• Hadoop—query across 1 month/hr
Results with Amazon Redshift
• Query 15 months in 14 min • 100 node DS2.8XL clusters • 20% time of one DBA
• Load 5B rows in 10 min • Easy resizing • Increased productivity
• 21B w/ 10B rows: 3 days to 2 hrs • Managed backups and restore
(Hive Redshift)
• Failure tolerance and recovery
• Load pipeline: 90 hrs to 8 hrs
(Oracle Redshift)
Resources
Detail Pages
• http://aws.amazon.com/redshift
• https://aws.amazon.com/marketplace/redshift/
• https://aws.amazon.com/redshift/developer-resources/
• Amazon Redshift Utilities - GitHub
Best Practices
• http://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-
practices.html
• http://docs.aws.amazon.com/redshift/latest/dg/c_designing-tables-best-
practices.html
• http://docs.aws.amazon.com/redshift/latest/dg/c-optimizing-query-
performance.html
Thank you!