De-Identifying Data in Snowflake and
Amazon Redshift
Harold Byun
VP Products
© 2020, Baffle. All rights reserved. Confidential & Proprietary 1
Introduction
• Overview of Data Analytics Trends and the Move to Cloud Data Lakes
• Key Data Privacy Challenges
• Methods for Data De-Identification
• Architecture Models to Support a De-Identified Data Pipeline
• Live Demo of De-Identification and Data Processing
• A Glimpse Into Privacy Preserving and Advanced Data Analytics
• Q&A
Questions throughout – use the chat panel
© 2020, Baffle. All rights reserved. Confidential & Proprietary 2
Speaker Bio
Harold Byun is VP of Products at Baffle, an end-to-end data-centric protection company. His career has
focused on data containment and security technologies including data loss prevention and activity
monitoring, cloud access security broker, and mobile data containment capabilities. He holds several
data security related patents.
© 2020, Baffle. All rights reserved. Confidential & Proprietary 3
Overview of Data Analytics Trends and
the Move to Cloud Data Lakes
© 2020, Baffle. All rights reserved. Confidential & Proprietary 4
AI and Big Data are a Big Deal
© 2020, Baffle. All rights reserved. Confidential & Proprietary 5
Trends Impacting Cloud Data Analytics and Data Lakes
1 • By the end of 2024, 75% of organizations will shift from piloting to operationalizing artificial intelligence
(AI), driving a 5 times increase in streaming data and analytics infrastructures. (Gartner)
2 • By 2022, 35% of large organizations will be either sellers or buyers of data via formal online data
marketplaces, up from 25% in 2020 (Gartner)
3 • Existing on-premise big data environments remain static and are running out of room
4 • A significant move to leverage cloud-based data lakes for analytics and AI/ML
5 • Continued inadvertent exposure of data in aggregated environments
© 2020, Baffle. All rights reserved. Confidential & Proprietary 6
Moving to Cloud-based Data Lakes
ENTERPRISE – CURRENT STATE
APPLICATIONS
DATA STORES
DISTRIBUTED DATA
© 2020, Baffle. All rights reserved. Confidential & Proprietary 7
Moving to Cloud-based Data Lakes
ENTERPRISE – CURRENT STATE
APPLICATIONS
DATA STORES
DISTRIBUTED DATA
© 2020, Baffle. All rights reserved. Confidential & Proprietary 8
Key Data Privacy Challenges
© 2020, Baffle. All rights reserved. Confidential & Proprietary 9
Continued Data Exposure or Leakage
1 2 3
Data breaches continue Third party risk and data Cloud storage data leaks
unabated sharing continue
Data loss and leakage is the ~60% of CISOs have Over 1 billion records leaked
#1 cloud security concern reported data leakage via a and an estimated 11% of
(2019 Cloud Security Report) third party in 2018. cloud storage left open to
(Ponemon Institute) public
© 2020, Baffle. All rights reserved. Confidential & Proprietary 10
Data Analytics Challenges
Q: What are the biggest data
management/analytics challenges
faced by your organization?
Source: 451 Research’s Voice of the Enterprise: Data & Analytics, 1H 2019
© 2020, Baffle. All rights reserved. Confidential & Proprietary 11
Privacy Around the World
GDPR, CCPA and other privacy
regulations taking effect
Financial penalties and brand
impact are more severe
© 2020, Baffle. All rights reserved. Confidential & Proprietary Source: https://www.dlapiperdataprotection.com/index.html?t=about&c=AO 12
Data Privacy Enforced
Source: 451 Research’s Voice of the Enterprise: Data & Analytics, 1H 2019
© 2020, Baffle. All rights reserved. Confidential & Proprietary 13
Data Privacy Resources
Gartner Report on Privacy CCPA Compliance Simplified Encryption Simplified
Preserving Analytics White Paper
© 2020, Baffle. All rights reserved. Confidential & Proprietary 14
Privacy? So What, You’re Going to Collect Data Anyway
© 2020, Baffle. All rights reserved. Confidential & Proprietary 15
Continued Data Exposure or Leakage
Source: Gartner, “Securing the Data and Advanced Analytics Pipeline”, 27 Jan 2020
© 2020, Baffle. All rights reserved. Confidential & Proprietary 16
Methods for Data De-Identification
© 2020, Baffle. All rights reserved. Confidential & Proprietary 17
Infrastructure vs. Data
Customer responsibility “Security in
the Cloud”
AWS responsibility “Security of the
Cloud”
AWS is responsible for protecting the
infrastructure that runs all of the
services offered in the AWS Cloud.
© 2020, Baffle. All rights reserved. Confidential & Proprietary 18
Existing Infrastructure Control Methods
NOTE: This is not an exhaustive list
AWS Azure
Block S3 public access Azure AD integration for authorization to Azure Blob Storage
Bucket ACLs Azure AD, roles and secure access signatures (SAS)
Secure Access Signatures – SAS allows for a URI with resource and query
IAM Roles for controlling access from instances parameters to restrict access and authorization to storage resources. Can
be established as a service or user delegation
Monitoring and Logging:
- Policy-based discovery for open principal access ”*” Monitoring and Logging:
- ListBucket assessments - Advanced Threat Protection
- Access monitoring with CloudWatch, CloudTrail - Access monitoring via Azure Monitor
- Discovery via Macie
Encryption at-rest: Encryption at-rest:
- SSE S3 – Server-side encryption with AWS Managed Keys - Enabled by default for all blobs
- SSE-KMS – Server-side encryption with customer keys stored in AWS - Microsoft-managed keys – blob encryption using a Microsoft key store
KMS - Azure Key Vault – Customer-managed keys to encrypt blob storage and
- SSE-C – Server-side encryption with customer provided keys Azure files
- Client-Side Encryption – Data is encrypted before upload using client - Customer-provided keys – customer owned key store used to encrypt
encryption blobs
HTTPS / TLS – Encryption in-transit HTTPS / TLS – Encryption in-transit
VPC Endpoints – Establishes S3 connectivity via VPC to prevent traffic Azure Private Endpoints – Enables connectivity via VPC to prevent traffic
from traversing the public internet from traversing the public internet
© 2020, Baffle. All rights reserved. Confidential & Proprietary 19
Common Methods for De-Identification
Supported Data Protection Modes Description
Table or column-based encryption using randomized, deterministic
Data Encryption AES-CTR encryption or FPE
Uses deterministic AES encryption to generate a deterministic encrypted transform for a given
value. Can be applied to support JOINs and foreign key constraints to preserve referential
Secure Data Tokenization (TOK) integrity. Does NOT use code book method
Supports encryption where the cipher text output has the same form of the input. Preserves
length of the data type. Can be applied to support JOINs and foreign key constraints to
preserve referential integrity. Does NOT use code book method. Cannot be used in
Format Preserving Encryption (FPE) conjunction with RLE or Advanced Encryption. Baffle uses NIST approved FF1 and FF3-1
algorithms for FPE
Supports a library of masking formats that protects data at the presentation layer to prevent users
from viewing data in the clear. Masking can be applied using static alphanumeric characters, randomly
Data Masking generated data values, and/or partially mask data values. Masking can be applied to both clear text
and/or encrypted data
Supports role or group-based policies in conjunction with data masking policies to restrict viewing of
Role-based Data Masking data based on group membership or other attribution.
Support for privacy preserving analytics and secure data sharing on encrypted table or
columnar data using randomized AES and secure multiparty compute (SMPC). This
Advanced Encryption (SMPC) encryption mode facilitates operations and analytics on encrypted data across multiple parties
without revealing data to other participating parties.
© 2020, Baffle. All rights reserved. Confidential & Proprietary 20
Objects Encryption vs. Data-Centric Encryption
ENCRYPTED DATA
CLEAR TEXT DATA
© 2020, Baffle. All rights reserved. Confidential & Proprietary 21
Key Benefits
• De-identify, tokenize or encrypt data INSIDE objects and files
• Safe harbor from accidental data leaks from key privacy and compliance regulations
• Accelerate cloud-based data analytics programs by addressing key security and privacy
concerns
© 2020, Baffle. All rights reserved. Confidential & Proprietary 22
Architecture Models for a De-Identified Data Pipeline
© 2020, Baffle. All rights reserved. Confidential & Proprietary 23
Data Pipeline Architecture
© 2020, Baffle. All rights reserved. Confidential & Proprietary 24
Data Pipeline Example
© 2020, Baffle. All rights reserved. Confidential & Proprietary 25
Example of a De-Identified Pipeline
Snowflake
Database On-premise Baffle Shield S3 Bucket AWS Glue AWS Athena
AWS DMS
Encrypted Data
AWS EMR
AWS Redshift
© 2020, Baffle. All rights reserved. Confidential & Proprietary 26
Live Demo
© 2020, Baffle. All rights reserved. Confidential & Proprietary 27
Example of a De-Identified Pipeline
Snowflake
Database On-premise Baffle Shield S3 Bucket AWS Glue AWS Athena
AWS DMS
Encrypted Data
AWS EMR
AWS Redshift
© 2020, Baffle. All rights reserved. Confidential & Proprietary 28
Baffle / Snowflake Integration
Masking
Profile
Azure Blob
Storage
Azure API Baffle Azure Azure Key
Management Functions Vault
© 2020, Baffle. All rights reserved. Confidential & Proprietary 29
Baffle’s Data Protection Service Architecture
Make data breaches irrelevant
Baffle Manager
Application Tier
• Cloud-based management console for all data
encryption and key management across the enterprise
• Comprehensive compliance and audit reporting
• Provides protection for applications, business
SQL Interface intelligence tools, containers and serverless code
JDBC ODBC
Baffle Shield
• Restricts access and decryption to calling application
• Enables data access monitoring to track anomalies
• No changes to the application required
• Supports a variety of databases including Amazon RDS
Database Tier
Baffle Secure Multiparty Compute (SMPC)
• Delivered as a software solution that automates the
encryption process for any application on any database
Physical Storage
• Dynamic access control
• Comprehensive compliance monitoring
• Requires that user defined functions (UDFs) are
deployed
© 2020, Baffle. All rights reserved. Confidential & Proprietary 30
A Glimpse Into Privacy Preserving Analytics
© 2020, Baffle. All rights reserved. Confidential & Proprietary 31
Privacy Preserving Analytics
What is it?
• A computational method that allows for operations, processing and analysis of data without
revealing the underlying data values or violating the data privacy contract.
Gartner Report on Privacy Preservation in Analytics
More info and resources: https://baffle.io/privacy
© 2020, Baffle. All rights reserved. Confidential & Proprietary 32
USE CASE
Data as a Service - 3rd Party Data Access Control
1 3rd party organizations can be Key Benefits
granted granular access to a
subset of a data store
• Organizations can control and
Vendor 1 minimize data sharing via a
centralized data model
2 Companies better control access
to data enable a centralized
informational model
• Rather than spend time vetting 3rd
parties via questionnaires and then
giving the your data, allow them to
securely integrate into your
centralized data management
structure
Vendor 2
• Achieve the benefits of sourcing
specific operations, without
compromising your security
Table/Col 1 Table/Col 2 posture
ABC Key XYZ Key
© 2020, Baffle. All rights reserved. Confidential & Proprietary 33
USE CASE
Healthcare Data Sharing
VPC 1
ORG 1 ORG 1 BAFFLESHIELD SHARED
ENCRYPED DB ORG 2 BAFFLESHIELD ORG 2
KEYSTORE WITH KEYID 1 KEYSTORE WITH KEYID 2
SMPC
SERVLE
TS
1 Org 1 publishes health information on a VPC 2
patient to a shared database encrypting
the patient data with their own
encryption key.
© 2020, Baffle. All rights reserved. Confidential & Proprietary 34
USE CASE
Healthcare Data Sharing
VPC 1
ORG 1 ORG 1 BAFFLESHIELD SHARED
ENCRYPED DB ORG 2 BAFFLESHIELD ORG 2
Publish: John Doe, Has_Condition = ‘Yes’ 🡪
ABCDEF, Has_Condition = ‘DEF123459’
KEYSTORE WITH KEYID 1 KEYSTORE WITH KEYID 2
SMPC
SERVLE
TS
1 Org 1 publishes health information on a VPC 2
patient to a shared database encrypting
the patient data with their own
encryption key.
© 2020, Baffle. All rights reserved. Confidential & Proprietary 35
USE CASE
Healthcare Data Sharing
VPC 1
ORG 1 ORG 1 BAFFLESHIELD SHARED
ENCRYPED DB ORG 2 BAFFLESHIELD ORG 2
KEYSTORE WITH KEYID 1 KEYSTORE WITH KEYID 2
SMPC
SERVLE
TS
VPC 2
2 There are no encryption keys
present in the shared
database and no access to
keys.
© 2020, Baffle. All rights reserved. Confidential & Proprietary 36
USE CASE
Healthcare Data Sharing
VPC 1
ORG 1 ORG 1 BAFFLESHIELD SHARED
ENCRYPED DB ORG 2 BAFFLESHIELD ORG 2
Publish: Patient, Jane Doe, Has_Condition = ‘Yes’ 🡪
XYZGHI, Has_Condition = ‘AEFEWDCDSW’
KEYSTORE WITH KEYID 1 KEYSTORE WITH KEYID 2
SMPC
SERVLE
TS
3 Org 2 queries to confirm if Org 1 has
VPC 2
information on patients with a given
condition. The patient PHI is encrypted
using Org 2’s encryption key.
© 2020, Baffle. All rights reserved. Confidential & Proprietary 37
USE CASE
Healthcare Data Sharing
VPC 1
ORG 1 ORG 1 BAFFLESHIELD SHARED
ROCHE BAFFLESHIELD ORG 2
ENCRYPTED DB
Query: Patients = Has_Condition = ‘Yes’ 🡪
XYZGHI, Has_Condition = ‘AEFEWDCDSW’
KEYSTORE WITH KEYID 1 KEYSTORE WITH KEYID 2
SMPC
SERVLE
TS
VPC 2
4 SMPC performs a comparison operation on
using different keys without ever accessing
the encrypted data values. The results are
returned without decrypting the data.
© 2020, Baffle. All rights reserved. Confidential & Proprietary 38
Summary
• Leverage cloud data lakes to enable flexibility and accommodate data growth easily
• Implement data-centric protection methods to reduce the risk of data leakage
• Leverage de-identification capabilities to accelerate analytics and data monetization efforts
that still comply with data privacy regulations
• Examine operational models that minimize impact to Devops and business data flows
© 2020, Baffle. All rights reserved. Confidential & Proprietary 39
Data Privacy Resources
Simplifying Encryption White Gartner Report on Privacy Video Talks and 1:1
Paper Preserving Analytics Technical Consultation
© 2020, Baffle. All rights reserved. Confidential & Proprietary 40
Q&A
© 2020, Baffle. All rights reserved. Confidential & Proprietary 41
Thank You!
[email protected] © 2020, Baffle. All rights reserved. Confidential & Proprietary 42