Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
65 views42 pages

De-Identify Snowflake and Redshift

The document discusses a presentation on de-identifying data in Snowflake and Amazon Redshift. It provides an overview of data analytics trends driving organizations to move workloads to cloud data lakes. It also covers key data privacy challenges around continued data exposure, third party risks, and new regulations. Finally, it discusses various methods for de-identifying data like encryption, tokenization, and generalization.

Uploaded by

patilpatkars
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views42 pages

De-Identify Snowflake and Redshift

The document discusses a presentation on de-identifying data in Snowflake and Amazon Redshift. It provides an overview of data analytics trends driving organizations to move workloads to cloud data lakes. It also covers key data privacy challenges around continued data exposure, third party risks, and new regulations. Finally, it discusses various methods for de-identifying data like encryption, tokenization, and generalization.

Uploaded by

patilpatkars
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

De-Identifying Data in Snowflake and

Amazon Redshift

Harold Byun
VP Products

© 2020, Baffle. All rights reserved. Confidential & Proprietary 1


Introduction

• Overview of Data Analytics Trends and the Move to Cloud Data Lakes

• Key Data Privacy Challenges

• Methods for Data De-Identification

• Architecture Models to Support a De-Identified Data Pipeline

• Live Demo of De-Identification and Data Processing

• A Glimpse Into Privacy Preserving and Advanced Data Analytics

• Q&A

Questions throughout – use the chat panel

Email [email protected], [email protected]

© 2020, Baffle. All rights reserved. Confidential & Proprietary 2


Speaker Bio

Harold Byun is VP of Products at Baffle, an end-to-end data-centric protection company. His career has
focused on data containment and security technologies including data loss prevention and activity
monitoring, cloud access security broker, and mobile data containment capabilities. He holds several
data security related patents.

© 2020, Baffle. All rights reserved. Confidential & Proprietary 3


Overview of Data Analytics Trends and
the Move to Cloud Data Lakes

© 2020, Baffle. All rights reserved. Confidential & Proprietary 4


AI and Big Data are a Big Deal

© 2020, Baffle. All rights reserved. Confidential & Proprietary 5


Trends Impacting Cloud Data Analytics and Data Lakes

1 • By the end of 2024, 75% of organizations will shift from piloting to operationalizing artificial intelligence
(AI), driving a 5 times increase in streaming data and analytics infrastructures. (Gartner)

2 • By 2022, 35% of large organizations will be either sellers or buyers of data via formal online data
marketplaces, up from 25% in 2020 (Gartner)

3 • Existing on-premise big data environments remain static and are running out of room

4 • A significant move to leverage cloud-based data lakes for analytics and AI/ML

5 • Continued inadvertent exposure of data in aggregated environments

© 2020, Baffle. All rights reserved. Confidential & Proprietary 6


Moving to Cloud-based Data Lakes
ENTERPRISE – CURRENT STATE

APPLICATIONS

DATA STORES

DISTRIBUTED DATA

© 2020, Baffle. All rights reserved. Confidential & Proprietary 7


Moving to Cloud-based Data Lakes
ENTERPRISE – CURRENT STATE

APPLICATIONS

DATA STORES

DISTRIBUTED DATA

© 2020, Baffle. All rights reserved. Confidential & Proprietary 8


Key Data Privacy Challenges

© 2020, Baffle. All rights reserved. Confidential & Proprietary 9


Continued Data Exposure or Leakage
1 2 3

Data breaches continue Third party risk and data Cloud storage data leaks
unabated sharing continue

Data loss and leakage is the ~60% of CISOs have Over 1 billion records leaked
#1 cloud security concern reported data leakage via a and an estimated 11% of
(2019 Cloud Security Report) third party in 2018. cloud storage left open to
(Ponemon Institute) public

© 2020, Baffle. All rights reserved. Confidential & Proprietary 10


Data Analytics Challenges
Q: What are the biggest data
management/analytics challenges
faced by your organization?

Source: 451 Research’s Voice of the Enterprise: Data & Analytics, 1H 2019
© 2020, Baffle. All rights reserved. Confidential & Proprietary 11
Privacy Around the World

GDPR, CCPA and other privacy


regulations taking effect

Financial penalties and brand


impact are more severe

© 2020, Baffle. All rights reserved. Confidential & Proprietary Source: https://www.dlapiperdataprotection.com/index.html?t=about&c=AO 12
Data Privacy Enforced

Source: 451 Research’s Voice of the Enterprise: Data & Analytics, 1H 2019
© 2020, Baffle. All rights reserved. Confidential & Proprietary 13
Data Privacy Resources

Gartner Report on Privacy CCPA Compliance Simplified Encryption Simplified


Preserving Analytics White Paper

© 2020, Baffle. All rights reserved. Confidential & Proprietary 14


Privacy? So What, You’re Going to Collect Data Anyway

© 2020, Baffle. All rights reserved. Confidential & Proprietary 15


Continued Data Exposure or Leakage

Source: Gartner, “Securing the Data and Advanced Analytics Pipeline”, 27 Jan 2020

© 2020, Baffle. All rights reserved. Confidential & Proprietary 16


Methods for Data De-Identification

© 2020, Baffle. All rights reserved. Confidential & Proprietary 17


Infrastructure vs. Data

Customer responsibility “Security in


the Cloud”

AWS responsibility “Security of the


Cloud”

AWS is responsible for protecting the


infrastructure that runs all of the
services offered in the AWS Cloud.

© 2020, Baffle. All rights reserved. Confidential & Proprietary 18


Existing Infrastructure Control Methods
NOTE: This is not an exhaustive list

AWS Azure

Block S3 public access Azure AD integration for authorization to Azure Blob Storage

Bucket ACLs Azure AD, roles and secure access signatures (SAS)

Secure Access Signatures – SAS allows for a URI with resource and query
IAM Roles for controlling access from instances parameters to restrict access and authorization to storage resources. Can
be established as a service or user delegation
Monitoring and Logging:
- Policy-based discovery for open principal access ”*” Monitoring and Logging:
- ListBucket assessments - Advanced Threat Protection
- Access monitoring with CloudWatch, CloudTrail - Access monitoring via Azure Monitor
- Discovery via Macie
Encryption at-rest: Encryption at-rest:
- SSE S3 – Server-side encryption with AWS Managed Keys - Enabled by default for all blobs
- SSE-KMS – Server-side encryption with customer keys stored in AWS - Microsoft-managed keys – blob encryption using a Microsoft key store
KMS - Azure Key Vault – Customer-managed keys to encrypt blob storage and
- SSE-C – Server-side encryption with customer provided keys Azure files
- Client-Side Encryption – Data is encrypted before upload using client - Customer-provided keys – customer owned key store used to encrypt
encryption blobs

HTTPS / TLS – Encryption in-transit HTTPS / TLS – Encryption in-transit

VPC Endpoints – Establishes S3 connectivity via VPC to prevent traffic Azure Private Endpoints – Enables connectivity via VPC to prevent traffic
from traversing the public internet from traversing the public internet

© 2020, Baffle. All rights reserved. Confidential & Proprietary 19


Common Methods for De-Identification
Supported Data Protection Modes Description
Table or column-based encryption using randomized, deterministic
Data Encryption AES-CTR encryption or FPE

Uses deterministic AES encryption to generate a deterministic encrypted transform for a given
value. Can be applied to support JOINs and foreign key constraints to preserve referential
Secure Data Tokenization (TOK) integrity. Does NOT use code book method

Supports encryption where the cipher text output has the same form of the input. Preserves
length of the data type. Can be applied to support JOINs and foreign key constraints to
preserve referential integrity. Does NOT use code book method. Cannot be used in
Format Preserving Encryption (FPE) conjunction with RLE or Advanced Encryption. Baffle uses NIST approved FF1 and FF3-1
algorithms for FPE

Supports a library of masking formats that protects data at the presentation layer to prevent users
from viewing data in the clear. Masking can be applied using static alphanumeric characters, randomly
Data Masking generated data values, and/or partially mask data values. Masking can be applied to both clear text
and/or encrypted data

Supports role or group-based policies in conjunction with data masking policies to restrict viewing of
Role-based Data Masking data based on group membership or other attribution.

Support for privacy preserving analytics and secure data sharing on encrypted table or
columnar data using randomized AES and secure multiparty compute (SMPC). This
Advanced Encryption (SMPC) encryption mode facilitates operations and analytics on encrypted data across multiple parties
without revealing data to other participating parties.

© 2020, Baffle. All rights reserved. Confidential & Proprietary 20


Objects Encryption vs. Data-Centric Encryption
ENCRYPTED DATA

CLEAR TEXT DATA

© 2020, Baffle. All rights reserved. Confidential & Proprietary 21


Key Benefits

• De-identify, tokenize or encrypt data INSIDE objects and files

• Safe harbor from accidental data leaks from key privacy and compliance regulations

• Accelerate cloud-based data analytics programs by addressing key security and privacy
concerns

© 2020, Baffle. All rights reserved. Confidential & Proprietary 22


Architecture Models for a De-Identified Data Pipeline

© 2020, Baffle. All rights reserved. Confidential & Proprietary 23


Data Pipeline Architecture

© 2020, Baffle. All rights reserved. Confidential & Proprietary 24


Data Pipeline Example

© 2020, Baffle. All rights reserved. Confidential & Proprietary 25


Example of a De-Identified Pipeline

Snowflake

Database On-premise Baffle Shield S3 Bucket AWS Glue AWS Athena


AWS DMS

Encrypted Data

AWS EMR

AWS Redshift

© 2020, Baffle. All rights reserved. Confidential & Proprietary 26


Live Demo

© 2020, Baffle. All rights reserved. Confidential & Proprietary 27


Example of a De-Identified Pipeline

Snowflake

Database On-premise Baffle Shield S3 Bucket AWS Glue AWS Athena


AWS DMS

Encrypted Data

AWS EMR

AWS Redshift

© 2020, Baffle. All rights reserved. Confidential & Proprietary 28


Baffle / Snowflake Integration

Masking
Profile
Azure Blob
Storage

Azure API Baffle Azure Azure Key


Management Functions Vault

© 2020, Baffle. All rights reserved. Confidential & Proprietary 29


Baffle’s Data Protection Service Architecture
Make data breaches irrelevant
Baffle Manager
Application Tier
• Cloud-based management console for all data
encryption and key management across the enterprise
• Comprehensive compliance and audit reporting
• Provides protection for applications, business
SQL Interface intelligence tools, containers and serverless code

JDBC ODBC
Baffle Shield
• Restricts access and decryption to calling application
• Enables data access monitoring to track anomalies
• No changes to the application required
• Supports a variety of databases including Amazon RDS

Database Tier

Baffle Secure Multiparty Compute (SMPC)


• Delivered as a software solution that automates the
encryption process for any application on any database
Physical Storage
• Dynamic access control
• Comprehensive compliance monitoring
• Requires that user defined functions (UDFs) are
deployed
© 2020, Baffle. All rights reserved. Confidential & Proprietary 30
A Glimpse Into Privacy Preserving Analytics

© 2020, Baffle. All rights reserved. Confidential & Proprietary 31


Privacy Preserving Analytics
What is it?

• A computational method that allows for operations, processing and analysis of data without
revealing the underlying data values or violating the data privacy contract.

Gartner Report on Privacy Preservation in Analytics

More info and resources: https://baffle.io/privacy


© 2020, Baffle. All rights reserved. Confidential & Proprietary 32
USE CASE

Data as a Service - 3rd Party Data Access Control


1 3rd party organizations can be Key Benefits
granted granular access to a
subset of a data store
• Organizations can control and
Vendor 1 minimize data sharing via a
centralized data model
2 Companies better control access
to data enable a centralized
informational model
• Rather than spend time vetting 3rd
parties via questionnaires and then
giving the your data, allow them to
securely integrate into your
centralized data management
structure

Vendor 2
• Achieve the benefits of sourcing
specific operations, without
compromising your security
Table/Col 1 Table/Col 2 posture
ABC Key XYZ Key
© 2020, Baffle. All rights reserved. Confidential & Proprietary 33
USE CASE

Healthcare Data Sharing


VPC 1

ORG 1 ORG 1 BAFFLESHIELD SHARED


ENCRYPED DB ORG 2 BAFFLESHIELD ORG 2

KEYSTORE WITH KEYID 1 KEYSTORE WITH KEYID 2

SMPC
SERVLE
TS

1 Org 1 publishes health information on a VPC 2


patient to a shared database encrypting
the patient data with their own
encryption key.

© 2020, Baffle. All rights reserved. Confidential & Proprietary 34


USE CASE

Healthcare Data Sharing


VPC 1

ORG 1 ORG 1 BAFFLESHIELD SHARED


ENCRYPED DB ORG 2 BAFFLESHIELD ORG 2

Publish: John Doe, Has_Condition = ‘Yes’ 🡪


ABCDEF, Has_Condition = ‘DEF123459’

KEYSTORE WITH KEYID 1 KEYSTORE WITH KEYID 2

SMPC
SERVLE
TS

1 Org 1 publishes health information on a VPC 2


patient to a shared database encrypting
the patient data with their own
encryption key.

© 2020, Baffle. All rights reserved. Confidential & Proprietary 35


USE CASE

Healthcare Data Sharing


VPC 1

ORG 1 ORG 1 BAFFLESHIELD SHARED


ENCRYPED DB ORG 2 BAFFLESHIELD ORG 2

KEYSTORE WITH KEYID 1 KEYSTORE WITH KEYID 2

SMPC
SERVLE
TS
VPC 2

2 There are no encryption keys


present in the shared
database and no access to
keys.
© 2020, Baffle. All rights reserved. Confidential & Proprietary 36
USE CASE

Healthcare Data Sharing


VPC 1

ORG 1 ORG 1 BAFFLESHIELD SHARED


ENCRYPED DB ORG 2 BAFFLESHIELD ORG 2

Publish: Patient, Jane Doe, Has_Condition = ‘Yes’ 🡪


XYZGHI, Has_Condition = ‘AEFEWDCDSW’

KEYSTORE WITH KEYID 1 KEYSTORE WITH KEYID 2

SMPC
SERVLE
TS
3 Org 2 queries to confirm if Org 1 has
VPC 2
information on patients with a given
condition. The patient PHI is encrypted
using Org 2’s encryption key.

© 2020, Baffle. All rights reserved. Confidential & Proprietary 37


USE CASE

Healthcare Data Sharing


VPC 1

ORG 1 ORG 1 BAFFLESHIELD SHARED


ROCHE BAFFLESHIELD ORG 2
ENCRYPTED DB

Query: Patients = Has_Condition = ‘Yes’ 🡪


XYZGHI, Has_Condition = ‘AEFEWDCDSW’

KEYSTORE WITH KEYID 1 KEYSTORE WITH KEYID 2

SMPC
SERVLE
TS
VPC 2

4 SMPC performs a comparison operation on


using different keys without ever accessing
the encrypted data values. The results are
returned without decrypting the data.
© 2020, Baffle. All rights reserved. Confidential & Proprietary 38
Summary

• Leverage cloud data lakes to enable flexibility and accommodate data growth easily

• Implement data-centric protection methods to reduce the risk of data leakage

• Leverage de-identification capabilities to accelerate analytics and data monetization efforts


that still comply with data privacy regulations

• Examine operational models that minimize impact to Devops and business data flows

© 2020, Baffle. All rights reserved. Confidential & Proprietary 39


Data Privacy Resources

Simplifying Encryption White Gartner Report on Privacy Video Talks and 1:1
Paper Preserving Analytics Technical Consultation

© 2020, Baffle. All rights reserved. Confidential & Proprietary 40


Q&A

© 2020, Baffle. All rights reserved. Confidential & Proprietary 41


Thank You!
[email protected]

© 2020, Baffle. All rights reserved. Confidential & Proprietary 42

You might also like