Snow
Snow
INTRODUCTION:
1. Decoupled compute and storage
2. No hardware or no maintenance
3. Snowflake is fully ACID complaint and it support all kind of tables (except External
Tables)
4. SnowSQL is a CLI Tool
5. SQL functionality can be extended via SQL UDF, Javascript UDf and Session variables.
6. USE_CACHED_RESULT = ‘FALSE’
Architecture:
1. Multi-Clustered Clusters,Hybrid of Shared-disk and Shared-Nothing Architecture
2. 3 Layers:
a. Data Storage (Compressed Columnar Format, Micro-Partitions => Logical
Structure of table).
i. Two important features: Timetravel and Snapshot Cloning
b. Virtual Warehouses (Independent Clusters - Dont share resources) (Query
Processing)
c. CSL (Brain, Query Planning,Optimization and Compilation)
i. Authentication
ii. Infrastructure Mngmnt
iii. Metadata Mngmnt
iv. Query Parsing and Optimization
v. Access Control
7. Caches : Reduce cost and improve performance cost.
a. Metadata Cache or Metadata Layer or CSL or ServiceLayer(Owned With CSL)
i. Holds object information
ii. Queries like:
1. Desc table tablename
2. Select Current_user(), current_version()
3. Select count(*) fro’m table_name
4. Select min(col) from table_name
b. Result Cache (Owned with CSL)
i. Holds for 24 Hrs (When query is run again, it is for 24 hrs till 31 days)
ii. Exact same query
iii. Users cannot see others result but result cache by one user is used by
another user
iv. Doesn’t use values such as CURENT_TIMESTAMP
v. When underlying data is changed or table is dropped, this cache is
refreshed or maybe unavailable.
c. Local Disk Cache or WH Cache or SSD Cache or Data Cache or Raw Data
Cache(i.e No Aggregated Data):
i. Data Hold by WH as local
ii. Deleted when Wh is suspended.
iii. If WH is Resized then Cache is purged i.e removed
iv. Query is changed it can use this cache if available
d. Bytes Scanned in Green Color => Remote Storage, Bytes Scanned in Blue
Storage => Local Storages
Releases:
1. Releases every week
2. Release process is Transparent and no downtime
3. Release Types:
a. New Release(Feature, Updates, Enhancement, Bugs).
b. Patch Release(Fixes)
8. New Release follows 3 Stage process.
9. Patch Release is available to all at same time.
10. New Release is first given to early users, then standar, enterprise and so on.
11. Early access is provided to only Enterprise and higher editions before 24 Hrs.
12. You cannot get back to previous versions.
Web UI:
1. Database,Share,Data Marketplace,Warehouse,worksheet,history
3. INFORMATION_SCHEMA
a. Do not Contain dropped objects.
b. Retention from 1 year
c. Latency: 45 mins to 3 hrs
4. ORGANIZATION_USAGE
5. READER_ACCOUNT_USAGE
ii. SNOWFLAKE_SAMPLE_DATA
1. INFORMATION_SCHEMA
2. TPC…. Schemas
3. WEATHER Schema
d. When a new DB is created, it has INFORMATION_SCHEMA and PUBLIC
Schema.
e. TABLES:
i. All SF Tables are divided into Micro-Partition.
ii. Each micro-partitons are compressed columnar data:
1. Max size of 16MB compressed.
2. Stored on Logical HDD
3. They are immutable and can’t be changed.
4. Metadata of Micro-partiiton is stored in CSL
5. Improves query performance on large tables by skipping data.
iii. SF recommends use of Clustering key (If Multi-Tb in size)
iv. SF only enforces NOT NULL constraint.
v. Table Types:
1. PERMANENT:
a. Default Table type
b. Persist until dropped
c. Time Tavel: 0 to 90 days
d. Failsafe: Yes
2. TEMPORARY:
a. Persist till session
b. Time Tavel: 0 or 1 days
c. Failsafe: No
d. Cannot be converted into any other table type
3. TRANSIENT:
a. Persist until dropped
b. Time Tavel: 0 or 1 days
c. Failsafe: No
4. EXTERNAL TABLE:
a. Persist until dropped
b. Read only
c. Time Tavel: No
d. Failsafe: No
e. Cloning: No
a. @~[Login]
b. Automatically defined
c. File format should be given in copy into
2. Table:
a. @%[TABLE_NAME]
b. Automatically defined
c. No transformation while loading
d. File format should be given in copy into
3. Named:
a. Internal Named: @[STAGE_NAME]
i. Temporary: When dropped, data files are purged.
ii. Can not be cloned
b. External Named (Azure, Gcp, Aws): @[STAGE_NAME]
i. Temporary: When dropped, data files are not
removed.
ii. Can be cloned
j. DATA TYPES:
i. Numeric - NUMBER(38,0)
ii. String/Binary:
1. String/Text/Varchar/Character/Char (16 Mb uncompressed,
Default: Max length)
2. Char(1)= Varchar(1)
3. binary=varbinary (8 Mb uncompressed, Default: same length
always)
iii. Boolean (can have unknown value i.e NULL)
1. Conversion from string: True/t/yes/y/on/1 => True
2. Conversion from Numeric: 0 => False, Any non zero => True
iv. Date/Time (Default Precision: 9 i.e 0 to 9 and uses Gregorian Calendar):
1. Time (Hh:mm:ss)
2. Date
3. DateTime (TIMESTAMP_NTZ)
4. TimeStamp
5. TimeStamp_LTZ
6. TimeStamp_NZ
7. TimeStamp_TZ
v. INTERVAL constant to add/subtract from date/time (Not a datatype)
vi. All Float4,Float8, etc as Float:
1. Supports special types: ‘NaN’, ‘inf’, ‘-inf’
vii. String Constant/ Literals:
1. Must be enclosed in single quotes (‘) or dollar signs ($)/
viii. No of digits After decimal (Scale) has an impact on storage.
ix. Unsupported Data types: LOB, CLOB, ENUM or User defined Datatype
Data Sharing:
1. Sharing feature is achieved using SF Service Layer and Metadata Layer(Metadata only).
2. Storage is charged in Producer’s account and Compute in Consumer’s account(For
reader, compute would be charged to producer only).
3. Tables, External Tables, secure views, Secure Materialized View and secure Udf’s
can be shared.
4. VPS does not support secure data sharing.
5. WebUi does not support add/removing secure UDF’s from shares.
6. Share are named snowflake object that can contain only a Single Database.
7. Sharing data from multiple table can be done via secure views.
8. Only ACCOUNTADMIN can provision share object or reader account.
9. Changing ownership of existing shares is not possible.
10. Shared databases are read-only
11. No limit on adding shares and consumers.
12. Any new object added within shared db, grants have to be given explicitly
13. Cannot create 2 or more db’s from shared object
14. Consumers can query shared object in same way they query there own objects.
15. If you add an Object to a share, it is immediately available to Consumers. Similarly, if you
remove privileges they are inaccessible.
16. There is no TimeTravel for consumer shared database.
17. Data sharing is only possible in same cloud and same region.
18. To Reshare a shared object, error would be shown.
19. Cloning of share object, Time travel is not allowed.
20. Cannot edit comment of shared db.
21. Show shares; It has KIND column which shows share is Inbound or Outbound.
22. Product offering for Secure data sharing:
a. Direct share
b. Data marketplace:
i. It has two types of Data listings: Standard Data Listing, Personalised Data
listing.
c. Data exchange.
23. Reader’s Account (Managed Accounts):
a. It is an alternative if consumer does not have a SF account
b. Own and controlled by producer or provider account.
c. Show managed accounts;
d. Create managed account jainam admin_name=’’ admin_password=’’ type=reader
VIRTUAL WAREHOUSES:
1. VWH is Cluster of servers with CPU, Memory and Disk
2. Executes on SELECT as well as DML operations(Delete,Insert,Update,Copy into)
3. X-small has 1 Server per cluster. Similarly small has 2 servers per cluster.
4. VWH size is T-shirt sizes(8) i.e X-small to 4XL (i.e 1,2,4,8,16...128)
5. In CMD we have “X-SMALL” as default and in WebUi we have “X-Large” as default.
6. Can be stopped and resize at any time (Only new queries get affected)
7. When creating a VWH you can specify:
ii. Modify
g. Failsafe:
i. Not configurable (7 day extra period after Time Travel)
ii. Available only for Permanent Tables
iii. Only accessible by SF
h. Cloning:
i. Referred as Zero-copy-cloning.
ii. Only Metadata is copied so No Storage Costs until a change is done. If its
done to cloned table, then new micro-partition is created (Storage costs
included).
iii. It references to Table so (No storage Costs), but when a changed is Do
not inherit the source’s grant privileges but if source is database or
schema then privileges is possible
iv. To clone a table, your current role must have SELECT privileges on
source table.
v. To clone a database,schema your current role must have USAGE
privileges on source table
vi. External Tables cannot be be cloned
vii. All Stages except Internal Named Stage can be cloned.
viii. When a Stream is Cloned, unconsumed records in streams are
inaccessible.
ix. When a Task is Cloned, it needs to be resumed individually.
x. Cloning has started and data is changed in a table and retention time is 0,
then it gives error. To avoid, either do not perform DML operations or
increase retention time.
xi. Files that have already been processed into the source table can be
loaded again into a cloned table. I.e history is not stored of loaded files.
DATA MOVEMENT:
1. File Location:
a. On Local
b. On Cloud (S3, Blob storage, GCS)
2. File Type:
a. Structured: CSV,TSV,etc
b. Semi-structured: JSON,ORC,PARQUET,AVRO and XML
c. User can specify compression on loading, Default Compression: GZIP
3. Encryption on Load:
a. Files can be loaded to SF by providing key to SF on load
b. Unencrypted files using AES-128 bit keys (or AES-256-bit keys) i.e using
CLIENT_ENCRYPTION_KEY_SIZE
4. Best Practices while Loading:
a. Split larger files into small files
b. 10-100Mb file compressed for data load is ideal
c. Parquet >3Gb compressed - should be split into 1GB chunks
d. Variant datatype has 16 Mb compressed size limit per row
2. Purge: False (default). If error occurs and purge is not done, error
is not shown to user.
3. Force: False (default)
4. Pattern = [‘regex pattern’]
5. To load files whose metadata has expired, set the
LOAD_UNCERTAIN_FILES = True.
6. To ignore metadata during loading, set FORCE.
7. Parallely load upto 1000 files.
d. CONTINUOUS LOADING (SNOWPIPE):
i. 14 days of Metadata history
ii. Snowflake compute resource is used
iii. Snowpipe is used to load small volume of frequent data
iv. SnowPipe allows loading data from files as soon as they are available in
External stage
v. Done using COPY INTO command (All datatype are supported)
vi. File arrived detection mechanism:
1. Using Cloud Notification i.e AUTO_INGEST (External Stages
only)
2. Calling REST API Endpoint (Internal + External stages)
a. insertFiles: Informs snowflake about files to be ingested.
b. insertReport: 10000 events are retained, for max of 10
mins.
c. loadHistoryScan: Fetches a report about ingested files
whose contents have been added to the table.
vii. Snowpipe can be paused or resumed using
PIPE_EXECUTION_PAUSED = True. It is supported by Account, Schema
or Pipe.
viii. Stopped is not an execution state.
ix. SnowPipe copies files into Ingestion Queue from where files are loaded to
snowflake.
x. Snowpipe cannot load a file with same name even if its modified one.
xi. Snowflake features for enabling continuous data pipelines:
1. Continuous data loading:
a. Snowpipe.
b. Snowflake connector for Kafka:
i. Snowflake table loaded by Kafka connector has a
schema consisting of 2 variant columns:
RECORD_CONTENT, RECORD_METADATA
ii. Record_metadata contains: Topic, Partiiton, Key,
CreateTime/LogAppendTime
iii. Kafka connector guarantee exactly-once delivery.
iv. When neither key.converter or value.converter is
set, then most SMT are supportedwith an exception
of regex.router.
6. UNLOADING DATA
a. COPY INTO command for unloading to Cloud (can be done without stage also
using url and proper credentials)/ Local (GET Command)
b. Alternative is Select statement which is preferred as all operations can be
applied.
c. Unloading can be done to Internal Stage (Any), External Stage, and External
Locations
d. GET is not supported : To download file from External Stages, Go Snowflake,
.NET, Node.js and limited ODBC drivers.
e. GET command with parallelism is achieved by: PARALLEL=<integer>, where it
can be from 1 to 99 (Default:10)
f. GET has an option: PATTERN=[Regex‘’]
g. File Formats: Flat (CSV,TSV, etc), JSON,PARQUET
h. Use OBJECT_CONSTRUCT to create semi structure format files.
i. During unloading snappy compression is used.
j. Files can be single or multiple (Max 16 Mb) (Default:Multiple)
k. SINGLE:True to set it to True,MAX_FILE_SIZE to limit option can be set.
l. Enclose strings in double or single quotes, empty_field_as_Null: true (Default:f),
convert null values using null_if
m. Any type of data transformation can be done
n. Default name is “data_num_num_num”
o. S3 bucket requires s3:DeleteObject and s3:PutOBject for unloading.
ii.
Role Based Access Control (RBAC): All privileges related to objects are
assigned to roles. And roles are assigned to user.
2. AUTHENTICATION:
a. MFA:
i. Provided by Duo Services
ii. Each user must enable it themself
iii. SF recommends enable MFA on ACCOUNTADMIN (Minimum)
iv. MFA can be disbaled by AccountAdmin or SecurityAdmin as:
DISABLED_MFA= true or MINS_TOBYPASS_MFA = 5
v. Use with UI,SnowSql,ODBC,JDBC,Python Connector
b. SSO (SAML 2.0) idP allows users to access via federated services i,e login using
tokens directly.
c. SSO is available on Enterprise+
d. MFA,Oauth,SSO is available to all editions
e. SOC 1,2 TYPE 2 ,HITRUST/HIPAA(BCE or+), PCI DSS(BCE or+), FedRamp ,
GxP , ISO27001
SNOWFLAKE SEMI_STRUCTURED:
1. Can be operated on: JSON,AVRO,ORC,PARQUET,XML
2. Stores in compressed columnar binary representation
3. It is also called as VARIANT type or UNIVERSAL type which can hold any type of data
(ARRAY or OBJECT)
4. Max size 16 Mb compressed
5. In variant NULL is stored as string “null”
6. When snowflake semi structured data is inserted into variant it tries to extract in
columnar format based on certain rules.
7. Querying semi-structured Data (Primarily JSON):
a. Can access semi structure data but not XML
b. <column1>:<level_1_element>
c. Query output is enclosed in double quotes, bcuse query output is VARIANT and
not varchar
d. 2 Ways to access element in json object:
i. Dot notation: select src:sales.name form table
ii. Bracket Notation: select src[‘sales’][‘name’] from table
iii. Here sales,name is case sensitive while src i.e column is case insensitve
e. Casting is done using ::
f. FLATTEN / PARSE_JSON / GET FUNCTION:
i. Flatten is used to produce lateral view of Variant, object or array
ii. Flatten command has lateral and table options.
iii. Output of Flatten query has below Columns:
1. Seq
2. Key
3. Path
4. Index
5. Value
6. This
iv. To parse nested arrays.
v. Get takes value as first argument and extract variant value of the element
in path provided as per second argument
ACID OR TRANSACTIONS:
1. SHOW LOCKS or SHOW TRANSACTIONS.
2. Snowflake does not support nested Transactions i.e When a Transaction is called from
another Transaction, it is not nested, instead it is running in its own scope. So they are
called SCOPED TRANSACTIONS.
3. Commit operations lock resources.
4. UPDATE, DELETE, and MERGE statements hold locks.
5. If you run a transaction in session, and it discontinues and you left it open it is closed
after “4 Hours” or using SYSTEM$ABORT_TRANSACTION.
6. For, Multi-threaded Programs,
a. Use separate connection for each thread.
b. Execute these threads synchronously.
c. Fact: Multiple sessions cannot share same Transaction. But Multiple Threads
using single connection can share same Transaction.
IMPORTANT POINTS:
1. If you create a table, drop it, create a new table with same name and run UNDROP cmd
it will fail.
2. Replication is supported for DATABASES only.
3. INFORMATION_SCHEMA contains TABLE Functions to give account level usage and
historical data for storage,warehouses,etc.
4. Variables can be set in snowflake. Size of String/Binary => 256 Bytes
5. DATE_TRUNC is used to Truncate the function.
6. Snowflake uses lacework for network traffic and user activity. It uses Sumo Logic and
Threat Stack to monitor failed logins, file integrity monitoring and unauthorized system
modifications.