####Apache spark
-- It is lightning-fast unified analytics engine for big data processing and
machine learning.
####Cluster
-- It is group of virtual machines having driver node and worker node. Drive node
orchestrates and distribute tasks among the
worker nodes
####Types of Cluster
All Purpose Job cluster
----------------------- -------------------------
Created manually Created by hobs
Persistent Terminated at the end of the job
suitable for interactive Suitable for automated workloads
workloads
shared among many user Isolated just for the job
Expensive to run Cheaper to run
#### Cluster configuration
Single node -- only one VM which acts as a both driver and worker node
Multi node -- one driver node and one or more worker nodes
Databricks Runtime -- It is the set of software artifacts that run on the clusters
of machines managed by Databricks.
Autoscalling -- it is the process of adding and remove worker nodes depending on
the workload
##### Databricks Unit
DBU is a normalized unit of processing power on the databricks lakehouse platform
used for measurement and pricing purposes.
#### Cluster Pool
Set of predefined nodes to increase the speed of cluster creation
#### Magic commands
%sql
### Clear state option in the RUN tab
clears all the varibale values
### %md magic command
markdown command to document the code
### %fs magic command
ls -- to list down files
### %sh magic command
ps -- to list down all the processess which are running
#### Databricks Utilities can be used in all notebooks except SQL
dbutils.help() -- To get help
dbutils.fs.help() -- To get help on fs command
dbutils.fs.help('ls') -- To get help on ls within fs command
dbutils.fs.ls('/databricks-datasets') -- to list all the files within a folder
or
%fs
ls
### Aceess GEN2 using Access key ########
spark.conf.set(
"fs.azure.account.key.<storage-account>.dfs.core.windows.net",
storage-account-access-key)
OR
spark.conf.set(
"fs.azure.account.key.<storage-account>.dfs.core.windows.net",
dbutils.secrets.get(scope="<scope>", key="<storage-account-access-key>"))
dbutils.fs.ls("abfss://<container-name>@<storage-account-
name>.dfs.core.windows.net/<path-to-data>")
### Aceess GEN2 using SAS token ########
spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net",
"SAS")
spark.conf.set("fs.azure.sas.token.provider.type.<storage-
account>.dfs.core.windows.net",
"org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
spark.conf.set("fs.azure.sas.fixed.token.<storage-account>.dfs.core.windows.net",
dbutils.secrets.get(scope="<scope>", key="<sas-token-key>"))
### Aceess GEN2 using Service principal ########
1) Register Azure AD Apllication/ service principal
2) Generate secret/password for the application
3) set spark config with app/client id, directory/ tenant id & secret
4) Assign Role Storage Blob contributor to the Data lake
1) Register Azure AD Apllication/ service principal
azure portal -> azure active directory(search for) -> app registrations -> new
registration -> formula1-app
copy the
Application(Client ID)
Directory(Tenant ID )
2) Generate secret/password for the application
go to app registration here formula1-app -> certifications and secret -> give the
secret name - > copy the value
3) set spark config with app/client id, directory/ tenant id & secret
spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net",
"OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-
account>.dfs.core.windows.net",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-
account>.dfs.core.windows.net", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-
account>.dfs.core.windows.net", service_credential)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-
account>.dfs.core.windows.net",
"https://login.microsoftonline.com/<directory-id>/oauth2/token")
4) Assign Role Storage Blob contributor to the Data lake
storage account(formula 1) -> access control (IAM) -> Add -> Add role assignment ->
storage blob contributor
### Cluster scoped authentication ################
go to the compute -> edit -> advanced ->
copy paste the spark config
fs.azure.account.key.<storage-account>.dfs.core.windows.net storage-account-access-
key
### Azure data lake using credential passthrough ###
go to the compute -> edit -> advanced ->
TICK the Enable credential passthrough for user-level-access
also the access the folder give user level access
storage account(formula 1) -> access control (IAM) -> Add -> Add role assignment ->
storage blob contributor -> add member -> give the member mail id
###############################
Securing access to azure data lake
1) Databricks backed secret scope
2) Azure key vault backed secret scope
spark.conf.set(
"fs.azure.account.key.<storage-account>.dfs.core.windows.net",
dbutils.secrets.get(scope="<scope>", key="<storage-account-access-key>"))
create the azure key vault and store the access key there
and
go to the microsoft databricks homepage
in the URL type screts/createScope
1) give the secret scope name
2) give the DNS name as VAULT URI which is copied from the azure key vault
3) give the resource id copied the azure key vault
##### dbutils.secrets
dbutils.secrets.help()
dbutils.secrets.listScopes() -- List all the scopes
dbutils.secrets.list(scope= 'formula1-scope') -- List the metadata
dbutils.secrets.get(scope : String, key : String)
dbutils.secrets.get(scope : formula1-scope, key : formula1-account-key)
spark.conf.set(
"fs.azure.account.key.<storage-account>.dfs.core.windows.net",
dbutils.secrets.get(scope="<scope>", key="<storage-account-access-key>"))
### using secrets utility in clusters
go to the compute -> edit -> advanced ->
copy paste the spark config
fs.azure.account.key.<storage-account>.dfs.core.windows.net
{{secrets/secret-name/secret-account-key}}
#### DBFS
Databricks FileSystem
#### Mount
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type":
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<application-id>",
"fs.azure.account.oauth2.client.secret":
dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),
"fs.azure.account.oauth2.client.endpoint":
"https://login.microsoftonline.com/<directory-id>/oauth2/token"}
# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)
dbutills.fs.mounts() ---- list the mount points
dbutills.fs.unmounts('/mnt/formula1/demo')
############### Spark Architecture ##################
Driver(VM)
Worker node(VM) Worker node(VM)
(Executor) (Executor)
Application -> job -> stage(s) -> task
############### Ingestion
df= spark.read.option('header',True).csv('dbfs:/mnt/formula1/raw/circuits.csv')
df.show(truncate=False)
display(df)
df.printSchema()
df.describe.show()
df= spark.read.option('header',True).option('InferSchema',
True).csv('/mnt/formula1/raw/circuits.csv')
circuits_schema=StructType(fields=[StructField("Id",IntegerType(), False),
StructField("name",StringType(), False)
])
df= spark.read \
.option('header',True) \
.schema(circuits_schema) \
.csv('/mnt/formula1/raw/circuits.csv')
##### Select
df_select= df.select("id","name")
df_select= df.select(df.id,df.name)
df_select= df.select(df["id"],df["name"])
df_select= df.select(col("id"),(col("name"))
df_select= df.select(col("id").alias("circuit_id"),(col("name"))
##### withColumnRenamed
df_with = df.withColumnRenamed(existing column, new column)
df_with = df.withColumnRenamed("lat", "latitude")
##### WithColumn
df_final = df.withColumn("new_column", current_timestamp)/
df_final = df.withColumn("date", current_timestamp)
### lit
df_final = df.WithColumn("date", current_timestamp)\
.WithColumn("env", lit("production"))
#### write data
df_final.write.parquet('/mnt/formula1/processed/circuit')
df_final.write.mode('overwrite').parquet('/mnt/formula1/processed/circuit')
#### partitionBy
df_final.write.mode('overwrite').partitionBy("race_year").parquet('/mnt/formula1/
processed/circuit')
### DDL Schema
const_schema="id INT, name STRING"
### JSON file
json_df= spark.read \
.schema(const_schema) \
.json('/mnt/formula1/raw/constructors.json')
### DROP column
const_dropped_df= json_df.drop("url")
### Nested column in json file
name_schema=StructType(fields=[StructField("firstname",StringType(), False),
StructField("lastname",StringType(), False)
])
driver_schema=StructType(fields=[StructField("drive_license",StringType(), False),
StructField("name",name_schema, False)
----- Please note here
])
driver_df= df.withColumn(name, concat(col("name.firstname"), lit(" ")
col("name.lastname")))
##### Mutli line JSON
json_df= spark.read \
.option('multiline', True)
.schema(pitstop_schema) \
.json('/mnt/formula1/raw/pitstop.json')
#### To process multiple CSV or json files
# To use wildcard chararcter
df= spark.read \
.schema(circuits_schema) \
.csv('/mnt/formula1/raw/circuits/circuits_*.csv')
or
# To use folder path
df= spark.read \
.schema(circuits_schema) \
.csv('/mnt/formula1/raw/circuits/)
##### Databricks workflows
%run magic command to include one notebook in another notebook
%run /includes/configuration
raw_folder_path=/mnt/formula1/raw
df= spark.read \
.schema(circuits_schema) \
.csv(f"{raw_folder_path}/circuits/")
### Passing parameters
dbutils.widgets.help()
dbutils.widgets.text('p_data_source','')
v_data_source=dbutils.widgets.get('p_data_source')
df= cir_df.withCoulmn('data_source', lit(v_data_source))
#### Notebook workflows --- To run a notebook
dbutils.notebook.help()
dbutils.notebook.run("notebook name to run", 0, {"p_data_source" : "Ergast API"})
Here 0 indicates it never timeouts
p_data_source is the paramter
Ergast API is the parameter value
put
dbutils.notebook.exit(Success) in other notebooks for the status
v_status=dbutils.notebook.run("notebook name to run", 0, {"p_data_source" : "Ergast
API"})
v_status
Success
#### Filter transformations
filter_df= df.filter("age = 35" and name = 'Malli') -- sql
filter_df= df.filter(df["age"] == 35 & df["name"] = 'Malli')) -- python
#### Join
a) join_df = circuit_df.join(races_df,
circuit_df.circuit_id==races_df.circuit_id,"inner")
b) join_df = circuit_df.join(races_df,
circuit_df.circuit_id==races_df.circuit_id,"inner") \
.select(circuit_df.col1,races_df.col1)
c) join_df = circuit_df.join(races_df,
circuit_df.circuit_id==races_df.circuit_id,"left")
d) join_df = circuit_df.join(races_df,
circuit_df.circuit_id==races_df.circuit_id,"right")
e) join_df = circuit_df.join(races_df,
circuit_df.circuit_id==races_df.circuit_id,"full")
### semi and anti joins
semi join
similar to inner join and picks only columns from the left dataframe
a) join_df = circuit_df.join(races_df,
circuit_df.circuit_id==races_df.circuit_id,"semi")
anti join
opposite to semi--- picks the non matched records
b) join_df = circuit_df.join(races_df,
circuit_df.circuit_id==races_df.circuit_id,"anti")
cross join
a) df3=df1.crossJoin(df2)
#### Aggregations
df.select(count("*")).show()
df.select(count("race_id")).show()
df.select(countDistinct("race_id")).show()
df\
.groupBy("DriverName") \
.sum("Points") \
.show()
#### agg
Allows to put 2 aggregate functions
df\
.groupBy("DriverName") \
.agg(sum("Points") , countDistinct("race_name"))\
.show()
### Window functions
driverRankSpec = Window.partitionBy("race_year").orderBy("total_points")
demoGroupeddf.withColumn("rank", rank().over(driverRankSpec))
#### when
df\
.groupBy("DriverName") \
.agg(sum("Points") , countDistinct("race_name"))\
.count(when(col("position" == 1, True)).alias("wins")
.show()
### Rank and dense rank
rank will skip the position when there are same values
desne rank won't
### temporary and global views
df.createOrReplaceTempView("v_race_results") -- is available only within the
spark session
%sql
select * from v_race_results
python
spark.sql("select * from v_race_results")
df=spark.sql("select * from v_race_results")
df.createOrReplaceGlobalTempView("gv_race_results")
spark.sql("select * from global_temp.gv_race_results")
############## Spark SQL
a) CREATE DATABASE IF NOT EXISTS demo;
b) SHOW databases;
c) DESC database demo or DESCRIBE database demo;
d) DESCRIBE database EXTENDED demo; --To get more information
e) SELECT CURRENT_DATABASE();
f) USE demo;
g) SHOW TABLES;
h) SHOW TABLES in demo;
## Managed and External Table
#Using Python
#Managed table
race_results_df=spark.read.parquet(f"{presentation_folder_path}\race_results")
race_results_df.write.format("parquet").saveAsTable("demo.race_results_python")
#Using SQL
create table demo.race_results_sql
as
select * from demo.race_results_python
## External Table
#using python
race_results_df.write.format("parquet").option("path",f"{presentation_folder_path}/
race_results_ext_py").saveAsTable("demo.race_results_python_ext_py")
#using sql
create table demo.race_results_ext_sql
(id integer,
name string
)
using parquet
LOCATION '/mnt/formula1/presentation/race_results_ext_sql'
OR
create table demo.race_results_ext_sql
(id integer,
name string
)
using csv
options( path '/mnt/formula1/presentation/results.csv', header true)
INSERT INTO demo.race_results_ext_sql
SELECT * FROM demo.race_results_ext_py
where race_year=2020
Difference between managed and external table
spark maintains both metadata and file storage in managed
spark maintains only metadata and we managed file storage
### Views
CREATE OR REPLACE TEMP VIEW demo.v_race_results
as
select * from
demp.race_results_ext_sql
where race_year=2010;
CREATE OR REPLACE GLOBAL TEMP VIEW demo.gv_race_results
as
select * from
demp.race_results_ext_sql
where race_year=2010;
select * from global_temp.gv_race_results
## Permanent view
CREATE OR REPLACE VIEW pv_race_results
as
select * from
demp.race_results_ext_sql
where race_year=2000;
########################## DELTA LAKE #######################################
#write data to delta lake
#Managed Table
results_df.write.format("delta").mode("overwrie").saveAsTable("f1_demo.results_mana
ged")
#To save it to a Location
results_df.write.format("delta").mode("overwrie").save("/mnt/formula1/demo/
results_external")
#External Table
CREATE TABLE f1_demo.results_external
USING DELTA
LOCATION "/mnt/formula1/demo/results_external"
#To read a file
results_external_df=sprak.read.format("delta").load("/mnt/formula1/demo/
results_external")
# To partition data
results_df.write.format("delta").mode("overwrie").partitionBy("constructor_id").sav
eAsTable("f1_demo.results_partitioned")
#show
SHOW PARTITIONS f1_demo.results_partitioned;
# Updates and Deletes using sql
similar to SQL statements
# Updates and Deletes using python
from delta.tables import DeltaTable
deltaTable = DeltaTable.forPath(spark, "/mnt/formula1/demo/results_managed")
deltaTable.update(
condition = ("position <= 10"),
set = { "ponts": 21-position })
from delta.tables import DeltaTable
deltaTable = DeltaTable.forPath(spark, "/mnt/formula1/demo/results_managed")
deltaTable.delete(
condition = ("position <= 10")
### Merge
MERGE INTO people10m
USING people10mupdates
ON people10m.id = people10mupdates.id
WHEN MATCHED THEN
UPDATE SET
id = people10mupdates.id,
firstName = people10mupdates.firstName,
middleName = people10mupdates.middleName,
lastName = people10mupdates.lastName,
gender = people10mupdates.gender,
birthDate = people10mupdates.birthDate,
ssn = people10mupdates.ssn,
salary = people10mupdates.salary
WHEN NOT MATCHED
THEN INSERT (
id,
firstName,
middleName,
lastName,
gender,
birthDate,
ssn,
salary
)
VALUES (
people10mupdates.id,
people10mupdates.firstName,
people10mupdates.middleName,
people10mupdates.lastName,
people10mupdates.gender,
people10mupdates.birthDate,
people10mupdates.ssn,
people10mupdates.salary
)
# Using python
from delta.tables import *
deltaTablePeople = DeltaTable.forPath(spark, '/tmp/delta/people-10m')
deltaTablePeopleUpdates = DeltaTable.forPath(spark, '/tmp/delta/people-10m-
updates')
dfUpdates = deltaTablePeopleUpdates.toDF()
deltaTablePeople.alias('people') \
.merge(
dfUpdates.alias('updates'),
'people.id = updates.id'
) \
.whenMatchedUpdate(set =
{
"id": "updates.id",
"firstName": "updates.firstName",
"middleName": "updates.middleName",
"lastName": "updates.lastName",
"gender": "updates.gender",
"birthDate": "updates.birthDate",
"ssn": "updates.ssn",
"salary": "updates.salary"
}
) \
.whenNotMatchedInsert(values =
{
"id": "updates.id",
"firstName": "updates.firstName",
"middleName": "updates.middleName",
"lastName": "updates.lastName",
"gender": "updates.gender",
"birthDate": "updates.birthDate",
"ssn": "updates.ssn",
"salary": "updates.salary"
}
) \
.execute()
### History, Versioning and Time Travel
#using SQL
DESC HISTORY f1_demo.drivers_merge;
select * from f1_demo.drivers_merge where version AS OF 2;
select * from f1_demo.drivers_merge where TIMESTAMP AS OF "timestamp";
#using python
df=spark.read.format("delta").option("versionAsOf",2).load('/mnt/formual1dl/demo/
drivers_merge')
# VACUUM
VACUUM f1_demo.drivers_merge; -- To delete the data older than 7 days
VACUUM f1_demo.drivers_merge RETAIN 0 HOURS; to delete the entire history of data
select * from f1_demo.drivers_merge where version AS OF 2; --------------won't work
after vacuum
### Convert parquet to delta
#Table
CONVERT TO DELTA f1_delta.convert_to_delta;
# File
df=spark.table("f1_delta.convert_to_delta")
df.write.format("parquet").save('/mnt/formula1dl/demo/convert_to_delta_new');
CONVERT TO DELTA parquet.`/mnt/formula1dl/demo/convert_to_delta_new`;
# Drop duplcates function
df.dropDuplicates(['name', 'height']).show()
### Unity Catalog
Unity catalog is a databricks offered unified solution for implementing data
governance in the data lakehouse
It provides
Data access control
Data audit
Data lineage
Data discoverability
## Unity catalog object model
Managed tables can only be in delta format
Stored in the default storage
Deleted data retained for 30 days
Benefits from automatic maintenance and performance improvements
#To access a table
select * from catalog.schema.table
## Catalog
it contains
1) Hive metastore -- Has table information from legacy system
2) Main -- default catalog created by databricks
3) system -- contains table for audit information
4) samples -- contains sample tables
#Commands
show catalog;
show current_schema;
show current_catalog;
#Data discoverability
# To find the tables
select * from system.information_schema.tables
where table_name='results';
# Data lineage
It is the process of following/tracking the journey of data within the pipeline.