Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
280 views20 pages

Azure Databricks Notes

The document provides an overview of Apache Spark, detailing its architecture, cluster types, and configuration options. It covers various data access methods, including using Azure Data Lake, and outlines commands and utilities for managing data within Databricks. Additionally, it explains data ingestion, transformations, and the use of Delta Lake for managing data efficiently.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
280 views20 pages

Azure Databricks Notes

The document provides an overview of Apache Spark, detailing its architecture, cluster types, and configuration options. It covers various data access methods, including using Azure Data Lake, and outlines commands and utilities for managing data within Databricks. Additionally, it explains data ingestion, transformations, and the use of Delta Lake for managing data efficiently.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 20

####Apache spark

-- It is lightning-fast unified analytics engine for big data processing and


machine learning.

####Cluster

-- It is group of virtual machines having driver node and worker node. Drive node
orchestrates and distribute tasks among the
worker nodes

####Types of Cluster

All Purpose Job cluster


----------------------- -------------------------
Created manually Created by hobs

Persistent Terminated at the end of the job

suitable for interactive Suitable for automated workloads


workloads

shared among many user Isolated just for the job

Expensive to run Cheaper to run

#### Cluster configuration

Single node -- only one VM which acts as a both driver and worker node

Multi node -- one driver node and one or more worker nodes

Databricks Runtime -- It is the set of software artifacts that run on the clusters
of machines managed by Databricks.

Autoscalling -- it is the process of adding and remove worker nodes depending on


the workload

##### Databricks Unit

DBU is a normalized unit of processing power on the databricks lakehouse platform


used for measurement and pricing purposes.

#### Cluster Pool

Set of predefined nodes to increase the speed of cluster creation


#### Magic commands

%sql

### Clear state option in the RUN tab

clears all the varibale values

### %md magic command

markdown command to document the code

### %fs magic command

ls -- to list down files

### %sh magic command

ps -- to list down all the processess which are running

#### Databricks Utilities can be used in all notebooks except SQL

dbutils.help() -- To get help

dbutils.fs.help() -- To get help on fs command

dbutils.fs.help('ls') -- To get help on ls within fs command

dbutils.fs.ls('/databricks-datasets') -- to list all the files within a folder

or

%fs
ls

### Aceess GEN2 using Access key ########

spark.conf.set(
"fs.azure.account.key.<storage-account>.dfs.core.windows.net",
storage-account-access-key)

OR

spark.conf.set(
"fs.azure.account.key.<storage-account>.dfs.core.windows.net",
dbutils.secrets.get(scope="<scope>", key="<storage-account-access-key>"))

dbutils.fs.ls("abfss://<container-name>@<storage-account-
name>.dfs.core.windows.net/<path-to-data>")

### Aceess GEN2 using SAS token ########

spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net",
"SAS")
spark.conf.set("fs.azure.sas.token.provider.type.<storage-
account>.dfs.core.windows.net",
"org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
spark.conf.set("fs.azure.sas.fixed.token.<storage-account>.dfs.core.windows.net",
dbutils.secrets.get(scope="<scope>", key="<sas-token-key>"))

### Aceess GEN2 using Service principal ########

1) Register Azure AD Apllication/ service principal


2) Generate secret/password for the application
3) set spark config with app/client id, directory/ tenant id & secret
4) Assign Role Storage Blob contributor to the Data lake

1) Register Azure AD Apllication/ service principal

azure portal -> azure active directory(search for) -> app registrations -> new
registration -> formula1-app

copy the

Application(Client ID)

Directory(Tenant ID )

2) Generate secret/password for the application

go to app registration here formula1-app -> certifications and secret -> give the
secret name - > copy the value

3) set spark config with app/client id, directory/ tenant id & secret

spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net",
"OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-
account>.dfs.core.windows.net",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-
account>.dfs.core.windows.net", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-
account>.dfs.core.windows.net", service_credential)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-
account>.dfs.core.windows.net",
"https://login.microsoftonline.com/<directory-id>/oauth2/token")

4) Assign Role Storage Blob contributor to the Data lake

storage account(formula 1) -> access control (IAM) -> Add -> Add role assignment ->
storage blob contributor

### Cluster scoped authentication ################

go to the compute -> edit -> advanced ->

copy paste the spark config

fs.azure.account.key.<storage-account>.dfs.core.windows.net storage-account-access-
key

### Azure data lake using credential passthrough ###

go to the compute -> edit -> advanced ->

TICK the Enable credential passthrough for user-level-access

also the access the folder give user level access

storage account(formula 1) -> access control (IAM) -> Add -> Add role assignment ->
storage blob contributor -> add member -> give the member mail id

###############################

Securing access to azure data lake

1) Databricks backed secret scope


2) Azure key vault backed secret scope

spark.conf.set(
"fs.azure.account.key.<storage-account>.dfs.core.windows.net",
dbutils.secrets.get(scope="<scope>", key="<storage-account-access-key>"))

create the azure key vault and store the access key there

and

go to the microsoft databricks homepage

in the URL type screts/createScope

1) give the secret scope name


2) give the DNS name as VAULT URI which is copied from the azure key vault
3) give the resource id copied the azure key vault

##### dbutils.secrets

dbutils.secrets.help()

dbutils.secrets.listScopes() -- List all the scopes

dbutils.secrets.list(scope= 'formula1-scope') -- List the metadata

dbutils.secrets.get(scope : String, key : String)


dbutils.secrets.get(scope : formula1-scope, key : formula1-account-key)

spark.conf.set(
"fs.azure.account.key.<storage-account>.dfs.core.windows.net",
dbutils.secrets.get(scope="<scope>", key="<storage-account-access-key>"))

### using secrets utility in clusters

go to the compute -> edit -> advanced ->

copy paste the spark config

fs.azure.account.key.<storage-account>.dfs.core.windows.net
{{secrets/secret-name/secret-account-key}}

#### DBFS

Databricks FileSystem

#### Mount
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type":
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<application-id>",
"fs.azure.account.oauth2.client.secret":
dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),
"fs.azure.account.oauth2.client.endpoint":
"https://login.microsoftonline.com/<directory-id>/oauth2/token"}

# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)

dbutills.fs.mounts() ---- list the mount points


dbutills.fs.unmounts('/mnt/formula1/demo')

############### Spark Architecture ##################

Driver(VM)

Worker node(VM) Worker node(VM)


(Executor) (Executor)

Application -> job -> stage(s) -> task

############### Ingestion

df= spark.read.option('header',True).csv('dbfs:/mnt/formula1/raw/circuits.csv')

df.show(truncate=False)

display(df)

df.printSchema()

df.describe.show()

df= spark.read.option('header',True).option('InferSchema',
True).csv('/mnt/formula1/raw/circuits.csv')

circuits_schema=StructType(fields=[StructField("Id",IntegerType(), False),
StructField("name",StringType(), False)
])

df= spark.read \
.option('header',True) \
.schema(circuits_schema) \
.csv('/mnt/formula1/raw/circuits.csv')

##### Select

df_select= df.select("id","name")

df_select= df.select(df.id,df.name)

df_select= df.select(df["id"],df["name"])

df_select= df.select(col("id"),(col("name"))

df_select= df.select(col("id").alias("circuit_id"),(col("name"))

##### withColumnRenamed

df_with = df.withColumnRenamed(existing column, new column)

df_with = df.withColumnRenamed("lat", "latitude")

##### WithColumn

df_final = df.withColumn("new_column", current_timestamp)/

df_final = df.withColumn("date", current_timestamp)

### lit

df_final = df.WithColumn("date", current_timestamp)\


.WithColumn("env", lit("production"))

#### write data

df_final.write.parquet('/mnt/formula1/processed/circuit')
df_final.write.mode('overwrite').parquet('/mnt/formula1/processed/circuit')

#### partitionBy

df_final.write.mode('overwrite').partitionBy("race_year").parquet('/mnt/formula1/
processed/circuit')

### DDL Schema

const_schema="id INT, name STRING"

### JSON file

json_df= spark.read \
.schema(const_schema) \
.json('/mnt/formula1/raw/constructors.json')

### DROP column

const_dropped_df= json_df.drop("url")

### Nested column in json file

name_schema=StructType(fields=[StructField("firstname",StringType(), False),
StructField("lastname",StringType(), False)
])

driver_schema=StructType(fields=[StructField("drive_license",StringType(), False),
StructField("name",name_schema, False)
----- Please note here
])

driver_df= df.withColumn(name, concat(col("name.firstname"), lit(" ")


col("name.lastname")))

##### Mutli line JSON


json_df= spark.read \
.option('multiline', True)
.schema(pitstop_schema) \
.json('/mnt/formula1/raw/pitstop.json')

#### To process multiple CSV or json files

# To use wildcard chararcter

df= spark.read \
.schema(circuits_schema) \
.csv('/mnt/formula1/raw/circuits/circuits_*.csv')

or

# To use folder path

df= spark.read \
.schema(circuits_schema) \
.csv('/mnt/formula1/raw/circuits/)

##### Databricks workflows

%run magic command to include one notebook in another notebook

%run /includes/configuration

raw_folder_path=/mnt/formula1/raw

df= spark.read \
.schema(circuits_schema) \
.csv(f"{raw_folder_path}/circuits/")

### Passing parameters

dbutils.widgets.help()
dbutils.widgets.text('p_data_source','')
v_data_source=dbutils.widgets.get('p_data_source')

df= cir_df.withCoulmn('data_source', lit(v_data_source))

#### Notebook workflows --- To run a notebook

dbutils.notebook.help()

dbutils.notebook.run("notebook name to run", 0, {"p_data_source" : "Ergast API"})

Here 0 indicates it never timeouts


p_data_source is the paramter
Ergast API is the parameter value

put

dbutils.notebook.exit(Success) in other notebooks for the status

v_status=dbutils.notebook.run("notebook name to run", 0, {"p_data_source" : "Ergast


API"})

v_status

Success

#### Filter transformations

filter_df= df.filter("age = 35" and name = 'Malli') -- sql

filter_df= df.filter(df["age"] == 35 & df["name"] = 'Malli')) -- python

#### Join

a) join_df = circuit_df.join(races_df,
circuit_df.circuit_id==races_df.circuit_id,"inner")

b) join_df = circuit_df.join(races_df,
circuit_df.circuit_id==races_df.circuit_id,"inner") \
.select(circuit_df.col1,races_df.col1)

c) join_df = circuit_df.join(races_df,
circuit_df.circuit_id==races_df.circuit_id,"left")

d) join_df = circuit_df.join(races_df,
circuit_df.circuit_id==races_df.circuit_id,"right")

e) join_df = circuit_df.join(races_df,
circuit_df.circuit_id==races_df.circuit_id,"full")

### semi and anti joins

semi join

similar to inner join and picks only columns from the left dataframe

a) join_df = circuit_df.join(races_df,
circuit_df.circuit_id==races_df.circuit_id,"semi")

anti join

opposite to semi--- picks the non matched records

b) join_df = circuit_df.join(races_df,
circuit_df.circuit_id==races_df.circuit_id,"anti")

cross join

a) df3=df1.crossJoin(df2)

#### Aggregations

df.select(count("*")).show()

df.select(count("race_id")).show()

df.select(countDistinct("race_id")).show()
df\
.groupBy("DriverName") \
.sum("Points") \
.show()

#### agg

Allows to put 2 aggregate functions

df\
.groupBy("DriverName") \
.agg(sum("Points") , countDistinct("race_name"))\
.show()

### Window functions

driverRankSpec = Window.partitionBy("race_year").orderBy("total_points")
demoGroupeddf.withColumn("rank", rank().over(driverRankSpec))

#### when

df\
.groupBy("DriverName") \
.agg(sum("Points") , countDistinct("race_name"))\
.count(when(col("position" == 1, True)).alias("wins")
.show()

### Rank and dense rank

rank will skip the position when there are same values

desne rank won't

### temporary and global views

df.createOrReplaceTempView("v_race_results") -- is available only within the


spark session

%sql
select * from v_race_results

python

spark.sql("select * from v_race_results")

df=spark.sql("select * from v_race_results")

df.createOrReplaceGlobalTempView("gv_race_results")

spark.sql("select * from global_temp.gv_race_results")

############## Spark SQL

a) CREATE DATABASE IF NOT EXISTS demo;

b) SHOW databases;

c) DESC database demo or DESCRIBE database demo;

d) DESCRIBE database EXTENDED demo; --To get more information

e) SELECT CURRENT_DATABASE();

f) USE demo;

g) SHOW TABLES;

h) SHOW TABLES in demo;

## Managed and External Table

#Using Python

#Managed table

race_results_df=spark.read.parquet(f"{presentation_folder_path}\race_results")

race_results_df.write.format("parquet").saveAsTable("demo.race_results_python")
#Using SQL

create table demo.race_results_sql


as
select * from demo.race_results_python

## External Table

#using python

race_results_df.write.format("parquet").option("path",f"{presentation_folder_path}/
race_results_ext_py").saveAsTable("demo.race_results_python_ext_py")

#using sql

create table demo.race_results_ext_sql


(id integer,
name string
)
using parquet
LOCATION '/mnt/formula1/presentation/race_results_ext_sql'

OR

create table demo.race_results_ext_sql


(id integer,
name string
)
using csv
options( path '/mnt/formula1/presentation/results.csv', header true)

INSERT INTO demo.race_results_ext_sql


SELECT * FROM demo.race_results_ext_py
where race_year=2020

Difference between managed and external table

spark maintains both metadata and file storage in managed

spark maintains only metadata and we managed file storage

### Views
CREATE OR REPLACE TEMP VIEW demo.v_race_results
as
select * from
demp.race_results_ext_sql
where race_year=2010;

CREATE OR REPLACE GLOBAL TEMP VIEW demo.gv_race_results


as
select * from
demp.race_results_ext_sql
where race_year=2010;

select * from global_temp.gv_race_results

## Permanent view

CREATE OR REPLACE VIEW pv_race_results


as
select * from
demp.race_results_ext_sql
where race_year=2000;

########################## DELTA LAKE #######################################

#write data to delta lake

#Managed Table

results_df.write.format("delta").mode("overwrie").saveAsTable("f1_demo.results_mana
ged")

#To save it to a Location

results_df.write.format("delta").mode("overwrie").save("/mnt/formula1/demo/
results_external")

#External Table
CREATE TABLE f1_demo.results_external
USING DELTA
LOCATION "/mnt/formula1/demo/results_external"

#To read a file

results_external_df=sprak.read.format("delta").load("/mnt/formula1/demo/
results_external")

# To partition data

results_df.write.format("delta").mode("overwrie").partitionBy("constructor_id").sav
eAsTable("f1_demo.results_partitioned")

#show

SHOW PARTITIONS f1_demo.results_partitioned;

# Updates and Deletes using sql

similar to SQL statements

# Updates and Deletes using python

from delta.tables import DeltaTable

deltaTable = DeltaTable.forPath(spark, "/mnt/formula1/demo/results_managed")

deltaTable.update(
condition = ("position <= 10"),
set = { "ponts": 21-position })

from delta.tables import DeltaTable

deltaTable = DeltaTable.forPath(spark, "/mnt/formula1/demo/results_managed")

deltaTable.delete(
condition = ("position <= 10")

### Merge
MERGE INTO people10m
USING people10mupdates
ON people10m.id = people10mupdates.id
WHEN MATCHED THEN
UPDATE SET
id = people10mupdates.id,
firstName = people10mupdates.firstName,
middleName = people10mupdates.middleName,
lastName = people10mupdates.lastName,
gender = people10mupdates.gender,
birthDate = people10mupdates.birthDate,
ssn = people10mupdates.ssn,
salary = people10mupdates.salary
WHEN NOT MATCHED
THEN INSERT (
id,
firstName,
middleName,
lastName,
gender,
birthDate,
ssn,
salary
)
VALUES (
people10mupdates.id,
people10mupdates.firstName,
people10mupdates.middleName,
people10mupdates.lastName,
people10mupdates.gender,
people10mupdates.birthDate,
people10mupdates.ssn,
people10mupdates.salary
)

# Using python

from delta.tables import *

deltaTablePeople = DeltaTable.forPath(spark, '/tmp/delta/people-10m')


deltaTablePeopleUpdates = DeltaTable.forPath(spark, '/tmp/delta/people-10m-
updates')

dfUpdates = deltaTablePeopleUpdates.toDF()

deltaTablePeople.alias('people') \
.merge(
dfUpdates.alias('updates'),
'people.id = updates.id'
) \
.whenMatchedUpdate(set =
{
"id": "updates.id",
"firstName": "updates.firstName",
"middleName": "updates.middleName",
"lastName": "updates.lastName",
"gender": "updates.gender",
"birthDate": "updates.birthDate",
"ssn": "updates.ssn",
"salary": "updates.salary"
}
) \
.whenNotMatchedInsert(values =
{
"id": "updates.id",
"firstName": "updates.firstName",
"middleName": "updates.middleName",
"lastName": "updates.lastName",
"gender": "updates.gender",
"birthDate": "updates.birthDate",
"ssn": "updates.ssn",
"salary": "updates.salary"
}
) \
.execute()

### History, Versioning and Time Travel

#using SQL

DESC HISTORY f1_demo.drivers_merge;

select * from f1_demo.drivers_merge where version AS OF 2;

select * from f1_demo.drivers_merge where TIMESTAMP AS OF "timestamp";

#using python

df=spark.read.format("delta").option("versionAsOf",2).load('/mnt/formual1dl/demo/
drivers_merge')

# VACUUM

VACUUM f1_demo.drivers_merge; -- To delete the data older than 7 days

VACUUM f1_demo.drivers_merge RETAIN 0 HOURS; to delete the entire history of data

select * from f1_demo.drivers_merge where version AS OF 2; --------------won't work


after vacuum
### Convert parquet to delta

#Table

CONVERT TO DELTA f1_delta.convert_to_delta;

# File

df=spark.table("f1_delta.convert_to_delta")

df.write.format("parquet").save('/mnt/formula1dl/demo/convert_to_delta_new');

CONVERT TO DELTA parquet.`/mnt/formula1dl/demo/convert_to_delta_new`;

# Drop duplcates function

df.dropDuplicates(['name', 'height']).show()

### Unity Catalog

Unity catalog is a databricks offered unified solution for implementing data


governance in the data lakehouse

It provides

Data access control


Data audit
Data lineage
Data discoverability

## Unity catalog object model

Managed tables can only be in delta format

Stored in the default storage

Deleted data retained for 30 days

Benefits from automatic maintenance and performance improvements

#To access a table

select * from catalog.schema.table


## Catalog

it contains

1) Hive metastore -- Has table information from legacy system

2) Main -- default catalog created by databricks

3) system -- contains table for audit information

4) samples -- contains sample tables

#Commands

show catalog;

show current_schema;

show current_catalog;

#Data discoverability

# To find the tables

select * from system.information_schema.tables


where table_name='results';

# Data lineage

It is the process of following/tracking the journey of data within the pipeline.

You might also like