0% found this document useful (0 votes)

280 views20 pages

Azure Databricks Notes

The document provides an overview of Apache Spark, detailing its architecture, cluster types, and configuration options. It covers various data access methods, including using Azure Data Lake, and outlines commands and utilities for managing data within Databricks. Additionally, it explains data ingestion, transformations, and the use of Delta Lake for managing data efficiently.

Uploaded by

mallikarjuna.dwbi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

280 views20 pages

Azure Databricks Notes

Uploaded by

mallikarjuna.dwbi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 20

####Apache spark

-- It is lightning-fast unified analytics engine for big data processing and

machine learning.

####Cluster

-- It is group of virtual machines having driver node and worker node. Drive node
orchestrates and distribute tasks among the
worker nodes

####Types of Cluster

All Purpose Job cluster

----------------------- -------------------------
Created manually Created by hobs

Persistent Terminated at the end of the job

suitable for interactive Suitable for automated workloads

workloads

shared among many user Isolated just for the job

Expensive to run Cheaper to run

#### Cluster configuration

Single node -- only one VM which acts as a both driver and worker node

Multi node -- one driver node and one or more worker nodes

Databricks Runtime -- It is the set of software artifacts that run on the clusters
of machines managed by Databricks.

Autoscalling -- it is the process of adding and remove worker nodes depending on

the workload

##### Databricks Unit

DBU is a normalized unit of processing power on the databricks lakehouse platform

used for measurement and pricing purposes.

#### Cluster Pool

Set of predefined nodes to increase the speed of cluster creation

#### Magic commands

%sql

### Clear state option in the RUN tab

clears all the varibale values

### %md magic command

markdown command to document the code

### %fs magic command

ls -- to list down files

### %sh magic command

ps -- to list down all the processess which are running

#### Databricks Utilities can be used in all notebooks except SQL

dbutils.help() -- To get help

dbutils.fs.help() -- To get help on fs command

dbutils.fs.help('ls') -- To get help on ls within fs command

dbutils.fs.ls('/databricks-datasets') -- to list all the files within a folder

%fs
ls

### Aceess GEN2 using Access key ########

spark.conf.set(
"fs.azure.account.key.<storage-account>.dfs.core.windows.net",
storage-account-access-key)

spark.conf.set(
"fs.azure.account.key.<storage-account>.dfs.core.windows.net",
dbutils.secrets.get(scope="<scope>", key="<storage-account-access-key>"))

dbutils.fs.ls("abfss://<container-name>@<storage-account-
name>.dfs.core.windows.net/<path-to-data>")

### Aceess GEN2 using SAS token ########

spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net",
"SAS")
spark.conf.set("fs.azure.sas.token.provider.type.<storage-
account>.dfs.core.windows.net",
"org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
spark.conf.set("fs.azure.sas.fixed.token.<storage-account>.dfs.core.windows.net",
dbutils.secrets.get(scope="<scope>", key="<sas-token-key>"))

### Aceess GEN2 using Service principal ########

1) Register Azure AD Apllication/ service principal

2) Generate secret/password for the application
3) set spark config with app/client id, directory/ tenant id & secret
4) Assign Role Storage Blob contributor to the Data lake

1) Register Azure AD Apllication/ service principal

azure portal -> azure active directory(search for) -> app registrations -> new
registration -> formula1-app

copy the

Application(Client ID)

Directory(Tenant ID )

2) Generate secret/password for the application

go to app registration here formula1-app -> certifications and secret -> give the
secret name - > copy the value

3) set spark config with app/client id, directory/ tenant id & secret

spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net",
"OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-
account>.dfs.core.windows.net",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-
account>.dfs.core.windows.net", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-
account>.dfs.core.windows.net", service_credential)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-
account>.dfs.core.windows.net",
"https://login.microsoftonline.com/<directory-id>/oauth2/token")

4) Assign Role Storage Blob contributor to the Data lake

storage account(formula 1) -> access control (IAM) -> Add -> Add role assignment ->
storage blob contributor

### Cluster scoped authentication ################

go to the compute -> edit -> advanced ->

copy paste the spark config

fs.azure.account.key.<storage-account>.dfs.core.windows.net storage-account-access-
key

### Azure data lake using credential passthrough ###

go to the compute -> edit -> advanced ->

TICK the Enable credential passthrough for user-level-access

also the access the folder give user level access

storage account(formula 1) -> access control (IAM) -> Add -> Add role assignment ->
storage blob contributor -> add member -> give the member mail id

###############################

Securing access to azure data lake

1) Databricks backed secret scope

2) Azure key vault backed secret scope

spark.conf.set(
"fs.azure.account.key.<storage-account>.dfs.core.windows.net",
dbutils.secrets.get(scope="<scope>", key="<storage-account-access-key>"))

create the azure key vault and store the access key there

and

go to the microsoft databricks homepage

in the URL type screts/createScope

1) give the secret scope name

2) give the DNS name as VAULT URI which is copied from the azure key vault
3) give the resource id copied the azure key vault

##### dbutils.secrets

dbutils.secrets.help()

dbutils.secrets.listScopes() -- List all the scopes

dbutils.secrets.list(scope= 'formula1-scope') -- List the metadata

dbutils.secrets.get(scope : String, key : String)

dbutils.secrets.get(scope : formula1-scope, key : formula1-account-key)

spark.conf.set(
"fs.azure.account.key.<storage-account>.dfs.core.windows.net",
dbutils.secrets.get(scope="<scope>", key="<storage-account-access-key>"))

### using secrets utility in clusters

go to the compute -> edit -> advanced ->

copy paste the spark config

fs.azure.account.key.<storage-account>.dfs.core.windows.net
{{secrets/secret-name/secret-account-key}}

#### DBFS

Databricks FileSystem

#### Mount
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type":
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<application-id>",
"fs.azure.account.oauth2.client.secret":
dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),
"fs.azure.account.oauth2.client.endpoint":
"https://login.microsoftonline.com/<directory-id>/oauth2/token"}

# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)

dbutills.fs.mounts() ---- list the mount points

dbutills.fs.unmounts('/mnt/formula1/demo')

############### Spark Architecture ##################

Driver(VM)

Worker node(VM) Worker node(VM)

(Executor) (Executor)

Application -> job -> stage(s) -> task

############### Ingestion

df= spark.read.option('header',True).csv('dbfs:/mnt/formula1/raw/circuits.csv')

df.show(truncate=False)

display(df)

df.printSchema()

df.describe.show()

df= spark.read.option('header',True).option('InferSchema',
True).csv('/mnt/formula1/raw/circuits.csv')

circuits_schema=StructType(fields=[StructField("Id",IntegerType(), False),
StructField("name",StringType(), False)
])

df= spark.read \
.option('header',True) \
.schema(circuits_schema) \
.csv('/mnt/formula1/raw/circuits.csv')

##### Select

df_select= df.select("id","name")

df_select= df.select(df.id,df.name)

df_select= df.select(df["id"],df["name"])

df_select= df.select(col("id"),(col("name"))

df_select= df.select(col("id").alias("circuit_id"),(col("name"))

##### withColumnRenamed

df_with = df.withColumnRenamed(existing column, new column)

df_with = df.withColumnRenamed("lat", "latitude")

##### WithColumn

df_final = df.withColumn("new_column", current_timestamp)/

df_final = df.withColumn("date", current_timestamp)

### lit

df_final = df.WithColumn("date", current_timestamp)\

.WithColumn("env", lit("production"))

#### write data

df_final.write.parquet('/mnt/formula1/processed/circuit')
df_final.write.mode('overwrite').parquet('/mnt/formula1/processed/circuit')

#### partitionBy

df_final.write.mode('overwrite').partitionBy("race_year").parquet('/mnt/formula1/
processed/circuit')

### DDL Schema

const_schema="id INT, name STRING"

### JSON file

json_df= spark.read \
.schema(const_schema) \
.json('/mnt/formula1/raw/constructors.json')

### DROP column

const_dropped_df= json_df.drop("url")

### Nested column in json file

name_schema=StructType(fields=[StructField("firstname",StringType(), False),
StructField("lastname",StringType(), False)
])

driver_schema=StructType(fields=[StructField("drive_license",StringType(), False),
StructField("name",name_schema, False)
----- Please note here
])

driver_df= df.withColumn(name, concat(col("name.firstname"), lit(" ")

col("name.lastname")))

##### Mutli line JSON

json_df= spark.read \
.option('multiline', True)
.schema(pitstop_schema) \
.json('/mnt/formula1/raw/pitstop.json')

#### To process multiple CSV or json files

# To use wildcard chararcter

df= spark.read \
.schema(circuits_schema) \
.csv('/mnt/formula1/raw/circuits/circuits_*.csv')

# To use folder path

df= spark.read \
.schema(circuits_schema) \
.csv('/mnt/formula1/raw/circuits/)

##### Databricks workflows

%run magic command to include one notebook in another notebook

%run /includes/configuration

raw_folder_path=/mnt/formula1/raw

df= spark.read \
.schema(circuits_schema) \
.csv(f"{raw_folder_path}/circuits/")

### Passing parameters

dbutils.widgets.help()
dbutils.widgets.text('p_data_source','')
v_data_source=dbutils.widgets.get('p_data_source')

df= cir_df.withCoulmn('data_source', lit(v_data_source))

#### Notebook workflows --- To run a notebook

dbutils.notebook.help()

dbutils.notebook.run("notebook name to run", 0, {"p_data_source" : "Ergast API"})

Here 0 indicates it never timeouts

p_data_source is the paramter
Ergast API is the parameter value

put

dbutils.notebook.exit(Success) in other notebooks for the status

v_status=dbutils.notebook.run("notebook name to run", 0, {"p_data_source" : "Ergast

API"})

v_status

Success

#### Filter transformations

filter_df= df.filter("age = 35" and name = 'Malli') -- sql

filter_df= df.filter(df["age"] == 35 & df["name"] = 'Malli')) -- python

#### Join

a) join_df = circuit_df.join(races_df,
circuit_df.circuit_id==races_df.circuit_id,"inner")

b) join_df = circuit_df.join(races_df,
circuit_df.circuit_id==races_df.circuit_id,"inner") \
.select(circuit_df.col1,races_df.col1)

c) join_df = circuit_df.join(races_df,
circuit_df.circuit_id==races_df.circuit_id,"left")

d) join_df = circuit_df.join(races_df,
circuit_df.circuit_id==races_df.circuit_id,"right")

e) join_df = circuit_df.join(races_df,
circuit_df.circuit_id==races_df.circuit_id,"full")

### semi and anti joins

semi join

similar to inner join and picks only columns from the left dataframe

a) join_df = circuit_df.join(races_df,
circuit_df.circuit_id==races_df.circuit_id,"semi")

anti join

opposite to semi--- picks the non matched records

b) join_df = circuit_df.join(races_df,
circuit_df.circuit_id==races_df.circuit_id,"anti")

cross join

a) df3=df1.crossJoin(df2)

#### Aggregations

df.select(count("*")).show()

df.select(count("race_id")).show()

df.select(countDistinct("race_id")).show()
df\
.groupBy("DriverName") \
.sum("Points") \
.show()

#### agg

Allows to put 2 aggregate functions

df\
.groupBy("DriverName") \
.agg(sum("Points") , countDistinct("race_name"))\
.show()

### Window functions

driverRankSpec = Window.partitionBy("race_year").orderBy("total_points")
demoGroupeddf.withColumn("rank", rank().over(driverRankSpec))

#### when

df\
.groupBy("DriverName") \
.agg(sum("Points") , countDistinct("race_name"))\
.count(when(col("position" == 1, True)).alias("wins")
.show()

### Rank and dense rank

rank will skip the position when there are same values

desne rank won't

### temporary and global views

df.createOrReplaceTempView("v_race_results") -- is available only within the

spark session

%sql
select * from v_race_results

python

spark.sql("select * from v_race_results")

df=spark.sql("select * from v_race_results")

df.createOrReplaceGlobalTempView("gv_race_results")

spark.sql("select * from global_temp.gv_race_results")

############## Spark SQL

a) CREATE DATABASE IF NOT EXISTS demo;

b) SHOW databases;

c) DESC database demo or DESCRIBE database demo;

d) DESCRIBE database EXTENDED demo; --To get more information

e) SELECT CURRENT_DATABASE();

f) USE demo;

g) SHOW TABLES;

h) SHOW TABLES in demo;

## Managed and External Table

#Using Python

#Managed table

race_results_df=spark.read.parquet(f"{presentation_folder_path}\race_results")

race_results_df.write.format("parquet").saveAsTable("demo.race_results_python")
#Using SQL

create table demo.race_results_sql

as
select * from demo.race_results_python

## External Table

#using python

race_results_df.write.format("parquet").option("path",f"{presentation_folder_path}/
race_results_ext_py").saveAsTable("demo.race_results_python_ext_py")

#using sql

create table demo.race_results_ext_sql

(id integer,
name string
)
using parquet
LOCATION '/mnt/formula1/presentation/race_results_ext_sql'

create table demo.race_results_ext_sql

(id integer,
name string
)
using csv
options( path '/mnt/formula1/presentation/results.csv', header true)

INSERT INTO demo.race_results_ext_sql

SELECT * FROM demo.race_results_ext_py
where race_year=2020

Difference between managed and external table

spark maintains both metadata and file storage in managed

spark maintains only metadata and we managed file storage

### Views
CREATE OR REPLACE TEMP VIEW demo.v_race_results
as
select * from
demp.race_results_ext_sql
where race_year=2010;

CREATE OR REPLACE GLOBAL TEMP VIEW demo.gv_race_results

as
select * from
demp.race_results_ext_sql
where race_year=2010;

select * from global_temp.gv_race_results

## Permanent view

CREATE OR REPLACE VIEW pv_race_results

as
select * from
demp.race_results_ext_sql
where race_year=2000;

########################## DELTA LAKE #######################################

#write data to delta lake

#Managed Table

results_df.write.format("delta").mode("overwrie").saveAsTable("f1_demo.results_mana
ged")

#To save it to a Location

results_df.write.format("delta").mode("overwrie").save("/mnt/formula1/demo/
results_external")

#External Table
CREATE TABLE f1_demo.results_external
USING DELTA
LOCATION "/mnt/formula1/demo/results_external"

#To read a file

results_external_df=sprak.read.format("delta").load("/mnt/formula1/demo/
results_external")

# To partition data

results_df.write.format("delta").mode("overwrie").partitionBy("constructor_id").sav
eAsTable("f1_demo.results_partitioned")

#show

SHOW PARTITIONS f1_demo.results_partitioned;

# Updates and Deletes using sql

similar to SQL statements

# Updates and Deletes using python

from delta.tables import DeltaTable

deltaTable = DeltaTable.forPath(spark, "/mnt/formula1/demo/results_managed")

deltaTable.update(
condition = ("position <= 10"),
set = { "ponts": 21-position })

from delta.tables import DeltaTable

deltaTable = DeltaTable.forPath(spark, "/mnt/formula1/demo/results_managed")

deltaTable.delete(
condition = ("position <= 10")

### Merge
MERGE INTO people10m
USING people10mupdates
ON people10m.id = people10mupdates.id
WHEN MATCHED THEN
UPDATE SET
id = people10mupdates.id,
firstName = people10mupdates.firstName,
middleName = people10mupdates.middleName,
lastName = people10mupdates.lastName,
gender = people10mupdates.gender,
birthDate = people10mupdates.birthDate,
ssn = people10mupdates.ssn,
salary = people10mupdates.salary
WHEN NOT MATCHED
THEN INSERT (
id,
firstName,
middleName,
lastName,
gender,
birthDate,
ssn,
salary
)
VALUES (
people10mupdates.id,
people10mupdates.firstName,
people10mupdates.middleName,
people10mupdates.lastName,
people10mupdates.gender,
people10mupdates.birthDate,
people10mupdates.ssn,
people10mupdates.salary
)

# Using python

from delta.tables import *

deltaTablePeople = DeltaTable.forPath(spark, '/tmp/delta/people-10m')

deltaTablePeopleUpdates = DeltaTable.forPath(spark, '/tmp/delta/people-10m-
updates')

dfUpdates = deltaTablePeopleUpdates.toDF()

deltaTablePeople.alias('people') \
.merge(
dfUpdates.alias('updates'),
'people.id = updates.id'
) \
.whenMatchedUpdate(set =
{
"id": "updates.id",
"firstName": "updates.firstName",
"middleName": "updates.middleName",
"lastName": "updates.lastName",
"gender": "updates.gender",
"birthDate": "updates.birthDate",
"ssn": "updates.ssn",
"salary": "updates.salary"
}
) \
.whenNotMatchedInsert(values =
{
"id": "updates.id",
"firstName": "updates.firstName",
"middleName": "updates.middleName",
"lastName": "updates.lastName",
"gender": "updates.gender",
"birthDate": "updates.birthDate",
"ssn": "updates.ssn",
"salary": "updates.salary"
}
) \
.execute()

### History, Versioning and Time Travel

#using SQL

DESC HISTORY f1_demo.drivers_merge;

select * from f1_demo.drivers_merge where version AS OF 2;

select * from f1_demo.drivers_merge where TIMESTAMP AS OF "timestamp";

#using python

df=spark.read.format("delta").option("versionAsOf",2).load('/mnt/formual1dl/demo/
drivers_merge')

# VACUUM

VACUUM f1_demo.drivers_merge; -- To delete the data older than 7 days

VACUUM f1_demo.drivers_merge RETAIN 0 HOURS; to delete the entire history of data

select * from f1_demo.drivers_merge where version AS OF 2; --------------won't work

after vacuum
### Convert parquet to delta

#Table

CONVERT TO DELTA f1_delta.convert_to_delta;

# File

df=spark.table("f1_delta.convert_to_delta")

df.write.format("parquet").save('/mnt/formula1dl/demo/convert_to_delta_new');

CONVERT TO DELTA parquet.`/mnt/formula1dl/demo/convert_to_delta_new`;

# Drop duplcates function

df.dropDuplicates(['name', 'height']).show()

### Unity Catalog

Unity catalog is a databricks offered unified solution for implementing data

governance in the data lakehouse

It provides

Data access control

Data audit
Data lineage
Data discoverability

## Unity catalog object model

Managed tables can only be in delta format

Stored in the default storage

Deleted data retained for 30 days

Benefits from automatic maintenance and performance improvements

#To access a table

select * from catalog.schema.table

## Catalog

it contains

1) Hive metastore -- Has table information from legacy system

2) Main -- default catalog created by databricks

3) system -- contains table for audit information

4) samples -- contains sample tables

#Commands

show catalog;

show current_schema;

show current_catalog;

#Data discoverability

# To find the tables

select * from system.information_schema.tables

where table_name='results';

# Data lineage

It is the process of following/tracking the journey of data within the pipeline.

Databricks Data Engineer Associate Notes
No ratings yet
Databricks Data Engineer Associate Notes
5 pages
Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
Azure DataBricks
No ratings yet
Azure DataBricks
37 pages
Pyspark 4
No ratings yet
Pyspark 4
5 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
Databricks Certified Professional Data Engineer Jun 2024
No ratings yet
Databricks Certified Professional Data Engineer Jun 2024
21 pages
Databricks - Cheatsheet
No ratings yet
Databricks - Cheatsheet
7 pages
Build A Data Pipeline Using AWS Glue
No ratings yet
Build A Data Pipeline Using AWS Glue
27 pages
Py 1731703428
No ratings yet
Py 1731703428
8 pages
Data Engineering 101 - Databricks Optimization
No ratings yet
Data Engineering 101 - Databricks Optimization
16 pages
Py Spark
No ratings yet
Py Spark
10 pages
25 Pyspark Transformation
No ratings yet
25 Pyspark Transformation
10 pages
PySpark Optimization Scenarios - Wipro
No ratings yet
PySpark Optimization Scenarios - Wipro
8 pages
Data Migration and CDC Tasks
No ratings yet
Data Migration and CDC Tasks
11 pages
Databricks Exam
No ratings yet
Databricks Exam
14 pages
Azure Data Engineer Mock Interview - Project Special
No ratings yet
Azure Data Engineer Mock Interview - Project Special
11 pages
50 PySpark Interview Questions 1732556477
No ratings yet
50 PySpark Interview Questions 1732556477
7 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Data Centre Design Guide
100% (1)
Data Centre Design Guide
180 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
SCD in Databricks
No ratings yet
SCD in Databricks
16 pages
Data Bricks
No ratings yet
Data Bricks
20 pages
Data Bricks
No ratings yet
Data Bricks
43 pages
Understanding Apache Spark Architecture
No ratings yet
Understanding Apache Spark Architecture
30 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Cluster Computing
No ratings yet
Cluster Computing
57 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Lab6 Copy Options
No ratings yet
Lab6 Copy Options
4 pages
Spark Architecture for Developers
No ratings yet
Spark Architecture for Developers
7 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
VCF Vxrail
No ratings yet
VCF Vxrail
310 pages
Master PySpark 1-18
No ratings yet
Master PySpark 1-18
59 pages
Pysparkdump
No ratings yet
Pysparkdump
4 pages
DVS SPARK Course Content PDF
No ratings yet
DVS SPARK Course Content PDF
2 pages
Privileged Account Security System Requirements
No ratings yet
Privileged Account Security System Requirements
69 pages
Databricksmcqsquestionsandanswers
No ratings yet
Databricksmcqsquestionsandanswers
5 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
ES101CPX02007+ +VxRail+7.0.XXX+Concepts+-+Downloadable+Content
100% (1)
ES101CPX02007+ +VxRail+7.0.XXX+Concepts+-+Downloadable+Content
67 pages
SQL - & - Pyspak
No ratings yet
SQL - & - Pyspak
6 pages
Azure Data Factory: Cloud ETL & Integration
No ratings yet
Azure Data Factory: Cloud ETL & Integration
10 pages
Using Databricks Notebook in Talend Studio
No ratings yet
Using Databricks Notebook in Talend Studio
19 pages
Azure Data Engineer + Databricks Content
No ratings yet
Azure Data Engineer + Databricks Content
7 pages
Software Engg PDF
No ratings yet
Software Engg PDF
16 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
2007 08 23 STE Journal Based Backup (TSM)
100% (3)
2007 08 23 STE Journal Based Backup (TSM)
61 pages
Creating Secrets in Databricks
No ratings yet
Creating Secrets in Databricks
13 pages
Pandas Handbook
No ratings yet
Pandas Handbook
33 pages
Rose For Pstor Failover Solution
No ratings yet
Rose For Pstor Failover Solution
44 pages
Spark in Production
No ratings yet
Spark in Production
34 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
AHV Admin Guide v6 - 1
No ratings yet
AHV Admin Guide v6 - 1
193 pages
Pyspark Cashing & Persisting - Complete Guide
No ratings yet
Pyspark Cashing & Persisting - Complete Guide
3 pages
Lab9 Troubleshooting Snowpipe AWS
No ratings yet
Lab9 Troubleshooting Snowpipe AWS
2 pages
Mellanox SX60XX 1U Switch Gateway User Manual
No ratings yet
Mellanox SX60XX 1U Switch Gateway User Manual
114 pages
Apache Spark for Developers
No ratings yet
Apache Spark for Developers
3 pages
Dice Resume CV SN
No ratings yet
Dice Resume CV SN
5 pages
PySpark RDD Cheat Sheet Guide
No ratings yet
PySpark RDD Cheat Sheet Guide
1 page
Top Pyspark InterviewQuestions
No ratings yet
Top Pyspark InterviewQuestions
21 pages
Apache Druid: Sudhindra Tirupati Nagaraj
No ratings yet
Apache Druid: Sudhindra Tirupati Nagaraj
12 pages
ESXi 6.x Installation & Troubleshooting Guide
100% (1)
ESXi 6.x Installation & Troubleshooting Guide
32 pages
Data Modelling Essentials
No ratings yet
Data Modelling Essentials
40 pages
Distributed Computing BE (AI&DS)
No ratings yet
Distributed Computing BE (AI&DS)
53 pages
Hortonworks Data Platform Installing HDP On Windows
No ratings yet
Hortonworks Data Platform Installing HDP On Windows
84 pages
Pandas vs PySpark: Data Operations
No ratings yet
Pandas vs PySpark: Data Operations
3 pages
Azure Databricks Onboarding Guide
No ratings yet
Azure Databricks Onboarding Guide
298 pages
Module-2 - Introduction To Hadoop
No ratings yet
Module-2 - Introduction To Hadoop
13 pages
High Availability Administration Guide EN
No ratings yet
High Availability Administration Guide EN
134 pages
Dell Emc Networker Module For Databases and Applications: Installation Guide
No ratings yet
Dell Emc Networker Module For Databases and Applications: Installation Guide
44 pages
Snowflake Demo
No ratings yet
Snowflake Demo
13 pages
06-Setting Up Unity Catalog
No ratings yet
06-Setting Up Unity Catalog
5 pages
CS341: Project in Mining Massive Datasets: Michele Catasta, Jure Leskovec, Jeffrey Ullman
No ratings yet
CS341: Project in Mining Massive Datasets: Michele Catasta, Jure Leskovec, Jeffrey Ullman
29 pages
Data Stream Processing Insights
No ratings yet
Data Stream Processing Insights
67 pages
SQL Vs PySpark 1678871778
No ratings yet
SQL Vs PySpark 1678871778
8 pages
Answer: A: Exam Name: Exam Type: Exam Code: Total Questions
No ratings yet
Answer: A: Exam Name: Exam Type: Exam Code: Total Questions
27 pages
Nabhi
No ratings yet
Nabhi
10 pages
Unit Iv
No ratings yet
Unit Iv
20 pages
Spark Repartition1
No ratings yet
Spark Repartition1
7 pages
Clustering Palo Alto
No ratings yet
Clustering Palo Alto
43 pages
Vision Edge 10S (Model - Vision E10S) Network Packet Broker
No ratings yet
Vision Edge 10S (Model - Vision E10S) Network Packet Broker
13 pages
Had Oop Cancer
No ratings yet
Had Oop Cancer
6 pages
DIVA Core 8.1 Install Config
No ratings yet
DIVA Core 8.1 Install Config
324 pages
AlwaysonAvailabilityGroupsFAQS AOAG 1740791285
No ratings yet
AlwaysonAvailabilityGroupsFAQS AOAG 1740791285
48 pages
Module 4 Implementing Storage Spaces and Data Deduplication
No ratings yet
Module 4 Implementing Storage Spaces and Data Deduplication
65 pages
h8321 WP Smartpools Storage Tiering
No ratings yet
h8321 WP Smartpools Storage Tiering
37 pages
PowerHA - 2 Basic Architecture
No ratings yet
PowerHA - 2 Basic Architecture
38 pages
How To Install and Configure A Two-Node Cluster
No ratings yet
How To Install and Configure A Two-Node Cluster
19 pages
Big Data: Integrating R and Hadoop
No ratings yet
Big Data: Integrating R and Hadoop
12 pages
Teradata Scripts
No ratings yet
Teradata Scripts
998 pages
MapR Installation
No ratings yet
MapR Installation
6 pages