29/4/2021 4.
1 The Spark UI - Databricks
4.1 The Spark UI
%run ../Includes/Classroom-Setup
Mounting course-specific datasets to /mnt/training...
Datasets are already mounted to /mnt/training from s3a://databricks-corp-training/common
res1: Boolean = false
res2: Boolean = false
DROP TABLE IF EXISTS People10M;
CREATE TABLE People10M
USING csv
OPTIONS (
path "/mnt/training/dataframes/people-10m.csv",
header "true");
DROP TABLE IF EXISTS ssaNames;
CREATE TABLE ssaNames USING parquet OPTIONS (
path "/mnt/training/ssn/names.parquet",
header "true"
);
OK
Catalog Error
file:///home/reivajmc/Documentos/SparkSQL/4.1 The Spark UI.html 1/7
29/4/2021 4.1 The Spark UI - Databricks
SELECT
firstName,
lastName,
birthDate
FROM
People10M
WHERE
year(birthDate) > 1990
AND gender = 'F'
firstName lastName birthDate
1 An Cowper 1992-02-08T05:00:00.000Z
2 Caroyln Cardon 1994-05-15T04:00:00.000Z
3 Yesenia Goldring 1997-07-09T04:00:00.000Z
4 Hedwig Pendleberry 1998-12-02T05:00:00.000Z
5 Kala Lyfe 1994-06-23T04:00:00.000Z
6 Gussie McKeeman 1991-11-15T05:00:00.000Z
7 Pansy Shrieves 1991-05-24T04:00:00.000Z
Showing the first 1000 rows.
Plan Optimization Example
CREATE OR REPLACE TEMPORARY VIEW joined AS
SELECT People10m.firstName,
to_date(birthDate) AS date
FROM People10m
JOIN ssaNames ON People10m.firstName = ssaNames.firstName;
CREATE OR REPLACE TEMPORARY VIEW filtered AS
SELECT firstName,count(firstName)
FROM joined
WHERE
date >= "1980-01-01"
GROUP BY
firstName, date;
OK
file:///home/reivajmc/Documentos/SparkSQL/4.1 The Spark UI.html 2/7
29/4/2021 4.1 The Spark UI - Databricks
SELECT * FROM filtered;
firstName count(firstName)
1 Ellan 49
2 Charline 117
3 Latisha 72
4 Tonita 73
5 Gwenn 76
6 Nidia 67
7 Torri 91
Showing the first 1000 rows.
CACHE TABLE filtered;
OK
SELECT * FROM filtered;
firstName count(firstName)
1 Ellan 49
2 Charline 117
3 Latisha 72
4 Tonita 73
5 Gwenn 76
6 Nidia 67
7 Torri 91
Showing the first 1000 rows.
SELECT * FROM filtered WHERE firstName = "Latisha";
firstName count(firstName)
1 Latisha 72
2 Latisha 72
file:///home/reivajmc/Documentos/SparkSQL/4.1 The Spark UI.html 3/7
29/4/2021 4.1 The Spark UI - Databricks
3 Latisha 72
4 Latisha 72
5 Latisha 72
6 Latisha 72
7 Latisha 72
Showing all 513 rows.
UNCACHE TABLE IF EXISTS filtered;
OK
SELECT * FROM filtered WHERE firstName = "Latisha";
firstName count(firstName)
1 Latisha 72
2 Latisha 72
3 Latisha 72
4 Latisha 72
5 Latisha 72
6 Latisha 72
7 Latisha 72
Showing all 513 rows.
Set Partitions
DROP TABLE IF EXISTS bikeShare;
CREATE TABLE bikeShare
USING csv
OPTIONS (
path "/mnt/training/bikeSharing/data-001/hour.csv",
header "true")
OK
file:///home/reivajmc/Documentos/SparkSQL/4.1 The Spark UI.html 4/7
29/4/2021 4.1 The Spark UI - Databricks
SELECT
*
FROM
bikeShare
WHERE
hr = 10
instant dteday season yr mnth hr
1 11 2011-01-01 1 0 1 10
2 34 2011-01-02 1 0 1 10
3 56 2011-01-03 1 0 1 10
4 79 2011-01-04 1 0 1 10
5 102 2011-01-05 1 0 1 10
6 125 2011-01-06 1 0 1 10
7 148 2011-01-07 1 0 1 10
Showing all 727 rows.
DROP TABLE IF EXISTS bikeShare_partitioned;
CREATE TABLE bikeShare_partitioned
PARTITIONED BY (p_hr)
AS
SELECT
instant,
dteday,
season,
yr,
mnth,
hr as p_hr,
holiday,
weekday,
workingday,
weathersit,
temp
FROM
bikeShare
Query returned no results
SELECT * FROM bikeShare_partitioned WHERE p_hr = 10
file:///home/reivajmc/Documentos/SparkSQL/4.1 The Spark UI.html 5/7
29/4/2021 4.1 The Spark UI - Databricks
instant dteday season yr mnth p_hr
1 11 2011-01-01 1 0 1 10
2 34 2011-01-02 1 0 1 10
3 56 2011-01-03 1 0 1 10
4 79 2011-01-04 1 0 1 10
5 102 2011-01-05 1 0 1 10
6 125 2011-01-06 1 0 1 10
7 148 2011-01-07 1 0 1 10
Showing all 727 rows.
Beware of small files!
DROP TABLE IF EXISTS bikeShare_parquet;
CREATE TABLE bikeShare
PARTITIONED BY (p_instant)
AS
SELECT
instant AS p_instant,
dteday,
season,
yr,
mnth,
hr
holiday,
weekday,
workingday,
weathersit,
temp
FROM
bikeShare_csv
%run ../Includes/Classroom-Cleanup
Citations
Bike Sharing Data
file:///home/reivajmc/Documentos/SparkSQL/4.1 The Spark UI.html 6/7
29/4/2021 4.1 The Spark UI - Databricks
[1] Fanaee-T, Hadi, and Gama, Joao, Event labeling combining ensemble detectors
and background knowledge, Progress in Artificial Intelligence (2013): pp. 1-15,
Springer Berlin Heidelberg, doi:10.1007/s13748-013-0040-3.
@article{ year={2013}, issn={2192-6352}, journal={Progress in Artificial Intelligence},
doi={10.1007/s13748-013-0040-3}, title={Event labeling combining ensemble
detectors and background knowledge}, url={http://dx.doi.org/10.1007/s13748-013-
0040-3} (http://dx.doi.org/10.1007/s13748-013-0040-3}), publisher={Springer Berlin
Heidelberg}, keywords={Event labeling; Event detection; Ensemble learning;
Background knowledge}, author={Fanaee-T, Hadi and Gama, Joao}, pages={1-15} }
© 2020 Databricks, Inc. All rights reserved.
Apache, Apache Spark, Spark and the Spark logo are trademarks of the Apache
Software Foundation (http://www.apache.org/).
file:///home/reivajmc/Documentos/SparkSQL/4.1 The Spark UI.html 7/7