07 Structured Data Processing
07 Structured Data Processing
SQL
Amir H.
Payberah
[email protected]
2022-09-20
The Course Web
Page
htt ps : / / i d 2 2 2 1 k t h . g i t h u b . i o
1 / 69
The Questions-Answers
Page
2 / 69
Where Are
We?
3 / 69
Motivatio
n
4 / 69
Spark and Spark
SQL
5 / 69
Structured Data vs. RDD
(1/2)
6/
Structured Data vs. RDD
(1/2)
6/
Structured Data vs. RDD
(1/2)
6/
Structured Data vs. RDD
(2/2)
7 / 69
DataFrames and
DataSets
8/
DataFrames and
DataSets
8/
DataFra
me
9 / 69
DataFram
e
10 /
Schem
a
► Defines the column names and types of a
DataFrame.
► Assume people.json file as an input:
{"name":"Michael", " a ge " : 1 5 , " i d " : 1 2 }
{"name":"Andy", " a ge " : 3 0 , " i d " : 1 5 }
{ " n a m e " : " J u sti n " , " a ge " : 1 9 , " i d " : 2 0 }
{"name":"Andy", " a ge " : 1 2 , " i d " : 1 5 }
{"nam e" :"Jim " , " a ge " : 1 9 , " i d " : 2 0 }
{"name":"Andy", " a ge " : 1 2 , " i d " : 1 0 }
11 /
Schem
a
► Defines the column names and types of a
DataFrame.
► Assume people.json file as an input:
{"name":"Michael", " a ge " : 1 5 , " i d " : 1 2 }
{"name":"Andy", " a ge " : 3 0 , " i d " : 1 5 }
{ " n a m e " : " J u sti n " , " a ge " : 1 9 , " i d " : 2 0 }
{"name":"Andy", " a ge " : 1 2 , " i d " : 1 5 }
{"nam e" :"Jim " , " a ge " : 1 9 , " i d " : 2 0 }
{"name":"Andy", " a ge " : 1 2 , " i d " : 1 0 }
// returns:
StructType( St r u c t F i e l d ( age, LongType, true) ,
S t r u c t F i e l d ( i d , L o n g Ty p e , t r u e ) ,
S t r u c t F i e l d ( n a m e , S t r i n g Ty p e , t r u e ) )
11 /
Column
(1/2)
► They are like columns in a table.
► c o l returns a reference to a column.
► expr performs transformations on a
column.
► columns returns all columns on a
DataFrame
v a l people = s p a r k . re a d . fo r m at ( " j s o n " ) . l o a d ( " p e o p l e . j s o n " )
col("age")
people.columns
// returns:
Array[String] =
Array(age, i d ,
name)
12 /
Column
(2/2)
► Different ways to refer to a
column.
v a l people = s p a r k . re a d . fo r m at ( " j s o n " ) . l o a d ( " p e o p l e . j s o n " )
people.col("name")
col("name")
column("name")
’name
$"name"
expr( "name")
13 /
Row
► A row is a record of
data.
► They are of type Row.
► Rows do not have
schemas.
14 /
Row
14 /
Creating a
DataFrame
15 /
Creating a DataFrame - From an
RDD
16 /
Creating a DataFrame - From an
RDD
16 /
Creating a DataFrame - From an
RDD
16 /
Creating a DataFrame - From Data
Source
17 /
DataFrame Transformations
(1/5)
► Add and remove rows or columns
► Transform a row into a column (or vice versa)
► Change the order of rows based on the values in
columns
18 /
DataFrame Transformations
(2/5)
// select
people.select("nam e" , " a g e " , " i d " ) . s h o w ( 2 )
19 /
DataFrame Transformations
(2/5)
// select
people.select("nam e" , " a g e " , " i d " ) . s h o w ( 2 )
// selectExpr
p e o p l e . s e l e c t E x p r ( " * " , "(age < 20) as teenager" ).show ()
p e o p l e . s e l e c t E x p r ( " a vg ( a g e ) " , " c o u n t ( d i sti n c t ( n a m e ) ) " , " su m (id)" ). show ()
19 /
DataFrame Transformations
(3/5)
20 /
What is the
output?
p e o p l e . s e l e c t E x p r ( " a vg ( a g e ) " , " c o u n t ( d i sti n c t ( n a m e )) as d i s ti n c t " ) . s h o w ( )
+---+---+-------+
|age| i d | name|
+---+---+-------+
| 15| 12|Michael|
| 30| 15| Andy|
| 19| 20| Andy|
+---+---+-------+
21 /
What is the
output?
p e o p l e . s e l e c t E x p r ( " a vg ( a g e ) " , " c o u n t ( d i sti n c t ( n a m e )) as d i s ti n c t " ) . s h o w ( )
+---+---+-------+
|age| i d | name|
+---+---+-------+
| 15| 12|Michael|
| 30| 15| Andy|
| 19| 20| Andy|
+---+---+-------+
Opti on 1 Opti on 2
+--------+--------+ +--------+--------+
| avg ( a g e ) | d i sti n c t | | avg ( a g e ) | d i sti n c t |
+--------+--------+ +--------+--------+
| 21.333| 3| | 21.333| 2|
+--------+--------+ +--------+--------+
21 /
DataFrame Transformations
(4/5)
/ / withColumnRenamed
people.withColumnRenamed("name", "username").columns
/ / drop
people.drop("name").columns
22 /
What is the
output?
people.withColumn("teenager", expr("age < 2 0 " ) ) . s h o w ( )
+---+---+-------+
|age| i d | name|
+---+---+-------+
| 15| 12|Michael|
| 30| 15| Andy|
| 19| 20| J u s ti n |
+---+---+-------+
23 /
What is the
output?
people.withColumn("teenager", expr("age < 2 0 " ) ) . s h o w ( )
+---+---+-------+
|age| i d | name|
+---+---+-------+
| 15| 12|Michael|
| 30| 15| Andy|
| 19| 20| J u s ti n |
+---+---+-------+
Opti on 1 Opti on 2
+---+---+-------+--------+ +---+---+-------+--------+
|age| i d | name|teenager| |age| i d | name|teenager|
+---+---+-------+--------+ +---+---+-------+--------+
| 15| 12|Michael| true| | 15| 12|Michae l | true|
| 30| 15| Andy| | 19| 20| Ju s ti n | true|
| 19| 20| J u s ti n | f alse| +---+---+-------+--------+
+ - - - + - - - + - - - - - - - + - - - -true|
----+
23 /
DataFrame Transformations
(5/5)
v a l d f = spark.createDataFrame (Seq((0, " h e l l o " ) , ( 1 , " w o r l d " ) ) ) . t o D F ( " i d " , " t e x t " )
24 /
DataFrame
Actions
25 /
Aggregati
on
26 /
Aggregati
on
27 /
Grouping
Types
► Summarizing a complete
DataFrame
► Group by
► Windowing
28 /
Grouping
Types
► Summarizing a complete
DataFrame
► Group by
► Windowing
29 /
Summarizing a Complete DataFrame
Functions (1/2)
p e o p l e . s e l e c t ( c o u n t ( " a ge " ) ). s h o w ( )
30 /
Summarizing a Complete DataFrame
Functions (2/2)
► min and max extract the minimum and maximum values from a
DataFrame.
► sum adds all the values in a column.
► avg calculates the average.
v a l people = s p a r k . re a d . fo r m at ( " j s o n " ) . l o a d ( " p e o p l e . j s o n " )
m ax (" id" )). show () people . select (sum (" a ge" )). show ()
31 /
Grouping
Types
► Summarizing a complete
DataFrame
► Group by
► Windowing
32 /
Group By
(1/3)
33 /
Group By
(2/3)
people . g roupBy (" na m e" ). ag g (count(" a ge" ). alia s (" a geag g " )). show ()
34 /
Group By
(3/3)
people.groupBy("name").agg("age" - > " count" , "age" - > " a v g " , " i d " - > "max").show()
35 /
What is the
output?
people.groupBy("name").agg("age" - > " count" , "age" - > " a v g " , " i d " - > "max").show()
+---+---+-------+
|age| i d | name|
+---+---+-------+
| 15| 12|Michael|
| 30| 15| Andy|
| 19| 20| Andy|
+---+---+-------+
36 /
What is the
output?
people.groupBy("name").agg("age" - > " count" , "age" - > " a v g " , " i d " - > "max").show()
+---+---+-------+
|age| i d | name|
+---+---+-------+
| 15| 12|Michael|
| 30| 15| Andy|
| 19| 20| Andy|
+---+---+-------+
Opti on 1 Opti on 2
+-------+----------+--------+-------+ +-------+----------+--------+-------+
| name|count(age)|avg(age)|max(id)| | name|count(age)|avg(age)|max(id)|
+-------+----------+--------+-------+ +-------+----------+--------+-------+
|Michael| 1| 15.0| 12| |Michael| 1| 21.33| 20|
| Andy| 2| 24.5| 20| | Andy| 2| 21.33| 20|
+-------+----------+--------+-------+ +-------+----------+--------+-------+
36 /
Grouping
Types
► Summarizing a complete
DataFrame
► Group by
► Windowing
37 /
Windowing
(1/2)
► Computing some aggregation on a specific window of data.
► The window determines which rows will be passed in to this
function.
► You define them by using a reference to the current data.
► A group of rows is called a frame.
38 /
Windowing
(2/2)
v a l windowSpec = Window.rowsBetween(-1, 1 )
v a l avgAge = avg (col("age" )).over(windowSpec )
people.select(col("name"), c o l ( " a g e " ) ,
avgA ge .alias("avg _age" )).show
39 /
What is the
output?
v a l windowSpec = Window.rowsBetween(-1, 1 )
v a l avgAge = avg (col("age" )).over(windowSpec )
people.select(col("name"), c o l ( " a ge " ) ,
avg A ge . a l i a s ( " avg _ a ge " ) ) . s h o w ( )
+---+---+-------+
|age| i d | name|
+---+---+-------+
| 15| 12|Michael|
| 30| 15| Andy|
| 19| 20| Andy|
+---+---+-------+
40 /
What is the
output?
v a l windowSpec = Window.rowsBetween(-1, 1 )
v a l avgAge = avg (col("age" )).over(windowSpec )
people.select(col("name"), c o l ( " a ge " ) ,
avg A ge . a l i a s ( " avg _ a ge " ) ) . s h o w ( )
+---+---+-------+
|age| i d | name|
+---+---+-------+
15| 15|
| 30| 12|Michael| Andy|
| 19| 20|
+ - - - + - - - + - - - Andy|
----+
Opti on 1 Opti on 2
+-------+---+--------+ +-------+---+--------+
| name|age| avg_age| | name|age| avg_age|
+-------+---+--------+ +-------+---+--------+
|Michael| 15| 22.5| |Michael| 15| 7.5|
| Andy| 30| 21.33| | Andy| 30| 22.5|
| Andy| 19| 24.5| | Andy| 19| 21.33|
+-------+---+--------+ +-------+---+--------+
40 /
Join
s
41 /
Join
s
► Joins are relational constructs you use to combine relations together.
► Different join types: inner join, outer join, left outer join, right outer
join, left semi join, left anti join, cross join
42 /
Joins
Example
43 /
Joins Example -
Inner
+---+-------+--------+---+----------+
| id| name|group_id| id|department|
+---+-------+--------+---+----------+
| 0| Seif| 0| 0| SICS/KTH|
| 1| Amir| 1| 1| KTH|
| 2|Sarunas| 1| 1|
KTH|
+---+-------+--------+---+----------+
44 /
Joins Example -
Outer
+----+-------+--------+---+----------+
| id| name|group_id| id|department|
+----+-------+--------+---+----------+
| 1| Amir| 1| 1| KTH|
| 2|Sarunas| 1| 1| KTH|
|null| null| n u l l | 2| SICS|
| 0| Seif| 0| 0|
SICS/KTH|
+----+-------+--------+---+----------+
45 /
Joins Communication
Strategies
46 /
Shuffle
Join
► Every node talks to every other node.
► They share data according to which node has a certain key or
set of keys.
47 /
Broadcast
Join
► When the table is small enough to fit into the memory of a single
worker node.
48 /
SQ
L
49 /
SQL
► You can run SQL queries on views/tables via the method s q l on the
SparkSession
object.
s p a r k . s q l ( " S E L EC T * from people_view" ).show()
+---+---+-------+
|age| i d | name|
+---+---+-------+
| 15| 12|Michael|
| 30| 15| Andy|
| 19| 20| J u s ti n |
| 12| 15| Andy|
| 19| 20| Jim|
| 12| 10| Andy|
+---+---+-------+
50 /
Temporary
View
v a l teenagersDF = s p a r k . s q l ( " S E L EC T name, age FROM people_view WHERE age BETWEEN 13 AND 19" )
51 /
DataS
et
52 /
Untyped API with
DataFrame
53 /
Untyped API with
DataFrame
53 /
Why
DataSet?
54 /
Why
DataSet?
54 /
Why
DataSet?
► What is in
Row?
54 /
Why
DataSet?
v a l c o l l e c t e d L i s t = collectedPeople.map {
row => ( r o w ( 0 ) . a s I n s t a n c e O f [ S t r i n g ] , r o w ( 1 ) . a s I n s t a n c e O f [ I n t ] , r o w ( 2 ) . a s I n s t a n c e O f [ I n t ] )
}
55 /
Why
DataSet?
v a l c o l l e c t e d L i s t = collectedPeople.map {
row => ( r o w ( 0 ) . a s I n s t a n c e O f [ S t r i n g ] , r o w ( 1 ) . a s I n s t a n c e O f [ I n t ] , r o w ( 2 ) . a s I n s t a n c e O f [ I n t ] )
}
55 /
DataS
et
► Datasets can be thought of as typed distributed collections of data.
► Dataset API unifies the DataFrame and RDD APls.
► You can consider a DataFrame as an alias for Dataset[Row], where a
Row is a generic untyped JVM object.
[htt p://why-not-learn-something.blogspot.com/2016/07/apache-spark-rdd-vs-dataframe-vs-dataset.html]
56 /
Structured APIs in
Spark
57 /
Creating
DataSets
58 /
Creating
DataSets
v a l ds1 = s c . p a r a l l e l i z e ( p e r s o n S e q ) . t o D S
58 /
Creating
DataSets
v a l ds1 = s c . p a r a l l e l i z e ( p e r s o n S e q ) . t o D S
58 /
DataSet
Transformations
59 /
Structured Data
Execution
60 /
Structured Data Execution
Steps
► 1. Write DataFrame/Dataset/SQL Code.
► 2. If valid code, Spark converts this to a logical plan.
► 3. Spark transforms this logical plan to a Physical Plan
• Checking for optimizations along the way.
► 4. Spark then executes this physical plan (RDD manipulations) on
the cluster.
61 /
Logical Planning (1/2)
62 /
Logical Planning
(1/2)
► The logical plan represents a set of abstract transformations.
► This plan is unresolved.
• The code might be valid, the tables/columns that it refers to might
not exist.
62 /
Logical Planning
(1/2)
► The logical plan represents a set of abstract transformations.
► This plan is unresolved.
• The code might be valid, the tables/columns that it refers to might not
exist.
► Spark uses the catalog, a repository of all table and DataFrame
information, to resolve columns and tables in the analyzer.
62 /
Logical Planning
(2/2)
63 /
Logical Planning
(2/2)
63 /
Physical
Planning
► The physical plan specifies how the logical plan will execute on
the cluster.
64 /
Executio
n
65 /
Summa
ry
66 /
Summa
ry
67 /
Referenc
es
68 /
Question
s?
69 /