Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
4 views91 pages

07 Structured Data Processing

The document provides an overview of structured data processing using Spark and Spark SQL, highlighting the differences between RDDs and structured collections like DataFrames and Datasets. It explains how to create DataFrames from RDDs and raw data sources, as well as various transformations that can be performed on DataFrames, such as selecting, filtering, and modifying columns. Additionally, the document includes examples of schema definitions and operations on structured data.

Uploaded by

mraiyata
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views91 pages

07 Structured Data Processing

The document provides an overview of structured data processing using Spark and Spark SQL, highlighting the differences between RDDs and structured collections like DataFrames and Datasets. It explains how to create DataFrames from RDDs and raw data sources, as well as various transformations that can be performed on DataFrames, such as selecting, filtering, and modifying columns. Additionally, the document includes examples of schema definitions and operations on structured data.

Uploaded by

mraiyata
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 91

Structured Data Processing - Spark

SQL
Amir H.
Payberah
[email protected]
2022-09-20
The Course Web
Page

htt ps : / / i d 2 2 2 1 k t h . g i t h u b . i o

1 / 69
The Questions-Answers
Page

htt ps://ti nyurl.com/bdenpwc5

2 / 69
Where Are
We?

3 / 69
Motivatio
n

4 / 69
Spark and Spark
SQL

5 / 69
Structured Data vs. RDD
(1/2)

► case c l a s s Account(name: S t r i n g , balance: Double, r i s k :


Boolean)

6/
Structured Data vs. RDD
(1/2)

► case c l a s s Account(name: S t r i n g , balance: Double, r i s k :


Boolean)
► RDD[Account]

6/
Structured Data vs. RDD
(1/2)

► case c l a s s Account(name: S t r i n g , balance:Double, r i s k : Boolean)


► RDD[Account]
► RDDs don’t know anything about the schema of the data it’s
dealing with.

6/
Structured Data vs. RDD
(2/2)

► case c l a s s Account(name: S t r i n g , balance:Double, r i s k : Boolean)


► RDD[Account]
► A database/Hive sees it as a columns of named and
typed values.

7 / 69
DataFrames and
DataSets

► Spark has two notions of structured collections:


• DataFrames
• Datasets

► They are distributed table-like collections with well-defined rows


and columns.

8/
DataFrames and
DataSets

► Spark has two notions of structured collections:


• DataFrames
• Datasets

► They are distributed table-like collections with well-defined rows and


columns.

► They represent immutable lazily evaluated plans.

► When an action is performed on them, Spark performs the actual


transformations and return the result.

8/
DataFra
me

9 / 69
DataFram
e

► Consists of a series of rows and a number of columns.

► Equivalent to a table in a relational database.

► Spark + RDD: functional transformations on partitioned collections of


objects.

► SQL + DataFrame: declarative transformations on partitioned


collections of tuples.

10 /
Schem
a
► Defines the column names and types of a
DataFrame.
► Assume people.json file as an input:
{"name":"Michael", " a ge " : 1 5 , " i d " : 1 2 }
{"name":"Andy", " a ge " : 3 0 , " i d " : 1 5 }
{ " n a m e " : " J u sti n " , " a ge " : 1 9 , " i d " : 2 0 }
{"name":"Andy", " a ge " : 1 2 , " i d " : 1 5 }
{"nam e" :"Jim " , " a ge " : 1 9 , " i d " : 2 0 }
{"name":"Andy", " a ge " : 1 2 , " i d " : 1 0 }

11 /
Schem
a
► Defines the column names and types of a
DataFrame.
► Assume people.json file as an input:
{"name":"Michael", " a ge " : 1 5 , " i d " : 1 2 }
{"name":"Andy", " a ge " : 3 0 , " i d " : 1 5 }
{ " n a m e " : " J u sti n " , " a ge " : 1 9 , " i d " : 2 0 }
{"name":"Andy", " a ge " : 1 2 , " i d " : 1 5 }
{"nam e" :"Jim " , " a ge " : 1 9 , " i d " : 2 0 }
{"name":"Andy", " a ge " : 1 2 , " i d " : 1 0 }

v a l people = s p a r k . re a d . fo r m at ( " j s o n " ) . l o a d ( " p e o p l e . j s o n " )


people.schema

// returns:
StructType( St r u c t F i e l d ( age, LongType, true) ,
S t r u c t F i e l d ( i d , L o n g Ty p e , t r u e ) ,
S t r u c t F i e l d ( n a m e , S t r i n g Ty p e , t r u e ) )

11 /
Column
(1/2)
► They are like columns in a table.
► c o l returns a reference to a column.
► expr performs transformations on a
column.
► columns returns all columns on a
DataFrame
v a l people = s p a r k . re a d . fo r m at ( " j s o n " ) . l o a d ( " p e o p l e . j s o n " )

col("age")

exp("age + 5 < 32" )

people.columns
// returns:
Array[String] =
Array(age, i d ,
name)

12 /
Column
(2/2)
► Different ways to refer to a
column.
v a l people = s p a r k . re a d . fo r m at ( " j s o n " ) . l o a d ( " p e o p l e . j s o n " )

people.col("name")

col("name")

column("name")

’name

$"name"

expr( "name")

13 /
Row

► A row is a record of
data.
► They are of type Row.
► Rows do not have
schemas.

import org .apache.spark.sql.Row

v a l myRow = Ro w ( " S e i f" , 6 5 , 0 )

14 /
Row

► A row is a record of data.


► They are of type Row.
► Rows do not have schemas.
• The order of values should be the same order as the schema of the
DataFrame to which they might be appended.
► To access data in rows, you need to specify the position that you
would like.
import org .apache.spark.sql.Row

v a l myRow = Ro w ( " S e i f" , 6 5 , 0 )

myRow(0) / / type Any


myRow (0).asInstanceOf[String ] / /
S t r i n g myRow.getString(0) / / S t r i n g
myRow.getInt(1) / / I n t

14 /
Creating a
DataFrame

► Two ways to create a


DataFrame:
1. From an RDD
2. From raw data sources

15 /
Creating a DataFrame - From an
RDD

► The schema automatically


inferred.

16 /
Creating a DataFrame - From an
RDD

► The schema automatically inferred.


► You can use toDF to convert an RDD to
val DataFrame.
tupleRDD = s c . p a ra l l e l i ze ( A r ray ( ( " s e i f " , 65, 0 ) , ("amir", 40, 1 ) )
v a l tupleDF = tupleRDD.toDF("name", " a ge " , " i d " )

16 /
Creating a DataFrame - From an
RDD

► The schema automatically inferred.


► You can use toDF to convert an RDD to
val DataFrame.
tupleRDD = s c . p a ra l l e l i ze ( A r ray ( ( " s e i f " , 65, 0 ) , ("amir", 40, 1 ) )
v a l tupleDF = tupleRDD.toDF("name", " a ge " , " i d " )

► If RDD contains case class instances, Spark infers the


attributes from it.
case c l a s s Person(name: S t r i n g , age: I n t , i d : I n t )
v a l peopleRDD = s c . p a r a l l e l i z e ( A r r a y ( P e r s o n ( " s e i f " , 6 5 , 0 ) , Person("am ir" , 4 0 , 1 ) ) )
v a l peopleDF = peopleDF.toDF

16 /
Creating a DataFrame - From Data
Source

► Data sources supported by Spark.


• CSV, JSON, Parquet, ORC, JDBC/ODBC connections, Plain-
text files
• Cassandra, HBase, MongoDB, AWS Redshift, XML, etc.
v a l peopleJson = s p a r k . re a d . fo r m at ( " j s o n " ) . l o a d ( " p e o p l e . j s o n " )

v a l peopleCsv = s p a r k . re a d . fo r m at ( " c sv " )


. o p ti o n ( " s e p " , " ; " )
. opti on (" inferS chem a" , " t r u e " )
. o p ti o n ( " h e a d e r " , " t r u e " )
.load("people.csv" )

17 /
DataFrame Transformations
(1/5)
► Add and remove rows or columns
► Transform a row into a column (or vice versa)
► Change the order of rows based on the values in
columns

[M. Zaharia e t a l . , Spa rk : The D e fi n i ti v e Guide, O ’ R e i l l y Media, 2018]

18 /
DataFrame Transformations
(2/5)

► s e l e c t and s e l e c t E x pr allow to do the DataFrame equivalent of SQL


queries on a table of data.

// select
people.select("nam e" , " a g e " , " i d " ) . s h o w ( 2 )

19 /
DataFrame Transformations
(2/5)

► s e l e c t and s e l e c t E x pr allow to do the DataFrame equivalent of SQL


queries on a table of data.

// select
people.select("nam e" , " a g e " , " i d " ) . s h o w ( 2 )

// selectExpr
p e o p l e . s e l e c t E x p r ( " * " , "(age < 20) as teenager" ).show ()
p e o p l e . s e l e c t E x p r ( " a vg ( a g e ) " , " c o u n t ( d i sti n c t ( n a m e ) ) " , " su m (id)" ). show ()

19 /
DataFrame Transformations
(3/5)

► fi l t e r and where both filter rows.

► d i s ti n c t can be used to extract unique


rows.
p e o p l e . fi l t e r ( " a g e < 20" ).show ()

people.where("age < 20" ).show ()

p e o p l e . s e l e c t ( " n a m e " ) . d i sti n c t ( ) . s h o w ( )

20 /
What is the
output?
p e o p l e . s e l e c t E x p r ( " a vg ( a g e ) " , " c o u n t ( d i sti n c t ( n a m e )) as d i s ti n c t " ) . s h o w ( )

+---+---+-------+
|age| i d | name|
+---+---+-------+
| 15| 12|Michael|
| 30| 15| Andy|
| 19| 20| Andy|
+---+---+-------+

21 /
What is the
output?
p e o p l e . s e l e c t E x p r ( " a vg ( a g e ) " , " c o u n t ( d i sti n c t ( n a m e )) as d i s ti n c t " ) . s h o w ( )

+---+---+-------+
|age| i d | name|
+---+---+-------+
| 15| 12|Michael|
| 30| 15| Andy|
| 19| 20| Andy|
+---+---+-------+

Opti on 1 Opti on 2
+--------+--------+ +--------+--------+
| avg ( a g e ) | d i sti n c t | | avg ( a g e ) | d i sti n c t |
+--------+--------+ +--------+--------+
| 21.333| 3| | 21.333| 2|
+--------+--------+ +--------+--------+

21 /
DataFrame Transformations
(4/5)

► withColumn adds a new column to a


DataFrame.
► withColumnRenamed renames a column.
► drop removes a column.
/ / withColumn
people.withColumn("teenager", expr("age < 2 0 " ) ). s h o w ( )

/ / withColumnRenamed
people.withColumnRenamed("name", "username").columns

/ / drop
people.drop("name").columns

22 /
What is the
output?
people.withColumn("teenager", expr("age < 2 0 " ) ) . s h o w ( )

+---+---+-------+
|age| i d | name|
+---+---+-------+
| 15| 12|Michael|
| 30| 15| Andy|
| 19| 20| J u s ti n |
+---+---+-------+

23 /
What is the
output?
people.withColumn("teenager", expr("age < 2 0 " ) ) . s h o w ( )

+---+---+-------+
|age| i d | name|
+---+---+-------+
| 15| 12|Michael|
| 30| 15| Andy|
| 19| 20| J u s ti n |
+---+---+-------+

Opti on 1 Opti on 2
+---+---+-------+--------+ +---+---+-------+--------+
|age| i d | name|teenager| |age| i d | name|teenager|
+---+---+-------+--------+ +---+---+-------+--------+
| 15| 12|Michael| true| | 15| 12|Michae l | true|
| 30| 15| Andy| | 19| 20| Ju s ti n | true|
| 19| 20| J u s ti n | f alse| +---+---+-------+--------+
+ - - - + - - - + - - - - - - - + - - - -true|
----+

23 /
DataFrame Transformations
(5/5)

► You can use udf to define new column-based


functions.
import o r g . a p a c h e . s p a r k . s q l . f u n c ti o n s . { c o l , u d f }

v a l d f = spark.createDataFrame (Seq((0, " h e l l o " ) , ( 1 , " w o r l d " ) ) ) . t o D F ( " i d " , " t e x t " )

v a l upper: S t r i n g => S t r i n g = _.toUpperCase


v a l upperUDF = s p a r k . u d f . r e g i s t e r ( " u p p e r " , upper)

df.withColumn("upper", upperUDF (col("text" ))).show

24 /
DataFrame
Actions

► Like RDDs, DataFrames also have their own set of actions.

► c o l l e c t : returns an array that contains all of rows in this


DataFrame.

► count: returns the number of rows in this DataFrame.

► fi r s t and head: returns the first row of the DataFrame.

► show: displays the top 20 rows of the DataFrame in a tabular


form.

► take: returns the first n rows of the DataFrame.

25 /
Aggregati
on

26 /
Aggregati
on

► In an aggregation you specify


• A key or grouping
• An aggregation function

► The given function must produce one result for


each group.

27 /
Grouping
Types

► Summarizing a complete
DataFrame

► Group by

► Windowing

28 /
Grouping
Types

► Summarizing a complete
DataFrame

► Group by

► Windowing

29 /
Summarizing a Complete DataFrame
Functions (1/2)

► count returns the total number of values.


► co u nt D i sti nc t returns the number of unique groups.
► fi r s t and l a s t return the first and last value of a
DataFrame.
v a l people = s p a r k . re a d . fo r m at ( " j s o n " ) . l o a d ( " p e o p l e . j s o n " )

p e o p l e . s e l e c t ( c o u n t ( " a ge " ) ). s h o w ( )

people . select (countD isti nct (" na m e" )). show ()

p e o p l e . s e l e c t ( fi r s t ( " n a m e " ) , l a s t ( " a g e " ) ) . s h o w ( )

30 /
Summarizing a Complete DataFrame
Functions (2/2)

► min and max extract the minimum and maximum values from a
DataFrame.
► sum adds all the values in a column.
► avg calculates the average.
v a l people = s p a r k . re a d . fo r m at ( " j s o n " ) . l o a d ( " p e o p l e . j s o n " )

people.select(m in("nam e" ), m ax("age"),

m ax (" id" )). show () people . select (sum (" a ge" )). show ()

p e o p l e . s e l e c t ( avg ( " a g e " ) ) . s h o w ( )

31 /
Grouping
Types

► Summarizing a complete
DataFrame

► Group by

► Windowing

32 /
Group By
(1/3)

► Perform aggregations on groups in the data.

► Typically on categorical data.

► We do this grouping in two phases:


1. Specify the column(s) on which we would like to
group.
2. Specify the aggregation(s).

33 /
Group By
(2/3)

► Grouping with expressions


• Rather than passing that function as an expression into a s e l e c t
statement, we specify it as within agg.

v a l people = s p a r k . re a d . fo r m at ( " j s o n " ) . l o a d ( " p e o p l e . j s o n " )

people . g roupBy (" na m e" ). ag g (count(" a ge" ). alia s (" a geag g " )). show ()

34 /
Group By
(3/3)

► Grouping with Maps


• Specify transformations as a series of Maps
• The key is the column, and the value is the aggregation function (as
a string).
v a l people = s p a r k . re a d . fo r m at ( " j s o n " ) . l o a d ( " p e o p l e . j s o n " )

people.groupBy("name").agg("age" - > " count" , "age" - > " a v g " , " i d " - > "max").show()

35 /
What is the
output?
people.groupBy("name").agg("age" - > " count" , "age" - > " a v g " , " i d " - > "max").show()

+---+---+-------+
|age| i d | name|
+---+---+-------+
| 15| 12|Michael|
| 30| 15| Andy|
| 19| 20| Andy|
+---+---+-------+

36 /
What is the
output?
people.groupBy("name").agg("age" - > " count" , "age" - > " a v g " , " i d " - > "max").show()

+---+---+-------+
|age| i d | name|
+---+---+-------+
| 15| 12|Michael|
| 30| 15| Andy|
| 19| 20| Andy|
+---+---+-------+

Opti on 1 Opti on 2
+-------+----------+--------+-------+ +-------+----------+--------+-------+
| name|count(age)|avg(age)|max(id)| | name|count(age)|avg(age)|max(id)|
+-------+----------+--------+-------+ +-------+----------+--------+-------+
|Michael| 1| 15.0| 12| |Michael| 1| 21.33| 20|
| Andy| 2| 24.5| 20| | Andy| 2| 21.33| 20|
+-------+----------+--------+-------+ +-------+----------+--------+-------+

36 /
Grouping
Types

► Summarizing a complete
DataFrame

► Group by

► Windowing

37 /
Windowing
(1/2)
► Computing some aggregation on a specific window of data.
► The window determines which rows will be passed in to this
function.
► You define them by using a reference to the current data.
► A group of rows is called a frame.

[M. Zaharia e t a l . , Spa rk : The D e fi n i ti v e Guide, O ’ R e i l l y Media, 2018]

38 /
Windowing
(2/2)

► Unlike grouping, here each row can fall into one or


more frames.
import org.apache.spark.sql.expressions.Window
import o r g . a p a c h e . s p a r k . s q l . f u n c ti o n s . c o l

v a l people = s p a r k . re a d . fo r m at ( " j s o n " ) . l o a d ( " p e o p l e . j s o n " )

v a l windowSpec = Window.rowsBetween(-1, 1 )
v a l avgAge = avg (col("age" )).over(windowSpec )
people.select(col("name"), c o l ( " a g e " ) ,
avgA ge .alias("avg _age" )).show

39 /
What is the
output?
v a l windowSpec = Window.rowsBetween(-1, 1 )
v a l avgAge = avg (col("age" )).over(windowSpec )
people.select(col("name"), c o l ( " a ge " ) ,
avg A ge . a l i a s ( " avg _ a ge " ) ) . s h o w ( )

+---+---+-------+
|age| i d | name|
+---+---+-------+
| 15| 12|Michael|
| 30| 15| Andy|
| 19| 20| Andy|
+---+---+-------+

40 /
What is the
output?
v a l windowSpec = Window.rowsBetween(-1, 1 )
v a l avgAge = avg (col("age" )).over(windowSpec )
people.select(col("name"), c o l ( " a ge " ) ,
avg A ge . a l i a s ( " avg _ a ge " ) ) . s h o w ( )

+---+---+-------+
|age| i d | name|
+---+---+-------+
15| 15|
| 30| 12|Michael| Andy|
| 19| 20|
+ - - - + - - - + - - - Andy|
----+
Opti on 1 Opti on 2
+-------+---+--------+ +-------+---+--------+
| name|age| avg_age| | name|age| avg_age|
+-------+---+--------+ +-------+---+--------+
|Michael| 15| 22.5| |Michael| 15| 7.5|
| Andy| 30| 21.33| | Andy| 30| 22.5|
| Andy| 19| 24.5| | Andy| 19| 21.33|
+-------+---+--------+ +-------+---+--------+

40 /
Join
s

41 /
Join
s
► Joins are relational constructs you use to combine relations together.

► Different join types: inner join, outer join, left outer join, right outer
join, left semi join, left anti join, cross join

42 /
Joins
Example

v a l person = S e q ( ( 0 , " S e i f " , 0 ) , ( 1 , "Am ir" , 1 ) , ( 2 , " S aru nas" , 1 ) )


. t o D F ( " i d " , "name", "g roup_id" )

v a l group = S e q ( ( 0 , " S I C S / K T H " ), ( 1 , " K T H " ), ( 2 , " S I C S " ) )


. t o D F ( " i d " , "department")

43 /
Joins Example -
Inner

v a l j o in E x p re s s i o n = person . col(" g roup_id" ) === g r o u p . c o l ( " i d " )

var joinType = " i n n e r "

p e rs o n . j o i n ( g ro u p , j o i n E x p r e s s i o n , join Type ). show ()

+---+-------+--------+---+----------+
| id| name|group_id| id|department|
+---+-------+--------+---+----------+
| 0| Seif| 0| 0| SICS/KTH|
| 1| Amir| 1| 1| KTH|
| 2|Sarunas| 1| 1|
KTH|
+---+-------+--------+---+----------+

44 /
Joins Example -
Outer

v a l j o in E x p re s s i o n = person . col(" g roup_id" ) === g r o u p . c o l ( " i d " )

var joinType = "outer"

p e rs o n . j o i n ( g ro u p , j o i n E x p r e s s i o n , join Type ). show ()

+----+-------+--------+---+----------+
| id| name|group_id| id|department|
+----+-------+--------+---+----------+
| 1| Amir| 1| 1| KTH|
| 2|Sarunas| 1| 1| KTH|
|null| null| n u l l | 2| SICS|
| 0| Seif| 0| 0|
SICS/KTH|
+----+-------+--------+---+----------+

45 /
Joins Communication
Strategies

► Two different communication ways during


joins:
• Shuffle join: big table to big table
• Broadcast join: big table to small table

46 /
Shuffle
Join
► Every node talks to every other node.
► They share data according to which node has a certain key or
set of keys.

[M. Zaharia e t a l . , Spa rk : The D e fi n i ti v e Guide, O ’ R e i l l y Media, 2018]

47 /
Broadcast
Join
► When the table is small enough to fit into the memory of a single
worker node.

[M. Zaharia e t a l . , Spa rk : The D e fi n i ti v e Guide, O ’ R e i l l y Media, 2018]

48 /
SQ
L

49 /
SQL

► You can run SQL queries on views/tables via the method s q l on the
SparkSession
object.
s p a r k . s q l ( " S E L EC T * from people_view" ).show()

+---+---+-------+
|age| i d | name|
+---+---+-------+
| 15| 12|Michael|
| 30| 15| Andy|
| 19| 20| J u s ti n |
| 12| 15| Andy|
| 19| 20| Jim|
| 12| 10| Andy|
+---+---+-------+

50 /
Temporary
View

► createOrReplaceTempView creates (or replaces) a lazily evaluated


view.
► You can use it like a table in Spark SQL.
people.createOrReplaceTempView ("people_view")

v a l teenagersDF = s p a r k . s q l ( " S E L EC T name, age FROM people_view WHERE age BETWEEN 13 AND 19" )

51 /
DataS
et

52 /
Untyped API with
DataFrame

► DataFrames elements are Rows, which are generic untyped


JVM objects.
► Scala compiler cannot type check Spark SQL schemas in
DataFrames.

53 /
Untyped API with
DataFrame

► DataFrames elements are Rows, which are generic untyped


JVM objects.
► Scala compiler cannot type check Spark SQL schemas in
DataFrames.
► The following code compiles, but you get a runtime exception.
• icolumns:
/ / people d num is("name",
not in the "age", DataFrame
"id") columns [name, age, i d ]
v a l people = s p a r k . re a d . fo r m at ( " j s o n " ) . l o a d ( " p e o p l e . j s o n " )

p e o p l e . fi l t e r ( " i d _ n u m < 20" ) / / runti me excepti on

53 /
Why
DataSet?

► Assume the following


example
case c l a s s Person(name: S t r i n g , age: BigInt, i d : BigInt)
v a l peopleRDD = s c . p a r a l l e l i z e ( A r r a y ( P e r s o n ( " s e i f " , 6 5 , 0 ) , Person("am ir" , 4 0 , 1 ) ) )
v a l peopleDF = peopleRDD.toDF

54 /
Why
DataSet?

► Assume the following


example
case c l a s s Person(name: S t r i n g , age: BigInt, i d : BigInt)
v a l peopleRDD = s c . p a r a l l e l i z e ( A r r a y ( P e r s o n ( " s e i f " , 6 5 , 0 ) , Person("am ir" , 4 0 , 1 ) ) )
v a l peopleDF = peopleRDD.toDF

► Now, let’s use c o l l e c t to bring back it to the


val
master.
collectedPeople = p e o p l e D F . c o l l e c t ( )
/ / co l le c t e d Pe o p l e : Array[org.apache.spark.sql.Row]

54 /
Why
DataSet?

► Assume the following


example
case c l a s s Person(name: S t r i n g , age: BigInt, i d : BigInt)
v a l peopleRDD = s c . p a r a l l e l i z e ( A r r a y ( P e r s o n ( " s e i f " , 6 5 , 0 ) , Person("am ir" , 4 0 , 1 ) ) )
v a l peopleDF = peopleRDD.toDF

► Now, let’s use c o l l e c t to bring back it to the


val
master.
collectedPeople = p e o p l e D F . c o l l e c t ( )
/ / co l le c t e d Pe o p l e : Array[org.apache.spark.sql.Row]

► What is in
Row?

54 /
Why
DataSet?

► To be able to work with the collected values, we should cast


the Rows.
• How many columns?
• What types?
/ / Person(name: S ti n g , age: B i g I n t , i d : B i g I n t )

v a l c o l l e c t e d L i s t = collectedPeople.map {
row => ( r o w ( 0 ) . a s I n s t a n c e O f [ S t r i n g ] , r o w ( 1 ) . a s I n s t a n c e O f [ I n t ] , r o w ( 2 ) . a s I n s t a n c e O f [ I n t ] )
}

55 /
Why
DataSet?

► To be able to work with the collected values, we should cast


the Rows.
• How many columns?
• What types?
/ / Person(name: S ti n g , age: B i g I n t , i d : B i g I n t )

v a l c o l l e c t e d L i s t = collectedPeople.map {
row => ( r o w ( 0 ) . a s I n s t a n c e O f [ S t r i n g ] , r o w ( 1 ) . a s I n s t a n c e O f [ I n t ] , r o w ( 2 ) . a s I n s t a n c e O f [ I n t ] )
}

► But, what if we cast the types wrong?


► Wouldn’t it be nice if we could have both Spark SQL optimizations and
typesafety?

55 /
DataS
et
► Datasets can be thought of as typed distributed collections of data.
► Dataset API unifies the DataFrame and RDD APls.
► You can consider a DataFrame as an alias for Dataset[Row], where a
Row is a generic untyped JVM object.

type DataFrame = Dataset[Row]

[htt p://why-not-learn-something.blogspot.com/2016/07/apache-spark-rdd-vs-dataframe-vs-dataset.html]

56 /
Structured APIs in
Spark

[ J . S . Damji e t a l . , Learning Spark - L i g h t n i n g - Fa s t Data A naly ti cs ]

57 /
Creating
DataSets

► To convert a sequence or an RDD to a Dataset, we can use


toDS().
► You can call as[SomeCaseClass] to convert the DataFrame to a
case Dataset.
c l a s s Person(name: S t r i n g , age: B i g I n t , i d : B i g I n t )
v a l personSeq = Seq(Person("Max", 3 3 , 0 ) , Person("Adam", 3 2 , 1 ) )

58 /
Creating
DataSets

► To convert a sequence or an RDD to a Dataset, we can use


toDS().
► You can call as[SomeCaseClass] to convert the DataFrame to a
case Dataset.
c l a s s Person(name: S t r i n g , age: B i g I n t , i d : B i g I n t )
v a l personSeq = Seq(Person("Max", 3 3 , 0 ) , Person("Adam", 3 2 , 1 ) )

v a l ds1 = s c . p a r a l l e l i z e ( p e r s o n S e q ) . t o D S

58 /
Creating
DataSets

► To convert a sequence or an RDD to a Dataset, we can use


toDS().
► You can call as[SomeCaseClass] to convert the DataFrame to a
case Dataset.
c l a s s Person(name: S t r i n g , age: B i g I n t , i d : B i g I n t )
v a l personSeq = Seq(Person("Max", 3 3 , 0 ) , Person("Adam", 3 2 , 1 ) )

v a l ds1 = s c . p a r a l l e l i z e ( p e r s o n S e q ) . t o D S

v a l ds2 = s p a r k . re a d . fo r m at ( " j s o n " ) . l o a d ( " p e o p l e . j s o n " ) . a s [ Pe rs o n ]

58 /
DataSet
Transformations

► Transformations on Datasets are the same as those that we had on


DataFrames.

► Datasets allow us to specify more complex and strongly typed


case transformations.
c l a s s Person(name: S t r i n g , age: B i g I n t , i d : B i g I n t )

v a l people = s p a r k . re a d . fo r m at ( " j s o n " ) . l o a d ( " p e o p l e . j s o n " ) . a s [ Pe rs o n ]

p e o p l e . fi l t e r ( x => x . age < 40). show ()

people.map(x => (x.name, x . age + 5 , x . i d ) ) . s h o w ( )

59 /
Structured Data
Execution

60 /
Structured Data Execution
Steps
► 1. Write DataFrame/Dataset/SQL Code.
► 2. If valid code, Spark converts this to a logical plan.
► 3. Spark transforms this logical plan to a Physical Plan
• Checking for optimizations along the way.
► 4. Spark then executes this physical plan (RDD manipulations) on
the cluster.

[M. Zaharia e t a l . , Spa rk : The D e fi n i ti v e Guide, O ’ R e i l l y Media, 2018]

61 /
Logical Planning (1/2)

► The logical plan represents a set of abstract


transformations.

[M. Zaharia e t a l . , Spa rk : The D e fi n i ti v e Guide, O ’ R e i l l y Media, 2018]

62 /
Logical Planning
(1/2)
► The logical plan represents a set of abstract transformations.
► This plan is unresolved.
• The code might be valid, the tables/columns that it refers to might
not exist.

[M. Zaharia e t a l . , Spa rk : The D e fi n i ti v e Guide, O ’ R e i l l y Media, 2018]

62 /
Logical Planning
(1/2)
► The logical plan represents a set of abstract transformations.
► This plan is unresolved.
• The code might be valid, the tables/columns that it refers to might not
exist.
► Spark uses the catalog, a repository of all table and DataFrame
information, to resolve columns and tables in the analyzer.

[M. Zaharia e t a l . , Spa rk : The D e fi n i ti v e Guide, O ’ R e i l l y Media, 2018]

62 /
Logical Planning
(2/2)

► The analyzer might reject the unresolved


logical plan.

[M. Zaharia e t a l . , Spa rk : The D e fi n i ti v e Guide, O ’ R e i l l y Media, 2018]

63 /
Logical Planning
(2/2)

► The analyzer might reject the unresolved logical plan.


► If the analyzer can resolve it, the result is passed through the Catalyst
optimizer.
► It converts the user’s set of expressions into the most optimized
version.

[M. Zaharia e t a l . , Spa rk : The D e fi n i ti v e Guide, O ’ R e i l l y Media, 2018]

63 /
Physical
Planning

► The physical plan specifies how the logical plan will execute on
the cluster.

► Physical planning results in a series of RDDs and


transformations.

[M. Zaharia e t a l . , Spa rk : The D e fi n i ti v e Guide, O ’ R e i l l y Media, 2018]

64 /
Executio
n

► Upon selecting a physical plan, Spark runs all of this code


over RDDs.

► Spark performs further optimizations at runtime.

► Finally the result is returned to the user.

65 /
Summa
ry

66 /
Summa
ry

► RDD vs. DataFrame vs.


DataSet

► Logical and physical plans

67 /
Referenc
es

► M. Zaharia et al., “Spark: The Definitive Guide”, O’Reilly Media, 2018


- Chapters 4-11.

► M. Armbrust et al., “Spark SQL: Relational data processing in spark”,


ACM SIG- MOD, 2015.

68 /
Question
s?

69 /

You might also like