0% found this document useful (0 votes)

4 views91 pages

07 Structured Data Processing

The document provides an overview of structured data processing using Spark and Spark SQL, highlighting the differences between RDDs and structured collections like DataFrames and Datasets. It explains how to create DataFrames from RDDs and raw data sources, as well as various transformations that can be performed on DataFrames, such as selecting, filtering, and modifying columns. Additionally, the document includes examples of schema definitions and operations on structured data.

Uploaded by

mraiyata

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views91 pages

07 Structured Data Processing

Uploaded by

mraiyata

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 91

Structured Data Processing - Spark

SQL
Amir H.
Payberah
[email protected]
2022-09-20
The Course Web
Page

htt ps : / / i d 2 2 2 1 k t h . g i t h u b . i o

1 / 69
The Questions-Answers
Page

htt ps://ti nyurl.com/bdenpwc5

2 / 69
Where Are
We?

3 / 69
Motivatio
n

4 / 69
Spark and Spark
SQL

5 / 69
Structured Data vs. RDD
(1/2)

► case c l a s s Account(name: S t r i n g , balance: Double, r i s k :

Boolean)

6/
Structured Data vs. RDD
(1/2)

► case c l a s s Account(name: S t r i n g , balance: Double, r i s k :

Boolean)
► RDD[Account]

6/
Structured Data vs. RDD
(1/2)

► case c l a s s Account(name: S t r i n g , balance:Double, r i s k : Boolean)

► RDD[Account]
► RDDs don’t know anything about the schema of the data it’s
dealing with.

6/
Structured Data vs. RDD
(2/2)

► case c l a s s Account(name: S t r i n g , balance:Double, r i s k : Boolean)

► RDD[Account]
► A database/Hive sees it as a columns of named and
typed values.

7 / 69
DataFrames and
DataSets

► Spark has two notions of structured collections:

• DataFrames
• Datasets

► They are distributed table-like collections with well-defined rows

and columns.

8/
DataFrames and
DataSets

► Spark has two notions of structured collections:

• DataFrames
• Datasets

► They are distributed table-like collections with well-defined rows and

columns.

► They represent immutable lazily evaluated plans.

► When an action is performed on them, Spark performs the actual

transformations and return the result.

8/
DataFra
me

9 / 69
DataFram
e

► Consists of a series of rows and a number of columns.

► Equivalent to a table in a relational database.

► Spark + RDD: functional transformations on partitioned collections of

objects.

► SQL + DataFrame: declarative transformations on partitioned

collections of tuples.

10 /
Schem
a
► Defines the column names and types of a
DataFrame.
► Assume people.json file as an input:
{"name":"Michael", " a ge " : 1 5 , " i d " : 1 2 }
{"name":"Andy", " a ge " : 3 0 , " i d " : 1 5 }
{ " n a m e " : " J u sti n " , " a ge " : 1 9 , " i d " : 2 0 }
{"name":"Andy", " a ge " : 1 2 , " i d " : 1 5 }
{"nam e" :"Jim " , " a ge " : 1 9 , " i d " : 2 0 }
{"name":"Andy", " a ge " : 1 2 , " i d " : 1 0 }

11 /
Schem
a
► Defines the column names and types of a
DataFrame.
► Assume people.json file as an input:
{"name":"Michael", " a ge " : 1 5 , " i d " : 1 2 }
{"name":"Andy", " a ge " : 3 0 , " i d " : 1 5 }
{ " n a m e " : " J u sti n " , " a ge " : 1 9 , " i d " : 2 0 }
{"name":"Andy", " a ge " : 1 2 , " i d " : 1 5 }
{"nam e" :"Jim " , " a ge " : 1 9 , " i d " : 2 0 }
{"name":"Andy", " a ge " : 1 2 , " i d " : 1 0 }

v a l people = s p a r k . re a d . fo r m at ( " j s o n " ) . l o a d ( " p e o p l e . j s o n " )

people.schema

// returns:
StructType( St r u c t F i e l d ( age, LongType, true) ,
S t r u c t F i e l d ( i d , L o n g Ty p e , t r u e ) ,
S t r u c t F i e l d ( n a m e , S t r i n g Ty p e , t r u e ) )

11 /
Column
(1/2)
► They are like columns in a table.
► c o l returns a reference to a column.
► expr performs transformations on a
column.
► columns returns all columns on a
DataFrame
v a l people = s p a r k . re a d . fo r m at ( " j s o n " ) . l o a d ( " p e o p l e . j s o n " )

col("age")

exp("age + 5 < 32" )

people.columns
// returns:
Array[String] =
Array(age, i d ,
name)

12 /
Column
(2/2)
► Different ways to refer to a
column.
v a l people = s p a r k . re a d . fo r m at ( " j s o n " ) . l o a d ( " p e o p l e . j s o n " )

people.col("name")

col("name")

column("name")

’name

$"name"

expr( "name")

13 /
Row

► A row is a record of
data.
► They are of type Row.
► Rows do not have
schemas.

import org .apache.spark.sql.Row

v a l myRow = Ro w ( " S e i f" , 6 5 , 0 )

14 /
Row

► A row is a record of data.

► They are of type Row.
► Rows do not have schemas.
• The order of values should be the same order as the schema of the
DataFrame to which they might be appended.
► To access data in rows, you need to specify the position that you
would like.
import org .apache.spark.sql.Row

v a l myRow = Ro w ( " S e i f" , 6 5 , 0 )

myRow(0) / / type Any

myRow (0).asInstanceOf[String ] / /
S t r i n g myRow.getString(0) / / S t r i n g
myRow.getInt(1) / / I n t

14 /
Creating a
DataFrame

► Two ways to create a

DataFrame:
1. From an RDD
2. From raw data sources

15 /
Creating a DataFrame - From an
RDD

► The schema automatically

inferred.

16 /
Creating a DataFrame - From an
RDD

► The schema automatically inferred.

► You can use toDF to convert an RDD to
val DataFrame.
tupleRDD = s c . p a ra l l e l i ze ( A r ray ( ( " s e i f " , 65, 0 ) , ("amir", 40, 1 ) )
v a l tupleDF = tupleRDD.toDF("name", " a ge " , " i d " )

16 /
Creating a DataFrame - From an
RDD

► The schema automatically inferred.

► If RDD contains case class instances, Spark infers the

attributes from it.
case c l a s s Person(name: S t r i n g , age: I n t , i d : I n t )
v a l peopleRDD = s c . p a r a l l e l i z e ( A r r a y ( P e r s o n ( " s e i f " , 6 5 , 0 ) , Person("am ir" , 4 0 , 1 ) ) )
v a l peopleDF = peopleDF.toDF

16 /
Creating a DataFrame - From Data
Source

► Data sources supported by Spark.

• CSV, JSON, Parquet, ORC, JDBC/ODBC connections, Plain-
text files
• Cassandra, HBase, MongoDB, AWS Redshift, XML, etc.
v a l peopleJson = s p a r k . re a d . fo r m at ( " j s o n " ) . l o a d ( " p e o p l e . j s o n " )

v a l peopleCsv = s p a r k . re a d . fo r m at ( " c sv " )

. o p ti o n ( " s e p " , " ; " )
. opti on (" inferS chem a" , " t r u e " )
. o p ti o n ( " h e a d e r " , " t r u e " )
.load("people.csv" )

17 /
DataFrame Transformations
(1/5)
► Add and remove rows or columns
► Transform a row into a column (or vice versa)
► Change the order of rows based on the values in
columns

[M. Zaharia e t a l . , Spa rk : The D e fi n i ti v e Guide, O ’ R e i l l y Media, 2018]

18 /
DataFrame Transformations
(2/5)

► s e l e c t and s e l e c t E x pr allow to do the DataFrame equivalent of SQL

queries on a table of data.

// select
people.select("nam e" , " a g e " , " i d " ) . s h o w ( 2 )

19 /
DataFrame Transformations
(2/5)

► s e l e c t and s e l e c t E x pr allow to do the DataFrame equivalent of SQL

queries on a table of data.

// select
people.select("nam e" , " a g e " , " i d " ) . s h o w ( 2 )

// selectExpr
p e o p l e . s e l e c t E x p r ( " * " , "(age < 20) as teenager" ).show ()
p e o p l e . s e l e c t E x p r ( " a vg ( a g e ) " , " c o u n t ( d i sti n c t ( n a m e ) ) " , " su m (id)" ). show ()

19 /
DataFrame Transformations
(3/5)

► fi l t e r and where both filter rows.

► d i s ti n c t can be used to extract unique

rows.
p e o p l e . fi l t e r ( " a g e < 20" ).show ()

people.where("age < 20" ).show ()

p e o p l e . s e l e c t ( " n a m e " ) . d i sti n c t ( ) . s h o w ( )

20 /
What is the
output?
p e o p l e . s e l e c t E x p r ( " a vg ( a g e ) " , " c o u n t ( d i sti n c t ( n a m e )) as d i s ti n c t " ) . s h o w ( )

+---+---+-------+
|age| i d | name|
+---+---+-------+
| 15| 12|Michael|
| 30| 15| Andy|
| 19| 20| Andy|
+---+---+-------+

21 /
What is the
output?
p e o p l e . s e l e c t E x p r ( " a vg ( a g e ) " , " c o u n t ( d i sti n c t ( n a m e )) as d i s ti n c t " ) . s h o w ( )

+---+---+-------+
|age| i d | name|
+---+---+-------+
| 15| 12|Michael|
| 30| 15| Andy|
| 19| 20| Andy|
+---+---+-------+

Opti on 1 Opti on 2
+--------+--------+ +--------+--------+
| avg ( a g e ) | d i sti n c t | | avg ( a g e ) | d i sti n c t |
+--------+--------+ +--------+--------+
| 21.333| 3| | 21.333| 2|
+--------+--------+ +--------+--------+

21 /
DataFrame Transformations
(4/5)

► withColumn adds a new column to a

DataFrame.
► withColumnRenamed renames a column.
► drop removes a column.
/ / withColumn
people.withColumn("teenager", expr("age < 2 0 " ) ). s h o w ( )

/ / withColumnRenamed
people.withColumnRenamed("name", "username").columns

/ / drop
people.drop("name").columns

22 /
What is the
output?
people.withColumn("teenager", expr("age < 2 0 " ) ) . s h o w ( )

+---+---+-------+
|age| i d | name|
+---+---+-------+
| 15| 12|Michael|
| 30| 15| Andy|
| 19| 20| J u s ti n |
+---+---+-------+

23 /
What is the
output?
people.withColumn("teenager", expr("age < 2 0 " ) ) . s h o w ( )

+---+---+-------+
|age| i d | name|
+---+---+-------+
| 15| 12|Michael|
| 30| 15| Andy|
| 19| 20| J u s ti n |
+---+---+-------+

Opti on 1 Opti on 2
+---+---+-------+--------+ +---+---+-------+--------+
|age| i d | name|teenager| |age| i d | name|teenager|
+---+---+-------+--------+ +---+---+-------+--------+
| 15| 12|Michael| true| | 15| 12|Michae l | true|
| 30| 15| Andy| | 19| 20| Ju s ti n | true|
| 19| 20| J u s ti n | f alse| +---+---+-------+--------+
+ - - - + - - - + - - - - - - - + - - - -true|
----+

23 /
DataFrame Transformations
(5/5)

► You can use udf to define new column-based

functions.
import o r g . a p a c h e . s p a r k . s q l . f u n c ti o n s . { c o l , u d f }

v a l d f = spark.createDataFrame (Seq((0, " h e l l o " ) , ( 1 , " w o r l d " ) ) ) . t o D F ( " i d " , " t e x t " )

v a l upper: S t r i n g => S t r i n g = _.toUpperCase

v a l upperUDF = s p a r k . u d f . r e g i s t e r ( " u p p e r " , upper)

df.withColumn("upper", upperUDF (col("text" ))).show

24 /
DataFrame
Actions

► Like RDDs, DataFrames also have their own set of actions.

► c o l l e c t : returns an array that contains all of rows in this

DataFrame.

► count: returns the number of rows in this DataFrame.

► fi r s t and head: returns the first row of the DataFrame.

► show: displays the top 20 rows of the DataFrame in a tabular

form.

► take: returns the first n rows of the DataFrame.

25 /
Aggregati
on

26 /
Aggregati
on

► In an aggregation you specify

• A key or grouping
• An aggregation function

► The given function must produce one result for

each group.

27 /
Grouping
Types

► Summarizing a complete
DataFrame

► Group by

► Windowing

28 /
Grouping
Types

► Summarizing a complete
DataFrame

► Group by

► Windowing

29 /
Summarizing a Complete DataFrame
Functions (1/2)

► count returns the total number of values.

► co u nt D i sti nc t returns the number of unique groups.
► fi r s t and l a s t return the first and last value of a
DataFrame.
v a l people = s p a r k . re a d . fo r m at ( " j s o n " ) . l o a d ( " p e o p l e . j s o n " )

p e o p l e . s e l e c t ( c o u n t ( " a ge " ) ). s h o w ( )

people . select (countD isti nct (" na m e" )). show ()

p e o p l e . s e l e c t ( fi r s t ( " n a m e " ) , l a s t ( " a g e " ) ) . s h o w ( )

30 /
Summarizing a Complete DataFrame
Functions (2/2)

► min and max extract the minimum and maximum values from a
DataFrame.
► sum adds all the values in a column.
► avg calculates the average.
v a l people = s p a r k . re a d . fo r m at ( " j s o n " ) . l o a d ( " p e o p l e . j s o n " )

people.select(m in("nam e" ), m ax("age"),

m ax (" id" )). show () people . select (sum (" a ge" )). show ()

p e o p l e . s e l e c t ( avg ( " a g e " ) ) . s h o w ( )

31 /
Grouping
Types

► Summarizing a complete
DataFrame

► Group by

► Windowing

32 /
Group By
(1/3)

► Perform aggregations on groups in the data.

► Typically on categorical data.

► We do this grouping in two phases:

1. Specify the column(s) on which we would like to
group.
2. Specify the aggregation(s).

33 /
Group By
(2/3)

► Grouping with expressions

• Rather than passing that function as an expression into a s e l e c t
statement, we specify it as within agg.

v a l people = s p a r k . re a d . fo r m at ( " j s o n " ) . l o a d ( " p e o p l e . j s o n " )

people . g roupBy (" na m e" ). ag g (count(" a ge" ). alia s (" a geag g " )). show ()

34 /
Group By
(3/3)

► Grouping with Maps

• Specify transformations as a series of Maps
• The key is the column, and the value is the aggregation function (as
a string).
v a l people = s p a r k . re a d . fo r m at ( " j s o n " ) . l o a d ( " p e o p l e . j s o n " )

people.groupBy("name").agg("age" - > " count" , "age" - > " a v g " , " i d " - > "max").show()

35 /
What is the
output?
people.groupBy("name").agg("age" - > " count" , "age" - > " a v g " , " i d " - > "max").show()

+---+---+-------+
|age| i d | name|
+---+---+-------+
| 15| 12|Michael|
| 30| 15| Andy|
| 19| 20| Andy|
+---+---+-------+

36 /
What is the
output?
people.groupBy("name").agg("age" - > " count" , "age" - > " a v g " , " i d " - > "max").show()

+---+---+-------+
|age| i d | name|
+---+---+-------+
| 15| 12|Michael|
| 30| 15| Andy|
| 19| 20| Andy|
+---+---+-------+

Opti on 1 Opti on 2
+-------+----------+--------+-------+ +-------+----------+--------+-------+
| name|count(age)|avg(age)|max(id)| | name|count(age)|avg(age)|max(id)|
+-------+----------+--------+-------+ +-------+----------+--------+-------+
|Michael| 1| 15.0| 12| |Michael| 1| 21.33| 20|
| Andy| 2| 24.5| 20| | Andy| 2| 21.33| 20|
+-------+----------+--------+-------+ +-------+----------+--------+-------+

36 /
Grouping
Types

► Summarizing a complete
DataFrame

► Group by

► Windowing

37 /
Windowing
(1/2)
► Computing some aggregation on a specific window of data.
► The window determines which rows will be passed in to this
function.
► You define them by using a reference to the current data.
► A group of rows is called a frame.

[M. Zaharia e t a l . , Spa rk : The D e fi n i ti v e Guide, O ’ R e i l l y Media, 2018]

38 /
Windowing
(2/2)

► Unlike grouping, here each row can fall into one or

more frames.
import org.apache.spark.sql.expressions.Window
import o r g . a p a c h e . s p a r k . s q l . f u n c ti o n s . c o l

v a l people = s p a r k . re a d . fo r m at ( " j s o n " ) . l o a d ( " p e o p l e . j s o n " )

v a l windowSpec = Window.rowsBetween(-1, 1 )
v a l avgAge = avg (col("age" )).over(windowSpec )
people.select(col("name"), c o l ( " a g e " ) ,
avgA ge .alias("avg _age" )).show

39 /
What is the
output?
v a l windowSpec = Window.rowsBetween(-1, 1 )
v a l avgAge = avg (col("age" )).over(windowSpec )
people.select(col("name"), c o l ( " a ge " ) ,
avg A ge . a l i a s ( " avg _ a ge " ) ) . s h o w ( )

+---+---+-------+
|age| i d | name|
+---+---+-------+
| 15| 12|Michael|
| 30| 15| Andy|
| 19| 20| Andy|
+---+---+-------+

40 /
What is the
output?
v a l windowSpec = Window.rowsBetween(-1, 1 )
v a l avgAge = avg (col("age" )).over(windowSpec )
people.select(col("name"), c o l ( " a ge " ) ,
avg A ge . a l i a s ( " avg _ a ge " ) ) . s h o w ( )

+---+---+-------+
|age| i d | name|
+---+---+-------+
15| 15|
| 30| 12|Michael| Andy|
| 19| 20|
+ - - - + - - - + - - - Andy|
----+
Opti on 1 Opti on 2
+-------+---+--------+ +-------+---+--------+
| name|age| avg_age| | name|age| avg_age|
+-------+---+--------+ +-------+---+--------+
|Michael| 15| 22.5| |Michael| 15| 7.5|
| Andy| 30| 21.33| | Andy| 30| 22.5|
| Andy| 19| 24.5| | Andy| 19| 21.33|
+-------+---+--------+ +-------+---+--------+

40 /
Join
s

41 /
Join
s
► Joins are relational constructs you use to combine relations together.

► Different join types: inner join, outer join, left outer join, right outer
join, left semi join, left anti join, cross join

42 /
Joins
Example

v a l person = S e q ( ( 0 , " S e i f " , 0 ) , ( 1 , "Am ir" , 1 ) , ( 2 , " S aru nas" , 1 ) )

. t o D F ( " i d " , "name", "g roup_id" )

v a l group = S e q ( ( 0 , " S I C S / K T H " ), ( 1 , " K T H " ), ( 2 , " S I C S " ) )

. t o D F ( " i d " , "department")

43 /
Joins Example -
Inner

v a l j o in E x p re s s i o n = person . col(" g roup_id" ) === g r o u p . c o l ( " i d " )

var joinType = " i n n e r "

p e rs o n . j o i n ( g ro u p , j o i n E x p r e s s i o n , join Type ). show ()

+---+-------+--------+---+----------+
| id| name|group_id| id|department|
+---+-------+--------+---+----------+
| 0| Seif| 0| 0| SICS/KTH|
| 1| Amir| 1| 1| KTH|
| 2|Sarunas| 1| 1|
KTH|
+---+-------+--------+---+----------+

44 /
Joins Example -
Outer

v a l j o in E x p re s s i o n = person . col(" g roup_id" ) === g r o u p . c o l ( " i d " )

var joinType = "outer"

p e rs o n . j o i n ( g ro u p , j o i n E x p r e s s i o n , join Type ). show ()

+----+-------+--------+---+----------+
| id| name|group_id| id|department|
+----+-------+--------+---+----------+
| 1| Amir| 1| 1| KTH|
| 2|Sarunas| 1| 1| KTH|
|null| null| n u l l | 2| SICS|
| 0| Seif| 0| 0|
SICS/KTH|
+----+-------+--------+---+----------+

45 /
Joins Communication
Strategies

► Two different communication ways during

joins:
• Shuffle join: big table to big table
• Broadcast join: big table to small table

46 /
Shuffle
Join
► Every node talks to every other node.
► They share data according to which node has a certain key or
set of keys.

[M. Zaharia e t a l . , Spa rk : The D e fi n i ti v e Guide, O ’ R e i l l y Media, 2018]

47 /
Broadcast
Join
► When the table is small enough to fit into the memory of a single
worker node.

[M. Zaharia e t a l . , Spa rk : The D e fi n i ti v e Guide, O ’ R e i l l y Media, 2018]

48 /
SQ
L

49 /
SQL

► You can run SQL queries on views/tables via the method s q l on the
SparkSession
object.
s p a r k . s q l ( " S E L EC T * from people_view" ).show()

+---+---+-------+
|age| i d | name|
+---+---+-------+
| 15| 12|Michael|
| 30| 15| Andy|
| 19| 20| J u s ti n |
| 12| 15| Andy|
| 19| 20| Jim|
| 12| 10| Andy|
+---+---+-------+

50 /
Temporary
View

► createOrReplaceTempView creates (or replaces) a lazily evaluated

view.
► You can use it like a table in Spark SQL.
people.createOrReplaceTempView ("people_view")

v a l teenagersDF = s p a r k . s q l ( " S E L EC T name, age FROM people_view WHERE age BETWEEN 13 AND 19" )

51 /
DataS
et

52 /
Untyped API with
DataFrame

► DataFrames elements are Rows, which are generic untyped

JVM objects.
► Scala compiler cannot type check Spark SQL schemas in
DataFrames.

53 /
Untyped API with
DataFrame

► DataFrames elements are Rows, which are generic untyped

JVM objects.
► Scala compiler cannot type check Spark SQL schemas in
DataFrames.
► The following code compiles, but you get a runtime exception.
• icolumns:
/ / people d num is("name",
not in the "age", DataFrame
"id") columns [name, age, i d ]
v a l people = s p a r k . re a d . fo r m at ( " j s o n " ) . l o a d ( " p e o p l e . j s o n " )

p e o p l e . fi l t e r ( " i d _ n u m < 20" ) / / runti me excepti on

53 /
Why
DataSet?

► Assume the following

example
case c l a s s Person(name: S t r i n g , age: BigInt, i d : BigInt)
v a l peopleRDD = s c . p a r a l l e l i z e ( A r r a y ( P e r s o n ( " s e i f " , 6 5 , 0 ) , Person("am ir" , 4 0 , 1 ) ) )
v a l peopleDF = peopleRDD.toDF

54 /
Why
DataSet?

► Assume the following

► Now, let’s use c o l l e c t to bring back it to the

val
master.
collectedPeople = p e o p l e D F . c o l l e c t ( )
/ / co l le c t e d Pe o p l e : Array[org.apache.spark.sql.Row]

54 /
Why
DataSet?

► Assume the following

► Now, let’s use c o l l e c t to bring back it to the

val
master.
collectedPeople = p e o p l e D F . c o l l e c t ( )
/ / co l le c t e d Pe o p l e : Array[org.apache.spark.sql.Row]

► What is in
Row?

54 /
Why
DataSet?

► To be able to work with the collected values, we should cast

the Rows.
• How many columns?
• What types?
/ / Person(name: S ti n g , age: B i g I n t , i d : B i g I n t )

v a l c o l l e c t e d L i s t = collectedPeople.map {
row => ( r o w ( 0 ) . a s I n s t a n c e O f [ S t r i n g ] , r o w ( 1 ) . a s I n s t a n c e O f [ I n t ] , r o w ( 2 ) . a s I n s t a n c e O f [ I n t ] )
}

55 /
Why
DataSet?

► To be able to work with the collected values, we should cast

the Rows.
• How many columns?
• What types?
/ / Person(name: S ti n g , age: B i g I n t , i d : B i g I n t )

► But, what if we cast the types wrong?

► Wouldn’t it be nice if we could have both Spark SQL optimizations and
typesafety?

55 /
DataS
et
► Datasets can be thought of as typed distributed collections of data.
► Dataset API unifies the DataFrame and RDD APls.
► You can consider a DataFrame as an alias for Dataset[Row], where a
Row is a generic untyped JVM object.

type DataFrame = Dataset[Row]

[htt p://why-not-learn-something.blogspot.com/2016/07/apache-spark-rdd-vs-dataframe-vs-dataset.html]

56 /
Structured APIs in
Spark

[ J . S . Damji e t a l . , Learning Spark - L i g h t n i n g - Fa s t Data A naly ti cs ]

57 /
Creating
DataSets

► To convert a sequence or an RDD to a Dataset, we can use

toDS().
► You can call as[SomeCaseClass] to convert the DataFrame to a
case Dataset.
c l a s s Person(name: S t r i n g , age: B i g I n t , i d : B i g I n t )
v a l personSeq = Seq(Person("Max", 3 3 , 0 ) , Person("Adam", 3 2 , 1 ) )

58 /
Creating
DataSets

► To convert a sequence or an RDD to a Dataset, we can use

v a l ds1 = s c . p a r a l l e l i z e ( p e r s o n S e q ) . t o D S

58 /
Creating
DataSets

► To convert a sequence or an RDD to a Dataset, we can use

v a l ds1 = s c . p a r a l l e l i z e ( p e r s o n S e q ) . t o D S

v a l ds2 = s p a r k . re a d . fo r m at ( " j s o n " ) . l o a d ( " p e o p l e . j s o n " ) . a s [ Pe rs o n ]

58 /
DataSet
Transformations

► Transformations on Datasets are the same as those that we had on

DataFrames.

► Datasets allow us to specify more complex and strongly typed

case transformations.
c l a s s Person(name: S t r i n g , age: B i g I n t , i d : B i g I n t )

v a l people = s p a r k . re a d . fo r m at ( " j s o n " ) . l o a d ( " p e o p l e . j s o n " ) . a s [ Pe rs o n ]

p e o p l e . fi l t e r ( x => x . age < 40). show ()

people.map(x => (x.name, x . age + 5 , x . i d ) ) . s h o w ( )

59 /
Structured Data
Execution

60 /
Structured Data Execution
Steps
► 1. Write DataFrame/Dataset/SQL Code.
► 2. If valid code, Spark converts this to a logical plan.
► 3. Spark transforms this logical plan to a Physical Plan
• Checking for optimizations along the way.
► 4. Spark then executes this physical plan (RDD manipulations) on
the cluster.

[M. Zaharia e t a l . , Spa rk : The D e fi n i ti v e Guide, O ’ R e i l l y Media, 2018]

61 /
Logical Planning (1/2)

► The logical plan represents a set of abstract

transformations.

[M. Zaharia e t a l . , Spa rk : The D e fi n i ti v e Guide, O ’ R e i l l y Media, 2018]

62 /
Logical Planning
(1/2)
► The logical plan represents a set of abstract transformations.
► This plan is unresolved.
• The code might be valid, the tables/columns that it refers to might not
exist.
► Spark uses the catalog, a repository of all table and DataFrame
information, to resolve columns and tables in the analyzer.

[M. Zaharia e t a l . , Spa rk : The D e fi n i ti v e Guide, O ’ R e i l l y Media, 2018]

62 /
Logical Planning
(2/2)

► The analyzer might reject the unresolved

logical plan.

[M. Zaharia e t a l . , Spa rk : The D e fi n i ti v e Guide, O ’ R e i l l y Media, 2018]

63 /
Logical Planning
(2/2)

► The analyzer might reject the unresolved logical plan.

► If the analyzer can resolve it, the result is passed through the Catalyst
optimizer.
► It converts the user’s set of expressions into the most optimized
version.

[M. Zaharia e t a l . , Spa rk : The D e fi n i ti v e Guide, O ’ R e i l l y Media, 2018]

63 /
Physical
Planning

► The physical plan specifies how the logical plan will execute on
the cluster.

► Physical planning results in a series of RDDs and

transformations.

[M. Zaharia e t a l . , Spa rk : The D e fi n i ti v e Guide, O ’ R e i l l y Media, 2018]

64 /
Executio
n

► Upon selecting a physical plan, Spark runs all of this code

over RDDs.

► Spark performs further optimizations at runtime.

► Finally the result is returned to the user.

65 /
Summa
ry

66 /
Summa
ry

► RDD vs. DataFrame vs.

DataSet

► Logical and physical plans

67 /
Referenc
es

► M. Zaharia et al., “Spark: The Definitive Guide”, O’Reilly Media, 2018

- Chapters 4-11.

► M. Armbrust et al., “Spark SQL: Relational data processing in spark”,

ACM SIG- MOD, 2015.

68 /
Question
s?

69 /

Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
02 Sparkml
No ratings yet
02 Sparkml
104 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
PySpark DataFrame Merging Guide
No ratings yet
PySpark DataFrame Merging Guide
42 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
Basic DataFrame Operation
No ratings yet
Basic DataFrame Operation
11 pages
Master PySpark 1-18
No ratings yet
Master PySpark 1-18
59 pages
PySpark Cheatsheet - Elaborate
No ratings yet
PySpark Cheatsheet - Elaborate
14 pages
Pyspark - Cheatsheet With Comparison To SQL5 - Seequality
No ratings yet
Pyspark - Cheatsheet With Comparison To SQL5 - Seequality
36 pages
Pyspark Module 1
No ratings yet
Pyspark Module 1
63 pages
Solutions 1742312993
No ratings yet
Solutions 1742312993
14 pages
SP 5
No ratings yet
SP 5
29 pages
SP 6
No ratings yet
SP 6
14 pages
Unit 4 Spark SQL
No ratings yet
Unit 4 Spark SQL
49 pages
Top 100 Pyspark Functions For Data Engineers 1738131847
No ratings yet
Top 100 Pyspark Functions For Data Engineers 1738131847
30 pages
Poultry Farm Management System
No ratings yet
Poultry Farm Management System
64 pages
SQL & Pyspark
No ratings yet
SQL & Pyspark
9 pages
SQL To Pyspark Conversion
No ratings yet
SQL To Pyspark Conversion
9 pages
Spark SQL
No ratings yet
Spark SQL
41 pages
Dahua Intro & Products
No ratings yet
Dahua Intro & Products
68 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
Journal
No ratings yet
Journal
47 pages
SME AC Panel Manual 052005 en
100% (1)
SME AC Panel Manual 052005 en
65 pages
w12 - Runningnotes 201026 001818
No ratings yet
w12 - Runningnotes 201026 001818
25 pages
SparkDataFrames 250719 202947
No ratings yet
SparkDataFrames 250719 202947
11 pages
T09 Sparksql
No ratings yet
T09 Sparksql
30 pages
QB
No ratings yet
QB
3 pages
Gann Circle Swing Levels
No ratings yet
Gann Circle Swing Levels
2 pages
Spark SQL & DataFrames Guide 2.2.0
No ratings yet
Spark SQL & DataFrames Guide 2.2.0
35 pages
Prinect Product Portfolio
No ratings yet
Prinect Product Portfolio
143 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
PySpark DataFrames Guide
No ratings yet
PySpark DataFrames Guide
33 pages
Lab 4 - Apache Spark SQL
No ratings yet
Lab 4 - Apache Spark SQL
46 pages
4 - Spark SQL
No ratings yet
4 - Spark SQL
58 pages
SQL Vs Pyspark-1
No ratings yet
SQL Vs Pyspark-1
9 pages
Pyspark
No ratings yet
Pyspark
44 pages
Pyspark Coding Interview Questions
No ratings yet
Pyspark Coding Interview Questions
19 pages
Jumat 10 Feb 2023 Diterima Sarana
No ratings yet
Jumat 10 Feb 2023 Diterima Sarana
1,535 pages
Spark Structured API Solutions
No ratings yet
Spark Structured API Solutions
10 pages
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
No ratings yet
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
106 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
DATAFRAME Vs DATASETS
No ratings yet
DATAFRAME Vs DATASETS
9 pages
07 Spark Dataframes
100% (1)
07 Spark Dataframes
45 pages
CS 2018 042
No ratings yet
CS 2018 042
8 pages
Spark SQL for Data Engineers
No ratings yet
Spark SQL for Data Engineers
25 pages
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
No ratings yet
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
17 pages
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
124 pages
SQL PySpark Cheat Sheet 1731729790
No ratings yet
SQL PySpark Cheat Sheet 1731729790
9 pages
10 Spark1
No ratings yet
10 Spark1
31 pages
Mod5 Bda
No ratings yet
Mod5 Bda
9 pages
CC103 Mod3
No ratings yet
CC103 Mod3
12 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
CS English
No ratings yet
CS English
47 pages
SQL Vs PySpark 1678871778
No ratings yet
SQL Vs PySpark 1678871778
8 pages
Fall209 Spark SQL MC
No ratings yet
Fall209 Spark SQL MC
96 pages
Oracle Applications - Query To Get Employee and Supervisor Hierarchy Details in Oracle Apps HRMS R12
No ratings yet
Oracle Applications - Query To Get Employee and Supervisor Hierarchy Details in Oracle Apps HRMS R12
3 pages
TDX Agentforce Hackathon Rules
No ratings yet
TDX Agentforce Hackathon Rules
11 pages
SQL & pySPARK
No ratings yet
SQL & pySPARK
9 pages
III Computer Lesson Answers 143 18-11-2023
No ratings yet
III Computer Lesson Answers 143 18-11-2023
15 pages
Question Bank-BDA (Module 1&2) 2
No ratings yet
Question Bank-BDA (Module 1&2) 2
5 pages
Massachusetts Institute of Technology
No ratings yet
Massachusetts Institute of Technology
3 pages
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
Message Mapping in CPI I-Flow
No ratings yet
Message Mapping in CPI I-Flow
11 pages
SparkSQL for Data Engineers
No ratings yet
SparkSQL for Data Engineers
44 pages
Spark SQL Meetup - 4-8-2012
No ratings yet
Spark SQL Meetup - 4-8-2012
27 pages
USA BATCH IIi
No ratings yet
USA BATCH IIi
92 pages
Pure+Moderation Brochure+General+2020+
No ratings yet
Pure+Moderation Brochure+General+2020+
20 pages
3a-105230 PBR 33 RH
No ratings yet
3a-105230 PBR 33 RH
1 page
1KHW002589 - E Firmware Download For ETL600R4
No ratings yet
1KHW002589 - E Firmware Download For ETL600R4
7 pages
Unit 5 Java
No ratings yet
Unit 5 Java
23 pages
Big Data Project-2 Report
No ratings yet
Big Data Project-2 Report
22 pages
Gek106913 B
No ratings yet
Gek106913 B
4 pages
Sap Certification Orientation Sep9
No ratings yet
Sap Certification Orientation Sep9
23 pages
QX-5000 Configurator User Guide
No ratings yet
QX-5000 Configurator User Guide
40 pages
Internal Trade (Korea) vs. International Trade
No ratings yet
Internal Trade (Korea) vs. International Trade
19 pages
Hands-On Exercise No. 4 Batch-10 Graphic Design Total Marks: 10 Due Date: 19/08/2021
No ratings yet
Hands-On Exercise No. 4 Batch-10 Graphic Design Total Marks: 10 Due Date: 19/08/2021
3 pages
Review of Phasor Estimation Algorithms For Phasor Measurement Units and Their Applications 667f8d3276d54
No ratings yet
Review of Phasor Estimation Algorithms For Phasor Measurement Units and Their Applications 667f8d3276d54
18 pages
CS3342 Software Design Course
No ratings yet
CS3342 Software Design Course
15 pages
Fundamental Notes 3-6 Month Course
No ratings yet
Fundamental Notes 3-6 Month Course
5 pages
Algorithm Assignment Solutions
No ratings yet
Algorithm Assignment Solutions
3 pages
Navneet Kaur PM 1
No ratings yet
Navneet Kaur PM 1
3 pages