0% found this document useful (0 votes)

81 views4 pages

Pair RDD Operations: Flat Map

This document provides examples of various Spark RDD and DataFrame/Dataset operations including: 1) Creating RDDs from local data, files on HDFS, and performing transformations like flatMap, filter, reduce; 2) Pair RDD operations like groupBy, reduceByKey, sortByKey, and joins; 3) Creating DataFrames/Datasets from JSON, working with Spark SQL, and creating temporary views; 4) Interacting with Hive, creating schemas with case classes, and running queries; 5) Examples of Spark Streaming word count using sockets and running a streaming application.

Uploaded by

marina dutta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as RTF, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

81 views4 pages

Pair RDD Operations: Flat Map

Uploaded by

marina dutta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as RTF, PDF, TXT or read online on Scribd

You are on page 1/ 4

val x = sc.

parallelize(List("spark rdd example", "sample example"))

val x = sc.parallelize(List("spark rdd example", "sample example”),2)

x.collect()

val textFileLocalTest = sc.textFile("/Users/syedrizvi/Desktop/HadoopExamples/file.txt");

val textFile = sc.textFile("hdfs://localhost:9000/test.txt")

Flat Map
val x = sc.parallelize(List("spark rdd example", "sample example"))
val y = x.flatMap(x => x.split(" "))

Map
val z = y.map(x => (x, 1));

Filter
val x = sc.parallelize(1 to 10)

Or with partition

val x = sc.parallelize(1 to 10, 2)

val y = x.filter(num => num%2==0)
y.collect();

Reduce
val x = sc.parallelize(1 to 10, 2)
val y = x.reduce((a, b) => (a+b))

Pair RDD Operations

GroupBy
val x = sc.parallelize(Array("Joseph", "Jimmy", "Tina","Thomas", "James", "Cory","Christine", "Jackeline",
"Juan"))
val y = x.groupBy(word => word.charAt(0))

y.collect();

ReduceByKey
val x = sc.parallelize(Array(("a", 1), ("b", 1), ("a", 1),("a", 1), ("b", 1),("b", 1),("b", 1), ("b", 1)))
val y = x.reduceByKey((key, value) => (key + value))
y.collect()

SortByKey
val y = x.sortByKey()
y.collect()

Joins
val salesprofit = sc.parallelize(Array(("Cadbury's", 3.5),("Nestle", 2.8),("Mars", 2.5), ("Thorton's", 2.2)));

val salesyear = sc.parallelize(Array(("Cadbury's", 2015),("Nestle", 2014),("Mars", 2014), ("Thorton's", 2013)));

val join = salesprofit.join(salesyear);

join.collect();
Spark SQL

val sqlContext = new org.apache.spark.sql.SQLContext(sc);

val df = sqlContext.read.json("/Users/syedrizvi/Desktop/HadoopExamples/Spark/sample.json")

df.show();

df.printSchema();

df.select(“name”).show();

df.select(df("name"),df("age")+1).show();

df.filter(df("age")>21).show()

df.groupBy("age").count().show();

Creating Temp Views

df.createOrReplaceTempView("people")
val sqlDF = spark.sql("SELECT * FROM people")
sqlDF.show();

Creating Data sets on the fly

case class Person(name: String, age: Long)

val caseClassDS = Seq(Person("Andy", 32)).toDS()
caseClassDS.show()

val primitiveDS = Seq(1, 2, 3).toDS()

primitiveDS.map(_ + 1).collect()

Creating Schemas with Reflection

val sqlContext = new org.apache.spark.sql.SQLContext(sc);

case class Person(name: String, age: Long)

val peopleDF =
spark.sparkContext.textFile("/Users/syedrizvi/Desktop/HadoopExamples/Spark/people.txt").map(_.split(",")).m
ap(attributes=>Person(attributes(0),attributes(1).trim.toInt)).toDF();

peopleDF.createOrReplaceTempView("people")

val teenagersDF = spark.sql("SELECT name, age FROM people WHERE age BETWEEN 13 AND 19")

teenagersDF.map(teenager => "Name: " + teenager(0)).show()

teenagersDF.map(teenager => "Name: " + teenager.getAs[String]("name")).show()

Interacting with Hive
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession

val warehouseLocation = "spark-warehouse"

val spark = SparkSession.builder().appName("Spark Hive Example").config("spark.sql.warehouse.dir",

warehouseLocation).enableHiveSupport().getOrCreate()

import spark.implicits._
import spark.sql

sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")

sql("LOAD DATA LOCAL INPATH '/Users/syedrizvi/Desktop/HadoopExamples/Spark/kv1.txt' INTO TABLE

src")

sql("SELECT * FROM src").show()

sql("select current_database()").show(false)

Spark Streaming
To run the example from source

To Run net cat

nc -lk 9999

/usr/local/Cellar/apache-spark/2.1.0/bin/run-example streaming.NetworkWordCount localhost 9999

Your own word count

import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Seconds, StreamingContext}

val ssc = new StreamingContext(sc, Seconds(1))

val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()

Hadoop Realtime Issues
100% (1)
Hadoop Realtime Issues
3 pages
Deepshikha Agrawal Pushp B.Sc. (IT), MBA (IT) Certification-Hadoop, Spark, Scala, Python, Tableau, ML (Assistant Professor JLBS)
No ratings yet
Deepshikha Agrawal Pushp B.Sc. (IT), MBA (IT) Certification-Hadoop, Spark, Scala, Python, Tableau, ML (Assistant Professor JLBS)
74 pages
Hadoop and Java Ques - Ans
No ratings yet
Hadoop and Java Ques - Ans
222 pages
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
Dbms Enforcing Integrity Constraints
57% (7)
Dbms Enforcing Integrity Constraints
11 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Big Data and Apache Spark Overview
No ratings yet
Big Data and Apache Spark Overview
211 pages
SQL Practice for Students
No ratings yet
SQL Practice for Students
10 pages
049 Hadoop Commands Reference Guide.
No ratings yet
049 Hadoop Commands Reference Guide.
3 pages
Tutorial-HDP-Administration V III
100% (1)
Tutorial-HDP-Administration V III
274 pages
File Formats in Big Data
No ratings yet
File Formats in Big Data
13 pages
DVS SPARK Course Content PDF
No ratings yet
DVS SPARK Course Content PDF
2 pages
HDFS Commands
No ratings yet
HDFS Commands
15 pages
3 Mapreduce Notes
No ratings yet
3 Mapreduce Notes
25 pages
Apache Hadoop Commands
100% (1)
Apache Hadoop Commands
8 pages
Hive Installation On Windows 10
No ratings yet
Hive Installation On Windows 10
13 pages
SQL Server DBA Training Plan 1 PDF
100% (1)
SQL Server DBA Training Plan 1 PDF
38 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Hands-On Hadoop Tutorial
100% (1)
Hands-On Hadoop Tutorial
13 pages
Ajay Singh - Hadoop Resume
67% (3)
Ajay Singh - Hadoop Resume
2 pages
7 Hive Notes
No ratings yet
7 Hive Notes
36 pages
Sunny SQL Data Manipulation Language Help
100% (2)
Sunny SQL Data Manipulation Language Help
51 pages
Information System Design "Student Registration System Example"
No ratings yet
Information System Design "Student Registration System Example"
18 pages
Hadoop Data Transfer with Sqoop
No ratings yet
Hadoop Data Transfer with Sqoop
21 pages
Pyspark Commands
No ratings yet
Pyspark Commands
12 pages
Spark Training in Bangalore
No ratings yet
Spark Training in Bangalore
36 pages
9 Sqoop Notes
No ratings yet
9 Sqoop Notes
17 pages
BK Hdfs Administration
No ratings yet
BK Hdfs Administration
73 pages
SQL - Lokesh Verma
No ratings yet
SQL - Lokesh Verma
49 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
Hadoop Interview Questions Guide
100% (1)
Hadoop Interview Questions Guide
34 pages
Hadoop FS Shell Commands Guide
No ratings yet
Hadoop FS Shell Commands Guide
5 pages
HBase Interview Questions
No ratings yet
HBase Interview Questions
12 pages
Hadoop Log Level MapReduce Tutorial
No ratings yet
Hadoop Log Level MapReduce Tutorial
3 pages
Hadoop Overview
100% (1)
Hadoop Overview
16 pages
Interview Questions Postgres
No ratings yet
Interview Questions Postgres
18 pages
Cse 411
No ratings yet
Cse 411
9 pages
Module-1: Hdfs Basics Running Example Programs and Benchmarks Hadoop Mapreduce Framework Mapreduce Programming
No ratings yet
Module-1: Hdfs Basics Running Example Programs and Benchmarks Hadoop Mapreduce Framework Mapreduce Programming
33 pages
Streaming with Apache Flink
No ratings yet
Streaming with Apache Flink
232 pages
Some of The Frequently Asked Interview Questions For Hadoop Developers Are
100% (1)
Some of The Frequently Asked Interview Questions For Hadoop Developers Are
72 pages
Hive Commands
No ratings yet
Hive Commands
3 pages
HOL Hive PDF
No ratings yet
HOL Hive PDF
23 pages
SA Lab Manual
No ratings yet
SA Lab Manual
7 pages
Cluster Maintenance Guide
No ratings yet
Cluster Maintenance Guide
19 pages
Week 1 DB
No ratings yet
Week 1 DB
26 pages
Sqoop Commands - Latest
No ratings yet
Sqoop Commands - Latest
4 pages
Integrity Constraints
No ratings yet
Integrity Constraints
4 pages
Mapr Snapshots
No ratings yet
Mapr Snapshots
31 pages
Assignment 03
No ratings yet
Assignment 03
4 pages
Hadoop Imp Commands
No ratings yet
Hadoop Imp Commands
21 pages
Hadoop for Data Engineers
No ratings yet
Hadoop for Data Engineers
44 pages
HDFS Exercises - Basic
No ratings yet
HDFS Exercises - Basic
5 pages
Uk 3
No ratings yet
Uk 3
4 pages
SQL Server Built-In Functions Guide
No ratings yet
SQL Server Built-In Functions Guide
29 pages
100+ Hadoop Interview Questions From Interviews
No ratings yet
100+ Hadoop Interview Questions From Interviews
32 pages
MongoDB Shell Cheat Sheet
No ratings yet
MongoDB Shell Cheat Sheet
3 pages
Apache Hive
No ratings yet
Apache Hive
77 pages
Midhun BIGDATA Curicullum
No ratings yet
Midhun BIGDATA Curicullum
17 pages
Cloudera Administrator Training For Apache Hadoop
No ratings yet
Cloudera Administrator Training For Apache Hadoop
5 pages
Spark RDD Guide for Developers
No ratings yet
Spark RDD Guide for Developers
7 pages
Course Contents of Hadoop and Big Data
No ratings yet
Course Contents of Hadoop and Big Data
11 pages
HBase NoSQL Database Overview
No ratings yet
HBase NoSQL Database Overview
9 pages
Dremio Data Source Guide
No ratings yet
Dremio Data Source Guide
2 pages
Setting Up Spark 2.0 With Intellij Community Edition
No ratings yet
Setting Up Spark 2.0 With Intellij Community Edition
12 pages
Hive Tutorial PDF
0% (1)
Hive Tutorial PDF
14 pages
Automate DB Upgrades with Liquibase
No ratings yet
Automate DB Upgrades with Liquibase
4 pages
Assignment Day 10: Task 1
No ratings yet
Assignment Day 10: Task 1
8 pages
Assignment Day 10: Task 1
No ratings yet
Assignment Day 10: Task 1
8 pages
Hive Queries
No ratings yet
Hive Queries
5 pages
Required Permissions 28
No ratings yet
Required Permissions 28
93 pages
Relational Database: Relationship Types
No ratings yet
Relational Database: Relationship Types
2 pages
How To Start
No ratings yet
How To Start
3 pages
1Z0 071 Demo
No ratings yet
1Z0 071 Demo
14 pages
Scala PDF
No ratings yet
Scala PDF
29 pages
Database Quiz for Students
No ratings yet
Database Quiz for Students
4 pages
Mysql Interview Questions PDF
No ratings yet
Mysql Interview Questions PDF
5 pages
Hibernate Syllabus
No ratings yet
Hibernate Syllabus
4 pages
Clauses
No ratings yet
Clauses
5 pages
Data Stream Processing Insights
No ratings yet
Data Stream Processing Insights
67 pages
Oracle Data Migration Guide
No ratings yet
Oracle Data Migration Guide
3 pages
Hadoop & Kognitio Commands Guide
No ratings yet
Hadoop & Kognitio Commands Guide
1 page
Introduction To: What Is SQL?
No ratings yet
Introduction To: What Is SQL?
25 pages
Hive Commands Simplin
No ratings yet
Hive Commands Simplin
5 pages
DDL Notes
No ratings yet
DDL Notes
5 pages
Cloudera Administration Study Guide
No ratings yet
Cloudera Administration Study Guide
3 pages
MSSQL Practical Guidelines & Tasks
No ratings yet
MSSQL Practical Guidelines & Tasks
2 pages
Facebook Hive POC
No ratings yet
Facebook Hive POC
18 pages
Dbms Exp3b
No ratings yet
Dbms Exp3b
9 pages
Database Lab Activity
No ratings yet
Database Lab Activity
3 pages
Pyspark Code
No ratings yet
Pyspark Code
3 pages
Postgre SQLsyllabus
No ratings yet
Postgre SQLsyllabus
7 pages
SQL ANalyst by CT Taylor Part 2
No ratings yet
SQL ANalyst by CT Taylor Part 2
5 pages
Hive Table Management Guide
No ratings yet
Hive Table Management Guide
2 pages
What Is A: Stored Procedure
No ratings yet
What Is A: Stored Procedure
5 pages
BDT MSE2Scheme 23-24
No ratings yet
BDT MSE2Scheme 23-24
4 pages
Indrani Cheat Sheet
No ratings yet
Indrani Cheat Sheet
2 pages

Pair RDD Operations: Flat Map

Uploaded by

Pair RDD Operations: Flat Map

Uploaded by

val x = sc.

parallelize(List("spark rdd example", "sample example"))

val x = sc.parallelize(List("spark rdd example", "sample example”),2)

val textFileLocalTest = sc.textFile("/Users/syedrizvi/Desktop/HadoopExamples/file.txt");

val textFile = sc.textFile("hdfs://localhost:9000/test.txt")

val x = sc.parallelize(1 to 10, 2)

Pair RDD Operations

val salesyear = sc.parallelize(Array(("Cadbury's", 2015),("Nestle", 2014),("Mars", 2014), ("Thorton's", 2013)));

val join = salesprofit.join(salesyear);

val sqlContext = new org.apache.spark.sql.SQLContext(sc);

Creating Temp Views

Creating Data sets on the fly

case class Person(name: String, age: Long)

val primitiveDS = Seq(1, 2, 3).toDS()

Creating Schemas with Reflection

val sqlContext = new org.apache.spark.sql.SQLContext(sc);

case class Person(name: String, age: Long)

teenagersDF.map(teenager => "Name: " + teenager(0)).show()

teenagersDF.map(teenager => "Name: " + teenager.getAs[String]("name")).show()

val warehouseLocation = "spark-warehouse"

val spark = SparkSession.builder().appName("Spark Hive Example").config("spark.sql.warehouse.dir",

sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")

sql("LOAD DATA LOCAL INPATH '/Users/syedrizvi/Desktop/HadoopExamples/Spark/kv1.txt' INTO TABLE

sql("SELECT * FROM src").show()

To Run net cat

/usr/local/Cellar/apache-spark/2.1.0/bin/run-example streaming.NetworkWordCount localhost 9999

Your own word count

val ssc = new StreamingContext(sc, Seconds(1))

You might also like