0% found this document useful (0 votes)

38 views8 pages

v2 3 Running+PySpark+on+Jupyter+NoteBook

The document provides steps to setup and configure Jupyter notebook and PySpark environment on an EC2 instance. It includes verifying prerequisite software, configuring Jupyter, setting environment variables and testing PySpark setup on Jupyter notebook. The document contains detailed instructions with commands and code snippets for complete configuration.

Uploaded by

Junaid Sheikh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views8 pages

v2 3 Running+PySpark+on+Jupyter+NoteBook

Uploaded by

Junaid Sheikh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Prerequisite:

1. Anaconda installation
2. Java 1.8
3. Spark 2.3

1. Verify the installation

a. Run the following command to verify the anaconda installation

conda --version

b. Run the following command to verify Java 8 installation

java -version

c. Run the following command to verify Spark2 installation

spark2-shell --version
2. Configure Jupyter(One-time process)
Jupyter notebook is already installed in our CDH instance with Anaconda. However, we need to
configure it before we can actually run it.

a. Run the following command to generate jupyter configuration file.

jupyter notebook --generate-config

You can see jupyter_notebook_config.py has been created inside the /root/.jupyter directory.

b. Allow access to your remote Jupyter server

Open the jupyter_notebook_config.py file

vi .jupyter/jupyter_notebook_config.py

Press “I” to get into the insert mode

Copy and paste the below two lines as shown in the screenshot.
c.NotebookApp.allow_origin = ' *' #allow all origins
c.NotebookApp.ip = '0.0.0.0' # listen on all IPs

To exit
> Press ‘Esc’ > Type :wq! > Hit Enter
3. Now, use the following command to run the Jupyter Notebook

jupyter notebook --port 7861 --allow-root

You will see the Jupyter server as started.

Note: Don’t kill this process, keep it running until you are done using your Jyputer notebook.
To stop/close your Jyputer Notebook open this terminal and press Ctrl + C

You can see it has given you a URL where your Jupyter notebook is running. In my case the
URL is

http://(ip-10-0-0-228.ec2.internal or 127.0.0.1):7861/?token=5935c2bdf10c2212013af0b4e9e7cd98cd22ca03eeaf4005

In order to open the Jupyter notebook on your browser, replace the highlighted part with the
Public IP of your EC2 Instance.

In my case, the final URL will be-

http://3.89.129.54:7861/?token=5935c2bdf10c2212013af0b4e9e7cd98cd22ca03eeaf4005
Now, you can open your web browser and copy-paste your final URL to open the Jupyter
notebook. You should be able to see the Jupyter home page.

4. PySpark on Jupiter

In order to run PySpark on Jupyter notebook, you need to paste and run the following
few lines of codes in your every PySpark Notebook.

import os
import sys
os.environ["PYSPARK_PYTHON"] = "/opt/cloudera/parcels/Anaconda/bin/python"
os.environ["JAVA_HOME"] = "/usr/java/jdk1.8.0_161/jre"
os.environ["SPARK_HOME"] = "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera2-1.cdh5.13.3.p0.316101/lib/spark2/"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.6-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

Note: The value for the above environment variables may be different in your case, we suggest
you verify these values before using it. Please follow the below-mentioned steps for the same.
i. Anaconda and Python

os.environ["PYSPARK_PYTHON"] = "/opt/cloudera/parcels/Anaconda/bin/python"

Make sure Anaconda is installed in the /opt/cloudera/parcels/ director. To Verify please check
whether the Anaconda directory is present under /opt/cloudera/parcels/ path.

ls /opt/cloudera/parcels/

ii. Java Home

os.environ["JAVA_HOME"] = "/usr/java/jdk1.8.0_161/jre"

Run the following command to get your Java_Home path

echo $JAVA_HOME

Hence /usr/java/jdk1.8.0_161/jre will be the value of our JAVA_HOME variable

iii. Spark Home

For the Spark home path in the following line-
os.environ["SPARK_HOME"] = "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera2-1.cdh5.13.3.p0.316101/lib/spark2/"

The name of this Spark file(highlighted above) be exactly the same as it is present under your
/opt/cloudera/parcels/

Run the following command to check/verify and replace it in case you have a different
version/distribution installed.

ls /opt/cloudera/parcels/
iv. Version of py4j and pyspark.zip file

sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.6-src.zip")

sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

After you have identified your correct Spark home path this step is to verify the version of the
py4j file.

For that run the following command(make sure to modify this command according to your Spark
home identified in the 3rd point)

ls /opt/cloudera/parcels/SPARK2-2.3.0.cloudera2-1.cdh5.13.3.p0.316101/lib/spark2/python/lib

You should be able to see the py4j and p

yspark file. Make sure you modify this according to
your instance in the above code.
5. Test your PySpark setup

a. Open a new Jupyter Notebook

b. Copy-paste the environment variable that you have finalized in the 4th point into a cell.
(You need to do it for every pySpark notebook which you will create)

c. Run the cell you should not see any error.

d. Now, let’s initialize the SparkContext object. Copy-paste the following code in a new cell
> Run the cell
from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("jupyter_Spark").setMaster("yarn-client")
sc = SparkContext(conf=conf)
sc

You should be able see the following output

This means your pySpark is working fine.

6. Closing Jupyter Notebook

To stop the notebook open the terminal/Putty window where the Jupyter process is running and
press Ctrl + C.
Enter ‘y’ to shut down the cluster.

VHS To DVD 7.0: Honestech
No ratings yet
VHS To DVD 7.0: Honestech
74 pages
Binomial Theorem (Practice Question) PDF
100% (3)
Binomial Theorem (Practice Question) PDF
11 pages
How To Write Track 1 and 2 Dumps With Pin PitDumps EMV Software PDF
78% (9)
How To Write Track 1 and 2 Dumps With Pin PitDumps EMV Software PDF
2 pages
(Ebook) Visualization Analysis and Design by Tamara Munzner ISBN 9781466508910, 1466508914 Download
100% (1)
(Ebook) Visualization Analysis and Design by Tamara Munzner ISBN 9781466508910, 1466508914 Download
95 pages
Cart Sauer Danfoss
100% (5)
Cart Sauer Danfoss
8 pages
Pyspark Tutorial
100% (2)
Pyspark Tutorial
27 pages
Free AI Tools To Boost Task Productivity and Work Efficiency
No ratings yet
Free AI Tools To Boost Task Productivity and Work Efficiency
3 pages
Spark With Python Notes
No ratings yet
Spark With Python Notes
206 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
200 pages
Utr - PLN Suar PDF
100% (1)
Utr - PLN Suar PDF
86 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
Lab Manual Computer Data Security & Privacy (COMP-324) : Course Coordinator: Dr. Sherif Tawfik Amin
No ratings yet
Lab Manual Computer Data Security & Privacy (COMP-324) : Course Coordinator: Dr. Sherif Tawfik Amin
51 pages
Rest-Assured Rest
No ratings yet
Rest-Assured Rest
17 pages
Steps To Use Jupyter Inside Anaconda
No ratings yet
Steps To Use Jupyter Inside Anaconda
2 pages
Contact ID Codes
No ratings yet
Contact ID Codes
6 pages
Activity - Predicting Survival On The Titanic
No ratings yet
Activity - Predicting Survival On The Titanic
19 pages
Music Notation Shortcuts Guide
No ratings yet
Music Notation Shortcuts Guide
7 pages
Installation Lectures
No ratings yet
Installation Lectures
41 pages
State of The Art Reliability
No ratings yet
State of The Art Reliability
39 pages
Spark Labs for Data Engineers
No ratings yet
Spark Labs for Data Engineers
133 pages
Spark-Tutorial - IV - Python
No ratings yet
Spark-Tutorial - IV - Python
212 pages
Pyspark 1
No ratings yet
Pyspark 1
19 pages
Pyspark
No ratings yet
Pyspark
4 pages
RDD Programming Guide - Spark 3.5.5 Documentation
No ratings yet
RDD Programming Guide - Spark 3.5.5 Documentation
14 pages
Tutorial 1b Introductory Ce712
No ratings yet
Tutorial 1b Introductory Ce712
5 pages
Dhruv Python Lab File
No ratings yet
Dhruv Python Lab File
20 pages
Final Guideline IDC Jupyter
No ratings yet
Final Guideline IDC Jupyter
2 pages
Learning Spark - Chapter 2
No ratings yet
Learning Spark - Chapter 2
6 pages
System On Chip
No ratings yet
System On Chip
12 pages
Computer Applications in Hydraulic Engineering Tutorials 2020-Jul-21
No ratings yet
Computer Applications in Hydraulic Engineering Tutorials 2020-Jul-21
100 pages
Unit 4 Spark Updated
No ratings yet
Unit 4 Spark Updated
86 pages
Create EC2 Instance Containing Jupyter Server
No ratings yet
Create EC2 Instance Containing Jupyter Server
5 pages
CIS612 SparkInstallation Ubuntun
No ratings yet
CIS612 SparkInstallation Ubuntun
10 pages
Software Description
No ratings yet
Software Description
20 pages
Pyspark
No ratings yet
Pyspark
10 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Jupyter PDF
No ratings yet
Jupyter PDF
114 pages
Resume Limpia Banerjee
No ratings yet
Resume Limpia Banerjee
3 pages
Single Axis Solar Tracking System Using Microcontroller (ATmega328) and Servo Motor
No ratings yet
Single Axis Solar Tracking System Using Microcontroller (ATmega328) and Servo Motor
4 pages
Pyspark - Notes 1
No ratings yet
Pyspark - Notes 1
3 pages
Pellegrino Jupyter Isgc2022
No ratings yet
Pellegrino Jupyter Isgc2022
39 pages
Electrical Machines Lab Manual Student
No ratings yet
Electrical Machines Lab Manual Student
59 pages
Install Apache Spark & Python Guide
No ratings yet
Install Apache Spark & Python Guide
3 pages
Unit IV Spark
No ratings yet
Unit IV Spark
23 pages
Colab Spark Initialize Step
No ratings yet
Colab Spark Initialize Step
1 page
Server Setup For Jupyter Notebook
No ratings yet
Server Setup For Jupyter Notebook
12 pages
PySpark Exam Setup and Basic Code Guide
No ratings yet
PySpark Exam Setup and Basic Code Guide
4 pages
Spark Installation
No ratings yet
Spark Installation
1 page
2018 Pogs MDR
No ratings yet
2018 Pogs MDR
3 pages
0 Python AB
No ratings yet
0 Python AB
4 pages
Krishna Rungta - TensorFlow in 1 Day Make Your Own Neural Network (2018) - Trang-7
No ratings yet
Krishna Rungta - TensorFlow in 1 Day Make Your Own Neural Network (2018) - Trang-7
26 pages
Spark Installation Mac
No ratings yet
Spark Installation Mac
1 page
Cloud Computing Unit-1 Notes
No ratings yet
Cloud Computing Unit-1 Notes
16 pages
Untitled Document
No ratings yet
Untitled Document
7 pages
Apache Spark Setup Guide
No ratings yet
Apache Spark Setup Guide
8 pages
Using Virtual Environments in Jupyter Notebook and Python
No ratings yet
Using Virtual Environments in Jupyter Notebook and Python
5 pages
Py Spark
No ratings yet
Py Spark
7 pages
Implen Nanophotometer User Manual V1.0.5
No ratings yet
Implen Nanophotometer User Manual V1.0.5
70 pages
Big Book of MLOps 2nd Edition
No ratings yet
Big Book of MLOps 2nd Edition
78 pages
How To Install Jupyter Notebook On Ubuntu: Getting Started
No ratings yet
How To Install Jupyter Notebook On Ubuntu: Getting Started
95 pages
Install Spark On Windows 10-MacOS
No ratings yet
Install Spark On Windows 10-MacOS
23 pages
Make Sure You Have Virtualenv Installed Create A Virtual: Ipython Kernel Install - User - Name .Venv
No ratings yet
Make Sure You Have Virtualenv Installed Create A Virtual: Ipython Kernel Install - User - Name .Venv
3 pages
Regulatory Compliance for HP Devices
No ratings yet
Regulatory Compliance for HP Devices
3 pages
Data Science For EnergySystemModelling
No ratings yet
Data Science For EnergySystemModelling
7 pages
Spark Overview: Security
No ratings yet
Spark Overview: Security
4 pages
Data Sci Lab 1
No ratings yet
Data Sci Lab 1
4 pages
2.2 Integrated Development Environments (IDEs) - Hydro-Informatics
No ratings yet
2.2 Integrated Development Environments (IDEs) - Hydro-Informatics
13 pages
04 - IBM Watsonx - Data - Apache Spark
No ratings yet
04 - IBM Watsonx - Data - Apache Spark
33 pages
MATLAB Short Notes
No ratings yet
MATLAB Short Notes
312 pages
Setup Devenv m1 v6
No ratings yet
Setup Devenv m1 v6
61 pages
Overview History R
No ratings yet
Overview History R
15 pages
Getting Started: 1. Set Up A Python Interpreter
No ratings yet
Getting Started: 1. Set Up A Python Interpreter
6 pages
Anaconda for Hadoop & Spark Integration
No ratings yet
Anaconda for Hadoop & Spark Integration
2 pages
My Jupyter Docker Full Stack
No ratings yet
My Jupyter Docker Full Stack
33 pages
Hdo6000a Operators Manual
No ratings yet
Hdo6000a Operators Manual
212 pages
Independent Speed Test Analysis of 4G Mobile Networks Performed by DIKW Consulting
No ratings yet
Independent Speed Test Analysis of 4G Mobile Networks Performed by DIKW Consulting
50 pages
Fpga Implementation of Neural Networks: Main Contents
No ratings yet
Fpga Implementation of Neural Networks: Main Contents
21 pages
Document
No ratings yet
Document
22 pages
Monica Grover's Resume
No ratings yet
Monica Grover's Resume
2 pages
A Spatial Data Structure For Fast Poisson-Disk Sample Generation
No ratings yet
A Spatial Data Structure For Fast Poisson-Disk Sample Generation
6 pages
iDS-7200HQHI-M2/S SERIES Turbo Acusense DVR: Key Feature
No ratings yet
iDS-7200HQHI-M2/S SERIES Turbo Acusense DVR: Key Feature
4 pages
Jupyter PDF
No ratings yet
Jupyter PDF
39 pages
CEMS Exam Guidelines 2023
No ratings yet
CEMS Exam Guidelines 2023
1 page
NetWorker 19.1 Installation Guide PDF
No ratings yet
NetWorker 19.1 Installation Guide PDF
196 pages
Installation Et Configuration de Spark
No ratings yet
Installation Et Configuration de Spark
14 pages
Jupyter Notebook Commands
No ratings yet
Jupyter Notebook Commands
10 pages

v2 3 Running+PySpark+on+Jupyter+NoteBook

Uploaded by

v2 3 Running+PySpark+on+Jupyter+NoteBook

Uploaded by

Prerequisite:

1. Verify the installation

a. Run the following command to verify the anaconda installation

b. Run the following command to verify Java 8 installation

c. Run the following command to verify Spark2 installation

a. Run the following command to generate jupyter configuration file.

b. Allow access to your remote Jupyter server

Open the ​ jupyter_notebook_config.py ​file

Press “I” to get into the insert mode

jupyter notebook --port 7861 --allow-root

You will see the Jupyter server as started.

In my case, the final URL will be-

ii. Java Home

Run the following command to get your Java_Home path

Hence​ /usr/java/jdk1.8.0_161/jre​ will be the value of our JAVA_HOME variable

iii. Spark Home

sys.path.insert(​0​, os.environ[​"PYLIB"​] +​"/​py4j-0.10.6-src​.zip"​)

You should be able to see the ​py4j​ and p

a. Open a new Jupyter Notebook

c. Run the cell you should not see any error.

You should be able see the following output

This means your pySpark is working fine.

You might also like

Open the jupyter_notebook_config.py file

Hence /usr/java/jdk1.8.0_161/jre will be the value of our JAVA_HOME variable

sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.6-src.zip")

You should be able to see the py4j and p