Prerequisite:
1. Anaconda installation
2. Java 1.8
3. Spark 2.3
1. Verify the installation
Log in to EC2 user as the root user
a. Run the following command to verify the anaconda installation
conda --version
b. Run the following command to verify Java 8 installation
java -version
c. Run the following command to verify Spark2 installation
spark2-shell --version
2. Configure Jupyter(One-time process)
Jupyter notebook is already installed in our CDH instance with Anaconda. However, we need to
configure it before we can actually run it.
a. Run the following command to generate jupyter configuration file.
jupyter notebook --generate-config
You can see jupyter_notebook_config.py has been created inside the /root/.jupyter directory.
b. Allow access to your remote Jupyter server
Open the jupyter_notebook_config.py file
vi .jupyter/jupyter_notebook_config.py
Press “I” to get into the insert mode
Copy and paste the below two lines as shown in the screenshot.
c.NotebookApp.allow_origin = ' *' #allow all origins
c.NotebookApp.ip = '0.0.0.0' # listen on all IPs
To exit
> Press ‘Esc’ > Type :wq! > Hit Enter
3. Now, use the following command to run the Jupyter Notebook
jupyter notebook --port 7861 --allow-root
You will see the Jupyter server as started.
Note: Don’t kill this process, keep it running until you are done using your Jyputer notebook.
To stop/close your Jyputer Notebook open this terminal and press Ctrl + C
You can see it has given you a URL where your Jupyter notebook is running. In my case the
URL is
http://(ip-10-0-0-228.ec2.internal or 127.0.0.1):7861/?token=5935c2bdf10c2212013af0b4e9e7cd98cd22ca03eeaf4005
In order to open the Jupyter notebook on your browser, replace the highlighted part with the
Public IP of your EC2 Instance.
In my case, the final URL will be-
http://3.89.129.54:7861/?token=5935c2bdf10c2212013af0b4e9e7cd98cd22ca03eeaf4005
Now, you can open your web browser and copy-paste your final URL to open the Jupyter
notebook. You should be able to see the Jupyter home page.
4. PySpark on Jupiter
In order to run PySpark on Jupyter notebook, you need to paste and run the following
few lines of codes in your every PySpark Notebook.
import os
import sys
os.environ["PYSPARK_PYTHON"] = "/opt/cloudera/parcels/Anaconda/bin/python"
os.environ["JAVA_HOME"] = "/usr/java/jdk1.8.0_161/jre"
os.environ["SPARK_HOME"] = "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera2-1.cdh5.13.3.p0.316101/lib/spark2/"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.6-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")
Note: The value for the above environment variables may be different in your case, we suggest
you verify these values before using it. Please follow the below-mentioned steps for the same.
i. Anaconda and Python
os.environ["PYSPARK_PYTHON"] = "/opt/cloudera/parcels/Anaconda/bin/python"
Make sure Anaconda is installed in the /opt/cloudera/parcels/ director. To Verify please check
whether the Anaconda directory is present under /opt/cloudera/parcels/ path.
ls /opt/cloudera/parcels/
ii. Java Home
os.environ["JAVA_HOME"] = "/usr/java/jdk1.8.0_161/jre"
Run the following command to get your Java_Home path
echo $JAVA_HOME
Hence /usr/java/jdk1.8.0_161/jre will be the value of our JAVA_HOME variable
iii. Spark Home
For the Spark home path in the following line-
os.environ["SPARK_HOME"] = "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera2-1.cdh5.13.3.p0.316101/lib/spark2/"
The name of this Spark file(highlighted above) be exactly the same as it is present under your
/opt/cloudera/parcels/
Run the following command to check/verify and replace it in case you have a different
version/distribution installed.
ls /opt/cloudera/parcels/
iv. Version of py4j and pyspark.zip file
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.6-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")
After you have identified your correct Spark home path this step is to verify the version of the
py4j file.
For that run the following command(make sure to modify this command according to your Spark
home identified in the 3rd point)
ls /opt/cloudera/parcels/SPARK2-2.3.0.cloudera2-1.cdh5.13.3.p0.316101/lib/spark2/python/lib
You should be able to see the py4j and p
yspark file. Make sure you modify this according to
your instance in the above code.
5. Test your PySpark setup
a. Open a new Jupyter Notebook
b. Copy-paste the environment variable that you have finalized in the 4th point into a cell.
(You need to do it for every pySpark notebook which you will create)
c. Run the cell you should not see any error.
d. Now, let’s initialize the SparkContext object. Copy-paste the following code in a new cell
> Run the cell
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("jupyter_Spark").setMaster("yarn-client")
sc = SparkContext(conf=conf)
sc
You should be able see the following output
This means your pySpark is working fine.
6. Closing Jupyter Notebook
To stop the notebook open the terminal/Putty window where the Jupyter process is running and
press Ctrl + C.
Enter ‘y’ to shut down the cluster.