Description

This project contains all of the code necessary to deploy and run PySpark using Docker on a Windows machine. The process for deploying and then running the container are as follows:

Copy windows_install.bat to remote machine then execute

net use x: \\computer-name\C$ /user:domain\username
copy-item C:\docker\windows_install.bat x:\windows_install.bat
psexec \\remote-machine c:\windows_install.bat

Remote machine will restart, login then batch file to build container should start automatically
Start spark master

/spark_home_directory/sbin/start-master.sh

Run docker container on remote machine

docker run -P --net=host --add-host=moby:127.0.0.1 -it neighdough/spark ./sbin/start-slave.sh spark://<host-name>:7077
docker run -P --net=host --add-host=moby:127.0.0.1 -it neighdough/spark /bin/bash

Sample Commands

Run container, mount drive, run interactive

docker run -v local/data/directory:/data -it neighdough/spark /bin/bash

In container, specify ipython as pyspark driver and then run PySpark

PYSPARK_DRIVER_PYTHON=ipython ./bin/pyspark

Run container, publish all ports

docker run -P --net=host --add-host=moby:127.0.0.1 -it neighdough/spark /bin/bash

Sample pyspark commands

text = sc.textFile('/data/data/sample-text.txt').map(
            lambda line: re.sub(exp, '', line).split(' ')
            ).zipWithIndex()

text.flatMap(lambda x: [[i, x[1]] for i in x[0]]).collect()

General Information

To connect a PostgreSQL database, you'll need to download the JDBC driver (https://jdbc.postgresql.org/download.html) and then place it in the SPARK_HOME/jars/ directory

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
Dockerfile		Dockerfile
build_container.bat		build_container.bat
process_data.py		process_data.py
readme.md		readme.md
requirements.txt		requirements.txt
windows_install.bat		windows_install.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Description

Sample Commands

Run container, mount drive, run interactive

In container, specify ipython as pyspark driver and then run PySpark

Run container, publish all ports

Sample pyspark commands

General Information

About

Uh oh!

Releases

Packages

Languages

neighdough/docker

Folders and files

Latest commit

History

Repository files navigation

Description

Sample Commands

Run container, mount drive, run interactive

In container, specify ipython as pyspark driver and then run PySpark

Run container, publish all ports

Sample pyspark commands

General Information

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages