This project contains all of the code necessary to deploy and run PySpark using Docker on a Windows machine. The process for deploying and then running the container are as follows:
- Copy windows_install.bat to remote machine then execute
net use x: \\computer-name\C$ /user:domain\username
copy-item C:\docker\windows_install.bat x:\windows_install.bat
psexec \\remote-machine c:\windows_install.bat- Remote machine will restart, login then batch file to build container should start automatically
- Start spark master
/spark_home_directory/sbin/start-master.sh- Run docker container on remote machine
docker run -P --net=host --add-host=moby:127.0.0.1 -it neighdough/spark ./sbin/start-slave.sh spark://<host-name>:7077
docker run -P --net=host --add-host=moby:127.0.0.1 -it neighdough/spark /bin/bashdocker run -v local/data/directory:/data -it neighdough/spark /bin/bashPYSPARK_DRIVER_PYTHON=ipython ./bin/pysparkdocker run -P --net=host --add-host=moby:127.0.0.1 -it neighdough/spark /bin/bashtext = sc.textFile('/data/data/sample-text.txt').map(
lambda line: re.sub(exp, '', line).split(' ')
).zipWithIndex()
text.flatMap(lambda x: [[i, x[1]] for i in x[0]]).collect()To connect a PostgreSQL database, you'll need to download the JDBC driver (https://jdbc.postgresql.org/download.html) and then place it in the SPARK_HOME/jars/ directory