Thanks to visit codestin.com
Credit goes to github.com

Skip to content

neighdough/docker

Repository files navigation

Description

This project contains all of the code necessary to deploy and run PySpark using Docker on a Windows machine. The process for deploying and then running the container are as follows:

  1. Copy windows_install.bat to remote machine then execute
net use x: \\computer-name\C$ /user:domain\username
copy-item C:\docker\windows_install.bat x:\windows_install.bat
psexec \\remote-machine c:\windows_install.bat
  1. Remote machine will restart, login then batch file to build container should start automatically
  2. Start spark master
/spark_home_directory/sbin/start-master.sh
  1. Run docker container on remote machine
docker run -P --net=host --add-host=moby:127.0.0.1 -it neighdough/spark ./sbin/start-slave.sh spark://<host-name>:7077
docker run -P --net=host --add-host=moby:127.0.0.1 -it neighdough/spark /bin/bash

Sample Commands

Run container, mount drive, run interactive

docker run -v local/data/directory:/data -it neighdough/spark /bin/bash

In container, specify ipython as pyspark driver and then run PySpark

PYSPARK_DRIVER_PYTHON=ipython ./bin/pyspark

Run container, publish all ports

docker run -P --net=host --add-host=moby:127.0.0.1 -it neighdough/spark /bin/bash

Sample pyspark commands

text = sc.textFile('/data/data/sample-text.txt').map(
            lambda line: re.sub(exp, '', line).split(' ')
            ).zipWithIndex()

text.flatMap(lambda x: [[i, x[1]] for i in x[0]]).collect()

General Information

To connect a PostgreSQL database, you'll need to download the JDBC driver (https://jdbc.postgresql.org/download.html) and then place it in the SPARK_HOME/jars/ directory

About

storage for docker containers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published