Codestin Search App

#Pyspark - KMeans

The following example uses KMeans clustering to produce geographic Lat/Long centers based on purchases. Steps are outlined below to setup the local environment to run the example. Check out the Spark docs for the lastest functionality.

Note: using Spark 1.2.0 package with CDH 5.3

###Steps

The following directions assume you have a Spark environment up and running. If not, go here Spark VM.

###Setup Spark Environment

I grabbed sample from the watson_bisum_deals to import into my Local Hive table. I logged onto i75 and pulled the sample data into my home directory. From there use rsync or scp to pull to your local machine.

A. Getting sample Data

hive -e 'select * from watson_bisum_deals limit 10000' > deals_sample_data.tsv

B. Download the file

Note: Copying the file to the /home/hadmindirectory inside the VM.

scp -3 {user name}@{server name}:{path to sample file} hadmin@hadoop-manager:/home/hadmin

I set up a local hive table to mimic a production table 'watson_bisum_deals'. Run from inside the VM.

A. Use the create_watson_bisum_deals.sql file to create a local hive table(run as hadmin).

hive -f /home/hadmin/kmeans_example/create_watson_bisum_deals.sql

B. Import the sample data file into the table. Use the load_data.sql file.

hive -f /home/hadmin/kmeans_example/load_data.sql

C. Check Hive to sure everything looks good.

Is the Table there?

hive -e 'SHOW TABLES'

Check data:

hive -e 'select * from watson_bisum_deals limit 5'

###Running Spark

Log into the VM

ssh hadmin@hadoop-manager

Use the spark-submit utility to run the app(always run as hadmin user).

spark-submit {path to file}/kmeans.py

####Notes about the Running of the Spark App

Currently the app uses two centers in order the cluster the data points. The purchases are filtered by the 'Beauty/Health' category so results find the center lat/long based on 'Beauty/Health' purchases
Data for the input are pulled from your local hive table 'watson_bisum_deals'
The results are the Lat/Long of two centers where the data points are clustered
The results are written to HDFS temporarily(tab delimited file) in the /tmp/{output_dir}/{unix_timestamp}. The {output_dir} is set by the user while the {unix_timestamp} is a dynamic directory in order to avoid errors. **Note: the {output_dir} must exist prior to running to avoid errors.
The results are finally loaded into Hive. The table name is set by the user {hive_table}

###Using Pyspark Repl

The code contained with the 'kmeans.py' file can also be executed use the pyspark repl. This utility acts just like a python repl, and is good just to test and experiment

Start the repl:

pyspark

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
collaborative_filtering_example		collaborative_filtering_example
data		data
kmeans_example		kmeans_example
lib		lib
README.md		README.md
avro_inputformat.py		avro_inputformat.py
filter_count.py		filter_count.py
pagerank.py		pagerank.py
sort.py		sort.py
spark_sql.py		spark_sql.py
wordcount.py		wordcount.py
wordcount_lzo_file.py		wordcount_lzo_file.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages