Running Awk in Parallel to process 256M Records

I wrote a blog post about this work. It was discussed some at Hacker News.

This repo contains the artefacts about the Smoky Mountains Data Challenge 2018 that I solved (and won first prize). In the following, I describe the approach, method and some interesting tidbits.

A pdf report may be found in the /report folder.

SMC Data Challenge 4 Scientific Publications Mining

. To run the awk code:

awk -f prob2.awk stop_words.txt data_dir/*.txt

. To compile the Swift code:

stc runprob2.swift #will generate tic file

. To run Swift code:

turbine -n 340 runprob2.tic

Name		Name	Last commit message	Last commit date
Latest commit History 125 Commits
data		data
misc		misc
report		report
results		results
src		src
.gitignore		.gitignore
README.md		README.md
notes.txt		notes.txt
poster.pdf		poster.pdf
poster.pptx		poster.pptx
poster_script.txt		poster_script.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Running Awk in Parallel to process 256M Records

A pdf report may be found in the /report folder.

About

Uh oh!

Releases

Packages

Languages

pebblefoot31/SMC18

Folders and files

Latest commit

History

Repository files navigation

Running Awk in Parallel to process 256M Records

A pdf report may be found in the /report folder.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages