0% found this document useful (0 votes)

82 views72 pages

Cluster ParallelTechniques PDF

The document discusses parallel programming concepts in R. It covers different frameworks for parallelization in R including snow, foreach, and multicore. Snow uses a master-worker model with functions like parLapply to parallelize tasks. Foreach provides a for-loop style interface using %dopar% to parallelize across nodes. Multicore only works for multicore machines and uses mclapply to parallelize tasks across cores. Examples are given of using these frameworks to parallelize tasks and distribute data across nodes for processing. Good practices like testing, portability, and error handling are recommended.

Uploaded by

Mark Castillo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

82 views72 pages

Cluster ParallelTechniques PDF

Uploaded by

Mark Castillo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 72

Parallel Programming in R PS236b Spring 2010

Mark Huberty

February 24, 2010

Parallel Programming in R PS236b Spring 2010

Concepts Parallelization in R snow foreach multicore Three benchmarks Considerations Good habits: testing, portability, and error handling The Political Science Cluster GPU Computing

Parallel Programming in R PS236b Spring 2010 Concepts

Parallelization: basic concepts

Parallel Programming in R PS236b Spring 2010 Concepts

The commodity computing revolution

Parallel Programming in R PS236b Spring 2010 Concepts

The fast computing revolution

Parallel Programming in R PS236b Spring 2010 Concepts

The three aspects of the revolution

Cheap computing now comes in three forms: Many cores on the same chip Many chips in the same computer Many computers joined with high-speed connections We can refer to any of these parallel processing units as nodes.

Parallel Programming in R PS236b Spring 2010 Concepts

Kinds of parallelization
Taking advantage of these forms of compute power requires code that can do one of several kinds of parallelization: Bit -based parallelization Already have this: the move up the chain via 4-8-16-32-64 bit machines changes the number of steps required to run a single instruction Instruction-based parallelization Processor/program layer Data-based parallelization Decompose large data structures into independent chunks, on which you perform the same operation Task - based parallelization Perform different, independent tasks on the same data For R, we are mostly interested in data and task parallelization

Parallel Programming in R PS236b Spring 2010 Concepts

Data Parallelization, contd

Data parallelization is very common: Bootstrapping: Sample N times from data D and apply function F to each sample Genetic matching: generate N realizations of matches between groups T and C and calculate the balance on each; repeat for G generations Monte Carlo simulations Google does a ton of this kind of work via its MapReduce framework

Parallel Programming in R PS236b Spring 2010 Concepts

Task parallelization

Task parallelization is a little less obvious. Ideas include: Given N possible estimators of treatment effect , test all against data set D Machine learning: given N different classication schemes for some data set D , generate some test statistics S for all of them

Parallel Programming in R PS236b Spring 2010 Concepts

Data parallelization: an overview

So implement data parallelization, we must: Conceptualize the problem as a set of operations against independent data sets Break up this set of operations into independent components Assign each component to a node for processing Collect the output of each component and return it to the master process

Parallel Programming in R PS236b Spring 2010 Concepts

Technical requirements for parallelization

For the hardware geeks: Multiple cores or servers

Parallel Programming in R PS236b Spring 2010 Concepts

Technical requirements for parallelization

For the hardware geeks: Multiple cores or servers Some means to connect them

Parallel Programming in R PS236b Spring 2010 Concepts

Technical requirements for parallelization

For the hardware geeks: Multiple cores or servers Some means to connect them A way to communicate among them
Sockets MPI (Message Passing Interface)

Parallel Programming in R PS236b Spring 2010 Concepts

Technical requirements for parallelization

For the hardware geeks: Multiple cores or servers Some means to connect them A way to communicate among them
Sockets MPI (Message Passing Interface)

A means of sharing programs and data

Parallel Programming in R PS236b Spring 2010 Concepts

Technical requirements for parallelization

For the hardware geeks: Multiple cores or servers Some means to connect them A way to communicate among them
Sockets MPI (Message Passing Interface)

A means of sharing programs and data A framework to organize the division of tasks and the collection of results

Parallel Programming in R PS236b Spring 2010 Concepts

How fast can we get?

For some program containing a function F that will operate on some data set D, decompose D into di , i N and perform F on each, splitting the tasks across M nodes For the program containing F , the maximum gain from parallelization is given by Amdahls Law. For a program with P percent parallizability running on M nodes: S= 1 1P +
P M

(1)

Parallel Programming in R PS236b Spring 2010 Concepts

The Political Science Compute Cluster

Heres what we have: 9 2-chip Opteron 248 servers Gigabit ethernet interconnects OpenMPI message passing A Network File System

Parallel Programming in R PS236b Spring 2010 Parallelization in R

Parallelization in R

Parallel Programming in R PS236b Spring 2010 Parallelization in R

Frameworks for parallelization

R has several frameworks to manage data parallelization. Three mature and effective ones are: snow, which uses the apply model for task division foreach, which uses a for loop model for division multicore, which is only suitable for the many-cores hardware model There are several other possibilities (nws, mapreduce, pvm) at different levels of obsolescence or instability.

Parallel Programming in R PS236b Spring 2010 Parallelization in R snow

The snow model

snow is a master/worker model: From N nodes create 1 master and N 1 workers1 ; farm jobs out to the workers. This is a little weird when using MPI-based systems where the nodes are undifferentiated; keep this in mind when using MPI for R jobs.

parlance uses the master/slave terminology, which has been abandoned for obvious reasons

1 Older

Parallel Programming in R PS236b Spring 2010 Parallelization in R snow

snow and parallelization

The snow library makes parallization straightforward: Create a cluster (usually with either sockets or MPI) Use parallel versions of the apply functions to run stuff across the nodes of the cluster So this is pretty easy: we already know how to use apply

Parallel Programming in R PS236b Spring 2010 Parallelization in R snow

Cluster creation in snow

See the clusterCreate function in the clusterSetup.R code.

Parallel Programming in R PS236b Spring 2010 Parallelization in R snow

snow example

1 ## assuming you ve a l r e a d y c r e a t e d a c l u s t e r c l : 2 m < matrix ( rnorm ( 1 6 ) , 4 , 4 ) 3 c l u s t e r E x p o r t ( m ) 4 parSapply ( c l , 1 : n c o l s (m) , function ( x ) { 5 mean (m[ , x ] ) 6 } 7 ) Notice there that parSapply has replaced sapply, but nothing much else has changed.

Parallel Programming in R PS236b Spring 2010 Parallelization in R snow

Data vs. Task parallelization with snow

Data: 1 2 3 4 5 6 7 p a r L a p p l y ( c l , 1 : nSims , function ( x ) { n < dim ( data ) [ 1 ] simdata < data [ sample ( 1 : n , n , replace =TRUE ) , ] o u t < myfunc ( simdata ) return ( out ) } ) Task: 1 2 3 4 5 f u n c l i s t < l i s t ( func1 , func2 , func3 ) p a r L a p p l y ( c l , 1 : length ( f u n c l i s t ) , function ( x ) { o u t < f u n c l i s t [ [ x ] ] ( data ) return ( out ) }

Parallel Programming in R PS236b Spring 2010 Parallelization in R snow

Object management in snow

snow requires some additional object management: Libraries must be called on all nodes Data objects must be exported to all nodes

Parallel Programming in R PS236b Spring 2010 Parallelization in R snow

Object management in snow

snow requires some additional object management: Libraries must be called on all nodes Data objects must be exported to all nodes As in: 1 2 3 4 ## Given a c l u s t e r c a l l e d c l m < matrix ( rnorm ( 1 0 0 ) , ncol =10 , nrow =10) c l u s t e r E x p o r t ( c l , m ) c l u s t e r E v a l Q ( c l , l i b r a r y ( Matching ) ) Notice that this doesnt apply to objects created inside a call to the cluster (i.e. parLapply().

Parallel Programming in R PS236b Spring 2010 Parallelization in R foreach

foreach and parallelization

REvolution Computing released the foreach libraries. To use them, you install: foreach doSNOW for using snow clusters doMPI for working directly with MPI doMC for use on multicore machines The basic idea: looks like a for loop, performs like an apply, and portable

Parallel Programming in R PS236b Spring 2010 Parallelization in R foreach

foreach example
1 ## Load t h e l i b r a r i e s . I assume I m on an MPIbased 2 ## c l u s t e r ; o t h e r o p t i o n s are doSNOW and doMC 3 l i b r a r y ( f o r e a c h ) ; l i b r a r y ( doMPI ) 4 5 ## Get t h e c l u s t e r c o n f i g u r a t i o n . 6 c l < s t a r t C l u s t e r ( ) 7 8 ## T e l l doMPI t h a t t h e c l u s t e r e x i s t s 9 re g i st e r D o M P I ( c l ) 10 11 m < matrix ( rnorm ( 1 6 ) , 4 , 4 ) 12 13 ## Run t h e f o r l o o p t o c a l c u l a t e t h e column means 14 f o r e a c h ( i =1: ncol (m) ) %dopar% { 15 mean (m[ , i ] ) # makes m a v a i l a b l e on nodes 16 }

Parallel Programming in R PS236b Spring 2010 Parallelization in R foreach

foreach continued
Notice the important bit: ## Run t h e f o r l o o p t o c a l c u l a t e t h e mean ## o f each column i n m. f o r e a c h ( i =1: ncol (m) ) %dopar% { mean (m[ , i ] ) # makes m a v a i l a b l e on nodes } Here, the foreach term controls the repetition, but the %dopar% term does the actual work. If this werent running on a cluster, you would use %do% instead

Parallel Programming in R PS236b Spring 2010 Parallelization in R multicore

multicore and parallelization

multicore only works on chips with > 1 core (dual-core, quad-core, etc). Its basic function is mclapply: 1 library ( multicore ) 2 m < matrix ( rnorm ( 1 6 ) , 4 , 4 ) 3 4 ## By d e f a u l t t h e f u n c t i o n f i n d s a l l cores on 5 ## t h e c h i p and uses them a l l 6 mclapply ( 1 : n c o l s (m) , function ( x ) { 7 mean (m[ , x ] ) 8 }, 9 mc . cores = getOption ( cores ) 10 ) 11 ## N o t i c e t h e l a s t argument ; can s p e c i f y a 12 ## number o f cores i f d e s i r e d

Parallel Programming in R PS236b Spring 2010 Parallelization in R Three benchmarks

Three benchmarks, three lessons

Parallel Programming in R PS236b Spring 2010 Parallelization in R Three benchmarks

Parallel computing on your laptop

New laptops today will almost certainly come with dual-core chips.

Usual speed gains run around 50%

Parallel Programming in R PS236b Spring 2010 Parallelization in R Three benchmarks

Parallel computing on your laptop

New laptops today will almost certainly come with dual-core chips.
Benchmarked code for MBP 13, 2.2gHz, 4gb RAM

Compute time (s)

Bootstrap (1 core)

Bootstrap (2 cores)

Matching (1 core)

Matching (2 cores)

10 October 2009

Usual speed gains run around 50%

Parallel Programming in R PS236b Spring 2010 Parallelization in R Three benchmarks

Matrix multiplication
Results for:
1 2 3 4 5 6 7 8 9 10 11 12 13 l i b r a r y ( snow ) t e s t m a t < matrix ( rnorm (10000000) , ncol =10000) mm. s e r i a l < system . time ( t e s t m a t %% t ( t e s t m a t ) ) t e s t m a t . t < t ( t e s t m a t ) source ( setupCode . R ) clusterCreate ( ) clusterExport ( cl , c ( testmat , testmat . t ) ) mm. p a r a l l e l < system . time ( parMM ( c l , t e s t m a t , t e s t m a t . t ) ) save . image ( mm. r e s u l t s . RData ) clusterShutdown ( )

mm.parallel \mm.serial = 0.6 for an 8-node cluster

Parallel Programming in R PS236b Spring 2010 Parallelization in R Three benchmarks

Parallelization and speed: an example

So why choose one? Lets look at speed and features. The idea: write the same bootstrap as a serial job, a snow job, and a foreach job, and see what we get. All the code is available at http://pscluster.berkeley.edu

Parallel Programming in R PS236b Spring 2010 Parallelization in R Three benchmarks

Results: timing

Table: Benchmark results for 500 repetitions of a 1000-trial bootstrap for different coding methods; parallel methods use 8 nodes Mean time (s) 176.4 33.4 27.3 27.2 Pct of Serial time 100.0 18.9 15.5 15.4 2.5 pct CI 171.8 32.9 26.8 26.7 97.5 pct CI 181.3 33.5 27.8 27.8

Serial parSapply foreach, snow foreach, dompi

Parallel Programming in R PS236b Spring 2010 Parallelization in R Three benchmarks

Results: distributions
Variation in compute times for bootstrap, 500 repetitions of 1000 trials
parSapply foreach, snow foreach, doMPI

Variation in compute times for the serial bootstrap, 500 repetitions of 1000 trials
0.15

30 Compute time (s)

0.00

0.05

0.10

170

175 Compute time (s)

180

185

Figure: Benchmark distributions for serial and parallel bootstraps

Parallel Programming in R PS236b Spring 2010 Parallelization in R Three benchmarks

doMPI vs. doSNOW

Given the identical performance, why choose one vs. the other? doMPI is not compatible with an environment that has a snow cluster running in it. Thus use doMPI when running things without snow, and doSNOW when combining code that requires snow with foreach-based routines.

Parallel Programming in R PS236b Spring 2010 Considerations

Parallelization and data structures

Lists dont get much use in generic R but are very helpful for parallelization: For N data sets of the same format, do some identical analysis A on each of them

Parallel Programming in R PS236b Spring 2010 Considerations

Parallelization and data structures

Lists dont get much use in generic R but are very helpful for parallelization: For N data sets of the same format, do some identical analysis A on each of them
Solution:
1. Create some list L of length N , containing all the data sets; 2. then loop across the list N times (using P chips) 3. and apply function A to each data set

Parallel Programming in R PS236b Spring 2010 Considerations

Parallelization and data structures

Can be referred to as a map/reduce problem R has a mapReduce package that claims to do this; but its basically just an parLapply with more overhead.

Parallel Programming in R PS236b Spring 2010 Considerations

Some complications

There are a few more issues to worry about: Memory management:

Processors parallelize but memory does not: the master node still has to hold all the results in RAM If R runs out of RAM, the entire process will die and take your data with it Solutions:
Reduce objects on the nodes as much as possible Run things in recursive loops

Parallel Programming in R PS236b Spring 2010 Considerations

Memory mgmt and object size

Comparative object sizes for the output of a regression, N = 1000, P = 10: Output from lm: 321k Output from lm$coefficients: 0.6k

Parallel Programming in R PS236b Spring 2010 Considerations

Memory mgmt and object size

Comparative object sizes for the output of a regression, N = 1000, P = 10: Output from lm: 321k Output from lm$coefficients: 0.6k Factor of 535 difference!

Parallel Programming in R PS236b Spring 2010 Considerations

Memory mgmt and object size

Comparative object sizes for the output of a regression, N = 1000, P = 10: Output from lm: 321k Output from lm$coefficients: 0.6k Factor of 535 difference! So if you only need the coefcients, you use much less memoryand lm is a simple object compared to, say, MatchBalance

Parallel Programming in R PS236b Spring 2010 Considerations

Recursive loops for memory mgmt: a GenMatch example

Common desire: want to run GenMatch with large population sizes. Common problem: the MemoryMatrix grows very large and the kills off the R session with an out of memory error. Solution: recursion: 1 2 3 4 5 6 7 ## Assuming a gen . o u t from a dummy run o f GenMatch par < gen . o u t for ( i in 1:10){ par < GenMatch ( . . . , s t a r t i n g . v a l u e s = par$ s t a r t i n g . values , hard . g e n e r a t i o n . l i m i t =TRUE, max . g e n e r a t i o n s =25 )} Notice that here, the MemoryMatrix never grows large; but GenMatch still retains the history via recursion