Easy Data Parallelism
Richard Warburton
Raoul-Gabriel Urma
Overview
● Why is parallelism Important?
● What is data parallelism?
● Parallelising your Streams
● Performance and Internals
Why is Parallelism important?
source: http://www.gotw.ca/images/CPU.png
Multicore
What is Data Parallelism?
Concurrency is not Parallelism!
● Concurrency
○ At least two threads are making progress
○ May not run at the same time
○ Eg: chrome and eclipse both running
● Parallelism
○ At least two threads are executing simultaneously
○ A specific case of concurrency
○ Eg: servlet container dealing with two users at
once on a multicore machine
Parallelism
● Task
○ Distribute execution processes over processes
○ Threads and Executors in Java
○ Eg: each thread services a user in JEE App
● Data
○ Distribute data over different processes
○ Support built on top of Streams
○ Eg: process a payroll and give each core 100
employee’s salary
What are good data parallel
problems?
● Big Batch Jobs
○ Transaction Processing
○ Analytics/Reporting
● Web crawlers / parsers
● Maths
○ Monte Carlo Simulations
○ Linear Algebra
What’s a good data parallel problem from your
workplace?
Parallelising your Streams
Data Parallelism
● Useful
○ a lot of data
○ want to process in a similar way
● API aims to be explicit, but unobtrusive
○ .parallelStream()
○ .parallel()
● Can flip between sequential and parallel
Data Parallelism
// Replace stream() with parallelStream()
Set<String> origins = musicians
.parallelStream()
.filter(artist -> artist.getName().startsWith("The"))
.map(artist -> artist.getNationality())
.collect(toSet());
Not all serial code works in parallel.
DON’T interfere with data sources
// add double each value into a list.
List<Integer> numbers = getNumbers();
numbers.parallelStream()
.forEach(i -> numbers.add(i * 2));
Referring to data sources fixed
// add double each value into a list.
List<Integer> numbers = getNumbers();
numbers = numbers.parallelStream()
.flatMap(i -> Stream.of(i, i * 2))
.collect(toList());
DON’T misuse reduce
int totalCost(List<Purchase> items) {
return items.parallelStream()
.reduce(DELIVERY_FEE,
(tally, item) -> tally + item.cost());
}
Associativity
“you can flip order around and things still work”
(4 + 2) + 1 = 4 + (2 + 1) = 7
(4 * 2) * 1 = 4 * (2 * 1) = 8
Identity
“the do nothing value”
0 + 5 = 5
1 * 5 = 5
How to fix reduce
int totalCost(List<Purchase> items) {
return DELIVERY_FEE
+ items.parallelStream()
.reduce(0,
(tally, item) -> tally + item.cost());
}
How to fix reduce (2)
int totalCost(List<Purchase> items) {
return DELIVERY_FEE
+ items.parallelStream()
.mapToInt(Purchase::getCost)
.sum();
}
DON’T hold locks
List<Integer> values = getValues();
CountDownLatch latch = new CountDownLatch(values.size());
values.parallelStream()
.forEach(i -> {
try {
doSomething(i);
// Potential Deadlock
latch.countdown();
} catch (Exception e ) {
e.printStackTrace();
}});
No mutable state!
public static long sideEffectParallelSum(long n) {
Accumulator accumulator = new Accumulator();
LongStream.rangeClosed(1,n).parallel()
.forEach(accumulator::add);
return accumulator.total;
}
public static class Accumulator {
private long total = 0;
public void add(long value) {
total += value;
}
}
Parallel Code Summary
● Very easy to make your code parallel,
but …
● Sometimes you can get away with things
sequentially that you can’t in parallel
○ sources
○ reduce
○ locks
○ unprotected mutable data
Performance and Internals
Under the hood
● Work distributed using Fork/Join framework
● Distributed by data
● New abstraction: Spliterator
Parallel Integer Sums
int sum =
values.parallelStream()
.mapToInt(i -> i)
.sum();
Spliterator
public interface Spliterator<T> {
/** Carve off a portion of the data
into a separate Spliterator */
Spliterator<T> trySplit();
/** Iterate the data described by this Spliterator */
void forEachRemaining(Consumer<? super T> action);
/** The size of the data described
by this Spliterator, if known */
long getExactSizeIfKnown();
}
Always a tradeoff ...
● Parallelism eats more CPU time
○ Thread communication
○ Distributing & Decomposing work
○ Potentially increased memory pressure
○ Competing for the CPU with other processes
● It can reduce wall time
○ Time from beginning to end of the processes’
execution
○ Ideally only need to wait for 1/N of the execution
time
Decomposition Performance
● Data Size
● Source Data Structure
● Packing
● Number of Cores
● Cost per Element
Data Structures
● Good
○ ArrayList / Intstream.range / Stream.of
○ Random Access + Easy to balance
● Meh
○ Hashset / Treeset
○ Usually good balance
● Bad
○ LinkedList / BufferedReader.lines() /
Streams.iterate()
○ Unknown length
○ bad random access performance
Stateful Operations
● Stateless
○ no need to keep state when evaluated
○ eg: map, reduce
○ superior parallel decomposition
○ bounded amounts of data
● Stateful
○ accumulate state during evaluation
○ eg: sorted
○ unbounded caching of data
Benchmarking and Testing
● Don’t assume parallel = faster, measure it
● Use jmh:
http://openjdk.java.net/projects/code-tools/jmh/
● Best Practices
○ Warmup
○ Repeatability
○ Evade the JIT
Summary
Lesson Summary
● Easy to obtain Data Parallelism
● Pick your situation well
● A lot of performance influencers
● Benchmark your parallel code
The End
Exercise
In: com.java_8_training.problems.data_parallelism
1. Looks at OptimisationExample
2. Try to improve the performance of this code
3. Measure performance using the benchmark harness
4. Don’t make the code uglier!
Exercise
In: com.java_8_training.problems.data_parallelism
1. Parallelise the sum of squares method
Question1Test
2. Fix the bug in the "multiplyThrough" method
Question2Test
3. Remove the locks and keep the code safe
Question3Test
Amdahl’s Law
● Defines upper bound for parallel speedup
● Time(n) = Time(1) * (s + 1/n * (1 - s))
○ n = number of cores
○ s = proportion of code that is strictly serial
● Speedup(n) = 1 / (s + 1/n * (1 - s))
● Example
○ 1024 cores, 50% serial
○ 1 / (0.5 + 1/1024 * (1 - 0.5)) ~= 2x speedup