LibreOfficeCalc
SpreadsheetsontheGPU
MichaelMeeks<[email protected]>
mmeeks,#libreofficedev,irc.freenode.net
Stand at the crossroads and look; ask for the
ancient paths, ask where the good way is,
and walk in it, and you will find rest for your
souls... - Jeremiah 6:16
Overview
LibreOffice?
Abitabout:
GPUs
Spreadsheets
Internalrefactoring
OpenCLoptimisation
newcalcfeatures
XML/loadperformance
Calc/GPUquestions?
Questions?
LibreOffice Project & Software
Open Source / Free
Software
One million new unique
IPs per week (that we can
track)
Double the weekly
growth one year ago.
Tens of millions of users,
and growing fast.
Hundred+ contributing
coders each month
2500+ commits last
month
Around a thousand
developers ( including
QA, Translators, UX etc.
http://www.libreoffice.org/
Cumulative unique IP's for updates vs. time
not counting any Linux / vendor versions
60,000,000
50,000,000
40,000,000
30,000,000
20,000,000
10,000,000
0
AdvisoryBoardMembers
This slide's layout is a victim of our success here ...
4 / 41
Event Name | Your Name
WhyusetheGPU?
APUsGPUfasterthanCPU
TonsofunusedComputeUnitsacrossyourAPU
Doubleprecisionisunreasonablyslower
Andprecisionisnonnegotiablefor
spreadsheetsIEE764required.
Betterpowerusageperflop.
Numbers based
on a Kaveri 7850K
APU - & top-end
discrete Graphics
card.
fp64
CPU flops
GPU flops
FirePro 7990
fp32
1
10
100
1000
Flops : note the log scale ...
10000
Developersbehindthecalcrework:
Kohei Yoshida:
MDDS maintainer
Heroic calc core re-factorer
Code Ninja etc.
Markus Mohrhard
Calc maintainer,
Chart2 wrestler
Unit tester par
Excellence
etc.
Jagan Lokanatha
Kismat Singh
Matus Kukan
Data Streamer,
G-builder,
Size optimizer ..
A large OpenCL team,
Particularly I-Jui (Ray) Sung
SpreadsheetGeometry
An early
Spreadsheet
C 3000 BC
Aspect ratio: 8:1
Contents:
Victory against
every land
who giveth all life
forever
Excel 2003
Excel 2010
64k x 256
10^6 x 16k
Aspect:
256:1
Aspect:
16:1
50% of
spreadsheets
used to make
business
decisions.
Columnar data structures
The 'Broom
Handle'
aspect
ratio.
SpreadsheetCoreDataStorage
ThejoyofObjectOrientation
ScTable
ScBaseCell
ScDocument
ScColumn
Broadcaster (8 bytes)
Text width (2 bytes)
Cell type (1 byte)
Script type (1 byte)
ScValueCell
ScFormulaCell
ScStringCell
ScEditCell
ScNoteCell*
10 / 41
Event Name | Your Name
Abstraction of Cell Value Access
ScBaseCell Usage (Before)
ScDocument
11
Undo / Redo
RTF Filter
Change Tracking
Quattro Pro Filter
Content Rendering
HTML Filter
Excel Filter (xls, xlsx)
External Reference
Document Iterators
CSV Filter
DIF Filter
UNO API Layer
Conditional Formatting
SYLK Filter
VBA API Layer
Chart Data Provider
DBF Filter
ODF Filter
Cell Validation
CppUnit Test
Abstraction of Cell Value Access
ScBaseCell Usage (After)
ScDocument
Biggest calc core re-factor
in a decade+
Dis-infecting the horrible,
long-term, inherited
structural problems of Calc.
Document Iterators
Lots of new unit tests being
created for the first time for
the calc core.
Moved to using new 'MDDS'
data structures.
2x weeks with no compile ...
12
Before(ScBaseCell)
ScTable
ScBaseCell
ScDocument
ScColumn
Broadcaster (8 bytes)
Text width (2 bytes)
Cell type (1 byte)
Script type (1 byte)
ScValueCell
ScFormulaCell
ScStringCell
ScEditCell
ScNoteCell*
13 / 41
Scattered
pointer
chasing
walking cells
down a
column ...
Event Name | Your Name
After(mdds::multi_type_vector)
ScTable
ScColumn
svl::SharedString block
ScDocument
double block
EditTextObject block
ScFormulaCell block
Broadcasters
Cell notes
Text widths
Script types
14 / 41
Cell values
Event Name | Your Name
Iteratingovercells(oldway)
loop down a column and the inner loop:
double nSum = 0.0;
ScBaseCell* pCell = pCol >maItems[nColRow].pCell;
++nColRow;
switch (pCell->GetCellType())
{
case CELLTYPE_VALUE:
nSum += ((ScValueCell*)pCell)->GetValue();
break;
case CELLTYPE_FORMULA:
something worse ...
case CELLTYPE_STRING:
case CELLTYPE_EDIT:
case CELLTYPE_NOTE:
15 / 41
Event Name | Your Name
Iteratingovercells(newway)
double nSum = 0.0;
for (size_t i = 0; i < nChunkLength; i++)
nSum += pDoubleChunk[i];
ONO. from a vectoriser ...
16 / 41
Event Name | Your Name
SharedFormula
Before
Tokens
18 / 41
ScFormulaCell
ScTokenArray
ScFormulaCell
ScTokenArray
ScFormulaCell
ScTokenArray
ScFormulaCell
ScTokenArray
ScFormulaCell
ScTokenArray
ScFormulaCell
ScTokenArray
ScFormulaCell
ScTokenArray
Event Name | Your Name
...
...
RPN
After
ScFormulaCell
ScFormulaCell
ScFormulaCellGroup
ScFormulaCell
Tokens
ScFormulaCell
ScTokenArray
ScFormulaCell
ScFormulaCell
ScFormulaCell
19 / 41
Event Name | Your Name
RPN
Memoryusage
Heap memory size (MB)
400
372
300
259
200
100
27
0
Shared formula on
Empty document
Shared formula off
Test document used:
http://kohei.us/wp-content/uploads/2013/08/shared-formula-memory-test.ods
20 / 41
Event Name | Your Name
Sharedstringrework
Stringcomparisonswereslow
AlsonottractableforaGPU
Caseinsensitiveequalityisahard
problemICU&heavylifting.
Stringcomparisonsalotin
functions,andPivotTables.
Sharedstringstorageisuseful.
Sofixit...
Concept
svl::SharedStringPool
svl::SharedString
Original string pool
svl::SharedString
Upcased string pool
svl::SharedString
22 / 41
Event Name | Your Name
Stringcomparison(oldway)
23 / 41
Event Name | Your Name
Stringcomparison(newway)
24 / 41
Event Name | Your Name
OpenCL/calculation...
WhyOpenCL&HSA...
GPUandCPUoptimisation
WhywritecustomSSE2/SSE3etc.assembly
detectarch,andselectbackendcross
platforms.
InsteadgetOpenCL(fromAPUvendor)to
generatethebestcode...
HetrogenousSystemArchitecturerocks:
AnAMD64likeinnovation:
sharedVirtualMemoryAddressspace&pointers:
GPUCPU.
Avoidwastefulcopies,fastdispatch
GreatOpenCL2.0support.
UsetherightComputeUnitforthejob.
Auto-compile Formula OpenCL
#pragma OPENCL EXTENSION cl_khr_fp64: enable
int isNan(double a) { return isnan(a); }
double legalize(double a, double b) { return isNan(a)?b:a;}
double tmp0_0_fsum(__global double *tmp0_0_0)
{
double tmp = 0;
{
int i;
i = 0;
tmp = legalize(((tmp0_0_0[i])+(tmp)), tmp);
i = 1;
Formulae compiled idly / on
tmp = legalize(((tmp0_0_0[i])+(tmp)), tmp);
entry in a thread to hide
i = 2;
tmp = legalize(((tmp0_0_0[i])+(tmp)), tmp);
latency.
} // to scope the int i declaration
return tmp;
Kernel generation thanks
}
to:
double tmp0_nop(__global double *tmp0_0_0)
{
double tmp = 0;
int gid0 = get_global_id(0);
tmp = tmp0_0_fsum(tmp0_0_0);
return tmp;
}
__kernel void DynamicKernel_nop_fsum(__global double *result, __global double
*tmp0_0_0)
{
int gid0 = get_global_id(0);
result[gid0] = tmp0_nop(tmp0_0_0);
}
__kernel void
The same formula for a longer sum
tmp0_0_0_reduction(__global double* A,
__global double *result,
int arrayLength, int windowSize)
Compiled from standard formula syntax
{
double tmp, current_result =0;
int writePos = get_group_id(1);
int lidx = get_local_id(0);
double tmp0_0_fsum(__global double
__local double shm_buf[256];
*tmp0_0_0) {
int offset = 0;
double tmp = 0;
int end = windowSize;
int gid0 = get_global_id(0);
end = min(end, arrayLength);
tmp = ((tmp0_0_0[gid0])+(tmp));
barrier(CLK_LOCAL_MEM_FENCE);
return tmp;
int loop = arrayLength/512 + 1;
}
for (int l=0; l<loop; l++) {
double tmp0_nop(__global double
tmp = 0;
*tmp0_0_0) {
int loopOffset = l*512;
double tmp = 0;
if((loopOffset + lidx + offset + 256) < end) {
int gid0 = get_global_id(0);
tmp = legalize(((A[loopOffset + lidx + offset])+
tmp = tmp0_0_fsum(tmp0_0_0);
(tmp)), tmp);
return tmp;
tmp = legalize(((A[loopOffset + lidx + offset +
}
256])+(tmp)), tmp);
__kernel void
} else if ((loopOffset + lidx + offset) < end)
DynamicKernel_nop_fsum(__global double
tmp = legalize(((A[loopOffset + lidx + offset])+
*result,
(tmp)), tmp);
shm_buf[lidx] = tmp;
__global double *tmp0_0_0)
barrier(CLK_LOCAL_MEM_FENCE);
{
for (int i = 128; i >0; i/=2) {
int gid0 = get_global_id(0);
if (lidx < i)
result[gid0] = tmp0_nop(tmp0_0_0);
shm_buf[lidx] = ((shm_buf[lidx])+
}
(shm_buf[lidx + i]));
barrier(CLK_LOCAL_MEM_FENCE);
}
if (lidx == 0)
current_result =((current_result)+(shm_buf[0]));
barrier(CLK_LOCAL_MEM_FENCE);
}
if (lidx == 0)
result[writePos] = current_result;
}
Performance numbers for sample sheets.
GPU / OpenCL
Software
min_max_avg_r
30x 500x
faster for
these
samples vs.
the legacy
software
calculation
destination-workbook
Shorter is better
dates-worked
stock-history
on Kaveri.
ground-water
10
100
1,000
10,000
100,000
Yet another log plot milliseconds on the X axis ...
Inmoredetail...
Thisisaspreadsheet
Highlyspreadsheetgeometrydependent
WhatdoyoumeanwhatistheXfactor?
Don'tlikeyourXfactoraddmorerows,or
complexity.
Representativesheetsimportantsomebased
onrealworldmadness
Functions:
Researchshowsvastmajorityofdistinct
fomulaehaveverysimplefunctions:SUM,
AVERAGE,SUMIF,VLOOKUP,etc.
Weoptimisethose
Wedon'tdoeg.TextfunctionslikeUPPER
Howthatworksinpractise:
Enabling Custom Calculation
Turn on OpenCL computation: Tools Options
Enabling OpenCL goodness
Auto-select the best OpenCL device via a micro-benchmark
Or disable that and explicitly select a device.
33 / 41
Event Name | Your Name
BigdataneedsDocument
Loadoptimization
ParallelizedLoading...
DesktopCPUcoresareoftenidle.
XMLparsing:
Theidealapplicationofparallelism
SAXparsers:
SuckingicAcheeXperienceparsers
read,parseatinypieceofXML&emitanevent
punchthatdeepintothecoreoftheAPPlogic,and
return..
ParseanothertinypieceofXML.
BetterAPIsandimpl'sneeded:Tokenizing,
Namespacehandlingetc.
Luckilyeasytoretrofitthreading...
DozensofperformancewinsinXFastParser.
Utilisingyour32coreCPU...
(boxesarethreads).
Thread 2
SplitXMLParse&
Sheetpopulate
Thread 1
Unzip,
XML Parse,
Tokenize
Populate
Sheet Data
Structures.
ParallelisedSheet
Loading
Unzip,
XML Parse,
Tokenize
Populate
Sheet Data
Structures.
Progress bar
thread
ParalleltoGPU
compilation
etc.
=COVAR(A1:A300,B1:B300)
OpenCL code
Ready to execute kernels
Tools->Options->Advanced->Experimental Mode required for parallel loading
Doesitwork?withGPUenabled
Wall-clock time to load set of large XLSX spreadsheets: 8 thread Intel machine
num-formula-2-sheets-1m.xlsx
numbers-formula-8-sheets-100k.xlsx
numbers-formula-100k.xlsx
Shorter is better
numbers-100k.xlsx
sumifs-testsheet.xlsx
Calc 4.1.3
Calc
Reference
stock-history.xlsm
matrix-inverse.xlsx
mandy.xlsm
mandy-no-macro.xlsx
groundwater-daily.xlsm
dates-worked.xlsx
0.1
10
Log Time / seconds
Apologies for another log scale: Average 5X vs. 4.1.3
100
Howdoesthatpanout?
Problems^WOpportunities...
PickingagoodOpenCLdriver
White/Black/Anylistingofknowngood/bad/
mixedHardware/Driver/OS
Whichcoretopick?
fp64perfetc.Timevs.Power
Currentlymicrobenchmarktime.
HSArocks
CL_MEM_USE_HOST_PTRisaroyalpain:
Alignmentissuescurrentlycauselotsofcopyingin
severalcases.
OpenCL2.0'sSharedVirtualMemoryisawesome
CompilerPerformance:
ExcelRPNCstringIRGPU
SPIRsoundsgreatifitcanbestable.
FutureOpenCLwork...
Volunteers/funderswelcome
Killpercelldependencygraphing
Badlyneedstobepercolumn:
Shrinkmemoryusage,improveloadtime
Detectindependentcolumncalculations
SPIRintegration
Enablingparallelexecution,widerCSEetc.
Avoid'NaN'foobyadaptingtodatashapefaster.
Calcasaflowprocess,'constructyour
pipelineinasheet'
Crazyawesomedemos:Mobilevs.PC...
ZIPLZ77/OpenCLaccelerationorsimilar
LibreOfficeConclusions
LibreOfficeisinnovating:
Goinginterestingplacesnoonehasgonebefore:
OpenCLinagenericspreadsheetsafirst
Whywrite5xhandcodedassemblerversionsandselectperplatform.
RunyourworkloadontherightComputeUnittosavetime&battery.
RefactoringforOpenCLimprovesperformanceforall
FasterforCPUandGPU
PCMark8.2includesLibreOfficebenchmarking.
LibreOfficelovesnewcontributor&features
thereisalreadyatoolforthat.
Talktomeaboutgettinginvolved...
Thanksforallofyourhelpandsupport!
Oh, that my words were recorded, that they were written on a scroll, that they were
inscribed with an iron tool on lead, or engraved in rock for ever! I know that my Redeemer
lives, and that in the end he will stand upon the earth. And though this body has been
destroyed yet in my flesh I will see God, I myself will see him, with my own eyes - I and not
another. How my heart yearns within me. - Job 19: 23-27
41