Matching Linear Algebra and Tensor Code To Specialized
Matching Linear Algebra and Tensor Code To Specialized
1
CC ’23, February 25–26, 2023, Montréal, QC, Canada P.A. Martínez, J. Woodruff, J. Armengol-Estapé, G. Bernabé, J.M. García, M.F.P. O’Boyle
2.3 Our approach - ATC FACC. Behavioral equivalence is also employed in FACC
Rather than relying on code structure to guide detection, [63]. Unfortunately, it is restricted to FFTs and one-dimensional
ATC uses behavioral equivalence to determine if a section of arrays, and cannot detect the replacement in Figure 1. There-
code is a linear algebra operation. Firstly, ATC uses neural fore, we extended FACC to FACC* to consider GEMMs and
program classification [18] to detect that the code in Figure multi-dimensional arrays. This, however, exposes its weak
2 is probably a GEMM. It then searches variable matches to variable binding model which is combinatorial in the number
determine the potential source and output arrays. As the of user array variables and their dimensionality. Furthermore,
search space is combinatorially large, we introduce scal- it relies on program synthesis to determine the length of ar-
able, algorithm-independent heuristics (which we discuss in rays, which scales poorly to problems with many potential
Section 5) that keep the number of mappings manageable. length parameters for arrays such as GEMM.
Next, ATC generates different input values for the arrays FACC also relies on brittle inter-procedural liveness analy-
and records the output. After generating many randomized ses to determine the liveness status of variables. This restricts
inputs, it observes that it has the equivalent behavior to the it to running only at link time, rendering it invalid for use
corresponding API and is able to replace the AVX2 code with in shared libraries. We will see in Section 8 that the com-
the GEMM call at the bottom of Figure 2. bination of these issues results in excessively large search
spaces.
Legality. Now, IO behavioral equivalence is not proof that
a section of code is a particular linear algebra operation - 3 System overview
similarly IDL and KernelFaRer do not prove equivalence. For Figure 3 gives a system flow overview of ATC. We first de-
proof, bounded model checking based on Kleene [14] can be tect regions of code that are likely to be linear algebraic
deployed. In practice, as demonstrated in our experimental operations using a neural program classifier. The classifier is
section, IO equivalence gives no false positives. For further trained ahead of time, based on programs that are equivalent
guarantees, we can ask for programmer sign-off or employ to the accelerator and prior examples of linear algebra code.
model checking. Once candidate code sections have been identified, we ap-
ply program analysis to match user program variables with
Profitable. Once we have detected and can replace a sec- the particular API formal parameters. Given the combina-
tion of code with an accelerator call, we need to determine if torially large search space, we develop novel techniques to
it is profitable to do. Due to hardware evolution, we do not make the problem tractable.
use a hard-wired heuristic to determine profitability. Instead, For each candidate matching, we generate multiple data
we learn, off-line, a simple predictive model to determine if inputs, execute the user code section and record the output
the target accelerator is faster than a CPU implementation. values. If the input/output pairs correspond to the input/out-
The model is called at runtime, determining if offloading is put behavior of the accelerator API, we can say they are
worthwhile. behavioral equivalent and candidates for replacement.
3
CC ’23, February 25–26, 2023, Montréal, QC, Canada P.A. Martínez, J. Woodruff, J. Armengol-Estapé, G. Bernabé, J.M. García, M.F.P. O’Boyle
While candidate user code may be replaceable with a call Exit with YES Load/StoreInst
to an accelerator API, it may not be profitable. Therefore, we error code
employ a simple ML classifier, trained offline, and invoked Out of YES NO Replace Load/Store
Array A?
at runtime to see if acceleration is appropriate for the user bound? to/from index 0
code for the runtime known array sizes. Perform the
instruction NO
3.1 Neural Program Classification
Figure 4. Dimension detection algorithm overview for a
To detect potentially acceleratable parts of a program, we use
target example array called A.
prior work in neural program classification [18]. A network is
trained with multiple instances of different program classes.
We use the OJClone dataset [44], which includes 105 classes Algorithm 1 Dimensions detection algorithm
of different programs, and add examples of the programs that
1: for arr in function do
we want to detect e.g. GEMMs and convolutions, gathered
2: fakeLoadAndStoresExcept(𝑎𝑟𝑟 )
from benchmark suite repositories other than GitHub.
3: replaceLoadAndStores(𝑎𝑟𝑟 )
At compile time a new candidate program is divided into
4: repeat
functions, which are presented to the neural classifier. The
5: 𝑐 = getNextCombination(𝑎𝑟𝑟 )
classifier assigns each function in the program a probability
6: ffi_call(𝐴, 𝑉 )
of belonging to a certain class. We consider the most proba-
7: if not failed then
ble class, which in the case of a GEMM or convolution is then
8: 𝑓 𝑜𝑢𝑛𝑑 = 𝑇 𝑟𝑢𝑒
considered for variable matching and eventual code replace-
9: end if
ment as described in the following sections. Classification
10: until not found
is fast (≤ 1.5 sec) and has negligible impact on compilation
11: Add 𝑐 to 𝐶
time (see Section 8.3).
12: end for
13: return 𝐶
4 Variable Matching
To check if a section of user code is behaviorally equivalent
to the API, we have to match up the user program variables some programs, lengths can be found using static analysis
with API formal parameters. We first detect what variables (e.g. [50]), but this fails in more complex cases. We use run-
are livein/liveout (Section 4.1) and then the dimensions of
time analysis to determine which program variables define
arrays (Section 4.2).
array size using a modified form of runtime array bound
checking. For each set of variables that could define an ar-
4.1 Detecting livein and liveout variables
ray’s size (typically, from the argument list), we set such
Detecting livein and liveout variables via standard static variables to a fixed value. We then execute the user code that
analysis is is modified to check runtime array accesses.
straightforward for well-structured programs but fails for First, the compiler selects a target array to find its size.
more diverse real-world codes, which may use assembly code Then, to generate the modified program, we tweak the load
or intrinsic functions. and store instructions in the user program, replacing them
ATC uses dynamic analysis to determine which variables with custom function calls in the IR. If a load or store does
are livein and liveouts inside a function. In C, variables are not access the array we are interested in, we modify it to
passed by value so non-pointers variables are always livein. load and store at a constant, safe location. If it does, the
In the case of pointers (or arrays), we generate random inputs instruction is replaced with a function call that will check at
with arbitrary sizes. If the values in memory change after runtime if the access is out of bounds. If so, the program exits
executing the program, the array is considered liveout. with a custom error code. If not, we have found a valid array
This allows us to detect which variables are livein or live- size. The basic idea is depicted in Figure 4. This is used by
out, but not both livein and liveout at the same time. We gen- our JIT analysis as shown in Algorithm 1 and implemented
erate a new random input for liveout variables and re-execute in LLVM.
the function. If the output differs from the first execution, it This way, the compiler can assign different input sizes to a
is both livein and liveout. We implement this algorithm as a given array and check the exit code. Therefore, the compiler
just-in-time compiler pass in LLVM [38]. iterates over all the possible dimensions combinations until
one of the executions does not end with the custom error exit
4.2 Detecting the dimensions of arrays code. That means that the program was completed without
Detecting arrays length enables offloading of appropriately- any illegal access to the target array, which indicates that it
sized regions of codes, so it is a critical step in ATC. For is the right dimension of the array.
4
Matching linear algebra and tensor code to specialized hardware accelerators CC ’23, February 25–26, 2023, Montréal, QC, Canada
|𝑎| if |𝑏 | = 0, floating-point programs [63], verification of such liftings are
|𝑏 | if |𝑎| = 0, some way off.
𝑙𝑒𝑣 (𝑡𝑎𝑖𝑙 (𝑎), 𝑡𝑎𝑖𝑙 (𝑏))
if 𝑎[0] = 𝑏 [0], In summary, the key challenges that all competing tech-
𝑙𝑒𝑣 (𝑎, 𝑏) = niques face are:
𝑙𝑒𝑣 (𝑡𝑎𝑖𝑙 (𝑎), 𝑏)
1 + 𝑚𝑖𝑛
𝑙𝑒𝑣 (𝑎, 𝑡𝑎𝑖𝑙 (𝑏)) otherwise • Floating-point numbers often raise challenges in theo-
𝑙𝑒𝑣 (𝑡𝑎𝑖𝑙 (𝑎), 𝑡𝑎𝑖𝑙 (𝑏))
rem provers as they are challenging to reason about.
(1) • Floating-point functions may have different accuracies
in different input ranges, meaning that the obvious
Figure 6. Levenshtein recursive definition checks of correctness (even within bounds) are difficult
to apply.
The backend of ATC is not tied to using behavioral equiva-
gemm_api ( float * tc_A , float * tc_B , float * tc_C , lence. As we will see, the use of such behavioral equivalence
int tc_m , int tc_n , int tc_k , results in no false positives. Further development of theorem
int tc_lda , int tc_ldb , int tc_ldc ,
float tc_alpha , float tc_beta ) { prover technologies would mean that the weak behavioral
equivalence in ATC could easily be replaced with a theorem
gemm ( int M , int N , int K , float alpha , prover guaranteeing correctness and enabling automatic
float *A , int lda , float *B , int ldb ,
float beta , float *C , int ldc ) { transformations.
the CPU, we used oneDNN v1.96. The TPU system runs IDL. The constraint based scheme [28] only matches 6
Debian 10 with kernel 4.19.0-14. out of 50 cases. These programs are largely naive implemen-
tations of GEMM, with a simple loop structure. It is able to
7.1 User code manage 2 programs containing unrolled loops but fails on
We explored GitHub looking for C and C++ GEMM codes, anything more complex. Matching more diverse cases would
analyzing more than 400 programs from which we selected require writing a new IDL constraint description for each
50 programs. We discarded the rest of them because of wrong sub-class.
implementations, compilation errors or duplicated code. The KernelFaRer. This code matching approach [20] is more
final list of programs is shown in Table 8. We categorize successful, matching 11 GEMMs due to a more robust pattern
the codes as follows: Naive: naive implementations with the matcher. For straightforward sequential implementations, it
traditional 3-loop structure; Naive Parallel: as Naive but with is able to match all but one of the cases. However, any code
simple outer loop parallelization; Unrolled: naive implemen- variation, including loop unrolling, defeats it.
tation with unrolled loops; Kernel Calls: implementations
that divide the loops into different function calls; Blocked: Polly. Although it does not match and replace GEMMs,
tiled implementations; Goto: implementations of the Goto it can detect SCoPs which may be candidates for replace-
algorithm [29]; Strassen: implementations of the Strassen ment with appropriate API calls. It is less successful than
algorithm [56]; Intrinsics: implementations using Intel intrin- KernelFaRer in detecting naive implementations but is more
sics. robust across other more complex categories including one
In addition, we selected 50 non-GEMM projects to check parallel and unrolled versions and 2 blocked cases. It slightly
whether any of the approaches gave false positives. outperforms KernelFaRer, matching 13 vs. 11 out of 50 cases.
Convolutions. We explored GitHub looking for C and FACC*. Unlike the other approaches, FACC* performed
C++ 4D convolution implementations. We analyzed around poorly on naive implementations, but better on others. Here,
50 programs from which we a selected list of 15 programs the size of the mapping search space is the limiting factor. It
based on the same methodology used for selecting GEMMs. was able to find 10 cases in the available time (timeout ≤ 10
The list of convolution programs is shown in Table 1. We mins). We examine the reasons for this in Section 8.3.
have included codes from the most relevant convolution
ATC. Our approach is significantly more robust across all
implementations: Direct: the direct convolution algorithm;
categories, matching 42 out of 50 cases. It is able to detect all
im2col+gemm: an algorithm that casts the input as matrices
naive implementations and the majority within each other
(im2col) and later uses a GEMM, as in Caffe [33]; Winograd:
category. It detects more naive parallel implementations,
the Winograd algorithm.
unrolled and blocked programs than Polly and is the only
7.2 Methods technique to detect GEMMs in codes containing kernel calls
and intrinsic instructions.
We evaluate our approach against 4 well known schemes:
IDL: Idioms are described using an idiom description lan- 8.1.1 Accuracy. Figure 10 provides a summary of ATC’s
guage [28], which is translated into a set of constraints over success and failure by type. In 8 cases ATC failed to detect
LLVM IR. that the program contained a GEMM. In one case, program
KernelFaRer: Uses different pattern matching to detect spe- 23, this is due to there being too many candidate matches,
cific code constructs, matching specific matrix-multiplication 280 which is above our timeout threshold of 100 candidates.
structures [20]. The remaining cases are due to overly aggressive search
7
CC ’23, February 25–26, 2023, Montréal, QC, Canada P.A. Martínez, J. Woodruff, J. Armengol-Estapé, G. Bernabé, J.M. García, M.F.P. O’Boyle
Algorithm Code LoC Layout Sizes Optimizations Algorithm Code LoC Layout Sizes Optimizations
1 22 Column-major Squared None Kernel Calls 26 164 Column-major Any Unrolled
2 127 Both Any None 27 104 Row-major Any Block
3 18 Row-major Any None 28 30 Row-major Squared OpenMP
4 41 Column-major Squared None 29 52 Column-major Any None
5 11 Row-major Any None 30 35 Row-major Squared None
6 11 Row-major Any None Blocked 31 38 Column-major Squared None
Naive
7 30 Row-major Any None 32 42 Row-major Multiple of bs Unrolled
8 18 Column-major Any None 33 49 Row-major Squared None
9 40 Column-major Any None 34 18 Row-major Squared None
10 39 Column-major Any None 35 21 Row-major Squared None
11 43 Row-major Any None 36 247 Column-major Squared Intrinsics (SSE)
Goto
12 11 Row-major Squared None 37 89 Row-major Squared None
13 39 Row-major Squared OpenMP 38 210 Row-major Squared None
14 28 Column-major Squared OpenMP Strassen 39 315 Row-major Squared, power of 2 None
Naive
15 164 Row-major Any OpenMP 40 162 Row-major Squared None
parallel
16 22 Row-major Multiple of nthreads C++ threads 41 102 Row-major Squared Intrinsics (AVX2)
17 107 Row-major Squared C++ threads 42 91 Row-major Multiple of 8 Intrinsics (AVX2)
18 57 Row-major Any None 43 82 Row-major Multiple of 8 Intrinsics (AVX2)
19 50 Row-major Any None 44 58 Row-major Any Intrinsics (SSE)
Unrolled
20 63 Row-major Squared OpenMP 45 112 Row-major Multiple of bs Intrinsics (AVX2)
Intrinsics
21 38 Row-major Squared, multiple of bs None 46 136 Row-major Multiple of bs Intrinsics (AVX2)
22 46 Column-major Any None 47 120 Row-major Any Intrinsics (AVX2)
23 115 Column-major Any OpenMP 48 143 Row-major Multiple of bs Intrinsics (AVX2)
Kernel Calls
24 61 Column-major Any None 49 57 Row-major Multiple of bs Intrinsics (AVX2)
25 105 Column-major Any Unrolled 50 60 Row-major Any Intrinsics (SSE)
Figure 8. List of GEMM codes
12 3
100 11 IDL POLLY KFR FACC* ATC 9
42
% of matched codes
4 4
80 9 3
6 2
3
60 2 11
40 4
1 1 13
1 2 2 1110
20 6
1
0 0 0 0000 0 0 000 000 0000
0
Naive Naive p. Unrrolled Kernels Blocked Goto Strassen Intrinsics All
8.2 Performance
% of programs
80
Matched
60 The performance of each approach is shown in Figure 11.
Too many candidates
40 Polly is not included here as although it can detect SCoPs, it
Missed matches
20
does not explicitly identify them as GEMMs for API replace-
ment. We show two bars for KernelFaRer, which correspond
0
to the strategy of GEMM code with an optimized CPU im-
plementation as described in [20] and KFR (XPU) which is
Figure 10. Percentage of matched GEMM codes by ATC our extension, replacing the CPU library with the optimized
divided by failure reason. XPU implementation. IDL and FACC* directly target the ac-
celerator, while ATC chooses the CPU or accelerator based
on its SVM platform predictor. This runtime prediction cost
is negligible ≤ 0.3𝑚𝑠𝑒𝑐 and included in Figure 11.
What is immediately clear is that detecting more GEMMs
pruning, missing a legal match. Improved search heuristics leads to better overall speedup. In the Naive category, KFR
are likely to improve program coverage. and ATC are both able to achieve good performance, with
a speedup of 726x and 1031x, respectively. The gap is nar-
False positives. None of the methods classified any of the rowed when using KFR (XPU). However, KFR is unable to
50 non-GEMMs as a GEMM. Across all methods, there were detect GEMMs in any other category leading to just a 6.2x
no false positives.
8
Matching linear algebra and tensor code to specialized hardware accelerators CC ’23, February 25–26, 2023, Montréal, QC, Canada
10000
IDL KFR (CPU) KFR (XPU) FACC* ATC
1000
Speedup
100
10
1
Naive Naive p. Unrrolled Kernels Blocked Goto Strassen Intrinsics All
Figure 11. Geometric mean speedup obtained by IDL, KernelFaRer, FACC* and ATC in GEMM programs with 𝑛 = 8192.
104 12h
103 1h
102 10m
101 45s
1 4s
5 10 15 20 25 30 35 40 45 50
Code
Figure 12. Comparison of the number of candidates generated for matching GEMM codes: FACC* vs our approach.
100 IO Testing of FFTs and does not scale to longer function signatures
Tests Code 23 used for GEMM. To support any accelerator type, the com-
Generation
80 Code 2 2,000 piler should support multi-dimensional arrays, while FACC
Candidates
Generation only supports 1D arrays. Because in 1D arrays and FFTs the
Time (s)
60 Neural
1,500
Embeddings
search space in matching the API parameters is small, FACC
40 Code 21 1,000 does not include anything to reduce it. With more complex
Code 7 programs and domains, this limitation makes compiling pro-
20 Code 1 500 grams intractable.
0 0 Mask [52] uses symbolic execution to prove equivalence,
1 3 8 48 280 which does not work well for floating-point problems. Fuzzy
Number of candidates classification techniques based on code clone detection [41,
58], domain-classification [60], pattern matching [15], code
Figure 13. Compilation time for different number of candi- embeddings [2, 3, 21] and identifiers [37, 48] can be used
dates. to help compile to accelerators [63]. These classification
strategies are able to classify diverse code structures, but do
3 not provide a compilation strategy for using an accelerator
% of matched codes
100
such as OpenCL Kernels [30, 61] and OpenMP [43]. Similar Hernandez, Daniel Ho, Yu-Cheng Huang, Olof Johansson, Shishir
techniques have been applied to FPGAs, by estimating pow- Juluri, Shobhit Kanaujia, Mannli Kesarkar, Jonathan Killinger, Ben
er/performance [26] and tracking actual performance [51]. Kim, Rohan Kulkarni, Meghan Lele, Hauyi Li, Huamin Li, Christo-
pher Mitchell, Bharath Muthiah, Nitin Nagarkatte, Ashwin Narasimha,
Bernard Nguyen, Thiara Ortiz, Soumya Padmanabha, Deng Pan, Ash-
10 Conclusions win Poojary, Ye Qi, Oliver Raginel, Dward Rajagopal, Tristian Rice,
This work presented ATC, a flexible domain-agnostic com- Craig Ross, Nadav Rotem, Scott Russ, Kushal Shsh, Bauhua Shan, Hao
Shen, Pavan Shetty, Krish Skandakumaran, Kutta Srinivasan, Roshan
piler that matches legacy linear algebra code to accelerators.
Sumbaly, Michael Taubery, Mor Tzur, Hao Wang, Man Wang, Ben Wei,
By using IO behavioral equivalence and smart search space Alex Xia, Chanyu Xu, Martin Yang, Kai Zhang, Ruoxi Zhang, Ming
reduction, we are able to match over 80% of challenging Zhao, Witney Zhao, Rui Zhu, Lin Qiao, Misha Smelyanskiy, Bill Jia, and
real-world programs to accelerator APIs, significantly out- Vijay Roa. 2021. First-Generation Inference Accelerator Deployment
performing all alternative approaches. at Facebook. (2021). arXiv:2107.04140 [cs.AR]
[6] José M Andión. 2015. Compilation techniques for automatic extraction
Supporting new domains different from GEMM and convo-
of parallelism and locality in heterogeneous architectures. Ph.D. Thesis.
lution is easy because ATC focuses on behavior rather than Universidade Da Coruña. http://hdl.handle.net/2183/15854.
code structure, which makes it very flexible and extensible. [7] Kevin Angstadt, Jean-Baptiste Jeannin, and Westley Weimer. 2020.
Furthermore, to support other accelerators in GEMM or con- Accelerating Legacy String Kernels via Bounded Automata Learning.
volution, only the accelerator API is needed: ATC adapts to ASPLOS. doi: 10.1145/3373376.3378503.
[8] Arm. 2020. Arm Ethos-U55: microNPU. Avaialable at https://www.arm.
the new specification automatically.
com/products/silicon-ip-cpu/ethos/ethos-u55 (Accessed 2022).
Future work will examine how to further reduce the search [9] Somashekaracharya G Bhaskaracharya, Julien Demouth, and Vinod
space using online learning and to expand the complexity Grover. 2020. Automatic Kernel Generation for Volta Tensor Cores.
of user code considered. Longer-term, we wish to automati- (2020). arXiv:2006.12645 [cs.PL]
cally target a range of accelerators with diverse functionality, [10] Gabriel Hjort Blindell. 2018. Universal Instruction Selection. Ph.D.
Thesis. KTH Royal Institute of Technology.
matching and transforming user code to maximize perfor-
[11] Lorenzo Chelini, Andi Drebes, Oleksandr Zinenko, Albert Cohen,
mance. Nicolas Vasilache, Tobias Grosser, and Henk Corporaal. 2021. Pro-
gressive Raising in Multi-level IR. In 2021 IEEE/ACM International
Acknowledgments Symposium on Code Generation and Optimization (CGO). 15–26. doi:
10.1109/CGO51591.2021.9370332.
Grant TED2021-129221B-I00 funded by MCIN/AEI/10.13039/ [12] Jack Choquette, Olivier Giroux, and Denis Foley. 2018. Volta: Perfor-
501100011033 and by the “European Union NextGenera- mance and Programmability. IEEE Micro 38, 2 (2018), 42–52. doi:
tionEU/PRTR”. 10.1109/MM.2018.022071134.
[13] Jean Coiffier. 2011. Fundamentals of Numerical Weather Prediction.
References Cambridge University Press. doi: 10.1017/CBO9780511734458.
[14] Bruce Collie. 2022. Practical Synthesis from Real-World Oracles. Ph.D.
[1] Maaz Bin Safeer Ahmad, Jonathan Ragan-Kelley, Alvin Cheung, and Thesis. The University of Edinburgh. doi: 10.7488/era/2334.
Shoaib Kamil. 2019. Automatically translating image processing li- [15] Bruce Collie, Philip Ginsbach, and Michael F.P. O’Boyle. 2019.
braries to halide. ACM Transactions on Graphics 38 (Nov. 2019), 1–13. Type-Directed Program Synthesis and Constraint Generation for
Issue 6. doi: 10.1145/3355089.3356549. Library Portability. In 2019 28th International Conference on Paral-
[2] Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles Sutton. lel Architectures and Compilation Techniques (PACT). 55–67. doi:
2015. Suggesting accurate method and class names, In the 2015 10th
10.1109/PACT.2019.00013.
Joint Meeting. Proceedings of the 2015 10th Joint Meeting on Foundations [16] Bruce Collie, Philip Ginsbach, Jackson Woodruff, Ajitha Rajan, and
of Software Engineering - ESEC/FSE 2015. doi: 10.1145/2786805.2786849. Michael F. P. O’Boyle. 2021. M3: Semantic API Migrations. In Pro-
[3] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. ceedings of the 35th IEEE/ACM International Conference on Automated
code2vec: learning distributed representations of code. Proceedings of Software Engineering (Virtual Event, Australia) (ASE ’20). Associa-
the ACM on Programming Languages 3 (Jan. 2019), 1–29. Issue POPL. tion for Computing Machinery, New York, NY, USA, 90–102. doi:
doi: 10.1145/3290353. 10.1145/3324884.3416618.
[4] Muhammad Shoaib Bin Altaf and David A. Wood. 2017. LogCA: A [17] Meghan Cowan, Thierry Moreau, Tianqi Chen, James Bornholt, and
High-Level Performance Model for Hardware Accelerators. In Proceed- Luis Ceze. 2020. Automatic Generation of High-Performance Quan-
ings of the 44th Annual International Symposium on Computer Architec- tized Machine Learning Kernels. In Proceedings of the 18th ACM/IEEE
ture (Toronto, ON, Canada) (ISCA ’17). Association for Computing Ma- International Symposium on Code Generation and Optimization (San
chinery, New York, NY, USA, 375–388. doi: 10.1145/3079856.3080216. Diego, CA, USA) (CGO 2020). Association for Computing Machinery,
[5] Michael Anderson, Benny Chen, Summer Deng, Jordan Fix, Michael New York, NY, USA, 305–316. doi: 10.1145/3368826.3377912.
Gschwind, Aravind Kalaiah, Changkyu Kim, Jaewon Lee, Jason Liang, [18] Chris Cummins, Zacharias V. Fisches, Tal Ben-Nun, Torsten Hoefler,
Haixin Lui, Arun Montgomery, Jacka dn Moorthy, Satish Nadathur, Michael F P O’Boyle, and Hugh Leather. 2021. ProGraML: A Graph-
Sam Naghshineh, Avinash Nayak, Jongsoo Park, Chris Petersen, Martin based Program Representation for Data Flow Analysis and Compiler
Schatz, Narayanan Sundaram, Bandsheng Ten, Peter Tang, Amy Yang, Optimizations. In Proceedings of the 38th International Conference on
Jiecao Yu, Hector Yuen, Ying Zhang, Aravind Anbudarai, Vandana Machine Learning (Proceedings of Machine Learning Research, Vol. 139),
Balan, Harsha Bojja, Joe Boyd, Matthew Breitback, Claudio Caldato, Marina Meila and Tong Zhang (Eds.). PMLR, 2244–2253.
Anna Calvo, Garret Catron, Sneh Chandwani, Panos Christeas, Brad [19] William J. Dally, Yatish Turakhia, and Song Han. 2020. Domain-Specific
Cottel, Briand Countinho, Arun Dalli, Abhishek Chanotia, Oniel Dun- Hardware Accelerators. Commun. ACM 63, 7 (June 2020), 48–57. doi:
can, Roman Dzhabrov, Simon Elmir, Chunli Fu, Wenyin Fu, Michael 10.1145/3361682.
Fulthrop, Adi Gangidi, Nick Gibson, Sean Gordon, Beatriz Padilla
11
CC ’23, February 25–26, 2023, Montréal, QC, Canada P.A. Martínez, J. Woodruff, J. Armengol-Estapé, G. Bernabé, J.M. García, M.F.P. O’Boyle
[20] João P. L. De Carvalho, Braedy Kuzma, Ivan Korostelev, José Nelson Annual International Symposium on Computer Architecture (ISCA).
Amaral, Christopher Barton, José Moreira, and Guido Araujo. 2021. IEEE Computer Society, Los Alamitos, CA, USA, 1–14. doi:
KernelFaRer: Replacing Native-Code Idioms with High-Performance 10.1109/ISCA52012.2021.00010.
Library Calls. ACM Trans. Archit. Code Optim. 18, 3, Article 38 (jun [35] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gau-
2021), 22 pages. doi: 10.1145/3459010. rav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Bo-
[21] Daniel DeFreez, Aditya V. Thakur, and Cindy Rubio-González. 2018. den, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris
Path-based function embedding and its application to error-handling Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb,
specification mining, In the 2018 26th ACM Joint Meeting. Proceedings Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland,
of the 2018 26th ACM Joint Meeting on European Software Engineering Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert
Conference and Symposium on the Foundations of Software Engineering Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexan-
- ESEC/FSE 2018. doi: 10.1145/3236024.3236059. der Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen
[22] Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. 2020. Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris
Mathematics for Machine Learning. Cambridge University Press. doi: Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adri-
10.1017/9781108679930. ana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi
[23] B. Di Martino and G. Iannello. 1996. PAP Recognizer: a tool for auto- Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick,
matic recognition of parallelizable patterns. In WPC ’96. 4th Workshop Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir
on Program Comprehension. 164–174. doi: 10.1109/WPC.1996.501131. Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snel-
[24] J. Domke, E. Vatai, A. Drozd, P. ChenT, Y. Oyama, L. Zhang, S. Salaria, ham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory
D. Mukunoki, A. Podobas, M. WahibT, et al. 2021. Matrix Engines for Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard
High Performance Computing: A Paragon of Performance or Grasping Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-
at Straws?. In 2021 IEEE International Parallel and Distributed Processing Datacenter Performance Analysis of a Tensor Processing Unit. In Pro-
Symposium (IPDPS). IEEE Computer Society, Los Alamitos, CA, USA, ceedings of the 44th Annual International Symposium on Computer Archi-
1056–1065. doi: 10.1109/IPDPS49936.2021.00114. tecture (Toronto, ON, Canada) (ISCA ’17). Association for Computing
[25] Jeremy Fowers, Kalin Ovtcharov, Michael K Papamichael, Todd Mas- Machinery, New York, NY, USA, 1–12. doi: 10.1145/3140659.3080246.
sengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, [36] Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe, Leif Singer,
Logan Adams, Mahdi Ghandi, Stephen Heil, Prerak Patel, Adam Daniel M. German, and Daniela Damian. 2014. The Promises and
Sapek, Gabriel Weisz, Lisa Woods, Sitaram Lanka, Steven K Reinhardt, Perils of Mining GitHub. In Proceedings of the 11th Working Confer-
Adrian M Caulfield, Eric S Chung, and Doug Burger. 2019. Inside ence on Mining Software Repositories (Hyderabad, India) (MSR 2014).
Project Brainwave’s Cloud-Scale, Real-Time AI Processor. IEEE Micro Association for Computing Machinery, New York, NY, USA, 92–101.
39, 3 (2019), 20–28. doi: 10.1109/MM.2019.2910506. doi: 10.1145/2597073.2597074.
[26] Gereon Führ, Seyit Halil Hamurcu, Diego Pala, Thomas Grass, Rainer [37] Jakapong Klainongsuang, Yusuf Sulistyo Nugroho, Hideaki Hata, Bun-
Leupers, Gerd Ascheid, and Juan Fernando Eusse. 2019. Automatic dit Manaskasemsak, Arnon Rungsawang, Pattara Leelaprute, and
Energy-Minimized HW/SW Partitioning for FPGA-Accelerated MP- Kenichi Matsumoto. 2019. Identifying Algorithm Names in Code
SoCs. IEEE Embedded Systems Letters 11, 3 (2019), 93–96. doi: Comments. (2019). arXiv:1907.04557 [cs.SE]
10.1109/LES.2019.2901224. [38] Chris Lattner and Vikram Adve. 2004. LLVM: a compilation framework
[27] gcc documentation. 2022. 26.2 Match and Simplify: The Language. Ava- for lifelong program analysis & transformation. In International Sym-
ialable at https://gcc.gnu.org/onlinedocs/gccint/The-Language.html posium on Code Generation and Optimization, 2004. CGO 2004. 75–86.
(Accessed 2022). doi: 10.1109/CGO.2004.1281665.
[28] Philip Ginsbach, Toomas Remmelg, Michel Steuwer, Bruno Bodin, [39] Vladimir I Levenshtein et al. 1966. Binary codes capable of correcting
Christophe Dubach, and Michael F. P. O’Boyle. 2018. Automatic Match- deletions, insertions, and reversals. In Soviet physics doklady, Vol. 10.
ing of Legacy Code to Heterogeneous APIs: An Idiomatic Approach. Soviet Union, 707–710.
SIGPLAN Not. 53, 2 (mar 2018), 139–153. doi: 10.1145/3296957.3173182. [40] Benjamin Livshits, Manu Sridharan, Yannis Smaragdakis, Ondřej
[29] Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of High- Lhoták, J. Nelson Amaral, Bor-Yuh Evan Chang, Samuel Z. Guyer,
Performance Matrix Multiplication. ACM Trans. Math. Softw. 34, 3, Uday P. Khedker, Anders Møller, and Dimitrios Vardoulakis. 2015. In
Article 12 (may 2008), 25 pages. doi: 10.1145/1356052.1356053. defense of soundiness. Commun. ACM 58 (Jan. 2015), 44–46. Issue 2.
[30] Dominik Grewe, Zheng Wang, and Michael FP O’Boyle. 2013. Portable doi: 10.1145/2644805.
mapping of data parallel programs to OpenCL for heterogeneous [41] Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy,
systems. In Proceedings of the 2013 IEEE/ACM International Sympo- Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu
sium on Code Generation and Optimization (CGO). IEEE, 1–10. doi: Tang, et al. 2021. CodeXGLUE: A Machine Learning Bench-
10.1109/CGO.2013.6494993. mark Dataset for Code Understanding and Generation. (2021).
[31] Tobias Grosser, Hongbin Zheng, Raghesh Aloor, Andreas Simbürger, arXiv:2102.04664 [cs.SE]
Armin Größlinger, and Louis-Noël Pouchet. 2011. Polly-Polyhedral op- [42] Charith Mendis, Jeffrey Bosboom, Kevin Wu, Shoaib Kamil, Jonathan
timization in LLVM. In Proceedings of the First International Workshop Ragan-Kelley, Sylvain Paris, Qin Zhao, and Saman Amarasinghe. 2015.
on Polyhedral Compilation Techniques (IMPACT), Vol. 2011. 1. Helium: lifting high-performance stencil kernels from stripped x86
[32] Intel. 2022. AI Hardware. Available at https://www.intel.com/content/ binaries to halide DSL code. PLDI. doi: 10.1145/2737924.2737974.
www/us/en/artificial-intelligence/hardware.html. [43] Alok Mishra, Abid M Malik, and Barbara Chapman. 2020. Using
[33] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Machine Learning for OpenMP GPU Offloading in LLVM. SC (2020).
Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. [44] Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional
Caffe: Convolutional Architecture for Fast Feature Embedding. (2014). Neural Networks over Tree Structures for Programming Language
arXiv:1408.5093 [cs.CV] Processing. In Proceedings of the Thirtieth AAAI Conference on Artificial
[34] Norman P. Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Intelligence (Phoenix, Arizona) (AAAI’16). AAAI Press, 1287–1293. doi:
Thomas B. Jablin, George Kurian, James Laudon, Sheng Li, Peter 10.5555/3015812.3016002.
Ma, Xiaoyu Ma, et al. 2021. Ten Lessons From Three Generations [45] Alastair Colin Murray. 2012. Customising Compilers for Customisable
Shaped Google’s TPUv4i : Industrial Product. In 2021 ACM/IEEE 48th Processors. Ph.D. Thesis. The University of Edinburgh. http://hdl.
12
Matching linear algebra and tensor code to specialized hardware accelerators CC ’23, February 25–26, 2023, Montréal, QC, Canada
13