0% found this document useful (0 votes)

20 views13 pages

Matching Linear Algebra and Tensor Code To Specialized

Tenseur et algèbre linéaire

Uploaded by

Akpo Armand

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views13 pages

Matching Linear Algebra and Tensor Code To Specialized

Tenseur et algèbre linéaire

Uploaded by

Akpo Armand

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Matching linear algebra and tensor code to

specialized hardware accelerators

Pablo Antonio Martínez Jackson Woodruff Jordi Armengol-Estapé
[email protected] [email protected] [email protected]
University of Murcia University of Edinburgh University of Edinburgh
Murcia, Spain Edinburgh, United Kingdom Edinburgh, United Kingdom

Gregorio Bernabé José Manuel García Michael F.P. O’Boyle

[email protected] [email protected] [email protected]
University of Murcia University of Murcia University of Edinburgh
arXiv:2301.11659v2 [cs.PL] 31 Jan 2023

Murcia, Spain Murcia, Spain Edinburgh, United Kingdom

Abstract ubiquitous DNN [22] workloads. Its importance is reflected

Dedicated tensor accelerators demonstrate the importance in the large number of accelerator libraries and hardware
of linear algebra in modern applications. Such accelerators devices devoted to fast linear algebra. These range from
have the potential for impressive performance gains, but specialized devices such as Google’s TPU [35] to the tensor
require programmers to rewrite code using vendor APIs — a cores on NVIDIA [12] among many others [5, 8, 25, 32, 34].
barrier to wider scale adoption. Recent work overcomes this While such devices promise significant performance for an
by matching and replacing patterns within code, but such important class of applications [19], their uptake is limited by
approaches are fragile and fail to cope with the diversity of their programmability [24]. Typically, these accelerators and
real-world codes. libraries are accessed via calls to specialized APIs, meaning
We develop ATC, a compiler that uses program synthesis existing code has to be rewritten. Given the volume [36]
to map regions of code to specific APIs. The mapping space and variety [40] of existing legacy code, such rewriting is a
that ATC explores is combinatorially large, requiring the significant undertaking [19].
development of program classification, dynamic analysis, The combined importance of linear algebra acceleration
variable constraint generation and lexical distance matching and the difficulty of rewriting legacy code to accelerators has
techniques to make it tractable. led to recent work which attempts to automate the process.
We apply ATC to real-world tensor and linear algebra These techniques search user code for matrix multiplica-
codes and evaluate them against four state-of-the-art ap- tions using constraints [20, 28] or polyhedral analyses [9]
proaches. We accelerate between 2.6x and 7x more programs, and replace regions of code with appropriate API calls or
leading to over an order of magnitude performance improve- instructions.
ment. However, as we show in Section 8.1, these approaches are
fragile. Constraints capture only a limited set of program
ACM Reference Format:
patterns and small variations in the user code defeat them.
Pablo Antonio Martínez, Jackson Woodruff, Jordi Armengol-Estapé,
While they work well on curated benchmarks, they perform
Gregorio Bernabé, José Manuel García, and Michael F.P. O’Boyle.
2023. Matching linear algebra and tensor code to specialized hard- poorly on real-world code [20, 63], defeated by function calls,
ware accelerators. In Proceedings of the 32nd ACM SIGPLAN Interna- optimized code and inline assembler.
tional Conference on Compiler Construction (CC ’23), February 25–26, Neural classification (e.g. [18]) can effectively detect code
2023, Montréal, QC, Canada. ACM, New York, NY, USA, 13 pages. despite these challenges. However, it does not provide a path
https://doi.org/10.1145/3578360.3580262 to acceleration, but requires further steps. These include gen-
erating variable mappings and checking for equivalence [63]
1 Introduction which has shown promising results for Fourier Transforms.
Linear algebra is a fundamental building block of many of However, one of the key challenges in matching code to
today’s critical applications; from weather modeling [13] to APIs is the cost of searching for user program variables that
map to API formal parameters. As the width of the API and
CC ’23, February 25–26, 2023, Montréal, QC, Canada complexity of the user program increase, this becomes com-
© 2023 Copyright held by the owner/author(s). binatorially expensive. As we show in Section 8.3 existing
This is the author’s version of the work. It is posted here for your personal approaches [63] fail to scale to the challenges that linear
use. Not for redistribution. The definitive Version of Record was published in
Proceedings of the 32nd ACM SIGPLAN International Conference on Compiler
algebra APIs present.
Construction (CC ’23), February 25–26, 2023, Montréal, QC, Canada, https:
//doi.org/10.1145/3578360.3580262.

1
CC ’23, February 25–26, 2023, Montréal, QC, Canada P.A. Martínez, J. Woodruff, J. Armengol-Estapé, G. Bernabé, J.M. García, M.F.P. O’Boyle

We present ATC, a compiler that applies program synthe-

sis to compile general user-code to linear algebra accelerators.
We identify and solve key challenges
enabling the detect/synthesize paradigm to scale to the
more complex APIs of linear algebra acceleration. In addi-
tion, ATC employs a trained platform predictor to determine
whether acceleration is profitable or not.
We applied our approach to 50 GitHub GEMM and 15
convolution projects and discovered between 2.6 and 7x more
linear operators compared to KernelFaRer [20], IDL [28],
Polly [31] or FACC[63]. This resulted in more than an order
of magnitude performance improvement.
This paper makes the following contributions:
• We present ATC, which maps matrix multiplication
and convolution programs to hardware accelerators,
up to 7x more frequently than existing techniques.
• We introduce novel heuristics to reduce the mapping
search space by four orders of magnitude.
• We develop novel dynamic analyses to determine higher-
level information about variables, enabling synthesis
without costly whole-program analyses.
Figure 2. GEMM code optimized for AVX2 found on GitHub
2 Motivation consisting of 120 lines of hand-optimized intrinsics and how
ATC matches the code to the accelerator API

2.1 Exisiting Match and replace

IDL and KernelFaRer. Both aim to detect linear algebra
operations in user programs and replace them with an appro-
priate accelerator library call. To illustrate this consider the
code in Figure 1. This shows a straight-forward matrix mul-
tiplication program fragment, from the parboil benchmark
suite [57]. They aim to detect this matrix-multiplication and
replace it with a call to the library, shown at the bottom of
the diagram.
To replace code with an API call they have to both de-
tect the code performing a matrix multiplication and also
determine which user program variables correspond to the
arguments of the API call. Both approaches are able to detect
that this is a matrix multiplication, and can determine the
mapping between user variables and API parameters.

2.2 Examples of complex GEMM programs

Unfortunately, in practice, user code can be complex such
that code structure or pattern-based approaches inevitably
fail.
As an example, consider the code found on GitHub shown
Figure 1. Example application of API replacement. The in Figure 2 which implements a matrix-multiplication al-
above program is taken from the parboil benchmark [57], a gorithm (only a fragment of the 120 lines of user code are
widely-used benchmark suite, which is transformed into a shown here). The code structure is complex and difficult to
call to an optimized matrix-multiplication accelerator API. understand as it makes extensive use of inline assembler in-
trinsics which defeats the code structure analysis approaches
of IDL and KernelFaRer, preventing acceleration.
2
Matching linear algebra and tensor code to specialized hardware accelerators CC ’23, February 25–26, 2023, Montréal, QC, Canada

ACCELERATOR API USER CODE

Acceleratable Candidate Detection
OJCLONE
DATASET NEURAL EMBEDDINGS
FUNCTION
F1 F2 F3 ... FN
+ GEMM CANDIDATES
PROGRAMS
+ CONV IO Detection
PROGRAMS IO IO
ACCELERATABLE EQUIVALENCE DETECTION
MATCHES SET OF MATCHES FUNCTION
GENERATION PERFORMANCE
MATCHES HEURISTCS SVM CLASSIFIER ANALYSIS
MISMATCH
SOLVER ACCELERATOR
MATCH ALGORITHM SAMPLING
PROGRAM LEVENSHTEIN
SYNTHESIS FUNCTION
Matches SAMPLING
VALID MATCHES Reduction PROGRAM CLASSIFICATION Profitability Detection
Matches Generation
ACCELERATED CODE

Figure 3. ATC compiler architecture

2.3 Our approach - ATC FACC. Behavioral equivalence is also employed in FACC
Rather than relying on code structure to guide detection, [63]. Unfortunately, it is restricted to FFTs and one-dimensional
ATC uses behavioral equivalence to determine if a section of arrays, and cannot detect the replacement in Figure 1. There-
code is a linear algebra operation. Firstly, ATC uses neural fore, we extended FACC to FACC* to consider GEMMs and
program classification [18] to detect that the code in Figure multi-dimensional arrays. This, however, exposes its weak
2 is probably a GEMM. It then searches variable matches to variable binding model which is combinatorial in the number
determine the potential source and output arrays. As the of user array variables and their dimensionality. Furthermore,
search space is combinatorially large, we introduce scal- it relies on program synthesis to determine the length of ar-
able, algorithm-independent heuristics (which we discuss in rays, which scales poorly to problems with many potential
Section 5) that keep the number of mappings manageable. length parameters for arrays such as GEMM.
Next, ATC generates different input values for the arrays FACC also relies on brittle inter-procedural liveness analy-
and records the output. After generating many randomized ses to determine the liveness status of variables. This restricts
inputs, it observes that it has the equivalent behavior to the it to running only at link time, rendering it invalid for use
corresponding API and is able to replace the AVX2 code with in shared libraries. We will see in Section 8 that the com-
the GEMM call at the bottom of Figure 2. bination of these issues results in excessively large search
spaces.
Legality. Now, IO behavioral equivalence is not proof that
a section of code is a particular linear algebra operation - 3 System overview
similarly IDL and KernelFaRer do not prove equivalence. For Figure 3 gives a system flow overview of ATC. We first de-
proof, bounded model checking based on Kleene [14] can be tect regions of code that are likely to be linear algebraic
deployed. In practice, as demonstrated in our experimental operations using a neural program classifier. The classifier is
section, IO equivalence gives no false positives. For further trained ahead of time, based on programs that are equivalent
guarantees, we can ask for programmer sign-off or employ to the accelerator and prior examples of linear algebra code.
model checking. Once candidate code sections have been identified, we ap-
ply program analysis to match user program variables with
Profitable. Once we have detected and can replace a sec- the particular API formal parameters. Given the combina-
tion of code with an accelerator call, we need to determine if torially large search space, we develop novel techniques to
it is profitable to do. Due to hardware evolution, we do not make the problem tractable.
use a hard-wired heuristic to determine profitability. Instead, For each candidate matching, we generate multiple data
we learn, off-line, a simple predictive model to determine if inputs, execute the user code section and record the output
the target accelerator is faster than a CPU implementation. values. If the input/output pairs correspond to the input/out-
The model is called at runtime, determining if offloading is put behavior of the accelerator API, we can say they are
worthwhile. behavioral equivalent and candidates for replacement.
3
CC ’23, February 25–26, 2023, Montréal, QC, Canada P.A. Martínez, J. Woodruff, J. Armengol-Estapé, G. Bernabé, J.M. García, M.F.P. O’Boyle

While candidate user code may be replaceable with a call Exit with YES Load/StoreInst
to an accelerator API, it may not be profitable. Therefore, we error code
employ a simple ML classifier, trained offline, and invoked Out of YES NO Replace Load/Store
Array A?
at runtime to see if acceleration is appropriate for the user bound? to/from index 0
code for the runtime known array sizes. Perform the
instruction NO
3.1 Neural Program Classification
Figure 4. Dimension detection algorithm overview for a
To detect potentially acceleratable parts of a program, we use
target example array called A.
prior work in neural program classification [18]. A network is
trained with multiple instances of different program classes.
We use the OJClone dataset [44], which includes 105 classes Algorithm 1 Dimensions detection algorithm
of different programs, and add examples of the programs that
1: for arr in function do
we want to detect e.g. GEMMs and convolutions, gathered
2: fakeLoadAndStoresExcept(𝑎𝑟𝑟 )
from benchmark suite repositories other than GitHub.
3: replaceLoadAndStores(𝑎𝑟𝑟 )
At compile time a new candidate program is divided into
4: repeat
functions, which are presented to the neural classifier. The
5: 𝑐 = getNextCombination(𝑎𝑟𝑟 )
classifier assigns each function in the program a probability
6: ffi_call(𝐴, 𝑉 )
of belonging to a certain class. We consider the most proba-
7: if not failed then
ble class, which in the case of a GEMM or convolution is then
8: 𝑓 𝑜𝑢𝑛𝑑 = 𝑇 𝑟𝑢𝑒
considered for variable matching and eventual code replace-
9: end if
ment as described in the following sections. Classification
10: until not found
is fast (≤ 1.5 sec) and has negligible impact on compilation
11: Add 𝑐 to 𝐶
time (see Section 8.3).
12: end for
13: return 𝐶
4 Variable Matching
To check if a section of user code is behaviorally equivalent
to the API, we have to match up the user program variables some programs, lengths can be found using static analysis
with API formal parameters. We first detect what variables (e.g. [50]), but this fails in more complex cases. We use run-
are livein/liveout (Section 4.1) and then the dimensions of
time analysis to determine which program variables define
arrays (Section 4.2).
array size using a modified form of runtime array bound
checking. For each set of variables that could define an ar-
4.1 Detecting livein and liveout variables
ray’s size (typically, from the argument list), we set such
Detecting livein and liveout variables via standard static variables to a fixed value. We then execute the user code that
analysis is is modified to check runtime array accesses.
straightforward for well-structured programs but fails for First, the compiler selects a target array to find its size.
more diverse real-world codes, which may use assembly code Then, to generate the modified program, we tweak the load
or intrinsic functions. and store instructions in the user program, replacing them
ATC uses dynamic analysis to determine which variables with custom function calls in the IR. If a load or store does
are livein and liveouts inside a function. In C, variables are not access the array we are interested in, we modify it to
passed by value so non-pointers variables are always livein. load and store at a constant, safe location. If it does, the
In the case of pointers (or arrays), we generate random inputs instruction is replaced with a function call that will check at
with arbitrary sizes. If the values in memory change after runtime if the access is out of bounds. If so, the program exits
executing the program, the array is considered liveout. with a custom error code. If not, we have found a valid array
This allows us to detect which variables are livein or live- size. The basic idea is depicted in Figure 4. This is used by
out, but not both livein and liveout at the same time. We gen- our JIT analysis as shown in Algorithm 1 and implemented
erate a new random input for liveout variables and re-execute in LLVM.
the function. If the output differs from the first execution, it This way, the compiler can assign different input sizes to a
is both livein and liveout. We implement this algorithm as a given array and check the exit code. Therefore, the compiler
just-in-time compiler pass in LLVM [38]. iterates over all the possible dimensions combinations until
one of the executions does not end with the custom error exit
4.2 Detecting the dimensions of arrays code. That means that the program was completed without
Detecting arrays length enables offloading of appropriately- any illegal access to the target array, which indicates that it
sized regions of codes, so it is a critical step in ATC. For is the right dimension of the array.
4
Matching linear algebra and tensor code to specialized hardware accelerators CC ’23, February 25–26, 2023, Montréal, QC, Canada

Algorithm 2 Automatic matching algorithm Liveout A:[0,0,1] U:[0,0,1]

1: function dimsMatch(𝑓 1𝑎, 𝑓 2𝑎, 𝑝, 𝑛)
A: X(x0*x1) Y(x1*x2) Z(x2*x0)
2: 𝑆=∅
3: 𝑖𝑑𝑥 ← 0 U: A(y0*y1) B(y1*y2) C(y2*y0)
4: for 𝑎𝑟𝑔𝑠1 in f1a do
[0,1,2] [1,0,2] [2,0,1]
5: 𝑎𝑟𝑔𝑠2 = f2a[p[idx]]
6: Add {𝑎𝑟𝑔𝑠1, 𝑎𝑟𝑔𝑠2} to 𝑆 x0 -> y0 x0 -> y1 x0 -> y2
x1 -> y1 x1 -> y2 x1 -> y0
7: 𝑖𝑑𝑥 ← 𝑖𝑑𝑥 + 1 x1 -> y0
x2 -> y2 x2 -> y1
8: end for x2 -> y1
9: return Size(S) = 𝑛 x2 -> y2
10: end function x0 -> y0
11:
12: function outMatch(𝑓 1𝑜, 𝑓 2𝑜, 𝑝) Figure 5. Example application of the matching algorithm.
13: idx = IndexOf(𝑓 2𝑜, 1) The right match is found the algorithm automatically. Per-
14: return IndexOf(𝑝, idx) = IndexOf(𝑓 1𝑜, 1) mutations in red means they are invalid, while the green
15: end function
permutation means valid.
16:
17: function findMatchings(𝑓 1𝑎, 𝑓 2𝑎, 𝑓 1𝑜, 𝑓 2𝑜, 𝑛)
18: 𝐵=∅
functions with three 2D arrays each. First, the algorithm
19: for p in permutations(0...𝑛) do
generates all the permutations between 0 and 𝑛 − 1 (𝑛 = 3 in
20: if dimsMatch(f1a, f2a, p) and
this example). Then, for each permutation, it tries matching
21: outMatch(f1o, f2o, p) then
each variable in every array in the user code with the cor-
22: Add 𝑝 to 𝐵
responding variable in the array of the API (here we show
23: end if
only three of the six possible permutations).
24: end for
In the first case (with the permutation [0, 1, 2]), the algo-
25: return 𝐵
rithm tries matching the array variables of the user program
26: end function
𝑋, 𝑌 , 𝑍 with API parameters 𝐴, 𝐵, 𝐶 . We then examine each
of the variables defining each of the corresponding arrays.
5 Reducing the matchings search space Comparing 𝑋 and 𝐴 gives a match of 𝑥0 → 𝑦0 and 𝑥1 → 𝑦1.
To match code to APIs, the compiler generates different can- For the second array variable 𝑌 and API parameter 𝐵, we
didates for the variable to formal parameter mappings and have 𝑥1 → 𝑦1 and 𝑥2 → 𝑦2 and for the third variable pair
then tests them using IO equivalence. For small APIs, all map- 𝑍, 𝐶 we have 𝑥2 → 𝑦2 and 𝑥0 → 𝑦0. All of these are con-
pings can be explored, but the combinatorial cost makes it sistent with 𝑛=3 constraint, which satisfies the condition
prohibitive for real-world accelerator APIs. We develop tech- (dimsMatch in Algorithm 2). Liveout information is also sat-
niques that reduce the mapping space by exploiting arrays isfied so this permutation is added as a potential mapping.
information and human coding styles. In the second permutation [1, 0, 2], where 𝑋, 𝑌 , 𝑍 maps
to 𝐵, 𝐴, 𝐶, the constraints are inconsistent e.g. 𝑥1 → 𝑦2 and
5.1 Exploiting array information 𝑥1 → 𝑦0 leading to 6 ≥ 3, so it is not a valid match. In the
Using array dimensions (Section 4.2), we can reduce the num- third and last example, constraints are equal to 𝑛, but the
ber of possible matches that must be checked, as assigning liveout arrays do not match. Thus, the only valid match is
one array to another means that the dimensions of each array the one found in the first permutation.
must line up.
5.2 Using argument names
5.1.1 Automatic matching algorithm. We first gener-
Programs are developed by humans, so we can assume that
ate all 𝑛! permutations of the 𝑛 array variables to 𝑛 parame-
the functions that humans write follow common patterns.
ters mapping. We discard all permutations where variable
We exploit this by analyzing the argument names of the API
livenesses do not match. Then for each candidate user array
and the user program to find lexical similarities.
and parameter array pair, we generate the constraints defin-
To compare argument names, we use the Levenshtein dis-
ing how their dimensions match. If we find contradictory
tance [39] to compute the distance between each of the user
constraints for any permutation, we discard it. The algorithm
programs and API arguments. Figure 6 shows the definition
is shown in Algorithm 2.
of the Levenshtein distance, which calculation is based on
5.1.2 Automatic Matching Algorithm: Example. To il- the minimal number of modifications needed to transform
lustrate this, Figure 5 shows an example where we have two one word into another, representing how close are those
5
CC ’23, February 25–26, 2023, Montréal, QC, Canada P.A. Martínez, J. Woodruff, J. Armengol-Estapé, G. Bernabé, J.M. García, M.F.P. O’Boyle


 |𝑎| if |𝑏 | = 0, floating-point programs [63], verification of such liftings are



 |𝑏 | if |𝑎| = 0, some way off.

 𝑙𝑒𝑣 (𝑡𝑎𝑖𝑙 (𝑎), 𝑡𝑎𝑖𝑙 (𝑏))

 if 𝑎[0] = 𝑏 [0], In summary, the key challenges that all competing tech-
𝑙𝑒𝑣 (𝑎, 𝑏) = niques face are:
 𝑙𝑒𝑣 (𝑡𝑎𝑖𝑙 (𝑎), 𝑏)
 




 1 + 𝑚𝑖𝑛

𝑙𝑒𝑣 (𝑎, 𝑡𝑎𝑖𝑙 (𝑏)) otherwise • Floating-point numbers often raise challenges in theo-


  𝑙𝑒𝑣 (𝑡𝑎𝑖𝑙 (𝑎), 𝑡𝑎𝑖𝑙 (𝑏))
 rem provers as they are challenging to reason about.
 
(1) • Floating-point functions may have different accuracies
in different input ranges, meaning that the obvious
Figure 6. Levenshtein recursive definition checks of correctness (even within bounds) are difficult
to apply.
The backend of ATC is not tied to using behavioral equiva-
gemm_api ( float * tc_A , float * tc_B , float * tc_C , lence. As we will see, the use of such behavioral equivalence
int tc_m , int tc_n , int tc_k , results in no false positives. Further development of theorem
int tc_lda , int tc_ldb , int tc_ldc ,
float tc_alpha , float tc_beta ) { prover technologies would mean that the weak behavioral
equivalence in ATC could easily be replaced with a theorem
gemm ( int M , int N , int K , float alpha , prover guaranteeing correctness and enabling automatic
float *A , int lda , float *B , int ldb ,
float beta , float *C , int ldc ) { transformations.

tc A ... tc lda ... 6 Automatic profitability detection

M 1 ... 3 ... We assume that user code runs faster when replaced by a
N 1 ... 3 ... platform-specific library. The question is whether it is best
K 1 ... 3 ...
to run on a CPU or accelerator version (XPU) of the library.
alpha 4 ... 3 ...
A 0 ... 2 ... This in turn depends on the input size, which is only known
lda 2 ... 0 ... at runtime. We use a predictive model based on empirical
........ ....... ... ....... ... data to enable accurate predictions as platforms and libraries
evolve by retraining the model.
Figure 7. Levenshtein distance calculation for the arguments
of the tensor core API (above) and an example user program. SVM. We use the well-known support vector machine
(SVM) classifier with a polynomial kernel of degree 3 with
gamma=1 and 𝐶=100. We sample the CPU and the accelerator
words. After computing the distance, the compiler selects with a common dataset of input sizes, which produces a
the combination that minimizes the Levenshtein distance. dataset that is small enough to be processed in less than
Figure 7 shows an application example of the Levenshtein five minutes, but large enough to be highly accurate. Data
distance to a real case of GEMM matching. For calculating is labeled with 0 or 1 meaning that the CPU or the XPU is
the distance, we strip the API suffix (tc_) and convert all faster. The model is then trained and deployed at runtime,
names to lowercase. Results show that the most probable when matrix sizes are known, The training phase is done
mapping for tc_A is A in the user code, and for tc_lda is only once, at “factory time”, and the resulting model when
lda, which are the right matches. deployed has negligible (≤ 0.3𝑚𝑠𝑒𝑐) runtime overhead (see
Section 8.2).
5.3 IO generation
7 Setup
Once we have a candidate match we generate random inputs
of different sizes and test for input-output (IO) equivalence. We evaluate GEMM and convolution acceleration on special-
We use 30 inputs of varying sizes. Although IO behavioral ized platforms. For GEMM, we used an Intel i7-11700 (CPU)
equivalence is not proof, we can increase the number of tests with an NVIDIA Quadro RTX 5000 (tensor cores) (XPU). For
for increased confidence. No existing technique such as IDL convolution, we used the Google Cloud Platform (GCP) ser-
or KernelFaReR can prove that a matched piece of code is vices equipped with a TPUv3 with 8 TPU cores. Compilation
provably equivalent to an API and therefore rely on user benchmarks in Section 8.3 are executed in an AMD EPYC
sign-off. 7413.
The Intel/NVIDIA platform runs CentOS 8.3 with kernel
5.3.1 Behavioral Equivalence and the Limits of Veri- 4.18.0. LLVM was downloaded from the official Git repository,
fication. ATC, like prior work on floating-point accelera- using commit 329fda3. User codes were compiled using gcc
tors [63], uses behavioral equivalence. The downside of this 11.2.0 with -O3 -march=native flags. We used cuBLAS 11.2
strategy is that it requires programmer sign-off to make any and MKL 2020.2.254 for compiling codes to the XPU and
substitution. However, due to the complexities of verifying CPU, respectively. For compiling convolution programs to
6
Matching linear algebra and tensor code to specialized hardware accelerators CC ’23, February 25–26, 2023, Montréal, QC, Canada

Algorithm Code LoC Nº Args Optimizations Constraints C struct?

Polly: Detects static control parts (SCoPs) in the code using
1 35 12 None None No
2 36 10 OpenMP FW = FH = 3 No the polyhedral model [31]. It does not replace the code with
3 34 8 OpenMP FW = FH = 3 No a call to an optimized library.
4 43 11 None FW = FH = 3 No
Direct 5 39 8 OpenMP FW = FH = 3 No FACC*: FACC uses neural embeddings and behavioral syn-
6 76 16 None N=1 No thesis to detect candidates for acceleration [63]. It is limited
7 209 18 Vectorized N=1 Yes
8 102 12 None None No to 1D arrays so we developed an extended version, FACC*,
9 42 16 None None No which supports multi-dimensional arrays.
10 189 15 None N=1 Yes
im2col+
11 286 15 BLAS N=1 Yes
gemm
12 179 17 BLAS FW = FH Yes
13 687 17 Intrinsics + OpenMP FW = FH = 3 No
Winograd 14 254 12 None N=1 Yes 8 Results
15 782 12 Intrinsics + OpenMP FW = FH = 3 No
8.1 Detection
Table 1. List of convolution codes
Figure 9 shows the percentage of GEMM programs matched
by each technique across each of 8 categories listed in Table 8.

the CPU, we used oneDNN v1.96. The TPU system runs IDL. The constraint based scheme [28] only matches 6
Debian 10 with kernel 4.19.0-14. out of 50 cases. These programs are largely naive implemen-
tations of GEMM, with a simple loop structure. It is able to
7.1 User code manage 2 programs containing unrolled loops but fails on
We explored GitHub looking for C and C++ GEMM codes, anything more complex. Matching more diverse cases would
analyzing more than 400 programs from which we selected require writing a new IDL constraint description for each
50 programs. We discarded the rest of them because of wrong sub-class.
implementations, compilation errors or duplicated code. The KernelFaRer. This code matching approach [20] is more
final list of programs is shown in Table 8. We categorize successful, matching 11 GEMMs due to a more robust pattern
the codes as follows: Naive: naive implementations with the matcher. For straightforward sequential implementations, it
traditional 3-loop structure; Naive Parallel: as Naive but with is able to match all but one of the cases. However, any code
simple outer loop parallelization; Unrolled: naive implemen- variation, including loop unrolling, defeats it.
tation with unrolled loops; Kernel Calls: implementations
that divide the loops into different function calls; Blocked: Polly. Although it does not match and replace GEMMs,
tiled implementations; Goto: implementations of the Goto it can detect SCoPs which may be candidates for replace-
algorithm [29]; Strassen: implementations of the Strassen ment with appropriate API calls. It is less successful than
algorithm [56]; Intrinsics: implementations using Intel intrin- KernelFaRer in detecting naive implementations but is more
sics. robust across other more complex categories including one
In addition, we selected 50 non-GEMM projects to check parallel and unrolled versions and 2 blocked cases. It slightly
whether any of the approaches gave false positives. outperforms KernelFaRer, matching 13 vs. 11 out of 50 cases.

Convolutions. We explored GitHub looking for C and FACC*. Unlike the other approaches, FACC* performed
C++ 4D convolution implementations. We analyzed around poorly on naive implementations, but better on others. Here,
50 programs from which we a selected list of 15 programs the size of the mapping search space is the limiting factor. It
based on the same methodology used for selecting GEMMs. was able to find 10 cases in the available time (timeout ≤ 10
The list of convolution programs is shown in Table 1. We mins). We examine the reasons for this in Section 8.3.
have included codes from the most relevant convolution
ATC. Our approach is significantly more robust across all
implementations: Direct: the direct convolution algorithm;
categories, matching 42 out of 50 cases. It is able to detect all
im2col+gemm: an algorithm that casts the input as matrices
naive implementations and the majority within each other
(im2col) and later uses a GEMM, as in Caffe [33]; Winograd:
category. It detects more naive parallel implementations,
the Winograd algorithm.
unrolled and blocked programs than Polly and is the only
7.2 Methods technique to detect GEMMs in codes containing kernel calls
and intrinsic instructions.
We evaluate our approach against 4 well known schemes:
IDL: Idioms are described using an idiom description lan- 8.1.1 Accuracy. Figure 10 provides a summary of ATC’s
guage [28], which is translated into a set of constraints over success and failure by type. In 8 cases ATC failed to detect
LLVM IR. that the program contained a GEMM. In one case, program
KernelFaRer: Uses different pattern matching to detect spe- 23, this is due to there being too many candidate matches,
cific code constructs, matching specific matrix-multiplication 280 which is above our timeout threshold of 100 candidates.
structures [20]. The remaining cases are due to overly aggressive search
7
CC ’23, February 25–26, 2023, Montréal, QC, Canada P.A. Martínez, J. Woodruff, J. Armengol-Estapé, G. Bernabé, J.M. García, M.F.P. O’Boyle

Algorithm Code LoC Layout Sizes Optimizations Algorithm Code LoC Layout Sizes Optimizations
1 22 Column-major Squared None Kernel Calls 26 164 Column-major Any Unrolled
2 127 Both Any None 27 104 Row-major Any Block
3 18 Row-major Any None 28 30 Row-major Squared OpenMP
4 41 Column-major Squared None 29 52 Column-major Any None
5 11 Row-major Any None 30 35 Row-major Squared None
6 11 Row-major Any None Blocked 31 38 Column-major Squared None
Naive
7 30 Row-major Any None 32 42 Row-major Multiple of bs Unrolled
8 18 Column-major Any None 33 49 Row-major Squared None
9 40 Column-major Any None 34 18 Row-major Squared None
10 39 Column-major Any None 35 21 Row-major Squared None
11 43 Row-major Any None 36 247 Column-major Squared Intrinsics (SSE)
Goto
12 11 Row-major Squared None 37 89 Row-major Squared None
13 39 Row-major Squared OpenMP 38 210 Row-major Squared None
14 28 Column-major Squared OpenMP Strassen 39 315 Row-major Squared, power of 2 None
Naive
15 164 Row-major Any OpenMP 40 162 Row-major Squared None
parallel
16 22 Row-major Multiple of nthreads C++ threads 41 102 Row-major Squared Intrinsics (AVX2)
17 107 Row-major Squared C++ threads 42 91 Row-major Multiple of 8 Intrinsics (AVX2)
18 57 Row-major Any None 43 82 Row-major Multiple of 8 Intrinsics (AVX2)
19 50 Row-major Any None 44 58 Row-major Any Intrinsics (SSE)
Unrolled
20 63 Row-major Squared OpenMP 45 112 Row-major Multiple of bs Intrinsics (AVX2)
Intrinsics
21 38 Row-major Squared, multiple of bs None 46 136 Row-major Multiple of bs Intrinsics (AVX2)
22 46 Column-major Any None 47 120 Row-major Any Intrinsics (AVX2)
23 115 Column-major Any OpenMP 48 143 Row-major Multiple of bs Intrinsics (AVX2)
Kernel Calls
24 61 Column-major Any None 49 57 Row-major Multiple of bs Intrinsics (AVX2)
25 105 Column-major Any Unrolled 50 60 Row-major Any Intrinsics (SSE)
Figure 8. List of GEMM codes

12 3
100 11 IDL POLLY KFR FACC* ATC 9
42
% of matched codes

4 4
80 9 3
6 2
3
60 2 11
40 4
1 1 13
1 2 2 1110
20 6
1
0 0 0 0000 0 0 000 000 0000
0
Naive Naive p. Unrrolled Kernels Blocked Goto Strassen Intrinsics All

Figure 9. Percentage of matched GEMM codes by different techniques.

8.2 Performance
% of programs

80
Matched
60 The performance of each approach is shown in Figure 11.
Too many candidates
40 Polly is not included here as although it can detect SCoPs, it
Missed matches
20
does not explicitly identify them as GEMMs for API replace-
ment. We show two bars for KernelFaRer, which correspond
0
to the strategy of GEMM code with an optimized CPU im-
plementation as described in [20] and KFR (XPU) which is
Figure 10. Percentage of matched GEMM codes by ATC our extension, replacing the CPU library with the optimized
divided by failure reason. XPU implementation. IDL and FACC* directly target the ac-
celerator, while ATC chooses the CPU or accelerator based
on its SVM platform predictor. This runtime prediction cost
is negligible ≤ 0.3𝑚𝑠𝑒𝑐 and included in Figure 11.
What is immediately clear is that detecting more GEMMs
pruning, missing a legal match. Improved search heuristics leads to better overall speedup. In the Naive category, KFR
are likely to improve program coverage. and ATC are both able to achieve good performance, with
a speedup of 726x and 1031x, respectively. The gap is nar-
False positives. None of the methods classified any of the rowed when using KFR (XPU). However, KFR is unable to
50 non-GEMMs as a GEMM. Across all methods, there were detect GEMMs in any other category leading to just a 6.2x
no false positives.
8
Matching linear algebra and tensor code to specialized hardware accelerators CC ’23, February 25–26, 2023, Montréal, QC, Canada

10000
IDL KFR (CPU) KFR (XPU) FACC* ATC
1000
Speedup

100

1
Naive Naive p. Unrrolled Kernels Blocked Goto Strassen Intrinsics All

Figure 11. Geometric mean speedup obtained by IDL, KernelFaRer, FACC* and ATC in GEMM programs with 𝑛 = 8192.

speedup overall while ATC achieves 344.0x. Unsurprisingly, Parameter Global

there is more performance available on naive sequential im- Value m
Accuracy
plementations than in those cases where the programmer (mnk) 2000 4000 6000 8000 10000
has spent effort in optimizing the program.
111 100% 100% 100% 70.0% 100% 93.8%
123 100% 78.9% 100% 100% 100% 95.9%
8.3 Candidate search complexity and compile time 312 100% 84.3% 100% 100% 100% 96.9%
136 100% 89.5% 100% 100% 100% 97.9%
One of the key challenges in matching code to APIs is search-
ing for program variables that map to API formal parameters. Table 2. SVM accuracy for different sizes. 111 means m = 1
As the width of the API and complexity of the user program × m, n = 1 × m, k = 1 × m. 123 means m = 1 × m, n = 2 × m,
increase, this becomes combinatorially expensive. Figure 12 k = 3 × m etc
evaluates FACC* naive matching of variables and our ap-
proach based on the Levenshtein distance. Naive matching
varies considerably from just 4 candidates to over 1 million.
Our approach greatly reduces the number of candidates for 8.5 Convolutions
the majority of the programs. There is one special case, code
Our approach is generic and can be applied to other APIs
23, where we reduce the number of candidates, but it is still
other than GEMMs. As an example, we consider tensor con-
too high.
volutions which are a significant component of DNN work-
Figure 13 shows the compilation time of ATC. The initial
loads. While IDL, KernelFaRer, Polly and FACC* were unable
neural classifier has a negligible constant execution time of
to detect any of the convolutions, ATC detected 10 of the
1.3 seconds, while the other phases’ compilation time grows
15 convolutions as shown in Figure 14; we were unable to
with the number of candidates.
match 5 due to the excessive number of candidates.
As the number of candidates begins to increase compi-
Figure 15 shows the performance achieved by replacing
lation time becomes prohibitively expensive. Code 23 has
with library code for each of the programs we are able to
280 candidates which would take 35 mins more to evaluate.
accelerate. Across all codes, the SVM predicts that the TPU
We limit the number of candidates considered to 100 which
accelerator outperforms the CPU, giving an average 17.8x
corresponds to a timeout of ≤ 10 minutes.
performance improvement across the programs.

8.4 Profitability accuracy 9 Related work

To measure the accuracy of the SVM platform predictor, we Matching in Programs. Matching high-level program
built a model offline and tested it on unseen data values. structure has been used to discover parallelism [23], het-
Table 2 summarizes the SVM accuracy with different input erogenous offloading [6, 45] and many other core compiler
sizes and shapes. The SVM achieves a global accuracy of tasks [27]. Constraint languages make these tasks easier [10,
99.7%, where the misprediction occurs between 𝑚 = 2000 27, 28] but their constraints are very sensitive to code struc-
and 𝑚 = 8000 which is the “edge” between the CPU and the ture [20].
XPU. In all other intervals, the prediction is always correct. For matrix multiplications in particular, KernelFaRer [20]
The best accuracy is achieved with non-squared matrices, provides a more robust approach, detecting characteristics
while square matrices give slightly lower accuracy. Overall, that define matrix multiplications. Polyhedral analyses can
this is a highly accurate predictor with a negligible runtime also be used to target matrix multiplication accelerators [9,
overhead of ‘ ≤ 0.3𝑚𝑠𝑒𝑐. 59], but both these techniques fail to scale to the diversity
9
CC ’23, February 25–26, 2023, Montréal, QC, Canada P.A. Martínez, J. Woodruff, J. Armengol-Estapé, G. Bernabé, J.M. García, M.F.P. O’Boyle

Approx. compilation time

106 20d
Candidates generated
FACC* ATC Threshold
105 2d

104 12h

103 1h

102 10m

101 45s

1 4s
5 10 15 20 25 30 35 40 45 50
Code

Figure 12. Comparison of the number of candidates generated for matching GEMM codes: FACC* vs our approach.

100 IO Testing of FFTs and does not scale to longer function signatures
Tests Code 23 used for GEMM. To support any accelerator type, the com-
Generation
80 Code 2 2,000 piler should support multi-dimensional arrays, while FACC
Candidates
Generation only supports 1D arrays. Because in 1D arrays and FFTs the
Time (s)

60 Neural
1,500
Embeddings
search space in matching the API parameters is small, FACC
40 Code 21 1,000 does not include anything to reduce it. With more complex
Code 7 programs and domains, this limitation makes compiling pro-
20 Code 1 500 grams intractable.
0 0 Mask [52] uses symbolic execution to prove equivalence,
1 3 8 48 280 which does not work well for floating-point problems. Fuzzy
Number of candidates classification techniques based on code clone detection [41,
58], domain-classification [60], pattern matching [15], code
Figure 13. Compilation time for different number of candi- embeddings [2, 3, 21] and identifiers [37, 48] can be used
dates. to help compile to accelerators [63]. These classification
strategies are able to classify diverse code structures, but do
3 not provide a compilation strategy for using an accelerator
% of matched codes

100 Matched Not matched on their own.

80 6 2 10 A large class of techniques focus on migrating between
60
APIs. These techniques often use program synthesis [16],
40 3 1 5
NLP [47] and code embeddings [46, 49]. These techniques
20 are unable to extract existing code into APIs.
0
0
Direct i2mcol+gemm Winograd All
Compiling for GEMM Accelerators. Existing compila-
tion strategies largely focus on lowering code from intrinsics
Figure 14. Matched convolution codes by ATC.
to accelerators using rewrite rules [53, 54, 62] and synthesis
techniques [17].
Speedup (CPU)
1000 Speedup (TPU) Existing approaches to extracting matrix multiplications [20,
ATC 28] are brittle. Synthesis-based techniques [1, 7, 42] and
rewriting-based techniques [11, 55] have been developed
Speedup

100

to extract these DSLs that can then be lowered: but they

10
largely require flexible DSLs, rather than APIs presented by
1
hardware accelerators.
1 3 4 5 6 8 10 11 13 15 All
Code
Performance Prediction. Predicting code the performance
of hardware accelerators is challenging, as the break-even
Figure 15. ATC speedup in convolution programs with ℎ =
point may depend on many different arguments within a
𝑤 = 224, 𝑘𝑤 = 𝑘ℎ = 11, 𝑐 = 3, 𝑘 = 96 and 𝑛 = 100.
function’s interface [4]. LogCA [4] introduces static perfor-
mance comparison models for hardware accelerators and
of real code. FACC [63] uses IO equivalence, which is ro- similar models have been applied in offloading tasks [64]. Ma-
bust to program structure, but only addresses the challenges chine learning has often been applied in profitability settings,
10
Matching linear algebra and tensor code to specialized hardware accelerators CC ’23, February 25–26, 2023, Montréal, QC, Canada

such as OpenCL Kernels [30, 61] and OpenMP [43]. Similar Hernandez, Daniel Ho, Yu-Cheng Huang, Olof Johansson, Shishir
techniques have been applied to FPGAs, by estimating pow- Juluri, Shobhit Kanaujia, Mannli Kesarkar, Jonathan Killinger, Ben
er/performance [26] and tracking actual performance [51]. Kim, Rohan Kulkarni, Meghan Lele, Hauyi Li, Huamin Li, Christo-
pher Mitchell, Bharath Muthiah, Nitin Nagarkatte, Ashwin Narasimha,
Bernard Nguyen, Thiara Ortiz, Soumya Padmanabha, Deng Pan, Ash-
10 Conclusions win Poojary, Ye Qi, Oliver Raginel, Dward Rajagopal, Tristian Rice,
This work presented ATC, a flexible domain-agnostic com- Craig Ross, Nadav Rotem, Scott Russ, Kushal Shsh, Bauhua Shan, Hao
Shen, Pavan Shetty, Krish Skandakumaran, Kutta Srinivasan, Roshan
piler that matches legacy linear algebra code to accelerators.
Sumbaly, Michael Taubery, Mor Tzur, Hao Wang, Man Wang, Ben Wei,
By using IO behavioral equivalence and smart search space Alex Xia, Chanyu Xu, Martin Yang, Kai Zhang, Ruoxi Zhang, Ming
reduction, we are able to match over 80% of challenging Zhao, Witney Zhao, Rui Zhu, Lin Qiao, Misha Smelyanskiy, Bill Jia, and
real-world programs to accelerator APIs, significantly out- Vijay Roa. 2021. First-Generation Inference Accelerator Deployment
performing all alternative approaches. at Facebook. (2021). arXiv:2107.04140 [cs.AR]
[6] José M Andión. 2015. Compilation techniques for automatic extraction
Supporting new domains different from GEMM and convo-
of parallelism and locality in heterogeneous architectures. Ph.D. Thesis.
lution is easy because ATC focuses on behavior rather than Universidade Da Coruña. http://hdl.handle.net/2183/15854.
code structure, which makes it very flexible and extensible. [7] Kevin Angstadt, Jean-Baptiste Jeannin, and Westley Weimer. 2020.
Furthermore, to support other accelerators in GEMM or con- Accelerating Legacy String Kernels via Bounded Automata Learning.
volution, only the accelerator API is needed: ATC adapts to ASPLOS. doi: 10.1145/3373376.3378503.
[8] Arm. 2020. Arm Ethos-U55: microNPU. Avaialable at https://www.arm.
the new specification automatically.
com/products/silicon-ip-cpu/ethos/ethos-u55 (Accessed 2022).
Future work will examine how to further reduce the search [9] Somashekaracharya G Bhaskaracharya, Julien Demouth, and Vinod
space using online learning and to expand the complexity Grover. 2020. Automatic Kernel Generation for Volta Tensor Cores.
of user code considered. Longer-term, we wish to automati- (2020). arXiv:2006.12645 [cs.PL]
cally target a range of accelerators with diverse functionality, [10] Gabriel Hjort Blindell. 2018. Universal Instruction Selection. Ph.D.
Thesis. KTH Royal Institute of Technology.
matching and transforming user code to maximize perfor-
[11] Lorenzo Chelini, Andi Drebes, Oleksandr Zinenko, Albert Cohen,
mance. Nicolas Vasilache, Tobias Grosser, and Henk Corporaal. 2021. Pro-
gressive Raising in Multi-level IR. In 2021 IEEE/ACM International
Acknowledgments Symposium on Code Generation and Optimization (CGO). 15–26. doi:
10.1109/CGO51591.2021.9370332.
Grant TED2021-129221B-I00 funded by MCIN/AEI/10.13039/ [12] Jack Choquette, Olivier Giroux, and Denis Foley. 2018. Volta: Perfor-
501100011033 and by the “European Union NextGenera- mance and Programmability. IEEE Micro 38, 2 (2018), 42–52. doi:
tionEU/PRTR”. 10.1109/MM.2018.022071134.
[13] Jean Coiffier. 2011. Fundamentals of Numerical Weather Prediction.
References Cambridge University Press. doi: 10.1017/CBO9780511734458.
[14] Bruce Collie. 2022. Practical Synthesis from Real-World Oracles. Ph.D.
[1] Maaz Bin Safeer Ahmad, Jonathan Ragan-Kelley, Alvin Cheung, and Thesis. The University of Edinburgh. doi: 10.7488/era/2334.
Shoaib Kamil. 2019. Automatically translating image processing li- [15] Bruce Collie, Philip Ginsbach, and Michael F.P. O’Boyle. 2019.
braries to halide. ACM Transactions on Graphics 38 (Nov. 2019), 1–13. Type-Directed Program Synthesis and Constraint Generation for
Issue 6. doi: 10.1145/3355089.3356549. Library Portability. In 2019 28th International Conference on Paral-
[2] Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles Sutton. lel Architectures and Compilation Techniques (PACT). 55–67. doi:
2015. Suggesting accurate method and class names, In the 2015 10th
10.1109/PACT.2019.00013.
Joint Meeting. Proceedings of the 2015 10th Joint Meeting on Foundations [16] Bruce Collie, Philip Ginsbach, Jackson Woodruff, Ajitha Rajan, and
of Software Engineering - ESEC/FSE 2015. doi: 10.1145/2786805.2786849. Michael F. P. O’Boyle. 2021. M3: Semantic API Migrations. In Pro-
[3] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. ceedings of the 35th IEEE/ACM International Conference on Automated
code2vec: learning distributed representations of code. Proceedings of Software Engineering (Virtual Event, Australia) (ASE ’20). Associa-
the ACM on Programming Languages 3 (Jan. 2019), 1–29. Issue POPL. tion for Computing Machinery, New York, NY, USA, 90–102. doi:
doi: 10.1145/3290353. 10.1145/3324884.3416618.
[4] Muhammad Shoaib Bin Altaf and David A. Wood. 2017. LogCA: A [17] Meghan Cowan, Thierry Moreau, Tianqi Chen, James Bornholt, and
High-Level Performance Model for Hardware Accelerators. In Proceed- Luis Ceze. 2020. Automatic Generation of High-Performance Quan-
ings of the 44th Annual International Symposium on Computer Architec- tized Machine Learning Kernels. In Proceedings of the 18th ACM/IEEE
ture (Toronto, ON, Canada) (ISCA ’17). Association for Computing Ma- International Symposium on Code Generation and Optimization (San
chinery, New York, NY, USA, 375–388. doi: 10.1145/3079856.3080216. Diego, CA, USA) (CGO 2020). Association for Computing Machinery,
[5] Michael Anderson, Benny Chen, Summer Deng, Jordan Fix, Michael New York, NY, USA, 305–316. doi: 10.1145/3368826.3377912.
Gschwind, Aravind Kalaiah, Changkyu Kim, Jaewon Lee, Jason Liang, [18] Chris Cummins, Zacharias V. Fisches, Tal Ben-Nun, Torsten Hoefler,
Haixin Lui, Arun Montgomery, Jacka dn Moorthy, Satish Nadathur, Michael F P O’Boyle, and Hugh Leather. 2021. ProGraML: A Graph-
Sam Naghshineh, Avinash Nayak, Jongsoo Park, Chris Petersen, Martin based Program Representation for Data Flow Analysis and Compiler
Schatz, Narayanan Sundaram, Bandsheng Ten, Peter Tang, Amy Yang, Optimizations. In Proceedings of the 38th International Conference on
Jiecao Yu, Hector Yuen, Ying Zhang, Aravind Anbudarai, Vandana Machine Learning (Proceedings of Machine Learning Research, Vol. 139),
Balan, Harsha Bojja, Joe Boyd, Matthew Breitback, Claudio Caldato, Marina Meila and Tong Zhang (Eds.). PMLR, 2244–2253.
Anna Calvo, Garret Catron, Sneh Chandwani, Panos Christeas, Brad [19] William J. Dally, Yatish Turakhia, and Song Han. 2020. Domain-Specific
Cottel, Briand Countinho, Arun Dalli, Abhishek Chanotia, Oniel Dun- Hardware Accelerators. Commun. ACM 63, 7 (June 2020), 48–57. doi:
can, Roman Dzhabrov, Simon Elmir, Chunli Fu, Wenyin Fu, Michael 10.1145/3361682.
Fulthrop, Adi Gangidi, Nick Gibson, Sean Gordon, Beatriz Padilla
11
CC ’23, February 25–26, 2023, Montréal, QC, Canada P.A. Martínez, J. Woodruff, J. Armengol-Estapé, G. Bernabé, J.M. García, M.F.P. O’Boyle

[20] João P. L. De Carvalho, Braedy Kuzma, Ivan Korostelev, José Nelson Annual International Symposium on Computer Architecture (ISCA).
Amaral, Christopher Barton, José Moreira, and Guido Araujo. 2021. IEEE Computer Society, Los Alamitos, CA, USA, 1–14. doi:
KernelFaRer: Replacing Native-Code Idioms with High-Performance 10.1109/ISCA52012.2021.00010.
Library Calls. ACM Trans. Archit. Code Optim. 18, 3, Article 38 (jun [35] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gau-
2021), 22 pages. doi: 10.1145/3459010. rav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Bo-
[21] Daniel DeFreez, Aditya V. Thakur, and Cindy Rubio-González. 2018. den, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris
Path-based function embedding and its application to error-handling Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb,
specification mining, In the 2018 26th ACM Joint Meeting. Proceedings Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland,
of the 2018 26th ACM Joint Meeting on European Software Engineering Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert
Conference and Symposium on the Foundations of Software Engineering Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexan-
- ESEC/FSE 2018. doi: 10.1145/3236024.3236059. der Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen
[22] Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. 2020. Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris
Mathematics for Machine Learning. Cambridge University Press. doi: Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adri-
10.1017/9781108679930. ana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi
[23] B. Di Martino and G. Iannello. 1996. PAP Recognizer: a tool for auto- Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick,
matic recognition of parallelizable patterns. In WPC ’96. 4th Workshop Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir
on Program Comprehension. 164–174. doi: 10.1109/WPC.1996.501131. Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snel-
[24] J. Domke, E. Vatai, A. Drozd, P. ChenT, Y. Oyama, L. Zhang, S. Salaria, ham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory
D. Mukunoki, A. Podobas, M. WahibT, et al. 2021. Matrix Engines for Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard
High Performance Computing: A Paragon of Performance or Grasping Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-
at Straws?. In 2021 IEEE International Parallel and Distributed Processing Datacenter Performance Analysis of a Tensor Processing Unit. In Pro-
Symposium (IPDPS). IEEE Computer Society, Los Alamitos, CA, USA, ceedings of the 44th Annual International Symposium on Computer Archi-
1056–1065. doi: 10.1109/IPDPS49936.2021.00114. tecture (Toronto, ON, Canada) (ISCA ’17). Association for Computing
[25] Jeremy Fowers, Kalin Ovtcharov, Michael K Papamichael, Todd Mas- Machinery, New York, NY, USA, 1–12. doi: 10.1145/3140659.3080246.
sengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, [36] Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe, Leif Singer,
Logan Adams, Mahdi Ghandi, Stephen Heil, Prerak Patel, Adam Daniel M. German, and Daniela Damian. 2014. The Promises and
Sapek, Gabriel Weisz, Lisa Woods, Sitaram Lanka, Steven K Reinhardt, Perils of Mining GitHub. In Proceedings of the 11th Working Confer-
Adrian M Caulfield, Eric S Chung, and Doug Burger. 2019. Inside ence on Mining Software Repositories (Hyderabad, India) (MSR 2014).
Project Brainwave’s Cloud-Scale, Real-Time AI Processor. IEEE Micro Association for Computing Machinery, New York, NY, USA, 92–101.
39, 3 (2019), 20–28. doi: 10.1109/MM.2019.2910506. doi: 10.1145/2597073.2597074.
[26] Gereon Führ, Seyit Halil Hamurcu, Diego Pala, Thomas Grass, Rainer [37] Jakapong Klainongsuang, Yusuf Sulistyo Nugroho, Hideaki Hata, Bun-
Leupers, Gerd Ascheid, and Juan Fernando Eusse. 2019. Automatic dit Manaskasemsak, Arnon Rungsawang, Pattara Leelaprute, and
Energy-Minimized HW/SW Partitioning for FPGA-Accelerated MP- Kenichi Matsumoto. 2019. Identifying Algorithm Names in Code
SoCs. IEEE Embedded Systems Letters 11, 3 (2019), 93–96. doi: Comments. (2019). arXiv:1907.04557 [cs.SE]
10.1109/LES.2019.2901224. [38] Chris Lattner and Vikram Adve. 2004. LLVM: a compilation framework
[27] gcc documentation. 2022. 26.2 Match and Simplify: The Language. Ava- for lifelong program analysis & transformation. In International Sym-
ialable at https://gcc.gnu.org/onlinedocs/gccint/The-Language.html posium on Code Generation and Optimization, 2004. CGO 2004. 75–86.
(Accessed 2022). doi: 10.1109/CGO.2004.1281665.
[28] Philip Ginsbach, Toomas Remmelg, Michel Steuwer, Bruno Bodin, [39] Vladimir I Levenshtein et al. 1966. Binary codes capable of correcting
Christophe Dubach, and Michael F. P. O’Boyle. 2018. Automatic Match- deletions, insertions, and reversals. In Soviet physics doklady, Vol. 10.
ing of Legacy Code to Heterogeneous APIs: An Idiomatic Approach. Soviet Union, 707–710.
SIGPLAN Not. 53, 2 (mar 2018), 139–153. doi: 10.1145/3296957.3173182. [40] Benjamin Livshits, Manu Sridharan, Yannis Smaragdakis, Ondřej
[29] Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of High- Lhoták, J. Nelson Amaral, Bor-Yuh Evan Chang, Samuel Z. Guyer,
Performance Matrix Multiplication. ACM Trans. Math. Softw. 34, 3, Uday P. Khedker, Anders Møller, and Dimitrios Vardoulakis. 2015. In
Article 12 (may 2008), 25 pages. doi: 10.1145/1356052.1356053. defense of soundiness. Commun. ACM 58 (Jan. 2015), 44–46. Issue 2.
[30] Dominik Grewe, Zheng Wang, and Michael FP O’Boyle. 2013. Portable doi: 10.1145/2644805.
mapping of data parallel programs to OpenCL for heterogeneous [41] Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy,
systems. In Proceedings of the 2013 IEEE/ACM International Sympo- Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu
sium on Code Generation and Optimization (CGO). IEEE, 1–10. doi: Tang, et al. 2021. CodeXGLUE: A Machine Learning Bench-
10.1109/CGO.2013.6494993. mark Dataset for Code Understanding and Generation. (2021).
[31] Tobias Grosser, Hongbin Zheng, Raghesh Aloor, Andreas Simbürger, arXiv:2102.04664 [cs.SE]
Armin Größlinger, and Louis-Noël Pouchet. 2011. Polly-Polyhedral op- [42] Charith Mendis, Jeffrey Bosboom, Kevin Wu, Shoaib Kamil, Jonathan
timization in LLVM. In Proceedings of the First International Workshop Ragan-Kelley, Sylvain Paris, Qin Zhao, and Saman Amarasinghe. 2015.
on Polyhedral Compilation Techniques (IMPACT), Vol. 2011. 1. Helium: lifting high-performance stencil kernels from stripped x86
[32] Intel. 2022. AI Hardware. Available at https://www.intel.com/content/ binaries to halide DSL code. PLDI. doi: 10.1145/2737924.2737974.
www/us/en/artificial-intelligence/hardware.html. [43] Alok Mishra, Abid M Malik, and Barbara Chapman. 2020. Using
[33] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Machine Learning for OpenMP GPU Offloading in LLVM. SC (2020).
Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. [44] Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional
Caffe: Convolutional Architecture for Fast Feature Embedding. (2014). Neural Networks over Tree Structures for Programming Language
arXiv:1408.5093 [cs.CV] Processing. In Proceedings of the Thirtieth AAAI Conference on Artificial
[34] Norman P. Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Intelligence (Phoenix, Arizona) (AAAI’16). AAAI Press, 1287–1293. doi:
Thomas B. Jablin, George Kurian, James Laudon, Sheng Li, Peter 10.5555/3015812.3016002.
Ma, Xiaoyu Ma, et al. 2021. Ten Lessons From Three Generations [45] Alastair Colin Murray. 2012. Customising Compilers for Customisable
Shaped Google’s TPUv4i : Industrial Product. In 2021 ACM/IEEE 48th Processors. Ph.D. Thesis. The University of Edinburgh. http://hdl.

12
Matching linear algebra and tensor code to specialized hardware accelerators CC ’23, February 25–26, 2023, Montréal, QC, Canada

handle.net/1842/8028. doi: 10.1145/3038228.3038235.

[46] Trong Duc Nguyen, Anh Tuan Nguyen, Hung Dang Phan, and Tien N. [62] Jian Weng, Animesh Jain, Jie Wang, Leyuan Wang, Yida Wang, and
Nguyen. 2017. Exploring API Embedding for API Usages and Applica- Tony Nowatzki. 2021. UNIT: Unifying Tensorized Instruction Compi-
tions. ICSE. doi: 10.1109/icse.2017.47. lation. In 2021 IEEE/ACM International Symposium on Code Generation
[47] Ansong Ni, Daniel Ramos, Aidan Yang, Ines Lynce, Vasco Man- and Optimization (CGO). 77–89. doi: 10.1109/CGO51591.2021.9370330.
quinho, Ruben Martins, and Claire Le Goues. 2021. SOAR: A Syn- [63] Jackson Woodruff, Jordi Armengol-Estapé, Sam Ainsworth, and
thesis Approach for Data Science API Refactoring. ICSE (2021). Michael F. P. O’Boyle. 2022. Bind the Gap: Compiling Real Soft-
arXiv:2102.06726 [cs.SE] ware to Hardware FFT Accelerators. In Proceedings of the 43rd ACM
[48] Seiya Numata, Norihiro Yoshida, Eunjong Choi, and Katsuro Inoue. SIGPLAN International Conference on Programming Language De-
2016. On the Effectiveness of Vector-Based Approach for Supporting sign and Implementation (San Diego, CA, USA) (PLDI 2022). Asso-
Simultaneous Editing of Software Clones. Product-Focused Software ciation for Computing Machinery, New York, NY, USA, 687–702. doi:
Process Improvement (Nov. 2016), 560–567. 10.1145/3519939.3523439.
[49] Hung Dang Phan, Anh Tuan Nguyen, Trong Duc Nguyen, and Tien N. [64] Gina Yuan, Shoumik Palkar, Deepak Narayanan, and Matei Zaharia.
Nguyen. 2017. Statistical Migration of API Usages. ICSE-C. doi: 2020. Offload Annotations: Bringing Heterogeneous Computing to
10.1109/icse-c.2017.17. Existing Libraries and Workloads. In 2020 USENIX Annual Technical
[50] Tristan Ravitch, Steve Jackson, Eric Aderhold, and Ben Liblit. 2009. Conference (USENIX ATC 20). USENIX Association, 293–306.
Automatic generation of library bindings using static analysis.
ACM SIGPLAN Notices 44 (May 2009), 352–362. Issue 6. doi:
10.1145/1543135.1542516.
[51] Roberto Rigamonti, Baptiste Delporte, Anthony Convers, and Alberto
Dassatti. 2016. Transparent Live Code Offloading on FPGA. (2016).
arXiv:1609.00130 [cs.DC]
[52] Malavika Samak, Deokhwan Kim, and Martin C. Rinard. 2019. Syn-
thesizing Replacement Classes. Proc. ACM Program. Lang. 4, POPL,
Article 52 (dec 2019), 33 pages. doi: 10.1145/3371120.
[53] Christof Schlaak, Tzung-Han Juang, and Christophe Dubach. 2022. Op-
timizing Data Reshaping Operations in Functional IRs for High-Level
Synthesis. In Proceedings of the 23rd ACM SIGPLAN/SIGBED Interna-
tional Conference on Languages, Compilers, and Tools for Embedded Sys-
tems (San Diego, CA, USA) (LCTES 2022). Association for Computing
Machinery, New York, NY, USA, 61–72. doi: 10.1145/3519941.3535069.
[54] Michel Steuwer, Christian Fensch, Sam Lindley, and Christophe
Dubach. 2015. Generating Performance Portable Code Using Rewrite
Rules: From High-Level Functional Expressions to High-Performance
OpenCL Code. (2015), 205–217. doi: 10.1145/2784731.2784754.
[55] Michel Steuwer, Toomas Remmelg, and Christophe Dubach. 2016.
Matrix multiplication beyond auto-tuning: Rewrite-based GPU code
generation. In 2016 International Conference on Compliers, Archi-
tectures, and Sythesis of Embedded Systems (CASES). 1–10. doi:
10.1145/2968455.2968521.
[56] Volker Strassen. 1969. Gaussian elimination is not optimal. Numer.
Math. 13, 4 (01 Aug 1969), 354–356. doi: 10.1007/BF02165411.
[57] John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-
Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W Hwu.
2012. Parboil: A revised benchmark suite for scientific and commercial
throughput computing. Center for Reliable and High-Performance
Computing 127 (2012), 27.
[58] Fang-Hsiang Su, Jonathan Bell, Kenneth Harvey, Simha Sethumadha-
van, Gail Kaiser, and Tony Jebara. 2016. Code Relatives: Detecting
Similarly Behaving Software. In Proceedings of the 2016 24th ACM SIG-
SOFT International Symposium on Foundations of Software Engineering
(Seattle, WA, USA) (FSE 2016). Association for Computing Machinery,
New York, NY, USA, 702–714. doi: 10.1145/2950290.2950321.
[59] Wei Sun, Savvas Sioutas, Sander Stuijk, Andrew Nelson, and Henk Cor-
poraal. 2021. Efficient Tensor Cores support in TVM for Low-Latency
Deep learning. In 2021 Design, Automation & Test in Europe Conference
& Exhibition (DATE). 120–123. doi: 10.23919/DATE51398.2021.9473984.
[60] Richard Uhrie. 2021. Automatic Computational Domain Detection. Ph.D.
Thesis. Arizona State University. https://hdl.handle.net/2286/R.2.N.
161894.
[61] Yuan Wen and Michael F.P. O’Boyle. 2017. Merge or Separate? Multi-
Job Scheduling for OpenCL Kernels on CPU/GPU Platforms. In Pro-
ceedings of the General Purpose GPUs (Austin, TX, USA) (GPGPU-10).
Association for Computing Machinery, New York, NY, USA, 22–31.

Master Thesis
No ratings yet
Master Thesis
100 pages
A Data Structure Optimizing Compiler For tUPL
No ratings yet
A Data Structure Optimizing Compiler For tUPL
102 pages
Complex Analysis: Christian Berg
No ratings yet
Complex Analysis: Christian Berg
192 pages
Modeling A Non-Uniform Memory Access Architecture For Optimizing
No ratings yet
Modeling A Non-Uniform Memory Access Architecture For Optimizing
79 pages
Icl Utk 1031 2017
No ratings yet
Icl Utk 1031 2017
45 pages
Riscv Matrix Spec
No ratings yet
Riscv Matrix Spec
102 pages
Nimish Shah, Wannes Meert, Marian Verhelst - Efficient Execution of Irregular Dataflow Graphs. Hardware - Software Co-Optimization For Probabilistic AI and Sparse Linear Algebra-Springer
No ratings yet
Nimish Shah, Wannes Meert, Marian Verhelst - Efficient Execution of Irregular Dataflow Graphs. Hardware - Software Co-Optimization For Probabilistic AI and Sparse Linear Algebra-Springer
155 pages
Introduction To OpenACC Course 20161102 1530 1
No ratings yet
Introduction To OpenACC Course 20161102 1530 1
64 pages
Machine Learning Projects Python
94% (18)
Machine Learning Projects Python
134 pages
DL 1
No ratings yet
DL 1
63 pages
Modular Linear Algebra Library in C For Science & Education: Computer Engineering
No ratings yet
Modular Linear Algebra Library in C For Science & Education: Computer Engineering
8 pages
VCL Manual
No ratings yet
VCL Manual
96 pages
ASM Handbook Volume 10 Materials Characterization 1st Edition Asm International. Handbook Committee. PDF Download
No ratings yet
ASM Handbook Volume 10 Materials Characterization 1st Edition Asm International. Handbook Committee. PDF Download
107 pages
Compilers: Tools For Scientists and Engineers
No ratings yet
Compilers: Tools For Scientists and Engineers
42 pages
S3076 Getting Started With OpenACC
No ratings yet
S3076 Getting Started With OpenACC
58 pages
2019 Mapl Tillet Kung Cox
No ratings yet
2019 Mapl Tillet Kung Cox
10 pages
E.3.1 Recall The Doubles of All Numbers To at Least 10
No ratings yet
E.3.1 Recall The Doubles of All Numbers To at Least 10
4 pages
Presentation1 DISP
No ratings yet
Presentation1 DISP
41 pages
Deep Learning
No ratings yet
Deep Learning
142 pages
Phys2431 Python Programming For Linear Algebra-1
No ratings yet
Phys2431 Python Programming For Linear Algebra-1
3 pages
Matrix Computation On The GPU
No ratings yet
Matrix Computation On The GPU
455 pages
Topological Bialgebras Et YBE
No ratings yet
Topological Bialgebras Et YBE
45 pages
Interface For Sparse Linear Algebra Operations
No ratings yet
Interface For Sparse Linear Algebra Operations
43 pages
Introduction To Datastructure & Basic Terminology
No ratings yet
Introduction To Datastructure & Basic Terminology
7 pages
Motuner A Compiler-Based Auto-Tuning Approach For Mixed-Precision Operators
No ratings yet
Motuner A Compiler-Based Auto-Tuning Approach For Mixed-Precision Operators
9 pages
Reaction Engineering Course Outline
No ratings yet
Reaction Engineering Course Outline
181 pages
DSP Lab Mannual PDF
No ratings yet
DSP Lab Mannual PDF
55 pages
Exocompilation For Productive Programming of Hardware Accelerators
No ratings yet
Exocompilation For Productive Programming of Hardware Accelerators
16 pages
From MATLAB To Embedded C: News&Notes
No ratings yet
From MATLAB To Embedded C: News&Notes
4 pages
Content PDF
No ratings yet
Content PDF
14 pages
From MATLAB To Embedded C: News&Notes
No ratings yet
From MATLAB To Embedded C: News&Notes
4 pages
Bruttomesso PHD Thesis
No ratings yet
Bruttomesso PHD Thesis
164 pages
Python in Large-Scale Linear Algebra
No ratings yet
Python in Large-Scale Linear Algebra
11 pages
Secondary - 2018 - Class - 9 & 10 - Math Full - PDF Opt
No ratings yet
Secondary - 2018 - Class - 9 & 10 - Math Full - PDF Opt
390 pages
C++ For Scientific Computing: Mark Richardson May 2009
No ratings yet
C++ For Scientific Computing: Mark Richardson May 2009
51 pages
Taichi Design
100% (1)
Taichi Design
166 pages
Ecp2018 Magma Tutorial 1
No ratings yet
Ecp2018 Magma Tutorial 1
50 pages
Gpucoder Ug
No ratings yet
Gpucoder Ug
560 pages
PMS KPK
No ratings yet
PMS KPK
2 pages
Jacobi Jordan Conformal Algebras Basics
No ratings yet
Jacobi Jordan Conformal Algebras Basics
24 pages
CppSim Reference Manual
No ratings yet
CppSim Reference Manual
140 pages
Exploring Matrix Applications in The Digital World Using C Programming
No ratings yet
Exploring Matrix Applications in The Digital World Using C Programming
19 pages
User's Guide: Code Generation From MATLAB
100% (1)
User's Guide: Code Generation From MATLAB
301 pages
Wave Spectrum Fatigue Guide
100% (1)
Wave Spectrum Fatigue Guide
40 pages
Mathworks Product Overview: Accelerating The Pace of Engineering and Science
No ratings yet
Mathworks Product Overview: Accelerating The Pace of Engineering and Science
6 pages
Code Generation Compiler For The Openmp 4.0 Accelerator Model Onto Ompss
No ratings yet
Code Generation Compiler For The Openmp 4.0 Accelerator Model Onto Ompss
74 pages
Program Generation For Small-Scale Linear Algebra Applications
No ratings yet
Program Generation For Small-Scale Linear Algebra Applications
13 pages
PID Controller
No ratings yet
PID Controller
5 pages
SLM PC11 Quarter2 Week2
No ratings yet
SLM PC11 Quarter2 Week2
20 pages
Hypothesis Testing Key Words
No ratings yet
Hypothesis Testing Key Words
2 pages
6-Codegen Opti PDF
No ratings yet
6-Codegen Opti PDF
47 pages
Introduction To Matlab - 1
No ratings yet
Introduction To Matlab - 1
190 pages
EC6303-Signals and Systems
No ratings yet
EC6303-Signals and Systems
10 pages
Algebra Notes From The Underground 1st Edition Paolo Aluffi Instant Download
No ratings yet
Algebra Notes From The Underground 1st Edition Paolo Aluffi Instant Download
82 pages
LA Worksheet
No ratings yet
LA Worksheet
150 pages
Proposal Presentation
No ratings yet
Proposal Presentation
22 pages
Science of Programming Matrix Computations
100% (2)
Science of Programming Matrix Computations
178 pages
MatlabCC Compiler4
No ratings yet
MatlabCC Compiler4
335 pages
Intro to Statistics for Students
No ratings yet
Intro to Statistics for Students
28 pages
Compiler Design Code Generation
No ratings yet
Compiler Design Code Generation
4 pages
Linear Algebra Assignment
No ratings yet
Linear Algebra Assignment
5 pages
Paul A. Gagniuc - Coding Examples From Simple To Complex - Applications in MATLAB (2024, Springer) - Libgen - Li
No ratings yet
Paul A. Gagniuc - Coding Examples From Simple To Complex - Applications in MATLAB (2024, Springer) - Libgen - Li
275 pages
Mathematicians and Their Contributions
No ratings yet
Mathematicians and Their Contributions
2 pages
Solving Algebraic Equations in Commutative Rings (Hideyuki Matsumura)
No ratings yet
Solving Algebraic Equations in Commutative Rings (Hideyuki Matsumura)
7 pages
Otb Preparation Guide
No ratings yet
Otb Preparation Guide
14 pages
BHH 93
No ratings yet
BHH 93
27 pages
Simulation Manual For Configurable Mapreduce Accelerator: Gheorghe M. S Tefan
No ratings yet
Simulation Manual For Configurable Mapreduce Accelerator: Gheorghe M. S Tefan
78 pages
CSE110 - OOP - Lab Assignment 02 - Student Version
No ratings yet
CSE110 - OOP - Lab Assignment 02 - Student Version
4 pages
Python Programming Basics Course
No ratings yet
Python Programming Basics Course
13 pages
TOA Course Outline
No ratings yet
TOA Course Outline
3 pages
A-Thurs-O2 Absorption-Report
No ratings yet
A-Thurs-O2 Absorption-Report
25 pages
Chapter 6 Shear and Moments in Beams Updting 2020
No ratings yet
Chapter 6 Shear and Moments in Beams Updting 2020
19 pages
Exercise Only
No ratings yet
Exercise Only
40 pages
cs61c Notes
No ratings yet
cs61c Notes
29 pages
Jurnal JP - Peran Masa Kerja Dan Gaya Komunikasi Terhadap Kinerja Karyawan Dengan Motivasi Karyawan Sebagai Mediator Pada PT Gajah Tunggal TBK
No ratings yet
Jurnal JP - Peran Masa Kerja Dan Gaya Komunikasi Terhadap Kinerja Karyawan Dengan Motivasi Karyawan Sebagai Mediator Pada PT Gajah Tunggal TBK
13 pages
Exam Blanc D Alibori 18
No ratings yet
Exam Blanc D Alibori 18
5 pages
7com1078 Cap Mock 2021
No ratings yet
7com1078 Cap Mock 2021
2 pages
Lecture 17
No ratings yet
Lecture 17
2 pages
Muthayammal College of Arts and Science Rasipuram: Assignment No - 2
No ratings yet
Muthayammal College of Arts and Science Rasipuram: Assignment No - 2
9 pages
Quyen 2-2016-Scientific Computing With MATLAB-Paul Gribble-Scicomp
No ratings yet
Quyen 2-2016-Scientific Computing With MATLAB-Paul Gribble-Scicomp
319 pages
STD 8 Maths: Cube Roots & Proportions Quiz
No ratings yet
STD 8 Maths: Cube Roots & Proportions Quiz
3 pages
References: D Dy DZ D Dy DZ D DX DZ D DX DZ D D D D D Dy DZ Ydydz
No ratings yet
References: D Dy DZ D Dy DZ D DX DZ D DX DZ D D D D D Dy DZ Ydydz
5 pages
RN 34
No ratings yet
RN 34
110 pages
Satyapriya Roy College of Education: AA 287, SECTOR I, SALT LAKE, KOLKATA 700 064
No ratings yet
Satyapriya Roy College of Education: AA 287, SECTOR I, SALT LAKE, KOLKATA 700 064
6 pages
Current, Resistance, Emf - Summative Test
No ratings yet
Current, Resistance, Emf - Summative Test
3 pages
Gat Eee Nba DSP 18eel67 Co 2021-22
No ratings yet
Gat Eee Nba DSP 18eel67 Co 2021-22
2 pages
DSP Practical Workbook PDF
No ratings yet
DSP Practical Workbook PDF
117 pages
Haskell Arrays Accelerated With GPUs
100% (1)
Haskell Arrays Accelerated With GPUs
47 pages
Applications of Linear Algebra in Computer Engineering
No ratings yet
Applications of Linear Algebra in Computer Engineering
4 pages