HeteroFlow: C++ Library for CPU-GPU Parallel Programming
HeteroFlow: C++ Library for CPU-GPU Parallel Programming
Modern C++
Tsung-Wei Huang∗ and Yibo Lin†
∗ Department of Electrical and Computer Engineering, University of Utah
† Department of Computer Science, Peking University
Abstract—In this paper, we introduce Heteroflow, a new C++ task belongs to one of host, pull, push, and kernel tasks; a
arXiv:2203.08395v1 [cs.DC] 16 Mar 2022
library to help developers quickly write parallel CPU-GPU host task runs a callable object on any CPU core (“the host”),
programs using task dependency graphs. Heteroflow leverages a pull task copies data from the host to a GPU (“the device”),
the power of modern C++ and task-based approaches to enable
efficient implementations of heterogeneous decomposition strate- a push task copies data from a GPU to the host, and a kernel
gies. Our new CPU-GPU programming model allows users to task offloads computation to a GPU. Figure 1 explains the
express a problem in a way that adapts to effective separation saxpy task graph in Heteroflow’s graph language.
of concerns and expertise encapsulation. Compared with existing
libraries, Heteroflow is more cost-efficient in performance scaling,
programming productivity, and solution generality. We have host_x pull_x push_x
evaluated Heteroflow on two real applications in VLSI design
automation and demonstrated the performance scalability across saxpy
different CPU-GPU numbers and problem sizes. At a particular host_y pull_y push_y
example of VLSI timing analysis with million-scale tasking,
Heteroflow achieved 7.7× runtime speed-up (99 vs 13 minutes)
over a baseline on a machine of 40 CPU cores and 4 GPUs. Fig. 1: A saxpy (“single-precision A·X plus Y”) task graph using two
host tasks to create two data vectors, two pull tasks to send data to
a GPU, a kernel task to offload the saxpy computation to the GPU,
I. I NTRODUCTION and two push tasks to push data from the GPU to the host.
Modern parallel applications in machine learning, data
analytics, and scientific computing typically consist of a het- global v o i d s ax p y ( i n t n , i n t a , i n t * x , i n t * y ) {
erogeneous use of both central processing units (CPUs) and i n t i = b l o c k I d x . x * blockDim . x + t h r e a d I d x . x ;
i f ( i < n ) y [ i ] = a*x [ i ] + y [ i ] ;
graphics processing units (GPUs) [1]. Writing a parallel CPU- }
GPU program is never an easy job, since CPUs and GPUs have
fundamentally different architectures and programming logic. cons t i n t N = 65536;
v e c t o r <i n t > x , y ;
To address this challenge, the parallel computing community
has investigated many programming libraries to assist develop- hf : : Executor executor ;
ers with quick access to massively parallel and heterogeneous h f : : H e t e r o f l o w G;
computing resources using minimal programming effort [2], auto G . h o s t ( [ & ] ( ) { x . r e s i z e ( N, 1 ) ; } ) ;
host x =
[3], [4], [5], [6], [7], [8], [9], [10], [11]. In particular, hy- auto G . h o s t ( [ & ] ( ) { y . r e s i z e ( N, 2 ) ; } ) ;
host y =
brid multi-CPU multi-GPU systems are driving high demand auto pull x =
G. p u l l ( x ) ;
auto pull y =
G. p u l l ( y ) ;
for new heterogeneous programming techniques in support auto kernel =
G . k e r n e l ( saxpy , N, 2 , p u l l x , p u l l y )
for more efficient CPU-GPU collaborative computing [12]. . block x (256)
However, related research remains nascent, especially on the . g r i d x ( ( N+ 2 5 5 ) / 2 5 6 )
auto push x = G. push ( pull x , x ) ;
front of leveraging modern C++ to achieve new programming auto push y = G. push ( pull y , y ) ;
productivity and performance scalability that were previously
out of reach [13]. host x . precede ( pull x );
host y . precede ( pull y );
The Heteroflow project addresses a long-standing question: k e r n e l . p r e c e d e ( push x , push y )
“how can we make it easier for C++ developers to write . succeed ( pull x , pull y ) ;
efficient CPU-GPU parallel programs?” For many C++ de-
auto f u t u r e = executor . run ( hf ) ;
velopers, achieving high performance on a hybrid CPU-GPU
system can be tedious. Programmers have to overcome com- Listing 1: Heteroflow code of Figure 1.
plexities arising out of concurrency controls, kernel offloading,
scheduling, and load-balancing before diving into the real Listing 1 shows the Heteroflow code that implements the
implementation of a heterogeneous decomposition algorithm. saxpy task graph in Figure 1. The code explains itself. The
Heteroflow adopts a new task-based programming model using program creates a task dependency graph of two host tasks,
modern C++ to address this challenge. Consider the canonical two pull tasks, one kernel task, and two push tasks. The kernel
saxpy (A·X plus Y) example in Figure 1. Each Heteroflow task binds to a saxpy kernel written in CUDA [2]. The depen-
dency links form constraints that conform to Figure 1. Het- on the forefront of computing around 1980 and has fostered
eroflow provides an executor interface to perform automatic many prominent problems and algorithms in computer science.
parallelization of a task graph scalable to manycore CPUs Figure 2 demonstrates a conventional VLSI CAD flow with a
and GPUs. There is no explicit thread managements or fine- highlight on physical design. Due to the ever-increasing design
grained concurrency controls in the code. Our design principle complexity, the recent CAD community is driving the need for
is to let users write simple, expressive, and transparent parallel hybrid CPU-GPU computing to keep tool performance up with
code. Heteroflow explores a minimum set of core routines the technology scaling [14], [15].
that are sufficient enough for users to implement a broad set
of heterogeneous computing algorithms. Our task application
programming interface (API) is not only flexible on the user
front, but also extensible with the future evolution of C++
standards and heterogeneous architectures. We summarize our
contributions as follows:
• Programming model. We develop a new parallel CPU-
GPU programming model to assist developers with effi-
cient access to heterogeneous computing resources. Our
programming model allows users to express a problem with
effective separation of concerns and expertise encapsulation.
Developers can work at a suitable level of granularity for
writing scalable applications that is commensurate with their
domain knowledge.
• Transparency. Heteroflow is transparent. Developers need
not to deal with standard concurrency mechanisms such
as threads and fine-grained concurrency controls, that are
often tedious and hard to program correctly. Instead, our
system runtime abstracts these problems from developers
and tackles many of the hardest parallel and heterogeneous
computing details, notably resource allocation, CPU-GPU Fig. 2: A typical VLSI design automation flow with a highlight on
co-scheduling, kernel offloading, etc. the physical design stage. Heteroflow is motivated to address the
ever-increasing computational need of modern CAD tools.
• Expressiveness. We leverage modern C++ to design an
expressive API that empowers users with explicit graph con-
struction and refinement to fully exploit task parallelism in A. Challenge 1: Vast and Complex Dependencies
their applications. The expressive power also lets developers Computational problems in CAD are extremely complex
perform rather a lot of work without writing a lot of code. and have many challenges that normal software developments
Our user experiences lead us to believe that, although it do not have. The biggest challenge to develop parallel CAD
requires some effort to learn, most C++ programmers can tools is the vast and complex task dependencies. Before
master our APIs and apply Heteroflow to their jobs in just evaluating an algorithm, a number of logical and physical
a few hours. information must arrive first. These quantities are often depen-
We have applied Heteroflow to two real applications, timing dent to each other and are expensive to compute. The resulting
analysis and cell placement, in large-scale circuit design au- task dependency graph in terms of encapsulated function calls
tomation and demonstrated the performance scalability across can be very large. For example, a million-gate design can
different numbers of CPUs, GPUs, and problem sizes. We produce a graph of billions of tasks and dependencies that
believe Heteroflow stands out as a unique tasking library takes several days to accomplish [13]. However, such difficulty
considering the ensemble of software tradeoffs and architec- does not prevent CAD tools from parallelization, but highlights
ture decisions we have made. With that being said, different the need of new tasking frameworks to implement efficient
programming libraries and frameworks have their pros and parallel decomposition strategies especially with CPU-GPU
cons, and deserve a particular reason to exist. Heteroflow aims collaborative computing [15].
for a higher-level alternative in modern C++ domain.
B. Challenge 2: Extensive Domain Knowledge
II. M OTIVATION
Developing a parallel CAD algorithm requires deep and
Heteroflow is motivated by our research projects to develop broad domain knowledge across circuits, modeling, and pro-
efficient computer-aided design (CAD) tools for very large gramming to fully exploit parallelism. The compute pattern
scale integration (VLSI) design automation. CAD has been is highly irregular and unbalanced, requiring very strategic
an immensely successful field in assisting designers in im- collaboration between CPU and GPU. Developers often need
plementing VLSI circuits with billions of transistors. It was direct access to native GPU programming libraries such as
CUDA and OpenCL to handcraft the kernels with problem- Heteroflow aims to help C++ developers quickly write
specific knowledge [2], [3]. Existing frameworks that provide CPU-GPU parallel programs and implement efficient
high-level abstraction over kernel programming always come heterogeneous decomposition strategies using
with restricted applicability, preventing CAD engineers from task-based models.
using many new powerful features of the native libraries. — Heteroflow’s Project Mantra
Our domain experience concludes that despite nontrivial GPU
kernels, what makes concurrent CPU-GPU programming an
A. Create a Task Dependency Graph
enormous challenge is the vast and complex surrounding tasks,
most notably the resource controls on multi-GPU cards, CPU- Heteroflow is object-oriented. Users can create multiple task
GPU co-scheduling, tasking, and synchronization. dependency graph objects each representing a unique parallel
decomposition in an application. A task dependency graph
is a directed acyclic graph (DAG) with nodes and edges
representing tasks and dependency constraints, respectively.
C. Need for a New CPU-GPU Programming Solution Each task belongs to one of the four categories: host, pull,
push, and kernel.
Unfortunately, most parallel CPU-GPU programming so- 1) Host Task: A host task is associated with a callable
lutions in CAD tools are hard-coded [14], [15]. Developers object which can be a function object, binding expression,
are “heroic programmers” to handcraft every detail of a functor, or a lambda expression. The callable is invoked at
heterogeneous decomposition algorithm and explicitly decide runtime by a CPU thread to run on a CPU core. Listing 2
which part of the application runs on which CPU and GPU. gives an example of creating a host task. In most applications,
While the performance is acceptable, it is too expensive to the callable is described in C++ lambda to construct a closure
maintain the codebase and scale to new hardware architectures. inline in the source code. This property allows host task to
Some recent solutions adopted directive-driven models such as enable efficient lazy evaluation and capture any data whether
OpenMP GPU and OpenACC particularly for data-intensive it is declared in a local block or flat in a global scope, largely
algorithms [6], [7]. However, these approaches cannot handle facilitating the ease of programming.
dynamic workloads since compilers have limited knowledge to
hf : : Heteroflow hf ;
annotate runtime task parallelism and dynamic dependencies. auto hos t = hf . hos t (
In fact, frameworks at functional level are more favorable due [ ] ( ) { c o u t << ” t a s k r u n s on a CPU c o r e ” ; }
to the flexibility in runtime controls and on-demand tasking. );
Nevertheless, most libraries on this front are disadvantageous Listing 2: Creates a host task.
from an ease-of-programming standpoint [12]. Users often
need to sort out many distinct notations and library details Each time users create a task, the heteroflow object adds a
before implementing a heterogeneous algorithm [16]. Also, a node to its task graph and returns a task handle to users. A
lack of support for modern C++ largely inhibits the program- task handle is a lightweight class object that wraps a pointer to
ming productivity and performance scalability [13], [17]. After a graph node. The purpose of this extra layer is to provide an
many years of research, we and our industry partners conclude extensible mechanism for users to modify the task attributes
the biggest hurdle to program the power of collaborative and, most importantly, prevents users from direct access to the
CPU-GPU computing is a suitable task programming library. internal graph storage which can easily introduce undefined
Whichever model is used, understanding the structure of an behaviors. Each node has a general-purpose polymorphic func-
application is critical. Developers must explicitly consider tion wrapper to store and invoke different callables according
possible data or task parallelism of their problems and leverage to a task type. A task handle can be empty, often used as a
domain-specific knowledge to design effective decomposition placeholder when it is not associated with a graph node. This
strategies for parallelization. At the same time, the library run- is particularly useful when a task content cannot be decided
time removes the burden of low-level jobs from developers to until a certain point during the program execution, while the
improve programming productivity and transparent scalability. task storage needs preallocation at programming time. These
To this end, our goal is to address these challenges and develop properties are applicable to all task types.
a general-purpose tasking interface for concurrent CPU-GPU 2) Pull Task: A pull task lets users pull data from the host
programming. to the device. The exact GPU to perform this memory opera-
tion is decided by the scheduler at runtime. Developers should
think separately which part of their applications runs on which
III. H ETEROFLOW space, and decompose them with explicit task construction.
Since most GPU memory operations are expensive compared
to CPU counterparts, Heteroflow splits the execution of a GPU
In this section, we discuss the programming model and workload into three operations, host-do-device (H2D) input
runtime of Heteroflow. We will cover important technical transfers, launch of a kernel, and device-to-host (D2H) output
details that support the software architecture of Heteroflow. transfers, to enable more task overlaps. Pull task adopts this
strategy to help users manage the tedious details in H2D data overhead and the CUDA stream is a sequenced mechanism for
transfers. At the same time, it presents an effective abstraction interleaving GPU operations [2]. A key motivation behind this
of which the scheduler can take advantage to perform various design is to support multi-GPU computing. Both the memory
optimizations such as automatic GPU mapping, streaming, and allocator and stream are specific to a GPU context which is
memory pooling. decided by the scheduler at runtime. Finally, we create a span
v e c t o r <i n t > d a t a 1 ( 1 0 0 ) ; from the stateful tuple and enqueue the data transfer operation
f l o a t * d a t a 2 = new f l o a t [ 1 0 ] ; to the stream (line 6:12).
auto pull1 = hf . p u l l ( data1 ) ; 3) Push Task: A push task lets users push data associated
auto p u ll2 = hf . p u l l ( data2 , 1 0 ) ;
with a pull task from the device to the host. The code snippet in
Listing 3: Creates two pull tasks. Listing 5 creates two push tasks that operate on the pull tasks
in Listing 3. The arguments consist of two part, a source pull
Listing 3 gives an example of creating two pull tasks to task of device data and the rest to construct a std::span
transfer data from the host to the device. The first pull task object for the target. Similar to Listing 3, the first push task
operates on a C++ vector of integer numbers and the second operates on an integer vector and the second push task operates
pull task operates on a raw data block of real numbers. on a raw data block of floating numbers. Push task is stateful.
Heteroflow employs the C++20 span syntax to implement the Any runtime change on the arguments that were used to
pull interface. The arguments forwarded to the pull method construct a pull task will reflect on its execution context. This
must conform to the constructor of std::span. In fact, property allows users to create stateful Heteroflow graphs for
we have investigated many possible data representations and efficient data management between concurrent CPU and GPU
decided to use span because of its lightweight abstraction tasks.
for describing a contiguous sequence of objects. A span can
auto push1 = hf . push ( pull1 , data1 ) ;
easily convert to a C-style raw data view that is acceptable by auto push2 = hf . push ( pull2 , data2 , 1 0 ) ;
most GPU programming libraries [2], [3], [18]. Sticking with
C++ standard also keeps the core of Heteroflow portable and Listing 5: Creates two push tasks from the two pull tasks in
minimizes the rate of change required for our data interface. Listing 3.
1 t e m p l a t e <ty p en am e . . . ArgsT>
2 a u t o P u l l T a s k : : p u l l ( ArgsT & & . . . a r g s ) { 1 t e m p l a t e <ty p en am e . . . ArgsT>
3 g e t n o d e h a n d l e ( ) . work = [ 2 a u t o P u s h T as k : : p u s h ( P u l l T a s k p , ArgsT & & . . . a r g s ) {
4 t = S t a t e f u l T u p l e ( f o r w a r d <ArgsT >( a r g s ) . . . ) 3 g e t n o d e h a n d l e ( ) . work = [
5 ] ( A l l o c a t o r& a , c u d a S t r e a m t s ) m u t a b l e { 4 s r c =p ,
6 a u t o h s p an = m a k e s p a n f r o m t u p l e ( t ) ; 5 t = S t a t e f u l T u p l e ( f o r w a r d <ArgsT >( a r g s ) . . . )
7 a u t o h d a t a = h s p an . d a t a ( ) ; 6 ] ( cudaStream t s ) mutable {
8 a u t o h s i z e = h s p an . s i z e b y t e s ( ) ; 7 a u t o h s p an = m a k e s p a n f r o m t u p l e ( t ) ;
9 auto d data = a . a l l o c a t e ( h size ) ; 8 a u t o h d a t a = h s p an . d a t a ( ) ;
10 cudaMemcpyAsync ( 9 a u t o h s i z e = h s p an . s i z e b y t e s ( ) ;
11 d d a t a , h d a t a , h s i z e , H2D , s 10 auto d data = s rc . device data ( ) ;
12 ); 11 cudaMemcpyAsync (
13 }; 12 h d a t a , d d a t a , h s i z e , D2H , s
14 return * this ; 13 );
15 } 14 };
15 return * this ;
Listing 4: Implementation details of the pull task. 16 }
Listing 4 highlights the core implementation of the pull task Listing 6: Implementation details of the push task.
based on CUDA. 1 To be concise, we omit details such as error Listing 6 highlights the core implementation of the push task
checking and auxiliary functions. The pull task forms a closure based on CUDA. The push task captures the argument list in
that captures the arguments in a custom tuple by which we en- the same way as the pull task to form a stateful closure (line
able stateful task execution (line 4). For instance, in Listing 1, 5). The execution context creates a span from the target and
the change made by the host task host_x on the data vectors extracts the device data from the source pull task (line 7:10).
must be visible to the pull task pull_x. The stateful tuple Finally, we enqueue the data transfer operation to a CUDA
wraps references in objects to keep state transition consistent stream passed by the scheduler at runtime (line 11:13). This
between dependent tasks. Maintaining a stateful transition is a CUDA stream is guaranteed to live in the same GPU context
backbone of Heteroflow. Developers can carry out fine-grained as the source pull task. In short, Heteroflow uses pull tasks
concurrency through decomposition and enforce dependency and push tasks to perform H2D and D2H data transfers. Users
constraints to keep the logical relationship between task data. explicitly specify the data to transfer between CPU and GPU,
In terms of arguments, the runtime passes a memory allocator and encode these tasks in a graph to exploit task parallelism.
and a CUDA stream to the closure (line 5). The allocator They never worry about the underlying details of resource
is a pooled resource for reducing GPU memory allocation allocation and GPU placement.
1 While the current implementation is based on CUDA, our task interface 4) Kernel Task: A kernel task offloads computation from
can accept other GPU programming libraries [3]. the host to the device. Heteroflow empowers users with explicit
kernel programming using native CUDA toolkits. We never try Listing 8 highlights the core implementation of the kernel
hard to develop another C++ kernel programming framework task. The kernel method takes a kernel function written in
that often comes with restricted applicability and performance CUDA and the rest arguments to invoke the kernel (line 1:2).
portability. Instead, users leverage their domain knowledge The arity must match in both sides. A key difference between
with the highest degree of freedom to implement their ker- Heteroflow and existing models is the way we establish data
nel algorithms, while leaving task parallelism to Heteroflow. connection – we use pull tasks as the gateway rather than
Listing 7 gives an example of creating two kernel tasks that raw pointers. This abstraction largely improves safety and
offload two given CUDA kernel functions to the device using transparency in scaling graph execution to multiple GPUs.
the pull tasks created in Listing 3. The first kernel task operates From the input argument list, we gather all relevant pull tasks
on kernel1 with data from pull1. The second kernel task to this kernel (line 3 and line 13:18) and let the scheduler
operates on kernel2 with data from pull2. Both tasks perform automatic device placement. Similar to push and pull
configure 256 CUDA threads in a block. Kernel functions tasks, we capture the argument list in a stateful tuple (line 6)
are not obligated to take any Heteroflow-specific objects. This and use two auxiliary functions to invoke the kernel from the
largely increases the portability and testability of Heteroflow, tuple (line 20:36). All the runtime changes on the arguments
especially for applications that heavily use third-party kernel will reflect on the execution context of the kernel.
functions written by domain experts. 1 struct PointerCaster {
2 void * data { n u l l p t r };
global void k er n el1 ( i n t * data , i n t N) ; 3 t e m p l a t e <ty p en am e T>
global void k er n el2 ( f l o a t * data , i n t N) ; 4 o p e r a t o r T* ( ) {
a u t o k1 = h f . k e r n e l ( k e r n e l 1 , p u l l 1 , 100) 5 r e t u r n (T*) data ;
. g r i d x (N/ 2 5 6 ) 6 }
. block x ( 2 5 6 ) ; 7 };
a u t o k2 = h f . k e r n e l ( k e r n e l 2 , p u l l 2 , 10); 8
. g r i d (N/ 2 5 6 , 1 , 1 ) 9 t e m p l a t e <ty p en am e T>
. block (256 , 1 , 1 ) ; 10 a u t o K e r n e l T a s k : : c o n v e r t ( T&& a r g ) {
11 i f c o n s t e x p r ( i s p u l l t a s k <T>) {
Listing 7: Creates two kernel tasks that operate on the two 12 r e t u rn P oi nt e rC as t er {arg . data ( ) } ;
pull tasks in Listing 3. 13 }
14 else {
15 r e t u r n f o r w a r d <T>( a r g ) ;
1 t e m p l a t e <ty p en am e F , ty p en am e . . . ArgsT> 16 }
2 a u t o K e r n e l T a s k : : k e r n e l ( F&& f , ArgsT & & . . . a r g s ) { 17 }
3 gather sources ( args . . . ) ;
4 g e t n o d e h a n d l e ( ) . work = [ Listing 9: Implementation details of the data connection be-
5 k = * t h i s , f = f o r w a r d <F>( f ) , tween a pull task and a kernel task.
6 t = S t a t e f u l T u p l e ( f o r w a r d <ArgsT >( a r g s ) . . . )
7 ] ( cudaStream t s ) mutable { Each argument in the kernel function must experience
8 k . apply kernel ( s , f , t ) ; another conversion (line 34 in Listing 8) before launching
9 };
10 return * this ; the kernel. The purpose of this conversion is to transform the
11 } pull task to the type of the corresponding kernel argument,
12 and to possibly conduct any sanity checks at both compile
13 t e m p l a t e <ty p en am e T>
14 auto KernelTask : : g a t h e r s o u r c e s (T&&... t a s k s ) { time and runtime. Listing 9 highlights the core implemen-
15 i f c o n s t e x p r ( i s p u l l t a s k <T>) { tation of this conversion. The function convert evaluates
16 ( get node handle ( ) . add sources ( t a sk s ) , . . . ) ; an argument at compile time (line 9:17). If the argument
17 }
18 } is a pull task, it returns a cast of the internal GPU data
19 pointer to the target argument type (line 11:13). Otherwise,
20 t e m p l a t e <ty p en am e F , ty p en am e T> it forwards the argument in return (line 15). The auxiliary
21 auto KernelTask : : ap p ly k er n el (
22 cudaStream t s , F f , T t structure PointerCaster (line 1:7) is designed to operate
23 ) { on plain old data (POD) pointers in support for conventional
24 c o n s t a u t o N = t u p l e s i z e <T> :: v a l u e ; GPU kernel programming syntaxes. The same concept apply
25 a p p l y k e r n e l ( s , f , t , m ak e in d ex s eq u en ce <N>{});
26 } to custom data types depending on a compiler’s capability.
27 5) Add a Dependency Link: After tasks are created, the
28 t e m p l a t e <ty p en am e F , ty p en am e T , s i z e t . . . I> next step is to add dependency links. A dependency link is
29 auto KernelTask : : ap p ly k er n el (
30 c u d a S t r e a m t s , F f , T t , i n d e x s e q u e n c e <I . . . >
a directed edge between two tasks to force one task to run
31 ) { before or after another. Heteroflow defines two very intuitive
32 a u t o& h = g e t n o d e h a n d l e ( ) ; methods, precede and succeed, to let users create task
33 f<<<h . g r i d , h . b lo ck , h . shm , s>>>(
34 c o n v e r t ( g e t <I >( t ) ) . . .
dependencies. The two methods are symmetrical to each other.
35 ); A preceding link forces a task to run before another and a
36 } succeeding link forces a task to run after another. Heteroflow’s
Listing 8: Implementation details of the kernel task. task interface is uniform. Users can insert dependencies be-
tween tasks of different types as long as no cycles are formed.
host1 pull1 kernel1 push1 B. Execute a Task Dependency Graph
An executor is the basic building block for executing a
host2 pull2 kernel2 push2 Heteroflow graph. It manages a set of CPU threads and GPU
devices to schedule in which list of tasks to execute. When
Fig. 3: A task graph of eight tasks and seven dependency constraints. a task is ready, the runtime submits the task to an execution
context which can occur in either a physical CPU core or
a GPU device. In Heteroflow, a task is indeed a callable.
global v o i d k1 ( i n t * v ec1 ) ; When users create a task, Heteroflow marshals all required
global v o i d k2 ( i n t * vec1 , i n t * v ec2 ) ;
parameters along with unique placeholders for runtime argu-
v e c t o r <i n t > vec1 , v ec2 ; ments to form a closure that can be run by any CPU thread.
Execution of a GPU task will be placed under a GPU context.
hf : : Heteroflow hf ;
a u t o h o s t 1 = h f . h o s t ( [ ] ( ) { v ec1 . r e s i z e ( 1 0 0 , 0 ) ; } ) ; The scheduler manages all such details to ensure consistent
a u t o h o s t 2 = h f . h o s t ( [ ] ( ) { v ec2 . r e s i z e ( 1 0 0 , 1 ) ; } ) ; results across multiple GPUs. Listing 12 creates an executor
a u t o p u l l 1 = h f . p u l l ( v ec1 ) ; of eight CPU threads and four GPUs and uses it to execute
a u t o p u l l 2 = h f . p u l l ( v ec2 ) ;
a u t o p u s h 1 = h f . p u s h ( p u l l 1 , v ec1 ) ; a graph one times, 100 times, and multiple times until a
a u t o p u s h 2 = h f . p u s h ( p u l l 2 , v ec2 ) ; stopping criteria is met. Users can adjust the number based
a u t o k e r n e l 1 = h f . k e r n e l ( k1 , p u l l 1 ) ; on hardware capability to easily scale their graphs across
a u t o k e r n e l 2 = h f . k e r n e l ( k2 , p u l l 1 , p u l l 2 ) ;
different CPU-GPU configurations. All the run methods in
host1 . precede ( pull1 ) ; the executor class are non-blocking. Issuing a run on a graph
host2 . precede ( pull2 ) ; returns immediately with a C++ future object. Users can
pull1 . precede ( kernel1 ) ;
pull2 . precede ( kernel2 ) ; use it to inspect the execution status of the graph or chain
k e r n e l 1 . p r e c e d e ( push1 , k e r n e l 2 ) ; up a continuation for asynchronous controls. The executor
k e r n e l 2 . pr ecede ( push2 ) ; class also provides a method wait_for_all that blocks
Listing 10: Creates dependency links to describe Figure 3. until all running graphs associated with the caller executor
finish. Heteroflow’s executor interface is thread-safe. Touching
Listing 10 gives an example of using the method precede an executor from multiple threads is valid. Users can take
to describe the dependency graph in Figure 3. Users can advantage of this property to explore higher-level parallelism
precede an arbitrary number of tasks in one call. The overall without concerning about race in execution.
code to create dependency links in Heteroflow is very simple,
h f : : E x e c u t o r e x e c u t o r ( 8 , 4 ) ; / / 8 CPU t h r e a d s 4 GPUs
concise, and self-explanatory. An important takeaway here hf : : Heteroflow graph ;
is that task dependency is explicit in Heteroflow. Our API auto f u t u r e 1 = ex ecu to r . run ( graph ) ;
never creates implicit dependency links even though they are a u t o f u t u r e 2 = e x e c u t o r . r u n n ( g r ap h , 1 0 0 ) ;
a u t o f u t u r e 3 = e x e c u t o r . r u n u n t i l ( g r ap h , [&] ( ) {
obvious in certain graphs. Such concern typically arises when return custom stopping criteria ( ) ;
creating a kernel task that requires GPU data from other });
pull tasks. In this scenario, pull tasks must finish before the executor . wai t for all ( ) ;
kernel task and users are responsible for this dependency in Listing 12: Creates an executor to run a Heteroflow graph.
their graphs. Heteroflow delegates the dependency controls to
users so they can tailor graphs to their needs. With careful
graph construction and refinement, applications can efficiently C. Scheduling Algorithm
reuse data without adding redundant task dependencies. For Another major contribution of Heteroflow is the design of
example, kernel2 in Figure 3 can access the GPU data of a scheduler on top of our heterogeneous tasking interface.
pull1 as a result of transitive dependency (pull1 precedes Scheduler is an integral part of the executor for mapping
kernel1 and kernel1 precedes kernel2). Listing 10 task graphs onto available CPU cores and GPUs. When an
implements this intent. executor is created with N CPU threads and M GPUs, we
6) Inspect a Task Dependency Graph: Another powerful spawn N CPU threads, namely workers, to execute tasks.
feature of Heteroflow on the user front is the visualization Unlike existing works [8], [19], we do not dedicate a worker
of a task dependency graph using the standard DOT format. to manage a target GPU, since all tasks are uniformly rep-
Users can find readily available tools such as Python Graphviz resented in Heteroflow using polymorphic functional objects
and viz.js to draw a graph without extra programming effort. (see Listings 4, 6, and 8). This largely facilitates the design
Graph visualization largely facilitates testing and debugging of our scheduler in providing efficient resource utilization
of Heteroflow applications. Listing 11 gives an example of and flexible runtime optimizations, for instance, GPU memory
dumping a Heteroflow graph to the standard output. allocators, asynchronous CUDA streams, and task fusing.
h f . dump ( c o u t ) ; Our scheduler design is motivated by [13]. When a graph is
c o u t << h f . dump ( ) ; submitted to an executor, a special data structure called topol-
ogy is created to marshal execution parameters and runtime
Listing 11: Dumps a Heteroflow graph to the standard output.
metadata. Each heteroflow object has a list of topologies to
track individual execution status. The executor also maintains 6 cudaStreamWaitEvent ( s , e , 0 ) ;
a topology counter to signal callers on completion. The 7 }
communication is based on a shared state managed by a pair of Listing 13: Implementation details of invoking a pull task.
C++ promise and future objects. The first step in scheduling is
device placement, mapping each GPU task to a particular GPU While detailing the scheduler design is out of the scope of
device. An advantage of our programming model is implicit this paper, there are a few notable items. First, each worker
data dependencies between a kernel and its pull tasks (see keeps a per-thread CUDA stream to enable concurrent GPU
line 3 in Listing 8), through which the scheduler can utilize memory and kernel operations. Second, our executor keeps a
to place them under the right device. Based on this property, memory pool for each GPU device to reduce the scheduling
we develop a simple and efficient device placement algorithm overhead of frequent allocations by pull tasks. We implement
using union-find and bin packing as shown in Algorithm 1. the famous Buddy allocator algorithm [22]. Third, our work-
The key idea is to group each kernel with its source pull tasks stealing loop adopts an adaptive strategy to balance working
(line 1:7) and then pack each unique group to a GPU bin with and sleeping threads on top of available task parallelism. The
an optimized cost (line 8:14). By default, we minimize the key idea is to ensure one thief exists as long as an active
load per GPU bins for maximal concurrency but can expose worker is running a task. At the time of this writing, our
this strategy to a pluggable interface for custom cost metrics. scheduler design might not be perfect, but it provides a proof
of concept for our programming model and fosters future
Algorithm 1: DevicePlacement research opportunities for new algorithms.
1 foreach t ∈ tasks do IV. E XPERIMENTAL R ESULTS
2 if t.type() == KERNEL then
3 foreach p ∈ t.source pull tasks() do We evaluated the performance of Heteroflow on two real
4 set union(t, p); VLSI CAD applications, timing analysis and standard cell
5 end placement. Each application represents a unique computation
6 end pattern. All experiments ran on a Ubuntu Linux 5.0.0-21-
7 end generic x86 64-bit machine with 40 Intel Xeon Gold 6138
8 foreach t ∈ tasks do CPU cores at 2.00 GHz, 4 GeForce RTX 2080 GPUs, and 256
9 if x ← t.type(); x == KERNEL or x == PULL GB RAM. The timing analysis program is compiled by g++8.2
then and nvcc CUDA 10.1 with C++14 standards -std=c++14
10 if r ← set find(t); is set root(r) then and optimization flags -O2. The placement program is com-
11 set bin packing with balanced load(t); piled under the same environment. Both programs are derived
12 end from our open-source projects, OpenTimer [23], [24], [25] and
13 end DREAMPlace [26], that consist of complex domain-specific
14 end algorithms with more than 10K lines of code over years of
development.
After device placement, the scheduler enters a work-stealing A. VLSI Timing Analysis
loop where each worker thread iteratively drains out tasks from We applied Heteroflow to solve a VLSI timing analysis
its local queue and transitions to a thief to steal a task from a problem. Timing analysis is a very important component in
randomly selected peer called victim. The process stops when the overall design flow (see Figure 2). It verifies the expected
an executor is destroyed. We employ work-stealing because timing behaviors of a digital circuit to ensure correct function-
it has been extensively studied and used in many parallel alities after tape-out. Among various timing analysis problems,
processing systems for dynamic load-balancing and irregular one subject is to find the correlation between different timing
computations [20], [21]. When a worker thread executes a task, views. Each each view represents a unique combination of a
it applies a visitor pattern that invokes a separate method for process variation corner (e.g., temperature, voltage) and an
each task type. Running a host task is trivial, but calling a analysis mode (e.g., testing, functional). Figure 4 shows the
GPU task must be scoped under the right execution context. number of required analysis views increases exponentially as
Heteroflow provides a resource acquisition is initialization the technology node advances [23], [24]. Timing correlation
(RAII)-style mechanism on top of CUDA device API to scope is not only important for reasoning the behavior of a timer but
the task execution under its assigned GPU device. Listing 13 also useful for building regression models to reduce required
gives the implementation details of invoking a pull task from analysis iterations.
an executor. All GPU tasks are synchronized through CUDA In reality, there are many ways to conduct timing analysis
events (line 4 and line 6). and correlation. In this experiment, we consider a representa-
1 v o i d E x e c u t o r : : i n v o k e ( u n s i g n e d me , P u l l& h ) { tive three-step flow: a timer generates analysis datasets from
2 a u t o [ d , s , e ] = g e t d e v i c e s t r e a m e v e n t ( me , h ) ; a circuit design across multiple views; a hybrid CPU-GPU
3 ScopedDeviceContext ctx ( d ) ;
4 cudaEventRecord ( e , s ) ; algorithm extracts timing statistics and generates regression
5 h . work ( g e t d e v i c e a l l o c a t o r ( d ) , s ) ; models for each dataset; a synchronization step combines all
Runtimes across CPUs Runtimes across GPUs
100 4 GPUs 40 CPU cores
3 GPUs 50 32 CPU cores
2 GPUs 24 CPU cores
Runtime (minutes)
Runtime (minutes)
80
1 GPU 16 CPU cores
40 8 CPU cores
60
30
40
20
20
Runtime (minutes)
Runtime (minutes)
1 GPU 16 Cores
assessed quantities to a concrete report. Figure 5 illustrates 15 8 Cores
a fractional task graph of two views. We use the open- 20
10
source tool, OpenTimer, to generate 1024 different timing 10
5
reports for a large circuit, netcard, of 1.5M gates [23], [24].
The correlation layer implements a CPU-based algorithm to 0 0
0 200 400 600 800 1,000 0 200 400 600 800 1,000
extract graph information (critical paths [27], [28], CPPR [29], Problem Size (# analysis views) Problem Size (# analysis views)
[30], [31]) and a GPU-based algorithm to perform logistic
regression with gradient descent. Part of CPU and GPU tasks Fig. 6: Runtimes at different CPU-GPU numbers and problem sizes
for analyzing the circuit netcard (1.5M gates and 1.5M nets).
are dependent to each other. For demonstration purpose, we
pre-generate the analysis data and control the sample size such
that each analysis view takes approximately the same runtime.
B. VLSI Placement
We applied Heteroflow to solve a VLSI placement problem,
timer a fundamental step in the physical design stage (see Figure
2). The goal is to determine the physical locations of cells
cpu-0-0 cpu-0-1 cpu-0-2 cpu-0-3 cpu-1-0 cpu-1-1 cpu-1-2 cpu-1-3
(logic gates) in a fixed layout region with minimal interconnect
collect-0 pullx-0 pully-0 collect-1 pullx-1 pully-1 wirelength. Modern placement typically incorporates hundreds
of millions of cells and takes several hours to finish. To
test-0 pullc-0 kernel-0 test-1 pullc-1 kernel-1
reduce the long runtime, recent work started investigating new
combine-0 pushx-0 pushy-0 combine-1 pushx-1 pushy-1 algorithms using the power of heterogeneous computing [26].
Among various placement techniques, detailed placement is an
finalize-0 finalize-1
important step to refine a legalized placement solution for min-
sync imal wirelength. Mainstream detailed placement algorithms
are combinatorial and iterative. A widely-used matching-based
Fig. 5: A partial task graph of VLSI timing analysis for finding
correlation between two views. Each view implements a hybrid CPU- algorithm is shown in Figure 7. The key idea is to extract
GPU correlation algorithm. a maximal independent set (marked in cyan) from a cell set
and model the wirelength minimization problem on these non-
overlapped cells into a weighted bipartite matching graph.
Figure 6 shows the overall runtime performance at different The entire process is very time-consuming especially for large
CPU-GPU numbers and problem sizes. In general, we observe designs with millions of cells. A practical implementation iter-
a descent scaling when increasing the number of cores and ates the following three steps: a parallel maximal independent
GPUs. The task graph requires 99 minutes to finish at the set finding step using Blelloch’s Algorithm [32]; a sequential
lowest hardware concurrency of 1 core and 1 GPU. Using partitioning step to cluster adjacent cells; a parallel bipartite
all 40 cores and 4 GPUs is able to speed up the runtime by matching step to find the best permutation of cell locations.
7.7× finishing in 13 minutes. On the slice of 4 GPUs, the Figure 7(c) illustrates the process.
runtimes are 51, 23, 18, 15, 14, and 13 minutes for 1, 8, 16, In the experiment, we implemented a hybrid CPU-GPU de-
24, 32, and 40 cores, respectively. The GPU counterparts at 40 tailed placement algorithm introduced by DREAMPlace [26].
cores are 36, 21, 15, and 13 minutes for 1, 2, 3, and 4 GPUs, Among these three steps, finding the maximal independent set
respectively. The lower side of Figure 6 shows the runtime takes the most runtime. DREAMPlace developed a new accel-
versus the problem size in terms of six different timing views, eration algorithm that offloaded this step to GPU, and showed
32, 64, 128, 256, 512, and 1024. At any point, increasing the 40× speed-up over a CPU baseline using 20 cores [26]. The
number of CPUs or GPUs can all reduce the runtime. For this other two steps have graph-oriented computation patterns and
particular workload, speed-up from multiple GPUs is more are implemented on CPUs. Figure 8 shows a partial task graph
remarkable than CPUs. for the algorithm in two iterations. The algorithm normally
inity-1[0] inity-2[0] inity-3[0] initz-1[0][0]
1 1’
1 pullx pully-1[0] pully-2[0] pully-3[0] pullz-1[0][0]
2 2’
mis-a[0][0] mis-b[0][0]
2 3 3 3’
pushy-2[0] pushz-1[0][0]
4 4’
partitioning[0]
4 5
5 5’
matching[0][0] matching[0][1] inity-2[1] inity-3[1] hosty-1[1] hostz-1[1][0]
Runtime (seconds)
50 50 40 CPU cores
1 GPU 32 CPU cores
24 CPU cores
converges in 10-50 iterations. To enable task overlaps between 40 40
16 CPU cores
8 CPU cores
iterations, we flatten the task graph for a given iteration 30 30 1 CPU core
number. The task graph in Figure 8 highlights the complexity 20 20
of the algorithm and dependent CPU-GPU tasks.
10 10
Figure 9 shows the runtime performance at different CPU- 1 8 16 24 32 40 1 2 3 4
Number of CPU Cores Number of GPUs
GPU numbers and iterations, for placing a large circuit,
Runtime vs Problem Size (40 cores) Runtime vs Problem Size (4 GPUs)
bigblue4, of 2.2M cells and 2.2M nets. It is observed that 15
4 GPUs 60 40 Cores
increasing the number of CPU cores reduces the runtime. For 3 GPUs 32 Cores
2 GPUs 24 Cores
Runtime (seconds)
Runtime (seconds)
instance, under 1 GPU it takes 58.41s and 14.02s using 1 core 1 GPU 16 Cores
10 40 8 Cores
and 40 cores, respectively. Maximum concurrency saturates at 1 Core
14.02s and 13.61s for 1 GPU and 4 GPUs, respectively. In fact, Problem Size (# iterations) Problem Size (# iterations)
this property is generally true for most optimization algorithms Fig. 9: Runtimes at different CPU-GPU numbers and problem sizes
in VLSI CAD, as they are often irregular and dependent [15]. for placing the circuit bigblue4 (2.2M cells and 2.2M nets).
In terms of different problem sizes which is measured by the
iteration count used to construct the task graph, increasing
the number of CPU cores can reduce the runtime in most C++ AMP, and Brook+ are popular GPU programming frame-
scenarios. For example, the task graph of 5 iterations under 4 works that provide a rich set of low-level APIs for explicit
GPUs finishes in 6.35s and 1.44s using 1 core and 40 cores, GPU managements [2], [3], [18], [33]. These libraries are
respectively. Due to the nature of the algorithm, such trend is designed particularly for power users to implement vari-
not observed on the GPU side. ous optimization strategies specific to a GPU architecture.
Directive-based models such as hiCUDA, Ompss, OpenMPC,
V. R ELATED W ORK and OpenACC provide high-level abstraction on GPU pro-
Heterogeneous programming models have been extensively gramming by augmenting program information, for instance,
developed in scientific communities and enabled vast success guidance on loop mapping onto GPU and data sharing rules, to
in various problem domains [12]. CUDA, OpenCL, OpenGL, designated compilers [4], [5], [6], [7]. These models are good
at loop-based parallelism but cannot handle well irregular task [10] H. Kaiser et al., “HPX: A task based programming model in a global
parallelism [34]. Functional-level approaches such as StarPU, address space,” in PGAS, 2014, pp. 6:1–6:11.
[11] G. Bosilca et al., “PaRSEC : A programming paradigm exploiting
SYCL, HPX, PaRSEC, QUARK, XKAAPI++, Unicorn, and heterogeneity for enhancing scalability,” 2013.
Taskflow are capable of concurrent CPU-GPU tasking [8], [12] S. Mittal et al., “A Survey of CPU-GPU Heterogeneous Computing
[9], [10], [11], [19], [35], [16], [17], [36], [37]. The offered Techniques,” ACM Comput. Surv., vol. 47, no. 4, pp. 69:1–69:35, 2015.
[13] T.-W. Huang et al., “Cpp-Taskflow: Fast Task-based Parallel Program-
graph description languages can be complex or expressive, ming using Modern C++,” in IEEE IPDPS, 2019, pp. 974–983.
depending on the targeted applications. Other data structure- [14] L. Stok, “Developing Parallel EDA Tools,” IEEE Design Test, vol. 30,
driven libraries such as Thrust, VexCL, and Boost.Compute no. 1, pp. 65–66, 2013.
[15] Y.-S. Lu et al., “Can Parallel Programming Revolutionize EDA Tools?”
provide C++ STL-style interfaces to program batch CPU-GPU Advanced Logic Synthesis, 2018.
workloads [38], [39], [40]. For concurrent CPU-GPU tasking, [16] T. Beri et al., “The Unicorn Runtime: Efficient Distributed Shared
users are responsible for scheduling and concurrency controls Memory Programming for Hybrid CPU-GPU Clusters,” IEEE TPDS,
that are known difficult to program correctly. vol. 28, no. 5, pp. 1518–1534, 2017.
[17] T.-W. Huang et al., “Taskflow: A Lightweight Parallel and Heteroge-
CPU-GPU co-scheduling is a pivotal component of all neous Task Graph Computing System,” in IEEE TPDS, vol. 33, no. 6,
heterogeneous programming systems. The parallel comput- 2022, pp. 1303 – 1320.
ing community has a number of algorithms including static [18] “OpenGL.” [Online]. Available: https://opengl.org/
[19] “XKAAPI++.” [Online]. Available: http://kaapi.gforge.inria.fr/
mapping [41], dynamic work-stealing [20], [21], asymptotic [20] J. V. Lima et al., “Design and analysis of scheduling strategies for multi-
profiling [42], and other system-defined strategies [5], [8], CPU and multi-GPU architectures,” Parallel Computing, vol. 44, pp.
[10], [16]. Vendor-specific features such as CUDA Graph [2], 37–52, 2015.
[21] C.-X. Lin et al., “An Efficient Work-Stealing Scheduler for Task
[43] and SYCL [9] offer asynchronous graph scheduling for Dependency Graph,” in IEEE ICPADS, 2020, pp. 64–71.
task parallelism but implementation details are unknown. On [22] K. C. Knowlton, “A Fast Storage Allocator,” Commun. ACM, vol. 8,
the other hand, automatic GPU placement has been studied no. 10, pp. 623–624, 1965.
[23] T.-W. Huang et al., “OpenTimer: A high-performance timing analysis
in machine learning community [44], [45]. The goal is to tool,” in IEEE/ACM ICCAD, 2015, pp. 895–902.
place operations in a deep neural network onto GPU devices [24] ——, “OpenTimer 2.0: A New Parallel Incremental Timing Analysis
in an optimal way, such that the training process can complete Engine,” IEEE TCAD, vol. 40, no. 4, pp. 776–789, 2021.
[25] ——, “OpenTimer v2: A Parallel Incremental Timing Analysis Engine,”
within the shortest amount of time. However, these algorithms IEEE DAT, vol. 38, no. 2, pp. 62–68, 2021.
are problem-specific and require a unified tensor data structure [26] Y. Lin et al., “DREAMPlace: Deep Learning Toolkit-Enabled GPU
for performance modeling. Acceleration for Modern VLSI Placement,” in IEEE/ACM DAC, 2019,
pp. 117:1–117:6.
VI. C ONCLUSION [27] Y. Zamani et al., “A High-Performance Heterogeneous Critical Path
Analysis Framework,” in IEEE HPEC, 2021, pp. 1–7.
In this paper, we have introduced Heteroflow, a new modern [28] K. Zhou et al., “Efficient Critical Paths Search Algorithm using Merge-
able Heap,” in IEEE/ACM ASPDAC, 2022, pp. 190–195.
C++ tasking library to help developers quickly write CPU- [29] T.-W. Huang et al., “Fast path-based timing analysis for CPPR,” in
GPU parallel programs and implement efficient heterogeneous IEEE/ACM ICCAD, 2014, pp. 596–599.
decomposition algorithms. We have evaluated Heteroflow on [30] ——, “UI-Timer 1.0: An ultrafast path-based timing analysis algorithm
for cppr,” IEEE TCAD, vol. 35, no. 11, pp. 1862–1875, Nov 2016.
two real design automation problems and shown performance [31] Z. Guo et al., “HeteroCPPR: Accelerating Common Path Pessimism
scalability across different CPU-GPU numbers and problem Removal with Heterogeneous CPU-GPU Parallelism,” in IEEE/ACM
sizes. At a particular VLSI timing analysis example, Het- ICCAD, 2021, pp. 1–9.
[32] G. E. Blelloch et al., “Greedy sequential maximal independent set and
eroflow can reduce a baseline runtime from 99 minutes to 13 matching are parallel on average,” in ACM SPAA, 2012, pp. 308–317.
minutes (7.7× speed-up) on a machine of 40 CPU cores and [33] I. Buck et al., “Brook for GPUs: Stream Computing on Graphics
4 GPUs. Future work will focus on distributing our scheduler Hardware,” in ACM SIGGRAPH, 2004.
based on [46] and incorporating a broader range of workloads, [34] S. Lee et al., “Early Evaluation of Directive-Based GPU Programming
Models for Productive Exascale Computing,” in IEEE/ACM SC, 2012.
including machine learning [47], [48] and engineering simu- [35] A. Yarkhan, “Dynamic task execution on shared and distributed memory
lation [49], [50], [51]. architectures,” PhD thesis, 2012.
[36] C.-X. Lin et al., “An Efficient and Composable Parallel Task Program-
R EFERENCES ming Library,” in IEEE HPEC, 2019, pp. 1–7.
[37] ——, “A Modern C++ Parallel Task Programming Library,” in ACM
[1] J. S. Vetter et al., “Productive Computational Science in the Era of MM, 2019, p. 2284–2287.
Extreme Heterogeneity,” in DOE ASCR Report, 2018. [38] “Thrust.” [Online]. Available: https://github.com/thrust/thrust
[2] “CUDA.” [Online]. Available: https://developer.nvidia.com/cuda-zone [39] “VexCL.” [Online]. Available: https://zenodo.org/record/571466
[3] “OpenCL.” [Online]. Available: https://www.khronos.org/opencl/ [40] “Boost.Compute.” [Online]. Available:
[4] T. D. Han et al., “hiCUDA: High-level GPGPU Programming,” IEEE https://github.com/boostorg/compute
TPDS, vol. 22, pp. 78–90, 2011. [41] I. Buck et al., “Heterogeneous task scheduling for Accelerated
[5] A. Duran et al., “Ompss: a proposal for programming heterogeneous OpenMP,” in IEEE IPDPS, 2012.
multi-core architectures.” Parallel Processing Letters, vol. 21, pp. 173– [42] Z. Wang et al., “CPU+GPU scheduling with asymptotic profiling,”
193, 2011. Parallel Computing, vol. 40, no. 2, pp. 107–115, 2014.
[6] S. Lee et al., “OpenMPC: Extended OpenMP programming and tuning [43] D.-L. Lin et al., “Efficient GPU Computation using Task Graph Paral-
for GPUs,” in IEEE/ACM SC, 2010. lelism,” in Euro-Par. Springer, 2021, pp. 435–450.
[7] “OpenACC.” [Online]. Available: http://www.openacc-standard.org [44] A. Mirhoseini et al., “A Hierarchical Model for Device Placement,” in
[8] E. Agullo et al., “Harnessing clusters of hybrid nodes with a sequential ICLR, 2018.
task-based programming model,” in PMAA, 2014. [45] Y. Gao et al., “Post: Device Placement with Cross-Entropy Minimization
[9] “SYCL.” [Online]. Available: https://www.khronos.org/sycl/ and Proximal Policy Optimization,” in NIPS, 2018.
[46] T.-W. Huang et al., “DtCraft: A High-performance Distributed Execution
Engine at Scale,” IEEE TCAD, 2018.
[47] D.-L. Lin et al., “A Novel Inference Algorithm for Large Sparse Neural
Network using Task Graph Parallelism,” IEEE HPEC, 2020.
[48] D.-L. Lin and T.-W. Huang, “Accelerating Large Sparse Neural Network
Inference using GPU Task Graph Parallelism,” IEEE TPDS, 2022.
[49] Z. Guo et al., “GPU-accelerated Static Timing Analysis,” in IEEE/ACM
ICCAD, 2020, pp. 1–8.
[50] G. Guo et al., “GPU-accelerated Pash-based Timing Analysis,” in
IEEE/ACM DAC, 2021.
[51] ——, “GPU-accelerated Critical Path Generation with Path Constraints,”
in IEEE/ACM ICCAD, 2021, pp. 1–9.