parallel
: Device-Level Parallel Algorithms#
The cuda.cccl.parallel
library provides device-level algorithms that operate
on entire arrays or ranges of data. These algorithms are designed to be easy to use from Python
while delivering the performance of hand-optimized CUDA kernels, portable across different
GPU architectures.
Algorithms#
The core functionality provided by the parallel
library are algorithms such
as reductions, scans, sorts, and transforms.
Here’s a simple example showing how to use the reduce_into
algorithm to
reduce an array of integers.
"""
Sum all values in an array using reduction with PLUS operation.
"""
import cupy as cp
import numpy as np
import cuda.cccl.parallel.experimental as parallel
# Prepare the input and output arrays.
dtype = np.int32
h_init = np.array([0], dtype=dtype)
d_input = cp.array([1, 2, 3, 4, 5], dtype=dtype)
d_output = cp.empty(1, dtype=dtype)
# Perform the reduction.
parallel.reduce_into(d_input, d_output, parallel.OpKind.PLUS, len(d_input), h_init)
# Verify the result.
expected_output = 15
assert (d_output == expected_output).all()
result = d_output[0]
print(f"Sum reduction result: {result}")
Many algorithms, including reduction, require a temporary memory buffer. The library will allocate this buffer for you, but you can also use the object-based API for greater control.
"""
Reduction example using the object API.
"""
import cupy as cp
import numpy as np
import cuda.cccl.parallel.experimental as parallel
# Prepare the input and output arrays.
dtype = np.int32
init_value = 5
h_init = np.array([init_value], dtype=dtype)
h_input = np.array([1, 2, 3, 4], dtype=dtype)
d_input = cp.asarray(h_input)
d_output = cp.empty(1, dtype=dtype)
# Create a reducer object.
reducer = parallel.make_reduce_into(d_input, d_output, parallel.OpKind.PLUS, h_init)
# Get the temporary storage size.
temp_storage_size = reducer(None, d_input, d_output, len(h_input), h_init)
# Allocate temporary storage using any user-defined allocator.
# The result must be an object exposing `__cuda_array_interface__`.
d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)
# Perform the reduction.
reducer(d_temp_storage, d_input, d_output, len(h_input), h_init)
expected_result = np.sum(h_input) + init_value
actual_result = d_output.get()[0]
assert actual_result == expected_result
print("Reduce object example completed successfully")
Iterators#
Algorithms can be used not just on arrays, but also on iterators. Iterators provide a way to represent sequences of data without needing to allocate memory for them.
Here’s an example showing how to use reduction with a CountingIterator
that
generates a sequence of numbers starting from a specified value.
"""
Example showing how to use counting_iterator.
"""
import functools
import cupy as cp
import numpy as np
import cuda.cccl.parallel.experimental as parallel
# Prepare the input and output arrays.
first_item = 10
num_items = 3
# Create the counting iterator.
first_it = parallel.CountingIterator(np.int32(first_item))
# Prepare the initial value for the reduction.
h_init = np.array([0], dtype=np.int32)
# Prepare the output array.
d_output = cp.empty(1, dtype=np.int32)
# Perform the reduction.
parallel.reduce_into(first_it, d_output, parallel.OpKind.PLUS, num_items, h_init)
# Verify the result.
expected_output = functools.reduce(
lambda a, b: a + b, range(first_item, first_item + num_items)
)
assert (d_output == expected_output).all()
print(f"Counting iterator result: {d_output[0]} (expected: {expected_output})")
Iterators also provide a way to compose operations. Here’s an example showing
how to use reduce_into
with a TransformIterator
to compute the sum of squares
of a sequence of numbers.
"""
Demonstrate reduction with transform iterator.
"""
import functools
import cupy as cp
import numpy as np
import cuda.cccl.parallel.experimental as parallel
def transform_op(a):
return -a if a % 2 == 0 else a
# Prepare the input and output arrays.
first_item = 10
num_items = 100
transform_it = parallel.TransformIterator(
parallel.CountingIterator(np.int32(first_item)), transform_op
) # Input sequence
h_init = np.array([0], dtype=np.int64) # Initial value for the reduction
d_output = cp.empty(1, dtype=np.int64) # Storage for output
# Perform the reduction.
parallel.reduce_into(transform_it, d_output, parallel.OpKind.PLUS, num_items, h_init)
# Verify the result.
expected_output = functools.reduce(
lambda a, b: a + b,
[-a if a % 2 == 0 else a for a in range(first_item, first_item + num_items)],
)
# Test assertions
print(f"Transform iterator result: {d_output[0]} (expected: {expected_output})")
assert (d_output == expected_output).all()
assert d_output[0] == expected_output
Iterators that wrap an array (or another output iterator) may be used as both input and output iterators.
Here’s an example showing how to use a
TransformIterator
to transform the output
of a reduction before writing to the underlying array.
Custom Types#
The parallel
library supports defining custom data types,
using the gpu_struct
decorator.
Here are some examples showing how to define and use custom types:
"""
Finding the maximum green value in a sequence of pixels using `reduce_into`
with a custom data type.
"""
import cupy as cp
import numpy as np
import cuda.cccl.parallel.experimental as parallel
# Define a custom data type to store the pixel values.
@parallel.gpu_struct
class Pixel:
r: np.int32
g: np.int32
b: np.int32
# Define a reduction operation that returns the pixel with the maximum green value.
def max_g_value(x, y):
return x if x.g > y.g else y
# Prepare the input and output arrays.
d_rgb = cp.random.randint(0, 256, (10, 3), dtype=np.int32).view(Pixel.dtype)
d_out = cp.empty(1, Pixel.dtype)
# Prepare the initial value for the reduction.
h_init = Pixel(0, 0, 0)
# Perform the reduction.
parallel.reduce_into(d_rgb, d_out, max_g_value, d_rgb.size, h_init)
# Calculate the expected result.
h_rgb = d_rgb.get()
expected = h_rgb[h_rgb.view("int32")[:, 1].argmax()]
# Verify the result.
assert expected["g"] == d_out.get()["g"]
result = d_out.get()
print(f"Pixel reduction result: {result}")
Example Collections#
For complete runnable examples and more advanced usage patterns, see our full collection of examples.