`parallel`: Device-Level Parallel Algorithms#

The cuda.cccl.parallel library provides device-level algorithms that operate on entire arrays or ranges of data. These algorithms are designed to be easy to use from Python while delivering the performance of hand-optimized CUDA kernels, portable across different GPU architectures.

Algorithms#

The core functionality provided by the parallel library are algorithms such as reductions, scans, sorts, and transforms.

Here’s a simple example showing how to use the reduce_into algorithm to reduce an array of integers.

Basic reduction example.#

"""
Sum all values in an array using reduction with PLUS operation.
"""

import cupy as cp
import numpy as np

import cuda.cccl.parallel.experimental as parallel

# Prepare the input and output arrays.
dtype = np.int32
h_init = np.array([0], dtype=dtype)
d_input = cp.array([1, 2, 3, 4, 5], dtype=dtype)
d_output = cp.empty(1, dtype=dtype)

# Perform the reduction.
parallel.reduce_into(d_input, d_output, parallel.OpKind.PLUS, len(d_input), h_init)

# Verify the result.
expected_output = 15
assert (d_output == expected_output).all()
result = d_output[0]
print(f"Sum reduction result: {result}")

Many algorithms, including reduction, require a temporary memory buffer. The library will allocate this buffer for you, but you can also use the object-based API for greater control.

Reduction with object-based API.#

"""
Reduction example using the object API.
"""

import cupy as cp
import numpy as np

import cuda.cccl.parallel.experimental as parallel

# Prepare the input and output arrays.
dtype = np.int32
init_value = 5
h_init = np.array([init_value], dtype=dtype)
h_input = np.array([1, 2, 3, 4], dtype=dtype)
d_input = cp.asarray(h_input)
d_output = cp.empty(1, dtype=dtype)

# Create a reducer object.
reducer = parallel.make_reduce_into(d_input, d_output, parallel.OpKind.PLUS, h_init)

# Get the temporary storage size.
temp_storage_size = reducer(None, d_input, d_output, len(h_input), h_init)

# Allocate temporary storage using any user-defined allocator.
# The result must be an object exposing `__cuda_array_interface__`.
d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)

# Perform the reduction.
reducer(d_temp_storage, d_input, d_output, len(h_input), h_init)

expected_result = np.sum(h_input) + init_value
actual_result = d_output.get()[0]
assert actual_result == expected_result
print("Reduce object example completed successfully")

Iterators#

Algorithms can be used not just on arrays, but also on iterators. Iterators provide a way to represent sequences of data without needing to allocate memory for them.

Here’s an example showing how to use reduction with a CountingIterator that generates a sequence of numbers starting from a specified value.

Counting iterator example.#

"""
Example showing how to use counting_iterator.
"""

import functools

import cupy as cp
import numpy as np

import cuda.cccl.parallel.experimental as parallel

# Prepare the input and output arrays.
first_item = 10
num_items = 3

# Create the counting iterator.
first_it = parallel.CountingIterator(np.int32(first_item))

# Prepare the initial value for the reduction.
h_init = np.array([0], dtype=np.int32)

# Prepare the output array.
d_output = cp.empty(1, dtype=np.int32)

# Perform the reduction.
parallel.reduce_into(first_it, d_output, parallel.OpKind.PLUS, num_items, h_init)

# Verify the result.
expected_output = functools.reduce(
    lambda a, b: a + b, range(first_item, first_item + num_items)
)
assert (d_output == expected_output).all()
print(f"Counting iterator result: {d_output[0]} (expected: {expected_output})")

Iterators also provide a way to compose operations. Here’s an example showing how to use reduce_into with a TransformIterator to compute the sum of squares of a sequence of numbers.

Transform iterator example.#

"""
Demonstrate reduction with transform iterator.
"""

import functools

import cupy as cp
import numpy as np

import cuda.cccl.parallel.experimental as parallel


def transform_op(a):
    return -a if a % 2 == 0 else a


# Prepare the input and output arrays.
first_item = 10
num_items = 100

transform_it = parallel.TransformIterator(
    parallel.CountingIterator(np.int32(first_item)), transform_op
)  # Input sequence
h_init = np.array([0], dtype=np.int64)  # Initial value for the reduction
d_output = cp.empty(1, dtype=np.int64)  # Storage for output

# Perform the reduction.
parallel.reduce_into(transform_it, d_output, parallel.OpKind.PLUS, num_items, h_init)

# Verify the result.
expected_output = functools.reduce(
    lambda a, b: a + b,
    [-a if a % 2 == 0 else a for a in range(first_item, first_item + num_items)],
)

# Test assertions
print(f"Transform iterator result: {d_output[0]} (expected: {expected_output})")
assert (d_output == expected_output).all()
assert d_output[0] == expected_output

Iterators that wrap an array (or another output iterator) may be used as both input and output iterators. Here’s an example showing how to use a TransformIterator to transform the output of a reduction before writing to the underlying array.

Custom Types#

The parallel library supports defining custom data types, using the gpu_struct decorator. Here are some examples showing how to define and use custom types:

Custom type reduction example.#

"""
Finding the maximum green value in a sequence of pixels using `reduce_into`
with a custom data type.
"""

import cupy as cp
import numpy as np

import cuda.cccl.parallel.experimental as parallel


# Define a custom data type to store the pixel values.
@parallel.gpu_struct
class Pixel:
    r: np.int32
    g: np.int32
    b: np.int32


# Define a reduction operation that returns the pixel with the maximum green value.
def max_g_value(x, y):
    return x if x.g > y.g else y


# Prepare the input and output arrays.
d_rgb = cp.random.randint(0, 256, (10, 3), dtype=np.int32).view(Pixel.dtype)
d_out = cp.empty(1, Pixel.dtype)

# Prepare the initial value for the reduction.
h_init = Pixel(0, 0, 0)

# Perform the reduction.
parallel.reduce_into(d_rgb, d_out, max_g_value, d_rgb.size, h_init)

# Calculate the expected result.
h_rgb = d_rgb.get()
expected = h_rgb[h_rgb.view("int32")[:, 1].argmax()]

# Verify the result.
assert expected["g"] == d_out.get()["g"]
result = d_out.get()
print(f"Pixel reduction result: {result}")

Example Collections#

For complete runnable examples and more advanced usage patterns, see our full collection of examples.

External API References#

cuda.cccl.parallel API Reference

parallel: Device-Level Parallel Algorithms#