A high-performance Julia package for solving block-structured linear systems with a focus on block tridiagonal matrices.
BlockStructuredSolvers.jl provides specialized solvers for block-structured linear systems with a focus on block tridiagonal matrices. It offers multiple backends for different hardware architectures:
- CPU: Sequential solver for standard CPUs
- CUDA: High-performance GPU solvers using NVIDIA CUDA
- ROCm: GPU solvers for AMD GPUs using ROCm
Each backend supports both single (Float32) and double (Float64) precision computations, allowing you to balance speed and accuracy based on your application needs.
- Multiple Backends: CPU, CUDA (NVIDIA), and ROCm (AMD) support
- Precision Options: Both Float32 and Float64 support
- Algorithm Variants:
- Sequential solver: Traditional block tridiagonal algorithm
- Batched solver: Optimized for parallel execution on GPUs
- High Performance: Optimized implementations for each backend
- Simple Interface: Consistent API across all backends
using Pkg
Pkg.add("BlockStructuredSolvers")Here's a simple example of how to use BlockStructuredSolvers to solve a block tridiagonal system:
using BlockStructuredSolvers
using LinearAlgebra
using Random
# Problem size
N = 100 # Number of blocks
n = 32 # Size of each block
# Choose precision
T = Float64
# CPU solver
function solve_on_cpu()
# Initialize the solver
data = initialize_cpu(N, n, T)
# Create and set up matrices
A_tensor = Array{T, 3}(zeros(n, n, N))
B_tensor = Array{T, 3}(zeros(n, n, N-1))
# Fill with your data (example: random positive definite matrices)
for i in 1:N
temp = randn(T, n, n)
A_tensor[:, :, i] = temp * temp' + n * I
end
for i in 1:N-1
temp = randn(T, n, n)
B_tensor[:, :, i] = temp
end
# Copy data to solver
for i in 1:N
copyto!(data.A_list[i], A_tensor[:, :, i])
end
for i in 1:N-1
copyto!(data.B_list[i], B_tensor[:, :, i])
end
# Create right-hand side
d_list = Vector{Matrix{T}}(undef, N)
for i in 1:N
d_list[i] = rand(T, n, 1)
end
# Factorize
factorize!(data)
# Solve
solve!(data, d_list)
# Solution is now in d_list
return d_list
end
# GPU solver using CUDA
function solve_on_gpu()
using CUDA
# Initialize the batched solver
data = initialize_batched(N, n, T, CuArray)
# Create and set up matrices
A_tensor = CuArray{T, 3}(zeros(n, n, N))
B_tensor = CuArray{T, 3}(zeros(n, n, N-1))
# Fill with your data (example: random positive definite matrices)
CUDA.@allowscalar for i in 1:N
temp = randn(T, n, n)
A_tensor[:, :, i] .= CuArray{T, 2}(temp * temp' + n * I)
end
CUDA.@allowscalar for i in 1:N-1
temp = randn(T, n, n)
B_tensor[:, :, i] .= CuArray{T, 2}(temp)
end
# Copy data to solver
copyto!(data.A_tensor, A_tensor)
copyto!(data.B_tensor, B_tensor)
# Create right-hand side
d_tensor = CuArray{T, 3}(zeros(T, n, 1, N))
CUDA.@allowscalar for i in 1:N
d_tensor[:, :, i] = CuArray{T, 2}(rand(T, n, 1))
end
# Copy right-hand side to solver
copyto!(data.d_tensor, d_tensor)
# Factorize
CUDA.@sync factorize!(data)
# Solve
CUDA.@sync solve!(data)
# Solution is now in data.d_list
return data.d_list
end# Initialize a CPU solver
data = initialize_cpu(N, n, T=Float64)
# Factorize the matrix
factorize!(data)
# Solve the system
solve!(data, d_list)# Initialize a batched CUDA solver
data = initialize_batched(N, n, T=Float64, M=CuArray)
# Initialize a sequential CUDA solver
data = initialize_seq(N, n, T=Float64, M=CuArray)
# Factorize the matrix
CUDA.@sync factorize!(data)
# Solve the system
CUDA.@sync solve!(data)# Initialize a batched ROCm solver
data = initialize_batched(N, n, T=Float64, M=ROCArray)
# Initialize a sequential ROCm solver
data = initialize_seq(N, n, T=Float64, M=ROCArray)
# Factorize the matrix
AMDGPU.@sync factorize!(data)
# Solve the system
AMDGPU.@sync solve!(data)BlockStructuredSolvers.jl is designed for high performance across different hardware architectures. Here are some key performance considerations:
- GPU vs CPU: For large systems, GPU implementations can offer significant speedups over CPU implementations.
- Batched vs Sequential: The batched solver is generally faster on GPUs for large problems.
- Float32 vs Float64: Single precision (Float32) is approximately twice as fast as double precision (Float64) but offers less numerical accuracy.
The package implements a block Cholesky factorization for block tridiagonal matrices. For a block tridiagonal matrix with diagonal blocks A_i and off-diagonal blocks B_i, the algorithm:
- Factorizes the diagonal blocks using Cholesky decomposition
- Updates the remaining blocks using forward and backward substitution
- Solves the system using the factorized blocks
For the batched implementation, the algorithm is reorganized to maximize parallel execution on GPUs.
Contributions are welcome! Please feel free to submit a Pull Request.
This package is licensed under the MIT License - see the LICENSE file for details.
If you use BlockStructuredSolvers.jl in your research, please cite:
@software{BlockStructuredSolvers,
author = {David Jin, Alexis Montoison, and Sungho Shin},
title = {BlockStructuredSolvers.jl: A High-Performance Julia Package for Solving Block-Structured Linear Systems},
year = {2024},
url = {https://github.com/username/BlockStructuredSolvers.jl}
}