This simple program implements the SAXPY operation (Y[i] = a * X[i] + Y[i]) using rocThrust and showcases the usage of the vector and functor templates and of thrust::fill and thrust::transform operations.
- Two host arrrays of floats
xandyare instantiated, and their contents are printed to the standard output. - Two
thrust::device_vector<float>s,XandY, are instantiated with the corresponding arrays. The contents are copied to the device. - The
saxpy_slowfunction is invoked next. It uses the most straightforward implementation using a temporary device vectortempand two separate transformations, one with multiplies and one with plus. First, thetempvector is filled withavalues, usingthrust::fill. Then, it is filled by transformed values ofa * X[i]bythrust::transformusing thethrust::multipliesfunctor. Last, the device vectorYis filled bytemp[i] + Y[i]bythrust::transformusing thethrust::plusfunctor. - The values of device vector
Yare printed to the standard output. TheXandYvectors are destroyed. - Two new
thrust::device_vector<float>s,XandY, are instantiated with the corresponding arrays. The contents are copied to the device. - The
saxpy_fastfunction is invoked. It implements the same operation with a single transformation and represents "best practice". Device vectorYis filled byY[i] = a * X[i] + Y[i]bythrust::transformusingsaxpy_functor. The functor makes use of Fused Multiply-Add (FMA) operation. - The values of device vector
Yare printed to the standard output. TheXandYvectors are destroyed.
- rocThrust's device and host vectors implement RAII-style ownership over device and host memory pointers (similarly to
std::vector). The instances are aware of the requested element count, allocate the required amount of memory, and free it upon destruction. When resized, the memory is reallocated if needed. - Additionally, using
device_vectorandhost_vectorsimplifies the transfers between device and host memory to a copy assignment. Note that iterators over device containers can be used everywhere just like host iterators. - It is suggested that developers use
device_vectorandhost_vectorinstead of explicit invocations tomallocandfreefunctions. - Likewise
std::fill,thrust::fill(first, last, value)assigns a prescribedvalueto every element in the range[first, last). It can work both with host and device side iterators and supports sequential and parallel executon policies. - Like
std::transform,thrust::transformcan apply both unary and binary functions on its inputs and fills the output range with resulting values. - Functors
thrust::binary_function,thrust::multipliesandthrust::plusrepresent binary operations correspondingly of general type, of multiplication, and of addition of their arguments. - Fused Multiply-Add (FMA) operation
fmarepresents multiplication of the first two arguments followed by addition of the third one to the product. It has the advantage of being faster and more accurate compated to separate multiplication and addition on the hardware that support such an instruction, as it avoids cancellation error in addition (addition insidefmaoperation proceeds with full non-rounded result of multiplication that is twice wider).
thrust::host_vector::host_vectorthrust::host_vector::operator[]thrust::host_vector::begin()thrust::host_vector::end()thrust::device_vector::device_vectorthrust::device_vector::operator[]thrust::device_vector::begin()thrust::device_vector::end()thrust::binary_functionthrust::multipliesthrust::plusthrust::fillthrust::transform