Thinking with iterators in CUDA and HIP

Parallel primitives are the ubiquitous building blocks of GPU programming with CUDA and HIP, to make your life as a programmer easier. Primitives like scans, reductions, and sorts operate in parallel over large data inputs. The basic use case has input and output residing in device memory as an array of values. However, the libraries provided by NVIDIA and AMD allow the use of iterators, which abstract the concept of input and output. An iterator is a type that behaves as a pointer, but overrides part of the dereferencing and arithmetic logic. With creative use of iterators, using the parallel primitives can become simpler and more performant. Assuming basic familiarity with the primitives, in this article we will show two examples of how iterators can be used to create better GPU programs: finding the arguments of the maximum of a function using a zip iterator in HIP and fast reduction with equally-sized segments in CUDA.

Leveraging the zip iterator to find the maximum argument

A reduction with a binary maximum operator finds the maximum element in the input. But what if we are also interested in where this element is located in the input? For the first example, let us find the argument of the maximum of a function with HIP. To make it a bit more interesting, suppose we want to find an index of an integer value that has most “set” bits.

A naive approach might be to first perform a reduction over the entire input, and then find the index that produced that input. However, by making clever use of iterators we can use just a single reduction. This is where the zip iterator comes in, which “zips” two iterators together: the dereferenced type is a tuple of both iterator’s value types. By zipping a counting iterator and an input array, we can enumerate the input elements.

thrust::device_vector<unsigned int> d_in(num_elements);
auto iter = rocprim::make_zip_iterator(
    rocprim::make_tuple(rocprim::make_counting_iterator(0), d_in.data()));

Any function that operates on this iterator is now provided with the index of each value. Below is an implementation of an binary operator that finds a tuple with the greatest number of set value bits.

using tuple_t = rocprim::tuple<int, unsigned int>;

struct arg_max_popc {
    __device__ tuple_t operator()(const tuple_t &lhs, const tuple_t &rhs) {
        // __popc counts the number of set bits
        return __popc(rocprim::get<1>(lhs)) < __popc(rocprim::get<1>(rhs)) ? rhs : lhs;
    }
};

To finish up, we perform the reduction over the zipped iterators using the custom operator, which finds an index and value that has the most set bits.

thrust::device_vector<tuple_t> d_out(1);

rocprim::reduce(
  temporary_storage,
  storage_size,
  iter,           // input (our custom iterator)
  d_out.data(),   // output
  tuple_t(-1, 0), // initial_value
  num_elements,
  arg_max_popc{}  // reduce_op (our custom operator)
);

thrust::host_vector<tuple_t> h_out = d_out;
printf("(index=%d, value=0x%08x)\n", rocprim::get<0>(h_out[0]), rocprim::get<1>(h_out[0]));

Without the use of iterators, implementing such an operation would require an inconvenient approach or a custom kernel. By thinking with iterators, the implementation becomes simple and elegant. In fact, the implementations of (hip)CUB’s ArgMin and ArgMax are based on the same idea.

Speeding up reduction for equally-sized segments

Reductions come in many flavors, but the common ones are the input-wide reduction, the by-key reduction, and the segmented reduction. Briefly stated, input-wide computes the reduction of the whole input, by-key computes a reduction for every run of identical keys associated with the input, and segmented computes a run for a set of provided segments.

For the second example, we will consider a scenario where a reduction needs to be made for equally-sized segments in CUDA. The simplest way to implement this would be to choose the segmented reduction, which looks something like the following:

thrust::device_vector<float> d_in(num_segments * num_elements_per_segment);
thrust::device_vector<float> d_out(num_segments);

// Prepare the segment offsets.
thrust::host_vector<int> h_offsets(num_segments + 1);
for (int i = 0; i < num_segments + 1; ++i) {
  h_offsets[i] = num_elements_per_segment * i;
}
thrust::device_vector<int> d_offsets = h_offsets;

cub::DeviceSegmentedReduce::Reduce(
  d_temp_storage, 
  temp_storage_bytes,
  d_in.data(),
  d_out.data(),
  num_segments,
  thrust::raw_pointer_cast(d_offsets.data()),     // d_begin_offsets
  thrust::raw_pointer_cast(d_offsets.data()) + 1, // d_end_offsets
  ::cuda::std::plus{}, // reduction_op
  0.f                  // initial_value
);

While the above implementation works fine, every segment offset introduces additional loads. Especially for small segments, this can add a lot of overhead. With the use of iterators, we can produce the offsets on the fly, without having to store them in memory. To achieve this, we combine two iterators: the counting iterator and the transform iterator.

A counting iterator simply represents a range of sequentially increasing values. If we create a counting iterator starting at zero, the iterator dereferences to zero, and incrementing by five dereferences to five. Since there is no backing storage, no memory loads are needed.

The transform iterator applies a unary operator to another iterator. Suppose the operator is to perform the square, then applying the transform iterator to a counting iterator starting from zero returns the values 0, 1, 4, and 9.

By combining these two basic building blocks we can formulate an iterator that returns the offsets as needed for the segmented reduction for equally-sized segments:

struct multiplier {
    (int segment_num_elements) : segment_num_elements(segment_num_elements) {}

    __host__ __device__ int operator()(int i) const {
        return i * segment_num_elements;
    }

    int segment_num_elements;
};

auto offsets = ::cuda::make_transform_iterator(
  ::cuda::make_counting_iterator(0), (segment_num_elements));

cub::DeviceSegmentedReduce::Reduce(
  d_temp_storage, 
  temp_storage_bytes,
  d_in.data(),
  d_out.data(),
  num_segments,
  offsets,             // d_begin_offsets (our custom iterator)
  offsets + 1,         // d_end_offsets
  ::cuda::std::plus{}, // reduction_op
  0.f                  // initial_value
);

This should increase the performance, but there is an alternative we could consider instead. Since our segments are of equal size, each element can independently infer which segment it belongs to. For any given element index, its segment index is simply given by the element index divided by the number of elements in the segment. Using this information, we can use a by-key reduction instead. To leverage reduce by key, an array of keys must be constructed that maps every index to a segment index. Below is an example with a segment size of three elements.

keys   0 0 0 1 1 1 2 2 2 3 3 3 
values 0 1 2 3 4 5 6 7 8 9 0 1
result     3    12    21    10

A naive method would be to write out the keys in device memory, but similar to before we can produce an iterator that produces the keys on the fly.

struct indexer {
    indexer(int segment_num_elements) : segment_num_elements(segment_num_elements) {}

    __host__ __device__ int operator()(int i) const {
        return i / segment_num_elements;
    }

    int segment_num_elements;
};

auto keys_in = ::cuda::make_transform_iterator(
  ::cuda::make_counting_iterator(0), indexer(segment_num_elements));

The full reformulated reduction is shown below. The snippet also makes use of an iterator that has not yet been mentioned, the discard iterator. The discard iterator is an output iterator that simply does not perform any operation when written to. As we are not interested in the unique output keys nor the number of runs, these results can be discarded.

cub::DeviceReduce::ReduceByKey(
  d_temp_storage, 
  temp_storage_bytes,
  keys_in,                       // d_keys_in (our custom iterator)
  cuda::make_discard_iterator(), // d_unique_out
  d_in.data(),                   // d_values_in
  d_out.data(),                  // d_aggregates_out
  cuda::make_discard_iterator(), // d_num_runs_out
  cuda::std::plus{},             // reduction_op 
  num_elements);

The segmented reduction and by-key reduction have different constraints they need to operate under, and it’s not immediately clear which version will perform the best. In such situations, the best thing to do is just to measure. Below are the results of performing the three methods for reduction with different segment sizes on CUDA 13.0 on an RTX 5080, including a non-segmented reduction as a baseline.

As we assumed, the segmented reduction that uses no iterator performs the worst, and using an iterator makes it strictly better. For small to medium segment sizes, the by-key reduction outperforms the segmented reduce significantly. For very large segment sizes, the segmented reduce performs on par with a non-segmented reduce. From the graph, it seems that the by-key reduction has a constant overhead with respect to a non-segmented reduce, except for tiny segments. The segmented reduction has an overhead that scales more heavily with the segment size, but for huge segments the overhead is fully gone.

Blog Post Author: Nol Moonen, Stream HPC

Published in May 2026