# c++ – Optimizing a diagonal matrix-vector multiplication (?diamv) kernel

For an (completely optional) assignment for an introductory course to programming with C++, I am trying to implement a diagonal matrix-vector multiplication (?diamv) kernel, i.e. mathematically
$$mathbf{y} leftarrow alphamathbf{y} + beta mathbf{M}mathbf{x}$$
for a diagonally clustered matrix $$mathbf{M}$$, dense vectors $$mathbf{x}$$ and $$mathbf{y}$$, and scalars $$alpha$$ and $$beta$$. I believe that I can reasonably motivate the following assumptions:

1. The processors executing the compute threads are capable of executing the SSE4.2 instruction set extension (but not necessarily AVX2),
2. The access scheme of the matrix $$mathbf{M}$$ does not affect the computation and therefore temporal cache locality between kernel calls does not need to be considered,
3. The matrix $$mathbf{M}$$ does not fit in cache, is very diagonally clustered with a diagonal pattern that is known at compile time, and square,
4. The matrix $$mathbf{M}$$ does not contain regularly occurring sequences in its diagonals that would allow for compression along an axis,
5. No reordering function exists for the structure of the matrix $$mathbf{M}$$ that would lead to a cache-oblivious product with a lower cost than an ideal multilevel-memory optimized algorithm,
6. The source data is aligned on an adequate boundary,
7. OpenMP, chosen for its popularity, is available to enable shared-memory parallelism. No distributed memory parallelism is necessary as it is assumed that a domain decomposition algorithm, e.g. DP-FETI, will decompose processing to the node level due to the typical problem size.

Having done a literature review, I have come to the following conclusions on its design and implementation (this is a summary, in increasing granularity, with the extensive literature review being available upon request to save space):

1. “In order to achieve high performance, a parallel implementation of a sparse matrix-vector multiplication must maintain scalability” per White and Sadayappan, 1997.
2. The diagonal matrix storage scheme,
$$DeclareMathOperator{vec}{vec}DeclareMathOperator{val}{val} vecleft(val{(i,j)}equiv a_{i,i+j}right)$$
where $$vec$$ is the matrix vectorization operator, which obtains a vector by stacking the columns of the operand matrix on top of one another. By storing the matrix in this format, I believe the cache locality to be as optimal as possible to allow for row-wise parallelization. Checkerboard partitioning reduces to row-wise for diagonal matrices. Furthermore, this allows for source vector re-use, which is necessary unless the matrix is re-used while still in cache (Frison 2016).
3. I believe that the aforementioned should always hold, before vectorization is even considered? The non-regular padded areas of the matrix, i.e. the top-left and bottom-right, can be handled separately without incurring extra cost in the asymptotic sense (because the matrix is diagonally clustered and very large).
4. Because access to this matrix is linear, software prefetching should not be necessary. I have included it anyways, for code review, at the spot which I considered the most logical.

The following snippet represents my best effort, taking the aforementioned into consideration:

#include <algorithm>
#include <stdint.h>
#include <type_traits>

#include <xmmintrin.h>
#include <emmintrin.h>

#include <omp.h>

#include "tensors.hpp"

#define CEIL_INT_DIV(num, denom)        1 + ((denom - 1) / num)

#if defined(__INTEL_COMPILER)
#define AGNOSTIC_UNROLL(N)              unroll (N)
#elif defined(__CLANG__)
#define AGNOSTIC_UNROLL(N)              clang loop unroll_count(N)
#elif defined(__GNUG__)
#define AGNOSTIC_UNROLL(N)              unroll N
#else
#warning "Compiler not supported"
#endif

/* Computer-specific optimization parameters */
#define PREFETCH                        true
#define OMP_SIZE                        16
#define BLK_I                           8
#define SSE_REG_SIZE                    128
#define SSE_ALIGNMENT                   16
#define SSE_UNROLL_COEF                 3

namespace ranges = std::ranges;

/* Calculate the largest absolute value ..., TODO more elegant? */
template <typename T1, typename T2>
auto static inline largest_abs_val(T1 x, T2 y) {
return std::abs(x) > std::abs(y) ? std::abs(x) : std::abs(y);
}

/* Define intrinsics agnostically; compiler errors thrown automatically */
namespace mm {

/* _mm_store_px - (...) */
inline auto store_px(float *__p, __m128 __a) { return _mm_store_ps(__p, __a); };
inline auto store_px(double *__dp, __m128d __a) { return _mm_store_pd(__dp, __a); };

/* _mm_set1_px - (...) */
inline auto set_px1(float __w) { return _mm_set1_ps(__w);};
inline auto set_px1(double __w) { return _mm_set1_pd(__w); };

/* _mm_mul_px - (...) */
inline auto mul_px(__m128 __a, __m128 __b) { return _mm_mul_ps(__a, __b);};
inline auto mul_px(__m128d __a, __m128d __b) { return _mm_mul_pd(__a, __b); };
}

namespace tensors {
template <typename T1, typename T2>
int diamv(matrix<T1> const &M,
vector<T1> const &x,
vector<T1> &y,
vector<T2> const &d,
T1 alpha, T1 beta) noexcept {
/* Initializations */
/* - Compute the size of an SSE vector */
constexpr size_t sse_size =  SSE_REG_SIZE / (8*sizeof(T1));
/* - Validation of arguments */
static_assert((BLK_I >= sse_size && BLK_I % sse_size == 0), "Cache blocking is invalid");
/* - Reinterpretation of the data as aligned */
auto M_ = reinterpret_cast<T1 *>(__builtin_assume_aligned(M.data(), SSE_ALIGNMENT));
auto x_ = reinterpret_cast<T1 *>(__builtin_assume_aligned(x.data(), SSE_ALIGNMENT));
auto y_ = reinterpret_cast<T1 *>(__builtin_assume_aligned(y.data(), SSE_ALIGNMENT));
auto d_ = reinterpret_cast<T2 *>(__builtin_assume_aligned(d.data(), SSE_ALIGNMENT));
/* - Number of diagonals */
auto n_diags = d.size();
/* - Number of zeroes for padding TODO more elegant? */
/* - No. of rows lower padding needs to be extended with */
/* - Broadcast α and β into vectors outside of the kernel loop */
auto alpha_ = mm::set_px1(alpha);
auto beta_ = mm::set_px1(beta);

/* Compute y := αy + βMx in two steps */
/* - Pre-compute the bounding areas of the two non-vectorizable and single vect. areas */
/* - Non-vectorizable areas (top-left and bottom-right resp.) */
#pragma AGNOSTIC_UNROLL(2)
for (size_t NONVEC_LOOP=0; NONVEC_LOOP<2; NONVEC_LOOP++) {
for (size_t index_M=conds_begin(NONVEC_LOOP); index_M<conds_end(NONVEC_LOOP); index_M++) {
auto index_y = index_M / n_diags;
auto index_x = d(index_M % n_diags) + index_y;
if (index_x >= 0)
y_(index_y) = (alpha * y_(index_y)) + (beta * M_(index_M) * x_(index_x));
}
}
/* - Vectorized area - (parallel) iteration over the x parallelization blocks */
#pragma omp parallel for shared (M_, x_, y_) schedule(static)
for (size_t j_blk=conds_end(0)+1; j_blk<conds_begin(1); j_blk+=BLK_I*n_diags) {
/* Iteration over the x cache blocks */
for (size_t j_bare = 0; j_bare < CEIL_INT_DIV(sse_size, BLK_I); j_bare++) {
size_t j = j_blk + (j_bare*n_diags*sse_size);
/* Perform y = ... for this block, potentially with unrolling */
/* *** microkernel goes here *** */
#if PREFETCH
/* __mm_prefetch() */
#endif
}
}

return 0;
};
}

Some important notes:

1. tensors.hpp is a simple header-only library that I’ve written for the occasion to act as a uniform abstraction layer to tensors of various orders (with the CRTP) having different storage schemes. It also contains aliases to e.g. vectors and dense matrices.

2. For the microkernel, I believe there to be two possibilities

a. Iterate linearly over the vectorized matrix within each cache block; this would amount to row-wise iteration over the matrix $$mathbf{M}$$ within each cache block and therefore a dot product. To the best of my knowledge, dot products are inefficient in dense matrix-vector products due to both data dependencies and how the intrinsics decompose into μops.

b. Iterate over rows in cache blocks in the vectorized matrix, amounting to iteration over diagonals in the matrix $$mathbf{M}$$ within each cache block. Because of the way the matrix $$mathbf{M}$$ is stored, i.e. in its vectorized form, this would incur the cost of broadcasting the floating-point numbers (which, to the best of my knowledge is a complex matter) but allow rows within blocks to be performed in parallel.

I’m afraid that I’ve missed out some other, better, options. This is the primary reason for opening this question. I’m completely stuck. Furthermore, I believe that the differences in how well the source/destination vectors are re-used are too close to call. Does anyone know how I would approach shedding more insight into this?

3. Even if the cache hit rate is high, I’m afraid of the bottleneck shifting to e.g. inadequate instruction scheduling. Is there a way to check this in a machine-independent way other than having to rely on memory bandwidth?

4. Is there a way to make the “ugly” non-vectorizable code more elegant?

Proofreading the above, I feel like a total amateur; all feedback is (very) much appreciated. Thank you in advance.