For an (completely optional) assignment for an introductory course to programming with C++, I am trying to implement a diagonal matrixvector multiplication (?diamv
) kernel, i.e. mathematically
$$mathbf{y} leftarrow alphamathbf{y} + beta mathbf{M}mathbf{x}$$
for a diagonally clustered matrix $mathbf{M}$, dense vectors $mathbf{x}$ and $mathbf{y}$, and scalars $alpha$ and $beta$. I believe that I can reasonably motivate the following assumptions:
 The processors executing the compute threads are capable of executing the SSE4.2 instruction set extension (but not necessarily AVX2),
 The access scheme of the matrix $mathbf{M}$ does not affect the computation and therefore temporal cache locality between kernel calls does not need to be considered,
 The matrix $mathbf{M}$ does not fit in cache, is very diagonally clustered with a diagonal pattern that is known at compile time, and square,
 The matrix $mathbf{M}$ does not contain regularly occurring sequences in its diagonals that would allow for compression along an axis,
 No reordering function exists for the structure of the matrix $mathbf{M}$ that would lead to a cacheoblivious product with a lower cost than an ideal multilevelmemory optimized algorithm,
 The source data is aligned on an adequate boundary,
 OpenMP, chosen for its popularity, is available to enable sharedmemory parallelism. No distributed memory parallelism is necessary as it is assumed that a domain decomposition algorithm, e.g. DPFETI, will decompose processing to the node level due to the typical problem size.
Having done a literature review, I have come to the following conclusions on its design and implementation (this is a summary, in increasing granularity, with the extensive literature review being available upon request to save space):
 “In order to achieve high performance, a parallel implementation of a sparse matrixvector multiplication must maintain scalability” per White and Sadayappan, 1997.
 The diagonal matrix storage scheme,
$$DeclareMathOperator{vec}{vec}DeclareMathOperator{val}{val}
vecleft(val{(i,j)}equiv a_{i,i+j}right)$$
where $vec$ is the matrix vectorization operator, which obtains a vector by stacking the columns of the operand matrix on top of one another. By storing the matrix in this format, I believe the cache locality to be as optimal as possible to allow for rowwise parallelization. Checkerboard partitioning reduces to rowwise for diagonal matrices. Furthermore, this allows for source vector reuse, which is necessary unless the matrix is reused while still in cache (Frison 2016).  I believe that the aforementioned should always hold, before vectorization is even considered? The nonregular padded areas of the matrix, i.e. the topleft and bottomright, can be handled separately without incurring extra cost in the asymptotic sense (because the matrix is diagonally clustered and very large).
 Because access to this matrix is linear, software prefetching should not be necessary. I have included it anyways, for code review, at the spot which I considered the most logical.
The following snippet represents my best effort, taking the aforementioned into consideration:
#include <algorithm>
#include <stdint.h>
#include <type_traits>
#include <xmmintrin.h>
#include <emmintrin.h>
#include <omp.h>
#include "tensors.hpp"
#define CEIL_INT_DIV(num, denom) 1 + ((denom  1) / num)
#if defined(__INTEL_COMPILER)
#define AGNOSTIC_UNROLL(N) unroll (N)
#elif defined(__CLANG__)
#define AGNOSTIC_UNROLL(N) clang loop unroll_count(N)
#elif defined(__GNUG__)
#define AGNOSTIC_UNROLL(N) unroll N
#else
#warning "Compiler not supported"
#endif
/* Computerspecific optimization parameters */
#define PREFETCH true
#define OMP_SIZE 16
#define BLK_I 8
#define SSE_REG_SIZE 128
#define SSE_ALIGNMENT 16
#define SSE_UNROLL_COEF 3
namespace ranges = std::ranges;
/* Calculate the largest absolute value ..., TODO more elegant? */
template <typename T1, typename T2>
auto static inline largest_abs_val(T1 x, T2 y) {
return std::abs(x) > std::abs(y) ? std::abs(x) : std::abs(y);
}
/* Define intrinsics agnostically; compiler errors thrown automatically */
namespace mm {
/* _mm_load_px  (...) */
inline auto load_px(float const *__p) { return _mm_load_ps(__p); };
inline auto load_px(double const *__dp) { return _mm_load_pd(__dp); };
/* _mm_store_px  (...) */
inline auto store_px(float *__p, __m128 __a) { return _mm_store_ps(__p, __a); };
inline auto store_px(double *__dp, __m128d __a) { return _mm_store_pd(__dp, __a); };
/* _mm_set1_px  (...) */
inline auto set_px1(float __w) { return _mm_set1_ps(__w);};
inline auto set_px1(double __w) { return _mm_set1_pd(__w); };
/* _mm_mul_px  (...) */
inline auto mul_px(__m128 __a, __m128 __b) { return _mm_mul_ps(__a, __b);};
inline auto mul_px(__m128d __a, __m128d __b) { return _mm_mul_pd(__a, __b); };
}
namespace tensors {
template <typename T1, typename T2>
int diamv(matrix<T1> const &M,
vector<T1> const &x,
vector<T1> &y,
vector<T2> const &d,
T1 alpha, T1 beta) noexcept {
/* Initializations */
/*  Compute the size of an SSE vector */
constexpr size_t sse_size = SSE_REG_SIZE / (8*sizeof(T1));
/*  Validation of arguments */
static_assert((BLK_I >= sse_size && BLK_I % sse_size == 0), "Cache blocking is invalid");
/*  Reinterpretation of the data as aligned */
auto M_ = reinterpret_cast<T1 *>(__builtin_assume_aligned(M.data(), SSE_ALIGNMENT));
auto x_ = reinterpret_cast<T1 *>(__builtin_assume_aligned(x.data(), SSE_ALIGNMENT));
auto y_ = reinterpret_cast<T1 *>(__builtin_assume_aligned(y.data(), SSE_ALIGNMENT));
auto d_ = reinterpret_cast<T2 *>(__builtin_assume_aligned(d.data(), SSE_ALIGNMENT));
/*  Number of diagonals */
auto n_diags = d.size();
/*  Number of zeroes for padding TODO more elegant? */
auto n_padding_zeroes = largest_abs_val(ranges::min(d), ranges::max(d));
/*  No. of rows lower padding needs to be extended with */
auto n_padding_ext = (y.size()  2*n_padding_zeroes) % sse_size;
/*  Broadcast α and β into vectors outside of the kernel loop */
auto alpha_ = mm::set_px1(alpha);
auto beta_ = mm::set_px1(beta);
/* Compute y := αy + βMx in two steps */
/*  Precompute the bounding areas of the two nonvectorizable and single vect. areas */
size_t conds_begin() = {0, M.size()  (n_padding_ext+n_padding_zeroes)*n_diags};
size_t conds_end() = {n_padding_zeroes*n_diags, M.size()};
/*  Nonvectorizable areas (topleft and bottomright resp.) */
#pragma AGNOSTIC_UNROLL(2)
for (size_t NONVEC_LOOP=0; NONVEC_LOOP<2; NONVEC_LOOP++) {
for (size_t index_M=conds_begin(NONVEC_LOOP); index_M<conds_end(NONVEC_LOOP); index_M++) {
auto index_y = index_M / n_diags;
auto index_x = d(index_M % n_diags) + index_y;
if (index_x >= 0)
y_(index_y) = (alpha * y_(index_y)) + (beta * M_(index_M) * x_(index_x));
}
}
/*  Vectorized area  (parallel) iteration over the x parallelization blocks */
#pragma omp parallel for shared (M_, x_, y_) schedule(static)
for (size_t j_blk=conds_end(0)+1; j_blk<conds_begin(1); j_blk+=BLK_I*n_diags) {
/* Iteration over the x cache blocks */
for (size_t j_bare = 0; j_bare < CEIL_INT_DIV(sse_size, BLK_I); j_bare++) {
size_t j = j_blk + (j_bare*n_diags*sse_size);
/* Perform y = ... for this block, potentially with unrolling */
/* *** microkernel goes here *** */
#if PREFETCH
/* __mm_prefetch() */
#endif
}
}
return 0;
};
}
Some important notes:

tensors.hpp
is a simple headeronly library that I’ve written for the occasion to act as a uniform abstraction layer to tensors of various orders (with the CRTP) having different storage schemes. It also contains aliases to e.g. vectors and dense matrices. 
For the microkernel, I believe there to be two possibilities
a. Iterate linearly over the vectorized matrix within each cache block; this would amount to rowwise iteration over the matrix $mathbf{M}$ within each cache block and therefore a dot product. To the best of my knowledge, dot products are inefficient in dense matrixvector products due to both data dependencies and how the intrinsics decompose into μops.
b. Iterate over rows in cache blocks in the vectorized matrix, amounting to iteration over diagonals in the matrix $mathbf{M}$ within each cache block. Because of the way the matrix $mathbf{M}$ is stored, i.e. in its vectorized form, this would incur the cost of broadcasting the floatingpoint numbers (which, to the best of my knowledge is a complex matter) but allow rows within blocks to be performed in parallel.
I’m afraid that I’ve missed out some other, better, options. This is the primary reason for opening this question. I’m completely stuck. Furthermore, I believe that the differences in how well the source/destination vectors are reused are too close to call. Does anyone know how I would approach shedding more insight into this?

Even if the cache hit rate is high, I’m afraid of the bottleneck shifting to e.g. inadequate instruction scheduling. Is there a way to check this in a machineindependent way other than having to rely on memory bandwidth?

Is there a way to make the “ugly” nonvectorizable code more elegant?
Proofreading the above, I feel like a total amateur; all feedback is (very) much appreciated. Thank you in advance.