calculus – Divergence of a vector-tensor multiplication

I have the following expression:

$nabla cdot (vec{mathbf{v}} mathbf{A})$.

Where: $vec{mathbf{v}}$ is vector (1×3) and $mathbf{A}$ a tensor (3×3)

I would like to know if the following identity holds:

$nabla cdot (vec{mathbf{v}} mathbf{A}) = vec{mathbf{v}} cdot nabla mathbf{A} + (nabla cdot vec{mathbf{v}} )mathbf{A}$

If not, is there any identity for the divergence of a vector multiplied by a tensor?

Best regards

performance tuning – sparse matrix multiplication

Say I have a sparse matrix $M$ with $sim 50.000 .000$ non zero elements.

What is the fastest way to make operations such as :

$$M^k X$$

where $X$ is a given vector.

Is the operation $MX$ parallelizable for example ? I have access to some kind of computing cluster and was wondering if I could make use of it ?

Multiplication of two binary numbers in fixed point arithmetic

So I’m performing some operations with fractional numbers in a 16-bit FIXED-POINT processor.

I have to multiply the numbers $ x=-6.35$, represented in $ Q_{11}$, and $ y=-0.1$, represented in $ Q_{14}$.

First I represent the numbers in the respective notation in binary. The MSB is the sign bit.

So $ x=11001.10100110011$ and $ y=11.11100110011001$. I know the binary point is just in our mind and the processor treats this numbers as integers.

Ok then we multiply the numbers and get $ x*y=11001000000100010011111001111011$. We elimnate the repeated sign bit and save the 16 MSB and represent the result in the appropriate format $ Q_{10}$: $ x*y=100100.0000100010$. This number corresponds to $ – 27.966796875$. But this doesn’t make any sense, the result should be $ 0.635$.

What is going on here? Why is the result different? Am I missing something?

linear algebra – Lower bound on vector-matrix-vector multiplication with bound on norm

I have an expression like $a^T M b$ which I want to lower bound to something like $a^T M a + ?$ ($a^T M b geq a^T M a + ?$) where $M in mathbb{R}^{nxn}$ is symmetric and positive definite. I can add more assumptions to make this work.

My first idea was to make the sensible assumption $lVert a-e rVert leq lVert b rVert leq lVert a+e rVert$ and use this to get a bound which depends in same way on $lVert e rVert$ but I am not sure where to start here.

How do I use the lower bound on the norm of the vector to bound my expression?

PS: $a$ and $b$ are actually functions of x but that shouldn’t matter here.

c++ – Optimizing matrix vector multiplication with keyword “register” and unsafe pointer arithmetic

I know this piece of code is quite strange, but it does its job very well performance-wise , reducing the running time of a very computation intensive operation by 3 – 5 times without using a better algorithm. But since the code using the controversal keyword register and I use pointer arithmetic quite unsafe (in my opinion), I would like to hear opinions of other people.

A bit about the requirements: since the project is a research project with emphasis on simplicity, we do not want to use a linear algebra library like Eigen3 or Blas or any standard library of C++ at all. It should also not employ any other algorithm rather than the school method of matrix vector multiplication, because we would want to focus on the hardware.

My question is, is this kind of code oftenly seen in the wild and is considered normal? How can I optimize this code further?

I would appreciate every constructive feedback.

 * Multiply matrix with vector
 * Matrix is transposed. Hence we can do row-wise inner product.
 * @param matrix (n x m)
 * @param vector (1 x m)
 * @param output (1 x n)
 * @param input_height_ n
 * @param input_width_ m
void matrix_vector_multiply(float *matrix, float *vector, float *output, uint32_t input_height_, uint32_t input_width_) {
     * The functional principle of this code block is very simple. We iterate 4 rows parallel.
     * With this trick, we only have to fetch the vector's data once and effectively reuse the vector.
     * We used the keyword register to give the compiler hint which variable we would love to keep on the CPU register.
     * Since CPU registers are rare, we really want to use it where it needs to be. Since we want to put them all on registers, we will utilize only 4 rows at once.
     * Also the register keyword is only a hint to the compiler, which can be completely ignored.
    register uint32_t input_height = input_height_;
    register uint32_t input_width = input_width_;
    // Put the needed data into a registered variable
    // We will obtain a higher chance for the compiler to optimize our code better
    register float * output_ptr = output;
    register float * input_ptr = matrix;

     * Using blocked data only if we have more than 4 rows, everything else would be
     * a waste of overhead
    if(input_height > 4 && input_width > 4) {
        uint32_t y = 0;

        // Four at once
        for (; y < input_height - 4; y += 4) {
            // Since we iterate the vector_ptr manually for higher cache locality, we have to reset it every loop
            register float * vector_ptr = vector;

            // Load the data from matrix into four rows
            register float *input_cols_ptr1 = input_ptr;
            input_ptr += input_width;
            register float *input_cols_ptr2 = input_ptr;
            input_ptr += input_width;
            register float *input_cols_ptr3 = input_ptr;
            input_ptr += input_width;
            register float *input_cols_ptr4 = input_ptr;
            input_ptr += input_width;

            // Result for each row
            register float product0 = 0;
            register float product1 = 0;
            register float product2 = 0;
            register float product3 = 0;

            for (uint32_t x = 0; x < input_width; x++) {
                // Picking the value of the vector at the position
                register float vector_val = *vector_ptr++;

                product0 += vector_val * *input_cols_ptr1++;
                product1 += vector_val * *input_cols_ptr2++;
                product2 += vector_val * *input_cols_ptr3++;
                product3 += vector_val * *input_cols_ptr4++;

            // Store the result
            *output_ptr++ += product0;
            *output_ptr++ += product1;
            *output_ptr++ += product2;
            *output_ptr++ += product3;

        // Processing the rest columns
        for (; y < input_height; y++, output_ptr++) {
            register float * vector_ptr = vector;
            for (uint32_t x = 0; x < input_width; x++) {
                *output_ptr += *vector_ptr++ * *input_ptr++;

         * Everything else goes into this.
    } else {
        for (register uint32_t y = 0; y < input_height; y++, output_ptr++) {
            register float * vector_ptr = vector;
            for (register uint32_t x = 0; x < input_width; x++) {
                *output_ptr += *vector_ptr++ * *input_ptr++;

The original code looked like this

void gemm(const float *matrix, const float *vector, float *output, uint32_t input_height, uint32_t input_width) {
#pragma omp parallel for
    for (uint32_t y = 0; y < input_height; y++) {
        float sum = 0.0f;
        const float * row = matrix + y * input_width;
        for (uint32_t x = 0; x < input_width; x++) {
            sum += vector(x) * row(x);
        output(y) += sum;

abstract algebra – What does the notation $mathbb{Z}_q[X]/(X^n+1)$ mean and how is multiplication defined there?

I recently encountered the notation which is referred to as ring (in the description of a cryptography scheme, Dilithium, here, page 4):

(…) The key generation algorithm generates a $k times l$ matrix $bf{A}$ each of whose entries is a polynomial in the ring $R_q = mathbb{Z}_q/(X^n+1)$

I am not entirely sure how to read this notation:

  • Is this simply the polynomials of maximum degree where all coefficients are in $Z_q$?
  • How is the multiplication of two polynomials defined in this ring?
  • Does the matrix $bf{A}$ now consist of $l$ polynomials or $k times l$ polynomials?

c++ – Optimizing a diagonal matrix-vector multiplication (?diamv) kernel

For an (completely optional) assignment for an introductory course to programming with C++, I am trying to implement a diagonal matrix-vector multiplication (?diamv) kernel, i.e. mathematically
$$mathbf{y} leftarrow alphamathbf{y} + beta mathbf{M}mathbf{x}$$
for a diagonally clustered matrix $mathbf{M}$, dense vectors $mathbf{x}$ and $mathbf{y}$, and scalars $alpha$ and $beta$. I believe that I can reasonably motivate the following assumptions:

  1. The processors executing the compute threads are capable of executing the SSE4.2 instruction set extension (but not necessarily AVX2),
  2. The access scheme of the matrix $mathbf{M}$ does not affect the computation and therefore temporal cache locality between kernel calls does not need to be considered,
  3. The matrix $mathbf{M}$ does not fit in cache, is very diagonally clustered with a diagonal pattern that is known at compile time, and square,
  4. The matrix $mathbf{M}$ does not contain regularly occurring sequences in its diagonals that would allow for compression along an axis,
  5. No reordering function exists for the structure of the matrix $mathbf{M}$ that would lead to a cache-oblivious product with a lower cost than an ideal multilevel-memory optimized algorithm,
  6. The source data is aligned on an adequate boundary,
  7. OpenMP, chosen for its popularity, is available to enable shared-memory parallelism. No distributed memory parallelism is necessary as it is assumed that a domain decomposition algorithm, e.g. DP-FETI, will decompose processing to the node level due to the typical problem size.

Having done a literature review, I have come to the following conclusions on its design and implementation (this is a summary, in increasing granularity, with the extensive literature review being available upon request to save space):

  1. “In order to achieve high performance, a parallel implementation of a sparse matrix-vector multiplication must maintain scalability” per White and Sadayappan, 1997.
  2. The diagonal matrix storage scheme,
    vecleft(val{(i,j)}equiv a_{i,i+j}right)$$

    where $vec$ is the matrix vectorization operator, which obtains a vector by stacking the columns of the operand matrix on top of one another. By storing the matrix in this format, I believe the cache locality to be as optimal as possible to allow for row-wise parallelization. Checkerboard partitioning reduces to row-wise for diagonal matrices. Furthermore, this allows for source vector re-use, which is necessary unless the matrix is re-used while still in cache (Frison 2016).
  3. I believe that the aforementioned should always hold, before vectorization is even considered? The non-regular padded areas of the matrix, i.e. the top-left and bottom-right, can be handled separately without incurring extra cost in the asymptotic sense (because the matrix is diagonally clustered and very large).
  4. Because access to this matrix is linear, software prefetching should not be necessary. I have included it anyways, for code review, at the spot which I considered the most logical.

The following snippet represents my best effort, taking the aforementioned into consideration:

#include <algorithm>
#include <stdint.h>
#include <type_traits>

#include <xmmintrin.h>
#include <emmintrin.h>

#include <omp.h>

#include "tensors.hpp"

#define CEIL_INT_DIV(num, denom)        1 + ((denom - 1) / num)

#if defined(__INTEL_COMPILER)
#define AGNOSTIC_UNROLL(N)              unroll (N)
#elif defined(__CLANG__)
#define AGNOSTIC_UNROLL(N)              clang loop unroll_count(N)
#elif defined(__GNUG__)
#define AGNOSTIC_UNROLL(N)              unroll N
#warning "Compiler not supported"

/* Computer-specific optimization parameters */
#define PREFETCH                        true
#define OMP_SIZE                        16
#define BLK_I                           8
#define SSE_REG_SIZE                    128
#define SSE_ALIGNMENT                   16
#define SSE_UNROLL_COEF                 3

namespace ranges = std::ranges;

/* Calculate the largest absolute value ..., TODO more elegant? */
template <typename T1, typename T2>
auto static inline largest_abs_val(T1 x, T2 y) {
    return std::abs(x) > std::abs(y) ? std::abs(x) : std::abs(y);

/* Define intrinsics agnostically; compiler errors thrown automatically */
namespace mm {
    /* _mm_load_px - (...) */
    inline auto load_px(float const *__p) { return _mm_load_ps(__p); };
    inline auto load_px(double const *__dp) { return _mm_load_pd(__dp); };

    /* _mm_store_px - (...) */
    inline auto store_px(float *__p, __m128 __a) { return _mm_store_ps(__p, __a); };
    inline auto store_px(double *__dp, __m128d __a) { return _mm_store_pd(__dp, __a); };

    /* _mm_set1_px - (...) */
    inline auto set_px1(float __w) { return _mm_set1_ps(__w);};
    inline auto set_px1(double __w) { return _mm_set1_pd(__w); };

    /* _mm_mul_px - (...) */
    inline auto mul_px(__m128 __a, __m128 __b) { return _mm_mul_ps(__a, __b);};
    inline auto mul_px(__m128d __a, __m128d __b) { return _mm_mul_pd(__a, __b); };

namespace tensors {
    template <typename T1, typename T2>
    int diamv(matrix<T1> const &M, 
              vector<T1> const &x,
              vector<T1> &y,
              vector<T2> const &d,
              T1 alpha, T1 beta) noexcept {
        /* Initializations */
        /* - Compute the size of an SSE vector */
        constexpr size_t sse_size =  SSE_REG_SIZE / (8*sizeof(T1));
        /* - Validation of arguments */
        static_assert((BLK_I >= sse_size && BLK_I % sse_size == 0), "Cache blocking is invalid");
        /* - Reinterpretation of the data as aligned */
        auto M_ = reinterpret_cast<T1 *>(__builtin_assume_aligned(, SSE_ALIGNMENT));
        auto x_ = reinterpret_cast<T1 *>(__builtin_assume_aligned(, SSE_ALIGNMENT));
        auto y_ = reinterpret_cast<T1 *>(__builtin_assume_aligned(, SSE_ALIGNMENT));
        auto d_ = reinterpret_cast<T2 *>(__builtin_assume_aligned(, SSE_ALIGNMENT));
        /* - Number of diagonals */
        auto n_diags = d.size();
        /* - Number of zeroes for padding TODO more elegant? */
        auto n_padding_zeroes = largest_abs_val(ranges::min(d), ranges::max(d));
        /* - No. of rows lower padding needs to be extended with */
        auto n_padding_ext = (y.size() - 2*n_padding_zeroes) % sse_size;
        /* - Broadcast α and β into vectors outside of the kernel loop */
        auto alpha_ = mm::set_px1(alpha);
        auto beta_ = mm::set_px1(beta);

        /* Compute y := αy + βMx in two steps */
        /* - Pre-compute the bounding areas of the two non-vectorizable and single vect. areas */
        size_t conds_begin() = {0, M.size() - (n_padding_ext+n_padding_zeroes)*n_diags};
        size_t conds_end() = {n_padding_zeroes*n_diags, M.size()};
        /* - Non-vectorizable areas (top-left and bottom-right resp.) */
        for (size_t NONVEC_LOOP=0; NONVEC_LOOP<2; NONVEC_LOOP++) {
            for (size_t index_M=conds_begin(NONVEC_LOOP); index_M<conds_end(NONVEC_LOOP); index_M++) {
                auto index_y = index_M / n_diags;
                auto index_x = d(index_M % n_diags) + index_y;
                if (index_x >= 0)
                    y_(index_y) = (alpha * y_(index_y)) + (beta * M_(index_M) * x_(index_x));
        /* - Vectorized area - (parallel) iteration over the x parallelization blocks */
#pragma omp parallel for shared (M_, x_, y_) schedule(static)
        for (size_t j_blk=conds_end(0)+1; j_blk<conds_begin(1); j_blk+=BLK_I*n_diags) {
            /* Iteration over the x cache blocks */
            for (size_t j_bare = 0; j_bare < CEIL_INT_DIV(sse_size, BLK_I); j_bare++) {
                size_t j = j_blk + (j_bare*n_diags*sse_size);
                /* Perform y = ... for this block, potentially with unrolling */
                /* *** microkernel goes here *** */
                /* __mm_prefetch() */

        return 0;

Some important notes:

  1. tensors.hpp is a simple header-only library that I’ve written for the occasion to act as a uniform abstraction layer to tensors of various orders (with the CRTP) having different storage schemes. It also contains aliases to e.g. vectors and dense matrices.

  2. For the microkernel, I believe there to be two possibilities

    a. Iterate linearly over the vectorized matrix within each cache block; this would amount to row-wise iteration over the matrix $mathbf{M}$ within each cache block and therefore a dot product. To the best of my knowledge, dot products are inefficient in dense matrix-vector products due to both data dependencies and how the intrinsics decompose into μops.

    b. Iterate over rows in cache blocks in the vectorized matrix, amounting to iteration over diagonals in the matrix $mathbf{M}$ within each cache block. Because of the way the matrix $mathbf{M}$ is stored, i.e. in its vectorized form, this would incur the cost of broadcasting the floating-point numbers (which, to the best of my knowledge is a complex matter) but allow rows within blocks to be performed in parallel.

    I’m afraid that I’ve missed out some other, better, options. This is the primary reason for opening this question. I’m completely stuck. Furthermore, I believe that the differences in how well the source/destination vectors are re-used are too close to call. Does anyone know how I would approach shedding more insight into this?

  3. Even if the cache hit rate is high, I’m afraid of the bottleneck shifting to e.g. inadequate instruction scheduling. Is there a way to check this in a machine-independent way other than having to rely on memory bandwidth?

  4. Is there a way to make the “ugly” non-vectorizable code more elegant?

Proofreading the above, I feel like a total amateur; all feedback is (very) much appreciated. Thank you in advance.

recursion – Code and time complexity of multiplication à la française

This references the multiplication algorithm in Chapter 1 of Algorithms by Dasgupta et al.

I am trying to understand how the code for multiplication à la française works from the multiplication by hand. This is the example given.

by hand

This is the code given for it.


I went ahead and did multiplication of $13 times 11$ (odd) case and $13 times 10$ (even) in a spreadsheet and this is what I got.

spreadsheet example

It seems to me that in the code the rows of the columns are flipped with respect to handwritten example, that is, in $13 times 11$, the 1 shows up where 13 is and not where 104 is as in the handwritten example. I still get the correct answer. Where did I make a mistake? In the evaluation or understanding the conversion from handwritten algorithm to algorithm in code.

I also have trouble seeing how to get $O(n^2)$ for the time complexity. I understand that there is bit shifting and addition and so at some point, assuming both $x$ and $y$ are n bits, we end up adding $1 + 2 + 3 … + n = frac{n}{2}(n+1) = O(n^2)$ but aren’t we shifting the bits n times and therefore need to multiply by another factor of $n$ to get O(n^3)?

This is what I get when I think of the equation for master theorem: $T(n) = T(frac{n}{2}) + O(n)$ because there is 1 recursive call and the problem is halved and it takes $O(n)$ time to add n bits. However, this equation would give me $Theta(n)$ which is wrong. What step did I miss in the recurrence equation?

python – Multiplication algorithm I wrote – slower than I expected

Recently I thought of an algorithm for multiplication and decided to stop dreaming and start writing on paper my ideas, and even implement this to code (in this case – Python 3.9.1).
I do not know if it resembles Karatsuba’s algorithm, but I glanced at it and it seems to work very differently.

The idea behind this multiplication (calculating $x cdot y$) algorithm is to represent them as a power of two, plus some remainder, then use the distributive rule of multiplication to get:

$$x = 2^a + K \ y = 2^b + T$$

$$ x cdot y = (2^a + K) cdot (2^b + T) = 2^{a+b} + T cdot2^a + K cdot 2^b + K cdot T$$

I chose the power to be $2$ as it would help us with bit-manipulation later on.
Calculating $2^{a+b}$ is easy using bitwise operations as so: $$ 2^{a+b} = 1 << (a+b)$$

But how would we find $a$ and $b$?

We want $2^a$ or $2^b$ to be the largest power of $2$ below our $x$ (or $y$ correspondingly), to take as much ‘volume’ from the original number, and thus making the calculations easier with bit manipulations. So, I just used the $lg$ function, which from what I’ve read it can run in $O(1)$ running-time complexity (Or at worst, $lg lg n$). We have:

$$ a = lfloor lg(x) rfloor, ~~~ b = lfloor lg(y) rfloor$$

We then need to find $K$ which is just the remainder when we subtract $2^a$ from $x$: $$K= x – 2^a = x – (1 << a)$$

However, maybe subtraction isn’t the best idea, maybe it takes too much time, and though about another bit manipulation. All I had to do is to flip the most significant bit (left most bit) which represents the greatest power of $2$ this number consists of, and so I had to pad exactly $a$ $1$‘s and use the $&$ bitwise operation to clear the MSB. We now have a code to find $K$ and $T$ respectively:

$$ K = x~~ &~~ text{int(‘1’ * a, 2)} \ T = y~~ &~~ text{int(‘1’ * b, 2)}$$

Finally, we can add all the factors together, calling the function recursively to compute $K cdot T$ to get:

$$ (1 << (a + b)) + (T << a) + (K << b) + overbrace{text{mult(K,T)}}^{text{recursive call}}$$

def mult(x, y):
    if x == 1:
        return y
    elif y == 1:
        return x
    elif x == 0 or y == 0:
        return 0

    base_x = int(log2(x))
    base_y = int(log2(y))

    K = x & int('1' * base_x, 2)
    T = y & int('1' * base_y, 2)

    return (1 << (base_x + base_y)) + (T << base_x) + (K << base_y) + mult(K, T)

But oh! from what I’ve tested, this algorithm does not seem to get near the time it takes to multiply two numbers by just using the plain-old $text{*}$ operation, Sob!

times = ()
for _ in range(10000):
    x = random.randint(10 ** 900, 10 ** 1000)
    y = random.randint(10 ** 900, 10 ** 1000)
    start = time.time()
    mult(x, y)
    end = time.time()
    times.append(end - start)

This tests $1,000$ multiplications of $900$ to $1000$ digits long random integers, then printing the average time. On my machine the average is: 0.01391555905342102 seconds. Python’s regular multiplication won’t even show a number, just 0.0 because it is so fast.

From what I know, Python’s algorithm do use Karatsuba’s algorithm, and it is roughly $O(n^{approx 1.58})$ – I did not analyze mine strictly, but in one sense it runs at approximately: $$O(max (text{#Number_of_on_bits_x, #Number_of_on_bits_y}))$$
Because every recursive call, we turn off the $text{MSB}$ – thus the number of recursive calls we make is the maximum number of bits that are on ($=1$) in $x$ and $y$, which is strictly smaller than the numbers themselves.. thus we can surely say it is $O(max (x,y)) sim O(n)$ as all the other operations in the functions are $O(1)$. So it boils down to the question of ‘why?’ – why is it slower? What have I done wrong in my algorithm so it is slower even that from first glance it seems faster?

Thank you!

reference request – Triangular Multiplication Table using Do-while Loop

I want to ask a question or probable favor in how do I make my program codes “do while loop” in creating a Triangular Multiplication. Is there a probable way on to create such thing without using any other statement?

public class Main {
  static void ssbr(int n) {
    int i = 1;
    System.out.printf("%4d", n * i);
    i = i + 1;
    } while(i <= 7);

public static void main(String() args) {
    int i = 1;
            i = i + 1;
        } while (i <= 7);

Output it gave:

1  2  3  4  5  6  7
2  4  6  8 10 11 12
3  6  9 12 15 18 21
4  8 12 16 20 24 30
5 10 15 20 25 30 35
6 12 18 24 30 36 42
7 14 21 28 35 42 49

Output I wanted:

2  4
3  6  9
4  8 12 16
5 10 15 20 25
6 12 18 24 30 36
7 14 21 28 35 42 49