c ++ – Obtaining a transposition of a size matrix of n * n where n = 2 ^ m using multithreading

The following is c ++ source code to obtain a matrix (std::vector>) transpose in parallel.

The duration is $ Theta ( lg ^ 2 (n)) $ during work $ Theta (n ^ 2) $

Any suggestion for improvement will be appreciated.

void parallelForColsTrans(std::vector>& A, size_t fCol, size_t lCol, size_t i, size_t n)
    if (fCol == lCol)
        T temp = A(i)(fCol+n);
        A(i)(fCol+n) = A(i+n)(fCol);
        A(i+n)(fCol) = temp;


    std::async(parallelForColsTrans, std::ref(A), fCol, (fCol + lCol) / 2, i, n);
    std::async(parallelForColsTrans, std::ref(A), (fCol + lCol) / 2 + 1, lCol, i, n);


void parallelForRowsTrans(std::vector>& A, size_t fRow, size_t lRow, size_t fCol, size_t lCol, size_t n)
    if (fRow == lRow) {
        parallelForColsTrans(A, fCol, lCol, fRow, n);

    std::async(parallelForRowsTrans, std::ref(A), fRow, (fRow+lRow) / 2, fCol, lCol, n);
    std::async(parallelForRowsTrans, std::ref(A), (fRow + lRow) / 2 + 1, lRow, fCol, lCol, n);

void pMatTransposeRecursive(std::vector>& A, size_t firstRow, size_t lastRow, size_t firstColumn, size_t lastColumn)
    if (firstRow == lastRow)    return;

    auto t1 =std::async(pMatTransposeRecursive, std::ref(A), firstRow, (firstRow +lastRow)/2, firstColumn, (firstColumn+lastColumn)/2);
    auto t2 =std::async(pMatTransposeRecursive, std::ref(A), (firstRow +lastRow)/2+1, lastRow, firstColumn, (firstColumn+lastColumn)/2);
    auto t3 =std::async(pMatTransposeRecursive, std::ref(A), firstRow, (firstRow +lastRow)/2, (firstColumn+lastColumn)/2+1, lastColumn);
    pMatTransposeRecursive(std::ref(A), (firstRow +lastRow)/2+1, lastRow, (firstColumn+lastColumn)/2+1, lastColumn);
    size_t n = (lastColumn-firstColumn+1)/2;
    parallelForRowsTrans(std::ref(A), firstRow, firstRow+n-1, firstColumn, firstColumn+n-1, n);


void transpose(std::vector>& A){
    pMatTransposeRecursive( A, 0, A.size()-1, 0, A(0).size()-1);
int main(){

    std::vector> A = {{1,2,3,4,7,7,7,8}, {5,6,7,8,4,5,1,1}, {9,10,5,5,11,12,4,79}, {7,8,13,14,15,16,44,6}, {13,-14,7,-7,15,-16,-44,6}, {13,-14,105,106,404,6,9,9}, {13,-14,7,-7,15,-16,-44,6}, {13,-14,105,106,404,6,9,9}};
    for(auto & el:A){
        for(auto& ele:el) std::cout << ele << std::setw(4) ;
        std::cout << "n";


multithreading – A simple thread-safe Deque in C ++

I am trying to implement a thread-safe deque in C ++.
ThreadSafeDeque will be used by a FileLogger class.
When the threads call the log() function of FileLogger the messages will push_back()ed to ThreadSafeDeque and come back almost immediately. In a separate thread, the FileLogger pop_front() messages and write them to a file at their own pace.
Am I doing things correctly and efficiently below?

#pragma once
class ThreadSafeDeque {
    void pop_front_waiting(T &t) {
        // unique_lock can be unlocked, lock_guard can not
        std::unique_lock lock{ mutex }; // locks
        while(deque.empty()) {
            condition.wait(lock); // unlocks, sleeps and relocks when woken up  
        t = deque.front();
    } // unlocks as goes out of scope

    void push_back(const T &t) {
        std::unique_lock lock{ mutex }; 
        condition.notify_one(); // wakes up pop_front_waiting  
    std::deque               deque;
    std::mutex                  mutex;
    std::condition_variable condition;

multithreading – A simple multithreaded FileLogger in C ++

In order to learn more about multithreading programming in C ++, I implement a basic multithreaded recorder.

I am using a std :: deque to store messages in a FileLogger class. Whenever a thread logs a message; this message is pushed to the back of the deque.

In a separate thread, the FileLogger checks for messages in the deque and, if it does, writes them to the file.

Access to the deque is guarded by a mutex.

In order to facilitate connection from anywhere; the recorder is implemented as a singleton.

Is my code correct? How can it be improved?

// FileLogger.h:
class FileLogger
    static void initialize(const char* filePath) { // called by main thread before any threads are spawned
        instance_ = new FileLogger(filePath);
    static FileLogger* instance() { // called from many threads simultaneously
        return instance_;
    void log(const std::string &msg);
    FileLogger(const char* filePath);
    void writeToFile();
    static FileLogger*     instance_;
    std::deque messages;
    std::mutex         messagesMutex; // lock/unlock this each time messages is pushed or popped
    std::ofstream               fout;
    std::thread         writerThread;
// FileLogger.cpp:
FileLogger* FileLogger::instance_ = nullptr;

void FileLogger::writeToFile() {
    for (;;) {
        std::string message;
        while (messages.empty()) {
        message = messages.front();
        fout << message << std::endl << std::flush;

FileLogger::FileLogger(const char* filePath) {
    std::thread t(&FileLogger::writeToFile, this);
    writerThread = std::move(t);

void FileLogger::log(const std::string &msg) {
    std::lock_guard lg(messagesMutex);

multithreading – Parallel MergeSort in C ++

I have tried to implement parallel MergeSort in C ++, which also tracks the number of comparisons made and the number of threads it uses:


int *original_array,*auxiliary_array;
std::mutex protector_of_the_global_counter;
int global_counter=0;
std::mutex protector_of_the_thread_counter;
int number_of_threads=0;

class Counting_Comparator {
    bool was_allocated;
    int *local_counter;
    Counting_Comparator() {
        local_counter=new int(0);
    Counting_Comparator(int *init) {
    int get_count() {return *local_counter;}
    bool operator() (T first, T second) {
        return first &x) {
    ~Counting_Comparator() {
        if (was_allocated) delete local_counter;

struct limits {
    int lower_limit,upper_limit,reccursion_depth;

void parallel_merge_sort(limits argument) {
    int lower_limit=argument.lower_limit;
    int upper_limit=argument.upper_limit;
    if (upper_limit-lower_limit<2) return; //An array of length less than 2 is already sorted.
    int reccursion_depth=argument.reccursion_depth;
    int middle_of_the_array=(upper_limit+lower_limit)/2;
    limits left_part={lower_limit,middle_of_the_array,reccursion_depth+1},
    if (reccursion_depth comparator_functor(&local_counter);

int main(void) {
    using std::cout;
    using std::cin;
    using std::endl;
    cout <<"Enter how many numbers you will input." <>n;
    try {
        original_array=new int(n);
        auxiliary_array=new int(n);
    catch (...) {
        std::cerr <<"Not enough memory!?" <>original_array(i);
    limits entire_array={0,n,0};
    clock_t processor_time=clock();
    try {
    std::thread root_of_the_reccursion(parallel_merge_sort,entire_array);
    catch (std::system_error error) {
        std::cerr <<"Can't create a new thread, error "" <

So what do you think?

multithreading – What should I learn first?

Battery exchange network

The Stack Exchange network includes 175 question and answer communities, including Stack Overflow, the largest and most reliable online community for developers who want to learn, share knowledge and develop their careers.

Visit Stack Exchange

multithreading – Run Bash scripts in parallel

I would like to execute a script several times on 10 files in parallel. What I need to know is how to structure the arguments by number. My non-parallel script is:

For i in {1..10};
    do python myscript.py "folder_"$i;

I have heard of mpirun but am not sure how to structure the arguments by file number or something similar.

multithreading – C ++ General purpose thread pool of fixed size

I have been reading for a few days on thread pools in C ++ and I decided to deploy mine. I mainly intend to use it to learn how to implement parallel algorithms at some point in the future, but before that, I need to know if I can do something to make it more efficient. .

These are all the variables I use. I decided to put everything in its own namespace and do the std::condition_variable (responsible for the main thread break) static because there is really no need for each thread_pool object to having a copy.

namespace async {

    static std::condition_variable main_thread_cv;

    class thread_pool {

            std::mutex mutex_m;
            std::atomic busy_m;
            std::condition_variable pool_cv_m;
            std::array workers_m;
            std::queue> task_queue_m;
            bool should_terminate_m;
            void thread_loop();

            thread_pool(const thread_pool& other) = delete;
            thread_pool(const thread_pool&& other) = delete;
            thread_pool& operator=(const thread_pool& other) = delete;
            thread_pool& operator=(const thread_pool&& other) = delete;
            void wait();
            template  void enqueue(Fn&& function, Args&&... args);


This is the thread loop executed by all of the worker threads.

void thread_pool::thread_loop() {

    thread_local std::function task;

    for (;;) {
        { std::unique_lock lock(mutex_m);
            pool_cv_m.wait(lock, (this)() { return !task_queue_m.empty() || should_terminate_m; });
            if (should_terminate_m) {
            task = task_queue_m.front();


ctor and dtor:

thread_pool::thread_pool() {

    busy_m = 0;
    should_terminate_m = false;

    for (auto& thread : workers_m) {
        thread = std::thread((this)() { thread_loop(); });


thread_pool::~thread_pool() {

    busy_m = 0;
    should_terminate_m = true;


    for (auto& thread : workers_m) {


the wait and enqueue the functions:

void thread_pool::wait() {

    { std::unique_lock lock(mutex_m);
        main_thread_cv.wait(lock, (this)() { return busy_m == 0 && task_queue_m.empty(); });


void thread_pool::enqueue(Fn&& function, Args&& ...args) {

    { std::scoped_lock lock(mutex_m);
        task_queue_m.push(std::bind(std::forward(function), std::forward(args)...));


design – Would it be possible to abstract multi-threading capability for programs that were not originally designed for this?

Would it be possible to provide (or more) multi-core threading capability for programs that were not originally designed for this?

And this by creating a "virtual" processor core (or for i7s with hyperthreading, virtual "virtual cores") which, for a program, the program sees it as a single core / thread, but of the other next to this virtual core is a program / tool / utility that spreads the work across multiple cores / threads alone? And for programs already designed for multi-core support, the virtual core allows an increase in the number of usable cores.

I think that would be useful given the trend in recent years to increase the number of cores compared to the overall increase in processor speed instead of CPU running against the "ceiling" of Moore's Law and the seemingly slow or dragging push of software development to take advantage of these growing number of processor cores.

I realize that something like this would probably not be simple or easy to accomplish, but most of all I wonder if it would be doable.

multithreading – Java create a thread with parameters

I have a program that counts a word in the text more than once.
I want the loop to be in a separate thread. How can I pass the articls and stringToSearch parameters to the thread or define the global parameters?

public class Main {
    public static void main(String() args)  {
        Scanner s = new Scanner(System.in);
        int numberArticles = s.nextInt();
articles = new ArrayList<>(); for(int i = 0; i < numberArticles; i++) { String articleName = s.nextLine(); String content = ""; File file = new File(articleName + ".txt"); BufferedReader br; try { br = new BufferedReader(new FileReader(file)); String st; while ((st = br.readLine()) != null) { content += st; } } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } articles.add(new Article(articleName, content)); } String stringToSearch = s.nextLine(); MyThread myThread = new MyThread(); myThread.start(); } } public class MyThread extends Thread { public void run(){ for(Article article : articles) { int counter = 0; String() words = article.getContent().split(" "); for (String word : words) { if(word.equals(stringToSearch)) { counter++; } } } } }

c # – Multi-threading with TPL – Accessing the properties of the internal class

I am using the TPL library to parallelize a 2D grid operation. I have extracted a simple example from my actual code to illustrate what I am doing. I get the desired results and my computing times are accelerated by the number of processors on my laptop (12).

I would love to have advice or opinions on my code regarding how my properties are declared. Again, it works as expected, but ask yourself if the design could be better. Thanks in advance.

My simplified code:

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Text;
using System.Threading;
using System.Threading.Tasks;

namespace Gridding

    public abstract class Base
        /// Values representing the mesh values where each node in the gridis assigned som value. 
        public float(,) Values { get; set; }
        public abstract void Compute();

    public class Derived : Base

        /// Make the mesh readonly.  Is this necessary?
        readonly Mesh MyMesh;

        Derived(Mesh mesh)
            MyMesh = mesh;

        public override void Compute()
            Values = new float(MyMesh.NX, MyMesh.NY);

            double() xn = MyMesh.GetXNodes();
            double() yn = MyMesh.GetYNodes();

            /// Paralellize the threads along the columns of the grid using the Task Paralllel Library (TPL).
            Parallel.For(0, MyMesh.NX, i =>
                Run(i, xn, yn);

        private void Run(int i, double() xn, double() yn)
            /// Some long operation that parallelizes along the columns of a mesh/grid
            double x = xn(i);
            for (int j = 0; j < MyMesh.NY; j++)

                /// Again, longer operation here
                double y = yn(j);

                double someValue = Math.Sqrt(x * y); 
                Values(i, j) = (float)someValue;

        static void Main(string() args)
            int nx = 100;
            int ny = 120;
            double x0 = 0.0;
            double y0 = 0.0;
            double width = 100;
            double height = 120;

            Mesh mesh = new Mesh(nx, ny, x0, y0, width, height);

            Base tplTest = new Derived(mesh);

            float(,) values = tplTest.Values;



        /// A simple North-South oriented grid.
        class Mesh
            public int NX { get; } = 100;
            public int NY { get; set; } = 150;
            public double XOrigin { get; set; } = 0.0;
            public double YOrigin { get; set; } = 0.0;
            public double Width { get; set; } = 100.0;
            public double Height { get; set; } = 150.0;
            public double DX { get; }
            public double DY { get; }

            public Mesh(int nx, int ny, double xOrigin, double yOrigin, double width, double height)
                NX = nx;
                NY = ny;
                XOrigin = xOrigin;
                YOrigin = yOrigin;
                Width = width;
                Height = height;
                DX = Width / (NX - 1);
                DY = Height / (NY - 1);

            public double() GetYNodes()
                double() yNodeLocs = new double(NY);
                for (int i = 0; i < NY; i++)
                    yNodeLocs(i) = YOrigin + i * DY;
                return yNodeLocs;

            public double() GetXNodes()
                double() xNodeLocs = new double(NX);
                for (int i = 0; i < NX; i++)
                    xNodeLocs(i) = XOrigin + i * DX;
                return xNodeLocs;