c – Calculating pi with Monte Carlo using OpenMP

This code has a serious problem due to (at least the typical implementation of) rand(). rand() normally has a (hidden) seed value, and each call to rand() modifies the seed. At least in most implementations, that means rand always forces serialization.
That means, calling rand() (a couple of times) in your inner loop will prevent code from scaling well at all. In fact, in most cases (as Toby Speight showed) a multi-threaded version will run substantially slower than a single-threaded version.

To fix this, you pretty much need to use some other random number generator. Your implementation may provide one (e.g., erand48 is fairly common). If you really need your code to be portable, you could write your own, something on this order:

#include <time.h>

typedef unsigned long long rand_state;

// multiplier/modulus taken from Knuth Volume 2, Table 1
static const int multiplier = 314159269;
static const int addend = 1;
static const int modulus = 0xffffffff;

// note that this works differently from srand, returning a seed rather than setting
// a hidden seed.
rand_state omp_srand() {
    rand_state state = time(NULL);
    state ^= (unsigned long long)omp_get_thread_num() << 32;
    return state;
}

int omp_rand(rand_state *state) {
    *state = *state * multiplier + addend;
    return *state & modulus;
}

Note that since it’s entirely likely that all the threads get started in the same second, this combines the current time with the thread ID to seed each thread’s generator. Although it’s not 100% guaranteed, this gives a pretty high likelihood that each thread’s generator will start with a unique seed. On the other hand, it also means that we have to start the thread, and then seed the generator inside that thread.

To use this, our code would be something on this general order:

double calculate_in_circle_random_count(void) {

    static const unsigned total = 1000000000;
    unsigned in_circle = 0;

#pragma omp parallel reduction(+:in_circle)
    {
        // do per-thread initialization
        rand_state seed = omp_srand();
        int count = total / omp_get_num_threads();
        int i;

        // then do this thread's portion of the computation:
        for (i = 0;  i < count;  ++i) {

            double x = (double)omp_rand(&seed) / OMP_RAND_MAX;
            double y = (double)omp_rand(&seed) / OMP_RAND_MAX;

            double val = x * x + y * y;
            in_circle +=  val < 1.0;
        }
    }
    return 4.0 * in_circle / total;
}

int main(int argc, char const *argv())
{
    int num_threads = 1;

    if (argc > 1) 
        num_threads = atoi(argv(1));

    omp_set_num_threads(num_threads);

    float pi_approx = calculate_in_circle_random_count();

    printf("Pi approximation: %.6fn", pi_approx);

    return 0;
}

With this modification, the code is at least capable of scaling. To get it to scale very well, you’d need to change your total to a rather larger number though–with it as small as you’ve specified, it takes longer to start up multiple threads than it saves in calculation. But at least this code can scale well, so if we make total quite a bit larger, it really will run faster. For timing, I added a few more zeros to total, and rewrote main a bit, to create a loop to use 1, 2, 3, and 4 threads and print out the time each iteration. On my machine, this produced the following:

With 1 threads: Pi approximation: 3.140259, time: 3,652,683 microseconds
With 2 threads: Pi approximation: 3.139389, time: 1,892,415 microseconds
With 3 threads: Pi approximation: 3.138885, time: 1,268,917 microseconds
With 4 threads: Pi approximation: 3.138306, time:   935,579 microseconds

So with this, 4 threads is at least close to 4 times as fast as one thread. I’m pretty sure if you use rand inside the loop, you’ll never get it to scale well at all.

For anybody who cares, the version of main that produces that output is written in C++, and looks like this:

int main(int argc, char const* argv()) {
    using namespace std::chrono;

    std::cout.imbue(std::locale(""));

    int processors = omp_get_num_procs();

    for (int num_threads = 1; num_threads <= processors; num_threads++) {
        std::cout << "With " << std::setw(2) << num_threads << " threads: ";
        omp_set_num_threads(num_threads);

        auto start = high_resolution_clock::now();
        float pi_approx = calculate_in_circle_random_count();
        auto stop = high_resolution_clock::now();

        printf(" %.6f, time: ", pi_approx);
        std::cout << std::setw(9) << duration_cast<microseconds>(stop - start).count() << " microsecondsn";
    }
    return 0;
}

This remains pretty much the same, but with some timing code hacked in. In particular, it’s not an attempt at rewriting the code in C++ in general. If I were doing that, I’d probably do a number of things rather differently (starting with the fact that the C++ <random> header already provides a clean way to handle per-thread random number generation, so I’d use that instead of rolling my own).