Skip to content

Lab (5): Shared Memory Programming Using OpenMP

Lab Overview

This lab provides hands-on experience with OpenMP (Open Multi-Processing), an API for shared-memory parallel programming. Students will learn to write parallel programs using OpenMP directives, understand thread management, variable scoping, and synchronization mechanisms.

Lab Objectives

By the end of this lab, students will be able to:

  1. Create and compile OpenMP programs using parallel regions
  2. Control the number of threads and understand thread identification
  3. Implement manual data distribution and result gathering
  4. Apply reduction operations for efficient parallel computations
  5. Understand and manage variable scope (shared vs private)
  6. Use critical sections for thread synchronization
  7. Measure and analyze parallel program performance

Prerequisites

  • Unit 5: OpenMP

Part 1: Getting Started with OpenMP

1.1 Basic Parallel Region

Let's start with a simple "Hello World" program to understand the basic OpenMP structure.

Example 1.1: Basic Hello World

#include <iostream>
#include <omp.h>

int main() {
    std::cout << "Before parallel region" << std::endl;

    #pragma omp parallel
    {
        std::cout << "Hello from OpenMP!" << std::endl;
    }

    std::cout << "After parallel region" << std::endl;
    return 0;
}

Compilation and Execution:

g++ -g -Wall -fopenmp -o hello_omp hello_omp.cpp
./hello_omp

Expected Output: Multiple "Hello from OpenMP!" messages (number depends on your system's default thread count).

1.2 OpenMP Runtime Library Functions

OpenMP provides a comprehensive set of runtime library functions that allow you to query and control the parallel execution environment. These functions are essential for writing effective parallel programs.

1.2.1 OpenMP API Function Naming Convention

All OpenMP runtime library functions follow a consistent naming convention:

omp_<function_name>()

Key Characteristics:

  • Prefix: All functions start with omp_
  • Underscore Separation: Words are separated by underscores (e.g., get_thread_num)
  • Descriptive Names: Function names clearly indicate their purpose
  • Header File: All functions require #include <omp.h>

Common Categories:

  • Thread Information: omp_get_thread_num(), omp_get_num_threads()
  • Thread Control: omp_set_num_threads(), omp_get_max_threads()
  • Environment: omp_in_parallel(), omp_get_nested()
  • Timing: omp_get_wtime(), omp_get_wtick()
  • Locks: omp_init_lock(), omp_set_lock(), omp_unset_lock()

1.2.2 Essential Thread Information Functions

  1. omp_get_thread_num()

    • Purpose: Returns the thread number of the calling thread within its team
    • Return Type: int
    • Range: 0 to omp_get_num_threads() - 1
    • Master Thread: Always returns 0
    • Usage: Identifies which thread is executing the current code
    int thread_id = omp_get_thread_num();
    // thread_id will be 0, 1, 2, 3, ... depending on which thread calls this
    
  2. omp_get_num_threads()

    • Purpose: Returns the number of threads currently in the team executing the parallel region
    • Return Type: int
    • Serial Region: Returns 1 (only master thread)
    • Parallel Region: Returns the actual number of threads in the team
    • Usage: Determines the total number of threads available for work distribution
    int total_threads = omp_get_num_threads();
    // total_threads will be the number of threads in the current team
    

Important Notes:

  • These functions return meaningful values only when called from within a parallel region
  • In serial regions, omp_get_thread_num() returns 0 and omp_get_num_threads() returns 1
  • The actual number of threads may be less than requested due to system limitations

1.2.3 Work Distribution Pattern

These functions are commonly used together for manual work distribution:

int thread_id = omp_get_thread_num();    // Which thread am I?
int num_threads = omp_get_num_threads(); // How many threads total?

// Calculate my portion of work
int work_per_thread = total_work / num_threads;
int my_start = thread_id * work_per_thread;
int my_end = (thread_id == num_threads - 1) ? total_work : my_start + work_per_thread;

1.3 Controlling Thread Count

Example 1.2:

#include <iostream>
#include <omp.h>

int main() {
    #pragma omp parallel num_threads(4)
    {
        int thread_id = omp_get_thread_num();
        int total_threads = omp_get_num_threads();

        std::cout << "Hello from thread " << thread_id 
                << " of " << total_threads << std::endl;
    }
    return 0;
}

Exercise 1:

  • Write the above code and include your name and index number as comment. Take a screenshot
  • Run the above code and take a screenshot.
  • Modify the above program to use 16 threads.
  • Replace std::cout with printf() function.
  • Run the modified code and take a screenshot

Note: Run it multiple times and note the order of output.


Part 2: Data Distribution and Gathering

2.1 Array Sum - Manual Block Distribution

Let's implement a block distribution where we manually distribute work among threads and gather results. The exmaple simply computes the sum of all elements in 1D array. The task of each thread is to compute its partial sum.

Example 2.1: Array Sum with Block Distribution

#include <iostream>
#include <omp.h>
#include <vector>
#include <chrono>

double global_sum = 0.0;

void compute_partial_sum(const std::vector<double>& arr, int n) {
    int thread_id = omp_get_thread_num();
    int num_threads = omp_get_num_threads();

    // Calculate work distribution
    int elements_per_thread = n / num_threads;
    int start = thread_id * elements_per_thread;
    int end = (thread_id == num_threads - 1) ? n : start + elements_per_thread;

    // Compute partial sum
    double partial_sum = 0.0;
    for (int i = start; i < end; i++) {
        partial_sum += arr[i];
    }

    /*
    std::cout << "Thread " << thread_id << " computed elements [" 
              << start << ", " << end << "), partial sum = " 
              << partial_sum << std::endl;
    */

    // Add to global sum (requires synchronization)
    #pragma omp critical
    global_sum += partial_sum;
}

int main() {
    const int N = 1000000;
    std::vector<double> arr(N);

    // Initialize array
    for (int i = 0; i < N; i++) {
        arr[i] = i + 1.0;  // 1, 2, 3, ..., N
    }

    // Serial computation for comparison
    auto start_time = std::chrono::high_resolution_clock::now();
    double serial_sum = 0.0;
    for (int i = 0; i < N; i++) {
        serial_sum += arr[i];
    }
    auto end_time = std::chrono::high_resolution_clock::now();
    auto serial_duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time);

    std::cout << "Serial sum = " << serial_sum << std::endl;
    std::cout << "Serial time = " << serial_duration.count() << " microseconds" << std::endl;

    // Reset global sum
    global_sum = 0.0;

    // Parallel computation
    start_time = std::chrono::high_resolution_clock::now();
    #pragma omp parallel num_threads(4)
    {
        compute_partial_sum(arr, N);
    }
    end_time = std::chrono::high_resolution_clock::now();
    auto parallel_duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time);

    std::cout << "Parallel sum = " << global_sum << std::endl;
    std::cout << "Parallel time = " << parallel_duration.count() << " microseconds" << std::endl;
    std::cout << "Speedup = " << (double)serial_duration.count() / parallel_duration.count() << std::endl;

    return 0;
}

2.1.1 OpenMP vs Pthread Comparison

The beauty of OpenMP lies in its simplicity. Notice how we created 4 threads and distributed work with just two lines:

#pragma omp parallel num_threads(4) 
compute_partial_sum(arr, N); 

OpenMP Advantages:

  • Fork-Join Model: OpenMP automatically creates threads at the parallel region and joins them at the end
  • Implicit Thread Management: No need to explicitly create, manage, or join threads
  • Work Distribution: Each thread automatically gets a unique thread ID and can calculate its work portion
  • Shared Memory: All threads naturally share the same address space

Equivalent Pthread Implementation:

To achieve the same functionality with Pthreads, you would need significantly more code:

#include <pthread.h>
#include <iostream>
#include <vector>

struct ThreadData {
    const std::vector<double>* arr;
    int n;
    int thread_id;
    int num_threads;
    double* global_sum;
    pthread_mutex_t* mutex;
};

void* pthread_compute_partial_sum(void* arg) {
    ThreadData* data = (ThreadData*)arg;

    // Calculate work distribution (same logic as OpenMP version)
    int elements_per_thread = data->n / data->num_threads;
    int start = data->thread_id * elements_per_thread;
    int end = (data->thread_id == data->num_threads - 1) ? data->n : start + elements_per_thread;

    // Compute partial sum
    double partial_sum = 0.0;
    for (int i = start; i < end; i++) {
        partial_sum += (*(data->arr))[i];
    }

    std::cout << "Thread " << data->thread_id << " computed elements [" 
              << start << ", " << end << "), partial sum = " 
              << partial_sum << std::endl;

    // Add to global sum with mutex protection
    pthread_mutex_lock(data->mutex);
    *(data->global_sum) += partial_sum;
    pthread_mutex_unlock(data->mutex);

    return NULL;
}

// In main function - Pthread version:
void compute_with_pthreads(const std::vector<double>& arr, int n, int num_threads) {
    pthread_t threads[num_threads];
    ThreadData thread_data[num_threads];
    pthread_mutex_t mutex;
    double global_sum = 0.0;

    // Initialize mutex
    pthread_mutex_init(&mutex, NULL);

    // Create threads
    for (int i = 0; i < num_threads; i++) {
        thread_data[i].arr = &arr;
        thread_data[i].n = n;
        thread_data[i].thread_id = i;
        thread_data[i].num_threads = num_threads;
        thread_data[i].global_sum = &global_sum;
        thread_data[i].mutex = &mutex;

        pthread_create(&threads[i], NULL, pthread_compute_partial_sum, &thread_data[i]);
    }

    // Join threads
    for (int i = 0; i < num_threads; i++) {
        pthread_join(threads[i], NULL);
    }

    // Clean up
    pthread_mutex_destroy(&mutex);

    std::cout << "Pthread result: " << global_sum << std::endl;
}

Comparison Summary:

Aspect OpenMP Pthread
Lines of code 2 lines ~10+ lines
Thread creation Automatic Manual pthread_create()
Thread joining Automatic Manual pthread_join()
Data sharing Implicit Explicit struct passing
Synchronization #pragma omp critical pthread_mutex_t
Thread ID omp_get_thread_num() Manual tracking
Scalability Change num_threads(N) Modify array sizes

Why OpenMP is Simpler:

  1. Declarative vs Imperative: OpenMP tells what to parallelize, Pthread tells how to parallelize
  2. Less Error-Prone: No manual thread lifecycle management
  3. Easier Debugging: Cleaner, more readable code
  4. Portable: Same code works across different systems
  5. Incremental Parallelization: Easy to add/remove parallelism

Exercise 2:

  1. Write the above code and include your name and index number as comment. Take a screenshot
  2. Run the above program with 2, 4, and 8 threads.
  3. Record the timing results and calculate efficiency
  4. Explain why we need the #pragma omp critical directive
  5. Is this program strongly scalable or weakly scalable? Show your answer.

Remember: To record the timing results, you MUST run the program several times (at least 5 times) and then record the minimum time.

2.2 Improving with Reduction

Now let's improve the same example using OpenMP's reduction clause.

Attention

👉 Read Unit 5, slide 53, and pay close attention to the Execution Model presented on slide 56.

Example 2.2: Array Sum with Reduction

#include <iostream>
#include <omp.h>
#include <vector>
#include <chrono>

double compute_partial_sum_reduction(const std::vector<double>& arr, int n) {
    int thread_id = omp_get_thread_num();
    int num_threads = omp_get_num_threads();

    // Calculate work distribution
    int elements_per_thread = n / num_threads;
    int start = thread_id * elements_per_thread;
    int end = (thread_id == num_threads - 1) ? n : start + elements_per_thread;

    // Compute partial sum
    double partial_sum = 0.0;
    for (int i = start; i < end; i++) {
        partial_sum += arr[i];
    }

    /*
    std::cout << "Thread " << thread_id << " computed elements [" 
              << start << ", " << end << "), partial sum = " 
              << partial_sum << std::endl;
    */
    return partial_sum;
}

int main() {
    const int N = 1000000;
    std::vector<double> arr(N);

    // Initialize array
    for (int i = 0; i < N; i++) {
        arr[i] = i + 1.0;
    }

    double total_sum;

    // Parallel computation with reduction
    auto start_time = std::chrono::high_resolution_clock::now();
    #pragma omp parallel num_threads(4) reduction(+:total_sum)
    {
        total_sum = compute_partial_sum_reduction(arr, N);
    }
    auto end_time = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time);

    std::cout << "Total sum with reduction = " << total_sum << std::endl;
    std::cout << "Time with reduction = " << duration.count() << " microseconds" << std::endl;

    return 0;
}

Exercise 3:

  1. Compare the performance of critical section vs reduction approach
  2. Modify the code to use different numbers of threads (1, 2, 4, 8) and record the timing results

Part 3: Different Reduction Operators

3.1 Finding Maximum Value

Let's implement a different example using the max reduction operator.

Example 3.1: Finding Maximum - Manual Approach

#include <iostream>
#include <omp.h>
#include <vector>
#include <chrono>
#include <algorithm>
#include <random>

double global_max = 0.0;

void find_partial_max(const std::vector<double>& arr, int n) {
    int thread_id = omp_get_thread_num();
    int num_threads = omp_get_num_threads();

    // Calculate work distribution
    int elements_per_thread = n / num_threads;
    int start = thread_id * elements_per_thread;
    int end = (thread_id == num_threads - 1) ? n : start + elements_per_thread;

    // Find partial maximum
    double partial_max = arr[start];
    for (int i = start + 1; i < end; i++) {
        if (arr[i] > partial_max) {
            partial_max = arr[i];
        }
    }

    std::cout << "Thread " << thread_id << " found max in range [" 
              << start << ", " << end << ") = " << partial_max << std::endl;

    // Update global maximum (requires synchronization)
    #pragma omp critical
    {
        if (partial_max > global_max) {
            global_max = partial_max;
        }
    }
}

int main() {
    const int N = 1000000;
    std::vector<double> arr(N);

    // Initialize array with random values
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_real_distribution<> dis(0.0, 1000.0);

    for (int i = 0; i < N; i++) {
        arr[i] = dis(gen);
    }

    // Add a known maximum value for verification
    arr[N/2] = 9999.0;

    // Serial computation
    auto start_time = std::chrono::high_resolution_clock::now();
    double serial_max = *std::max_element(arr.begin(), arr.end());
    auto end_time = std::chrono::high_resolution_clock::now();
    auto serial_duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time);

    std::cout << "Serial max = " << serial_max << std::endl;
    std::cout << "Serial time = " << serial_duration.count() << " microseconds" << std::endl;

    // Reset global max
    global_max = 0.0;

    // Parallel computation with critical section
    start_time = std::chrono::high_resolution_clock::now();
    #pragma omp parallel num_threads(4)
    {
        find_partial_max(arr, N);
    }
    end_time = std::chrono::high_resolution_clock::now();
    auto parallel_duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time);

    std::cout << "Parallel max (critical) = " << global_max << std::endl;
    std::cout << "Parallel time (critical) = " << parallel_duration.count() << " microseconds" << std::endl;

    return 0;
}

Example 3.2: Finding Maximum - Reduction Approach

#include <iostream>
#include <omp.h>
#include <vector>
#include <chrono>
#include <algorithm>
#include <random>

double find_partial_max_reduction(const std::vector<double>& arr, int n) {
    int thread_id = omp_get_thread_num();
    int num_threads = omp_get_num_threads();

    // Calculate work distribution
    int elements_per_thread = n / num_threads;
    int start = thread_id * elements_per_thread;
    int end = (thread_id == num_threads - 1) ? n : start + elements_per_thread;

    // Find partial maximum
    double partial_max = arr[start];
    for (int i = start + 1; i < end; i++) {
        if (arr[i] > partial_max) {
            partial_max = arr[i];
        }
    }

    /*
    std::cout << "Thread " << thread_id << " found max in range [" 
              << start << ", " << end << ") = " << partial_max << std::endl;
    */
    return partial_max;
}

int main() {
    const int N = 1000000;
    std::vector<double> arr(N);

    // Initialize array with random values
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_real_distribution<> dis(0.0, 1000.0);

    for (int i = 0; i < N; i++) {
        arr[i] = dis(gen);
    }

    // Add a known maximum value
    arr[N/2] = 9999.0;

    double total_max = 0.0;

    // Parallel computation with reduction
    auto start_time = std::chrono::high_resolution_clock::now();
    #pragma omp parallel num_threads(4) reduction(max:total_max)
    {
        total_max = find_partial_max_reduction(arr, N);
    }

    auto end_time = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time);

    std::cout << "Parallel max with reduction = " << total_max << std::endl;
    std::cout << "Parallel time with reduction = " << duration.count() << " microseconds" << std::endl;

    return 0;
}

Exercise 4: 1. Compare the performance of critical section vs reduction for finding maximum 2. Implement a similar program for finding the minimum value using reduction(min:variable) 3. Test with different array sizes (100,000, 1,000,000, 10,000,000) and record performance


Part 4: Variable Scope - Shared vs Private

4.1 Understanding Variable Scope

Example 4.1: Shared Variables Problem

#include <iostream>
#include <omp.h>

int main() {
    int counter = 0;

    std::cout << "Initial counter value: " << counter << std::endl;

    #pragma omp parallel num_threads(4)
    {
        int thread_id = omp_get_thread_num();

        // This is dangerous - race condition!
        for (int i = 0; i < 1000; i++) {
            counter++;
        }

        std::cout << "Thread " << thread_id << " finished" << std::endl;
    }

    std::cout << "Final counter value: " << counter << std::endl;
    std::cout << "Expected value: " << 4 * 1000 << std::endl;

    return 0;
}

Exercise 4:

  1. Run the above program multiple times. Do you get the expected result?
  2. Explain why the results are inconsistent.

Example 4.2: Using Private Variables

#include <iostream>
#include <omp.h>

int main() {
    int shared_counter = 0;
    int x = 100;  // This will be shared by default

    std::cout << "Before parallel region: x = " << x << std::endl;

    #pragma omp parallel num_threads(4) private(x)
    {
        int thread_id = omp_get_thread_num();

        // x is private to each thread, uninitialized
        std::cout << "Thread " << thread_id << " initial x = " << x << std::endl;

        // Each thread sets its own x
        x = thread_id * 10;

        int local_sum = 0;
        for (int i = 0; i < 1000; i++) {
            local_sum++;
        }

        std::cout << "Thread " << thread_id << " final x = " << x 
                  << ", local_sum = " << local_sum << std::endl;

        // Safe update to shared variable
        #pragma omp critical
        shared_counter += local_sum;
    }

    std::cout << "After parallel region: x = " << x << std::endl;
    std::cout << "Final shared_counter = " << shared_counter << std::endl;

    return 0;
}

Example 4.3: Using firstprivate

#include <iostream>
#include <omp.h>

int main() {
    int x = 100;

    std::cout << "Before parallel region: x = " << x << std::endl;

    #pragma omp parallel num_threads(4) firstprivate(x)
    {
        int thread_id = omp_get_thread_num();

        // x is private to each thread, initialized with original value
        std::cout << "Thread " << thread_id << " initial x = " << x << std::endl;

        // Each thread modifies its own copy
        x += thread_id * 10;

        std::cout << "Thread " << thread_id << " final x = " << x << std::endl;
    }

    std::cout << "After parallel region: x = " << x << std::endl;

    return 0;
}

Example 4.4: Using default(none)

#include <iostream>
#include <omp.h>

int main() {
    int shared_var = 0;
    int private_var = 10;

    #pragma omp parallel num_threads(4) default(none) shared(shared_var) private(private_var)
    {
        int thread_id = omp_get_thread_num();

        // private_var is uninitialized
        private_var = thread_id * 5;

        std::cout << "Thread " << thread_id << " private_var = " 
                  << private_var << std::endl;

        #pragma omp critical
        shared_var += private_var;
    }

    std::cout << "Final shared_var = " << shared_var << std::endl;

    return 0;
}

Exercise 5:

  1. Try compiling Example 4.4 without specifying all variables in the default(none) clause (i.e., remove shared and private clauses)

Part 5: Trapezoidal Rule Implementation

5.1 Serial Implementation

#include <iostream>
#include <chrono>
#include <cmath>

double f(double x) {
    return x * x * x - x + 1;  // f(x) = x³ - x + 1
}

double trapezoidal_rule(double a, double b, long n) {
    double h = (b - a) / n;
    double result = (f(a) + f(b)) / 2.0;

    for (long i = 1; i < n; i++) {
        double x = a + i * h;
        result += f(x);
    }

    return result * h;
}

int main() {
    double a = -1.0, b = 3.0;
    long n = 100000000;  // 100 million trapezoids

    auto start = std::chrono::high_resolution_clock::now();
    double result = trapezoidal_rule(a, b, n);
    auto end = std::chrono::high_resolution_clock::now();

    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);

    std::cout << "Serial Implementation:" << std::endl;
    std::cout << "Integral approximation = " << result << std::endl;
    std::cout << "Time = " << duration.count() << " milliseconds" << std::endl;

    return 0;
}

5.2 Parallel Implementation with Manual Distribution

#include <iostream>
#include <omp.h>
#include <chrono>
#include <cmath>

double f(double x) {
    return x * x * x - x + 1;
}

double trapezoidal_rule(double a, double b, long n) {
    double h = (b - a) / n;
    double result = (f(a) + f(b)) / 2.0;

    for (long i = 1; i < n; i++) {
        double x = a + i * h;
        result += f(x);
    }

    return result * h;
}

double global_result = 0.0;

void compute_trapezoids(double a, double b, long n) {
    int thread_id = omp_get_thread_num();
    int num_threads = omp_get_num_threads();

    // Calculate work distribution
    long local_n = n / num_threads;
    double local_interval = (b - a) / num_threads;
    double local_a = a + thread_id * local_interval;
    double local_b = local_a + local_interval;

    std::cout << "Thread " << thread_id << " computing interval [" 
              << local_a << ", " << local_b << "] with " 
              << local_n << " trapezoids" << std::endl;

    double local_result = trapezoidal_rule(local_a, local_b, local_n);

    #pragma omp critical
    global_result += local_result;
}

int main() {
    double a = -1.0, b = 3.0;
    long n = 100000000;

    // Reset global result
    global_result = 0.0;

    auto start = std::chrono::high_resolution_clock::now();
    #pragma omp parallel num_threads(4)
    {
        compute_trapezoids(a, b, n);
    }
    auto end = std::chrono::high_resolution_clock::now();

    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);

    std::cout << "Parallel Implementation (Critical Section):" << std::endl;
    std::cout << "Integral approximation = " << global_result << std::endl;
    std::cout << "Time = " << duration.count() << " milliseconds" << std::endl;

    return 0;
}

5.3 Parallel Implementation with Reduction

#include <iostream>
#include <omp.h>
#include <chrono>
#include <cmath>

double f(double x) {
    return x * x * x - x + 1;
}

double trapezoidal_rule(double a, double b, long n) {
    double h = (b - a) / n;
    double result = (f(a) + f(b)) / 2.0;

    for (long i = 1; i < n; i++) {
        double x = a + i * h;
        result += f(x);
    }

    return result * h;
}

double compute_trapezoids_reduction(double a, double b, long n) {
    int thread_id = omp_get_thread_num();
    int num_threads = omp_get_num_threads();

    // Calculate work distribution
    long local_n = n / num_threads;
    double local_interval = (b - a) / num_threads;
    double local_a = a + thread_id * local_interval;
    double local_b = local_a + local_interval;

    /*
    std::cout << "Thread " << thread_id << " computing interval [" 
              << local_a << ", " << local_b << "] with " 
              << local_n << " trapezoids" << std::endl;
    */
    return trapezoidal_rule(local_a, local_b, local_n);
}

int main() {
    double a = -1.0, b = 3.0;
    long n = 100000000;

    double total_result = 0.0;

    auto start = std::chrono::high_resolution_clock::now();
    #pragma omp parallel num_threads(4) reduction(+:total_result)
    {
        total_result = compute_trapezoids_reduction(a, b, n);
    }
    auto end = std::chrono::high_resolution_clock::now();

    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);

    std::cout << "Parallel Implementation (Reduction):" << std::endl;
    std::cout << "Integral approximation = " << total_result << std::endl;
    std::cout << "Time = " << duration.count() << " milliseconds" << std::endl;

    return 0;
}

Performance Analysis Template

For each exercise involving performance measurement, record your results in the following format:

Tip

Data size: __

Number of Threads Execution Time (ms) Speedup Efficiency
1 (Serial) 1.0 1.0
2
4
8

Formulas: - Speedup = Serial Time / Parallel Time - Efficiency = Speedup / Number of Threads

Submission Guidelines

  1. Do ALL exercises (1 to 5)
  2. Submit all source code files with proper comments
  3. Include a performance analysis report for exercises requiring timing
  4. Answer all questions posed in the exercises
  5. Demonstrate understanding of OpenMP concepts through code examples
  6. Include compilation commands and sample outputs