Lab (5): Shared Memory Programming Using OpenMP
Lab Overview
This lab provides hands-on experience with OpenMP (Open Multi-Processing), an API for shared-memory parallel programming. Students will learn to write parallel programs using OpenMP directives, understand thread management, variable scoping, and synchronization mechanisms.
Lab Objectives
By the end of this lab, students will be able to:
- Create and compile OpenMP programs using parallel regions
- Control the number of threads and understand thread identification
- Implement manual data distribution and result gathering
- Apply reduction operations for efficient parallel computations
- Understand and manage variable scope (shared vs private)
- Use critical sections for thread synchronization
- Measure and analyze parallel program performance
Prerequisites
- Unit 5: OpenMP
Part 1: Getting Started with OpenMP
1.1 Basic Parallel Region
Let's start with a simple "Hello World" program to understand the basic OpenMP structure.
Example 1.1: Basic Hello World
#include <iostream>
#include <omp.h>
int main() {
std::cout << "Before parallel region" << std::endl;
#pragma omp parallel
{
std::cout << "Hello from OpenMP!" << std::endl;
}
std::cout << "After parallel region" << std::endl;
return 0;
}
Compilation and Execution:
Expected Output: Multiple "Hello from OpenMP!" messages (number depends on your system's default thread count).
1.2 OpenMP Runtime Library Functions
OpenMP provides a comprehensive set of runtime library functions that allow you to query and control the parallel execution environment. These functions are essential for writing effective parallel programs.
1.2.1 OpenMP API Function Naming Convention
All OpenMP runtime library functions follow a consistent naming convention:
Key Characteristics:
- Prefix: All functions start with
omp_
- Underscore Separation: Words are separated by underscores (e.g.,
get_thread_num
) - Descriptive Names: Function names clearly indicate their purpose
- Header File: All functions require
#include <omp.h>
Common Categories:
- Thread Information:
omp_get_thread_num()
,omp_get_num_threads()
- Thread Control:
omp_set_num_threads()
,omp_get_max_threads()
- Environment:
omp_in_parallel()
,omp_get_nested()
- Timing:
omp_get_wtime()
,omp_get_wtick()
- Locks:
omp_init_lock()
,omp_set_lock()
,omp_unset_lock()
1.2.2 Essential Thread Information Functions
-
omp_get_thread_num()
- Purpose: Returns the thread number of the calling thread within its team
- Return Type:
int
- Range: 0 to
omp_get_num_threads() - 1
- Master Thread: Always returns 0
- Usage: Identifies which thread is executing the current code
-
omp_get_num_threads()
- Purpose: Returns the number of threads currently in the team executing the parallel region
- Return Type:
int
- Serial Region: Returns 1 (only master thread)
- Parallel Region: Returns the actual number of threads in the team
- Usage: Determines the total number of threads available for work distribution
Important Notes:
- These functions return meaningful values only when called from within a parallel region
- In serial regions,
omp_get_thread_num()
returns 0 andomp_get_num_threads()
returns 1 - The actual number of threads may be less than requested due to system limitations
1.2.3 Work Distribution Pattern
These functions are commonly used together for manual work distribution:
int thread_id = omp_get_thread_num(); // Which thread am I?
int num_threads = omp_get_num_threads(); // How many threads total?
// Calculate my portion of work
int work_per_thread = total_work / num_threads;
int my_start = thread_id * work_per_thread;
int my_end = (thread_id == num_threads - 1) ? total_work : my_start + work_per_thread;
1.3 Controlling Thread Count
Example 1.2:
#include <iostream>
#include <omp.h>
int main() {
#pragma omp parallel num_threads(4)
{
int thread_id = omp_get_thread_num();
int total_threads = omp_get_num_threads();
std::cout << "Hello from thread " << thread_id
<< " of " << total_threads << std::endl;
}
return 0;
}
Exercise 1:
- Write the above code and include your name and index number as comment. Take a screenshot
- Run the above code and take a screenshot.
- Modify the above program to use 16 threads.
- Replace std::cout with printf() function.
- Run the modified code and take a screenshot
Note: Run it multiple times and note the order of output.
Part 2: Data Distribution and Gathering
2.1 Array Sum - Manual Block Distribution
Let's implement a block distribution where we manually distribute work among threads and gather results. The exmaple simply computes the sum of all elements in 1D array. The task of each thread is to compute its partial sum.
Example 2.1: Array Sum with Block Distribution
#include <iostream>
#include <omp.h>
#include <vector>
#include <chrono>
double global_sum = 0.0;
void compute_partial_sum(const std::vector<double>& arr, int n) {
int thread_id = omp_get_thread_num();
int num_threads = omp_get_num_threads();
// Calculate work distribution
int elements_per_thread = n / num_threads;
int start = thread_id * elements_per_thread;
int end = (thread_id == num_threads - 1) ? n : start + elements_per_thread;
// Compute partial sum
double partial_sum = 0.0;
for (int i = start; i < end; i++) {
partial_sum += arr[i];
}
/*
std::cout << "Thread " << thread_id << " computed elements ["
<< start << ", " << end << "), partial sum = "
<< partial_sum << std::endl;
*/
// Add to global sum (requires synchronization)
#pragma omp critical
global_sum += partial_sum;
}
int main() {
const int N = 1000000;
std::vector<double> arr(N);
// Initialize array
for (int i = 0; i < N; i++) {
arr[i] = i + 1.0; // 1, 2, 3, ..., N
}
// Serial computation for comparison
auto start_time = std::chrono::high_resolution_clock::now();
double serial_sum = 0.0;
for (int i = 0; i < N; i++) {
serial_sum += arr[i];
}
auto end_time = std::chrono::high_resolution_clock::now();
auto serial_duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time);
std::cout << "Serial sum = " << serial_sum << std::endl;
std::cout << "Serial time = " << serial_duration.count() << " microseconds" << std::endl;
// Reset global sum
global_sum = 0.0;
// Parallel computation
start_time = std::chrono::high_resolution_clock::now();
#pragma omp parallel num_threads(4)
{
compute_partial_sum(arr, N);
}
end_time = std::chrono::high_resolution_clock::now();
auto parallel_duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time);
std::cout << "Parallel sum = " << global_sum << std::endl;
std::cout << "Parallel time = " << parallel_duration.count() << " microseconds" << std::endl;
std::cout << "Speedup = " << (double)serial_duration.count() / parallel_duration.count() << std::endl;
return 0;
}
2.1.1 OpenMP vs Pthread Comparison
The beauty of OpenMP lies in its simplicity. Notice how we created 4 threads and distributed work with just two lines:
OpenMP Advantages:
- Fork-Join Model: OpenMP automatically creates threads at the parallel region and joins them at the end
- Implicit Thread Management: No need to explicitly create, manage, or join threads
- Work Distribution: Each thread automatically gets a unique thread ID and can calculate its work portion
- Shared Memory: All threads naturally share the same address space
Equivalent Pthread Implementation:
To achieve the same functionality with Pthreads, you would need significantly more code:
#include <pthread.h>
#include <iostream>
#include <vector>
struct ThreadData {
const std::vector<double>* arr;
int n;
int thread_id;
int num_threads;
double* global_sum;
pthread_mutex_t* mutex;
};
void* pthread_compute_partial_sum(void* arg) {
ThreadData* data = (ThreadData*)arg;
// Calculate work distribution (same logic as OpenMP version)
int elements_per_thread = data->n / data->num_threads;
int start = data->thread_id * elements_per_thread;
int end = (data->thread_id == data->num_threads - 1) ? data->n : start + elements_per_thread;
// Compute partial sum
double partial_sum = 0.0;
for (int i = start; i < end; i++) {
partial_sum += (*(data->arr))[i];
}
std::cout << "Thread " << data->thread_id << " computed elements ["
<< start << ", " << end << "), partial sum = "
<< partial_sum << std::endl;
// Add to global sum with mutex protection
pthread_mutex_lock(data->mutex);
*(data->global_sum) += partial_sum;
pthread_mutex_unlock(data->mutex);
return NULL;
}
// In main function - Pthread version:
void compute_with_pthreads(const std::vector<double>& arr, int n, int num_threads) {
pthread_t threads[num_threads];
ThreadData thread_data[num_threads];
pthread_mutex_t mutex;
double global_sum = 0.0;
// Initialize mutex
pthread_mutex_init(&mutex, NULL);
// Create threads
for (int i = 0; i < num_threads; i++) {
thread_data[i].arr = &arr;
thread_data[i].n = n;
thread_data[i].thread_id = i;
thread_data[i].num_threads = num_threads;
thread_data[i].global_sum = &global_sum;
thread_data[i].mutex = &mutex;
pthread_create(&threads[i], NULL, pthread_compute_partial_sum, &thread_data[i]);
}
// Join threads
for (int i = 0; i < num_threads; i++) {
pthread_join(threads[i], NULL);
}
// Clean up
pthread_mutex_destroy(&mutex);
std::cout << "Pthread result: " << global_sum << std::endl;
}
Comparison Summary:
Aspect | OpenMP | Pthread |
---|---|---|
Lines of code | 2 lines | ~10+ lines |
Thread creation | Automatic | Manual pthread_create() |
Thread joining | Automatic | Manual pthread_join() |
Data sharing | Implicit | Explicit struct passing |
Synchronization | #pragma omp critical |
pthread_mutex_t |
Thread ID | omp_get_thread_num() |
Manual tracking |
Scalability | Change num_threads(N) | Modify array sizes |
Why OpenMP is Simpler:
- Declarative vs Imperative: OpenMP tells what to parallelize, Pthread tells how to parallelize
- Less Error-Prone: No manual thread lifecycle management
- Easier Debugging: Cleaner, more readable code
- Portable: Same code works across different systems
- Incremental Parallelization: Easy to add/remove parallelism
Exercise 2:
- Write the above code and include your name and index number as comment. Take a screenshot
- Run the above program with 2, 4, and 8 threads.
- Record the timing results and calculate efficiency
- Explain why we need the
#pragma omp critical
directive - Is this program strongly scalable or weakly scalable? Show your answer.
Remember: To record the timing results, you MUST run the program several times (at least 5 times) and then record the minimum time.
2.2 Improving with Reduction
Now let's improve the same example using OpenMP's reduction clause.
Attention
Read Unit 5, slide 53, and pay close attention to the Execution Model presented on slide 56.
Example 2.2: Array Sum with Reduction
#include <iostream>
#include <omp.h>
#include <vector>
#include <chrono>
double compute_partial_sum_reduction(const std::vector<double>& arr, int n) {
int thread_id = omp_get_thread_num();
int num_threads = omp_get_num_threads();
// Calculate work distribution
int elements_per_thread = n / num_threads;
int start = thread_id * elements_per_thread;
int end = (thread_id == num_threads - 1) ? n : start + elements_per_thread;
// Compute partial sum
double partial_sum = 0.0;
for (int i = start; i < end; i++) {
partial_sum += arr[i];
}
/*
std::cout << "Thread " << thread_id << " computed elements ["
<< start << ", " << end << "), partial sum = "
<< partial_sum << std::endl;
*/
return partial_sum;
}
int main() {
const int N = 1000000;
std::vector<double> arr(N);
// Initialize array
for (int i = 0; i < N; i++) {
arr[i] = i + 1.0;
}
double total_sum;
// Parallel computation with reduction
auto start_time = std::chrono::high_resolution_clock::now();
#pragma omp parallel num_threads(4) reduction(+:total_sum)
{
total_sum = compute_partial_sum_reduction(arr, N);
}
auto end_time = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time);
std::cout << "Total sum with reduction = " << total_sum << std::endl;
std::cout << "Time with reduction = " << duration.count() << " microseconds" << std::endl;
return 0;
}
Exercise 3:
- Compare the performance of critical section vs reduction approach
- Modify the code to use different numbers of threads (1, 2, 4, 8) and record the timing results
Part 3: Different Reduction Operators
3.1 Finding Maximum Value
Let's implement a different example using the max reduction operator.
Example 3.1: Finding Maximum - Manual Approach
#include <iostream>
#include <omp.h>
#include <vector>
#include <chrono>
#include <algorithm>
#include <random>
double global_max = 0.0;
void find_partial_max(const std::vector<double>& arr, int n) {
int thread_id = omp_get_thread_num();
int num_threads = omp_get_num_threads();
// Calculate work distribution
int elements_per_thread = n / num_threads;
int start = thread_id * elements_per_thread;
int end = (thread_id == num_threads - 1) ? n : start + elements_per_thread;
// Find partial maximum
double partial_max = arr[start];
for (int i = start + 1; i < end; i++) {
if (arr[i] > partial_max) {
partial_max = arr[i];
}
}
std::cout << "Thread " << thread_id << " found max in range ["
<< start << ", " << end << ") = " << partial_max << std::endl;
// Update global maximum (requires synchronization)
#pragma omp critical
{
if (partial_max > global_max) {
global_max = partial_max;
}
}
}
int main() {
const int N = 1000000;
std::vector<double> arr(N);
// Initialize array with random values
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_real_distribution<> dis(0.0, 1000.0);
for (int i = 0; i < N; i++) {
arr[i] = dis(gen);
}
// Add a known maximum value for verification
arr[N/2] = 9999.0;
// Serial computation
auto start_time = std::chrono::high_resolution_clock::now();
double serial_max = *std::max_element(arr.begin(), arr.end());
auto end_time = std::chrono::high_resolution_clock::now();
auto serial_duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time);
std::cout << "Serial max = " << serial_max << std::endl;
std::cout << "Serial time = " << serial_duration.count() << " microseconds" << std::endl;
// Reset global max
global_max = 0.0;
// Parallel computation with critical section
start_time = std::chrono::high_resolution_clock::now();
#pragma omp parallel num_threads(4)
{
find_partial_max(arr, N);
}
end_time = std::chrono::high_resolution_clock::now();
auto parallel_duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time);
std::cout << "Parallel max (critical) = " << global_max << std::endl;
std::cout << "Parallel time (critical) = " << parallel_duration.count() << " microseconds" << std::endl;
return 0;
}
Example 3.2: Finding Maximum - Reduction Approach
#include <iostream>
#include <omp.h>
#include <vector>
#include <chrono>
#include <algorithm>
#include <random>
double find_partial_max_reduction(const std::vector<double>& arr, int n) {
int thread_id = omp_get_thread_num();
int num_threads = omp_get_num_threads();
// Calculate work distribution
int elements_per_thread = n / num_threads;
int start = thread_id * elements_per_thread;
int end = (thread_id == num_threads - 1) ? n : start + elements_per_thread;
// Find partial maximum
double partial_max = arr[start];
for (int i = start + 1; i < end; i++) {
if (arr[i] > partial_max) {
partial_max = arr[i];
}
}
/*
std::cout << "Thread " << thread_id << " found max in range ["
<< start << ", " << end << ") = " << partial_max << std::endl;
*/
return partial_max;
}
int main() {
const int N = 1000000;
std::vector<double> arr(N);
// Initialize array with random values
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_real_distribution<> dis(0.0, 1000.0);
for (int i = 0; i < N; i++) {
arr[i] = dis(gen);
}
// Add a known maximum value
arr[N/2] = 9999.0;
double total_max = 0.0;
// Parallel computation with reduction
auto start_time = std::chrono::high_resolution_clock::now();
#pragma omp parallel num_threads(4) reduction(max:total_max)
{
total_max = find_partial_max_reduction(arr, N);
}
auto end_time = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time);
std::cout << "Parallel max with reduction = " << total_max << std::endl;
std::cout << "Parallel time with reduction = " << duration.count() << " microseconds" << std::endl;
return 0;
}
Exercise 4:
1. Compare the performance of critical section vs reduction for finding maximum
2. Implement a similar program for finding the minimum value using reduction(min:variable)
3. Test with different array sizes (100,000, 1,000,000, 10,000,000) and record performance
Part 4: Variable Scope - Shared vs Private
4.1 Understanding Variable Scope
Example 4.1: Shared Variables Problem
#include <iostream>
#include <omp.h>
int main() {
int counter = 0;
std::cout << "Initial counter value: " << counter << std::endl;
#pragma omp parallel num_threads(4)
{
int thread_id = omp_get_thread_num();
// This is dangerous - race condition!
for (int i = 0; i < 1000; i++) {
counter++;
}
std::cout << "Thread " << thread_id << " finished" << std::endl;
}
std::cout << "Final counter value: " << counter << std::endl;
std::cout << "Expected value: " << 4 * 1000 << std::endl;
return 0;
}
Exercise 4:
- Run the above program multiple times. Do you get the expected result?
- Explain why the results are inconsistent.
Example 4.2: Using Private Variables
#include <iostream>
#include <omp.h>
int main() {
int shared_counter = 0;
int x = 100; // This will be shared by default
std::cout << "Before parallel region: x = " << x << std::endl;
#pragma omp parallel num_threads(4) private(x)
{
int thread_id = omp_get_thread_num();
// x is private to each thread, uninitialized
std::cout << "Thread " << thread_id << " initial x = " << x << std::endl;
// Each thread sets its own x
x = thread_id * 10;
int local_sum = 0;
for (int i = 0; i < 1000; i++) {
local_sum++;
}
std::cout << "Thread " << thread_id << " final x = " << x
<< ", local_sum = " << local_sum << std::endl;
// Safe update to shared variable
#pragma omp critical
shared_counter += local_sum;
}
std::cout << "After parallel region: x = " << x << std::endl;
std::cout << "Final shared_counter = " << shared_counter << std::endl;
return 0;
}
Example 4.3: Using firstprivate
#include <iostream>
#include <omp.h>
int main() {
int x = 100;
std::cout << "Before parallel region: x = " << x << std::endl;
#pragma omp parallel num_threads(4) firstprivate(x)
{
int thread_id = omp_get_thread_num();
// x is private to each thread, initialized with original value
std::cout << "Thread " << thread_id << " initial x = " << x << std::endl;
// Each thread modifies its own copy
x += thread_id * 10;
std::cout << "Thread " << thread_id << " final x = " << x << std::endl;
}
std::cout << "After parallel region: x = " << x << std::endl;
return 0;
}
Example 4.4: Using default(none)
#include <iostream>
#include <omp.h>
int main() {
int shared_var = 0;
int private_var = 10;
#pragma omp parallel num_threads(4) default(none) shared(shared_var) private(private_var)
{
int thread_id = omp_get_thread_num();
// private_var is uninitialized
private_var = thread_id * 5;
std::cout << "Thread " << thread_id << " private_var = "
<< private_var << std::endl;
#pragma omp critical
shared_var += private_var;
}
std::cout << "Final shared_var = " << shared_var << std::endl;
return 0;
}
Exercise 5:
- Try compiling Example 4.4 without specifying all variables in the default(none) clause (i.e., remove
shared
andprivate
clauses)
Part 5: Trapezoidal Rule Implementation
5.1 Serial Implementation
#include <iostream>
#include <chrono>
#include <cmath>
double f(double x) {
return x * x * x - x + 1; // f(x) = x³ - x + 1
}
double trapezoidal_rule(double a, double b, long n) {
double h = (b - a) / n;
double result = (f(a) + f(b)) / 2.0;
for (long i = 1; i < n; i++) {
double x = a + i * h;
result += f(x);
}
return result * h;
}
int main() {
double a = -1.0, b = 3.0;
long n = 100000000; // 100 million trapezoids
auto start = std::chrono::high_resolution_clock::now();
double result = trapezoidal_rule(a, b, n);
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << "Serial Implementation:" << std::endl;
std::cout << "Integral approximation = " << result << std::endl;
std::cout << "Time = " << duration.count() << " milliseconds" << std::endl;
return 0;
}
5.2 Parallel Implementation with Manual Distribution
#include <iostream>
#include <omp.h>
#include <chrono>
#include <cmath>
double f(double x) {
return x * x * x - x + 1;
}
double trapezoidal_rule(double a, double b, long n) {
double h = (b - a) / n;
double result = (f(a) + f(b)) / 2.0;
for (long i = 1; i < n; i++) {
double x = a + i * h;
result += f(x);
}
return result * h;
}
double global_result = 0.0;
void compute_trapezoids(double a, double b, long n) {
int thread_id = omp_get_thread_num();
int num_threads = omp_get_num_threads();
// Calculate work distribution
long local_n = n / num_threads;
double local_interval = (b - a) / num_threads;
double local_a = a + thread_id * local_interval;
double local_b = local_a + local_interval;
std::cout << "Thread " << thread_id << " computing interval ["
<< local_a << ", " << local_b << "] with "
<< local_n << " trapezoids" << std::endl;
double local_result = trapezoidal_rule(local_a, local_b, local_n);
#pragma omp critical
global_result += local_result;
}
int main() {
double a = -1.0, b = 3.0;
long n = 100000000;
// Reset global result
global_result = 0.0;
auto start = std::chrono::high_resolution_clock::now();
#pragma omp parallel num_threads(4)
{
compute_trapezoids(a, b, n);
}
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << "Parallel Implementation (Critical Section):" << std::endl;
std::cout << "Integral approximation = " << global_result << std::endl;
std::cout << "Time = " << duration.count() << " milliseconds" << std::endl;
return 0;
}
5.3 Parallel Implementation with Reduction
#include <iostream>
#include <omp.h>
#include <chrono>
#include <cmath>
double f(double x) {
return x * x * x - x + 1;
}
double trapezoidal_rule(double a, double b, long n) {
double h = (b - a) / n;
double result = (f(a) + f(b)) / 2.0;
for (long i = 1; i < n; i++) {
double x = a + i * h;
result += f(x);
}
return result * h;
}
double compute_trapezoids_reduction(double a, double b, long n) {
int thread_id = omp_get_thread_num();
int num_threads = omp_get_num_threads();
// Calculate work distribution
long local_n = n / num_threads;
double local_interval = (b - a) / num_threads;
double local_a = a + thread_id * local_interval;
double local_b = local_a + local_interval;
/*
std::cout << "Thread " << thread_id << " computing interval ["
<< local_a << ", " << local_b << "] with "
<< local_n << " trapezoids" << std::endl;
*/
return trapezoidal_rule(local_a, local_b, local_n);
}
int main() {
double a = -1.0, b = 3.0;
long n = 100000000;
double total_result = 0.0;
auto start = std::chrono::high_resolution_clock::now();
#pragma omp parallel num_threads(4) reduction(+:total_result)
{
total_result = compute_trapezoids_reduction(a, b, n);
}
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << "Parallel Implementation (Reduction):" << std::endl;
std::cout << "Integral approximation = " << total_result << std::endl;
std::cout << "Time = " << duration.count() << " milliseconds" << std::endl;
return 0;
}
Performance Analysis Template
For each exercise involving performance measurement, record your results in the following format:
Tip
Data size: __
Number of Threads | Execution Time (ms) | Speedup | Efficiency |
---|---|---|---|
1 (Serial) | 1.0 | 1.0 | |
2 | |||
4 | |||
8 |
Formulas: - Speedup = Serial Time / Parallel Time - Efficiency = Speedup / Number of Threads
Submission Guidelines
- Do ALL exercises (1 to 5)
- Submit all source code files with proper comments
- Include a performance analysis report for exercises requiring timing
- Answer all questions posed in the exercises
- Demonstrate understanding of OpenMP concepts through code examples
- Include compilation commands and sample outputs