Lab(1): Introduction to Pthread
Lab Overview
This lab provides a hands-on introduction to concurrent programming in C++ using the POSIX Threads (Pthreads) library. You'll learn the fundamental concepts of creating and managing threads to perform tasks in parallel. The lab will guide you through thread creation and data passing. Moreover, you'll build a program that demonstrates the performance benefits of parallelism by comparing a sequential task to a multithreaded one. In the future lab, you'll learn about the critical problem of race conditions and how to solve it using mutexes for thread synchronization.
Lab Objectives
- Write your first parallel program using Pthread Library
- Learn how to create a thread to execute a job
- Desktop browser and Internet connection to access Online GDB.
Prerequisites
- Basic understanding of C++ (functions, pointers, etc.).
- Familiarity with compiling and running C++ programs in a Unix-like environment (e.g., Linux).
- A multi-core system to observe parallelism benefits.
Lab Submission
- Do exercise 1.
- Write your name and index in the first line of your program, every function and the main function
- take a screenshot (or more) of the code including your name and your index.
- take screenshots of the output screen.
- Put all your work in a word document, convert it to PDF, and upload it to the LMS
1. Thread Creation
What is a Thread?
A thread is the smallest unit of execution within a process. A single process can have multiple threads running concurrently, all sharing the same memory space. This means they share global variables, the heap, and file descriptors. However, each thread has its own private program counter, stack, and set of registers. This shared memory model makes communication between threads easy but also introduces challenges that we will address later.
The main thread of your program starts when main()
is called. This main thread can then create other threads, which we'll call child threads. This process is often referred to as forking.
Why Use Threads?
- Parallelism: Execute multiple tasks simultaneously
- Responsiveness: Keep applications responsive while performing background tasks
- Resource sharing: Efficient communication between concurrent tasks
- Performance: Better utilization of multi-core processors
Thread Forking and Joining
The lifecycle of a thread involves creating it (forking) and then potentially waiting for it to finish (joining).
-
Forking: The
pthread_create
function creates a new thread and makes it executable. The operating system's scheduler will then decide when to run it. After the call, both the original (main) thread and the new (forked) thread are running concurrently. -
Joining: A thread can wait for another thread to complete its execution by calling
pthread_join
. This is a blocking call. The thread that callspthread_join
will pause its execution until the target thread terminates. This is essential for ensuring that results computed by a child thread are available to the parent thread before the program exits.
sequenceDiagram
participant MainThread
participant ThreadA
participant ThreadB
activate MainThread
MainThread->>ThreadA: fork()
activate ThreadA
MainThread->>ThreadB: fork()
activate ThreadB
par Parallel Execution
MainThread->>MainThread: Main Work 1
MainThread->>MainThread: Main Work 2
and
ThreadA->>ThreadA: Work A1
ThreadA->>ThreadA: Work A2
and
ThreadB->>ThreadB: Work B1
ThreadB->>ThreadB: Work B2
end
MainThread->>ThreadA: join()
deactivate MainThread
rect rgb(255, 200, 200)
ThreadA-->>MainThread: Completed
deactivate ThreadA
end
activate MainThread
MainThread->>ThreadB: join()
deactivate MainThread
rect rgb(255, 200, 200)
ThreadB-->>MainThread: Completed
deactivate ThreadB
end
activate MainThread
MainThread->>MainThread: Continues execution
deactivate MainThread
Creating a Thread with Pthreads
The pthread_create
function is used to fork a new thread. For simplicity, we’ll use default attributes (passing nullptr
as the second argument). Here’s a step-by-step guide:
Step-by-Step Example
- Include Headers: Use
<pthread.h>
for Pthreads and<iostream>
for output. - Define a Thread Function: It must return
void *
and take avoid *
argument. - Create the Thread (Fork): Call
pthread_create
with a thread identifier, attributes, function, and argument. - Wait for Completion(Join): Use
pthread_join
to wait for the thread to finish.
Example Code
#include <pthread.h>
#include <iostream>
void* say_hello(void* arg) {
std::cout << "Hello from the thread!\n";
return nullptr;
}
int main() {
pthread_t thread_id;
int result = pthread_create(&thread_id, nullptr, say_hello, nullptr); // Default attributes
if (result != 0) {
std::cerr << "Error creating thread!" << std::endl;
return 1;
}
pthread_join(thread_id, nullptr); // Wait for thread to finish
std::cout << "Main thread finished.\n";
return 0;
}
#include <pthread.h>
: This line includes the necessary header file for Pthread functions.pthread_t thread_id;
: This declares a variable of type pthread_t, which is used to identify the thread.pthread_create(&thread_id, nullptr, say_hello, nullptr);
: This is the core function for creating a new thread.- The first argument is a pointer to the pthread_t variable where the ID of the new thread will be stored.
- The second argument is used to specify thread attributes. We pass
NULL
to use the default attributes. - The third argument is a pointer to the function that the new thread will execute. This function must have a
void * (*)(void *)
signature. - The fourth argument is a pointer to the argument that will be passed to the thread function. In this case, we are not passing any arguments, so it is
NULL
.
pthread_join(thread_id, NULL);
: This function makes the main thread wait for the newly created thread to terminate.
Compile and Run
Sample Output
Thread Creation Flow
sequenceDiagram
participant MainThread
participant HelloThread
activate MainThread
MainThread->>HelloThread: pthread_create()
activate HelloThread
MainThread->>HelloThread: pthread_join()
deactivate MainThread
Note over MainThread: Blocked waiting
HelloThread->>HelloThread: std::cout << "Hello from thread!"
HelloThread-->>MainThread: Thread Exit (return nullptr)
deactivate HelloThread
activate MainThread
MainThread->>MainThread: std::cout << "Main thread finished."
deactivate MainThread
2. Passing Arguments to a Thread
Threads often need data to perform tasks. Since pthread_create
accepts a void *
argument, you can pass any data type by casting it.
Passing a Single Value
You can pass a single value to a thread by casting a pointer to void *
.
#include <iostream>
#include <pthread.h>
void* printMessage(void* msg) {
// Cast the void* argument back to its original type
char* message = (char*) msg;
std::cout << message << std::endl;
return nullptr;
}
int main() {
pthread_t thread1, thread2;
const char* message1 = "This is thread 1.";
const char* message2 = "This is thread 2.";
// Create threads and pass a string literal to each
int result = pthread_create(&thread1, nullptr, printMessage, (void*) message1);
if (result != 0) {
std::cerr << "Error creating thread!" << std::endl;
return 1;
}
result = pthread_create(&thread2, nullptr, printMessage, (void*) message2);
if (result != 0) {
std::cerr << "Error creating thread!" << std::endl;
return 1;
}
// Wait for the threads to finish
pthread_join(thread1, nullptr);
pthread_join(thread2, nullptr);
return 0;
}
- We cast the
const char*
tovoid*
when callingpthread_create
. - Inside the
printMessage
function, we cast thevoid*
argument back tochar*
to use it.
Sample Output:
OrExample: Passing an Integer
#include <pthread.h>
#include <iostream>
void* print_number(void* arg) {
int* number = static_cast<int*>(arg);
std::cout << "Thread received: " << *number << "\n";
return nullptr;
}
int main() {
pthread_t thread_id;
int value = 100;
int result = pthread_create(&thread_id, nullptr, print_number, &value);
if (result != 0) {
std::cerr << "Error creating thread!" << std::endl;
return 1;
}
pthread_join(thread_id, nullptr);
std::cout << "Main thread done.\n";
return 0;
}
Explanation:
- Use
static_cast
to safely convertvoid*
back to the original type. - Pass the address of the data (e.g.,
&value
) topthread_create
.
Compile and Run
Sample Output
Passing Multiple Values
To pass multiple values, you must bundle them into a struct and pass a pointer to an instance of that struct.
#include <iostream>
#include <pthread.h>
#include <string>
// A struct to hold multiple arguments for our thread
struct ThreadArgs {
int id;
std::string message;
};
void* printArgs(void* arg) {
// Cast the void* back to our struct type
ThreadArgs* args = (ThreadArgs*) arg;
std::cout << "Thread ID: " << args->id
<< ", Message: '" << args->message << "'" << std::endl;
return nullptr;
}
int main() {
pthread_t thread1;
// IMPORTANT: Allocate the struct on the heap, not the stack,
// to ensure it still exists when the new thread accesses it.
ThreadArgs* args = new ThreadArgs();
args->id = 101;
args->message = "Passing multiple values is easy!";
// Create the thread, passing a pointer to the struct
int result = pthread_create(&thread1, nullptr, printArgs, (void*) args);
if (result != 0) {
std::cerr << "Error creating thread!" << std::endl;
return 1;
}
pthread_join(thread1, nullptr);
// Free the dynamically allocated memory to avoid leaks
delete args;
return 0;
}
Sample Output:
3. The Power of Parallelism
Now, let's explore why using multiple threads can significantly speed up our programs. We'll do this by writing a program to count prime numbers in a given range, first sequentially and then in parallel.
Measuring Execution Time: Monotonic Clocks
To measure the performance of our sequential and parallel code, we need an accurate timer. For benchmarking, we should use a monotonic clock. A monotonic clock is a clock that can only move forward and is not affected by system time changes (like a user manually changing the time or daylight saving adjustments). This property ensures that the time duration we measure between two points is always positive and accurately reflects the elapsed time. In C++, std::chrono::steady_clock
is guaranteed to be monotonic and is the best choice for measuring time intervals.
Here are the steps to compute the execution time:
- Include the header file
chrono
: - Set the clock to monotonic clock:
- Get the starting time-point before executing the task:
- Get the ending time-point after executing the task:
- Compute the time duration:
- Print out the duration time in (
seconds
,milliseconds
,microseconds
, etc):
The isPrime()
Function
We need a function that does a fair amount of work to properly see the benefits of parallelism. Checking for prime numbers is a good example of a CPU-intensive task.
// A simple function to check if a number is prime
bool isPrime(int n) {
if (n < 2) return false; // no prime less than 2
if (n == 2) return true; // 2 is the only prime that is even
if (n % 2 == 0) return false; // All even numbers are not prime.
// Check odd divisors up to sqrt(n)
for (int i = 3; i * i <= n; i+=2) {
if (n % i == 0) return false;
}
return true;
}
Algorithm Explanation:
- If the number
n
< 2, returnfalse
. - If the number
n
is 2, returntrue
. - If the number
n
is an even number, returnfalse
- For each odd integer
i
from 3 to \(\sqrt n\): - If
n % i == 0
, returnfalse
. - Return
true
if no divisors are found.
We only need to check up to the square root because if a number n
has a divisor larger than its square root, it must also have a divisor smaller than it.
Sequential Prime Counting
Let's establish our baseline with a sequential (single-threaded) implementation:
#include <iostream>
#include <chrono>
// isPrime() function from above
int countPrimesSequential(int start, int end) {
int count = 0;
for (int i = start; i <= end; i++) {
if (isPrime(i)) {
count++;
}
}
return count;
}
int main() {
const int START = 1;
const int END = 5000000;
std::cout << "Counting primes from " << START << " to " << END << " (sequential)" << std::endl;
// Measure execution time
auto start_time = std::chrono::high_resolution_clock::now();
int primeCount = countPrimesSequential(START, END);
auto end_time = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);
std::cout << "Found " << primeCount << " prime numbers" << std::endl;
std::cout << "Sequential execution time: " << duration.count() << " milliseconds" << std::endl;
return 0;
}
Counting primes from 1 to 5000000 (sequential)
Found 348513 prime numbers
Sequential execution time: 756 milliseconds
Parallel Prime Counting
Now, let's parallelize this task by dividing the range among multiple threads.
countPrimesSequential() Analysis
Our sequential function processes numbers one by one in a single thread:
- Input: Range of numbers (start to end)
- Process: Check each number individually
- Output: Count of prime numbers
- Limitation: Uses only one CPU core
countPrimesParallel() Components
To parallelize this, we need three key components:
Component 1: Thread Data Structure
struct ThreadData {
int start; // Starting number for this thread
int end; // Ending number for this thread
int count; // Store the count of primes found
int threadId; // For identification and debugging
};
Component 2: Thread Function
void* countPrimesThread(void* arg) {
ThreadData* data = static_cast<ThreadData*>(arg);
std::cout << "Thread " << data->threadId
<< " processing range [" << data->start
<< ", " << data->end << "]" << std::endl;
data->count = 0;
for (int i = data->start; i <= data->end; i++) {
if (isPrime(i)) {
data->count++;
}
}
std::cout << "Thread " << data->threadId
<< " found " << data->count << " primes" << std::endl;
return nullptr;
}
Component 3: Work Distribution Strategy Let us do computation decompsition (data parallelism) according to the lecture:
// Divide the work evenly among threads
int block_length = (END - START + 1) / NUM_THREADS;
for (int i = 0; i < NUM_THREADS; i++) {
threadData[i].start = START + i * block_length;
threadData[i].end = (i == NUM_THREADS - 1) ? END : threadData[i].start + block_length - 1;
threadData[i].threadId = i;
}
Let assume that START = 0 and END = 999:
The diagram below visually shows how the range [0, 999] is partitioned into 4 equal segments (250 units each), with the last thread taking exactly 250 units in this case (since 1000 is divisible by 4). For uneven divisions (e.g., range 0-1000 with 4 threads), Thread 3 would get [750, 1000] (251 units).
flowchart TD
subgraph MainThread["Main Thread"]
direction TB
A[START = 0] --> B[END = 999]
C[NUM_THREADS = 4]
D["block_length = (999 - 0 + 1) / 4 = 250"]
subgraph Distribution["Work Distribution"]
direction LR
T0["Thread 0: [0, 249]"]:::thread
T1["Thread 1: [250, 499]"]:::thread
T2["Thread 2: [500, 749]"]:::thread
T3["Thread 3: [750, 999]"]:::thread
end
B --> D --> Distribution
end
classDef thread fill:#9f9,stroke:#333,stroke-width:2px;
Key elements explained:
-
Initialization:
START = 0
,END = 999
,NUM_THREADS = 4
block_length = (999 - 0 + 1) / 4 = 250
(integer division)
-
Work Distribution:
- Thread 0:
[0, 249]
\(= (0 + 0×250)\) to \((0 + 0×250 + 250 - 1)\) - Thread 1:
[250, 499]
\(= (0 + 1×250)\) to \((0 + 1×250 + 250 - 1)\) - Thread 2:
[500, 749]
\(= (0 + 2×250)\) to \((0 + 2×250 + 250 - 1)\) - Thread 3:
[750, 999]
\(= (0 + 3×250)\) to END (special last thread case)
- Thread 0:
-
Special Handling for Last Thread:
- The last thread (Thread 3) gets the remaining range
- Ensures all values from START to END are covered
- Handles cases where the range isn't perfectly divisible by thread count
How the Calculation Works:
int block_length = (END - START + 1) / NUM_THREADS; // 1000/4 = 250
// Thread 0:
start = 0 + 0*250 = 0
end = 0 + 250 - 1 = 249
// Thread 1:
start = 0 + 1*250 = 250
end = 250 + 250 - 1 = 499
// Thread 2:
start = 0 + 2*250 = 500
end = 500 + 250 - 1 = 749
// Thread 3 (last thread):
start = 0 + 3*250 = 750
end = END = 999 // Instead of 750+250-1=999
This distribution ensures:
- Equal workload for all threads (except possibly last thread)
- Complete coverage from START to END
- No overlapping ranges
- Efficient division with O(1) calculation per thread
- Handles remainder values automatically in last thread
Full Program
#include <iostream>
#include <pthread.h>
#include <chrono>
struct ThreadData {
int start;
int end;
int count;
int threadId;
};
bool isPrime(int n) {
if (n < 2) return false;
if (n == 2) return true;
if (n % 2 == 0) return false;
for (int i = 3; i * i <= n; i += 2) {
if (n % i == 0) {
return false;
}
}
return true;
}
void* countPrimesThread(void* arg) {
ThreadData* data = static_cast<ThreadData*>(arg);
std::cout << "Thread " << data->threadId
<< " processing range [" << data->start
<< ", " << data->end << "]" << std::endl;
data->count = 0;
for (int i = data->start; i <= data->end; i++) {
if (isPrime(i)) {
data->count++;
}
}
std::cout << "Thread " << data->threadId
<< " found " << data->count << " primes" << std::endl;
return nullptr;
}
int countPrimesParallel(int start, int end, int numThreads) {
pthread_t* threads = new pthread_t[numThreads];
ThreadData* threadData = new ThreadData[numThreads];
// Distribute work among threads
int block_length = (end - start + 1) / numThreads;
std::cout << "Creating " << numThreads << " threads for parallel processing..." << std::endl;
// Create threads
for (int i = 0; i < numThreads; i++) {
threadData[i].start = start + i * block_length;
threadData[i].end = (i == numThreads - 1) ? end : threadData[i].start + block_length - 1;
threadData[i].threadId = i;
threadData[i].count = 0;
int result = pthread_create(&threads[i], nullptr, countPrimesThread, &threadData[i]);
if (result != 0) {
std::cerr << "Error creating thread " << i << std::endl;
return 1;
}
}
// Wait for all threads to complete
for (int i = 0; i < numThreads; i++) {
pthread_join(threads[i], nullptr);
}
// Combine counts from all threads
int totalPrimes = 0;
for (int i = 0; i < numThreads; i++) {
totalPrimes += threadData[i].count;
}
delete[] threads;
delete[] threadData;
return totalPrimes;
}
// Sequential version for comparison
int countPrimesSequential(int start, int end) {
int count = 0;
for (int i = start; i <= end; i++) {
if (isPrime(i)) {
count++;
}
}
return count;
}
int main() {
const int START = 1;
const int END = 100000;
const int NUM_THREADS = 4;
std::cout << "=== Prime Number Counting Performance Comparison ===" << std::endl;
std::cout << "Range: " << START << " to " << END << std::endl << std::endl;
// Sequential execution
std::cout << "1. Sequential Execution:" << std::endl;
auto start_time = std::chrono::high_resolution_clock::now();
int sequentialResult = countPrimesSequential(START, END);
auto end_time = std::chrono::high_resolution_clock::now();
auto sequentialDuration = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);
std::cout << "Found " << sequentialResult << " prime numbers" << std::endl;
std::cout << "Sequential time: " << sequentialDuration.count() << " ms" << std::endl << std::endl;
// Parallel execution
std::cout << "2. Parallel Execution (" << NUM_THREADS << " threads):" << std::endl;
start_time = std::chrono::high_resolution_clock::now();
int parallelResult = countPrimesParallel(START, END, NUM_THREADS);
end_time = std::chrono::high_resolution_clock::now();
auto parallelDuration = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);
std::cout << "Found " << parallelResult << " prime numbers" << std::endl;
std::cout << "Parallel time: " << parallelDuration.count() << " ms" << std::endl << std::endl;
// Performance analysis
std::cout << "=== Performance Analysis ===" << std::endl;
std::cout << "Correctness check: " << (sequentialResult == parallelResult ? "PASSED" : "FAILED") << std::endl;
if (parallelDuration.count() > 0) {
double speedup = static_cast<double>(sequentialDuration.count()) / parallelDuration.count();
std::cout << "Speedup: " << speedup << "x" << std::endl;
std::cout << "Efficiency: " << (speedup / NUM_THREADS) * 100 << "%" << std::endl;
}
return 0;
}
The program above will first count primes sequentially and then do it in parallel, finally printing a comparison of the results.
Sample Output (will vary based on your machine):
=== Prime Number Counting Performance Comparison ===
Range: 1 to 500000
1. Sequential Execution:
Found 41538 prime numbers
Sequential time: 29 ms
2. Parallel Execution (4 threads):
Creating 4 threads for parallel processing...
Thread 0 processing range [1, 125000]
Thread 1 processing range [125001, 250000]
Thread 2 processing range [250001, 375000]
Thread 3 processing range [375001, 500000]
Thread 0 found 11734 primes
Thread 2 found 9860 primes
Thread 1 found 10310 primes
Thread 3 found 9634 primes
Found 41538 prime numbers
Parallel time: 15 ms
=== Performance Analysis ===
Correctness check: PASSED
Speedup: 1.93333x
Efficiency: 48.3333%
Exercise (1)
Parallel Vector Summation
Write a C++ program to calculate the sum of all elements in a large std::vector<int>
.
- Create a vector with 20,000,000 integers, all initialized to
1
. - Calculate the sum sequentially and measure the time.
- Calculate the sum in parallel using 4 threads. Each thread should sum its own portion of the vector.
- The main thread must collect the partial sums from each thread to get the final total.
- Verify the sequential and parallel sums are identical. Compare their execution times.