Lab(1): Introduction to Pthread

Lab Overview

This lab provides a hands-on introduction to concurrent programming in C++ using the POSIX Threads (Pthreads) library. You'll learn the fundamental concepts of creating and managing threads to perform tasks in parallel. The lab will guide you through thread creation and data passing. Moreover, you'll build a program that demonstrates the performance benefits of parallelism by comparing a sequential task to a multithreaded one. In the future lab, you'll learn about the critical problem of race conditions and how to solve it using mutexes for thread synchronization.

Lab Objectives

Write your first parallel program using Pthread Library
Learn how to create a thread to execute a job
Desktop browser and Internet connection to access Online GDB.

Prerequisites

Basic understanding of C++ (functions, pointers, etc.).
Familiarity with compiling and running C++ programs in a Unix-like environment (e.g., Linux).
A multi-core system to observe parallelism benefits.

Lab Submission

Do exercise 1.
Write your name and index in the first line of your program, every function and the main function
take a screenshot (or more) of the code including your name and your index.
take screenshots of the output screen.
Put all your work in a word document, convert it to PDF, and upload it to the LMS

1. Thread Creation

What is a Thread?

A thread is the smallest unit of execution within a process. A single process can have multiple threads running concurrently, all sharing the same memory space. This means they share global variables, the heap, and file descriptors. However, each thread has its own private program counter, stack, and set of registers. This shared memory model makes communication between threads easy but also introduces challenges that we will address later.

The main thread of your program starts when main() is called. This main thread can then create other threads, which we'll call child threads. This process is often referred to as forking.

Why Use Threads?

Parallelism: Execute multiple tasks simultaneously
Responsiveness: Keep applications responsive while performing background tasks
Resource sharing: Efficient communication between concurrent tasks
Performance: Better utilization of multi-core processors

Thread Forking and Joining

The lifecycle of a thread involves creating it (forking) and then potentially waiting for it to finish (joining).

Forking: The pthread_create function creates a new thread and makes it executable. The operating system's scheduler will then decide when to run it. After the call, both the original (main) thread and the new (forked) thread are running concurrently.
Joining: A thread can wait for another thread to complete its execution by calling pthread_join. This is a blocking call. The thread that calls pthread_join will pause its execution until the target thread terminates. This is essential for ensuring that results computed by a child thread are available to the parent thread before the program exits.

sequenceDiagram
    participant MainThread
    participant ThreadA
    participant ThreadB

    activate MainThread
    MainThread->>ThreadA: fork()
    activate ThreadA
    MainThread->>ThreadB: fork()
    activate ThreadB

    par Parallel Execution
        MainThread->>MainThread: Main Work 1
        MainThread->>MainThread: Main Work 2
    and
        ThreadA->>ThreadA: Work A1
        ThreadA->>ThreadA: Work A2
    and
        ThreadB->>ThreadB: Work B1
        ThreadB->>ThreadB: Work B2
    end

    MainThread->>ThreadA: join()
    deactivate MainThread
    rect rgb(255, 200, 200)
        ThreadA-->>MainThread: Completed
        deactivate ThreadA
    end
    activate MainThread

    MainThread->>ThreadB: join()
    deactivate MainThread
    rect rgb(255, 200, 200)
        ThreadB-->>MainThread: Completed
        deactivate ThreadB
    end
    activate MainThread 

    MainThread->>MainThread: Continues execution
    deactivate MainThread

Creating a Thread with Pthreads

The pthread_create function is used to fork a new thread. For simplicity, we’ll use default attributes (passing nullptr as the second argument). Here’s a step-by-step guide:

Step-by-Step Example

Include Headers: Use <pthread.h> for Pthreads and <iostream> for output.
Define a Thread Function: It must return void * and take a void * argument.
Create the Thread (Fork): Call pthread_create with a thread identifier, attributes, function, and argument.
Wait for Completion(Join): Use pthread_join to wait for the thread to finish.

Example Code

#include <pthread.h>
#include <iostream>

void* say_hello(void* arg) {
    std::cout << "Hello from the thread!\n";
    return nullptr;
}

int main() {
    pthread_t thread_id;

    int result = pthread_create(&thread_id, nullptr, say_hello, nullptr); // Default attributes
    if (result != 0) {
        std::cerr << "Error creating thread!" << std::endl;
        return 1;
    }

    pthread_join(thread_id, nullptr); // Wait for thread to finish
    std::cout << "Main thread finished.\n";
    return 0;
}

Explanation:

#include <pthread.h>: This line includes the necessary header file for Pthread functions.
pthread_t thread_id;: This declares a variable of type pthread_t, which is used to identify the thread.
pthread_create(&thread_id, nullptr, say_hello, nullptr);: This is the core function for creating a new thread.
- The first argument is a pointer to the pthread_t variable where the ID of the new thread will be stored.
- The second argument is used to specify thread attributes. We pass NULL to use the default attributes.
- The third argument is a pointer to the function that the new thread will execute. This function must have a void * (*)(void *) signature.
- The fourth argument is a pointer to the argument that will be passed to the thread function. In this case, we are not passing any arguments, so it is NULL.
pthread_join(thread_id, NULL);: This function makes the main thread wait for the newly created thread to terminate.

Compile and Run

g++ -o hello hello.cpp -lpthread
./hello

Sample Output

Hello from the thread!
Main thread finished.

Thread Creation Flow

sequenceDiagram
    participant MainThread
    participant HelloThread

    activate MainThread
    MainThread->>HelloThread: pthread_create()
    activate HelloThread
    MainThread->>HelloThread: pthread_join()
    deactivate MainThread
    Note over MainThread: Blocked waiting
    HelloThread->>HelloThread: std::cout << "Hello from thread!"
    HelloThread-->>MainThread: Thread Exit (return nullptr)
    deactivate HelloThread
    activate MainThread
    MainThread->>MainThread: std::cout << "Main thread finished."
    deactivate MainThread

2. Passing Arguments to a Thread

Threads often need data to perform tasks. Since pthread_create accepts a void * argument, you can pass any data type by casting it.

Passing a Single Value

You can pass a single value to a thread by casting a pointer to void *.

#include <iostream>
#include <pthread.h>

void* printMessage(void* msg) {
    // Cast the void* argument back to its original type
    char* message = (char*) msg;
    std::cout << message << std::endl;
    return nullptr;
}

int main() {
    pthread_t thread1, thread2;
    const char* message1 = "This is thread 1.";
    const char* message2 = "This is thread 2.";

    // Create threads and pass a string literal to each
    int result = pthread_create(&thread1, nullptr, printMessage, (void*) message1);
    if (result != 0) {
        std::cerr << "Error creating thread!" << std::endl;
        return 1;
    }

    result = pthread_create(&thread2, nullptr, printMessage, (void*) message2);
    if (result != 0) {
        std::cerr << "Error creating thread!" << std::endl;
        return 1;
    }

    // Wait for the threads to finish
    pthread_join(thread1, nullptr);
    pthread_join(thread2, nullptr);

    return 0;
}

Explanation:

We cast the const char* to void* when calling pthread_create.
Inside the printMessage function, we cast the void* argument back to char* to use it.

Sample Output:

This is thread 1.
This is thread 2.

Or

This is thread 2.
This is thread 1.

Example: Passing an Integer

#include <pthread.h>
#include <iostream>

void* print_number(void* arg) {
    int* number = static_cast<int*>(arg);
    std::cout << "Thread received: " << *number << "\n";
    return nullptr;
}

int main() {
    pthread_t thread_id;
    int value = 100;

    int result = pthread_create(&thread_id, nullptr, print_number, &value);
    if (result != 0) {
        std::cerr << "Error creating thread!" << std::endl;
        return 1;
    }

    pthread_join(thread_id, nullptr);
    std::cout << "Main thread done.\n";
    return 0;
}

Explanation:

Use static_cast to safely convert void* back to the original type.
Pass the address of the data (e.g., &value) to pthread_create.

Compile and Run

g++ -o number number.cpp -lpthread
./number

Sample Output

Thread received: 100
Main thread done.

Passing Multiple Values

To pass multiple values, you must bundle them into a struct and pass a pointer to an instance of that struct.

#include <iostream>
#include <pthread.h>
#include <string>

// A struct to hold multiple arguments for our thread
struct ThreadArgs {
    int id;
    std::string message;
};

void* printArgs(void* arg) {
    // Cast the void* back to our struct type
    ThreadArgs* args = (ThreadArgs*) arg;

    std::cout << "Thread ID: " << args->id
              << ", Message: '" << args->message << "'" << std::endl;

    return nullptr;
}

int main() {
    pthread_t thread1;

    // IMPORTANT: Allocate the struct on the heap, not the stack,
    // to ensure it still exists when the new thread accesses it.
    ThreadArgs* args = new ThreadArgs();
    args->id = 101;
    args->message = "Passing multiple values is easy!";

    // Create the thread, passing a pointer to the struct
    int result = pthread_create(&thread1, nullptr, printArgs, (void*) args);
    if (result != 0) {
        std::cerr << "Error creating thread!" << std::endl;
        return 1;
    }

    pthread_join(thread1, nullptr);

    // Free the dynamically allocated memory to avoid leaks
    delete args;

    return 0;
}

Sample Output:

Thread ID: 101, Message: 'Passing multiple values is easy!'

3. The Power of Parallelism

Now, let's explore why using multiple threads can significantly speed up our programs. We'll do this by writing a program to count prime numbers in a given range, first sequentially and then in parallel.

Measuring Execution Time: Monotonic Clocks

To measure the performance of our sequential and parallel code, we need an accurate timer. For benchmarking, we should use a monotonic clock. A monotonic clock is a clock that can only move forward and is not affected by system time changes (like a user manually changing the time or daylight saving adjustments). This property ensures that the time duration we measure between two points is always positive and accurately reflects the elapsed time. In C++, std::chrono::steady_clock is guaranteed to be monotonic and is the best choice for measuring time intervals.

Here are the steps to compute the execution time:

Include the header file chrono:
```
#include <chrono>
```

Set the clock to monotonic clock:

using clock = std::chrono::steady_clock;

Get the starting time-point before executing the task:
```
clock::time_point start = clock::now();
```
Get the ending time-point after executing the task:
```
clock::time_point end = clock::now();
```

Compute the time duration:

clock::duration execution_time = end - start;

Print out the duration time in (seconds, milliseconds, microseconds, etc):

std::cout << "Execution time is "
        << std::chrono::duration_cast<std::chrono::milliseconds>(execution_time).count() 
        << "ms." << std::endl;

The `isPrime()` Function

We need a function that does a fair amount of work to properly see the benefits of parallelism. Checking for prime numbers is a good example of a CPU-intensive task.

// A simple function to check if a number is prime

bool isPrime(int n) {
    if (n < 2) return false;    // no prime less than 2
    if (n == 2) return true;    // 2 is the only prime that is even
    if (n % 2 == 0) return false; // All even numbers are not prime.

    // Check odd divisors up to sqrt(n)
    for (int i = 3; i * i <= n; i+=2) {
        if (n % i == 0) return false;
    }
    return true;
}

Algorithm Explanation:

If the number n < 2, return false.
If the number n is 2, return true.
If the number n is an even number, return false
For each odd integer i from 3 to \(\sqrt n\):
If n % i == 0, return false.
Return true if no divisors are found.

We only need to check up to the square root because if a number n has a divisor larger than its square root, it must also have a divisor smaller than it.

Sequential Prime Counting

Let's establish our baseline with a sequential (single-threaded) implementation:

#include <iostream>
#include <chrono>

// isPrime() function from above

int countPrimesSequential(int start, int end) {
    int count = 0;
    for (int i = start; i <= end; i++) {
        if (isPrime(i)) {
            count++;
        }
    }
    return count;
}

int main() {
    const int START = 1;
    const int END = 5000000;

    std::cout << "Counting primes from " << START << " to " << END << " (sequential)" << std::endl;

    // Measure execution time
    auto start_time = std::chrono::high_resolution_clock::now();

    int primeCount = countPrimesSequential(START, END);

    auto end_time = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);

    std::cout << "Found " << primeCount << " prime numbers" << std::endl;
    std::cout << "Sequential execution time: " << duration.count() << " milliseconds" << std::endl;

    return 0;
}

Sample Output:

Counting primes from 1 to 5000000 (sequential)
Found 348513 prime numbers
Sequential execution time: 756 milliseconds

Parallel Prime Counting

Now, let's parallelize this task by dividing the range among multiple threads.

countPrimesSequential() Analysis

Our sequential function processes numbers one by one in a single thread:

Input: Range of numbers (start to end)
Process: Check each number individually
Output: Count of prime numbers
Limitation: Uses only one CPU core

countPrimesParallel() Components

To parallelize this, we need three key components:

Component 1: Thread Data Structure

struct ThreadData {
    int start;        // Starting number for this thread
    int end;          // Ending number for this thread
    int count;        // Store the count of primes found
    int threadId;     // For identification and debugging
};

Component 2: Thread Function

void* countPrimesThread(void* arg) {
    ThreadData* data = static_cast<ThreadData*>(arg);

    std::cout << "Thread " << data->threadId 
              << " processing range [" << data->start 
              << ", " << data->end << "]" << std::endl;

    data->count = 0;
    for (int i = data->start; i <= data->end; i++) {
        if (isPrime(i)) {
            data->count++;
        }
    }

    std::cout << "Thread " << data->threadId 
              << " found " << data->count << " primes" << std::endl;

    return nullptr;
}

Component 3: Work Distribution Strategy Let us do computation decompsition (data parallelism) according to the lecture:

// Divide the work evenly among threads
int block_length = (END - START + 1) / NUM_THREADS;
for (int i = 0; i < NUM_THREADS; i++) {
    threadData[i].start = START + i * block_length;
    threadData[i].end = (i == NUM_THREADS - 1) ? END : threadData[i].start + block_length - 1;
    threadData[i].threadId = i;
}

Let assume that START = 0 and END = 999:

The diagram below visually shows how the range [0, 999] is partitioned into 4 equal segments (250 units each), with the last thread taking exactly 250 units in this case (since 1000 is divisible by 4). For uneven divisions (e.g., range 0-1000 with 4 threads), Thread 3 would get [750, 1000] (251 units).

flowchart TD
    subgraph MainThread["Main Thread"]
        direction TB
        A[START = 0] --> B[END = 999]
        C[NUM_THREADS = 4]
        D["block_length = (999 - 0 + 1) / 4 = 250"]

        subgraph Distribution["Work Distribution"]
            direction LR
            T0["Thread 0: [0, 249]"]:::thread
            T1["Thread 1: [250, 499]"]:::thread
            T2["Thread 2: [500, 749]"]:::thread
            T3["Thread 3: [750, 999]"]:::thread
        end

        B --> D --> Distribution
    end

    classDef thread fill:#9f9,stroke:#333,stroke-width:2px;

Key elements explained:

Initialization:
- START = 0, END = 999, NUM_THREADS = 4
- block_length = (999 - 0 + 1) / 4 = 250 (integer division)
Work Distribution:
- Thread 0: [0, 249] \(= (0 + 0×250)\) to \((0 + 0×250 + 250 - 1)\)
- Thread 1: [250, 499] \(= (0 + 1×250)\) to \((0 + 1×250 + 250 - 1)\)
- Thread 2: [500, 749] \(= (0 + 2×250)\) to \((0 + 2×250 + 250 - 1)\)
- Thread 3: [750, 999] \(= (0 + 3×250)\) to END (special last thread case)
Special Handling for Last Thread:
- The last thread (Thread 3) gets the remaining range
- Ensures all values from START to END are covered
- Handles cases where the range isn't perfectly divisible by thread count

How the Calculation Works:

int block_length = (END - START + 1) / NUM_THREADS;  // 1000/4 = 250

// Thread 0:
start = 0 + 0*250 = 0
end = 0 + 250 - 1 = 249

// Thread 1:
start = 0 + 1*250 = 250
end = 250 + 250 - 1 = 499

// Thread 2:
start = 0 + 2*250 = 500
end = 500 + 250 - 1 = 749

// Thread 3 (last thread):
start = 0 + 3*250 = 750
end = END = 999  // Instead of 750+250-1=999

This distribution ensures:

Equal workload for all threads (except possibly last thread)
Complete coverage from START to END
No overlapping ranges
Efficient division with O(1) calculation per thread
Handles remainder values automatically in last thread

Full Program

#include <iostream>
#include <pthread.h>
#include <chrono>

struct ThreadData {
    int start;
    int end;
    int count;
    int threadId;
};

bool isPrime(int n) {
    if (n < 2) return false;
    if (n == 2) return true;
    if (n % 2 == 0) return false;

    for (int i = 3; i * i <= n; i += 2) {
        if (n % i == 0) {
            return false;
        }
    }
    return true;
}

void* countPrimesThread(void* arg) {
    ThreadData* data = static_cast<ThreadData*>(arg);

    std::cout << "Thread " << data->threadId 
              << " processing range [" << data->start 
              << ", " << data->end << "]" << std::endl;

    data->count = 0;
    for (int i = data->start; i <= data->end; i++) {
        if (isPrime(i)) {
            data->count++;
        }
    }

    std::cout << "Thread " << data->threadId 
              << " found " << data->count << " primes" << std::endl;

    return nullptr;
}

int countPrimesParallel(int start, int end, int numThreads) {
    pthread_t* threads = new pthread_t[numThreads];
    ThreadData* threadData = new ThreadData[numThreads];

    // Distribute work among threads
    int block_length = (end - start + 1) / numThreads;

    std::cout << "Creating " << numThreads << " threads for parallel processing..." << std::endl;

    // Create threads
    for (int i = 0; i < numThreads; i++) {
        threadData[i].start = start + i * block_length;
        threadData[i].end = (i == numThreads - 1) ? end : threadData[i].start + block_length - 1;
        threadData[i].threadId = i;
        threadData[i].count = 0;

        int result = pthread_create(&threads[i], nullptr, countPrimesThread, &threadData[i]);
        if (result != 0) {
            std::cerr << "Error creating thread " << i << std::endl;
            return 1;
        }
    }

    // Wait for all threads to complete
    for (int i = 0; i < numThreads; i++) {
        pthread_join(threads[i], nullptr);
    }

    // Combine counts from all threads
    int totalPrimes = 0;
    for (int i = 0; i < numThreads; i++) {
        totalPrimes += threadData[i].count;
    }

    delete[] threads;
    delete[] threadData;

    return totalPrimes;
}

// Sequential version for comparison
int countPrimesSequential(int start, int end) {
    int count = 0;
    for (int i = start; i <= end; i++) {
        if (isPrime(i)) {
            count++;
        }
    }
    return count;
}

int main() {
    const int START = 1;
    const int END = 100000;
    const int NUM_THREADS = 4;

    std::cout << "=== Prime Number Counting Performance Comparison ===" << std::endl;
    std::cout << "Range: " << START << " to " << END << std::endl << std::endl;

    // Sequential execution
    std::cout << "1. Sequential Execution:" << std::endl;
    auto start_time = std::chrono::high_resolution_clock::now();
    int sequentialResult = countPrimesSequential(START, END);
    auto end_time = std::chrono::high_resolution_clock::now();
    auto sequentialDuration = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);

    std::cout << "Found " << sequentialResult << " prime numbers" << std::endl;
    std::cout << "Sequential time: " << sequentialDuration.count() << " ms" << std::endl << std::endl;

    // Parallel execution
    std::cout << "2. Parallel Execution (" << NUM_THREADS << " threads):" << std::endl;
    start_time = std::chrono::high_resolution_clock::now();
    int parallelResult = countPrimesParallel(START, END, NUM_THREADS);
    end_time = std::chrono::high_resolution_clock::now();
    auto parallelDuration = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);

    std::cout << "Found " << parallelResult << " prime numbers" << std::endl;
    std::cout << "Parallel time: " << parallelDuration.count() << " ms" << std::endl << std::endl;

    // Performance analysis
    std::cout << "=== Performance Analysis ===" << std::endl;
    std::cout << "Correctness check: " << (sequentialResult == parallelResult ? "PASSED" : "FAILED") << std::endl;

    if (parallelDuration.count() > 0) {
        double speedup = static_cast<double>(sequentialDuration.count()) / parallelDuration.count();
        std::cout << "Speedup: " << speedup << "x" << std::endl;
        std::cout << "Efficiency: " << (speedup / NUM_THREADS) * 100 << "%" << std::endl;
    }

    return 0;
}

The program above will first count primes sequentially and then do it in parallel, finally printing a comparison of the results.

Sample Output (will vary based on your machine):

=== Prime Number Counting Performance Comparison ===
Range: 1 to 500000

1. Sequential Execution:
Found 41538 prime numbers
Sequential time: 29 ms

2. Parallel Execution (4 threads):
Creating 4 threads for parallel processing...
Thread 0 processing range [1, 125000]
Thread 1 processing range [125001, 250000]
Thread 2 processing range [250001, 375000]
Thread 3 processing range [375001, 500000]
Thread 0 found 11734 primes
Thread 2 found 9860 primes
Thread 1 found 10310 primes
Thread 3 found 9634 primes
Found 41538 prime numbers
Parallel time: 15 ms

=== Performance Analysis ===
Correctness check: PASSED
Speedup: 1.93333x
Efficiency: 48.3333%

Exercise (1)

Parallel Vector Summation

Write a C++ program to calculate the sum of all elements in a large std::vector<int>.

Create a vector with 20,000,000 integers, all initialized to 1.
Calculate the sum sequentially and measure the time.
Calculate the sum in parallel using 4 threads. Each thread should sum its own portion of the vector.
The main thread must collect the partial sums from each thread to get the final total.
Verify the sequential and parallel sums are identical. Compare their execution times.

Lab(1): Introduction to Pthread

Lab Overview

Lab Objectives

Prerequisites

Lab Submission

1. Thread Creation

What is a Thread?

Why Use Threads?

Thread Forking and Joining

Creating a Thread with Pthreads

Step-by-Step Example

Example Code

Compile and Run

Sample Output

Thread Creation Flow

2. Passing Arguments to a Thread

Passing a Single Value

Passing Multiple Values

3. The Power of Parallelism

Measuring Execution Time: Monotonic Clocks

The isPrime() Function

Sequential Prime Counting

Parallel Prime Counting

countPrimesSequential() Analysis

countPrimesParallel() Components

Key elements explained:

How the Calculation Works:

Full Program

Parallel Vector Summation

The `isPrime()` Function