Lab(3): Data Parallelism

Lab Overview

In this lab, you will explore the fundamental concepts of data parallelism and learn how different data partitioning strategies can dramatically impact the performance of parallel programs. Using the computationally intensive Fibonacci sequence as our test case, you'll discover why naive parallelization sometimes fails and how intelligent data partitioning can unlock the true power of parallel computing.

You'll implement and compare two fundamental data partition strategies:

Block Partitioning: Assigns contiguous chunks of data to each processing unit.
Cyclic Partitioning: Distributes data in a round-robin fashion for better load balancing.

Additionally, you will conduct a Performance Analysis to evaluate the effectiveness of each approach.

By the end of this lab, you’ll understand why parallelization isn’t always faster and how to design scalable parallel algorithms optimized for multi-core processors.

Lab Objectives

By completing this lab, you will be able to:

Understand the critical importance of data partitioning in parallel computing
Compare block vs. cyclic data partitioning strategies and their performance implications
Analyze load balancing challenges and explore effective solutions
Measure and evaluate execution time improvements (or lack thereof) in parallel programs
Debug common pitfalls in parallel programming, especially uneven workload distribution
Apply appropriate data partitioning strategies based on problem characteristics

Prerequisites

Be familiar with POSIX pthreads (covered in Lab 1)
Read lecture note: "Parallel Hardware Part (2)"

Part 1: Understanding the Problem - Fibonacci Sequence

Mathematical Foundation

The Fibonacci sequence follows this recurrence relation:

\[ fib(n) = \begin{cases} 1 & n \leq 1 \\ fib(n-1) + fib(n-2) & n > 1 \end{cases} \]

The sequence begins as:

$fib(0) = 1$
$fib(1) = 1$
$fib(2) = fib(1) + fib(0) = 1 + 1 = 2$
$fib(3) = fib(2) + fib(1) = 2 + 1 = 3$
$fib(4) = fib(3) + fib(2) = 3 + 2 = 5$
and so on.

A simple recursive function that computes the fibonacci numbers:

unsigned long long fib (unsigned int n)
{
    if (n <= 1)
        return 1;
    else
        return fib(n-1) + fib(n-2);
}

We use a recursive Fibonacci implementation because:

Exponential time complexity: $O(2^n)$ - creates significant computational load
Uneven computation cost: fib(45) takes much longer than fib(20).

Run this test program to see execution time for different Fibonacci values, and observe the exponential growth in execution time:

#include <iostream>
#include <chrono>

using namespace std;
using namespace std::chrono;

unsigned long long fib(unsigned int n)
{
    if (n <= 1)
        return 1;
    else
        return fib(n-1) + fib(n-2);
}

int main()
{
    cout << "Observing Fibonacci execution times:\n";

    for (int n : {25, 30, 35, 40, 45}) {
        auto start = high_resolution_clock::now();
        unsigned long long result = fib(n);
        auto end = high_resolution_clock::now();

        auto duration = duration_cast<milliseconds>(end - start);
        cout << "fib(" << n << ") = " << result 
             << " (Time: " << duration.count() << "ms)\n";
    }

    return 0;
}

Jump to the code @ OnlineDBG

Here is a sample output:

Output

Observing Fibonacci execution times:
fib(25) = 165580141 (Time: 0ms)
fib(30) = 165580141 (Time: 7ms)
fib(35) = 165580141 (Time: 77ms)
fib(40) = 165580141 (Time: 851ms)
fib(45) = 1836311903 (Time: 9276ms)

Info

The running time grows exponentially as depicted in the following plot:

{ "$schema": "https://vega.github.io/schema/vega-lite/v6.json", "description": "Exponential Time", "width": 300, "data": { "values": [ {"a": 25, "b": 0}, {"a": 30, "b": 7}, {"a": 35, "b": 77}, {"a": 40, "b": 851}, {"a": 45, "b": 9276} ] }, "mark": {"type": "line", "point": true}, "encoding": { "x": {"field": "a", "type": "quantitative", "title": "n"}, "y": {"field": "b", "type": "quantitative", "title": "Running time"} } }

Sequential Fibonacci Series

Let us print the fibonacci series from 0 up to 47 sequentially.

for (int n=0; n<48; n++)
    cout << "fib(" << n << ") = " << fib(n) << endl;

See the code @ OnlineGBD

Click on the button, run the code, and observe the execution time.

Part 2: Naive Parallelization - Block Partition

The Obvious Approach

Our first attempt at parallelization uses block partitioning, where each thread processes a contiguous chunk of data:

n: [0, 1, 2, 3, 4, 5, 6, ..., 45, 46, 47]
4 Threads:
Thread 0 -> [0, 1, 2, ..., 11]    (12 elements)
Thread 1 -> [12, 13, 14, ..., 23] (12 elements)  
Thread 2 -> [24, 25, 26, ..., 35] (12 elements)
Thread 3 -> [36, 37, 38, ..., 47] (12 elements)

Implementation Components

Component 1: Thread Data Structure:

struct ThreadData {
    int start;        // Starting number for this thread
    int end;          // Ending number for this thread
    int threadId;     // For identification and debugging
};

Component 2: Thread Function: A modified function computes and prints Fibonacci numbers in the assigned range:

void* printFibThread(void *arg)
{
    ThreadData *p = static_cast<ThreadData *>(arg)

    for (int n = p->start; n <= p->end; n += 1) {
        unsigned long long x = fib(n);
        cerr << "fib(" << n << ") = " << x << endl;
    }

    return nullptr;
}

Component 3: Work Distribution Logic: We divide the workload evenly among threads:

// compute block size for partitioning
int block_length = (END - START + 1) / NUM_THREADS;
for (int i = 0; i < NUM_THREADS; i++) {
    threadData[i].start = START + i * block_length;
    threadData[i].end = (i == NUM_THREADS - 1) ? END : threadData[i].start + block_length - 1;
    threadData[i].threadId = i;
}

where

NUM_THREADS is the number of processors/threads available.

Full Program Using Block Partition

#include <iostream>
#include <pthread.h>
#include <chrono>

using namespace std;
using namespace std::chrono;

struct ThreadData {
    int start;        
    int end;          
    int threadId;     
};

unsigned long long fib(unsigned int n)
{
    if (n <= 1)
        return 1;
    else
        return fib(n-1) + fib(n-2);
}

void* printFibThread(void *arg)
{
    ThreadData *data = static_cast<ThreadData *>(arg);

    cout << "Thread " << data->threadId << " starting work on range [" 
         << data->start << ", " << data->end << "]" << endl;

    for (int n = data->start; n <= data->end; n++) {
        unsigned long long result = fib(n);
        cout << "Thread " << data->threadId 
             << ": fib(" << n << ") = " << result << endl;
    }

    cout << "Thread " << data->threadId << " finished!" << endl;
    return nullptr;
}

int main()
{
    const int START = 0;
    const int END = 47;
    const int NUM_THREADS = 4;

    pthread_t threads[NUM_THREADS];
    ThreadData threadData[NUM_THREADS];

    cout << "=== BLOCK PARTITION PARALLEL FIBONACCI ===" << endl;
    cout << "Computing Fibonacci(" << START << ") to Fibonacci(" << END
         << ") using " << NUM_THREADS << " threads" << endl;

    auto startTime = high_resolution_clock::now();

    // Distribute work in blocks
    int blockSize = (END - START + 1) / NUM_THREADS;
    for (int i = 0; i < NUM_THREADS; i++) {
        threadData[i].start = START + i * blockSize;
        threadData[i].end = (i == NUM_THREADS - 1) ? END : threadData[i].start + blockSize - 1;
        threadData[i].threadId = i;

        cout << "Creating thread " << i << " for range [" 
             << threadData[i].start << ", " << threadData[i].end << "]" << endl;

        int result = pthread_create(&threads[i], nullptr, printFibThread, &threadData[i]);
        if (result != 0) {
            cerr << "Error creating thread " << i << endl;
            return -1;
        }
    }

    // Wait for all threads to complete
    for (int i = 0; i < NUM_THREADS; i++) {
        pthread_join(threads[i], nullptr);
    }

    auto endTime = high_resolution_clock::now();
    auto duration = duration_cast<seconds>(endTime - startTime);

    cout << "\n=== EXECUTION COMPLETED ===" << endl;
    cout << "Total parallel execution time: " << duration.count() << " seconds" << endl;

    return 0;
}

Jump to the code @ OnlineDBG

Block Partitioning: Why it Fails

When running the program, you'll observe:

Threads 0, 1, and 2 finish quickly, computing smaller Fibonacci numbers.
Thread 3 takes significantly longer, computing fib(36) through fib(47))
No speedup compared to sequential execution improvement, as the workload is unevenly distributed

This occurs because higher Fibonacci numbers require exponentially more computation, resulting in an unbalanced load across threads.

Exercise 1

Tasks:Modify the block partition implementation to calculate key performance metrics:

Measure Sequential Execution Time for comparison.
Compute Speedup

\[ \text{Speedup} = \frac{\text{Sequential Time}}{\text{Parallel Time}} \]
Calculate Efficiency

\[ \text{Efficiency} = \frac{\text{Speedup}}{\text{Number of Threads}} \]
Evaluate Load Balancing Effectiveness

\[ \text{Load Balance} = \frac{\text{Minimum Thread Time}}{\text{Maximum Thread Time}} \]

After conducting the experiment, write a brief commentary on your findings. Address the following points:

Did the block partitioning approach improve execution time? Why or why not?
Which thread experienced the longest execution time, and what does that indicate?
Was the workload balanced across threads? How could it be improved?
How do your speedup and efficiency results compare to ideal parallel scaling?
If you were to optimize the partitioning strategy, what changes would you make?

Part 3: Cyclic Partition - The Solution

The Round-Robin Approach

Instead of giving each thread contiguous blocks, we distribute work in a round-robin fashion:

n: [0, 1, 2, 3, ..., 46, 47]
4 Threads:
Thread 0: [0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44]
Thread 1: [1, 5, 9, 13, 17, 21, 25, 29, 33, 37, 41, 45]
Thread 2: [2, 6, 10, 14, 18, 22, 26, 30, 34, 38, 42, 46]
Thread 3: [3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47]

Notice, T0 (Thread 0) gets data item 0, T1 gets data item 1, T2 gets data item 2, T3 gets data item 3, T0 gets data item 4, and so on. This distribution is known as Cyclic Partition.

Cyclic Distribution Implementation

void* cyclicFibThread(void *arg)
{
    ThreadData *data = static_cast<ThreadData *>(arg);

    cout << "Thread " << data->threadId << " starting cyclic work" << endl;

    // Cyclic partition: thread i gets elements i, i+NUM_THREADS, i+2*NUM_THREADS, ...
    for (int n = data->start + data->threadId; n <= data->end; n += NUM_THREADS) {
        unsigned long long result = fib(n);
        cout << "Thread " << data->threadId 
             << ": fib(" << n << ") = " << result << endl;
    }

    cout << "Thread " << data->threadId << " finished cyclic work!" << endl;
    return nullptr;
}

Key Learning Outcomes

1. Data Distribution Strategies

Strategy	Description	Best For	Worst For
Block	Contiguous chunks	Uniform workload	Variable workload
Cyclic	Round-robin assignment	Variable workload	Cache-sensitive tasks

2. Performance Metrics

Speedup = Sequential Time / Parallel Time
Efficiency = Speedup / Number of Processors
Load Balance = Min Thread Time / Max Thread Time

3. Choosing the Right Strategy

Block Partitioning:

Works best when computational cost is uniform
Retains better cache locality
Simple to implement

Cyclic Partitioning:

Ideal for highly variable workloads
Balances execution time across threads
Reduces bottlenecks in uneven computations

Exercise 2

Complete cyclic implementation.

Hints

You may need to declare NUM_THREADS globally
Rename printFibThread to blockFibThread

Requirements:

Add timing code using std::chrono
Measure the time from thread creation to all threads completing
Display the total execution time in seconds
Compare with your sequential time from Exercise 1
Add your name and index to the code comments

Expected Output Format:

=== EXECUTION COMPLETED ===
Total parallel execution time: XX seconds
Sequential execution time was: YY seconds
Speedup achieved: ZZ (or slowdown if > 1.0)
Efficiency: EE

After completing the implementation, provide a brief analysis:

Did cyclic partitioning improve execution time? Why or why not?
How does speedup compare to block partitioning?
Did cyclic partitioning solve load imbalance issues?
What further optimizations could enhance performance?

Lab Submission

Complete Exercise 1 and Exercise 2.
Include your name and index in all code files.
Take screenshots of your code and output.
Compile your work into a PDF document.
Upload the PDF to the LMS.