Lab(3): Data Parallelism
Lab Overview
In this lab, you will explore the fundamental concepts of data parallelism and learn how different data partitioning strategies can dramatically impact the performance of parallel programs. Using the computationally intensive Fibonacci sequence as our test case, you'll discover why naive parallelization sometimes fails and how intelligent data partitioning can unlock the true power of parallel computing.
You'll implement and compare two fundamental data partition strategies:
- Block Partitioning: Assigns contiguous chunks of data to each processing unit.
- Cyclic Partitioning: Distributes data in a round-robin fashion for better load balancing.
Additionally, you will conduct a Performance Analysis to evaluate the effectiveness of each approach.
By the end of this lab, you’ll understand why parallelization isn’t always faster and how to design scalable parallel algorithms optimized for multi-core processors.
Lab Objectives
By completing this lab, you will be able to:
- Understand the critical importance of data partitioning in parallel computing
- Compare block vs. cyclic data partitioning strategies and their performance implications
- Analyze load balancing challenges and explore effective solutions
- Measure and evaluate execution time improvements (or lack thereof) in parallel programs
- Debug common pitfalls in parallel programming, especially uneven workload distribution
- Apply appropriate data partitioning strategies based on problem characteristics
Prerequisites
- Be familiar with POSIX pthreads (covered in Lab 1)
- Read lecture note: "Parallel Hardware Part (2)"
Part 1: Understanding the Problem - Fibonacci Sequence
Mathematical Foundation
The Fibonacci sequence follows this recurrence relation:
The sequence begins as:
- \(fib(0) = 1\)
- \(fib(1) = 1\)
- \(fib(2) = fib(1) + fib(0) = 1 + 1 = 2\)
- \(fib(3) = fib(2) + fib(1) = 2 + 1 = 3\)
- \(fib(4) = fib(3) + fib(2) = 3 + 2 = 5\)
- and so on.
A simple recursive function that computes the fibonacci numbers:
We use a recursive Fibonacci implementation because:
- Exponential time complexity: \(O(2^n)\) - creates significant computational load
- Uneven computation cost:
fib(45)
takes much longer thanfib(20)
.
Run this test program to see execution time for different Fibonacci values, and observe the exponential growth in execution time:
#include <iostream>
#include <chrono>
using namespace std;
using namespace std::chrono;
unsigned long long fib(unsigned int n)
{
if (n <= 1)
return 1;
else
return fib(n-1) + fib(n-2);
}
int main()
{
cout << "Observing Fibonacci execution times:\n";
for (int n : {25, 30, 35, 40, 45}) {
auto start = high_resolution_clock::now();
unsigned long long result = fib(n);
auto end = high_resolution_clock::now();
auto duration = duration_cast<milliseconds>(end - start);
cout << "fib(" << n << ") = " << result
<< " (Time: " << duration.count() << "ms)\n";
}
return 0;
}
Here is a sample output:
Output
Info
The running time grows exponentially as depicted in the following plot:
Sequential Fibonacci Series
Let us print the fibonacci series from 0 up to 47 sequentially.
See the code @ OnlineGBDClick on the button, run the code, and observe the execution time.
Part 2: Naive Parallelization - Block Partition
The Obvious Approach
Our first attempt at parallelization uses block partitioning, where each thread processes a contiguous chunk of data:
n: [0, 1, 2, 3, 4, 5, 6, ..., 45, 46, 47]
4 Threads:
Thread 0 -> [0, 1, 2, ..., 11] (12 elements)
Thread 1 -> [12, 13, 14, ..., 23] (12 elements)
Thread 2 -> [24, 25, 26, ..., 35] (12 elements)
Thread 3 -> [36, 37, 38, ..., 47] (12 elements)
Implementation Components
Component 1: Thread Data Structure:
struct ThreadData {
int start; // Starting number for this thread
int end; // Ending number for this thread
int threadId; // For identification and debugging
};
Component 2: Thread Function: A modified function computes and prints Fibonacci numbers in the assigned range:
void* printFibThread(void *arg)
{
ThreadData *p = static_cast<ThreadData *>(arg)
for (int n = p->start; n <= p->end; n += 1) {
unsigned long long x = fib(n);
cerr << "fib(" << n << ") = " << x << endl;
}
return nullptr;
}
Component 3: Work Distribution Logic: We divide the workload evenly among threads:
// compute block size for partitioning
int block_length = (END - START + 1) / NUM_THREADS;
for (int i = 0; i < NUM_THREADS; i++) {
threadData[i].start = START + i * block_length;
threadData[i].end = (i == NUM_THREADS - 1) ? END : threadData[i].start + block_length - 1;
threadData[i].threadId = i;
}
NUM_THREADS
is the number of processors/threads available.
Full Program Using Block Partition
#include <iostream>
#include <pthread.h>
#include <chrono>
using namespace std;
using namespace std::chrono;
struct ThreadData {
int start;
int end;
int threadId;
};
unsigned long long fib(unsigned int n)
{
if (n <= 1)
return 1;
else
return fib(n-1) + fib(n-2);
}
void* printFibThread(void *arg)
{
ThreadData *data = static_cast<ThreadData *>(arg);
cout << "Thread " << data->threadId << " starting work on range ["
<< data->start << ", " << data->end << "]" << endl;
for (int n = data->start; n <= data->end; n++) {
unsigned long long result = fib(n);
cout << "Thread " << data->threadId
<< ": fib(" << n << ") = " << result << endl;
}
cout << "Thread " << data->threadId << " finished!" << endl;
return nullptr;
}
int main()
{
const int START = 0;
const int END = 47;
const int NUM_THREADS = 4;
pthread_t threads[NUM_THREADS];
ThreadData threadData[NUM_THREADS];
cout << "=== BLOCK PARTITION PARALLEL FIBONACCI ===" << endl;
cout << "Computing Fibonacci(" << START << ") to Fibonacci(" << END
<< ") using " << NUM_THREADS << " threads" << endl;
auto startTime = high_resolution_clock::now();
// Distribute work in blocks
int blockSize = (END - START + 1) / NUM_THREADS;
for (int i = 0; i < NUM_THREADS; i++) {
threadData[i].start = START + i * blockSize;
threadData[i].end = (i == NUM_THREADS - 1) ? END : threadData[i].start + blockSize - 1;
threadData[i].threadId = i;
cout << "Creating thread " << i << " for range ["
<< threadData[i].start << ", " << threadData[i].end << "]" << endl;
int result = pthread_create(&threads[i], nullptr, printFibThread, &threadData[i]);
if (result != 0) {
cerr << "Error creating thread " << i << endl;
return -1;
}
}
// Wait for all threads to complete
for (int i = 0; i < NUM_THREADS; i++) {
pthread_join(threads[i], nullptr);
}
auto endTime = high_resolution_clock::now();
auto duration = duration_cast<seconds>(endTime - startTime);
cout << "\n=== EXECUTION COMPLETED ===" << endl;
cout << "Total parallel execution time: " << duration.count() << " seconds" << endl;
return 0;
}
Block Partitioning: Why it Fails
When running the program, you'll observe:
- Threads 0, 1, and 2 finish quickly, computing smaller Fibonacci numbers.
- Thread 3 takes significantly longer, computing
fib(36)
throughfib(47)
) - No speedup compared to sequential execution improvement, as the workload is unevenly distributed
This occurs because higher Fibonacci numbers require exponentially more computation, resulting in an unbalanced load across threads.
Exercise 1
Tasks:Modify the block partition implementation to calculate key performance metrics:
- Measure Sequential Execution Time for comparison.
-
Compute Speedup
\[ \text{Speedup} = \frac{\text{Sequential Time}}{\text{Parallel Time}} \] -
Calculate Efficiency
\[ \text{Efficiency} = \frac{\text{Speedup}}{\text{Number of Threads}} \] -
Evaluate Load Balancing Effectiveness
\[ \text{Load Balance} = \frac{\text{Minimum Thread Time}}{\text{Maximum Thread Time}} \]
After conducting the experiment, write a brief commentary on your findings. Address the following points:
- Did the block partitioning approach improve execution time? Why or why not?
- Which thread experienced the longest execution time, and what does that indicate?
- Was the workload balanced across threads? How could it be improved?
- How do your speedup and efficiency results compare to ideal parallel scaling?
- If you were to optimize the partitioning strategy, what changes would you make?
Part 3: Cyclic Partition - The Solution
The Round-Robin Approach
Instead of giving each thread contiguous blocks, we distribute work in a round-robin fashion:
n: [0, 1, 2, 3, ..., 46, 47]
4 Threads:
Thread 0: [0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44]
Thread 1: [1, 5, 9, 13, 17, 21, 25, 29, 33, 37, 41, 45]
Thread 2: [2, 6, 10, 14, 18, 22, 26, 30, 34, 38, 42, 46]
Thread 3: [3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47]
Notice, T0 (Thread 0) gets data item 0, T1 gets data item 1, T2 gets data item 2, T3 gets data item 3, T0 gets data item 4, and so on. This distribution is known as Cyclic Partition.
Cyclic Distribution Implementation
void* cyclicFibThread(void *arg)
{
ThreadData *data = static_cast<ThreadData *>(arg);
cout << "Thread " << data->threadId << " starting cyclic work" << endl;
// Cyclic partition: thread i gets elements i, i+NUM_THREADS, i+2*NUM_THREADS, ...
for (int n = data->start + data->threadId; n <= data->end; n += NUM_THREADS) {
unsigned long long result = fib(n);
cout << "Thread " << data->threadId
<< ": fib(" << n << ") = " << result << endl;
}
cout << "Thread " << data->threadId << " finished cyclic work!" << endl;
return nullptr;
}
Key Learning Outcomes
1. Data Distribution Strategies
Strategy | Description | Best For | Worst For |
---|---|---|---|
Block | Contiguous chunks | Uniform workload | Variable workload |
Cyclic | Round-robin assignment | Variable workload | Cache-sensitive tasks |
2. Performance Metrics
- Speedup = Sequential Time / Parallel Time
- Efficiency = Speedup / Number of Processors
- Load Balance = Min Thread Time / Max Thread Time
3. Choosing the Right Strategy
Block Partitioning:
- Works best when computational cost is uniform
- Retains better cache locality
- Simple to implement
Cyclic Partitioning:
- Ideal for highly variable workloads
- Balances execution time across threads
- Reduces bottlenecks in uneven computations
Exercise 2
Complete cyclic implementation.
Hints
- You may need to declare
NUM_THREADS
globally - Rename
printFibThread
toblockFibThread
Requirements:
- Add timing code using
std::chrono
- Measure the time from thread creation to all threads completing
- Display the total execution time in seconds
- Compare with your sequential time from Exercise 1
- Add your name and index to the code comments
Expected Output Format:
=== EXECUTION COMPLETED ===
Total parallel execution time: XX seconds
Sequential execution time was: YY seconds
Speedup achieved: ZZ (or slowdown if > 1.0)
Efficiency: EE
After completing the implementation, provide a brief analysis:
- Did cyclic partitioning improve execution time? Why or why not?
- How does speedup compare to block partitioning?
- Did cyclic partitioning solve load imbalance issues?
- What further optimizations could enhance performance?
Lab Submission
- Complete Exercise 1 and Exercise 2.
- Include your name and index in all code files.
- Take screenshots of your code and output.
- Compile your work into a PDF document.
- Upload the PDF to the LMS.