Lab (4): Cache Locality and Memory Access Optimization

Lab Overview

In previous labs, we explored concurrent programming and thread synchronization. While these techniques help utilize multiple CPU cores, another critical aspect of performance optimization often has a greater impact: cache locality. Modern processors are fast, but memory access is slow. To bridge this gap, processors use a hierarchy of caches (L1, L2, L3) to store frequently accessed data close to the CPU. Optimizing for cache locality can dramatically improve performance—sometimes by orders of magnitude.

This lab teaches you how memory access patterns affect performance, how to write cache-friendly code, and demonstrates these concepts through examples, culminating in optimized matrix multiplication algorithms.

Lab Objectives

Understand the memory hierarchy and cache behavior
Learn the difference between spatial and temporal locality
Write code that exploits cache locality effectively
Compare performance of cache-friendly vs. cache-unfriendly algorithms
Implement and optimize matrix multiplication algorithms
Measure and analyze performance improvements using timing
Apply cache optimization techniques to real-world problems

Prerequisites

Understanding of arrays and multi-dimensional arrays in C++
Basic knowledge of memory layout and pointers
Familiarity with timing measurements (chrono library)
Completion of previous pthread labs (helpful but not required)

Lab Submission

Complete the exercise at the end
Write your name and index in the first line of your program
Take screenshots of your code including your name and index
Take screenshots showing performance comparisons and timing results
Include a brief analysis of your results in the document
Put all work in a Word document, convert to PDF, and upload to LMS

1. Understanding Memory Hierarchy and Cache

The Memory Hierarchy

Modern computer systems use a memory hierarchy to balance speed, capacity, and cost. Each level has typical capacities (e.g., 1KB, 32KB, 256KB) and access times:

Memory Level	Access Time	Capacity
CPU Registers	~1 cycle	1KB	Fastest
L1 Cache	~3 cycles	32KB
L2 Cache	~10 cycles	256KB
L3 Cache	~40 cycles	8MB
Main Memory	~200 cycles	8GB
SSD/HDD	~10^6 cycles	~1TB	Slowest

Registers: Fastest, smallest storage for immediate computations.
Caches (L1, L2, L3): Store frequently accessed data; sizes like 32KB (L1) or 256KB (L2) are typical for modern CPUs.
Main Memory (RAM): Holds program data and instructions.
Storage (SSD/HDD): Persistent, slowest storage.

Cache Principles

Caches reduce memory access time by keeping recently used data near the CPU:

Cache Line: The smallest unit of data transferred between cache and main memory, typically 64 bytes. When data is fetched, the entire cache line is loaded, not just the requested byte, to exploit spatial locality.
Cache Hit: Occurs when requested data is found in the cache, providing fast access (e.g., 3 cycles for L1).
Cache Miss: Occurs when data isn’t in the cache, requiring a fetch from slower memory (e.g., 200 cycles from RAM), which is costly.
Cache Eviction: When the cache is full, older data is replaced (evicted) to make space, often using policies like Least Recently Used (LRU).

Why It Matters: Efficient cache use minimizes misses, leveraging the speed of cache over RAM.

Why Cache Matters: Understanding Cache Locality

In this lab, we explore cache locality and its significant impact on program performance. Cache locality refers to the principle that programs tend to access a small portion of their address space repeatedly over a short period. This concept is divided into two types:

Temporal locality: Accessing the same data multiple times within a short timeframe, such as a variable reused in a loop (e.g., the sum variable in our performance example).
Spatial locality: Accessing data that is stored close together in memory, like consecutive elements in an array.

Modern processors use caches—small, fast memory units—to take advantage of these localities. When data is accessed, it’s loaded into the cache along with nearby data, making subsequent accesses faster. Poor access patterns, however, can lead to cache misses, which slow down execution due to the need to fetch data from slower main memory.

To illustrate this, assume that we have an array of 12 integers. To calculate the sum, we need to traverse the array elements. There are two major methods: sequential traversal and strided traversal. Consider the following diagram:

Cache Access Patterns — Figure 1: Diagram showing sequential memory access pattern: \(a[0] \to a[1] \to a[2] \to \cdots \to a[11]\).

The above diagrams depict two memory access patterns for an array with elements labeled \( a_0 \) to \( a_{11} \):

Sequential traversal: Elements are accessed in order (\( a_0 \to a_1 \to a_2 \to \ldots \to a_{11} \)), shown by black arrows. This maximizes spatial locality because consecutive elements are stored contiguously and often fit within the same cache line. Loading a cache line brings multiple elements into the cache, reducing cache misses during sequential access.
Strided traversal: Elements are accessed with a fixed stride, such as every fourth element (\( a_0 \to a_4 \to a_8 \), then \( a_1 \to a_5 \to a_9 \), etc.), indicated by colored arrows. This can disrupt spatial locality if the stride causes each access to land in a different cache line, increasing cache misses.

The diagram’s bottom section reinterprets the array as a 3x4 matrix with rows (\( a_0, a_1, a_2, a_3 \)), (\( a_4, a_5, a_6, a_7 \)), and (\( a_8, a_9, a_{10}, a_{11} \)). Here, strided access (e.g., \( a_0, a_4, a_8 \)) corresponds to a column-wise traversal. In C++, arrays use row-major order, meaning rows are contiguous in memory. Column-wise access jumps across rows, leading to non-contiguous memory access and potentially more cache misses.

The performance example in this lab brings these concepts to life. By comparing the execution times of sequential (row-wise) and strided (column-wise) access on a large array, you’ll see the tangible impact of cache locality. In the code, the variable sum exemplifies temporal locality because it is repeatedly updated within the loop as each element is processed. When sum fits in a CPU register or remains in the cache, it avoids costly memory fetches, enhancing performance. Sequential access will run faster due to better spatial locality, while strided access suffers from poor cache utilization, allowing you to perceive how memory access patterns affect execution time.

Mastering cache locality is key to writing efficient code, particularly in performance-critical fields like scientific computing, graphics, and machine learning. Optimizing data access patterns to align with locality principles can yield significant speedups without changing the core algorithm.

Now, let’s dive into the performance demonstration.

Performance Demonstration: Sequential vs. Strided Access

#include <iostream>
#include <chrono>
#include <vector>

const int ARRAY_SIZE = 64 * 1024 * 1024; // 64MB array (larger than cache)

// Cache-friendly: sequential access
long long sum_sequential(const std::vector<int>& arr) {
    long long sum = 0;
    for (int i = 0; i < ARRAY_SIZE; i++) {
        sum += arr[i];
    }
    return sum;
}

// Cache-unfriendly: strided access
long long sum_strided(const std::vector<int>& arr) {
    long long sum = 0;
    int stride = 1024;
    for (int iteration = 0; iteration < stride; iteration++) {
        for (int i = iteration; i < ARRAY_SIZE; i += stride) {
            sum += arr[i];
        }
    }
    return sum;
}

int main() {
    std::cout << "=== Cache Locality Demonstration ===\n";

    std::vector<int> data(ARRAY_SIZE);
    for (int i = 0; i < ARRAY_SIZE; i++) {
        data[i] = i % 1000;
    }

    auto start = std::chrono::high_resolution_clock::now();
    long long result1 = sum_sequential(data);
    auto end = std::chrono::high_resolution_clock::now();
    auto sequential_time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);

    start = std::chrono::high_resolution_clock::now();
    long long result2 = sum_strided(data);
    end = std::chrono::high_resolution_clock::now();
    auto strided_time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);

    std::cout << "Sequential sum: " << result1 << " (Time: " << sequential_time.count() << " ms)\n";
    std::cout << "Strided sum: " << result2 << " (Time: " << strided_time.count() << " ms)\n";
    std::cout << "Performance ratio: " << (double)strided_time.count() / sequential_time.count() << "x slower\n";

    return 0;
}

Sample Output (varies by system):

=== Cache Locality Demonstration ===
Sequential sum: 33520818816 (Time: 181 ms)
Strided sum: 33520818816 (Time: 881 ms)
Performance ratio: 4.8674x slower

Analysis:

Sequential Access: Iterates contiguously, loading cache lines (e.g., 64 bytes, or 16 integers) that are fully utilized, minimizing misses. The sum variable benefits from temporal locality as it’s updated repeatedly in the cache or a register.
Strided Access: Jumps by 1024 elements (4096 bytes), likely requiring a new cache line per access, increasing misses. The sum variable still exhibits temporal locality, but the overall performance suffers due to poor spatial locality of the array accesses.
Connection to Pthreads: In your Pthreads labs, you’ve split arrays across threads. Poor locality in each thread’s access pattern (e.g., strided access) would amplify cache misses in parallel programs.

2. Array Access Patterns and Performance

Row-Major vs. Column-Major Access

In C/C++, two-dimensional arrays are stored in row-major order, meaning rows are contiguous in memory. This storage pattern makes row-wise access cache-friendly, while column-wise access is less efficient due to poor cache utilization.

Visual Explanation

Consider a 3x4 matrix stored in row-major order, as shown in Figure 3, with a 16-byte cache line.

Row-Major Access (Sequential Traversal):
- Elements are accessed in the order: [0,0] → [0,1] → [0,2] → [0,3] → [1,0] → ....
- Process:
  - When [0,0] is accessed for the first time, a cache miss occurs, and the CPU fetches 16 bytes from main memory, loading elements [0,0], [0,1], [0,2], and [0,3] into the cache (assuming each element is 4 bytes).
  - When [0,1] is accessed, a cache hit occurs, as it is already in the cache.
  - This pattern continues, with cache misses occurring only when a new row is accessed (e.g., transitioning from [0,3] to [1,0]). For the entire 3x4 matrix, approximately three cache misses are expected—one per row.
- Outcome: This sequential traversal maximizes spatial locality, efficiently utilizing the cache line and minimizing misses.
Column-Major Access (Strided Traversal):
- Elements are accessed in the order: [0,0] → [1,0] → [2,0] → [0,1] → ....
- Process:
  - When [0,0] is accessed for the first time, a cache miss occurs, and the CPU fetches 16 bytes from main memory, loading [0,0], [0,1], [0,2], and [0,3] into the cache.
  - When [1,0] is accessed, another cache miss occurs because [1,0] is not contiguous with the previous cache line. The CPU fetches a new 16-byte block, loading [1,0], [1,1], [1,2], and [1,3].
  - When [2,0] is accessed, a third cache miss is triggered, loading a new cache line (e.g., [2,0], [2,1], [2,2], [2,3]). If the cache is small or full, the initial cache line (e.g., [0,0] to [0,3]) may be evicted to accommodate this new data.
  - Observation: The cache line is underutilized, as only one element per 16-byte block is used before the next miss. In a fully occupied or small cache, this leads to frequent cache misses and evictions.
- Outcome: This strided traversal results in poor spatial locality, causing multiple cache misses and degraded performance.

#include <iostream>
#include <chrono>
#include <vector>

const int ROWS = 4000;
const int COLS = 4000;

long long sum_row_major(const std::vector<std::vector<int>>& matrix) {
    long long sum = 0;
    for (int i = 0; i < ROWS; i++) {
        for (int j = 0; j < COLS; j++) {
            sum += matrix[i][j];
        }
    }
    return sum;
}

long long sum_column_major(const std::vector<std::vector<int>>& matrix) {
    long long sum = 0;
    for (int j = 0; j < COLS; j++) {
        for (int i = 0; i < ROWS; i++) {
            sum += matrix[i][j];
        }
    }
    return sum;
}

int main() {
    std::cout << "=== Row-Major vs Column-Major Access ===\n";

    std::vector<std::vector<int>> matrix(ROWS, std::vector<int>(COLS));
    for (int i = 0; i < ROWS; i++) {
        for (int j = 0; j < COLS; j++) {
            matrix[i][j] = (i + j) % 100;
        }
    }

    auto start = std::chrono::high_resolution_clock::now();
    long long result1 = sum_row_major(matrix);
    auto end = std::chrono::high_resolution_clock::now();
    auto row_major_time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);

    start = std::chrono::high_resolution_clock::now();
    long long result2 = sum_column_major(matrix);
    end = std::chrono::high_resolution_clock::now();
    auto col_major_time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);

    std::cout << "Row-major sum: " << result1 << " (Time: " << row_major_time.count() << " ms)\n";
    std::cout << "Column-major sum: " << result2 << " (Time: " << col_major_time.count() << " ms)\n";
    std::cout << "Performance ratio: " << (double)col_major_time.count() / row_major_time.count() << "x slower\n";

    return 0;
}

3. Cache-Friendly Data Structures

Structure of Arrays (SoA) vs. Array of Structures (AoS)

A particle system is a computational model used to simulate a collection of many individual particles, each with properties like position, velocity, and mass. These systems are widely used in fields such as physics simulations (e.g., molecular dynamics), computer graphics (e.g., smoke, fire, or fluid effects), and game development (e.g., crowd or debris animations). The importance of optimizing particle systems lies in their computational intensity—updating thousands or millions of particles in real-time requires efficient memory access to achieve acceptable performance. Poor cache utilization can lead to significant bottlenecks, making cache-friendly data structures critical for scalability and speed. In this section, we explore two approaches, Array of Structures (AoS) and Structure of Arrays (SoA), to organize particle data, particularly focusing on velocity components (vx, vy, vz), which represent the speed along the x, y, and z axes, respectively.

Array of Structures (AoS):
- Design: Each particle is represented as a single structure containing all its attributes—position (x, y, z), velocity (vx, vy, vz), mass, and a unique identifier (id). These structures are stored in a contiguous array.
- Memory Layout: For N particles, the memory is organized as [Particle 0 {x, y, z, vx, vy, vz, mass, id}, Particle 1 {x, y, z, vx, vy, vz, mass, id}, ...]. This groups all data for a single particle together.
- Access Pattern: When updating a particle’s position (e.g., x += vx * dt), the CPU accesses all attributes of one particle sequentially. However, if the operation focuses on a single attribute (e.g., updating all vx values), the access jumps between non-contiguous memory locations across particles, potentially causing cache misses.
- Advantages: Intuitive for per-particle operations and easier to manage when each particle’s data is processed as a unit (e.g., in object-oriented designs).
- Disadvantages: Poor spatial locality for component-wise operations, as unused fields (e.g., mass) are loaded into the cache alongside the needed ones, wasting bandwidth.
Structure of Arrays (SoA):
- Design: Each attribute of the particles (e.g., x, vx, mass) is stored in a separate array, and all arrays are of the same length (N). A ParticleSystem structure groups these arrays together.
- Memory Layout: For N particles, the memory is organized as [x[0], x[1], ...], [y[0], y[1], ...], [z[0], z[1], ...], [vx[0], vx[1], ...], ...]. This separates attributes into contiguous blocks.
- Access Pattern: When updating positions (e.g., x[i] += vx[i] * dt), the CPU accesses consecutive elements of the x and vx arrays, aligning with cache lines and improving spatial locality. This is especially beneficial for vectorized operations (e.g., SIMD instructions) that process multiple particles’ vx values simultaneously.
- Advantages: Excellent spatial locality for component-wise updates, reducing cache misses and enabling efficient parallel processing. It also allows selective loading of only the required attributes into the cache.
- Disadvantages: Less intuitive for per-particle operations, requiring more complex indexing, and may increase memory overhead due to separate allocations.

Performance Impact: The choice between AoS and SoA depends on the workload. For particle system simulations where velocity components (vx, vy, vz) are updated uniformly across all particles (e.g., applying a global force), SoA typically outperforms AoS. This is because SoA ensures that all vx values are contiguous, allowing the cache to load them efficiently, whereas AoS scatters these values across particle structures, leading to more cache line fetches.

Implementation Example:

#include <iostream>
#include <chrono>
#include <vector>

const int NUM_PARTICLES = 1000000;

struct Particle {
    float x, y, z;
    float vx, vy, vz;
    float mass;
    int id;
};

struct ParticleSystem {
    std::vector<float> x, y, z;
    std::vector<float> vx, vy, vz;
    std::vector<float> mass;
    std::vector<int> id;

    ParticleSystem(int size) : x(size), y(size), z(size), 
                              vx(size), vy(size), vz(size), 
                              mass(size), id(size) {}
};

void update_positions_aos(std::vector<Particle>& particles, float dt) {
    for (int i = 0; i < particles.size(); i++) {
        particles[i].x += particles[i].vx * dt;
        particles[i].y += particles[i].vy * dt;
        particles[i].z += particles[i].vz * dt;
    }
}

void update_positions_soa(ParticleSystem& system, float dt) {
    for (int i = 0; i < system.x.size(); i++) {
        system.x[i] += system.vx[i] * dt;
        system.y[i] += system.vy[i] * dt;
        system.z[i] += system.vz[i] * dt;
    }
}

int main() {
    std::cout << "=== Array of Structures vs Structure of Arrays ===\n";

    std::vector<Particle> particles(NUM_PARTICLES);
    for (int i = 0; i < NUM_PARTICLES; i++) {
        particles[i] = {static_cast<float>(i), static_cast<float>(i+1), static_cast<float>(i+2),
                        1.0f, 1.0f, 1.0f, 1.0f, i};
    }

    ParticleSystem system(NUM_PARTICLES);
    for (int i = 0; i < NUM_PARTICLES; i++) {
        system.x[i] = static_cast<float>(i);
        system.y[i] = static_cast<float>(i+1);
        system.z[i] = static_cast<float>(i+2);
        system.vx[i] = system.vy[i] = system.vz[i] = 1.0f;
        system.mass[i] = 1.0f;
        system.id[i] = i;
    }

    const float dt = 0.01f;
    const int iterations = 1000;

    auto start = std::chrono::high_resolution_clock::now();
    for (int iter = 0; iter < iterations; iter++) {
        update_positions_aos(particles, dt);
    }
    auto end = std::chrono::high_resolution_clock::now();
    auto aos_time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);

    start = std::chrono::high_resolution_clock::now();
    for (int iter = 0; iter < iterations; iter++) {
        update_positions_soa(system, dt);
    }
    end = std::chrono::high_resolution_clock::now();
    auto soa_time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);

    std::cout << "AoS update time: " << aos_time.count() << " ms\n";
    std::cout << "SoA update time: " << soa_time.count() << " ms\n";
    std::cout << "Performance improvement: " << (double)aos_time.count() / soa_time.count() << "x faster\n";

    return 0;
}

Analysis:

In the example, SoA often outperforms AoS because it loads only the necessary velocity components (vx, vy, vz) and positions (x, y, z) into the cache, avoiding the overhead of fetching unused attributes like mass or id. The contiguous memory layout of SoA aligns with cache lines (e.g., 64 bytes), enabling efficient use of spatial locality.
The performance difference becomes more pronounced with larger NUM_PARTICLES or when vectorization (e.g., using SIMD instructions) is applied, as SoA’s structure is inherently compatible with such optimizations.
However, if the simulation requires frequent per-particle operations (e.g., computing forces based on mass and id for each particle), AoS might be more practical due to its cohesive data grouping, though it may still suffer from cache inefficiency.

Practical Considerations:

Cache Size: The effectiveness of SoA increases with larger datasets that exceed L1 cache (e.g., 32KB), as it ensures only relevant data is cached.
Alignment: Ensure arrays in SoA are properly aligned (e.g., to 64-byte boundaries) to maximize cache line utilization, which can be achieved using compiler directives like __attribute__((aligned(64))) in GCC.
Hybrid Approaches: In some cases, a hybrid model (e.g., grouping a few related attributes into small structures within an SoA-like layout) can balance locality and usability, depending on the specific workload.

This section equips you with the knowledge to choose the appropriate data structure based on the access patterns of your particle simulation, optimizing for cache performance in real-world applications like physics engines or computational fluid dynamics.

4. Matrix Multiplication: The Ultimate Cache Challenge

Why Matrix Multiplication Matters

Matrix multiplication is key in scientific computing, graphics, machine learning, and more, with O(n³) complexity making cache optimization critical.

Naive Matrix Multiplication (ijk order)

If \(A\) is an \(n \times m\) matrix and \(B\) is an \(m \times p\) matrix, then the matrix \(C = A \times B\) can be computed as follows:

    for (int i=0; i<n; i++)
        for (int j=0; j<p; j++) {
            c[i,j] = 0;
            for (k=0; k<m; k++)
                c[i,j] += a[i,k] * b[k,j];
        }
    }

This naive algorithm accesses matrix \(A\) using sequential traversal and matrix \(B\) using strided traversal. Consequently, the execution time of c[i,j] += a[i,k] + b[k,j] is hinged with the strided access of matrix \(B\).

Cache Blocking (Tiling)

Explanation: Cache blocking divides matrices into smaller blocks (e.g., 64x64) that fit in the cache. Instead of processing the entire matrix at once, it computes results for one block at a time, reusing data while it’s still in the cache. This improves temporal locality (reusing data) and spatial locality (accessing contiguous memory), reducing cache misses.

For more explanation, please check this Youtube video

#include <iostream>
#include <chrono>
#include <vector>
#include <random>

class Matrix {
private:
    std::vector<std::vector<double>> data;
    int rows, cols;
public:
    Matrix(int r, int c) : rows(r), cols(c) { data.resize(rows, std::vector<double>(cols, 0.0)); }
    void randomize() {
        std::random_device rd; std::mt19937 gen(rd()); std::uniform_real_distribution<> dis(0.0, 1.0);
        for (int i = 0; i < rows; i++) for (int j = 0; j < cols; j++) data[i][j] = dis(gen);
    }
    double& operator()(int i, int j) { return data[i][j]; }
    const double& operator()(int i, int j) const { return data[i][j]; }
    int getRows() const { return rows; }
    int getCols() const { return cols; }
};

Matrix multiply_naive(const Matrix& A, const Matrix& B) {
    int n = A.getRows();
    Matrix C(n, n);
    for (int i = 0; i < n; i++) {
        for (int j = 0; j < n; j++) {
            for (int k = 0; k < n; k++) {
                C(i, j) += A(i, k) * B(k, j);
            }
        }
    }
    return C;
}

Matrix multiply_blocked(const Matrix& A, const Matrix& B, int block_size = 64) {
    int n = A.getRows();
    Matrix C(n, n);
    for (int ii = 0; ii < n; ii += block_size) {
        for (int jj = 0; jj < n; jj += block_size) {
            for (int kk = 0; kk < n; kk += block_size) {
                int i_max = std::min(ii + block_size, n);
                int j_max = std::min(jj + block_size, n);
                int k_max = std::min(kk + block_size, n);
                for (int i = ii; i < i_max; i++) {
                    for (int k = kk; k < k_max; k++) {
                        double a_ik = A(i, k);
                        for (int j = jj; j < j_max; j++) {
                            C(i, j) += a_ik * B(k, j);
                        }
                    }
                }
            }
        }
    }
    return C;
}

int main() {
    const int n = 512;
    Matrix A(n, n), B(n, n);
    A.randomize(); B.randomize();

    auto start = std::chrono::high_resolution_clock::now();
    Matrix C1 = multiply_naive(A, B);
    auto end = std::chrono::high_resolution_clock::now();
    auto naive_time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);

    start = std::chrono::high_resolution_clock::now();
    Matrix C2 = multiply_blocked(A, B);
    end = std::chrono::high_resolution_clock::now();
    auto blocked_time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);

    std::cout << "Naive time: " << naive_time.count() << " ms\n";
    std::cout << "Blocked time: " << blocked_time.count() << " ms\n";
    std::cout << "Improvement: " << (double)naive_time.count() / blocked_time.count() << "x faster\n";
    return 0;
}

Exercise: Optimizing Strided Array Summation

Implement a program to compute the sum of a large 1D array (ARRAY_SIZE = 64 * 1024 * 1024) using both sequential and strided traversal patterns. Optimize the strided traversal to improve cache performance and compare the results with the sequential approach.

Instructions

Setup:
- Use a std::vector<int> with ARRAY_SIZE = 64 * 1024 * 1024 to store the array, initialized with random values (e.g., i % 1000).
Implementation:
- Write a sum_sequential function that computes the sum by accessing array elements consecutively (e.g., arr[i] from i = 0 to ARRAY_SIZE - 1).
- Write a sum_strided function that computes the sum with a stride of 1024 (e.g., arr[i] where i increases by 1024 each step, covering all elements through multiple passes).
- Optimize the sum_strided function by experimenting with a smaller stride (e.g., 16 or 32) or aligning the array to a 64-byte boundary using __attribute__((aligned(64))) in GCC.
- Measure execution time for both the original and optimized versions using the chrono library.
Comparison:
- Run the program and record the time taken for the sequential sum, the original strided sum, and the optimized strided sum.
- Calculate the performance ratio (e.g., original strided time / optimized strided time) to quantify the improvement.
Analysis:
- Explain why the optimized strided version (if faster) improves performance, focusing on cache locality (spatial and temporal).
- Discuss how changing the array size (e.g., to 16 * 1024 * 1024 or 256 * 1024 * 1024) or cache line size might affect the results.

Submission

Include your code with your name and index at the top.
Take screenshots of your code and performance results, highlighting the timings for all versions.
Provide a brief written analysis (1-2 paragraphs) in your submission document.
Compile all materials into a Word document, convert to PDF, and upload to the LMS.