CUDA Prefix Sum

Visualizing the process of a hierarchical CUDA inclusive scan using shared & global memory

Overview of Hierarchical Scan Using Shared Memory

  1. Block-Wise Scan (Local Scan):

    • Operation: Each CUDA block performs a scan on a subset of the input data using shared memory.

    • Output:

      • Scanned Data: Each block writes its scanned subset back to global memory.

      • Block Sums: Each block writes the total sum of its subset to an auxiliary array in global memory.

  2. Scan of Block Sums:

    • Operation: Perform a scan on the array of block sums to obtain the total sums preceding each block.

    • Output: A scanned block sums array.

  3. Add Block Sums to Scanned Data (Adjustment Phase):

    • Operation: Each block adds its corresponding scanned block sum to its scanned data to obtain the final inclusive scan.

    • Output: Final inclusive scan result in global memory.


Step-by-Step Visualization

Let's consider an input array of 8 elements for simplicity:

Input Array (Global Memory):
Index: 0  1  2  3  4  5  6  7
Value: 3, 1, 7, 0, 4, 1, 6, 3

Assume we are using 2 CUDA blocks with 4 threads per block. Here's how the process unfolds:

1. Block-Wise Scan (Local Scan) Using Shared Memory

Each block processes a subset of the input array. Shared memory is used within each block for fast data access.

Block 0: Processing Indices 0-3

  • Shared Memory (s_data):

    Before Scan:
    Thread 0: s_data[0] = 3
    Thread 1: s_data[1] = 1
    Thread 2: s_data[2] = 7
    Thread 3: s_data[3] = 0
  • Scan Operation (Inclusive Scan):

    After Scan in Shared Memory:
    s_data: 3, 4, 11, 11
    • Explanation:

      • s_data[0] = 3

      • s_data[1] = 3 + 1 = 4

      • s_data[2] = 4 + 7 = 11

      • s_data[3] = 11 + 0 = 11

  • Write Back to Global Memory:

    Scanned Data (Global Memory):
    Index: 0  1  2  3
    Value: 3, 4, 11, 11
  • Block Sum:

    Block Sums Array:
    Block 0 Sum: 11

Block 1: Processing Indices 4-7

  • Shared Memory (s_data):

    Before Scan:
    Thread 0: s_data[0] = 4
    Thread 1: s_data[1] = 1
    Thread 2: s_data[2] = 6
    Thread 3: s_data[3] = 3
  • Scan Operation (Inclusive Scan):

    After Scan in Shared Memory:
    s_data: 4, 5, 11, 14
    • Explanation:

      • s_data[0] = 4

      • s_data[1] = 4 + 1 = 5

      • s_data[2] = 5 + 6 = 11

      • s_data[3] = 11 + 3 = 14

  • Write Back to Global Memory:

    Scanned Data (Global Memory):
    Index: 4  5  6  7
    Value: 4, 5, 11, 14
  • Block Sum:

    Block Sums Array:
    Block 0 Sum: 11
    Block 1 Sum: 14

2. Scan of Block Sums

Now, perform a scan on the Block Sums Array to determine the total sums preceding each block.

Block Sums Array (Global Memory):

Index: 0  1
Value: 11, 14
  • Scan Operation (Inclusive Scan):

    After Scan of Block Sums:
    Scanned Block Sums Array:
    Index: 0  1
    Value: 11, 25
    • Explanation:

      • Scanned_Block_Sums[0] = 11

      • Scanned_Block_Sums[1] = 11 + 14 = 25


3. Add Scanned Block Sums to Scanned Data (Adjustment Phase)

Each block adds the sum of all previous blocks to its scanned data to obtain the final inclusive scan.

Adjustment Phase:

  • Block 0:

    • No Adjustment Needed (First block).

  • Block 1:

    • Adjustment Value: Scanned_Block_Sums[0] = 11

    • Adjusted Scanned Data:

      Original Scanned Data (Block 1):
      Index: 4  5  6  7
      Value: 4, 5, 11, 14
      
      After Adjustment:
      Index: 4  5   6   7
      Value: 15, 16, 22, 25
      • Explanation:

        • Index 4: 4 + 11 = 15

        • Index 5: 5 + 11 = 16

        • Index 6: 11 + 11 = 22

        • Index 7: 14 + 11 = 25


Final Inclusive Scan Output

Combining the adjusted scanned data from both blocks:

Final Inclusive Scan Output (Global Memory):
Index: 0  1  2  3  4  5  6  7
Value: 3, 4, 11, 11, 15, 16, 22, 25

Visual Summary

Below is a visual summary of the entire process:

Initial Input Array

Index: 0  1  2  3  4  5  6  7
Value: 3, 1, 7, 0, 4, 1, 6, 3

1. Block-Wise Scan Using Shared Memory

Block 0 (Indices 0-3)

Shared Memory (Before Scan):
3, 1, 7, 0

After Inclusive Scan in Shared Memory:
3, 4, 11, 11

Write Back to Global Memory:
Index: 0  1  2  3
Value: 3, 4, 11, 11

Block Sum:
11

Block 1 (Indices 4-7)

Shared Memory (Before Scan):
4, 1, 6, 3

After Inclusive Scan in Shared Memory:
4, 5, 11, 14

Write Back to Global Memory:
Index: 4  5  6  7
Value: 4, 5, 11, 14

Block Sum:
14

2. Scan of Block Sums

Block Sums Array:
Index: 0  1
Value: 11, 14

After Inclusive Scan:
Index: 0  1
Value: 11, 25

3. Adjustment Phase

Final Adjusted Scanned Data:

Block 0:
Index: 0  1  2  3
Value: 3, 4, 11, 11

Block 1:
Index: 4  5  6  7
Original Value: 4, 5, 11, 14
Adjustment: +11
Adjusted Value: 15, 16, 22, 25

Final Output

Final Inclusive Scan Output:
Index: 0  1  2  3  4  5  6  7
Value: 3, 4, 11, 11, 15, 16, 22, 25

Detailed Walkthrough with Shared and Global Memory Interaction

1. Block-Wise Scan Kernel Execution

  • Global Memory: Stores the Input Array and the Scanned Data.

  • Shared Memory: Temporarily holds a block's subset of the input data for fast access during the scan.

For Each Block:

  1. Load Data into Shared Memory:

    • Each thread reads one element from the input array (global memory) into shared memory.

    • Example:

      • Thread 0: Reads d_input[0] = 3s_data[0] = 3

      • Thread 1: Reads d_input[1] = 1s_data[1] = 1

      • Thread 2: Reads d_input[2] = 7s_data[2] = 7

      • Thread 3: Reads d_input[3] = 0s_data[3] = 0

  2. Perform Scan in Shared Memory:

    • The scan is performed within shared memory, ensuring all threads within the block collaborate efficiently.

    • Synchronization: __syncthreads() is used to synchronize threads after loading data and after each scan phase.

  3. Write Scanned Data and Block Sum to Global Memory:

    • After the scan:

      • Each thread writes its scanned value back to the Scanned Data array in global memory.

      • The last thread writes the block's total sum to the Block Sums Array in global memory.

2. Scan Block Sums Kernel Execution

  • Global Memory: Holds the Block Sums Array.

  • Shared Memory: Used to perform a scan on the block sums.

Process:

  1. Load Block Sums into Shared Memory:

    • Each thread reads one block sum into shared memory.

  2. Perform Scan on Block Sums:

    • An inclusive scan is performed on the block sums within shared memory.

    • This results in a scanned block sums array.

  3. Write Scanned Block Sums Back to Global Memory:

    • The scanned block sums are written back to global memory for the adjustment phase.

3. Adjustment Phase Kernel Execution

  • Global Memory: Contains both the Scanned Data and the Scanned Block Sums.

  • Shared Memory: Not used in this phase; data is directly adjusted in global memory.

Process:

  1. Add Scanned Block Sums to Scanned Data:

    • Each thread adds the scanned block sum corresponding to its block to its scanned data.

    • Example:

      • Block 0: No addition needed.

      • Block 1: Adds 11 (Scanned_Block_Sums[0]) to each of its scanned values.


Illustrative Diagrams

While text-based diagrams have limitations, the following representations can help clarify the data flow and memory interactions.

Diagram 1: Initial Data Partitioning

Global Memory:
+---+---+---+---+---+---+---+---+
| 3 | 1 | 7 | 0 | 4 | 1 | 6 | 3 |
+---+---+---+---+---+---+---+---+
  |   |   |   |   |   |   |   |
  v   v   v   v   v   v   v   v
Block 0            Block 1
(Indices 0-3)      (Indices 4-7)

Diagram 2: Block-Wise Scan Using Shared Memory

Block 0

Shared Memory (s_data):
+---+---+---+---+
| 3 | 1 | 7 | 0 |
+---+---+---+---+
      |
      v
After Scan:
+---+---+----+----+
| 3 | 4 | 11 | 11 |
+---+---+----+----+

Write Back to Global Memory:
Scanned Data:
+---+---+----+----+
| 3 | 4 | 11 | 11 |
+---+---+----+----+

Block 1

Shared Memory (s_data):
+---+---+---+---+
| 4 | 1 | 6 | 3 |
+---+---+---+---+
      |
      v
After Scan:
+---+---+----+----+
| 4 | 5 | 11 | 14 |
+---+---+----+----+

Write Back to Global Memory:
Scanned Data:
+---+---+----+----+
| 4 | 5 | 11 | 14 |
+---+---+----+----+

Diagram 3: Scanning Block Sums

Block Sums Array:
+----+----+
| 11 | 14 |
+----+----+

After Scan:
Scanned Block Sums Array:
+----+----+
| 11 | 25 |
+----+----+

Diagram 4: Adjustment Phase

Final Scanned Data Before Adjustment:
Block 0:
+---+---+----+----+
| 3 | 4 | 11 | 11 |
+---+---+----+----+

Block 1:
+---+---+----+----+
| 4 | 5 | 11 | 14 |
+---+---+----+----+

Scanned Block Sums:
+----+----+
| 11 | 25 |
+----+----+

Adjustment:
Block 0: No adjustment
Block 1: Add 11 to each element

Final Scanned Data:
Block 0:
+---+---+----+----+
| 3 | 4 | 11 | 11 |
+---+---+----+----+

Block 1:
+---+---+----+----+
|15 |16 |22 |25 |
+---+---+----+----+

Final Output:
+---+---+----+----+----+----+----+----+
| 3 | 4 | 11 | 11 |15 |16 |22 |25 |
+---+---+----+----+----+----+----+----+

Memory Interaction Summary

  1. Shared Memory:

    • Used During:

      • Block-Wise Scan: Each block loads its subset of data into shared memory, performs the scan, and writes back the scanned data.

    • Benefits:

      • Fast Access: Shared memory allows for low-latency access during the scan operations within a block.

      • Synchronization: Threads within a block can synchronize their operations efficiently.

  2. Global Memory:

    • Used For:

      • Input and Output Data: The original input array and the final scanned output reside in global memory.

      • Block Sums Array: After each block performs its local scan, the block sums are stored in a separate global memory array.

      • Scanned Block Sums: The scanned block sums are also stored in global memory for the adjustment phase.

    • Benefits:

      • Persistence: Data persists across kernel launches, allowing different kernels to access and modify it.

      • Scalability: Can handle large datasets that exceed the capacity of shared memory.

  3. Interaction Between Shared and Global Memory:

    • Within a Kernel:

      • Loading Data: Threads load data from global memory to shared memory for fast access.

      • Writing Results: After computations in shared memory, results are written back to global memory.

    • Across Kernels:

      • Block Sums: The block sums are written to global memory by one kernel and read by another kernel during the scan of block sums.

      • Adjustment Phase: The scanned block sums are read from global memory by the adjustment phase kernel to modify the scanned data.


Final Notes

  • Block Size Selection:

    • Choose a block size that is a multiple of the warp size (typically 32) for optimal performance.

    • Common choices are 128, 256, or 512 threads per block, depending on the GPU architecture and input size.

  • Handling Non-Power-of-Two Sizes:

    • Pad the input data with zeros if necessary to ensure each block has a complete set of data.

    • This avoids out-of-bounds memory accesses during scan operations.

  • Synchronization:

    • Use __syncthreads() within blocks to synchronize threads during shared memory operations.

    • Ensure that all threads have completed their work before proceeding to avoid race conditions.

  • Recursive vs. Iterative Scanning:

    • For very large input sizes, recursive scanning of block sums may be necessary.

    • Alternatively, an iterative approach with multiple kernel launches can be employed.

  • Error Handling and Optimization:

    • Always check for CUDA errors after kernel launches and memory operations.

    • Profile your implementation using tools like NVIDIA Nsight Compute to identify and eliminate bottlenecks.

Last updated