CUDA Prefix Sum

Visualizing the process of a hierarchical CUDA inclusive scan using shared & global memory

Overview of Hierarchical Scan Using Shared Memory

Block-Wise Scan (Local Scan):
- Operation: Each CUDA block performs a scan on a subset of the input data using shared memory.
- Output:
  - Scanned Data: Each block writes its scanned subset back to global memory.
  - Block Sums: Each block writes the total sum of its subset to an auxiliary array in global memory.
Scan of Block Sums:
- Operation: Perform a scan on the array of block sums to obtain the total sums preceding each block.
- Output: A scanned block sums array.
Add Block Sums to Scanned Data (Adjustment Phase):
- Operation: Each block adds its corresponding scanned block sum to its scanned data to obtain the final inclusive scan.
- Output: Final inclusive scan result in global memory.

Step-by-Step Visualization

Let's consider an input array of 8 elements for simplicity:

Input Array (Global Memory):
Index: 0  1  2  3  4  5  6  7
Value: 3, 1, 7, 0, 4, 1, 6, 3

Assume we are using 2 CUDA blocks with 4 threads per block. Here's how the process unfolds:

1. Block-Wise Scan (Local Scan) Using Shared Memory

Each block processes a subset of the input array. Shared memory is used within each block for fast data access.

Block 0: Processing Indices 0-3

Shared Memory (s_data):

Before Scan:
Thread 0: s_data[0] = 3
Thread 1: s_data[1] = 1
Thread 2: s_data[2] = 7
Thread 3: s_data[3] = 0

Scan Operation (Inclusive Scan):
```
After Scan in Shared Memory:
s_data: 3, 4, 11, 11
```
- Explanation:
  - s_data[0] = 3
  - s_data[1] = 3 + 1 = 4
  - s_data[2] = 4 + 7 = 11
  - s_data[3] = 11 + 0 = 11

Write Back to Global Memory:

Scanned Data (Global Memory):
Index: 0  1  2  3
Value: 3, 4, 11, 11

Block Sum:
```
Block Sums Array:
Block 0 Sum: 11
```

Block 1: Processing Indices 4-7

Shared Memory (s_data):

Before Scan:
Thread 0: s_data[0] = 4
Thread 1: s_data[1] = 1
Thread 2: s_data[2] = 6
Thread 3: s_data[3] = 3

Scan Operation (Inclusive Scan):
```
After Scan in Shared Memory:
s_data: 4, 5, 11, 14
```
- Explanation:
  - s_data[0] = 4
  - s_data[1] = 4 + 1 = 5
  - s_data[2] = 5 + 6 = 11
  - s_data[3] = 11 + 3 = 14

Write Back to Global Memory:

Scanned Data (Global Memory):
Index: 4  5  6  7
Value: 4, 5, 11, 14

Block Sum:

Block Sums Array:
Block 0 Sum: 11
Block 1 Sum: 14

2. Scan of Block Sums

Now, perform a scan on the Block Sums Array to determine the total sums preceding each block.

Block Sums Array (Global Memory):

Index: 0  1
Value: 11, 14

Scan Operation (Inclusive Scan):
```
After Scan of Block Sums:
Scanned Block Sums Array:
Index: 0  1
Value: 11, 25
```
- Explanation:
  - Scanned_Block_Sums[0] = 11
  - Scanned_Block_Sums[1] = 11 + 14 = 25

3. Add Scanned Block Sums to Scanned Data (Adjustment Phase)

Each block adds the sum of all previous blocks to its scanned data to obtain the final inclusive scan.

Adjustment Phase:

Block 0:
- No Adjustment Needed (First block).
Block 1:
- Adjustment Value: Scanned_Block_Sums[0] = 11
- Adjusted Scanned Data:
  Original Scanned Data (Block 1): Index: 4 5 6 7 Value: 4, 5, 11, 14 After Adjustment: Index: 4 5 6 7 Value: 15, 16, 22, 25
  - Explanation:
    Index 4: 4 + 11 = 15
    Index 5: 5 + 11 = 16
    Index 6: 11 + 11 = 22
    Index 7: 14 + 11 = 25

Final Inclusive Scan Output

Combining the adjusted scanned data from both blocks:

Final Inclusive Scan Output (Global Memory):
Index: 0  1  2  3  4  5  6  7
Value: 3, 4, 11, 11, 15, 16, 22, 25

Visual Summary

Below is a visual summary of the entire process:

Initial Input Array

Index: 0  1  2  3  4  5  6  7
Value: 3, 1, 7, 0, 4, 1, 6, 3

1. Block-Wise Scan Using Shared Memory

Block 0 (Indices 0-3)

Shared Memory (Before Scan):
3, 1, 7, 0

After Inclusive Scan in Shared Memory:
3, 4, 11, 11

Write Back to Global Memory:
Index: 0  1  2  3
Value: 3, 4, 11, 11

Block Sum:
11

Block 1 (Indices 4-7)

Shared Memory (Before Scan):
4, 1, 6, 3

After Inclusive Scan in Shared Memory:
4, 5, 11, 14

Write Back to Global Memory:
Index: 4  5  6  7
Value: 4, 5, 11, 14

Block Sum:
14

2. Scan of Block Sums

Block Sums Array:
Index: 0  1
Value: 11, 14

After Inclusive Scan:
Index: 0  1
Value: 11, 25

3. Adjustment Phase

Final Adjusted Scanned Data:

Block 0:
Index: 0  1  2  3
Value: 3, 4, 11, 11

Block 1:
Index: 4  5  6  7
Original Value: 4, 5, 11, 14
Adjustment: +11
Adjusted Value: 15, 16, 22, 25

Final Output

Final Inclusive Scan Output:
Index: 0  1  2  3  4  5  6  7
Value: 3, 4, 11, 11, 15, 16, 22, 25

Detailed Walkthrough with Shared and Global Memory Interaction

1. Block-Wise Scan Kernel Execution

Global Memory: Stores the Input Array and the Scanned Data.
Shared Memory: Temporarily holds a block's subset of the input data for fast access during the scan.

For Each Block:

Load Data into Shared Memory:
- Each thread reads one element from the input array (global memory) into shared memory.
- Example:
  - Thread 0: Reads d_input[0] = 3 → s_data[0] = 3
  - Thread 1: Reads d_input[1] = 1 → s_data[1] = 1
  - Thread 2: Reads d_input[2] = 7 → s_data[2] = 7
  - Thread 3: Reads d_input[3] = 0 → s_data[3] = 0
Perform Scan in Shared Memory:
- The scan is performed within shared memory, ensuring all threads within the block collaborate efficiently.
- Synchronization: __syncthreads() is used to synchronize threads after loading data and after each scan phase.
Write Scanned Data and Block Sum to Global Memory:
- After the scan:
  - Each thread writes its scanned value back to the Scanned Data array in global memory.
  - The last thread writes the block's total sum to the Block Sums Array in global memory.

2. Scan Block Sums Kernel Execution

Global Memory: Holds the Block Sums Array.
Shared Memory: Used to perform a scan on the block sums.

Process:

Load Block Sums into Shared Memory:
- Each thread reads one block sum into shared memory.
Perform Scan on Block Sums:
- An inclusive scan is performed on the block sums within shared memory.
- This results in a scanned block sums array.
Write Scanned Block Sums Back to Global Memory:
- The scanned block sums are written back to global memory for the adjustment phase.

3. Adjustment Phase Kernel Execution

Global Memory: Contains both the Scanned Data and the Scanned Block Sums.
Shared Memory: Not used in this phase; data is directly adjusted in global memory.

Process:

Add Scanned Block Sums to Scanned Data:
- Each thread adds the scanned block sum corresponding to its block to its scanned data.
- Example:
  - Block 0: No addition needed.
  - Block 1: Adds 11 (Scanned_Block_Sums[0]) to each of its scanned values.

Illustrative Diagrams

While text-based diagrams have limitations, the following representations can help clarify the data flow and memory interactions.

Diagram 1: Initial Data Partitioning

Global Memory:
+---+---+---+---+---+---+---+---+
| 3 | 1 | 7 | 0 | 4 | 1 | 6 | 3 |
+---+---+---+---+---+---+---+---+
  |   |   |   |   |   |   |   |
  v   v   v   v   v   v   v   v
Block 0            Block 1
(Indices 0-3)      (Indices 4-7)

Diagram 2: Block-Wise Scan Using Shared Memory

Block 0

Shared Memory (s_data):
+---+---+---+---+
| 3 | 1 | 7 | 0 |
+---+---+---+---+
      |
      v
After Scan:
+---+---+----+----+
| 3 | 4 | 11 | 11 |
+---+---+----+----+

Write Back to Global Memory:
Scanned Data:
+---+---+----+----+
| 3 | 4 | 11 | 11 |
+---+---+----+----+

Block 1

Shared Memory (s_data):
+---+---+---+---+
| 4 | 1 | 6 | 3 |
+---+---+---+---+
      |
      v
After Scan:
+---+---+----+----+
| 4 | 5 | 11 | 14 |
+---+---+----+----+

Write Back to Global Memory:
Scanned Data:
+---+---+----+----+
| 4 | 5 | 11 | 14 |
+---+---+----+----+

Diagram 3: Scanning Block Sums

Block Sums Array:
+----+----+
| 11 | 14 |
+----+----+

After Scan:
Scanned Block Sums Array:
+----+----+
| 11 | 25 |
+----+----+

Diagram 4: Adjustment Phase

Final Scanned Data Before Adjustment:
Block 0:
+---+---+----+----+
| 3 | 4 | 11 | 11 |
+---+---+----+----+

Block 1:
+---+---+----+----+
| 4 | 5 | 11 | 14 |
+---+---+----+----+

Scanned Block Sums:
+----+----+
| 11 | 25 |
+----+----+

Adjustment:
Block 0: No adjustment
Block 1: Add 11 to each element

Final Scanned Data:
Block 0:
+---+---+----+----+
| 3 | 4 | 11 | 11 |
+---+---+----+----+

Block 1:
+---+---+----+----+
|15 |16 |22 |25 |
+---+---+----+----+

Final Output:
+---+---+----+----+----+----+----+----+
| 3 | 4 | 11 | 11 |15 |16 |22 |25 |
+---+---+----+----+----+----+----+----+

Memory Interaction Summary

Shared Memory:
- Used During:
  - Block-Wise Scan: Each block loads its subset of data into shared memory, performs the scan, and writes back the scanned data.
- Benefits:
  - Fast Access: Shared memory allows for low-latency access during the scan operations within a block.
  - Synchronization: Threads within a block can synchronize their operations efficiently.
Global Memory:
- Used For:
  - Input and Output Data: The original input array and the final scanned output reside in global memory.
  - Block Sums Array: After each block performs its local scan, the block sums are stored in a separate global memory array.
  - Scanned Block Sums: The scanned block sums are also stored in global memory for the adjustment phase.
- Benefits:
  - Persistence: Data persists across kernel launches, allowing different kernels to access and modify it.
  - Scalability: Can handle large datasets that exceed the capacity of shared memory.
Interaction Between Shared and Global Memory:
- Within a Kernel:
  - Loading Data: Threads load data from global memory to shared memory for fast access.
  - Writing Results: After computations in shared memory, results are written back to global memory.
- Across Kernels:
  - Block Sums: The block sums are written to global memory by one kernel and read by another kernel during the scan of block sums.
  - Adjustment Phase: The scanned block sums are read from global memory by the adjustment phase kernel to modify the scanned data.

Final Notes

Block Size Selection:
- Choose a block size that is a multiple of the warp size (typically 32) for optimal performance.
- Common choices are 128, 256, or 512 threads per block, depending on the GPU architecture and input size.
Handling Non-Power-of-Two Sizes:
- Pad the input data with zeros if necessary to ensure each block has a complete set of data.
- This avoids out-of-bounds memory accesses during scan operations.
Synchronization:
- Use __syncthreads() within blocks to synchronize threads during shared memory operations.
- Ensure that all threads have completed their work before proceeding to avoid race conditions.
Recursive vs. Iterative Scanning:
- For very large input sizes, recursive scanning of block sums may be necessary.
- Alternatively, an iterative approach with multiple kernel launches can be employed.
Error Handling and Optimization:
- Always check for CUDA errors after kernel launches and memory operations.
- Profile your implementation using tools like NVIDIA Nsight Compute to identify and eliminate bottlenecks.

PreviousBrent-Kung Parallel Prefix

Last updated 1 year ago

hashtagOverview of Hierarchical Scan Using Shared Memory

hashtagStep-by-Step Visualization

hashtag1. Block-Wise Scan (Local Scan) Using Shared Memory

hashtag2. Scan of Block Sums

hashtag3. Add Scanned Block Sums to Scanned Data (Adjustment Phase)

hashtagFinal Inclusive Scan Output

hashtagVisual Summary

hashtagInitial Input Array

hashtag1. Block-Wise Scan Using Shared Memory

hashtag2. Scan of Block Sums

hashtag3. Adjustment Phase

hashtagFinal Output

hashtagDetailed Walkthrough with Shared and Global Memory Interaction

hashtag1. Block-Wise Scan Kernel Execution

hashtag2. Scan Block Sums Kernel Execution

hashtag3. Adjustment Phase Kernel Execution

hashtagIllustrative Diagrams

hashtagDiagram 1: Initial Data Partitioning

hashtagDiagram 2: Block-Wise Scan Using Shared Memory

hashtagDiagram 3: Scanning Block Sums

hashtagDiagram 4: Adjustment Phase

hashtagMemory Interaction Summary

hashtagFinal Notes

Overview of Hierarchical Scan Using Shared Memory

Step-by-Step Visualization

1. Block-Wise Scan (Local Scan) Using Shared Memory

2. Scan of Block Sums

3. Add Scanned Block Sums to Scanned Data (Adjustment Phase)

Final Inclusive Scan Output

Visual Summary

Initial Input Array

1. Block-Wise Scan Using Shared Memory

2. Scan of Block Sums

3. Adjustment Phase

Final Output

Detailed Walkthrough with Shared and Global Memory Interaction

1. Block-Wise Scan Kernel Execution

2. Scan Block Sums Kernel Execution

3. Adjustment Phase Kernel Execution

Illustrative Diagrams

Diagram 1: Initial Data Partitioning

Diagram 2: Block-Wise Scan Using Shared Memory

Diagram 3: Scanning Block Sums

Diagram 4: Adjustment Phase

Memory Interaction Summary

Final Notes