CUDA Prefix Sum
Visualizing the process of a hierarchical CUDA inclusive scan using shared & global memory
Overview of Hierarchical Scan Using Shared Memory
Block-Wise Scan (Local Scan):
Operation: Each CUDA block performs a scan on a subset of the input data using shared memory.
Output:
Scanned Data: Each block writes its scanned subset back to global memory.
Block Sums: Each block writes the total sum of its subset to an auxiliary array in global memory.
Scan of Block Sums:
Operation: Perform a scan on the array of block sums to obtain the total sums preceding each block.
Output: A scanned block sums array.
Add Block Sums to Scanned Data (Adjustment Phase):
Operation: Each block adds its corresponding scanned block sum to its scanned data to obtain the final inclusive scan.
Output: Final inclusive scan result in global memory.
Step-by-Step Visualization
Let's consider an input array of 8 elements for simplicity:
Input Array (Global Memory):
Index: 0 1 2 3 4 5 6 7
Value: 3, 1, 7, 0, 4, 1, 6, 3Assume we are using 2 CUDA blocks with 4 threads per block. Here's how the process unfolds:
1. Block-Wise Scan (Local Scan) Using Shared Memory
Each block processes a subset of the input array. Shared memory is used within each block for fast data access.
Block 0: Processing Indices 0-3
Shared Memory (s_data):
Before Scan: Thread 0: s_data[0] = 3 Thread 1: s_data[1] = 1 Thread 2: s_data[2] = 7 Thread 3: s_data[3] = 0Scan Operation (Inclusive Scan):
After Scan in Shared Memory: s_data: 3, 4, 11, 11Explanation:
s_data[0] = 3
s_data[1] = 3 + 1 = 4
s_data[2] = 4 + 7 = 11
s_data[3] = 11 + 0 = 11
Write Back to Global Memory:
Scanned Data (Global Memory): Index: 0 1 2 3 Value: 3, 4, 11, 11Block Sum:
Block Sums Array: Block 0 Sum: 11
Block 1: Processing Indices 4-7
Shared Memory (s_data):
Before Scan: Thread 0: s_data[0] = 4 Thread 1: s_data[1] = 1 Thread 2: s_data[2] = 6 Thread 3: s_data[3] = 3Scan Operation (Inclusive Scan):
After Scan in Shared Memory: s_data: 4, 5, 11, 14Explanation:
s_data[0] = 4
s_data[1] = 4 + 1 = 5
s_data[2] = 5 + 6 = 11
s_data[3] = 11 + 3 = 14
Write Back to Global Memory:
Scanned Data (Global Memory): Index: 4 5 6 7 Value: 4, 5, 11, 14Block Sum:
Block Sums Array: Block 0 Sum: 11 Block 1 Sum: 14
2. Scan of Block Sums
Now, perform a scan on the Block Sums Array to determine the total sums preceding each block.
Block Sums Array (Global Memory):
Index: 0 1
Value: 11, 14Scan Operation (Inclusive Scan):
After Scan of Block Sums: Scanned Block Sums Array: Index: 0 1 Value: 11, 25Explanation:
Scanned_Block_Sums[0] = 11
Scanned_Block_Sums[1] = 11 + 14 = 25
3. Add Scanned Block Sums to Scanned Data (Adjustment Phase)
Each block adds the sum of all previous blocks to its scanned data to obtain the final inclusive scan.
Adjustment Phase:
Block 0:
No Adjustment Needed (First block).
Block 1:
Adjustment Value: Scanned_Block_Sums[0] = 11
Adjusted Scanned Data:
Original Scanned Data (Block 1): Index: 4 5 6 7 Value: 4, 5, 11, 14 After Adjustment: Index: 4 5 6 7 Value: 15, 16, 22, 25Explanation:
Index 4: 4 + 11 = 15
Index 5: 5 + 11 = 16
Index 6: 11 + 11 = 22
Index 7: 14 + 11 = 25
Final Inclusive Scan Output
Combining the adjusted scanned data from both blocks:
Final Inclusive Scan Output (Global Memory):
Index: 0 1 2 3 4 5 6 7
Value: 3, 4, 11, 11, 15, 16, 22, 25Visual Summary
Below is a visual summary of the entire process:
Initial Input Array
Index: 0 1 2 3 4 5 6 7
Value: 3, 1, 7, 0, 4, 1, 6, 31. Block-Wise Scan Using Shared Memory
Block 0 (Indices 0-3)
Shared Memory (Before Scan):
3, 1, 7, 0
After Inclusive Scan in Shared Memory:
3, 4, 11, 11
Write Back to Global Memory:
Index: 0 1 2 3
Value: 3, 4, 11, 11
Block Sum:
11Block 1 (Indices 4-7)
Shared Memory (Before Scan):
4, 1, 6, 3
After Inclusive Scan in Shared Memory:
4, 5, 11, 14
Write Back to Global Memory:
Index: 4 5 6 7
Value: 4, 5, 11, 14
Block Sum:
142. Scan of Block Sums
Block Sums Array:
Index: 0 1
Value: 11, 14
After Inclusive Scan:
Index: 0 1
Value: 11, 253. Adjustment Phase
Final Adjusted Scanned Data:
Block 0:
Index: 0 1 2 3
Value: 3, 4, 11, 11
Block 1:
Index: 4 5 6 7
Original Value: 4, 5, 11, 14
Adjustment: +11
Adjusted Value: 15, 16, 22, 25Final Output
Final Inclusive Scan Output:
Index: 0 1 2 3 4 5 6 7
Value: 3, 4, 11, 11, 15, 16, 22, 25Detailed Walkthrough with Shared and Global Memory Interaction
1. Block-Wise Scan Kernel Execution
Global Memory: Stores the Input Array and the Scanned Data.
Shared Memory: Temporarily holds a block's subset of the input data for fast access during the scan.
For Each Block:
Load Data into Shared Memory:
Each thread reads one element from the input array (global memory) into shared memory.
Example:
Thread 0: Reads
d_input[0] = 3→s_data[0] = 3Thread 1: Reads
d_input[1] = 1→s_data[1] = 1Thread 2: Reads
d_input[2] = 7→s_data[2] = 7Thread 3: Reads
d_input[3] = 0→s_data[3] = 0
Perform Scan in Shared Memory:
The scan is performed within shared memory, ensuring all threads within the block collaborate efficiently.
Synchronization:
__syncthreads()is used to synchronize threads after loading data and after each scan phase.
Write Scanned Data and Block Sum to Global Memory:
After the scan:
Each thread writes its scanned value back to the Scanned Data array in global memory.
The last thread writes the block's total sum to the Block Sums Array in global memory.
2. Scan Block Sums Kernel Execution
Global Memory: Holds the Block Sums Array.
Shared Memory: Used to perform a scan on the block sums.
Process:
Load Block Sums into Shared Memory:
Each thread reads one block sum into shared memory.
Perform Scan on Block Sums:
An inclusive scan is performed on the block sums within shared memory.
This results in a scanned block sums array.
Write Scanned Block Sums Back to Global Memory:
The scanned block sums are written back to global memory for the adjustment phase.
3. Adjustment Phase Kernel Execution
Global Memory: Contains both the Scanned Data and the Scanned Block Sums.
Shared Memory: Not used in this phase; data is directly adjusted in global memory.
Process:
Add Scanned Block Sums to Scanned Data:
Each thread adds the scanned block sum corresponding to its block to its scanned data.
Example:
Block 0: No addition needed.
Block 1: Adds
11(Scanned_Block_Sums[0]) to each of its scanned values.
Illustrative Diagrams
While text-based diagrams have limitations, the following representations can help clarify the data flow and memory interactions.
Diagram 1: Initial Data Partitioning
Global Memory:
+---+---+---+---+---+---+---+---+
| 3 | 1 | 7 | 0 | 4 | 1 | 6 | 3 |
+---+---+---+---+---+---+---+---+
| | | | | | | |
v v v v v v v v
Block 0 Block 1
(Indices 0-3) (Indices 4-7)Diagram 2: Block-Wise Scan Using Shared Memory
Block 0
Shared Memory (s_data):
+---+---+---+---+
| 3 | 1 | 7 | 0 |
+---+---+---+---+
|
v
After Scan:
+---+---+----+----+
| 3 | 4 | 11 | 11 |
+---+---+----+----+
Write Back to Global Memory:
Scanned Data:
+---+---+----+----+
| 3 | 4 | 11 | 11 |
+---+---+----+----+Block 1
Shared Memory (s_data):
+---+---+---+---+
| 4 | 1 | 6 | 3 |
+---+---+---+---+
|
v
After Scan:
+---+---+----+----+
| 4 | 5 | 11 | 14 |
+---+---+----+----+
Write Back to Global Memory:
Scanned Data:
+---+---+----+----+
| 4 | 5 | 11 | 14 |
+---+---+----+----+Diagram 3: Scanning Block Sums
Block Sums Array:
+----+----+
| 11 | 14 |
+----+----+
After Scan:
Scanned Block Sums Array:
+----+----+
| 11 | 25 |
+----+----+Diagram 4: Adjustment Phase
Final Scanned Data Before Adjustment:
Block 0:
+---+---+----+----+
| 3 | 4 | 11 | 11 |
+---+---+----+----+
Block 1:
+---+---+----+----+
| 4 | 5 | 11 | 14 |
+---+---+----+----+
Scanned Block Sums:
+----+----+
| 11 | 25 |
+----+----+
Adjustment:
Block 0: No adjustment
Block 1: Add 11 to each element
Final Scanned Data:
Block 0:
+---+---+----+----+
| 3 | 4 | 11 | 11 |
+---+---+----+----+
Block 1:
+---+---+----+----+
|15 |16 |22 |25 |
+---+---+----+----+
Final Output:
+---+---+----+----+----+----+----+----+
| 3 | 4 | 11 | 11 |15 |16 |22 |25 |
+---+---+----+----+----+----+----+----+Memory Interaction Summary
Shared Memory:
Used During:
Block-Wise Scan: Each block loads its subset of data into shared memory, performs the scan, and writes back the scanned data.
Benefits:
Fast Access: Shared memory allows for low-latency access during the scan operations within a block.
Synchronization: Threads within a block can synchronize their operations efficiently.
Global Memory:
Used For:
Input and Output Data: The original input array and the final scanned output reside in global memory.
Block Sums Array: After each block performs its local scan, the block sums are stored in a separate global memory array.
Scanned Block Sums: The scanned block sums are also stored in global memory for the adjustment phase.
Benefits:
Persistence: Data persists across kernel launches, allowing different kernels to access and modify it.
Scalability: Can handle large datasets that exceed the capacity of shared memory.
Interaction Between Shared and Global Memory:
Within a Kernel:
Loading Data: Threads load data from global memory to shared memory for fast access.
Writing Results: After computations in shared memory, results are written back to global memory.
Across Kernels:
Block Sums: The block sums are written to global memory by one kernel and read by another kernel during the scan of block sums.
Adjustment Phase: The scanned block sums are read from global memory by the adjustment phase kernel to modify the scanned data.
Final Notes
Block Size Selection:
Choose a block size that is a multiple of the warp size (typically 32) for optimal performance.
Common choices are 128, 256, or 512 threads per block, depending on the GPU architecture and input size.
Handling Non-Power-of-Two Sizes:
Pad the input data with zeros if necessary to ensure each block has a complete set of data.
This avoids out-of-bounds memory accesses during scan operations.
Synchronization:
Use
__syncthreads()within blocks to synchronize threads during shared memory operations.Ensure that all threads have completed their work before proceeding to avoid race conditions.
Recursive vs. Iterative Scanning:
For very large input sizes, recursive scanning of block sums may be necessary.
Alternatively, an iterative approach with multiple kernel launches can be employed.
Error Handling and Optimization:
Always check for CUDA errors after kernel launches and memory operations.
Profile your implementation using tools like NVIDIA Nsight Compute to identify and eliminate bottlenecks.
Last updated