I bet you can easily accelerate your program by 10x by adopting CUDA. But that 10x is far from the end of the story. A fully optimized CUDA code could give you a 100x boost. To write highly optimized CUDA kernels, one needs to understand some GPU concepts well. However, I found some of the concepts are not being explained well on the internet and can easily get people confused.
Those concepts are confusing because
- Some terminologies were borrowed from the CPU program. But they are in fact not the same concept as their CPU origins.
- Some terminologies were invented from the hardware’s point of view, i.e. to describe an actual hardware module or component. Some other terminologies were invented for the software side of things, they are abstract concepts that don’t exist physically. But these concepts are mixed together. And you do need to understand both the software and hardware sides to optimize CUDA code.
I hope I could clarify some of the CUDA concepts with this post.
A GPU is formed by multiple units named SM (Streaming Multiprocessors). As a concrete example, the GPU Titan V has 80 SMs. Each SM can execute many threads concurrently. In the case of Titan V, that maximum concurrent thread count for a singleTitan V SM is 2048. But these threads are not exactly the same as the threads run by a CPU.
These threads are grouped. And a thread group is called a warp, which contains 32 threads. So a Titan V SM can execute 2048 threads, but those 2048 threads are grouped into 2048 / 32 = 64 warps.
What these threads are different from a CPU thread is that CPU threads can each execute different tasks at the same time. But GPU threads in a single warp can only execute one same task. For example, if you want to perform 2 operations, c = a + b and d = a * b. And you need to perform these 2 calculations on lots of data. You can’t assign one warp to perform both calculations at the same time. All 32 threads must work on the same calculation before moving onto the next calculation, although the data the threads process can be different, for example, 1 + 2 and 45 + 17, they have to all work on the addition calculation before moving onto the multiplication.
This is different from CPU threads, because if you have a powerful enough CPU to support 32 threads simultaneously, each of them can work on a separate calculation. Therefore, the concept of the GPU thread is akin to the SIMD (Single Instruction, Multiple Data) feature of CPU.
The best real life analogy of GPU threads in a warp I found is the above. I don’t know if you have had similar experience before. In high school, when we got punished to copy words for not being able to finish homework, for example, we tended to bound a group of pens together in a vertical row so that we can write multiple copies of the same content at the same time. That helped us finish it quicker.
As powerful as the school trick is, holding multiple pens isn’t turning you into a mighty octopus who has many tentacles and can perform multiple tasks at the same time. This is the major difference between CPU threads and GPU threads.
Why is this important? Because when you launch a GPU program, you need to specify the thread organization you want. And a careless configuration can easily impact the performance or waste GPU resources.
From the software’s point of view, GPU threads are organized into blocks. Block is a pure software concept that doesn’t exist in the hardware design. Unlike the physical thread organization, the warp. Blocks don’t have a fixed number of threads. You can specify any number of threads up to 1024 within a block, but that doesn’t mean any thread number will perform the same.
Consider we want to perform 330 calculations. One natural way is launching 10 blocks, and each block works on 33 calculations with 33 threads. But because every 32 threads are grouped into a warp. To finish the 33 calculations, 2 warps == 64 threads are involved. In total, we will be using 640 threads.
Another way is launching 11 blocks of 32 threads. This time, each block can fit into a single warp. So in total, 11 warps == 352 threads will be launched. There will be some waste, but it won’t be as much as the first option.
Another thing that needs to be considered is the number of SMs, because each block can only be executed within one SM. A block can’t be processed by more than one SM. If the workload is very large, i.e. we have lots of blocks, we could use up all available SMs and we still have remaining work to do. In this case, we will have to launch part of the work as the first batch and then finish the remaining work in following batches. For example, in the case of Titan V, there are 80 SMs. And suppose we have a complex work that requires 90 SMs to finish. We will have to launch a batch of 80 SMs first and launch the remaining work with 10 SMs as the second batch. But in this case, during the second batch, 70 SMs are idle. A better way is adjust the workload for each SM, so that they can do less work each and finish sooner. But in total, you need 160 SMs this time. Although you still need to launch 2 batches of calculations, but because each batch can finish quicker, the overall run time reduces.
Lastly, if you are familiar NVIDIA’s marketing terms, a GPU’s powerfulness is often measured by the number of CUDA cores. But when you learn CUDA programming, you probably seldom see it as a programming concept. Well, a CUDA core is actually a warp. So again in the Titan V case, it has 80 (SMs) * (2048) Threads / 32 (Threads / Warp) = 5120 CUDA cores.