4. Hardware Implementation
The NVIDIA GPU architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.
A multiprocessor is designed to execute hundreds of threads concurrently. To manage such a large amount of threads, it employs a unique architecture called SIMT (Single-Instruction, Multiple-Thread) that is described in SIMT Architecture. The instructions are pipelined to leverage instruction-level parallelism within a single thread, as well as thread-level parallelism extensively through simultaneous hardware multithreading as detailed in Hardware Multithreading. Unlike CPU cores they are issued in order however and there is no branch prediction and no speculative execution.
SIMT Architecture and Hardware Multithreading describe the architecture features of the streaming multiprocessor that are common to all devices. Compute Capability 3.x, Compute Capability 5.x,Compute Capability 6.x, and Compute Capability 7.x provide the specifics for devices of compute capabilities 3.x, 5.x, 6.x, and 7.x respectively.
The NVIDIA GPU architecture uses a little-endian representation.
4.1. SIMT Architecture
The multiprocessor creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps. Individual threads composing a warp start together at the same program address, but they have their own instruction address counter and register state and are therefore free to branch and execute independently. The term warp originates from weaving, the first parallel thread technology. A half-warp is either the first or second half of a warp. A quarter-warp is either the first, second, third, or fourth quarter of a warp.
When a multiprocessor is given one or more thread blocks to execute, it partitions them into warps and each warp gets scheduled by a warp scheduler for execution. The way a block is partitioned into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0. Thread Hierarchy describes how thread IDs relate to thread indices in the block.
A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp executes each branch path taken, disabling threads that are not on that path. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjoint code paths.
The SIMT architecture is akin to SIMD (Single Instruction, Multiple Data) vector organizations in that a single instruction controls multiple processing elements. A key difference is that SIMD vector organizations expose the SIMD width to the software, whereas SIMT instructions specify the execution and branching behavior of a single thread. In contrast with SIMD vector machines, SIMT enables programmers to write thread-level parallel code for independent, scalar threads, as well as data-parallel code for coordinated threads. For the purposes of correctness, the programmer can essentially ignore the SIMT behavior; however, substantial performance improvements can be realized by taking care that the code seldom requires threads in a warp to diverge. In practice, this is analogous to the role of cache lines in traditional code: Cache line size can be safely ignored when designing for correctness but must be considered in the code structure when designing for peak performance. Vector architectures, on the other hand, require the software to coalesce loads into vectors and manage divergence manually.
Prior to Volta, warps used a single program counter shared amongst all 32 threads in the warp together with an active mask specifying the active threads of the warp. As a result, threads from the same warp in divergent regions or different states of execution cannot signal each other or exchange data, and algorithms requiring fine-grained sharing of data guarded by locks or mutexes can easily lead to deadlock, depending on which warp the contending threads come from.
Starting with the Volta architecture, Independent Thread Scheduling allows full concurrency between threads, regardless of warp. With Independent Thread Scheduling, the GPU maintains execution state per thread, including a program counter and call stack, and can yield execution at a per-thread granularity, either to make better use of execution resources or to allow one thread to wait for data to be produced by another. A schedule optimizer determines how to group active threads from the same warp together into SIMT units. This retains the high throughput of SIMT execution as in prior NVIDIA GPUs, but with much more flexibility: threads can now diverge and reconverge at sub-warp granularity.
Independent Thread Scheduling can lead to a rather different set of threads participating in the executed code than intended if the developer made assumptions about warp-synchronicity1 of previous hardware architectures. In particular, any warp-synchronous code (such as synchronization-free, intra-warp reductions) should be revisited to ensure compatibility with Volta and beyond. See Compute Capability 7.x for further details.
Notes
The threads of a warp that are participating in the current instruction are called the active threads, whereas threads not on the current instruction are inactive (disabled). Threads can be inactive for a variety of reasons including having exited earlier than other threads of their warp, having taken a different branch path than the branch path currently executed by the warp, or being the last threads of a block whose number of threads is not a multiple of the warp size.
If a non-atomic instruction executed by a warp writes to the same location in global or shared memory for more than one of the threads of the warp, the number of serialized writes that occur to that location varies depending on the compute capability of the device (see Compute Capability 3.x, Compute Capability 5.x, Compute Capability 6.x, and Compute Capability 7.x), and which thread performs the final write is undefined.
If an atomic instruction executed by a warp reads, modifies, and writes to the same location in global memory for more than one of the threads of the warp, each read/modify/write to that location occurs and they are all serialized, but the order in which they occur is undefined.
本文备注/经验分享:
CUDA能同时执行数以万计的线程,而传统CPU只能同时执行几十个线程,最多几百个线程。CUDA的这种执行能力差异,是如何实现的。这是通过将一张N卡,继续拆分为SM(Stream Multiprocessor,流多处理器)和里面的SP,并将具体的线程映射到SM和SP来执行,从而实现能同时执行数以万计的线程的效果的。我们常说,一张GTX1080有2560个SP,这里是具体的128个SP一组,构成一种叫SM的东西;而同时GTX1080的芯片上,有20组SM,这样就构成了2560个SP,其实精确的说,SM更像我们传统的说法的CPU核心,而里面的128给SP,只是能同时执行128个特定的计算任务分量而已。我们都知道现在的CPU都具有SSE/AVX/AVX-512这种向量执行能力,例如很多CPU(例如华硕的)升级到了Skylake的服务器U,每个CPU核心里面有2组AVX-512的ports,而每个AVX-512的port,能同时执行16组浮点运算。这两组合并起来能同时让该CPU核心,同时执行32组浮点运算。这就有点SM里面的SP的味道了。只不过,常见的6.1的Pascal,一个SM里面的128个SP,能同时执行128组浮点计算,即使比最新的CPU的32组浮点运算,也要同时执行的数量多的多。而且一张普通的GTX1080就能有20组SM了。而有20个核心的CPU却很少见。
不仅仅是SM和SP的这种构成,N卡还有许多其他的方面,来辅助重复利用这些巨大的SP和SM数量。一种是线程映射关系的区别。GPU将每个线程映射到1个SP上执行,这样看起来每个线程都是独立的,每次只能处理1个标量运算(例如一笔D = A * B C这种融合乘加)。CPU却将每个线程映射到向量执行单元上去,这样看起来CPU的线程每个都是向量的,一个线程能同时执行多组运算(例如1个同时执行32组D = A * B C运算)。
这两种并无本质区别。但GPU的这种更加易用。为何?因为它看起来每个线程就只需要简单的计算1组数据即可,性能来自海量的数量存在的线程。
而CPU的这种向量扩展,却需要让程序员同时记住我要安排协调N组同样的数据。他们如何能同时计算,向量的计算结果中哪些有用,哪些需要屏蔽。等等。所以GPU提供了一种简化的模型。有利于发挥性能。 这也是为何现在很多人,即使使用CPU,特别是多核的CPU,也要在CPU上使用OpenCL的原因。因为OpenCL同样的是CUDA的这种简化的单个线程计算简单任务 海量线程的模型。比直接在CPU上使用向量计算要简单的多。而这种令人感到简单,并且愉悦的模型,叫SIMT(CPU的那种叫SIMD)。
SIMT这种模型说起来很简单:你见过一些kernel的,特别是前几天我们一起阅读过一个旋转图像的kernel,这个kernel本身很简单,没有几行,但却将一幅完全的图片进行了旋转(请在此插入脑海中的kernel的代码的图片),
直接看此kernel,里面只顺序的执行里面的几行代码: (1)设定本线程当前坐标 (2)根据当前坐标设定旋转后坐标 (3)读取旋转后位置的点的值 (4)目标图像本线程负责的点坐标写入上面的这个值
你会看到这个kernel实际上只处理了一个点,这就是SIMT,代码编译出来的是指令,每个点都是同样的指令处理的。你会看到这个kernel实际上只处理了一个点,这就是SIMT,代码编译出来的是指令, 每个点都是同样的指令处理的。同时却有很多很多的线程在同时执行(有图像的所有点那么多的线程),这样的话,造成了实质上, 一条指令,被很多很多线程执行,一条指令(Single Insutrction),很多很多线程执行(Multiple Threads),简称SIMT, 字面看上去很简单的一个东西,但是硬件还有一些小细节。
(1)为了节省一些控制逻辑的硬件实现资源(晶体管),这些很多线程是32个在一起的,同时发射出来1条指令给它们共享,这样很多硬件单元都节省了(例如不需要给32个在一起的线程取指令操作32次,取1次即可)。这种打包叫warp。
在很多N卡硬件上,直接映射到一组32个SP构成的阵列。例如GTX1080的SM里面的128个SP,构成了4组×32SP的阵列。理论上可以同一时刻执行4个32个线程的warp(具体细节略微有出入。但是不妨碍理解), 这样就导致了这1个叫warp的东西里面的32个线程,具有更紧密的关系,导致存在一些可能的优化,或者写代码时候的考虑。这个手册以后会说到。
(2)CPU的核心有超线程,常见我们见到Intel的CPU,一个核心有能执行2个线程的。也有能执行4个线程的。这样同时将执行能力放大了2X-4X。
而回到GPU上。GTX1080虽然一个SM只有128个SP,但该SM能同时执行的线程数量可不止128个。 它的该SM能同时执行2048个线程。也就是16倍的超线程,当然超线程是Intel的叫法,CUDA里叫“同时驻留的线程”, 这些同时驻留的线程都同时存在于SM上,谁有机会谁执行,(例如一个线程突然卡住了,以为要访存,访问是一个缓慢的长延迟操作。 那么其他的驻留中的线程如果就可以趁机被执行),而无论是CPU的超线程,还是GPU的per SM的resident threads,他们的学名都叫“同步多线程”(SMT), 同步多线程技术提高了CPU的核,或者GPU的SM的执行单元利用能力,因为一旦这些多个线程里面的某个卡住,另外的一个或者多个就可以见缝插针的被执行。 依靠这种GPU的SM能同时执行2048个线程,又同时有20个SM(针对1080来说), 那么该GPU能同时真执行的线程数目是2048 × 20 = 40960线程。 也就是大约能同时有4万多个线程在GPU上真正的执行中,这个数量是恐怖的。这还没完,CUDA还允许你同时你启动更多的线程,用<<<>>>语法,指定一组(block)个线程,乘以你要多少组,构成一次grid启动, 例如我可以要求512个线程一组 × 10000000组, 虽然同时1080该GPU只能同时执行4万多个线程,大约80组该block,但是硬件允许一旦GPU的某SM上的某block完成,会全自动的空出空间来,再自动的从后备的这一些组(block)中上一组到空位中。这就是刚才你看到的:
As thread blocks terminate, new blocks are launched on the vacated multiprocessors.
一旦有结束的blocks,在多处理器(SM)上的空位,就允许其他的没有执行的blocks自动上来执行。通过这种方式,构成了全自动的海量线程执行能力——这是(2)点。
通过(1)很多线程(warp)共享指令,节省晶体管(同样的核心大小可以有比CPU核心更强大的执行能力---因为控制逻辑用的晶体管简化了,都在执行单元上,执行能力提升了),和刚才的(2)海量线程执行能力,我们得到了GPU的强大。
但是这没完。 以上两点造成了一些问题: 问题(A)我想要全局同步,请问在CUDA里面怎么实现?如果有512个线程的一个block,要同时要求执行100000000个blocks,哪怕是20个SM的1080,也不能同时执行它们。必须有前面的blocks执行完毕,造成空位,才能下一个block上到SM上执行。 这样的话,虽然逻辑上是同时要求执行了100000000,但是实际上是做不到的(是分批上去执行的),所以例如在这10000000个blocks都执行到kernel的1/3代码行处,要求同步一次,这也是做不到的,因为有些blocks根本就没有执行。 而要它们执行,又必须前面的blocks结束,无法大家都集体的在某行(假设的1/3处)暂停一下(同步),然后又继续。所以这就造成了传说中的CUDA全局同步的困难。 好在Pascal 和CUDA 9 允许在一定的小规模(能同时在GPU的SM上都执行的那样的小规模)的kernel能全局同步——不过这个以后再说。除了这点,刚才的(1)(2)点还造成了另外一个问题,其实主要是(1)点造成的,因为很多线程在共享1条指令,如果要执行不同的分支,会造成硬件模拟分支,性能下降,例如这种:
if (a[i] > 3) {操作1;} else {操作2;} 假设这里的i是线程编号,我们知道,总是有32个线程一组,在共享一条指令的。那么当他们中有16个满足了if,要执行操作1,而另外16个不满足条件,要执行else的操作2,那么刚才说过,要共享相同的指令给这32个,它们会怎么办。 SIMT的架构的GPU为了满足这点,设定了很多硬件上的处理,其中一个叫执行掩码,实际上这段代码会被GPU理解成(不是很精确,这里的一种大致对应):
执行a[i] > 3的判断,大家都执行 操作1 盖掉16个线程的结果不要(也称为假执行,或者predicated execution之类的),然后大家都执行 操作2 盖掉另外16个线程的结果不要。 这样就造成了在这个32个线程的warp内部,浪费掉了50%的执行能力。这也是SIMT带来的代价之一。因为有一半的warp,在两次执行的过程中,代码的结果是无意义的。但是虽然这样,NV却给你带来了所有线程都可以自由的进展的假象。这就是SIMT的具体实现的代价,和给程序员带来的好处。相比GPU的SIMT,CPU的SIMD总是要求分量是一致的,不能带来每个分量都自由进展的假象,造成很多(思考上的)不便。
这也是手册说的:
A key difference is that SIMD vector organizations expose the SIMD width to the software, whereas SIMT instructions specify the execution and branching behavior of a single thread.
SIMD暴露了SIMD宽度(例如16个分量,8个分量,4个分量)给(写)软件(的人),而SIMT却只需要说明单个的一个线程的执行和分支即可。(分支可能有多种方式,例如刚才说的掩盖掉一半的执行结果),SIMT全自动为你带来了每个线程独立执行的灵活性。当然,这在某些硬件上是有代价的。(后面手册会讲到divergent branch---warp内部分支的代价),以及,Volta稍微减轻了这点。 Volta(计算能力7.x)说了允许单个线程真正的独立进展。但没有说硬件上是什么机制允许这样的。考虑到我还没有测试过这点,暂时无法对这个带来的性能提升(相比6.x或更低)进行评价和描述。但是我们需要知道,Volta带来了更强的SIMT,和在特殊情况下(例如warp内部分支)的更好性能。
注意章节最后提到了一个概念:
The threads of a warp that are participating in the current instruction are called the active threads, whereas threads not on the current instruction are inactive (disabled).
在常规N卡上(先不考虑volta,因为暂时无测试),尽量要让warp内部执行一致(不一致虽然逻辑上无问题,但可能会造成性能下降)——这里提出了active threads,和inactive的。例如刚才的分支的时候,被屏蔽掉的那些,就是inactive的。inactive占据SP,但不贡献有效的计算,造成实际等效性能浪费。
最后还说了原子操作,和普通的读写修改,对同一个内存位置的问题。如果没有使用原子操作,如果一个warp中的所有线程都,例如改写了同一个地址的值,那么这个结果是不稳定的,不安全的,最终具体是哪个线程写入了是未定义的。例如说,如果需要对地址&p[100], p里面的第100个元素这里进行 1操作,而一共有1个warp的32个线程都执行了此操作,或者甚至有100个warp中的3200个线程都执行了这个操作,因为这些线程在同时执行(例如可能分配在20个SM上),那么最终的改写结果是不确定的。而如果使用了原子操作,可以稳定的保证这一点(最终结果p[100]里面是原始值增加了3200),原子操作给你模拟了一个所有的这个读原值-增加1-写入新值的过程是不被打扰的,然后这3200个线程是顺序的都执行了这个过程的假象。 (具体实际如何和硬件有关,但是表面的假象是维持的),这样是安全的。
此章节还说了一些概念的,就是字面意思。例如GPU没有分支预测,预测执行之类的。就是顺序的一条一条执行的。它靠海量的线程(2048个/SM)的互相切换来掩盖一些操作的代价(例如延迟)——这叫TLP(可以自行搜索更多),也可以是顺序的指令的前后互相掩盖一些操作的代价(这叫ILP)——TLP和ILP这两个应当自行搜索理解。
类似的,还有如果一个SM上不满2048个线程,例如只有1566个线程会如何(occupancy 75%),这些后面手册会说。不能压满SM(例如一些特殊原因, 例如每个线程的寄存器使用太多),可能会造成性能下降,但不是一定的。要看情况。
有不明白的地方,请在本文后留言
或者在我们的技术论坛bbs.gpuworld.cn上发帖