Intel OpenCL 之 Pipeline（四）：Pipeline不理想的几种情况

pipeline不理想的情况主要有两类，一类是影响II的，一类是不影响II的。影响II的会导致II值大于1，不影响II的称为Serial Regions。

通常来说，我们必须保证 critical loop 的II为1，而非 critical loop，比如外层循环，在某些条件下可以适当放低II要求。

pipeline不理想通常是由loop-carried dependency导致，因此本文中先介绍loop-carried dependency，再介绍两类pipeline不理想的情况。这里只做简单介绍，具体优化后面再详细说明。

loop-carried dependency

Loop-carried dependency可分为两类：

data dependency
memory dependency

数据依赖：

下面的例子中，变量sum存在数据依赖，导致serial region。

这里，serial region即内层循环体。编译器为了保证代码运行的正确性，会在新一次迭代进入内层循环体之前，强制保证当前迭代内层循环体已经执行完成，即内层循环完成对变量sum的更新。

代码语言：javascript复制

kernel void serially_execute (global int * restrict A,
                            global int * restrict B,
                            global int * restrict result, unsigned N)
{
    int sum = 0;
    for (unsigned i = 0; i < N; i  ) {
        int res;
        for (int j = 0; j < N; j  ) {
            sum  = A[i*N j];
        }
        sum  = B[i];
    }
    *result = sum;
}

代码语言：javascript复制

Loop Report:
-  Loop "Block1" (file k.cl line 9)
| Pipelined with successive iterations launched every 2 cycles due to:
|
|     Pipeline structure: every terminating loop with subloops has iterations launched at least 2 cycles apart.
|     Having successive iterations launched every two cycles should still lead to good performance if the inner loop is pipelined well and has sufficiently high number of iterations.
|
| Iterations executed serially across the region listed below.
| Only a single loop iteration will execute inside the listed region.
| This will cause performance degradation unless the region is pipelined well (can process an iteration every cycle).
|
|     Loop "Block2" (file k.cl line 10)
|     due to:
|     Data dependency on variable sum (file k.cl line 7)
|
|
|-  Loop "Block2" (file k.cl line 10)
    Pipelined well. Successive iterations are launched every cycle.

访存依赖：

下面的例子中，由于对global memory的访问存在依赖关系，导致II值大于1。

我们知道RAM只有一个读端口和一个写端口，只能同时进行一次读操作和一次写操作。当然，Intel FPGA可以构造有3个读端口，一个写端口的RAM，但是也不能像寄存器一样，做到对任意位置的任意读写。编译器为了保证以RAM存储的变量的正确性，对同一变量（变量名），若在一个for循环内既有加载操作，又有存储操作，则会强制保证其先后顺序，因此才会存在仿存依赖的问题。

此外，我们也应该了解到，对global memory的访问会产生很大延时。

代码语言：javascript复制

#define N 128

__kernel void unoptimized( __global int* restrict A )
{
  for(unsigned i = 0; i < N; i  )
    A[N-i] = A[i];
}

代码语言：javascript复制

Loop Report:
-  Loop "Block1" (file unoptimized4.cl line 5)
 |  Pipelined with successive iterations launched every 324 cycles due to:
 |
 |  Memory dependency on Load Operation from: (file unoptimized4.cl line 6)
 |     Store Operation (file unoptimized4.cl line 6)
 |  Largest Critical Path Contributors:
 |     49%: Load Operation (file unoptimized4.cl line 6)
 |     49%: Store Operation (file unoptimized4.cl line 6)

Initiation Interval（II）大于1

理想情况下，II应为1，即优化报告中给出：

代码语言：javascript复制

Pipelined well. Successive iterations are launched every cycle.

当II大于1时，优化报告会显示以下信息：

代码语言：javascript复制

Successive iterations launched every <N> cycles due to:

其中，due to后面会给出loop-carried dependency的位置。

Serial Regions

当存在serial regions时，优化报告会给出以下信息：

代码语言：javascript复制

Loop Report:
-  Loop "Block1" (file k.cl line 9)
| Pipelined with successive iterations launched every cycles:
|
| Iterations executed serially across the region listed below.
| Only a single loop iteration will execute inside the listed region.
| This will cause performance degradation unless the region is pipelined well (can process an iteration every cycle).
|
|     Loop "Block2" (file k.cl line 10)
|     due to:
|     Data dependency on variable sum (file k.cl line 7)

serial region导致loop的执行成下图的状态：

pipeline-41

而正常能pipeline良好的loop执行是这样的：

pipeline-42

可见，如果内层循环的迭代次数N足够大（远大于内层循环的latency），那么serial regions对性能的影响也不会太大。

参考

[Intel FPGA SDK for OpenCL Best Practices Guide]

缓存编程算法 ide fpga

0 人点赞