从一个矩阵乘法的例子一步一步进行功能设计与性能优化。
mmult实现及优化步骤
矩阵乘法优化步骤
步骤
实现功能
关键概念/ Keywords
1、cpu实现
即在host端实现简单的矩阵乘法,便于比对数据与性能对比
---
2、OpenCL实现
在device端实现基于OpenCL的FPGA矩阵乘法硬件设计.
OpenCL接口函数
3、加入Local Memory
采用 Local Memory 减少数据访存次数
内核优化
局部内存
4、实现读写的突发传输
采用突发传输的方式更好的实现DDR与 Local Memory数据的读写访问
内核优化
突发读/写
5、数组分割
通过循环展开与数组分割的方式,实现更好的计算性能
数组分割
循环展开
流水线设计
方案分析及优化思路三(指令优化)
现在经过前面两次优化后,代码的组织结构没有什么问题了,现在的关键问题是:矩阵运算的嵌套for循环仅仅实现了内层的pipeline
,因为外层for循环无法对内部的for循环flatten
,所以外面两层的for循环没有实现pipeline
。要解决这个问题,最直接的思路就是将最内层的for循环直接进行循环展开,进一步提高计算过程的并行度。但是在进行循环展开的过程中,需要将内层用到的数组进行切割,否则将无法进行unroll
。因此,我们将用到的指令有三个:内层for循环要进行循环展开(unroll
),并行计算用到的数组要进行数组切割(array partition
),次外层的for循环要流水起来(pipeline
)。
代码实现
代码语言:javascript复制#define MAX_SIZE 64
kernel __attribute__((reqd_work_group_size(1, 1, 1)))
void mmult( __global int* in1, //Read-only input matrix1
__global int* in2, //Read-only input matrix2
__global int* out, //Output matrix
int dim //One dimension of the matrix
)
{
//Local memory to store input matrices
//Local memory is implemented as BRAM memory blocks
//Complete partition done on dim 2 for in1, on dim 1 for in2 and on dim 2 for out
__local int local_in1[MAX_SIZE][MAX_SIZE] __attribute__((xcl_array_partition(complete, 2)));
__local int local_in2[MAX_SIZE][MAX_SIZE] __attribute__((xcl_array_partition(complete, 1)));
__local int local_out[MAX_SIZE][MAX_SIZE] __attribute__((xcl_array_partition(complete, 2)));
//Burst reads on input matrices from DDR memory
//Burst read for matrix local_in1 and local_in2
read_in1: for(int iter = 0, i = 0, j = 0; iter < dim * dim; iter , j ){
if(j == dim){ j = 0; i ; }
local_in1[i][j] = in1[iter];
}
read_in2: for(int iter = 0, i = 0, j = 0; iter < dim * dim; iter , j ){
if(j == dim){ j = 0; i ; }
local_in2[i][j] = in2[iter];
}
//Based on the functionality the number of iterations
//to be executed for "loop_3" must be "dim" size.
//But for the pipeline to happen in the "loop_2" the
//"loop_3" must be unrolled, to unroll the size cannot be dynamic.
//It gives better throughput with usage of additional resources.
loop_1: for(int i = 0; i < dim; i ){
__attribute__((xcl_pipeline_loop))
loop_2: for(int j = 0; j < dim; j ){
local_out[i][j] = 0;
__attribute__((opencl_unroll_hint))
loop_3: for(int k = 0; k < MAX_SIZE; k ){
local_out[i][j] = local_in1[i][k] * local_in2[k][ j];
}
}
}
//Burst write from local_out to DDR memory
write_out: for(int iter = 0, i = 0, j = 0; iter < dim * dim; iter , j ){
if(j == dim){ j = 0; i ; }
out[iter] = local_out[i][j];
}
}
- xcl_pipeline_loop
循环流水(Loop pipelining
)是在一个迭代周期内的多个操作可实现并行处理。下图中的A
展示了默认情况下的顺序执行操作,每次读操作之间相差3个时钟周期(II=3
),离最后一次写操作相差8个时钟周期。图中的B
展示了加入循环流水的示意图,每次读操作之间相差1个周期(II=1
),离最后一次写操作相差4个时钟周期,在使用同样的资源下,提高了流水线的启动间隔和延迟。
- xcl_array_partition
下图所示的是数组分隔的三种方式:block
、cyclic
和complete
。block
和cyclic
是根据factor
参数的设置来决定划分多少个独立的RAM
存储(如图中的factor=2
)。按块划为block
是在原数组上按连续存储元素的方式划分成factor
个块单元;按轮询cyclic
是在原数组上按交叉存储元素的方式划分成factor
个块单元;按全部展开complete
是把原数组划分为一个个独立的元素(寄存器)。
对于多维数组的分割,可按照数组的维度来划分:
实验结果分析
- vivado hls log文件分析
WARNING: [XFORM 203-104] Completely partitioning array 'local_in1' accessed through non-constant indices on dimension 2 (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:170:9), which may result in long runtime and suboptimal QoR due to large multiplexers. Please consider wrapping the array access into a function or using a register file core instead.
INFO: [XFORM 203-101] Partitioning array 'local_in1' in dimension 2 completely.
WARNING: [XFORM 203-104] Completely partitioning array 'local_in2' accessed through non-constant indices on dimension 1 (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:174:9), which may result in long runtime and suboptimal QoR due to large multiplexers. Please consider wrapping the array access into a function or using a register file core instead.
INFO: [XFORM 203-101] Partitioning array 'local_in2' in dimension 1 completely.
WARNING: [XFORM 203-104] Completely partitioning array 'local_out' accessed through non-constant indices on dimension 2 (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:196:9), which may result in long runtime and suboptimal QoR due to large multiplexers. Please consider wrapping the array access into a function or using a register file core instead.
INFO: [XFORM 203-101] Partitioning array 'local_out' in dimension 2 completely.
INFO: [XFORM 203-11] Balancing expressions in function 'mmult' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:44:41)...125 expression(s) balanced.
INFO: [HLS 200-111] Finished Pre-synthesis Time (s): cpu = 00:00:06 ; elapsed = 00:00:06 . Memory (MB): peak = 494.324 ; gain = 156.758 ; free physical = 19781 ; free virtual = 45173
INFO: [XFORM 203-541] Flattening a loop nest 'loop_1' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:182:42) in function 'mmult'.
INFO: [XFORM 203-811] Inferring bus burst read of variable length on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:170:9).
INFO: [XFORM 203-811] Inferring bus burst read of variable length on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:174:9).
INFO: [XFORM 203-811] Inferring bus burst write of variable length on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:196:9).
INFO: [HLS 200-111] Finished Architecture Synthesis Time (s): cpu = 00:00:07 ; elapsed = 00:00:07 . Memory (MB): peak = 494.324 ; gain = 156.758 ; free physical = 19781 ; free virtual = 45174
INFO: [HLS 200-10] Starting hardware synthesis ...
INFO: [HLS 200-10] Synthesizing 'mmult' ...
WARNING: [SYN 201-107] Renaming port name 'mmult/out' to 'mmult/out_r' to avoid the conflict with HDL keywords or other object names.
INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [HLS 200-42] -- Implementing module 'mmult'
INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [SCHED 204-11] Starting scheduling ...
INFO: [SCHED 204-61] Pipelining loop 'read_in1'.
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 1, Depth: 3.
INFO: [SCHED 204-61] Pipelining loop 'read_in2'.
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 1, Depth: 3.
INFO: [SCHED 204-61] Pipelining loop 'loop_1_loop_2'.
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 1, Depth: 12.
INFO: [SCHED 204-61] Pipelining loop 'write_out'.
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 1, Depth: 5.
INFO: [SCHED 204-11] Finished scheduling.
- HLS Report
- 综合结果分析
* 首先,硬件代码没有优化指令,log文件中首先将三个数组进行了对应维度的切割,然后也成功的对最内层的循环进行了unroll
处理。
* 然后,相比于Burst Read/Write
版本的矩阵乘法实现,该版本主要是加上了两个优化指令,实现内层循环的并行化。从Pipleline
的角度考虑:第一段for循环pipeline
成功;第二段的for循环嵌套中write_data
对应的for循环并行展开,接着对其外层的for循环进行flatten
,最终,整体实现pipeline
;第三段for循环pipeline
成功。
* 从pipeline成功后的II角度考虑:所有for循环pipeline
后的II=1
。
- 硬件仿真结果
- 硬件实现结果
mmult1
mmult2
mmult3
mmult4
总结
到此为止,我们已经完整的分析了一个SDAceel的例子。从CPU实现,再到FPGA实现,并一步一步进行优化设计。在FPGA的优化中主要包括两种优化方向:其一是基于带宽(Bandwidth
)和数据吞吐率(Throughput
)的优化;其二是基于计算性能(Performance
)的优化。通常在设计的过程中,新手往往只会考虑到计算性能的优化,不断地提高并行度,但是在提高并行度的过程中,往往会导致带宽严重不足,数据吞吐率低,进一步使得并行的效果受到带宽的限制,以至于无法实现并行的效果。在实现矩阵乘法的例子中,我们首先做了基本的功能实现,紧接着对于Local Memory
和Burst Write/Read
的优化就是在提高访存效率,进而提高数据吞吐率。最后关于指令的优化,才是对于计算性能上的优化。结果对比表格如下所示:
方案 | 结果(ns) | 提高倍数 |
---|---|---|
未优化(mmult1) | 365459518 | 1 |
Local Memory(mmult2) | 2106545 | 173.5 |
Burst R/W(mmult3) | 1616897 | 226.0 |
Pipeline(mmult4) | 127392 | 2868.8 |
参考
xilinx github Xilinx/SDAccel_Examples/cpu_to_fpga ug1253 SDx Pragma Reference Guide 2017.2 ug1207 SDAccel Environment Optmizaton Guide