从一个矩阵乘法的例子一步一步进行功能设计与性能优化。
mmult实现及优化步骤
矩阵乘法优化步骤
步骤
实现功能
关键概念/ Keywords
1、cpu实现
即在host端实现简单的矩阵乘法,便于比对数据与性能对比
---
2、OpenCL实现
在device端实现基于OpenCL的FPGA矩阵乘法硬件设计.
OpenCL接口函数
3、加入Local Memory
采用 Local Memory 减少数据访存次数
内核优化
局部内存
4、实现读写的突发传输
采用突发传输的方式更好的实现DDR与 Local Memory数据的读写访问
内核优化
突发读/写
5、数组分割
通过循环展开与数组分割的方式,实现更好的计算性能
数组分割
循环展开
流水线设计
CPU端实现mmult计算
代码语言:javascript复制void mmult_cpu( int *in1, // Input matrix 1
int *in2, // Input matrix 2
int *out, // Output matrix (out = A x B)
int dim // Matrix size of one dimension
)
{
//Performs matrix multiplication out = in1 x in2
for (int i = 0; i < dim; i ){
for (int j = 0; j < dim; j ){
for (int k = 0; k < dim; k ){
out[i * dim j] = in1[i * dim k] * in2[k * dim j];
}
}
}
}
FPGA端实现mmult计算
OpenCL Host端初始化流程
host 端代码实现
代码语言:javascript复制//OpenCL utility layer include
#include "xcl2.hpp"
#include <vector>
//Array Size to access
#define DATA_SIZE 64
uint64_t get_duration_ns (const cl::Event &event) {
uint64_t nstimestart, nstimeend;
event.getProfilingInfo<uint64_t>(CL_PROFILING_COMMAND_START,&nstimestart);
event.getProfilingInfo<uint64_t>(CL_PROFILING_COMMAND_END,&nstimeend);
return(nstimeend-nstimestart);
}
//CPU implementation of Matrix Multiplication
//The inputs are of the size (DATA_SIZE x DATA_SIZE)
void mmult_cpu (
int *in1, //Input Matrix 1
int *in2, //Input Matrix 1
int *out, //Input Matrix 1
int dim //One dimension of matrix
)
{
//Performs Matrix multiply Out = In1 x In2
for(int i = 0; i < dim; i ) {
for(int j = 0; j < dim; j ) {
for(int k = 0; k < dim; k ) {
out[i * dim j] = in1[i * dim k] * in2[k * dim j];
}
}
}
}
//Functionality to setup OpenCL context and trigger the Kernel
uint64_t mmult_fpga (
std::vector<int,aligned_allocator<int>>& source_in1, //Input Matrix 1
std::vector<int,aligned_allocator<int>>& source_in2, //Input Matrix 2
std::vector<int,aligned_allocator<int>>& source_fpga_results, //Output Matrix
int dim //One dimension of matrix
)
{
int size = dim;
size_t matrix_size_bytes = sizeof(int) * size * size;
//The get_xil_devices will return vector of Xilinx Devices
std::vector<cl::Device> devices = xcl::get_xil_devices();
cl::Device device = devices[0];
//Creating Context and Command Queue for selected Device
cl::Context context(device);
cl::CommandQueue q(context, device, CL_QUEUE_PROFILING_ENABLE);
std::string device_name = device.getInfo<CL_DEVICE_NAME>();
//import_binary() command will find the OpenCL binary file created using the
//xocc compiler load into OpenCL Binary and return as Binaries
//OpenCL and it can contain many functions which can be executed on the
//device.
std::string binaryFile = xcl::find_binary_file(device_name,"mmult");
cl::Program::Binaries bins = xcl::import_binary_file(binaryFile);
devices.resize(1);
cl::Program program(context, devices, bins);
//This call will extract a kernel out of the program we loaded in the
//previous line. A kernel is an OpenCL function that is executed on the
//FPGA. This function is defined in the src/mmult.cl file.
cl::Kernel kernel(program,"mmult");
//These commands will allocate memory on the FPGA. The cl::Buffer
//objects can be used to reference the memory locations on the device.
//The cl::Buffer object cannot be referenced directly and must be passed
//to other OpenCL functions.
cl::Buffer buffer_in1(context,CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY,
matrix_size_bytes,source_in1.data());
cl::Buffer buffer_in2(context,CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY,
matrix_size_bytes,source_in2.data());
cl::Buffer buffer_output(context,CL_MEM_USE_HOST_PTR | CL_MEM_WRITE_ONLY,
matrix_size_bytes,source_fpga_results.data());
//These commands will load the source_in1 and source_in2 vectors from the host
//application into the buffer_in1 and buffer_in2 cl::Buffer objects. The data
//will be be transferred from system memory over PCIe to the FPGA on-board
//DDR memory.
q.enqueueMigrateMemObjects({buffer_in1, buffer_in2},0/* 0 means from host*/);
//Set the kernel arguments
int narg = 0;
kernel.setArg(narg , buffer_in1);
kernel.setArg(narg , buffer_in2);
kernel.setArg(narg , buffer_output);
kernel.setArg(narg , size);
cl::Event event;
uint64_t kernel_duration = 0;
//Launch the kernel
q.enqueueTask(kernel, NULL, &event);
//The result of the previous kernel execution will need to be retrieved in
//order to view the results. This call will write the data from the
//buffer_output cl_mem object to the source_fpga_results vector
q.enqueueMigrateMemObjects({buffer_output},CL_MIGRATE_MEM_OBJECT_HOST);
q.finish();
kernel_duration = get_duration_ns(event);
return kernel_duration;
}
int main(int argc, char** argv)
{
//Allocate Memory in Host Memory
int size = DATA_SIZE;
size_t matrix_size_bytes = sizeof(int) * size * size;
//When creating a buffer with user pointer, under the hood user ptr is
//used if and only if it is properly aligned (page aligned). When not
//aligned, runtime has no choice but to create its own host side buffer
//that backs user ptr. This in turn implies that all operations that move
//data to/from device incur an extra memcpy to move data to/from runtime's
//own host buffer from/to user pointer. So it is recommended to use this
//allocator if user wish to Create Buffer/Memory Object to align user buffer
//to the page boundary. It will ensure that user buffer will be used when
//user create Buffer/Mem Object.
std::vector<int,aligned_allocator<int>> source_in1(matrix_size_bytes);
std::vector<int,aligned_allocator<int>> source_in2(matrix_size_bytes);
std::vector<int,aligned_allocator<int>> source_fpga_results(matrix_size_bytes);
std::vector<int,aligned_allocator<int>> source_cpu_results(matrix_size_bytes);
//Create the test data
for(int i = 0 ; i < DATA_SIZE * DATA_SIZE ; i ){
source_in1[i] = i;
source_in2[i] = i * i;
source_cpu_results[i] = 0;
source_fpga_results[i] = 0;
}
uint64_t kernel_duration = 0;
//Compute CPU Results
mmult_cpu(source_in1.data(), source_in2.data(), source_cpu_results.data(), size);
//Compute FPGA Results
kernel_duration = mmult_fpga(source_in1, source_in2, source_fpga_results, size);
//Compare the results of FPGA to CPU
bool match = true;
for (int i = 0 ; i < size * size; i ){
if (source_fpga_results[i] != source_cpu_results[i]){
std::cout << "Error: Result mismatch" << std::endl;
std::cout << "i = " << i << " CPU result = " << source_cpu_results[i]
<< " FPGA result = " << source_fpga_results[i] << std::endl;
match = false;
break;
}
}
std::cout << "TEST " << (match ? "PASSED" : "FAILED") << std::endl;
std::cout << "Wall Clock Time (Kernel execution): " << kernel_duration << std::endl;
std::cout << "Note: Wall Clock Time is meaningful for real hardware execution only,"
<< "not for emulation." << std::endl;
return (match ? EXIT_SUCCESS : EXIT_FAILURE);
}
device端代码实现(简单实现mmult逻辑)
代码语言:javascript复制kernel __attribute__((reqd_work_group_size(1, 1, 1)))
void mmult( __global int* in1, //Read-only input matrix1
__global int* in2, //Read-only input matrix2
__global int* out, //Output matrix
int dim //One dimension of the matrix
)
{
//Reads the data from DDR, performs the computation
//and writes back the result to DDR.
LOOP1:for (int i = 0 ; i < dim ; i ){
LOOP2:for(int j = 0; j < dim; j ){
out[i * dim j] = 0;
LOOP3:for(int k = 0; k < dim; k ){
out[i * dim j] = in1[i * dim k] * in2[k * dim j];
}
}
}
}
实验结果分析
- vivado hls log文件分析(重点关注WARNING)
WARNING: [XFORM 203-542] Cannot flatten a loop nest 'LOOP2' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:47:44) in function 'mmult' :
WARNING: [XFORM 203-542] the outer loop is not a perfect loop.
INFO: [XFORM 203-541] Flattening a loop nest 'LOOP1' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:45:43) in function 'mmult'.
INFO: [HLS 200-111] Finished Architecture Synthesis Time (s): cpu = 00:00:00.77 ; elapsed = 00:00:00.88 . Memory (MB): peak = 494.320 ; gain = 156.758 ; free physical = 19872 ; free virtual = 45217
INFO: [HLS 200-10] Starting hardware synthesis ...
INFO: [HLS 200-10] Synthesizing 'mmult' ...
WARNING: [SYN 201-107] Renaming port name 'mmult/out' to 'mmult/out_r' to avoid the conflict with HDL keywords or other object names.
INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [HLS 200-42] -- Implementing module 'mmult'
INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [SCHED 204-11] Starting scheduling ...
INFO: [SCHED 204-61] Pipelining loop 'LOOP3'.
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 1, distance = 1, offset = 0)
between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 2, distance = 1, offset = 0)
between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 3, distance = 1, offset = 0)
between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 4, distance = 1, offset = 0)
between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 130, distance = 1, offset = 0)
between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 193, distance = 1, offset = 0)
between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 225, distance = 1, offset = 0)
between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 241, distance = 1, offset = 0)
between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 249, distance = 1, offset = 0)
between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 253, distance = 1, offset = 0)
between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 255, distance = 1, offset = 0)
between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 256, distance = 1, offset = 0)
between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
INFO: [SCHED 204-61] Unable to satisfy pipeline directive: Unable to pipeline the region.
INFO: [SCHED 204-11] Finished scheduling.
- HLS Report
- 综合结果分析
分析综合结果的方法:
* 首先分析对于添加的优化指令是否综合实现,若不能实现,原因是什么?
* 然后分析代码pipeline
的情况。SDAccel对于嵌套的for循环来讲:pipeline内层的for循环全部unroll,pipeline外层的for循环试图进行Flattening,Flatten成功则统一到一个pipeline中。
* 对于pipeline
的循环进一步分析II值是多少,理论能优化到多少?
从上述日志分析可知,该硬件的综合实现有很多问题:
* 首先,硬件代码没有优化指令,不需要关注指令是否实现。
* 然后,对于实现的三层for循环,只是实现了最内层LOOP3
循环的pipeline
,中间层未实现Flatten
的原因是:the outer loop is not a perfect loop.
。而LOOP2
向LOOP1
继续试图进行Flattening
,成功则LOOP2
与LOOP1
统一为LOOP1_LOOP2
。一般情况下对于Flattening不成功的原因有两种:一种是外层for循环中夹杂内层for循环的结构;另一种是内层for循环的循环边界是变量。具体循环的类型如下图所示。所以此例中LOOP2
不能与LOOP3
实现Flatten
的原因是前者。也就是在LOOP2
循环体中有out[i * dim j] = 0;
操作,而out
数组在内层LOOP3
中同样用到。反过来说,假如说编译器对LOOP2
与LOOP3
进行Flatten
,那么对于out[i * dim j] = 0
操作在同一个循环中将不知如何与内部的循环体进行融合。
* 最后对于试图Pipeline
的LOOP3
进行II
值的分析,从log文件中可知II值过大,以至于无法进行Pipeline
,原因是产生接口gmem
的carried dependence
。所以,所有的loop
都未能实现pipeline
。
关于gmem
的carried dependence
问题可以关注我的另一篇文章 gmem carry dependency 分析
- 硬件仿真结果
- 硬件实现结果
参考
xilinx github Xilinx/SDAccelExamples/cputo_fpga ug1253 SDx Pragma Reference Guide 2017.2 ug1207 SDAccel Environment Optmizaton Guide