SDAccel矩阵乘法优化（一）

从一个矩阵乘法的例子一步一步进行功能设计与性能优化。

mmult实现及优化步骤

矩阵乘法优化步骤

步骤

实现功能

关键概念/ Keywords

1、cpu实现

即在host端实现简单的矩阵乘法，便于比对数据与性能对比

---

2、OpenCL实现

在device端实现基于OpenCL的FPGA矩阵乘法硬件设计.

OpenCL接口函数

3、加入Local Memory

采用 Local Memory 减少数据访存次数

内核优化

局部内存

4、实现读写的突发传输

采用突发传输的方式更好的实现DDR与 Local Memory数据的读写访问

内核优化

突发读/写

5、数组分割

通过循环展开与数组分割的方式，实现更好的计算性能

数组分割

循环展开

流水线设计

CPU端实现mmult计算

代码语言：javascript复制

void mmult_cpu( int *in1,   // Input matrix 1
                int *in2,   // Input matrix 2
                int *out,   // Output matrix (out = A x B)
                int dim     // Matrix size of one dimension
              )
{
    //Performs matrix multiplication out = in1 x in2
    for (int i = 0; i < dim; i  ){
        for (int j = 0; j < dim; j  ){
            for (int k = 0; k < dim; k  ){
                out[i * dim   j]  = in1[i * dim   k] * in2[k * dim    j];
            }
        }
    }
}

FPGA端实现mmult计算

OpenCL Host端初始化流程

host 端代码实现

代码语言：javascript复制

//OpenCL utility layer include
#include "xcl2.hpp"
#include <vector>

//Array Size to access
#define DATA_SIZE 64

uint64_t get_duration_ns (const cl::Event &event) {
    uint64_t nstimestart, nstimeend;
    event.getProfilingInfo<uint64_t>(CL_PROFILING_COMMAND_START,&nstimestart);
    event.getProfilingInfo<uint64_t>(CL_PROFILING_COMMAND_END,&nstimeend);
    return(nstimeend-nstimestart);
}

//CPU implementation of Matrix Multiplication
//The inputs are of the size (DATA_SIZE x DATA_SIZE)
void mmult_cpu (
    int *in1,   //Input Matrix 1
    int *in2,   //Input Matrix 1
    int *out,   //Input Matrix 1
    int dim     //One dimension of matrix
)
{
    //Performs Matrix multiply Out = In1 x In2
    for(int i = 0; i < dim; i  ) {
        for(int j = 0; j < dim; j  ) {
            for(int k = 0; k < dim; k  ) {
                out[i * dim   j]  = in1[i * dim   k] * in2[k * dim   j];
            }
        }
    }
}

//Functionality to setup OpenCL context and trigger the Kernel
uint64_t mmult_fpga (
    std::vector<int,aligned_allocator<int>>& source_in1,   //Input Matrix 1
    std::vector<int,aligned_allocator<int>>& source_in2,   //Input Matrix 2
    std::vector<int,aligned_allocator<int>>& source_fpga_results,    //Output Matrix
    int dim                                                //One dimension of matrix
)
{
    int size = dim;
    size_t matrix_size_bytes = sizeof(int) * size * size;

    //The get_xil_devices will return vector of Xilinx Devices
    std::vector<cl::Device> devices = xcl::get_xil_devices();
    cl::Device device = devices[0];

    //Creating Context and Command Queue for selected Device
    cl::Context context(device);
    cl::CommandQueue q(context, device, CL_QUEUE_PROFILING_ENABLE);
    std::string device_name = device.getInfo<CL_DEVICE_NAME>();

    //import_binary() command will find the OpenCL binary file created using the
    //xocc compiler load into OpenCL Binary and return as Binaries
    //OpenCL and it can contain many functions which can be executed on the
    //device.
    std::string binaryFile = xcl::find_binary_file(device_name,"mmult");
    cl::Program::Binaries bins = xcl::import_binary_file(binaryFile);
    devices.resize(1);
    cl::Program program(context, devices, bins);

    //This call will extract a kernel out of the program we loaded in the
    //previous line. A kernel is an OpenCL function that is executed on the
    //FPGA. This function is defined in the src/mmult.cl file.
    cl::Kernel kernel(program,"mmult");

    //These commands will allocate memory on the FPGA. The cl::Buffer
    //objects can be used to reference the memory locations on the device.
    //The cl::Buffer object cannot be referenced directly and must be passed
    //to other OpenCL functions.
    cl::Buffer buffer_in1(context,CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY,
            matrix_size_bytes,source_in1.data());
    cl::Buffer buffer_in2(context,CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY,
            matrix_size_bytes,source_in2.data());
    cl::Buffer buffer_output(context,CL_MEM_USE_HOST_PTR | CL_MEM_WRITE_ONLY,
            matrix_size_bytes,source_fpga_results.data());

    //These commands will load the source_in1 and source_in2 vectors from the host
    //application into the buffer_in1 and buffer_in2 cl::Buffer objects. The data
    //will be be transferred from system memory over PCIe to the FPGA on-board
    //DDR memory.
    q.enqueueMigrateMemObjects({buffer_in1, buffer_in2},0/* 0 means from host*/);

    //Set the kernel arguments
    int narg = 0;
    kernel.setArg(narg  , buffer_in1);
    kernel.setArg(narg  , buffer_in2);
    kernel.setArg(narg  , buffer_output);
    kernel.setArg(narg  , size);

    cl::Event event;
    uint64_t kernel_duration = 0;

    //Launch the kernel
    q.enqueueTask(kernel, NULL, &event);

    //The result of the previous kernel execution will need to be retrieved in
    //order to view the results. This call will write the data from the
    //buffer_output cl_mem object to the source_fpga_results vector
    q.enqueueMigrateMemObjects({buffer_output},CL_MIGRATE_MEM_OBJECT_HOST);
    q.finish();

    kernel_duration = get_duration_ns(event);

    return kernel_duration;
}

int main(int argc, char** argv)
{
    //Allocate Memory in Host Memory
    int size = DATA_SIZE;
    size_t matrix_size_bytes = sizeof(int) * size * size;

    //When creating a buffer with user pointer, under the hood user ptr is
    //used if and only if it is properly aligned (page aligned). When not
    //aligned, runtime has no choice but to create its own host side buffer
    //that backs user ptr. This in turn implies that all operations that move
    //data to/from device incur an extra memcpy to move data to/from runtime's
    //own host buffer from/to user pointer. So it is recommended to use this
    //allocator if user wish to Create Buffer/Memory Object to align user buffer
    //to the page boundary. It will ensure that user buffer will be used when
    //user create Buffer/Mem Object.
    std::vector<int,aligned_allocator<int>> source_in1(matrix_size_bytes);
    std::vector<int,aligned_allocator<int>> source_in2(matrix_size_bytes);
    std::vector<int,aligned_allocator<int>> source_fpga_results(matrix_size_bytes);
    std::vector<int,aligned_allocator<int>> source_cpu_results(matrix_size_bytes);

    //Create the test data
    for(int i = 0 ; i < DATA_SIZE * DATA_SIZE ; i  ){
        source_in1[i] = i;
        source_in2[i] = i * i;
        source_cpu_results[i] = 0;
        source_fpga_results[i] = 0;
    }

    uint64_t kernel_duration = 0;

    //Compute CPU Results
    mmult_cpu(source_in1.data(), source_in2.data(), source_cpu_results.data(), size);

    //Compute FPGA Results
    kernel_duration = mmult_fpga(source_in1, source_in2, source_fpga_results, size);

    //Compare the results of FPGA to CPU
    bool match = true;
    for (int i = 0 ; i < size * size; i  ){
        if (source_fpga_results[i] != source_cpu_results[i]){
            std::cout << "Error: Result mismatch" << std::endl;
            std::cout << "i = " << i << " CPU result = " << source_cpu_results[i]
                << " FPGA result = " << source_fpga_results[i] << std::endl;
            match = false;
            break;
        }
    }

    std::cout << "TEST " << (match ? "PASSED" : "FAILED") << std::endl;

    std::cout << "Wall Clock Time (Kernel execution): " << kernel_duration << std::endl;
    std::cout << "Note: Wall Clock Time is meaningful for real hardware execution only,"
            << "not for emulation." << std::endl;

    return (match ? EXIT_SUCCESS :  EXIT_FAILURE);
}

device端代码实现（简单实现mmult逻辑）

代码语言：javascript复制

kernel __attribute__((reqd_work_group_size(1, 1, 1)))
void mmult( __global int* in1,  //Read-only input matrix1
            __global int* in2,  //Read-only input matrix2
            __global int* out,  //Output matrix
            int dim             //One dimension of the matrix
          )
{
    //Reads the data from DDR, performs the computation
    //and writes back the result to DDR.
    LOOP1：for (int i = 0 ; i < dim ; i  ){
        LOOP2：for(int j = 0; j < dim; j  ){
                   out[i * dim   j] = 0;
            LOOP3：for(int k = 0; k < dim; k  ){
                       out[i * dim   j]  = in1[i * dim   k] * in2[k * dim   j];
            }
        }
    }
}

实验结果分析

vivado hls log文件分析(重点关注WARNING)

代码语言：javascript复制

WARNING: [XFORM 203-542] Cannot flatten a loop nest 'LOOP2' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:47:44) in function 'mmult' :
WARNING: [XFORM 203-542] the outer loop is not a perfect loop.
INFO: [XFORM 203-541] Flattening a loop nest 'LOOP1' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:45:43) in function 'mmult'.
INFO: [HLS 200-111] Finished Architecture Synthesis Time (s): cpu = 00:00:00.77 ; elapsed = 00:00:00.88 . Memory (MB): peak = 494.320 ; gain = 156.758 ; free physical = 19872 ; free virtual = 45217
INFO: [HLS 200-10] Starting hardware synthesis ...
INFO: [HLS 200-10] Synthesizing 'mmult' ...
WARNING: [SYN 201-107] Renaming port name 'mmult/out' to 'mmult/out_r' to avoid the conflict with HDL keywords or other object names.
INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [HLS 200-42] -- Implementing module 'mmult'
INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [SCHED 204-11] Starting scheduling ...
INFO: [SCHED 204-61] Pipelining loop 'LOOP3'.
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 1, distance = 1, offset = 0)
   between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 2, distance = 1, offset = 0)
   between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 3, distance = 1, offset = 0)
   between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 4, distance = 1, offset = 0)
   between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 130, distance = 1, offset = 0)
   between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 193, distance = 1, offset = 0)
   between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 225, distance = 1, offset = 0)
   between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 241, distance = 1, offset = 0)
   between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 249, distance = 1, offset = 0)
   between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 253, distance = 1, offset = 0)
   between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 255, distance = 1, offset = 0)
   between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 256, distance = 1, offset = 0)
   between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
INFO: [SCHED 204-61] Unable to satisfy pipeline directive: Unable to pipeline the region.
INFO: [SCHED 204-11] Finished scheduling.

HLS Report

综合结果分析

分析综合结果的方法：＊首先分析对于添加的优化指令是否综合实现，若不能实现，原因是什么？＊然后分析代码pipeline的情况。SDAccel对于嵌套的for循环来讲：pipeline内层的for循环全部unroll，pipeline外层的for循环试图进行Flattening，Flatten成功则统一到一个pipeline中。＊对于pipeline的循环进一步分析II值是多少，理论能优化到多少？

从上述日志分析可知，该硬件的综合实现有很多问题：＊首先，硬件代码没有优化指令，不需要关注指令是否实现。＊然后，对于实现的三层for循环，只是实现了最内层LOOP3循环的pipeline，中间层未实现Flatten的原因是：the outer loop is not a perfect loop.。而LOOP2向LOOP1继续试图进行Flattening,成功则LOOP2与LOOP1统一为LOOP1_LOOP2。一般情况下对于Flattening不成功的原因有两种：一种是外层for循环中夹杂内层for循环的结构；另一种是内层for循环的循环边界是变量。具体循环的类型如下图所示。所以此例中LOOP2不能与LOOP3实现Flatten的原因是前者。也就是在LOOP2循环体中有out[i * dim j] = 0;操作，而out数组在内层LOOP3中同样用到。反过来说，假如说编译器对LOOP2与LOOP3进行Flatten，那么对于out[i * dim j] = 0操作在同一个循环中将不知如何与内部的循环体进行融合。

＊最后对于试图Pipeline的LOOP3进行II值的分析，从log文件中可知II值过大，以至于无法进行Pipeline,原因是产生接口gmem的carried dependence。所以，所有的loop都未能实现pipeline。关于gmem的carried dependence问题可以关注我的另一篇文章 gmem carry dependency 分析

硬件仿真结果

硬件实现结果

参考

xilinx github Xilinx/SDAccelExamples/cputo_fpga ug1253 SDx Pragma Reference Guide 2017.2 ug1207 SDAccel Environment Optmizaton Guide

面向对象编程编程算法 fpga

0 人点赞