使用cv :: cuda :: GpuMat推力(thrust)
目标
推力(thrust)是一个非常强大的库,用于各种各样的cuda加速算法。然而推力被设计为与向量而不是投影矩阵一起工作。以下教程将讨论将cv :: cuda :: GpuMat包装到可用于推力算法的推力迭代器中。
本教程将向您展示如何:
- 将GpuMat包装成推力迭代器
- 用随机数字填充GpuMat
- 将GpuMat的列排序到位
- 将大于0的值复制到新的gpu矩阵
- Use streams with thrust
将GpuMat包装到推力迭代器中
以下代码将为GpuMat生成一个迭代器
/*
@Brief GpuMatBeginItr returns a thrust compatible iterator to the beginning of a GPU mat's memory.
@Param mat is the input matrix
@Param channel is the channel of the matrix that the iterator is accessing. If set to -1, the iterator will access every element in sequential order
*/
template<typename T>
thrust::permutation_iterator<thrust::device_ptr<T>, thrust::transform_iterator<step_functor<T>, thrust::counting_iterator<int>>> GpuMatBeginItr(cv::cuda::GpuMat mat, int channel = 0)
{
if (channel == -1)
{
mat = mat.reshape(1);
channel = 0;
}
CV_Assert(mat.depth() == cv::DataType<T>::depth);
CV_Assert(channel < mat.channels());
return thrust::make_permutation_iterator(thrust::device_pointer_cast(mat.ptr<T>(0) + channel),
thrust::make_transform_iterator(thrust::make_counting_iterator(0), step_functor<T>(mat.cols, mat.step / sizeof(T), mat.channels())));
}
/*
@Brief GpuMatEndItr returns a thrust compatible iterator to the end of a GPU mat's memory.
@Param mat is the input matrix
@Param channel is the channel of the matrix that the iterator is accessing. If set to -1, the iterator will access every element in sequential order
*/
template<typename T>
thrust::permutation_iterator<thrust::device_ptr<T>, thrust::transform_iterator<step_functor<T>, thrust::counting_iterator<int>>> GpuMatEndItr(cv::cuda::GpuMat mat, int channel = 0)
{
if (channel == -1)
{
mat = mat.reshape(1);
channel = 0;
}
CV_Assert(mat.depth() == cv::DataType<T>::depth);
CV_Assert(channel < mat.channels());
return thrust::make_permutation_iterator(thrust::device_pointer_cast(mat.ptr<T>(0) + channel),
thrust::make_transform_iterator(thrust::make_counting_iterator(mat.rows*mat.cols), step_functor<T>(mat.cols, mat.step / sizeof(T), mat.channels())));
}
我们的目标是使一个迭代器从矩阵的开始开始,并且正确地递增以访问连续的矩阵元素。这对于连续的行来说是微不足道的,但是对于一个坐标矩阵的列呢?为了做到这一点,我们需要迭代器来了解矩阵维度和步骤。此信息嵌入在step_functor中。
template<typename T> struct step_functor : public thrust::unary_function<int, int>
{
int columns;
int step;
int channels;
__host__ __device__ step_functor(int columns_, int step_, int channels_ = 1) : columns(columns_), step(step_), channels(channels_) { };
__host__ step_functor(cv::cuda::GpuMat& mat)
{
CV_Assert(mat.depth() == cv::DataType<T>::depth);
columns = mat.cols;
step = mat.step / sizeof(T);
channels = mat.channels();
}
__host__ __device__
int operator()(int x) const
{
int row = x / columns;
int idx = (row * step) + (x % columns)*channels;
return idx;
}
};
步骤函数接收索引值,并从矩阵的开头返回适当的偏移量。计数迭代器简单地增加像素元素的范围。结合到transform_iterator中,我们有一个从0到M * N的迭代器,并正确地递增,以计算GpuMat的缓存。不幸的是,这不包括任何内存位置信息,因为我们需要一个推力:: device_ptr。通过将设备指针与transform_iterator组合,我们可以将推力指向矩阵的第一个元素,并将其相应地进行调整。
用随机数字填充GpuMat
现在我们有一些很好的功能来为迭代器提供推力,让它们使用OpenCV做的一些事情。不幸的是,在撰写本文时,OpenCV没有Gpu随机数生成。幸运的是,推力是这样的,现在这两者之间的互操作是微不足道的。示例摘自http://stackoverflow.com/questions/12614164/generating-a-random-number-vector-between-0-and-1-0-using-thrust
首先我们需要编写一个函数来产生我们的随机值。
struct prg
{
float a, b;
__host__ __device__
prg(float _a = 0.f, float _b = 1.f) : a(_a), b(_b) {};
__host__ __device__
float operator()(const unsigned int n) const
{
thrust::default_random_engine rng;
thrust::uniform_real_distribution<float> dist(a, b);
rng.discard(n);
return dist(rng);
}
};
这将占用整数值并输出a和b之间的值。现在我们将使用推力变换填充我们的矩阵的值在0和10之间。
{
cv::cuda::GpuMat d_value(1, 100, CV_32F);
auto valueBegin = GpuMatBeginItr<float>(d_value);
auto valueEnd = GpuMatEndItr<float>(d_value);
thrust::transform(thrust::make_counting_iterator(0), thrust::make_counting_iterator(d_value.cols), valueBegin, prg(-1, 1));
cv::Mat h_value(d_value);
}
将GpuMat的列排序到位
使用随机值和索引填充矩阵元素。之后,我们将排序随机数和人格。
{
cv::cuda::GpuMat d_data(1, 100, CV_32SC2);
// Thrust compatible begin and end iterators to channel 1 of this matrix
auto keyBegin = GpuMatBeginItr<int>(d_data, 1);
auto keyEnd = GpuMatEndItr<int>(d_data, 1);
// Thrust compatible begin and end iterators to channel 0 of this matrix
auto idxBegin = GpuMatBeginItr<int>(d_data, 0);
auto idxEnd = GpuMatEndItr<int>(d_data, 0);
// Fill the index channel with a sequence of numbers from 0 to 100
thrust::sequence(idxBegin, idxEnd);
// Fill the key channel with random numbers between 0 and 10. A counting iterator is used here to give an integer value for each location as an input to prg::operator()
thrust::transform(thrust::make_counting_iterator(0), thrust::make_counting_iterator(d_data.cols), keyBegin, prg(0, 10));
// Sort the key channel and index channel such that the keys and indecies stay together
thrust::sort_by_key(keyBegin, keyEnd, idxBegin);
cv::Mat h_idx(d_data);
}
使用streams时,将值大于0的值复制到新的gpu矩阵
在这个例子中,我们将看到如何使用cv :: cuda :: Streams。不幸的是,这个具体例子使用必须将结果返回给CPU的功能,因此它不是Streams的最佳使用。
{
cv::cuda::GpuMat d_value(1, 100, CV_32F);
auto valueBegin = GpuMatBeginItr<float>(d_value);
auto valueEnd = GpuMatEndItr<float>(d_value);
cv::cuda::Stream stream;
//! [random_gen_stream]
// Same as the random generation code from before except now the transformation is being performed on a stream
thrust::transform(thrust::system::cuda::par.on(cv::cuda::StreamAccessor::getStream(stream)), thrust::make_counting_iterator(0), thrust::make_counting_iterator(d_value.cols), valueBegin, prg(-1, 1));
//! [random_gen_stream]
// Count the number of values we are going to copy
int count = thrust::count_if(thrust::system::cuda::par.on(cv::cuda::StreamAccessor::getStream(stream)), valueBegin, valueEnd, pred_greater<float>(0.0));
// Allocate a destination for copied values
cv::cuda::GpuMat d_valueGreater(1, count, CV_32F);
// Copy values that satisfy the predicate.
thrust::copy_if(thrust::system::cuda::par.on(cv::cuda::StreamAccessor::getStream(stream)), valueBegin, valueEnd, GpuMatBeginItr<float>(d_valueGreater), pred_greater<float>(0.0));
cv::Mat h_greater(d_valueGreater);
}
首先,我们将在Streams上填充随机生成的数据在-1和1之间的GPU数据块。
// Same as the random generation code from before except now the transformation is being performed on a stream
thrust::transform(thrust::system::cuda::par.on(cv::cuda::StreamAccessor::getStream(stream)), thrust::make_counting_iterator(0), thrust::make_counting_iterator(d_value.cols), valueBegin, prg(-1, 1));
请注意使用thrust :: system :: cuda :: par.on(...),这将创建一个用于在Streams上执行推力代码的执行策略。在cuda工具包分发的推力版本中有一个错误,从版本7.5开始,这还没有被修正。这个错误导致代码不能在Streams上执行。然而,可以通过使用git存储库中的最新版本的推力来修复该错误。(http://github.com/thrust/thrust.git)接下来,我们将使用以下谓词使用推力:: count_if来确定多少值大于0:
template<typename T> struct pred_greater
{
T value;
__host__ __device__ pred_greater(T value_) : value(value_){}
__host__ __device__ bool operator()(const T& val) const
{
return val > value;
}
};
我们将使用这些结果创建一个用于存储复制值的输出缓冲区,然后我们将使用具有相同谓词的copy_if来填充输出缓冲区。最后我们将把值下载到一个CPU垫中进行查看。