nvidia-rapids︱cuML机器学习加速库

2020-03-27 15:56:55 浏览数 (1)

cuML是一套用于实现与其他RAPIDS项目共享兼容API的机器学习算法和数学原语函数。

cuML使数据科学家、研究人员和软件工程师能够在GPU上运行传统的表格ML任务,而无需深入了解CUDA编程的细节。 在大多数情况下,cuML的Python API与来自scikit-learn的API相匹配。

对于大型数据集,这些基于GPU的实现可以比其CPU等效完成10-50倍。 有关性能的详细信息,请参阅cuML基准测试笔记本。

官方文档:

rapidsai/cuml

cuML API Reference

官方案例还是蛮多的:

来看看有啥模型:

关联文章:

nvidia-rapids︱cuDF与pandas一样的DataFrame库

NVIDIA的python-GPU算法生态 ︱ RAPIDS 0.10

nvidia-rapids︱cuML机器学习加速库

nvidia-rapids︱cuGraph(NetworkX-like)关系图模型


文章目录

  • 1 安装与背景
    • 1.1 安装
    • 1.2 背景
  • 2 DBSCAN
  • 3 TSNE算法在Fashion MNIST的使用
  • 4 XGBoosting
  • 5 利用KNN进行图像检索

1 安装与背景

1.1 安装

参考:https://github.com/rapidsai/cuml/blob/branch-0.13/BUILD.md

conda env create -n cuml_dev python=3.7 --file=conda/environments/cuml_dev_cuda10.0.yml

docker版本,可参考:https://rapids.ai/start.html#prerequisites

代码语言:javascript复制
docker pull rapidsai/rapidsai:cuda10.1-runtime-ubuntu16.04-py3.7
docker run --gpus all --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786 
    rapidsai/rapidsai:cuda10.1-runtime-ubuntu16.04-py3.7

1.2 背景

不仅是训练,要想真正在GPU上扩展数据科学,也需要加速端到端的应用程序。cuML 0.9 为我们带来了基于GPU的树模型支持的下一个发展,包括新的森林推理库(FIL)。FIL是一个轻量级的GPU加速引擎,它对基于树形模型进行推理,包括梯度增强决策树和随机森林。使用单个V100 GPU和两行Python代码,用户就可以加载一个已保存的XGBoost或LightGBM模型,并对新数据执行推理,速度比双20核CPU节点快36倍。在开源Treelite软件包的基础上,下一个版本的FIL还将添加对scikit-learn和cuML随机森林模型的支持。

图3:推理速度对比,XGBoost CPU vs 森林推理库 (FIL) GPU


2 DBSCAN

The DBSCAN algorithm is a clustering algorithm that works really well for datasets that have regions of high density.

The model can take array-like objects, either in host as NumPy arrays or in device (as Numba or cuda_array_interface-compliant), as well as cuDF DataFrames.

代码语言:javascript复制
import cudf
import matplotlib.pyplot as plt
import numpy as np
from cuml.datasets import make_blobs
from cuml.cluster import DBSCAN as cuDBSCAN
from sklearn.cluster import DBSCAN as skDBSCAN
from sklearn.metrics import adjusted_rand_score

%matplotlib inline

# 定义参数
n_samples = 10**4
n_features = 2

eps = 0.15
min_samples = 3
random_state = 23

#Generate Data
%%time
device_data, device_labels = make_blobs(n_samples=n_samples, 
                                        n_features=n_features,
                                        centers=5,
                                        cluster_std=0.1,
                                        random_state=random_state)

device_data = cudf.DataFrame.from_gpu_matrix(device_data)
device_labels = cudf.Series(device_labels)
# Copy dataset from GPU memory to host memory.
# This is done to later compare CPU and GPU results.
host_data = device_data.to_pandas()
host_labels = device_labels.to_pandas()

# sklearn 模型拟合
%%time
clustering_sk = skDBSCAN(eps=eps,
                         min_samples=min_samples,
                         algorithm="brute",
                         n_jobs=-1)

clustering_sk.fit(host_data)

# cuML 模型拟合
%%time
clustering_cuml = cuDBSCAN(eps=eps,
                           min_samples=min_samples,
                           verbose=True,
                           max_mbytes_per_batch=13e3)

clustering_cuml.fit(device_data, out_dtype="int32")

# 可视化
fig = plt.figure(figsize=(16, 10))

X = np.array(host_data)
labels = clustering_cuml.labels_

n_clusters_ = len(labels)

# Black removed and is used for noise instead.
unique_labels = labels.unique()
colors = [plt.cm.Spectral(each)
          for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]

    class_member_mask = (labels == k)

    xy = X[class_member_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markersize=5, markeredgecolor=tuple(col))

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

结果评估:

代码语言:javascript复制
%%time
sk_score = adjusted_rand_score(host_labels, clustering_sk.labels_)
cuml_score = adjusted_rand_score(host_labels, clustering_cuml.labels_)

>>> (0.9998750031236718, 0.9998750031236718)

两个结果是一模一样的,也就是skearn和cuML的结果一致。


3 TSNE算法在Fashion MNIST的使用

TSNE (T-Distributed Stochastic Neighborhood Embedding) is a fantastic dimensionality reduction algorithm used to visualize large complex datasets including medical scans, neural network weights, gene expressions and much more.

cuML’s TSNE algorithm supports both the faster Barnes Hut $ n logn $ algorithm and also the slower Exact $ n^2 $ .

The model can take array-like objects, either in host as NumPy arrays as well as cuDF DataFrames as the input.

代码语言:javascript复制
import gzip
import matplotlib.pyplot as plt
import numpy as np
import os
from cuml.manifold import TSNE

%matplotlib inline

# https://github.com/zalandoresearch/fashion-mnist/blob/master/utils/mnist_reader.py
def load_mnist_train(path):
    """Load MNIST data from path"""
    labels_path = os.path.join(path, 'train-labels-idx1-ubyte.gz')
    images_path = os.path.join(path, 'train-images-idx3-ubyte.gz')

    with gzip.open(labels_path, 'rb') as lbpath:
        labels = np.frombuffer(lbpath.read(), dtype=np.uint8,
                               offset=8)

    with gzip.open(images_path, 'rb') as imgpath:
        images = np.frombuffer(imgpath.read(), dtype=np.uint8,
                               offset=16).reshape(len(labels), 784)
    return images, labels


# 加载数据
images, labels = load_mnist_train("data/fashion")

plt.figure(figsize=(5,5))
plt.imshow(images[100].reshape((28, 28)), cmap = 'gray')
代码语言:javascript复制
# 建模
tsne = TSNE(n_components = 2, method = 'barnes_hut', random_state=23)
%time embedding = tsne.fit_transform(images)

print(embedding[:10], embedding.shape)



CPU times: user 2.41 s, sys: 2.57 s, total: 4.98 s
Wall time: 4.98 s
[[-13.577632    39.87483   ]
 [ 26.136728   -17.68164   ]
 [ 23.164072    22.151243  ]
 [ 28.361032    11.134571  ]
 [ 35.419216     5.6633983 ]
 [ -0.15575314 -11.143476  ]
 [-24.30308     -1.584903  ]
 [ -5.9438944  -27.522072  ]
 [  2.0439444   29.574451  ]
 [ -3.0801039   27.079374  ]] (60000, 2)

可视化Visualize Embedding:

代码语言:javascript复制
# Visualize Embedding


classes = [
    'T-shirt/top',
    'Trouser',
    'Pullover',
    'Dress',
    'Coat',
    'Sandal',
    'Shirt',
    'Sneaker',
    'Bag',
    'Ankle boot'
]

fig, ax = plt.subplots(1, figsize = (14, 10))
plt.scatter(embedding[:,1], embedding[:,0], s = 0.3, c = labels, cmap = 'Spectral')
plt.setp(ax, xticks = [], yticks = [])
cbar = plt.colorbar(boundaries = np.arange(11)-0.5)
cbar.set_ticks(np.arange(10))
cbar.set_ticklabels(classes)
plt.title('Fashion MNIST Embedded via TSNE');

4 XGBoosting

代码语言:javascript复制
import numpy as np; print('numpy Version:', np.__version__)
import pandas as pd; print('pandas Version:', pd.__version__)
import xgboost as xgb; print('XGBoost Version:', xgb.__version__)


# helper function for simulating data
def simulate_data(m, n, k=2, numerical=False):
    if numerical:
        features = np.random.rand(m, n)
    else:
        features = np.random.randint(2, size=(m, n))
    labels = np.random.randint(k, size=m)
    return np.c_[labels, features].astype(np.float32)


# helper function for loading data
def load_data(filename, n_rows):
    if n_rows >= 1e9:
        df = pd.read_csv(filename)
    else:
        df = pd.read_csv(filename, nrows=n_rows)
    return df.values.astype(np.float32)

# settings
LOAD = False
n_rows = int(1e5)
n_columns = int(100)
n_categories = 2

# 加载数据
%%time

if LOAD:
    dataset = load_data('/tmp', n_rows)
else:
    dataset = simulate_data(n_rows, n_columns, n_categories)
print(dataset.shape)

# 训练集切分
# identify shape and indices
n_rows, n_columns = dataset.shape
train_size = 0.80
train_index = int(n_rows * train_size)

# split X, y
X, y = dataset[:, 1:], dataset[:, 0]
del dataset

# split train data
X_train, y_train = X[:train_index, :], y[:train_index]

# split validation data
X_validation, y_validation = X[train_index:, :], y[train_index:]

# 检验
# check dimensions
print('X_train: ', X_train.shape, X_train.dtype, 'y_train: ', y_train.shape, y_train.dtype)
print('X_validation', X_validation.shape, X_validation.dtype, 'y_validation: ', y_validation.shape, y_validation.dtype)

# check the proportions
total = X_train.shape[0]   X_validation.shape[0]
print('X_train proportion:', X_train.shape[0] / total)
print('X_validation proportion:', X_validation.shape[0] / total)

# Convert NumPy data to DMatrix format
%%time

dtrain = xgb.DMatrix(X_train, label=y_train)
dvalidation = xgb.DMatrix(X_validation, label=y_validation)

# 设置参数
# instantiate params
params = {}

# general params
general_params = {'silent': 1}
params.update(general_params)

# booster params
n_gpus = 1
booster_params = {}

if n_gpus != 0:
    booster_params['tree_method'] = 'gpu_hist'
    booster_params['n_gpus'] = n_gpus
params.update(booster_params)

# learning task params
learning_task_params = {'eval_metric': 'auc', 'objective': 'binary:logistic'}
params.update(learning_task_params)
print(params)


# 模型训练
# model training settings
evallist = [(dvalidation, 'validation'), (dtrain, 'train')]
num_round = 10

%%time

bst = xgb.train(params, dtrain, num_round, evallist)

输出:

代码语言:javascript复制
[0]	validation-auc:0.504014	train-auc:0.542211
[1]	validation-auc:0.506166	train-auc:0.559262
[2]	validation-auc:0.501638	train-auc:0.570375
[3]	validation-auc:0.50275	train-auc:0.580726
[4]	validation-auc:0.503445	train-auc:0.589701
[5]	validation-auc:0.503413	train-auc:0.598342
[6]	validation-auc:0.504258	train-auc:0.605253
[7]	validation-auc:0.503157	train-auc:0.611937
[8]	validation-auc:0.502372	train-auc:0.617561
[9]	validation-auc:0.501949	train-auc:0.62333
CPU times: user 1.12 s, sys: 195 ms, total: 1.31 s
Wall time: 360 ms

相关参考:

  • Open Source Website
  • GitHub
  • Press Release
  • NVIDIA Blog
  • Developer Blog
  • NVIDIA Data Science Webpage

5 利用KNN进行图像检索

参考:在GPU实例上使用RAPIDS加速图像搜索任务

阿里云文档中有专门的介绍,所以不做太多赘述。

使用开源框架Tensorflow和Keras提取图片特征,其中模型为基于ImageNet数据集的ResNet50(notop)预训练模型。

连接公网下载模型(大小约91M),下载完成后默认保存到/root/.keras/models/目录

数据下载:

代码语言:javascript复制
import os
import tarfile
import numpy as np
from urllib.request import urlretrieve


def download_and_extract(data_dir):
    """doc"""
    def _progress(count, block_size, total_size):
        print('r>>> Downloading %s  (total:%.0fM) %.1f%%' % (
            filename, total_size / 1024 / 1024, 100.0 * count * block_size / total_size), end='')
    
    url = 'http://ai.stanford.edu/~acoates/stl10/stl10_binary.tar.gz'
    filename = url.split('/')[-1]
    filepath = os.path.join(data_dir, filename)
    decom_dir = os.path.join(data_dir, filename.split('.')[0])
    
    if not os.path.exists(data_dir):
        os.makedirs(data_dir)
    if os.path.exists(filepath):
        print('>>> {} has exist in current directory.'.format(filename))
    else:
        urlretrieve(url, filepath, _progress)
        print("nSuccessfully downloaded.")
    if not os.path.exists(decom_dir):
        # Decompress
        print(">>> Decompressing from {}....".format(filepath))
        tar = tarfile.open(filepath, 'r')
        tar.extractall(data_dir)
        print("Successfully decompressed")
        tar.close()
    else:
        print('>>> Directory "{}" has exist. '.format(decom_dir))


def read_all_images(path_to_data):
    """get all images from binary path"""
    with open(path_to_data, 'rb') as f:
        everything = np.fromfile(f, dtype=np.uint8)
        images = np.reshape(everything, (-1, 3, 96, 96))
        images = np.transpose(images, (0, 3, 2, 1))
    return images


# the directory to save data
data_dir = './data'
# download and decompression
download_and_extract(data_dir)

# 读入数据
# the path of unlabeled data
path_unlabeled = os.path.join(data_dir, 'stl10_binary/unlabeled_X.bin')
# get images from binary
images = read_all_images(path_unlabeled)
print('>>> images shape: ', images.shape)

# 看图
import random
import matplotlib.pyplot as plt
%matplotlib inline

def show_image(image):
    """show image"""
    fig = plt.figure(figsize=(3, 3))
    plt.imshow(image)
    plt.show()
    fig.clear()


# random show a image
rand_image_index = random.randint(0, images.shape[0])
show_image(images[rand_image_index])
代码语言:javascript复制
# 分割数据
from sklearn.model_selection import train_test_split

train_images, query_images = train_test_split(images, test_size=0.1, random_state=123)
print('train_images shape: ', train_images.shape)
print('query_images shape: ', query_images.shape)

# 图片特征
# set tensorflow params to adjust GPU memory usage, if use default params, tensorflow would use
# nearly all of the gpu memory, we need reserve some gpu memory for cuml.
import os
# only use device 0
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
# method 1: allocate gpu memory base on runtime allocations
# config.gpu_options.allow_growth = True
# method 2: determines the fraction of the onerall amount of memory 
# that each visibel GPU should be allocated.
config.gpu_options.per_process_gpu_memory_fraction = 0.3
set_session(tf.Session(config=config))

# 特征抽取
from keras.applications.resnet50 import ResNet50
from keras.preprocessing import image
from keras.applications.resnet50 import preprocess_input

# download resnet50(notop) model(first running) and load model
model = ResNet50(weights='imagenet', include_top=False, input_shape=(96, 96, 3), pooling='max')
# network summary
model.summary()

%%time
train_features = model.predict(train_images)
print('train features shape: ', train_features.shape)

%%time
query_features = model.predict(query_images)
print('query features shape: ', query_features.shape)

然后是KNN阶段,包括了sklear-KNN,和CUML-KNN:

代码语言:javascript复制
from cuml.neighbors import NearestNeighbors

%%time
knn_cuml = NearestNeighbors()
knn_cuml.fit(train_features)

%%time
distances_cuml, indices_cuml = knn_cuml.kneighbors(query_features, k=3)

from sklearn.neighbors import NearestNeighbors
%%time
knn_sk = NearestNeighbors(n_neighbors=3, metric='sqeuclidean', n_jobs=-1)
knn_sk.fit(train_features)

%%time
distances_sk, indices_sk = knn_sk.kneighbors(query_features, 3)
# compare the distance obtained while using sklearn and cuml models
(np.abs(distances_cuml - distances_sk) < 1).all()

# 展示结果
def show_images(query, sim_images, sim_dists):
    """doc"""
    simi_num = len(sim_images)
    fig = plt.figure(figsize=(3 * (simi_num   1), 3))

    axes = fig.subplots(1, simi_num   1)
    for index, ax in enumerate(axes):
        if index == 0:
            ax.imshow(query)
            ax.set_title('query')
        else:
            ax.imshow(sim_images[index - 1])
            ax.set_title('dist: %.1f' % (sim_dists[index - 1]))
    plt.show()
    fig.clear()

# get random indices
random_show_index = np.random.randint(0, query_images.shape[0], size=5)
random_query = query_images[random_show_index]
random_indices = indices_cuml[random_show_index].astype(np.int)
random_distances = distances_cuml[random_show_index]

# show result images
for query_image, sim_indices, sim_dists in zip(random_query, random_indices, random_distances):
    sim_images = train_images[sim_indices]
    show_images(query_image, sim_images, sim_dists)

用到后再追加..

0 人点赞