Classifying data with support vector machines支持向量机用于分类数据

2020-04-30 18:03:24 浏览数 (1)

Support vector machines (SVM) is one of the techniques we will use that doesn't have an easy probabilistic interpretation. The idea behind SVMs is that we find the plane that separates the group of the dataset the "best". Here, separation means that the choice of the plane maximizes the margin between the closest points on the plane. These points are called support vectors.

支持向量机是当我们没有一个简单的统计学解释时使用的方法,SVM背后的思想是找出将数据分割成组的最佳平面。这里,分割意思是选择最近两个点的最大区间边界的平面。这些点叫做支持向量。

Getting ready准备工作

SVM is one of my favorite machine learning algorithms. It was one of the first machine learning algorithms I learned in school. So, let's get some data and get started:

SVM是我最喜欢的一种学习算法。这是我在学校学习的最早的机器学习算法之一,所以,让我们获取些数据后开始吧:

代码语言:javascript复制
from sklearn import datasets
X, y = datasets.make_classification()

How to do it…怎么做

The mechanics of creating a support vector classifier is very simple; there are a few options available. Therefore, we'll do the following:

机械的生成一个支持向量分类器非常简单,这里有少量的可选参数,因此,我们按照以下步骤做:

1. Create an SVC object and fit it to some fake data.

2. Fit the SVC object to some example data.

3. Talk a little about the SVC options.

1、生成支持向量分类器对象并在一些虚拟数据上拟合它

2、用支持向量分类器做一些样例数据的拟合

3、讨论一些支持向量分类器的可选参数。

Import support vector classifier (SVC) from the support vector machine module:从支持向量机模型中导入支持向量分类器:

代码语言:javascript复制
from sklearn.svm import SVC
base_svm = SVC()
base_svm.fit(X, y)

Let's look at some of the attributes:让我们看一些参数

1、C : In cases where we don't have a well-separated set, C will scale the error on the margin. As C gets higher, the penalization for the error becomes larger and the SVM will try to find a narrow margin even if it misclassifies more points.

1、C在我们没有一个分类好的数据集的例子,C将缩放边界误差,当C取值较大,误差的惩罚将变得更大,并且SVM将努力找到一个窄的边距,甚至会错误的分类更多的点。

2、class_weight : This denotes how much weight to give to each class in the problem.This is given as a dictionary where classes are the keys and values are the weights associated with these classes.

2、class_weight:这决定该问题将给予每个分类多少权重。这将是一个字典,键是他的分类,值时每个分类适合的权重。

3、 gamma : This is the gamma parameter for kernels and is supported by rgb , sigmoid ,and ploy .

3、gamma:这时候核心的gamma参数,被rgb、sigmoid、ploy所支持。

4、 kernel : This is the kernel to use; we'll use linear in the following How it works...section, but rgb is the popular and default choice.

4、核:这是使用的核模型,我们接下来在如何运行中使用线性核模型,但是rgb更流行,并且是默认的选择。

How it works…如何运行

Like we talked about in the Getting ready section, SVM will try to find the plane that best bifurcates the two classes. Let's look at a simple example with two features and a well-separated outcome.

正如我们在前面步骤中所说的,SVM将要尝试找到两个分类最好的分支平面。让我们看一个简单的两个特征的例子和它分割较好的输出。

First, let's fit the dataset, and then we'll plot what's going on:首先,我们拟合数据集,然后我们画出发生了什么。

代码语言:javascript复制
X, y = datasets.make_blobs(n_features=2, centers=2)
from sklearn.svm import LinearSVC
svm = LinearSVC()
svm.fit(X, y)

Now that we've fit the support vector machine, we'll plot its outcome at each point in the graph. This will show us the approximate decision boundary:

现在我们拟合支持向量机,我们将画出它的图形中每个点的输出,这将展示给我们近似的决策边界。

代码语言:javascript复制
from itertools import product
from collections import namedtuple
Point = namedtuple('Point', ['x', 'y', 'outcome'])
decision_boundary = []
xmin, xmax = np.percentile(X[:, 0], [0, 100])
ymin, ymax = np.percentile(X[:, 1], [0, 100])
for xpt, ypt in product(np.linspace(xmin-2.5, xmax 2.5, 20),np.linspace(ymin-2.5, ymax 2.5, 20)):
    p = Point(xpt, ypt, svm.predict([xpt, ypt]))
    decision_boundary.append(p)
import matplotlib.pyplot as plt
f, ax = plt.subplots(figsize=(7, 5))
import numpy as np
colors = np.array(['r', 'b'])
for xpt, ypt, pt in decision_boundary:
    ax.scatter(xpt, ypt, color=colors[pt[0]], alpha=.15)
    ax.scatter(X[:, 0], X[:, 1], color=colors[y], s=30)
ax.set_ylim(ymin, ymax)
ax.set_xlim(xmin, xmax)
ax.set_title("A well separated dataset")
plt.show()

The following is the output:以下为输出结果:

Let's look at another example, but this time the decision boundary will not be so clear:让我们看一下另一个例子,但这次的决策边界将不再那么明显:

代码语言:javascript复制
X, y = datasets.make_classification(n_features=2,n_classes=2,n_informative=2,n_redundant=0)

As we can see, this is not a problem that will easily be solved by a linear classification rule.While we will not use this in practice, let's have a look at the decision boundary. First, let's retrain the classifier with the new datapoints:

如我们所见,这不是一个简单的就能被线性分类规则解决的问题。虽然我们不在训练中使用,让我们看一看决策边界,首先,我们使用新的数据点重新训练分类器。

代码语言:javascript复制
svm.fit(X, y)
xmin, xmax = np.percentile(X[:, 0], [0, 100])
ymin, ymax = np.percentile(X[:, 1], [0, 100])
test_points = np.array([[xx, yy] for xx, yy in ])
test_preds = svm.predict(test_points)
import matplotlib.pyplot as plt
f, ax = plt.subplots(figsize=(7, 5))
import numpy as np
colors = np.array(['r', 'b'])
ax.scatter(test_points[:, 0], test_points[:, 1],color=colors[test_preds], alpha=.25)
ax.scatter(X[:, 0], X[:, 1], color=colors[y])
ax.set_title("A well separated dataset")

The following is the output:输出如下图所示:

As we saw, the decision line isn't perfect, but at the end of the day, this is the best Linear SVM we will get.

如我们所见,决策线并不是最好的,但是最后,这会是我们能得到的最好的线性支持向量机

There's more…扩展阅读

While we might not be able to get a better Linear SVM by default, the SVC classifier in scikit-learn will use the radial basis function. We've seen this function before, but let's take a look and see what it does to the decision boundaries of the dataset we just fit:

当我们不能通过默认参数来得到较好的线性支持向量机,在scikit-learn中支持向量分类器将使用径向基函数。我们已经见过这个函数,但是,让我们再看一看,我们拟合它做数据集的决策边界时,它做了什么。

代码语言:javascript复制
radial_svm = SVC(kernel='rbf')
radial_svm.fit(X, y)
xmin, xmax = np.percentile(X[:, 0], [0, 100])
ymin, ymax = np.percentile(X[:, 1], [0, 100])
test_points = np.array([[xx, yy] for xx, yy in product(np.linspace(xmin, xmax),np.linspace(ymin, ymax))])
test_preds = radial_svm.predict(test_points)
import matplotlib.pyplot as plt
f, ax = plt.subplots(figsize=(7, 5))
import numpy as np
colors = np.array(['r', 'b'])
ax.scatter(test_points[:, 0], test_points[:, 1],color=colors[test_preds], alpha=.25)
ax.scatter(X[:, 0], X[:, 1], color=colors[y])
ax.set_title("SVM with a radial basis function")

The following is the output:以下是图形输出

As we can see, the decision boundary has been altered. We can even pass in our own radial basis function, if needed:

如我们所见,决策边界已经变了,如果需要的话,我们能够传入我们自己的径向基函数

代码语言:javascript复制
def test_kernel(X, y):
    """ Test kernel that returns the exponentiation of the dot of the X 
        and y matrices.This looks an lot like the log hazards if you're familiar 
        with survival analysis.
        测试返回x的幂次,y的矩阵的核,如果你熟悉几种分析方法,这看起来很像log hazards。
    """
    return np.exp(np.dot(X, y.T))
test_svc = SVC(kernel=test_kernel)
test_svc.fit(X, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel=<function test_kernel at 0x1a1ab308c0>, max_iter=-1,
    probability=False, random_state=None, shrinking=True, tol=0.001,
    verbose=False)

0 人点赞