Creating sample data for toy analysis为模拟数据分析创建样本数据
I will again implore you to use some of your own data for this book, but in the event you cannot,we'll learn how we can use scikit-learn to create toy data.
我再次建议你用一些自己的数据,但既然你没有,我们来学习我们怎么用scikit-learn创建模拟数据集。
Getting ready准备
Very similar to getting built-in datasets, fetching new datasets, and creating sample datasets,the functions that are used follow the naming convention make_<the data set> . Just to be clear, this data is purely artificial:
获取内置数据集、获取新的数据集、创造样本集都很相似,函数的命名习惯是 make_<the data set>,只是数据是人造的而已。
代码语言:python代码运行次数:0复制datasets.make_*?
datasets.make_biclusters
datasets.make_blobs
datasets.make_checkerboard
datasets.make_circles
datasets.make_classification
...
To save typing, import the datasets module as d , and numpy as np :
代码为:导入datasets模型命名为d,导入numpy命名为np:
代码语言:python代码运行次数:0复制import sklearn.datasets as d
import numpy as np
How to do it...怎么做
This section will walk you through the creation of several datasets; the following How it works... section will confirm the purported characteristics of the datasets. In addition to the sample datasets, these will be used throughout the book to create data with the necessary characteristics for the algorithms on display.
这节将带你创造几个数据集,然后是他是怎么工作的,确认虚拟数据集的特征,除此之外,全书中为了算法展示的需要创造必要的特征是非常实用的。
First, the stalwart—regression 首先,强大的回归模型数据:
代码语言:javascript复制reg_data = d.make_regression()
By default, this will generate a tuple with a 100 x 100 matrix – 100 samples by 100 features.However, by default, only 10 features are responsible for the target data generation. The second member of the tuple is the target variable.It is also possible to get more involved. For example, to generate a 1000 x 10 matrix with five features responsible for the target creation, an underlying bias factor of 1.0, and 2 targets,the following command will be run:
默认的,它将生成一个100*100的矩阵的元组(100个样本,100个特征),默认情况下只有10个特征是对于目标值是有关系的,元组的第二个值是目标值。元组是可以改变的,比如,生成一个1000*10的矩阵,创造有5个特征影响标签,1.0的潜在偏差,2个目标标签,以下代码所示:
代码语言:javascript复制complex_reg_data = d.make_regression(1000, 10, 5, 2, 1.0)
complex_reg_data[0].shape
(1000, 10)
Classification datasets are also very simple to create. It's simple to create a base classification set, but the basic case is rarely experienced in practice—most users don't convert, most transactions aren't fraudulent, and so on. Therefore, it's useful to explore classification on unbalanced datasets:
生成分类数据集也很简单,虽然生成分类数据简单,但是基本数据是与现实经验不符合的,很多用户不转换,或者错误的转化等等,所以,探索不均衡数据集的分类是非常有用的。
代码语言:javascript复制classification_set = d.make_classification(weights=[0.1]) # 默认数据是100*20, 目标值100*1
np.bincount(classification_set[1]) # 计数,分类目标值内各项的个数
array([10, 90]) # 目标值有10个0, 90个1
Clusters will also be covered. There are actually several functions to create datasets that can be modeled by different cluster algorithms. For example, blobs are very easy to create and can be modeled by K-Means:
当然包含聚类,这儿实际有几种生成适合不同聚类算法的数据集的函数。例如,用blobs很容易生成用于K-Means算法的数据。
代码语言:javascript复制blobs = d.make_blobs() # 默认数据是100*2, 目标值100*1
生成(100*2)的数组和100个目标值。
How it works...它如何工作的。
Let's walk you through how scikit-learn produces the regression dataset by taking a look at the source code (with some modifications for clarity). Any undefined variables are assumed to have the default value of make_regression .It's actually surprisingly simple to follow.
让我来带你通过看源码来了解scikit-learn如何生成回归模型数据集(为了清晰做了些改进),容易理解,任何未定义的变量都被赋予了默认值。
First, a random array is generated with the size specified when the function is called:
首先,调用函数生成特定形状的随机数组
代码语言:javascript复制X = np.random.randn(n_samples, n_features)
Given the basic dataset, the target dataset is then generated:
生成基本数据集后生成目标数据集。
代码语言:javascript复制ground_truth = np.zeroes((np_samples, n_target))
ground_truth[:n_informative, :] = 100*np.random.rand(n_informative,n_targets)
The dot product of X and ground_truth are taken to get the final target values. Bias, if any, is added at this time:
X and ground_truth点乘结果被用作生成最终的目标值,如果有偏移量,会同时加上。
代码语言:javascript复制y = np.dot(X, ground_truth) bias
Due to NumPy's broadcasting, bias can be a scalar value, and this value will be added to every sample.
由于numpy的广播作用,偏移量这个标量会被加到每一个样本中
Finally, it's a simple matter of adding any noise and shuffling the dataset. Voilà, we have a dataset perfect to test regression.
最后,加一些噪声然后打乱,OK,我们有了一个漂亮的测试回归数据集。