使用主要协变量回归改进样本和特征选择(CS)

2020-12-30 15:44:50 浏览数 (1)

罗斯·克森斯基,本杰明·赫尔弗雷希特,埃德加·恩格尔,米歇尔·塞里奥蒂

从大量候选项中选择最相关的功能和示例是一项在自动数据分析文本中经常发生的任务,它可用于提高模型的计算性能,而且通常也具有可传输性。在这里,我们重点介绍两个流行的子选择方案,它们已应用于此目的:CUR 分解,它基于要素矩阵的低级近似值和最远点采样,它依赖于最多样化的样本和区分特征的迭代标识。我们修改这些不受监督的方法,按照与主体共变量回归(PCovR)方法相同的精神,纳入受监督的组件。我们表明,合并目标信息可提供在监督任务中性能更好的选择,我们用山脊回归、内核脊回归和稀疏内核回归来演示这些选择。我们还表明,结合简单的监督学习模型可以提高更复杂的模型(如前馈神经网络)的准确性。我们提出进行调整,以尽量减少执行无人监督的任务时任何子选择可能产生的影响。我们演示了使用 PCov-CUR和 PCov-FPS在化学和材料科学应用上的显著改进,通常将实现给定回归精度水平所需的特征和样本数减少 2 个因子和样本数。

Improving Sample and Feature Selection with Principal Covariates Regression

Rose K. Cersonsky, Benjamin A. Helfrecht, Edgar A. Engel, Michele Ceriotti

Selecting the most relevant features and samples out of a large set of candidates is a task that occurs very often in the context of automated data analysis, where it can be used to improve the computational performance, and also often the transferability, of a model. Here we focus on two popular sub-selection schemes which have been applied to this end: CUR decomposition, that is based on a low-rank approximation of the feature matrix and Farthest Point Sampling, that relies on the iterative identification of the most diverse samples and discriminating features. We modify these unsupervised approaches, incorporating a supervised component following the same spirit as the Principal Covariates Regression (PCovR) method. We show that incorporating target information provides selections that perform better in supervised tasks, which we demonstrate with ridge regression, kernel ridge regression, and sparse kernel regression. We also show that incorporating aspects of simple supervised learning models can improve the accuracy of more complex models, such as feed-forward neural networks. We present adjustments to minimize the impact that any subselection may incur when performing unsupervised tasks. We demonstrate the significant improvements associated with the use of PCov-CUR and PCov-FPS selections for applications to chemistry and materials science, typically reducing by a factor of two the number of features and samples which are required to achieve a given level of regression accuracy.

0 人点赞