访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问
stat统计学,共计46篇
【1】 Analysis of the evolution of agroclimatic risks in a context of climate variability in the region of Segou in Mali 标题:马里塞古地区气候多变性背景下农业气候风险演变分析
作者:Diop Amadou,Barro Diakarya 备注:25 pages, 31 figures 链接:https://arxiv.org/abs/2106.12571 摘要:在萨赫勒地区,人口主要依靠雨水灌溉的农业。特别是在西非,气候模型无法捕捉当今气候变化的一些基本特征。这项研究提出了一个在气候变化的背景下分析农业气候风险演变的贡献。对雨季的主要变量进行统计检验,确定趋势,并用数据序列描述变量。因此,本文在分析农业气候参数的季节变化及其年际变化的同时,提供了不同农业气候风险的统计模型。研究确定了农业气候风险的概率分布,阐明了雨季的特征。 摘要:In the Sahel region the population depends largely on rain-fed agriculture. In West Africa in particular, climate models turn to be unable to capture some basic features of present-day climate variability. This study proposes a contribution to the analysis of the evolution of agro-climatic risks in a context of climate variability. Some statistical tests are used on the main variables of the rainy season to determine the trends and the variabilities are described by the data series. Thus, the paper provides a statistical modeling of the different agro-climatic risks while the seasonal variability of agro-climatic parameters were analized as well as their inter annual variability. The study identifies the probability distributions of agroclimatic risks and the characterization of the rainy season was clarified.
【2】 Approximate Bayesian Computation with Path Signatures 标题:基于路径签名的近似贝叶斯计算
作者:Joel Dyer,Patrick Cannon,Sebastian M Schmon 备注:27 pages, 11 figures 链接:https://arxiv.org/abs/2106.12555 摘要:具有科学意义的模拟模型通常缺乏易于处理的似然函数,排除了基于标准似然的统计推断。近似贝叶斯计算是一种常用的无似然推理方法,通过比较模拟器输出和观测数据,得到一个近似的后验概率。然而,模拟数据和观测数据之间的紧密性的有效度量通常很难构造,特别是对于通常是高维和结构复杂的时间序列数据。现有的方法通常涉及手工构建摘要统计数据,需要大量的领域专业知识和实验,或者依赖不现实的假设,比如iid数据。其他的则不适用于更复杂的环境,如多变量或不规则采样的时间序列数据。在本文中,我们介绍了使用路径签名作为一个自然的候选特征集来构建时间序列数据之间的距离,用于近似贝叶斯计算算法。实验表明,这种方法比现有的时间序列模型方法能产生更精确的近似贝叶斯后验概率。 摘要:Simulation models of scientific interest often lack a tractable likelihood function, precluding standard likelihood-based statistical inference. A popular likelihood-free method for inferring simulator parameters is approximate Bayesian computation, where an approximate posterior is sampled by comparing simulator output and observed data. However, effective measures of closeness between simulated and observed data are generally difficult to construct, particularly for time series data which are often high-dimensional and structurally complex. Existing approaches typically involve manually constructing summary statistics, requiring substantial domain expertise and experimentation, or rely on unrealistic assumptions such as iid data. Others are inappropriate in more complex settings like multivariate or irregularly sampled time series data. In this paper, we introduce the use of path signatures as a natural candidate feature set for constructing distances between time series data for use in approximate Bayesian computation algorithms. Our experiments show that such an approach can generate more accurate approximate Bayesian posteriors than existing techniques for time series models.
【3】 Sampling with Mirrored Stein Operators 标题:镜像Stein算子抽样
作者:Jiaxin Shi,Chang Liu,Lester Mackey 机构:Microsoft Research, Cambridge, MA, Beijing 备注:23 pages 链接:https://arxiv.org/abs/2106.12506 摘要:我们介绍了一个新的粒子演化采样器家族,适用于约束域和非欧几里德几何。Stein变分镜像下降和镜像Stein变分梯度下降使Kullback-Leibler(KL)散度最小化,从而使粒子在镜像映射定义的对偶空间中演化。Stein变分自然梯度法利用非欧几里德几何原理,有效地减小了KL对无约束目标的发散。我们从一类新的镜像Stein算子和自适应核中导出了这些采样器。我们证明了这些新的采样器可以精确地逼近单纯形上的分布,在后选择推理中提供有效的置信区间,并且在大规模无约束后验推理中比以前的方法收敛更快。最后,在目标分布的可验证条件下,证明了新方法的收敛性。 摘要:We introduce a new family of particle evolution samplers suitable for constrained domains and non-Euclidean geometries. Stein Variational Mirror Descent and Mirrored Stein Variational Gradient Descent minimize the Kullback-Leibler (KL) divergence to constrained target distributions by evolving particles in a dual space defined by a mirror map. Stein Variational Natural Gradient exploits non-Euclidean geometry to more efficiently minimize the KL divergence to unconstrained targets. We derive these samplers from a new class of mirrored Stein operators and adaptive kernels developed in this work. We demonstrate that these new samplers yield accurate approximations to distributions on the simplex, deliver valid confidence intervals in post-selection inference, and converge more rapidly than prior methods in large-scale unconstrained posterior inference. Finally, we establish the convergence of our new procedures under verifiable conditions on the target distribution.
【4】 The SKIM-FA Kernel: High-Dimensional Variable Selection and Nonlinear Interaction Discovery in Linear Time 标题:SKIM-FA核:高维变量选择和线性时间中的非线性相互作用发现
作者:Raj Agrawal,Tamara Broderick 机构:Massachusetts Institute of Technology, and 链接:https://arxiv.org/abs/2106.12408 摘要:许多科学问题需要确定一小部分与目标反应相关的协变量,并估计其影响。通常,这些影响是非线性的,并且包含相互作用,因此线性和加法方法可能导致较差的估计和变量选择。贝叶斯框架使得在层次模型中同时表达稀疏性、非线性和交互作用变得简单。但是,对于处理这三个问题的少数其他方法来说,推理在计算上是很困难的——运行时协变量的数量至少是二次的,而且往往更糟。在目前的工作中,我们解决了这个计算瓶颈。我们首先证明了合适的贝叶斯模型可以表示为高斯过程(GPs)。然后,我们演示了如何用核技巧减少这些GPs到O(#协变量)的计算时间来进行变量选择和估计。我们得到的拟合对应于希尔BERT空间中回归函数的稀疏正交分解(即,函数方差分析分解),其中交互作用效应表示不能用低阶效应解释的所有变化。在各种合成数据集和真实数据集上,我们的方法优于用于大型高维数据集的现有方法,同时在运行时保持竞争力(或快几个数量级)。 摘要:Many scientific problems require identifying a small set of covariates that are associated with a target response and estimating their effects. Often, these effects are nonlinear and include interactions, so linear and additive methods can lead to poor estimation and variable selection. The Bayesian framework makes it straightforward to simultaneously express sparsity, nonlinearity, and interactions in a hierarchical model. But, as for the few other methods that handle this trifecta, inference is computationally intractable - with runtime at least quadratic in the number of covariates, and often worse. In the present work, we solve this computational bottleneck. We first show that suitable Bayesian models can be represented as Gaussian processes (GPs). We then demonstrate how a kernel trick can reduce computation with these GPs to O(# covariates) time for both variable selection and estimation. Our resulting fit corresponds to a sparse orthogonal decomposition of the regression function in a Hilbert space (i.e., a functional ANOVA decomposition), where interaction effects represent all variation that cannot be explained by lower-order effects. On a variety of synthetic and real datasets, our approach outperforms existing methods used for large, high-dimensional datasets while remaining competitive (or being orders of magnitude faster) in runtime.
【5】 Integrating relative survival in multi-state models -- a non-parametric approach 标题:多态模型中相对存活率的综合--一种非参数方法
作者:D. Manevski,H. Putter,M. Pohar Perme,E. F. Bonneville,J. Schetelig,L. C. de Wreede 链接:https://arxiv.org/abs/2106.12399 摘要:多状态模型提供了通常生存/事件历史分析设置的扩展。在医学领域,多状态模型为进一步研究复发和缓解等中间事件提供了可能。在这项工作中,提出了一个进一步的扩展,使用相对生存率,其中由于人口原因(即非疾病相关死亡率)的死亡率进行评估。目的是在死因没有记录或不确定的数据集中,将所有疾病死亡率和非疾病相关死亡率(有或没有中间事件)分开。为此,人口死亡率表被整合到估算过程中,同时使用了基本的相对存活率概念,即总体死亡率风险可以写为人口和超额部分的总和。因此,我们提出了一种改进的非参数估计方法,其中考虑了人口死亡率。对转移危险和转移概率给出了精确的定义和合适的估计。介绍了方差估计技术和置信区间,并通过仿真研究了新方法的性能。通过对异基因造血干细胞移植后患者队列的分析,说明了新开发的方法。该工作也在R包mstate中实现。 摘要:Multi-state models provide an extension of the usual survival/event-history analysis setting. In the medical domain, multi-state models give the possibility of further investigating intermediate events such as relapse and remission. In this work, a further extension is proposed using relative survival, where mortality due to population causes (i.e. non-disease-related mortality) is evaluated. The objective is to split all mortality in disease and non-disease-related mortality, with and without intermediate events, in datasets where cause of death is not recorded or is uncertain. To this end, population mortality tables are integrated into the estimation process, while using the basic relative survival idea that the overall mortality hazard can be written as a sum of a population and an excess part. Hence, we propose an upgraded non-parametric approach to estimation, where population mortality is taken into account. Precise definitions and suitable estimators are given for both the transition hazards and probabilities. Variance estimating techniques and confidence intervals are introduced and the behaviour of the new method is investigated through simulations. The newly developed methodology is illustrated by the analysis of a cohort of patients followed after an allogeneic hematopoietic stem cell transplantation. The work is also implemented in the R package mstate.
【6】 Innovations Autoencoder and its Application in Real-Time Anomaly Detection 标题:新息自动编码器及其在实时异常检测中的应用
作者:Xinyi Wang,Lang Tong 机构:School of Electrical and Computer Engineering, Cornell University, Ithaca, NY , USA 链接:https://arxiv.org/abs/2106.12382 摘要:时间序列的新息序列是一系列独立且同分布的随机变量序列,原始时间序列具有因果表示。一次的创新在统计上独立于时间序列的先前历史。因此,它代表的是目前所包含的新信息,而不是过去的信息。由于其简单的概率结构,新息序列是最有效的原始签名。与主成分分析(PCA/ICA)表示不同,新息序列不仅保留了原始时间序列的完整统计特性,而且保留了原始时间序列的时序。一个长期存在的开放性问题是寻找一种计算上易于处理的方法来提取非高斯过程的新息序列。提出了一种利用因果卷积神经网络提取新息序列的深度学习方法,称为新息自动编码器(IAE)。文中还介绍了IAE在具有未知异常和无异常模型的非参数异常检测中的应用。 摘要:An innovations sequence of a time series is a sequence of independent and identically distributed random variables with which the original time series has a causal representation. The innovation at a time is statistically independent of the prior history of the time series. As such, it represents the new information contained at present but not in the past. Because of its simple probability structure, an innovations sequence is the most efficient signature of the original. Unlike the principle or independent analysis (PCA/ICA) representations, an innovations sequence preserves not only the complete statistical properties but also the temporal order of the original time series. An long-standing open problem is to find a computationally tractable way to extract an innovations sequence of non-Gaussian processes. This paper presents a deep learning approach, referred to as Innovations Autoencoder (IAE), that extracts innovations sequences using a causal convolutional neural network. An application of IAE to nonparametric anomaly detection with unknown anomaly and anomaly-free models is also presented.
【7】 Online Updating Statistics for Heterogenous Updating Regressions via Homogenization Techniques 标题:基于均匀化技术的异质更新回归的在线更新统计
作者:Lin Lu,Lu Jun,Hu Qinqin 机构: and Weiyu Li 1 1Zhongtai Securities Institute for Financial Studies, Shandong University, China 2School of Statistics and Mathematics, Zhejiang Gongshang University 链接:https://arxiv.org/abs/2106.12370 摘要:在大数据流环境下,模型的变量集会随着数据流的变化而变化,这是一种常见的情况。本文提出了一种均匀化策略来表示在数据流处理过程中逐渐更新的异构模型。通过均匀化表示,我们可以方便地构造各种在线更新统计量,如参数估计、残差平方和和和$F$-统计量。与经典情形的主要区别在于,同质化模型中的人工协变量与原始模型中的自然协变量的分布不完全相同,因此相关的理论性质与经典模型不同。建立了在线更新统计量的渐近性质,结果表明,该方法在不受数据批数限制的情况下,具有估计效率高、预测性好的特点。模拟实验的各种数值例子进一步说明了该方法的性能。 摘要:Under the environment of big data streams, it is a common situation where the variable set of a model may change according to the condition of data streams. In this paper, we propose a homogenization strategy to represent the heterogenous models that are gradually updated in the process of data streams. With the homogenized representations, we can easily construct various online updating statistics such as parameter estimation, residual sum of squares and $F$-statistic for the heterogenous updating regression models. The main difference from the classical scenarios is that the artificial covariates in the homogenized models are not identically distributed as the natural covariates in the original models, consequently, the related theoretical properties are distinct from the classical ones. The asymptotical properties of the online updating statistics are established, which show that the new method can achieve estimation efficiency and oracle property, without any constraint on the number of data batches. The behavior of the method is further illustrated by various numerical examples from simulation experiments.
【8】 Identifiability and estimation of meta-elliptical copula generators 标题:亚椭圆Copula生成元的可识别性和估计
作者:Alexis Derumigny,Jean-David Fermanian 机构:Department of Applied Mathematics, Delft University of Technology, Van Mourik Broekmanweg , XE, Delft, Netherlands., CREST-ENSAE, avenue Henry Le Chatelier, Palaiseau cedex, France. 链接:https://arxiv.org/abs/2106.12367 摘要:亚椭圆连接函数通常被用来模拟随机向量各分量之间的依赖关系。它们由一个相关矩阵和一个称为密度生成器的map$g$指定。当后一种相关矩阵可以很容易地从伪观测样本中估计出来时,密度发生器就不属于参数族了。我们提出了非参数识别这个发生器的充分条件。然后,通过M估计、基于模拟的推理或R包中的迭代过程,提出了几个非参数估计。仿真结果表明了后一种方法的有效性。 摘要:Meta-elliptical copulas are often proposed to model dependence between the components of a random vector. They are specified by a correlation matrix and a map $g$, called a density generator. When the latter correlation matrix can easily be estimated from pseudo-samples of observations, this is not the case for the density generator when it does not belong to a parametric family. We state sufficient conditions to non-parametrically identify this generator. Several nonparametric estimators of $g$ are then proposed, by M-estimation, simulation-based inference or by an iterative procedure available in a R package. Some simulations illustrate the relevance of the latter method.
【9】 Regularised B-splines projected Gaussian Process priors to estimate the age profile of COVID-19 deaths before and after vaccine roll-out 标题:正则化B样条投影高斯过程先验估计疫苗推出前后冠状病毒死亡的年龄分布
作者:Mélodie Monod,Alexandra Blenkinsop,Andrea Brizzi,Yu Chen,Vidoushee Jogarah,Yuanrong Wang,Seth Flaxman,Samir Bhatt,Oliver Ratmann 链接:https://arxiv.org/abs/2106.12360 摘要:COVID-19大流行在美国造成了严重的公共卫生后果。美国在2020年底开始了一项疫苗接种运动,主要针对老年居民,然后再向年轻人提供疫苗接种。由于COVID-19感染致死率和疫苗摄取率在不同年龄段存在差异,一个重要的考虑因素是年龄对死亡的贡献是否随着时间的推移而向年轻年龄组转移。在这项研究中,我们使用贝叶斯非参数空间方法来估计年龄对COVID-19归因死亡的贡献。所提出的空间方法是由正则B样条投影的低秩高斯过程。仿真分析和基准测试结果表明,在较低的运行时间下,空间方法的性能优于标准的B样条方法,与标准的高斯过程相当。我们发现COVID-19在美国尤其致命。美国各州85岁以上老人的死亡率从1%到5%不等。自疫苗接种运动开始以来,美国各州每周死亡人数都有所减少,75岁以上的人比0-74岁的人减少得更快。与此同时,75岁以上的人对死亡的贡献也有所下降,在全国范围内,这种下降的时间和速度存在着巨大的差异。 摘要:The COVID-19 pandemic has caused severe public health consequences in the United States. The United States began a vaccination campaign at the end of 2020 targeting primarily elderly residents before extending access to younger individuals. With both COVID-19 infection fatality ratios and vaccine uptake being heterogeneous across ages, an important consideration is whether the age contribution to deaths shifted over time towards younger age groups. In this study, we use a Bayesian non-parametric spatial approach to estimate the age-specific contribution to COVID-19 attributable deaths over time. The proposed spatial approach is a low-rank Gaussian Process projected by regularised B-splines. Simulation analyses and benchmark results show that the spatial approach performs better than a standard B-splines approach and equivalently well as a standard Gaussian Process, for considerably lower runtimes. We find that COVID-19 has been especially deadly in the United States. The mortality rates among individuals aged 85 ranged from 1% to 5% across the US states. Since the beginning of the vaccination campaign, the number of weekly deaths reduced in every US state with a faster decrease among individuals aged 75 than individuals aged 0-74. Simultaneously to this reduction, the contribution of individuals age 75 to deaths decreased, with important disparities in the timing and rapidity of this decrease across the country.
【10】 Calibrating the Lee-Carter and the Poisson Lee-Carter models via Neural Networks 标题:用神经网络标定Lee-Carter和Poisson Lee-Carter模型
作者:Salvatore Scognamiglio 链接:https://arxiv.org/abs/2106.12312 摘要:本文介绍了一种神经网络方法来拟合多种群的Lee-Carter和Poisson-Lee-Carter模型。我们开发了一些神经网络来复制单个LC模型的结构,并通过同时分析所有考虑人群的死亡率数据来进行联合拟合。神经网络体系结构是专门设计用来校准每个单独的模型使用所有可用的信息,而不是使用人口特定的数据子集在传统的估计方案。在人类死亡率数据库(HMD)的所有国家进行的大量数值实验表明了该方法的有效性。特别是,由此产生的参数估计似乎很平稳,对死亡率数据中经常出现的随机波动不太敏感,特别是对于低人口国家。此外,预测效果也有显著提高。 摘要:This paper introduces a neural network approach for fitting the Lee-Carter and the Poisson Lee-Carter model on multiple populations. We develop some neural networks that replicate the structure of the individual LC models and allow their joint fitting by analysing the mortality data of all the considered populations simultaneously. The neural network architecture is specifically designed to calibrate each individual model using all available information instead of using a population-specific subset of data as in the traditional estimation schemes. A large set of numerical experiments performed on all the countries of the Human Mortality Database (HMD) shows the effectiveness of our approach. In particular, the resulting parameter estimates appear smooth and less sensitive to the random fluctuations often present in the mortality rates' data, especially for low-population countries. In addition, the forecasting performance results significantly improved as well.
【11】 A Note on the Accuracy of Variational Bayes in State Space Models: Inference and Prediction 标题:关于状态空间模型中变分贝叶斯精度的一个注记:推断和预测
作者:David T. Frazier,Ruben Loaiza-Maya,Gael M. Martin 链接:https://arxiv.org/abs/2106.12262 摘要:利用理论和数值结果,我们证明了广泛的状态空间模型中常用的变分贝叶斯方法的准确性。结果表明,就固定参数的精度而言,方法有一个明确的层次结构,不近似状态的方法比近似状态的方法具有更高的精度。我们还用数值方法证明了不同方法之间的推断差异通常只会在小样本评估期内产生小的预测精度差异。然而,在某些情况下,这些预测差异可能会在较长的样本期内变得显著。这一发现表明,预测结果对推理不准确的不变性,这是从业者经常吹捧的观点,试图证明使用变分推理的合理性,并不是普遍存在的,必须根据具体情况进行评估。 摘要:Using theoretical and numerical results, we document the accuracy of commonly applied variational Bayes methods across a broad range of state space models. The results demonstrate that, in terms of accuracy on fixed parameters, there is a clear hierarchy in terms of the methods, with approaches that do not approximate the states yielding superior accuracy over methods that do. We also document numerically that the inferential discrepancies between the various methods often yield only small discrepancies in predictive accuracy over small out-of-sample evaluation periods. Nevertheless, in certain settings, these predictive discrepancies can become marked over longer out-of-sample periods. This finding indicates that the invariance of predictive results to inferential inaccuracy, which has been an oft-touted point made by practitioners seeking to justify the use of variational inference, is not ubiquitous and must be assessed on a case-by-case basis.
【12】 ParK: Sound and Efficient Kernel Ridge Regression by Feature Space Partitions 标题:PARK:基于特征空间划分的合理高效核岭回归
作者:Luigi Carratino,Stefano Vigogna,Daniele Calandriello,Lorenzo Rosasco 机构:MaLGa - DIBRIS, University of Genova, Italy, DeepMind Paris, France, Istituto Italiano di Tecnologia, Genova, Italy, CBMM - MIT, Cambridge, MA, USA 链接:https://arxiv.org/abs/2106.12231 摘要:介绍了一种新的大规模核岭回归求解器ParK。我们的方法结合了分区与随机投影和迭代优化,以减少空间和时间复杂度,同时可证明保持相同的统计精度。特别地,直接在特征空间而不是在输入空间中构造适当的划分,我们促进了局部估计量之间的正交性,从而确保了局部有效维数和偏差等关键量保持在控制之下。我们描述了我们模型的统计计算折衷,并通过大规模数据集的数值实验证明了我们方法的有效性。 摘要:We introduce ParK, a new large-scale solver for kernel ridge regression. Our approach combines partitioning with random projections and iterative optimization to reduce space and time complexity while provably maintaining the same statistical accuracy. In particular, constructing suitable partitions directly in the feature space rather than in the input space, we promote orthogonality between the local estimators, thus ensuring that key quantities such as local effective dimension and bias remain under control. We characterize the statistical-computational tradeoff of our model, and demonstrate the effectiveness of our method by numerical experiments on large-scale datasets.
【13】 groupShapley: Efficient prediction explanation with Shapley values for feature groups 标题:groupShapley:特征组Shapley值的有效预测解释
作者:Martin Jullum,Annabelle Redelmeier,Kjersti Aas 机构:Date, Feb ,th, arXiv:,.,v, [stat.ML] , Jun 链接:https://arxiv.org/abs/2106.12228 摘要:Shapley价值观已经成为解释复杂机器学习模型预测的最合适和理论上最合理的框架之一。Shapley值在解释环境中的流行可能是由于其独特的理论性质。然而,Shapley值的主要缺点是,它的计算复杂度随着输入特征的数量呈指数增长,这使得它在可能有成百上千个特征的许多实际情况下是不可行的。此外,对于许多(依赖的)特征,呈现/可视化和解释计算出的Shapley值也变得具有挑战性。本文介绍了groupShapley:一种处理上述瓶颈的概念上简单的方法。其思想是将特征分组,例如按类型或相关性,然后计算并呈现这些组的Shapley值,而不是所有单个特征的Shapley值。将成百上千个特征减少到六个左右,使得精确计算切实可行,表示和知识提取大大简化。我们证明了在一定的条件下,groupShapley等价于每个特征组中特征值的求和。此外,我们提供了一个模拟研究,举例说明这些条件不满足时的差异。我们在一个真实的汽车保险示例中说明了该方法的可用性,其中groupShapley用于提供简单直观的解释。 摘要:Shapley values has established itself as one of the most appropriate and theoretically sound frameworks for explaining predictions from complex machine learning models. The popularity of Shapley values in the explanation setting is probably due to its unique theoretical properties. The main drawback with Shapley values, however, is that its computational complexity grows exponentially in the number of input features, making it unfeasible in many real world situations where there could be hundreds or thousands of features. Furthermore, with many (dependent) features, presenting/visualizing and interpreting the computed Shapley values also becomes challenging. The present paper introduces groupShapley: a conceptually simple approach for dealing with the aforementioned bottlenecks. The idea is to group the features, for example by type or dependence, and then compute and present Shapley values for these groups instead of for all individual features. Reducing hundreds or thousands of features to half a dozen or so, makes precise computations practically feasible and the presentation and knowledge extraction greatly simplified. We prove that under certain conditions, groupShapley is equivalent to summing the feature-wise Shapley values within each feature group. Moreover, we provide a simulation study exemplifying the differences when these conditions are not met. We illustrate the usability of the approach in a real world car insurance example, where groupShapley is used to provide simple and intuitive explanations.
【14】 Bayesian Chance Constrained Optimization: Approximations and Statistical Consistency 标题:贝叶斯机会约束优化:逼近与统计一致性
作者:Prateek Jaiswal,Harsha Honnappa,Vinayak A. Rao 链接:https://arxiv.org/abs/2106.12199 摘要:本文在贝叶斯框架下研究了数据驱动的机会约束随机优化问题。贝叶斯后验概率提供了一种将数据和先验知识结合到随机优化问题中的原则性机制。然而,贝叶斯后验概率的计算是一个典型的难以解决的问题,并且已经产生了大量关于近似贝叶斯计算的文献。在这里,在机会约束优化的上下文中,我们关注的是使用近似后验分布计算的最优值的统计一致性(在适当意义上)的问题。为此,我们严格地证明了一个频率一致性结果,证明了一个固定的参数化约束优化问题的最优值到最优值的弱收敛性。我们通过建立最优值的概率收敛速度来增强这一点。我们还证明了近似贝叶斯随机优化问题的凸可行性。最后,我们证明了我们的方法在一个M/M/c排队模型的最优人员配置问题上的实用性。 摘要:This paper considers data-driven chance-constrained stochastic optimization problems in a Bayesian framework. Bayesian posteriors afford a principled mechanism to incorporate data and prior knowledge into stochastic optimization problems. However, the computation of Bayesian posteriors is typically an intractable problem, and has spawned a large literature on approximate Bayesian computation. Here, in the context of chance-constrained optimization, we focus on the question of statistical consistency (in an appropriate sense) of the optimal value, computed using an approximate posterior distribution. To this end, we rigorously prove a frequentist consistency result demonstrating the weak convergence of the optimal value to the optimal value of a fixed, parameterized constrained optimization problem. We augment this by also establishing a probabilistic rate of convergence of the optimal value. We also prove the convex feasibility of the approximate Bayesian stochastic optimization problem. Finally, we demonstrate the utility of our approach on an optimal staffing problem for an M/M/c queueing model.
【15】 Closed-Form, Provable, and Robust PCA via Leverage Statistics and Innovation Search 标题:通过利用统计数据和创新搜索实现封闭式、可证明且强大的PCA
作者:Mostafa Rahmani,Ping Li 机构:Cognitive Computing Lab, Baidu Research, NE ,th St. Bellevue, WA , USA 备注:Published in IEEE Transactions on Signal Processing. arXiv admin note: text overlap with arXiv:1912.12988 链接:https://arxiv.org/abs/2106.12190 摘要:新息搜索的思想最初被提出用于数据聚类,最近被用于离群点检测。在应用新息搜索进行离群点检测时,利用新息方向来度量数据点的新息。研究了二次成本函数下新息搜索算法计算的新息值,证明了新成本函数下的新息值等价于杠杆率得分。利用这一有趣的联系,为基于杠杆评分的鲁棒PCA方法建立了若干理论保证,并设计了一种新的鲁棒PCA方法。理论结果包括对异常值分布和内联值分布的不同模型的性能保证。此外,我们还证明了算法对噪声的鲁棒性。数值和理论研究表明,该方法具有快速性和封闭性,其性能优于现有的大多数算法。 摘要:The idea of Innovation Search, which was initially proposed for data clustering, was recently used for outlier detection. In the application of Innovation Search for outlier detection, the directions of innovation were utilized to measure the innovation of the data points. We study the Innovation Values computed by the Innovation Search algorithm under a quadratic cost function and it is proved that Innovation Values with the new cost function are equivalent to Leverage Scores. This interesting connection is utilized to establish several theoretical guarantees for a Leverage Score based robust PCA method and to design a new robust PCA method. The theoretical results include performance guarantees with different models for the distribution of outliers and the distribution of inliers. In addition, we demonstrate the robustness of the algorithms against the presence of noise. The numerical and theoretical studies indicate that while the presented approach is fast and closed-form, it can outperform most of the existing algorithms.
【16】 Changepoint Detection: An Analysis of the Central England Temperature Series 标题:变点探测:英格兰中部气温序列的分析
作者:Xueheng Shi,Claudie Beaulieu,Rebecca Killick,Robert Lund 机构:Department of Statistics, University of California, Santa Cruz, Department of Ocean Sciences, Lancaster University 链接:https://arxiv.org/abs/2106.12180 摘要:本文对英格兰中部温度序列的结构变化进行了统计分析。这个系列包含了一个最长的表面温度记录和它的变化点分析揭示了几个有趣的方面。具有结构突变(包括均值和趋势变化)的回归函数被拟合到序列中,并通过两个常用的多重变点惩罚似然准则进行比较。最后,最优模式被判定为一个包含三个位置和趋势变化的模式,大约在1989年过渡到快速变暖的状态。序列的变异性没有发现显著变化,并且判断变化特征比短记忆或长记忆自相关更可信。该分析作为一个不同的转换点技术的走查教程,说明了从不同的模型可以统计推断出什么。 摘要:This paper presents a statistical analysis of structural changes in the Central England temperature series. This series contains one of the longest surface temperature records available and a changepoint analysis of it reveals several interesting aspects. Regression functions with structural breaks, including mean and trend shifts, are fitted to the series and compared via two commonly used multiple changepoint penalized likelihood criteria. In the end, the optimal model is judged to be one containing three location and trend shifts, with a transition to a rapidly warming regime circa 1989. The variability of the series is not found to be significantly changing, and shift features are judged to be more plausible than short- or long-memory autocorrelations. The analysis serves as a walk-through tutorial of different changepoint techniques, illustrating what can statistically be inferred from different models.
【17】 COVID-19 Forecasts Using Internet Search Information in the United States 标题:利用互联网搜索信息预测美国的冠状病毒
作者:Simin Ma,Shihao Yang 备注:16 page, 4 figures 链接:https://arxiv.org/abs/2106.12160 摘要:随着COVID-19在全球肆虐,准确预测疾病的传播对于态势感知、资源分配和公共卫生决策至关重要。与美国疾病控制和预防中心(CDC)收集的传统疾病监测数据不同,互联网上的大数据(如在线搜索量)此前已被证明包含跟踪传染病动态的有价值信息。在这项研究中,我们评估了利用互联网上相关查询的搜索量来跟踪和预测COVID-19大流行的可行性。我们发现COVID-19的死亡趋势与症状相关查询(如“味觉丧失”)的搜索量之间有很强的关联。然后,我们将搜索量信息与COVID-19时间序列信息相结合,进一步发展了一个流感追踪模型,以预测未来2周内美国国家层面的COVID-19死亡人数。与基准时间序列模型相比,在国家层面上减少了45%的误差,因此我们利用跨州跨分辨率时空框架(跨州、跨地区和跨国家收集来自搜索量和COVID-19报告的信息),额外构建了州层面的COVID-19死亡模型。然后,这些ARGOX变量以赢家通吃的集成方式进行聚合,以生成最终的州级2周预测。数值实验表明,该方法稳定地优于时间序列基线模型,在公开的基准模型中达到了最先进的性能。总的来说,我们表明,在COVID-19大流行期间,疾病动态和相关的公共搜索行为共同进化,在利用历史病例/死亡以及时空跨区域信息的同时捕获它们的依赖性,将能够实现稳定和准确的美国国家和州级预测。 摘要:As the COVID-19 ravaging through the globe, accurate forecasts of the disease spread is crucial for situational awareness, resource allocation, and public health decision-making. Alternative to the traditional disease surveillance data collected by the United States (US) Centers for Disease Control and Prevention (CDC), big data from Internet such as online search volumes has been previously shown to contain valuable information for tracking infectious disease dynamics. In this study, we evaluate the feasibility of using Internet search volume of relevant queries to track and predict COVID-19 pandemic. We found strong association between COVID-19 death trend and the search volume of symptom-related queries such as "loss of taste". Then, we further develop an influenza-tracking model to predict future 2-week COVID-19 deaths on the US national level, by combining search volume information with COVID-19 time series information. Encouraged by the 45% error reduction on national level comparing to the baseline time series model, we additionally build state-level COVID-19 deaths models, leveraging the cross-state cross-resolution spatial temporal framework that pools information from search volume and COVID-19 reports across states, regions and the nation. These variants of ARGOX are then aggregated in a winner-takes-all ensemble fashion to produce the final state-level 2-week forecasts. Numerical experiments demonstrate that our method steadily outperforms time series baseline models, and achieves the state-of-the-art performance among the publicly available benchmark models. Overall, we show that disease dynamics and relevant public search behaviors co-evolve during the COVID-19 pandemic, and capturing their dependencies while leveraging historical cases/deaths as well as spatial-temporal cross-region information will enable stable and accurate US national and state-level forecasts.
【18】 Bounds on Causal Effects and Application to High Dimensional Data 标题:因果效应的界限及其在高维数据中的应用
作者:Ang Li,Judea Pearl 机构:University of California, Los Angeles, Computer Science Department 链接:https://arxiv.org/abs/2106.12121 摘要:本文讨论了当后门或前门准则中的调整变量被部分观测时因果效应的估计问题。对于这类情形,我们通过求解两个非线性优化问题得到了因果效应的界,并证明了该界是充分的。利用这种优化方法,我们提出了一个降维框架,允许一个交易估计功率偏差,并通过仿真研究证明其性能。 摘要:This paper addresses the problem of estimating causal effects when adjustment variables in the back-door or front-door criterion are partially observed. For such scenarios, we derive bounds on the causal effects by solving two non-linear optimization problems, and demonstrate that the bounds are sufficient. Using this optimization method, we propose a framework for dimensionality reduction that allows one to trade bias for estimation power, and demonstrate its performance using simulation studies.
【19】 Generalised Kernel Stein Discrepancy(GKSD): A Unifying Approach for Non-parametric Goodness-of-fit Testing 标题:广义核Stein差异(GKSD):非参数拟合优度检验的统一方法
作者:Wenkai Xu 链接:https://arxiv.org/abs/2106.12105 摘要:基于核Stein差异(KSD)的非参数拟合优度检验方法是验证各种情况下一般非规范分布的有效方法。现有的工作主要集中在研究提高测试性能的最优内核选择。然而,Stein算子一般是非唯一的,不同的Stein算子选择也会对测试性能产生相当大的影响。在这项工作中,我们提出了一个统一的框架,即广义核Stein差异(GKSD),从理论上比较和解释了基于KSD的拟合优度检验中不同的Stein算子。我们明确地导出了GKSD框架如何推广现有的Stein算子及其相应的测试。此外,我们还表明gksd框架可以作为开发基于核的非参数拟合优度测试的指导,用于复杂的新数据场景,例如截断分布或成分数据。实验结果表明,与现有方法(包括基于最大平均偏差(MMD)的测试)相比,本文提出的测试方法能够很好地控制I型误差,获得更高的测试效率。 摘要:Non-parametric goodness-of-fit testing procedures based on kernel Stein discrepancies (KSD) are promising approaches to validate general unnormalised distributions in various scenarios. Existing works have focused on studying optimal kernel choices to boost test performances. However, the Stein operators are generally non-unique, while different choices of Stein operators can also have considerable effect on the test performances. In this work, we propose a unifying framework, the generalised kernel Stein discrepancy (GKSD), to theoretically compare and interpret different Stein operators in performing the KSD-based goodness-of-fit tests. We derive explicitly that how the proposed GKSD framework generalises existing Stein operators and their corresponding tests. In addition, we show thatGKSD framework can be used as a guide to develop kernel-based non-parametric goodness-of-fit tests for complex new data scenarios, e.g. truncated distributions or compositional data. Experimental results demonstrate that the proposed tests control type-I error well and achieve higher test power than existing approaches, including the test based on maximum-mean-discrepancy (MMD).
【20】 Assessing epidemic curves for evidence of superspreading 标题:评估流行病曲线以寻找超级传播的证据
作者:Joe Meagher,Nial Friel 链接:https://arxiv.org/abs/2106.12064 摘要:每个指标病例引起的继发感染预期数、生殖数或R$数是了解和管理传染病的重要汇总统计数据。估算R$的方法有很多;然而,这些模型中很少有明确的异质性疾病繁殖模型,这导致了在人群中的超扩散。本文提出了一个包含异质个体繁殖数的流行病曲线的简约离散时间分支过程模型。我们的贝叶斯推断方法表明,这种异质性导致时变队列繁殖数的估计不太确定。不考虑未来交叉验证评估了所提出模型的预测性能,使我们能够评估流行病曲线,寻找超扩散的证据。我们将这些方法应用于爱尔兰共和国的COVID-19流行曲线,并找到一些支持异质性疾病繁殖的方法。我们得出结论,10%最具传染性的指数病例约占预期继发感染的40-80%。我们的分析强调了从流行曲线中识别异质性疾病繁殖的困难,并且异质性是估计Rt$时的一个重要考虑因素。 摘要:The expected number of secondary infections arising from each index case, the reproduction number, or $R$ number is a vital summary statistic for understanding and managing epidemic diseases. There are many methods for estimating $R$; however, few of these explicitly model heterogeneous disease reproduction, which gives rise to superspreading within the population. Here we propose a parsimonious discrete-time branching process model for epidemic curves that incorporates heterogeneous individual reproduction numbers. Our Bayesian approach to inference illustrates that this heterogeneity results in less certainty on estimates of the time-varying cohort reproduction number $R_t$. Leave-future-out cross-validation evaluates the predictive performance of the proposed model, allowing us to assess epidemic curves for evidence of superspreading. We apply these methods to a COVID-19 epidemic curve for the Republic of Ireland and find some support for heterogeneous disease reproduction. We conclude that the 10% most infectious index cases account for approximately 40-80% of the expected secondary infections. Our analysis highlights the difficulties in identifying heterogeneous disease reproduction from epidemic curves and that heterogeneity is a vital consideration when estimating $R_t$.
【21】 Pure Exploration in Kernel and Neural Bandits 标题:核函数与神经环中的纯探索
作者:Yinglun Zhu,Dongruo Zhou,Ruoxi Jiang,Quanquan Gu,Rebecca Willett,Robert Nowak 机构:University of Wisconsin-Madison, University of California, Los Angeles, University of Chicago 链接:https://arxiv.org/abs/2106.12034 摘要:我们研究土匪的纯探索,在土匪中,特征表示的维数可以远远大于武器的数目。为了克服维数灾难,我们建议自适应地将每个手臂的特征表示嵌入到低维空间中,并仔细处理诱导的模型错误。我们的方法在概念上与现有的只能处理低维线性强盗或被动处理模型错误的工作有很大不同。我们展示了我们的方法在两个纯探索环境中的应用,这两个纯探索环境是以前研究的:(1)奖励函数属于一个可能无限维的再生核Hilbert空间,(2)奖励函数是非线性的,可以用神经网络来逼近。我们的主要结果提供了样本复杂性保证,仅依赖于核或神经表示中特征空间的有效维数。在合成数据集和真实数据集上进行的大量实验证明了我们方法的有效性。 摘要:We study pure exploration in bandits, where the dimension of the feature representation can be much larger than the number of arms. To overcome the curse of dimensionality, we propose to adaptively embed the feature representation of each arm into a lower-dimensional space and carefully deal with the induced model misspecifications. Our approach is conceptually very different from existing works that can either only handle low-dimensional linear bandits or passively deal with model misspecifications. We showcase the application of our approach to two pure exploration settings that were previously under-studied: (1) the reward function belongs to a possibly infinite-dimensional Reproducing Kernel Hilbert Space, and (2) the reward function is nonlinear and can be approximated by neural networks. Our main results provide sample complexity guarantees that only depend on the effective dimension of the feature spaces in the kernel or neural representations. Extensive experiments conducted on both synthetic and real-world datasets demonstrate the efficacy of our methods.
【22】 Inference in High-dimensional Linear Regression 标题:高维线性回归中的推断
作者:Heather S. Battey,Nancy Reid 机构:of Statistical Sciences, University of Toronto, University Ave,th Floor, Toronto, Ontario M,G ,Z, Canada, arXiv:,.,v, [stat.ME] , Jun 备注:27 pages, 1 figure, 5 tables 链接:https://arxiv.org/abs/2106.12001 摘要:当潜在解释变量的数量大于样本量时,我们发展了一种线性回归模型的推断方法。我们的方法将每个回归系数依次作为兴趣参数,其余的系数作为讨厌的参数,并寻求一个最佳的兴趣变换。这种转换的作用是允许对每个变量进行边际最小二乘分析,就像在析因实验中一样。这个问题的一个参数化在计算和数学上都特别方便。特别是,它允许一个最佳转换问题的解析解,便于与其他工作进行比较。与lasso(Tibshirani,1996)及其扩展等正则化回归相比,不需要对选择进行调整,也不需要对解释变量进行重新缩放,从而确保回归系数的物理解释得以保留。我们讨论了使用这种置信区间作为一个更广泛的推理语句集的一部分,以便反映模型以及参数的不确定性。文中简要讨论了将这项工作推广到其它回归模型的考虑因素。 摘要:We develop an approach to inference in a linear regression model when the number of potential explanatory variables is larger than the sample size. Our approach treats each regression coefficient in turn as the interest parameter, the remaining coefficients being nuisance parameters, and seeks an optimal interest-respecting transformation. The role of this transformation is to allow a marginal least squares analysis for each variable, as in a factorial experiment. One parameterization of the problem is found to be particularly convenient, both computationally and mathematically. In particular, it permits an analytic solution to the optimal transformation problem, facilitating comparison to other work. In contrast to regularized regression such as the lasso (Tibshirani, 1996) and its extensions, neither adjustment for selection, nor rescaling of the explanatory variables is needed, ensuring the physical interpretation of regression coefficients is retained. We discuss the use of such confidence intervals as part of a broader set of inferential statements, so as to reflect uncertainty over the model as well as over the parameters. The considerations involved in extending the work to other regression models are briefly discussed.
【23】 Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding 标题:稳定、快速、准确:相对位置编码的核心注意力
作者:Shengjie Luo,Shanda Li,Tianle Cai,Di He,Dinglan Peng,Shuxin Zheng,Guolin Ke,Liwei Wang,Tie-Yan Liu 机构:Peking University, Princeton University, University of Science and Technology of China, Microsoft Research 备注:Preprint. Work in Progress 链接:https://arxiv.org/abs/2106.12566 摘要:注意模块是Transformer的重要组成部分,由于其二次复杂性,不能有效地扩展到长序列。许多工作集中在对点的逼近,然后对softmax函数进行指数化处理,导致了次二次甚至线性复杂的变换器结构。然而,我们发现,这些方法不能应用于更强大的注意模块,超越点然后指数风格,如Transformer与相对位置编码(RPE)。由于在许多最先进的模型中,相对位置编码被用作默认值,因此设计能够结合RPE的高效转换器是很有吸引力的。在本文中,我们提出了一种新的方法来加速Transformer的注意力计算的RPE上的核注意。基于相对位置编码形成Toeplitz矩阵的观察,我们从数学上证明了快速傅立叶变换(FFT)可以有效地计算RPE核化注意。通过FFT,我们的方法达到了$mathcal{O}(nlogn)$的时间复杂度。有趣的是,我们进一步证明了适当使用相对位置编码可以缓解香草核化注意的训练不稳定性问题。在广泛的任务,我们的经验表明,我们的模型可以从零开始训练,没有任何优化问题。所学习的模型比许多有效的Transformer变型具有更好的性能,并且在长序列情况下比标准Transformer更快。 摘要:The attention module, which is a crucial component in Transformer, cannot scale efficiently to long sequences due to its quadratic complexity. Many works focus on approximating the dot-then-exponentiate softmax function in the original attention, leading to sub-quadratic or even linear-complexity Transformer architectures. However, we show that these methods cannot be applied to more powerful attention modules that go beyond the dot-then-exponentiate style, e.g., Transformers with relative positional encoding (RPE). Since in many state-of-the-art models, relative positional encoding is used as default, designing efficient Transformers that can incorporate RPE is appealing. In this paper, we propose a novel way to accelerate attention calculation for Transformers with RPE on top of the kernelized attention. Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT). With FFT, our method achieves $mathcal{O}(nlog n)$ time complexity. Interestingly, we further demonstrate that properly using relative positional encoding can mitigate the training instability problem of vanilla kernelized attention. On a wide range of tasks, we empirically show that our models can be trained from scratch without any optimization issues. The learned model performs better than many efficient Transformer variants and is faster than standard Transformer in the long-sequence regime.
【24】 Multi-Class Classification of Blood Cells - End to End Computer Vision based diagnosis case study 标题:血细胞多级分类--基于计算机视觉的端到端诊断案例研究
作者:Sai Sukruth Bezugam 机构:Electrical Engineering Department, Indian Institute of Technology Delhi 备注:18 pages, 10 figures 链接:https://arxiv.org/abs/2106.12548 摘要:血源性疾病的诊断通常涉及识别和描述患者血样。自动检测和分类血细胞亚型的方法在医学上有重要的应用。自动化的医学图像处理和分析为医学诊断提供了强有力的工具。在这项工作中,我们处理的问题,白血球分类的基础上,其外部轮廓,颜色的形态特征。我们将探索一套预处理和分割(基于颜色的分割、形态学处理、轮廓)算法以及一套特征提取方法(角点检测算法和梯度直方图(HOG)),降维算法(主成分分析(PCA)),能够通过各种无监督(k-近邻)和有监督(支持向量机、决策树、线性判别分析、二次判别分析、,朴素贝叶斯(naivebayes)算法将不同类别的白细胞分为嗜酸性粒细胞、淋巴细胞、单核细胞和中性粒细胞。我们甚至向前迈出了一步,探索各种深度卷积神经网络架构(Sqeezent、MobilenetV1、MobilenetV2、InceptionNet等),无需预处理/分割和预处理。我们希望探索许多算法来识别时间复杂度最低、资源需求较低的鲁棒算法。这项工作的结果可以作为根据自动血细胞分类的要求选择算法的线索。 摘要:The diagnosis of blood-based diseases often involves identifying and characterizing patient blood samples. Automated methods to detect and classify blood cell subtypes have important medical applications. Automated medical image processing and analysis offers a powerful tool for medical diagnosis. In this work we tackle the problem of white blood cell classification based on the morphological characteristics of their outer contour, color. The work we would explore a set of preprocessing and segmentation (Color-based segmentation, Morphological processing, contouring) algorithms along with a set of features extraction methods (Corner detection algorithms and Histogram of Gradients(HOG)), dimensionality reduction algorithms (Principal Component Analysis(PCA)) that are able to recognize and classify through various Unsupervised(k-nearest neighbors) and Supervised (Support Vector Machine, Decision Trees, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Naive Bayes) algorithms different categories of white blood cells to Eosinophil, Lymphocyte, Monocyte, and Neutrophil. We even take a step forwards to explore various Deep Convolutional Neural network architecture (Sqeezent, MobilenetV1,MobilenetV2, InceptionNet etc.) without preprocessing/segmentation and with preprocessing. We would like to explore many algorithms to identify the robust algorithm with least time complexity and low resource requirement. The outcome of this work can be a cue to selection of algorithms as per requirement for automated blood cell classification.
【25】 Synthetic Benchmarks for Scientific Research in Explainable Machine Learning 标题:可解释机器学习中的科学研究综合基准
作者:Yang Liu,Sujay Khandagale,Colin White,Willie Neiswanger 机构:AI 2Stanford University 链接:https://arxiv.org/abs/2106.12543 摘要:随着机器学习模型变得越来越复杂,它们的应用变得越来越重要,解释模型预测的工具变得越来越重要。尽管可解释性技术被广泛使用,评估和比较不同的特征属性方法仍然具有挑战性:评估理想情况下需要人类研究,而经验评估指标在实际数据集的计算上往往是禁止的。在这项工作中,我们通过发布XAI Bench来解决这个问题:一套合成数据集和一个用于基准特性属性算法的库。与真实世界的数据集不同,合成数据集允许有效地计算条件期望值,这些值是评估基本真值Shapley值和其他度量所需的。我们发布的合成数据集提供了各种各样的参数,可以配置这些参数来模拟真实世界的数据。我们通过对流行的解释性技术进行多个评估指标的基准测试,并识别流行解释者的失败模式,来展示我们的库的强大功能。我们图书馆的效率将有助于从开发到部署带来新的解释方法。 摘要:As machine learning models grow more complex and their applications become more high-stakes, tools for explaining model predictions have become increasingly important. Despite the widespread use of explainability techniques, evaluating and comparing different feature attribution methods remains challenging: evaluations ideally require human studies, and empirical evaluation metrics are often computationally prohibitive on real-world datasets. In this work, we address this issue by releasing XAI-Bench: a suite of synthetic datasets along with a library for benchmarking feature attribution algorithms. Unlike real-world datasets, synthetic datasets allow the efficient computation of conditional expected values that are needed to evaluate ground-truth Shapley values and other metrics. The synthetic datasets we release offer a wide variety of parameters that can be configured to simulate real-world data. We demonstrate the power of our library by benchmarking popular explainability techniques across several evaluation metrics and identifying failure modes for popular explainers. The efficiency of our library will help bring new explainability methods from development to deployment.
【26】 Learning Stochastic Majority Votes by Minimizing a PAC-Bayes Generalization Bound 标题:最小化PAC-Bayes泛化界学习随机多数票
作者:Valentina Zantedeschi,Paul Viallard,Emilie Morvant,Rémi Emonet,Amaury Habrard,Pascal Germain,Benjamin Guedj 机构:Inria, Lille - Nord Europe, Inria-London - The Inria London Programme, University College London, Centre for Artificial Intelligence, Univ Lyon, UJM-Saint-Etienne, CNRS, Institut d Optique Graduate School, Laboratoire Hubert Curien UMR , F-, SAINT-ETIENNE, France 链接:https://arxiv.org/abs/2106.12535 摘要:我们研究了有限分类器集合上多数票的随机对应,并研究了它的泛化性质。虽然我们的方法适用于任意分布,但我们用Dirichlet分布来实例化它:这允许期望风险的一个封闭形式和可微表达式,然后将泛化界转化为一个可处理的训练目标。由此产生的随机多数投票学习算法达到了最先进的精度,并受益于(非真空)严格的泛化边界,在一系列的数值实验中,当与同样最小化PAC贝叶斯目标的竞争算法进行比较时——既有不知情的(数据独立的)先验,也有知情的(数据依赖的)先验。 摘要:We investigate a stochastic counterpart of majority votes over finite ensembles of classifiers, and study its generalization properties. While our approach holds for arbitrary distributions, we instantiate it with Dirichlet distributions: this allows for a closed-form and differentiable expression for the expected risk, which then turns the generalization bound into a tractable training objective. The resulting stochastic majority vote learning algorithm achieves state-of-the-art accuracy and benefits from (non-vacuous) tight generalization bounds, in a series of numerical experiments when compared to competing algorithms which also minimize PAC-Bayes objectives -- both with uninformed (data-independent) and informed (data-dependent) priors.
【27】 Bayesian Deep Learning Hyperparameter Search for Robust Function Mapping to Polynomials with Noise 标题:贝叶斯深度学习超参数搜索在含噪声多项式稳健函数映射中的应用
作者:Nidhin Harilal,Udit Bhatia,Auroop R. Ganguly 机构:Indian Institute of Technology Gandhinagar, Gujarat, India, Northeastern University, Boston, MA, USA, Pacific Northwest National Laboratory, Richland, WA, USA 链接:https://arxiv.org/abs/2106.12532 摘要:最近的文献报道了神经结构搜索的进展,以及连接结构的可解释性和可解释性。然而,我们对于如何设计贝叶斯深度学习(BDL)超参数,特别是深度、宽度和集合大小的理解,对于具有不确定性量化的鲁棒函数映射,仍然是新兴的。本文试图通过将贝叶斯连接表示映射到具有不同噪声类型和比率的不同阶多项式来加深我们的理解。我们研究噪声污染多项式来寻找超参数的组合,这些超参数可以提取出潜在的多项式信号,同时基于噪声属性量化不确定性。具体地说,我们试图研究这样一个问题:可以找到一个合适的神经结构和集合配置来检测任何n阶多项式的信号,该信号被具有不同分布和信噪比以及不同噪声属性的噪声污染。我们的结果表明,可能存在一个最佳的网络深度以及预测技巧和不确定性量化的最佳集合数,分别。然而,宽度的最优性是不可辨别的,即使在高宽度值时,性能增益随着宽度的增加而减小。我们的实验和见解对理解BDL表示的理论性质和设计实际的解决方案具有指导意义。 摘要:Advances in neural architecture search, as well as explainability and interpretability of connectionist architectures, have been reported in the recent literature. However, our understanding of how to design Bayesian Deep Learning (BDL) hyperparameters, specifically, the depth, width and ensemble size, for robust function mapping with uncertainty quantification, is still emerging. This paper attempts to further our understanding by mapping Bayesian connectionist representations to polynomials of different orders with varying noise types and ratios. We examine the noise-contaminated polynomials to search for the combination of hyperparameters that can extract the underlying polynomial signals while quantifying uncertainties based on the noise attributes. Specifically, we attempt to study the question that an appropriate neural architecture and ensemble configuration can be found to detect a signal of any n-th order polynomial contaminated with noise having different distributions and signal-to-noise (SNR) ratios and varying noise attributes. Our results suggest the possible existence of an optimal network depth as well as an optimal number of ensembles for prediction skills and uncertainty quantification, respectively. However, optimality is not discernible for width, even though the performance gain reduces with increasing width at high values of width. Our experiments and insights can be directional to understand theoretical properties of BDL representations and to design practical solutions.
【28】 Training Data Subset Selection for Regression with Controlled Generalization Error 标题:具有受控泛化误差的回归训练数据子集选择
作者:Durga Sivasubramanian,Rishabh Iyer,Ganesh Ramakrishnan,Abir De 备注:None 链接:https://arxiv.org/abs/2106.12491 摘要:从大量训练实例中选择数据子集是一种高效、经济的机器学习方法。然而,在较小子集上训练的模型泛化能力较差。在本文中,我们的目标是设计一个选择训练数据子集的算法,以便在不显著牺牲精度的情况下快速训练模型。更具体地说,我们专注于L2正则化回归问题的数据子集选择,并提供了一个新的问题公式,该公式寻求在验证集上受误差界影响的情况下,最小化关于可训练参数和训练数据子集的训练损失。我们通过一些技术创新来解决这个问题。首先,我们使用原始训练问题的对偶,用简化的约束来表示这个问题,并且证明这个新表示的目标是一个单调的α-子模函数,用于各种各样的建模选择。这样的性质使得我们开发了SELCON,一种有效的数据子集选择优化最小化算法,它允许一个近似保证,即使训练提供了训练模型的不完全估计。最后,我们在多个数据集上的实验表明,SELCON比目前的最新技术更有效地权衡了准确性和效率。 摘要:Data subset selection from a large number of training instances has been a successful approach toward efficient and cost-effective machine learning. However, models trained on a smaller subset may show poor generalization ability. In this paper, our goal is to design an algorithm for selecting a subset of the training data, so that the model can be trained quickly, without significantly sacrificing on accuracy. More specifically, we focus on data subset selection for L2 regularized regression problems and provide a novel problem formulation which seeks to minimize the training loss with respect to both the trainable parameters and the subset of training data, subject to error bounds on the validation set. We tackle this problem using several technical innovations. First, we represent this problem with simplified constraints using the dual of the original training problem and show that the objective of this new representation is a monotone and alpha-submodular function, for a wide variety of modeling choices. Such properties lead us to develop SELCON, an efficient majorization-minimization algorithm for data subset selection, that admits an approximation guarantee even when the training provides an imperfect estimate of the trained model. Finally, our experiments on several datasets show that SELCON trades off accuracy and efficiency more effectively than the current state-of-the-art.
【29】 Portfolio Allocation under Asymmetric Dependence in Asset Returns using Local Gaussian Correlations 标题:基于局部高斯相关的资产收益非对称依赖下的投资组合配置
作者:Anders D. Sleire,Bård Støve,Håkon Otneim,Geir Drage Berentsen,Dag Tjøstheim,Sverre Hauso Haugen 链接:https://arxiv.org/abs/2106.12425 摘要:众所周知,金融收益之间存在着不对称的依赖结构。在本文中,我们使用一种新的非参数的局部相关性度量,即局部高斯相关性来改进投资组合的配置。我们扩展了经典的均值-方差框架,并证明使用我们的新方法,投资组合优化是简单的,只依赖于一个调整参数(带宽)。对于月度资产收益数据,新方法的表现优于等权(1/N)投资组合和经典Markowitz投资组合。 摘要:It is well known that there are asymmetric dependence structures between financial returns. In this paper we use a new nonparametric measure of local dependence, the local Gaussian correlation, to improve portfolio allocation. We extend the classical mean-variance framework, and show that the portfolio optimization is straightforward using our new approach, only relying on a tuning parameter (the bandwidth). The new method is shown to outperform the equally weighted (1/N) portfolio and the classical Markowitz portfolio for monthly asset returns data.
【30】 Alias-Free Generative Adversarial Networks 标题:无别名生成性对抗性网络
作者:Tero Karras,Miika Aittala,Samuli Laine,Erik Härkönen,Janne Hellsten,Jaakko Lehtinen,Timo Aila 机构:Aalto University and NVIDIA, NVIDIA and Aalto University 链接:https://arxiv.org/abs/2106.12423 摘要:我们观察到,尽管他们的层次卷积性质,合成过程中的典型生成对手网络依赖于绝对像素坐标在一个不健康的方式。这表现为,例如,细节似乎被粘在图像坐标上,而不是被描绘对象的表面。我们追踪的根本原因是粗心的信号处理,造成混叠在发电机网络。将网络中的所有信号解释为连续的,我们导出了普遍适用的、小的体系结构更改,以保证不需要的信息不会泄漏到分层合成过程中。得到的网络与StyleGAN2的FID匹配,但在内部表示上有很大的不同,即使在亚像素尺度上,它们也完全等同于平移和旋转。我们的结果为更适合视频和动画的生成模型铺平了道路。 摘要:We observe that despite their hierarchical convolutional nature, the synthesis process of typical generative adversarial networks depends on absolute pixel coordinates in an unhealthy manner. This manifests itself as, e.g., detail appearing to be glued to image coordinates instead of the surfaces of depicted objects. We trace the root cause to careless signal processing that causes aliasing in the generator network. Interpreting all signals in the network as continuous, we derive generally applicable, small architectural changes that guarantee that unwanted information cannot leak into the hierarchical synthesis process. The resulting networks match the FID of StyleGAN2 but differ dramatically in their internal representations, and they are fully equivariant to translation and rotation even at subpixel scales. Our results pave the way for generative models better suited for video and animation.
【31】 False perfection in machine prediction: Detecting and assessing circularity problems in machine learning 标题:机器预测中的错误完美:检测和评估机器学习中的循环性问题
作者:Michael Hagmann,Stefan Riezler 机构: Computational Linguistics, Heidelberg University, Heidelberg, Germany, Interdisciplinary Center for Scientific Computing (IWR), Heidelberg, �These authors contributed equally to this work. 链接:https://arxiv.org/abs/2106.12417 摘要:机器学习算法从输入数据和目标输出的模式中训练模型,目的是为看不见的测试输入预测正确的输出。在这里,我们展示了机器学习在医学信息学或专利法等重要应用领域中的一个问题,它包括在输入数据表示中确定定义目标输出的测量。这将导致基于已知目标定义的机器重构的完美但循环的预测,但在实际数据上失败,其中定义的测量可能不可用或仅不完全可用。对给定的数据集和黑盒机器学习模型进行了循环性检验,检验了目标函数定义是否可以重构并用于训练。我们认为,将研究成果转移到现实世界的应用需要通过将定义目标结果的度量与机器学习中的数据表示分离来避免循环性。 摘要:Machine learning algorithms train models from patterns of input data and target outputs, with the goal of predicting correct outputs for unseen test inputs. Here we demonstrate a problem of machine learning in vital application areas such as medical informatics or patent law that consists of the inclusion of measurements on which target outputs are deterministically defined in the representations of input data. This leads to perfect, but circular predictions based on a machine reconstruction of the known target definition, but fails on real-world data where the defining measurements may not or only incompletely be available. We present a circularity test that shows, for given datasets and black-box machine learning models, whether the target functional definition can be reconstructed and has been used in training. We argue that a transfer of research results to real-world applications requires to avoid circularity by separating measurements that define target outcomes from data representations in machine learning.
【32】 Gaussian and Hermite Ornstein-Uhlenbeck processes 标题:高斯和Hermite Ornstein-Uhlenbeck过程
作者:Khalifa Es-Sebaiy 链接:https://arxiv.org/abs/2106.12311 摘要:本文研究了高斯噪声驱动下具有平稳和非平稳增量的Ornstein-Uhlenbeck(OU)过程和Hermite-OU过程的自协方差函数的渐近性态。我们的结果是Cheridito等人.{cite{CKM}和Kaarakka及Salminencite{KS}相应结果的推广。 摘要:In the present paper we study the asymptotic behavior of the auto-covariance function for Ornstein-Uhlenbeck (OU) processes driven by Gaussian noises with stationary and non-stationary increments and for Hermite OU processes. Our results are generalizations of the corresponding results of Cheridito et al. cite{CKM} and Kaarakka and Salminen cite{KS}.
【33】 Should You Go Deeper? Optimizing Convolutional Neural Network Architectures without Training by Receptive Field Analysis 标题:你应该再深入一点吗?基于感受场分析的无训练卷积神经网络结构优化
作者:Mats L. Richter,Julius Schöning,Ulf Krumnack 机构:Osnabrueck, Germany, Julius Sch¨oning, Osnabr¨uck University of Applied Sciences 备注:Preprint 链接:https://arxiv.org/abs/2106.12307 摘要:将人工神经网络(ANN)应用到特定任务中,研究人员、程序员和其他专家通常会在设计中过多地使用卷积层。这意味着,这些人工神经网络包含太多的参数,需要在不影响结果的情况下进行不必要的训练。卷积层所能处理的特征受到其感受野的严格限制。通过逐层分析感受野的扩展,我们可以可靠地预测在给定的神经网络结构中,对推理没有定性贡献的层序列。基于这些分析,我们提出了解决这些低效率的设计策略,优化了人工神经网络的可解释性和计算性能。由于这些策略和分析都不需要对实际模型进行训练,因此这些洞察使得人工神经网络体系结构的设计过程非常有效,将来可能会实现自动化。 摘要:Applying artificial neural networks (ANN) to specific tasks, researchers, programmers, and other specialists usually overshot the number of convolutional layers in their designs. By implication, these ANNs hold too many parameters, which needed unnecessarily trained without impacting the result. The features, a convolutional layer can process, are strictly limited by its receptive field. By layer-wise analyzing the expansion of the receptive fields, we can reliably predict sequences of layers that will not contribute qualitatively to the inference in thegiven ANN architecture. Based on these analyses, we propose design strategies to resolve these inefficiencies, optimizing the explainability and the computational performance of ANNs. Since neither the strategies nor the analysis requires training of the actual model, these insights allow for a very efficient design process of ANNs architectures which might be automated in the future.
【34】 Behavior Mimics Distribution: Combining Individual and Group Behaviors for Federated Learning 标题:行为模拟分布:将个体行为和群体行为结合起来进行联合学习
作者:Hua Huang,Fanhua Shang,Yuanyuan Liu,Hongying Liu 机构:Key Lab of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, China, Peng Cheng Lab, Shenzhen, China 备注:This paper has been accepted by International Joint Conference on Artificial Intelligence (IJCAI) 2021 链接:https://arxiv.org/abs/2106.12300 摘要:联邦学习(FL)已经成为一种活跃的、有前途的分布式机器学习模式。由于统计上的异质性,最近的研究清楚地表明,流行的FL方法(例如FedAvg)的性能由于本地更新引起的客户端漂移而急剧恶化。本文提出了一种新的联合学习算法(IGFL),它利用个体和群体的行为来模拟分布,从而提高了对异质性的处理能力。与现有的FL方法不同,我们的IGFL可以应用于客户机和服务器优化。作为一个副产品,我们提出了一种新的基于注意的联邦学习在服务器优化的IGFL。据我们所知,这是第一次将注意机制纳入联邦优化。我们进行了大量的实验,结果表明IGFL可以显著提高现有联邦学习方法的性能。特别是当个体间的数据分布不同时,IGFL可以将分类精度提高13%左右。 摘要:Federated Learning (FL) has become an active and promising distributed machine learning paradigm. As a result of statistical heterogeneity, recent studies clearly show that the performance of popular FL methods (e.g., FedAvg) deteriorates dramatically due to the client drift caused by local updates. This paper proposes a novel Federated Learning algorithm (called IGFL), which leverages both Individual and Group behaviors to mimic distribution, thereby improving the ability to deal with heterogeneity. Unlike existing FL methods, our IGFL can be applied to both client and server optimization. As a by-product, we propose a new attention-based federated learning in the server optimization of IGFL. To the best of our knowledge, this is the first time to incorporate attention mechanisms into federated optimization. We conduct extensive experiments and show that IGFL can significantly improve the performance of existing federated learning methods. Especially when the distributions of data among individuals are diverse, IGFL can improve the classification accuracy by about 13% compared with prior baselines.
【35】 ADAVI: Automatic Dual Amortized Variational Inference Applied To Pyramidal Bayesian Models 标题:ADAVI:应用于金字塔贝叶斯模型的自动对偶摊销变分推理
作者:Louis Rouillard,Demian Wassermann 机构:Université Paris-Saclay, Inria, CEA, Palaiseau, France 备注:None 链接:https://arxiv.org/abs/2106.12248 摘要:通常,人口研究的特点是金字塔组织的数据表示使用层次贝叶斯模型(HBM)丰富的板块。这些模型在神经成像(neuroimaging)等环境中可能会变得异常庞大,其中一个样本由一个功能性MRI信号组成,该信号在4个测量环节中,在6.4万个大脑位置进行测量,至少有数十名受试者。即使是在300个大脑位置的特定皮层区域上的一个简化例子,也会有大约100万个参数,这妨碍了基于模拟的推理(SBI)等现代密度估计技术的使用。为了在这类具有挑战性的问题中推断参数的后验分布,我们设计了一种新的方法来自动产生一个变分族对偶到目标HBM。这个变量族表示为一个神经网络,由一个基于注意的分层编码器组合而成,该编码器将摘要统计信息提供给一组规范化流。我们自动导出的神经网络利用了厚板的可交换性,并对其参数空间进行因子分解。由此产生的体系结构相对于典型的SBI表示减少了几个数量级的参数化,同时保持了表达能力。我们的方法在摊销设置中对指定的HBM进行推断:一旦训练,它可以很容易地应用于新的数据样本来计算参数的全后验概率。我们证明了我们的方法对模拟数据的能力,以及一个具有挑战性的高维大脑分割实验。我们还提出了SBI技术和结构化变分推理交叉的几个问题。 摘要:Frequently, population studies feature pyramidally-organized data represented using Hierarchical Bayesian Models (HBM) enriched with plates. These models can become prohibitively large in settings such as neuroimaging, where a sample is composed of a functional MRI signal measured on 64 thousand brain locations, across 4 measurement sessions, and at least tens of subjects. Even a reduced example on a specific cortical region of 300 brain locations features around 1 million parameters, hampering the usage of modern density estimation techniques such as Simulation-Based Inference (SBI). To infer parameter posterior distributions in this challenging class of problems, we designed a novel methodology that automatically produces a variational family dual to a target HBM. This variatonal family, represented as a neural network, consists in the combination of an attention-based hierarchical encoder feeding summary statistics to a set of normalizing flows. Our automatically-derived neural network exploits exchangeability in the plate-enriched HBM and factorizes its parameter space. The resulting architecture reduces by orders of magnitude its parameterization with respect to that of a typical SBI representation, while maintaining expressivity. Our method performs inference on the specified HBM in an amortized setup: once trained, it can readily be applied to a new data sample to compute the parameters' full posterior. We demonstrate the capability of our method on simulated data, as well as a challenging high-dimensional brain parcellation experiment. We also open up several questions that lie at the intersection between SBI techniques and structured Variational Inference.
【36】 A Unified Approach to Fair Online Learning via Blackwell Approachability 标题:通过Blackwell可接近性实现公平在线学习的统一方法
作者:Evgenii Chzhen,Christophe Giraud,Gilles Stoltz 机构:Université Paris-Saclay, CNRS, Laboratoire de mathématiques d’Orsay, Orsay, France 链接:https://arxiv.org/abs/2106.12242 摘要:我们提供了一个设置和一般方法,公平的在线学习随机敏感和非敏感的背景。场景是玩家和自然之间的重复游戏,在每个阶段,双方都根据上下文选择动作。受无意识概念的启发,我们假设玩家在做出决定之前只能访问非敏感上下文,同时我们讨论了自然访问敏感上下文和自然不知道敏感上下文的两种情况。利用Blackwell的可接近性理论处理未知上下文分布的情况,给出了学习目标与公平约束相容的一般充要条件。这一条件在(分组)无遗憾和(分组)校准目标以及作为附加约束的人口均等上被实例化。当目标与约束不相容时,所提供的框架允许描述两者之间的最佳权衡。 摘要:We provide a setting and a general approach to fair online learning with stochastic sensitive and non-sensitive contexts. The setting is a repeated game between the Player and Nature, where at each stage both pick actions based on the contexts. Inspired by the notion of unawareness, we assume that the Player can only access the non-sensitive context before making a decision, while we discuss both cases of Nature accessing the sensitive contexts and Nature unaware of the sensitive contexts. Adapting Blackwell's approachability theory to handle the case of an unknown contexts' distribution, we provide a general necessary and sufficient condition for learning objectives to be compatible with some fairness constraints. This condition is instantiated on (group-wise) no-regret and (group-wise) calibration objectives, and on demographic parity as an additional constraint. When the objective is not compatible with the constraint, the provided framework permits to characterise the optimal trade-off between the two.
【37】 Random Effect Bandits 标题:随机效应土匪
作者:Rong Zhu,Branislav Kveton 机构:Institute of Science and Technology for Brain-inspired Intelligence, Fudan University, Shanghai, China, MOE Frontiers Center for Brain Science, Fudan University, Google Research, Mountain View, CA, USA 链接:https://arxiv.org/abs/2106.12200 摘要:本文研究了一个经典的在线学习问题——多武装土匪的后悔最小化问题。为了开发更有效的统计算法,我们建议使用随机效应模型的假设。在这个模型中,武器的平均报酬是独立于我们估计的参数的未知分布而得出的。我们给出了该模型中arm均值的估计量,并对其不确定性进行了分析。基于这些结果,我们设计了一个UCB算法,我们称之为ReUCB。我们分析了ReUCB,并证明了它的$n$轮遗憾上的Bayes遗憾界与现有的下限相匹配。我们的实验表明,ReUCB可以在各种情况下优于Thompson抽样,而不必假设arm均值的先验分布是已知的。 摘要:This paper studies regret minimization in multi-armed bandits, a classical online learning problem. To develop more statistically-efficient algorithms, we propose to use the assumption of a random-effect model. In this model, the mean rewards of arms are drawn independently from an unknown distribution, whose parameters we estimate. We provide an estimator of the arm means in this model and also analyze its uncertainty. Based on these results, we design a UCB algorithm, which we call ReUCB. We analyze ReUCB and prove a Bayes regret bound on its $n$-round regret, which matches an existing lower bound. Our experiments show that ReUCB can outperform Thompson sampling in various scenarios, without assuming that the prior distribution of arm means is known.
【38】 Fairness for Image Generation with Uncertain Sensitive Attributes 标题:具有不确定敏感属性的图像生成公平性研究
作者:Ajil Jalal,Sushrut Karmalkar,Jessica Hoffmann,Alexandros G. Dimakis,Eric Price 机构: The University of Texas at Austin 2Department of Computer Science, The University of Texas atAustin 链接:https://arxiv.org/abs/2106.12182 摘要:这项工作处理的公平性问题的背景下产生的程序,如图像超分辨率,这需要不同的定义,从标准的分类设置。此外,虽然传统的群体公平性定义通常是针对特定的受保护群体来定义的——掩盖了这些群体是人为的、带有历史和政治动机这一事实——但我们强调,没有基本的真相认同。例如,南亚人和东亚人应该被视为一个单独的群体还是单独的群体?我们应该把一个种族看作一个整体,还是按性别进一步划分?选择哪些群体是有效的,谁属于他们是一个不可能的两难选择,对亚洲人来说“公平”可能要求对南亚人来说“不公平”。这推动了定义的引入,允许算法对相关分组是不经意的。我们定义了群体公平的几个直观概念,并研究了它们的不相容性和取舍。我们证明了人口均等的自然扩展强烈地依赖于分组,并且{不可能}被遗忘地实现。另一方面,我们引入的新定义,条件比例表示,可以通过后验抽样实现。我们的实验验证了我们的理论结果,并使用最先进的生成模型实现了公平的图像重建。 摘要:This work tackles the issue of fairness in the context of generative procedures, such as image super-resolution, which entail different definitions from the standard classification setting. Moreover, while traditional group fairness definitions are typically defined with respect to specified protected groups -- camouflaging the fact that these groupings are artificial and carry historical and political motivations -- we emphasize that there are no ground truth identities. For instance, should South and East Asians be viewed as a single group or separate groups? Should we consider one race as a whole or further split by gender? Choosing which groups are valid and who belongs in them is an impossible dilemma and being ``fair'' with respect to Asians may require being ``unfair'' with respect to South Asians. This motivates the introduction of definitions that allow algorithms to be emph{oblivious} to the relevant groupings. We define several intuitive notions of group fairness and study their incompatibilities and trade-offs. We show that the natural extension of demographic parity is strongly dependent on the grouping, and emph{impossible} to achieve obliviously. On the other hand, the conceptually new definition we introduce, Conditional Proportional Representation, can be achieved obliviously through Posterior Sampling. Our experiments validate our theoretical results and achieve fair image reconstruction using state-of-the-art generative models.
【39】 Deep Gaussian Processes: A Survey 标题:深高斯过程:综述
作者:Kalvik Jakkala 机构:Department of Computer Science, University of North Carolina at Charlotte, Charlotte, NC, USA 备注:23 pages, 5 figures 链接:https://arxiv.org/abs/2106.12135 摘要:高斯过程是贝叶斯学习的主要方法之一。虽然这种方法已成功地应用于许多问题,但它有一些基本的局限性。文献中的多种方法解决了这些局限性。然而,到目前为止,还没有对这些主题进行全面的调查。现有的大多数研究只关注高斯过程及其导数的一种特殊变体。这项调查详细说明了使用高斯过程的核心动机,它们的数学公式,局限性,以及多年来蓬勃发展的研究主题,以解决上述局限性。此外,深高斯过程(DGPs)是一个特殊的研究领域,在过去的十年中得到了很大的发展。他们的调查概述了推动这一研究领域前沿的重要出版物。最后,对存在的问题和今后的研究方向进行了简要的讨论。 摘要:Gaussian processes are one of the dominant approaches in Bayesian learning. Although the approach has been applied to numerous problems with great success, it has a few fundamental limitations. Multiple methods in literature have addressed these limitations. However, there has not been a comprehensive survey of the topics as of yet. Most existing surveys focus on only one particular variant of Gaussian processes and their derivatives. This survey details the core motivations for using Gaussian processes, their mathematical formulations, limitations, and research themes that have flourished over the years to address said limitations. Furthermore, one particular research area is Deep Gaussian Processes (DGPs), it has improved substantially in the past decade. The significant publications that advanced the forefront of this research area are outlined in their survey. Finally, a brief discussion on open problems and research directions for future work is presented at the end.
【40】 Near-Optimal Linear Regression under Distribution Shift 标题:分布漂移下的次优线性回归
作者:Qi Lei,Wei Hu,Jason D. Lee 机构:‡ 备注:ICML 2021 链接:https://arxiv.org/abs/2106.12108 摘要:当源域有足够的数据,而目标域的标记数据很少时,迁移学习是必不可少的。我们发展估计,实现最小最大线性风险的线性回归问题的分布转移。我们的算法涵盖了不同的迁移学习设置,包括协变量移位和模型移位。我们还考虑了从线性或一般非线性模型生成数据的时间。我们证明了线性minimax估计在minimax风险的绝对常数范围内,即使在各种源/目标分布的非线性估计中也是如此。 摘要:Transfer learning is essential when sufficient data comes from the source domain, with scarce labeled data from the target domain. We develop estimators that achieve minimax linear risk for linear regression problems under distribution shift. Our algorithms cover different transfer learning settings including covariate shift and model shift. We also consider when data are generated from either linear or general nonlinear models. We show that linear minimax estimators are within an absolute constant of the minimax risk even among nonlinear estimators for various source/target distributions.
【41】 Learning Identity-Preserving Transformations on Data Manifolds 标题:学习数据流形上的保恒等变换
作者:Marissa Connor,Kion Fallah,Christopher Rozell 机构:School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 链接:https://arxiv.org/abs/2106.12096 摘要:许多机器学习技术在模型中加入了保持身份的变换,将其性能推广到以前看不到的数据中。这些变换通常是从一组已知的函数中选择的,这些函数在应用时可以保持输入的一致性(例如,旋转、平移、翻转和缩放)。然而,有许多自然变化不能被标记以供监督或通过检查数据来定义。正如流形假说所建议的,许多这些自然变化存在于或接近低维非线性流形上。有几种技术通过一组学习的李群算子来表示流形变化,李群算子定义流形上的运动方向。然而,这些方法是有限的,因为它们在训练模型时需要变换标签,并且它们缺乏一种方法来确定流形的哪些区域适合应用每个特定的操作符。我们通过引入一种不需要变换标签的学习策略来解决这些限制,并开发了一种方法来学习每个操作符可能被使用的局部区域,同时保持输入的身份。在MNIST和Fashion-MNIST上的实验突出了我们的模型在多类数据集上学习身份保持转换的能力。此外,我们在CelebA上进行训练,以展示我们的模型以无监督的方式学习复杂数据集上语义上有意义的转换的能力。 摘要:Many machine learning techniques incorporate identity-preserving transformations into their models to generalize their performance to previously unseen data. These transformations are typically selected from a set of functions that are known to maintain the identity of an input when applied (e.g., rotation, translation, flipping, and scaling). However, there are many natural variations that cannot be labeled for supervision or defined through examination of the data. As suggested by the manifold hypothesis, many of these natural variations live on or near a low-dimensional, nonlinear manifold. Several techniques represent manifold variations through a set of learned Lie group operators that define directions of motion on the manifold. However theses approaches are limited because they require transformation labels when training their models and they lack a method for determining which regions of the manifold are appropriate for applying each specific operator. We address these limitations by introducing a learning strategy that does not require transformation labels and developing a method that learns the local regions where each operator is likely to be used while preserving the identity of inputs. Experiments on MNIST and Fashion MNIST highlight our model's ability to learn identity-preserving transformations on multi-class datasets. Additionally, we train on CelebA to showcase our model's ability to learn semantically meaningful transformations on complex datasets in an unsupervised manner.
【42】 A Simple Baseline for Batch Active Learning with Stochastic Acquisition Functions 标题:具有随机获取函数的批处理主动学习的一种简单基线
作者:Andreas Kirsch,Sebastian Farquhar,Yarin Gal 机构: Department of Computer Science, UniversityofOxford 链接:https://arxiv.org/abs/2106.12059 摘要:在主动学习中,新标签通常是成批获得的。然而,常见的采集函数一次只能用于一个样本采集轮,当它们的分数被天真地用于批量采集时,它们会导致批次缺乏多样性,从而降低性能。另一方面,最先进的批量采集函数的计算成本很高。在本文中,我们提出了一类新的随机获取函数,通过观察一个样本的获取分数随额外样本的获取而变化,并对额外批次样本的这种差异进行建模,将一个样本的获取函数扩展到批次设置。我们只需根据采集分数,使用Gibbs分布从池集中采样,就可以获得新的样本。我们的采集函数在计算和执行其他批量采集函数时都要便宜得多。 摘要:In active learning, new labels are commonly acquired in batches. However, common acquisition functions are only meant for one-sample acquisition rounds at a time, and when their scores are used naively for batch acquisition, they result in batches lacking diversity, which deteriorates performance. On the other hand, state-of-the-art batch acquisition functions are costly to compute. In this paper, we present a novel class of stochastic acquisition functions that extend one-sample acquisition functions to the batch setting by observing how one-sample acquisition scores change as additional samples are acquired and modelling this difference for additional batch samples. We simply acquire new samples by sampling from the pool set using a Gibbs distribution based on the acquisition scores. Our acquisition functions are both vastly cheaper to compute and out-perform other batch acquisition functions.
【43】 Fluctuations of water quality time series in rivers follow superstatistics 标题:河流水质时间序列的波动遵循超统计
作者:Benjamin Schäfer,Catherine M. Heppell,Hefin Rhys,Christian Beck 机构:School of Mathematical Sciences, Queen Mary University of London, London E,NS, United Kingdom, Norwegian University of Life Sciences, ˚, As, Norway, Queen Mary University of London, School of Geography, Mile End Road, London E,NS, UK 链接:https://arxiv.org/abs/2106.12047 摘要:超统计是非平衡统计物理中的一种通用方法,已被广泛应用于各种复杂系统,从流体动力湍流到交通延误和空气污染动力学。在这里,我们调查了河流中测量的水质时间序列(如溶解氧浓度和电导率),并提供了它们表现出超统计行为的证据。我们的主要例子是英格兰东南部河棋中记录的时间序列。具体来说,我们使用季节性去趋势和经验模式分解(EMD)来分离测量数据的趋势和波动。无论采用哪种去趋势方法,我们都观察到了重尾涨落分布,这是由溶解氧的对数正态超统计很好地描述的。相反,我们发现电导率数据的双峰非标准超统计,我们使用两个组合的$chi^2$-分布来建模。 摘要:Superstatistics is a general method from nonequilibrium statistical physics which has been applied to a variety of complex systems, ranging from hydrodynamic turbulence to traffic delays and air pollution dynamics. Here, we investigate water quality time series (such as dissolved oxygen concentrations and electrical conductivity) as measured in rivers, and provide evidence that they exhibit superstatistical behaviour. Our main example are time series as recorded in the river Chess in South East England. Specifically, we use seasonal detrending and empirical mode decomposition (EMD) to separate trends from fluctuations for the measured data. With either detrending method, we observe heavy-tailed fluctuation distributions, which are well described by a log-normal superstatistics for dissolved oxygen. Contrarily, we find a double peaked non-standard superstatistics for the electrical conductivity data, which we model using two combined $chi^2$-distributions.
【44】 KrigR -- A tool for downloading and statistically downscaling climate reanalysis data 标题:KrigR--气候再分析数据下载和统计下标工具
作者:Erik Kusch,Richard Davy 机构:AFFILIATIONS:, Center for Biodiversity Dynamics in a Changing World (BIOCHANGE), Section for, Ecoinformatics & Biodiversity, Department of Biology, Arhus University, Nansen Environmental and Remote Sensing Center, Thormohlensgate , Bergen, Norway 备注:submitted to Scientific Data 链接:https://arxiv.org/abs/2106.12046 摘要:我们介绍了KrigR,一个R软件包,用于使用kriging获取和统计降尺度最新的气候数据。KrigR允许R用户(1)下载用户指定区域的ERA5和ERA5陆地气候再分析数据和时间长度,(2)将这些气候产品聚合到所需的时间分辨率和度量,(3)获取地形协变量,以及(4)通过kriging使用协变量数据统计地将空间数据缩小到用户指定的分辨率。KrigR可以在一个函数调用中执行所有这些任务,从而使用户能够用一个R命令以高时空分辨率获得83(ERA5)/50(ERA5-Land)气候变量中的任何一个。因此,KrigR提供了一个工具箱,以前所未有的高时间和空间分辨率组合获得大量定制的气候数据。此外,我们还演示了如何将KrigR进行划分,以便在kriging步骤中使用任何给定的气候数据集和第三方/用户提供的协变量,并通过提供降尺度的不确定性,与其他高分辨率数据集相比带来优势,这可以解释现有几种高分辨率气候产品之间的差异。 摘要:We present KrigR, an R package for acquiring and statistically downscaling state-of-the-art climate data using kriging. KrigR allows R-users to (1) download ERA5 and ERA5-Land climate reanalysis data for a user-specified region, and time-length, (2) aggregate these climate products to desired temporal resolutions and metrics, (3) acquire topographical co-variates, and (4) statistically downscale spatial data to a user-specified resolution using co-variate data via kriging. KrigR can execute all of these tasks in a single function call, thus enabling the user to obtain any of 83 (ERA5) / 50 (ERA5-Land) climate variables at high spatial and temporal resolution with a single R-command. Thereby, KrigR provides a toolbox to obtain a wide range of tailored climate data at unprecedented combinations of high temporal and spatial resolutions. Additionally, we demonstrate how KrigR is compartmentalised to enable use of any given climate dataset and third-party/user-provided co-variates for the kriging step and brings an advantage over other high-resolution datasets by providing downscaling uncertainties, which can explain the difference between several existing high-resolution climate products.
【45】 Test-time Collective Prediction 标题:测试时间集合预测
作者:Celestine Mendler-Dünner,Wenshuo Guo,Stephen Bates,Michael I. Jordan 机构:University of California, Berkeley 链接:https://arxiv.org/abs/2106.12012 摘要:机器学习中一个越来越常见的设置涉及多方,每一方都有自己的数据,他们希望共同预测未来的测试点。代理希望从全套代理的集体专业知识中获益,以便做出比单个代理更好的预测,但可能不愿意公布其数据或模型参数。在这项工作中,我们探索了一种分散的机制来在测试时进行集体预测,利用每个代理预先训练的模型,而不依赖于外部验证、模型再训练或数据池。我们的方法从社会科学文献中获得了关于人类共识形成的启示。我们从理论上分析了我们的机制,结果表明它在大样本限制下收敛于逆均方误差加权。为了计算集体预测的误差线,我们提出了一个分散的Jackknife过程,评估我们的机制对单个代理预测的敏感性。在经验上,我们证明了我们的方案有效地结合了输入空间中不同质量的模型。所提出的一致性预测比经典的模型平均法获得了显著的收益,甚至优于可以获得额外验证数据的加权平均法。 摘要:An increasingly common setting in machine learning involves multiple parties, each with their own data, who want to jointly make predictions on future test points. Agents wish to benefit from the collective expertise of the full set of agents to make better predictions than they would individually, but may not be willing to release their data or model parameters. In this work, we explore a decentralized mechanism to make collective predictions at test time, leveraging each agent's pre-trained model without relying on external validation, model retraining, or data pooling. Our approach takes inspiration from the literature in social science on human consensus-making. We analyze our mechanism theoretically, showing that it converges to inverse meansquared-error (MSE) weighting in the large-sample limit. To compute error bars on the collective predictions we propose a decentralized Jackknife procedure that evaluates the sensitivity of our mechanism to a single agent's prediction. Empirically, we demonstrate that our scheme effectively combines models with differing quality across the input space. The proposed consensus prediction achieves significant gains over classical model averaging, and even outperforms weighted averaging schemes that have access to additional validation data.
【46】 Learned Interpretable Residual Extragradient ISTA for Sparse Coding 标题:稀疏编码的学习可解释残差外梯度ISTA
作者:Lin Kong,Wei Sun,Fanhua Shang,Yuanyuan Liu,Hongying Liu 机构: School of Artificial Intelligence, XidianUniversity 备注:Accepted for presentation at the ICML Workshop on Theoretic Foundation, Criticism, and Application Trend of Explainable AI 链接:https://arxiv.org/abs/2106.11970 摘要:近年来,学习型迭代收缩阈值算法(LISTA)的研究越来越受到人们的关注。大量的实验和理论证明了LISTA在解决稀疏编码问题上的高效性。然而,现有的LISTA方法都是串行连接。为了解决这个问题,我们提出了一种新的基于超梯度的LISTA(ELISTA),它具有剩余结构和理论保证。特别是,我们的算法也能在一定程度上为Res网提供可解释性。从理论上证明了该方法具有线性收敛性。在实践中,大量的实证结果验证了该方法的优越性。 摘要:Recently, the study on learned iterative shrinkage thresholding algorithm (LISTA) has attracted increasing attentions. A large number of experiments as well as some theories have proved the high efficiency of LISTA for solving sparse coding problems. However, existing LISTA methods are all serial connection. To address this issue, we propose a novel extragradient based LISTA (ELISTA), which has a residual structure and theoretical guarantees. In particular, our algorithm can also provide the interpretability for Res-Net to a certain extent. From a theoretical perspective, we prove that our method attains linear convergence. In practice, extensive empirical results verify the advantages of our method.