统计学学术速递[8.23]

2021-08-24 16:39:07 浏览数 (1)

Update!H5支持摘要折叠,体验更佳!点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!

stat统计学,共计25篇

【1】 A New Asymmetric Copula with Reversible Correlations and Its Application to the EU Sovereign Debt Crisis 标题:一种新的可逆相关非对称Copula及其在欧盟主权债务危机中的应用 链接:https://arxiv.org/abs/2108.09278

作者:Masahito Kobayashi,Jinghui Chen 机构:Meiji Gakuin Unviersity and Yokohama National University∗ 摘要:提出了一种基于二元分裂正态分布的非对称copula。这个copula可以独立地改变其上下分布尾的相关符号。作为一个应用,滚动最大似然估计表明,在主权债务危机开始后,欧盟外围国将尾部相关系数较低的符号从负变为正。相比之下,德国在危机前后股票与债券之间存在负相关性。 摘要:This paper proposes a novel asymmetric copula based upon bivariate split normal distribution. This copula can change correlation signs of its upper and lower tails of distribution independently. As an application, it is shown by the rolling maximum likelihood estimation that the EU periphery countries changed sign of the lower tail correlation coefficient from negative to positive after the sovereign debt crisis started. In contrast, Germany had negative stock-bond correlation before and after the crisis.

【2】 Optimal Order Simple Regret for Gaussian Process Bandits 标题:高斯过程带的最优阶简单遗憾 链接:https://arxiv.org/abs/2108.09262

作者:Sattar Vakili,Nacime Bouziani,Sepehr Jalali,Alberto Bernacchia,Da-shan Shiu 机构:∗ MediaTek Research, †Imperial College London 摘要:考虑连续优化,一个连续的,可能非凸的,昂贵的评估目标函数$F$。这个问题可以归结为高斯过程(GP)强盗,$f$生活在再生核希尔BERT空间(RKHS)中。对几种学习算法的最新分析表明,简单遗憾性能的上下限之间存在显著差距。当$N$是勘探试验的次数,$gamma_N$是最大信息增益时,我们证明了一个$tilde{mathcal{O}(sqrt{gamma_N/N})$界对纯勘探算法的简单后悔性能的影响,该界明显比现有界更为严格。我们证明,对于已知后悔下限的情况,这个界在对数因子下是顺序最优的。为了建立这些结果,我们证明了适用于RKHS元素的GP模型的新颖且精确的置信区间,这可能具有更广泛的意义。 摘要:Consider the sequential optimization of a continuous, possibly non-convex, and expensive to evaluate objective function $f$. The problem can be cast as a Gaussian Process (GP) bandit where $f$ lives in a reproducing kernel Hilbert space (RKHS). The state of the art analysis of several learning algorithms shows a significant gap between the lower and upper bounds on the simple regret performance. When $N$ is the number of exploration trials and $gamma_N$ is the maximal information gain, we prove an $tilde{mathcal{O}}(sqrt{gamma_N/N})$ bound on the simple regret performance of a pure exploration algorithm that is significantly tighter than the existing bounds. We show that this bound is order optimal up to logarithmic factors for the cases where a lower bound on regret is known. To establish these results, we prove novel and sharp confidence intervals for GP models applicable to RKHS elements which may be of broader interest.

【3】 Signal Detection in Degree Corrected ERGMs 标题:度校正ERGMs中的信号检测 链接:https://arxiv.org/abs/2108.09255

作者:Yuanzhe Xu,Sumit Mukherjee 备注:37 pages, 1 figure 摘要:本文研究了阶修正指数随机图模型(ERGM)中的稀疏信号检测问题。我们研究了基于条件中心度和和和最大度的两个测试的性能,这两个测试针对一大类这样的ergm。这些测试的性能与$beta$模型中相应的非中心测试的性能相匹配。重点关注度校正的双星ERGM,我们通过基于(无条件)度和的测试表明,改进的检测在临界状态下是可能的。在此设置中,我们在所有参数区域中提供匹配下界,该下界基于备选方案下的度之间的相关性估计,并且可能具有独立的兴趣。 摘要:In this paper, we study sparse signal detection problems in Degree Corrected Exponential Random Graph Models (ERGMs). We study the performance of two tests based on the conditionally centered sum of degrees and conditionally centered maximum of degrees, for a wide class of such ERGMs. The performance of these tests match the performance of the corresponding uncentered tests in the $beta$ model. Focusing on the degree corrected two star ERGM, we show that improved detection is possible at criticality using a test based on (unconditional) sum of degrees. In this setting we provide matching lower bounds in all parameter regimes, which is based on correlations estimates between degrees under the alternative, and of possible independent interest.

【4】 A comparison of different clustering approaches for high-dimensional presence-absence data 标题:高维在线-缺席数据的不同聚类方法比较 链接:https://arxiv.org/abs/2108.09243

作者:Gabriele d'Angella,Christian Hennig 备注:17 pages, 6 Figures 摘要:有无数据由零和1的向量或矩阵定义,其中1通常表示某个位置的“存在”。例如,当调查地理物种分布、遗传信息或文本中某些术语的出现时,会出现有无数据。有许多应用程序可以对此类数据进行聚类;一个例子是寻找所谓的生物元素,即在地理上倾向于同时出现的物种群。存在与否数据可以以多种方式进行聚类,即使用具有局部独立性的潜在类混合方法、基于距离的Jaccard距离分层聚类,或者也可以使用聚类方法对距离的多维尺度表示上的连续数据进行聚类。这些方法在概念上非常不同,因此在理论上很难进行比较。我们将其性能与基于物种分布模型的综合模拟研究进行了比较。 摘要:Presence-absence data is defined by vectors or matrices of zeroes and ones, where the ones usually indicate a "presence" in a certain place. Presence-absence data occur for example when investigating geographical species distributions, genetic information, or the occurrence of certain terms in texts. There are many applications for clustering such data; one example is to find so-called biotic elements, i.e., groups of species that tend to occur together geographically. Presence-absence data can be clustered in various ways, namely using a latent class mixture approach with local independence, distance-based hierarchical clustering with the Jaccard distance, or also using clustering methods for continuous data on a multidimensional scaling representation of the distances. These methods are conceptually very different and can therefore not easily be compared theoretically. We compare their performance with a comprehensive simulation study based on models for species distributions.

【5】 Parameters not identifiable or distinguishable from data, including correlation between Gaussian observations 标题:无法从数据中识别或区分的参数,包括高斯观测之间的相关性 链接:https://arxiv.org/abs/2108.09227

作者:Christian Hennig 机构:Universita di Bologna, Via delle Belle Arti, Bologna 备注:25 pages, no figures 摘要:结果表明,一些理论上可识别的参数不能从数据中识别,这意味着它们不可能存在一致估计。一个重要的例子是高斯观测值之间的恒定相关性(存在这种相关性时,甚至无法从数据中确定平均值)。定义了可识别性和三种不同于数据的可识别性。高斯观测值之间的两种不同常数相关性甚至无法从数据中区分。另一个例子是$k$-表示集群的集群成员参数。文献中的一些现有结果与新框架相关。 摘要:It is shown that some theoretically identifiable parameters cannot be identified from data, meaning that no consistent estimator of them can exist. An important example is a constant correlation between Gaussian observations (in presence of such correlation not even the mean can be identified from data). Identifiability and three versions of distinguishability from data are defined. Two different constant correlations between Gaussian observations cannot even be distinguished from data. A further example are cluster membership parameters in $k$-means clustering. Several existing results in the literature are connected to the new framework.

【6】 Detecting changes in the trend function of heteroscedastic time series 标题:检测异方差时间序列趋势函数的变化 链接:https://arxiv.org/abs/2108.09206

作者:Sara Kristin Schmidt 机构:Department of Mathematics, Ruhr-Universität Bochum, Universitätsstraße , Bochum, Germany 摘要:我们提出了一个新的渐近检验来评估时间序列均值的平稳性,该检验适用于异方差和短程依赖的情况。我们的检验统计量由本地样本均值的基尼均值差组成。为了分析它的渐近性质,我们发展了非平稳下强混合三角阵列U-统计量的新极限理论。最重要的是,我们在均值不变的假设下证明了检验统计量的渐近正态性,并证明了该检验与一类非常普遍的备选方案(包括均值的平滑和突变)的一致性。我们提出了所有相关参数的估计量,包括长期方差的自适应子抽样估计量,并证明了它们的一致性。我们的程序在广泛的模拟研究和两个数据示例中进行了实际评估。 摘要:We propose a new asymptotic test to assess the stationarity of a time series' mean that is applicable in the presence of both heteroscedasticity and short-range dependence. Our test statistic is composed of Gini's mean difference of local sample means. To analyse its asymptotic behaviour, we develop new limit theory for U-statistics of strongly mixing triangular arrays under non-stationarity. Most importantly, we show asymptotic normality of the test statistic under the hypothesis of a constant mean and prove the test's consistency against a very general class of alternatives, including both smooth and abrupt changes in the mean. We propose estimators for all parameters involved, including an adapted subsampling estimator for the long run variance, and show their consistency. Our procedure is practically evaluated in an extensive simulation study and in two data examples.

【7】 Advanced models for predicting event occurrence in event-driven clinical trials accounting for patient dropout, cure and ongoing recruitment 标题:用于预测事件驱动临床试验中事件发生的高级模型,该模型考虑了患者退出、治愈和正在进行的招募 链接:https://arxiv.org/abs/2108.09196

作者:Vladimir Anisimov,Stephen Gormley,Rosalind Baverstock,Cynthia Kineza 机构:Data Science, Center for Design & Analysis, Amgen, London, UK, Data Science, Center for Design & Analysis, Amgen, Thousand Oaks, CA, US 备注:17 pages, 8 figures 摘要:我们考虑事件驱动的临床试验,其中在已经达到预定数量的临床事件之后进行分析。例如,这些事件可能是肿瘤学的进展或心血管试验中的中风。在过渡阶段,主要任务之一是预测一段时间内的事件数量和达到特定里程碑的时间,我们需要考虑不仅在已经招募和随访的患者中,而且在尚未招募的患者中可能发生的事件。因此,在此类试验中,我们需要将患者招募和事件计数一起建模。在本文中,我们开发了一种新的分析方法,该方法考虑了患者治愈的机会,以及他们退出和失去随访的机会。招募是使用先前出版物中开发的泊松-伽马模型建模的。在考虑事件发生时,我们假设发生主要事件的时间和退出的时间是独立的随机变量,并且我们开发了一些使用指数、威布尔和对数正态分布的具有cure的高级模型。这项技术得到了良好开发、测试和文档化软件的支持。利用仿真和实际数据集,参考开发的软件对结果进行了说明。 摘要:We consider event-driven clinical trials, where the analysis is performed once a pre-determined number of clinical events has been reached. For example, these events could be progression in oncology or a stroke in cardiovascular trials. At the interim stage, one of the main tasks is predicting the number of events over time and the time to reach specific milestones, where we need to account for events that may occur not only in patients already recruited and are followed-up but also in patients yet to be recruited. Therefore, in such trials we need to model patient recruitment and event counts together. In the paper we develop a new analytic approach which accounts for the opportunity of patients to be cured, as well as for them to dropout and be lost to follow-up. Recruitment is modelled using a Poisson-gamma model developed in previous publications. When considering the occurrence of events, we assume that the time to the main event and the time to dropout are independent random variables, and we have developed a few advanced models with cure using exponential, Weibull and log-normal distributions. This technique is supported by well developed, tested and documented software. The results are illustrated using simulation and a real dataset with reference to the developed software.

【8】 latentcor: An R Package for estimating latent correlations from mixed data types 标题:Latentcor:用于从混合数据类型中估计潜在相关性的R软件包 链接:https://arxiv.org/abs/2108.09180

作者:Mingze Huang,Christian L. Müller,Irina Gaynanova 机构: Department of Statistics, Texas A& M University, College Station, TX , Department of Economics, Texas A& M University, College Station, TX , Ludwig-Maximilians-Universität München, Germany , Helmholtz Zentrum München, Germany , Flatiron Institute, New York 摘要:我们提出了“latentcor”,这是一个用于从混合变量类型的数据进行相关估计的R包。混合变量类型,包括连续、二进制、序数、零膨胀或截断数据,在许多科学领域都是例行收集的。准确估计这些变量之间的相关性通常是统计分析工作流程中的第一个关键步骤。作为默认选择的Pearson相关性不适合混合数据类型,因为违反了基本的正态性假设。另一方面,半参数潜在高斯copula模型的概念为估计混合数据类型之间的相关性提供了一种统一的方法。R软件包“latentcor”包含这些模型的综合列表,能够估计任何连续/二元/三元/零膨胀(截断)变量类型之间的相关性。底层实现利用快速多线性插值方案,有效选择插值网格点,从而在不影响估计精度的情况下为软件包提供较小的内存占用。这使得潜在相关性估计可用于现代高通量数据分析。 摘要:We present `latentcor`, an R package for correlation estimation from data with mixed variable types. Mixed variables types, including continuous, binary, ordinal, zero-inflated, or truncated data are routinely collected in many areas of science. Accurate estimation of correlations among such variables is often the first critical step in statistical analysis workflows. Pearson correlation as the default choice is not well suited for mixed data types as the underlying normality assumption is violated. The concept of semi-parametric latent Gaussian copula models, on the other hand, provides a unifying way to estimate correlations between mixed data types. The R package `latentcor` comprises a comprehensive list of these models, enabling the estimation of correlations between any of continuous/binary/ternary/zero-inflated (truncated) variable types. The underlying implementation takes advantage of a fast multi-linear interpolation scheme with an efficient choice of interpolation grid points, thus giving the package a small memory footprint without compromising estimation accuracy. This makes latent correlation estimation readily available for modern high-throughput data analysis.

【9】 Irish Property Price Estimation Using A Flexible Geo-spatial Smoothing Approach: What is the Impact of an Address? 标题:使用灵活的地理空间平滑方法估计爱尔兰房地产价格:地址的影响是什么? 链接:https://arxiv.org/abs/2108.09175

作者:Aoife K. Hurley,James Sweeney 机构: University of Limerick 摘要:准确、高效的房地产估价在各种情况下都是至关重要的,例如在获得抵押贷款融资以购买房地产时,或者在将住宅房地产税设定为房地产转售价值的百分比时。在国际上,基于转售的财产税最为常见,因为其易于实施,且难以确定场地价值。在爱尔兰,房地产估价目前基于与最近出售的相邻房地产的比较,然而,这种方法受到房地产营业额低的限制。基于财产价值而非场地价值的国家财产税也会抑制改善工程,因为随之而来的税收负担增加。在这篇文章中,我们开发了一个空间特征回归模型来分离房地产特征对转售价值的空间和非空间贡献。我们通过地理关联、跨多种房地产类型和饰面借用信息来缓解房地产周转率低的问题。我们调查了地址标签错误对预测性能的影响,供应商错误地提供了更丰富的邮政编码,并评估了改进工程对增加价值的贡献。我们灵活的地理空间模型在许多不同的评估指标上都优于所有竞争对手,包括价格预测和相关不确定性区间的准确性。虽然我们的模型应用于爱尔兰,但准确评估房地产交易量低的市场中的房地产价值以及量化特定房地产特征的价值贡献的能力具有广泛的应用。区分空间和非空间对房地产价值的贡献的能力也为基于场地价值的房地产税提供了一种途径。 摘要:Accurate and efficient valuation of property is of utmost importance in a variety of settings, such as when securing mortgage finance to purchase a property, or where residential property taxes are set as a percentage of a property's resale value. Internationally, resale based property taxes are most common due to ease of implementation and the difficulty of establishing site values. In an Irish context, property valuations are currently based on comparison to recently sold neighbouring properties, however, this approach is limited by low property turnover. National property taxes based on property value, as opposed to site value, also act as a disincentive to improvement works due to the ensuing increased tax burden. In this article we develop a spatial hedonic regression model to separate the spatial and non-spatial contributions of property features to resale value. We mitigate the issue of low property turnover through geographic correlation, borrowing information across multiple property types and finishes. We investigate the impact of address mislabelling on predictive performance, where vendors erroneously supply a more affluent postcode, and evaluate the contribution of improvement works to increased values. Our flexible geo-spatial model outperforms all competitors across a number of different evaluation metrics, including the accuracy of both price prediction and associated uncertainty intervals. While our models are applied in an Irish context, the ability to accurately value properties in markets with low property turnover and to quantify the value contributions of specific property features has widespread application. The ability to separate spatial and non-spatial contributions to a property's value also provides an avenue to site-value based property taxes.

【10】 State-Of-The-Art Algorithms For Low-Rank Dynamic Mode Decomposition 标题:低秩动态模态分解的最新算法 链接:https://arxiv.org/abs/2108.09160

作者:Patrick Heas,Cedric Herzet 机构:campus universitaire de Beaulieu 备注:arXiv admin note: substantial text overlap with arXiv:1610.02962 摘要:本技术说明回顾了使用低阶动态模式分解(DMD)对高维动态系统进行线性近似的最新算法。在重复我们的文章“低阶动态模式分解:精确且易于处理的解决方案”中的几个部分时,这项工作提供了额外的细节,有助于构建最先进方法的全面图景。 摘要:This technical note reviews sate-of-the-art algorithms for linear approximation of high-dimensional dynamical systems using low-rank dynamic mode decomposition (DMD). While repeating several parts of our article "low-rank dynamic mode decomposition: an exact and tractable solution", this work provides additional details useful for building a comprehensive picture of state-of-the-art methods.

【11】 A multi-level model for estimating region-age-time-type specific male circumcision coverage from household survey and health system data in South Africa 标题:根据南非家庭调查和卫生系统数据估计区域-年龄-时间类型特定男性包皮环切术覆盖率的多水平模型 链接:https://arxiv.org/abs/2108.09142

作者:Matthew L. Thomas,Khangelani Zuma,Dayanund Loykissoonlal,Bridget Dube,Peter Vranken,Sarah E. Porter,Katharine Kripke,Thapelo Seatlhodi,Gesine Meyer-Rath,Leigh F. Johnson,Jeffrey W. Eaton 机构:Joint Centre for Excellence in Environmental Intelligence, University of Exeter & Met Office, Exeter, United, MRC Centre for Global Infectious Disease Analysis, School of Public Health, Imperial College London, London, United Kingdom 摘要:自愿医疗男性包皮环切术(VMMC)可将男性感染HIV的风险降低60%。在艾滋病毒负担很重的撒哈拉以南非洲国家,实施了提供男性包皮环切术以预防艾滋病毒感染的方案。虽然MMC的大规模供应是最近才开始的,但传统的MC长期以来一直是男性成年实践的一部分。传统MC发生的方式和年龄因国家内的种族群体而异。根据年龄和包皮环切术类型(传统或医学)在次国家一级对包皮环切术覆盖率进行准确估计,对于规划和提供VMMC以实现目标以及评估其对艾滋病毒发病率的影响至关重要。在本文中,我们开发了一个贝叶斯竞争风险时间-事件模型,以产生具有概率不确定性的区域年龄-时间类型特定概率和MC覆盖率。该模型综合了住户调查数据和卫生系统数据中有关VMMC数量的数据。我们使用五项家庭调查数据和VMMC计划数据证明了该模型,以估算2008年至2019年期间南非52个地区的MC覆盖率。2008年,全国15-49岁的男性中有24.1%(CI:23.4-24.8%)接受传统割礼,19.4%(CI:18.9-20.0%)接受医学割礼。2008年至2019年间,开展了500万次VMMC,15-49岁男性的MC覆盖率增至64.0%(CI:63.2-64.9%),医疗MC覆盖率增至42%(CI:41.3-43.0%)。不同地区的MC覆盖率差异很大,从13.4%到86.3%不等。传统MC的平均年龄在13到19岁之间,这取决于当地的文化习俗。 摘要:Voluntary medical male circumcision (VMMC) reduces the risk of male HIV acquisition by 60%. Programmes to provide male circumcision (MC) to prevent HIV infection have been introduced in sub-Saharan African countries with high HIV burden. While large-scale provision of MMC is recent, traditional MC has long been conducted as part of male coming-of-age practices. How and at what age traditional MC occurs varies by ethnic groups within countries. Accurate estimates of MC coverage by age and type of circumcision (traditional or medical) over time at sub-national levels are essential for planning and delivering VMMCs to meet targets and evaluating their impacts on HIV incidence. In this paper, we developed a Bayesian competing risks time-to-event model to produce region-age-time-type specific probabilities and coverage of MC with probabilistic uncertainty. The model jointly synthesises data from household surveys and health system data on the number of VMMCs conducted. We demonstrated the model using data from five household surveys and VMMC programme data to produce estimates of MC coverage for 52 districts in South Africa between 2008 and 2019. Nationally in 2008, 24.1% (CI: 23.4-24.8%) of men aged 15-49 were traditionally circumcised and 19.4% (CI: 18.9-20.0%) were medically circumcised. Between 2008 and 2019, five million VMMCs were conducted, and MC coverage among men aged 15-49 increased to 64.0% (CI: 63.2-64.9%) and medical MC coverage to 42% (CI: 41.3-43.0%). MC coverage varied widely across districts, ranging from 13.4-86.3%. The average age of traditional MC ranged between 13 to 19 years, depending on local cultural practices.

【12】 A Theoretical Analysis of the Stationarity of an Unrestricted Autoregression Process 标题:一类无约束自回归过程的平稳性的理论分析 链接:https://arxiv.org/abs/2108.09083

作者:Varsha S. Kulkarni 机构:Harvard University 摘要:如果高维自回归模型结合了时间依赖性的异质性,则它们将相对一般地描述一些经济计量过程。本文分析了维数为$k>1$的自回归过程的平稳性,该过程的系数序列为$beta$乘以$0<delta<1$的递增幂。该定理给出了系数与$k$之间的简单关系中的平稳性条件,即$delta$。在计算上,平稳性的证据取决于参数。$delta$的选择设置了$beta$的界限和模型预测的时滞数。 摘要:The higher dimensional autoregressive models would describe some of the econometric processes relatively generically if they incorporate the heterogeneity in dependence on times. This paper analyzes the stationarity of an autoregressive process of dimension $k>1$ having a sequence of coefficients $beta$ multiplied by successively increasing powers of $0<delta<1$. The theorem gives the conditions of stationarity in simple relations between the coefficients and $k$ in terms of $delta$. Computationally, the evidence of stationarity depends on the parameters. The choice of $delta$ sets the bounds on $beta$ and the number of time lags for prediction of the model.

【13】 Quadratic Discriminant Analysis by Projection 标题:基于投影法的二次判别分析 链接:https://arxiv.org/abs/2108.09005

作者:Ruiyang Wu,Ning Hao 机构:Department of Mathematics, University of Arizona 摘要:判别分析,包括线性判别分析(LDA)和二次判别分析(QDA),是解决分类问题的常用方法。众所周知,LDA不适合分析异方差数据,QDA将是理想的工具。然而,当数据集中的特征数量为中等或较高时,QDA的帮助较小,LDA及其变体通常表现更好,因为它们对维度的鲁棒性。在这项工作中,我们介绍了一种新的基于QDA的降维和分类方法。特别是,我们定义并估计了QDA的最优一维(1D)子空间,这是一种新的混合判别分析方法。新方法可以处理参数个数与LDA相同的数据异方差。因此,它比标准QDA更稳定,适用于中等维度的数据。我们展示了我们的方法的估计一致性,并通过模拟和实际数据的例子将其与LDA、QDA、正则化判别分析(RDA)和其他一些竞争对手进行了比较。 摘要:Discriminant analysis, including linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA), is a popular approach to classification problems. It is well known that LDA is suboptimal to analyze heteroscedastic data, for which QDA would be an ideal tool. However, QDA is less helpful when the number of features in a data set is moderate or high, and LDA and its variants often perform better due to their robustness against dimensionality. In this work, we introduce a new dimension reduction and classification method based on QDA. In particular, we define and estimate the optimal one-dimensional (1D) subspace for QDA, which is a novel hybrid approach to discriminant analysis. The new method can handle data heteroscedasticity with number of parameters equal to that of LDA. Therefore, it is more stable than the standard QDA and works well for data in moderate dimensions. We show an estimation consistency property of our method, and compare it with LDA, QDA, regularized discriminant analysis (RDA) and a few other competitors by simulated and real data examples.

【14】 Distributionally Robust Learning 标题:分布式鲁棒学习 链接:https://arxiv.org/abs/2108.08993

作者:Ruidi Chen,Ioannis Ch. Paschalidis 机构:Robust Learning”, : Vol. , No. ,–, pp ,–,. DOI: ,.,,., Boston University, teaching, andor private study. Commercial use or systematic downloading, (by robots or other automatic processes) is prohibited without ex-, plicit Publisher approval., Boston — Delft 摘要:本专著开发了一个全面的统计学习框架,该框架使用Wasserstein度量下的分布鲁棒优化(DRO)对数据中的(分布)扰动具有鲁棒性。从Wasserstein度量和DRO公式的基本性质开始,我们探索对偶性,以获得易于处理的公式,并开发有限样本以及渐近性能保证。我们考虑了一系列的学习问题,包括(i)分布鲁棒线性回归;ii)预测因子中具有群体结构的分布稳健回归(iii)分布稳健多输出回归和多类分类,(iv)结合分布稳健回归和最近邻估计的最优决策(v) 分布式鲁棒半监督学习和(vi)分布式鲁棒强化学习。推导出了每个问题的易于处理的DRO松弛,在稳健性和正则化之间建立了联系,并获得了解决方案的预测和估计误差的界。除了理论之外,我们还包括数值实验和使用合成和真实数据的案例研究。真实数据实验都与各种健康信息学问题有关,这是一个应用领域,为这项工作提供了最初的推动力。 摘要:This monograph develops a comprehensive statistical learning framework that is robust to (distributional) perturbations in the data using Distributionally Robust Optimization (DRO) under the Wasserstein metric. Beginning with fundamental properties of the Wasserstein metric and the DRO formulation, we explore duality to arrive at tractable formulations and develop finite-sample, as well as asymptotic, performance guarantees. We consider a series of learning problems, including (i) distributionally robust linear regression; (ii) distributionally robust regression with group structure in the predictors; (iii) distributionally robust multi-output regression and multiclass classification, (iv) optimal decision making that combines distributionally robust regression with nearest-neighbor estimation; (v) distributionally robust semi-supervised learning, and (vi) distributionally robust reinforcement learning. A tractable DRO relaxation for each problem is being derived, establishing a connection between robustness and regularization, and obtaining bounds on the prediction and estimation errors of the solution. Beyond theory, we include numerical experiments and case studies using synthetic and real data. The real data experiments are all associated with various health informatics problems, an application area which provided the initial impetus for this work.

【15】 A Novel Approach to Handling the Non-Central Dirichlet Distribution 标题:处理非中心Dirichlet分布的一种新方法 链接:https://arxiv.org/abs/2108.08947

作者:Carlo Orsi 机构:Ph.D. in Statistics at University of Milano-Bicocca (Milan, Italy), Casella Postale , Ufficio Postale Fermo, Fermo (Italy) 备注:Submitted to an unspecified peer-reviewed journal. arXiv admin note: text overlap with arXiv:2107.14392 摘要:本文对非中心Dirichlet分布的研究提供了新的见解。后者类似于Dirichlet分布,通过将其定义中涉及的卡方随机变量替换为尽可能多的非中心变量而获得。具体地说,基于一个简单的条件密度和一个适当的转换到独立卡方随机变量的特征性的非中心框架中,引入了一种处理该模型分析的新方法。因此,这种方法能够弥补这种分布的联合密度函数不可否认的数学复杂性,为实现新的有吸引力的随机表示以及混合原始矩的令人惊讶的简单闭合形式表达式铺平道路。 摘要:In the present paper new insights into the study of the Non-central Dirichlet distribution are provided. This latter is the analogue of the Dirichlet distribution obtained by replacing the Chi-Squared random variables involved in its definition by as many non-central ones. Specifically, a novel approach to tackling the analysis of this model is introduced based on a simple conditional density together with a suitable transposition into the non-central framework of a characterizing property of independent Chi-Squared random variables. This approach thus enables to remedy the undeniable mathematical complexity of the joint density function of such distribution by paving the way towards achieving a new attractive stochastic representation as well as a surprisingly simple closed-form expression for its mixed raw moments.

【16】 Robust Designs for Prospective Randomized Trials Surveying Sensitive Topics 标题:调查敏感问题的前瞻性随机试验的稳健设计 链接:https://arxiv.org/abs/2108.08944

作者:Evan T. R. Rosenman,Rina Friedberg,Mike Baiocchi 机构:Harvard University Data Science Initiative, LinkedIn Data Science and Applied Research, Stanford University School of Medicine 摘要:我们考虑设计前瞻性随机试验的问题,其中结果数据将是自我报告的,并且将涉及敏感话题。我们感兴趣的是误报行为,以及受访者对二元结果的少报或多报倾向如何影响实验的效果。我们通过假设研究中的每个人都是一个“报告类”的成员来模拟这个问题:一个说真话的人,一个报告不足的人,一个报告过度的人,或者一个说假话的人。我们表明,报告类别和“反应类别”(表征个体对治疗的反应)的联合分布将准确定义我们实验中因果估计的偏差和方差。然后,我们提出了一种新的程序,用于推导与给定误报水平对应的最坏情况下的样本大小。我们的问题源于之前在肯尼亚内罗毕对青春期女孩实施性暴力预防计划的随机对照试验的经验。 摘要:We consider the problem of designing a prospective randomized trial in which the outcome data will be self-reported, and will involve sensitive topics. Our interest is in misreporting behavior, and how respondents' tendency to under- or overreport a binary outcome might affect the power of the experiment. We model the problem by assuming each individual in our study is a member of one "reporting class": a truth-teller, underreporter, overreporter, or false-teller. We show that the joint distribution of reporting classes and "response classes" (characterizing individuals' response to the treatment) will exactly define the bias and variance of the causal estimate in our experiment. Then, we propose a novel procedure for deriving sample sizes under the worst-case power corresponding to a given level of misreporting. Our problem is motivated by prior experience implementing a randomized controlled trial of a sexual violence prevention program among adolescent girls in Nairobi, Kenya.

【17】 Local Latin Hypercube Refinement for Multi-objective Design Uncertainty Optimization 标题:多目标设计不确定性优化的局部拉丁超立方体精化 链接:https://arxiv.org/abs/2108.08890

作者:Can Bogoclu,Dirk Roos,Tamara Nestorović 机构:Ruhr-Universität Bochum, Institute of Computational, Engineering, Mechanics of Adaptive Systems, Universitätsstr. , Building ICFW ,-, D-, Bochum, Niederrhein University of Applied Sciences, Institute of Modelling 备注:The code repository can be found at this https URL 摘要:优化设计的可靠性和稳健性很重要,但由于高样本要求,往往无法承受。使用基于统计和机器学习方法的代理模型来提高样本效率。然而,对于高维或多模态系统,替代模型也可能需要大量样本才能获得良好的结果。针对基于可靠性的多目标稳健设计优化问题,提出了一种基于替代项的序贯抽样策略。提出的局部拉丁超立方体精化(LoLHR)策略不依赖于模型,并且可以与任何代理模型相结合,因为没有免费午餐,但可能有预算午餐。将该方法与平稳采样以及文献中提出的其他策略进行了比较。高斯过程和支持向量回归都被用作替代模型。实证证据表明,与其他基于代理的策略相比,LoLHR平均取得了更好的结果。 摘要:Optimizing the reliability and the robustness of a design is important but often unaffordable due to high sample requirements. Surrogate models based on statistical and machine learning methods are used to increase the sample efficiency. However, for higher dimensional or multi-modal systems, surrogate models may also require a large amount of samples to achieve good results. We propose a sequential sampling strategy for the surrogate based solution of multi-objective reliability based robust design optimization problems. Proposed local Latin hypercube refinement (LoLHR) strategy is model-agnostic and can be combined with any surrogate model because there is no free lunch but possibly a budget one. The proposed method is compared to stationary sampling as well as other proposed strategies from the literature. Gaussian process and support vector regression are both used as surrogate models. Empirical evidence is presented, showing that LoLHR achieves on average better results compared to other surrogate based strategies on the tested examples.

【18】 Structure Learning for Directed Trees 标题:有向树的结构学习 链接:https://arxiv.org/abs/2108.08871

作者:Martin Emil Jakobsen,Rajen D. Shah,Peter Bühlmann,Jonas Peters 机构:Peter B¨uhlmann⋆, ♮University of Copenhagen, Denmark, ♭University of Cambridge, United Kingdom, ⋆ETH Zurich, Switzerland 备注:84 pages, 17 figures 摘要:了解系统的因果结构是许多科学领域的基本兴趣,可以帮助设计预测算法,在对系统的操作下运行良好。在某些限制条件下,因果结构可以从观测分布中识别出来。为了从数据中了解结构,基于分数的方法根据拟合的质量评估不同的图形。然而,对于大型非线性模型,这些依赖于启发式优化方法,没有恢复真正因果结构的一般保证。在本文中,我们考虑有向树的结构学习。我们在Chu-Liu-Edmonds算法的基础上提出了一种快速且可扩展的方法,我们称之为因果加性树(CAT)。对于高斯误差的情况,我们证明了渐近区域的一致性,且可辨识间隙为零。我们还介绍了一种方法,用于测试子结构假设,该方法具有渐进的家族错误率控制,在选择后和未确定的环境中是有效的。此外,我们还研究了可识别差距,它量化了真实因果模型对观测分布的拟合程度,并证明了它是因果模型局部性质的下界。仿真研究表明,与竞争结构学习方法相比,CAT具有良好的性能。 摘要:Knowing the causal structure of a system is of fundamental interest in many areas of science and can aid the design of prediction algorithms that work well under manipulations to the system. The causal structure becomes identifiable from the observational distribution under certain restrictions. To learn the structure from data, score-based methods evaluate different graphs according to the quality of their fits. However, for large nonlinear models, these rely on heuristic optimization approaches with no general guarantees of recovering the true causal structure. In this paper, we consider structure learning of directed trees. We propose a fast and scalable method based on Chu-Liu-Edmonds' algorithm we call causal additive trees (CAT). For the case of Gaussian errors, we prove consistency in an asymptotic regime with a vanishing identifiability gap. We also introduce a method for testing substructure hypotheses with asymptotic family-wise error rate control that is valid post-selection and in unidentified settings. Furthermore, we study the identifiability gap, which quantifies how much better the true causal model fits the observational distribution, and prove that it is lower bounded by local properties of the causal model. Simulation studies demonstrate the favorable performance of CAT compared to competing structure learning methods.

【19】 Joint Estimation of Robin Coefficient and Domain Boundary for the Poisson Problem 标题:泊松问题的Robin系数和区域边界的联合估计 链接:https://arxiv.org/abs/2108.09152

作者:Ruanui Nicholson,Matti Niskanen 机构:Department of Engineering Science, University of Auckland 摘要:考虑泊松问题边界不可接近的边界条件,同时为泊松问题的边界形状,同时推断非均质系数场的问题。例如,腐蚀检测和热参数估计中会出现这样的问题。我们进行了基于局部高斯近似的线性化不确定性量化,并使用马尔可夫链蒙特卡罗(MCMC)抽样对联合后验概率进行了全面探索。通过利用泊松问题的已知不变性,我们能够避免边界形状改变时重新网格划分的需要。本文介绍的线性化不确定性分析依赖于可观测地图参数的局部线性化,与Robin系数和边界形状有关,以最大后验概率(map)估计值进行评估。使用高斯-牛顿法计算MAP估计值。另一方面,我们使用Metropolis-adjusted Langevin算法(MALA)探索全关节后验,该算法需要对数后验梯度。因此,我们推导了泊松问题解关于Robin系数和边界形状的Fr'{e}chet导数,以及对数后验梯度,这是使用所谓的伴随方法有效计算的。通过数值实验和模拟数据验证了该方法的性能。 摘要:We consider the problem of simultaneously inferring the heterogeneous coefficient field for a Robin boundary condition on an inaccessible part of the boundary along with the shape of the boundary for the Poisson problem. Such a problem arises in, for example, corrosion detection, and thermal parameter estimation. We carry out both linearised uncertainty quantification, based on a local Gaussian approximation, and full exploration of the joint posterior using Markov chain Monte Carlo (MCMC) sampling. By exploiting a known invariance property of the Poisson problem, we are able to circumvent the need to re-mesh as the shape of the boundary changes. The linearised uncertainty analysis presented here relies on a local linearisation of the parameter-to-observable map, with respect to both the Robin coefficient and the boundary shape, evaluated at the maximum a posteriori (MAP) estimates. Computation of the MAP estimate is carried out using the Gauss-Newton method. On the other hand, to explore the full joint posterior we use the Metropolis-adjusted Langevin algorithm (MALA), which requires the gradient of the log-posterior. We thus derive both the Fr'{e}chet derivative of the solution to the Poisson problem with respect to the Robin coefficient and the boundary shape, and the gradient of the log-posterior, which is efficiently computed using the so-called adjoint approach. The performance of the approach is demonstrated via several numerical experiments with simulated data.

【20】 Identifying Aggregation Artery Architecture of constrained Origin-Destination flows using Manhattan L-function 标题:用曼哈顿L-函数识别受约束始发地-目的地流的汇聚动脉结构 链接:https://arxiv.org/abs/2108.09042

作者:Zidong Fang,Hua Shu,Ci Song,Jie Chen,Tianyu Liu,Xiaohan Liu,Tao Pei 机构:a State Key Laboratory of Resources and Environmental Information System, Institute of, Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, b University of Chinese Academy of Sciences, Beijing, China; 备注:29 pages, 12 figures 摘要:城市中人和货物的运动可以用约束流来表示,约束流定义为道路网络中物体在起点和终点之间的运动。流量聚合,即同时聚合起点和目的地,是最常见的模式之一,即两个交通枢纽之间聚合的起点到目的地流量可能表明两个站点之间的巨大交通需求。开发约束流量的聚类方法对于确定城市流量聚集至关重要。在现有的识别流聚集的方法中,流的L函数是主要的方法。然而,该方法依赖于欧几里德L函数检测到的关键参数聚集尺度,不适用于道路网络。提取的聚集可能被高估和分散。因此,我们提出了一种基于曼哈顿空间L函数的聚类方法,该方法包括三个主要步骤。第一种是通过曼哈顿L函数检测聚合规模。第二是确定不同尺度下具有最高局部L函数值的核心流。最后一步是取核心流邻域的交点,其范围取决于相应的规模。通过设置核心流的数量,我们可以集中聚合,从而突出聚合动脉架构(AAA),该架构描述了包含关键流集群在道路网络上的投影的路段。使用滑行流的实验表明,AAA可以澄清已识别聚集流的居民移动类型。我们的方法也有助于选择配送点的位置,从而支持对城市互动的准确分析。 摘要:The movement of humans and goods in cities can be represented by constrained flow, which is defined as the movement of objects between origin and destination in road networks. Flow aggregation, namely origins and destinations aggregated simultaneously, is one of the most common patterns, say the aggregated origin-to-destination flows between two transport hubs may indicate the great traffic demand between two sites. Developing a clustering method for constrained flows is crucial for determining urban flow aggregation. Among existing methods about identifying flow aggregation, L-function of flows is the major one. Nevertheless, this method depends on the aggregation scale, the key parameter detected by Euclidean L-function, it does not adapt to road network. The extracted aggregation may be overestimated and dispersed. Therefore, we propose a clustering method based on L-function of Manhattan space, which consists of three major steps. The first is to detect aggregation scales by Manhattan L-function. The second is to determine core flows possessing highest local L-function values at different scales. The final step is to take the intersection of core flows neighbourhoods, the extent of which depends on corresponding scale. By setting the number of core flows, we could concentrate the aggregation and thus highlight Aggregation Artery Architecture (AAA), which depicts road sections that contain the projection of key flow cluster on the road networks. Experiment using taxi flows showed that AAA could clarify resident movement type of identified aggregated flows. Our method also helps selecting locations for distribution sites, thereby supporting accurate analysis of urban interactions.

【21】 Federated Distributionally Robust Optimization for Phase Configuration of RISs 标题:RISS相位配置的联合分布鲁棒优化 链接:https://arxiv.org/abs/2108.09026

作者:Chaouki Ben Issaid,Sumudu Samarakoon,Mehdi Bennis,H. Vincent Poor 机构:Centre for Wireless Communications (CWC), University of Oulu, Finland, †Electrical Engineering Department, Princeton University, Princeton, USA 备注:6 pages, 2 figures 摘要:在这篇文章中,我们研究了在监督学习环境下,异构RIS类型下的鲁棒可重构智能表面(RIS)辅助下行链路通信问题。通过将异构RIS设计上的下行链路通信建模为学习如何以分布式方式优化相位配置的不同工作者,我们以通信效率高的方式使用分布式鲁棒公式解决了该分布式学习问题,同时确定了其收敛速度。通过这样做,我们可以确保最坏情况工人的全局模型性能接近其他工人的性能。仿真结果表明,与竞争基线相比,我们提出的算法需要更少的通信回合(大约减少50%),以达到相同的最坏情况分布测试精度。 摘要:In this article, we study the problem of robust reconfigurable intelligent surface (RIS)-aided downlink communication over heterogeneous RIS types in the supervised learning setting. By modeling downlink communication over heterogeneous RIS designs as different workers that learn how to optimize phase configurations in a distributed manner, we solve this distributed learning problem using a distributionally robust formulation in a communication-efficient manner, while establishing its rate of convergence. By doing so, we ensure that the global model performance of the worst-case worker is close to the performance of other workers. Simulation results show that our proposed algorithm requires fewer communication rounds (about 50% lesser) to achieve the same worst-case distribution test accuracy compared to competitive baselines.

【22】 Uniformity Testing in the Shuffle Model: Simpler, Better, Faster 标题:混洗模型中的均匀性测试:更简单、更好、更快 链接:https://arxiv.org/abs/2108.08987

作者:Clément L. Canonne,Hongyi Lyu 摘要:均匀性检验,或检验独立观测值是否均匀分布,是分布检验中的典型问题。在过去几年中,一系列工作一直专注于数据隐私约束下的一致性测试,并获得了各种隐私模型下的私有和数据高效算法,如中央差分隐私(DP)、本地隐私(LDP)、泛隐私,以及最近的差分隐私洗牌模型。在这项工作中,我们大大简化了对洗牌模型中已知一致性测试算法的分析,并利用最近关于“通过洗牌进行隐私放大”的结果,提供了一种替代算法,该算法通过基本和简化的参数获得相同的保证。 摘要:Uniformity testing, or testing whether independent observations are uniformly distributed, is the prototypical question in distribution testing. Over the past years, a line of work has been focusing on uniformity testing under privacy constraints on the data, and obtained private and data-efficient algorithms under various privacy models such as central differential privacy (DP), local privacy (LDP), pan-privacy, and, very recently, the shuffle model of differential privacy. In this work, we considerably simplify the analysis of the known uniformity testing algorithm in the shuffle model, and, using a recent result on "privacy amplification via shuffling," provide an alternative algorithm attaining the same guarantees with an elementary and streamlined argument.

【23】 Assessing Cerebellar Disorders With Wearable Inertial Sensor Data Using Time-Frequency and Autoregressive Hidden Markov Model Approaches 标题:基于时频和自回归隐马尔可夫模型的可穿戴惯性传感器小脑功能障碍评估 链接:https://arxiv.org/abs/2108.08975

作者:Karin C. Knudson,Anoopum S. Gupta 机构:Data Intensive Studies Center, Tufts University, Medford, MA , Department of Neurology, Massachusetts General Hospital and Harvard Medical School, Boston, MA 备注:12 pages, 7 figures 摘要:我们使用自回归隐马尔可夫模型和时频方法从运动过程中收集的可穿戴惯性传感器数据中创建小脑共济失调行为特征的有意义的定量描述。可穿戴传感器数据相对容易收集,并提供运动的直接测量,可用于开发有用的行为生物标记物。神经退行性疾病的敏感和特异的行为生物标记物对于支持早期发现、药物开发和靶向治疗至关重要。当参与者执行临床评估任务时,我们从可穿戴传感器收集的加速计和陀螺仪数据中创建一组灵活且描述性的特征,并用这些特征估计疾病状态和严重程度。短时间的数据收集($<$5分钟)产生足够的信息,以非常高的准确性有效地将共济失调患者与健康对照组分开,将共济失调与其他神经退行性疾病(如帕金森病)分开,并给出疾病严重性的估计。 摘要:We use autoregressive hidden Markov models and a time-frequency approach to create meaningful quantitative descriptions of behavioral characteristics of cerebellar ataxias from wearable inertial sensor data gathered during movement. Wearable sensor data is relatively easily collected and provides direct measurements of movement that can be used to develop useful behavioral biomarkers. Sensitive and specific behavioral biomarkers for neurodegenerative diseases are critical to supporting early detection, drug development efforts, and targeted treatments. We create a flexible and descriptive set of features derived from accelerometer and gyroscope data collected from wearable sensors while participants perform clinical assessment tasks, and with them estimate disease status and severity. A short period of data collection ($<$ 5 minutes) yields enough information to effectively separate patients with ataxia from healthy controls with very high accuracy, to separate ataxia from other neurodegenerative diseases such as Parkinson's disease, and to give estimates of disease severity.

【24】 Statistical Learning to Operationalize a Domain Agnostic Data Quality Scoring 标题:将领域无关数据质量评分付诸实施的统计学习 链接:https://arxiv.org/abs/2108.08905

作者:Sezal Chug,Priya Kaushal,Ponnurangam Kumaraguru,Tavpritesh Sethi 机构: Department of Computer Science, Indraprastha Institute of Information Technology, ¤Current Address: Department of Computer Science, Indraprastha Institute of, Information Technology, New Delhi, Delhi, India 备注:20 Pages, 8 Figures, 1 Table 摘要:数据正以难以想象的速度扩展,随着这一发展,数据质量的责任也随之而来。数据质量是指当前信息的相关性,有助于特定组织的决策和规划等各种操作。大多数数据质量是在特定的基础上测量的,因此开发的概念都没有提供任何实际应用。当前的实证研究旨在制定一个具体的自动化数据质量平台,以评估传入数据集的质量,并生成质量标签、评分和综合报告。我们利用healthdata.gov、opendata.nhs和人口统计与健康调查(DHS)项目的各种数据集,观察质量分数的变化,并使用主成分分析(PCA)制定标签。当前实证研究的结果揭示了一个包含九个质量成分的指标,即出处、数据集特征、一致性、元数据耦合、缺失单元格和重复行的百分比、数据的偏斜、分类列不一致的比率以及这些属性之间的相关性。该研究还提供了一个说明性的案例研究和验证的指标以下突变测试方法。本研究提供了一个自动化平台,该平台采用传入的数据集和元数据来提供DQ分数、报告和标签。这项研究的结果将对数据科学家有用,因为在为其各自的实际应用部署数据之前,这一质量标签的价值将逐渐增强信心。 摘要:Data is expanding at an unimaginable rate, and with this development comes the responsibility of the quality of data. Data Quality refers to the relevance of the information present and helps in various operations like decision making and planning in a particular organization. Mostly data quality is measured on an ad-hoc basis, and hence none of the developed concepts provide any practical application. The current empirical study was undertaken to formulate a concrete automated data quality platform to assess the quality of incoming dataset and generate a quality label, score and comprehensive report. We utilize various datasets from healthdata.gov, opendata.nhs and Demographics and Health Surveys (DHS) Program to observe the variations in the quality score and formulate a label using Principal Component Analysis(PCA). The results of the current empirical study revealed a metric that encompasses nine quality ingredients, namely provenance, dataset characteristics, uniformity, metadata coupling, percentage of missing cells and duplicate rows, skewness of data, the ratio of inconsistencies of categorical columns, and correlation between these attributes. The study also provides an illustrative case study and validation of the metric following Mutation Testing approaches. This research study provides an automated platform which takes an incoming dataset and metadata to provide the DQ score, report and label. The results of this study would be useful to data scientists as the value of this quality label would instill confidence before deploying the data for his/her respective practical application.

【25】 Risk Bounds and Calibration for a Smart Predict-then-Optimize Method 标题:一种智能先预测后优化方法的风险界与校正 链接:https://arxiv.org/abs/2108.08887

作者:Heyuan Liu,Paul Grigas 机构:Department of Industrial Engineering and Operations Research, University of California, Berkeley∗ 摘要:预测-然后优化框架是实际随机决策问题的基础:首先预测优化模型的未知参数,然后使用预测值求解问题。该设置中的自然损失函数是通过测量预测参数引起的决策误差来定义的,Elmachtoub和Grigas将其命名为Smart Predict then Optimize(SPO)损失[arXiv:1710.08005]。由于SPO损失通常是非凸的,并且可能是不连续的,Elmachtoub和Grigas[arXiv:1710.08005]引入了一种称为SPO 损失的凸替代物,它重要地解释了优化模型的基本结构。在本文中,我们大大扩展了Elmachtoub和Grigas提供的SPO 损失的一致性结果[arXiv:1710.08005]。我们开发了SPO 损失相对于SPO损失的风险边界和统一校准结果,这提供了一种将超额替代风险转换为超额真实风险的定量方法。通过将我们的风险界与推广界相结合,我们证明了SPO 损失的经验最小值以高概率实现了低超额真实风险。我们首先将这些结果证明在基本优化问题的可行域是多面体的情况下,然后我们证明了当可行域是强凸函数的水平集时,这些结果可以得到实质性的加强。我们进行实验,从经验上证明SPO 替代物在投资组合分配和成本敏感的多类别分类问题上的强度,与标准的$ell_1$和平方$ell_2$预测误差损失相比。 摘要:The predict-then-optimize framework is fundamental in practical stochastic decision-making problems: first predict unknown parameters of an optimization model, then solve the problem using the predicted values. A natural loss function in this setting is defined by measuring the decision error induced by the predicted parameters, which was named the Smart Predict-then-Optimize (SPO) loss by Elmachtoub and Grigas [arXiv:1710.08005]. Since the SPO loss is typically nonconvex and possibly discontinuous, Elmachtoub and Grigas [arXiv:1710.08005] introduced a convex surrogate, called the SPO loss, that importantly accounts for the underlying structure of the optimization model. In this paper, we greatly expand upon the consistency results for the SPO loss provided by Elmachtoub and Grigas [arXiv:1710.08005]. We develop risk bounds and uniform calibration results for the SPO loss relative to the SPO loss, which provide a quantitative way to transfer the excess surrogate risk to excess true risk. By combining our risk bounds with generalization bounds, we show that the empirical minimizer of the SPO loss achieves low excess true risk with high probability. We first demonstrate these results in the case when the feasible region of the underlying optimization problem is a polyhedron, and then we show that the results can be strengthened substantially when the feasible region is a level set of a strongly convex function. We perform experiments to empirically demonstrate the strength of the SPO surrogate, as compared to standard $ell_1$ and squared $ell_2$ prediction error losses, on portfolio allocation and cost-sensitive multi-class classification problems.

0 人点赞