统计学学术速递[11.9]

2021-11-17 10:53:26 浏览数 (1)

stat统计学,共计63篇

【1】 On the Finite-Sample Performance of Measure Transportation-Based Multivariate Rank Tests 标题:基于测度运输的多元秩检验的有限样本性能研究 链接:https://arxiv.org/abs/2111.04705

作者:Marc Hallin,Gilles Mordant 机构: Université libre de Bruxelles, Universität Göttingen 摘要:将秩和分位数的对偶单变量概念扩展到维2及更高,半个多世纪以来一直是一个悬而未决的问题。基于度量传输结果,最近提出了一个名为中心向外秩和分位数的解决方案,与以前的建议相反,该解决方案具有使单变量秩成为统计推断成功工具的所有特性。正如它们的单变量对应项(它们在维度1中减少),中心向外秩允许针对各种问题构造无分布且渐近有效的测试,其中某些噪声或创新的密度尚未确定。这些测试的实际实现涉及到网格的任意选择。虽然该选择的渐近影响为零,但其有限样本结果并非如此。在本文中,我们研究了在多元双样本定位问题的典型背景下,这种选择的有限样本影响。 摘要:Extending to dimension 2 and higher the dual univariate concepts of ranks and quantiles has remained an open problem for more than half a century. Based on measure transportation results, a solution has been proposed recently under the name center-outward ranks and quantiles which, contrary to previous proposals, enjoys all the properties that make univariate ranks a successful tool for statistical inference. Just as their univariate counterparts (to which they reduce in dimension one), center-outward ranks allow for the construction of distribution-free and asymptotically efficient tests for a variety of problems where the density of some noise or innovation remains unspecified. The actual implementation of these tests involves the somewhat arbitrary choice of a grid. While the asymptotic impact of that choice is nil, its finite-sample consequences are not. In this note, we investigate the finite-sample impact of that choice in the typical context of the multivariate two-sample location problem.

【2】 Smooth tensor estimation with unknown permutations 标题:具有未知排列的光滑张量估计 链接:https://arxiv.org/abs/2111.04681

作者:Chanwoo Lee,Miaoyan Wang 机构:University of Wisconsin – Madison 备注:37 pages, 10 figures, 10 tables 摘要:我们考虑存在未知排列的结构张量去噪问题。此类数据问题通常出现在推荐系统、神经成像、社区检测和多路比较应用中。在这里,我们发展了一个光滑张量模型的一般族,它可以达到任意的指数排列;该模型结合了流行的张量块模型和Lipschitz超图模型作为特例。我们证明了分块多项式族中的约束最小二乘估计达到了极大极小误差界。揭示了最佳恢复所需的平滑度阈值的相变现象。特别是,我们发现高达$(m-2)(m 1)/2$的多项式足以精确恢复阶数-$m$张量,而高次则没有进一步的好处。这一现象揭示了具有和不具有未知置换的光滑张量估计问题的内在区别。此外,我们提供了一个有效的多项式时间Borda计数算法,该算法在单调性假设下可证明达到最优速率。通过模拟和芝加哥犯罪数据分析证明了我们的程序的有效性。 摘要:We consider the problem of structured tensor denoising in the presence of unknown permutations. Such data problems arise commonly in recommendation system, neuroimaging, community detection, and multiway comparison applications. Here, we develop a general family of smooth tensor models up to arbitrary index permutations; the model incorporates the popular tensor block models and Lipschitz hypergraphon models as special cases. We show that a constrained least-squares estimator in the block-wise polynomial family achieves the minimax error bound. A phase transition phenomenon is revealed with respect to the smoothness threshold needed for optimal recovery. In particular, we find that a polynomial of degree up to $(m-2)(m 1)/2$ is sufficient for accurate recovery of order-$m$ tensors, whereas higher degree exhibits no further benefits. This phenomenon reveals the intrinsic distinction for smooth tensor estimation problems with and without unknown permutations. Furthermore, we provide an efficient polynomial-time Borda count algorithm that provably achieves optimal rate under monotonicity assumptions. The efficacy of our procedure is demonstrated through both simulations and Chicago crime data analysis.

【3】 Consistent Sufficient Explanations and Minimal Local Rules for explaining regression and classification models 标题:用于解释回归和分类模型的一致充分解释和最小局部规则 链接:https://arxiv.org/abs/2111.04658

作者:Salim I. Amoukou,Nicolas J. B Brunel 备注:8 pages, 2 figures, 1 table 摘要:为了解释任何模型的决策,我们扩展了概率充分解释(P-SE)的概念。对于每种情况,该方法都会选择足够以高概率产生相同预测的最小特征子集,同时删除其他特征。P-SE的关键是计算保持相同预测的条件概率。因此,我们通过随机森林为任何数据$(boldsymbol{X},Y)$引入了一个精确而快速的概率估计,并通过对其一致性的理论分析证明了它的有效性。因此,我们将P-SE推广到回归问题。此外,我们处理非二进制特征,没有学习$X$的分布,也没有进行预测的模型。最后,我们介绍了基于P-SE的回归/分类的局部规则解释,并将我们的方法与其他可解释AI方法进行了比较。这些方法在url{www.github.com/salimamoukou/acv00}上以Python包的形式公开提供。 摘要:To explain the decision of any model, we extend the notion of probabilistic Sufficient Explanations (P-SE). For each instance, this approach selects the minimal subset of features that is sufficient to yield the same prediction with high probability, while removing other features. The crux of P-SE is to compute the conditional probability of maintaining the same prediction. Therefore, we introduce an accurate and fast estimator of this probability via random Forests for any data $(boldsymbol{X}, Y)$ and show its efficiency through a theoretical analysis of its consistency. As a consequence, we extend the P-SE to regression problems. In addition, we deal with non-binary features, without learning the distribution of $X$ nor having the model for making predictions. Finally, we introduce local rule-based explanations for regression/classification based on the P-SE and compare our approaches w.r.t other explainable AI methods. These methods are publicly available as a Python package at url{www.github.com/salimamoukou/acv00}.

【4】 Optimal convex lifted sparse phase retrieval and PCA with an atomic matrix norm regularizer 标题:基于原子矩阵范数正则化的最优凸提升稀疏相位恢复和PCA 链接:https://arxiv.org/abs/2111.04652

作者:Andrew D. McRae,Justin K. Romberg,Mark A. Davenport 摘要:我们提出了新的分析和算法来解决稀疏相位恢复和稀疏主成分分析(PCA)与凸提升矩阵公式。关键的创新是一种新的混合原子矩阵范数,当用作正则化时,它可以促进具有稀疏因子的低秩矩阵。我们证明了以该原子范数作为正则化子的凸规划为稀疏相位检索和稀疏PCA提供了接近最优的样本复杂度和错误率保证。虽然我们不知道如何用一个有效的算法精确地求解凸规划,但对于相位恢复的情况,我们仔细地分析了该规划及其对偶,从而导出了一个实用的启发式算法。我们的经验表明,这种实用的算法执行类似于现有的国家的最先进的算法。 摘要:We present novel analysis and algorithms for solving sparse phase retrieval and sparse principal component analysis (PCA) with convex lifted matrix formulations. The key innovation is a new mixed atomic matrix norm that, when used as regularization, promotes low-rank matrices with sparse factors. We show that convex programs with this atomic norm as a regularizer provide near-optimal sample complexity and error rate guarantees for sparse phase retrieval and sparse PCA. While we do not know how to solve the convex programs exactly with an efficient algorithm, for the phase retrieval case we carefully analyze the program and its dual and thereby derive a practical heuristic algorithm. We show empirically that this practical algorithm performs similarly to existing state-of-the-art algorithms.

【5】 Bayesian modelling of statistical region- and family-level clustered ordinal outcome data from Turkey 标题:来自土耳其的统计区域和家庭级别的聚集顺序结果数据的贝叶斯建模 链接:https://arxiv.org/abs/2111.04645

作者:Ozgur Asar 机构:Bayesian modelling of statistical region- and family-level clustered ordinal outcome data from Turkey Özgür Asar Department of Biostatistics and Medical Informatics, Acıbadem Mehmet Ali Aydınlar University 摘要:本研究涉及在存在随机效应的情况下,采用多元逻辑回归分析三级有序结果数据。假设随机效应遵循logit链接的桥接分布,这允许获得回归系数的边际解释。数据来源于土耳其收入和生活条件研究,其中结果变量为自评健康(SRH),这在本质上是有序的。对这些数据的分析是为了比较协变量亚组,并根据SRH得出区域和家庭水平的推论。参数和随机效应按照贝叶斯范式从关节后验密度中取样。模型选择采用三个准则:Watenable信息准则、对数伪边际似然准则和偏差信息准则。这三个方面都表明,为了建立SRH模型,我们需要考虑区域和家庭层面的变量。模型复制观测数据的程度通过后验预测检查进行检查。在经济和人口变量水平、土耳其地区和参与调查的家庭之间发现了性健康和生殖健康方面的差异。一些有趣的发现是,失业者报告健康状况较差的可能性比就业者高19%,而爱琴海农村地区报告健康状况较差的可能性最小。 摘要:This study is concerned with the analysis of three-level ordinal outcome data with polytomous logistic regression in the presence of random-effects. It is assumed that the random-effects follow a Bridge distribution for the logit link, which allows one to obtain marginal interpretations of the regression coefficients. The data are obtained from the Turkish Income and Living Conditions Study, where the outcome variable is self-rated health (SRH), which is ordinal in nature. The analysis of these data is to compare covariate sub-groups and draw region- and family-level inferences in terms of SRH. Parameters and random-effects are sampled from the joint posterior densities following a Bayesian paradigm. Three criteria are used for model selection: Watenable information criterion, log pseudo marginal likelihood, and deviance information criterion. All three suggest that we need to account for both region- and family-level variabilities in order to model SRH. The extent to which the models replicate the observed data is examined by posterior predictive checks. Differences in SRH are found between levels of economic and demographic variables, regions of Turkey, and families who participated in the survey. Some of the interesting findings are that unemployed people are 19% more likely to report poorer health than employed people, and rural Aegean is the region that has the least probability of reporting poorer health.

【6】 A Private and Computationally-Efficient Estimator for Unbounded Gaussians 标题:一种计算高效的无界高斯型私有估计器 链接:https://arxiv.org/abs/2111.04609

作者:Gautam Kamath,Argyris Mouzakis,Vikrant Singhal,Thomas Steinke,Jonathan Ullman 机构: University of Waterloo, Northeastern University 摘要:我们给出了任意高斯分布$mathcal{N}(mu,Sigma)$in$mathbb{R}^d$的均值和协方差的第一多项式时间、多项式样本、微分私有估计。以前的所有估计器要么是非结构化的,具有无限的运行时间,要么要求用户指定参数$mu$和$Sigma$的先验边界。我们算法中的主要新技术工具是一个新的微分私有预处理器,它从任意高斯$mathcal{N}(0、Sigma)$中采样,并返回矩阵$a$,使得$aSigma a^T$具有恒定的条件数。 摘要:We give the first polynomial-time, polynomial-sample, differentially private estimator for the mean and covariance of an arbitrary Gaussian distribution $mathcal{N}(mu,Sigma)$ in $mathbb{R}^d$. All previous estimators are either nonconstructive, with unbounded running time, or require the user to specify a priori bounds on the parameters $mu$ and $Sigma$. The primary new technical tool in our algorithm is a new differentially private preconditioner that takes samples from an arbitrary Gaussian $mathcal{N}(0,Sigma)$ and returns a matrix $A$ such that $A Sigma A^T$ has constant condition number.

【7】 Neyman-Pearson Multi-class Classification via Cost-sensitive Learning 标题:基于代价敏感学习的Neyman-Pearson多类分类 链接:https://arxiv.org/abs/2111.04597

作者:Ye Tian,Yang Feng 机构:Department of Statistics, Columbia University, New York, NY , USA, Department of Biostatistics, School of Global Public Health, New York University 备注:44 pages, 6 figures 摘要:大多数现有的分类方法旨在最大限度地降低总体误分类率,然而,在应用中,不同类型的错误可能会产生不同的后果。为了考虑这一不对称问题,已经开发了两种流行的范式,即Neyman-Pearson(NP)范式和成本敏感(CS)范式。与CS范式相比,NP范式不需要成本规范。以往关于NP范式的研究大多集中在二元情况下。在这项工作中,我们通过将多类NP问题与CS问题联系起来来研究它,并提出了两种算法。我们将NP-oracle不等式和一致性从二元情形推广到多类情形,并证明了我们的两个算法在一定条件下都具有这些性质。仿真和实际数据研究证明了算法的有效性。据我们所知,这是第一个通过具有理论保证的代价敏感学习技术来解决多类NP问题的工作。所提出的算法在CRAN上的R包“npcs”中实现。 摘要:Most existing classification methods aim to minimize the overall misclassification error rate, however, in applications, different types of errors can have different consequences. To take into account this asymmetry issue, two popular paradigms have been developed, namely the Neyman-Pearson (NP) paradigm and cost-sensitive (CS) paradigm. Compared to CS paradigm, NP paradigm does not require a specification of costs. Most previous works on NP paradigm focused on the binary case. In this work, we study the multi-class NP problem by connecting it to the CS problem, and propose two algorithms. We extend the NP oracle inequalities and consistency from the binary case to the multi-class case, and show that our two algorithms enjoy these properties under certain conditions. The simulation and real data studies demonstrate the effectiveness of our algorithms. To our knowledge, this is the first work to solve the multi-class NP problem via cost-sensitive learning techniques with theoretical guarantees. The proposed algorithms are implemented in the R package "npcs" on CRAN.

【8】 Fast and Scalable Spike and Slab Variable Selection in High-Dimensional Gaussian Processes 标题:高维高斯过程中快速可扩展的尖峰和板状变量选择 链接:https://arxiv.org/abs/2111.04558

作者:Hugh Dance,Brooks Paige 机构:University College London 摘要:高斯过程(GPs)中的变量选择通常通过对“自动相关确定”核的逆长度尺度进行阈值化来进行,但在高维数据集中,这种方法可能不可靠。一个更符合概率原则的替代方法是使用spike和slab先验,并推断变量包含的后验概率。然而,GPs中的现有实现在高维和大数据集(n$)中运行的成本非常高,或者对于大多数内核来说都很难实现。因此,我们开发了一种快速且可扩展的变分推理算法,用于具有任意可微核的spike和slab GP。我们通过对超参数进行贝叶斯模型平均,提高了算法适应相关变量稀疏性的能力,并通过零温度后验限制、辍学修剪和最近邻小批量处理实现了显著的速度提升。在实验中,我们的方法始终优于普通和稀疏变分GPs,同时保留类似的运行时间(即使$n=10^6$),并使用MCMC与spike和slab GP进行竞争,但运行速度要快1000$。 摘要:Variable selection in Gaussian processes (GPs) is typically undertaken by thresholding the inverse lengthscales of `automatic relevance determination' kernels, but in high-dimensional datasets this approach can be unreliable. A more probabilistically principled alternative is to use spike and slab priors and infer a posterior probability of variable inclusion. However, existing implementations in GPs are extremely costly to run in both high-dimensional and large-$n$ datasets, or are intractable for most kernels. As such, we develop a fast and scalable variational inference algorithm for the spike and slab GP that is tractable with arbitrary differentiable kernels. We improve our algorithm's ability to adapt to the sparsity of relevant variables by Bayesian model averaging over hyperparameters, and achieve substantial speed ups using zero temperature posterior restrictions, dropout pruning and nearest neighbour minibatching. In experiments our method consistently outperforms vanilla and sparse variational GPs whilst retaining similar runtimes (even when $n=10^6$) and performs competitively with a spike and slab GP using MCMC but runs up to $1000$ times faster.

【9】 Bayesian profile regression for clustering analysis involving a longitudinal response and explanatory variables 标题:包含纵向响应和解释变量的聚类分析的贝叶斯剖面回归 链接:https://arxiv.org/abs/2111.04518

作者:Anaïs Rouanet,Rob Johnson,Magdalena E Strauss,Sylvia Richardson,Brian D Tom,Simon R White,Paul D W Kirk 机构:Kirk,∗, MRC Biostatistics Unit, School of Clinical Medicine, University of Cambridge, U.K., EMBL-EBI, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB,SD, UK, Department of Psychiatry, University of Cambridge, Cambridge, CB,EB, UK 备注:39 pages, 27 figures. Accompanying code is available from this https URL 摘要:现代基因组学的一个关键问题是鉴定一组具有共同功能的共同调控基因。贝叶斯剖面回归是一种半监督混合建模方法,它利用响应来指导对相关聚类的推理。轮廓回归以前的应用考虑了单变量连续、分类和计数结果。在这项工作中,我们将贝叶斯剖面回归扩展到结果为纵向(或多变量连续)的情况,并提供PReMiuMlongi(PReMiuM的更新版本)剖面回归的R包。我们考虑多元正态和高斯过程回归响应模型,并提供原理应用到四个模拟研究的证明。该模型应用于芽接酵母数据,以确定在酿酒酵母细胞周期中共同调节的基因组。我们确定了与基因表达轨迹的特定模式相关的4组不同的基因,以及可能参与共同调节过程的结合转录因子。 摘要:The identification of sets of co-regulated genes that share a common function is a key question of modern genomics. Bayesian profile regression is a semi-supervised mixture modelling approach that makes use of a response to guide inference toward relevant clusterings. Previous applications of profile regression have considered univariate continuous, categorical, and count outcomes. In this work, we extend Bayesian profile regression to cases where the outcome is longitudinal (or multivariate continuous) and provide PReMiuMlongi, an updated version of PReMiuM, the R package for profile regression. We consider multivariate normal and Gaussian process regression response models and provide proof of principle applications to four simulation studies. The model is applied on budding yeast data to identify groups of genes co-regulated during the Saccharomyces cerevisiae cell cycle. We identify 4 distinct groups of genes associated with specific patterns of gene expression trajectories, along with the bound transcriptional factors, likely involved in their co-regulation process.

【10】 Clustering and Structural Robustness in Causal Diagrams 标题:因果图中的聚类和结构稳健性 链接:https://arxiv.org/abs/2111.04513

作者:Santtu Tikka,Jouni Helske,Juha Karvanen 摘要:图形通常用于表示和可视化因果关系。对于少量变量,此方法提供了手头场景的简洁明了的视图。随着研究变量数量的增加,图解法可能变得不切实际,表达的清晰度也会降低。变量聚类是减少因果关系图大小的一种自然方法,但如果任意实施,它可能会错误地改变因果关系的基本属性。我们定义了一种特殊类型的聚类,称为运输聚类,它保证在一定条件下保持因果效应的可识别性。我们提供了一个完善的算法,用于在给定的图中查找所有公交集群,并演示了集群如何简化因果效应的识别。我们还研究了反问题,其中一个从聚集图开始,寻找因果效应的可识别性保持不变的扩展图。我们证明了这种结构鲁棒性与公交集群密切相关。 摘要:Graphs are commonly used to represent and visualize causal relations. For a small number of variables, this approach provides a succinct and clear view of the scenario at hand. As the number of variables under study increases, the graphical approach may become impractical, and the clarity of the representation is lost. Clustering of variables is a natural way to reduce the size of the causal diagram but it may erroneously change the essential properties of the causal relations if implemented arbitrarily. We define a specific type of cluster, called transit cluster, that is guaranteed to preserve the identifiability properties of causal effects under certain conditions. We provide a sound and complete algorithm for finding all transit clusters in a given graph and demonstrate how clustering can simplify the identification of causal effects. We also study the inverse problem, where one starts with a clustered graph and looks for extended graphs where the identifiability properties of causal effects remain unchanged. We show that this kind of structural robustness is closely related to transit clusters.

【11】 Assessing the Uncertainty of Epidemiological Forecasts with Normalised Estimation Error Squared 标题:用归一化估计误差平方评定流行病学预测的不确定性 链接:https://arxiv.org/abs/2111.04498

作者:R. E. Moore,C. Rosato,S. Maskell 机构: Conor Rosato and Simon MaskellDepartment of Electrical Engineering and Electronics, University of Liverpool, The models developed by the University of Cambridge MRC BiostatisticsUnit and Public Health England (PHE) [ 1], the University of Warwick [ 2] 备注:14 pages, 2 figures, 2 tables 摘要:传染病模型的估计值构成了用于告知英国应对新冠病毒-19大流行的科学证据的重要组成部分。这些估计在精确度上可能存在显著差异,一些人过于自信,另一些人过于谨慎。流行病学预测中的不确定性应与其预测中的误差相称。我们建议将归一化估计误差平方(NEES)作为评估预测和未来观测之间一致性的指标。我们介绍了一种新的新的COVID-19传染病模型,并用它来证明NEES在诊断因正则化参数的不同值而产生的过度自信和过度谨慎预测方面的有用性。 摘要:Estimates from infectious disease models have constituted a significant part of the scientific evidence used to inform the response to the COVID-19 pandemic in the UK. These estimates can vary strikingly in their precision, with some being over-confident and others over-cautious. The uncertainty in epidemiological forecasts should be commensurate with the errors in their predictions. We propose Normalised Estimation Error Squared (NEES) as a metric for assessing the consistency between forecasts and future observations. We introduce a novel infectious disease model for COVID-19 and use it to demonstrate the usefulness of NEES for diagnosing over-confident and over-cautious predictions resulting from different values of a regularization parameter.

【12】 Is the group structure important in grouped functional time series? 标题:分组函数时间序列中的分组结构重要吗? 链接:https://arxiv.org/abs/2111.04390

作者:Yang Yang,Han Lin Shang 机构:Department of Econometrics and Business Statistics, Monash University, , Department of Actuarial Studies and Business Analytics, Macquarie University 备注:29 pages, 8 figures, to appear at the Journal of Data Science 摘要:我们研究了群结构在分组功能时间序列中的重要性。由于群结构的非唯一性,我们研究了分组功能时间序列中不同的分解结构。我们解决了一个实际问题,即群体结构是否会影响预测精度。采用动态多变量函数时间序列方法,考虑联合建模和多序列预测。以1975年至2016年日本次国家年龄别死亡率为例,我们调查了两组结构的1-15步超前点和区间预测精度。 摘要:We study the importance of group structure in grouped functional time series. Due to the non-uniqueness of group structure, we investigate different disaggregation structures in grouped functional time series. We address a practical question on whether or not the group structure can affect forecast accuracy. Using a dynamic multivariate functional time series method, we consider joint modeling and forecasting multiple series. Illustrated by Japanese sub-national age-specific mortality rates from 1975 to 2016, we investigate one- to 15-step-ahead point and interval forecast accuracies for the two group structures.

【13】 A Bayesian Analysis of Migration Pathways using Chain Event Graphs of Agent Based Models 标题:基于Agent模型的链事件图用于迁移路径的贝叶斯分析 链接:https://arxiv.org/abs/2111.04368

作者:Peter Strong,Alys McAlpine,Jim Q Smith 摘要:基于代理的模型(ABM)通常用于建模迁移,并越来越多地用于通过一系列启发式if-then规则来模拟个体迁移决策和展开事件。然而,ABMs缺乏嵌入更多执行推理的原则性策略来评估和验证模型的方法,这两种方法对于实际案例研究都具有重要意义。链事件图(CEG)可以满足这一需求:它们可以用来提供准确表示ABM的贝叶斯框架。通过使用CEG,我们说明了如何将导出的ABM转换为贝叶斯框架,并概述了这种方法的优点。 摘要:Agent-Based Models (ABMs) are often used to model migration and are increasingly used to simulate individual migrant decision-making and unfolding events through a sequence of heuristic if-then rules. However, ABMs lack the methods to embed more principled strategies of performing inference to estimate and validate the models, both of which are of significant importance for real-world case studies. Chain Event Graphs (CEGs) can fill this need: they can be used to provide a Bayesian framework which represents an ABM accurately. Through the use of the CEG, we illustrate how to transform an elicited ABM into a Bayesian framework and outline the benefits of this approach.

【14】 The Weighted Generalised Covariance Measure 标题:加权广义协方差测度 链接:https://arxiv.org/abs/2111.04361

作者:Cyrill Scheidegger,Julia Hörrmann,Peter Bühlmann 机构: Seminar for Statistics, ETH Z¨urich, Department of Computer Science, ETH Z¨urich 摘要:我们介绍了一种新的条件独立性测试,它基于我们称之为加权广义协方差测度(WGCM)。它是最近引入的广义协方差测度(GCM)的扩展。为了检验X和Y在给定Z的条件下独立的零假设,我们的检验统计量是非线性回归X和Y在Z上的残差之间样本协方差的加权形式。我们对单变量和多变量X和Y提出了不同的检验变量。我们给出了检验产生正确I型错误率的条件。最后,我们使用模拟和真实数据集将我们的新测试与原始GCM进行了比较。通常,与GCM相比,我们的测试对更广泛的备选方案具有强大的抵抗力。这是以相对于GCM已经很好地工作的替代方案的功率降低为代价的。 摘要:We introduce a new test for conditional independence which is based on what we call the weighted generalised covariance measure (WGCM). It is an extension of the recently introduced generalised covariance measure (GCM). To test the null hypothesis of X and Y being conditionally independent given Z, our test statistic is a weighted form of the sample covariance between the residuals of nonlinearly regressing X and Y on Z. We propose different variants of the test for both univariate and multivariate X and Y. We give conditions under which the tests yield the correct type I error rate. Finally, we compare our novel tests to the original GCM using simulation and on real data sets. Typically, our tests have power against a wider class of alternatives compared to the GCM. This comes at the cost of having less power against alternatives for which the GCM already works well.

【15】 Assessing the effectiveness of empirical calibration under different bias scenarios 标题:不同偏差情景下经验校准的有效性评估 链接:https://arxiv.org/abs/2111.04233

作者:Hon Hwang,Juan C Quiroz,Blanca Gallego 机构: Centre for Big Data Research in Health (CBDRH), University of New South Wales, Sydney, NSW, Australia, Corresponding author:, Level , AGSM Building, G, Botany St, Kensington NSW 摘要:背景:根据观测数据对因果效应的估计受到各种偏差来源的影响。这些偏差可以通过使用不受治疗影响的阴性对照结果进行调整。经验校准程序使用阴性对照校准p值,同时使用阴性和阳性对照校准相关结果95%置信区间的覆盖率。虽然在一些大型的观察性研究中使用了经验校准,但在不同的偏倚情景下,没有对其影响进行系统的检查。方法:使用已知治疗效果的模拟数据集分析置信区间经验校准的效果。模拟用于二元治疗和二元结果,模拟的偏差源于未测量的混杂因子、模型错误、测量误差和缺乏阳性。通过确定相关结果的置信区间覆盖率和偏差的变化来评估经验校准的性能。结果:在大多数情况下,经验校准增加了95%置信区间的兴趣结果覆盖率,但在调整兴趣结果偏差方面不一致。当调整未测量的混杂偏差时,经验校准最有效。适当的阴性对照对经验校准所做的调整有很大影响,但在使用不适当的阴性对照时,也可以观察到利益结果覆盖率的微小改善。结论:这项工作为在观察性研究中校准治疗效果置信区间的经验校准的有效性提供了证据。我们建议对置信区间进行经验校准,尤其是当存在未测量的混杂风险时。 摘要:Background: Estimations of causal effects from observational data are subject to various sources of bias. These biases can be adjusted by using negative control outcomes not affected by the treatment. The empirical calibration procedure uses negative controls to calibrate p-values and both negative and positive controls to calibrate coverage of the 95% confidence interval of the outcome of interest. Although empirical calibration has been used in several large observational studies, there is no systematic examination of its effect under different bias scenarios. Methods: The effect of empirical calibration of confidence intervals was analyzed using simulated datasets with known treatment effects. The simulations were for binary treatment and binary outcome, with simulated biases resulting from unmeasured confounder, model misspecification, measurement error, and lack of positivity. The performance of empirical calibration was evaluated by determining the change of the confidence interval coverage and bias of the outcome of interest. Results: Empirical calibration increased coverage of the outcome of interest by the 95% confidence interval under most settings but was inconsistent in adjusting the bias of the outcome of interest. Empirical calibration was most effective when adjusting for unmeasured confounding bias. Suitable negative controls had a large impact on the adjustment made by empirical calibration, but small improvements in the coverage of the outcome of interest were also observable when using unsuitable negative controls. Conclusions: This work adds evidence to the efficacy of empirical calibration on calibrating the confidence intervals of treatment effects in observational studies. We recommend empirical calibration of confidence intervals, especially when there is a risk of unmeasured confounding.

【16】 Rate-Optimal Cluster-Randomized Designs for Spatial Interference 标题:空间干扰的速率最优聚类随机设计 链接:https://arxiv.org/abs/2111.04219

作者:Michael P. Leung 摘要:我们考虑一个潜在的结果模型,其中干扰可能存在于任何两个单元之间,但是干扰的程度随着空间距离而减小。因果估计是全球平均治疗效果,它将所有单位治疗时的反事实结果与无单位治疗时的结果进行比较。我们研究一类设计,其中空间被划分成簇,随机分为治疗组和对照组。对于每种设计,我们使用Horovitz-Thompson估计器估计治疗效果,该估计器将所有邻居治疗的单位的平均结果与没有邻居治疗的单位进行比较,其中邻居半径的顺序与设计规定的簇大小相同。我们推导了作为设计和干扰度函数的估计器的收敛速度,并使用它来获得此类估计器的设计对,这些设计对在相对最小的干扰假设下实现接近最优的收敛速度。我们证明了估计量是渐近正态的,并给出了方差估计量。最后,我们讨论了使用聚类算法划分空间的设计的实际实现。 摘要:We consider a potential outcomes model in which interference may be present between any two units but the extent of interference diminishes with spatial distance. The causal estimand is the global average treatment effect, which compares counterfactual outcomes when all units are treated to outcomes when none are. We study a class of designs in which space is partitioned into clusters that are randomized into treatment and control. For each design, we estimate the treatment effect using a Horovitz-Thompson estimator that compares the average outcomes of units with all neighbors treated to units with no neighbors treated, where the neighborhood radius is of the same order as the cluster size dictated by the design. We derive the estimator's rate of convergence as a function of the design and degree of interference and use this to obtain estimator-design pairs in this class that achieve near-optimal rates of convergence under relatively minimal assumptions on interference. We prove that the estimators are asymptotically normal and provide a variance estimator. Finally, we discuss practical implementation of the designs by partitioning space using clustering algorithms.

【17】 Analysis of Least square estimator for simple Linear Regression with a uniform distribution error 标题:具有均匀分布误差的一元线性回归的最小二乘估计分析 链接:https://arxiv.org/abs/2111.04200

作者:M Jlibene,S Taoufik,S Benjelloun 机构: Université Mohammed VI Polytechnique 摘要:在简单线性回归的框架下,我们研究了线性模型的偏差项$varepsilon$为均匀分布时的最小二乘估计。特别地,我们给出了这种估计的规律,并证明了一些收敛性。 摘要:We study the least square estimator, in the framework of simple linear regression, when the deviance term $varepsilon$ with respect to the linear model is modeled by a uniform distribution. In particular, we give the law of this estimator, and prove some convergence properties.

【18】 Gene regulatory network in single cells based on the Poisson log-normal model 标题:基于泊松对数正态模型的单细胞基因调控网络 链接:https://arxiv.org/abs/2111.04037

作者:Feiyi Xiao,Junjie Tang,Huaying Fang,Ruibin Xi 机构: 1 School of Mathematical Sciences, Peking University, Capital Normal University 备注:34 pages, 8 figures 摘要:基因调控网络推理对于理解各种遗传和环境条件下的复杂分子相互作用至关重要。单细胞RNA测序(scRNA-seq)技术的快速发展,前所未有地使基因调控网络能够在单细胞分辨率上进行推断。然而,连续数据的传统图形模型,如高斯图形模型,不适合于scRNA-seq计数数据的网络推断。在此,我们使用多元泊松对数正态分布(PLN)对scRNA序列数据进行建模,并将潜在正态分布的精度矩阵表示为调节网络。我们建议首先使用矩估计估计潜在协方差矩阵,然后通过最小化lasso惩罚D-迹损失函数来估计精度矩阵。我们建立了协方差矩阵估计的收敛速度,并进一步建立了高维环境下精度矩阵的PLNet估计的收敛速度和符号一致性。通过对scRNA序列数据的模拟和基因调控网络分析,评估PLNet的性能,并与现有方法进行比较。 摘要:Gene regulatory network inference is crucial for understanding the complex molecular interactions in various genetic and environmental conditions. The rapid development of single-cell RNA sequencing (scRNA-seq) technologies unprecedentedly enables gene regulatory networks inference at the single cell resolution. However, traditional graphical models for continuous data, such as Gaussian graphical models, are inappropriate for network inference of scRNA-seq's count data. Here, we model the scRNA-seq data using the multivariate Poisson log-normal (PLN) distribution and represent the precision matrix of the latent normal distribution as the regulatory network. We propose to first estimate the latent covariance matrix using a moment estimator and then estimate the precision matrix by minimizing the lasso-penalized D-trace loss function. We establish the convergence rate of the covariance matrix estimator and further establish the convergence rates and the sign consistency of the proposed PLNet estimator of the precision matrix in the high dimensional setting. The performance of PLNet is evaluated and compared with available methods using simulation and gene regulatory network analysis of scRNA-seq data.

【19】 Kernel Methods for Multistage Causal Inference: Mediation Analysis and Dynamic Treatment Effects 标题:多阶段因果推理的核心方法:中介分析和动态处理效果 链接:https://arxiv.org/abs/2111.03950

作者:Rahul Singh,Liyuan Xu,Arthur Gretton 机构:|, Department of Economics, Massachusetts, Institute of Technology, Cambridge, MA, Gatsby Computational Neuroscience Unit, University College London, London, UK, Correspondence, Massachusetts Institute of Technology, Cambridge, MA, USA, Funding information 备注:66 pages. Material in this draft previously appeared in a working paper presented at the 2020 NeurIPS Workshop on ML for Economic Policy (arXiv:2010.04855v1). We have divided the original working paper (arXiv:2010.04855v1) into two projects: one paper focusing on static settings (arXiv:2010.04855) and this paper focusing on dynamic settings 摘要:我们提出了用于中介分析和短期动态治疗效应的核岭回归估计。我们允许治疗、协变量和中介是离散的或连续的,以及低维、高维或无限维的。我们提出了基于核矩阵运算的闭式解的反事实结果的均值、增量和分布的估计。对于连续处理的情况,我们证明了有限样本率下的一致一致性。对于离散处理情形,我们证明了根n一致性、高斯近似和半参数效率。我们进行模拟,然后评估美国就业团队计划对弱势青年的中介和动态治疗效果。 摘要:We propose kernel ridge regression estimators for mediation analysis and dynamic treatment effects over short horizons. We allow treatments, covariates, and mediators to be discrete or continuous, and low, high, or infinite dimensional. We propose estimators of means, increments, and distributions of counterfactual outcomes with closed form solutions in terms of kernel matrix operations. For the continuous treatment case, we prove uniform consistency with finite sample rates. For the discrete treatment case, we prove root-n consistency, Gaussian approximation, and semiparametric efficiency. We conduct simulations then estimate mediated and dynamic treatment effects of the US Job Corps program for disadvantaged youth.

【20】 Deep Neyman-Scott Processes 标题:深Neyman-Scott过程 链接:https://arxiv.org/abs/2111.03949

作者:Chengkuan Hong,Christian R. Shelton 机构:Department of Computer Science and Engineering, University of California, Riverside 摘要:Neyman-Scott过程是Cox过程的一个特例。潜在随机过程和可观测随机过程都是泊松过程。我们考虑了一个深Neyman Scott过程,其中网络的构建组件都是泊松过程。我们通过马尔可夫链蒙特卡罗开发了一种有效的后验抽样,并将其用于基于似然的推理。我们的方法为复杂的层次点过程中的推理打开了空间。我们在实验中发现,隐藏的泊松过程越多,似然拟合和事件类型预测的性能越好。我们还将我们的方法与时态真实世界数据集的最新模型进行了比较,并展示了在数据拟合和预测方面的竞争力,使用的参数要少得多。 摘要:A Neyman-Scott process is a special case of a Cox process. The latent and observable stochastic processes are both Poisson processes. We consider a deep Neyman-Scott process in this paper, for which the building components of a network are all Poisson processes. We develop an efficient posterior sampling via Markov chain Monte Carlo and use it for likelihood-based inference. Our method opens up room for the inference in sophisticated hierarchical point processes. We show in the experiments that more hidden Poisson processes brings better performance for likelihood fitting and events types prediction. We also compare our method with state-of-the-art models for temporal real-world datasets and demonstrate competitive abilities for both data fitting and prediction, using far fewer parameters.

【21】 An Online Sequential Test for Qualitative Treatment Effects 标题:定性处理效果的在线序贯检验 链接:https://arxiv.org/abs/2111.03908

作者:Chengchun Shi,Shikai Luo,Hongtu Zhu,Rui Song 机构:Department of Statistics, London School of Economics and Political Science, Tecent PCG, Department of Biostatistics, University of North-Carolina, Department of Statistics, North-Carolina State University, Editor: 摘要:科技公司(如谷歌或Facebook)经常使用随机在线实验和/或A/B测试,主要基于平均治疗效果来比较新产品和旧产品。然而,检测定性治疗效果也非常重要,只有在某些特定情况下,新的治疗效果才能显著优于现有的治疗效果。本文的目的是开发一个强大的测试程序,以有效地检测这种定性治疗效果。我们提出了一个可扩展的在线更新算法来实现我们的测试过程。它有三个新颖之处,包括自适应随机化、顺序监控和在线更新,并保证I型错误控制。我们还彻底检查了我们的测试过程的理论属性,包括测试统计的极限分布和有效引导方法的合理性。我们进行了广泛的实证研究,以检验我们测试程序的有限样本性能。 摘要:Tech companies (e.g., Google or Facebook) often use randomized online experiments and/or A/B testing primarily based on the average treatment effects to compare their new product with an old one. However, it is also critically important to detect qualitative treatment effects such that the new one may significantly outperform the existing one only under some specific circumstances. The aim of this paper is to develop a powerful testing procedure to efficiently detect such qualitative treatment effects. We propose a scalable online updating algorithm to implement our test procedure. It has three novelties including adaptive randomization, sequential monitoring, and online updating with guaranteed type-I error control. We also thoroughly examine the theoretical properties of our testing procedure including the limiting distribution of test statistics and the justification of an efficient bootstrap method. Extensive empirical studies are conducted to examine the finite sample performance of our test procedure.

【22】 Causal Mediation and Sensitivity Analysis for Mixed-Scale Data 标题:混合尺度数据的因果中介及敏感性分析 链接:https://arxiv.org/abs/2111.03907

作者:Lexi Rene,Antonio R. Linero,Elizabeth Slate 摘要:因果中介分析的目标,通常在潜在结果框架内描述,是沿着不同的因果路径分解暴露对感兴趣的结果的影响。Imai等人(2010年)利用序列可忽略性假设实现非参数识别,提出了一种灵活的中介效应测量方法,重点关注结果和中介的参数和半参数正态/伯努利模型。较少关注结果和/或中介模型为混合尺度、顺序或不属于正常/伯努利设置的情况。我们开发了一个简单但灵活的参数化建模框架,以适应反应是连续和二元混合的常见情况,并将其应用于结果和中介的零一膨胀贝塔模型。将我们提出的方法应用于公开的JOBS II数据集,我们(i)论证了非正态模型的必要性,(II)展示了如何估计边界删失数据的平均和分位数中介效应,以及(iii)展示了如何通过引入未确定的、具有科学意义的,灵敏度参数。 摘要:The goal of causal mediation analysis, often described within the potential outcomes framework, is to decompose the effect of an exposure on an outcome of interest along different causal pathways. Using the assumption of sequential ignorability to attain non-parametric identification, Imai et al. (2010) proposed a flexible approach to measuring mediation effects, focusing on parametric and semiparametric normal/Bernoulli models for the outcome and mediator. Less attention has been paid to the case where the outcome and/or mediator model are mixed-scale, ordinal, or otherwise fall outside the normal/Bernoulli setting. We develop a simple, but flexible, parametric modeling framework to accommodate the common situation where the responses are mixed continuous and binary, and apply it to a zero-one inflated beta model for the outcome and mediator. Applying our proposed methods to a publicly-available JOBS II dataset, we (i) argue for the need for non-normal models, (ii) show how to estimate both average and quantile mediation effects for boundary-censored data, and (iii) show how to conduct a meaningful sensitivity analysis by introducing unidentified, scientifically meaningful, sensitivity parameters.

【23】 The How and Why of Bayesian Nonparametric Causal Inference 标题:贝叶斯非参数因果推理的方式和原因 链接:https://arxiv.org/abs/2111.03897

作者:Antonio R. Linero,Joseph L. Antonelli 机构:Department of Statistics and Data Sciences, University of Texas at Austin, edu†Department of Statistics, University of Florida 摘要:在因果推理竞赛最近取得成功的推动下,贝叶斯非参数(和高维)方法最近在因果推理文献中受到越来越多的关注。在本文中,我们提出了一个全面的概述贝叶斯非参数应用因果推理。我们的目标是(i)介绍基本的贝叶斯非参数工具箱;(ii)讨论如何确定哪种工具最适合某一特定问题;以及(iii)说明如何避免在高维环境中应用贝叶斯非参数方法的常见陷阱。与标准的固定维参数化问题不同,在这些问题中,单独的结果建模有时是有效的,我们认为,在大多数情况下,对选择和结果过程进行建模是必要的。 摘要:Spurred on by recent successes in causal inference competitions, Bayesian nonparametric (and high-dimensional) methods have recently seen increased attention in the causal inference literature. In this paper, we present a comprehensive overview of Bayesian nonparametric applications to causal inference. Our aims are to (i) introduce the fundamental Bayesian nonparametric toolkit; (ii) discuss how to determine which tool is most appropriate for a given problem; and (iii) show how to avoid common pitfalls in applying Bayesian nonparametric methods in high-dimensional settings. Unlike standard fixed-dimensional parametric problems, where outcome modeling alone can sometimes be effective, we argue that most of the time it is necessary to model both the selection and outcome processes.

【24】 Empirical Bayes Control of the False Discovery Exceedance 标题:虚假发现超越的经验贝叶斯控制 链接:https://arxiv.org/abs/2111.03885

作者:Pallavi Basu,Luella Fu,Alessio Saretto,Wenguang Sun 机构:Indian School of Business, San Francisco State University, Tel Aviv University, Federal Reserve Bank of Dallas, USC Marshall School of Business 摘要:在错误发现比例(FDP)高度可变的稀疏大规模测试问题中,错误发现超越(FDX)为广泛使用的错误发现率(FDR)提供了一种有价值的替代方法。我们开发了一种经验贝叶斯方法来控制FDX。我们证明,对于满足可交换性条件的两组模型中的独立假设和高斯模型中的相关假设,基于局部错误发现率(lfdr)排序和阈值的oracle决策规则是最优的,即功率在FDX约束下最大化。我们提出了一个数据驱动的FDX过程,通过精心设计的计算快捷方式模拟oracle。我们通过仿真研究了该方法的实证性能,并通过一个识别异常股票交易策略的应用说明了FDX控制的优点。 摘要:In sparse large-scale testing problems where the false discovery proportion (FDP) is highly variable, the false discovery exceedance (FDX) provides a valuable alternative to the widely used false discovery rate (FDR). We develop an empirical Bayes approach to controlling the FDX. We show that for independent hypotheses from a two-group model and dependent hypotheses from a Gaussian model fulfilling the exchangeability condition, an oracle decision rule based on ranking and thresholding the local false discovery rate (lfdr) is optimal in the sense that the power is maximized subject to FDX constraint. We propose a data-driven FDX procedure that emulates the oracle via carefully designed computational shortcuts. We investigate the empirical performance of the proposed method using simulations and illustrate the merits of FDX control through an application for identifying abnormal stock trading strategies.

【25】 Metric Distributional Discrepancy in Metric Space 标题:度量空间中的度量分布差异 链接:https://arxiv.org/abs/2111.03851

作者:Wenliang Pan,Yujue Li,Jianwu Liu,Weixiong Mai 机构:Department of Mathematics, Sun Yat-sen University, Macao Center for Mathematical Sciences, Macau University of Science and, Technology 摘要:独立性分析是在回归分析之前,找出影响对象的基本因素必不可少的一步。随着机器学习、医学学习和各种学科的广泛应用,向量空间中测量随机变量之间关系的统计方法得到了很好的研究。然而,在度量空间中验证随机元素之间关系的方法很少。在本文中,我们提出了一种称为度量分布差异(MDD)的新指标来度量随机元素$X$和分类变量$Y$之间的依赖性,该指标适用于医学图像和遗传数据。度量分布差异统计可被视为给定$Y$的每类$X$的条件分布与$X$的无条件分布之间的距离。与其他依赖性度量相比,MDD具有一些显著的优点。例如,当且仅当$X$和$Y$是独立的时,MDD为零。MDD测试是一种无分布测试,因为没有关于随机元素分布的假设。此外,MDD检验对具有重尾分布和潜在异常值的数据具有鲁棒性。通过数值实验和实际数据分析,验证了理论的有效性和MDD检验的性质。 摘要:Independence analysis is an indispensable step before regression analysis to find out essential factors that influence the objects. With many applications in machine Learning, medical Learning and a variety of disciplines, statistical methods of measuring the relationship between random variables have been well studied in vector spaces. However, there are few methods developed to verify the relation between random elements in metric spaces. In this paper, we present a novel index called metric distributional discrepancy (MDD) to measure the dependence between a random element $X$ and a categorical variable $Y$, which is applicable to the medical image and genetic data. The metric distributional discrepancy statistics can be considered as the distance between the conditional distribution of $X$ given each class of $Y$ and the unconditional distribution of $X$. MDD enjoys some significant merits compared to other dependence-measures. For instance, MDD is zero if and only if $X$ and $Y$ are independent. MDD test is a distribution-free test since there is no assumption on the distribution of random elements. Furthermore, MDD test is robust to the data with heavy-tailed distribution and potential outliers. We demonstrate the validity of our theory and the property of the MDD test by several numerical experiments and real data analysis.

【26】 Parametric and nonparametric probability distribution estimators of sample maximum 标题:样本最大值的参数和非参数概率分布估计 链接:https://arxiv.org/abs/2111.03765

作者:Taku Moriyama 机构:Department of Management of Social Systems and Civil Engineering, Tottori University 摘要:本研究涉及样本最大值的概率分布估计。传统的方法是对极限分布——广义极值分布进行参数拟合;然而,有限情况下的模型在一定程度上是错误的。我们提出了一种插件类型的非参数估计,它不需要模型说明。证明了渐近收敛速度与尾指数和二阶参数有关。随着尾部变轻,参数拟合的错误指定程度变大,这意味着收敛速度变慢。在Weibull情形下,可以看作是尾部亮度的极限,只有非参数分布估计保持其一致性。最后,我们报告了数值实验的结果。 摘要:This study concerns probability distribution estimation of sample maximum. The traditional approach is the parametric fitting to the limiting distribution - the generalized extreme value distribution; however, the model in finite cases is misspecified to a certain extent. We propose a plug-in type of nonparametric estimator which does not need model specification. It is proved that both asymptotic convergence rates depend on the tail index and the second order parameter. As the tail gets light, the degree of misspecification of the parametric fitting becomes large, that means the convergence rate becomes slow. In the Weibull cases, which can be seen as the limit of tail-lightness, only the nonparametric distribution estimator keeps its consistency. Finally, we report the results of numerical experiments.

【27】 A probabilistic formalization of contextual bias in forensic analysis: Evidence that examiner bias leads to systemic bias in the criminal justice system 标题:法医分析中上下文偏见的概率形式化:审查员偏见导致刑事司法系统系统性偏见的证据 链接:https://arxiv.org/abs/2111.03762

作者:Maria Cuellar,Jacqueline Mauro,Amanda Luby 机构:Department of Criminology, University of Pennsylvania, Philadelphia, PA , Data Science, Google, Inc., Mountain View, CA , Department of Mathematics and Statistics, Swarthmore College, Swarthmore, PA 备注:30 pages, 9 figures 摘要:虽然研究人员已经在法医学中发现了背景偏见的证据,但目前对背景偏见的讨论是定性的。我们将多年的实证研究形式化,并通过定量展示偏见如何在整个法律体系中传播,一直到刑事审判中最终确定有罪,从而扩展这一研究。我们提供了一个概率框架,用于描述如何使用贝叶斯规则的比率形式在法医分析环境中更新信息。我们使用我们的框架分析实证研究的结果,并使用模拟来证明在不存在实验的情况下,偏差是如何复合的。我们发现,在法医学分析的早期阶段,即使是很小的偏见,也会导致在刑事审判中最终判定有罪时产生很大的、复杂的偏见。 摘要:Although researchers have found evidence contextual bias in forensic science, the discussion of contextual bias is currently qualitative. We formalize years of empirical research and extend this research by showing quantitatively how biases can be propagated throughout the legal system, all the way up to the final determination of guilt in a criminal trial. We provide a probabilistic framework for describing how information is updated in a forensic analysis setting by using the ratio form of Bayes' rule. We analyze results from empirical studies using our framework and use simulations to demonstrate how bias can be compounded where experiments do not exist. We find that even minor biases in the earlier stages of forensic analysis lead to large, compounded biases in the final determination of guilt in a criminal trial.

【28】 The Role of SARS-CoV-2 Testing on Hospitalizations in California 标题:SARS-CoV-2检测在加州住院中的作用 链接:https://arxiv.org/abs/2111.03761

作者:J. Cricelio Montesinos-López,Maria L. Daza-Torres,Yury E. García,Luis A. Barboza,Fabio Sanchez,Alec J. Schmidt,Brad H. Pollock,Miriam Nuño 备注:17 pages, 6 figures 摘要:新的SARS-CoV-2病毒的迅速传播引发了一场全球健康危机,这场危机对已有健康状况和特定人口和社会经济特征的人群产生了不成比例的影响。政府的主要关切之一是避免医疗系统的过度拥挤。因此,他们实施了一系列非药物措施来控制病毒的传播,大规模检测是最有效的控制措施之一。迄今为止,公共卫生官员继续推动其中一些措施,主要原因是大规模疫苗接种的延迟和新病毒株的出现。在这项研究中,我们使用混合线性模型研究了加利福尼亚州县级新冠病毒-19阳性率和住院率之间的关系。该分析在该州截至2021年9月登记的三波确诊新冠病毒-19病例中进行。我们的研究结果表明,在所有研究波中,县一级的检测阳性率始终与住院率相关。随着时间的推移,似乎与较高住院率相关的人口因素发生了变化,因为大流行的概况影响了加州各县不同部分的人口。 摘要:The rapid spread of the new SARS-CoV-2 virus triggered a global health crisis disproportionately impacting people with pre-existing health conditions and particular demographic and socioeconomic characteristics. One of the main concerns of governments has been to avoid the overwhelm of health systems. For this reason, they have implemented a series of non-pharmaceutical measures to control the spread of the virus, with mass tests being one of the most effective control. To date, public health officials continue to promote some of these measures, mainly due to delays in mass vaccination and the emergence of new virus strains. In this study, we studied the association between COVID-19 positivity rate and hospitalization rates at the county level in California using a mixed linear model. The analysis was performed in the three waves of confirmed COVID-19 cases registered in the state to September 2021. Our findings suggest that test positivity rate is consistently associated with hospitalization rates at the county level for all waves of study. Demographic factors that seem to be related with higher hospitalization rates changed over time, as the profile of the pandemic impacted different fractions of the population in counties across California.

【29】 Transformed-linear prediction for extremes 标题:极值的变换线性预测 链接:https://arxiv.org/abs/2111.03754

作者:Jeongjin Lee,Daniel Cooley 机构:Department of Statistics, Colorado State University, and 摘要:我们考虑当观测值处于最高水平时进行预测的问题。我们构造了一个非负随机变量的内积空间,该空间由独立的规则变化随机变量的变换线性组合而成。内积矩阵对应于尾对依赖矩阵,它总结了尾对依赖。投影定理产生最佳变换线性预测器,其形式与非极值预测中的最佳线性无偏预测器相同。我们还根据规则变化的几何构造预测区间。我们在模拟研究和两个应用中证明了这些区间具有良好的覆盖率;预测高污染水平,预测巨大的经济损失。 摘要:We consider the problem of performing prediction when observed values are at their highest levels. We construct an inner product space of nonnegative random variables from transformed-linear combinations of independent regularly varying random variables. The matrix of inner products corresponds to the tail pairwise dependence matrix, which summarizes tail dependence. The projection theorem yields the optimal transformed-linear predictor, which has the same form as the best linear unbiased predictor in non-extreme prediction. We also construct prediction intervals based on the geometry of regular variation. We show that these intervals have good coverage in a simulation study as well as in two applications; prediction of high pollution levels, and prediction of large financial losses.

【30】 A Non-Markovian Model to Assess Contact Tracing for the Containment of COVID-19 标题:评估接触者追踪控制冠状病毒的非马尔科夫模型 链接:https://arxiv.org/abs/2111.03722

作者:Aram Vajdi,Lee W. Cohnstaedt,Leela E. Noronhaz,Dana N. Mitzelz,William C. Wilsonz,Caterina M. Scoglio 机构:M. Scoglio∗ 摘要:新冠病毒-19仍然是一个具有挑战性的全球威胁,持续不断的感染和临床疾病浪潮已导致数百万人死亡,并对全球卫生系统造成巨大压力。已经为SARS-CoV-2病毒开发了有效的疫苗,并向数十亿人接种;然而,该病毒继续传播并演变成新的变种,疫苗最终可能对其效果较差。非药物干预,如社交距离、戴口罩和接触追踪,仍然是重要的工具,特别是在疫情爆发时。在本文中,我们使用一个基于网络的非马尔可夫数学模型来评估接触追踪的有效性。为了提高新模型的可靠性,将经验确定的分布纳入模型状态对的过渡时间,例如从暴露到感染状态。一阶闭合近似用于推导流行病阈值的表达式,该表达式取决于密切接触的人数。利用2020年秋季学期从大学人群中收集的调查接触数据,我们确定,即使是四到五次接触也足以维持病毒传播。此外,我们的模型显示,通过将疫情规模减少三倍以上,接触追踪可以成为有效的疫情缓解措施。提高流行病模型的可靠性对于准确的公共卫生规划和作为决策支持工具的使用至关重要。转向基于经验数据的更精确的非马尔可夫模型是重要的。 摘要:COVID-19 remains a challenging global threat with ongoing waves of infections and clinical disease which have resulted millions of deaths and an enormous strain on health systems worldwide. Effective vaccines have been developed for the SARS-CoV-2 virus and administered to billions of people; however, the virus continues to circulate and evolve into new variants for which vaccines may ultimately be less effective. Non-pharmaceutical interventions, such as social distancing, wearing face coverings, and contact tracing, remain important tools, especially at the onset of an outbreak. In this paper, we assess the effectiveness of contact tracing using a non-Markovian, network-based mathematical model. To improve the reliability of the novel model, empirically determined distributions were incorporated for the transition time of model state pairs, such as from exposed to infected states. The first-order closure approximation was used to derive an expression for the epidemic threshold, which is dependent on the number of close contacts. Using survey contact data collected during the 2020 fall academic semester from a university population, we determined that even four to five contacts are sufficient to maintain viral transmission. Additionally, our model reveals that contact tracing can be an effective outbreak mitigation measure by reducing the epidemic size by more than three-fold. Increasing the reliability of epidemic models is critical for accurate public health planning and use as decision support tools. Moving toward more accurate non-Markovian models built upon empirical data is important.

【31】 Compressed spectral screening for large-scale differential correlation analysis with application in selecting Glioblastoma gene modules 标题:大规模差分相关分析压缩谱筛选及其在胶质母细胞瘤基因模块筛选中的应用 链接:https://arxiv.org/abs/2111.03721

作者:Tianxi Li,Xiwei Tang,Ajay Chatrath 机构:Department of Statistics, University of Virginia, Department of Biochemistry and Molecular Genetics, University of Virginia 摘要:差异共表达分析已被科学家广泛应用于理解疾病的生物学机制。然而,未知的微分模式往往是复杂的;因此,基于简化参数假设的模型在识别差异方面可能无效。同时,这种分析所涉及的基因表达数据本质上是高维的,其相关矩阵甚至不可计算。如此之大的规模严重限制了大多数经过充分研究的统计方法的应用。本文介绍了一种简单而有效的方法来解决差分相关分析问题,称为压缩谱筛选。通过利用光谱结构和随机采样技术,我们的方法可以实现对具有复杂差分模式的特征的高度精确筛选,同时保持在标准个人计算机上几分钟内分析$10^4$--$10^5$变量相关矩阵的可伸缩性。我们应用这种筛选方法比较了一组关于胶质母细胞瘤和正常受试者的TCGA数据集。我们的分析成功地识别了具有不同共表达模式的基因的多个功能模块。这些发现揭示了胶质母细胞瘤进化机制的新见解。理论分析也证明了我们方法的有效性,表明压缩光谱分析可以实现可变屏蔽一致性。 摘要:Differential co-expression analysis has been widely applied by scientists in understanding the biological mechanisms of diseases. However, the unknown differential patterns are often complicated; thus, models based on simplified parametric assumptions can be ineffective in identifying the differences. Meanwhile, the gene expression data involved in such analysis are in extremely high dimensions by nature, whose correlation matrices may not even be computable. Such a large scale seriously limits the application of most well-studied statistical methods. This paper introduces a simple yet powerful approach to the differential correlation analysis problem called compressed spectral screening. By leveraging spectral structures and random sampling techniques, our approach could achieve a highly accurate screening of features with complicated differential patterns while maintaining the scalability to analyze correlation matrices of $10^4$--$10^5$ variables within a few minutes on a standard personal computer. We have applied this screening approach in comparing a TCGA data set about Glioblastoma with normal subjects. Our analysis successfully identifies multiple functional modules of genes that exhibit different co-expression patterns. The findings reveal new insights about Glioblastoma's evolving mechanism. The validity of our approach is also justified by a theoretical analysis, showing that the compressed spectral analysis can achieve variable screening consistency.

【32】 Strong Recovery In Group Synchronization 标题:组同步中的强恢复 链接:https://arxiv.org/abs/2111.03705

作者:Bradley Stich 摘要:群同步问题是在给定一组可能有噪声的边上群差异观测值时,估计图顶点处的未知群元素。我们考虑有限尺寸图上的群同步问题,其大小趋于无穷大,我们关注的是是否真的边缘差异可以从观测(即,强恢复)精确恢复的问题。我们证明了两个主要结果,一个是正的,一个是负的。在正方向上,我们证明了对于包含完全有向图以及相对良好的先验分布和观测核的同步问题序列,我们能够以高概率恢复正确的边标记。我们的否定结果为稀疏图序列提供了条件,在此条件下,不可能以高概率恢复正确的边标记。 摘要:The group synchronization problem is to estimate unknown group elements at the vertices of a graph when given a set of possibly noisy observations of group differences at the edges. We consider the group synchronization problem on finite graphs with size tending to infinity, and we focus on the question of whether the true edge differences can be exactly recovered from the observations (i.e., strong recovery). We prove two main results, one positive and one negative. In the positive direction, we prove that for a sequence of synchronization problems containing the complete digraph along with a relatively well behaved prior distribution and observation kernel, with high probability we can recover the correct edge labeling. Our negative result provides conditions on a sequence of sparse graphs under which it is impossible to recover the correct edge labeling with high probability.

【33】 The Ball Pit Algorithm: A Markov Chain Monte Carlo Method Based on Path Integrals 标题:球坑算法:一种基于路径积分的马尔可夫链蒙特卡罗方法 链接:https://arxiv.org/abs/2111.03691

作者:Miguel Fudolig,Reka Howard 机构:Department of Statistics, University of Nebraska-Lincoln, and 摘要:Ball-Pit算法(BPA)是一种新的马尔可夫链蒙特卡罗(MCMC)算法,用于抽样边际后验分布,它是从马尔可夫链贝叶斯分析的路径积分公式发展而来的。BPA在对来自伯努利和泊松概率的模拟数据的后验分布进行采样时,产生了与自适应无U形转弯取样器(NUTS)实现的哈密顿蒙特卡罗相当的结果。BPA的一个主要优点是它的计算时间显著减少,在分析单参数模型时,它比NUTS至少快95%。BPA还应用于一个多参数Cauchy模型,该模型使用了杂交和自交受精植株高度差异的实际数据。位置参数的后中位与其他贝叶斯抽样方法一致。此外,从BPA获得的标度参数对数的后中值接近使用拉普拉斯正态近似计算的估计后中值。与NUTS相比,Cauchy分析的BPA实现的计算时间快55%。总体而言,我们发现BPA是哈密顿蒙特卡罗和其他标准MCMC方法的高效替代方法。 摘要:The Ball Pit Algorithm (BPA) is a novel Markov chain Monte Carlo (MCMC) algorithm for sampling marginal posterior distributions developed from the path integral formulation of the Bayesian analysis for Markov chains. The BPA yielded comparable results to the Hamiltonian Monte Carlo as implemented by the adaptive No U-Turn Sampler (NUTS) in sampling posterior distributions for simulated data from Bernoulli and Poisson likelihoods. One major advantage of the BPA is its significantly lower computational time, which was measured to be at least 95% faster than NUTS in analyzing single parameter models. The BPA was also applied to a multi-parameter Cauchy model using real data of the height differences of cross- and self-fertilized plants. The posterior medians for the location parameter were consistent with other Bayesian sampling methods. Additionally, the posterior median for the logarithm of the scale parameter obtained from the BPA was close to the estimated posterior median calculated using the Laplace normal approximation. The computational time of the BPA implementation of the Cauchy analysis is 55% faster compared to that for NUTS. Overall, we have found that the BPA is a highly efficient alternative to the Hamiltonian Monte Carlo and other standard MCMC methods.

【34】 Estimating High Order Gradients of the Data Distribution by Denoising 标题:用去噪方法估计数据分布的高阶梯度 链接:https://arxiv.org/abs/2111.04726

作者:Chenlin Meng,Yang Song,Wenzhe Li,Stefano Ermon 机构:Stanford University, Tsinghua University 备注:NeurIPS 2021 摘要:通过去噪分数匹配可以有效地估计数据密度的一阶导数,它已成为图像生成和音频合成等许多应用中的重要组成部分。高阶导数提供有关数据分布的附加本地信息,并支持新的应用程序。虽然可以通过学习密度模型的自动微分来估计它们,但这会放大估计误差,并且在高维环境中成本高昂。为了克服这些局限性,我们提出了一种从样本中直接估计数据密度高阶导数(分数)的方法。我们首先表明去噪分数匹配可以解释为特维迪公式的一个特殊情况。通过利用高阶矩上的Tweedie公式,我们将去噪分数匹配推广到估计高阶导数。我们的经验证明,使用该方法训练的模型比通过自动微分更有效、更准确地逼近二阶导数。我们表明,我们的模型可用于量化去噪过程中的不确定性,并通过对合成数据和自然图像进行Ozaki离散化,提高Langevin dynamics的混合速度。 摘要:The first order derivative of a data density can be estimated efficiently by denoising score matching, and has become an important component in many applications, such as image generation and audio synthesis. Higher order derivatives provide additional local information about the data distribution and enable new applications. Although they can be estimated via automatic differentiation of a learned density model, this can amplify estimation errors and is expensive in high dimensional settings. To overcome these limitations, we propose a method to directly estimate high order derivatives (scores) of a data density from samples. We first show that denoising score matching can be interpreted as a particular case of Tweedie's formula. By leveraging Tweedie's formula on higher order moments, we generalize denoising score matching to estimate higher order derivatives. We demonstrate empirically that models trained with the proposed method can approximate second order derivatives more efficiently and accurately than via automatic differentiation. We show that our models can be used to quantify uncertainty in denoising and to improve the mixing speed of Langevin dynamics via Ozaki discretization for sampling synthetic data and natural images.

【35】 Universal and data-adaptive algorithms for model selection in linear contextual bandits 标题:线性上下文土匪模型选择的通用和数据自适应算法 链接:https://arxiv.org/abs/2111.04688

作者:Vidya Muthukumar,Akshay Krishnamurthy 机构:Electrical and Computer Engineering and Industrial and Systems Engineering, Georgia Institute of Technology†, Microsoft Research, New York City ‡ 备注:27 pages 摘要:上下文bandits中的模型选择是相对于固定模型类的遗憾最小化的一个重要补充问题。我们认为模型选择的最简单的非平凡实例:区分一个简单的多武装强盗问题从线性上下文强盗问题。即使在这种情况下,当前最先进的方法也会以次优的方式进行探索,并需要强大的“特征多样性”条件。在本文中,我们引入了新的算法,a)以数据自适应的方式进行探索,b)提供形式为$mathcal{O}(d^{alpha}T^{1-alpha})的模型选择保证,其中,$d$表示线性模型的维数,$T$表示轮的总数。第一种算法具有“两全其美”的特性,同时恢复在不同分布假设下保持的两个先验结果。第二种方法完全消除了分布假设,扩大了可处理模型选择的范围。我们的方法在一些附加假设下扩展到嵌套线性上下文盗贼之间的模型选择。 摘要:Model selection in contextual bandits is an important complementary problem to regret minimization with respect to a fixed model class. We consider the simplest non-trivial instance of model-selection: distinguishing a simple multi-armed bandit problem from a linear contextual bandit problem. Even in this instance, current state-of-the-art methods explore in a suboptimal manner and require strong "feature-diversity" conditions. In this paper, we introduce new algorithms that a) explore in a data-adaptive manner, and b) provide model selection guarantees of the form $mathcal{O}(d^{alpha} T^{1- alpha})$ with no feature diversity conditions whatsoever, where $d$ denotes the dimension of the linear model and $T$ denotes the total number of rounds. The first algorithm enjoys a "best-of-both-worlds" property, recovering two prior results that hold under distinct distributional assumptions, simultaneously. The second removes distributional assumptions altogether, expanding the scope for tractable model selection. Our approach extends to model selection among nested linear contextual bandits under some additional assumptions.

【36】 Approximate Neural Architecture Search via Operation Distribution Learning 标题:基于操作分布学习的近似神经结构搜索 链接:https://arxiv.org/abs/2111.04670

作者:Xingchen Wan,Binxin Ru,Pedro M. Esperança,Fabio M. Carlucci 机构:Pedro M. Esperanc¸a, Huawei Noah’s Ark Lab, London, UK, Machine Learning Research Group, University of Oxford, Oxford, UK 备注:WACV 2022. 10 pages, 3 figures and 5 tables (15 pages, 7 figures and 6 tables including appendices) 摘要:神经架构搜索(NAS)的标准范例是搜索具有特定操作和连接的完全确定性架构。在这项工作中,我们提出搜索最优操作分布,从而提供一个随机的近似解,可用于任意长度的结构样本。我们提出并表明,给定一个架构单元,其性能在很大程度上取决于所使用操作的比率,而不是典型搜索空间中的任何特定连接模式;也就是说,操作顺序的微小变化通常是无关紧要的。这种直觉与任何特定的搜索策略都是正交的,可以应用于各种NAS算法。通过对4个数据集和4种NAS技术(贝叶斯优化、可微搜索、局部搜索和随机搜索)的广泛验证,我们表明操作分布(1)具有足够的辨别能力,能够可靠地识别解决方案,(2)比传统编码更容易优化,在性能上几乎没有成本的情况下实现大的速度提升。事实上,这种简单的直觉大大降低了当前方法的成本,并有可能使NAS在更广泛的应用中得到应用。 摘要:The standard paradigm in Neural Architecture Search (NAS) is to search for a fully deterministic architecture with specific operations and connections. In this work, we instead propose to search for the optimal operation distribution, thus providing a stochastic and approximate solution, which can be used to sample architectures of arbitrary length. We propose and show, that given an architectural cell, its performance largely depends on the ratio of used operations, rather than any specific connection pattern in typical search spaces; that is, small changes in the ordering of the operations are often irrelevant. This intuition is orthogonal to any specific search strategy and can be applied to a diverse set of NAS algorithms. Through extensive validation on 4 data-sets and 4 NAS techniques (Bayesian optimisation, differentiable search, local search and random search), we show that the operation distribution (1) holds enough discriminating power to reliably identify a solution and (2) is significantly easier to optimise than traditional encodings, leading to large speed-ups at little to no cost in performance. Indeed, this simple intuition significantly reduces the cost of current approaches and potentially enable NAS to be used in a broader range of applications.

【37】 Statistical properties of large data sets with linear latent features 标题:具有线性潜在特征的大数据集的统计特性 链接:https://arxiv.org/abs/2111.04641

作者:Philipp Fleig,Ilya Nemenman 机构:Department of Physics & Astronomy, University of Pennsylvania, Philadelphia, PA , USA, Department of Physics, Emory University, Atlanta, GA , USA, Department of Biology, Emory University, Atlanta, GA , USA and 备注:20 pages, 6 figures 摘要:对低维潜在特征如何在大维数据中显示自己的分析理解仍然缺乏。我们通过定义一个由概率矩阵构造的具有加性噪声的线性潜在特征模型,并通过分析和数值计算相关矩阵的成对相关和特征值的统计分布来研究这一点。这使我们能够通过记录变量的数量、观测值、潜在特征和信噪比,在广泛的数据区域内解析潜在特征结构。我们在相关性和特征值的分布中发现了潜在特征的特征印记,并对信号和噪声之间的边界提供了分析估计,即使在没有明显的光谱间隙的情况下。 摘要:Analytical understanding of how low-dimensional latent features reveal themselves in large-dimensional data is still lacking. We study this by defining a linear latent feature model with additive noise constructed from probabilistic matrices, and analytically and numerically computing the statistical distributions of pairwise correlations and eigenvalues of the correlation matrix. This allows us to resolve the latent feature structure across a wide range of data regimes set by the number of recorded variables, observations, latent features and the signal-to-noise ratio. We find a characteristic imprint of latent features in the distribution of correlations and eigenvalues and provide an analytic estimate for the boundary between signal and noise even in the absence of a clear spectral gap.

【38】 Nonnegative Tensor Completion via Integer Optimization 标题:基于整数优化的非负张量补全 链接:https://arxiv.org/abs/2111.04580

作者:Caleb Bugg,Chen Chen,Anil Aswani 机构:University of California, Berkeley, The Ohio State University 摘要:与矩阵补全不同,迄今为止,张量补全问题的任何算法都无法达到信息论样本复杂度。本文针对非负张量完成的特殊情况,提出了一种新的算法。我们证明了我们的算法在达到信息论速率的同时,以线性(数值容差)的oracle步数收敛。我们的方法是使用我们构造的特定0-1多面体的规范来定义非负张量的新范数。因为范数是使用0-1多面体定义的,这意味着我们可以使用整数线性规划来解决多面体上的线性分离问题。我们将这一观点与Frank-Wolfe算法的一个变体结合起来,构造了我们的数值算法,并通过实验证明了其有效性和可扩展性。 摘要:Unlike matrix completion, no algorithm for the tensor completion problem has so far been shown to achieve the information-theoretic sample complexity rate. This paper develops a new algorithm for the special case of completion for nonnegative tensors. We prove that our algorithm converges in a linear (in numerical tolerance) number of oracle steps, while achieving the information-theoretic rate. Our approach is to define a new norm for nonnegative tensors using the gauge of a specific 0-1 polytope that we construct. Because the norm is defined using a 0-1 polytope, this means we can use integer linear programming to solve linear separation problems over the polytope. We combine this insight with a variant of the Frank-Wolfe algorithm to construct our numerical algorithm, and we demonstrate its effectiveness and scalability through experiments.

【39】 Information-Theoretic Bayes Risk Lower Bounds for Realizable Models 标题:可实现模型的信息论Bayes风险下界 链接:https://arxiv.org/abs/2111.04579

作者:Matthew Nokleby,Ahmad Beirami 机构:Lily AI, Mountain View, CA, Facebook AI, Menlo Park, CA 摘要:我们推导了可实现机器学习模型的贝叶斯风险和泛化误差的信息论下界。特别是,我们采用了一种分析方法,其中模型参数的率失真函数限制了训练样本和模型参数之间所需的互信息,以便将模型学习到贝叶斯风险约束。对于可实现的模型,我们证明了率失真函数和互信息都包含便于分析的表达式。对于参数(大致)较低的Lipschitz模型,我们从下面定义率失真函数,而对于VC类,互信息在上面由$dmathrm{VC}log(n)$定义。当这些条件匹配时,关于零一损失的Bayes风险标度不快于$Omega(d_mathrm{vc}/n)$,它匹配已知的外界和对数因子的极大极小下界。我们也考虑标签噪声的影响,当训练和/或测试样本被破坏时提供下限。 摘要:We derive information-theoretic lower bounds on the Bayes risk and generalization error of realizable machine learning models. In particular, we employ an analysis in which the rate-distortion function of the model parameters bounds the required mutual information between the training samples and the model parameters in order to learn a model up to a Bayes risk constraint. For realizable models, we show that both the rate distortion function and mutual information admit expressions that are convenient for analysis. For models that are (roughly) lower Lipschitz in their parameters, we bound the rate distortion function from below, whereas for VC classes, the mutual information is bounded above by $d_mathrm{vc}log(n)$. When these conditions match, the Bayes risk with respect to the zero-one loss scales no faster than $Omega(d_mathrm{vc}/n)$, which matches known outer bounds and minimax lower bounds up to logarithmic factors. We also consider the impact of label noise, providing lower bounds when training and/or test samples are corrupted.

【40】 Improved Regularization and Robustness for Fine-tuning in Neural Networks 标题:神经网络精调的改进正则化和鲁棒性 链接:https://arxiv.org/abs/2111.04578

作者:Dongyue Li,Hongyang R. Zhang 备注:22 pages, 6 figures, 11 tables 摘要:一种广泛使用的迁移学习算法是微调,即在目标任务上使用少量标记数据对预先训练的模型进行微调。当预训练模型的容量远远大于目标数据集的大小时,微调容易过度拟合和“记忆”训练标签。因此,一个重要的问题是正则化微调并确保其对噪声的鲁棒性。为了解决这个问题,我们首先分析微调的泛化特性。我们提出了一个PAC Bayes泛化界,它取决于微调过程中每一层的移动距离和微调模型的噪声稳定性。我们根据经验测量这些量。在分析的基础上,我们提出了正则化自标记——正则化和自标记方法之间的插值,包括(i)分层正则化以约束每层中的移动距离;(ii)自标记校正和标签重新称重,以纠正错误标记的数据点(模型有信心)和重新称重不太自信的数据点。我们使用多个预先训练好的模型架构,在大量图像和文本数据集上验证了我们的方法。对于七个图像分类任务,我们的方法将基线方法提高了1.76%(平均),对于一些镜头分类任务,我们的方法提高了0.75%。当目标数据集包含噪声标签时,在两种噪声环境下,我们的方法比基线方法平均高3.56%。 摘要:A widely used algorithm for transfer learning is fine-tuning, where a pre-trained model is fine-tuned on a target task with a small amount of labeled data. When the capacity of the pre-trained model is much larger than the size of the target data set, fine-tuning is prone to overfitting and "memorizing" the training labels. Hence, an important question is to regularize fine-tuning and ensure its robustness to noise. To address this question, we begin by analyzing the generalization properties of fine-tuning. We present a PAC-Bayes generalization bound that depends on the distance traveled in each layer during fine-tuning and the noise stability of the fine-tuned model. We empirically measure these quantities. Based on the analysis, we propose regularized self-labeling -- the interpolation between regularization and self-labeling methods, including (i) layer-wise regularization to constrain the distance traveled in each layer; (ii) self label-correction and label-reweighting to correct mislabeled data points (that the model is confident) and reweight less confident data points. We validate our approach on an extensive collection of image and text data sets using multiple pre-trained model architectures. Our approach improves baseline methods by 1.76% (on average) for seven image classification tasks and 0.75% for a few-shot classification task. When the target data set includes noisy labels, our approach outperforms baseline methods by 3.56% on average in two noisy settings.

【41】 Mixed-Integer Optimization with Constraint Learning 标题:带约束学习的混合整数优化 链接:https://arxiv.org/abs/2111.04469

作者:Donato Maragno,Holly Wiberg,Dimitris Bertsimas,S. Ilker Birbil,Dick den Hertog,Adejuyigbe Fajemisin 机构:Amsterdam Business School, University of Amsterdam, TV Amsterdam, Netherlands 摘要:我们建立了一个广泛的方法学基础的混合整数优化学习约束。我们提出了一种用于数据驱动决策的端到端管道,其中使用机器学习直接从数据中学习约束和目标,并将训练好的模型嵌入到优化公式中。我们利用了许多机器学习方法的混合整数优化表示性,包括线性模型、决策树、集成和多层感知器。考虑多种方法使我们能够捕捉决策、上下文变量和结果之间的各种潜在关系。我们还使用观察的凸包来描述决策信任区域,以确保可靠的建议并避免外推。我们使用列生成和聚类有效地结合了这种表示。结合领域驱动约束和目标项,嵌入模型和信赖域定义了一个用于处方生成的混合整数优化问题。我们将此框架实现为一个Python包(OptiCL),供实践者使用。我们在化疗优化和世界粮食计划署规划中都展示了该方法。案例研究说明了该框架在生成高质量处方、信任区域增加的价值、合并多种机器学习方法以及包含多种学习约束方面的优势。 摘要:We establish a broad methodological foundation for mixed-integer optimization with learned constraints. We propose an end-to-end pipeline for data-driven decision making in which constraints and objectives are directly learned from data using machine learning, and the trained models are embedded in an optimization formulation. We exploit the mixed-integer optimization-representability of many machine learning methods, including linear models, decision trees, ensembles, and multi-layer perceptrons. The consideration of multiple methods allows us to capture various underlying relationships between decisions, contextual variables, and outcomes. We also characterize a decision trust region using the convex hull of the observations, to ensure credible recommendations and avoid extrapolation. We efficiently incorporate this representation using column generation and clustering. In combination with domain-driven constraints and objective terms, the embedded models and trust region define a mixed-integer optimization problem for prescription generation. We implement this framework as a Python package (OptiCL) for practitioners. We demonstrate the method in both chemotherapy optimization and World Food Programme planning. The case studies illustrate the benefit of the framework in generating high-quality prescriptions, the value added by the trust region, the incorporation of multiple machine learning methods, and the inclusion of multiple learned constraints.

【42】 There is no Double-Descent in Random Forests 标题:随机森林中不存在双下降现象 链接:https://arxiv.org/abs/2111.04409

作者:Sebastian Buschjäger,Katharina Morik 机构:Artificial Intelligence Group, TU Dortmund, Germany 备注:11 pages, 3 figures, 3 algorithms 摘要:随机森林(RFs)是机器学习领域中最先进的技术之一,在几乎零参数调整的情况下提供优异的性能。值得注意的是,RFs似乎不受过度装配的影响,即使它们的基本构造块众所周知是过度装配的。最近,一项广受欢迎的研究认为RF呈现出所谓的双下降曲线:首先,模型以u形曲线过度拟合数据,然后,一旦达到某个模型复杂性,它会突然再次提高性能。在本文中,我们对模型容量是解释射频成功的正确工具这一概念提出质疑,并认为训练模型的算法比以前认为的更重要。我们证明了RF不呈现双下降曲线,而是呈现单下降曲线。因此,它在经典意义上并没有过度拟合。我们进一步提出了一种RF变化,虽然其决策边界近似于过度拟合的DT,但它也不会过度拟合。类似地,我们表明近似于RF决策边界的DT仍然会过拟合。最后,我们研究了集合的多样性,以此作为评估其性能的工具。为此,我们引入了负相关森林(NCForest),它允许对集合中的多样性进行精确控制。我们发现,分集和偏置确实对RF的性能有着至关重要的影响。分集过低会将射频性能压缩为一棵树,而分集过高则意味着大多数树不再产生正确的输出。然而,在这两个极端之间,我们发现了一个大范围的不同折衷方案,所有这些方案的性能大致相同。因此,只要算法达到这种良好的折衷机制,偏见和多样性之间的具体折衷并不重要。 摘要:Random Forests (RFs) are among the state-of-the-art in machine learning and offer excellent performance with nearly zero parameter tuning. Remarkably, RFs seem to be impervious to overfitting even though their basic building blocks are well-known to overfit. Recently, a broadly received study argued that a RF exhibits a so-called double-descent curve: First, the model overfits the data in a u-shaped curve and then, once a certain model complexity is reached, it suddenly improves its performance again. In this paper, we challenge the notion that model capacity is the correct tool to explain the success of RF and argue that the algorithm which trains the model plays a more important role than previously thought. We show that a RF does not exhibit a double-descent curve but rather has a single descent. Hence, it does not overfit in the classic sense. We further present a RF variation that also does not overfit although its decision boundary approximates that of an overfitted DT. Similar, we show that a DT which approximates the decision boundary of a RF will still overfit. Last, we study the diversity of an ensemble as a tool the estimate its performance. To do so, we introduce Negative Correlation Forest (NCForest) which allows for precise control over the diversity in the ensemble. We show, that the diversity and the bias indeed have a crucial impact on the performance of the RF. Having too low diversity collapses the performance of the RF into a a single tree, whereas having too much diversity means that most trees do not produce correct outputs anymore. However, in-between these two extremes we find a large range of different trade-offs with all roughly equal performance. Hence, the specific trade-off between bias and diversity does not matter as long as the algorithm reaches this good trade-off regime.

【43】 Robust and Information-theoretically Safe Bias Classifier against Adversarial Attacks 标题:抗敌意攻击的稳健且信息理论安全的偏向分类器 链接:https://arxiv.org/abs/2111.04404

作者:Lijia Yu,Xiao-Shan Gao 机构:Academy of Mathematics and Systems Science, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Beijing , China 摘要:本文介绍了一种偏差分类器,即以Relu作为激活函数的DNN的偏差部分作为分类器。这项工作的动机是,偏差部分是一个具有零梯度的分段常数函数,因此不能直接受到基于梯度的方法的攻击,以生成对手,如FGSM。证明了偏差分类器的存在性,提出了一种有效的偏差分类器训练方法。证明了通过在偏差分类器中加入适当的随机一阶部分,在攻击产生完全随机的方向以产生对手的意义下,获得了一个针对原始模型梯度攻击的信息理论上安全的分类器。这似乎是第一次提出信息理论安全分类器的概念。提出了几种针对偏差分类器的攻击方法,并通过数值实验表明,在大多数情况下,偏差分类器比DNN对这些攻击具有更强的鲁棒性。 摘要:In this paper, the bias classifier is introduced, that is, the bias part of a DNN with Relu as the activation function is used as a classifier. The work is motivated by the fact that the bias part is a piecewise constant function with zero gradient and hence cannot be directly attacked by gradient-based methods to generate adversaries such as FGSM. The existence of the bias classifier is proved an effective training method for the bias classifier is proposed. It is proved that by adding a proper random first-degree part to the bias classifier, an information-theoretically safe classifier against the original-model gradient-based attack is obtained in the sense that the attack generates a totally random direction for generating adversaries. This seems to be the first time that the concept of information-theoretically safe classifier is proposed. Several attack methods for the bias classifier are proposed and numerical experiments are used to show that the bias classifier is more robust than DNNs against these attacks in most cases.

【44】 Lattice gauge symmetry in neural networks 标题:神经网络中的格点规范对称性 链接:https://arxiv.org/abs/2111.04389

作者:Matteo Favoni,Andreas Ipp,David I. Müller,Daniel Schuh 机构:Institute for Theoretical Physics, TU Wien, Wiedner Hauptstrasse ,-,, Tower B, Wien, Austria 备注:10 pages, 3 figures, proceedings for the 38th International Symposium on Lattice Field Theory (LATTICE21) 摘要:我们回顾了一种新的神经网络结构,称为格点规范等变卷积神经网络(L-CNN),它可以应用于格点规范理论中的一般机器学习问题,同时精确地保持规范对称性。我们讨论了规范等变的概念,我们用它显式地构造了规范等变卷积层和双线性层。使用看似简单的非线性回归任务对L-CNN和非等变CNN的性能进行比较,其中L-CNN与非等变CNN相比具有普遍性,并在预测中实现了高度的准确性。 摘要:We review a novel neural network architecture called lattice gauge equivariant convolutional neural networks (L-CNNs), which can be applied to generic machine learning problems in lattice gauge theory while exactly preserving gauge symmetry. We discuss the concept of gauge equivariance which we use to explicitly construct a gauge equivariant convolutional layer and a bilinear layer. The performance of L-CNNs and non-equivariant CNNs is compared using seemingly simple non-linear regression tasks, where L-CNNs demonstrate generalizability and achieve a high degree of accuracy in their predictions compared to their non-equivariant counterparts.

【45】 Extension of Correspondence Analysis to multiway data-sets through High Order SVD: a geometric framework 标题:通过高阶奇异值分解将对应分析扩展到多路数据集:一个几何框架 链接:https://arxiv.org/abs/2111.04343

作者:Olivier Coulaud,Alain Franc,Martina Iannacito 机构:arXiv:,.,v, [math.NA] , Nov 摘要:本文从几何角度出发,通过高阶奇异值分解(HOSVD)将对应分析(CA)扩展到张量。对应分析是从主成分分析发展而来的研究列联表的著名工具。多年来,人们提出了不同的CA到多向表的代数扩展,但忽略了它的几何意义。基于Tucker模型和HOSVD,我们提出了一种直接将每个张量模式关联为点云的方法。我们证明了点云是相互关联的。特别是使用CA度量,我们表明重心关系在张量框架中仍然成立。最后,使用两个数据集强调我们的策略相对于经典矩阵方法的优点和缺点。 摘要:This paper presents an extension of Correspondence Analysis (CA) to tensors through High Order Singular Value Decomposition (HOSVD) from a geometric viewpoint. Correspondence analysis is a well-known tool, developed from principal component analysis, for studying contingency tables. Different algebraic extensions of CA to multi-way tables have been proposed over the years, nevertheless neglecting its geometric meaning. Relying on the Tucker model and the HOSVD, we propose a direct way to associate with each tensor mode a point cloud. We prove that the point clouds are related to each other. Specifically using the CA metrics we show that the barycentric relation is still true in the tensor framework. Finally two data sets are used to underline the advantages and the drawbacks of our strategy with respect to the classical matrix approaches.

【46】 The Hardness Analysis of Thompson Sampling for Combinatorial Semi-bandits with Greedy Oracle 标题:贪婪甲骨文组合半强盗的Thompson抽样难度分析 链接:https://arxiv.org/abs/2111.04295

作者:Fang Kong,Yueran Yang,Wei Chen,Shuai Li 机构:Shanghai Jiao Tong University, Microsoft Research 备注:Accepted in NeurIPS, 2021 摘要:汤普森采样(TS)在土匪地区引起了很多兴趣。它是在20世纪30年代引入的,但直到最近几年才在理论上得到证实。它在组合多臂bandit(CMAB)环境中的所有分析都需要一个精确的oracle来提供任何输入的最佳解决方案。然而,这样的预言通常是不可行的,因为许多组合优化问题是NP难的,并且只有近似预言可用。一个例子(Wang和Chen,2018)表明TS无法使用近似oracle进行学习。然而,这种oracle并不常见,它只针对特定的问题实例而设计。TS的收敛性分析是否可以扩展到CMAB中的精确oracle,这仍然是一个悬而未决的问题。在本文中,我们在贪心预言机下研究了这个问题,贪心预言机是一个常用的(近似)预言机,具有解决许多(离线)组合优化问题的理论保证。我们提供了一个与问题相关的遗憾下界$Omega(log T/Delta^2)$来量化TS的硬度,以解决贪婪oracle的CMAB问题,其中,$T$是时间范围,$Delta$是一些奖励差距。我们还提供了几乎匹配的遗憾上限。这是TS第一次用通用近似预言解CMAB的理论结果,打破了TS不能用近似预言解的误解。 摘要:Thompson sampling (TS) has attracted a lot of interest in the bandit area. It was introduced in the 1930s but has not been theoretically proven until recent years. All of its analysis in the combinatorial multi-armed bandit (CMAB) setting requires an exact oracle to provide optimal solutions with any input. However, such an oracle is usually not feasible since many combinatorial optimization problems are NP-hard and only approximation oracles are available. An example (Wang and Chen, 2018) has shown the failure of TS to learn with an approximation oracle. However, this oracle is uncommon and is designed only for a specific problem instance. It is still an open question whether the convergence analysis of TS can be extended beyond the exact oracle in CMAB. In this paper, we study this question under the greedy oracle, which is a common (approximation) oracle with theoretical guarantees to solve many (offline) combinatorial optimization problems. We provide a problem-dependent regret lower bound of order $Omega(log T/Delta^2)$ to quantify the hardness of TS to solve CMAB problems with greedy oracle, where $T$ is the time horizon and $Delta$ is some reward gap. We also provide an almost matching regret upper bound. These are the first theoretical results for TS to solve CMAB with a common approximation oracle and break the misconception that TS cannot work with approximation oracles.

【47】 Identifying Best Fair Intervention 标题:确定最佳公平干预 链接:https://arxiv.org/abs/2111.04272

作者:Ruijiang Gao,Han Feng 摘要:在给定的因果模型下,研究了具有公平约束的最优arm识别问题。其目标是在给定节点上找到一种软干预,在满足公平性约束的同时,仅使用因果模型的部分知识进行反事实估计,从而使结果最大化。这个问题的动机是确保在线市场的公平性。我们提供了错误概率的理论保证,并用两阶段基线实证检验了我们算法的有效性。 摘要:We study the problem of best arm identification with a fairness constraint in a given causal model. The goal is to find a soft intervention on a given node to maximize the outcome while meeting a fairness constraint by counterfactual estimation with only partial knowledge of the causal model. The problem is motivated by ensuring fairness on an online marketplace. We provide theoretical guarantees on the probability of error and empirically examine the effectiveness of our algorithm with a two-stage baseline.

【48】 Exponential GARCH-Ito Volatility Models 标题:指数GARCH-Ito波动率模型 链接:https://arxiv.org/abs/2111.04267

作者:Donggyu Kim 机构:College of Business, Korea Advanced Institute of Science and Technology (KAIST) 备注:36 pages, 7 Figures 摘要:本文引入了一种新的伊藤扩散过程来模拟高频金融数据,该过程通过在连续瞬时波动过程中嵌入具有对数积分波动率的离散时间非线性指数GARCH结构来适应低频波动动力学。该模型的主要特点是,与现有的GARCH-Ito模型不同,瞬时波动过程具有非线性结构,这确保了对数积分波动率具有已实现的GARCH结构。我们称之为指数实现的GARCH-Ito(ERGI)模型。鉴于对数积分波动率的自回归结构,我们提出了参数估计的拟似然估计方法,并建立了其渐近性质。我们进行了模拟研究,以检验所提出模型的有限样本绩效,并对标准普尔500指数中的50项资产进行了实证研究。数值研究表明了新模型的优越性。 摘要:This paper introduces a novel Ito diffusion process to model high-frequency financial data, which can accommodate low-frequency volatility dynamics by embedding the discrete-time non-linear exponential GARCH structure with log-integrated volatility in a continuous instantaneous volatility process. The key feature of the proposed model is that, unlike existing GARCH-Ito models, the instantaneous volatility process has a non-linear structure, which ensures that the log-integrated volatilities have the realized GARCH structure. We call this the exponential realized GARCH-Ito (ERGI) model. Given the auto-regressive structure of the log-integrated volatility, we propose a quasi-likelihood estimation procedure for parameter estimation and establish its asymptotic properties. We conduct a simulation study to check the finite sample performance of the proposed model and an empirical study with 50 assets among the S&P 500 compositions. The numerical studies show the advantages of the new proposed model.

【49】 Representation Learning via Quantum Neural Tangent Kernels 标题:基于量子神经正切核的表示学习 链接:https://arxiv.org/abs/2111.04225

作者:Junyu Liu,Francesco Tacchino,Jennifer R. Glick,Liang Jiang,Antonio Mezzacapo 机构:Pritzker School of Molecular Engineering, The University of Chicago, Chicago, IL , USA, Chicago Quantum Exchange, Chicago, IL , USA, Kadanoff Center for Theoretical Physics, The University of Chicago, Chicago, IL , USA 备注:40=11 29 pages, many figures 摘要:变分量子电路用于量子机器学习和变分量子模拟任务。设计好的变分电路或预测它们在给定的学习或优化任务中的表现仍不清楚。在这里,我们讨论这些问题,用神经正切核理论分析变分量子电路。我们定义了量子神经正切核,并推导了优化和学习任务中相关损失函数的动力学方程。我们解析地求解冻结极限或惰性训练区域中的动力学,其中变分角变化缓慢,线性扰动足够好。我们将分析扩展到动态环境,包括变分角度的二次修正。然后我们考虑混合量子经典体系结构,并定义了混合核的大宽度极限,表明混合量子经典神经网络可以近似高斯。本文给出的结果表明,对用于量子机器学习和优化问题的变分量子电路的训练动力学的分析理解是可能的。这些分析结果得到了量子机器学习实验数值模拟的支持。 摘要:Variational quantum circuits are used in quantum machine learning and variational quantum simulation tasks. Designing good variational circuits or predicting how well they perform for given learning or optimization tasks is still unclear. Here we discuss these problems, analyzing variational quantum circuits using the theory of neural tangent kernels. We define quantum neural tangent kernels, and derive dynamical equations for their associated loss function in optimization and learning tasks. We analytically solve the dynamics in the frozen limit, or lazy training regime, where variational angles change slowly and a linear perturbation is good enough. We extend the analysis to a dynamical setting, including quadratic corrections in the variational angles. We then consider hybrid quantum-classical architecture and define a large width limit for hybrid kernels, showing that a hybrid quantum-classical neural network can be approximately Gaussian. The results presented here show limits for which analytical understandings of the training dynamics for variational quantum circuits, used for quantum machine learning and optimization problems, are possible. These analytical results are supported by numerical simulations of quantum machine learning experiments.

【50】 Uncertainty Calibration for Ensemble-Based Debiasing Methods 标题:基于集成的去偏方法的不确定度校正 链接:https://arxiv.org/abs/2111.04104

作者:Ruibin Xiong,Yimeng Chen,Liang Pang,Xueqi Chen,Yanyan Lan 机构:CAS Key Laboratory of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences ,Baidu Inc., Academy of Mathematics and Systems Science, Chinese Academy of Sciences 备注:None 摘要:基于集成的debiasing方法已被证明可以有效地减少分类器对特定数据集偏差的依赖,方法是利用仅偏差模型的输出来调整学习目标。在本文中,我们重点讨论了这些基于集成的方法中的仅偏倚模型,该模型在现有文献中起着重要作用,但尚未得到太多关注。理论上,我们证明了仅偏倚模型的不确定性估计不准确会损害借记性能。从经验上看,我们表明现有的纯偏差模型在产生准确的不确定性估计方面存在不足。基于这些发现,我们建议对仅偏差模型进行校准,从而实现基于集成的三阶段debiasing框架,包括偏差建模、模型校准和debiasing。在NLI和事实验证任务上的实验结果表明,我们提出的三阶段debiasing框架始终优于传统的两阶段一进一出的分布精度。 摘要:Ensemble-based debiasing methods have been shown effective in mitigating the reliance of classifiers on specific dataset bias, by exploiting the output of a bias-only model to adjust the learning target. In this paper, we focus on the bias-only model in these ensemble-based methods, which plays an important role but has not gained much attention in the existing literature. Theoretically, we prove that the debiasing performance can be damaged by inaccurate uncertainty estimations of the bias-only model. Empirically, we show that existing bias-only models fall short in producing accurate uncertainty estimations. Motivated by these findings, we propose to conduct calibration on the bias-only model, thus achieving a three-stage ensemble-based debiasing framework, including bias modeling, model calibrating, and debiasing. Experimental results on NLI and fact verification tasks show that our proposed three-stage debiasing framework consistently outperforms the traditional two-stage one in out-of-distribution accuracy.

【51】 Iterative Causal Discovery in the Possible Presence of Latent Confounders and Selection Bias 标题:潜在混杂因素和选择偏差可能存在的迭代因果发现 链接:https://arxiv.org/abs/2111.04095

作者:Raanan Y. Rohekar,Shami Nisimov,Yaniv Gurwicz,Gal Novik 机构:Intel Labs 备注:35th Conference on Neural Information Processing Systems (NeurIPS 2021). arXiv admin note: text overlap with arXiv:2012.07513 摘要:我们提出了一个完善的算法,称为迭代因果发现(ICD),用于在潜在混杂因素和选择偏差存在的情况下恢复因果图。ICD依赖于因果马尔可夫和忠实性假设,并恢复基础因果图的等价类。它从一个完整的图开始,由单个迭代阶段组成,该阶段通过识别连接节点之间的条件独立性(CI)来逐步细化该图。任何迭代后产生的独立性和因果关系都是正确的,可以随时呈现ICD。本质上,我们将CI条件集的大小与其在图上与测试节点的距离联系起来,并在后续迭代中增加该值。因此,每次迭代都会细化一个由先前迭代恢复的图,该图具有更小的条件集(更高的统计能力),这有助于提高稳定性。我们通过经验证明,与FCI、FCI 和RFCI算法相比,ICD需要更少的CI测试,并学习更准确的因果图。 摘要:We present a sound and complete algorithm, called iterative causal discovery (ICD), for recovering causal graphs in the presence of latent confounders and selection bias. ICD relies on the causal Markov and faithfulness assumptions and recovers the equivalence class of the underlying causal graph. It starts with a complete graph, and consists of a single iterative stage that gradually refines this graph by identifying conditional independence (CI) between connected nodes. Independence and causal relations entailed after any iteration are correct, rendering ICD anytime. Essentially, we tie the size of the CI conditioning set to its distance on the graph from the tested nodes, and increase this value in the successive iteration. Thus, each iteration refines a graph that was recovered by previous iterations having smaller conditioning sets -- a higher statistical power -- which contributes to stability. We demonstrate empirically that ICD requires significantly fewer CI tests and learns more accurate causal graphs compared to FCI, FCI , and RFCI algorithms.

【52】 Sampling from Log-Concave Distributions with Infinity-Distance Guarantees and Applications to Differentially Private Optimization 标题:无限距离保证的对数凹分布抽样及其在差分私有优化中的应用 链接:https://arxiv.org/abs/2111.04089

作者:Oren Mangoubi,Nisheeth K. Vishnoi 机构:Worcester Polytechnic Institute, Yale University 摘要:对于$d维的对数凹分布$π(theta) Poto E^ {-f(theta)},在一个多面体$K$上,我们考虑从一个分布$NU$输出样本的问题,它是$O(VaRePiSLON)$-接近无穷远$$Suf{{theta〉,在k} log { nu(theta)}{pi(theta)} $到$pi-$。这种具有无限距离保证的采样器特别适用于差分私有优化,因为带有总变化距离或KL发散边界的传统采样算法不足以保证差分隐私。我们的主要结果是一个算法,该算法从分布$O(varepsilon)$-接近$pi$的无限距离输出一个点,需要$O((md dL^2R^2)次(LR dlog(frac{Rd LRd}{varepsilon r}))次md^{omega-1}$算术运算,其中$f$是$L$-Lipschitz,$K$由$m$不等式定义,包含在半径为$R$的球中,并包含半径较小的球$R$,并且$omega$是矩阵乘法常数。特别是,此运行时在$frac{1}{varepsilon}$中是对数的,并且显著改进了以前的工作。从技术上讲,我们不同于以往在$K$的$frac{1}{varepsilon^2}$离散化上构造马尔可夫链以实现具有$O(varepsilon)$无限距离误差的样本,并提出了一种将具有总变差界的$K$连续样本转换为具有无穷边界的样本的方法。为了改善对$d$的依赖性,我们提出了一个“软阈值”版本的Dikin walk,它可能具有独立的意义。将我们的算法插入到指数机制的框架中,可以在$varepsilon$纯微分私有算法的运行时间上产生类似的改进,这些算法用于优化问题,如Lipschitz凸函数的经验风险最小化和低秩近似,同时仍然达到已知的最严格的效用界限。 摘要:For a $d$-dimensional log-concave distribution $pi(theta)propto e^{-f(theta)}$ on a polytope $K$, we consider the problem of outputting samples from a distribution $nu$ which is $O(varepsilon)$-close in infinity-distance $sup_{thetain K}|logfrac{nu(theta)}{pi(theta)}|$ to $pi$. Such samplers with infinity-distance guarantees are specifically desired for differentially private optimization as traditional sampling algorithms which come with total-variation distance or KL divergence bounds are insufficient to guarantee differential privacy. Our main result is an algorithm that outputs a point from a distribution $O(varepsilon)$-close to $pi$ in infinity-distance and requires $O((md dL^2R^2)times(LR dlog(frac{Rd LRd}{varepsilon r}))times md^{omega-1})$ arithmetic operations, where $f$ is $L$-Lipschitz, $K$ is defined by $m$ inequalities, is contained in a ball of radius $R$ and contains a ball of smaller radius $r$, and $omega$ is the matrix-multiplication constant. In particular this runtime is logarithmic in $frac{1}{varepsilon}$ and significantly improves on prior works. Technically, we depart from the prior works that construct Markov chains on a $frac{1}{varepsilon^2}$-discretization of $K$ to achieve a sample with $O(varepsilon)$ infinity-distance error, and present a method to convert continuous samples from $K$ with total-variation bounds to samples with infinity bounds. To achieve improved dependence on $d$, we present a "soft-threshold" version of the Dikin walk which may be of independent interest. Plugging our algorithm into the framework of the exponential mechanism yields similar improvements in the running time of $varepsilon$-pure differentially private algorithms for optimization problems such as empirical risk minimization of Lipschitz-convex functions and low-rank approximation, while still achieving the tightest known utility bounds.

【53】 Open-Set Crowdsourcing using Multiple-Source Transfer Learning 标题:基于多源迁移学习的开集众包 链接:https://arxiv.org/abs/2111.04073

作者:Guangyang Han,Guoxian Yu,Lei Liu,Lizhen Cui,Carlotta Domeniconi,Xiangliang Zhang 机构:Zhang, College of Computer and Information Sciences, Southwest University, China, School of Software, Shandong University, China, Joint SDU-NTU Centre for Artificial Intelligence Research, Shandong University, Jinan, China 备注:8 pages, 1 figures 摘要:我们提出并定义了一个新的众包场景,开放式众包,我们只知道一个不熟悉的众包项目的一般主题,而不知道它的标签空间,即一组可能的标签。这仍然是一个任务注释问题,但对任务和标签空间的不熟悉妨碍了任务和工作人员的建模以及真理推理。我们提出了一个直观的解决方案OSCrowd。首先,OSCrowd将群组主题相关数据集集成到一个大的源域中,以促进部分迁移学习,从而近似这些任务的标签空间推断。接下来,它根据类别相关性为每个源域分配权重。在此之后,它使用多源开放集迁移学习来建模群组任务并分配可能的注释。迁移学习给出的标签空间和注释将用于指导和规范人群工作者的注释。我们在一个在线场景中验证了OSCrowd,并证明OSCrowd解决了开放集众包问题,比相关的众包解决方案工作得更好。 摘要:We raise and define a new crowdsourcing scenario, open set crowdsourcing, where we only know the general theme of an unfamiliar crowdsourcing project, and we don't know its label space, that is, the set of possible labels. This is still a task annotating problem, but the unfamiliarity with the tasks and the label space hampers the modelling of the task and of workers, and also the truth inference. We propose an intuitive solution, OSCrowd. First, OSCrowd integrates crowd theme related datasets into a large source domain to facilitate partial transfer learning to approximate the label space inference of these tasks. Next, it assigns weights to each source domain based on category correlation. After this, it uses multiple-source open set transfer learning to model crowd tasks and assign possible annotations. The label space and annotations given by transfer learning will be used to guide and standardize crowd workers' annotations. We validate OSCrowd in an online scenario, and prove that OSCrowd solves the open set crowdsourcing problem, works better than related crowdsourcing solutions.

【54】 Positivity Validation Detection and Explainability via Zero Fraction Multi-Hypothesis Testing and Asymmetrically Pruned Decision Trees 标题:基于零分式多假设检验和非对称修剪决策树的正性验证检测与可解释性 链接:https://arxiv.org/abs/2111.04033

作者:Guy Wolf,Gil Shabat,Hanan Shteingart 机构:Vianai Systems 备注:Talk accepted to Causal Data Science Meeting, 2021 摘要:正性是从观测数据进行因果推断的三个条件之一。验证积极性的标准方法是分析倾向性的分布。然而,为了使非专家进行因果推理的能力民主化,需要设计一种算法,以(i)测试积极性,并(ii)解释在协变量空间中缺乏积极性的地方。后者可用于建议进一步因果分析的局限性和/或鼓励违反积极性的实验。本文的贡献在于首先提出了自动实证分析问题,然后提出了一种基于两步过程的算法。第一步,对协变量的倾向性条件进行建模,然后使用多假设检验分析后一种分布,以创建积极性违规标签。第二步使用不对称修剪的决策树进行解释。后者进一步转化为非专家能够理解的可读文本。我们在一个大型软件企业的专有数据集上演示了我们的方法。 摘要:Positivity is one of the three conditions for causal inference from observational data. The standard way to validate positivity is to analyze the distribution of propensity. However, to democratize the ability to do causal inference by non-experts, it is required to design an algorithm to (i) test positivity and (ii) explain where in the covariate space positivity is lacking. The latter could be used to either suggest the limitation of further causal analysis and/or encourage experimentation where positivity is violated. The contribution of this paper is first present the problem of automatic positivity analysis and secondly to propose an algorithm based on a two steps process. The first step, models the propensity condition on the covariates and then analyze the latter distribution using multiple hypothesis testing to create positivity violation labels. The second step uses asymmetrically pruned decision trees for explainability. The latter is further converted into readable text a non-expert can understand. We demonstrate our method on a proprietary data-set of a large software enterprise.

【55】 Understanding Layer-wise Contributions in Deep Neural Networks through Spectral Analysis 标题:用谱分析理解深度神经网络中的分层贡献 链接:https://arxiv.org/abs/2111.03972

作者:Yatin Dandi,Arthur Jacot 机构:IIT Kanpur, India, Ecole Polytechnique Fédérale de Lausanne, Switzerland, Chair of Statistical Field Theory 摘要:光谱分析是一种强大的工具,它可以将任何功能分解为更简单的部分。在机器学习中,Mercer定理推广了这一思想,为任何内核和输入分布提供了频率递增函数的自然基础。最近,有几项工作通过神经切线核的框架将这种分析扩展到深层神经网络。在这项工作中,我们分析了深层神经网络的分层谱偏差,并将其与不同层在降低给定目标函数的泛化误差方面的贡献联系起来。我们利用Hermite多项式和球谐函数的性质证明了初始层对单位球面上定义的高频函数有较大的偏向。我们进一步提供了在深神经网络的高维数据集中验证我们理论的实证结果。 摘要:Spectral analysis is a powerful tool, decomposing any function into simpler parts. In machine learning, Mercer's theorem generalizes this idea, providing for any kernel and input distribution a natural basis of functions of increasing frequency. More recently, several works have extended this analysis to deep neural networks through the framework of Neural Tangent Kernel. In this work, we analyze the layer-wise spectral bias of Deep Neural Networks and relate it to the contributions of different layers in the reduction of generalization error for a given target function. We utilize the properties of Hermite polynomials and spherical harmonics to prove that initial layers exhibit a larger bias towards high-frequency functions defined on the unit sphere. We further provide empirical results validating our theory in high dimensional datasets for Deep Neural Networks.

【56】 Exponential Bellman Equation and Improved Regret Bounds for Risk-Sensitive Reinforcement Learning 标题:风险敏感强化学习的指数Bellman方程和改进的遗憾界 链接:https://arxiv.org/abs/2111.03947

作者:Yingjie Fei,Zhuoran Yang,Yudong Chen,Zhaoran Wang 机构: Northwestern University, Princeton University, University of Wisconsin-Madison 摘要:我们研究了基于熵风险测度的风险敏感强化学习(RL)。虽然现有的工作已经为这个问题建立了非渐近后悔保证,但它们在上界和下界之间留下了一个指数差距。我们确定了现有算法及其分析中的缺陷,这些缺陷导致了这种差距。为了弥补这些不足,我们研究了风险敏感Bellman方程的简单变换,我们称之为指数Bellman方程。指数Bellman方程激励我们对风险敏感RL算法中的Bellman备份过程进行新的分析,并进一步推动新探索机制的设计。我们表明,这些分析和算法创新一起导致改进的后悔上界。 摘要:We study risk-sensitive reinforcement learning (RL) based on the entropic risk measure. Although existing works have established non-asymptotic regret guarantees for this problem, they leave open an exponential gap between the upper and lower bounds. We identify the deficiencies in existing algorithms and their analysis that result in such a gap. To remedy these deficiencies, we investigate a simple transformation of the risk-sensitive Bellman equations, which we call the exponential Bellman equation. The exponential Bellman equation inspires us to develop a novel analysis of Bellman backup procedures in risk-sensitive RL algorithms, and further motivates the design of a novel exploration mechanism. We show that these analytic and algorithmic innovations together lead to improved regret upper bounds over existing ones.

【57】 A Probit Tensor Factorization Model For Relational Learning 标题:一种用于关系学习的Probit张量分解模型 链接:https://arxiv.org/abs/2111.03943

作者:Ye Liu,Rui Song,Wenbin Lu 机构:Department of Statistics, North Carolina State University, and 备注:30 pages 摘要:随着知识图的激增,具有复杂多关系结构的数据建模在统计关系学习领域受到越来越多的关注。统计关系学习最重要的目标之一是链接预测,即预测知识图中是否存在某些关系。大量的模型和算法被提出来进行链路预测,其中张量因子分解方法在计算效率和预测精度方面达到了最先进的水平。然而,现有张量因子分解模型的一个共同缺点是,缺失关系和不存在关系被以相同的方式处理,这导致信息丢失。为了解决这个问题,我们提出了一个带有probit链接的二元张量分解模型,它不仅继承了经典张量分解模型的计算效率,而且还考虑了关系数据的二元性。我们提出的probit张量因子分解(PTF)模型在预测精度和可解释性方面都显示出优势 摘要:With the proliferation of knowledge graphs, modeling data with complex multirelational structure has gained increasing attention in the area of statistical relational learning. One of the most important goals of statistical relational learning is link prediction, i.e., predicting whether certain relations exist in the knowledge graph. A large number of models and algorithms have been proposed to perform link prediction, among which tensor factorization method has proven to achieve state-of-the-art performance in terms of computation efficiency and prediction accuracy. However, a common drawback of the existing tensor factorization models is that the missing relations and non-existing relations are treated in the same way, which results in a loss of information. To address this issue, we propose a binary tensor factorization model with probit link, which not only inherits the computation efficiency from the classic tensor factorization model but also accounts for the binary nature of relational data. Our proposed probit tensor factorization (PTF) model shows advantages in both the prediction accuracy and interpretability

【58】 AGGLIO: Global Optimization for Locally Convex Functions 标题:AGGLIO:局部凸函数的全局优化 链接:https://arxiv.org/abs/2111.03932

作者:Debojyoti Dey,Bhaskar Mukhoty,Purushottam Kar 机构:IIT Kanpur 备注:33 pages, 7 figures, to appear at 9th ACM IKDD Conference on Data Science (CODS) 2022. Code for AGGLIO is available at this https URL 摘要:本文介绍了AGGIO(加速分级广义线性模型优化),这是一种分阶段分级优化技术,为目标仅具有局部凸性且可能在全局范围内甚至不具有拟凸性的非凸优化问题提供全局收敛保证。特别是,这包括利用流行的激活函数(如sigmoid、softplus和SiLU)的学习问题,这些函数产生非凸训练目标。AGGLGIO可以很容易地使用点和小批量SGD更新实现,并在一般条件下提供可证明的全局最优收敛性。在实验中,aggio在收敛速度和收敛精度方面优于最近提出的几种非凸和局部凸目标优化技术。Aggio依赖于广义线性模型的分级技术,以及一种新颖的证明策略,这两种方法可能都是独立的。 摘要:This paper presents AGGLIO (Accelerated Graduated Generalized LInear-model Optimization), a stage-wise, graduated optimization technique that offers global convergence guarantees for non-convex optimization problems whose objectives offer only local convexity and may fail to be even quasi-convex at a global scale. In particular, this includes learning problems that utilize popular activation functions such as sigmoid, softplus and SiLU that yield non-convex training objectives. AGGLIO can be readily implemented using point as well as mini-batch SGD updates and offers provable convergence to the global optimum in general conditions. In experiments, AGGLIO outperformed several recently proposed optimization techniques for non-convex and locally convex objectives in terms of convergence rate as well as convergent accuracy. AGGLIO relies on a graduation technique for generalized linear models, as well as a novel proof strategy, both of which may be of independent interest.

【59】 Geodesic curves in Gaussian random field manifolds 标题:高斯随机场流形中的测地曲线 链接:https://arxiv.org/abs/2111.03905

作者:Alexandre L. M. Levada 机构:Computing Department, Federal Universisty of S˜ao Carlos, SP, Brazil 备注:28 pages, 7 figures. arXiv admin note: text overlap with arXiv:2109.09204 摘要:随机场是一种数学结构,用于模拟随时间变化的随机变量的空间相互作用,其应用范围从统计物理和热力学到系统生物学和复杂系统的模拟。尽管自19世纪以来一直在研究随机场,但对于随机场的动力学如何与其参数空间的几何特性相关,我们知之甚少。例如,我们如何使用内在度量来量化在不同区域中运行的两个随机场之间的相似性?本文提出了一种计算高斯随机场流形测地距离的数值方法。首先,我们推导基本参数空间(3 x 3一阶Fisher信息矩阵)的度量张量,然后我们推导非线性微分方程组定义中所需的27个Christoffel符号,其解是从初始条件开始的测地曲线。采用四阶Runge-Kutta方法对非线性系统进行迭代求解。结果表明,该方法可以在几种不同的初始条件下估计测地距离。此外,结果还揭示了一个有趣的模式:在某些情况下,通过对微分方程组进行时间反转获得的测地曲线与原始曲线不匹配,这表明沿测地曲线移动的运动参考轨迹中存在不可逆几何变形。 摘要:Random fields are mathematical structures used to model the spatial interaction of random variables along time, with applications ranging from statistical physics and thermodynamics to system's biology and the simulation of complex systems. Despite being studied since the 19th century, little is known about how the dynamics of random fields are related to the geometric properties of their parametric spaces. For example, how can we quantify the similarity between two random fields operating in different regimes using an intrinsic measure? In this paper, we propose a numerical method for the computation of geodesic distances in Gaussian random field manifolds. First, we derive the metric tensor of the underlying parametric space (the 3 x 3 first-order Fisher information matrix), then we derive the 27 Christoffel symbols required in the definition of the system of non-linear differential equations whose solution is a geodesic curve starting at the initial conditions. The fourth-order Runge-Kutta method is applied to numerically solve the non-linear system through an iterative approach. The obtained results show that the proposed method can estimate the geodesic distances for several different initial conditions. Besides, the results reveal an interesting pattern: in several cases, the geodesic curve obtained by reversing the system of differential equations in time does not match the original curve, suggesting the existence of irreversible geometric deformations in the trajectory of a moving reference traveling along a geodesic curve.

【60】 Dynamic Regret Minimization for Control of Non-stationary Linear Dynamical Systems 标题:非定常线性动态系统控制的动态遗憾最小化 链接:https://arxiv.org/abs/2111.03772

作者:Yuwei Luo,Varun Gupta,Mladen Kolar 摘要:我们考虑的问题是控制线性二次调节器(LQR)系统在有限的地平线$T $固定和已知的成本矩阵$ Q,R $,但未知和非平稳动力学$ {aat,btt } $。动力学矩阵的序列可以是任意的,但总变化量为$V_T$,假设为$o(T)$,控制器未知。在假设所有$t$都有一系列稳定但可能次优的控制器的情况下,我们提出了一种算法,该算法实现了$tilde{mathcal{O}}left(V_t^{2/5}t^{3/5}right)$的最优动态遗憾。通过分段常数动态,我们的算法实现了$tilde{mathcal{O}}(sqrt{ST})$的最优遗憾,其中$S$是开关数。我们的算法的关键是一种自适应非平稳性检测策略,该策略建立在最近针对上下文多武装强盗问题开发的方法之上。我们还认为,对于LQR问题,非自适应遗忘(例如,重新启动或使用具有静态窗口大小的滑动窗口学习)可能不是最优的,即使在使用$V_T$知识对窗口大小进行优化调整时也是如此。分析我们的算法的主要技术挑战是证明当待估计的参数是非平稳的时,普通最小二乘(OLS)估计器有一个小偏差。我们的分析还强调,导致遗憾的关键主题是,LQR问题本质上是一个具有线性反馈和局部二次成本的bandit问题。这个主题比LQR问题本身更普遍,因此我们相信我们的结果应该得到更广泛的应用。 摘要:We consider the problem of controlling a Linear Quadratic Regulator (LQR) system over a finite horizon $T$ with fixed and known cost matrices $Q,R$, but unknown and non-stationary dynamics ${A_t, B_t}$. The sequence of dynamics matrices can be arbitrary, but with a total variation, $V_T$, assumed to be $o(T)$ and unknown to the controller. Under the assumption that a sequence of stabilizing, but potentially sub-optimal controllers is available for all $t$, we present an algorithm that achieves the optimal dynamic regret of $tilde{mathcal{O}}left(V_T^{2/5}T^{3/5}right)$. With piece-wise constant dynamics, our algorithm achieves the optimal regret of $tilde{mathcal{O}}(sqrt{ST})$ where $S$ is the number of switches. The crux of our algorithm is an adaptive non-stationarity detection strategy, which builds on an approach recently developed for contextual Multi-armed Bandit problems. We also argue that non-adaptive forgetting (e.g., restarting or using sliding window learning with a static window size) may not be regret optimal for the LQR problem, even when the window size is optimally tuned with the knowledge of $V_T$. The main technical challenge in the analysis of our algorithm is to prove that the ordinary least squares (OLS) estimator has a small bias when the parameter to be estimated is non-stationary. Our analysis also highlights that the key motif driving the regret is that the LQR problem is in spirit a bandit problem with linear feedback and locally quadratic cost. This motif is more universal than the LQR problem itself, and therefore we believe our results should find wider application.

【61】 Sharp Bounds for Federated Averaging (Local SGD) and Continuous Perspective 标题:联合平均(局部SGD)和连续透视的锐界 链接:https://arxiv.org/abs/2111.03741

作者:Margalit Glasgow,Honglin Yuan,Tengyu Ma 机构:Stanford University 备注:47 pages. First two authors contributed equally 摘要:联邦平均(FedAvg),也称为局部SGD,是联邦学习(FL)中最流行的算法之一。尽管FedAvg简单且受欢迎,但其收敛速度至今尚未确定。即使在最简单的假设(凸、光滑、齐次和有界协方差)下,最著名的上界和下界也不匹配,并且不清楚现有分析是否捕获了算法的容量。在这项工作中,我们首先通过提供与现有上界匹配的FedAvg下界来解决这个问题,这表明现有的FedAvg上界分析是不可改进的。此外,我们在异构环境中建立了一个与现有上限几乎匹配的下限。虽然我们的下界显示了FedAvg的局限性,但在附加的三阶光滑性假设下,我们证明了在凸和非凸环境下更乐观的最新收敛结果。我们的分析源自一个我们称之为迭代偏差的概念,该概念由SGD轨迹的期望值与相同初始化的无噪声梯度下降轨迹的偏差来定义。我们证明了这个量的新的锐界,并直观地展示了如何从随机微分方程(SDE)的角度分析这个量。 摘要:Federated Averaging (FedAvg), also known as Local SGD, is one of the most popular algorithms in Federated Learning (FL). Despite its simplicity and popularity, the convergence rate of FedAvg has thus far been undetermined. Even under the simplest assumptions (convex, smooth, homogeneous, and bounded covariance), the best-known upper and lower bounds do not match, and it is not clear whether the existing analysis captures the capacity of the algorithm. In this work, we first resolve this question by providing a lower bound for FedAvg that matches the existing upper bound, which shows the existing FedAvg upper bound analysis is not improvable. Additionally, we establish a lower bound in a heterogeneous setting that nearly matches the existing upper bound. While our lower bounds show the limitations of FedAvg, under an additional assumption of third-order smoothness, we prove more optimistic state-of-the-art convergence results in both convex and non-convex settings. Our analysis stems from a notion we call iterate bias, which is defined by the deviation of the expectation of the SGD trajectory from the noiseless gradient descent trajectory with the same initialization. We prove novel sharp bounds on this quantity, and show intuitively how to analyze this quantity from a Stochastic Differential Equation (SDE) perspective.

【62】 Tradeoffs of Linear Mixed Models in Genome-wide Association Studies 标题:全基因组关联研究中线性混合模型的权衡 链接:https://arxiv.org/abs/2111.03739

作者:Haohan Wang,Bryon Aragam,Eric Xing 机构:School of Computer Science, Carnegie Mellon University, Booth School of Business, University of Chicago 备注:in final revision of Journal of Computational Biology 摘要:基于全基因组关联研究(GWAS)文献中众所周知的经验论点,我们研究了应用于GWAS的线性混合模型(LMM)的统计特性。首先,我们研究了LMMs对在亲属关系矩阵中包含候选SNP的敏感性,这在实践中经常被用来加速计算。我们的结果揭示了包含候选SNP所产生的误差的大小,为该技术提供了一个理由,以权衡速度和准确性。其次,我们研究了混合模型如何纠正GWAS中的混杂因素,这被广泛认为是LMMs优于传统方法的优势。我们考虑了混杂因素、人口分层和环境混杂因素的两个来源,并研究了在实践中常用的不同方法如何权衡这两个混杂因素。 摘要:Motivated by empirical arguments that are well-known from the genome-wide association studies (GWAS) literature, we study the statistical properties of linear mixed models (LMMs) applied to GWAS. First, we study the sensitivity of LMMs to the inclusion of a candidate SNP in the kinship matrix, which is often done in practice to speed up computations. Our results shed light on the size of the error incurred by including a candidate SNP, providing a justification to this technique in order to trade-off velocity against veracity. Second, we investigate how mixed models can correct confounders in GWAS, which is widely accepted as an advantage of LMMs over traditional methods. We consider two sources of confounding factors, population stratification and environmental confounding factors, and study how different methods that are commonly used in practice trade-off these two confounding factors differently.

【63】 Differential Privacy Over Riemannian Manifolds 标题:黎曼流形上的微分隐私 链接:https://arxiv.org/abs/2111.02516

作者:Matthew Reimherr,Karthik Bharath,Carlos Soto 机构:Department of Statistics, Pennsylvania State University, University Park, PA, School of Mathematical Sciences, University of Nottingham, Nottingham, UK 备注:15 pages (including supplementary material and references), 2 figures (including supplementary material), published in NeurIPS 摘要:在这项工作中,我们考虑释放一个差异私人统计汇总驻留在黎曼流形上的问题。我们提出了拉普拉斯或K-范数机制的一个扩展,该机制利用流形上的固有距离和体积。我们还详细地考虑了一个特定的情况,即摘要是流形上的数据的FR’ECHET均值。我们证明了我们的机制是速率最优的,并且只取决于流形的维度,而不取决于任何环境空间的维度,同时也展示了忽略流形结构如何降低净化摘要的效用。我们用两个统计方面特别感兴趣的例子来说明我们的框架:对称正定矩阵空间(用于协方差矩阵)和球体(可用于离散分布建模)。 摘要:In this work we consider the problem of releasing a differentially private statistical summary that resides on a Riemannian manifold. We present an extension of the Laplace or K-norm mechanism that utilizes intrinsic distances and volumes on the manifold. We also consider in detail the specific case where the summary is the Fr'echet mean of data residing on a manifold. We demonstrate that our mechanism is rate optimal and depends only on the dimension of the manifold, not on the dimension of any ambient space, while also showing how ignoring the manifold structure can decrease the utility of the sanitized summary. We illustrate our framework in two examples of particular interest in statistics: the space of symmetric positive definite matrices, which is used for covariance matrices, and the sphere, which can be used as a space for modeling discrete distributions.

0 人点赞