访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问
stat统计学,共计39篇
【1】 Inner spike and slab Bayesian nonparametric models 标题:内尖峰和板条贝叶斯非参数模型
作者:Antonio Canale,Antonio Lijoi,Bernardo Nipoti,Igor Prünster 备注:19 pages, 3 figures 链接:https://arxiv.org/abs/2107.10223 摘要:离散贝叶斯非参数模型的期望值是支撑点处的点质量和扩散概率分布的凸线性组合,允许包含强先验信息,但仍然非常灵活。最近在统计文献中的贡献已经在许多应用中成功地实现了这种建模策略,包括密度估计、非参数回归和基于模型的聚类。我们对一大类非参数模型进行了深入的研究,我们称之为内尖峰和板状hNRMI模型,它是通过考虑具有独立增量的齐次规范化随机测度(hNRMI)和由点质量和扩散概率分布的凸线性组合给出的基测度得到的。在本文中,我们研究了这些模型的分布性质,我们的结果包括:i)它们导出的可交换配分概率函数,ii)可交换样本中不同值个数的分布,iii)后验预测分布,和iv)与正概率支撑点唯一点重合的元素数分布。我们的发现是通过一个广义P′olya urn方案实现贝叶斯内尖峰和板状hNRMI模型的主要构造块。 摘要:Discrete Bayesian nonparametric models whose expectation is a convex linear combination of a point mass at some point of the support and a diffuse probability distribution allow to incorporate strong prior information, while still being extremely flexible. Recent contributions in the statistical literature have successfully implemented such a modelling strategy in a variety of applications, including density estimation, nonparametric regression and model-based clustering. We provide a thorough study of a large class of nonparametric models we call inner spike and slab hNRMI models, which are obtained by considering homogeneous normalized random measures with independent increments (hNRMI) with base measure given by a convex linear combination of a point mass and a diffuse probability distribution. In this paper we investigate the distributional properties of these models and our results include: i) the exchangeable partition probability function they induce, ii) the distribution of the number of distinct values in an exchangeable sample, iii) the posterior predictive distribution, and iv) the distribution of the number of elements that coincide with the only point of the support with positive probability. Our findings are the main building block for an actual implementation of Bayesian inner spike and slab hNRMI models by means of a generalized P'olya urn scheme.
【2】 Differentiable Annealed Importance Sampling and the Perils of Gradient Noise 标题:可微退火重要性抽样与梯度噪声的危害
作者:Guodong Zhang,Kyle Hsu,Jianing Li,Chelsea Finn,Roger Grosse 机构:University of Toronto,Vector Institute,Stanford University 备注:22 pages 链接:https://arxiv.org/abs/2107.10211 摘要:退火重要性抽样(AIS)及其相关算法是边缘似然估计的高效工具,但由于使用了Metropolis-Hastings(MH)校正步骤,因此不完全可微。可微性是一个可取的性质,因为它将承认的可能性,优化边际可能性作为一个目标使用梯度为基础的方法。为此,我们提出了一种可微AIS算法,通过放弃MH步骤,进一步解除了小批量计算的锁定。我们提供了一个详细的收敛性分析贝叶斯线性回归超越了以往的分析,明确说明非完美过渡。利用这个分析,我们证明了我们的算法在整批设置下是一致的,并且提供了一个次线性的收敛速度。然而,我们证明了算法在使用小批量梯度时是不一致的,这是由于最后一次迭代收敛到后验点和消除路径随机误差的目标之间的根本不相容。这一结果与随机优化和随机梯度Langevin动力学的经验形成了鲜明的对比,在随机优化和随机梯度Langevin动力学中,梯度噪声的影响可以通过采取更小的步骤来消除。我们的否定结果主要依赖于我们对平稳分布收敛性的明确考虑,这有助于解释开发实用有效的利用小批量梯度的AIS类算法的困难。 摘要:Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation, but are not fully differentiable due to the use of Metropolis-Hastings (MH) correction steps. Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective using gradient-based methods. To this end, we propose a differentiable AIS algorithm by abandoning MH steps, which further unlocks mini-batch computation. We provide a detailed convergence analysis for Bayesian linear regression which goes beyond previous analyses by explicitly accounting for non-perfect transitions. Using this analysis, we prove that our algorithm is consistent in the full-batch setting and provide a sublinear convergence rate. However, we show that the algorithm is inconsistent when mini-batch gradients are used due to a fundamental incompatibility between the goals of last-iterate convergence to the posterior and elimination of the pathwise stochastic error. This result is in stark contrast to our experience with stochastic optimization and stochastic gradient Langevin dynamics, where the effects of gradient noise can be washed out by taking more steps of a smaller size. Our negative result relies crucially on our explicit consideration of convergence to the stationary distribution, and it helps explain the difficulty of developing practically effective AIS-like algorithms that exploit mini-batch gradients.
【3】 Bayesian iterative screening in ultra-high dimensional settings 标题:超高维环境下的贝叶斯迭代筛选
作者:Run Wang,Somak Dutta,Vivekananda Roy 机构:Department of Statistics, Iowa State University, USA 链接:https://arxiv.org/abs/2107.10175 摘要:超高维线性回归中的变量选择通常是在筛选步骤之前进行的,以显著降低维数。本文提出了一种贝叶斯变量筛选方法(BITS)。BITS可以成功地集成关于效应大小和真变量数量的先验知识(如果有的话)。BITS迭代地包括考虑到已经选择的变量的具有最高后验概率的潜在变量。它由一个快速Cholesky更新算法实现,并被证明具有屏蔽一致性。BITS是建立在高斯误差模型的基础上的,但在更一般的尾部条件下,屏蔽一致性被证明是成立的。后验筛选一致性的概念允许所得到的模型为进一步的贝叶斯变量选择方法提供一个良好的起点。提出了一种新的基于后验概率的筛选一致停止规则。仿真研究和实际数据的例子被用来证明可扩展性和精筛选性能。 摘要:Variable selection in ultra-high dimensional linear regression is often preceded by a screening step to significantly reduce the dimension. Here a Bayesian variable screening method (BITS) is developed. BITS can successfully integrate prior knowledge, if any, on effect sizes, and the number of true variables. BITS iteratively includes potential variables with the highest posterior probability accounting for the already selected variables. It is implemented by a fast Cholesky update algorithm and is shown to have the screening consistency property. BITS is built based on a model with Gaussian errors, yet, the screening consistency is proved to hold under more general tail conditions. The notion of posterior screening consistency allows the resulting model to provide a good starting point for further Bayesian variable selection methods. A new screening consistent stopping rule based on posterior probability is developed. Simulation studies and real data examples are used to demonstrate scalability and fine screening performance.
【4】 Decoupling Systemic Risk into Endopathic and Exopathic Competing Risks Through Autoregressive Conditional Accelerated Fréchet Model 标题:用自回归条件加速Fréchet模型将系统性风险解耦为内外竞争风险
作者:Jingyu Ji,Deyuan Li,Zhengjun Zhang 机构:School of Data Science, Fudan University, Shanghai, China, School of Management, Fudan University, Shanghai, China, Department of Statistics, University of Wisconsin, Madison, WI, USA 备注:27 pages, 10 figures 链接:https://arxiv.org/abs/2107.10148 摘要:识别地缘政治、经济、金融、环境、交通、流行病系统中的系统性风险模式及其影响是风险管理的关键。本文介绍了两种新的内病和外病竞争风险。本文将新的极大值极值理论和系统性风险的自回归条件Fr′echet模型集成到一个新的自回归条件加速Fr′echet(AcAF)模型中,使系统性风险解耦为内病和外病竞争风险。建立了AcAF模型的平稳性和遍历性的概率性质。统计推断是通过条件极大似然估计发展起来的。给出了估计量的相合性和渐近正态性。仿真结果表明了所提出的估计器的有效性和AcAF模型对异构数据建模的灵活性。对标准普尔500指数和加密货币交易中股票收益率的实证研究表明,所提出的模型在识别风险模式、内因和外因竞争风险方面表现优异,信息量大,可解释性强,增强了对市场系统性风险及其成因的理解,使更好的风险管理成为可能。 摘要:Identifying systemic risk patterns in geopolitical, economic, financial, environmental, transportation, epidemiological systems, and their impacts is the key to risk management. This paper introduces two new endopathic and exopathic competing risks. The paper integrates the new extreme value theory for maxima of maxima and the autoregressive conditional Fr'echet model for systemic risk into a new autoregressive conditional accelerated Fr'echet (AcAF) model, which enables decoupling systemic risk into endopathic and exopathic competing risks. The paper establishes the probabilistic properties of stationarity and ergodicity of the AcAF model. Statistical inference is developed through conditional maximum likelihood estimation. The consistency and asymptotic normality of the estimators are derived. Simulation demonstrates the efficiency of the proposed estimators and the AcAF model's flexibility in modeling heterogeneous data. Empirical studies on the stock returns in S&P 500 and the cryptocurrency trading show the superior performance of the proposed model in terms of the identified risk patterns, endopathic and exopathic competing risks, being informative with greater interpretability, enhancing the understanding of the systemic risks of a market and their causes, and making better risk management possible.
【5】 Extracting Governing Laws from Sample Path Data of Non-Gaussian Stochastic Dynamical Systems 标题:从非高斯随机动力系统样本路径数据中提取控制律
作者:Yang Li,Jinqiao Duan 机构:Received: date Accepted: date 备注:arXiv admin note: substantial text overlap with arXiv:2005.03769 链接:https://arxiv.org/abs/2107.10127 摘要:随着数据科学的发展,在分析和理解具有实验和观测数据的系统的复杂动力学方面取得了新的进展。由于大量的物理现象表现出爆发、飞行、跳跃和间歇的特征,具有非高斯L′evy噪声的随机微分方程适合于对这些系统进行建模。因此,从现有的数据中推断这些方程来合理地预测动力学行为是必要的。在这项工作中,我们考虑数据驱动的方法提取随机动力系统与非高斯不对称(而不是对称)L’EVY过程,以及高斯布朗运动。我们建立了一个理论框架并设计了一个数值算法来计算非对称L′evy跳跃测度、漂移和扩散(即非局部Kramers-Moyal公式),从而从噪声数据中获得随机控制律。通过几个典型算例的数值实验,验证了该方法的有效性和准确性。这种方法将成为一种有效的工具,从现有的数据集中发现支配规律,并在了解复杂随机现象的机制。 摘要:Advances in data science are leading to new progresses in the analysis and understanding of complex dynamics for systems with experimental and observational data. With numerous physical phenomena exhibiting bursting, flights, hopping, and intermittent features, stochastic differential equations with non-Gaussian L'evy noise are suitable to model these systems. Thus it is desirable and essential to infer such equations from available data to reasonably predict dynamical behaviors. In this work, we consider a data-driven method to extract stochastic dynamical systems with non-Gaussian asymmetric (rather than the symmetric) L'evy process, as well as Gaussian Brownian motion. We establish a theoretical framework and design a numerical algorithm to compute the asymmetric L'evy jump measure, drift and diffusion (i.e., nonlocal Kramers-Moyal formulas), hence obtaining the stochastic governing law, from noisy data. Numerical experiments on several prototypical examples confirm the efficacy and accuracy of this method. This method will become an effective tool in discovering the governing laws from available data sets and in understanding the mechanisms underlying complex random phenomena.
【6】 A variational approximate posterior for the deep Wishart process 标题:深度Wishart过程的变分近似后验估计
作者:Sebastian W. Ober,Laurence Aitchison 机构:Department of Engineering, University of Cambridge, Cambridge, UK, Department of Computer Science, University of Bristol, Bristol, UK 备注:20 pages 链接:https://arxiv.org/abs/2107.10125 摘要:最近的工作引入了深度内核进程作为NNs的完全基于内核的替代方案(Aitchison等人,2020)。深核过程通过交替地从半正定矩阵上的分布中采样核并执行非线性变换,灵活地学习良好的顶层表示。一种特殊的深核过程,即深Wishart过程(DWP),由于它的先验等价于深高斯过程(DGP)的先验而引起了人们的特别关注。然而,由于在半正定矩阵上缺乏足够灵活的分布,DWPs中的推理仍然是不可能的。本文通过推广Wishart概率密度的Bartlett分解,给出了一种在半正定矩阵上获得柔性分布的新方法。我们使用这个新的分布来发展一个包含跨层依赖的DWP的近似后验分布。我们提出了一种双随机诱导点的DWP推理方案,并通过实验证明了在DWP中的推理比在具有等价先验知识的DGP中的推理具有更好的性能。 摘要:Recent work introduced deep kernel processes as an entirely kernel-based alternative to NNs (Aitchison et al. 2020). Deep kernel processes flexibly learn good top-layer representations by alternately sampling the kernel from a distribution over positive semi-definite matrices and performing nonlinear transformations. A particular deep kernel process, the deep Wishart process (DWP), is of particular interest because its prior is equivalent to deep Gaussian process (DGP) priors. However, inference in DWPs has not yet been possible due to the lack of sufficiently flexible distributions over positive semi-definite matrices. Here, we give a novel approach to obtaining flexible distributions over positive semi-definite matrices by generalising the Bartlett decomposition of the Wishart probability density. We use this new distribution to develop an approximate posterior for the DWP that includes dependency across layers. We develop a doubly-stochastic inducing-point inference scheme for the DWP and show experimentally that inference in the DWP gives improved performance over doing inference in a DGP with the equivalent prior.
【7】 Tracking the Transmission Dynamics of COVID-19 with a Time-Varying Coefficient State-Space Model 标题:用时变系数状态空间模型跟踪冠状病毒传播动态
作者:Joshua P. Keller,Tianjian Zhou,Andee Kaplan,G. Brooke Anderson,Wen Zhou 链接:https://arxiv.org/abs/2107.10118 摘要:COVID-19的传播受到不同国家、州和国家的监管政策和行为模式的极大影响。COVID-19的种群水平动态通常可以用一组常微分方程来描述,但是这些确定性方程不足以模拟观察到的病例率,由于局部检测和病例报告政策以及个体间的非同质行为,病例率可能会发生变化。为了评估人口流动对COVID-19传播的影响,我们建立了一个新的传染病传播的贝叶斯时变系数状态空间模型。该模型的基础是一个时变系数隔室模型来概括敏感、暴露、未检测传染、检测感染、未检测到的去除、检测到的非感染性、检测到的恢复和检测到的死亡个体之间的动态变化。模型中的传染度和检测参数随时间变化,传染度分量包含了多种人口流动来源的信息。伴随着这个隔间模型,一个乘法过程模型被引入,以允许偏离确定性动力学。我们应用这个模型观察了美国几个州和科罗拉多州的COVID-19病例和死亡情况。我们发现,人口流动性指标与传播率高度相关,可以解释这些地区传染性的复杂时间变化。此外,流动性和流行病学参数之间的推断联系因地点而异,揭示了不同政策对COVID-19动态的异质性影响。 摘要:The spread of COVID-19 has been greatly impacted by regulatory policies and behavior patterns that vary across counties, states, and countries. Population-level dynamics of COVID-19 can generally be described using a set of ordinary differential equations, but these deterministic equations are insufficient for modeling the observed case rates, which can vary due to local testing and case reporting policies and non-homogeneous behavior among individuals. To assess the impact of population mobility on the spread of COVID-19, we have developed a novel Bayesian time-varying coefficient state-space model for infectious disease transmission. The foundation of this model is a time-varying coefficient compartment model to recapitulate the dynamics among susceptible, exposed, undetected infectious, detected infectious, undetected removed, detected non-infectious, detected recovered, and detected deceased individuals. The infectiousness and detection parameters are modeled to vary by time, and the infectiousness component in the model incorporates information on multiple sources of population mobility. Along with this compartment model, a multiplicative process model is introduced to allow for deviation from the deterministic dynamics. We apply this model to observed COVID-19 cases and deaths in several US states and Colorado counties. We find that population mobility measures are highly correlated with transmission rates and can explain complicated temporal variation in infectiousness in these regions. Additionally, the inferred connections between mobility and epidemiological parameters, varying across locations, have revealed the heterogeneous effects of different policies on the dynamics of COVID-19.
【8】 On the Convergence of Prior-Guided Zeroth-Order Optimization Algorithms 标题:关于先验引导零阶优化算法的收敛性
作者:Shuyu Cheng,Guoqiang Wu,Jun Zhu 机构:Dept. of Comp. Sci. and Tech., BNRist Center, State Key Lab for Intell. Tech. & Sys., Institute for AI, THBI Lab, Tsinghua University, Beijing, China 备注:Code available at this https URL 链接:https://arxiv.org/abs/2107.10110 摘要:零阶优化被广泛应用于处理具有挑战性的任务,如基于查询的黑盒对抗攻击和强化学习。在基于有限差分的梯度估计过程中,人们尝试了多种方法来整合先验信息,并取得了很好的实证结果。然而,它们的收敛性还不是很清楚。本文在贪心下降框架下,利用不同的梯度估计量,分析了先验引导ZO算法的收敛性,试图填补这一空白。为先验随机无梯度(PRGF)算法的收敛性提供了保证。此外,为了进一步加速贪心下降法,我们提出了一个新的加速随机搜索(ARS)算法,结合先验信息,并进行了收敛性分析。最后,我们的理论结果被几个数值基准和对抗性攻击的实验所证实。 摘要:Zeroth-order (ZO) optimization is widely used to handle challenging tasks, such as query-based black-box adversarial attacks and reinforcement learning. Various attempts have been made to integrate prior information into the gradient estimation procedure based on finite differences, with promising empirical results. However, their convergence properties are not well understood. This paper makes an attempt to fill this gap by analyzing the convergence of prior-guided ZO algorithms under a greedy descent framework with various gradient estimators. We provide a convergence guarantee for the prior-guided random gradient-free (PRGF) algorithms. Moreover, to further accelerate over greedy descent methods, we present a new accelerated random search (ARS) algorithm that incorporates prior information, together with a convergence analysis. Finally, our theoretical results are confirmed by experiments on several numerical benchmarks as well as adversarial attacks.
【9】 Discovering Latent Causal Variables via Mechanism Sparsity: A New Principle for Nonlinear ICA 标题:通过机制稀疏性发现潜在因果变量:非线性独立成分分析的一种新原理
作者:Sébastien Lachapelle,Pau Rodríguez López,Rémi Le Priol,Alexandre Lacoste,Simon Lacoste-Julien 机构: Universit´e de Montr´eal 2Element AI 备注:Appears in: Workshop on the Neglected Assumptions in Causal Inference (NACI) at the 38 th International Conference on Machine Learning, 2021. 19 pages 链接:https://arxiv.org/abs/2107.10098 摘要:可以说,找到一个可解释的低维代表一个潜在的高维现象是核心的科学事业。独立分量分析(ICA)是一种将这一目标形式化并为实际应用提供估计过程的综合方法。当潜在因素稀疏地依赖于观测辅助变量和/或过去的潜在因素时,本文提出机制稀疏正则化作为实现非线性ICA的新原则。我们证明,如果将潜在机制正则化为稀疏,并且数据生成过程满足某种图形准则,则潜在变量可以恢复为一个排列。作为一个特例,我们的框架展示了如何利用未知目标干预对潜在因素进行分解,从而在独立分量分析和因果关系之间建立进一步的联系。我们用玩具实验验证了我们的理论结果。 摘要:It can be argued that finding an interpretable low-dimensional representation of a potentially high-dimensional phenomenon is central to the scientific enterprise. Independent component analysis (ICA) refers to an ensemble of methods which formalize this goal and provide estimation procedure for practical application. This work proposes mechanism sparsity regularization as a new principle to achieve nonlinear ICA when latent factors depend sparsely on observed auxiliary variables and/or past latent factors. We show that the latent variables can be recovered up to a permutation if one regularizes the latent mechanisms to be sparse and if some graphical criterion is satisfied by the data generating process. As a special case, our framework shows how one can leverage unknown-target interventions on the latent factors to disentangle them, thus drawing further connections between ICA and causality. We validate our theoretical results with toy experiments.
【10】 Optimal Rates for Nonparametric Density Estimation under Communication Constraints 标题:通信约束下非参数密度估计的最优率
作者:Jayadev Acharya,Clément L. Canonne,Aditya Vikram Singh,Himanshu Tyagi 链接:https://arxiv.org/abs/2107.10078 摘要:我们考虑当每个样本被量化到只有有限数量的比特时,BESV空间的密度估计。我们提出了一种利用小波基稀疏性的非交互式自适应估计器,以及一种基于通信约束下参数估计的模拟和推断技术。我们通过推导即使在允许交互协议的情况下也能保持的极小极大下界,证明了我们的估计器几乎是速率最优的。有趣的是,虽然我们的基于小波的估计器对于Sobolev空间也几乎是速率最优的,但是对于这些空间自然产生的标准Fourier基是否可以用于实现相同的性能还不清楚。 摘要:We consider density estimation for Besov spaces when each sample is quantized to only a limited number of bits. We provide a noninteractive adaptive estimator that exploits the sparsity of wavelet bases, along with a simulate-and-infer technique from parametric estimation under communication constraints. We show that our estimator is nearly rate-optimal by deriving minimax lower bounds that hold even when interactive protocols are allowed. Interestingly, while our wavelet-based estimator is almost rate-optimal for Sobolev spaces as well, it is unclear whether the standard Fourier basis, which arise naturally for those spaces, can be used to achieve the same performance.
【11】 Adaptive Inducing Points Selection For Gaussian Processes 标题:高斯过程的自适应诱导点选择
作者:Théo Galy-Fajou,Manfred Opper 机构: Technical University of Berlin 备注:Accepted at Continual Learning Workshop - ICML 2020 : this https URL 链接:https://arxiv.org/abs/2107.10066 摘要:高斯过程(textbf{GPs})是一种灵活的非参数模型,具有很强的概率解释能力。虽然GPs是对时间序列进行推断的标准选择,但在流式传输环境中很少有技术可以使用cite{bui2017streaming}提出了一种利用稀疏性技术训练在线GPs的有效变分方法:用一组较小的诱导点(textbf{IPs})来逼近整个观测集,并用新数据移动。IPs的数目和位置对算法的性能有很大的影响。除了优化它们的位置之外,我们还提出了根据GP的性质和数据结构自适应地添加新的点。 摘要:Gaussian Processes (textbf{GPs}) are flexible non-parametric models with strong probabilistic interpretation. While being a standard choice for performing inference on time series, GPs have few techniques to work in a streaming setting. cite{bui2017streaming} developed an efficient variational approach to train online GPs by using sparsity techniques: The whole set of observations is approximated by a smaller set of inducing points (textbf{IPs}) and moved around with new data. Both the number and the locations of the IPs will affect greatly the performance of the algorithm. In addition to optimizing their locations, we propose to adaptively add new points, based on the properties of the GP and the structure of the data.
【12】 The impact of increasing COVID-19 cases/deaths on the number of uncivil tweets directed at governments 标题:冠状病毒病例/死亡人数增加对针对政府的非文明推文数量的影响
作者:Kohei Nishi 链接:https://arxiv.org/abs/2107.10041 摘要:通过推特(Twitter)等社交媒体的政治表达已经作为一种政治参与形式扎根。然而,人们在社交媒体上发出什么样的政治信息,以及在什么情况下这样做,还不太清楚。这项研究认为,当政府政策绩效恶化时,人们会愤怒或沮丧,并向政府传递不文明的信息。为了验证这一理论,目前的研究使用神经网络机器学习模型,将针对美国州长的推文分为不文明推文和不文明推文,并考察了州一级COVID-19指标恶化对针对州长的不文明推文数量的影响。结果显示,州一级的COVID-19病例和死亡人数不断增加,导致针对州长的不文明推文数量增加。这表明人们通过投票以外的行动来评估政府的表现。 摘要:Political expression through social media such as Twitter has already taken root as a form of political participation. However, it is less clear what kind of political messages people send out on social media and under what circumstances they do so. This study theorizes that when government policy performance worsens, people get angry or frustrated and send uncivil messages to the government. To test this theory, the current study classifies tweets directed at U.S. state governors as uncivil or not, using a neural network machine-learning model, and examines the impact of worsening state-level COVID-19 indicators on the number of uncivil tweets directed at the state governors. The results show that increasing state-level COVID-19 cases and deaths lead to higher numbers of uncivil tweets directed at state governors. This suggests that people evaluate the government's performance through actions other than voting.
【13】 Linear spectral statistics of sequential sample covariance matrices 标题:序贯样本协方差矩阵的线性谱统计量
作者:Nina Dörnemann,Holger Dette 机构:Fakult¨at f¨ur Mathematik, Ruhr-Universit¨at Bochum, Bochum, Germany 链接:https://arxiv.org/abs/2107.10036 摘要:独立的$p$维向量,具有独立的复数或实值项,例如$mathbb{E}[mathbf{x}u i]=mathbf{0}$,${rm Var}(mathbf{x}u i)=mathbf{i}u p$,$i=1,ldots,n$,让$mathbf{T}u n$是一个$ptimes p$Hermitian非负定矩阵,$f$是一个给定函数。我们证明了随机过程$big({operatorname{tr}}(f(mathbf{B}{un,t}))big){tin[tu 0,1]}$的一个适当标准化版本对应于序贯经验协方差估计$big(mathbf{B}{n,t}){tin[tu 0,],1]}=Big(frac{1}{n}sum{i=1}^{lfloor n trfloor}mathbf{t}^{1/2}u nmathbf{x}u imathbf{x}u i^starmathbf{t}^{1/2}u nBig){tin[t{0,1]}$$对于$n,ptoinfty$,弱收敛于非标准高斯过程。作为一个应用,我们利用这些结果开发了一种新的方法,用于在高维框架中监控球形假设,即使基础数据的维数大于样本大小。 摘要:Independent $p$-dimensional vectors with independent complex or real valued entries such that $mathbb{E} [mathbf{x}_i] = mathbf{0}$, ${rm Var } (mathbf{x}_i) = mathbf{I}_p$, $i=1, ldots,n$, let $mathbf{T }_n$ be a $p times p$ Hermitian nonnegative definite matrix and $f $ be a given function. We prove that an approriately standardized version of the stochastic process $ big ( {operatorname{tr}} ( f(mathbf{B}_{n,t}) ) big )_{t in [t_0, 1]} $ corresponding to a linear spectral statistic of the sequential empirical covariance estimator $$ big ( mathbf{B}_{n,t} )_{tin [ t_0 , 1]} = Big ( frac{1}{n} sum_{i=1}^{lfloor n t rfloor} mathbf{T }^{1/2}_n mathbf{x}_i mathbf{x}_i ^star mathbf{T }^{1/2}_n Big)_{tin [ t_0 , 1]} $$ converges weakly to a non-standard Gaussian process for $n,ptoinfty$. As an application we use these results to develop a novel approach for monitoring the sphericity assumption in a high-dimensional framework, even if the dimension of the underlying data is larger than the sample size.
【14】 Frequentist inference for cluster randomised trials with multiple primary outcomes 标题:具有多个主要结果的整群随机试验的频数推断
作者:Samuel I Watson,Joshua Akinyemi,Karla Hemming 链接:https://arxiv.org/abs/2107.10017 摘要:许多有影响的随机试验指南通常建议或要求使用单一的主要结果,以避免“多重试验”的问题。如果不进行校正,拒绝一组无效假设中至少一个的概率(家族错误率)通常远大于任何单个测试的标称率,因此p值和置信区间等统计数据没有可靠的解释。尽管整群随机试验可能需要多种结果来充分描述往往复杂和多方面干预的效果。我们提出了一种用于多个结果的聚类随机试验的推断方法,该方法确保了名义上的家庭错误率,并产生了名义上的“家庭”覆盖率的同时置信区间。我们采用了Romano和Wolf(2005)在广义线性模型框架内使用随机测试方法的基于重采样的逐步下降过程。然后,我们将Garthwaite和Buckland(1996)提出的置信区间极限的Robbins-Monro搜索过程应用到这个逐步过程中,以产生一组置信区间。我们在平行和阶梯式楔形聚类随机研究的模拟研究中表明,该方法具有名义错误率和覆盖率,并在提议的和更标准的分析下比较了真实世界阶梯式楔形试验的分析结果。 摘要:The use of a single primary outcome is generally either recommended or required by many influential randomised trial guidelines to avoid the problem of "multiple testing". Without correction, the probability of rejecting at least one of a set of null hypotheses (the family-wise error rate) is often much greater than the nominal rate of any single test so that statistics like p-values and confidence intervals have no reliable interpretation. Cluster randomised trials though may require multiple outcomes to adequately describe the effects of often complex and multi-faceted interventions. We propose a method for inference for cluster randomised trials with multiple outcomes that ensures a nominal family-wise error rate and produces simultaneous confidence intervals with nominal "family-wise" coverage. We adapt the resampling-based stepdown procedure of Romano and Wolf (2005) using a randomisation-test approach within a generalised linear model framework. We then adapt the Robbins-Monro search procedure for confidence interval limits proposed by Garthwaite and Buckland (1996) to this stepdown process to produce a set of confidence intervals. We show that this procedure has nominal error rates and coverage in a simulation-based study of parallel and stepped-wedge cluster randomised studies and compare results from the analysis of a real-world stepped-wedge trial under both the proposed and more standard analyses.
【15】 Delving Into Deep Walkers: A Convergence Analysis of Random-Walk-Based Vertex Embeddings 标题:深入研究深度漫游:基于随机游走的顶点嵌入的收敛性分析
作者:Dominik Kloepfer,Angelica I. Aviles-Rivero,Daniel Heydecker 机构: Heydecker are with the Department of AppliedMathematics and Theoretical Physics 链接:https://arxiv.org/abs/2107.10014 摘要:近年来,基于随机游动的图顶点嵌入技术的影响越来越大,它能有效地将图转换为更易于计算的格式,同时保留相关信息,在多个任务中表现出良好的性能。然而,这些算法的理论性质,特别是超参数和图结构对其收敛性的影响,至今还没有得到很好的理解。本文对基于随机游动的嵌入技术进行了理论分析。首先,我们证明了在一些较弱的假设下,由随机游动导出的顶点嵌入确实在单极限的随机游动数$N到 infty$和双极限的随机游动长度$L到 infty$中收敛。其次,我们推导了集中边界,量化了单极限和双极限语料库的收敛速度。第三,我们利用这些结果来推导一个选择超参数$N$和$L$的启发式算法。我们通过一系列的数值实验和可视化实验,验证和说明了我们的研究结果的实际重要性。 摘要:Graph vertex embeddings based on random walks have become increasingly influential in recent years, showing good performance in several tasks as they efficiently transform a graph into a more computationally digestible format while preserving relevant information. However, the theoretical properties of such algorithms, in particular the influence of hyperparameters and of the graph structure on their convergence behaviour, have so far not been well-understood. In this work, we provide a theoretical analysis for random-walks based embeddings techniques. Firstly, we prove that, under some weak assumptions, vertex embeddings derived from random walks do indeed converge both in the single limit of the number of random walks $N to infty$ and in the double limit of both $N$ and the length of each random walk $Ltoinfty$. Secondly, we derive concentration bounds quantifying the converge rate of the corpora for the single and double limits. Thirdly, we use these results to derive a heuristic for choosing the hyperparameters $N$ and $L$. We validate and illustrate the practical importance of our findings with a range of numerical and visual experiments on several graphs drawn from real-world applications.
【16】 On ageing properties of lifetime distributions 标题:关于寿命分布的老化特性
作者:Anakha K K,V M Chacko 机构:Department of Statistics, St. Thomas’ College (Autonomous), Thrissur, Kerala-, India 备注:25 pages 链接:https://arxiv.org/abs/2107.09921 摘要:可靠性理论的一个合理的部分是致力于研究故障率,它们的性质,连接和应用。目前的研究主要集中在失效率分布及其形状特性。讨论了Lindley分布的各种推广的失效率,并讨论了DUS变换的区别作用 摘要:A reasonable segment of reliability theory is perpetrated to the study of failure rates, their properties, connections and applications. The present study focused on failure rate distributions and their shape properties. Failure rates of various generalizations of Lindley distribution are discussed and the distinguishable roles of DUS transformation is also discussed
【17】 Improving the Power to Detect Indirect Effects in Mediation Analysis 标题:提高调解分析中间接效应的检测能力
作者:John Kidd,Dan-Yu Lin 备注:15 pages, 3 figures, 2 tables 链接:https://arxiv.org/abs/2107.09812 摘要:因果中介分析旨在确定自变量是直接影响反应变量,还是通过中介间接影响反应变量。现有的统计检验,以确定是否存在间接影响是过于保守或膨胀的I型错误。在这篇文章中,我们提出了两种基于交并检验原理的方法,在控制I型误差的同时提高了功率。通过大量的仿真验证了所提方法的优越性。最后,我们提供了一个大型蛋白质组学研究的应用。 摘要:Causal mediation analysis seeks to determine whether an independent variable affects a response variable directly or whether it does so indirectly, by way of a mediator. The existing statistical tests to determine the existence of an indirect effect are overly conservative or have inflated type I error. In this article, we propose two methods based on the principle of intersection-union tests that offer improvements in power while controlling the type I error. We demonstrate the advantages of the proposed methods through extensive simulation. Finally, we provide an application to a large proteomic study.
【18】 A Stochastic Version of the EM Algorithm for Mixture Cure Rate Model with Exponentiated Weibull Family of Lifetimes 标题:寿命指数为Weibull族的混合治愈率模型EM算法的随机版本
作者:Sandip Barui,Suvra Pal,Nutan Mishra,Katherine Davies 机构: University of Texas at Arlington, Mishra is with Department of Mathematics and Statistics, University of South Alabama 备注:33 pages 链接:https://arxiv.org/abs/2107.09810 摘要:缺失值的处理在生存数据的分析中起着重要的作用,尤其是以治愈率为标志的数据。在本文中,我们讨论了随机逼近期望最大化(EM)算法的性质和实现,以获得最大似然(ML)型估计的情况下,丢失数据自然出现由于右删失和一部分个人免疫的事件的兴趣。假设一个灵活的三参数指数Weibull(EW)分布族同时包含单调(递增和递减)和非单调(单峰和浴缸)危险函数,从而描述非免疫个体的寿命。为了评估SEM算法的性能,在不同的参数设置下进行了广泛的仿真研究。利用似然比检验对EW分布族进行了模型判别。此外,我们还研究了SEM算法对异常值和算法起始值的鲁棒性。在给定的上下文中,随机EM(SEM)算法的性能优于研究良好的EM算法的一些场景也得到了检验。为了进一步的证明,一个真实的皮肤黑色素瘤生存数据分析使用提出的治愈率模型与EW寿命分布和提出的估计技术。通过这些数据,我们说明了似然比检验的适用性,以拒绝几个众所周知的寿命分布嵌套在更广泛的类EW分布。 摘要:Handling missing values plays an important role in the analysis of survival data, especially, the ones marked by cure fraction. In this paper, we discuss the properties and implementation of stochastic approximations to the expectation-maximization (EM) algorithm to obtain maximum likelihood (ML) type estimates in situations where missing data arise naturally due to right censoring and a proportion of individuals are immune to the event of interest. A flexible family of three parameter exponentiated-Weibull (EW) distributions is assumed to characterize lifetimes of the non-immune individuals as it accommodates both monotone (increasing and decreasing) and non-monotone (unimodal and bathtub) hazard functions. To evaluate the performance of the SEM algorithm, an extensive simulation study is carried out under various parameter settings. Using likelihood ratio test we also carry out model discrimination within the EW family of distributions. Furthermore, we study the robustness of the SEM algorithm with respect to outliers and algorithm starting values. Few scenarios where stochastic EM (SEM) algorithm outperforms the well-studied EM algorithm are also examined in the given context. For further demonstration, a real survival data on cutaneous melanoma is analyzed using the proposed cure rate model with EW lifetime distribution and the proposed estimation technique. Through this data, we illustrate the applicability of the likelihood ratio test towards rejecting several well-known lifetime distributions that are nested within the wider class of EW distributions.
【19】 A conditional independence test for causality in econometrics 标题:计量经济学中因果关系的条件独立性检验
作者:Jaime Sevilla,Alexandra Mayn 链接:https://arxiv.org/abs/2107.09765 摘要:Y检验是多元回归中检测缺失混杂因素的有用工具,但在实际中很少使用,因为它需要识别多个条件独立的工具,这通常是不可能的。我们提出了一个启发式测试,放宽了独立性的要求。然后,我们将展示如何将这个启发式测试应用于价格需求和企业贷款生产率问题。我们的结论是,当变量与高斯加性噪声线性相关时,检验是有用的,但在其他情况下可能会产生误导。尽管如此,我们相信这个测试对于伪造一个提议的控制集是一个有用的概念。 摘要:The Y-test is a useful tool for detecting missing confounders in the context of a multivariate regression.However, it is rarely used in practice since it requires identifying multiple conditionally independent instruments, which is often impossible. We propose a heuristic test which relaxes the independence requirement. We then show how to apply this heuristic test on a price-demand and a firm loan-productivity problem. We conclude that the test is informative when the variables are linearly related with Gaussian additive noise, but it can be misleading in other contexts. Still, we believe that the test can be a useful concept for falsifying a proposed control set.
【20】 Log-symmetric models with cure fraction with application to leprosy reactions data 标题:带有治愈分数的对数对称模型及其在麻风反应数据中的应用
作者:Joyce B. Rocha,Francisco M. C. Medeiros,Dione M. Valença 机构:Department of Statistics, Federal University of Rio Grande do Norte, Natal, Brazil. 链接:https://arxiv.org/abs/2107.09757 摘要:考虑到易感个体的生存期分布属于对数对称分布类,提出了一个带治愈分数的对数对称生存模型。这个类有连续的、严格正的和不对称的分布,包括对数正态分布、对数-$t$-Student分布、Birnbaum-Saunders分布、对数logistic I分布、对数logistic II分布、对数正态分布、对数指数幂分布和对数斜线分布。对数对称类非常灵活,允许包含双峰分布和异常值。这包括通过与固化分数相关联的参数的解释变量。我们通过广泛的模拟研究评估所提出的模型的性能,并考虑实际数据应用来评估因素对汉森病患者麻风反应的免疫力的影响。 摘要:In this paper, we propose a log-symmetric survival model with cure fraction, considering that the distributions of lifetimes for susceptible individuals belong to the log-symmetric class of distributions. This class has continuous, strictly positive, and asymmetric distributions, including the log-normal, log-$t$-Student, Birnbaum-Saunders, log-logistic I, log-logistic II, log-normal-contaminated, log-exponential-power, and log-slash distributions. The log-symmetric class is quite flexible and allows for including bimodal distributions and outliers. This includes explanatory variables through the parameter associated with the cure fraction. We evaluate the performance of the proposed model through extensive simulation studies and consider a real data application to evaluate the effect of factors on the immunity to leprosy reactions in patients with Hansen's disease.
【21】 Evaluating Effectiveness of Public Health Intervention Strategies for Mitigating COVID-19 Pandemic 标题:缓解冠状病毒大流行的公共卫生干预策略效果评价
作者:Shanghong Xie,Wenbo Wang,Qinxia Wang,Yuanjia Wang,Donglin Zeng 机构:. Department of Biostatistics, Columbia University, New York, NY, U.S.A., . Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 链接:https://arxiv.org/abs/2107.09749 摘要:2019年冠状病毒病(COVID-19)大流行是前所未有的全球公共卫生挑战。在美国(美国),州政府已经实施了各种非药物干预措施(NPI),如物理距离关闭(闭锁)、留在家里的命令、强制的面罩,以响应COVID-19的快速传播。我们提出了一个嵌套的病例对照设计,在准实验框架下用倾向评分加权来估计疾病跨州传播的平均干预效果。我们进一步发展了一种方法来测试适度干预效果的因素,以协助精确的公共卫生干预。我们的方法考虑了疾病传播的潜在动态和平衡状态水平的干预前特征。我们证明了我们的估计在假设下提供了因果干预效应。我们应用此方法分析美国COVID-19发病率病例,以评估六种干预措施的效果。我们发现,锁定对减少传输的影响最大,重新打开酒吧显著增加传输。非白人人口比例较高的州,因重新开放酒吧而增加Rt$的风险更大。 摘要:Coronavirus disease 2019 (COVID-19) pandemic is an unprecedented global public health challenge. In the United States (US), state governments have implemented various non-pharmaceutical interventions (NPIs), such as physical distance closure (lockdown), stay-at-home order, mandatory facial mask in public in response to the rapid spread of COVID-19. To evaluate the effectiveness of these NPIs, we propose a nested case-control design with propensity score weighting under the quasi-experiment framework to estimate the average intervention effect on disease transmission across states. We further develop a method to test for factors that moderate intervention effect to assist precision public health intervention. Our method takes account of the underlying dynamics of disease transmission and balance state-level pre-intervention characteristics. We prove that our estimator provides causal intervention effect under assumptions. We apply this method to analyze US COVID-19 incidence cases to estimate the effects of six interventions. We show that lockdown has the largest effect on reducing transmission and reopening bars significantly increase transmission. States with a higher percentage of non-white population are at greater risk of increased $R_t$ associated with reopening bars.
【22】 Strategies for variable selection in large-scale healthcare database studies with missing covariate and outcome data 标题:具有缺失协变量和结果数据的大规模医疗保健数据库研究中变量选择的策略
作者:Jung-Yi Joyce Lin,Liangyuan Hu,Chuyue Huang,Steven Lawrence,Usha Govindarajulu 机构:Govindarajulu , Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, Madison Ave, New York, New York , Department of Biostatistics and Epidemiology, Rutgers University, Piscataway, New Jersey 备注:18 pages, 3 figures, 4 tables 链接:https://arxiv.org/abs/2107.09730 摘要:以往的研究表明,当协变量和结果数据随机缺失时,将bootstrap插补与基于树的机器学习变量选择方法相结合,可以恢复完全观测数据的良好性能。然而,这种方法在计算上是昂贵的,特别是在大规模数据集上。我们提出了一种基于推理的方法RR-BART,该方法利用基于似然的贝叶斯机器学习技术,贝叶斯加性回归树,并使用Rubin规则将多重插补数据集上变量重要性度量的估计和方差结合起来,以便在存在缺失数据的情况下进行变量选择。一项有代表性的仿真研究表明,RR-BART的性能至少与bootstrap与BART、BI-BART相结合一样好,但即使在非线性和非可加性的复杂条件下,在MAR机制下也能节省大量的计算量,且总体丢失率很高。RR-BART通过超参数$k$比BI-BART对前尾音的敏感度更低,并且不依赖于BI-BART所要求的选择阈值$pi$。我们的模拟研究还表明,将一个二进制预测器的缺失值编码为一个单独的类别,可以显著提高BI-BART选择二进制预测器的能力。我们通过一个代谢综合征3年发病率危险因素的案例研究,以及全国妇女健康研究的数据,进一步论证了这些方法。 摘要:Prior work has shown that combining bootstrap imputation with tree-based machine learning variable selection methods can recover the good performance achievable on fully observed data when covariate and outcome data are missing at random (MAR). This approach however is computationally expensive, especially on large-scale datasets. We propose an inference-based method RR-BART, that leverages the likelihood-based Bayesian machine learning technique, Bayesian Additive Regression Trees, and uses Rubin's rule to combine the estimates and variances of the variable importance measures on multiply imputed datasets for variable selection in the presence of missing data. A representative simulation study suggests that RR-BART performs at least as well as combining bootstrap with BART, BI-BART, but offers substantial computational savings, even in complex conditions of nonlinearity and nonadditivity with a large percentage of overall missingness under the MAR mechanism. RR-BART is also less sensitive to the end note prior via the hyperparameter $k$ than BI-BART, and does not depend on the selection threshold value $pi$ as required by BI-BART. Our simulation studies also suggest that encoding the missing values of a binary predictor as a separate category significantly improves the power of selecting the binary predictor for BI-BART. We further demonstrate the methods via a case study of risk factors for 3-year incidence of metabolic syndrome with data from the Study of Women's Health Across the Nation.
【23】 Machine Learning for Real-World Evidence Analysis of COVID-19 Pharmacotherapy 标题:机器学习在冠状病毒药物治疗实证分析中的应用
作者:Aurelia Bustos,Patricio Mas_Serrano,Mari L. Boquera,Jose M. Salinas 机构:AI Medical Research Unit, MedBravo∗, Pharmacy Department, HGUA†, ISABIAL‡, Mari Luz Boquera, Jose Maria Salinas, IT Department§, San Juan University Hospital 备注:22 pages, 7 tables, 11 figures 链接:https://arxiv.org/abs/2107.10239 摘要:引言:临床实践中产生的真实世界数据可用于分析COVID-19药物治疗的真实世界证据(RWE)和验证随机临床试验(RCTs)的结果。机器学习(ML)方法在RWE中得到了广泛的应用,是一种很有前途的精密医学工具。在这项研究中,ML方法用于研究西班牙巴伦西亚地区COVID-19住院治疗的疗效。方法:采用10个卫生部门2020年1月至2021年1月的5244例和1312例COVID-19住院病例,分别对remdesivir、皮质类固醇、tocilizumab、lopinavir-ritonavir、阿奇霉素和氯喹/羟基氯喹的治疗效果模型(TE-ML)进行训练和验证。另外两个卫生部门的2390名住院患者被保留作为一项独立测试,以回顾性分析使用cox比例风险模型的TE-ML模型选择的人群中治疗的生存益处。使用治疗倾向评分调整TE-ML模型,以控制与结果相关的治疗前混杂变量,并进一步评估其无效性。ML架构基于增强的决策树。结果:在TE-ML模型确定的人群中,只有Remdesivir和Tocilizumab与生存时间增加显著相关,危险比分别为0.41(P=0.04)和0.21(P=0.001)。氯喹衍生物、洛匹那韦、利托那韦和阿奇霉素对存活率无影响。解释TE-ML模型预测的工具在患者层面被探索为个性化决策和精确医学的潜在工具。结论:ML法适用于COVID-19药物治疗的RWE分析。所得结果重现了RWE上已发表的结果,并验证了RCT的结果。 摘要:Introduction: Real-world data generated from clinical practice can be used to analyze the real-world evidence (RWE) of COVID-19 pharmacotherapy and validate the results of randomized clinical trials (RCTs). Machine learning (ML) methods are being used in RWE and are promising tools for precision-medicine. In this study, ML methods are applied to study the efficacy of therapies on COVID-19 hospital admissions in the Valencian Region in Spain. Methods: 5244 and 1312 COVID-19 hospital admissions - dated between January 2020 and January 2021 from 10 health departments, were used respectively for training and validation of separate treatment-effect models (TE-ML) for remdesivir, corticosteroids, tocilizumab, lopinavir-ritonavir, azithromycin and chloroquine/hydroxychloroquine. 2390 admissions from 2 additional health departments were reserved as an independent test to analyze retrospectively the survival benefits of therapies in the population selected by the TE-ML models using cox-proportional hazard models. TE-ML models were adjusted using treatment propensity scores to control for pre-treatment confounding variables associated to outcome and further evaluated for futility. ML architecture was based on boosted decision-trees. Results: In the populations identified by the TE-ML models, only Remdesivir and Tocilizumab were significantly associated with an increase in survival time, with hazard ratios of 0.41 (P = 0.04) and 0.21 (P = 0.001), respectively. No survival benefits from chloroquine derivatives, lopinavir-ritonavir and azithromycin were demonstrated. Tools to explain the predictions of TE-ML models are explored at patient-level as potential tools for personalized decision making and precision medicine. Conclusion: ML methods are suitable tools toward RWE analysis of COVID-19 pharmacotherapies. Results obtained reproduce published results on RWE and validate the results from RCTs.
【24】 Efficient Algorithms for Learning Depth-2 Neural Networks with General ReLU Activations 标题:具有一般RELU激活的Depth-2神经网络的高效学习算法
作者:Pranjal Awasthi,Alex Tang,Aravindan Vijayaraghavan 机构:Google Research, Northwestern University 备注:36 pages (including appendix) 链接:https://arxiv.org/abs/2107.10209 摘要:在温和的非简并假设下,我们提出多项式时间和样本有效的算法来学习具有一般ReLU激活的未知深度2前馈神经网络。特别地,我们考虑学习一个未知形式的f $(x)={a} {{Mathsf {t}}σ({W}^ Mthsf {t} x b)$,其中$x$是从高斯分布中提取的,$ σ(t):=max(t,0)$是Relu激活。以前的工作学习网络与ReLU激活假设偏差$b$为零。为了处理偏项的存在,我们提出的算法包括对函数$f(x)$的Hermite展开所产生的多个高阶张量进行鲁棒分解。利用这些思想,我们还建立了最小假设下网络参数的可辨识性。 摘要:We present polynomial time and sample efficient algorithms for learning an unknown depth-2 feedforward neural network with general ReLU activations, under mild non-degeneracy assumptions. In particular, we consider learning an unknown network of the form $f(x) = {a}^{mathsf{T}}sigma({W}^mathsf{T}x b)$, where $x$ is drawn from the Gaussian distribution, and $sigma(t) := max(t,0)$ is the ReLU activation. Prior works for learning networks with ReLU activations assume that the bias $b$ is zero. In order to deal with the presence of the bias terms, our proposed algorithm consists of robustly decomposing multiple higher order tensors arising from the Hermite expansion of the function $f(x)$. Using these ideas we also establish identifiability of the network parameters under minimal assumptions.
【25】 Distribution of Classification Margins: Are All Data Equal? 标题:分类边距的分布:所有数据都相等吗?
作者:Andrzej Banburski,Fernanda De La Torre,Nishka Pant,Ishana Shastri,Tomaso Poggio 机构: 2Brown University 备注:Previously online as CBMM Memo 115 on the CBMM MIT site 链接:https://arxiv.org/abs/2107.10199 摘要:最近的理论结果表明,在指数损失函数下,深度神经网络的梯度下降使分类裕度局部最大,这相当于在裕度约束下使权重矩阵的范数最小化。然而,解的这一性质并不能完全描述其泛化性能。我们从理论上证明了训练集上边缘分布曲线下的面积实际上是一个很好的泛化度量。然后,我们证明,在实现数据分离后,可以动态地将训练集减少99%以上,而不会显著降低性能。有趣的是,得到的“高容量”特征子集在不同的训练运行中并不一致,这与理论上的观点一致,即在SGD下,以及在存在批量归一化和权重衰减的情况下,所有训练点都应收敛到相同的渐近边界。 摘要:Recent theoretical results show that gradient descent on deep neural networks under exponential loss functions locally maximizes classification margin, which is equivalent to minimizing the norm of the weight matrices under margin constraints. This property of the solution however does not fully characterize the generalization performance. We motivate theoretically and show empirically that the area under the curve of the margin distribution on the training set is in fact a good measure of generalization. We then show that, after data separation is achieved, it is possible to dynamically reduce the training set by more than 99% without significant loss of performance. Interestingly, the resulting subset of "high capacity" features is not consistent across different training runs, which is consistent with the theoretical claim that all training points should converge to the same asymptotic margin under SGD and in the presence of both batch normalization and weight decay.
【26】 On the Memorization Properties of Contrastive Learning 标题:论对比学习的记忆特性
作者:Ildus Sadrtdinov,Nadezhda Chirkova,Ekaterina Lobacheva 机构: One of the powerful tools for analysinggeneralization properties are memorization studies whichinvestigate what patterns and how do DNNs learn and moti- 1HSE University 备注:Published in Workshop on Overparameterization: Pitfalls & Opportunities at ICML 2021 链接:https://arxiv.org/abs/2107.10143 摘要:对深层神经网络(DNNs)的记忆研究有助于理解DNNs学习的模式和方式,并促进DNN训练方法的改进。在这项工作中,我们研究了一种广泛使用的对比自监督学习方法SimCLR的记忆特性,并将其与监督学习和随机标签训练的记忆进行了比较。我们发现,训练对象和增广在SimCLR如何学习它们的意义上可能具有不同的复杂性。此外,我们还证明了SimCLR在训练对象复杂度分布上类似于随机标签训练。 摘要:Memorization studies of deep neural networks (DNNs) help to understand what patterns and how do DNNs learn, and motivate improvements to DNN training approaches. In this work, we investigate the memorization properties of SimCLR, a widely used contrastive self-supervised learning approach, and compare them to the memorization of supervised learning and random labels training. We find that both training objects and augmentations may have different complexity in the sense of how SimCLR learns them. Moreover, we show that SimCLR is similar to random labels training in terms of the distribution of training objects complexity.
【27】 Interpreting diffusion score matching using normalizing flow 标题:用归一化流程解释扩散分数匹配
作者:Wenbo Gong,Yingzhen Li 机构: Such useful properties enable them to bewidely applied in training energy-based model (EBM) (SongEqual contribution 1Department of Engineering, University ofCambridge, UK 2Department of Computing 备注:8 pages, International Conference on Machine Learning (ICML) INNF 2021 Workshop Spotlight 链接:https://arxiv.org/abs/2107.10072 摘要:评分匹配(SM)及其对应的Stein差异(SD)在模型训练和评价中取得了巨大的成功。然而,最近的研究显示了它们在处理某些类型的分布时的局限性。一种可能的解决方法是将原始分数匹配(或Stein差异)与扩散矩阵相结合,这称为扩散分数匹配(DSM)(或扩散Stein差异(DSD))。然而,缺乏对扩散的解释限制了它在简单分布和人工选择矩阵中的应用。在这项工作中,我们计划通过使用标准化流解释扩散矩阵来填补这一空白。具体地说,我们从理论上证明了DSM(或DSD)等价于正规化流定义的变换空间中的原始分数匹配(或Stein差异),其中扩散矩阵是流的Jacobian矩阵的逆。此外,我们还建立了它与黎曼流形的联系,并进一步推广到连续流,其中DSM的变化以常微分方程为特征。 摘要:Scoring matching (SM), and its related counterpart, Stein discrepancy (SD) have achieved great success in model training and evaluations. However, recent research shows their limitations when dealing with certain types of distributions. One possible fix is incorporating the original score matching (or Stein discrepancy) with a diffusion matrix, which is called diffusion score matching (DSM) (or diffusion Stein discrepancy (DSD)). However, the lack of interpretation of the diffusion limits its usage within simple distributions and manually chosen matrix. In this work, we plan to fill this gap by interpreting the diffusion matrix using normalizing flows. Specifically, we theoretically prove that DSM (or DSD) is equivalent to the original score matching (or Stein discrepancy) evaluated in the transformed space defined by the normalizing flow, where the diffusion matrix is the inverse of the flow's Jacobian matrix. In addition, we also build its connection to Riemannian manifolds and further extend it to continuous flows, where the change of DSM is characterized by an ODE.
【28】 KalmanNet: Neural Network Aided Kalman Filtering for Partially Known Dynamics 标题:KalmanNet:神经网络辅助部分已知动力学的卡尔曼滤波
作者:Guy Revach,Nir Shlezinger,Xiaoyong Ni,Adria Lopez Escoriza,Ruud J. G. van Sloun,Yonina C. Eldar 机构: Ben-Gurion University of the Negev 链接:https://arxiv.org/abs/2107.10043 摘要:动态系统的实时状态估计是信号处理与控制中的一项基本任务。对于由完全已知的线性高斯状态空间(SS)模型表示的系统,著名的Kalman滤波器(KF)是一种低复杂度的最优解。然而,在实践中,基本SS模型的线性度和对它的准确认识往往是不存在的。在这里,我们提出了KalmanNet,一种实时状态估计器,它从数据中学习,在具有部分信息的非线性动态下进行Kalman滤波。通过在KF流中加入结构SS模型和专用的递归神经网络模块,我们保持了经典算法的数据效率和可解释性,同时隐式地从数据中学习复杂动力学。数值计算表明,KalmanNet方法克服了非线性和模型失配的缺点,优于经典的滤波方法。 摘要:Real-time state estimation of dynamical systems is a fundamental task in signal processing and control. For systems that are well-represented by a fully known linear Gaussian state space (SS) model, the celebrated Kalman filter (KF) is a low complexity optimal solution. However, both linearity of the underlying SS model and accurate knowledge of it are often not encountered in practice. Here, we present KalmanNet, a real-time state estimator that learns from data to carry out Kalman filtering under non-linear dynamics with partial information. By incorporating the structural SS model with a dedicated recurrent neural network module in the flow of the KF, we retain data efficiency and interpretability of the classic algorithm while implicitly learning complex dynamics from data. We numerically demonstrate that KalmanNet overcomes nonlinearities and model mismatch, outperforming classic filtering methods operating with both mismatched and accurate domain knowledge.
【29】 Differentiable Feature Selection, a Reparameterization Approach 标题:可微特征选择--一种再参数化方法
作者:Jérémie Dona,Patrick Gallinari 机构: Sorbonne Universit´e, CNRS, LIP, F-, Paris, France, Criteo AI Labs, Paris, France 备注:None 链接:https://arxiv.org/abs/2107.10030 摘要:我们考虑重建的特征选择的任务,它包括从整个数据实例中选择可以重构的特征的小子集。这在涉及例如昂贵的物理测量、传感器放置或信息压缩的若干环境中尤其重要。为了打破这个问题固有的组合性质,我们制定的任务是优化二元掩模分布,使准确的重建。然后,我们面临两个主要挑战。一个是由于二元分布引起的可微性问题。第二种方法是通过以相关方式选择变量来消除冗余信息,这需要对二进制分布的协方差进行建模。我们通过引入一种新的对数正态分布的重参数化来解决这两个问题。通过对多个高维图像基准点的评价,证明了该方法提供了一种有效的探测方案,并能有效地进行特征选择。我们表明,该方法利用了数据的内在几何结构,有利于重建。 摘要:We consider the task of feature selection for reconstruction which consists in choosing a small subset of features from which whole data instances can be reconstructed. This is of particular importance in several contexts involving for example costly physical measurements, sensor placement or information compression. To break the intrinsic combinatorial nature of this problem, we formulate the task as optimizing a binary mask distribution enabling an accurate reconstruction. We then face two main challenges. One concerns differentiability issues due to the binary distribution. The second one corresponds to the elimination of redundant information by selecting variables in a correlated fashion which requires modeling the covariance of the binary distribution. We address both issues by introducing a relaxation of the problem via a novel reparameterization of the logitNormal distribution. We demonstrate that the proposed method provides an effective exploration scheme and leads to efficient feature selection for reconstruction through evaluation on several high dimensional image benchmarks. We show that the method leverages the intrinsic geometry of the data, facilitating reconstruction.
【30】 Optimal Operation of Power Systems with Energy Storage under Uncertainty: A Scenario-based Method with Strategic Sampling 标题:不确定条件下储能电力系统最优运行:一种基于情景的策略抽样方法
作者:Ren Hu,Qifeng Li 机构:E 链接:https://arxiv.org/abs/2107.10013 摘要:储能系统的多周期动态特性、间歇性可再生能源发电以及电力负荷的不可控性,使得电力系统运行优化具有挑战性。采用机会约束优化(CCO)建模方法,建立了不确定条件下的多周期最优PSO模型,其中约束条件包括非线性储能模型和交流潮流模型。针对这一具有挑战性的CCO问题,提出了一种新的求解方法。提出的方法在计算上是有效的,主要有两个原因。首先,利用一组基于广义最小绝对收缩和选择算子的学习辅助二次凸不等式来逼近交流潮流约束。其次,考虑到数据的物理模式,以基于学习的抽样为动力,提出了策略抽样方法,通过不同的抽样策略显著减少了所需的场景数。在IEEE标准系统上的仿真结果表明:1)所提出的策略抽样方法显著提高了基于情景的机会约束最优PSO问题求解方法的计算效率,2)数据驱动的潮流凸逼近是一种很有前途的非线性、非凸交流潮流的替代方法。 摘要:The multi-period dynamics of energy storage (ES), intermittent renewable generation and uncontrollable power loads, make the optimization of power system operation (PSO) challenging. A multi-period optimal PSO under uncertainty is formulated using the chance-constrained optimization (CCO) modeling paradigm, where the constraints include the nonlinear energy storage and AC power flow models. Based on the emerging scenario optimization method which does not rely on pre-known probability distribution functions, this paper develops a novel solution method for this challenging CCO problem. The proposed meth-od is computationally effective for mainly two reasons. First, the original AC power flow constraints are approximated by a set of learning-assisted quadratic convex inequalities based on a generalized least absolute shrinkage and selection operator. Second, considering the physical patterns of data and motived by learning-based sampling, the strategic sampling method is developed to significantly reduce the required number of scenarios through different sampling strategies. The simulation results on IEEE standard systems indicate that 1) the proposed strategic sampling significantly improves the computational efficiency of the scenario-based approach for solving the chance-constrained optimal PSO problem, 2) the data-driven convex approximation of power flow can be promising alternatives of nonlinear and nonconvex AC power flow.
【31】 Memorization in Deep Neural Networks: Does the Loss Function matter? 标题:深度神经网络中的记忆:损失函数重要吗?
作者:Deep Patel,P. S. Sastry 机构:Indian Institute of Science, Bangalore, India - 备注:Accepted at PAKDD 2021. 12 pages and 5 figures 链接:https://arxiv.org/abs/2107.09957 摘要:深度神经网络,往往由于过度参数化,显示出能够准确记忆甚至随机标记的数据。实证研究也表明,没有一种标准的正则化技术能够缓解这种过度拟合。我们研究损失函数的选择是否会影响这种记忆。我们用MNIST和CIFAR-10这两个基准数据集进行了实证研究,结果表明,相对于交叉熵或平方误差损失,对称损失函数显著提高了网络抵抗这种过度拟合的能力。然后,我们给出了记忆鲁棒性的形式化定义,并从理论上解释了为什么对称损失提供了这种鲁棒性。我们的结果清楚地表明,在这种记忆现象中,损失函数单独起作用。 摘要:Deep Neural Networks, often owing to the overparameterization, are shown to be capable of exactly memorizing even randomly labelled data. Empirical studies have also shown that none of the standard regularization techniques mitigate such overfitting. We investigate whether the choice of the loss function can affect this memorization. We empirically show, with benchmark data sets MNIST and CIFAR-10, that a symmetric loss function, as opposed to either cross-entropy or squared error loss, results in significant improvement in the ability of the network to resist such overfitting. We then provide a formal definition for robustness to memorization and provide a theoretical explanation as to why the symmetric losses provide this robustness. Our results clearly bring out the role loss functions alone can play in this phenomenon of memorization.
【32】 Boundary of Distribution Support Generator (BDSG): Sample Generation on the Boundary 标题:分布支持生成器(BDSG)的边界:边界上的样本生成
作者:Nikolaos Dionelis 机构:The University of Edinburgh, Edinburgh, UK 备注:None 链接:https://arxiv.org/abs/2107.09950 摘要:生成性模型,如生成性对抗网络(GANs),已被用于无监督异常检测。在性能不断提高的同时,还存在一些限制,特别是由于难以获得多模态支持,以及接近尾部的基础分布(即分布支持的边界)的能力。本文提出了一种方法,试图减轻这些缺点。提出了一种基于可逆残差网络的分布支持发生器(BDSG)边界模型。GANs一般不保证概率分布的存在,在这里,我们使用最近发展的可逆残差网络(IResNet)和残差流(ResFlow)进行密度估计。这些模型尚未用于异常检测。我们利用IResNet和ResFlow来进行非分布(OoD)样本检测,并使用复合损失函数来生成边界上的样本,该复合损失函数迫使样本位于边界上。BDSG解决了非凸支持、不相交分量和多峰分布。合成数据和来自多峰分布(如MNIST和CIFAR-10)的数据的结果表明,与文献中的方法相比具有竞争力。 摘要:Generative models, such as Generative Adversarial Networks (GANs), have been used for unsupervised anomaly detection. While performance keeps improving, several limitations exist particularly attributed to difficulties at capturing multimodal supports and to the ability to approximate the underlying distribution closer to the tails, i.e. the boundary of the distribution's support. This paper proposes an approach that attempts to alleviate such shortcomings. We propose an invertible-residual-network-based model, the Boundary of Distribution Support Generator (BDSG). GANs generally do not guarantee the existence of a probability distribution and here, we use the recently developed Invertible Residual Network (IResNet) and Residual Flow (ResFlow), for density estimation. These models have not yet been used for anomaly detection. We leverage IResNet and ResFlow for Out-of-Distribution (OoD) sample detection and for sample generation on the boundary using a compound loss function that forces the samples to lie on the boundary. The BDSG addresses non-convex support, disjoint components, and multimodal distributions. Results on synthetic data and data from multimodal distributions, such as MNIST and CIFAR-10, demonstrate competitive performance compared to methods from the literature.
【33】 Online structural kernel selection for mobile health 标题:面向移动健康的在线结构内核选择
作者:Eura Shin,Pedja Klasnja,Susan Murphy,Finale Doshi-Velez 备注:Workshop paper in ICML IMLH 2021 链接:https://arxiv.org/abs/2107.09949 摘要:基于移动健康中高效个性化学习的需求,研究了多任务环境下高斯过程回归的在线核选择问题。为此,我们提出了一种新的生成过程。我们的方法证明了核演化的轨迹可以在用户之间传递以提高学习效率,并且核本身对于健康预测目标是有意义的。 摘要:Motivated by the need for efficient and personalized learning in mobile health, we investigate the problem of online kernel selection for Gaussian Process regression in the multi-task setting. We propose a novel generative process on the kernel composition for this purpose. Our method demonstrates that trajectories of kernel evolutions can be transferred between users to improve learning and that the kernels themselves are meaningful for an mHealth prediction goal.
【34】 A Statistical Model of Word Rank Evolution 标题:一种词序演变的统计模型
作者:Alex John Quijano,Rick Dale,Suzanne Sindi 机构: Applied Mathematics, University of California Merced, Merced, California, USA, Department of Communications, University of California Los Angeles, Los Angeles 备注:This manuscript - with 53 pages and 28 figures - is a draft for a journal research article submission 链接:https://arxiv.org/abs/2107.09948 摘要:大量语言数据集的可用性使得数据驱动的方法能够研究语言变化。本文通过对googlebooks语料库单字频率数据集的研究,探讨了八种语言的词序动态。我们观察了从1900年到2008年unigrams的排名变化,并将其与我们为分析开发的Wright-Fisher启发模型进行了比较。该模型模拟了一个中立的进化过程,并以不存在消失词为限制条件。这项工作解释了数学框架的模型-写为一个马尔可夫链与多项式转移概率-显示如何频率的话在时间上的变化。从我们在数据和模型中的观察来看,词的秩稳定性表现出两种类型的特征:(1)秩的增加/减少是单调的,或(2)平均秩保持不变。根据我们的模型,排名高的词往往更稳定,而排名低的词往往更不稳定。有些词的等级变化有两种方式:(a)随着时间的推移,等级的小幅度增加或减少的累积;(b)等级增加或减少的冲击。大多数停止词和斯瓦德什语词在八种语言中都是稳定的。这些特征表明,所有语言中的单字频率都以一种与纯粹中性进化过程不一致的方式发生了变化。 摘要:The availability of large linguistic data sets enables data-driven approaches to study linguistic change. This work explores the word rank dynamics of eight languages by investigating the Google Books corpus unigram frequency data set. We observed the rank changes of the unigrams from 1900 to 2008 and compared it to a Wright-Fisher inspired model that we developed for our analysis. The model simulates a neutral evolutionary process with the restriction of having no disappearing words. This work explains the mathematical framework of the model - written as a Markov Chain with multinomial transition probabilities - to show how frequencies of words change in time. From our observations in the data and our model, word rank stability shows two types of characteristics: (1) the increase/decrease in ranks are monotonic, or (2) the average rank stays the same. Based on our model, high-ranked words tend to be more stable while low-ranked words tend to be more volatile. Some words change in ranks in two ways: (a) by an accumulation of small increasing/decreasing rank changes in time and (b) by shocks of increase/decrease in ranks. Most of the stopwords and Swadesh words are observed to be stable in ranks across eight languages. These signatures suggest unigram frequencies in all languages have changed in a manner inconsistent with a purely neutral evolutionary process.
【35】 Preventing dataset shift from breaking machine-learning biomarkers 标题:防止数据集移动破坏机器学习生物标记物
作者:Jéroôme Dockès,Gaël Varoquaux,Jean-Baptiste Poline 机构:McGill University ,INRIA, ∗Corresponding author JB Poline and Ga¨el Varoquaux contributed equally to this work. 备注:GigaScience, BioMed Central, In press 链接:https://arxiv.org/abs/2107.09947 摘要:机器学习带来了从具有丰富生物医学测量数据的队列中提取新的生物标记物的希望。一个好的生物标志物是一个可靠的检测相应的条件。然而,生物标志物通常是从与目标人群不同的队列中提取的。这种不匹配,称为数据集偏移,可能会破坏生物标记物在新个体中的应用。在生物医学研究中,数据集的变化是经常发生的,例如由于招聘偏见。当数据集发生变化时,标准的机器学习技术不足以提取和验证生物标记。本文概述了机器学习提取生物标记的时间和方式,以及检测和校正策略。 摘要:Machine learning brings the hope of finding new biomarkers extracted from cohorts with rich biomedical measurements. A good biomarker is one that gives reliable detection of the corresponding condition. However, biomarkers are often extracted from a cohort that differs from the target population. Such a mismatch, known as a dataset shift, can undermine the application of the biomarker to new individuals. Dataset shifts are frequent in biomedical research, e.g. because of recruitment biases. When a dataset shift occurs, standard machine-learning techniques do not suffice to extract and validate biomarkers. This article provides an overview of when and how dataset shifts breaks machine-learning extracted biomarkers, as well as detection and correction strategies.
【36】 Design of Experiments for Stochastic Contextual Linear Bandits 标题:随机背景线性图的实验设计
作者:Andrea Zanette,Kefan Dong,Jonathan Lee,Emma Brunskill 机构:Institute for Computational and Mathematical Engineering, Stanford University, Stanford, CA, Department of Computer Science 备注:Initial submission 链接:https://arxiv.org/abs/2107.09912 摘要:在随机线性背景bandit设置中,存在多个minimax程序用于策略的探索,这些策略对所获取的数据是反应性的。在实践中,部署这些算法可能会有很大的工程开销,特别是当数据集以分布式方式收集时,或者当需要人在回路中实现不同的策略时。在这种情况下,使用单一的非反应性策略进行探索是有益的。假设某些批处理上下文是可用的,我们设计一个单一的随机策略来收集一个好的数据集,从中可以提取一个接近最优的策略。我们提出了一个理论分析以及数值实验的合成和现实世界的数据集。 摘要:In the stochastic linear contextual bandit setting there exist several minimax procedures for exploration with policies that are reactive to the data being acquired. In practice, there can be a significant engineering overhead to deploy these algorithms, especially when the dataset is collected in a distributed fashion or when a human in the loop is needed to implement a different policy. Exploring with a single non-reactive policy is beneficial in such cases. Assuming some batch contexts are available, we design a single stochastic policy to collect a good dataset from which a near-optimal policy can be extracted. We present a theoretical analysis as well as numerical experiments on both synthetic and real-world datasets.
【37】 EMG Pattern Recognition via Bayesian Inference with Scale Mixture-Based Stochastic Generative Models 标题:基于尺度混合随机生成模型的贝叶斯推理肌电模式识别
作者:Akira Furui,Takuya Igaue,Toshio Tsuji 机构:Graduate School of Advanced Science and Engineering, Hiroshima University, Higashi-hiroshima ,-, Japan, Graduate School of Engineering, The University of Tokyo, Bunkyo-ku ,-, Japan 备注:This paper is accepted for publication in Expert Systems with Applications 链接:https://arxiv.org/abs/2107.09853 摘要:肌电图(EMG)由于能够反映人体运动意图,已被广泛应用于假手和信息设备的信号接口。虽然EMG分类方法已经被引入到基于EMG的控制系统中,但是它们没有完全考虑EMG信号的随机特性。提出了一种基于尺度混合生成模型的肌电模式分类方法。比例混合模型是一种随机肌电模型,它将肌电方差看作一个随机变量,使得方差中的不确定性得以表示。将该模型进行了扩展,并将其应用于肌电信号的模式分类。该方法通过变分贝叶斯学习进行训练,实现了模型复杂度的自动确定。此外,为了用部分判别法优化该方法的超参数,提出了一种基于互信息的确定方法。仿真和肌电分析实验验证了超参数与分类精度的关系以及该方法的有效性。使用公共肌电数据集进行的比较表明,该方法优于各种传统的分类器。这些结果表明了所提方法的有效性及其对肌电控制系统的适用性。在肌电模式识别中,基于能反映肌电信号随机特征的生成模型的分类器比传统的通用分类器具有更好的识别效果。 摘要:Electromyogram (EMG) has been utilized to interface signals for prosthetic hands and information devices owing to its ability to reflect human motion intentions. Although various EMG classification methods have been introduced into EMG-based control systems, they do not fully consider the stochastic characteristics of EMG signals. This paper proposes an EMG pattern classification method incorporating a scale mixture-based generative model. A scale mixture model is a stochastic EMG model in which the EMG variance is considered as a random variable, enabling the representation of uncertainty in the variance. This model is extended in this study and utilized for EMG pattern classification. The proposed method is trained by variational Bayesian learning, thereby allowing the automatic determination of the model complexity. Furthermore, to optimize the hyperparameters of the proposed method with a partial discriminative approach, a mutual information-based determination method is introduced. Simulation and EMG analysis experiments demonstrated the relationship between the hyperparameters and classification accuracy of the proposed method as well as the validity of the proposed method. The comparison using public EMG datasets revealed that the proposed method outperformed the various conventional classifiers. These results indicated the validity of the proposed method and its applicability to EMG-based control systems. In EMG pattern recognition, a classifier based on a generative model that reflects the stochastic characteristics of EMG signals can outperform the conventional general-purpose classifier.
【38】 Private Alternating Least Squares: Practical Private Matrix Completion with Tighter Rates 标题:私有交替最小二乘:具有更紧速率的实用私有矩阵完成
作者:Steve Chien,Prateek Jain,Walid Krichene,Steffen Rendle,Shuang Song,Abhradeep Thakurta,Li Zhang 链接:https://arxiv.org/abs/2107.09802 摘要:研究了用户级隐私下的差分私有矩阵完备问题。我们设计了一种流行的交替最小二乘(ALS)方法的联合差分私有变体,该方法实现了:(i)矩阵完成的样本复杂度(以项目数、用户数为单位)接近最优,以及(ii)理论上和基准数据集上最著名的隐私/效用权衡。特别是,我们首次对引入噪声以确保DP的ALS进行了全局收敛性分析,并表明,与最著名的替代方案(Jain et al.(2018)提出的私有Frank Wolfe算法)相比,我们的误差界限在项目和用户数量方面具有更好的伸缩性,这在实际问题中是至关重要的。在标准基准上的广泛验证表明,该算法与精心设计的采样程序相结合,比现有的技术具有更高的精度,有望成为第一个实用的DP嵌入模型。 摘要:We study the problem of differentially private (DP) matrix completion under user-level privacy. We design a joint differentially private variant of the popular Alternating-Least-Squares (ALS) method that achieves: i) (nearly) optimal sample complexity for matrix completion (in terms of number of items, users), and ii) the best known privacy/utility trade-off both theoretically, as well as on benchmark data sets. In particular, we provide the first global convergence analysis of ALS with noise introduced to ensure DP, and show that, in comparison to the best known alternative (the Private Frank-Wolfe algorithm by Jain et al. (2018)), our error bounds scale significantly better with respect to the number of items and users, which is critical in practical problems. Extensive validation on standard benchmarks demonstrate that the algorithm, in combination with carefully designed sampling procedures, is significantly more accurate than existing techniques, thus promising to be the first practical DP embedding model.
【39】 Statistical Estimation from Dependent Data 标题:相依数据的统计估计
作者:Yuval Dagan,Constantinos Daskalakis,Nishanth Dikkala,Surbhi Goel,Anthimos Vardis Kandiros 机构:EECS & CSAIL, MIT, GOOGLE RESEARCH, Microsoft Research NYC 备注:41 pages, ICML 2021 链接:https://arxiv.org/abs/2107.09773 摘要:我们考虑一个一般的统计估计问题,其中跨不同观察的二进制标签不是独立地对它们的特征向量进行调节,而是依赖的捕获设置,例如,这些观测被收集在空间域、时间域或社交网络上,从而引起依赖性。我们用马尔可夫随机场的语言来建模这些依赖关系,重要的是,允许这些依赖关系是实质性的,也就是说,不要假设捕获这些依赖关系的马尔可夫随机场是在高温下。作为我们的主要贡献,我们为这个模型提供了算法和统计上有效的估计率,给出了logistic回归、稀疏logistic回归和依赖数据的神经网络设置的一些例子。我们的估计保证遵循从{em single}样本估计Ising模型参数(即外场和相互作用强度)的新结果{我们在真实的网络数据上评估了我们的估计方法,结果表明,在三个文本分类数据集:Cora、citeser和Pubmed中,它优于忽略依赖关系的标准回归方法 摘要:We consider a general statistical estimation problem wherein binary labels across different observations are not independent conditioned on their feature vectors, but dependent, capturing settings where e.g. these observations are collected on a spatial domain, a temporal domain, or a social network, which induce dependencies. We model these dependencies in the language of Markov Random Fields and, importantly, allow these dependencies to be substantial, i.e do not assume that the Markov Random Field capturing these dependencies is in high temperature. As our main contribution we provide algorithms and statistically efficient estimation rates for this model, giving several instantiations of our bounds in logistic regression, sparse logistic regression, and neural network settings with dependent data. Our estimation guarantees follow from novel results for estimating the parameters (i.e. external fields and interaction strengths) of Ising models from a {em single} sample. {We evaluate our estimation approach on real networked data, showing that it outperforms standard regression approaches that ignore dependencies, across three text classification datasets: Cora, Citeseer and Pubmed.}