统计学学术速递[7.15]

2021-07-27 10:57:20 浏览数 (1)

访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问

stat统计学,共计31篇

【1】 Comparison of Canonical Correlation and Partial Least Squares analyses of simulated and empirical data 标题:模拟数据与经验数据的典型相关分析与偏最小二乘分析的比较

作者:Anthony R McIntosh 机构:Rotman Research Institute, Baycrest Centre, Dept of Psychology, University of Toronto, Mailing Address:, Bathurst Street, Toronto ON M,A ,E, Canada 备注:40 pages, 12 figures, 14 tables 链接:https://arxiv.org/abs/2107.06867 摘要:本文在三个大样本的模拟数据集和两个经验数据集上比较了CCA和PLS的一般形式。我们从这些数据中连续选取较小的子样本来评估敏感性、可靠性和再现性。在块内或块间没有相关性的空数据中,两种方法在样本大小上显示出相同的假阳性率。这两种方法在样本量降到n=50以下之前,在弱但可靠的数据中也显示出相同的检测效果。在强效应的情况下,除非一个数据块内的项目相关性很高,否则这两种方法显示出相似的性能。对于偏最小二乘法,除了最小的样本量外,结果在样本量范围内是可重复的。相反,当块内相关性较高时,CCA的再现性下降。如果进行主成分分析(PCA)并使用成分得分来计算交叉块矩阵,则这一点会得到改善。我们的检查结果给出了三个信息。首先,对于块内和块间结构合理的数据,CCA和PLS给出了可比较的结果。第二,如果在任一块中存在高相关性,这可能会损害CCA结果的可靠性。这种已知的CCA问题可以在交叉块计算之前用PCA进行修正。然而,这假设PCA结构对于给定的样本是稳定的。第三,即使在大样本情况下,零假设检验也不能保证结果的可重复性。这一最终结果表明,无论是统计意义和再现性评估的任何数据。 摘要:In this paper, we compared the general forms of CCA and PLS on three simulated and two empirical datasets, all having large sample sizes. We took successively smaller subsamples of these data to evaluate sensitivity, reliability, and reproducibility. In null data having no correlation within or between blocks, both methods showed equivalent false positive rates across sample sizes. Both methods also showed equivalent detection in data with weak but reliable effects until sample sizes drop below n=50. In the case of strong effects, both methods showed similar performance unless the correlations of items within one data block were high. For PLS, the results were reproducible across sample sizes for strong effects, except at the smallest sample sizes. On the contrary, the reproducibility for CCA declined when the within-block correlations were high. This was ameliorated if a principal components analysis (PCA) was performed and the component scores used to calculate the cross-block matrix. The outcome of our examination gives three messages. First, for data with reasonable within and between block structure, CCA and PLS give comparable results. Second, if there are high correlations within either block, this can compromise the reliability of CCA results. This known issue of CCA can be remedied with PCA before cross-block calculation. This, however, assumes that the PCA structure is stable for a given sample. Third, null hypothesis testing does not guarantee that the results are reproducible, even with large sample sizes. This final outcome suggests that both statistical significance and reproducibility be assessed for any data.

【2】 High-dimensional Precision Matrix Estimation with a Known Graphical Structure 标题:已知图结构的高维精度矩阵估计

作者:Thien-Minh Le,Ping-Shou Zhong 机构:Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, U.S.A., Department of Mathematics, Statistics and Computer Science, University of Illinois at Chicago, Chicago, Illinois, U.S.A. 链接:https://arxiv.org/abs/2107.06815 摘要:精度矩阵是协方差矩阵的逆矩阵。本文研究在高维环境下,已知图形结构的精度矩阵的估计问题。基于已知的图形结构与精度矩阵之间的关系,提出了一种简单的精度矩阵估计方法。我们得到了所提出的估计的收敛速度,并且在高维环境下,当数据维数随样本量的增加而增加时,得到了所提出的估计的渐近正态性。数值模拟结果验证了该方法的有效性。我们还表明,该方法优于一些现有的方法,不利用图形结构信息。 摘要:A precision matrix is the inverse of a covariance matrix. In this paper, we study the problem of estimating the precision matrix with a known graphical structure under high-dimensional settings. We propose a simple estimator of the precision matrix based on the connection between the known graphical structure and the precision matrix. We obtain the rates of convergence of the proposed estimators and derive the asymptotic normality of the proposed estimator in the high-dimensional setting when the data dimension grows with the sample size. Numerical simulations are conducted to demonstrate the performance of the proposed method. We also show that the proposed method outperforms some existing methods that do not utilize the graphical structure information.

【3】 Correlated Stochastic Block Models: Exact Graph Matching with Applications to Recovering Communities 标题:相关随机挡路模型:精确图匹配及其在社区恢复中的应用

作者:Miklos Z. Racz,Anirudh Sridhar 备注:42 pages, 4 figures 链接:https://arxiv.org/abs/2107.06767 摘要:我们考虑从多个相关网络学习潜在社团结构的任务。首先,我们研究了两个边相关随机块模型之间潜在顶点对应的学习问题,重点研究了平均度与顶点数成对数关系的区域。我们推导了精确恢复的精确信息论阈值:在阈值以上存在一个估计量,它输出概率接近1的真实对应,而在阈值以下没有一个估计量可以恢复概率有界于0的真实对应。作为我们结果的一个应用,我们展示了如何在参数区域中使用多个相关图来精确地恢复潜在的群落,而理论上仅使用一个图是不可能的。 摘要:We consider the task of learning latent community structure from multiple correlated networks. First, we study the problem of learning the latent vertex correspondence between two edge-correlated stochastic block models, focusing on the regime where the average degree is logarithmic in the number of vertices. We derive the precise information-theoretic threshold for exact recovery: above the threshold there exists an estimator that outputs the true correspondence with probability close to 1, while below it no estimator can recover the true correspondence with probability bounded away from 0. As an application of our results, we show how one can exactly recover the latent communities using multiple correlated graphs in parameter regimes where it is information-theoretically impossible to do so using just a single graph.

【4】 New Developments on the Non-Central Chi-Squared and Beta Distributions 标题:非中心卡方分布和贝塔分布的新发展

作者:Carlo Orsi 备注:accepted for publication at Austrian Journal of Statistics on 06 May 2021 链接:https://arxiv.org/abs/2107.06689 摘要:利用新的方法得到了非中心卡方分布和非中心Beta分布的零阶矩的新公式。前一种模型的混合表示和二项式升序阶乘的新扩展是第一种方法的主要成分,而第二种方法依赖于后一种模型的有趣的条件独立关系和简单的条件密度。然后,为了达到两个目的进行了模拟研究:一方面对推导的力矩公式进行了数值验证,另一方面讨论了新公式相对于现有公式的优点。 摘要:New formulas for the moments about zero of the Non-central Chi-Squared and the Non-central Beta distributions are achieved by means of novel approaches. The mixture representation of the former model and a new expansion of the ascending factorial of a binomial are the main ingredients of the first approach, whereas the second one hinges on an interesting relationship of conditional independence and a simple conditional density of the latter model. Then, a simulation study is carried out in order to pursue a twofold purpose: providing numerical validations of the derived moment formulas on one side and discussing the advantages of the new formulas over the existing ones on the other.

【5】 M5 Competition Uncertainty: Overdispersion, distributional forecasting, GAMLSS and beyond 标题:M5竞争不确定性:过度分散、分布预测、GAMLSS及以后

作者:Florian Ziel 机构:University of Duisburg-Essen 链接:https://arxiv.org/abs/2107.06675 摘要:M5竞争不确定性跟踪旨在对沃尔玛数千种零售商品的销售进行概率预测。我们发现,M5竞争数据面临着强烈的过度分散和零星需求,尤其是零需求。我们讨论由此产生的建模问题,有关充分的概率预测这样的计数数据过程。遗憾的是,大多数流行的预测方法(如lightgbm和xgboost-GBMs)由于考虑了目标函数,无法处理数据特征。分布预测为解决这些问题提供了一种合适的建模方法。GAMLSS框架允许使用低维分布进行灵活的概率预测。通过对负二项分布的位置和尺度参数的建模,说明了GAMLSS方法如何应用于M5竞争数据。最后,我们讨论了分布式建模的软件包及其缺点,如R包gamlss及其扩展包,以及(深层)分布式预测库,如TensorFlow Probability。 摘要:The M5 competition uncertainty track aims for probabilistic forecasting of sales of thousands of Walmart retail goods. We show that the M5 competition data faces strong overdispersion and sporadic demand, especially zero demand. We discuss resulting modeling issues concerning adequate probabilistic forecasting of such count data processes. Unfortunately, the majority of popular prediction methods used in the M5 competition (e.g. lightgbm and xgboost GBMs) fails to address the data characteristics due to the considered objective functions. The distributional forecasting provides a suitable modeling approach for to the overcome those problems. The GAMLSS framework allows flexible probabilistic forecasting using low dimensional distributions. We illustrate, how the GAMLSS approach can be applied for the M5 competition data by modeling the location and scale parameter of various distributions, e.g. the negative binomial distribution. Finally, we discuss software packages for distributional modeling and their drawback, like the R package gamlss with its package extensions, and (deep) distributional forecasting libraries such as TensorFlow Probability.

【6】 Time Series Estimation of the Dynamic Effects of Disaster-Type Shock 标题:灾害型地震动力效应的时间序列估计

作者:Richard Davis,Serena Ng 链接:https://arxiv.org/abs/2107.06663 摘要:在原激波相互独立的假设下,本文给出了三个svar的结果。首先,提出了一个研究无限方差灾害型冲击动态效应的框架。我们证明了VAR的最小二乘估计是一致的,但具有非标准性质。第二,证明了对SVAR施加的限制可以通过测试所识别冲击的独立性来验证。该测试可以应用于数据是否有胖或瘦的尾巴,以及对过度以及准确识别的模型。第三,将灾害冲击识别为峰度最大的分量,其中使用即使存在无限方差冲击也有效的估计器来估计相互独立的分量。考虑了两个应用。首先,独立性检验被用来阐明关于不确定性在经济波动中的作用的相互矛盾的证据。第二,灾害冲击具有短期经济影响,主要来自反馈动力。 摘要:The paper provides three results for SVARs under the assumption that the primitive shocks are mutually independent. First, a framework is proposed to study the dynamic effects of disaster-type shocks with infinite variance. We show that the least squares estimates of the VAR are consistent but have non-standard properties. Second, it is shown that the restrictions imposed on a SVAR can be validated by testing independence of the identified shocks. The test can be applied whether the data have fat or thin tails, and to over as well as exactly identified models. Third, the disaster shock is identified as the component with the largest kurtosis, where the mutually independent components are estimated using an estimator that is valid even in the presence of an infinite variance shock. Two applications are considered. In the first, the independence test is used to shed light on the conflicting evidence regarding the role of uncertainty in economic fluctuations. In the second, disaster shocks are shown to have short term economic impact arising mostly from feedback dynamics.

【7】 Bayesian Lifetime Regression with Multi-type Group-shared Latent Heterogeneity 标题:具有多类型组共享潜在异质性的贝叶斯寿命回归

作者:Xuxue Sun,Mingyang Li 机构:Department of Industrial and Management Systems Engineering, University of South Florida, USA 备注:22 pages 链接:https://arxiv.org/abs/2107.06539 摘要:由于共享但未观察到的协变量的影响所导致的潜在异质性,从同一批次生产或在同一地区使用的产品往往表现出相关的寿命观察。不可用的群共享协变量涉及多种不同类型(如离散型、连续型或混合型),并导致不同结构的群共享潜在异质性。如果不仔细捕捉这种潜在的异质性,寿命建模的准确性将大大降低。在这项工作中,我们提出了一个通用的贝叶斯寿命建模方法,通过全面研究由不同类型的群体共享未观察协变量引起的群体共享潜在异质性的结构。该方法能够灵活地刻画生命周期数据中多类型群共享的潜在异质性。此外,它还可以处理缺少组成员信息的情况,解决样本量有限的问题。进一步发展了贝叶斯抽样算法和数据扩充技术,以联合量化观测协变量和群体共享潜在异质性的影响。此外,我们进行了全面的数值研究,通过与其他模型的比较,验证了所提出的建模方法的改进性能。我们还提供了实证研究结果,以探讨群体数量和每个群体的样本大小对估计群体共享潜在异质性的影响,并证明所提出的方法对于未观察到的群体共享协变量的不同结构的模型可识别性。我们还提供了一个实际案例来说明该方法的有效性。 摘要:Products manufactured from the same batch or utilized in the same region often exhibit correlated lifetime observations due to the latent heterogeneity caused by the influence of shared but unobserved covariates. The unavailable group-shared covariates involve multiple different types (e.g., discrete, continuous, or mixed-type) and induce different structures of indispensable group-shared latent heterogeneity. Without carefully capturing such latent heterogeneity, the lifetime modeling accuracy will be significantly undermined. In this work, we propose a generic Bayesian lifetime modeling approach by comprehensively investigating the structures of group-shared latent heterogeneity caused by different types of group-shared unobserved covariates. The proposed approach is flexible to characterize multi-type group-shared latent heterogeneity in lifetime data. Besides, it can handle the case of lack of group membership information and address the issue of limited sample size. Bayesian sampling algorithm with data augmentation technique is further developed to jointly quantify the influence of observed covariates and group-shared latent heterogeneity. Further, we conduct comprehensive numerical study to demonstrate the improved performance of proposed modeling approach via comparison with alternative models. We also present empirical study results to investigate the impacts of group number and sample size per group on estimating the group-shared latent heterogeneity and to demonstrate model identifiability of proposed approach for different structures of unobserved group-shared covariates. We also present a real case study to illustrate the effectiveness of proposed approach.

【8】 Method of Moments Confidence Intervals for a Semi-Supervised Two-Component Mixture Model 标题:半监督双组分混合模型的矩置信度区间方法

作者:Bradley Lubich,Daniel Jeske,Weixin Yao 链接:https://arxiv.org/abs/2107.06503 摘要:未治疗患者的反应分布和该分布的变化的混合是治疗患者组反应的有用模型。混合模型解释了这样一个事实,即治疗组中并非所有患者都会对治疗产生反应,因此他们的反应与未治疗患者的反应遵循相同的分布。在这种情况下,治疗效果既包括作为应答者的治疗患者的比例,也包括应答者分布变化的幅度。在本文中,我们研究了处理效果的矩估计方法的性质,并证明了它们在无需对响应分布作任何参数假设的情况下获得近似置信区间的有效性。 摘要:A mixture of a distribution of responses from untreated patients and a shift of that distribution is a useful model for the responses from a group of treated patients. The mixture model accounts for the fact that not all the patients in the treated group will respond to the treatment and consequently their responses follow the same distribution as the responses from untreated patients. The treatment effect in this context consists of both the fraction of the treated patients that are responders and the magnitude of the shift in the distribution for the responders. In this paper, we investigate properties of the method of moment estimators for the treatment effect and demonstrate their usefulness for obtaining approximate confidence intervals without any parametric assumptions about the distribution of responses.

【9】 Spectrum Gaussian Processes Based On Tunable Basis Functions 标题:基于可调谐基函数的谱高斯过程

作者:Wenqi Fang,Guanlin Wu,Jingjing Li,Zheng Wang,Jiang Cao,Yang Ping 机构:Nanhu Laboratory, Jiaxing, P.R.China, National University of Defense Technology, Changsha, P.R.China, Tongzhu Technology Co., Ltd, Jiaxing, P.R.China, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, P.R.China 备注:10 figures 链接:https://arxiv.org/abs/2107.06473 摘要:高斯过程的谱逼近和变分诱导学习是降低计算复杂度的两种常用方法。然而,在以往的研究中,这些方法往往采用正交基函数,如Hilbert空间中的特征向量,谱方法中的特征向量,或变分框架中的解耦正交分量。本文受量子物理的启发,引入了一种新的基函数来逼近高斯过程中的核函数,该基函数具有可调性、局部性和有界性。这些函数中有两个可调参数,它们控制着函数的正交性,限制了函数的有界性。我们在开源数据集上进行了大量的实验来验证它的性能。与现有的几种方法进行比较,结果表明,该方法能获得满意甚至更好的结果,特别是在核函数选取不当的情况下。 摘要:Spectral approximation and variational inducing learning for the Gaussian process are two popular methods to reduce computational complexity. However, in previous research, those methods always tend to adopt the orthonormal basis functions, such as eigenvectors in the Hilbert space, in the spectrum method, or decoupled orthogonal components in the variational framework. In this paper, inspired by quantum physics, we introduce a novel basis function, which is tunable, local and bounded, to approximate the kernel function in the Gaussian process. There are two adjustable parameters in these functions, which control their orthogonality to each other and limit their boundedness. And we conduct extensive experiments on open-source datasets to testify its performance. Compared to several state-of-the-art methods, it turns out that the proposed method can obtain satisfactory or even better results, especially with poorly chosen kernel functions.

【10】 Survey data integration for regression analysis using model calibration 标题:使用模型校准进行回归分析的调查数据集成

作者:Hang J. Kim,Zhonglei Wang,Jae Kwang Kim 机构:Division of Statistics and Data Science, University of Cincinnati, Wang Yanan Institute for Studies in Economics, Xiamen University, Department of Statistics, Iowa State University 链接:https://arxiv.org/abs/2107.06448 摘要:在数据集成的背景下,我们考虑回归分析。为了结合来自外部的部分信息,我们采用了模型校正的思想,它引入了一个基于观测协变量的“工作”简化模型。工作简化模型不一定是正确指定的,但可以是一个有用的设备,纳入部分信息从外部数据。实际实现基于经验似然法的一个新应用。该方法特别适用于多个具有不同缺失模式的信息源的组合。结合韩国国民健康与营养检查调查的调查数据和韩国国民健康保险共享服务的大数据,将该方法应用于一个真实的数据实例。 摘要:We consider regression analysis in the context of data integration. To combine partial information from external sources, we employ the idea of model calibration which introduces a "working" reduced model based on the observed covariates. The working reduced model is not necessarily correctly specified but can be a useful device to incorporate the partial information from the external data. The actual implementation is based on a novel application of the empirical likelihood method. The proposed method is particularly attractive for combining information from several sources with different missing patterns. The proposed method is applied to a real data example combining survey data from Korean National Health and Nutrition Examination Survey and big data from National Health Insurance Sharing Service in Korea.

【11】 Bayesian Semiparametric Multivariate Density Deconvolution via Stochastic Rotation of Replicates 标题:基于重复随机旋转的贝叶斯半参数多元密度反褶积

作者:Arkaprava Roy,Abhra Sarkar 机构:Department of Biostatistics, University of Florida, Mowry Road, Gainesville, FL , USA, Department of Statistics and Data Sciences, The University of Texas at Austin, Speedway D, Austin, TX ,-, USA 备注:arXiv admin note: text overlap with arXiv:1912.05084 链接:https://arxiv.org/abs/2107.06436 摘要:我们考虑多变量密度反褶积的问题,其中随机向量的分布需要从有条件异方差测量误差污染的复制品中估计出来。我们提出了一个概念上简单,但基本上是新颖的和高度稳健的多元密度反褶积方法随机旋转复制到相应的真正潜在价值。我们还解决了在这个新引入的框架中调节条件异方差测量误差的另一个非常具有挑战性的问题。我们采用贝叶斯方法进行估计和推理,通过有效的马尔可夫链蒙特卡罗算法实现,适当地适应了我们分析的各个方面的不确定性。给出了该方法的渐近收敛性保证。我们通过模拟实验说明了该方法的经验有效性,并通过24小时膳食回忆的测量误差来估计不同膳食成分的长期联合平均摄入量。 摘要:We consider the problem of multivariate density deconvolution where the distribution of a random vector needs to be estimated from replicates contaminated with conditionally heteroscedastic measurement errors. We propose a conceptually straightforward yet fundamentally novel and highly robust approach to multivariate density deconvolution by stochastically rotating the replicates toward the corresponding true latent values. We also address the additionally significantly challenging problem of accommodating conditionally heteroscedastic measurement errors in this newly introduced framework. We take a Bayesian route to estimation and inference, implemented via an efficient Markov chain Monte Carlo algorithm, appropriately accommodating uncertainty in all aspects of our analysis. Asymptotic convergence guarantees for the method are also established. We illustrate the method's empirical efficacy through simulation experiments and its practical utility in estimating the long-term joint average intakes of different dietary components from their measurement error contaminated 24-hour dietary recalls.

【12】 For high-dimensional hierarchical models, consider exchangeability of effects across covariates instead of across datasets 标题:对于高维等级模型,请考虑效果在协变量之间的互换性,而不是跨数据集的可交换性

作者:Brian L. Trippe,Hilary K. Finucane,Tamara Broderick 机构:MIT CSAIL, Broad Institute 备注:10 pages plus supplementary material 链接:https://arxiv.org/abs/2107.06428 摘要:分层贝叶斯方法使得多个相关回归问题之间的信息共享成为可能。虽然标准实践是将回归参数(效应)建模为(1)在数据集之间可交换,(2)在协变量之间具有不同程度的相关性,但我们表明,当协变量的数量超过数据集的数量时,这种方法表现出较差的统计性能。例如,在统计遗传学中,我们可能会对成千上万的个体(反应)的几十个特征(定义的数据集)进行回归,而这些特征(定义的数据集)的遗传变异多达数百万个(协变量)。当一个分析员比数据集有更多的协变量时,我们认为,通常更自然地将效应建模为:(1)在协变量之间可交换;(2)在数据集之间有不同程度的关联。为此,我们提出了一个层次模型来表达我们的另一种观点。我们设计了一个经验贝叶斯估计来学习数据集之间的关联度。我们发展的理论表明,当协变量的数量占数据集数量的主导地位时,我们的方法优于经典方法,并在几个高维多元回归和分类问题上验证了这一结果。 摘要:Hierarchical Bayesian methods enable information sharing across multiple related regression problems. While standard practice is to model regression parameters (effects) as (1) exchangeable across datasets and (2) correlated to differing degrees across covariates, we show that this approach exhibits poor statistical performance when the number of covariates exceeds the number of datasets. For instance, in statistical genetics, we might regress dozens of traits (defining datasets) for thousands of individuals (responses) on up to millions of genetic variants (covariates). When an analyst has more covariates than datasets, we argue that it is often more natural to instead model effects as (1) exchangeable across covariates and (2) correlated to differing degrees across datasets. To this end, we propose a hierarchical model expressing our alternative perspective. We devise an empirical Bayes estimator for learning the degree of correlation between datasets. We develop theory that demonstrates that our method outperforms the classic approach when the number of covariates dominates the number of datasets, and corroborate this result empirically on several high-dimensional multiple regression and classification problems.

【13】 Planning a method for covariate adjustment in individually-randomised trials: a practical guide 标题:在个人随机试验中规划协变量调整的方法:实用指南

作者:Tim P. Morris,A. Sarah Walker,Elizabeth J. Williamson,Ian R. White 机构:Correspondence:, MRC Clinical Trials Unit at UCL, London, UK, Full list of author information is, available at the end of the article 备注:1 figure, 5 main tables, 2 appendices, 2 appendix tables 链接:https://arxiv.org/abs/2107.06398 摘要:背景:长期以来,人们建议在验证性随机试验的分析中考虑基线协变量,主要的统计学理由是,这增加了功率,并且,当随机方案平衡协变量时,允许对实验误差进行有效估计。有各种方法可用于解释协变量。方法:我们考虑如何在写一个统计分析计划,选择三种广泛的方法:直接调整,标准化和逆概率的治疗加权(IPTW),这在我们看来是最有前途的方法。使用GetTested试验,一项旨在评估电子STI(性传播感染)测试和结果服务有效性的随机试验,我们说明了如何提前选择方法,并显示了一些预期的问题。结果:方法的选择并不简单,特别是对于我们最关注的二元结果测量模型。我们比较了这三种方法在目标量(估计量)、模型错误情况下的性能、收敛性问题、设计平衡处理、估计精度、标准误差估计等方面的性能,最后阐明了处理缺失数据的一些问题。结论:我们得出结论,没有单一的方法总是最好的,并解释为什么选择将取决于审判上下文,但鼓励专家更经常地考虑这三种方法。 摘要:Background: It has long been advised to account for baseline covariates in the analysis of confirmatory randomised trials, with the main statistical justifications being that this increases power and, when a randomisation scheme balanced covariates, permits a valid estimate of experimental error. There are various methods available to account for covariates. Methods: We consider how, at the point of writing a statistical analysis plan, to choose between three broad approaches: direct adjustment, standardisation and inverse-probability-of-treatment weighting (IPTW), which are in our view the most promising methods. Using the GetTested trial, a randomised trial designed to assess the effectiveness of an electonic STI (sexually transmitted infection) testing and results service, we illustrate how a method might be chosen in advance and show some of the anticipated issues in action. Results: The choice of approach is not straightforward, particularly with models for binary outcome measures, where we focus most of our attention. We compare the properties of the three broad approaches in terms of the quantity they target (estimand), how a method performs under model misspecification, convergence issues, handling designed balance, precision of estimators, estimation of standard errors, and finally clarify some issues around handling of missing data. Conclusions: We conclude that no single approach is always best and explain why the choice will depend on the trial context but encourage trialists to consider the three methods more routinely.

【14】 Whiteout: when do fixed-X knockoffs fail? 标题:白话:FIXED-X仿冒品什么时候会失败?

作者:Xiao Li,William Fithian 机构:Department of Statistics, UC Berkeley 链接:https://arxiv.org/abs/2107.06388 摘要:仿冒方法的一个核心优势是其几乎无限的可定制性,允许分析人员利用机器学习算法和领域知识,而不会威胁该方法的鲁棒有限样本错误发现率控制保证。虽然之前的一些工作已经研究了仿冒品的具体实现被证明是有效的,但是对于这种灵活的方法,一般的负面结果更难获得。在这项工作中,我们重铸固定-$X$敲除滤波器高斯线性模型作为条件后选择推理方法。它将用户生成的高斯噪声添加到普通最小二乘估计器$hatbeta$中,以获得具有不相关项的“白化”估计器$widetildebeta$,并使用$text{sgn}(widetildebetau j)$作为$Hu j:的测试统计量执行推断贝塔系数=0美元。我们证明了我们的白化公式和包含负控制预测变量的更标准公式之间的等价性,展示了如何使用固定的$X$仿冒框架对具有(渐近)多元高斯参数估计的任何问题进行多重测试。基于这个观点,我们得到了第一个否定的结果,它普遍地限制了所有固定的-X$仿冒方法的能力,而不考虑分析师的选择。我们的结果粗略地表明,如果$text{Var}(hatbeta)$的前导特征值很大,前导特征向量很密集,那么就无法在不可修复地擦除几乎所有信号的情况下使$text{sgn}(widetildebetaj)$变白,从而导致$text{sgn}(widetildebetaj)$过于缺乏信息而无法进行精确的推断。我们给出了当Bonferroni校正多重检验的真阳性率趋向于1时,任何固定的敲除方法的真阳性率(TPR)必须收敛于0的条件,并举例说明了这一现象。 摘要:A core strength of knockoff methods is their virtually limitless customizability, allowing an analyst to exploit machine learning algorithms and domain knowledge without threatening the method's robust finite-sample false discovery rate control guarantee. While several previous works have investigated regimes where specific implementations of knockoffs are provably powerful, general negative results are more difficult to obtain for such a flexible method. In this work we recast the fixed-$X$ knockoff filter for the Gaussian linear model as a conditional post-selection inference method. It adds user-generated Gaussian noise to the ordinary least squares estimator $hatbeta$ to obtain a "whitened" estimator $widetildebeta$ with uncorrelated entries, and performs inference using $text{sgn}(widetildebeta_j)$ as the test statistic for $H_j:; beta_j = 0$. We prove equivalence between our whitening formulation and the more standard formulation involving negative control predictor variables, showing how the fixed-$X$ knockoffs framework can be used for multiple testing on any problem with (asymptotically) multivariate Gaussian parameter estimates. Relying on this perspective, we obtain the first negative results that universally upper-bound the power of all fixed-$X$ knockoff methods, without regard to choices made by the analyst. Our results show roughly that, if the leading eigenvalues of $text{Var}(hatbeta)$ are large with dense leading eigenvectors, then there is no way to whiten $hatbeta$ without irreparably erasing nearly all of the signal, rendering $text{sgn}(widetildebeta_j)$ too uninformative for accurate inference. We give conditions under which the true positive rate (TPR) for any fixed-$X$ knockoff method must converge to zero even while the TPR of Bonferroni-corrected multiple testing tends to one, and we explore several examples illustrating this phenomenon.

【15】 Extended L-ensembles: a new representation for Determinantal Point Processes 标题:扩展L-系综:行列点过程的一种新表示

作者:Nicolas Tremblay,Simon Barthelmé,Konstantin Usevich,Pierre-Olivier Amblard 机构:CNRS, Univ. Grenoble Alpes, Grenoble INP, GIPSA-lab, Université de Lorraine and CNRS, CRAN (Centre de Recherche en Automatique en Nancy) 备注:Most of this material appeared in a previous arxiv submission (arXiv:2007.04117), two sections are new, and some things have been rephrased 链接:https://arxiv.org/abs/2107.06345 摘要:行列式点过程(dpp)是一类排斥点过程,由于其相对简单而广受欢迎。传统上,它们是通过边缘分布来定义的,但是dpp的一个子集称为“L-群”,具有可处理的可能性,因此特别容易处理。实际上,在许多应用中,dpp更自然地基于L系综公式而不是通过边缘核来定义。事实上,并非所有的dpp都是L-群是不幸的,但有一个统一的描述。我们在这里介绍了扩展L-系综,并说明所有DPP都是扩展L-系综(反之亦然)。扩展L-系综具有非常简单的似然函数,包含L-系综和投影DPP作为特例。从理论上讲,他们修正了DPPs通常形式主义的一些病态,例如投影DPPs不是L-群的事实。从实用的角度出发,他们扩展了可用于定义dpp的核函数集:我们证明了条件正定核是定义dpp的良好候选核,包括不需要空间尺度参数的dpp。最后,扩展的L-系综是基于所谓的“鞍点矩阵”的,我们证明了Cauchy-Binet定理对这类矩阵的一个扩展,它可能是独立的。 摘要:Determinantal point processes (DPPs) are a class of repulsive point processes, popular for their relative simplicity. They are traditionally defined via their marginal distributions, but a subset of DPPs called "L-ensembles" have tractable likelihoods and are thus particularly easy to work with. Indeed, in many applications, DPPs are more naturally defined based on the L-ensemble formulation rather than through the marginal kernel. The fact that not all DPPs are L-ensembles is unfortunate, but there is a unifying description. We introduce here extended L-ensembles, and show that all DPPs are extended L-ensembles (and vice-versa). Extended L-ensembles have very simple likelihood functions, contain L-ensembles and projection DPPs as special cases. From a theoretical standpoint, they fix some pathologies in the usual formalism of DPPs, for instance the fact that projection DPPs are not L-ensembles. From a practical standpoint, they extend the set of kernel functions that may be used to define DPPs: we show that conditional positive definite kernels are good candidates for defining DPPs, including DPPs that need no spatial scale parameter. Finally, extended L-ensembles are based on so-called ``saddle-point matrices'', and we prove an extension of the Cauchy-Binet theorem for such matrices that may be of independent interest.

【16】 Scalable Optimal Transport in High Dimensions for Graph Distances, Embedding Alignment, and More 标题:用于图形距离、嵌入对齐等的高维可扩展最佳传输

作者:Johannes Klicpera,Marten Lienen,Stephan Günnemann 机构: a pile of earth) to 1Technical University of Munich 备注:Published as a conference paper at ICML 2021 链接:https://arxiv.org/abs/2107.06876 摘要:目前计算最优传输(OT)的最佳实践是通过熵正则化和Sinkhorn迭代。该算法以二次时间运行,因为它需要完整的成对代价矩阵,这对于大型对象集来说是非常昂贵的。在这项工作中,我们提出了两种有效的代价矩阵对数线性时间近似:第一种是基于局部敏感散列(LSH)的稀疏近似,第二种是基于LSH的稀疏校正的Nystr“om近似,我们称之为局部校正Nystr”om(LCN)。这些近似使得熵正则化OT的一般对数线性时间算法即使在深度学习中常见的复杂、高维空间中也表现良好。我们从理论上分析了这些近似,并在实验上直接和端到端地对它们进行了评估,作为实际应用的一个组成部分。使用我们的无监督单词嵌入对齐的近似方法,我们可以将最先进的方法的速度提高3倍,同时还可以在不改变任何模型的情况下将准确率提高3.1个百分点。对于图距离回归,我们提出了图传输网络(GTN),它将图神经网络(GNNs)与增强Sinkhorn相结合。GTN比以前的模型竞争了48%,并且在节点数量上仍然是线性的。 摘要:The current best practice for computing optimal transport (OT) is via entropy regularization and Sinkhorn iterations. This algorithm runs in quadratic time as it requires the full pairwise cost matrix, which is prohibitively expensive for large sets of objects. In this work we propose two effective log-linear time approximations of the cost matrix: First, a sparse approximation based on locality-sensitive hashing (LSH) and, second, a Nystr"om approximation with LSH-based sparse corrections, which we call locally corrected Nystr"om (LCN). These approximations enable general log-linear time algorithms for entropy-regularized OT that perform well even for the complex, high-dimensional spaces common in deep learning. We analyse these approximations theoretically and evaluate them experimentally both directly and end-to-end as a component for real-world applications. Using our approximations for unsupervised word embedding alignment enables us to speed up a state-of-the-art method by a factor of 3 while also improving the accuracy by 3.1 percentage points without any additional model changes. For graph distance regression we propose the graph transport network (GTN), which combines graph neural networks (GNNs) with enhanced Sinkhorn. GTN outcompetes previous models by 48% and still scales log-linearly in the number of nodes.

【17】 A Particle Filter Approach to Power System Line Outage Detection Using Load and Generator Bus Dynamics 标题:基于负荷和发电机母线动态的电力系统线路故障检测的粒子过滤方法

作者:Xiaozhou Yang,Nan Chen,Chao Zhai 机构:Chao Zhai is with the School of Automation 备注:9 pages, 7 figures 链接:https://arxiv.org/abs/2107.06754 摘要:有限的相量测量单元(PMU)和变化的信号强度水平使得快速实时的输电线路故障检测具有挑战性。现有的方法侧重于监测节点的代数变量,即电压相角和幅值。它们的有效性取决于电压中的强停堆信号和停堆位置附近的PMU。我们提出了一个利用发电机动态状态和节点电压信息的统一检测框架。发电机动态特性的加入使得检测速度更快,对先验未知的故障位置更具鲁棒性,我们使用ieee39总线测试系统演示了这一点。特别是,该方案对80%的线路检测率达到了80%以上,并且大多数中断都是在0.2秒内检测到的。新的方法可以通过更快地检测故障并提供故障信号的分解以进行诊断,从而提高系统操作员的实时态势感知能力,从而使电力系统更具弹性。 摘要:Limited phasor measurement unit (PMU) and varying signal strength levels make fast real-time transmission-line outage detection challenging. Existing approaches focus on monitoring nodal algebraic variables, i.e., voltage phase angle and magnitude. Their effectiveness is predicated on both strong outage signals in voltage and PMUs in the outage location's vicinity. We propose a unified detection framework that utilizes both generator dynamic states and nodal voltage information. The inclusion of generator dynamics makes detection faster and more robust to a priori unknown outage locations, which we demonstrate using the IEEE 39-bus test system. In particular, the scheme achieves an over 80% detection rate for 80% of the lines, and most outages are detected within 0.2 seconds. The new approach could be implemented to improve system operators' real-time situational awareness by detecting outages faster and providing a breakdown of outage signals for diagnostic purposes, making power systems more resilient.

【18】 Financial Return Distributions: Past, Present, and COVID-19 标题:财务回报分布:过去、现在和冠状病毒

作者:Marcin Wątorek,Jarosław Kwapień,Stanisław Drożdż 备注:None 链接:https://arxiv.org/abs/2107.06659 摘要:我们分析了货币汇率、加密货币和代表股票指数、股票和商品的差价合约(CFD)的价格收益分布。基于2017-2020年的最新数据,我们使用幂律函数、拉伸指数函数和$q$-高斯函数对不同时间尺度下的收益率分布进行了建模。通过将我们的结果与先前的研究结果进行比较,我们将重点放在拟合的函数参数以及它们如何随时间变化上,并发现在长达几分钟的时间范围内,所谓的“逆立方幂律”仍然构成一个适当的全局参考。然而,我们不再观察到市场时间流的假设普遍恒定加速,这种加速以前表现为经验收益率分布向正态分布的更快收敛。我们的结果并不排除这种情况,而是表明,与当前市场形势相关的一些其他短期过程会改变市场动态,并可能掩盖这种情况。真实的市场动态与具有不同统计特性的不同制度的持续交替有关。一个例子是COVID-19大流行爆发,它对金融市场产生了巨大而短暂的影响。我们还指出,两个因素——市场时间流的速度和资产互相关程度——虽然相关(速度越大,在给定的时间尺度上的互相关越大),但在收益分布尾部的作用方向相反,这会影响期望分布收敛到正态分布。 摘要:We analyze the price return distributions of currency exchange rates, cryptocurrencies, and contracts for differences (CFDs) representing stock indices, stock shares, and commodities. Based on recent data from the years 2017--2020, we model tails of the return distributions at different time scales by using power-law, stretched exponential, and $q$-Gaussian functions. We focus on the fitted function parameters and how they change over the years by comparing our results with those from earlier studies and find that, on the time horizons of up to a few minutes, the so-called "inverse-cubic power-law" still constitutes an appropriate global reference. However, we no longer observe the hypothesized universal constant acceleration of the market time flow that was manifested before in an ever faster convergence of empirical return distributions towards the normal distribution. Our results do not exclude such a scenario but, rather, suggest that some other short-term processes related to a current market situation alter market dynamics and may mask this scenario. Real market dynamics is associated with a continuous alternation of different regimes with different statistical properties. An example is the COVID-19 pandemic outburst, which had an enormous yet short-time impact on financial markets. We also point out that two factors -- speed of the market time flow and the asset cross-correlation magnitude -- while related (the larger the speed, the larger the cross-correlations on a given time scale), act in opposite directions with regard to the return distribution tails, which can affect the expected distribution convergence to the normal distribution.

【19】 A Framework for Machine Learning of Model Error in Dynamical Systems 标题:一种动态系统模型误差的机器学习框架

作者:Matthew E. Levine,Andrew M. Stuart 机构: California Institute of Technology 链接:https://arxiv.org/abs/2107.06658 摘要:动态系统数据信息预测模型的发展在许多学科中引起了广泛的兴趣。我们提出了一个统一的框架,混合机械和机器学习方法,以识别动态系统的数据。我们比较了纯数据驱动学习和包含不完全领域知识的混合模型。对于模型误差无记忆且具有显著记忆的问题,我们将问题分为连续时间和离散时间,并在实验上比较了数据驱动和混合方法。我们的公式是不可知的选择机器学习模型。使用Lorenz'63和Lorenz'96多尺度系统,我们发现混合方法在数据饥饿、模型复杂度要求和总体预测性能方面大大优于单纯的数据驱动方法。我们还发现,虽然连续时间帧允许对不规则采样的鲁棒性和理想的域解释性,但离散时间帧可以提供类似或更好的预测性能,特别是当数据采样不足且向量场无法解析时。从学习理论的角度研究了模型误差,定义了过度风险和泛化误差;对于用于学习遍历动力系统的误差线性模型,这两个误差都有一个随T的平方根减小的项的界。我们还举例说明了有利于记忆建模的场景,证明了连续时间递归神经网络(RNNs)原则上可以学习记忆相关的模型误差,并能很好地重构原系统;数值结果描述了用这种方法表示记忆的挑战。我们还将RNN与水库计算联系起来,从而将记忆相关错误的学习与最近使用随机特征在Banach空间之间进行监督学习的工作联系起来。 摘要:The development of data-informed predictive models for dynamical systems is of widespread interest in many disciplines. We present a unifying framework for blending mechanistic and machine-learning approaches to identify dynamical systems from data. We compare pure data-driven learning with hybrid models which incorporate imperfect domain knowledge. We cast the problem in both continuous- and discrete-time, for problems in which the model error is memoryless and in which it has significant memory, and we compare data-driven and hybrid approaches experimentally. Our formulation is agnostic to the chosen machine learning model. Using Lorenz '63 and Lorenz '96 Multiscale systems, we find that hybrid methods substantially outperform solely data-driven approaches in terms of data hunger, demands for model complexity, and overall predictive performance. We also find that, while a continuous-time framing allows for robustness to irregular sampling and desirable domain-interpretability, a discrete-time framing can provide similar or better predictive performance, especially when data are undersampled and the vector field cannot be resolved. We study model error from the learning theory perspective, defining excess risk and generalization error; for a linear model of the error used to learn about ergodic dynamical systems, both errors are bounded by terms that diminish with the square-root of T. We also illustrate scenarios that benefit from modeling with memory, proving that continuous-time recurrent neural networks (RNNs) can, in principle, learn memory-dependent model error and reconstruct the original system arbitrarily well; numerical results depict challenges in representing memory by this approach. We also connect RNNs to reservoir computing and thereby relate the learning of memory-dependent error to recent work on supervised learning between Banach spaces using random features.

【20】 An Efficient Deep Distribution Network for Bid Shading in First-Price Auctions 标题:一种有效的第一价格拍卖中的深层分销网络

作者:Tian Zhou,Hao He,Shengjun Pan,Niklas Karlsson,Bharatbhushan Shetty,Brendan Kitts,Djordje Gligorijevic,San Gultekin,Tingyu Mao,Junwei Pan,Jianlong Zhang,Aaron Flores 机构:Yahoo Research, Verizon Media, Sunnyvale, CA, USA 链接:https://arxiv.org/abs/2107.06650 摘要:自2019年以来,在线广告行业的大多数广告交易和卖方平台(SSP)从第二价格拍卖转向第一价格拍卖。由于这些拍卖的根本区别,需求方平台(dsp)不得不更新其竞价策略,以避免不必要的高价竞价,从而产生过高的价格。为了平衡第一价格拍卖的成本和中标概率,提出了投标底纹法来调整第二价格拍卖的投标价格。在这项研究中,我们提出了一个新的深度分销网络的最优出价在开放(非删失)和封闭(删失)网上第一价格拍卖。离线和在线A/B测试结果表明,我们的算法在剩余和有效每动作成本(eCPX)指标上都优于现有的算法。此外,该算法在运行时进行了优化,并已作为生产算法部署到VerizonMedia DSP中,每天可处理数千亿次投标请求。在线A/B测试表明,基于印象(CPM)、基于点击(CPC)和基于转换(CPA)的广告活动的广告主ROI分别提高了 2.4%、 2.4%和 8.6%。 摘要:Since 2019, most ad exchanges and sell-side platforms (SSPs), in the online advertising industry, shifted from second to first price auctions. Due to the fundamental difference between these auctions, demand-side platforms (DSPs) have had to update their bidding strategies to avoid bidding unnecessarily high and hence overpaying. Bid shading was proposed to adjust the bid price intended for second-price auctions, in order to balance cost and winning probability in a first-price auction setup. In this study, we introduce a novel deep distribution network for optimal bidding in both open (non-censored) and closed (censored) online first-price auctions. Offline and online A/B testing results show that our algorithm outperforms previous state-of-art algorithms in terms of both surplus and effective cost per action (eCPX) metrics. Furthermore, the algorithm is optimized in run-time and has been deployed into VerizonMedia DSP as production algorithm, serving hundreds of billions of bid requests per day. Online A/B test shows that advertiser's ROI are improved by 2.4%, 2.4%, and 8.6% for impression based (CPM), click based (CPC), and conversion based (CPA) campaigns respectively.

【21】 Rough McKean-Vlasov dynamics for robust ensemble Kalman filtering 标题:鲁棒集成卡尔曼滤波的粗糙McKean-Vlasov动力学

作者:Michele Coghi,Torstein Nilssen,Nikolas Nüsken 备注:41 pages, 7 figures 链接:https://arxiv.org/abs/2107.06621 摘要:基于将数据合并到错误指定和多尺度动力学模型的挑战,我们研究了一个包含数据流的McKean-Vlasov方程,该方程是一个常见的驱动粗糙路径。此设置允许我们在适当的粗糙路径拓扑中证明驱动程序的适定性和连续性。后一个性质是我们随后发展稳健数据同化方法的关键:我们为相关的相互作用粒子系统建立混沌传播,这反过来又暗示了一个数值方案,可以被视为集合卡尔曼滤波到粗糙路径框架的扩展。最后,我们讨论了一种基于子抽样的数据驱动方法来构造合适的粗糙路径提升,并在多尺度环境下的参数估计问题的数值实验中证明了该方法的鲁棒性。 摘要:Motivated by the challenge of incorporating data into misspecified and multiscale dynamical models, we study a McKean-Vlasov equation that contains the data stream as a common driving rough path. This setting allows us to prove well-posedness as well as continuity with respect to the driver in an appropriate rough-path topology. The latter property is key in our subsequent development of a robust data assimilation methodology: We establish propagation of chaos for the associated interacting particle system, which in turn is suggestive of a numerical scheme that can be viewed as an extension of the ensemble Kalman filter to a rough-path framework. Finally, we discuss a data-driven method based on subsampling to construct suitable rough path lifts and demonstrate the robustness of our scheme in a number of numerical experiments related to parameter estimation problems in multiscale contexts.

【22】 Oblivious sketching for logistic regression 标题:用于Logistic回归的迟钝素描

作者:Alexander Munteanu,Simon Omlor,David Woodruff 备注:ICML 2021 链接:https://arxiv.org/abs/2107.06615 摘要:在数据流的一次传递中解决逻辑回归问题有哪些保证?为了回答这个问题,我们提出了第一个数据不经意的逻辑回归草图。我们的草图可以在旋转栅门数据流上的输入稀疏时间中计算,并将$d$维数据集的大小从$n$减少到仅$operatorname{poly}(mu dlog n)$加权点,其中$mu$是捕获数据压缩复杂性的有用参数。在草图上求解(加权)logistic回归给出了完整数据集上原始问题的$O(logn)$-近似值。我们还展示了如何在稍加修改的情况下获得$O(1)$-近似。我们的草图快速、简单、易于实现,我们的实验证明了它们的实用性。 摘要:What guarantees are possible for solving logistic regression in one pass over a data stream? To answer this question, we present the first data oblivious sketch for logistic regression. Our sketch can be computed in input sparsity time over a turnstile data stream and reduces the size of a $d$-dimensional data set from $n$ to only $operatorname{poly}(mu dlog n)$ weighted points, where $mu$ is a useful parameter which captures the complexity of compressing the data. Solving (weighted) logistic regression on the sketch gives an $O(log n)$-approximation to the original problem on the full data set. We also show how to obtain an $O(1)$-approximation with slight modifications. Our sketches are fast, simple, easy to implement, and our experiments demonstrate their practicality.

【23】 MESS: Manifold Embedding Motivated Super Sampling 标题:Mess:流形嵌入激励超抽样

作者:Erik Thordsen,Erich Schubert 链接:https://arxiv.org/abs/2107.06566 摘要:机器学习和数据分析领域的许多方法都依赖于观测数据位于低维流形上的假设。这一假设已经在许多实际数据集上得到了验证。为了利用这种流形假设,通常需要对流形进行局部采样,使其达到一定的密度,以便观察流形的特征。然而,为了增加数据集的内在维数,所需的数据密度引入了对非常大的数据集的需要,从而导致维数灾难的众多面之一。为了克服对局部数据密度要求的增加,我们提出了一个生成虚拟数据点的框架,该虚拟数据点忠实于数据中可观察流形的近似嵌入函数。 摘要:Many approaches in the field of machine learning and data analysis rely on the assumption that the observed data lies on lower-dimensional manifolds. This assumption has been verified empirically for many real data sets. To make use of this manifold assumption one generally requires the manifold to be locally sampled to a certain density such that features of the manifold can be observed. However, for increasing intrinsic dimensionality of a data set the required data density introduces the need for very large data sets, resulting in one of the many faces of the curse of dimensionality. To combat the increased requirement for local data density we propose a framework to generate virtual data points that faithful to an approximate embedding function underlying the manifold observable in the data.

【24】 Total Effect Analysis of Vaccination on Household Transmission in the Office for National Statistics COVID-19 Infection Survey 标题:国家统计局冠状病毒感染调查办公室接种疫苗对家庭传播的总体效果分析

作者:Thomas House,Lorenzo Pellis,Emma Pritchard,Angela R. McLean,A. Sarah Walker 机构:Walker, Department of Mathematics, University of Manchester, Manchester, UK, IBM Research, Hartree Centre, Daresbury, UK, The Alan Turing Institute for Data Science and Artificial Intelligence, London, UK 备注:5 pages, 2 figures 链接:https://arxiv.org/abs/2107.06545 摘要:我们调查了国家统计局COVID-19感染调查(ONS-CIS)中家庭继发病例数量的分布,按接种时间和家庭感染情况分层。这表明,在接种疫苗后,家庭的二次发病率在统计学上有显著的近似减半的总体效果。 摘要:We investigate the distribution of numbers of secondary cases in households in the Office for National Statistics COVID-19 Infection Survey (ONS CIS), stratified by timing of vaccination and infection in the households. This shows a total effect of a statistically significant approximate halving of the secondary attack rate in households following vaccination.

【25】 Zeroth and First Order Stochastic Frank-Wolfe Algorithms for Constrained Optimization 标题:约束优化的零阶和一阶随机Frank-Wolfe算法

作者:Zeeshan Akhtar,Ketan Rajawat 机构: Rajawat are with the Department of Electrical Engineering, Indian Institute of Technology Kanpur 链接:https://arxiv.org/abs/2107.06534 摘要:研究了具有两组约束条件的随机凸优化问题:(a)优化变量域上的确定性约束,这些约束条件很难投影;以及(b)允许有效投影的确定性或随机约束。这种形式的问题经常出现在半定规划的背景下,以及当各种NP难问题通过半定松弛近似求解时。由于投影到第一组约束是困难的,因此有必要探索无投影算法,如随机Frank-Wolfe(FW)算法。另一方面,第二组约束不能以同样的方式处理,必须作为指标函数纳入目标函数,从而使FW方法的应用复杂化。类似的问题以前也有过研究,并采用一阶随机FW算法对指标函数应用同伦和Nesterov平滑技术进行求解。这项工作改进了这些现有的结果,并提出了基于动量的一阶方法,其产生了改进的收敛速度,与已知的没有第二组约束的问题的速率一致。提出的算法的零阶变量也得到了发展,并且再次改进了最新的速率结果。在稀疏矩阵估计、半定松弛聚类和均匀稀疏割问题的相关应用中验证了算法的有效性。 摘要:This paper considers stochastic convex optimization problems with two sets of constraints: (a) deterministic constraints on the domain of the optimization variable, which are difficult to project onto; and (b) deterministic or stochastic constraints that admit efficient projection. Problems of this form arise frequently in the context of semidefinite programming as well as when various NP-hard problems are solved approximately via semidefinite relaxation. Since projection onto the first set of constraints is difficult, it becomes necessary to explore projection-free algorithms, such as the stochastic Frank-Wolfe (FW) algorithm. On the other hand, the second set of constraints cannot be handled in the same way, and must be incorporated as an indicator function within the objective function, thereby complicating the application of FW methods. Similar problems have been studied before, and solved using first-order stochastic FW algorithms by applying homotopy and Nesterov's smoothing techniques to the indicator function. This work improves upon these existing results and puts forth momentum-based first-order methods that yield improved convergence rates, at par with the best known rates for problems without the second set of constraints. Zeroth-order variants of the proposed algorithms are also developed and again improve upon the state-of-the-art rate results. The efficacy of the proposed algorithms is tested on relevant applications of sparse matrix estimation, clustering via semidefinite relaxation, and uniform sparsest cut problem.

【26】 Generative and reproducible benchmarks for comprehensive evaluation of machine learning classifiers 标题:机器学习分类器综合评价的生成性和重现性基准

作者:Patryk Orzechowski,Jason H. Moore 机构:Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA , USA, Department of Automatics and Robotics, AGH University of Science and Technology, al. Mickiewicza ,-, Krakow, Poland, ∗To whom correspondence should be addressed: 备注:12 pages, 3 figures with subfigures 链接:https://arxiv.org/abs/2107.06475 摘要:了解机器学习算法的优缺点对于确定其应用范围至关重要。在这里,我们介绍了多样性和生成性ML基准(DIGEN)-一个综合数据集的集合,用于全面、可复制和可解释的机器学习算法的基准测试,用于二元结果的分类。DIGEN资源由40个数学函数组成,这些函数将连续特征映射到离散端点,以创建合成数据集。这40个函数是使用启发式算法发现的,该算法旨在最大限度地提高多种流行机器学习算法的性能差异,从而为评估和比较新方法提供了一个有用的测试套件。对生成函数的访问有助于理解为什么与其他算法相比,一种方法的性能较差,从而为改进提供了思路。具有大量文档和分析的资源是开源的,可以在GitHub上获得。 摘要:Understanding the strengths and weaknesses of machine learning (ML) algorithms is crucial for determine their scope of application. Here, we introduce the DIverse and GENerative ML Benchmark (DIGEN) - a collection of synthetic datasets for comprehensive, reproducible, and interpretable benchmarking of machine learning algorithms for classification of binary outcomes. The DIGEN resource consists of 40 mathematical functions which map continuous features to discrete endpoints for creating synthetic datasets. These 40 functions were discovered using a heuristic algorithm designed to maximize the diversity of performance among multiple popular machine learning algorithms thus providing a useful test suite for evaluating and comparing new methods. Access to the generative functions facilitates understanding of why a method performs poorly compared to other algorithms thus providing ideas for improvement. The resource with extensive documentation and analyses is open-source and available on GitHub.

【27】 Going Beyond Linear RL: Sample Efficient Neural Function Approximation 标题:超越线性RL:样本有效神经函数逼近

作者:Baihe Huang,Kaixuan Huang,Sham M. Kakade,Jason D. Lee,Qi Lei,Runzhe Wang,Jiaqi Yang 机构: University of Washington§Jasondl, Princeton University||runzhew 链接:https://arxiv.org/abs/2107.06466 摘要:基于Q函数的神经网络逼近的深度强化学习(RL)在经验上取得了巨大的成功。传统的RL理论主要集中在线性函数逼近(或eluder维数)方法上,而对Q函数的神经网络逼近的非线性RL知之甚少。这是本文的重点,我们研究了两层神经网络(同时考虑ReLU函数和多项式激活函数)的函数逼近。我们的第一个结果是在两层神经网络的完备性下,在生成模型设置中的一个计算和统计有效的算法。我们的第二个结果考虑了这个设置,但是只考虑了神经网络函数类的可实现性。在这里,假设确定性动力学,样本复杂性在代数维度上线性扩展。在所有情况下,我们的结果大大改善了线性(或逃避维度)方法所能达到的效果。 摘要:Deep Reinforcement Learning (RL) powered by neural net approximation of the Q function has had enormous empirical success. While the theory of RL has traditionally focused on linear function approximation (or eluder dimension) approaches, little is known about nonlinear RL with neural net approximations of the Q functions. This is the focus of this work, where we study function approximation with two-layer neural networks (considering both ReLU and polynomial activation functions). Our first result is a computationally and statistically efficient algorithm in the generative model setting under completeness for two-layer neural networks. Our second result considers this setting but under only realizability of the neural net function class. Here, assuming deterministic dynamics, the sample complexity scales linearly in the algebraic dimension. In all cases, our results significantly improve upon what can be attained with linear (or eluder dimension) methods.

【28】 A Smoothed Impossibility Theorem on Condorcet Criterion and Participation 标题:关于Condorcet准则和参与性的一个光滑不可能性定理

作者:Lirong Xia 机构:Rensselaer Polytechnic Institute, Troy NY, USA 链接:https://arxiv.org/abs/2107.06435 摘要:1988年,Moulin证明了一个深刻而令人惊讶的不可能定理,该定理揭示了两个普遍研究的投票公理之间的根本不相容性:当备选方案的个数m至少为4时,没有一个决定性的投票规则(只输出一个获胜者)同时满足Condorcet准则和参与。本文利用平滑分析证明了这个不可能定理的一个推广:对于任意固定的$mge4$和任意的投票规则r,在温和的条件下,康多塞特准则和参与都满足的平滑可能性最大为$1-Omega(n^{-3})$,其中n是足够大的投票人数。我们的定理立即暗示了i.i.d.均匀分布定理的定量版本,即社会选择理论中的公正文化。 摘要:In 1988, Moulin proved an insightful and surprising impossibility theorem that reveals a fundamental incompatibility between two commonly-studied axioms of voting: no resolute voting rule (which outputs a single winner) satisfies Condorcet Criterion and Participation simultaneously when the number of alternatives m is at least four. In this paper, we prove an extension of this impossibility theorem using smoothed analysis: for any fixed $mge 4$ and any voting rule r, under mild conditions, the smoothed likelihood for both Condorcet Criterion and Participation to be satisfied is at most $1-Omega(n^{-3})$, where n is the number of voters that is sufficiently large. Our theorem immediately implies a quantitative version of the theorem for i.i.d. uniform distributions, known as the Impartial Culture in social choice theory.

【29】 Spectral Recovery of Binary Censored Block Models 标题:二元删失挡路模型的谱恢复

作者:Souvik Dhara,Julia Gaudio,Elchanan Mossel,Colin Sandon 备注:28 pages, 3 figures 链接:https://arxiv.org/abs/2107.06338 摘要:社区检测是一个识别图中社区结构的问题。通常将图建模为随机块模型中的一个样本,其中每个顶点都属于一个社区。两个顶点被一条边连接的概率取决于这些顶点的社区。在本文中,我们考虑一个截尾社区检测模型与两个社区,其中大部分的数据丢失的状态,只有一小部分的潜在边缘被揭示。在这个模型中,同一个社区的顶点以概率$p$连接,而相反社区的顶点以概率$q$连接。给定的一对顶点${u,v}$的连通性状态以概率$alpha$显示,独立于所有顶点对,其中$alpha=frac{tlog(n)}{n}$。我们建立了信息论阈值$tc(p,q)$,使得当$t<tc(p,q)$时,没有一种算法能准确地恢复出群落。我们证明了当$t>tc(p,q)$时,一种基于加权有符号邻接矩阵的简单谱算法能够准确地恢复群落。虽然谱算法在对称情况下表现出接近最优的性能,但在允许两个社区内的连接概率不同的非对称情况下,它们可能会失败。特别地,我们证明了一个参数区域的存在性,其中一个简单的两阶段算法成功,但是任何基于加权有符号邻接矩阵的前两个特征向量的线性组合的阈值化算法都失败。 摘要:Community detection is the problem of identifying community structure in graphs. Often the graph is modeled as a sample from the Stochastic Block Model, in which each vertex belongs to a community. The probability that two vertices are connected by an edge depends on the communities of those vertices. In this paper, we consider a model of censored community detection with two communities, where most of the data is missing as the status of only a small fraction of the potential edges is revealed. In this model, vertices in the same community are connected with probability $p$ while vertices in opposite communities are connected with probability $q$. The connectivity status of a given pair of vertices ${u,v}$ is revealed with probability $alpha$, independently across all pairs, where $alpha = frac{t log(n)}{n}$. We establish the information-theoretic threshold $t_c(p,q)$, such that no algorithm succeeds in recovering the communities exactly when $t < t_c(p,q)$. We show that when $t > t_c(p,q)$, a simple spectral algorithm based on a weighted, signed adjacency matrix succeeds in recovering the communities exactly. While spectral algorithms are shown to have near-optimal performance in the symmetric case, we show that they may fail in the asymmetric case where the connection probabilities inside the two communities are allowed to be different. In particular, we show the existence of a parameter regime where a simple two-phase algorithm succeeds but any algorithm based on thresholding a linear combination of the top two eigenvectors of the weighted, signed adjacency matrix fails.

【30】 Efficient exact computation of the conjunctive and disjunctive decompositions of D-S Theory for information fusion: Translation and extension 标题:信息融合D-S理论合取与析取分解的高效精确计算:翻译与扩展

作者:Maxime Chaveroche,Franck Davoine,Véronique Cherfaoui 机构:Sorbonne University Alliance, Universit´e de technologie de Compiegne, CNRS, Heudiasyc, CS , - , Compiegne Cedex, France 备注:Extension of an article published in the proceedings of the french conference GRETSI 2019 链接:https://arxiv.org/abs/2107.06329 摘要:Dempster-Shafer理论(DST)推广了贝叶斯概率理论,提供了有用的附加信息,但计算量大。为了降低Dempster规则在信息融合中的计算复杂度,人们做了大量的工作。然而,对于作为其他重要信息融合方法核心的证据合取分解和析取分解,却鲜有研究降低计算复杂度。在本文中,我们提出了一种方法,旨在利用这些分解中包含的实际证据(信息)来计算它们。它基于一个新的概念,我们称之为焦点,源自焦点集的概念。有了它,在某些情况下,我们可以将这些计算减少到焦点集数量的线性复杂度。从更广的角度来看,当辨别框架的大小超过几十种可能的状态时,我们的公式有可能是可处理的,这与现有的literature相反。本文扩展(并翻译)了我们在2019年法国GRETSI会议上发表的工作。 摘要:Dempster-Shafer Theory (DST) generalizes Bayesian probability theory, offering useful additional information, but suffers from a high computational burden. A lot of work has been done to reduce the complexity of computations used in information fusion with Dempster's rule. Yet, few research had been conducted to reduce the complexity of computations for the conjunctive and disjunctive decompositions of evidence, which are at the core of other important methods of information fusion. In this paper, we propose a method designed to exploit the actual evidence (information) contained in these decompositions in order to compute them. It is based on a new notion that we call focal point, derived from the notion of focal set. With it, we are able to reduce these computations up to a linear complexity in the number of focal sets in some cases. In a broader perspective, our formulas have the potential to be tractable when the size of the frame of discernment exceeds a few dozen possible states, contrary to the existing litterature. This article extends (and translates) our work published at the french conference GRETSI in 2019.

【31】 Inverse Contextual Bandits: Learning How Behavior Evolves over Time 标题:逆向语境:了解行为是如何随时间演变的

作者:Alihan Hüyük,Daniel Jarrett,Mihaela van der Schaar 机构:University of Cambridge, UCLA, The Alan Turing Institute 链接:https://arxiv.org/abs/2107.06317 摘要:通过观察代理的行为来了解代理的优先级对于决策过程(如医疗保健)的透明性和问责制至关重要。虽然传统的政策学习方法几乎总是假定行为的平稳性,但在实践中却很难做到这一点:医疗实践在不断发展,临床专业人员也在不断调整他们的优先事项。我们需要一种策略学习的方法,提供(1)决策的可解释表示,解释(2)行为的非平稳性,以及以(3)离线方式操作。首先,我们建立学习主体的情境强盗行为模型,并将逆情境强盗问题(ICB)形式化。其次,我们提出了两种算法来解决ICB问题,每种算法都对agent的学习策略做了不同程度的假设。最后,通过实际和模拟的肝移植数据,说明了该方法的适用性和可解释性,并验证了其准确性。 摘要:Understanding an agent's priorities by observing their behavior is critical for transparency and accountability in decision processes, such as in healthcare. While conventional approaches to policy learning almost invariably assume stationarity in behavior, this is hardly true in practice: Medical practice is constantly evolving, and clinical professionals are constantly fine-tuning their priorities. We desire an approach to policy learning that provides (1) interpretable representations of decision-making, accounts for (2) non-stationarity in behavior, as well as operating in an (3) offline manner. First, we model the behavior of learning agents in terms of contextual bandits, and formalize the problem of inverse contextual bandits (ICB). Second, we propose two algorithms to tackle ICB, each making varying degrees of assumptions regarding the agent's learning strategy. Finally, through both real and simulated data for liver transplantations, we illustrate the applicability and explainability of our method, as well as validating its accuracy.

0 人点赞