Update!H5支持摘要折叠,体验更佳!点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!
stat统计学,共计53篇
【1】 Dependent Bayesian nonparametric modeling of compositional data using random Bernstein polynomials 标题:基于随机Bernstein多项式的成分数据依赖贝叶斯非参数建模 链接:https://arxiv.org/abs/2108.13403
作者:Claudia Wehrhahn,Andrés F. Barrientos,Alejandro Jara 摘要:我们讨论了成分响应回归分析的贝叶斯非参数程序,即多元单纯形支持的数据。这些程序基于一类修改的多元Bernstein多项式和依赖的断棒过程。讨论了一般模型和一般模型的两个简化版本。建立了后验分布的连续性、关联结构、支持度和一致性等吸引人的理论性质。此外,我们利用spike和slab先验选择最适合底层真实数据生成分布复杂性的模型版本。模拟研究和哥伦比亚固体废物数据的应用说明了该模型的性能。 摘要:We discuss Bayesian nonparametric procedures for the regression analysis of compositional responses, that is, data supported on a multivariate simplex. The procedures are based on a modified class of multivariate Bernstein polynomials and on the use of dependent stick-breaking processes. A general model and two simplified versions of the general model are discussed. Appealing theoretical properties such as continuity, association structure, support, and consistency of the posterior distribution are established. Additionally, we exploit the use of spike-and-slab priors for choosing the version of the model that best adapts to the complexity of the underlying true data-generating distribution. The performance of the proposed model is illustrated in a simulation study and in an application to solid waste data from Colombia.
【2】 A practical guide to causal discovery with cohort data 标题:队列数据因果发现实用指南 链接:https://arxiv.org/abs/2108.13395
作者:Ryan M. Andrews,Ronja Foraita,Vanessa Didelez,Janine Witte 机构:Boston University School of Public Health, Leibniz Institute for Prevention Research and Epidemiology - BIPS, Bremen, University of Bremen 摘要:在本指南中,我们介绍了如何使用三个流行的软件包执行基于约束的因果发现:pcalg(带有附加组件tpc和micd)、bnlearn和TETRAD。我们关注这些数据包如何与观测数据一起使用,以及在存在混合数据(即,一些变量是连续的,而其他变量是分类的数据)、变量之间的已知时间顺序和缺失数据的情况下如何使用。自始至终,我们指出了每种方案的相对优势和局限性,并给出了切实可行的建议。我们希望本指南能够帮助那些对数据执行基于约束的因果发现感兴趣的人。 摘要:In this guide, we present how to perform constraint-based causal discovery using three popular software packages: pcalg (with add-ons tpc and micd), bnlearn, and TETRAD. We focus on how these packages can be used with observational data and in the presence of mixed data (i.e., data where some variables are continuous, while others are categorical), a known time ordering between variables, and missing data. Throughout, we point out the relative strengths and limitations of each package, as well as give practical recommendations. We hope this guide helps anyone who is interested in performing constraint-based causal discovery on their data.
【3】 Statistical Challenges in Tracking the Evolution of SARS-CoV-2 标题:追踪SARS-CoV-2演变的统计学挑战 链接:https://arxiv.org/abs/2108.13362
作者:Lorenzo Cappello,Jaehee Kim,Sifan Liu,Julia A. Palacios 机构: Jaehee Kim is a PostdoctoralResearch Fellow in the Department of Biology, Stanford University 摘要:对SARS-CoV-2的基因组监测有助于跟踪该病毒在大流行期间的传播和演变。从受感染个体中分离出的SARS-CoV-2分子序列的可用性,加上系统动力学方法,为深入了解该病毒的起源、进化速度、引入时间、传播模式以及在人群中传播的新变种的兴起提供了信息。尽管各国政府、实验室和研究人员在收集和排序分子数据方面做出了巨大的全球努力,但在分析和解释所收集的数据方面仍然存在许多挑战。在这里,我们描述了目前用于监测SARS-CoV-2传播的模型和方法,讨论了长期存在的和新的统计挑战,并提出了一种在流行期间跟踪新变异增加的方法。 摘要:Genomic surveillance of SARS-CoV-2 has been instrumental in tracking the spread and evolution of the virus during the pandemic. The availability of SARS-CoV-2 molecular sequences isolated from infected individuals, coupled with phylodynamic methods, have provided insights into the origin of the virus, its evolutionary rate, the timing of introductions, the patterns of transmission, and the rise of novel variants that have spread through populations. Despite enormous global efforts of governments, laboratories, and researchers to collect and sequence molecular data, many challenges remain in analyzing and interpreting the data collected. Here, we describe the models and methods currently used to monitor the spread of SARS-CoV-2, discuss long-standing and new statistical challenges, and propose a method for tracking the rise of novel variants during the epidemic.
【4】 Multiple imputation and test-wise deletion for causal discovery with incomplete cohort data 标题:不完全队列数据因果发现的多重归因和检验删除 链接:https://arxiv.org/abs/2108.13331
作者:Janine Witte,Ronja Foraita,Vanessa Didelez 机构: Leibniz Institute for Prevention Research and Epidemiology—BIPS, University of Bremen 备注:38 pages, 11 figures 摘要:因果发现算法根据观测数据估计因果图。这可以为侧重于个体治疗结果对之间因果关系的分析提供有价值的补充。基于约束的因果发现算法在构建图时依赖于条件独立性测试。直到最近,这些算法还无法处理丢失的值。在本文中,我们研究了两种替代方案:测试删除和多重插补。我们建立了因果结构在测试删除下可恢复的充分必要条件,并认为多重插补在因果发现中比在估计中更具挑战性。我们通过模拟基准因果图进行了广泛的比较:正如人们所期望的那样,我们发现测试删除和多重插补都明显优于列表删除和单一插补。至关重要的是,我们的结果进一步表明,多重插补在具有少量高斯变量或离散变量的情况下特别有用,但当数据集包含这两种变量的混合时,两种方法都不是最好的。我们比较的方法包括随机森林插补和结合测试删除和多重插补的混合程序。欧洲儿童饮食和生活方式相关疾病IDEFICS队列研究数据的应用就是一个例证。 摘要:Causal discovery algorithms estimate causal graphs from observational data. This can provide a valuable complement to analyses focussing on the causal relation between individual treatment-outcome pairs. Constraint-based causal discovery algorithms rely on conditional independence testing when building the graph. Until recently, these algorithms have been unable to handle missing values. In this paper, we investigate two alternative solutions: Test-wise deletion and multiple imputation. We establish necessary and sufficient conditions for the recoverability of causal structures under test-wise deletion, and argue that multiple imputation is more challenging in the context of causal discovery than for estimation. We conduct an extensive comparison by simulating from benchmark causal graphs: As one might expect, we find that test-wise deletion and multiple imputation both clearly outperform list-wise deletion and single imputation. Crucially, our results further suggest that multiple imputation is especially useful in settings with a small number of either Gaussian or discrete variables, but when the dataset contains a mix of both neither method is uniformly best. The methods we compare include random forest imputation and a hybrid procedure combining test-wise deletion and multiple imputation. An application to data from the IDEFICS cohort study on diet- and lifestyle-related diseases in European children serves as an illustrating example.
【5】 Lagged couplings diagnose Markov chain Monte Carlo phylogenetic inference 标题:滞后耦合诊断马尔可夫链蒙特卡罗系统发育推论 链接:https://arxiv.org/abs/2108.13328
作者:Luke J. Kelly,Robin J. Ryder,Grégoire Clarté 机构: Universit´e Paris-Dauphine, PSL University 摘要:在复杂的样本空间中,系统发育推断是一个棘手的统计问题。马尔可夫链蒙特卡罗方法是贝叶斯系统发育推断的主要工具,但构建有效的方案来探索相关的后验分布并评估其收敛性是一个挑战。在最近发展蒙特卡罗算法耦合的工作的基础上,我们描述了一个将马尔可夫链耦合到具有年龄、标量参数和潜在变量的系统发育树空间上的后验分布的过程。我们演示了如何使用这些耦合来检查链的收敛性和混合时间。 摘要:Phylogenetic inference is an intractable statistical problem on a complex sample space. Markov chain Monte Carlo methods are the primary tool for Bayesian phylogenetic inference, but it is challenging to construct efficient schemes to explore the associated posterior distribution and to then assess their convergence. Building on recent work developing couplings of Monte Carlo algorithms, we describe a procedure to couple Markov Chains targeting a posterior distribution over a space of phylogenetic trees with ages, scalar parameters and latent variables. We demonstrate how to use these couplings to check convergence and mixing time of the chains.
【6】 Eliminating Systematic Bias from Difference-in-Differences Design: A Permutational Detrending Strategy 标题:消除差异性设计中的系统性偏差:一种置换去趋势策略 链接:https://arxiv.org/abs/2108.13311
作者:Xiaoming Wang,Sukun Wang 机构:Health Services Statistical and Analytic Methods, Alberta Health Services, Edmonton, AB T,G ,J, Financial Crime Unit, Bank of Montreal, Toronto, ON, M,W ,A, Canada 备注:15 pages, 2 figures 摘要:自1985年Ashenfelter和Card的初步研究以来,差异中的差异(DID)研究设计的使用已变得广泛。然而,正如文献中指出的,这种流行的准实验设计也存在估计偏差和推理偏差,在某些情况下可能非常严重。在这项研究中,我们从调查DID设计中系统性偏差的潜在来源开始。通过分析它们对统计估计和推断的影响,我们提出了一种补救方法——置换去趋势(PD)策略,以克服估计偏差和推断偏差方面的挑战。我们证明了建议的PD DID方法提供了无偏点估计、置信区间估计和显著性检验。我们用模拟实验说明了它的统计特性。我们通过将其应用于临床数据EASE(手术环境的老年友好方法)和社会经济数据CPS(当前人口调查)来证明其实用性。我们将讨论所提出方法的优点和局限性。 摘要:Since the initial work by Ashenfelter and Card in 1985, the use of difference-in-differences (DID) study design has become widespread. However, as pointed out in the literature, this popular quasi-experimental design also suffers estimation bias and inference bias, which could be very serious in some circumstances. In this study, we start by investigating potential sources of systemic bias from the DID design. Via analyzing their impact on statistical estimation and inference, we propose a remedy -- a permutational detrending (PD) strategy -- to overcome the challenges in both the estimation bias and the inference bias. We prove that the proposed PD DID method provides unbiased point estimates, confidence interval estimates, and significance tests. We illustrate its statistical proprieties using simulation experiments. We demonstrate its practical utility by applying it to the clinical data EASE (Elder-Friendly Approaches to the Surgical Environment) and the social-economical data CPS (Current Population Survey). We discuss the strengths and limitations of the proposed approach.
【7】 A principled stopping rule for importance sampling 标题:一种重要抽样的原则性停止规则 链接:https://arxiv.org/abs/2108.13289
作者:Medha Agarwal,Dootika Vats,Víctor Elvira 机构:Department of Statistics, University of Washington, Deptartment of Mathematics and Statistics, IIT Kanpur, V´ıctor Elvira, School of Mathematics, University of Edinburgh 摘要:重要性抽样(IS)是一种蒙特卡罗技术,它依赖于加权样本(根据建议分布模拟)来估计难以处理的积分。估计量的质量随着样本数的增加而提高。然而,为了达到预期的估计质量,所需的样本数量是未知的,并且取决于感兴趣的数量、估计量和所选择的方案。我们提出了一个顺序停止规则,当估计的总体可变性相对较小时终止模拟。所提出的方法与IS中有效样本量的概念密切相关,并克服了现有指标的关键缺点,例如,它承认多变量估计问题。我们的停止规则保留了渐近保证,并为用户提供了在IS中何时停止模拟的明确指南。 摘要:Importance sampling (IS) is a Monte Carlo technique that relies on weighted samples, simulated from a proposal distribution, to estimate intractable integrals. The quality of the estimators improves with the number of samples. However, for achieving a desired quality of estimation, the required number of samples is unknown, and depends on the quantity of interest, the estimator, and the chosen proposal. We present a sequential stopping rule that terminates simulation when the overall variability in estimation is relatively small. The proposed methodology closely connects to the idea of an effective sample size in IS and overcomes crucial shortcomings of existing metrics, e.g., it acknowledges multivariate estimation problems. Our stopping rule retains asymptotic guarantees and provides users a clear guideline on when to stop the simulation in IS.
【8】 Bayesian Sensitivity Analysis for Missing Data Using the E-value 标题:基于E值的缺失数据贝叶斯灵敏度分析 链接:https://arxiv.org/abs/2108.13286
作者:Wu Xue,Abbas Zaidi 机构:Department of Statistics, The George Washington University, Statistics & Privacy, Facebook 摘要:敏感性分析是一个评估从缺失的结果数据中得出的结论如何容易偏离不稳定的基本假设的框架。我们将E值(量化因果结论稳健性的常用指标)扩展到缺失结果的设置。通过部分观察到的Facebook转换事件的激励示例,我们提出了在三种贡献下进行敏感性分析的方法。首先,我们开发了一种利用噪声基准(例如,用于保护单元级隐私的聚合报告)对敏感性参数进行贝叶斯估计的方法;这两个经验得出的主观和客观先验进行了探讨。其次,利用敏感性参数的贝叶斯估计,我们通过仿真提出了一种E值后验推断机制。最后,构造了E值的封闭形式分布,以便在由于计算限制而无法进行后验模拟时,可以进行直接推断。我们使用基于数据的模拟,并辅以Facebook转换事件的案例研究,证明了相对于E值的渐近推断,性能的提高。 摘要:Sensitivity Analysis is a framework to assess how conclusions drawn from missing outcome data may be vulnerable to departures from untestable underlying assumptions. We extend the E-value, a popular metric for quantifying robustness of causal conclusions, to the setting of missing outcomes. With motivating examples from partially-observed Facebook conversion events, we present methodology for conducting Sensitivity Analysis at scale with three contributions. First, we develop a method for the Bayesian estimation of sensitivity parameters leveraging noisy benchmarks(e.g., aggregated reports for protecting unit-level privacy); both empirically derived subjective and objective priors are explored. Second, utilizing the Bayesian estimation of the sensitivity parameters we propose a mechanism for posterior inference of the E-value via simulation. Finally, closed form distributions of the E-value are constructed to make direct inference possible when posterior simulation is infeasible due to computational constraints. We demonstrate gains in performance over asymptotic inference of the E-value using data-based simulations, supplemented by a case-study of Facebook conversion events.
【9】 Multi-Resolution Spatio-Temporal Prediction with Application to Wind Power Generation 标题:多分辨率时空预测及其在风力发电中的应用 链接:https://arxiv.org/abs/2108.13285
作者:Shixiang Zhu,Hanyu Zhang,Yao Xie,Pascal Van Hentenryck 摘要:本文提出了一种可在不同分辨率下运行的时空风速预测模型。该模型假设一个集群的风预测与其最近历史上的上游影响相关,集群之间的相关性由有向动态图表示。还描述了一种贝叶斯方法,其中关于不同数据分辨率下预测误差的先验信念以高斯过程的形式表示。联合框架通过组合不同数据分辨率下的预测结果来提高预测性能,并提供合理的不确定性量化。该模型根据美国中西部的实际风数据进行评估,与传统基线相比,表现出优越的性能。 摘要:This paper proposes a spatio-temporal model for wind speed prediction which can be run at different resolutions. The model assumes that the wind prediction of a cluster is correlated to its upstream influences in recent history, and the correlation between clusters is represented by a directed dynamic graph. A Bayesian approach is also described in which prior beliefs about the predictive errors at different data resolutions are represented in a form of Gaussian processes. The joint framework enhances the predictive performance by combining results from predictions at different data resolution and provides reasonable uncertainty quantification. The model is evaluated on actual wind data from the Midwest U.S. and shows a superior performance compared to traditional baselines.
【10】 Algorithm for the product of Jack polynomials and its application to the sphericity test 标题:杰克多项式乘积的算法及其在球度检验中的应用 链接:https://arxiv.org/abs/2108.13283
作者:Koki Shimizu,Hiroki Hashiguchi 机构:Tokyo University of Science,-, Kagurazaka, Shinjuku-ku,-, Tokyo, Japan 摘要:在这项研究中,我们推导了用于球度检验的奇异beta-Wishart矩阵的最大和最小特征值之比的密度和分布函数。这些函数可以用杰克多项式的乘积表示。我们提出了一种算法,通过杰克多项式的线性组合来扩展杰克多项式的乘积。利用该算法对导出的分布进行了数值计算。 摘要:In this study, we derive the density and distribution function of a ratio of the largest and smallest eigenvalues of a singular beta-Wishart matrix for the sphericity test. These functions can be expressed in terms of the product of Jack polynomials. We propose an algorithm that expands the product of Jack polynomials by a linear combination of Jack polynomials. Numerical computation for the derived distributions is performed using the algorithm.
【11】 Optimal Multi-Wave Validation of Secondary Use Data with Outcome and Exposure Misclassification 标题:具有结果和曝光错误分类的二次利用数据的最优多波验证 链接:https://arxiv.org/abs/2108.13263
作者:Sarah C. Lotspeich,Gustavo G. C. Amorim,Pamela A. Shaw,Ran Tao,Bryan E. Shepherd 机构:Vanderbilt University Medical Center, Nashville, TN, U.S.A., Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, U.S.A., Vanderbilt Genetics Institute 备注:15 pages, 3 tables, 3 figures, supplemental materials can be found in ancillary file supplement.pdf 摘要:电子健康记录(EHR)等观测数据库的日益普及为生物医学研究中此类数据的二次使用提供了前所未有的机会。但是,这些数据可能容易出错,需要在使用前进行验证。由于资源限制,验证整个数据库通常是不现实的。一个经济高效的替代方案是实施两阶段设计,验证患者记录的子集,该子集丰富了有关感兴趣的研究问题的信息。在本文中,我们考虑在差分结果和曝光错误分类下的比值比估计。我们提出了最小化最大似然比估计方差的最优设计。我们开发了一种新的自适应网格搜索算法,能够以计算可行且数值精确的方式定位最优设计。由于优化设计需要在一开始就指定未知参数,因此在没有先验信息的情况下无法实现,因此我们在实践中引入了一种多波采样策略来近似它。我们通过广泛的模拟和两个大型观测研究,证明了所提出的设计比现有设计的效率提高。我们提供了一个R软件包和闪亮的应用程序,以便于使用最佳设计。 摘要:The growing availability of observational databases like electronic health records (EHR) provides unprecedented opportunities for secondary use of such data in biomedical research. However, these data can be error-prone and need to be validated before use. It is usually unrealistic to validate the whole database due to resource constraints. A cost-effective alternative is to implement a two-phase design that validates a subset of patient records that are enriched for information about the research question of interest. Herein, we consider odds ratio estimation under differential outcome and exposure misclassification. We propose optimal designs that minimize the variance of the maximum likelihood odds ratio estimator. We develop a novel adaptive grid search algorithm that can locate the optimal design in a computationally feasible and numerically accurate manner. Because the optimal design requires specification of unknown parameters at the outset and thus is unattainable without prior information, we introduce a multi-wave sampling strategy to approximate it in practice. We demonstrate the efficiency gains of the proposed designs over existing ones through extensive simulations and two large observational studies. We provide an R package and Shiny app to facilitate the use of the optimal designs.
【12】 Functional Data Representation with Merge Trees 标题:使用合并树的函数式数据表示 链接:https://arxiv.org/abs/2108.13147
作者:Matteo Pegoraro,Piercesare Secchi 摘要:在本文中,我们面对的问题,表示功能数据的工具代数拓扑。我们通过合并树来表示函数,并将此表示与持久性图提供的表示进行了比较。我们证明了这两个树结构虽然不是等价的,但在它们所表示的函数的同胚重参数化下都是不变的,因此允许进行统计分析,这与函数失调无关。我们对合并树采用了一种新的度量,并证明了当合并树表示函数时与其具体实现相关的一些理论结果。为了展示我们对功能数据分析的拓扑方法的良好特性,我们首先通过几个示例,使用{em in silico}生成的数据来说明和比较合并树和持久性图提供的不同表示,然后在Aneurisk65数据集复制上对其进行测试,从我们不同的角度来看,监督分类分析使该数据集成为处理失调功能数据方法的基准。 摘要:In this paper we face the problem of representation of functional data with the tools of algebraic topology. We represent functions by means of merge trees and this representation is compared with that offered by persistence diagrams. We show that these two tree structures, although not equivalent, are both invariant under homeomorphic re-parametrizations of the functions they represent, thus allowing for a statistical analysis which is indifferent to functional misalignment. We employ a novel metric for merge trees and we prove a few theoretical results related to its specific implementation when merge trees represent functions. To showcase the good properties of our topological approach to functional data analysis, we first go through a few examples using data generated {em in silico} and employed to illustrate and compare the different representations provided by merge trees and persistence diagrams, and then we test it on the Aneurisk65 dataset replicating, from our different perspective, the supervised classification analysis which contributed to make this dataset a benchmark for methods dealing with misaligned functional data.
【13】 A fast point solver for deep nonlinear function approximators 标题:一种用于深度非线性函数逼近器的快速点求解器 链接:https://arxiv.org/abs/2108.13097
作者:Laurence Aitchison 机构:Department of Computer Science, University of Bristol, Bristol, UK 摘要:深核过程(DKP)推广了贝叶斯神经网络,但不要求我们表示特征或权重。相反,在每个隐藏层,它们表示并优化一个灵活的内核。在这里,我们利用最初在控制理论文献中开发的矩阵解算器,为DKP开发了一种类似牛顿的方法,该方法在大约10步内收敛。这比通常的梯度下降法快很多倍。通过开发“内核backprop”和“内核autodiff”算法,我们将其推广到任意DKP体系结构。虽然这些方法目前不是贝叶斯方法,因为它们给出了点估计,并且由于数据点的数量是立方的,所以伸缩性很差,但我们希望它们将形成一类新的更有效的方法来优化深度非线性函数逼近器的基础。 摘要:Deep kernel processes (DKPs) generalise Bayesian neural networks, but do not require us to represent either features or weights. Instead, at each hidden layer they represent and optimize a flexible kernel. Here, we develop a Newton-like method for DKPs that converges in around 10 steps, exploiting matrix solvers initially developed in the control theory literature. These are many times faster the usual gradient descent approach. We generalise to arbitrary DKP architectures, by developing "kernel backprop", and algorithms for "kernel autodiff". While these methods currently are not Bayesian as they give point estimates and scale poorly as they are cubic in the number of datapoints, we hope they will form the basis of a new class of much more efficient approaches to optimizing deep nonlinear function approximators.
【14】 Generalized nearly isotonic regression 标题:广义近等调回归 链接:https://arxiv.org/abs/2108.13010
作者:Takeru Matsuda,Yuto Miyatake 机构: Osaka University 摘要:估计正态均值的分段单调序列的问题称为近等渗回归。针对这个问题,通过修改池邻接违反者算法(PAVA),设计了一种有效的算法。在本研究中,我们将近等渗回归推广到一般的单参数指数族,如二项式、泊松和卡方回归。我们考虑分段单调参数序列的估计,并基于改进的PAVA,利用自然和期望参数之间的对偶性,开发了一种有效的算法。我们还提供了一种使用信息准则选择正则化参数的方法。仿真结果表明,该方法以数据驱动的方式检测分段单调参数序列中的变化点。还介绍了常微分方程解算器在谱估计、因果推断和离散化误差量化方面的应用。 摘要:The problem of estimating a piecewise monotone sequence of normal means is called the nearly isotonic regression. For this problem, an efficient algorithm has been devised by modifying the pool adjacent violators algorithm (PAVA). In this study, we extend nearly isotonic regression to general one-parameter exponential families such as binomial, Poisson and chi-square. We consider estimation of a piecewise monotone parameter sequence and develop an efficient algorithm based on the modified PAVA, which utilizes the duality between the natural and expectation parameters. We also provide a method for selecting the regularization parameter by using an information criterion. Simulation results demonstrate that the proposed method detects change-points in piecewise monotone parameter sequences in a data-driven manner. Applications to spectrum estimation, causal inference and discretization error quantification of ODE solvers are also presented.
【15】 Accuracy, precision, and agreement statistical tests for Bland-Altman method 标题:Bland-Altman方法的准确度、精密度和一致性统计检验 链接:https://arxiv.org/abs/2108.12937
作者:P. S. P. Silveira,J. E. Vieira,A. A. Ferraro,J. O. Siqueira 机构: Department of Surgery, University of Sao Paulo Medical School, 4 Department of Legal Medicine, University of Sao PauloMedical School 备注:16 pages, 5 figures. Submitted to Clinics (this https URL). Clinics is an electronic journal that publishes peer-reviewed articles in continuous flow, of interest to clinicians and researchers in the medical sciences 摘要:Bland和Altman绘图方法是一种图形绘图方法,用于比较相关数据集,支持最终将测量方法替换为另一种方法。也许是由于其图形化的简单输出,它得到了广泛的应用,但常常被误解。我们提供了三个嵌套测试:准确性、精密度和一致性,作为一种手段,以获得测量等效性的统计支持。这些基于结构回归,添加到方法中,根据推断统计标准将其转换,验证平均值相等(准确度)、均方差(精度)以及与平分线的一致性(一致性)。为了遵循Bland和Altman的原则,添加了一个说明这三个测试的图形输出。应用Bland和Altman原理的先前发表文章中的五对数据集说明了这种统计方法。在一个案例中证明了严格等价,三个案例显示了部分等价,还有一个案例没有等价。在这里,我们展示了一种添加到图形输出中的统计方法,该方法将乏味的Altman图形主观解释转化为清晰客观的结果,并对可靠和更好的沟通决策具有重要价值。 摘要:Bland and Altman plot method is a graphical plot approach that compares related data sets, supporting the eventual replacement of a measurement method for another one. Perhaps due to its graphical easy output it had been widely applied, however often misinterpreted. We provide three nested tests: accuracy, precision and agreement, as a means to reach statistical support for the equivalence of measurements. These were based on structural regressions added to the method converting it on inferential statistical criteria, verifying mean equality (accuracy), homoscedasticity (precision), and concordance with a bisector line (agreement). A graphical output illustrating these three tests were added to follow Bland and Altman's principles. Five pairs of data sets from previously published articles that applied the Bland and Altman's principles illustrate this statistical approach. In one case it was demonstrated strict equivalence, three cases showed partial equivalence, and there was one case without equivalence. Here we show a statistical approach added to the graphical outputs that turns the Bland-Altman otherwise graphical subjective interpretation into a clear and objective result and with significance value for a reliable and better communicable decision.
【16】 Inequality in Education: A Comparison of Australian Indigenous and Nonindigenous Populations 标题:教育不平等:澳大利亚土著和非土著人口的比较 链接:https://arxiv.org/abs/2108.12830
作者:David Gunawan,William Griffiths,Duangkamon Chotikapanich 机构:University of Wollongong, University of Melbourne, Monash University 摘要:考虑2001年、2006年、2014年和2017年澳大利亚土著和非土著居民的教育成就分布。贝叶斯推理用于分析这些有序分类分布随时间的变化,并比较本地和非本地分布。教育成就水平和教育成就不平等都被考虑在内。为了比较水平随时间的变化以及两个群体之间的不平等,使用了一阶随机优势和教育贫困指数。为了检验不等式随时间的变化,考虑了两个不等式指数和广义洛伦兹优势。结果以指数的后验密度和显性比较的后验概率表示。我们发现一些证据表明,随着时间的推移,情况有所改善,特别是在土著分布的较低部分,不平等性从2001年到2017年显著增加。 摘要:Educational achievement distributions for Australian indigenous and nonindigenous populations in the years 2001, 2006, 2014 and 2017 are considered. Bayesian inference is used to analyse how these ordinal categorical distributions have changed over time and to compare indigenous and nonindigenous distributions. Both the level of educational achievement and inequality in educational achievement are considered. To compare changes in levels over time, as well as inequality between the two populations, first order stochastic dominance and an index of educational poverty are used. To examine changes in inequality over time, two inequality indices and generalised Lorenz dominance are considered. Results are presented in terms of posterior densities for the indices and posterior probabilities for dominance for the dominance comparisons. We find some evidence of improvement over time, especially in the lower parts of the indigenous distribution and that inequality has significantly increased from 2001 to 2017.
【17】 Survival Analysis with Graph-Based Regularization for Predictors 标题:基于图的正则化预测因子生存分析 链接:https://arxiv.org/abs/2108.12827
作者:Xi He,Liyan Xie,Yao Xie,Pinar Keskinocak 机构:H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, School of Data Science, The Chinese University of Hong Kong, Shenzhen 摘要:我们研究生存分析中的变量选择问题,以确定当变量通过图结构具有相互关联的先验知识时,影响生存时间的最重要因素。我们考虑Cox比例风险模型与基于图的正则化变量选择。通过连接到群套索,提出了一种求解图正则化极大似然问题的计算高效算法。我们对所提出的估计量的恢复误差和渐近分布提供了理论保证。与现有方法相比,所提出的方法具有良好的性能和优点,在合成和真实数据示例中都得到了验证。 摘要:We study the variable selection problem in survival analysis to identify the most important factors affecting the survival time when the variables have prior knowledge that they have a mutual correlation through a graph structure. We consider the Cox proportional hazard model with a graph-based regularizer for variable selection. A computationally efficient algorithm is developed to solve the graph regularized maximum likelihood problem by connecting to group lasso. We provide theoretical guarantees about the recovery error and asymptotic distribution of the proposed estimators. The good performance and benefit of the proposed approach compared with existing methods are demonstrated in both synthetic and real data examples.
【18】 A scoring framework for tiered warnings and multicategorical forecasts based on fixed risk measures 标题:基于固定风险度量的分级预警和多类别预报评分框架 链接:https://arxiv.org/abs/2108.12814
作者:Robert Taggart,Nicholas Loveday,Deryn Griffiths 机构:Bureau of Meteorology 备注:22 pages, 4 figures, 1 table 摘要:分层警报和多类别预报的使用在气象业务中无处不在。这里,提出了一个灵活的评分函数族,用于评估有序多分类预测的性能。每个分数都有一个针对特定用例选择的风险参数$alpha$,因此它与基于固定阈值概率$1-alpha$(相当于固定$alpha$-分位数映射)的预测指令一致。每个分数还具有特定于用例的权重,因此准确区分分类阈值的预测者将按照该阈值的权重比例获得奖励。在对未遂事件或关闭假警报的惩罚进行折扣的情况下,提出了一种变化,这同样与基于固定风险度量的指令一致。给出的分数为当前使用的许多性能度量提供了一种替代方法,其预测事件的最佳阈值概率通常随每个预测案例而变化,并且在公平分数的情况下,基于样本基准率,而不是适合用户的风险度量。 摘要:The use of tiered warnings and multicategorical forecasts are ubiquitous in meteorological operations. Here, a flexible family of scoring functions is presented for evaluating the performance of ordered multicategorical forecasts. Each score has a risk parameter $alpha$, selected for the specific use case, so that it is consistent with a forecast directive based on the fixed threshold probability $1-alpha$ (equivalently, a fixed $alpha$-quantile mapping). Each score also has use-case specific weights so that forecasters who accurately discriminate between categorical thresholds are rewarded in proportion to the weight for that threshold. A variation is presented where the penalty assigned to near misses or close false alarms is discounted, which again is consistent with directives based on fixed risk measures. The scores presented provide an alternative to many performance measures currently in use, whose optimal threshold probabilities for forecasting an event typically vary with each forecast case, and in the case of equitable scores are based around sample base rates rather than risk measures suitable for users.
【19】 Density estimation in RKHS with application to Korobov spaces in high dimensions 标题:RKHS中的密度估计及其在高维Korobov空间中的应用 链接:https://arxiv.org/abs/2108.12699
作者:Yoshihito Kazashi,Fabio Nobile 摘要:本文提出了一种核方法,用以估计从这种密度中提取的i.i.d.样本的概率密度函数(pdf)。我们的估计是核函数的线性组合,其系数由线性方程确定。在一般的再生核希尔BERT空间中建立了积分均方误差的误差分析。然后将所发展的理论应用于加权Korobov空间的概率密度函数估计,建立了与维数无关的收敛速度。在适当的平滑假设下,我们的方法可以获得任意接近最优速率的速率。数值结果支持我们的理论。 摘要:A kernel method for estimating a probability density function (pdf) from an i.i.d. sample drawn from such density is presented. Our estimator is a linear combination of kernel functions, the coefficients of which are determined by a linear equation. An error analysis for the mean integrated squared error is established in a general reproducing kernel Hilbert space setting. The theory developed is then applied to estimate pdfs belonging to weighted Korobov spaces, for which a dimension independent convergence rate is established. Under a suitable smoothness assumption, our method attains a rate arbitrarily close to the optimal rate. Numerical results support our theory.
【20】 Feature Selection in High-dimensional Space Using Graph-Based Methods 标题:基于图的高维空间特征选择方法 链接:https://arxiv.org/abs/2108.12682
作者:Swarnadip Ghosh,Somabha Mukherjee,Divyansh Agarwal 备注:16 pages, 8 figures, 3 tables 摘要:高维特征选择是机器学习、图像分析和基因组学等应用领域的核心问题。在本文中,我们提出了基于图的测试作为特征选择的有用基础。我们描述了一种在高维数据中选择信息特征的算法,其中每个观测值来自$K$不同分布中的一个。我们的算法可以应用于完全非参数设置中,而不需要对数据进行任何分布假设,并且它旨在输出数据中对总体分布变化贡献最大的特征。我们方法的核心是递归应用基于无分布图的特征集子集测试,这些特征集位于由数据构建的分层聚类树的不同深度。我们的算法以高概率恢复所有真正有贡献的特征,同时确保对错误发现的最佳控制。最后,我们通过合成数据展示了该方法优于其他现有方法的性能,并在气候变化领域的实际数据集上展示了该方法的实用性。 摘要:High-dimensional feature selection is a central problem in a variety of application domains such as machine learning, image analysis, and genomics. In this paper, we propose graph-based tests as a useful basis for feature selection. We describe an algorithm for selecting informative features in high-dimensional data, where each observation comes from one of $K$ different distributions. Our algorithm can be applied in a completely nonparametric setup without any distributional assumptions on the data, and it aims at outputting those features in the data, that contribute the most to the overall distributional variation. At the heart of our method is the recursive application of distribution-free graph-based tests on subsets of the feature set, located at different depths of a hierarchical clustering tree constructed from the data. Our algorithm recovers all truly contributing features with high probability, while ensuring optimal control on false-discovery. Finally, we show the superior performance of our method over other existing ones through synthetic data, and also demonstrate the utility of the method on a real-life dataset from the domain of climate change.
【21】 Convergence of position-dependent MALA with application to conditional simulation in GLMMs 标题:位置相关MALA算法的收敛性及其在GLMMS条件模拟中的应用 链接:https://arxiv.org/abs/2108.12662
作者:Vivekananda Roy,Lijin Zhang 机构:Department of Statistics, Iowa State University, USA 摘要:我们建立了具有位置相关建议协方差矩阵的Metropolis-Hastings(MH)算法具有或不具有几何收敛速度的可验证条件。一些基于扩散的MH算法,如Metropolis-adjusted Langevin算法(MALA)和预条件MALA(PCMALA)具有位置无关的建议方差。然而,对于MALA的其他变体,如流形MALA(MMALA),建议协方差矩阵在每次迭代中都会发生变化。因此,我们为Langevin算法的不同变体的几何遍历性提供了条件。这些条件在两种最流行的广义线性混合模型(GLMMs)的条件模拟环境中得到验证,即具有logit链接的二项式GLMM和具有log链接的泊松GLMM。在一些空间GLMM框架下的经验比较表明,计算成本较低的PCMALA和适当选择的预处理矩阵可能优于MMALA。 摘要:We establish verifiable conditions under which Metropolis Hastings (MH) algorithms with position-dependent proposal covariance matrix will or will not have geometric rate of convergence. Some of the diffusions based MH algorithms like Metropolis adjusted Langevin algorithms (MALA) and Pre-conditioned MALA (PCMALA) have position independent proposal variance. Whereas, for other variants of MALA like manifold MALA (MMALA), the proposal covariance matrix changes in every iteration. Thus, we provide conditions for geometric ergodicity of different variations of Langevin algorithms. These conditions are verified in the context of conditional simulation from the two most popular generalized linear mixed models (GLMMs), namely the binomial GLMM with logit link and the Poisson GLMM with log link. Empirical comparison in the framework of some spatial GLMMs shows that computationally less expensive PCMALA with an appropriately chosen pre-conditioning matrix may outperform MMALA.
【22】 Maximum Likelihood Estimation of Diffusions by Continuous Time Markov Chain 标题:基于连续时间马尔可夫链的扩散极大似然估计 链接:https://arxiv.org/abs/2108.12649
作者:J. L. Kirkby,Dang Nguyen,Duy Nguyen,Nhu Nguyen 摘要:本文提出了一种估计参数扩散过程参数的新方法。我们的方法基于扩散过程的近似连续时间马尔可夫链(CTMC)的闭式最大似然估计。与典型的时间离散化方法不同,如使用Shoji Ozaki或Kessler方法的psuedo似然近似法,CTMC近似法在参数估计过程中不引入时间离散误差,因此非常适合于采样数据不频繁的典型计量经济情况。由于CTMC的结构,我们能够获得适用于一般单变量扩散的样本可能性的闭合形式近似。将状态离散化方法与近似极大似然估计(时间离散化)和精确极大似然估计(如适用)进行比较,结果表明CMTC估计器具有良好的性能。除了外汇利率和固定到期利率的真实数据实验外,还提供了模拟示例。 摘要:In this paper we present a novel method for estimating the parameters of a parametric diffusion processes. Our approach is based on a closed-form Maximum Likelihood estimator for an approximating Continuous Time Markov Chain (CTMC) of the diffusion process. Unlike typical time discretization approaches, such as psuedo-likelihood approximations with Shoji-Ozaki or Kessler's method, the CTMC approximation introduces no time-discretization error during parameter estimation, and is thus well-suited for typical econometric situations with infrequently sampled data. Due to the structure of the CTMC, we are able to obtain closed-form approximations for the sample likelihood which hold for general univariate diffusions. Comparisons of the state-discretization approach with approximate MLE (time-discretization) and Exact MLE (when applicable) demonstrate favorable performance of the CMTC estimator. Simulated examples are provided in addition to real data experiments with FX rates and constant maturity interest rates.
【23】 Generalized Huber Loss for Robust Learning and its Efficient Minimization for a Robust Statistics 标题:稳健学习的广义Huber损失及其稳健统计量的有效最小化 链接:https://arxiv.org/abs/2108.12627
作者:Kaan Gokcesu,Hakan Gokcesu 摘要:我们提出了Huber损失的一个广义公式。我们证明了选择合适的函数,特别是log-exp变换;我们可以得到一个损失函数,它结合了绝对损失和二次损失的理想性质。我们提供了一个算法来寻找这种损失函数的最小值,并表明找到一个集中度量并不比传统的均值和中值困难多少。 摘要:We propose a generalized formulation of the Huber loss. We show that with a suitable function of choice, specifically the log-exp transform; we can achieve a loss function which combines the desirable properties of both the absolute and the quadratic loss. We provide an algorithm to find the minimizer of such loss functions and show that finding a centralizing metric is not that much harder than the traditional mean and median.
【24】 ZAP: Z-value Adaptive Procedures for False Discovery Rate Control with Side Information标题:ZAP:Z-带边信息的错误发现率控制的值自适应过程链接:https://arxiv.org/abs/2108.12623
作者:Dennis Leung,Wenguang Sun 摘要:协变量自适应多重检验是近年来备受关注的一个重要研究方向。人们普遍认为,利用辅助协变量提供的辅助信息可以提高错误发现率(FDR)程序的能力。目前,大多数此类程序都是以$p$值作为主要统计数据设计的。然而,对于双边假设,通常的数据处理步骤将主要统计数据(称为$z$值)转换为$p$值,这不仅会导致主要统计数据携带的信息丢失,还可能破坏协变量协助FDR推断的能力。我们开发了一种基于$z$值的协变量自适应(ZAP)方法,该方法基于由$z$值和协变量共同编码的完整结构信息。它试图通过一个工作模型模拟oracle$z$值过程,其拒绝区域与$p$值自适应测试方法的拒绝区域明显不同。ZAP的关键优势在于,即使在工作模型错误指定的情况下,FDR控制也能以最小的假设得到保证。我们使用模拟和真实数据展示了ZAP的最新性能,这表明与基于$p$值的方法相比,效率增益是可观的。我们的方法是在$texttt{R}$包$texttt{zap}$中实现的。 摘要:Adaptive multiple testing with covariates is an important research direction that has gained major attention in recent years. It has been widely recognized that leveraging side information provided by auxiliary covariates can improve the power of false discovery rate (FDR) procedures. Currently, most such procedures are devised with $p$-values as their main statistics. However, for two-sided hypotheses, the usual data processing step that transforms the primary statistics, known as $z$-values, into $p$-values not only leads to a loss of information carried by the main statistics, but can also undermine the ability of the covariates to assist with the FDR inference. We develop a $z$-value based covariate-adaptive (ZAP) methodology that operates on the intact structural information encoded jointly by the $z$-values and covariates. It seeks to emulate the oracle $z$-value procedure via a working model, and its rejection regions significantly depart from those of the $p$-value adaptive testing approaches. The key strength of ZAP is that the FDR control is guaranteed with minimal assumptions, even when the working model is misspecified. We demonstrate the state-of-the-art performance of ZAP using both simulated and real data, which shows that the efficiency gain can be substantial in comparison with $p$-value based methods. Our methodology is implemented in the $texttt{R}$ package $texttt{zap}$.
【25】 Nonparametric estimation of the incubation time distribution 标题:孵化时间分布的非参数估计 链接:https://arxiv.org/abs/2108.12606
作者:Piet Groeneboom 机构:Delft University of Technology, Mekelweg , CD Delft, The Netherlands. 备注:24 pages, 6 figures 摘要:我们讨论疾病潜伏期分布的非参数估计。这些模型中的经典方法是在估计过程中使用参数族,如Weibull、对数正态分布或gamma。我们分析了非参数极大似然估计(MLE),并证明了在某些条件下,它的收敛速度是立方根$n$,其极限行为由Chernoff分布给出。我们还研究了基于最大似然估计的光滑估计。与经典参数方法相比,基于最大似然估计的密度估计能够捕捉到密度的更精细或意外方面{为非参数方法提供了tt R}脚本。 摘要:We discuss nonparametric estimators of the distribution of the incubation time of a disease. The classical approach in these models is to use parametric families like Weibull, log-normal or gamma in the estimation procedure. We analyze instead the nonparametric maximum likelihood estimator (MLE) and show that, under some conditions, its rate of convergence is cube root $n$ and that its limit behavior is given by Chernoff's distribution. We also study smooth estimates, based on the MLE. The density estimates, based on the MLE, are capable of catching finer or unexpected aspects of the density, in contrast with the classical parametric methods. {tt R} scripts are provided for the nonparametric methods.
【26】 A robust fusion-extraction procedure with summary statistics in the presence of biased sources 标题:有偏源情况下具有汇总统计量的鲁棒融合提取过程 链接:https://arxiv.org/abs/2108.12600
作者:Ruoyu Wang,Qihua Wang,Wang Miao 机构:Academy of Mathematics and Systems Science, Chinese Academy of, Sciences, Beijing , China, and University of Chinese Academy of, Sciences, Beijing , China., School of Mathematical Sciences, Peking University, Beijing 摘要:如今,来自各种数据源的信息越来越多。然而,由于经常遇到的偏差采样、群体异质性或模型错误,一些数据源可能会产生偏差估计。这就需要统计方法在有偏见的情况下结合信息。本文提出了一种鲁棒的数据融合提取方法。即使许多数据源是有偏差的,该方法也可以产生相关参数的一致估计。所提出的估计量易于计算,且仅采用汇总统计,因此可应用于许多不同的领域,例如元分析、孟德尔随机和分布式系统。此外,在一些温和的条件下,所提出的估计量与仅使用来自无偏源的数据的oracle估计量是渐近等价的。还建立了该估计量的渐近正态性。与现有的元分析方法相比,即使数据源的数量和参数的维数随着样本量的增加而发散,理论性质也得到了保证,这确保了所提出的方法在很大范围内的性能。通过仿真研究,评估了系统的鲁棒性和oracle性能。该方法应用于荟萃分析数据集,以评估中度牙周病的外科治疗,以及孟德尔随机数据集,以研究头颈癌的危险因素。 摘要:Information from various data sources is increasingly available nowadays. However, some of the data sources may produce biased estimation due to commonly encountered biased sampling, population heterogeneity, or model misspecification. This calls for statistical methods to combine information in the presence of biased sources. In this paper, a robust data fusion-extraction method is proposed. The method can produce a consistent estimator of the parameter of interest even if many of the data sources are biased. The proposed estimator is easy to compute and only employs summary statistics, and hence can be applied to many different fields, e.g. meta-analysis, Mendelian randomisation and distributed system. Moreover, the proposed estimator is asymptotically equivalent to the oracle estimator that only uses data from unbiased sources under some mild conditions. Asymptotic normality of the proposed estimator is also established. In contrast to the existing meta-analysis methods, the theoretical properties are guaranteed even if both the number of data sources and the dimension of the parameter diverge as the sample size increases, which ensures the performance of the proposed method over a wide range. The robustness and oracle property is also evaluated via simulation studies. The proposed method is applied to a meta-analysis data set to evaluate the surgical treatment for the moderate periodontal disease, and a Mendelian randomization data set to study the risk factors of head and neck cancer.
【27】 The promise and perils of point process models of political events 标题:政治事件点过程模型的前景与风险 链接:https://arxiv.org/abs/2108.12566
作者:Lin Zhu,Scott J. Cook,Mikyoung Jun 摘要:事件数据在应用政治学研究中越来越普遍。虽然这些数据本质上是位置性的,但政治学家主要将这些事件的汇总摘要作为区域数据进行分析。在这样做的过程中,它们丢失了这些点模式数据固有的许多信息,以及使用点流程模型分析事件所带来的许多灵活性。认识到这些优势,应用统计学家在分析政治事件时越来越多地使用点过程模型。然而,这项工作往往忽略了政治事件数据的固有局限性(例如,地理定位精度),这会使点过程模型的直接应用复杂化。在本文中,我们试图弥合这一分歧:介绍政治学研究中点过程建模的好处,并强调政治学数据为这些方法带来的独特挑战。为了让我们的讨论更深入,我们聚焦于全球恐怖主义数据库,使用单变量和双变量对数高斯-考克斯过程模型(LGCP)分析2014年尼日利亚的恐怖袭击。 摘要:Event data are increasingly common in applied political science research. While these data are inherently locational, political scientists predominately analyze aggregate summaries of these events as areal data. In so doing, they lose much of the information inherent to these point pattern data, and much of the flexibility that comes analyzing events using point process models. Recognizing these advantages, applied statisticians have increasingly utilized point process models in the analysis of political events. Yet, this work often neglects inherent limitations of political event data (e.g, geolocation accuracy), which can complicate the direct application of point process models. In this paper, we attempt to bridge this divide: introducing the benefits of point process modeling for political science research, and highlighting the unique challenges political science data pose for these approaches. To ground our discussion, we focus the Global Terrorism Database, using a univariate and bivariate log-Gaussian Cox process model (LGCP) to analyze terror attacks in Nigeria during 2014.
【28】 Convergence Rates for Learning Linear Operators from Noisy Data 标题:从噪声数据中学习线性算子的收敛速度 链接:https://arxiv.org/abs/2108.12515
作者:Maarten V. de Hoop,Nikola B. Kovachki,Nicholas H. Nelsen,Andrew M. Stuart 机构: Rice University 备注:33 pages, 4 figures, 3 tables 摘要:我们研究了从随机输入数据上的噪声逐点估计学习Hilbert空间上线性算子的贝叶斯逆问题。我们的框架假设该目标算子是自伴和对角的,在一个与高斯先验和噪声协方差算子共享的基础上,由强加的统计模型产生,并且能够处理紧凑、有界甚至无界的目标算子。当数据数量趋于无穷大时,我们建立了关于Bochner范数族的后验收缩率,并推导了估计误差的相关下界。在大数据限制下,我们还提供了与后验平均点估计相关的适当定义的超额风险和广义间隙泛函的渐近收敛速度。在此过程中,我们将后验一致性结果与非参数学习理论联系起来。此外,与学习有界线性算子或紧致线性算子相比,这些收敛速度突出并量化了学习无界线性算子的难度。数值实验证实了这一理论,并证明在更一般的问题环境中也可以预期类似的结论。 摘要:We study the Bayesian inverse problem of learning a linear operator on a Hilbert space from its noisy pointwise evaluations on random input data. Our framework assumes that this target operator is self-adjoint and diagonal in a basis shared with the Gaussian prior and noise covariance operators arising from the imposed statistical model and is able to handle target operators that are compact, bounded, or even unbounded. We establish posterior contraction rates with respect to a family of Bochner norms as the number of data tend to infinity and derive related lower bounds on the estimation error. In the large data limit, we also provide asymptotic convergence rates of suitably defined excess risk and generalization gap functionals associated with the posterior mean point estimator. In doing so, we connect the posterior consistency results to nonparametric learning theory. Furthermore, these convergence rates highlight and quantify the difficulty of learning unbounded linear operators in comparison with the learning of bounded or compact ones. Numerical experiments confirm the theory and demonstrate that similar conclusions may be expected in more general problem settings.
【29】 PanelPRO: a general framework for multi-gene, multi-cancer Mendelian risk prediction models 标题:PanelPRO:多基因、多癌症孟德尔风险预测模型的通用框架 链接:https://arxiv.org/abs/2108.12504
作者:Jane W. Liang,Gregory E. Idos,Christine Hong,Stephen B. Gruber,Giovanni Parmigiani,Danielle Braun 机构:Department of Biostatistics, Harvard T.H. Chan School of Public Health, Department of Data Science, Dana-Farber Cancer Institute 备注:23 pages, 7 figures 摘要:风险评估是个体化临床管理的一个有价值的组成部分,用于识别因遗传致病性变体而具有更高癌症风险的个体。利用孟德尔遗传学原理、贝叶斯概率理论和变异特异性知识,孟德尔模型根据家族史推导出携带致病性变异并在未来发展为癌症的概率。现有的孟德尔模型被广泛应用,但通常局限于特定的基因和综合征。然而,多基因小组种系测试的兴起刺激了许多新的基因癌症关联的发现,这些关联目前还没有在这些模型中得到解释。我们开发了PanelPRO,这是一个灵活、高效的孟德尔风险预测框架,可以包含任意数量的基因和癌症,克服了由于模型复杂性增加而产生的计算挑战。我们基于这个框架实现了一个11基因,11癌症模型,这是迄今为止创建的最大的孟德尔模型。使用模拟和带有种系面板测试数据的临床队列,我们评估模型性能,验证我们的方法与现有孟德尔模型的反向兼容性,并说明其用法。我们的实现可以在PanelPRO R软件包中免费供研究使用。 摘要:Risk evaluation to identify individuals who are at greater risk of cancer as a result of heritable pathogenic variants is a valuable component of individualized clinical management. Using principles of Mendelian genetics, Bayesian probability theory, and variant-specific knowledge, Mendelian models derive the probability of carrying a pathogenic variant and developing cancer in the future, based on family history. Existing Mendelian models are widely employed, but are generally limited to specific genes and syndromes. However, the upsurge of multi-gene panel germline testing has spurred the discovery of many new gene-cancer associations that are not presently accounted for in these models. We have developed PanelPRO, a flexible, efficient Mendelian risk prediction framework that can incorporate an arbitrary number of genes and cancers, overcoming the computational challenges that arise because of the increased model complexity. We implement an eleven-gene, eleven-cancer model, the largest Mendelian model created thus far, based on this framework. Using simulations and a clinical cohort with germline panel testing data, we evaluate model performance, validate the reverse-compatibility of our approach with existing Mendelian models, and illustrate its usage. Our implementation is freely available for research use in the PanelPRO R package.
【30】 Joint modelling of longitudinal measurements and survival times via a copula approach 标题:通过Copula方法对纵向测量和生存时间的联合建模 链接:https://arxiv.org/abs/2108.12478
作者:Zili Zhang,Christiana Charalambous,Peter Foster 机构:Department of Mathematics, University of Manchester, Manchester M,PL, UK, ARTICLE HISTORY 摘要:纵向和事件时间数据的联合建模通常由随机效应联合模型描述,该模型使用共享或相关的潜在效应来捕获两个过程之间的关联。在这个框架下,考虑潜在效应,通过假设条件独立,可以直接导出两个过程的联合分布。文献中也考虑了将相互依赖引入子模型的替代方法,其中一种方法是使用连接函数,在纵向过程和事件时间过程的边缘分布之间引入非线性相关性。文献中提出了一种高斯copula联合模型,通过应用蒙特卡罗期望最大化算法来拟合联合数据。虽然它很有启发性,但它最初的估算程序也有一些局限性。在最初的方法中,对数似然函数无法解析推导,因此需要蒙特卡罗积分,这不仅需要密集的计算,而且会在估计中引入额外的变化/噪声。与EM算法的结合进一步降低了计算速度,并且不能始终保证收敛到最大似然估计。此外,当受试者有不同数量的观察时,假设所有受试者的计划测量长度是一致和平衡的并不合适。在本文中,我们放宽了这一限制,并提出了一种精确的似然估计方法来取代计算成本更高的蒙特卡罗期望最大化算法。我们还提供了一种计算生存概率动态预测的简单方法,表明我们提出的模型在预测性能上与共享随机效应联合模型相当。 摘要:Joint modelling of longitudinal and time-to-event data is usually described by a random effect joint model which uses shared or correlated latent effects to capture associations between the two processes. Under this framework, the joint distribution of the two processes can be derived straightforwardly by assuming conditional independence given the latent effects. Alternative approaches to induce interdependency into sub-models have also been considered in the literature and one such approach is using copulas, to introduce non-linear correlation between the marginal distributions of the longitudinal and time-to-event processes. A Gaussian copula joint model has been proposed in the literature to fit joint data by applying a Monte Carlo expectation-maximisation algorithm. Enlightening as it is, its original estimation procedure comes with some limitations. In the original approach, the log-likelihood function can not be derived analytically thus requires a Monte Carlo integration, which not only comes with intensive computation but also introduces extra variation/noise into the estimation. The combination with the EM algorithm slows down the computation further and convergence to the maximum likelihood estimators can not be always guaranteed. In addition, the assumption that the length of planned measurements is uniform and balanced across all subjects is not suitable when subjects have varying number of observations. In this paper, we relax this restriction and propose an exact likelihood estimation approach to replace the more computationally expensive Monte Carlo expectation-maximisation algorithm. We also provide a straightforward way to compute dynamic predictions of survival probabilities, showing that our proposed model is comparable in prediction performance to the shared random effects joint model.
【31】 Point forecasting and forecast evaluation with generalized Huber loss 标题:具有广义Huber损失的点预测与预测评价 链接:https://arxiv.org/abs/2108.12426
作者:Robert J. Taggart 机构:Bureau of Meteorology 备注:26 pages, 4 figures. An earlier version of this manuscript was published in December 2020 as a Bureau Research Report at this http URL 摘要:Huber损失、其非对称变量及其相关泛函(此处称为“Huber泛函”)是在点预测和预测评估的背景下研究的。分布的Huber泛函是期望(非对称)Huber损失的极小值集,是分位数和相应期望值之间的中介,也出现在M估计中。每个Huber泛函都是可导出的,生成期望分数的精确极小值集,服从概率分布类上的弱正则性条件,并且具有其一致分数函数的完整特征。这种评分函数允许混合表示为基本评分函数的加权平均值。每一个基本分数都可以解释为对一类投资决策使用特定预测的相对经济损失,该类投资决策的利润和损失是有上限的。最后,合成实例表明,在预测评估中,当某些实现度量出现错误时,Huber损失可作为期望点预测平方误差的稳健评分替代方案。 摘要:Huber loss, its asymmetric variants and their associated functionals (here named "Huber functionals") are studied in the context of point forecasting and forecast evaluation. The Huber functional of a distribution is the set of minimizers of the expected (asymmetric) Huber loss, is an intermediary between a quantile and corresponding expectile, and also arises in M-estimation. Each Huber functional is elicitable, generating the precise set of minimizers of an expected score, subject to weak regularity conditions on the class of probability distributions, and has a complete characterization of its consistent scoring functions. Such scoring functions admit a mixture representation as a weighted average of elementary scoring functions. Each elementary score can be interpreted as the relative economic loss of using a particular forecast for a class of investment decisions where profits and losses are capped. Finally, synthetic examples illustrate that in forecast evaluation Huber loss serves as robust scoring alternative to squared error for expectation point forecasts when some measurements of the realization are faulty.
【32】 Ovarian Cancer Prediction from Ovarian Cysts Based on TVUS Using Machine Learning Algorithms 标题:基于TVUS的机器学习算法在卵巢囊肿卵巢癌预测中的应用 链接:https://arxiv.org/abs/2108.13387
作者:Laboni Akter,Nasrin Akhter 机构:Department of Biomedical Engineering, Khulna University of Engineering, & Technology, Khulna, Bangladesh 备注:This paper has been published in International Conference on Big Data, IoT and Machine Learning 2021 (BIM 2021) 摘要:卵巢癌(OC)是一种女性生殖系统恶性肿瘤,可在年轻女孩中发现,且大多数为生育期或生殖期妇女。很少有囊肿是危险的,可能会导致癌症。因此,通过阴道超声(TVUS)筛查进行预测是非常重要的,可以从不同类型的筛查中进行检测。在这项研究中,我们使用了一个名为PLCO的实际数据集,在三个目标变量内使用了TVUS筛选和三种机器学习(ML)技术,分别是随机森林KNN和XGBoost。在准确率、召回率、f1分数和准确率方面,我们分别获得了99.50%、99.50%、99.49%和99.50%的近似值。在这些随机森林算法、KNN算法和XGB算法中观察到的AUC得分分别为99.87%、98.97%和99.88%。这种方法有助于医生和嫌疑人及早识别卵巢风险,减少卵巢恶性肿瘤相关并发症和死亡。 摘要:Ovarian Cancer (OC) is type of female reproductive malignancy which can be found among young girls and mostly the women in their fertile or reproductive. There are few number of cysts are dangerous and may it cause cancer. So, it is very important to predict and it can be from different types of screening are used for this detection using Transvaginal Ultrasonography (TVUS) screening. In this research, we employed an actual datasets called PLCO with TVUS screening and three machine learning (ML) techniques, respectively Random Forest KNN, and XGBoost within three target variables. We obtained a best performance from this algorithms as far as accuracy, recall, f1 score and precision with the approximations of 99.50%, 99.50%, 99.49% and 99.50% individually. The AUC score of 99.87%, 98.97% and 99.88% are observed in these Random Forest, KNN and XGB algorithms .This approach helps assist physicians and suspects in identifying ovarian risks early on, reducing ovarian malignancy-related complications and deaths.
【33】 Survival Prediction of Heart Failure Patients using Stacked Ensemble Machine Learning Algorithm 标题:基于堆叠集成机器学习算法的心力衰竭患者生存预测 链接:https://arxiv.org/abs/2108.13367
作者:S. M Mehedi Zaman,Wasay Mahmood Qureshi,Md. Mohsin Sarker Raihan,Ocean Monjur,Abdullah Bin Shams 机构:Department of Electrical and Electronic Engineering, Islamic University of Technology, Gazipur , Bangladesh, Department of Biomedical Engineering Khulna University of Engineering & Technology, Khulna , Bangladesh 备注:This article has been submitted for publication in Biomedical Physics & Engineering Express 摘要:心血管疾病,尤其是心力衰竭,是我们这个时代的主要健康危害问题之一,也是全世界死亡的主要原因。使用机器学习(ML)模型的数据挖掘技术的进步为有希望的预测方法铺平了道路。数据挖掘是将医疗机构创建的大量原始数据转换为有意义的信息的过程,这些信息有助于做出预测和关键决策。本研究的主要目的是收集心力衰竭患者的各种随访数据,分析这些数据,并利用几种ML模型预测心血管病患者的生存可能性。由于数据集中类的不平衡性,实现了合成少数过采样技术(SMOTE)。我们的研究中使用了两个无监督模型(K-Means和模糊C-Means聚类)和三个有监督分类器(随机森林、XGBoost和决策树)。经过深入的研究,我们的结果证明了监督ML算法优于无监督模型。此外,我们还设计并提出了一个有监督的堆叠集成学习模型,该模型可以实现99.98%的准确率、准确率、召回率和F1分数。我们的研究表明,使用有监督的ML算法,只有从患者身上收集的某些属性才能成功预测心力衰竭后存活的可能性。 摘要:Cardiovascular disease, especially heart failure is one of the major health hazard issues of our time and is a leading cause of death worldwide. Advancement in data mining techniques using machine learning (ML) models is paving promising prediction approaches. Data mining is the process of converting massive volumes of raw data created by the healthcare institutions into meaningful information that can aid in making predictions and crucial decisions. Collecting various follow-up data from patients who have had heart failures, analyzing those data, and utilizing several ML models to predict the survival possibility of cardiovascular patients is the key aim of this study. Due to the imbalance of the classes in the dataset, Synthetic Minority Oversampling Technique (SMOTE) has been implemented. Two unsupervised models (K-Means and Fuzzy C-Means clustering) and three supervised classifiers (Random Forest, XGBoost and Decision Tree) have been used in our study. After thorough investigation, our results demonstrate a superior performance of the supervised ML algorithms over unsupervised models. Moreover, we designed and propose a supervised stacked ensemble learning model that can achieve an accuracy, precision, recall and F1 score of 99.98%. Our study shows that only certain attributes collected from the patients are imperative to successfully predict the surviving possibility post heart failure, using supervised ML algorithms.
【34】 On the effects of biased quantum random numbers on the initialization of artificial neural networks 标题:有偏量子随机数对人工神经网络初始化的影响 链接:https://arxiv.org/abs/2108.13329
作者:Raoul Heese,Moritz Wolter,Sascha Mücke,Lukas Franken,Nico Piatkowski 机构:Fraunhofer Institute for Industrial Mathematics ITWM, Fraunhofer Institute for Algorithms and Scientific Computing SCAI, Artificial Intelligence Group, TU Dortmund University, Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS 备注:20 pages, 10 figures, 3 tables 摘要:实用量子计算的最新进展导致了各种基于云的量子计算平台,使研究人员能够在有噪声的中尺度量子(NISQ)设备上评估他们的算法。量子计算机的一个共同特性是,它们表现出真实的随机性,而不是从经典系统中获得的伪随机性。在机器学习的背景下研究这种真正的量子随机性的影响很有吸引力,最近的结果模糊地表明量子随机数的使用确实可以带来好处。为了进一步阐明这一主题,我们在数值实验中实证研究了硬件偏置量子随机数对人工神经网络权值初始化的影响。与无偏量子随机数以及经典伪随机数发生器中的有偏和无偏随机数相比,我们没有发现统计上的显著差异。我们实验的量子随机数是从真实的量子硬件中获得的。 摘要:Recent advances in practical quantum computing have led to a variety of cloud-based quantum computing platforms that allow researchers to evaluate their algorithms on noisy intermediate-scale quantum (NISQ) devices. A common property of quantum computers is that they exhibit instances of true randomness as opposed to pseudo-randomness obtained from classical systems. Investigating the effects of such true quantum randomness in the context of machine learning is appealing, and recent results vaguely suggest that benefits can indeed be achieved from the use of quantum random numbers. To shed some more light on this topic, we empirically study the effects of hardware-biased quantum random numbers on the initialization of artificial neural network weights in numerical experiments. We find no statistically significant difference in comparison with unbiased quantum random numbers as well as biased and unbiased random numbers from a classical pseudo-random number generator. The quantum random numbers for our experiments are obtained from real quantum hardware.
【35】 Deep Reinforcement Learning at the Edge of the Statistical Precipice 标题:统计悬崖边缘的深度强化学习 链接:https://arxiv.org/abs/2108.13264
作者:Rishabh Agarwal,Max Schwarzer,Pablo Samuel Castro,Aaron Courville,Marc G. Bellemare 机构:Google Research, Brain Team, MILA, Université de Montréal, CIFAR Fellow, MILA, McGill University 摘要:深度强化学习(RL)算法主要是通过比较它们在大量任务中的相对性能来评估的。在deep RL基准上公布的大多数结果都比较了总体绩效的点估计值,如任务的平均分和中位数,忽略了使用有限次数的训练所隐含的统计不确定性。从Arcade Learning Environment(ALE)开始,向计算要求高的基准的转变导致了每个任务只评估少量运行的实践,加剧了点估计的统计不确定性。在本文中,我们认为,在少数运行深度RL制度下的可靠评估不能忽视结果的不确定性,而不会降低该领域进展的风险。我们使用Atari 100k基准的案例研究来说明这一点,我们发现单独从点估计得出的结论与更彻底的统计分析得出的结论之间存在重大差异。为了通过少量的运行提高现场对报告结果的信心,我们提倡报告总体绩效的区间估计,并提出绩效概况,以说明结果的可变性,同时提出更稳健和有效的总体指标,如四分位平均分,实现结果的小不确定性。使用这些统计工具,我们仔细检查了其他广泛使用的RL基准(包括ALE、Procgen和DeepMind Control Suite)上现有算法的性能评估,再次揭示了先前比较中的差异。我们的发现要求改变我们评估deep RL性能的方式,为此,我们提出了更严格的评估方法,并提供了一个开源的rliable库,以防止不可靠的结果使该领域停滞不前。 摘要:Deep reinforcement learning (RL) algorithms are predominantly evaluated by comparing their relative performance on a large suite of tasks. Most published results on deep RL benchmarks compare point estimates of aggregate performance such as mean and median scores across tasks, ignoring the statistical uncertainty implied by the use of a finite number of training runs. Beginning with the Arcade Learning Environment (ALE), the shift towards computationally-demanding benchmarks has led to the practice of evaluating only a small number of runs per task, exacerbating the statistical uncertainty in point estimates. In this paper, we argue that reliable evaluation in the few run deep RL regime cannot ignore the uncertainty in results without running the risk of slowing down progress in the field. We illustrate this point using a case study on the Atari 100k benchmark, where we find substantial discrepancies between conclusions drawn from point estimates alone versus a more thorough statistical analysis. With the aim of increasing the field's confidence in reported results with a handful of runs, we advocate for reporting interval estimates of aggregate performance and propose performance profiles to account for the variability in results, as well as present more robust and efficient aggregate metrics, such as interquartile mean scores, to achieve small uncertainty in results. Using such statistical tools, we scrutinize performance evaluations of existing algorithms on other widely used RL benchmarks including the ALE, Procgen, and the DeepMind Control Suite, again revealing discrepancies in prior comparisons. Our findings call for a change in how we evaluate performance in deep RL, for which we present a more rigorous evaluation methodology, accompanied with an open-source library rliable, to prevent unreliable results from stagnating the field.
【36】 Representation of binary classification trees with binary features by quantum circuits 标题:具有二进制特征的二叉树的量子电路表示 链接:https://arxiv.org/abs/2108.13207
作者:Raoul Heese,Patricia Bickert,Astrid Elisa Niederle 机构:Fraunhofer ITWM, Kaiserslautern, Germany, BASF SE, Ludwigshafen, Germany 备注:41 pages, 20 figures, 3 tables 摘要:我们提出了一种基于概率方法的具有二元特征的二元分类树的量子表示。通过使用量子计算机作为概率分布处理器,可以通过测量量子电路来实现决策树的概率遍历。我们描述了如何将树归纳和查询数据的类标签预测集成到这个框架中。一种按需采样方法可以使用恒定数量的经典内存插槽进行预测,与树深度无关。我们使用量子计算模拟器和实际的IBM量子硬件对我们的方法进行了实验研究。据我们所知,这是第一次在量子设备上实现决策树分类器。 摘要:We propose a quantum representation of binary classification trees with binary features based on a probabilistic approach. By using the quantum computer as a processor for probability distributions, a probabilistic traversal of the decision tree can be realized via measurements of a quantum circuit. We describe how tree inductions and the prediction of class labels of query data can be integrated into this framework. An on-demand sampling method enables predictions with a constant number of classical memory slots, independent of the tree depth. We experimentally study our approach using both a quantum computing simulator and actual IBM quantum hardware. To our knowledge, this is the first realization of a decision tree classifier on a quantum device.
【37】 On the approximation of a matrix 标题:关于矩阵的逼近 链接:https://arxiv.org/abs/2108.13195
作者:Samriddha Sanyal 机构:Indian Statistical Institute, B.T. road, Kolkata , India 摘要:设$F^{*}$是由非随机方法导出的给定$(a乘以b)$矩阵$F$的近似值。我们证明了对于给定的$F$和$F^{*}$,$H$和$T$可以通过随机算法计算,使得$(HT)$比$F^{*}$更接近$F$。 摘要:Let $F^{*}$ be an approximation of a given $(a times b)$ matrix $F$ derived by methods that are not randomized. We prove that for a given $F$ and $F^{*}$, $H$ and $T$ can be computed by randomized algorithm such that $(HT)$ is an approximation of $F$ better than $F^{*}$.
【38】 MonTrees: Automated Detection and Classification of Networking Anomalies in Cellular Networks 标题:MonTrees:蜂窝网络中网络异常的自动检测和分类 链接:https://arxiv.org/abs/2108.13156
作者:Mohamed Moulay,Rafael Garcia Leiva,Pablo J. Rojo Maroni,Vincenzo Mancuso,Antonio Fernandez Anta,Ali Safari Khatouni 机构:Safari Khatouni 备注:15 pages, 15 figures 摘要:蜂窝网络的活跃增长和动态特性使得网络故障排除具有挑战性。在过去的几年中,利用机器学习识别网络问题已经获得了很大的知名度,从而大大改善了蜂窝网络服务。在本文中,我们提出了一种将有监督和无监督机器学习算法相结合的新方法来自动化蜂窝网络中的故障识别过程并对网络异常进行分类。我们使用通过路试测量以及通过MONROE平台获得的运营商业移动网络的真实数据进行的实验表明,我们的方法能够自动识别和分类网络异常,从而能够及时准确地进行故障排除。 摘要:The active growth and dynamic nature of cellular networks makes network troubleshooting challenging. Identification of network problems leveraging on machine learning has gained a lot of visibility in the past few years, resulting in dramatically improved cellular network services. In this paper, we present a novel methodology to automate the fault identification process in a cellular network and to classify network anomalies, which combines supervised and unsupervised machine learning algorithms. Our experiments using real data from operational commercial mobile networks obtained through drive-test measurements as well as via the MONROE platform show that our method can automatically identify and classify networking anomalies, thus enabling timely and precise troubleshooting actions.
【39】 Individual and population approaches for calibrating division rates in population dynamics: Application to the bacterial cell cycle 标题:在种群动力学中校准分裂速率的个体和种群方法:在细菌细胞周期中的应用 链接:https://arxiv.org/abs/2108.13155
作者:Marie Doumic,Marc Hoffmann 机构:Sorbonne Universit´es, Inria, UPMC Univ Paris , Lab. J.L. Lions UMR CNRS , Paris, France, University Paris-Dauphine, CEREMADE, Place du Mar´echal De Lattre de Tassigny, Paris, France 摘要:建模、分析和推断种群繁殖中的触发机制是许多生物学应用的基础。它也是数学生物学中一个活跃且不断发展的研究领域。在这一章中,我们回顾了过去十年来在稳定环境中估计人口增长和分化中的分化率的主要结果。这些方法结合了从偏微分方程和随机过程中借用的工具,以及从数理统计中产生的某种观点。关注细菌细胞分裂周期的应用提供了一个具体的介绍,并可能帮助读者确定该领域的主要新挑战。 摘要:Modelling, analysing and inferring triggering mechanisms in population reproduction is fundamental in many biological applications. It is also an active and growing research domain in mathematical biology. In this chapter, we review the main results developed over the last decade for the estimation of the division rate in growing and dividing populations in a steady environment. These methods combine tools borrowed from PDE's and stochastic processes, with a certain view that emerges from mathematical statistics. A focus on the application to the bacterial cell division cycle provides a concrete presentation, and may help the reader to identify major new challenges in the field.
【40】 Investigating Vulnerabilities of Deep Neural Policies 标题:调查深层神经策略的脆弱性 链接:https://arxiv.org/abs/2108.13093
作者:Ezgi Korkmaz 机构:KTH Royal Institute of Technology , Stockholm, Sweden. 备注:Presented at the Conference on Uncertainty in Artificial Intelligence (UAI) 2021 摘要:基于深度神经网络的强化学习策略很容易受到其输入的不可察觉的对抗性扰动的影响,这与神经网络图像分类器非常相似。最近的工作提出了几种方法,以提高深度强化学习代理在存在这些不可察觉干扰(即对抗性训练)的情况下基于训练的对抗性干扰的鲁棒性。在本文中,我们研究了对抗性训练对agent学习的神经策略的影响。特别是,我们采用两种不同的平行方法,研究基于最坏情况分布转移和特征敏感性的深层神经策略对抗性训练的结果。对于第一种方法,我们比较了针对对抗训练和普通训练神经策略计算的最小扰动的傅里叶谱。通过在OpenAI Atari环境中的实验,我们表明,针对敌对训练策略计算的最小扰动更集中于傅立叶域中的低频,这表明这些策略对低频扰动的敏感性更高。对于第二种方法,我们提出了一种新的方法来测量深度神经策略的特征敏感性,并比较了最先进的对抗训练深度神经策略和普通训练深度神经策略的这些特征敏感性差异。我们相信,我们的研究结果可以作为了解对抗性训练与神经策略鲁棒性不同概念之间关系的第一步。 摘要:Reinforcement learning policies based on deep neural networks are vulnerable to imperceptible adversarial perturbations to their inputs, in much the same way as neural network image classifiers. Recent work has proposed several methods to improve the robustness of deep reinforcement learning agents to adversarial perturbations based on training in the presence of these imperceptible perturbations (i.e. adversarial training). In this paper, we study the effects of adversarial training on the neural policy learned by the agent. In particular, we follow two distinct parallel approaches to investigate the outcomes of adversarial training on deep neural policies based on worst-case distributional shift and feature sensitivity. For the first approach, we compare the Fourier spectrum of minimal perturbations computed for both adversarially trained and vanilla trained neural policies. Via experiments in the OpenAI Atari environments we show that minimal perturbations computed for adversarially trained policies are more focused on lower frequencies in the Fourier domain, indicating a higher sensitivity of these policies to low frequency perturbations. For the second approach, we propose a novel method to measure the feature sensitivities of deep neural policies and we compare these feature sensitivity differences in state-of-the-art adversarially trained deep neural policies and vanilla trained deep neural policies. We believe our results can be an initial step towards understanding the relationship between adversarial training and different notions of robustness for neural policies.
【41】 An Introduction to Variational Inference 标题:变分推理导论 链接:https://arxiv.org/abs/2108.13083
作者:Ankush Ganguly,Samuel W. F. Earp 机构:Sertis Vision Lab† 备注:13 pages, 9 figures 摘要:近似复概率密度是现代统计学中的一个核心问题。在本文中,我们引入了变分推理(VI)的概念,这是机器学习中一种流行的方法,它使用优化技术来估计复杂的概率密度。这种特性使得VI比经典方法(如马尔可夫链蒙特卡罗抽样)收敛更快。从概念上讲,VI的工作原理是选择一系列概率密度函数,然后找到最接近实际概率密度的函数——通常使用Kullback-Leibler(KL)散度作为优化指标。我们引入证据下界来方便地计算近似概率密度,并回顾了平均场变分推理背后的思想。最后,我们讨论了虚拟仪器在变分自动编码器(VAE)和VAE生成对抗网络(VAE-GAN)中的应用。通过本文,我们旨在解释虚拟仪器的概念,并用这种方法帮助未来的研究。 摘要:Approximating complex probability densities is a core problem in modern statistics. In this paper, we introduce the concept of Variational Inference (VI), a popular method in machine learning that uses optimization techniques to estimate complex probability densities. This property allows VI to converge faster than classical methods, such as, Markov Chain Monte Carlo sampling. Conceptually, VI works by choosing a family of probability density functions and then finding the one closest to the actual probability density -- often using the Kullback-Leibler (KL) divergence as the optimization metric. We introduce the Evidence Lower Bound to tractably compute the approximated probability density and we review the ideas behind mean-field variational inference. Finally, we discuss the applications of VI to variational auto-encoders (VAE) and VAE-Generative Adversarial Network (VAE-GAN). With this paper, we aim to explain the concept of VI and assist in future research with this approach.
【42】 Communication-Computation Efficient Device-Edge Co-Inference via AutoML 标题:基于AutoML的通信计算高效设备边缘协同推理 链接:https://arxiv.org/abs/2108.13009
作者:Xinjie Zhang,Jiawei Shao,Yuyi Mao,Jun Zhang 机构:∗Dept. of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong, †Dept. of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong 摘要:设备边缘协同推理(Device-edge-co-inference)是一种在资源受限的移动设备和边缘服务器之间划分深度神经网络的方法,最近成为支持智能移动应用的一种很有前途的范例。为了加快推理过程,设备模型稀疏化和中间特征压缩被认为是两种重要的技术。然而,由于设备模型稀疏度水平和中间特征压缩比分别直接影响计算量和通信开销,并且两者都影响推理精度,因此,由于搜索空间大,寻找这些超参数的最优值带来了重大挑战。在本文中,我们致力于开发一种有效的算法来确定这些超参数。通过选择合适的模型分割点和一对编码器/解码器作为中间特征向量,将该问题转化为序列决策问题,提出了一种基于深度强化学习(DRL)的自动机器学习(AutoML)框架。在一个图像分类任务上的实验结果证明了该框架的有效性,在不同的基线方案下,该框架实现了更好的通信计算权衡和显著的推理加速。 摘要:Device-edge co-inference, which partitions a deep neural network between a resource-constrained mobile device and an edge server, recently emerges as a promising paradigm to support intelligent mobile applications. To accelerate the inference process, on-device model sparsification and intermediate feature compression are regarded as two prominent techniques. However, as the on-device model sparsity level and intermediate feature compression ratio have direct impacts on computation workload and communication overhead respectively, and both of them affect the inference accuracy, finding the optimal values of these hyper-parameters brings a major challenge due to the large search space. In this paper, we endeavor to develop an efficient algorithm to determine these hyper-parameters. By selecting a suitable model split point and a pair of encoder/decoder for the intermediate feature vector, this problem is casted as a sequential decision problem, for which, a novel automated machine learning (AutoML) framework is proposed based on deep reinforcement learning (DRL). Experiment results on an image classification task demonstrate the effectiveness of the proposed framework in achieving a better communication-computation trade-off and significant inference speedup against various baseline schemes.
【43】 Optimal testing strategies to monitor COVID-19 traced contacts 标题:监测冠状病毒追踪接触者的最佳检测策略 链接:https://arxiv.org/abs/2108.12938
作者:Patricio Foncea,Susana Mondschein,Marcelo Olivares 机构: Operations Research Center, MIT., Department of Industrial Engineering, University of Chile., Instituto Sistemas Complejos de Ingenier´ıa. 备注:16 pages, 6 pages of appendix, 6 figures, 3 tables 摘要:隔离已确定的密切接触者对于降低传播率和避免症状出现前和无症状病例的二次感染风险至关重要。这种减少传播的接触追踪策略的有效性对检疫的遵守情况很敏感,检疫时间越长,或在接种疫苗的人群中(风险感知降低),检疫的遵守情况可能越低。本研究开发了一个模拟模型,以评估接触追踪策略,该模型基于暴露后已识别接触者的顺序测试,作为隔离的替代方案,其中接触者只有在通过阳性测试确认后才被隔离。该分析考虑了不同数量和类型的检测(PCR和侧向流抗原检测(LFA)),以确定成本效益高的检测政策,从而最大限度地减少暴露后的预期感染天数,同时考虑到不同水平的检测能力。该分析表明,即使是有限数量的检测也能有效降低二次感染风险:两次LFA检测(具有最佳时间)可避免感染,其水平相当于90%依从性的14天隔离;添加第三次检测(PCR或LFA)可达到95%的检疫符合率。这些结果对接触者的暴露日期具有鲁棒性,这表明,在难以实施严格隔离的环境中,简单的测试规则可以有效地改进接触者追踪。 摘要:The quarantine of identified close contacts has been vital to reducing transmission rates and averting secondary infection risk before symptom onset and by asymptomatic cases. The effectiveness of this contact tracing strategy to mitigate transmission is sensitive to the adherence to quarantines, which may be lower for longer quarantine periods or in vaccinated populations (where perceptions of risk are reduced). This study develops a simulation model to evaluate contact tracing strategies based on the sequential testing of identified contacts after exposure as an alternative to quarantines, in which contacts are isolated only after confirmation by a positive test. The analysis considers different number and types of tests (PCR and lateral flow antigen tests (LFA)) to identify the cost-effective testing policies that minimize the expected infecting days post-exposure considering different levels of testing capacity. This analysis suggests that even a limited number of tests can be effective at reducing secondary infection risk: two LFA tests (with optimal timing) avert infectiousness at a level that is comparable to 14-day quarantine with 90% adherence; adding a third test (PCR or LFA) reaches the efficiency of a 95% quarantine adherence. These results are robust to the exposure dates of the contact, which suggests that simple testing rules can be effective for improving contact tracing in settings where strict quarantine adherence is difficult to implement.
【44】 Photonic Quantum Policy Learning in OpenAI Gym 标题:OpenAI健身房中的光子量子策略学习 链接:https://arxiv.org/abs/2108.12926
作者:Dániel Nagy,Zsolt Tabi,Péter Hága,Zsófia Kallus,Zoltán Zimborás 机构:Ericsson Hungary and, E¨otv¨os Lor´and University, Budapest, Hungary, P´eter H´aga, Ericsson Research, Zs´ofia Kallus, Zolt´an Zimbor´as, Wigner Research Centre for Physics and, MTA-BME Lend¨ulet QIT Research Group 备注:7 pages 摘要:近几年来,短期噪声中尺度量子(NISQ)计算设备已成为可能。利用NISQ量子计算机原型最有前途的应用领域之一是量子机器学习。虽然量子神经网络被广泛用于监督学习,但量子强化学习仍然是这一领域的一个新兴领域。为了解决一个经典的连续控制问题,我们使用了连续变量量子机器学习方法。我们介绍了光子变分量子代理的近端策略优化,并研究了数据重新上传的效果。我们通过使用草莓场、光子模拟器Fock后端和连接到OpenAI健身房环境和TensorFlow的混合训练框架进行实证研究,提出性能评估。对于受限CartPole问题,光子策略学习的两种变体实现了类似的性能水平,并且比具有相同可训练参数数的基线经典神经网络具有更快的收敛速度。 摘要:In recent years, near-term noisy intermediate scale quantum (NISQ) computing devices have become available. One of the most promising application areas to leverage such NISQ quantum computer prototypes is quantum machine learning. While quantum neural networks are widely studied for supervised learning, quantum reinforcement learning is still just an emerging field of this area. To solve a classical continuous control problem, we use a continuous-variable quantum machine learning approach. We introduce proximal policy optimization for photonic variational quantum agents and also study the effect of the data re-uploading. We present performance assessment via empirical study using Strawberry Fields, a photonic simulator Fock backend and a hybrid training framework connected to an OpenAI Gym environment and TensorFlow. For the restricted CartPole problem, the two variations of the photonic policy learning achieve comparable performance levels and a faster convergence than the baseline classical neural network of same number of trainable parameters.
【45】 Avoiding unwanted results in locally linear embedding: A new understanding of regularization 标题:避免局部线性嵌入的无用结果:对正则化的新理解 链接:https://arxiv.org/abs/2108.12680
作者:Liren Lin 机构:Department of Applied Mathematics, National Sun Yat-sen University, Taiwan 备注:11 pages 摘要:我们证明了当不使用正则化时,局部线性嵌入(LLE)固有地允许一些不需要的结果,即使在原始算法中不需要正则化的情况下也是如此。在数据的每个邻域都实现精确的局部线性关系的情况下,从数学上证明了一种特殊类型的结果的存在性,我们称之为“投影模式”。这些特殊模式以及在更一般的情况下可能出现的一些其他奇异结果,通过在高维空间中嵌入孔的瑞士辊上的数值例子显示出来。观察到,使用正则化可以有效地防止所有这些不良结果。 摘要:We demonstrate that locally linear embedding (LLE) inherently admits some unwanted results when no regularization is used, even for cases in which regularization is not supposed to be needed in the original algorithm. The existence of one special type of result, which we call ``projection pattern'', is mathematically proved in the situation that an exact local linear relation is achieved in each neighborhood of the data. These special patterns as well as some other bizarre results that may occur in more general situations are shown by numerical examples on the Swiss roll with a hole embedded in a high dimensional space. It is observed that all these bad results can be effectively prevented by using regularization.
【46】 Variational Inference with NoFAS: Normalizing Flow with Adaptive Surrogate for Computationally Expensive Models 标题:使用NOFAS的变分推理:使用自适应代理对计算昂贵的模型进行流归一化 链接:https://arxiv.org/abs/2108.12657
作者:Yu Wang,Fang Liu,Daniele E. Schiavazzi 机构:Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, Notre Dame, IN, USA 摘要:从数据中快速推断数值模型参数是生成广泛应用的预测模型的重要前提。当每个似然评估的计算成本很高时,使用基于抽样的方法(如马尔可夫链蒙特卡罗)可能会变得很困难。将变分推理与归一化流相结合的新方法的特点是计算量仅随潜在变量空间的维数线性增长,并且依赖基于梯度的优化而不是采样,从而为模型参数的贝叶斯推理提供了一种更有效的方法。此外,通过使用离线训练的替代模型(如神经网络)替换真实模型,可以降低频繁评估昂贵可能性的成本。然而,当替代物在后验模态周围不够精确时,这种方法可能会产生显著的偏差。为了在不牺牲推理精度的情况下降低计算成本,我们提出了一种使用自适应代理(NoFAS)的归一化流优化策略,该策略交替更新归一化流参数和神经网络代理模型的权重。我们还提出了一种用于代理模型训练的有效样本加权方案,该方案在捕获产生观测数据的参数的可能区域的同时,确保了代理模型的某些全局准确性。我们展示了NoFAS相对于各种基准的推理和计算优势,包括底层模型缺乏可识别性的情况。本研究所用的源代码和数值实验可在https://github.com/cedricwangyu/NoFAS. 摘要:Fast inference of numerical model parameters from data is an important prerequisite to generate predictive models for a wide range of applications. Use of sampling-based approaches such as Markov chain Monte Carlo may become intractable when each likelihood evaluation is computationally expensive. New approaches combining variational inference with normalizing flow are characterized by a computational cost that grows only linearly with the dimensionality of the latent variable space, and rely on gradient-based optimization instead of sampling, providing a more efficient approach for Bayesian inference about the model parameters. Moreover, the cost of frequently evaluating an expensive likelihood can be mitigated by replacing the true model with an offline trained surrogate model, such as neural networks. However, this approach might generate significant bias when the surrogate is insufficiently accurate around the posterior modes. To reduce the computational cost without sacrificing inferential accuracy, we propose Normalizing Flow with Adaptive Surrogate (NoFAS), an optimization strategy that alternatively updates the normalizing flow parameters and the weights of a neural network surrogate model. We also propose an efficient sample weighting scheme for surrogate model training that ensures some global accuracy of the surrogate while capturing the likely regions of the parameters that yield the observed data. We demonstrate the inferential and computational superiority of NoFAS against various benchmarks, including cases where the underlying model lacks identifiability. The source code and numerical experiments used for this study are available at https://github.com/cedricwangyu/NoFAS.
【47】 Stochastic Approximation with Discontinuous Dynamics, Differential Inclusions, and Applications 标题:间断动力学、微分包含的随机逼近及其应用 链接:https://arxiv.org/abs/2108.12652
作者:Nhu Nguyen,George Yin 摘要:这项工作发展了随机逼近算法的新结果。重点是处理算法和具有不连续性的极限。主要成分包括微分包含、集值分析、非光滑分析和随机微分包含的使用。在广泛的条件下,证明了适当缩放的迭代序列具有微分包含极限。此外,还首次证明了一个中心的、有尺度的迭代序列弱收敛于一个随机微分包含极限。然后将结果用于处理几个应用示例,包括马尔可夫决策过程、Lasso算法、Pegasos算法、支持向量机分类和学习。还提供了一些数值演示。 摘要:This work develops new results for stochastic approximation algorithms. The emphases are on treating algorithms and limits with discontinuities. The main ingredients include the use of differential inclusions, set-valued analysis, and non-smooth analysis, and stochastic differential inclusions. Under broad conditions, it is shown that a suitably scaled sequence of the iterates has a differential inclusion limit. In addition, it is shown for the first time that a centered and scaled sequence of the iterates converges weakly to a stochastic differential inclusion limit. The results are then used to treat several application examples including Markov decision process, Lasso algorithms, Pegasos algorithms, support vector machine classification, and learning. Some numerical demonstrations are also provided.
【48】 Extracting Stochastic Governing Laws by Nonlocal Kramers-Moyal Formulas 标题:利用非局部Kramers-MoYAL公式提取随机控制律 链接:https://arxiv.org/abs/2108.12570
作者:Yubin Lu,Yang Li,Jinqiao Duan 机构:School of Mathematics and Statistics & Center for Mathematical Sciences, Huazhong University of, Science and Technology, Wuhan , China, School of Automation, Nanjing University of Science and Technology, Nanjing , China 摘要:随着计算技术和科学工具的迅速发展,数据驱动分析在从数据中提取动力系统控制规律方面取得了很大进展。尽管非高斯涨落广泛存在,但到目前为止,识别具有非高斯L掼evy噪声的随机微分方程的有效数据驱动方法相对较少。在这项工作中,我们提出了一种数据驱动的方法,从短时间的模拟数据中提取(高斯)布朗运动和(非高斯)列维运动的随机控制律。具体地说,我们使用归一化流技术从数据中估计转移概率密度函数(非局部福克-普朗克方程的解),然后将其替换为最近提出的非局部Kramers-Moyal公式,以近似L逖vy跳跃测度、漂移系数和扩散系数。我们证明了这种方法可以学习具有L掼evy运动的随机微分方程。我们给出了一维和二维、解耦和耦合系统的例子来说明我们的方法。该方法将成为发现随机控制规律和理解复杂动力学行为的有效工具。 摘要:With the rapid development of computational techniques and scientific tools, great progress of data-driven analysis has been made to extract governing laws of dynamical systems from data. Despite the wide occurrences of non-Gaussian fluctuations, the effective data-driven methods to identify stochastic differential equations with non-Gaussian L'evy noise are relatively few so far. In this work, we propose a data-driven approach to extract stochastic governing laws with both (Gaussian) Brownian motion and (non-Gaussian) L'evy motion, from short bursts of simulation data. Specifically, we use the normalizing flows technology to estimate the transition probability density function (solution of nonlocal Fokker-Planck equation) from data, and then substitute it into the recently proposed nonlocal Kramers-Moyal formulas to approximate L'evy jump measure, drift coefficient and diffusion coefficient. We demonstrate that this approach can learn the stochastic differential equation with L'evy motion. We present examples with one- and two-dimensional, decoupled and coupled systems to illustrate our method. This approach will become an effective tool for discovering stochastic governing laws and understanding complex dynamical behaviors.
【49】 Self-fulfilling Bandits: Endogeneity Spillover and Dynamic Selection in Algorithmic Decision-making 标题:自我实现的强盗:算法决策中的内生性溢出与动态选择 链接:https://arxiv.org/abs/2108.12547
作者:Jin Li,Ye Luo,Xiaowei Zhang 机构: The University of Hong Kong 备注:Main Body: 29 pages, 6 figures; Supplemental Material: 25 pages 摘要:在本文中,我们研究了数据和行为相互依赖的算法决策中的内生性问题。当背景多臂bandit模型中存在内生性协变量时,由于协变量的内生性溢出到行为中,会产生一种新的偏差(自我实现偏差)。我们提出了一类算法,通过将工具变量纳入领先的在线学习算法来纠正偏差。这些算法还获得了与无内生性的情况下最著名的下限相匹配的遗憾水平。为了建立理论属性,我们开发了一种通用技术,可以解开数据和动作之间的相互依赖关系。 摘要:In this paper, we study endogeneity problems in algorithmic decision-making where data and actions are interdependent. When there are endogenous covariates in a contextual multi-armed bandit model, a novel bias (self-fulfilling bias) arises because the endogeneity of the covariates spills over to the actions. We propose a class of algorithms to correct for the bias by incorporating instrumental variables into leading online learning algorithms. These algorithms also attain regret levels that match the best known lower bound for the cases without endogeneity. To establish the theoretical properties, we develop a general technique that untangles the interdependence between data and actions.
【50】 Approximate Bayesian Optimisation for Neural Networks 标题:神经网络的近似贝叶斯优化 链接:https://arxiv.org/abs/2108.12461
作者:Nadhir Hassen,Irina Rish 机构:MILA, Quebec, Canada 摘要:已经做了大量工作来自动化机器学习算法,以突出模型选择的重要性。自动选择最佳预测模型及其相应参数的过程可以改善广泛的实际应用。贝叶斯优化(BO)使用黑盒优化方法,通过采集函数根据勘探开发权衡标准提出解决方案。BO框架包含两个关键要素:一个是概率替代模型,该模型由未知目标函数(数据相关)的先验信念组成,另一个是描述模型拟合的最优程度的目标函数。选择最佳模型及其相关超参数可能非常昂贵,并且通常使用高斯过程(GPs)进行拟合,并且在某些情况下由于其难处理性而应用近似推理。然而,由于全球定位系统(GPs)与观测值的数量成立方比例,因此处理优化需要许多评估的目标一直是一个挑战。此外,大多数真实数据集是非平稳的,它们对代理模型做出了理想化的假设。以随机方式解决分析可处理性和计算可行性的必要性能够确保贝叶斯优化的效率和适用性。在本文中,我们探讨了使用神经网络代替GPs来模拟函数上的分布,我们提供了密度比估计和基于近似推理的类概率估计之间的联系,这种重新构造提供了算法效率和可处理性。 摘要:A body of work has been done to automate machine learning algorithm to highlight the importance of model choice. Automating the process of choosing the best forecasting model and its corresponding parameters can result to improve a wide range of real-world applications. Bayesian optimisation (BO) uses a blackbox optimisation methods to propose solutions according to an exploration-exploitation trade-off criterion through acquisition functions. BO framework imposes two key ingredients: a probabilistic surrogate model that consist of prior belief of the unknown objective function(data-dependant) and an objective function that describes how optimal is the model-fit. Choosing the best model and its associated hyperparameters can be very expensive, and is typically fit using Gaussian processes (GPs) and at some extends applying approximate inference due its intractability. However, since GPs scale cubically with the number of observations, it has been challenging to handle objectives whose optimization requires many evaluations. In addition, most real-dataset are non-stationary which make idealistic assumptions on surrogate models. The necessity to solve the analytical tractability and the computational feasibility in a stochastic fashion enables to ensure the efficiency and the applicability of Bayesian optimisation. In this paper we explore the use of neural networks as an alternative to GPs to model distributions over functions, we provide a link between density-ratio estimation and class probability estimation based on approximate inference, this reformulation provides algorithm efficiency and tractability.
【51】 Multimodal Data Fusion in High-Dimensional Heterogeneous Datasets via Generative Models 标题:基于产生式模型的高维异构数据集多模态数据融合 链接:https://arxiv.org/abs/2108.12445
作者:Yasin Yilmaz,Mehmet Aktukmak,Alfred O. Hero 机构:and 摘要:常用的潜在空间嵌入技术,如主成分分析、因子分析和流形学习技术,通常用于学习同质数据的有效表示。然而,它们不容易扩展到由数字和分类变量组合而成的异构数据,例如,由链接GPS和文本数据产生的数据。在本文中,我们感兴趣的是以无监督的方式从高维异构数据中学习概率生成模型。学习的生成模型提供潜在的统一表示,捕捉数据多维度的共同因素,从而能够为各种机器学习任务融合多模态数据。根据贝叶斯方法,我们提出了一个通用框架,通过指数分布族的自然参数化,将不同的数据类型组合在一起。为了将模型推理扩展到具有数千个特征的数百万实例,我们使用拉普拉斯-伯恩斯坦近似进行涉及非线性连接函数的后验计算。针对具有实值(高斯)和分类(多项式)特征的常见异构数据集,详细介绍了该算法。在两个高维异构数据集(NYC-Taxi和MovieLens-10M)上的实验表明,该算法在不同的机器学习任务(如异常检测、数据插补和推荐系统)上具有可扩展性和竞争力。 摘要:The commonly used latent space embedding techniques, such as Principal Component Analysis, Factor Analysis, and manifold learning techniques, are typically used for learning effective representations of homogeneous data. However, they do not readily extend to heterogeneous data that are a combination of numerical and categorical variables, e.g., arising from linked GPS and text data. In this paper, we are interested in learning probabilistic generative models from high-dimensional heterogeneous data in an unsupervised fashion. The learned generative model provides latent unified representations that capture the factors common to the multiple dimensions of the data, and thus enable fusing multimodal data for various machine learning tasks. Following a Bayesian approach, we propose a general framework that combines disparate data types through the natural parameterization of the exponential family of distributions. To scale the model inference to millions of instances with thousands of features, we use the Laplace-Bernstein approximation for posterior computations involving nonlinear link functions. The proposed algorithm is presented in detail for the commonly encountered heterogeneous datasets with real-valued (Gaussian) and categorical (multinomial) features. Experiments on two high-dimensional and heterogeneous datasets (NYC Taxi and MovieLens-10M) demonstrate the scalability and competitive performance of the proposed algorithm on different machine learning tasks such as anomaly detection, data imputation, and recommender systems.
【52】 Identifying rote learning and the supporting effects of hints in drills 标题:识别死记硬背和提示在训练中的支持作用 链接:https://arxiv.org/abs/2108.12288
作者:Gunnar Stefansson,Anna Helga Jonsdottir,Thorarinn Jonmundsson,Gylfi Snaer Sigurdsson,Ingunn Lilja Bergsdottir 机构: Bergsdottir University of Iceland, Science Institute (ICELAND) Abstract Whenever students use any drilling system the question arises how much of their learning is meaningful learning 摘要:每当学生使用任何训练系统时,问题就出现了:他们的学习有多少是有意义的学习,而不是通过重复或死记硬背来记忆。尽管这两种类型的学习在教育系统中都有自己的位置,但重要的是能够区分这两种学习方法,并确定可以使学生摆脱死记硬背学习并激励他们进行有意义的学习的选项。导师网站是一个在线培训系统。系统的设计目标是学习而不是评估。这是通过向学生呈现多项选择题来完成的,这些选择题是随机选择的,但与学生的表现有关。通过从与一般问题陈述或标题相关联的集合中提取正确和不正确的答案,可以为特定主题生成问题本身。在这个生成过程中,学生可能会看到同一个问题标题两次,但会看到所有新答案选项或新旧答案选项的混合。从概率论和统计学课程的数据,在COVID-19教导,分析分离死记硬背的学习有意义的学习。分析表明,学生的学习不是死记硬背的,但即使是在大型问题数据库中,当学生看到之前见过的答案选项时,他们的表现也会更好。因此,我们展示了机械学习的一个要素,但也展示了更深层次的学习。项目数据库中已经植入了提示,使得一些问题包含提示学生正确答案的线索。这与有意义学习与死记硬背学习的问题有关,因为我们希望有一个新的提示可以引导学生更认真地思考问题,而不是继续死记硬背。初步结果表明,提示对于成绩较差的学生特别有用。 摘要:Whenever students use any drilling system the question arises how much of their learning is meaningful learning vs memorisation through repetition or rote learning. Although both types of learning have their place in an educational system it is important to be able to distinguish between these two approaches to learning and identify options which can dislodge students from rote learning and motivate them towards meaningful learning. The tutor-web is an online drilling system. The design aim of the system is learning rather than evaluation. This is done by presenting students with multiple-choice questions which are selected randomly but linked to the students' performance. The questions themselves can be generated for a specific topic by drawing correct and incorrect answers from a collection associated with a general problem statement or heading. With this generating process students may see the same question heading twice but be presented with all new answer options or a mixture of new and old answer options. Data from a course on probability theory and statistics, taught during COVID-19, are analysed to separate rote learning from meaningful learning. The analyses show non-rote learning, but even with large question databases, students' performance is better when they are presented with an answer option they have seen before. An element of rote learning is thus exhibited but a deeper learning is also demonstrated. The item database has been seeded with hints such that some questions contain clues to cue the students towards the correct answer. This ties in with the issue of meaningful learning versus rote learning since the hope is that a new hint will work as a cue to coax the student to think harder about the question rather than continue to employ rote learning. Preliminary results indicate that hints are particularly useful for students with poor performance metrics.
【53】 Provable Tensor-Train Format Tensor Completion by Riemannian Optimization 标题:基于黎曼优化的可证明张量-训练格式张量完备化 链接:https://arxiv.org/abs/2108.12163
作者:Jian-Feng Cai,Jingyang Li,Dong Xia 机构:Hong Kong University of Science and Technology 备注:71 pages, 5 figures 摘要:张量列(TT)格式在处理结构高阶张量方面具有诱人的优势。近十年来,TT格式张量在不同学科中得到了广泛的应用,其中张量补全引起了广泛的关注。许多快速算法,包括黎曼梯度下降(RGrad)算法,已经被提出用于TT格式的张量补全。然而,这些算法的理论保证大多缺失或次优,部分原因是TT格式分解中复杂的递归代数运算。此外,为其他格式的张量(例如Tucker和CP)建立的现有结果不适用,因为处理TT格式张量的算法有很大的不同,并且涉及更多。在本文中,我们提供了,据我们所知,第一个理论保证的收敛RGrad算法的TT格式张量完成,在一个近乎最佳的样本量条件。RGrad算法以无张量条件数的恒定收缩率线性收敛,无需重新调节。我们还提出了一种新的方法,称为序列二阶矩方法,以在相似的样本量要求下实现热初始化。作为一个副产品,我们的结果甚至大大改进了先前关于矩阵补全的RGrad算法的研究。数值实验证实了我们的理论发现,并展示了TT格式分解所获得的计算加速。 摘要:The tensor train (TT) format enjoys appealing advantages in handling structural high-order tensors. The recent decade has witnessed the wide applications of TT-format tensors from diverse disciplines, among which tensor completion has drawn considerable attention. Numerous fast algorithms, including the Riemannian gradient descent (RGrad) algorithm, have been proposed for the TT-format tensor completion. However, the theoretical guarantees of these algorithms are largely missing or sub-optimal, partly due to the complicated and recursive algebraic operations in TT-format decomposition. Moreover, existing results established for the tensors of other formats, for example, Tucker and CP, are inapplicable because the algorithms treating TT-format tensors are substantially different and more involved. In this paper, we provide, to our best knowledge, the first theoretical guarantees of the convergence of RGrad algorithm for TT-format tensor completion, under a nearly optimal sample size condition. The RGrad algorithm converges linearly with a constant contraction rate that is free of tensor condition number without the necessity of re-conditioning. We also propose a novel approach, referred to as the sequential second-order moment method, to attain a warm initialization under a similar sample size requirement. As a byproduct, our result even significantly refines the prior investigation of RGrad algorithm for matrix completion. Numerical experiments confirm our theoretical discovery and showcase the computational speedup gained by the TT-format decomposition.
机器翻译,仅供参考