统计学学术速递[10.19]

2021-10-21 16:17:31 浏览数 (1)

Update!H5支持摘要折叠,体验更佳!点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!

stat统计学,共计73篇

【1】 Minimum ell_{1}-norm interpolators: Precise asymptotics and multiple descent链接:https://arxiv.org/abs/2110.09502

作者:Yue Li,Yuting Wei 摘要:一种不断发展的机器学习方法观察到的经验证据表明,插值估计器——实现零训练误差的估计器——未必有害。本文试图从理论上理解一类重要的插值器:最小$ellu{1}$-范数插值器,其动机是观察到一些学习算法在过参数化区域倾向于低$ellu{1}-范数解。具体地,我们考虑高斯设计下的噪声稀疏回归模型,着重于线性稀疏性和高维渐近性(使得特征数量和稀疏水平比例与样本大小成比例)。我们观察到一种奇怪的多血统现象,并为其提供了严格的理论依据;也就是说,随着模型容量的增加,最小$ellu_1$-范数插值器的泛化风险会经历多个(可能超过两个)下降和上升阶段。这种现象源于最小$ellu 1$-范数插值器的特殊结构,以及过参数化比率和稀疏性之间的微妙相互作用,从而揭示了几何上与最小$ellu 2$-范数插值器的根本区别。我们的发现建立在对风险行为的精确描述之上,风险行为由两个非线性方程组和两个未知数组成。 摘要:An evolving line of machine learning works observe empirical evidence that suggests interpolating estimators -- the ones that achieve zero training error -- may not necessarily be harmful. This paper pursues theoretical understanding for an important type of interpolators: the minimum $ell_{1}$-norm interpolator, which is motivated by the observation that several learning algorithms favor low $ell_1$-norm solutions in the over-parameterized regime. Concretely, we consider the noisy sparse regression model under Gaussian design, focusing on linear sparsity and high-dimensional asymptotics (so that both the number of features and the sparsity level scale proportionally with the sample size). We observe, and provide rigorous theoretical justification for, a curious multi-descent phenomenon; that is, the generalization risk of the minimum $ell_1$-norm interpolator undergoes multiple (and possibly more than two) phases of descent and ascent as one increases the model capacity. This phenomenon stems from the special structure of the minimum $ell_1$-norm interpolator as well as the delicate interplay between the over-parameterized ratio and the sparsity, thus unveiling a fundamental distinction in geometry from the minimum $ell_2$-norm interpolator. Our finding is built upon an exact characterization of the risk behavior, which is governed by a system of two non-linear equations with two unknowns.

【2】 Gradient boosting with extreme-value theory for wildfire prediction 标题:用极值理论进行野火预测的梯度助推法 链接:https://arxiv.org/abs/2110.09497

作者:Jonathan Koh 机构: these fires can lead tosubstantial economic losses; global insured claims due to wildfire events haveInstitute of Mathematics, EPFLInstitute of Mathematical Statistics and Actuarial Science, University of BernE-mail 摘要:本文详细介绍了2021年极值分析数据挑战赛$textit{kohrereation}$团队的方法,该团队处理美国周边地区野火数量和规模的预测。我们的方法在机器学习环境中使用了极值理论的思想,并在理论上证明了梯度提升的损失函数。我们设计了一个空间交叉验证方案,并表明在我们的设置中,它比原始交叉验证提供了更好的测试集性能代理。这些预测以具有不同损失函数的boosting方法为基准,并在得分标准方面具有竞争力,最终在竞争排名中排名第二。 摘要:This paper details the approach of the team $textit{Kohrrelation}$ in the 2021 Extreme Value Analysis data challenge, dealing with the prediction of wildfire counts and sizes over the contiguous US. Our approach uses ideas from extreme-value theory in a machine learning context with theoretically justified loss functions for gradient boosting. We devise a spatial cross-validation scheme and show that in our setting it provides a better proxy for test set performance than naive cross-validation. The predictions are benchmarked against boosting approaches with different loss functions, and perform competitively in terms of the score criterion, finally placing second in the competition ranking.

【3】 Frequentist-Bayes Hybrid Covariance Estimationfor Unfolding Problems 标题:展开问题的频率-贝叶斯混合协方差估计 链接:https://arxiv.org/abs/2110.09382

作者:Pim Jordi Verschuuren 摘要:本文提出了一种利用伪实验估计未展开分布协方差的频率贝叶斯混合方法。通过无偏Rao-Cramer界(RCB)和频率伪实验,将该方法与其他协方差估计方法进行了比较。我们证明了当引入正则化时,无偏RCB方法与其他两种方法不同。对于不同数量的正则化,新的混合方法与频率伪实验方法非常吻合。然而,混合方法具有不需要明确的似然定义的额外优势,并且可以与任何使用响应矩阵建模检测器响应的展开算法结合使用。 摘要:In this paper we present a frequentist-Bayesian hybrid method for estimating covariances of unfolded distributions using pseudo-experiments. The method is compared with other covariance estimation methods using the unbiased Rao-Cramer bound (RCB) and frequentist pseudo-experiments. We show that the unbiased RCB method diverges from the other two methods when regularization is introduced. The new hybrid method agrees well with the frequentist pseudo-experiment method for various amounts of regularization. However, the hybrid method has the added advantage of not requiring a clear likelihood definition and can be used in combination with any unfolding algorithm that uses a response matrix to model the detector response.

【4】 Efficient Exploration in Binary and Preferential Bayesian Optimization 标题:二元优先贝叶斯优化的高效探索 链接:https://arxiv.org/abs/2110.09361

作者:Tristan Fauvel,Matthew Chalk 机构:Sorbonne Universit´e, INSERM, CNRS, Institut de la Vision, F-, Paris, France 摘要:贝叶斯优化(BO)是一种优化昂贵的黑箱函数的有效方法,它寻求在利用(选择可能达到最大值的参数)和探索(选择目标函数不确定的参数)之间进行权衡。在许多实际情况下,不可能直接测量目标函数,只能使用二元测量,如成功/失败或成对比较。为了在这种环境下进行有效的探索,我们证明了BO算法区分不同类型的不确定性是很重要的:关于未知目标函数的认知不确定性,以及来自噪声观测且无法减少的任意不确定性。事实上,只有前者对有效勘探很重要。基于此,我们提出了几种新的捕获函数,它们在二进制和优先BO中的性能优于最先进的启发式算法,同时计算速度快,易于实现。然后,我们将这些获取规则推广到批量学习,即同时执行多个查询。 摘要:Bayesian optimization (BO) is an effective approach to optimize expensive black-box functions, that seeks to trade-off between exploitation (selecting parameters where the maximum is likely) and exploration (selecting parameters where we are uncertain about the objective function). In many real-world situations, direct measurements of the objective function are not possible, and only binary measurements such as success/failure or pairwise comparisons are available. To perform efficient exploration in this setting, we show that it is important for BO algorithms to distinguish between different types of uncertainty: epistemic uncertainty, about the unknown objective function, and aleatoric uncertainty, which comes from noisy observations and cannot be reduced. In effect, only the former is important for efficient exploration. Based on this, we propose several new acquisition functions that outperform state-of-the-art heuristics in binary and preferential BO, while being fast to compute and easy to implement. We then generalize these acquisition rules to batch learning, where multiple queries are performed simultaneously.

【5】 Prediction of liquid fuel properties using machine learning models with Gaussian processes and probabilistic conditional generative learning 标题:基于高斯过程和概率条件产生式学习的机器学习模型预测液体燃料性质 链接:https://arxiv.org/abs/2110.09360

作者:Rodolfo S. M. Freitas,Ágatha P. F. Lima,Cheng Chen,Fernando A. Rochinha,Daniel Mira,Xi Jiang 机构:Dept. Mechanical Engineering, COPPE Federal University of Rio de Janeiro, RJ ,-, Rio de Janeiro, Brazil, School of Engineering and Materials Science, Queen Mary University of London, Mile End Road, London E,NS, UK, Barcelona Supercomputing Center, Barcelona, Spain, ∗ 备注:22 pages, 13 figures 摘要:在广泛的压力和温度条件下准确测定复杂混合物的燃料特性对于使用替代燃料至关重要。目前的工作旨在构建廉价的计算机器学习(ML)模型,作为预测替代燃料物理性质的闭合方程。这些模型可以使用数据融合保真度方法中MD模拟和/或实验测量的数据库进行训练。这里采用高斯过程(GP)和概率生成模型。GP是一种流行的非参数贝叶斯方法,用于建立代理模型,主要是因为它具有处理假设和认知不确定性的能力。生成性模型已经显示了以同样目的使用深层神经网络的能力。在这项工作中,最大似然分析侧重于燃料密度这一特定特性,但它也可以扩展到其他物理化学特性。本研究探讨了ML模型处理多保真度数据的通用性。结果表明,ML模型能够准确预测各种压力和温度条件下的燃料特性。 摘要:Accurate determination of fuel properties of complex mixtures over a wide range of pressure and temperature conditions is essential to utilizing alternative fuels. The present work aims to construct cheap-to-compute machine learning (ML) models to act as closure equations for predicting the physical properties of alternative fuels. Those models can be trained using the database from MD simulations and/or experimental measurements in a data-fusion-fidelity approach. Here, Gaussian Process (GP) and probabilistic generative models are adopted. GP is a popular non-parametric Bayesian approach to build surrogate models mainly due to its capacity to handle the aleatory and epistemic uncertainties. Generative models have shown the ability of deep neural networks employed with the same intent. In this work, ML analysis is focused on a particular property, the fuel density, but it can also be extended to other physicochemical properties. This study explores the versatility of the ML models to handle multi-fidelity data. The results show that ML models can predict accurately the fuel properties of a wide range of pressure and temperature conditions.

【6】 Regression with Missing Data, a Comparison Study of TechniquesBased on Random Forests 标题:基于随机林的缺失数据回归技术比较研究 链接:https://arxiv.org/abs/2110.09333

作者:Irving Gómez-Méndez,Emilien Joly 机构:Centro de Investigaci´on en Matem´aticas, AC (CIMAT) 摘要:在本文中,我们介绍了一种新的随机森林算法处理样本中缺失值的实际好处。这项工作的目的是比较不同的解决方案来处理随机森林的缺失值,并描述我们的新算法的性能以及算法的复杂性。考虑并模拟了各种缺失值机制(如MCAR、MAR、MNAR)。我们研究了我们算法的二次误差和偏差,并将其与文献中最流行的缺失值随机森林算法进行了比较。特别是,为了回归和预测的目的,我们比较了这些技术。这项工作遵循了Gomez Mendez和Joly(2020)关于新算法一致性的第一篇论文。 摘要:In this paper we present the practical benefits of a new random forest algorithm to deal withmissing values in the sample. The purpose of this work is to compare the different solutionsto deal with missing values with random forests and describe our new algorithm performanceas well as its algorithmic complexity. A variety of missing value mechanisms (such as MCAR,MAR, MNAR) are considered and simulated. We study the quadratic errors and the bias ofour algorithm and compare it to the most popular missing values random forests algorithms inthe literature. In particular, we compare those techniques for both a regression and predictionpurpose. This work follows a first paper Gomez-Mendez and Joly (2020) on the consistency ofthis new algorithm.

【7】 Double Robust Mass-Imputation with Matching Estimators 标题:具有匹配估计器的双鲁棒质量推算 链接:https://arxiv.org/abs/2110.09275

作者:Ali Furkan Kalay 摘要:本文提出了一种利用双评分匹配(DSM)进行大规模插补的方法,并给出了一个应用程序,对一个不概率样本进行推断。DSM是一种$k$-最近邻算法,它使用两个平衡分数而不是协变量来减少距离度量的维数,从而实现更快的收敛速度。如果正确指定了两个平衡分数模型中的一个,则DSM质量插补和总体推断是一致的。仿真结果表明,当数据生成过程存在非线性混杂因素时,DSM的性能优于最近发展的双稳健估计。DGP的非线性是一个主要问题,因为它无法进行测试,并导致违反实现一致性所需的假设。即使DSM的一致性依赖于两个建模假设,它也可以防止偏差在这种情况下膨胀,因为DSM是一个半参数估计量。置信区间是使用野生自举方法构建的。只要DSM是一致的,所提出的自举方法就会生成有效的置信区间。 摘要:This paper proposes using a method named Double Score Matching (DSM) to do mass-imputation and presents an application to make inferences with a nonprobability sample. DSM is a $k$-Nearest Neighbors algorithm that uses two balance scores instead of covariates to reduce the dimension of the distance metric and thus to achieve a faster convergence rate. DSM mass-imputation and population inference are consistent if one of two balance score models is correctly specified. Simulation results show that the DSM performs better than recently developed double robust estimators when the data generating process has nonlinear confounders. The nonlinearity of the DGP is a major concern because it cannot be tested, and it leads to a violation of the assumptions required to achieve consistency. Even if the consistency of the DSM relies on the two modeling assumptions, it prevents bias from inflating under such cases because DSM is a semiparametric estimator. The confidence intervals are constructed using a wild bootstrapping approach. The proposed bootstrapping method generates valid confidence intervals as long as DSM is consistent.

【8】 RKHS-SHAP: Shapley Values for Kernel Methods 标题:RKHS-Shap:核方法的Shapley值 链接:https://arxiv.org/abs/2110.09167

作者:Siu Lun Chau,Javier Gonzalez,Dino Sejdinovic 机构:Department of Statistics, University of Oxford, United Kingdom, OX,LB, Microsoft Research Cambridge, United Kingdom, CB,FB 备注:11 pages, 4 figures 摘要:核方法的特征属性通常是启发式的,而不是针对每个预测进行个性化。为了解决这个问题,我们转向Shapley值的概念,这是一个联盟博弈理论框架,以前曾应用于不同的机器学习模型解释任务,如线性模型、树集合和深度网络。通过从函数的角度分析Shapley值,我们提出了textsc{RKHS-SHAP},这是一种用于内核机器的属性方法,它可以使用分布的内核平均嵌入有效地计算emph{interventive}和emph{Observational Shapley值}。我们从理论上证明了我们的方法对于局部扰动是鲁棒的——这是一个关键但常常被忽视的解释性要求。此外,我们提出了适用于一般经验风险最小化框架的emph{Shapley regulaser},允许在控制特定特征对模型贡献水平的同时进行学习。我们证明了Shapley正则化器能够实现对给定特征的协变量变化鲁棒的学习和控制敏感特征的Shapley值的公平学习。 摘要:Feature attribution for kernel methods is often heuristic and not individualised for each prediction. To address this, we turn to the concept of Shapley values, a coalition game theoretical framework that has previously been applied to different machine learning model interpretation tasks, such as linear models, tree ensembles and deep networks. By analysing Shapley values from a functional perspective, we propose textsc{RKHS-SHAP}, an attribution method for kernel machines that can efficiently compute both emph{Interventional} and emph{Observational Shapley values} using kernel mean embeddings of distributions. We show theoretically that our method is robust with respect to local perturbations - a key yet often overlooked desideratum for interpretability. Further, we propose emph{Shapley regulariser}, applicable to a general empirical risk minimisation framework, allowing learning while controlling the level of specific feature's contributions to the model. We demonstrate that the Shapley regulariser enables learning which is robust to covariate shift of a given feature and fair learning which controls the Shapley values of sensitive features.

【9】 Variance Reduction in Stochastic Reaction Networks using Control Variates 标题:利用控制变量减少随机反应网络的方差 链接:https://arxiv.org/abs/2110.09143

作者:Michael Backenköhler,Luca Bortolussi,Verena Wolf 机构:Saarland University, Germany, Saarbr¨ucken Graduate School of Computer Science, University of Trieste, Italy 备注:arXiv admin note: substantial text overlap with arXiv:1905.00854 摘要:蒙特卡罗估计在随机反应网络中起着至关重要的作用。然而,减少相应估计量的统计不确定性需要对大量轨迹进行抽样。我们提出了基于过程统计矩的控制变量来减少估计量的方差。我们开发了一种算法,它可以选择无穷多个控制变量的有效子集。为此,该算法使用重采样和冗余感知贪婪选择。我们在几个案例研究中证明了我们方法的有效性。 摘要:Monte Carlo estimation in plays a crucial role in stochastic reaction networks. However, reducing the statistical uncertainty of the corresponding estimators requires sampling a large number of trajectories. We propose control variates based on the statistical moments of the process to reduce the estimators' variances. We develop an algorithm that selects an efficient subset of infinitely many control variates. To this end, the algorithm uses resampling and a redundancy-aware greedy selection. We demonstrate the efficiency of our approach in several case studies.

【10】 Optimal designs for experiments for scalar-on-function linear models 标题:标量函数线性模型的最优实验设计 链接:https://arxiv.org/abs/2110.09115

作者:Damianos Michaelides,Maria Adamou,David C. Woods,Antony M. Overstall 机构:Southampton, SO,BJ, UK., arXiv:,.,v, [stat.ME] , Oct 摘要:这项工作的目的是将通常的最佳实验设计范式扩展到一个或多个因素设置为函数的实验。对于这些新实验,设计包括每次实验的函数组合以及非函数变量的设置。在简要介绍了函数变量类之后,描述了基函数系统。将基函数展开应用于由函数和标量因子组成的函数线性模型,将问题简化为单个设计矩阵的优化问题。 摘要:The aim of this work is to extend the usual optimal experimental design paradigm to experiments where the settings of one or more factors are functions. For these new experiments, a design consists of combinations of functions for each run of the experiment along with settings for non-functional variables. After briefly introducing the class of functional variables, basis function systems are described. Basis function expansion is applied to a functional linear model consisting of both functional and scalar factors, reducing the problem to an optimisation problem of a single design matrix.

【11】 Kernel-based estimation for partially functional linear model: Minimax rates and randomized sketches 标题:部分函数线性模型的核估计:极小极大率和随机化草图 链接:https://arxiv.org/abs/2110.09042

作者:Shaogao Lv,Xin He,Junhui Wang 机构:College of Statistics and Mathmetics, Nanjing Audit University, Nanjing, China, School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China, School of Data Science, City University of Hong Kong, Kowloon Tong, Kowloon, Hong Kong 摘要:本文考虑部分函数线性模型(PFLM),其中所有预测特征由一个函数协变量和一个高维标量向量组成。在无限维再生核Hilbert空间上,提出的PFLM估计是一种最小二乘方法,具有函数范数和$ellu 1$-范数的两种混合正则化。本文的主要任务是在高维环境下建立PFLM的minimax估计率,并利用经验过程理论中分析核类的各种技术建立最优minimax估计率。此外,我们还提出了一种基于核矩阵随机草图的高效数值算法。为了支持我们的方法和优化策略,进行了一些数值实验。 摘要:This paper considers the partially functional linear model (PFLM) where all predictive features consist of a functional covariate and a high dimensional scalar vector. Over an infinite dimensional reproducing kernel Hilbert space, the proposed estimation for PFLM is a least square approach with two mixed regularizations of a function-norm and an $ell_1$-norm. Our main task in this paper is to establish the minimax rates for PFLM under high dimensional setting, and the optimal minimax rates of estimation is established by using various techniques in empirical process theory for analyzing kernel classes. In addition, we propose an efficient numerical algorithm based on randomized sketches of the kernel matrix. Several numerical experiments are implemented to support our method and optimization strategy.

【12】 A Bayesian approach to multi-task learning with network lasso 标题:基于网络套索的贝叶斯多任务学习方法 链接:https://arxiv.org/abs/2110.09040

作者:Kaito Shimamura,Shuichi Kawano 摘要:网络套索是一种通过正则化极大似然方法解决多任务学习问题的方法。网络套索的一个特点是为每个样本设置不同的模型。模型之间的关系用关系系数表示。网络套索中的一个关键问题是为这些关系系数提供适当的值。在本文中,我们提出了一种贝叶斯方法来解决多任务学习问题的网络套索。这种方法允许我们通过贝叶斯估计客观地确定相关系数。仿真研究和实际数据分析表明了该方法的有效性。 摘要:Network lasso is a method for solving a multi-task learning problem through the regularized maximum likelihood method. A characteristic of network lasso is setting a different model for each sample. The relationships among the models are represented by relational coefficients. A crucial issue in network lasso is to provide appropriate values for these relational coefficients. In this paper, we propose a Bayesian approach to solve multi-task learning problems by network lasso. This approach allows us to objectively determine the relational coefficients by Bayesian estimation. The effectiveness of the proposed method is shown in a simulation study and a real data analysis.

【13】 A Space-time Model for Inferring A Susceptibility Map for An Infectious Disease 标题:一种推断传染病易感图谱的时空模型 链接:https://arxiv.org/abs/2110.09013

作者:Xiaoxiao Li,Matthew Ferrari,Michael J. Tildesley,Murali Haran 机构:Department of Statistics, Pennsylvania State University, University Park, PA, USA, Department of Biology and the Center for Infectious Disease Dynamics, Penn State University 摘要:基于土耳其口蹄疫(FMD)暴发数据,我们开发了一个基于暴发时空记录的疾病风险估计模型。传染病在地理单位的传播既取决于相邻单位之间的传播,也取决于每个单位对暴发的内在易感性。空间相关易感性可能来自已知因素,如人口密度,或未知(或未测量)因素,如通勤流量、环境条件或健康差异。我们的框架考虑了时空传输和敏感性。我们将未知的空间相关磁化率建模为高斯过程。我们表明,可根据观察到的、地理位置的感染事件时间序列估计易感性表面,并使用基于投影的降维方法,从而提高计算效率。除了从土耳其口蹄疫数据中确定高风险区域外,我们还研究了我们的方法如何处理众所周知的英格兰威尔士麻疹爆发数据;我们后一项研究的结果是,估计的易感性表面与群体规模密切相关,与先前的分析一致。 摘要:Motivated by foot-and-mouth disease (FMD) outbreak data from Turkey, we develop a model to estimate disease risk based on a space-time record of outbreaks. The spread of infectious disease in geographical units depends on both transmission between neighbouring units and the intrinsic susceptibility of each unit to an outbreak. Spatially correlated susceptibility may arise from known factors, such as population density, or unknown (or unmeasured) factors such as commuter flows, environmental conditions, or health disparities. Our framework accounts for both space-time transmission and susceptibility. We model the unknown spatially correlated susceptibility as a Gaussian process. We show that the susceptibility surface can be estimated from observed, geo-located time series of infection events and use a projection-based dimension reduction approach which improves computational efficiency. In addition to identifying high risk regions from the Turkey FMD data, we also study how our approach works on the well known England-Wales measles outbreaks data; our latter study results in an estimated susceptibility surface that is strongly correlated with population size, consistent with prior analyses.

【14】 Valid and Exact Statistical Inference for Multi-dimensional Multiple Change-Points by Selective Inference 标题:基于选择性推理的多维多变点有效准确的统计推断 链接:https://arxiv.org/abs/2110.08989

作者:Ryota Sugiyama,Hiroki Toda,Vo Nguyen Le Duy,Yu Inatsu,Ichiro Takeuchi 机构:Nagoya Institute of Technology, RIKEN 摘要:本文研究多维序列中变化点的统计推断。在从多维序列进行CP检测时,通常不仅需要检测位置,还需要识别发生变化的组件子集。针对这类问题,已经提出了几种算法,但尚未建立有效的精确推理方法来评估检测位置和组件的统计可靠性。在本研究中,我们提出了一种方法,可以保证检测到的变化的位置和分量的统计可靠性。我们将该方法应用于基因组异常识别和人类行为分析问题,证明了该方法的有效性。 摘要:In this paper, we study statistical inference of change-points (CPs) in multi-dimensional sequence. In CP detection from a multi-dimensional sequence, it is often desirable not only to detect the location, but also to identify the subset of the components in which the change occurs. Several algorithms have been proposed for such problems, but no valid exact inference method has been established to evaluate the statistical reliability of the detected locations and components. In this study, we propose a method that can guarantee the statistical reliability of both the location and the components of the detected changes. We demonstrate the effectiveness of the proposed method by applying it to the problems of genomic abnormality identification and human behavior analysis.

【15】 Sample size calculations for n-of-1 trials 标题:n-of-1试验的样本量计算 链接:https://arxiv.org/abs/2110.08970

作者:Jiabei Yang,Jon A. Steingrimsson,Christopher H. Schmid 机构:Department of Biostatistics, School of Public Health, Brown University, Center for Evidence Synthesis in Health, School of Public Health, Brown University 摘要:N-of-1试验,即单参与者试验,其中多个治疗在研究期间顺序随机,可直接估计个体特定治疗效果。与随机对照试验相比,联合n-of-1试验为估计总体平均治疗效果提供了额外的信息,并提高了个体特异性治疗效果估计的精确度。在这篇文章中,我们提出了一个设计n-of-1试验的程序。我们正式定义了用于确定一系列n-of-1试验样本量的设计组件,提出了分析这些试验的模型,并使用它们推导了用于估计总体平均治疗效果和个体特定治疗效果估计的标准误差的样本量公式。我们建议首先找到可能的设计,以满足估计总体平均治疗效果的功率要求,然后,如果有兴趣,最终确定设计,以满足个体特定治疗效果估计的标准误差要求。本文通过一个闪亮的应用程序实现并说明了该过程。 摘要:N-of-1 trials, single participant trials in which multiple treatments are sequentially randomized over the study period, can give direct estimates of individual-specific treatment effects. Combining n-of-1 trials gives extra information for estimating the population average treatment effect compared with randomized controlled trials and increases precision for individual-specific treatment effect estimates. In this paper, we present a procedure for designing n-of-1 trials. We formally define the design components for determining the sample size of a series of n-of-1 trials, present models for analyzing these trials and use them to derive the sample size formula for estimating the population average treatment effect and the standard error of the individual-specific treatment effect estimates. We recommend first finding the possible designs that will satisfy the power requirement for estimating the population average treatment effect and then, if of interest, finalizing the design to also satisfy the standard error requirements for the individual-specific treatment effect estimates. The procedure is implemented and illustrated in the paper and through a Shiny app.

【16】 On completing a measurement model by symmetry 标题:关于用对称性完成测量模型的问题 链接:https://arxiv.org/abs/2110.08969

作者:Richard E. Danielson 机构:Bedford Institute of Oceanography, Fisheries and Oceans Canada 备注:4 pages 摘要:对对称性的呼吁是建立一个规范的线性回归模型的特定代表性和测量的特定非线性(通常称为模型误差)的既定概念。加法分量是从平凡完备模型M=M中推导出来的。因子分析和方程误差在变量误差框架中激发了相应的表征和非线性概念,并对术语进行了新的解释。有人建议,对相关性的现代解释包括线性关联和非线性关联。 摘要:An appeal for symmetry is made to build established notions of specific representation and specific nonlinearity of measurement (often called model error) into a canonical linear regression model. Additive components are derived from the trivially complete model M = m. Factor analysis and equation error motivate corresponding notions of representation and nonlinearity in an errors-in-variables framework, with a novel interpretation of terms. It is suggested that a modern interpretation of correlation involves both linear and nonlinear association.

【17】 Assessing Ecosystem State Space Models: Identifiability and Estimation 标题:评估生态系统状态空间模型:可辨识性和估计性 链接:https://arxiv.org/abs/2110.08967

作者:John W. Smith,Leah R. Johnson,Robert Q. Thomas 机构: 2 1Department of Statistics, Virginia Tech 2Department of Forest Resources and Environmental Conservation 摘要:贝叶斯方法越来越多地被用于环境预测和预测中的机械过程模型的参数化。特别是,描述生态系统动态的模型,在每一个时间步都具有线性和自回归的多个状态,可以被视为统计状态空间模型。在本文中,我们研究了生态系统模型的这一子集,当过程和观测误差为正态分布时,给出了潜在状态和过程精度参数的闭合形式吉布斯抽样更新。我们使用来自示例模型(DALECev)的模拟数据来评估参数估计和可识别性在观测缺口场景下的性能。我们表明,随着观测状态数据之间的时间间隔增加,过程精度估计变得不可靠。为了提高估计,特别是精度,我们引入了一种调整潜态时间步长的方法,以利用更高频率的驱动器信息。此外,我们还证明了数据克隆是评估这类模型参数可辨识性的合适方法。总的来说,我们的研究有助于将状态空间模型应用于生态预测应用,其中1)数据不适用于所有状态,生态系统模型在运行时间段的传输,2)需要进行过程不确定性估计。 摘要:Bayesian methods are increasingly being applied to parameterize mechanistic process models used in environmental prediction and forecasting. In particular, models describing ecosystem dynamics with multiple states that are linear and autoregressive at each step in time can be treated as statistical state space models. In this paper we examine this subset of ecosystem models, giving closed form Gibbs sampling updates for latent states and process precision parameters when process and observation errors are normally distributed. We use simulated data from an example model (DALECev) to assess the performance of parameter estimation and identifiability under scenarios of gaps in observations. We show that process precision estimates become unreliable as temporal gaps between observed state data increase. To improve estimates, particularly precisions, we introduce a method of tuning the timestep of the latent states to leverage higher-frequency driver information. Further, we show that data cloning is a suitable method for assessing parameter identifiability in this class of models. Overall, our study helps inform the application of state space models to ecological forecasting applications where 1) data are not available for all states and transfers at the operational timestep for the ecosystem model and 2) process uncertainty estimation is desired.

【18】 Rejoinder: Learning Optimal Distributionally Robust Individualized Treatment Rules 标题:反驳:学习最优分布稳健个性化治疗规则 链接:https://arxiv.org/abs/2110.08936

作者:Weibin Mo,Zhengling Qi,Yufeng Liu 备注:None 摘要:我们感谢编辑们为本次讨论提供的机会,感谢讨论者们的深刻评论和深思熟虑的贡献。我们还要祝贺Kallus(2020)在通过重新确定目标提高政策学习效率方面所做的鼓舞人心的工作。受Dukes和Vansteelandt(2020)中讨论的启发,我们在第1节中首先指出了我们的工作与Kallus(2020)之间有趣的联系和区别。特别是,这两篇论文中考虑的假设和变化来源导致了不同范围和重点的不同研究问题。在第2节中,在Li等人(2020)的讨论之后;梁和赵(2020),我们还考虑了有效的政策评估问题时,我们有一些数据从测试分布在训练阶段可用。我们表明,在假设训练和测试的样本量以相同的顺序增长的情况下,有效的价值函数估计可以提供有竞争力的性能。我们进一步展示了这些估计与现有文献的一些联系。然而,当可用于训练的测试样本量增长较慢时,有效的值函数估计可能不再有效。相比之下,DRITR的测试样本量要求不如使用组合数据进行有效的政策评估。最后,我们在第3节中强调了DRITR的普遍适用性和有用性。 摘要:We thank the opportunity offered by editors for this discussion and the discussants for their insightful comments and thoughtful contributions. We also want to congratulate Kallus (2020) for his inspiring work in improving the efficiency of policy learning by retargeting. Motivated from the discussion in Dukes and Vansteelandt (2020), we first point out interesting connections and distinctions between our work and Kallus (2020) in Section 1. In particular, the assumptions and sources of variation for consideration in these two papers lead to different research problems with different scopes and focuses. In Section 2, following the discussions in Li et al. (2020); Liang and Zhao (2020), we also consider the efficient policy evaluation problem when we have some data from the testing distribution available at the training stage. We show that under the assumption that the sample sizes from training and testing are growing in the same order, efficient value function estimates can deliver competitive performance. We further show some connections of these estimates with existing literature. However, when the growth of testing sample size available for training is in a slower order, efficient value function estimates may not perform well anymore. In contrast, the requirement of the testing sample size for DRITR is not as strong as that of efficient policy evaluation using the combined data. Finally, we highlight the general applicability and usefulness of DRITR in Section 3.

【19】 Exploitation of error correlation in a large analysis validation: GlobCurrent case study 标题:大型分析验证中误差相关性的开发:GlobCurrent案例研究 链接:https://arxiv.org/abs/2110.08905

作者:Richard E. Danielson,Johnny A. Johannessen,Graham D. Quartly,Marie-Hélène Rio,Bertrand Chapron,Fabrice Collard,Craig Donlon 机构:Nansen Environmental and Remote Sensing Center, Bergen, Norway, Plymouth Marine Laboratory, Plymouth, United Kingdom, Collecte Localisation Satellites, Ramonville Saint-Agne, France, Ifremer, Plouzan´e, France, OceanDataLab, Locmaria-Plouzan´e, France 备注:None 摘要:通过现场观测(漂流者)和大网格分析(全球海流)对海流信号和噪声的方差进行评估,作为1993-2015年一年中某一天的函数,并在广泛的海流速度范围内进行。无论搭配如何划分,很难断言任何天气评估都可以基于独立观测。取而代之的是,提出了一种通过调节误差相关性而不同于普通线性回归的测量模型。通过应用Fuller(1987)的方程和测量误差概念,分别将误差划分为共享(相关)和非共享(不相关)分量,探索独立性的解释。新模型中产生的方差分割有利于噪声。海流共享(方程)误差与非共享(测量)误差具有可比性,而对于全球海流和漂流者,后者分别与普通线性回归和反向线性回归具有可比性。虽然信号方差看起来很小,但它作为两个变量之间一致性的度量的实用性被强调。对密集网格采样的稀疏配置允许考虑测量模型的一阶自回归形式,包括分析原位误差互相关和分析时间误差自相关的参数化。前者(互相关)是一个方程误差项,可容纳全局电流和漂移器共享的误差。后者(自相关)有助于识别和检索所有模型参数。使用规定的全球电流和漂移之间的校准(通过方差匹配)寻求解决方案。由于全局电流和漂移电流的真实电流方差很小,信噪比最多接近于零。这对于中等流速和子午流分量尤为明显。 摘要:An assessment of variance in ocean current signal and noise shared by in situ observations (drifters) and a large gridded analysis (GlobCurrent) is sought as a function of day of the year for 1993-2015 and across a broad spectrum of current speed. Regardless of the division of collocations, it is difficult to claim that any synoptic assessment can be based on independent observations. Instead, a measurement model that departs from ordinary linear regression by accommodating error correlation is proposed. The interpretation of independence is explored by applying Fuller's (1987) concept of equation and measurement error to a division of error into shared (correlated) and unshared (uncorrelated) components, respectively. The resulting division of variance in the new model favours noise. Ocean current shared (equation) error is of comparable magnitude to unshared (measurement) error and the latter is, for GlobCurrent and drifters respectively, comparable to ordinary and reverse linear regression. Although signal variance appears to be small, its utility as a measure of agreement between two variates is highlighted. Sparse collocations that sample a dense grid permit a first order autoregressive form of measurement model to be considered, including parameterizations of analysis-in situ error cross-correlation and analysis temporal error autocorrelation. The former (cross-correlation) is an equation error term that accommodates error shared by both GlobCurrent and drifters. The latter (autocorrelation) facilitates an identification and retrieval of all model parameters. Solutions are sought using a prescribed calibration between GlobCurrent and drifters (by variance matching). Because the true current variance of GlobCurrent and drifters is small, signal to noise ratio is near zero at best. This is particularly evident for moderate current speed and meridional current component.

【20】 Persuasion by Dimension Reduction 标题:降维的说服力 链接:https://arxiv.org/abs/2110.08884

作者:Semyon Malamud,Andreas Schrimpf 备注:arXiv admin note: text overlap with arXiv:2102.10909 摘要:观察多维数据(状态向量)的代理(发送者)应该如何说服另一个代理采取所需的行动?我们证明,发送方通过将状态向量投影到低维对象(我们称之为“最优信息流形”)上执行(非线性)降维始终是最优的。我们描述了该流形的几何特性,并将其与发送方的偏好相联系。最优策略将信息分为“好”和“坏”两部分。当发送者的边际效用是线性的时,揭示好信息的全部数量总是最优的。相反,在凹边际效用下,最优信息设计隐藏了好信息的极端实现,只揭示了它的方向(符号)。我们通过显式地解决几个多维贝叶斯说服问题来说明这些效果。 摘要:How should an agent (the sender) observing multi-dimensional data (the state vector) persuade another agent to take the desired action? We show that it is always optimal for the sender to perform a (non-linear) dimension reduction by projecting the state vector onto a lower-dimensional object that we call the "optimal information manifold." We characterize geometric properties of this manifold and link them to the sender's preferences. Optimal policy splits information into "good" and "bad" components. When the sender's marginal utility is linear, revealing the full magnitude of good information is always optimal. In contrast, with concave marginal utility, optimal information design conceals the extreme realizations of good information and only reveals its direction (sign). We illustrate these effects by explicitly solving several multi-dimensional Bayesian persuasion problems.

【21】 Building Degradation Index with Variable Selection for Multivariate Sensory Data 标题:多变量感官数据的变量选择构建退化指数 链接:https://arxiv.org/abs/2110.08882

作者:Yueyao Wang,I-Chen Lee,Yili Hong,Xinwei Deng 机构:Department of Statistics, Virginia Tech, Blacksburg, VA , Department of Statistics, National Cheng Kung University, Tainan, Taiwan 备注:28 pages 摘要:退化数据的建模和分析一直是可靠性和系统健康管理领域的一个活跃研究领域。随着传感器技术的进步,通常会收集潜在降解过程的多元感官数据。然而,大多数现有的退化建模研究需要提供一个单变量退化指数。因此,构建多元感官数据的退化指数是退化建模的基本步骤。在本文中,我们提出了一种新的多元感官数据退化指数的建立方法。基于变量选择的加性非线性模型,该方法可以自动选择退化指数中信息量最大的传感器信号。提出了一种基于自适应群惩罚的惩罚似然法进行参数估计。通过对NASA喷气发动机传感器数据的仿真研究和分析,我们证明了该方法优于现有方法。 摘要:The modeling and analysis of degradation data have been an active research area in reliability and system health management. As the senor technology advances, multivariate sensory data are commonly collected for the underlying degradation process. However, most existing research on degradation modeling requires a univariate degradation index to be provided. Thus, constructing a degradation index for multivariate sensory data is a fundamental step in degradation modeling. In this paper, we propose a novel degradation index building method for multivariate sensory data. Based on an additive nonlinear model with variable selection, the proposed method can automatically select the most informative sensor signals to be used in the degradation index. The penalized likelihood method with adaptive group penalty is developed for parameter estimation. We demonstrate that the proposed method outperforms existing methods via both simulation studies and analyses of the NASA jet engine sensor data.

【22】 A Bayesian Selection Model for Correcting Outcome Reporting Bias With Application to a Meta-analysis on Heart Failure Interventions 标题:校正结果报告偏差的贝叶斯选择模型及其在心力衰竭干预措施荟萃分析中的应用 链接:https://arxiv.org/abs/2110.08849

作者:Ray Bai,Xiaokang Liu,Lifeng Lin,Yulun Liu,Stephen E. Kimmel,Haitao Chu,Yong Chen 备注:26 pages, 5 tables, 8 figures 摘要:多元荟萃分析(MMA)是联合评估多种结果治疗效果的有力工具。然而,MMA结果的有效性可能会因结果报告偏差(ORB)或研究选择性报告结果的趋势而受损。直到最近,ORB还未被研究。由于ORB可能导致有偏见的结论,因此在存在ORB的情况下,纠正效应大小的估计并量化其不确定性至关重要。基于这个目标,我们开发了一个贝叶斯选择模型来调整MMA中的ORB。我们进一步提出了一种量化ORB对MMA结果影响的方法。我们通过对来自Cochrane系统评价数据库的748项双变量荟萃分析的荟萃评价来评估我们的方法。我们的模型受心力衰竭患者再入院和生活质量干预的荟萃分析的启发,并应用于该分析。在我们的分析中,调整ORB后,干预组再入院的相对风险(RR)从显著降低(RR:0.931,95%可信区间[CI]:0.862-0.993)变为统计上无显著性影响(RR:0.955,95%可信区间:0.876-1.051)。这项研究表明,在荟萃分析中,不考虑ORB可能导致不同的结论。 摘要:Multivariate meta-analysis (MMA) is a powerful tool for jointly estimating multiple outcomes' treatment effects. However, the validity of results from MMA is potentially compromised by outcome reporting bias (ORB), or the tendency for studies to selectively report outcomes. Until recently, ORB has been understudied. Since ORB can lead to biased conclusions, it is crucial to correct the estimates of effect sizes and quantify their uncertainty in the presence of ORB. With this goal, we develop a Bayesian selection model to adjust for ORB in MMA. We further propose a measure for quantifying the impact of ORB on the results from MMA. We evaluate our approaches through a meta-evaluation of 748 bivariate meta-analyses from the Cochrane Database of Systematic Reviews. Our model is motivated by and applied to a meta-analysis of interventions on hospital readmission and quality of life for heart failure patients. In our analysis, the relative risk (RR) of hospital readmission for the intervention group changes from a significant decrease (RR: 0.931, 95% confidence interval [CI]: 0.862-0.993) to a statistically nonsignificant effect (RR: 0.955, 95% CI: 0.876-1.051) after adjusting for ORB. This study demonstrates that failing to account for ORB can lead to different conclusions in a meta-analysis.

【23】 On minimax estimation problem for stationary stochastic sequences from observations in special sets of points 标题:关于特殊点集上观测值的平稳随机序列的极小极大估计问题 链接:https://arxiv.org/abs/2110.08766

作者:Oleksandr Masyutka,Mikhail Moklyachuk 机构:Department of Mathematics and Theoretical Radiophysics, Taras Shevchenko National University of Kyiv, com†Department of Probability Theory 备注:arXiv admin note: text overlap with arXiv:1804.08408 摘要:研究了依赖于随机平稳序列的未知值的线性泛函的均方最优估计问题。在谱确定的条件下,在已知序列谱密度的情况下,导出了泛函最优线性估计的均方误差和谱特性的计算公式。在序列的谱密度未知的情况下,当给出了一些容许谱密度集时,应用了minimax(robust)估计方法。对于某些特殊的容许密度集,导出了确定最不利谱密度和极小极大谱特征的公式。 摘要:The problem of the mean-square optimal estimation of the linear functionals which depend on the unknown values of a stochastic stationary sequence from observations of the sequence in special sets of points is considered. Formulas for calculating the mean-square error and the spectral characteristic of the optimal linear estimate of the functionals are derived under the condition of spectral certainty, where the spectral density of the sequence is exactly known. The minimax (robust) method of estimation is applied in the case where the spectral density of the sequence is not known exactly while some sets of admissible spectral densities are given. Formulas that determine the least favourable spectral densities and the minimax spectral characteristics are derived for some special sets of admissible densities.

【24】 JEL ratio test for independence of time to failure and cause of failure in competing risks 标题:竞争风险中失效时间和失效原因独立性的凝胶率检验 链接:https://arxiv.org/abs/2110.08747

作者:Sreelakshmy N.,Sreedevi E. P 机构:Prajyothi Nikethan College, Pudukkad, India., SNGS College, Pattambi, India. 摘要:在本文中,我们提出了jackknife经验似然(JEL)比率检验,用于检验竞争风险数据中失败时间和失败原因的独立性。我们使用U-统计理论推导了JEL比率检验。检验统计量的渐近分布是一个自由度的卡方分布。进行蒙特卡罗模拟研究,以评估拟议测试的有限样本行为。建议的JEL试验的性能与Dewan等人(2004)给出的试验进行了比较。最后,我们用各种真实数据集说明了我们的测试过程。 摘要:In the present article, we propose jackknife empirical likelihood (JEL) ratio test for testing the independence of time to failure and cause of failure in competing risks data. We use U-statistic theory to derive the JEL ratio test. The asymptotic distribution of the test statistic is shown to be chi-square distribution with one degree of freedom. A Monte Carlo simulation study is carried out to assess the finite sample behaviour of the proposed test. The performance of proposed JEL test is compared with the test given in Dewan et al. (2004). Finally we illustrate our test procedure using various real data sets.

【25】 Noise-Augmented Privacy-Preserving Empirical Risk Minimization with Dual-purpose Regularizer and Privacy Budget Retrieval and Recycling 标题:基于双用途规则化和隐私预算检索回收的噪声增强隐私保护经验风险最小化 链接:https://arxiv.org/abs/2110.08676

作者:Yinan Li,Fang Liu 机构:Applied and Computational Mathematics and Statistics, University of Notre Dame, Notre Dame, Indiana, USA 摘要:我们提出了噪声增强隐私保护经验风险最小化(NAPP-ERM),解决了具有差异隐私保证的ERM问题。现有的隐私保护ERM方法可能会受到过度正则化的影响,使用l2项在目标正则化的基础上实现强凸性。NAPP-ERM改进了现有方法,通过适当设计的增强数据迭代实现目标正则化,并通过一个自适应加权的双重用途l2正则化器提供强凸性,从而缓解过度正则化。当目标正则化用于变量选择时,我们提出了一种新的正则化方法,该正则化方法同时实现了隐私性和稀疏性保证。最后,我们提出了一种在满足强凸性要求时检索隐私预算的策略,该策略可以返回给用户,从而以比原计划更低的隐私成本保证ERM的DP,或者回收到ERM优化过程中,以减少注入的DP噪声并提高DP-ERM的效用。从实现角度来看,NAPP-ERM可以通过优化给定噪声增强数据的非扰动目标函数来实现,因此可以利用现有工具进行非私有ERM优化。我们通过大量实验说明了NAPP-ERM过度规范化和私人预算检索对变量选择和预测的缓解效果。 摘要:We propose Noise-Augmented Privacy-Preserving Empirical Risk Minimization (NAPP-ERM) that solves ERM with differential privacy guarantees. Existing privacy-preserving ERM approaches may be subject to over-regularization with the employment of an l2 term to achieve strong convexity on top of the target regularization. NAPP-ERM improves over the current approaches and mitigates over-regularization by iteratively realizing target regularization through appropriately designed augmented data and delivering strong convexity via a single adaptively weighted dual-purpose l2 regularizer. When the target regularization is for variable selection, we propose a new regularizer that achieves both privacy and sparsity guarantees simultaneously. Finally, we propose a strategy to retrieve privacy budget when the strong convexity requirement is met, which can be returned to users such that the DP of ERM is guaranteed at a lower privacy cost than originally planned, or be recycled to the ERM optimization procedure to reduce the injected DP noise and improve the utility of DP-ERM. From an implementation perspective, NAPP-ERM can be achieved by optimizing a non-perturbed object function given noise-augmented data and can thus leverage existing tools for non-private ERM optimization. We illustrate through extensive experiments the mitigation effect of the over-regularization and private budget retrieval by NAPP-ERM on variable selection and prediction.

【26】 Quantile Regression by Dyadic CART 标题:基于二进购物车的分位数回归 链接:https://arxiv.org/abs/2110.08665

作者:Oscar Hernan Madrid Padilla,Sabyasachi Chatterjee 机构:Department of Statistics, University of California, Los Angeles, Department of Statistics, University of Illinois at Urbana-Champaign 摘要:在本文中,我们提出并研究了一个来自Donoho(1997)的二进分类和回归树(DCART)估计,用于一般维(固定设计)分位数回归。我们将这个建议的估计器称为QDCART估计器。与均值回归版本一样,我们证明a)计算QDCART估计器存在计算复杂度$O(Nlog N)$的快速动态规划算法,b)计算QDCART估计器存在oracle风险界(权衡平方误差和真实信号的复杂度参数)。这个oracle风险界允许我们证明QDCART估计器对于分段常数和有界变差函数类具有自适应速率最优估计保证。与DCART估计器的现有结果不同,DCART估计器要求误差分布的次高斯性,为了保证我们的估计保持不变,我们不需要对误差分布进行任何限制性的尾部衰减假设。例如,我们的结果即使在误差分布没有像柯西分布这样的一阶矩时仍然成立。除了二元CART方法之外,我们还考虑了其他变异方法,如在Caltjje和GoSWaMi(2019)中引入的最优回归树(ORT)估计器。特别是,我们还将ORT估计扩展到分位数设置,并确定它享有类似的保证。因此,本文扩展了这些基于全局最优回归树的方法的范围,以适用于重尾数据。然后,我们对模拟数据和真实数据进行了大量的数值实验,证明了所提方法的有效性。 摘要:In this paper we propose and study a version of the Dyadic Classification and Regression Trees (DCART) estimator from Donoho (1997) for (fixed design) quantile regression in general dimensions. We refer to this proposed estimator as the QDCART estimator. Just like the mean regression version, we show that a) a fast dynamic programming based algorithm with computational complexity $O(N log N)$ exists for computing the QDCART estimator and b) an oracle risk bound (trading off squared error and a complexity parameter of the true signal) holds for the QDCART estimator. This oracle risk bound then allows us to demonstrate that the QDCART estimator enjoys adaptively rate optimal estimation guarantees for piecewise constant and bounded variation function classes. In contrast to existing results for the DCART estimator which requires subgaussianity of the error distribution, for our estimation guarantees to hold we do not need any restrictive tail decay assumptions on the error distribution. For instance, our results hold even when the error distribution has no first moment such as the Cauchy distribution. Apart from the Dyadic CART method, we also consider other variant methods such as the Optimal Regression Tree (ORT) estimator introduced in Chatterjee and Goswami (2019). In particular, we also extend the ORT estimator to the quantile setting and establish that it enjoys analogous guarantees. Thus, this paper extends the scope of these globally optimal regression tree based methodologies to be applicable for heavy tailed data. We then perform extensive numerical experiments on both simulated and real data which illustrate the usefulness of the proposed methods.

【27】 Minding non-collapsibility of odds ratios when recalibrating risk prediction models 标题:重新校准风险预测模型时要注意赔率比的不可崩溃性 链接:https://arxiv.org/abs/2110.08648

作者:Mohsen Sadatsafavi,Hamid Tavakoli1,Abdollah Safari 机构:. Respiratory Evaluation Sciences Program, Collaboration for Outcomes Research and, Evaluation, The University of British Columbia, . Department of Statistics, Tehran University, Corresponding Author:, Associate Professor, 备注:10 Pages, 1 Figure, 1 Appendix 摘要:在临床预测建模中,模型更新是指在新环境中使用预测模型之前对其进行修改的实践。在对二元结果进行逻辑回归的情况下,最简单的更新方法之一是预测风险的固定优势比转换,以改进大规模的校准。先前的作者提出了基于原始人群和新人群患病率之间的差异或预测风险和观察风险的平均值之间的差异来计算该优势比的方程式。我们表明,这种方法不能考虑非折叠的比值比。因此,它低估了预测的风险,特别是当预测的风险更加分散时(即,对于具有良好辨别力的模型)。我们提出了一个从预测风险的均值和方差中恢复条件优势比的近似方程。简单的仿真和案例研究表明,这种方法可以减少这种欠校正。提供了用于实现的R代码。 摘要:In clinical prediction modeling, model updating refers to the practice of modifying a prediction model before it is used in a new setting. In the context of logistic regression for a binary outcome, one of the simplest updating methods is a fixed odds-ratio transformation of predicted risks to improve calibration-in-the-large. Previous authors have proposed equations for calculating this odds-ratio based on the discrepancy between the prevalence in the original and the new population, or between the average of predicted and observed risks. We show that this method fails to consider the non-collapsibility of odds-ratio. Consequently, it under-corrects predicted risks, especially when predicted risks are more dispersed (i.e., for models with good discrimination). We suggest an approximate equation for recovering the conditional odds-ratio from the mean and variance of predicted risks. Brief simulations and a case study show that this approach reduces such under-correction. R code for implementation is provided.

【28】 A Reduced-Bias Weighted least square estimation of the Extreme Value Index 标题:极值指数的一种减偏加权最小二乘估计 链接:https://arxiv.org/abs/2110.08570

作者:E. Ocran,R. Minkah,K. Doku-Amponsah 机构:University of Ghana 备注:24 pages 摘要:在本文中,我们提出了一个减少偏差估计的EVI的帕累托型尾巴(重尾)分布。这是使用加权最小二乘法得出的。结果表明,在数据的基本分布的二阶条件下,估计量是无偏的、一致的和渐近正态的。通过仿真研究了该估计器的有限样本性质。结果表明,它在偏差和均方误差方面与现有的极值指数估计具有竞争性。此外,它产生的$gamma>0$估计值对最高阶统计数的敏感性较低,因此可用于选择最佳尾部分数。利用土壤化学和保险的实际数据集进一步说明了所提出的估算方法。 摘要:In this paper, we propose a reduced-bias estimator of the EVI for Pareto-type tails (heavy-tailed) distributions. This is derived using the weighted least squares method. It is shown that the estimator is unbiased, consistent and asymptotically normal under the second-order conditions on the underlying distribution of the data. The finite sample properties of the proposed estimator are studied through a simulation study. The results show that it is competitive to the existing estimators of the extreme value index in terms of bias and Mean Square Error. In addition, it yields estimates of $gamma>0$ that are less sensitive to the number of top-order statistics, and hence, can be used for selecting an optimal tail fraction. The proposed estimator is further illustrated using practical datasets from pedochemical and insurance.

【29】 Spectral measures of empirical autocovariance matrices of high dimensional Gaussian stationary processes 标题:高维高斯平稳过程经验自协方差阵的谱测度 链接:https://arxiv.org/abs/2110.08523

作者:Arup Bose,Walid Hachem 摘要:考虑在一个给定的非零时间滞后的经验自协方差矩阵基于多元复高斯平稳时间序列的观测。这些自协方差矩阵的谱分析在某些统计问题中很有用,例如与白噪声测试相关的问题。我们研究了它们的谱测度在时间序列维数和观测窗长度均增长到无穷大且速率相同的渐近区域内的行为。根据大型随机非厄米矩阵谱分析的一般框架,首先得到了自协方差矩阵移位后的小奇异值的概率行为。然后用它来推断自方差矩阵的经验谱测度在任何滞后条件下的大样本行为。单位圆上的矩阵正交多项式在我们的研究中起着至关重要的作用。 摘要:Consider the empirical autocovariance matrix at a given non-zero time lag based on observations from a multivariate complex Gaussian stationary time series. The spectral analysis of these autocovariance matrices can be useful in certain statistical problems, such as those related to testing for white noise. We study the behavior of their spectral measures in the asymptotic regime where the time series dimension and the observation window length both grow to infinity, and at the same rate. Following a general framework in the field of the spectral analysis of large random non-Hermitian matrices, at first the probabilistic behavior of the small singular values of the shifted versions of the autocovariance matrix are obtained. This is then used to infer about the large sample behaviour of the empirical spectral measure of the autocovariance matrices at any lag. Matrix orthogonal polynomials on the unit circle play a crucial role in our study.

【30】 Mode and Ridge Estimation in Euclidean and Directional Product Spaces: A Mean Shift Approach 标题:欧氏空间和方向积空间中的模和岭估计:均值漂移方法 链接:https://arxiv.org/abs/2110.08505

作者:Yikun Zhang,Yen-Chi Chen 机构:Department of Statistics, University of Washington, Seattle, WA , USA 备注:51 pages, 10 figures 摘要:从数据集估计的局部模式集和脊线是数据生成分布的重要汇总特征。在这项工作中,我们考虑从两个或多个欧几里德/方向度量空间的乘积空间中的点云数据估计局部模式和脊。具体来说,我们将著名的(子空间约束的)均值漂移算法推广到乘积空间环境,并说明了这种推广中的一些缺陷。我们推导了该方法的算法收敛性,提供了实现的实用指南,并在模拟和真实数据集上证明了其有效性。 摘要:The set of local modes and the ridge lines estimated from a dataset are important summary characteristics of the data-generating distribution. In this work, we consider estimating the local modes and ridges from point cloud data in a product space with two or more Euclidean/directional metric spaces. Specifically, we generalize the well-known (subspace constrained) mean shift algorithm to the product space setting and illuminate some pitfalls in such generalization. We derive the algorithmic convergence of the proposed method, provide practical guidelines on the implementation, and demonstrate its effectiveness on both simulated and real datasets.

【31】 On Model Selection Consistency of Lasso for High-Dimensional Ising Models on Tree-like Graphs 标题:关于树形图上高维Ising模型的套索模型选择一致性 链接:https://arxiv.org/abs/2110.08500

作者:Xiangming Meng,Tomoyuki Obuchi,Yoshiyuki Kabashima 机构:Institute for Physics of Intelligence and Department of Physics, Graduate School, of Science, The University of Tokyo, Tokyo, Japan, Department of Systems Science, Graduate School of Informatics, Kyoto University, Kyoto, Japan 备注:30 pages, 4 figures 摘要:考虑基于邻域最小绝对收缩和选择算子(LASSO)的高维Ising模型选择问题。严格证明了在伊辛模型总体协方差矩阵上的一些弱相干条件下,对于顺磁相的任何树状图,样本大小为$n=Omega{(d^3log{p}}}$,其中$p$是变量数,$d$是最大节点度,可以实现一致的模型选择。当对样本协方差矩阵直接施加相同的条件时,表明减小的样本大小$n=Omega{(d^2log{p}}}$就足够了。所获得的与Lasso一致的模型选择的充分条件在样本复杂度的标度上与$ellu 1$正则化logistic回归相同。鉴于套索的流行性和有效性,我们的严格分析为其在伊辛模型选择中的实际应用提供了理论支持。 摘要:We consider the problem of high-dimensional Ising model selection using neighborhood-based least absolute shrinkage and selection operator (Lasso). It is rigorously proved that under some mild coherence conditions on the population covariance matrix of the Ising model, consistent model selection can be achieved with sample sizes $n=Omega{(d^3log{p})}$ for any tree-like graph in the paramagnetic phase, where $p$ is the number of variables and $d$ is the maximum node degree. When the same conditions are imposed directly on the sample covariance matrices, it is shown that a reduced sample size $n=Omega{(d^2log{p})}$ suffices. The obtained sufficient conditions for consistent model selection with Lasso are the same in the scaling of the sample complexity as that of $ell_1$-regularized logistic regression. Given the popularity and efficiency of Lasso, our rigorous analysis provides a theoretical backing for its practical use in Ising model selection.

【32】 Adversarial Attacks on Gaussian Process Bandits 标题:对高斯过程环的对抗性攻击 链接:https://arxiv.org/abs/2110.08449

作者:Eric Han,Jonathan Scarlett 机构:School of Computing, National University of Singapore, Department of Mathematics & Institute of Data Science, National University of Singapore 摘要:高斯过程(GP)是一种被广泛采用的工具,用于顺序优化黑盒函数,在这种情况下,评估成本高且可能有噪声。最近关于GP bandits的研究已经提出,要超越随机噪声,设计出对敌对攻击具有鲁棒性的算法。在本文中,我们从攻击者的角度研究了这个问题,提出了各种对抗性攻击方法,并对攻击者的强度和先验信息进行了不同的假设。我们的目标是从理论和实践的角度理解对GP强盗的对抗性攻击。我们主要关注对流行的GP-UCB算法和相关的基于消除的算法的有针对性的攻击,该算法基于对函数$f$的不利扰动来生成另一个函数$tilde{f}$,其最优值在某个区域$mathcal{R}{rm target}$。基于我们的理论分析,我们设计了白盒攻击(已知$f$)和黑盒攻击(未知$f$),前者包括减法攻击和剪切攻击,后者包括攻击性减法攻击。我们证明了对GP bandits的对抗性攻击可以成功地迫使算法向$mathcal{R}{rm target}$移动,即使攻击预算很低,并且我们比较了我们的攻击在几个真实函数和合成函数上的性能和效率。 摘要:Gaussian processes (GP) are a widely-adopted tool used to sequentially optimize black-box functions, where evaluations are costly and potentially noisy. Recent works on GP bandits have proposed to move beyond random noise and devise algorithms robust to adversarial attacks. In this paper, we study this problem from the attacker's perspective, proposing various adversarial attack methods with differing assumptions on the attacker's strength and prior information. Our goal is to understand adversarial attacks on GP bandits from both a theoretical and practical perspective. We focus primarily on targeted attacks on the popular GP-UCB algorithm and a related elimination-based algorithm, based on adversarially perturbing the function $f$ to produce another function $tilde{f}$ whose optima are in some region $mathcal{R}_{rm target}$. Based on our theoretical analysis, we devise both white-box attacks (known $f$) and black-box attacks (unknown $f$), with the former including a Subtraction attack and Clipping attack, and the latter including an Aggressive subtraction attack. We demonstrate that adversarial attacks on GP bandits can succeed in forcing the algorithm towards $mathcal{R}_{rm target}$ even with a low attack budget, and we compare our attacks' performance and efficiency on several real and synthetic functions.

【33】 Exact Bias Correction for Linear Adjustment of Randomized Controlled Trials 标题:随机对照试验线性调整的精确偏差校正 链接:https://arxiv.org/abs/2110.08425

作者:Haoge Chang,Joel Middleton,Peter Aronow 摘要:在对经验实践的一次有影响力的批评中,Freedman引用{Freedman 2008a,Freedman 2008b}表明线性回归估计器在随机化模型下的随机对照试验分析中存在偏差。在Freedman的假设下,我们得到了线性回归估计的精确闭式偏差修正,无论是否通过协变量相互作用进行处理。我们证明了偏差修正估计量的极限分布与未修正估计量的极限分布相同,这意味着可以在不引入任何偏差风险的情况下获得调整的渐近收益。结合Lin{lin2013agnostic}的结果,我们的结果表明,Freedman反对使用回归调整的理论论点可以通过对实践的微小修改得到完全解决。 摘要:In an influential critique of empirical practice, Freedman cite{freedman2008A,freedman2008B} showed that the linear regression estimator was biased for the analysis of randomized controlled trials under the randomization model. Under Freedman's assumptions, we derive exact closed-form bias corrections for the linear regression estimator with and without treatment-by-covariate interactions. We show that the limiting distribution of the bias corrected estimator is identical to the uncorrected estimator, implying that the asymptotic gains from adjustment can be attained without introducing any risk of bias. Taken together with results from Lin cite{lin2013agnostic}, our results show that Freedman's theoretical arguments against the use of regression adjustment can be completely resolved with minor modifications to practice.

【34】 Nuances in Margin Conditions Determine Gains in Active Learning 标题:边际条件的细微差别决定了主动学习的收益 链接:https://arxiv.org/abs/2110.08418

作者:Samory Kpotufe,Gan Yuan,Yunfan Zhao 机构: and here weColumbia University 摘要:我们考虑具有平滑回归函数的非参数分类,其中众所周知的是,E$(Y×x] $中的余量的概念在主动学习和被动学习中确定快或慢的速率。在这里,我们阐明了这两种设置之间的显著区别。也就是说,我们证明了边际概念中的一些看似良性的细微差别——涉及贝叶斯分类器的唯一性,并且对被动学习率没有明显的影响——决定了是否有任何主动学习者能够超越被动学习率。特别是,对于Audibert Tsybakov的边际条件(允许具有非唯一Bayes分类器的一般情况),在通常研究的环境中,如果$X$上的边际接近一致,则任何主动学习者都无法超过被动学习。因此,我们的结果否定了过去文献中的通常直觉,即在非参数环境下,主动速率应该比被动速率更好。 摘要:We consider nonparametric classification with smooth regression functions, where it is well known that notions of margin in $E[Y|X]$ determine fast or slow rates in both active and passive learning. Here we elucidate a striking distinction between the two settings. Namely, we show that some seemingly benign nuances in notions of margin -- involving the uniqueness of the Bayes classifier, and which have no apparent effect on rates in passive learning -- determine whether or not any active learner can outperform passive learning rates. In particular, for Audibert-Tsybakov's margin condition (allowing general situations with non-unique Bayes classifiers), no active learner can gain over passive learning in commonly studied settings where the marginal on $X$ is near uniform. Our results thus negate the usual intuition from past literature that active rates should improve over passive rates in nonparametric settings.

【35】 Multi-group Gaussian Processes 标题:多群高斯过程 链接:https://arxiv.org/abs/2110.08411

作者:Didong Li,Andrew Jones,Sudipto Banerjee,Barbara E. Engelhardt 机构:Department of Computer Science, Princeton University, Department of Biostatistics, University of California, Los Angeles, Gladstone Institutes 摘要:高斯过程(GPs)广泛应用于功能数据分析、机器学习和空间统计中,用于建模复杂的依赖关系。现代科学数据集通常是异构的,通常包含多个已知的离散样本子群。例如,在基因组学应用中,样品可根据组织类型或药物暴露分组。在建模过程中,需要利用组之间的相似性,同时考虑组之间的差异。虽然已有大量关于欧几里德域$mathbb{R}^p$上GPs的文献,但适用于多组数据的域上GPs的研究仍然较少。在这里,我们开发了一个多群高斯过程(MGGP),我们在$mathbb{R}^ptimesmathscr{C}$上定义它,其中$mathscr{C}$是一个表示群标签的有限集。我们提供了在这个域上构造有效(正定)协方差函数的一般方法,并描述了推理、估计和预测的算法。我们进行了模拟实验,并将MGGP应用于基因表达数据,以说明MGGP在连续变量和分类变量联合建模中的行为和优势。 摘要:Gaussian processes (GPs) are pervasive in functional data analysis, machine learning, and spatial statistics for modeling complex dependencies. Modern scientific data sets are typically heterogeneous and often contain multiple known discrete subgroups of samples. For example, in genomics applications samples may be grouped according to tissue type or drug exposure. In the modeling process it is desirable to leverage the similarity among groups while accounting for differences between them. While a substantial literature exists for GPs over Euclidean domains $mathbb{R}^p$, GPs on domains suitable for multi-group data remain less explored. Here, we develop a multi-group Gaussian process (MGGP), which we define on $mathbb{R}^ptimes mathscr{C}$, where $mathscr{C}$ is a finite set representing the group label. We provide general methods to construct valid (positive definite) covariance functions on this domain, and we describe algorithms for inference, estimation, and prediction. We perform simulation experiments and apply MGGP to gene expression data to illustrate the behavior and advantages of the MGGP in the joint modeling of continuous and categorical variables.

【36】 Covariate Adjustment in Regression Discontinuity Designs 标题:回归间断设计中的协变量平差 链接:https://arxiv.org/abs/2110.08410

作者:Matias D. Cattaneo,Luke Keele,Rocio Titiunik 机构:Roc´ıo Titiunik§ 摘要:回归不连续性(RD)设计是因果推理和程序评估中广泛使用的非实验方法。虽然它的规范公式只需要一个分数和一个结果变量,但在实证工作中,经常会遇到使用额外变量进行调整的RD实现。从方法论和实证角度来看,这种做法导致了对RD分析中协变量调整作用的误解。在本章中,我们回顾了协变量调整在研发设计中的不同作用,并为其在应用中的正确使用提供方法学指导。 摘要:The Regression Discontinuity (RD) design is a widely used non-experimental method for causal inference and program evaluation. While its canonical formulation only requires a score and an outcome variable, it is common in empirical work to encounter RD implementations where additional variables are used for adjustment. This practice has led to misconceptions about the role of covariate adjustment in RD analysis, from both methodological and empirical perspectives. In this chapter, we review the different roles of covariate adjustment in RD designs, and offer methodological guidance for its correct use in applications.

【37】 Spatio-temporal extreme event modeling of terror insurgencies 标题:恐怖叛乱的时空极端事件建模 链接:https://arxiv.org/abs/2110.08363

作者:Lekha Patel,Lyndsay Shand,J. Derek Tucker,Gabriel Huerta 机构:Statistical Sciences, Sandia National Laboratories, Albuquerque, NM 摘要:具有潜在致命后果的极端事件,如恐怖组织组织的极端事件,性质极不可预测,对社会构成迫在眉睫的威胁。具体而言,量化在任意时空区域发生恐怖袭击的可能性及其相对社会风险,将有助于采取知情措施加强国家安全。针对基线强度不均匀的攻击,提出了一种新的自激标记时空模型。它的触发强度用高斯过程先验分布简洁地建模,以灵活地捕捉任意攻击和先前恐怖事件之间复杂的时空依赖关系。通过推断该模型的参数,我们突出了可能发生攻击的特定时空区域。此外,通过根据攻击产生的伤亡人数衡量攻击结果,我们引入了一种新的伤亡人数混合分布。该分布通过{it广义ZipF}分布灵活地处理低伤亡人数和高伤亡人数以及数据的离散性。我们依靠定制的马尔可夫链蒙特卡罗(MCMC)方法来估计模型参数。我们使用来自开源全球恐怖主义数据库(GTD)的数据来说明该方法,该数据库对应于2013-2018年阿富汗境内的袭击事件。我们表明,我们的模型能够预测2019-2021年未来袭击的强度,同时考虑各种相关协变量,如人口密度、地区语言数量以及支持反对政府的人口密度。 摘要:Extreme events with potential deadly outcomes, such as those organized by terror groups, are highly unpredictable in nature and an imminent threat to society. In particular, quantifying the likelihood of a terror attack occurring in an arbitrary space-time region and its relative societal risk, would facilitate informed measures that would strengthen national security. This paper introduces a novel self-exciting marked spatio-temporal model for attacks whose inhomogeneous baseline intensity is written as a function of covariates. Its triggering intensity is succinctly modeled with a Gaussian Process prior distribution to flexibly capture intricate spatio-temporal dependencies between an arbitrary attack and previous terror events. By inferring the parameters of this model, we highlight specific space-time areas in which attacks are likely to occur. Furthermore, by measuring the outcome of an attack in terms of the number of casualties it produces, we introduce a novel mixture distribution for the number of casualties. This distribution flexibly handles low and high number of casualties and the discrete nature of the data through a {it Generalized ZipF} distribution. We rely on a customized Markov chain Monte Carlo (MCMC) method to estimate the model parameters. We illustrate the methodology with data from the open source Global Terrorism Database (GTD) that correspond to attacks in Afghanistan from 2013-2018. We show that our model is able to predict the intensity of future attacks for 2019-2021 while considering various covariates of interest such as population density, number of regional languages spoken, and the density of population supporting the opposing government.

【38】 Discovering and Achieving Goals via World Models 标题:通过世界模型发现和实现目标 链接:https://arxiv.org/abs/2110.09514

作者:Russell Mendonca,Oleh Rybkin,Kostas Daniilidis,Danijar Hafner,Deepak Pathak 机构:Carnegie Mellon University, University of Pennsylvania, University of Toronto 备注:NeurIPS 2021. First two authors contributed equally. Website at this https URL 摘要:人工智能体如何在没有任何监督的情况下学会在复杂的视觉环境中解决许多不同的任务?我们将这个问题分解为两个问题:发现新的目标和学习可靠地实现它们。我们介绍了潜在探索者成就者(LEXA),这是一个统一的解决方案,它从图像输入中学习世界模型,并使用它从想象的卷展中训练探索者和成就者策略。与之前通过到达之前访问过的州进行探索的方法不同,探索者计划通过预见发现未知的令人惊讶的州,然后将这些州作为不同的目标供成功者练习。在无监督阶段之后,LEXA解决了指定为目标图像Zero-Shot的任务,无需任何额外的学习。LEXA在以前的基准测试和新的具有挑战性的基准测试中,在四个标准机器人操作和移动领域共有40项测试任务,大大优于以前的无监督目标达成方法。LEXA进一步实现了需要按顺序与多个对象交互的目标。最后,为了演示LEXA的可伸缩性和通用性,我们在四个不同的环境中训练了一个通用代理。代码和视频在https://orybkin.github.io/lexa/ 摘要:How can artificial agents learn to solve many diverse tasks in complex visual environments in the absence of any supervision? We decompose this question into two problems: discovering new goals and learning to reliably achieve them. We introduce Latent Explorer Achiever (LEXA), a unified solution to these that learns a world model from image inputs and uses it to train an explorer and an achiever policy from imagined rollouts. Unlike prior methods that explore by reaching previously visited states, the explorer plans to discover unseen surprising states through foresight, which are then used as diverse targets for the achiever to practice. After the unsupervised phase, LEXA solves tasks specified as goal images zero-shot without any additional learning. LEXA substantially outperforms previous approaches to unsupervised goal-reaching, both on prior benchmarks and on a new challenging benchmark with a total of 40 test tasks spanning across four standard robotic manipulation and locomotion domains. LEXA further achieves goals that require interacting with multiple objects in sequence. Finally, to demonstrate the scalability and generality of LEXA, we train a single general agent across four distinct environments. Code and videos at https://orybkin.github.io/lexa/

【39】 Provable Hierarchy-Based Meta-Reinforcement Learning 标题:基于可证明层次的元强化学习 链接:https://arxiv.org/abs/2110.09507

作者:Kurtland Chua,Qi Lei,Jason D. Lee 机构:Princeton University, Princeton, NJ , USA 摘要:层次强化学习(HRL)作为一种复杂模块化行为的可处理学习方法,受到了广泛的关注。然而,现有的工作要么假设可以访问专家构建的层次结构,要么使用没有可证明保证的层次结构学习启发法。为了解决这一差距,我们分析了元RL设置中的HRL,学习者在元训练期间学习潜在的层次结构,以便用于下游任务。我们考虑在过渡动态中嵌入自然层次结构的表格设置。与有监督元学习理论类似,我们提供了“多样性条件”,与易于处理的基于乐观主义的算法一起,保证了这种自然层次结构的样本有效恢复。此外,我们使用恢复的层次结构为学习者解决元测试任务提供遗憾边界。我们的范围包含了HRL文献中的常见概念,如时间和状态/动作抽象,这表明我们的设置和分析在实践中捕捉了HRL的重要特征。 摘要:Hierarchical reinforcement learning (HRL) has seen widespread interest as an approach to tractable learning of complex modular behaviors. However, existing work either assume access to expert-constructed hierarchies, or use hierarchy-learning heuristics with no provable guarantees. To address this gap, we analyze HRL in the meta-RL setting, where a learner learns latent hierarchical structure during meta-training for use in a downstream task. We consider a tabular setting where natural hierarchical structure is embedded in the transition dynamics. Analogous to supervised meta-learning theory, we provide "diversity conditions" which, together with a tractable optimism-based algorithm, guarantee sample-efficient recovery of this natural hierarchy. Furthermore, we provide regret bounds on a learner using the recovered hierarchy to solve a meta-test task. Our bounds incorporate common notions in HRL literature such as temporal and state/action abstractions, suggesting that our setting and analysis capture important features of HRL in practice.

【40】 Recovery Guarantees for Kernel-based Clustering under Non-parametric Mixture Models 标题:非参数混合模型下基于核的聚类的恢复保证 链接:https://arxiv.org/abs/2110.09476

作者:Leena Chennuru Vankadara,Sebastian Bordt,Ulrike von Luxburg,Debarghya Ghoshdastidar 机构:University of T¨ubingen, Max Planck Institute, for Intelligent Systems, T¨ubingen, Technical University of, Munich 摘要:尽管基于内核的聚类无处不在,但在考虑数据生成过程的强结构假设的设置之外,令人惊讶的是很少有统计保证。在这项工作中,我们通过研究非参数混合模型下基于核的聚类算法的统计性能,朝着缩小这一差距迈出了一步。我们提供了必要和充分的可分性条件,在这些条件下,这些算法可以一致地恢复潜在的真实聚类。我们的分析为内核聚类方法提供了保证,而无需对组件分布的形式进行结构性假设。此外,我们在基于核的数据聚类和基于核密度的聚类之间建立了一个关键等价关系。这使我们能够为非参数混合模型的基于核的估计提供一致性保证。除了理论意义外,这种联系还可能具有实际意义,包括在聚类的背景下系统地选择高斯核的带宽。 摘要:Despite the ubiquity of kernel-based clustering, surprisingly few statistical guarantees exist beyond settings that consider strong structural assumptions on the data generation process. In this work, we take a step towards bridging this gap by studying the statistical performance of kernel-based clustering algorithms under non-parametric mixture models. We provide necessary and sufficient separability conditions under which these algorithms can consistently recover the underlying true clustering. Our analysis provides guarantees for kernel clustering approaches without structural assumptions on the form of the component distributions. Additionally, we establish a key equivalence between kernel-based data-clustering and kernel density-based clustering. This enables us to provide consistency guarantees for kernel-based estimators of non-parametric mixture models. Along with theoretical implications, this connection could have practical implications, including in the systematic choice of the bandwidth of the Gaussian kernel in the context of clustering.

【41】 Improving Robustness using Generated Data 标题:使用生成的数据提高稳健性 链接:https://arxiv.org/abs/2110.09468

作者:Sven Gowal,Sylvestre-Alvise Rebuffi,Olivia Wiles,Florian Stimberg,Dan Andrei Calian,Timothy Mann 机构:DeepMind, London 备注:Accepted at NeurIPS 2021 摘要:最近的研究表明,稳健的训练需要比标准分类所需的数据集大得多的数据集。在CIFAR-10和CIFAR-100上,这转化为仅根据原始训练集的数据训练的模型与使用从“8000万微小图像”数据集(TI-80M)提取的附加数据训练的模型之间存在相当大的鲁棒精度差距。在本文中,我们探讨了如何利用仅在原始训练集上训练的生成模型来人为地增加原始训练集的大小,并提高对抗$ellp$范数有界扰动的鲁棒性。我们确定了加入额外生成的数据可以提高鲁棒性的充分条件,并证明了与使用额外真实数据训练的模型相比,可以显著减小鲁棒精度差距。令人惊讶的是,我们甚至表明,即使添加非真实随机数据(由高斯采样生成)也可以提高鲁棒性。我们分别针对大小为$epsilon=8/255$和$epsilon=128/255$的$ellinfty$和$ellu 2$范数有界扰动对CIFAR-10、CIFAR-100、SVHN和TinyImageNet的方法进行了评估。与以前最先进的方法相比,我们在鲁棒精度方面有了很大的绝对改进。针对大小为$epsilon=8/255$的$ellinfty$范数有界扰动,我们的模型在CIFAR-10和CIFAR-100上分别达到66.10%和33.49%的鲁棒精度(在最先进的基础上提高了 8.96%和 3.29%)。对于大小为$epsilon=128/255$的$ellu_2$范数有界扰动,我们的模型在CIFAR-10上达到78.31%( 3.81%)。这些结果超过了以前大多数使用外部数据的工作。 摘要:Recent work argues that robust training requires substantially larger datasets than those required for standard classification. On CIFAR-10 and CIFAR-100, this translates into a sizable robust-accuracy gap between models trained solely on data from the original training set and those trained with additional data extracted from the "80 Million Tiny Images" dataset (TI-80M). In this paper, we explore how generative models trained solely on the original training set can be leveraged to artificially increase the size of the original training set and improve adversarial robustness to $ell_p$ norm-bounded perturbations. We identify the sufficient conditions under which incorporating additional generated data can improve robustness, and demonstrate that it is possible to significantly reduce the robust-accuracy gap to models trained with additional real data. Surprisingly, we even show that even the addition of non-realistic random data (generated by Gaussian sampling) can improve robustness. We evaluate our approach on CIFAR-10, CIFAR-100, SVHN and TinyImageNet against $ell_infty$ and $ell_2$ norm-bounded perturbations of size $epsilon = 8/255$ and $epsilon = 128/255$, respectively. We show large absolute improvements in robust accuracy compared to previous state-of-the-art methods. Against $ell_infty$ norm-bounded perturbations of size $epsilon = 8/255$, our models achieve 66.10% and 33.49% robust accuracy on CIFAR-10 and CIFAR-100, respectively (improving upon the state-of-the-art by 8.96% and 3.29%). Against $ell_2$ norm-bounded perturbations of size $epsilon = 128/255$, our model achieves 78.31% on CIFAR-10 ( 3.81%). These results beat most prior works that use external data.

【42】 Beltrami Flow and Neural Diffusion on Graphs 标题:图上的Beltrami流与神经扩散 链接:https://arxiv.org/abs/2110.09443

作者:Benjamin Paul Chamberlain,James Rowbottom,Davide Eynard,Francesco Di Giovanni,Xiaowen Dong,Michael M Bronstein 机构:University of Oxford, Michael M. Bronstein, Twitter Inc. and Imperial College London 备注:21 pages, 5 figures. Proceedings of the Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS) 2021 摘要:我们提出了一类新的基于离散Beltrami流的图神经网络,这是一种非欧几里德扩散偏微分方程。在我们的模型中,节点特征补充了从图拓扑派生的位置编码,并由Beltrami流联合进化,同时产生连续的特征学习和拓扑进化。由此产生的模型推广了许多流行的图形神经网络,并在几个基准上实现了最先进的结果。 摘要:We propose a novel class of graph neural networks based on the discretised Beltrami flow, a non-Euclidean diffusion PDE. In our model, node features are supplemented with positional encodings derived from the graph topology and jointly evolved by the Beltrami flow, producing simultaneously continuous feature learning and topology evolution. The resulting model generalises many popular graph neural networks and achieves state-of-the-art results on several benchmarks.

【43】 Understanding jumps in high frequency digital asset markets 标题:了解高频数字资产市场的跳跃 链接:https://arxiv.org/abs/2110.09429

作者:Danial Saef,Odett Nagy,Sergej Sizov,Wolfgang Karl Härdle 机构:Wolfgang Karl H¨ardle¶ 摘要:虽然注意力是数字资产价格的一个预测因素,比特币价格的上涨也是众所周知的,但我们对其替代品知之甚少。研究高频加密数据为我们提供了一种独特的可能性,可以确认跨市场数字资产回报是由围绕黑天鹅事件聚集的高频跳跃驱动的,类似于波动性和交易量的季节性。回归显示,日内跳跃显著影响日终收益的大小和方向。这为加密期权定价模型提供了基础研究。然而,我们需要更好的计量经济学方法来捕捉加密产品的特定市场微观结构。所有计算均可通过quantlet.com技术进行复制。 摘要:While attention is a predictor for digital asset prices, and jumps in Bitcoin prices are well-known, we know little about its alternatives. Studying high frequency crypto data gives us the unique possibility to confirm that cross market digital asset returns are driven by high frequency jumps clustered around black swan events, resembling volatility and trading volume seasonalities. Regressions show that intra-day jumps significantly influence end of day returns in size and direction. This provides fundamental research for crypto option pricing models. However, we need better econometric methods for capturing the specific market microstructure of cryptos. All calculations are reproducible via the quantlet.com technology.

【44】 Towards Federated Bayesian Network Structure Learning with Continuous Optimization 标题:基于连续优化的联邦贝叶斯网络结构学习 链接:https://arxiv.org/abs/2110.09356

作者:Ignavier Ng,Kun Zhang 机构:Carnegie Mellon University 备注:16 pages; 5 figures 摘要:传统上,贝叶斯网络结构学习通常是在一个中心站点上进行的,在该站点上收集所有数据。然而,在实践中,数据可能分布在不同的方(例如,公司、设备),这些方打算共同学习贝叶斯网络,但出于隐私或安全考虑不愿意披露与其数据相关的信息。在这项工作中,我们提出了一种跨思洛联盟学习方法,以估计贝叶斯网络的结构,从数据是横向分割的不同方面。我们提出了一种基于连续优化的分布式结构学习方法,使用交替方向乘子法(ADMM),使得在优化过程中只需交换模型参数。我们通过在线性和非线性情况下采用该方法,证明了该方法的灵活性。在合成数据集和真实数据集上的实验结果表明,与其他方法相比,该方法取得了更好的性能,尤其是当客户端数量相对较多且每个客户端的样本量有限时。 摘要:Traditionally, Bayesian network structure learning is often carried out at a central site, in which all data is gathered. However, in practice, data may be distributed across different parties (e.g., companies, devices) who intend to collectively learn a Bayesian network, but are not willing to disclose information related to their data owing to privacy or security concerns. In this work, we present a cross-silo federated learning approach to estimate the structure of Bayesian network from data that is horizontally partitioned across different parties. We develop a distributed structure learning method based on continuous optimization, using the alternating direction method of multipliers (ADMM), such that only the model parameters have to be exchanged during the optimization process. We demonstrate the flexibility of our approach by adopting it for both linear and nonlinear cases. Experimental results on synthetic and real datasets show that it achieves an improved performance over the other methods, especially when there is a relatively large number of clients and each has a limited sample size.

【45】 A portfolio approach to massively parallel Bayesian optimization 标题:大规模并行贝叶斯优化的投资组合方法 链接:https://arxiv.org/abs/2110.09334

作者:Mickael Binois,Nicholson Collier,Jonathan Ozik 摘要:减少优化研究时间的一种方法是并行评估设计,而不是一次一个。为了评估代价高昂的黑盒,已经提出了批量版本的贝叶斯优化。他们通过构建黑盒的代理模型来工作,该模型可用于选择设计,以便通过填充标准进行有效评估。然而,随着更高水平的并行化变得可用,适用于几十次并行评估的策略变得有限,特别是由于选择更多评估的复杂性。当黑匣子嘈杂时,这一点更为关键,需要进行更多的评估和重复实验。在这里,我们提出了一种可扩展的策略,该策略可以在本地跟上大规模批处理,重点关注勘探/开发权衡和投资组合分配。对于单目标和多目标优化任务,我们将该方法与确定性函数和噪声函数的相关方法进行了比较。这些实验显示了与现有方法相似或更好的性能,同时速度快了几个数量级。 摘要:One way to reduce the time of conducting optimization studies is to evaluate designs in parallel rather than just one-at-a-time. For expensive-to-evaluate black-boxes, batch versions of Bayesian optimization have been proposed. They work by building a surrogate model of the black-box that can be used to select the designs to evaluate efficiently via an infill criterion. Still, with higher levels of parallelization becoming available, the strategies that work for a few tens of parallel evaluations become limiting, in particular due to the complexity of selecting more evaluations. It is even more crucial when the black-box is noisy, necessitating more evaluations as well as repeating experiments. Here we propose a scalable strategy that can keep up with massive batching natively, focused on the exploration/exploitation trade-off and a portfolio allocation. We compare the approach with related methods on deterministic and noisy functions, for mono and multiobjective optimization tasks. These experiments show similar or better performance than existing methods, while being orders of magnitude faster.

【46】 Self-Supervised Representation Learning: Introduction, Advances and Challenges 标题:自我监督表征学习:介绍、进展与挑战 链接:https://arxiv.org/abs/2110.09327

作者:Linus Ericsson,Henry Gouk,Chen Change Loy,Timothy M. Hospedales 摘要:自监督表示学习方法旨在提供强大的深度特征学习,而无需大型标注数据集,从而缓解标注瓶颈,这是当前深度学习实际应用的主要障碍之一。近年来,这些方法发展迅速,其功效接近,有时甚至超过了各种数据模式(包括图像、视频、声音、文本和图形)中完全监督的训练前备选方案。本文介绍了这一充满活力的领域,包括关键概念、四大方法家族和相关的最新技术,以及如何将自我监督方法应用于各种数据模式。我们进一步讨论实际考虑因素,包括工作流、表示可转移性和计算成本。最后,我们调查了该领域中为未来工作提供肥沃土壤的主要公开挑战。 摘要:Self-supervised representation learning methods aim to provide powerful deep feature learning without the requirement of large annotated datasets, thus alleviating the annotation bottleneck that is one of the main barriers to practical deployment of deep learning today. These methods have advanced rapidly in recent years, with their efficacy approaching and sometimes surpassing fully supervised pre-training alternatives across a variety of data modalities including image, video, sound, text and graphs. This article introduces this vibrant area including key concepts, the four main families of approach and associated state of the art, and how self-supervised methods are applied to diverse modalities of data. We further discuss practical considerations including workflows, representation transferability, and compute cost. Finally, we survey the major open challenges in the field that provide fertile ground for future work.

【47】 Multi-Objective Allocation of COVID-19 Testing Centers: Improving Coverage and Equity in Access 标题:冠状病毒检测中心的多目标配置:提高接入的覆盖率和公平性 链接:https://arxiv.org/abs/2110.09272

作者:Zhen Zhong,Ribhu Sengupta,Kamran Paynabar,Lance A. Waller 摘要:在撰写本文时,新冠病毒-19已经传播给4200多万人,并在全美造成673000多人死亡。在整个大流行期间,公共卫生当局一直在监测诊断测试结果,以确定传播热点。这些信息有助于减少或阻断新冠病毒-19的传播途径,并帮助感染患者接受早期治疗。然而,目前大多数试验场地分配方案都是基于经验或便利性,往往导致效率低下和非最优分配。此外,城市内人口的历史社会人口模式可能导致不同种族和收入群体之间在获得测试方面存在可衡量的不平等。为了解决这些紧迫问题,我们提出了一个新的试验场地分配方案,以(a)最大限度地扩大人口覆盖率,(b)最大限度地减少与疫情轨迹预测相关的预测不确定性,以及(c)减少获取方面的不平等。我们通过案例研究来说明我们的方法,将我们的分配方案与佐治亚州记录的测试点分配进行比较,揭示了与当前实践相比,人口覆盖率的增加和访问公平性的改善。 摘要:At the time of this article, COVID-19 has been transmitted to more than 42 million people and resulted in more than 673,000 deaths across the United States. Throughout this pandemic, public health authorities have monitored the results of diagnostic testing to identify hotspots of transmission. Such information can help reduce or block transmission paths of COVID-19 and help infected patients receive early treatment. However, most current schemes of test site allocation have been based on experience or convenience, often resulting in low efficiency and non-optimal allocation. In addition, the historical sociodemographic patterns of populations within cities can result in measurable inequities in access to testing between various racial and income groups. To address these pressing issues, we propose a novel test site allocation scheme to (a) maximize population coverage, (b) minimize prediction uncertainties associated with projections of outbreak trajectories, and (c) reduce inequities in access. We illustrate our approach with case studies comparing our allocation scheme with recorded allocation of testing sites in Georgia, revealing increases in both population coverage and improvements in equity of access over current practice.

【48】 A Sociotechnical View of Algorithmic Fairness 标题:算法公平性的社会技术观 链接:https://arxiv.org/abs/2110.09253

作者:Mateusz Dolata,Stefan Feuerriegel,Gerhard Schwabe 机构:Please, refer to the journal publication:, Dolata, M., Feuerriegel, S., & Schwabe, G. (to appear) A Sociotechnical View of Algorithmic, Fairness. Information Systems Journal. 备注:Accepted at Information Systems Journal 摘要:算法公平性被认为是一种新兴技术,可以减轻自动化决策中的系统性歧视,为提高信息系统(IS)的公平性提供机会。然而,基于最新的文献综述,我们认为公平是一个固有的社会概念,因此,算法公平技术应该从社会技术的角度来看待。我们提出了关于算法公平性作为一种社会技术现象的论述。我们的研究目标是将AF嵌入is的社会技术观点中。具体而言,我们阐述了为什么使用算法手段确保公平的系统的结果取决于技术和社会结构之间的相互影响。这种观点可以产生新的见解,整合技术领域和社会研究的知识。此外,它还为IS辩论开辟了新的方向。我们的贡献如下:首先,基于对310篇文章的系统分析,我们对当前关于算法公平性的论述中的基本假设提出了问题。其次,我们通过将算法公平性理论化为一种社会技术结构来回应这些假设。第三,我们为IS研究人员提出了方向,通过追求对社会技术算法公平性的独特理解来增强其影响力。我们呼吁并采取一种全面的方法来处理AF。从社会技术角度看待算法公平性可以为系统性偏见和歧视提供全面的解决方案。 摘要:Algorithmic fairness has been framed as a newly emerging technology that mitigates systemic discrimination in automated decision-making, providing opportunities to improve fairness in information systems (IS). However, based on a state-of-the-art literature review, we argue that fairness is an inherently social concept and that technologies for algorithmic fairness should therefore be approached through a sociotechnical lens. We advance the discourse on algorithmic fairness as a sociotechnical phenomenon. Our research objective is to embed AF in the sociotechnical view of IS. Specifically, we elaborate on why outcomes of a system that uses algorithmic means to assure fairness depends on mutual influences between technical and social structures. This perspective can generate new insights that integrate knowledge from both technical fields and social studies. Further, it spurs new directions for IS debates. We contribute as follows: First, we problematize fundamental assumptions in the current discourse on algorithmic fairness based on a systematic analysis of 310 articles. Second, we respond to these assumptions by theorizing algorithmic fairness as a sociotechnical construct. Third, we propose directions for IS researchers to enhance their impacts by pursuing a unique understanding of sociotechnical algorithmic fairness. We call for and undertake a holistic approach to AF. A sociotechnical perspective on algorithmic fairness can yield holistic solutions to systemic biases and discrimination.

【49】 Impact of COVID-19 Policies and Misinformation on Social Unrest 标题:冠状病毒政策和错误信息对社会动荡的影响 链接:https://arxiv.org/abs/2110.09234

作者:Martha Barnard,Radhika Iyer,Sara Y. Del Valle,Ashlynn R. Daughton 机构: A-, Information Systems and Modeling, Los Alamos National Lab, Los Alamos, NM, USA, Department of Political Science and Department of Computing, Data Science, and Society, University, of California, Berkeley, Berkeley, CA, USA 备注:21 pages, 9 figures 摘要:新的冠状病毒病(COVID-19)大流行影响了地球的每一个角落,扰乱了政府,导致了社会经济的不稳定。这场危机引发了关于社会不同部门在变革和压力时期如何互动和相互影响的问题。鉴于这一流行病前所未有的经济和社会影响,许多新的数据来源已经可用,使我们能够定量地探讨这些关联。了解这些关系可以帮助我们更好地为未来的灾难做好准备,并减轻影响。在这里,我们关注西欧八个国家和美国四个地区的社会动荡(抗议)、健康结果、公共卫生秩序和错误信息之间的相互作用。我们创建了1-3周的二元抗议指标预测,用于确定高抗议活动的时间,以及一段时间内的总体抗议计数。我们发现,除比利时外,所有地区的各种数据流中至少有一个特征可以预测抗议活动。然而,抗议预测的准确性因国家而异,也就是说,在所分析的大约一半国家中,我们的预测优于na“这是一种新的模式。这些复杂的结果表明,不同的数据流有可能预测一个像抗议一样易变的话题,以及预测一个像大流行一样迅速演变的局势的困难。 摘要:The novel coronavirus disease (COVID-19) pandemic has impacted every corner of earth, disrupting governments and leading to socioeconomic instability. This crisis has prompted questions surrounding how different sectors of society interact and influence each other during times of change and stress. Given the unprecedented economic and societal impacts of this pandemic, many new data sources have become available, allowing us to quantitatively explore these associations. Understanding these relationships can help us better prepare for future disasters and mitigate the impacts. Here, we focus on the interplay between social unrest (protests), health outcomes, public health orders, and misinformation in eight countries of Western Europe and four regions of the United States. We created 1-3 week forecasts of both a binary protest metric for identifying times of high protest activity and the overall protest counts over time. We found that for all regions, except Belgium, at least one feature from our various data streams was predictive of protests. However, the accuracy of the protest forecasts varied by country, that is, for roughly half of the countries analyzed, our forecasts outperform a na"ive model. These mixed results demonstrate the potential of diverse data streams to predict a topic as volatile as protests as well as the difficulties of predicting a situation that is as rapidly evolving as a pandemic.

【50】 Learning Optimal Conformal Classifiers 标题:学习最优共形分类器 链接:https://arxiv.org/abs/2110.09192

作者:David Stutz,Krishnamurthy,Dvijotham,Ali Taylan Cemgil,Arnaud Doucet 机构: DeepMind, Max Planck Institute for Informatics, Saarland Informatics Campus 摘要:现代基于深度学习的分类器在测试数据上显示出非常高的准确性,但这并不能为安全部署提供足够的保证,特别是在医疗诊断等高风险AI应用中。通常,预测是在没有可靠的不确定性估计或正式保证的情况下获得的。共形预测(CP)通过使用分类器的概率估计以用户指定的概率预测包含真实类的置信集来解决这些问题。然而,在训练后使用CP作为单独的处理步骤会阻止基础模型适应置信集的预测。因此,本文探讨了在训练过程中通过CP进行区分的策略,目标是使用保形包装器端到端构建训练模型。在我们的保形训练(ConfTr)方法中,我们在训练期间专门“模拟”小批量的保形训练。我们表明,CT通过减少平均置信集大小(效率低下)优于最先进的CP分类方法。此外,它允许“塑造”在测试时预测的置信集,这对于标准CP来说是困难的。在多个数据集的实验中,我们表明,在保留CP提供的保证的同时,ConfTr可以影响无效率在类之间的分布,或根据所包含的类指导置信集的组成。 摘要:Modern deep learning based classifiers show very high accuracy on test data but this does not provide sufficient guarantees for safe deployment, especially in high-stake AI applications such as medical diagnosis. Usually, predictions are obtained without a reliable uncertainty estimate or a formal guarantee. Conformal prediction (CP) addresses these issues by using the classifier's probability estimates to predict confidence sets containing the true class with a user-specified probability. However, using CP as a separate processing step after training prevents the underlying model from adapting to the prediction of confidence sets. Thus, this paper explores strategies to differentiate through CP during training with the goal of training model with the conformal wrapper end-to-end. In our approach, conformal training (ConfTr), we specifically "simulate" conformalization on mini-batches during training. We show that CT outperforms state-of-the-art CP methods for classification by reducing the average confidence set size (inefficiency). Moreover, it allows to "shape" the confidence sets predicted at test time, which is difficult for standard CP. On experiments with several datasets, we show ConfTr can influence how inefficiency is distributed across classes, or guide the composition of confidence sets in terms of the included classes, while retaining the guarantees offered by CP.

【51】 Measuring the influence of beliefs in belief networks 标题:测量信念网络中信念的影响 链接:https://arxiv.org/abs/2110.09154

作者:Aleksandar Tomašević 机构:MEASURING THE INFLUENCE OF BELIEFS IN BELIEF NETWORKSARXIV PREPRINTAleksandar Tomaševi´cDepartment of SociologyUniversity of Novi SadNovi Sad 备注:19 pages, 4 figures. Earlier version of this work was presented at Networks 2021 conference 摘要:有影响力的信仰对于我们理解人们如何思考政治问题和做出政治决策至关重要。本研究基于心理测量网络方法和网络影响研究的进展,提出了一种在更大的信仰系统网络背景下测量政治信仰影响的新方法。利用最新一轮的欧洲社会调查数据,我们在一个信仰网络上展示了这一方法,该网络表达了29个欧洲国家对政权的支持,并捕获了与支持政权表现、原则、制度和政治行为者相关的信仰。我们的研究结果表明,信念的平均影响可能与信念网络的一致性和连通性有关,特定信念(如民主满意度)对国家层面的影响与来自同一领域的外部指标(如自由民主指数)显著负相关,这表明极具影响力的信仰与紧迫的政治问题有关。这些发现表明,根据大规模调查数据估计的基于网络的信念影响指标可以作为比较政治研究的一种新型指标,为将心理测量网络分析方法整合到政治学方法中开辟了新的途径。 摘要:Influential beliefs are crucial for our understanding of how people reason about political issues and make political decisions. This research proposes a new method for measuring the influence of political beliefs within larger context of belief system networks, based on the advances in psychometric network methods and network influence research. Using the latest round of the European Social Survey data, we demonstrate this approach on a belief network expressing support for the regime in 29 European countries and capturing beliefs related to support for regime performance, principles, institutions, and political actors. Our results show that the average influence of beliefs can be related to the consistency and connectivity of the belief network and that the influence of specific beliefs (e.g. Satisfaction with Democracy) on a country level has a significant negative correlation with external indicators from the same domain (e.g. Liberal Democracy index), which suggests that highly influential beliefs are related to pressing political issues. These findings suggest that network-based belief influence metrics estimated from large-scale survey data can be used a new type of indicator in comparative political research, which opens new avenues for integrating psychometric network analysis methods into political science methodology.

【52】 Learning Prototype-oriented Set Representations for Meta-Learning 标题:面向学习原型的集合表示在元学习中的应用 链接:https://arxiv.org/abs/2110.09140

作者:Dandan Guo,Long Tian,Minghe Zhang,Mingyuan Zhou,Hongyuan Zha 机构:The Chinese University of Hong Kong, Shenzhen, Xidian University, Georgia Institute of Technology, McCombs School of Business, The University of Texas at Austin., School of Data Science, Shenzhen Research Institute of Big Data 摘要:从集合结构化数据中学习是最近引起越来越多关注的一个基本问题,其中引入了一系列摘要网络来处理集合输入。事实上,许多元学习问题可以被视为集合输入任务。大多数现有的摘要网络旨在为输入集设计不同的体系结构,以实现排列不变性。然而,对于元分布中的不同集合密切相关并共享某些统计特性的常见情况关注甚少。将每个集合视为一组全局原型的分布,本文提供了一种新的基于优化传输(OT)的方法来改进现有的摘要网络。为了了解全球原型的分布,我们将其与数据点上的经验分布的OT距离最小化,从而提供一种自然的无监督方法来改进摘要网络。由于我们的即插即用框架可以应用于许多元学习问题,我们进一步将其实例化为少数镜头分类和隐式元生成建模的情况。大量实验表明,我们的框架在从集合中学习更强大的摘要统计信息方面显著改进了现有的摘要网络,并且可以成功地集成到基于度量的少量镜头分类和生成建模应用中,为解决集合输入和元学习问题提供了一个有前途的工具。 摘要:Learning from set-structured data is a fundamental problem that has recently attracted increasing attention, where a series of summary networks are introduced to deal with the set input. In fact, many meta-learning problems can be treated as set-input tasks. Most existing summary networks aim to design different architectures for the input set in order to enforce permutation invariance. However, scant attention has been paid to the common cases where different sets in a meta-distribution are closely related and share certain statistical properties. Viewing each set as a distribution over a set of global prototypes, this paper provides a novel optimal transport (OT) based way to improve existing summary networks. To learn the distribution over the global prototypes, we minimize its OT distance to the set empirical distribution over data points, providing a natural unsupervised way to improve the summary network. Since our plug-and-play framework can be applied to many meta-learning problems, we further instantiate it to the cases of few-shot classification and implicit meta generative modeling. Extensive experiments demonstrate that our framework significantly improves the existing summary networks on learning more powerful summary statistics from sets and can be successfully integrated into metric-based few-shot classification and generative modeling applications, providing a promising tool for addressing set-input and meta-learning problems.

【53】 Natural Image Reconstruction from fMRI using Deep Learning: A Survey 标题:基于深度学习的fMRI自然图像重建研究综述 链接:https://arxiv.org/abs/2110.09006

作者:Zarina Rakhimberdina,Quentin Jodelet,Xin Liu,Tsuyoshi Murata 机构:Department of Computer Science, Tokyo Institute of Technology, Tokyo ,-, Artificial Intelligence Research Center, National Institute of Advanced Industrial, Science and Technology, Tokyo ,-, Japan 摘要:随着脑成像技术和机器学习工具的出现,人们致力于建立计算模型,以捕获人脑中视觉信息的编码。最具挑战性的大脑解码任务之一是从功能磁共振成像(fMRI)测量的大脑活动中准确重建感知的自然图像。在这项工作中,我们调查了最新的深度学习方法的自然图像重建功能磁共振成像。我们从体系结构设计、基准数据集和评估指标方面对这些方法进行了检查,并提出了跨标准化评估指标的公平性能评估。最后,我们讨论了现有研究的优势和局限性以及未来可能的发展方向。 摘要:With the advent of brain imaging techniques and machine learning tools, much effort has been devoted to building computational models to capture the encoding of visual information in the human brain. One of the most challenging brain decoding tasks is the accurate reconstruction of the perceived natural images from brain activities measured by functional magnetic resonance imaging (fMRI). In this work, we survey the most recent deep learning methods for natural image reconstruction from fMRI. We examine these methods in terms of architectural design, benchmark datasets, and evaluation metrics and present a fair performance evaluation across standardized evaluation metrics. Finally, we discuss the strengths and limitations of existing studies and present potential future directions.

【54】 StyleNeRF: A Style-based 3D-Aware Generator for High-resolution Image Synthesis 标题:StyleNeRF:一种基于样式的高分辨率图像合成3D感知生成器 链接:https://arxiv.org/abs/2110.08985

作者:Jiatao Gu,Lingjie Liu,Peng Wang,Christian Theobalt 机构:†Facebook AI, ‡Max Planck Institute for Informatics, ⋄The University of Hong Kong 备注:24 pages, 19 figures. Project page: this http URL 摘要:我们提出了StyleNeRF,一种三维感知生成模型,用于具有高多视图一致性的照片真实感高分辨率图像合成,可在非结构化二维图像上进行训练。现有的方法要么不能合成具有精细细节的高分辨率图像,要么产生明显的3D不一致伪影。此外,它们中的许多缺乏对样式属性和显式3D摄影机姿势的控制。StyleNeRF将neural radiance field(NeRF)集成到基于样式的生成器中,以解决上述挑战,即提高渲染效率和高分辨率图像生成的3D一致性。我们执行体绘制只是为了生成一个低分辨率的特征映射,并逐步应用二维上采样来解决第一个问题。为了缓解二维上采样造成的不一致性,我们提出了多种设计,包括更好的上采样器和新的正则化损耗。通过这些设计,StyleNeRF可以以交互速率合成高分辨率图像,同时保持高质量的3D一致性。StyleNeRF还可以控制相机姿势和不同级别的样式,这可以概括为看不见的视图。它还支持具有挑战性的任务,包括放大和缩小、样式混合、反转和语义编辑。 摘要:We propose StyleNeRF, a 3D-aware generative model for photo-realistic high-resolution image synthesis with high multi-view consistency, which can be trained on unstructured 2D images. Existing approaches either cannot synthesize high-resolution images with fine details or yield noticeable 3D-inconsistent artifacts. In addition, many of them lack control over style attributes and explicit 3D camera poses. StyleNeRF integrates the neural radiance field (NeRF) into a style-based generator to tackle the aforementioned challenges, i.e., improving rendering efficiency and 3D consistency for high-resolution image generation. We perform volume rendering only to produce a low-resolution feature map and progressively apply upsampling in 2D to address the first issue. To mitigate the inconsistencies caused by 2D upsampling, we propose multiple designs, including a better upsampler and a new regularization loss. With these designs, StyleNeRF can synthesize high-resolution images at interactive rates while preserving 3D consistency at high quality. StyleNeRF also enables control of camera poses and different levels of styles, which can generalize to unseen views. It also supports challenging tasks, including zoom-in and-out, style mixing, inversion, and semantic editing.

【55】 Optimistic Policy Optimization is Provably Efficient in Non-stationary MDPs 标题:乐观策略优化在非平稳MDP中被证明是有效的 链接:https://arxiv.org/abs/2110.08984

作者:Han Zhong,Zhuoran Yang,Zhaoran Wang Csaba Szepesvári 机构:Zhaoran Wang‡, Csaba Szepesv´ari§ 摘要:我们研究了非平稳线性核马尔可夫决策过程(MDP)中的情景强化学习(RL)。在该设置中,奖励函数和过渡内核相对于给定的特征映射都是线性的,并且允许随时间变化,只要它们各自的参数变化不超过特定的变化预算。我们提出了$underline{text{p}}$周期性$underline{text{r}}$estarted$underline{text{o}}$optimistic$underline{text{p}}$olicy$underline{text{o}$优化算法(PROPO),这是一种具有线性函数近似的乐观策略优化算法。PROPO具有两种机制:基于滑动窗口的策略评估和基于定期重启的策略改进,这两种机制是为非平稳环境中的策略优化而定制的。此外,我们仅利用滑动窗口技术,提出了一种值迭代算法。我们建立了所提方法的动态上界和一个匹配的极大极小下界,该下界显示了所提方法的(近)最优性。据我们所知,PROPO是第一个可证明有效的处理非平稳性的策略优化算法。 摘要:We study episodic reinforcement learning (RL) in non-stationary linear kernel Markov decision processes (MDPs). In this setting, both the reward function and the transition kernel are linear with respect to the given feature maps and are allowed to vary over time, as long as their respective parameter variations do not exceed certain variation budgets. We propose the $underline{text{p}}$eriodically $underline{text{r}}$estarted $underline{text{o}}$ptimistic $underline{text{p}}$olicy $underline{text{o}}$ptimization algorithm (PROPO), which is an optimistic policy optimization algorithm with linear function approximation. PROPO features two mechanisms: sliding-window-based policy evaluation and periodic-restart-based policy improvement, which are tailored for policy optimization in a non-stationary environment. In addition, only utilizing the technique of sliding window, we propose a value-iteration algorithm. We establish dynamic upper bounds for the proposed methods and a matching minimax lower bound which shows the (near-) optimality of the proposed methods. To our best knowledge, PROPO is the first provably efficient policy optimization algorithm that handles non-stationarity.

【56】 Explaining generalization in deep learning: progress and fundamental limits 标题:解释深度学习中的泛化:进展与基本局限 链接:https://arxiv.org/abs/2110.08922

作者:Vaishnavh Nagarajan 机构:CMU-CS-,-, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA , Thesis Committee:, J. Zico Kolter, Chair, Andrej Risteski, Ameet Talwalkar, Nathan Srebro, Toyota Technological Institute at Chicago, Submitted in partial fulfillment of the requirements 备注:arXiv admin note: text overlap with arXiv:1902.04742 摘要:本文研究了深度学习理论中一个基本的开放性挑战:为什么深度网络在过度参数化、未规范化以及将训练数据拟合为零误差的情况下仍能很好地推广?在论文的第一部分,我们将实证研究通过随机梯度下降训练深层网络如何隐含地控制网络的容量。随后,为了说明这如何导致更好的泛化,我们将导出{em数据相关的}{em基于一致收敛的}泛化边界,并改进对参数计数的依赖性。由于其简单性和通用性,一致收敛实际上是深度学习文献中使用最广泛的工具。鉴于一致收敛的普遍性,在本论文中,我们还将后退一步,确定一致收敛的基本极限,作为解释泛化的工具。特别是,我们将展示在一些过参数化设置的示例中,{em any}一致收敛界将只提供一个空洞的泛化界。考虑到这一点,在论文的最后一部分,我们将改变方向,引入一种{em emerical}技术来使用未标记数据估计泛化。我们的技术不依赖于基于一致收敛的复杂性的任何概念,并且非常精确。我们将从理论上说明为什么我们的技术具有如此高的精度。最后,我们将讨论未来的工作如何探索将分布假设纳入泛化边界(如以未标记数据的形式)的新方法,并探索其他工具来推导边界,可能是通过修改一致收敛或开发全新的工具。 摘要:This dissertation studies a fundamental open challenge in deep learning theory: why do deep networks generalize well even while being overparameterized, unregularized and fitting the training data to zero error? In the first part of the thesis, we will empirically study how training deep networks via stochastic gradient descent implicitly controls the networks' capacity. Subsequently, to show how this leads to better generalization, we will derive {em data-dependent} {em uniform-convergence-based} generalization bounds with improved dependencies on the parameter count. Uniform convergence has in fact been the most widely used tool in deep learning literature, thanks to its simplicity and generality. Given its popularity, in this thesis, we will also take a step back to identify the fundamental limits of uniform convergence as a tool to explain generalization. In particular, we will show that in some example overparameterized settings, {em any} uniform convergence bound will provide only a vacuous generalization bound. With this realization in mind, in the last part of the thesis, we will change course and introduce an {em empirical} technique to estimate generalization using unlabeled data. Our technique does not rely on any notion of uniform-convergece-based complexity and is remarkably precise. We will theoretically show why our technique enjoys such precision. We will conclude by discussing how future work could explore novel ways to incorporate distributional assumptions in generalization bounds (such as in the form of unlabeled data) and explore other tools to derive bounds, perhaps by modifying uniform convergence or by developing completely new tools altogether.

【57】 Noise-robust Clustering 标题:抗噪声聚类 链接:https://arxiv.org/abs/2110.08871

作者:Rahmat Adesunkanmi,Ratnesh Kumar 机构:accessible. 摘要:本文提出了无监督机器学习中的噪声鲁棒聚类技术。噪音、一致性和其他模糊性的不确定性可能成为数据分析中的严重障碍。因此,在处理大数据时,数据质量、清理、管理和治理仍然是至关重要的原则。有了这种复杂性,就不再像在经典环境中那样确定性地处理数据了,考虑噪声分布及其对数据样本值的影响就变得有意义了。经典聚类方法根据数据在底层空间中的相对距离或相似性将数据分组为“相似类”。本文通过在数据分布(而不是原始数据)上扩展经典的$K$-均值和$K$-中值聚类来解决这个问题。这涉及使用两种测量方法测量分布之间的距离:最佳质量传输(也称为Wasserstein距离,表示为$W_2$)和本文提出的一种新的距离测量方法,随机变量距离的期望值(表示为)。本文提出的基于分布的$K$-means和$K$-medoids算法首先对数据分布进行聚类,然后将每个原始数据分配给数据分布的聚类。 摘要:This paper presents noise-robust clustering techniques in unsupervised machine learning. The uncertainty about the noise, consistency, and other ambiguities can become severe obstacles in data analytics. As a result, data quality, cleansing, management, and governance remain critical disciplines when working with Big Data. With this complexity, it is no longer sufficient to treat data deterministically as in a classical setting, and it becomes meaningful to account for noise distribution and its impact on data sample values. Classical clustering methods group data into "similarity classes" depending on their relative distances or similarities in the underlying space. This paper addressed this problem via the extension of classical $K$-means and $K$-medoids clustering over data distributions (rather than the raw data). This involves measuring distances among distributions using two types of measures: the optimal mass transport (also called Wasserstein distance, denoted $W_2$) and a novel distance measure proposed in this paper, the expected value of random variable distance (denoted ED). The presented distribution-based $K$-means and $K$-medoids algorithms cluster the data distributions first and then assign each raw data to the cluster of data's distribution.

【58】 Understanding the network formation pattern for better link prediction 标题:了解网络形成模式,以便更好地预测链路 链接:https://arxiv.org/abs/2110.08850

作者:Jiating Yu,Ling-Yun Wu 机构:IAM, MADIS, NCMIS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing , China, School of Mathematical Sciences, University of Chinese Academy of Sciences 备注:21 pages, 3 figures, 18 tables, and 29 references 摘要:链路预测作为复杂网络领域中的一个经典问题,已经引起了研究者的广泛关注,它对于理解网络的演化和动态发展机制具有重要意义。虽然已经提出了各种特定于网络类型的算法来解决链路预测问题,但大多数算法都假设网络结构由三元闭包原理控制。我们仍然缺乏对预测潜在链路的网络形成模式的适应性和全面的理解。此外,研究如何更好地利用网络本地信息也很有价值。为此,我们提出了一种新方法,称为使用多阶局部信息(MOLI)的链路预测,该方法利用来自不同距离邻居的局部信息,参数可以是基于先验知识的先验驱动,或通过解决观测网络上的优化问题的数据驱动。MOLI通过图上的随机游动定义了局部网络扩散过程,从而更好地利用了网络信息。在11种不同类型的模拟和真实网络上,我们证明了MOLI优于其他11种广泛使用的链路预测算法。我们还得出结论,对于不同的网络,包括社会网络、通信网络、生物网络等,存在不同的局部信息利用模式。特别是,经典的基于公共邻居的算法并不像人们认为的那样适用于所有社会网络;相反,一些社交网络遵循四边形闭合原则,优先连接长度为3的路径。 摘要:As a classical problem in the field of complex networks, link prediction has attracted much attention from researchers, which is of great significance to help us understand the evolution and dynamic development mechanisms of networks. Although various network type-specific algorithms have been proposed to tackle the link prediction problem, most of them suppose that the network structure is dominated by the Triadic Closure Principle. We still lack an adaptive and comprehensive understanding of network formation patterns for predicting potential links. In addition, it is valuable to investigate how network local information can be better utilized. To this end, we proposed a novel method named Link prediction using Multiple Order Local Information (MOLI) that exploits the local information from the neighbors of different distances, with parameters that can be a prior-driven based on prior knowledge, or data-driven by solving an optimization problem on observed networks. MOLI defined a local network diffusion process via random walks on the graph, resulting in better use of network information. We show that MOLI outperforms the other 11 widely used link prediction algorithms on 11 different types of simulated and real-world networks. We also conclude that there are different patterns of local information utilization for different networks, including social networks, communication networks, biological networks, etc. In particular, the classical common neighbor-based algorithm is not as adaptable to all social networks as it is perceived to be; instead, some of the social networks obey the Quadrilateral Closure Principle which preferentially connects paths of length three.

【59】 Centroid Approximation for Bootstrap 标题:Bootstrap的质心逼近 链接:https://arxiv.org/abs/2110.08720

作者:Mao Ye,Qiang Liu 机构:University of Texas at Austin 摘要:Bootstrap是一种用于不确定性量化的原则性和强大的频率统计工具。不幸的是,由于需要绘制一个大的i.i.d.自举样本来近似理想的自举分布,标准的自举方法需要大量的计算;这在很大程度上阻碍了它们在大规模机器学习,特别是深度学习问题中的应用。在这项工作中,我们提出了一种有效的方法来显式地{优化}一小部分高质量的“质心”点,以更好地逼近理想的自举分布。我们通过最小化一个简单的目标函数来实现这一点,该目标函数渐近等价于理想bootstrap分布的Wasserstein距离。这使得我们能够通过少量的自举质心提供准确的不确定性估计,优于简单的i.i.d.抽样方法。从经验上看,我们的方法可以提高bootstrap在各种应用中的性能。 摘要:Bootstrap is a principled and powerful frequentist statistical tool for uncertainty quantification. Unfortunately, standard bootstrap methods are computationally intensive due to the need of drawing a large i.i.d. bootstrap sample to approximate the ideal bootstrap distribution; this largely hinders their application in large-scale machine learning, especially deep learning problems. In this work, we propose an efficient method to explicitly emph{optimize} a small set of high quality "centroid" points to better approximate the ideal bootstrap distribution. We achieve this by minimizing a simple objective function that is asymptotically equivalent to the Wasserstein distance to the ideal bootstrap distribution. This allows us to provide an accurate estimation of uncertainty with a small number of bootstrap centroids, outperforming the naive i.i.d. sampling approach. Empirically, we show that our method can boost the performance of bootstrap in a variety of applications.

【60】 NeuralArTS: Structuring Neural Architecture Search with Type Theory 标题:NeuralArts:用类型理论构建神经结构搜索 链接:https://arxiv.org/abs/2110.08710

作者:Robert Wu,Nayan Saxena,Rohan Jain 机构:University of Toronto, King’s College Cir, Toronto, Ontario M,S 摘要:神经架构搜索(NAS)算法在给定可能操作的初始搜索空间的情况下,自动完成寻找最佳深度学习架构的任务。开发这些搜索空间通常是手动的,预先优化的搜索空间比从头开始的搜索效率更高。在本文中,我们提出了一个新的框架,称为神经体系结构类型系统(NeuralArTS),它将结构化类型系统中的无限组网络操作进行分类。我们进一步展示了如何将NeuralArTS应用于卷积层,并提出了几个未来的方向。 摘要:Neural Architecture Search (NAS) algorithms automate the task of finding optimal deep learning architectures given an initial search space of possible operations. Developing these search spaces is usually a manual affair with pre-optimized search spaces being more efficient, rather than searching from scratch. In this paper we present a new framework called Neural Architecture Type System (NeuralArTS) that categorizes the infinite set of network operations in a structured type system. We further demonstrate how NeuralArTS can be applied to convolutional layers and propose several future directions.

【61】 Towards Instance-Optimal Offline Reinforcement Learning with Pessimism 标题:面向实例最优的悲观离线强化学习 链接:https://arxiv.org/abs/2110.08695

作者:Ming Yin,Yu-Xiang Wang 机构:Department of Computer Science, UC Santa Barbara, Department of Statistics and Applied Probability, UC Santa Barbara 备注:NeurIPS, 2021 摘要:我们研究了离线强化学习(offline-RL)问题,其目标是使用来自策略$mu$的数据,在未知马尔可夫决策过程(MDP)中学习奖励最大化策略。特别地,我们考虑有限地平线MDPs的离线RL的样本复杂性问题。以往的工作是基于不同的数据覆盖假设来研究这个问题的,它们的学习保证是用覆盖系数来表示的,而覆盖系数缺乏对系统数量的明确描述。在这项工作中,我们分析了自适应悲观值迭代(APVI)算法,并推导了自适应悲观值迭代(APVI)算法,分析了自适应悲观值迭代(APVI)算法,我们分析了自适应悲观值迭代(APVI)算法,并分析了自适应悲观值迭代(APVI)算法,我们分析了自适应悲观值迭代(APVI)算法,并推导出了接近匹配[OOOOOO左(sum {{{{{{{h{{h h{{h{h h h h=h=h=h=h=h=1)h}{{{{h{{{{{{h h{{{{{h{{h{h{{{{{h{h h{{{{{{h{{h{{{}sqrt{frac{1}{n}}right.]作为补充,我们还证明了在弱假设下,如果$d^{pi^star}uh(s_h,a_h)>0$,则$d^muu h(s_h,a_h)>0$的每实例信息理论下界。与以前的极大极小下界不同,每个实例的下界(通过局部极大极小)是一个更强的标准,因为它分别适用于单个实例。这里$pi^star$是一个最优策略,$mu$是行为策略,$d_h^mu$是边际状态动作概率。我们将上述方程称为内在离线强化学习界,因为它直接暗示了所有现有的最优结果:统一数据覆盖假设下的极大极小率、无地平线设置、单一策略集中性和紧问题相关结果。随后,我们将结果推广到无假设的区域(在该区域中,我们不对$mu$进行假设),并获得无假设的内在界。由于其一般形式,我们相信内在界有助于阐明导致特定问题难以解决的原因,并揭示离线RL的基本挑战。 摘要:We study the offline reinforcement learning (offline RL) problem, where the goal is to learn a reward-maximizing policy in an unknown Markov Decision Process (MDP) using the data coming from a policy $mu$. In particular, we consider the sample complexity problems of offline RL for finite-horizon MDPs. Prior works study this problem based on different data-coverage assumptions, and their learning guarantees are expressed by the covering coefficients which lack the explicit characterization of system quantities. In this work, we analyze the Adaptive Pessimistic Value Iteration (APVI) algorithm and derive the suboptimality upper bound that nearly matches [ Oleft(sum_{h=1}^Hsum_{s_h,a_h}d^{pi^star}_h(s_h,a_h)sqrt{frac{mathrm{Var}_{P_{s_h,a_h}}{(V^star_{h 1} r_h)}}{d^mu_h(s_h,a_h)}}sqrt{frac{1}{n}}right). ] In complementary, we also prove a per-instance information-theoretical lower bound under the weak assumption that $d^mu_h(s_h,a_h)>0$ if $d^{pi^star}_h(s_h,a_h)>0$. Different from the previous minimax lower bounds, the per-instance lower bound (via local minimaxity) is a much stronger criterion as it applies to individual instances separately. Here $pi^star$ is a optimal policy, $mu$ is the behavior policy and $d_h^mu$ is the marginal state-action probability. We call the above equation the intrinsic offline reinforcement learning bound since it directly implies all the existing optimal results: minimax rate under uniform data-coverage assumption, horizon-free setting, single policy concentrability, and the tight problem-dependent results. Later, we extend the result to the assumption-free regime (where we make no assumption on $ mu$) and obtain the assumption-free intrinsic bound. Due to its generic form, we believe the intrinsic bound could help illuminate what makes a specific problem hard and reveal the fundamental challenges in offline RL.

【62】 On the Statistical Analysis of Complex Tree-shaped 3D Objects 标题:关于复杂树形三维物体的统计分析 链接:https://arxiv.org/abs/2110.08693

作者:Guan Wang,Hamid Laga,Anuj Srivastava 机构:au•Anuj Srivastava is with the Department of Statistics 摘要:人们如何分析表现出复杂几何和拓扑变化的详细3D生物对象,如神经元和植物树?在本文中,我们开发了一个新的数学框架,用于表示、比较和计算此类树状三维对象形状之间的测地变形。子树的层次结构是这些对象的特征——每个子树都有主分支和一些附加的分支——并且需要跨对象匹配这些结构以进行有意义的比较。我们提出了一种新的表示方法,将最初为欧几里德曲线开发的平方根速度函数(SRVF)扩展到树形3D对象。然后,我们定义了一个新的度量,用于量化将一个树状对象变形为另一个所需的弯曲、拉伸和分支滑动。与当前的度量(如商欧几里德距离(QED)和树编辑距离(TED)相比,所提出的表示和度量捕获了分支的全部弹性(即弯曲和拉伸)以及拓扑变化(即分支死亡/出生和滑动)。它完全避免了QED和TED度量的边折叠和节点拆分操作所导致的收缩。我们演示了该框架在比较、匹配和计算生物对象(如神经元和植物树)之间的测地线方面的实用性。该框架还适用于各种形状分析任务:(i)树形3D对象的对称性分析和对称化,(ii)计算树形3D对象总体的汇总统计(变化的平均值和模式),(iii)将参数概率分布拟合到此类总体,以及(iv)最后,根据估计的概率分布通过随机抽样合成新的树形3D对象。 摘要:How can one analyze detailed 3D biological objects, such as neurons and botanical trees, that exhibit complex geometrical and topological variation? In this paper, we develop a novel mathematical framework for representing, comparing, and computing geodesic deformations between the shapes of such tree-like 3D objects. A hierarchical organization of subtrees characterizes these objects -- each subtree has the main branch with some side branches attached -- and one needs to match these structures across objects for meaningful comparisons. We propose a novel representation that extends the Square-Root Velocity Function (SRVF), initially developed for Euclidean curves, to tree-shaped 3D objects. We then define a new metric that quantifies the bending, stretching, and branch sliding needed to deform one tree-shaped object into the other. Compared to the current metrics, such as the Quotient Euclidean Distance (QED) and the Tree Edit Distance (TED), the proposed representation and metric capture the full elasticity of the branches (i.e., bending and stretching) as well as the topological variations (i.e., branch death/birth and sliding). It completely avoids the shrinkage that results from the edge collapse and node split operations of the QED and TED metrics. We demonstrate the utility of this framework in comparing, matching, and computing geodesics between biological objects such as neurons and botanical trees. The framework is also applied to various shape analysis tasks: (i) symmetry analysis and symmetrization of tree-shaped 3D objects, (ii) computing summary statistics (means and modes of variations) of populations of tree-shaped 3D objects, (iii) fitting parametric probability distributions to such populations, and (iv) finally synthesizing novel tree-shaped 3D objects through random sampling from estimated probability distributions.

【63】 Terminal Embeddings in Sublinear Time 标题:次线性时间中的终端嵌入 链接:https://arxiv.org/abs/2110.08691

作者:Yeshwanth Cherapanamjeri,Jelani Nelson 备注:Accepted to FOCS 2021 摘要:最近(Elkin,Filtser,Neiman 2017)引入了{it terminal embedding}的概念,从一个度量空间$(X,d_X)$到另一个$(Y,d_Y)$,使用一组指定的终端$Tsubset X$。如果$rho$是最小的值,这样一个嵌入$f$被称为有失真$rhoge 1$,这样就存在一个常数$C>0$满足begin{equation*}forall x\forall qinxcdux(x,q)le dY(f(x,f(q))le Crho dux(x,q)结束{方程*}在$X,Y$都是欧几里德度量的情况下,$Y$是$m$维的,最近(纳拉亚南,尼尔森2019年),在(Mahabadi,Makarychev,Makarychev,Razenshteyn 2018年)的工作之后,表明通过这样一种终端嵌入$m=O(epsilon^{-2}log n)$for$n:=| T$,可以实现失真$1 epsilon$。这推广了Johnson-Lindenstrauss引理,该引理仅将距离保持在$T$以内,而与空间其余部分的距离不保持在$T$。缺点是,在一些$qinmathbb{R}^d$上评估嵌入需要在$m$变量中求解具有$Theta(n)$约束的半定程序,因此需要一些超线性$mathrm{poly}(n)$运行时。我们在这项工作中的主要贡献是为计算终端嵌入提供一种新的数据结构。我们展示了如何预处理$T$以获得一个几乎线性的空间数据结构,该结构支持在次线性时间$n^{1-Theta(epsilon^2) o(1)} dn^{o(1)}$中计算任何$qinmathbb{R}^d$的终端嵌入图像。为了实现这一点,我们利用在近似最近邻搜索环境中开发的工具。 摘要:Recently (Elkin, Filtser, Neiman 2017) introduced the concept of a {it terminal embedding} from one metric space $(X,d_X)$ to another $(Y,d_Y)$ with a set of designated terminals $Tsubset X$. Such an embedding $f$ is said to have distortion $rhoge 1$ if $rho$ is the smallest value such that there exists a constant $C>0$ satisfying begin{equation*} forall xin T forall qin X, C d_X(x, q) le d_Y(f(x), f(q)) le C rho d_X(x, q) . end{equation*} In the case that $X,Y$ are both Euclidean metrics with $Y$ being $m$-dimensional, recently (Narayanan, Nelson 2019), following work of (Mahabadi, Makarychev, Makarychev, Razenshteyn 2018), showed that distortion $1 epsilon$ is achievable via such a terminal embedding with $m = O(epsilon^{-2}log n)$ for $n := |T|$. This generalizes the Johnson-Lindenstrauss lemma, which only preserves distances within $T$ and not to $T$ from the rest of space. The downside is that evaluating the embedding on some $qin mathbb{R}^d$ required solving a semidefinite program with $Theta(n)$ constraints in $m$ variables and thus required some superlinear $mathrm{poly}(n)$ runtime. Our main contribution in this work is to give a new data structure for computing terminal embeddings. We show how to pre-process $T$ to obtain an almost linear-space data structure that supports computing the terminal embedding image of any $qinmathbb{R}^d$ in sublinear time $n^{1-Theta(epsilon^2) o(1)} dn^{o(1)}$. To accomplish this, we leverage tools developed in the context of approximate nearest neighbor search.

【64】 Transformer with a Mixture of Gaussian Keys 标题:混合使用高斯密钥的Transformer 链接:https://arxiv.org/abs/2110.08678

作者:Tam Nguyen,Tan M. Nguyen,Dung Le,Khuong Nguyen,Anh Tran,Richard G. Baraniuk,Nhat Ho,Stanley J. Osher 机构:FPT Software, Vietnam†, University of California, Los Angeles, USA‡, Rice University, Houston, USA⋄, University of Texas, Austin, USA◦ 备注:21 pages, 8 figures, 4 tables 摘要:多头注意力是最先进的Transformer背后的驱动力,这些Transformer在各种自然语言处理(NLP)和计算机视觉任务中实现了卓越的性能。据观察,在许多应用中,这些注意头学习冗余嵌入,并且大多数注意头可以在不降低模型性能的情况下移除。受这一观察结果的启发,我们提出了一种混合高斯密钥的Transformer(Transformer MGK),这是一种新的Transformer架构,它将Transformer中的冗余磁头替换为每个磁头上的混合密钥。这些关键点的混合遵循高斯混合模型,允许每个注意头有效地集中在输入序列的不同部分。与传统的transformer相比,transformer MGK加快了训练和推理速度,参数更少,需要的计算次数更少,同时在任务之间实现了相当或更好的精度。TransformerMGK也可以很容易地扩展到线性应用。我们以经验证明Transformer MGK在一系列实际应用中的优势,包括语言建模和涉及很长序列的任务。在Wikitext-103和远程竞技场基准测试中,具有4个磁头的TransformerMGK与具有8个磁头的基准Transformer具有相当或更好的性能。 摘要:Multi-head attention is a driving force behind state-of-the-art transformers which achieve remarkable performance across a variety of natural language processing (NLP) and computer vision tasks. It has been observed that for many applications, those attention heads learn redundant embedding, and most of them can be removed without degrading the performance of the model. Inspired by this observation, we propose Transformer with a Mixture of Gaussian Keys (Transformer-MGK), a novel transformer architecture that replaces redundant heads in transformers with a mixture of keys at each head. These mixtures of keys follow a Gaussian mixture model and allow each attention head to focus on different parts of the input sequence efficiently. Compared to its conventional transformer counterpart, Transformer-MGK accelerates training and inference, has fewer parameters, and requires less FLOPs to compute while achieving comparable or better accuracy across tasks. Transformer-MGK can also be easily extended to use with linear attentions. We empirically demonstrate the advantage of Transformer-MGK in a range of practical applications including language modeling and tasks that involve very long sequences. On the Wikitext-103 and Long Range Arena benchmark, Transformer-MGKs with 4 heads attain comparable or better performance to the baseline transformers with 8 heads.

【65】 Towards Robust Waveform-Based Acoustic Models 标题:走向稳健的基于波形的声学模型 链接:https://arxiv.org/abs/2110.08634

作者:Dino Oglic,Zoran Cvetkovic,Peter Sollich,Steve Renals,Bin Yu 机构: Sollich is with the Department of Mathematics 摘要:我们提出了一种在不利环境中学习鲁棒声学模型的方法,其特点是训练和测试条件之间存在显著的不匹配。这个问题对于需要在看不见的环境中表现良好的语音识别系统的部署至关重要。我们的方法是邻域风险最小化的一个例子,其目的是通过将定义输入空间上的经验密度的增量函数替换为训练样本附近的边际人口密度近似值来改进训练期间的风险估计。更具体地说,我们假设以训练样本为中心的局部邻域可以使用高斯混合近似,并从理论上证明这可以将鲁棒归纳偏差纳入学习过程。我们通过数据增强方案隐式地描述了单个混合成分的特征,该方案旨在解决声学模型中常见的伪相关源。为了避免由于信息丢失(与标准特征提取技术(例如FBANK和MFCC特征)而对稳健性造成的潜在混淆影响,我们将评估重点放在基于波形的设置上。我们的实验结果表明,所提出的方法可以推广到看不见的噪声条件,与使用标准风险最小化原则的训练相比,在分布外推广方面相对提高了150%。此外,研究结果表明,相对于使用训练样本学习的模型,该样本设计用于匹配测试话语的声学条件特征(即,最佳邻近密度),具有竞争力。 摘要:We propose an approach for learning robust acoustic models in adverse environments, characterized by a significant mismatch between training and test conditions. This problem is of paramount importance for the deployment of speech recognition systems that need to perform well in unseen environments. Our approach is an instance of vicinal risk minimization, which aims to improve risk estimates during training by replacing the delta functions that define the empirical density over the input space with an approximation of the marginal population density in the vicinity of the training samples. More specifically, we assume that local neighborhoods centered at training samples can be approximated using a mixture of Gaussians, and demonstrate theoretically that this can incorporate robust inductive bias into the learning process. We characterize the individual mixture components implicitly via data augmentation schemes, designed to address common sources of spurious correlations in acoustic models. To avoid potential confounding effects on robustness due to information loss, which has been associated with standard feature extraction techniques (e.g., FBANK and MFCC features), we focus our evaluation on the waveform-based setting. Our empirical results show that the proposed approach can generalize to unseen noise conditions, with 150% relative improvement in out-of-distribution generalization compared to training using the standard risk minimization principle. Moreover, the results demonstrate competitive performance relative to models learned using a training sample designed to match the acoustic conditions characteristic of test utterances (i.e., optimal vicinal densities).

【66】 On the Pareto Frontier of Regret Minimization and Best Arm Identification in Stochastic Bandits 标题:随机土匪中遗憾最小化的Pareto前沿与最佳臂识别 链接:https://arxiv.org/abs/2110.08627

作者:Zixin Zhong,Wang Chi Cheung,Vincent Y. F. Tan 机构: 1Department of Mathematics, National University of Singapore, Singapore 2Department of Electrical and Computer Engineering, Singapore 3Department of In-dustrial Systems and Management, National University of Singa-pore 备注:27 pages, 8 figures 摘要:我们研究了随机土匪中两个原型目标的帕累托边界,即具有固定视界的后悔最小化(RM)和最佳手臂识别(BAI)。民间传说,开发和勘探之间的平衡对RM和BAI都至关重要,但勘探对于实现后一个目标的最佳绩效更为关键。为了使这一点更加精确,我们首先设计并分析了BoBW-lil'UCB$({gamma})$算法,该算法在${gamma}$的不同值下实现RM或BAI的顺序最优性能。作为补充,我们证明,对于RM和BAI目标,没有任何算法可以同时实现最佳性能。更准确地说,我们建立了具有给定BAI失效概率的任何算法可实现的遗憾的非平凡下界。这一分析表明,在某些制度下,BoBW-lil'UCB$({gamma})$达到了帕累托最优,直至常数或小项。数值实验进一步证明,当应用于困难情况时,BoBW-lil'UCB的表现优于竞争对手UCB${alpha}$(Degenne et al.,2019),后者是为RM和BAI设计的,具有固定置信度。 摘要:We study the Pareto frontier of two archetypal objectives in stochastic bandits, namely, regret minimization (RM) and best arm identification (BAI) with a fixed horizon. It is folklore that the balance between exploitation and exploration is crucial for both RM and BAI, but exploration is more critical in achieving the optimal performance for the latter objective. To make this precise, we first design and analyze the BoBW-lil'UCB$({gamma})$ algorithm, which achieves order-wise optimal performance for RM or BAI under different values of ${gamma}$. Complementarily, we show that no algorithm can simultaneously perform optimally for both the RM and BAI objectives. More precisely, we establish non-trivial lower bounds on the regret achievable by any algorithm with a given BAI failure probability. This analysis shows that in some regimes BoBW-lil'UCB$({gamma})$ achieves Pareto-optimality up to constant or small terms. Numerical experiments further demonstrate that when applied to difficult instances, BoBW-lil'UCB outperforms a close competitor UCB$_{alpha}$ (Degenne et al., 2019), which is designed for RM and BAI with a fixed confidence.

【67】 Physics-guided Deep Markov Models for Learning Nonlinear Dynamical Systems with Uncertainty 标题:物理引导的不确定非线性动力系统学习的深马尔可夫模型 链接:https://arxiv.org/abs/2110.08607

作者:Wei Liu,Zhilu Lai,Kiran Bacsa,Eleni Chatzi 机构: Chair of Structural Mechanics and Monitoring, Department of Civil, Environmental and Geomatic, Engineering, ETH-Z¨urich, Z¨urich, Switzerland, Future Resilient Systems, Singapore-ETH Centre, Singapore 摘要:在本文中,我们提出了一个概率物理指导框架,称为物理指导的深马尔可夫模型(PgDMM)。该框架特别适用于从测量数据推断非线性动力系统的特征和潜在结构,其中,潜在变量的精确推断通常很难实现。最近出现的一个选项涉及利用变分推理来执行近似推理。在这种方案中,系统的过渡和发射函数通过前馈神经网络(深层生成模型)参数化。然而,由于神经网络函数的通用性和高度通用性,学习的潜在空间往往缺乏物理解释和结构化表示。为了解决这个问题,我们将基于物理的状态空间模型与深度马尔可夫模型相结合,从而为非线性动力系统的无监督学习和辨识提供了一个混合建模框架。具体来说,过渡过程可以建模为一个基于物理的模型,该模型通过添加神经网络组件进行增强,旨在了解基于物理的模型与被监测的实际动力系统之间的差异。该框架利用了深度学习的表达能力,同时通过在潜在空间一侧施加物理驱动的限制,保留了动力系统的驱动物理。我们在非线性系统的说明性仿真示例和实验案例研究中展示了这种融合在提高性能方面的优势。我们的结果表明,所采用的跃迁和发射函数所涉及的基于物理的模型本质上加强了更结构化和物理上可解释的潜在空间,这对于泛化和预测能力至关重要。 摘要:In this paper, we propose a probabilistic physics-guided framework, termed Physics-guided Deep Markov Model (PgDMM). The framework is especially targeted to the inference of the characteristics and latent structure of nonlinear dynamical systems from measurement data, where it is typically intractable to perform exact inference of latent variables. A recently surfaced option pertains to leveraging variational inference to perform approximate inference. In such a scheme, transition and emission functions of the system are parameterized via feed-forward neural networks (deep generative models). However, due to the generalized and highly versatile formulation of neural network functions, the learned latent space is often prone to lack physical interpretation and structured representation. To address this, we bridge physics-based state space models with Deep Markov Models, thus delivering a hybrid modeling framework for unsupervised learning and identification for nonlinear dynamical systems. Specifically, the transition process can be modeled as a physics-based model enhanced with an additive neural network component, which aims to learn the discrepancy between the physics-based model and the actual dynamical system being monitored. The proposed framework takes advantage of the expressive power of deep learning, while retaining the driving physics of the dynamical system by imposing physics-driven restrictions on the side of the latent space. We demonstrate the benefits of such a fusion in terms of achieving improved performance on illustrative simulation examples and experimental case studies of nonlinear systems. Our results indicate that the physics-based models involved in the employed transition and emission functions essentially enforce a more structured and physically interpretable latent space, which is essential to generalization and prediction capabilities.

【68】 Statistics in everyone's backyard: an impact study via citation network analysis 标题:每个人后院的统计数据:基于引文网络分析的影响研究 链接:https://arxiv.org/abs/2110.08605

作者:Lijia Wang,Xin Tong,Y. X. Rachel Wang 机构:Department of Mathematics, University of Southern California, Department of Data Sciences and Operations, University of Southern California, School of Mathematics and Statistics, University of Sydney 摘要:策划引文数据的日益可用性为分析和理解科学出版物的智力影响提供了丰富的资源。在统计学领域,当前对引文数据的研究主要集中在统计期刊和论文之间的相互作用,将影响的度量限制在统计本身。在本文中,我们朝着理解统计在大数据时代对其他科学领域的影响迈出了第一步。通过从科学网数据库收集选定统计期刊的综合文献计量数据,我们调查了引用领域随时间的变化趋势和组成,以表明它们的多样性正在增加。此外,我们使用局部聚类技术,包括带有电导的个性化PageRank,用于大小选择,以找到与给定外部感兴趣主题最相关的统计研究区域。我们为该程序提供了理论保证,并通过大量案例研究表明,引文数据的结果与我们对这些外部主题的知识和直觉非常吻合。总的来说,我们发现统计界最近发明的统计理论和方法对其他科学领域产生了越来越大的影响。 摘要:The increasing availability of curated citation data provides a wealth of resources for analyzing and understanding the intellectual influence of scientific publications. In the field of statistics, current studies of citation data have mostly focused on the interactions between statistical journals and papers, limiting the measure of influence to mainly within statistics itself. In this paper, we take the first step towards understanding the impact statistics has made on other scientific fields in the era of Big Data. By collecting comprehensive bibliometric data from the Web of Science database for selected statistical journals, we investigate the citation trends and compositions of citing fields over time to show that their diversity has been increasing. Furthermore, we use the local clustering technique involving personalized PageRank with conductance for size selection to find the most relevant statistical research area for a given external topic of interest. We provide theoretical guarantees for the procedure and, through a number of case studies, show the results from our citation data align well with our knowledge and intuition about these external topics. Overall, we have found that the statistical theory and methods recently invented by the statistics community have made increasing impact on other scientific fields.

【69】 PDMM: A novel Primal-Dual Majorization-Minimization algorithm for Poisson Phase-Retrieval problem 标题:PDMM:泊松相位恢复问题的一种新的原始-对偶优化最小化算法 链接:https://arxiv.org/abs/2110.08600

作者:Ghania Fatima,Zongyu Li,Aakash Arora,Prabhu Babu 机构:Zongyu Li is with Department of Electrical Engineering and ComputerScience, University of Michigan, University of Luxembourg 摘要:在本文中,我们引入了一种新的迭代算法来解决相位恢复问题,其中测量值仅由未知信号的线性函数的幅值组成,并且测量值中的噪声服从泊松分布。该算法基于优化最小化原理(MM);然而,MM在这里的应用非常新颖,不同于文献中MM通常用于解决优化问题的方式。更准确地说,我们通过调用泊松似然函数中对数(.)项的Fenchel对偶表示,将原始最小化问题转化为鞍点问题。然后,我们在原始变量和对偶变量上提出了更严格的代理函数,从而产生了一个双循环MM算法,我们称之为原始-对偶优化最小化(PDMM)算法。由此产生的算法的迭代步骤易于实现,并且只涉及计算矩阵向量积。我们还将我们的算法扩展到处理各种L1正则化泊松相位恢复问题(利用稀疏性)。将该算法与先前提出的泊松数据模型的wirtinger流(WF)、MM(常规)和交替方向乘子法(ADMM)等算法进行了比较。在不同的实验环境下的仿真结果表明,PDMM比竞争方法更快,其在恢复原始信号方面的性能与现有的算法是一致的。 摘要:In this paper, we introduce a novel iterative algorithm for the problem of phase-retrieval where the measurements consist of only the magnitude of linear function of the unknown signal, and the noise in the measurements follow Poisson distribution. The proposed algorithm is based on the principle of majorization-minimization (MM); however, the application of MM here is very novel and distinct from the way MM has been usually used to solve optimization problems in the literature. More precisely, we reformulate the original minimization problem into a saddle point problem by invoking Fenchel dual representation of the log (.) term in the Poisson likelihood function. We then propose tighter surrogate functions over both primal and dual variables resulting in a double-loop MM algorithm, which we have named as Primal-Dual Majorization-Minimization (PDMM) algorithm. The iterative steps of the resulting algorithm are simple to implement and involve only computing matrix vector products. We also extend our algorithm to handle various L1 regularized Poisson phase-retrieval problems (which exploit sparsity). The proposed algorithm is compared with previously proposed algorithms such as wirtinger flow (WF), MM (conventional), and alternating direction methods of multipliers (ADMM) for the Poisson data model. The simulation results under different experimental settings show that PDMM is faster than the competing methods, and its performance in recovering the original signal is at par with the state-of-the-art algorithms.

【70】 Nys-Curve: Nyström-Approximated Curvature for Stochastic Optimization 标题:NYS曲线:随机优化的Nyström近似曲率 链接:https://arxiv.org/abs/2110.08577

作者:Hardik Tankaria,Dinesh Singh,Makoto Yamada 机构:RIKEN AIP,Kyoto University 摘要:拟牛顿方法通常通过使用割线方程近似Hessian曲线来提供曲率信息。然而,由于使用了一阶导数,割线方程在近似牛顿步时变得平淡无奇。在这项研究中,我们提出了一种基于近似牛顿步的随机优化算法,用于线性收敛的凸函数的大规模经验风险最小化。具体地说,我们使用$kll d$随机选择的变量计算大小为($dtimes k$)的部分列Hessian,然后使用textit{Nystr“om method}更好地近似完整Hessian矩阵。为了进一步降低每次迭代的计算复杂度,我们直接计算更新步骤($Deltaboldsymbol{w}$)无需计算和存储完整Hessian或其逆。此外,为了解决即使计算部分Hessian也可能需要大量时间的大规模场景,我们使用了分布保持(DP)子采样以计算部分Hessian。DP子采样生成具有类似一阶和二阶分布统计信息的$p$子样本,并在每个历元以循环方式选择单个子样本以计算部分Hessian。我们将近似Hessian与随机梯度下降和随机方差re相结合数值实验表明,所提出的方法能够获得牛顿法的更好逼近,其性能与最新的一阶和随机拟牛顿法相当。 摘要:The quasi-Newton methods generally provide curvature information by approximating the Hessian using the secant equation. However, the secant equation becomes insipid in approximating the Newton step owing to its use of the first-order derivatives. In this study, we propose an approximate Newton step-based stochastic optimization algorithm for large-scale empirical risk minimization of convex functions with linear convergence rates. Specifically, we compute a partial column Hessian of size ($dtimes k$) with $kll d$ randomly selected variables, then use the textit{Nystr"om method} to better approximate the full Hessian matrix. To further reduce the computational complexity per iteration, we directly compute the update step ($Deltaboldsymbol{w}$) without computing and storing the full Hessian or its inverse. Furthermore, to address large-scale scenarios in which even computing a partial Hessian may require significant time, we used distribution-preserving (DP) sub-sampling to compute a partial Hessian. The DP sub-sampling generates $p$ sub-samples with similar first and second-order distribution statistics and selects a single sub-sample at each epoch in a round-robin manner to compute the partial Hessian. We integrate our approximated Hessian with stochastic gradient descent and stochastic variance-reduced gradients to solve the logistic regression problem. The numerical experiments show that the proposed approach was able to obtain a better approximation of Newtontextquotesingle s method with performance competitive with the state-of-the-art first-order and the stochastic quasi-Newton methods.

【71】 Estimating individual admixture from finite reference databases 标题:从有限参考数据库中估计个体混合物 链接:https://arxiv.org/abs/2110.08348

作者:Peter Pfaffelhuber,Angelika Rohde 备注:17 pages, 3 figures 摘要:单独外加剂(IA)的概念假设个体的基因组由从$K$祖先群体中遗传的等位基因组成。每个等位基因的每个拷贝都有相同的机会$q_K$起源于$K$,并且连同所有群体中的等位基因频率$p$组成混合模型,这是{sc STRUCTURE}和}等软件的基础{sc混合物}这里,我们假设$p$是通过有限参考数据库给出的,而$q$是通过最大似然估计的。首先,我们对$q$的有效估计以及源自参考数据库有限性的估计量方差感兴趣,即,.$p$中的方差。我们提供了最大值的中心极限定理-似然估计,给出模拟结果,并讨论在法医遗传学中的应用。 摘要:The concept of individual admixture (IA) assumes that the genome of individuals is composed of alleles inherited from $K$ ancestral populations. Each copy of each allele has the same chance $q_k$ to originate from population $k$, and together with the allele frequencies in all populations $p$ comprises the admixture model, which is the basis for software like {sc STRUCTURE} and {sc ADMIXTURE}. Here, we assume that $p$ is given through a finite reference database, and $q$ is estimated via maximum likelihood. Above all, we are interested in efficient estimation of $q$, and the variance of the estimator which originates from finiteness of the reference database, i.e. a variance in $p$. We provide a central limit theorem for the maximum-likelihood estimator, give simulation results, and discuss applications in forensic genetics.

【72】 A New Approach for Interpretability and Reliability in Clinical Risk Prediction: Acute Coronary Syndrome Scenario 标题:临床风险预测中可解释性和可靠性的新方法:急性冠脉综合征情景 链接:https://arxiv.org/abs/2110.08331

作者:Francisco Valente,Jorge Henriques,Simão Paredes,Teresa Rocha,Paulo de Carvalho,João Morais 机构:Center for Informatics and Systems of University of Coimbra, University of Coimbra, P´olo II,-, Coimbra, Portugal, Polytechnic of Coimbra, Department of Systems and Computer Engineering, Rua Pedro, Nunes - Quinta da Nora,-, Coimbra, Portugal 备注:None 摘要:我们打算创建一种新的风险评估方法,该方法结合了风险评分和机器学习模型的最佳特征。更具体地说,我们的目标是开发一种方法,该方法除了具有良好的性能外,还为每位患者提供个性化的模型和结果,具有较高的可解释性,并包含评估预测的可靠性通常是不可用的。通过将这些特征组合在同一方法中,我们期望它能够提高医生在日常活动中使用此类工具的信心。为了实现上述目标,开发了一种三步方法:通过对风险因素进行二分来创建若干规则;使用机器学习分类器对这些规则进行训练,以预测每个规则的接受程度(规则正确的概率)对于每位患者,这些信息被合并并用于计算死亡率风险和此类预测的可靠性。该方法适用于任何类型急性冠脉综合征(ACS)患者的数据集,以评估30天的全因死亡率风险。该绩效与最先进的方法进行比较:逻辑回归(LR)、人工神经网络(ANN)和临床风险评分模型(急性冠状动脉事件全球登记处-GRACE)。所提出的方法获得了与标准LR相同的测试结果,但提供了更好的解释性和个性化;它也显著优于GRACE风险模型和标准ANN模型。校准曲线还表明,所获得的模型在接近理想曲线时具有很好的泛化能力。Fi最后,个体预测的可靠性估计与错误分类率有很大的相关性。这些属性在其他临床场景中也可能有有益的应用 摘要:We intend to create a new risk assessment methodology that combines the best characteristics of both risk score and machine learning models. More specifically, we aim to develop a method that, besides having a good performance, offers a personalized model and outcome for each patient, presents high interpretability, and incorporates an estimation of the prediction reliability which is not usually available. By combining these features in the same approach we expect that it can boost the confidence of physicians to use such a tool in their daily activity. In order to achieve the mentioned goals, a three-step methodology was developed: several rules were created by dichotomizing risk factors; such rules were trained with a machine learning classifier to predict the acceptance degree of each rule (the probability that the rule is correct) for each patient; that information was combined and used to compute the risk of mortality and the reliability of such prediction. The methodology was applied to a dataset of patients admitted with any type of acute coronary syndromes (ACS), to assess the 30-days all-cause mortality risk. The performance was compared with state-of-the-art approaches: logistic regression (LR), artificial neural network (ANN), and clinical risk score model (Global Registry of Acute Coronary Events - GRACE). The proposed approach achieved testing results identical to the standard LR, but offers superior interpretability and personalization; it also significantly outperforms the GRACE risk model and the standard ANN model. The calibration curve also suggests a very good generalization ability of the obtained model as it approaches the ideal curve. Finally, the reliability estimation of individual predictions presented a great correlation with the misclassifications rate. Those properties may have a beneficial application in other clinical scenarios as well. [abridged]

【73】 Structure Learning in Inverse Ising Problems Using ell_2-Regularized Linear Estimator链接:https://arxiv.org/abs/2008.08342

作者:Xiangming Meng,Tomoyuki Obuchi,Yoshiyuki Kabashima 备注:35 pages, 8 figures 摘要:在伊辛反问题的框架下讨论了伪似然方法在采用$ell_2$-正则(岭)线性回归时的推理性能。该装置用于从理论上研究数据生成模型与推理模型不同的情况,即模型不匹配的情况。在假设教师耦合稀疏的师生场景中,使用复制和空腔方法进行分析,特别关注是否正确推断教师耦合的存在/不存在。结果表明,尽管存在模型失配,但当自旋数$N$小于数据集大小$M$时,在热力学极限$N到infty$范围内,可以使用无正则化的朴素线性回归完美地识别网络结构。此外,为了访问欠定区域$M摘要:The inference performance of the pseudolikelihood method is discussed in the framework of the inverse Ising problem when the $ell_2$-regularized (ridge) linear regression is adopted. This setup is introduced for theoretically investigating the situation where the data generation model is different from the inference one, namely the model mismatch situation. In the teacher-student scenario under the assumption that the teacher couplings are sparse, the analysis is conducted using the replica and cavity methods, with a special focus on whether the presence/absence of teacher couplings is correctly inferred or not. The result indicates that despite the model mismatch, one can perfectly identify the network structure using naive linear regression without regularization when the number of spins $N$ is smaller than the dataset size $M$, in the thermodynamic limit $Nto infty$. Further, to access the underdetermined region $M < N$, we examine the effect of the $ell_2$ regularization, and find that biases appear in all the coupling estimates, preventing the perfect identification of the network structure. We, however, find that the biases are shown to decay exponentially fast as the distance from the center spin chosen in the pseudolikelihood method grows. Based on this finding, we propose a two-stage estimator: In the first stage, the ridge regression is used and the estimates are pruned by a relatively small threshold; in the second stage the naive linear regression is conducted only on the remaining couplings, and the resultant estimates are again pruned by another relatively large threshold. This estimator with the appropriate regularization coefficient and thresholds is shown to achieve the perfect identification of the network structure even in $0

机器翻译,仅供参考

0 人点赞