统计学学术速递[8.30]

2021-09-16 14:29:58 浏览数 (1)

Update!H5支持摘要折叠,体验更佳!点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!

stat统计学,共计19篇

【1】 Bayesian Sparse Blind Deconvolution Using MCMC Methods Based on Normal-Inverse-Gamma Prior 标题:基于正反伽马先验的贝叶斯稀疏盲反卷积MCMC方法 链接:https://arxiv.org/abs/2108.12398

作者:Burak Cevat Civek,Emre Ertin 机构: Ertin are with the Department of Electrical andComputer Engineering, The Ohio State University 摘要:稀疏盲反卷积问题的贝叶斯估计方法通常使用贝努利-高斯(BG)先验建模稀疏序列,并使用马尔可夫链蒙特卡罗(MCMC)方法估计未知量。然而,BG模型的离散性造成了计算瓶颈,即使使用最近提出的增强型采样器方案,也阻碍了对概率空间的有效探索。为了解决这个问题,我们提出了一种替代的MCMC方法,通过使用正规逆Gamma(NIG)先验对稀疏序列进行建模。我们推导了有效的Gibbs采样器,并说明通过将问题转化为一个完全连续的值框架,可以消除与BG模型相关的计算负担。除了稀疏性,我们还对卷积序列加入了时域和频域约束。我们通过大量的仿真验证了所提出方法的有效性,并描述了与现有利用BG建模的方法相比的计算增益。 摘要:Bayesian estimation methods for sparse blind deconvolution problems conventionally employ Bernoulli-Gaussian (BG) prior for modeling sparse sequences and utilize Markov Chain Monte Carlo (MCMC) methods for the estimation of unknowns. However, the discrete nature of the BG model creates computational bottlenecks, preventing efficient exploration of the probability space even with the recently proposed enhanced sampler schemes. To address this issue, we propose an alternative MCMC method by modeling the sparse sequences using the Normal-Inverse-Gamma (NIG) prior. We derive effective Gibbs samplers for this prior and illustrate that the computational burden associated with the BG model can be eliminated by transferring the problem into a completely continuous-valued framework. In addition to sparsity, we also incorporate time and frequency domain constraints on the convolving sequences. We demonstrate the effectiveness of the proposed methods via extensive simulations and characterize computational gains relative to the existing methods that utilize BG modeling.

【2】 A class of dependent Dirichlet processes via latent multinomial processes 标题:一类基于潜在多项式过程的相依Dirichlet过程 链接:https://arxiv.org/abs/2108.12396

作者:Luis E. Nieto-Barajas 机构:Department of Statistics, ITAM, Mexico 摘要:我们描述了在一组Dirichlet过程上引入一般依赖结构的过程。相关性可以在一个方向上定义时间序列,也可以在两个方向上定义空间相关性。还可以考虑更多的方向。依赖是通过一组潜在过程来诱导的,并利用Dirichlet过程和多项式过程之间的共轭性来确保集合中每个元素的边际定律是一个Dirichlet过程。依赖性通过任意两个元素之间的相关性来表征。在贝叶斯非参数背景下,当我们使用Dirichlet过程集作为先验分布时,得到了后验分布。后验预测分布诱导了由广义P挈olya urs定义的部分可交换序列。还包括一个数值例子来说明。 摘要:We describe a procedure to introduce general dependence structures on a set of Dirichlet processes. Dependence can be in one direction to define a time series or in two directions to define spatial dependencies. More directions can also be considered. Dependence is induced via a set of latent processes and exploit the conjugacy property between the Dirichlet and the multinomial processes to ensure that the marginal law for each element of the set is a Dirichlet process. Dependence is characterised through the correlation between any two elements. Posterior distributions are obtained when we use the set of Dirichlet processes as prior distributions in a bayesian nonparametric context. Posterior predictive distributions induce partially exchangeable sequences defined by generalised P'olya urs. A numerical example to illustrate is also included.

【3】 A Parameter Estimation Method for Multivariate Aggregated Hawkes Processes 标题:多元聚集Hawkes过程的一种参数估计方法 链接:https://arxiv.org/abs/2108.12357

作者:Leigh Shlomovich,Edward A. K. Cohen,Niall Adams 备注:14 pages, 5 figures 摘要:在用点过程对数据建模时,通常假设事件不能同时发生。这就产生了一个问题,因为由于记录能力的限制和存储大量精确数据的费用,由于聚合或舍入,真实世界的数据通常包含同步观测值。为了更好地理解过程之间的关系,我们考虑使用多元霍克斯过程建模聚合事件数据,它提供了相互刺激的行为的描述,并在地震和金融领域得到了广泛的应用。在这里,我们使用蒙特卡罗期望最大化(MC-EM)算法将单变量聚合Hawkes过程的现有参数估计方法推广到多变量情况,并通过模拟研究表明,该问题的替代方法可能存在严重偏差,在所有考虑的情况下,多元MC-EM方法在均方误差方面优于它们。 摘要:It is often assumed that events cannot occur simultaneously when modelling data with point processes. This raises a problem as real-world data often contains synchronous observations due to aggregation or rounding, resulting from limitations on recording capabilities and the expense of storing high volumes of precise data. In order to gain a better understanding of the relationships between processes, we consider modelling the aggregated event data using multivariate Hawkes processes, which offer a description of mutually-exciting behaviour and have found wide applications in areas including seismology and finance. Here we generalise existing methodology on parameter estimation of univariate aggregated Hawkes processes to the multivariate case using a Monte Carlo Expectation Maximization (MC-EM) algorithm and through a simulation study illustrate that alternative approaches to this problem can be severely biased, with the multivariate MC-EM method outperforming them in terms of MSE in all considered cases.

【4】 Correcting spatial Gaussian process parameter and prediction variance estimation under informative sampling 标题:信息抽样下空间高斯过程参数修正与预测方差估计 链接:https://arxiv.org/abs/2108.12354

作者:Erin M. Schliep,Christopher K. Wikle,Ranadeep Daw 备注:21 pages, 7 figures 摘要:信息抽样设计可以在两个重要方面影响空间预测或克里格法。首先,抽样设计会使空间协方差参数估计产生偏差,而空间协方差参数估计又会使空间克里格估计产生偏差。其次,即使对空间协方差参数进行无偏估计,由于克里格方差是观测位置的函数,这些估计也会因样本而异,并高估基于总体的估计。在这项工作中,我们发展了一种加权复合似然方法来改进信息抽样设计下的空间协方差参数估计。然后,给出这些参数估计,我们提出了三种方法来量化抽样设计对空间预测方差估计的影响。这些结果可用于为基于群体的推理做出明智的决策。我们用一个全面的模拟研究来说明我们的方法。然后,我们应用我们的方法对位于加利福尼亚中部的油井中的硝酸盐浓度进行空间预测。 摘要:Informative sampling designs can impact spatial prediction, or kriging, in two important ways. First, the sampling design can bias spatial covariance parameter estimation, which in turn can bias spatial kriging estimates. Second, even with unbiased estimates of the spatial covariance parameters, since the kriging variance is a function of the observation locations, these estimates will vary based on the sample and overestimate the population-based estimates. In this work, we develop a weighted composite likelihood approach to improve spatial covariance parameter estimation under informative sampling designs. Then, given these parameter estimates, we propose three approaches to quantify the effects of the sampling design on the variance estimates in spatial prediction. These results can be used to make informed decisions for population-based inference. We illustrate our approaches using a comprehensive simulation study. Then, we apply our methods to perform spatial prediction on nitrate concentration in wells located throughout central California.

【5】 Statistical Inference for Linear Mediation Models with High-dimensional Mediators and Application to Studying Stock Reaction to COVID-19 Pandemic 标题:高维中介线性中介模型的统计推断及其在冠状病毒大流行股票反应研究中的应用 链接:https://arxiv.org/abs/2108.12329

作者:Xu Guo,Runze Li,Jingyuan Liu,Mudong Zeng 机构:School of Statistics, Beijing Normal University, Beijing, China, Department of Statistics, The Pennsylvania State University, University Park, PA , USA., MOE Key Laboratory of Econometrics, Department of Statistics, School of Economics 摘要:中介分析在基因组学、流行病学和金融学等许多科学领域引起了越来越多的关注。在本文中,我们提出了新的高维中介模型的统计推断过程,其中结果模型和中介模型都与高维中介是线性的。由于中介体的高维性,传统的中介分析过程不能用于对高维线性中介模型进行统计推断。我们提出了一种通过部分惩罚最小二乘法估计模型间接效应的方法,并进一步建立了其理论性质。我们进一步发展了一个关于间接效应的部分惩罚Wald检验,并证明了该检验具有$chi^2$极限零分布。我们还提出了一个直接效应的$F$型检验,并证明了该检验在零假设下渐近遵循$chi^2$-分布,在局部方案下渐近遵循非中心$chi^2$-分布。蒙特卡罗模拟用于检验所提出测试的有限样本性能,并将其性能与现有测试进行比较。我们进一步应用2019冠状病毒疾病的研究,通过对中介公司的股票和股票收益的财务指标的中介效应的实证分析,来研究新的统计推断程序。 摘要:Mediation analysis draws increasing attention in many scientific areas such as genomics, epidemiology and finance. In this paper, we propose new statistical inference procedures for high dimensional mediation models, in which both the outcome model and the mediator model are linear with high dimensional mediators. Traditional procedures for mediation analysis cannot be used to make statistical inference for high dimensional linear mediation models due to high-dimensionality of the mediators. We propose an estimation procedure for the indirect effects of the models via a partial penalized least squares method, and further establish its theoretical properties. We further develop a partial penalized Wald test on the indirect effects, and prove that the proposed test has a $chi^2$ limiting null distribution. We also propose an $F$-type test for direct effects and show that the proposed test asymptotically follows a $chi^2$-distribution under null hypothesis and a noncentral $chi^2$-distribution under local alternatives. Monte Carlo simulations are conducted to examine the finite sample performance of the proposed tests and compare their performance with existing ones. We further apply the newly proposed statistical inference procedures to study stock reaction to COVID-19 pandemic via an empirical analysis of studying the mediation effects of financial metrics that bridge company's sector and stock return.

【6】 A comparison of approaches to improve worst-case predictive model performance over patient subpopulations 标题:改善患者亚群最坏情况预测模型性能的方法比较 链接:https://arxiv.org/abs/2108.12250

作者:Stephen R. Pfohl,Haoran Zhang,Yizhe Xu,Agata Foryciarz,Marzyeh Ghassemi,Nigam H. Shah 机构:Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, Department of Computer Science, University of Toronto, Toronto, Ontario, Canada, Department of Computer Science, Stanford University, Stanford, California , USA 摘要:在患者群体中平均准确的临床结果预测模型可能在某些亚群体中表现不佳,可能导致或加剧医疗服务获得和质量方面的不平等。旨在最大化子种群最坏情况下模型性能的模型训练方法,如分布式鲁棒优化(DRO),试图在不引入额外危害的情况下解决该问题。我们对DRO和标准学习程序的几种变体进行了大规模实证研究,以确定模型开发和选择方法,与从电子健康记录数据学习预测模型的标准方法相比,这些方法能够持续改善亚群体的分类和最坏情况表现。在我们的评估过程中,我们引入了DRO方法的扩展,该扩展允许指定用于评估最坏情况性能的指标。我们对预测住院死亡率、住院时间延长和30天再入院的模型进行分析,并使用重症监护数据预测住院死亡率。我们发现,除了相对较少的例外,对于所检查的每个患者亚群,没有任何方法比使用整个训练数据集的标准学习程序表现得更好。这些结果表明,当有兴趣改善患者亚群的模型性能,使其超出标准实践所能达到的水平时,可能需要通过隐式或显式增加有效样本量的技术来实现。 摘要:Predictive models for clinical outcomes that are accurate on average in a patient population may underperform drastically for some subpopulations, potentially introducing or reinforcing inequities in care access and quality. Model training approaches that aim to maximize worst-case model performance across subpopulations, such as distributionally robust optimization (DRO), attempt to address this problem without introducing additional harms. We conduct a large-scale empirical study of DRO and several variations of standard learning procedures to identify approaches for model development and selection that consistently improve disaggregated and worst-case performance over subpopulations compared to standard approaches for learning predictive models from electronic health records data. In the course of our evaluation, we introduce an extension to DRO approaches that allows for specification of the metric used to assess worst-case performance. We conduct the analysis for models that predict in-hospital mortality, prolonged length of stay, and 30-day readmission for inpatient admissions, and predict in-hospital mortality using intensive care data. We find that, with relatively few exceptions, no approach performs better, for each patient subpopulation examined, than standard learning procedures using the entire training dataset. These results imply that when it is of interest to improve model performance for patient subpopulations beyond what can be achieved with standard practices, it may be necessary to do so via techniques that implicitly or explicitly increase the effective sample size.

【7】 Limit theorems for dependent combinatorial data, with applications in statistical inference 标题:相依组合数据的极限定理及其在统计推断中的应用 链接:https://arxiv.org/abs/2108.12233

作者:Somabha Mukherjee 机构:A DISSERTATION, For the Graduate Group in Managerial Science and Applied Economics, Presented to the Faculties of the University of Pennsylvania, Partial Fulfillment of the Requirements for the, Degree of Doctor of Philosophy, Supervisor of Dissertation 备注:None 摘要:伊辛模型是马尔可夫随机场的一个著名例子,它被引入统计物理学来模拟铁磁性。这是一个具有二元结果的离散指数族,其中充分的统计包含一个二次项,用于捕获两两相互作用产生的相关性。然而,在许多情况下,网络中的依赖性不仅来自配对,还来自对等组效应。捕捉高阶依赖关系的一个方便的数学框架是$p$-张量伊辛模型,其中充分的统计量由一个$p$次的多线性多项式组成。本文发展了一个统计推断$p$-张量伊辛模型中自然参数的框架。我们从居里-维斯-伊辛模型开始,在该模型中,我们在参数的最大似然(ML)估计的渐近性中发现了各种非标准现象,例如在参数空间的内部存在一条临界曲线,这些估计具有极限混合分布,在这条曲线的边界点有一个惊人的超效率现象。由于存在难以计算的归一化常数,ML估计在更一般的$p$-张量伊辛模型中失败。为了克服这个问题,我们使用了流行的最大伪似然(MPL)方法,它避免了计算基于条件分布的莫名其妙的归一化常数。我们推导了MPL估计为$sqrt{N}$-一致的一般条件,其中$N$是底层网络的大小。最后,我们考虑一个更一般的Ising模型,它在网络的节点中包含高维协变量,也可以看作是依赖观测的逻辑回归模型。在该模型中,我们证明了在真实协变量向量的稀疏性假设下,参数可以一致地估计。 摘要:The Ising model is a celebrated example of a Markov random field, introduced in statistical physics to model ferromagnetism. This is a discrete exponential family with binary outcomes, where the sufficient statistic involves a quadratic term designed to capture correlations arising from pairwise interactions. However, in many situations the dependencies in a network arise not just from pairs, but from peer-group effects. A convenient mathematical framework for capturing higher-order dependencies, is the $p$-tensor Ising model, where the sufficient statistic consists of a multilinear polynomial of degree $p$. This thesis develops a framework for statistical inference of the natural parameters in $p$-tensor Ising models. We begin with the Curie-Weiss Ising model, where we unearth various non-standard phenomena in the asymptotics of the maximum-likelihood (ML) estimates of the parameters, such as the presence of a critical curve in the interior of the parameter space on which these estimates have a limiting mixture distribution, and a surprising superefficiency phenomenon at the boundary point(s) of this curve. ML estimation fails in more general $p$-tensor Ising models due to the presence of a computationally intractable normalizing constant. To overcome this issue, we use the popular maximum pseudo-likelihood (MPL) method, which avoids computing the inexplicit normalizing constant based on conditional distributions. We derive general conditions under which the MPL estimate is $sqrt{N}$-consistent, where $N$ is the size of the underlying network. Finally, we consider a more general Ising model, which incorporates high-dimensional covariates at the nodes of the network, that can also be viewed as a logistic regression model with dependent observations. In this model, we show that the parameters can be estimated consistently under sparsity assumptions on the true covariate vector.

【8】 Parametric G-computation for Compatible Indirect Treatment Comparisons with Limited Individual Patient Data 标题:有限个体患者数据相容间接治疗比较的参数G计算 链接:https://arxiv.org/abs/2108.12208

作者:Antonio Remiro-Azócar,Anna Heath,Gianluca Baio 机构:Department of Statistical Science, University College London, London, Quantitative Research, Statistical Outcomes, Research & Analytics (SORA) Ltd, Child Health Evaluative Sciences, The, Hospital for Sick Children, Toronto, Canada, Dalla Lana School of Public Health 备注:29 pages, 4 figures (19 additional pages in the Supplementary Material). This is the journal version of some of the research in the working paper arXiv:2008.05951. Submitted to Research Synthesis Methods. Note: text overlap with arXiv:2008.05951 摘要:当效应修正因子存在交叉试验差异且患者水平数据有限时,越来越多地使用匹配调整间接比较(MAIC)等人群调整方法来比较边际治疗效果。MAIC基于倾向评分权重,该权重对较差的协变量重叠非常敏感,不能超出观察到的协变量空间进行外推。基于当前结果回归的替代方案可以外推,但目标是间接比较中不兼容的条件治疗效果。在调整协变量时,必须对相关人群的条件估计进行积分或平均,以恢复兼容的边际治疗效果。我们提出了一种基于边缘化方法的参数G-计算,该方法可以很容易地应用于结果回归为广义线性模型或Cox模型的情况。该方法将协变量调整回归视为一种干扰模型,并将其估计与利益边际处理效果的评估分开。该方法可以适应贝叶斯统计框架,自然地将分析集成到概率框架中。一项模拟研究提供了原理证明,并对该方法相对于MAIC和常规结果回归的性能进行了基准测试。参数G计算比MAIC获得更精确和更准确的估计,特别是当协变量重叠很差时,并且在假设没有失败的情况下产生无偏的边际处理效果估计。此外,边缘化协变量调整估计比传统结果回归产生的条件估计提供了更高的精度和准确性,因为效果的度量是不可折叠的,因此传统结果回归具有系统性偏差。 摘要:Population adjustment methods such as matching-adjusted indirect comparison (MAIC) are increasingly used to compare marginal treatment effects when there are cross-trial differences in effect modifiers and limited patient-level data. MAIC is based on propensity score weighting, which is sensitive to poor covariate overlap and cannot extrapolate beyond the observed covariate space. Current outcome regression-based alternatives can extrapolate but target a conditional treatment effect that is incompatible in the indirect comparison. When adjusting for covariates, one must integrate or average the conditional estimate over the relevant population to recover a compatible marginal treatment effect. We propose a marginalization method based parametric G-computation that can be easily applied where the outcome regression is a generalized linear model or a Cox model. The approach views the covariate adjustment regression as a nuisance model and separates its estimation from the evaluation of the marginal treatment effect of interest. The method can accommodate a Bayesian statistical framework, which naturally integrates the analysis into a probabilistic framework. A simulation study provides proof-of-principle and benchmarks the method's performance against MAIC and the conventional outcome regression. Parametric G-computation achieves more precise and more accurate estimates than MAIC, particularly when covariate overlap is poor, and yields unbiased marginal treatment effect estimates under no failures of assumptions. Furthermore, the marginalized covariate-adjusted estimates provide greater precision and accuracy than the conditional estimates produced by the conventional outcome regression, which are systematically biased because the measure of effect is non-collapsible.

【9】 Targeting Underrepresented Populations in Precision Medicine: A Federated Transfer Learning Approach 标题:以精确医学中代表性不足的人群为目标:一种联合迁移学习方法 链接:https://arxiv.org/abs/2108.12112

作者:Sai Li,Tianxi Cai,Rui Duan 机构:Institute of Statistics and Big Data, Renmin University of China, Department of Biostatistics, Harvard University 摘要:少数群体和弱势群体在大规模临床和基因组学研究中的代表性有限,已成为将精确医学研究转化为实践的障碍。由于不同人群的异质性,风险预测模型在这些代表性不足的人群中往往表现不佳,因此可能进一步加剧已知的健康差异。在本文中,我们提出了一种双向数据集成策略,该策略通过联邦转移学习方法集成来自不同人群和多个医疗机构的异构数据。所提出的方法可以处理来自不同群体的样本量高度不平衡的挑战性环境。通过参与站点之间的少量通信,所提出的方法可以实现与汇集分析相当的性能,其中单个级别的数据直接汇集在一起。我们表明,该方法提高了代表性不足群体的估计和预测精度,并缩小了群体间模型性能的差距。我们的理论分析揭示了通信预算、隐私限制和人口异质性如何影响估计精度。通过数值实验和多中心研究的实际应用,我们证明了我们方法的可行性和有效性。在多中心研究中,我们构建了AA人群II型糖尿病的多基因风险预测模型。 摘要:The limited representation of minorities and disadvantaged populations in large-scale clinical and genomics research has become a barrier to translating precision medicine research into practice. Due to heterogeneity across populations, risk prediction models are often found to be underperformed in these underrepresented populations, and therefore may further exacerbate known health disparities. In this paper, we propose a two-way data integration strategy that integrates heterogeneous data from diverse populations and from multiple healthcare institutions via a federated transfer learning approach. The proposed method can handle the challenging setting where sample sizes from different populations are highly unbalanced. With only a small number of communications across participating sites, the proposed method can achieve performance comparable to the pooled analysis where individual-level data are directly pooled together. We show that the proposed method improves the estimation and prediction accuracy in underrepresented populations, and reduces the gap of model performance across populations. Our theoretical analysis reveals how estimation accuracy is influenced by communication budgets, privacy restrictions, and heterogeneity across populations. We demonstrate the feasibility and validity of our methods through numerical experiments and a real application to a multi-center study, in which we construct polygenic risk prediction models for Type II diabetes in AA population.

【10】 Selection of inverse gamma and half-t priors for hierarchical models: sensitivity and recommendations 标题:分层模型的逆伽马先验和半t先验的选择:灵敏度和建议 链接:https://arxiv.org/abs/2108.12045

作者:Zachary Brehm,Aaron Wagner,Erik VonKaenel,David Burton,Samuel J. Weisenthal,Martin Cole,Yiping Pang,Sally W. Thurston 机构: Thurston† ( 1)( ( 1) University of Rochester, Department of Biostatistics and Computational Biology 备注:24 pages, 2 figures, submitted to Bayesian Analysis 摘要:虽然先验选择的重要性已被充分理解,但在层次模型中建立先验选择指南仍然是贝叶斯方法学研究的一个活跃、有时有争议的领域。文献中经常讨论单个先验族的超参数选择,但在相似模型和超参数下比较不同先验族的情况很少。利用模拟数据,我们评估了反向伽马和半-$t$priors在三个层次模型中估计随机效应标准差的性能:8学校模型、随机截距纵向模型和简单多结果模型。我们使用一系列先验超参数来比较两个先验家族的性能,其中一些在文献中已经提出,而另一些允许直接比较半-$t$和反向伽马先验对。估计非常小的随机效应标准差值会导致收敛问题,特别是对于半个$t$先验值。在大多数情况下,我们发现标准偏差的后验分布在半-$t$先验条件下的偏差小于在反伽马条件下的偏差。逆伽马先验的覆盖率通常相似,但其间隔长度比之前的一半-$t$的对应物小。我们对这两个先前家族的研究结果将为层次模型的先前规范提供信息,使从业者能够更好地将他们的先验知识与其各自的模型和目标相一致。 摘要:While the importance of prior selection is well understood, establishing guidelines for selecting priors in hierarchical models has remained an active, and sometimes contentious, area of Bayesian methodology research. Choices of hyperparameters for individual families of priors are often discussed in the literature, but rarely are different families of priors compared under similar models and hyperparameters. Using simulated data, we evaluate the performance of inverse gamma and half-$t$ priors for estimating the standard deviation of random effects in three hierarchical models: the 8-schools model, a random intercepts longitudinal model, and a simple multiple outcomes model. We compare the performance of the two prior families using a range of prior hyperparameters, some of which have been suggested in the literature, and others that allow for a direct comparison of pairs of half-$t$ and inverse-gamma priors. Estimation of very small values of the random effect standard deviation led to convergence issues especially for the half-$t$ priors. For most settings, we found that the posterior distribution of the standard deviation had smaller bias under half-$t$ priors than under their inverse-gamma counterparts. Inverse gamma priors generally gave similar coverage but had smaller interval lengths than their half-$t$ prior counterparts. Our results for these two prior families will inform prior specification for hierarchical models, allowing practitioners to better align their priors with their respective models and goals.

【11】 Contaminated Gibbs-type priors 标题:污染的Gibbs类前驱 链接:https://arxiv.org/abs/2108.11997

作者:Federico Camerlenghi,Riccardo Corradin,Andrea Ongaro 机构:‡ 1 1Department of Economics, University of Milano-BicoccaAugust 30 备注:43 pages, 14 figures, 10 tables 摘要:吉布斯型先验作为贝叶斯非参数模型中的关键成分被广泛使用。由于它们的灵活性和数学上的可处理性,它们在物种抽样问题、聚类和混合建模中成为主要的先验知识。我们引入了一个新的过程家族,该家族扩展了Gibbs类型1,通过在模型中包含污染物成分来解释异常(异常值)的存在或频率为1的过度观测。我们首先研究了诱导随机划分和相关的预测分布,并刻画了簇数的渐近行为。我们得到的所有结果都是封闭的,并且易于解释,作为一个值得注意的例子,我们将重点放在Pitman-Yor过程的污染版本上。最后,我们指出了我们的构造在不同应用问题中的优势:我们展示了污染物成分如何帮助对天文聚类问题执行离群点检测,以及如何改进与物种相关的数据集中的预测推断,展示了频率为1的大量物种。 摘要:Gibbs-type priors are widely used as key components in several Bayesian nonparametric models. By virtue of their flexibility and mathematical tractability, they turn out to be predominant priors in species sampling problems, clustering and mixture modelling. We introduce a new family of processes which extend the Gibbs-type one, by including a contaminant component in the model to account for the presence of anomalies (outliers) or an excess of observations with frequency one. We first investigate the induced random partition, the associated predictive distribution and we characterize the asymptotic behaviour of the number of clusters. All the results we obtain are in closed form and easily interpretable, as a noteworthy example we focus on the contaminated version of the Pitman-Yor process. Finally we pinpoint the advantage of our construction in different applied problems: we show how the contaminant component helps to perform outlier detection for an astronomical clustering problem and to improve predictive inference in a species-related dataset, exhibiting a high number of species with frequency one.

【12】 Chi-squared test for hypothesis testing of homogeneity 标题:同质性假设检验的卡方检验 链接:https://arxiv.org/abs/2108.11980

作者:Mikhail Ermakov 机构:key words: goodness of fit tests, consistency, chi-squared test, maxisets. 备注:18 pages 摘要:我们提供了检验齐性假设的卡方检验的非参数替代集一致一致性的充分必要条件。卡方检验的细胞数随着样本量的增加而增加。备选方案的非参数集可以用密度和分布函数来定义。 摘要:We provide necessary and sufficient conditions of uniform consistency of nonparametric sets of alternatives of chi-squared test for testing of hypothesis of homogeneity. The number of cells of chi-squared test increases with sample size growth. Nonparametric sets of alternatives can be defined both in terms of densities and distribution functions.

【13】 Multiple Hypothesis Testing Framework for Spatial Signals 标题:空间信号的多假设检验框架 链接:https://arxiv.org/abs/2108.12314

作者:Martin Gölz,Abdelhak M. Zoubir,Visa Koivunen 机构: Koivunen is with the Department of SignalProcessing and Acoustics, Aalto University 备注:Submitted to IEEE Transactions on Signal and Information Processing over Networks 摘要:识别空间感兴趣、不同或敌对行为区域的问题是涉及分布式多传感器系统的许多实际应用所固有的。在这项工作中,我们开发了一个基于多假设检验的通用框架来识别这些区域。假设监测环境为离散空间网格。识别与不同假设相关联的空间网格点,同时将错误发现率控制在预先指定的水平。使用大规模传感器网络获取测量值。我们提出了一种基于谱矩方法的数据驱动的局部错误发现率估计方法。我们的方法对潜在物理现象的特定空间传播模型是不可知的。它依靠广泛适用的密度模型进行局部汇总统计。在传感器之间,根据插值的局部错误发现率将位置分配给与不同假设相关的区域。空间传播无线电波的应用说明了我们方法的优点。 摘要:The problem of identifying regions of spatially interesting, different or adversarial behavior is inherent to many practical applications involving distributed multisensor systems. In this work, we develop a general framework stemming from multiple hypothesis testing to identify such regions. A discrete spatial grid is assumed for the monitored environment. The spatial grid points associated with different hypotheses are identified while controlling the false discovery rate at a pre-specified level. Measurements are acquired using a large-scale sensor network. We propose a novel, data-driven method to estimate local false discovery rates based on the spectral method of moments. Our method is agnostic to specific spatial propagation models of the underlying physical phenomenon. It relies on a broadly applicable density model for local summary statistics. In between sensors, locations are assigned to regions associated with different hypotheses based on interpolated local false discovery rates. The benefits of our method are illustrated by applications to spatially propagating radio waves.

【14】 Evaluation of individual attributes associated with shared HIV risk behaviors among two network-based studies of people who inject drugs 标题:在两项基于网络的注射吸毒者研究中评估与共同的HIV危险行为相关的个体属性 链接:https://arxiv.org/abs/2108.12287

作者:Valerie Ryan,TingFang Lee,Ashley L. Buchanan,Natallia V. Katenka,Samuel R. Friedman,Georgios Nikolopoulos 机构:Department of Computer Science and Statistics, University of Rhode Island, Department of Pharmacy Practice, Department of Population Health, NYU Grossman School of Medicine, Medical School, University of Cyprus 备注:19 pages 摘要:社会环境在维持或减少艾滋病毒风险行为方面发挥着重要作用。这项研究分析了与注射毒品的人(PWID)相互之间进行HIV风险行为的可能性相关的网络和个人属性。我们分析了在社会风险因素和艾滋病毒风险研究(SFHR)和减少传播干预项目(TRIP)中收集的数据,以进行分析。指数随机图模型用于确定哪些属性与PWID中参与HIV风险行为的可能性相关,例如相互关联的注射行为。所有模型和两个数据集的结果表明,人们更有可能与在某些方面与他们相似的人(例如,同性、种族/民族、生活条件)进行风险行为。在SFHR和TRIP中,我们探讨了个人和网络层面的缺失对PWID中个人参与HIV风险行为可能性的影响。在这项研究中,我们发现已知的个体水平风险因素,包括住房不稳定和种族/民族,也是决定PWID中观察到的网络结构的重要因素。干预措施的未来发展不仅要考虑个体风险因素,而且要考虑社区和社会影响,使个人容易受到HIV风险的影响。 摘要:Social context plays an important role in perpetuating or reducing HIV risk behaviors. This study analyzed the network and individual attributes that were associated with the likelihood that people who inject drugs (PWID) will engage in HIV risk behaviors with one another. We analyze data collected in the Social Risk Factors and HIV Risk Study (SFHR) and Transmission Reduction Intervention Project (TRIP) to perform the analysis. Exponential random graph models were used to determine which attributes were associated with the likelihood of people engaging in HIV risk behaviors, such as injection behaviors that are associated with one another, among PWID. Results across all models and across both data sets indicated that people were more likely to engage in risk behaviors with others who were similar to them in some way (e.g., were the same sex, race/ethnicity, living conditions). In both SFHR and TRIP, we explore the effects of missingness at individual and network levels on the likelihood of individuals to engage in HIV risk behaviors among PWID. In this study, we found that known individual-level risk factors, including housing instability and race/ethnicity, are also important factors in determining the structure of the observed network among PWID. Future development of interventions should consider not only individual risk factors, but communities and social influences leaving individuals vulnerable to HIV risk.

【15】 Quantum Sub-Gaussian Mean Estimator 标题:量子亚高斯平均估计器 链接:https://arxiv.org/abs/2108.12172

作者:Yassine Hamoudi 机构:Université de Paris, IRIF, CNRS, F-, Paris, France. 备注:20 pages 摘要:我们提出了一种新的量子算法来估计作为量子计算输出的实值随机变量的平均值。我们的估计器在估计具有次高斯误差率的重尾分布的平均值所需的经典i.i.d.样本数上实现了接近最优的二次加速。该结果包含了(对数因子)早期对重尾分布[BHMT02,BDGT11]不是最优的均值估计问题的研究,或者需要方差的先验信息[Hein02,Mon15,HM19]。作为一个应用,我们获得了$(epsilon,delta)$近似问题的新量子算法,该问题对输入随机变量的变异系数具有最佳依赖性。 摘要:We present a new quantum algorithm for estimating the mean of a real-valued random variable obtained as the output of a quantum computation. Our estimator achieves a nearly-optimal quadratic speedup over the number of classical i.i.d. samples needed to estimate the mean of a heavy-tailed distribution with a sub-Gaussian error rate. This result subsumes (up to logarithmic factors) earlier works on the mean estimation problem that were not optimal for heavy-tailed distributions [BHMT02,BDGT11], or that require prior information on the variance [Hein02,Mon15,HM19]. As an application, we obtain new quantum algorithms for the $(epsilon,delta)$-approximation problem with an optimal dependence on the coefficient of variation of the input random variable.

【16】 An Introduction to Hamiltonian Monte Carlo Method for Sampling 标题:哈密顿蒙特卡罗抽样方法简介 链接:https://arxiv.org/abs/2108.12107

作者:Nisheeth K. Vishnoi 备注:This exposition is to supplement the talk by the author at the Bootcamp in the semester on Geometric Methods for Optimization and Sampling at the Simons Institute for the Theory of Computing 摘要:本文的目的是介绍哈密顿蒙特卡罗(HMC)方法——一种从吉布斯密度$pi(x)propto e^{-f(x)}$采样的哈密顿动力学启发算法。我们关注“理想化”的情况,在这种情况下,我们可以精确地计算连续的轨迹。我们证明了理想HMC保持$pi$,并且当$f$是强凸光滑的时,我们证明了它的收敛性。 摘要:The goal of this article is to introduce the Hamiltonian Monte Carlo (HMC) method -- a Hamiltonian dynamics-inspired algorithm for sampling from a Gibbs density $pi(x) propto e^{-f(x)}$. We focus on the "idealized" case, where one can compute continuous trajectories exactly. We show that idealized HMC preserves $pi$ and we establish its convergence when $f$ is strongly convex and smooth.

【17】 Statistical Quantification of Differential Privacy: A Local Approach 标题:差分隐私的统计量化:一种局部方法 链接:https://arxiv.org/abs/2108.09528

作者:Önder Askin,Tim Kutta,Holger Dette 机构:Ruhr-University Bochum 摘要:在这项工作中,我们介绍了一种新的方法统计量化的差异隐私在黑盒设置。我们给出了随机算法a的最佳隐私参数的估计量和置信区间,以及其他关键变量(如新的“以数据为中心的隐私级别”)。我们的估计器基于隐私的局部特征,与相关文献相比,避免了“事件选择”过程——隐私验证的主要障碍。这使得我们的方法易于实现且用户友好。我们证明了估计的快速收敛速度和置信区间的渐近有效性。对各种算法的实验研究证实了我们方法的有效性。 摘要:In this work we introduce a new approach for statistical quantification of differential privacy in a black box setting. We present estimators and confidence intervals for the optimal privacy parameter of a randomized algorithm A, as well as other key variables (such as the novel "data-centric privacy level"). Our estimators are based on a local characterization of privacy and in contrast to the related literature avoid the process of "event selection" - a major obstacle to privacy validation. This makes our methods easy to implement and user-friendly. We show fast convergence rates of the estimators and asymptotic validity of the confidence intervals. An experimental study of various algorithms confirms the efficacy of our approach.

【18】 Rank Energy Statistics in the Context of Change Point Detection 标题:变化点检测背景下的秩能量统计 链接:https://arxiv.org/abs/2108.04903

作者:Amanda Ng 机构:Rank Energy Statistics in the Context of Change Point Detection Amanda Ng The Bronx High School of Science Bodhisattva Sen Mentor; Department of Statistics 备注:5 pages and 1 figure 摘要:在这篇文章中,我提出了一个多变量无分布非参数检验的一般程序,该程序是从多变化点分析中基于度量传输的秩的概念推导而来的。我将使用此算法来估计变化点的数量及其在观察到的多元时间序列中的位置。在本文中,改变点问题是在给定的分布和改变点的数量都未知的一般情况下观察到的,而不是假设观察到的时间序列遵循特定的分布,或者只包含一个改变点,正如许多研究领域的工作所假设的那样。其目的是开发一种技术,在尽可能少的假设的同时,准确地识别分布中的变化。这里使用的秩能量统计是基于能量统计的,有可能检测到分布中的任何变化。我介绍了这个新算法的特性,它可以用于分析各种数据集,包括层次聚类、多元正态性测试、基因选择和微阵列数据分析。该算法也已在R包recp中实现,该包可在GitHub上获得。 摘要:In this paper, I propose a general procedure for multivariate distribution-free nonparametric testing derived from the concept of ranks that are based upon measure transportation in the context of multiple change point analysis. I will use this algorithm to estimate both the number of change points and their locations within an observed multivariate time series. In this paper, the change point problem is observed in a general setting in which both the given distribution and number of change points are unknown, rather than assume the observed time series follows a specific distribution or contains only one change point as many works in this area of study assume. The intention of this is to develop a technique for accurately identifying the changes in a distribution while making as few suppositions as possible. The rank energy statistic used here is based on energy statistics and has the potential to detect any change in a distribution. I present the properties of this new algorithm, which can be used to analyze various datasets, including hierarchical clustering, testing multivariate normality, gene selection, and microarray data analysis. This algorithm has also been implemented in the R package recp, which is available on GitHub.

【19】 The ICSCREAM methodology: Identification of penalizing configurations in computer experiments using screening and metamodel -- Applications in thermal-hydraulics 标题:ICSCREAM方法:使用筛选和元模型识别计算机实验中的惩罚配置--在热工水力学中的应用 链接:https://arxiv.org/abs/2004.04663

作者:A. Marrel,Bertrand Iooss,V Chabridon 机构:CEA, DES, IRESNE, DER, F-, Saint-Paul-lez-Durance, France, B. Iooss and V. Chabridon, EDF R&D, Quai Watier, Chatou, France 摘要:在核事故分析的风险评估框架中,与不确定输入变量的概率建模相关的最佳估计计算机代码用于估计安全裕度。此类不确定性量化研究的第一步通常是在其他输入参数的不确定性下,确定几个输入参数(称为“情景输入”)的关键配置(或在规定安全裕度的意义上进行处罚)。然而,核工程中使用的大多数计算机代码,如与热工水力事故情景模拟相关的代码,都需要花费大量的CPU时间来开发高效的策略。这项工作的重点是通过基于元模型的方法(即,适合于小规模模拟样本的数学模型)研究机器学习算法。为了在大量输入的情况下实现这一目标,提出了一种特定的原创方法,称为ICSCRIAM(使用筛选和元模型识别惩罚配置)。有影响力的输入筛选基于先进的全球敏感性分析工具(HSIC重要性度量)。然后依次建立高斯过程元模型,并在贝叶斯框架内用于估计超过高级阈值的条件概率,根据情景输入。该方法的效率在两个高维(大约100个输入)热工水力工业案例中得到了说明,模拟了压水堆一次冷却剂损失事故。对于这两种使用情况,研究集中在峰值包层温度(PCT)和临界配置上,其定义超过了PCT的90%分位数。在这两种情况下,ICS方法允许仅使用大约一千个代码模拟来估计情景输入的影响及其临界值区域。 摘要:In the framework of risk assessment in nuclear accident analysis, best-estimatecomputer codes, associated to a probabilistic modeling of the uncertain input variables,are used to estimate safety margins. A first step in such uncertainty quantificationstudies is often to identify the critical configurations (or penalizing, in thesense of a prescribed safety margin) of several input parameters (called ``scenarioinputs''), under the uncertainty on the other input parameters. However, the largeCPU-time cost of most of the computer codes used in nuclear engineering, as theones related to thermal-hydraulic accident scenario simulations, involve to develophighly efficient strategies. This work focuses on machine learning algorithms bythe way of the metamodel-based approach (i.e., a mathematical model which is fittedon a small-size sample of simulations). To achieve it with a very large numberof inputs, a specific and original methodology, called ICSCREAM (Identificationof penalizing Configurations using SCREening And Metamodel), is proposed. Thescreening of influential inputs is based on an advanced global sensitivity analysistool (HSIC importance measures). A Gaussian process metamodel is then sequentiallybuilt and used to estimate, within a Bayesian framework, the conditionalprobabilities of exceeding a high-level threshold, according to the scenario inputs.The efficiency of this methodology is illustrated on two high-dimensional (arounda hundred inputs) thermal-hydraulic industrial cases simulating an accident of primarycoolant loss in a pressurized water reactor. For both use cases, the studyfocuses on the peak cladding temperature (PCT) and critical configurations aredefined by exceeding the 90%-quantile of PCT. In both cases, the ICSCREAMmethodology allows to estimate, by using only around one thousand of code simulations,the impact of the scenario inputs and their critical areas of values.

0 人点赞