Update!H5支持摘要折叠,体验更佳!点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!
stat统计学,共计29篇
【1】 Compositional Active Inference I: Bayesian Lenses. Statistical Games 标题:成分主动推理I:贝叶斯透镜。统计博弈 链接:https://arxiv.org/abs/2109.04461
作者:Toby St. Clere Smithe 机构:University of Oxford, &, Topos Institute 备注:29 pages; comments welcome 摘要:我们引入了贝叶斯透镜的概念,描述了精确贝叶斯推理的双向结构,以及统计博弈,形式化了近似推理问题的优化目标。我们证明了贝叶斯反演是根据合成透镜模式合成的,并用一些经典的统计概念,从最大似然估计到广义变分贝叶斯方法,举例说明了统计博弈。这篇文章是一系列文章中的第一篇,为主动推理理论的合成解释奠定了基础,因此我们特别关注具有自由能目标的统计博弈。 摘要:We introduce the concepts of Bayesian lens, characterizing the bidirectional structure of exact Bayesian inference, and statistical game, formalizing the optimization objectives of approximate inference problems. We prove that Bayesian inversions compose according to the compositional lens pattern, and exemplify statistical games with a number of classic statistical concepts, from maximum likelihood estimation to generalized variational Bayesian methods. This paper is the first in a series laying the foundations for a compositional account of the theory of active inference, and we therefore pay particular attention to statistical games with a free-energy objective.
【2】 Modeling Massive Spatial Datasets Using a Conjugate Bayesian Linear Regression Framework 标题:基于共轭贝叶斯线性回归框架的海量空间数据建模 链接:https://arxiv.org/abs/2109.04447
作者:Sudipto Banerjee 机构:UCLA Department of Biostatistics, Charles E. Young Drive South, Los Angeles, CA ,-,. 备注:None 摘要:地理信息系统(GIS)和相关技术引起了统计人员对分析大型空间数据集的可扩展方法的极大兴趣。人们提出了多种可扩展的空间过程模型,这些模型可以很容易地嵌入到分层建模框架中进行贝叶斯推理。虽然统计研究的重点主要集中在创新和更复杂的模型开发上,但对于实践科学家或空间分析员易于实现的可伸缩层次模型的方法,关注相对有限。本文讨论了如何将点引用的空间过程模型转换为共轭贝叶斯线性回归,从而快速地对空间过程进行推理。该方法允许从回归参数、潜在过程和预测随机变量的联合后验分布直接进行精确采样(避免马尔可夫链蒙特卡罗等迭代算法),并且可以在统计编程环境(如R。 摘要:Geographic Information Systems (GIS) and related technologies have generated substantial interest among statisticians with regard to scalable methodologies for analyzing large spatial datasets. A variety of scalable spatial process models have been proposed that can be easily embedded within a hierarchical modeling framework to carry out Bayesian inference. While the focus of statistical research has mostly been directed toward innovative and more complex model development, relatively limited attention has been accorded to approaches for easily implementable scalable hierarchical models for the practicing scientist or spatial analyst. This article discusses how point-referenced spatial process models can be cast as a conjugate Bayesian linear regression that can rapidly deliver inference on spatial processes. The approach allows exact sampling directly (avoids iterative algorithms such as Markov chain Monte Carlo) from the joint posterior distribution of regression parameters, the latent process and the predictive random variables, and can be easily implemented on statistical programming environments such as R.
【3】 Extreme Bandits using Robust Statistics 标题:基于稳健统计的极值带 链接:https://arxiv.org/abs/2109.04433
作者:Sujay Bhatt,Ping Li,Gennady Samorodnitsky 机构:Cognitive Computing Lab, Baidu Research, NE ,th St. Bellevue, WA , USA, School of ORIE, Cornell University, Frank T Rhodes Hall, Ithaca, NY , USA 摘要:我们认为一个多武装的强盗问题的动机的情况下,只有极值,而不是预期的价值在经典土匪设置,是有兴趣的。我们提出了使用鲁棒统计的无分布算法,并描述了其统计特性。我们证明了所提出的算法在较弱的条件下实现了消失极值后悔。通过数值实验验证了算法在有限样本条件下的性能。结果表明,与已知的算法相比,本文提出的算法具有更好的性能。 摘要:We consider a multi-armed bandit problem motivated by situations where only the extreme values, as opposed to expected values in the classical bandit setting, are of interest. We propose distribution free algorithms using robust statistics and characterize the statistical properties. We show that the provided algorithms achieve vanishing extremal regret under weaker conditions than existing algorithms. Performance of the algorithms is demonstrated for the finite-sample setting using numerical experiments. The results show superior performance of the proposed algorithms compared to the well known algorithms.
【4】 Goodness-of-Fit Testing for Hölder-Continuous Densities: Sharp Local Minimax Rates 标题:Hölder连续密度的拟合优度检验:锐化局部极小极大率 链接:https://arxiv.org/abs/2109.04346
作者:Julien Chhor,Alexandra Carpentier 机构:CRESTENSAE, OvGU, Magdeburg 备注:76 pages 摘要:我们考虑H-“旧光滑密度超过$MATHBB{R}{D$”:给定$N$IID观测值未知密度$p$,并且给定已知密度$p0$,我们研究大$ Rho $应该如何区分,以高概率,情况下$p= p0$$从所有H选项的复合选项中分离出来。较旧的平滑密度$p$,使得$\;p-pu 0\u tgeqrho$,其中,$tin[1,2]$。假设密度定义在$mathbb{R}^d$上,且具有H“较旧的平滑度参数$alpha>0$。在目前的工作中,我们解决了案例$alphaleq 1$,并使用对密度的附加技术限制来处理案例$alpha>1$。我们确定了局部极大极小检验率的匹配上界和下界,以$p_0$明确给出。我们提出了新的测试统计数据,我们相信这可能是独立的兴趣。我们还建立了明确的截止值$u_B$的第一个定义,允许我们将$mathbb{R}^d$分割为一个整体部分(定义为$mathbb{R}^d$的子集,其中,$p_0$只取大于或等于$u_B$的值)和一个尾部部分(定义为整体的互补部分),每个部分都涉及对局部极小极大测试率的根本不同贡献。 摘要:We consider the goodness-of fit testing problem for H"older smooth densities over $mathbb{R}^d$: given $n$ iid observations with unknown density $p$ and given a known density $p_0$, we investigate how large $rho$ should be to distinguish, with high probability, the case $p=p_0$ from the composite alternative of all H"older-smooth densities $p$ such that $|p-p_0|_t geq rho$ where $t in [1,2]$. The densities are assumed to be defined over $mathbb{R}^d$ and to have H"older smoothness parameter $alpha>0$. In the present work, we solve the case $alpha leq 1$ and handle the case $alpha>1$ using an additional technical restriction on the densities. We identify matching upper and lower bounds on the local minimax rates of testing, given explicitly in terms of $p_0$. We propose novel test statistics which we believe could be of independent interest. We also establish the first definition of an explicit cutoff $u_B$ allowing us to split $mathbb{R}^d$ into a bulk part (defined as the subset of $mathbb{R}^d$ where $p_0$ takes only values greater than or equal to $u_B$) and a tail part (defined as the complementary of the bulk), each part involving fundamentally different contributions to the local minimax rates of testing.
【5】 Adaptive importance sampling for seismic fragility curve estimation 标题:地震易损性曲线估计的自适应重要抽样 链接:https://arxiv.org/abs/2109.04323
作者:Clement Gauchy,Cyril Feau,Josselin Garnier 摘要:作为概率风险评估研究的一部分,有必要研究机械和土木工程结构在地震荷载作用下的脆弱性。这种风险可以用脆性曲线来衡量,脆性曲线表示结构在地震烈度测量条件下的失效概率。脆性曲线的估计依赖于耗时的数值模拟,因此需要仔细的实验设计,以便在有限的代码评估次数下获得结构脆性的最大信息。为了减少训练损失的方差,我们提出并实现了一种基于自适应重要性抽样的主动学习方法。从理论上和数值上分析了该方法在偏差、标准差和预测区间覆盖率方面的效率。 摘要:As part of Probabilistic Risk Assessment studies, it is necessary to study the fragility of mechanical and civil engineered structures when subjected to seismic loads. This risk can be measured with fragility curves, which express the probability of failure of the structure conditionally to a seismic intensity measure. The estimation of fragility curves relies on time-consuming numerical simulations, so that careful experimental design is required in order to gain the maximum information on the structure's fragility with a limited number of code evaluations. We propose and implement an active learning methodology based on adaptive importance sampling in order to reduce the variance of the training loss. The efficiency of the proposed method in terms of bias, standard deviation and prediction interval coverage are theoretically and numerically characterized.
【6】 Optimizing Precision and Power by Machine Learning in Randomized Trials, with an Application to COVID-19 标题:机器学习优化随机试验的精度和功耗及其在冠状病毒中的应用 链接:https://arxiv.org/abs/2109.04294
作者:Nicholas Williams,Michael Rosenblum,Iván Díaz 机构:Division of Biostatistics, Department of Population Health Sciences, Weill Cornell Medicine., Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health. 摘要:快速发现有效的治疗方法需要在临床试验中有效利用现有资源。使用协变量调整可以产生更精确的统计估计,从而减少得出无效或有效结论所需的参与者数量。我们关注事件发生的时间和顺序结果。随机研究中协变量调整的一个关键问题是如何拟合与结果和基线协变量相关的模型,以最大限度地提高精度。我们提出了一个新的理论结果,在结果数据完全随机缺失的假设下,建立了依赖于机器学习(例如,l1正则化、随机森林、XGBoost和多元自适应回归样条)的各种协变量调整估计的渐近正态性条件。我们进一步给出了渐近方差的一致估计。重要的是,这些条件不要求机器学习方法收敛到以基线变量为条件的真实结果分布,只要它们收敛到某些(可能不正确的)极限。我们进行了一项模拟研究,以评估上述预测方法在新冠病毒-19试验中的性能,使用了纽约长老会威尔康奈尔医学医院1500多名新冠病毒-19住院患者的纵向数据。我们发现,使用l1正则化可以得到估计器和相应的假设检验,这些估计器和假设检验可以控制1型误差,并且在所有测试样本量中比未经调整的估计器更精确。我们还表明,当协变量不能预测结果时,l1正则化仍然与未经调整的估计器一样精确,即使在小样本(n=100)。我们给出了一个R包adjrct,它对顺序和事件时间结果执行模型稳健协变量调整。 摘要:The rapid finding of effective therapeutics requires the efficient use of available resources in clinical trials. The use of covariate adjustment can yield statistical estimates with improved precision, resulting in a reduction in the number of participants required to draw futility or efficacy conclusions. We focus on time-to-event and ordinal outcomes. A key question for covariate adjustment in randomized studies is how to fit a model relating the outcome and the baseline covariates to maximize precision. We present a novel theoretical result establishing conditions for asymptotic normality of a variety of covariate-adjusted estimators that rely on machine learning (e.g., l1-regularization, Random Forests, XGBoost, and Multivariate Adaptive Regression Splines), under the assumption that outcome data is missing completely at random. We further present a consistent estimator of the asymptotic variance. Importantly, the conditions do not require the machine learning methods to converge to the true outcome distribution conditional on baseline variables, as long as they converge to some (possibly incorrect) limit. We conducted a simulation study to evaluate the performance of the aforementioned prediction methods in COVID-19 trials using longitudinal data from over 1,500 patients hospitalized with COVID-19 at Weill Cornell Medicine New York Presbyterian Hospital. We found that using l1-regularization led to estimators and corresponding hypothesis tests that control type 1 error and are more precise than an unadjusted estimator across all sample sizes tested. We also show that when covariates are not prognostic of the outcome, l1-regularization remains as precise as the unadjusted estimator, even at small sample sizes (n = 100). We give an R package adjrct that performs model-robust covariate adjustment for ordinal and time-to-event outcomes.
【7】 Posterior Concentration Rates for Bayesian O'Sullivan Penalized Splines 标题:贝叶斯O‘Sullivan惩罚样条的后验集中率 链接:https://arxiv.org/abs/2109.04288
作者:Paul Bach,Nadja Klein 摘要:O’Sullivan惩罚样条方法是一种流行的非参数回归的频率分析方法。因此,未知回归函数在丰富样条基础上展开,并使用基于积分平方$q$th导数的粗糙度惩罚进行正则化。虽然在频率设置下O’Sullivan惩罚样条的渐近性质已被广泛研究,但到目前为止,对贝叶斯对应的理论理解仍然缺失。在本文中,我们缩小了这一差距,并研究了贝叶斯对应的频率O样条方法的渐近性。我们推导了整个后验分布以接近最优速率集中在真实回归函数周围的充分条件。我们的结果表明,与通常使用的缓慢回归样条速率相比,样条节点数的速率更快,可以实现接近最优速率的后验集中。此外,在平滑方差上使用几个不同的超先验,例如伽马和威布尔超先验,可以获得接近最优速率的后验浓度。 摘要:The O'Sullivan penalized splines approach is a popular frequentist approach for nonparametric regression. Thereby, the unknown regression function is expanded in a rich spline basis and a roughness penalty based on the integrated squared $q$th derivative is used for regularization. While the asymptotic properties of O'Sullivan penalized splines in a frequentist setting have been investigated extensively, the theoretical understanding of the Bayesian counterpart has been missing so far. In this paper, we close this gap and study the asymptotics of the Bayesian counterpart of the frequentist O-splines approach. We derive sufficient conditions for the entire posterior distribution to concentrate around the true regression function at near optimal rate. Our results show that posterior concentration at near optimal rate can be achieved with a faster rate for the number of spline knots than the slow regression spline rate that is commonly being used. Furthermore, posterior concentration at near optimal rate can be achieved with several different hyperpriors on the smoothing variance such as a Gamma and a Weibull hyperprior.
【8】 Supervised Linear Dimension-Reduction Methods: Review, Extensions, and Comparisons 标题:有监督的线性降维方法:回顾、扩展和比较 链接:https://arxiv.org/abs/2109.04244
作者:Shaojie Xu,Joel Vaughan,Jie Chen,Agus Sudjianto,Vijayan Nair 机构:Wells Fargo & Company 摘要:主成分分析(PCA)是一种著名的线性降维方法,在数据分析和建模中得到了广泛的应用。它是一种无监督学习技术,可以为输入变量识别一个合适的线性子空间,该线性子空间包含最大的变化,并保留尽可能多的信息。PCA也被用于预测模型中,在进行回归分析之前,预测因子的原始高维空间被缩减为更小、更易于管理的集合。然而,这种方法在降维阶段的响应中不包含信息,因此可能具有较差的预测性能。为了解决这个问题,文献中提出了几种有监督的线性降维技术。本文回顾了所选择的技术,扩展了其中的一些技术,并通过仿真比较了它们的性能。其中两种技术,偏最小二乘(PLS)和最小二乘主成分分析(LSPCA),在本研究中始终优于其他技术。 摘要:Principal component analysis (PCA) is a well-known linear dimension-reduction method that has been widely used in data analysis and modeling. It is an unsupervised learning technique that identifies a suitable linear subspace for the input variable that contains maximal variation and preserves as much information as possible. PCA has also been used in prediction models where the original, high-dimensional space of predictors is reduced to a smaller, more manageable, set before conducting regression analysis. However, this approach does not incorporate information in the response during the dimension-reduction stage and hence can have poor predictive performance. To address this concern, several supervised linear dimension-reduction techniques have been proposed in the literature. This paper reviews selected techniques, extends some of them, and compares their performance through simulations. Two of these techniques, partial least squares (PLS) and least-squares PCA (LSPCA), consistently outperform the others in this study.
【9】 Evaluation of imputation techniques with varying percentage of missing data 标题:不同数据缺失率下的补偿技术评价 链接:https://arxiv.org/abs/2109.04227
作者:Seema Sangari,Herman E. Ray 备注:16 pages, 21 figures, 3 tables 摘要:数据缺失一直是困扰统计学家和应用分析研究人员的一个常见问题。虽然基于均值或热板插补等替代方法已经得到了很好的研究,但通过改进计算资源实现的新兴插补技术已经限制了正式评估。本研究正式考虑了五种最新开发的插补方法:Amelia、Mice、mi、Hmisc和missForest——使用RMSE将其性能与实际值和公认的基于均值的替代方法进行比较。RMSE度量通过使用排名方法的方法进行合并。我们的结果表明missForest算法表现最好,而mi算法表现最差。 摘要:Missing data is a common problem which has consistently plagued statisticians and applied analytical researchers. While replacement methods like mean-based or hot deck imputation have been well researched, emerging imputation techniques enabled through improved computational resources have had limited formal assessment. This study formally considers five more recently developed imputation methods: Amelia, Mice, mi, Hmisc and missForest - compares their performances using RMSE against actual values and against the well-established mean-based replacement approach. The RMSE measure was consolidated by method using a ranking approach. Our results indicate that the missForest algorithm performed best and the mi algorithm performed worst.
【10】 Kurtosis-based projection pursuit for matrix-valued data 标题:基于峰度的矩阵值数据投影追打 链接:https://arxiv.org/abs/2109.04167
作者:Una Radojicic,Klaus Nordhausen,Joni Virta 机构:Vienna University of Technology, University of Jyv¨askyl¨a, University of Turku 摘要:我们开发了允许以矩阵形式自然表示的数据的投影寻踪。对于投影指数,我们提出了经典峰度和Mardia多元峰度的扩展。第一个索引同时估计矩阵两侧的投影,而第二个索引分别查找两个投影。这两个指标表明,恢复最佳分离投影两组高斯混合在完全没有任何标签信息。我们进一步建立了相应样本估计的强相合性。通过对手写邮政编码数据的仿真和实例验证了该方法。 摘要:We develop projection pursuit for data that admit a natural representation in matrix form. For projection indices, we propose extensions of the classical kurtosis and Mardia's multivariate kurtosis. The first index estimates projections for both sides of the matrices simultaneously, while the second index finds the two projections separately. Both indices are shown to recover the optimally separating projection for two-group Gaussian mixtures in the full absence of any label information. We further establish the strong consistency of the corresponding sample estimators. Simulations and a real data example on hand-written postal code data are used to demonstrate the method.
【11】 Multi-population mortality forecasting using high-dimensional functional factor models 标题:高维函数因子模型在多人口死亡率预测中的应用 链接:https://arxiv.org/abs/2109.04146
作者:Chen Tang,Han Lin Shang,Yanrong Yang 摘要:本文提出了一种高维函数时间序列的双因子模型,该模型能够在函数数据框架下对多人口死亡率进行建模和预测。该模型首先将HDFTS分解为具有较低维度(公共特征)的功能性时间序列和针对不同横截面(异质性)的基函数系统。然后将低维公共函数时间序列进一步简化为低维标量因子矩阵。降维因子矩阵可以合理地传递原始HDFTS中的有用信息。提取原始HDFTS中包含的所有时间动态,以便于预测。该模型可以看作是现有几种功能因子模型的一般情况。通过蒙特卡罗模拟,我们证明了该方法在模型拟合中的性能。在对日本次国家特定年龄死亡率的实证研究中,我们表明,与现有的功能因子模型相比,该模型在多人口死亡率建模中产生了更准确的点和区间预测。通过对人寿年金定价实践的比较,证明了预测改进的财务影响。 摘要:This paper proposes a two-fold factor model for high-dimensional functional time series (HDFTS), which enables the modeling and forecasting of multi-population mortality under the functional data framework. The proposed model first decomposes the HDFTS into functional time series with lower dimensions (common feature) and a system of basis functions specific to different cross-sections (heterogeneity). Then the lower-dimensional common functional time series are further reduced into low-dimensional scalar factor matrices. The dimensionally reduced factor matrices can reasonably convey useful information in the original HDFTS. All the temporal dynamics contained in the original HDFTS are extracted to facilitate forecasting. The proposed model can be regarded as a general case of several existing functional factor models. Through a Monte Carlo simulation, we demonstrate the performance of the proposed method in model fitting. In an empirical study of the Japanese subnational age-specific mortality rates, we show that the proposed model produces more accurate point and interval forecasts in modeling multi-population mortality than those existing functional factor models. The financial impact of the improvements in forecasts is demonstrated through comparisons in life annuity pricing practices.
【12】 Popularity Adjusted Block Models are Generalized Random Dot Product Graphs 标题:受欢迎度调整的挡路模型是广义随机点乘积图 链接:https://arxiv.org/abs/2109.04010
作者:John Koo,Minh Tang,Michael W. Trosset 机构:Department of Statistics, Indiana University, Department of Statistics, North Carolina State University 备注:33 pages, 7 figures 摘要:通过证明PABM是GRDPG的一个特例,其中社区对应于潜在向量的相互正交子空间,我们连接了两个随机图模型,即流行调整块模型(PABM)和广义随机点积图(GRDPG)。这种见解使我们能够为PABM构建新的社区检测和参数估计算法,并改进依赖于稀疏子空间聚类的现有算法。利用已建立的GRDPG邻接谱嵌入的渐近性质,我们得到了这些算法的渐近性质。特别地,我们证明了当图的顶点数趋于无穷大时,社区检测错误的绝对数趋于零。仿真实验说明了这些特性。 摘要:We connect two random graph models, the Popularity Adjusted Block Model (PABM) and the Generalized Random Dot Product Graph (GRDPG), by demonstrating that the PABM is a special case of the GRDPG in which communities correspond to mutually orthogonal subspaces of latent vectors. This insight allows us to construct new algorithms for community detection and parameter estimation for the PABM, as well as improve an existing algorithm that relies on Sparse Subspace Clustering. Using established asymptotic properties of Adjacency Spectral Embedding for the GRDPG, we derive asymptotic properties of these algorithms. In particular, we demonstrate that the absolute number of community detection errors tends to zero as the number of graph vertices tends to infinity. Simulation experiments illustrate these properties.
【13】 Differentially private methods for managing model uncertainty in linear regression models 标题:管理线性回归模型中模型不确定性的差分私有方法 链接:https://arxiv.org/abs/2109.03949
作者:Víctor Peña,Andrés F. Barrientos 机构: The City University of New YorkAndr´es Felipe Barrientos, Florida State UniversityAbstractStatistical methods for confidential data are in high demand due to an increase in computa-tional power and changes in privacy law 摘要:由于计算能力的提高和隐私法的变化,保密数据的统计方法需求量很大。本文介绍了在线性回归模型中处理模型不确定性的不同私有方法。更准确地说,我们提供了差异私有贝叶斯因子、后验概率、似然比统计、信息标准和模型平均估计。我们的方法是渐进一致的,并且易于与非私有方法的现有实现一起运行。 摘要:Statistical methods for confidential data are in high demand due to an increase in computational power and changes in privacy law. This article introduces differentially private methods for handling model uncertainty in linear regression models. More precisely, we provide differentially private Bayes factors, posterior probabilities, likelihood ratio statistics, information criteria, and model-averaged estimates. Our methods are asymptotically consistent and easy to run with existing implementations of non-private methods.
【14】 Sharp regret bounds for empirical Bayes and compound decision problems 标题:经验Bayes和复合决策问题的精确后悔界 链接:https://arxiv.org/abs/2109.03943
作者:Yury Polyanskiy,Yihong Wu 摘要:在平方损失下,我们考虑经典的问题,估计一个$N$维的均值(具有同一协方差矩阵)或泊松分布向量。在贝叶斯环境下,最优估计量由先验相关条件均值给出。在一个频繁的设置各种收缩方法是在上个世纪发展起来的。Robbins(1956)提出的经验贝叶斯框架,通过假设参数是独立的,但先验未知,将贝叶斯和频率主义思维结合起来,旨在使用完全数据驱动的估计器与知道真实先验的贝叶斯预言器竞争。优点的中心数字是遗憾,即在最坏情况下(在先验值上),总超额风险超过贝叶斯风险。尽管这一范式在60多年前就被引入,但对于非参数环境下最优后悔的渐近标度知之甚少。我们证明了对于具有紧支撑和次指数先验的泊松模型,最优后悔尺度分别为$Theta((frac{logn}{logn})^2)$和$Theta(log3n)$,这两个尺度都是由Robbins的原始估计得到的。对于正态平均模型,对于紧支撑和次高斯先验,遗憾分别至少为$Omega((frac{log n}{loglog n})^2)和$Omega(log^2 n)$,前者解决了Singh(1979)关于不可能实现有界遗憾的猜想;在此工作之前,最佳后悔下限为$Omega(1)$。%也得到了亚高斯或亚指数先验的类似结果。除了经验贝叶斯设置外,这些结果在参数具有确定性的复合设置中也适用。作为一个附带应用,本文中的构造还导致了混合密度估计的改进或新的下界。 摘要:We consider the classical problems of estimating the mean of an $n$-dimensional normally (with identity covariance matrix) or Poisson distributed vector under the squared loss. In a Bayesian setting the optimal estimator is given by the prior-dependent conditional mean. In a frequentist setting various shrinkage methods were developed over the last century. The framework of empirical Bayes, put forth by Robbins (1956), combines Bayesian and frequentist mindsets by postulating that the parameters are independent but with an unknown prior and aims to use a fully data-driven estimator to compete with the Bayesian oracle that knows the true prior. The central figure of merit is the regret, namely, the total excess risk over the Bayes risk in the worst case (over the priors). Although this paradigm was introduced more than 60 years ago, little is known about the asymptotic scaling of the optimal regret in the nonparametric setting. We show that for the Poisson model with compactly supported and subexponential priors, the optimal regret scales as $Theta((frac{log n}{loglog n})^2)$ and $Theta(log^3 n)$, respectively, both attained by the original estimator of Robbins. For the normal mean model, the regret is shown to be at least $Omega((frac{log n}{loglog n})^2)$ and $Omega(log^2 n)$ for compactly supported and subgaussian priors, respectively, the former of which resolves the conjecture of Singh (1979) on the impossibility of achieving bounded regret; before this work, the best regret lower bound was $Omega(1)$. %Analogous results for subgaussian or subexponential priors are also obtained. In addition to the empirical Bayes setting, these results are shown to hold in the compound setting where the parameters are deterministic. As a side application, the construction in this paper also leads to improved or new lower bounds for mixture density estimation.
【15】 Estimation for recurrent events through conditional estimating equations 标题:用条件估计方程对重复性事件进行估计 链接:https://arxiv.org/abs/2109.03932
作者:Ioana Schiopu-Kratina,Hai Yan Liu,Mayer Alvo,Pierre-Jerome Bergeron 摘要:我们提出了一种新的估计方法,用于统计分析连续重复事件之间的平均间隔时间长度依赖于一组解释性随机变量和右截尾的情况。依赖性通过类回归和过度分散参数表示,通过条件估计方程进行估计。每个间隔时间长度的平均值和方差,以先前事件和其他协变量的观察历史为条件,是参数和协变量的已知函数。在一定的截尾条件下,我们构造了渐近无偏且只包含观测数据的归一化估计函数。我们讨论了参数估计序列的存在性、相合性和渐近正态性,这些参数是这些估计方程的根。模拟表明,我们的估计器可以在相对较小的样本量下成功地用于短期研究。 摘要:We present new estimators for the statistical analysis of the dependence of the mean gap time length between consecutive recurrent events, on a set of explanatory random variables and in the presence of right censoring. The dependence is expressed through regression-like and overdispersion parameters, estimated via conditional estimating equations. The mean and variance of the length of each gap time, conditioned on the observed history of prior events and other covariates, are known functions of parameters and covariates. Under certain conditions on censoring, we construct normalized estimating functions that are asymptotically unbiased and contain only observed data. We discuss the existence, consistency and asymptotic normality of a sequence of estimators of the parameters, which are roots of these estimating equations. Simulations suggest that our estimators could be used successfully with a relatively small sample size in a study of short duration.
【16】 Learning the hypotheses space from data through a U-curve algorithm: a statistically consistent complexity regularizer for Model Selection 标题:通过U-曲线算法从数据中学习假设空间:一种用于模型选择的统计一致性复杂度正则化方法 链接:https://arxiv.org/abs/2109.03866
作者:Diego Marcondes,Adilson Simonis,Junior Barrera 备注:This is work is a merger of arXiv:2001.09532 and arXiv:2001.11578 摘要:本文提出了一种数据驱动的、系统的、一致的、非穷举的模型选择方法,它是经典不可知PAC学习模型的扩展。在这种方法中,学习问题不仅由假设空间$mathcal{H}$建模,而且由学习空间$mathbb{L}(mathcal{H})$建模,它是$mathcal{H}$子空间的偏序集,覆盖$mathcal{H}$,并满足相关子空间的VC维的一个性质,这是一个适合模型选择算法的代数搜索空间。我们的主要贡献是一个数据驱动的通用学习算法,用于在$mathbb{L}(mathcal{H})$上执行正则化模型选择,以及一个框架,在该框架下,通过正确建模$mathbb{L}(mathcal{H})$并使用高计算能力,理论上可以更好地估计给定样本大小的目标假设。这种方法的一个显著结果是$mathbb{L}(mathcal{H})$的非穷举搜索可以返回最优解的条件。本文的结果导致了机器学习的一个实用特性,即高计算能力可以缓解实验数据的不足。在计算能力不断普及的背景下,这一特性可能有助于理解为什么机器学习变得如此重要,即使在数据昂贵且难以获取的情况下。 摘要:This paper proposes a data-driven systematic, consistent and non-exhaustive approach to Model Selection, that is an extension of the classical agnostic PAC learning model. In this approach, learning problems are modeled not only by a hypothesis space $mathcal{H}$, but also by a Learning Space $mathbb{L}(mathcal{H})$, a poset of subspaces of $mathcal{H}$, which covers $mathcal{H}$ and satisfies a property regarding the VC dimension of related subspaces, that is a suitable algebraic search space for Model Selection algorithms. Our main contributions are a data-driven general learning algorithm to perform regularized Model Selection on $mathbb{L}(mathcal{H})$ and a framework under which one can, theoretically, better estimate a target hypothesis with a given sample size by properly modeling $mathbb{L}(mathcal{H})$ and employing high computational power. A remarkable consequence of this approach are conditions under which a non-exhaustive search of $mathbb{L}(mathcal{H})$ can return an optimal solution. The results of this paper lead to a practical property of Machine Learning, that the lack of experimental data may be mitigated by a high computational capacity. In a context of continuous popularization of computational power, this property may help understand why Machine Learning has become so important, even where data is expensive and hard to get.
【17】 Simplifying small area estimation with rFIA: a demonstration of tools and techniques 标题:用rFIA简化小区域估计:工具和技术的演示 链接:https://arxiv.org/abs/2109.03863
作者:Hunter Stanke,Andrew O. Finley,Grant M. Domke 机构:Department of Forestry, Michigan State University, East Lansing, MI, USA, School of Environmental and Forest Sciences, University of Washington, Seattle, WA, USA, Forest Service, Northern Research Station, US Department of Agriculture, St Paul, MN, USA 摘要:美国(美国)林业局森林调查与分析(FIA)计划负责美国国家森林调查。传统上,FIA计划依赖于基于样本的方法——永久性地块网络和相关的基于设计的估计器——来估计大地理区域和长时间内的森林变量。这些方法通常在大域上提供无偏推断,但由于样本量小,无法为小域提供可靠的估计。因此,对小范围估计需求的增加将要求FIA计划采用非传统的估计方法,这种方法能够以更高的空间和时间分辨率对森林变量进行防御性估计,而无需收集额外的实地数据。鉴于这一挑战,FIA数据的小面积估计(SAE)方法的开发已成为一个活跃且高效的研究领域。然而,SAE方法仍然难以应用于FIA数据,部分原因是FIA计划使用的复杂数据结构和库存设计。因此,我们认为需要一套新的估算工具(即软件),以适应大地理区域和长时间段的推断需求转变为小空间和/或时间域的推断需求。在此,我们介绍了rFIA,一个开源的R包,旨在增加FIA数据的可访问性,作为这样的工具之一。具体而言,我们提出了两个案例研究,以证明rFIA有可能简化一系列SAE方法在FIA数据中的应用:(1)使用空间费伊-赫里奥特模型估算美国周边地区的当代县级森林碳储量;(2)使用贝叶斯混合效应模型,对缅因州华盛顿县可销售木材量的多个年代趋势进行时间上的明确估计。 摘要:The United States (US) Forest Service Forest Inventory and Analysis (FIA) program operates the national forest inventory of the US. Traditionally, the FIA program has relied on sample-based approaches -- permanent plot networks and associated design-based estimators -- to estimate forest variables across large geographic areas and long periods of time. These approaches generally offer unbiased inference on large domains but fail to provide reliable estimates for small domains due to low sample sizes. Rising demand for small domain estimates will thus require the FIA program to adopt non-traditional estimation approaches that are capable of delivering defensible estimates of forest variables at increased spatial and temporal resolution, without the expense of collecting additional field data. In light of this challenge, the development of small area estimation (SAE) methods for FIA data has become an active and highly productive area of research. Yet, SAE methods remain difficult to apply to FIA data, due in part to the complex data structures and inventory design used by the FIA program. Thus, we argue that a new suite of estimation tools (i.e., software) will be required to accommodate shifts in demand for inference on large geographic areas and long time periods to inference on small spatial and/or temporal domains. Herein, we present rFIA, an open-source R package designed to increase the accessibility of FIA data, as one such tool. Specifically, we present two case studies chosen to demonstrate rFIA's potential to simplify the application of a broad suite of SAE methods to FIA data: (1) estimation of contemporary county-level forest carbon stocks across the conterminous US using a spatial Fay-Herriot model; and (2) temporally-explicit estimation of multi-decadal trends in merchantable wood volume in Washington County, Maine using a Bayesian mixed-effects model.
【18】 On a quantile autoregressive conditional duration model applied to high-frequency financial data 标题:分位数自回归条件久期模型在高频金融数据中的应用 链接:https://arxiv.org/abs/2109.03844
作者:Helton Saulo,Narayanaswamy Balakrishnan,Roberto Vila 备注:29 pages, 5 figuras 摘要:自回归条件持续时间(ACD)模型主要用于处理两个连续事件之间时间产生的数据。这些模型通常以时变条件平均值或中位持续时间来表示。在本文中,我们放松这个假设,并考虑有条件分位数的方法,以促进不同百分位数的建模。提出的ACD分位数模型基于Birnbaum-Saunders分布的倾斜版本,该模型提供了比传统Birnbaum-Saunders分布更好的尾部拟合,此外还推进了期望条件最大化(ECM)算法的实现。进行蒙特卡罗模拟研究,以评估模型的行为以及参数估计方法,并评估残差的形式。最后分析了一个真实的金融交易数据集来说明所提出的方法。 摘要:Autoregressive conditional duration (ACD) models are primarily used to deal with data arising from times between two successive events. These models are usually specified in terms of a time-varying conditional mean or median duration. In this paper, we relax this assumption and consider a conditional quantile approach to facilitate the modeling of different percentiles. The proposed ACD quantile model is based on a skewed version of Birnbaum-Saunders distribution, which provides better fitting of the tails than the traditional Birnbaum-Saunders distribution, in addition to advancing the implementation of an expectation conditional maximization (ECM) algorithm. A Monte Carlo simulation study is performed to assess the behavior of the model as well as the parameter estimation method and to evaluate a form of residual. A real financial transaction data set is finally analyzed to illustrate the proposed approach.
【19】 Detecting and Mitigating Test-time Failure Risks via Model-agnostic Uncertainty Learning 标题:通过模型不可知不确定性学习检测和降低测试时间失败风险 链接:https://arxiv.org/abs/2109.04432
作者:Preethi Lahoti,Krishna P. Gummadi,Gerhard Weikum 机构:Max Planck Institute for Informatics, Max Planck Institute for Software Systems 备注:To appear in the 21st IEEE International Conference on Data Mining (ICDM 2021), Auckland, New Zealand 摘要:可靠地预测机器学习(ML)系统在部署生产数据时的潜在故障风险是可信AI的一个关键方面。本文介绍了Risk Advisor,一种新的后hoc元学习器,用于估计任何已训练过的黑盒分类模型的故障风险和预测不确定性。除了提供风险评分外,风险顾问还将不确定性估计分解为任意和认知不确定性成分,从而对导致故障的不确定性来源提供信息性见解。因此,风险顾问可以区分由数据可变性、数据转移和模型限制引起的故障,并就缓解措施提出建议(例如,收集更多数据以应对数据转移)。在各种黑箱分类模型家族以及覆盖常见ML故障场景的真实数据集和合成数据集上进行的大量实验表明,Risk Advisor能够可靠地预测所有场景中的部署时间故障风险,并优于强基线。 摘要:Reliably predicting potential failure risks of machine learning (ML) systems when deployed with production data is a crucial aspect of trustworthy AI. This paper introduces Risk Advisor, a novel post-hoc meta-learner for estimating failure risks and predictive uncertainties of any already-trained black-box classification model. In addition to providing a risk score, the Risk Advisor decomposes the uncertainty estimates into aleatoric and epistemic uncertainty components, thus giving informative insights into the sources of uncertainty inducing the failures. Consequently, Risk Advisor can distinguish between failures caused by data variability, data shifts and model limitations and advise on mitigation actions (e.g., collecting more data to counter data shift). Extensive experiments on various families of black-box classification models and on real-world and synthetic datasets covering common ML failure scenarios show that the Risk Advisor reliably predicts deployment-time failure risks in all the scenarios, and outperforms strong baselines.
【20】 Assessing Machine Learning Approaches to Address IoT Sensor Drift 标题:评估物联网传感器漂移的机器学习方法 链接:https://arxiv.org/abs/2109.04356
作者:Haining Zheng,Antonio Paiva 机构:ExxonMobil Research and Engineering Company, Annandale, NJ , Antonio R. Paiva 备注:6 pages, The 4th International Workshop on Artificial Intelligence of Things, In conjunction with the 27th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2021), Virtual conference, Aug. 14-18th 摘要:物联网传感器的扩散及其在各个行业和应用中的部署在这个大数据时代带来了无数分析机会。然而,这些传感器测量值的漂移对自动化数据分析以及持续有效地训练和部署模型的能力提出了重大挑战。在本文中,我们研究并测试了文献中的几种方法在现实条件下处理和适应传感器漂移的能力。这些方法中的大多数都是最新的,因此代表了当前的技术水平。测试是在一个公开的气体传感器数据集上进行的,该数据集显示出随时间的漂移。结果表明,尽管采用了上述方法,但由于传感器漂移,传感性能大幅下降。然后,我们讨论了当前方法确定的几个问题,并概述了解决这些问题的未来研究方向。 摘要:The proliferation of IoT sensors and their deployment in various industries and applications has brought about numerous analysis opportunities in this Big Data era. However, drift of those sensor measurements poses major challenges to automate data analysis and the ability to effectively train and deploy models on a continuous basis. In this paper we study and test several approaches from the literature with regard to their ability to cope with and adapt to sensor drift under realistic conditions. Most of these approaches are recent and thus are representative of the current state-of-the-art. The testing was performed on a publicly available gas sensor dataset exhibiting drift over time. The results show substantial drops in sensing performance due to sensor drift in spite of the approaches. We then discuss several issues identified with current approaches and outline directions for future research to tackle them.
【21】 COLUMBUS: Automated Discovery of New Multi-Level Features for Domain Generalization via Knowledge Corruption 标题:Columbus:通过知识损坏自动发现用于领域综合的新的多级特征 链接:https://arxiv.org/abs/2109.04320
作者:Ahmed Frikha,Denis Krompaß,Volker Tresp 机构:Siemens AI Lab, Siemens Technology, Ludwig Maximilian University of Munich 摘要:机器学习模型,可以推广到看不见的领域是必不可少的,当应用于现实世界的情况下涉及强大的领域转移。我们解决了具有挑战性的领域泛化(DG)问题,其中在一组源领域上训练的模型有望在看不见的领域中很好地泛化,而不会暴露于它们的数据。DG的主要挑战是,从源域学习到的特性不一定存在于看不见的目标域中,从而导致性能下降。我们认为,学习一组更丰富的特征对于改进向更广泛未知领域的转移至关重要。出于这个原因,我们提出了COLUMBUS,一种通过有针对性地破坏最相关的输入和数据的多级表示来实施新特征发现的方法。我们进行了广泛的实证评估,以证明所提出的方法的有效性,该方法在DomainBed框架中的多个DG基准数据集上优于18个DG算法,从而获得了最新的结果。 摘要:Machine learning models that can generalize to unseen domains are essential when applied in real-world scenarios involving strong domain shifts. We address the challenging domain generalization (DG) problem, where a model trained on a set of source domains is expected to generalize well in unseen domains without any exposure to their data. The main challenge of DG is that the features learned from the source domains are not necessarily present in the unseen target domains, leading to performance deterioration. We assume that learning a richer set of features is crucial to improve the transfer to a wider set of unknown domains. For this reason, we propose COLUMBUS, a method that enforces new feature discovery via a targeted corruption of the most relevant input and multi-level representations of the data. We conduct an extensive empirical evaluation to demonstrate the effectiveness of the proposed approach which achieves new state-of-the-art results by outperforming 18 DG algorithms on multiple DG benchmark datasets in the DomainBed framework.
【22】 Estimation of Corporate Greenhouse Gas Emissions via Machine Learning 标题:基于机器学习的企业温室气体排放量估算 链接:https://arxiv.org/abs/2109.04318
作者:You Han,Achintya Gopal,Liwen Ouyang,Aaron Key 备注:Accepted for the Tackling Climate Change with Machine Learning Workshop at ICML 2021 摘要:作为履行《巴黎协定》和到2050年实现净零排放的重要步骤,欧盟委员会于2021年4月通过了最雄心勃勃的气候影响措施,以改善资本流向可持续活动。要使这些和其他国际措施取得成功,可靠的数据是关键。了解全球公司碳足迹的能力对于投资者遵守这些措施至关重要。然而,由于只有一小部分公司自愿披露其温室气体(GHG)排放量,投资者几乎不可能将其投资策略与这些措施结合起来。通过对已披露温室气体排放量的机器学习模型进行训练,我们能够估计全球其他未披露其排放量的公司的排放量。在本文中,我们表明,我们的模型为投资者提供了企业温室气体排放的准确估计,使他们能够将其投资与监管措施相一致,并实现净零目标。 摘要:As an important step to fulfill the Paris Agreement and achieve net-zero emissions by 2050, the European Commission adopted the most ambitious package of climate impact measures in April 2021 to improve the flow of capital towards sustainable activities. For these and other international measures to be successful, reliable data is key. The ability to see the carbon footprint of companies around the world will be critical for investors to comply with the measures. However, with only a small portion of companies volunteering to disclose their greenhouse gas (GHG) emissions, it is nearly impossible for investors to align their investment strategies with the measures. By training a machine learning model on disclosed GHG emissions, we are able to estimate the emissions of other companies globally who do not disclose their emissions. In this paper, we show that our model provides accurate estimates of corporate GHG emissions to investors such that they are able to align their investments with the regulatory measures and achieve net-zero goals.
【23】 NTS-NOTEARS: Learning Nonparametric Temporal DAGs With Time-Series Data and Prior Knowledge 标题:NTS-NOTEARS:利用时序数据和先验知识学习非参数时态DAG 链接:https://arxiv.org/abs/2109.04286
作者:Xiangyu Sun,Guiliang Liu,Pascal Poupart,Oliver Schulte 机构: Simon Fraser University, University of Waterloo 备注:Preprint, under review 摘要:我们针对时间序列数据提出了一种基于分数的DAG结构学习方法,该方法可以捕获变量之间的线性、非线性、滞后和瞬时关系,同时确保整个图的非循环性。该方法扩展了非参数记事本,这是一种最近用于学习非参数瞬时DAG的连续优化方法。该方法比使用非线性条件独立性检验的基于约束的方法速度更快。我们还提倡使用优化约束将先验知识纳入结构学习过程。大量模拟数据的实验表明,该方法比最近的几种比较方法发现了更好的DAG结构。我们还对从NHL冰球比赛中获得的包含连续变量和离散变量的复杂真实数据评估了所提出的方法。该守则可于https://github.com/xiangyu-sun-789/NTS-NOTEARS/. 摘要:We propose a score-based DAG structure learning method for time-series data that captures linear, nonlinear, lagged and instantaneous relations among variables while ensuring acyclicity throughout the entire graph. The proposed method extends nonparametric NOTEARS, a recent continuous optimization approach for learning nonparametric instantaneous DAGs. The proposed method is faster than constraint-based methods using nonlinear conditional independence tests. We also promote the use of optimization constraints to incorporate prior knowledge into the structure learning process. A broad set of experiments with simulated data demonstrates that the proposed method discovers better DAG structures than several recent comparison methods. We also evaluate the proposed method on complex real-world data acquired from NHL ice hockey games containing a mixture of continuous and discrete variables. The code is available at https://github.com/xiangyu-sun-789/NTS-NOTEARS/.
【24】 Almost sure convergence of the accelerated weight histogram algorithm 标题:加速权重直方图算法的几乎必然收敛性 链接:https://arxiv.org/abs/2109.04265
作者:Henrik Hult,Guo-Jhen Wu 备注:35 pages 摘要:加速权重直方图(AWH)算法是为统计物理和计算生物学应用而开发的一种迭代扩展集成算法。它用于估计自由能差异和吉布斯测度的期望值。AWH算法基于设计参数的迭代更新,该参数与自由能密切相关,通过将权重直方图与指定的目标分布相匹配获得。权重直方图由状态空间和参数空间乘积上的马尔可夫链样本构成。本文证明了用自适应遍历平均估计自由能差和期望值的AWH算法的收敛性。证明的基础是将AWH算法识别为随机逼近,并研究相关极限常微分方程的性质。 摘要:The accelerated weight histogram (AWH) algorithm is an iterative extended ensemble algorithm, developed for statistical physics and computational biology applications. It is used to estimate free energy differences and expectations with respect to Gibbs measures. The AWH algorithm is based on iterative updates of a design parameter, which is closely related to the free energy, obtained by matching a weight histogram with a specified target distribution. The weight histogram is constructed from samples of a Markov chain on the product of the state space and parameter space. In this paper almost sure convergence of the AWH algorithm is proved, for estimating free energy differences as well as estimating expectations with adaptive ergodic averages. The proof is based on identifying the AWH algorithm as a stochastic approximation and studying the properties of the associated limit ordinary differential equation.
【25】 Relating Graph Neural Networks to Structural Causal Models 标题:图神经网络与结构因果模型的关联 链接:https://arxiv.org/abs/2109.04173
作者:Matej Zečević,Devendra Singh Dhami,Petar Veličković,Kristian Kersting 机构:Computer Science Deptartment, TU Darmstadt,DeepMind 备注:Main paper: 7 pages, References: 2 pages, Appendix: 10 pages; Main paper: 5 figures, Appendix: 3 figures 摘要:因果关系可以用结构因果模型(SCM)来描述,该模型包含有关感兴趣的变量及其机械关系的信息。对于大多数感兴趣的过程,底层SCM将仅部分可见,因此因果推理试图利用任何公开的信息。图形神经网络(GNN)作为结构化输入的通用逼近器,为因果学习提供了一个可行的候选者,建议与SCM进行更紧密的集成。为此,我们从第一原理出发进行了理论分析,在GNN和SCM之间建立了一种新的联系,同时对一般的神经因果模型进行了扩展。然后,我们为基于GNN的因果推理建立了一个新的模型类,该模型类对于因果效应识别是必要和充分的。我们在模拟和标准基准上的实证说明验证了我们的理论证明。 摘要:Causality can be described in terms of a structural causal model (SCM) that carries information on the variables of interest and their mechanistic relations. For most processes of interest the underlying SCM will only be partially observable, thus causal inference tries to leverage any exposed information. Graph neural networks (GNN) as universal approximators on structured input pose a viable candidate for causal learning, suggesting a tighter integration with SCM. To this effect we present a theoretical analysis from first principles that establishes a novel connection between GNN and SCM while providing an extended view on general neural-causal models. We then establish a new model class for GNN-based causal inference that is necessary and sufficient for causal effect identification. Our empirical illustration on simulations and standard benchmarks validate our theoretical proofs.
【26】 SPLICE: A Synthetic Paid Loss and Incurred Cost Experience Simulator 标题:Splice:一种综合赔付成本体验模拟器 链接:https://arxiv.org/abs/2109.04058
作者:Benjamin Avanzi,Gregory Clive Taylor,Melantha Wang 机构:Centre for Actuarial Studies, Department of Economics, University of Melbourne VIC , Australia, School of Risk and Actuarial Studies, UNSW Australia Business School, UNSW Sydney NSW , Australia 备注:arXiv admin note: text overlap with arXiv:2008.05693 摘要:在本文中,我们首先介绍一个模拟已发生损失的案例估计,称为“拼接”(综合已支付损失和已发生成本经验)。在三个模块中,连续模拟案例估计,并为每个索赔输出记录。在许多不同的情况下,案例估算的修订也被模拟为索赔期内的一个序列。此外,还纳入了与已发生损失的案例估算相关的一些相关性,特别是确认了实践中发现的案例估算的某些属性。例如,修订的幅度取决于最终索赔额,修订额随时间的分布也取决于最终索赔额。其中一些修改是为了响应索赔付款的发生,因此“拼接”需要输入模拟的每项索赔付款历史记录。索赔数据可按事故和付款“期限”进行汇总,其期限可供用户任意选择(如月份、季度等)`SPLICE“是建立在CRAN(Avanzi et al.,2021a,b)上现有的名为“合成”的个人索赔经验模拟器的基础上的,该模拟器提供了对发生、通知以及个人部分付款的时间和金额的灵活建模。这与所发生的损失形成对比,这些损失构成了“拼接”的额外贡献。包含发生的损失估计提供了一种几乎没有其他模拟器能够提供的设施。 摘要:In this paper, we first introduce a simulator of cases estimates of incurred losses, called `SPLICE` (Synthetic Paid Loss and Incurred Cost Experience). In three modules, case estimates are simulated in continuous time, and a record is output for each individual claim. Revisions for the case estimates are also simulated as a sequence over the lifetime of the claim, in a number of different situations. Furthermore, some dependencies in relation to case estimates of incurred losses are incorporated, particularly recognizing certain properties of case estimates that are found in practice. For example, the magnitude of revisions depends on ultimate claim size, as does the distribution of the revisions over time. Some of these revisions occur in response to occurrence of claim payments, and so `SPLICE` requires input of simulated per-claim payment histories. The claim data can be summarized by accident and payment "periods" whose duration is an arbitrary choice (e.g. month, quarter, etc.) available to the user. `SPLICE` is built on an existing simulator of individual claim experience called `SynthETIC` available on CRAN (Avanzi et al., 2021a,b), which offers flexible modelling of occurrence, notification, as well as the timing and magnitude of individual partial payments. This is in contrast with the incurred losses, which constitute the additional contribution of `SPLICE`. The inclusion of incurred loss estimates provides a facility that almost no other simulators do.
【27】 Application of the Singular Spectrum Analysis on electroluminescence images of thin-film photovoltaic modules 标题:奇异谱分析在薄膜光伏组件电致发光图像中的应用 链接:https://arxiv.org/abs/2109.04048
作者:Evgenii Sovetkin,Bart E. Pieters 摘要:本文讨论了奇异谱分析法(SSA)在薄膜光伏组件电致发光(EL)图像中的应用。我们提出将EL图像分解为三个分量的总和:全局强度、单元和非周期分量。提取信号的参数模型用于执行多个图像处理任务。电池组件用于以亚像素精度识别光伏电池之间的互连线,以及纠正EL图像的错误拼接。此外,电池组件信号的显式表达式用于估算反向特征长度,这是一个与光伏组件中电阻相关的物理参数。 摘要:This paper discusses an application of the singular spectrum analysis method (SSA) in the context of electroluminescence (EL) images of thin-film photovoltaic (PV) modules. We propose an EL image decomposition as a sum of three components: global intensity, cell, and aperiodic components. A parametric model of the extracted signal is used to perform several image processing tasks. The cell component is used to identify interconnection lines between PV cells at sub-pixel accuracy, as well as to correct incorrect stitching of EL images. Furthermore, an explicit expression of the cell component signal is used to estimate the inverse characteristic length, a physical parameter related to the resistances in a PV module.
【28】 LSB: Local Self-Balancing MCMC in Discrete Spaces 标题:LSB:离散空间中的局部自平衡MCMC 链接:https://arxiv.org/abs/2109.03867
作者:Emanuele Sansone 机构:Department of Computer Science, KU Leuven 摘要:马尔可夫链蒙特卡罗(MCMC)方法是解决高维目标分布样本问题的有效方法。虽然MCMC方法具有良好的理论性质,如保证收敛和混合到真实目标,但在实践中,它们的采样效率取决于建议分布和手头目标的选择。本工作考虑使用机器学习使建议分布适应目标,以提高纯离散域中的采样效率。具体而言,(i)它为一系列提案分布提出了一种新的参数化,称为局部平衡提案,(ii)它定义了一个基于互信息的目标函数,(iii)它设计了一个学习程序,使提案的参数适应目标,从而实现快速收敛和快速混合。我们将得到的采样器称为局部自平衡采样器(LSB)。我们通过伊辛模型和贝叶斯网络的实验分析表明,LSB确实能够提高基于局部平衡方案的最先进采样器的效率,从而减少收敛所需的迭代次数,同时实现可比的混合性能。 摘要:Markov Chain Monte Carlo (MCMC) methods are promising solutions to sample from target distributions in high dimensions. While MCMC methods enjoy nice theoretical properties, like guaranteed convergence and mixing to the true target, in practice their sampling efficiency depends on the choice of the proposal distribution and the target at hand. This work considers using machine learning to adapt the proposal distribution to the target, in order to improve the sampling efficiency in the purely discrete domain. Specifically, (i) it proposes a new parametrization for a family of proposal distributions, called locally balanced proposals, (ii) it defines an objective function based on mutual information and (iii) it devises a learning procedure to adapt the parameters of the proposal to the target, thus achieving fast convergence and fast mixing. We call the resulting sampler as the Locally Self-Balancing Sampler (LSB). We show through experimental analysis on the Ising model and Bayesian networks that LSB is indeed able to improve the efficiency over a state-of-the-art sampler based on locally balanced proposals, thus reducing the number of iterations required to converge, while achieving comparable mixing performance.
【29】 Mean-Square Analysis with An Application to Optimal Dimension Dependence of Langevin Monte Carlo 标题:均方分析及其在朗之万蒙特卡罗最优维相关性中的应用 链接:https://arxiv.org/abs/2109.03839
作者:Ruilin Li,Hongyuan Zha,Molei Tao 机构:Georgia Institute of Technology, The Chinese University of Hong Kong, Shenzhen 备注:Submitted to NeurIPS 2021 on May 28, 2021 (the submission deadline) 摘要:基于随机微分方程(SDE)离散化的采样算法构成了MCMC方法的一个丰富而流行的子集。这项工作为2-Wasserstein距离中采样误差的非渐近分析提供了一个通用框架,这也导致了混合时间的界。该方法适用于压缩SDE的任何一致离散化。当应用于Langevin Monte Carlo算法时,它建立了$tilde{mathcal{O}}left(frac{sqrt{d}{epsilon}right)$混合时间,无温启动,在普通对数平滑和对数强凸条件下,加上无穷远处目标测度势的三阶导数的增长条件。该界限改进了之前已知的$tilde{mathcal{O}}left(frac{d}{epsilon}right)$结果,对于满足上述假设的目标度量,在维度$d$和精度公差$epsilon$方面都是最优的(按顺序)。数值实验进一步验证了理论分析的正确性。 摘要:Sampling algorithms based on discretizations of Stochastic Differential Equations (SDEs) compose a rich and popular subset of MCMC methods. This work provides a general framework for the non-asymptotic analysis of sampling error in 2-Wasserstein distance, which also leads to a bound of mixing time. The method applies to any consistent discretization of contractive SDEs. When applied to Langevin Monte Carlo algorithm, it establishes $tilde{mathcal{O}}left( frac{sqrt{d}}{epsilon} right)$ mixing time, without warm start, under the common log-smooth and log-strongly-convex conditions, plus a growth condition on the 3rd-order derivative of the potential of target measures at infinity. This bound improves the best previously known $tilde{mathcal{O}}left( frac{d}{epsilon} right)$ result and is optimal (in terms of order) in both dimension $d$ and accuracy tolerance $epsilon$ for target measures satisfying the aforementioned assumptions. Our theoretical analysis is further validated by numerical experiments.
机器翻译,仅供参考