Update!H5支持摘要折叠,体验更佳!点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!
stat统计学,共计29篇
【1】 Desiderata for Representation Learning: A Causal Perspective 标题:表征学习的愿望:因果视角 链接:https://arxiv.org/abs/2109.03795
作者:Yixin Wang,Michael I. Jordan 机构:UC Berkeley, EECS and Statistics 备注:67 pages 摘要:表征学习构造低维表征来概括高维数据的基本特征。这种学习问题通常通过描述与学习表征相关的各种需求来解决;e、 例如,它们是非虚假的、有效的或不受牵连的。然而,将这些直观的需求转化为正式的标准是一个挑战,可以根据观察到的数据进行测量和增强。在本文中,我们从因果的角度来研究表征学习,使用因果断言的反事实量和可观察结果来形式化非虚假性和效率(在有监督的表征学习中)以及解纠缠(在无监督的表征学习中)。这产生了可计算的指标,可用于评估表示满足感兴趣需求的程度,并从单个观测数据集中学习非虚假和分离表示。 摘要:Representation learning constructs low-dimensional representations to summarize essential features of high-dimensional data. This learning problem is often approached by describing various desiderata associated with learned representations; e.g., that they be non-spurious, efficient, or disentangled. It can be challenging, however, to turn these intuitive desiderata into formal criteria that can be measured and enhanced based on observed data. In this paper, we take a causal perspective on representation learning, formalizing non-spuriousness and efficiency (in supervised representation learning) and disentanglement (in unsupervised representation learning) using counterfactual quantities and observable consequences of causal assertions. This yields computable metrics that can be used to assess the degree to which representations satisfy the desiderata of interest and learn non-spurious and disentangled representations from single observational datasets.
【2】 Emergency Equity: Access and Emergency Medical Services in San Francisco 标题:紧急公平:旧金山的接入和紧急医疗服务 链接:https://arxiv.org/abs/2109.03789
作者:Robert Newton,Soundar Kumara,Paul Griffin 机构:a PhD Student, Harold and Inge Marcus Department of Industrial and Manufacturing Engineering, b Allen E. Pearce and Allen M. Pearce Professor of Industrial Engineering, Harold and Inge Marcus 备注:11 pages, 4 figures, 7 tables, submitted to Socio-Economic Planning Sciences 摘要:2020,加利福尼亚要求旧金山考虑在住房、交通和急诊科等资源的使用方面获得公平,因为它在经济爆发后重新开放经济。使用旧金山消防局维护的公共数据集,从2003年1月至2021年4月收到的与紧急响应相关的每一个呼叫,我们计算了响应时间和距离,最近的48个消防站和14个本地急诊科急诊科。我们使用逻辑回归来确定满足响应时间平均值、与消防站的距离和与急诊室的距离的概率,这些平均值基于基于IRS收入报表数据的邮政编码的收入中位数。最低等级的邮政编码(每年25000-50000美元)符合平均响应指标的概率始终最低。这一点在到急诊室的距离上最为显著,在那里,来自最低收入阶层的邮政编码的电话有11.5%的几率在城市平均距离(1英里)内到达急诊室,而次低概率(对于年收入10万至20万美元的收入阶层)为75.9%。鉴于旧金山将公平视为加利福尼亚“安全经济蓝图”的一部分,它应该评估急诊科的分布情况。关键词:消防部门、急救医疗服务、急诊室、公平、逻辑回归 摘要:In 2020, California required San Francisco to consider equity in access to resources such as housing, transportation, and emergency services as it re-opened its economy post-pandemic. Using a public dataset maintained by the San Francisco Fire Department of every call received related to emergency response from January 2003 to April 2021, we calculated the response times and distances to the closest of 48 fire stations and 14 local emergency rooms. We used logistic regression to determine the probability of meeting the averages of response time, distance from a fire station, and distance to an emergency room based on the median income bracket of a ZIP code based on IRS statement of income data. ZIP codes in the lowest bracket ($25,000-$50,000 annually) consistently had the lowest probability of meeting average response metrics. This was most notable for distances to emergency rooms, where calls from ZIP codes in the lowest income bracket had an 11.5% chance of being within the city's average distance (1 mile) of an emergency room, while the next lowest probability (for the income bracket of $100,000-$200,000 annually) was 75.9%. As San Francisco considers equity as a part of California's "Blueprint for a Safer Economy," it should evaluate the distribution of access to emergency services. Keywords: fire department, emergency medical services, emergency rooms, equity, logistic regression
【3】 Dyadic Clustering in International Relations 标题:国际关系中的二元聚类 链接:https://arxiv.org/abs/2109.03774
作者:Jacob Carlson,Trevor Incerti,P. M. Aronow 摘要:国际关系中的定量实证研究通常依赖于二元数据。标准的分析技术不能解释二元体通常不是相互独立的事实。也就是说,当二元数据共享一个组成成员(例如,一个共同国家)时,它们可能具有统计依赖性,或者说是“聚集性”。最近的研究开发了二元聚类稳健标准误差(DCRSE),可以解释这种依赖性。利用这些DCRSE,我们重新分析了2014年1月至2020年1月期间国际组织发表的所有以二元数据为特征的实证文章。我们发现,关键解释变量的公布标准误差平均约为DCRSE的一半,这表明二元聚类导致研究人员严重低估了不确定性。然而,大多数(67%)具有统计学意义的发现在使用DCRSEs时仍然具有统计学意义。我们得出结论,考虑二元聚类既重要又可行,并提供了R和Stata中的软件,以便于在未来的研究中使用DCRSE。 摘要:Quantitative empirical inquiry in international relations often relies on dyadic data. Standard analytic techniques do not account for the fact that dyads are not generally independent of one another. That is, when dyads share a constituent member (e.g., a common country), they may be statistically dependent, or "clustered." Recent work has developed dyadic clustering robust standard errors (DCRSEs) that account for this dependence. Using these DCRSEs, we reanalyzed all empirical articles published in International Organization between January 2014 and January 2020 that feature dyadic data. We find that published standard errors for key explanatory variables are, on average, approximately half as large as DCRSEs, suggesting that dyadic clustering is leading researchers to severely underestimate uncertainty. However, most (67% of) statistically significant findings remain statistically significant when using DCRSEs. We conclude that accounting for dyadic clustering is both important and feasible, and offer software in R and Stata to facilitate use of DCRSEs in future research.
【4】 Grid-Uniform Copulas and Rectangle Exchanges: Bayesian Model and Inference for a Rich Class of Copula Functions 标题:网格均匀Copula和矩形交换:一类丰富Copula函数的贝叶斯模型及其推理 链接:https://arxiv.org/abs/2109.03768
作者:Nicolás Kuschinski,Alejandro Jara 备注:33 pages, 7 figures, 1 table 摘要:基于Copula的模型在建模多元分布时提供了很大的灵活性,允许对边际分布的模型进行规范,与将它们连接起来形成联合分布的依赖结构(Copula)分开。选择一类copula模型不是一项简单的任务,它的错误说明可能导致错误的结论。我们引入了一类新的网格一致copula函数,它在Hellinger意义下的所有连续copula函数空间中都是稠密的。我们提出了一个基于此类的贝叶斯模型,并开发了一个自动马尔可夫链蒙特卡罗算法来探索相应的后验分布。通过模拟数据说明了该方法,并与现有的主要方法进行了比较。 摘要:Copula-based models provide a great deal of flexibility in modelling multivariate distributions, allowing for the specifications of models for the marginal distributions separately from the dependence structure (copula) that links them to form a joint distribution. Choosing a class of copula models is not a trivial task and its misspecification can lead to wrong conclusions. We introduce a novel class of grid-uniform copula functions, which is dense in the space of all continuous copula functions in a Hellinger sense. We propose a Bayesian model based on this class and develop an automatic Markov chain Monte Carlo algorithm for exploring the corresponding posterior distribution. The methodology is illustrated by means of simulated data and compared to the main existing approach.
【5】 Causal Inference for Quantile Treatment Effects 标题:分位数治疗效果的因果推断 链接:https://arxiv.org/abs/2109.03757
作者:Shuo Sun,Erica E. M. Moodie,Johanna G. Nešlehová 机构:Department of Epidemiology, Biostatistics and Occupational Health, Department of Mathematics and Statistics, McGill University, Montr´eal, QC, Canada 备注:None 摘要:对环境现象的分析通常涉及到理解不太可能发生的事件,如洪水、热浪、干旱或高浓度污染物。然而,大多数因果推理文献关注的是建模方法,而不是(可能是高)分位数。我们定义了总体分位数治疗(或暴露)效应(QTE)的一般估计量——加权QTE(WQTE)——其中总体QTE是一个特例,以及一类包含倾向得分的一般平衡权重。导出了所提出的WQTE估计的渐近性质。我们进一步提出并比较倾向评分回归和基于这些平衡权重的两种加权方法,以了解暴露对分位数的因果影响,允许暴露为二元、离散或连续。仿真研究了三种估计量的有限样本行为。所提出的方法应用于巴伐利亚多瑙河集水区的数据,以估算河流中95%的磷对铜浓度的QTE。 摘要:Analyses of environmental phenomena often are concerned with understanding unlikely events such as floods, heatwaves, droughts or high concentrations of pollutants. Yet the majority of the causal inference literature has focused on modelling means, rather than (possibly high) quantiles. We define a general estimator of the population quantile treatment (or exposure) effects (QTE) -- the weighted QTE (WQTE) -- of which the population QTE is a special case, along with a general class of balancing weights incorporating the propensity score. Asymptotic properties of the proposed WQTE estimators are derived. We further propose and compare propensity score regression and two weighted methods based on these balancing weights to understand the causal effect of an exposure on quantiles, allowing for the exposure to be binary, discrete or continuous. Finite sample behavior of the three estimators is studied in simulation. The proposed methods are applied to data taken from the Bavarian Danube catchment area to estimate the 95% QTE of phosphorus on copper concentration in the river.
【6】 Quantile-based fuzzy clustering of multivariate time series in the frequency domain 标题:基于分位数的多变量时间序列频域模糊聚类 链接:https://arxiv.org/abs/2109.03728
作者:Ángel López-Oriona,José A. Vilar,Pierpaolo-D'Urso 摘要:提出了一种对不同依赖模型生成的多元时间序列进行模糊聚类的新方法。生成模型之间的不同程度或动态行为随时间的变化是证明模糊方法的一些论据,其中每个系列与具有特定成员级别的所有集群相关联。我们的程序考虑基于分位数的互谱特征,包括三个阶段:(i)每个元素由分位数互谱密度的适当估计向量表征,(ii)进行主成分分析以捕获减少噪声影响的主要差异,以及(iii)第一个保留主成分之间的平方欧氏距离用于通过标准模糊C-均值和模糊C-medoids算法进行聚类。在广泛的模拟研究中,对所提出方法的性能进行了评估,其中考虑了几种类型的生成过程,包括线性、非线性和动态条件相关模型。评估以两种不同的方式进行:直接测量结果模糊划分的质量,以及考虑该技术确定与定义良好的聚类等距的序列重叠性质的能力。该程序与文献中建议的几个备选方案进行了比较,无论基础流程和评估方案如何,都大大优于所有备选方案。涉及空气质量和金融数据库的两个具体应用说明了我们方法的有效性。 摘要:A novel procedure to perform fuzzy clustering of multivariate time series generated from different dependence models is proposed. Different amounts of dissimilarity between the generating models or changes on the dynamic behaviours over time are some arguments justifying a fuzzy approach, where each series is associated to all the clusters with specific membership levels. Our procedure considers quantile-based cross-spectral features and consists of three stages: (i) each element is characterized by a vector of proper estimates of the quantile cross-spectral densities, (ii) principal component analysis is carried out to capture the main differences reducing the effects of the noise, and (iii) the squared Euclidean distance between the first retained principal components is used to perform clustering through the standard fuzzy C-means and fuzzy C-medoids algorithms. The performance of the proposed approach is evaluated in a broad simulation study where several types of generating processes are considered, including linear, nonlinear and dynamic conditional correlation models. Assessment is done in two different ways: by directly measuring the quality of the resulting fuzzy partition and by taking into account the ability of the technique to determine the overlapping nature of series located equidistant from well-defined clusters. The procedure is compared with the few alternatives suggested in the literature, substantially outperforming all of them whatever the underlying process and the evaluation scheme. Two specific applications involving air quality and financial databases illustrate the usefulness of our approach.
【7】 Dependent Dirichlet Processes for Analysis of aGeneralized Shared Frailty Model 标题:广义共享脆弱性模型分析的相依Dirichlet过程 链接:https://arxiv.org/abs/2109.03713
作者:Chong Zhong,Zhihua Ma,Junshan Shen,Catherine Liu 摘要:贝叶斯范式由于在处理复杂的截尾方案方面优于频繁范式,因此在生存分析中利用了拟合良好的复杂生存模型和可行计算。在本章中,我们旨在通过对具有复杂数据结构的多元生存结果的生存建模进行贝叶斯分析,展示贝叶斯计算的最新趋势,即后验抽样的自动化。在放松比例性假设和共同基线人群限制的基础上,我们提出了一个广义共享脆弱性模型,该模型包括参数脆弱性和非参数脆弱性随机效应,以便同时考虑多个事件的治疗和时间变化。我们开发了ANOVA依赖Dirichlet过程的生存函数版本,以模拟基线生存函数之间的依赖关系。后验采样由当代贝叶斯计算工具Stan中的无U形转弯取样器自动实现。通过对膀胱癌复发数据的分析,验证了该模型的有效性。估算结果与现有结果一致。我们的模型和贝叶斯推理提供了证据,证明贝叶斯范式促进了生存分析中的复杂建模和可行计算,斯坦放松了后验推理。 摘要:Bayesian paradigm takes advantage of well fitting complicated survival models and feasible computing in survival analysis owing to the superiority in tackling the complex censoring scheme, compared with the frequentist paradigm. In this chapter, we aim to display the latest tendency in Bayesian computing, in the sense of automating the posterior sampling, through Bayesian analysis of survival modeling for multivariate survival outcomes with complicated data structure. Motivated by relaxing the strong assumption of proportionality and the restriction of a common baseline population, we propose a generalized shared frailty model which includes both parametric and nonparametric frailty random effects so as to incorporate both treatment-wise and temporal variation for multiple events. We develop a survival-function version of ANOVA dependent Dirichlet process to model the dependency among the baseline survival functions. The posterior sampling is implemented by the No-U-Turn sampler in Stan, a contemporary Bayesian computing tool, automatically. The proposed model is validated by analysis of the bladder cancer recurrences data. The estimation is consistent with existing results. Our model and Bayesian inference provide evidence that the Bayesian paradigm fosters complex modeling and feasible computing in survival analysis and Stan relaxes the posterior inference.
【8】 Parameterizing and Simulating from Causal Models 标题:因果模型的参数化与模拟 链接:https://arxiv.org/abs/2109.03694
作者:Robin J. Evans,Vanessa Didelez 备注:33 pages, 5 figures 摘要:因果推理中的许多统计问题涉及概率分布,而不是实际观察数据的概率分布;作为一个额外的复杂性,感兴趣的对象通常是这个其他概率分布的一个边际量。这给统计推断带来了许多实际的复杂性,即使问题是非参数识别的。Na“ive试图以参数化方式指定模型可能会导致不必要的后果,如不相容的参数假设或所谓的“g-null悖论”。因此,很难执行基于似然的推理,甚至很难用一般的方法从模型中进行模拟。我们引入“节俭参数化”,它将兴趣的因果效应置于其中心,然后围绕它构建模型的其余部分。我们这样做的方式提供了一个使用感兴趣的因果量构造规则的、非冗余的参数化的方法。对于离散变量,我们使用优势比来完成参数化,而对于连续变量,我们使用连接函数。我们的方法允许我们从具有参数指定的因果分布的模型中构造和模拟,并使用基于可能性的方法(包括完全贝叶斯方法)对其进行拟合。我们可以精确拟合和模拟的模型包括边缘结构模型和结构嵌套模型。我们的建议包括平均因果效应和治疗对受治疗者的影响的参数化,以及其他感兴趣的因果量。我们的结果将使从业者能够针对正确指定的模型,以一种以前不可能的方式,根据最佳可能的估计量来评估他们的方法。 摘要:Many statistical problems in causal inference involve a probability distribution other than the one from which data are actually observed; as an additional complication, the object of interest is often a marginal quantity of this other probability distribution. This creates many practical complications for statistical inference, even where the problem is non-parametrically identified. Na"ive attempts to specify a model parametrically can lead to unwanted consequences such as incompatible parametric assumptions or the so-called `g-null paradox'. As a consequence it is difficult to perform likelihood-based inference, or even to simulate from the model in a general way. We introduce the `frugal parameterization', which places the causal effect of interest at its centre, and then build the rest of the model around it. We do this in a way that provides a recipe for constructing a regular, non-redundant parameterization using causal quantities of interest. In the case of discrete variables we use odds ratios to complete the parameterization, while in the continuous case we use copulas. Our methods allow us to construct and simulate from models with parametrically specified causal distributions, and fit them using likelihood-based methods, including fully Bayesian approaches. Models we can fit and simulate from exactly include marginal structural models and structural nested models. Our proposal includes parameterizations for the average causal effect and effect of treatment on the treated, as well as other causal quantities of interest. Our results will allow practitioners to assess their methods against the best possible estimators for correctly specified models, in a way which has previously been impossible.
【9】 Confidence surfaces for the mean of locally stationary functional time series 标题:局部平稳函数时间序列均值的置信曲面 链接:https://arxiv.org/abs/2109.03641
作者:Holger Dette,Weichi Wu 机构:Ruhr-Universit¨at Bochum, Fakult¨at f¨ur Mathematik, Germany, Tsinghua University, Center for Statistical Science, Department of Industrial Engineering, Beijing China 摘要:构造局部平稳函数时间序列${X{i,n}(t)}{i=1,ldots,n}$的平均函数的同时置信带是一个具有挑战性的问题,因为这些置信带不能建立在经典极限理论上。一方面,对于函数$X_{i,n}$的固定参数$t$,估计值和时间相关回归函数之间的最大绝对偏差(经过适当标准化后)表现出极值行为,其极限为Gumbel分布。另一方面,对于平稳泛函数据,可以根据Banach空间值随机变量的经典中心定理建立同时置信带,并且最大绝对偏差的极限分布由高斯过程的上范数给出。由于两个极限定理的收敛速度不同,因此它们不兼容,并且不存在可用于在局部平稳情况下构造置信曲面的弱收敛结果。在本文中,我们提出了一种新的bootstrap方法来构造局部平稳函数时间序列的平均函数的同时置信带,该置信带是由最大绝对偏差的高斯近似驱动的。我们用渐近理论证明了该方法的有效性,通过仿真研究证明了其良好的有限样本性质,并通过一个数据实例说明了其适用性。 摘要:The problem of constructing a simultaneous confidence band for the mean function of a locally stationary functional time series $ { X_{i,n} (t) }_{i = 1, ldots, n}$ is challenging as these bands can not be built on classical limit theory. On the one hand, for a fixed argument $t$ of the functions $ X_{i,n}$, the maximum absolute deviation between an estimate and the time dependent regression function exhibits (after appropriate standardization) an extreme value behaviour with a Gumbel distribution in the limit. On the other hand, for stationary functional data, simultaneous confidence bands can be built on classical central theorems for Banach space valued random variables and the limit distribution of the maximum absolute deviation is given by the sup-norm of a Gaussian process. As both limit theorems have different rates of convergence, they are not compatible, and a weak convergence result, which could be used for the construction of a confidence surface in the locally stationary case, does not exist. In this paper we propose new bootstrap methodology to construct a simultaneous confidence band for the mean function of a locally stationary functional time series, which is motivated by a Gaussian approximation for the maximum absolute deviation. We prove the validity of our approach by asymptotic theory, demonstrate good finite sample properties by means of a simulation study and illustrate its applicability analyzing a data example.
【10】 Estimating causal effects in the presence of competing events using regression standardisation with the Stata command standsurv 标题:使用STATA命令STANSURE的回归标准化来估计存在竞争事件时的因果效应 链接:https://arxiv.org/abs/2109.03628
作者:Elisavet Syriopoulou,Sarwar I Mozumder,Mark J Rutherford,Paul C Lambert 摘要:当对时间到事件结果感兴趣时,可能会出现阻止感兴趣事件发生的竞争性事件。在存在竞争性事件的情况下,已经提出了各种统计估计来定义治疗对感兴趣事件的因果影响。根据估计,竞争事件要么被容纳,要么被消除,从而产生具有不同解释的因果效应。前一种方法捕获了治疗对感兴趣事件的总体影响,而后一种方法捕获了治疗对感兴趣事件的直接影响,该影响不受竞争事件的影响。对于治疗效果可以通过不同的因果途径分为对感兴趣事件的影响和对竞争事件的影响的情况,也定义了可分离效应。我们概述了存在竞争事件时可能感兴趣的各种因果效应,包括总效应、直接效应和可分离效应,并描述了如何使用Stata命令standsurv的回归标准化获得估计值。回归标准化是通过在拟合生存模型后获得研究人群中所有个体的个体估计平均值来应用的。使用standsurv,可以计算出几个感兴趣的对比度,包括差异、比率和其他用户定义的函数。也可以使用delta方法获得置信区间。自始至终,我们使用一个例子来分析一个关于前列腺癌的公开数据集,以允许读者复制分析,并进一步探索不同的兴趣影响。 摘要:When interested in a time-to-event outcome, competing events that prevent the occurrence of the event of interest may be present. In the presence of competing events, various statistical estimands have been suggested for defining the causal effect of treatment on the event of interest. Depending on the estimand, the competing events are either accommodated or eliminated, resulting in causal effects with different interpretation. The former approach captures the total effect of treatment on the event of interest while the latter approach captures the direct effect of treatment on the event of interest that is not mediated by the competing event. Separable effects have also been defined for settings where the treatment effect can be partitioned into its effect on the event of interest and its effect on the competing event through different causal pathways. We outline various causal effects that may be of interest in the presence of competing events, including total, direct and separable effects, and describe how to obtain estimates using regression standardisation with the Stata command standsurv. Regression standardisation is applied by obtaining the average of individual estimates across all individuals in a study population after fitting a survival model. With standsurv several contrasts of interest can be calculated including differences, ratios and other user-defined functions. Confidence intervals can also be obtained using the delta method. Throughout we use an example analysing a publicly available dataset on prostate cancer to allow the reader to replicate the analysis and further explore the different effects of interest.
【11】 Higher Order Kernel Mean Embeddings to Capture Filtrations of Stochastic Processes 标题:捕捉随机过程渗流的高阶核平均嵌入方法 链接:https://arxiv.org/abs/2109.03582
作者:Cristopher Salvi,Maud Lemercier,Chong Liu,Blanka Hovarth,Theodoros Damoulas,Terry Lyons 机构:Blanka Horvath 摘要:随机过程是在某些路径空间中具有值的随机变量。然而,将随机过程简化为路径值随机变量会忽略其过滤,即过程随时间传递的信息流。通过调节过滤过程,我们引入了一系列高阶核均值嵌入(KME),它概括了KME的概念,并捕获了与过滤相关的附加信息。我们推导了相关的高阶最大平均偏差(MMD)的经验估计,并证明了一致性。然后,我们构建了一个过滤敏感的内核双样本测试,能够提取标准MMD测试遗漏的信息。此外,利用我们的高阶MMD,我们在随机过程上构建了一系列通用核,允许通过经典的基于核的回归方法解决定量金融(如美式期权定价)中的实际校准和最优停止问题。最后,将现有的条件独立性测试应用于随机过程的情况,我们设计了一种因果发现算法,仅从多维轨迹的观测中恢复相互作用物体之间结构依赖的因果图。 摘要:Stochastic processes are random variables with values in some space of paths. However, reducing a stochastic process to a path-valued random variable ignores its filtration, i.e. the flow of information carried by the process through time. By conditioning the process on its filtration, we introduce a family of higher order kernel mean embeddings (KMEs) that generalizes the notion of KME and captures additional information related to the filtration. We derive empirical estimators for the associated higher order maximum mean discrepancies (MMDs) and prove consistency. We then construct a filtration-sensitive kernel two-sample test able to pick up information that gets missed by the standard MMD test. In addition, leveraging our higher order MMDs we construct a family of universal kernels on stochastic processes that allows to solve real-world calibration and optimal stopping problems in quantitative finance (such as the pricing of American options) via classical kernel-based regression methods. Finally, adapting existing tests for conditional independence to the case of stochastic processes, we design a causal-discovery algorithm to recover the causal graph of structural dependencies among interacting bodies solely from observations of their multidimensional trajectories.
【12】 Multivariate, Heteroscedastic Empirical Bayes via Nonparametric Maximum Likelihood 标题:基于非参数最大似然的多元异方差经验贝叶斯 链接:https://arxiv.org/abs/2109.03466
作者:Jake A. Soloff,Adityanand Guntuboyina,Bodhisattva Sen 机构:Department of Statistics, University of California, Berkeley, Department of Statistics, Columbia University 摘要:在许多大规模去噪问题中,多元异方差误差使统计推断复杂化。在这种情况下,经验贝叶斯方法很有吸引力,但标准的参数化方法依赖于先验分布形式的假设,这种假设很难证明,并且会引入不必要的调整参数。我们扩展了高斯位置混合密度的非参数最大似然估计(NPMLE),以考虑多元异方差误差。NPMLEs通过求解无限维凸优化问题来估计任意先验;我们证明了这个凸优化问题可以用一个有限维的形式来近似。我们引入了一个对偶混合密度,其模式包含每个NPMLE的原子,并且我们利用对偶在多变量设置中显示非唯一性,以及在NPMLE的支持下构造显式边界。基于NPMLE的经验Bayes后验均值具有较低的后悔率,这意味着它们紧密地针对oracle后验均值,人们将使用手中的真实先验进行计算。我们证明了一个oracle不等式,这意味着在没有先验知识的情况下,经验Bayes估计的去噪性能接近最优水平(达到对数因子)。我们提供了NPMLE平均Hellinger精度的有限样本界,用于估计观测值的边缘密度。我们还证明了NPMLE用于反褶积的自适应和近似最优性质。我们将该方法应用于两个天文数据集,构建了银河系140万颗恒星的完全数据驱动的彩色星等图,并研究了红团中27000颗恒星的化学丰度比分布。 摘要:Multivariate, heteroscedastic errors complicate statistical inference in many large-scale denoising problems. Empirical Bayes is attractive in such settings, but standard parametric approaches rest on assumptions about the form of the prior distribution which can be hard to justify and which introduce unnecessary tuning parameters. We extend the nonparametric maximum likelihood estimator (NPMLE) for Gaussian location mixture densities to allow for multivariate, heteroscedastic errors. NPMLEs estimate an arbitrary prior by solving an infinite-dimensional, convex optimization problem; we show that this convex optimization problem can be tractably approximated by a finite-dimensional version. We introduce a dual mixture density whose modes contain the atoms of every NPMLE, and we leverage the dual both to show non-uniqueness in multivariate settings as well as to construct explicit bounds on the support of the NPMLE. The empirical Bayes posterior means based on an NPMLE have low regret, meaning they closely target the oracle posterior means one would compute with the true prior in hand. We prove an oracle inequality implying that the empirical Bayes estimator performs at nearly the optimal level (up to logarithmic factors) for denoising without prior knowledge. We provide finite-sample bounds on the average Hellinger accuracy of an NPMLE for estimating the marginal densities of the observations. We also demonstrate the adaptive and nearly-optimal properties of NPMLEs for deconvolution. We apply the method to two astronomy datasets, constructing a fully data-driven color-magnitude diagram of 1.4 million stars in the Milky Way and investigating the distribution of chemical abundance ratios for 27 thousand stars in the red clump.
【13】 Uncertainty Quantification and Experimental Design for large-scale linear Inverse Problems under Gaussian Process Priors 标题:高斯过程先验下大规模线性反问题的不确定性量化与实验设计 链接:https://arxiv.org/abs/2109.03457
作者:Cédric Travelletti,David Ginsbourger,Niklas Linde 机构:‡InstituteofMathematicalStatisticsandActuarialScience, UniversityofBern 备注:under review 摘要:我们考虑使用高斯过程(GP)先验来解决逆问题的贝叶斯框架。众所周知,GPs的计算复杂度是以数据点的数量为单位的。我们在这里表明,在涉及积分算子的反问题中,人们面临着阻碍在大网格上反演的额外困难。此外,在这种情况下,协方差矩阵可能变得太大而无法存储。通过利用高斯测度序列分解的结果,我们能够引入后验协方差矩阵的隐式表示,通过仅存储低秩中间矩阵来减少内存占用,同时允许动态访问单个元素,而无需构建完整的后验协方差矩阵。此外,它允许快速有序地包含新的观测值。在考虑顺序实验设计任务时,这些特性至关重要。我们通过计算重力反问题偏移集恢复的顺序数据收集计划来演示我们的方法,其目标是提供意大利斯特龙博利火山内部高密度区域的精细分辨率估计。通过将加权综合方差缩减(wIVR)准则扩展到反问题,计算顺序数据收集计划。我们的结果表明,该标准能够显著降低偏移量的不确定度,达到接近最小剩余不确定度水平。总的来说,我们的技术允许在自然科学中产生的大规模反问题上发挥概率模型的优势。 摘要:We consider the use of Gaussian process (GP) priors for solving inverse problems in a Bayesian framework. As is well known, the computational complexity of GPs scales cubically in the number of datapoints. We here show that in the context of inverse problems involving integral operators, one faces additional difficulties that hinder inversion on large grids. Furthermore, in that context, covariance matrices can become too large to be stored. By leveraging results about sequential disintegrations of Gaussian measures, we are able to introduce an implicit representation of posterior covariance matrices that reduces the memory footprint by only storing low rank intermediate matrices, while allowing individual elements to be accessed on-the-fly without needing to build full posterior covariance matrices. Moreover, it allows for fast sequential inclusion of new observations. These features are crucial when considering sequential experimental design tasks. We demonstrate our approach by computing sequential data collection plans for excursion set recovery for a gravimetric inverse problem, where the goal is to provide fine resolution estimates of high density regions inside the Stromboli volcano, Italy. Sequential data collection plans are computed by extending the weighted integrated variance reduction (wIVR) criterion to inverse problems. Our results show that this criterion is able to significantly reduce the uncertainty on the excursion volume, reaching close to minimal levels of residual uncertainty. Overall, our techniques allow the advantages of probabilistic models to be brought to bear on large-scale inverse problems arising in the natural sciences.
【14】 Convergence of Batch Asynchronous Stochastic Approximation With Applications to Reinforcement Learning 标题:批异步随机逼近的收敛性及其在强化学习中的应用 链接:https://arxiv.org/abs/2109.03445
作者:Rajeeva L. Karandikar,M. Vidyasagar 备注:11 pages 摘要:随机近似(SA)算法是一种广泛使用的概率方法,用于寻找形式为$mathbf{f}(boldsymbol{theta})=mathbf{0}$的方程的解,其中$mathbf{f}:mathbb{R}^drightarrowmathbb{R}^d$,此时只有$mathbf f{f}(cdot)的噪声测量值可用。在迄今为止的文献中,人们可以区分“同步”更新和“异步”更新,前者是指每次更新当前猜测$boldsymbol{theta}u t$的整个向量,后者是指更新$boldsymbol{theta}u t$的一个组件。在凸优化和非凸优化中,还存在“批量”更新的概念,即$boldsymbol{theta}u t$的部分但不是全部组件在每次$t$时更新。此外,使用“本地”时钟和“全局”时钟之间也有区别。在迄今为止的文献中,当使用本地时钟时,收敛性证明假设测量噪声是i.i.d 序列,这一假设在强化学习(RL)中不成立。在本文中,我们提供了批非同步随机逼近(BASA)的一般收敛理论,该理论适用于测量噪声形成鞅差序列的情况,无论更新使用本地时钟还是全局时钟。这是迄今为止最普遍的结果,包括所有其他结果。 摘要:The stochastic approximation (SA) algorithm is a widely used probabilistic method for finding a solution to an equation of the form $mathbf{f}(boldsymbol{theta}) = mathbf{0}$ where $mathbf{f} : mathbb{R}^d rightarrow mathbb{R}^d$, when only noisy measurements of $mathbf{f}(cdot)$ are available. In the literature to date, one can make a distinction between "synchronous" updating, whereby the entire vector of the current guess $boldsymbol{theta}_t$ is updated at each time, and "asynchronous" updating, whereby ony one component of $boldsymbol{theta}_t$ is updated. In convex and nonconvex optimization, there is also the notion of "batch" updating, whereby some but not all components of $boldsymbol{theta}_t$ are updated at each time $t$. In addition, there is also a distinction between using a "local" clock versus a "global" clock. In the literature to date, convergence proofs when a local clock is used make the assumption that the measurement noise is an i.i.d sequence, an assumption that does not hold in Reinforcement Learning (RL). In this note, we provide a general theory of convergence for batch asymchronous stochastic approximation (BASA), that works whether the updates use a local clock or a global clock, for the case where the measurement noises form a martingale difference sequence. This is the most general result to date and encompasses all others.
【15】 Metrics to find a surrogate endpoint of OS in metastatic oncology trials: a simulation study 标题:在转移性肿瘤学试验中寻找OS替代终点的指标:一项模拟研究 链接:https://arxiv.org/abs/2109.03421
作者:Wei Zou 机构:Genentech, S. San Francisco, California, USA 备注:the main body has 21 pages and 8 figures. submitted to BMC Medical Research Methodology. The abstract was modified to be compliant with arxiv's length limit 摘要:癌症患者总生存率(OS)的替代终点(SE)对于提高肿瘤药物开发效率至关重要。在实践中,我们可能会在发现队列中发现新的患者水平与OS的关联,然后在荟萃分析中测量研究之间的试验水平关联,以验证SE。在这项工作中,我们模拟了成对的指标,以量化患者层面和试验层面的代孕,并评估其关联性,并了解初始发现的各种患者层面指标在多大程度上表明了作为SE的最终效用。在所有模拟场景中,我们发现所有患者级别指标之间存在紧密的相关性,包括C指数、综合brier评分和SE值与OS之间的对数风险比;它们中的任何一个与试验水平关联度量之间的相关性相似。尽管SE和OS之间真正的生物学联系在不断增加,但在许多情况下,患者和试验水平的指标往往同时稳定下来;他们的交往总是迅速减少。在这里考虑的SE开发框架和数据生成模型下,所有患者级别的指标在根据最终试验级别关联对候选SE进行排名方面是相似的;在SE中加入额外的生物因素可能会降低改善患者水平和试验水平相关性的回报。 摘要:Surrogate endpoint (SE) for overall survival (OS) in cancer patients is essential to improving the efficiency of oncology drug development. In practice, we may discover a new patient level association with OS in a discovery cohort, and then measure the trial level association across studies in a meta-analysis to validate the SE. In this work, we simulated pairs of metrics to quantify the surrogacy at the patient level and the trial level and evaluated their association, and to understand how well various patient level metrics from the initial discovery would indicate the eventual utility as a SE. Across all the simulation scenarios, we found tight correlation among all the patient level metrics, including C index, integrated brier score and log hazard ratio between SE values and OS; and similar correlation between any of them and the trial level association metric. Despite the continual increase in the true biological link between SE and OS, both patient and trial level metrics often plateaued coincidentally in many scenarios; their association always decreased quickly. Under the SE development framework and data generation models considered here, all patient level metrics are similar in ranking a candidate SE according to its eventual trial level association; incorporating additional biological factors into a SE are likely to have diminished return in improving both patient level and trial level association.
【16】 Functional Principal Subspace Sampling for Large Scale Functional Data Analysis 标题:函数主子空间抽样在大规模函数数据分析中的应用 链接:https://arxiv.org/abs/2109.03397
作者:Shiyuan He,Xiaomeng Yan 机构:Institute of Statistics and Big Data, Renmin University of China, Bejing, China., Department of Statistics, Texas A&M University, College Station, TX, USA. 摘要:功能数据分析(FDA)方法对某些高维数据具有计算和理论上的吸引力,但缺乏对现代大样本数据集的可扩展性。为了应对这一挑战,我们为两种重要的FDA方法开发了随机算法:功能主成分分析(FPCA)和标量响应的功能线性回归(FLR)。这两种方法是相互联系的,因为它们都依赖于函数主子空间的精确估计。所提出的算法从手头的大型数据集中抽取子样本,并在子样本上应用FPCA或FLR以降低计算成本。为了有效地保留子样本中的子空间信息,我们提出了一种函数主子空间抽样概率,该概率消除了函数主子空间中的特征值尺度效应,并对残差进行了适当的加权。基于算子摄动分析,我们证明了所提出的概率对子空间投影算子的一阶误差有精确的控制,并且可以解释为函数子空间估计的重要抽样。此外,还建立了所提算法的集中边界,以反映无限维空间中函数数据的低固有维性质。在合成数据集和真实数据集上验证了所提算法的有效性。 摘要:Functional data analysis (FDA) methods have computational and theoretical appeals for some high dimensional data, but lack the scalability to modern large sample datasets. To tackle the challenge, we develop randomized algorithms for two important FDA methods: functional principal component analysis (FPCA) and functional linear regression (FLR) with scalar response. The two methods are connected as they both rely on the accurate estimation of functional principal subspace. The proposed algorithms draw subsamples from the large dataset at hand and apply FPCA or FLR over the subsamples to reduce the computational cost. To effectively preserve subspace information in the subsamples, we propose a functional principal subspace sampling probability, which removes the eigenvalue scale effect inside the functional principal subspace and properly weights the residual. Based on the operator perturbation analysis, we show the proposed probability has precise control over the first order error of the subspace projection operator and can be interpreted as an importance sampling for functional subspace estimation. Moreover, concentration bounds for the proposed algorithms are established to reflect the low intrinsic dimension nature of functional data in an infinite dimensional space. The effectiveness of the proposed algorithms is demonstrated upon synthetic and real datasets.
【17】 AWGAN: Empowering High-Dimensional Discriminator Output for Generative Adversarial Networks 标题:AWGAN:增强生成性对抗网络高维鉴别器输出的能力 链接:https://arxiv.org/abs/2109.03378
作者:Mengyu Dai,Haibin Hang,Anuj Srivastava 机构: Microsoft, University of Delaware, Florida State University, Department of Statistics 摘要:经验性多维鉴别器(批评家)输出可能是有利的,但尚未对其进行详细解释。在本文中,(i)我们严格证明了高维批评输出在区分真分布和假分布方面具有优势;(ii)我们还引入了平方根速度变换(SRVT)块,进一步放大了这一优势。证明基于我们提出的最大p-中心性差异,该差异以p-Wasserstein距离为界,完全符合高维临界输出n的Wasserstein-GAN框架。我们还表明,当n=1时,建议的差异相当于1-Wasserstein距离。SRVT块用于打破高维批评输出的对称结构,提高鉴别器网络的泛化能力。在实现方面,建议的框架不需要额外的超参数调整,这在很大程度上促进了其使用。对图像生成任务的实验表明,与基准数据集相比,性能有所提高。 摘要:Empirically multidimensional discriminator (critic) output can be advantageous, while a solid explanation for it has not been discussed. In this paper, (i) we rigorously prove that high-dimensional critic output has advantage on distinguishing real and fake distributions; (ii) we also introduce an square-root velocity transformation (SRVT) block which further magnifies this advantage. The proof is based on our proposed maximal p-centrality discrepancy which is bounded above by p-Wasserstein distance and perfectly fits the Wasserstein GAN framework with high-dimensional critic output n. We have also showed when n = 1, the proposed discrepancy is equivalent to 1-Wasserstein distance. The SRVT block is applied to break the symmetric structure of high-dimensional critic output and improve the generalization capability of the discriminator network. In terms of implementation, the proposed framework does not require additional hyper-parameter tuning, which largely facilitates its usage. Experiments on image generation tasks show performance improvement on benchmark datasets.
【18】 SIHR: An R Package for Statistical Inference in High-dimensional Linear and Logistic Regression Models 标题:SIHR:用于高维线性和Logistic回归模型统计推断的R软件包 链接:https://arxiv.org/abs/2109.03365
作者:Prabrisha Rakshit,T. Tony Cai,Zijian Guo 机构:Rutgers University, University of Pennsylvania 摘要:我们介绍并通过数值例子说明了R包texttt{SIHR},它处理(1)高维线性回归中的线性和二次函数以及(2)高维逻辑回归中的线性函数的统计推断。所提出的算法的重点是点估计、置信区间构造和假设检验。将推理方法推广到多元回归模型。我们包括真实的数据应用程序,以证明该软件包的性能和实用性。 摘要:We introduce and illustrate through numerical examples the R package texttt{SIHR} which handles the statistical inference for (1) linear and quadratic functionals in the high-dimensional linear regression and (2) linear functional in the high-dimensional logistic regression. The focus of the proposed algorithms is on the point estimation, confidence interval construction and hypothesis testing. The inference methods are extended to multiple regression models. We include real data applications to demonstrate the package's performance and practicality.
【19】 On the CUSUM procedure for phase-type distributions: a Lévy fluctuation theory approach 标题:关于相型分布的累积和过程:Lévy涨落理论方法 链接:https://arxiv.org/abs/2109.03361
作者:Jevgenijs Ivanovs,Kazutoshi Yamazaki 备注:25 pages, 4 figures 摘要:我们介绍了一种分析顺序变化点检测中累积和(CUSUM)过程的新方法。当观测值为相位型分布且变化后分布是由其变化前分布的指数倾斜给出时,CUSUM统计量的首次通过分析简化为某一马尔可夫加性过程的首次通过分析。通过使用所谓的尺度矩阵理论并进一步发展它,我们推导了CUSUM过程下平均游程长度、平均检测延迟和虚警概率的精确表达式。该方法具有鲁棒性,适用于非i.i.d.观测的一般环境。给出了数值结果。 摘要:We introduce a new method analyzing the cumulative sum (CUSUM) procedure in sequential change-point detection. When observations are phase-type distributed and the post-change distribution is given by exponential tilting of its pre-change distribution, the first passage analysis of the CUSUM statistic is reduced to that of a certain Markov additive process. By using the theory of the so-called scale matrix and further developing it, we derive exact expressions of the average run length, average detection delay, and false alarm probability under the CUSUM procedure. The proposed method is robust and applicable in a general setting with non-i.i.d. observations. Numerical results also are given.
【20】 Latent Space Network Modelling with Continuous and Discrete Geometries 标题:基于连续和离散几何的潜在空间网络建模 链接:https://arxiv.org/abs/2109.03343
作者:Marios Papamichalis,Kathryn Turnbull,Simon Lunagomez,Edoardo Airoldi 机构:Fox School of Business, Temple University, Philadelphia, PA , Department of Mathematics and Statistics, Lancaster University, Lancaster, LA, YW 备注:42 pages, 14 figures 摘要:丰富的网络模型将每个节点与控制连接形成倾向的低维潜在坐标相关联。这种类型的模型在文献中得到了很好的建立,在文献中通常假设基础几何是欧几里德几何。最近的工作探索了这种选择的后果,并推动了对依赖于非欧几里德潜在几何的模型的研究,主要关注球面几何和双曲几何。在本文中{脚注{这是这项工作的第一个版本。任何潜在的错误都属于第一作者。}考虑到依赖于球面、双曲和晶格几何的网络模型,我们检查从网络中的可观察链接可以推断出的潜在特征的程度。对于每个几何体,我们描述了一个潜在的网络模型,详细描述了潜在坐标上的约束,消除了众所周知的可识别性问题,并提出了贝叶斯估计方案。因此,我们开发了一个计算程序来对网络模型进行推理,其中底层几何体的属性起着至关重要的作用。此外,我们还通过实际数据应用验证了这些模型的有效性。 摘要:A rich class of network models associate each node with a low-dimensional latent coordinate that controls the propensity for connections to form. Models of this type are well established in the literature, where it is typical to assume that the underlying geometry is Euclidean. Recent work has explored the consequences of this choice and has motivated the study of models which rely on non-Euclidean latent geometries, with a primary focus on spherical and hyperbolic geometry. In this paperfootnote{This is the first version of this work. Any potential mistake belongs to the first author.}, we examine to what extent latent features can be inferred from the observable links in the network, considering network models which rely on spherical, hyperbolic and lattice geometries. For each geometry, we describe a latent network model, detail constraints on the latent coordinates which remove the well-known identifiability issues, and present schemes for Bayesian estimation. Thus, we develop a computational procedures to perform inference for network models in which the properties of the underlying geometry play a vital role. Furthermore, we access the validity of those models with real data applications.
【21】 A note on the permutation distribution of generalized correlation coefficients 标题:关于广义相关系数排列分布的一点注记 链接:https://arxiv.org/abs/2109.03342
作者:Yejiong Zhu,Hao Chen 机构:University of California, Davis 摘要:当a{ij}$对称且b{ij}$对称时,我们给出了置换零分布下广义相关系数$sum a{ij}b{ij}$渐近正态的充分条件。 摘要:We provide sufficient conditions for the asymptotic normality of the generalized correlation coefficient $sum a_{ij}b_{ij}$ under the permutation null distribution when $a_{ij}$'s are symmetric and $b_{ij}$'s are symmetric.
【22】 C-MinHash: Rigorously Reducing K Permutations to Two标题:C-MinHash:严格将K排列减少到两个链接:https://arxiv.org/abs/2109.03337
作者:Xiaoyun Li,Ping Li 机构:Cognitive Computing Lab, Baidu Research, NE ,th St. Bellevue, WA , USA 摘要:Minwise hashing(MinHash)是一种重要而实用的算法,用于生成随机散列以逼近海量二进制(0/1)数据中的Jaccard(相似性)相似性。MinHash的基本理论要求对数据集中的每个数据向量应用数百甚至数千个独立的随机排列,以便获得可靠的结果(例如)在海量数据中构建大规模学习模型或近似近邻搜索。在本文中,我们提出了{bf循环MinHash(C-MinHash)},并给出了令人惊讶的理论结果,我们只需要两个独立的随机置换。对于C-MinHash,我们首先对数据向量进行初始置换,然后使用第二次置换来生成哈希值。基本上,第二种排列通过循环移位重复使用$K$次,以产生$K$散列。与经典的MinHash不同,这些$K$散列显然是相关的,但我们能够提供严格的证据,证明我们仍然能够获得Jaccard相似性的无偏估计,并且理论方差一致小于具有$K$独立排列的经典MinHash。C-MinHash的理论证明需要一些非平凡的工作。通过数值实验验证了该理论的正确性,并证明了C-MinHash算法的有效性。 摘要:Minwise hashing (MinHash) is an important and practical algorithm for generating random hashes to approximate the Jaccard (resemblance) similarity in massive binary (0/1) data. The basic theory of MinHash requires applying hundreds or even thousands of independent random permutations to each data vector in the dataset, in order to obtain reliable results for (e.g.,) building large-scale learning models or approximate near neighbor search in massive data. In this paper, we propose {bf Circulant MinHash (C-MinHash)} and provide the surprising theoretical results that we just need textbf{two} independent random permutations. For C-MinHash, we first conduct an initial permutation on the data vector, then we use a second permutation to generate hash values. Basically, the second permutation is re-used $K$ times via circulant shifting to produce $K$ hashes. Unlike classical MinHash, these $K$ hashes are obviously correlated, but we are able to provide rigorous proofs that we still obtain an unbiased estimate of the Jaccard similarity and the theoretical variance is uniformly smaller than that of the classical MinHash with $K$ independent permutations. The theoretical proofs of C-MinHash require some non-trivial efforts. Numerical experiments are conducted to justify the theory and demonstrate the effectiveness of C-MinHash.
【23】 Highly Parallel Autoregressive Entity Linking with Discriminative Correction 标题:具有判别校正的高度并行自回归实体链接 链接:https://arxiv.org/abs/2109.03792
作者:Nicola De Cao,Wilker Aziz,Ivan Titov 机构:University of Amsterdam,University of Edinburgh 备注:Accepted at EMNLP2021 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Code at this https URL . 8 pages, 1 figure, 3 tables 摘要:生成性方法最近被证明对实体消歧和实体链接(即联合提及检测和消歧)都是有效的。然而,先前提出的EL自回归公式存在以下问题:i)复杂(深度)解码器导致的高计算成本;ii)随源序列长度扩展的非并行解码;以及iii)需要对大量数据进行训练。在这项工作中,我们提出了一种非常有效的方法,该方法将所有潜在提及的自回归链接并行化,并依赖于浅层高效解码器。此外,我们还通过一个额外的鉴别成分,即一个校正项,来增强生成目标,使我们能够直接优化生成器的排名。综合起来,这些技术解决了上述所有问题:我们的模型比以前的生成方法快70倍以上,精度更高,优于标准英语数据集AIDA CoNLL上的最先进方法。源代码可在https://github.com/nicola-decao/efficient-autoregressive-EL 摘要:Generative approaches have been recently shown to be effective for both Entity Disambiguation and Entity Linking (i.e., joint mention detection and disambiguation). However, the previously proposed autoregressive formulation for EL suffers from i) high computational cost due to a complex (deep) decoder, ii) non-parallelizable decoding that scales with the source sequence length, and iii) the need for training on a large amount of data. In this work, we propose a very efficient approach that parallelizes autoregressive linking across all potential mentions and relies on a shallow and efficient decoder. Moreover, we augment the generative objective with an extra discriminative component, i.e., a correction term which lets us directly optimize the generator's ranking. When taken together, these techniques tackle all the above issues: our model is >70 times faster and more accurate than the previous generative method, outperforming state-of-the-art approaches on the standard English dataset AIDA-CoNLL. Source code available at https://github.com/nicola-decao/efficient-autoregressive-EL
【24】 FedZKT: Zero-Shot Knowledge Transfer towards Heterogeneous On-Device Models in Federated Learning 标题:FedZKT:联邦学习中面向异构设备模型的零概率知识转移 链接:https://arxiv.org/abs/2109.03775
作者:Lan Zhang,Xiaoyong Yuan 机构:Michigan Technological University 备注:13 pages 摘要:联合学习使分布式设备能够协作学习共享预测模型,而无需集中于设备训练数据。目前的大多数算法都需要在具有相同结构和尺寸的设备模型上进行类似的个人训练,这阻碍了资源受限设备的参与。针对当前广泛存在的异构设备,本文提出了一个新的框架,该框架通过零炮知识转移支持跨异构设备模型的联邦学习,名为FedZKT。具体而言,FedZKT允许参与设备独立确定其设备型号。为了在设备模型上传递知识,FedZKT开发了一种零丸蒸馏方法,与基于公共数据集或预先训练的数据生成器的某些先前研究相反。为了最大程度地减少设备上的工作负载,将资源密集型蒸馏任务分配给服务器,服务器构造一个生成器,以使用接收到的异构设备上模型的集合进行对抗性训练。然后,提取的中心知识将以相应的设备模型参数的形式发回,这些参数很容易在设备端吸收。实验研究证明了FedZKT对异构设备模型和具有挑战性的联邦学习场景(如非iid数据分布和离散效应)的有效性和鲁棒性。 摘要:Federated learning enables distributed devices to collaboratively learn a shared prediction model without centralizing on-device training data. Most of the current algorithms require comparable individual efforts to train on-device models with the same structure and size, impeding participation from resource-constrained devices. Given the widespread yet heterogeneous devices nowadays, this paper proposes a new framework supporting federated learning across heterogeneous on-device models via Zero-shot Knowledge Transfer, named by FedZKT. Specifically, FedZKT allows participating devices to independently determine their on-device models. To transfer knowledge across on-device models, FedZKT develops a zero-shot distillation approach contrary to certain prior research based on a public dataset or a pre-trained data generator. To utmostly reduce on-device workload, the resource-intensive distillation task is assigned to the server, which constructs a generator to adversarially train with the ensemble of the received heterogeneous on-device models. The distilled central knowledge will then be sent back in the form of the corresponding on-device model parameters, which can be easily absorbed at the device side. Experimental studies demonstrate the effectiveness and the robustness of FedZKT towards heterogeneous on-device models and challenging federated learning scenarios, such as non-iid data distribution and straggler effects.
【25】 Approximate Factor Models with Weaker Loadings 标题:负载较弱的近似因子模型 链接:https://arxiv.org/abs/2109.03773
作者:Jushan Bai,Serena Ng 摘要:普遍的横截面相关性越来越被认为是经济数据的一个适当特征,近似因子模型为分析提供了一个有用的框架。假设一个强因子结构,早期的工作建立了因子和载荷的主成分估计到旋转矩阵的收敛性。本文表明,尽管速度较慢,并且在样本量的附加假设下,对于广泛的较弱因子载荷,估计值仍然是一致的和渐近正态的。标准的推断程序可以使用,除非在极弱的负载情况下,这对经验工作有令人鼓舞的影响。简化的证明具有独立的意义。 摘要:Pervasive cross-section dependence is increasingly recognized as an appropriate characteristic of economic data and the approximate factor model provides a useful framework for analysis. Assuming a strong factor structure, early work established convergence of the principal component estimates of the factors and loadings to a rotation matrix. This paper shows that the estimates are still consistent and asymptotically normal for a broad range of weaker factor loadings, albeit at slower rates and under additional assumptions on the sample size. Standard inference procedures can be used except in the case of extremely weak loadings which has encouraging implications for empirical work. The simplified proofs are of independent interest.
【26】 Priming PCA with EigenGame 标题:基于特征博弈的PCA初始化 链接:https://arxiv.org/abs/2109.03709
作者:Bálint Máté,François Fleuret 机构: University of Geneva{balint 摘要:我们介绍了素数PCA(pPCA),这是最近提出的特征博弈算法的一个扩展,用于在大规模环境中计算主成分。我们的算法首先运行特征博弈来获得主成分的近似值,然后在它们所跨越的子空间中应用精确的PCA。由于在本征对策的任何实际应用中,该子空间的维数都很小,因此第二步的计算成本非常低。尽管如此,它显著提高了跨数据集的给定计算预算的准确性。在此设置中,特征博弈的目的是缩小搜索空间,并为第二步(精确计算)准备数据。我们正式证明了pPCA在非常温和的条件下改进了EigenGame,并且我们在合成和真实大规模数据集上提供了实验验证,表明它系统地转化为改进的性能。在我们的实验中,我们在原始EigenGame论文的数据集上实现了5-25倍的收敛速度改进。 摘要:We introduce primed-PCA (pPCA), an extension of the recently proposed EigenGame algorithm for computing principal components in a large-scale setup. Our algorithm first runs EigenGame to get an approximation of the principal components, and then applies an exact PCA in the subspace they span. Since this subspace is of small dimension in any practical use of EigenGame, this second step is extremely cheap computationally. Nonetheless, it improves accuracy significantly for a given computational budget across datasets. In this setup, the purpose of EigenGame is to narrow down the search space, and prepare the data for the second step, an exact calculation. We show formally that pPCA improves upon EigenGame under very mild conditions, and we provide experimental validation on both synthetic and real large-scale datasets showing that it systematically translates to improved performance. In our experiments we achieve improvements in convergence speed by factors of 5-25 on the datasets of the original EigenGame paper.
【27】 Self-explaining variational posterior distributions for Gaussian Process models 标题:高斯过程模型的自解释变分后验分布 链接:https://arxiv.org/abs/2109.03708
作者:Sarem Seitz 机构:Department of Information Systems and Applied Computer Science, University of Bamberg, Bamberg, Germany 摘要:贝叶斯方法已成为将先验知识和不确定性概念融入机器学习模型的一种流行方法。同时,现代机器学习的复杂性使得理解模型的推理过程具有挑战性,更不用说以严格的方式表达特定的先验假设了。虽然主要对前一个问题感兴趣,但最近的发展也可以扩大我们可以提供给复杂贝叶斯模型的先验信息的范围。受自解释模型思想的启发,我们引入了相应的变分高斯过程概念。一方面,我们的贡献提高了这些类型模型的透明度。更重要的是,我们提出的自解释变分后验分布允许将关于目标函数整体的一般先验知识和关于单个特征贡献的先验知识结合起来。 摘要:Bayesian methods have become a popular way to incorporate prior knowledge and a notion of uncertainty into machine learning models. At the same time, the complexity of modern machine learning makes it challenging to comprehend a model's reasoning process, let alone express specific prior assumptions in a rigorous manner. While primarily interested in the former issue, recent developments intransparent machine learning could also broaden the range of prior information that we can provide to complex Bayesian models. Inspired by the idea of self-explaining models, we introduce a corresponding concept for variational GaussianProcesses. On the one hand, our contribution improves transparency for these types of models. More importantly though, our proposed self-explaining variational posterior distribution allows to incorporate both general prior knowledge about a target function as a whole and prior knowledge about the contribution of individual features.
【28】 YAHPO Gym -- Design Criteria and a new Multifidelity Benchmark for Hyperparameter Optimization 标题:YAHPO健身房--超参数优化的设计准则和新的多保真基准 链接:https://arxiv.org/abs/2109.03670
作者:Florian Pfisterer,Lennart Schneider,Julia Moosbauer,Martin Binder,Bernd Bischl 机构:Ludwig Maximilian University of Munich 备注:Preprint. Under review. 17 pages, 4 tables, 5 figures 摘要:在开发和分析新的超参数优化(HPO)方法时,在精心策划的基准套件上对其进行经验评估和比较是至关重要的。在这项工作中,我们列出了这些基准的理想特性和要求,并提出了一组新的挑战性和相关的多理想性HPO基准问题。为此,我们重新审视了基于代理的基准的概念,并将其与更广泛使用的表格基准进行了实证比较,表明后者可能导致HPO方法的性能估计和排名出现偏差。我们为多理想HPO方法提出了一个新的基于代理的基准测试套件,该套件由9个基准测试集合组成,总共构成700多个多理想HPO问题。我们的所有基准测试还允许查询多个优化目标,从而实现多目标HPO的基准测试。我们根据定义的需求检查并比较了我们的基准测试套件,并表明我们的基准测试为现有套件提供了可行的补充。 摘要:When developing and analyzing new hyperparameter optimization (HPO) methods, it is vital to empirically evaluate and compare them on well-curated benchmark suites. In this work, we list desirable properties and requirements for such benchmarks and propose a new set of challenging and relevant multifidelity HPO benchmark problems motivated by these requirements. For this, we revisit the concept of surrogate-based benchmarks and empirically compare them to more widely-used tabular benchmarks, showing that the latter ones may induce bias in performance estimation and ranking of HPO methods. We present a new surrogate-based benchmark suite for multifidelity HPO methods consisting of 9 benchmark collections that constitute over 700 multifidelity HPO problems in total. All our benchmarks also allow for querying of multiple optimization targets, enabling the benchmarking of multi-objective HPO. We examine and compare our benchmark suite with respect to the defined requirements and show that our benchmarks provide viable additions to existing suites.
【29】 Entangled Datasets for Quantum Machine Learning 标题:量子机器学习中的纠缠数据集 链接:https://arxiv.org/abs/2109.03400
作者:Louis Schatzki,Andrew Arrasmith,Patrick J. Coles,M. Cerezo 机构:Theoretical Division, Los Alamos National Laboratory, Los Alamos, NM , USA, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL , USA 备注:12 8 pages, 10 3 figures, 1 table 摘要:高质量、大规模的数据集对经典机器学习的发展和成功起着至关重要的作用。量子机器学习(QML)是一个新领域,旨在利用量子计算机进行数据分析,以期获得某种量子优势。虽然大多数提出的QML体系结构都是使用经典数据集进行基准测试的,但仍有疑问,经典数据集上的QML是否能实现这样的优势。在这项工作中,我们认为应该使用由量子态组成的量子数据集。为此,我们引入了由具有不同数量和类型的多体纠缠的量子态组成的纠缠数据集。我们首先展示如何训练量子神经网络来生成纠缠数据集中的状态。然后,我们使用纠缠数据集对QML模型进行基准测试,以完成监督学习分类任务。我们还考虑了另一种基于纠缠的数据集,它是可伸缩的,并且由由不同深度的量子电路所制备的状态组成。作为我们结果的副产品,我们介绍了一种产生多体纠缠态的新方法,为量子纠缠理论提供了一个量子神经网络的应用案例。 摘要:High-quality, large-scale datasets have played a crucial role in the development and success of classical machine learning. Quantum Machine Learning (QML) is a new field that aims to use quantum computers for data analysis, with the hope of obtaining a quantum advantage of some sort. While most proposed QML architectures are benchmarked using classical datasets, there is still doubt whether QML on classical datasets will achieve such an advantage. In this work, we argue that one should instead employ quantum datasets composed of quantum states. For this purpose, we introduce the NTangled dataset composed of quantum states with different amounts and types of multipartite entanglement. We first show how a quantum neural network can be trained to generate the states in the NTangled dataset. Then, we use the NTangled dataset to benchmark QML models for supervised learning classification tasks. We also consider an alternative entanglement-based dataset, which is scalable and is composed of states prepared by quantum circuits with different depths. As a byproduct of our results, we introduce a novel method for generating multipartite entangled states, providing a use-case of quantum neural networks for quantum entanglement theory.
机器翻译,仅供参考