访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问
stat统计学,共计39篇
【1】 A New Robust Multivariate Mode Estimator for Eye-tracking Calibration 标题:一种新的用于眼动定标的鲁棒多变量模式估计器
作者:Adrien Brilhault,Sergio Neuenschwander,Ricardo Araujo Rios 链接:https://arxiv.org/abs/2107.08030 摘要:本文提出了一种估计多元分布主模式的新方法,并将其应用于人眼跟踪标定。当对合作性较差的对象(如婴儿或猴子)进行眼睛跟踪实验时,校准数据通常会受到高度污染。异常值通常以聚类的形式组织,对应于受试者不看校准点的时间间隔。在这种类型的多模态分布中,大多数中心趋势度量在估计主注视坐标(第一种模式)时失败,导致将注视映射到屏幕坐标时出现错误和不准确。在这里,我们开发了一种新的算法来识别多元分布的第一种模式,称为BRIL,它依赖于基于递归深度的滤波。这种新方法在高斯分布和均匀分布的人工混合上进行了测试,并与现有方法(传统的深度中位数、位置和散射的稳健估计以及基于聚类的方法)进行了比较。我们获得了优异的性能,即使是对于包含非常高比例的离群值的分布,无论是分组还是随机分布。最后,我们利用卷尾猴眼睛跟踪标定的实验数据,证明了我们的方法在真实场景中的优势,特别是对于其他算法通常缺乏准确性的分布。 摘要:We propose in this work a new method for estimating the main mode of multivariate distributions, with application to eye-tracking calibrations. When performing eye-tracking experiments with poorly cooperative subjects, such as infants or monkeys, the calibration data generally suffer from high contamination. Outliers are typically organized in clusters, corresponding to the time intervals when subjects were not looking at the calibration points. In this type of multimodal distributions, most central tendency measures fail at estimating the principal fixation coordinates (the first mode), resulting in errors and inaccuracies when mapping the gaze to the screen coordinates. Here, we developed a new algorithm to identify the first mode of multivariate distributions, named BRIL, which rely on recursive depth-based filtering. This novel approach was tested on artificial mixtures of Gaussian and Uniform distributions, and compared to existing methods (conventional depth medians, robust estimators of location and scatter, and clustering-based approaches). We obtained outstanding performances, even for distributions containing very high proportions of outliers, both grouped in clusters and randomly distributed. Finally, we demonstrate the strength of our method in a real-world scenario using experimental data from eye-tracking calibrations with Capuchin monkeys, especially for distributions where other algorithms typically lack accuracy.
【2】 A study on reliability of a k-out-of-n system equipped with a cold standby component based on copula标题:基于Copula的
作者:Achintya Roy,Nitin Gupta 机构:Department of Mathematics, Indian Institute of Technology Kharagpur, West Bengal, INDIA. 备注:14 pages, 5 figures 链接:https://arxiv.org/abs/2107.08023 摘要:一个$k$-out-of-n$系统由$n$可交换组件组成,并配有一个冷备用组件。假设所有的部件包括备用部件都是相依的,并用copula函数来模拟这种相依性。得到了系统可靠度函数的精确表达式。我们还计算了系统的三种不同的平均剩余寿命函数。最后给出了一些数值例子来说明理论结果。我们的研究包含了一些文献中的初步研究。 摘要:A $k$-out-of-$n$ system consisting of $n$ exchangeable components equipped with a single cold standby component is considered. All the components including standby are assumed to be dependent and this dependence is modeled by a copula function. An exact expression for the reliability function of the considered system has been obtained. We also compute three different mean residual life functions of the system. Finally, some numerical examples are provided to illustrate the theoretical results. Our studies subsume some of the initial studies in the literature.
【3】 Online Graph Topology Learning from Matrix-valued Time Series 标题:基于矩阵值时间序列的在线图拓扑学习
作者:Yiye Jiang,Jérémie Bigot,Sofian Maabout 机构:Institut de Mathématiques de Bordeaux, Université de Bordeaux, Laboratoire Bordelais de Recherche en Informatique, Université de Bordeaux 链接:https://arxiv.org/abs/2107.08020 摘要:本文研究矩阵值时间序列的统计分析。这些是通过传感器网络(通常是一组空间位置)收集的数据,随着时间的推移,记录多个测量值的观测值。从这些数据,我们建议学习,在一个在线的方式,一个图表,捕捉两个方面的依赖:一个描述稀疏的空间关系传感器,另一个表征测量关系。为此,我们引入了一个新的多元自回归模型来推断编码在系数矩阵中的图拓扑,该模型捕捉了这种矩阵值时间序列中存在的稀疏Granger因果关系依赖结构。我们通过对系数矩阵施加Kronecker和结构来分解图。我们开发了两种在线方法来递归地学习图形。第一种方法使用Wald检验进行投影OLS估计,得到估计量的渐近分布。对于第二个问题,我们形式化了一个套索型优化问题。我们依靠同伦算法来推导系数矩阵估计的更新规则。此外,我们提供了正则化参数的自适应调整过程。数值实验使用合成和真实数据,以支持所提出的学习方法的有效性。 摘要:This paper is concerned with the statistical analysis of matrix-valued time series. These are data collected over a network of sensors (typically a set of spatial locations), recording, over time, observations of multiple measurements. From such data, we propose to learn, in an online fashion, a graph that captures two aspects of dependency: one describing the sparse spatial relationship between sensors, and the other characterizing the measurement relationship. To this purpose, we introduce a novel multivariate autoregressive model to infer the graph topology encoded in the coefficient matrix which captures the sparse Granger causality dependency structure present in such matrix-valued time series. We decompose the graph by imposing a Kronecker sum structure on the coefficient matrix. We develop two online approaches to learn the graph in a recursive way. The first one uses Wald test for the projected OLS estimation, where we derive the asymptotic distribution for the estimator. For the second one, we formalize a Lasso-type optimization problem. We rely on homotopy algorithms to derive updating rules for estimating the coefficient matrix. Furthermore, we provide an adaptive tuning procedure for the regularization parameter. Numerical experiments using both synthetic and real data, are performed to support the effectiveness of the proposed learning approaches.
【4】 Bootstrapping Through Discrete Convolutional Methods 标题:离散卷积方法的自举法
作者:Jared M. Clark,Richard L. Warr 机构:and, Department of Statistics, Brigham Young University, Provo, UT , USA 链接:https://arxiv.org/abs/2107.08019 摘要:Bootstrapping被设计为使用montecarlo技术从固定样本中随机重新采样数据。然而,原始样本本身定义了一个离散分布。卷积方法非常适合于离散分布,我们展示了利用这些技术进行自举的优势。离散卷积方法可以为自举量提供精确的数值解,或者至少是数学误差界。相比之下,蒙特卡罗自举方法只能提供缓慢收敛的置信区间。此外,对于某些问题,卷积方法的计算时间可以大大少于蒙特卡罗重采样。本文提供了几个使用卷积技术的自举例子,并将结果与蒙特卡罗自举法和竞争鞍点法的结果进行了比较。 摘要:Bootstrapping was designed to randomly resample data from a fixed sample using Monte Carlo techniques. However, the original sample itself defines a discrete distribution. Convolutional methods are well suited for discrete distributions, and we show the advantages of utilizing these techniques for bootstrapping. The discrete convolutional approach can provide exact numerical solutions for bootstrap quantities, or at least mathematical error bounds. In contrast, Monte Carlo bootstrap methods can only provide confidence intervals which converge slowly. Additionally, for some problems the computation time of the convolutional approach can be dramatically less than that of Monte Carlo resampling. This article provides several examples of bootstrapping using the proposed convolutional technique and compares the results to those of the Monte Carlo bootstrap, and to those of the competing saddlepoint method.
【5】 Efficient Bayesian Sampling Using Normalizing Flows to Assist Markov Chain Monte Carlo Methods 标题:用归一化流辅助马尔可夫链蒙特卡罗方法的高效贝叶斯抽样
作者:Marylou Gabrié,Grant M. Rotskoff,Eric Vanden-Eijnden 机构: Stanford University, CA 9 4 30 5 4CourantInstitute, New York University 链接:https://arxiv.org/abs/2107.08001 摘要:规范化流可以生成复杂的目标分布,因此在贝叶斯统计的许多应用中显示出作为MCMC后验抽样的替代或补充的前景。由于事先没有来自目标后验分布的数据集可用,因此通常使用反向Kullback-Leibler(KL)散度来训练流,该散度只需要来自基分布的样本。当后路很复杂,很难用未经训练的正常化血流取样时,这种方法可能效果不佳。在这里,我们探索了一种独特的训练策略,使用直接KL散度作为损失,其中来自后验数据的样本是通过(i)在后验数据上使用归一化流辅助局部MCMC算法来加速其混合速率,以及(ii)使用这种方法生成的数据来训练流。该方法只需要有限的关于后验概率的输入,并且可以用来估计模型验证所需的证据,如我们在示例中所示。 摘要:Normalizing flows can generate complex target distributions and thus show promise in many applications in Bayesian statistics as an alternative or complement to MCMC for sampling posteriors. Since no data set from the target posterior distribution is available beforehand, the flow is typically trained using the reverse Kullback-Leibler (KL) divergence that only requires samples from a base distribution. This strategy may perform poorly when the posterior is complicated and hard to sample with an untrained normalizing flow. Here we explore a distinct training strategy, using the direct KL divergence as loss, in which samples from the posterior are generated by (i) assisting a local MCMC algorithm on the posterior with a normalizing flow to accelerate its mixing rate and (ii) using the data generated this way to train the flow. The method only requires a limited amount of textit{a~priori} input about the posterior, and can be used to estimate the evidence required for model validation, as we illustrate on examples.
【6】 Clarifying Selection Bias in Cluster Randomized Trials: Estimands and Estimation 标题:整群随机试验中选择偏差的澄清:估计与估计
作者:Fan Li,Zizhong Tian,Jennifer Bobb,Georgia Papadogeorgou,Fan Li 机构:Author affiliations:, Department of Biostatistics, Yale University School of Public Health, New Haven, Connecticut, PhD); Department of Statistics, University of Florida, Gainesville, Florida (Georgia, Papadogeorgou, PhD), Conflict of interest: none declared. 备注:Keywords: causal inference, cluster randomized trial, intention-to-treat, heterogeneous treatment effect, post-randomization, principal stratification 链接:https://arxiv.org/abs/2107.07967 摘要:在整群随机试验(CRT)中,通常在整群随机后招募患者,招募者和患者不会对分配盲目。这导致干预组和对照组患者招募过程的差异和基线特征的系统性差异,导致随机化后的选择偏倚。我们严格定义了随机后混杂情况下的因果估计。我们阐明了标准协变量调整方法可以有效估计这些估计的条件。我们讨论了在不满足这些条件的情况下估计因果效应所需的额外数据和假设。采用因果推理中的主要分层框架,我们阐明了在CRTs中有两种意向治疗(ITT)因果估计:一种是针对总体人群的,另一种是针对招募人群的。我们推导了这两种估计在主要阶层特定因果效应方面的解析公式。我们评估了两种常见的协变量调整方法,多元回归和倾向评分加权,在不同的数据生成过程中的实证表现。当各主要阶层的治疗效果不一致时,ITT对总体人群的影响不同于ITT对招募人群的影响。对招募的样本进行简单的ITT分析会导致对ITT效应的有偏估计。在存在随机化后选择的情况下,在没有关于非招募受试者的额外数据的情况下,只有当治疗效果在主要阶层之间是同质的时,ITT对招募人群的影响才是可估计的,并且ITT对总体人群的影响通常是不可估计的。协变量调整在多大程度上可以消除选择偏差,这取决于跨主要地层的非均质性的影响程度。 摘要:In cluster randomized trials (CRTs), patients are typically recruited after clusters are randomized, and the recruiters and patients are not blinded to the assignment. This leads to differential recruitment process and systematic differences in baseline characteristics of the recruited patients between intervention and control arms, inducing post-randomization selection bias. We rigorously define the causal estimands in the presence of post-randomization confounding. We elucidate the conditions under which standard covariate adjustment methods can validly estimate these estimands. We discuss the additional data and assumptions necessary for estimating the causal effects when such conditions are not met. Adopting the principal stratification framework in causal inference, we clarify there are two intention-to-treat (ITT) causal estimands in CRTs: one for the overall population and one for the recruited population. We derive the analytical formula of the two estimands in terms of principal-stratum-specific causal effects. We assess the empirical performance of two common covariate adjustment methods, multivariate regression and propensity score weighting, under different data generating processes. When treatment effects are heterogeneous across principal strata, the ITT effect on the overall population differs from the ITT effect on the recruited population. A naive ITT analysis of the recruited sample leads to biased estimate of both ITT effects. In the presence of post-randomization selection and without additional data on the non-recruited subjects, the ITT effect on the recruited population is estimable only when the treatment effects are homogenous between principal strata, and the ITT effect on the overall population is generally not estimable. The extent to which covariate adjustment can remove selection bias depends on the degree of effect heterogeneity across principal strata.
【7】 Nearly Unstable Integer-Valued ARCH Process and Unit Root Testing 标题:近不稳定整值ARCH过程与单位根检验
作者:Wagner Barreto-Souza,Ngai Hang Chan 机构:⋆Statistics Program, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia, ♯Department of Statistics, The Chinese University of Hong Kong, Hong Kong 备注:Paper submitted for publication 链接:https://arxiv.org/abs/2107.07963 摘要:本文介绍了一种处理计数时间序列数据的近似不稳定整数值自回归条件异方差(NU-INARCH)过程。证明了具有Skorohod拓扑的NU-INARCH过程的正规化弱收敛于Cox-Ingersoll-Ross扩散。将相关参数的条件最小二乘估计的渐近分布建立为某些随机积分的泛函。基于montecarlo模拟的数值实验验证了有限样本下渐近分布的行为。仿真结果表明,即使真实过程不太接近非平稳,近似不稳定方法也比基于平稳性假设的方法提供了令人满意的结果。提出了一种单位根检验方法,并通过montecarlo模拟检验了其I型误差和功率。作为一个例子,拟议的方法适用于英国因COVID-19导致的每日死亡人数。 摘要:This paper introduces a Nearly Unstable INteger-valued AutoRegressive Conditional Heteroskedasticity (NU-INARCH) process for dealing with count time series data. It is proved that a proper normalization of the NU-INARCH process endowed with a Skorohod topology weakly converges to a Cox-Ingersoll-Ross diffusion. The asymptotic distribution of the conditional least squares estimator of the correlation parameter is established as a functional of certain stochastic integrals. Numerical experiments based on Monte Carlo simulations are provided to verify the behavior of the asymptotic distribution under finite samples. These simulations reveal that the nearly unstable approach provides satisfactory and better results than those based on the stationarity assumption even when the true process is not that close to non-stationarity. A unit root test is proposed and its Type-I error and power are examined via Monte Carlo simulations. As an illustration, the proposed methodology is applied to the daily number of deaths due to COVID-19 in the United Kingdom.
【8】 Estimation of Li-ion degradation test sample sizes required to understand cell-to-cell variability 标题:了解细胞间可变性所需锂离子降解试验样本大小的估计
作者:Philipp Dechent,Samuel Greenbank,Felix Hildenbrand,Saad Jbabdi,Dirk Uwe Sauer,David A. Howey 机构: Department of Engineering, University of Oxford, UK||Helmholtz Institute M¨unster (HI MS) 备注:13 pages, 9 figures 链接:https://arxiv.org/abs/2107.07881 摘要:锂离子电池的老化会导致性能的不可逆下降。由制造差异引起的细胞间的内在变异性发生在生命的各个阶段,并随着年龄的增长而增加。研究人员需要知道他们应该测试的最小细胞数,以准确地表示群体的变异性,因为测试许多细胞是昂贵的。在本文中,假设模型参数可以从描述更大人口的分布中提取,将经验容量与时间老化模型拟合到各种退化数据集。使用分层贝叶斯方法,我们估计了需要测试的细胞数量。根据老化模型的复杂性,分别具有1个、2个或3个参数的模型需要至少9个、11个或13个单元格的数据以获得一致的拟合。这意味着研究人员需要在他们的实验中的每个测试点测试至少这些细胞的数量,以捕捉制造的变异性。 摘要:Ageing of lithium-ion batteries results in irreversible reduction in performance. Intrinsic variability between cells, caused by manufacturing differences, occurs at all stages of life and increases with age. Researchers need to know the minimum number of cells they should test to give an accurate representation of population variability, since testing many cells is expensive. In this paper, empirical capacity versus time ageing models were fitted to various degradation datasets assuming that the model parameters could be drawn from a distribution describing a larger population. Using a hierarchical Bayesian approach, we estimated the number of cells required to be tested. Depending on the complexity of the ageing model, models with 1, 2 or 3 parameters respectively required data from at least 9, 11 or 13 cells for a consistent fit. This implies that researchers will need to test at least these numbers of cells at each test point in their experiment to capture manufacturing variability.
【9】 A Penalized Shared-parameter Algorithm for Estimating Optimal Dynamic Treatment Regimens 标题:估计最优动态治疗方案的惩罚共享参数算法
作者:Trikay Nalamada,Shruti Agarwal,Maria Jahja,Bibhas Chakraborty,Palash Ghosh 机构:Department of Mathematics, Indian, Institute of Technology Guwahati, Assam , India, Department of Statistics, North, Carolina State University, USA, Centre for Quantitative Medicine, Duke-NUS Medical School, National, University of Singapore, Singapore 链接:https://arxiv.org/abs/2107.07875 摘要:动态治疗方案(DTR)是一组决策规则,用于根据患者的病史对患者进行个性化治疗。基于Q-learning的Q-shared算法已被用于开发涉及跨多个干预阶段共享决策规则的dtr。我们证明了现有的Q-shared算法由于在Q-learning设置中使用了线性模型而存在非收敛性,并确定了Q-shared失败的条件。利用扩展约束普通最小二乘法的性质,给出了一种惩罚Q-共享算法,该算法不仅能在违反条件的情况下收敛,而且在满足条件的情况下仍能优于原Q-共享算法。我们给出了该方法在实际应用和若干综合仿真中的证明。 摘要:A dynamic treatment regimen (DTR) is a set of decision rules to personalize treatments for an individual using their medical history. The Q-learning based Q-shared algorithm has been used to develop DTRs that involve decision rules shared across multiple stages of intervention. We show that the existing Q-shared algorithm can suffer from non-convergence due to the use of linear models in the Q-learning setup, and identify the condition in which Q-shared fails. Leveraging properties from expansion-constrained ordinary least-squares, we give a penalized Q-shared algorithm that not only converges in settings that violate the condition, but can outperform the original Q-shared algorithm even when the condition is satisfied. We give evidence for the proposed method in a real-world application and several synthetic simulations.
【10】 A Causal Perspective on Meaningful and Robust Algorithmic Recourse 标题:有意义且稳健的算法资源的因果透视
作者:Gunnar König,Timo Freiesleben,Moritz Grosse-Wentrup 机构:We take a causal perspective on the issue at hand and arguethat robustness and meaningfulness are related problems 1Institute for Statistics, University of Vienna 3Munich Center for Mathe-matical Philosophy 备注:ICML (International Conference on Machine Learning) Workshop on Algorithmic Recourse 链接:https://arxiv.org/abs/2107.07853 摘要:算法追索解释告知利益相关者如何采取行动恢复不利的预测。然而,一般而言,ML模型不能很好地预测介入分布。因此,以期望的方式改变预测的动作可能不会导致潜在目标的改进。这种方法对模型改装既没有意义也不可靠。扩展了Karimi等人(2021)的工作,我们提出了有意义的算法资源(MAR),它只建议改进预测和目标的行动。我们通过强调模型审计和有意义的、可操作的追索解释之间的差异来证明这种选择约束。此外,我们引入了一种称为有效算法追索(EAR)的MAR松弛方法,在某些假设下,该方法仅允许对目标原因进行干预,从而产生有意义的追索。 摘要:Algorithmic recourse explanations inform stakeholders on how to act to revert unfavorable predictions. However, in general ML models do not predict well in interventional distributions. Thus, an action that changes the prediction in the desired way may not lead to an improvement of the underlying target. Such recourse is neither meaningful nor robust to model refits. Extending the work of Karimi et al. (2021), we propose meaningful algorithmic recourse (MAR) that only recommends actions that improve both prediction and target. We justify this selection constraint by highlighting the differences between model audit and meaningful, actionable recourse explanations. Additionally, we introduce a relaxation of MAR called effective algorithmic recourse (EAR), which, under certain assumptions, yields meaningful recourse by only allowing interventions on causes of the target.
【11】 Aggregating estimates by convex optimization 标题:基于凸优化的集结估计
作者:Anatoli Juditsky,Arkadi Nemirovski 机构: Georgia Institute of Technology 链接:https://arxiv.org/abs/2107.07836 摘要:我们讨论了基于凸假设检验的估计聚集和自适应估计方法。我们证明了在观测源于{em simple observation schemes}且未知信号集是凸集和紧集的有限并集的情况下,所提出的方法导致具有接近最优性能的聚合和自适应例程。作为一个例子,我们考虑所提出的估计的恢复未知的信号属于已知的属于一个联合的ELITOPOS在高斯观测方案的问题。当联合中的集合数目不是很大时,所提出的方法可以有效地实现。我们通过一个小的仿真研究来说明所提出的方法在单指标模型中的信号估计问题中的实际性能。 摘要:We discuss the approach to estimate aggregation and adaptive estimation based upon (nearly optimal) testing of convex hypotheses. We show that in the situation where the observations stem from {em simple observation schemes} and where set of unknown signals is a finite union of convex and compact sets, the proposed approach leads to aggregation and adaptation routines with nearly optimal performance. As an illustration, we consider application of the proposed estimates to the problem of recovery of unknown signal known to belong to a union of ellitopes in Gaussian observation scheme. The proposed approach can be implemented efficiently when the number of sets in the union is "not very large." We conclude the paper with a small simulation study illustrating practical performance of the proposed procedures in the problem of signal estimation in the single-index model.
【12】 Chi-square and normal inference in high-dimensional multi-task regression 标题:高维多任务回归中的卡方和正态推断
作者:Pierre C Bellec,Gabriel Romon 链接:https://arxiv.org/abs/2107.07828 摘要:本文提出了在行稀疏假设下,$B^*$的多任务(MT)线性模型中,$p$协变量,$T$任务,$n$观测值的未知系数矩阵$B^*$的卡方和正态推理方法。允许行稀疏度$s$、维度$p$和任务数$T$随$n$增长。在高维区域$pggg n$中,为了利用行稀疏性,考虑了MT-Lasso。我们建立在MT套索的基础上,用一个去偏方案来修正惩罚引起的偏差。该方案需要引入一个新的数据驱动对象,即交互矩阵,它可以捕获不同任务的噪声向量和残差之间的有效相关性。这个矩阵是psd,大小为$TT$,可以有效地计算。在高斯设计和$frac{sT slog(p/s)}{n}to0$下,相互作用矩阵使我们得到渐近正态和$chi^2u T$结果,这对应于Frobenius范数的一致性。对于已知和未知的设计协方差$Sigma$,这些渐近分布结果为$B^*$的单个项目和$B^*$的单行产生有效的置信椭球。以前在分组变量回归中的建议要求行稀疏性$slesssimsqrt n$直到常数取决于$T$和对数因子$n,p$,使用交互矩阵的去偏方案在${min(T^2,日志(8p)}}}}}}}}}}}}}}}}}}}}}}}}}}日志}}}}}}}}}}{0和$\n}0和$\frac{sT s日志(p/s) u0sqrt Tlllsqrt{n}$最多为对数因子。 摘要:The paper proposes chi-square and normal inference methodologies for the unknown coefficient matrix $B^*$ of size $ptimes T$ in a Multi-Task (MT) linear model with $p$ covariates, $T$ tasks and $n$ observations under a row-sparse assumption on $B^*$. The row-sparsity $s$, dimension $p$ and number of tasks $T$ are allowed to grow with $n$. In the high-dimensional regime $pggg n$, in order to leverage row-sparsity, the MT Lasso is considered. We build upon the MT Lasso with a de-biasing scheme to correct for the bias induced by the penalty. This scheme requires the introduction of a new data-driven object, coined the interaction matrix, that captures effective correlations between noise vector and residuals on different tasks. This matrix is psd, of size $Ttimes T$ and can be computed efficiently. The interaction matrix lets us derive asymptotic normal and $chi^2_T$ results under Gaussian design and $frac{sT slog(p/s)}{n}to0$ which corresponds to consistency in Frobenius norm. These asymptotic distribution results yield valid confidence intervals for single entries of $B^*$ and valid confidence ellipsoids for single rows of $B^*$, for both known and unknown design covariance $Sigma$. While previous proposals in grouped-variables regression require row-sparsity $slesssimsqrt n$ up to constants depending on $T$ and logarithmic factors in $n,p$, the de-biasing scheme using the interaction matrix provides confidence intervals and $chi^2_T$ confidence ellipsoids under the conditions ${min(T^2,log^8p)}/{n}to 0$ and $$ frac{sT slog(p/s) |Sigma^{-1}e_j|_0log p}{n}to0, quad frac{min(s,|Sigma^{-1}e_j|_0)}{sqrt n} sqrt{[T log(p/s)]log p}to 0, $$ allowing row-sparsity $sgggsqrt n$ when $|Sigma^{-1}e_j|_0 sqrt Tlll sqrt{n}$ up to logarithmic factors.
【13】 Intrinsic Dimension Adaptive Partitioning for Kernel Methods 标题:核方法的本征维数自适应划分
作者:Thomas Hamm,Ingo Steinwart 机构:Institute for Stochastics and Applications, Faculty ,: Mathematics and Physics, University of Stuttgart, D-, Stuttgart Germany 备注:36 pages, 5 figures, 2 tables 链接:https://arxiv.org/abs/2107.07750 摘要:我们证明了核岭回归的极小极大最优学习率,即基于数据相关的输入空间划分的支持向量机,其中输入空间维数的相关性被数据生成分布的分形维数所代替。我们进一步表明,这些最佳率可以实现一个训练验证过程中没有任何先验知识的内在维度的数据。最后,我们进行了大量的实验,证明我们所考虑的学习方法实际上能够从一个非平凡地嵌入在更高维空间中的数据集以及从原始数据集进行推广。 摘要:We prove minimax optimal learning rates for kernel ridge regression, resp.~support vector machines based on a data dependent partition of the input space, where the dependence of the dimension of the input space is replaced by the fractal dimension of the support of the data generating distribution. We further show that these optimal rates can be achieved by a training validation procedure without any prior knowledge on this intrinsic dimension of the data. Finally, we conduct extensive experiments which demonstrate that our considered learning methods are actually able to generalize from a dataset that is non-trivially embedded in a much higher dimensional space just as well as from the original dataset.
【14】 Nearest-Neighbor Geostatistical Models for Non-Gaussian Data 标题:非高斯数据的最近邻地质统计模型
作者:Xiaotian Zheng,Athanasios Kottas,Bruno Sansó 机构:Department of Statistics, University of California Santa Cruz 链接:https://arxiv.org/abs/2107.07736 摘要:我们发展了一类最近邻混合迁移分布过程(NNMP)模型,为非高斯地统计学数据提供了灵活性和可扩展性。我们使用一个有向无环图来定义一个适当的空间过程,这个空间过程是由有限混合给出的有限维分布。我们发展条件,以建立一般NNMP模型与预先指定的平稳边际分布。我们还建立了NNMP模型所隐含的尾部依赖强度的下界,证明了所提出的通过二元分布规范建立多元依赖模型的方法的灵活性。为了实现推理和预测,我们建立了一个贝叶斯层次模型的数据,利用NNMP先验模型的空间随机效应过程。从推理的角度来看,NNMP模型为处理大型空间数据集提供了一种新的计算方法,利用混合模型结构避免了大型矩阵运算引起的计算问题。我们通过综合数据实例和地中海海面温度数据的分析,说明了NNMP建模框架的优点。 摘要:We develop a class of nearest neighbor mixture transition distribution process (NNMP) models that provides flexibility and scalability for non-Gaussian geostatistical data. We use a directed acyclic graph to define a proper spatial process with finite-dimensional distributions given by finite mixtures. We develop conditions to construct general NNMP models with pre-specified stationary marginal distributions. We also establish lower bounds for the strength of the tail dependence implied by NNMP models, demonstrating the flexibility of the proposed methodology for modeling multivariate dependence through bivariate distribution specification. To implement inference and prediction, we formulate a Bayesian hierarchical model for the data, using the NNMP prior model for the spatial random effects process. From an inferential point of view, the NNMP model lays out a new computational approach to handling large spatial data sets, leveraging the mixture model structure to avoid computational issues that arise from large matrix operations. We illustrate the benefits of the NNMP modeling framework using synthetic data examples and through analysis of sea surface temperature data from the Mediterranean sea.
【15】 Auto-differentiable Ensemble Kalman Filters 标题:自可微集合卡尔曼滤波器
作者:Yuming Chen,Daniel Sanz-Alonso,Rebecca Willett 机构:University of Chicago 链接:https://arxiv.org/abs/2107.07687 摘要:资料同化是对时间演化状态的连续估计。在高维状态和未知状态空间动力学的情况下,这项任务在科学和工程应用中有着广泛的应用。本文介绍了一种用于数据同化动态系统学习的机器学习框架。我们的自动可微集合Kalman滤波器(AD-EnKFs)将用于状态恢复的集合Kalman滤波器与用于动态学习的机器学习工具相结合。在此过程中,AD-EnKFs利用集成Kalman滤波器缩放到高维状态的能力和自动微分的能力来训练动力学的高维代理模型。使用Lorenz-96模式的数值结果表明,AD-EnKFs的性能优于现有的使用期望最大化或粒子滤波来融合数据同化和机器学习的方法。此外,AD-enkf易于实现,并且需要最少的调优。 摘要:Data assimilation is concerned with sequentially estimating a temporally-evolving state. This task, which arises in a wide range of scientific and engineering applications, is particularly challenging when the state is high-dimensional and the state-space dynamics are unknown. This paper introduces a machine learning framework for learning dynamical systems in data assimilation. Our auto-differentiable ensemble Kalman filters (AD-EnKFs) blend ensemble Kalman filters for state recovery with machine learning tools for learning the dynamics. In doing so, AD-EnKFs leverage the ability of ensemble Kalman filters to scale to high-dimensional states and the power of automatic differentiation to train high-dimensional surrogate models for the dynamics. Numerical results using the Lorenz-96 model show that AD-EnKFs outperform existing methods that use expectation-maximization or particle filters to merge data assimilation and machine learning. In addition, AD-EnKFs are easy to implement and require minimal tuning.
【16】 Predicting Drought and Subsidence Risks in France 标题:预测法国的干旱和沉降风险
作者:Arthur Charpentier,Molly James,Hani Ali 机构:a Universit´e du Qu´ebec a Montr´eal (UQAM), Montr´eal (Qu´ebec), Canada, b EURo Institut d’Actuariat (EURIA), Universit´e de Brest, France, c Willis Re, Paris, France 链接:https://arxiv.org/abs/2107.07668 摘要:干旱事件的经济后果越来越重要,尽管部分原因是潜在机制的复杂性,人们往往难以理解。在本文中,我们将研究干旱的后果之一,即沉降风险(或更具体地说,粘土收缩引起的沉降),几十年来,法国一直强制为其提供保险。在过去的二十年里,我们利用从几家保险公司(约占家庭保险市场的四分之一)获得的数据,提出了一些统计模型来预测这些干旱的发生频率和强度,这表明气候变化可能会对这一风险产生重大的经济后果。但是,即使我们使用比标准回归型模型更先进的模型(这里是随机森林来捕捉非线性和交叉效应),仍然很难预测沉降索赔的经济成本,即使所有地球物理和气候信息都可用。 摘要:The economic consequences of drought episodes are increasingly important, although they are often difficult to apprehend in part because of the complexity of the underlying mechanisms. In this article, we will study one of the consequences of drought, namely the risk of subsidence (or more specifically clay shrinkage induced subsidence), for which insurance has been mandatory in France for several decades. Using data obtained from several insurers, representing about a quarter of the household insurance market, over the past twenty years, we propose some statistical models to predict the frequency but also the intensity of these droughts, for insurers, showing that climate change will have probably major economic consequences on this risk. But even if we use more advanced models than standard regression-type models (here random forests to capture non linearity and cross effects), it is still difficult to predict the economic cost of subsidence claims, even if all geophysical and climatic information is available.
【17】 Bayesian Markov Renewal Mixed Models for Vocalization Syntax 标题:发声句法的贝叶斯马尔可夫更新混合模型
作者:Yutong Wu,Erich D. Jarvis,Abhra Sarkar 备注:34 pages, 10 figures 链接:https://arxiv.org/abs/2107.07648 摘要:研究人类语音交流机制的神经、遗传和进化基础是神经科学的一个重要领域。由于缺乏关于人类的高质量数据,在实验室环境中进行的小鼠发声实验已被证明有助于提供哺乳动物发声发育和进化的宝贵见解,特别是某些基因突变的影响。来自小鼠发声实验的数据集通常由分类音节序列以及不同基因型小鼠在不同语境下发声的连续音节间隔时间组成。很少有统计模型同时考虑了转移概率和状态间隔的推断。后者特别重要,因为增加的状态间隔可能是可能的发声障碍的一个迹象。本文提出了一类新的Markov更新混合模型,该模型同时考虑了状态转移和状态间隔时间的随机动态性。具体地说,我们分别用Dirichlet和gamma混合模型来模拟跃迁动力学和态间区间,使得这两种情况下的混合概率随着固定的协变量效应以及随机的个体特定效应而灵活变化。我们应用我们的模型来分析Foxp2基因突变对小鼠发声行为的影响。我们发现,基因型和社会语境显著影响着语篇的语篇间隔时间,但与以往的分析相比,基因型和社会语境对音节转换动力学的影响较弱。 摘要:Studying the neurological, genetic and evolutionary basis of human vocal communication mechanisms is an important field of neuroscience. In the absence of high quality data on humans, mouse vocalization experiments in laboratory settings have been proven to be useful in providing valuable insights into mammalian vocal development and evolution, including especially the impact of certain genetic mutations. Data sets from mouse vocalization experiments usually consist of categorical syllable sequences along with continuous inter-syllable interval times for mice of different genotypes vocalizing under various contexts. Few statistical models have considered the inference for both transition probabilities and inter-state intervals. The latter is of particular importance as increased inter-state intervals can be an indication of possible vocal impairment. In this paper, we propose a class of novel Markov renewal mixed models that capture the stochastic dynamics of both state transitions and inter-state interval times. Specifically, we model the transition dynamics and the inter-state intervals using Dirichlet and gamma mixtures, respectively, allowing the mixture probabilities in both cases to vary flexibly with fixed covariate effects as well as random individual-specific effects. We apply our model to analyze the impact of a mutation in the Foxp2 gene on mouse vocal behavior. We find that genotypes and social contexts significantly affect the inter-state interval times but, compared to previous analyses, the influences of genotype and social context on the syllable transition dynamics are weaker.
【18】 Obtaining Causal Information by Merging Datasets with MAXENT 标题:用MAXENT合并数据集获取因果信息
作者:Sergio Hernan Garrido Mejia,Elke Kirschbaum,Dominik Janzing 机构:Amazon Research, T¨ubingen, Germany, Danmarks Tekniske Universitet, Lyngby, Denmark 链接:https://arxiv.org/abs/2107.07640 摘要:对“哪种治疗方法对目标变量有因果影响”这一问题的研究在许多科学学科中具有特别的相关性。如果不是所有的治疗变量都被观察到或者甚至不能与目标变量一起被观察到,那么这个具有挑战性的任务就变得更加困难。另一个同样重要且具有挑战性的任务是在存在混杂因素的情况下量化治疗对靶点的因果影响。在本文中,我们讨论了如何在不联合观察所有变量的情况下,通过合并来自不同数据集的统计信息来获得因果知识。我们首先展示了在假设因果充分性和信度的扩展版本时,如何使用最大熵原理来识别随机变量之间的边。此外,我们推导了在存在混杂因素的情况下,介入分布和治疗对目标变量的平均因果效应的界限。在这两种情况下,我们假设只有变量的子集被联合观察到。 摘要:The investigation of the question "which treatment has a causal effect on a target variable?" is of particular relevance in a large number of scientific disciplines. This challenging task becomes even more difficult if not all treatment variables were or even cannot be observed jointly with the target variable. Another similarly important and challenging task is to quantify the causal influence of a treatment on a target in the presence of confounders. In this paper, we discuss how causal knowledge can be obtained without having observed all variables jointly, but by merging the statistical information from different datasets. We first show how the maximum entropy principle can be used to identify edges among random variables when assuming causal sufficiency and an extended version of faithfulness. Additionally, we derive bounds on the interventional distribution and the average causal effect of a treatment on a target variable in the presence of confounders. In both cases we assume that only subsets of the variables have been observed jointly.
【19】 Quantifying the economic response to COVID-19 mitigations and death rates via forecasting Purchasing Managers' Indices using Generalised Network Autoregressive models with exogenous variables 标题:用带外生变量的广义网络自回归模型预测采购经理指数,量化冠状病毒缓解和死亡率的经济反应
作者:Guy P Nason,James L Wei 备注:To be read before the Royal Statistical Society at the Society's 2021 annual conference held in Manchester on Wednesday, September 8th 2021, the President, Professor Sylvia Richardson, in the Chair. Accepted by the Journal of the Royal Statistical Society, Series A 链接:https://arxiv.org/abs/2107.07605 摘要:了解经济体的当前状况、它们对COVID-19缓解措施和指标的反应以及它们的未来可能会是什么是很重要的。我们使用最近发展的广义网络自回归(GNAR)模型,使用贸易决定的网络,对一些国家的采购经理指数进行建模和预测。我们使用连接国家的网络,这些国家的连接本身或其权重由国家之间的出口贸易程度决定。我们将这些模型扩展到包括节点特定的时间序列外生变量(GNARX模型),利用此模型将COVID-19缓解严格性指数和COVID-19死亡率纳入我们的分析。在均方预测误差方面,高度简约的GNAR模型明显优于向量自回归模型,我们的GNARX模型本身也优于GNAR模型。进一步的混合频率模型预测了英国经济将在多大程度上受到更严厉、更弱或没有干预措施的影响。 摘要:Knowledge of the current state of economies, how they respond to COVID-19 mitigations and indicators, and what the future might hold for them is important. We use recently-developed generalised network autoregressive (GNAR) models, using trade-determined networks, to model and forecast the Purchasing Managers' Indices for a number of countries. We use networks that link countries where the links themselves, or their weights, are determined by the degree of export trade between the countries. We extend these models to include node-specific time series exogenous variables (GNARX models), using this to incorporate COVID-19 mitigation stringency indices and COVID-19 death rates into our analysis. The highly parsimonious GNAR models considerably outperform vector autoregressive models in terms of mean-squared forecasting error and our GNARX models themselves outperform GNAR ones. Further mixed frequency modelling predicts the extent to which that the UK economy will be affected by harsher, weaker or no interventions.
【20】 Optimal-Design Domain-Adaptation for Exposure Prediction in Two-Stage Epidemiological Studies 标题:两阶段流行病学研究中暴露预测的最优设计域适应
作者:Ron Sarafian,Itai Kloog,Jonathan D. Rosenblatt 链接:https://arxiv.org/abs/2107.07602 摘要:在两阶段研究的第一阶段,研究者使用统计模型来估算未观察到的暴露。在第二阶段,估算暴露作为流行病学模型中的协变量。第一阶段的插补误差与第二阶段的测量误差相同,从而导致暴露效应估计的偏差。本研究旨在透过第一阶段与第二阶段的资讯分享,以改善暴露效果的评估。我们估计量的核心是观察到并非所有的第二阶段观察都同样重要。因此,我们借鉴了最优实验设计理论的思想,来识别更重要的个体。然后,我们利用机器学习领域适应文献中的观点改进了这些个体的插补。我们的模拟证实,暴露效应估计比当前的最佳实践更准确。实证结果显示,PM对高血糖风险的影响较小,置信区间较紧。在环境科学家和流行病学家之间共享信息可以提高对健康影响的估计。我们的估计器是利用这种信息交换的原则性方法,可以应用于任何两阶段的研究。 摘要:In the first stage of a two-stage study, the researcher uses a statistical model to impute the unobserved exposures. In the second stage, imputed exposures serve as covariates in epidemiological models. Imputation error in the first stage operate as measurement errors in the second stage, and thus bias exposure effect estimates. This study aims to improve the estimation of exposure effects by sharing information between the first and second stage. At the heart of our estimator is the observation that not all second-stage observations are equally important to impute. We thus borrow ideas from the optimal-experimental-design theory, to identify individuals of higher importance. We then improve the imputation of these individuals using ideas from the machine-learning literature of domain-adaptation. Our simulations confirm that the exposure effect estimates are more accurate than the current best practice. An empirical demonstration yields smaller estimates of PM effect on hyperglycemia risk, with tighter confidence bands. Sharing information between environmental scientist and epidemiologist improves health effect estimates. Our estimator is a principled approach for harnessing this information exchange, and may be applied to any two stage study.
【21】 Ranked Sparsity: A Cogent Regularization Framework for Selecting and Estimating Feature Interactions and Polynomials 标题:排序稀疏度:一种选择和估计特征交互作用和多项式的同调正则化框架
作者:Ryan A. Peterson,Joseph E. Cavanaugh 链接:https://arxiv.org/abs/2107.07594 摘要:我们探讨并说明了排序稀疏性的概念,当不同特征集之间的信息质量存在预期差异时,这种现象通常会在建模应用程序中自然发生。它的存在会导致传统和现代的模型选择方法失败,因为这样的程序通常假定每个潜在参数都同样值得进入最终模型——我们称这种假设为“协变量均衡”。然而,这种假设并不总是成立的,特别是在衍生变量的存在。例如,当所有可能的交互作用都被视为候选预测因子时,协变量均衡的前提往往会产生过度指定和不透明的模型。额外的候选变量的数量极大地增加了交互作用中错误发现的数量,导致不必要的复杂和难以解释的模型与许多(真正虚假的)交互作用。我们建议采用一种建模策略,该策略需要更强的证据水平,以便在最终模型中选择某些变量(例如交互作用)。这种排序稀疏范式可以通过稀疏排序套索(SRL)实现。在一系列的仿真研究中,我们比较了SRL相对于竞争方法的性能,结果表明SRL是一种非常有吸引力的方法,因为它快速、准确,并且生成更透明的模型(具有更少的虚假交互)。我们用一组基因表达测量和临床协变量来预测肺癌患者的生存率,特别是寻找基因与环境的相互作用。 摘要:We explore and illustrate the concept of ranked sparsity, a phenomenon that often occurs naturally in modeling applications when an expected disparity exists in the quality of information between different feature sets. Its presence can cause traditional and modern model selection methods to fail because such procedures commonly presume that each potential parameter is equally worthy of entering into the final model - we call this presumption "covariate equipoise". However, this presumption does not always hold, especially in the presence of derived variables. For instance, when all possible interactions are considered as candidate predictors, the premise of covariate equipoise will often produce over-specified and opaque models. The sheer number of additional candidate variables grossly inflates the number of false discoveries in the interactions, resulting in unnecessarily complex and difficult-to-interpret models with many (truly spurious) interactions. We suggest a modeling strategy that requires a stronger level of evidence in order to allow certain variables (e.g. interactions) to be selected in the final model. This ranked sparsity paradigm can be implemented with the sparsity-ranked lasso (SRL). We compare the performance of SRL relative to competing methods in a series of simulation studies, showing that the SRL is a very attractive method because it is fast, accurate, and produces more transparent models (with fewer false interactions). We illustrate its utility in an application to predict the survival of lung cancer patients using a set of gene expression measurements and clinical covariates, searching in particular for gene-environment interactions.
【22】 A multiple criteria approach for ship risk classification: An alternative to the Paris MoU Ship Risk Profile 标题:船舶风险分类的多准则方法:巴黎谅解备忘录船舶风险剖面的替代方案
作者:Duarte Caldeira Dinis,José Rui Figueira,Ângelo Palos Teixeira 机构:ˆAngelo Palos Teixeirab, CEG-IST, Instituto Superior T´ecnico, Universidade de Lisboa, Av. Rovisco Pais ,-, Lisboa, Portugal, CENTEC, Instituto Superior T´ecnico, Universidade de Lisboa 备注:32 pages, 2 figures 链接:https://arxiv.org/abs/2107.07581 摘要:《关于港口国管制的巴黎谅解备忘录》(巴黎谅解备忘录)负责控制欧洲水域的不合格航运,从而提高安全、防止污染以及船上生活和工作条件的标准。自2011年以来,该备忘录采用了一种计分制,称为“船舶风险状况”(SRP),根据该计分制,每艘船舶根据其在一组标准上的得分被分配一个风险状况。作为一个以多准则决策辅助(MCDA)为核心的工具,包括准则、权重和风险类别,从MCDA的角度对SRP进行了有限的研究。本文提出了一种基于卡片组方法(DCM)的船舶风险分类MCDA方法。DCM特别适合于这种情况,因为它允许决策者直观地在标准量表上对不同标准和不同水平之间的偏好进行建模。首先,根据SRP中建立的标准,构建一个框架。其次,利用DCM建立MCDA模型,包括准则值函数和准则权重的定义。最后,将本文提出的MCDA模型应用于船舶数据集,并对结果进行了讨论。所提出的方法得到了稳健的结果,使其成为当前SRP的一个潜在替代方案。 摘要:The Paris Memorandum of Understanding on Port State Control (Paris MoU) is responsible for controlling substandard shipping in European waters and, consequently, increasing the standards of safety, pollution prevention, and onboard living and working conditions. Since 2011, the Memorandum adopted a system of points, named the "Ship Risk Profile" (SRP), under which each ship is assigned a risk profile according to its score on a set of criteria. Being a multiple criteria decision aiding (MCDA) tool at its core, comprising criteria, weights, and risk categories, limited research has been performed on the SRP from an MCDA perspective. The purpose of this paper is to propose an MCDA approach for ship risk classification through the Deck of Cards Method (DCM). The DCM is particularly suitable within this context as it allows, intuitively for the decision-maker, to model preference among different criteria and among different levels on criteria scales. First, a framework is built, based on the criteria established in the SRP. Second, the DCM is used to build the MCDA model, including the definition of the criteria value functions and criteria weights. Finally, the proposed MCDA model is applied to a dataset of ships and the results are discussed. Robust results have been obtained with the proposed approach, making it a potential alternative to the current SRP.
【23】 Optimal tests of the composite null hypothesis arising in mediation analysis 标题:中介分析中复合零假设的最优检验
作者:Caleb H. Miles,Antoine Chambaz 机构: Department of Biostatistics, Columbia University, Universit´e de Paris 备注:40 pages, 7 figures 链接:https://arxiv.org/abs/2107.07575 摘要:在某些因果和回归模型假设下,通过中间变量,暴露对结果的间接影响可以通过回归系数的乘积来确定。因此,没有间接影响的零假设是一个复合零假设,因为如果任一回归系数为零,零就成立。结果是,现有的假设检验要么在原点附近严重不足(即,当两个系数相对于标准误差都很小时),要么在零假设空间上不一致地保留类型1误差。我们提出假设检验,即(i)保持水平α1型错误,(ii)当两个真实的潜在影响相对于样本量都很小时,有意义地提高功率,以及(iii)当至少一个不存在时,保持功率。一种方法给出了一个闭式检验,该检验对于替代参数空间上的局部幂是minimax最优的。另一种方法使用稀疏线性规划产生一个近似最优的Bayes风险准则检验。我们提供了一个实现minimax最优测试的R包。 摘要:The indirect effect of an exposure on an outcome through an intermediate variable can be identified by a product of regression coefficients under certain causal and regression modeling assumptions. Thus, the null hypothesis of no indirect effect is a composite null hypothesis, as the null holds if either regression coefficient is zero. A consequence is that existing hypothesis tests are either severely underpowered near the origin (i.e., when both coefficients are small with respect to standard errors) or do not preserve type 1 error uniformly over the null hypothesis space. We propose hypothesis tests that (i) preserve level alpha type 1 error, (ii) meaningfully improve power when both true underlying effects are small relative to sample size, and (iii) preserve power when at least one is not. One approach gives a closed-form test that is minimax optimal with respect to local power over the alternative parameter space. Another uses sparse linear programming to produce an approximately optimal test for a Bayes risk criterion. We provide an R package that implements the minimax optimal test.
【24】 Multivariate Conway-Maxwell-Poisson Distribution: Sarmanov Method and Doubly-Intractable Bayesian Inference 标题:多元Conway-Maxwell-Poisson分布:Sarmanov方法和双难解贝叶斯推断
作者:Luiza S. C. Piancastelli,Nial Friel,Wagner Barreto-Souza,Hernando Ombao 机构:School of Mathematics and Statistics, University College Dublin, Dublin, Ireland, Insight Centre for Data Analytics, Ireland, Statistics Program, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia 备注:Paper submitted for publication 链接:https://arxiv.org/abs/2107.07561 摘要:本文提出了一种具有康威-麦克斯韦(COM)-泊松边界的多元计数分布。为此,我们对Sarmanov方法进行了改进,以构造多元分布。我们的多元COM-Poisson(MultCOMP)模型具有如下特点:(i)它允许一个灵活的协方差矩阵同时允许负的和正的非对角项(ii)它克服了文献中现有的二元COM-Poisson分布没有COM-Poisson边值的局限性(iii)它允许多变量计数的分析,而不仅仅限于双变量计数。由于似然规范依赖于涉及模型参数的许多难以处理的规范化常数,因此它提出了推理挑战。这些障碍促使我们提出了一种贝叶斯推理方法,通过交换算法和分组独立的Metropolis-Hastings算法来处理由此产生的双困难后验概率。数值实验的基础上提出了模拟来说明所提出的贝叶斯方法。我们通过对2018年至2021年英超主客场球队进球数的实际数据应用,分析MultCOMP模型的潜力。这里,我们的兴趣是评估在COVID-19大流行期间缺乏人群对众所周知的主队优势的影响。MultCOMP模型的拟合表明,有证据表明主队进球数减少,而不是伴随着对手得分的减少。因此,我们的分析表明,在没有人群的情况下,主队优势较小,这与一些足球专家的观点一致。 摘要:In this paper, a multivariate count distribution with Conway-Maxwell (COM)-Poisson marginals is proposed. To do this, we develop a modification of the Sarmanov method for constructing multivariate distributions. Our multivariate COM-Poisson (MultCOMP) model has desirable features such as (i) it admits a flexible covariance matrix allowing for both negative and positive non-diagonal entries; (ii) it overcomes the limitation of the existing bivariate COM-Poisson distributions in the literature that do not have COM-Poisson marginals; (iii) it allows for the analysis of multivariate counts and is not just limited to bivariate counts. Inferential challenges are presented by the likelihood specification as it depends on a number of intractable normalizing constants involving the model parameters. These obstacles motivate us to propose a Bayesian inferential approach where the resulting doubly-intractable posterior is dealt with via the exchange algorithm and the Grouped Independence Metropolis-Hastings algorithm. Numerical experiments based on simulations are presented to illustrate the proposed Bayesian approach. We analyze the potential of the MultCOMP model through a real data application on the numbers of goals scored by the home and away teams in the Premier League from 2018 to 2021. Here, our interest is to assess the effect of a lack of crowds during the COVID-19 pandemic on the well-known home team advantage. A MultCOMP model fit shows that there is evidence of a decreased number of goals scored by the home team, not accompanied by a reduced score from the opponent. Hence, our analysis suggests a smaller home team advantage in the absence of crowds, which agrees with the opinion of several football experts.
【25】 POS tagging, lemmatization and dependency parsing of West Frisian 标题:西弗里斯语的词性标注、词汇化和依存句法分析
作者:Wilbert Heeringa,Gosse Bouma,Martha Hofman,Eduard Drenth,Jan Wijffels,Hans Van de Velde 机构:Fryske Akademy,University of Groningen,BNOSAC,Utrecht University, Leeuwarden,Groningen,Brussels,Utrecht 备注:6 pages, 2 figures, 6 tables 链接:https://arxiv.org/abs/2107.07974 摘要:我们提出了一个用于西弗里斯语的lemmatizer/POS-tagger/dependency解析器,它使用了一个语料库,共有44714个单词,3126个句子,根据通用依赖版本2的指导原则进行了注释。POS-tagger通过使用荷兰语POS-tagger来分配给单词,而荷兰语POS-tagger应用于逐字翻译,或荷兰语平行文本的句子。使用弗里斯翻译程序Oersetter创建的直译可以获得最佳效果。形态学和句法注释是在荷兰语直译的基础上产生的。将lemmatizer/tagger/annotator在使用默认参数进行训练时的性能与使用用于训练lassymall UD 2.5语料库的参数值时获得的性能进行比较。“引理”有显著的改进。Frisian lemmatizer/PoS tagger/dependency解析器作为web应用程序和web服务发布。 摘要:We present a lemmatizer/POS-tagger/dependency parser for West Frisian using a corpus of 44,714 words in 3,126 sentences that were annotated according to the guidelines of Universal Dependency version 2. POS tags were assigned to words by using a Dutch POS tagger that was applied to a literal word-by-word translation, or to sentences of a Dutch parallel text. Best results were obtained when using literal translations that were created by using the Frisian translation program Oersetter. Morphologic and syntactic annotations were generated on the basis of a literal Dutch translation as well. The performance of the lemmatizer/tagger/annotator when it was trained using default parameters was compared to the performance that was obtained when using the parameter values that were used for training the LassySmall UD 2.5 corpus. A significant improvement was found for `lemma'. The Frisian lemmatizer/PoS tagger/dependency parser is released as a web app and as a web service.
【26】 Self-normalized Cramer moderate deviations for a supercritical Galton-Watson process 标题:超临界Galton-Watson过程的自归一化Cramer中偏差
作者:Xiequan Fan,Qi-Man Shao 机构:Center for Applied Mathematics, Tianjin University, Tianjin , China, Department of Statistics and Data Science, Southern University of Science and Technology, Shenzhen , China. 链接:https://arxiv.org/abs/2107.07965 摘要:设$(zn){ngeq0}$为超临界Galton-Watson过程。考虑子均值的Lotka Nagaev估计。本文建立了Lotka-Nagaev估计的自规范化Cram{e}r型中偏差和Berry-Esseen界。结果被认为是最优或接近最优的。 摘要:Let $(Z_n)_{ngeq0}$ be a supercritical Galton-Watson process. Consider the Lotka-Nagaev estimator for the offspring mean. In this paper, we establish self-normalized Cram'{e}r type moderate deviations and Berry-Esseen's bounds for the Lotka-Nagaev estimator. The results are believed to be optimal or near optimal.
【27】 Flexible Covariate Adjustments in Regression Discontinuity Designs 标题:回归间断设计中的灵活协变量调整
作者:Claudia Noack,Tomasz Olma,Christoph Rothe 机构: Department of Economics, University of Mannheim 链接:https://arxiv.org/abs/2107.07942 摘要:经验回归不连续性(RD)研究通常使用协变量来提高其估计的精度。在本文中,我们提出了一类新的估计器,它比目前广泛使用的线性调整估计器更有效地利用这些协变量信息。我们的方法可以容纳大量的离散或连续的协变量。它涉及到用一个适当修改的结果变量进行标准的RD分析,这个变量的形式是原始结果和协变量函数之间的差异。我们描述了导致具有最小渐近方差的估计量的函数,并展示了如何通过现代机器学习、非参数回归或经典参数方法来估计它。所得到的估计器易于实现,因为调谐参数可以像在常规RD分析中那样选择。一个广泛的仿真研究说明了我们的方法的性能。 摘要:Empirical regression discontinuity (RD) studies often use covariates to increase the precision of their estimates. In this paper, we propose a novel class of estimators that use such covariate information more efficiently than the linear adjustment estimators that are currently used widely in practice. Our approach can accommodate a possibly large number of either discrete or continuous covariates. It involves running a standard RD analysis with an appropriately modified outcome variable, which takes the form of the difference between the original outcome and a function of the covariates. We characterize the function that leads to the estimator with the smallest asymptotic variance, and show how it can be estimated via modern machine learning, nonparametric regression, or classical parametric methods. The resulting estimator is easy to implement, as tuning parameters can be chosen as in a conventional RD analysis. An extensive simulation study illustrates the performance of our approach.
【28】 A DoE-based approach for the implementation of structural surrogate models in the early stage design of box-wing aircraft 标题:基于DOE的箱翼飞机早期设计结构代理模型实现方法
作者:Vittorio Cipolla,Vincenzo Binante,Karim Abu Salem,Giuseppe Palaia,Davide Zanetti 机构: Palaia 3 University of Pisa, Department of Civil and Industrial Engineering 2 PhD Candidate, Department of Civil and Industrial Engineering 3 PhD Student, Department of Civil and Industrial Engineering 4 Aeronautical Engineer 链接:https://arxiv.org/abs/2107.07865 摘要:在不限制航空运输增长的情况下,应对减少航空对环境影响的挑战的可能方法之一是采用更高效、完全不同的飞机结构。其中,箱形翼飞机是一种很有前途的解决方案,至少在中短途飞机的应用中是如此,根据H2020项目“PARSIFAL”的成果,每乘客公里的二氧化碳排放量将减少20%。本论文面临的问题是在设计的早期阶段估算这种破坏性配置的结构质量,强调了文献中可用方法的这种能力的局限性,并提出了一种基于DoE的方法来定义适合这种目的的替代模型。“PARSIFAL”项目中的一个测试用例用于该方法的第一个概念,从有限元模型参数化开始,然后构建有限元结果数据库,从而引入回归模型并在优化框架中实现它们。为了验证机翼尺寸和优化程序,对所获得的结果进行了研究。最后,简要介绍了在意大利研究项目“PROSIB”中将箱形机翼布局应用于区域飞机类别所产生的一个附加测试案例,以进一步评估所提出方法的能力。 摘要:One of the possible ways to face the challenge of reducing the environmental impact of aviation, without limiting the growth of air transport, is the introduction of more efficient, radically different aircraft architectures. Among these, the box-wing one represents a promising solution, at least in the case of its application to short-to-medium haul aircraft, which, according to the achievement of the H2020 project "PARSIFAL", would bring to a 20% reduction in terms of emitted CO2 per passenger-kilometre. The present paper faces the problem of estimating the structural mass of such a disruptive configuration in the early stages of the design, underlining the limitations in this capability of the approaches available by literature and proposing a DoE-based approach to define surrogate models suitable for such purpose. A test case from the project "PARSIFAL" is used for the first conception of the approach, starting from the Finite Element Model parametrization, then followed by the construction of a database of FEM results, hence introducing the regression models and implementing them in an optimization framework. Results achieved are investigated in order to validate both the wing sizing and the optimization procedure. Finally, an additional test case resulting from the application of the box-wing layout to the regional aircraft category within the Italian research project "PROSIB", is briefly presented to further assess the capabilities of the proposed approach.
【29】 Subspace Shrinkage in Conjugate Bayesian Vector Autoregressions 标题:共轭贝叶斯向量自回归的子空间收缩
作者:Florian Huber,Gary Koop 机构:University of Salzburg, University of Strathclyde 链接:https://arxiv.org/abs/2107.07804 摘要:使用大型数据集的宏观经济学家通常面临着要么使用大型向量自回归(VAR)要么使用因子模型的选择。在这篇论文中,我们发展了使用子空间收缩先验将两者结合起来的方法。子空间先验向一类函数收缩,而不是直接将模型的参数向某个预先指定的位置强迫。我们发展了一个共轭VAR先验,它向因子模型定义的子空间收缩。我们的方法允许估计收缩的强度以及因素的数量。在建立了我们提出的先验知识的理论性质之后,我们进行了模拟并将其应用于美国宏观经济数据。通过仿真,我们证明了我们的框架能够成功地检测到因子的数量。在一个涉及大量宏观经济数据集的预测练习中,我们发现使用我们的先验知识将变量与因子模型相结合可以导致预测的改进。 摘要:Macroeconomists using large datasets often face the choice of working with either a large Vector Autoregression (VAR) or a factor model. In this paper, we develop methods for combining the two using a subspace shrinkage prior. Subspace priors shrink towards a class of functions rather than directly forcing the parameters of a model towards some pre-specified location. We develop a conjugate VAR prior which shrinks towards the subspace which is defined by a factor model. Our approach allows for estimating the strength of the shrinkage as well as the number of factors. After establishing the theoretical properties of our proposed prior, we carry out simulations and apply it to US macroeconomic data. Using simulations we show that our framework successfully detects the number of factors. In a forecasting exercise involving a large macroeconomic data set we find that combining VARs with factor models using our prior can lead to forecast improvements.
【30】 Efficient proximal gradient algorithms for joint graphical lasso 标题:关节图形套索的高效近似梯度算法
作者:Jie Chen,Ryosuke Shimmura,Joe Suzuki 机构:Graduate School of Engineering Science, Osaka University, Japan. 备注:23 pages, 5 figures 链接:https://arxiv.org/abs/2107.07799 摘要:我们考虑从稀疏数据学习一个无向图模型。目前已有多种有效的图形套索(GL)算法被提出,而交替方向乘子法(ADMM)是求解联合图形套索(JGL)的主要方法。我们为JGL提出了有无回溯选项的近端梯度程序。这些程序是一阶的,相对简单,子问题以封闭形式有效地求解。进一步证明了JGL问题解的有界性和算法的迭代次数。数值结果表明,该算法具有较高的精度和精度,其效率与现有算法相当。 摘要:We consider learning an undirected graphical model from sparse data. While several efficient algorithms have been proposed for graphical lasso (GL), the alternating direction method of multipliers (ADMM) is the main approach taken concerning for joint graphical lasso (JGL). We propose proximal gradient procedures with and without a backtracking option for the JGL. These procedures are first-order and relatively simple, and the subproblems are solved efficiently in closed form. We further show the boundedness for the solution of the JGL problem and the iterations in the algorithms. The numerical results indicate that the proposed algorithms can achieve high accuracy and precision, and their efficiency is competitive with state-of-the-art algorithms.
【31】 Active learning for online training in imbalanced data streams under cold start 标题:冷启动下不平衡数据流在线训练的主动学习
作者:Ricardo Barata,Miguel Leite,Ricardo Pacheco,Marco O. P. Sampaio,João Tiago Ascensão,Pedro Bizarro 机构:Feedzai 备注:9 pages, 6 figures, 2 tables 链接:https://arxiv.org/abs/2107.07724 摘要:在依赖机器学习(ML)进行预测建模的现代系统中,标记数据是必不可少的。这样的系统可能会遇到冷启动问题:有监督的模型工作得很好,但最初,没有标签,这是昂贵的或缓慢获得。在数据不平衡的情况下,这个问题更为严重。在线金融欺诈检测就是这样一个例子:i)费用高昂,或者ii)如果依赖于受害者提出投诉,它会遭受长时间的延迟。如果必须立即建立模型,则后者可能不可行,因此一种选择是要求分析人员标记事件,同时最小化注释数量以控制成本。我们提出了一个主动学习(AL)注释系统的数据集与数量级的类不平衡,在冷启动流的情况下。我们提出了一种计算效率高的基于离群点的判别AL方法(ODAL),并设计了一种新的三阶段AL标记策略序列,用于预热。然后,我们在四个真实世界的数据集进行实证研究,不同程度的阶级失衡。结果表明,与标准的AL策略相比,该方法能更快地获得高性能模型。与随机抽样相比,其观察到的收益可以达到80%,并且可以与具有无限注释预算或附加历史数据(标签的1/10到1/50)的策略相竞争。 摘要:Labeled data is essential in modern systems that rely on Machine Learning (ML) for predictive modelling. Such systems may suffer from the cold-start problem: supervised models work well but, initially, there are no labels, which are costly or slow to obtain. This problem is even worse in imbalanced data scenarios. Online financial fraud detection is an example where labeling is: i) expensive, or ii) it suffers from long delays, if relying on victims filing complaints. The latter may not be viable if a model has to be in place immediately, so an option is to ask analysts to label events while minimizing the number of annotations to control costs. We propose an Active Learning (AL) annotation system for datasets with orders of magnitude of class imbalance, in a cold start streaming scenario. We present a computationally efficient Outlier-based Discriminative AL approach (ODAL) and design a novel 3-stage sequence of AL labeling policies where it is used as warm-up. Then, we perform empirical studies in four real world datasets, with various magnitudes of class imbalance. The results show that our method can more quickly reach a high performance model than standard AL policies. Its observed gains over random sampling can reach 80% and be competitive with policies with an unlimited annotation budget or additional historical data (with 1/10 to 1/50 of the labels).
【32】 Estimation from Partially Sampled Distributed Traces 标题:从部分采样的分布道进行估计
作者:Otmar Ertl 机构:Dynatrace Research, Linz, Austria 链接:https://arxiv.org/abs/2107.07703 摘要:为了降低分布式跟踪的处理和存储成本,采样通常是必要的。在这项工作中,我们描述了一种可伸缩的自适应采样方法,它可以比广泛使用的基于头部的采样方法更好地保存感兴趣的事件。可以为每个跨度单独选择采样率,允许考虑跨度属性和本地资源约束。产生的记录道通常只是部分采样,而不是完全采样,这使统计分析变得复杂。为了利用给定的信息,提出了一种无偏估计算法。尽管它不需要知道记录道是否完整,但与只考虑完整记录道相比,它在许多情况下减少了估计误差。 摘要:Sampling is often a necessary evil to reduce the processing and storage costs of distributed tracing. In this work, we describe a scalable and adaptive sampling approach that can preserve events of interest better than the widely used head-based sampling approach. Sampling rates can be chosen individually and independently for every span, allowing to take span attributes and local resource constraints into account. The resulting traces are often only partially and not completely sampled which complicates statistical analysis. To exploit the given information, an unbiased estimation algorithm is presented. Even though it does not need to know whether the traces are complete, it reduces the estimation error in many cases compared to considering only complete traces.
【33】 The Application of Active Query K-Means in Text Classification 标题:主动查询K-Means算法在文本分类中的应用
作者:Yukun Jiang 机构:Department of Computer Science, New York University, New York, United States 备注:None 链接:https://arxiv.org/abs/2107.07682 摘要:主动学习是一种处理大量未标记数据的先进机器学习方法。在自然语言处理领域,通常对所有的数据进行注释既费钱又费时。这种低效性启发我们将主动学习应用于文本分类。本文首先将传统的无监督k-均值聚类算法改进为半监督聚类算法。然后,将该算法进一步扩展到具有惩罚最小最大选择的主动学习场景中,使得有限的查询产生更稳定的初始质心。该方法利用了用户的交互查询结果和底层的距离表示。在一个中文新闻数据集上进行测试后,它显示了在降低训练成本的同时,准确率的持续提高。 摘要:Active learning is a state-of-art machine learning approach to deal with an abundance of unlabeled data. In the field of Natural Language Processing, typically it is costly and time-consuming to have all the data annotated. This inefficiency inspires out our application of active learning in text classification. Traditional unsupervised k-means clustering is first modified into a semi-supervised version in this research. Then, a novel attempt is applied to further extend the algorithm into active learning scenario with Penalized Min-Max-selection, so as to make limited queries that yield more stable initial centroids. This method utilizes both the interactive query results from users and the underlying distance representation. After tested on a Chinese news dataset, it shows a consistent increase in accuracy while lowering the cost in training.
【34】 Correlation detection in trees for partial graph alignment 标题:用于部分图对齐的树的相关性检测
作者:Luca Ganassali,Laurent Massoulié,Marc Lelarge 备注:22 pages, 1 figure. Preliminary version 链接:https://arxiv.org/abs/2107.07623 摘要:我们考虑稀疏图的对齐,它包括在两个图的节点之间找到映射,保留了大部分的边。我们的方法是比较两个图中的局部结构,如果两个节点的邻域“足够接近”,则匹配两个节点:对于相关ErdH{o}s-R-enyi随机图,这个问题可以通过测试一对分支树是从乘积分布还是从相关分布中提取来局部地重新表述。我们为此问题设计了一个最优测试,给出了一个用于图对齐的消息传递算法,该算法可证明在多项式时间内返回正确匹配顶点的正分数和不匹配的消失分数。如果图中的平均度数$lambda=O(1)$,并且在[0,1]$中有一个相关参数$s,则$lambda s$足够大,$1-s$足够小,这个结果就成立了,从而完成了最新的图表。利用Kullback-Leibler发散给出了多项式时间内部分图对齐(或树中相关检测)是否可行的更严格条件。 摘要:We consider alignment of sparse graphs, which consists in finding a mapping between the nodes of two graphs which preserves most of the edges. Our approach is to compare local structures in the two graphs, matching two nodes if their neighborhoods are 'close enough': for correlated ErdH{o}s-R'enyi random graphs, this problem can be locally rephrased in terms of testing whether a pair of branching trees is drawn from either a product distribution, or a correlated distribution. We design an optimal test for this problem which gives rise to a message-passing algorithm for graph alignment, which provably returns in polynomial time a positive fraction of correctly matched vertices, and a vanishing fraction of mismatches. With an average degree $lambda = O(1)$ in the graphs, and a correlation parameter $s in [0,1]$, this result holds with $lambda s$ large enough, and $1-s$ small enough, completing the recent state-of-the-art diagram. Tighter conditions for determining whether partial graph alignment (or correlation detection in trees) is feasible in polynomial time are given in terms of Kullback-Leibler divergences.
【35】 Adversarial Attack for Uncertainty Estimation: Identifying Critical Regions in Neural Networks 标题:不确定性估计的对抗性攻击:识别神经网络中的关键区域
作者:Ismail Alarab,Simant Prakoonwit 机构: Bournemouth University, United Kingdom 备注:15 pages, 6 figures, Submitted to Neural Processing Letters Journal 链接:https://arxiv.org/abs/2107.07618 摘要:我们提出了一种新的方法来捕捉神经网络中决策边界附近的数据点,这些数据点通常涉及到特定类型的不确定性。在我们的方法中,我们试图基于对抗攻击方法的思想来进行不确定性估计。在本文中,不确定性估计是从输入扰动中导出的,这与以往的研究不同,在贝叶斯方法中,这些研究提供了对模型参数的扰动。我们能够通过对输入的几次扰动来产生不确定性。有趣的是,我们将提出的方法应用于来自区块链的数据集。我们比较了模型不确定性与最新的不确定性方法的性能。结果表明,与其他方法相比,本文提出的方法具有显著的优越性,并且在机器学习中捕获模型不确定性的风险较小。 摘要:We propose a novel method to capture data points near decision boundary in neural network that are often referred to a specific type of uncertainty. In our approach, we sought to perform uncertainty estimation based on the idea of adversarial attack method. In this paper, uncertainty estimates are derived from the input perturbations, unlike previous studies that provide perturbations on the model's parameters as in Bayesian approach. We are able to produce uncertainty with couple of perturbations on the inputs. Interestingly, we apply the proposed method to datasets derived from blockchain. We compare the performance of model uncertainty with the most recent uncertainty methods. We show that the proposed method has revealed a significant outperformance over other methods and provided less risk to capture model uncertainty in machine learning.
【36】 Measuring inter-cluster similarities with Alpha Shape TRIangulation in loCal Subspaces (ASTRICS) facilitates visualization and clustering of high-dimensional data 标题:使用局部子空间中的Alpha形状三角剖分(ASTRICS)测量簇间相似性有助于高维数据的可视化和聚类
作者:Joshua M. Scurll 机构: Department of Mathematics and Institute of Applied Mathematics, Mathematics Road, University, of British Columbia, Vancouver, British Columbia V,T ,Z, Canada. 备注:35 pages, 7 figures 链接:https://arxiv.org/abs/2107.07603 摘要:高维数据的聚类和可视化是许多领域的重要任务。例如,在生物信息学中,它们对于单细胞数据的分析是至关重要的,例如质谱仪(CyTOF)数据。对HD数据进行聚类的一些最有效的算法是通过图中的节点来表示数据,边根据相似性或距离的度量来连接相邻的节点。然而,基于图的算法的用户通常面临着选择输入参数的值的关键但具有挑战性的任务,该输入参数设置图中邻域的大小,例如连接每个节点的最近邻居的数目或连接节点的阈值距离。用户的负担可以通过节点间相似性的度量来减轻,该度量对于不同的节点可以具有值0,而不需要任何用户定义的参数或阈值。这将自动确定邻域,同时仍然生成稀疏图。为此,本文提出了一种基于局部降维和关键alpha形状三角剖分的HD数据点簇间相似度度量方法ASTRICS。我证明了我的ASTRICS相似性度量在三阶段流水线的第2阶段中可以促进HD数据的聚类和可视化:第1阶段=通过任何方法对数据进行初始聚类;阶段2=让图节点代表初始簇而不是单个数据点,并使用ASTRICS自动定义节点之间的边;阶段3=使用图形进行进一步的聚类和可视化。这就把选择图形邻域大小这一关键任务与从本质上选择查看数据的分辨率这一更简单的任务进行了权衡。然后,图形以及随后的下游聚类和可视化将自动适应所选的分辨率。 摘要:Clustering and visualizing high-dimensional (HD) data are important tasks in a variety of fields. For example, in bioinformatics, they are crucial for analyses of single-cell data such as mass cytometry (CyTOF) data. Some of the most effective algorithms for clustering HD data are based on representing the data by nodes in a graph, with edges connecting neighbouring nodes according to some measure of similarity or distance. However, users of graph-based algorithms are typically faced with the critical but challenging task of choosing the value of an input parameter that sets the size of neighbourhoods in the graph, e.g. the number of nearest neighbours to which to connect each node or a threshold distance for connecting nodes. The burden on the user could be alleviated by a measure of inter-node similarity that can have value 0 for dissimilar nodes without requiring any user-defined parameters or thresholds. This would determine the neighbourhoods automatically while still yielding a sparse graph. To this end, I propose a new method called ASTRICS to measure similarity between clusters of HD data points based on local dimensionality reduction and triangulation of critical alpha shapes. I show that my ASTRICS similarity measure can facilitate both clustering and visualization of HD data by using it in Stage 2 of a three-stage pipeline: Stage 1 = perform an initial clustering of the data by any method; Stage 2 = let graph nodes represent initial clusters instead of individual data points and use ASTRICS to automatically define edges between nodes; Stage 3 = use the graph for further clustering and visualization. This trades the critical task of choosing a graph neighbourhood size for the easier task of essentially choosing a resolution at which to view the data. The graph and consequently downstream clustering and visualization are then automatically adapted to the chosen resolution.
【37】 Prediction of Blood Lactate Values in Critically Ill Patients: A Retrospective Multi-center Cohort Study 标题:危重患者血乳酸值的预测:一项回顾性多中心队列研究
作者:Behrooz Mamandipoor,Wesley Yeung,Louis Agha-Mir-Salim,David J. Stone,Venet Osmani,Leo Anthony Celi 机构:Affiliations:, Fondazione Bruno Kessler Research Institute, Trento, Italy, Laboratory for Computational Physiology, Harvard-MIT Health Sciences and Technology, Massachusetts Institute of Technology, Cambridge, MA, USA 备注:None 链接:https://arxiv.org/abs/2107.07582 摘要:目的。最初获得的血清乳酸水平升高是危重病人死亡率的有力预测因素。确定血清乳酸水平更可能升高的患者可以提醒医生加强护理,并指导他们进行血液检测的频率。我们研究机器学习模型是否能预测随后的血清乳酸变化。方法。我们使用模拟III和eICU CRD数据集,在模拟III队列的eICU队列的内部和外部验证中研究血清乳酸变化预测。根据初始乳酸水平分为三个亚组:i)正常组(<2 mmol/L),ii)轻度组(2-4 mmol/L),以及iii)重度组(>4 mmol/L)。根据各组间血清乳酸水平的增加或减少来确定结果。我们还通过将结果定义为乳酸变化>10%来进行敏感性分析,并进一步研究后续乳酸测量之间的时间间隔对预测性能的影响。结果。LSTM模型能够预测正常组AUC为0.77(95%CI 0.762-0.771),轻度组AUC为0.77(95%CI 0.768-0.772),重度组AUC为0.85(95%CI 0.840-0.851)的模拟III患者血清乳酸值的恶化,外部验证的表现稍低。结论。LSTM对血清乳酸水平下降的患者有很好的鉴别能力。临床研究需要评估基于这些结果的临床决策支持工具的使用是否会对决策和患者结果产生积极影响。 摘要:Purpose. Elevations in initially obtained serum lactate levels are strong predictors of mortality in critically ill patients. Identifying patients whose serum lactate levels are more likely to increase can alert physicians to intensify care and guide them in the frequency of tending the blood test. We investigate whether machine learning models can predict subsequent serum lactate changes. Methods. We investigated serum lactate change prediction using the MIMIC-III and eICU-CRD datasets in internal as well as external validation of the eICU cohort on the MIMIC-III cohort. Three subgroups were defined based on the initial lactate levels: i) normal group (<2 mmol/L), ii) mild group (2-4 mmol/L), and iii) severe group (>4 mmol/L). Outcomes were defined based on increase or decrease of serum lactate levels between the groups. We also performed sensitivity analysis by defining the outcome as lactate change of >10% and furthermore investigated the influence of the time interval between subsequent lactate measurements on predictive performance. Results. The LSTM models were able to predict deterioration of serum lactate values of MIMIC-III patients with an AUC of 0.77 (95% CI 0.762-0.771) for the normal group, 0.77 (95% CI 0.768-0.772) for the mild group, and 0.85 (95% CI 0.840-0.851) for the severe group, with a slightly lower performance in the external validation. Conclusion. The LSTM demonstrated good discrimination of patients who had deterioration in serum lactate levels. Clinical studies are needed to evaluate whether utilization of a clinical decision support tool based on these results could positively impact decision-making and patient outcomes.
【38】 Optimal Stopping Methodology for the Secretary Problem with Random Queries 标题:随机查询秘书问题的最优停止方法
作者:George V. Moustakides,Xujun Liu,Olgica Milenkovic 机构:† 1Department of ECE, University of Patras, gr 2Department of ECE, University of Illinois 链接:https://arxiv.org/abs/2107.07513 摘要:候选人依次到达面试过程,结果是他们的排名相对于他们的前任。根据每次可用的排名,必须建立一个选择或解雇当前候选人的决策机制,以尽量增加选择最佳人选的机会。这一经典的“秘书问题”已被深入研究,主要使用组合方法,以及许多其他变种。在这项工作中,我们考虑一个特定的新版本,其中在审查一个是允许查询外部专家,以提高作出正确决策的概率。不同于现有的配方,我们认为专家不一定是绝对可靠的,并可以提供建议,可能是错误的。为了解决这个问题,我们采用了概率方法,将查询时间看作连续的停止时间,并利用最优停止理论对其进行优化。对于每个查询时间,我们还必须设计一种机制来决定是否应该在查询时间终止搜索。这一决定在通常的专家无误的假设下是直截了当的,但是,当专家有错时,它的结构要复杂得多。 摘要:Candidates arrive sequentially for an interview process which results in them being ranked relative to their predecessors. Based on the ranks available at each time, one must develop a decision mechanism that selects or dismisses the current candidate in an effort to maximize the chance to select the best. This classical version of the "Secretary problem" has been studied in depth using mostly combinatorial approaches, along with numerous other variants. In this work we consider a particular new version where during reviewing one is allowed to query an external expert to improve the probability of making the correct decision. Unlike existing formulations, we consider experts that are not necessarily infallible and may provide suggestions that can be faulty. For the solution of our problem we adopt a probabilistic methodology and view the querying times as consecutive stopping times which we optimize with the help of optimal stopping theory. For each querying time we must also design a mechanism to decide whether we should terminate the search at the querying time or not. This decision is straightforward under the usual assumption of infallible experts but, when experts are faulty, it has a far more intricate structure.
【39】 Towards quantifying information flows: relative entropy in deep neural networks and the renormalization group 标题:迈向信息流的量化:深度神经网络和重整化群中的相对熵
作者:Johanna Erdmenger,Kevin T. Grosvenor,Ro Jefferson 机构:Institute for Theoretical Physics and Astrophysics and W¨urzburg-Dresden Cluster of Excellence, ct.qmat, Julius-Maximilians-Universit¨at W¨urzburg, Am Hubland, W¨urzburg, Germany 备注:41 pages, 8 figures; code available at this https URL 链接:https://arxiv.org/abs/2107.06898 摘要:我们研究了重整化群(RG)和深层神经网络之间的相似性,其中随后的神经元层类似于沿着RG的连续步骤。特别地,我们通过在抽取RG下显式计算一维和二维Ising模型以及作为深度函数的前馈神经网络中的相对熵或Kullback-Leibler散度来量化信息流。我们观察到性质相同的行为,其特征是单调增加到参数相关的渐近值。在量子场论方面,单调增长证实了相对熵与c定理之间的联系。对于神经网络,渐近行为可能对机器学习中的各种信息最大化方法,以及解开紧致性和可推广性的纠缠都有影响。此外,当我们考虑二维Ising模型和随机神经网络都表现出非平凡的临界点时,相对熵对任一系统的相位结构都不敏感。从这个意义上说,为了充分阐明这些模型中的信息流,需要更精细的探针。 摘要:We investigate the analogy between the renormalization group (RG) and deep neural networks, wherein subsequent layers of neurons are analogous to successive steps along the RG. In particular, we quantify the flow of information by explicitly computing the relative entropy or Kullback-Leibler divergence in both the one- and two-dimensional Ising models under decimation RG, as well as in a feedforward neural network as a function of depth. We observe qualitatively identical behavior characterized by the monotonic increase to a parameter-dependent asymptotic value. On the quantum field theory side, the monotonic increase confirms the connection between the relative entropy and the c-theorem. For the neural networks, the asymptotic behavior may have implications for various information maximization methods in machine learning, as well as for disentangling compactness and generalizability. Furthermore, while both the two-dimensional Ising model and the random neural networks we consider exhibit non-trivial critical points, the relative entropy appears insensitive to the phase structure of either system. In this sense, more refined probes are required in order to fully elucidate the flow of information in these models.