访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问
stat统计学,共计28篇
【1】 Probabilistic Forecast Combination for Anomaly Detection in Building Heat Load Time Series 标题:建筑热负荷时间序列异常检测的概率预测组合方法
作者:Mario Beykirch,Tim Janke,Imed Tayeche,Florian Steinke 机构:Energy Information Networks and Systems, TU Darmstadt, Germany 备注:Accepted in the proceedings of ISGT-Europe 2021 to be published by IEEE 链接:https://arxiv.org/abs/2107.10828 摘要:我们考虑建筑水平热负荷时间序列的自动异常检测问题。一个异常检测模型必须适用于不同的建筑物群,并在低信噪比、多个季节性和显著的外生效应的热负荷时间序列上提供稳健的结果。我们建议在异常检测方案中采用基于确定性预测集合的概率预测组合方法,该方案基于预测分布下的概率对观测值进行分类。我们的经验表明,基于预测的异常检测提供了更高的准确性时,采用预测组合方法。 摘要:We consider the problem of automated anomaly detection for building level heat load time series. An anomaly detection model must be applicable to a diverse group of buildings and provide robust results on heat load time series with low signal-to-noise ratios, several seasonalities, and significant exogenous effects. We propose to employ a probabilistic forecast combination approach based on an ensemble of deterministic forecasts in an anomaly detection scheme that classifies observed values based on their probability under a predictive distribution. We show empirically that forecast based anomaly detection provides improved accuracy when employing a forecast combination approach.
【2】 On the Scalability of Informed Importance Tempering 标题:论知情重要性调和的可扩展性
作者:Quan Zhou 机构:Department of Statistics, Texas A&M University 链接:https://arxiv.org/abs/2107.10827 摘要:知情MCMC方法已被提出作为高维离散状态空间上贝叶斯后验计算的可扩展解决方案。我们研究了一类MCMC方案,称为知情重要性回火(IIT),它结合了重要性抽样和知情局部建议。得到了IIT估计量的谱间隙界,证明了IIT采样器对于单峰目标分布具有显著的可扩展性。本文所获得的理论见解为模型选择中知情建议的选择和MCMC方法中重要性抽样的使用提供了指导。 摘要:Informed MCMC methods have been proposed as scalable solutions to Bayesian posterior computation on high-dimensional discrete state spaces. We study a class of MCMC schemes called informed importance tempering (IIT), which combine importance sampling and informed local proposals. Spectral gap bounds for IIT estimators are obtained, which demonstrate the remarkable scalability of IIT samplers for unimodal target distributions. The theoretical insights acquired in this note provide guidance on the choice of informed proposals in model selection and the use of importance sampling in MCMC methods.
【3】 Dimension-Free Anticoncentration Bounds for Gaussian Order Statistics with Discussion of Applications to Multiple Testing 标题:高斯顺序统计量的无量纲反浓度界及其在多重检验中的应用
作者:Damian Kozbur 机构:Department of Economics, University of Z¨urich, Sch¨onberggasse , Z¨urich, Switzerland 链接:https://arxiv.org/abs/2107.10766 摘要:证明了以下反浓缩性质。具有单位方差分量的任意相关联合高斯随机向量$X$的$k$阶统计量位于长度$varepsilon$的区间内的概率在$2{varepsilon}k({1 mathrm{E}[\ X\ infty]})$上有界。这个界对于统计高维多重假设检验问题中的广义错误率控制有着重要的意义,后面将对此进行讨论。 摘要:The following anticoncentration property is proved. The probability that the $k$-order statistic of an arbitrarily correlated jointly Gaussian random vector $X$ with unit variance components lies within an interval of length $varepsilon$ is bounded above by $2{varepsilon}k ({ 1 mathrm{E}[|X|_infty ]}) $. This bound has implications for generalized error rate control in statistical high-dimensional multiple hypothesis testing problems, which are discussed subsequently.
【4】 A network Poisson model for weighted directed networks with covariates 标题:带协变量的加权有向网络的网络泊松模型
作者:MengXu,Qiuping Wang 机构:Central China Normal University 备注:24 pages, 5 tables, 2 figures 链接:https://arxiv.org/abs/2107.10735 摘要:网络中的边缘不仅是二进制的,无论是存在还是不存在,而且在许多情况下(例如,两个用户之间的电子邮件数量)都采用加权值。提出了协变量-p_0$模型来模拟具有程度异质性和协变量的二元有向网络。然而,当它应用于加权网络时,可能会造成信息的丢失。本文提出用Poisson分布来建立加权有向网络模型,该模型考虑了网络的稀疏性、节点的程度异质性以及节点的协变量引起的同态性。我们称之为网络泊松模型。该模型包含一个密度参数$mu$、一个$2n$维节点参数${theta}$和一个固定维协变量回归系数${gamma}$。由于参数个数随$n$增加,渐近理论是非标准的。当节点数为无穷大时,我们建立了极大似然估计的$ellinfty$误差,$hat{theta}$和$hat{gamma}$,它们是$hat{theta}$的$oup((logn/n)^{1/2})$,以及$hat{gamma}$的$oup(logn/n)$,直到一个附加因子。我们还得到了极大似然估计的渐近正态性。数值研究和数据分析证明了我们的理论发现。)对于b{theta}和Op(logn/n),对于b{gamma},增加一个因子。我们还得到了极大似然估计的渐近正态性。数值研究和数据分析证实了我们的理论发现。 摘要:The edges in networks are not only binary, either present or absent, but also take weighted values in many scenarios (e.g., the number of emails between two users). The covariate-$p_0$ model has been proposed to model binary directed networks with the degree heterogeneity and covariates. However, it may cause information loss when it is applied in weighted networks. In this paper, we propose to use the Poisson distribution to model weighted directed networks, which admits the sparsity of networks, the degree heterogeneity and the homophily caused by covariates of nodes. We call it the emph{network Poisson model}. The model contains a density parameter $mu$, a $2n$-dimensional node parameter ${theta}$ and a fixed dimensional regression coefficient ${gamma}$ of covariates. Since the number of parameters increases with $n$, asymptotic theory is nonstandard. When the number $n$ of nodes goes to infinity, we establish the $ell_infty$-errors for the maximum likelihood estimators (MLEs), $hat{theta}$ and $hat{{gamma}}$, which are $O_p( (log n/n)^{1/2} )$ for $hat{theta}$ and $O_p( log n/n)$ for $hat{{gamma}}$, up to an additional factor. We also obtain the asymptotic normality of the MLE. Numerical studies and a data analysis demonstrate our theoretical findings. ) for b{theta} and Op(log n/n) for b{gamma}, up to an additional factor. We also obtain the asymptotic normality of the MLE. Numerical studies and a data analysis demonstrate our theoretical findings.
【5】 Reproducibility of COVID-19 pre-prints 标题:冠状病毒预印片的重复性
作者:Annie Collins,Rohan Alexander 机构:†University of Toronto, ‡University of Toronto 备注:14 pages, 6 tables, 4 figures 链接:https://arxiv.org/abs/2107.10724 摘要:为了检验COVID-19研究的可重复性,我们创建了一个发布到arXiv,bioRxiv,medRxiv,我们从这些预印本中提取文本,并对其进行分析,以寻找表明预印本数据和代码可用性的关键字标记。对于我们样本中的预打印,我们无法找到75%的arXiv、67%的bioRxiv、79%的medRxiv和85%的SocArXiv上的开放数据或开放代码的标记。我们得出的结论是,让作者将他们的预印本的开放程度分类为预印本提交过程的一部分可能是有价值的,更广泛地说,有必要更好地将开放科学训练纳入广泛的领域。 摘要:To examine the reproducibility of COVID-19 research, we create a dataset of pre-prints posted to arXiv, bioRxiv, medRxiv, and SocArXiv between 28 January 2020 and 30 June 2021 that are related to COVID-19. We extract the text from these pre-prints and parse them looking for keyword markers signalling the availability of the data and code underpinning the pre-print. For the pre-prints that are in our sample, we are unable to find markers of either open data or open code for 75 per cent of those on arXiv, 67 per cent of those on bioRxiv, 79 per cent of those on medRxiv, and 85 per cent of those on SocArXiv. We conclude that there may be value in having authors categorize the degree of openness of their pre-print as part of the pre-print submissions process, and more broadly, there is a need to better integrate open science training into a wide range of fields.
【6】 Fed-ensemble: Improving Generalization through Model Ensembling in Federated Learning 标题:FED集成:通过联邦学习中的模型集成改进泛化
作者:Naichen Shi,Fan Lai,Raed Al Kontar,Mosharaf Chowdhury 链接:https://arxiv.org/abs/2107.10663 摘要:在本文中,我们提出Fed集成:一种将模型集成引入联邦学习(FL)的简单方法。Fed-ensemble不是通过聚集局部模型来更新单个全局模型,而是通过随机排列来更新一组K个模型,然后通过模型平均来获得预测。Fed系综可以在已建立的FL方法中容易地使用,并且不施加计算开销,因为它只需要在每个通信回合中将K个模型中的一个发送到客户端。理论上,我们证明了在神经切线核机制下,所有K模型对新数据的预测属于同一预测后验分布。这一结果又揭示了模型平均的推广优势。我们还说明了fed系综有一个优雅的贝叶斯解释。实验结果表明,与几种FL算法相比,该模型在广泛的数据集上具有更高的性能,并且在FL应用中经常遇到的异构环境下具有更高的性能。 摘要:In this paper we propose Fed-ensemble: a simple approach that bringsmodel ensembling to federated learning (FL). Instead of aggregating localmodels to update a single global model, Fed-ensemble uses random permutations to update a group of K models and then obtains predictions through model averaging. Fed-ensemble can be readily utilized within established FL methods and does not impose a computational overhead as it only requires one of the K models to be sent to a client in each communication round. Theoretically, we show that predictions on newdata from all K models belong to the same predictive posterior distribution under a neural tangent kernel regime. This result in turn sheds light onthe generalization advantages of model averaging. We also illustrate thatFed-ensemble has an elegant Bayesian interpretation. Empirical results show that our model has superior performance over several FL algorithms,on a wide range of data sets, and excels in heterogeneous settings often encountered in FL applications.
【7】 Kernel-Matrix Determinant Estimates from stopped Cholesky Decomposition 标题:基于停止Cholesky分解的核矩阵行列式估计
作者:Simon Bartels,Wouter Boomsma,Jes Frellsen,Damien Garreau 机构:University of Copenhagen, Universitetsparken , København, Denmark, Technical University of Denmark, Richard Petersens Plads, Kgs. Lyngby, Denmark, Universit´e Cˆote d’Azur, Inria, CNRS, LJAD, Parc Valrose, Nice Cedex , France, Editor: 链接:https://arxiv.org/abs/2107.10587 摘要:涉及高斯过程或行列式点过程的算法通常需要计算核矩阵的行列式。后者通常是通过Cholesky分解计算出来的,Cholesky分解是一种矩阵大小的三次复杂度算法。我们证明,在温和的假设下,仅从一个子矩阵估计行列式是可能的,并且相对误差有概率保证。我们提出了对Cholesky分解的一种扩充,它在处理整个矩阵之前在一定条件下停止。实验表明,在不提前停车的情况下,这可以节省大量时间,同时开销小于5%$。更一般地,我们提出了一种概率停止策略,用于近似已知长度的和,其中加法器是按顺序显示的。我们不假设加数之间的独立性,只假设它们从下面有界并且条件期望值减小。 摘要:Algorithms involving Gaussian processes or determinantal point processes typically require computing the determinant of a kernel matrix. Frequently, the latter is computed from the Cholesky decomposition, an algorithm of cubic complexity in the size of the matrix. We show that, under mild assumptions, it is possible to estimate the determinant from only a sub-matrix, with probabilistic guarantee on the relative error. We present an augmentation of the Cholesky decomposition that stops under certain conditions before processing the whole matrix. Experiments demonstrate that this can save a considerable amount of time while having an overhead of less than $5%$ when not stopping early. More generally, we present a probabilistic stopping strategy for the approximation of a sum of known length where addends are revealed sequentially. We do not assume independence between addends, only that they are bounded from below and decrease in conditional expectation.
【8】 One-parameter generalised Fisher information 标题:单参数广义Fisher信息
作者:Worachet Bukaew,Sikarin Yoo-Kong 机构:†The Institute for Fundamental Study (IF), Naresuan University, Phitsanulok, Thailand,., ∗Center of Excellence in Theoretical and Computational Science (TaCs-CoE), King Mongkut’s University of Thonburi (KMUTT) 备注:10 pages, 2 tables 链接:https://arxiv.org/abs/2107.10578 摘要:我们引入了广义Fisher信息或Fisher信息的单参数扩展类。这种新形式的Fisher信息来源于标准Fisher信息与变分原理之间的有趣联系以及Lagrangian的非唯一性。此外,可以将这一单参数Fisher信息作为生成函数来获得所谓的Fisher信息层次结构。导出了广义Cramer-Rao不等式。有趣的是,除了标准的Fisher信息外,整个Fisher信息层次结构都不遵循加法规则。这可能意味着Tsallis熵和单参数Fisher信息之间存在间接联系。此外,还从两参数Kullback-Leibler散度得到了整个Fisher信息层次。 摘要:We introduce the generalised Fisher information or the one-parameter extended class of the Fisher information. This new form of the Fisher information is obtained from the intriguing connection between the standard Fisher information and the variational principle together with the non-uniqueness property of the Lagrangian. Furthermore, one could treat this one-parameter Fisher information as a generating function for obtaining what is called Fisher information hierarchy. The generalised Cramer-Rao inequality is also derived. The interesting point is about the fact that the whole Fisher information hierarchy, except for the standard Fisher information, does not follow the additive rule. This could suggest that there is an indirect connection between the Tsallis entropy and the one-parameter Fisher information. Furthermore, the whole Fisher information hierarchy is also obtained from the two-parameter Kullback-Leibler divergence.
【9】 Graphical Influence Diagnostics for Changepoint Models 标题:变点模型的图形化影响诊断
作者:Ines Wilms,Rebecca Killick,David S. Matteson 机构:Department of Quantitative Economics, Maastricht University, Department of Mathematics and Statistics, Lancaster University and, Department of Statistics and Data Science, Cornell University 链接:https://arxiv.org/abs/2107.10572 摘要:Changepoint模型在许多学科中具有广泛的吸引力,可以对有序数据的异构性进行建模。然而,缺乏图形化的影响诊断来描述单一观测值对变化点模型的影响。我们通过开发一个框架来研究变化点分段中的不稳定性,并评估单个观测值对变化点分析的各种输出的影响,从而解决这一差距。我们构建图形诊断图,让从业者评估是否发生不稳定;发生的方式和地点;以及检测触发不稳定性的有影响力的个体观察。我们分析了测井数据,以说明如何在实践中使用这些影响诊断图来揭示数据的特征,否则这些特征可能仍然是隐藏的。 摘要:Changepoint models enjoy a wide appeal in a variety of disciplines to model the heterogeneity of ordered data. Graphical influence diagnostics to characterize the influence of single observations on changepoint models are, however, lacking. We address this gap by developing a framework for investigating instabilities in changepoint segmentations and assessing the influence of single observations on various outputs of a changepoint analysis. We construct graphical diagnostic plots that allow practitioners to assess whether instabilities occur; how and where they occur; and to detect influential individual observations triggering instability. We analyze well-log data to illustrate how such influence diagnostic plots can be used in practice to reveal features of the data that may otherwise remain hidden.
【10】 An overcome of far-distance limitation on tunnel CCTV-based accident detection in AI deep-learning frameworks 标题:AI深度学习框架下隧道闭路电视事故检测远距离限制的克服
作者:Kyu-Beom Lee,Hyu-Soung Shin 机构:) Smart City & Construction Engineering, UST, Gyeonggi Province , Republic of, ) Department of Future & Smart Construction Research, KICT, Gyeonggi Province, Republic of Korea 备注:6 pages, 3 figures, to be presented in "2021 INTERNATIONAL CONFERENCE ON TUNNELS AND UNDERGROUND SPACES" conference 链接:https://arxiv.org/abs/2107.10567 摘要:隧道CCTV安装在低高度、长距离区间。然而,由于安装高度的限制,在距离上会产生严重的透视效应,在现有的基于隧道闭路电视的事故检测系统(Pflugfelder 2005)中,几乎不可能检测到距离闭路电视较远的车辆。为了克服这一限制,通过基于逆透视变换的目标检测算法,通过重新设置感兴趣区域(ROI)来检测车辆目标。它能探测到远离闭路电视的车辆。为了验证这一过程,本文同时在CCTV原始图像和扭曲图像的基础上创建了由图像和边界框组成的数据集,并比较了用这两个数据集训练的深度学习目标检测模型的性能。因此,与训练原始图像的模型相比,训练扭曲图像的模型能够在远离闭路电视的位置更准确地检测车辆物体。 摘要:Tunnel CCTVs are installed to low height and long-distance interval. However, because of the limitation of installation height, severe perspective effect in distance occurs, and it is almost impossible to detect vehicles in far distance from the CCTV in the existing tunnel CCTV-based accident detection system (Pflugfelder 2005). To overcome the limitation, a vehicle object is detected through an object detection algorithm based on an inverse perspective transform by re-setting the region of interest (ROI). It can detect vehicles that are far away from the CCTV. To verify this process, this paper creates each dataset consisting of images and bounding boxes based on the original and warped images of the CCTV at the same time, and then compares performance of the deep learning object detection models trained with the two datasets. As a result, the model that trained the warped image was able to detect vehicle objects more accurately at the position far from the CCTV compared to the model that trained the original image.
【11】 Robust low-rank covariance matrix estimation with a general pattern of missing values 标题:一般缺失值模式下的鲁棒低秩协方差矩阵估计
作者:Alexandre Hippert-Ferrer,Mohammed Nabil El Korso,Arnaud Breloy,Guillaume Ginolhac 机构: Paris Saclay University, Paris-Nanterre University, Savoie Mont Blanc University 备注:32 pages, 10 figures, submitted to Elsevier Signal Processing 链接:https://arxiv.org/abs/2107.10505 摘要:本文研究了数据不完全时的稳健协方差矩阵估计问题。经典的统计估计方法通常是建立在高斯假设的基础上,而现有的稳健估计方法是建立在非结构化信号模型的基础上。在异质性导致重尾分布的真实数据集中,前者可能不准确,而后者则不能从通常的低秩信号结构中获益。利用期望最大化算法中的观测数据似然函数,在鲁棒(复合高斯)低秩模型上设计了一种协方差矩阵估计方法。它还设计用于处理缺少值的一般模式。该方法首先在模拟数据集上进行了验证。然后,在多光谱和高光谱时间序列这两个缺失值的真实数据集上评估了它在分类和聚类应用中的兴趣。 摘要:This paper tackles the problem of robust covariance matrix estimation when the data is incomplete. Classical statistical estimation methodologies are usually built upon the Gaussian assumption, whereas existing robust estimation ones assume unstructured signal models. The former can be inaccurate in real-world data sets in which heterogeneity causes heavy-tail distributions, while the latter does not profit from the usual low-rank structure of the signal. Taking advantage of both worlds, a covariance matrix estimation procedure is designed on a robust (compound Gaussian) low-rank model by leveraging the observed-data likelihood function within an expectation-maximization algorithm. It is also designed to handle general pattern of missing values. The proposed procedure is first validated on simulated data sets. Then, its interest for classification and clustering applications is assessed on two real data sets with missing values, which include multispectral and hyperspectral time series.
【12】 Scale-mixture Birnbaum-Saunders quantile regression models applied to personal accident insurance data 标题:比例混合Birnbaum-Saunders分位数回归模型在人身意外保险数据中的应用
作者:Alan Dasilva,Helton Saulo 机构:Institute of Mathematics and Statistics, University of Sao Paulo, Sao Paulo, Brazil, Department of Statistics, University of Brasilia, Brasilia, Brazil 链接:https://arxiv.org/abs/2107.10365 摘要:人身意外保险数据的建模一直是保险文献中一个极为相关的话题。一般情况下,数据往往表现出正的不对称性和重尾,在建模策略中采用了非分位数Birnbaum-Saunders回归模型。在这项工作中,我们提出了一个新的分位数回归模型的规模混合比恩鲍姆桑德斯分布,这是重新参数化插入分位数参数。通过EM算法得到模型参数的极大似然估计。使用texttt{R}软件进行了两项蒙特卡罗模拟研究。第一项研究旨在分析最大似然估计、信息准则AIC、AICc、BIC、HIC、均方误差根、随机分位数和广义Cox-Snell残差的性能。在第二个模拟研究中,评估了Wald的大小和幂、似然比、得分和梯度检验。这两个模拟研究是考虑到不同的兴趣分位数和样本大小。最后,以一个实际的保险数据集为例说明了该方法的有效性。 摘要:The modeling of personal accident insurance data has been a topic of extreme relevance in the insurance literature. In general, the data often exhibit positive asymmetry and heavy tails and non-quantile Birnbaum-Saunders regression models have been used in the modeling strategy. In this work, we propose a new quantile regression model based on the scale-mixture Birnbaum-Saunders distribution, which is reparametrized by inserting a quantile parameter. The maximum likelihood estimates of the model parameters are obtained via the EM algorithm. Two Monte Carlo simulation studies were performed using the texttt{R} software. The first study aims to analyze the performance of the maximum likelihood estimates, the information criteria AIC, AICc, BIC, HIC, the root of the mean square error, and the randomized quantile and generalized Cox-Snell residuals. In the second simulation study, the size and power of the the Wald, likelihood ratio, score and gradient tests are evaluated. The two simulation studies were conducted considering different quantiles of interest and sample sizes. Finally, a real insurance data set is analyzed to illustrate the proposed approach.
【13】 Robust Nonparametric Regression with Deep Neural Networks 标题:基于深度神经网络的稳健非参数回归
作者:Guohao Shen,Yuling Jiao,Yuanyuan Lin,Jian Huang 机构:§ 1Department of Statistics, The Chinese University of Hong Kong, hk 2School of Mathematics and Statistics, Wuhan University, cn 3Department of Statistics, hk 4Department of Statistics and Actuarial Science, University of Iowa 备注:Guohao Shen and Yuling Jiao contributed equally to this work. Corresponding authors: Yuanyuan Lin (Email: ylin@sta.cuhk.edu.hk) and Jian Huang (Email: jian-huang@uiowa.edu). arXiv admin note: substantial text overlap with arXiv:2104.06708 链接:https://arxiv.org/abs/2107.10343 摘要:本文研究了重尾误差分布回归模型的深神经网络鲁棒非参数估计的性质。在适当的光滑条件和误差项的温和条件下,建立了一类具有ReLU激活的深度神经网络鲁棒非参数回归估计的非渐近误差界。特别地,我们只假设误差分布有一个p大于1的有限p阶矩。我们还证明了当预测值的分布在一个近似的低维集合上时,深度稳健回归估计能够避开维数灾难。我们的误差界的一个重要特征是,对于网络宽度和网络大小(参数个数)不超过预测因子维数d的平方阶的ReLU神经网络,我们的超额风险界与d呈次线性关系。我们的假设放宽了严格的流形支撑假设,这在实践中可能是限制性的和不现实的。我们还放宽了对数据分布、目标回归函数和最近文献中所要求的神经网络的几个关键假设。我们的模拟研究表明,当误差具有重尾分布时,鲁棒方法可以显著优于最小二乘方法,并且说明在深度非参数回归中损失函数的选择是重要的。 摘要:In this paper, we study the properties of robust nonparametric estimation using deep neural networks for regression models with heavy tailed error distributions. We establish the non-asymptotic error bounds for a class of robust nonparametric regression estimators using deep neural networks with ReLU activation under suitable smoothness conditions on the regression function and mild conditions on the error term. In particular, we only assume that the error distribution has a finite p-th moment with p greater than one. We also show that the deep robust regression estimators are able to circumvent the curse of dimensionality when the distribution of the predictor is supported on an approximate lower-dimensional set. An important feature of our error bound is that, for ReLU neural networks with network width and network size (number of parameters) no more than the order of the square of the dimensionality d of the predictor, our excess risk bounds depend sub-linearly on d. Our assumption relaxes the exact manifold support assumption, which could be restrictive and unrealistic in practice. We also relax several crucial assumptions on the data distribution, the target regression function and the neural networks required in the recent literature. Our simulation studies demonstrate that the robust methods can significantly outperform the least squares method when the errors have heavy-tailed distributions and illustrate that the choice of loss function is important in the context of deep nonparametric regression.
【14】 Recovering lost and absent information in temporal networks 标题:时态网络中丢失和缺失信息的恢复
作者:James P. Bagrow,Sune Lehmann 机构:Mathematics & Statistics, University of Vermont, Burlington, VT, United States, Vermont Complex Systems Center, University of Vermont, Burlington, VT, United States, DTU Compute, Technical University of Denmark, Kgs Lyngby, Denmark 备注:19 pages, 5 figures, 1 table, plus supporting information 链接:https://arxiv.org/abs/2107.10835 摘要:时间网络中的全部活动都被捕获在其边缘活动数据中——时间序列编码网络中每个边缘的联结强度或开关动态。然而,在许多实际应用中,边缘层数据是不可用的,网络分析必须依赖于节点活动数据,而节点活动数据聚集了边缘活动数据,因此信息量较小。这就提出了一个问题:是否可以使用静态网络从节点活动中恢复更丰富的边缘活动?在这里,我们证明了恢复是可能的,考虑到丢失了多少信息,恢复的准确度往往令人惊讶,并且恢复的数据对于后续的网络分析任务是有用的。当网络密度增加时,无论是在拓扑上还是在动态上,恢复都比较困难,但是利用动态和拓扑稀疏性可以有效地解决恢复问题。我们从理论和经验两方面对恢复问题的困难性进行了形式化的刻画,证明了恢复误差有界的条件,并证明了即使不满足这些条件,仍然可以得到高质量的解。有效的恢复既有希望也有风险,因为它可以对复杂系统进行更深入的科学研究,但在社会系统的背景下,当社会信息可以跨多个数据源聚合时,也会引发隐私问题。 摘要:The full range of activity in a temporal network is captured in its edge activity data -- time series encoding the tie strengths or on-off dynamics of each edge in the network. However, in many practical applications, edge-level data are unavailable, and the network analyses must rely instead on node activity data which aggregates the edge-activity data and thus is less informative. This raises the question: Is it possible to use the static network to recover the richer edge activities from the node activities? Here we show that recovery is possible, often with a surprising degree of accuracy given how much information is lost, and that the recovered data are useful for subsequent network analysis tasks. Recovery is more difficult when network density increases, either topologically or dynamically, but exploiting dynamical and topological sparsity enables effective solutions to the recovery problem. We formally characterize the difficulty of the recovery problem both theoretically and empirically, proving the conditions under which recovery errors can be bounded and showing that, even when these conditions are not met, good quality solutions can still be derived. Effective recovery carries both promise and peril, as it enables deeper scientific study of complex systems but in the context of social systems also raises privacy concerns when social information can be aggregated across multiple data sources.
【15】 Neural Variational Gradient Descent 标题:神经变分梯度下降
作者:Lauro Langosco di Langosco,Vincent Fortuin,Heiko Strathmann 链接:https://arxiv.org/abs/2107.10731 摘要:基于粒子的近似贝叶斯推理方法,如Stein变分梯度下降(SVGD)结合了采样方法的灵活性和收敛性保证以及变分推理的计算优势。在实践中,SVGD依赖于选择适当的核函数,这会影响其对目标分布建模的能力—这是一个仅使用启发式解决方案的具有挑战性的问题。我们提出了一种神经变分梯度下降(NVGD)算法,该算法通过一个深度神经网络对Stein差异的见证函数进行参数化,该神经网络的参数在推理过程中被并行学习,从而减少了进行核选择的必要性。我们对流行的综合推理问题、现实世界中的贝叶斯线性回归和贝叶斯神经网络推理进行了实证评估。 摘要:Particle-based approximate Bayesian inference approaches such as Stein Variational Gradient Descent (SVGD) combine the flexibility and convergence guarantees of sampling methods with the computational benefits of variational inference. In practice, SVGD relies on the choice of an appropriate kernel function, which impacts its ability to model the target distribution -- a challenging problem with only heuristic solutions. We propose Neural Variational Gradient Descent (NVGD), which is based on parameterizing the witness function of the Stein discrepancy by a deep neural network whose parameters are learned in parallel to the inference, mitigating the necessity to make any kernel choices whatsoever. We empirically evaluate our method on popular synthetic inference problems, real-world Bayesian linear regression, and Bayesian neural network inference.
【16】 Typing assumptions improve identification in causal discovery 标题:键入假设改进了因果发现中的识别
作者:Philippe Brouillard,Perouz Taslakian,Alexandre Lacoste,Sebastien Lachapelle,Alexandre Drouin 备注:Accepted for presentation as a contributed talk at the Workshop on the Neglected Assumptions in Causal Inference (NACI) at the 38th International Conference on Machine Learning, 2021 链接:https://arxiv.org/abs/2107.10703 摘要:从观测数据中发现因果关系是一项具有挑战性的任务,不可能总是找到精确的解决方案。在数据生成过程的假设下,因果图通常可以被识别为等价类。提出新的现实假设来限定这些等价类是一个活跃的研究领域。在这项工作中,我们提出了一套新的假设,限制可能的因果关系的基础上的性质的变量。因此,我们引入类型化有向无环图,其中变量类型用于确定因果关系的有效性。我们证明,无论是理论上还是实证上,所提出的假设可以导致在因果图的识别显着收益。 摘要:Causal discovery from observational data is a challenging task to which an exact solution cannot always be identified. Under assumptions about the data-generative process, the causal graph can often be identified up to an equivalence class. Proposing new realistic assumptions to circumscribe such equivalence classes is an active field of research. In this work, we propose a new set of assumptions that constrain possible causal relationships based on the nature of the variables. We thus introduce typed directed acyclic graphs, in which variable types are used to determine the validity of causal relationships. We demonstrate, both theoretically and empirically, that the proposed assumptions can result in significant gains in the identification of the causal graph.
【17】 Differentially Private Algorithms for 2020 Census Detailed DHC Race & Ethnicity
作者:Sam Haney,William Sexton,Ashwin Machanavajjhala,Michael Hay,Gerome Miklau 机构: Michael Hay and Gerome Miklau are faculty members at Duke University, Colgate University andUniversity of Massachussets 备注:Presented at Theory and Practice of Differential Privacy Workshop (TPDP) 2021 链接:https://arxiv.org/abs/2107.10659 摘要:本文描述了美国人口普查局正在考虑发布的一个建议的差异私有(DP)算法,作为2020年人口普查的一部分,该算法将发布详细的人口和住房特征(DHC)种族和民族列表。这些表格包含了美国整个人口的人口统计和住房特征,以及不同地理层次的详细种族和部落。我们描述了两种不同的私有算法策略,一种是基于从满足“纯”DP的双边几何分布中添加噪声,另一种是基于从离散高斯分布中添加噪声,该分布满足已被充分研究的不同隐私性,称为零集中差分隐私(zCDP)。我们分析估计隐私损失参数所确保的两个算法的可比水平的错误引入的统计。 摘要:This article describes a proposed differentially private (DP) algorithms that the US Census Bureau is considering to release the Detailed Demographic and Housing Characteristics (DHC) Race & Ethnicity tabulations as part of the 2020 Census. The tabulations contain statistics (counts) of demographic and housing characteristics of the entire population of the US crossed with detailed races and tribes at varying levels of geography. We describe two differentially private algorithmic strategies, one based on adding noise drawn from a two-sided Geometric distribution that satisfies "pure"-DP, and another based on adding noise from a Discrete Gaussian distribution that satisfied a well studied variant of differential privacy, called Zero Concentrated Differential Privacy (zCDP). We analytically estimate the privacy loss parameters ensured by the two algorithms for comparable levels of error introduced in the statistics.
【18】 Fast Low-Rank Tensor Decomposition by Ridge Leverage Score Sampling 标题:基于岭杠杆得分抽样的快速低秩张量分解
作者:Matthew Fahrbach,Mehrdad Ghadiri,Thomas Fu 机构:Google Research, Georgia Tech 备注:29 pages, 1 figure 链接:https://arxiv.org/abs/2107.10654 摘要:低秩张量分解是低秩矩阵逼近的推广,是在高维数据中发现低维结构的有力技术。在本文中,我们研究了Tucker分解,并使用随机数值线性代数中的工具ridge-leverage分数来加速交替最小二乘(ALS)算法中的核心张量更新步骤。更新核心张量是ALS的一个严重瓶颈,是一个高度结构化的岭回归问题,其中设计矩阵是因子矩阵的Kronecker积。我们展示了如何使用近似岭杠杆分数为任何岭回归问题构造一个草图实例,使得草图问题的解向量是原始实例的$(1 varepsilon)$-近似。此外,我们证明了经典杠杆分数足以作为近似值,从而允许我们利用Kronecker结构并及时更新主要依赖于秩和草图参数的核心张量(即输入张量的次线性)。当从设计矩阵中删除行时(例如,如果张量有丢失的条目),我们也给出了岭杠杆分数的上界,并且我们证明了我们的近似岭回归算法对于合成和真实数据上的大、低秩Tucker分解的有效性。 摘要:Low-rank tensor decomposition generalizes low-rank matrix approximation and is a powerful technique for discovering low-dimensional structure in high-dimensional data. In this paper, we study Tucker decompositions and use tools from randomized numerical linear algebra called ridge leverage scores to accelerate the core tensor update step in the widely-used alternating least squares (ALS) algorithm. Updating the core tensor, a severe bottleneck in ALS, is a highly-structured ridge regression problem where the design matrix is a Kronecker product of the factor matrices. We show how to use approximate ridge leverage scores to construct a sketched instance for any ridge regression problem such that the solution vector for the sketched problem is a $(1 varepsilon)$-approximation to the original instance. Moreover, we show that classical leverage scores suffice as an approximation, which then allows us to exploit the Kronecker structure and update the core tensor in time that depends predominantly on the rank and the sketching parameters (i.e., sublinear in the size of the input tensor). We also give upper bounds for ridge leverage scores as rows are removed from the design matrix (e.g., if the tensor has missing entries), and we demonstrate the effectiveness of our approximate ridge regressioni algorithm for large, low-rank Tucker decompositions on both synthetic and real-world data.
【19】 Análisis de Canasta de mercado en supermercados mediante mapas auto-organizados 标题:Canasta de Mercado an anlisis de Canasta de Mercado en supermercados mediante Mapas auto-Organados
作者:Joaquín Cordero,Alfredo Bolt,Mauricio Valle 备注:18 pages, in Spanish, 7 Figures, 5 tables, Research 链接:https://arxiv.org/abs/2107.10647 摘要:导言:智利首都西区一家重要的连锁超市,需要获取关键信息来做出决策,这些信息可以在数据库中获得,但由于信息的复杂性和数量变得难以可视化,因此需要进行处理,。方法:应用Kohonen的SOM方法,用人工神经网络建立了一种算法。要实现这一目标,必须遵循某些关键步骤来开发它,例如数据挖掘,它将负责过滤,然后仅使用相关数据进行市场篮子分析。过滤信息后,必须准备好数据。在数据准备完成后,我们准备了Python编程环境,使之适应样本数据,然后在测试结果出来之后,对SOM进行参数设置训练。结果:SOM的结果通过将大多数购买的产品定位在拓扑上紧密的位置来获得它们之间的关系,形成促销、包装和捆绑,供零售经理考虑,因为这些关系是通过SOM训练与客户的真实交易获得的。结论:在此基础上,对提供本研究所用数据的连锁超市提出了关于购物篮的建议 摘要:Introduction: An important chain of supermarkets in the western zone of the capital of Chile, needs to obtain key information to make decisions, this information is available in the databases but needs to be processed due to the complexity and quantity of information which becomes difficult to visualiz,. Method: For this purpose, an algorithm was developed using artificial neural networks applying Kohonen's SOM method. To carry it out, certain key procedures must be followed to develop it, such as data mining that will be responsible for filtering and then use only the relevant data for market basket analysis. After filtering the information, the data must be prepared. After data preparation, we prepared the Python programming environment to adapt it to the sample data, then proceed to train the SOM with its parameters set after test results. Result: the result of the SOM obtains the relationship between the products that were most purchased by positioning them topologically close, to form promotions, packs and bundles for the retail manager to take into consideration, because these relationships were obtained as a result of the SOM training with the real transactions of the clients. Conclusion: Based on this, recommendations on frequent shopping baskets have been made to the supermarket chain that provided the data used in the research
【20】 Bandit Quickest Changepoint Detection 标题:强盗最快变点检测
作者:Aditya Gopalan,Venkatesh Saligrama,Braghadeesh Lakshminarayanan 机构:Dept. of Electrical Communication Engineering, Indian Institute of Science, Bengaluru , India, Dept. of Electrical and Computer Engineering, Boston University, Boston, MA , USA 备注:26 pages including appendices 链接:https://arxiv.org/abs/2107.10492 摘要:在许多工业和安全应用中,检测时间行为模式的突然变化是很有意义的。突变通常是局部的,主要通过一个对齐的传感动作(例如,一个视野狭窄的摄像机)可以观察到。由于资源限制,对所有传感器进行连续监测是不切实际的。我们提出了bandit最快变化点检测框架作为一种平衡感知成本和检测延迟的方法。在这个框架中,依次选择感测动作(或传感器),并且只观察与所选动作相对应的测量。我们推导了一般一类有限参数化概率分布的检测时延的信息论下界。然后,我们提出了一个计算效率高的在线感知方案,它无缝地平衡了探索不同感知选项和利用查询信息行为的需要。我们推导了该方案的期望延迟界,并证明了在低虚警率下这些界与我们的信息论下界相匹配,从而证明了该方法的最优性。然后,我们在合成数据集和真实数据集上进行了大量实验,证明了我们提出的方法的有效性。 摘要:Detecting abrupt changes in temporal behavior patterns is of interest in many industrial and security applications. Abrupt changes are often local and observable primarily through a well-aligned sensing action (e.g., a camera with a narrow field-of-view). Due to resource constraints, continuous monitoring of all of the sensors is impractical. We propose the bandit quickest changepoint detection framework as a means of balancing sensing cost with detection delay. In this framework, sensing actions (or sensors) are sequentially chosen, and only measurements corresponding to chosen actions are observed. We derive an information-theoretic lower bound on the detection delay for a general class of finitely parameterized probability distributions. We then propose a computationally efficient online sensing scheme, which seamlessly balances the need for exploration of different sensing options with exploitation of querying informative actions. We derive expected delay bounds for the proposed scheme and show that these bounds match our information-theoretic lower bounds at low false alarm rates, establishing optimality of the proposed method. We then perform a number of experiments on synthetic and real datasets demonstrating the efficacy of our proposed method.
【21】 Efficient Neural Causal Discovery without Acyclicity Constraints 标题:无循环性约束的高效神经因果发现
作者:Phillip Lippe,Taco Cohen,Efstratios Gavves 机构:University of Amsterdam, QUVA lab, Qualcomm AI Research∗ 备注:8th Causal Inference Workshop at UAI 2021 (contributed talk). 34 pages, 12 figures 链接:https://arxiv.org/abs/2107.10483 摘要:利用观测数据和介入数据学习因果图模型的结构是许多科学领域的一个基本问题。一个很有前途的方向是基于分数的方法的连续优化,它以数据驱动的方式有效地学习因果图。然而,到目前为止,这些方法需要约束优化来实现非循环性或缺乏收敛性保证。在这篇文章中,我们提出了一种有效的结构学习方法,利用观察和干预数据的有向,非循环因果图。ENCO将图搜索描述为一个独立边可能性的优化,边方向被建模为一个单独的参数。因此,我们可以在温和的条件下提供ENCO的收敛性保证,而不必限制分数函数的非循环性。实验表明,在处理确定性变量和潜在混杂因素的同时,ENCO可以有效地恢复具有数百个节点的图,其数量级比以前可能的要大。 摘要:Learning the structure of a causal graphical model using both observational and interventional data is a fundamental problem in many scientific fields. A promising direction is continuous optimization for score-based methods, which efficiently learn the causal graph in a data-driven manner. However, to date, those methods require constrained optimization to enforce acyclicity or lack convergence guarantees. In this paper, we present ENCO, an efficient structure learning method for directed, acyclic causal graphs leveraging observational and interventional data. ENCO formulates the graph search as an optimization of independent edge likelihoods, with the edge orientation being modeled as a separate parameter. Consequently, we can provide convergence guarantees of ENCO under mild conditions without constraining the score function with respect to acyclicity. In experiments, we show that ENCO can efficiently recover graphs with hundreds of nodes, an order of magnitude larger than what was previously possible, while handling deterministic variables and latent confounders.
【22】 Learning Sparse Fixed-Structure Gaussian Bayesian Networks 标题:稀疏固定结构高斯贝叶斯网络的学习
作者:Arnab Bhattacharyya,Davin Choo,Rishikesh Gajjala,Sutanu Gayen,Yuhao Wang 机构:National University of Singapore, Indian Institute of Science, Bangalore 备注:30 pages, 11 figures 链接:https://arxiv.org/abs/2107.10450 摘要:高斯-贝叶斯网络(又称线性高斯结构方程模型)广泛应用于连续变量间因果关系的建模。在这项工作中,我们研究了学习一个固定结构的高斯-贝叶斯网络到总变差距离有界误差的问题。我们分析了常用的逐点最小二乘回归(LeastSquares),证明了它具有接近最优的样本复杂度。我们还研究了几个新的算法:BatchAvgLeastSquares取每个节点上多批最小二乘解的平均值,这样就可以在批大小和批数之间进行插值。我们证明了BatchAvgLeastSquares也具有接近最优的样本复杂度。-CauchyEst在每个节点上取多批线性系统解的中值。我们证明了专门用于多树的算法CauchyEstTree具有接近最优的样本复杂度。实验结果表明,对于无污染的、可实现的数据,LeastSquares算法的性能最好,但是在存在污染或DAG错误的情况下,CauchyEst/CauchyEstTree和BatchAvgLeastSquares算法的性能更好。 摘要:Gaussian Bayesian networks (a.k.a. linear Gaussian structural equation models) are widely used to model causal interactions among continuous variables. In this work, we study the problem of learning a fixed-structure Gaussian Bayesian network up to a bounded error in total variation distance. We analyze the commonly used node-wise least squares regression (LeastSquares) and prove that it has a near-optimal sample complexity. We also study a couple of new algorithms for the problem: - BatchAvgLeastSquares takes the average of several batches of least squares solutions at each node, so that one can interpolate between the batch size and the number of batches. We show that BatchAvgLeastSquares also has near-optimal sample complexity. - CauchyEst takes the median of solutions to several batches of linear systems at each node. We show that the algorithm specialized to polytrees, CauchyEstTree, has near-optimal sample complexity. Experimentally, we show that for uncontaminated, realizable data, the LeastSquares algorithm performs best, but in the presence of contamination or DAG misspecification, CauchyEst/CauchyEstTree and BatchAvgLeastSquares respectively perform better.
【23】 Species Distribution Modeling for Machine Learning Practitioners: A Review 标题:机器学习实践者的物种分布建模研究进展
作者:Sara Beery,Elijah Cole,Joseph Parker,Pietro Perona,Kevin Winner 机构: California Institute of TechnologyJOSEPH PARKER, California Institute of TechnologyPIETRO PERONA, California Institute of TechnologyKEVIN WINNER, Yale UniversityFig 备注:ACM COMPASS 2021 链接:https://arxiv.org/abs/2107.10400 摘要:保护科学依赖于对给定生态系统中发生的事情的准确理解。有多少物种生活在那里?人口构成如何?随着时间的推移,这种情况如何变化?物种分布模型(SDM)旨在预测物种发生的空间(有时是时间)模式,即可能发现物种的位置。在过去的几年里,人们对应用强大的机器学习工具来解决生态学中具有挑战性的问题的兴趣激增。尽管SDM相当重要,但它在计算机科学界的关注相对较少。我们在这项工作中的目标是为计算机科学家提供必要的背景来阅读SDM文献和开发基于ML的生态有用的SDM算法。特别是,我们将介绍关键的SDM概念和术语,回顾标准模型,讨论数据可用性,并强调技术挑战和缺陷。 摘要:Conservation science depends on an accurate understanding of what's happening in a given ecosystem. How many species live there? What is the makeup of the population? How is that changing over time? Species Distribution Modeling (SDM) seeks to predict the spatial (and sometimes temporal) patterns of species occurrence, i.e. where a species is likely to be found. The last few years have seen a surge of interest in applying powerful machine learning tools to challenging problems in ecology. Despite its considerable importance, SDM has received relatively little attention from the computer science community. Our goal in this work is to provide computer scientists with the necessary background to read the SDM literature and develop ecologically useful ML-based SDM algorithms. In particular, we introduce key SDM concepts and terminology, review standard models, discuss data availability, and highlight technical challenges and pitfalls.
【24】 On the Use of Time Series Kernel and Dimensionality Reduction to Identify the Acquisition of Antimicrobial Multidrug Resistance in the Intensive Care Unit 标题:应用时间序列核函数和降维方法识别重症监护病房抗菌药物多药耐药获得性
作者:Óscar Escudero-Arnanz,Joaquín Rodríguez-Álvarez,Karl Øyvind Mikalsen,Robert Jenssen,Cristina Soguero-Ruiz 机构:Rey Juan Carlos University, Fuenlabrada, Madrid, Spain, University Hospital of Fuenlabrada, University Hospital of North-Norway, UiT The Arctic University of Norway, Tromsø, Norway 链接:https://arxiv.org/abs/2107.10398 摘要:在重症监护病房(ICU)患者中获得抗菌药物多药耐药性(AMR)是全球关注的主要问题。本研究以马德里FueLabRADA大学医院ICU记录的3476例患者的多变量时间序列(MTS)数据为研究对象,从2004至2020. 18的患者在ICU逗留期间获得AMR。本文的目的是对AMR的发展进行早期预测。为此,我们利用时间序列聚类核(TCK)来学习MTS之间的相似性。为了评估TCK作为核的有效性,我们将几种降维技术应用于可视化和分类任务。实验结果表明,TCK可以识别一组在ICU住院48小时内获得AMR的患者,并具有良好的分类能力。 摘要:The acquisition of Antimicrobial Multidrug Resistance (AMR) in patients admitted to the Intensive Care Units (ICU) is a major global concern. This study analyses data in the form of multivariate time series (MTS) from 3476 patients recorded at the ICU of University Hospital of Fuenlabrada (Madrid) from 2004 to 2020. 18% of the patients acquired AMR during their stay in the ICU. The goal of this paper is an early prediction of the development of AMR. Towards that end, we leverage the time-series cluster kernel (TCK) to learn similarities between MTS. To evaluate the effectiveness of TCK as a kernel, we applied several dimensionality reduction techniques for visualization and classification tasks. The experimental results show that TCK allows identifying a group of patients that acquire the AMR during the first 48 hours of their ICU stay, and it also provides good classification capabilities.
【25】 Improving COVID-19 Forecasting using eXogenous Variables 标题:利用外生变量改进冠状病毒预测
作者:Mohammadhossein Toutiaee,Xiaochuan Li,Yogesh Chaudhari,Shophine Sivaraja,Aishwarya Venkataraj,Indrajeet Javeri,Yuan Ke,Ismailcem Arpinar,Nicole Lazar,John Miller 机构:Department of Computer Science, The University of Georgia, Athens, GA, Department of Statistics, The Pennsylvania State University, University Park, PA 链接:https://arxiv.org/abs/2107.10397 摘要:在这项工作中,我们通过考虑国家和州层面的数据来研究美国的大流行过程。我们提出并比较了包含辅助变量的多时间序列预测技术。一种是基于时空图神经网络的方法,它利用一种混合的深度学习结构和人类流动数据来预测流感大流行的过程。此图中的节点表示由于COVID-19导致的状态级死亡,边表示人的移动趋势,时间边对应于跨时间的节点属性。第二种方法是基于美国使用SARIMA模型和外生变量预测COVID-19死亡率的统计技术。我们在美国的州和国家层面的COVID-19数据上评估了这些技术,并声称SARIMA和MCP模型通过外生变量生成预测值可以丰富基础模型,分别捕获国家和州层面数据的复杂性。我们证明了COVID-19数据集的预测精度显著提高,在国家级数据中,预测精度比GCN-LSTM模型分别提高了64.58%和59.18%(平均),在州级数据中,预测精度比GCN-LSTM模型分别提高了58.79%和52.40%(平均)。此外,我们提出的模型比平行研究(AUG-NN)平均提高了27.35%的精确度。 摘要:In this work, we study the pandemic course in the United States by considering national and state levels data. We propose and compare multiple time-series prediction techniques which incorporate auxiliary variables. One type of approach is based on spatio-temporal graph neural networks which forecast the pandemic course by utilizing a hybrid deep learning architecture and human mobility data. Nodes in this graph represent the state-level deaths due to COVID-19, edges represent the human mobility trend and temporal edges correspond to node attributes across time. The second approach is based on a statistical technique for COVID-19 mortality prediction in the United States that uses the SARIMA model and eXogenous variables. We evaluate these techniques on both state and national levels COVID-19 data in the United States and claim that the SARIMA and MCP models generated forecast values by the eXogenous variables can enrich the underlying model to capture complexity in respectively national and state levels data. We demonstrate significant enhancement in the forecasting accuracy for a COVID-19 dataset, with a maximum improvement in forecasting accuracy by 64.58% and 59.18% (on average) over the GCN-LSTM model in the national level data, and 58.79% and 52.40% (on average) over the GCN-LSTM model in the state level data. Additionally, our proposed model outperforms a parallel study (AUG-NN) by 27.35% improvement of accuracy on average.
【26】 Identification of Tissue Optical Properties During Thermal Laser-Tissue Interactions: Approach and Preliminary Evaluation 标题:热激光-组织相互作用中组织光学性质的识别:方法和初步评价
作者:Andrea Arnold,Loris Fichera 机构: Department of Mathematical Sciences, Worcester Polytechnic Institute, Worcester, MA, USA, Department of Robotics Engineering, Worcester Polytechnic Institute, Worcester 备注:17 pages, 5 figures 链接:https://arxiv.org/abs/2107.10340 摘要:在这篇论文中,我们提出了一个计算框架来估计控制激光照射组织热响应的物理组织特性。我们特别关注两个量,吸收系数和散射系数,这两个量描述了光在组织中的光学吸收,其知识对于正确规划医疗激光治疗至关重要。为了进行估计,我们利用了一种用于数据同化的贝叶斯滤波算法集合卡尔曼滤波(EnKF)的实现。不同于以往的方法,在这项工作中,我们估计组织光学性质的基础上,观察表面热响应的激光照射。这种方法有可能在临床装置中直接实现,因为它只需要一个简单的热传感器,例如,一个微型红外相机。由于组织的光学特性在激光照射过程中会发生变化,因此我们采用了一种能够跟踪时变参数的EnKF变体。通过模拟的初步评估,我们证明了所提出的技术能够识别组织的光学特性并跟踪其在激光照射过程中的动态变化,同时跟踪表面下组织温度的变化。 摘要:In this paper, we propose a computational framework to estimate the physical tissue properties that govern the thermal response of laser-irradiated tissue. We focus in particular on two quantities, the absorption and scattering coefficients, which describe the optical absorption of light in the tissue and whose knowledge is vital to correctly plan medical laser treatments. To perform the estimation, we utilize an implementation of the Ensemble Kalman Filter (EnKF), a type of Bayesian filtering algorithm for data assimilation. Unlike prior approaches, in this work we estimate the tissue optical properties based on the observed surface thermal response to laser irradiation. This method has the potential for straightforward implementation in a clinical setup, as it would only require a simple thermal sensor, e.g., a miniaturized infrared camera. Because the optical properties of tissue can undergo shifts during laser exposure, we employ a variant of EnKF capable of tracking time-varying parameters. Through preliminary evaluation in simulation, we demonstrate the ability of the proposed technique to identify the tissue optical properties and track their dynamic changes during laser exposure, while simultaneously tracking changes in the tissue temperature at locations beneath the surface.
【27】 A Sparsity Algorithm with Applications to Corporate Credit Rating 标题:一种稀疏性算法及其在企业信用评级中的应用
作者:Dan Wang,Zhi Chen,Ionut Florescu 备注:16 pages, 11 tables, 3 figures 链接:https://arxiv.org/abs/2107.10306 摘要:在人工智能中,解释通常被称为黑盒的机器学习技术的结果是一项困难的任务。对特定“黑匣子”的反事实解释试图找出对输入值的最小变化,从而修改对特定输出的预测,而不是原始输出。在这项工作中,我们制定的问题,寻找一个反事实的解释作为一个优化问题。我们提出了一种新的“稀疏算法”来解决优化问题,同时也最大限度地提高了反事实解释的稀疏性。为了提高上市公司的信用评级,我们应用稀疏性算法为上市公司提供了一个简单的建议。我们用一个综合生成的数据集验证了稀疏性算法,并将其进一步应用于美国金融、医疗和it行业公司的季度财务报表。我们提供的证据表明,当评级提高时,反事实解释可以捕捉到当前季度和下一季度之间发生变化的真实报表特征的性质。实证结果表明,一家公司的评级越高,进一步提高信用评级所需的“努力”就越大。 摘要:In Artificial Intelligence, interpreting the results of a Machine Learning technique often termed as a black box is a difficult task. A counterfactual explanation of a particular "black box" attempts to find the smallest change to the input values that modifies the prediction to a particular output, other than the original one. In this work we formulate the problem of finding a counterfactual explanation as an optimization problem. We propose a new "sparsity algorithm" which solves the optimization problem, while also maximizing the sparsity of the counterfactual explanation. We apply the sparsity algorithm to provide a simple suggestion to publicly traded companies in order to improve their credit ratings. We validate the sparsity algorithm with a synthetically generated dataset and we further apply it to quarterly financial statements from companies in financial, healthcare and IT sectors of the US market. We provide evidence that the counterfactual explanation can capture the nature of the real statement features that changed between the current quarter and the following quarter when ratings improved. The empirical results show that the higher the rating of a company the greater the "effort" required to further improve credit rating.
【28】 LeanML: A Design Pattern To Slash Avoidable Wastes in Machine Learning Projects 标题:LeanML:机器学习项目中劈开可避免浪费的设计模式
作者:Yves-Laurent Kom Samo 机构:KXY Technologies, Inc., Technology Dr., San Jose, CA , USA 链接:https://arxiv.org/abs/2107.08066 摘要:我们介绍了精益方法在机器学习项目中的首次应用。与精益初创企业和精益制造类似,我们认为精益机器学习(LeanML)可以大幅削减商业机器学习项目中可避免的浪费,降低机器学习能力投资的商业风险,从而进一步实现机器学习的民主化。本文提出的精益设计模式基于两种实现方式。首先,当使用给定的一组解释变量$xinmathcal{x}$来预测结果$yinmathcal{y}$时,对于广泛的性能度量,并且不训练任何预测模型,可以估计可能达到的最佳性能。其次,这样做比学习最好的预测模型更容易、更快、更便宜。当使用$x$预测$y$作为互信息$Ileft(y;x右)$,并且可能是$y$可变性的度量(例如,在分类精度的情况下,它的香农熵,在回归MSE的情况下,它的方差)。我们说明了LeanML设计模式在一系列回归和分类问题上的有效性,包括合成问题和现实问题。 摘要:We introduce the first application of the lean methodology to machine learning projects. Similar to lean startups and lean manufacturing, we argue that lean machine learning (LeanML) can drastically slash avoidable wastes in commercial machine learning projects, reduce the business risk in investing in machine learning capabilities and, in so doing, further democratize access to machine learning. The lean design pattern we propose in this paper is based on two realizations. First, it is possible to estimate the best performance one may achieve when predicting an outcome $y in mathcal{Y}$ using a given set of explanatory variables $x in mathcal{X}$, for a wide range of performance metrics, and without training any predictive model. Second, doing so is considerably easier, faster, and cheaper than learning the best predictive model. We derive formulae expressing the best $R^2$, MSE, classification accuracy, and log-likelihood per observation achievable when using $x$ to predict $y$ as a function of the mutual information $Ileft(y; xright)$, and possibly a measure of the variability of $y$ (e.g. its Shannon entropy in the case of classification accuracy, and its variance in the case regression MSE). We illustrate the efficacy of the LeanML design pattern on a wide range of regression and classification problems, synthetic and real-life.