访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问
stat统计学,共计44篇
【1】 MIxBN: library for learning Bayesian networks from mixed data 标题:MIxBN:从混合数据中学习贝叶斯网络的库
作者:Anna V. Bubnova,Irina Deeva,Anna V. Kalyuzhnaya 机构:ITMO University, Saint-Petersburg, Russia 链接:https://arxiv.org/abs/2106.13194 摘要:本文描述了一个新的学习贝叶斯网络的图书馆从数据中包含离散和连续变量(混合数据)。除了经典的离散化数据学习方法外,该库还提出了一种算法,由于数据离散化会导致信息丢失,因此该算法允许对混合数据进行结构学习和参数学习,而不需要离散化。该算法基于混合MI分数函数进行结构学习,同时采用线性回归和高斯分布近似进行参数学习。该库还提供了两种枚举图结构的算法-贪婪爬山算法和进化算法。因此,该库的关键功能如下:(1)离散数据上贝叶斯网络的结构和参数学习,(2)混合数据上贝叶斯网络的结构和参数学习,使用MI混合分数函数和高斯近似,(3)在两种枚举图结构的算法之一——爬山算法和进化算法上启动学习算法。由于对混合数据表示的需求来自于实际需要,因此我们的实现的优点将在解决合成数据和真实数据集的近似和间隙恢复问题的背景下进行评估。 摘要:This paper describes a new library for learning Bayesian networks from data containing discrete and continuous variables (mixed data). In addition to the classical learning methods on discretized data, this library proposes its algorithm that allows structural learning and parameters learning from mixed data without discretization since data discretization leads to information loss. This algorithm based on mixed MI score function for structural learning, and also linear regression and Gaussian distribution approximation for parameters learning. The library also offers two algorithms for enumerating graph structures - the greedy Hill-Climbing algorithm and the evolutionary algorithm. Thus the key capabilities of the proposed library are as follows: (1) structural and parameters learning of a Bayesian network on discretized data, (2) structural and parameters learning of a Bayesian network on mixed data using the MI mixed score function and Gaussian approximation, (3) launching learning algorithms on one of two algorithms for enumerating graph structures - Hill-Climbing and the evolutionary algorithm. Since the need for mixed data representation comes from practical necessity, the advantages of our implementations are evaluated in the context of solving approximation and gap recovery problems on synthetic data and real datasets.
【2】 Practical strategies for GEV-based regression models for extremes 标题:基于GEV的极值回归模型的实用策略
作者:Daniela Castro-Camilo,Raphaël Huser,Håvard Rue 备注:19 pages, 3 figures 链接:https://arxiv.org/abs/2106.13110 摘要:广义极值(GEV)分布是一个三参数族,它描述了独立同分布随机变量序列的适当重整化极大值的渐近行为。如果形状参数$席席$为零,GEV分布具有无界支持,而如果$XI$为正,则极限分布重尾,具有无限的上端点但有限的较低端点。在实际应用中,我们假设GEV族是最大值在块上分布的合理近似,并进行了相应的拟合。这意味着GEV属性,例如在$ 席> 0元的有限下端点,是由有限样本极大值继承的,而该样本最大值可能不具有有界支持。这在基于多个相互作用的协变量预测极端观测值时尤其成问题。为了解决这个通常被忽视的问题,我们提出了一个混合GEV分布,它平滑地结合了GOPEL分布的左尾(GEV与$ XI=0美元)与FR’ECHET分布的右尾(GEV,$ 席席0美元),因此,具有无限的支持。使用贝叶斯框架,我们重新参数化GEV分布,以提供更自然的解释(可能是协变量依赖)模型参数。在新的位置和扩散参数上的独立先验导致了原始位置和尺度参数的联合先验分布。我们引入了保性质惩罚复杂度(P$^3$C)先验的概念,并将其应用于形状参数以保持一阶矩和二阶矩。我们以加利福尼亚州的no2$污染水平为例说明了我们的方法,揭示了bGEV分布的稳健性,以及新参数化和P$^3$C先验框架的适用性。 摘要:The generalised extreme value (GEV) distribution is a three parameter family that describes the asymptotic behaviour of properly renormalised maxima of a sequence of independent and identically distributed random variables. If the shape parameter $xi$ is zero, the GEV distribution has unbounded support, whereas if $xi$ is positive, the limiting distribution is heavy-tailed with infinite upper endpoint but finite lower endpoint. In practical applications, we assume that the GEV family is a reasonable approximation for the distribution of maxima over blocks, and we fit it accordingly. This implies that GEV properties, such as finite lower endpoint in the case $xi>0$, are inherited by the finite-sample maxima, which might not have bounded support. This is particularly problematic when predicting extreme observations based on multiple and interacting covariates. To tackle this usually overlooked issue, we propose a blended GEV distribution, which smoothly combines the left tail of a Gumbel distribution (GEV with $xi=0$) with the right tail of a Fr'echet distribution (GEV with $xi>0$) and, therefore, has unbounded support. Using a Bayesian framework, we reparametrise the GEV distribution to offer a more natural interpretation of the (possibly covariate-dependent) model parameters. Independent priors over the new location and spread parameters induce a joint prior distribution for the original location and scale parameters. We introduce the concept of property-preserving penalised complexity (P$^3$C) priors and apply it to the shape parameter to preserve first and second moments. We illustrate our methods with an application to NO$_2$ pollution levels in California, which reveals the robustness of the bGEV distribution, as well as the suitability of the new parametrisation and the P$^3$C prior framework.
【3】 Optimal sequential sampling design for environmental extremes 标题:环境极值的最优序贯抽样设计
作者:Raphaël de Fondeville,Matthieu Wilhelm 链接:https://arxiv.org/abs/2106.13077 摘要:位于瑞士苏黎世市附近的西尔河(Sihl river)一直受到严密监控,因为它直接流经苏黎世市的主要火车站。为了发布预警并进行准确的风险量化,必须在流域内建立密集的监测站网络。然而,截至2021年,该地区只有三个自动监测站在运行,这自然提出了一个问题:如何扩展这一网络,以便对极端降雨事件进行最佳监测?到目前为止,现有的台网设计方法主要集中在最大限度地提高插值精度或最小化某些模型参数估计的不确定性上。在这项工作中,我们提出了新的原则,从极值理论的启发,以优化监测极端事件。对于平稳过程,我们研究了诱导抽样设计的理论性质,这种设计产生了非平凡的点模式,这是由于边界效应和位置间距离最大化之间的折衷。对于一般的应用,我们提出了一个理论上合理的函数峰值过阈值模型,并给出了一个顺序站选择算法。然后,我们通过有效地利用该地区现有的台站和雷达测量,提出可能扩展Sihl河监测网的建议。 摘要:The Sihl river, located near the city of Zurich in Switzerland, is under continuous and tight surveillance as it flows directly under the city's main railway station. To issue early warnings and conduct accurate risk quantification, a dense network of monitoring stations is necessary inside the river basin. However, as of 2021 only three automatic stations are operated in this region, naturally raising the question: how to extend this network for optimal monitoring of extreme rainfall events? So far, existing methodologies for station network design have mostly focused on maximizing interpolation accuracy or minimizing the uncertainty of some model's parameters estimates. In this work, we propose new principles inspired from extreme value theory for optimal monitoring of extreme events. For stationary processes, we study the theoretical properties of the induced sampling design that yields non-trivial point patterns resulting from a compromise between a boundary effect and the maximization of inter-location distances. For general applications, we propose a theoretically justified functional peak-over-threshold model and provide an algorithm for sequential station selection. We then issue recommendations for possible extensions of the Sihl river monitoring network, by efficiently leveraging both station and radar measurements available in this region.
【4】 Cauchy or not Cauchy? New goodness-of-fit tests for the Cauchy distribution 标题:柯西还是不柯西?柯西分布的新拟合优度检验
作者:Bruno Ebner,Lena Eid,Bernhard Klar 机构:Institute of Stochastics, Karlsruhe Institute of Technology (KIT), Englerstr. , D-, Karlsruhe. 备注:21 pages, 1 figure, 6 tables 链接:https://arxiv.org/abs/2106.13073 摘要:我们引入了柯西分布的一个新的刻画,并提出了一类柯西族的拟合优度检验。在Hilbert空间框架下,在零假设和固定方案下导出了极限分布。新的测试与一大类的替代品是一致的。比较蒙特卡罗模拟研究表明,测试是竞争的国家最先进的程序,我们应用测试日志回报的加密货币。 摘要:We introduce a new characterization of the Cauchy distribution and propose a class of goodness-of-fit tests to the Cauchy family. The limit distribution is derived in a Hilbert space framework under the null hypothesis and under fixed alternatives. The new tests are consistent against a large class of alternatives. A comparative Monte Carlo simulation study shows that the test is competitive to the state of the art procedures, and we apply the tests to log-returns of cryptocurrencies.
【5】 Understanding Uncertainty in Bayesian Deep Learning 标题:理解贝叶斯深度学习中的不确定性
作者:Cooper Lorsung 机构:to, School of Engineering and Applied Sciences, in partial fulfillment of the requirements, for the degree of, Master of Engineering, in the subject of, Computational Science and Engineering, Harvard University, Cambridge, Massachusetts 备注:97 pages, 32 figures, Master of Engineering Thesis 链接:https://arxiv.org/abs/2106.13055 摘要:神经线性模型(NLM)是一种深度贝叶斯模型,通过从数据中学习特征,然后对这些特征进行贝叶斯线性回归,从而产生预测不确定性。尽管这些模型很受欢迎,但很少有研究关注于这些模型的预测不确定性的正式评估。此外,现有的工作指出了在NLMs等模型中编码领域知识的困难,使得它们不适合于需要解释性的应用。在这项工作中,我们表明,传统的训练程序NLMs可以大大低估不确定性的数据稀缺地区。我们确定了这种行为的根本原因,并提出了一种新的训练方法,既可以捕获有用的预测不确定性,也允许纳入领域知识。 摘要:Neural Linear Models (NLM) are deep Bayesian models that produce predictive uncertainty by learning features from the data and then performing Bayesian linear regression over these features. Despite their popularity, few works have focused on formally evaluating the predictive uncertainties of these models. Furthermore, existing works point out the difficulties of encoding domain knowledge in models like NLMs, making them unsuitable for applications where interpretability is required. In this work, we show that traditional training procedures for NLMs can drastically underestimate uncertainty in data-scarce regions. We identify the underlying reasons for this behavior and propose a novel training method that can both capture useful predictive uncertainties as well as allow for incorporation of domain knowledge.
【6】 Quantization Aware Training, ERNIE and Kurtosis Regularizer: a short empirical study 标题:量化意识训练、厄尼和峰度调节器:一项简短的实证研究
作者:Andrea Zanetti 备注:13 pages, 8 figures 链接:https://arxiv.org/abs/2106.13035 摘要:像Ernie或Bert这样的预先训练的语言模型目前在许多应用程序中使用。这些模型带有一组预先训练好的权值,这些权值通常是在无监督/自监督模式下对大量数据获得的。在那之后,他们将在特定的任务上进行微调。然后应用程序使用这些模型进行推理,通常会应用一些附加约束,如低功耗预算或输入和输出之间的低延迟。满足这些推理设置附加要求的主要途径是使用低精度计算(例如INT8而不是FP32),但这会导致模型的功能性能(例如精度)恶化。已经开发了一些方法来解决这个问题,并且超越了PTO(训练后量化)的限制,更具体地说是QAT(量化感知训练,参见[4])是干扰训练过程以使其在训练期间受到量化阶段的影响(或简单地受到干扰)的过程。除了QAT之外,最近intelhabana实验室还提出了一种更直接的方法,使训练结果对使用正则化器的后续量化更具鲁棒性,从而改变了驱动训练过程的损失函数。但他们的建议并不能像厄尼这样的预先训练过的模特开箱即用。在这篇短文中,我们将说明为什么不会发生这种情况(对于Ernie的情况),并提出一种非常基本的方法来处理它,同时分享一些初始结果(最终INT8精度的提高),这些结果可能会引起愿意在低精度区域的应用中使用Ernie的从业者的兴趣。 摘要:Pre-trained language models like Ernie or Bert are currently used in many applications. These models come with a set of pre-trained weights typically obtained in unsupervised/self-supervised modality on a huge amount of data. After that, they are fine-tuned on a specific task. Applications then use these models for inference, and often some additional constraints apply, like low power-budget or low latency between input and output. The main avenue to meet these additional requirements for the inference settings, is to use low precision computation (e.g. INT8 rather than FP32), but this comes with a cost of deteriorating the functional performance (e.g. accuracy) of the model. Some approaches have been developed to tackle the problem and go beyond the limitations of the PTO (Post-Training Quantization), more specifically the QAT (Quantization Aware Training, see [4]) is a procedure that interferes with the training process in order to make it affected (or simply disturbed) by the quantization phase during the training itself. Besides QAT, recently Intel-Habana Labs have proposed an additional and more direct way to make the training results more robust to subsequent quantization which uses a regularizer, therefore changing the loss function that drives the training procedure. But their proposal does not work out-of-the-box for pre-trained models like Ernie, for example. In this short paper we show why this is not happening (for the Ernie case) and we propose a very basic way to deal with it, sharing as well some initial results (increase in final INT8 accuracy) that might be of interest to practitioners willing to use Ernie in their applications, in low precision regime.
【7】 Multi-Reference Alignment for sparse signals, Uniform Uncertainty Principles and the Beltway Problem 标题:稀疏信号多参考对准、均匀测不准原理与环路问题
作者:Subhro Ghosh,Philippe Rigollet 机构: National University of Singapore 链接:https://arxiv.org/abs/2106.12996 摘要:受低温电子显微镜(cryo-EM)等尖端应用的启发,多参考比对(MRA)模型需要在一组等轴测和幅值$sigma$的加性噪声的潜在作用下,从图像的重复测量中学习未知信号。尽管有很大的兴趣,了解这个模型中的估计率的清晰图像直到最近才出现,特别是在高噪声区域$sigmagg 1$,这在应用中是高度相关的。最近的研究表明,对于某些傅里叶变换完全支持的信号,其显著的渐近样本复杂度为$sigma^6$,与常规模型中出现的传统$sigma^2$形成鲜明对比。在实践中,这些结果往往大得让人望而却步,这促使人们对MRA模型周围的变化进行研究,在这种情况下,可以获得更好的样本复杂度。在本文中,我们证明了即使在经典的MRA模型中,emph{sparse}信号也表现出中等的$sigma^4$样本复杂度。我们的结果探索并利用了MRA估计问题与应用数学中两个经典主题的联系:来自组合优化的环道问题和来自调和分析的一致不确定性原理。 摘要:Motivated by cutting-edge applications like cryo-electron microscopy (cryo-EM), the Multi-Reference Alignment (MRA) model entails the learning of an unknown signal from repeated measurements of its images under the latent action of a group of isometries and additive noise of magnitude $sigma$. Despite significant interest, a clear picture for understanding rates of estimation in this model has emerged only recently, particularly in the high-noise regime $sigma gg 1$ that is highly relevant in applications. Recent investigations have revealed a remarkable asymptotic sample complexity of order $sigma^6$ for certain signals whose Fourier transforms have full support, in stark contrast to the traditional $sigma^2$ that arise in regular models. Often prohibitively large in practice, these results have prompted the investigation of variations around the MRA model where better sample complexity may be achieved. In this paper, we show that emph{sparse} signals exhibit an intermediate $sigma^4$ sample complexity even in the classical MRA model. Our results explore and exploit connections of the MRA estimation problem with two classical topics in applied mathematics: the textit{beltway problem} from combinatorial optimization, and textit{uniform uncertainty principles} from harmonic analysis.
【8】 Regression Trees and Ensembles for Cumulative Incidence Functions 标题:累积关联函数的回归树与集成
作者:Youngjoo Cho,Annette M. Molinaro,Chen Hu,Robert L. Strawderman 链接:https://arxiv.org/abs/2106.12948 摘要:在过去的十年中,使用累积关联函数来描述一类事件在其他事件存在时的风险变得越来越流行。利用参数方法、非参数方法和半参数方法处理了建模、估计和推理问题。开发机器学习方法的适当扩展的努力,例如回归树和相关的集成方法,最近才开始。在本文中,我们提出了一种新的方法来估计累积关联曲线中的竞争风险设置使用回归树和相关的集成估计。所提出的方法使用Brier评分风险的增广估计量作为构建和修剪树的主要依据,并导致使用现有的R软件包很容易实现的方法。肿瘤放射治疗组(试验9410)的数据被用来说明这些新方法。 摘要:The use of cumulative incidence functions for characterizing the risk of one type of event in the presence of others has become increasingly popular over the past decade. The problems of modeling, estimation and inference have been treated using parametric, nonparametric and semi-parametric methods. Efforts to develop suitable extensions of machine learning methods, such as regression trees and related ensemble methods, have begun comparatively recently. In this paper, we propose a novel approach to estimating cumulative incidence curves in a competing risks setting using regression trees and associated ensemble estimators. The proposed methods employ augmented estimators of the Brier score risk as the primary basis for building and pruning trees, and lead to methods that are easily implemented using existing R packages. Data from the Radiation Therapy Oncology Group (trial 9410) is used to illustrate these new methods.
【9】 Fundamental limits for learning hidden Markov model parameters 标题:学习隐马尔可夫模型参数的基本界限
作者:Kweku Abraham,Zacharie Naulet,Elisabeth Gassiat 机构:Université Paris-Saclay, CNRS, Laboratoire de Mathématiques d’Orsay, Orsay, France 链接:https://arxiv.org/abs/2106.12936 摘要:研究了可学习和不可学习隐马尔可夫模型之间的边界。HMMs是一种灵活的工具,用于对来自未知群体的依赖数据进行聚类。当簇是不同的,且隐链是遍历的,且具有满秩转移矩阵时,模型参数是可识别的。当这些条件中的任何一个失效时,就不可能确定参数。对于具有两个隐藏态的链,我们证明了匹配常数的非交感极大极小上界和下界,在该上界和下界处,参数变得可学习。 摘要:We study the frontier between learnable and unlearnable hidden Markov models (HMMs). HMMs are flexible tools for clustering dependent data coming from unknown populations. The model parameters are known to be identifiable as soon as the clusters are distinct and the hidden chain is ergodic with a full rank transition matrix. In the limit as any one of these conditions fails, it becomes impossible to identify parameters. For a chain with two hidden states we prove nonasymptotic minimax upper and lower bounds, matching up to constants, which exhibit thresholds at which the parameters become learnable.
【10】 Identifying Hidden Visits from Sparse Call Detail Record Data 标题:从稀疏呼叫详细记录数据识别隐藏访问
作者:Zhan Zhao,Haris N. Koutsopoulos,Jinhua Zhao 机构:Department of Urban Planning and Design, The University of Hong Kong, Hong Kong SAR, China, Department of Civil and Environmental Engineering, Northeastern University, Bonston, MA, United States, Department of Urban Studies and Planning 链接:https://arxiv.org/abs/2106.12885 摘要:尽管有大量关于使用呼叫详细记录(CDR)数据进行出行推断的文献,但对其局限性缺乏基本的了解。特别是,由于CDR数据的稀疏性,用户可能会前往某个位置而不会在数据中被发现,我们称之为“隐藏访问”。隐藏访问的存在阻碍了我们从CDR数据中提取关于人类流动和旅行行为的可靠信息的能力。在这项研究中,我们提出一种数据融合的方法来获得隐藏访问的统计推断标记数据。在没有补充数据的情况下,这可以通过从更细粒度的蜂窝数据访问记录中提取标记的观察值,以及从语音呼叫和文本消息记录中提取特征来实现。利用一个中国大城市300万用户的真实CDR数据集对该方法进行了验证。使用Logistic回归、支持向量机、随机森林和梯度提升等方法来推断在CDR数据中观察到的位移过程中是否存在隐藏访问。测试结果表明,与现有研究中普遍采用的无隐藏访问规则相比,该规则有了显著的改进。基于该模型,我们估计从CDR数据中提取的位移中有超过10%涉及到隐藏访问。所提出的数据融合方法提供了一种基于电信记录推断个体移动模式的系统统计方法。 摘要:Despite a large body of literature on trip inference using call detail record (CDR) data, a fundamental understanding of their limitations is lacking. In particular, because of the sparse nature of CDR data, users may travel to a location without being revealed in the data, which we refer to as a "hidden visit". The existence of hidden visits hinders our ability to extract reliable information about human mobility and travel behavior from CDR data. In this study, we propose a data fusion approach to obtain labeled data for statistical inference of hidden visits. In the absence of complementary data, this can be accomplished by extracting labeled observations from more granular cellular data access records, and extracting features from voice call and text messaging records. The proposed approach is demonstrated using a real-world CDR dataset of 3 million users from a large Chinese city. Logistic regression, support vector machine, random forest, and gradient boosting are used to infer whether a hidden visit exists during a displacement observed from CDR data. The test results show significant improvement over the naive no-hidden-visit rule, which is an implicit assumption adopted by most existing studies. Based on the proposed model, we estimate that over 10% of the displacements extracted from CDR data involve hidden visits. The proposed data fusion method offers a systematic statistical approach to inferring individual mobility patterns based on telecommunication records.
【11】 Bootstrap confidence intervals for multiple change points based on moving sum procedures 标题:基于移动和过程的多变点Bootstrap置信区间
作者:Haeran Cho,Claudia Kirch 链接:https://arxiv.org/abs/2106.12844 摘要:在本文中,我们讨论的问题,量化不确定性的位置,多个变化点。我们首先建立变点估计的渐近分布作为移动和统计量的局部极大值,其中极限分布取决于相应的变化大小是局部的,即随着样本大小的增加趋于零,还是固定的。在此基础上,我们提出了一种适用于未知变化大小的置信区间生成的bootstrap方法,保证了局部变化和固定变化的渐近有效性。仿真研究表明,该方法具有良好的性能,并对如何将其扩展到串行相关错误进行了讨论。 摘要:In this paper, we address the problem of quantifying uncertainty about the locations of multiple change points. We first establish the asymptotic distribution of the change point estimators obtained as the local maximisers of moving sum statistics, where the limit distributions differ depending on whether the corresponding size of changes is local, i.e. tends to zero as the sample size increases, or fixed. Then, we propose a bootstrap procedure for confidence interval generation which adapts to the unknown size of changes and guarantees asymptotic validity both for local and fixed changes. Simulation studies show good performance of the proposed bootstrap procedure, and we provide some discussions about how it can be extended to serially dependent errors.
【12】 Three rates of convergence or separation via U-statistics in a dependent framework 标题:在相依框架中通过U-统计量的三种收敛或分离率
作者:Quentin Duchemin,Yohann De Castro,Claire Lacour 机构:LAMA, Univ Gustave Eiffel, CNRS, Marne-la-Vallée, France., &, Institut Camille Jordan, École Centrale de Lyon, Lyon, France 备注:This submission completes the submission arXiv:2011.11435v1 which has been split into two pieces: concentration inequalities arXiv:2011.11435v3 and further versions, and this submission about three applications 链接:https://arxiv.org/abs/2106.12796 摘要:尽管U统计量在现代概率统计中无处不在,但它们在相依框架下的非渐近分析可能被忽略了。在最近的工作中,我们证明了一致遍历马氏链的二阶U统计量的一个新的集中不等式。在本文中,我们通过进一步推动三个不同活跃研究领域的知识现状,将这一理论突破付诸行动。首先,利用MCMC方法建立了一个新的迹类积分算子谱估计的指数不等式。新颖之处在于,这个结果适用于具有正特征值和负特征值的核,据我们所知这是新的。此外,我们还研究了使用成对损失函数和马尔可夫链样本的在线算法的泛化性能。通过展示如何从任何在线学习者生成的假设序列中提取低风险假设,我们提供了一个在线到批量转换结果。最后给出了马氏链不变测度密度的拟合优度检验的非渐近分析。我们确定了基于$L_2$距离的测试具有规定幂的一些替代类。 摘要:Despite the ubiquity of U-statistics in modern Probability and Statistics, their non-asymptotic analysis in a dependent framework may have been overlooked. In a recent work, a new concentration inequality for U-statistics of order two for uniformly ergodic Markov chains has been proved. In this paper, we put this theoretical breakthrough into action by pushing further the current state of knowledge in three different active fields of research. First, we establish a new exponential inequality for the estimation of spectra of trace class integral operators with MCMC methods. The novelty is that this result holds for kernels with positive and negative eigenvalues, which is new as far as we know. In addition, we investigate generalization performance of online algorithms working with pairwise loss functions and Markov chain samples. We provide an online-to-batch conversion result by showing how we can extract a low risk hypothesis from the sequence of hypotheses generated by any online learner. We finally give a non-asymptotic analysis of a goodness-of-fit test on the density of the invariant measure of a Markov chain. We identify some classes of alternatives over which our test based on the $L_2$ distance has a prescribed power.
【13】 Item Response Thresholds Models 标题:项目响应阈值模型
作者:Gerhard Tutz 机构:Ludwig-Maximilians-Universit¨at M¨unchen, Akademiestraße , M¨unchen 链接:https://arxiv.org/abs/2106.12784 摘要:提出了一类适用于连续型、二元型、有序分类型和计数型响应的综合模型。项目的难度由难度函数来描述,它取代了项目响应模型中通常使用的项目难度参数。它们至关重要地决定了响应分布,并使模型对于所涵盖的分布范围非常灵活。model类包含几种广泛使用的模型,如二进制Rasch模型和作为特例的分级响应模型,允许简化,并提供了一种无分布的方法来计算类型项。模型的一个主要优点是,当不同类型的项目被组合起来测量能力或态度时,它们可以用于混合项目格式。它是综合建模方法的直接结果,允许难度函数自动适应响应分布。显示了model类的基本属性。用几个实际数据集说明了模型的灵活性。 摘要:A comprehensive class of models is proposed that can be used for continuous, binary, ordered categorical and count type responses. The difficulty of items is described by difficulty functions, which replace the item difficulty parameters that are typically used in item response models. They crucially determine the response distribution and make the models very flexible with regard to the range of distributions that are covered. The model class contains several widely used models as the binary Rasch model and the graded response model as special cases, allows for simplifications, and offers a distribution free alternative to count type items. A major strength of the models is that they can be used for mixed item formats, when different types of items are combined to measure abilities or attitudes. It is an immediate consequence of the comprehensive modeling approach that allows that difficulty functions automatically adapt to the response distribution. Basic properties of the model class are shown. Several real data sets are used to illustrate the flexibility of the models.
【14】 Two-sample tests for repeated measurements of histogram objects with applications to wearable device data 标题:直方图对象重复测量的双样本测试及其在可穿戴设备数据中的应用
作者:Jingru Zhang,Kathleen R. Merikangas,Hongzhe Li,Haochang Shou 机构:University of Pennsylvania, National Institutes of Health 链接:https://arxiv.org/abs/2106.12768 摘要:在生物医学研究和纵向研究中,重复观察越来越普遍。例如,可穿戴传感器设备被部署来连续跟踪来自每一个人的生理和生物信号。恰当地评估生物信号的日分布在不同疾病组和人群中的差异仍然是一个非常重要的问题。因此,这些数据可以表示为多变量复杂对象数据,如概率密度、直方图和树上的观察值。传统的统计方法往往无法适用,因为它们是从一个任意的非欧几里德度量空间采样。在本文中,我们提出了一种新的基于非参数图的重复测量对象数据双样本测试方法。提出了一组测试统计数据来捕获各种可能的替代方案。在置换零点下,我们得到了它们的渐近零点分布。通过仿真研究表明,在有限样本下,这些测试在控制I型误差的同时,比现有方法有了显著的功率改进。在心境障碍的研究样本中,提出的测试被证明提供了关于日常体力活动分布的位置、个体间和个体内变异性的额外见解。 摘要:Repeated observations have become increasingly common in biomedical research and longitudinal studies. For instance, wearable sensor devices are deployed to continuously track physiological and biological signals from each individual over multiple days. It remains of great interest to appropriately evaluate how the daily distribution of biosignals might differ across disease groups and demographics. Hence these data could be formulated as multivariate complex object data such as probability densities, histograms, and observations on a tree. Traditional statistical methods would often fail to apply as they are sampled from an arbitrary non-Euclidean metric space. In this paper, we propose novel non-parametric graph-based two-sample tests for object data with repeated measures. A set of test statistics are proposed to capture various possible alternatives. We derive their asymptotic null distributions under the permutation null. These tests exhibit substantial power improvements over the existing methods while controlling the type I errors under finite samples as shown through simulation studies. The proposed tests are demonstrated to provide additional insights on the location, inter- and intra-individual variability of the daily physical activity distributions in a sample of studies for mood disorders.
【15】 Label Disentanglement in Partition-based Extreme Multilabel Classification 标题:基于划分的极值多标签分类中的标签解缠
作者:Xuanqing Liu,Wei-Cheng Chang,Hsiang-Fu Yu,Cho-Jui Hsieh,Inderjit S. Dhillon 机构:UCLA, Amazon, UT Austin 链接:https://arxiv.org/abs/2106.12751 摘要:基于划分的方法由于其可扩展到较大的输出空间(例如,数百万或更多)而越来越多地用于极端多标签分类(XMC)问题。然而,现有的方法将大的标签空间划分为相互排斥的簇,当标签具有多模态和丰富语义时,这种方法是次优的。例如,标签“Apple”可以是水果,也可以是商标名,这就引出了以下研究问题:我们能否将这些多模式标签与为下游XMC任务定制的非独占聚类分离开来?在本文中,我们证明了基于分区的XMC中的标签分配问题可以表示为一个优化问题,目标是最大化准确率。这就产生了一种高效的算法来形成灵活的重叠标签簇,以及一种可以交替优化基于分区的XMC的簇分配和模型参数的方法。在合成数据集和真实数据集上的实验结果表明,我们的方法可以成功地分离多模态标签,从而在四个XMC基准上得到最新的结果。 摘要:Partition-based methods are increasingly-used in extreme multi-label classification (XMC) problems due to their scalability to large output spaces (e.g., millions or more). However, existing methods partition the large label space into mutually exclusive clusters, which is sub-optimal when labels have multi-modality and rich semantics. For instance, the label "Apple" can be the fruit or the brand name, which leads to the following research question: can we disentangle these multi-modal labels with non-exclusive clustering tailored for downstream XMC tasks? In this paper, we show that the label assignment problem in partition-based XMC can be formulated as an optimization problem, with the objective of maximizing precision rates. This leads to an efficient algorithm to form flexible and overlapped label clusters, and a method that can alternatively optimizes the cluster assignments and the model parameters for partition-based XMC. Experimental results on synthetic and real datasets show that our method can successfully disentangle multi-modal labels, leading to state-of-the-art (SOTA) results on four XMC benchmarks.
【16】 Multiple Testing for Composite Null with FDR Control Guarantee 标题:FDR控制保证下的复合零点多次检测
作者:Ran Dai,Cheng Zheng 链接:https://arxiv.org/abs/2106.12719 摘要:错误发现率(FDR)控制程序为多假设信号识别实验的再现性提供了重要的统计保证。在最近的许多应用中,在多个独立的实验中研究同一组候选特征。例如,在不同的设施和不同的队列中重复的实验,以及具有相同候选特征但不同感兴趣结果的关联研究。这些研究为我们提供了通过联合考虑实验来识别信号的机会。我们研究的问题是如何提供再现性保证时,我们测试复合零假设的多个特征。具体来说,我们测试了来自多个实验的零假设的结合。我们提出了一种基于敲打的变量选择方法来识别来自多个独立实验的互信号,并且具有有限样本量的FDR控制保证。通过数值研究和在犯罪数据和TCGA数据分析中的应用,证明了该方法的有效性。 摘要:False discovery rate (FDR) controlling procedures provide important statistical guarantees for reproducibility in signal identification experiments with multiple hypotheses testing. In many recent applications, the same set of candidate features are studied in multiple independent experiments. For example, experiments repeated at different facilities and with different cohorts, and association studies with the same candidate features but different outcomes of interest. These studies provide us opportunities to identify signals by considering the experiments jointly. We study the question of how to provide reproducibility guarantees when we test composite null hypotheses on multiple features. Specifically, we test the unions of the null hypotheses from multiple experiments. We present a knockoff-based variable selection method to identify mutual signals from multiple independent experiments, with a finite sample size FDR control guarantee. We demonstrate the performance of this method with numerical studies and applications in analyzing crime data and TCGA data.
【17】 Optimal estimation of coarse structural nested mean models with application to initiating ART in HIV infected patients 标题:粗结构嵌套均值模型的最优估计及其在HIV感染者启动抗逆转录病毒治疗中的应用
作者:Judith J. Lok,Department of Mathematics,Statistics,Boston University 机构: Cummington Mall, Boston, Massachusetts , U.S.A. 链接:https://arxiv.org/abs/2106.12677 摘要:粗结构嵌套平均模型用于估计纵向观测数据的治疗效果。粗结构嵌套均值模型导致了一大类估计量。结果表明,估计值和标准误差在这一类中可能有很大的不同。我们证明了在附加假设下,粗结构嵌套均值模型的最优估计存在一个显式解。此外,我们还证明了即使附加假设不成立,这个最优估计也是双重稳健的:它是一致的和渐近正态的,不仅当治疗开始的模型是正确的,而且当某个结果回归模型是正确的。在一个模拟研究中,我们将最优估计与粗结构嵌套均值模型中的一些简单选择进行了比较。此外,我们应用最优和朴素的估计来研究在最近感染的HIV患者中,抗逆转录病毒治疗(ART)一年导致的CD4计数增加如何依赖于HIV感染和ART开始之间的时间。无论是在仿真研究还是在实际应用中,最优估计的使用都会大大提高估计精度。 摘要:Coarse structural nested mean models are used to estimate treatment effects from longitudinal observational data. Coarse structural nested mean models lead to a large class of estimators. It turns out that estimates and standard errors may differ considerably within this class. We prove that, under additional assumptions, there exists an explicit solution for the optimal estimator within the class of coarse structural nested mean models. Moreover, we show that even if the additional assumptions do not hold, this optimal estimator is doubly-robust: it is consistent and asymptotically normal not only if the model for treatment initiation is correct, but also if a certain outcome-regression model is correct. We compare the optimal estimator to some naive choices within the class of coarse structural nested mean models in a simulation study. Furthermore, we apply the optimal and naive estimators to study how the CD4 count increase due to one year of antiretroviral treatment (ART) depends on the time between HIV infection and ART initiation in recently infected HIV infected patients. Both in the simulation study and in the application, the use of optimal estimators leads to substantial increases in precision.
【18】 Black Box Variational Bayes Model Averaging 标题:黑箱变分贝叶斯模型平均
作者:Vojtech Kejzlar,Shrijita Bhattacharya,Mookyong Son,Tapabrata Maiti 机构:Department of Mathematics and Statistics, Skidmore College, Department of Statistics and Probability, Michigan State University 链接:https://arxiv.org/abs/2106.12652 摘要:几十年来,贝叶斯模型平均(Bayesian Model Averaging,BMA)一直是一个流行的框架,它系统地解释了当多个相互竞争的模型可以用来描述相同或相似的物理过程时所产生的模型不确定性。然而,这个框架的实现带来了许多实际挑战,包括通过马尔可夫链蒙特卡罗和数值积分的后验近似。我们提出了一种变分Bayes推理方法来解决BMA问题,作为标准解的一种可行的替代方法,避免了前面提到的许多缺陷。所提出的方法是“黑箱”的意义,它可以很容易地应用于许多模型,很少或没有模型的具体推导。我们通过一组标准例子说明了变分方法的实用性,并讨论了所有必要的实现细节。还提供了包含所有示例的完整文档化Python代码。 摘要:For many decades now, Bayesian Model Averaging (BMA) has been a popular framework to systematically account for model uncertainty that arises in situations when multiple competing models are available to describe the same or similar physical process. The implementation of this framework, however, comes with multitude of practical challenges including posterior approximation via Markov Chain Monte Carlo and numerical integration. We present a Variational Bayes Inference approach to BMA as a viable alternative to the standard solutions which avoids many of the aforementioned pitfalls. The proposed method is 'black box' in the sense that it can be readily applied to many models with little to no model-specific derivation. We illustrate the utility of our variational approach on a suite of standard examples and discuss all the necessary implementation details. Fully documented Python code with all the examples is provided as well.
【19】 Multi-objective Asynchronous Successive Halving 标题:多目标异步逐次减半
作者:Robin Schmucker,Michele Donini,Muhammad Bilal Zafar,David Salinas,Cédric Archambeau 机构:Machine Learning Department, Carnegie Mellon University, Amazon, Berlin, Germany 链接:https://arxiv.org/abs/2106.12639 摘要:超参数优化(Hyperparameter optimization,HPO)被越来越多地用于自动调整机器学习模型的预测性能(如精度)。然而,在大量的实际应用中,准确度只是多个性能标准中的一个,常常是相互冲突的,因此必须采用多目标(MO)的观点。虽然有关MO优化的文献比较丰富,但是很少有人关注HPO。在本文中,我们提出了算法,扩展异步连续减半(ASHA)到MO设置。考虑到多种评估指标,我们评估了这些方法在三个实际任务中的性能:(i)神经结构搜索,(ii)算法公平性和(iii)语言模型优化。我们的实证分析表明,moasha能够在规模上执行mohpo。此外,我们观察到,考虑到候选人选择的整个帕累托阵面在挂钟时间方面始终优于基于MO尺度化的多保真度HPO。我们的算法(开源)为该领域的未来研究建立了新的基线。 摘要:Hyperparameter optimization (HPO) is increasingly used to automatically tune the predictive performance (e.g., accuracy) of machine learning models. However, in a plethora of real-world applications, accuracy is only one of the multiple -- often conflicting -- performance criteria, necessitating the adoption of a multi-objective (MO) perspective. While the literature on MO optimization is rich, few prior studies have focused on HPO. In this paper, we propose algorithms that extend asynchronous successive halving (ASHA) to the MO setting. Considering multiple evaluation metrics, we assess the performance of these methods on three real world tasks: (i) Neural architecture search, (ii) algorithmic fairness and (iii) language model optimization. Our empirical analysis shows that MO ASHA enables to perform MO HPO at scale. Further, we observe that that taking the entire Pareto front into account for candidate selection consistently outperforms multi-fidelity HPO based on MO scalarization in terms of wall-clock time. Our algorithms (to be open-sourced) establish new baselines for future research in the area.
【20】 Sharp Convergence Rates for Empirical Optimal Transport with Smooth Costs 标题:具有光滑费用的经验最优运输问题的强收敛速度
作者:Tudor Manole,Jonathan Niles-Weed 机构:Department of Statistics and Data Science, Carnegie Mellon University, Courant Institute of Mathematical Sciences, New York University 链接:https://arxiv.org/abs/2106.13181 摘要:我们重新讨论了最优运输成本的插件估计的收敛速度的特征化问题。众所周知,由$mathbb{R}^d$上绝对连续分布的独立样本组成的经验测度在Wasserstein距离中以$$n^{-1/d}$的速率收敛到该分布,这可以用来证明许多最优运输成本的插件估计值以相同的速率收敛。然而,我们表明,当成本平稳时,这种分析是松散的:基于经验测度的插件估计器以$n^{-2/d}$的速度二次收敛得更快。作为推论,我们证明了当测度相距较远时,两个分布之间的Wasserstein距离更容易估计。我们还证明了下界,不仅证明了我们对插件估计的分析是严密的,而且还证明了没有其他估计能够在所有测度对上获得更快的一致收敛速度。我们的证明依赖于基于局部Lipschitz函数和半凹函数的$L^2$覆盖数严格控制的经验过程理论论证。作为我们证明的副产品,我们对满足适当力矩条件的任意两个测度之间的最优耦合所引起的位移,在广泛的代价函数范围内导出了$L^infty$估计。 摘要:We revisit the question of characterizing the convergence rate of plug-in estimators of optimal transport costs. It is well known that an empirical measure comprising independent samples from an absolutely continuous distribution on $mathbb{R}^d$ converges to that distribution at the rate $n^{-1/d}$ in Wasserstein distance, which can be used to prove that plug-in estimators of many optimal transport costs converge at this same rate. However, we show that when the cost is smooth, this analysis is loose: plug-in estimators based on empirical measures converge quadratically faster, at the rate $n^{-2/d}$. As a corollary, we show that the Wasserstein distance between two distributions is significantly easier to estimate when the measures are far apart. We also prove lower bounds, showing not only that our analysis of the plug-in estimator is tight, but also that no other estimator can enjoy significantly faster rates of convergence uniformly over all pairs of measures. Our proofs rely on empirical process theory arguments based on tight control of $L^2$ covering numbers for locally Lipschitz and semi-concave functions. As a byproduct of our proofs, we derive $L^infty$ estimates on the displacement induced by the optimal coupling between any two measures satisfying suitable moment conditions, for a wide range of cost functions.
【21】 The leading coefficient of Lascoux polynomials 标题:LasCoux多项式的先导系数
作者:Alessio Borzí,Xiangying Chen,Harshit J. Motwani,Lorenzo Venturello,Martin Vodička 机构:AND MARTIN VODIˇCKA 链接:https://arxiv.org/abs/2106.13104 摘要:最近引入了Lascoux多项式来证明线性浓度模型最大似然度的多项式性。我们找到了拉斯科多项式(C型)的导系数,并将其推广到一般矩阵(A型)和斜对称矩阵(D型)的情形。特别地,我们确定了这类多项式的阶数。作为应用,我们找到了半定规划代数次的多项式$delta(m,n,n-s)$的次,当$delta(m,n,n-s)$s=1$时,我们找到了C,A和D型的前导系数。 摘要:Lascoux polynomials have been recently introduced to prove polynomiality of the maximum-likelihood degree of linear concentration models. We find the leading coefficient of the Lascoux polynomials (type C) and their generalizations to the case of general matrices (type A) and skew symmetric matrices (type D). In particular, we determine the degrees of such polynomials. As an application, we find the degree of the polynomial $delta(m,n,n-s)$ of the algebraic degree of semidefinite programming, and when $s=1$ we find its leading coefficient for types C, A and D.
【22】 Understanding the Spread of COVID-19 Epidemic: A Spatio-Temporal Point Process View 标题:认识冠状病毒流行的传播:时空点过程观
作者:Shuang Li,Lu Wang,Xinyun Chen,Yixiang Fang,Yan Song 机构:TheChineseUniversityofHongKong 链接:https://arxiv.org/abs/2106.13097 摘要:自1月21日在美国发现第一例冠状病毒病例以来,美国已有超过100万人确诊了COVID-19病例。这种传染性呼吸道疾病已在美国3000多个县和50个州迅速传播,并表现出进化集群和复杂的触发模式。了解该病复杂的时空交织传播过程,以便进行准确的预测或明智的外部干预是非常必要的。在本文中,我们将COVID-19的传播建模为时空点过程,并提出了一个无强度的生成模型来跟踪疾病的传播。我们进一步采用产生式对抗式模仿学习框架来学习模型参数。与传统的似然学习方法相比,该模拟学习框架不需要预先指定强度函数,减少了模型的误判。此外,对抗式学习过程绕过了似然评估中难以评估的积分,使得模型推理更具有数据和变量的可伸缩性。我们展示了美国COVID-19确诊病例的动态学习表现,并基于学习生成模型评估了社会距离策略。 摘要:Since the first coronavirus case was identified in the U.S. on Jan. 21, more than 1 million people in the U.S. have confirmed cases of COVID-19. This infectious respiratory disease has spread rapidly across more than 3000 counties and 50 states in the U.S. and have exhibited evolutionary clustering and complex triggering patterns. It is essential to understand the complex spacetime intertwined propagation of this disease so that accurate prediction or smart external intervention can be carried out. In this paper, we model the propagation of the COVID-19 as spatio-temporal point processes and propose a generative and intensity-free model to track the spread of the disease. We further adopt a generative adversarial imitation learning framework to learn the model parameters. In comparison with the traditional likelihood-based learning methods, this imitation learning framework does not need to prespecify an intensity function, which alleviates the model-misspecification. Moreover, the adversarial learning procedure bypasses the difficult-to-evaluate integral involved in the likelihood evaluation, which makes the model inference more scalable with the data and variables. We showcase the dynamic learning performance on the COVID-19 confirmed cases in the U.S. and evaluate the social distancing policy based on the learned generative model.
【23】 Unsupervised Learning of Depth and Depth-of-Field Effect from Natural Images with Aperture Rendering Generative Adversarial Networks 标题:基于孔径绘制生成性对抗网络的自然图像景深和景深效应的无监督学习
作者:Takuhiro Kaneko 机构:NTT Communication Science Laboratories, NTT Corporation 备注:Accepted to CVPR 2021 (Oral). Project page: this https URL 链接:https://arxiv.org/abs/2106.13041 摘要:从二维投影的自然图像中理解三维世界是计算机视觉和图形学的一个基本挑战。近年来,一种无监督学习方法因其在数据收集方面的优势而受到广泛关注。然而,为了减轻训练限制,典型方法需要对视点分布(例如,包含各种视点图像的数据集)或对象形状(例如,对称对象)施加假设。这些假设常常限制应用;例如,应用于非刚性物体或从相似视点(例如,花或鸟图像)捕获的图像仍然是一个挑战。为了补充这些方法,我们提出了孔径渲染生成对抗网络(AR-GANs),该网络在GANs上配置孔径渲染,并采用聚焦线索来学习未标记自然图像的景深和景深效果。为了解决由无监督设置引起的模糊(即平滑纹理和离焦模糊之间的模糊,前景和背景模糊之间的模糊),我们开发了自由度混合学习,使生成器能够在生成不同自由度图像的同时学习真实的图像分布。此外,我们在引导学习方向之前,设计了一个中心焦点。在实验中,我们证明了AR-GANs在各种数据集(如花卉、鸟类和人脸图像)中的有效性,并通过将其与其他三维表示学习GANs相结合,证明了其可移植性,验证了其在浅自由度绘制中的适用性。 摘要:Understanding the 3D world from 2D projected natural images is a fundamental challenge in computer vision and graphics. Recently, an unsupervised learning approach has garnered considerable attention owing to its advantages in data collection. However, to mitigate training limitations, typical methods need to impose assumptions for viewpoint distribution (e.g., a dataset containing various viewpoint images) or object shape (e.g., symmetric objects). These assumptions often restrict applications; for instance, the application to non-rigid objects or images captured from similar viewpoints (e.g., flower or bird images) remains a challenge. To complement these approaches, we propose aperture rendering generative adversarial networks (AR-GANs), which equip aperture rendering on top of GANs, and adopt focus cues to learn the depth and depth-of-field (DoF) effect of unlabeled natural images. To address the ambiguities triggered by unsupervised setting (i.e., ambiguities between smooth texture and out-of-focus blurs, and between foreground and background blurs), we develop DoF mixture learning, which enables the generator to learn real image distribution while generating diverse DoF images. In addition, we devise a center focus prior to guiding the learning direction. In the experiments, we demonstrate the effectiveness of AR-GANs in various datasets, such as flower, bird, and face images, demonstrate their portability by incorporating them into other 3D representation learning GANs, and validate their applicability in shallow DoF rendering.
【24】 A Stochastic Sequential Quadratic Optimization Algorithm for Nonlinear Equality Constrained Optimization with Rank-Deficient Jacobians 标题:降秩雅可比非线性等式约束优化的随机序列二次优化算法
作者:Albert S. Berahas,Frank E. Curtis,Michael J. O'Neill,Daniel P. Robinson 机构:Department of Industrial and Operations Engineering, University of Michigan, Department of Industrial and Systems Engineering, Lehigh University, arXiv:,.,v, [math.OC] , Jun 链接:https://arxiv.org/abs/2106.13015 摘要:提出了一种求解光滑非线性等式约束优化问题的序列二次优化算法,其中目标函数由随机函数的期望值定义。该方法的算法结构基于一种在实践中广泛有效的分步分解策略,其中,每个搜索方向被计算为法向步长(朝向线性化可行性)和切向步长(朝向约束雅可比矩阵的零空间中的目标减小)的和。然而,所提出的方法与文献中的其他方法不同之处在于,它既允许使用随机目标梯度估计,而且即使在约束雅可比矩阵可能是秩亏的情况下也具有收敛性保证。数值实验结果表明,该算法比常用算法具有更好的性能。 摘要:A sequential quadratic optimization algorithm is proposed for solving smooth nonlinear equality constrained optimization problems in which the objective function is defined by an expectation of a stochastic function. The algorithmic structure of the proposed method is based on a step decomposition strategy that is known in the literature to be widely effective in practice, wherein each search direction is computed as the sum of a normal step (toward linearized feasibility) and a tangential step (toward objective decrease in the null space of the constraint Jacobian). However, the proposed method is unique from others in the literature in that it both allows the use of stochastic objective gradient estimates and possesses convergence guarantees even in the setting in which the constraint Jacobians may be rank deficient. The results of numerical experiments demonstrate that the algorithm offers superior performance when compared to popular alternatives.
【25】 Graphlets in multilayer networks 标题:多层网络中的Graphlet
作者:Sallamari Sallmen,Tarmo Nurmi,Mikko Kivelä 机构:Mikko Kivel¨a∗ 备注:58 pages, 45 figures 链接:https://arxiv.org/abs/2106.13011 摘要:将各种网络数据表示为多重网络、网络网络和其他多层网络,可以揭示这些系统中全新的结构类型。我们介绍了一个通用的、有原则的多层网络的graphlet框架,它允许我们将任何多层网络分解成小的多层构建块。这些多层图既可以自己分析,也可以用来做一些任务,比如比较不同的系统。该方法在多层同构、自同构轨道定义、多层网络类型等方面具有灵活性。我们举例说明了我们的多重网络的方法,并展示了如何用它来区分由多个模型产生的网络。此外,我们还提供了一种自动生成数百个轨道计数之间的依赖关系方程的方法,以消除多余的轨道计数。本文介绍的框架使人们能够分析具有多种语义的多层网络,这些方法可以用来分析无数多层网络的结构构建块。 摘要:Representing various networked data as multiplex networks, networks of networks and other multilayer networks can reveal completely new types of structures in these system. We introduce a general and principled graphlet framework for multilayer networks which allows one to break any multilayer network into small multilayered building blocks. These multilayer graphlets can be either analyzed themselves or used to do tasks such as comparing different systems. The method is flexible in terms of multilayer isomorphism, automorphism orbit definition, and the type of multilayer network. We illustrate our method for multiplex networks and show how it can be used to distinguish networks produced with multiple models from each other in an unsupervised way. In addition, we include an automatic way of generating the hundreds of dependency equations between the orbit counts needed to remove redundant orbit counts. The framework introduced here allows one to analyze multilayer networks with versatile semantics, and these methods can thus be used to analyze the structural building blocks of myriad multilayer networks.
【26】 Bayesian Optimization with High-Dimensional Outputs 标题:高维输出的贝叶斯优化
作者:Wesley J. Maddox,Maximilian Balandat,Andrew Gordon Wilson,Eytan Bakshy 机构:New York University, Facebook 链接:https://arxiv.org/abs/2106.12997 摘要:贝叶斯优化是一种样本有效的黑盒优化过程,通常适用于具有少量独立目标的问题。然而,在实践中,我们经常希望优化在许多相关结果(或“任务”)上定义的目标。例如,科学家们可能希望优化密集网格中的基站网络覆盖范围。类似地,工程师们可能会寻求通过约束或鲁棒优化来平衡机器人在数十种不同环境中的性能。然而,高斯过程(GP)模型通常被用作多任务贝叶斯优化的概率替代模型,其结果数量的伸缩性较差,极大地限制了其适用性。我们设计了一种有效的精确多任务GP抽样技术,将协方差矩阵中的Kronecker结构与Matheron的恒等式相结合,使得我们能够使用具有成千上万个相关输出的精确多任务GP模型进行贝叶斯优化。通过这样做,与现有方法相比,我们在样本效率方面取得了实质性的改进,这些方法只对结果的聚合函数进行建模。我们演示了如何在科学和工程领域的一系列任务中打开贝叶斯优化的一类新应用,包括优化具有65000多个输出的光学干涉仪的干涉图。 摘要:Bayesian Optimization is a sample-efficient black-box optimization procedure that is typically applied to problems with a small number of independent objectives. However, in practice we often wish to optimize objectives defined over many correlated outcomes (or ``tasks"). For example, scientists may want to optimize the coverage of a cell tower network across a dense grid of locations. Similarly, engineers may seek to balance the performance of a robot across dozens of different environments via constrained or robust optimization. However, the Gaussian Process (GP) models typically used as probabilistic surrogates for multi-task Bayesian Optimization scale poorly with the number of outcomes, greatly limiting applicability. We devise an efficient technique for exact multi-task GP sampling that combines exploiting Kronecker structure in the covariance matrices with Matheron's identity, allowing us to perform Bayesian Optimization using exact multi-task GP models with tens of thousands of correlated outputs. In doing so, we achieve substantial improvements in sample efficiency compared to existing approaches that only model aggregate functions of the outcomes. We demonstrate how this unlocks a new class of applications for Bayesian Optimization across a range of tasks in science and engineering, including optimizing interference patterns of an optical interferometer with more than 65,000 outputs.
【27】 Relationship between pulmonary nodule malignancy and surrounding pleurae, airways and vessels: a quantitative study using the public LIDC-IDRI dataset 标题:肺结节恶性肿瘤与周围胸膜、气道和血管的关系:使用公共LIDC-IDRI数据集的定量研究
作者:Yulei Qin,Yun Gu,Hanxiao Zhang,Jie Yang,Lihui Wang,Feng Yao,Yue-Min Zhu 机构:Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Institute of Medical Robotics, Shanghai Jiao Tong University, Shanghai , China, CREATIS, INSA Lyon, CNRS UMR , INSERM U, Universit´e de Lyon, Villeurbanne , France 备注:33 pages, 3 figures, Submitted for review 链接:https://arxiv.org/abs/2106.12991 摘要:目的探讨非对比CT对肺结节周围胸膜、气道及血管的鉴别诊断价值。LIDC-IDRI数据集是最大的公共CT数据库之一,用于研究。对694例中1556个结节进行统计分析,平均病灶<3和>3的结节分别表示为良性和恶性。此外,对113例诊断符合基本事实的患者的339个结节进行了独立评估。发展了计算机算法来分割肺部结构,量化到胸膜表面、气道和血管的距离,以及结节附近气道和血管的计数数和标准体积。通过比值比(OR)和卡方检验(Chi^2)来证实结节周围结构特征与结节恶性程度的相关性。非参数接收器操作特性(ROC)分析在logistic回归中进行,以评估每个结构的辨别能力。良性组和恶性组结节到胸膜表面、气道和血管的平均距离分别为(6.56,5.19)、(37.08,26.43)和(1.42,1.07)mm。结节与接触或投射到结节的气道和血管计数的相关系数分别为(or=22.96,chi^2=105.04)和(or=7.06,chi^2=290.11)。结节与气道和血管容积的相关系数分别为(OR=9.19,chi^2=159.02)和(OR=2.29,chi^2=55.89)。胸膜、气道和血管的曲线下面积分别为0.5202、0.6943和0.6529。我们的研究结果显示,与良性结节相比,恶性结节周围往往有更多的肺部结构,提示这些结构的特征可以作为肺癌的生物标志物。 摘要:To investigate whether the pleurae, airways and vessels surrounding a nodule on non-contrast computed tomography (CT) can discriminate benign and malignant pulmonary nodules. The LIDC-IDRI dataset, one of the largest publicly available CT database, was exploited for study. A total of 1556 nodules from 694 patients were involved in statistical analysis, where nodules with average scorings <3 and >3 were respectively denoted as benign and malignant. Besides, 339 nodules from 113 patients with diagnosis ground-truth were independently evaluated. Computer algorithms were developed to segment pulmonary structures and quantify the distances to pleural surface, airways and vessels, as well as the counting number and normalized volume of airways and vessels near a nodule. Odds ratio (OR) and Chi-square (chi^2) testing were performed to demonstrate the correlation between features of surrounding structures and nodule malignancy. A non-parametric receiver operating characteristic (ROC) analysis was conducted in logistic regression to evaluate discrimination ability of each structure. For benign and malignant groups, the average distances from nodules to pleural surface, airways and vessels are respectively (6.56, 5.19), (37.08, 26.43) and (1.42, 1.07) mm. The correlation between nodules and the counting number of airways and vessels that contact or project towards nodules are respectively (OR=22.96, chi^2=105.04) and (OR=7.06, chi^2=290.11). The correlation between nodules and the volume of airways and vessels are (OR=9.19, chi^2=159.02) and (OR=2.29, chi^2=55.89). The areas-under-curves (AUCs) for pleurae, airways and vessels are respectively 0.5202, 0.6943 and 0.6529. Our results show that malignant nodules are often surrounded by more pulmonary structures compared with benign ones, suggesting that features of these structures could be viewed as lung cancer biomarkers.
【28】 Fund2Vec: Mutual Funds Similarity using Graph Learning 标题:Fund2Vec:基于图学习的共同基金相似度
作者:Vipul Satone,Dhruv Desai,Dhagash Mehta 机构:The Vanguard Group, Inc. 备注:2 column format, 8 pages, 8 figures, 5 tables 链接:https://arxiv.org/abs/2106.12987 摘要:识别与基础投资组合相关的类似共同基金在金融服务中有许多应用,包括基金推荐系统、竞争对手分析、投资组合分析、营销和销售等。传统方法要么是定性的,因此容易产生偏差,而且往往不可复制,或者,我们知道不能从原始数据中捕捉投资组合中的所有细微差别(非线性)。我们提出了一种全新的方法来识别类似的基金,该方法基于基金及其基础资产数据的加权二部网络表示,使用一种称为Node2Vec的复杂机器学习方法来学习网络的嵌入式低维表示。我们称之为嵌入emph{Fund2Vec}。我们首次研究了原始形式的基金资产网络的加权二部网络表示法,该方法确定了投资组合之间的结构相似性,而不仅仅是投资组合重叠。 摘要:Identifying similar mutual funds with respect to the underlying portfolios has found many applications in financial services ranging from fund recommender systems, competitors analysis, portfolio analytics, marketing and sales, etc. The traditional methods are either qualitative, and hence prone to biases and often not reproducible, or, are known not to capture all the nuances (non-linearities) among the portfolios from the raw data. We propose a radically new approach to identify similar funds based on the weighted bipartite network representation of funds and their underlying assets data using a sophisticated machine learning method called Node2Vec which learns an embedded low-dimensional representation of the network. We call the embedding emph{Fund2Vec}. Ours is the first ever study of the weighted bipartite network representation of the funds-assets network in its original form that identifies structural similarity among portfolios as opposed to merely portfolio overlaps.
【29】 Tensor networks for unsupervised machine learning 标题:张量网络在无监督机器学习中的应用
作者:Jing Liu,Sujie Li,Jiang Zhang,Pan Zhang 机构:School of Systems Science, Beijing Normal University, CAS Key Laboratory for Theoretical Physics, Institute of Theoretical Physics, School of Physical Sciences, University of Chinese Academy of Sciences, Beijing , China 备注:12 pages, 9 figures, for Github page, see this https URL 链接:https://arxiv.org/abs/2106.12974 摘要:对高维数据的联合分布进行建模是无监督机器学习的核心任务。近年来,基于张量网络的学习模型得到了广泛的关注,它具有利用纠缠特性从理论上理解表达能力的优点,是连接经典计算和量子计算的桥梁。然而,现有的基于张量网络的无监督模型虽然有着巨大的潜力,但由于其性能远不如标准的Boltzmann机器和神经网络等模型,因此只能作为原理的证明。在这项工作中,我们提出了自回归矩阵积态(AMPS),一种基于张量网络的模型,它结合了量子多体物理中的矩阵积态和机器学习中的自回归模型。该模型具有归一化概率和无偏抽样的精确计算,以及表达能力的清晰理论理解。我们使用两个应用程序,合成和真实数据的生成性建模和统计物理中的强化学习来演示我们模型的性能。通过大量的数值实验,我们发现该模型明显优于现有的基于张量网络的模型和受限Boltzmann机器,并与最新的神经网络模型相竞争。 摘要:Modeling the joint distribution of high-dimensional data is a central task in unsupervised machine learning. In recent years, many interests have been attracted to developing learning models based on tensor networks, which have advantages of theoretical understandings of the expressive power using entanglement properties, and as a bridge connecting the classical computation and the quantum computation. Despite the great potential, however, existing tensor-network-based unsupervised models only work as a proof of principle, as their performances are much worse than the standard models such as the restricted Boltzmann machines and neural networks. In this work, we present the Autoregressive Matrix Product States (AMPS), a tensor-network-based model combining the matrix product states from quantum many-body physics and the autoregressive models from machine learning. The model enjoys exact calculation of normalized probability and unbiased sampling, as well as a clear theoretical understanding of expressive power. We demonstrate the performance of our model using two applications, the generative modeling on synthetic and real-world data, and the reinforcement learning in statistical physics. Using extensive numerical experiments, we show that the proposed model significantly outperforms the existing tensor-network-based models and the restricted Boltzmann machines, and is competitive with the state-of-the-art neural network models.
【30】 Partial Wasserstein and Maximum Mean Discrepancy distances for bridging the gap between outlier detection and drift detection 标题:用于弥合孤立点检测和漂移检测之间差距的部分Wasserstein和最大平均差异距离
作者:Thomas Viehmann 链接:https://arxiv.org/abs/2106.12893 摘要:随着机器学习和基于深度学习的应用在实践中的兴起,监控,即验证这些操作是否符合规范,已经成为一个重要的实际问题。这种监控的一个重要方面是检查输入(或中间产物)是否偏离了其验证的分布,这可能会使测试期间获得的性能保证失效。有两种常见的方法。更经典的可能是异常值检测或新颖性检测,其中,对于单个输入,我们询问它是否是异常值,即极不可能来自参考分布。第二种,也许是最近的方法,是考虑更多的输入,并将其分布与参考分布(例如,在测试期间取样)进行比较。这是在标签漂移检测下完成的。在这项工作中,我们通过比较给定数量的输入和自动选择的参考分布部分,来弥补离群点检测和漂移检测之间的差距。 摘要:With the rise of machine learning and deep learning based applications in practice, monitoring, i.e. verifying that these operate within specification, has become an important practical problem. An important aspect of this monitoring is to check whether the inputs (or intermediates) have strayed from the distribution they were validated for, which can void the performance assurances obtained during testing. There are two common approaches for this. The, perhaps, more classical one is outlier detection or novelty detection, where, for a single input we ask whether it is an outlier, i.e. exceedingly unlikely to have originated from a reference distribution. The second, perhaps more recent approach, is to consider a larger number of inputs and compare its distribution to a reference distribution (e.g. sampled during testing). This is done under the label drift detection. In this work, we bridge the gap between outlier detection and drift detection through comparing a given number of inputs to an automatically chosen part of the reference distribution.
【31】 COVID-19 cases prediction using regression and novel SSM model for non-converged countries 标题:基于回归和新SSM模型的非融合国家冠状病毒病例预测
作者:Tushar Sarkar,Umang Patel,Rupali Patil 备注:None 链接:https://arxiv.org/abs/2106.12888 摘要:预测2019年新的冠状病毒疾病(COVID-19)相关或确诊病例的数量对于对抗和控制COVID-19爆发至关重要。从2020年1月20日到2020年7月21日收集了新的COVID-19相关病例。我们筛选出了正在融合的国家,并将这些国家用于训练网络。我们利用SARIMAX线性回归模型预测尚未收敛的国家新的疑似COVID-19病例。我们利用提出的统计SARIMAX模型(SSM)预测了非收敛国家的曲线。我们提出了新的基于信息调查的预测结果,可以帮助政府规划其未来的活动,并帮助临床管理部门为即将到来的事情做好更充分的准备。我们的框架可以利用线性回归预测日冕病例的峰值,R平方值为0.986,并在印度、美国和巴西等国家的不同水平上预测这种流行病的下降。我们发现,考虑更多的国家进行训练会降低预测过程,因为各国的约束条件各不相同。因此,我们期望这项工作中引用的结果将有助于个人更好地了解这种流行病的可能性。 摘要:Anticipating the quantity of new associated or affirmed cases with novel coronavirus ailment 2019 (COVID-19) is critical in the counteraction and control of the COVID-19 flare-up. The new associated cases with COVID-19 information were gathered from 20 January 2020 to 21 July 2020. We filtered out the countries which are converging and used those for training the network. We utilized the SARIMAX, Linear regression model to anticipate new suspected COVID-19 cases for the countries which did not converge yet. We predict the curve of non-converged countries with the help of proposed Statistical SARIMAX model (SSM). We present new information investigation-based forecast results that can assist governments with planning their future activities and help clinical administrations to be more ready for what's to come. Our framework can foresee peak corona cases with an R-Squared value of 0.986 utilizing linear regression and fall of this pandemic at various levels for countries like India, US, and Brazil. We found that considering more countries for training degrades the prediction process as constraints vary from nation to nation. Thus, we expect that the outcomes referenced in this work will help individuals to better understand the possibilities of this pandemic.
【32】 A Near-Optimal Algorithm for Debiasing Trained Machine Learning Models 标题:一种用于训练机器学习模型去偏的近似最优算法
作者:Ibrahim Alabdulmohsin,Mario Lucic 机构:Google Research, Brain Team, Zürich, Switzerland 备注:21 pages, 5 figures 链接:https://arxiv.org/abs/2106.12887 摘要:我们提出了一个可扩展的后处理算法,用于去除训练模型,包括深度神经网络(DNNs),我们通过限制其过度Bayes风险证明了它是接近最优的。我们在经典算法和现代DNN体系结构的标准基准数据集上验证了它的优势,并证明了它在处理性能上优于以前的后处理方法。此外,我们还证明了所提出的算法对于大规模训练的模型是特别有效的,其中后处理是一种自然而实用的选择。 摘要:We present a scalable post-processing algorithm for debiasing trained models, including deep neural networks (DNNs), which we prove to be near-optimal by bounding its excess Bayes risk. We empirically validate its advantages on standard benchmark datasets across both classical algorithms as well as modern DNN architectures and demonstrate that it outperforms previous post-processing methods while performing on par with in-processing. In addition, we show that the proposed algorithm is particularly effective for models trained at scale where post-processing is a natural and practical choice.
【33】 Constrained Classification and Policy Learning 标题:受限分类与策略学习
作者:Toru Kitagawa,Shosei Sakaguchi,Aleksey Tetenov 链接:https://arxiv.org/abs/2106.12886 摘要:现代机器学习分类方法,包括AdaBoost、支持向量机和深度神经网络,利用代理损失技术来规避最小化经验分类风险的计算复杂性。这些技术对于因果策略学习问题也很有用,因为个体化治疗规则的估计可以被转换为加权(成本敏感)分类问题。Zhang(2004)和Bartlett等人(2006)研究的替代损失方法的一致性主要依赖于正确规范的假设,这意味着指定的分类器集足够丰富,可以包含第一个最好的分类器。然而,当分类器集受到可解释性或公平性的限制时,这种假设就不那么可信了,使得基于代理损失的算法在这种次优场景中的适用性未知。本文研究了在不假设正确规格说明的情况下,在分类器集有约束的情况下,代理损失过程的一致性问题。我们证明,在约束仅限制分类器的预测集的情况下,铰链损失(即$ellu 1$-支持向量机)是在次优方案中保持一致性的唯一替代损失。如果约束额外限制了分类器的函数形式,则即使存在铰链损失,也不能保证代理损失方法的一致性。因此,我们刻画了约束分类器集的条件,以保证铰链风险最小化分类器的一致性。利用我们的理论结果,我们发展了稳健和计算上有吸引力的铰链损失为基础的程序单调分类问题。 摘要:Modern machine learning approaches to classification, including AdaBoost, support vector machines, and deep neural networks, utilize surrogate loss techniques to circumvent the computational complexity of minimizing empirical classification risk. These techniques are also useful for causal policy learning problems, since estimation of individualized treatment rules can be cast as a weighted (cost-sensitive) classification problem. Consistency of the surrogate loss approaches studied in Zhang (2004) and Bartlett et al. (2006) crucially relies on the assumption of correct specification, meaning that the specified set of classifiers is rich enough to contain a first-best classifier. This assumption is, however, less credible when the set of classifiers is constrained by interpretability or fairness, leaving the applicability of surrogate loss based algorithms unknown in such second-best scenarios. This paper studies consistency of surrogate loss procedures under a constrained set of classifiers without assuming correct specification. We show that in the setting where the constraint restricts the classifier's prediction set only, hinge losses (i.e., $ell_1$-support vector machines) are the only surrogate losses that preserve consistency in second-best scenarios. If the constraint additionally restricts the functional form of the classifier, consistency of a surrogate loss approach is not guaranteed even with hinge loss. We therefore characterize conditions for the constrained set of classifiers that can guarantee consistency of hinge risk minimizing classifiers. Exploiting our theoretical results, we develop robust and computationally attractive hinge loss based procedures for a monotone classification problem.
【34】 DCoM: A Deep Column Mapper for Semantic Data Type Detection 标题:DCOM:一种用于语义数据类型检测的深列映射器
作者:Subhadip Maji,Swapna Sourav Rout,Sudeep Choudhary 机构:Optum Global Solutions, Bangalore, India , Senior Data Scientist 备注:9 pages, 2 figures, 7 tables 链接:https://arxiv.org/abs/2106.12871 摘要:语义数据类型检测是数据科学中一项非常重要的任务,可以实现数据的自动清洗、模式匹配、数据发现、语义数据类型规范化和敏感数据识别。现有的方法包括基于正则表达式或基于字典查找的方法,这些方法对脏数据和看不见的数据不具有鲁棒性,并且只能预测非常少的语义数据类型。现有的机器学习方法从数据中提取大量的工程特征,建立logistic回归、随机森林或前馈神经网络。在本文中,我们引入了DCoM(一种基于多输入NLP的深度神经网络)来检测语义数据类型,它不是从数据中提取大量的特征,而是将列(或实例)的原始值作为文本输入到模型中。我们在从VizNet语料库中提取的686765个数据列上训练DCoM,这些数据列包含78种不同的语义数据类型。在同一数据集上,DCoM的性能优于其他当代结果,具有相当大的优势。 摘要:Detection of semantic data types is a very crucial task in data science for automated data cleaning, schema matching, data discovery, semantic data type normalization and sensitive data identification. Existing methods include regular expression-based or dictionary lookup-based methods that are not robust to dirty as well unseen data and are limited to a very less number of semantic data types to predict. Existing Machine Learning methods extract large number of engineered features from data and build logistic regression, random forest or feedforward neural network for this purpose. In this paper, we introduce DCoM, a collection of multi-input NLP-based deep neural networks to detect semantic data types where instead of extracting large number of features from the data, we feed the raw values of columns (or instances) to the model as texts. We train DCoM on 686,765 data columns extracted from VizNet corpus with 78 different semantic data types. DCoM outperforms other contemporary results with a quite significant margin on the same dataset.
【35】 The gig economy in Poland: evidence based on mobile big data 标题:波兰的零工经济:基于移动大数据的证据
作者:Beręsewicz Maciej,Nikulin Dagmara,Szymkowiak Marcin,Wilak Kamil 机构:‡Poznań University of Economics and Business 备注:44 pages, 20 figures 链接:https://arxiv.org/abs/2106.12827 摘要:在本文中,我们将讨论如何衡量平台经济的规模和特点的问题。我们提出了一种基于智能手机数据的不同的抽样调查方法,这些数据是作为在线营销的一部分通过程序系统被动收集的。特别是,在我们的研究中,我们着重于两种类型的服务:食品递送(博尔特快递、外卖、格洛弗、沃尔特和运输服务(博尔特司机、现在免费、iTaxi和Uber)。我们的结果表明,波兰的平台经济正在增长。特别是,关于通过应用程序提供的食品配送和运输服务,我们观察到在2018年1月至2020年12月期间,这一趋势呈增长趋势。考虑到应用程序用户的人口结构,我们的研究结果证实了过去的研究结果:大多数平台工作者是年轻男性,但应用程序用户的年龄结构在这两类服务中各不相同。另一个令人惊讶的发现是,在波兰,外国人并没有占到工作人员的大多数。如果将平台工人的数量与相应的工作人口进行比较,波兰9个最大城市的活跃应用程序用户的估计份额约占工作人口的0.5-2%。 摘要:In this article we address the question of how to measure the size and characteristics of the platform economy. We propose a~different, to sample surveys, approach based on smartphone data, which are passively collected through programmatic systems as part of online marketing. In particular, in our study we focus on two types of services: food delivery (Bolt Courier, Takeaway, Glover, Wolt and transport services (Bolt Driver, Free Now, iTaxi and Uber). Our results show that the platform economy in Poland is growing. In particular, with respect to food delivery and transportation services performed by means of applications, we observed a growing trend between January 2018 and December 2020. Taking into account the demographic structure of apps users, our results confirm findings from past studies: the majority of platform workers are young men but the age structure of app users is different for each of the two categories of services. Another surprising finding is that foreigners do not account for the majority of gig workers in Poland. When the number of platform workers is compared with corresponding working populations, the estimated share of active app users accounts for about 0.5-2% of working populations in 9 largest Polish cities.
【36】 On the detection of low-rank signal in the presence of spatially uncorrelated noise: a frequency domain approach 标题:空间不相关噪声下的低秩信号检测:频域方法
作者:Alexis Rosuel,Philippe Loubaton,Pascal Vallet,Xavier Mestre 链接:https://arxiv.org/abs/2106.12815 摘要:本文分析了由K维高斯白噪声驱动的mxk-MIMO滤波器输出信号,以及由互不相关分量的M维高斯噪声破坏的M维有用信号的检测。研究的重点是基于谱相干矩阵(SCM)估计的特征值的频域测试统计量,该特征值是作为观察信号的频率平滑周期图的重整化而获得的。如果N表示样本大小,B表示平滑跨度,证明了在M,B,N收敛到无穷大而K保持不变的高维区域,SCM表现为某种相关Wishart矩阵。利用这些矩阵的特征值行为的众所周知的结果,可以推断基于SCM的线性谱统计的标准测试不能检测到高维区域中有用信号的存在。提出了一种新的基于SCM的测试方法,证明了该方法的一致性,并通过数值模拟对其统计性能进行了评价。 摘要:This paper analyzes the detection of a M-dimensional useful signal modeled as the output of a M xK MIMO filter driven by a K-dimensional white Gaussian noise, and corrupted by a M-dimensional Gaussian noise with mutually uncorrelated components. The study is focused on frequency domain test statistics based on the eigenvalues of an estimate of the spectral coherence matrix (SCM), obtained as a renormalization of the frequency-smoothed periodogram of the observed signal. If N denotes the sample size and B the smoothing span, it is proved that in the high-dimensional regime where M, B, N converge to infinity while K remains fixed, the SCM behaves as a certain correlated Wishart matrix. Exploiting well-known results on the behaviour of the eigenvalues of such matrices, it is deduced that the standard tests based on linear spectral statistics of the SCM fail to detect the presence of the useful signal in the high-dimensional regime. A new test based on the SCM, which is proved to be consistent, is also proposed, and its statistical performance is evaluated through numerical simulations.
【37】 Task-agnostic Continual Learning with Hybrid Probabilistic Models 标题:基于混合概率模型的任务不可知性持续学习
作者:Polina Kirichenko,Mehrdad Farajtabar,Dushyant Rao,Balaji Lakshminarayanan,Nir Levine,Ang Li,Huiyi Hu,Andrew Gordon Wilson,Razvan Pascanu 机构: New York University, DeepMind, Google Research 链接:https://arxiv.org/abs/2106.12772 摘要:不断地学习新的任务而不忘记不断变化的数据分布对于现实世界中的问题是至关重要的,但是对于现代的深度学习来说是非常具有挑战性的。在这项工作中,我们提出了HCL,一种混合生成判别方法,以持续学习分类。我们用一个规范化流来模拟每个任务和每个类的分布。该流程用于学习数据分布、执行分类、识别任务变化和避免遗忘,所有这些都利用了规范化流程模型唯一启用的可逆性和精确可能性。我们利用流的生成能力,通过生成重放和一种新的函数正则化技术来避免灾难性遗忘。对于任务识别,我们使用最先进的异常检测技术,基于测量模型统计的典型性。我们展示了HCL在一系列持续学习基准上的强大性能,如split MNIST、split CIFAR和SVHN-MNIST。 摘要:Learning new tasks continuously without forgetting on a constantly changing data distribution is essential for real-world problems but extremely challenging for modern deep learning. In this work we propose HCL, a Hybrid generative-discriminative approach to Continual Learning for classification. We model the distribution of each task and each class with a normalizing flow. The flow is used to learn the data distribution, perform classification, identify task changes, and avoid forgetting, all leveraging the invertibility and exact likelihood which are uniquely enabled by the normalizing flow model. We use the generative capabilities of the flow to avoid catastrophic forgetting through generative replay and a novel functional regularization technique. For task identification, we use state-of-the-art anomaly detection techniques based on measuring the typicality of the model's statistics. We demonstrate the strong performance of HCL on a range of continual learning benchmarks such as split-MNIST, split-CIFAR, and SVHN-MNIST.
【38】 Factors affecting the COVID-19 risk in the US counties: an innovative approach by combining unsupervised and supervised learning 标题:美国各县冠状病毒风险的影响因素:无监督学习和监督学习相结合的创新方法
作者:Samira Ziyadidegan,Moein Razavi,Homa Pesarakli,Amir Hossein Javid,Madhav Erraguntla 机构:Department of Computer Science and Engineering, Texas A&M University, College Station, TX, moeinra-, Department of Industrial and Systems Engineering, Texas A&M University, College Station, TX, merra-, , These authors contributed equally to this paper. 备注:15 pages, 8 figures, 5 tables 链接:https://arxiv.org/abs/2106.12766 摘要:COVID-19疾病传播迅速,在中国确诊第一例阳性病例近三个月后,冠状病毒开始在美国各地传播。一些州和县报告阳性病例和死亡人数较高,而一些州和县报告与COVID-19相关的病例和死亡率较低。本文分析了县级人群中可能影响COVID-19感染风险和死亡率的因素。提出了一种基于K均值聚类和多种分类模型的方法来确定最关键因素。结果显示,平均气温、贫困人口百分比、肥胖成年人百分比、气压、人口密度、风速、经度和未参保人口百分比是最显著的属性 摘要:The COVID-19 disease spreads swiftly, and nearly three months after the first positive case was confirmed in China, Coronavirus started to spread all over the United States. Some states and counties reported high number of positive cases and deaths, while some reported lower COVID-19 related cases and mortality. In this paper, the factors that could affect the risk of COVID-19 infection and mortality were analyzed in county level. An innovative method by using K-means clustering and several classification models is utilized to determine the most critical factors. Results showed that mean temperature, percent of people below poverty, percent of adults with obesity, air pressure, population density, wind speed, longitude, and percent of uninsured people were the most significant attributes
【39】 Finite-Sample Analysis of Off-Policy TD-Learning via Generalized Bellman Operators 标题:基于广义Bellman算子的非政策TD-学习的有限样本分析
作者:Zaiwei Chen,Siva Theja Maguluri,Sanjay Shakkottai,Karthikeyan Shanmugam 机构:edu†Georgia Institute of Technology, edu‡The University of Texas at Austin 链接:https://arxiv.org/abs/2106.12729 摘要:在时间差分(TD)学习中,非策略采样比策略采样更为实用,通过将学习与数据采集分离,实现了数据重用。众所周知,政策评估(包括多步非政策重要性抽样)具有求解广义Bellman方程的解释。本文给出了求解广义Bellman算子不动点的一般非策略类TD随机逼近算法的有限样本界。我们的关键步骤是证明广义Bellman算子同时是一个压缩映射,对于$[1,infty)$中的每一个$ellu p$范数,具有一个公共压缩因子。非策略TD学习由于重要抽样比率的乘积而受到高方差的影响。文献中提出了许多算法(例如,$Q^pi(lambda)$、Tree Backup$(lambda)$、Retrace$(lambda)$和$Q$-trace)来解决这个问题。我们的结果立即暗示了这些算法的有限样本界。特别是,我们为$Q^pi(lambda)$、Tree Backup$(lambda)$和Retrace$(lambda)$提供了第一个已知的有限样本保证,并改进了[19]中$Q$-trace的最广为人知的界限。此外,我们还展示了每种算法的偏差-方差权衡。 摘要:In temporal difference (TD) learning, off-policy sampling is known to be more practical than on-policy sampling, and by decoupling learning from data collection, it enables data reuse. It is known that policy evaluation (including multi-step off-policy importance sampling) has the interpretation of solving a generalized Bellman equation. In this paper, we derive finite-sample bounds for any general off-policy TD-like stochastic approximation algorithm that solves for the fixed-point of this generalized Bellman operator. Our key step is to show that the generalized Bellman operator is simultaneously a contraction mapping with respect to a weighted $ell_p$-norm for each $p$ in $[1,infty)$, with a common contraction factor. Off-policy TD-learning is known to suffer from high variance due to the product of importance sampling ratios. A number of algorithms (e.g. $Q^pi(lambda)$, Tree-Backup$(lambda)$, Retrace$(lambda)$, and $Q$-trace) have been proposed in the literature to address this issue. Our results immediately imply finite-sample bounds of these algorithms. In particular, we provide first-known finite-sample guarantees for $Q^pi(lambda)$, Tree-Backup$(lambda)$, and Retrace$(lambda)$, and improve the best known bounds of $Q$-trace in [19]. Moreover, we show the bias-variance trade-offs in each of these algorithms.
【40】 Long short-term relevance learning 标题:长期短期关联学习
作者:Bram van de Weg,Lars Greve,Bojana Rosic 机构: University of Twente 链接:https://arxiv.org/abs/2106.12694 摘要:为了将先验知识和测量不确定性结合到传统的长短时记忆(LSTM)神经网络中,提出了一种有效的稀疏贝叶斯训练算法。与经典的LSTM方法相比,该方法能自动确定相关的神经网络连接并进行相应的调整。由于它的灵活性,新的LSTM方法不容易过度拟合,因此可以通过使用较小的数据集来近似时间相关的解。在一个结构非线性有限元的应用中,我们证明了自调节框架不需要事先知道合适的网络结构和大小,同时以合理的计算成本保证了满意的精度。 摘要:To incorporate prior knowledge as well as measurement uncertainties in the traditional long short term memory (LSTM) neural networks, an efficient sparse Bayesian training algorithm is introduced to the network architecture. The proposed scheme automatically determines relevant neural connections and adapts accordingly, in contrast to the classical LSTM solution. Due to its flexibility, the new LSTM scheme is less prone to overfitting, and hence can approximate time dependent solutions by use of a smaller data set. On a structural nonlinear finite element application we show that the self-regulating framework does not require prior knowledge of a suitable network architecture and size, while ensuring satisfying accuracy at reasonable computational cost.
【41】 Fairness via Representation Neutralization 标题:通过表征中和实现公平
作者:Mengnan Du,Subhabrata Mukherjee,Guanchu Wang,Ruixiang Tang,Ahmed Hassan Awadallah,Xia Hu 机构:Texas A&M University,Microsoft Research 链接:https://arxiv.org/abs/2106.12674 摘要:现有的DNN模型的偏差缓解方法主要是学习基于Debias的编码器。这个过程不仅需要对敏感属性进行大量实例级注释,而且不能保证所有公平性敏感信息都已从编码器中删除。为了解决这些局限性,我们探讨了以下研究问题:即使以有偏表示作为输入,我们是否可以通过仅对分类头进行减分来降低DNN模型的区分度?为此,我们提出了一种新的缓解技术,即表示中和公平性(RNF),该技术仅通过去除DNN模型中特定于任务的分类头来实现公平性。为此,我们利用具有相同地面真值标签但敏感属性不同的样本,并利用它们的中和表示来训练DNN模型的分类头。RNF的关键思想是阻止分类头捕获具有特定类标签的编码器表示中公平敏感信息之间的虚假相关性。为了解决低资源设置而无法访问敏感属性注释的问题,我们利用偏差放大模型为敏感属性生成代理注释。在多个基准数据集上的实验结果表明,我们的RNF框架可以有效地降低DNN模型的区分度,并且任务特定性能的退化最小。 摘要:Existing bias mitigation methods for DNN models primarily work on learning debiased encoders. This process not only requires a lot of instance-level annotations for sensitive attributes, it also does not guarantee that all fairness sensitive information has been removed from the encoder. To address these limitations, we explore the following research question: Can we reduce the discrimination of DNN models by only debiasing the classification head, even with biased representations as inputs? To this end, we propose a new mitigation technique, namely, Representation Neutralization for Fairness (RNF) that achieves fairness by debiasing only the task-specific classification head of DNN models. To this end, we leverage samples with the same ground-truth label but different sensitive attributes, and use their neutralized representations to train the classification head of the DNN model. The key idea of RNF is to discourage the classification head from capturing spurious correlation between fairness sensitive information in encoder representations with specific class labels. To address low-resource settings with no access to sensitive attribute annotations, we leverage a bias-amplified model to generate proxy annotations for sensitive attributes. Experimental results over several benchmark datasets demonstrate our RNF framework to effectively reduce discrimination of DNN models with minimal degradation in task-specific performance.
【42】 Leveraging semantically similar queries for ranking via combining representations 标题:利用语义相似的查询通过组合表示进行排序
作者:Hayden S. Helm,Marah Abdin,Benjamin D. Pedigo,Shweti Mahajan,Vince Lyzinski,Youngser Park,Amitabh Basu,Piali~Choudhury,Christopher M. White,Weiwei Yang,Carey E. Priebe 机构: 1 Microsoft Research 2 Johns Hopkins University 3 University of Maryland 链接:https://arxiv.org/abs/2106.12621 摘要:在现代排序问题中,要排序的项的不同和完全不同的表示常常是可用的。因此,明智的做法是尝试将这些表示结合起来以提高排名。实际上,通过组合表示来学习排序对于学习特定查询的排序函数是有原则和实用的。然而,在数据极度匮乏的情况下,特定查询可用的标记数据量可能会导致高度可变和无效的排序函数。减轻少量数据影响的一种方法是利用语义相似查询中的信息。事实上,正如我们在模拟设置和实际数据示例中所演示的,当语义相似的查询可用时,在对特定查询进行排序时,可以使用它们。我们在偏倚-方差权衡的背景下描述和探索这一现象,并将其应用于Bing导航图和果蝇幼虫连接体的数据稀缺设置。 摘要:In modern ranking problems, different and disparate representations of the items to be ranked are often available. It is sensible, then, to try to combine these representations to improve ranking. Indeed, learning to rank via combining representations is both principled and practical for learning a ranking function for a particular query. In extremely data-scarce settings, however, the amount of labeled data available for a particular query can lead to a highly variable and ineffective ranking function. One way to mitigate the effect of the small amount of data is to leverage information from semantically similar queries. Indeed, as we demonstrate in simulation settings and real data examples, when semantically similar queries are available it is possible to gainfully use them when ranking with respect to a particular query. We describe and explore this phenomenon in the context of the bias-variance trade off and apply it to the data-scarce settings of a Bing navigational graph and the Drosophila larva connectome.
【43】 Adversarial Examples in Multi-Layer Random ReLU Networks 标题:多层随机REU网络中的对抗性实例
作者:Peter L. Bartlett,Sébastien Bubeck,Yeshwanth Cherapanamjeri 机构:Department of Electrical Engineering and Computer Science, Department of Statistics, UC Berkeley, Microsoft Research Redmond 链接:https://arxiv.org/abs/2106.12611 摘要:我们考虑了具有独立高斯参数的ReLU网络中的对抗性例子现象。对于深度恒定且宽度范围较大的网络(例如,如果每一层的宽度与任何其他层的宽度是多项式,就足够了),输入向量的小扰动会导致输出的大变化。这推广了Daniely和Schacham(2020)关于宽度迅速减小网络的结果,以及Bubeck等人(2021)关于双层网络的结果。证明表明,在这些网络中出现对抗性例子是因为它们计算的函数非常接近线性。网络中的瓶颈层起着关键作用:网络中某个点的最小宽度决定了计算到该点的映射的规模和灵敏度。主要结果是对于具有常数深度的网络,但是我们也证明了对于这类结果,深度上的一些约束是必要的,因为有适当的深度网络,以常数概率计算接近常数的函数。 摘要:We consider the phenomenon of adversarial examples in ReLU networks with independent gaussian parameters. For networks of constant depth and with a large range of widths (for instance, it suffices if the width of each layer is polynomial in that of any other layer), small perturbations of input vectors lead to large changes of outputs. This generalizes results of Daniely and Schacham (2020) for networks of rapidly decreasing width and of Bubeck et al (2021) for two-layer networks. The proof shows that adversarial examples arise in these networks because the functions that they compute are very close to linear. Bottleneck layers in the network play a key role: the minimal width up to some point in the network determines scales and sensitivities of mappings computed up to that point. The main result is for networks with constant depth, but we also show that some constraint on depth is necessary for a result of this kind, because there are suitably deep networks that, with constant probability, compute a function that is close to constant.
【44】 FLEA: Provably Fair Multisource Learning from Unreliable Training Data 标题:跳蚤:从不可靠的训练数据中证明公平的多源学习
作者:Eugenia Iofinova,Nikola Konstantinov,Christoph H. Lampert 机构:IST Austria 链接:https://arxiv.org/abs/2106.11732 摘要:公平感知学习的目标是构建分类器,不仅能做出准确的预测,而且不歧视特定群体。它是一个快速发展的机器学习领域,具有深远的社会影响。然而,现有的公平学习方法容易受到训练数据中意外或恶意伪影的影响,从而在不知情的情况下产生不公平的分类器。在这项工作中,我们解决了在健壮的多源环境中,从不可靠的训练数据中公平学习的问题,其中可用的训练数据来自多个源,其中一小部分可能不代表真实的数据分布。我们引入了跳蚤,一种基于过滤的算法,允许学习系统识别和抑制那些如果用于训练会对公平性或准确性产生负面影响的数据源。我们通过在多个数据集上的一系列实验证明了我们的方法的有效性。此外,我们正式地证明,只要受影响的数据源少于一半,只要有足够的数据,跳蚤就可以保护学习者不受不可靠数据的影响。 摘要:Fairness-aware learning aims at constructing classifiers that not only make accurate predictions, but do not discriminate against specific groups. It is a fast-growing area of machine learning with far-reaching societal impact. However, existing fair learning methods are vulnerable to accidental or malicious artifacts in the training data, which can cause them to unknowingly produce unfair classifiers. In this work we address the problem of fair learning from unreliable training data in the robust multisource setting, where the available training data comes from multiple sources, a fraction of which might be not representative of the true data distribution. We introduce FLEA, a filtering-based algorithm that allows the learning system to identify and suppress those data sources that would have a negative impact on fairness or accuracy if they were used for training. We show the effectiveness of our approach by a diverse range of experiments on multiple datasets. Additionally we prove formally that, given enough data, FLEA protects the learner against unreliable data as long as the fraction of affected data sources is less than half.