统计学学术速递[12.14]

2021-12-17 16:07:40 浏览数 (1)

stat统计学,共计56篇

【1】 Sequential Break-Point Detection in Stationary Time Series: An Application to Monitoring Economic Indicators 标题:平稳时间序列的序贯断点检测:在经济指标监测中的应用 链接:https://arxiv.org/abs/2112.06889

作者:Christis Katsouris 摘要:通过预警系统监测经济状况和金融稳定,可以作为突发经济事件的预防机制。本文通过大量的模拟实验,研究了平稳时间序列回归模型序列断点检测器的统计性能。我们采用在线顺序方案监测欧洲和美国金融市场的经济指标,这些指标跨越2008年金融危机期间。我们的结果表明,应用于平稳时间序列回归(如AR(1)和AR(1)-GARCH(1,1))的这些测试的性能取决于中断的严重程度以及样本外周期内中断点的位置。因此,我们的研究为实践者在经济和金融条件下的连续断点检测提供了一些有用的见解。 摘要:Monitoring economic conditions and financial stability with an early warning system serves as a prevention mechanism for unexpected economic events. In this paper, we investigate the statistical performance of sequential break-point detectors for stationary time series regression models with extensive simulation experiments. We employ an online sequential scheme for monitoring economic indicators from the European as well as the American financial markets that span the period during the 2008 financial crisis. Our results show that the performance of these tests applied to stationary time series regressions such as the AR(1) as well as the AR(1)-GARCH(1,1) depend on the severity of the break as well as the location of the break-point within the out-of-sample period. Consequently, our study provides some useful insights to practitioners for sequential break-point detection in economic and financial conditions.

【2】 Optimal friction matrix for underdamped Langevin sampling 标题:欠阻尼朗之万抽样的最优摩擦矩阵 链接:https://arxiv.org/abs/2112.06844

作者:Martin Chak,Nikolas Kantas,Tony Lelièvre,Grigorios A. Pavliotis 机构:Tony Lelievre 备注:44 pages, 6 figures 摘要:给出了一个优化欠阻尼Langevin动力学中摩擦系数的系统程序,该程序作为一种采样工具,采用了与摩擦相关的渐近方差梯度。我们用适当泊松方程的解给出了该梯度的表达式,并表明它可以通过在对数密度的凹度假设下对相关的第一变化/正切过程进行短期模拟来近似。我们的算法被应用于贝叶斯推理问题中的后验均值估计,并且在完全梯度和随机梯度情况下,与原始欠阻尼和过阻尼Langevin动力学相比,证明了减少的方差。 摘要:A systematic procedure for optimising the friction coefficient in underdamped Langevin dynamics as a sampling tool is given by taking the gradient of the associated asymptotic variance with respect to friction. We give an expression for this gradient in terms of the solution to an appropriate Poisson equation and show that it can be approximated by short simulations of the associated first variation/tangent process under concavity assumptions on the log density. Our algorithm is applied to the estimation of posterior means in Bayesian inference problems and reduced variance is demonstrated when compared to the original underdamped and overdamped Langevin dynamics in both full and stochastic gradient cases.

【3】 Accounting for survey design in Bayesian disaggregation of survey-based areal estimates of proportions: an application to the American Community Survey 标题:基于调查的面积比例估计的贝叶斯分解中的调查设计:在美国社区调查中的应用 链接:https://arxiv.org/abs/2112.06802

作者:Marco H. Benedetti,Veronica J. Berrocal,Roderick J. Little 机构:Nationwide Children’s Hospital, Center for Injury Research and Policy, Children’s Crossroad, Columbus, OH , Department of Statistics, School of Information and Computer Sciences, Donald Bren Hall, University of California, Irvine, Irvine, CA , Department of Biostatistics 摘要:了解健康的社会决定因素对健康结果的影响需要关于受试者居住社区特征的数据。然而,对这些特征的估计通常是在空间和时间上以一种降低其效用的方式进行聚合的。以美国社区调查(ACS)为例,ACS是一项由美国人口普查局管理的多年全国性调查,其估算值为:小城市地区的估算值在5年内汇总,而1年估算值仅适用于人口超过65000美元的城市地区。研究人员可能希望在人群健康研究中使用ACS估计值来描述邻里水平暴露的特征。然而,5年估计值可能无法正确描述时间变化或与研究中的其他数据在时间上保持一致,而1年估计值的粗略空间分辨率降低了其在表征邻里暴露方面的效用。为了避免这个问题,在本文中,我们提出了一个建模框架来分解抽样调查中得出的比例估计,该框架明确说明了调查设计的影响。我们通过将该模型应用于ACS数据,以精细的时空分辨率对密歇根州的贫困状况进行了估计,从而说明了该模型的实用性。 摘要:Understanding the effects of social determinants of health on health outcomes requires data on characteristics of the neighborhoods in which subjects live. However, estimates of these characteristics are often aggregated over space and time in a fashion that diminishes their utility. Take, for example, estimates from the American Community Survey (ACS), a multi-year nationwide survey administered by the U.S. Census Bureau: estimates for small municipal areas are aggregated over 5-year periods, whereas 1-year estimates are only available for municipal areas with populations $>$65,000. Researchers may wish to use ACS estimates in studies of population health to characterize neighborhood-level exposures. However, 5-year estimates may not properly characterize temporal changes or align temporally with other data in the study, while the coarse spatial resolution of the 1-year estimates diminishes their utility in characterizing neighborhood exposure. To circumvent this issue, in this paper we propose a modeling framework to disaggregate estimates of proportions derived from sampling surveys which explicitly accounts for the survey design effect. We illustrate the utility of our model by applying it to the ACS data, generating estimates of poverty for the state of Michigan at fine spatio-temporal resolution.

【4】 Robust factored principal component analysis for matrix-valued outlier accommodation and detection 标题:用于矩阵值异常容纳和检测的鲁棒因子主成分分析 链接:https://arxiv.org/abs/2112.06760

作者:Xuan Ma,Jianhua Zhao,Yue Wang 机构:School of Statistics and Mathematics, Yunnan University of Finance and Economics 备注:37 pages 摘要:主成分分析(PCA)是一种流行的矢量数据降维技术。因子主成分分析(FPCA)是对矩阵数据主成分分析的一种概率扩展,它可以大大减少主成分分析中的参数数量,同时产生令人满意的性能。然而,FPCA基于高斯假设,因此容易受到异常值的影响。尽管多元$t$分布作为一种稳健的矢量数据建模工具已有很长的历史,但其在矩阵数据中的应用非常有限。主要原因是矢量化矩阵数据的维数通常很高,维数越高,衡量稳健性的故障点越低。为了解决FPCA所面临的鲁棒性问题并使其适用于矩阵数据,本文提出了FPCA的鲁棒扩展(RFPCA),该扩展基于一个名为矩阵变量$t$分布的$t$型分布。与多元$t$分布一样,矩阵变量$t$分布可以自适应地降低异常值的权重,并产生稳健的估计。我们开发了一种用于参数估计的快速EM型算法。在合成数据集和真实数据集上的实验表明,RFPCA与几种相关方法相比具有优越性,并且RFPCA是一种简单但功能强大的矩阵值离群点检测工具。 摘要:Principal component analysis (PCA) is a popular dimension reduction technique for vector data. Factored PCA (FPCA) is a probabilistic extension of PCA for matrix data, which can substantially reduce the number of parameters in PCA while yield satisfactory performance. However, FPCA is based on the Gaussian assumption and thereby susceptible to outliers. Although the multivariate $t$ distribution as a robust modeling tool for vector data has a very long history, its application to matrix data is very limited. The main reason is that the dimension of the vectorized matrix data is often very high and the higher the dimension, the lower the breakdown point that measures the robustness. To solve the robustness problem suffered by FPCA and make it applicable to matrix data, in this paper we propose a robust extension of FPCA (RFPCA), which is built upon a $t$-type distribution called matrix-variate $t$ distribution. Like the multivariate $t$ distribution, the matrix-variate $t$ distribution can adaptively down-weight outliers and yield robust estimates. We develop a fast EM-type algorithm for parameter estimation. Experiments on synthetic and real-world datasets reveal that RFPCA is compared favorably with several related methods and RFPCA is a simple but powerful tool for matrix-valued outlier detection.

【5】 Real-Time Estimation of COVID-19 Infections via Deconvolution and Sensor Fusion 标题:基于解卷积和传感器融合的冠状病毒感染实时估计 链接:https://arxiv.org/abs/2112.06697

作者:Maria Jahja,Andrew Chin,Ryan J. Tibshirani 机构: Carnegie Mellon University 摘要:我们建议2019冠状病毒疾病2019冠状病毒疾病的发病率估计病例报告延迟分布的每日报告的COVID-19病例计数,以估计每一个新的症状性COVID-19感染的日数。重要的是,我们专注于实时(而非回顾性)估计感染,这带来了许多挑战。为了解决这些问题,我们开发了分布估计和反褶积步骤的新方法,并采用了传感器融合层(该层融合了来自模型的预测,这些模型经过训练以跟踪基于辅助监测流的感染),以提高准确性和稳定性。 摘要:We propose, implement, and evaluate a method to estimate the daily number of new symptomatic COVID-19 infections, at the level of individual U.S. counties, by deconvolving daily reported COVID-19 case counts using an estimated symptom-onset-to-case-report delay distribution. Importantly, we focus on estimating infections in real-time (rather than retrospectively), which poses numerous challenges. To address these, we develop new methodology for both the distribution estimation and deconvolution steps, and we employ a sensor fusion layer (which fuses together predictions from models that are trained to track infections based on auxiliary surveillance streams) in order to improve accuracy and stability.

【6】 Recalibration of Predictive Models as Approximate Probabilistic Updates 标题:作为近似概率更新的预测模型的重新校准 链接:https://arxiv.org/abs/2112.06674

作者:Evan T. R. Rosenman,Santiago Olivella 机构:Data Science Initiative, Harvard University, Department of Political Science, University of North Carolina, Chapel Hill 摘要:通过将低水平预测与高水平聚集定义的已知衍生量相协调,对预测模型的输出进行常规重新校准。例如,可以调整预测美国选举中个人层面投票概率的模型,使其聚合与每个州观察到的投票总数相匹配,从而产生更好的校准预测。在本研究报告中,我们为最常用的重新校准策略之一提供了理论基础,通俗地称为“逻辑转换”通常被视为一个启发式优化问题(即找到一个调整,使聚合预测和目标总数之间的差异最小化),我们表明,logit移位实际上提供了一个快速而准确的近似原则,但通常在计算上是不切实际的调整策略:根据目标总数计算后验预测概率。在推导了近似质量的分析界限之后,我们用蒙特卡罗模拟说明了该方法的准确性。模拟还证实了有关简单logit变换的用户可以期望其表现最佳的情景的分析结果,即当聚合目标由许多单独的预测组成时,以及当真实概率分布在0.5左右对称且紧密时。 摘要:The output of predictive models is routinely recalibrated by reconciling low-level predictions with known derived quantities defined at higher levels of aggregation. For example, models predicting turnout probabilities at the individual level in U.S. elections can be adjusted so that their aggregation matches the observed vote totals in each state, thus producing better calibrated predictions. In this research note, we provide theoretical grounding for one of the most commonly used recalibration strategies, known colloquially as the "logit shift." Typically cast as a heuristic optimization problem (whereby an adjustment is found such that it minimizes the difference between aggregated predictions and the target totals), we show that the logit shift in fact offers a fast and accurate approximation to a principled, but often computationally impractical adjustment strategy: computing the posterior prediction probabilities, conditional on the target totals. After deriving analytical bounds on the quality of the approximation, we illustrate the accuracy of the approach using Monte Carlo simulations. The simulations also confirm analytical results regarding scenarios in which users of the simple logit shift can expect it to perform best -- namely, when the aggregated targets are comprised of many individual predictions, and when the distribution of true probabilities is symmetric and tight around 0.5.

【7】 An Introduction to Quantum Computing for Statisticians 标题:统计学家量子计算导论 链接:https://arxiv.org/abs/2112.06587

作者:Anna Lopatnikova,Minh-Ngoc Tran 机构: the University of Sydney Business School 备注:80 pages 摘要:量子计算有可能彻底改变我们生活和理解世界的方式。本综述旨在为量子计算提供一个易于理解的介绍,重点介绍量子计算在统计和数据分析中的应用。我们首先介绍理解量子计算所必需的基本概念以及量子计算和经典计算之间的区别。我们描述了作为量子算法构建块的核心量子子程序。然后,我们回顾了一系列量子算法,这些算法有望在统计学和机器学习中提供计算优势。我们强调了将量子计算应用于统计问题的挑战和机遇,并讨论了未来可能的研究方向。 摘要:Quantum computing has the potential to revolutionise and change the way we live and understand the world. This review aims to provide an accessible introduction to quantum computing with a focus on applications in statistics and data analysis. We start with an introduction to the basic concepts necessary to understand quantum computing and the differences between quantum and classical computing. We describe the core quantum subroutines that serve as the building blocks of quantum algorithms. We then review a range of quantum algorithms expected to deliver a computational advantage in statistics and machine learning. We highlight the challenges and opportunities in applying quantum computing to problems in statistics and discuss potential future research directions.

【8】 Inference via Randomized Test Statistics 标题:基于随机测试统计量的推论 链接:https://arxiv.org/abs/2112.06583

作者:Nikita Puchkin,Vladimir Ulyanov 机构:HSE University and Institute for Information Transmission Problems RAS, ru†HSE University and Lomonosov Moscow State University 备注:24 pages 摘要:我们表明,在特定情况下,外部随机化可以使检验统计量收敛到其极限分布。这导致了更清晰的推论。我们的方法基于加权和的中心极限定理。我们将我们的方法应用于一系列基于秩的检验统计量和一系列phi散度检验统计量,并证明,在外部随机性方面,随机统计量以压倒性概率收敛于Kolmogorov度量中的极限卡方分布$O(1/n)$(达到一些对数因子)。 摘要:We show that external randomization may enforce the convergence of test statistics to their limiting distributions in particular cases. This results in a sharper inference. Our approach is based on a central limit theorem for weighted sums. We apply our method to a family of rank-based test statistics and a family of phi-divergence test statistics and prove that, with overwhelming probability with respect to the external randomization, the randomized statistics converge at the rate $O(1/n)$ (up to some logarithmic factors) to the limiting chi-square distribution in Kolmogorov metric.

【9】 Simulation of Gaussian random field in a ball 标题:球内高斯随机场的模拟 链接:https://arxiv.org/abs/2112.06579

作者:D. Kolyukhin,A. Minakov 机构:Trofimuk Institute of Petroleum Geology and Geophysics, SB RAS, pr. Koptyuga, Novosibirsk, Russia., Centre for Earth Evolution and Dynamics (CEED), University of Oslo, Sem Sælands vei ,A, Oslo, Norway. 备注:11 pages, 4 figures 摘要:本文致力于三维球内高斯标量实随机场的统计建模。我们提出了一个描述单位球空间异质性的统计模型和一个生成相应随机实现集合的数值过程。通过对估计协方差函数和解析协方差函数的数值比较,验证了该方法的准确性。我们的方法对于假设的径向和角度协方差函数是灵活的。我们通过随机场实现的数值例子说明了协方差模型参数的影响。所提出的统计模拟技术可以应用于地球和其他行星三维空间异质性的推断。 摘要:The presented paper is devoted to statistical modeling of Gaussian scalar real random fields inside a three-dimensional sphere (ball). We propose a statistical model describing the spatial heterogeneity in a unit ball and a numerical procedure for generating an ensemble of corresponding random realizations. The accuracy of the presented approach is corroborated by the numerical comparison of the estimated and analytical covariance functions. Our approach is flexible with respect to the assumed radial and angular covariance function. We illustrate the effect of the covariance model parameters based on numerical examples of random field realizations. The presented statistical simulation technique can be applied, for example, to the inference of the 3D spatial heterogeneity in the Earth and other planets.

【10】 On model-based time trend adjustments in platform trials with non-concurrent controls 标题:基于模型的非并发控制平台试验时间趋势调整研究 链接:https://arxiv.org/abs/2112.06574

作者:Marta Bofill Roig,Pavla Krotka,Carl-Fredrik Burman,Ekkehard Glimm,Katharina Hees,Peter Jacko,Franz Koenig,Dominic Magirr,Peter Mesenbrink,Kert Viele,Martin Posch 机构:EU-PEARL (EU Patient-cEntric clinicAl tRial pLatforms) Consortium, Center for Medical Statistics, Informatics and Intelligent Systems, Medical University of Vienna, Vienna, Austria 摘要:与对照组相比,平台试验可以评估几种治疗方法的疗效。治疗的数量不是固定的,因为随着试验的进行,手臂可能会被增加或移除。由于使用共享控制组,平台试验比独立平行组试验更有效。对于稍后进入试验的患者,并非对照组的所有患者都同时随机分组。然后将控制组分为并发控制和非并发控制。使用非同期对照(NCC)可以提高试验的效率,但会因时间趋势而引入偏差。我们专注于一个平台试验,包括两个治疗组和一个普通对照组。假设稍后添加第二个治疗臂,我们评估基于模型的方法在使用NCC时调整时间趋势的稳健性。我们考虑将时间趋势建模为线性或作为阶跃函数的方法,在武器进入或离开试验时的步骤。对于具有连续或二元结果的试验,我们调查了在一系列情景下测试新增加的arm疗效的1型错误率(t1e)和功效。除了跨臂时间趋势相等的场景外,我们还调查了在模型比例中具有不同趋势或不具有相加趋势的设置。只要不同臂的时间趋势相等且模型比例上的加法,在控制t1e的同时,根据所有臂的数据拟合的阶跃函数模型可提高功率。如果使用块随机化,即使趋势形状偏离阶跃函数,这一点仍然成立。但是,如果各臂之间的趋势不同,或者在模型比例上不是相加的,则t1e控制可能会丢失。使用阶跃函数模型合并NCC所获得的效率可以超过潜在的偏差。然而,应考虑试验的具体情况、不同时间趋势的合理性和结果的稳健性 摘要:Platform trials can evaluate the efficacy of several treatments compared to a control. The number of treatments is not fixed, as arms may be added or removed as the trial progresses. Platform trials are more efficient than independent parallel-group trials because of using shared control groups. For arms entering the trial later, not all patients in the control group are randomised concurrently. The control group is then divided into concurrent and non-concurrent controls. Using non-concurrent controls (NCC) can improve the trial's efficiency, but can introduce bias due to time trends. We focus on a platform trial with two treatment arms and a common control arm. Assuming that the second treatment arm is added later, we assess the robustness of model-based approaches to adjust for time trends when using NCC. We consider approaches where time trends are modeled as linear or as a step function, with steps at times where arms enter or leave the trial. For trials with continuous or binary outcomes, we investigate the type 1 error (t1e) rate and power of testing the efficacy of the newly added arm under a range of scenarios. In addition to scenarios where time trends are equal across arms, we investigate settings with trends that are different or not additive in the model scale. A step function model fitted on data from all arms gives increased power while controlling the t1e, as long as the time trends are equal for the different arms and additive on the model scale. This holds even if the trend's shape deviates from a step function if block randomisation is used. But if trends differ between arms or are not additive on the model scale, t1e control may be lost. The efficiency gained by using step function models to incorporate NCC can outweigh potential biases. However, the specifics of the trial, plausibility of different time trends, and robustness of results should be considered

【11】 Prediction in functional regression with discretely observed and noisy covariates 标题:具有离散观测和噪声协变量的函数回归预测 链接:https://arxiv.org/abs/2112.06486

作者:Siegfried Hörmann,Fatima Jammoul 机构:Institute of Statistics, Graz University of Technology, and 摘要:在实践中,功能数据是在一组离散的观测点上采样的,通常容易受到噪声的影响。在本文中,我们考虑在回归问题中使用这样的数据作为解释变量的设置。如果主要目标是预测,我们将证明将问题嵌入标量函数回归的收益是有限的。相反,我们在预测因子上施加一个因子模型,并建议根据适当数量的因子得分对反应进行回归。该方法在温和的技术假设下是一致的,在数值上是有效的,并且在模拟和实际数据设置中都具有良好的实用性能。 摘要:In practice functional data are sampled on a discrete set of observation points and often susceptible to noise. We consider in this paper the setting where such data are used as explanatory variables in a regression problem. If the primary goal is prediction, we show that the gain by embedding the problem into a scalar-on-function regression is limited. Instead we impose a factor model on the predictors and suggest regressing the response on an appropriate number of factor scores. This approach is shown to be consistent under mild technical assumptions, numerically efficient and gives good practical performance in both simulations as well as real data settings.

【12】 A unified treatment of characteristic functions of symmetric multivariate and related distributions 标题:对称多元分布特征函数及其相关分布的统一处理 链接:https://arxiv.org/abs/2112.06472

作者:Hua Dong,Chuancun Yin 机构:School of Statistics, Qufu Normal University, Shandong , China 备注:17 pages 摘要:本文的目的是给出所有椭圆分布和相关分布的特征函数的统一表达式。这些分布包括多元椭圆对称分布和一些非对称分布,如斜椭圆分布及其位置-尺度混合分布。特别地,对于重要的情形,如多元Student-$t$,Cauchy,logistic,Laplace,对称稳定,我们得到了特征函数的简单闭合形式。特征函数的表达式包括贝塞尔型函数或广义超几何级数。 摘要:The purpose of the present paper is to give unified expressions to the characteristic functions of all elliptical and related distributions. Those distributions including the multivariate elliptical symmetric distributions and some asymmetric distributions such as skew-elliptical distributions and their location-scale mixtures. In particular, we get simple closed form of characteristic functions for important cases such as the multivariate Student-$t$, Cauchy, logistic, Laplace, symmetric stable. The expressions of characteristic functions involve Bessel type functions or generalized hypergeometric series.

【13】 Scalable subsampling: computation, aggregation and inference 标题:可伸缩子抽样:计算、聚合和推理 链接:https://arxiv.org/abs/2112.06434

作者:Dimitris N. Politis 机构:Department of Mathematics, and Halicioglu Data Science Institute, University of California, San Diego, La Jolla, CA ,-, USA 摘要:二次抽样是20世纪90年代发展起来的一种通用统计方法,旨在估计统计数据的抽样分布,以便进行非参数推断,如置信区间的构造和假设检验。在大数据时代,二次抽样再次兴起,标准的、完全重采样大小的引导法可能无法计算。然而,即使选择一个大小为$b$的随机子样本,也可能在计算上具有挑战性,因为$b$和样本大小$n$都非常大。在本文中,我们展示了如何使用一组经过适当选择的非随机子样本,通过子样本进行有效的(并且在计算上是可行的)分布估计。此外,我们还展示了如何使用相同的子样本集生成一个可扩展的大数据子采样聚合过程(也称为子标记)。有趣的是,与$hatthetan$相比,可伸缩子标记估计器可以调整为具有相同(或更好)的收敛速度。本文最后展示了如何基于可伸缩子标记估计器而不是原始$hatthetau n$进行推断,例如置信区间。 摘要:Subsampling is a general statistical method developed in the 1990s aimed at estimating the sampling distribution of a statistic $hat theta _n$ in order to conduct nonparametric inference such as the construction of confidence intervals and hypothesis tests. Subsampling has seen a resurgence in the Big Data era where the standard, full-resample size bootstrap can be infeasible to compute. Nevertheless, even choosing a single random subsample of size $b$ can be computationally challenging with both $b$ and the sample size $n$ being very large. In the paper at hand, we show how a set of appropriately chosen, non-random subsamples can be used to conduct effective -- and computationally feasible -- distribution estimation via subsampling. Further, we show how the same set of subsamples can be used to yield a procedure for subsampling aggregation -- also known as subagging -- that is scalable with big data. Interestingly, the scalable subagging estimator can be tuned to have the same (or better) rate of convergence as compared to $hat theta _n$. The paper is concluded by showing how to conduct inference, e.g., confidence intervals, based on the scalable subagging estimator instead of the original $hat theta _n$.

【14】 How Good are Low-Rank Approximations in Gaussian Process Regression? 标题:低秩近似在高斯过程回归中有多好? 链接:https://arxiv.org/abs/2112.06410

作者:Constantinos Daskalakis,Petros Dellaportas,Aristeidis Panos 机构:CSAIL, Massachusetts Institute of Technology, USA, University College London, UK, University of Warwick, UK, Athens University of Economics and Business, Greece, The Alan Turing Institute, UK 摘要:我们为由两种常见的低秩核近似产生的近似高斯过程(GP)回归提供了保证:基于随机傅立叶特征和基于截断核的Mercer展开。特别是,我们将Kullback-Leibler散度限定在一个精确GP和一个由上述低秩近似产生的GP之间,以及它们相应的预测密度之间,我们还限制了预测平均向量之间的误差,以及使用精确GP计算的预测协方差矩阵与使用近似GP计算的预测协方差矩阵之间的误差。我们在模拟数据和标准基准上进行了实验,以评估我们的理论边界的有效性。 摘要:We provide guarantees for approximate Gaussian Process (GP) regression resulting from two common low-rank kernel approximations: based on random Fourier features, and based on truncating the kernel's Mercer expansion. In particular, we bound the Kullback-Leibler divergence between an exact GP and one resulting from one of the afore-described low-rank approximations to its kernel, as well as between their corresponding predictive densities, and we also bound the error between predictive mean vectors and between predictive covariance matrices computed using the exact versus using the approximate GP. We provide experiments on both simulated data and standard benchmarks to evaluate the effectiveness of our theoretical bounds.

【15】 Detection and Estimation of Multiple Transient Changes 标题:多瞬态变化的检测与估计 链接:https://arxiv.org/abs/2112.06308

作者:Baron Michael,Malov Sergey V 机构:‡ 1American University, -Petersburg State University 备注:25 pages, 3 figures 摘要:针对临时故障或瞬态变化的情况,提出了变化点检测方法,当意外故障最终被重新调整并返回初始状态时。“处于控制”状态的基本分布在未知时间段内变为“失控”分布。提出了基于似然的顺序和回顾性工具来检测和估计每对变化点。评估获得的变化点估计值的准确性。所提出的方法可在预先选择的水平上同时控制家族性假警报和假调整率。 摘要:Change-point detection methods are proposed for the case of temporary failures, or transient changes, when an unexpected disorder is ultimately followed by a readjustment and return to the initial state. A base distribution of the "in-control" state changes to an "out-of-control" distribution for unknown periods of time. Likelihood based sequential and retrospective tools are proposed for the detection and estimation of each pair of change-points. The accuracy of the obtained change-point estimates is assessed. Proposed methods offer simultaneous control the familywise false alarm and false readjustment rates at the pre-chosen levels.

【16】 The R Package knnwtsim: Nonparametric Forecasting With a Tailored Similarity Measure 标题:R包Knnwtsim:具有定制相似度量的非参数预测 链接:https://arxiv.org/abs/2112.06266

作者:Matthew Trupiano 机构:Rochester Institute of Technology 备注:knnwtsim package version 0.1.0 摘要:R包knnwtsim提供了使用相似性度量实现k近邻(KNN)预测的功能,该相似性度量专门用于预测响应序列的未来观测值,其中最近观测值、季节性或周期性模式,一个或多个外生预测值在预测新的响应点时都具有预测价值。本文将介绍感兴趣的相似性度量,以及knnwtsim中用于计算、调整并最终用于KNN预测的函数。该软件包在预测问题中可能具有特殊价值,在预测问题中,响应和预测因子之间的函数关系是非常量或分段的,因此可能违反流行替代方案的假设。此外,该软件包还提供了用于开发和测试该方法的真实世界和模拟时间序列数据集。 摘要:The R package knnwtsim provides functions to implement k nearest neighbors (KNN) forecasting using a similarity metric tailored to the forecasting problem of predicting future observations of a response series where recent observations, seasonal or periodic patterns, and the values of one or more exogenous predictors all have predictive value in forecasting new response points. This paper will introduce the similarity measure of interest, and the functions in knnwtsim used to calculate, tune, and ultimately utilize it in KNN forecasting. This package may be of particular value in forecasting problems where the functional relationships between response and predictors are non-constant or piece-wise and thus can violate the assumptions of popular alternatives. In addition both real world and simulated time series datasets used in the development and testing of this approach have been made available with the package.

【17】 A multi-arm multi-stage platform design that allows pre-planned addition of arms while still controlling the family-wise error 标题:多臂多阶段平台设计,允许预先计划添加臂,同时仍可控制家族误差 链接:https://arxiv.org/abs/2112.06195

作者:Peter Greenstreet,Thomas Jaki,Alun Bedding,Chris Harbron,Pavel Mozgunov 备注:30 Pages, 7 figures 摘要:人们对平台试验越来越感兴趣,该试验允许随着试验的进展增加新的治疗手段,并且能够在试验的一段时间内停止治疗,要么是因为缺乏益处/无效,要么是因为优越性。在某些情况下,平台试验需要确保错误率得到控制。本文提出了一种多阶段设计,允许以预先计划的方式在平台试验中添加额外的ARM,同时仍然控制家庭错误率。给出了一种计算达到期望功率水平所需样本量的方法,并说明了如何找到样本量和期望样本量的分布。本文介绍了一项激励性试验,该试验侧重于两种设置,第一种设置为每个积极治疗组的固定阶段数,第二种设置为固定的总阶段数,随后添加的治疗得到的阶段数较少。通过这个例子,我们表明,与运行多个单独的试验相比,所提出的方法在控制误差的同时,使样本量更小。 摘要:There is growing interest in platform trials that allow for adding of new treatment arms as the trial progresses as well as being able to stop treatments part way through the trial for either lack of benefit/futility or for superiority. In some situations, platform trials need to guarantee that error rates are controlled. This paper presents a multi-stage design that allows additional arms to be added in a platform trial in a pre-planned fashion, while still controlling the family wise error rate. A method is given to compute the sample size required to achieve a desired level of power and we show how the distribution of the sample size and the expected sample size can be found. A motivating trial is presented which focuses on two settings, with the first being a set number of stages per active treatment arm and the second being a set total number of stages, with treatments that are added later getting fewer stages. Through this example we show that the proposed method results in a smaller sample size while still controlling the errors compared to running multiple separate trials.

【18】 Characterizations of the Normal Distribution via the Independence of the Sample Mean and the Feasible Definite Statistics with Ordered Arguments 标题:正态分布的样本均值独立性和有序变量可行确定统计量的刻画 链接:https://arxiv.org/abs/2112.06152

作者:Chin-Yuan Hu,Gwo Dong Lin 机构:a National Changhua University of Education, Taiwan, b Hwa-Kang Xing-Ye Foundation, Taiwan, c Academia Sinica, Taiwan 备注:18 pages 摘要:众所周知,样本均值和样本方差的独立性是正态分布的特征。利用Anosov定理,我们进一步研究了样本均值和一些可行定统计量的类似特征性质。本文首次介绍的后一种统计量是基于具有正齐次的有序变元的非负、确定和连续函数。所提出的方法似乎是自然的,并且可以用于推导许多可行的确定统计的表征结果,例如涉及样本方差、样本范围以及基尼均值差的已知表征。 摘要:It is well known that the independence of the sample mean and the sample variance characterizes the normal distribution. By using Anosov's theorem, we further investigate the analogous characteristic properties in terms of the sample mean and some feasible definite statistics. The latter statistics introduced in this paper for the first time are based on nonnegative, definite and continuous functions of ordered arguments with positive degree of homogeneity. The proposed approach seems to be natural and can be used to derive easily characterization results for many feasible definite statistics, such as known characterizations involving the sample variance, sample range as well as Gini's mean difference.

【19】 Markov subsampling based Huber Criterion 标题:基于马尔可夫次抽样的Huber准则 链接:https://arxiv.org/abs/2112.06134

作者:Tieliang Gong,Yuxin Dong,Hong Chen,Bo Dong,Chen Li 机构: Dong is with the School of Continuing Education, Xi’an JiaotongUniversity 摘要:子采样是解决大数据带来的计算挑战的重要技术。许多次抽样程序都属于重要性抽样的框架,重要性抽样将高抽样概率分配给似乎具有重大影响的样本。当噪声水平较高时,这些采样程序往往会选择许多异常值,因此在实践中往往不能令人满意地执行。为了解决这个问题,我们设计了一种新的基于Huber准则的Markov子抽样策略(HMS),从含噪的完整数据中构造一个信息子集;然后,构造的子集用作优化的工作数据,以便进行高效处理。HMS建立在Metropolis-Hasting程序的基础上,其中使用Huber标准确定每个采样单元的包含概率,以防止对异常值进行过度评分。在温和的条件下,我们证明了基于HMS选择的子样本的估计在统计上与次高斯偏差界一致。大规模仿真和实际数据示例的广泛研究表明,HMS具有良好的性能。 摘要:Subsampling is an important technique to tackle the computational challenges brought by big data. Many subsampling procedures fall within the framework of importance sampling, which assigns high sampling probabilities to the samples appearing to have big impacts. When the noise level is high, those sampling procedures tend to pick many outliers and thus often do not perform satisfactorily in practice. To tackle this issue, we design a new Markov subsampling strategy based on Huber criterion (HMS) to construct an informative subset from the noisy full data; the constructed subset then serves as a refined working data for efficient processing. HMS is built upon a Metropolis-Hasting procedure, where the inclusion probability of each sampling unit is determined using the Huber criterion to prevent over scoring the outliers. Under mild conditions, we show that the estimator based on the subsamples selected by HMS is statistically consistent with a sub-Gaussian deviation bound. The promising performance of HMS is demonstrated by extensive studies on large scale simulations and real data examples.

【20】 Confidence intervals for the random forest generalization error 标题:随机森林泛化误差的置信区间 链接:https://arxiv.org/abs/2112.06101

作者:Marques F.,Paulo C 机构:Insper Institute of Education and Research, S˜ao Paulo, Brazil 备注:8 pages 摘要:我们证明,在随机森林的训练过程下,不仅存在其泛化误差的众所周知且几乎无需计算的包外点估计,而且存在计算泛化误差置信区间的路径,该路径不需要对森林进行再训练或任何形式的数据分裂。仿真结果表明,该置信区间不仅具有较低的计算成本,而且在训练样本量方面具有良好的覆盖率和适当的宽度收缩率。 摘要:We show that underneath the training process of a random forest there lies not only the well known and almost computationally free out-of-bag point estimate of its generalization error, but also a path to compute a confidence interval for the generalization error which does not demand a retraining of the forest or any forms of data splitting. Besides the low computational cost involved in its construction, this confidence interval is shown through simulations to have good coverage and appropriate shrinking rate of its width in terms of the training sample size.

【21】 Latent Community Adaptive Network Regression 标题:潜在社区自适应网络回归 链接:https://arxiv.org/abs/2112.06097

作者:Heather Mathews,Alexander Volfovsky 机构:Department of Statistical Science, Duke University 摘要:社会和健康科学中对网络数据的研究通常集中在两个不同的任务上(1)检测节点之间的社区结构和(2)将协变量信息与边缘形成相关联。在这些数据中,很可能在社区之间,协变量对边缘形成的影响有所不同(例如,年龄可能在整个城市社区的友谊形成中扮演不同的角色)。在这项工作中,我们引入了一个潜在空间网络模型,其中与某些协变量相关的系数可以依赖于节点的潜在社区成员。我们证明忽略这种结构可能导致对边缘形成的协变量重要性估计过高或过低,并提出了一种马尔可夫链蒙特卡罗方法,用于同时学习潜在的群落结构和群落特定系数。我们利用有效的谱方法来提高我们方法的计算可处理性。 摘要:The study of network data in the social and health sciences frequently concentrates on two distinct tasks (1) detecting community structures among nodes and (2) associating covariate information to edge formation. In much of this data, it is likely that the effects of covariates on edge formation differ between communities (e.g. age might play a different role in friendship formation in communities across a city). In this work, we introduce a latent space network model where coefficients associated with certain covariates can depend on latent community membership of the nodes. We show that ignoring such structure can lead to either over- or under-estimation of covariate importance to edge formation and propose a Markov Chain Monte Carlo approach for simultaneously learning the latent community structure and the community specific coefficients. We leverage efficient spectral methods to improve the computational tractability of our approach.

【22】 Determinantal point processes based on orthogonal polynomials for sampling minibatches in SGD 标题:基于正交多项式的SGD小批量抽样确定点过程 链接:https://arxiv.org/abs/2112.06007

作者:Remi Bardenet,Subhro Ghosh,Meixia Lin 备注:Accepted at NeurIPS 2021 (Spotlight Paper). Authors are listed in alphabetical order 摘要:随机梯度下降(SGD)是机器学习的基石。当数据项的数量N较大时,SGD依赖于使用原始数据集的一个小子集(称为minibatch)构造经验风险梯度的无偏估计量。默认的小批量构造涉及对所需大小的子集进行统一采样,但已经探索了减少方差的替代方法。特别是,实验证据表明,从决定点过程(DPP)中提取小批量,小批量的分布有利于所选项目之间的多样性。然而,就像最近关于核心集DPP的工作一样,很难系统地、原则性地理解DPP的帮助方式和原因。在这项工作中,我们提出了一个基于正交多项式的DPP范式,用于SGD中的小批量采样。我们的方法利用了手头的特定数据分布,这使它比现有的数据无关方法具有更高的灵敏度和能力。我们通过详细的理论分析证明了我们的方法的收敛性,离散数据集和基础连续域之间的交织。特别是,我们展示了特定的DPP和一系列受控近似如何导致梯度估计器,其方差随批次大小衰减得比均匀采样下更快。再加上凸目标上SGD的现有有限时间保证,这意味着DPP小批比均匀小批在均方近似误差上的界限更小。此外,我们的估计器适用于一种最新算法,该算法直接对DPP的线性统计数据(即梯度估计器)进行采样,而不对基础DPP(即小批量)进行采样,从而减少计算开销。我们提供详细的合成和真实数据实验来证实我们的理论主张。 摘要:Stochastic gradient descent (SGD) is a cornerstone of machine learning. When the number N of data items is large, SGD relies on constructing an unbiased estimator of the gradient of the empirical risk using a small subset of the original dataset, called a minibatch. Default minibatch construction involves uniformly sampling a subset of the desired size, but alternatives have been explored for variance reduction. In particular, experimental evidence suggests drawing minibatches from determinantal point processes (DPPs), distributions over minibatches that favour diversity among selected items. However, like in recent work on DPPs for coresets, providing a systematic and principled understanding of how and why DPPs help has been difficult. In this work, we contribute an orthogonal polynomial-based DPP paradigm for minibatch sampling in SGD. Our approach leverages the specific data distribution at hand, which endows it with greater sensitivity and power over existing data-agnostic methods. We substantiate our method via a detailed theoretical analysis of its convergence properties, interweaving between the discrete data set and the underlying continuous domain. In particular, we show how specific DPPs and a string of controlled approximations can lead to gradient estimators with a variance that decays faster with the batchsize than under uniform sampling. Coupled with existing finite-time guarantees for SGD on convex objectives, this entails that, DPP minibatches lead to a smaller bound on the mean square approximation error than uniform minibatches. Moreover, our estimators are amenable to a recent algorithm that directly samples linear statistics of DPPs (i.e., the gradient estimator) without sampling the underlying DPP (i.e., the minibatch), thereby reducing computational overhead. We provide detailed synthetic as well as real data experiments to substantiate our theoretical claims.

【23】 Multiply robust estimators in longitudinal studies with missing data under control-based imputation 标题:控制补偿下缺失数据纵向研究中的乘法稳健估计 链接:https://arxiv.org/abs/2112.06000

作者:Siyi Liu,Shu Yang,Yilong Zhang,Guanghan,Liu 机构:Department of Statistics, North Carolina State University, Merck & Co., Inc., Summary. 摘要:纵向研究往往会丢失数据。ICH E9(R1)附录阐述了在考虑并发事件的情况下确定治疗效果评估的重要性。跳转参考(J2R)是一种经典的基于控制的情景,用于使用假设策略进行治疗效果评估,其中假设治疗组参与者在并发事件后与对照组中具有相同协变量的参与者具有相同的疾病进展。我们建立了新的估计器,根据J2R下提出的潜在结果框架评估平均治疗效果。在J2R提出的假设下,构造了各种识别公式,激发了依赖于观测数据分布不同部分的估计器。此外,我们得到了一种受有效影响函数启发的新估计量,其多重稳健性在于,如果正确指定了任意对多个干扰函数,或者如果使用灵活的建模方法时干扰函数以不低于$n^{-1/4}$的速度收敛,则该估计量可实现$n^{1/2}$-一致性。在模拟研究和抗抑郁药临床试验中验证了所提出估计器的有限样本性能。 摘要:Longitudinal studies are often subject to missing data. The ICH E9(R1) addendum addresses the importance of defining a treatment effect estimand with the consideration of intercurrent events. Jump-to-reference (J2R) is one classically envisioned control-based scenario for the treatment effect evaluation using the hypothetical strategy, where the participants in the treatment group after intercurrent events are assumed to have the same disease progress as those with identical covariates in the control group. We establish new estimators to assess the average treatment effect based on a proposed potential outcomes framework under J2R. Various identification formulas are constructed under the assumptions addressed by J2R, motivating estimators that rely on different parts of the observed data distribution. Moreover, we obtain a novel estimator inspired by the efficient influence function, with multiple robustness in the sense that it achieves $n^{1/2}$-consistency if any pairs of multiple nuisance functions are correctly specified, or if the nuisance functions converge at a rate not slower than $n^{-1/4}$ when using flexible modeling approaches. The finite-sample performance of the proposed estimators is validated in simulation studies and an antidepressant clinical trial.

【24】 Test Set Sizing Via Random Matrix Theory 标题:基于随机矩阵理论的测试集调整 链接:https://arxiv.org/abs/2112.05977

作者:Alexander Dubbs 摘要:本文利用随机矩阵理论中的技术,对m个数据点,每个数据点为独立的n维多元高斯分布的简单线性回归,寻找理想的训练测试数据分割。它将“理想”定义为满足完整性度量,即经验模型误差是实际测量噪声,因此公平地反映了模型的值或缺失。本文首次以一种真正最优的方式解决任何模型的训练和测试规模问题。训练集中数据点的数量是定理1导出的四次多项式的根,它只依赖于m和n;多元高斯分布的协方差矩阵、真实模型参数和真实测量噪声会从计算中消失。关键的数学困难在于认识到本文中的问题是在雅可比系综的背景下讨论的,雅可比系综是一种描述已知随机矩阵模型特征值的概率分布,并以Selberg和Aomoto的方式评估新积分。数学结果得到了充分的计算证据的支持。本文是机器学习中自动选择训练集/测试集大小的一个步骤。 摘要:This paper uses techniques from Random Matrix Theory to find the ideal training-testing data split for a simple linear regression with m data points, each an independent n-dimensional multivariate Gaussian. It defines "ideal" as satisfying the integrity metric, i.e. the empirical model error is the actual measurement noise, and thus fairly reflects the value or lack of same of the model. This paper is the first to solve for the training and test size for any model in a way that is truly optimal. The number of data points in the training set is the root of a quartic polynomial Theorem 1 derives which depends only on m and n; the covariance matrix of the multivariate Gaussian, the true model parameters, and the true measurement noise drop out of the calculations. The critical mathematical difficulties were realizing that the problems herein were discussed in the context of the Jacobi Ensemble, a probability distribution describing the eigenvalues of a known random matrix model, and evaluating a new integral in the style of Selberg and Aomoto. Mathematical results are supported with thorough computational evidence. This paper is a step towards automatic choices of training/test set sizes in machine learning.

【25】 A Note on the Moments of Special Mixture Distributions, with Applications for Control Charts 标题:关于特殊混合分布矩的注记及其在控制图中的应用 链接:https://arxiv.org/abs/2112.05940

作者:Balázs Dobi,András Zempléni 备注:9 pages 摘要:控制图可以应用于广泛的领域,本文着重介绍适用于医疗保健应用的概括。我们集中于使用混合分布作为过程平均值的可能位移的影响。我们给出了两次采样之间目标值的预期平方偏差的闭合公式。这一观察结果可用于寻找控制图的成本最优参数。 摘要:Control charts can be applied in a wide range of areas, this paper focuses on generalisations suitable for healthcare applications. We concentrate on the effect of using mixture distributions as the possible shifts in the process mean value. We give a closed formula for the expected squared deviation from the target value between samplings. This observation can be utilised in finding the cost-optimal parameters of the control chart.

【26】 A Sparse Expansion For Deep Gaussian Processes 标题:深高斯过程的一种稀疏展开 链接:https://arxiv.org/abs/2112.05888

作者:Liang Ding,Rui Tuo,Shahin Shahrampour 机构:a Industrial & Systems Engineering, Texas A&M University, College Station, TX, b Mechanical & Industrial Engineering, Northeastern University, Boston, MA 摘要:深度高斯过程(DGP)使非参数方法能够量化复杂的深度机器学习模型的不确定性。DGP模型的传统推理方法由于需要对核矩阵进行大规模运算以进行训练和推理,因此计算复杂度较高。在这项工作中,我们提出了一种基于一系列高斯过程的精确推理和预测的有效方案,称为张量-马尔可夫-高斯过程(TMGP)。我们构造了一个TMGP的诱导近似,称为分层展开。接下来,我们开发了一个深度TMGP(DTMGP)模型,作为TMGPs多层次扩展的组成部分。所提出的DTMGP模型具有以下特性:(1)每个激活函数的输出是确定的,而权重的选择独立于标准高斯分布;(2) 在训练或预测中,只有O(polylog(M))(out of M)激活函数具有非零输出,这显著提高了计算效率。我们在真实数据集上的数值实验表明,与其他DGP模型相比,DTMGP具有更高的计算效率。 摘要:Deep Gaussian Processes (DGP) enable a non-parametric approach to quantify the uncertainty of complex deep machine learning models. Conventional inferential methods for DGP models can suffer from high computational complexity as they require large-scale operations with kernel matrices for training and inference. In this work, we propose an efficient scheme for accurate inference and prediction based on a range of Gaussian Processes, called the Tensor Markov Gaussian Processes (TMGP). We construct an induced approximation of TMGP referred to as the hierarchical expansion. Next, we develop a deep TMGP (DTMGP) model as the composition of multiple hierarchical expansion of TMGPs. The proposed DTMGP model has the following properties: (1) the outputs of each activation function are deterministic while the weights are chosen independently from standard Gaussian distribution; (2) in training or prediction, only O(polylog(M)) (out of M) activation functions have non-zero outputs, which significantly boosts the computational efficiency. Our numerical experiments on real datasets show the superior computational efficiency of DTMGP versus other DGP models.

【27】 The Past as a Stochastic Process 标题:作为随机过程的过去 链接:https://arxiv.org/abs/2112.05876

作者:David H. Wolpert,Michael H. Price,Stefani A. Crabtree,Timothy A. Kohler,Jurgen Jost,James Evans,Peter F. Stadler,Hajime Shimao,Manfred D. Laubichler 机构:Santa Fe Institute∗, Complexity Science Hub Vienna, Arizona State University, Utah State University, Washington State University, Crow Canyon Archaeological Center, The Max Planck Institute for Mathematics in the Science, University of Chicago, Leipzig University 备注:20 pages, 4 figures 摘要:历史进程表现出显著的多样性。然而,学者们长期以来一直试图确定模式,并对历史参与者和影响进行分类,并取得了一些成功。随机过程框架为分析大型历史数据集提供了一种结构化方法,允许检测有时令人惊讶的模式,识别过程内生和外生的相关因果因素,以及不同历史案例之间的比较。数据、分析工具和随机过程的组织理论框架的结合补充了历史和考古学中的传统叙事方法。 摘要:Historical processes manifest remarkable diversity. Nevertheless, scholars have long attempted to identify patterns and categorize historical actors and influences with some success. A stochastic process framework provides a structured approach for the analysis of large historical datasets that allows for detection of sometimes surprising patterns, identification of relevant causal actors both endogenous and exogenous to the process, and comparison between different historical cases. The combination of data, analytical tools and the organizing theoretical framework of stochastic processes complements traditional narrative approaches in history and archaeology.

【28】 Association study between gene expression and multiple phenotypes in omics applications of complex diseases 标题:复杂疾病组学应用中基因表达与多种表型的相关性研究 链接:https://arxiv.org/abs/2112.05818

作者:Yujia Li,Yusi Fang,Peng Liu,George C. Tseng 机构:Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA ,., Summary: 摘要:研究表型-基因关联可以揭示疾病的机制并开发有效的治疗方法。在多表型可用且相互关联的复杂疾病中,分别分析和解释每个表型的相关基因可能会由于不考虑表型之间的相关性而降低统计能力并失去解释。典型的方法是许多全局测试方法,如多元方差分析(MANOVA),它测试表型和每个基因之间的整体关联,而不考虑表型之间的异质性。在本文中,我们扩展和评估了两种p值组合方法,自适应加权Fisher方法(AFp)和自适应Fisher方法(AFz),以解决这个问题,其中AFp是我们最终提出的方法,基于广泛的模拟和实际应用。与传统的全局测试方法相比,我们提出的AFp方法有三个优点。首先,它可以考虑表型的异质性,并确定哪些特定的表型与基因相关联,使用表型特异性0-1权重。其次,AFp将每个表型关联测试的p值作为输入,因此可以适应不同类型的表型(连续型、二进制型和计数型)。第三,我们还应用bootstrapping构建了AFp权重估计的变异指数,并生成了一个共同成员矩阵,根据关联模式对基因进行分类,以便进行直观的生物学研究。通过广泛的模拟,AFp在I型误差控制和统计能力方面显示出优于全局测试方法的性能,以及在AFz上更高的0-1权重估计精度。一个真实的组学应用,结合复杂肺部疾病的转录组学和临床数据,展示了AFp的深刻生物学发现。 摘要:Studying phenotype-gene association can uncover mechanism of diseases and develop efficient treatments. In complex disease where multiple phenotypes are available and correlated, analyzing and interpreting associated genes for each phenotype respectively may decrease statistical power and lose intepretation due to not considering the correlation between phenotypes. The typical approaches are many global testing methods, such as multivariate analysis of variance (MANOVA), which tests the overall association between phenotypes and each gene, without considersing the heterogeneity among phenotypes. In this paper, we extend and evaluate two p-value combination methods, adaptive weighted Fisher's method (AFp) and adaptive Fisher's method (AFz), to tackle this problem, where AFp stands out as our final proposed method, based on extensive simulations and a real application. Our proposed AFp method has three advantages over traditional global testing methods. Firstly, it can consider the heterogeneity of phenotypes and determines which specific phenotypes a gene is associated with, using phenotype specific 0-1 weights. Secondly, AFp takes the p-values from the test of association of each phenotype as input, thus can accommodate different types of phenotypes (continuous, binary and count). Thirdly, we also apply bootstrapping to construct a variability index for the weight estimator of AFp and generate a co-membership matrix to categorize (cluster) genes based on their association-patterns for intuitive biological investigations. Through extensive simulations, AFp shows superior performance over global testing methods in terms of type I error control and statistical power, as well as higher accuracy of 0-1 weights estimation over AFz. A real omics application with transcriptomic and clinical data of complex lung diseases demonstrates insightful biological findings of AFp.

【29】 On the identification of the riskiest directional components from multivariate heavy-tailed data 标题:关于从多变量重尾数据中识别风险最大的方向分量 链接:https://arxiv.org/abs/2112.05759

作者:Miriam Hägele,Jaakko Lehtomaa 备注:22 pages, 3 figures 摘要:在单变量数据中,存在识别产生最大观测值的主要特征的标准程序。然而,在多变量环境中,情况却大不相同。本文旨在为检测多元数据中的主要方向分量提供工具和算法。我们研究了维度为$dgeq 2$的一般重尾多元随机向量,并给出了可用于评估数据为何为重尾的一致性估计。这是通过确定一组最危险的方向组件来实现的。在制定再保险政策时,其结果对保险特别重要,在对冲多种资产组合时,其结果对金融特别重要。 摘要:In univariate data, there exist standard procedures for identifying dominating features that produce the largest observations. However, in the multivariate setting, the situation is quite different. This paper aims to provide tools and algorithms for detecting dominating directional components in multivariate data. We study general heavy-tailed multivariate random vectors in dimension $dgeq 2$ and present consistent estimators which can be used to evaluate why the data is heavy-tailed. This is done by identifying the set of the riskiest directional components. The results are of particular interest in insurance when setting reinsurance policies and in finance when hedging a portfolio of multiple assets.

【30】 Variational autoencoders in the presence of low-dimensional data: landscape and implicit bias 标题:存在低维数据的变分自动编码器:横向和隐式偏差 链接:https://arxiv.org/abs/2112.06868

作者:Frederic Koehler,Viraj Mehta,Andrej Risteski,Chenghui Zhou 摘要:变分自动编码器(VAE)是最常用的生成模型之一,特别是对于图像数据。训练VAE的一个突出困难是低维流形上支持的数据。Dai和Wipf(2019)最近的工作表明,在低维数据上,生成器将收敛到方差为0的解,这在地面真值流形上得到了正确的支持。在本文中,通过理论和实证结果的结合,我们发现这个故事更加微妙。准确地说,我们表明,对于线性编码器/解码器,故事大部分是真实的,VAE训练确实恢复了一个生成器,其支持度等于地面真值流形,但这是由于梯度下降的隐含偏差,而不仅仅是VAE损失本身。在非线性情况下,我们证明了VAE训练经常学习一个高维流形,它是地面真值流形的超集。 摘要:Variational Autoencoders (VAEs) are one of the most commonly used generative models, particularly for image data. A prominent difficulty in training VAEs is data that is supported on a lower dimensional manifold. Recent work by Dai and Wipf (2019) suggests that on low-dimensional data, the generator will converge to a solution with 0 variance which is correctly supported on the ground truth manifold. In this paper, via a combination of theoretical and empirical results, we show that the story is more subtle. Precisely, we show that for linear encoders/decoders, the story is mostly true and VAE training does recover a generator with support equal to the ground truth manifold, but this is due to the implicit bias of gradient descent rather than merely the VAE loss itself. In the nonlinear case, we show that the VAE training frequently learns a higher-dimensional manifold which is a superset of the ground truth manifold.

【31】 The whole and the parts: the MDL principle and the a-contrario framework 标题:整体与部分:MDL原则与a-对立面框架 链接:https://arxiv.org/abs/2112.06853

作者:Rafael Grompone von Gioi,Ignacio Ramírez Paulino,Gregory Randall 机构:†Universit´e Paris-Saclay, Universidad de la Rep´ublica 备注:Submitted to SIAM Jourinal on Imaging Sciences (SIIMS) 摘要:这项工作探讨了Rissanen提出的最小描述长度(MDL)原则与Desolneux、Moisan和Morel提出的结构检测a-contrario框架之间的联系。MDL原则侧重于对整个数据进行最佳解释,而a-contrario方法则侧重于检测具有异常统计的部分数据。尽管在不同的理论形式下,我们表明,这两种方法在其机制中共享许多共同的概念和工具,并在许多有趣的场景中产生非常相似的公式,从简单的玩具示例到实际应用,如曲线的多边形近似和图像中的线段检测。我们还制定了两种方法在形式上等价的条件。 摘要:This work explores the connections between the Minimum Description Length (MDL) principle as developed by Rissanen, and the a-contrario framework for structure detection proposed by Desolneux, Moisan and Morel. The MDL principle focuses on the best interpretation for the whole data while the a-contrario approach concentrates on detecting parts of the data with anomalous statistics. Although framed in different theoretical formalisms, we show that both methodologies share many common concepts and tools in their machinery and yield very similar formulations in a number of interesting scenarios ranging from simple toy examples to practical applications such as polygonal approximation of curves and line segment detection in images. We also formulate the conditions under which both approaches are formally equivalent.

【32】 Incorporating Measurement Error in Astronomical Object Classification 标题:天文目标分类中计入测量误差的研究 链接:https://arxiv.org/abs/2112.06831

作者:Sarah Shy,Hyungsuk Tak,Eric D. Feigelson,John D. Timlin,G. Jogesh Babu 机构:Department of Statistics, Pennsylvania State University, University Park, PA , USA, Department of Astronomy and Astrophysics, Pennsylvania State University, University Park, PA , USA 摘要:大多数通用分类方法,如支持向量机(SVM)和随机森林(RF),无法解释天文数据的一个异常特征:已知的测量误差不确定性。在天文数据中,这些信息通常在数据中给出,但由于流行的机器学习分类器无法将其合并,因此被丢弃。我们提出了一种基于模拟的方法,将异方差测量误差纳入任何现有的分类方法中,以更好地量化分类中的不确定性。所提出的方法首先模拟来自高斯测量误差模型的贝叶斯后验预测分布的数据的扰动实现。然后,选择一个分类器适合每个模拟。模拟中的变化自然反映了标记和未标记数据集中测量误差传播的不确定性。我们通过两个数值研究证明了这种方法的使用。首先,将该方法应用于支持向量机(SVM)和软分类器(RF)进行了深入的仿真研究。第二项研究是从光度数据中识别高-$z$$(2.9leq zleq 5.1)$类星体候选的现实分类问题。这些数据来自斯隆数字天空测量、$Spitzer$IRAC赤道测量和$Spitzer$-HETDEX探索性大面积测量的合并目录。提出的方法表明,在随机森林识别的11847个高-$z$类星体候选体中,有3146个是潜在的误分类。此外,在${sim}185万个没有被确定为高-$z$类星体且没有测量误差的物体中,当考虑到测量误差时,936个可以被视为候选物体。 摘要:Most general-purpose classification methods, such as support-vector machine (SVM) and random forest (RF), fail to account for an unusual characteristic of astronomical data: known measurement error uncertainties. In astronomical data, this information is often given in the data but discarded because popular machine learning classifiers cannot incorporate it. We propose a simulation-based approach that incorporates heteroscedastic measurement error into any existing classification method to better quantify uncertainty in classification. The proposed method first simulates perturbed realizations of the data from a Bayesian posterior predictive distribution of a Gaussian measurement error model. Then, a chosen classifier is fit to each simulation. The variation across the simulations naturally reflects the uncertainty propagated from the measurement errors in both labeled and unlabeled data sets. We demonstrate the use of this approach via two numerical studies. The first is a thorough simulation study applying the proposed procedure to SVM and RF, which are well-known hard and soft classifiers, respectively. The second study is a realistic classification problem of identifying high-$z$ $(2.9 leq z leq 5.1)$ quasar candidates from photometric data. The data were obtained from merged catalogs of the Sloan Digital Sky Survey, the $Spitzer$ IRAC Equatorial Survey, and the $Spitzer$-HETDEX Exploratory Large-Area Survey. The proposed approach reveals that out of 11,847 high-$z$ quasar candidates identified by a random forest without incorporating measurement error, 3,146 are potential misclassifications. Additionally, out of ${sim}1.85$ million objects not identified as high-$z$ quasars without measurement error, 936 can be considered candidates when measurement error is taken into account.

【33】 BScNets: Block Simplicial Complex Neural Networks 标题:BSCNet:挡路单纯复神经网络 链接:https://arxiv.org/abs/2112.06826

作者:Yuzhou Chen,Yulia R. Gel,H. Vincent Poor 机构:Department of Electrical and Computer Engineering, Princeton University, Department of Mathematical Sciences, University of Texas at Dallas 摘要:单纯形神经网络(SNN)是最近出现的图形学习的最新方向,它将卷积结构的概念从节点空间扩展到了图形上的单纯形复形。在目前的实践中,不是预先主导地评估节点之间的成对关系,简单复合体允许我们描述高阶交互和多节点图结构。通过建立卷积运算和新的块Hodge-Laplacian之间的联系,我们提出了用于链路预测的第一个SNN。我们新的块单纯形复杂神经网络(BScNets)模型通过系统地整合多个不同维度的高阶图结构之间的显著交互作用,对现有的图卷积网络(GCN)框架进行了推广。我们讨论了BScNets背后的理论基础,并举例说明了它在八个真实世界和合成数据集的链路预测中的应用。我们的实验表明,BSCNET在保持较低计算成本的同时,显著优于最先进的模型。最后,我们展示了BSCNETs作为追踪传染病传播的新的有希望的替代物,如COVID-19和测量医疗风险缓解策略的有效性。 摘要:Simplicial neural networks (SNN) have recently emerged as the newest direction in graph learning which expands the idea of convolutional architectures from node space to simplicial complexes on graphs. Instead of pre-dominantly assessing pairwise relations among nodes as in the current practice, simplicial complexes allow us to describe higher-order interactions and multi-node graph structures. By building upon connection between the convolution operation and the new block Hodge-Laplacian, we propose the first SNN for link prediction. Our new Block Simplicial Complex Neural Networks (BScNets) model generalizes the existing graph convolutional network (GCN) frameworks by systematically incorporating salient interactions among multiple higher-order graph structures of different dimensions. We discuss theoretical foundations behind BScNets and illustrate its utility for link prediction on eight real-world and synthetic datasets. Our experiments indicate that BScNets outperforms the state-of-the-art models by a significant margin while maintaining low computation costs. Finally, we show utility of BScNets as the new promising alternative for tracking spread of infectious diseases such as COVID-19 and measuring the effectiveness of the healthcare risk mitigation strategies.

【34】 Multi-Asset Spot and Option Market Simulation 标题:多资产现货和期权市场仿真 链接:https://arxiv.org/abs/2112.06823

作者:Magnus Wiese,Ben Wood,Alexandre Pachoud,Ralf Korn,Hans Buehler,Phillip Murray,Lianjun Bai 机构:J.P. Morgan:, Bank Street, London, United Kingdom, University of Kaiserlautern, Gottlieb-Daimler Straße , Kaiserslautern, Germany 摘要:我们在正常化流动的基础上为单一标的构建了现实的现货和股票期权市场模拟器。我们通过一个无套利的自动编码器来处理市场观察到的看涨期权价格的高维性,该编码器近似于价格的有效低维表示,同时在重构曲面中保持无静态套利。给定一个多资产宇宙,我们利用规范化流的条件可逆性,并引入一种可扩展的方法来校准一组独立模拟器的联合分布,同时保持每个模拟器的动态。实证结果突出了校准模拟器的优点及其保真度。 摘要:We construct realistic spot and equity option market simulators for a single underlying on the basis of normalizing flows. We address the high-dimensionality of market observed call prices through an arbitrage-free autoencoder that approximates efficient low-dimensional representations of the prices while maintaining no static arbitrage in the reconstructed surface. Given a multi-asset universe, we leverage the conditional invertibility property of normalizing flows and introduce a scalable method to calibrate the joint distribution of a set of independent simulators while preserving the dynamics of each simulator. Empirical results highlight the goodness of the calibrated simulators and their fidelity.

【35】 Hedging Cryptocurrency Options 标题:对冲加密货币期权 链接:https://arxiv.org/abs/2112.06807

作者:Jovanka Lili Matic,Natalie Packham,Wolfgang Karl Härdle 摘要:加密货币(CC)市场具有波动性、非平稳性和非连续性。这对CC期权的定价和套期保值提出了独特的挑战。我们研究了各种模型的套期保值行为和有效性。首先,我们将市场数据校准为SVI隐含波动率面,再校准为价格期权。为了涵盖广泛的市场动态,我们使用两种类型的蒙特卡罗模拟生成价格路径。在第一种方法中,价格路径遵循SVCJ模型(具有相关跳跃的随机波动率)。第二种方法通过GARCH滤波核密度估计来模拟路径。在这两个市场中,期权通过仿射跳跃扩散类和无限活动Lāevy过程的模型进行套期保值。包括广泛的市场模型,可以理解完整但过于节俭的模型与更复杂但不完整的模型之间的套期保值绩效权衡。采用动态Delta、Delta-Gamma、Delta-Vega和最小方差对冲策略。校准结果显示了随机波动性、低跳跃强度和无限活动的强烈迹象。除了短期期权之外,在随机波动率模型中,Delta-Vega套期保值取得了持续良好的业绩。根据校准和套期保值结果判断,该研究提供了随机波动性是CC市场驱动力的证据。 摘要:The cryptocurrency (CC) market is volatile, non-stationary and non-continuous. This poses unique challenges for pricing and hedging CC options. We study the hedge behaviour and effectiveness for a wide range of models. First, we calibrate market data to SVI-implied volatility surfaces to price options. To cover a wide range of market dynamics, we generate price paths using two types of Monte Carlo simulations. In the first approach, price paths follow an SVCJ model (stochastic volatility with correlated jumps). The second approach simulates paths from a GARCH-filtered kernel density estimation. In these two markets, options are hedged with models from the class of affine jump diffusions and infinite activity L'evy processes. Including a wide range of market models allows to understand the trade-off in the hedge performance between complete, but overly parsimonious models, and more complex, but incomplete models. Dynamic Delta, Delta-Gamma, Delta-Vega and minimum variance hedge strategies are applied. The calibration results reveal a strong indication for stochastic volatility, low jump intensity and evidence of infinite activity. With the exception of short-dated options, a consistently good performance is achieved with Delta-Vega hedging in stochastic volatility models. Judging on the calibration and hedging results, the study provides evidence that stochastic volatility is the driving force in CC markets.

【36】 Depth Uncertainty Networks for Active Learning 标题:深度不确定网络在主动学习中的应用 链接:https://arxiv.org/abs/2112.06796

作者:Chelsea Murray,James U. Allingham,Javier Antorán,José Miguel Hernández-Lobato 机构:Department of Engineering, University of Cambridge 摘要:在主动学习中,训练数据集的大小和复杂性随时间而变化。在主动学习开始时,通过可用数据量很好地指定的简单模型可能会因更多点被主动采样而产生偏差。可能非常适合完整数据集的灵活模型可能会在主动学习开始时出现过度拟合。我们使用深度不确定性网络(DUNs)来解决这个问题。深度不确定性网络是BNN的一种变体,它可以推断网络的深度,从而推断网络的复杂性。我们发现,DUNs在几个主动学习任务上的表现优于其他BNN变体。重要的是,我们表明,在DUN表现最好的任务中,他们的过度拟合程度明显低于基线。 摘要:In active learning, the size and complexity of the training dataset changes over time. Simple models that are well specified by the amount of data available at the start of active learning might suffer from bias as more points are actively sampled. Flexible models that might be well suited to the full dataset can suffer from overfitting towards the start of active learning. We tackle this problem using Depth Uncertainty Networks (DUNs), a BNN variant in which the depth of the network, and thus its complexity, is inferred. We find that DUNs outperform other BNN variants on several active learning tasks. Importantly, we show that on the tasks in which DUNs perform best they present notably less overfitting than baselines.

【37】 Data-driven modelling of nonlinear dynamics by polytope projections and memory 标题:基于多面体投影和记忆的非线性动力学数据驱动建模 链接:https://arxiv.org/abs/2112.06742

作者:Niklas Wulkow,Péter Koltai,Vikram Sunkara,Christof Schütte 机构:Department of Mathematics and Computer Science, Freie Universit¨at Berlin, Zuse Institute Berlin, Germany 摘要:我们提出了一种从数据中建立动力系统模型的数值方法。我们使用最近引入的可伸缩概率近似(SPA)方法将点从欧几里德空间投影到凸多面体,并用新的低维坐标表示系统的投影状态,表示它们在多面体中的位置。然后,我们引入一种特殊的非线性变换来构造多面体中的动力学模型,并将其转换回原始状态空间。为了克服从投影到低维多面体的潜在信息损失,我们使用Takens延迟嵌入定理意义上的记忆。通过构造,我们的方法产生稳定的模型。我们在不同的例子中说明了该方法再现多个连通分量的混沌动力学和吸引子的能力。 摘要:We present a numerical method to model dynamical systems from data. We use the recently introduced method Scalable Probabilistic Approximation (SPA) to project points from a Euclidean space to convex polytopes and represent these projected states of a system in new, lower-dimensional coordinates denoting their position in the polytope. We then introduce a specific nonlinear transformation to construct a model of the dynamics in the polytope and to transform back into the original state space. To overcome the potential loss of information from the projection to a lower-dimensional polytope, we use memory in the sense of the delay-embedding theorem of Takens. By construction, our method produces stable models. We illustrate the capacity of the method to reproduce even chaotic dynamics and attractors with multiple connected components on various examples.

【38】 DriPP: Driven Point Processes to Model Stimuli Induced Patterns in M/EEG Signals 标题:DRIPP:模拟M/EEG信号中刺激诱导模式的驱动点过程 链接:https://arxiv.org/abs/2112.06652

作者:Cédric Allain,Alexandre Gramfort,Thomas Moreau,A Preprint 机构:Inria, Parietal team, Université Paris-Saclay, Saclay, France 摘要:对来自脑电图(EEG)和脑磁图(MEG)的非侵入性电生理信号的定量分析归结为识别时间模式,例如诱发反应、短暂的神经振荡爆发,以及用于数据清理的眨眼或心跳。一些工作表明,这些模式可以在无监督的情况下有效地提取,例如,使用卷积字典学习。这将导致对数据进行基于事件的描述。考虑到这些事件,一个自然的问题是估计它们的发生是如何被某些认知任务和实验操作调节的。为了解决这个问题,我们提出了一种点过程方法。虽然过去神经科学中已经使用了点过程,特别是单细胞记录(尖峰序列),但卷积字典学习等技术使其适用于基于EEG/MEG信号的人类研究。我们开发了一种新的统计点过程模型,称为驱动时间点过程(DriPP),其中点过程模型的强度函数与一组对应于刺激事件的点过程相关联。我们推导了一种快速且有原则的期望最大化(EM)算法来估计该模型的参数。仿真结果表明,模型参数可以从足够长的信号中识别。标准MEG数据集的结果表明,我们的方法揭示了诱发和诱发的事件相关神经反应,并分离出非任务特定的时间模式。 摘要:The quantitative analysis of non-invasive electrophysiology signals from electroencephalography (EEG) and magnetoencephalography (MEG) boils down to the identification of temporal patterns such as evoked responses, transient bursts of neural oscillations but also blinks or heartbeats for data cleaning. Several works have shown that these patterns can be extracted efficiently in an unsupervised way, e.g., using Convolutional Dictionary Learning. This leads to an event-based description of the data. Given these events, a natural question is to estimate how their occurrences are modulated by certain cognitive tasks and experimental manipulations. To address it, we propose a point process approach. While point processes have been used in neuroscience in the past, in particular for single cell recordings (spike trains), techniques such as Convolutional Dictionary Learning make them amenable to human studies based on EEG/MEG signals. We develop a novel statistical point process model-called driven temporal point processes (DriPP)-where the intensity function of the point process model is linked to a set of point processes corresponding to stimulation events. We derive a fast and principled expectation-maximization (EM) algorithm to estimate the parameters of this model. Simulations reveal that model parameters can be identified from long enough signals. Results on standard MEG datasets demonstrate that our methodology reveals event-related neural responses-both evoked and induced-and isolates non-task specific temporal patterns.

【39】 Bayesian Nonparametric View to Spawning 标题:产卵的贝叶斯非参数视图 链接:https://arxiv.org/abs/2112.06640

作者:Bahman Moraffah 机构:School of Electrical, Computer and Energy Engineering, Arizona State University, Tempe, AZ 摘要:在跟踪多个对象时,通常假设每个观测(测量)都来自一个且仅来自一个对象。然而,我们可能会遇到这样一种情况:在每个时间步,每个度量可能与多个对象关联,也可能不关联——繁殖。因此,为了跟踪多个对象的出生和死亡,每个测量与多个对象的关联是一项关键任务。在本文中,我们介绍了一种新的贝叶斯非参数方法,该方法模拟了一种场景,其中每个观测值可能来自未知数量的对象,它提供了一种可处理的马尔可夫链蒙特卡罗(MCMC)方法来从后验分布中采样。每个时间步的对象数本身也被假定为未知。然后,我们通过实验展示了非参数建模在产卵事件场景中的优势。我们的实验结果也证明了我们的框架相对于现有方法的优势。 摘要:In tracking multiple objects, it is often assumed that each observation (measurement) is originated from one and only one object. However, we may encounter a situation that each measurement may or may not be associated with multiple objects at each time step --spawning. Therefore, the association of each measurement to multiple objects is a crucial task to perform in order to track multiple objects with birth and death. In this paper, we introduce a novel Bayesian nonparametric approach that models a scenario where each observation may be drawn from an unknown number of objects for which it provides a tractable Markov chain Monte Carlo (MCMC) approach to sample from the posterior distribution. The number of objects at each time step, itself, is also assumed to be unknown. We, then, show through experiments the advantage of nonparametric modeling to scenarios with spawning events. Our experiment results also demonstrate the advantages of our framework over the existing methods.

【40】 Orthogonal Group Synchronization with Incomplete Measurements: Error Bounds and Linear Convergence of the Generalized Power Method 标题:不完全测量的正交群同步:广义幂法的误差界和线性收敛性 链接:https://arxiv.org/abs/2112.06556

作者:Linglingzhi Zhu,Jinxin Wang,Anthony Man-Cho So 摘要:组同步是指根据噪声成对测量值估计组元素的集合。这样一个非凸问题已经受到许多科学领域的关注,包括计算机视觉、机器人和低温电子显微镜。在本文中,我们主要研究不完全测量下一般加性噪声模型的正交群同步问题,这比通常考虑的完全测量设置更为普遍。从最优性条件和投影梯度上升法(也称为广义功率法(GPM))的不动点的角度给出了正交群同步问题的特征。值得注意的是,即使没有生成模型,这些结果仍然成立。同时,我们推导了正交群同步问题的局部误差界性质,这对于不同算法的收敛速度分析是有用的,并且是独立的。最后,基于所建立的局部误差界性质,证明了广义广义预测模型在一般加性噪声模型下的线性收敛结果。我们的理论收敛结果在几个确定性条件下成立,这些条件可以覆盖对抗性噪声的某些情况,作为一个例子,我们将其专门用于Erd“os-R”enyi测量图和高斯噪声的设置。 摘要:Group synchronization refers to estimating a collection of group elements from the noisy pairwise measurements. Such a nonconvex problem has received much attention from numerous scientific fields including computer vision, robotics, and cryo-electron microscopy. In this paper, we focus on the orthogonal group synchronization problem with general additive noise models under incomplete measurements, which is much more general than the commonly considered setting of complete measurements. Characterizations of the orthogonal group synchronization problem are given from perspectives of optimality conditions as well as fixed points of the projected gradient ascent method which is also known as the generalized power method (GPM). It is well worth noting that these results still hold even without generative models. In the meantime, we derive the local error bound property for the orthogonal group synchronization problem which is useful for the convergence rate analysis of different algorithms and can be of independent interest. Finally, we prove the linear convergence result of the GPM to a global maximizer under a general additive noise model based on the established local error bound property. Our theoretical convergence result holds under several deterministic conditions which can cover certain cases with adversarial noise, and as an example we specialize it to the setting of the Erd"os-R'enyi measurement graph and Gaussian noise.

【41】 Cryptocurrency Market Consolidation in 2020--2021 标题:2020--2021年加密货币市场整合 链接:https://arxiv.org/abs/2112.06552

作者:Jarosław Kwapień,Marcin Wątorek,Stanisław Drożdż 机构: Institute of Nuclear Physics, Cracow University of Technology 备注:None 摘要:研究了Binance上列出的80种最具流动性的加密货币的价格收益时间序列是否存在去趋势互相关。对于移动窗口的不同位置,应用去趋势相关矩阵的谱分析和基于该矩阵计算的最小生成树的拓扑分析。加密货币之间的相互关联性比以前更强。在特定的时间尺度上,平均互相关随着时间的推移而增加,其方式类似于从过去到现在的Epps效应放大。最小生成树也改变了它们的拓扑结构,对于短时间尺度,它们随着最大节点度的增加而变得更加集中,而对于长时间尺度,它们变得更加分散,但同时也更加相关。除了市场间的依赖性之外,还分析了加密货币市场与一些传统市场(如股票市场、商品市场和外汇市场)之间的去趋势交叉相关性。在同一动荡时期,加密货币市场与其他市场表现出更高水平的交叉相关性,而在这一时期,加密货币市场本身具有较强的交叉相关性。 摘要:Time series of price returns for 80 of the most liquid cryptocurrencies listed on Binance are investigated for the presence of detrended cross-correlations. A spectral analysis of the detrended correlation matrix and a topological analysis of the minimal spanning trees calculated based on this matrix are applied for different positions of a moving window. The cryptocurrencies become more strongly cross-correlated among themselves than they used to be before. The average cross-correlations increase with time on a specific time scale in a way that resembles the Epps effect amplification when going from past to present. The minimal spanning trees also change their topology and, for the short time scales, they become more centralized with increasing maximum node degrees, while for the long time scales they become more distributed, but also more correlated at the same time. Apart from the inter-market dependencies, the detrended cross-correlations between the cryptocurrency market and some traditional markets, like the stock markets, commodity markets, and Forex, are also analyzed. The cryptocurrency market shows higher levels of cross-correlations with the other markets during the same turbulent periods, in which it is strongly cross-correlated itself.

【42】 A Complete Characterisation of ReLU-Invariant Distributions 标题:重数不变分布的完全刻画 链接:https://arxiv.org/abs/2112.06532

作者:Jan Macdonald,Stephan Wäldchen 机构:Institut für Mathematik, Technische Universität Berlin 备注:39 pages, 9 Figures 摘要:我们给出了在ReLU神经网络层作用下不变的概率分布族的完整特征。在训练贝叶斯网络或分析训练过的神经网络期间,例如在不确定性量化(UQ)或可解释人工智能(XAI)的背景下,出现了对此类家族的需求。我们证明了不存在不变的参数化分布族,除非至少满足以下三个限制条件之一:第一,网络层的宽度为1,这对于实际的神经网络是不合理的。其次,族中的概率测度具有有限的支持度,基本上相当于抽样分布。第三,族的参数化不是局部Lipschitz连续的,这排除了所有计算上可行的族。最后,我们证明了这些限制是单独必要的。对于这三种情况中的每一种,我们都可以构造一个不变族,只利用其中一种限制,而不利用另外两种限制。 摘要:We give a complete characterisation of families of probability distributions that are invariant under the action of ReLU neural network layers. The need for such families arises during the training of Bayesian networks or the analysis of trained neural networks, e.g., in the context of uncertainty quantification (UQ) or explainable artificial intelligence (XAI). We prove that no invariant parametrised family of distributions can exist unless at least one of the following three restrictions holds: First, the network layers have a width of one, which is unreasonable for practical neural networks. Second, the probability measures in the family have finite support, which basically amounts to sampling distributions. Third, the parametrisation of the family is not locally Lipschitz continuous, which excludes all computationally feasible families. Finally, we show that these restrictions are individually necessary. For each of the three cases we can construct an invariant family exploiting exactly one of the restrictions but not the other two.

【43】 Top K Ranking for Multi-Armed Bandit with Noisy Evaluations标题:具有噪声评价的多臂强盗的TOPK排名链接:https://arxiv.org/abs/2112.06517

作者:Evrard Garcelon,Vashist Avadhanula,Alessandro Lazaric,and Matteo Pirotta 机构:Meta AI, CREST, ENSAE, Core Data Science, Meta 摘要:我们考虑多臂强盗设置,在每个回合开始时,学习者接收每个臂的真实奖励的嘈杂独立的,并且可能有偏见的,EMPH{{评价},它选择$k $臂,以尽可能累积超过$$$回合的尽可能多的奖励。假设在每一轮中,每只手臂的真实奖励来自固定分布,我们根据评估的生成方式得出不同的算法方法和理论保证。首先,当观测函数是真报酬的遗传线性函数时,我们在一般情况下给出了$widetilde{O}(T^{2/3})$遗憾。另一方面,我们证明了当观测函数是真实报酬的噪声线性函数时,可以导出改进的$widetilde{O}(sqrt{T})$遗憾。最后,我们报告了一项实证验证,该验证证实了我们的理论发现,提供了与其他方法的彻底比较,并进一步支持了这种设置在实践中的利益。 摘要:We consider a multi-armed bandit setting where, at the beginning of each round, the learner receives noisy independent, and possibly biased, emph{evaluations} of the true reward of each arm and it selects $K$ arms with the objective of accumulating as much reward as possible over $T$ rounds. Under the assumption that at each round the true reward of each arm is drawn from a fixed distribution, we derive different algorithmic approaches and theoretical guarantees depending on how the evaluations are generated. First, we show a $widetilde{O}(T^{2/3})$ regret in the general case when the observation functions are a genearalized linear function of the true rewards. On the other hand, we show that an improved $widetilde{O}(sqrt{T})$ regret can be derived when the observation functions are noisy linear functions of the true rewards. Finally, we report an empirical validation that confirms our theoretical findings, provides a thorough comparison to alternative approaches, and further supports the interest of this setting in practice.

【44】 GM Score: Incorporating inter-class and intra-class generator diversity, discriminability of disentangled representation, and sample fidelity for evaluating GANs 标题:GM评分:结合类间和类内生成器多样性、解缠表示的可区分性和样本保真度来评估GANS 链接:https://arxiv.org/abs/2112.06431

作者:Harshvardhan GM,Aanchal Sahu,Mahendra Kumar Gourisaria 备注:21 pages, 9 figures 摘要:与其他生成模型(如变分自动编码器(VAE)和玻尔兹曼机器)相比,生成对抗网络(GAN)因其较高的样本质量而广受欢迎,但它们在评估生成样本时也面临同样的困难。必须牢记各个方面,如生成样本的质量、类的多样性(类内和类间)、分离潜在空间的使用、所述评估指标与人类感知的一致性等。在本文中,我们提出了一个新分数,即GM分数,该模型考虑了样本质量、非纠缠表示、类内和类间多样性等多种因素,并采用了精度、召回率和F1分数等其他指标,对深层信念网络(DBN)和受限玻尔兹曼机(RBM)的潜在空间进行了判别。对在基准MNIST数据集上训练的不同GAN(GAN、DCGAN、BiGAN、CGAN、CoupledGAN、LSGAN、SGAN、WGAN和WGAN改进型)进行评估。 摘要:While generative adversarial networks (GAN) are popular for their higher sample quality as opposed to other generative models like the variational autoencoders (VAE) and Boltzmann machines, they suffer from the same difficulty of the evaluation of generated samples. Various aspects must be kept in mind, such as the quality of generated samples, the diversity of classes (within a class and among classes), the use of disentangled latent spaces, agreement of said evaluation metric with human perception, etc. In this paper, we propose a new score, namely, GM Score, which takes into various factors such as sample quality, disentangled representation, intra-class and inter-class diversity, and other metrics such as precision, recall, and F1 score are employed for discriminability of latent space of deep belief network (DBN) and restricted Boltzmann machine (RBM). The evaluation is done for different GANs (GAN, DCGAN, BiGAN, CGAN, CoupledGAN, LSGAN, SGAN, WGAN, and WGAN Improved) trained on the benchmark MNIST dataset.

【45】 Stochastic differential equations for limiting description of UCB rule for Gaussian multi-armed bandits 标题:高斯多臂土匪UCB规则极限描述的随机微分方程 链接:https://arxiv.org/abs/2112.06423

作者:Sergey Garbar 机构:Yaroslav-the-Wise Novgorod State University, Velikiy Novgorod, Russia 备注:9 pages, 2 figures 摘要:我们考虑具有已知控制视界大小$N$的高斯多臂土匪的上置信界策略,并利用随机微分方程和常微分方程系统建立其极限描述。假设武器的奖励具有未知的预期值和已知的方差。当平均报酬相差$N^{-1/2}$量级时,对报酬的紧密分布进行了一组蒙特卡罗模拟,因为它产生了最高的归一化遗憾,以验证所获得描述的有效性。当标准化遗憾不明显大于最大可能值时,估计控制范围的最小大小。 摘要:We consider the upper confidence bound strategy for Gaussian multi-armed bandits with known control horizon sizes $N$ and build its limiting description with a system of stochastic differential equations and ordinary differential equations. Rewards for the arms are assumed to have unknown expected values and known variances. A set of Monte-Carlo simulations was performed for the case of close distributions of rewards, when mean rewards differ by the magnitude of order $N^{-1/2}$, as it yields the highest normalized regret, to verify the validity of the obtained description. The minimal size of the control horizon when the normalized regret is not noticeably larger than maximum possible was estimated.

【46】 COVID-19 Forecasts via Stock Market Indicators 标题:利用股市指标进行冠状病毒预测 链接:https://arxiv.org/abs/2112.06393

作者:Yi Liang,James Unwin 机构:⋆ PRIMES, Massachusetts Institute of Technology, Cambridge, MA , USA, † Department of Physics, University of Illinois at Chicago, Chicago, IL , USA 备注:11 pages, 9 figures 摘要:可靠的短期预测可以为后勤规划,特别是医院员工和设备等资源的优化配置提供潜在的救生见解。通过重新解释2019冠状病毒疾病的烛台,我们能够应用一些最流行的股票市场技术指标,以获得预测力的过程中的流行病。通过对MACD 2019冠状病毒疾病的分析,我们对它们进行了预测,并对股票市场数据和WHO COVID-19数据进行了预测。特别是,我们通过考虑大流行后续波的起始点的识别,展示了这种新方法的实用性。最后,我们的2019冠状病毒疾病的新的方法是用来评估当前的卫生政策是否正在影响生长。 摘要:Reliable short term forecasting can provide potentially lifesaving insights into logistical planning, and in particular, into the optimal allocation of resources such as hospital staff and equipment. By reinterpreting COVID-19 daily cases in terms of candlesticks, we are able to apply some of the most popular stock market technical indicators to obtain predictive power over the course of the pandemics. By providing a quantitative assessment of MACD, RSI, and candlestick analyses, we show their statistical significance in making predictions for both stock market data and WHO COVID-19 data. In particular, we show the utility of this novel approach by considering the identification of the beginnings of subsequent waves of the pandemic. Finally, our new methods are used to assess whether current health policies are impacting the growth in new COVID-19 cases.

【47】 WOOD: Wasserstein-based Out-of-Distribution Detection 标题:Wood:基于Wasserstein的分布外检测 链接:https://arxiv.org/abs/2112.06384

作者:Yinan Wang,Wenbo Sun,Jionghua "Judy" Jin,Zhenyu "James" Kong,Xiaowei Yue 机构: and Jionghua (Judy) Jin are with the Department of Industrialand Operations Engineering, University of Michigan 摘要:基于深度神经网络的分类器的训练和测试数据通常假设来自同一分布。当部分测试样本来自与训练样本(也称为分布外(OOD)样本)距离足够远的分布时,经过训练的神经网络倾向于对这些OOD样本进行高置信度预测。在训练用于图像分类、目标检测等的神经网络时,面向对象样本的检测至关重要。它可以增强分类器对无关输入的鲁棒性,并提高系统在不同形式攻击下的恢复能力和安全性。面向对象样本的检测有三个主要挑战:(i)提出的面向对象检测方法应兼容各种分类器体系结构(如DenseNet、ResNet),而不会显著增加模型复杂性和对计算资源的要求;(ii)OOD样本可能来自多个分布,其类别标签通常不可用;(iii)需要定义评分函数,以便有效地将OOD样本与分布内(InD)样本分开。为了克服这些挑战,我们提出了一种基于Wasserstein的分布外检测(WOOD)方法。其基本思想是定义一个基于Wasserstein距离的分数,用于评估测试样本和InD样本分布之间的差异性。然后,根据所提出的得分函数,提出并求解一个优化问题。研究了该方法的统计学习界,以保证经验优化器获得的损失值接近全局最优值。对比研究结果表明,提出的木材一致优于其他现有的木材检测方法。 摘要:The training and test data for deep-neural-network-based classifiers are usually assumed to be sampled from the same distribution. When part of the test samples are drawn from a distribution that is sufficiently far away from that of the training samples (a.k.a. out-of-distribution (OOD) samples), the trained neural network has a tendency to make high confidence predictions for these OOD samples. Detection of the OOD samples is critical when training a neural network used for image classification, object detection, etc. It can enhance the classifier's robustness to irrelevant inputs, and improve the system resilience and security under different forms of attacks. Detection of OOD samples has three main challenges: (i) the proposed OOD detection method should be compatible with various architectures of classifiers (e.g., DenseNet, ResNet), without significantly increasing the model complexity and requirements on computational resources; (ii) the OOD samples may come from multiple distributions, whose class labels are commonly unavailable; (iii) a score function needs to be defined to effectively separate OOD samples from in-distribution (InD) samples. To overcome these challenges, we propose a Wasserstein-based out-of-distribution detection (WOOD) method. The basic idea is to define a Wasserstein-distance-based score that evaluates the dissimilarity between a test sample and the distribution of InD samples. An optimization problem is then formulated and solved based on the proposed score function. The statistical learning bound of the proposed method is investigated to guarantee that the loss value achieved by the empirical optimizer approximates the global optimum. The comparison study results demonstrate that the proposed WOOD consistently outperforms other existing OOD detection methods.

【48】 Robust Voting Rules from Algorithmic Robust Statistics 标题:算法稳健统计中的稳健投票规则 链接:https://arxiv.org/abs/2112.06380

作者:Allen Liu,Ankur Moitra 摘要:本文研究了Mallows模型的鲁棒学习问题。我们给出了一个算法,即使当其样本的常数部分被任意破坏时,也能准确地估计中心排名。此外,我们的稳健性保证与维度无关,因为我们的总体准确性不取决于被排序的备选方案的数量。我们的工作可以被认为是将算法稳健统计的观点自然地融入到投票和信息聚合中的一个核心推理问题中。具体而言,我们的投票规则是可有效计算的,其结果不能被一大群相互勾结的选民所改变。 摘要:In this work we study the problem of robustly learning a Mallows model. We give an algorithm that can accurately estimate the central ranking even when a constant fraction of its samples are arbitrarily corrupted. Moreover our robustness guarantees are dimension-independent in the sense that our overall accuracy does not depend on the number of alternatives being ranked. Our work can be thought of as a natural infusion of perspectives from algorithmic robust statistics into one of the central inference problems in voting and information-aggregation. Specifically, our voting rule is efficiently computable and its outcome cannot be changed by much by a large group of colluding voters.

【49】 Machine Learning Calabi-Yau Hypersurfaces 标题:机器学习Calabi-Yau超曲面 链接:https://arxiv.org/abs/2112.06350

作者:David S. Berman,Yang-Hui He,Edward Hirst 机构:Centre for Theoretical Physics, School of Physics and Astronomy, Queen Mary University, of London, Mile End Road, London E,NS, UK, London Institute for Mathematical Sciences, Royal Institution, London W,S ,BS, UK 备注:32 pages, 45 figures 摘要:我们重新访问了经典的加权P4s数据库,该数据库承认Calabi-Yau 3次超曲面配备了机器学习工具箱中的各种工具。无监督技术可识别拓扑数据对权重的意外几乎线性依赖。这样,我们就可以在Calabi Yau数据中识别以前未被注意到的集群。监督技术成功地从超曲面的权值中预测其拓扑参数,准确率为R^2>95%。监督学习还允许我们通过使用聚类行为支持的分区来识别允许Calabi-Yau超曲面达到100%精度的加权P4。 摘要:We revisit the classic database of weighted-P4s which admit Calabi-Yau 3-fold hypersurfaces equipped with a diverse set of tools from the machine-learning toolbox. Unsupervised techniques identify an unanticipated almost linear dependence of the topological data on the weights. This then allows us to identify a previously unnoticed clustering in the Calabi-Yau data. Supervised techniques are successful in predicting the topological parameters of the hypersurface from its weights with an accuracy of R^2 > 95%. Supervised learning also allows us to identify weighted-P4s which admit Calabi-Yau hypersurfaces to 100% accuracy by making use of partitioning supported by the clustering behaviour.

【50】 Fairness for Robust Learning to Rank 标题:稳健学习对排名的公平性 链接:https://arxiv.org/abs/2112.06288

作者:Omid Memarrast,Ashkan Rezaei,Rizal Fathony,Brian Ziebart 机构:Department of Computer Science, University of Illinois at Chicago, Chicago, IL , Bosch Center for Artificial Intelligence, Carnegie Mellon University, Pittsburgh, PA 摘要:传统的排名系统只关注最大化排名项目对用户的效用,而公平性感知排名系统还试图平衡不同受保护属性(如性别或种族)的暴露。为了实现这类群体排名的公平性,我们基于分布稳健性的第一性原理推导了一个新的排名系统。我们建立了一个极小极大博弈,在满足公平约束的同时,玩家选择分布而不是排名来最大化效用,而对手在匹配训练数据的统计信息时寻求最小化效用。我们表明,与现有的基线方法相比,我们的方法为高度公平的排名提供了更好的实用性。 摘要:While conventional ranking systems focus solely on maximizing the utility of the ranked items to users, fairness-aware ranking systems additionally try to balance the exposure for different protected attributes such as gender or race. To achieve this type of group fairness for ranking, we derive a new ranking system based on the first principles of distributional robustness. We formulate a minimax game between a player choosing a distribution over rankings to maximize utility while satisfying fairness constraints against an adversary seeking to minimize utility while matching statistics of the training data. We show that our approach provides better utility for highly fair rankings than existing baseline methods.

【51】 Spatial-Temporal-Fusion BNN: Variational Bayesian Feature Layer 标题:时空融合BNN:变分贝叶斯特征层 链接:https://arxiv.org/abs/2112.06281

作者:Shiye Lei,Zhuozhuo Tu,Leszek Rutkowski,Feng Zhou,Li Shen,Fengxiang He,Dacheng Tao 机构: Rutkowski is with the Department of Artificial Intelligence, University of Social Sciences 摘要:贝叶斯神经网络(BNN)已成为缓解深度学习中过度自信预测的主要方法,但由于分布参数的大量存在,贝叶斯神经网络往往存在尺度问题。在本文中,我们发现,当单独重新训练时,深度网络的第一层具有多个不同的最优解。这表明,当第一层被贝叶斯层改变时,后验方差较大,这促使我们设计时空融合BNN(STF-BNN),以便有效地将BNN扩展到大型模型:(1)首先正常地从头开始训练神经网络,以实现快速训练;(2)将第一层转换为贝叶斯模型,并采用随机变分推理进行推理,而其他层是固定的。与普通的贝叶斯网络相比,该方法可以大大减少训练时间和参数数量,从而有效地扩展贝叶斯网络。我们进一步为STF-BNN的普遍性和缓解过度自信的能力提供了理论保证。综合实验表明,STF-BNN(1)在预测和不确定度量化方面达到了最新水平;(2) 显著提高对抗鲁棒性和隐私保护;(3)大大减少了训练时间和内存成本。 摘要:Bayesian neural networks (BNNs) have become a principal approach to alleviate overconfident predictions in deep learning, but they often suffer from scaling issues due to a large number of distribution parameters. In this paper, we discover that the first layer of a deep network possesses multiple disparate optima when solely retrained. This indicates a large posterior variance when the first layer is altered by a Bayesian layer, which motivates us to design a spatial-temporal-fusion BNN (STF-BNN) for efficiently scaling BNNs to large models: (1) first normally train a neural network from scratch to realize fast training; and (2) the first layer is converted to Bayesian and inferred by employing stochastic variational inference, while other layers are fixed. Compared to vanilla BNNs, our approach can greatly reduce the training time and the number of parameters, which contributes to scale BNNs efficiently. We further provide theoretical guarantees on the generalizability and the capability of mitigating overconfidence of STF-BNN. Comprehensive experiments demonstrate that STF-BNN (1) achieves the state-of-the-art performance on prediction and uncertainty quantification; (2) significantly improves adversarial robustness and privacy preservation; and (3) considerably reduces training time and memory costs.

【52】 Learning with Subset Stacking 标题:子集堆叠学习 链接:https://arxiv.org/abs/2112.06251

作者:S. Ilker Birbil,Sinan Yildirim,Kaya Gokalp,Hakan Akyuz 机构: İlker BirbilAmsterdam Business School, University of Amsterdam, Sabancı University, TurkeyHakan AkyüzEconometric Institute, Erasmus University Rotterdam 摘要:我们提出了一种新的算法,从一组输入输出对中学习。我们的算法设计用于输入变量和输出变量之间的关系在预测空间中表现出异质行为的群体。该算法首先生成集中在输入空间中随机点周围的子集。然后为每个子集训练一个局部预测器。然后以一种新的方式组合这些预测值,以产生一个整体预测值。我们称这种算法为“子集叠加学习”或更少,因为它类似于叠加回归方法。我们在几个数据集上比较了LESS的测试性能和最先进的方法。我们的比较表明,LESS是一种竞争性的监督学习方法。此外,我们观察到LESS在计算时间方面也是有效的,它允许直接的并行实现。 摘要:We propose a new algorithm that learns from a set of input-output pairs. Our algorithm is designed for populations where the relation between the input variables and the output variable exhibits a heterogeneous behavior across the predictor space. The algorithm starts with generating subsets that are concentrated around random points in the input space. This is followed by training a local predictor for each subset. Those predictors are then combined in a novel way to yield an overall predictor. We call this algorithm "LEarning with Subset Stacking" or LESS, due to its resemblance to method of stacking regressors. We compare the testing performance of LESS with the state-of-the-art methods on several datasets. Our comparison shows that LESS is a competitive supervised learning method. Moreover, we observe that LESS is also efficient in terms of computation time and it allows a straightforward parallel implementation.

【53】 Measuring Complexity of Learning Schemes Using Hessian-Schatten Total-Variation 标题:用Hessian-Schatten全变差度量学习方案的复杂性 链接:https://arxiv.org/abs/2112.06209

作者:Shayan Aziznejad,Joaquim Campos,Michael Unser 摘要:在本文中,我们介绍了Hessian-Schatten全变差(HTV)——一种新的半形式,用于量化多元函数的总“粗糙度”。我们定义HTV的动机是评估监督学习方案的复杂性。我们首先指定适当的矩阵值Banach空间,这些空间配备了适当的混合范数类。然后我们证明HTV对旋转、缩放和平移是不变的。此外,它的最小值是针对线性映射实现的,这支持了一种常见的直觉,即线性回归是最不复杂的学习模型。接下来,我们给出了计算两类一般函数HTV的闭式表达式。第一类是具有一定正则性的Sobolev函数,我们证明了HTV与Hessian-Schatten半形式一致,Hessian-Schatten半形式有时用作图像重建的正则化器。第二类是连续和分段线性(CPWL)函数。在这种情况下,我们表明HTV反映了具有公共面的线性区域之间斜率的总变化。因此,可以将其视为CPWL映射的线性区域数(l0型)的凸松弛(l1型)。最后,我们用一些具体的例子来说明我们提出的半形式的使用。 摘要:In this paper, we introduce the Hessian-Schatten total-variation (HTV) -- a novel seminorm that quantifies the total "rugosity" of multivariate functions. Our motivation for defining HTV is to assess the complexity of supervised learning schemes. We start by specifying the adequate matrix-valued Banach spaces that are equipped with suitable classes of mixed-norms. We then show that HTV is invariant to rotations, scalings, and translations. Additionally, its minimum value is achieved for linear mappings, supporting the common intuition that linear regression is the least complex learning model. Next, we present closed-form expressions for computing the HTV of two general classes of functions. The first one is the class of Sobolev functions with a certain degree of regularity, for which we show that HTV coincides with the Hessian-Schatten seminorm that is sometimes used as a regularizer for image reconstruction. The second one is the class of continuous and piecewise linear (CPWL) functions. In this case, we show that the HTV reflects the total change in slopes between linear regions that have a common facet. Hence, it can be viewed as a convex relaxation (l1-type) of the number of linear regions (l0-type) of CPWL mappings. Finally, we illustrate the use of our proposed seminorm with some concrete examples.

【54】 Convergence of Generalized Belief Propagation Algorithm on Graphs with Motifs 标题:有模图上广义置信传播算法的收敛性 链接:https://arxiv.org/abs/2112.06087

作者:Yitao Chen,Deepanshu Vasal 机构:Qualcomm, USA, Northwestern University 备注:10 pages 2 figures 摘要:信念传播是机器学习中许多应用的基本消息传递算法。众所周知,信任传播算法在树图上是精确的。然而,在大多数应用中,信念传播是在循环图上运行的。因此,理解循环图上信念传播的行为已经成为不同领域研究人员的主要课题。本文研究了模体图(三角形、环等)上广义信念传播算法的收敛性,证明了在一定的初始化条件下,模体图上铁磁Ising模型的广义信念传播收敛到Bethe自由能的全局最优解。 摘要:Belief propagation is a fundamental message-passing algorithm for numerous applications in machine learning. It is known that belief propagation algorithm is exact on tree graphs. However, belief propagation is run on loopy graphs in most applications. So, understanding the behavior of belief propagation on loopy graphs has been a major topic for researchers in different areas. In this paper, we study the convergence behavior of generalized belief propagation algorithm on graphs with motifs (triangles, loops, etc.) We show under a certain initialization, generalized belief propagation converges to the global optimum of the Bethe free energy for ferromagnetic Ising models on graphs with motifs.

【55】 On spectral distribution of sample covariance matrices from large dimensional and large k-fold tensor products标题:关于高维大

作者:Benoît Collins,Jianfeng Yao,Wangjun Yuan 备注:21 pages, 6 figures 摘要:我们研究了大n$维向量的独立秩1$k$倍张量积和的特征值分布。文献中以前的结果假设$k=o(n)$,并表明在基向量上适当的矩条件下,特征值分布收敛于著名的Marv{c}enko Pastur定律。本文在量子信息理论的推动下,研究了$k$增长较快的区域,即$k=O(n)$。我们证明了特征值分布的矩序列有一个极限,这与Marv{c}enko-Pastur定律不同。作为副产品,我们证明了对于这个张量模型,Marv{c}enko Pastur定律极限成立的充要条件是$k=o(n)$。该方法基于矩量法。 摘要:We study the eigenvalue distributions for sums of independent rank-one $k$-fold tensor products of large $n$-dimensional vectors. Previous results in the literature assume that $k=o(n)$ and show that the eigenvalue distributions converge to the celebrated Marv{c}enko-Pastur law under appropriate moment conditions on the base vectors. In this paper, motivated by quantum information theory, we study the regime where $k$ grows faster, namely $k=O(n)$. We show that the moment sequences of the eigenvalue distributions have a limit, which is different from the Marv{c}enko-Pastur law. As a byproduct, we show that the Marv{c}enko-Pastur law limit holds if and only if $k=o(n)$ for this tensor model. The approach is based on the method of moments.

【56】 U.S. Long-Term Earnings Outcomes by Sex, Race, Ethnicity, and Place of Birth 标题:按性别、种族、民族和出生地划分的美国长期收入结果 链接:https://arxiv.org/abs/2112.05822

作者:Kevin L. McKinney,John M. Abowd,Hubert P. Janicki 备注:77 pages, 42 figures 摘要:本文是全球收入动态项目收入不平等、波动性和流动性跨国比较的一部分。利用美国人口普查局的纵向雇主家庭动态(LEHD)基础设施文件中的数据,我们为1998年至2019年的美国提供了一套统一的收入统计数据,我们发现美国收入不平等性有所增加,波动性有所降低。不平等性加剧和波动性降低的组合表明,不同人口群体的收入增长存在显著差异。我们通过估算25-54岁合格工人的单一队列12年平均收入来进一步探讨这一点。研究发现,劳动力供应(带薪小时数和工作季度数)的差异可以解释近90%的工人收入变化,尽管即使在控制劳动力供应后,人口群体之间的收入差异仍然无法解释。使用分位数回归方法,我们估计每个人口组的反事实收入分布。我们发现,在收入分布的底部,工资时数、地域划分、行业和教育等特征的差异解释了几乎所有的收入差距,但在中位数以上,特征回报差异的贡献成为主要组成部分。 摘要:This paper is part of the Global Income Dynamics Project cross-country comparison of earnings inequality, volatility, and mobility. Using data from the U.S. Census Bureau's Longitudinal Employer-Household Dynamics (LEHD) infrastructure files we produce a uniform set of earnings statistics for the U.S. From 1998 to 2019, we find U.S. earnings inequality has increased and volatility has decreased. The combination of increased inequality and reduced volatility suggest earnings growth differs substantially across different demographic groups. We explore this further by estimating 12-year average earnings for a single cohort of age 25-54 eligible workers. Differences in labor supply (hours paid and quarters worked) are found to explain almost 90% of the variation in worker earnings, although even after controlling for labor supply substantial earnings differences across demographic groups remain unexplained. Using a quantile regression approach, we estimate counterfactual earnings distributions for each demographic group. We find that at the bottom of the earnings distribution differences in characteristics such as hours paid, geographic division, industry, and education explain almost all the earnings gap, however above the median the contribution of the differences in the returns to characteristics becomes the dominant component.

0 人点赞