stat统计学,共计30篇
【1】 Kalman Filtering with Adversarial Corruptions 标题:对抗性腐蚀下的卡尔曼滤波 链接:https://arxiv.org/abs/2111.06395
作者:Sitan Chen,Frederic Koehler,Ankur Moitra,Morris Yau 机构:UC Berkeley, MIT 备注:57 pages, comments welcome 摘要:在这里,我们回顾了线性二次估计的经典问题,即从噪声测量估计线性动力系统的轨迹。当测量噪声为高斯噪声时,著名的卡尔曼滤波器给出了最佳估计量,但众所周知,当偏离此假设时(例如,当噪声为重尾噪声时),卡尔曼滤波器会出现故障。许多特别的启发式方法在实践中被用于处理异常值。在一项开创性的工作中,Schick和Mitter给出了当测量噪声是高斯函数的已知无穷小扰动时的可证明保证,并提出了一个重要问题,即对于大扰动和未知扰动,是否可以获得类似的保证。在这项工作中,我们给出了一个真正鲁棒的滤波器:我们给出了线性二次估计的第一个强可证明的保证,即使是一个常数部分的测量已经被不利地破坏。该框架可以模拟重尾甚至非平稳噪声过程。我们的算法证明了卡尔曼滤波器的鲁棒性,因为它与知道腐蚀位置的最优算法相竞争。我们的工作是在一个具有挑战性的贝叶斯环境中进行的,在这个环境中,测量的数量随着我们需要估计的复杂性而变化。此外,在线性动力系统中,过去的信息会随着时间而衰减。我们开发了一套新技术,能够在不同的时间步长和不同的时间尺度上稳健地提取信息。 摘要:Here we revisit the classic problem of linear quadratic estimation, i.e. estimating the trajectory of a linear dynamical system from noisy measurements. The celebrated Kalman filter gives an optimal estimator when the measurement noise is Gaussian, but is widely known to break down when one deviates from this assumption, e.g. when the noise is heavy-tailed. Many ad hoc heuristics have been employed in practice for dealing with outliers. In a pioneering work, Schick and Mitter gave provable guarantees when the measurement noise is a known infinitesimal perturbation of a Gaussian and raised the important question of whether one can get similar guarantees for large and unknown perturbations. In this work we give a truly robust filter: we give the first strong provable guarantees for linear quadratic estimation when even a constant fraction of measurements have been adversarially corrupted. This framework can model heavy-tailed and even non-stationary noise processes. Our algorithm robustifies the Kalman filter in the sense that it competes with the optimal algorithm that knows the locations of the corruptions. Our work is in a challenging Bayesian setting where the number of measurements scales with the complexity of what we need to estimate. Moreover, in linear dynamical systems past information decays over time. We develop a suite of new techniques to robustly extract information across different time steps and over varying time scales.
【2】 Full Characterization of Adaptively Strong Majority Voting in Crowdsourcing 标题:众包中适应性强多数票的充分刻画 链接:https://arxiv.org/abs/2111.06390
作者:Margarita Boyarskaya,Panos Ipeirotis 机构:New York University 摘要:众包中一种常用的质量控制技术是让员工检查一个项目,并投票决定该项目是否贴上了正确的标签。为了消除工作人员响应中可能存在的噪音,一种解决方案是不断向更多工作人员征求投票,直到两种可能结果的投票数之差超过预先指定的阈值{delta}。我们展示了一种利用吸收马尔可夫链对这种{delta}边际投票共识聚合过程建模的方法。我们为投票过程的关键属性提供了封闭形式的方程,即结果的质量、完成投票的预期票数、所需票数的方差以及分布的其他时刻。利用这些结果,我们进一步表明,我们可以调整阈值{delta}的值,以在使用不同准确度工人的投票过程中实现质量等效。然后,我们利用这一结果为具有不同响应准确度的工人群体提供效率均衡的支付率。最后,我们使用完全合成的数据和现实生活中的众包投票进行了一组模拟实验。我们表明,我们的理论模型很好地描述了共识聚合过程的结果。 摘要:A commonly used technique for quality control in crowdsourcing is to task the workers with examining an item and voting on whether the item is labeled correctly. To counteract possible noise in worker responses, one solution is to keep soliciting votes from more workers until the difference between the numbers of votes for the two possible outcomes exceeds a pre-specified threshold {delta}. We show a way to model such {delta}-margin voting consensus aggregation process using absorbing Markov chains. We provide closed-form equations for the key properties of this voting process -- namely, for the quality of the results, the expected number of votes to completion, the variance of the required number of votes, and other moments of the distribution. Using these results, we show further that one can adapt the value of the threshold {delta} to achieve quality-equivalence across voting processes that employ workers of different accuracy levels. We then use this result to provide efficiency-equalizing payment rates for groups of workers characterized by different levels of response accuracy. Finally, we perform a set of simulated experiments using both fully synthetic data as well as real-life crowdsourced votes. We show that our theoretical model characterizes the outcomes of the consensus aggregation process well.
【3】 Design of porthole aluminium extrusion dies through mathematical formulation 标题:用数学公式设计分流铝挤压模 链接:https://arxiv.org/abs/2111.06370
作者:Juan Llorca-Schenk,Irene Sentana-Gadea,Miguel Sanchez-Lozano 机构:a Department of Graphic Expression, Composition and Projects, University of Alicante, Alicante, Spain, b Department of Mechanical Engineering and Energy, Miguel Hernandez University, Elche, Spain 摘要:通过对大量成功的分流模设计几何数据的统计分析,提出了一种解决分流模设计问题的数学方法。采用线性回归和对数回归对88个一次试模的596个不同孔口的几何数据进行了分析。根据挤压工艺知识和统计标准,非显著变量或高度相关变量被丢弃。本文着重于一个典型的分流孔模具案例的验证模型:四个腔和每个腔四个孔模具。这个数学公式是用一个表达式总结大量设计经验的一种方法。一个广泛的研究途径是推广该模型或将其扩展到其他类型的舷窗模具。 摘要:A mathematical approach to solve the porthole die design problem is achieved by statistical analysis of a large amount of geometric data of successful porthole die designs. Linear and logarithmic regression are used to analyse geometrical data of 596 different ports from 88 first trial dies. Non-significant variables or high correlated variables are discarded according to knowledge of the extrusion process and statistical criteria. This paper focuses on a validation model for a typical case of porthole dies: four cavities and four ports per cavity dies. This mathematical formulation is a way of summarizing in a single expression the experience accumulated in a large number of designs over time. A broad way of research is open to generalise this model or extend it to other types of porthole dies.
【4】 Simulating High-Dimensional Multivariate Data using the bigsimr R Package 标题:用Bigsimr R软件包模拟高维多变量数据 链接:https://arxiv.org/abs/2111.06327
作者:A. Grant Schissler,Edward J. Bedrick,Alexander D. Knudson,Tomasz J. Kozubowski,Tin Nguyen,Juli Petereit,Walter W. Piegorsch,Duc Tran 机构:Department of Mathematics & Statistics, University of Nevada, Reno, Epidemiology and Biostatistics Department, University of Arizona, Department of Computer Science & Engineering, Anna K. Panorska, Nevada Bioinformatics Center, Mathematics Department, © A.G. Schissler. 备注:22 pages, 10 figures, this https URL 摘要:在使用蒙特卡罗技术和评估统计方法时,准确地模拟数据是至关重要的。在这个大数据时代,测量通常是相关的、高维的,例如高通量生物医学实验中获得的数据。由于计算的复杂性和缺乏用户友好的软件来模拟这些大规模的多元结构,研究人员求助于假设独立性或执行任意数据转换的模拟设计。为了弥补这一差距,我们开发了带有R和Python接口的Bigsimr Julia包。本文主要研究R接口。这些软件包通过Pearson、Spearman或Kendall相关矩阵实现具有任意边际分布和相关性的高维随机向量模拟。bigsimr包含高性能功能,包括多核和图形处理单元加速算法,用于估计相关性和计算最近的相关性矩阵。蒙特卡罗研究量化了我们方法的准确性和可扩展性,最高可达$d=10000$。我们描述了示例工作流,并将其应用于高维数据集——从乳腺癌肿瘤样本中获得的RNA测序数据。 摘要:It is critical to accurately simulate data when employing Monte Carlo techniques and evaluating statistical methodology. Measurements are often correlated and high dimensional in this era of big data, such as data obtained in high-throughput biomedical experiments. Due to the computational complexity and a lack of user-friendly software available to simulate these massive multivariate constructions, researchers resort to simulation designs that posit independence or perform arbitrary data transformations. To close this gap, we developed the Bigsimr Julia package with R and Python interfaces. This paper focuses on the R interface. These packages empower high-dimensional random vector simulation with arbitrary marginal distributions and dependency via a Pearson, Spearman, or Kendall correlation matrix. bigsimr contains high-performance features, including multi-core and graphical-processing-unit-accelerated algorithms to estimate correlation and compute the nearest correlation matrix. Monte Carlo studies quantify the accuracy and scalability of our approach, up to $d=10,000$. We describe example workflows and apply to a high-dimensional data set -- RNA-sequencing data obtained from breast cancer tumor samples.
【5】 ORION-AE: Multisensor acoustic emission datasets reflecting supervised untightening of bolts in a jointed vibrating structure 标题:Orion-AE:多传感器声发射数据集,反映节理振动结构中螺栓的监督松弛 链接:https://arxiv.org/abs/2111.06322
作者:Benoit Verdin,Emmanuel Ramasso,Gael Chevallier 机构:Department of Applied Mechanics, Institut FEMTO-ST, UBFCUFCENSMMCNRSUTBM, Rue Alain Savary, Besançon, France 摘要:由于螺栓连接中的接触和摩擦会导致非线性随机行为,因此在运行期间监测连接结构中的松动具有挑战性。工作的目的是提供谐波振动试验期间获得的声发射(AE)传感器的数据集。这些数据集已用于验证基于原始损失函数的时间序列新聚类方法的开发,该损失函数包括聚类开始、动力学和累积参数。这些数据集来自一个称为ORION的连接结构,这是一个为振动研究设计的原始试验台。它集成了多个传感器,提供来自激光测振仪的振动速度和来自三个不同AE传感器的声音信号。一种实验设计,允许通过改变拧紧配置在不同操作条件下获取数据流。这些数据集可以用来挑战机器学习和信号处理方法。 摘要:Monitoring loosening in jointed structures during operation is challenging because contact and friction, in bolted joints, induce a nonlinear stochastic behavior. The purpose of work is to provide datasets from acoustic emission (AE) sensors obtained during harmonic vibration tests. These datasets have been used to validate the development of a new clustering method for time-series, based on an original loss function that includes parameters on clusters onset, kinetics and accumulation. The datasets are originating from a jointed structure called ORION, which is an original test-rig designed for vibration study. It integrates several sensors that provide the vibration velocity from a laser vibrometer and the acoustic signals from three different AE sensors. A design of experiment allowed to acquire data streams in different operating conditions by varying the tightening configurations. These datasets can be used to challenge machine learning and signal processing methods.
【6】 Proper Scoring Rules, Gradients, Divergences, and Entropies for Paths and Time Series 标题:路径和时间序列的正确评分规则、梯度、发散度和熵 链接:https://arxiv.org/abs/2111.06314
作者:Patric Bonnier,Harald Oberhauser 机构:Mathematical Institute, University of Oxford 摘要:许多预测不包括点预测,而是涉及数量的演变。例如,中央银行可能会预测下一季度的利率,流行病学家可能会预测感染率的轨迹,临床医生可能会预测第二天医疗标记物的行为,等等。情况更加复杂,因为这些预测有时只涉及近似值“未来演变形态”或“事件顺序”形式上,这种预测可以看作是模时间参数化路径等价类空间上的概率测度。我们利用适当评分规则的统计框架和经典数学结果,推导出一种原则性的方法来进行此类预测的决策。特别是,我们引入了梯度、熵和散度的概念,这些概念是为尊重潜在的非欧几里德结构而定制的。 摘要:Many forecasts consist not of point predictions but concern the evolution of quantities. For example, a central bank might predict the interest rates during the next quarter, an epidemiologist might predict trajectories of infection rates, a clinician might predict the behaviour of medical markers over the next day, etc. The situation is further complicated since these forecasts sometimes only concern the approximate "shape of the future evolution" or "order of events". Formally, such forecasts can be seen as probability measures on spaces of equivalence classes of paths modulo time-parametrization. We leverage the statistical framework of proper scoring rules with classical mathematical results to derive a principled approach to decision making with such forecasts. In particular, we introduce notions of gradients, entropy, and divergence that are tailor-made to respect the underlying non-Euclidean structure.
【7】 On Recovering the Best Rank-r Approximation from Few Entries 标题:关于从少数几个表项恢复最佳秩-r近似的问题 链接:https://arxiv.org/abs/2111.06302
作者:Shun Xu,Ming Yuan 机构:Columbia University 摘要:在本文中,我们将研究如何从一个小的矩阵条目重构一个大矩阵的最佳秩-$r$近似。我们证明,即使一个数据矩阵是满秩的,并且不能用低秩矩阵很好地逼近,它的最佳低秩近似仍然可以可靠地从它的少量条目计算或估计。从统计学的角度来看,这一点尤其重要:数据矩阵的最佳低秩近似值通常比其本身更令人感兴趣,因为它们捕捉到了其他复杂数据生成模型更稳定且通常更可再现的特性。特别地,我们研究了两种不可知论的方法:第一种是基于光谱截断;第二个是基于投影梯度下降的优化过程。我们认为,虽然第一种方法直观且合理有效,但后一种方法的总体性能要高得多。我们证明了误差取决于矩阵接近低秩的程度。理论和数值结果都证明了所提方法的有效性。 摘要:In this note, we investigate how well we can reconstruct the best rank-$r$ approximation of a large matrix from a small number of its entries. We show that even if a data matrix is of full rank and cannot be approximated well by a low-rank matrix, its best low-rank approximations may still be reliably computed or estimated from a small number of its entries. This is especially relevant from a statistical viewpoint: the best low-rank approximations to a data matrix are often of more interest than itself because they capture the more stable and oftentimes more reproducible properties of an otherwise complicated data-generating model. In particular, we investigate two agnostic approaches: the first is based on spectral truncation; and the second is a projected gradient descent based optimization procedure. We argue that, while the first approach is intuitive and reasonably effective, the latter has far superior performance in general. We show that the error depends on how close the matrix is to being of low rank. Both theoretical and numerical evidence is presented to demonstrate the effectiveness of the proposed approaches.
【8】 Pool samples to efficiently estimate pathogen prevalence dynamics 标题:有效估计病原体流行动态的池样本 链接:https://arxiv.org/abs/2111.06249
作者:Braden Scherting,Alison Peel,Raina Plowright,Andrew Hoegh 机构:Dept. Mathematical Sciences, Montana State University, MT, USA, Centre for Planetary Health and Food Security, Griffith University, Queensland, AU, Dept. Microbiology and Immunology 摘要:估计一种疾病的流行率对于评估和减轻其在人群内或人群之间传播的风险是必要的。估计流行如何随时间变化提供更多的信息,这些风险,但很难获得由于必要的采样强度和相称的测试成本。我们建议汇集和联合测试多个样本以降低测试成本,并使用一种新的非参数、分层贝叶斯模型从汇集测试结果推断人群患病率。通过对自然感染数据的两项综合研究和两项案例研究表明,与同一预算下的个体试验相比,该方法降低了不确定性,并与预算高得多的个体试验相比,产生了类似的估计。 摘要:Estimating the prevalence of a disease is necessary for evaluating and mitigating risks of its transmission within or between populations. Estimates that consider how prevalence changes with time provide more information about these risks but are difficult to obtain due to the necessary sampling intensity and commensurate testing costs. We propose pooling and jointly testing multiple samples to reduce testing costs and use a novel nonparametric, hierarchical Bayesian model to infer population prevalence from the pooled test results. This approach is shown to reduce uncertainty compared to individual testing at the same budget and to produce similar estimates compared to individual testing at a much higher budget through two synthetic studies and two case studies of natural infection data.
【9】 ARISE: ApeRIodic SEmi-parametric Process for Efficient Markets without Periodogram and Gaussianity Assumptions 标题:RISE:无周期图和高斯假设的有效市场的非周期半参数过程 链接:https://arxiv.org/abs/2111.06222
作者:Shao-Qun Zhang,Zhi-Hua Zhou 机构:National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing , China 摘要:模仿和学习有效市场的长期记忆是机器学习和金融经济学对连续数据交互作用的一个基本问题。尽管这个问题很突出,但目前的治疗方法要么主要局限于启发式技术,要么主要依赖周期图或高斯假设。本文给出了研究有效市场的非周期半参数过程。该过程被描述为一些已知过程的无穷和函数,并采用非周期谱估计来确定关键超参数,因此具有建模具有长期记忆性、非平稳性和非周期谱的价格数据的能力和潜力。我们进一步从理论上证明了该过程具有均方收敛性、一致性和渐近正态性,没有周期图和高斯假设。在实践中,我们应用产生过程来确定现实世界市场的效率。此外,我们还提供了两种替代的应用:研究各种机器学习模型的长期记忆性和开发用于时间序列推断和预测的潜在状态空间模型。数值实验证实了本文方法的优越性。 摘要:Mimicking and learning the long-term memory of efficient markets is a fundamental problem in the interaction between machine learning and financial economics to sequential data. Despite the prominence of this issue, current treatments either remain largely limited to heuristic techniques or rely significantly on periodogram or Gaussianty assumptions. In this paper, we present the ApeRIodic SEmi-parametric (ARISE) process for investigating efficient markets. The ARISE process is formulated as an infinite-sum function of some known processes and employs the aperiodic spectrum estimation to determine the key hyper-parameters, thus possessing the power and potential of modeling the price data with long-term memory, non-stationarity, and aperiodic spectrum. We further theoretically show that the ARISE process has the mean-square convergence, consistency, and asymptotic normality without periodogram and Gaussianity assumptions. In practice, we apply the ARISE process to identify the efficiency of real-world markets. Besides, we also provide two alternative ARISE applications: studying the long-term memorability of various machine-learning models and developing a latent state-space model for inference and forecasting of time series. The numerical experiments confirm the superiority of our proposed approaches.
【10】 Integrating metabolic networks and growth biomarkers to unveil potential mechanisms of obesity 标题:整合代谢网络和生长生物标记物揭示肥胖的潜在机制 链接:https://arxiv.org/abs/2111.06212
作者:Andrea Cremaschi,Maria De Iorio,Narasimhan Kothandaraman,Fabian Yap,Mya Tway Tint,Johan Eriksson 机构:Singapore Institute for Clinical Sciences, ASTAR, Singapore, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Division of Science, Yale-NUS College, Singapore, Department of Statistical Science, University College London, UK 摘要:在过去十年中,肥胖症等慢性非传染性疾病的发病率显著增加。对这些早期疾病的研究对于确定其成年病程和支持临床干预具有极其重要的意义。最近,研究肥胖儿童代谢途径改变的方法受到关注。在这项工作中,我们提出了一种新的联合建模方法,用于分析生长生物标志物和代谢物浓度,以揭示与儿童肥胖相关的代谢途径。在贝叶斯框架内,我们通过指定联合非参数随机效应分布,灵活地建模生长轨迹和代谢关联的时间演化,这也允许对受试者进行聚类,从而识别风险子组。生长曲线和代谢关联模式决定了聚类结构。通过指定回归项,风险因素的纳入是很简单的。我们展示了基于新加坡“新加坡成长为健康结果”(GUSTO)队列研究数据的拟议方法。后验推理是通过一个定制的MCMC算法,适应混合支持的非参数先验。我们的分析已经确定了肥胖儿童中潜在的关键途径,允许探索与儿童肥胖相关的可能分子机制。 摘要:The prevalence of chronic non-communicable diseases such as obesity has noticeably increased in the last decade. The study of these diseases in early life is of paramount importance in determining their course in adult life and in supporting clinical interventions. Recently, attention has been drawn on approaches that study the alteration of metabolic pathways in obese children. In this work, we propose a novel joint modelling approach for the analysis of growth biomarkers and metabolite concentrations, to unveil metabolic pathways related to child obesity. Within a Bayesian framework, we flexibly model the temporal evolution of growth trajectories and metabolic associations through the specification of a joint non-parametric random effect distribution which also allows for clustering of the subjects, thus identifying risk sub-groups. Growth profiles as well as patterns of metabolic associations determine the clustering structure. Inclusion of risk factors is straightforward through the specification of a regression term. We demonstrate the proposed approach on data from the Growing Up in Singapore Towards healthy Outcomes (GUSTO) cohort study, based in Singapore. Posterior inference is obtained via a tailored MCMC algorithm, accommodating a nonparametric prior with mixed support. Our analysis has identified potential key pathways in obese children that allows for exploration of possible molecular mechanisms associated with child obesity.
【11】 Robust Integrative Biclustering for Multi-view Data 标题:面向多视图数据的鲁棒综合双聚类 链接:https://arxiv.org/abs/2111.06209
作者:W. Zhang,C. Wendt,R. Bowler,C. P. Hersh,S. E. Safo 机构:Division of Biostatistics,Division of Pulmonary, Allergy and Critical Care, University of Minnesota, Minneapolis, USA, Division of Pulmonary, Critical Care and Sleep Medicine, Department of Medicine, National Jewish Health, Denver, CO, USA 摘要:在许多生物医学研究中,数据的多个视图(例如基因组学、蛋白质组学)是可用的,一个特别的兴趣可能是检测以特定变量组为特征的样本亚组。双聚类方法非常适合这个问题,因为它们假设特定的变量组可能只与特定的样本组相关。有许多双聚类方法用于标识视图中的行-列聚类,但很少有方法用于多个视图中的数据。现有的一些算法在很大程度上依赖于正则化参数来获得行-列聚类,它们给用户带来了不必要的负担,从而限制了它们在实际中的应用。我们将现有的基于稀疏奇异值分解的单视图数据双聚类方法扩展到多视图数据。我们的方法,综合稀疏奇异值分解(iSSVD),结合稳定性选择来控制I型错误率,估计样本和变量属于双聚类的概率,找到稳定的双聚类,并产生可解释的行-列关联。仿真和实际数据分析表明,iSSVD优于其他几种单视图和多视图双聚类方法,能够检测出有意义的双聚类。iSSVD是一种用户友好、计算效率高的算法,在许多疾病分型应用中将非常有用。 摘要:In many biomedical research, multiple views of data (e.g., genomics, proteomics) are available, and a particular interest might be the detection of sample subgroups characterized by specific groups of variables. Biclustering methods are well-suited for this problem as they assume that specific groups of variables might be relevant only to specific groups of samples. Many biclustering methods exist for identifying row-column clusters in a view but few methods exist for data from multiple views. The few existing algorithms are heavily dependent on regularization parameters for getting row-column clusters, and they impose unnecessary burden on users thus limiting their use in practice. We extend an existing biclustering method based on sparse singular value decomposition for single-view data to data from multiple views. Our method, integrative sparse singular value decomposition (iSSVD), incorporates stability selection to control Type I error rates, estimates the probability of samples and variables to belong to a bicluster, finds stable biclusters, and results in interpretable row-column associations. Simulations and real data analyses show that iSSVD outperforms several other single- and multi-view biclustering methods and is able to detect meaningful biclusters. iSSVD is a user-friendly, computationally efficient algorithm that will be useful in many disease subtyping applications.
【12】 Haar-Weave-Metropolis kernel 标题:Haar-weave-Metropolis核 链接:https://arxiv.org/abs/2111.06148
作者:Kengo Kamatani,Xiaolin Song 机构:Institute of Statistical Mathematics, Graduate School of Engineering Science, Osaka University 备注:24 pages, 3 figures 摘要:最近,受哈密顿蒙特卡罗方法的启发,许多马尔可夫链蒙特卡罗方法被发展为确定性可逆变换方法。确定性变换相对容易与目标分布的局部信息(梯度等)协调。然而,正如遍历理论所表明的,这些确定性建议方法似乎与鲁棒性不兼容,并导致收敛性差,尤其是在具有重尾的目标分布的情况下。另一方面,使用Haar测度的Markov核是相对鲁棒的,因为它通过引入全局参数来学习关于目标分布的全局信息。然而,它需要一个密度保持条件,许多确定性建议打破了这个条件。在本文中,我们仔细地选择保留结构的确定性变换,并使用确定性变换创建马尔可夫核,即Weave Metropolis核。结合Haar测度,我们还引入了Haar核。通过这种方式,马尔可夫核可以使用确定性方案利用目标分布的局部信息,并且由于Haar测度,它可以利用目标分布的全局信息。最后,我们通过数值实验表明,该方法在有效样本量和每秒均方跳距方面优于其他方法。 摘要:Recently, many Markov chain Monte Carlo methods have been developed with deterministic reversible transform proposals inspired by the Hamiltonian Monte Carlo method. The deterministic transform is relatively easy to reconcile with the local information (gradient etc.) of the target distribution. However, as the ergodic theory suggests, these deterministic proposal methods seem to be incompatible with robustness and lead to poor convergence, especially in the case of target distributions with heavy tails. On the other hand, the Markov kernel using the Haar measure is relatively robust since it learns global information about the target distribution introducing global parameters. However, it requires a density preserving condition, and many deterministic proposals break this condition. In this paper, we carefully select deterministic transforms that preserve the structure and create a Markov kernel, the Weave-Metropolis kernel, using the deterministic transforms. By combining with the Haar measure, we also introduce the Haar-Weave-Metropolis kernel. In this way, the Markov kernel can employ the local information of the target distribution using the deterministic proposal, and thanks to the Haar measure, it can employ the global information of the target distribution. Finally, we show through numerical experiments that the performance of the proposed method is superior to other methods in terms of effective sample size and mean square jump distance per second.
【13】 On the Equivalence between Neural Network and Support Vector Machine 标题:论神经网络与支持向量机的等价性 链接:https://arxiv.org/abs/2111.06063
作者:Yilan Chen,Wei Huang,Lam M. Nguyen,Tsui-Wei Weng 机构:Computer Science and Engineering, University of California San Diego, La Jolla, CA, Engineering and Information Technology, University of Technology Sydney, Ultimo, Australia, IBM Research, Thomas J. Watson Research Center, Yorktown Heights, NY 备注:35th Conference on Neural Information Processing Systems (NeurIPS 2021) 摘要:最近的研究表明,通过梯度下降训练的无限宽神经网络(NN)的动力学可以用神经切线核(NTK)citep{jacot2018neural}来表征。在平方损失下,通过梯度下降以无限小的学习率训练的无限宽NN等价于具有NTKcitep{arora2019exact}的核回归。然而,目前已知的等价性仅适用于岭回归,而NN与其他核机器(KMs),例如支持向量机(SVM)之间的等价性仍然未知。因此,在这项工作中,我们建议建立神经网络和支持向量机之间的等价性,特别是通过软边缘损失训练无限宽神经网络和通过次梯度下降训练NTK的标准软边缘支持向量机。我们的主要理论结果包括:建立了NN与一系列具有有限宽度边界的$ell_2$正则化KMs之间的等价性,这是以前的工作所不能处理的,并且表明由此类正则化损失函数训练的每个有限宽度NN约为一KM。此外,我们还证明了我们的理论可以实现三个实际应用,包括(i)通过相应的知识管理实现神经网络的泛化界;(ii)无限宽NN的{非平凡}鲁棒性证书(而现有的鲁棒性验证方法将提供空洞的边界);(iii)本质上比以前的核回归更稳健的无限宽NNs。我们的实验代码可在url获取{https://github.com/leslie-CH/equiv-nn-svm}. 摘要:Recent research shows that the dynamics of an infinitely wide neural network (NN) trained by gradient descent can be characterized by Neural Tangent Kernel (NTK) citep{jacot2018neural}. Under the squared loss, the infinite-width NN trained by gradient descent with an infinitely small learning rate is equivalent to kernel regression with NTK citep{arora2019exact}. However, the equivalence is only known for ridge regression currently citep{arora2019harnessing}, while the equivalence between NN and other kernel machines (KMs), e.g. support vector machine (SVM), remains unknown. Therefore, in this work, we propose to establish the equivalence between NN and SVM, and specifically, the infinitely wide NN trained by soft margin loss and the standard soft margin SVM with NTK trained by subgradient descent. Our main theoretical results include establishing the equivalence between NN and a broad family of $ell_2$ regularized KMs with finite-width bounds, which cannot be handled by prior work, and showing that every finite-width NN trained by such regularized loss functions is approximately a KM. Furthermore, we demonstrate our theory can enable three practical applications, including (i) textit{non-vacuous} generalization bound of NN via the corresponding KM; (ii) textit{non-trivial} robustness certificate for the infinite-width NN (while existing robustness verification methods would provide vacuous bounds); (iii) intrinsically more robust infinite-width NNs than those from previous kernel regression. Our code for the experiments are available at url{https://github.com/leslie-CH/equiv-nn-svm}.
【14】 Tight bounds for minimum l1-norm interpolation of noisy data 标题:含噪数据最小L1范数插值的紧界 链接:https://arxiv.org/abs/2111.05987
作者:Guillaume Wang,Konstantin Donhauser,Fanny Yang 机构:ETH Zurich, Department of Computer Science, ETH AI Center 备注:29 pages, 1 figure 摘要:我们为最小$ellu 1$-范数插值器的预测误差,即基追踪,提供了$sigma^2/log(d/n)$阶的匹配上界和下界。我们的结果在$dgg n$时紧到可以忽略的项,并且是第一个暗示各向同性特征和稀疏地面真相的噪声最小范数插值的渐近一致性。我们的工作补充了关于最小$ell_2$-范数插值的“良性过拟合”的文献,其中只有当特征有效地是低维时才能实现渐近一致性。 摘要:We provide matching upper and lower bounds of order $sigma^2/log(d/n)$ for the prediction error of the minimum $ell_1$-norm interpolator, a.k.a. basis pursuit. Our result is tight up to negligible terms when $d gg n$, and is the first to imply asymptotic consistency of noisy minimum-norm interpolation for isotropic features and sparse ground truths. Our work complements the literature on "benign overfitting" for minimum $ell_2$-norm interpolation, where asymptotic consistency can be achieved only when the features are effectively low-dimensional.
【15】 SyMetric: Measuring the Quality of Learnt Hamiltonian Dynamics Inferred from Vision 标题:SyMetric:测量从视觉推断的学习哈密顿动力学的质量 链接:https://arxiv.org/abs/2111.05986
作者:Irina Higgins,Peter Wirnsberger,Andrew Jaegle,Aleksandar Botev 机构:DeepMind, London 摘要:最近提出的一类模型试图利用哈密顿力学提供的先验知识,从高维观测(如图像)中学习潜在动力学。虽然这些模型在机器人技术或自动驾驶等领域有着重要的潜在应用,但目前还没有很好的方法来评估它们的性能:现有的方法主要依赖于图像重建质量,而图像重建质量并不总是反映学习到的潜在动力学的质量。在这项工作中,我们根据经验强调了现有度量的问题,并开发了一组新的度量,包括一个二元指标,用于指示是否忠实地捕获了潜在的哈密顿动力学,我们称之为辛度量或符号度量。我们的方法利用了哈密顿动力学的已知特性,并且比重建误差更能区分模型捕捉潜在动力学的能力。利用对称性,我们确定了一组体系结构选择,显著提高了先前提出的从像素推断潜在动态的模型哈密顿生成网络(HGN)的性能。与原始HGN不同,新的HGN 能够在某些数据集上发现具有物理意义的延迟的可解释相空间。此外,它在13个数据集的不同范围内稳定地进行更长时间的展期,在时间上向前和向后产生基本上无限长的展期,而在数据集的一个子集上没有质量下降。 摘要:A recently proposed class of models attempts to learn latent dynamics from high-dimensional observations, like images, using priors informed by Hamiltonian mechanics. While these models have important potential applications in areas like robotics or autonomous driving, there is currently no good way to evaluate their performance: existing methods primarily rely on image reconstruction quality, which does not always reflect the quality of the learnt latent dynamics. In this work, we empirically highlight the problems with the existing measures and develop a set of new measures, including a binary indicator of whether the underlying Hamiltonian dynamics have been faithfully captured, which we call Symplecticity Metric or SyMetric. Our measures take advantage of the known properties of Hamiltonian dynamics and are more discriminative of the model's ability to capture the underlying dynamics than reconstruction error. Using SyMetric, we identify a set of architectural choices that significantly improve the performance of a previously proposed model for inferring latent dynamics from pixels, the Hamiltonian Generative Network (HGN). Unlike the original HGN, the new HGN is able to discover an interpretable phase space with physically meaningful latents on some datasets. Furthermore, it is stable for significantly longer rollouts on a diverse range of 13 datasets, producing rollouts of essentially infinite length both forward and backwards in time with no degradation in quality on a subset of the datasets.
【16】 The modeling of multiple animals that share behavioral features 标题:具有共同行为特征的多个动物的建模 链接:https://arxiv.org/abs/2111.05985
作者:Gianluca Mastrantonio 机构:Polytechnic of Turin, Department of Mathematical Science, Turin, Italy., Summary., In the late years, many models to analyze animal tracking data have been proposed. Among 摘要:在这项工作中,我们提出了一个可以用来推断多种动物行为的模型。我们的方案被定义为一组基于粘性层次Dirichlet过程的隐马尔可夫模型,具有共享的基测度和STAP发射分布。潜在分类代表动物的行为,由STAP参数描述。根据潜在分类,这些动物是独立的。由于我们将STAP参数的分布形式化,动物可能在不同的行为中共享参数集或参数子集,从而允许我们研究它们之间的相似性。基于Dirichlet过程的隐马尔可夫模型允许我们估计每只动物潜在行为的数量,作为模型参数。这项建议是由一个真实的数据问题推动的,其中观测到了六只Marema牧羊犬的GPS坐标。在其他的研究结果中,我们发现四只狗拥有大部分的行为特征,而两只狗有特定的行为特征。 摘要:In this work, we propose a model that can be used to infer the behavior of multiple animals. Our proposal is defined as a set of hidden Markov models that are based on the sticky hierarchical Dirichlet process, with a shared base-measure, and a STAP emission distribution. The latent classifications are representative of the behavior assumed by the animals, which is described by the STAP parameters. Given the latent classifications, the animals are independent. As a result of the way we formalize the distribution over the STAP parameters, the animals may share, in different behaviors, the set or a subset of the parameters, thereby allowing us to investigate the similarities between them. The hidden Markov models, based on the Dirichlet process, allow us to estimate the number of latent behaviors for each animal, as a model parameter. This proposal is motivated by a real data problem, where the GPS coordinates of six Maremma Sheepdogs have been observed. Among the other results, we show that four dogs share most of the behavior characteristics, while two have specific behaviors.
【17】 Accurate confidence interval estimation for non-centrality parameters and effect size indices 标题:非中心性参数和效应大小指数的精确置信区间估计 链接:https://arxiv.org/abs/2111.05966
作者:Kaidi Kang,Kristan Armstrong,Suzanne Avery,Maureen McHugo,Stephan Heckers,Simon Vandekar 机构:vanderbilt university, department of biostatistics, vanderbilt university medical center, department of psychiatry and, behavioral sciences, Please address correspondence to:, West End Ave., #, Department of Biostatistics, Vanderbilt University, Nashville, TN 摘要:我们最近提出了一种与检验统计量的非中心性参数相关的稳健效应大小指数(RESI)。RESI比普通索引更具优势,因为(1)它广泛适用于多种类型的数据;(2) 它可以依赖于稳健的协方差估计;(3) 它可以适应干扰参数的存在。我们为RESI提供了一个一致的估计,但是,对于RESI没有确定的置信区间(CI)估计程序。在这里,我们使用统计理论和模拟来评估三个RESI估计器的几种CI估计程序。我们的研究结果表明:(1)与一般效应大小相比,稳健估计对于真实效应大小是一致的;(2) 非中心性参数效应大小的通用CI程序无法覆盖标称水平的真实效应大小。使用稳健估计器和建议的自举CI通常是准确的,适用于对RESI进行一致的估计和有效的推断,尤其是在可能违反模型假设的情况下。基于RESI,我们提出了一个效应大小分析(ANOES)的一般框架,这样可以方便地以方差分析(ANOVA)表格格式报告各种模型的效应大小和置信区间 摘要:We recently proposed a robust effect size index (RESI) that is related to the non-centrality parameter of a test statistic. RESI is advantageous over common indices because (1) it is widely applicable to many types of data; (2) it can rely on a robust covariance estimate; (3) it can accommodate the existence of nuisance parameters. We provided a consistent estimator for the RESI, however, there is no established confidence interval (CI) estimation procedure for the RESI. Here, we use statistical theory and simulations to evaluate several CI estimation procedures for three estimators of the RESI. Our findings show (1) in contrast to common effect sizes, the robust estimator is consistent for the true effect size; (2) common CI procedures for effect sizes that are non-centrality parameters fail to cover the true effect size at the nominal level. Using the robust estimator along with the proposed bootstrap CI is generally accurate and applicable to conduct consistent estimation and valid inference for the RESI, especially when model assumptions may be violated. Based on the RESI, we propose a general framework for the analysis of effect size (ANOES), such that effect sizes and confidence intervals can be easily reported in an analysis of variance (ANOVA) table format for a wide range of models
【18】 Spatial epidemiology and adaptive targeted sampling to manage the Chagas disease vector Triatoma dimidiata 标题:空间流行病学和适应性靶向抽样管理恰加斯病病媒二分体Triatoma dimidiata 链接:https://arxiv.org/abs/2111.05964
作者:B. K. M. Case,Jean-Gabriel Young,Daniel Penados,Laurent Hébert-Dufresne,Lori Stevens 机构:Vermont Complex Systems Center, University of Vermont, USA, Department of Computer Science, University of Vermont, USA, Department of Mathematics & Statistics, University of Vermont, USA 摘要:在中美洲,广泛使用杀虫剂仍然是控制恰加斯病的主要形式,尽管只是暂时降低了国内地方病媒二分锥虫的水平,而且没有什么长期影响。最近,一种强调社区反馈和住房改善的方法被证明能产生持久的效果。然而,这种干预措施所需的额外资源和人员可能妨碍其广泛采用。解决这一问题的一个办法是只针对社区中的一部分房屋,同时仍然消除足够的虫害,以中断疾病传播。在这里,我们开发了一个顺序抽样框架,该框架适用于访问更多房屋时特定于社区的信息,从而使我们能够高效地找到带有家庭媒介的房屋,同时最小化抽样偏差。该方法适用于贝叶斯地质统计模型进行空间信息预测,同时逐渐从基于预测不确定性的房屋优先排序过渡到针对具有高感染风险的房屋。该方法的一个关键特征是使用单个勘探参数$alpha$来控制这两个设计目标之间的转换速率。在使用危地马拉东南部五个村庄的经验数据进行的模拟研究中,我们使用$alpha$的一系列值对我们的方法进行了测试,发现它可以始终如一地选择比随机抽样更少的家庭,同时仍然使村庄感染率低于给定的阈值。我们进一步发现,当更多的社会经济信息可用时,可以节省更多的资金,但满足目标感染率的一致性较差,尤其是在探索性较差的策略中。我们的研究结果为实施长期的二分锥虫控制提供了新的选择。 摘要:Widespread application of insecticide remains the primary form of control for Chagas disease in Central America, despite only temporarily reducing domestic levels of the endemic vector Triatoma dimidiata and having little long-term impact. Recently, an approach emphasizing community feedback and housing improvements has been shown to yield lasting results. However, the additional resources and personnel required by such an intervention likely hinders its widespread adoption. One solution to this problem would be to target only a subset of houses in a community while still eliminating enough infestations to interrupt disease transfer. Here we develop a sequential sampling framework that adapts to information specific to a community as more houses are visited, thereby allowing us to efficiently find homes with domiciliary vectors while minimizing sampling bias. The method fits Bayesian geostatistical models to make spatially informed predictions, while gradually transitioning from prioritizing houses based on prediction uncertainty to targeting houses with a high risk of infestation. A key feature of the method is the use of a single exploration parameter, $alpha$, to control the rate of transition between these two design targets. In a simulation study using empirical data from five villages in southeastern Guatemala, we test our method using a range of values for $alpha$, and find it can consistently select fewer homes than random sampling, while still bringing the village infestation rate below a given threshold. We further find that when additional socioeconomic information is available, much larger savings are possible, but that meeting the target infestation rate is less consistent, particularly among the less exploratory strategies. Our results suggest new options for implementing long-term T. dimidiata control.
【19】 Adversarial sampling of unknown and high-dimensional conditional distributions 标题:未知和高维条件分布的对抗性抽样 链接:https://arxiv.org/abs/2111.05962
作者:Malik Hassanaly,Andrew Glaws,Karen Stengel,Ryan N. King 备注:26 pages, 12 figures, 4 tables 摘要:许多工程问题需要预测实现到实现的可变性或对建模数量的精确描述。在这种情况下,有必要从可能具有数百万自由度的未知高维空间中采样元素。虽然存在能够从已知形状的概率密度函数(PDF)中采样元素的方法,但当分布未知时,需要进行几种近似。在本文中,采样方法以及对基础分布的推断都使用了一种称为生成对抗网络(GAN)的数据驱动方法进行处理,该方法训练两个相互竞争的神经网络,以生成一个网络,该网络可以有效地从训练集分布生成样本。在实践中,经常需要从条件分布中抽取样本。当条件变量连续时,可能只有一个(如果有)数据点对应于条件变量的特定值,这不足以估计条件分布。这项工作处理这个问题,使用先验估计的条件矩的PDF。比较了随机估计和外部神经网络两种计算这些矩的方法;但是,可以使用任何首选方法。该算法以过滤后的湍流流场的反褶积为例进行了验证。结果表明,与现有方法相比,该算法的所有版本都能有效地对目标条件分布进行采样,且对样本质量的影响最小。此外,该程序还可用作由具有连续变量的条件GAN(cGAN)生成的样本多样性的度量。 摘要:Many engineering problems require the prediction of realization-to-realization variability or a refined description of modeled quantities. In that case, it is necessary to sample elements from unknown high-dimensional spaces with possibly millions of degrees of freedom. While there exist methods able to sample elements from probability density functions (PDF) with known shapes, several approximations need to be made when the distribution is unknown. In this paper the sampling method, as well as the inference of the underlying distribution, are both handled with a data-driven method known as generative adversarial networks (GAN), which trains two competing neural networks to produce a network that can effectively generate samples from the training set distribution. In practice, it is often necessary to draw samples from conditional distributions. When the conditional variables are continuous, only one (if any) data point corresponding to a particular value of a conditioning variable may be available, which is not sufficient to estimate the conditional distribution. This work handles this problem using an a priori estimation of the conditional moments of a PDF. Two approaches, stochastic estimation, and an external neural network are compared here for computing these moments; however, any preferred method can be used. The algorithm is demonstrated in the case of the deconvolution of a filtered turbulent flow field. It is shown that all the versions of the proposed algorithm effectively sample the target conditional distribution with minimal impact on the quality of the samples compared to state-of-the-art methods. Additionally, the procedure can be used as a metric for the diversity of samples generated by a conditional GAN (cGAN) conditioned with continuous variables.
【20】 Beyond Importance Scores: Interpreting Tabular ML by Visualizing Feature Semantics 标题:超越重要性分数:通过可视化要素语义来解释表格ML 链接:https://arxiv.org/abs/2111.05898
作者:Amirata Ghorbani,Dina Berenbaum,Maor Ivgi,Yuval Dafna,James Zou 机构:Stanford University, Demystify AI 摘要:随着机器学习(ML)模型越来越广泛地用于做出关键决策,可解释性正成为一个活跃的研究课题。表格数据是医疗保健和金融等多种应用中最常用的数据模式之一。许多用于表格数据的现有可解释性方法只报告特征重要性得分——局部(每个示例)或全局(每个模型)——但它们不提供特征交互方式的解释或可视化。我们通过引入特征向量(专为表格数据集设计的一种新的全局可解释性方法)来解决这一局限性。除了提供特征重要性外,特征向量还通过直观的特征可视化技术发现特征之间固有的语义关系。我们的系统实验通过将该方法应用于几个真实数据集,证明了该方法的经验效用。我们还为特征向量提供了一个易于使用的Python包。 摘要:Interpretability is becoming an active research topic as machine learning (ML) models are more widely used to make critical decisions. Tabular data is one of the most commonly used modes of data in diverse applications such as healthcare and finance. Much of the existing interpretability methods used for tabular data only report feature-importance scores -- either locally (per example) or globally (per model) -- but they do not provide interpretation or visualization of how the features interact. We address this limitation by introducing Feature Vectors, a new global interpretability method designed for tabular datasets. In addition to providing feature-importance, Feature Vectors discovers the inherent semantic relationship among features via an intuitive feature visualization technique. Our systematic experiments demonstrate the empirical utility of this new method by applying it to several real-world datasets. We further provide an easy-to-use Python package for Feature Vectors.
【21】 PDMP Monte Carlo methods for piecewise-smooth densities 标题:分段光滑密度的PDMP蒙特卡罗方法 链接:https://arxiv.org/abs/2111.05859
作者:Augustin Chevallier,Sam Power,Andi Q. Wang,Paul Fearnhead 摘要:发展基于分段确定马尔可夫过程的马尔可夫链蒙特卡罗算法已经引起了人们极大的兴趣。然而,现有的算法只能在目标兴趣分布处处可微的情况下使用。对这些算法进行调整的关键在于,当过程遇到不连续时,为其定义适当的动力学,以使其能够从到密度进行采样。我们提出了一个简单的不连续过程过渡条件,可用于扩展任何现有的光滑密度取样器,并给出了该过渡的具体选择,该过渡可与常用算法(如反弹粒子取样器、坐标取样器和之字形过程)配合使用。我们的理论结果扩展并提出了先前提出的严格论点,例如,为限制在有界区域内的连续密度构造采样器,并且我们提出了一种可以在这种情况下工作的锯齿形过程。我们提出了一种新的方法来推导具有边界的分段确定马尔可夫过程的不变分布,这种方法可能具有独立的意义。 摘要:There has been substantial interest in developing Markov chain Monte Carlo algorithms based on piecewise-deterministic Markov processes. However existing algorithms can only be used if the target distribution of interest is differentiable everywhere. The key to adapting these algorithms so that they can sample from to densities with discontinuities is defining appropriate dynamics for the process when it hits a discontinuity. We present a simple condition for the transition of the process at a discontinuity which can be used to extend any existing sampler for smooth densities, and give specific choices for this transition which work with popular algorithms such as the Bouncy Particle Sampler, the Coordinate Sampler and the Zig-Zag Process. Our theoretical results extend and make rigorous arguments that have been presented previously, for instance constructing samplers for continuous densities restricted to a bounded domain, and we present a version of the Zig-Zag Process that can work in such a scenario. Our novel approach to deriving the invariant distribution of a piecewise-deterministic Markov process with boundaries may be of independent interest.
【22】 Learning Signal-Agnostic Manifolds of Neural Fields 标题:学习信号不可知的神经场流形 链接:https://arxiv.org/abs/2111.06387
作者:Yilun Du,Katherine M. Collins,Joshua B. Tenenbaum,Vincent Sitzmann 机构:Katherine Collins, MIT CSAIL, MIT BCS, MIT CBMM 备注:NeurIPS 2021, additional results and code at this https URL 摘要:深度神经网络已被广泛用于跨图像、形状和音频信号等模式学习数据集的潜在结构。然而,现有的模型通常依赖于模态,需要定制的体系结构和目标来处理不同类别的信号。我们利用神经场以模态独立的方式捕获图像、形状、音频和跨模态视听域中的底层结构。我们的任务是学习流形,我们的目标是推断数据所在的低维局部线性子空间。通过加强流形、局部线性和局部等距的覆盖,我们的模型(称为GEM)学习捕获跨模式数据集的底层结构。然后,我们可以沿着流形的线性区域移动,以获得样本之间的感知一致性插值,并可以进一步使用GEM恢复流形上的点,不仅收集输入图像的各种完整信息,还收集音频或图像信号的跨模态幻觉。最后,我们展示了通过遍历GEM的底层流形,我们可以在信号域中生成新样本。有关代码和其他结果,请访问https://yilundu.github.io/gem/. 摘要:Deep neural networks have been used widely to learn the latent structure of datasets, across modalities such as images, shapes, and audio signals. However, existing models are generally modality-dependent, requiring custom architectures and objectives to process different classes of signals. We leverage neural fields to capture the underlying structure in image, shape, audio and cross-modal audiovisual domains in a modality-independent manner. We cast our task as one of learning a manifold, where we aim to infer a low-dimensional, locally linear subspace in which our data resides. By enforcing coverage of the manifold, local linearity, and local isometry, our model -- dubbed GEM -- learns to capture the underlying structure of datasets across modalities. We can then travel along linear regions of our manifold to obtain perceptually consistent interpolations between samples, and can further use GEM to recover points on our manifold and glean not only diverse completions of input images, but cross-modal hallucinations of audio or image signals. Finally, we show that by walking across the underlying manifold of GEM, we may generate new samples in our signal domains. Code and additional results are available at https://yilundu.github.io/gem/.
【23】 Quantum Model-Discovery 标题:量子模型--发现 链接:https://arxiv.org/abs/2111.06376
作者:Niklas Heim,Atiyo Ghosh,Oleksandr Kyriienko,Vincent E. Elfving 机构:Qu & Co B.V., PO Box , AW, Amsterdam, The Netherlands, Artificial Intelligence Center, Czech Technical University, Prague, CZ , Department of Physics and Astronomy, University of Exeter, Stocker Road, Exeter EX,QL, United Kingdom 备注:first version, 18 pages, 6 figures 摘要:量子计算有望加速科学和工程中一些最具挑战性的问题。量子算法已经被提出,在从化学到物流优化的应用中显示出理论优势。科学和工程中出现的许多问题可以改写为一组微分方程。用于求解微分方程的量子算法在容错量子计算领域显示出可证明的优势,在容错量子计算领域,深度和广度的量子电路可以有效地求解偏微分方程(PDE)等大型线性系统。最近,也有人提出了用变分方法来求解非线性偏微分方程(PDE)的方法。最有希望的通用方法之一是基于科学机器学习领域中解决偏微分方程的最新发展。我们将短期量子计算机的适用性扩展到更一般的科学机器学习任务,包括从测量数据集中发现微分方程。我们使用可微量子电路(DQCs)求解由算符库参数化的方程,并对数据和方程的组合进行回归。我们的结果表明,在经典和量子机器学习方法之间的接口上,量子模型发现(QMoD)是一条有前途的道路。我们展示了在不同的系统上使用QMoD成功地进行参数推断和方程发现,包括一个二阶常微分方程和一个非线性偏微分方程。 摘要:Quantum computing promises to speed up some of the most challenging problems in science and engineering. Quantum algorithms have been proposed showing theoretical advantages in applications ranging from chemistry to logistics optimization. Many problems appearing in science and engineering can be rewritten as a set of differential equations. Quantum algorithms for solving differential equations have shown a provable advantage in the fault-tolerant quantum computing regime, where deep and wide quantum circuits can be used to solve large linear systems like partial differential equations (PDEs) efficiently. Recently, variational approaches to solving non-linear PDEs also with near-term quantum devices were proposed. One of the most promising general approaches is based on recent developments in the field of scientific machine learning for solving PDEs. We extend the applicability of near-term quantum computers to more general scientific machine learning tasks, including the discovery of differential equations from a dataset of measurements. We use differentiable quantum circuits (DQCs) to solve equations parameterized by a library of operators, and perform regression on a combination of data and equations. Our results show a promising path to Quantum Model Discovery (QMoD), on the interface between classical and quantum machine learning approaches. We demonstrate successful parameter inference and equation discovery using QMoD on different systems including a second-order, ordinary differential equation and a non-linear, partial differential equation.
【24】 Can you always reap what you sow? Network and functional data analysis of VC investments in health-tech companies 标题:种瓜得瓜,种豆得豆吗?健康科技企业风险投资的网络与功能数据分析 链接:https://arxiv.org/abs/2111.06371
作者:Christian Esposito,Marco Gortan,Lorenzo Testa,Francesca Chiaromonte,Giorgio Fagiolo,Andrea Mina,Giulio Rossetti 机构: Inst. of Economics & EMbeDS, Sant’Anna School of Advanced Studies, Pisa, Italy, Dept. of Computer Science, University of Pisa, Pisa, Italy, Dept. of Finance, Bocconi University, Milan, Italy 备注:12 pages, 4 figures, accepted for publication in the proceedings of the 10th International Conference on Complex Networks and Their Applications 摘要:企业在风险资本市场上的“成功”很难定义,其决定因素至今仍知之甚少。我们建立了一个由医疗保健部门的投资者和公司组成的二方网络,描述其结构和社区。然后,我们通过逐步引入更精确的定义来描述“成功”,并发现这些定义与公司的中心地位之间存在正相关关系。特别是,我们能够将公司的融资轨迹分为两组,分别采用不同的“成功”机制,并将属于其中一组或另一组的概率与其网络特征(特别是其中心地位和投资者之一)联系起来。我们通过引入标量和功能性“成功”结果进一步研究了这种正相关性,证实了我们的发现及其稳健性。 摘要:"Success" of firms in venture capital markets is hard to define, and its determinants are still poorly understood. We build a bipartite network of investors and firms in the healthcare sector, describing its structure and its communities. Then, we characterize "success" introducing progressively more refined definitions, and we find a positive association between such definitions and the centrality of a company. In particular, we are able to cluster funding trajectories of firms into two groups capturing different "success" regimes and to link the probability of belonging to one or the other to their network features (in particular their centrality and the one of their investors). We further investigate this positive association by introducing scalar as well as functional "success" outcomes, confirming our findings and their robustness.
【25】 Machine Learning-Based Optimization of Chiral Photonic Nanostructures: Evolution- and Neural Network-Based Design 标题:基于机器学习的手性光子纳米结构优化:基于进化和神经网络的设计 链接:https://arxiv.org/abs/2111.06272
作者:Oliver Mey,Arash Rahimi-Iman 机构:O. Mey, Fraunhofer IISEAS, Fraunhofer Institute for Integrated Circuits, Division Engineering of Adaptive Systems, Dresden, M¨unchner Str. , Dresden, Germany, A. Rahimi-Iman, . Physikalisches Institut, Justus-Liebig-Universit¨at Gießen, Gießen, Germany 摘要:手性光子学开辟了新的途径来操纵光与物质的相互作用,并通过纳米结构的非平凡模式来调整超表面和超材料的光学响应。物质的手性,如分子的手性,以及光的手性,在最简单的情况下是由圆极化的手性给出的,在化学、纳米光子学和光学信息处理中的应用引起了广泛的关注。我们报道了利用进化算法和神经网络两种机器学习方法设计手征光子结构,以快速有效地优化介电超表面的光学特性。在过渡金属二氯化铝激子共振范围内获得的可见光设计配方表明,反射光的圆偏振度随频率变化,其表现为左右圆偏振强度之间的差异。我们的研究结果表明,利用二硫化钨作为可能的活性材料,利用谷霍尔效应和光谷相干等特性,可以方便地制备和表征用于手性敏感光-物质耦合场景的光学纳米图案反射器。 摘要:Chiral photonics opens new pathways to manipulate light-matter interactions and tailor the optical response of meta-surfaces and -materials by nanostructuring nontrivial patterns. Chirality of matter, such as that of molecules, and light, which in the simplest case is given by the handedness of circular polarization, have attracted much attention for applications in chemistry, nanophotonics and optical information processing. We report the design of chiral photonic structures using two machine learning methods, the evolutionary algorithm and neural network approach, for rapid and efficient optimization of optical properties for dielectric metasurfaces. The design recipes obtained for visible light in the range of transition-metal dichalcogenide exciton resonances show a frequency-dependent modification in the reflected light's degree of circular polarization, that is represented by the difference between left- and right-circularly polarized intensity. Our results suggest the facile fabrication and characterization of optical nanopatterned reflectors for chirality-sensitive light-matter coupling scenarios employing tungsten disulfide as possible active material with features such as valley Hall effect and optical valley coherence.
【26】 Training Cross-Lingual embeddings for Setswana and Sepedi 标题:训练茨瓦纳语和塞佩迪语的跨语言嵌入 链接:https://arxiv.org/abs/2111.06230
作者:Mack Makgatho,Vukosi Marivate,Tshephisho Sefara,Valencia Wagner 机构: University of Pretoriamack, University of Pretoriavukosi, ValenciaSol Plaatje Universityvalencia 备注:Accepted (to appear) for the 2nd Workshop on Resources for African Indigenous Languages 摘要:非洲语言仍然落后于自然语言处理技术的进步,一个原因是缺乏代表性数据,拥有一种能够在语言之间传递信息的技术可以帮助缓解数据缺乏的问题。本文训练Setswana和Sepedi单语词向量,并使用VecMap为Setswana Sepedi创建跨语言嵌入,以实现跨语言迁移。单词嵌入是将单词表示为连续浮点数的单词向量,其中语义相似的单词映射到n维空间中的邻近点。单词嵌入的概念基于分布假设,即语义相似的单词分布在相似的上下文中(Harris,1954)。跨语言嵌入通过学习两个单独训练的单语向量的共享向量空间来利用单语嵌入,使得具有相似意义的单词由相似向量表示。在本文中,我们研究了塞斯瓦纳单语词向量的跨语言嵌入。我们使用VecMap中的无监督跨语言嵌入来训练Setswana-Sepedi跨语言单词嵌入。我们使用语义评估任务评估Setswana Sepedi跨语言单词表示的质量。对于语义相似性任务,我们将WordSim和SimLex任务翻译成Setswana和Sepedi。我们发布这个数据集作为其他研究人员工作的一部分。我们评估嵌入的内在质量,以确定单词嵌入的语义表示是否有改进。 摘要:African languages still lag in the advances of Natural Language Processing techniques, one reason being the lack of representative data, having a technique that can transfer information between languages can help mitigate against the lack of data problem. This paper trains Setswana and Sepedi monolingual word vectors and uses VecMap to create cross-lingual embeddings for Setswana-Sepedi in order to do a cross-lingual transfer. Word embeddings are word vectors that represent words as continuous floating numbers where semantically similar words are mapped to nearby points in n-dimensional space. The idea of word embeddings is based on the distribution hypothesis that states, semantically similar words are distributed in similar contexts (Harris, 1954). Cross-lingual embeddings leverages monolingual embeddings by learning a shared vector space for two separately trained monolingual vectors such that words with similar meaning are represented by similar vectors. In this paper, we investigate cross-lingual embeddings for Setswana-Sepedi monolingual word vector. We use the unsupervised cross lingual embeddings in VecMap to train the Setswana-Sepedi cross-language word embeddings. We evaluate the quality of the Setswana-Sepedi cross-lingual word representation using a semantic evaluation task. For the semantic similarity task, we translated the WordSim and SimLex tasks into Setswana and Sepedi. We release this dataset as part of this work for other researchers. We evaluate the intrinsic quality of the embeddings to determine if there is improvement in the semantic representation of the word embeddings.
【27】 Occupational Income Inequality of Thailand: A Case Study of Exploratory Data Analysis beyond Gini Coefficient 标题:泰国职业收入不平等:超出基尼系数的探索性数据分析 链接:https://arxiv.org/abs/2111.06224
作者:Wanetha Sudswong,Anon Plangprasopchok,Chainarong Amornbunchornvej 机构:Amornbunchornveja, National Electronics and Computer Technology Center (NECTEC), Phahonyothin, Road, Khlong Nueng, Pathum Thani, Thailand 备注:22 pages 摘要:收入不平等是我们社会进步必须解决的一个重要问题。收入不平等的研究通过基尼系数广受欢迎,基尼系数通常用于衡量不平等程度。虽然这种方法在几个方面是有效的,但仅基尼系数就不可避免地忽略了少数群体亚群体(例如职业),这导致少数群体中未发现的不平等模式缺失。在这项研究中,通过使用基尼系数和收入支配网络的网络密度,对泰国1200多万户家庭的收入和职业调查进行了分析,以了解一般和职业收入不平等问题的程度。结果表明,在农业省份,两种类型的不平等(低基尼系数和网络密度)问题较少,而一些非农业省份面临着职业收入不平等(高网络密度)问题,没有任何总体收入不平等(低基尼系数)的症状。此外,研究结果还利用估计统计数据说明了收入不平等的差距,这不仅支持收入不平等是否存在,而且我们还能够判断职业间收入差距的大小。这些结果不能仅通过基尼系数获得。这项工作是从一般人口和亚人口角度分析收入不平等的一个用例,可用于其他国家的研究。 摘要:Income inequality is an important issue that has to be solved in order to make progress in our society. The study of income inequality is well received through the Gini coefficient, which is used to measure degrees of inequality in general. While this method is effective in several aspects, the Gini coefficient alone inevitably overlooks minority subpopulations (e.g. occupations) which results in missing undetected patterns of inequality in minority. In this study, the surveys of incomes and occupations from more than 12 millions households across Thailand have been analyzed by using both Gini coefficient and network densities of income domination networks to get insight regarding the degrees of general and occupational income inequality issues. The results show that, in agricultural provinces, there are less issues in both types of inequality (low Gini coefficients and network densities), while some non-agricultural provinces face an issue of occupational income inequality (high network densities) without any symptom of general income inequality (low Gini coefficients). Moreover, the results also illustrate the gaps of income inequality using estimation statistics, which not only support whether income inequality exists, but that we are also able to tell the magnitudes of income gaps among occupations. These results cannot be obtained via Gini coefficients alone. This work serves as a use case of analyzing income inequality from both general population and subpopulations perspectives that can be utilized in studies of other countries.
【28】 BOiLS: Bayesian Optimisation for Logic Synthesis 标题:BOILS:逻辑综合的贝叶斯优化 链接:https://arxiv.org/abs/2111.06178
作者:Antoine Grosnit,Cedric Malherbe,Rasul Tutunov,Xingchen Wan,Jun Wang,Haitham Bou Ammar 机构:Huawei Noah’s Ark Lab, University College London 摘要:在逻辑综合过程中优化电路的结果质量(QoR)是一项艰巨的挑战,需要探索指数大小的搜索空间。虽然专家设计的操作有助于发现有效序列,但逻辑电路复杂性的增加有利于自动化程序。受机器学习成功的启发,研究人员将深度学习和强化学习应用于逻辑综合应用。无论多么成功,这些技术都面临着样本复杂度高的问题,阻碍了广泛采用。为了实现高效和可扩展的解决方案,我们提出了BOiLS,这是第一个采用现代贝叶斯优化来导航合成操作空间的算法。BOiLS不需要人工干预,通过新颖的高斯过程内核和信任区域约束的获取,有效地权衡了探索与开发。在EPFL基准测试的一组实验中,我们展示了BOiLS在样本效率和QoR值方面优于最新技术。 摘要:Optimising the quality-of-results (QoR) of circuits during logic synthesis is a formidable challenge necessitating the exploration of exponentially sized search spaces. While expert-designed operations aid in uncovering effective sequences, the increase in complexity of logic circuits favours automated procedures. Inspired by the successes of machine learning, researchers adapted deep learning and reinforcement learning to logic synthesis applications. However successful, those techniques suffer from high sample complexities preventing widespread adoption. To enable efficient and scalable solutions, we propose BOiLS, the first algorithm adapting modern Bayesian optimisation to navigate the space of synthesis operations. BOiLS requires no human intervention and effectively trades-off exploration versus exploitation through novel Gaussian process kernels and trust-region constrained acquisitions. In a set of experiments on EPFL benchmarks, we demonstrate BOiLS's superior performance compared to state-of-the-art in terms of both sample efficiency and QoR values.
【29】 Uncertainty quantification of a 3D In-Stent Restenosis model with surrogate modelling 标题:用代理建模法量化三维支架内再狭窄模型的不确定性 链接:https://arxiv.org/abs/2111.06173
作者:Dongwei Ye,Pavel Zun,Valeria Krzhizhanovskaya,Alfons G. Hoekstra 机构:Computational Science Lab, Institute for Informatics, University of Amsterdam, The Netherlands, National Center for Cognitive Research, ITMO University, Saint Petersburg, Russia 摘要:支架内再狭窄是由于球囊扩张和支架置入引起的血管损伤导致冠状动脉狭窄的复发。它可能导致心绞痛症状的复发或急性冠状动脉综合征。提出了一种支架内再狭窄模型的不确定性量化方法,该模型具有四个不确定参数(内皮细胞再生时间、平滑肌细胞键断裂的阈值应变、血流速度和内部弹性层中的开窗百分比)。研究了两个感兴趣的量,即容器中的平均横截面积和最大相对面积损失。由于模型的计算强度和不确定性量化所需的评估数量,开发了基于高斯过程回归和适当正交分解的替代模型,该模型随后取代了不确定性量化中的原始支架内再狭窄模型。详细分析了不确定性传播和灵敏度分析。平均横截面积和最大相对面积损失的不确定性分别约为11%和16%,不确定性估计表明,较高的开窗率主要决定了过程初始阶段新生内膜生长的不确定性。另一方面,血流速度和内皮细胞再生时间的不确定性主要决定了再狭窄过程后期临床相关阶段的相关数量的不确定性。与其他不确定参数相比,阈值应变的不确定性相对较小。 摘要:In-Stent Restenosis is a recurrence of coronary artery narrowing due to vascular injury caused by balloon dilation and stent placement. It may lead to the relapse of angina symptoms or to an acute coronary syndrome. An uncertainty quantification of a model for In-Stent Restenosis with four uncertain parameters (endothelium regeneration time, the threshold strain for smooth muscle cells bond breaking, blood flow velocity and the percentage of fenestration in the internal elastic lamina) is presented. Two quantities of interest were studied, namely the average cross-sectional area and the maximum relative area loss in a vessel. Due to the computational intensity of the model and the number of evaluations required in the uncertainty quantification, a surrogate model, based on Gaussian process regression with proper orthogonal decomposition, was developed which subsequently replaced the original In-Stent Restenosis model in the uncertainty quantification. A detailed analysis of the uncertainty propagation and sensitivity analysis is presented. Around 11% and 16% of uncertainty are observed on the average cross-sectional area and maximum relative area loss respectively, and the uncertainty estimates show that a higher fenestration mainly determines uncertainty in the neointimal growth at the initial stage of the process. On the other hand, the uncertainty in blood flow velocity and endothelium regeneration time mainly determine the uncertainty in the quantities of interest at the later, clinically relevant stages of the restenosis process. The uncertainty in the threshold strain is relatively small compared to the other uncertain parameters.
【30】 Causal KL: Evaluating Causal Discovery 标题:因果KL:评估因果发现 链接:https://arxiv.org/abs/2111.06029
作者:Rodney T. O'Donnell,Kevin B. Korb,Lloyd Allison 机构:School of Information Technology, Monash University, Clayton, Vic, Australia 备注:26 pages 摘要:使用人工数据评估因果模型发现的两个最常用标准是编辑距离和Kullback-Leibler散度,从真实模型到学习模型进行测量。这两个指标最大限度地奖励了真正的模型。然而,我们认为,在判断错误模型的相对优点时,他们都没有足够的辨别力。例如,“编辑距离”无法区分强概率依赖和弱概率依赖。另一方面,KL分歧对所有统计上等价的模型都给予同等的奖励,而不管它们的因果关系不同。我们提出了一种扩展的KL散度,我们称之为因果KL(CKL),它考虑了区分观测等效模型的因果关系。结果显示了三种CKL变体,表明因果KL在实践中效果良好。 摘要:The two most commonly used criteria for assessing causal model discovery with artificial data are edit-distance and Kullback-Leibler divergence, measured from the true model to the learned model. Both of these metrics maximally reward the true model. However, we argue that they are both insufficiently discriminating in judging the relative merits of false models. Edit distance, for example, fails to distinguish between strong and weak probabilistic dependencies. KL divergence, on the other hand, rewards equally all statistically equivalent models, regardless of their different causal claims. We propose an augmented KL divergence, which we call Causal KL (CKL), which takes into account causal relationships which distinguish between observationally equivalent models. Results are presented for three variants of CKL, showing that Causal KL works well in practice.