访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问
q-fin金融,共计13篇
cs.SD语音,共计9篇
eess.AS音频处理,共计10篇
1.q-fin金融:
【1】 Correlation scenarios and correlation stress testing 标题:关联场景和关联压力测试
作者:N. Packham,F. Woebbeking 链接:https://arxiv.org/abs/2107.06839 摘要:我们开发了一种通用的方法来测试金融资产组合的压力相关性。资产收益的相关矩阵以参数形式表示,其中相关性表示为风险因素(如国家和行业因素)的函数。利用贝叶斯变量选择方法,建立了资产与风险因素之间的稀疏因子结构。定期校准产生了经济上有意义的因素压力情景的联合分布。因此,该方法也适合作为反向压力测试框架:使用联合风险因素分布的马氏距离或最高密度区域(HDR)可以推断最坏情况下的相关情况。我们给出了对欧洲和北美股票的大型投资组合进行压力测试的例子。 摘要:We develop a general approach for stress testing correlations of financial asset portfolios. The correlation matrix of asset returns is specified in a parametric form, where correlations are represented as a function of risk factors, such as country and industry factors. A sparse factor structure linking assets and risk factors is built using Bayesian variable selection methods. Regular calibration yields a joint distribution of economically meaningful stress scenarios of the factors. As such, the method also lends itself as a reverse stress testing framework: using the Mahalanobis distance or highest density regions (HDR) on the joint risk factor distribution allows to infer worst-case correlation scenarios. We give examples of stress tests on a large portfolio of European and North American stocks.
【2】 A decision support tool for ship biofouling management in the Baltic Sea 标题:波罗的海船舶生物污垢管理决策支持工具
作者:Emilia Luoma,Mirka Laurila-Pant,Elias Altarriba,Inari Helle,Lena Granhag,Maiju Lehtiniemi,Greta Srėbalienė,Sergej Olenin,Annukka Lehikoinen 机构:Ecosystems and Environment Research Programme, Environmental Sciences, University of Helsinki, Finland, South-Eastern Finland University of Applied Sciences (Xamk), Logistics and Seafaring, Kotka, Finland, Organismal and Evolutionary Biology Research Programme 链接:https://arxiv.org/abs/2107.06810 摘要:船舶生物污染在世界范围内造成了重大的环境和经济后果。此外,船体的生物污染管理会导致社会、环境和经济风险,这些风险都应被视为达成平衡的决策。此外,每种情况都是独特的,因此,必须针对具体情况考虑最佳管理策略。我们提出了一个新的决策支持工具,利用贝叶斯网络,以促进对波罗的海复杂的生物污染管理问题的全面了解,并确定潜在的管理方案及其后果。该工具将生物污染管理策略与NIS(非本土物种)引入风险、生物杀灭剂涂层导致的生态毒理学风险、燃油消耗导致的二氧化碳排放以及燃油消耗相关成本、水清洗和涂层进行了比较。根据研究结果,最佳的生物污染管理策略将包括无生物杀灭剂的涂层,定期进行水清洗和收集材料的装置。然而,最佳的无生物杀灭剂涂层类型和最佳的水内清洗间隔是不同的,例如取决于船舶的运行状况。决策支持工具可以增加对该问题的多角度理解,并支持波罗的海最佳生物污染管理策略的实施。 摘要:Biofouling of ships causes major environmental and economic consequences all over the world. In addition, biofouling management of ship hulls causes both social, environmental and economic risks that should all be considered reaching well-balanced decisions. In addition, each case is unique and thus optimal management strategy must be considered case-specifically. We produced a novel decision support tool using Bayesian networks to promote the comprehensive understanding about the complex biofouling management issue in the Baltic Sea and to identify potential management options and their consequences. The tool compares the biofouling management strategies in relation to NIS (non-indigenous species) introduction risk, eco-toxicological risk due to biocidal coating, carbon dioxide emissions resulting from fuel consumption and costs related to fuel consumption, in-water cleaning and coating. According to the results, the optimal biofouling management strategy would consist of a biocidal-free coating with regular in-water cleaning and with devices collecting the material. However, the best biocidal-free coating type and the optimal in-water cleaning interval varies and depends e.g. on the operational profile of the ship. The decision support tool can increase the multi-perspective understanding about the issue and support the implementation of the optimal biofouling management strategies in the Baltic Sea.
【3】 Clustering and attention model based for Intelligent Trading 标题:基于聚类和注意力模型的智能交易
作者:Mimansa Rana,Nanxiang Mao,Ming Ao,Xiaohui Wu,Poning Liang,Matloob Khushi 机构:The School of Computer Science, The University of Sydney 链接:https://arxiv.org/abs/2107.06782 摘要:外汇市场在全球金融市场中占有重要地位。外汇交易在给投资者带来高收益机会的同时,也带来一定的风险。自20世纪外汇市场建立以来,汇率预测一直是国内外学者研究的热点问题。由于影响外汇市场因素的复杂性和数量,技术分析无法应对行政干预或突发事件。我们的团队选取了2005年至2021年的几对外币历史数据和衍生技术指标作为数据集,建立了不同的机器学习模型,用于超卖情景下的事件驱动价格预测。 摘要:The foreign exchange market has taken an important role in the global financial market. While foreign exchange trading brings high-yield opportunities to investors, it also brings certain risks. Since the establishment of the foreign exchange market in the 20th century, foreign exchange rate forecasting has become a hot issue studied by scholars from all over the world. Due to the complexity and number of factors affecting the foreign exchange market, technical analysis cannot respond to administrative intervention or unexpected events. Our team chose several pairs of foreign currency historical data and derived technical indicators from 2005 to 2021 as the dataset and established different machine learning models for event-driven price prediction for oversold scenario.
【4】 On the short term stability of financial ARCH price processes 标题:论金融ARCH价格过程的短期稳定性
作者:Gilles Zumbach 机构:Edgelab, Avenue de la Rasude , Lausanne, Switzerland 链接:https://arxiv.org/abs/2107.06758 摘要:对于许多金融应用,重要的是要有可靠和易于处理的模型的行为和指标,例如在风险评估。一个成功的方法是基于ARCH过程,它在统计特性和易计算性之间取得了正确的平衡。本文主要研究二次ARCH过程和具有稳定长期行为的理论条件。特别地,方差估计量的权重和应为1,创新的方差应为1。利用历史数据,可以计算已实现的经验创新,并评估其统计特性。以股票指数、商品指数和外汇汇率为样本,实证创新的方差均显著大于1。这一出发点是短期不稳定,或是由于条件变化而产生的快速适应性。创新的另一个理论条件是零均值。这种情况也进行了实证研究,一些时间序列显示显着偏离零。 摘要:For many financial applications, it is important to have reliable and tractable models for the behavior of assets and indexes, for example in risk evaluation. A successful approach is based on ARCH processes, which strike the right balance between statistical properties and ease of computation. This study focuses on quadratic ARCH processes and the theoretical conditions to have a stable long-term behavior. In particular, the weights for the variance estimators should sum to 1, and the variance of the innovations should be 1. Using historical data, the realized empirical innovations can be computed, and their statistical properties assessed. Using samples of 3 to 5 decades, the variance of the empirical innovations are always significantly above 1, for a sample of stock indexes, commodity indexes and FX rates. This departure points to a short term instability, or to a fast adaptability due to changing conditions. Another theoretical condition on the innovations is to have a zero mean. This condition is also investigated empirically, with some time series showing significant departure from zero.
【5】 Financial Return Distributions: Past, Present, and COVID-19 标题:财务回报分布:过去、现在和冠状病毒
作者:Marcin Wątorek,Jarosław Kwapień,Stanisław Drożdż 备注:None 链接:https://arxiv.org/abs/2107.06659 摘要:我们分析了货币汇率、加密货币和代表股票指数、股票和商品的差价合约(CFD)的价格收益分布。基于2017-2020年的最新数据,我们使用幂律函数、拉伸指数函数和$q$-高斯函数对不同时间尺度下的收益率分布进行了建模。通过将我们的结果与先前的研究结果进行比较,我们将重点放在拟合的函数参数以及它们如何随时间变化上,并发现在长达几分钟的时间范围内,所谓的“逆立方幂律”仍然构成一个适当的全局参考。然而,我们不再观察到市场时间流的假设普遍恒定加速,这种加速以前表现为经验收益率分布向正态分布的更快收敛。我们的结果并不排除这种情况,而是表明,与当前市场形势相关的一些其他短期过程会改变市场动态,并可能掩盖这种情况。真实的市场动态与具有不同统计特性的不同制度的持续交替有关。一个例子是COVID-19大流行爆发,它对金融市场产生了巨大而短暂的影响。我们还指出,两个因素——市场时间流的速度和资产互相关程度——虽然相关(速度越大,在给定的时间尺度上的互相关越大),但在收益分布尾部的作用方向相反,这会影响期望分布收敛到正态分布。 摘要:We analyze the price return distributions of currency exchange rates, cryptocurrencies, and contracts for differences (CFDs) representing stock indices, stock shares, and commodities. Based on recent data from the years 2017--2020, we model tails of the return distributions at different time scales by using power-law, stretched exponential, and $q$-Gaussian functions. We focus on the fitted function parameters and how they change over the years by comparing our results with those from earlier studies and find that, on the time horizons of up to a few minutes, the so-called "inverse-cubic power-law" still constitutes an appropriate global reference. However, we no longer observe the hypothesized universal constant acceleration of the market time flow that was manifested before in an ever faster convergence of empirical return distributions towards the normal distribution. Our results do not exclude such a scenario but, rather, suggest that some other short-term processes related to a current market situation alter market dynamics and may mask this scenario. Real market dynamics is associated with a continuous alternation of different regimes with different statistical properties. An example is the COVID-19 pandemic outburst, which had an enormous yet short-time impact on financial markets. We also point out that two factors -- speed of the market time flow and the asset cross-correlation magnitude -- while related (the larger the speed, the larger the cross-correlations on a given time scale), act in opposite directions with regard to the return distribution tails, which can affect the expected distribution convergence to the normal distribution.
【6】 A General Approach for Parisian Stopping Times under Markov Processes 标题:马尔可夫过程下巴黎停车时间的一般方法
作者:Gongqiu Zhang,Lingfei Li 链接:https://arxiv.org/abs/2107.06605 摘要:本文提出了一种基于连续时间马尔可夫链近似的方法来计算一般一维马尔可夫过程下巴黎停止时间和巴黎期权价格的分布。在一般条件下证明了该方法的收敛性,得到了扩散模型收敛速度的精确估计。我们的理论分析揭示了如何设计CTMC的网格以实现更快的收敛。数值实验验证了该方法对扩散模型和跳跃模型的准确性和有效性。为了证明我们方法的通用性,我们发展了多面巴黎停止时间、巴黎停止时间和首次通过时间的联合分布、巴黎债券以及更复杂的模型,如制度转换和随机波动率模型的扩展。 摘要:We propose a method based on continuous time Markov chain approximation to compute the distribution of Parisian stopping times and price Parisian options under general one-dimensional Markov processes. We prove the convergence of the method under a general setting and obtain sharp estimate of the convergence rate for diffusion models. Our theoretical analysis reveals how to design the grid of the CTMC to achieve faster convergence. Numerical experiments are conducted to demonstrate the accuracy and efficiency of our method for both diffusion and jump models. To show the versality of our approach, we develop extensions for multi-sided Parisian stopping times, the joint distribution of Parisian stopping times and first passage times, Parisian bonds and for more sophisticated models like regime-switching and stochastic volatility models.
【7】 The Infinite Horizon Investment-Consumption Problem for Epstein-Zin Stochastic Differential Utility 标题:Epstein-Zin随机微分效用的无限时域投资-消费问题
作者:David Hobson,Martin Herdegen,Joseph Jerome 链接:https://arxiv.org/abs/2107.06593 摘要:在本文中,我们考虑了一个具有爱泼斯坦Zin随机微分效用的偏好代理的最优投资消费问题,该投资在一个常数参数的斯科尔斯-梅顿市场上。本文有三个主要目标:第一,详细介绍无限时域Epstein-Zin随机微分效用,包括讨论哪些参数组合导致了一个公式化的问题;第二,证明了无穷时域Epstein-Zin随机微分效用在控制agent风险厌恶和时间方差厌恶的参数约束下的存在唯一性;第三,在所有可容许的消费流中为投资消费问题的候选最优解提供一个验证论证。为了实现这些目标,我们引入了一个与传统文献中使用的Epstein-Zin随机微分效用稍微不同的公式。这一公式强调了对随机微分效用函数参数的某些限制的必要性和适当性。 摘要:In this article we consider the optimal investment-consumption problem for an agent with preferences governed by Epstein-Zin stochastic differential utility who invests in a constant-parameter Black-Scholes-Merton market. The paper has three main goals: first, to provide a detailed introduction to infinite-horizon Epstein-Zin stochastic differential utility, including a discussion of which parameter combinations lead to a well-formulated problem; second, to prove existence and uniqueness of infinite horizon Epstein-Zin stochastic differential utility under a restriction on the parameters governing the agent's risk aversion and temporal variance aversion; and third, to provide a verification argument for the candidate optimal solution to the investment-consumption problem among all admissible consumption streams. To achieve these goals, we introduce a slightly different formulation of Epstein-Zin stochastic differential utility to that which is traditionally used in the literature. This formulation highlights the necessity and appropriateness of certain restrictions on the parameters governing the stochastic differential utility function.
【8】 Winners and losers of immigration 标题:移民的赢家和输家
作者:Davide Fiaschi,Cristina Tealdi 机构:“... immigration has consequences, and these consequences, generally imply that some people lose while others benefit.”, — George Borjas (,) 备注:85 pages, 19 figures, 16 tables 链接:https://arxiv.org/abs/2107.06544 摘要:我们的目标是通过一般均衡搜索和匹配模型来确定突然涌入的低技能移民的赢家和输家,在这个模型中,员工,无论是本地人还是非本地人,他们的技能水平参差不齐,生产不同类型的产品,政府在公共产品方面的支出由工资和利润的累进税提供资金。我们估计,2008-2017年期间,这一冲击对意大利的短期影响每年都相当大,且高度不对称。2017年,与没有非本地人的反事实情况相比,低技能和高技能员工的实际工资分别低4%和高8%。同样,在低技能市场工作的雇主的利润下降幅度相当,而在高技能市场工作的雇主则相反。最后,非本地人的存在导致国内生产总值增长14%,政府收入增加约700亿欧元,社会保障缴款增加180亿欧元。 摘要:We aim to identify winners and losers of a sudden inflow of low-skilled immigrants using a general equilibrium search and matching model in which employees, either native or non-native, are heterogeneous with respect to their skill level and produce different types of goods and Government expenditure in public goods is financed by a progressive taxation on wages and profits. We estimate the short-term impact of this shock for Italy in each year in the period 2008-2017 to be sizeable and highly asymmetric. In 2017, the real wages of low-skilled and high-skilled employees were 4% lower and 8% higher, respectively, compared to a counter-factual scenario with no non-natives. Similarly, employers working in the low-skilled market experienced a drop in profits of comparable magnitude, while the opposite happened to employers operating in the high-skilled market. Finally, the presence of non-natives led to a 14% increase in GDP and to an increment of approximately 70 billion euros in government revenues and 18 billion euros in social security contributions.
【9】 From Carbon-transition Premium to Carbon-transition Risk 标题:从碳转移溢价到碳转移风险
作者:Suryadeepto Nag,Siddhartha P. Chakrabarty,Sankarshan Basu 机构: Indian Institute of TechnologyGuwahati 链接:https://arxiv.org/abs/2107.06518 摘要:投资者意识到即将出台的法规要求企业减少碳足迹,这给企业的股票带来了碳转型风险溢价。在进行横截面分析时,估计美国市场的大盘股之间存在显著溢价。风险溢价的存在表明投资者对未来低碳转型风险敞口的认识。发展了一种新的度量方法,即单事件转移风险(SETR),以模拟企业对碳转移风险的最大暴露,并根据风险溢价确定了该风险的函数形式。考虑了过渡事件到达过程的不同分布类型,确定并研究了相应的SETR。研究了不同过程中高溢价和高风险之间的权衡,发现根据到达时间的分布,投资者获得正收益的概率(溢价-风险权衡)可能更低、相等或更高,尽管碳溢价的定价是公平的,投资者对一只股票采取多头或空头仓位的决定仍可能存在偏差。 摘要:Investor awareness about impending regulations requiring firms to reduce their carbon footprint has introduced a carbon transition risk premium in the stocks of firms. On performing a cross-section analysis, a significant premium was estimated among large caps in the US markets. The existence of a risk premium indicates investor awareness about future exposure to low-carbon transition. A new measure, the Single Event Transition Risk (SETR), was developed to model the maximum exposure of a firm to carbon transition risk, and a functional form for the same was determined, in terms of risk premia. Different classes of distributions for arrival processes of transition events were considered and the respective SETRs were determined and studied. The trade-off between higher premia and higher risks was studied for the different processes, and it was observed that, based on the distributions of arrival times, investors could have a lower, equal or higher probability of positive returns (from the premium-risk trade-off), and that despite a fair pricing of the carbon premium, decisions by investors to take long or short positions on a stock could still be biased.
【10】 A Unified Formula of the Optimal Portfolio for Piecewise HARA Utilities 标题:分段Hara公用事业最优投资组合的统一公式
作者:Zongxia Liang,Yang Liu,Ming Ma 链接:https://arxiv.org/abs/2107.06460 摘要:我们提出了一类分段双曲型绝对风险规避(PHARA)效用函数,包括许多非标准效用函数。一个典型的应用是对冲基金管理中HARA偏好和分段线性收益的组合。我们导出了最优投资组合的一个统一的闭式公式,即四期分割。该公式具有明确的经济学意义,反映了风险厌恶、风险寻求、损失厌恶和一阶风险厌恶的行为。一个主要的发现是,冒险行为大大增加了非凹性和减少了不可微性。 摘要:We propose a general family of piecewise hyperbolic absolute risk aversion (PHARA) utility, including many non-standard utilities as examples. A typical application is the composition of an HARA preference and a piecewise linear payoff in hedge fund management. We derive a unified closed-form formula of the optimal portfolio, which is a four-term division. The formula has clear economic meanings, reflecting the behavior of risk aversion, risk seeking, loss aversion and first-order risk aversion. One main finding is that risk-taking behaviors are greatly increased by non-concavity and reduced by non-differentiability.
【11】 Arbitrage-free pricing of CVA for cross-currency swap with wrong-way risk under stochastic correlation modeling framework 标题:随机相关建模框架下具有错误方向风险的交叉货币掉期CVA无套利定价
作者:Ashish Kumar,Laszlo Markus,Norbert Hari 机构:a E¨otv¨os Lor´and University, Budapest, Hungary, ARTICLE HISTORY 链接:https://arxiv.org/abs/2107.06349 摘要:风险敞口与交易对手信用风险之间的正相关关系导致了所谓的错误方向风险(WWR)。即使在经历了十年的金融危机之后,以合理和可处理的方式解决世界大战仍然具有挑战性。学者们已经提出了通过copula方法建立无套利模型,但是这些方法的计算成本很高,并且很难在实践中使用。重采样方法由业界提出,但缺乏数学基础。本文的目的是弥合学术界和工业界之间的这种差距。为此,我们提出了一种随机相关方法来评估WWR。基于常相关的风险敞口与交易对手信用风险相关性建模方法假设为线性相关性,无法捕捉尾部相关性。利用随机相关性,我们进一步远离高斯copula,可以捕捉尾部风险。这种影响反映在结果中,当与假设风险敞口和信贷之间存在高度恒定相关性的情况相比,随机相关性对计算的CVA的影响很大。鉴于CVA固有的不确定性,本文提出的方法被认为是一种很有前途的WWR建模方法。 摘要:A positive correlation between exposure and counterparty credit risk gives rise to the so-called Wrong-Way Risk (WWR). Even after a decade of the financial crisis, addressing WWR in both sound and tractable ways remains challenging. Academicians have proposed arbitrage-free set-ups through copula methods but those are computationally expensive and hard to use in practice. Resampling methods are proposed by the industry but they lack mathematical foundations. The purpose of this article is to bridge this gap between the approaches used by academicians and industry. To this end, we propose a stochastic correlation approach to asses WWR. The methods based on constant correlation to model the dependency between exposure and counterparty credit risk assume a linear dependency, thus fail to capture the tail dependence. Using a stochastic correlation we move further away from the Gaussian copula and can capture the tail risk. This effect is reflected in the results where the impact of stochastic correlation on calculated CVA is substantial when compared to the case when a high constant correlation is assumed between exposure and credit. Given the uncertainty inherent to CVA, the proposed method is believed to provide a promising way to model WWR.
【12】 Comparing Intellectual property policy in the Global North and South -- A one-size-fits-all policy for economic prosperity? 标题:全球南北知识产权政策比较--经济繁荣的“一刀切”政策?
作者:Madhumitha Raghuraman,Malavika Ranjan,S Sidhartha Narayan 机构:Indian Institute of Technology, Madras 链接:https://arxiv.org/abs/2107.06855 摘要:本文试图分析作为全球南北经济增长工具的知识产权领域的政策制定。首先研究经济增长与知识产权之间的联系,然后了解美国的知识产权(IPR)发展,美国是国际上大力保护知识产权的主要倡导者。下一节比较了全球北方和南方的知识产权,并对导致这些差异的各种因素进行了分析。本文以印度制药业为例,分析了知识产权对经济的不同影响,并得出结论:在知识产权的采用上,可能还没有一个一刀切的政策。 摘要:This paper attempts to analyse policymaking in the field of Intellectual Property (IP) as an instrument of economic growth across the Global North and South. It begins by studying the links between economic growth and IP, followed by an understanding of Intellectual Property Rights (IPR) development in the US, a leading proponent of robust IPR protection internationally. The next section compares the IPR in the Global North and South and undertakes an analysis of the diverse factors that result in these differences. The paper uses the case study of the Indian Pharmaceutical Industry to understand how IPR may differentially affect economies and conclude that there may not yet be a one size fits all policy for the adoption of Intellectual Property Rights.
【13】 The threshold strategy for spectrally negative Levy processes and a terminal value at creeping ruin in the objective function 标题:目标函数中谱负Levy过程的门限策略和渐进式破产的终值
作者:Chongrui Zhu 链接:https://arxiv.org/abs/2107.06841 摘要:本文研究了Levy风险模型中一类具有渐近破产终值的红利优化问题。我们考虑一个保险公司,其盈余过程演变为具有高斯部分的谱负莱维.巴斯比鲁过程,其目标函数由累积折现股利支付和在爬行破产中的终值给出。从涨落理论的恒等式出发,在终值为负的条件下,我们证明了在一类可容许的红利率受限的类上,阈值策略是阈值水平为零的最优策略。此外,还给出了正的一些充分条件。 摘要:In this paper, a dividend optimization problem with a terminal value at creeping ruin for Levy risk models has been investigated. We consider an insurance company whose surplus process evolves as a spectrally negative Levy process with a Gaussian part and its objective function is given by cumulative discounted dividend payments and a terminal value at creeping ruin. In views of identities from fluctuation theory, under the restriction on the negative terminal value, we show that the threshold strategy turns out to be the optimal one with threshold level at zero over an admissible class with restricted dividend rates. Furthermore, some sufficient conditions for the positive one also have been given.
2.cs.SD语音:
【1】 Federated Self-Training for Semi-Supervised Audio Recognition 标题:用于半监督音频识别的联合自训练
作者:Vasileios Tsouvalas,Aaqib Saeed,Tanir Ozcelebi 机构: Eindhoven University of Technology 链接:https://arxiv.org/abs/2107.06877 摘要:联邦学习是一种分布式机器学习范式,处理分散的个人数据集。由于数据驻留在智能手机和虚拟助理等设备上,因此标签将委托给客户机,或者以自动化的方式提取标签。具体地说,在音频数据的情况下,获取语义注释可能非常昂贵和耗时。因此,大量的音频数据在用户的设备上仍然没有标记和未被利用。大多数现有的联邦学习方法侧重于监督学习,而不利用未标记的数据。在这项工作中,我们研究的问题,半监督学习的音频模型通过自我训练结合联邦学习。为了提高音频识别模型的泛化能力,我们提出了FedSTAR来开发大规模的设备上未标记数据。我们进一步证明,自我监督的预训练模型可以加速设备模型的训练,显著提高收敛到更少的训练轮。我们在不同的公共音频分类数据集上进行了实验,并研究了我们的模型在不同百分比的标记和未标记数据下的性能。值得注意的是,我们表明,与完全监督的联邦模型相比,FedSTAR在只有3%的标记数据可用的情况下,平均可以将识别率提高13.28%。 摘要:Federated Learning is a distributed machine learning paradigm dealing with decentralized and personal datasets. Since data reside on devices like smartphones and virtual assistants, labeling is entrusted to the clients, or labels are extracted in an automated way. Specifically, in the case of audio data, acquiring semantic annotations can be prohibitively expensive and time-consuming. As a result, an abundance of audio data remains unlabeled and unexploited on users' devices. Most existing federated learning approaches focus on supervised learning without harnessing the unlabeled data. In this work, we study the problem of semi-supervised learning of audio models via self-training in conjunction with federated learning. We propose FedSTAR to exploit large-scale on-device unlabeled data to improve the generalization of audio recognition models. We further demonstrate that self-supervised pre-trained models can accelerate the training of on-device models, significantly improving convergence to within fewer training rounds. We conduct experiments on diverse public audio classification datasets and investigate the performance of our models under varying percentages of labeled and unlabeled data. Notably, we show that with as little as 3% labeled data available, FedSTAR on average can improve the recognition rate by 13.28% compared to the fully supervised federated model.
【2】 Localization Based Sequential Grouping for Continuous Speech Separation 标题:基于局部化的连续语音分离序列分组算法
作者:Zhong-Qiu Wang,DeLiang Wang 机构:Mitsubishi Electric Research Laboratories (MERL), USA, Department of Computer Science and Engineering, The Ohio State University, USA 备注:5 pages, 1 figure 链接:https://arxiv.org/abs/2107.06853 摘要:本研究探讨连续语音分离与说话人二值化之强健说话人定位,其中我们使用说话人方向来分组同一说话人的非连续片段。假设说话人不移动并且位于不同的方向,到达方向(DOA)信息为精确的顺序分组和说话人重分类提供了信息提示。我们的系统在以下意义上是阻塞在线的。给定一个最多有两个说话人的帧块,我们采用一个两说话人分离模型来分离(和增强)说话人,估计每个分离说话人的DOA,并根据DOA估计将分离结果分组。基于LibriCSS语料库的说话人二值化和说话人属性语音识别结果验证了该算法的有效性。 摘要:This study investigates robust speaker localization for con-tinuous speech separation and speaker diarization, where we use speaker directions to group non-contiguous segments of the same speaker. Assuming that speakers do not move and are located in different directions, the direction of arrival (DOA) information provides an informative cue for accurate sequential grouping and speaker diarization. Our system is block-online in the following sense. Given a block of frames with at most two speakers, we apply a two-speaker separa-tion model to separate (and enhance) the speakers, estimate the DOA of each separated speaker, and group the separation results across blocks based on the DOA estimates. Speaker diarization and speaker-attributed speech recognition results on the LibriCSS corpus demonstrate the effectiveness of the proposed algorithm.
【3】 MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation 标题:MMGCN:基于深图卷积网络的会话情感识别多模式融合
作者:Jingwen Hu,Yuchen Liu,Jinming Zhao,Qin Jin 机构:School of Information, Renmin University of China 链接:https://arxiv.org/abs/2107.06779 摘要:会话中的情感识别(ERC)是情感对话系统的重要组成部分,有助于系统理解用户的情感并产生移情反应。然而,大多数的研究主要集中在对说话人和上下文信息的建模上,或者仅仅是通过特征拼接来利用多模态信息。为了探索一种更有效地利用多模态和远距离上下文信息的方法,本文提出了一种基于多模态融合图卷积网络的新模型MMGCN。MMGCN不仅可以有效地利用多模态依赖,而且可以利用说话人信息对说话人之间和说话人内部的依赖进行建模。我们在两个公共基准数据集IEMOCAP和MELD上对所提出的模型进行了评估,结果证明了MMGCN的有效性,在多模态会话环境下,MMGCN的性能明显优于其他SOTA方法。 摘要:Emotion recognition in conversation (ERC) is a crucial component in affective dialogue systems, which helps the system understand users' emotions and generate empathetic responses. However, most works focus on modeling speaker and contextual information primarily on the textual modality or simply leveraging multimodal information through feature concatenation. In order to explore a more effective way of utilizing both multimodal and long-distance contextual information, we propose a new model based on multimodal fused graph convolutional network, MMGCN, in this work. MMGCN can not only make use of multimodal dependencies effectively, but also leverage speaker information to model inter-speaker and intra-speaker dependency. We evaluate our proposed model on two public benchmark datasets, IEMOCAP and MELD, and the results prove the effectiveness of MMGCN, which outperforms other SOTA methods by a significant margin under the multimodal conversation setting.
【4】 The Period-Modulated Harmonic Locked Loop (PM-HLL): A low-effort algorithm for rapid time-domain periodicity estimation 标题:周期调制谐波锁定环(PM-HLL):一种低复杂度的时域周期快速估计算法
作者:Volker Hohmann 机构: Department of Medical Physics and Acoustics, University of Oldenburg, Oldenburg, HörTech gGmbH, Oldenburg, Germany, Cluster of Excellence Hearing,all Oldenburg, Germany 链接:https://arxiv.org/abs/2107.06645 摘要:许多语音和音乐分析和处理方案依赖于对周期信号分量的基频f0的估计。大多数已建立的方案在估计问题中采用了非特定的信号模型,如正弦模型,这可能会限制时间分辨率和估计精度。本研究提出一种新的时域锁环演算法,具有较低的运算量和较低的记忆体占用。回路控制信号直接从输入时间信号中导出,采用谐波信号模型。理论上,这允许对任意波形的周期信号进行噪声鲁棒性和快速f0估计,并且不需要先验频率分析。对采用不同周期性的短信号和加入宽带噪声的短信号进行了仿真,验证和评价了该算法的基本性能。根据信噪比(SNR),即使在信噪比接近或低于0dB的情况下,该估计器也能在3-4个信号重复内收敛。此外,它还可以跟踪基频扫描,延迟小于一个周期,同时跟踪三音和弦信号的所有音调。对具有移频谐波的准周期声信号和具有随机周期性的信号进行了鲁棒跟踪。在大多数情况下,估计误差的平均值和标准偏差,即真实值和估计值f0之间的差值等于或低于1Hz。结果表明,该算法适用于低延迟语音和音乐的分析与处理。 摘要:Many speech and music analysis and processing schemes rely on an estimate of the fundamental frequency f0 of periodic signal components. Most established schemes apply rather unspecific signal models such as sinusoidal models to the estimation problem, which may limit time resolution and estimation accuracy. This study proposes a novel time-domain locked-loop algorithm with low computational effort and low memory footprint for f0 estimation. The loop control signal is directly derived from the input time signal, using a harmonic signal model. Theoretically, this allows for a noise-robust and rapid f0 estimation for periodic signals of arbitrary waveform, and without the requirement of a prior frequency analysis. Several simulations with short signals employing different types of periodicity and with added wide-band noise were performed to demonstrate and evaluate the basic properties of the proposed algorithm. Depending on the Signal-to-Noise Ratio (SNR), the estimator was found to converge within 3-4 signal repetitions, even at SNR close to or below 0dB. Furthermore, it was found to follow fundamental frequency sweeps with a delay of less than one period and to track all tones of a three-tone musical chord signal simultaneously. Quasi-periodic sounds with shifted harmonics as well as signals with stochastic periodicity were robustly tracked. Mean and standard deviation of the estimation error, i.e., the difference between true and estimated f0, were at or below 1 Hz in most cases. The results suggest that the proposed algorithm may be applicable to low-delay speech and music analysis and processing.
【5】 ZR-2021VG: Zero-Resource Speech Challenge, Visually-Grounded Language Modelling track, 2021 edition 标题:ZR-2021VG:零资源语音挑战赛,视觉接地语言建模赛道,2021年版
作者:Afra Alishahia,Grzegorz Chrupała,Alejandrina Cristia,Emmanuel Dupoux,Bertrand Higy,Marvin Lavechin,Okko Räsänen,Chen Yu 机构:Dept. of Cognitive Science and AI, Tilburg University, Netherlands, Laboratoire de Sciences Cognitives et Psycholinguistique, ENS, Paris, France, Unit of Computing Sciences, Tampere University, Finland 链接:https://arxiv.org/abs/2107.06546 摘要:我们展示了在2021年第二轮零资源演讲挑战赛中引入的视觉基础语言建模轨迹。我们激励新的轨道并详细讨论参与规则。我们还介绍了为这条赛道开发的两个基线系统。 摘要:We present the visually-grounded language modelling track that was introduced in the Zero-Resource Speech challenge, 2021 edition, 2nd round. We motivate the new track and discuss participation rules in detail. We also present the two baseline systems that were developed for this track.
【6】 Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding 标题:神经说话人嵌入的串行化多层多头注意
作者:Hongning Zhu,Kong Aik Lee,Haizhou Li 机构:School of Computing, National University of Singapore, Singapore, Department of Electrical and Computer Engineering, National University of Singapore, Singapore, Institute for Infocomm Research, A⋆STAR, Singapore 备注:Accepted by Interspeech 2021 链接:https://arxiv.org/abs/2107.06493 摘要:该文提出了一种用于文本无关说话人验证中神经说话人嵌入的多层多头注意序列。在以前的工作中,来自一层的帧级特征被聚合以形成一个话语级表示。受Transformer网络的启发,我们提出的方法利用堆叠式自我注意机制的层次结构来获得与说话人更相关的精细特征。序列化注意机制包含一堆自我注意模块,用于创建说话人的固定维度表示。提出的串行化多层多人头注意方法,不是并行地利用多人头注意,而是以串行的方式从一层到下一层聚合和传播注意统计信息。此外,我们使用统计池对每个语句使用输入感知查询。随着层数的增加,神经网络可以学习更多的说话人嵌入。在VoxCeleb1数据集和SITW数据集上的实验结果表明,本文提出的方法在EER和DCF0.01上分别比其他基线方法(包括x-vectors和其他x-vectors 传统注意池方法)提高了9.7%和8.1%。 摘要:This paper proposes a serialized multi-layer multi-head attention for neural speaker embedding in text-independent speaker verification. In prior works, frame-level features from one layer are aggregated to form an utterance-level representation. Inspired by the Transformer network, our proposed method utilizes the hierarchical architecture of stacked self-attention mechanisms to derive refined features that are more correlated with speakers. Serialized attention mechanism contains a stack of self-attention modules to create fixed-dimensional representations of speakers. Instead of utilizing multi-head attention in parallel, the proposed serialized multi-layer multi-head attention is designed to aggregate and propagate attentive statistics from one layer to the next in a serialized manner. In addition, we employ an input-aware query for each utterance with the statistics pooling. With more layers stacked, the neural network can learn more discriminative speaker embeddings. Experiment results on VoxCeleb1 dataset and SITW dataset show that our proposed method outperforms other baseline methods, including x-vectors and other x-vectors conventional attentive pooling approaches by 9.7% in EER and 8.1% in DCF0.01.
【7】 Many-to-Many Voice Conversion based Feature Disentanglement using Variational Autoencoder 标题:基于多对多语音转换的变分自动编码器特征解缠
作者:Manh Luong,Viet Anh Tran 机构:Vinai Research, Deezer Research and Development, Paris, France 链接:https://arxiv.org/abs/2107.06642 摘要:语音转换是一项具有挑战性的任务,它在不改变语言内容的前提下,将源说话人的语音特征转换为目标说话人的语音特征。近年来,基于变分自动编码器(VAEs)的多对多语音转换(VC)的研究取得了很好的效果,但是这些方法缺乏对说话人身份和语言内容进行分离的能力,无法在看不见的说话人场景中取得很好的效果。本文提出了一种基于特征分离的多对多语音转换方法。该方法具有从话语中分离说话人身份和语言内容的能力,可以通过一个单独的自动编码网络将多个源说话人转换为多个目标说话人。此外,它自然地处理看不见的目标说话人场景。我们进行了客观和主观两方面的评估,以显示我们提出的方法在自然度和目标-说话人相似性方面与其他最先进的模型相比的竞争性能。 摘要:Voice conversion is a challenging task which transforms the voice characteristics of a source speaker to a target speaker without changing linguistic content. Recently, there have been many works on many-to-many Voice Conversion (VC) based on Variational Autoencoder (VAEs) achieving good results, however, these methods lack the ability to disentangle speaker identity and linguistic content to achieve good performance on unseen speaker scenarios. In this paper, we propose a new method based on feature disentanglement to tackle many to many voice conversion. The method has the capability to disentangle speaker identity and linguistic content from utterances, it can convert from many source speakers to many target speakers with a single autoencoder network. Moreover, it naturally deals with the unseen target speaker scenarios. We perform both objective and subjective evaluations to show the competitive performance of our proposed method compared with other state-of-the-art models in terms of naturalness and target speaker similarity.
【8】 Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection 标题:有人在说话吗?视听有源说话人检测的长期时间特征探索
作者:Ruijie Tao,Zexu Pan,Rohan Kumar Das,Xinyuan Qian,Mike Zheng Shou,Haizhou Li 机构:Department of Electrical and Computer Engineering, National University of Singapore, Singapore, Singapore 备注:ACM Multimedia 2021 链接:https://arxiv.org/abs/2107.06592 摘要:主动说话人检测(ASD)是指在一个或多个说话人的视觉场景中检测说话人。成功的ASD有赖于对短期和长期视听信息的准确解读,以及视听互动。与以往利用短期特征进行即时决策的工作不同,本文提出了一种新的决策框架TalkNet,该框架同时考虑了短期和长期特征。TalkNet包括用于特征表示的音频和视频时间编码器、用于模态间交互的音频-视频交叉注意机制和用于捕获长期说话证据的自我注意机制。实验表明,在AVA-ActiveSpeaker数据集和Columbia-ASD数据集上,TalkNet分别比现有系统提高了3.5%和2.2%。代码已在以下位置提供:textcolor{magenta}{url{https://github.com/TaoRuijie/TalkNet_ASD}}. 摘要:Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or more speakers. The successful ASD depends on accurate interpretation of short-term and long-term audio and visual information, as well as audio-visual interaction. Unlike the prior work where systems make decision instantaneously using short-term features, we propose a novel framework, named TalkNet, that makes decision by taking both short-term and long-term features into consideration. TalkNet consists of audio and visual temporal encoders for feature representation, audio-visual cross-attention mechanism for inter-modality interaction, and a self-attention mechanism to capture long-term speaking evidence. The experiments demonstrate that TalkNet achieves 3.5% and 2.2% improvement over the state-of-the-art systems on the AVA-ActiveSpeaker dataset and Columbia ASD dataset, respectively. Code has been made available at: textcolor{magenta}{url{https://github.com/TaoRuijie/TalkNet_ASD}}.
【9】 Multi-Task Audio Source Separation 标题:多任务音频源分离
作者:Lu Zhang,Chenxing Li,Feng Deng,Xiaorui Wang 机构:Kuai Shou Technology Co., Beijing, China, Harbin Institute of Technology, Shenzhen, China 链接:https://arxiv.org/abs/2107.06467 摘要:近年来,语音增强、语音分离、音乐源分离等音频源分离技术取得了令人瞩目的成绩。深度神经网络强大的建模能力给我们带来了更具挑战性的任务的希望。本文提出了一种新的多任务音频源分离(MTASS)方法,将语音、音乐和噪声信号从单声道混合信号中分离出来。首先,我们介绍这个任务的细节,并生成一个包含语音、音乐和背景噪声的混合数据集。在此基础上,提出了一种复域MTASS模型,充分利用了三种音频信号在频谱特性上的差异。该模型采用两级流水线结构,将三种音频信号分离,然后分别进行信号补偿。通过对不同训练目标的比较,选择了复比掩模作为更适合MTASS的训练目标。实验结果还表明,残差信号补偿模块有助于信号的进一步恢复。所提出的模型在分离性能上比几种已知的分离模型有明显的优势。 摘要:The audio source separation tasks, such as speech enhancement, speech separation, and music source separation, have achieved impressive performance in recent studies. The powerful modeling capabilities of deep neural networks give us hope for more challenging tasks. This paper launches a new multi-task audio source separation (MTASS) challenge to separate the speech, music, and noise signals from the monaural mixture. First, we introduce the details of this task and generate a dataset of mixtures containing speech, music, and background noises. Then, we propose an MTASS model in the complex domain to fully utilize the differences in spectral characteristics of the three audio signals. In detail, the proposed model follows a two-stage pipeline, which separates the three types of audio signals and then performs signal compensation separately. After comparing different training targets, the complex ratio mask is selected as a more suitable target for the MTASS. The experimental results also indicate that the residual signal compensation module helps to recover the signals further. The proposed model shows significant advantages in separation performance over several well-known separation models.
3.eess.AS音频处理:
【1】 Low complexity online convolutional beamforming 标题:低复杂度在线卷积波束形成
作者:Sebastian Braun,Ivan Tashev 机构:Microsoft Research, Redmond, WA, USA 备注:None 链接:https://arxiv.org/abs/2107.06775 摘要:卷积波束形成器将多信道线性预测模型集成到波束形成器中,为联合去冗余和降噪任务提供了良好的性能和优化。虽然需要较长的滤波器来模拟较长的混响时间,但当前在线解决方案的计算负担随着滤波器长度和麦克风数量的增加而快速增长。在这项工作中,我们提出了一个低复杂度卷积波束形成器使用卡尔曼滤波衍生仿射投影算法来解决自适应滤波问题。所提出的解决方案比可比的现有解决方案复杂几个数量级,同时在混响挑战数据集上略优于现有解决方案。 摘要:Convolutional beamformers integrate the multichannel linear prediction model into beamformers, which provide good performance and optimality for joint dereverberation and noise reduction tasks. While longer filters are required to model long reverberation times, the computational burden of current online solutions grows fast with the filter length and number of microphones. In this work, we propose a low complexity convolutional beamformer using a Kalman filter derived affine projection algorithm to solve the adaptive filtering problem. The proposed solution is several orders of magnitude less complex than comparable existing solutions while slightly outperforming them on the REVERB challenge dataset.
【2】 Many-to-Many Voice Conversion based Feature Disentanglement using Variational Autoencoder 标题:基于多对多语音转换的变分自动编码器特征解缠
作者:Manh Luong,Viet Anh Tran 机构:Vinai Research, Deezer Research and Development, Paris, France 链接:https://arxiv.org/abs/2107.06642 摘要:语音转换是一项具有挑战性的任务,它在不改变语言内容的前提下,将源说话人的语音特征转换为目标说话人的语音特征。近年来,基于变分自动编码器(VAEs)的多对多语音转换(VC)的研究取得了很好的效果,但是这些方法缺乏对说话人身份和语言内容进行分离的能力,无法在看不见的说话人场景中取得很好的效果。本文提出了一种基于特征分离的多对多语音转换方法。该方法具有从话语中分离说话人身份和语言内容的能力,可以通过一个单独的自动编码网络将多个源说话人转换为多个目标说话人。此外,它自然地处理看不见的目标说话人场景。我们进行了客观和主观两方面的评估,以显示我们提出的方法在自然度和目标-说话人相似性方面与其他最先进的模型相比的竞争性能。 摘要:Voice conversion is a challenging task which transforms the voice characteristics of a source speaker to a target speaker without changing linguistic content. Recently, there have been many works on many-to-many Voice Conversion (VC) based on Variational Autoencoder (VAEs) achieving good results, however, these methods lack the ability to disentangle speaker identity and linguistic content to achieve good performance on unseen speaker scenarios. In this paper, we propose a new method based on feature disentanglement to tackle many to many voice conversion. The method has the capability to disentangle speaker identity and linguistic content from utterances, it can convert from many source speakers to many target speakers with a single autoencoder network. Moreover, it naturally deals with the unseen target speaker scenarios. We perform both objective and subjective evaluations to show the competitive performance of our proposed method compared with other state-of-the-art models in terms of naturalness and target speaker similarity.
【3】 Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection 标题:有人在说话吗?视听有源说话人检测的长期时间特征探索
作者:Ruijie Tao,Zexu Pan,Rohan Kumar Das,Xinyuan Qian,Mike Zheng Shou,Haizhou Li 机构:Department of Electrical and Computer Engineering, National University of Singapore, Singapore, Singapore 备注:ACM Multimedia 2021 链接:https://arxiv.org/abs/2107.06592 摘要:主动说话人检测(ASD)是指在一个或多个说话人的视觉场景中检测说话人。成功的ASD有赖于对短期和长期视听信息的准确解读,以及视听互动。与以往利用短期特征进行即时决策的工作不同,本文提出了一种新的决策框架TalkNet,该框架同时考虑了短期和长期特征。TalkNet包括用于特征表示的音频和视频时间编码器、用于模态间交互的音频-视频交叉注意机制和用于捕获长期说话证据的自我注意机制。实验表明,在AVA-ActiveSpeaker数据集和Columbia-ASD数据集上,TalkNet分别比现有系统提高了3.5%和2.2%。代码已在以下位置提供:textcolor{magenta}{url{https://github.com/TaoRuijie/TalkNet_ASD}}. 摘要:Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or more speakers. The successful ASD depends on accurate interpretation of short-term and long-term audio and visual information, as well as audio-visual interaction. Unlike the prior work where systems make decision instantaneously using short-term features, we propose a novel framework, named TalkNet, that makes decision by taking both short-term and long-term features into consideration. TalkNet consists of audio and visual temporal encoders for feature representation, audio-visual cross-attention mechanism for inter-modality interaction, and a self-attention mechanism to capture long-term speaking evidence. The experiments demonstrate that TalkNet achieves 3.5% and 2.2% improvement over the state-of-the-art systems on the AVA-ActiveSpeaker dataset and Columbia ASD dataset, respectively. Code has been made available at: textcolor{magenta}{url{https://github.com/TaoRuijie/TalkNet_ASD}}.
【4】 Multi-Task Audio Source Separation 标题:多任务音频源分离
作者:Lu Zhang,Chenxing Li,Feng Deng,Xiaorui Wang 机构:Kuai Shou Technology Co., Beijing, China, Harbin Institute of Technology, Shenzhen, China 链接:https://arxiv.org/abs/2107.06467 摘要:近年来,语音增强、语音分离、音乐源分离等音频源分离技术取得了令人瞩目的成绩。深度神经网络强大的建模能力给我们带来了更具挑战性的任务的希望。本文提出了一种新的多任务音频源分离(MTASS)方法,将语音、音乐和噪声信号从单声道混合信号中分离出来。首先,我们介绍这个任务的细节,并生成一个包含语音、音乐和背景噪声的混合数据集。在此基础上,提出了一种复域MTASS模型,充分利用了三种音频信号在频谱特性上的差异。该模型采用两级流水线结构,将三种音频信号分离,然后分别进行信号补偿。通过对不同训练目标的比较,选择了复比掩模作为更适合MTASS的训练目标。实验结果还表明,残差信号补偿模块有助于信号的进一步恢复。所提出的模型在分离性能上比几种已知的分离模型有明显的优势。 摘要:The audio source separation tasks, such as speech enhancement, speech separation, and music source separation, have achieved impressive performance in recent studies. The powerful modeling capabilities of deep neural networks give us hope for more challenging tasks. This paper launches a new multi-task audio source separation (MTASS) challenge to separate the speech, music, and noise signals from the monaural mixture. First, we introduce the details of this task and generate a dataset of mixtures containing speech, music, and background noises. Then, we propose an MTASS model in the complex domain to fully utilize the differences in spectral characteristics of the three audio signals. In detail, the proposed model follows a two-stage pipeline, which separates the three types of audio signals and then performs signal compensation separately. After comparing different training targets, the complex ratio mask is selected as a more suitable target for the MTASS. The experimental results also indicate that the residual signal compensation module helps to recover the signals further. The proposed model shows significant advantages in separation performance over several well-known separation models.
【5】 Federated Self-Training for Semi-Supervised Audio Recognition 标题:用于半监督音频识别的联合自训练
作者:Vasileios Tsouvalas,Aaqib Saeed,Tanir Ozcelebi 机构: Eindhoven University of Technology 链接:https://arxiv.org/abs/2107.06877 摘要:联邦学习是一种分布式机器学习范式,处理分散的个人数据集。由于数据驻留在智能手机和虚拟助理等设备上,因此标签将委托给客户机,或者以自动化的方式提取标签。具体地说,在音频数据的情况下,获取语义注释可能非常昂贵和耗时。因此,大量的音频数据在用户的设备上仍然没有标记和未被利用。大多数现有的联邦学习方法侧重于监督学习,而不利用未标记的数据。在这项工作中,我们研究的问题,半监督学习的音频模型通过自我训练结合联邦学习。为了提高音频识别模型的泛化能力,我们提出了FedSTAR来开发大规模的设备上未标记数据。我们进一步证明,自我监督的预训练模型可以加速设备模型的训练,显著提高收敛到更少的训练轮。我们在不同的公共音频分类数据集上进行了实验,并研究了我们的模型在不同百分比的标记和未标记数据下的性能。值得注意的是,我们表明,与完全监督的联邦模型相比,FedSTAR在只有3%的标记数据可用的情况下,平均可以将识别率提高13.28%。 摘要:Federated Learning is a distributed machine learning paradigm dealing with decentralized and personal datasets. Since data reside on devices like smartphones and virtual assistants, labeling is entrusted to the clients, or labels are extracted in an automated way. Specifically, in the case of audio data, acquiring semantic annotations can be prohibitively expensive and time-consuming. As a result, an abundance of audio data remains unlabeled and unexploited on users' devices. Most existing federated learning approaches focus on supervised learning without harnessing the unlabeled data. In this work, we study the problem of semi-supervised learning of audio models via self-training in conjunction with federated learning. We propose FedSTAR to exploit large-scale on-device unlabeled data to improve the generalization of audio recognition models. We further demonstrate that self-supervised pre-trained models can accelerate the training of on-device models, significantly improving convergence to within fewer training rounds. We conduct experiments on diverse public audio classification datasets and investigate the performance of our models under varying percentages of labeled and unlabeled data. Notably, we show that with as little as 3% labeled data available, FedSTAR on average can improve the recognition rate by 13.28% compared to the fully supervised federated model.
【6】 Localization Based Sequential Grouping for Continuous Speech Separation 标题:基于局部化的连续语音分离序列分组算法
作者:Zhong-Qiu Wang,DeLiang Wang 机构:Mitsubishi Electric Research Laboratories (MERL), USA, Department of Computer Science and Engineering, The Ohio State University, USA 备注:5 pages, 1 figure 链接:https://arxiv.org/abs/2107.06853 摘要:本研究探讨连续语音分离与说话人二值化之强健说话人定位,其中我们使用说话人方向来分组同一说话人的非连续片段。假设说话人不移动并且位于不同的方向,到达方向(DOA)信息为精确的顺序分组和说话人重分类提供了信息提示。我们的系统在以下意义上是阻塞在线的。给定一个最多有两个说话人的帧块,我们采用一个两说话人分离模型来分离(和增强)说话人,估计每个分离说话人的DOA,并根据DOA估计将分离结果分组。基于LibriCSS语料库的说话人二值化和说话人属性语音识别结果验证了该算法的有效性。 摘要:This study investigates robust speaker localization for con-tinuous speech separation and speaker diarization, where we use speaker directions to group non-contiguous segments of the same speaker. Assuming that speakers do not move and are located in different directions, the direction of arrival (DOA) information provides an informative cue for accurate sequential grouping and speaker diarization. Our system is block-online in the following sense. Given a block of frames with at most two speakers, we apply a two-speaker separa-tion model to separate (and enhance) the speakers, estimate the DOA of each separated speaker, and group the separation results across blocks based on the DOA estimates. Speaker diarization and speaker-attributed speech recognition results on the LibriCSS corpus demonstrate the effectiveness of the proposed algorithm.
【7】 MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation 标题:MMGCN:基于深图卷积网络的会话情感识别多模式融合
作者:Jingwen Hu,Yuchen Liu,Jinming Zhao,Qin Jin 机构:School of Information, Renmin University of China 链接:https://arxiv.org/abs/2107.06779 摘要:会话中的情感识别(ERC)是情感对话系统的重要组成部分,有助于系统理解用户的情感并产生移情反应。然而,大多数的研究主要集中在对说话人和上下文信息的建模上,或者仅仅是通过特征拼接来利用多模态信息。为了探索一种更有效地利用多模态和远距离上下文信息的方法,本文提出了一种基于多模态融合图卷积网络的新模型MMGCN。MMGCN不仅可以有效地利用多模态依赖,而且可以利用说话人信息对说话人之间和说话人内部的依赖进行建模。我们在两个公共基准数据集IEMOCAP和MELD上对所提出的模型进行了评估,结果证明了MMGCN的有效性,在多模态会话环境下,MMGCN的性能明显优于其他SOTA方法。 摘要:Emotion recognition in conversation (ERC) is a crucial component in affective dialogue systems, which helps the system understand users' emotions and generate empathetic responses. However, most works focus on modeling speaker and contextual information primarily on the textual modality or simply leveraging multimodal information through feature concatenation. In order to explore a more effective way of utilizing both multimodal and long-distance contextual information, we propose a new model based on multimodal fused graph convolutional network, MMGCN, in this work. MMGCN can not only make use of multimodal dependencies effectively, but also leverage speaker information to model inter-speaker and intra-speaker dependency. We evaluate our proposed model on two public benchmark datasets, IEMOCAP and MELD, and the results prove the effectiveness of MMGCN, which outperforms other SOTA methods by a significant margin under the multimodal conversation setting.
【8】 The Period-Modulated Harmonic Locked Loop (PM-HLL): A low-effort algorithm for rapid time-domain periodicity estimation 标题:周期调制谐波锁定环(PM-HLL):一种低复杂度的时域周期快速估计算法
作者:Volker Hohmann 机构: Department of Medical Physics and Acoustics, University of Oldenburg, Oldenburg, HörTech gGmbH, Oldenburg, Germany, Cluster of Excellence Hearing,all Oldenburg, Germany 链接:https://arxiv.org/abs/2107.06645 摘要:许多语音和音乐分析和处理方案依赖于对周期信号分量的基频f0的估计。大多数已建立的方案在估计问题中采用了非特定的信号模型,如正弦模型,这可能会限制时间分辨率和估计精度。本研究提出一种新的时域锁环演算法,具有较低的运算量和较低的记忆体占用。回路控制信号直接从输入时间信号中导出,采用谐波信号模型。理论上,这允许对任意波形的周期信号进行噪声鲁棒性和快速f0估计,并且不需要先验频率分析。对采用不同周期性的短信号和加入宽带噪声的短信号进行了仿真,验证和评价了该算法的基本性能。根据信噪比(SNR),即使在信噪比接近或低于0dB的情况下,该估计器也能在3-4个信号重复内收敛。此外,它还可以跟踪基频扫描,延迟小于一个周期,同时跟踪三音和弦信号的所有音调。对具有移频谐波的准周期声信号和具有随机周期性的信号进行了鲁棒跟踪。在大多数情况下,估计误差的平均值和标准偏差,即真实值和估计值f0之间的差值等于或低于1Hz。结果表明,该算法适用于低延迟语音和音乐的分析与处理。 摘要:Many speech and music analysis and processing schemes rely on an estimate of the fundamental frequency f0 of periodic signal components. Most established schemes apply rather unspecific signal models such as sinusoidal models to the estimation problem, which may limit time resolution and estimation accuracy. This study proposes a novel time-domain locked-loop algorithm with low computational effort and low memory footprint for f0 estimation. The loop control signal is directly derived from the input time signal, using a harmonic signal model. Theoretically, this allows for a noise-robust and rapid f0 estimation for periodic signals of arbitrary waveform, and without the requirement of a prior frequency analysis. Several simulations with short signals employing different types of periodicity and with added wide-band noise were performed to demonstrate and evaluate the basic properties of the proposed algorithm. Depending on the Signal-to-Noise Ratio (SNR), the estimator was found to converge within 3-4 signal repetitions, even at SNR close to or below 0dB. Furthermore, it was found to follow fundamental frequency sweeps with a delay of less than one period and to track all tones of a three-tone musical chord signal simultaneously. Quasi-periodic sounds with shifted harmonics as well as signals with stochastic periodicity were robustly tracked. Mean and standard deviation of the estimation error, i.e., the difference between true and estimated f0, were at or below 1 Hz in most cases. The results suggest that the proposed algorithm may be applicable to low-delay speech and music analysis and processing.
【9】 ZR-2021VG: Zero-Resource Speech Challenge, Visually-Grounded Language Modelling track, 2021 edition 标题:ZR-2021VG:零资源语音挑战赛,视觉接地语言建模赛道,2021年版
作者:Afra Alishahia,Grzegorz Chrupała,Alejandrina Cristia,Emmanuel Dupoux,Bertrand Higy,Marvin Lavechin,Okko Räsänen,Chen Yu 机构:Dept. of Cognitive Science and AI, Tilburg University, Netherlands, Laboratoire de Sciences Cognitives et Psycholinguistique, ENS, Paris, France, Unit of Computing Sciences, Tampere University, Finland 链接:https://arxiv.org/abs/2107.06546 摘要:我们展示了在2021年第二轮零资源演讲挑战赛中引入的视觉基础语言建模轨迹。我们激励新的轨道并详细讨论参与规则。我们还介绍了为这条赛道开发的两个基线系统。 摘要:We present the visually-grounded language modelling track that was introduced in the Zero-Resource Speech challenge, 2021 edition, 2nd round. We motivate the new track and discuss participation rules in detail. We also present the two baseline systems that were developed for this track.
【10】 Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding 标题:神经说话人嵌入的串行化多层多头注意
作者:Hongning Zhu,Kong Aik Lee,Haizhou Li 机构:School of Computing, National University of Singapore, Singapore, Department of Electrical and Computer Engineering, National University of Singapore, Singapore, Institute for Infocomm Research, A⋆STAR, Singapore 备注:Accepted by Interspeech 2021 链接:https://arxiv.org/abs/2107.06493 摘要:该文提出了一种用于文本无关说话人验证中神经说话人嵌入的多层多头注意序列。在以前的工作中,来自一层的帧级特征被聚合以形成一个话语级表示。受Transformer网络的启发,我们提出的方法利用堆叠式自我注意机制的层次结构来获得与说话人更相关的精细特征。序列化注意机制包含一堆自我注意模块,用于创建说话人的固定维度表示。提出的串行化多层多人头注意方法,不是并行地利用多人头注意,而是以串行的方式从一层到下一层聚合和传播注意统计信息。此外,我们使用统计池对每个语句使用输入感知查询。随着层数的增加,神经网络可以学习更多的说话人嵌入。在VoxCeleb1数据集和SITW数据集上的实验结果表明,本文提出的方法在EER和DCF0.01上分别比其他基线方法(包括x-vectors和其他x-vectors 传统注意池方法)提高了9.7%和8.1%。 摘要:This paper proposes a serialized multi-layer multi-head attention for neural speaker embedding in text-independent speaker verification. In prior works, frame-level features from one layer are aggregated to form an utterance-level representation. Inspired by the Transformer network, our proposed method utilizes the hierarchical architecture of stacked self-attention mechanisms to derive refined features that are more correlated with speakers. Serialized attention mechanism contains a stack of self-attention modules to create fixed-dimensional representations of speakers. Instead of utilizing multi-head attention in parallel, the proposed serialized multi-layer multi-head attention is designed to aggregate and propagate attentive statistics from one layer to the next in a serialized manner. In addition, we employ an input-aware query for each utterance with the statistics pooling. With more layers stacked, the neural network can learn more discriminative speaker embeddings. Experiment results on VoxCeleb1 dataset and SITW dataset show that our proposed method outperforms other baseline methods, including x-vectors and other x-vectors conventional attentive pooling approaches by 9.7% in EER and 8.1% in DCF0.01.