金融/语音/音频处理学术速递[7.7]

2021-07-27 10:26:55 浏览数 (1)

q-fin金融,共计10篇

cs.SD语音,共计6篇

eess.AS音频处理,共计7篇

1.q-fin金融:

【1】 Countering Misinformation on Social Media Through Educational Interventions: Evidence from a Randomized Experiment in Pakistan 标题:通过教育干预打击社交媒体上的错误信息:来自巴基斯坦随机实验的证据

作者:Ayesha Ali,Ihsan Ayyub Qazi 链接:https://arxiv.org/abs/2107.02775 摘要:假新闻在发展中国家是一个日益严重的问题,可能产生深远的后果。我们在巴基斯坦城市进行了一项随机实验,以评估两种教育干预措施在低数字识字率人群中消除错误信息的有效性。我们没有发现基于视频的一般教育信息对错误信息的显著影响。然而,当基于个人过去接触假新闻的个性化反馈来增强这些信息时,我们发现识别假新闻的标准差提高了0.14。我们还发现,在女性受访者的推动下,对识别真实新闻有负面但不显著的影响。我们的研究结果表明,教育干预可以使信息识别,但其有效性关键取决于他们的特点和交付是如何为感兴趣的人群定制。 摘要:Fake news is a growing problem in developing countries with potentially far-reaching consequences. We conduct a randomized experiment in urban Pakistan to evaluate the effectiveness of two educational interventions to counter misinformation among low-digital literacy populations. We do not find a significant effect of video-based general educational messages about misinformation. However, when such messages are augmented with personalized feedback based on individuals' past engagement with fake news, we find an improvement of 0.14 standard deviations in identifying fake news. We also find negative but insignificant effects on identifying true news, driven by female respondents. Our results suggest that educational interventions can enable information discernment but their effectiveness critically depends on how well their features and delivery are customized for the population of interest.

【2】 Collaborative Insurance Sustainability and Network Structure 标题:协同保险的可持续性与网络结构

作者:Arthur Charpentier,Lariosse Kouakou,Matthias Löwe,Philipp Ratz,Franck Vermet 机构:a Universit´e du Qu´ebec a Montr´eal (UQAM), Montr´eal (Qu´ebec), Canada, b EURo Institut d’Actuariat (EURIA), Universit´e de Brest, France, c University of M¨unster, Germany, arXiv:,.,v, [q-fin.RM] , Jul 链接:https://arxiv.org/abs/2107.02764 摘要:随着互联网的出现,点对点(P2P)经济一直在增长,优步(Uber)或Airbnb等知名品牌就是其中的一个例子。在保险行业,这种方法仍处于初级阶段,但一些公司已开始探索基于P2P的合作保险产品(如美国的柠檬水或法国的Inspeer)。精算文献最近才开始考虑风险分担机制,如Denuit和罗BERT(2021)或冯等人(2021)。本文描述并分析了这样一个P2P产品,并给出了一些对等的风险分担契约。在这里,我们考虑投保人仍然有保险合同的情况,但第一个自我保险层,在可扣除的情况下,可以与朋友共享。我们研究了网络的形状(通过度的分布)对风险降低的影响。我们还考虑了互惠承诺的一些最佳设置,并讨论了与朋友的朋友介绍合同,以减轻一些可能的缺点,使没有足够的联系来交换风险的人。 摘要:The peer-to-peer (P2P) economy has been growing with the advent of the Internet, with well known brands such as Uber or Airbnb being examples thereof. In the insurance sector the approach is still in its infancy, but some companies have started to explore P2P-based collaborative insurance products (eg. Lemonade in the U.S. or Inspeer in France). The actuarial literature only recently started to consider those risk sharing mechanisms, as in Denuit and Robert (2021) or Feng et al. (2021). In this paper, describe and analyse such a P2P product, with some reciprocal risk sharing contracts. Here, we consider the case where policyholders still have an insurance contract, but the first self-insurance layer, below the deductible, can be shared with friends. We study the impact of the shape of the network (through the distribution of degrees) on the risk reduction. We consider also some optimal setting of the reciprocal commitments, and discuss the introduction of contracts with friends of friends to mitigate some possible drawbacks of having people without enough connections to exchange risks.

【3】 Optimal Insurance to Maximize RDEU Under a Distortion-Deviation Premium Principle 标题:失真-偏差保费原则下最大RDEU的最优保险

作者:Xiaoqing Liang,Ruodu Wang,Virginia Young 链接:https://arxiv.org/abs/2107.02656 摘要:本文研究了一个风险厌恶型个体的最优保险问题,该个体寻求其最终财富的秩相关期望效用(RDEU)最大化,保险定价采用广义扭曲偏差保费原则。证明了最优解满足的充要条件,并考虑了三个模糊阶数,进一步确定最优赔付。最后,我们分析了三种扭曲偏离保费原则下的实例,探讨了无保险或免赔额保险最优的具体条件。 摘要:In this paper, we study an optimal insurance problem for a risk-averse individual who seeks to maximize the rank-dependent expected utility (RDEU) of her terminal wealth, and insurance is priced via a general distortion-deviation premium principle. We prove necessary and sufficient conditions satisfied by the optimal solution and consider three ambiguity orders to further determine the optimal indemnity. Finally, we analyze examples under three distortion-deviation premium principles to explore the specific conditions under which no insurance or deductible insurance is optimal.

【4】 A dynamic version of the super-replication theorem under proportional transaction costs 标题:交易费用成比例的超复制定理的一个动态版本

作者:Francesca Biagini,Thomas Reitsam 链接:https://arxiv.org/abs/2107.02628 摘要:我们在动态环境中扩展了[27]中的超级复制定理,包括基于num的和无num的环境。为此,我们推广了可容许策略的概念。特别地,我们得到了一个定义良好的超复制价格过程,它在一定的正则性假设下是右连续的。 摘要:We extend the super-replication theorems of [27] in a dynamic setting, both in the num'eraire-based as well as in the num'eraire-free setting. For this purpose, we generalize the notion of admissible strategies. In particular, we obtain a well-defined super-replication price process, which is right-continuous under some regularity assumptions.

【5】 Approximations to ultimate ruin probabilities with a Wienner process perturbation 标题:具有Wienner过程扰动的最终破产概率的逼近

作者:Yacine Koucha,Alfredo D. Egidio dos Reis 机构:Brunel University London, UB,PH Uxbridge, United Kingdom, Universidade de Lisboa, ISEG, -, Lisboa, Portugal 备注:Master dissertation work, 18 pages, 4 figures, 8 numerical tables 链接:https://arxiv.org/abs/2107.02537 摘要:在本文中,我们通过在复合泊松过程中加入一个维纳过程,将经典的Cram′er-Lundberg集体风险理论模型改为一个扰动模型,该过程可用于考虑保费收入的不确定性、利率波动和投保人数量的变化。我们的研究是一篇硕士论文的一部分,我们的目的是对扰动风险模型的无限时间破产概率作一个简要的综述,并提出一些新的近似方法。本文提出了四种不同的摄动风险模型的逼近方法。第一种方法基于对最大总损失分布的上下迭代逼近。第二种方法依赖于四矩指数devylder近似。第三种方法是基于Renyi和De Vylder近似的一阶Pad′e近似。最后一种方法是二阶Pad&e-Ramsay近似。它们是通过拟合索赔额分布的一、二、三或四个矩而产生的,这大大推广了近似方法。我们使用轻尾和重尾分布的组合对个人索赔额的近似精度进行了测试。我们评估了最终破产概率,并给出了指数、gamma和混合指数索赔分布的数值结果,证明了这四种方法的高精度。分析和数值方法被用来强调我们的发现的实际意义。 摘要:In this paper, we adapt the classic Cram'er-Lundberg collective risk theory model to a perturbed model by adding a Wiener process to the compound Poisson process, which can be used to incorporate premium income uncertainty, interest rate fluctuations and changes in the number of policyholders. Our study is part of a Master dissertation, our aim is to make a short overview and present additionally some new approximation methods for the infinite time ruin probabilities for the perturbed risk model. We present four different approximation methods for the perturbed risk model. The first method is based on iterative upper and lower approximations to the maximal aggregate loss distribution. The second method relies on a four-moment exponential De Vylder approximation. The third method is based on the first-order Pad'e approximation of the Renyi and De Vylder approximations. The last method is the second order Pad'e-Ramsay approximation. These are generated by fitting one, two, three or four moments of the claim amount distribution, which greatly generalizes the approximations. We test the precision of approximations using a combination of light and heavy tailed distributions for the individual claim amount. We assess the ultimate ruin probability and present numerical results for the exponential, gamma, and mixed exponential claim distributions, demonstrating the high accuracy of these four methods. Analytical and numerical methods are used to highlight the practical implications of our findings.

【6】 Predicting Exporters with Machine Learning 标题:基于机器学习的出口商预测

作者:Francesca Micocci,Armando Rungi 机构: IMT School forAdvanced Studies, IMT School for AdvancedStudies 备注:40 pages, 10 figures 链接:https://arxiv.org/abs/2107.02512 摘要:在这篇文章中,我们利用机器学习技术,以出口商和非出口商的金融账户为基础,预测样本外企业的出口能力。因此,我们将展示如何将预测值用作出口得分,即衡量非出口商与出口状况的距离。为此,我们对2010-2018年法国57021家制造业企业的财务报告进行了各种算法的训练和测试。我们发现,属性缺失的贝叶斯加性回归树(BART-MIA)比其他技术表现更好,预测精度高达0.90美元。对于出口商定义的变化和不连续出口商的存在,预测是稳健的。最后,我们认为,出口得分可以有助于贸易促进,贸易信贷,并评估企业的竞争力。例如,信封背面的估计表明,一家出口分数略低于平均水平的代表性公司需要高达44%$的现金资源和高达2.5美元的资本支出,才能达到完全出口地位。 摘要:In this contribution, we exploit machine learning techniques to predict out-of-sample firms' ability to export based on the financial accounts of both exporters and non-exporters. Therefore, we show how forecasts can be used as exporting scores, i.e., to measure the distance of non-exporters from export status. For our purpose, we train and test various algorithms on the financial reports of 57,021 manufacturing firms in France in 2010-2018. We find that a Bayesian Additive Regression Tree with Missingness In Attributes (BART-MIA) performs better than other techniques with a prediction accuracy of up to $0.90$. Predictions are robust to changes in definitions of exporters and in the presence of discontinuous exporters. Eventually, we argue that exporting scores can be helpful for trade promotion, trade credit, and to assess firms' competitiveness. For example, back-of-the-envelope estimates show that a representative firm with just below-average exporting scores needs up to $44%$ more cash resources and up to $2.5$ times more capital expenses to reach full export status.

【7】 Clustering Structure of Microstructure Measures 标题:微观结构测度的聚类结构

作者:Liao Zhu,Ningning Sun,Martin T. Wells 机构: Cornell University, ‡Department of Computer Science 链接:https://arxiv.org/abs/2107.02283 摘要:本文建立了市场微观结构特征测度在股票收益预测中的聚类模型。在10秒的时间频率内,我们研究了不同测度的聚类结构,以找出预测的最佳测度。通过这种方法,我们可以用有限的预测器进行更精确的预测,从而消除了噪声,使模型更易于解释。 摘要:This paper builds the clustering model of measures of market microstructure features which are popular in predicting the stock returns. In a 10-second time frequency, we study the clustering structure of different measures to find out the best ones for predicting. In this way, we can predict more accurately with a limited number of predictors, which removes the noise and makes the model more interpretable.

【8】 Two Stochastic Control Problems In Capital Structure and Portfolio Choice 标题:资本结构与投资组合选择中的两个随机控制问题

作者:Shan Huang 链接:https://arxiv.org/abs/2107.02242 摘要:本文主要研究资本结构和个人生命周期投资组合选择中的两个问题。在第一个问题中,我们推导了一个随机控制模型来优化银行的股利和资本重组政策,并在银行资产不透明的情况下,将银行的真实会计资产价值建模为部分观测变量的情况下,以美国银行为例进行了验证。经校正的模型显示,报告的会计资产价值中的噪声掩盖了约三分之一的真实资产收益率波动率,并使银行的市场权益价值提高了7.8%,因为噪声掩盖了银行偿付能力风险,使银行监管机构无法对其进行监管。特别是那些贷款损失准备金、不良资产和房地产贷款水平较高、报告总资产波动率较低的银行,其会计资产价值非常混乱。由于真实资产价值受到巨大冲击,在最近的金融危机期间,银行的资产更加不透明。在第二个问题中,我们提出了一个在经济条件下,当投资者面临借贷和卖空约束,以及股票市场和劳动力市场之间的协整关系时,具有自愿退休期权的最优投资组合选择模型。我们的模型重新解释了股市繁荣时期股票投资和提前退休的不参与之谜。从长期来看,随着风险厌恶情绪的增强或工资水平的下降,投资者提前退休的意愿变得更强。与实证结果一致,我们发现退休弹性使得最优投资组合在股票市场的投资较少。我们还发现,我们的模型生成的投资组合财富份额上升。 摘要:This thesis mainly focuses on two problems in capital structure and individual's life-cycle portfolio choice. In the first problem, we derive a stochastic control model to optimize banks' dividend and recapitalization policies and calibrate that to a sample of U.S. banks in the situation where we model banks' true accounting asset values as partially observed variables due to the opaqueness in banks' assets. By the calibrated model, the noise in reported accounting asset values hides about one-third of the true asset return volatility and raises the banks' market equity value by 7.8% because the noise hides the banks' solvency risk from banking regulators. Particularly, those banks with a high level of loan loss provisions, nonperforming assets, and real estate loans, and with a low volatility of reported total assets have noisy accounting asset values. Because of the substantial shock on the true asset values, the banks' assets were more opaque during the recent financial crisis. In the second problem, we present an optimal portfolio selection model with voluntary retirement option in an economic situation, where an investor is facing borrowing and short sale constraints, as well as the cointegration between the stock and labor markets. Our model reinterprets the non-participation puzzle in stock investment and early retirement in market booms. Investor's willingness to retire earlier becomes stronger as risk aversion increases or as wages decline in the long term. Consistent with the empirical evidence, we find that retirement flexibility makes the optimal portfolio invest less in the stock market. We also find that our model-generated portfolio share rises in wealth.

【9】 The global migration network of sex-workers 标题:性工作者的全球移民网络

作者:Luis E C Rocha,Petter Holme,Claudio D G Linhares 机构:Ghent University, Dept of Economics, Ghent, Belgium, Ghent University, Dept of Physics and Astronomy, Ghent, Belgium, Tokyo Institute of Technology, Tokyo, Japan, University of S˜ao Paulo, Institute of Mathematics and Computer Sciences, S˜ao Carlos, Brazil 备注:Comments and feedback welcomed. Two tables and 6 figures including SI 链接:https://arxiv.org/abs/2107.02633 摘要:各国社会和经济环境的差异鼓励人们移民,以寻求更好的生活条件,包括工作机会、更高的工资、安全和福利。然而,由于不良的记录、隐私问题和居住状况,量化全球移民具有挑战性。这对于参与污名化、无管制或非法活动的某些类别的移民来说尤其重要。护送服务或高端卖淫是吸引全世界工人的高薪活动。本文运用网络方法研究性工作者的国际迁移模式。利用广泛的国际在线护送服务广告目录和个人护送信息,我们重建了一个移民流动网络,其中的节点代表来源国或目的地国。这些线路代表了两国之间的直达路线。性工作者的移徙网络显示出与一般人口移徙不同的结构模式。该网络包含一个强大的核心,在这个核心中,经常观察到一组高收入欧洲国家之间的相互移徙,然而欧洲被划分为不同的网络社区,与非欧洲国家有着特定的联系。我们发现国家之间存在非互惠关系,其中一些国家主要提供服务,而另一些国家则吸引工人。人均国内生产总值是一个很好的指标,反映了一个国家对外来工人的吸引力和服务率,但与移民的可能性无关。与在本国工作相比,移民的平均经济收益为15.9%。只有来自77%国家的性工作者在移民方面有经济收益,而平均收益随着原籍国的GDPc而减少。我们的研究结果显示,高端性工作者的迁移受到经济、地理和文化方面的制约。 摘要:Differences in the social and economic environment across countries encourage humans to migrate in search of better living conditions, including job opportunities, higher salaries, security and welfare. Quantifying global migration is, however, challenging because of poor recording, privacy issues and residence status. This is particularly critical for some classes of migrants involved in stigmatised, unregulated or illegal activities. Escorting services or high-end prostitution are well-paid activities that attract workers all around the world. In this paper, we study international migration patterns of sex-workers by using network methods. Using an extensive international online advertisement directory of escorting services and information about individual escorts, we reconstruct a migrant flow network where nodes represent either origin or destination countries. The links represent the direct routes between two countries. The migration network of sex-workers shows different structural patterns than the migration of the general population. The network contains a strong core where mutual migration is often observed between a group of high-income European countries, yet Europe is split into different network communities with specific ties to non-European countries. We find non-reciprocal relations between countries, with some of them mostly offering while others attract workers. The GDP per capita is a good indicator of country attractiveness for incoming workers and service rates but is unrelated to the probability of emigration. The median financial gain of migrating, in comparison to working at the home country, is 15.9%. Only sex-workers coming from 77% of the countries have financial gains with migration and average gains decrease with the GDPc of the country of origin. Our results shows that high-end sex-worker migration is regulated by economic, geographic and cultural aspects.

【10】 Face masks, vaccination rates and low crowding drive the demand for the London Underground during the COVID-19 pandemic 标题:在冠状病毒大流行期间,口罩、疫苗接种率和低拥挤推动了对伦敦地铁的需求

作者:Prateek Bansal,Roselinde Kessels,Rico Krueger,Daniel J Graham 机构:Transport Strategy Centre, Imperial College London, UK, Department of Data Analytics and Digitalization, Maastricht University, the Netherlands, Transport and Mobility Laboratory, École Polytechnique Fédérale de Lausanne, Switzerland 链接:https://arxiv.org/abs/2107.02394 摘要:COVID-19大流行严重影响了人们的旅行行为和外出活动的参与。虽然随着疫苗接种率的提高,应对措施正在放松,但对公共交通的需求仍然不确定。为了调查流感大流行期间伦敦地铁的使用者偏好,我们在流感大流行前的使用者(N=961)中进行了一项声明选择实验。我们使用多项式和混合logit模型分析收集的数据。我们的分析揭示了伦敦地铁需求对出行属性(拥挤密度和出行时间)、疫情(确诊的新COVID-19病例)和干预措施(疫苗接种率和强制面罩)的敏感性。强制口罩和较高的疫苗接种率是COVID-19期间伦敦地铁出行需求的两大驱动力。疫苗接种率对地铁需求的积极影响随着拥挤密度的增加而增加,而强制口罩的积极影响随着出行时间的延长而减少。混合logit揭示了大量的偏好异质性。例如,虽然强制性口罩的平均效果是积极的,但大约20%的流感大流行前使用者乘地铁出行的偏好受到负面影响。估计的需求敏感性与运输系统的供需管理和先进流行病学模型的校准有关。 摘要:The COVID-19 pandemic has drastically impacted people's travel behaviour and out-of-home activity participation. While countermeasures are being eased with increasing vaccination rates, the demand for public transport remains uncertain. To investigate user preferences to travel by London Underground during the pandemic, we conducted a stated choice experiment among its pre-pandemic users (N=961). We analysed the collected data using multinomial and mixed logit models. Our analysis provides insights into the sensitivity of the demand for the London Underground with respect to travel attributes (crowding density and travel time), the epidemic situation (confirmed new COVID-19 cases), and interventions (vaccination rates and mandatory face masks). Mandatory face masks and higher vaccination rates are the top two drivers of travel demand for the London Underground during COVID-19. The positive impact of vaccination rates on the Underground demand increases with crowding density, and the positive effect of mandatory face masks decreases with travel time. Mixed logit reveals substantial preference heterogeneity. For instance, while the average effect of mandatory face masks is positive, preferences of around 20% of the pre-pandemic users to travel by the Underground are negatively affected. The estimated demand sensitivities are relevant for supply-demand management in transit systems and the calibration of advanced epidemiological models.

2.cs.SD语音:

【1】 A Multi-Objective Approach for Sustainable Generative Audio Models 标题:一种可持续生成音频模型的多目标方法

作者:Constance Douwes,Philippe Esling,Jean-Pierre Briot 机构:IRCAM, Sorbonne Université, CNRS, UMR , F-, Paris, France, Sorbonne Université, CNRS, LIP, F-, Paris, France, UNIRIO, Rio de Janeiro, RJ ,-, Brazil 备注:9 pages, 3 figures 链接:https://arxiv.org/abs/2107.02621 摘要:近年来,深度学习社区主要关注深度生成模型的准确性,在一些研究领域取得了令人瞩目的进步。然而,这场科学竞赛的质量带来了巨大的计算成本,这导致了巨大的能源消耗和温室气体排放。如果当前计算消耗的指数增长持续下去,人工智能(AI)将不幸地成为全球变暖的一个重要因素。这个问题的核心是我们作为一个科学团体用来评估我们工作的措施。目前,人工智能领域的研究人员对科学作品的评判大多是基于准确度、对数似然度、重建或意见得分的提高,这些都完全抹杀了生成模型的实际计算成本。在本文中,我们引入了基于Pareto最优的多目标测度的思想,它同时集成了模型的精度以及训练对环境的影响。通过将这一方法应用于当前最新的生成音频模型中,我们发现这一方法极大地改变了人们对该领域结果的感知意义,鼓励了最佳的训练技术和资源分配。我们希望这类措施将被广泛采用,以帮助社区更好地评估他们工作的重要性,同时使计算成本和精细碳排放成为人工智能研究的焦点。 摘要:In recent years, the deep learning community has largely focused on the accuracy of deep generative models, resulting in impressive improvements in several research fields. However, this scientific race for quality comes at a tremendous computational cost, which incurs vast energy consumption and greenhouse gas emissions. If the current exponential growth of computational consumption persists, Artificial Intelligence (AI) will sadly become a considerable contributor to global warming. At the heart of this problem are the measures that we use as a scientific community to evaluate our work. Currently, researchers in the field of AI judge scientific works mostly based on the improvement in accuracy, log-likelihood, reconstruction or opinion scores, all of which entirely obliterates the actual computational cost of generative models. In this paper, we introduce the idea of relying on a multi-objective measure based on Pareto optimality, which simultaneously integrates the models accuracy, as well as the environmental impact of their training. By applying this measure on the current state-of-the-art in generative audio models, we show that this measure drastically changes the perceived significance of the results in the field, encouraging optimal training techniques and resource allocation. We hope that this type of measure will be widely adopted, in order to help the community to better evaluate the significance of their work, while bringing computational cost -- and in fine carbon emissions -- in the spotlight of AI research.

【2】 Self-training with noisy student model and semi-supervised loss function for dcase 2021 challenge task 4 标题:具有噪声学生模型和半监督损失函数的DCASE 2021挑战任务4的自我训练

作者:Nam Kyun Kim,Hong Kook Kim 机构:School of Electrical Engineering and Computer Science,AI Graduate School, Gwangju Institute of Science and Technology, Cheomdangwagi-ro, Gwangju , Republic of Korea 备注:5 pages, DCASE 2021 challenge Task 4 technical report 链接:https://arxiv.org/abs/2107.02569 摘要:本文提出了一种用于DCASE 2021挑战任务4的复调声音事件检测(SED)方法。该模型由两个阶段组成:一个平均教师模型,用于提供弱标记或未标记数据的目标标记;一个基于自训练的噪声学生模型,用于预测声音事件的强标记。基于残差卷积递归神经网络(RCRNN)的教师和学生模型,首先利用弱标记数据集、未标记数据集和强标记合成数据集的所有训练数据对平均教师模型进行训练。然后,训练后的平均教师模型预测弱标记和未标记数据集的强标记,并在SED模型的第二阶段将其引入到噪声学生模型中。这里,噪声学生模型的结构与第一阶段基于RCRNN的平均教师模型的学生模型相同。然后,通过添加特征噪声(如时频移、混频、SpecAugment和基于模型噪声的丢失)对其进行自训练。另外,采用半监督损失函数训练噪声学生模型,作为标签噪声注入。在DCASE 2021挑战任务4的验证集上对SED模型的性能进行了评估,并最终选择了几个组合了五重验证模型和不同超参数半监督损失函数的集成模型作为我们的最终模型。 摘要:This report proposes a polyphonic sound event detection (SED) method for the DCASE 2021 Challenge Task 4. The proposed SED model consists of two stages: a mean-teacher model for providing target labels regarding weakly labeled or unlabeled data and a self-training-based noisy student model for predicting strong labels for sound events. The mean-teacher model, which is based on the residual convolutional recurrent neural network (RCRNN) for the teacher and student model, is first trained using all the training data from a weakly labeled dataset, an unlabeled dataset, and a strongly labeled synthetic dataset. Then, the trained mean-teacher model predicts the strong label to each of the weakly labeled and unlabeled datasets, which is brought to the noisy student model in the second stage of the proposed SED model. Here, the structure of the noisy student model is identical to the RCRNN-based student model of the mean-teacher model in the first stage. Then, it is self-trained by adding feature noises, such as time-frequency shift, mixup, SpecAugment, and dropout-based model noise. In addition, a semi-supervised loss function is applied to train the noisy student model, which acts as label noise injection. The performance of the proposed SED model is evaluated on the validation set of the DCASE 2021 Challenge Task 4, and then, several ensemble models that combine five-fold validation models with different hyperparameters of the semi-supervised loss function are finally selected as our final models.

【3】 AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style 标题:AdaSpeech 3:自适应文本到语音转换,实现自然风格

作者:Yuzi Yan,Xu Tan,Bohan Li,Guangyan Zhang,Tao Qin,Sheng Zhao,Yuan Shen,Wei-Qiang Zhang,Tie-Yan Liu 机构:#Department of Electronic Engineering, Tsinghua University, China, $Microsoft Research Asia, China, %Microsoft Azure Speech, China, & Department of Electronic Engineering, The Chinese University of Hong Kong, China 备注:Accepted by INTERSPEECH 2021 链接:https://arxiv.org/abs/2107.02530 摘要:虽然目前的文语转换(TTS)模型在合成阅读风格(如有声读物)语音方面表现很好,但合成自发风格语音(如播客或会话)仍然具有挑战性,主要原因有两个:1)缺乏自发语音的训练数据;2) 在自然语言中,对充满停顿(um和uh)和不同节奏进行建模的困难。在本文中,我们开发了adaspeech3,这是一个自适应的TTS系统,它可以微调一个经过良好训练的阅读风格TTS模型,用于自然风格的语音。具体来说,1)为了在文本序列中适当地插入填充停顿(FP),我们在TTS模型中引入了FP预测器;2) 为了对不同的节奏进行建模,我们引入了一种基于混合专家(MoE)的时长预测器,该预测器由三位专家组成,分别负责生成快、中、慢语音,并对其进行微调和基音预测器以适应节奏;3) 为了适应其他说话人的音色,我们在译码器中使用少量的语音数据对一些参数进行微调。为了解决训练数据缺乏的问题,我们挖掘了一个自发语音数据集来支持我们的研究工作,并为将来自发TTS的研究提供便利。实验表明,adaspeech3能够合成具有自然FP和自然节律的语音,其MOS和SMOS评分明显优于以往的自适应TTS系统。 摘要:While recent text to speech (TTS) models perform very well in synthesizing reading-style (e.g., audiobook) speech, it is still challenging to synthesize spontaneous-style speech (e.g., podcast or conversation), mainly because of two reasons: 1) the lack of training data for spontaneous speech; 2) the difficulty in modeling the filled pauses (um and uh) and diverse rhythms in spontaneous speech. In this paper, we develop AdaSpeech 3, an adaptive TTS system that fine-tunes a well-trained reading-style TTS model for spontaneous-style speech. Specifically, 1) to insert filled pauses (FP) in the text sequence appropriately, we introduce an FP predictor to the TTS model; 2) to model the varying rhythms, we introduce a duration predictor based on mixture of experts (MoE), which contains three experts responsible for the generation of fast, medium and slow speech respectively, and fine-tune it as well as the pitch predictor for rhythm adaptation; 3) to adapt to other speaker timbre, we fine-tune some parameters in the decoder with few speech data. To address the challenge of lack of training data, we mine a spontaneous speech dataset to support our research this work and facilitate future research on spontaneous TTS. Experiments show that AdaSpeech 3 synthesizes speech with natural FP and rhythms in spontaneous styles, and achieves much better MOS and SMOS scores than previous adaptive TTS systems.

【4】 Lexical Access Model for Italian -- Modeling human speech processing: identification of words in running speech toward lexical access based on the detection of landmarks and other acoustic cues to features 标题:意大利语词汇通达模型--模拟人类语音处理:基于地标和其他声学特征线索的词汇通达识别

作者:Maria-Gabriella Di Benedetto,Stefanie Shattuck-Hufnagel,Jeung-Yoon Choi,Luca De Nardis,Javier Arango,Ian Chan,Alec DeCaprio 机构: Massachusetts Institute of Technology (MIT), Cambridge, MA, United States, DIET Department, Sapienza University of Rome, Rome, Italy, Radcliffe Institute for Advanced Study, Harvard University, Cambridge, MA, United States, Author Note 备注:Submitted to Language and Speech, 2021 链接:https://arxiv.org/abs/2107.02720 摘要:建立听者在推导说话者所要表达的词语时所启动的过程模型,需要对词汇项在记忆中的存储方式进行假设。这项工作的目的是开发一个系统,模仿人类识别运行语音中的单词,从而提供一个框架,更好地理解人类的语音处理。我们建立了一个基于Stevens词汇通达模型的意大利语语音识别器,该模型将单词存储为不同特征的层次结构(Stevens,K。N(2002). "建立一个基于声学标志和显著特征的词汇通达模型。声学。Soc公司。上午,111(4):1872-1891)。在过去的几十年里,麻省理工学院(MIT)的语音通信小组开发了一个基于这种方法的英语语音识别系统。意大利语将成为除英语之外的第一种语言;对另一种语言的扩展提供了检验假设的机会,即单词在记忆中表现为一组分层排列的显著特征,并揭示了哪些潜在机制可能具有语言独立性。本文还介绍了一个新的词汇访问语料库,LaMIT数据库,专门为这项工作创建和标记,将免费提供给语音研究界。未来的发展将验证这样一个假设,即作为特征线索的特定声学不连续性(称为标志点)与语言无关,而其他线索可能与语言有关,这对理解人脑如何识别语音具有重要意义。 摘要:Modelling the process that a listener actuates in deriving the words intended by a speaker requires setting a hypothesis on how lexical items are stored in memory. This work aims at developing a system that imitates humans when identifying words in running speech and, in this way, provide a framework to better understand human speech processing. We build a speech recognizer for Italian based on the principles of Stevens' model of Lexical Access in which words are stored as hierarchical arrangements of distinctive features (Stevens, K. N. (2002). "Toward a model for lexical access based on acoustic landmarks and distinctive features," J. Acoust. Soc. Am., 111(4):1872-1891). Over the past few decades, the Speech Communication Group at the Massachusetts Institute of Technology (MIT) developed a speech recognition system for English based on this approach. Italian will be the first language beyond English to be explored; the extension to another language provides the opportunity to test the hypothesis that words are represented in memory as a set of hierarchically-arranged distinctive features, and reveal which of the underlying mechanisms may have a language-independent nature. This paper also introduces a new Lexical Access corpus, the LaMIT database, created and labeled specifically for this work, that will be provided freely to the speech research community. Future developments will test the hypothesis that specific acoustic discontinuities - called landmarks - that serve as cues to features, are language independent, while other cues may be language-dependent, with powerful implications for understanding how the human brain recognizes speech.

【5】 Location, Location: Enhancing the Evaluation of Text-to-Speech Synthesis Using the Rapid Prosody Transcription Paradigm 标题:位置,位置:使用快速韵律转录范式增强文本到语音合成的评估

作者:Elijah Gutierrez,Pilar Oplustil-Gallegos,Catherine Lai 机构:Linguistics and English Language, University of Edinburgh, United Kingdom, The Centre for Speech Technology Research, University of Edinburgh, United Kingdom 备注:Accepted to Speech Synthesis Workshop 2019: this https URL 链接:https://arxiv.org/abs/2107.02527 摘要:文本到语音合成系统通常使用平均意见得分(MOS)测试进行评估,其中听者对合成语音样本进行Likert评分。MOS测试的一个主要缺点是,它们只提供总体质量的一般度量,即话语的自然度,因此不能告诉我们合成错误发生的确切位置。这就使得对话语中韵律变化的恰当性的评价成为不确定的。为了解决这个问题,我们提出了一种基于快速韵律转录范式的评价方法。这使得听者能够实时标记话语中错误的位置,从而提供合成信号中发生的感知错误的概率表示。我们进行的实验证实,细粒度评估可以映射到标准MOS测试的系统排名,但错误标记给出了更全面的综合韵律评估。特别是,对于标准的有声读物测试集样本,我们发现错误标记始终聚集在由标点符号表示的主要韵律边界的单词周围。然而,对于我们控制信息结构的基于问答的刺激,我们发现神经TTS系统在产生与上下文相适应的韵律突出的能力上存在差异。 摘要:Text-to-Speech synthesis systems are generally evaluated using Mean Opinion Score (MOS) tests, where listeners score samples of synthetic speech on a Likert scale. A major drawback of MOS tests is that they only offer a general measure of overall quality-i.e., the naturalness of an utterance-and so cannot tell us where exactly synthesis errors occur. This can make evaluation of the appropriateness of prosodic variation within utterances inconclusive. To address this, we propose a novel evaluation method based on the Rapid Prosody Transcription paradigm. This allows listeners to mark the locations of errors in an utterance in real-time, providing a probabilistic representation of the perceptual errors that occur in the synthetic signal. We conduct experiments that confirm that the fine-grained evaluation can be mapped to system rankings of standard MOS tests, but the error marking gives a much more comprehensive assessment of synthesized prosody. In particular, for standard audiobook test set samples, we see that error marks consistently cluster around words at major prosodic boundaries indicated by punctuation. However, for question-answer based stimuli, where we control information structure, we see differences emerge in the ability of neural TTS systems to generate context-appropriate prosodic prominence.

【6】 Separation Guided Speaker Diarization in Realistic Mismatched Conditions 标题:现实失配条件下的分离制导说话人二值化

作者:Shu-Tong Niu,Jun Du,Lei Sun,Chin-Hui Lee 机构:University of Science and Technology of China, Hefei, Anhui, P.R.China, Flytek Research, Hefei, Anhui, P. R. China, Georgia Institute of Technology, Atlanta, GA, USA 链接:https://arxiv.org/abs/2107.02357 摘要:本文充分利用了语音分离和说话人聚类的互补性,提出了一种基于分离的说话人二值化方法。由于传统的基于聚类的说话人二值化(CSD)方法不能很好地处理重叠的语音片段,本文研究了基于分离的说话人二值化(SSD)方法,该方法具有处理说话人重叠区域的潜力。我们的初步分析表明,基于Conv-TasNet的语音分离技术在模拟数据上效果很好,但由于模拟训练语音和阅读语音的语音风格高度不匹配,在实际会话语音中效果不稳定。这样,基于分离的处理可以帮助CSD在真实的不匹配条件下处理重叠的语音片段。在分析SSD系统性能不稳定性的基础上,设计了几种选择SSD和CSD系统性能的策略。对DIHARD-III Challenge的会话电话语音(CTS)数据的实验表明,所提出的SGSD系统能够显著提高最先进的CSD系统的性能,在开发集和评估集的相对二值化错误率分别降低了20.2%和20.8%。 摘要:We propose a separation guided speaker diarization (SGSD) approach by fully utilizing a complementarity of speech separation and speaker clustering. Since the conventional clustering-based speaker diarization (CSD) approach cannot well handle overlapping speech segments, we investigate, in this study, separation-based speaker diarization (SSD) which inherently has the potential to handle the speaker overlap regions. Our preliminary analysis shows that the state-of-the-art Conv-TasNet based speech separation, which works quite well on the simulation data, is unstable in realistic conversational speech due to the high mismatch speaking styles in simulated training speech and read speech. In doing so, separation-based processing can assist CSD in handling the overlapping speech segments under the realistic mismatched conditions. Specifically, several strategies are designed to select between the results of SSD and CSD systems based on an analysis of the instability of the SSD system performances. Experiments on the conversational telephone speech (CTS) data from DIHARD-III Challenge show that the proposed SGSD system can significantly improve the performance of state-of-the-art CSD systems, yielding relative diarization error rate reductions of 20.2% and 20.8% on the development set and evaluation set, respectively.

3.eess.AS音频处理:

【1】 Lexical Access Model for Italian -- Modeling human speech processing: identification of words in running speech toward lexical access based on the detection of landmarks and other acoustic cues to features 标题:意大利语词汇通达模型--模拟人类语音处理:基于地标和其他声学特征线索的词汇通达识别

作者:Maria-Gabriella Di Benedetto,Stefanie Shattuck-Hufnagel,Jeung-Yoon Choi,Luca De Nardis,Javier Arango,Ian Chan,Alec DeCaprio 机构: Massachusetts Institute of Technology (MIT), Cambridge, MA, United States, DIET Department, Sapienza University of Rome, Rome, Italy, Radcliffe Institute for Advanced Study, Harvard University, Cambridge, MA, United States, Author Note 备注:Submitted to Language and Speech, 2021 链接:https://arxiv.org/abs/2107.02720 摘要:建立听者在推导说话者所要表达的词语时所启动的过程模型,需要对词汇项在记忆中的存储方式进行假设。这项工作的目的是开发一个系统,模仿人类识别运行语音中的单词,从而提供一个框架,更好地理解人类的语音处理。我们建立了一个基于Stevens词汇通达模型的意大利语语音识别器,该模型将单词存储为不同特征的层次结构(Stevens,K。N(2002). "建立一个基于声学标志和显著特征的词汇通达模型。声学。Soc公司。上午,111(4):1872-1891)。在过去的几十年里,麻省理工学院(MIT)的语音通信小组开发了一个基于这种方法的英语语音识别系统。意大利语将成为除英语之外的第一种语言;对另一种语言的扩展提供了检验假设的机会,即单词在记忆中表现为一组分层排列的显著特征,并揭示了哪些潜在机制可能具有语言独立性。本文还介绍了一个新的词汇访问语料库,LaMIT数据库,专门为这项工作创建和标记,将免费提供给语音研究界。未来的发展将验证这样一个假设,即作为特征线索的特定声学不连续性(称为标志点)与语言无关,而其他线索可能与语言有关,这对理解人脑如何识别语音具有重要意义。 摘要:Modelling the process that a listener actuates in deriving the words intended by a speaker requires setting a hypothesis on how lexical items are stored in memory. This work aims at developing a system that imitates humans when identifying words in running speech and, in this way, provide a framework to better understand human speech processing. We build a speech recognizer for Italian based on the principles of Stevens' model of Lexical Access in which words are stored as hierarchical arrangements of distinctive features (Stevens, K. N. (2002). "Toward a model for lexical access based on acoustic landmarks and distinctive features," J. Acoust. Soc. Am., 111(4):1872-1891). Over the past few decades, the Speech Communication Group at the Massachusetts Institute of Technology (MIT) developed a speech recognition system for English based on this approach. Italian will be the first language beyond English to be explored; the extension to another language provides the opportunity to test the hypothesis that words are represented in memory as a set of hierarchically-arranged distinctive features, and reveal which of the underlying mechanisms may have a language-independent nature. This paper also introduces a new Lexical Access corpus, the LaMIT database, created and labeled specifically for this work, that will be provided freely to the speech research community. Future developments will test the hypothesis that specific acoustic discontinuities - called landmarks - that serve as cues to features, are language independent, while other cues may be language-dependent, with powerful implications for understanding how the human brain recognizes speech.

【2】 Exploiting Single-Channel Speech For Multi-channel End-to-end Speech Recognition 标题:利用单通道语音实现多通道端到端语音识别

作者:Keyu An,Zhijian Ou 机构:Speech Processing and Machine Intelligence (SPMI) Lab, Tsinghua University, China 备注:submitted to ASRU 2021 链接:https://arxiv.org/abs/2107.02670 摘要:近年来,支持多通道ASR的神经波束形成器端到端训练方法在多通道语音识别中取得了良好的效果。然而,多个模块的集成使得端到端训练变得更加困难,特别是考虑到在真实环境中记录的具有相当大数据规模的多通道语音语料库相对有限。本文探讨了如何利用单通道数据来改进多通道端到端语音识别系统。具体来说,我们设计了三种方案来开发单通道数据,即预训练、数据调度和数据模拟。在CHiME4和AISHELL-4数据集上的大量实验表明,这三种方法都提高了多通道端到端训练的稳定性和语音识别性能,而数据调度方法保持了更简单的流水线(相对于预训练)和更少的计算开销(相对于数据模拟)。此外,我们对我们的系统进行了深入的分析,包括前端的选择、数据扩充、训练策略和单通道数据大小对系统性能的影响。 摘要:Recently, the end-to-end training approach for neural beamformer-supported multi-channel ASR has shown its effectiveness in multi-channel speech recognition. However, the integration of multiple modules makes it more difficult to perform end-to-end training, particularly given that the multi-channel speech corpus recorded in real environments with a sizeable data scale is relatively limited. This paper explores the usage of single-channel data to improve the multi-channel end-to-end speech recognition system. Specifically, we design three schemes to exploit the single-channel data, namely pre-training, data scheduling, and data simulation. Extensive experiments on CHiME4 and AISHELL-4 datasets demonstrate that all three methods improve the multi-channel end-to-end training stability and speech recognition performance, while the data scheduling approach keeps a much simpler pipeline (vs. pre-training) and less computation cost (vs. data simulation). Moreover, we give a thorough analysis of our systems, including how the performance is affected by the choice of front-end, the data augmentation, training strategy, and single-channel data size.

【3】 Location, Location: Enhancing the Evaluation of Text-to-Speech Synthesis Using the Rapid Prosody Transcription Paradigm 标题:位置,位置:使用快速韵律转录范式增强文本到语音合成的评估

作者:Elijah Gutierrez,Pilar Oplustil-Gallegos,Catherine Lai 机构:Linguistics and English Language, University of Edinburgh, United Kingdom, The Centre for Speech Technology Research, University of Edinburgh, United Kingdom 备注:Accepted to Speech Synthesis Workshop 2019: this https URL 链接:https://arxiv.org/abs/2107.02527 摘要:文本到语音合成系统通常使用平均意见得分(MOS)测试进行评估,其中听者对合成语音样本进行Likert评分。MOS测试的一个主要缺点是,它们只提供总体质量的一般度量,即话语的自然度,因此不能告诉我们合成错误发生的确切位置。这就使得对话语中韵律变化的恰当性的评价成为不确定的。为了解决这个问题,我们提出了一种基于快速韵律转录范式的评价方法。这使得听者能够实时标记话语中错误的位置,从而提供合成信号中发生的感知错误的概率表示。我们进行的实验证实,细粒度评估可以映射到标准MOS测试的系统排名,但错误标记给出了更全面的综合韵律评估。特别是,对于标准的有声读物测试集样本,我们发现错误标记始终聚集在由标点符号表示的主要韵律边界的单词周围。然而,对于我们控制信息结构的基于问答的刺激,我们发现神经TTS系统在产生与上下文相适应的韵律突出的能力上存在差异。 摘要:Text-to-Speech synthesis systems are generally evaluated using Mean Opinion Score (MOS) tests, where listeners score samples of synthetic speech on a Likert scale. A major drawback of MOS tests is that they only offer a general measure of overall quality-i.e., the naturalness of an utterance-and so cannot tell us where exactly synthesis errors occur. This can make evaluation of the appropriateness of prosodic variation within utterances inconclusive. To address this, we propose a novel evaluation method based on the Rapid Prosody Transcription paradigm. This allows listeners to mark the locations of errors in an utterance in real-time, providing a probabilistic representation of the perceptual errors that occur in the synthetic signal. We conduct experiments that confirm that the fine-grained evaluation can be mapped to system rankings of standard MOS tests, but the error marking gives a much more comprehensive assessment of synthesized prosody. In particular, for standard audiobook test set samples, we see that error marks consistently cluster around words at major prosodic boundaries indicated by punctuation. However, for question-answer based stimuli, where we control information structure, we see differences emerge in the ability of neural TTS systems to generate context-appropriate prosodic prominence.

【4】 Separation Guided Speaker Diarization in Realistic Mismatched Conditions 标题:现实失配条件下的分离制导说话人二值化

作者:Shu-Tong Niu,Jun Du,Lei Sun,Chin-Hui Lee 机构:University of Science and Technology of China, Hefei, Anhui, P.R.China, Flytek Research, Hefei, Anhui, P. R. China, Georgia Institute of Technology, Atlanta, GA, USA 链接:https://arxiv.org/abs/2107.02357 摘要:本文充分利用了语音分离和说话人聚类的互补性,提出了一种基于分离的说话人二值化方法。由于传统的基于聚类的说话人二值化(CSD)方法不能很好地处理重叠的语音片段,本文研究了基于分离的说话人二值化(SSD)方法,该方法具有处理说话人重叠区域的潜力。我们的初步分析表明,基于Conv-TasNet的语音分离技术在模拟数据上效果很好,但由于模拟训练语音和阅读语音的语音风格高度不匹配,在实际会话语音中效果不稳定。这样,基于分离的处理可以帮助CSD在真实的不匹配条件下处理重叠的语音片段。在分析SSD系统性能不稳定性的基础上,设计了几种选择SSD和CSD系统性能的策略。对DIHARD-III Challenge的会话电话语音(CTS)数据的实验表明,所提出的SGSD系统能够显著提高最先进的CSD系统的性能,在开发集和评估集的相对二值化错误率分别降低了20.2%和20.8%。 摘要:We propose a separation guided speaker diarization (SGSD) approach by fully utilizing a complementarity of speech separation and speaker clustering. Since the conventional clustering-based speaker diarization (CSD) approach cannot well handle overlapping speech segments, we investigate, in this study, separation-based speaker diarization (SSD) which inherently has the potential to handle the speaker overlap regions. Our preliminary analysis shows that the state-of-the-art Conv-TasNet based speech separation, which works quite well on the simulation data, is unstable in realistic conversational speech due to the high mismatch speaking styles in simulated training speech and read speech. In doing so, separation-based processing can assist CSD in handling the overlapping speech segments under the realistic mismatched conditions. Specifically, several strategies are designed to select between the results of SSD and CSD systems based on an analysis of the instability of the SSD system performances. Experiments on the conversational telephone speech (CTS) data from DIHARD-III Challenge show that the proposed SGSD system can significantly improve the performance of state-of-the-art CSD systems, yielding relative diarization error rate reductions of 20.2% and 20.8% on the development set and evaluation set, respectively.

【5】 A Multi-Objective Approach for Sustainable Generative Audio Models 标题:一种可持续生成音频模型的多目标方法

作者:Constance Douwes,Philippe Esling,Jean-Pierre Briot 机构:IRCAM, Sorbonne Université, CNRS, UMR , F-, Paris, France, Sorbonne Université, CNRS, LIP, F-, Paris, France, UNIRIO, Rio de Janeiro, RJ ,-, Brazil 备注:9 pages, 3 figures 链接:https://arxiv.org/abs/2107.02621 摘要:近年来,深度学习社区主要关注深度生成模型的准确性,在一些研究领域取得了令人瞩目的进步。然而,这场科学竞赛的质量带来了巨大的计算成本,这导致了巨大的能源消耗和温室气体排放。如果当前计算消耗的指数增长持续下去,人工智能(AI)将不幸地成为全球变暖的一个重要因素。这个问题的核心是我们作为一个科学团体用来评估我们工作的措施。目前,人工智能领域的研究人员对科学作品的评判大多是基于准确度、对数似然度、重建或意见得分的提高,这些都完全抹杀了生成模型的实际计算成本。在本文中,我们引入了基于Pareto最优的多目标测度的思想,它同时集成了模型的精度以及训练对环境的影响。通过将这一方法应用于当前最新的生成音频模型中,我们发现这一方法极大地改变了人们对该领域结果的感知意义,鼓励了最佳的训练技术和资源分配。我们希望这类措施将被广泛采用,以帮助社区更好地评估他们工作的重要性,同时使计算成本和精细碳排放成为人工智能研究的焦点。 摘要:In recent years, the deep learning community has largely focused on the accuracy of deep generative models, resulting in impressive improvements in several research fields. However, this scientific race for quality comes at a tremendous computational cost, which incurs vast energy consumption and greenhouse gas emissions. If the current exponential growth of computational consumption persists, Artificial Intelligence (AI) will sadly become a considerable contributor to global warming. At the heart of this problem are the measures that we use as a scientific community to evaluate our work. Currently, researchers in the field of AI judge scientific works mostly based on the improvement in accuracy, log-likelihood, reconstruction or opinion scores, all of which entirely obliterates the actual computational cost of generative models. In this paper, we introduce the idea of relying on a multi-objective measure based on Pareto optimality, which simultaneously integrates the models accuracy, as well as the environmental impact of their training. By applying this measure on the current state-of-the-art in generative audio models, we show that this measure drastically changes the perceived significance of the results in the field, encouraging optimal training techniques and resource allocation. We hope that this type of measure will be widely adopted, in order to help the community to better evaluate the significance of their work, while bringing computational cost -- and in fine carbon emissions -- in the spotlight of AI research.

【6】 Self-training with noisy student model and semi-supervised loss function for dcase 2021 challenge task 4 标题:具有噪声学生模型和半监督损失函数的DCASE 2021挑战任务4的自我训练

作者:Nam Kyun Kim,Hong Kook Kim 机构:School of Electrical Engineering and Computer Science,AI Graduate School, Gwangju Institute of Science and Technology, Cheomdangwagi-ro, Gwangju , Republic of Korea 备注:5 pages, DCASE 2021 challenge Task 4 technical report 链接:https://arxiv.org/abs/2107.02569 摘要:本文提出了一种用于DCASE 2021挑战任务4的复调声音事件检测(SED)方法。该模型由两个阶段组成:一个平均教师模型,用于提供弱标记或未标记数据的目标标记;一个基于自训练的噪声学生模型,用于预测声音事件的强标记。基于残差卷积递归神经网络(RCRNN)的教师和学生模型,首先利用弱标记数据集、未标记数据集和强标记合成数据集的所有训练数据对平均教师模型进行训练。然后,训练后的平均教师模型预测弱标记和未标记数据集的强标记,并在SED模型的第二阶段将其引入到噪声学生模型中。这里,噪声学生模型的结构与第一阶段基于RCRNN的平均教师模型的学生模型相同。然后,通过添加特征噪声(如时频移、混频、SpecAugment和基于模型噪声的丢失)对其进行自训练。另外,采用半监督损失函数训练噪声学生模型,作为标签噪声注入。在DCASE 2021挑战任务4的验证集上对SED模型的性能进行了评估,并最终选择了几个组合了五重验证模型和不同超参数半监督损失函数的集成模型作为我们的最终模型。 摘要:This report proposes a polyphonic sound event detection (SED) method for the DCASE 2021 Challenge Task 4. The proposed SED model consists of two stages: a mean-teacher model for providing target labels regarding weakly labeled or unlabeled data and a self-training-based noisy student model for predicting strong labels for sound events. The mean-teacher model, which is based on the residual convolutional recurrent neural network (RCRNN) for the teacher and student model, is first trained using all the training data from a weakly labeled dataset, an unlabeled dataset, and a strongly labeled synthetic dataset. Then, the trained mean-teacher model predicts the strong label to each of the weakly labeled and unlabeled datasets, which is brought to the noisy student model in the second stage of the proposed SED model. Here, the structure of the noisy student model is identical to the RCRNN-based student model of the mean-teacher model in the first stage. Then, it is self-trained by adding feature noises, such as time-frequency shift, mixup, SpecAugment, and dropout-based model noise. In addition, a semi-supervised loss function is applied to train the noisy student model, which acts as label noise injection. The performance of the proposed SED model is evaluated on the validation set of the DCASE 2021 Challenge Task 4, and then, several ensemble models that combine five-fold validation models with different hyperparameters of the semi-supervised loss function are finally selected as our final models.

【7】 AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style 标题:AdaSpeech 3:自适应文本到语音转换,实现自然风格

作者:Yuzi Yan,Xu Tan,Bohan Li,Guangyan Zhang,Tao Qin,Sheng Zhao,Yuan Shen,Wei-Qiang Zhang,Tie-Yan Liu 机构:#Department of Electronic Engineering, Tsinghua University, China, $Microsoft Research Asia, China, %Microsoft Azure Speech, China, & Department of Electronic Engineering, The Chinese University of Hong Kong, China 备注:Accepted by INTERSPEECH 2021 链接:https://arxiv.org/abs/2107.02530 摘要:虽然目前的文语转换(TTS)模型在合成阅读风格(如有声读物)语音方面表现很好,但合成自发风格语音(如播客或会话)仍然具有挑战性,主要原因有两个:1)缺乏自发语音的训练数据;2) 在自然语言中,对充满停顿(um和uh)和不同节奏进行建模的困难。在本文中,我们开发了adaspeech3,这是一个自适应的TTS系统,它可以微调一个经过良好训练的阅读风格TTS模型,用于自然风格的语音。具体来说,1)为了在文本序列中适当地插入填充停顿(FP),我们在TTS模型中引入了FP预测器;2) 为了对不同的节奏进行建模,我们引入了一种基于混合专家(MoE)的时长预测器,该预测器由三位专家组成,分别负责生成快、中、慢语音,并对其进行微调和基音预测器以适应节奏;3) 为了适应其他说话人的音色,我们在译码器中使用少量的语音数据对一些参数进行微调。为了解决训练数据缺乏的问题,我们挖掘了一个自发语音数据集来支持我们的研究工作,并为将来自发TTS的研究提供便利。实验表明,adaspeech3能够合成具有自然FP和自然节律的语音,其MOS和SMOS评分明显优于以往的自适应TTS系统。 摘要:While recent text to speech (TTS) models perform very well in synthesizing reading-style (e.g., audiobook) speech, it is still challenging to synthesize spontaneous-style speech (e.g., podcast or conversation), mainly because of two reasons: 1) the lack of training data for spontaneous speech; 2) the difficulty in modeling the filled pauses (um and uh) and diverse rhythms in spontaneous speech. In this paper, we develop AdaSpeech 3, an adaptive TTS system that fine-tunes a well-trained reading-style TTS model for spontaneous-style speech. Specifically, 1) to insert filled pauses (FP) in the text sequence appropriately, we introduce an FP predictor to the TTS model; 2) to model the varying rhythms, we introduce a duration predictor based on mixture of experts (MoE), which contains three experts responsible for the generation of fast, medium and slow speech respectively, and fine-tune it as well as the pitch predictor for rhythm adaptation; 3) to adapt to other speaker timbre, we fine-tune some parameters in the decoder with few speech data. To address the challenge of lack of training data, we mine a spontaneous speech dataset to support our research this work and facilitate future research on spontaneous TTS. Experiments show that AdaSpeech 3 synthesizes speech with natural FP and rhythms in spontaneous styles, and achieves much better MOS and SMOS scores than previous adaptive TTS systems.

0 人点赞