访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问
q-fin金融,共计9篇
cs.SD语音,共计8篇
eess.AS音频处理,共计13篇
1.q-fin金融:
【1】 Efficient Black-Box Importance Sampling for VaR and CVaR Estimation 标题:VaR和CVaR估计的有效黑箱重要抽样
作者:Anand Deo,Karthyek Murthy 机构:Singapore University of Technology and Design, Somapah Rd, Singapore 链接:https://arxiv.org/abs/2106.10236 摘要:本文考虑了用机器学习特征映射或混合整数线性优化公式等复杂对象来估计损失尾部风险的重要性抽样方法。假设只有黑箱访问损失和潜在随机向量的分布,本文提出了一种有效的估计风险值和条件风险值的IS算法。在任何IS程序中,关键的挑战是,识别适当的测量变化,通过自结构IS转换实现自动化,该转换学习并复制不太罕见样本中条件过剩的浓度特性。当在对数尺度上观察时,得到的估计享受渐近最优方差缩减。仿真实验验证了该方案的有效性和实用性 摘要:This paper considers Importance Sampling (IS) for the estimation of tail risks of a loss defined in terms of a sophisticated object such as a machine learning feature map or a mixed integer linear optimisation formulation. Assuming only black-box access to the loss and the distribution of the underlying random vector, the paper presents an efficient IS algorithm for estimating the Value at Risk and Conditional Value at Risk. The key challenge in any IS procedure, namely, identifying an appropriate change-of-measure, is automated with a self-structuring IS transformation that learns and replicates the concentration properties of the conditional excess from less rare samples. The resulting estimators enjoy asymptotically optimal variance reduction when viewed in the logarithmic scale. Simulation experiments highlight the efficacy and practicality of the proposed scheme
【2】 Active labour market policies for the long-term unemployed: New evidence from causal machine learning 标题:针对长期失业者的积极劳动力市场政策:来自因果机器学习的新证据
作者:Daniel Goller,Tamara Harrer,Michael Lechner,Joachim Wolff 机构:Swiss Institute of Empirical Economic Research, University of St. Gallen, Institute for Employment Research, Nuremberg, Centre for Research in Economics of Education, University of Bern 链接:https://arxiv.org/abs/2106.10141 摘要:我们调查了德国长期失业者的三种不同的求职和训练方案的有效性。在广泛的行政数据集的基础上,我们使用因果机器学习评估了这些计划对不同聚合水平的影响。我们发现,参与调查的计划,与安置服务是最有效的。效果很快就能实现,而且对任何项目都是持久的。虽然对男性的影响是相当同质的,但我们发现女性在不同的特征上有不同的影响。当当地劳动力市场条件改善时,妇女尤其受益。关于失业者对不同计划的分配机制,我们发现观察到的分配与随机分配一样有效。因此,我们提出数据驱动的规则,将失业者分配到各个劳动力市场方案中,以改善现状。 摘要:We investigate the effectiveness of three different job-search and training programmes for German long-term unemployed persons. On the basis of an extensive administrative data set, we evaluated the effects of those programmes on various levels of aggregation using Causal Machine Learning. We found participants to benefit from the investigated programmes with placement services to be most effective. Effects are realised quickly and are long-lasting for any programme. While the effects are rather homogenous for men, we found differential effects for women in various characteristics. Women benefit in particular when local labour market conditions improve. Regarding the allocation mechanism of the unemployed to the different programmes, we found the observed allocation to be as effective as a random allocation. Therefore, we propose data-driven rules for the allocation of the unemployed to the respective labour market programmes that would improve the status-quo.
【3】 Introductory Economics: Gender, Majors, and Future Performance 标题:经济学导论:性别、专业与未来表现
作者:Natsuki Arai,Shian Chang,Biing-Shen Kuo 备注:18 pages, 4 figures 链接:https://arxiv.org/abs/2106.10091 摘要:通过对台湾某商学院2008年至2019年经济学导论考试成绩的调查,我们发现三组结果:第一,在考试成绩上,我们没有发现性别间的显著差异。其次,学生的专业与他们的考试成绩显著相关,这很可能反映了他们在大学入学时的学术能力。第三,考试成绩对学生未来的学习成绩有很强的预测作用。 摘要:By investigating the exam scores of introductory economics in a business school in Taiwan between 2008 and 2019, we find three sets of results: First, we find no significant difference between genders in the exam scores. Second, students' majors are significantly associated with their exam scores, which likely reflects their academic ability measured at college admission. Third, the exam scores are strong predictors of students' future academic performance.
【4】 Universal Risk Budgeting 标题:普遍风险预算
作者:Alex Garivaltis 机构:Northern Illinois University 备注:25 pages, 8 figures 链接:https://arxiv.org/abs/2106.10030 摘要:我将Cover引以为傲的通用投资组合选择算法(Cover 1991)与现代表示法(Qian 2016;Roncalli(2013)将投资组合视为可用资产之间的某种风险分配,而不仅仅是资本分配。因此,我定义了一个通用的风险预算方案,该方案根据每个风险预算(而不是每个资本预算)的历史业绩记录(la Cover)对其进行加权。我证明了我的方案在数学上等价于一种新型的覆盖和ordentlich1996泛投资组合,它使用了迄今为止在泛投资组合理论文献中尚未出现的一个新的先验密度族。我认为,我的通用风险预算,这样定义,是一个潜在的更明确和灵活的类型的通用投资组合;它允许算法交易者结合,与优势,他的先验知识(或信念)有关的特定协方差结构的瞬时资产回报。比如说,如果可用资产的波动性存在一定的分散性,那么文献中标准的统一(或Dirichlet)先验将在可能的风险预算上产生危险的不平衡先验分布。在作者看来,提出的“garivatis prior”是对Cover的永恒专家系统(Cover 1991)的一个很好的改进,Cover的专家系统是适当的不可知论和开放的(从一开始)到不同的风险预算。受1992年Jamshidian的启发,通用风险预算是在连续时间Black和Scholes 1973市场中作为一种新的异国情调的选择而制定的,具有所有的乐趣、优雅和便利。 摘要:I juxtapose Cover's vaunted universal portfolio selection algorithm (Cover 1991) with the modern representation (Qian 2016; Roncalli 2013) of a portfolio as a certain allocation of risk among the available assets, rather than a mere allocation of capital. Thus, I define a Universal Risk Budgeting scheme that weights each risk budget (instead of each capital budget) by its historical performance record (a la Cover). I prove that my scheme is mathematically equivalent to a novel type of Cover and Ordentlich 1996 universal portfolio that uses a new family of prior densities that have hitherto not appeared in the literature on universal portfolio theory. I argue that my universal risk budget, so-defined, is a potentially more perspicuous and flexible type of universal portfolio; it allows the algorithmic trader to incorporate, with advantage, his prior knowledge (or beliefs) about the particular covariance structure of instantaneous asset returns. Say, if there is some dispersion in the volatilities of the available assets, then the uniform (or Dirichlet) priors that are standard in the literature will generate a dangerously lopsided prior distribution over the possible risk budgets. In the author's opinion, the proposed "Garivaltis prior" makes for a nice improvement on Cover's timeless expert system (Cover 1991), that is properly agnostic and open (from the very get-go) to different risk budgets. Inspired by Jamshidian 1992, the universal risk budget is formulated as a new kind of exotic option in the continuous time Black and Scholes 1973 market, with all the pleasure, elegance, and convenience that that entails.
【5】 Robust deep hedging 标题:稳健深度套期保值
作者:Eva Lütkebohmert,Thorsten Schmidt,Julian Sester 机构:Department of Quantitative Finance, Institute for Economic Research, University of Freiburg, Rempartstr. , Freiburg, Germany., Department of Mathematical Stochastics, Mathematical Institute, University of Freiburg, Ernst-Zermelo-Str. , Freiburg, Germany,. 链接:https://arxiv.org/abs/2106.10024 摘要:研究了一类Markov过程(广义仿射过程)在参数不确定条件下的定价和套期保值问题,其中包括Black-Scholes模型和常方差弹性(CEV)模型。基于一般的动态规划原理,我们能够将相关的非线性期望与Kolmogorov方程的变分形式联系起来,从而为鲁棒框架下的快速数值定价打开了大门。本文的主要创新之处在于提出了一种深度套期保值方法,有效地解决了参数不确定性下的套期保值问题。我们在模拟和真实数据上对该方法进行了数值评估,结果表明,稳健的深度套期保值方法优于现有的套期保值方法,特别是在高度波动的时期。 摘要:We study pricing and hedging under parameter uncertainty for a class of Markov processes which we call generalized affine processes and which includes the Black-Scholes model as well as the constant elasticity of variance (CEV) model as special cases. Based on a general dynamic programming principle, we are able to link the associated nonlinear expectation to a variational form of the Kolmogorov equation which opens the door for fast numerical pricing in the robust framework. The main novelty of the paper is that we propose a deep hedging approach which efficiently solves the hedging problem under parameter uncertainty. We numerically evaluate this method on simulated and real data and show that the robust deep hedging outperforms existing hedging approaches, in particular in highly volatile periods.
【6】 XRP Network and Proposal of Flow Index 标题:XRP网络与流量指标的建议
作者:Hideaki Aoyama 备注:15 pages, 15 Figures, for "Blockchain Kyoto 2011" JPS Conference Proceedings 链接:https://arxiv.org/abs/2106.10012 摘要:XRP是Ripple Labs开发的一种现代加密资产(加密货币),它一直在增加其金融业务。我们研究它的交易历史作为分类帐数据。分析了它的基本统计特性、相关性和网络特性。基于一些具有大型事务历史的节点的行为,我们提出了一个新的索引:Flow索引。Flow索引是一对适合描述作为节点源和目的地的事务频率的索引。利用这个流动指数,我们研究了XRP网络的整体结构,构造了蝴蝶结/胡桃木结构。 摘要:XRP is a modern crypto-asset (crypto-currency) developed by Ripple Labs, which has been increasing its financial presence. We study its transaction history available as ledger data. An analysis of its basic statistics, correlations, and network properties are presented. Motivated by the behavior of some nodes with histories of large transactions, we propose a new index: the ``Flow Index.'' The Flow Index is a pair of indices suitable for characterizing transaction frequencies as a source and destination of a node. Using this Flow Index, we study the global structure of the XRP network and construct bow-tie/walnut structure.
【7】 Proof-of-Work Cryptocurrencies: Does Mining Technology Undermine Decentralization? 标题:工作证明加密货币:采矿技术是否会破坏权力下放?
作者:Agostino Capponi,Sveinn Olafsson,Humoud Alsabah 机构:Sveinn ´Olafsson†, . 链接:https://arxiv.org/abs/2106.09783 摘要:工作证明协议是否达到了支持分散加密货币挖掘的预期目的?为了解决这个问题,我们建立了一个博弈论模型,矿工首先投资硬件以提高运营效率,然后在寻租博弈中争夺采矿报酬。我们认为,由于采矿者面临的能力限制,采矿业的集中程度低于公共讨论和最近的学术工作。我们发现,硬件效率的提高并不一定会导致大型矿商增加他们的优势,而是允许小型矿商扩张和新矿商加入竞争。我们的校准模型表明,硬件效率对攻击网络的成本影响很小,而挖掘报酬对攻击网络的成本影响很大。这突出了小型和新兴加密货币的脆弱性,以及已建立的加密货币向基于费用的采矿奖励计划过渡的脆弱性。 摘要:Does the proof-of-work protocol serve its intended purpose of supporting decentralized cryptocurrency mining? To address this question, we develop a game-theoretical model where miners first invest in hardware to improve the efficiency of their operations, and then compete for mining rewards in a rent-seeking game. We argue that because of capacity constraints faced by miners, centralization in mining is lower than indicated by both public discourse and recent academic work. We show that advancements in hardware efficiency do not necessarily lead to larger miners increasing their advantage, but rather allow smaller miners to expand and new miners to enter the competition. Our calibrated model illustrates that hardware efficiency has a small impact on the cost of attacking a network, while the mining reward has a significant impact. This highlights the vulnerability of smaller and emerging cryptocurrencies, as well as of established cryptocurrencies transitioning to a fee-based mining reward scheme.
【8】 Chances for the honest in honest versus insider trading 标题:诚实的人在诚实的交易与内幕交易中的机会
作者:Mauricio Elizalde,Carlos Escudero 链接:https://arxiv.org/abs/2106.10033 摘要:我们研究了一个有限时间范围内有两个投资者的Black-Scholes市场:一个诚实的交易者和一个内幕交易者。我们用预期随机演算分两步进行分析。首先,我们恢复了经典的投资组合优化结果,即内幕投资者的期望对数效用严格大于诚实交易者的期望对数效用。然后,我们证明,只要市场是可行的,诚实的交易者可以得到一个更高的对数效用,因此更多的财富,比严格的正概率内幕。我们的证明依赖于对Dol-Dade指数过程的一种前向积分变量的分析。主要的财务结论是对数效用对一些内部人士来说可能过于保守。 摘要:We study a Black-Scholes market with a finite time horizon and two investors: an honest and an insider trader. We analyze it with anticipating stochastic calculus in two steps. First, we recover the classical result on portfolio optimization that shows that the expected logarithmic utility of the insider is strictly greater than that of the honest trader. Then, we prove that, whenever the market is viable, the honest trader can get a higher logarithmic utility, and therefore more wealth, than the insider with a strictly positive probability. Our proof relies on the analysis of a sort of forward integral variant of the Dol'eans-Dade exponential process. The main financial conclusion is that the logarithmic utility is perhaps too conservative for some insiders.
【9】 Centralized systemic risk control in the interbank system: Relaxed control and Gamma-convergence 标题:银行同业系统集中系统性风险控制:放松控制与伽马收敛
作者:Lijun Bo,Tongqing Li,Xiang Yu 机构: School of Mathematical Sciences, University of Science and Technology of China 备注:Keywords: Systemic risk; interbank system; relaxed control; mean field model; stochastic FPK equation; Gamma-convergence 链接:https://arxiv.org/abs/2106.09978 摘要:本文研究了中央银行的系统性风险控制问题,即中央银行通过借贷活动对银行间系统的货币供应量进行动态规划。面对银行间的异质性和银行间的共同噪声,中央银行的目标是寻找一种最优策略,使各银行的对数货币储备与规定的资本水平之间的平均距离最小化。采用松弛控制方法,应用Ekeland变分原理,在有限组系统中得到最优随机控制。随着银行数量的增加,我们利用Gamma收敛参数进一步证明了最优策略的收敛性,从而得到了平均场模型中的最优松弛控制。证明了极限最优松弛控制与随机Fokker-Planck-Kolmogorov(FPK)方程的解有关。在一定的条件下,证明了随机FPK方程解的唯一性。 摘要:This paper studies a systemic risk control problem by the central bank, which dynamically plans monetary supply for the interbank system with borrowing and lending activities. Facing both heterogeneity among banks and the common noise, the central bank aims to find an optimal strategy to minimize the average distance between log-monetary reserves and some prescribed capital levels for all banks. A relaxed control approach is adopted, and an optimal randomized control can be obtained in the system with finite banks by applying Ekeland's variational principle. As the number of banks grows large, we further prove the convergence of optimal strategies using the Gamma-convergence arguments, which yields an optimal relaxed control in the mean field model. It is shown that the limiting optimal relaxed control is linked to a solution of a stochastic Fokker-Planck-Kolmogorov (FPK) equation. The uniqueness of the solution to the stochastic FPK equation is also established under some mild conditions.
2.cs.SD语音:
【1】 Fusion of Embeddings Networks for Robust Combination of Text Dependent and Independent Speaker Recognition 标题:嵌入网络融合稳健结合文本相关和独立说话人识别
作者:Ruirui Li,Chelsea J. -T. Ju,Zeya Chen,Hongda Mao,Oguz Elibol,Andreas Stolcke 机构:Chelsea J.-T. Ju, Amazon Alexa Speech 链接:https://arxiv.org/abs/2106.10169 摘要:通过基于用户的语音输入隐式地识别用户,说话人识别可以实现许多下游应用,例如个性化的系统行为和快速购物结账。基于语音内容是否受限,可以使用文本相关(TD)和文本无关(TI)说话人识别模型。我们希望通过一个集合系统将这两种模型的优点结合起来,做出更可靠的预测。然而,任何这样的组合方法必须对不完整的输入具有鲁棒性,即当TD或TI输入缺失时。作为一个解决方案,我们提出了一个融合嵌入式网络的foenet架构,结合联合学习和神经注意。在一个语音辅助输入数据集上,我们将foenet与四种有竞争力的基线融合方法进行了比较,结果表明,该方法比基线融合和分数融合方法具有更高的准确率,特别是在存在不完全输入的情况下。 摘要:By implicitly recognizing a user based on his/her speech input, speaker identification enables many downstream applications, such as personalized system behavior and expedited shopping checkouts. Based on whether the speech content is constrained or not, both text-dependent (TD) and text-independent (TI) speaker recognition models may be used. We wish to combine the advantages of both types of models through an ensemble system to make more reliable predictions. However, any such combined approach has to be robust to incomplete inputs, i.e., when either TD or TI input is missing. As a solution we propose a fusion of embeddings network foenet architecture, combining joint learning with neural attention. We compare foenet with four competitive baseline methods on a dataset of voice assistant inputs, and show that it achieves higher accuracy than the baseline and score fusion methods, especially in the presence of incomplete inputs.
【2】 Synchronising speech segments with musical beats in Mandarin and English singing 标题:普通话和英语演唱中语音片段与音乐节拍的同步
作者:Cong Zhang,Jian Zhu 机构:Radboud University, University of Michigan 备注:To be published in the Proceeding of Interspeech 2021 链接:https://arxiv.org/abs/2106.10045 摘要:基于语音数据训练的模型具有很强的灵活性和可控性,能够生成合成的人声。然而,由于在语音训练数据中缺乏关于片段和节拍之间的时间关系的信息,因此合成的歌唱有时可能听起来不符合节拍。因此,语音片段和音乐节拍之间的时间关系信息的可用性是至关重要的。本文以P-中心理论和声音层次理论为基础,对歌唱数据中的节拍同步问题进行了研究。对一个普通话语料库和一个英语专业歌唱语料库进行了人工标注和分析。结果表明,音乐节拍的存在更多地依赖于节段的持续时间而不是声音。然而,声音等级和P-中心理论与搏动的位置高度相关。尽管汉语和英语表现出共同的模式,但它们表现出跨语言的差异。 摘要:Generating synthesised singing voice with models trained on speech data has many advantages due to the models' flexibility and controllability. However, since the information about the temporal relationship between segments and beats are lacking in speech training data, the synthesised singing may sound off-beat at times. Therefore, the availability of the information on the temporal relationship between speech segments and music beats is crucial. The current study investigated the segment-beat synchronisation in singing data, with hypotheses formed based on the linguistics theories of P-centre and sonority hierarchy. A Mandarin corpus and an English corpus of professional singing data were manually annotated and analysed. The results showed that the presence of musical beats was more dependent on segment duration than sonority. However, the sonority hierarchy and the P-centre theory were highly related to the location of beats. Mandarin and English demonstrated cross-linguistic variations despite exhibiting common patterns.
【3】 Zero-Shot Federated Learning with New Classes for Audio Classification 标题:用于音频分类的带新类的零概率联合学习
作者:Gautham Krishna Gudur,Satheesh K. Perepu 机构:Global AI Accelerator, Ericsson, Ericsson Research 备注:Accepted at Interspeech 2021. Also accepted at the Distributed and Private Machine Learning (DPML) and Hardware Aware Efficient Training (HAET) workshops at ICLR 2021 链接:https://arxiv.org/abs/2106.10019 摘要:联合学习是一种有效的方法,可以从不同的用户设备中提取见解,同时保护用户的隐私。但是,具有完全看不见的数据分布的新类可以在联邦学习设置中的任何设备上进行流式处理,全局服务器或其他用户无法访问这些设备的数据。为此,我们提出了一个统一的Zero-Shot框架来处理联邦学习过程中的上述挑战。我们在这里模拟了两种情况——1)当用户没有报告新的类标签时,使用传统的FL设置;2) 当用户报告新的类标签时,我们通过计算与每个设备的新类对应的类相似矩阵来合成匿名数据印象,然后通过无监督聚类来区分不同用户的新类。此外,我们提出的框架还可以处理参与用户的标签和模型中的统计异质性。我们在两个广泛使用的音频分类应用程序(关键字定位和城市声音分类)上,使用本地和全局更新中的新类,以及异构标签和模型,对我们在不同通信轮(FL迭代)上的设备框架进行了实证评估,平均确定性精度分别提高了约4.041%和约4.258%。 摘要:Federated learning is an effective way of extracting insights from different user devices while preserving the privacy of users. However, new classes with completely unseen data distributions can stream across any device in a federated learning setting, whose data cannot be accessed by the global server or other users. To this end, we propose a unified zero-shot framework to handle these aforementioned challenges during federated learning. We simulate two scenarios here -- 1) when the new class labels are not reported by the user, the traditional FL setting is used; 2) when new class labels are reported by the user, we synthesize Anonymized Data Impressions by calculating class similarity matrices corresponding to each device's new classes followed by unsupervised clustering to distinguish between new classes across different users. Moreover, our proposed framework can also handle statistical heterogeneities in both labels and models across the participating users. We empirically evaluate our framework on-device across different communication rounds (FL iterations) with new classes in both local and global updates, along with heterogeneous labels and models, on two widely used audio classification applications -- keyword spotting and urban sound classification, and observe an average deterministic accuracy increase of ~4.041% and ~4.258% respectively.
【4】 Improving Performance of Seen and Unseen Speech Style Transfer in End-to-end Neural TTS 标题:改善端到端神经TTS中看得见和看不见语音风格转换的性能
作者:Xiaochun An,Frank K. Soong,Lei Xie 机构:School of Computer Science, Northwestern Polytechnical University, Xi’an, China, Microsoft China 链接:https://arxiv.org/abs/2106.10003 摘要:端到端的神经TTS训练在语体转换方面表现出了较好的效果。然而,无论是目标语体还是说话人的训练数据都限制了这种改进。当经过训练的TTS试图从一个未知的、任意风格的新说话人那里将语音转换成目标风格时,就会出现风格转换性能不足的情况。在本文中,我们提出了一种新的风格转换方法,即在不相交的多风格数据集上,记录不同风格的数据集,每个风格由一个说话人和多个话语组成。为了对样式信息进行编码,我们采用了逆自回归流(IAF)结构来改进变分推理。对整个系统进行优化,使四种不同损失函数的加权和最小:1)重建损失,用于测量源重建和目标重建中的失真;2) 一个对抗性的失利“愚弄”了一个训练有素的鉴别器;3) 样式失真损失,用于测量传输后的预期样式损失;4) 一种循环一致性损失,用于在传输后保留源的说话人身份。实验从客观和主观两方面证明了该方法对看不见和看不见风格转换任务的有效性。新方法的性能比现有技术中的四个基线系统的性能更好、更健壮。 摘要:End-to-end neural TTS training has shown improved performance in speech style transfer. However, the improvement is still limited by the training data in both target styles and speakers. Inadequate style transfer performance occurs when the trained TTS tries to transfer the speech to a target style from a new speaker with an unknown, arbitrary style. In this paper, we propose a new approach to style transfer for both seen and unseen styles, with disjoint, multi-style datasets, i.e., datasets of different styles are recorded, each individual style is by one speaker with multiple utterances. To encode the style information, we adopt an inverse autoregressive flow (IAF) structure to improve the variational inference. The whole system is optimized to minimize a weighed sum of four different loss functions: 1) a reconstruction loss to measure the distortions in both source and target reconstructions; 2) an adversarial loss to "fool" a well-trained discriminator; 3) a style distortion loss to measure the expected style loss after the transfer; 4) a cycle consistency loss to preserve the speaker identity of the source after the transfer. Experiments demonstrate, both objectively and subjectively, the effectiveness of the proposed approach for seen and unseen style transfer tasks. The performance of the new approach is better and more robust than those of four baseline systems of the prior art.
【5】 PixInWav: Residual Steganography for Hiding Pixels in Audio 标题:PixInWav:一种在音频中隐藏像素的残差隐写算法
作者:Margarita Geleta,Cristina Punti,Kevin McGuinness,Jordi Pons,Cristian Canton,Xavier Giro-i-Nieto 机构:Cristina Punt´ı, Universitat Politecnica de Catalunya, Dublin City University, Dolby Labs, Institut de Robotica i Informatica Industrial, CSIC-UPC, Barcelona Supercomputing Center 备注:Extended abstract presented in CVPR 2021 Women in Computer Vision Workshop 链接:https://arxiv.org/abs/2106.09814 摘要:隐写术包括在可公开获得的宿主介质中隐藏数据的机制。虽然以前的工作集中在单峰设置(例如,隐藏图像中的图像,或隐藏音频中的音频),PixInWav的目标是在音频中隐藏图像的多模态情况。为此,我们提出了一种基于短时离散余弦变换(STDCT)音频谱图的残差结构。在我们的结果中,我们发现我们提出的残差音频隐写术装置允许在不影响质量的情况下对来自主机音频的隐藏图像进行独立编码。因此,虽然以前的工作需要主机和隐藏信号来隐藏信号,但PixInWav可以脱机编码图像——以后可以以剩余方式隐藏到任何音频信号中。最后,我们在实验室环境中测试了我们的方案,将图像通过电波从扬声器传输到麦克风,验证了我们的理论见解并获得了有希望的结果。 摘要:Steganography comprises the mechanics of hiding data in a host media that may be publicly available. While previous works focused on unimodal setups (e.g., hiding images in images, or hiding audio in audio), PixInWav targets the multimodal case of hiding images in audio. To this end, we propose a novel residual architecture operating on top of short-time discrete cosine transform (STDCT) audio spectrograms. Among our results, we find that the residual audio steganography setup we propose allows independent encoding of the hidden image from the host audio without compromising quality. Accordingly, while previous works require both host and hidden signals to hide a signal, PixInWav can encode images offline -- which can be later hidden, in a residual fashion, into any audio signal. Finally, we test our scheme in a lab setting to transmit images over airwaves from a loudspeaker to a microphone verifying our theoretical insights and obtaining promising results.
【6】 On-Device Personalization of Automatic Speech Recognition Models for Disordered Speech 标题:无序语音的自动语音识别模型的在设备个性化
作者:Katrin Tomanek,Françoise Beaufays,Julie Cattiau,Angad Chandorkar,Khe Chai Sim 机构:Google, USA 链接:https://arxiv.org/abs/2106.10259 摘要:虽然目前最先进的自动语音识别(ASR)系统对典型语音的识别精度很高,但对无序语音和其他非典型语音模式的识别性能却有很大的下降。ASR模型的个人化是解决这一问题的常用方法,通常在基于服务器的训练环境中进行,这会带来数据隐私、模型更新时间延迟以及在移动设备和服务器基础设施之间复制数据和模型的通信成本等问题。在本文中,我们提出了一种基于设备的ASR个性化的方法,只需要非常少量的特定于说话人的数据。我们在一组100个语音混乱的人身上测试了我们的方法,发现平均相对单词错误率提高了71%,每个人只需要50个简短的话语。在语音控制的家庭自动化平台上进行测试时,设备上的个性化模型显示平均任务成功率为81%,而不适应的模型只有40%。 摘要:While current state-of-the-art Automatic Speech Recognition (ASR) systems achieve high accuracy on typical speech, they suffer from significant performance degradation on disordered speech and other atypical speech patterns. Personalization of ASR models, a commonly applied solution to this problem, is usually performed in a server-based training environment posing problems around data privacy, delayed model-update times, and communication cost for copying data and models between mobile device and server infrastructure. In this paper, we present an approach to on-device based ASR personalization with very small amounts of speaker-specific data. We test our approach on a diverse set of 100 speakers with disordered speech and find median relative word error rate improvement of 71% with only 50 short utterances required per speaker. When tested on a voice-controlled home automation platform, on-device personalized models show a median task success rate of 81%, compared to only 40% of the unadapted models.
【7】 A learned conditional prior for the VAE acoustic space of a TTS system 标题:TTS系统VAE声学空间的学习条件先验
作者:Penny Karanasou,Sri Karlapati,Alexis Moinet,Arnaud Joly,Ammar Abbas,Simon Slangen,Jaime Lorenzo Trueba,Thomas Drugman 机构:Amazon Research, Cambridge, United Kingdom 备注:in Proceedings of Interspeech 2021 链接:https://arxiv.org/abs/2106.10229 摘要:许多因素会影响一个句子的不同表达方式。生成模型,如变分自动编码器(VAE),捕捉这种变化,并允许同一个句子通过采样的多个格式副本。韵律可变性的程度在很大程度上取决于采样时使用的先验知识。本文提出了一种新的计算神经文语转换(TTS)系统VAE潜空间信息先验的方法。通过这样做,我们的目标是样本具有更多的韵律可变性,同时获得对潜在空间结构的可控性。通过使用次级VAE的后验分布作为先验,我们将其作为说话人向量的条件,我们可以在明确考虑条件的情况下从主VAE中采样,并得到每个条件下潜在空间特定区域(即说话人)的样本。一个正式的偏好测试表明,该方法明显优于标准条件VAE。我们还提供了潜伏期的可视化,在潜伏期出现了分离良好的特定条件簇,以及烧蚀研究,以更好地了解系统的行为。 摘要:Many factors influence speech yielding different renditions of a given sentence. Generative models, such as variational autoencoders (VAEs), capture this variability and allow multiple renditions of the same sentence via sampling. The degree of prosodic variability depends heavily on the prior that is used when sampling. In this paper, we propose a novel method to compute an informative prior for the VAE latent space of a neural text-to-speech (TTS) system. By doing so, we aim to sample with more prosodic variability, while gaining controllability over the latent space's structure. By using as prior the posterior distribution of a secondary VAE, which we condition on a speaker vector, we can sample from the primary VAE taking explicitly the conditioning into account and resulting in samples from a specific region of the latent space for each condition (i.e. speaker). A formal preference test demonstrates significant preference of the proposed approach over standard Conditional VAE. We also provide visualisations of the latent space where well-separated condition-specific clusters appear, as well as ablation studies to better understand the behaviour of the system.
【8】 Multi-mode Transformer Transducer with Stochastic Future Context 标题:具有随机未来背景的多模Transformer互感器
作者:Kwangyoun Kim,Felix Wu,Prashant Sridhar,Kyu J. Han,Shinji Watanabe 机构:ASAPP, USA, Carnegie Mellon University, USA 备注:Accepted to Interspeech 2021 链接:https://arxiv.org/abs/2106.09760 摘要:自动语音识别(ASR)模型在将更多的周围语音信息表示为上下文时,错误率较低。不幸的是,获取更大的未来上下文会导致更高的延迟。在速度和准确性之间存在着不可避免的权衡。天真地说,为了适应不同的延迟需求,人们必须存储多个模型,并在约束条件下选择最佳模型。相反,更理想的方法是有一个单一的模型,可以根据不同的约束动态调整其延迟,我们称之为多模式ASR。多模式ASR模型可以满足推理过程中的各种延迟要求——当更大的延迟变得可接受时,模型可以处理更长的未来上下文以获得更高的精度;当延迟预算不灵活时,模型可以减少对未来上下文的依赖,但仍能获得可靠的精度。在追求多模式ASR,我们提出了随机未来上下文,一个简单的训练过程,样本流配置在每个迭代。通过在AISHELL-1和LibriSpeech数据集上的大量实验,我们证明了一个多模式ASR模型可以与一组经过不同延迟预算训练的具有竞争力的流基线相媲美。 摘要:Automatic speech recognition (ASR) models make fewer errors when more surrounding speech information is presented as context. Unfortunately, acquiring a larger future context leads to higher latency. There exists an inevitable trade-off between speed and accuracy. Naively, to fit different latency requirements, people have to store multiple models and pick the best one under the constraints. Instead, a more desirable approach is to have a single model that can dynamically adjust its latency based on different constraints, which we refer to as Multi-mode ASR. A Multi-mode ASR model can fulfill various latency requirements during inference -- when a larger latency becomes acceptable, the model can process longer future context to achieve higher accuracy and when a latency budget is not flexible, the model can be less dependent on future context but still achieve reliable accuracy. In pursuit of Multi-mode ASR, we propose Stochastic Future Context, a simple training procedure that samples one streaming configuration in each iteration. Through extensive experiments on AISHELL-1 and LibriSpeech datasets, we show that a Multi-mode ASR model rivals, if not surpasses, a set of competitive streaming baselines trained with different latency budgets.
3.eess.AS音频处理:
【1】 On-Device Personalization of Automatic Speech Recognition Models for Disordered Speech 标题:无序语音的自动语音识别模型的在设备个性化
作者:Katrin Tomanek,Françoise Beaufays,Julie Cattiau,Angad Chandorkar,Khe Chai Sim 机构:Google, USA 链接:https://arxiv.org/abs/2106.10259 摘要:虽然目前最先进的自动语音识别(ASR)系统对典型语音的识别精度很高,但对无序语音和其他非典型语音模式的识别性能却有很大的下降。ASR模型的个人化是解决这一问题的常用方法,通常在基于服务器的训练环境中进行,这会带来数据隐私、模型更新时间延迟以及在移动设备和服务器基础设施之间复制数据和模型的通信成本等问题。在本文中,我们提出了一种基于设备的ASR个性化的方法,只需要非常少量的特定于说话人的数据。我们在一组100个语音混乱的人身上测试了我们的方法,发现平均相对单词错误率提高了71%,每个人只需要50个简短的话语。在语音控制的家庭自动化平台上进行测试时,设备上的个性化模型显示平均任务成功率为81%,而不适应的模型只有40%。 摘要:While current state-of-the-art Automatic Speech Recognition (ASR) systems achieve high accuracy on typical speech, they suffer from significant performance degradation on disordered speech and other atypical speech patterns. Personalization of ASR models, a commonly applied solution to this problem, is usually performed in a server-based training environment posing problems around data privacy, delayed model-update times, and communication cost for copying data and models between mobile device and server infrastructure. In this paper, we present an approach to on-device based ASR personalization with very small amounts of speaker-specific data. We test our approach on a diverse set of 100 speakers with disordered speech and find median relative word error rate improvement of 71% with only 50 short utterances required per speaker. When tested on a voice-controlled home automation platform, on-device personalized models show a median task success rate of 81%, compared to only 40% of the unadapted models.
【2】 A learned conditional prior for the VAE acoustic space of a TTS system 标题:TTS系统VAE声学空间的学习条件先验
作者:Penny Karanasou,Sri Karlapati,Alexis Moinet,Arnaud Joly,Ammar Abbas,Simon Slangen,Jaime Lorenzo Trueba,Thomas Drugman 机构:Amazon Research, Cambridge, United Kingdom 备注:in Proceedings of Interspeech 2021 链接:https://arxiv.org/abs/2106.10229 摘要:许多因素会影响一个句子的不同表达方式。生成模型,如变分自动编码器(VAE),捕捉这种变化,并允许同一个句子通过采样的多个格式副本。韵律可变性的程度在很大程度上取决于采样时使用的先验知识。本文提出了一种新的计算神经文语转换(TTS)系统VAE潜空间信息先验的方法。通过这样做,我们的目标是样本具有更多的韵律可变性,同时获得对潜在空间结构的可控性。通过使用次级VAE的后验分布作为先验,我们将其作为说话人向量的条件,我们可以在明确考虑条件的情况下从主VAE中采样,并得到每个条件下潜在空间特定区域(即说话人)的样本。一个正式的偏好测试表明,该方法明显优于标准条件VAE。我们还提供了潜伏期的可视化,在潜伏期出现了分离良好的特定条件簇,以及烧蚀研究,以更好地了解系统的行为。 摘要:Many factors influence speech yielding different renditions of a given sentence. Generative models, such as variational autoencoders (VAEs), capture this variability and allow multiple renditions of the same sentence via sampling. The degree of prosodic variability depends heavily on the prior that is used when sampling. In this paper, we propose a novel method to compute an informative prior for the VAE latent space of a neural text-to-speech (TTS) system. By doing so, we aim to sample with more prosodic variability, while gaining controllability over the latent space's structure. By using as prior the posterior distribution of a secondary VAE, which we condition on a speaker vector, we can sample from the primary VAE taking explicitly the conditioning into account and resulting in samples from a specific region of the latent space for each condition (i.e. speaker). A formal preference test demonstrates significant preference of the proposed approach over standard Conditional VAE. We also provide visualisations of the latent space where well-separated condition-specific clusters appear, as well as ablation studies to better understand the behaviour of the system.
【3】 Golos: Russian Dataset for Speech Research 标题:Golos:用于语音研究的俄语数据集
作者:Nikolay Karpov,Alexander Denisenko,Fedor Minkin 机构:Sber, Russia 备注:5 pages, 3 figures, accepted to Interspeech2021 链接:https://arxiv.org/abs/2106.10161 摘要:本文介绍了一个适合于语音研究的大型语料库Golos。数据集主要由在众包平台上手工标注的录音文件组成。音频的总持续时间约为1240小时。我们提供了免费下载的语料库,以及在此语料库上准备的声学模型。此外,还采用迁移学习来提高声学模型的性能。为了用beam搜索算法评价数据集的质量,我们在开放的公共爬网数据集上建立了一个3-gram语言模型。总的字错误率(WER)指标是3.3%和11.5%。 摘要:This paper introduces a novel Russian speech dataset called Golos, a large corpus suitable for speech research. The dataset mainly consists of recorded audio files manually annotated on the crowd-sourcing platform. The total duration of the audio is about 1240 hours. We have made the corpus freely available to download, along with the acoustic model with CTC loss prepared on this corpus. Additionally, transfer learning was applied to improve the performance of the acoustic model. In order to evaluate the quality of the dataset with the beam-search algorithm, we have built a 3-gram language model on the open Common Crawl dataset. The total word error rate (WER) metrics turned out to be about 3.3% and 11.5%.
【4】 VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion 标题:VQMIVC:矢量量化和基于互信息的无监督语音表示解纠缠的一次语音转换
作者:Disong Wang,Liqun Deng,Yu Ting Yeung,Xiao Chen,Xunying Liu,Helen Meng 机构:The Chinese University of Hong Kong, Hong Kong SAR, China, Huawei Noah’s Ark Lab 备注:Accepted to Interspeech 2021. Code, pre-trained models and demo are available at this https URL 链接:https://arxiv.org/abs/2106.10132 摘要:一次语音转换(VC)是通过语音表示解纠缠来实现的一种有效的转换方法,它只需要一个目标说话人的话语作为参考就可以在任意说话人之间进行转换。现有的研究普遍忽略了训练过程中不同语音表征之间的相关性,导致内容信息泄漏到说话人表征中,从而降低了VC的性能。为了缓解这一问题,我们采用矢量量化(VQ)进行内容编码,并在训练过程中引入互信息(MI)作为相关度量,通过无监督的方式减少内容、说话人和基音表示的相互依赖性,实现内容、说话人和基音表示的适当分离。实验结果表明,该方法在保持源语言内容和语调变化,同时捕捉目标说话人特征的前提下,能有效地学习分离语音表征。与现有的单次VC系统相比,该方法具有更高的语音自然度和说话人相似度。我们的代码、预先训练的模型和演示可在https://github.com/Wendison/VQMIVC. 摘要:One-shot voice conversion (VC), which performs conversion across arbitrary speakers with only a single target-speaker utterance for reference, can be effectively achieved by speech representation disentanglement. Existing work generally ignores the correlation between different speech representations during training, which causes leakage of content information into the speaker representation and thus degrades VC performance. To alleviate this issue, we employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training, to achieve proper disentanglement of content, speaker and pitch representations, by reducing their inter-dependencies in an unsupervised manner. Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations for retaining source linguistic content and intonation variations, while capturing target speaker characteristics. In doing so, the proposed approach achieves higher speech naturalness and speaker similarity than current state-of-the-art one-shot VC systems. Our code, pre-trained models and demo are available at https://github.com/Wendison/VQMIVC.
【5】 Unsupervised Domain Adaptation for Dysarthric Speech Detection via Domain Adversarial Training and Mutual Information Minimization 标题:基于域对抗训练和互信息最小化的无监督域自适应节律语音检测
作者:Disong Wang,Liqun Deng,Yu Ting Yeung,Xiao Chen,Xunying Liu,Helen Meng 机构:The Chinese University of Hong Kong, Hong Kong SAR, China, Huawei Noah’s Ark Lab 备注:Accepted to Interspeech 2021 链接:https://arxiv.org/abs/2106.10127 摘要:构音障碍语音检测(DSD)系统旨在从语音中检测神经运动障碍的特征。这种系统特别容易受到域不匹配的影响,其中训练和测试数据分别来自源域和目标域,但这两个域在语音刺激、疾病病因等方面可能不同。由于注释大量数据集的高成本,很难获得目标域中的标记数据。本文首次尝试将跨域DSD描述为一个无监督域自适应(UDA)问题。我们使用标记的源域数据和未标记的目标域数据,提出了一种多任务学习策略,包括构音障碍存在分类(DPC)、域对抗训练(DAT)和互信息最小化(MIM),旨在学习构音障碍区分性和域不变的生物标记嵌入。具体来说,DPC有助于生物标记物嵌入捕获构音障碍的关键指标;DAT迫使生物标记嵌入在源域和靶域无法区分;MIM进一步降低了生物标记嵌入与领域相关线索之间的相关性。将UASPEECH和TORGO语料库分别作为源域和目标域,实验结果表明,UDA的加入在话语级加权平均回忆和说话人级准确率上分别获得了22.2%和20.0%的绝对提高。 摘要:Dysarthric speech detection (DSD) systems aim to detect characteristics of the neuromotor disorder from speech. Such systems are particularly susceptible to domain mismatch where the training and testing data come from the source and target domains respectively, but the two domains may differ in terms of speech stimuli, disease etiology, etc. It is hard to acquire labelled data in the target domain, due to high costs of annotating sizeable datasets. This paper makes a first attempt to formulate cross-domain DSD as an unsupervised domain adaptation (UDA) problem. We use labelled source-domain data and unlabelled target-domain data, and propose a multi-task learning strategy, including dysarthria presence classification (DPC), domain adversarial training (DAT) and mutual information minimization (MIM), which aim to learn dysarthria-discriminative and domain-invariant biomarker embeddings. Specifically, DPC helps biomarker embeddings capture critical indicators of dysarthria; DAT forces biomarker embeddings to be indistinguishable in source and target domains; and MIM further reduces the correlation between biomarker embeddings and domain-related cues. By treating the UASPEECH and TORGO corpora respectively as the source and target domains, experiments show that the incorporation of UDA attains absolute increases of 22.2% and 20.0% respectively in utterance-level weighted average recall and speaker-level accuracy.
【6】 Low Resource German ASR with Untranscribed Data Spoken by Non-native Children -- INTERSPEECH 2021 Shared Task SPAPL System 标题:具有非母语儿童非转录数据的低资源德语ASR--InterSpeech 2021共享任务SPAPL系统
作者:Jinhan Wang,Yunzheng Zhu,Ruchao Fan,Wei Chu,Abeer Alwan 机构:Department of Electrical and Computer Engineering, University of California Los Angeles, USA, PAII Inc., USA 备注:Accepted to INTERSPEECH 2021 链接:https://arxiv.org/abs/2106.09963 摘要:本文介绍了SPAPL系统在interspeech2021挑战赛中的应用:德语非母语儿童语音自动识别的共享任务提供5小时的转录数据和~60小时的未转录数据,以开发德国儿童ASR系统。对于转录数据的训练,我们提出了一种非语音状态鉴别丢失(NSDL)方法来减轻长时长非语音片段对语音的影响。为了探索未翻译数据的使用,各种方法被实现并结合在一起,以逐步提高系统性能。首先,使用双向自回归预测编码(Bi-APC)来学习声学模型的初始参数。其次,进一步利用增量半监督学习迭代生成伪转录数据。第三,在不同的训练阶段采用不同的数据扩充方案,以增加训练数据的可变性和规模。最后,采用递归神经网络语言模型(RNNLM)进行重排序。我们的系统在评估数据上实现了39.68%的字错误率(WER),比官方基线(45.21%)提高了约12%。 摘要:This paper describes the SPAPL system for the INTERSPEECH 2021 Challenge: Shared Task on Automatic Speech Recognition for Non-Native Children's Speech in German. ~ 5 hours of transcribed data and ~ 60 hours of untranscribed data are provided to develop a German ASR system for children. For the training of the transcribed data, we propose a non-speech state discriminative loss (NSDL) to mitigate the influence of long-duration non-speech segments within speech utterances. In order to explore the use of the untranscribed data, various approaches are implemented and combined together to incrementally improve the system performance. First, bidirectional autoregressive predictive coding (Bi-APC) is used to learn initial parameters for acoustic modelling using the provided untranscribed data. Second, incremental semi-supervised learning is further used to iteratively generate pseudo-transcribed data. Third, different data augmentation schemes are used at different training stages to increase the variability and size of the training data. Finally, a recurrent neural network language model (RNNLM) is used for rescoring. Our system achieves a word error rate (WER) of 39.68% on the evaluation data, an approximately 12% relative improvement over the official baseline (45.21%).
【7】 An Improved Single Step Non-autoregressive Transformer for Automatic Speech Recognition 标题:一种用于自动语音识别的改进单步非自回归变换器
作者:Ruchao Fan,Wei Chu,Peng Chang,Jing Xiao,Abeer Alwan 机构:Dept. of Electrical and Computer Engineering, University of California, Los Angeles, USA, PAII Inc., USA 备注:To appear in Interspeech2021 链接:https://arxiv.org/abs/2106.09885 摘要:非自回归机制可以显著减少语音转换器的推理时间,特别是当使用单步变量时。基于CTC校准的单步非自回归Transformer(CASS-NAT)的前期工作表明,与自回归Transformer(AT)相比,实时系数(RTF)有很大的提高。在这项工作中,我们提出了几种提高端到端CASS-NAT准确性的方法,并进行了性能分析。首先,卷积增强的自我注意块应用于编码器和解码器模块。其次,我们建议扩展每个令牌的触发掩码(声学边界),以增加CTC对齐的鲁棒性。此外,采用迭代损失函数来增强低层参数的梯度更新。在不使用外部语言模型的情况下,改进的CASS-NAT在Librispeech test clean/other测试集上的WER为3.1%/7.2%,在Aisell1测试集上的CER为5.4%,相对WER/CER提高了7%~21%。为了进行分析,我们绘制了解码器中的注意权重分布图,以可视化符号级声学嵌入之间的关系。当声学嵌入被可视化时,我们发现它们具有与单词嵌入相似的行为,这就解释了为什么改进的CASS-NAT的性能与AT相似。 摘要:Non-autoregressive mechanisms can significantly decrease inference time for speech transformers, especially when the single step variant is applied. Previous work on CTC alignment-based single step non-autoregressive transformer (CASS-NAT) has shown a large real time factor (RTF) improvement over autoregressive transformers (AT). In this work, we propose several methods to improve the accuracy of the end-to-end CASS-NAT, followed by performance analyses. First, convolution augmented self-attention blocks are applied to both the encoder and decoder modules. Second, we propose to expand the trigger mask (acoustic boundary) for each token to increase the robustness of CTC alignments. In addition, iterated loss functions are used to enhance the gradient update of low-layer parameters. Without using an external language model, the WERs of the improved CASS-NAT, when using the three methods, are 3.1%/7.2% on Librispeech test clean/other sets and the CER is 5.4% on the Aishell1 test set, achieving a 7%~21% relative WER/CER improvement. For the analyses, we plot attention weight distributions in the decoders to visualize the relationships between token-level acoustic embeddings. When the acoustic embeddings are visualized, we find that they have a similar behavior to word embeddings, which explains why the improved CASS-NAT performs similarly to AT.
【8】 Multi-mode Transformer Transducer with Stochastic Future Context 标题:具有随机未来背景的多模Transformer互感器
作者:Kwangyoun Kim,Felix Wu,Prashant Sridhar,Kyu J. Han,Shinji Watanabe 机构:ASAPP, USA, Carnegie Mellon University, USA 备注:Accepted to Interspeech 2021 链接:https://arxiv.org/abs/2106.09760 摘要:自动语音识别(ASR)模型在将更多的周围语音信息表示为上下文时,错误率较低。不幸的是,获取更大的未来上下文会导致更高的延迟。在速度和准确性之间存在着不可避免的权衡。天真地说,为了适应不同的延迟需求,人们必须存储多个模型,并在约束条件下选择最佳模型。相反,更理想的方法是有一个单一的模型,可以根据不同的约束动态调整其延迟,我们称之为多模式ASR。多模式ASR模型可以满足推理过程中的各种延迟要求——当更大的延迟变得可接受时,模型可以处理更长的未来上下文以获得更高的精度;当延迟预算不灵活时,模型可以减少对未来上下文的依赖,但仍能获得可靠的精度。在追求多模式ASR,我们提出了随机未来上下文,一个简单的训练过程,样本流配置在每个迭代。通过在AISHELL-1和LibriSpeech数据集上的大量实验,我们证明了一个多模式ASR模型可以与一组经过不同延迟预算训练的具有竞争力的流基线相媲美。 摘要:Automatic speech recognition (ASR) models make fewer errors when more surrounding speech information is presented as context. Unfortunately, acquiring a larger future context leads to higher latency. There exists an inevitable trade-off between speed and accuracy. Naively, to fit different latency requirements, people have to store multiple models and pick the best one under the constraints. Instead, a more desirable approach is to have a single model that can dynamically adjust its latency based on different constraints, which we refer to as Multi-mode ASR. A Multi-mode ASR model can fulfill various latency requirements during inference -- when a larger latency becomes acceptable, the model can process longer future context to achieve higher accuracy and when a latency budget is not flexible, the model can be less dependent on future context but still achieve reliable accuracy. In pursuit of Multi-mode ASR, we propose Stochastic Future Context, a simple training procedure that samples one streaming configuration in each iteration. Through extensive experiments on AISHELL-1 and LibriSpeech datasets, we show that a Multi-mode ASR model rivals, if not surpasses, a set of competitive streaming baselines trained with different latency budgets.
【9】 Fusion of Embeddings Networks for Robust Combination of Text Dependent and Independent Speaker Recognition 标题:嵌入网络融合稳健结合文本相关和独立说话人识别
作者:Ruirui Li,Chelsea J. -T. Ju,Zeya Chen,Hongda Mao,Oguz Elibol,Andreas Stolcke 机构:Chelsea J.-T. Ju, Amazon Alexa Speech 链接:https://arxiv.org/abs/2106.10169 摘要:通过基于用户的语音输入隐式地识别用户,说话人识别可以实现许多下游应用,例如个性化的系统行为和快速购物结账。基于语音内容是否受限,可以使用文本相关(TD)和文本无关(TI)说话人识别模型。我们希望通过一个集合系统将这两种模型的优点结合起来,做出更可靠的预测。然而,任何这样的组合方法必须对不完整的输入具有鲁棒性,即当TD或TI输入缺失时。作为一个解决方案,我们提出了一个融合嵌入式网络的foenet架构,结合联合学习和神经注意。在一个语音辅助输入数据集上,我们将foenet与四种有竞争力的基线融合方法进行了比较,结果表明,该方法比基线融合和分数融合方法具有更高的准确率,特别是在存在不完全输入的情况下。 摘要:By implicitly recognizing a user based on his/her speech input, speaker identification enables many downstream applications, such as personalized system behavior and expedited shopping checkouts. Based on whether the speech content is constrained or not, both text-dependent (TD) and text-independent (TI) speaker recognition models may be used. We wish to combine the advantages of both types of models through an ensemble system to make more reliable predictions. However, any such combined approach has to be robust to incomplete inputs, i.e., when either TD or TI input is missing. As a solution we propose a fusion of embeddings network foenet architecture, combining joint learning with neural attention. We compare foenet with four competitive baseline methods on a dataset of voice assistant inputs, and show that it achieves higher accuracy than the baseline and score fusion methods, especially in the presence of incomplete inputs.
【10】 Synchronising speech segments with musical beats in Mandarin and English singing 标题:普通话和英语演唱中语音片段与音乐节拍的同步
作者:Cong Zhang,Jian Zhu 机构:Radboud University, University of Michigan 备注:To be published in the Proceeding of Interspeech 2021 链接:https://arxiv.org/abs/2106.10045 摘要:基于语音数据训练的模型具有很强的灵活性和可控性,能够生成合成的人声。然而,由于在语音训练数据中缺乏关于片段和节拍之间的时间关系的信息,因此合成的歌唱有时可能听起来不符合节拍。因此,语音片段和音乐节拍之间的时间关系信息的可用性是至关重要的。本文以P-中心理论和声音层次理论为基础,对歌唱数据中的节拍同步问题进行了研究。对一个普通话语料库和一个英语专业歌唱语料库进行了人工标注和分析。结果表明,音乐节拍的存在更多地依赖于节段的持续时间而不是声音。然而,声音等级和P-中心理论与搏动的位置高度相关。尽管汉语和英语表现出共同的模式,但它们表现出跨语言的差异。 摘要:Generating synthesised singing voice with models trained on speech data has many advantages due to the models' flexibility and controllability. However, since the information about the temporal relationship between segments and beats are lacking in speech training data, the synthesised singing may sound off-beat at times. Therefore, the availability of the information on the temporal relationship between speech segments and music beats is crucial. The current study investigated the segment-beat synchronisation in singing data, with hypotheses formed based on the linguistics theories of P-centre and sonority hierarchy. A Mandarin corpus and an English corpus of professional singing data were manually annotated and analysed. The results showed that the presence of musical beats was more dependent on segment duration than sonority. However, the sonority hierarchy and the P-centre theory were highly related to the location of beats. Mandarin and English demonstrated cross-linguistic variations despite exhibiting common patterns.
【11】 Zero-Shot Federated Learning with New Classes for Audio Classification 标题:用于音频分类的带新类的零概率联合学习
作者:Gautham Krishna Gudur,Satheesh K. Perepu 机构:Global AI Accelerator, Ericsson, Ericsson Research 备注:Accepted at Interspeech 2021. Also accepted at the Distributed and Private Machine Learning (DPML) and Hardware Aware Efficient Training (HAET) workshops at ICLR 2021 链接:https://arxiv.org/abs/2106.10019 摘要:联合学习是一种有效的方法,可以从不同的用户设备中提取见解,同时保护用户的隐私。但是,具有完全看不见的数据分布的新类可以在联邦学习设置中的任何设备上进行流式处理,全局服务器或其他用户无法访问这些设备的数据。为此,我们提出了一个统一的Zero-Shot框架来处理联邦学习过程中的上述挑战。我们在这里模拟了两种情况——1)当用户没有报告新的类标签时,使用传统的FL设置;2) 当用户报告新的类标签时,我们通过计算与每个设备的新类对应的类相似矩阵来合成匿名数据印象,然后通过无监督聚类来区分不同用户的新类。此外,我们提出的框架还可以处理参与用户的标签和模型中的统计异质性。我们在两个广泛使用的音频分类应用程序(关键字定位和城市声音分类)上,使用本地和全局更新中的新类,以及异构标签和模型,对我们在不同通信轮(FL迭代)上的设备框架进行了实证评估,平均确定性精度分别提高了约4.041%和约4.258%。 摘要:Federated learning is an effective way of extracting insights from different user devices while preserving the privacy of users. However, new classes with completely unseen data distributions can stream across any device in a federated learning setting, whose data cannot be accessed by the global server or other users. To this end, we propose a unified zero-shot framework to handle these aforementioned challenges during federated learning. We simulate two scenarios here -- 1) when the new class labels are not reported by the user, the traditional FL setting is used; 2) when new class labels are reported by the user, we synthesize Anonymized Data Impressions by calculating class similarity matrices corresponding to each device's new classes followed by unsupervised clustering to distinguish between new classes across different users. Moreover, our proposed framework can also handle statistical heterogeneities in both labels and models across the participating users. We empirically evaluate our framework on-device across different communication rounds (FL iterations) with new classes in both local and global updates, along with heterogeneous labels and models, on two widely used audio classification applications -- keyword spotting and urban sound classification, and observe an average deterministic accuracy increase of ~4.041% and ~4.258% respectively.
【12】 Improving Performance of Seen and Unseen Speech Style Transfer in End-to-end Neural TTS 标题:改善端到端神经TTS中看得见和看不见语音风格转换的性能
作者:Xiaochun An,Frank K. Soong,Lei Xie 机构:School of Computer Science, Northwestern Polytechnical University, Xi’an, China, Microsoft China 链接:https://arxiv.org/abs/2106.10003 摘要:端到端的神经TTS训练在语体转换方面表现出了较好的效果。然而,无论是目标语体还是说话人的训练数据都限制了这种改进。当经过训练的TTS试图从一个未知的、任意风格的新说话人那里将语音转换成目标风格时,就会出现风格转换性能不足的情况。在本文中,我们提出了一种新的风格转换方法,即在不相交的多风格数据集上,记录不同风格的数据集,每个风格由一个说话人和多个话语组成。为了对样式信息进行编码,我们采用了逆自回归流(IAF)结构来改进变分推理。对整个系统进行优化,使四种不同损失函数的加权和最小:1)重建损失,用于测量源重建和目标重建中的失真;2) 一个对抗性的失利“愚弄”了一个训练有素的鉴别器;3) 样式失真损失,用于测量传输后的预期样式损失;4) 一种循环一致性损失,用于在传输后保留源的说话人身份。实验从客观和主观两方面证明了该方法对看不见和看不见风格转换任务的有效性。新方法的性能比现有技术中的四个基线系统的性能更好、更健壮。 摘要:End-to-end neural TTS training has shown improved performance in speech style transfer. However, the improvement is still limited by the training data in both target styles and speakers. Inadequate style transfer performance occurs when the trained TTS tries to transfer the speech to a target style from a new speaker with an unknown, arbitrary style. In this paper, we propose a new approach to style transfer for both seen and unseen styles, with disjoint, multi-style datasets, i.e., datasets of different styles are recorded, each individual style is by one speaker with multiple utterances. To encode the style information, we adopt an inverse autoregressive flow (IAF) structure to improve the variational inference. The whole system is optimized to minimize a weighed sum of four different loss functions: 1) a reconstruction loss to measure the distortions in both source and target reconstructions; 2) an adversarial loss to "fool" a well-trained discriminator; 3) a style distortion loss to measure the expected style loss after the transfer; 4) a cycle consistency loss to preserve the speaker identity of the source after the transfer. Experiments demonstrate, both objectively and subjectively, the effectiveness of the proposed approach for seen and unseen style transfer tasks. The performance of the new approach is better and more robust than those of four baseline systems of the prior art.
【13】 PixInWav: Residual Steganography for Hiding Pixels in Audio 标题:PixInWav:一种在音频中隐藏像素的残差隐写算法
作者:Margarita Geleta,Cristina Punti,Kevin McGuinness,Jordi Pons,Cristian Canton,Xavier Giro-i-Nieto 机构:Cristina Punt´ı, Universitat Politecnica de Catalunya, Dublin City University, Dolby Labs, Institut de Robotica i Informatica Industrial, CSIC-UPC, Barcelona Supercomputing Center 备注:Extended abstract presented in CVPR 2021 Women in Computer Vision Workshop 链接:https://arxiv.org/abs/2106.09814 摘要:隐写术包括在可公开获得的宿主介质中隐藏数据的机制。虽然以前的工作集中在单峰设置(例如,隐藏图像中的图像,或隐藏音频中的音频),PixInWav的目标是在音频中隐藏图像的多模态情况。为此,我们提出了一种基于短时离散余弦变换(STDCT)音频谱图的残差结构。在我们的结果中,我们发现我们提出的残差音频隐写术装置允许在不影响质量的情况下对来自主机音频的隐藏图像进行独立编码。因此,虽然以前的工作需要主机和隐藏信号来隐藏信号,但PixInWav可以脱机编码图像——以后可以以剩余方式隐藏到任何音频信号中。最后,我们在实验室环境中测试了我们的方案,将图像通过电波从扬声器传输到麦克风,验证了我们的理论见解并获得了有希望的结果。 摘要:Steganography comprises the mechanics of hiding data in a host media that may be publicly available. While previous works focused on unimodal setups (e.g., hiding images in images, or hiding audio in audio), PixInWav targets the multimodal case of hiding images in audio. To this end, we propose a novel residual architecture operating on top of short-time discrete cosine transform (STDCT) audio spectrograms. Among our results, we find that the residual audio steganography setup we propose allows independent encoding of the hidden image from the host audio without compromising quality. Accordingly, while previous works require both host and hidden signals to hide a signal, PixInWav can encode images offline -- which can be later hidden, in a residual fashion, into any audio signal. Finally, we test our scheme in a lab setting to transmit images over airwaves from a loudspeaker to a microphone verifying our theoretical insights and obtaining promising results.