q-fin金融,共计3篇
cs.SD语音,共计5篇
eess.AS音频处理,共计4篇
1.q-fin金融:
【1】 Shock Symmetry and Business Cycle Synchronization: Is Monetary Unification Feasible among CAPADR Countries? 标题:冲击对称性与经济周期同步性:CAPADR国家之间的货币统一可行吗? 链接:https://arxiv.org/abs/2112.02063
作者:Jafet Baca 摘要:鉴于正在进行的一体化努力,CAPADR经济体是否可以从单一货币中获益的问题自然产生。本文探讨了在七个CAPADR国家内建立最优货币区(OCA)的可行性。我们估计SVAR模型,以检索2009:01-2020:01之间的需求和供应冲击,并确定其对称程度。然后,我们继续计算每个国家的两个地区分散指标和纳入假设OCA的成本。我们的结果表明,不对称冲击往往占主导地位。此外,分散指数表明,随着时间的推移,商业周期变得更加同步。然而,CAPADR国家仍然是周期性分歧的来源,因此,无论何时追求货币统一,它们都将在周期相关性方面付出巨大代价。我们的结论是,该区域不符合OCA合适的对称性和同步性要求。 摘要:In light of the ongoing integration efforts, the question of whether CAPADR economies may benefit from a single currency arises naturally. This paper examines the feasibility of an Optimum Currency Area (OCA) within seven CAPADR countries. We estimate SVAR models to retrieve demand and supply shocks between 2009:01 - 2020:01 and determine their extent of symmetry. We then go on to compute two regional indicators of dispersion and the cost of inclusion into a hypothetical OCA for each country. Our results indicate that asymmetric shocks tend to prevail. In addition, the dispersion indexes show that business cycles have become more synchronous over time. However, CAPADR countries are still sources of cyclical divergence, so that they would incur significant costs in terms of cycle correlation whenever they pursue currency unification. We conclude that the region does not meet the required symmetry and synchronicity for an OCA to be appropiate.
【2】 Reinforcement learning for options on target volatility funds 标题:目标波动率基金期权的强化学习 链接:https://arxiv.org/abs/2112.01841
作者:Roberto Daluiso,Emanuele Nastasi,Andrea Pallavicini,Stefano Polo 摘要:在这项工作中,我们处理了由于对冲目标波动率策略(TVS)、风险资产组合和无风险资产组合下的风险证券而增加的融资成本,以便将组合的已实现波动率保持在一定水平。TVS风险投资组合构成中的不确定性以及每个组成部分的套期保值成本差异需要解决一个控制问题来评估期权价格。我们推导了Black和Scholes(BS)情形下问题的解析解。然后,在局部波动率(LV)模型下,我们使用强化学习(RL)技术来确定导致最保守价格的基金组成,对于局部波动率(LV)模型,先验解不可用。我们展示了RL代理的性能如何与通过将BS分析策略应用于TVS动力学而获得的性能兼容,因此在LV场景中也具有竞争力。 摘要:In this work we deal with the funding costs rising from hedging the risky securities underlying a target volatility strategy (TVS), a portfolio of risky assets and a risk-free one dynamically rebalanced in order to keep the realized volatility of the portfolio on a certain level. The uncertainty in the TVS risky portfolio composition along with the difference in hedging costs for each component requires to solve a control problem to evaluate the option prices. We derive an analytical solution of the problem in the Black and Scholes (BS) scenario. Then we use Reinforcement Learning (RL) techniques to determine the fund composition leading to the most conservative price under the local volatility (LV) model, for which an a priori solution is not available. We show how the performances of the RL agents are compatible with those obtained by applying path-wise the BS analytical strategy to the TVS dynamics, which therefore appears competitive also in the LV scenario.
【3】 The relationship between IMF broad based financial development index and international trade: Evidence from India 标题:国际货币基金组织基础广泛的金融发展指数与国际贸易的关系:来自印度的证据 链接:https://arxiv.org/abs/2112.01749
作者:Ummuhabeeba Chaliyan,Mini P. Thomas 备注:26 pages 摘要:本研究调查了1980年至2019年间印度经济的金融发展与国际贸易之间是否存在单向或双向因果关系。实证分析采用了国际货币基金组织提出的三个金融发展指标,即金融制度发展指数、金融市场发展指数和金融发展综合指数。估计Johansen协整、向量误差修正模型和向量自回归模型,以检验利益变量之间的长期关系和短期动态。计量经济学结果表明,金融发展综合指数与贸易开放度之间确实存在长期关联。贸易开放度与金融市场发展指数之间也存在协整关系。然而,没有证据表明金融制度发展与贸易开放之间存在协整关系。格兰杰因果检验结果表明,从金融发展综合指数到贸易开放度之间存在单向因果关系。金融市场的发展也会导致贸易开放。因此,经验证据强调了制定承认发达金融市场在促进国际贸易方面的作用的政策的重要性。 摘要:This study investigates whether a uni-directional or bi-directional causal relationship exists between financial development and international trade for Indian economy, during the time period from 1980 to 2019. Three measures of financial development created by IMF, namely, financial institutional development index, financial market development index and a composite index of financial development is utilized for the empirical analysis. Johansen cointegration, vector error correction model and vector auto regressive model are estimated to examine the long run relationship and short run dynamics among the variables of interest. The econometric results indicate that there is indeed a long run association between the composite index of financial development and trade openness. Cointegration is also found to exist between trade openness and index of financial market development. However, there is no evidence of cointegration between financial institutional development and trade openness. Granger causality test results indicate the presence of uni-directional causality running from composite index of financial development to trade openness. Financial market development is also found to Granger cause trade openness. Empirical evidence thus underlines the importance of formulating policies which recognize the role of well-developed financial markets in promoting international trade.
2.cs.SD语音:
【1】 Blackbox Untargeted Adversarial Testing of Automatic Speech Recognition Systems 标题:自动语音识别系统的黑盒非目标对抗性测试 链接:https://arxiv.org/abs/2112.01821
作者:Xiaoliang Wu,Ajitha Rajan 备注:10 pages, 6 figures and 7 tables 摘要:自动语音识别(ASR)系统非常流行,尤其是在家用电器的语音导航和语音控制应用中。ASR的计算核心是深度神经网络(DNN),已被证明易受对抗性扰动的影响;攻击者容易误用以生成恶意输出。为了帮助测试ASR的正确性,我们提出了自动生成blackbox(对DNN不可知)的技术,该技术是跨ASR可移植的非目标对抗性攻击。关于对抗性ASR测试的许多现有工作都集中在有针对性的攻击上,即在给定输出文本的情况下生成音频样本。针对特定ASR内DNN(白盒)结构定制的目标技术不可移植。相反,我们的方法攻击了ASR管道的信号处理阶段,该阶段在大多数ASR中共享。此外,我们通过使用心理声学模型操纵声音信号,将信号保持在人类感知阈值以下,从而确保生成的对抗性音频样本没有人类听觉差异。我们使用三种流行的ASR和三种输入音频数据集,使用输出文本的权重、与原始音频的相似性以及不同ASR上的攻击成功率,评估了我们技术的可移植性和有效性。我们发现,我们的测试技术可以跨ASR进行移植,对抗性音频样本产生了高成功率、WER以及与原始音频的相似性。 摘要:Automatic speech recognition (ASR) systems are prevalent, particularly in applications for voice navigation and voice control of domestic appliances. The computational core of ASRs are deep neural networks (DNNs) that have been shown to be susceptible to adversarial perturbations; easily misused by attackers to generate malicious outputs. To help test the correctness of ASRS, we propose techniques that automatically generate blackbox (agnostic to the DNN), untargeted adversarial attacks that are portable across ASRs. Much of the existing work on adversarial ASR testing focuses on targeted attacks, i.e generating audio samples given an output text. Targeted techniques are not portable, customised to the structure of DNNs (whitebox) within a specific ASR. In contrast, our method attacks the signal processing stage of the ASR pipeline that is shared across most ASRs. Additionally, we ensure the generated adversarial audio samples have no human audible difference by manipulating the acoustic signal using a psychoacoustic model that maintains the signal below the thresholds of human perception. We evaluate portability and effectiveness of our techniques using three popular ASRs and three input audio datasets using the metrics - WER of output text, Similarity to original audio and attack Success Rate on different ASRs. We found our testing techniques were portable across ASRs, with the adversarial audio samples producing high Success Rates, WERs and Similarities to the original audio.
【2】 Music-to-Dance Generation with Optimal Transport 标题:具有最优运输的音乐到舞蹈生成 链接:https://arxiv.org/abs/2112.01806
作者:Shuang Wu,Shijian Lu,Li Cheng 摘要:一段音乐的舞蹈编排是一项具有挑战性的任务,必须在考虑音乐主题和节奏的同时,创造性地呈现独特的风格舞蹈元素。不同的方法,如相似性检索、序列间建模和生成性对抗网络,已经解决了这一问题,但它们生成的舞蹈序列往往缺乏运动真实感、多样性和音乐一致性。在这篇文章中,我们提出了一个音乐舞蹈与最佳传输网络(MDOT网络),学习从音乐生成三维舞蹈编舞。我们引入一个最佳传输距离来评估生成的舞蹈分布的真实性,并引入格罗莫夫-瓦瑟斯坦距离来测量舞蹈分布与输入音乐之间的对应关系。这提供了一个定义明确且无分歧的训练目标,以减轻标准GAN训练的限制,该训练经常受到不稳定和发散发电机损耗问题的困扰。大量的实验表明,我们的MDOT网络可以合成真实的、多样的舞蹈,这些舞蹈与输入音乐实现有机的统一,反映出共同的意向性,并与节奏衔接相匹配。 摘要:Dance choreography for a piece of music is a challenging task, having to be creative in presenting distinctive stylistic dance elements while taking into account the musical theme and rhythm. It has been tackled by different approaches such as similarity retrieval, sequence-to-sequence modeling and generative adversarial networks, but their generated dance sequences are often short of motion realism, diversity and music consistency. In this paper, we propose a Music-to-Dance with Optimal Transport Network (MDOT-Net) for learning to generate 3D dance choreographs from music. We introduce an optimal transport distance for evaluating the authenticity of the generated dance distribution and a Gromov-Wasserstein distance to measure the correspondence between the dance distribution and the input music. This gives a well defined and non-divergent training objective that mitigates the limitation of standard GAN training which is frequently plagued with instability and divergent generator loss issues. Extensive experiments demonstrate that our MDOT-Net can synthesize realistic and diverse dances which achieve an organic unity with the input music, reflecting the shared intentionality and matching the rhythmic articulation.
【3】 BBS-KWS:The Mandarin Keyword Spotting System Won the Video Keyword Wakeup Challenge 标题:BBS-KWS:普通话关键词识别系统荣获视频关键词唤醒挑战赛冠军 链接:https://arxiv.org/abs/2112.01757
作者:Yuting Yang,Binbin Du,Yingxin Zhang,Wenxuan Wang,Yuke Li 摘要:本文介绍了义敦NISP团队提交给视频关键词唤醒挑战赛的系统。我们提出了一个汉语关键词识别系统(KWS),并对其进行了一些新颖而有效的改进,包括一个大主干(B)模型、一个关键词偏倚(B)机制和音节建模单元的引入。考虑到这一点,我们将整个系统BBS-KWS称为缩写。BBS-KWS系统由端到端自动语音识别(ASR)模块和KWS模块组成。ASR模块将语音特征转换为文本表示,将一个大型骨干网络应用于声学模型,并考虑音节建模单元。此外,在ASR推理阶段,采用关键词偏向机制来提高关键词的召回率。KWS模块应用多个标准来确定关键字是否存在,例如多阶段匹配、模糊匹配和连接主义时间分类(CTC)前缀分数。为了进一步改进我们的系统,我们在CN-Celeb数据集上进行了半监督学习,以获得更好的泛化效果。在VKW任务中,BBS-KWS系统在基线上取得了显著的进步,并在两个轨道上获得了第一名。 摘要:This paper introduces the system submitted by the Yidun NISP team to the video keyword wakeup challenge. We propose a mandarin keyword spotting system (KWS) with several novel and effective improvements, including a big backbone (B) model, a keyword biasing (B) mechanism and the introduction of syllable modeling units (S). By considering this, we term the total system BBS-KWS as an abbreviation. The BBS-KWS system consists of an end-to-end automatic speech recognition (ASR) module and a KWS module. The ASR module converts speech features to text representations, which applies a big backbone network to the acoustic model and takes syllable modeling units into consideration as well. In addition, the keyword biasing mechanism is used to improve the recall rate of keywords in the ASR inference stage. The KWS module applies multiple criteria to determine the absence or presence of the keywords, such as multi-stage matching, fuzzy matching, and connectionist temporal classification (CTC) prefix score. To further improve our system, we conduct semi-supervised learning on the CN-Celeb dataset for better generalization. In the VKW task, the BBS-KWS system achieves significant gains over the baseline and won the first place in two tracks.
【4】 LMR-CBT: Learning Modality-fused Representations with CB-Transformer for Multimodal Emotion Recognition from Unaligned Multimodal Sequences 标题:LMR-CBT:基于CB-Transform的学习模态融合表示法在未对齐多模态序列中的多模态情感识别 链接:https://arxiv.org/abs/2112.01697
作者:Ziwang Fu,Feng Liu,Hanyang Wang,Siyuan Shen,Jiahao Zhang,Jiayin Qi,Xiangling Fu,Aimin Zhou 备注:9 pages ,Figure 2, Table 5 摘要:学习模态融合表征和处理未对齐的多模态序列在多模态情感识别中具有重要意义和挑战性。现有的方法使用定向两两注意或信息中心来融合语言、视觉和音频模式。然而,这些方法在融合特征时引入了信息冗余,并且在不考虑模式互补性的情况下效率低下。在本文中,我们提出了一种有效的神经网络学习模式融合表示与CBTransformer(LMR-CBT)的多模态情感识别从未对齐的多模态序列。具体来说,我们首先分别对这三种模式进行特征提取,以获得序列的局部结构。然后,我们设计了一种新的跨模态块变换器(CB-transformer),该变换器支持不同模态的互补学习,主要分为局部时间学习、跨模态特征融合和全局自我注意表征。此外,我们将融合后的特征与原始特征拼接,对序列中的情感进行分类。最后,我们在三个具有挑战性的数据集(IEMOCAP、CMU-MOSI和CMU-MOSEI)上进行单词对齐和非对齐实验。实验结果表明,在这两种情况下,我们提出的方法都具有优越性和有效性。与主流方法相比,我们的方法以最少的参数数量达到了最先进的水平。 摘要:Learning modality-fused representations and processing unaligned multimodal sequences are meaningful and challenging in multimodal emotion recognition. Existing approaches use directional pairwise attention or a message hub to fuse language, visual, and audio modalities. However, those approaches introduce information redundancy when fusing features and are inefficient without considering the complementarity of modalities. In this paper, we propose an efficient neural network to learn modality-fused representations with CB-Transformer (LMR-CBT) for multimodal emotion recognition from unaligned multimodal sequences. Specifically, we first perform feature extraction for the three modalities respectively to obtain the local structure of the sequences. Then, we design a novel transformer with cross-modal blocks (CB-Transformer) that enables complementary learning of different modalities, mainly divided into local temporal learning,cross-modal feature fusion and global self-attention representations. In addition, we splice the fused features with the original features to classify the emotions of the sequences. Finally, we conduct word-aligned and unaligned experiments on three challenging datasets, IEMOCAP, CMU-MOSI, and CMU-MOSEI. The experimental results show the superiority and efficiency of our proposed method in both settings. Compared with the mainstream methods, our approach reaches the state-of-the-art with a minimum number of parameters.
【5】 Adversarially learning disentangled speech representations for robust multi-factor voice conversion 标题:用于鲁棒多因素语音转换的对抗性学习解缠语音表示 链接:https://arxiv.org/abs/2102.00184
作者:Jie Wang,Jingbei Li,Xintao Zhao,Zhiyong Wu,Shiyin Kang,Helen Meng 摘要:在语音转换(VC)中,将语音分解为非纠缠语音表示对于实现高度可控的风格转换至关重要。VC中传统的语音表征学习方法仅将语音分解为说话人和内容,缺乏对其他韵律相关因素的可控性。针对更多语音因素的最先进的语音表示学习方法正在使用主要的解纠缠算法,如随机重采样和特设瓶颈层大小调整,但这很难确保鲁棒的语音表示解纠缠。为了提高VC中高度可控的风格转换对多种因素的鲁棒性,我们提出了一种基于对抗学习的非纠缠语音表征学习框架。提取了四种表征内容、音色、节奏和音高的语音表示,并通过受BERT启发的对抗性掩码和预测(MAP)网络进一步分解。对抗性网络通过随机掩蔽和预测一种语音表征与另一种语音表征之间的相关性来最小化语音表征之间的相关性。实验结果表明,该框架通过将语音质量MOS从2.79提高到3.30,将MCD从3.89降低到3.58,显著提高了VC对多因素的鲁棒性。 摘要:Factorizing speech as disentangled speech representations is vital to achieve highly controllable style transfer in voice conversion (VC). Conventional speech representation learning methods in VC only factorize speech as speaker and content, lacking controllability on other prosody-related factors. State-of-the-art speech representation learning methods for more speechfactors are using primary disentangle algorithms such as random resampling and ad-hoc bottleneck layer size adjustment,which however is hard to ensure robust speech representationdisentanglement. To increase the robustness of highly controllable style transfer on multiple factors in VC, we propose a disentangled speech representation learning framework based on adversarial learning. Four speech representations characterizing content, timbre, rhythm and pitch are extracted, and further disentangled by an adversarial Mask-And-Predict (MAP)network inspired by BERT. The adversarial network is used tominimize the correlations between the speech representations,by randomly masking and predicting one of the representationsfrom the others. Experimental results show that the proposedframework significantly improves the robustness of VC on multiple factors by increasing the speech quality MOS from 2.79 to3.30 and decreasing the MCD from 3.89 to 3.58.
3.eess.AS音频处理:
【1】 Blackbox Untargeted Adversarial Testing of Automatic Speech Recognition Systems 标题:自动语音识别系统的黑盒非目标对抗性测试 链接:https://arxiv.org/abs/2112.01821
作者:Xiaoliang Wu,Ajitha Rajan 备注:10 pages, 6 figures and 7 tables 摘要:自动语音识别(ASR)系统非常流行,尤其是在家用电器的语音导航和语音控制应用中。ASR的计算核心是深度神经网络(DNN),已被证明易受对抗性扰动的影响;攻击者容易误用以生成恶意输出。为了帮助测试ASR的正确性,我们提出了自动生成blackbox(对DNN不可知)的技术,该技术是跨ASR可移植的非目标对抗性攻击。关于对抗性ASR测试的许多现有工作都集中在有针对性的攻击上,即在给定输出文本的情况下生成音频样本。针对特定ASR内DNN(白盒)结构定制的目标技术不可移植。相反,我们的方法攻击了ASR管道的信号处理阶段,该阶段在大多数ASR中共享。此外,我们通过使用心理声学模型操纵声音信号,将信号保持在人类感知阈值以下,从而确保生成的对抗性音频样本没有人类听觉差异。我们使用三种流行的ASR和三种输入音频数据集,使用输出文本的权重、与原始音频的相似性以及不同ASR上的攻击成功率,评估了我们技术的可移植性和有效性。我们发现,我们的测试技术可以跨ASR进行移植,对抗性音频样本产生了高成功率、WER以及与原始音频的相似性。 摘要:Automatic speech recognition (ASR) systems are prevalent, particularly in applications for voice navigation and voice control of domestic appliances. The computational core of ASRs are deep neural networks (DNNs) that have been shown to be susceptible to adversarial perturbations; easily misused by attackers to generate malicious outputs. To help test the correctness of ASRS, we propose techniques that automatically generate blackbox (agnostic to the DNN), untargeted adversarial attacks that are portable across ASRs. Much of the existing work on adversarial ASR testing focuses on targeted attacks, i.e generating audio samples given an output text. Targeted techniques are not portable, customised to the structure of DNNs (whitebox) within a specific ASR. In contrast, our method attacks the signal processing stage of the ASR pipeline that is shared across most ASRs. Additionally, we ensure the generated adversarial audio samples have no human audible difference by manipulating the acoustic signal using a psychoacoustic model that maintains the signal below the thresholds of human perception. We evaluate portability and effectiveness of our techniques using three popular ASRs and three input audio datasets using the metrics - WER of output text, Similarity to original audio and attack Success Rate on different ASRs. We found our testing techniques were portable across ASRs, with the adversarial audio samples producing high Success Rates, WERs and Similarities to the original audio.
【2】 Music-to-Dance Generation with Optimal Transport 标题:具有最优运输的音乐到舞蹈生成 链接:https://arxiv.org/abs/2112.01806
作者:Shuang Wu,Shijian Lu,Li Cheng 摘要:一段音乐的舞蹈编排是一项具有挑战性的任务,必须在考虑音乐主题和节奏的同时,创造性地呈现独特的风格舞蹈元素。不同的方法,如相似性检索、序列间建模和生成性对抗网络,已经解决了这一问题,但它们生成的舞蹈序列往往缺乏运动真实感、多样性和音乐一致性。在这篇文章中,我们提出了一个音乐舞蹈与最佳传输网络(MDOT网络),学习从音乐生成三维舞蹈编舞。我们引入一个最佳传输距离来评估生成的舞蹈分布的真实性,并引入格罗莫夫-瓦瑟斯坦距离来测量舞蹈分布与输入音乐之间的对应关系。这提供了一个定义明确且无分歧的训练目标,以减轻标准GAN训练的限制,该训练经常受到不稳定和发散发电机损耗问题的困扰。大量的实验表明,我们的MDOT网络可以合成真实的、多样的舞蹈,这些舞蹈与输入音乐实现有机的统一,反映出共同的意向性,并与节奏衔接相匹配。 摘要:Dance choreography for a piece of music is a challenging task, having to be creative in presenting distinctive stylistic dance elements while taking into account the musical theme and rhythm. It has been tackled by different approaches such as similarity retrieval, sequence-to-sequence modeling and generative adversarial networks, but their generated dance sequences are often short of motion realism, diversity and music consistency. In this paper, we propose a Music-to-Dance with Optimal Transport Network (MDOT-Net) for learning to generate 3D dance choreographs from music. We introduce an optimal transport distance for evaluating the authenticity of the generated dance distribution and a Gromov-Wasserstein distance to measure the correspondence between the dance distribution and the input music. This gives a well defined and non-divergent training objective that mitigates the limitation of standard GAN training which is frequently plagued with instability and divergent generator loss issues. Extensive experiments demonstrate that our MDOT-Net can synthesize realistic and diverse dances which achieve an organic unity with the input music, reflecting the shared intentionality and matching the rhythmic articulation.
【3】 BBS-KWS:The Mandarin Keyword Spotting System Won the Video Keyword Wakeup Challenge 标题:BBS-KWS:普通话关键词识别系统荣获视频关键词唤醒挑战赛冠军 链接:https://arxiv.org/abs/2112.01757
作者:Yuting Yang,Binbin Du,Yingxin Zhang,Wenxuan Wang,Yuke Li 摘要:本文介绍了义敦NISP团队提交给视频关键词唤醒挑战赛的系统。我们提出了一个汉语关键词识别系统(KWS),并对其进行了一些新颖而有效的改进,包括一个大主干(B)模型、一个关键词偏倚(B)机制和音节建模单元的引入。考虑到这一点,我们将整个系统BBS-KWS称为缩写。BBS-KWS系统由端到端自动语音识别(ASR)模块和KWS模块组成。ASR模块将语音特征转换为文本表示,将一个大型骨干网络应用于声学模型,并考虑音节建模单元。此外,在ASR推理阶段,采用关键词偏向机制来提高关键词的召回率。KWS模块应用多个标准来确定关键字是否存在,例如多阶段匹配、模糊匹配和连接主义时间分类(CTC)前缀分数。为了进一步改进我们的系统,我们在CN-Celeb数据集上进行了半监督学习,以获得更好的泛化效果。在VKW任务中,BBS-KWS系统在基线上取得了显著的进步,并在两个轨道上获得了第一名。 摘要:This paper introduces the system submitted by the Yidun NISP team to the video keyword wakeup challenge. We propose a mandarin keyword spotting system (KWS) with several novel and effective improvements, including a big backbone (B) model, a keyword biasing (B) mechanism and the introduction of syllable modeling units (S). By considering this, we term the total system BBS-KWS as an abbreviation. The BBS-KWS system consists of an end-to-end automatic speech recognition (ASR) module and a KWS module. The ASR module converts speech features to text representations, which applies a big backbone network to the acoustic model and takes syllable modeling units into consideration as well. In addition, the keyword biasing mechanism is used to improve the recall rate of keywords in the ASR inference stage. The KWS module applies multiple criteria to determine the absence or presence of the keywords, such as multi-stage matching, fuzzy matching, and connectionist temporal classification (CTC) prefix score. To further improve our system, we conduct semi-supervised learning on the CN-Celeb dataset for better generalization. In the VKW task, the BBS-KWS system achieves significant gains over the baseline and won the first place in two tracks.
【4】 LMR-CBT: Learning Modality-fused Representations with CB-Transformer for Multimodal Emotion Recognition from Unaligned Multimodal Sequences 标题:LMR-CBT:基于CB-Transform的学习模态融合表示法在未对齐多模态序列中的多模态情感识别 链接:https://arxiv.org/abs/2112.01697
作者:Ziwang Fu,Feng Liu,Hanyang Wang,Siyuan Shen,Jiahao Zhang,Jiayin Qi,Xiangling Fu,Aimin Zhou 备注:9 pages ,Figure 2, Table 5 摘要:学习模态融合表征和处理未对齐的多模态序列在多模态情感识别中具有重要意义和挑战性。现有的方法使用定向两两注意或信息中心来融合语言、视觉和音频模式。然而,这些方法在融合特征时引入了信息冗余,并且在不考虑模式互补性的情况下效率低下。在本文中,我们提出了一种有效的神经网络学习模式融合表示与CBTransformer(LMR-CBT)的多模态情感识别从未对齐的多模态序列。具体来说,我们首先分别对这三种模式进行特征提取,以获得序列的局部结构。然后,我们设计了一种新的跨模态块变换器(CB-transformer),该变换器支持不同模态的互补学习,主要分为局部时间学习、跨模态特征融合和全局自我注意表征。此外,我们将融合后的特征与原始特征拼接,对序列中的情感进行分类。最后,我们在三个具有挑战性的数据集(IEMOCAP、CMU-MOSI和CMU-MOSEI)上进行单词对齐和非对齐实验。实验结果表明,在这两种情况下,我们提出的方法都具有优越性和有效性。与主流方法相比,我们的方法以最少的参数数量达到了最先进的水平。 摘要:Learning modality-fused representations and processing unaligned multimodal sequences are meaningful and challenging in multimodal emotion recognition. Existing approaches use directional pairwise attention or a message hub to fuse language, visual, and audio modalities. However, those approaches introduce information redundancy when fusing features and are inefficient without considering the complementarity of modalities. In this paper, we propose an efficient neural network to learn modality-fused representations with CB-Transformer (LMR-CBT) for multimodal emotion recognition from unaligned multimodal sequences. Specifically, we first perform feature extraction for the three modalities respectively to obtain the local structure of the sequences. Then, we design a novel transformer with cross-modal blocks (CB-Transformer) that enables complementary learning of different modalities, mainly divided into local temporal learning,cross-modal feature fusion and global self-attention representations. In addition, we splice the fused features with the original features to classify the emotions of the sequences. Finally, we conduct word-aligned and unaligned experiments on three challenging datasets, IEMOCAP, CMU-MOSI, and CMU-MOSEI. The experimental results show the superiority and efficiency of our proposed method in both settings. Compared with the mainstream methods, our approach reaches the state-of-the-art with a minimum number of parameters.
机器翻译,仅供参考