计算机视觉学术速递[7.12]

2021-07-27 10:48:27 浏览数 (1)

访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问

cs.CV 方向,今日共计55篇

Transformer(1篇)

【1】 ViTGAN: Training GANs with Vision Transformers 标题:ViTGAN:用视觉Transformer训练Gan

作者:Kwonjoon Lee,Huiwen Chang,Lu Jiang,Han Zhang,Zhuowen Tu,Ce Liu 机构:UC San Diego, Google Research 链接:https://arxiv.org/abs/2107.04589 摘要:近年来,视觉Transformer(vit)在图像识别方面表现出了很强的竞争力,同时对视觉感应偏差的要求也越来越低。在本文中,我们探讨了这种观察是否可以扩展到图像生成。为此,我们将ViT架构集成到生成性对抗网络(GAN)中。我们观察到,现有的正则化方法与自我注意的交互作用很差,导致训练过程中严重的不稳定性。为了解决这个问题,我们引入了一种新的正则化技术来训练具有ViTs的GANs。根据经验,我们的方法名为ViTGAN,在CIFAR-10、CelebA和LSUN卧室数据集上实现了与基于CNN的最新StyleGAN2相当的性能。 摘要:Recently, Vision Transformers (ViTs) have shown competitive performance on image recognition while requiring less vision-specific inductive biases. In this paper, we investigate if such observation can be extended to image generation. To this end, we integrate the ViT architecture into generative adversarial networks (GANs). We observe that existing regularization methods for GANs interact poorly with self-attention, causing serious instability during training. To resolve this issue, we introduce novel regularization techniques for training GANs with ViTs. Empirically, our approach, named ViTGAN, achieves comparable performance to state-of-the-art CNN-based StyleGAN2 on CIFAR-10, CelebA, and LSUN bedroom datasets.

检测相关(3篇)

【1】 Learning Cascaded Detection Tasks with Weakly-Supervised Domain Adaptation 标题:具有弱监督领域自适应的学习级联检测任务

作者:Niklas Hanselmann,Nick Schneider,Benedikt Ortelt,Andreas Geiger 机构: Germany 2University of T¨ubingen, Germany 4Max Planck Institute for Intelligent Systems 备注:Accepted to IEEE IV 2021 链接:https://arxiv.org/abs/2107.04523 摘要:为了应对自动驾驶的挑战,深度学习在处理越来越复杂的任务(如3D检测或实例分割)方面已被证明是至关重要的。用于基于图像的检测任务的最新方法通过以级联方式操作来解决这种复杂性:它们首先提取2D边界框,基于该边界框推断附加属性,例如实例掩码。尽管这些方法表现良好,但一个关键的挑战仍然是缺乏针对日益多样化的任务的准确且廉价的注释。合成数据提供了一个很有希望的解决方案,但是,尽管在领域适应性研究方面做出了努力,合成数据和真实数据之间的差距仍然是一个悬而未决的问题。在这项工作中,我们提出了一个弱监督域自适应设置,利用级联检测任务的结构。特别是,我们学习仅从源域推断属性,同时利用二维边界框作为两个域中的弱标签来解释域移动。我们进一步鼓励领域不变的特点,通过类级特征对齐使用地面真理类信息,这是不可用的无监督设置。我们的实验表明,该方法在完全监督的环境下是有竞争力的,而在很大程度上优于无监督自适应方法。 摘要:In order to handle the challenges of autonomous driving, deep learning has proven to be crucial in tackling increasingly complex tasks, such as 3D detection or instance segmentation. State-of-the-art approaches for image-based detection tasks tackle this complexity by operating in a cascaded fashion: they first extract a 2D bounding box based on which additional attributes, e.g. instance masks, are inferred. While these methods perform well, a key challenge remains the lack of accurate and cheap annotations for the growing variety of tasks. Synthetic data presents a promising solution but, despite the effort in domain adaptation research, the gap between synthetic and real data remains an open problem. In this work, we propose a weakly supervised domain adaptation setting which exploits the structure of cascaded detection tasks. In particular, we learn to infer the attributes solely from the source domain while leveraging 2D bounding boxes as weak labels in both domains to explain the domain shift. We further encourage domain-invariant features through class-wise feature alignment using ground-truth class information, which is not available in the unsupervised setting. As our experiments demonstrate, the approach is competitive with fully supervised settings while outperforming unsupervised adaptation approaches by a large margin.

【2】 Action Unit Detection with Joint Adaptive Attention and Graph Relation 标题:基于联合自适应注意力和图关系的动作单元检测

作者:Chenggong Zhang,Juan Song,Qingyang Zhang,Weilong Dong,Ruomeng Ding,Zhilei Liu 机构:College of Intelligence and Computing, Tianjin University, Tianjin, China 链接:https://arxiv.org/abs/2107.04389 摘要:本文描述了一种面部动作单元(AU)的检测方法。在这项工作中,我们提出我们的意见,以外地情感行为分析(ABAW)2021年的竞争。该方法采用预先训练好的JAA模型作为特征提取工具,在多尺度特征的基础上提取全局特征、人脸对齐特征和局部特征。将Au局部特征作为图卷积的输入,进一步考虑Au之间的相关性,最后利用融合特征对Au进行分类。检测准确度按0.5×准确度 0.5×F1进行评价。我们的模型在具有挑战性的Aff-Wild2数据库上达到了0.674。 摘要:This paper describes an approach to the facial action unit (AU) detection. In this work, we present our submission to the Field Affective Behavior Analysis (ABAW) 2021 competition. The proposed method uses the pre-trained JAA model as the feature extractor, and extracts global features, face alignment features and AU local features on the basis of multi-scale features. We take the AU local features as the input of the graph convolution to further consider the correlation between AU, and finally use the fused features to classify AU. The detected accuracy was evaluated by 0.5*accuracy 0.5*F1. Our model achieves 0.674 on the challenging Aff-Wild2 database.

【3】 RGB Stream Is Enough for Temporal Action Detection 标题:RGB流足以进行时间动作检测

作者:Chenhao Wang,Hongxiang Cai,Yuxin Zou,Yichao Xiong 机构:Media Intelligence Technology Co.,Ltd 链接:https://arxiv.org/abs/2107.04362 摘要:到目前为止,最先进的时间动作检测器基于两个流输入,包括RGB帧和光流。虽然将RGB帧与光流相结合可以显著提高性能,但光流是一种手工设计的表示方法,不仅计算量大,而且两种流方法往往不能与光流进行端到端的联合学习。在本文中,我们认为光流在高精度的时间动作检测中是不必要的,而图像级数据增强(ILDA)是解决去除光流后性能下降的关键。为了评估ILDA的有效性,我们设计了一个简单而有效的基于单个RGB流的单级时间动作检测器DaoTAD。我们的结果表明,当使用ILDA训练时,DaoTAD具有与所有现有的最先进的双流检测器相当的精度,同时大大超过了以前的方法的推理速度,并且在GeForce GTX 1080 Ti上的推理速度是惊人的6668 fps。代码位于url{https://github.com/Media-Smart/vedatad}. 摘要:State-of-the-art temporal action detectors to date are based on two-stream input including RGB frames and optical flow. Although combining RGB frames and optical flow boosts performance significantly, optical flow is a hand-designed representation which not only requires heavy computation, but also makes it methodologically unsatisfactory that two-stream methods are often not learned end-to-end jointly with the flow. In this paper, we argue that optical flow is dispensable in high-accuracy temporal action detection and image level data augmentation (ILDA) is the key solution to avoid performance degradation when optical flow is removed. To evaluate the effectiveness of ILDA, we design a simple yet efficient one-stage temporal action detector based on single RGB stream named DaoTAD. Our results show that when trained with ILDA, DaoTAD has comparable accuracy with all existing state-of-the-art two-stream detectors while surpassing the inference speed of previous methods by a large margin and the inference speed is astounding 6668 fps on GeForce GTX 1080 Ti. Code is available at url{https://github.com/Media-Smart/vedatad}.

分类|识别相关(4篇)

【1】 Seven Basic Expression Recognition Using ResNet-18 标题:基于ResNet-18的七种基本表情识别

作者:Satnam Singh,Doris Schicker 链接:https://arxiv.org/abs/2107.04569 摘要:我们建议使用在FER 数据集上预先训练的ResNet-18架构来解决野外情感行为分析(ABAW)问题,以分类七种基本表达,即中性、愤怒、厌恶、恐惧、快乐、悲伤和惊讶。作为第二届野外情感行为分析研讨会和竞赛(ABAW2)的一部分,提供了一个由564个视频组成的数据库,大约280万帧,以及这七种基本表达的标签。我们对数据集进行了重采样,通过对过度代表的类进行欠采样和对欠代表的类进行过采样以及类权重来抵消类的不平衡。为了避免过度拟合,我们进行了数据扩充和使用L2正则化。我们的分类器达到了0.4的ABAW2分数,因此超过了比赛主持人提供的基线结果。 摘要:We propose to use a ResNet-18 architecture that was pre-trained on the FER dataset for tackling the problem of affective behavior analysis in-the-wild (ABAW) for classification of the seven basic expressions, namely, neutral, anger, disgust, fear, happiness, sadness and surprise. As part of the second workshop and competition on affective behavior analysis in-the-wild (ABAW2), a database consisting of 564 videos with around 2.8M frames is provided along with labels for these seven basic expressions. We resampled the dataset to counter class-imbalances by under-sampling the over-represented classes and over-sampling the under-represented classes along with class-wise weights. To avoid overfitting we performed data-augmentation and used L2 regularisation. Our classifier reaches an ABAW2 score of 0.4 and therefore exceeds the baseline results provided by the hosts of the competition.

【2】 Emotion Recognition with Incomplete Labels Using Modified Multi-task Learning Technique 标题:基于改进多任务学习技术的不完全标签情感识别

作者:Phan Tran Dac Thinh,Hoang Manh Hung,Hyung-Jeong Yang,Soo-Hyung Kim,Guee-Sang Lee 机构:Department of Artificial Intelligence Convergence, Chonnam National University, South Korea 链接:https://arxiv.org/abs/2107.04192 摘要:由于大量注释数据集的可访问性和可用性,从人脸中预测情感信息(如七种基本情感或动作单元)的任务逐渐变得更加有趣。在这项研究中,我们提出了一种方法,利用七个基本情绪和12个行动单位之间的关联从AffWild2数据集。该方法基于ResNet50的体系结构,针对两个任务的不完全标注,采用多任务学习技术。通过结合两个相关任务的知识,与只使用一种标签的模型相比,两种性能都有很大的提高。 摘要:The task of predicting affective information in the wild such as seven basic emotions or action units from human faces has gradually become more interesting due to the accessibility and availability of massive annotated datasets. In this study, we propose a method that utilizes the association between seven basic emotions and twelve action units from the AffWild2 dataset. The method based on the architecture of ResNet50 involves the multi-task learning technique for the incomplete labels of the two tasks. By combining the knowledge for two correlated tasks, both performances are improved by a large margin compared to those with the model employing only one kind of label.

【3】 A Multi-modal and Multi-task Learning Method for Action Unit and Expression Recognition 标题:一种面向动作单元和表情识别的多模态多任务学习方法

作者:Yue Jin,Tianqing Zheng,Chao Gao,Guoqiang Xu 机构:China Pacific Insurance (Group) Co., Ltd., Yishan Road, Shanghai, China, Shanghai Jiaotong University, Dongchuan Road, Shanghai, China 备注:5 pages, 3 figures, 2 tables 链接:https://arxiv.org/abs/2107.04187 摘要:分析人的情感对于人机交互系统是至关重要的。大多数方法都是在限制性场景中开发的,在野外环境中并不实用。情感行为分析在野外(ABAW)2021年比赛提供了一个基准,这在野外的问题。本文介绍了一种基于视觉和听觉信息的多模式多任务学习方法。我们使用AU和表达式注释来训练模型,并应用序列模型来进一步提取视频帧之间的关联。在验证集上,我们获得了0.712的AU分数和0.477的表达式分数。这些结果证明了我们的方法在提高模型性能方面的有效性。 摘要:Analyzing human affect is vital for human-computer interaction systems. Most methods are developed in restricted scenarios which are not practical for in-the-wild settings. The Affective Behavior Analysis in-the-wild (ABAW) 2021 Contest provides a benchmark for this in-the-wild problem. In this paper, we introduce a multi-modal and multi-task learning method by using both visual and audio information. We use both AU and expression annotations to train the model and apply a sequence model to further extract associations between video frames. We achieve an AU score of 0.712 and an expression score of 0.477 on the validation set. These results demonstrate the effectiveness of our approach in improving model performance.

【4】 Multitask Multi-database Emotion Recognition 标题:多任务多数据库情感识别

作者:Manh Tu Vu,Marie Beurton-Aimar 机构:Lucine, Avenue Emile Counord, Bordeaux, France, LaBRI, Cours de la Libération, Talence CEDEX, France 链接:https://arxiv.org/abs/2107.04127 摘要:在这项工作中,我们介绍我们提交的第二届情感行为分析在野生(ABAW)2021年的竞争。我们在多个资料库上训练一个统一的深度学习模型来执行两项任务:七种基本的面部表情预测和价唤醒估计。由于这些数据库并不包含所有这两个任务的标签,我们应用了知识提取技术来训练两个网络:一个教师模型和一个学生模型。学生模型将使用基本真值标签和从预先训练的教师模型导出的软标签进行训练。在训练过程中,为了更好地利用任务间的相关性,我们又增加了一个任务,即两个任务的组合。我们还利用比赛中使用的AffWild2数据库中两个任务之间的视频共享,进一步提高了网络的性能。实验结果表明,该网络在AffWild2数据库的验证集上取得了良好的效果。代码和预训练模型在https://github.com/glmanhtu/multitask-abaw-2021 摘要:In this work, we introduce our submission to the 2nd Affective Behavior Analysis in-the-wild (ABAW) 2021 competition. We train a unified deep learning model on multi-databases to perform two tasks: seven basic facial expressions prediction and valence-arousal estimation. Since these databases do not contains labels for all the two tasks, we have applied the distillation knowledge technique to train two networks: one teacher and one student model. The student model will be trained using both ground truth labels and soft labels derived from the pretrained teacher model. During the training, we add one more task, which is the combination of the two mentioned tasks, for better exploiting inter-task correlations. We also exploit the sharing videos between the two tasks of the AffWild2 database that is used in the competition, to further improve the performance of the network. Experiment results shows that the network have achieved promising results on the validation set of the AffWild2 database. Code and pretrained model are publicly available at https://github.com/glmanhtu/multitask-abaw-2021

分割|语义相关(9篇)

【1】 Semantic and Geometric Unfolding of StyleGAN Latent Space 标题:StyleGan潜在空间的语义和几何展开

作者:Mustafa Shukor,Xu Yao,Bharath Bhushan Damodaran,Pierre Hellier 机构:InterDigital, Inc ∗, Télécom Paris 备注:16 pages 链接:https://arxiv.org/abs/2107.04481 摘要:生成性对抗网络(generativediscountarial networks,GANs)通过反转和操纵与自然图像相对应的潜在代码,在图像编辑方面具有惊人的效率。这种特性来自于潜在空间的分离性质。在本文中,我们发现了这种潜在空间的两个几何限制:(a)欧氏距离不同于图像感知距离;(b)解纠缠不是最优的,使用线性模型的人脸属性分离是一个限制性假设。因此,我们提出了一种新的方法来学习代理潜在代表使用规范化流来弥补这些限制,并表明这导致了一个更有效的空间,为人脸图像编辑。 摘要:Generative adversarial networks (GANs) have proven to be surprisingly efficient for image editing by inverting and manipulating the latent code corresponding to a natural image. This property emerges from the disentangled nature of the latent space. In this paper, we identify two geometric limitations of such latent space: (a) euclidean distances differ from image perceptual distance, and (b) disentanglement is not optimal and facial attribute separation using linear model is a limiting hypothesis. We thus propose a new method to learn a proxy latent representation using normalizing flows to remedy these limitations, and show that this leads to a more efficient space for face image editing.

【2】 Semantic Segmentation on Multiple Visual Domains 标题:多视图域上的语义分割

作者:Floris Naber 机构:Department of Electrical Engineering, Eindhoven University of Technology 备注:Graduation project report 链接:https://arxiv.org/abs/2107.04326 摘要:语义分割模型只在其训练的领域中表现良好,训练数据集很少,并且通常具有很小的标签空间,因为所需的像素级注释制作成本很高。因此,需要在多个现有域上训练模型以增加输出标签空间。目前的研究表明,使用多域训练有可能提高跨数据集的准确性,但这尚未成功地扩展到三个不同的非重叠域的数据集,而无需手动标记。本文针对Cityscapes、SUIM和sunrgb-D数据集提出了一种方法,通过创建一个跨越所有数据集类的标签空间。合并重复的类,并通过保持类分离来解决不同粒度的问题。结果表明,在硬件性能均衡的情况下,由于资源不是无限的,多域模型的精度比所有基线模型加在一起的精度要高,表明即使在没有共同点的域中,模型也能从额外的数据中受益。 摘要:Semantic segmentation models only perform well on the domain they are trained on and datasets for training are scarce and often have a small label-spaces, because the pixel level annotations required are expensive to make. Thus training models on multiple existing domains is desired to increase the output label-space. Current research shows that there is potential to improve accuracy across datasets by using multi-domain training, but this has not yet been successfully extended to datasets of three different non-overlapping domains without manual labelling. In this paper a method for this is proposed for the datasets Cityscapes, SUIM and SUN RGB-D, by creating a label-space that spans all classes of the datasets. Duplicate classes are merged and discrepant granularity is solved by keeping classes separate. Results show that accuracy of the multi-domain model has higher accuracy than all baseline models together, if hardware performance is equalized, as resources are not limitless, showing that models benefit from additional data even from domains that have nothing in common.

【3】 Fast Pixel-Matching for Video Object Segmentation 标题:一种用于视频对象分割的快速像素匹配算法

作者:Siyue Yu,Jimin Xiao,BingFeng Zhang,Eng Gee Lim 机构:Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu, China, Beijing Jiaotong University, Beijing, China, A R T I C L E I N F O 备注:Accepted by Signal Processing: Image Communication 链接:https://arxiv.org/abs/2107.04279 摘要:视频对象分割是对给定第一帧注释的前景对象进行分割的一种方法,近年来受到越来越多的关注。许多先进的方法依靠在线模型更新或掩模传播技术取得了很好的效果。然而,由于模型在推理过程中的微调,大多数在线模型需要很高的计算代价。大多数基于掩模传播的模型速度较快,但由于无法适应对象外观的变化,性能相对较低。在本文中,我们的目标是设计一个新的模式,使良好的平衡之间的速度和性能。提出了一种基于掩模传播和非局部技术的前景目标定位模型NPMCA-net,该模型通过匹配参考帧和目标帧中的像素来实现对前景目标的直接定位。由于引入了第一帧和前一帧的信息,网络对较大的物体外观变化具有鲁棒性,并能更好地适应遮挡。大量实验表明,在同一水平比较下,我们的方法可以同时以较快的速度(DAVIS-2016的IoU为86.5%,DAVIS-2017的IoU为72.2%,每帧速度为0.11s)获得最新的性能。源代码位于https://github.com/siyueyu/NPMCA-net. 摘要:Video object segmentation, aiming to segment the foreground objects given the annotation of the first frame, has been attracting increasing attentions. Many state-of-the-art approaches have achieved great performance by relying on online model updating or mask-propagation techniques. However, most online models require high computational cost due to model fine-tuning during inference. Most mask-propagation based models are faster but with relatively low performance due to failure to adapt to object appearance variation. In this paper, we are aiming to design a new model to make a good balance between speed and performance. We propose a model, called NPMCA-net, which directly localizes foreground objects based on mask-propagation and non-local technique by matching pixels in reference and target frames. Since we bring in information of both first and previous frames, our network is robust to large object appearance variation, and can better adapt to occlusions. Extensive experiments show that our approach can achieve a new state-of-the-art performance with a fast speed at the same time (86.5% IoU on DAVIS-2016 and 72.2% IoU on DAVIS-2017, with speed of 0.11s per frame) under the same level comparison. Source code is available at https://github.com/siyueyu/NPMCA-net.

【4】 Towards Robust General Medical Image Segmentation 标题:面向稳健的普通医学图像分割

作者:Laura Daza,Juan C. Pérez,Pablo Arbeláez 机构:Universidad de los Andes, Colombia, King Abdullah University of Science and Technology (KAUST), Saudi Arabia 备注:Accepted at MICCAI 2021 链接:https://arxiv.org/abs/2107.04263 摘要:深度学习系统的可靠性不仅取决于其准确性,还取决于其对输入数据的对抗性扰动的鲁棒性。针对自然图像域中存在对抗性噪声的情况,提出了几种攻击和防御方法来提高深层神经网络的性能。然而,在体积数据的计算机辅助诊断中,鲁棒性仅针对特定任务和有限的攻击进行了研究。我们提出一个新的架构来评估一般医学影像分割系统的稳健性。我们的贡献有两个方面:(i)我们提出了一个新的基准,通过将最近的自动攻击自然图像分类框架扩展到体积数据分割领域来评估医学分割十项全能(MSD)的鲁棒性,(ii)提出了一种新的格结构,用于鲁棒的通用医学图像分割(ROG)。我们的结果表明,ROG能够推广到MSD的不同任务中,并且在复杂的对抗性攻击下大大超过了最先进的技术。 摘要:The reliability of Deep Learning systems depends on their accuracy but also on their robustness against adversarial perturbations to the input data. Several attacks and defenses have been proposed to improve the performance of Deep Neural Networks under the presence of adversarial noise in the natural image domain. However, robustness in computer-aided diagnosis for volumetric data has only been explored for specific tasks and with limited attacks. We propose a new framework to assess the robustness of general medical image segmentation systems. Our contributions are two-fold: (i) we propose a new benchmark to evaluate robustness in the context of the Medical Segmentation Decathlon (MSD) by extending the recent AutoAttack natural image classification framework to the domain of volumetric data segmentation, and (ii) we present a novel lattice architecture for RObust Generic medical image segmentation (ROG). Our results show that ROG is capable of generalizing across different tasks of the MSD and largely surpasses the state-of-the-art under sophisticated adversarial attacks.

【5】 Modality specific U-Net variants for biomedical image segmentation: A survey 标题:用于生物医学图像分割的模态特定U网变体研究进展

作者:Narinder Singh Punn,Sonali Agarwal 机构: Indian Institute of Information Technology Alla-habad 链接:https://arxiv.org/abs/2107.04537 摘要:随着深度学习方法的发展,如深度卷积神经网络、残差神经网络、对抗性网络等;U-Net结构是生物医学图像分割中应用最广泛的一种结构,用于实现目标区域或子区域的自动识别和检测。在最近的研究中,基于U-Net的方法在不同的应用中展示了最先进的性能,用于开发用于早期诊断和治疗疾病的计算机辅助诊断系统,如脑瘤、肺癌、阿尔茨海默病、乳腺癌、,本文通过描述U-Net框架,然后综合分析不同医学成像或模式的U-Net变体,如磁共振成像、X射线、计算机断层扫描/计算机轴向断层扫描、超声,正电子发射断层扫描等。此外,本文还重点介绍了基于U-Net的框架在正在流行的严重急性呼吸综合征冠状病毒2型(SARS-CoV-2,又称COVID-19)中的作用。 摘要:With the advent of advancements in deep learning approaches, such as deep convolution neural network, residual neural network, adversarial network; U-Net architectures are most widely utilized in biomedical image segmentation to address the automation in identification and detection of the target regions or sub-regions. In recent studies, U-Net based approaches have illustrated state-of-the-art performance in different applications for the development of computer-aided diagnosis systems for early diagnosis and treatment of diseases such as brain tumor, lung cancer, alzheimer, breast cancer, etc. This article contributes to present the success of these approaches by describing the U-Net framework, followed by the comprehensive analysis of the U-Net variants for different medical imaging or modalities such as magnetic resonance imaging, X-ray, computerized tomography/computerized axial tomography, ultrasound, positron emission tomography, etc. Besides, this article also highlights the contribution of U-Net based frameworks in the on-going pandemic, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) also known as COVID-19.

【6】 Hepatocellular Carcinoma Segmentation fromDigital Subtraction Angiography Videos usingLearnable Temporal Difference 标题:利用可学习时域差分法从数字减影血管造影视频中分割肝癌

作者:Wenting Jiang,Yicheng Jiang,Lu Zhang,Changmiao Wang,Xiaoguang Han,Shuixing Zhang,Xiang Wan,Shuguang Cui 机构: School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, Shenzhen, China, Shenzhen Research Institute of Big Data, Shenzhen, China, Department of Radiology, The First Affiliated Hospital of Jinan University, Guangzhou, China 链接:https://arxiv.org/abs/2107.04306 摘要:在数字减影血管造影(DSA)视频中自动分割肝细胞癌(HCC)可以帮助放射科医师在临床上有效地诊断HCC和准确地评价肿瘤。很少有研究从DSAvideos进行肝癌分割。由于拍摄过程中的运动伪影、肿瘤区域边界的模糊性以及解剖组织成像的高度相似性,使得该方法具有很大的挑战性。本文提出了DSA视频中的hcc分段问题,并建立了自己的DSA数据集。我们还提出了一种新的分割网络DSA-LTDNet,包括分割子网络、时差学习(TDL)模块和肝脏区域分割(LRS)子网络,以提供额外的指导。DSA-LTDNet对于主动地从DSA视频中学习潜在的运动信息,提高分割性能是比较理想的。实验结果表明,DSA-LTDNet的DICE评分比U-Net基线提高了近4%。 摘要:Automatic segmentation of hepatocellular carcinoma (HCC)in Digital Subtraction Angiography (DSA) videos can assist radiologistsin efficient diagnosis of HCC and accurate evaluation of tumors in clinical practice. Few studies have investigated HCC segmentation from DSAvideos. It shows great challenging due to motion artifacts in filming, ambiguous boundaries of tumor regions and high similarity in imaging toother anatomical tissues. In this paper, we raise the problem of HCCsegmentation in DSA videos, and build our own DSA dataset. We alsopropose a novel segmentation network called DSA-LTDNet, including asegmentation sub-network, a temporal difference learning (TDL) moduleand a liver region segmentation (LRS) sub-network for providing additional guidance. DSA-LTDNet is preferable for learning the latent motioninformation from DSA videos proactively and boosting segmentation performance. All of experiments are conducted on our self-collected dataset.Experimental results show that DSA-LTDNet increases the DICE scoreby nearly 4% compared to the U-Net baseline.

【7】 LIFE: A Generalizable Autodidactic Pipeline for 3D OCT-A Vessel Segmentation 标题:LIFE:一种适用于3DOCT-A血管分割的通用型自学管道

作者:Dewei Hu,Can Cui,Hao Li,Kathleen E. Larson,Yuankai K. Tao,Ipek Oguz 机构: Vanderbilt University, Dept. of Electrical Engineering and Computer Science, Vanderbilt University, Dept. of Biomedical Engineering, Nashville, TN, USA 备注:Accepted by International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) 2021 链接:https://arxiv.org/abs/2107.04282 摘要:光学相干断层扫描(OCT)是一种广泛应用于眼科的无创成像技术。它可以扩展到OCT血管造影(OCT-A),显示视网膜血管与改善对比度。最近的深度学习算法产生了有希望的血管分割结果;然而,由于缺乏人工标注的训练数据,视网膜血管的三维分割仍然很困难。提出了一种基于学习的局部强度融合方法。LIF是直接从OCT-a输入计算的毛细血管增强容积。然后构造局部强度融合编码器(LIFE),将给定的OCT-a体及其LIF对应体映射到一个共享的潜在空间。生命的潜在空间具有与输入数据相同的维度,并且包含两种形态的共同特征。通过二值化这个潜在的空间,我们得到一个体积血管分割。我们的方法是在一个人的中心凹OCT-a和三个斑马鱼OCT-a卷与手动标签进行评估。人类数据的骰子得分为0.7736,斑马鱼数据的骰子得分为0.8594 /-0.0275,这比现有的无监督算法有了显著的改进。 摘要:Optical coherence tomography (OCT) is a non-invasive imaging technique widely used for ophthalmology. It can be extended to OCT angiography (OCT-A), which reveals the retinal vasculature with improved contrast. Recent deep learning algorithms produced promising vascular segmentation results; however, 3D retinal vessel segmentation remains difficult due to the lack of manually annotated training data. We propose a learning-based method that is only supervised by a self-synthesized modality named local intensity fusion (LIF). LIF is a capillary-enhanced volume computed directly from the input OCT-A. We then construct the local intensity fusion encoder (LIFE) to map a given OCT-A volume and its LIF counterpart to a shared latent space. The latent space of LIFE has the same dimensions as the input data and it contains features common to both modalities. By binarizing this latent space, we obtain a volumetric vessel segmentation. Our method is evaluated in a human fovea OCT-A and three zebrafish OCT-A volumes with manual labels. It yields a Dice score of 0.7736 on human data and 0.8594 /- 0.0275 on zebrafish data, a dramatic improvement over existing unsupervised algorithms.

【8】 CASPIANET : A Multidimensional Channel-Spatial Asymmetric Attention Network with Noisy Student Curriculum Learning Paradigm for Brain Tumor Segmentation 标题:CASPIANET :具有噪声学生课程学习范式的多维通道-空间非对称注意网络脑瘤分割

作者:Andrea Liew,Chun Cheng Lee,Boon Leong Lan,Maxine Tan 机构:Electrical and Computer Systems Engineering Discipline, School of Engineering, Monash University Malaysia, Bandar, Radiology Department, Sunway Medical Centre, Bandar Sunway , Malaysia 链接:https://arxiv.org/abs/2107.04099 摘要:卷积神经网络(CNNs)已经成功地应用于脑肿瘤的语义分割。然而,目前的cnn和注意机制本质上是随机的,忽略了放射科医生用来手动标注感兴趣区域的形态学指标。在这篇文章中,我们引入了一种通道和空间不对称注意(CASPIAN)方法,通过利用肿瘤的固有结构来检测显著区域。为了证明我们提出的层的有效性,我们将其集成到一个成熟的卷积神经网络(CNN)体系结构中,以获得更高的骰子分数,并使用更少的GPU资源。此外,我们还研究了辅助多尺度和多平面注意分支的加入,以增加语义切分任务中至关重要的空间上下文。新的CASPIANET ,其Dice评分为91.19%,肿瘤核心评分为87.6%,增强评分为81.03%。此外,由于脑肿瘤数据的稀缺性,本文研究了基于噪声的Student分割方法。我们的新的噪声学生课程学习范式,逐步注入噪声以增加暴露在网络中的训练图像的复杂性,进一步将增强的肿瘤区域提高到81.53%。对BraTS2020数据进行的额外验证表明,在没有任何额外训练或微调的情况下,嘈杂的学生课程学习方法效果良好。 摘要:Convolutional neural networks (CNNs) have been used quite successfully for semantic segmentation of brain tumors. However, current CNNs and attention mechanisms are stochastic in nature and neglect the morphological indicators used by radiologists to manually annotate regions of interest. In this paper, we introduce a channel and spatial wise asymmetric attention (CASPIAN) by leveraging the inherent structure of tumors to detect regions of saliency. To demonstrate the efficacy of our proposed layer, we integrate this into a well-established convolutional neural network (CNN) architecture to achieve higher Dice scores, with less GPU resources. Also, we investigate the inclusion of auxiliary multiscale and multiplanar attention branches to increase the spatial context crucial in semantic segmentation tasks. The resulting architecture is the new CASPIANET , which achieves Dice Scores of 91.19% whole tumor, 87.6% for tumor core and 81.03% for enhancing tumor. Furthermore, driven by the scarcity of brain tumor data, we investigate the Noisy Student method for segmentation tasks. Our new Noisy Student Curriculum Learning paradigm, which infuses noise incrementally to increase the complexity of the training images exposed to the network, further boosts the enhancing tumor region to 81.53%. Additional validation performed on the BraTS2020 data shows that the Noisy Student Curriculum Learning method works well without any additional training or finetuning.

【9】 Comparison of 2D vs. 3D U-Net Organ Segmentation in abdominal 3D CT images 标题:腹部三维CT图像二维与三维U-net器官分割的比较

作者:Nico Zettler,Andre Mastmeyer 机构:Aalen University, Beethovenstr. , Burren Campus, Germany, Aalen, Baden-Wuerttemberg 备注:None 链接:https://arxiv.org/abs/2107.04062 摘要:提出了一种两步分割的方法,对体层CT图像中的5个腹部器官进行三维分割。首先提取每个相关器官的感兴趣体积作为边界框。提取的体积作为第二阶段的输入,其中具有不同结构尺寸的两个比较的U网络重新构建器官分割作为标签掩码。在这项工作中,我们重点比较二维U网与三维U网的对应。我们的初步结果表明,骰子的改善约6%的最大值。在这项研究中,令我们惊讶的是,肝脏和肾脏,例如显着更好地解决使用更快和GPU内存节省2D U网络。对于其他腹部关键器官,没有显著性差异,但我们观察到2du-Net在所有研究器官的GPU计算方面具有非常显著的优势。 摘要:A two-step concept for 3D segmentation on 5 abdominal organs inside volumetric CT images is presented. First each relevant organ's volume of interest is extracted as bounding box. The extracted volume acts as input for a second stage, wherein two compared U-Nets with different architectural dimensions re-construct an organ segmentation as label mask. In this work, we focus on comparing 2D U-Nets vs. 3D U-Net counterparts. Our initial results indicate Dice improvements of about 6% at maximum. In this study to our surprise, liver and kidneys for instance were tackled significantly better using the faster and GPU-memory saving 2D U-Nets. For other abdominal key organs, there were no significant differences, but we observe highly significant advantages for the 2D U-Net in terms of GPU computational efforts for all organs under study.

Zero/Few Shot|迁移|域适配|自适应(3篇)

【1】 On the Challenges of Open World Recognitionunder Shifting Visual Domains 标题:论视域转换下开放世界认知面临的挑战

作者:Dario Fontanel,Fabio Cermelli,Massimiliano Mancini,Barbara Caputo 备注:RAL/ICRA 2021 链接:https://arxiv.org/abs/2107.04461 摘要:在野外工作的机器人视觉系统必须在不受约束的场景中,在不同的环境条件下,同时面对各种语义概念,包括未知的语义概念。为此,最近的工作试图使视觉对象识别方法具有以下能力:i)检测看不见的概念;ii)随着时间的推移,随着新语义类图像的出现,扩展其知识。这种设置被称为开放世界识别(OWR),其目标是生成能够打破初始训练集中存在的语义限制的系统。然而,这种训练集不仅给系统施加了自己的语义限制,而且还施加了环境限制,因为它偏向于某些不一定反映现实世界高度可变性的获取条件。训练和测试分布之间的这种差异称为域转移。本文研究了OWR算法在域转移情况下的有效性,提出了第一个公平评估OWR算法性能的基准设置,包括有域转移和无域转移。然后,我们使用这个基准在各种场景中进行分析,显示当训练和测试分布不同时,现有的OWR算法是如何遭受严重的性能退化的。我们的分析表明,通过将OWR与领域泛化技术相结合,这种退化只得到了轻微的缓解,这表明现有算法的简单即插即用不足以识别未知领域中的新类别和未知类别。我们的研究结果清楚地指出了有待解决的问题和未来的研究方向,这些问题和方向是建立机器人视觉系统在这些具有挑战性但非常真实的条件下能够可靠地工作所需要研究的。代码位于https://github.com/DarioFontanel/OWR-VisualDomains 摘要:Robotic visual systems operating in the wild must act in unconstrained scenarios, under different environmental conditions while facing a variety of semantic concepts, including unknown ones. To this end, recent works tried to empower visual object recognition methods with the capability to i) detect unseen concepts and ii) extended their knowledge over time, as images of new semantic classes arrive. This setting, called Open World Recognition (OWR), has the goal to produce systems capable of breaking the semantic limits present in the initial training set. However, this training set imposes to the system not only its own semantic limits, but also environmental ones, due to its bias toward certain acquisition conditions that do not necessarily reflect the high variability of the real-world. This discrepancy between training and test distribution is called domain-shift. This work investigates whether OWR algorithms are effective under domain-shift, presenting the first benchmark setup for assessing fairly the performances of OWR algorithms, with and without domain-shift. We then use this benchmark to conduct analyses in various scenarios, showing how existing OWR algorithms indeed suffer a severe performance degradation when train and test distributions differ. Our analysis shows that this degradation is only slightly mitigated by coupling OWR with domain generalization techniques, indicating that the mere plug-and-play of existing algorithms is not enough to recognize new and unknown categories in unseen domains. Our results clearly point toward open issues and future research directions, that need to be investigated for building robot visual systems able to function reliably under these challenging yet very real conditions. Code available at https://github.com/DarioFontanel/OWR-VisualDomains

【2】 Wavelet Transform-assisted Adaptive Generative Modeling for Colorization 标题:小波变换辅助的自适应产生式着色建模

作者:Jin Li,Wanyun Li,Zichen Xu,Yuhao Wang,Qiegen Liu 机构:the Department of Electronic Information Engineering, Nanchang Univer-, of training samples and creating diverse colorization results., The most prevailing methods are generative adversarial, network (GAN) and variational auto-encoder (VAE). For 链接:https://arxiv.org/abs/2107.04261 摘要:无监督深度学习最近证明了生产高质量样本的承诺。虽然它在图像彩色化方面有着巨大的潜力,但由于机器学习中的流形假设,其性能受到限制。针对这一问题,本文提出了一种基于分数的小波域生成模型的解决方案。该模型利用小波变换的多尺度、多通道表示,从叠加的小波系数分量中学习先验信息,从而联合有效地学习粗、细频谱下的图像特征。此外,这种高度灵活的生成模型不需要对手优化,在小波域的双重一致性条件下,即数据一致性和结构一致性,可以更好地执行着色任务。具体地说,在训练阶段,使用一组由小波系数组成的多通道张量作为输入,通过去噪得分匹配来训练网络。在测试阶段,通过退火Langevin动力学迭代生成具有数据和结构一致性的样本。实验结果表明,该模型在着色质量,特别是着色鲁棒性和多样性方面有显著的改进。 摘要:Unsupervised deep learning has recently demonstrated the promise to produce high-quality samples. While it has tremendous potential to promote the image colorization task, the performance is limited owing to the manifold hypothesis in machine learning. This study presents a novel scheme that exploiting the score-based generative model in wavelet domain to address the issue. By taking advantage of the multi-scale and multi-channel representation via wavelet transform, the proposed model learns the priors from stacked wavelet coefficient components, thus learns the image characteristics under coarse and detail frequency spectrums jointly and effectively. Moreover, such a highly flexible generative model without adversarial optimization can execute colorization tasks better under dual consistency terms in wavelet domain, namely data-consistency and structure-consistency. Specifically, in the training phase, a set of multi-channel tensors consisting of wavelet coefficients are used as the input to train the network by denoising score matching. In the test phase, samples are iteratively generated via annealed Langevin dynamics with data and structure consistencies. Experiments demonstrated remarkable improvements of the proposed model on colorization quality, particularly on colorization robustness and diversity.

【3】 Exploring Dropout Discriminator for Domain Adaptation 标题:用于领域自适应的丢弃判别器研究

作者:Vinod K Kurmi,Venkatesh K Subramanian,Vinay P. Namboodiri 机构:a Electrical Engineering Department, Indian Institute of Technology Kanpur, Kanpur, India, b Department of Computer Science and Engineering, Indian Institute of Technology 备注:This work is an extension of our BMVC-2019 paper (arXiv:1907.10628) 链接:https://arxiv.org/abs/2107.04231 摘要:如何使分类器适应新的领域是机器学习中一个具有挑战性的问题。这一点已经通过许多基于深度和非深度学习的方法来解决。在所使用的方法中,对抗性学习方法被广泛应用于解决许多深度学习问题和领域适应问题。这些方法是基于一个鉴别器,确保源和目标分布是密切的。然而,这里我们建议,与其使用由单个鉴别器获得的点估计,不如使用基于鉴别器集合的分布来弥合这一差距。这可以通过使用多个分类器或使用传统的集成方法来实现。与此相反,我们建议一个基于montecarlo辍学的系综鉴别器足以获得基于分布的鉴别器。具体来说,我们提出了一个基于课程的辍学鉴别器,该鉴别器逐渐增加基于样本的分布的方差,并使用相应的反向梯度来对齐源和目标的特征表示。一组鉴别器有助于模型有效地学习数据分布。它还提供了更好的梯度估计来训练特征抽取器。详细的结果和深入的烧蚀分析表明,我们的模型优于最新的结果。 摘要:Adaptation of a classifier to new domains is one of the challenging problems in machine learning. This has been addressed using many deep and non-deep learning based methods. Among the methodologies used, that of adversarial learning is widely applied to solve many deep learning problems along with domain adaptation. These methods are based on a discriminator that ensures source and target distributions are close. However, here we suggest that rather than using a point estimate obtaining by a single discriminator, it would be useful if a distribution based on ensembles of discriminators could be used to bridge this gap. This could be achieved using multiple classifiers or using traditional ensemble methods. In contrast, we suggest that a Monte Carlo dropout based ensemble discriminator could suffice to obtain the distribution based discriminator. Specifically, we propose a curriculum based dropout discriminator that gradually increases the variance of the sample based distribution and the corresponding reverse gradients are used to align the source and target feature representations. An ensemble of discriminators helps the model to learn the data distribution efficiently. It also provides a better gradient estimates to train the feature extractor. The detailed results and thorough ablation analysis show that our model outperforms state-of-the-art results.

半弱无监督|主动学习|不确定性(3篇)

【1】 Gradient-Based Quantification of Epistemic Uncertainty for Deep Object Detectors 标题:基于梯度的深部目标探测器认知不确定性量化

作者:Tobias Riedlinger,Matthias Rottmann,Marius Schubert,Hanno Gottschalk 机构:School of Mathematics and Natural Sciences, University of Wuppertal 备注:20 pages, 8 figures, 7 tables 链接:https://arxiv.org/abs/2107.04517 摘要:可靠的认知不确定度估计是深部目标探测器在安全关键环境中后端应用的重要组成部分。现代网络体系结构倾向于用有限的预测能力提供校准不好的信心。在这里,我们介绍了一种新的基于梯度的不确定性度量,并针对不同的目标检测体系结构进行了研究。在MS-COCO、PASCAL-VOC和KITTI数据集上的实验表明,与网络置信度相比,在真阳性/假阳性鉴别和联合交叉预测方面有显著的改进。我们还发现了对蒙特卡罗不确定性度量的改进,并通过聚合不确定性度量的不同来源进一步显著提高了不确定性模型的可信度。此外,我们将我们的不确定性量化模型应用到目标检测管道中,作为区分预测真假的一种手段,取代了通常基于分数阈值的决策规则。在我们的实验中,我们在平均精度方面取得了显著的提高。在计算复杂度方面,我们发现计算梯度不确定性度量会导致浮点运算计数与montecarlo dropout类似。 摘要:Reliable epistemic uncertainty estimation is an essential component for backend applications of deep object detectors in safety-critical environments. Modern network architectures tend to give poorly calibrated confidences with limited predictive power. Here, we introduce novel gradient-based uncertainty metrics and investigate them for different object detection architectures. Experiments on the MS COCO, PASCAL VOC and the KITTI dataset show significant improvements in true positive / false positive discrimination and prediction of intersection over union as compared to network confidence. We also find improvement over Monte-Carlo dropout uncertainty metrics and further significant boosts by aggregating different sources of uncertainty metrics.The resulting uncertainty models generate well-calibrated confidences in all instances. Furthermore, we implement our uncertainty quantification models into object detection pipelines as a means to discern true against false predictions, replacing the ordinary score-threshold-based decision rule. In our experiments, we achieve a significant boost in detection performance in terms of mean average precision. With respect to computational complexity, we find that computing gradient uncertainty metrics results in floating point operation counts similar to those of Monte-Carlo dropout.

【2】 Differentially private training of neural networks with Langevin dynamics forcalibrated predictive uncertainty 标题:用于校正预测不确定性的朗之万动力学神经网络的差分私有训练

作者:Moritz Knolle,Alexander Ziller,Dmitrii Usynin,Rickmer Braren,Marcus R. Makowski,Daniel Rueckert,Georgios Kaissis 机构:Equal contribution 1Department of Diagnostic and Interven-tional Radiology, School of Medicine, Technical University of Munich, Germany 2Institutefor Artificial Intelligence and Informatics in Medicine, Technical University of Mu-nich 备注:Accepted to the ICML 2021 Theory and Practice of Differential Privacy Workshop 链接:https://arxiv.org/abs/2107.04296 摘要:我们发现,差异私人随机梯度下降(DP-SGD)可以产生校准差,过度自信的深度学习模型。这是安全关键应用的一个严重问题,例如在医疗诊断中。我们强调并利用随机梯度Langevin动态(一种用于训练深层神经网络的可伸缩贝叶斯推理技术)和DP-SGD之间的相似性,以便在对原始(DP-SGD)算法稍作调整的情况下训练差异私有的贝叶斯神经网络。我们的方法提供了比DP-SGD更可靠的不确定度估计,如预期校准误差的减少(MNIST$sim{5}$-倍,儿科肺炎数据集$sim{2}$-倍)所证明的那样。 摘要:We show that differentially private stochastic gradient descent (DP-SGD) can yield poorly calibrated, overconfident deep learning models. This represents a serious issue for safety-critical applications, e.g. in medical diagnosis. We highlight and exploit parallels between stochastic gradient Langevin dynamics, a scalable Bayesian inference technique for training deep neural networks, and DP-SGD, in order to train differentially private, Bayesian neural networks with minor adjustments to the original (DP-SGD) algorithm. Our approach provides considerably more reliable uncertainty estimates than DP-SGD, as demonstrated empirically by a reduction in expected calibration error (MNIST $sim{5}$-fold, Pediatric Pneumonia Dataset $sim{2}$-fold).

【3】 A Multi-task Mean Teacher for Semi-supervised Facial Affective Behavior Analysis 标题:一种用于半监督面部情感行为分析的多任务均值教师

作者:Lingfeng Wang,Shisen Wang 机构:School of Information and Communication Engineering, University of Electronic Science and Technology of China 链接:https://arxiv.org/abs/2107.04225 摘要:情感行为分析是人机交互的重要组成部分。已有的成功的情感行为分析方法如TSAV[9]受到了不完全标记数据集的挑战。为了提高该模型的性能,本文提出了一种用于半监督情感行为分析的多任务均值-教师模型,以从缺失的标签中学习,同时探索多个相关任务的学习。具体来说,我们首先利用TSAV作为基线模型来同时识别这三个任务。为了提供更好的语义信息,我们对渲染掩码的预处理方法进行了改进。在此基础上,我们将TSAV模型推广到半监督模型中,利用均值教师模型,使得TSAV模型能够从未标记数据中获益。在验证数据集上的实验结果表明,该方法比TSAV模型具有更好的性能,验证了该网络能够有效地学习额外的未标记数据,提高情感行为分析的性能。 摘要:Affective Behavior Analysis is an important part in human?computer interaction. Existing successful affective behavior analysis method such as TSAV[9] suffer from challenge of incomplete labeled datasets. To boost its performance, this paper presents a multi-task mean teacher model for semi?supervised Affective Behavior Analysis to learn from missing labels and exploring the learning of multiple correlated task simultaneously. To be specific, we first utilize TSAV as baseline model to simultaneously recognize the three tasks. We have modified the preprocessing method of rendering mask to provide better semantics information. After that, we extended TSAV model to semi-supervised model using mean teacher, which allow it to be benefited from unlabeled data. Experimental results on validation datasets show that our method achieves better performance than TSAV model, which verifies that the proposed network can effectively learn additional unlabeled data to boost the affective behavior analysis performance.

医学相关(3篇)

【1】 Cross-modal Attention for MRI and Ultrasound Volume Registration 标题:MRI和超声容积配准中的跨模态注意

作者:Xinrui Song,Hengtao Guo,Xuanang Xu,Hanqing Chao,Sheng Xu,Baris Turkbey,Bradford J. Wood,Ge Wang,Pingkun Yan 机构: Department of Biomedical Engineering and Center for Biotechnology and, Interdisciplinary Studies, Rensselaer Polytechnic Institute, Troy, NY , USA, Center for Interventional Oncology, Radiology & Imaging Sciences, National 链接:https://arxiv.org/abs/2107.04548 摘要:前列腺癌活检得益于经直肠超声(TRUS)和磁共振(MR)图像的精确融合。在过去的几年中,卷积神经网络(CNNs)在提取图像特征方面已经被证明是非常有效的。然而,具有挑战性的应用和计算机视觉的最新进展表明,CNNs在理解特征之间的空间对应关系方面的能力相当有限,而自我注意机制在这一任务中表现得尤为突出。本文旨在开发一种专门用于跨模态图像配准的自注意机制。我们提出的跨模态注意块有效地将一个卷中的每个特征映射到相应卷中的所有特征。实验结果表明,嵌入跨模态注意块的CNN网络的性能要比现有的CNN网络大10倍。我们还结合了可视化技术来提高我们网络的可解释性。我们工作的源代码可以在https://github.com/DIAL-RPI/Attention-Reg . 摘要:Prostate cancer biopsy benefits from accurate fusion of transrectal ultrasound (TRUS) and magnetic resonance (MR) images. In the past few years, convolutional neural networks (CNNs) have been proved powerful in extracting image features crucial for image registration. However, challenging applications and recent advances in computer vision suggest that CNNs are quite limited in its ability to understand spatial correspondence between features, a task in which the self-attention mechanism excels. This paper aims to develop a self-attention mechanism specifically for cross-modal image registration. Our proposed cross-modal attention block effectively maps each of the features in one volume to all features in the corresponding volume. Our experimental results demonstrate that a CNN network designed with the cross-modal attention block embedded outperforms an advanced CNN network 10 times of its size. We also incorporated visualization techniques to improve the interpretability of our network. The source code of our work is available at https://github.com/DIAL-RPI/Attention-Reg .

【2】 Deep Learning models for benign and malign Ocular Tumor Growth Estimation 标题:用于眼部良恶性肿瘤生长估计的深度学习模型

作者:Mayank Goswami 机构:Divyadrishti Imaging Laboratory, Department of Physics, Indian Institute of Technology Roorkee, Roorkee, India 备注:22 Page, 10 Figures 链接:https://arxiv.org/abs/2107.04220 摘要:相对丰富的医学图像数据为基于神经网络的图像处理方法的开发和测试提供了重要的支持。临床医生经常面临的问题是如何选择合适的图像处理算法的医学影像数据。本文提出了一种选择合适模型的策略。训练数据集包括50只小鼠眼睛的光学相干断层扫描(OCT)和血管造影(OCT-A)图像,随访超过100天。这些数据包含了经过治疗和未经治疗的老鼠眼睛的图像。我们测试了四种深度学习变体,用于自动(a)区分具有健康视网膜层的肿瘤区域和(b)分割3D眼部肿瘤体积。利用8个性能指标对深度学习模型的训练和测试图像的数目进行了详尽的敏感性分析,以研究其准确性、可靠性/再现性和速度。带有UVgg16的U-net最适合于治疗的恶性肿瘤数据集(具有相当大的变化),带有起始主干的U-net最适合于良性肿瘤数据集(具有较小的变化)。损失值和均方根误差分别是最敏感和最不敏感的性能指标。性能(通过指数)被发现是指数级的改善有关一些训练图像。分割的OCT血管造影术数据显示新生血管驱动肿瘤体积。图像分析显示,光动力成像辅助肿瘤治疗方案正在将一个生长旺盛的肿瘤转化为囊肿。得到了一个经验表达式,以帮助医学专业人员在给定图像数量和特征类型的情况下选择特定的模型。我们建议在采用特定的深度学习模式进行生物医学图像分析之前,应将所提出的练习作为标准实践。 摘要:Relatively abundant availability of medical imaging data has provided significant support in the development and testing of Neural Network based image processing methods. Clinicians often face issues in selecting suitable image processing algorithm for medical imaging data. A strategy for the selection of a proper model is presented here. The training data set comprises optical coherence tomography (OCT) and angiography (OCT-A) images of 50 mice eyes with more than 100 days follow-up. The data contains images from treated and untreated mouse eyes. Four deep learning variants are tested for automatic (a) differentiation of tumor region with healthy retinal layer and (b) segmentation of 3D ocular tumor volumes. Exhaustive sensitivity analysis of deep learning models is performed with respect to the number of training and testing images using 8 eight performance indices to study accuracy, reliability/reproducibility, and speed. U-net with UVgg16 is best for malign tumor data set with treatment (having considerable variation) and U-net with Inception backbone for benign tumor data (with minor variation). Loss value and root mean square error (R.M.S.E.) are found most and least sensitive performance indices, respectively. The performance (via indices) is found to be exponentially improving regarding a number of training images. The segmented OCT-Angiography data shows that neovascularization drives the tumor volume. Image analysis shows that photodynamic imaging-assisted tumor treatment protocol is transforming an aggressively growing tumor into a cyst. An empirical expression is obtained to help medical professionals to choose a particular model given the number of images and types of characteristics. We recommend that the presented exercise should be taken as standard practice before employing a particular deep learning model for biomedical image analysis.

【3】 3D RegNet: Deep Learning Model for COVID-19 Diagnosis on Chest CT Image 标题:3D RegNet:胸部CT图像冠状病毒诊断的深度学习模型

作者:Haibo Qi,Yuhan Wang,Xinyu Liu 机构:Xidian University, Xian, China 链接:https://arxiv.org/abs/2107.04055 摘要:本文提出了一种基于三维RegNet的神经网络诊断冠状病毒(Covid-19)感染患者身体状况的方法。在临床医学的应用中,肺部CT图像被医生用来判断病人是否感染了冠状病毒。然而,对于这种诊断方法,还可以考虑一些延迟,例如耗时和低准确度。肺作为人体较大的器官,如果利用二维切片图像进行诊断,会丢失重要的空间特征。为此,本文设计了一个基于三维图像的深度学习模型。三维图像作为输入数据,由二维肺部图像序列组成,从中提取相关的冠状病毒感染三维特征并进行分类。结果表明,在三维模型的测试集上,f1得分为0.8379,AUC值为0.8807。 摘要:In this paper, a 3D-RegNet-based neural network is proposed for diagnosing the physical condition of patients with coronavirus (Covid-19) infection. In the application of clinical medicine, lung CT images are utilized by practitioners to determine whether a patient is infected with coronavirus. However, there are some laybacks can be considered regarding to this diagnostic method, such as time consuming and low accuracy. As a relatively large organ of human body, important spatial features would be lost if the lungs were diagnosed utilizing two dimensional slice image. Therefore, in this paper, a deep learning model with 3D image was designed. The 3D image as input data was comprised of two-dimensional pulmonary image sequence and from which relevant coronavirus infection 3D features were extracted and classified. The results show that the test set of the 3D model, the result: f1 score of 0.8379 and AUC value of 0.8807 have been achieved.

GAN|对抗|攻击|生成相关(6篇)

【1】 White-Box Cartoonization Using An Extended GAN Framework 标题:使用扩展GaN框架的白盒卡通化

作者:Amey Thakur,Hasan Rizvi,Mega Satish 机构:Department of Computer Engineering, Department of Computer Engineering, Department of Computer Engineering, University of Mumbai, University of Mumbai, University of Mumbai 备注:5 pages, 6 figures. International Journal of Engineering Applied Sciences and Technology, 2021 链接:https://arxiv.org/abs/2107.04551 摘要:在本研究中,我们提出一个新的框架来估计生成模型,通过一个对抗过程来扩展现有的GAN框架,并开发一个白盒可控的图像卡通化,它可以从真实世界的照片和视频中生成高质量的卡通图像/视频。我们系统的学习目的是基于三种不同的表示:表面表示、结构表示和纹理表示。表面表示是指图像的光滑表面。结构表示与稀疏色块相关并压缩一般内容。纹理表示法表示卡通图像中的纹理、曲线和特征。生成性对抗网络(GAN)框架将图像分解为不同的表示形式,并从中学习生成卡通图像。这种分解使得框架更加可控和灵活,允许用户根据所需的输出进行更改。这种方法克服了以往任何系统在保持清晰度,颜色,纹理,形状的图像,但显示出卡通形象的特点。 摘要:In the present study, we propose to implement a new framework for estimating generative models via an adversarial process to extend an existing GAN framework and develop a white-box controllable image cartoonization, which can generate high-quality cartooned images/videos from real-world photos and videos. The learning purposes of our system are based on three distinct representations: surface representation, structure representation, and texture representation. The surface representation refers to the smooth surface of the images. The structure representation relates to the sparse colour blocks and compresses generic content. The texture representation shows the texture, curves, and features in cartoon images. Generative Adversarial Network (GAN) framework decomposes the images into different representations and learns from them to generate cartoon images. This decomposition makes the framework more controllable and flexible which allows users to make changes based on the required output. This approach overcomes any previous system in terms of maintaining clarity, colours, textures, shapes of images yet showing the characteristics of cartoon images.

【2】 Learning to Detect Adversarial Examples Based on Class Scores 标题:学习根据班级分数检测对抗性实例

作者:Tobias Uelwer,Felix Michels,Oliver De Candido 机构: Department of Computer Science, Heinrich Heine University D¨usseldorf, Germany, Department of Electrical and Computer Engineering, Technical University of Munich, Germany 备注:Accepted at the 44th German Conference on Artificial Intelligence (KI 2021) 链接:https://arxiv.org/abs/2107.04435 摘要:随着对抗性攻击对深度神经网络的威胁越来越大,研究有效的检测方法显得尤为重要。在这项工作中,我们进一步研究了基于已经训练好的分类模型的类分数的对抗攻击检测。我们建议训练一个支持向量机(SVM)在类分数检测对抗性的例子。我们的方法能够检测出由各种攻击产生的对抗性例子,并且可以很容易地应用到大量的深度分类模型中。我们表明,我们的方法产生了一个改进的检测率相比,现有的方法,同时很容易实现。我们对不同的深度分类模型进行了广泛的实证分析,调查了各种最先进的对抗性攻击。此外,我们观察到,我们提出的方法是更好地检测组合的对抗性攻击。这项工作表明,通过简单地使用已经训练好的分类模型的类分数,可以检测出各种对抗性攻击。 摘要:Given the increasing threat of adversarial attacks on deep neural networks (DNNs), research on efficient detection methods is more important than ever. In this work, we take a closer look at adversarial attack detection based on the class scores of an already trained classification model. We propose to train a support vector machine (SVM) on the class scores to detect adversarial examples. Our method is able to detect adversarial examples generated by various attacks, and can be easily adopted to a plethora of deep classification models. We show that our approach yields an improved detection rate compared to an existing method, whilst being easy to implement. We perform an extensive empirical analysis on different deep classification models, investigating various state-of-the-art adversarial attacks. Moreover, we observe that our proposed method is better at detecting a combination of adversarial attacks. This work indicates the potential of detecting various adversarial attacks simply by using the class scores of an already trained classification model.

【3】 Graph-based Deep Generative Modelling for Document Layout Generation 标题:基于图的文档版面生成的深度生成建模

作者:Sanket Biswas,Pau Riba,Josep Lladós,Umapada Pal 机构: Computer Vision Center & Computer Science Department, Universitat Autonoma de Barcelona, Spain, CVPR Unit, Indian Statistical Institute, India 备注:Accepted by ICDAR Workshops-GLESDO 2021 链接:https://arxiv.org/abs/2107.04357 摘要:任何深度学习方法的主要先决条件之一是提供大规模的训练数据。在处理现实场景中扫描的文档图像时,其内容的主要信息存储在版面中。在这项工作中,我们提出了一个自动化的深度生成模型,使用图神经网络(GNNs)生成具有高度可变和合理文档布局的合成数据,可用于训练文档解释系统,在这种情况下,特别是在数字邮件收发应用中。这也是第一个基于图形的文档布局生成方法,在管理文档图像(本例中是发票)上进行了实验。 摘要:One of the major prerequisites for any deep learning approach is the availability of large-scale training data. When dealing with scanned document images in real world scenarios, the principal information of its content is stored in the layout itself. In this work, we have proposed an automated deep generative model using Graph Neural Networks (GNNs) to generate synthetic data with highly variable and plausible document layouts that can be used to train document interpretation systems, in this case, specially in digital mailroom applications. It is also the first graph-based approach for document layout generation task experimented on administrative document images, in this case, invoices.

【4】 StyleCariGAN: Caricature Generation via StyleGAN Feature Map Modulation 标题:StyleCariGAN:基于StyleGAN要素地图调制的漫画生成

作者:Wonjong Jang,Gwangjin Ju,Yucheol Jung,Jiaolong Yang,Xin Tong,Seungyong Lee 备注:None 链接:https://arxiv.org/abs/2107.04331 摘要:提出了一种基于StyleGAN的形状和风格操纵的漫画生成框架。我们的框架,称为StyleCariGAN,自动创建一个现实和详细的漫画从一个输入的照片与可选的控制形状夸张程度和颜色样式类型。我们的方法的关键部分是形状放大块,用于调制StyleGAN的粗层特征图以产生理想的漫画形状放大。我们首先建立一个图层混合样式器,将图片样式器的精细图层交换到相应的图层样式器中,以生成漫画。给定一张输入照片,图层混合模型为漫画生成详细的颜色样式,但没有形状夸张。然后在混合模型的粗层中加入形状夸张块,在保持输入特征外观的前提下,对块进行训练以产生形状夸张。实验结果表明,与目前最先进的方法相比,我们的StyleCariGAN可以生成逼真、细致的漫画。我们证明StyleCariGAN还支持其他基于StyleGAN的图像处理,比如面部表情控制。 摘要:We present a caricature generation framework based on shape and style manipulation using StyleGAN. Our framework, dubbed StyleCariGAN, automatically creates a realistic and detailed caricature from an input photo with optional controls on shape exaggeration degree and color stylization type. The key component of our method is shape exaggeration blocks that are used for modulating coarse layer feature maps of StyleGAN to produce desirable caricature shape exaggerations. We first build a layer-mixed StyleGAN for photo-to-caricature style conversion by swapping fine layers of the StyleGAN for photos to the corresponding layers of the StyleGAN trained to generate caricatures. Given an input photo, the layer-mixed model produces detailed color stylization for a caricature but without shape exaggerations. We then append shape exaggeration blocks to the coarse layers of the layer-mixed model and train the blocks to create shape exaggerations while preserving the characteristic appearances of the input. Experimental results show that our StyleCariGAN generates realistic and detailed caricatures compared to the current state-of-the-art methods. We demonstrate StyleCariGAN also supports other StyleGAN-based image manipulations, such as facial expression control.

【5】 JPGNet: Joint Predictive Filtering and Generative Network for Image Inpainting 标题:JPGNet:用于图像修复的联合预测滤波和生成网络

作者:Xiaoguang Li,Qing Guo,Felix Juefei-Xu,Hongkai Yu,Yang Liu,Song wang 机构:University of South Carolina, USA, Nanyang Technological University, Singapore, Alibaba Group, USA, Cleveland State University, USA, Predictive Filtering, Diff. Map w.r.t GT, ‘Smart’ Fusion, Generative Network, Uncertainty map, Ground Truth 备注:This work has been accepted to ACM-MM 2021 链接:https://arxiv.org/abs/2107.04281 摘要:图像修复的目的是恢复缺失区域,使恢复结果与原始完整图像一致,这不同于一般的生成任务强调生成图像的自然性。然而,现有的研究通常将其视为一个纯粹的生成问题,并采用前沿的生成技术来解决。生成网络用现实的内容填补了主要缺失的部分,但通常会扭曲局部结构。在本文中,我们将图像修复描述为两个问题的混合,即预测滤波和深度生成。预测滤波在保留局部结构和去除伪影方面有很好的效果,但在完成较大的缺失区域方面存在不足。深层生成网络能够在对整个场景理解的基础上,对大量缺失的像素进行填充,但很难恢复到与原始图像相同的细节。为了充分利用它们各自的优势,我们提出了联合预测滤波与生成网络(JPGNet),它包括三个分支:预测滤波与不确定性网络(PFUNet)、深层生成网络和不确定性感知融合网络(UAFNet)。PFUNet可以根据输入图像自适应地预测像素核进行滤波修复,并输出不确定性映射。该图表示像素应该通过过滤或生成网络进行处理,这些网络被进一步馈送到UAFNet,以便在过滤和生成结果之间进行智能组合。值得注意的是,我们的方法作为图像修复问题的一个新框架,可以使任何现有的基于生成的方法受益。我们在三个公共数据集,即Dunhuang、Places2和CelebA上验证了我们的方法,并证明了我们的方法可以显著地增强三种最先进的生成方法(即StructFlow、EdgeConnect和RFRNet),而且只需稍微增加时间开销。 摘要:Image inpainting aims to restore the missing regions and make the recovery results identical to the originally complete image, which is different from the common generative task emphasizing the naturalness of generated images. Nevertheless, existing works usually regard it as a pure generation problem and employ cutting-edge generative techniques to address it. The generative networks fill the main missing parts with realistic contents but usually distort the local structures. In this paper, we formulate image inpainting as a mix of two problems, i.e., predictive filtering and deep generation. Predictive filtering is good at preserving local structures and removing artifacts but falls short to complete the large missing regions. The deep generative network can fill the numerous missing pixels based on the understanding of the whole scene but hardly restores the details identical to the original ones. To make use of their respective advantages, we propose the joint predictive filtering and generative network (JPGNet) that contains three branches: predictive filtering & uncertainty network (PFUNet), deep generative network, and uncertainty-aware fusion network (UAFNet). The PFUNet can adaptively predict pixel-wise kernels for filtering-based inpainting according to the input image and output an uncertainty map. This map indicates the pixels should be processed by filtering or generative networks, which is further fed to the UAFNet for a smart combination between filtering and generative results. Note that, our method as a novel framework for the image inpainting problem can benefit any existing generation-based methods. We validate our method on three public datasets, i.e., Dunhuang, Places2, and CelebA, and demonstrate that our method can enhance three state-of-the-art generative methods (i.e., StructFlow, EdgeConnect, and RFRNet) significantly with the slightly extra time cost.

【6】 Deep Image Synthesis from Intuitive User Input: A Review and Perspectives 标题:基于直观用户输入的深度图像合成:回顾与展望

作者:Yuan Xue,Yuan-Chen Guo,Han Zhang,Tao Xu,Song-Hai Zhang,Xiaolei Huang 机构: The Pennsylvania State University, University Park, PA, USA, Tsinghua University, Beijing, China, Google Brain, Mountain View, CA, USA, Facebook, Menlo Park, CA, USA 备注:None 链接:https://arxiv.org/abs/2107.04240 摘要:在计算机图形学、艺术和设计的许多应用中,用户希望提供直观的非图像输入,例如文本、草图、笔划、图形或布局,并且计算机系统自动生成附着于输入内容的照片真实图像。虽然允许这种自动图像内容生成的经典作品遵循了图像检索和合成的框架,但是在深层生成模型方面的最新进展,例如生成性对抗网络(GANs)、变分自动编码器(VAEs)和基于流的方法,使得更强大和通用的图像生成任务成为可能。本文回顾了近年来基于用户输入的图像合成方法,包括输入通用性、图像生成方法、基准数据集和评价指标等方面的研究进展。这激发了对输入表示和交互、主要图像生成范式之间的交叉授粉以及生成方法的评估和比较的新观点。 摘要:In many applications of computer graphics, art and design, it is desirable for a user to provide intuitive non-image input, such as text, sketch, stroke, graph or layout, and have a computer system automatically generate photo-realistic images that adhere to the input content. While classic works that allow such automatic image content generation have followed a framework of image retrieval and composition, recent advances in deep generative models such as generative adversarial networks (GANs), variational autoencoders (VAEs), and flow-based methods have enabled more powerful and versatile image generation tasks. This paper reviews recent works for image synthesis given intuitive user input, covering advances in input versatility, image generation methodology, benchmark datasets, and evaluation metrics. This motivates new perspectives on input representation and interactivity, cross pollination between major image generation paradigms, and evaluation and comparison of generation methods.

NAS模型搜索(1篇)

【1】 Mutually-aware Sub-Graphs Differentiable Architecture Search 标题:相互感知的子图可区分体系结构搜索

作者:Haoxian Tan,Sheng Guo,Yujie Zhong,Weilin Huang 机构:Malong LLC., MYBank, Ant Group 链接:https://arxiv.org/abs/2107.04324 摘要:可差分体系结构搜索以其简单高效的特点在NAS领域得到了广泛的应用,其中多路径算法和单路径算法占据主导地位。多路径框架(如DART)是直观的,但会受到内存使用和训练崩溃的影响。单路径方法(如GDAS和ProxylessNAS)减轻了内存问题,缩小了搜索和评估之间的差距,但牺牲了性能。在本文中,我们提出了一种概念上简单而有效的方法来桥接这两种范式,称为相互感知子图可微结构搜索(MSG-DAS)。我们框架的核心是一个可微Gumbel-TopK采样器,它产生多个相互排斥的单路径子图。为了缓解多子图设置带来的严重跳接问题,我们提出了一个Dropblock标识模块来稳定优化。为了充分利用现有的模型(超网和子图),我们引入了一种记忆高效的超网引导蒸馏来改进训练。该框架在灵活的内存使用和搜索质量之间取得了平衡。我们在ImageNet和CIFAR10上展示了我们的方法的有效性,其中搜索的模型显示了与最新方法相当的性能。 摘要:Differentiable architecture search is prevalent in the field of NAS because of its simplicity and efficiency, where two paradigms, multi-path algorithms and single-path methods, are dominated. Multi-path framework (e.g. DARTS) is intuitive but suffers from memory usage and training collapse. Single-path methods (e.g.GDAS and ProxylessNAS) mitigate the memory issue and shrink the gap between searching and evaluation but sacrifice the performance. In this paper, we propose a conceptually simple yet efficient method to bridge these two paradigms, referred as Mutually-aware Sub-Graphs Differentiable Architecture Search (MSG-DAS). The core of our framework is a differentiable Gumbel-TopK sampler that produces multiple mutually exclusive single-path sub-graphs. To alleviate the severer skip-connect issue brought by multiple sub-graphs setting, we propose a Dropblock-Identity module to stabilize the optimization. To make best use of the available models (super-net and sub-graphs), we introduce a memory-efficient super-net guidance distillation to improve training. The proposed framework strikes a balance between flexible memory usage and searching quality. We demonstrate the effectiveness of our methods on ImageNet and CIFAR10, where the searched models show a comparable performance as the most recent approaches.

跟踪(2篇)

【1】 Event-Based Feature Tracking in Continuous Time with Sliding Window Optimization 标题:基于滑动窗口优化的连续时间基于事件的特征跟踪

作者:Jason Chui,Simon Klenk,Daniel Cremers 机构:Technical University of Munich, Germany 备注:9 pages, 4 figures, 1 table 链接:https://arxiv.org/abs/2107.04536 摘要:提出了一种事件摄像机连续时间特征跟踪的新方法。为此,我们通过在时空中沿估计的轨迹对齐事件来跟踪特征,从而使图像平面上的投影产生最大锐利的事件斑图像。轨迹由$n^{th}$阶B样条函数参数化,B样条函数连续到$(n-2)^{th}$阶导数。与以往的工作不同,本文采用滑动窗口的方式对曲线参数进行了优化。在一个公共数据集上,我们通过实验证实了所提出的滑动窗口B样条优化方法比以前的方法能得到更长更精确的特征轨迹。 摘要:We propose a novel method for continuous-time feature tracking in event cameras. To this end, we track features by aligning events along an estimated trajectory in space-time such that the projection on the image plane results in maximally sharp event patch images. The trajectory is parameterized by $n^{th}$ order B-splines, which are continuous up to $(n-2)^{th}$ derivative. In contrast to previous work, we optimize the curve parameters in a sliding window fashion. On a public dataset we experimentally confirm that the proposed sliding-window B-spline optimization leads to longer and more accurate feature tracks than in previous work.

【2】 Score refinement for confidence-based 3D multi-object tracking 标题:基于置信度的三维多目标跟踪分数细化算法

作者:Nuri Benbarka,Jona Schröder,Andreas Zell 机构: University of T¨ubingen 备注:Accepted at IROS 2021 链接:https://arxiv.org/abs/2107.04327 摘要:多目标跟踪是自主导航的一个重要组成部分,它为决策提供了有价值的信息。许多研究者通过滤除逐帧的三维检测来解决三维多目标跟踪问题;然而,他们的重点主要是寻找有用的特征或合适的匹配度量。我们的工作集中在跟踪系统中一个被忽视的部分:分数细化和tracklet终止。我们表明,根据时间一致性操纵分数,同时根据tracklet分数终止tracklet,可以改善跟踪结果。我们通过使用分数更新函数增加匹配的tracklet的分数并减少不匹配的tracklet的分数来实现这一点。与基于计数的方法相比,在不同的数据集上使用不同的检测器和过滤算法时,我们的方法一致地产生更好的AMOTA和MOTA分数。AMOTA评分提高了1.83分,MOTA评分提高了2.96分。我们还使用了我们的方法作为一种后期融合集成方法,它比基于投票的集成方法具有更高的性能。它在nuScenes测试评估中获得了67.6分的AMOTA分数,这与其他最先进的跟踪器相当。代码可在以下网址公开获取:url{https://github.com/cogsys-tuebingen/CBMOT}. 摘要:Multi-object tracking is a critical component in autonomous navigation, as it provides valuable information for decision-making. Many researchers tackled the 3D multi-object tracking task by filtering out the frame-by-frame 3D detections; however, their focus was mainly on finding useful features or proper matching metrics. Our work focuses on a neglected part of the tracking system: score refinement and tracklet termination. We show that manipulating the scores depending on time consistency while terminating the tracklets depending on the tracklet score improves tracking results. We do this by increasing the matched tracklets' score with score update functions and decreasing the unmatched tracklets' score. Compared to count-based methods, our method consistently produces better AMOTA and MOTA scores when utilizing various detectors and filtering algorithms on different datasets. The improvements in AMOTA score went up to 1.83 and 2.96 in MOTA. We also used our method as a late-fusion ensembling method, and it performed better than voting-based ensemble methods by a solid margin. It achieved an AMOTA score of 67.6 on nuScenes test evaluation, which is comparable to other state-of-the-art trackers. Code is publicly available at: url{https://github.com/cogsys-tuebingen/CBMOT}.

裁剪|量化|加速|压缩相关(1篇)

【1】 Does Form Follow Function? An Empirical Exploration of the Impact of Deep Neural Network Architecture Design on Hardware-Specific Acceleration 标题:形式遵循功能吗?深度神经网络结构设计对硬件加速影响的实证研究

作者:Saad Abbasi,Mohammad Javad Shafiee,Ellick Chan,Alexander Wong 机构:Waterloo, ON, Canada, University of Waterloo, DarwinAI, Intel Corporation, United States 备注:8 pages 链接:https://arxiv.org/abs/2107.04144 摘要:关于深层神经网络架构设计和硬件特定加速的形式和功能之间的细粒度关系是研究文献中没有很好研究的一个领域,形式通常由精度而不是硬件功能决定。在这项研究中,我们进行了全面的实证研究,以探讨深层神经网络架构设计对通过特定硬件加速实现的推理加速程度的影响。更具体地说,我们通过OpenVINO微处理器特定加速和GPU特定加速的视角,实证研究了各种常用的宏体系结构设计模式对不同体系结构深度的影响。实验结果表明,在利用硬件特定加速的情况下,平均推理速度提高了380%,而推理速度的提高程度因宏体系结构设计模式的不同而有很大的差异,其中最快的加速速度达到了550%。此外,我们还深入探讨了随着体系结构深度和宽度的增加,FLOPs需求、3级缓存效率和网络延迟之间的关系。最后,我们分析了在各种手工制作的深度卷积神经网络架构设计以及通过神经架构搜索策略发现的架构设计中,使用硬件特定加速与本地深度学习框架相比,推理时间的减少。我们发现,DARTS派生的体系结构受益于硬件特定软件加速的最大改进(1200%),而基于深度瓶颈卷积的MobileNet-V2的总体推断时间最低,约为2.4毫秒。 摘要:The fine-grained relationship between form and function with respect to deep neural network architecture design and hardware-specific acceleration is one area that is not well studied in the research literature, with form often dictated by accuracy as opposed to hardware function. In this study, a comprehensive empirical exploration is conducted to investigate the impact of deep neural network architecture design on the degree of inference speedup that can be achieved via hardware-specific acceleration. More specifically, we empirically study the impact of a variety of commonly used macro-architecture design patterns across different architectural depths through the lens of OpenVINO microprocessor-specific and GPU-specific acceleration. Experimental results showed that while leveraging hardware-specific acceleration achieved an average inference speed-up of 380%, the degree of inference speed-up varied drastically depending on the macro-architecture design pattern, with the greatest speedup achieved on the depthwise bottleneck convolution design pattern at 550%. Furthermore, we conduct an in-depth exploration of the correlation between FLOPs requirement, level 3 cache efficacy, and network latency with increasing architectural depth and width. Finally, we analyze the inference time reductions using hardware-specific acceleration when compared to native deep learning frameworks across a wide variety of hand-crafted deep convolutional neural network architecture designs as well as ones found via neural architecture search strategies. We found that the DARTS-derived architecture to benefit from the greatest improvement from hardware-specific software acceleration (1200%) while the depthwise bottleneck convolution-based MobileNet-V2 to have the lowest overall inference time of around 2.4 ms.

蒸馏|知识提取(1篇)

【1】 Multi-Modal Association based Grouping for Form Structure Extraction 标题:用于表单结构提取的基于多模态关联的分组方法

作者:Milan Aggarwal,Mausoom Sarkar,Hiresh Gupta,Balaji Krishnamurthy 机构:Media and Data Science Research Lab, Adobe, Adobe Experience Cloud 备注:This work has been accepted and presented at WACV 2020 链接:https://arxiv.org/abs/2107.04396 摘要:文档结构抽取是近几十年来广泛研究的领域。最近这方面的工作是基于深度学习的,主要集中在通过语义分割使用完全卷积神经网络来提取结构。在这项工作中,我们提出了一种新的多模态形式结构提取方法。给定textruns和widget等简单元素,我们提取TextBlocks、文本字段、选择字段和选择组等高阶结构,这些结构对于表单中的信息收集非常重要。为了实现这一点,我们通过识别离它最近的候选元素,在每个低层元素(参考)周围获得一个局部图像补丁。我们通过BiLSTM依次处理候选图像的文本和空间表示,以获得上下文感知的表示,并将其与通过CNN处理得到的图像块特征融合。随后,序列解码器采用该融合特征向量来预测参考和候选之间的关联类型。这些预测的关联通过连接成分分析来确定更大的结构。实验结果表明,该方法对上述结构的召回率分别为90.29%、73.80%、83.12%和52.72%,显著优于语义分割基线。我们通过烧蚀显示了我们的方法的有效性,并将其与单独使用的方法进行了比较。我们还介绍了新的富人类注释表单数据集。 摘要:Document structure extraction has been a widely researched area for decades. Recent work in this direction has been deep learning-based, mostly focusing on extracting structure using fully convolution NN through semantic segmentation. In this work, we present a novel multi-modal approach for form structure extraction. Given simple elements such as textruns and widgets, we extract higher-order structures such as TextBlocks, Text Fields, Choice Fields, and Choice Groups, which are essential for information collection in forms. To achieve this, we obtain a local image patch around each low-level element (reference) by identifying candidate elements closest to it. We process textual and spatial representation of candidates sequentially through a BiLSTM to obtain context-aware representations and fuse them with image patch features obtained by processing it through a CNN. Subsequently, the sequential decoder takes this fused feature vector to predict the association type between reference and candidates. These predicted associations are utilized to determine larger structures through connected components analysis. Experimental results show the effectiveness of our approach achieving a recall of 90.29%, 73.80%, 83.12%, and 52.72% for the above structures, respectively, outperforming semantic segmentation baselines significantly. We show the efficacy of our method through ablations, comparing it against using individual modalities. We also introduce our new rich human-annotated Forms Dataset.

超分辨率|去噪|去模糊|去雾(2篇)

【1】 Effectiveness of State-of-the-Art Super Resolution Algorithms in Surveillance Environment 标题:超分辨率算法在监控环境中的有效性研究

作者:Muhammad Ali Farooq,Ammar Ali Khan,Ansar Ahmad,Rana Hammad Raza 机构: National University of Ireland (NUIG), Galway, Ireland, Pakistan Navy Engineering College (PNEC), National University of Sciences and Technology, (NUST), Karachi, Pakistan 备注:None 链接:https://arxiv.org/abs/2107.04133 摘要:图像超分辨率(SR)在需要观察者仔细观察图像以提取增强信息的领域有着广泛的应用。其中一个重点应用是监视源的离线法医分析。由于摄像机硬件、摄像机姿态、有限带宽、变化的照明条件和遮挡的限制,监视馈送的质量有时会显著降低,从而影响对行为、活动和场景中其他零星信息的监视。在所提出的研究工作中,我们检验了四种传统但有效的SR算法和三种基于深度学习的SR算法的有效性,以寻求在训练数据操作有限的监视环境中执行良好的最佳方法。这些算法从单个低分辨率(LR)输入图像生成增强分辨率的输出图像。为了进行性能分析,我们使用了来自6个监控数据集的220幅图像的子集,这些图像由与摄像机的距离不同、光照条件变化和背景复杂的个体组成。使用定性和定量指标对这些算法的性能进行了评估和比较。在人脸检测精度的基础上,对这些SR算法进行了比较。通过分析和比较各种算法的性能,证明了基于卷积神经网络(CNN)的外部字典SR技术在不同的监控条件下能够获得鲁棒的人脸检测精度和最优的量化度量结果。这是因为CNN层使用外部字典逐步学习更复杂的特征。 摘要:Image Super Resolution (SR) finds applications in areas where images need to be closely inspected by the observer to extract enhanced information. One such focused application is an offline forensic analysis of surveillance feeds. Due to the limitations of camera hardware, camera pose, limited bandwidth, varying illumination conditions, and occlusions, the quality of the surveillance feed is significantly degraded at times, thereby compromising monitoring of behavior, activities, and other sporadic information in the scene. For the proposed research work, we have inspected the effectiveness of four conventional yet effective SR algorithms and three deep learning-based SR algorithms to seek the finest method that executes well in a surveillance environment with limited training data op-tions. These algorithms generate an enhanced resolution output image from a sin-gle low-resolution (LR) input image. For performance analysis, a subset of 220 images from six surveillance datasets has been used, consisting of individuals with varying distances from the camera, changing illumination conditions, and complex backgrounds. The performance of these algorithms has been evaluated and compared using both qualitative and quantitative metrics. These SR algo-rithms have also been compared based on face detection accuracy. By analyzing and comparing the performance of all the algorithms, a Convolutional Neural Network (CNN) based SR technique using an external dictionary proved to be best by achieving robust face detection accuracy and scoring optimal quantitative metric results under different surveillance conditions. This is because the CNN layers progressively learn more complex features using an external dictionary.

【2】 Retinal OCT Denoising with Pseudo-Multimodal Fusion Network 标题:基于伪多模融合网络的视网膜OCT去噪

作者:Dewei Hu,Joseph D. Malone,Yigit Atay,Yuankai K. Tao,Ipek Oguz 机构: Vanderbilt University, Dept. of Electrical Engineering and Computer Science, Vanderbilt University, Dept. of Biomedical Engineering, Nashville, TN, USA 备注:Accepted by International Workshop on Ophthalmic Medical Image Analysis (OMIA) 2020 链接:https://arxiv.org/abs/2107.04288 摘要:光学相干断层扫描(OCT)是目前比较流行的视网膜成像技术。然而,它受到乘性斑点噪声的影响,乘性斑点噪声会降低基本解剖结构(包括血管和组织层)的可见性。尽管平均重复的B扫描帧可以显著提高信噪比(SNR),但这需要更长的采集时间,这会引入运动伪影并给患者带来不适。在这项研究中,我们提出了一种基于学习的方法,该方法利用了来自单帧噪声B扫描的信息,并利用自融合方法创建了一个伪模态。伪模态为在有噪声的B扫描中几乎看不到的层提供了良好的信噪比,但可以过度平滑细小的特征,如小血管。通过使用融合网络,可以组合来自每个模态的期望特征,并且它们的贡献权重是可调整的。结果表明,该方法能有效地抑制斑点噪声,增强视网膜层间的对比度,同时保留了视网膜的整体结构和小血管。与单模态网络相比,该方法将低噪声B扫描的结构相似度从0.559 -0.033提高到0.576 -0.031。 摘要:Optical coherence tomography (OCT) is a prevalent imaging technique for retina. However, it is affected by multiplicative speckle noise that can degrade the visibility of essential anatomical structures, including blood vessels and tissue layers. Although averaging repeated B-scan frames can significantly improve the signal-to-noise-ratio (SNR), this requires longer acquisition time, which can introduce motion artifacts and cause discomfort to patients. In this study, we propose a learning-based method that exploits information from the single-frame noisy B-scan and a pseudo-modality that is created with the aid of the self-fusion method. The pseudo-modality provides good SNR for layers that are barely perceptible in the noisy B-scan but can over-smooth fine features such as small vessels. By using a fusion network, desired features from each modality can be combined, and the weight of their contribution is adjustable. Evaluated by intensity-based and structural metrics, the result shows that our method can effectively suppress the speckle noise and enhance the contrast between retina layers while the overall structure and small blood vessels are preserved. Compared to the single modality network, our method improves the structural similarity with low noise B-scan from 0.559 - 0.033 to 0.576 - 0.031.

多模态(1篇)

【1】 Multimodal Icon Annotation For Mobile Applications 标题:面向移动应用的多模态图标标注

作者:Xiaoxue Zang,Ying Xu,Jindong Chen 机构:Google Research, Mountain View, CA, United States, Screenshot, View hierarchy, close, settings, home, play 备注:11 pages, MobileHCI 2021 链接:https://arxiv.org/abs/2107.04452 摘要:注释用户界面(UI)涉及对屏幕上有意义的UI元素进行本地化和分类,对于许多移动应用程序(如屏幕阅读器和设备的语音控制)来说是一个关键步骤。由于屏幕上缺少明确的标签、它们与图片的相似性以及形状的多样性,注释对象图标(如菜单、搜索和向后箭头)尤其具有挑战性。现有的研究或者使用视图层次或者基于像素的方法来处理这个任务。基于像素的方法更受欢迎,因为移动平台上的视图层次结构特征通常不完整或不准确,但是它忽略了视图层次结构中的指导信息,例如资源ID或内容描述。我们提出了一种新的基于深度学习的多模式方法,它结合了像素和视图层次结构的优点,并利用了最先进的对象检测技术。为了演示所提供的实用程序,我们通过手动注释Rico中最常用的29个图标来创建一个高质量的UI数据集,Rico是一个由72k UI屏幕截图组成的大型移动设计数据集。实验结果表明了该方法的有效性。该模型不仅优于广泛使用的目标分类基线,而且优于基于像素的目标检测模型。我们的研究揭示了如何结合视图层次结构和像素特征来注释UI元素。 摘要:Annotating user interfaces (UIs) that involves localization and classification of meaningful UI elements on a screen is a critical step for many mobile applications such as screen readers and voice control of devices. Annotating object icons, such as menu, search, and arrow backward, is especially challenging due to the lack of explicit labels on screens, their similarity to pictures, and their diverse shapes. Existing studies either use view hierarchy or pixel based methods to tackle the task. Pixel based approaches are more popular as view hierarchy features on mobile platforms are often incomplete or inaccurate, however it leaves out instructional information in the view hierarchy such as resource-ids or content descriptions. We propose a novel deep learning based multi-modal approach that combines the benefits of both pixel and view hierarchy features as well as leverages the state-of-the-art object detection techniques. In order to demonstrate the utility provided, we create a high quality UI dataset by manually annotating the most commonly used 29 icons in Rico, a large scale mobile design dataset consisting of 72k UI screenshots. The experimental results indicate the effectiveness of our multi-modal approach. Our model not only outperforms a widely used object classification baseline but also pixel based object detection models. Our study sheds light on how to combine view hierarchy with pixel features for annotating UI elements.

3D|3D重建等相关(1篇)

【1】 Prior-Guided Multi-View 3D Head Reconstruction 标题:先验引导的多视点三维头部重建

作者:Xueying Wang,Yudong Guo,Zhongqi Yang,Juyong Zhang 机构:The authors are with the School of Mathematical Sciences, University ofScience and Technology of China 链接:https://arxiv.org/abs/2107.04277 摘要:在计算机视觉和图形学中,恢复包含完整面部和头发区域的三维头部模型仍然是一个具有挑战性的问题。在本文中,我们考虑这个问题与一些多视图肖像图像作为输入。以往的基于优化策略和深度学习技术的多视点立体视觉方法都存在着头部结构不清晰、头发区域重建不准确等低频几何结构问题。为了解决这个问题,我们提出了一种先验引导的隐式神经绘制网络。具体地说,我们用一个可学习的符号距离场(SDF)对头部几何模型进行建模,并在人脸先验知识、头部语义分割信息和二维头发方向图的指导下,通过隐式可微渲染器对其进行优化。利用这些先验知识可以提高重建精度和鲁棒性,从而得到高质量的三维头部模型。大量的烧蚀研究和与最新方法的比较表明,我们的方法可以在这些先验知识的指导下产生高保真的三维头部几何结构。 摘要:Recovering a 3D head model including the complete face and hair regions is still a challenging problem in computer vision and graphics. In this paper, we consider this problem with a few multi-view portrait images as input. Previous multi-view stereo methods, either based on the optimization strategies or deep learning techniques, suffer from low-frequency geometric structures such as unclear head structures and inaccurate reconstruction in hair regions. To tackle this problem, we propose a prior-guided implicit neural rendering network. Specifically, we model the head geometry with a learnable signed distance field (SDF) and optimize it via an implicit differentiable renderer with the guidance of some human head priors, including the facial prior knowledge, head semantic segmentation information and 2D hair orientation maps. The utilization of these priors can improve the reconstruction accuracy and robustness, leading to a high-quality integrated 3D head model. Extensive ablation studies and comparisons with state-of-the-art methods demonstrate that our method could produce high-fidelity 3D head geometries with the guidance of these priors.

其他神经网络|深度学习|模型|建模(6篇)

【1】 Interpretable Compositional Convolutional Neural Networks 标题:可解释的组合卷积神经网络

作者:Wen Shen,Zhihua Wei,Shikun Huang,Binbin Zhang,Jiaqi Fan,Ping Zhao,Quanshi Zhang 机构:Tongji University, Shanghai, China, Shanghai Jiao Tong University, Shanghai, China 备注:IJCAI2021 链接:https://arxiv.org/abs/2107.04474 摘要:语义可解释性的合理定义是可解释人工智能的核心挑战。提出了一种将传统的卷积神经网络(CNN)修改为可解释的复合CNN的方法,以学习在中间卷积层中编码有意义的视觉模式的滤波器。在合成CNN中,每个滤波器都应该一致地表示具有明确含义的特定合成对象部分或图像区域。合成CNN从图像标签中学习分类,而不需要对零件或区域进行任何注释以进行监督。该方法可广泛应用于不同类型的cnn。实验证明了该方法的有效性。 摘要:The reasonable definition of semantic interpretability presents the core challenge in explainable AI. This paper proposes a method to modify a traditional convolutional neural network (CNN) into an interpretable compositional CNN, in order to learn filters that encode meaningful visual patterns in intermediate convolutional layers. In a compositional CNN, each filter is supposed to consistently represent a specific compositional object part or image region with a clear meaning. The compositional CNN learns from image labels for classification without any annotations of parts or regions for supervision. Our method can be broadly applied to different types of CNNs. Experiments have demonstrated the effectiveness of our method.

【2】 Understanding the Distributions of Aggregation Layers in Deep Neural Networks 标题:理解深度神经网络中聚合层的分布

作者:Eng-Jon Ong,Sameed Husain,Miroslaw Bober 机构: University of Surrey 链接:https://arxiv.org/abs/2107.04458 摘要:聚合过程在几乎所有的深网模型中都是普遍存在的。它是一种重要的机制,用于将深层特征整合为更紧凑的表示,同时提高对过度拟合的鲁棒性,并在深层网络中提供空间不变性。特别地,全局聚合层与DNNs的输出层的接近性意味着聚合特征对深网的性能有直接的影响。利用信息论方法可以更好地理解这种关系。然而,这需要了解聚合层激活的分布。为了实现这一点,我们提出了一种新的数学公式,用于分析建模涉及深度特征聚合的层的输出值的概率分布。一个重要的结果是我们的能力,分析预测KL发散的输出节点在DNN。我们还通过实验验证了我们的理论预测与一系列不同的分类任务和数据集的经验观测。 摘要:The process of aggregation is ubiquitous in almost all deep nets models. It functions as an important mechanism for consolidating deep features into a more compact representation, whilst increasing robustness to overfitting and providing spatial invariance in deep nets. In particular, the proximity of global aggregation layers to the output layers of DNNs mean that aggregated features have a direct influence on the performance of a deep net. A better understanding of this relationship can be obtained using information theoretic methods. However, this requires the knowledge of the distributions of the activations of aggregation layers. To achieve this, we propose a novel mathematical formulation for analytically modelling the probability distributions of output values of layers involved with deep feature aggregation. An important outcome is our ability to analytically predict the KL-divergence of output nodes in a DNN. We also experimentally verify our theoretical predictions against empirical observations across a range of different classification tasks and datasets.

【3】 Hoechst Is All You Need: LymphocyteClassification with Deep Learning 标题:Hoechst是您所需要的一切:基于深度学习的淋巴细胞分类

作者:Jessica Cooper,In Hwa Um,Ognjen Arandjelović,David J Harrison 机构:University of St Andrews 备注:15 pages, 4 figures 链接:https://arxiv.org/abs/2107.04388 摘要:多重免疫荧光和免疫组织化学使癌症病理学家能够识别细胞表面表达的几种蛋白质,从而使细胞分类、更好地了解肿瘤微环境、更准确的诊断、预后,并根据个别病人的免疫状况进行量身定制的免疫治疗。然而,他们是昂贵和耗时的过程,需要复杂的染色和成像技术的专家技术人员。Hoechst染色更便宜,也更容易进行,但在这种情况下通常不使用,因为它与DNA结合,而不是与免疫荧光技术靶向的蛋白质结合,而且以前认为仅基于DNA形态来分化表达这些蛋白质的细胞是不可能的。在这项工作中,我们展示了另一种方法,即训练一个深度卷积神经网络来识别表达三种蛋白质(T淋巴细胞标记CD3和CD8,以及B淋巴细胞标记CD20)的细胞,其精确度和召回率超过90%,仅来自Hoechst 33342染色的组织。我们的模型学习了以前未知的与这些蛋白表达相关的形态学特征,这些特征可用于准确区分淋巴细胞亚型,用于评估免疫细胞浸润等关键预后指标,从而预测和改善患者的预后,而无需昂贵的多重免疫荧光。 摘要:Multiplex immunofluorescence and immunohistochemistry benefit patients by allowing cancer pathologists to identify several proteins expressed on the surface of cells, enabling cell classification, better understanding of the tumour micro-environment, more accurate diagnoses, prognoses, and tailored immunotherapy based on the immune status of individual patients. However, they are expensive and time consuming processes which require complex staining and imaging techniques by expert technicians. Hoechst staining is much cheaper and easier to perform, but is not typically used in this case as it binds to DNA rather than to the proteins targeted by immunofluorescent techniques, and it was not previously thought possible to differentiate cells expressing these proteins based only on DNA morphology. In this work we show otherwise, training a deep convolutional neural network to identify cells expressing three proteins (T lymphocyte markers CD3 and CD8, and the B lymphocyte marker CD20) with greater than 90% precision and recall, from Hoechst 33342 stained tissue only. Our model learns previously unknown morphological features associated with expression of these proteins which can be used to accurately differentiate lymphocyte subtypes for use in key prognostic metrics such as assessment of immune cell infiltration,and thereby predict and improve patient outcomes without the need for costly multiplex immunofluorescence.

【4】 Joint Matrix Decomposition for Deep Convolutional Neural Networks Compression 标题:用于深卷积神经网络压缩的联合矩阵分解

作者:Shaowu Chen,Jihao Zhou,Weize Sun,Lei Huang 机构: Weize Sun and Lei Huang are with the Guang-dong Key Laboratory of Intelligent Information Processing, Shenzhen University 备注:Code is publicly available on GitHub: this https URL 链接:https://arxiv.org/abs/2107.04386 摘要:具有大量参数的深度卷积神经网络(CNNs)需要大量的计算资源,这限制了CNNs在资源受限设备上的应用。因此,近年来,基于分解的方法被用于压缩cnn。然而,由于压缩因子与性能呈负相关,最先进的作品要么性能严重下降,要么压缩因子很低。为了克服这些问题,不同于以往单独压缩层的工作,我们提出了通过联合矩阵分解来压缩cnn,缓解性能下降的问题。该思想的灵感来源于CNNs中存在大量的重复模块,通过将具有相同结构的权值投影到同一子空间中,可以进一步压缩甚至加速网络。提出了三种基于奇异值分解的联合矩阵分解方案,并给出了相应的优化方法。在三个具有挑战性的紧凑CNN和三个基准数据集上进行了大量的实验,证明了我们提出的算法的优越性能。结果表明,我们的方法可以将ResNet-34的大小压缩22倍,与现有的几种方法相比,精度下降较小。 摘要:Deep convolutional neural networks (CNNs) with a large number of parameters requires huge computational resources, which has limited the application of CNNs on resources constrained appliances. Decomposition-based methods, therefore, have been utilized to compress CNNs in recent years. However, since the compression factor and performance are negatively correlated, the state-of-the-art works either suffer from severe performance degradation or have limited low compression factors. To overcome these problems, unlike previous works compressing layers separately, we propose to compress CNNs and alleviate performance degradation via joint matrix decomposition. The idea is inspired by the fact that there are lots of repeated modules in CNNs, and by projecting weights with the same structures into the same subspace, networks can be further compressed and even accelerated. In particular, three joint matrix decomposition schemes are developed, and the corresponding optimization approaches based on Singular Values Decomposition are proposed. Extensive experiments are conducted across three challenging compact CNNs and 3 benchmark data sets to demonstrate the superior performance of our proposed algorithms. As a result, our methods can compress the size of ResNet-34 by 22x with slighter accuracy degradation compared with several state-of-the-art methods.

【5】 Activated Gradients for Deep Neural Networks 标题:深度神经网络的激活梯度

作者:Mei Liu,Liangming Chen,Xiaohao Du,Long Jin,Mingsheng Shang 机构: Jin are with the School of InformationScience and Engineering, Lanzhou University 链接:https://arxiv.org/abs/2107.04228 摘要:由于病态问题、消失/爆炸梯度问题和鞍点问题,深层神经网络往往性能不佳甚至训练失败。本文提出了一种利用梯度激活函数(GAF)来处理这些问题的新方法。直观地说,GAF放大了微小梯度,限制了大梯度。本文从理论上给出了GAF需要满足的条件,并在此基础上证明了GAF缓解了上述问题。此外,本文还证明了在一定的假设条件下,带GAF的SGD的收敛速度比不带GAF的SGD快。此外,在CIFAR、ImageNet和PASCAL视觉对象类上的实验证实了GAF的有效性。实验结果还表明,该方法可以应用于各种深度神经网络中,提高其性能。源代码在https://github.com/LongJin-lab/Activated-Gradients-for-Deep-Neural-Networks. 摘要:Deep neural networks often suffer from poor performance or even training failure due to the ill-conditioned problem, the vanishing/exploding gradient problem, and the saddle point problem. In this paper, a novel method by acting the gradient activation function (GAF) on the gradient is proposed to handle these challenges. Intuitively, the GAF enlarges the tiny gradients and restricts the large gradient. Theoretically, this paper gives conditions that the GAF needs to meet, and on this basis, proves that the GAF alleviates the problems mentioned above. In addition, this paper proves that the convergence rate of SGD with the GAF is faster than that without the GAF under some assumptions. Furthermore, experiments on CIFAR, ImageNet, and PASCAL visual object classes confirm the GAF's effectiveness. The experimental results also demonstrate that the proposed method is able to be adopted in various deep neural networks to improve their performance. The source code is publicly available at https://github.com/LongJin-lab/Activated-Gradients-for-Deep-Neural-Networks.

【6】 A Deep Discontinuity-Preserving Image Registration Network 标题:一种保持深度间断的图像配准网络

作者:Xiang Chen,Nishant Ravikumar,Yan Xia,Alejandro F Frangi 机构: Center for Computational Imaging and Simulation Technologies in Biomedicine, School of Computing, University of Leeds, Leeds,UK, Biomedical Imaging Department, Leeds Institute for Cardiovascular and Metabolic 备注:Provisional accepted in MICCAI 2021 链接:https://arxiv.org/abs/2107.04440 摘要:图像配准的目的是建立图像对或图像组之间的空间对应关系,是医学图像计算和计算机辅助干预的基础。目前,大多数基于深度学习的配准方法都假设所需的变形场是全局平滑和连续的,这在实际场景中并不总是有效的,特别是在医学图像配准(如心脏成像和腹部成像)中。这样的全局约束会导致不连续组织界面的伪影和错误增加。为了解决这一问题,我们提出了一种弱监督的深度不连续保持图像配准网络(DDIR),以获得更好的配准性能和逼真的变形场。在英国生物银行成像研究(UKBB)的心脏磁共振(MR)图像的配准实验中,我们证明了我们的方法在配准精度和预测更真实的变形方面比最先进的方法有显著的提高。 摘要:Image registration aims to establish spatial correspondence across pairs, or groups of images, and is a cornerstone of medical image computing and computer-assisted-interventions. Currently, most deep learning-based registration methods assume that the desired deformation fields are globally smooth and continuous, which is not always valid for real-world scenarios, especially in medical image registration (e.g. cardiac imaging and abdominal imaging). Such a global constraint can lead to artefacts and increased errors at discontinuous tissue interfaces. To tackle this issue, we propose a weakly-supervised Deep Discontinuity-preserving Image Registration network (DDIR), to obtain better registration performance and realistic deformation fields. We demonstrate that our method achieves significant improvements in registration accuracy and predicts more realistic deformations, in registration experiments on cardiac magnetic resonance (MR) images from UK Biobank Imaging Study (UKBB), than state-of-the-art approaches.

其他(8篇)

【1】 ANCER: Anisotropic Certification via Sample-wise Volume Maximization 标题:ANCER:基于样本量最大化的各向异性认证

作者:Francisco Eiras,Motasem Alfarra,M. Pawan Kumar,Philip H. S. Torr,Puneet K. Dokania,Bernard Ghanem,Adel Bibi 机构: University of Oxford, United Kingdom, KAUST, Saudi Arabia, Five AI Limited, United Kingdom 备注:First two authors and the last one contributed equally to this work 链接:https://arxiv.org/abs/2107.04570 摘要:随机平滑是近年来发展起来的一种有效的深度神经网络分类器规模验证工具。关于随机平滑的所有现有技术都集中于各向同性$ellu p$认证,其优点是产生可以通过$ellu p$-范数半径在各向同性方法之间容易比较的证书。然而,各向同性认证限制了在最坏情况下对手的输入周围可以认证的区域,即它不能推理其他“接近的”、潜在的大的、恒定的预测安全区域。为了缓解这个问题,(i)在简化分析之后,我们从理论上把各向同性随机平滑$ellu 1$和$ellu 2$证书扩展到它们的广义各向异性证书。此外,(ii)我们提出了评估指标,允许比较一般证书(如果证书证明了超集区域,则证书优于另一证书),并通过认证区域的体积对每个证书进行量化。我们介绍了ANCER,一个通过体积最大化获得给定测试集样本的各向异性证书的实用框架。我们的实证结果表明,ANCER在CIFAR-10和ImageNet上都达到了最先进的$ellu 1$和$ellu 2$认证精度,同时在体积方面认证了更大的区域,从而突出了远离各向同性分析的好处。我们实验中使用的代码在https://github.com/MotasemAlfarra/ANCER. 摘要:Randomized smoothing has recently emerged as an effective tool that enables certification of deep neural network classifiers at scale. All prior art on randomized smoothing has focused on isotropic $ell_p$ certification, which has the advantage of yielding certificates that can be easily compared among isotropic methods via $ell_p$-norm radius. However, isotropic certification limits the region that can be certified around an input to worst-case adversaries, ie it cannot reason about other "close", potentially large, constant prediction safe regions. To alleviate this issue, (i) we theoretically extend the isotropic randomized smoothing $ell_1$ and $ell_2$ certificates to their generalized anisotropic counterparts following a simplified analysis. Moreover, (ii) we propose evaluation metrics allowing for the comparison of general certificates - a certificate is superior to another if it certifies a superset region - with the quantification of each certificate through the volume of the certified region. We introduce ANCER, a practical framework for obtaining anisotropic certificates for a given test set sample via volume maximization. Our empirical results demonstrate that ANCER achieves state-of-the-art $ell_1$ and $ell_2$ certified accuracy on both CIFAR-10 and ImageNet at multiple radii, while certifying substantially larger regions in terms of volume, thus highlighting the benefits of moving away from isotropic analysis. Code used in our experiments is available in https://github.com/MotasemAlfarra/ANCER.

【2】 Hacking VMAF and VMAF NEG: metrics vulnerability to different preprocessing 标题:黑客攻击VMAF和VMAF NEG:不同预处理的度量漏洞

作者:Maksim Siniukov,Anastasia Antsiferova,Dmitriy Kulikov,Dmitriy Vatolin 链接:https://arxiv.org/abs/2107.04510 摘要:视频质量测量在视频处理应用的发展中起着至关重要的作用。在本文中,我们展示了如何流行的质量指标VMAF及其抗调谐版本VMAF NEG可以通过视频预处理人为地增加。我们提出了一个用于调整处理算法参数的管道,允许将VMAF提高218.8%。对预处理视频的主观比较表明,大多数方法的视频质量下降或保持不变。结果表明,通过一些预处理方法,VMAF-NEG评分也可以提高23.6%。 摘要:Video quality measurement plays a critical role in the development of video processing applications. In this paper, we show how popular quality metrics VMAF and its tuning-resistant version VMAF NEG can be artificially increased by video preprocessing. We propose a pipeline for tuning parameters of processing algorithms that allows increasing VMAF by up to 218.8%. A subjective comparison of preprocessed videos showed that with the majority of methods visual quality drops down or stays unchanged. We show that VMAF NEG scores can also be increased by some preprocessing methods by up to 23.6%.

【3】 MutualEyeContact: A conversation analysis tool with focus on eye contact 标题:MutualEyeContact:一种关注眼神交流的对话分析工具

作者:Alexander Schäfer,Tomoko Isomura,Gerd Reis,Katsumi Watanabe,Didier Stricker 链接:https://arxiv.org/abs/2107.04476 摘要:个人之间的眼神交流对于理解人类行为尤为重要。为了进一步研究眼神交流在社会交往中的重要性,便携式眼球跟踪技术似乎是一种自然的选择。然而,对可用数据的分析可能变得相当复杂。科学家需要快速准确地计算数据。此外,必须自动分离相关数据以节省时间。在这项工作中,我们提出了一种称为mutualyecontact的工具,它在这些任务中表现出色,可以帮助科学家理解(相互)眼神交流在社会交往中的重要性。我们将最先进的眼动跟踪技术与基于机器学习的人脸识别技术相结合,为社交会话的分析和可视化提供了一个工具。这项工作是计算机科学家和认知科学家的联合合作。它将社会和行为科学领域与计算机视觉和深度学习相结合。 摘要:Eye contact between individuals is particularly important for understanding human behaviour. To further investigate the importance of eye contact in social interactions, portable eye tracking technology seems to be a natural choice. However, the analysis of available data can become quite complex. Scientists need data that is calculated quickly and accurately. Additionally, the relevant data must be automatically separated to save time. In this work, we propose a tool called MutualEyeContact which excels in those tasks and can help scientists to understand the importance of (mutual) eye contact in social interactions. We combine state-of-the-art eye tracking with face recognition based on machine learning and provide a tool for analysis and visualization of social interaction sessions. This work is a joint collaboration of computer scientists and cognitive scientists. It combines the fields of social and behavioural science with computer vision and deep learning.

【4】 Memes in the Wild: Assessing the Generalizability of the Hateful Memes Challenge Dataset 标题:荒野中的模因:评估仇恨模因挑战数据集的概括性

作者:Hannah Rose Kirk,Yennie Jun,Paulius Rauba,Gal Wachtel,Ruining Li,Xingjian Bai,Noah Broestl,Martin Doff-Sotta,Aleksandar Shtedritski,Yuki M. Asano 机构:Oxford Internet Institute,Dept. of Computer Science,Oxford Uehiro Centre for Practical Ethics, Dept. of Engineering Science, † Oxford Artificial Intelligence Society 备注:Accepted paper at ACL WOAH 2021 链接:https://arxiv.org/abs/2107.04313 摘要:可恨的模因对当前的机器学习系统提出了独特的挑战,因为它们的信息来源于文本和视觉形式。为此,Facebook发布了“可恨的模因挑战”(Hateful Memes Challenge),这是一个预提取文本标题的模因数据集,但目前尚不清楚这些合成示例是否推广到“野生模因”。在这篇论文中,我们从Pinterest收集仇恨和非仇恨的模因来评估在Facebook数据集上预先训练的模型的样本外表现。我们发现,野生模因在两个关键方面有所不同:1)字幕必须通过OCR提取,注入噪声和多模态模型的递减性能;2)模因比“传统模因”更加多样化,包括在普通背景下的对话或文本截图。因此,本文作为一个现实的检查,目前的基准仇恨模因检测及其适用性检测现实世界的仇恨。 摘要:Hateful memes pose a unique challenge for current machine learning systems because their message is derived from both text- and visual-modalities. To this effect, Facebook released the Hateful Memes Challenge, a dataset of memes with pre-extracted text captions, but it is unclear whether these synthetic examples generalize to `memes in the wild'. In this paper, we collect hateful and non-hateful memes from Pinterest to evaluate out-of-sample performance on models pre-trained on the Facebook dataset. We find that memes in the wild differ in two key aspects: 1) Captions must be extracted via OCR, injecting noise and diminishing performance of multimodal models, and 2) Memes are more diverse than `traditional memes', including screenshots of conversations or text on a plain background. This paper thus serves as a reality check for the current benchmark of hateful meme detection and its applicability for detecting real world hate.

【5】 Beyond Farthest Point Sampling in Point-Wise Analysis 标题:点数分析中的超越最远点抽样

作者:Yiqun Lin,Lichang Chen,Haibin Huang,Chongyang Ma,Xiaoguang Han,Shuguang Cui 备注:12 pages, 13 figures and 13 tables 链接:https://arxiv.org/abs/2107.04291 摘要:采样、分组和聚集是点云多尺度分析的三个重要组成部分。在本文中,我们提出了一种新的数据驱动的采样学习策略的逐点分析任务。与广泛使用的采样技术,最远点采样(FPS)不同,我们建议将采样和下游应用结合起来学习。我们的主要观点是,对于不同的任务,像FPS这样的均匀采样方法并不总是最优的:在边界区域周围采样更多的点可以使逐点分类更容易分割。最后,我们提出了一种新的取样器学习策略,在任务相关的地面真实信息的监督下学习采样点的位移,并且可以与底层任务联合训练。我们进一步在不同的逐点分析架构中演示了我们的方法,包括语义部分分割、点云完成和关键点检测。实验结果表明,与以往的基线方法相比,采样器和任务的联合学习带来了显著的改进。 摘要:Sampling, grouping, and aggregation are three important components in the multi-scale analysis of point clouds. In this paper, we present a novel data-driven sampler learning strategy for point-wise analysis tasks. Unlike the widely used sampling technique, Farthest Point Sampling (FPS), we propose to learn sampling and downstream applications jointly. Our key insight is that uniform sampling methods like FPS are not always optimal for different tasks: sampling more points around boundary areas can make the point-wise classification easier for segmentation. Towards the end, we propose a novel sampler learning strategy that learns sampling point displacement supervised by task-related ground truth information and can be trained jointly with the underlying tasks. We further demonstrate our methods in various point-wise analysis architectures, including semantic part segmentation, point cloud completion, and keypoint detection. Our experiments show that jointly learning of the sampler and task brings remarkable improvement over previous baseline methods.

【6】 UrbanScene3D: A Large Scale Urban Scene Dataset and Simulator 标题:UrbanScene3D:一个大规模的城市场景数据集和模拟器

作者:Yilin Liu,Fuyou Xue,Hui Huang 机构:Visual Computing Research Center, Shenzhen University, China 备注:Project page: this https URL 链接:https://arxiv.org/abs/2107.04286 摘要:以不同方式感知环境的能力对机器人研究至关重要。这涉及到对二维和三维数据源的分析。我们提出了一个基于Unreal Engine 4和AirSim的大型城市场景数据集,它由不同尺度的人造场景和真实场景组成,称为UrbanScene3D。与以往单纯基于二维信息或人工三维CAD模型的工作不同,UrbanScene3D包含了紧凑的人工模型和由航空图像重建的详细的真实世界模型。每个建筑都是从整个场景模型中手工提取出来的,然后被指定一个唯一的标签,形成一个实例分割图。在UrbanScene3D中提供的带有实例分割标签的3D地面真实纹理模型允许用户获得他们想要的各种数据:实例分割图、任意分辨率的深度图、可见和不可见位置的3D点云/网格等,用户还可以模拟机器人(汽车/无人机),在提议的城市环境中测试各种自主任务。请参考我们的报纸和网站(https://vcc.tech/UrbanScene3D/)更多细节和应用。 摘要:The ability to perceive the environments in different ways is essential to robotic research. This involves the analysis of both 2D and 3D data sources. We present a large scale urban scene dataset associated with a handy simulator based on Unreal Engine 4 and AirSim, which consists of both man-made and real-world reconstruction scenes in different scales, referred to as UrbanScene3D. Unlike previous works that purely based on 2D information or man-made 3D CAD models, UrbanScene3D contains both compact man-made models and detailed real-world models reconstructed by aerial images. Each building has been manually extracted from the entire scene model and then has been assigned with a unique label, forming an instance segmentation map. The provided 3D ground-truth textured models with instance segmentation labels in UrbanScene3D allow users to obtain all kinds of data they would like to have: instance segmentation map, depth map in arbitrary resolution, 3D point cloud/mesh in both visible and invisible places, etc. In addition, with the help of AirSim, users can also simulate the robots (cars/drones)to test a variety of autonomous tasks in the proposed city environment. Please refer to our paper and website(https://vcc.tech/UrbanScene3D/) for further details and applications.

【7】 Unity Perception: Generate Synthetic Data for Computer Vision 标题:统一感知:为计算机视觉生成合成数据

作者:Steve Borkman,Adam Crespi,Saurav Dhakad,Sujoy Ganguly,Jonathan Hogins,You-Cyuan Jhang,Mohsen Kamalzadeh,Bowen Li,Steven Leal,Pete Parisi,Cesar Romero,Wesley Smith,Alex Thaman,Samuel Warren,Nupur Yadav 机构:Unity Technologies 链接:https://arxiv.org/abs/2107.04259 摘要:我们介绍了Unity-Perception包,该包旨在通过提供易于使用和高度可定制的工具集,简化和加速为计算机视觉任务生成合成数据集的过程。这个开源软件包扩展了Unity编辑器和引擎组件,为几个常见的计算机视觉任务生成完美的注释示例。此外,它还提供了一个可扩展的随机化框架,允许用户快速构造和配置随机化的模拟参数,以便在生成的数据集中引入变化。我们概述了所提供的工具及其工作原理,并通过训练二维目标检测模型来演示生成的合成数据集的价值。使用合成数据训练的模型比仅使用真实数据训练的模型具有更好的性能。 摘要:We introduce the Unity Perception package which aims to simplify and accelerate the process of generating synthetic datasets for computer vision tasks by offering an easy-to-use and highly customizable toolset. This open-source package extends the Unity Editor and engine components to generate perfectly annotated examples for several common computer vision tasks. Additionally, it offers an extensible Randomization framework that lets the user quickly construct and configure randomized simulation parameters in order to introduce variation into the generated datasets. We provide an overview of the provided tools and how they work, and demonstrate the value of the generated synthetic datasets by training a 2D object detection model. The model trained with mostly synthetic data outperforms the model trained using only real data.

【8】 EasyCom: An Augmented Reality Dataset to Support Algorithms for Easy Communication in Noisy Environments 标题:EasyCom:增强现实数据集,支持在嘈杂环境中轻松通信的算法

作者:Jacob Donley,Vladimir Tourbabin,Jung-Suk Lee,Mark Broyles,Hao Jiang,Jie Shen,Maja Pantic,Vamsi Krishna Ithapu,Ravish Mehra 机构:∗Facebook Reality Labs Research, USA., †Facebook AI Applied Research, UK. 备注:Dataset is available at: this https URL 链接:https://arxiv.org/abs/2107.04174 摘要:增强现实(AR)作为一个平台,有助于降低鸡尾酒会效应。未来的AR头戴式耳机可能会利用来自多种不同模式的传感器阵列的信息。训练和测试信号处理和机器学习算法的任务,如波束形成和语音增强需要高质量的代表性数据。据作者所知,截至出版之日,还没有包含以自我为中心的多通道音频和视频的数据集,这些音频和视频在嘈杂的环境中具有动态移动和对话。在这项工作中,我们描述、评估和发布了一个数据集,其中包含超过5小时的多模态数据,可用于训练和测试算法,以改进AR眼镜佩戴者的会话。我们提供了一个基线方法的语音清晰度、质量和信噪比改进结果,并显示了所有测试指标的改进。我们正在发布的数据集包含AR眼镜、以自我为中心的多通道麦克风阵列音频、宽视场RGB视频、语音源姿势、耳机麦克风音频、带注释的语音活动、语音转录、头部边界框、语音目标和源识别标签。我们已经创建并发布了这个数据集,以促进鸡尾酒会问题的多模式AR解决方案的研究。 摘要:Augmented Reality (AR) as a platform has the potential to facilitate the reduction of the cocktail party effect. Future AR headsets could potentially leverage information from an array of sensors spanning many different modalities. Training and testing signal processing and machine learning algorithms on tasks such as beam-forming and speech enhancement require high quality representative data. To the best of the author's knowledge, as of publication there are no available datasets that contain synchronized egocentric multi-channel audio and video with dynamic movement and conversations in a noisy environment. In this work, we describe, evaluate and release a dataset that contains over 5 hours of multi-modal data useful for training and testing algorithms for the application of improving conversations for an AR glasses wearer. We provide speech intelligibility, quality and signal-to-noise ratio improvement results for a baseline method and show improvements across all tested metrics. The dataset we are releasing contains AR glasses egocentric multi-channel microphone array audio, wide field-of-view RGB video, speech source pose, headset microphone audio, annotated voice activity, speech transcriptions, head bounding boxes, target of speech and source identification labels. We have created and are releasing this dataset to facilitate research in multi-modal AR solutions to the cocktail party problem.

0 人点赞