计算机视觉学术速递[6.25]

2021-07-02 17:41:59 浏览数 (1)

访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问

cs.CV 方向,今日共计63篇

Transformer(4篇)

【1】 Video Swin Transformer 标题:视频双Transformer

作者:Ze Liu,Jia Ning,Yue Cao,Yixuan Wei,Zheng Zhang,Stephen Lin,Han Hu 机构:Microsoft Research Asia, University of Science and Technology of China, Huazhong University of Science and Technology, Tsinghua University 链接:https://arxiv.org/abs/2106.13230 摘要:视觉界正在见证从CNNs到Transformers的建模转变,在这里,纯Transformer架构已经在主要视频识别基准上达到了最高的精确度。这些视频模型都是建立在变换层上的,变换层在空间和时间维度上全局连接补丁。在本文中,我们主张在视频转换器中引入局部性的感应偏差,这使得与以前的方法相比,它可以更好地在速度-精度上进行折衷,而以前的方法是全局计算自我注意,即使是使用空时分解。提出的视频结构的局部性是通过调整为图像域设计的swn变换器来实现的,同时继续利用预先训练的图像模型的能力。我们的方法在广泛的视频识别基准上实现了最先进的精度,包括动作识别(Kinetics-400上的最高精度为84.9,Kinetics-600上的最高精度为86.1,预训练数据减少了约20倍,模型尺寸减小了约3倍)和时间建模(Kinetics-v2上的最高精度为69.6)。代码和模型将在https://github.com/SwinTransformer/Video-Swin-Transformer. 摘要:The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks. These video models are all built on Transformer layers that globally connect patches across the spatial and temporal dimensions. In this paper, we instead advocate an inductive bias of locality in video Transformers, which leads to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization. The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models. Our approach achieves state-of-the-art accuracy on a broad range of video recognition benchmarks, including on action recognition (84.9 top-1 accuracy on Kinetics-400 and 86.1 top-1 accuracy on Kinetics-600 with ~20x less pre-training data and ~3x smaller model size) and temporal modeling (69.6 top-1 accuracy on Something-Something v2). The code and models will be made publicly available at https://github.com/SwinTransformer/Video-Swin-Transformer.

【2】 Exploring Corruption Robustness: Inductive Biases in Vision Transformers and MLP-Mixers 标题:探索腐败稳健性:视觉Transformer和MLP-Mixer中的感应偏差

作者:Katelyn Morrison,Benjamin Gilby,Colton Lipchak,Adam Mattioli,Adriana Kovashka 机构: Several recent works seeking to develop and trainEqual contribution 1Department of Computer Science 备注:Under review at the Uncertainty and Robustness in Deep Learning workshop at ICML 2021. Our appendix is attached to the last page of the paper 链接:https://arxiv.org/abs/2106.13122 摘要:最近,为了解决卷积神经网络中普遍存在的一些弱点,人们开发了基于视觉变换和MLP的模型。由于Transformer在这一领域的应用以及自我注意机制的新颖性,目前尚不清楚这些体系结构在多大程度上对腐蚀具有鲁棒性。尽管有一些工作建议数据扩充对于模型抗破坏的健壮性仍然是必不可少的,但是我们建议探索体系结构对破坏健壮性的影响。我们发现,视觉转换器架构本质上比ResNet-50和MLP混频器更能抵抗腐蚀。我们还发现,参数比ResNet-50少5倍的视觉变换器具有更大的形状偏差。我们的代码可以复制。 摘要:Recently, vision transformers and MLP-based models have been developed in order to address some of the prevalent weaknesses in convolutional neural networks. Due to the novelty of transformers being used in this domain along with the self-attention mechanism, it remains unclear to what degree these architectures are robust to corruptions. Despite some works proposing that data augmentation remains essential for a model to be robust against corruptions, we propose to explore the impact that the architecture has on corruption robustness. We find that vision transformer architectures are inherently more robust to corruptions than the ResNet-50 and MLP-Mixers. We also find that vision transformers with 5 times fewer parameters than a ResNet-50 have more shape bias. Our code is available to reproduce.

【3】 A Transformer-based Cross-modal Fusion Model with Adversarial Training for VQA Challenge 2021 标题:2021年VQA挑战赛对抗性训练的基于Transformer的跨模式融合模型

作者:Ke-Han Lu,Bo-Han Fang,Kuan-Yu Chen 机构:Department of CSIE, NTUST, Taiwan, ____________________________________, The two authors contributed equally to this paper. 备注:CVPR 2021 Workshop: Visual Question Answering (VQA) Challenge 链接:https://arxiv.org/abs/2106.13033 摘要:在这篇论文中,受visionlanguage预先训练模型的成功和对抗性攻击训练的好处的启发,我们提出了一种新的基于变换器的跨模式融合模型,将这两种概念结合起来,用于VQA挑战2021。具体而言,所提模型是在VinVL模型的架构之上[19],采用对抗性训练策略[4],使模型具有鲁棒性和通用性。此外,本系统还采用了两种实现技巧,取得了较好的效果。实验结果表明,该框架在VQAv2测试集上可以达到76.72%。 摘要:In this paper, inspired by the successes of visionlanguage pre-trained models and the benefits from training with adversarial attacks, we present a novel transformerbased cross-modal fusion modeling by incorporating the both notions for VQA challenge 2021. Specifically, the proposed model is on top of the architecture of VinVL model [19], and the adversarial training strategy [4] is applied to make the model robust and generalized. Moreover, two implementation tricks are also used in our system to obtain better results. The experiments demonstrate that the novel framework can achieve 76.72% on VQAv2 test-std set.

【4】 IA-RED^2: Interpretability-Aware Redundancy Reduction for Vision Transformers标题:IA-RED^2:针对Vision Transformers的可解释性冗余减少

作者:Bowen Pan,Yifan Jiang,Rameswar Panda,Zhangyang Wang,Rogerio Feris,Aude Oliva 机构:MIT CSAIL,UT Austin,MIT-IBM Waston AI Lab 链接:https://arxiv.org/abs/2106.12620 摘要:基于自我注意的模型transformer最近正成为计算机视觉领域的主要支柱。尽管Transformer在各种视觉任务中取得了令人印象深刻的成功,但它仍然存在计算量大和内存消耗大的问题。针对这一局限性,本文提出了一种可解释性感知的冗余约简框架(IA-RED$^2$)。我们首先观察到大量的冗余计算,主要花费在不相关的输入补丁上,然后引入一个可解释的模块来动态、优雅地删除这些冗余补丁。这种新的框架被扩展到一个层次结构,在这个层次结构中,不同阶段的不相关标记被逐渐移除,导致计算成本的显著缩减。我们对图像和视频任务进行了大量的实验,我们的方法可以为最先进的模型(如DeiT和timeformer)提供1.4倍的速度,只需牺牲不到0.7%的准确率。更重要的是,与其他加速方法相反,我们的方法具有内在的可解释性,具有大量的视觉证据,使视觉转换器更接近于一个更易于理解的架构,同时更轻。我们证明,在我们的框架中自然出现的可解释性可以优于原始视觉转换器学习到的原始注意,以及现成的解释方法产生的原始注意,具有定性和定量的结果。项目页面:http://people.csail.mit.edu/bpan/ia-red/. 摘要:The self-attention-based model, transformer, is recently becoming the leading backbone in the field of computer vision. In spite of the impressive success made by transformers in a variety of vision tasks, it still suffers from heavy computation and intensive memory cost. To address this limitation, this paper presents an Interpretability-Aware REDundancy REDuction framework (IA-RED$^2$). We start by observing a large amount of redundant computation, mainly spent on uncorrelated input patches, and then introduce an interpretable module to dynamically and gracefully drop these redundant patches. This novel framework is then extended to a hierarchical structure, where uncorrelated tokens at different stages are gradually removed, resulting in a considerable shrinkage of computational cost. We include extensive experiments on both image and video tasks, where our method could deliver up to 1.4X speed-up for state-of-the-art models like DeiT and TimeSformer, by only sacrificing less than 0.7% accuracy. More importantly, contrary to other acceleration approaches, our method is inherently interpretable with substantial visual evidence, making vision transformer closer to a more human-understandable architecture while being lighter. We demonstrate that the interpretability that naturally emerged in our framework can outperform the raw attention learned by the original visual transformer, as well as those generated by off-the-shelf interpretation methods, with both qualitative and quantitative results. Project Page: http://people.csail.mit.edu/bpan/ia-red/.

检测相关(10篇)

【1】 Depth Confidence-aware Camouflaged Object Detection 标题:深度置信度感知的伪装目标检测

作者:Jing Zhang,Yunqiu Lv,Mochu Xiang,Aixuan Li,Yuchao Dai,Yiran Zhong 机构: Australian National University, Northwestern Polytechnical University 备注:10 pages main content 3 pages reference. The first work in RGB-D Camouflaged object detection (COD) 链接:https://arxiv.org/abs/2106.13217 摘要:伪装目标检测(COD)是对隐藏在环境中的伪装目标进行分割的一种方法,由于伪装目标与周围环境具有相似的外观,因此具有很大的挑战性。生物学研究表明,由于所有动物都具有三维感知能力,因此深度可以为伪装目标的发现提供有用的目标定位线索。然而,深度信息还没有被用于伪装目标的检测。为了探索深度对伪装探测的贡献,我们从现有的单目深度估计方法出发,提出了一种基于深度引导的伪装目标探测网络。由于深度估计数据集和我们的伪装目标检测数据集之间的域差距,生成的深度可能不够精确,无法直接用于我们的框架中。基于RGB-COD分支和RGB-D-COD分支的模型预测,引入了深度质量评价模块,对深度质量进行评价。在训练过程中,只使用高质量的深度来更新多模态学习的模态交互模块。在测试过程中,我们的深度质量评估模块可以有效地确定深度的贡献,并选择RGB分支或RGB-D分支进行伪装预测。在各种伪装目标检测数据集上的大量实验证明了该方法在探索伪装目标检测深度信息方面的有效性。我们的代码和数据可在以下网址公开获取:url{https://github.com/JingZhang617/RGBD-COD}. 摘要:Camouflaged object detection (COD) aims to segment camouflaged objects hiding in the environment, which is challenging due to the similar appearance of camouflaged objects and their surroundings. Research in biology suggests that depth can provide useful object localization cues for camouflaged object discovery, as all the animals have 3D perception ability. However, the depth information has not been exploited for camouflaged object detection. To explore the contribution of depth for camouflage detection, we present a depth-guided camouflaged object detection network with pre-computed depth maps from existing monocular depth estimation methods. Due to the domain gap between the depth estimation dataset and our camouflaged object detection dataset, the generated depth may not be accurate enough to be directly used in our framework. We then introduce a depth quality assessment module to evaluate the quality of depth based on the model prediction from both RGB COD branch and RGB-D COD branch. During training, only high-quality depth is used to update the modal interaction module for multi-modal learning. During testing, our depth quality assessment module can effectively determine the contribution of depth and select the RGB branch or RGB-D branch for camouflage prediction. Extensive experiments on various camouflaged object detection datasets prove the effectiveness of our solution in exploring the depth information for camouflaged object detection. Our code and data is publicly available at: url{https://github.com/JingZhang617/RGBD-COD}.

【2】 Differential Morph Face Detection using Discriminative Wavelet Sub-bands 标题:基于鉴别小波子带的差分形态人脸检测

作者:Baaria Chaudhary,Poorya Aghdaie,Sobhan Soleymani,Jeremy Dawson,Nasser M. Nasrabadi 机构:West Virginia University 备注:None 链接:https://arxiv.org/abs/2106.13178 摘要:人脸识别系统极易受到变形攻击,其中变形的人脸参考图像可以被成功地验证为两个或多个不同的身份。本文提出了一种基于非抽样二维离散小波变换(DWT)的变形攻击检测算法。我们框架的核心是,变形过程中产生的伪影在图像域中无法识别,因此可以更容易地在空间频率域中识别。一个有区别的小波子带可以突出真实图像和变形图像之间的差异。为此,对所有图像进行多级小波变换,得到48个中高频子带。对于真实图像和变形图像,分别计算每个子带的熵分布。对于某些子带,真实图像中子带的熵与变形图像中相同子带的熵之间存在显著差异。因此,我们使用Kullback-Liebler散度(KLD)来利用这些差异,并分离出最具鉴别能力的子带。我们通过KLD值来衡量一个子带的区分能力,并选择KLD值最高的22个子带进行网络训练。然后,我们利用这22个子带训练一个深度连体神经网络来检测差异变形攻击。我们检验了区分小波子带对变形攻击检测的有效性,并表明在这些子带上训练的深度神经网络能够准确地识别变形图像。 摘要:Face recognition systems are extremely vulnerable to morphing attacks, in which a morphed facial reference image can be successfully verified as two or more distinct identities. In this paper, we propose a morph attack detection algorithm that leverages an undecimated 2D Discrete Wavelet Transform (DWT) for identifying morphed face images. The core of our framework is that artifacts resulting from the morphing process that are not discernible in the image domain can be more easily identified in the spatial frequency domain. A discriminative wavelet sub-band can accentuate the disparity between a real and a morphed image. To this end, multi-level DWT is applied to all images, yielding 48 mid and high-frequency sub-bands each. The entropy distributions for each sub-band are calculated separately for both bona fide and morph images. For some of the sub-bands, there is a marked difference between the entropy of the sub-band in a bona fide image and the identical sub-band's entropy in a morphed image. Consequently, we employ Kullback-Liebler Divergence (KLD) to exploit these differences and isolate the sub-bands that are the most discriminative. We measure how discriminative a sub-band is by its KLD value and the 22 sub-bands with the highest KLD values are chosen for network training. Then, we train a deep Siamese neural network using these 22 selected sub-bands for differential morph attack detection. We examine the efficacy of discriminative wavelet sub-bands for morph attack detection and show that a deep neural network trained on these sub-bands can accurately identify morph imagery.

【3】 Class agnostic moving target detection by color and location prediction of moving area 标题:基于运动区域颜色和位置预测的类别无关运动目标检测

作者:Zhuang He,Qi Li,Huajun Feng,Zhihai Xu 机构:The State Key Laboratory of Modern optical Instruments, Zhejiang University, No., Zheda Road, Hangzhou, China 链接:https://arxiv.org/abs/2106.12966 摘要:运动目标检测在计算机视觉中占有重要的地位。然而,传统的帧差法、光流法等算法往往计算量大、精度低。近年来,基于深度学习的卷积神经网络等算法取得了很高的精度和实时性,但通常需要提前知道目标的类别,这限制了其实际应用。因此,本文提出了一种无模型运动目标检测算法。该算法通过图像特征的差异来提取运动区域。然后,通过最大后验概率计算出运动区域的颜色和位置概率图。通过两个概率图之间的点乘,得到目标概率图。最后在目标概率图上用随机梯度下降法求解最优运动目标区域。结果表明,与现有算法相比,该算法在不需要知道目标类别的情况下,具有最高的精度。此外,针对现有数据集不适合运动目标检测的问题,提出了一种生成评价数据集的方法。此外,还证明了该算法可以用于辅助目标跟踪。 摘要:Moving target detection plays an important role in computer vision. However, traditional algorithms such as frame difference and optical flow usually suffer from low accuracy or heavy computation. Recent algorithms such as deep learning-based convolutional neural networks have achieved high accuracy and real-time performance, but they usually need to know the classes of targets in advance, which limits the practical applications. Therefore, we proposed a model free moving target detection algorithm. This algorithm extracts the moving area through the difference of image features. Then, the color and location probability map of the moving area will be calculated through maximum a posteriori probability. And the target probability map can be obtained through the dot multiply between the two maps. Finally, the optimal moving target area can be solved by stochastic gradient descent on the target probability map. Results show that the proposed algorithm achieves the highest accuracy compared with state-of-the-art algorithms, without needing to know the classes of targets. Furthermore, as the existing datasets are not suitable for moving target detection, we proposed a method for producing evaluation dataset. Besides, we also proved the proposed algorithm can be used to assist target tracking.

【4】 Continual Novelty Detection 标题:持续的新颖性检测

作者:Rahaf Aljundi,Daniel Olmeda Reino,Nikolay Chumerin,Richard E. Turner 机构:Toyota Motor Europe, University of Cambridge 链接:https://arxiv.org/abs/2106.12964 摘要:新颖性检测方法识别不代表模型训练集的样本,从而标记误导性预测,并在部署时带来更大的灵活性和透明度。然而,这方面的研究只考虑了离线环境下的新颖性检测。最近,计算机视觉界越来越认识到,应用程序需要一个更灵活的框架——持续学习——在这个框架中,代表新领域、新课程或新任务的新数据批在不同的时间点变得可用。在这种背景下,新奇性检测变得更加重要、有趣和富有挑战性。这项工作确定了这两个问题之间的关键联系,并研究了在持续学习环境下的新颖性检测问题。我们提出了连续新奇性检测问题,并给出了一个基准,在不同的连续学习环境下比较了几种新颖性检测方法。我们发现,持续学习会影响新颖性检测算法的行为,而新颖性检测可以精确地洞察持续学习者的行为。我们进一步提出基线和讨论可能的研究方向。我们认为这两个问题的耦合是将视觉模型应用于实践的一个很有前途的方向。 摘要:Novelty Detection methods identify samples that are not representative of a model's training set thereby flagging misleading predictions and bringing a greater flexibility and transparency at deployment time. However, research in this area has only considered Novelty Detection in the offline setting. Recently, there has been a growing realization in the computer vision community that applications demand a more flexible framework - Continual Learning - where new batches of data representing new domains, new classes or new tasks become available at different points in time. In this setting, Novelty Detection becomes more important, interesting and challenging. This work identifies the crucial link between the two problems and investigates the Novelty Detection problem under the Continual Learning setting. We formulate the Continual Novelty Detection problem and present a benchmark, where we compare several Novelty Detection methods under different Continual Learning settings. We show that Continual Learning affects the behaviour of novelty detection algorithms, while novelty detection can pinpoint insights in the behaviour of a continual learner. We further propose baselines and discuss possible research directions. We believe that the coupling of the two problems is a promising direction to bring vision models into practice.

【5】 DCoM: A Deep Column Mapper for Semantic Data Type Detection 标题:DCOM:一种用于语义数据类型检测的深列映射器

作者:Subhadip Maji,Swapna Sourav Rout,Sudeep Choudhary 机构:Optum Global Solutions, Bangalore, India , Senior Data Scientist 备注:9 pages, 2 figures, 7 tables 链接:https://arxiv.org/abs/2106.12871 摘要:语义数据类型检测是数据科学中一项非常重要的任务,可以实现数据的自动清洗、模式匹配、数据发现、语义数据类型规范化和敏感数据识别。现有的方法包括基于正则表达式或基于字典查找的方法,这些方法对脏数据和看不见的数据不具有鲁棒性,并且只能预测非常少的语义数据类型。现有的机器学习方法从数据中提取大量的工程特征,建立logistic回归、随机森林或前馈神经网络。在本文中,我们引入了DCoM(一种基于多输入NLP的深度神经网络)来检测语义数据类型,它不是从数据中提取大量的特征,而是将列(或实例)的原始值作为文本输入到模型中。我们在从VizNet语料库中提取的686765个数据列上训练DCoM,这些数据列包含78种不同的语义数据类型。在同一数据集上,DCoM的性能优于其他当代结果,具有相当大的优势。 摘要:Detection of semantic data types is a very crucial task in data science for automated data cleaning, schema matching, data discovery, semantic data type normalization and sensitive data identification. Existing methods include regular expression-based or dictionary lookup-based methods that are not robust to dirty as well unseen data and are limited to a very less number of semantic data types to predict. Existing Machine Learning methods extract large number of engineered features from data and build logistic regression, random forest or feedforward neural network for this purpose. In this paper, we introduce DCoM, a collection of multi-input NLP-based deep neural networks to detect semantic data types where instead of extracting large number of features from the data, we feed the raw values of columns (or instances) to the model as texts. We train DCoM on 686,765 data columns extracted from VizNet corpus with 78 different semantic data types. DCoM outperforms other contemporary results with a quite significant margin on the same dataset.

【6】 Detection of Deepfake Videos Using Long Distance Attention 标题:基于远程关注的深伪视频检测

作者:Wei Lu,Lingyi Liu,Junwei Luo,Xianfeng Zhao,Yicong Zhou,Jiwu Huang 机构: and also with the School of Cyber Security, University ofChinese Academy of Sciences 链接:https://arxiv.org/abs/2106.12832 摘要:近年来,随着伪造技术的迅速发展,人脸视频伪造会产生具有高度欺骗性的视频内容,并带来严重的安全威胁。而这类伪造视频的检测则更加紧迫和具有挑战性。现有的大多数检测方法都将该问题看作是一个普通的二值分类问题。在本文中,该问题被视为一个特殊的细粒度分类问题,因为假人脸和真人脸之间的差异非常细微。现有的大多数人脸伪造方法在空间域和时间域都存在一些常见的伪影,包括空间域的生成性缺陷和时间域的帧间不一致性。提出了一种时空模型,该模型由两部分组成,分别从全局角度捕获时空伪造痕迹。这两个组件的设计采用了一种新颖的远距离注意机制。空间域的一个分量用于捕获单个帧中的伪影,而时间域的另一个分量用于捕获连续帧中的伪影。它们以斑块的形式生成注意力图。注意方法具有更广阔的视野,有助于更好地组合全局信息和提取局部统计信息。最后,像其他细粒度分类方法一样,利用注意图引导网络关注人脸的关键部位。在不同公共数据集上的实验结果表明,该方法达到了最先进的性能,所提出的远程注意方法能够有效地捕获人脸伪造的关键部位。 摘要:With the rapid progress of deepfake techniques in recent years, facial video forgery can generate highly deceptive video contents and bring severe security threats. And detection of such forgery videos is much more urgent and challenging. Most existing detection methods treat the problem as a vanilla binary classification problem. In this paper, the problem is treated as a special fine-grained classification problem since the differences between fake and real faces are very subtle. It is observed that most existing face forgery methods left some common artifacts in the spatial domain and time domain, including generative defects in the spatial domain and inter-frame inconsistencies in the time domain. And a spatial-temporal model is proposed which has two components for capturing spatial and temporal forgery traces in global perspective respectively. The two components are designed using a novel long distance attention mechanism. The one component of the spatial domain is used to capture artifacts in a single frame, and the other component of the time domain is used to capture artifacts in consecutive frames. They generate attention maps in the form of patches. The attention method has a broader vision which contributes to better assembling global information and extracting local statistic information. Finally, the attention maps are used to guide the network to focus on pivotal parts of the face, just like other fine-grained classification methods. The experimental results on different public datasets demonstrate that the proposed method achieves the state-of-the-art performance, and the proposed long distance attention method can effectively capture pivotal parts for face forgery.

【7】 Multi-Modal 3D Object Detection in Autonomous Driving: a Survey 标题:自动驾驶中的多模态三维目标检测研究综述

作者:Yingjie Wang,Qiuyu Mao,Hanqi Zhu,Yu Zhang,Jianmin Ji,Yanyong Zhang 机构:Received: date Accepted: date 链接:https://arxiv.org/abs/2106.12735 摘要:在过去的几年里,我们见证了自动驾驶的快速发展。然而,由于复杂多变的驾驶环境,实现完全自主仍然是一项艰巨的任务。因此,自动驾驶汽车配备了一套传感器,以进行稳健和准确的环境感知。随着传感器数量和类型的不断增加,将它们结合起来以获得更好的感知成为一种自然趋势。到目前为止,还没有对基于多传感器融合的感知进行深入的综述。为了弥补这一差距并推动未来的研究,本调查致力于审查最近基于融合的三维探测深度学习模型,利用多传感器数据源,特别是相机和激光雷达。在这篇综述中,我们首先介绍了目前流行的自动驾驶汽车传感器的背景,包括它们的通用数据表示以及为每种类型的传感器数据开发的目标检测网络。接下来,我们将讨论一些流行的用于多模态三维目标检测的数据集,特别关注每个数据集中包含的传感器数据。然后从融合定位、融合数据表示和融合粒度三个方面对近年来的多模态三维检测网络进行了深入的研究。在详细回顾之后,我们讨论了存在的挑战并指出了可能的解决方案。希望本文的详细综述能帮助研究者在多模态三维目标检测领域展开研究。 摘要:In the past few years, we have witnessed rapid development of autonomous driving. However, achieving full autonomy remains a daunting task due to the complex and dynamic driving environment. As a result, self-driving cars are equipped with a suite of sensors to conduct robust and accurate environment perception. As the number and type of sensors keep increasing, combining them for better perception is becoming a natural trend. So far, there has been no indepth review that focuses on multi-sensor fusion based perception. To bridge this gap and motivate future research, this survey devotes to review recent fusion-based 3D detection deep learning models that leverage multiple sensor data sources, especially cameras and LiDARs. In this survey, we first introduce the background of popular sensors for autonomous cars, including their common data representations as well as object detection networks developed for each type of sensor data. Next, we discuss some popular datasets for multi-modal 3D object detection, with a special focus on the sensor data included in each dataset. Then we present in-depth reviews of recent multi-modal 3D detection networks by considering the following three aspects of the fusion: fusion location, fusion data representation, and fusion granularity. After a detailed review, we discuss open challenges and point out possible solutions. We hope that our detailed review can help researchers to embark investigations in the area of multi-modal 3D object detection.

【8】 All You Need is a Second Look: Towards Arbitrary-Shaped Text Detection 标题:您只需要再看一眼:任意形状的文本检测

作者:Meng Cao,Can Zhang,Dongming Yang,Yuexian Zou 机构: Zou are with the School of Electri-cal and Computer Engineering, Shenzhen Graduate School 备注:Accepted by T-CSVT 链接:https://arxiv.org/abs/2106.12720 摘要:任意形状的文本检测是一项具有挑战性的任务,因为野外的曲线文本具有复杂的几何布局。现有的主流方法遵循实例分割流水线来获取文本区域。然而,由于尺度的变化,任意形状的文本很难通过单一的切分网络来描述。在本文中,我们提出了一个两阶段分割为基础的检测器,称为NASK(需要第二次看),为任意形状的文本检测。与传统的单阶段分割网络相比,NASK采用从粗到精的方式进行检测,第一阶段分割定位矩形文本,第二阶段提取压缩表示。具体而言,NASK由文本实例分割(TIS)网络(第一阶段)、几何感知文本RoI对齐(geoallign)模块和基准点表达式(FOX)模块(第二阶段)组成。首先,通过一个新的群体空间和通道注意(GSCA)模块提取增强特征,并进行实例分割,得到矩形建议。然后,geoallign将这些矩形转换为固定大小,并对RoI特征表示进行编码。最后,FOX将文本实例分解为几个关键的几何属性来细化检测结果。在三个公共基准(包括Total Text、SCUTCTW1500和ICDAR 2015)上的大量实验结果验证了我们的NASK优于最新的方法。 摘要:Arbitrary-shaped text detection is a challenging task since curved texts in the wild are of the complex geometric layouts. Existing mainstream methods follow the instance segmentation pipeline to obtain the text regions. However, arbitraryshaped texts are difficult to be depicted through one single segmentation network because of the varying scales. In this paper, we propose a two-stage segmentation-based detector, termed as NASK (Need A Second looK), for arbitrary-shaped text detection. Compared to the traditional single-stage segmentation network, our NASK conducts the detection in a coarse-to-fine manner with the first stage segmentation spotting the rectangle text proposals and the second one retrieving compact representations. Specifically, NASK is composed of a Text Instance Segmentation (TIS) network (1st stage), a Geometry-aware Text RoI Alignment (GeoAlign) module, and a Fiducial pOint eXpression (FOX) module (2nd stage). Firstly, TIS extracts the augmented features with a novel Group Spatial and Channel Attention (GSCA) module and conducts instance segmentation to obtain rectangle proposals. Then, GeoAlign converts these rectangles into the fixed size and encodes RoI-wise feature representation. Finally, FOX disintegrates the text instance into serval pivotal geometrical attributes to refine the detection results. Extensive experimental results on three public benchmarks including Total-Text, SCUTCTW1500, and ICDAR 2015 verify that our NASK outperforms recent state-of-the-art methods.

【9】 Deep Fake Detection: Survey of Facial Manipulation Detection Solutions 标题:深度伪装检测:面部操作检测解决方案综述

作者:Samay Pashine,Sagar Mandiya,Praveen Gupta,Rashid Sheikh 机构:Dept. of Computer Science, AITR, Indore, India 备注:None 链接:https://arxiv.org/abs/2106.12605 摘要:作为一个领域,深度学习已经成功地用于解决大量复杂的问题,这种问题是几十年前我们无法想象的。但是,尽管它带来了很多好处,但仍然有一些方法可以用来给我们的社会带来危害。深度造假已经被证明是一个这样的问题,现在比以往任何时候,当任何个人都可以创造一个假图像或视频简单地使用智能手机上的应用程序,需要有一些对策,利用该方法可以检测出图像或视频是真是假,并对威胁网络信息可信度的问题进行处理。虽然由神经网络产生的深赝品看起来像真实的图像或视频一样真实,但经过适度处理后仍会留下时空痕迹或特征,这些特征在人眼看不见的情况下,可以借助专门从事深赝品检测的神经网络进行检测。本文分析了几种最先进的神经网络(MesoNet、ResNet-50、VGG-19和exception-Net),并对它们进行了比较,为各种场景寻找最佳解决方案,如实时深度假检测,部署在在线社交媒体平台上,分类应尽可能快,或为小型通讯社,分类不需要实时,但需要最大的准确性。 摘要:Deep Learning as a field has been successfully used to solve a plethora of complex problems, the likes of which we could not have imagined a few decades back. But as many benefits as it brings, there are still ways in which it can be used to bring harm to our society. Deep fakes have been proven to be one such problem, and now more than ever, when any individual can create a fake image or video simply using an application on the smartphone, there need to be some countermeasures, with which we can detect if the image or video is a fake or real and dispose of the problem threatening the trustworthiness of online information. Although the Deep fakes created by neural networks, may seem to be as real as a real image or video, it still leaves behind spatial and temporal traces or signatures after moderation, these signatures while being invisible to a human eye can be detected with the help of a neural network trained to specialize in Deep fake detection. In this paper, we analyze several such states of the art neural networks (MesoNet, ResNet-50, VGG-19, and Xception Net) and compare them against each other, to find an optimal solution for various scenarios like real-time deep fake detection to be deployed in online social media platforms where the classification should be made as fast as possible or for a small news agency where the classification need not be in real-time but requires utmost accuracy.

【10】 VinDr-SpineXR: A deep learning framework for spinal lesions detection and classification from radiographs 标题:VinDR-SpineXR:一种用于从X线图像中检测和分类脊柱病变的深度学习框架

作者:Hieu T. Nguyen,Hieu H. Pham,Nghia T. Nguyen,Ha Q. Nguyen,Thang Q. Huynh,Minh Dao,Van Vu 机构: Medical Imaging Center, Vingroup Big Data Institute, Hanoi, Vietnam, School of Information and Communication Technology, Hanoi University of Science, and Technology, Hanoi, Vietnam, College of Engineering & Computer Science, VinUniversity, Hanoi, Vietnam 备注:This is a preprint of our paper which was accepted for publication by the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2021) 链接:https://arxiv.org/abs/2106.12930 摘要:在临床实践中,x线片是鉴别脊柱异常最重要的影像学工具。然而,脊柱骨病变的评估对于放射科医生来说是一项具有挑战性的任务。这项工作的目的是开发和评估一个基于深度学习的框架,命名为VinDr SpineXR,用于脊柱X线异常的分类和定位。首先,我们建立了一个大的数据集,包括来自5000个研究的10468个脊柱X射线图像,每个图像都由一位经验丰富的放射科医生手动注释,并在13个类别的异常发现周围使用边界框。利用这个数据集,我们训练一个深度学习分类器来确定脊柱扫描是否异常,并训练一个检测器来定位13个重要发现中的7个。VinDr SpineXR是在1000个研究的2078个图像的测试集上进行评估的,这些图像与训练集是分开的。在图像级分类任务中,接收机工作特性曲线下面积(AUROC)为88.61%(95%CI为87.19%,90.02%),平均精度为100%(mAP@0.5)病灶定位任务占33.56%。这些结果作为概念的证明,并为今后这方面的研究奠定了基础。为了鼓励进步,数据集、代码和经过训练的深度学习模型被公开。 摘要:Radiographs are used as the most important imaging tool for identifying spine anomalies in clinical practice. The evaluation of spinal bone lesions, however, is a challenging task for radiologists. This work aims at developing and evaluating a deep learning-based framework, named VinDr-SpineXR, for the classification and localization of abnormalities from spine X-rays. First, we build a large dataset, comprising 10,468 spine X-ray images from 5,000 studies, each of which is manually annotated by an experienced radiologist with bounding boxes around abnormal findings in 13 categories. Using this dataset, we then train a deep learning classifier to determine whether a spine scan is abnormal and a detector to localize 7 crucial findings amongst the total 13. The VinDr-SpineXR is evaluated on a test set of 2,078 images from 1,000 studies, which is kept separate from the training set. It demonstrates an area under the receiver operating characteristic curve (AUROC) of 88.61% (95% CI 87.19%, 90.02%) for the image-level classification task and a mean average precision (mAP@0.5) of 33.56% for the lesion-level localization task. These results serve as a proof of concept and set a baseline for future research in this direction. To encourage advances, the dataset, codes, and trained deep learning models are made publicly available.

分类|识别相关(9篇)

【1】 Driver-centric Risk Object Identification 标题:以司机为中心的风险对象识别

作者:Chengxi Li,Stanley H. Chan,Yi-Ting Chen 机构:CheniswithNationalYangMingChiaoTungUniver-sity and National Chiao Tung University 备注:Submitted to TPAMI 链接:https://arxiv.org/abs/2106.13201 摘要:大量的交通事故是由于司机失误造成的。为了减少死亡事故,迫切需要开发智能驾驶系统来帮助驾驶员识别潜在的风险。在现有的研究中,危险情况通常是基于碰撞预测来定义的。然而,碰撞只是交通场景中的一种风险。我们认为需要一个更通用的定义。在这项工作中,我们提出了一个新的以驾驶员为中心的风险定义,即风险对象影响驾驶员的行为。在此基础上,提出了一种新的风险目标识别方法。我们将任务描述为因果问题,并从情境感知模型和因果推理模型中得到启发,提出了一种新的两阶段风险对象识别框架。一个以驾驶员为中心的风险对象识别(ROI)数据集被用来评估所提出的系统。与ROI数据集上的强基线相比,我们展示了最先进的风险对象识别性能。此外,我们进行了广泛的烧蚀研究,以证明我们的设计选择。 摘要:A massive number of traffic fatalities are due to driver errors. To reduce fatalities, developing intelligent driving systems assisting drivers to identify potential risks is in urgent need. Risky situations are generally defined based on collision prediction in existing research. However, collisions are only one type of risk in traffic scenarios. We believe a more generic definition is required. In this work, we propose a novel driver-centric definition of risk, i.e., risky objects influence driver behavior. Based on this definition, a new task called risk object identification is introduced. We formulate the task as a cause-effect problem and present a novel two-stage risk object identification framework, taking inspiration from models of situation awareness and causal inference. A driver-centric Risk Object Identification (ROI) dataset is curated to evaluate the proposed system. We demonstrate state-of-the-art risk object identification performance compared with strong baselines on the ROI dataset. In addition, we conduct extensive ablative studies to justify our design choices.

【2】 VOLO: Vision Outlooker for Visual Recognition 标题:VOLO:一种用于视觉识别的视觉观察器

作者:Li Yuan,Qibin Hou,Zihang Jiang,Jiashi Feng,Shuicheng Yan 机构:Sea AI Lab, National University of Singapore 备注:code: this https URL 链接:https://arxiv.org/abs/2106.13112 摘要:多年来,卷积神经网络一直是视觉识别的主流。近年来,基于自注意的视觉变换器(VIT)在图像网络分类中显示出巨大的潜力,但在没有额外数据的情况下,其性能仍低于最新的SOTA CNNSI。在这项工作中,我们的目标是缩小性能差距,并证明基于注意力的模型确实能够优于CNNs。我们发现,限制ViTs在Ima基因分类中性能的主要因素是ViTs在将精细特征编码到标记表示中的低效性。为了解决这个问题,我们引入了一个新的outlook关注,并提出了一个简单而通用的体系结构,称为visionoutlooker(VOLO)。与关注于粗糙层次上的全局依赖模型的自我注意不同,前景注意旨在有效地将更精细层次的特征和上下文编码为标记,这对于识别性能至关重要,但在很大程度上被自我注意所忽略。实验表明,我们的VOLO在不需要任何额外的训练数据的情况下,在ImageNet-1K分类中达到了87.1%的top-1精度,是第一个在这个竞争基准上超过87%的模型。此外,预训练的volo能很好地转移到下游任务,如语义切分。我们在CityScapes验证集上获得了84.3%的mIoU分数,在ADE20K验证集上获得了54.3%的mIoU分数。代码位于https://github.com/sail-sg/volo. 摘要:Visual recognition has been dominated by convolutionalneural networks (CNNs) for years. Though recently the pre-vailing vision transformers (ViTs) have shown great poten-tial of self-attention based models in ImageNet classifica-tion, their performance is still inferior to latest SOTA CNNsif no extra data are provided. In this work, we aim to closethe performance gap and demonstrate that attention-basedmodels are indeed able to outperform CNNs. We found thatthe main factor limiting the performance of ViTs for Ima-geNet classification is their low efficacy in encoding fine-level features into the token representations. To resolvethis, we introduce a noveloutlook attentionand present asimple and general architecture, termed Vision Outlooker(VOLO). Unlike self-attention that focuses on global depen-dency modeling at a coarse level, the outlook attention aimsto efficiently encode finer-level features and contexts intotokens, which are shown to be critical for recognition per-formance but largely ignored by the self-attention. Experi-ments show that our VOLO achieves 87.1% top-1 accuracyon ImageNet-1K classification, being the first model exceed-ing 87% accuracy on this competitive benchmark, withoutusing any extra training data. In addition, the pre-trainedVOLO transfers well to downstream tasks, such as seman-tic segmentation. We achieve 84.3% mIoU score on thecityscapes validation set and 54.3% on the ADE20K valida-tion set. Code is available at https://github.com/sail-sg/volo.

【3】 High Performance Hyperspectral Image Classification using Graphics Processing Units 标题:基于图形处理单元的高性能高光谱图像分类

作者:Mahmoud Hossam 机构: Teaching assistant at Basic Science Department Faculty of Computer and Information Sciences Ain Shams University Under the Supervision of Prof 备注:Master Thesis, Ain Shams University 链接:https://arxiv.org/abs/2106.12942 摘要:实时遥感应用,如搜索和救援任务,军事目标探测,环境监测,灾害预防和其他时间关键的应用需要机载实时处理能力或自主决策。一些无人遥控系统,如卫星,在物理上远离其运营商,所有航天器的控制和航天器返回的数据必须通过无线无线电链路传输。当卫星离开地面站的视线时,这条链路可能长时间不可用。因此,对于机载实时处理系统来说,重量轻、体积小、功耗低的硬件是必不可少的。随着近年来高光谱成像传感器的维数、尺寸和分辨率的不断提高,对遥感处理系统提出了新的挑战,需要更强大的计算体系结构。图形处理单元(gpu)是一种很有前途的轻量级高性能计算体系结构,可以满足机载系统的这些计算需求。本研究的目的是建立高性能的星载高光谱分析方法。我们提出了著名的递归层次分割(RHSEG)聚类方法的加速方法,使用GPU、带GPU的混合多核CPU和混合多核CPU/GPU集群。RHSEG是由美国国家航空航天局(NASA)开发的一种方法,其目的是提供具有多个输出级别的丰富分类信息。与CPU顺序实现相比,并行解决方案实现的加速比是并行单GPU的21倍,混合多节点计算机集群(具有16个计算节点)的240倍。与同等的并行CPU集群相比,使用单个GPU可以将能耗降低到74%。 摘要:Real-time remote sensing applications like search and rescue missions, military target detection, environmental monitoring, hazard prevention and other time-critical applications require onboard real time processing capabilities or autonomous decision making. Some unmanned remote systems like satellites are physically remote from their operators, and all control of the spacecraft and data returned by the spacecraft must be transmitted over a wireless radio link. This link may not be available for extended periods when the satellite is out of line of sight of its ground station. Therefore, lightweight, small size and low power consumption hardware is essential for onboard real time processing systems. With increasing dimensionality, size and resolution of recent hyperspectral imaging sensors, additional challenges are posed upon remote sensing processing systems and more capable computing architectures are needed. Graphical Processing Units (GPUs) emerged as promising architecture for light weight high performance computing that can address these computational requirements for onboard systems. The goal of this study is to build high performance methods for onboard hyperspectral analysis. We propose accelerated methods for the well-known recursive hierarchical segmentation (RHSEG) clustering method, using GPUs, hybrid multicore CPU with a GPU and hybrid multi-core CPU/GPU clusters. RHSEG is a method developed by the National Aeronautics and Space Administration (NASA), which is designed to provide rich classification information with several output levels. The achieved speedups by parallel solutions compared to CPU sequential implementations are 21x for parallel single GPU and 240x for hybrid multi-node computer clusters with 16 computing nodes. The energy consumption is reduced to 74% using a single GPU compared to the equivalent parallel CPU cluster.

【4】 Long-term Cross Adversarial Training: A Robust Meta-learning Method for Few-shot Classification Tasks 标题:长期交叉对抗性训练:一种用于少射击分类任务的稳健元学习方法

作者:Fan Liu,Shuyu Zhao,Xuelong Dai,Bin Xiao 机构: recent 1Department of Computing, The Hong Kong PolytechnicUniversity 链接:https://arxiv.org/abs/2106.12900 摘要:元学习模型可以使用少量的镜头标记数据快速适应新的任务。然而,尽管在少数镜头分类任务上取得了很好的泛化效果,但是在少数镜头学习中如何提高元学习模型的对抗鲁棒性仍然是一个挑战。尽管对抗性训练(AT)方法如对抗性查询(AQ)可以提高元学习模型的对抗性鲁棒性能,但AT训练的计算代价仍然很高。另一方面,用AT训练的元学习模型会显著降低原始清晰图像的准确率。提出了一种基于对抗鲁棒神经网络的元学习方法,称为长期交叉对抗训练(LCAT)。LCAT将沿自然和对抗样本分布方向对元学习模型参数进行长期交叉更新,以提高对抗和清洁少射分类精度。由于交叉对抗训练的存在,LCAT只需要比AQ少一半的对抗训练时间,对抗训练计算量较低。实验结果表明,在元学习模型中,LCAT在干净和对抗性Few-Shot分类精度上均优于SOTA对抗性训练方法。 摘要:Meta-learning model can quickly adapt to new tasks using few-shot labeled data. However, despite achieving good generalization on few-shot classification tasks, it is still challenging to improve the adversarial robustness of the meta-learning model in few-shot learning. Although adversarial training (AT) methods such as Adversarial Query (AQ) can improve the adversarially robust performance of meta-learning models, AT is still computationally expensive training. On the other hand, meta-learning models trained with AT will drop significant accuracy on the original clean images. This paper proposed a meta-learning method on the adversarially robust neural network called Long-term Cross Adversarial Training (LCAT). LCAT will update meta-learning model parameters cross along the natural and adversarial sample distribution direction with long-term to improve both adversarial and clean few-shot classification accuracy. Due to cross-adversarial training, LCAT only needs half of the adversarial training epoch than AQ, resulting in a low adversarial training computation. Experiment results show that LCAT achieves superior performance both on the clean and adversarial few-shot classification accuracy than SOTA adversarial training methods for meta-learning models.

【5】 Frequency Domain Convolutional Neural Network: Accelerated CNN for Large Diabetic Retinopathy Image Classification 标题:频域卷积神经网络:用于大型糖尿病视网膜病变图像分类的加速CNN

作者:Ee Fey Goh,ZhiYuan Chen,Wei Xiang Lim 机构:a University of Nottingham Malaysia, School of Computer Science, Jln Broga, Semenyih, Selangor 备注:This paper has been submitted to Neurocomputing 链接:https://arxiv.org/abs/2106.12736 摘要:卷积神经网络(CNNs)中的传统空间卷积层在训练时间可能需要数天的点上计算昂贵,除非减少层的数量、训练图像的数量或训练图像的大小。256x256像素的图像大小通常用于CNN的大多数应用,但是这种图像大小对于像糖尿病视网膜病变(DR)分类这样的应用来说太小了,在这些应用中,图像细节对于准确分类非常重要。本研究提出了频域卷积(FDC)和频域池(FDP)层,用RFFT、核初始化策略、卷积伪影去除和信道无关卷积(CIC)来代替传统的卷积和池层。利用FDC层和FDP层建立频域卷积神经网络(FDCNN),加速大图像的训练,用于DR分类。全FDC层是FDC层的扩展,允许在传统cnn中直接使用,它还用于修改VGG16体系结构。FDCNN比同等CNN结构至少快54.21%,内存效率高70.74%。与原始VGG16结构相比,改进的VGG16结构具有较短的训练时间和95.63%的准确率。 摘要:The conventional spatial convolution layers in the Convolutional Neural Networks (CNNs) are computationally expensive at the point where the training time could take days unless the number of layers, the number of training images or the size of the training images are reduced. The image size of 256x256 pixels is commonly used for most of the applications of CNN, but this image size is too small for applications like Diabetic Retinopathy (DR) classification where the image details are important for accurate classification. This research proposed Frequency Domain Convolution (FDC) and Frequency Domain Pooling (FDP) layers which were built with RFFT, kernel initialization strategy, convolution artifact removal and Channel Independent Convolution (CIC) to replace the conventional convolution and pooling layers. The FDC and FDP layers are used to build a Frequency Domain Convolutional Neural Network (FDCNN) to accelerate the training of large images for DR classification. The Full FDC layer is an extension of the FDC layer to allow direct use in conventional CNNs, it is also used to modify the VGG16 architecture. FDCNN is shown to be at least 54.21% faster and 70.74% more memory efficient compared to an equivalent CNN architecture. The modified VGG16 architecture with Full FDC layer is reported to achieve a shorter training time and a higher accuracy at 95.63% compared to the original VGG16 architecture for DR classification.

【6】 Feature Completion for Occluded Person Re-Identification 标题:用于被遮挡人员重新识别的特征补全

作者:Ruibing Hou,Bingpeng Ma,Hong Chang,Xinqian Gu,Shiguang Shan,Xilin Chen 机构:Institute of Computing Technology, China andUniversity of Chinese Academy of Sciences, Ma is with the School of Computer Science and Technology 备注:None 链接:https://arxiv.org/abs/2106.12733 摘要:人的再识别(reID)在计算机视觉中起着重要的作用。然而,现有的方法在遮挡场景中存在性能下降的问题。在这项工作中,我们提出了一个闭塞鲁棒块,区域特征完成(RFC),为闭塞里德。与以往大多数去除遮挡区域的工作不同,RFC分块可以在特征空间中恢复遮挡区域的语义。首先,开发了空间RFC(SRFC)模块。SRFC利用非遮挡区域的长距离空间背景来预测遮挡区域的特征。单元预测任务导致编码器/解码器架构,其中区域编码器对非遮挡区域和遮挡区域之间的相关性进行建模,区域解码器利用空间相关性来恢复遮挡区域特征。其次,我们引入了时态RFC(TRFC)模块,该模块捕获长期的时态上下文,以改进SRFC的预测。RFC块重量轻,端到端可训练,可以很容易地插入现有CNN中形成RFCnet。广泛的实验进行了闭塞和普遍整体里德基准。在遮挡数据集上,我们的方法明显优于现有方法,而在整体数据集上,我们的方法的性能仍然是最好的。源代码位于https://github.com/blue-blue272/OccludedReID-RFCnet. 摘要:Person re-identification (reID) plays an important role in computer vision. However, existing methods suffer from performance degradation in occluded scenes. In this work, we propose an occlusion-robust block, Region Feature Completion (RFC), for occluded reID. Different from most previous works that discard the occluded regions, RFC block can recover the semantics of occluded regions in feature space. Firstly, a Spatial RFC (SRFC) module is developed. SRFC exploits the long-range spatial contexts from non-occluded regions to predict the features of occluded regions. The unit-wise prediction task leads to an encoder/decoder architecture, where the region-encoder models the correlation between non-occluded and occluded region, and the region-decoder utilizes the spatial correlation to recover occluded region features. Secondly, we introduce Temporal RFC (TRFC) module which captures the long-term temporal contexts to refine the prediction of SRFC. RFC block is lightweight, end-to-end trainable and can be easily plugged into existing CNNs to form RFCnet. Extensive experiments are conducted on occluded and commonly holistic reID benchmarks. Our method significantly outperforms existing methods on the occlusion datasets, while remains top even superior performance on holistic datasets. The source code is available at https://github.com/blue-blue272/OccludedReID-RFCnet.

【7】 What makes visual place recognition easy or hard? 标题:是什么让视觉地点识别变得容易还是困难?

作者:Stefan Schubert,Peer Neubert 机构:TU Chemnitz, Germany 链接:https://arxiv.org/abs/2106.12671 摘要:视觉位置识别是移动机器人定位的基本能力。它将图像检索放在物理世界中操作的物理代理的实际上下文中。它是一个活跃的研究领域,在许多不同的实验中提出并评估了许多不同的方法。在下文中,我们认为,由于实际环境和个人设计决策的变化,地点识别实验在不同的论文中几乎没有可比性,并且在不同的实验中,有各种各样的特性可以改变。我们提供了一个这样的属性的广泛列表,并举例说明如何使用它们来设置一个位置识别实验更容易或更难。这对于不同的参与者来说可能是有趣的:(1)只想选择适合他们手头特定任务性质的地点识别方法的人,(2)寻找开放性研究问题并对特别困难的情况感兴趣的研究人员,(3)作者希望在这个主题上创造可复制的论文,以及(4)审稿人的任务,以确定潜在的问题,在审查的论文。 摘要:Visual place recognition is a fundamental capability for the localization of mobile robots. It places image retrieval in the practical context of physical agents operating in a physical world. It is an active field of research and many different approaches have been proposed and evaluated in many different experiments. In the following, we argue that due to variations of this practical context and individual design decisions, place recognition experiments are barely comparable across different papers and that there is a variety of properties that can change from one experiment to another. We provide an extensive list of such properties and give examples how they can be used to setup a place recognition experiment easier or harder. This might be interesting for different involved parties: (1) people who just want to select a place recognition approach that is suitable for the properties of their particular task at hand, (2) researchers that look for open research questions and are interested in particularly difficult instances, (3) authors that want to create reproducible papers on this topic, and (4) also reviewers that have the task to identify potential problems in papers under review.

【8】 Human Activity Recognition using Continuous Wavelet Transform and Convolutional Neural Networks 标题:基于连续小波变换和卷积神经网络的人体活动识别

作者:Anna Nedorubova,Alena Kadyrova,Aleksey Khlyupin 机构:Center for Engineering and Technology of MIPT, Moscow Institute of Physics and Technology, Institutskiy Pereulok , Dolgoprudny, Moscow , Russia, Key words: Human activity recognition, convolutional neural network, residual neural networks 链接:https://arxiv.org/abs/2106.12666 摘要:世界上有相当多的人因为健康原因不得不长期处于监视之下;他们包括糖尿病患者或其他慢性病患者、老年人和残疾人。这些人群可能面临着更高的危险,可能会有危及生命的跌倒或晕厥。由于资源有限,相当一部分处于危险中的人无法得到必要的监测,因此面临着过度的危险。目前,这一问题通常是通过应用人类活动识别(Human Activity Recognition,HAR)方法来解决的。HAR是一个具有前瞻性和快速发展的数据科学领域,有着广泛的应用领域,如医疗保健、体育、安全等。然而,目前的识别技术在准确性方面明显不足,因此本文提出了一种高精度的人类活动分类方法。我们提出了一个新的工作流程来解决HAR问题,并在UniMiB-SHAR数据集上对其进行了评估。该模型基于连续小波变换(CWT)和卷积神经网络(CNNs)。小波变换在时域和频域中定位信号特征,然后由CNN提取这些特征并识别活动。另外值得注意的是,CWT将1D加速度计信号转换为2D图像,因此能够获得更好的结果,因为2D网络具有显著更高的预测能力。在工作过程中,我们建立了一个卷积神经网络,改变了空间轴数、层数、每层神经元数、图像大小、母小波类型、母小波零矩阶数等模型参数,我们还应用了具有剩余块的模型,这导致了显著更高的度量值。最后,我们成功地达到了99.26%的准确率,这对于解决这个问题是一个有价值的表现。 摘要:Quite a few people in the world have to stay under permanent surveillance for health reasons; they include diabetic people or people with some other chronic conditions, the elderly and the disabled.These groups may face heightened risk of having life-threatening falls or of being struck by a syncope. Due to limited availability of resources a substantial part of people at risk can not receive necessary monitoring and thus are exposed to excessive danger. Nowadays, this problem is usually solved via applying Human Activity Recognition (HAR) methods. HAR is a perspective and fast-paced Data Science field, which has a wide range of application areas such as healthcare, sport, security etc. However, the currently techniques of recognition are markedly lacking in accuracy, hence, the present paper suggests a highly accurate method for human activity classification. Wepropose a new workflow to address the HAR problem and evaluate it on the UniMiB SHAR dataset, which consists of the accelerometer signals. The model we suggest is based on continuous wavelet transform (CWT) and convolutional neural networks (CNNs). Wavelet transform localizes signal features both in time and frequency domains and after that a CNN extracts these features and recognizes activity. It is also worth noting that CWT converts 1D accelerometer signal into 2D images and thus enables to obtain better results as 2D networks have a significantly higher predictive capacity. In the course of the work we build a convolutional neural network and vary such model parameters as number of spatial axes, number of layers, number of neurons in each layer, image size, type of mother wavelet, the order of zero moment of mother wavelet etc. Besides, we also apply models with residual blocks which resulted in significantly higher metric values. Finally, we succeed to reach 99.26 % accuracy and it is a worthy performance for this problem.

【9】 Handwritten Digit Recognition using Machine and Deep Learning Algorithms 标题:基于机器和深度学习算法的手写体数字识别

作者:Samay Pashine,Ritik Dixit,Rishika Kushwah 机构:Computer Science and Engineering, Acropolis Institute of Technology & Research, Indore, India 备注:None 链接:https://arxiv.org/abs/2106.12614 摘要:人类对机器的依赖程度从未如此之高,以至于从照片中的物体分类到无声电影中添加声音,一切都可以借助深度学习和机器学习算法来完成。同样,手写文本识别是一个重要的研究和开发领域,有许多可能实现。手写识别(HWR),也称为手写文本识别(HTR),是计算机接收和解释来自纸张文档、照片、触摸屏和其他设备的可理解手写输入的能力[1]。显然,在本文中,我们使用支持向量机(SVM)、多层感知器(MLP)和卷积神经网络(CNN)模型,利用MNIST数据集进行手写体数字识别。我们的主要目标是比较上述模型的准确性以及它们的执行时间,以获得最佳的数字识别模型。 摘要:The reliance of humans over machines has never been so high such that from object classification in photographs to adding sound to silent movies everything can be performed with the help of deep learning and machine learning algorithms. Likewise, Handwritten text recognition is one of the significant areas of research and development with a streaming number of possibilities that could be attained. Handwriting recognition (HWR), also known as Handwritten Text Recognition (HTR), is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other devices [1]. Apparently, in this paper, we have performed handwritten digit recognition with the help of MNIST datasets using Support Vector Machines (SVM), Multi-Layer Perceptron (MLP) and Convolution Neural Network (CNN) models. Our main objective is to compare the accuracy of the models stated above along with their execution time to get the best possible model for digit recognition.

分割|语义相关(3篇)

【1】 AutoAdapt: Automated Segmentation Network Search for Unsupervised Domain Adaptation 标题:AutoAdapt:无监督领域自适应的自动分割网络搜索

作者:Xueqing Deng,Yi Zhu,Yuxin Tian,Shawn Newsam 备注:short version has been accepted at 1st NAS workshop co-organized with CVPR 2021 链接:https://arxiv.org/abs/2106.13227 摘要:基于神经网络的语义分割在有大量标注数据的情况下,即在有监督的情况下,取得了显著的效果。然而,收集此类数据的成本很高,因此已经开发了一些方法,以适应在相关的、通常是合成的数据上训练的模型,这些数据的标签是现成的。目前的自适应方法没有考虑这些模型的泛化/可转移性对网络结构的依赖性。在本文中,我们执行神经架构搜寻(NAS)来提供架构层级的观点和分析,以供领域适配。我们确定了优化差距存在时,搜索架构的无监督领域适应,这使得这个问题非常困难。我们建议通过使用最大平均差异和区域加权熵来估计精度度量来弥补这一差距。在几个广泛采用的基准测试上的实验结果表明,我们提出的自适应框架确实发现了一些改进现有自适应技术性能的体系结构。 摘要:Neural network-based semantic segmentation has achieved remarkable results when large amounts of annotated data are available, that is, in the supervised case. However, such data is expensive to collect and so methods have been developed to adapt models trained on related, often synthetic data for which labels are readily available. Current adaptation approaches do not consider the dependence of the generalization/transferability of these models on network architecture. In this paper, we perform neural architecture search (NAS) to provide architecture-level perspective and analysis for domain adaptation. We identify the optimization gap that exists when searching architectures for unsupervised domain adaptation which makes this NAS problem uniquely difficult. We propose bridging this gap by using maximum mean discrepancy and regional weighted entropy to estimate the accuracy metric. Experimental results on several widely adopted benchmarks show that our proposed AutoAdapt framework indeed discovers architectures that improve the performance of a number of existing adaptation techniques.

【2】 Attention Toward Neighbors: A Context Aware Framework for High Resolution Image Segmentation 标题:关注邻域:一种上下文感知的高分辨率图像分割框架

作者:Fahim Faisal Niloy,M. Ashraful Amin,Amin Ahsan Ali,AKM Mahbubur Rahman 机构:Agency Lab, CSE, Independent University, Bangladesh 备注:Accepted at ICIP 2021 链接:https://arxiv.org/abs/2106.12902 摘要:由于中间特征图的巨大尺寸,高分辨率图像分割仍然具有挑战性和易出错性。传统的方法通过使用基于面片的方法来避免这个问题,其中每个面片是独立分割的。然而,独立的面片分割会导致错误,特别是在面片边界处,因为在非常高分辨率的图像中,面片的大小比完整的图像小得多。为了克服这些局限性,本文提出了一种新的框架来分割一个特定的补丁,该框架结合了相邻补丁的上下文信息。这允许分割网络以更大的视野看到目标面片,而不需要更大的特征图。通过大量的实验对比分析表明,本文提出的框架能够有效地分割高分辨率图像,并显著地提高了平均相交于并集和整体精度。 摘要:High-resolution image segmentation remains challenging and error-prone due to the enormous size of intermediate feature maps. Conventional methods avoid this problem by using patch based approaches where each patch is segmented independently. However, independent patch segmentation induces errors, particularly at the patch boundary due to the lack of contextual information in very high-resolution images where the patch size is much smaller compared to the full image. To overcome these limitations, in this paper, we propose a novel framework to segment a particular patch by incorporating contextual information from its neighboring patches. This allows the segmentation network to see the target patch with a wider field of view without the need of larger feature maps. Comparative analysis from a number of experiments shows that our proposed framework is able to segment high resolution images with significantly improved mean Intersection over Union and overall accuracy.

【3】 Topological Semantic Mapping by Consolidation of Deep Visual Features 标题:基于深层视觉特征合并的拓扑语义映射

作者:Ygor C. N. Sousa,Hansenclever F. Bassani 机构:UniversidadeFederaldePernambuco 备注:8 pages, 3 figures 链接:https://arxiv.org/abs/2106.12709 摘要:近年来,许多文献提出了利用卷积神经网络(CNNs)来识别图像语义属性的语义映射方法。属性的类型(例如:房间大小、场所类别和对象)及其类(例如:厨房和浴室,用于场所类别)通常是预定义的,并且仅限于特定的任务。因此,在构建地图的过程中获取和处理的所有视觉数据都将丢失,地图上只保留已识别的语义属性。与此相反,本文引入了一种拓扑语义映射方法,该方法使用CNN(GoogLeNet)从机器人操作时在多个环境视图中捕获的2D图像中提取的深度视觉特征,来创建在每个拓扑节点覆盖的区域中获得的视觉特征的统一表示。这些合并表示允许灵活地识别区域的语义属性,并在一系列视觉任务中使用。实验结果表明,该方法能够有效地融合区域的视觉特征,将区域的视觉特征作为语义属性识别对象和位置类别,并指示图像的拓扑位置,取得了很好的效果。利用GoogLeNet的分类层对目标进行分类,无需再训练,利用浅层多层感知器对地点类别进行识别。 摘要:Many works in the recent literature introduce semantic mapping methods that use CNNs (Convolutional Neural Networks) to recognize semantic properties in images. The types of properties (eg.: room size, place category, and objects) and their classes (eg.: kitchen and bathroom, for place category) are usually predefined and restricted to a specific task. Thus, all the visual data acquired and processed during the construction of the maps are lost and only the recognized semantic properties remain on the maps. In contrast, this work introduces a topological semantic mapping method that uses deep visual features extracted by a CNN, the GoogLeNet, from 2D images captured in multiple views of the environment as the robot operates, to create consolidated representations of visual features acquired in the regions covered by each topological node. These consolidated representations allow flexible recognition of semantic properties of the regions and use in a range of visual tasks. The experiments, performed using a real-world indoor dataset, showed that the method is able to consolidate the visual features of regions and use them to recognize objects and place categories as semantic properties, and to indicate the topological location of images, with very promising results. The objects are classified using the classification layer of GoogLeNet, without retraining, and the place categories are recognized using a shallow Multilayer Perceptron.

半弱无监督|主动学习|不确定性(3篇)

【1】 Unsupervised Learning of Depth and Depth-of-Field Effect from Natural Images with Aperture Rendering Generative Adversarial Networks 标题:基于孔径绘制生成性对抗网络的自然图像景深和景深效应的无监督学习

作者:Takuhiro Kaneko 机构:NTT Communication Science Laboratories, NTT Corporation 备注:Accepted to CVPR 2021 (Oral). Project page: this https URL 链接:https://arxiv.org/abs/2106.13041 摘要:从二维投影的自然图像中理解三维世界是计算机视觉和图形学的一个基本挑战。近年来,一种无监督学习方法因其在数据收集方面的优势而受到广泛关注。然而,为了减轻训练限制,典型方法需要对视点分布(例如,包含各种视点图像的数据集)或对象形状(例如,对称对象)施加假设。这些假设常常限制应用;例如,应用于非刚性物体或从相似视点(例如,花或鸟图像)捕获的图像仍然是一个挑战。为了补充这些方法,我们提出了孔径渲染生成对抗网络(AR-GANs),该网络在GANs上配置孔径渲染,并采用聚焦线索来学习未标记自然图像的景深和景深效果。为了解决由无监督设置引起的模糊(即平滑纹理和离焦模糊之间的模糊,前景和背景模糊之间的模糊),我们开发了自由度混合学习,使生成器能够在生成不同自由度图像的同时学习真实的图像分布。此外,我们在引导学习方向之前,设计了一个中心焦点。在实验中,我们证明了AR-GANs在各种数据集(如花卉、鸟类和人脸图像)中的有效性,并通过将其与其他三维表示学习GANs相结合,证明了其可移植性,验证了其在浅自由度绘制中的适用性。 摘要:Understanding the 3D world from 2D projected natural images is a fundamental challenge in computer vision and graphics. Recently, an unsupervised learning approach has garnered considerable attention owing to its advantages in data collection. However, to mitigate training limitations, typical methods need to impose assumptions for viewpoint distribution (e.g., a dataset containing various viewpoint images) or object shape (e.g., symmetric objects). These assumptions often restrict applications; for instance, the application to non-rigid objects or images captured from similar viewpoints (e.g., flower or bird images) remains a challenge. To complement these approaches, we propose aperture rendering generative adversarial networks (AR-GANs), which equip aperture rendering on top of GANs, and adopt focus cues to learn the depth and depth-of-field (DoF) effect of unlabeled natural images. To address the ambiguities triggered by unsupervised setting (i.e., ambiguities between smooth texture and out-of-focus blurs, and between foreground and background blurs), we develop DoF mixture learning, which enables the generator to learn real image distribution while generating diverse DoF images. In addition, we devise a center focus prior to guiding the learning direction. In the experiments, we demonstrate the effectiveness of AR-GANs in various datasets, such as flower, bird, and face images, demonstrate their portability by incorporating them into other 3D representation learning GANs, and validate their applicability in shallow DoF rendering.

【2】 Self-Supervised Monocular Depth Estimation of Untextured Indoor Rotated Scenes 标题:无纹理室内旋转场景的自监督单目深度估计

作者:Benjamin Keltjens,Tom van Dijk,Guido de Croon 机构:Delft University of Technology, Guido C.H.E. de Croon 链接:https://arxiv.org/abs/2106.12958 摘要:自监督深度学习方法利用立体图像训练单目深度估计。尽管这些方法在KITTI等室外数据集上表现出了很强的效果,但它们在有摄像机旋转的室内环境中的性能与有监督的方法不匹配。在室内,旋转的场景通常用于约束较少的应用,并且由于两个原因造成问题:低纹理区域的丰富性和旋转下图像的深度提示的复杂性增加。为了将自我监督学习扩展到更广泛的环境中,我们提出了两个补充。首先,我们提出了一种新的填充视差损失项来修正无纹理区域中图像重建误差损失的模糊性。具体地说,我们使用来自周围纹理区域的估计视差来插值无纹理区域的视差,并使用L1损失来校正原始估计。我们的实验表明,与Godard等人的Monodepth相比,在低纹理场景下,深度估计有了很大的改进,而在纹理场景上没有任何损失。其次,我们证明了应用程序在俯仰和滚动中的代表性旋转训练,足以显著提高整个预期轮换范围内的绩效。我们证明了当在没有摄像机旋转的测试集上进行评估时,深度估计是成功推广的,因为性能没有损失。这些发展使得自监督学习在复杂环境下的单目深度估计得到了更广泛的应用。 摘要:Self-supervised deep learning methods have leveraged stereo images for training monocular depth estimation. Although these methods show strong results on outdoor datasets such as KITTI, they do not match performance of supervised methods on indoor environments with camera rotation. Indoor, rotated scenes are common for less constrained applications and pose problems for two reasons: abundance of low texture regions and increased complexity of depth cues for images under rotation. In an effort to extend self-supervised learning to more generalised environments we propose two additions. First, we propose a novel Filled Disparity Loss term that corrects for ambiguity of image reconstruction error loss in textureless regions. Specifically, we interpolate disparity in untextured regions, using the estimated disparity from surrounding textured areas, and use L1 loss to correct the original estimation. Our experiments show that depth estimation is substantially improved on low-texture scenes, without any loss on textured scenes, when compared to Monodepth by Godard et al. Secondly, we show that training with an application's representative rotations, in both pitch and roll, is sufficient to significantly improve performance over the entire range of expected rotation. We demonstrate that depth estimation is successfully generalised as performance is not lost when evaluated on test sets with no camera rotation. Together these developments enable a broader use of self-supervised learning of monocular depth estimation for complex environments.

【3】 Unsupervised Deep Image Stitching: Reconstructing Stitched Features to Images 标题:无监督深度图像拼接:重建图像的拼接特征

作者:Lang Nie,Chunyu Lin,Kang Liao,Shuaicheng Liu,Yao Zhao 备注:Accepted by IEEE Transactions on Image Processing 链接:https://arxiv.org/abs/2106.12859 摘要:传统的基于特征的图像拼接技术在很大程度上依赖于特征检测的质量,往往无法对特征少或分辨率低的图像进行拼接。基于学习的图像拼接方法由于缺乏标记数据而很少被研究,使得有监督的方法不可靠。针对上述局限性,我们提出了一种无监督的深度图像拼接框架,包括两个阶段:无监督粗图像对齐和无监督图像重建。在第一阶段,我们设计了一个基于烧蚀的损耗来约束无监督单应网络,它更适合于大基线场景。此外,在拼接域空间引入变换层对输入图像进行扭曲。在第二阶段,基于像素级的错位在特征级可以得到一定程度的消除这一观点,我们设计了一个无监督的图像重建网络来消除从特征到像素的伪影。具体地说,重建网络可以由一个低分辨率变形分支和一个高分辨率精细分支来实现,学习图像拼接的变形规律,同时提高分辨率。为了建立评价基准和训练学习框架,提出并发布了一个用于无监督深度图像拼接的综合真实图像数据集。大量的实验很好地证明了我们的方法优于其他最先进的解决方案。即使与监督方案相比,我们的图像拼接质量仍然受到用户的青睐。 摘要:Traditional feature-based image stitching technologies rely heavily on feature detection quality, often failing to stitch images with few features or low resolution. The learning-based image stitching solutions are rarely studied due to the lack of labeled data, making the supervised methods unreliable. To address the above limitations, we propose an unsupervised deep image stitching framework consisting of two stages: unsupervised coarse image alignment and unsupervised image reconstruction. In the first stage, we design an ablation-based loss to constrain an unsupervised homography network, which is more suitable for large-baseline scenes. Moreover, a transformer layer is introduced to warp the input images in the stitching-domain space. In the second stage, motivated by the insight that the misalignments in pixel-level can be eliminated to a certain extent in feature-level, we design an unsupervised image reconstruction network to eliminate the artifacts from features to pixels. Specifically, the reconstruction network can be implemented by a low-resolution deformation branch and a high-resolution refined branch, learning the deformation rules of image stitching and enhancing the resolution simultaneously. To establish an evaluation benchmark and train the learning framework, a comprehensive real-world image dataset for unsupervised deep image stitching is presented and released. Extensive experiments well demonstrate the superiority of our method over other state-of-the-art solutions. Even compared with the supervised solutions, our image stitching quality is still preferred by users.

时序|行为识别|姿态|视频|运动估计(5篇)

【1】 FitVid: Overfitting in Pixel-Level Video Prediction 标题:FitVid:像素级视频预测中的过拟合

作者:Mohammad Babaeizadeh,Mohammad Taghi Saffar,Suraj Nair,Sergey Levine,Chelsea Finn,Dumitru Erhan 机构:Google Brain, Stanford University 链接:https://arxiv.org/abs/2106.13195 摘要:一个能够预测接下来会发生什么的代理可以通过计划执行各种任务,而无需额外的训练。此外,这样的代理可以在内部表示真实世界的复杂动态,因此可以获得对各种视觉感知任务有用的表示。这使得预测视频的未来帧,以观察到的过去和潜在的未来行动为条件,成为一项有趣的任务,尽管最近取得了许多进展,但仍然具有特别的挑战性。现有的视频预测模型在简单的狭义基准上取得了很好的效果,但在具有更复杂动力学或更广泛领域的真实数据集上,它们产生的预测质量较低。越来越多的证据表明,对训练数据的拟合不足是导致低质量预测的主要原因之一。在本文中,我们认为,在当前的视频模型中,参数使用效率低下是导致拟合不足的主要原因。因此,我们引入了一个新的体系结构FitVid,它能够在现有的最先进的模型中具有相似的参数计数的同时,对常用的基准进行严重的过度拟合。我们分析了过度拟合的后果,说明了过度拟合如何产生意想不到的结果,例如通过重复训练数据生成高质量的输出,以及如何使用现有的图像增强技术来缓解过度拟合。因此,FitVid在四种不同的视频预测基准上,在四种不同的度量标准上都优于当前最先进的模型。 摘要:An agent that is capable of predicting what happens next can perform a variety of tasks through planning with no additional training. Furthermore, such an agent can internally represent the complex dynamics of the real-world and therefore can acquire a representation useful for a variety of visual perception tasks. This makes predicting the future frames of a video, conditioned on the observed past and potentially future actions, an interesting task which remains exceptionally challenging despite many recent advances. Existing video prediction models have shown promising results on simple narrow benchmarks but they generate low quality predictions on real-life datasets with more complicated dynamics or broader domain. There is a growing body of evidence that underfitting on the training data is one of the primary causes for the low quality predictions. In this paper, we argue that the inefficient use of parameters in the current video models is the main reason for underfitting. Therefore, we introduce a new architecture, named FitVid, which is capable of severe overfitting on the common benchmarks while having similar parameter count as the current state-of-the-art models. We analyze the consequences of overfitting, illustrating how it can produce unexpected outcomes such as generating high quality output by repeating the training data, and how it can be mitigated using existing image augmentation techniques. As a result, FitVid outperforms the current state-of-the-art models across four different video prediction benchmarks on four different metrics.

【2】 Exploring Stronger Feature for Temporal Action Localization 标题:探索更强的时间动作本地化特征

作者:Zhiwu Qing,Xiang Wang,Ziyuan Huang,Yutong Feng,Shiwei Zhang,jianwen Jiang,Mingqian Tang,Changxin Gao,Nong Sang 机构:Key Laboratory of Image Processing and Intelligent Control, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Alibaba Group 备注:Rank 1st on the CVPR2021 HACS supervised Temporal Action Localization Challenge 链接:https://arxiv.org/abs/2106.13014 摘要:时间动作局部化的目的是用动作范畴来局部化起止时间。受GPU内存的限制,主流的方法对每个视频进行特征预提取。因此,特征质量决定了检测性能的上界。在这份技术报告中,我们探讨了经典的卷积式主干网和Transformer式主干网的最新发展。我们发现基于Transformer的方法比基于卷积的方法可以获得更好的分类性能,但是它们不能生成准确的动作建议。另外,提取帧分辨率较大的特征,减少空间信息的丢失,也可以有效地提高时间动作定位的性能。最后,通过简单的组合:BMN TCANet,我们在验证集上的mAP值达到42.42%,比2020年的多模型集成结果高出1.87%。最后,我们在CVPR2021 HACS监督的时间动作定位挑战中获得了排名第一的成绩。 摘要:Temporal action localization aims to localize starting and ending time with action category. Limited by GPU memory, mainstream methods pre-extract features for each video. Therefore, feature quality determines the upper bound of detection performance. In this technical report, we explored classic convolution-based backbones and the recent surge of transformer-based backbones. We found that the transformer-based methods can achieve better classification performance than convolution-based, but they cannot generate accuracy action proposals. In addition, extracting features with larger frame resolution to reduce the loss of spatial information can also effectively improve the performance of temporal action localization. Finally, we achieve 42.42% in terms of mAP on validation set with a single SlowFast feature by a simple combination: BMN TCANet, which is 1.87% higher than the result of 2020's multi-model ensemble. Finally, we achieve Rank 1st on the CVPR2021 HACS supervised Temporal Action Localization Challenge.

【3】 Evaluation of deep lift pose models for 3D rodent pose estimation based on geometrically triangulated data 标题:基于几何三角化数据的三维啮齿动物姿态估计的深提升姿态模型评价

作者:Indrani Sarkar,Indranil Maji,Charitha Omprakash,Sebastian Stober,Sanja Mikulovic,Pavol Bauer 机构: Otto von Guericke University , Germany, Leibniz Institute for Neurobiology, Germany 备注:5 pages, 6 figures, Accepted at the CVPR 2021 CV4Animals workshop 链接:https://arxiv.org/abs/2106.12993 摘要:实验动物行为的评估是现代神经科学研究的中心内容。行为是典型的研究方面的姿态变化,这是理想的三维捕捉。这需要在多摄像机系统上进行三角测量,多摄像机系统从不同角度观察动物。然而,由于遮挡和其他技术限制,这在现实的实验室设置中是具有挑战性的。在这里,我们建议使用提升姿势模型,允许从一个单一的视角摄像机对自由移动的啮齿动物进行稳健的三维姿势估计。为了获得高质量的姿势提升训练数据,我们首先在摄像机设置中进行几何校准,包括行为动物的底视图和侧视图。然后,我们评估了两个先前提出的模型架构在给定推理视角下的性能,并得出结论:使用时间卷积可以获得可靠的三维姿态推理。通过这项工作,我们希望能为神经科学界广泛的实验和设置提供一个更强大和多样化的自由活动啮齿动物行为跟踪。 摘要:The assessment of laboratory animal behavior is of central interest in modern neuroscience research. Behavior is typically studied in terms of pose changes, which are ideally captured in three dimensions. This requires triangulation over a multi-camera system which view the animal from different angles. However, this is challenging in realistic laboratory setups due to occlusions and other technical constrains. Here we propose the usage of lift-pose models that allow for robust 3D pose estimation of freely moving rodents from a single view camera view. To obtain high-quality training data for the pose-lifting, we first perform geometric calibration in a camera setup involving bottom as well as side views of the behaving animal. We then evaluate the performance of two previously proposed model architectures under given inference perspectives and conclude that reliable 3D pose inference can be obtained using temporal convolutions. With this work we would like to contribute to a more robust and diverse behavior tracking of freely moving rodents for a wide range of experiments and setups in the neuroscience community.

【4】 Video Super-Resolution with Long-Term Self-Exemplars 标题:具有长期自我样本的视频超分辨率

作者:Guotao Meng,Yue Wu,Sijin Li,Qifeng Chen 机构:HKUST, DJI, Bicubic, RBPN [,], TDAN [,], TGA [,], MuCAN [,], RSDN [,], Ours, Ground truth 链接:https://arxiv.org/abs/2106.12778 摘要:现有的视频超分辨率方法通常利用几个相邻帧为每一帧生成更高分辨率的图像。然而,这些方法并没有充分利用帧间的冗余信息:同一实例对应的补丁在不同尺度的帧间出现。基于这一观察,我们提出了一种长期跨尺度聚集的视频超分辨率方法,该方法利用了跨远帧的相似块(自样本)。我们的模型还包括一个多参考对齐模块,用于融合来自相似面片的特征:我们融合来自远处参考的特征来执行高质量的超分辨率。提出了一种新颖实用的基于参考的超分辨率训练策略。为了评估我们提出的方法的性能,我们在收集的CarCam数据集和Waymo开放数据集上进行了大量的实验,结果表明我们的方法优于现有的方法。我们的源代码将公开。 摘要:Existing video super-resolution methods often utilize a few neighboring frames to generate a higher-resolution image for each frame. However, the redundant information between distant frames has not been fully exploited in these methods: corresponding patches of the same instance appear across distant frames at different scales. Based on this observation, we propose a video super-resolution method with long-term cross-scale aggregation that leverages similar patches (self-exemplars) across distant frames. Our model also consists of a multi-reference alignment module to fuse the features derived from similar patches: we fuse the features of distant references to perform high-quality super-resolution. We also propose a novel and practical training strategy for referenced-based super-resolution. To evaluate the performance of our proposed method, we conduct extensive experiments on our collected CarCam dataset and the Waymo Open dataset, and the results demonstrate our method outperforms state-of-the-art methods. Our source code will be publicly available.

【5】 A Global Appearance and Local Coding Distortion based Fusion Framework for CNN based Filtering in Video Coding 标题:基于全局外观和局部编码失真的CNN视频编码滤波融合框架

作者:Jian Yue,Yanbo Gao,Shuai Li,Hui Yuan,Frédéric Dufaux 机构:W 链接:https://arxiv.org/abs/2106.12746 摘要:在视频编码中,采用环路滤波对重构帧进行处理,以消除块效应。随着卷积神经网络(CNNs)技术的发展,CNNs作为一种图像去噪技术,在环路滤波中得到了广泛的应用。然而,除了是失真图像之外,重构帧还通过视频编码中基于块的编码操作的固定行来获得。它携带一些相似特征的基于编码单元的编码失真。因此,本文从两个方面来解决图像的滤波问题,一是对被破坏的纹理进行全局外观恢复,二是对固定编码管道引起的局部编码失真进行恢复。因此,利用高层次的全局特征流、高层次的局部特征流和低层次的局部特征流,提出了一种基于三流全局外观和局部编码失真的融合网络。通过烧蚀实验验证了不同特征的必要性,证明了全局特征和局部特征在滤波中可以互补,结合起来可以获得更好的滤波效果。据我们所知,我们是第一个通过实验验证从上述全局外观和局部编码失真恢复方面清晰描述视频滤波过程的人,为发展滤波技术提供了清晰的途径。实验结果表明,与HEVC参考软件相比,该方法的性能明显优于现有的基于单帧的方法,在AI、LDP和RA配置下,平均节省了13.5%、11.3%和11.7%的BD速率。 摘要:In-loop filtering is used in video coding to process the reconstructed frame in order to remove blocking artifacts. With the development of convolutional neural networks (CNNs), CNNs have been explored for in-loop filtering considering it can be treated as an image de-noising task. However, in addition to being a distorted image, the reconstructed frame is also obtained by a fixed line of block based encoding operations in video coding. It carries coding-unit based coding distortion of some similar characteristics. Therefore, in this paper, we address the filtering problem from two aspects, global appearance restoration for disrupted texture and local coding distortion restoration caused by fixed pipeline of coding. Accordingly, a three-stream global appearance and local coding distortion based fusion network is developed with a high-level global feature stream, a high-level local feature stream and a low-level local feature stream. Ablation study is conducted to validate the necessity of different features, demonstrating that the global features and local features can complement each other in filtering and achieve better performance when combined. To the best of our knowledge, we are the first one that clearly characterizes the video filtering process from the above global appearance and local coding distortion restoration aspects with experimental verification, providing a clear pathway to developing filter techniques. Experimental results demonstrate that the proposed method significantly outperforms the existing single-frame based methods and achieves 13.5%, 11.3%, 11.7% BD-Rate saving on average for AI, LDP and RA configurations, respectively, compared with the HEVC reference software.

医学相关(6篇)

【1】 Handling Data Heterogeneity with Generative Replay in Collaborative Learning for Medical Imaging 标题:医学影像协作学习中用生成性回放处理数据异构性

作者:Liangqiong Qu,Niranjan Balachandar,Miao Zhang,Daniel Rubin 机构:Department of Biomedical Data Science at Stanford University, Stanford, CA , Department of Biomedical Data Science and Department of Radiology at Stanford, University, Stanford, CA , USA. 链接:https://arxiv.org/abs/2106.13208 摘要:协作学习能够在多个机构以保护隐私的方式对深层神经网络进行协作和分散训练,正迅速成为医疗保健应用中一项有价值的技术。然而,它的分布式特性常常导致机构间数据分布的显著异质性。现有的协作学习方法通常不能解释机构间数据的异质性,或者只研究了轻微倾斜的标签分布。本文提出了一种新的生成重放策略来解决协作学习方法中数据异构的问题。我们利用最新的图像合成技术开发了一种新的双模型体系结构:一个主模型学习所需的任务,一个辅助的“生成重放模型”或者合成与输入图像非常相似的图像,或者帮助提取潜在变量。生成式重放策略使用灵活,既可以整合到现有的协作学习方法中,提高它们处理跨机构数据异构的能力,也可以作为一种新颖的、个性化的协作学习框架(称为FedReplay)来降低通信成本。实验结果表明,该方法具有跨机构处理异构数据的能力。在高度异构的数据划分上,与最新的协作学习方法相比,我们的模型在糖尿病视网膜病变分类数据集上的预测准确率提高了约4.88%,在骨龄预测数据集上的平均绝对值降低了约49.8%。 摘要:Collaborative learning, which enables collaborative and decentralized training of deep neural networks at multiple institutions in a privacy-preserving manner, is rapidly emerging as a valuable technique in healthcare applications. However, its distributed nature often leads to significant heterogeneity in data distributions across institutions. Existing collaborative learning approaches generally do not account for the presence of heterogeneity in data among institutions, or only mildly skewed label distributions are studied. In this paper, we present a novel generative replay strategy to address the challenge of data heterogeneity in collaborative learning methods. Instead of directly training a model for task performance, we leverage recent image synthesis techniques to develop a novel dual model architecture: a primary model learns the desired task, and an auxiliary "generative replay model" either synthesizes images that closely resemble the input images or helps extract latent variables. The generative replay strategy is flexible to use, can either be incorporated into existing collaborative learning methods to improve their capability of handling data heterogeneity across institutions, or be used as a novel and individual collaborative learning framework (termed FedReplay) to reduce communication cost. Experimental results demonstrate the capability of the proposed method in handling heterogeneous data across institutions. On highly heterogeneous data partitions, our model achieves ~4.88% improvement in the prediction accuracy on a diabetic retinopathy classification dataset, and ~49.8% reduction of mean absolution value on a Bone Age prediction dataset, respectively, compared to the state-of-the art collaborative learning methods.

【2】 Relationship between pulmonary nodule malignancy and surrounding pleurae, airways and vessels: a quantitative study using the public LIDC-IDRI dataset 标题:肺结节恶性肿瘤与周围胸膜、气道和血管的关系:使用公共LIDC-IDRI数据集的定量研究

作者:Yulei Qin,Yun Gu,Hanxiao Zhang,Jie Yang,Lihui Wang,Feng Yao,Yue-Min Zhu 机构:Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Institute of Medical Robotics, Shanghai Jiao Tong University, Shanghai , China, CREATIS, INSA Lyon, CNRS UMR , INSERM U, Universit´e de Lyon, Villeurbanne , France 备注:33 pages, 3 figures, Submitted for review 链接:https://arxiv.org/abs/2106.12991 摘要:目的探讨非对比CT对肺结节周围胸膜、气道及血管的鉴别诊断价值。LIDC-IDRI数据集是最大的公共CT数据库之一,用于研究。对694例中1556个结节进行统计分析,平均病灶<3和>3的结节分别表示为良性和恶性。此外,对113例诊断符合基本事实的患者的339个结节进行了独立评估。发展了计算机算法来分割肺部结构,量化到胸膜表面、气道和血管的距离,以及结节附近气道和血管的计数数和标准体积。通过比值比(OR)和卡方检验(Chi^2)来证实结节周围结构特征与结节恶性程度的相关性。非参数接收器操作特性(ROC)分析在logistic回归中进行,以评估每个结构的辨别能力。良性组和恶性组结节到胸膜表面、气道和血管的平均距离分别为(6.56,5.19)、(37.08,26.43)和(1.42,1.07)mm。结节与接触或投射到结节的气道和血管计数的相关系数分别为(or=22.96,chi^2=105.04)和(or=7.06,chi^2=290.11)。结节与气道和血管容积的相关系数分别为(OR=9.19,chi^2=159.02)和(OR=2.29,chi^2=55.89)。胸膜、气道和血管的曲线下面积分别为0.5202、0.6943和0.6529。我们的研究结果显示,与良性结节相比,恶性结节周围往往有更多的肺部结构,提示这些结构的特征可以作为肺癌的生物标志物。 摘要:To investigate whether the pleurae, airways and vessels surrounding a nodule on non-contrast computed tomography (CT) can discriminate benign and malignant pulmonary nodules. The LIDC-IDRI dataset, one of the largest publicly available CT database, was exploited for study. A total of 1556 nodules from 694 patients were involved in statistical analysis, where nodules with average scorings <3 and >3 were respectively denoted as benign and malignant. Besides, 339 nodules from 113 patients with diagnosis ground-truth were independently evaluated. Computer algorithms were developed to segment pulmonary structures and quantify the distances to pleural surface, airways and vessels, as well as the counting number and normalized volume of airways and vessels near a nodule. Odds ratio (OR) and Chi-square (chi^2) testing were performed to demonstrate the correlation between features of surrounding structures and nodule malignancy. A non-parametric receiver operating characteristic (ROC) analysis was conducted in logistic regression to evaluate discrimination ability of each structure. For benign and malignant groups, the average distances from nodules to pleural surface, airways and vessels are respectively (6.56, 5.19), (37.08, 26.43) and (1.42, 1.07) mm. The correlation between nodules and the counting number of airways and vessels that contact or project towards nodules are respectively (OR=22.96, chi^2=105.04) and (OR=7.06, chi^2=290.11). The correlation between nodules and the volume of airways and vessels are (OR=9.19, chi^2=159.02) and (OR=2.29, chi^2=55.89). The areas-under-curves (AUCs) for pleurae, airways and vessels are respectively 0.5202, 0.6943 and 0.6529. Our results show that malignant nodules are often surrounded by more pulmonary structures compared with benign ones, suggesting that features of these structures could be viewed as lung cancer biomarkers.

【3】 Q-space Conditioned Translation Networks for Directional Synthesis of Diffusion Weighted Images from Multi-modal Structural MRI 标题:Q空间条件平移网络用于多模态结构磁共振扩散加权图像的定向合成

作者:Mengwei Ren,Heejong Kim,Neel Dey,Guido Gerig 机构:Department of Computer Science and Engineering, New York University, NY, USA 备注:Accepted by MICCAI 2021. Project page: this https URL; Code: this https URL 链接:https://arxiv.org/abs/2106.13188 摘要:目前用于弥散磁共振成像建模的深度学习方法通过直接从稀疏采样的弥散加权成像(DWI)预测微观结构指标来避免对密集采样弥散加权成像(DWI)的需要。然而,在训练和重构过程中,它们隐含着对静态$q$空间采样的不切实际的假设。此外,这种方法可以限制可变采样dwi的下游使用,用于包括微结构指数或纤维束成像的估计。我们提出了一个生成性对抗翻译框架,用于高质量的DWI合成,在给定常见结构图像(如B0、T1、T2)的情况下,使用任意$q$空间采样。我们的翻译网络以连续的$q$-空间信息为条件线性调整其内部表示,从而消除了对固定采样方案的需要。此外,这种方法能够从任意子采样的dwi中下游估计高质量的微结构图,这在稀疏采样dwi的情况下可能特别重要。在最近的几种方法中,所提出的方法提高了DWI合成的精度和保真度,并通过从合成图像中估计的标量微结构指数的精度来量化下游效用。代码位于https://github.com/mengweiren/q-space-conditioned-dwi-synthesis. 摘要:Current deep learning approaches for diffusion MRI modeling circumvent the need for densely-sampled diffusion-weighted images (DWIs) by directly predicting microstructural indices from sparsely-sampled DWIs. However, they implicitly make unrealistic assumptions of static $q$-space sampling during training and reconstruction. Further, such approaches can restrict downstream usage of variably sampled DWIs for usages including the estimation of microstructural indices or tractography. We propose a generative adversarial translation framework for high-quality DWI synthesis with arbitrary $q$-space sampling given commonly acquired structural images (e.g., B0, T1, T2). Our translation network linearly modulates its internal representations conditioned on continuous $q$-space information, thus removing the need for fixed sampling schemes. Moreover, this approach enables downstream estimation of high-quality microstructural maps from arbitrarily subsampled DWIs, which may be particularly important in cases with sparsely sampled DWIs. Across several recent methodologies, the proposed approach yields improved DWI synthesis accuracy and fidelity with enhanced downstream utility as quantified by the accuracy of scalar microstructure indices estimated from the synthesized images. Code is available at https://github.com/mengweiren/q-space-conditioned-dwi-synthesis.

【4】 High-resolution Image Registration of Consecutive and Re-stained Sections in Histopathology 标题:组织病理学连续切片和重染色切片的高分辨率图像配准

作者:Johannes Lotz,Nick Weiss,Jeroen van der Laak,StefanHeldmann 机构: Nick Weiss and Stefan Heldmann are with the Fraun-hofer Institute for Digital Medicine MEVIS 链接:https://arxiv.org/abs/2106.13150 摘要:我们比较了连续切片和重染切片的组织病理学变化图像配准。我们提出了一种非参数(非线性)图像配准的全自动算法,并将其应用于ANHIR挑战中的一个先前存在的数据集(230张幻灯片对,连续切片)和一个新的数据集(混合再染色和连续,81张幻灯片对,约3000个地标),该数据集已公开。在ANHIR数据集中获得注册超参数,并将其应用于新的数据集,而无需修改。在新的数据集中,配准后的地标误差范围从连续切片的13.2微米到重新染色切片的1微米。我们观察到,非参数配准在这两种情况下都会导致较低的地标误差,即使在重新染色的切片中影响较小。再染色切片非参数配准后的细胞核水平对齐为机器学习在组织病理学中的应用提供了一个有价值的工具。 摘要:We compare variational image registration in consectutive and re-stained sections from histopathology. We present a fully-automatic algorithm for non-parametric (nonlinear) image registration and apply it to a previously existing dataset from the ANHIR challenge (230 slide pairs, consecutive sections) and a new dataset (hybrid re-stained and consecutive, 81 slide pairs, ca. 3000 landmarks) which is made publicly available. Registration hyperparameters are obtained in the ANHIR dataset and applied to the new dataset without modification. In the new dataset, landmark errors after registration range from 13.2 micrometers for consecutive sections to 1 micrometer for re-stained sections. We observe that non-parametric registration leads to lower landmark errors in both cases, even though the effect is smaller in re-stained sections. The nucleus-level alignment after non-parametric registration of re-stained sections provides a valuable tool to generate automatic ground-truth for machine learning applications in histopathology.

【5】 Advancing biological super-resolution microscopy through deep learning: a brief review 标题:通过深度学习发展生物超分辨显微术:简要回顾

作者:Tianjie Yang,Yaoru Luo,Wei Ji,Ge Yang 机构:.Institute of Biophysics, Chinese Academy of Sciences, Beijing, China, .College of Life Sciences, University of Chinese Academy of Sciences, Beijing, .Laboratory of Computational Biology and Machine Intelligence, School of Artificial Intelli- 链接:https://arxiv.org/abs/2106.13064 摘要:超分辨显微镜克服了传统光学显微镜在空间分辨率上的衍射限制。通过提供具有分子特异性的纳米尺度生物过程的时空信息,它在生命科学中发挥着越来越重要的作用。然而,它的技术局限性要求权衡其空间分辨率、时间分辨率和样品的光照。近年来,深度学习在许多图像处理和计算机视觉任务中取得了突破性的进展。它在推动超分辨显微镜的性能包络方面也显示出巨大的前景。在这篇简短的综述中,我们综述了利用深度学习提高超分辨显微镜性能的最新进展。我们主要关注深度学习如何促进超分辨率图像的重建。讨论了相关的关键技术挑战。尽管面临挑战,深度学习将在超分辨显微镜的发展中发挥不可或缺的变革性作用。最后,我们展望了深度学习如何影响新一代光学显微镜技术的未来。 摘要:Super-resolution microscopy overcomes the diffraction limit of conventional light microscopy in spatial resolution. By providing novel spatial or spatio-temporal information on biological processes at nanometer resolution with molecular specificity, it plays an increasingly important role in life sciences. However, its technical limitations require trade-offs to balance its spatial resolution, temporal resolution, and light exposure of samples. Recently, deep learning has achieved breakthrough performance in many image processing and computer vision tasks. It has also shown great promise in pushing the performance envelope of super-resolution microscopy. In this brief Review, we survey recent advances in using deep learning to enhance performance of super-resolution microscopy. We focus primarily on how deep learning ad-vances reconstruction of super-resolution images. Related key technical challenges are discussed. Despite the challenges, deep learning is set to play an indispensable and transformative role in the development of super-resolution microscopy. We conclude with an outlook on how deep learning could shape the future of this new generation of light microscopy technology.

【6】 A Systematic Collection of Medical Image Datasets for Deep Learning 标题:面向深度学习的医学图像数据集系统收集

作者:Johann Li,Guangming Zhu,Cong Hua,Mingtao Feng,BasheerBennamoun,Ping Li,Xiaoyuan Lu,Juan Song,Peiyi Shen,Xu Xu,Lin Mei,Liang Zhang,Syed Afaq Ali Shah,Mohammed Bennamoun 机构:Received: date Accepted: date 备注:This paper has been submitted to one journal 链接:https://arxiv.org/abs/2106.12864 摘要:人工智能在医疗保健等领域取得的惊人成功证明了人工智能可以达到与人类相似的性能。然而,成功总是伴随着挑战。深度学习算法依赖于数据,需要大量的数据集进行训练。医学影像领域数据的缺乏,给深度学习在医学影像分析中的应用带来了瓶颈。医学图像的获取、注释和分析是昂贵的,并且它们的使用受到伦理限制。它们还需要许多资源,例如人力专门知识和资金。这使得非医学研究人员很难获得有用的大型医学数据。因此,尽可能全面,本文提供了一个收集医学图像数据集及其相关的挑战,为深入学习研究。我们收集了大约三百个数据集和挑战的信息,这些数据集和挑战主要在2013年至2020年间报告,并将其分为四类:头颈部、胸腹部、病理学和血液学以及“其他”。我们的论文有三个目的:1)提供一个最新的、完整的列表,可以作为一个通用的参考,方便地找到临床图像分析的数据集;2)指导研究人员在方法学上测试和评估他们的方法在相关数据集上的性能和鲁棒性,3)提供相关医学主题相关算法的“路线”,并挑战排行榜。 摘要:The astounding success made by artificial intelligence (AI) in healthcare and other fields proves that AI can achieve human-like performance. However, success always comes with challenges. Deep learning algorithms are data-dependent and require large datasets for training. The lack of data in the medical imaging field creates a bottleneck for the application of deep learning to medical image analysis. Medical image acquisition, annotation, and analysis are costly, and their usage is constrained by ethical restrictions. They also require many resources, such as human expertise and funding. That makes it difficult for non-medical researchers to have access to useful and large medical data. Thus, as comprehensive as possible, this paper provides a collection of medical image datasets with their associated challenges for deep learning research. We have collected information of around three hundred datasets and challenges mainly reported between 2013 and 2020 and categorized them into four categories: head & neck, chest & abdomen, pathology & blood, and ``others''. Our paper has three purposes: 1) to provide a most up to date and complete list that can be used as a universal reference to easily find the datasets for clinical image analysis, 2) to guide researchers on the methodology to test and evaluate their methods' performance and robustness on relevant datasets, 3) to provide a ``route'' to relevant algorithms for the relevant medical topics, and challenge leaderboards.

GAN|对抗|攻击|生成相关(3篇)

【1】 GaussiGAN: Controllable Image Synthesis with 3D Gaussians from Unposed Silhouettes 标题:GaussiGan:基于非姿态轮廓的三维高斯可控图像合成

作者:Youssef A. Mejjati,Isa Milefchik,Aaron Gokaslan,Oliver Wang,Kwang In Kim,James Tompkin 机构:University of Bath,Cornell University,Adobe,UNIST,Brown University 链接:https://arxiv.org/abs/2106.13215 摘要:我们提出一种算法,从未使用的多视图二维掩模监控中学习对象的粗略三维表示,然后使用它生成详细的掩模和图像纹理。与现有的基于体素的非定位物体重建方法不同,该方法通过透视摄像机和一组逐图像变换,学习用一组自监督的标准三维各向异性高斯函数来表示生成的形状和姿态。我们表明,这种方法可以稳健地估计三维空间的相机和对象,而最近的基线有时难以重建连贯的三维空间在此设置。我们展示了真实光照下合成数据集的结果,并演示了交互式姿势下的对象插入。通过我们的工作,我们有助于在基于学习的对象重建中处理更多真实世界变化的结构化表示。 摘要:We present an algorithm that learns a coarse 3D representation of objects from unposed multi-view 2D mask supervision, then uses it to generate detailed mask and image texture. In contrast to existing voxel-based methods for unposed object reconstruction, our approach learns to represent the generated shape and pose with a set of self-supervised canonical 3D anisotropic Gaussians via a perspective camera, and a set of per-image transforms. We show that this approach can robustly estimate a 3D space for the camera and object, while recent baselines sometimes struggle to reconstruct coherent 3D spaces in this setting. We show results on synthetic datasets with realistic lighting, and demonstrate object insertion with interactive posing. With our work, we help move towards structured representations that handle more real-world variation in learning-based object reconstruction.

【2】 SGTBN: Generating Dense Depth Maps from Single-Line LiDAR 标题:SGTBN:从单线LiDAR生成密集深度图

作者:Hengjie Lu,Shugong Xu,Shan Cao 链接:https://arxiv.org/abs/2106.12994 摘要:深度补全的目的是从稀疏的深度图和对齐的RGB图像中生成一个密集的深度图。然而,目前的深度完成方法使用非常昂贵的64线激光雷达(约10万美元)获得稀疏的深度图,这将限制其应用场景。与64线激光雷达相比,单线激光雷达的成本要低得多,而且更坚固。因此,本文提出了一种解决单线深度补全问题的方法,即利用单线激光雷达信息和对齐的RGB图像生成密集的深度图。在已有的64线深度完井数据集(KITTI)的基础上,提出了一种单线深度完井数据集。提出了一种语义引导双分支网络(SGTBN),该网络包含全局和局部分支,用于提取和融合全局和局部信息。网络中采用了语义引导的深度上采样模块,充分利用了RGB图像中的语义信息。除了常见的MSE损耗外,我们还加入了虚法向损耗来增加网络中高阶三维几何的约束。我们的网络在单线深度完成任务方面优于最先进的技术。此外,与单目深度估计方法相比,该方法在精度和模型尺寸上也具有明显的优势。 摘要:Depth completion aims to generate a dense depth map from the sparse depth map and aligned RGB image. However, current depth completion methods use extremely expensive 64-line LiDAR(about $100,000) to obtain sparse depth maps, which will limit their application scenarios. Compared with the 64-line LiDAR, the single-line LiDAR is much less expensive and much more robust. Therefore, we propose a method to tackle the problem of single-line depth completion, in which we aim to generate a dense depth map from the single-line LiDAR info and the aligned RGB image. A single-line depth completion dataset is proposed based on the existing 64-line depth completion dataset(KITTI). A network called Semantic Guided Two-Branch Network(SGTBN) which contains global and local branches to extract and fuse global and local info is proposed for this task. A Semantic guided depth upsampling module is used in our network to make full use of the semantic info in RGB images. Except for the usual MSE loss, we add the virtual normal loss to increase the constraint of high-order 3D geometry in our network. Our network outperforms the state-of-the-art in the single-line depth completion task. Besides, compared with the monocular depth estimation, our method also has significant advantages in precision and model size.

【3】 Towards Automatic Speech to Sign Language Generation 标题:走向语音到手语的自动生成

作者:Parul Kapoor,Rudrabha Mukhopadhyay,Sindhu B Hegde,Vinay Namboodiri,C V Jawahar 机构:Department of Computer Science and Engineering, IIT Kanpur, Center for Visual Information Technology, IIIT Hyderabad, Department of Computer Science, University of Bath 备注:5 pages(including references), 5 figures, Accepted in Interspeech 2021 链接:https://arxiv.org/abs/2106.12790 摘要:我们的目标是要解决的高度挑战性的任务,生成连续的手语视频完全从语音片段第一次。最近在这一领域的努力集中于从人类注释文本转录本生成此类视频,而不考虑其他方式。然而,用手语代替言语被证明是一个切实可行的解决办法,同时与患有听力损失的人沟通。因此,我们不需要使用文本作为输入,也不需要设计更自然、连续、自由表达的、涵盖大量词汇的演讲技巧。由于目前的数据集不足以直接从语音中生成手语,我们收集并发布了第一个印度手语数据集,包括语音水平注释、文本转录和相应的手语视频。接下来,我们提出了一个多任务Transformer网络训练生成签名者的姿态从语音片段。该模型以语音到文本为辅助任务,通过附加的跨模态鉴别器,以端到端的方式学习生成连续的符号-姿态序列。大量的实验和与其他基线的比较证明了该方法的有效性。我们还进行了额外的消融研究,以分析我们网络的不同模块的影响。包含多个结果的演示视频附在补充材料上。 摘要:We aim to solve the highly challenging task of generating continuous sign language videos solely from speech segments for the first time. Recent efforts in this space have focused on generating such videos from human-annotated text transcripts without considering other modalities. However, replacing speech with sign language proves to be a practical solution while communicating with people suffering from hearing loss. Therefore, we eliminate the need of using text as input and design techniques that work for more natural, continuous, freely uttered speech covering an extensive vocabulary. Since the current datasets are inadequate for generating sign language directly from speech, we collect and release the first Indian sign language dataset comprising speech-level annotations, text transcripts, and the corresponding sign-language videos. Next, we propose a multi-tasking transformer network trained to generate signer's poses from speech segments. With speech-to-text as an auxiliary task and an additional cross-modal discriminator, our model learns to generate continuous sign pose sequences in an end-to-end manner. Extensive experiments and comparisons with other baselines demonstrate the effectiveness of our approach. We also conduct additional ablation studies to analyze the effect of different modules of our network. A demo video containing several results is attached to the supplementary material.

OCR|文本相关(2篇)

【1】 AudioCLIP: Extending CLIP to Image, Text and Audio 标题:AudioCLIP:将剪辑扩展到图像、文本和音频

作者:Andrey Guzhov,Federico Raue,Jörn Hees,Andreas Dengel 机构: DFKI GmbH, Trippstadter Str. , Kaiserslautern, Germany, TU Kaiserslautern, Kaiserslautern, Germany 备注:submitted to GCPR 2021 链接:https://arxiv.org/abs/2106.13043 摘要:在过去,快速发展的声音分类领域得益于其他领域方法的应用。今天,我们观察到将特定领域的任务和方法融合在一起的趋势,这为社区提供了新的优秀模型。在这项工作中,我们提出了一个扩展的剪辑模型,处理音频除了文本和图像。我们提出的模型使用AudioSet数据集将ESResNeXt音频模型合并到CLIP框架中。这样的组合使得所提出的模型能够执行双峰和单峰分类和查询,同时保持CLIP以Zero-Shot推理方式推广到不可见数据集的能力。AudioCLIP在环境声音分类(ESC)任务中取得了新的最新成果,在UrbanSound8K和ESC-50数据集上的准确率分别达到90.07%和97.15%,优于其他方法。此外,它在相同的数据集(分别为68.78%和69.40%)上设置了零炮ESC任务的新基线。最后,我们还评估了该模型的跨模式查询性能以及完全和部分训练对结果的影响。为了重现性,我们的代码已经发布了。 摘要:In the past, the rapidly evolving field of sound classification greatly benefited from the application of methods from other domains. Today, we observe the trend to fuse domain-specific tasks and approaches together, which provides the community with new outstanding models. In this work, we present an extension of the CLIP model that handles audio in addition to text and images. Our proposed model incorporates the ESResNeXt audio-model into the CLIP framework using the AudioSet dataset. Such a combination enables the proposed model to perform bimodal and unimodal classification and querying, while keeping CLIP's ability to generalize to unseen datasets in a zero-shot inference fashion. AudioCLIP achieves new state-of-the-art results in the Environmental Sound Classification (ESC) task, out-performing other approaches by reaching accuracies of 90.07% on the UrbanSound8K and 97.15% on the ESC-50 datasets. Further it sets new baselines in the zero-shot ESC-task on the same datasets 68.78% and 69.40%, respectively). Finally, we also assess the cross-modal querying performance of the proposed model as well as the influence of full and partial training on the results. For the sake of reproducibility, our code is published.

【2】 A Simple and Strong Baseline: Progressively Region-based Scene Text Removal Networks 标题:一条简单而有力的基线:渐进式基于区域的场景文本删除网络

作者:Yuxin Wang,Hongtao Xie,Shancheng Fang,Yadong Qu,Yongdong Zhang 机构:University of Science and Technology of China 备注:10 pages, 7 figures 链接:https://arxiv.org/abs/2106.13029 摘要:现有的场景文本去除方法主要是训练一个精细的成对图像网络来同时实现文本定位和背景重建的功能,但存在两个问题:1)缺乏对文本区域的穷尽性擦除;2)造成对文本自由区域的过度擦除。为了解决这些问题,本文提出了一种新的渐进式基于区域的场景文本擦除器(PERT),它引入了基于区域的修改策略,只对文本区域中的像素进行渐进式擦除。首先,PERT将STR任务分解为几个擦除阶段。由于每个阶段的目标都是向文本删除的图像前进一步,而不是直接回归到最终的结果,因此分解操作降低了每个阶段的学习难度,并且可以通过对具有共享参数的轻量级擦除块进行迭代来获得详尽的擦除结果。然后,PERT引入了一种基于区域的修改策略,通过将文本定位与删除过程解耦来指导删除过程,从而保证无文本区域的完整性。得益于简单的体系结构,PERT是一个简单而强大的基线,并且易于遵循和开发。大量实验表明,PERT在合成数据集和真实数据集上都获得了最新的结果。代码可用athttps://github.com/wangyuxin87/PERT. 摘要:Existing scene text removal methods mainly train an elaborate network with paired images to realize the function of text localization and background reconstruction simultaneously, but there exists two problems: 1) lacking the exhaustive erasure of text region and 2) causing the excessive erasure to text-free areas. To handle these issues, this paper provides a novel ProgrEssively Region-based scene Text eraser (PERT), which introduces region-based modification strategy to progressively erase the pixels in only text region. Firstly, PERT decomposes the STR task to several erasing stages. As each stage aims to take a further step toward the text-removed image rather than directly regress to the final result, the decomposed operation reduces the learning difficulty in each stage, and an exhaustive erasure result can be obtained by iterating over lightweight erasing blocks with shared parameters. Then, PERT introduces a region-based modification strategy to ensure the integrity of text-free areas by decoupling text localization from erasure process to guide the removal. Benefiting from the simplicity architecture, PERT is a simple and strong baseline, and is easy to be followed and developed. Extensive experiments demonstrate that PERT obtains the state-of-the-art results on both synthetic and real-world datasets. Code is available athttps://github.com/wangyuxin87/PERT.

Attention注意力(1篇)

【1】 ATP-Net: An Attention-based Ternary Projection Network For Compressed Sensing 标题:ATP-Net:一种基于注意力的压缩感知三值投影网络

作者:Guanxiong Nie,Yajian Zhou 机构:China Construction Bank Hebei Branch, Hebei Shijiahzuang., China, Beijing University of Posts and Telecommunications, Beijing, China 链接:https://arxiv.org/abs/2106.12728 摘要:压缩感知(Compressed Sensing,CS)理论同时实现了信号的采样和压缩过程,可以用较少的观测数据实现准确的信号恢复,为更好更快地传输海量数据提供了解决方案。本文提出了一种基于三值采样矩阵的注意机制方法,解决了一般情况下CS采样矩阵是随机矩阵,与采样信号无关,需要较大的存储空间的问题。该方法由三元采样、初始重建和深度重建三部分组成,重点讨论了三元采样。三值化方法(-1,0, 1)的主要思想是在对采样矩阵进行二值化(-1, 1)后,引入注意机制来评估采样层参数的重要性,然后对重要性低于预定阈值的参数进行权值剪枝,实现三值化。在此基础上,实现了一种基于三值采样矩阵的压缩感知算法,称为ATP网络,即基于注意的三值投影网络。实验结果表明,采用三值采样矩阵的ATP网络重建图像的质量保持了令人满意的水平,即当采样率为0.25时,Set11的平均PSNR为30.4,比DR2网络提高了约6%。 摘要:Compressed Sensing (CS) theory simultaneously realizes the signal sampling and compression process, and can use fewer observations to achieve accurate signal recovery, providing a solution for better and faster transmission of massive data. In this paper, a ternary sampling matrix-based method with attention mechanism is proposed with the purpose to solve the problem that the CS sampling matrices in most cases are random matrices, which are irrelative to the sampled signal and need a large storage space. The proposed method consists of three components, i.e., ternary sampling, initial reconstruction and deep reconstruction, with the emphasis on the ternary sampling. The main idea of the ternary method (-1, 0, 1) is to introduce the attention mechanism to evaluate the importance of parameters at the sampling layer after the sampling matrix is binarized (-1, 1), followed by pruning weight of parameters, whose importance is below a predefined threshold, to achieve ternarization. Furthermore, a compressed sensing algorithm especially for image reconstruction is implemented, on the basis of the ternary sampling matrix, which is called ATP-Net, i.e., Attention-based ternary projection network. Experimental results show that the quality of image reconstruction by means of ATP-Net maintains a satisfactory level with the employment of the ternary sampling matrix, i.e., the average PSNR on Set11 is 30.4 when the sampling rate is 0.25, approximately 6% improvement compared with that of DR2-Net.

蒸馏|知识提取(1篇)

【1】 MatchVIE: Exploiting Match Relevancy between Entities for Visual Information Extraction 标题:MatchVIE:利用实体间的匹配相关性进行视觉信息提取

作者:Guozhi Tang,Lele Xie,Lianwen Jin,Jiapeng Wang,Jingdong Chen,Zhen Xu,Qianying Wang,Yaqiang Wu,Hui Li 机构:School of Electronic and Information Engineering, South China University of Technology, China, Guangdong Artificial Intelligence and Digital Economy Laboratory (Pazhou Lab), Guangzhou, China, Ant Group, China, Lenovo Research, China 备注:accepted by IJCAI 2021 链接:https://arxiv.org/abs/2106.12940 摘要:视觉信息提取(visualinformationextraction,VIE)任务旨在从各种文档图像(如发票和采购收据)中提取关键信息。以往的方法大多将VIE任务简单地看作一个序列标号问题或分类问题,需要模型通过引入字体、颜色、布局等多模态特征来仔细识别各种语义。但是,简单地引入多模态特征在面对数字语义类别或一些模糊文本时是不可能的。针对这一问题,本文提出了一种基于图神经网络的关键值匹配模型(MatchVIE)。该方法通过基于相关性评价的键值匹配,绕过了对各种语义的识别,只关注实体间的强相关性。此外,我们还引入了一种简单而有效的运算Num2Vec来解决编码值的不稳定性问题,使模型收敛更加平稳。综合实验表明,本文提出的MatchVIE算法的性能明显优于以往的方法。值得注意的是,据我们所知,MatchVIE可能是第一次尝试通过建模键和值之间的相关性来处理VIE任务,它是对现有方法的一个很好的补充。 摘要:Visual Information Extraction (VIE) task aims to extract key information from multifarious document images (e.g., invoices and purchase receipts). Most previous methods treat the VIE task simply as a sequence labeling problem or classification problem, which requires models to carefully identify each kind of semantics by introducing multimodal features, such as font, color, layout. But simply introducing multimodal features couldn't work well when faced with numeric semantic categories or some ambiguous texts. To address this issue, in this paper we propose a novel key-value matching model based on a graph neural network for VIE (MatchVIE). Through key-value matching based on relevancy evaluation, the proposed MatchVIE can bypass the recognitions to various semantics, and simply focuses on the strong relevancy between entities. Besides, we introduce a simple but effective operation, Num2Vec, to tackle the instability of encoded values, which helps model converge more smoothly. Comprehensive experiments demonstrate that the proposed MatchVIE can significantly outperform previous methods. Notably, to the best of our knowledge, MatchVIE may be the first attempt to tackle the VIE task by modeling the relevancy between keys and values and it is a good complement to the existing methods.

超分辨率|去噪|去模糊|去雾(1篇)

【1】 ChaLearn Looking at People: Inpainting and Denoising challenges 标题:ChaLearn看人:修复和去噪挑战

作者:Sergio Escalera,Marti Soler,Stephane Ayache,Umut Guclu,Jun Wan,Meysam Madadi,Xavier Baro,Hugo Jair Escalante,Isabelle Guyon 机构: Universitat de Barcelona and Computer Vision Center, Barcelona, Spain, Universitat de Barcelona, Barcelona, Spain, Aix Marseille Univ, CNRS, LIF, Marseille, France, Radboud University, Donders Institute for Brain, Cognition and Behaviour, Nijmegen 备注:None 链接:https://arxiv.org/abs/2106.13071 摘要:在机器学习和计算智能的背景下,处理不完全信息是一个被广泛研究的问题。然而,在计算机视觉的背景下,这个问题只在特定的场景中被研究过(例如,特定类型图像中的某些类型的遮挡),尽管在视觉数据中不完整的信息是很常见的。本章描述了一个学术竞赛的设计,该竞赛的重点是图像和视频序列的修复,这是WCCI2018竞赛项目的一部分,并与ECCV2018一起举办了一个卫星赛事。ChaLearn着眼于人的修复挑战旨在通过促进从图像和视频中恢复丢失和被遮挡信息的方法的发展,提高视觉修复的技术水平。提出了三种视觉修复方法:人体姿态估计、文本覆盖去除和指纹去噪。本章描述了挑战的设计,包括三个新数据集的发布,以及评估指标、基线和评估协议的描述。对挑战的结果进行了详细的分析和讨论,并概述了从这一事件中得出的结论。 摘要:Dealing with incomplete information is a well studied problem in the context of machine learning and computational intelligence. However, in the context of computer vision, the problem has only been studied in specific scenarios (e.g., certain types of occlusions in specific types of images), although it is common to have incomplete information in visual data. This chapter describes the design of an academic competition focusing on inpainting of images and video sequences that was part of the competition program of WCCI2018 and had a satellite event collocated with ECCV2018. The ChaLearn Looking at People Inpainting Challenge aimed at advancing the state of the art on visual inpainting by promoting the development of methods for recovering missing and occluded information from images and video. Three tracks were proposed in which visual inpainting might be helpful but still challenging: human body pose estimation, text overlays removal and fingerprint denoising. This chapter describes the design of the challenge, which includes the release of three novel datasets, and the description of evaluation metrics, baselines and evaluation protocol. The results of the challenge are analyzed and discussed in detail and conclusions derived from this event are outlined.

点云|SLAM|雷达|激光|深度RGBD相关(1篇)

【1】 FaDIV-Syn: Fast Depth-Independent View Synthesis 标题:FaDIV-SYN:快速深度无关视图合成

作者:Andre Rochow,Max Schwarz,Michael Weinmann,Sven Behnke 机构:University of Bonn 链接:https://arxiv.org/abs/2106.13139 摘要:介绍了一种与深度无关的快速视点合成方法FaDIV-Syn。我们的多视图方法解决了视图合成方法通常受深度估计阶段限制的问题,其中错误的深度预测可能导致较大的投影误差。为了避免这个问题,我们有效地将多个输入图像扭曲到一系列假定深度平面的目标帧中。由此产生的张量表示被输入到一个具有门控卷积的U网状CNN中,该CNN直接产生新的输出视图。因此,我们侧步显式深度估计。这提高了透明、反射和无特征场景部分的效率和性能。FaDIV-Syn可以处理插值和外推任务,在大规模RealEstate10k数据集上的性能优于最先进的外推方法。与同类方法相比,由于其轻量级的体系结构,它能够进行实时操作。我们进一步证明了FaDIV-Syn的数据效率,通过训练从较少的例子,以及它的推广到更高的分辨率和任意深度范围下严重的深度离散。 摘要:We introduce FaDIV-Syn, a fast depth-independent view synthesis method. Our multi-view approach addresses the problem that view synthesis methods are often limited by their depth estimation stage, where incorrect depth predictions can lead to large projection errors. To avoid this issue, we efficiently warp multiple input images into the target frame for a range of assumed depth planes. The resulting tensor representation is fed into a U-Net-like CNN with gated convolutions, which directly produces the novel output view. We therefore side-step explicit depth estimation. This improves efficiency and performance on transparent, reflective, and feature-less scene parts. FaDIV-Syn can handle both interpolation and extrapolation tasks and outperforms state-of-the-art extrapolation methods on the large-scale RealEstate10k dataset. In contrast to comparable methods, it is capable of real-time operation due to its lightweight architecture. We further demonstrate data efficiency of FaDIV-Syn by training from fewer examples as well as its generalization to higher resolutions and arbitrary depth ranges under severe depth discretization.

多模态(1篇)

【1】 Planetary UAV localization based on Multi-modal Registration with Pre-existing Digital Terrain Model 标题:基于已有数字地形模型多模态配准的行星无人机定位

作者:Xue Wan,Yuanbin Shao,Shengyang Li 机构:a Technology and Engineering Center for Space Utilization, Chinese Academy of Sciences, Beijing, b Key Laboratory of Space Utilization, Chinese Academy of Sciences, Beijing, China; 链接:https://arxiv.org/abs/2106.12738 摘要:行星无人机自主实时光学导航是保证探测成功的关键技术之一。在这种GPS定位环境下,基于视觉的定位是一种最优的定位方法。本文提出了一种基于多模态配准的SLAM算法,与已有的数字地形模型相比,该算法利用无人机上的最低点摄像机来估计行星无人机的位置。为了克服机载无人机图像与预装数字地形模型在尺度和外观上的差异,提出了一种理论模型,证明无人机图像和DEM的地形特征可以通过互功率谱在频域上进行关联。为了提供无人机的六自由度,我们还开发了一种优化方法,该方法通过LBA(局部束调整)将地理参考结果融合到SLAM系统中,即使在无特征的行星区域,也能实现鲁棒和精确的基于视觉的导航。为了验证该定位算法的鲁棒性和有效性,提出了一种新的基于无人机的跨源行星探测定位数据集。该数据集包括从9个行星场景采集的40200张合成无人机图像和相关的DEM查询图像。对比实验表明,在33.8km的飞行距离上,该方法的平均定位误差为0.45米,而ORB-SLAM的平均定位误差为1.31米,处理速度为12hz,保证了定位的实时性。我们将提供我们的数据集,以鼓励在这一有希望的主题上开展进一步的工作。 摘要:The autonomous real-time optical navigation of planetary UAV is of the key technologies to ensure the success of the exploration. In such a GPS denied environment, vision-based localization is an optimal approach. In this paper, we proposed a multi-modal registration based SLAM algorithm, which estimates the location of a planet UAV using a nadir view camera on the UAV compared with pre-existing digital terrain model. To overcome the scale and appearance difference between on-board UAV images and pre-installed digital terrain model, a theoretical model is proposed to prove that topographic features of UAV image and DEM can be correlated in frequency domain via cross power spectrum. To provide the six-DOF of the UAV, we also developed an optimization approach which fuses the geo-referencing result into a SLAM system via LBA (Local Bundle Adjustment) to achieve robust and accurate vision-based navigation even in featureless planetary areas. To test the robustness and effectiveness of the proposed localization algorithm, a new cross-source drone-based localization dataset for planetary exploration is proposed. The proposed dataset includes 40200 synthetic drone images taken from nine planetary scenes with related DEM query images. Comparison experiments carried out demonstrate that over the flight distance of 33.8km, the proposed method achieved average localization error of 0.45 meters, compared to 1.31 meters by ORB-SLAM, with the processing speed of 12hz which will ensure a real-time performance. We will make our datasets available to encourage further work on this promising topic.

其他神经网络|深度学习|模型|建模(5篇)

【1】 Towards Fully Interpretable Deep Neural Networks: Are We There Yet? 标题:走向完全可解释的深度神经网络:我们到了吗?

作者:Sandareka Wickramanayake,Wynne Hsu,Mong Li Lee 机构:Research on opening black-box DNNs can be broadly cate-gorized into post-hoc methods and inherently interpretable 1School of Computing, National University of Singa-pore 2Institute of Data Science, National University of Singa-pore 备注:Presented at the ICML 2021 Workshop on Theoretic Foundation, Criticism, and Application Trend of Explainable AI 链接:https://arxiv.org/abs/2106.13164 摘要:在人工智能(AI)系统中,尽管深度神经网络(DNNs)表现出了显著的性能,但它仍然是阻碍用户信任的黑匣子。对开放黑盒DNN的研究大致可分为事后方法和内在可解释DNN。虽然已经对事后解释方法进行了许多调查,但很少有人致力于内在可解释的DNN。本文综述了发展具有内在解释性的神经网络的现有方法,重点介绍了卷积神经网络(CNNs)。目的是了解目前朝着完全可解释的、能够满足不同解释要求的DNN方面取得的进展。最后,我们找出目前工作中的差距,并提出潜在的研究方向。 摘要:Despite the remarkable performance, Deep Neural Networks (DNNs) behave as black-boxes hindering user trust in Artificial Intelligence (AI) systems. Research on opening black-box DNN can be broadly categorized into post-hoc methods and inherently interpretable DNNs. While many surveys have been conducted on post-hoc interpretation methods, little effort is devoted to inherently interpretable DNNs. This paper provides a review of existing methods to develop DNNs with intrinsic interpretability, with a focus on Convolutional Neural Networks (CNNs). The aim is to understand the current progress towards fully interpretable DNNs that can cater to different interpretation requirements. Finally, we identify gaps in current work and suggest potential research directions.

【2】 Learning by Planning: Language-Guided Global Image Editing 标题:通过规划学习:语言引导的全球图像编辑

作者:Jing Shi,Ning Xu,Yihang Xu,Trung Bui,Franck Dernoncourt,Chenliang Xu 机构:University of Rochester, Adobe Research 备注:Accepted by CVPR2021 链接:https://arxiv.org/abs/2106.13156 摘要:近年来,语言引导的全局图像编辑越来越受到人们的关注,其应用潜力也越来越大。然而,以往的基于GAN的方法不仅局限于特定领域的低分辨率数据,而且缺乏可解释性。为了克服这些困难,我们开发了一个文本到操作模型,将模糊的编辑语言要求映射为一系列编辑操作,如改变对比度、亮度和饱和度。每个操作都是可解释和可微的。此外,任务中唯一的监督是目标图像,这对于序列决策的稳定训练是不够的。因此,我们提出了一种新的操作规划算法,从目标图像中产生可能的编辑序列作为伪地面真值。在新收集的MA5k-Req数据集和GIER数据集上的对比实验表明了本文方法的优越性。代码位于https://jshi31.github.io/T2ONet. 摘要:Recently, language-guided global image editing draws increasing attention with growing application potentials. However, previous GAN-based methods are not only confined to domain-specific, low-resolution data but also lacking in interpretability. To overcome the collective difficulties, we develop a text-to-operation model to map the vague editing language request into a series of editing operations, e.g., change contrast, brightness, and saturation. Each operation is interpretable and differentiable. Furthermore, the only supervision in the task is the target image, which is insufficient for a stable training of sequential decisions. Hence, we propose a novel operation planning algorithm to generate possible editing sequences from the target image as pseudo ground truth. Comparison experiments on the newly collected MA5k-Req dataset and GIER dataset show the advantages of our methods. Code is available at https://jshi31.github.io/T2ONet.

【3】 Conditional Deformable Image Registration with Convolutional Neural Network 标题:基于卷积神经网络的条件可变形图像配准

作者:Tony C. W. Mok,Albert C. S. Chung 机构:Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong 备注:Early accepted by MICCAI2021 链接:https://arxiv.org/abs/2106.12673 摘要:近年来,基于深度学习的方法在可变形图像配准中取得了良好的效果和运行时优势。然而,在基于深度学习的方法中,分析超参数的影响和寻找最优正则化参数是非常困难的。这是因为它需要训练大量具有不同超参数值的独立模型。本文提出了一种条件图像配准方法和一种新的自监督学习方法。通过学习与正则化超参数相关的条件特征,我们证明了具有任意超参数的最优解可以被单个深度卷积神经网络捕获。此外,在推断过程中,可以使用任意强度的平滑正则化来操纵所产生的变形场的平滑度。在大规模脑MRI数据集上的大量实验表明,该方法能够在不牺牲运行时间优势和配准精度的前提下,精确控制变形场的平滑度。 摘要:Recent deep learning-based methods have shown promising results and runtime advantages in deformable image registration. However, analyzing the effects of hyperparameters and searching for optimal regularization parameters prove to be too prohibitive in deep learning-based methods. This is because it involves training a substantial number of separate models with distinct hyperparameter values. In this paper, we propose a conditional image registration method and a new self-supervised learning paradigm for deep deformable image registration. By learning the conditional features that correlated with the regularization hyperparameter, we demonstrate that optimal solutions with arbitrary hyperparameters can be captured by a single deep convolutional neural network. In addition, the smoothness of the resulting deformation field can be manipulated with arbitrary strength of smoothness regularization during inference. Extensive experiments on a large-scale brain MRI dataset show that our proposed method enables the precise control of the smoothness of the deformation field without sacrificing the runtime advantage or registration accuracy.

【4】 Rate Distortion Characteristic Modeling for Neural Image Compression 标题:神经图像压缩的率失真特性建模

作者:Chuanmin Jia,Ziqing Ge,Shanshe Wang,Siwei Ma,Wen Gao 机构:†Department of Computer Science, Peking University, Beijing, China, ⋆Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 备注:13 pages, 7 figures 链接:https://arxiv.org/abs/2106.12954 摘要:端到端优化功能提供了神经图像压缩(NIC)优越的有损压缩性能。然而,不同的模型需要经过训练才能到达率失真(R-D)空间中的不同点。本文研究了网卡的rd特性分析与建模问题。利用深度网络和统计建模方法,建立了描述网卡研发行为的基本数学函数。因此,连续比特率点可以优雅地实现利用这样的模型通过一个单一的训练网络。在这方面,我们提出了一个插件模块来学习目标比特率和自动编码器潜变量的二进制表示之间的关系。此外,我们将NIC的速率和失真特性分别建模为编码参数$lambda$的函数。实验结果表明,该方法易于采用,与固定速率编码方法相比,具有很好的编码性能,有利于NIC的实际应用。此外,该模型还可应用于具有有限误码率的单网络NIC速率控制。 摘要:End-to-end optimization capability offers neural image compression (NIC) superior lossy compression performance. However, distinct models are required to be trained to reach different points in the rate-distortion (R-D) space. In this paper, we consider the problem of R-D characteristic analysis and modeling for NIC. We make efforts to formulate the essential mathematical functions to describe the R-D behavior of NIC using deep network and statistical modeling. Thus continuous bit-rate points could be elegantly realized by leveraging such model via a single trained network. In this regard, we propose a plugin-in module to learn the relationship between the target bit-rate and the binary representation for the latent variable of auto-encoder. Furthermore, we model the rate and distortion characteristic of NIC as a function of the coding parameter $lambda$ respectively. Our experiments show our proposed method is easy to adopt and obtains competitive coding performance with fixed-rate coding approaches, which would benefit the practical deployment of NIC. In addition, the proposed model could be applied to NIC rate control with limited bit-rate error using a single network.

【5】 Continuous-Time Deep Glioma Growth Models 标题:连续时间深部胶质瘤生长模型

作者:Jens Petersen,Fabian Isensee,Gregor Köhler,Paul F. Jäger,David Zimmerer,Ulf Neuberger,Wolfgang Wick,Jürgen Debus,Sabine Heiland,Martin Bendszus,Philipp Vollmuth,Klaus H. Maier-Hein 机构: Division of Medical Image Computing, German Cancer Research Center, HIP Applied Computer Vision Lab, Division of Medical Image Computing, German, Interactive Machine Learning Group, German Cancer Research Center 备注:MICCAI 2021 链接:https://arxiv.org/abs/2106.12917 摘要:从改善治疗决定到放射治疗中更好的剂量分布,估计肿瘤在未来可能如何发展的能力可能具有巨大的临床益处。最近的工作通过深入学习和变分推理来研究胶质瘤的生长建模问题,从而完全从真实的患者数据分布中学习生长动力学。到目前为止,这种方法仅限于预定义的图像采集间隔和固定长度的序列,这限制了它在更真实场景中的适用性。我们通过扩展神经过程(一类随机时间序列的条件生成模型)来克服这些局限性,该模型采用包含时空注意机制的分层多尺度表示编码。结果是一个学习的增长模型,可以在任意数量的观察条件,并可以产生一个连续时间轴上的时间一致的增长轨迹分布。在379名患者的数据集上,该方法成功地捕获了图像中的全局和细粒度变化,与其他学习的生长模型相比,表现出了更好的性能。 摘要:The ability to estimate how a tumor might evolve in the future could have tremendous clinical benefits, from improved treatment decisions to better dose distribution in radiation therapy. Recent work has approached the glioma growth modeling problem via deep learning and variational inference, thus learning growth dynamics entirely from a real patient data distribution. So far, this approach was constrained to predefined image acquisition intervals and sequences of fixed length, which limits its applicability in more realistic scenarios. We overcome these limitations by extending Neural Processes, a class of conditional generative models for stochastic time series, with a hierarchical multi-scale representation encoding including a spatio-temporal attention mechanism. The result is a learned growth model that can be conditioned on an arbitrary number of observations, and that can produce a distribution of temporally consistent growth trajectories on a continuous time axis. On a dataset of 379 patients, the approach successfully captures both global and finer-grained variations in the images, exhibiting superior performance compared to other learned growth models.

其他(8篇)

【1】 HyperNeRF: A Higher-Dimensional Representation for Topologically Varying Neural Radiance Fields 标题:HyperNeRF:拓扑变化神经辐射场的高维表示

作者:Keunhong Park,Utkarsh Sinha,Peter Hedman,Jonathan T. Barron,Sofien Bouaziz,Dan B Goldman,Ricardo Martin-Brualla,Steven M. Seitz 机构:Predicted Depth, Predicted Color 备注:Project page: this https URL 链接:https://arxiv.org/abs/2106.13228 摘要:神经辐射场(NeRF)能够以前所未有的逼真度重建场景,近年来的各种工作都将NeRF扩展到处理动态场景。重建这种非刚性场景的一种常用方法是使用从每个输入图像的坐标到标准模板坐标空间的变形场映射。然而,这些基于变形的方法很难对拓扑变化进行建模,因为拓扑变化要求变形场具有不连续性,但这些变形场必然是连续的。我们通过将NeRFs提升到一个高维空间,并通过将每个单独的输入图像对应的5D辐射场表示为通过这个“超空间”的切片来解决这个限制。我们的方法受到水平集方法的启发,它将曲面的演化建模为通过高维曲面的切片。我们在两个任务上评估了我们的方法:(i)在“时刻”之间平滑插值,即在输入图像中看到的场景配置,同时保持视觉真实性,以及(ii)在固定时刻进行新颖的视图合成。我们证明了我们的方法,我们称之为HyperNeRF,在两个任务上都比现有方法有显著的优势。与Nerfies相比,HyperNeRF在LPIPS测量中,插值平均错误率降低了8.6%,新视图合成平均错误率降低了8.8%。 摘要:Neural Radiance Fields (NeRF) are able to reconstruct scenes with unprecedented fidelity, and various recent works have extended NeRF to handle dynamic scenes. A common approach to reconstruct such non-rigid scenes is through the use of a learned deformation field mapping from coordinates in each input image into a canonical template coordinate space. However, these deformation-based approaches struggle to model changes in topology, as topological changes require a discontinuity in the deformation field, but these deformation fields are necessarily continuous. We address this limitation by lifting NeRFs into a higher dimensional space, and by representing the 5D radiance field corresponding to each individual input image as a slice through this "hyper-space". Our method is inspired by level set methods, which model the evolution of surfaces as slices through a higher dimensional surface. We evaluate our method on two tasks: (i) interpolating smoothly between "moments", i.e., configurations of the scene, seen in the input images while maintaining visual plausibility, and (ii) novel-view synthesis at fixed moments. We show that our method, which we dub HyperNeRF, outperforms existing methods on both tasks by significant margins. Compared to Nerfies, HyperNeRF reduces average error rates by 8.6% for interpolation and 8.8% for novel-view synthesis, as measured by LPIPS.

【2】 When Differential Privacy Meets Interpretability: A Case Study 标题:当差异隐私遇到可解释性时:一个案例研究

作者:Rakshit Naidu,Aman Priyanshu,Aadith Kumar,Sasikanth Kotti,Haofan Wang,Fatemehsadat Mireshghallah 备注:7 figures, 4 pages; Extended abstract presented at RCV-CVPR'21 链接:https://arxiv.org/abs/2106.13203 摘要:随着在医学成像和诊断等任务中使用个人数据训练深度神经网络(DNNs)的增加,DNNs的差异私人训练的重要性正在激增,并且有大量的工作专注于提供更好的隐私效用权衡。然而,很少有人关注这些模型的可解释性,以及DP的应用如何影响解释的质量。我们建议在APTOS数据集上广泛研究DP训练对DNNs的影响,特别是对医学成像应用的影响。 摘要:Given the increase in the use of personal data for training Deep Neural Networks (DNNs) in tasks such as medical imaging and diagnosis, differentially private training of DNNs is surging in importance and there is a huge body of work focusing on providing better privacy-utility trade-off. However, little attention is given to the interpretability of these models, and how the application of DP affects the quality of interpretations. We propose an extensive study into the effects of DP training on DNNs, especially on medical imaging applications, on the APTOS dataset.

【3】 Sparse Needlets for Lighting Estimation with Spherical Transport Loss 标题:考虑球面传输损耗的稀疏针光照估计

作者:Fangneng Zhan,Changgong Zhang,Wenbo Hu,Shijian Lu,Feiying Ma,Xuansong Xie,Ling Shao 机构: Nanyang Technological University, DAMO Academy, Alibaba Group, The Chinese University of Hong Kong, Inception Institute of Artificial Intelligence 备注:11 pages, 7 figures 链接:https://arxiv.org/abs/2106.13090 摘要:精确的光照估计对于许多计算机视觉和计算机图形学任务(如高动态范围(HDR)重照明)是一个挑战性的问题。现有的方法在频域或空间域对光照进行建模,这不足以表示场景中复杂的光照条件,并且容易产生不准确的估计。本文提出了一种新的光照估计模型NeedleLight,该模型用针线表示光照,可以同时在频域和空域进行光照估计。设计了一个最优阈值函数来实现稀疏针头,该函数修剪了冗余的光照参数,并且在光照表示中表现出优越的局部化特性。此外,基于最优传输理论,设计了一种新的球形传输损耗,该传输损耗可以指导在考虑空间信息的情况下回归光照表示参数。此外,我们还提出了一种新的度量方法,该方法通过直接评估估计的光照图而不是渲染图像,简洁而有效。大量的实验表明,与最新的方法相比,NeedleLight在多个评估指标上实现了一致的、优越的光照估计。 摘要:Accurate lighting estimation is challenging yet critical to many computer vision and computer graphics tasks such as high-dynamic-range (HDR) relighting. Existing approaches model lighting in either frequency domain or spatial domain which is insufficient to represent the complex lighting conditions in scenes and tends to produce inaccurate estimation. This paper presents NeedleLight, a new lighting estimation model that represents illumination with needlets and allows lighting estimation in both frequency domain and spatial domain jointly. An optimal thresholding function is designed to achieve sparse needlets which trims redundant lighting parameters and demonstrates superior localization properties for illumination representation. In addition, a novel spherical transport loss is designed based on optimal transport theory which guides to regress lighting representation parameters with consideration of the spatial information. Furthermore, we propose a new metric that is concise yet effective by directly evaluating the estimated illumination maps rather than rendered images. Extensive experiments show that NeedleLight achieves superior lighting estimation consistently across multiple evaluation metrics as compared with state-of-the-art methods.

【4】 Symmetric Wasserstein Autoencoders 标题:对称Wasserstein自动编码器

作者:Sun Sun,Hongyu Guo 机构:National Research Council Canada, Ottawa, ON., K,A ,R, Canada 备注:Accepted by UAI2021 链接:https://arxiv.org/abs/2106.13024 摘要:利用优化传输的框架,我们引入了一个新的具有可学习先验的生成式自动编码器家族,称为对称Wasserstein自动编码器(SWAEs)。我们提出对称匹配的联合分布的观测数据和潜在的代表性所引起的编码器和解码器。由此产生的算法联合优化了数据和潜在空间中的建模损失,数据空间中的损失导致了去噪效果。该算法通过对数据的对称处理和隐式表示,隐式地保留了数据在隐式空间中的局部结构。为了进一步提高潜在表征的质量,我们在目标中加入了重建损失,这对生成和重建都有很大的好处。我们的经验表明,在分类、重构和生成方面,SWAEs优于最先进的生成式自动编码器。 摘要:Leveraging the framework of Optimal Transport, we introduce a new family of generative autoencoders with a learnable prior, called Symmetric Wasserstein Autoencoders (SWAEs). We propose to symmetrically match the joint distributions of the observed data and the latent representation induced by the encoder and the decoder. The resulting algorithm jointly optimizes the modelling losses in both the data and the latent spaces with the loss in the data space leading to the denoising effect. With the symmetric treatment of the data and the latent representation, the algorithm implicitly preserves the local structure of the data in the latent space. To further improve the quality of the latent representation, we incorporate a reconstruction loss into the objective, which significantly benefits both the generation and reconstruction. We empirically show the superior performance of SWAEs over the state-of-the-art generative autoencoders in terms of classification, reconstruction, and generation.

【5】 Regularisation for PCA- and SVD-type matrix factorisations 标题:PCA和SVD型矩阵分解的正则化

作者:Abdolrahman Khoshrou,Eric J. Pauwels 机构:Centrum Wiskunde & Informatica, Science Park , XG, Amsterdam, The Netherlands, Department of Mathematics and Computer Science, Delft University of Technology, The Netherlands 链接:https://arxiv.org/abs/2106.12955 摘要:奇异值分解(SVD)及其密切相关的主成分分析(PCA)是一种著名的线性矩阵分解技术,广泛应用于降维和聚类等领域。然而,SVD/PCA的一个重要局限性是它对输入数据中噪声的敏感性。在本文中,我们再看一看正则化问题,并表明不同的公式的最小化问题导致不同的解决方案。 摘要:Singular Value Decomposition (SVD) and its close relative, Principal Component Analysis (PCA), are well-known linear matrix decomposition techniques that are widely used in applications such as dimension reduction and clustering. However, an important limitation of SVD/PCA is its sensitivity to noise in the input data. In this paper, we take another look at the problem of regularisation and show that different formulations of the minimisation problem lead to qualitatively different solutions.

【6】 Fast Monte Carlo Rendering via Multi-Resolution Sampling 标题:基于多分辨率采样的快速蒙特卡罗绘制

作者:Qiqi Hou,Zhan Li,Carl S Marshall,Selvakumar Panneer,Feng Liu 机构:Portland State University, Intel, Ours, Low Resolution High SPP, High Resolution Low SPP 备注:Graphic Interface 2021 链接:https://arxiv.org/abs/2106.12802 摘要:蒙特卡罗渲染算法被广泛应用于产生真实感的计算机图形图像。然而,这些算法需要对每个像素采样大量光线,以实现适当的全局照明,因此需要大量计算。本文提出了一种混合绘制方法来加速montecarlo绘制算法。我们的方法首先生成两个版本的渲染:一个是具有高采样率(LRHS)的低分辨率渲染,另一个是具有低采样率(HRLS)的高分辨率渲染。然后我们开发了一个深度卷积神经网络,将这两个渲染融合成一个高质量的图像,就好像它是以高分辨率和高采样率渲染的一样。具体来说,我们将此融合任务描述为一个超分辨率问题,该问题通过低分辨率输入(LRHS)生成高分辨率渲染,并辅以HRLS渲染。HRLS渲染提供了关键的高频细节,这些细节对于任何超分辨率方法都很难从LRHS中恢复。实验表明,在BCR数据集和Gharbi数据集上进行测试时,我们的混合渲染算法在渲染高质量图像的同时,明显快于最新的montecarlo去噪方法网址{https://github.com/hqqxyy/msspl} 摘要:Monte Carlo rendering algorithms are widely used to produce photorealistic computer graphics images. However, these algorithms need to sample a substantial amount of rays per pixel to enable proper global illumination and thus require an immense amount of computation. In this paper, we present a hybrid rendering method to speed up Monte Carlo rendering algorithms. Our method first generates two versions of a rendering: one at a low resolution with a high sample rate (LRHS) and the other at a high resolution with a low sample rate (HRLS). We then develop a deep convolutional neural network to fuse these two renderings into a high-quality image as if it were rendered at a high resolution with a high sample rate. Specifically, we formulate this fusion task as a super resolution problem that generates a high resolution rendering from a low resolution input (LRHS), assisted with the HRLS rendering. The HRLS rendering provides critical high frequency details which are difficult to recover from the LRHS for any super resolution methods. Our experiments show that our hybrid rendering algorithm is significantly faster than the state-of-the-art Monte Carlo denoising methods while rendering high-quality images when tested on both our own BCR dataset and the Gharbi dataset. url{https://github.com/hqqxyy/msspl}

【7】 Florida Wildlife Camera Trap Dataset 标题:佛罗里达野生动物相机陷阱数据集

作者:Crystal Gagne,Jyoti Kini,Daniel Smith,Mubarak Shah 机构:University of Central Florida, Orlando, Florida 备注:IEEE Conference on Computer Vision and Pattern Recognition, CV4Animals: Computer Vision for Animal Behavior Tracking and Modeling Workshop, 2021 链接:https://arxiv.org/abs/2106.12628 摘要:在保护和生态研究中,跟踪相机图像越来越受到生物学家的欢迎。操作摄像机陷阱所需的最小人为干扰允许捕捉无偏物种活动。一些基于人类和野生动物相互作用、不同物种的迁徙模式、濒危种群灭绝风险的研究,由于缺乏丰富的数据和手动注释踪迹相机图像的耗时性而受到限制。我们介绍了一个具有挑战性的野生动物相机陷阱分类数据集,收集自佛罗里达州西南部的两个不同地点,包括104495张图像,具有视觉相似的物种,不同的光照条件,倾斜的类分布,包括濒危物种,即佛罗里达豹的样本。ResNet-50的实验结果表明,基于图像分类的数据集可以进一步推动野生动物统计建模的发展。我们将公开数据集。 摘要:Trail camera imagery has increasingly gained popularity amongst biologists for conservation and ecological research. Minimal human interference required to operate camera traps allows capturing unbiased species activities. Several studies - based on human and wildlife interactions, migratory patterns of various species, risk of extinction in endangered populations - are limited by the lack of rich data and the time-consuming nature of manually annotating trail camera imagery. We introduce a challenging wildlife camera trap classification dataset collected from two different locations in Southwestern Florida, consisting of 104,495 images featuring visually similar species, varying illumination conditions, skewed class distribution, and including samples of endangered species, i.e. Florida panthers. Experimental evaluations with ResNet-50 architecture indicate that this image classification-based dataset can further push the advancements in wildlife statistical modeling. We will make the dataset publicly available.

【8】 AVHYAS: A Free and Open Source QGIS Plugin for Advanced Hyperspectral Image Analysis 标题:AVHYAS:一个免费开源的高级高光谱图像分析QGIS插件

作者:Rosly Boy Lyngdoh,Anand S Sahadevan,Touseef Ahmad,Pradyuman Singh Rathore,Manoj Mishra,Praveen Kumar Gupta,Arundhati Misra 机构:Hyperspectral Techniques Development Division, Advanced Microwave and Hyperspectral Techniques Development Group, Space Applications Centre, ISRO, Ahmedabad, Gujarat, India 备注:Accepted at IEEE International Conference on Emerging Techniques in Computational Intelligence, 2021 链接:https://arxiv.org/abs/2106.12776 摘要:高级高光谱数据分析软件(AVHYAS)插件是一个基于python3的量子GIS(QGIS)插件,用于处理和分析高光谱(Hx)图像。它的开发是为了保证充分利用目前和未来的Hx机载或星载传感器,并提供先进的Hx数据处理算法。该软件是免费提供的,并提供了一系列的基本和先进的工具,如大气校正(机载航空图像),标准处理工具,以及强大的机器学习和深入学习接口的Hx数据分析。 摘要:Advanced Hyperspectral Data Analysis Software (AVHYAS) plugin is a python3 based quantum GIS (QGIS) plugin designed to process and analyse hyperspectral (Hx) images. It is developed to guarantee full usage of present and future Hx airborne or spaceborne sensors and provides access to advanced algorithms for Hx data processing. The software is freely available and offers a range of basic and advanced tools such as atmospheric correction (for airborne AVIRISNG image), standard processing tools as well as powerful machine learning and Deep Learning interfaces for Hx data analysis.

0 人点赞