计算机视觉学术速递[8.19]

2021-08-24 16:36:19 浏览数 (1)

Update!H5支持摘要折叠,体验更佳!点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!

cs.CV 方向,今日共计71篇

Transformer(1篇)

【1】 Boosting Salient Object Detection with Transformer-based Asymmetric Bilateral U-Net 标题:基于Transformer的非对称双边U网增强显著目标检测 链接:https://arxiv.org/abs/2108.07851

作者:Yu Qiu,Yun Liu,Le Zhang,Jing Xu 机构: NankaiUniversity, Zhang is with the School of Information and Communication Engineer-ing, University of Electronic Science and Technology of China (UESTC) 摘要:现有的显著目标检测(SOD)模型主要依赖于基于CNN的U形结构和跳跃连接,将全局上下文和局部空间细节结合起来,这两个细节分别对显著目标的定位和对象细节的细化至关重要。尽管取得了巨大的成功,CNN在学习全球环境方面的能力还是有限的。最近,视觉变换器由于其强大的全局依赖建模功能,在计算机视觉领域取得了革命性的进展。然而,直接将转换器应用于SOD显然是次优的,因为转换器缺乏学习局部空间表示的能力。为此,本文探讨了transformer和CNN的结合,以学习SOD的全局和局部表示。我们提出了一种基于Transformer的不对称双边U网(AbiU网)。非对称双边编码器有一个转换器路径和一个轻量级CNN路径,这两条路径在每个编码器阶段进行通信,分别学习互补的全局上下文和局部空间细节。非对称双边解码器还包括两条路径,用于处理来自转换器和CNN编码器路径的特征,在每个解码器阶段进行通信,分别用于解码粗糙的显著对象位置和查找粒度对象细节。两个编码器/解码器路径之间的这种通信使得AbiU网络能够分别利用transformer和CNN的自然特性来学习互补的全局和局部表示。因此,ABiU网络为基于Transformer的SOD提供了一个新的视角。大量的实验表明,ABiU网络的性能优于以前最先进的SOD方法。代码将被发布。 摘要:Existing salient object detection (SOD) models mainly rely on CNN-based U-shaped structures with skip connections to combine the global contexts and local spatial details that are crucial for locating salient objects and refining object details, respectively. Despite great successes, the ability of CNN in learning global contexts is limited. Recently, the vision transformer has achieved revolutionary progress in computer vision owing to its powerful modeling of global dependencies. However, directly applying the transformer to SOD is obviously suboptimal because the transformer lacks the ability to learn local spatial representations. To this end, this paper explores the combination of transformer and CNN to learn both global and local representations for SOD. We propose a transformer-based Asymmetric Bilateral U-Net (AbiU-Net). The asymmetric bilateral encoder has a transformer path and a lightweight CNN path, where the two paths communicate at each encoder stage to learn complementary global contexts and local spatial details, respectively. The asymmetric bilateral decoder also consists of two paths to process features from the transformer and CNN encoder paths, with communication at each decoder stage for decoding coarse salient object locations and find-grained object details, respectively. Such communication between the two encoder/decoder paths enables AbiU-Net to learn complementary global and local representations, taking advantage of the natural properties of transformer and CNN, respectively. Hence, ABiU-Net provides a new perspective for transformer-based SOD. Extensive experiments demonstrate that ABiU-Net performs favorably against previous state-of-the-art SOD methods. The code will be released.

检测相关(13篇)

【1】 LIGA-Stereo: Learning LiDAR Geometry Aware Representations for Stereo-based 3D Detector 标题:LIGA-Stereo:学习基于立体的三维探测器的LiDAR几何感知表示 链接:https://arxiv.org/abs/2108.08258

作者:Xiaoyang Guo,Shaoshuai Shi,Xiaogang Wang,Hongsheng Li 机构:CUHK-SenseTime Joint Laboratory, The Chinese University of Hong Kong 备注:ICCV'21 摘要:基于立体的三维检测旨在使用中间深度贴图或隐式三维几何表示从立体图像中检测三维对象边界框,这为三维感知提供了一种低成本的解决方案。然而,与基于激光雷达的检测算法相比,其性能仍然较差。为了检测和定位精确的三维边界框,基于激光雷达的模型可以从激光雷达点云编码精确的对象边界和曲面法线方向。然而,由于立体匹配的局限性,立体检测器的检测结果容易受到错误深度特征的影响。为了解决这个问题,我们提出LIGA-Stereo(LiDAR-Geometry-Aware-stereodetector)在基于LiDAR的检测模型的高级几何感知表示的指导下学习基于立体的3D检测器。此外,我们发现现有的基于体素的立体检测器无法从间接的三维监控中有效地学习语义特征。我们附加了一个辅助的2D检测头来提供直接的2D语义监控。实验结果表明,上述两种策略提高了图像的几何和语义表示能力。与最先进的立体探测器相比,我们的方法在官方KITTI基准上分别提高了汽车、行人和自行车的3D检测性能10.44%、5.69%和5.97%。基于立体和基于激光雷达的3D探测器之间的差距进一步缩小。 摘要:Stereo-based 3D detection aims at detecting 3D object bounding boxes from stereo images using intermediate depth maps or implicit 3D geometry representations, which provides a low-cost solution for 3D perception. However, its performance is still inferior compared with LiDAR-based detection algorithms. To detect and localize accurate 3D bounding boxes, LiDAR-based models can encode accurate object boundaries and surface normal directions from LiDAR point clouds. However, the detection results of stereo-based detectors are easily affected by the erroneous depth features due to the limitation of stereo matching. To solve the problem, we propose LIGA-Stereo (LiDAR Geometry Aware Stereo Detector) to learn stereo-based 3D detectors under the guidance of high-level geometry-aware representations of LiDAR-based detection models. In addition, we found existing voxel-based stereo detectors failed to learn semantic features effectively from indirect 3D supervisions. We attach an auxiliary 2D detection head to provide direct 2D semantic supervisions. Experiment results show that the above two strategies improved the geometric and semantic representation capabilities. Compared with the state-of-the-art stereo detector, our method has improved the 3D detection performance of cars, pedestrians, cyclists by 10.44%, 5.69%, 5.97% mAP respectively on the official KITTI benchmark. The gap between stereo-based and LiDAR-based 3D detectors is further narrowed.

【2】 Deployment of Deep Neural Networks for Object Detection on Edge AI Devices with Runtime Optimization 标题:运行时优化的深度神经网络在边缘AI设备上的目标检测部署 链接:https://arxiv.org/abs/2108.08166

作者:Lukas Stäcker,Juncong Fei,Philipp Heidenreich,Frank Bonarens,Jason Rambach,Didier Stricker,Christoph Stiller 机构:Stellantis, Opel Automobile GmbH, Germany, German Research Center for Artificial Intelligence, Germany, Institute of Measurement and Control Systems, Karlsruhe Institute of Technology, Germany, equal contribution 备注:To present in ICCV 2021 (ERCVAD Workshop) 摘要:通过不断提高检测性能的新算法,深度神经网络已被证明对汽车场景理解越来越重要。但是,很少强调在嵌入式环境中部署的经验和需求。因此,我们在edge AI平台上对两个具有代表性的目标检测网络的部署进行了案例研究。特别地,我们考虑基于图像的2D对象检测和基于激光雷达的3D物体检测的点柱。考虑到可用的工具,我们描述了将算法从PyTorch训练环境转换为部署环境所需的修改。我们使用两个不同的库TensorRT和TorchScript评估部署的DNN的运行时。在我们的实验中,我们观察到TensorRT对于卷积层和TorchScript对于完全连接层的轻微优势。我们还研究了在为部署选择优化设置时运行时和性能之间的权衡,并观察到量化显著降低了运行时,而对检测性能的影响很小。 摘要:Deep neural networks have proven increasingly important for automotive scene understanding with new algorithms offering constant improvements of the detection performance. However, there is little emphasis on experiences and needs for deployment in embedded environments. We therefore perform a case study of the deployment of two representative object detection networks on an edge AI platform. In particular, we consider RetinaNet for image-based 2D object detection and PointPillars for LiDAR-based 3D object detection. We describe the modifications necessary to convert the algorithms from a PyTorch training environment to the deployment environment taking into account the available tools. We evaluate the runtime of the deployed DNN using two different libraries, TensorRT and TorchScript. In our experiments, we observe slight advantages of TensorRT for convolutional layers and TorchScript for fully connected layers. We also study the trade-off between runtime and performance, when selecting an optimized setup for deployment, and observe that quantization significantly reduces the runtime while having only little impact on the detection performance.

【3】 Specificity-preserving RGB-D Saliency Detection 标题:保持特异性的RGB-D显著性检测 链接:https://arxiv.org/abs/2108.08162

作者:Tao Zhou,Huazhu Fu,Geng Chen,Yi Zhou,Deng-Ping Fan,Ling Shao 机构:PCA lab, The Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information of MoE, School of Computer Science and Engineering, Nanjing University of Science and Technology, China, IIAI, Abu Dhabi, UAE 备注:Accepted by ICCV 2021 摘要:RGB-D显著性检测由于其有效性和现在可以方便地捕获深度线索的事实而受到越来越多的关注。现有的研究通常侧重于通过各种融合策略学习共享表示,很少有方法明确考虑如何保留特定于模态的特征。在本文中,我们从一个新的角度提出了一种用于RGB-D显著性检测的特异性保持网络(SP-Net),它通过探索共享信息和特定于模态的属性(例如特异性)来提高显著性检测性能。具体而言,采用两个模态特定网络和一个共享学习网络来生成个体和共享显著性图。提出了一种交叉增强集成模块(CIM)来融合共享学习网络中的交叉模态特征,然后将这些特征传播到下一层以集成跨层次信息。此外,我们还提出了一个多模态特征聚合(MFA)模块,将各个解码器的模态特征集成到共享解码器中,从而提供丰富的互补多模态信息,提高显著性检测性能。此外,跳过连接用于组合编码器层和解码器层之间的分层特征。在六个基准数据集上的实验表明,我们的SP-Net优于其他最先进的方法。代码可从以下网址获取:https://github.com/taozh2017/SPNet. 摘要:RGB-D saliency detection has attracted increasing attention, due to its effectiveness and the fact that depth cues can now be conveniently captured. Existing works often focus on learning a shared representation through various fusion strategies, with few methods explicitly considering how to preserve modality-specific characteristics. In this paper, taking a new perspective, we propose a specificity-preserving network (SP-Net) for RGB-D saliency detection, which benefits saliency detection performance by exploring both the shared information and modality-specific properties (e.g., specificity). Specifically, two modality-specific networks and a shared learning network are adopted to generate individual and shared saliency maps. A cross-enhanced integration module (CIM) is proposed to fuse cross-modal features in the shared learning network, which are then propagated to the next layer for integrating cross-level information. Besides, we propose a multi-modal feature aggregation (MFA) module to integrate the modality-specific features from each individual decoder into the shared decoder, which can provide rich complementary multi-modal information to boost the saliency detection performance. Further, a skip connection is used to combine hierarchical features between the encoder and decoder layers. Experiments on six benchmark datasets demonstrate that our SP-Net outperforms other state-of-the-art methods. Code is available at: https://github.com/taozh2017/SPNet.

【4】 Towards Deep and Efficient: A Deep Siamese Self-Attention Fully Efficient Convolutional Network for Change Detection in VHR Images 标题:走向深度和高效:一种用于VHR图像变化检测的深度暹罗自关注完全高效卷积网络 链接:https://arxiv.org/abs/2108.08157

作者:Hongruixuan Chen,Chen Wu,Bo Du 机构:State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University, Wuhan, China., School of Computer Science, Wuhan University, Wuhan, China. 摘要:近年来,FCN在CD领域引起了广泛的关注。为了追求更好的CD性能,设计更深更复杂的FCN已成为一种趋势,这不可避免地会带来大量的参数和难以承受的计算负担。在这项工作中,我们提出了一个非常深入和高效的CD网络,名为EffCDNet。在EffCDNet中,为了减少与深度体系结构相关的众多参数,引入了一种由深度卷积和具有信道洗牌机制的组卷积组成的高效卷积来取代标准卷积层。在特定的网络体系结构方面,EffCDNet不使用主流的类似UNet的体系结构,而是采用具有非常深的编码器和轻量级解码器的体系结构。在甚深编码器中,两个通过有效卷积叠加的甚深暹罗流首先从输入图像对中提取两个具有高度代表性和信息性的特征映射。随后,设计了一个高效的ASPP模块来捕获多尺度变化信息。在轻量级解码器中,采用循环交叉自我注意(RCCA)模块,有效地利用非局部相似特征表示来增强每个像素的可分辨性,从而有效地分离变化区域和未变化区域。此外,为了解决混淆像素的优化问题,提出了两种新的基于信息熵的损失函数。在两个具有挑战性的CD数据集上,我们的方法优于其他基于SOTA FCN的方法,仅具有基准级参数数和相当低的计算开销。 摘要:Recently, FCNs have attracted widespread attention in the CD field. In pursuit of better CD performance, it has become a tendency to design deeper and more complicated FCNs, which inevitably brings about huge numbers of parameters and an unbearable computational burden. With the goal of designing a quite deep architecture to obtain more precise CD results while simultaneously decreasing parameter numbers to improve efficiency, in this work, we present a very deep and efficient CD network, entitled EffCDNet. In EffCDNet, to reduce the numerous parameters associated with deep architecture, an efficient convolution consisting of depth-wise convolution and group convolution with a channel shuffle mechanism is introduced to replace standard convolutional layers. In terms of the specific network architecture, EffCDNet does not use mainstream UNet-like architecture, but rather adopts the architecture with a very deep encoder and a lightweight decoder. In the very deep encoder, two very deep siamese streams stacked by efficient convolution first extract two highly representative and informative feature maps from input image-pairs. Subsequently, an efficient ASPP module is designed to capture multi-scale change information. In the lightweight decoder, a recurrent criss-cross self-attention (RCCA) module is applied to efficiently utilize non-local similar feature representations to enhance discriminability for each pixel, thus effectively separating the changed and unchanged regions. Moreover, to tackle the optimization problem in confused pixels, two novel loss functions based on information entropy are presented. On two challenging CD datasets, our approach outperforms other SOTA FCN-based methods, with only benchmark-level parameter numbers and quite low computational overhead.

【5】 Optimising Knee Injury Detection with Spatial Attention and Validating Localisation Ability 标题:利用空间注意优化膝关节损伤检测并验证定位能力 链接:https://arxiv.org/abs/2108.08136

作者:Niamh Belton,Ivan Welaratne,Adil Dahlan,Ronan T Hearne,Misgina Tsighe Hagos,Aonghus Lawlor,Kathleen M. Curran 机构: Science Foundation Ireland Centre for Research Training in Machine Learning, School of Medicine, University College Dublin, Department of Radiology, Mater Misericordiae University Hospital, Dublin, School of Electronic Engineering, University College Dublin 备注:None 摘要:这项工作采用了预先训练的多视角卷积神经网络(CNN)和空间注意块来优化膝关节损伤检测。利用一个开源的磁共振成像(MRI)数据集和图像级标签进行分析。由于MRI数据是从三个平面获取的,我们使用单平面和多平面(多平面)的数据来比较我们的技术。对于多平面,我们研究了在网络中融合平面的各种方法。该分析产生了新的“MPFuseNet”网络和最先进的曲线下面积(AUC)分数,用于检测前交叉韧带(ACL)撕裂和异常磁共振成像,AUC分数分别为0.977和0.957。然后,我们开发了一个客观指标,惩罚定位精度(PLA),以验证模型的定位能力。该指标比较了梯度Cam输出生成的二进制掩模和磁共振成像样本上放射科医生的注释。我们还提取了模型不可知方法中的可解释性特征,然后由放射科医生验证其临床相关性。 摘要:This work employs a pre-trained, multi-view Convolutional Neural Network (CNN) with a spatial attention block to optimise knee injury detection. An open-source Magnetic Resonance Imaging (MRI) data set with image-level labels was leveraged for this analysis. As MRI data is acquired from three planes, we compare our technique using data from a single-plane and multiple planes (multi-plane). For multi-plane, we investigate various methods of fusing the planes in the network. This analysis resulted in the novel 'MPFuseNet' network and state-of-the-art Area Under the Curve (AUC) scores for detecting Anterior Cruciate Ligament (ACL) tears and Abnormal MRIs, achieving AUC scores of 0.977 and 0.957 respectively. We then developed an objective metric, Penalised Localisation Accuracy (PLA), to validate the model's localisation ability. This metric compares binary masks generated from Grad-Cam output and the radiologist's annotations on a sample of MRIs. We also extracted explainability features in a model-agnostic approach that were then verified as clinically relevant by the radiologist.

【6】 Multi-patch Feature Pyramid Network for Weakly Supervised Object Detection in Optical Remote Sensing Images 标题:基于多块特征金字塔网络的光学遥感图像弱监督目标检测 链接:https://arxiv.org/abs/2108.08063

作者:Pourya Shamsolmoali,Jocelyn Chanussot,Masoumeh Zareapoor,Huiyu Zhou,Jie Yang 机构: Université GrenobleAlpes 摘要:目标检测在遥感中是一项具有挑战性的任务,因为目标只占据图像中的几个像素,并且模型需要同时学习目标的位置和检测。即使已建立的方法对规则大小的对象表现良好,但在分析小对象或陷入局部极小值(例如,虚假对象部分)时,它们的性能较差。有两个可能的问题阻碍了他们。首先,由于背景复杂,现有的方法难以稳定地检测小目标。其次,大多数标准方法都使用手工制作的特征,在检测部分缺失的对象时效果不佳。我们在这里解决上述问题,并提出了一种新的体系结构与多补丁功能金字塔网络(MPFP网络)。与当前的模型不同,在训练过程中只追求最具辨别力的补丁,在MPFPNet中,补丁被划分为与类相关的子集,其中补丁是相关的,并且基于主损失函数,为子集确定一系列平滑损失函数,以改进收集小对象零件的模型。为了增强面片选择的特征表示,我们引入了一种有效的方法来正则化残值,并使融合过渡层严格保持范数。与几种最先进的目标检测模型相比,该网络包含自底向上和交叉连接,以融合不同尺度的特征,从而实现更好的精度。此外,开发的体系结构比基线更有效。 摘要:Object detection is a challenging task in remote sensing because objects only occupy a few pixels in the images, and the models are required to simultaneously learn object locations and detection. Even though the established approaches well perform for the objects of regular sizes, they achieve weak performance when analyzing small ones or getting stuck in the local minima (e.g. false object parts). Two possible issues stand in their way. First, the existing methods struggle to perform stably on the detection of small objects because of the complicated background. Second, most of the standard methods used hand-crafted features, and do not work well on the detection of objects parts of which are missing. We here address the above issues and propose a new architecture with a multiple patch feature pyramid network (MPFP-Net). Different from the current models that during training only pursue the most discriminative patches, in MPFPNet the patches are divided into class-affiliated subsets, in which the patches are related and based on the primary loss function, a sequence of smooth loss functions are determined for the subsets to improve the model for collecting small object parts. To enhance the feature representation for patch selection, we introduce an effective method to regularize the residual values and make the fusion transition layers strictly norm-preserving. The network contains bottom-up and crosswise connections to fuse the features of different scales to achieve better accuracy, compared to several state-of-the-art object detection models. Also, the developed architecture is more efficient than the baselines.

【7】 Few-Shot Batch Incremental Road Object Detection via Detector Fusion 标题:基于检测器融合的Few-Shot批量增量道路目标检测 链接:https://arxiv.org/abs/2108.08048

作者:Anuj Tambwekar,Kshitij Agrawal,Anay Majee,Anbumani Subramanian 机构:PES University, Intel Corporation 备注:accepted in 2nd Autonomous Vehicle Vision Workshop, ICCV2021 摘要:增量Few-Shot学习已经成为深度学习中一个新的和具有挑战性的领域,其目标是使用很少的新类数据样本,而不使用任何旧类数据来训练深度学习模型。在这项工作中,我们解决了使用来自印度驾驶数据集(IDD)的数据进行批量增量Few-Shot道路目标检测的问题。我们的方法,DualFusion,以一种允许我们学习用非常有限的数据检测稀有对象的方式结合了对象检测器,所有这些都不会严重降低检测器在丰富类上的性能。在IDD OpenSet增量少数镜头检测任务中,我们在基类上实现了40.0的mAP50分数和38.8的总体mAP50分数,两者都是迄今为止最高的。在COCO批量增量Few-Shot检测任务中,我们获得了9.9分的新AP分数,比同类最先进的新级别性能提高了6.6倍以上。 摘要:Incremental few-shot learning has emerged as a new and challenging area in deep learning, whose objective is to train deep learning models using very few samples of new class data, and none of the old class data. In this work we tackle the problem of batch incremental few-shot road object detection using data from the India Driving Dataset (IDD). Our approach, DualFusion, combines object detectors in a manner that allows us to learn to detect rare objects with very limited data, all without severely degrading the performance of the detector on the abundant classes. In the IDD OpenSet incremental few-shot detection task, we achieve a mAP50 score of 40.0 on the base classes and an overall mAP50 score of 38.8, both of which are the highest to date. In the COCO batch incremental few-shot detection task, we achieve a novel AP score of 9.9, surpassing the state-of-the-art novel class performance on the same by over 6.6 times.

【8】 Unbiased IoU for Spherical Image Object Detection 标题:球面图像目标检测的无偏借条 链接:https://arxiv.org/abs/2108.08029

作者:Qiang Zhao,Bin Chen,Hang Xu,Yike Ma,Xiaodong Li,Bailan Feng,Chenggang Yan,Feng Dai 机构:Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, Hangzhou Dianzi University, Hangzhou, China, Huawei Technologies Co., Ltd 摘要:作为计算机视觉中最基本和最具挑战性的问题之一,目标检测试图在自然图像中定位目标实例并找到它们的类别。目标检测算法评估中最重要的一步是计算预测边界框和地面真值边界框之间的联合交集(IoU)。虽然这个过程对于平面图像来说是定义和解决的,但是对于球形图像目标检测来说并不容易。现有的方法要么基于有偏差的边界框表示来计算IOU,要么进行过度的近似,因此会给出错误的结果。在本文中,我们首先确定球面矩形是球面图像中对象的无偏边界框,然后提出一种无需任何近似的IoU计算分析方法。基于无偏表示和计算,我们还提出了一种球形图像的无锚目标检测算法。在两个球形目标检测数据集上的实验表明,该方法比现有方法具有更好的性能。 摘要:As one of the most fundamental and challenging problems in computer vision, object detection tries to locate object instances and find their categories in natural images. The most important step in the evaluation of object detection algorithm is calculating the intersection-over-union (IoU) between the predicted bounding box and the ground truth one. Although this procedure is well-defined and solved for planar images, it is not easy for spherical image object detection. Existing methods either compute the IoUs based on biased bounding box representations or make excessive approximations, thus would give incorrect results. In this paper, we first identify that spherical rectangles are unbiased bounding boxes for objects in spherical images, and then propose an analytical method for IoU calculation without any approximations. Based on the unbiased representation and calculation, we also present an anchor free object detection algorithm for spherical images. The experiments on two spherical object detection datasets show that the proposed method can achieve better performance than existing methods.

【9】 WRICNet:A Weighted Rich-scale Inception Coder Network for Multi-Resolution Remote Sensing Image Change Detection 标题:WRICNet:一种用于多分辨率遥感图像变化检测的加权富尺度初始编码网络 链接:https://arxiv.org/abs/2108.07955

作者:Yu Jiang,Lei Hu,Yongmei Zhang,Xin Yang 机构:This work was supported by the National Natural Science Foundation of, China under Grant No. ,., Jiangxi Normal University, Nanchang , China., (e-mail:, University of Technology, Beijing , China. 摘要:大多数遥感图像变化检测模型只能在特定分辨率的数据集上得到很好的效果。为了提高模型在多分辨率数据集中的变化检测效率,本文提出了一种加权富尺度初始编码网络(WRICNet),它可以实现浅层多尺度特征和深层多尺度特征的有效融合。所提出的加权富尺度起始模块可以获得浅层多尺度特征,加权富尺度编码器模块可以获得深层多尺度特征。加权尺度块为不同尺度的特征赋予适当的权重,可以增强变化区域边缘的表达能力。在多分辨率数据集上的性能实验表明,与比较方法相比,该方法可以进一步减少变化区域外的虚警和变化区域内的漏警,并且变化区域的边缘更加准确。对提出的训练策略的研究表明,本文的训练策略和改进可以提高变化检测的有效性。 摘要:Majority models of remote sensing image changing detection can only get great effect in a specific resolution data set. With the purpose of improving change detection effectiveness of the model in the multi-resolution data set, a weighted rich-scale inception coder network (WRICNet) is proposed in this article, which can make a great fusion of shallow multi-scale features, and deep multi-scale features. The weighted rich-scale inception module of the proposed can obtain shallow multi-scale features, the weighted rich-scale coder module can obtain deep multi-scale features. The weighted scale block assigns appropriate weights to features of different scales, which can strengthen expressive ability of the edge of the changing area. The performance experiments on the multi-resolution data set demonstrate that, compared to the comparative methods, the proposed can further reduce the false alarm outside the change area, and the missed alarm in the change area, besides, the edge of the change area is more accurate. The ablation study of the proposed shows that the training strategy, and improvements of this article can improve the effectiveness of change detection.

【10】 PLAD: A Dataset for Multi-Size Power Line Assets Detection in High-Resolution UAV Images 标题:PLAD:一种用于高分辨率无人机图像多尺度电力线资产检测的数据集 链接:https://arxiv.org/abs/2108.07944

作者:André Luiz Buarque Vieira-e-Silva,Heitor de Castro Felix,Thiago de Menezes Chaves,Francisco Paulo Magalhães Simões,Veronica Teichrieb,Michel Mozinho dos Santos,Hemir da Cunha Santiago,Virginia Adélia Cordeiro Sgotti,Henrique Baptista Duffles Teixeira Lott Neto 机构:∗Voxar Labs, Centro de Inform´atica, Universidade Federal de Pernambuco, Recife, Brazil, †Departamento de Computac¸˜ao, Universidade Federal Rural de Pernambuco, Recife, Brazil, ‡In Forma Software, Recife, Brazil 备注:Accepted for presentation at SIBGRAPI 2021 摘要:例如,许多电力线公司正在使用无人机执行检查过程,而不是让工人爬上高压输电塔,从而使他们处于危险之中。检查的一项关键任务是对输电线路中的资产进行检测和分类。然而,与电力线资产相关的公共数据很少,这阻碍了该领域的快速发展。这项工作提出了电力线资产数据集,其中包含多个高压电力线组件的高分辨率和真实图像。它有2409个注释对象,分为五类:输电塔、绝缘子、间隔棒、塔板和Stockbridge阻尼器,它们在大小(分辨率)、方向、照明、角度和背景方面都有所不同。这项工作还提出了一个评价与流行的深部目标检测方法,显示出相当大的改进空间。PLAD数据集可在以下位置公开获取:https://github.com/andreluizbvs/PLAD. 摘要:Many power line companies are using UAVs to perform their inspection processes instead of putting their workers at risk by making them climb high voltage power line towers, for instance. A crucial task for the inspection is to detect and classify assets in the power transmission lines. However, public data related to power line assets are scarce, preventing a faster evolution of this area. This work proposes the Power Line Assets Dataset, containing high-resolution and real-world images of multiple high-voltage power line components. It has 2,409 annotated objects divided into five classes: transmission tower, insulator, spacer, tower plate, and Stockbridge damper, which vary in size (resolution), orientation, illumination, angulation, and background. This work also presents an evaluation with popular deep object detection methods, showing considerable room for improvement. The PLAD dataset is publicly available at https://github.com/andreluizbvs/PLAD.

【11】 Affect-Aware Deep Belief Network Representations for Multimodal Unsupervised Deception Detection 标题:多模态无监督欺骗检测的情感感知深信网表示 链接:https://arxiv.org/abs/2108.07897

作者:Leena Mathur,Maja J Matarić 机构:Department of Computer Science, University of Southern California 摘要:检测欺骗社会行为的自动化系统可以提高医疗、社会工作和法律领域的人类福祉。用于训练有监督欺骗检测模型的标记数据集很少能在现实世界的高风险环境中收集。为了应对这一挑战,我们提出了第一种无监督的方法来检测视频中真实世界的高风险欺骗,而不需要标签。本文提出了一种基于情感感知的无监督深层信念网络(DBN)学习欺骗行为和真实行为的鉴别表征的新方法。基于将情感和欺骗联系起来的心理学理论,我们对基于单峰和多峰DBN的方法进行了实验,这些方法都是基于面部配价、面部唤醒、听觉和视觉特征进行训练的。除了使用面部情感作为训练DBN模型的特征外,我们还介绍了一个DBN训练过程,该过程使用面部情感作为视听表示的对齐器。我们使用无监督高斯混合模型聚类进行分类实验,以评估我们的方法。我们最好的无监督方法(面部配价和视觉特征方面的训练)实现了80%的AUC,优于人类能力,表现与完全监督模型相当。我们的研究结果推动了未来在无监督、情感感知的计算方法上的工作,用于检测野外的欺骗和其他社会行为。 摘要:Automated systems that detect the social behavior of deception can enhance human well-being across medical, social work, and legal domains. Labeled datasets to train supervised deception detection models can rarely be collected for real-world, high-stakes contexts. To address this challenge, we propose the first unsupervised approach for detecting real-world, high-stakes deception in videos without requiring labels. This paper presents our novel approach for affect-aware unsupervised Deep Belief Networks (DBN) to learn discriminative representations of deceptive and truthful behavior. Drawing on psychology theories that link affect and deception, we experimented with unimodal and multimodal DBN-based approaches trained on facial valence, facial arousal, audio, and visual features. In addition to using facial affect as a feature on which DBN models are trained, we also introduce a DBN training procedure that uses facial affect as an aligner of audio-visual representations. We conducted classification experiments with unsupervised Gaussian Mixture Model clustering to evaluate our approaches. Our best unsupervised approach (trained on facial valence and visual features) achieved an AUC of 80%, outperforming human ability and performing comparably to fully-supervised models. Our results motivate future work on unsupervised, affect-aware computational approaches for detecting deception and other social behaviors in the wild.

【12】 Gastric Cancer Detection from X-ray Images Using Effective Data Augmentation and Hard Boundary Box Training 标题:基于有效数据增强和硬边界盒训练的X线图像胃癌检测 链接:https://arxiv.org/abs/2108.08158

作者:Hideaki Okamoto,Takakiyo Nomura,Kazuhito Nabeshima,Jun Hashimoto,Hitoshi Iyatomi 机构: Graduate School of Science and Engineering, Hosei University 备注:9 pages, 6 figures 摘要:X线检查适合于胃癌的筛查。与只能由医生进行的内窥镜检查相比,X射线成像也可以由放射技师进行,因此可以治疗更多的患者。然而,胃X线片的诊断准确率低至85%。为了解决这个问题,需要使用机器学习进行高度准确和定量的自动诊断。本文提出了一种从X射线图像中高精度检测胃癌部位的诊断支持方法。该方法的两个新技术方案是(1)随机功能性胃图像增强(sfGAIA)和(2)硬边界盒训练(HBBT)。前者是基于医学知识的X射线图像中胃皱襞的概率增强,而后者是减少假阳性的递归再训练技术。我们在临床实践中使用了145名患者的4724张胃X线照片,并在基于患者的五组交叉验证中评估了该方法的癌症检测性能。拟议的sfGAIA和HBBT在F1评分方面显著提高了EfficientDet-D7网络的性能5.9%,我们的筛查方法达到了胃癌的实际筛查能力(F1:57.8%,召回率:90.2%,精确度:42.5%)。 摘要:X-ray examination is suitable for screening of gastric cancer. Compared to endoscopy, which can only be performed by doctors, X-ray imaging can also be performed by radiographers, and thus, can treat more patients. However, the diagnostic accuracy of gastric radiographs is as low as 85%. To address this problem, highly accurate and quantitative automated diagnosis using machine learning needs to be performed. This paper proposes a diagnostic support method for detecting gastric cancer sites from X-ray images with high accuracy. The two new technical proposal of the method are (1) stochastic functional gastric image augmentation (sfGAIA), and (2) hard boundary box training (HBBT). The former is a probabilistic enhancement of gastric folds in X-ray images based on medical knowledge, whereas the latter is a recursive retraining technique to reduce false positives. We use 4,724 gastric radiographs of 145 patients in clinical practice and evaluate the cancer detection performance of the method in a patient-based five-group cross-validation. The proposed sfGAIA and HBBT significantly enhance the performance of the EfficientDet-D7 network by 5.9% in terms of the F1-score, and our screening method reaches a practical screening capability for gastric cancer (F1: 57.8%, recall: 90.2%, precision: 42.5%).

【13】 DRDrV3: Complete Lesion Detection in Fundus Images Using Mask R-CNN, Transfer Learning, and LSTM 标题:DRDrV3:使用Mask R-CNN、转移学习和LSTM在眼底图像中完成病变检测 链接:https://arxiv.org/abs/2108.08095

作者:Farzan Shenavarmasouleh,Farid Ghareh Mohammadi,M. Hadi Amini,Thiab Taha,Khaled Rasheed,Hamid R. Arabnia 机构: Department of Computer Science, University of Georgia, Athens, GA, Knight Foundation School of Computing and Information Sciences, Florida International University, Miami, FL 备注:The 7th International Conference on Health Informatics & Medical Systems (HIMS'21: July 26-29, 2021, USA) 摘要:医学成像是计算机视觉领域的一个新兴领域。在这项研究中,我们的目标是解决糖尿病视网膜病变(DR)的问题,作为医学影像学的开放性挑战之一。在本研究中,我们提出了一种新的病变检测体系结构,包括两个子模块,这是一种最佳的解决方案,不仅可以检测和发现DR引起的病变类型、其相应的边界框及其遮罩;还包括整个病例的严重程度。除了传统的精度外,我们还使用两个流行的评估标准来评估模型的输出,即联合交集(IOU)和平均精度(mAP)。我们假设,这种新的解决方案使专家能够以高置信度检测病变,并以高精度估计损伤的严重程度。 摘要:Medical Imaging is one of the growing fields in the world of computer vision. In this study, we aim to address the Diabetic Retinopathy (DR) problem as one of the open challenges in medical imaging. In this research, we propose a new lesion detection architecture, comprising of two sub-modules, which is an optimal solution to detect and find not only the type of lesions caused by DR, their corresponding bounding boxes, and their masks; but also the severity level of the overall case. Aside from traditional accuracy, we also use two popular evaluation criteria to evaluate the outputs of our models, which are intersection over union (IOU) and mean average precision (mAP). We hypothesize that this new solution enables specialists to detect lesions with high confidence and estimate the severity of the damage with high accuracy.

分类|识别相关(8篇)

【1】 Masked Face Recognition Challenge: The InsightFace Track Report 标题:蒙面人脸识别挑战赛:InsightFace Track报告 链接:https://arxiv.org/abs/2108.08191

作者:Jiankang Deng,Jia Guo,Xiang An,Zheng Zhu,Stefanos Zafeiriou 机构:Imperial College London, InsightFace, Tsinghua University 备注:The WebFace260M Track of ICCV-21 MFR Challenge is still open in this https URL 摘要:在COVID-19冠状病毒流行期间,几乎每个人都戴着面罩,这对脸部识别提出了巨大的挑战。在本次研讨会上,我们组织了蒙面人脸识别(MFR)挑战赛,重点研究了在面具存在的情况下的基准深度人脸识别方法。在MFR挑战赛中,有两条主要赛道:InsightFace赛道和WebFace260M赛道。对于InsightFace跟踪,我们手动收集一个具有7K身份的大规模蒙面人脸测试集。此外,我们还收集了一个包含14K身份的儿童测试集和一个包含242K身份的多种族测试集。利用这三个测试集,我们建立了一个在线模型测试系统,可以对人脸识别模型进行综合评价。为避免数据隐私问题,不会向公众发布测试图像。由于挑战仍在进行中,我们将继续更新排名靠前的解决方案以及arxiv报告。 摘要:During the COVID-19 coronavirus epidemic, almost everyone wears a facial mask, which poses a huge challenge to deep face recognition. In this workshop, we organize Masked Face Recognition (MFR) challenge and focus on bench-marking deep face recognition methods under the existence of facial masks. In the MFR challenge, there are two main tracks: the InsightFace track and the WebFace260M track. For the InsightFace track, we manually collect a large-scale masked face test set with 7K identities. In addition, we also collect a children test set including 14K identities and a multi-racial test set containing 242K identities. By using these three test sets, we build up an online model testing system, which can give a comprehensive evaluation of face recognition models. To avoid data privacy problems, no test image is released to the public. As the challenge is still under-going, we will keep on updating the top-ranked solutions as well as this report on the arxiv.

【2】 Hand Hygiene Video Classification Based on Deep Learning 标题:基于深度学习的手部卫生视频分类 链接:https://arxiv.org/abs/2108.08127

作者:Rashmi Bakshi 机构:School of Electrical and Electronic Engineering, Technological University, Dublin, Ireland 摘要:在这项工作中,对手势识别领域的文献进行了广泛的回顾,同时实施了基于深度学习解决方案的手卫生阶段简单分类系统。稳健数据集的子集,包括双手洗手姿势和单手手势,如线性手运动。一个预训练的神经网络模型RES Net 50,图像净权重用于3类分类:线性手运动、手掌到手掌的搓手和手指交错运动的搓手。前两类的正确预测精度>60%。一个完整的数据集以及更多的课程和训练步骤将作为未来的工作进行探索。 摘要:In this work, an extensive review of literature in the field of gesture recognition carried out along with the implementation of a simple classification system for hand hygiene stages based on deep learning solutions. A subset of robust dataset that consist of handwashing gestures with two hands as well as one-hand gestures such as linear hand movement utilized. A pretrained neural network model, RES Net 50, with image net weights used for the classification of 3 categories: Linear hand movement, rub hands palm to palm and rub hands with fingers interlaced movement. Correct predictions made for the first two classes with > 60% accuracy. A complete dataset along with increased number of classes and training steps will be explored as a future work.

【3】 Structured Outdoor Architecture Reconstruction by Exploration and Classification 标题:探索与分类相结合的结构化户外建筑改造 链接:https://arxiv.org/abs/2108.07990

作者:Fuyang Zhang,Xiang Xu,Nelson Nauata,Yasutaka Furukawa 机构:Simon Fraser University, BC, Canada, Exploration route, Classification score, . 备注:2021 International Conference on Computer Vision (ICCV 2021) 摘要:本文提出了一种基于航空影像的结构化建筑重建的探索和分类框架。我们的方法1)从现有算法可能不完美的建筑重建开始,通过启发式操作修改重建,探索建筑模型的空间;2) 学习在基于基本事实生成分类标签的同时对建筑模型的正确性进行分类,并3)重复。在测试时,我们迭代探索和分类,寻找具有最佳分类分数的结果。我们使用两个基线和两个最先进的重建算法进行初始重建来评估该方法。定性和定量评估表明,我们的方法从每次初始重建中不断提高重建质量。 摘要:This paper presents an explore-and-classify framework for structured architectural reconstruction from an aerial image. Starting from a potentially imperfect building reconstruction by an existing algorithm, our approach 1) explores the space of building models by modifying the reconstruction via heuristic actions; 2) learns to classify the correctness of building models while generating classification labels based on the ground-truth, and 3) repeat. At test time, we iterate exploration and classification, seeking for a result with the best classification score. We evaluate the approach using initial reconstructions by two baselines and two state-of-the-art reconstruction algorithms. Qualitative and quantitative evaluations demonstrate that our approach consistently improves the reconstruction quality from every initial reconstruction.

【4】 SynFace: Face Recognition with Synthetic Data 标题:SynFace:基于合成数据的人脸识别 链接:https://arxiv.org/abs/2108.07960

作者:Haibo Qiu,Baosheng Yu,Dihong Gong,Zhifeng Li,Wei Liu,Dacheng Tao 机构: JD Explore Academy, China, The University of Sydney, Australia, Tencent Data Platform, China 备注:Accepted by ICCV 2021 摘要:近年来,随着深度神经网络的成功应用,人脸识别技术取得了显著的进展。然而,收集大规模真实世界的人脸识别训练数据已成为一项挑战,特别是由于标签噪声和隐私问题。同时,现有的人脸识别数据集通常是从网络图像中采集的,缺乏对属性(如姿势和表情)的详细注释,因此对不同属性对人脸识别的影响研究较少。在本文中,我们将利用合成人脸图像(即SynFace)来解决上述人脸识别问题。具体地说,我们首先探讨了使用合成和真实人脸图像训练的最新最先进的人脸识别模型之间的性能差距。然后,我们分析了性能差距背后的根本原因,例如,类内变化差以及合成和真实人脸图像之间的域差距。受此启发,我们设计了具有身份混合(IM)和域混合(DM)的SynFace来缓解上述性能差距,展示了合成数据用于人脸识别的巨大潜力。此外,通过可控人脸合成模型,我们可以轻松管理合成人脸生成的不同因素,包括姿势、表情、照明、身份数和每个身份的样本数。因此,我们还对合成人脸图像进行了系统的实证分析,为如何有效地利用合成数据进行人脸识别提供了一些见解。 摘要:With the recent success of deep neural networks, remarkable progress has been achieved on face recognition. However, collecting large-scale real-world training data for face recognition has turned out to be challenging, especially due to the label noise and privacy issues. Meanwhile, existing face recognition datasets are usually collected from web images, lacking detailed annotations on attributes (e.g., pose and expression), so the influences of different attributes on face recognition have been poorly investigated. In this paper, we address the above-mentioned issues in face recognition using synthetic face images, i.e., SynFace. Specifically, we first explore the performance gap between recent state-of-the-art face recognition models trained with synthetic and real face images. We then analyze the underlying causes behind the performance gap, e.g., the poor intra-class variations and the domain gap between synthetic and real face images. Inspired by this, we devise the SynFace with identity mixup (IM) and domain mixup (DM) to mitigate the above performance gap, demonstrating the great potentials of synthetic data for face recognition. Furthermore, with the controllable face synthesis model, we can easily manage different factors of synthetic face generation, including pose, expression, illumination, the number of identities, and samples per identity. Therefore, we also perform a systematically empirical analysis on synthetic face images to provide some insights on how to effectively utilize synthetic data for face recognition.

【5】 Adversarial Relighting against Face Recognition 标题:针对人脸识别的对抗性重光照 链接:https://arxiv.org/abs/2108.07920

作者:Ruijun Gao,Qing Gao,Qian Zhang,Felix Juefei-Xu,Hongkai Yu,Wei Feng 机构: Qing Guo is also with the Nanyang Technological University 摘要:深度人脸识别(FR)在一些具有挑战性的数据集上取得了显著的高精度,并促进了成功的实际应用,甚至对通常被视为FR系统主要威胁的光照变化表现出高度鲁棒性。然而,在现实世界中,有限的人脸数据集无法完全覆盖由不同光照条件引起的光照变化。在本文中,我们从一个新的角度研究了照明对FR的威胁,即对抗性攻击,并确定了一个新的任务,即对抗性重新照明。在给定人脸图像的情况下,对抗性重光旨在产生一个自然重光的对手,同时愚弄最先进的深度FR方法。为此,我们首先提出了基于物理模型的对抗性重照明攻击(ARA),称为基于反照率商的对抗性重照明攻击(AQ-ARA)。它在FR系统的物理照明模型和指导下生成自然对抗光,并合成对抗性重新照明的人脸图像。此外,我们还提出了自动预测对手重照明攻击(AP-ARA),通过训练对手重照明网络(ARNet),根据不同的输入面一步自动预测对手重照明,从而实现效率敏感的应用。更重要的是,我们建议通过精确的重新照明设备将上述数字攻击转移到物理ARA(物理ARA),从而使估计的敌对照明条件在现实世界中重现。我们在两个公共数据集上用三种最先进的深度FR方法(即FaceNet、ArcFace和CosFace)验证了我们的方法。广泛而有见地的结果表明,我们的工作可以生成逼真的对抗性重光人脸图像,轻松愚弄FR,揭示特定光线方向和强度的威胁。 摘要:Deep face recognition (FR) has achieved significantly high accuracy on several challenging datasets and fosters successful real-world applications, even showing high robustness to the illumination variation that is usually regarded as a main threat to the FR system. However, in the real world, illumination variation caused by diverse lighting conditions cannot be fully covered by the limited face dataset. In this paper, we study the threat of lighting against FR from a new angle, i.e., adversarial attack, and identify a new task, i.e., adversarial relighting. Given a face image, adversarial relighting aims to produce a naturally relighted counterpart while fooling the state-of-the-art deep FR methods. To this end, we first propose the physical model-based adversarial relighting attack (ARA) denoted as albedo-quotient-based adversarial relighting attack (AQ-ARA). It generates natural adversarial light under the physical lighting model and guidance of FR systems and synthesizes adversarially relighted face images. Moreover, we propose the auto-predictive adversarial relighting attack (AP-ARA) by training an adversarial relighting network (ARNet) to automatically predict the adversarial light in a one-step manner according to different input faces, allowing efficiency-sensitive applications. More importantly, we propose to transfer the above digital attacks to physical ARA (Phy-ARA) through a precise relighting device, making the estimated adversarial lighting condition reproducible in the real world. We validate our methods on three state-of-the-art deep FR methods, i.e., FaceNet, ArcFace, and CosFace, on two public datasets. The extensive and insightful results demonstrate our work can generate realistic adversarial relighted face images fooling FR easily, revealing the threat of specific light directions and strengths.

【6】 Activity Recognition for Autism Diagnosis 标题:自闭症诊断中的活动识别 链接:https://arxiv.org/abs/2108.07917

作者:Anish Lakkapragada,Peter Washington,Dennis Wall 机构:Lynbrook High School, San Jose, CA , Stanford University, Stanford, CA 摘要:正式的孤独症诊断是一个低效且漫长的过程。家庭通常要等上几年才能为他们的孩子得到诊断;由于这种延迟,有些人可能根本没有收到。解决这个问题的一种方法是使用数字技术来检测是否存在与孤独症相关的行为,总的来说,这可能导致远程和自动化诊断。孤独症最强的指标之一是刺激,这是一组重复的自我刺激行为,如拍手、撞头和旋转。使用计算机视觉检测手拍打尤其困难,因为在这个空间中公共训练数据的稀疏性以及这些数据中的过度抖动和运动。我们的工作展示了一种克服这些问题的新方法:我们使用手部地标检测作为特征表示,然后将其输入到长短时记忆(LSTM)模型中。在检测来自自我刺激行为数据集(SSBD)的视频是否包含手拍打时,我们获得了约72%的验证准确率和F1分数。我们的最佳模型还可以准确预测我们在训练数据集之外录制的外部视频。该模型使用的参数少于26000个,有望快速部署到无处不在的可穿戴数字环境中,用于远程孤独症诊断。 摘要:A formal autism diagnosis is an inefficient and lengthy process. Families often have to wait years before receiving a diagnosis for their child; some may not receive one at all due to this delay. One approach to this problem is to use digital technologies to detect the presence of behaviors related to autism, which in aggregate may lead to remote and automated diagnostics. One of the strongest indicators of autism is stimming, which is a set of repetitive, self-stimulatory behaviors such as hand flapping, headbanging, and spinning. Using computer vision to detect hand flapping is especially difficult due to the sparsity of public training data in this space and excessive shakiness and motion in such data. Our work demonstrates a novel method that overcomes these issues: we use hand landmark detection over time as a feature representation which is then fed into a Long Short-Term Memory (LSTM) model. We achieve a validation accuracy and F1 Score of about 72% on detecting whether videos from the Self-Stimulatory Behaviour Dataset (SSBD) contain hand flapping or not. Our best model also predicts accurately on external videos we recorded of ourselves outside of the dataset it was trained on. This model uses less than 26,000 parameters, providing promise for fast deployment into ubiquitous and wearable digital settings for a remote autism diagnosis.

【7】 Multi-task learning for jersey number recognition in Ice Hockey 标题:冰球球衣号码识别的多任务学习 链接:https://arxiv.org/abs/2108.07848

作者:Kanav Vats,Mehrnaz Fani,David A. Clausi,John Zelek 机构:University of Waterloo, Waterloo, Ontario, Canada 备注:Accepted to the 4th International ACM Workshop on Multimedia Content Analysis in Sports 摘要:在计算机视觉中,通过识别运动员的球衣号码来识别运动员是一项具有挑战性的任务。我们设计并实现了一个多任务学习网络,用于球衣号码识别。为了训练网络识别球衣号码,使用了两种输出标签表示法(1)整体-将整个球衣号码视为一个类别,(2)数字-将球衣号码中的两个数字视为两个单独的类别。该网络通过多任务损失函数学习整体和数字表示。通过消融研究,我们确定了整体和数字损失的最佳权重。实验结果表明,所提出的多任务学习网络比组成的整体和数字单任务学习网络具有更好的性能。 摘要:Identifying players in sports videos by recognizing their jersey numbers is a challenging task in computer vision. We have designed and implemented a multi-task learning network for jersey number recognition. In order to train a network to recognize jersey numbers, two output label representations are used (1) Holistic - considers the entire jersey number as one class, and (2) Digit-wise - considers the two digits in a jersey number as two separate classes. The proposed network learns both holistic and digit-wise representations through a multi-task loss function. We determine the optimal weights to be assigned to holistic and digit-wise losses through an ablation study. Experimental results demonstrate that the proposed multi-task learning network performs better than the constituent holistic and digit-wise single-task learning networks.

【8】 Classification and reconstruction of spatially overlapping phase images using diffractive optical networks 标题:基于衍射光学网络的空间重叠相位图像分类与重建 链接:https://arxiv.org/abs/2108.07977

作者:Deniz Mengu,Muhammed Veli,Yair Rivenson,Aydogan Ozcan 机构:Department of Electrical & Computer Engineering, University of California Los Angeles, Department of Bioengineering, University of California Los Angeles (UCLA), California, USA 备注:30 Pages, 7 Figures, 2 Tables 摘要:衍射光学网络将波动光学和深度学习结合起来,当光从输入平面传播到输出平面时,对给定的机器学习或计算成像任务进行光学计算。在此,我们报告了用于空间重叠、相位编码对象分类和重建的衍射光学网络的设计。当两个不同的纯相位对象在空间上重叠时,单个对象函数会受到扰动,因为它们的相位模式会相加。仅从重叠相位分布中检索基础相位图像是一个具有挑战性的问题,其解决方案通常不是唯一的。我们表明,通过特定于任务的训练过程,由连续透射层组成的被动衍射网络可以在输入端同时对两个不同的随机选择的、空间重叠的相位图像进行光学分类。在使用来自MNIST数据集的约5.5亿个相位编码手写数字的独特组合进行训练后,我们的盲测试结果表明,对于新手写数字的两个重叠相位图像的全光分类,衍射网络的准确率达到了>85.8%。除了重叠相位对象的全光分类外,我们还演示了基于浅层电子神经网络的相位图像重建,该网络使用衍射网络的高度压缩输出作为其输入(例如。,约20-65倍的像素数),以快速重建两个相位图像,尽管它们的空间重叠和相关相位模糊。所提出的相位图像分类和重建框架可能在计算成像、显微镜和定量相位成像等领域得到应用。 摘要:Diffractive optical networks unify wave optics and deep learning to all-optically compute a given machine learning or computational imaging task as the light propagates from the input to the output plane. Here, we report the design of diffractive optical networks for the classification and reconstruction of spatially overlapping, phase-encoded objects. When two different phase-only objects spatially overlap, the individual object functions are perturbed since their phase patterns are summed up. The retrieval of the underlying phase images from solely the overlapping phase distribution presents a challenging problem, the solution of which is generally not unique. We show that through a task-specific training process, passive diffractive networks composed of successive transmissive layers can all-optically and simultaneously classify two different randomly-selected, spatially overlapping phase images at the input. After trained with ~550 million unique combinations of phase-encoded handwritten digits from the MNIST dataset, our blind testing results reveal that the diffractive network achieves an accuracy of >85.8% for all-optical classification of two overlapping phase images of new handwritten digits. In addition to all-optical classification of overlapping phase objects, we also demonstrate the reconstruction of these phase images based on a shallow electronic neural network that uses the highly compressed output of the diffractive network as its input (with e.g., ~20-65 times less number of pixels) to rapidly reconstruct both of the phase images, despite their spatial overlap and related phase ambiguity. The presented phase image classification and reconstruction framework might find applications in e.g., computational imaging, microscopy and quantitative phase imaging fields.

分割|语义相关(2篇)

【1】 Multi-Anchor Active Domain Adaptation for Semantic Segmentation 标题:面向语义分割的多锚点主动域自适应算法 链接:https://arxiv.org/abs/2108.08012

作者:Munan Ning,Donghuan Lu,Dong Wei,Cheng Bian,Chenglang Yuan,Shuang Yu,Kai Ma,Yefeng Zheng 机构:Tencent Jarvis Lab, Shenzhen, China 备注:ICCV 2021 Oral 摘要:无监督域自适应已被证明是一种有效的方法,通过调整合成的源域数据和真实世界的目标域样本来减轻人工标注的繁重工作量。不幸的是,无条件地将目标域分布映射到源域可能会扭曲目标域数据的基本结构信息。为此,我们首先提出了一种新的基于多锚的主动学习策略来辅助语义分割任务的领域自适应。通过创新性地采用多个锚而不是单个质心,源域可以更好地描述为多模态分布,从而从目标域中选择更具代表性和互补性的样本。手动注释这些活动样本的工作量很小,可以有效地减轻目标域分布的失真,从而获得较大的性能增益。此外,还采用多锚策略对目标分布进行建模。通过一种新的软对齐丢失方法,对多个锚点周围的目标样本的潜在表示进行正则化,可以实现更精确的分割。在公共数据集上进行了大量实验,以证明所提出的方法显著优于最先进的方法,并进行了彻底的消融研究,以验证每个组件的有效性。 摘要:Unsupervised domain adaption has proven to be an effective approach for alleviating the intensive workload of manual annotation by aligning the synthetic source-domain data and the real-world target-domain samples. Unfortunately, mapping the target-domain distribution to the source-domain unconditionally may distort the essential structural information of the target-domain data. To this end, we firstly propose to introduce a novel multi-anchor based active learning strategy to assist domain adaptation regarding the semantic segmentation task. By innovatively adopting multiple anchors instead of a single centroid, the source domain can be better characterized as a multimodal distribution, thus more representative and complimentary samples are selected from the target domain. With little workload to manually annotate these active samples, the distortion of the target-domain distribution can be effectively alleviated, resulting in a large performance gain. The multi-anchor strategy is additionally employed to model the target-distribution. By regularizing the latent representation of the target samples compact around multiple anchors through a novel soft alignment loss, more precise segmentation can be achieved. Extensive experiments are conducted on public datasets to demonstrate that the proposed approach outperforms state-of-the-art methods significantly, along with thorough ablation study to verify the effectiveness of each component.

【2】 A New Bidirectional Unsupervised Domain Adaptation Segmentation Framework 标题:一种新的双向无监督领域自适应分割框架 链接:https://arxiv.org/abs/2108.07979

作者:Munan Ning,Cheng Bian,Dong Wei,Chenglang Yuan,Yaohua Wang,Yang Guo,Kai Ma,Yefeng Zheng 机构: Tencent Jarvis Lab, Shenzhen, China, National University of Defense Technology, China 备注:IPMI 2021 摘要:领域转移通常发生在跨领域场景中,这是因为不同领域之间存在巨大差距:当将一个在一个领域受过良好训练的深度学习模型应用到另一个目标领域时,该模型通常表现不佳。为了解决这一问题,提出了无监督领域自适应(UDA)技术来弥合不同领域之间的差距,以提高模型性能,而无需在目标领域进行注释。特别是,UDA对于多模态医学图像分析具有很大的价值,其中注释困难是一个实际问题。然而,大多数现有的UDA方法只能在一个适应方向(例如,MRI到CT)上实现令人满意的改进,但在另一个适应方向(CT到MRI)上通常表现不佳,从而限制了它们的实际使用。在本文中,我们提出了一个基于非纠缠表示学习的双向UDA(BiUDA)框架,以获得同等能力的双向UDA性能。该框架采用统一的领域感知模式编码器,不仅可以通过领域控制器对不同领域的图像进行自适应编码,还可以通过消除冗余参数提高模型效率。此外,为了避免自适应过程中输入图像的内容和模式失真,引入了内容模式一致性损失。此外,为了获得更好的UDA分割性能,提出了一种标签一致性策略,通过重新组合目标域样式的图像和相应的源域注释来提供额外的监督。在两个公共数据集上进行的对比实验和消融研究证明了我们的BiUDA框架相对于当前最先进的UDA方法的优越性及其新颖设计的有效性。通过成功解决双向适应问题,我们的BiUDA框架为实际场景提供了一个灵活的UDA技术解决方案。 摘要:Domain shift happens in cross-domain scenarios commonly because of the wide gaps between different domains: when applying a deep learning model well-trained in one domain to another target domain, the model usually performs poorly. To tackle this problem, unsupervised domain adaptation (UDA) techniques are proposed to bridge the gap between different domains, for the purpose of improving model performance without annotation in the target domain. Particularly, UDA has a great value for multimodal medical image analysis, where annotation difficulty is a practical concern. However, most existing UDA methods can only achieve satisfactory improvements in one adaptation direction (e.g., MRI to CT), but often perform poorly in the other (CT to MRI), limiting their practical usage. In this paper, we propose a bidirectional UDA (BiUDA) framework based on disentangled representation learning for equally competent two-way UDA performances. This framework employs a unified domain-aware pattern encoder which not only can adaptively encode images in different domains through a domain controller, but also improve model efficiency by eliminating redundant parameters. Furthermore, to avoid distortion of contents and patterns of input images during the adaptation process, a content-pattern consistency loss is introduced. Additionally, for better UDA segmentation performance, a label consistency strategy is proposed to provide extra supervision by recomposing target-domain-styled images and corresponding source-domain annotations. Comparison experiments and ablation studies conducted on two public datasets demonstrate the superiority of our BiUDA framework to current state-of-the-art UDA methods and the effectiveness of its novel designs. By successfully addressing two-way adaptations, our BiUDA framework offers a flexible solution of UDA techniques to the real-world scenario.

Zero/Few Shot|迁移|域适配|自适应(7篇)

【1】 Confidence Adaptive Regularization for Deep Learning with Noisy Labels 标题:带噪声标签的深度学习置信度自适应正则化方法 链接:https://arxiv.org/abs/2108.08212

作者:Yangdi Lu,Yang Bo,Wenbo He 机构:Department of Computing and Software, McMaster University, Hamilton, Canada 摘要:最近关于深层神经网络对带噪标签的记忆效应的研究表明,网络首先拟合正确标记的训练样本,然后再记忆错误标记的样本。基于这种早期学习现象,我们提出了一种新的方法来防止错误标记样本的记忆。与使用模型输出识别或忽略错误标记样本的现有方法不同,我们在原始模型中引入了一个指标分支,使模型能够为每个样本生成置信值。置信值包含在我们的损失函数中,该函数学习为正确标记的样本分配大的置信值,为错误标记的样本分配小的置信值。我们还提出了一个辅助正则化项来进一步提高模型的鲁棒性。为了提高性能,我们采用了一种设计良好的目标估计策略来逐步校正噪声标签。我们提供了理论分析,并在合成数据集和真实数据集上进行了实验,证明我们的方法取得了与最先进方法相当的结果。 摘要:Recent studies on the memorization effects of deep neural networks on noisy labels show that the networks first fit the correctly-labeled training samples before memorizing the mislabeled samples. Motivated by this early-learning phenomenon, we propose a novel method to prevent memorization of the mislabeled samples. Unlike the existing approaches which use the model output to identify or ignore the mislabeled samples, we introduce an indicator branch to the original model and enable the model to produce a confidence value for each sample. The confidence values are incorporated in our loss function which is learned to assign large confidence values to correctly-labeled samples and small confidence values to mislabeled samples. We also propose an auxiliary regularization term to further improve the robustness of the model. To improve the performance, we gradually correct the noisy labels with a well-designed target estimation strategy. We provide the theoretical analysis and conduct the experiments on synthetic and real-world datasets, demonstrating that our approach achieves comparable results to the state-of-the-art methods.

【2】 Generalized and Incremental Few-Shot Learning by Explicit Learning and Calibration without Forgetting 标题:基于显式学习和无忘校正的广义增量式Few-Shot学习 链接:https://arxiv.org/abs/2108.08165

作者:Anna Kukleva,Hilde Kuehne,Bernt Schiele 机构:MPI for Informatics, Saarland Informatics Campus ,CVAILab, Goethe University Frankfurt , MIT-IBM Watson AI Lab, Cambridge 备注:ICCV 2021 摘要:广义和增量Few-Shot学习都必须处理三个主要挑战:仅从每个类的几个样本中学习新类,防止基类的灾难性遗忘,以及跨新类和基类的分类器校准。在这项工作中,我们提出了一个三阶段框架,允许明确有效地应对这些挑战。当第一阶段学习具有多个样本的基类时,第二阶段学习来自少数样本的新类的校准分类器,同时也防止灾难性遗忘。在最后一个阶段,所有等级都要进行校准。我们在四个具有挑战性的图像和视频Few-Shot分类基准数据集上评估了所提出的框架,并获得了广义和增量Few-Shot学习的最新结果。 摘要:Both generalized and incremental few-shot learning have to deal with three major challenges: learning novel classes from only few samples per class, preventing catastrophic forgetting of base classes, and classifier calibration across novel and base classes. In this work we propose a three-stage framework that allows to explicitly and effectively address these challenges. While the first phase learns base classes with many samples, the second phase learns a calibrated classifier for novel classes from few samples while also preventing catastrophic forgetting. In the final phase, calibration is achieved across all classes. We evaluate the proposed framework on four challenging benchmark datasets for image and video few-shot classification and obtain state-of-the-art results for both generalized and incremental few shot learning.

【3】 Target Adaptive Context Aggregation for Video Scene Graph Generation 标题:用于视频场景图生成的目标自适应上下文聚合 链接:https://arxiv.org/abs/2108.08121

作者:Yao Teng,Limin Wang,Zhifeng Li,Gangshan Wu 机构:State Key Laboratory for Novel Software Technology, Nanjing University, China, Tencent AI Lab, Shenzhen, China 备注:ICCV 2021 camera-ready version 摘要:本文讨论了一个具有挑战性的视频场景图生成(VidSGG)任务,它可以作为高级理解任务的结构化视频表示。我们提出了一种新的{em-detect-to-track}范式,将关系预测的上下文建模与复杂的底层实体跟踪解耦。具体地说,我们设计了一种有效的帧级VidSGG方法,称为{em-Target Adaptive Context Aggregation Network}(TRACE),其重点是捕获用于关系识别的时空上下文信息。我们的跟踪框架通过模块化设计简化了VidSGG管道,并提供了两个独特的层次关系树(HRTree)构建块和目标自适应上下文聚合块。更具体地说,我们的HRTree首先提供了一种自适应结构来有效地组织可能的候选关系,并引导上下文聚合模块有效地捕获时空结构信息。然后,我们为每个候选关系获取一个上下文化的特征表示,并构建一个分类头来识别其关系类别。最后,我们提供了一种简单的时间关联策略来跟踪跟踪检测结果,从而生成视频级视频图像。我们在两个VidSGG基准上进行了实验:ImageNet VidVRD和Action Genome,结果表明我们的跟踪达到了最先进的性能。代码和模型可在url中获得{https://github.com/MCG-NJU/TRACE}. 摘要:This paper deals with a challenging task of video scene graph generation (VidSGG), which could serve as a structured video representation for high-level understanding tasks. We present a new {em detect-to-track} paradigm for this task by decoupling the context modeling for relation prediction from the complicated low-level entity tracking. Specifically, we design an efficient method for frame-level VidSGG, termed as {em Target Adaptive Context Aggregation Network} (TRACE), with a focus on capturing spatio-temporal context information for relation recognition. Our TRACE framework streamlines the VidSGG pipeline with a modular design, and presents two unique blocks of Hierarchical Relation Tree (HRTree) construction and Target-adaptive Context Aggregation. More specific, our HRTree first provides an adpative structure for organizing possible relation candidates efficiently, and guides context aggregation module to effectively capture spatio-temporal structure information. Then, we obtain a contextualized feature representation for each relation candidate and build a classification head to recognize its relation category. Finally, we provide a simple temporal association strategy to track TRACE detected results to yield the video-level VidSGG. We perform experiments on two VidSGG benchmarks: ImageNet-VidVRD and Action Genome, and the results demonstrate that our TRACE achieves the state-of-the-art performance. The code and models are made available at url{https://github.com/MCG-NJU/TRACE}.

【4】 Adaptive Graph Convolution for Point Cloud Analysis 标题:自适应图形卷积在点云分析中的应用 链接:https://arxiv.org/abs/2108.08035

作者:Haoran Zhou,Yidan Feng,Mingsheng Fang,Mingqiang Wei,Jing Qin,Tong Lu 机构:Nanjing University of Aeronautics and Astronautics, The Hong Kong Polytechnic University 备注:Camera-ready, to be published in ICCV 2021 摘要:从二维网格状区域推广的三维点云卷积已经得到了广泛的研究,但还远远不够完善。标准卷积在3D点之间以特征对应为特征,表现出较差的特征学习的内在局限性。在本文中,我们提出了自适应图卷积(AdaptConv),它根据点的动态学习特征生成自适应核。与使用固定/各向同性核相比,Adapconv提高了点云卷积的灵活性,有效且精确地捕获了不同语义部分点之间的不同关系。与流行的注意权重方案不同,本文提出的AdaptConv在卷积运算中实现了自适应性,而不是简单地给相邻点分配不同的权重。大量的定性和定量评估表明,在几个基准数据集上,我们的方法优于最先进的点云分类和分割方法。我们的代码可在https://github.com/hrzhou2/AdaptConv-master. 摘要:Convolution on 3D point clouds that generalized from 2D grid-like domains is widely researched yet far from perfect. The standard convolution characterises feature correspondences indistinguishably among 3D points, presenting an intrinsic limitation of poor distinctive feature learning. In this paper, we propose Adaptive Graph Convolution (AdaptConv) which generates adaptive kernels for points according to their dynamically learned features. Compared with using a fixed/isotropic kernel, AdaptConv improves the flexibility of point cloud convolutions, effectively and precisely capturing the diverse relations between points from different semantic parts. Unlike popular attentional weight schemes, the proposed AdaptConv implements the adaptiveness inside the convolution operation instead of simply assigning different weights to the neighboring points. Extensive qualitative and quantitative evaluations show that our method outperforms state-of-the-art point cloud classification and segmentation approaches on several benchmark datasets. Our code is available at https://github.com/hrzhou2/AdaptConv-master.

【5】 Variational Attention: Propagating Domain-Specific Knowledge for Multi-Domain Learning in Crowd Counting 标题:变式注意:人群计数中多领域学习的特定领域知识传播 链接:https://arxiv.org/abs/2108.08023

作者:Binghui Chen,Zhaoyi Yan,Ke Li,Pengyu Li,Biao Wang,Wangmeng Zuo,Lei Zhang 机构: Harbin Institute of Technology, The Hong Kong Polytechnic University 备注:ICCV 2021 摘要:在人群计数中,由于难以标记的问题,人们认为难以收集一个新的大规模数据集,该数据集包含大量密度、场景等差异很大的图像。因此,对于学习通用模型,使用来自多个不同数据集的数据进行训练可能是一种补救方法,具有很大的价值。在本文中,我们采用多领域联合学习,提出了一种简单而有效的领域特定知识传播网络(DKPNet)1,用于同时无偏地学习多个不同数据领域的知识。这主要是通过提出新的变分注意(VA)技术来显式地模拟不同领域的注意分布来实现的。作为VA的一个扩展,提出了内禀变分注意(InVA)来处理域和子域的重叠问题。我们已经进行了大量实验,以验证我们的DKPNet相对于几种流行数据集(包括上海科技A/B、UCF-QNRF和NWPU)的优越性。 摘要:In crowd counting, due to the problem of laborious labelling, it is perceived intractability of collecting a new large-scale dataset which has plentiful images with large diversity in density, scene, etc. Thus, for learning a general model, training with data from multiple different datasets might be a remedy and be of great value. In this paper, we resort to the multi-domain joint learning and propose a simple but effective Domain-specific Knowledge Propagating Network (DKPNet)1 for unbiasedly learning the knowledge from multiple diverse data domains at the same time. It is mainly achieved by proposing the novel Variational Attention(VA) technique for explicitly modeling the attention distributions for different domains. And as an extension to VA, Intrinsic Variational Attention(InVA) is proposed to handle the problems of over-lapped domains and sub-domains. Extensive experiments have been conducted to validate the superiority of our DKPNet over several popular datasets, including ShanghaiTech A/B, UCF-QNRF and NWPU.

【6】 Adaptive Convolutions with Per-pixel Dynamic Filter Atom 标题:采用每像素动态过滤原子的自适应卷积 链接:https://arxiv.org/abs/2108.07895

作者:Ze Wang,Zichen Miao,Jun Hu,Qiang Qiu 机构:Purdue University, Facebook 摘要:应用特征相关网络权值在许多领域都被证明是有效的。然而,在实际应用中,受模型参数和内存占用的巨大限制,采用每像素自适应滤波器的可扩展和多功能动态卷积尚待充分探索。在本文中,我们通过分解滤波器来解决这一挑战,该滤波器适应于每个空间位置,通过一个轻型网络从局部特征生成的动态滤波器原子。通过在一组预先固定的多尺度基上进一步表示每个过滤原子,可以支持自适应感受野。作为对卷积层的即插即用替换,引入的具有每像素动态原子的自适应卷积能够显式地建模图像内方差,同时避免了繁重的计算、参数和内存开销。我们的方法保留了传统卷积的吸引人的特性,即平移等变和参数有效。我们的实验表明,所提出的方法在不同任务之间提供了相当甚至更好的性能,并且在处理具有显著图像内方差的任务时特别有效。 摘要:Applying feature dependent network weights have been proved to be effective in many fields. However, in practice, restricted by the enormous size of model parameters and memory footprints, scalable and versatile dynamic convolutions with per-pixel adapted filters are yet to be fully explored. In this paper, we address this challenge by decomposing filters, adapted to each spatial position, over dynamic filter atoms generated by a light-weight network from local features. Adaptive receptive fields can be supported by further representing each filter atom over sets of pre-fixed multi-scale bases. As plug-and-play replacements to convolutional layers, the introduced adaptive convolutions with per-pixel dynamic atoms enable explicit modeling of intra-image variance, while avoiding heavy computation, parameters, and memory cost. Our method preserves the appealing properties of conventional convolutions as being translation-equivariant and parametrically efficient. We present experiments to show that, the proposed method delivers comparable or even better performance across tasks, and are particularly effective on handling tasks with significant intra-image variance.

【7】 Channel-Temporal Attention for First-Person Video Domain Adaptation 标题:第一人称视频域自适应的通道-时间注意 链接:https://arxiv.org/abs/2108.07846

作者:Xianyuan Liu,Shuo Zhou,Tao Lei,Haiping Lu 机构:Institute of Optical and Electronics, Chinese Academy of Sciences, China, University of Chinese Academy of Sciences, China, University of Sheffield, United Kingdom 摘要:无监督领域自适应(UDA)可以将知识从标记的源数据转移到相同类别的未标记的目标数据。然而,第一人称动作识别的UDA是一个探索不足的问题,缺乏数据集,对第一人称视频特征的考虑也有限。本文着重于解决这一问题。首先,我们提出了两个小规模的第一人称视频域适配数据集:ADL${small}$和GTEA-KITCHEN。其次,我们引入通道时间注意块来捕捉通道和时间的关系,并建模它们对第一人称视觉重要的相互依赖关系。最后,我们提出了一个通道时间注意网络(CTAN)来将这些模块集成到现有的体系结构中。CTAN在两个拟议数据集和一个现有数据集EPIC${cvpr20}$上优于基线。 摘要:Unsupervised Domain Adaptation (UDA) can transfer knowledge from labeled source data to unlabeled target data of the same categories. However, UDA for first-person action recognition is an under-explored problem, with lack of datasets and limited consideration of first-person video characteristics. This paper focuses on addressing this problem. Firstly, we propose two small-scale first-person video domain adaptation datasets: ADL$_{small}$ and GTEA-KITCHEN. Secondly, we introduce channel-temporal attention blocks to capture the channel-wise and temporal-wise relationships and model their inter-dependencies important to first-person vision. Finally, we propose a Channel-Temporal Attention Network (CTAN) to integrate these blocks into existing architectures. CTAN outperforms baselines on the two proposed datasets and one existing dataset EPIC$_{cvpr20}$.

半弱无监督|主动学习|不确定性(4篇)

【1】 Learning RAW-to-sRGB Mappings with Inaccurately Aligned Supervision 标题:在不精确对齐的监督下学习原始到sRGB映射 链接:https://arxiv.org/abs/2108.08119

作者:Zhilu Zhang,Haolin Wang,Ming Liu,Ruohao Wang,Jiawei Zhang,Wangmeng Zuo 机构:Harbin Institute of Technology, SenseTime Research, Pazhou Lab, Guangzhou 备注:Accepted by ICCV 2021 摘要:近年来,学习RAW到sRGB映射引起了越来越多的关注,其中输入的RAW图像被训练成模仿由另一个相机捕获的目标sRGB图像。然而,严重的颜色不一致性使得生成输入原始图像和目标sRGB图像的对齐训练对非常具有挑战性。而使用不准确对齐的监控进行学习容易导致像素偏移并产生模糊结果。在本文中,我们提出了一种用于图像对齐和RAW到sRGB映射的联合学习模型,从而避免了这一问题。为了减少图像对齐中颜色不一致的影响,我们引入了一个全局颜色映射(GCM)模块,在给定输入原始图像的情况下生成初始sRGB图像,该模块可以保持像素的空间位置不变,并利用目标sRGB图像引导GCM向其转换颜色。然后部署预先训练的光流估计网络(例如,PWC网络),以扭曲目标sRGB图像,使其与GCM输出对齐。为了减轻不精确对齐监控的影响,利用扭曲的目标sRGB图像来学习RAW到sRGB的映射。当完成训练时,GCM模块和光流网络可以分离,从而不会为推断带来额外的计算成本。实验表明,我们的方法在ZRR和SR-RAW数据集上表现良好。通过我们的联合学习模型,轻量级主干可以在ZRR数据集上实现更好的定量和定性性能。代码可在https://github.com/cszhilu1998/RAW-to-sRGB. 摘要:Learning RAW-to-sRGB mapping has drawn increasing attention in recent years, wherein an input raw image is trained to imitate the target sRGB image captured by another camera. However, the severe color inconsistency makes it very challenging to generate well-aligned training pairs of input raw and target sRGB images. While learning with inaccurately aligned supervision is prone to causing pixel shift and producing blurry results. In this paper, we circumvent such issue by presenting a joint learning model for image alignment and RAW-to-sRGB mapping. To diminish the effect of color inconsistency in image alignment, we introduce to use a global color mapping (GCM) module to generate an initial sRGB image given the input raw image, which can keep the spatial location of the pixels unchanged, and the target sRGB image is utilized to guide GCM for converting the color towards it. Then a pre-trained optical flow estimation network (e.g., PWC-Net) is deployed to warp the target sRGB image to align with the GCM output. To alleviate the effect of inaccurately aligned supervision, the warped target sRGB image is leveraged to learn RAW-to-sRGB mapping. When training is done, the GCM module and optical flow network can be detached, thereby bringing no extra computation cost for inference. Experiments show that our method performs favorably against state-of-the-arts on ZRR and SR-RAW datasets. With our joint learning model, a light-weight backbone can achieve better quantitative and qualitative performance on ZRR dataset. Codes are available at https://github.com/cszhilu1998/RAW-to-sRGB.

【2】 Panoramic Depth Estimation via Supervised and Unsupervised Learning in Indoor Scenes 标题:基于有监督和无监督学习的室内场景全景深度估计 链接:https://arxiv.org/abs/2108.08076

作者:Keyang Zhou,Kailun Yang,Kaiwei Wang 机构:State Key Laboratory of Modern Optical Instrumentation, Zhejiang University, Hangzhou , China, Institute for Anthropomatics and Robotics, Karlsruhe Institute of Technology, Karlsruhe, Germany 备注:Accepted to Applied Optics. Code will be made publicly available at this https URL 摘要:深度估计作为将二维图像转换为三维空间的必要线索,已在许多机器视觉领域得到应用。然而,传统的立体匹配深度估计算法由于噪声大、精度低、对多摄像机标定要求严格等原因,难以实现全方位360度几何传感。在这项工作中,为了获得统一的周围感知,我们引入全景图像以获得更大的视野。我们将PADENet首次出现在我们之前的会议工作中,用于室外场景理解,以对室内场景进行全景单目深度估计。同时,针对全景图像的特点,改进了神经网络的训练过程。此外,我们将传统的立体匹配算法与深度学习方法相融合,进一步提高了深度预测的准确性。通过各种各样的实验,本研究证明了我们针对室内场景感知的方案的有效性。 摘要:Depth estimation, as a necessary clue to convert 2D images into the 3D space, has been applied in many machine vision areas. However, to achieve an entire surrounding 360-degree geometric sensing, traditional stereo matching algorithms for depth estimation are limited due to large noise, low accuracy, and strict requirements for multi-camera calibration. In this work, for a unified surrounding perception, we introduce panoramic images to obtain larger field of view. We extend PADENet first appeared in our previous conference work for outdoor scene understanding, to perform panoramic monocular depth estimation with a focus for indoor scenes. At the same time, we improve the training process of the neural network adapted to the characteristics of panoramic images. In addition, we fuse traditional stereo matching algorithm with deep learning methods and further improve the accuracy of depth predictions. With a comprehensive variety of experiments, this research demonstrates the effectiveness of our schemes aiming for indoor scene perception.

【3】 Unsupervised Image Generation with Infinite Generative Adversarial Networks 标题:无限生成对抗性网络的无监督图像生成 链接:https://arxiv.org/abs/2108.07975

作者:Hui Ying,He Wang,Tianjia Shao,Yin Yang,Kun Zhou 机构:Zhejiang University, University of Leeds, Clemson University 备注:18 pages, 11 figures 摘要:图像生成在计算机视觉中得到了广泛的研究,其中一个核心研究挑战是在几乎没有监督的情况下从任意复杂分布生成图像。生成性对抗网络作为一种隐式方法在这方面取得了巨大的成功,因此得到了广泛的应用。然而,已知GAN存在诸如模式崩溃、非结构化潜在空间、无法计算似然度等问题。在本文中,我们提出了一种新的无监督非参数方法,称为无限条件GAN或MIC GAN的混合,以共同解决多个GAN问题,以节省先验知识的图像生成为目标。通过对不同数据集的综合评估,我们发现MIC-GANs在构建潜在空间和避免模式崩溃方面是有效的,并且优于最先进的方法。Michgan具有适应性、通用性和健壮性。它们为几个众所周知的问题提供了一个有希望的解决方案。可用代码:github.com/yinghdb/MICGANs。 摘要:Image generation has been heavily investigated in computer vision, where one core research challenge is to generate images from arbitrarily complex distributions with little supervision. Generative Adversarial Networks (GANs) as an implicit approach have achieved great successes in this direction and therefore been employed widely. However, GANs are known to suffer from issues such as mode collapse, non-structured latent space, being unable to compute likelihoods, etc. In this paper, we propose a new unsupervised non-parametric method named mixture of infinite conditional GANs or MIC-GANs, to tackle several GAN issues together, aiming for image generation with parsimonious prior knowledge. Through comprehensive evaluations across different datasets, we show that MIC-GANs are effective in structuring the latent space and avoiding mode collapse, and outperform state-of-the-art methods. MICGANs are adaptive, versatile, and robust. They offer a promising solution to several well-known GAN issues. Code available: github.com/yinghdb/MICGANs.

【4】 Self-Supervised Visual Representations Learning by Contrastive Mask Prediction 标题:基于对比掩模预测的自监督视觉表征学习 链接:https://arxiv.org/abs/2108.07954

作者:Yucheng Zhao,Guangting Wang,Chong Luo,Wenjun Zeng,Zheng-Jun Zha 机构:University of Science and Technology of China, Microsoft Research Asia 备注:Accepted to ICCV 2021 摘要:先进的自监督视觉表征学习方法依赖于实例识别(ID)借口任务。我们指出ID任务有一个隐式语义一致性(SC)假设,这在无约束的数据集中可能不成立。在本文中,我们提出了一种新的用于视觉表征学习的对比掩模预测(CMP)任务,并设计了一个掩模对比度(MaskCo)框架来实现这一思想。MaskCo对比区域级特征而非视图级特征,这使得无需任何假设即可识别阳性样本。为了解决掩模特征和非掩模特征之间的域间隙问题,我们在MaskCo中设计了一个专用的掩模预测头。该模块被证明是CMP成功的关键。我们在ImageNet之外的训练数据集上评估了MaskCo,并将其性能与MoCo V2进行了比较。结果表明,MaskCo使用ImageNet训练数据集实现了与MoCo V2相当的性能,但在使用COCO或概念性字幕进行训练时,在一系列下游任务中表现出更强的性能。MaskCo为野外的自我监督学习提供了一种有希望的替代基于ID的方法。 摘要:Advanced self-supervised visual representation learning methods rely on the instance discrimination (ID) pretext task. We point out that the ID task has an implicit semantic consistency (SC) assumption, which may not hold in unconstrained datasets. In this paper, we propose a novel contrastive mask prediction (CMP) task for visual representation learning and design a mask contrast (MaskCo) framework to implement the idea. MaskCo contrasts region-level features instead of view-level features, which makes it possible to identify the positive sample without any assumptions. To solve the domain gap between masked and unmasked features, we design a dedicated mask prediction head in MaskCo. This module is shown to be the key to the success of the CMP. We evaluated MaskCo on training datasets beyond ImageNet and compare its performance with MoCo V2. Results show that MaskCo achieves comparable performance with MoCo V2 using ImageNet training dataset, but demonstrates a stronger performance across a range of downstream tasks when COCO or Conceptual Captions are used for training. MaskCo provides a promising alternative to the ID-based methods for self-supervised learning in the wild.

时序|行为识别|姿态|视频|运动估计(2篇)

【1】 Towards Robust Human Trajectory Prediction in Raw Videos 标题:面向原始视频的鲁棒人体轨迹预测 链接:https://arxiv.org/abs/2108.08259

作者:Rui Yu,Zihan Zhou 机构: Pennsylvania State University, University Park 备注:8 pages, 6 figures. Accepted by the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2021) 摘要:由于人体轨迹预测在自主车辆和室内机器人等应用中的重要性,近年来受到越来越多的关注。然而,现有的大多数方法都是基于人体标记轨迹进行预测,而忽略了检测和跟踪中的误差和噪声。在本文中,我们研究了原始视频中的人体轨迹预测问题,发现各种类型的跟踪误差都会严重影响预测精度。因此,我们提出了一种简单而有效的策略,通过加强预测一致性来纠正跟踪失败。所提出的“再跟踪”算法可以应用于任何现有的跟踪和预测管道。在公共基准数据集上的实验表明,该方法可以在具有挑战性的现实场景中提高跟踪和预测性能。有关代码和数据,请访问https://git.io/retracking-prediction. 摘要:Human trajectory prediction has received increased attention lately due to its importance in applications such as autonomous vehicles and indoor robots. However, most existing methods make predictions based on human-labeled trajectories and ignore the errors and noises in detection and tracking. In this paper, we study the problem of human trajectory forecasting in raw videos, and show that the prediction accuracy can be severely affected by various types of tracking errors. Accordingly, we propose a simple yet effective strategy to correct the tracking failures by enforcing prediction consistency over time. The proposed "re-tracking" algorithm can be applied to any existing tracking and prediction pipelines. Experiments on public benchmark datasets demonstrate that the proposed method can improve both tracking and prediction performance in challenging real-world scenarios. The code and data are available at https://git.io/retracking-prediction.

【2】 Overfitting the Data: Compact Neural Video Delivery via Content-aware Feature Modulation 标题:超配数据:通过内容感知功能调制实现紧凑型神经视频传输 链接:https://arxiv.org/abs/2108.08202

作者:Jiaming Liu,Ming Lu,Kaixin Chen,Xiaoqi Li,Shizun Wang,Zhaoqing Wang,Enhua Wu,Yurong Chen,Chuang Zhang,Ming Wu 机构:Beijing University of Posts and Telecommunications, Intel Labs China, State Key Lab of Computer Science, IOS, CAS & FST, University of Macau 备注:Accepted by ICCV 2021 摘要:互联网视频传输在过去几年经历了巨大的爆炸式增长。然而,视频传输系统的质量在很大程度上取决于互联网带宽。近年来,人们利用深度神经网络(DNN)来提高视频传输质量。这些方法将视频划分为块,并将LR视频块和相应的内容感知模型流到客户端。客户机运行模型推理来超级解析LR块。因此,为了传输视频,大量模型被流式传输。在本文中,我们首先仔细研究了不同块的模型之间的关系,然后巧妙地设计了一个联合训练框架以及内容感知特征调制(CaFM)层来压缩这些模型用于神经视频传输{bf使用我们的方法,每个视频块只需要不到$1%$的原始参数就可以流式传输,从而实现更好的SR性能。}我们在各种SR主干、视频时间长度和缩放因子上进行了大量实验,以证明我们方法的优势。此外,我们的方法也可以看作是一种新的视频编码方法。我们的初步实验在相同的存储成本下实现了比商用H.264和H.265标准更好的视频质量,显示了该方法的巨大潜力。代码位于:url{https://github.com/Neural-video-delivery/CaFM-Pytorch-ICCV2021} 摘要:Internet video delivery has undergone a tremendous explosion of growth over the past few years. However, the quality of video delivery system greatly depends on the Internet bandwidth. Deep Neural Networks (DNNs) are utilized to improve the quality of video delivery recently. These methods divide a video into chunks, and stream LR video chunks and corresponding content-aware models to the client. The client runs the inference of models to super-resolve the LR chunks. Consequently, a large number of models are streamed in order to deliver a video. In this paper, we first carefully study the relation between models of different chunks, then we tactfully design a joint training framework along with the Content-aware Feature Modulation (CaFM) layer to compress these models for neural video delivery. {bf With our method, each video chunk only requires less than $1% $ of original parameters to be streamed, achieving even better SR performance.} We conduct extensive experiments across various SR backbones, video time length, and scaling factors to demonstrate the advantages of our method. Besides, our method can be also viewed as a new approach of video coding. Our primary experiments achieve better video quality compared with the commercial H.264 and H.265 standard under the same storage cost, showing the great potential of the proposed method. Code is available at:url{https://github.com/Neural-video-delivery/CaFM-Pytorch-ICCV2021}

GAN|对抗|攻击|生成相关(1篇)

【1】 Deep Hybrid Self-Prior for Full 3D Mesh Generation 标题:用于全三维网格生成的深度混合自先验算法 链接:https://arxiv.org/abs/2108.08017

作者:Xingkui Wei,Zhengqing Chen,Yanwei Fu,Zhaopeng Cui,Yinda Zhang 机构: Fudan University, Zhejiang University, Google, Ours 备注:Accepted by ICCV2021 摘要:我们提供了一个深度学习管道,该管道在从彩色3D点云恢复由三角形网格和纹理贴图组成的完整3D模型之前利用网络自身。与以往利用2D自先验进行图像编辑或利用3D自先验进行纯表面重建的方法不同,我们建议在深度神经网络中利用一种新的2D-3D混合自先验,以显著提高几何质量并生成高分辨率纹理图,商品级3D扫描仪的输出中通常缺少这一点。特别是,我们首先使用具有三维自先验的三维卷积神经网络生成初始网格,然后在二维UV图谱中编码三维信息和颜色信息,然后通过具有自先验的二维卷积神经网络进一步细化。这样,2D和3D自先验都用于网格和纹理恢复。实验表明,在不需要任何额外的训练数据的情况下,我们的方法从稀疏输入中恢复出高质量的三维纹理网格模型,并且在几何和纹理质量方面都优于现有的方法。 摘要:We present a deep learning pipeline that leverages network self-prior to recover a full 3D model consisting of both a triangular mesh and a texture map from the colored 3D point cloud. Different from previous methods either exploiting 2D self-prior for image editing or 3D self-prior for pure surface reconstruction, we propose to exploit a novel hybrid 2D-3D self-prior in deep neural networks to significantly improve the geometry quality and produce a high-resolution texture map, which is typically missing from the output of commodity-level 3D scanners. In particular, we first generate an initial mesh using a 3D convolutional neural network with 3D self-prior, and then encode both 3D information and color information in the 2D UV atlas, which is further refined by 2D convolutional neural networks with the self-prior. In this way, both 2D and 3D self-priors are utilized for the mesh and texture recovery. Experiments show that, without the need of any additional training data, our method recovers the 3D textured mesh model of high quality from sparse input, and outperforms the state-of-the-art methods in terms of both the geometry and texture quality.

自动驾驶|车辆|车道检测等(1篇)

【1】 End-to-End Urban Driving by Imitating a Reinforcement Learning Coach 标题:模仿强化学习教练的城市端到端驾驶 链接:https://arxiv.org/abs/2108.08265

作者:Zhejun Zhang,Alexander Liniger,Dengxin Dai,Fisher Yu,Luc Van Gool 机构:Computer Vision Lab, ETH Z¨urich,MPI for Informatics,PSI, KU Leuven 备注:ICCV 2021 摘要:自动驾驶的端到端方法通常依靠专家演示。尽管人类是很好的驱动者,但他们不是端到端算法的好教练,而端到端算法需要密集的政策监督。相反,利用特权信息的自动化专家可以高效地生成大规模的策略内和策略外演示。然而,现有的城市驾驶自动化专家大量使用手工制定的规则,即使在驾驶模拟器上也无法达到最佳效果,因为在驾驶模拟器上可以获得地面真实信息。为了解决这些问题,我们训练了一名强化学习专家,将鸟瞰图像映射到连续的低级动作。在为卡拉设定新的绩效上限的同时,我们的专家也是一名更好的教练,为模仿学习代理提供信息丰富的监督信号,供其学习。在强化学习教练的监督下,具有单目摄像机输入的基线端到端代理实现了专家级性能。我们的端到端代理实现了78%的成功率,同时在NoCrash Density benchmark和更具挑战性的CARLA排行榜上推广到新市镇和新天气。 摘要:End-to-end approaches to autonomous driving commonly rely on expert demonstrations. Although humans are good drivers, they are not good coaches for end-to-end algorithms that demand dense on-policy supervision. On the contrary, automated experts that leverage privileged information can efficiently generate large scale on-policy and off-policy demonstrations. However, existing automated experts for urban driving make heavy use of hand-crafted rules and perform suboptimally even on driving simulators, where ground-truth information is available. To address these issues, we train a reinforcement learning expert that maps bird's-eye view images to continuous low-level actions. While setting a new performance upper-bound on CARLA, our expert is also a better coach that provides informative supervision signals for imitation learning agents to learn from. Supervised by our reinforcement learning coach, a baseline end-to-end agent with monocular camera-input achieves expert-level performance. Our end-to-end agent achieves a 78% success rate while generalizing to a new town and new weather on the NoCrash-dense benchmark and state-of-the-art performance on the more challenging CARLA LeaderBoard.

NAS模型搜索(3篇)

【1】 FOX-NAS: Fast, On-device and Explainable Neural Architecture Search 标题:FOX-NAS:快速、设备上和可解释的神经结构搜索 链接:https://arxiv.org/abs/2108.08189

作者:Chia-Hsiang Liu,Yu-Shin Han,Yuan-Yao Sung,Yi Lee,Hung-Yueh Chiang,Kai-Chiang Wu 机构:National Yang Ming Chiao Tung University,The University of Texas at Austin 备注:Accepted by ICCV 2021 Low-Power Computer Vision Workshop 摘要:神经结构搜索可以发现性能良好的神经网络,一次搜索法非常流行。一次性方法通常需要具有权重共享和预测体系结构性能的预测器的超级网。然而,以前的方法需要花费大量时间来生成性能预测,因此效率低下。为此,我们提出了FOX-NAS,它由基于模拟退火和多元回归的快速和可解释的预测因子组成。我们的方法是量化友好的,可以有效地部署到边缘。在不同硬件上的实验表明,FOX-NAS模型优于其他一些流行的神经网络结构。例如,FOX-NAS匹配MobileNetV2和EfficientNet-Lite0的准确性,在边缘CPU上的延迟分别减少240%和40%。FOX-NAS是2020年低功耗计算机视觉挑战赛(LPCVC)DSP分类赛道的第三名得主。请参阅上的所有评估结果https://lpcv.ai/competitions/2020. 搜索代码和预先训练的模型发布于https://github.com/great8nctu/FOX-NAS. 摘要:Neural architecture search can discover neural networks with good performance, and One-Shot approaches are prevalent. One-Shot approaches typically require a supernet with weight sharing and predictors that predict the performance of architecture. However, the previous methods take much time to generate performance predictors thus are inefficient. To this end, we propose FOX-NAS that consists of fast and explainable predictors based on simulated annealing and multivariate regression. Our method is quantization-friendly and can be efficiently deployed to the edge. The experiments on different hardware show that FOX-NAS models outperform some other popular neural network architectures. For example, FOX-NAS matches MobileNetV2 and EfficientNet-Lite0 accuracy with 240% and 40% less latency on the edge CPU. FOX-NAS is the 3rd place winner of the 2020 Low-Power Computer Vision Challenge (LPCVC), DSP classification track. See all evaluation results at https://lpcv.ai/competitions/2020. Search code and pre-trained models are released at https://github.com/great8nctu/FOX-NAS.

【2】 Single-DARTS: Towards Stable Architecture Search 标题:单镖:迈向稳定架构搜索 链接:https://arxiv.org/abs/2108.08128

作者:Pengfei Hou,Ying Jin,Yukang Chen 机构:Tsinghua University, The Chinese University of Hong Kong 备注:Accepted by ICCV 2021 NeurArch Workshp 摘要:可微结构搜索(DARTS)是神经结构搜索(NAS)的一个里程碑,具有简单性和较小的搜索成本。但是,DART仍然经常受到性能崩溃的影响,这种情况发生在某些操作(如跳过连接、零和池)主导体系结构时。在本文中,我们首先指出这种现象是由于双层优化造成的。我们提出了单DART,它仅使用单级优化,使用相同的数据批同时更新网络权重和结构参数。即使以前曾尝试过单级优化,也没有任何文献对这一要点提供系统的解释。单DART取代了双层优化,明显缓解了性能崩溃,并增强了架构搜索的稳定性。实验结果表明,单省道在主流搜索空间上达到了最先进的性能。例如,在NAS-Benchmark-201上,搜索的体系结构几乎是最优的。我们还验证了单级优化框架比双级优化框架更稳定。我们希望,这种简单而有效的方法将为差分体系结构搜索提供一些见解。该守则可于https://github.com/PencilAndBike/Single-DARTS.git. 摘要:Differentiable architecture search (DARTS) marks a milestone in Neural Architecture Search (NAS), boasting simplicity and small search costs. However, DARTS still suffers from frequent performance collapse, which happens when some operations, such as skip connections, zeroes and poolings, dominate the architecture. In this paper, we are the first to point out that the phenomenon is attributed to bi-level optimization. We propose Single-DARTS which merely uses single-level optimization, updating network weights and architecture parameters simultaneously with the same data batch. Even single-level optimization has been previously attempted, no literature provides a systematic explanation on this essential point. Replacing the bi-level optimization, Single-DARTS obviously alleviates performance collapse as well as enhances the stability of architecture search. Experiment results show that Single-DARTS achieves state-of-the-art performance on mainstream search spaces. For instance, on NAS-Benchmark-201, the searched architectures are nearly optimal ones. We also validate that the single-level optimization framework is much more stable than the bi-level one. We hope that this simple yet effective method will give some insights on differential architecture search. The code is available at https://github.com/PencilAndBike/Single-DARTS.git.

【3】 RANK-NOSH: Efficient Predictor-Based Architecture Search via Non-Uniform Successive Halving 标题:RANK-NOSH:基于非均匀连续减半的高效预测器体系结构搜索 链接:https://arxiv.org/abs/2108.08019

作者:Ruochen Wang,Xiangning Chen,Minhao Cheng,Xiaocheng Tang,Cho-Jui Hsieh 机构:Department of Computer Science, UCLA, DiDi AI Labs 备注:To Appear in ICCV2021. The code will be released shortly at this https URL 摘要:基于预测器的算法在神经结构搜索(NAS)任务中取得了显著的性能。然而,这些方法的计算成本很高,因为训练性能预测器通常需要从头开始训练和评估数百种体系结构。以前的工作主要集中在减少适应预测器所需的体系结构数量。在这项工作中,我们从另一个角度来应对这一挑战——通过减少体系结构训练的计算预算来提高搜索效率。我们提出了非均匀连续减半(NOSH)算法,这是一种分层调度算法,可以提前终止对性能不佳的体系结构的训练,以避免浪费预算。为了有效地利用NOSH产生的非均匀监督信号,我们将基于预测器的架构搜索描述为通过成对比较学习排序。由此产生的RANK-NOSH方法将搜索预算减少了约5倍,同时在各种空间和数据集上实现了比以前基于最先进的预测器的方法更具竞争力甚至更好的性能。 摘要:Predictor-based algorithms have achieved remarkable performance in the Neural Architecture Search (NAS) tasks. However, these methods suffer from high computation costs, as training the performance predictor usually requires training and evaluating hundreds of architectures from scratch. Previous works along this line mainly focus on reducing the number of architectures required to fit the predictor. In this work, we tackle this challenge from a different perspective - improve search efficiency by cutting down the computation budget of architecture training. We propose NOn-uniform Successive Halving (NOSH), a hierarchical scheduling algorithm that terminates the training of underperforming architectures early to avoid wasting budget. To effectively leverage the non-uniform supervision signals produced by NOSH, we formulate predictor-based architecture search as learning to rank with pairwise comparisons. The resulting method - RANK-NOSH, reduces the search budget by ~5x while achieving competitive or even better performance than previous state-of-the-art predictor-based methods on various spaces and datasets.

Attention注意力(1篇)

【1】 An Attention Module for Convolutional Neural Networks 标题:一种卷积神经网络的注意力模块 链接:https://arxiv.org/abs/2108.08205

作者:Zhu Baozhou,Peter Hofstee,Jinho Lee,Zaid Al-Ars 机构:Delft University of Technology, Delft, Netherlands, IBM Research Austin, TX, USA, Yonsei University, Seoul, Korea 摘要:注意机制被认为是一种捕捉长距离特征交互和提高卷积神经网络表示能力的先进技术。然而,我们发现目前基于注意激活的模型存在两个被忽视的问题:注意图的近似问题和容量不足问题。为了同时解决这两个问题,我们最初提出了一个用于卷积神经网络的注意模块,方法是开发一个AW卷积,其中注意映射的形状与权重匹配,而不是与激活匹配。我们提出的注意模块是对以前基于注意的方案的补充,例如那些应用注意机制来探索通道和空间特征之间关系的方案。在多个用于图像分类和目标检测任务的数据集上的实验表明了我们提出的注意模块的有效性。特别是,我们提出的注意力模块在ResNet101基线上的ImageNet分类上实现了1.00%的Top-1精度改进,在使用ResNet101 FPN主干的更快R-CNN基线上的COCO对象检测上实现了0.63 COCO风格的平均精度改进。当与以前基于注意激活的模型集成时,我们提出的注意模块可以进一步将其在ImageNet分类上的最高准确度提高0.57%,在COCO对象检测上的COCO风格平均准确度提高0.45%。代码和预先训练的模型将公开提供。 摘要:Attention mechanism has been regarded as an advanced technique to capture long-range feature interactions and to boost the representation capability for convolutional neural networks. However, we found two ignored problems in current attentional activations-based models: the approximation problem and the insufficient capacity problem of the attention maps. To solve the two problems together, we initially propose an attention module for convolutional neural networks by developing an AW-convolution, where the shape of attention maps matches that of the weights rather than the activations. Our proposed attention module is a complementary method to previous attention-based schemes, such as those that apply the attention mechanism to explore the relationship between channel-wise and spatial features. Experiments on several datasets for image classification and object detection tasks show the effectiveness of our proposed attention module. In particular, our proposed attention module achieves 1.00% Top-1 accuracy improvement on ImageNet classification over a ResNet101 baseline and 0.63 COCO-style Average Precision improvement on the COCO object detection on top of a Faster R-CNN baseline with the backbone of ResNet101-FPN. When integrating with the previous attentional activations-based models, our proposed attention module can further increase their Top-1 accuracy on ImageNet classification by up to 0.57% and COCO-style Average Precision on the COCO object detection by up to 0.45. Code and pre-trained models will be publicly available.

人脸|人群计数(3篇)

【1】 DeepFake MNIST : A DeepFake Facial Animation Dataset 标题:DeepFake MNIST :一个DeepFake人脸动画数据集 链接:https://arxiv.org/abs/2108.07949

作者:Jiajun Huang,Xueyu Wang,Bo Du,Pei Du,Chang Xu 机构:The University of Sydney, Wuhan University, AntGroup 备注:14 pages 摘要:深度假货是一种面部操纵技术,是数字社会的新威胁。人们提出了各种各样的假检测方法和数据集来检测这些数据,特别是人脸交换。然而,最近的研究较少考虑面部动画,这也是非常重要的,在深假面攻击方。它试图通过驾驶视频提供的动作来制作人脸图像动画,这也导致了对最近支付系统安全性的担忧,这些系统通过识别一系列用户面部动作来回复活跃度检测以验证真实用户。然而,我们的实验表明,现有的数据集不足以开发可靠的检测方法。而当前的活跃度检测器无法防御攻击等视频。作为回应,我们提出了一个新的人脸动画数据集,称为DeepFake MNIST ,由SOTA图像动画生成器生成。它包括10个不同动作中的10000个面部动画视频,可以欺骗最近的活跃度检测器。本文还介绍了一种基线检测方法,并对该方法进行了综合分析。此外,我们还分析了该数据集的性质,揭示了在不同类型的运动和压缩质量下检测动画数据集的困难和重要性。 摘要:The DeepFakes, which are the facial manipulation techniques, is the emerging threat to digital society. Various DeepFake detection methods and datasets are proposed for detecting such data, especially for face-swapping. However, recent researches less consider facial animation, which is also important in the DeepFake attack side. It tries to animate a face image with actions provided by a driving video, which also leads to a concern about the security of recent payment systems that reply on liveness detection to authenticate real users via recognising a sequence of user facial actions. However, our experiments show that the existed datasets are not sufficient to develop reliable detection methods. While the current liveness detector cannot defend such videos as the attack. As a response, we propose a new human face animation dataset, called DeepFake MNIST , generated by a SOTA image animation generator. It includes 10,000 facial animation videos in ten different actions, which can spoof the recent liveness detectors. A baseline detection method and a comprehensive analysis of the method is also included in this paper. In addition, we analyze the proposed dataset's properties and reveal the difficulty and importance of detecting animation datasets under different types of motion and compression quality.

【2】 FACIAL: Synthesizing Dynamic Talking Face with Implicit Attribute Learning 标题:人脸:基于隐含属性学习的动态说话人脸合成 链接:https://arxiv.org/abs/2108.07938

作者:Chenxu Zhang,Yifan Zhao,Yifei Huang,Ming Zeng,Saifeng Ni,Madhukar Budagavi,Xiaohu Guo 机构:The University of Texas at Dallas, Beihang University, East China Normal University, Xiamen University, Samsung Research America 备注:10 pages, 9 figures. Accepted by ICCV 2021 摘要:在本文中,我们提出了一种谈话人脸生成方法,该方法以音频信号为输入,以短目标视频剪辑为参考,合成目标人脸的照片逼真视频,具有与输入音频信号同步的自然嘴唇运动、头部姿势和眨眼。我们注意到,合成人脸属性不仅包括与语音高度相关的嘴唇运动等显式属性,还包括与输入音频相关性较弱的头部姿势和眨眼等隐式属性。为了用输入音频模拟不同人脸属性之间的复杂关系,我们提出了一种人脸隐式属性学习生成对抗网络(face-GAN),该网络集成了语音感知、上下文感知和身份感知信息,以合成具有嘴唇、头部姿势和,还有眨眼。然后,我们的渲染到视频网络将渲染的人脸图像和眨眼的注意图作为输入,生成具有照片真实感的输出视频帧。实验结果和用户研究表明,我们的方法不仅可以生成具有同步嘴唇运动、自然头部运动和眨眼的真实对话人脸视频,其质量优于最先进的方法。 摘要:In this paper, we propose a talking face generation method that takes an audio signal as input and a short target video clip as reference, and synthesizes a photo-realistic video of the target face with natural lip motions, head poses, and eye blinks that are in-sync with the input audio signal. We note that the synthetic face attributes include not only explicit ones such as lip motions that have high correlations with speech, but also implicit ones such as head poses and eye blinks that have only weak correlation with the input audio. To model such complicated relationships among different face attributes with input audio, we propose a FACe Implicit Attribute Learning Generative Adversarial Network (FACIAL-GAN), which integrates the phonetics-aware, context-aware, and identity-aware information to synthesize the 3D face animation with realistic motions of lips, head poses, and eye blinks. Then, our Rendering-to-Video network takes the rendered face images and the attention map of eye blinks as input to generate the photo-realistic output video frames. Experimental results and user studies show our method can generate realistic talking face videos with not only synchronized lip motions, but also natural head movements and eye blinks, with better qualities than the results of state-of-the-art methods.

【3】 ARCH : Animation-Ready Clothed Human Reconstruction Revisited 标题:ARCH :动画准备就绪的衣着人体重建重温 链接:https://arxiv.org/abs/2108.07845

作者:Tong He,Yuanlu Xu,Shunsuke Saito,Stefano Soatto,Tony Tung 机构:Facebook Reality Labs Research, USA, University of California, Los Angeles, USA 备注:Accepted by ICCV 2021 摘要:我们提出了一种基于图像的方法ARCH ,用于重建具有任意服装风格的3D化身。我们重建的化身在输入视图的可见区域和不可见区域都是动画就绪且高度逼真的。虽然先前的工作显示了重建具有各种拓扑结构的可动画穿着人类的巨大前景,但我们观察到存在导致次优重建质量的基本限制。在本文中,我们回顾了基于图像的化身重建的主要步骤,并用ARCH 解决了其局限性。首先,我们介绍了一种基于端到端点的几何编码器,以更好地描述底层3D人体的语义,取代以前手工制作的功能。其次,为了解决典型姿势下穿着衣服的人的拓扑变化引起的占有率模糊问题,我们提出了一个具有跨空间一致性的共同监督框架来联合估计姿势空间和典型空间中的占有率。最后,我们使用图像到图像的转换网络进一步细化重构曲面上的细节几何和纹理,从而提高任意视点的保真度和一致性。在实验中,我们展示了在重建质量和真实性方面,在公共基准和用户研究方面的最新进展。 摘要:We present ARCH , an image-based method to reconstruct 3D avatars with arbitrary clothing styles. Our reconstructed avatars are animation-ready and highly realistic, in both the visible regions from input views and the unseen regions. While prior work shows great promise of reconstructing animatable clothed humans with various topologies, we observe that there exist fundamental limitations resulting in sub-optimal reconstruction quality. In this paper, we revisit the major steps of image-based avatar reconstruction and address the limitations with ARCH . First, we introduce an end-to-end point based geometry encoder to better describe the semantics of the underlying 3D human body, in replacement of previous hand-crafted features. Second, in order to address the occupancy ambiguity caused by topological changes of clothed humans in the canonical pose, we propose a co-supervising framework with cross-space consistency to jointly estimate the occupancy in both the posed and canonical spaces. Last, we use image-to-image translation networks to further refine detailed geometry and texture on the reconstructed surface, which improves the fidelity and consistency across arbitrary viewpoints. In the experiments, we demonstrate improvements over the state of the art on both public benchmarks and user studies in reconstruction quality and realism.

跟踪(1篇)

【1】 Rendering and Tracking the Directional TSDF: Modeling Surface Orientation for Coherent Maps 标题:定向TSDF的渲染和跟踪:相干贴图的表面方向建模 链接:https://arxiv.org/abs/2108.08115

作者:Malte Splietker,Sven Behnke 机构:In: Proceedings of the ,th European Conference on Mobile Robots (ECMR) 备注:to be published in 10th European Conference on Mobile Robots (ECMR 2021) 摘要:RGB-D图像的密集实时跟踪和映射是许多机器人应用的重要工具,如导航或抓取。最近提出的方向截断有符号距离函数(DTSDF)是常规TSDF的一个扩展,显示了更相干映射和改进跟踪性能的潜力。在这项工作中,我们提出了从DTSDF渲染深度和颜色贴图的方法,使其成为现有跟踪器中常规TSDF的真正替代品。我们评估并表明,我们的方法提高了映射场景的可重用性。此外,我们添加了颜色集成,显著提高了相邻曲面的颜色正确性。 摘要:Dense real-time tracking and mapping from RGB-D images is an important tool for many robotic applications, such as navigation or grasping. The recently presented Directional Truncated Signed Distance Function (DTSDF) is an augmentation of the regular TSDF and shows potential for more coherent maps and improved tracking performance. In this work, we present methods for rendering depth- and color maps from the DTSDF, making it a true drop-in replacement for the regular TSDF in established trackers. We evaluate and show, that our method increases re-usability of mapped scenes. Furthermore, we add color integration which notably improves color-correctness at adjacent surfaces.

蒸馏|知识提取(1篇)

【1】 Learning Conditional Knowledge Distillation for Degraded-Reference Image Quality Assessment 标题:用于降质参考图像质量评估的学习型条件知识提取 链接:https://arxiv.org/abs/2108.07948

作者:Heliang Zheng,Huan Yang,Jianlong Fu,Zheng-Jun Zha,Jiebo Luo 机构:University of Science and Technology of China, Hefei, China, Microsoft Research, Beijing, China, University of Rochester, Rochester, NY 摘要:图像质量评估(IQA)的一个重要场景是评估图像恢复(IR)算法。最先进的方法采用了一个完整的参考范例,将恢复的图像与其相应的原始质量图像进行比较。然而,在盲图像恢复任务和真实场景中,原始质量的图像通常是不可用的。在本文中,我们提出了一种实用的解决方案,称为退化参考IQA(DR-IQA),它利用红外模型的输入,退化图像作为参考。具体来说,我们通过从原始质量的图像中提取知识,从退化图像中提取参考信息。提取是通过学习参考空间来实现的,在参考空间中,鼓励各种退化图像与原始质量图像共享相同的特征统计信息。并对参考空间进行优化,以获取对质量评估有用的深度图像先验信息。请注意,原始质量的图像仅在训练期间使用。我们的工作为盲IRs,特别是基于GAN的方法提供了一个强大且可微的度量。大量实验表明,我们的结果甚至可以接近全参考设置的性能。 摘要:An important scenario for image quality assessment (IQA) is to evaluate image restoration (IR) algorithms. The state-of-the-art approaches adopt a full-reference paradigm that compares restored images with their corresponding pristine-quality images. However, pristine-quality images are usually unavailable in blind image restoration tasks and real-world scenarios. In this paper, we propose a practical solution named degraded-reference IQA (DR-IQA), which exploits the inputs of IR models, degraded images, as references. Specifically, we extract reference information from degraded images by distilling knowledge from pristine-quality images. The distillation is achieved through learning a reference space, where various degraded images are encouraged to share the same feature statistics with pristine-quality images. And the reference space is optimized to capture deep image priors that are useful for quality assessment. Note that pristine-quality images are only used during training. Our work provides a powerful and differentiable metric for blind IRs, especially for GAN-based methods. Extensive experiments show that our results can even be close to the performance of full-reference settings.

超分辨率|去噪|去模糊|去雾(1篇)

【1】 Deep Reparametrization of Multi-Frame Super-Resolution and Denoising 标题:多帧超分辨率深度再参数化与去噪 链接:https://arxiv.org/abs/2108.08286

作者:Goutam Bhat,Martin Danelljan,Fisher Yu,Luc Van Gool,Radu Timofte 机构:Computer Vision Lab, ETH Zurich, Switzerland, Burst Denoising, RAW Burst SR 备注:ICCV 2021 Oral 摘要:我们对多帧图像恢复任务中常用的最大后验公式提出了一种深度再参数化方法。我们的方法是通过引入学习的误差度量和目标图像的潜在表示,将地图目标转换为深层特征空间。深度再参数化允许我们直接在潜在空间中建模图像形成过程,并将学习到的图像先验知识集成到预测中。因此,我们的方法利用了深度学习的优势,同时也受益于经典MAP公式提供的原则性多帧融合。我们通过对突发去噪和突发超分辨率数据集的综合实验来验证我们的方法。我们的方法为这两项任务设定了一个新的最先进水平,证明了所提议的公式的通用性和有效性。 摘要:We propose a deep reparametrization of the maximum a posteriori formulation commonly employed in multi-frame image restoration tasks. Our approach is derived by introducing a learned error metric and a latent representation of the target image, which transforms the MAP objective to a deep feature space. The deep reparametrization allows us to directly model the image formation process in the latent space, and to integrate learned image priors into the prediction. Our approach thereby leverages the advantages of deep learning, while also benefiting from the principled multi-frame fusion provided by the classical MAP formulation. We validate our approach through comprehensive experiments on burst denoising and burst super-resolution datasets. Our approach sets a new state-of-the-art for both tasks, demonstrating the generality and effectiveness of the proposed formulation.

3D|3D重建等相关(1篇)

【1】 A Simple Framework for 3D Lensless Imaging with Programmable Masks 标题:一种简单的可编程掩模三维无透镜成像框架 链接:https://arxiv.org/abs/2108.07966

作者:Yucheng Zheng,Yi Hua,Aswin C. Sankaranarayanan,M. Salman Asif 机构:University of California Riverside†, Carnegie Mellon University‡ 备注:None 摘要:无透镜相机通过在传感器附近用振幅或相位掩模替换传统相机中的镜头,提供了一个构建薄成像系统的框架。现有的无透镜成像方法可以恢复场景的深度和强度,但需要解决计算量大的反问题。此外,现有方法难以恢复深度变化较大的密集场景。在本文中,我们提出了一种无透镜成像系统,该系统使用可编程掩模上的不同图案捕获少量测量值。在这方面,我们作出三项贡献。首先,我们提出了一种快速恢复算法来恢复场景中固定数量深度平面上的纹理。第二,我们考虑掩模设计问题,可编程无透镜相机,并提供了一个设计模板优化掩模图案,目的是改善深度估计。第三,我们使用细化网络作为后处理步骤来识别和移除重建中的伪影。在无透镜相机原型上的实验结果对这些修改进行了广泛评估,以展示优化的掩模和恢复算法相对于最新技术的性能优势。 摘要:Lensless cameras provide a framework to build thin imaging systems by replacing the lens in a conventional camera with an amplitude or phase mask near the sensor. Existing methods for lensless imaging can recover the depth and intensity of the scene, but they require solving computationally-expensive inverse problems. Furthermore, existing methods struggle to recover dense scenes with large depth variations. In this paper, we propose a lensless imaging system that captures a small number of measurements using different patterns on a programmable mask. In this context, we make three contributions. First, we present a fast recovery algorithm to recover textures on a fixed number of depth planes in the scene. Second, we consider the mask design problem, for programmable lensless cameras, and provide a design template for optimizing the mask patterns with the goal of improving depth estimation. Third, we use a refinement network as a post-processing step to identify and remove artifacts in the reconstruction. These modifications are evaluated extensively with experimental results on a lensless camera prototype to showcase the performance benefits of the optimized masks and recovery algorithms over the state of the art.

其他神经网络|深度学习|模型|建模(7篇)

【1】 Scarce Data Driven Deep Learning of Drones via Generalized Data Distribution Space 标题:基于广义数据分布空间的稀缺数据驱动无人机深度学习 链接:https://arxiv.org/abs/2108.08244

作者:Chen Li,Schyler C. Sun,Zhuangkun Wei,Antonios Tsourdos,Weisi Guo 机构:Cranfield University, Weisi Guo is also with theAlan Turing Institute, We acknowledge fundingfrom Department of Transport under the S-TRIG program 20 20- 2 1 and EP-SRCUKRI Trustworthy Autonomous Systems Node in Security under grantnumber EPV0 2676 3 1 备注:9 pages, 6 figures 摘要:民用和专业环境中无人机扩散的增加为机场和国家基础设施创造了新的威胁载体。据估计,无人机入侵对一个主要机场造成的经济损失为每天数百万美元。由于缺乏多样的无人机训练数据,在数据稀少的情况下准确训练深度学习检测算法是一个公开的挑战。现有的方法主要依赖于收集多样化和全面的实验无人机镜头数据、人工诱导的数据扩充、迁移和元学习以及物理知识学习。然而,这些方法不能保证捕获不同的无人机设计并充分理解无人机的深层特征空间。在此,我们展示了如何通过生成性对抗网络(GAN)了解无人机数据的一般分布,并使用拓扑数据分析(TDA)解释缺失特征,从而获得缺失数据,从而实现快速、更准确的学习。我们在一个无人机图像数据集上演示了我们的结果,该数据集包含真实的无人机图像以及计算机辅助设计的模拟图像。与随机数据收集相比(通常做法-200个时代后的鉴别器准确率为94.67%),我们提出的GAN-TDA知情数据收集方法提供了4%的显著改进(200个时代后的鉴别器准确率为99.42%)。我们相信,这种利用神经网络的一般数据分布知识的方法可以应用于广泛的稀缺数据开放挑战。 摘要:Increased drone proliferation in civilian and professional settings has created new threat vectors for airports and national infrastructures. The economic damage for a single major airport from drone incursions is estimated to be millions per day. Due to the lack of diverse drone training data, accurate training of deep learning detection algorithms under scarce data is an open challenge. Existing methods largely rely on collecting diverse and comprehensive experimental drone footage data, artificially induced data augmentation, transfer and meta-learning, as well as physics-informed learning. However, these methods cannot guarantee capturing diverse drone designs and fully understanding the deep feature space of drones. Here, we show how understanding the general distribution of the drone data via a Generative Adversarial Network (GAN) and explaining the missing features using Topological Data Analysis (TDA) - can allow us to acquire missing data to achieve rapid and more accurate learning. We demonstrate our results on a drone image dataset, which contains both real drone images as well as simulated images from computer-aided design. When compared to random data collection (usual practice - discriminator accuracy of 94.67% after 200 epochs), our proposed GAN-TDA informed data collection method offers a significant 4% improvement (99.42% after 200 epochs). We believe that this approach of exploiting general data distribution knowledge form neural networks can be applied to a wide range of scarce data open challenges.

【2】 MBRS : Enhancing Robustness of DNN-based Watermarking by Mini-Batch of Real and Simulated JPEG Compression 标题:MBRS:通过小批量真实和模拟JPEG压缩增强基于DNN的水印的鲁棒性 链接:https://arxiv.org/abs/2108.08211

作者:Zhaoyang Jia,Han Fang,Weiming Zhang 机构:University of Science and Technology, of China, Hefei, Anhui, China 备注:9 pages, 6 figures, received by ACM MM'21 摘要:基于深度学习结构强大的特征提取能力,近年来,基于深度学习的水印算法得到了广泛的研究。该算法的基本框架是自动编码器式端到端架构,包括编码器、噪声层和解码器。保证鲁棒性的关键是使用差分噪声层进行对抗性训练。然而,我们发现现有的框架都不能很好地保证对JPEG压缩的鲁棒性,JPEG压缩虽然是非差分的,但却是一种重要的图像处理操作。为了解决这些局限性,我们提出了一种新的端到端训练体系结构,它利用小批量的真实和模拟JPEG压缩(MBR)来增强JPEG的鲁棒性。准确地说,对于不同的小批量,我们随机选择真实JPEG、模拟JPEG和无噪声层中的一个作为噪声层。此外,我们还建议在嵌入和提取阶段利用挤压和激发块来学习更好的特征,并提出一种“消息处理器”来扩展消息,使其更易于理解。同时,为了提高对裁剪攻击的鲁棒性,我们在网络中引入了加性扩散块。大量的实验结果表明,与最新的算法相比,该方案具有优越的性能。在质量因子Q=50的JPEG压缩下,我们的模型对提取的消息实现了小于0.01%的误码率,编码图像的PSNR大于36,这表明对JPEG攻击的鲁棒性得到了很好的增强。此外,在高斯滤波、裁剪、cropout和dropout等多种失真情况下,该框架也具有很强的鲁棒性。PyTorchcite{2011torch7}实现的代码在https://github.com/jzyustc/MBRS. 摘要:Based on the powerful feature extraction ability of deep learning architecture, recently, deep-learning based watermarking algorithms have been widely studied. The basic framework of such algorithm is the auto-encoder like end-to-end architecture with an encoder, a noise layer and a decoder. The key to guarantee robustness is the adversarial training with the differential noise layer. However, we found that none of the existing framework can well ensure the robustness against JPEG compression, which is non-differential but is an essential and important image processing operation. To address such limitations, we proposed a novel end-to-end training architecture, which utilizes Mini-Batch of Real and Simulated JPEG compression (MBRS) to enhance the JPEG robustness. Precisely, for different mini-batches, we randomly choose one of real JPEG, simulated JPEG and noise-free layer as the noise layer. Besides, we suggest to utilize the Squeeze-and-Excitation blocks which can learn better feature in embedding and extracting stage, and propose a "message processor" to expand the message in a more appreciate way. Meanwhile, to improve the robustness against crop attack, we propose an additive diffusion block into the network. The extensive experimental results have demonstrated the superior performance of the proposed scheme compared with the state-of-the-art algorithms. Under the JPEG compression with quality factor Q=50, our models achieve a bit error rate less than 0.01% for extracted messages, with PSNR larger than 36 for the encoded images, which shows the well-enhanced robustness against JPEG attack. Besides, under many other distortions such as Gaussian filter, crop, cropout and dropout, the proposed framework also obtains strong robustness. The code implemented by PyTorch cite{2011torch7} is avaiable in https://github.com/jzyustc/MBRS.

【3】 ALLNet: A Hybrid Convolutional Neural Network to Improve Diagnosis of Acute Lymphocytic Leukemia (ALL) in White Blood Cells 标题:Allnet:一种改进白细胞急性淋巴细胞白血病(ALL)诊断的混合卷积神经网络 链接:https://arxiv.org/abs/2108.08195

作者:Sai Mattapalli,Rishi Athavale 机构:Thomas Jefferson High School, Academy of Engineering, for Science & Technology, and Technology, Denotes equal contribution to writing paper regardless of the order of names 备注:20 pages, 13 figures, 4 tables 摘要:由于微观层面上的形态学相似性,在受急性淋巴细胞白血病(ALL)影响的血细胞与健康的血细胞之间进行准确和时间敏感的区分需要使用机器学习架构。然而,最常见的三种模型,VGG、ResNet和Inception,每种模型都有自己的缺陷,都有改进的余地,这就需要一种更优秀的模型。ALLNet是一种混合卷积神经网络结构,由VGG、ResNet和初始模型组成。ISBI2019的所有挑战数据集(此处提供)包含用于训练和测试模型的10691张白细胞图像。数据集中的7272张图像是含有ALL的细胞,3419张图像是健康细胞。在这些图像中,60%用于训练模型,20%用于交叉验证集,20%用于测试集。ALLNet在交叉验证集中的准确性为92.6567%,敏感性为95.5304%,特异性为85.9155%,AUC得分为0.966347,F1得分为0.94803,全面优于VGG、ResNet和初始模型。在测试集中,ALLNet的准确度为92.0991%,敏感性为96.5446%,特异性为82.8035%,AUC评分为0.959972,F1评分为0.942963。在临床工作区使用ALLNet可以更好地治疗世界各地成千上万的患有各种疾病的人,其中许多是儿童。 摘要:Due to morphological similarity at the microscopic level, making an accurate and time-sensitive distinction between blood cells affected by Acute Lymphocytic Leukemia (ALL) and their healthy counterparts calls for the usage of machine learning architectures. However, three of the most common models, VGG, ResNet, and Inception, each come with their own set of flaws with room for improvement which demands the need for a superior model. ALLNet, the proposed hybrid convolutional neural network architecture, consists of a combination of the VGG, ResNet, and Inception models. The ALL Challenge dataset of ISBI 2019 (available here) contains 10,691 images of white blood cells which were used to train and test the models. 7,272 of the images in the dataset are of cells with ALL and 3,419 of them are of healthy cells. Of the images, 60% were used to train the model, 20% were used for the cross-validation set, and 20% were used for the test set. ALLNet outperformed the VGG, ResNet, and the Inception models across the board, achieving an accuracy of 92.6567%, a sensitivity of 95.5304%, a specificity of 85.9155%, an AUC score of 0.966347, and an F1 score of 0.94803 in the cross-validation set. In the test set, ALLNet achieved an accuracy of 92.0991%, a sensitivity of 96.5446%, a specificity of 82.8035%, an AUC score of 0.959972, and an F1 score of 0.942963. The utilization of ALLNet in the clinical workspace can better treat the thousands of people suffering from ALL across the world, many of whom are children.

【4】 Effect of Parameter Optimization on Classical and Learning-based Image Matching Methods 标题:参数优化对经典和基于学习的图像匹配方法的影响 链接:https://arxiv.org/abs/2108.08179

作者:Ufuk Efe,Kutalmis Gokalp Ince,A. Aydin Alatan 机构:Department of Electrical and Electronics Engineering, Center for Image Analysis (OGAM), Middle East Technical University, Ankara, Turkey 备注:8 pages, 2 figures, 3 tables, ICCV 2021 TradiCV Workshop 摘要:近年来,基于深度学习的图像匹配方法得到了显著的改进。尽管有报道称这些方法的性能优于经典方法,但经典方法的性能并未得到详细检查。在这项研究中,我们比较了经典方法和基于学习的方法,通过使用带比率测试的相互最近邻搜索和优化比率测试阈值,在两种不同的性能指标上获得最佳性能。经过公平比较后,在HPatches数据集上的实验结果表明,经典方法和基于学习的方法之间的性能差距并不显著。在整个实验过程中,我们证明了SuperGlue是解决HPatches数据集上图像匹配问题的最新技术。然而,如果仔细优化单个参数,即比率测试阈值,则众所周知的传统方法SIFT的性能非常接近SuperGlue,甚至在1和2像素阈值下的平均匹配精度(MMA)方面优于SuperGlue。此外,最近的一种方法DFM仅使用预先训练的VGG特征作为描述符和比率测试,其性能优于大多数经过良好训练的基于学习的方法。因此,我们得出结论,在与基于学习的技术进行比较之前,应仔细分析任何经典方法的参数。 摘要:Deep learning-based image matching methods are improved significantly during the recent years. Although these methods are reported to outperform the classical techniques, the performance of the classical methods is not examined in detail. In this study, we compare classical and learning-based methods by employing mutual nearest neighbor search with ratio test and optimizing the ratio test threshold to achieve the best performance on two different performance metrics. After a fair comparison, the experimental results on HPatches dataset reveal that the performance gap between classical and learning-based methods is not that significant. Throughout the experiments, we demonstrated that SuperGlue is the state-of-the-art technique for the image matching problem on HPatches dataset. However, if a single parameter, namely ratio test threshold, is carefully optimized, a well-known traditional method SIFT performs quite close to SuperGlue and even outperforms in terms of mean matching accuracy (MMA) under 1 and 2 pixel thresholds. Moreover, a recent approach, DFM, which only uses pre-trained VGG features as descriptors and ratio test, is shown to outperform most of the well-trained learning-based methods. Therefore, we conclude that the parameters of any classical method should be analyzed carefully before comparing against a learning-based technique.

【5】 Spatially and color consistent environment lighting estimation using deep neural networks for mixed reality 标题:基于混合现实的深度神经网络空间和颜色一致的环境光照估计 链接:https://arxiv.org/abs/2108.07903

作者:Bruno Augusto Dorta Marques,Esteban Walter Gonzalez Clua,Anselmo Antunes Montenegro,Cristina Nader Vasconcelos 机构:Vasconcelosb, CMCC, Universidade Federal do ABC, Avenida dos Estados, Santo Andr´e, SP, Brazil, Instituto de Computa¸c˜ao - Universidade Federal Fluminense, Av. Gal. Milton Tavares de Souza, Niter´oi, RJ, Brazil, A R T I C L E I N F O, Article history: 摘要:一致性混合现实(XR)环境的表示需要实时的足够的真实和虚拟照明合成。估计真实场景的照明仍然是一个挑战。由于该问题的病态性质,经典的逆渲染技术针对简单的照明设置解决该问题。然而,这些假设并不满足计算机图形学和XR应用的当前技术水平。虽然最近的许多工作使用机器学习技术来估计环境光和场景的材质来解决这个问题,但大多数工作仅限于几何或以前的知识。本文提出了一个基于CNN的模型来估计混合现实环境中没有场景信息的复杂照明。我们使用一组球面谐波(SH)环境照明对环境照明进行建模,能够有效地表示区域照明。我们提出了一种新的CNN结构,它输入RGB图像并实时识别环境照明。与以前基于CNN的照明估计方法不同,我们建议使用高度优化的深度神经网络结构,减少参数数量,可以从真实世界的高动态范围(HDR)环境图像中学习高度复杂的照明场景。我们在实验中表明,当比较SH光照系数时,CNN结构可以预测环境光照,平均均方误差(MSE)为{7.85e-04}。我们在各种混合现实场景中验证了我们的模型。此外,我们还对真实场景中的重照明进行了定性比较。 摘要:The representation of consistent mixed reality (XR) environments requires adequate real and virtual illumination composition in real-time. Estimating the lighting of a real scenario is still a challenge. Due to the ill-posed nature of the problem, classical inverse-rendering techniques tackle the problem for simple lighting setups. However, those assumptions do not satisfy the current state-of-art in computer graphics and XR applications. While many recent works solve the problem using machine learning techniques to estimate the environment light and scene's materials, most of them are limited to geometry or previous knowledge. This paper presents a CNN-based model to estimate complex lighting for mixed reality environments with no previous information about the scene. We model the environment illumination using a set of spherical harmonics (SH) environment lighting, capable of efficiently represent area lighting. We propose a new CNN architecture that inputs an RGB image and recognizes, in real-time, the environment lighting. Unlike previous CNN-based lighting estimation methods, we propose using a highly optimized deep neural network architecture, with a reduced number of parameters, that can learn high complex lighting scenarios from real-world high-dynamic-range (HDR) environment images. We show in the experiments that the CNN architecture can predict the environment lighting with an average mean squared error (MSE) of num{7.85e-04} when comparing SH lighting coefficients. We validate our model in a variety of mixed reality scenarios. Furthermore, we present qualitative results comparing relights of real-world scenes.

【6】 Thermal Image Processing via Physics-Inspired Deep Networks 标题:基于物理启发的深度网络热图像处理 链接:https://arxiv.org/abs/2108.07973

作者:Vishwanath Saragadam,Akshat Dave,Ashok Veeraraghavan,Richard Baraniuk 机构:Rice University, Houston TX 备注:Accepted to 2nd ICCV workshop on Learning for Computational Imaging (LCI) 摘要:我们介绍了DeepIR,一种新的热图像处理框架,它将物理精确的传感器建模与基于深度网络的图像表示相结合。我们的关键使能观测是,热传感器捕获的图像可以分解为缓慢变化、场景无关的传感器不均匀性(可以使用物理精确建模)和场景特定的辐射通量(使用基于深度网络的正则化器很好地表示)。DeepIR既不需要训练数据,也不需要定期对已知黑体目标进行地面真实性校准,因此非常适合实际的计算机视觉任务。我们通过开发新的去噪和超分辨率算法,利用相机抖动拍摄的多幅场景图像,展示了深入红外的威力。仿真和实际数据实验表明,DeepIR可以用三幅图像进行高质量的非均匀性校正,与其他方法相比,峰值信噪比(PSNR)提高了10分贝。 摘要:We introduce DeepIR, a new thermal image processing framework that combines physically accurate sensor modeling with deep network-based image representation. Our key enabling observations are that the images captured by thermal sensors can be factored into slowly changing, scene-independent sensor non-uniformities (that can be accurately modeled using physics) and a scene-specific radiance flux (that is well-represented using a deep network-based regularizer). DeepIR requires neither training data nor periodic ground-truth calibration with a known black body target--making it well suited for practical computer vision tasks. We demonstrate the power of going DeepIR by developing new denoising and super-resolution algorithms that exploit multiple images of the scene captured with camera jitter. Simulated and real data experiments demonstrate that DeepIR can perform high-quality non-uniformity correction with as few as three images, achieving a 10dB PSNR improvement over competing approaches.

【7】 OncoPetNet: A Deep Learning based AI system for mitotic figure counting on H&E stained whole slide digital images in a large veterinary diagnostic lab setting 标题:OncoPetNet:一种基于深度学习的大型兽医诊断实验室H&E染色全玻片数字图像有丝分裂图像人工智能系统 链接:https://arxiv.org/abs/2108.07856

作者:Michael Fitzke,Derick Whitley,Wilson Yau,Fernando Rodrigues Jr,Vladimir Fadeev,Cindy Bacmeister,Chris Carter,Jeffrey Edwards,Matthew P. Lungren,Mark Parkinson 机构:Mars Digital Technologies, Antech Diagnostics, Stanford University 摘要:背景:组织病理学是现代医疗中许多疾病诊断和治疗的重要手段,在癌症治疗中起着至关重要的作用。病理学样本可能很大,需要多点取样,导致单个肿瘤的载玻片多达20张,而人类专家的位置选择和有丝分裂图的定量评估任务既耗时又主观。在数字病理学服务环境中实现这些任务的自动化为提高工作流效率和增强实践中的人类专家提供了重要机会。方法:在OncoPetNet的开发过程中,使用了多种最先进的组织病理学图像分类和有丝分裂图检测深度学习技术。此外,采用了无模型方法来提高速度和精度。健壮且可扩展的推理引擎利用Pytork的性能优化以及专门开发的推理加速技术。结果:与人类专家基线相比,拟议的系统在14种癌症类型的41例癌症患者中显示出显著改善的有丝分裂计数性能。与人类专家评估相比,21.9%的病例使用Oncopenet导致肿瘤分级发生变化。在部署过程中,在2个中心的高通量兽医病理诊断服务中实现了有效的0.27分钟/张幻灯片推断,每天处理3323张数字整张幻灯片图像。结论:这项工作代表了首次成功地自动化部署深度学习系统,以便在大规模临床实践中,在重要的组织病理学任务中实现实时专家级性能。由此产生的影响概述了模型开发、部署、临床决策的重要考虑因素,并为数字组织病理学实践中实施深度学习系统提供了最佳实践。 摘要:Background: Histopathology is an important modality for the diagnosis and management of many diseases in modern healthcare, and plays a critical role in cancer care. Pathology samples can be large and require multi-site sampling, leading to upwards of 20 slides for a single tumor, and the human-expert tasks of site selection and and quantitative assessment of mitotic figures are time consuming and subjective. Automating these tasks in the setting of a digital pathology service presents significant opportunities to improve workflow efficiency and augment human experts in practice. Approach: Multiple state-of-the-art deep learning techniques for histopathology image classification and mitotic figure detection were used in the development of OncoPetNet. Additionally, model-free approaches were used to increase speed and accuracy. The robust and scalable inference engine leverages Pytorch's performance optimizations as well as specifically developed speed up techniques in inference. Results: The proposed system, demonstrated significantly improved mitotic counting performance for 41 cancer cases across 14 cancer types compared to human expert baselines. In 21.9% of cases use of OncoPetNet led to change in tumor grading compared to human expert evaluation. In deployment, an effective 0.27 min/slide inference was achieved in a high throughput veterinary diagnostic pathology service across 2 centers processing 3,323 digital whole slide images daily. Conclusion: This work represents the first successful automated deployment of deep learning systems for real-time expert-level performance on important histopathology tasks at scale in a high volume clinical practice. The resulting impact outlines important considerations for model development, deployment, clinical decision making, and informs best practices for implementation of deep learning systems in digital histopathology practices.

其他(14篇)

【1】 Pixel-Perfect Structure-from-Motion with Featuremetric Refinement 标题:采用特征度量细化的像素完美运动结构 链接:https://arxiv.org/abs/2108.08291

作者:Philipp Lindenberger,Paul-Edouard Sarlin,Viktor Larsson,Marc Pollefeys 机构:Departments of ,Mathematics and ,Computer Science, ETH Zurich, Microsoft 备注:Accepted to ICCV 2021 for oral presentation 摘要:寻找可在多个视图中重复的局部特征是稀疏三维重建的基础。经典的图像匹配范式一次性地检测每个图像的关键点,这可能会产生局部性差的特征,并将较大的误差传播到最终的几何体。在本文中,我们通过直接对齐来自多个视图的低级图像信息,从运动中细化结构的两个关键步骤:首先在任何几何估计之前调整初始关键点位置,然后作为后处理细化点和相机姿势。这种改进对大的检测噪声和外观变化具有鲁棒性,因为它优化了基于神经网络预测的密集特征的特征度量误差。这显著提高了各种关键点探测器、具有挑战性的观察条件和现成深度特征的摄影机姿势和场景几何体的准确性。我们的系统可以轻松扩展到大型图像集,实现大规模像素完美的众包定位。我们的代码在https://github.com/cvg/pixel-perfect-sfm 作为流行的SfM软件COLMAP的附加组件。 摘要:Finding local features that are repeatable across multiple views is a cornerstone of sparse 3D reconstruction. The classical image matching paradigm detects keypoints per-image once and for all, which can yield poorly-localized features and propagate large errors to the final geometry. In this paper, we refine two key steps of structure-from-motion by a direct alignment of low-level image information from multiple views: we first adjust the initial keypoint locations prior to any geometric estimation, and subsequently refine points and camera poses as a post-processing. This refinement is robust to large detection noise and appearance changes, as it optimizes a featuremetric error based on dense features predicted by a neural network. This significantly improves the accuracy of camera poses and scene geometry for a wide range of keypoint detectors, challenging viewing conditions, and off-the-shelf deep features. Our system easily scales to large image collections, enabling pixel-perfect crowd-sourced localization at scale. Our code is publicly available at https://github.com/cvg/pixel-perfect-sfm as an add-on to the popular SfM software COLMAP.

【2】 Stochastic Scene-Aware Motion Prediction 标题:随机场景感知运动预测 链接:https://arxiv.org/abs/2108.08284

作者:Mohamed Hassan,Duygu Ceylan,Ruben Villegas,Jun Saito,Jimei Yang,Yi Zhou,Michael Black 机构:Max Planck Institute for Intelligent Systems, T¨ubingen, Germany, Adobe Research 备注:ICCV2021 摘要:计算机视觉的一个长期目标是捕捉、建模和真实地综合人类行为。具体来说,通过从数据中学习,我们的目标是使虚拟人能够在杂乱的室内场景中导航,并自然地与对象交互。这种具体行为在虚拟现实、计算机游戏和机器人技术中有应用,而综合行为可以用作训练数据的来源。这是一个挑战,因为真实的人体运动是多样的,并且适应场景。例如,一个人可以坐在或躺在沙发上,在许多地方,风格各异。在合成真实执行人机交互的虚拟人时,有必要对这种多样性进行建模。我们提出了一种新的数据驱动的随机运动合成方法,该方法对目标对象执行给定动作的不同方式进行建模。我们称之为SAMP的场景感知运动预测方法可推广到各种几何体的目标对象,同时使角色能够在杂乱的场景中导航。为了训练我们的方法,我们收集了包括各种坐姿、躺姿、行走和跑步方式的MoCap数据。我们在复杂的室内场景中演示了我们的方法,并与现有解决方案相比实现了优异的性能。我们的代码和数据可在https://samp.is.tue.mpg.de. 摘要:A long-standing goal in computer vision is to capture, model, and realistically synthesize human behavior. Specifically, by learning from data, our goal is to enable virtual humans to navigate within cluttered indoor scenes and naturally interact with objects. Such embodied behavior has applications in virtual reality, computer games, and robotics, while synthesized behavior can be used as a source of training data. This is challenging because real human motion is diverse and adapts to the scene. For example, a person can sit or lie on a sofa in many places and with varying styles. It is necessary to model this diversity when synthesizing virtual humans that realistically perform human-scene interactions. We present a novel data-driven, stochastic motion synthesis method that models different styles of performing a given action with a target object. Our method, called SAMP, for Scene-Aware Motion Prediction, generalizes to target objects of various geometries while enabling the character to navigate in cluttered scenes. To train our method, we collected MoCap data covering various sitting, lying down, walking, and running styles. We demonstrate our method on complex indoor scenes and achieve superior performance compared to existing solutions. Our code and data are available for research at https://samp.is.tue.mpg.de.

【3】 LOKI: Long Term and Key Intentions for Trajectory Prediction 标题:LOKI:轨迹预测的长期和关键意图 链接:https://arxiv.org/abs/2108.08236

作者:Harshayu Girase,Haiming Gang,Srikanth Malla,Jiachen Li,Akira Kanehara,Karttikeya Mangalam,Chiho Choi 机构:Honda Research Institute USA, University of California, Berkeley, Honda R&D Co., Ltd. 备注:ICCV 2021 (The dataset is available at this https URL) 摘要:轨迹预测方面的最新进展表明,关于主体意图的明确推理对于准确预测其运动非常重要。然而,目前的研究活动并不直接适用于智能和安全关键系统。这主要是因为很少有公共数据集是可用的,他们只考虑行人特定意图短暂的时间跨度从限制自我中心的观点。为此,我们提出了LOKI(长期和关键意图),这是一种新型的大规模数据集,旨在解决自主驾驶环境中异构交通代理(行人和车辆)的联合轨迹和意图预测问题。创建LOKI数据集是为了发现可能影响意图的几个因素,包括i)代理人自身意愿,ii)社会互动,iii)环境约束,以及iv)上下文信息。我们还提出了一个联合执行轨迹和意图预测的模型,表明关于意图的循环推理可以辅助轨迹预测。我们展示了我们的方法比最先进的轨迹预测方法高出27%$,并且还为基于帧的意图估计提供了基线。 摘要:Recent advances in trajectory prediction have shown that explicit reasoning about agents' intent is important to accurately forecast their motion. However, the current research activities are not directly applicable to intelligent and safety critical systems. This is mainly because very few public datasets are available, and they only consider pedestrian-specific intents for a short temporal horizon from a restricted egocentric view. To this end, we propose LOKI (LOng term and Key Intentions), a novel large-scale dataset that is designed to tackle joint trajectory and intention prediction for heterogeneous traffic agents (pedestrians and vehicles) in an autonomous driving setting. The LOKI dataset is created to discover several factors that may affect intention, including i) agent's own will, ii) social interactions, iii) environmental constraints, and iv) contextual information. We also propose a model that jointly performs trajectory and intention prediction, showing that recurrently reasoning about intention can assist with trajectory prediction. We show our method outperforms state-of-the-art trajectory prediction methods by upto $27%$ and also provide a baseline for frame-wise intention estimation.

【4】 Research on Gender-related Fingerprint Features 标题:与性别相关的指纹特征研究 链接:https://arxiv.org/abs/2108.08233

作者:Yong Qi,Yanping Li,Huawei Lin,Jiashu Chen,Huaiguang Lei 机构: Shaanxi Joint Laboratory of Artificial Intelligence (Shaanxi University of Science & Technology) , School of Computer Science, Northwestern Polytechnical University 备注:11 pages, 6 figures, 4 tables 摘要:指纹是人体重要的生物学特征,包含着丰富的性别信息。目前,学术界对指纹性别特征的研究一般处于认识层面,而对指纹性别特征的规范化研究则相当有限。在这项工作中,我们提出了一种更稳健的方法,密集扩张卷积ResNet(DDC-ResNet)从指纹中提取有效的性别信息。通过在主干中用萎缩卷积代替正常卷积运算,提供先验知识以保持边缘细节,并且可以扩展全局接收场。我们从三个方面探讨了结果:1)DDC ResNet的效率。在我们的数据集中评估了6种典型的自动特征提取方法与9种主流分类器的耦合,并给出了合理的实现细节。实验结果表明,我们的方法组合在平均准确率和分离性别准确率方面优于其他组合。平均准确率为96.5%,性别差异准确率为0.9752(男性)/0.9548(女性)。2) 手指的作用。研究发现,右手无名指的性别分类效果最好。3) 特定功能的效果。根据我们的方法观察到的指纹集中情况,可以推断出环和螺纹(1级)、分叉(2级)以及线形(3级)与性别有关。最后,我们将开放包含6000个指纹图像的数据集 摘要:Fingerprint is an important biological feature of human body, which contains abundant gender information. At present, the academic research of fingerprint gender characteristics is generally at the level of understanding, while the standardization research is quite limited. In this work, we propose a more robust method, Dense Dilated Convolution ResNet (DDC-ResNet) to extract valid gender information from fingerprints. By replacing the normal convolution operations with the atrous convolution in the backbone, prior knowledge is provided to keep the edge details and the global reception field can be extended. We explored the results in 3 ways: 1) The efficiency of the DDC-ResNet. 6 typical methods of automatic feature extraction coupling with 9 mainstream classifiers are evaluated in our dataset with fair implementation details. Experimental results demonstrate that the combination of our approach outperforms other combinations in terms of average accuracy and separate-gender accuracy. It reaches 96.5% for average and 0.9752 (males)/0.9548 (females) for separate-gender accuracy. 2) The effect of fingers. It is found that the best performance of classifying gender with separate fingers is achieved by the right ring finger. 3) The effect of specific features. Based on the observations of the concentrations of fingerprints visualized by our approach, it can be inferred that loops and whorls (level 1), bifurcations (level 2), as well as line shapes (level 3) are connected with gender. Finally, we will open source the dataset that contains 6000 fingerprint images

【5】 X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics 标题:X-modaler:一个用于跨模态分析的通用高性能代码库 链接:https://arxiv.org/abs/2108.08217

作者:Yehao Li,Yingwei Pan,Jingwen Chen,Ting Yao,Tao Mei 机构:JD AI Research, Beijing, China 备注:Accepted by 2021 ACMMM Open Source Software Competition. Source code: this https URL 摘要:随着深度学习在过去十年中的兴起和发展,出现了稳定的创新和突破势头,令人信服地推动了多媒体领域视觉和语言之间的跨模态分析的最新发展。然而,还没有一个开源代码库来支持以统一和模块化的方式训练和部署用于跨模态分析的众多神经网络模型。在这项工作中,我们提出了X-modaler——一种通用的高性能代码库,它将最先进的跨模态分析封装到几个通用阶段(例如预处理、编码器、跨模态交互、解码器和解码策略)。每个阶段都具有功能,涵盖了一系列在最新技术中广泛采用的模块,并允许在这些模块之间无缝切换。这种方式自然能够灵活地实现图像字幕、视频字幕和视觉语言预训练的最新算法,以促进研究社区的快速发展。同时,由于几个阶段的有效模块化设计(例如,跨模态交互)在不同的视觉语言任务中共享,因此X-modaler可以简单地扩展为跨模态分析中其他任务的启动原型,包括视觉问答、视觉常识推理和跨模态检索。X-modaler是Apache许可的代码库,其源代码、示例项目和预先训练的模型可在线获取:https://github.com/YehLi/xmodaler. 摘要:With the rise and development of deep learning over the past decade, there has been a steady momentum of innovation and breakthroughs that convincingly push the state-of-the-art of cross-modal analytics between vision and language in multimedia field. Nevertheless, there has not been an open-source codebase in support of training and deploying numerous neural network models for cross-modal analytics in a unified and modular fashion. In this work, we propose X-modaler -- a versatile and high-performance codebase that encapsulates the state-of-the-art cross-modal analytics into several general-purpose stages (e.g., pre-processing, encoder, cross-modal interaction, decoder, and decode strategy). Each stage is empowered with the functionality that covers a series of modules widely adopted in state-of-the-arts and allows seamless switching in between. This way naturally enables a flexible implementation of state-of-the-art algorithms for image captioning, video captioning, and vision-language pre-training, aiming to facilitate the rapid development of research community. Meanwhile, since the effective modular designs in several stages (e.g., cross-modal interaction) are shared across different vision-language tasks, X-modaler can be simply extended to power startup prototypes for other tasks in cross-modal analytics, including visual question answering, visual commonsense reasoning, and cross-modal retrieval. X-modaler is an Apache-licensed codebase, and its source codes, sample projects and pre-trained models are available on-line: https://github.com/YehLi/xmodaler.

【6】 ME-PCN: Point Completion Conditioned on Mask Emptiness 标题:ME-PCN:以掩码为空为条件的点补全 链接:https://arxiv.org/abs/2108.08187

作者:Bingchen Gong,Yinyu Nie,Yiqun Lin,Xiaoguang Han,Yizhou Yu 机构:The University of Hong Kong,Technical University of Munich,SSE, CUHK(SZ) 备注:to appear in ICCV 2021 摘要:点完成是指从不完整的观察中完成对象缺少的几何图形。主流方法通过解码从输入点云学习到的全局特征来预测丢失的形状,这通常导致在保持拓扑一致性和曲面细节方面的不足。在这项工作中,我们介绍了ME-PCN,这是一个利用3D形状空间中的“空”的点完成网络。给定一次深度扫描,以前的方法通常对占用的部分形状进行编码,而忽略深度贴图中的空白区域(例如孔)。相比之下,我们认为这些“空”线索表示形状边界,可用于改进曲面的拓扑表示和细节粒度。具体而言,我们的ME-PCN对占用的点云和相邻的“空点”进行编码。它在第一阶段估计粗粒度但完整且合理的曲面点,然后在细化阶段生成细粒度曲面细节。综合实验证明,我们的ME-PCN在定性和定量方面均优于最新技术。此外,我们进一步证明了我们的“空性”设计是轻量级的,并且易于嵌入到现有的方法中,这在提高CD和EMD分数方面显示了一致的有效性。 摘要:Point completion refers to completing the missing geometries of an object from incomplete observations. Main-stream methods predict the missing shapes by decoding a global feature learned from the input point cloud, which often leads to deficient results in preserving topology consistency and surface details. In this work, we present ME-PCN, a point completion network that leverages `emptiness' in 3D shape space. Given a single depth scan, previous methods often encode the occupied partial shapes while ignoring the empty regions (e.g. holes) in depth maps. In contrast, we argue that these `emptiness' clues indicate shape boundaries that can be used to improve topology representation and detail granularity on surfaces. Specifically, our ME-PCN encodes both the occupied point cloud and the neighboring `empty points'. It estimates coarse-grained but complete and reasonable surface points in the first stage, followed by a refinement stage to produce fine-grained surface details. Comprehensive experiments verify that our ME-PCN presents better qualitative and quantitative performance against the state-of-the-art. Besides, we further prove that our `emptiness' design is lightweight and easy to embed in existing methods, which shows consistent effectiveness in improving the CD and EMD scores.

【7】 Active Observer Visual Problem-Solving Methods are Dynamically Hypothesized, Deployed and Tested 标题:动态假设、部署和测试主动观察者可视化问题解决方法 链接:https://arxiv.org/abs/2108.08145

作者:Markus D. Solbach,John K. Tsotsos 机构:Electrical Engineering and Computer Science, York University, Toronto, Canada 备注:15 pages, 6 figures 摘要:STAR体系结构旨在测试复杂现实视觉空间任务和行为的视觉注意力全选择性调节模型的价值。然而,关于人类如何以主动观察者的身份在3D中解决这些任务的知识是贫乏的。因此,我们设计了一种新的实验装置并研究了这种行为。我们发现,人类表现出各种各样的问题解决策略,其广度和复杂性令人惊讶,并且不容易被当前的方法处理。很明显,解决方法是由假设的动作序列动态组合而成的,对它们进行测试,如果失败,则尝试不同的动作序列。积极观察的重要性是惊人的,因为缺乏任何学习效果。这些结果告诉我们,STAR的认知程序表示扩展了其与现实任务的相关性。 摘要:The STAR architecture was designed to test the value of the full Selective Tuning model of visual attention for complex real-world visuospatial tasks and behaviors. However, knowledge of how humans solve such tasks in 3D as active observers is lean. We thus devised a novel experimental setup and examined such behavior. We discovered that humans exhibit a variety of problem-solving strategies whose breadth and complexity are surprising and not easily handled by current methodologies. It is apparent that solution methods are dynamically composed by hypothesizing sequences of actions, testing them, and if they fail, trying different ones. The importance of active observation is striking as is the lack of any learning effect. These results inform our Cognitive Program representation of STAR extending its relevance to real-world tasks.

【8】 Image Collation: Matching illustrations in manuscripts 标题:图像校对:手稿中的配对插图 链接:https://arxiv.org/abs/2108.08109

作者:Ryad Kaoua,Xi Shen,Alexandra Durr,Stavros Lazaris,David Picard,Mathieu Aubry 机构: LIGM, Ecole des Ponts, Univ. Gustave Eiffel, CNRS, Marne-la-Vall´ee, France, Universit´e de Versailles-Saint-Quentin-en-Yvelines, France, CNRS (UMR ,), France 备注:accepted to ICDAR 2021 摘要:插图是重要的传动工具。对于历史学家来说,在类似手稿的语料库中研究其演变的第一步是确定哪些手稿彼此对应。这一图像整理任务对于被许多丢失的副本分开的手稿来说是一项艰巨的任务,这些手稿传播了几个世纪,可能已经被完全重新组织并进行了重大修改,以适应新的知识或信仰,包括数百幅插图。我们在本文中的贡献有三个方面。首先,我们介绍了插图整理的任务和一个用于评估解决方案的大型注释公共数据集,包括2个不同文本的6份手稿,其中包含2000多个插图和1200个注释对应。其次,我们分析了这项任务的最先进的相似性度量方法,并表明它们在简单的案例中取得了成功,但在插图发生了非常重大的变化且仅通过细微的细节进行区分时,它们却难以获得大型手稿。最后,我们提供了明确的证据,表明通过利用周期一致性对应关系,可以预期显著的性能提升。我们的代码和数据可在http://imagine.enpc.fr/~shenx/ImageCollation。 摘要:Illustrations are an essential transmission instrument. For an historian, the first step in studying their evolution in a corpus of similar manuscripts is to identify which ones correspond to each other. This image collation task is daunting for manuscripts separated by many lost copies, spreading over centuries, which might have been completely re-organized and greatly modified to adapt to novel knowledge or belief and include hundreds of illustrations. Our contributions in this paper are threefold. First, we introduce the task of illustration collation and a large annotated public dataset to evaluate solutions, including 6 manuscripts of 2 different texts with more than 2 000 illustrations and 1 200 annotated correspondences. Second, we analyze state of the art similarity measures for this task and show that they succeed in simple cases but struggle for large manuscripts when the illustrations have undergone very significant changes and are discriminated only by fine details. Finally, we show clear evidence that significant performance boosts can be expected by exploiting cycle-consistent correspondences. Our code and data are available on http://imagine.enpc.fr/~shenx/ImageCollation.

【9】 Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates 标题:语音驱动模板:基于学习模板的协同语音手势合成 链接:https://arxiv.org/abs/2108.08020

作者:Shenhan Qian,Zhi Tu,YiHao Zhi,Wen Liu,Shenghua Gao 机构:ShanghaiTech University, Shanghai Engineering Research Center of Intelligent Vision and Imaging, Shanghai Engineering Research Center of Energy Efficient and Custom AI IC 备注:Accepted by ICCV 2021 摘要:协同语音手势生成是合成一个手势序列,该序列不仅看起来真实,而且和输入语音音频相匹配。我们的方法生成整个上身的运动,包括手臂、手和头部。尽管最近的数据驱动方法取得了巨大的成功,但仍然存在一些挑战,例如种类有限、保真度差以及缺乏客观的度量。基于语音不能完全确定手势这一事实,我们设计了一种学习一组手势模板向量来建模潜在条件的方法,以消除歧义。在我们的方法中,模板向量决定生成的手势序列的总体外观,而语音音频驱动身体的细微运动,这两者对于合成真实手势序列都是必不可少的。由于手势语音同步的客观指标难以确定,我们采用唇形同步误差作为代理指标来调整和评估我们模型的同步能力。大量实验表明,该方法在保真度和同步性的客观和主观评价方面都具有优越性。 摘要:Co-speech gesture generation is to synthesize a gesture sequence that not only looks real but also matches with the input speech audio. Our method generates the movements of a complete upper body, including arms, hands, and the head. Although recent data-driven methods achieve great success, challenges still exist, such as limited variety, poor fidelity, and lack of objective metrics. Motivated by the fact that the speech cannot fully determine the gesture, we design a method that learns a set of gesture template vectors to model the latent conditions, which relieve the ambiguity. For our method, the template vector determines the general appearance of a generated gesture sequence, while the speech audio drives subtle movements of the body, both indispensable for synthesizing a realistic gesture sequence. Due to the intractability of an objective metric for gesture-speech synchronization, we adopt the lip-sync error as a proxy metric to tune and evaluate the synchronization ability of our model. Extensive experiments show the superiority of our method in both objective and subjective evaluations on fidelity and synchronization.

【10】 Object Disparity 标题:物体视差 链接:https://arxiv.org/abs/2108.07939

作者:Ynjiun Paul Wang 机构:Cupertino, CA 备注:10 pages, 13 figures, 7 tables 摘要:大多数立体视觉工作都集中于计算给定左右图像对的稠密像素视差。相机对通常需要镜头不失真和立体校准,以提供不失真的外极线校准图像对,用于精确的密集像素视差计算。由于噪声、物体遮挡、重复或缺乏纹理以及匹配算法的限制,这些物体边界区域的像素视差精度通常受到最大影响。尽管统计上像素视差误差总数可能较低(根据当前排名靠前的算法的Kitti Vision基准,低于2%),但这些视差误差在对象边界处的百分比非常高。这使得子序列三维对象距离检测的精度远低于预期。本文提出了一种解决三维物体距离检测的不同方法,即直接检测物体的视差,而无需进行密集像素视差计算。构建了一个示例squeezenet对象视差SSD(OD-SSD),与Kitti数据集像素视差地面真实值相比,该SSD具有相当的精度,能够有效地检测对象视差。使用多个不同立体系统捕获的混合图像数据集的进一步训练和测试结果可能表明,OD-SSD可能对立体系统参数不可知,例如基线、FOV、镜头畸变,甚至左/右摄像机极线错位。 摘要:Most of stereo vision works are focusing on computing the dense pixel disparity of a given pair of left and right images. A camera pair usually required lens undistortion and stereo calibration to provide an undistorted epipolar line calibrated image pair for accurate dense pixel disparity computation. Due to noise, object occlusion, repetitive or lack of texture and limitation of matching algorithms, the pixel disparity accuracy usually suffers the most at those object boundary areas. Although statistically the total number of pixel disparity errors might be low (under 2% according to the Kitti Vision Benchmark of current top ranking algorithms), the percentage of these disparity errors at object boundaries are very high. This renders the subsequence 3D object distance detection with much lower accuracy than desired. This paper proposed a different approach for solving a 3D object distance detection by detecting object disparity directly without going through a dense pixel disparity computation. An example squeezenet Object Disparity-SSD (OD-SSD) was constructed to demonstrate an efficient object disparity detection with comparable accuracy compared with Kitti dataset pixel disparity ground truth. Further training and testing results with mixed image dataset captured by several different stereo systems may suggest that an OD-SSD might be agnostic to stereo system parameters such as a baseline, FOV, lens distortion, even left/right camera epipolar line misalignment.

【11】 M-ar-K-Fast Independent Component Analysis 标题:M-Ar-K-快速独立分量分析 链接:https://arxiv.org/abs/2108.07908

作者:Luca Parisi 机构:Coventry, United Kingdom, PhD in Machine Learning for Clinical Decision Support Systems, MBA Candidate with Artificial Intelligence Specialism 备注:17 pages, 2 listings/Python code snippets, 2 figures, 5 tables. arXiv admin note: text overlap with arXiv:2009.07530 摘要:本研究提出了用于特征提取的m-arcinh核(m-ar-K)快速独立分量分析(FastICA)方法(m-ar-K-FastICA)。核技巧使降维技术能够捕获数据中更大程度的非线性;然而,用于辅助特征提取的可重复的开源内核仍然有限,并且在从熵数据投影特征时可能不可靠。m-ar-K函数在Python中免费提供,并与其开源库“scikit learn”兼容,因此与FastICA结合使用,以在数据存在高度随机性的情况下实现更可靠的特征提取,从而减少预白化的需要。被认为是不同的分类任务,作为相关的五(n=5)开放存取数据集的不同程度的信息熵,可从SCIKIT学习和大学加利福尼亚欧文(UCI)机器学习库。实验结果表明,该特征提取方法提高了分类性能。新的m-ar-K-FastICA降维方法与“FastICA”金标准方法进行了比较,支持其更高的可靠性和计算效率,而不考虑数据中潜在的不确定性。 摘要:This study presents the m-arcsinh Kernel ('m-ar-K') Fast Independent Component Analysis ('FastICA') method ('m-ar-K-FastICA') for feature extraction. The kernel trick has enabled dimensionality reduction techniques to capture a higher extent of non-linearity in the data; however, reproducible, open-source kernels to aid with feature extraction are still limited and may not be reliable when projecting features from entropic data. The m-ar-K function, freely available in Python and compatible with its open-source library 'scikit-learn', is hereby coupled with FastICA to achieve more reliable feature extraction in presence of a high extent of randomness in the data, reducing the need for pre-whitening. Different classification tasks were considered, as related to five (N = 5) open access datasets of various degrees of information entropy, available from scikit-learn and the University California Irvine (UCI) Machine Learning repository. Experimental results demonstrate improvements in the classification performance brought by the proposed feature extraction. The novel m-ar-K-FastICA dimensionality reduction approach is compared to the 'FastICA' gold standard method, supporting its higher reliability and computational efficiency, regardless of the underlying uncertainty in the data.

【12】 Global Pooling, More than Meets the Eye: Position Information is Encoded Channel-Wise in CNNs 标题:全球共享,不只是眼见为实:位置信息在CNN中编码为通道智能(Channel-Wise) 链接:https://arxiv.org/abs/2108.07884

作者:Md Amirul Islam,Matthew Kowal,Sen Jia,Konstantinos G. Derpanis,Neil D. B. Bruce 机构:Ryerson University, Canada, York University, Canada, University of Guelph, Canada, Toronto AI Lab, LG Electronics, Samsung AI Centre Toronto, Canada, Vector Institute for AI, Canada 备注:ICCV 2021 摘要:在本文中,我们挑战了一个普遍的假设,即通过全局合并将卷积神经网络(CNN)中的三维(空间通道)张量的空间维度压缩为一个向量会删除所有空间信息。具体来说,我们证明了位置信息是基于通道维度的顺序编码的,而语义信息在很大程度上不是。在本演示之后,我们通过将这些发现应用于两个应用程序来展示它们对现实世界的影响。首先,我们提出了一种简单而有效的数据增强策略和损失函数,以提高CNN输出的平移不变性。其次,我们提出了一种方法来有效地确定潜在表示中的哪些通道负责(i)编码总体位置信息或(ii)区域特定位置。我们首先表明,语义分割在很大程度上依赖于整体位置通道来进行预测。然后,我们第一次展示了有可能执行“特定于区域”的攻击,并在输入的特定部分降低网络的性能。我们相信,我们的发现和演示的应用将有助于了解CNN特征的相关研究领域。 摘要:In this paper, we challenge the common assumption that collapsing the spatial dimensions of a 3D (spatial-channel) tensor in a convolutional neural network (CNN) into a vector via global pooling removes all spatial information. Specifically, we demonstrate that positional information is encoded based on the ordering of the channel dimensions, while semantic information is largely not. Following this demonstration, we show the real world impact of these findings by applying them to two applications. First, we propose a simple yet effective data augmentation strategy and loss function which improves the translation invariance of a CNN's output. Second, we propose a method to efficiently determine which channels in the latent representation are responsible for (i) encoding overall position information or (ii) region-specific positions. We first show that semantic segmentation has a significant reliance on the overall position channels to make predictions. We then show for the first time that it is possible to perform a `region-specific' attack, and degrade a network's performance in a particular part of the input. We believe our findings and demonstrated applications will benefit research areas concerned with understanding the characteristics of CNNs.

【13】 A New Journey from SDRTV to HDRTV 标题:从SDRTV到HDRTV的新征程 链接:https://arxiv.org/abs/2108.07978

作者:Xiangyu Chen,Zhengwen Zhang,Jimmy S. Ren,Lynhoo Tian,Yu Qiao,Chao Dong 机构:ShenZhen Key Lab of Computer Vision and Pattern Recognition, SIAT-SenseTime Joint Lab, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences ,SenseTime Research, Qing Yuan Research Institute, Shanghai Jiao Tong University 备注:Accepted to ICCV 摘要:如今,现代显示器能够以高动态范围(HDR)和宽色域(WCG)呈现视频内容。然而,大多数可用资源仍在标准动态范围(SDR)内。因此,迫切需要将现有SDR-TV内容转换为HDR-TV版本。在本文中,我们通过建模SDRTV/HDRTV内容的形成来分析SDRTV到HDRTV任务。在此基础上,我们提出了一个包括自适应全局颜色映射、局部增强和高光生成三步的解决方案。此外,上述分析启发我们提出了一个轻量级网络,该网络利用全球统计数据作为指导来进行图像自适应颜色映射。此外,我们使用HDR10标准中的HDR视频构建了一个数据集,命名为HDRTV1K,并选择五个指标来评估SDRTV到HDRTV算法的结果。此外,我们的最终结果在定量比较和视觉质量方面达到了最先进的水平。代码和数据集可在https://github.com/chxy95/HDRTVNet. 摘要:Nowadays modern displays are capable to render video content with high dynamic range (HDR) and wide color gamut (WCG). However, most available resources are still in standard dynamic range (SDR). Therefore, there is an urgent demand to transform existing SDR-TV contents into their HDR-TV versions. In this paper, we conduct an analysis of SDRTV-to-HDRTV task by modeling the formation of SDRTV/HDRTV content. Base on the analysis, we propose a three-step solution pipeline including adaptive global color mapping, local enhancement and highlight generation. Moreover, the above analysis inspires us to present a lightweight network that utilizes global statistics as guidance to conduct image-adaptive color mapping. In addition, we construct a dataset using HDR videos in HDR10 standard, named HDRTV1K, and select five metrics to evaluate the results of SDRTV-to-HDRTV algorithms. Furthermore, our final results achieve state-of-the-art performance in quantitative comparisons and visual quality. The code and dataset are available at https://github.com/chxy95/HDRTVNet.

【14】 Calibration Method of the Monocular Omnidirectional Stereo Camera 标题:单目全向立体相机的标定方法 链接:https://arxiv.org/abs/2108.07936

作者:Ryota Kawamata,Keiichi Betsui,Kazuyoshi Yamazaki,Rei Sakakibara,Takeshi Shimano 机构:) Center for Technology Innovation – Instrumentation, Hitachi, Ltd., Received on M, D ,YYYY 备注:8 pages, 8 figures, 2 tables, accepted for publication in International Journal of Automotive Engineering 摘要:为了自动驾驶拍摄图像并360度测量到物体的距离,需要小巧、低成本的设备。我们一直在开发一种全方位立体相机,利用两个双曲面镜和一组镜头和传感器,这使得这款相机结构紧凑,成本低廉。我们建立了一种新的摄像机标定方法,该方法考虑了高阶径向畸变、详细的切向畸变、图像传感器倾斜和透镜-镜偏移。我们的方法将上视图和下视图图像的校准误差分别降低了6.0倍和4.3倍。距离测量的随机误差为4.9%,相距14米的物体的系统误差为5.7%,与传统方法相比提高了近9倍。剩余的距离误差是由于原型的光学分辨率降低,我们计划在未来的工作中进一步改进。 摘要:Compact and low-cost devices are needed for autonomous driving to image and measure distances to objects 360-degree around. We have been developing an omnidirectional stereo camera exploiting two hyperbolic mirrors and a single set of a lens and sensor, which makes this camera compact and cost efficient. We establish a new calibration method for this camera considering higher-order radial distortion, detailed tangential distortion, an image sensor tilt, and a lens-mirror offset. Our method reduces the calibration error by 6.0 and 4.3 times for the upper- and lower-view images, respectively. The random error of the distance measurement is 4.9% and the systematic error is 5.7% up to objects 14 meters apart, which is improved almost nine times compared to the conventional method. The remaining distance errors is due to a degraded optical resolution of the prototype, which we plan to make further improvements as future work.

0 人点赞