访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问
cs.CV 方向,今日共计120篇
Transformer(6篇)
【1】 Early Convolutions Help Transformers See Better 标题:早期卷曲帮助Transformer看得更清楚
作者:Tete Xiao,Mannat Singh,Eric Mintun,Trevor Darrell,Piotr Dollár,Ross Girshick 机构:Facebook AI Research (FAIR), UC Berkeley 链接:https://arxiv.org/abs/2106.14881 摘要:视觉转换器(ViT)模型的优化性能不达标。特别是,它们对优化器(AdamW与SGD)、优化器超参数和训练计划长度的选择非常敏感。相比之下,现代卷积神经网络更容易优化。为什么会这样?在这项工作中,我们推测问题在于ViT模型的补丁干,它是通过对输入图像应用一个stride-p pxp卷积(p默认为16)来实现的。这种大核加大步长卷积与神经网络中卷积层的典型设计选择背道而驰。为了测试这种非典型的设计选择是否会引起问题,我们分析了ViT模型的优化行为,其中使用原始的补丁茎与简单的对应物,其中我们用少量的叠加跨距两个3x3卷积代替ViT茎。虽然两个ViT设计中的绝大多数计算是相同的,但我们发现,早期视觉处理中的这一微小变化导致训练行为在对优化设置的敏感性以及最终模型精度方面明显不同。在ViT中使用卷积干极大地提高了优化稳定性,也提高了峰值性能(在ImageNet-1k上提高了~1-2%的top-1精度),同时保持了触发器和运行时。在模型复杂度(从1G到36G触发器)和数据集规模(从imagnet-1k到imagnet-21k)的广泛范围内都可以观察到这种改进。这些发现使我们推荐使用一个标准的,轻量级的卷积茎ViT模型作为一个更强大的架构选择相比,原来的ViT模型设计。 摘要:Vision transformer (ViT) models exhibit substandard optimizability. In particular, they are sensitive to the choice of optimizer (AdamW vs. SGD), optimizer hyperparameters, and training schedule length. In comparison, modern convolutional neural networks are far easier to optimize. Why is this the case? In this work, we conjecture that the issue lies with the patchify stem of ViT models, which is implemented by a stride-p pxp convolution (p=16 by default) applied to the input image. This large-kernel plus large-stride convolution runs counter to typical design choices of convolutional layers in neural networks. To test whether this atypical design choice causes an issue, we analyze the optimization behavior of ViT models with their original patchify stem versus a simple counterpart where we replace the ViT stem by a small number of stacked stride-two 3x3 convolutions. While the vast majority of computation in the two ViT designs is identical, we find that this small change in early visual processing results in markedly different training behavior in terms of the sensitivity to optimization settings as well as the final model accuracy. Using a convolutional stem in ViT dramatically increases optimization stability and also improves peak performance (by ~1-2% top-1 accuracy on ImageNet-1k), while maintaining flops and runtime. The improvement can be observed across the wide spectrum of model complexities (from 1G to 36G flops) and dataset scales (from ImageNet-1k to ImageNet-21k). These findings lead us to recommend using a standard, lightweight convolutional stem for ViT models as a more robust architectural choice compared to the original ViT model design.
【2】 Feature Combination Meets Attention: Baidu Soccer Embeddings and Transformer based Temporal Detection 标题:特征组合备受关注:百度足球嵌入和基于Transformer的时间检测
作者:Xin Zhou,Le Kang,Zhiyu Cheng,Bo He,Jingyu Xin 机构:Baidu Research, Bordeaux Dr, Sunnyvale, CA , USA 备注:Tech Report. Authors Xin Zhou, Le Kang, and Zhiyu Cheng made equal contributions 链接:https://arxiv.org/abs/2106.14447 摘要:随着互联网技术和新兴工具的迅速发展,与体育相关的在线视频以前所未有的速度增长。为了实现体育视频编辑/亮点生成过程的自动化,一个关键的任务是准确地识别和定位长视频中的事件。在这份技术报告中,我们提出了一个两阶段的范例来检测足球广播视频中发生了什么以及什么时候发生的事件。具体来说,我们对足球数据中的多个动作识别模型进行了微调,以提取高层语义特征,并设计了一个基于变换器的时态检测模块来定位目标事件。在CVPR 2021 ActivityNet研讨会的SoccerNet-v2挑战赛中,这种方法在动作捕捉和重放接地这两项任务中都取得了最先进的性能。我们的足球嵌入功能发布于https://github.com/baidu-research/vidpress-sports. 通过与更广泛的社区分享这些特征,我们希望能够加速足球视频理解的研究。 摘要:With rapidly evolving internet technologies and emerging tools, sports related videos generated online are increasing at an unprecedentedly fast pace. To automate sports video editing/highlight generation process, a key task is to precisely recognize and locate the events in the long untrimmed videos. In this tech report, we present a two-stage paradigm to detect what and when events happen in soccer broadcast videos. Specifically, we fine-tune multiple action recognition models on soccer data to extract high-level semantic features, and design a transformer based temporal detection module to locate the target events. This approach achieved the state-of-the-art performance in both two tasks, i.e., action spotting and replay grounding, in the SoccerNet-v2 Challenge, under CVPR 2021 ActivityNet workshop. Our soccer embedding features are released at https://github.com/baidu-research/vidpress-sports. By sharing these features with the broader community, we hope to accelerate the research into soccer video understanding.
【3】 Multi-Compound Transformer for Accurate Biomedical Image Segmentation 标题:用于精确生物医学图像分割的多复合变换器
作者:Yuanfeng Ji,Ruimao Zhang,Huijie Wang,Zhen Li,Lingyun Wu,Shaoting Zhang,Ping Luo 机构: The University of Hong Kong, Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong (Shenzhen), SenseTime Research 备注:Accepted by MICCAI2021 链接:https://arxiv.org/abs/2106.14385 摘要:最近的视觉转换器(即用于图像分类)学习不同面片标记的非局部注意交互。然而,现有技术无法学习不同像素的跨尺度依赖性、不同标签的语义对应性以及特征表示和语义嵌入的一致性,这对于生物医学分割至关重要。针对以上问题,本文提出了一种统一的Transformer网络,称为多复合Transformer(MCTrans),它将丰富的特征学习和语义结构挖掘结合到一个统一的框架中。具体而言,MCTrans将多尺度卷积特征嵌入为一系列标记,并执行尺度内和尺度间的自我注意,而不是以往工作中的单尺度注意。此外,本文还引入了一种可学习的代理嵌入方法,分别利用自注意和交叉注意对语义关系和特征增强进行建模。MCTrans可以很容易地插入到一个类似UNet的网络中,并且在六个标准的基准中比生物医学图像分割的最新方法有了显著的改进。例如,在Pannuke、CVC Clinic、CVC Colon、Etis、Kavirs、ISIC2018数据集中,MCTrans的表现分别比UNet高3.64%、3.71%、4.34%、2.8%、1.88%、1.57%。代码位于https://github.com/JiYuanFeng/MCTrans. 摘要:The recent vision transformer(i.e.for image classification) learns non-local attentive interaction of different patch tokens. However, prior arts miss learning the cross-scale dependencies of different pixels, the semantic correspondence of different labels, and the consistency of the feature representations and semantic embeddings, which are critical for biomedical segmentation. In this paper, we tackle the above issues by proposing a unified transformer network, termed Multi-Compound Transformer (MCTrans), which incorporates rich feature learning and semantic structure mining into a unified framework. Specifically, MCTrans embeds the multi-scale convolutional features as a sequence of tokens and performs intra- and inter-scale self-attention, rather than single-scale attention in previous works. In addition, a learnable proxy embedding is also introduced to model semantic relationship and feature enhancement by using self-attention and cross-attention, respectively. MCTrans can be easily plugged into a UNet-like network and attains a significant improvement over the state-of-the-art methods in biomedical image segmentation in six standard benchmarks. For example, MCTrans outperforms UNet by 3.64%, 3.71%, 4.34%, 2.8%, 1.88%, 1.57% in Pannuke, CVC-Clinic, CVC-Colon, Etis, Kavirs, ISIC2018 dataset, respectively. Code is available at https://github.com/JiYuanFeng/MCTrans.
【4】 Post-Training Quantization for Vision Transformer 标题:视觉Transformer的训练后量化
作者:Zhenhua Liu,Yunhe Wang,Kai Han,Siwei Ma,Wen Gao 机构:Institute of Digital Media, School of Electronic Engineering and Computer Science, Peking University, Noah’s Ark Lab, Huawei Technologies ,Peng Cheng Laboratory 链接:https://arxiv.org/abs/2106.14156 摘要:近年来,Transformer在各种计算机视觉应用中取得了令人瞩目的成绩。与主流的卷积神经网络相比,视觉变换器往往具有复杂的结构来提取强大的特征表示,这使得在移动设备上开发起来更加困难。在本文中,我们提出了一个有效的训练后量化算法,以减少存储和计算成本的视觉Transformer。基本上,量化任务可被视为分别为权重和输入寻找最优低位量化间隔。为了保持注意机制的功能,我们在传统的量化目标中引入了排名损失,目的是在量化后保持自我注意结果的相对顺序。此外,我们深入分析了不同层次的量化损失与特征多样性之间的关系,并利用每个注意图和输出特征的核范数,探索了一种混合精度量化方案。在多个基准模型和数据集上验证了该方法的有效性,其性能优于现有的训练后量化算法。例如,在ImageNet数据集上使用DeiT-B模型,我们可以获得81.29%的top-1精度,并进行大约8位的量化。 摘要:Recently, transformer has achieved remarkable performance on a variety of computer vision applications. Compared with mainstream convolutional neural networks, vision transformers are often of sophisticated architectures for extracting powerful feature representations, which are more difficult to be developed on mobile devices. In this paper, we present an effective post-training quantization algorithm for reducing the memory storage and computational costs of vision transformers. Basically, the quantization task can be regarded as finding the optimal low-bit quantization intervals for weights and inputs, respectively. To preserve the functionality of the attention mechanism, we introduce a ranking loss into the conventional quantization objective that aims to keep the relative order of the self-attention results after quantization. Moreover, we thoroughly analyze the relationship between quantization loss of different layers and the feature diversity, and explore a mixed-precision quantization scheme by exploiting the nuclear norm of each attention map and output feature. The effectiveness of the proposed method is verified on several benchmark models and datasets, which outperforms the state-of-the-art post-training quantization algorithms. For instance, we can obtain an 81.29% top-1 accuracy using DeiT-B model on ImageNet dataset with about 8-bit quantization.
【5】 OffRoadTranSeg: Semi-Supervised Segmentation using Transformers on OffRoad environments 标题:OffRoadTranSeg:越野环境下基于Transformer的半监督分割
作者:Anukriti Singh,Kartikeya Singh,P. B. Sujit 链接:https://arxiv.org/abs/2106.13963 摘要:我们提出了OffRoadTranSeg,第一个端到端的框架半监督分割在非结构化的户外环境使用Transformer和自动数据选择标签。越野分割是一种广泛应用于自主驾驶的场景理解方法。目前比较流行的非道路分割方法是使用完全连通的卷积层和大的标记数据,但是由于类的不平衡性,会出现一些不匹配的情况,有些类可能无法被检测到。我们的方法是以半监督的方式进行越野分割。目的是提供一个模型,其中自监督视觉Transformer用于微调越野数据集与自监督数据收集标签使用深度估计。该方法在RELLIS-3D和RUGD越野数据集上得到了验证。实验表明,OffRoadTranSeg模型的性能优于其他最先进的模型,解决了RELLIS-3D类不平衡问题。 摘要:We present OffRoadTranSeg, the first end-to-end framework for semi-supervised segmentation in unstructured outdoor environment using transformers and automatic data selection for labelling. The offroad segmentation is a scene understanding approach that is widely used in autonomous driving. The popular offroad segmentation method is to use fully connected convolution layers and large labelled data, however, due to class imbalance, there will be several mismatches and also some classes may not be detected. Our approach is to do the task of offroad segmentation in a semi-supervised manner. The aim is to provide a model where self supervised vision transformer is used to fine-tune offroad datasets with self-supervised data collection for labelling using depth estimation. The proposed method is validated on RELLIS-3D and RUGD offroad datasets. The experiments show that OffRoadTranSeg outperformed other state of the art models, and also solves the RELLIS-3D class imbalance problem.
【6】 MTrans: Multi-Modal Transformer for Accelerated MR Imaging 标题:MTRANS:用于加速磁共振成像的多模转换器
作者:Chun-Mei Feng,Yunlu Yan,Geng Chen,Huazhu Fu,Yong Xu,Ling Shao 机构: Shao are with the Inception Institute of Arti-ficial Intelligence 链接:https://arxiv.org/abs/2106.14248 摘要:加速多模式磁共振成像(MR)是快速MR成像的一种新的有效解决方案,它在辅助成像的引导下,从欠采样的对应物中恢复目标模式具有优越的性能。然而,现有的研究只是将辅助模态作为先验信息引入,缺乏对两种模态融合的潜在机制的深入研究。此外,它们通常依赖于卷积神经网络(CNNs),这种网络只关注局部信息,无法完全捕获全局知识的长距离依赖性。为此,我们提出了一种多模态变换器(MTrans),它能够将多尺度特征从目标模态转换到辅助模态,用于MR加速成像。通过重构transformer架构,我们的MTrans获得了强大的捕获深度多模态信息的能力。更具体地说,目标模态和辅助模态首先被分成两个分支,然后使用多模态Transformer模块进行融合。该模块基于一种改进的多头注意机制,称为交叉注意模块,它从辅助模态中吸收有助于目标模态的特征。我们的框架提供了两个吸引人的优点:(i)MTrans是第一次尝试使用改进的Transformer进行多模态MR成像,与基于CNN的方法相比提供了更多的全局信息(ii)提出了一种新的交叉注意模块,利用不同尺度下各分支的有用信息。它提供了清晰的结构信息和精细的像素级信息,有效地补充了目标模态。 摘要:Accelerating multi-modal magnetic resonance (MR) imaging is a new and effective solution for fast MR imaging, providing superior performance in restoring the target modality from its undersampled counterpart with guidance from an auxiliary modality. However, existing works simply introduce the auxiliary modality as prior information, lacking in-depth investigations on the potential mechanisms for fusing two modalities. Further, they usually rely on the convolutional neural networks (CNNs), which focus on local information and prevent them from fully capturing the long-distance dependencies of global knowledge. To this end, we propose a multi-modal transformer (MTrans), which is capable of transferring multi-scale features from the target modality to the auxiliary modality, for accelerated MR imaging. By restructuring the transformer architecture, our MTrans gains a powerful ability to capture deep multi-modal information. More specifically, the target modality and the auxiliary modality are first split into two branches and then fused using a multi-modal transformer module. This module is based on an improved multi-head attention mechanism, named the cross attention module, which absorbs features from the auxiliary modality that contribute to the target modality. Our framework provides two appealing benefits: (i) MTrans is the first attempt at using improved transformers for multi-modal MR imaging, affording more global information compared with CNN-based methods. (ii) A new cross attention module is proposed to exploit the useful information in each branch at different scales. It affords both distinct structural information and subtle pixel-level information, which supplement the target modality effectively.
检测相关(19篇)
【1】 Iris Presentation Attack Detection by Attention-based and Deep Pixel-wise Binary Supervision Network 标题:基于注意力的深度像素二值监督网络虹膜表示攻击检测
作者:Meiling Fang,Naser Damer,Fadi Boutros,Florian Kirchbuchner,Arjan Kuijper 机构:Fraunhofer Institute for Computer Graphics Research IGD, Darmstadt, Germany, Mathematical and Applied Visual Computing, TU Darmstadt, Darmstadt, Germany 备注:To appear at the 2021 International Joint Conference on Biometrics (IJCB 2021) 链接:https://arxiv.org/abs/2106.14845 摘要:虹膜呈现攻击检测(PAD)在虹膜识别系统中起着至关重要的作用。大多数现有的基于CNN的iris-PAD解决方案1)在CNN的训练过程中只执行二进制标签监控,服务于全局信息学习,但削弱了对局部鉴别特征的捕获;2)更喜欢叠加的更深的卷积或专家设计的网络,增加了过度拟合的风险,3)融合多个PAD系统或各种类型的功能,增加了在移动设备上部署的难度。因此,我们提出了一种新的基于注意的深度像素二进制监控(a-PBS)方法。像素级监控首先能够捕获细粒度的像素/补丁级线索。然后,注意机制引导网络自动找到最有助于精确PAD决策的区域。在LivDet Iris 2017和其他三个公开数据库上进行了大量实验,以证明所提出的A-PBS方法的有效性和鲁棒性。例如,A-PBS模型在IIITD-WVU数据库上的HTER为6.50%,优于最先进的方法。 摘要:Iris presentation attack detection (PAD) plays a vital role in iris recognition systems. Most existing CNN-based iris PAD solutions 1) perform only binary label supervision during the training of CNNs, serving global information learning but weakening the capture of local discriminative features, 2) prefer the stacked deeper convolutions or expert-designed networks, raising the risk of overfitting, 3) fuse multiple PAD systems or various types of features, increasing difficulty for deployment on mobile devices. Hence, we propose a novel attention-based deep pixel-wise binary supervision (A-PBS) method. Pixel-wise supervision is first able to capture the fine-grained pixel/patch-level cues. Then, the attention mechanism guides the network to automatically find regions that most contribute to an accurate PAD decision. Extensive experiments are performed on LivDet-Iris 2017 and three other publicly available databases to show the effectiveness and robustness of proposed A-PBS methods. For instance, the A-PBS model achieves an HTER of 6.50% on the IIITD-WVU database outperforming state-of-the-art methods.
【2】 One-Shot Affordance Detection 标题:一次过的非洲舞蹈病检测
作者:Hongchen Luo,Wei Zhai,Jing Zhang,Yang Cao,Dacheng Tao 机构: University of Science and Technology of China, China, The University of Sydney, Australia 链接:https://arxiv.org/abs/2106.14747 摘要:启示检测是指识别图像中物体潜在的动作可能性,是机器人感知和操作的重要能力。为了使机器人在不可见场景中具有这种能力,本文考虑了具有挑战性的单镜头可供性检测问题,即给定一个描述动作目的的支持图像,场景中所有具有共同可供性的物体都应该被检测到。为此,我们设计了一个一次性的启示检测(OS-AD)网络,该网络首先对目标进行估计,然后将其传输到所有候选图像中,以帮助检测出共同的启示。通过协作学习,OS-AD能够捕获具有相同潜在启示的对象之间的共同特征,并学习感知不可见启示的良好自适应能力。此外,我们通过收集和标记来自31个启示和72个对象类别的4k图像,构建了一个目的驱动的启示数据集(PAD)。实验结果表明,该模型在客观指标和视觉质量方面均优于已有的典型模型。基准测试套件位于ProjectPage。 摘要:Affordance detection refers to identifying the potential action possibilities of objects in an image, which is an important ability for robot perception and manipulation. To empower robots with this ability in unseen scenarios, we consider the challenging one-shot affordance detection problem in this paper, i.e., given a support image that depicts the action purpose, all objects in a scene with the common affordance should be detected. To this end, we devise a One-Shot Affordance Detection (OS-AD) network that firstly estimates the purpose and then transfers it to help detect the common affordance from all candidate images. Through collaboration learning, OS-AD can capture the common characteristics between objects having the same underlying affordance and learn a good adaptation capability for perceiving unseen affordances. Besides, we build a Purpose-driven Affordance Dataset (PAD) by collecting and labeling 4k images from 31 affordance and 72 object categories. Experimental results demonstrate the superiority of our model over previous representative ones in terms of both objective metrics and visual quality. The benchmark suite is at ProjectPage.
【3】 Dataset and Benchmarking of Real-Time Embedded Object Detection for RoboCup SSL 标题:RoboCup SSL实时嵌入式目标检测的数据集和基准
作者:Roberto Fernandes,Walber M. Rodrigues,Edna Barros 机构:Centro de Inform´atica, Universidade Federal de Pernambuco. Recife, Brazil. 链接:https://arxiv.org/abs/2106.14597 摘要:在特定上下文中生成模型到对象检测时,第一个障碍是要有一个标记所需类的数据集。在RoboCup中,有些联盟已经有多个数据集来训练和评估模型。然而,在小型联盟(SSL)中,还没有这样的数据集可用。本文提出了一个开源数据集作为SSL中实时目标检测的基准。本文还提出了一种在低功耗嵌入式系统中训练、部署和评估卷积神经网络(CNNs)模型的方法。这个管道被用来评估提出的数据集与最先进的优化模型。在这个数据集中,MobileNet SSD v1在SSL robot上运行时,以每秒94帧(FPS)的速度达到44.88%的AP(68.81%的AP50)。 摘要:When producing a model to object detection in a specific context, the first obstacle is to have a dataset labeling the desired classes. In RoboCup, some leagues already have more than one dataset to train and evaluate a model. However, in the Small Size League (SSL), there is not such dataset available yet. This paper presents an open-source dataset to be used as a benchmark for real-time object detection in SSL. This work also presented a pipeline to train, deploy, and evaluate Convolutional Neural Networks (CNNs) models in a low-power embedded system. This pipeline was used to evaluate the proposed dataset with state-of-art optimized models. In this dataset, the MobileNet SSD v1 achieves 44.88% AP (68.81% AP50) at 94 Frames Per Second (FPS) while running on an SSL robot.
【4】 Cheating Detection Pipeline for Online Interviews and Exams 标题:在线面试和考试的作弊检测管道
作者:Azmi Can Özgen,Mahiye Uluyağmur Öztürk,Umut Bayraktar 机构:Huawei Turkey R&D Center, Istanbul, Turkey 链接:https://arxiv.org/abs/2106.14483 摘要:由于流行病和远程工作环境的优势,远程考试和工作面试越来越受欢迎并成为不可或缺的。大多数公司和学术机构利用这些系统进行招聘和在线考试。然而,远程考试系统的关键问题之一是在可靠的环境中进行考试。在这项工作中,我们提出了在线面试和考试作弊分析管道。这套系统只需要考生的一段视频,这段视频是在考试期间录制的。然后利用作弊检测管道来检测另一个人、电子设备使用情况和候选人缺勤状态。该流水线由人脸检测、人脸识别、目标检测和人脸跟踪算法组成。为了评估管道的性能,我们收集了一个私有视频数据集。视频数据集包括作弊活动和干净的视频。最终,我们的管道提供了一个有效和快速的准则,以检测和分析作弊活动在网上面试和考试视频。 摘要:Remote examination and job interviews have gained popularity and become indispensable because of both pandemics and the advantage of remote working circumstances. Most companies and academic institutions utilize these systems for their recruitment processes and also for online exams. However, one of the critical problems of the remote examination systems is conducting the exams in a reliable environment. In this work, we present a cheating analysis pipeline for online interviews and exams. The system only requires a video of the candidate, which is recorded during the exam. Then cheating detection pipeline is employed to detect another person, electronic device usage, and candidate absence status. The pipeline consists of face detection, face recognition, object detection, and face tracking algorithms. To evaluate the performance of the pipeline we collected a private video dataset. The video dataset includes both cheating activities and clean videos. Ultimately, our pipeline presents an efficient and fast guideline to detect and analyze cheating activities in an online interview and exam video.
【5】 A More Compact Object Detector Head Network with Feature Enhancement and Relational Reasoning 标题:一种更紧凑的具有特征增强和关系推理的目标探测器头网络
作者:Wen chao Zhang,Chong Fu,Xiang shi Chang,Teng fei Zhao,Xiang Li,Chiu-Wing Sham 机构:School of Computer Science and Engineering, Northeastern University, Shenyang , China, Engineering Research Center of Security Technology of Complex Network System, Ministry of Education, China 链接:https://arxiv.org/abs/2106.14475 摘要:隐式特征交互模式的建模对于目标检测任务具有重要意义。然而,在两级检测器中,由于过度使用手工制作的构件,很难对实例特征的隐含关系进行推理。为了解决这个问题,我们分析了三种不同层次的特征交互关系,即裁剪后的局部特征与全局特征之间的依赖关系、实例内的特征自相关关系和实例间的互相关关系。为此,我们提出了一种更为紧凑的目标检测头网络(CODH),它不仅可以保留全局上下文信息和压缩信息密度,而且可以在更大的矩阵空间中进行实例特征增强和关系推理。我们的方法在不需要敲钟和口哨的情况下,可以有效地提高检测性能,同时显著地降低模型的参数,例如,在我们的方法中,头网络的参数比最先进的级联R-CNN小0.6倍,而在COCO test-dev上的性能提升为1.3%,同时又不失通用性,我们还可以通过组装我们的方法为其他多级探测器构建一个更轻的头部网络。 摘要:Modeling implicit feature interaction patterns is of significant importance to object detection tasks. However, in the two-stage detectors, due to the excessive use of hand-crafted components, it is very difficult to reason about the implicit relationship of the instance features. To tackle this problem, we analyze three different levels of feature interaction relationships, namely, the dependency relationship between the cropped local features and global features, the feature autocorrelation within the instance, and the cross-correlation relationship between the instances. To this end, we propose a more compact object detector head network (CODH), which can not only preserve global context information and condense the information density, but also allows instance-wise feature enhancement and relational reasoning in a larger matrix space. Without bells and whistles, our method can effectively improve the detection performance while significantly reducing the parameters of the model, e.g., with our method, the parameters of the head network is 0.6 times smaller than the state-of-the-art Cascade R-CNN, yet the performance boost is 1.3% on COCO test-dev. Without losing generality, we can also build a more lighter head network for other multi-stage detectors by assembling our method.
【6】 Rail-5k: a Real-World Dataset for Rail Surface Defects Detection 标题:Rail-5k:一种用于钢轨表面缺陷检测的真实数据集
作者:Zihao Zhang,Shaozuo Yu,Siwei Yang,Yu Zhou,Bingchen Zhao 机构:The Key Laboratory of Road and Traffic Engineering, Ministry of Education, Shanghai Key Laboratory of Rail Infrastructure Durability and System Safety, Tongji University, Shanghai, Department of Computer Science 链接:https://arxiv.org/abs/2106.14366 摘要:本文介绍了Rail-5k数据集,用于在实际应用场景(即钢轨表面缺陷检测任务)中测试可视化算法的性能。我们收集了来自全国铁路的超过5k幅高质量图像,并在铁路专家的帮助下对1100幅图像进行了注释,以确定最常见的13种钢轨缺陷类型。该数据集可用于两种设置,都有独特的挑战,第一种是完全监督设置,使用1k 标记的图像进行训练,细粒度和缺陷类的长尾分布使得视觉算法难以处理。第二种是4k未标记图像的半监督学习设置,这些4k图像是未标记的,包含可能的图像损坏和与标记图像的域偏移,这是以往半监督学习方法难以解决的问题。我们相信我们的数据集可以成为评估视觉算法鲁棒性和可靠性的一个有价值的基准。 摘要:This paper presents the Rail-5k dataset for benchmarking the performance of visual algorithms in a real-world application scenario, namely the rail surface defects detection task. We collected over 5k high-quality images from railways across China, and annotated 1100 images with the help from railway experts to identify the most common 13 types of rail defects. The dataset can be used for two settings both with unique challenges, the first is the fully-supervised setting using the 1k labeled images for training, fine-grained nature and long-tailed distribution of defect classes makes it hard for visual algorithms to tackle. The second is the semi-supervised learning setting facilitated by the 4k unlabeled images, these 4k images are uncurated containing possible image corruptions and domain shift with the labeled images, which can not be easily tackle by previous semi-supervised learning methods. We believe our dataset could be a valuable benchmark for evaluating robustness and reliability of visual algorithms.
【7】 Change Detection for Geodatabase Updating 标题:地理数据库更新中的变化检测
作者:Rongjun Qin 机构:Department of Civil Environmental and Geodetic Engineering, Department of Electrical and Computer Engineering, Translational Data Analytics Institute, The Ohio State University, Neil Avenue, Columbus, Ohio, USA 链接:https://arxiv.org/abs/2106.14309 摘要:地理数据库(矢量化数据)如今已成为相当标准的数字城市基础设施;然而,高效、经济地更新地理数据库仍然是地理空间行业的一个基本和实际问题。建立一个地理数据库的成本非常高,而且劳动密集,而且我们使用的地图通常有几个月甚至几年的延迟。一个解决方案是开发更自动化的(矢量化)地理空间数据生成方法,这在过去几十年中已被证明是一项艰巨的任务。另一种解决方案是首先检测新数据与现有地理空间数据之间的差异,然后仅更新标识为变化的区域。第二种方法因其高度的实用性和灵活性而越来越受到人们的青睐。一种高度相关的技术是变化检测。本文旨在概述遥感和地理信息领域最新的变化检测方法,以支持更新地理数据库的任务。用于变更检测的数据是高度不同的,因此我们根据数据的维度直观地构建了我们的审查,即1)使用2D数据进行变更检测;2) 三维数据变化检测。我们将根据在该领域所作的审查工作得出结论,并分享我们对更新地理数据库这一主题的展望。 摘要:The geodatabase (vectorized data) nowadays becomes a rather standard digital city infrastructure; however, updating geodatabase efficiently and economically remains a fundamental and practical issue in the geospatial industry. The cost of building a geodatabase is extremely high and labor intensive, and very often the maps we use have several months and even years of latency. One solution is to develop more automated methods for (vectorized) geospatial data generation, which has been proven a difficult task in the past decades. An alternative solution is to first detect the differences between the new data and the existing geospatial data, and then only update the area identified as changes. The second approach is becoming more favored due to its high practicality and flexibility. A highly relevant technique is change detection. This article aims to provide an overview the state-of-the-art change detection methods in the field of Remote Sensing and Geomatics to support the task of updating geodatabases. Data used for change detection are highly disparate, we therefore structure our review intuitively based on the dimension of the data, being 1) change detection with 2D data; 2) change detection with 3D data. Conclusions will be drawn based on the reviewed efforts in the field, and we will share our outlooks of the topic of updating geodatabases.
【8】 SDOF-Tracker: Fast and Accurate Multiple Human Tracking by Skipped-Detection and Optical-Flow 标题:SDOF-Tracker:基于跳过检测和光流的快速精确多人跟踪
作者:Hitoshi Nishimura,Satoshi Komorita,Yasutomo Kawanishi,Hiroshi Murase 机构: KDDI Research, Inc., Saitama, Japan, RIKEN, Kyoto, Japan, Nagoya University, Aichi, Japan 链接:https://arxiv.org/abs/2106.14259 摘要:多人跟踪是场景理解的一个基本问题。虽然在实际应用中,对跟踪精度和速度都有要求,但基于深度学习的跟踪方法主要集中在精度上,并且需要大量的运行时间。由于人体检测占据了大部分的跑步时间,因此本研究旨在通过在一定的帧间隔内进行人体检测来提高跑步速度。问题是如何在跳过人体检测的同时保持准确性。在本文中,我们提出了一种用光流来补充检测结果的方法,基于相邻帧之间某人的外观变化不大的事实。为了保持跟踪精度,我们在人体区域内引入了鲁棒的兴趣点选择和根据兴趣点分布计算的跟踪终止度量。在MOTChallenge的MOT20数据集上,提出的单自由度跟踪器在保持MOTA度量的同时,在总运行速度方面取得了最佳性能。我们的代码在https://anonymous.4open.science/r/sdof-tracker-75AE. 摘要:Multiple human tracking is a fundamental problem for scene understanding. Although both accuracy and speed are required in real-world applications, recent tracking methods based on deep learning have focused on accuracy and require substantial running time. This study aims to improve running speed by performing human detection at a certain frame interval because it accounts for most of the running time. The question is how to maintain accuracy while skipping human detection. In this paper, we propose a method that complements the detection results with optical flow, based on the fact that someone's appearance does not change much between adjacent frames. To maintain the tracking accuracy, we introduce robust interest point selection within human regions and a tracking termination metric calculated by the distribution of the interest points. On the MOT20 dataset in the MOTChallenge, the proposed SDOF-Tracker achieved the best performance in terms of the total running speed while maintaining the MOTA metric. Our code is available at https://anonymous.4open.science/r/sdof-tracker-75AE.
【9】 Memory Guided Road Detection 标题:记忆引导的道路检测
作者:Praveen Venkatesh,Rwik Rana,Varun Jain 机构:Electrical Engineering, Indian Institute of Technology, Gandhinagar, India, Mechanical Engineering 链接:https://arxiv.org/abs/2106.14184 摘要:在自动驾驶汽车应用中,需要根据输入的RGB正面图像预测车道位置。在本文中,我们提出了一种架构,通过引入一个随时间传播的底层共享特征空间(作为流动的动态内存),使我们能够提高道路检测的速度和鲁棒性,而不会造成较大的准确率。利用前一帧的信息,训练网络预测当前道路,预测精度更高,与前一帧的偏差更小。 摘要:In self driving car applications, there is a requirement to predict the location of the lane given an input RGB front facing image. In this paper, we propose an architecture that allows us to increase the speed and robustness of road detection without a large hit in accuracy by introducing an underlying shared feature space that is propagated over time, which serves as a flowing dynamic memory. By utilizing the gist of previous frames, we train the network to predict the current road with a greater accuracy and lesser deviation from previous frames.
【10】 Image content dependent semi-fragile watermarking with localized tamper detection 标题:基于图像内容的局部篡改检测半脆弱水印
作者:Samira Hosseini,Mojtaba Mahdavi 机构:University of Isfahan, Department of IT Engineering, Hezar Jerib Ave., Isfahan, Iran,- 备注:32 pages, 11 figures, 5 tables 链接:https://arxiv.org/abs/2106.14150 摘要:在半脆弱水印算法中,与内容无关的水印和分块无关的水印都是易受攻击的。本文针对半脆弱水印技术实现的目标,提出了一种不存在上述缺点的方法。该方法根据图像内容和密钥生成水印。此外,该嵌入方案使用密钥使带水印的块变得相互依赖。在嵌入阶段,图像被分割成不重叠的块。为了更准确地检测和区分不同类型的攻击,该方法在每个4x4块的LWT系数中嵌入每个水印位的三个副本。在认证阶段,通过在提取的比特之间进行投票来创建错误映射;这些地图显示了图像的真实性并揭示了修改的区域。此外,为了自动化身份验证,使用七个特征将图像分为四类。实验中的分类准确率为97.97%。实验结果表明,该方法对JPEG压缩具有较强的鲁棒性,在鲁棒性和半脆弱性方面与现有的半脆弱水印方法具有一定的竞争力。 摘要:Content-independent watermarks and block-wise independency can be considered as vulnerabilities in semi-fragile watermarking methods. In this paper to achieve the objectives of semi-fragile watermarking techniques, a method is proposed to not have the mentioned shortcomings. In the proposed method, the watermark is generated by relying on image content and a key. Furthermore, the embedding scheme causes the watermarked blocks to become dependent on each other, using a key. In the embedding phase, the image is partitioned into non-overlapping blocks. In order to detect and separate the different types of attacks more precisely, the proposed method embeds three copies of each watermark bit into LWT coefficients of each 4x4 block. In the authentication phase, by voting between the extracted bits the error maps are created; these maps indicate image authenticity and reveal the modified regions. Also, in order to automate the authentication, the images are classified into four categories using seven features. Classification accuracy in the experiments is 97.97 percent. It is noted that our experiments demonstrate that the proposed method is robust against JPEG compression and is competitive with a state-of-the-art semi-fragile watermarking method, in terms of robustness and semi-fragility.
【11】 Machine Learning Detection Algorithm for Large Barkhausen Jumps in Cluttered Environment 标题:杂波环境下巴克豪森跳跃的机器学习检测算法
作者:Roger Alimi,Amir Ivry,Elad Fisher,Eyal Weiss 机构: Technology Division, Soreq NRC, Yavne , Israel, Technion, Israel Institute of Technology, Haifa , Israel, Jerusalem College of Technology, Jerusalem , Israel 备注:None 链接:https://arxiv.org/abs/2106.14148 摘要:现代磁传感器阵列通常采用最先进的低功率磁强计,如平行和正交磁通门。低功率磁通门往往有大巴克豪森跳跃,出现在直流磁通门输出跳跃。这种现象恶化了信号保真度,并有效地增加了传感器内部的噪声。即使在生产过程中可以筛选出更容易发生直流跳变的传感器,但由于其稀疏性,传统的噪声测量方法并不总能捕捉到直流跳变。此外,几乎所有的传感器核心都存在直流跳跃,尽管速度较慢,但仍然无法忍受。即使直流跳跃可以很容易地在屏蔽环境中检测到,当部署在存在自然噪声和杂波时,也很难准确地检测到它们。这项工作填补了这一空白,并提出了算法,区分直流跳跃嵌入在自然磁场数据。为了提高对噪声的鲁棒性,我们开发了两种基于时间和统计物理特征的机器学习算法。第一种算法采用支持向量机分类器,第二种算法基于神经网络结构。我们将这些新方法与更经典的基于核的方法进行了比较。为此,生成了接收机工作特性曲线,通过比较不同分类器在不同工作点上的性能,实现了不同分类器的诊断能力。与传统方法相比,基于机器学习的算法具有更高的精度。另外,基于相应的接收机工作特性曲线的快速收敛,神经网络具有很高的泛化能力和鲁棒性。 摘要:Modern magnetic sensor arrays conventionally utilize state of the art low power magnetometers such as parallel and orthogonal fluxgates. Low power fluxgates tend to have large Barkhausen jumps that appear as a dc jump in the fluxgate output. This phenomenon deteriorates the signal fidelity and effectively increases the internal sensor noise. Even if sensors that are more prone to dc jumps can be screened during production, the conventional noise measurement does not always catch the dc jump because of its sparsity. Moreover, dc jumps persist in almost all the sensor cores although at a slower but still intolerable rate. Even if dc jumps can be easily detected in a shielded environment, when deployed in presence of natural noise and clutter, it can be hard to positively detect them. This work fills this gap and presents algorithms that distinguish dc jumps embedded in natural magnetic field data. To improve robustness to noise, we developed two machine learning algorithms that employ temporal and statistical physical-based features of a pre-acquired and well-known experimental data set. The first algorithm employs a support vector machine classifier, while the second is based on a neural network architecture. We compare these new approaches to a more classical kernel-based method. To that purpose, the receiver operating characteristic curve is generated, which allows diagnosis ability of the different classifiers by comparing their performances across various operation points. The accuracy of the machine learning-based algorithms over the classic method is highly emphasized. In addition, high generalization and robustness of the neural network can be concluded, based on the rapid convergence of the corresponding receiver operating characteristic curves.
【12】 Real-time 3D Object Detection using Feature Map Flow 标题:基于特征地图流的实时三维目标检测
作者:Youshaa Murhij,Dmitry Yudin 机构:Moscow Institute of Physics and Technology, Dolgoprodny, Russia 备注:CVPR 2021 Workshop on autonomous driving (Waymo Real-time 3D Detection) 链接:https://arxiv.org/abs/2106.14101 摘要:本文提出了一种基于深度神经模型推理(FMF)不同时间步长的时空特征图融合的实时三维检测方法。该方法提高了基于基线的三维检测中心的质量,并在nuScenes和Waymo基准上提供了实时性能。代码位于https://github.com/YoushaaMurhij/FMFNet 摘要:In this paper, we present a real-time 3D detection approach considering time-spatial feature map aggregation from different time steps of deep neural model inference (named feature map flow, FMF). Proposed approach improves the quality of 3D detection center-based baseline and provides real-time performance on the nuScenes and Waymo benchmark. Code is available at https://github.com/YoushaaMurhij/FMFNet
【13】 Radar Voxel Fusion for 3D Object Detection 标题:雷达体素融合在三维目标检测中的应用
作者:Felix Nobis,Ehsan Shafiei,Phillip Karle,Johannes Betz,Markus Lienkamp 机构:����������, Citation: Nobis, F.; Shafiei, E.; Karle P.;, Betz, J.; Lienkamp, M. Radar Voxel, Fusion for ,D Object Detection. Appl. 链接:https://arxiv.org/abs/2106.14087 摘要:汽车交通场景非常复杂,因为需要处理的场景、对象和天气条件多种多样。与自动化地下列车等更受约束的环境相比,汽车感知系统不能适应特定任务的狭窄领域,但必须处理具有不可预见事件的不断变化的环境。由于目前没有一个传感器能够可靠地感知到周围环境中的所有相关活动,因此采用传感器数据融合技术来感知尽可能多的信息。在较低的抽象层次上对不同传感器和传感器模式进行数据融合,能够在信息丰富的传感器数据被压缩之前补偿传感器的弱点和传感器之间的误检测,从而在传感器单个目标检测之后丢失信息。本文提出了一种用于三维目标检测的低层传感器融合网络,它融合了激光雷达、摄像机和雷达数据。在nuScenes数据集上对融合网络进行训练和评估。在测试集上,与基线激光雷达网络相比,雷达数据的融合使AP(平均精度)检测分数提高了约5.1%。事实证明,雷达传感器融合在恶劣条件下尤其有用,如下雨和夜间场景。融合额外的摄像机数据只有在与雷达融合的情况下才能起到积极的作用,这表明传感器之间的相互依赖性对检测结果非常重要。此外,本文提出了一种新的损失处理不连续的简单偏航表示的目标检测。我们更新的损耗提高了所有传感器输入配置的检测和方向估计性能。这项研究的代码已经在GitHub上提供。 摘要:Automotive traffic scenes are complex due to the variety of possible scenarios, objects, and weather conditions that need to be handled. In contrast to more constrained environments, such as automated underground trains, automotive perception systems cannot be tailored to a narrow field of specific tasks but must handle an ever-changing environment with unforeseen events. As currently no single sensor is able to reliably perceive all relevant activity in the surroundings, sensor data fusion is applied to perceive as much information as possible. Data fusion of different sensors and sensor modalities on a low abstraction level enables the compensation of sensor weaknesses and misdetections among the sensors before the information-rich sensor data are compressed and thereby information is lost after a sensor-individual object detection. This paper develops a low-level sensor fusion network for 3D object detection, which fuses lidar, camera, and radar data. The fusion network is trained and evaluated on the nuScenes data set. On the test set, fusion of radar data increases the resulting AP (Average Precision) detection score by about 5.1% in comparison to the baseline lidar network. The radar sensor fusion proves especially beneficial in inclement conditions such as rain and night scenes. Fusing additional camera data contributes positively only in conjunction with the radar fusion, which shows that interdependencies of the sensors are important for the detection result. Additionally, the paper proposes a novel loss to handle the discontinuity of a simple yaw representation for object detection. Our updated loss increases the detection and orientation estimation performance for all sensor input configurations. The code for this research has been made available on GitHub.
【14】 Exploring Temporal Context and Human Movement Dynamics for Online Action Detection in Videos 标题:视频在线动作检测的时间上下文和人体运动动力学研究
作者:Vasiliki I. Vasileiou,Nikolaos Kardaris,Petros Maragos 机构:School of E.C.E., National Technical University of Athens, Athens , Greece 备注:EUSIPCO-2021 链接:https://arxiv.org/abs/2106.13967 摘要:如今,人与机器人之间的交互不断扩大,要求越来越多的人体运动识别应用实时运行。然而,大多数时间动作检测和识别的工作都是以离线的方式进行的,即将时间分割的视频作为一个整体进行分类。本文基于最近提出的时间循环网络框架,探讨了如何将时间上下文和人体运动动力学有效地应用于在线动作检测。我们的方法使用各种最先进的架构,并适当地结合提取的特征,以提高行动检测。我们在一个具有挑战性但广泛使用的时间动作定位数据集THUMOS'14上对我们的方法进行了评估。我们的实验表明,我们的方法比基线方法有了显著的改进,在THUMOS'14上取得了最新的结果。 摘要:Nowadays, the interaction between humans and robots is constantly expanding, requiring more and more human motion recognition applications to operate in real time. However, most works on temporal action detection and recognition perform these tasks in offline manner, i.e. temporally segmented videos are classified as a whole. In this paper, based on the recently proposed framework of Temporal Recurrent Networks, we explore how temporal context and human movement dynamics can be effectively employed for online action detection. Our approach uses various state-of-the-art architectures and appropriately combines the extracted features in order to improve action detection. We evaluate our method on a challenging but widely used dataset for temporal action localization, THUMOS'14. Our experiments show significant improvement over the baseline method, achieving state-of-the art results on THUMOS'14.
【15】 Domain Adaptive YOLO for One-Stage Cross-Domain Detection 标题:一级跨域检测的域自适应YOLO算法
作者:Shizhao Zhang,Hongya Tuo,Jian Hu,Zhongliang Jing 机构:Domain Adaptive YOLO for One-Stage, Cross-Domain Detection, Shanghai Jiao Tong University, Shanghai, China, Baidu, Inc. 链接:https://arxiv.org/abs/2106.13939 摘要:域转移是目标检测器在实际应用中的一个主要挑战。新兴的两级探测器域自适应技术有助于解决这一问题。然而,两级探测器由于其长时间的消耗,并不是工业应用的首选。为了提高单级探测器的跨域性能,提出了一种新的域自适应YOLO(DA-YOLO)。图像级特征对齐用于局部特征(如纹理)的严格匹配,全局特征(如光照)的松散匹配。提出了多尺度实例级特征对齐方法,有效地减少了实例域的偏移,如对象外观和视点的变化。对这些领域分类器采用一致性正则化,帮助网络产生领域不变检测。我们在Cityscapes、KITTI、SIM10K等流行数据集上评估了我们提出的方法。。结果表明,在不同的跨域场景下进行测试时,效果显著。 摘要:Domain shift is a major challenge for object detectors to generalize well to real world applications. Emerging techniques of domain adaptation for two-stage detectors help to tackle this problem. However, two-stage detectors are not the first choice for industrial applications due to its long time consumption. In this paper, a novel Domain Adaptive YOLO (DA-YOLO) is proposed to improve cross-domain performance for one-stage detectors. Image level features alignment is used to strictly match for local features like texture, and loosely match for global features like illumination. Multi-scale instance level features alignment is presented to reduce instance domain shift effectively , such as variations in object appearance and viewpoint. A consensus regularization to these domain classifiers is employed to help the network generate domain-invariant detections. We evaluate our proposed method on popular datasets like Cityscapes, KITTI, SIM10K and etc.. The results demonstrate significant improvement when tested under different cross-domain scenarios.
【16】 A CNN Segmentation-Based Approach to Object Detection and Tracking in Ultrasound Scans with Application to the Vagus Nerve Detection 标题:基于CNN分割的超声目标检测与跟踪方法及其在迷走神经检测中的应用
作者:Abdullah F. Al-Battal,Yan Gong,Lu Xu,Timothy Morton,Chen Du,Yifeng Bu 1,Imanuel R Lerman,Radhika Madhavan,Truong Q. Nguyen 备注:7 pages , 4 figures, submitted to the IEEE EMBC 2021 conference 链接:https://arxiv.org/abs/2106.13849 摘要:超声扫描在一些医学诊断和治疗应用中是必不可少的。它用于可视化和分析影响治疗计划的解剖特征和结构。然而,它是劳动密集型的,其有效性取决于运营商。在扫描过程中对解剖结构进行实时、准确和健壮的自动检测和跟踪将对诊断和治疗程序的一致性和有效性产生重大影响。在这篇论文中,我们提出了一个深度学习框架来自动侦测和追踪超音波扫描中特定的解剖目标结构。我们的框架被设计成跨学科和成像设备的精确和健壮的,实时操作,并且不需要大量的训练集。当训练集的大小仅为原始训练集的20%时,它保持了90%以上的定位精度和召回率。框架主干是一个基于U-Net的弱训练分割神经网络。我们在两个不同的超声数据集上测试了该框架,目的是检测和跟踪迷走神经,其性能优于目前最先进的实时目标检测网络。 摘要:Ultrasound scanning is essential in several medical diagnostic and therapeutic applications. It is used to visualize and analyze anatomical features and structures that influence treatment plans. However, it is both labor intensive, and its effectiveness is operator dependent. Real-time accurate and robust automatic detection and tracking of anatomical structures while scanning would significantly impact diagnostic and therapeutic procedures to be consistent and efficient. In this paper, we propose a deep learning framework to automatically detect and track a specific anatomical target structure in ultrasound scans. Our framework is designed to be accurate and robust across subjects and imaging devices, to operate in real-time, and to not require a large training set. It maintains a localization precision and recall higher than 90% when trained on training sets that are as small as 20% in size of the original training set. The framework backbone is a weakly trained segmentation neural network based on U-Net. We tested the framework on two different ultrasound datasets with the aim to detect and track the Vagus nerve, where it outperformed current state-of-the-art real-time object detection networks.
【17】 EARLIN: Early Out-of-Distribution Detection for Resource-efficient Collaborative Inference 标题:EARLIN:资源高效的协同推理的早期失配检测
作者:Sumaiya Tabassum Nimi,Md Adnan Arefeen,Md Yusuf Sarwar Uddin,Yugyung Lee 机构:University of Missouri Kansas City, MO, USA 备注:To Appear in the proceedings of ECML-PKDD'2021 链接:https://arxiv.org/abs/2106.13842 摘要:协作推理使资源受限的边缘设备能够通过将输入(例如图像)上传到服务器(例如云)来进行推理,在服务器上运行大量的深度学习模型。虽然这种设置对于成功的推断是经济有效的,但是当模型面对没有训练模型的输入样本(称为分布外(OOD)样本)时,它的性能会严重下降。如果边缘设备至少能够检测到输入样本是OOD,那么通过不将这些输入上传到服务器以进行推理工作负载,这可能节省通信和计算资源。在本文中,我们提出了一种新的轻量级OOD检测方法,该方法从预先训练的CNN模型的浅层中挖掘重要特征,并基于定义在缩减特征空间上的距离函数,将输入样本检测为ID(分布中)或OOD。我们的技术(a)适用于预训练模型,而不需要对这些模型进行任何再训练,(b)不暴露于任何OOD数据集(所有检测参数都从ID训练数据集获得)。为此,我们开发了EARLIN(协作推理的早期OOD检测),它采用一个预先训练好的模型,并在OOD检测层对模型进行划分,将相当小的OOD部分部署在边缘设备上,其余部分部署在云上。通过使用真实数据集和原型实现进行的实验,我们表明,与其他方法相比,在基准数据集上预训练的流行深度学习模型上,我们的方法在总体精度和成本方面取得了更好的效果。 摘要:Collaborative inference enables resource-constrained edge devices to make inferences by uploading inputs (e.g., images) to a server (i.e., cloud) where the heavy deep learning models run. While this setup works cost-effectively for successful inferences, it severely underperforms when the model faces input samples on which the model was not trained (known as Out-of-Distribution (OOD) samples). If the edge devices could, at least, detect that an input sample is an OOD, that could potentially save communication and computation resources by not uploading those inputs to the server for inference workload. In this paper, we propose a novel lightweight OOD detection approach that mines important features from the shallow layers of a pretrained CNN model and detects an input sample as ID (In-Distribution) or OOD based on a distance function defined on the reduced feature space. Our technique (a) works on pretrained models without any retraining of those models, and (b) does not expose itself to any OOD dataset (all detection parameters are obtained from the ID training dataset). To this end, we develop EARLIN (EARLy OOD detection for Collaborative INference) that takes a pretrained model and partitions the model at the OOD detection layer and deploys the considerably small OOD part on an edge device and the rest on the cloud. By experimenting using real datasets and a prototype implementation, we show that our technique achieves better results than other approaches in terms of overall accuracy and cost when tested against popular OOD datasets on top of popular deep learning models pretrained on benchmark datasets.
【18】 A Machine Learning Model for Early Detection of Diabetic Foot using Thermogram Images 标题:基于热图图像的糖尿病足早期检测的机器学习模型
作者:Amith Khandakar,Muhammad E. H. Chowdhury,Mamun Bin Ibne Reaz,Sawal Hamid Md Ali,Md Anwarul Hasan,Serkan Kiranyaz,Tawsifur Rahman,Rashad Alfkey,Ahmad Ashrif A. Bakar,Rayaz A. Malik 机构:Department of Electrical Engineering, Qatar University, Doha-, Qatar, Dept. of Electrical, Electronics and Systems Engineering, Universiti Kebangsaan, Malaysia, Bangi, Selangor , Malaysia 备注:23 pages, 8 Figures 链接:https://arxiv.org/abs/2106.14207 摘要:糖尿病足溃疡(DFU)和截肢是一个重要的发病原因。预防DFU可通过鉴别DFU高危患者,并通过教育和卸载制定预防措施来实现。有几项研究报告说,热像图图像可能有助于检测足底温度增加之前,DFU。然而,足底温度的分布可能是不均匀的,因此很难量化和利用预测结果。我们将基于机器学习的评分技术、特征选择和优化技术以及学习分类器与几种最先进的卷积神经网络(CNNs)在足部温度图图像上进行了比较,并提出了一种鲁棒的解决方案来识别糖尿病足。MobilenetV2是一个相对浅层的CNN模型,对于一个基于两英尺热像图的分类,它的F1得分达到了95%,AdaBoost分类器使用了10个特征,F1得分达到了97%。对性能最好的网络的推理时间的比较证实,所提出的算法可以部署为智能手机应用程序,以允许用户在家庭环境中监视DFU的进程。 摘要:Diabetes foot ulceration (DFU) and amputation are a cause of significant morbidity. The prevention of DFU may be achieved by the identification of patients at risk of DFU and the institution of preventative measures through education and offloading. Several studies have reported that thermogram images may help to detect an increase in plantar temperature prior to DFU. However, the distribution of plantar temperature may be heterogeneous, making it difficult to quantify and utilize to predict outcomes. We have compared a machine learning-based scoring technique with feature selection and optimization techniques and learning classifiers to several state-of-the-art Convolutional Neural Networks (CNNs) on foot thermogram images and propose a robust solution to identify the diabetic foot. A comparatively shallow CNN model, MobilenetV2 achieved an F1 score of ~95% for a two-feet thermogram image-based classification and the AdaBoost Classifier used 10 features and achieved an F1 score of 97 %. A comparison of the inference time for the best-performing networks confirmed that the proposed algorithm can be deployed as a smartphone application to allow the user to monitor the progression of the DFU in a home setting.
【19】 An XAI Approach to Deep Learning Models in the Detection of Ductal Carcinoma in Situ 标题:基于XAI的深度学习模型在导管原位癌检测中的应用
作者:Michele La Ferla,Matthew Montebello,Dylan Seychell 机构:University of Malta, Msida, Malta 备注:9 pages, 6 figures 链接:https://arxiv.org/abs/2106.14186 摘要:在过去十年左右的时间里,为了解决与健康有关的问题,特别是乳腺癌,深度学习社区出现了一场叛乱。继2016年Camelyon-16挑战赛之后,几位研究人员已投入时间构建卷积神经网络(CNN),以帮助放射科医生和其他临床医生诊断乳腺癌。尤其是导管原位癌(DCIS);早期乳腺癌的临床术语。大公司在这方面的研究中贡献了相当大的一部分,其中谷歌Deepmind公司在2020年开发了一个模型,该模型被证明比放射科医生自己更能正确诊断乳腺癌。我们发现,在存在的问题中,有一个解释系统需要穿过CNN的隐藏层来突出那些有助于乳房X光片分类的像素。然后,我们选择了沈教授开发的一个开源、相当成功的项目,使用CBIS-DDSM图像数据库来运行我们的实验。后来使用Resnet-50和VGG-16补丁分类器对其进行了改进,分析比较了两者的结果。结果表明,Resnet-50在实验中收敛较早。在Montavon和Binder的研究之后,我们使用DeepTaylor分层相关传播(LRP)模型来突出乳腺x线片中对其分类贡献最大的像素和区域。这表示为原始图像中那些像素的映射,这些像素有助于诊断以及它们对最终分类的贡献程度。该算法最显著的优点是在Resnet-50补丁分类器体系结构中表现得非常好。 摘要:During the last decade or so, there has been an insurgence in the deep learning community to solve health-related issues, particularly breast cancer. Following the Camelyon-16 challenge in 2016, several researchers have dedicated their time to build Convolutional Neural Networks (CNNs) to help radiologists and other clinicians diagnose breast cancer. In particular, there has been an emphasis on Ductal Carcinoma in Situ (DCIS); the clinical term for early-stage breast cancer. Large companies have given their fair share of research into this subject, among these Google Deepmind who developed a model in 2020 that has proven to be better than radiologists themselves to diagnose breast cancer correctly. We found that among the issues which exist, there is a need for an explanatory system that goes through the hidden layers of a CNN to highlight those pixels that contributed to the classification of a mammogram. We then chose an open-source, reasonably successful project developed by Prof. Shen, using the CBIS-DDSM image database to run our experiments on. It was later improved using the Resnet-50 and VGG-16 patch-classifiers, analytically comparing the outcome of both. The results showed that the Resnet-50 one converged earlier in the experiments. Following the research by Montavon and Binder, we used the DeepTaylor Layer-wise Relevance Propagation (LRP) model to highlight those pixels and regions within a mammogram which contribute most to its classification. This is represented as a map of those pixels in the original image, which contribute to the diagnosis and the extent to which they contribute to the final classification. The most significant advantage of this algorithm is that it performs exceptionally well with the Resnet-50 patch classifier architecture.
分类|识别相关(14篇)
【1】 Hyperspectral Remote Sensing Image Classification Based on Multi-scale Cross Graphic Convolution 标题:基于多尺度交叉图形卷积的高光谱遥感图像分类
作者:Yunsong Zhao,Yin Li,Zhihan Chen,Tianchong Qiu,Guojin Liu 链接:https://arxiv.org/abs/2106.14804 摘要:特征的挖掘和利用直接影响到高光谱遥感图像分类识别模型的分类性能。传统的模型通常从单一的角度进行特征挖掘,挖掘出的特征有限,忽略了它们之间的内在联系。因此,有用的特征丢失,分类结果不令人满意。为了充分挖掘和利用图像特征,提出了一种新的多尺度特征挖掘学习算法(MGRNet)。该模型利用主成分分析对原始高光谱图像进行降维,保留99.99%的语义信息,提取降维特征。利用多尺度卷积算法对输入降维特征进行挖掘,得到浅层特征,作为多尺度图卷积算法的输入,构造不同尺度下特征值之间的内在关系。然后对图卷积得到的多尺度信息进行交叉融合,再将得到的新信息输入残差网络算法进行深度特征挖掘。最后,采用灵活的最大传递函数分类器对最终特征进行预测,完成分类。在三种常见的高光谱数据集上的实验结果表明,本文提出的MGRNet算法在识别精度上优于传统方法。 摘要:The mining and utilization of features directly affect the classification performance of models used in the classification and recognition of hyperspectral remote sensing images. Traditional models usually conduct feature mining from a single perspective, with the features mined being limited and the internal relationships between them being ignored. Consequently, useful features are lost and classification results are unsatisfactory. To fully mine and utilize image features, a new multi-scale feature-mining learning algorithm (MGRNet) is proposed. The model uses principal component analysis to reduce the dimensionality of the original hyperspectral image (HSI) to retain 99.99% of its semantic information and extract dimensionality reduction features. Using a multi-scale convolution algorithm, the input dimensionality reduction features were mined to obtain shallow features, which then served as inputs into a multi-scale graph convolution algorithm to construct the internal relationships between eigenvalues at different scales. We then carried out cross fusion of multi-scale information obtained by graph convolution, before inputting the new information obtained into the residual network algorithm for deep feature mining. Finally, a flexible maximum transfer function classifier was used to predict the final features and complete the classification. Experiments on three common hyperspectral datasets showed the MGRNet algorithm proposed in this paper to be superior to traditional methods in recognition accuracy.
【2】 Recurrent neural network transducer for Japanese and Chinese offline handwritten text recognition 标题:用于日汉脱机手写文本识别的递归神经网络换能器
作者:Trung Tan Ngo,Hung Tuan Nguyen,Nam Tuan Ly,Masaki Nakagawa 机构:Tokyo University of Agriculture and Technology, Tokyo, Japan 链接:https://arxiv.org/abs/2106.14459 摘要:本文提出了一种识别日文和中文脱机手写文本行图像的RNN变换器模型。据我们所知,它是第一种采用RNN传感器模型进行脱机手写文本识别的方法。该模型由三个主要部分组成:视觉特征编码器,通过CNN从输入图像中提取视觉特征,然后通过BLSTM编码;语言上下文编码器,通过嵌入层和LSTM从输入图像中提取和编码语言特征;以及联合解码器,其通过完全连接的和softmax层将视觉特征和语言特征组合并解码成最终的标签序列。该模型充分利用了输入图像的视觉和语言信息。在实验中,我们在两个数据集:Kuzushiji和SCUT-EPT上评估了该模型的性能。实验结果表明,该模型在所有数据集上都达到了最先进的性能。 摘要:In this paper, we propose an RNN-Transducer model for recognizing Japanese and Chinese offline handwritten text line images. As far as we know, it is the first approach that adopts the RNN-Transducer model for offline handwritten text recognition. The proposed model consists of three main components: a visual feature encoder that extracts visual features from an input image by CNN and then encodes the visual features by BLSTM; a linguistic context encoder that extracts and encodes linguistic features from the input image by embedded layers and LSTM; and a joint decoder that combines and then decodes the visual features and the linguistic features into the final label sequence by fully connected and softmax layers. The proposed model takes advantage of both visual and linguistic information from the input image. In the experiments, we evaluated the performance of the proposed model on the two datasets: Kuzushiji and SCUT-EPT. Experimental results show that the proposed model achieves state-of-the-art performance on all datasets.
【3】 Progressive Class-based Expansion Learning For Image Classification 标题:基于递进类的扩展学习在图像分类中的应用
作者:Hui Wang,Hanbin Zhao,Xi Li 机构: Zhejiang University 备注:Accepted to Journal of IEEE Signal Processing Letters 链接:https://arxiv.org/abs/2106.14412 摘要:本文提出了一种基于类的扩展学习的图像分类方法,旨在提高对混淆类样本的监督激励频率。基于类的扩展学习采用基于类的扩展优化方式,采用自底向上的增长策略,更加关注优先选择类的细粒度分类边界的学习质量。此外,我们还提出了一个班级混淆标准,优先选择混淆班级进行训练。通过这种方式,混淆类的分类边界经常被激发,从而形成细粒度的形式。实验结果证明了该方案在多个测试平台上的有效性。 摘要:In this paper, we propose a novel image process scheme called class-based expansion learning for image classification, which aims at improving the supervision-stimulation frequency for the samples of the confusing classes. Class-based expansion learning takes a bottom-up growing strategy in a class-based expansion optimization fashion, which pays more attention to the quality of learning the fine-grained classification boundaries for the preferentially selected classes. Besides, we develop a class confusion criterion to select the confusing class preferentially for training. In this way, the classification boundaries of the confusing classes are frequently stimulated, resulting in a fine-grained form. Experimental results demonstrate the effectiveness of the proposed scheme on several benchmarks.
【4】 Deep Learning Image Recognition for Non-images 标题:面向非图像的深度学习图像识别
作者:Boris Kovalerchuk,Divya Chandrika Kalla,Bedant Agarwal 机构:Dept. of Computer Science, Central Washington University, USA, Indian Institute of Technology Kharagpur, India 备注:33 pages, 17 figures, 18 tables 链接:https://arxiv.org/abs/2106.14350 摘要:强大的深度学习算法通过将非图像机器学习问题转化为图像识别问题,为解决非图像机器学习问题提供了机会。本章提出的CPC-R算法通过可视化非图像数据将非图像数据转换为图像。深入学习CNN算法解决了这些图像的学习问题。CPC-R算法的设计允许在二维图像中保留所有高维信息。在替代方法中使用成对值映射而不是单值映射允许用2倍更少的视觉元素来编码每个n-D点。一个n-D点的属性被划分为它的值对,并且每一对都被可视化为在相同的2-D笛卡尔坐标中的2-D点。接下来,将灰度或颜色强度值分配给每一对,以编码成对的顺序。这是导致热图图像。对CPC-R进行了不同CNN结构的计算实验,优化CPC-R图像的方法表明,CPC-R与深度学习CNN相结合的算法能够解决非图像ML问题,在基准数据集上达到很高的精度。本章通过添加更多的实验来测试分类的准确性,探索发现的特征的显著性和信息性来测试其可解释性,并推广了该方法,从而扩展了我们以前的工作。 摘要:Powerful deep learning algorithms open an opportunity for solving non-image Machine Learning (ML) problems by transforming these problems to into the image recognition problems. The CPC-R algorithm presented in this chapter converts non-image data into images by visualizing non-image data. Then deep learning CNN algorithms solve the learning problems on these images. The design of the CPC-R algorithm allows preserving all high-dimensional information in 2-D images. The use of pair values mapping instead of single value mapping used in the alternative approaches allows encoding each n-D point with 2 times fewer visual elements. The attributes of an n-D point are divided into pairs of its values and each pair is visualized as 2-D points in the same 2-D Cartesian coordinates. Next, grey scale or color intensity values are assigned to each pair to encode the order of pairs. This is resulted in the heatmap image. The computational experiments with CPC-R are conducted for different CNN architectures, and methods to optimize the CPC-R images showing that the combined CPC-R and deep learning CNN algorithms are able to solve non-image ML problems reaching high accuracy on the benchmark datasets. This chapter expands our prior work by adding more experiments to test accuracy of classification, exploring saliency and informativeness of discovered features to test their interpretability, and generalizing the approach.
【5】 Deep Learning for Technical Document Classification 标题:深度学习在技术文档分类中的应用
作者:Shuo Jiang,Jianxi Luo,Jie Hu,Christopher L. Magee 机构:School of Mechanical Engineering, Shanghai Jiao Tong University, Shanghai, China, Institute for Data, Systems, and Society and SUTD-MIT International Design Centre, Massachusetts, Institute of Technology, Cambridge, MA, USA 备注:34 pages, 7 figures, 10 tables 链接:https://arxiv.org/abs/2106.14269 摘要:在大型技术公司中,工程师和管理人员为支持相关决策而创建的技术文档的管理和组织需求近年来急剧增加,这导致了对更具可伸缩性、更准确和更自动化的文档分类的更高要求。以前的研究主要集中在分类和小规模数据库的文本处理上。本文提出了一种新的多模式深度学习技术文档分类体系结构TechDoc,它利用自然语言和描述性图像来训练层次分类器。该结构通过一个集成的训练过程综合卷积神经网络和递归神经网络。将该体系结构应用于一个大型的多模式技术文档数据库,训练了基于层次化国际专利分类系统的文档分类模型。结果表明,训练后的神经网络比单一模态和几种早期的文本分类方法具有更高的分类精度。经过训练的模型可以扩展到数百万个具有文本和数字的真实世界的技术文档,这对于大型技术公司和组织的数据和知识管理非常有用。 摘要:In large technology companies, the requirements for managing and organizing technical documents created by engineers and managers in supporting relevant decision making have increased dramatically in recent years, which has led to a higher demand for more scalable, accurate, and automated document classification. Prior studies have primarily focused on processing text for classification and small-scale databases. This paper describes a novel multimodal deep learning architecture, called TechDoc, for technical document classification, which utilizes both natural language and descriptive images to train hierarchical classifiers. The architecture synthesizes convolutional neural networks and recurrent neural networks through an integrated training process. We applied the architecture to a large multimodal technical document database and trained the model for classifying documents based on the hierarchical International Patent Classification system. Our results show that the trained neural network presents a greater classification accuracy than those using a single modality and several earlier text classification methods. The trained model can potentially be scaled to millions of real-world technical documents with both text and figures, which is useful for data and knowledge management in large technology companies and organizations.
【6】 Attention-guided Progressive Mapping for Profile Face Recognition 标题:用于侧面人脸识别的注意力引导的渐进式映射
作者:Junyang Huang,Changxing Ding 机构:South China University of Technology, Guangzhou, China 备注:Accepted by IJCB 2021. Code is available 链接:https://arxiv.org/abs/2106.14124 摘要:近年来,随着深度学习的发展,人脸识别领域取得了长足的进步。然而,交叉姿态人脸识别仍然是一个巨大的挑战。许多深度学习算法难以缩小姿态变化带来的性能差距;其主要原因与不同姿态人脸图像的类内差异和训练数据集的姿态不平衡有关。通过遍历正面人脸的特征空间来学习姿态鲁棒特征,为缓解这一问题提供了一种有效而廉价的方法。在本文中,我们提出了一种方法,逐步转换轮廓面表示到规范的姿态与注意成对损失。首先,为了降低将轮廓人脸特征直接转化为正面姿态的难度,提出了一种逐块学习源姿态与其邻近姿态之间的特征残差的方法,并通过加入残差的方法遍历到较小姿态的特征空间。其次,我们提出了一种谨慎的成对损失来引导特征转换朝着最有效的方向进行。最后,我们所提出的渐进式模组和仔细的成对损失是轻量且容易实现的,仅增加约7:5%的额外参数。对CFP和CPLFW数据集的评价表明了该方法的优越性。代码位于https://github.com/hjy1312/AGPM. 摘要:The past few years have witnessed great progress in the domain of face recognition thanks to advances in deep learning. However, cross pose face recognition remains a significant challenge. It is difficult for many deep learning algorithms to narrow the performance gap caused by pose variations; the main reasons for this relate to the intra-class discrepancy between face images in different poses and the pose imbalances of training datasets. Learning pose-robust features by traversing to the feature space of frontal faces provides an effective and cheap way to alleviate this problem. In this paper, we present a method for progressively transforming profile face representations to the canonical pose with an attentive pair-wise loss. Firstly, to reduce the difficulty of directly transforming the profile face features into a frontal pose, we propose to learn the feature residual between the source pose and its nearby pose in a block-byblock fashion, and thus traversing to the feature space of a smaller pose by adding the learned residual. Secondly, we propose an attentive pair-wise loss to guide the feature transformation progressing in the most effective direction. Finally, our proposed progressive module and attentive pair-wise loss are light-weight and easy to implement, adding only about 7:5% extra parameters. Evaluations on the CFP and CPLFW datasets demonstrate the superiority of our proposed method. Code is available at https://github.com/hjy1312/AGPM.
【7】 An Image Classifier Can Suffice Video Understanding 标题:一个图像分类器就可以满足视频理解
作者:Quanfu Fan,Chun-Fu,Chen,Rameswar Panda 机构:MIT-IBM Watson AI Lab, Cambridge, MA 链接:https://arxiv.org/abs/2106.14104 摘要:将视频识别问题作为一个图像识别任务来研究,为视频理解提供了一个新的视角。我们表明,一个单独的图像分类器可以满足视频理解没有时间建模。我们的方法是简单和普遍的。它将输入帧合成一个超级图像,训练一个图像分类器来完成动作识别任务,其方法与对图像进行分类完全相同。我们通过在四个公共数据集(包括Kinetics400、Something-to-Something(V2)、MiT和Jester)上使用最新开发的vision transformer展示了强大而有前途的性能,从而证明了这种想法的可行性。我们还用计算机视觉中流行的ResNet图像分类器进行了实验,进一步验证了我们的观点。Kinetics400上的结果与一些基于时空建模的最佳CNN方法相当。我们的代码和模型将在https://github.com/IBM/sifar-pytorch. 摘要:We propose a new perspective on video understanding by casting the video recognition problem as an image recognition task. We show that an image classifier alone can suffice for video understanding without temporal modeling. Our approach is simple and universal. It composes input frames into a super image to train an image classifier to fulfill the task of action recognition, in exactly the same way as classifying an image. We prove the viability of such an idea by demonstrating strong and promising performance on four public datasets including Kinetics400, Something-to-something (V2), MiT and Jester, using a recently developed vision transformer. We also experiment with the prevalent ResNet image classifiers in computer vision to further validate our idea. The results on Kinetics400 are comparable to some of the best-performed CNN approaches based on spatio-temporal modeling. our code and models will be made available at https://github.com/IBM/sifar-pytorch.
【8】 Image Classification with CondenseNeXt for ARM-Based Computing Platforms 标题:基于ARM计算平台的CondenseNeXt图像分类
作者:Priyank Kalgaonkar,Mohamed El-Sharkawy 机构:Department of Electrical and Computer Engineering, Purdue School of Engineering and Technology, Indianapolis, Indiana , USA. 备注:6 pages, 7 figures, conference, published IEEE Conference paper 链接:https://arxiv.org/abs/2106.14102 摘要:在这篇论文中,我们展示了我们的超高效深度卷积神经网络架构:CondenseNeXt在NXP BlueBox上的实现,NXP BlueBox是一个为自动驾驶车辆开发的自主驾驶开发平台。我们证明CondenseNeXt在FLOPs方面非常有效,它是为基于ARM的嵌入式计算平台设计的,计算资源有限,可以执行图像分类,而不需要CUDA支持的GPU。CondenseNeXt利用最先进的深度可分离卷积和模型压缩技术来实现显著的计算效率。对CIFAR-10、CIFAR-100和ImageNet数据集进行了广泛的分析,以验证卷积神经网络(CNN)结构的性能。在CIFAR-10(4.79%top-1误差)、CIFAR-100(21.98%top-1误差)和ImageNet(7.91%single model,single cropt top-5误差)三个基准数据集上实现了最先进的图像分类性能。CondenseNeXt最终训练的模型尺寸比CondenseNet提高了2.9 MB,前向触发器减少了59.98%,并且可以在基于ARM的计算平台上执行图像分类,而不需要CUDA支持的GPU,具有出色的效率。 摘要:In this paper, we demonstrate the implementation of our ultra-efficient deep convolutional neural network architecture: CondenseNeXt on NXP BlueBox, an autonomous driving development platform developed for self-driving vehicles. We show that CondenseNeXt is remarkably efficient in terms of FLOPs, designed for ARM-based embedded computing platforms with limited computational resources and can perform image classification without the need of a CUDA enabled GPU. CondenseNeXt utilizes the state-of-the-art depthwise separable convolution and model compression techniques to achieve a remarkable computational efficiency. Extensive analyses are conducted on CIFAR-10, CIFAR-100 and ImageNet datasets to verify the performance of CondenseNeXt Convolutional Neural Network (CNN) architecture. It achieves state-of-the-art image classification performance on three benchmark datasets including CIFAR-10 (4.79% top-1 error), CIFAR-100 (21.98% top-1 error) and ImageNet (7.91% single model, single crop top-5 error). CondenseNeXt achieves final trained model size improvement of 2.9 MB and up to 59.98% reduction in forward FLOPs compared to CondenseNet and can perform image classification on ARM-Based computing platforms without needing a CUDA enabled GPU support, with outstanding efficiency.
【9】 Spectral-Spatial Graph Reasoning Network for Hyperspectral Image Classification 标题:用于高光谱图像分类的光谱-空间图推理网络
作者:Di Wang,Bo Du,Liangpei Zhang 机构: Zhang is with the State Key Laboratory of Information Engineering inSurveying 链接:https://arxiv.org/abs/2106.13952 摘要:提出了一种用于高光谱图像分类的光谱空间图推理网络(SSGRN)。具体地说,该网络由空间图推理子网(SAGRN)和谱图推理子网(SEGRN)两部分组成,分别捕获空间图和谱图的上下文。与以往对原始图像进行超像素分割或试图在标签图像的指导下获取类别特征的方法不同,本文对网络的中间特征进行超像素分割,自适应地产生同质区域,得到有效的描述子。然后,我们在谱部分采用了相似的思想,合理地聚合信道,生成谱描述符,用于谱图上下文捕获。SAGRN和SEGRN中的所有图推理过程都是通过图卷积实现的。为了保证该方法的全局感知能力,利用非局部自注意机制获得了图推理中的所有邻接矩阵。最后,结合提取的空间和谱图上下文,得到SSGRN,实现了高精度的分类。在三个公共恒生指数基准上进行的大量定量和定性实验表明,与其他最先进的方法相比,所提出的方法具有竞争力。 摘要:In this paper, we propose a spectral-spatial graph reasoning network (SSGRN) for hyperspectral image (HSI) classification. Concretely, this network contains two parts that separately named spatial graph reasoning subnetwork (SAGRN) and spectral graph reasoning subnetwork (SEGRN) to capture the spatial and spectral graph contexts, respectively. Different from the previous approaches implementing superpixel segmentation on the original image or attempting to obtain the category features under the guide of label image, we perform the superpixel segmentation on intermediate features of the network to adaptively produce the homogeneous regions to get the effective descriptors. Then, we adopt a similar idea in spectral part that reasonably aggregating the channels to generate spectral descriptors for spectral graph contexts capturing. All graph reasoning procedures in SAGRN and SEGRN are achieved through graph convolution. To guarantee the global perception ability of the proposed methods, all adjacent matrices in graph reasoning are obtained with the help of non-local self-attention mechanism. At last, by combining the extracted spatial and spectral graph contexts, we obtain the SSGRN to achieve a high accuracy classification. Extensive quantitative and qualitative experiments on three public HSI benchmarks demonstrate the competitiveness of the proposed methods compared with other state-of-the-art approaches.
【10】 Dual-Stream Reciprocal Disentanglement Learning for Domain Adaption Person Re-Identification 标题:领域适配者重新识别的双流互反解缠学习
作者:Huafeng Li,Kaixiong Xu,Jinxing Li,Guangming Lu,Yong Xu,Zhengtao Yu,David Zhang 机构: Lu and Yong Xu are with Harbin Instituteof Technology, also with Shenzhen Key Laboratory ofVisual Object Detection and Recognition 备注:12 pages, 7 figures 链接:https://arxiv.org/abs/2106.13929 摘要:由于人类标记的样本对目标集是免费的,无监督的人员再识别(re-ID)近年来通过对源集的额外开发而引起了广泛的关注。然而,由于摄像机类型、光照和背景的不同,源域和目标域之间存在很大的差距,给跨域匹配带来了很大的挑战。为了解决这一问题,本文提出了一种新的双流互反解纠缠学习(DRDL)方法,该方法能有效地学习域不变特征。在DRDL中,首先构造两个编码器来提取与id相关和与id无关的特征,然后分别用它们相关的分类器来度量。此外,接着是对抗性学习策略,这两个流相互地和正面地相互影响,使得与id相关的特征和与id无关的特征从给定的图像中完全分离,使得编码器足够强大以获得有区别的但域不变的特征。与现有方法相比,该方法不需要生成图像,不仅大大降低了计算复杂度,而且去除了与身份相关特征的冗余信息。大量的实验证实了我们提出的方法与现有技术相比的优越性。源代码已在中发布https://github.com/lhf12278/DRDL. 摘要:Since human-labeled samples are free for the target set, unsupervised person re-identification (Re-ID) has attracted much attention in recent years, by additionally exploiting the source set. However, due to the differences on camera styles, illumination and backgrounds, there exists a large gap between source domain and target domain, introducing a great challenge on cross-domain matching. To tackle this problem, in this paper we propose a novel method named Dual-stream Reciprocal Disentanglement Learning (DRDL), which is quite efficient in learning domain-invariant features. In DRDL, two encoders are first constructed for id-related and id-unrelated feature extractions, which are respectively measured by their associated classifiers. Furthermore, followed by an adversarial learning strategy, both streams reciprocally and positively effect each other, so that the id-related features and id-unrelated features are completely disentangled from a given image, allowing the encoder to be powerful enough to obtain the discriminative but domain-invariant features. In contrast to existing approaches, our proposed method is free from image generation, which not only reduces the computational complexity remarkably, but also removes redundant information from id-related features. Extensive experiments substantiate the superiority of our proposed method compared with the state-of-the-arts. The source code has been released in https://github.com/lhf12278/DRDL.
【11】 Midpoint Regularization: from High Uncertainty Training to Conservative Classification 标题:中点正则化:从高不确定性训练到保守分类
作者:Hongyu Guo 机构:National Research Council Canada, Montreal Road, Ottawa, Ontario, K,A ,R 备注:Accepted to ECML-PKDD 2021. arXiv admin note: substantial text overlap with arXiv:2012.01559 链接:https://arxiv.org/abs/2106.13913 摘要:标签平滑(LS)通过惩罚产生过度自信输出分布的模型,提高了模型的泛化能力。对于每个训练样本,LS策略通过将其分布质量分布在非地面真值类上来平滑一个热编码的训练信号。我们通过考虑示例对来扩展这一技术,即PLS。PLS首先通过平均随机样本对来创建中点样本,然后在训练过程中学习每个中点样本的平滑分布,从而得到具有高不确定性标签的中点用于训练。实验表明,PLS显著优于LS,实现了高达30%的相对分类误差降低。我们还设想,PLS产生非常低的中奖softmax分数都在分布内和分布外的样本。 摘要:Label Smoothing (LS) improves model generalization through penalizing models from generating overconfident output distributions. For each training sample the LS strategy smooths the one-hot encoded training signal by distributing its distribution mass over the non-ground truth classes. We extend this technique by considering example pairs, coined PLS. PLS first creates midpoint samples by averaging random sample pairs and then learns a smoothing distribution during training for each of these midpoint samples, resulting in midpoints with high uncertainty labels for training. We empirically show that PLS significantly outperforms LS, achieving up to 30% of relative classification error reduction. We also visualize that PLS produces very low winning softmax scores for both in and out of distribution samples.
【12】 Scene Uncertainty and the Wellington Posterior of Deterministic Image Classifiers 标题:场景不确定性与确定性图像分类器的惠灵顿后验
作者:Stephanie Tsuei,Aditya Golatkar,Stefano Soatto 机构:Department of Computer Science, UCLA, Los Angeles, CA 链接:https://arxiv.org/abs/2106.13870 摘要:提出了一种在给定输入数据上估计图像分类器结果不确定性的方法。通常用于图像分类的深度神经网络是从输入图像到输出类的确定性映射。因此,他们在给定数据上的结果不涉及不确定性,因此我们必须明确定义、测量和解释“信心”时所指的可变性。为此,我们介绍了惠灵顿后验法,它是根据产生给定图像的同一场景可能产生的数据而获得的结果的分布。由于有无限多的场景可以生成给定的图像,惠灵顿后方需要从场景以外的其他描绘归纳。我们探讨了使用数据增强、置乱和模型线性化的替代方法。其他的替代方案包括生成对抗网络、条件先验网络和有监督的单视图重建。我们通过推断视频中时间相邻帧的类别来检验这些替代方法。这些发展只是评估深度网络分类器可靠性的一小步,其方式与安全关键应用兼容。 摘要:We propose a method to estimate the uncertainty of the outcome of an image classifier on a given input datum. Deep neural networks commonly used for image classification are deterministic maps from an input image to an output class. As such, their outcome on a given datum involves no uncertainty, so we must specify what variability we are referring to when defining, measuring and interpreting "confidence." To this end, we introduce the Wellington Posterior, which is the distribution of outcomes that would have been obtained in response to data that could have been generated by the same scene that produced the given image. Since there are infinitely many scenes that could have generated the given image, the Wellington Posterior requires induction from scenes other than the one portrayed. We explore alternate methods using data augmentation, ensembling, and model linearization. Additional alternatives include generative adversarial networks, conditional prior networks, and supervised single-view reconstruction. We test these alternatives against the empirical posterior obtained by inferring the class of temporally adjacent frames in a video. These developments are only a small step towards assessing the reliability of deep network classifiers in a manner that is compatible with safety-critical applications.
【13】 Nonuniform Defocus Removal for Image Classification 标题:用于图像分类的非均匀散焦去除
作者:Nguyen Hieu Thao,Oleg Soloviev,Jacques Noom,Michel Verhaegen 机构:the date of receipt and acceptance should be inserted later 备注:29 pages, 10 figures 链接:https://arxiv.org/abs/2106.13864 摘要:提出并研究了基于机器学习算法的单帧非均匀图像去卷积问题,即非均匀离焦问题。本文对NDR问题进行了数学分析,提出了求解该问题的散焦消除(DR)算法。在不加任何不可验证假设的情况下,建立了DR算法的全局收敛性。仿真结果表明,该方法具有可解性、噪声鲁棒性、收敛性、模型不敏感和计算效率高等显著特点。通过实验数据验证了NDR问题的物理相关性和DR算法的实用性。回到最初引发NDR问题研究的应用,我们证明了DR算法可以提高卷积神经网络对畸变图像分类的精度。与现有的大多数单帧各向异性反褶积方法相比,本文的主要区别在于不需要将数据图像分解为等平面子区域。因此,将图像划分为等平面区域的解决方法不适用于NDR问题,需要开发和分析处理整个图像的方法,如DR算法。 摘要:We propose and study the single-frame anisoplanatic deconvolution problem associated with image classification using machine learning algorithms, named the nonuniform defocus removal (NDR) problem. Mathematical analysis of the NDR problem is done and the so-called defocus removal (DR) algorithm for solving it is proposed. Global convergence of the DR algorithm is established without imposing any unverifiable assumption. Numerical results on simulation data show significant features of DR including solvability, noise robustness, convergence, model insensitivity and computational efficiency. Physical relevance of the NDR problem and practicability of the DR algorithm are tested on experimental data. Back to the application that originally motivated the investigation of the NDR problem, we show that the DR algorithm can improve the accuracy of classifying distorted images using convolutional neural networks. The key difference of this paper compared to most existing works on single-frame anisoplanatic deconvolution is that the new method does not require the data image to be decomposable into isoplanatic subregions. Therefore, solution approaches partitioning the image into isoplanatic zones are not applicable to the NDR problem and those handling the entire image such as the DR algorithm need to be developed and analyzed.
【14】 Functional Classwise Principal Component Analysis: A Novel Classification Framework 标题:功能分类主成分分析:一种新的分类框架
作者:Avishek Chatterjee,Satyaki Mazumder,Koel Das 机构: Das are with the Department ofMathematics and Statistics 链接:https://arxiv.org/abs/2106.13959 摘要:近年来,功能数据分析(FDA)已成功地应用于高维数据分类领域。在本文中,我们提出了一个新的分类框架,利用功能数据和分类主成分分析(PCA)。该方法适用于高维时间序列数据的小样本问题。该方法提取分段线性函数特征空间,特别适用于硬分类问题,将时间序列数据转化为函数数据,利用分类函数PCA进行特征提取,然后利用贝叶斯线性分类器进行分类。我们将该方法应用于神经科学、食品科学、医学和化学计量学等多个领域的合成数据集和实时序列数据,证明了该方法的有效性。 摘要:In recent times, functional data analysis (FDA) has been successfully applied in the field of high dimensional data classification. In this paper, we present a novel classification framework using functional data and classwise Principal Component Analysis (PCA). Our proposed method can be used in high dimensional time series data which typically suffers from small sample size problem. Our method extracts a piece wise linear functional feature space and is particularly suitable for hard classification problems.The proposed framework converts time series data into functional data and uses classwise functional PCA for feature extraction followed by classification using a Bayesian linear classifier. We demonstrate the efficacy of our proposed method by applying it to both synthetic data sets and real time series data from diverse fields including but not limited to neuroscience, food science, medical sciences and chemometrics.
分割|语义相关(10篇)
【1】 K-Net: Towards Unified Image Segmentation 标题:K-Net:走向统一的图像分割
作者:Wenwei Zhang,Jiangmiao Pang,Kai Chen,Chen Change Loy 机构:S-Lab, Nanyang Technological University, CUHK-SenseTime Joint Lab, the Chinese University of Hong Kong, SenseTime Research, Shanghai AI Laboratory 备注:Technical Report 链接:https://arxiv.org/abs/2106.14855 摘要:语义、实例和全景分割已经使用不同的和专门的框架来解决,尽管它们之间存在潜在的联系。本文为这些基本相似的任务提供了一个统一、简单、有效的框架。这个名为K-Net的框架通过一组可学习的内核一致地分割实例和语义类别,每个内核负责为一个潜在实例或一个stuff类生成一个掩码。为了克服区分不同实例的困难,我们提出了一种内核更新策略,使得每个内核都是动态的,并以其在输入图像中的有意义组为条件。K-网可以采用二部匹配的端到端方式进行训练,其训练和推理自然是无NMS和无盒的。K-Net在没有钟声和哨声的情况下,以52.1%的PQ和54.3%的mIoU分别超过了以往在MS-COCO上的全景分割和在ADE20K上的语义分割的最新单模型结果。其实例分割性能与级联掩码R-CNNon-MS-COCO相当,推理速度提高了60%-90%。代码和模型将在https://github.com/open-mmlab/mmdetection. 摘要:Semantic, instance, and panoptic segmentations have been addressed using different and specialized frameworks despite their underlying connections. This paper presents a unified, simple, and effective framework for these essentially similar tasks. The framework, named K-Net, segments both instances and semantic categories consistently by a group of learnable kernels, where each kernel is responsible for generating a mask for either a potential instance or a stuff class. To remedy the difficulties of distinguishing various instances, we propose a kernel update strategy that enables each kernel dynamic and conditional on its meaningful group in the input image. K-Net can be trained in an end-to-end manner with bipartite matching, and its training and inference are naturally NMS-free and box-free. Without bells and whistles, K-Net surpasses all previous state-of-the-art single-model results of panoptic segmentation on MS COCO and semantic segmentation on ADE20K with 52.1% PQ and 54.3% mIoU, respectively. Its instance segmentation performance is also on par with Cascade Mask R-CNNon MS COCO with 60%-90% faster inference speeds. Code and models will be released at https://github.com/open-mmlab/mmdetection.
【2】 Real-Time Multi-View 3D Human Pose Estimation using Semantic Feedback to Smart Edge Sensors 标题:基于语义反馈的智能边缘传感器实时多视角三维人体姿态估计
作者:Simon Bultmann,Sven Behnke 机构:Autonomous Intelligent Systems, University of Bonn, Germany, Accepted for Robotics: Science and Systems (RSS) 备注:Accepted for Robotics: Science and Systems (RSS), July 2021, 10 pages, 6 figures 链接:https://arxiv.org/abs/2106.14729 摘要:提出了一种基于多摄像机的三维人体姿态估计方法,该方法采用分布式智能边缘传感器,通过语义反馈环与后端相结合。每个摄像机视图的二维联合检测在专用的嵌入式推理处理器上本地执行。只有语义骨架表示通过网络传输,原始图像保留在传感器板上。基于三角剖分和包含人体骨架先验知识的人体模型,从中心后端的二维关节恢复三维姿势。从后端到各个传感器的反馈通道是在语义层上实现的。异中心三维姿态被反投影到传感器视图中,并与二维关节检测融合。通过引入全局上下文信息,可以改进每个传感器的局部语义模型。整个管道能够实时运行。我们在三个公共数据集上评估了我们的方法,在那里我们获得了最新的结果,并展示了我们反馈架构的优点,以及我们自己的多人实验设置。使用反馈信号可以提高二维关节检测,进而提高估计的三维姿态。 摘要:We present a novel method for estimation of 3D human poses from a multi-camera setup, employing distributed smart edge sensors coupled with a backend through a semantic feedback loop. 2D joint detection for each camera view is performed locally on a dedicated embedded inference processor. Only the semantic skeleton representation is transmitted over the network and raw images remain on the sensor board. 3D poses are recovered from 2D joints on a central backend, based on triangulation and a body model which incorporates prior knowledge of the human skeleton. A feedback channel from backend to individual sensors is implemented on a semantic level. The allocentric 3D pose is backprojected into the sensor views where it is fused with 2D joint detections. The local semantic model on each sensor can thus be improved by incorporating global context information. The whole pipeline is capable of real-time operation. We evaluate our method on three public datasets, where we achieve state-of-the-art results and show the benefits of our feedback architecture, as well as in our own setup for multi-person experiments. Using the feedback signal improves the 2D joint detections and in turn the estimated 3D poses.
【3】 False Negative Reduction in Video Instance Segmentation using Uncertainty Estimates 标题:利用不确定性估计减少视频实例分割中的漏报
作者:Kira Maag 机构:University of Wuppertal, Germany 链接:https://arxiv.org/abs/2106.14474 摘要:图像实例分割是实现场景自动理解的重要工具。神经网络通常被训练来优化它们在精确度方面的整体性能。同时,在自动驾驶等应用中,被忽视的行人似乎比被错误发现的行人更有害。在这项工作中,我们提出了一种基于时间序列不一致性的图像序列假阴性检测方法。由于该算法可以极大地增加实例的数量,因此我们使用实例上聚集的不确定性估计来进行假阳性剪枝。为此,构造了表征给定实例的不确定性和几何结构或基于深度估计的实例度量。该方法作为一个后处理步骤,适用于任何也只能在单帧上训练的神经网络。在我们的测试中,我们得到了一个改进的折衷假阴性和假阳性的情况下,通过我们的融合检测方法相比,使用一个普通的评分值提供的实例分割网络在推理。 摘要:Instance segmentation of images is an important tool for automated scene understanding. Neural networks are usually trained to optimize their overall performance in terms of accuracy. Meanwhile, in applications such as automated driving, an overlooked pedestrian seems more harmful than a falsely detected one. In this work, we present a false negative detection method for image sequences based on inconsistencies in time series of tracked instances given the availability of image sequences in online applications. As the number of instances can be greatly increased by this algorithm, we apply a false positive pruning using uncertainty estimates aggregated over instances. To this end, instance-wise metrics are constructed which characterize uncertainty and geometry of a given instance or are predicated on depth estimation. The proposed method serves as a post-processing step applicable to any neural network that can also be trained on single frames only. In our tests, we obtain an improved trade-off between false negative and false positive instances by our fused detection approach in comparison to the use of an ordinary score value provided by the instance segmentation network during inference.
【4】 Kimera-Multi: Robust, Distributed, Dense Metric-Semantic SLAM for Multi-Robot Systems 标题:Kimera-Multi:面向多机器人系统的鲁棒、分布式、密集度量语义SLAM
作者:Yulun Tian,Yun Chang,Fernando Herrera Arias,Carlos Nieto-Granda,Jonathan P. How,Luca Carlone 备注:18 pages, 15 figures 链接:https://arxiv.org/abs/2106.14386 摘要:本文介绍了Kimera Multi,这是第一个多机器人系统,它(i)具有鲁棒性,能够识别和拒绝由感知混叠引起的不正确的机器人间和机器人内循环闭合,(ii)完全分布式,仅依赖本地(对等)通信来实现分布式定位和映射,建立了一个全局一致的三维网格环境度量语义模型,并对网格的面进行语义标注。Kimera Multi由一组配备视觉惯性传感器的机器人实现。每个机器人使用Kimera建立一个局部轨迹估计和一个局部网格。当通信可用时,机器人基于一种新的分布式渐进非凸算法,提出了一种分布式位置识别和鲁棒位姿图优化协议。所提出的协议允许机器人通过利用机器人间的循环闭包来改进其局部轨迹估计,同时对异常值具有鲁棒性。最后,每个机器人利用其改进的轨迹估计,利用网格变形技术对局部网格进行修正。我们演示了Kimera Multi的照片真实感模拟、SLAM基准数据集以及使用地面机器人收集的具有挑战性的室外数据集。真实和模拟实验都涉及长轨迹(例如,每个机器人最多800米)。实验表明,Kimera Multi(i)在鲁棒性和准确性方面优于现有技术,(ii)在完全分布式的情况下实现了与集中式SLAM系统相当的估计误差,(iii)在通信带宽方面节省,(iv)生成精确的度量语义3D网格,和(v)是模块化的,并且还可以用于标准3D重建(即,没有语义标签)或用于轨迹估计(即,没有重建3D网格)。 摘要:This paper presents Kimera-Multi, the first multi-robot system that (i) is robust and capable of identifying and rejecting incorrect inter and intra-robot loop closures resulting from perceptual aliasing, (ii) is fully distributed and only relies on local (peer-to-peer) communication to achieve distributed localization and mapping, and (iii) builds a globally consistent metric-semantic 3D mesh model of the environment in real-time, where faces of the mesh are annotated with semantic labels. Kimera-Multi is implemented by a team of robots equipped with visual-inertial sensors. Each robot builds a local trajectory estimate and a local mesh using Kimera. When communication is available, robots initiate a distributed place recognition and robust pose graph optimization protocol based on a novel distributed graduated non-convexity algorithm. The proposed protocol allows the robots to improve their local trajectory estimates by leveraging inter-robot loop closures while being robust to outliers. Finally, each robot uses its improved trajectory estimate to correct the local mesh using mesh deformation techniques. We demonstrate Kimera-Multi in photo-realistic simulations, SLAM benchmarking datasets, and challenging outdoor datasets collected using ground robots. Both real and simulated experiments involve long trajectories (e.g., up to 800 meters per robot). The experiments show that Kimera-Multi (i) outperforms the state of the art in terms of robustness and accuracy, (ii) achieves estimation errors comparable to a centralized SLAM system while being fully distributed, (iii) is parsimonious in terms of communication bandwidth, (iv) produces accurate metric-semantic 3D meshes, and (v) is modular and can be also used for standard 3D reconstruction (i.e., without semantic labels) or for trajectory estimation (i.e., without reconstructing a 3D mesh).
【5】 Semi-supervised Semantic Segmentation with Directional Context-aware Consistency 标题:具有方向性上下文感知一致性的半监督语义切分
作者:Xin Lai,Zhuotao Tian,Li Jiang,Shu Liu,Hengshuang Zhao,Liwei Wang,Jiaya Jia 机构:The Chinese University of Hong Kong, SmartMore, University of Oxford 备注:Accepted to CVPR 2021. Code is available at this https URL 链接:https://arxiv.org/abs/2106.14133 摘要:语义切分近年来取得了很大的进展。然而,令人满意的性能在很大程度上依赖于大量的像素级注释。因此,在本文中,我们主要研究半监督分割问题,在半监督分割中,只有一小部分标记的数据被提供给一个更大的完全未标记的图像集合。然而,由于有限的注释,模型可能过度依赖于训练数据中的上下文,这导致对以前看不到的场景的泛化能力较差。首选的高级表达应该在不丧失自我意识的情况下捕捉上下文信息。因此,我们建议在相同身份但不同背景的特征之间保持背景感知的一致性,使得表示对不同的环境具有鲁棒性。此外,我们提出了方向对比损失(DC-Loss)来实现像素对像素的一致性,只需将质量较低的特征对准对应的特征即可。另外,为了避免假阴性样本和过滤不确定的阳性样本,我们提出了两种抽样策略。大量的实验表明,该方法简单而有效,大大超过了现有的方法,并且具有很好的泛化能力。 摘要:Semantic segmentation has made tremendous progress in recent years. However, satisfying performance highly depends on a large number of pixel-level annotations. Therefore, in this paper, we focus on the semi-supervised segmentation problem where only a small set of labeled data is provided with a much larger collection of totally unlabeled images. Nevertheless, due to the limited annotations, models may overly rely on the contexts available in the training data, which causes poor generalization to the scenes unseen before. A preferred high-level representation should capture the contextual information while not losing self-awareness. Therefore, we propose to maintain the context-aware consistency between features of the same identity but with different contexts, making the representations robust to the varying environments. Moreover, we present the Directional Contrastive Loss (DC Loss) to accomplish the consistency in a pixel-to-pixel manner, only requiring the feature with lower quality to be aligned towards its counterpart. In addition, to avoid the false-negative samples and filter the uncertain positive samples, we put forward two sampling strategies. Extensive experiments show that our simple yet effective method surpasses current state-of-the-art methods by a large margin and also generalizes well with extra image-level annotations.
【6】 Generalized Zero-Shot Learning using Multimodal Variational Auto-Encoder with Semantic Concepts 标题:基于语义概念的多模态变分自动编码器的广义零点学习
作者:Nihar Bendre,Kevin Desai,Peyman Najafirad 机构:Secure Artificial Intelligent Laboratory and Autonomy (AILA), The University of Texas at San Antonio, Texas, USA 备注:5 pages, 2 figures, 2 tables 链接:https://arxiv.org/abs/2106.14082 摘要:随着数据量的不断增加,多模态学习面临的主要挑战是标记样本的局限性。对于分类任务,元学习、Zero-Shot学习和Few-Shot学习等技术展示了基于先验知识学习新类信息的能力。最近的技术试图学习语义空间和图像空间之间的跨模态映射。然而,它们往往忽略了局部和全局的语义知识。为了克服这个问题,我们提出了一种多模态变分自动编码器(M-VAE),它可以学习图像特征的共享潜空间和语义空间。在我们的方法中,我们将多模态数据连接到单个嵌入,然后将其传递给VAE以学习潜在空间。我们建议在通过解码器重建特征嵌入的过程中使用多模态损耗。我们的方法能够关联模式并利用局部和全局语义知识进行新的样本预测。我们在四个基准数据集上使用MLP分类器的实验结果表明,我们提出的模型优于目前最先进的广义Zero-Shot学习方法。 摘要:With the ever-increasing amount of data, the central challenge in multimodal learning involves limitations of labelled samples. For the task of classification, techniques such as meta-learning, zero-shot learning, and few-shot learning showcase the ability to learn information about novel classes based on prior knowledge. Recent techniques try to learn a cross-modal mapping between the semantic space and the image space. However, they tend to ignore the local and global semantic knowledge. To overcome this problem, we propose a Multimodal Variational Auto-Encoder (M-VAE) which can learn the shared latent space of image features and the semantic space. In our approach we concatenate multimodal data to a single embedding before passing it to the VAE for learning the latent space. We propose the use of a multi-modal loss during the reconstruction of the feature embedding through the decoder. Our approach is capable to correlating modalities and exploit the local and global semantic knowledge for novel sample predictions. Our experimental results using a MLP classifier on four benchmark datasets show that our proposed model outperforms the current state-of-the-art approaches for generalized zero-shot learning.
【7】 ACN: Adversarial Co-training Network for Brain Tumor Segmentation with Missing Modalities 标题:ACN:缺失模态的对抗性联合训练脑瘤分割网络
作者:Yixin Wang,Yang Zhang,Yang Liu,Zihao Lin,Jiang Tian,Cheng Zhong,Zhongchao Shi,Jianping Fan,Zhiqiang He 机构: Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, University of Chinese Academy of Sciences, Beijing, China, Lenovo Corporate Research & Development, Lenovo Ltd., Beijing, China, AI Lab, Lenovo Research, Beijing, China 备注:MICCAI 2021 链接:https://arxiv.org/abs/2106.14591 摘要:从磁共振成像(MRI)中准确分割脑肿瘤在诊断、预后和手术治疗中具有重要的临床意义,这需要多种方法来提供互补的形态学和生理病理学信息。然而,在临床实践中,由于图像损坏、伪影、不同的采集协议或对某些造影剂的过敏,常常会出现缺失模式。尽管现有的努力证明了为所有缺失情况建立统一模型的可能性,但当一种以上的模式缺失时,大多数模式的表现都很差。在本文中,我们提出了一种新的对抗式协同训练网络(ACN)来解决这一问题,其中一系列独立但相关的模型被训练用于每一个缺失的情况,并取得了显著的效果。具体地说,ACN采用了一种新的联合训练网络,使得完整模态和缺失模态的耦合学习过程能够补充彼此的域和特征表示,更重要的是,能够恢复缺失模态的“缺失”信息。在此基础上,提出了两个无监督学习模块,即熵和知识对抗学习模块,分别用于在提高预测可靠性和鼓励潜在表示对齐的同时最小化领域差距。我们还将模态互信息知识转移学习应用于ACN,以保持模态间丰富的互信息。在BraTS2018数据集上的大量实验表明,在任何缺失情况下,本文提出的方法都明显优于现有的方法。 摘要:Accurate segmentation of brain tumors from magnetic resonance imaging (MRI) is clinically relevant in diagnoses, prognoses and surgery treatment, which requires multiple modalities to provide complementary morphological and physiopathologic information. However, missing modality commonly occurs due to image corruption, artifacts, different acquisition protocols or allergies to certain contrast agents in clinical practice. Though existing efforts demonstrate the possibility of a unified model for all missing situations, most of them perform poorly when more than one modality is missing. In this paper, we propose a novel Adversarial Co-training Network (ACN) to solve this issue, in which a series of independent yet related models are trained dedicated to each missing situation with significantly better results. Specifically, ACN adopts a novel co-training network, which enables a coupled learning process for both full modality and missing modality to supplement each other's domain and feature representations, and more importantly, to recover the `missing' information of absent modalities. Then, two unsupervised modules, i.e., entropy and knowledge adversarial learning modules are proposed to minimize the domain gap while enhancing prediction reliability and encouraging the alignment of latent representations, respectively. We also adapt modality-mutual information knowledge transfer learning to ACN to retain the rich mutual information among modalities. Extensive experiments on BraTS2018 dataset show that our proposed method significantly outperforms all state-of-the-art methods under any missing situation.
【8】 Disentangling semantic features of macromolecules in Cryo-Electron Tomography 标题:低温电子断层扫描中大分子的解缠语义特征
作者:Kai Yi,Jianye Pang,Yungeng Zhang,Xiangrui Zeng,Min Xu 机构:Department of Computer Science, King Abdullah University of Science and Technology, Xi’an Jiaotong University, Peking University, Computational Biology Department, Carnegie Mellon University 链接:https://arxiv.org/abs/2106.14192 摘要:低温电子层析成像(Cryo-ET)是一种三维成像技术,它能以近原子分辨率系统地研究单细胞内大分子结构的形态、丰度和分布。然而,由于结构的复杂性和成像的局限性,Cryo-ET捕获的大分子结构的系统和有效的$textit{de novo}$识别和恢复具有很大的挑战性。即使是具有相同结构的大分子,由于不同的取向和成像限制,如噪声和丢失的楔形效应,也会有不同的外观。明确地分清大分子的语义特征是对大分子进行若干下游分析的关键。本文提出了一种三维空间变分自动编码器来解决这个问题,该编码器可以明确地分离大分子的结构、取向和位移。在合成和真实的cryo-ET数据集上的大量实验和跨域评估证明了该方法的有效性。 摘要:Cryo-electron tomography (Cryo-ET) is a 3D imaging technique that enables the systemic study of shape, abundance, and distribution of macromolecular structures in single cells in near-atomic resolution. However, the systematic and efficient $textit{de novo}$ recognition and recovery of macromolecular structures captured by Cryo-ET are very challenging due to the structural complexity and imaging limits. Even macromolecules with identical structures have various appearances due to different orientations and imaging limits, such as noise and the missing wedge effect. Explicitly disentangling the semantic features of macromolecules is crucial for performing several downstream analyses on the macromolecules. This paper has addressed the problem by proposing a 3D Spatial Variational Autoencoder that explicitly disentangle the structure, orientation, and shift of macromolecules. Extensive experiments on both synthesized and real cryo-ET datasets and cross-domain evaluations demonstrate the efficacy of our method.
【9】 Residual Moment Loss for Medical Image Segmentation 标题:医学图像分割中的残差矩损失
作者:Quanziang Wang,Renzhen Wang,Yuexiang Li,Kai Ma,Yefeng Zheng,Deyu Meng 机构: School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an, China, Tencent Jarvis Lab, Shenzhen, China 链接:https://arxiv.org/abs/2106.14178 摘要:实验证明,位置信息有助于深度学习模型捕捉目标的流形结构,从而提高医学图像分割的精度。然而,现有的大多数方法都采用隐式的方式对位置信息进行编码,例如距离变换映射,它描述了每个像素到轮廓边界的相对距离,便于网络学习。这些隐式方法不能充分利用目标的位置信息(即绝对位置)。本文提出了一种新的损失函数,即剩余矩损失函数,用于在深度学习网络训练过程中嵌入分割目标的位置信息。特别是在图像矩的激励下,利用坐标信息对分割预测图和地面真值图进行加权。然后,我们的RM-loss鼓励网络保持两个加权映射之间的一致性,这使得分割网络能够很容易地定位目标并提取出多个与结构相关的特征。我们通过在两个公开的数据集上进行广泛的实验,即二维视杯和视盘分割和三维左心房分割,来验证所提出的RM丢失。实验结果证明了RM-loss算法的有效性,显著提高了分割网络的精度。 摘要:Location information is proven to benefit the deep learning models on capturing the manifold structure of target objects, and accordingly boosts the accuracy of medical image segmentation. However, most existing methods encode the location information in an implicit way, e.g. the distance transform maps, which describe the relative distance from each pixel to the contour boundary, for the network to learn. These implicit approaches do not fully exploit the position information (i.e. absolute location) of targets. In this paper, we propose a novel loss function, namely residual moment (RM) loss, to explicitly embed the location information of segmentation targets during the training of deep learning networks. Particularly, motivated by image moments, the segmentation prediction map and ground-truth map are weighted by coordinate information. Then our RM loss encourages the networks to maintain the consistency between the two weighted maps, which promotes the segmentation networks to easily locate the targets and extract manifold-structure-related features. We validate the proposed RM loss by conducting extensive experiments on two publicly available datasets, i.e., 2D optic cup and disk segmentation and 3D left atrial segmentation. The experimental results demonstrate the effectiveness of our RM loss, which significantly boosts the accuracy of segmentation networks.
【10】 BiX-NAS: Searching Efficient Bi-directional Architecture for Medical Image Segmentation 标题:Bix-NAS:寻找高效的医学图像分割双向体系结构
作者:Xinyi Wang,Tiange Xiang,Chaoyi Zhang,Yang Song,Dongnan Liu,Heng Huang,Weidong Cai 机构: School of Computer Science, University of Sydney, Australia, School of Computer Science and Engineering, University of New South Wales, Electrical and Computer Engineering, University of Pittsburgh, USA 备注:MICCAI2021 链接:https://arxiv.org/abs/2106.14033 摘要:最近,在各种医学图像分割任务中,递归机制被引入到U-Net中。现有的研究主要集中在通过重用构建块来促进网络递归。虽然可以大大节省网络参数,但随着迭代时间的增加,计算量仍不可避免地增加。在这项工作中,我们研究了双向跳跃连接网络的多尺度升级,然后通过一种新的两阶段神经结构搜索(NAS)算法BiX-NAS自动发现一个有效的结构。该方法通过在不同的层次和迭代次数上筛选出无效的多尺度特征,降低了网络计算量。我们使用三个不同的医学图像数据集对BiX-NAS在两个分割任务上进行了评估,实验结果表明,我们的BiX-NAS搜索结构在显著降低计算成本的同时达到了最先进的性能。 摘要:The recurrent mechanism has recently been introduced into U-Net in various medical image segmentation tasks. Existing studies have focused on promoting network recursion via reusing building blocks. Although network parameters could be greatly saved, computational costs still increase inevitably in accordance with the pre-set iteration time. In this work, we study a multi-scale upgrade of a bi-directional skip connected network and then automatically discover an efficient architecture by a novel two-phase Neural Architecture Search (NAS) algorithm, namely BiX-NAS. Our proposed method reduces the network computational cost by sifting out ineffective multi-scale features at different levels and iterations. We evaluate BiX-NAS on two segmentation tasks using three different medical image datasets, and the experimental results show that our BiX-NAS searched architecture achieves the state-of-the-art performance with significantly lower computational cost.
Zero/Few Shot|迁移|域适配|自适应(6篇)
【1】 Fast computation of mutual information in the frequency domain with applications to global multimodal image alignment 标题:频域互信息快速计算及其在全局多模态图像配准中的应用
作者:Johan Öfverstedt,Joakim Lindblad,Nataša Sladoje 机构:Department of Information Technology, Uppsala University, L¨agerhyddsv¨agen , Uppsala, Sweden 备注:7 pages, 4 figures, 2 tables. The article is under consideration at Pattern Recognition Letters 链接:https://arxiv.org/abs/2106.14699 摘要:多模态图像配准是在不同成像技术或不同条件下形成的图像之间寻找空间对应关系的过程,以便于异构数据融合和相关分析。互信息(MI)的信息论概念被广泛用作指导多模态对准过程的相似性度量,其中大多数工作都集中在MI的局部最大化上,通常只适用于小位移;这就需要对MI进行全局最大化,由于现有算法的高运行时复杂性,MI在计算上是不可行的。提出了一种基于频域互相关计算的离散位移MI算法(形式化为互信息函数CMIF)。我们证明了该算法等价于一个直接方法,同时在运行时间方面渐近优越。在此基础上,提出了一种基于CMIF算法的少自由度(如刚体)变换模型的多模态图像对齐方法。我们在航空图像、细胞学图像和组织学图像这三个不同的基准数据集上评估了该方法的有效性,并观察到极好的成功率(在恢复已知的刚性变换方面),总体上优于其他方法,包括MI的局部优化以及最近的几种基于深度学习的方法。我们还评估了该算法的GPU实现的运行时间,并观察到与直接方法的GPU实现相比,真实图像大小的速度从100倍提高到10000倍以上。代码在url{github.com/MIDA group/globalign}作为开源共享。 摘要:Multimodal image alignment is the process of finding spatial correspondences between images formed by different imaging techniques or under different conditions, to facilitate heterogeneous data fusion and correlative analysis. The information-theoretic concept of mutual information (MI) is widely used as a similarity measure to guide multimodal alignment processes, where most works have focused on local maximization of MI that typically works well only for small displacements; this points to a need for global maximization of MI, which has previously been computationally infeasible due to the high run-time complexity of existing algorithms. We propose an efficient algorithm for computing MI for all discrete displacements (formalized as the cross-mutual information function (CMIF)), which is based on cross-correlation computed in the frequency domain. We show that the algorithm is equivalent to a direct method while asymptotically superior in terms of run-time. Furthermore, we propose a method for multimodal image alignment for transformation models with few degrees of freedom (e.g. rigid) based on the proposed CMIF-algorithm. We evaluate the efficacy of the proposed method on three distinct benchmark datasets, of aerial images, cytological images, and histological images, and we observe excellent success-rates (in recovering known rigid transformations), overall outperforming alternative methods, including local optimization of MI as well as several recent deep learning-based approaches. We also evaluate the run-times of a GPU implementation of the proposed algorithm and observe speed-ups from 100 to more than 10,000 times for realistic image sizes compared to a GPU implementation of a direct method. Code is shared as open-source at url{github.com/MIDA-group/globalign}.
【2】 Dizygotic Conditional Variational AutoEncoder for Multi-Modal and Partial Modality Absent Few-Shot Learning 标题:缺少少射学习的多模态和部分模态双合子条件变分自动编码器
作者:Yi Zhang,Sheng Huang,Xi Peng,Dan Yang 机构: Chongqing University 备注:13 pages 链接:https://arxiv.org/abs/2106.14467 摘要:数据增强技术是提高Few-Shot分类性能的有力手段。它生成更多的样本作为补充,然后将该任务转化为一个共同的有监督学习问题进行求解。然而,大多数主流的基于数据增强的方法只考虑单一模态的信息,导致生成的特征的多样性和质量较低。针对上述问题,本文提出了一种新的多模态数据增强方法双合子条件变分自动编码器(DCVAE)。DCVAE通过将两个条件变分自编码器(CVAEs)以双合子共生的方式与相同种子但不同模态条件配对来进行特征合成。随后,将两个cvae生成的特征自适应地组合起来,得到最终的特征,并将其转换回配对条件,同时保证这些条件不仅在表示上而且在函数上与原始条件一致。DCVAE本质上提供了一种新的思路,即利用不同模态先验信息的互补性,在不同的多模态场景中进行数据扩充。大量的实验结果表明,我们的工作在minimagenet、CIFAR-FS和CUB数据集上取得了最先进的性能,并且能够在部分模态缺失的情况下很好地工作。 摘要:Data augmentation is a powerful technique for improving the performance of the few-shot classification task. It generates more samples as supplements, and then this task can be transformed into a common supervised learning issue for solution. However, most mainstream data augmentation based approaches only consider the single modality information, which leads to the low diversity and quality of generated features. In this paper, we present a novel multi-modal data augmentation approach named Dizygotic Conditional Variational AutoEncoder (DCVAE) for addressing the aforementioned issue. DCVAE conducts feature synthesis via pairing two Conditional Variational AutoEncoders (CVAEs) with the same seed but different modality conditions in a dizygotic symbiosis manner. Subsequently, the generated features of two CVAEs are adaptively combined to yield the final feature, which can be converted back into its paired conditions while ensuring these conditions are consistent with the original conditions not only in representation but also in function. DCVAE essentially provides a new idea of data augmentation in various multi-modal scenarios by exploiting the complement of different modality prior information. Extensive experimental results demonstrate our work achieves state-of-the-art performances on miniImageNet, CIFAR-FS and CUB datasets, and is able to work well in the partial modality absence case.
【3】 Few-Shot Domain Expansion for Face Anti-Spoofing 标题:面向人脸反欺骗的小射击域扩展
作者:Bowen Yang,Jing Zhang,Zhenfei Yin,Jing Shao 机构:College of Software, Beihang University, SenseTime Research 备注:10 pages, 5 figures 链接:https://arxiv.org/abs/2106.14162 摘要:人脸反欺骗(FAS)是人脸识别系统中不可缺少的、应用广泛的模块。尽管已经取得了很高的准确率,但是由于应用环境的非平稳性以及现实应用中可能出现的新型表示攻击,FAS系统永远不会是完美的。在实际应用中,如果给定一个新部署场景(目标域)中的少量标记样本和现有源域中丰富的标记人脸图像,那么FAS系统在新场景中的性能将不会牺牲原有域的性能。为此,我们确定并解决了一个更为实际的问题:用于人脸反欺骗(FSDE-FAS)的少量镜头域扩展。由于目标域训练样本不足,模型可能会出现对目标域的过度拟合和源域的灾难性遗忘。针对这一问题,本文提出了一种基于样式转换的语义对齐增强(SASA)框架。我们建议在真实感风格转换的基础上,通过生成辅助样本来增强目标数据。在增广数据的辅助下,我们进一步提出了一种精心设计的机制,从实例级和分布级对不同的域进行对齐,以较少的遗忘约束来稳定源域的性能。提出了两个模拟FSDE-FAS场景的基准测试,实验结果表明,提出的SASA方法优于现有的方法。 摘要:Face anti-spoofing (FAS) is an indispensable and widely used module in face recognition systems. Although high accuracy has been achieved, a FAS system will never be perfect due to the non-stationary applied environments and the potential emergence of new types of presentation attacks in real-world applications. In practice, given a handful of labeled samples from a new deployment scenario (target domain) and abundant labeled face images in the existing source domain, the FAS system is expected to perform well in the new scenario without sacrificing the performance on the original domain. To this end, we identify and address a more practical problem: Few-Shot Domain Expansion for Face Anti-Spoofing (FSDE-FAS). This problem is challenging since with insufficient target domain training samples, the model may suffer from both overfitting to the target domain and catastrophic forgetting of the source domain. To address the problem, this paper proposes a Style transfer-based Augmentation for Semantic Alignment (SASA) framework. We propose to augment the target data by generating auxiliary samples based on photorealistic style transfer. With the assistant of the augmented data, we further propose a carefully designed mechanism to align different domains from both instance-level and distribution-level, and then stabilize the performance on the source domain with a less-forgetting constraint. Two benchmarks are proposed to simulate the FSDE-FAS scenarios, and the experimental results show that the proposed SASA method outperforms state-of-the-art methods.
【4】 Semantics-aware Multi-modal Domain Translation:From LiDAR Point Clouds to Panoramic Color Images 标题:语义感知的多模态领域翻译:从LiDAR点云到全景彩色图像
作者:Tiago Cortinhal,Fatih Kurnaz,Eren Aksoy 机构:Halmstad University, Sweden, Middle East Technical Univetsity, Turkey, Eren Erdal Aksoy 链接:https://arxiv.org/abs/2106.13974 摘要:在这项工作中,我们提出了一个简单而有效的框架来解决具有独特数据格式的不同传感器模式之间的域转换问题。通过仅依赖场景的语义,我们的模块化生成框架首次能够从给定的全三维LiDAR点云合成全景彩色图像。该框架首先对点云进行语义分割,首先将点云投影到球面上。对相应的摄像机图像进行相同的语义分割。接下来,我们的新的条件生成模型将学习如何将预测的LiDAR图像片段转化为相应的摄像机图像。最后,对生成的图像片段进行处理,生成全景场景图像。我们对SemanticKitti数据集进行了全面的定量评估,结果表明我们提出的框架优于其他强基线模型。我们的源代码可以在https://github.com/halmstad-University/TITAN-NET 摘要:In this work, we present a simple yet effective framework to address the domain translation problem between different sensor modalities with unique data formats. By relying only on the semantics of the scene, our modular generative framework can, for the first time, synthesize a panoramic color image from a given full 3D LiDAR point cloud. The framework starts with semantic segmentation of the point cloud, which is initially projected onto a spherical surface. The same semantic segmentation is applied to the corresponding camera image. Next, our new conditional generative model adversarially learns to translate the predicted LiDAR segment maps to the camera image counterparts. Finally, generated image segments are processed to render the panoramic scene images. We provide a thorough quantitative evaluation on the SemanticKitti dataset and show that our proposed framework outperforms other strong baseline models. Our source code is available at https://github.com/halmstad-University/TITAN-NET
【5】 Domain Conditional Predictors for Domain Adaptation 标题:域自适应的域条件预报器
作者:Joao Monteiro,Xavier Gibert,Jianqiao Feng,Vincent Dumoulin,Dar-Shyang Lee 机构:Google 备注:Part of the pre-registration workshop at NeurIPS 2020: this https URL 链接:https://arxiv.org/abs/2106.13899 摘要:学习保证通常依赖于i.i.d.数据的假设,一旦预测者被部署到执行现实任务中,在实践中很可能会违反这些假设。因此,域自适应方法作为一个有用的框架出现,在支持不同的训练和测试数据分布时产生额外的灵活性,前提是满足其他假设,如协变量移位,即期望标签上的条件分布独立于基础数据分布。为了在不同的训练数据源和测试数据源之间进行泛化,引入了几种方法,这些方法通常依赖于域不变性的一般思想,使得预测模型忽略了数据生成分布。在本文中,我们通过从相反的方向来处理跨数据源的泛化问题:我们考虑一种条件建模方法,其中预测除了依赖于输入数据外,还使用与底层数据生成分布相关的信息。例如,该模型有一个明确的机制来适应不断变化的环境和/或新的数据源。我们认为,这种方法比现有的域自适应方法更具普遍适用性,因为它不需要额外的假设,如协变量移位,并进一步产生更简单的训练算法,避免了通常在域不变方法中使用的minimax公式引起的训练不稳定性的共同来源。 摘要:Learning guarantees often rely on assumptions of i.i.d. data, which will likely be violated in practice once predictors are deployed to perform real-world tasks. Domain adaptation approaches thus appeared as a useful framework yielding extra flexibility in that distinct train and test data distributions are supported, provided that other assumptions are satisfied such as covariate shift, which expects the conditional distributions over labels to be independent of the underlying data distribution. Several approaches were introduced in order to induce generalization across varying train and test data sources, and those often rely on the general idea of domain-invariance, in such a way that the data-generating distributions are to be disregarded by the prediction model. In this contribution, we tackle the problem of generalizing across data sources by approaching it from the opposite direction: we consider a conditional modeling approach in which predictions, in addition to being dependent on the input data, use information relative to the underlying data-generating distribution. For instance, the model has an explicit mechanism to adapt to changing environments and/or new data sources. We argue that such an approach is more generally applicable than current domain adaptation methods since it does not require extra assumptions such as covariate shift and further yields simpler training algorithms that avoid a common source of training instabilities caused by minimax formulations, often employed in domain-invariant methods.
【6】 Multimodal Few-Shot Learning with Frozen Language Models 标题:基于冻结语言模型的多模态少发式学习
作者:Maria Tsimpoukelli,Jacob Menick,Serkan Cabi,S. M. Ali Eslami,Oriol Vinyals,Felix Hill 机构:DeepMind, University College London 链接:https://arxiv.org/abs/2106.13884 摘要:在足够大的范围内训练时,自回归语言模型表现出显著的学习新语言任务的能力。在这里,我们提出了一个简单,但有效的方法,将这种少数镜头的学习能力转移到一个多模式的设置(视觉和语言)。使用对齐的图像和字幕数据,我们训练一个视觉编码器,将每幅图像表示为一系列连续的嵌入,这样一个预先训练的、冻结的语言模型就会生成相应的字幕。由此产生的系统是一个多模态的少数镜头学习者,具有惊人的能力,学习各种新的任务时,条件的例子,表现为一个序列的多重交织图像和文本嵌入。我们证明,它可以快速学习新对象和新的视觉类别的单词,只使用少数的例子进行视觉问答,并利用外部知识,通过测量一个单一的模型对各种已建立和新的基准。 摘要:When trained at sufficient scale, auto-regressive language models exhibit the notable ability to learn a new language task after being prompted with just a few examples. Here, we present a simple, yet effective, approach for transferring this few-shot learning ability to a multimodal setting (vision and language). Using aligned image and caption data, we train a vision encoder to represent each image as a sequence of continuous embeddings, such that a pre-trained, frozen language model prompted with this prefix generates the appropriate caption. The resulting system is a multimodal few-shot learner, with the surprising ability to learn a variety of new tasks when conditioned on examples, represented as a sequence of multiple interleaved image and text embeddings. We demonstrate that it can rapidly learn words for new objects and novel visual categories, do visual question-answering with only a handful of examples, and make use of outside knowledge, by measuring a single model on a variety of established and new benchmarks.
半弱无监督|主动学习|不确定性(4篇)
【1】 A Theory-Driven Self-Labeling Refinement Method for Contrastive Representation Learning 标题:一种理论驱动的对比表征学习自标注求精方法
作者:Pan Zhou,Caiming Xiong,Xiao-Tong Yuan,Steven Hoi 机构:∗ Salesforce Research, † Nanjing University of Information Science & Technology 备注:under review. arXiv admin note: substantial text overlap with arXiv:1903.11680 by other authors 链接:https://arxiv.org/abs/2106.14749 摘要:对于图像查询,无监督对比学习将同一图像的裁剪标记为正片,将其他图像的裁剪标记为负片。尽管这样的本地标签分配策略很直观,但它不能揭示查询及其正负项之间潜在的语义相似性,并且会降低性能,因为某些负项在语义上与查询相似,甚至与查询共享同一语义类。在这项工作中,我们首先证明了在对比学习中,不准确的标签分配严重影响了语义实例识别的泛化,而准确的标签则有利于语义实例识别的泛化。受这一理论的启发,我们提出了一种新的对比学习自标记优化方法。它通过两个互补的模块来提高标签质量:(i)自标记精炼(SLR)来生成准确的标签;(ii)动量混合(MM)来增强查询与其正查询之间的相似性。SLR使用查询的一个正数来估计查询与其正数和负数之间的语义相似度,并将估计的相似度与对比学习中的香草标签分配相结合,以迭代方式生成更准确、信息更丰富的软标签。从理论上证明了SLR能够准确地恢复标签损坏数据的真实语义标签,并监督网络实现分类任务的零预测误差。MM将查询和阳性信息随机组合,以增加生成的虚拟查询和它们的阳性信息之间的语义相似性,从而提高标签的准确性。在CIFAR10、ImageNet、VOC和COCO上的实验结果表明了该方法的有效性。PyTorch代码和模型将在网上发布。 摘要:For an image query, unsupervised contrastive learning labels crops of the same image as positives, and other image crops as negatives. Although intuitive, such a native label assignment strategy cannot reveal the underlying semantic similarity between a query and its positives and negatives, and impairs performance, since some negatives are semantically similar to the query or even share the same semantic class as the query. In this work, we first prove that for contrastive learning, inaccurate label assignment heavily impairs its generalization for semantic instance discrimination, while accurate labels benefit its generalization. Inspired by this theory, we propose a novel self-labeling refinement approach for contrastive learning. It improves the label quality via two complementary modules: (i) self-labeling refinery (SLR) to generate accurate labels and (ii) momentum mixup (MM) to enhance similarity between query and its positive. SLR uses a positive of a query to estimate semantic similarity between a query and its positive and negatives, and combines estimated similarity with vanilla label assignment in contrastive learning to iteratively generate more accurate and informative soft labels. We theoretically show that our SLR can exactly recover the true semantic labels of label-corrupted data, and supervises networks to achieve zero prediction error on classification tasks. MM randomly combines queries and positives to increase semantic similarity between the generated virtual queries and their positives so as to improves label accuracy. Experimental results on CIFAR10, ImageNet, VOC and COCO show the effectiveness of our method. PyTorch code and model will be released online.
【2】 Unsupervised Discovery of Actions in Instructional Videos 标题:教学视频中动作的无监督发现
作者:AJ Piergiovanni,Anelia Angelova,Michael S. Ryoo,Irfan Essa 机构:Robotics at Google, Michael Ryoo, Google Research 备注:Full paper 链接:https://arxiv.org/abs/2106.14733 摘要:在本文中,我们解决的问题,自动发现原子行动在无监督的方式从教学视频。教学视频包含复杂的活动,是智能代理(如自主机器人或虚拟助理)的丰富信息源,例如,智能代理可以自动“读取”教学视频中的步骤并执行这些步骤。然而,视频很少用原子活动、它们的边界或持续时间来注释。我们提出了一种无监督的方法来学习原子行动的结构化人类任务从各种教学视频。提出了一种序列随机自回归模型用于视频的时间分割,该模型学习表示和发现任务中不同原子行为之间的序列关系,并为视频提供自动和无监督的自标记。我们的方法比最先进的无监督方法具有更大的利润。我们将开放源代码。 摘要:In this paper we address the problem of automatically discovering atomic actions in unsupervised manner from instructional videos. Instructional videos contain complex activities and are a rich source of information for intelligent agents, such as, autonomous robots or virtual assistants, which can, for example, automatically `read' the steps from an instructional video and execute them. However, videos are rarely annotated with atomic activities, their boundaries or duration. We present an unsupervised approach to learn atomic actions of structured human tasks from a variety of instructional videos. We propose a sequential stochastic autoregressive model for temporal segmentation of videos, which learns to represent and discover the sequential relationship between different atomic actions of the task, and which provides automatic and unsupervised self-labeling for videos. Our approach outperforms the state-of-the-art unsupervised methods with large margins. We will open source the code.
【3】 Semi-Supervised Deep Ensembles for Blind Image Quality Assessment 标题:基于半监督深度集成的盲图像质量评价
作者:Zhihua Wang,Dingquan Li,Kede Ma 机构:Department of Computer Science, City University of Hong Kong, Peng Cheng Laboratory 备注:6 pages, 1 figure, 5 tables 链接:https://arxiv.org/abs/2106.14008 摘要:如果基础学习者被认为是“精确的”和“多样的”,集成方法通常被认为比单一模型更好。在这里,我们研究了一种半监督集成学习策略来产生可推广的盲图像质量评估模型。我们训练了一个用于质量预测的多头卷积网络,通过最大化集合(以及基础学习者)对标记数据的准确性,以及它们之间对未标记数据的不一致性(即多样性),两者都通过保真度损失来实现。我们进行了大量的实验来证明使用未标记数据进行BIQA的优势,特别是在模型泛化和故障识别方面。 摘要:Ensemble methods are generally regarded to be better than a single model if the base learners are deemed to be "accurate" and "diverse." Here we investigate a semi-supervised ensemble learning strategy to produce generalizable blind image quality assessment models. We train a multi-head convolutional network for quality prediction by maximizing the accuracy of the ensemble (as well as the base learners) on labeled data, and the disagreement (i.e., diversity) among them on unlabeled data, both implemented by the fidelity loss. We conduct extensive experiments to demonstrate the advantages of employing unlabeled data for BIQA, especially in model generalization and failure identification.
【4】 Semi-Supervised Raw-to-Raw Mapping 标题:半监督原始到原始映射
作者:Mahmoud Afifi,Abdullah Abuolaim 机构:ca~abuolaimLassonde School of EngineeringYork UniversityToronto 链接:https://arxiv.org/abs/2106.13883 摘要:相机传感器的原始RGB颜色因不同传感器品牌和型号的光谱灵敏度差异而不同。本文主要研究不同传感器原始RGB颜色空间之间的映射问题。以前的工作解决了这个问题,使用成对校准,以实现准确的颜色映射。虽然这种方法是精确的,但是由于它需要:(1)在每个新场景中放置一个颜色校准对象,由两个相机设备捕获一对图像(2) 对颜色校准对象进行精确的图像对齐或手动注释。本文旨在通过一个更实际的设置来解决原始空间中的颜色映射问题。具体地说,我们提出了一种半监督的raw-to-raw映射方法,该方法训练在一组小的成对图像上,同时训练由每个相机设备捕获的一组未成对的图像。通过大量的实验,我们证明了除了单一的校准方案外,我们的方法与其他域自适应方案相比取得了更好的结果。作为这项工作的一部分,我们从两个不同的智能手机摄像头生成了一个新的原始图像数据集。我们的数据集包括未配对和配对集,用于我们的半监督训练和评估。 摘要:The raw-RGB colors of a camera sensor vary due to the spectral sensitivity differences across different sensor makes and models. This paper focuses on the task of mapping between different sensor raw-RGB color spaces. Prior work addressed this problem using a pairwise calibration to achieve accurate color mapping. Although being accurate, this approach is less practical as it requires: (1) capturing pair of images by both camera devices with a color calibration object placed in each new scene; (2) accurate image alignment or manual annotation of the color calibration object. This paper aims to tackle color mapping in the raw space through a more practical setup. Specifically, we present a semi-supervised raw-to-raw mapping method trained on a small set of paired images alongside an unpaired set of images captured by each camera device. Through extensive experiments, we show that our method achieves better results compared to other domain adaptation alternatives in addition to the single-calibration solution. We have generated a new dataset of raw images from two different smartphone cameras as part of this effort. Our dataset includes unpaired and paired sets for our semi-supervised training and evaluation.
时序|行为识别|姿态|视频|运动估计(8篇)
【1】 Real-Time Human Pose Estimation on a Smart Walker using Convolutional Neural Networks 标题:基于卷积神经网络的智能步行器实时人体姿态估计
作者:Manuel Palermo,Sara Moccia,Lucia Migliorelli,Emanuele Frontoni,Cristina P. Santos 机构: University of Minho, Portu-gal‡The BioRobotics Institute, Italy§Department of Excellence in Robotics and AI 备注:Accepted for publication in Expert Systems with Applications 链接:https://arxiv.org/abs/2106.14739 摘要:康复对于提高行动障碍患者的生活质量非常重要。智能步行机是一种常用的解决方案,应该嵌入自动和客观的工具,用于数据驱动的人在回路控制和监控。然而,目前的解决方案侧重于从专用传感器中提取很少的特定度量,没有统一的全身方法。我们研究了一个通用的,实时的,全身姿态估计框架,基于两个RGB D摄像机流,非重叠视图安装在智能步行机上用于康复。采用两级神经网络框架进行人体关键点估计。2D阶段实现了在2D图像帧中定位身体关键点的检测模块。3D阶段实现了一个回归模块,该模块将两个摄像机中检测到的关键点提升并关联到相对于助行器的3D空间。模型预测是低通滤波,以提高时间一致性。采用一种定制的获取方法获得一个数据集,其中有14名健康受试者,用于离线训练和评估所提出的框架,然后将其部署在真实的walker设备上。在walker的受限硬件上,2D平台的关键点检测误差为3.73像素,3D平台的关键点检测误差为44.05mm,推断时间为26.6ms。我们提出了一种新的方法,病人监测和数据驱动的人在回路控制的背景下,智能步行机。它能够从廉价的传感器中实时提取完整而紧凑的身体表示,作为下游度量提取解决方案和人机交互应用的公共基础。尽管取得了令人满意的结果,但仍应收集更多关于残疾用户的数据,以评估其作为现实场景中康复工具的性能。 摘要:Rehabilitation is important to improve quality of life for mobility-impaired patients. Smart walkers are a commonly used solution that should embed automatic and objective tools for data-driven human-in-the-loop control and monitoring. However, present solutions focus on extracting few specific metrics from dedicated sensors with no unified full-body approach. We investigate a general, real-time, full-body pose estimation framework based on two RGB D camera streams with non-overlapping views mounted on a smart walker equipment used in rehabilitation. Human keypoint estimation is performed using a two-stage neural network framework. The 2D-Stage implements a detection module that locates body keypoints in the 2D image frames. The 3D-Stage implements a regression module that lifts and relates the detected keypoints in both cameras to the 3D space relative to the walker. Model predictions are low-pass filtered to improve temporal consistency. A custom acquisition method was used to obtain a dataset, with 14 healthy subjects, used for training and evaluating the proposed framework offline, which was then deployed on the real walker equipment. An overall keypoint detection error of 3.73 pixels for the 2D-Stage and 44.05mm for the 3D-Stage were reported, with an inference time of 26.6ms when deployed on the constrained hardware of the walker. We present a novel approach to patient monitoring and data-driven human-in-the-loop control in the context of smart walkers. It is able to extract a complete and compact body representation in real-time and from inexpensive sensors, serving as a common base for downstream metrics extraction solutions, and Human-Robot interaction applications. Despite promising results, more data should be collected on users with impairments, to assess its performance as a rehabilitation tool in real-world scenarios.
【2】 Motion Projection Consistency Based 3D Human Pose Estimation with Virtual Bones from Monocular Videos 标题:基于运动投影一致性的单目视频虚拟骨骼三维人体姿态估计
作者:Guangming Wang,Honghao Zeng,Ziliang Wang,Zhe Liu,Hesheng Wang 机构: Institute of MedicalRobotics, Shanghai Jiao Tong University, and alsowith the Key Laboratory of Marine Intelligent Equipment and System ofMinistry of Education 备注:10 pages, 7 figures, under review 链接:https://arxiv.org/abs/2106.14706 摘要:实时三维人体姿态估计是人机交互的关键。仅从单目视频中估计三维人体姿态是一种经济实用的方法。然而,目前基于骨骼拼接的三维人体姿态估计方法存在累积误差问题。本文提出了虚拟骨骼的概念来解决这一难题。虚拟骨骼是非相邻关节之间的虚拟骨骼。它们在现实中并不存在,但为三维人体关节的估计带来了新的循环约束。本文提出的网络可以同时预测真实骨骼和虚拟骨骼。由预测的真实骨骼和虚拟骨骼构造的循环约束和学习真实骨骼的最终长度。此外,还考虑了连续框架中节点的运动约束。提出了一种新的投影一致性损失方法,用于三维人体姿态的学习。在Human3.6M数据集上的实验表明,该方法具有良好的性能。烧蚀研究证明了所提出的帧间投影一致性约束和帧内循环约束的有效性。 摘要:Real-time 3D human pose estimation is crucial for human-computer interaction. It is cheap and practical to estimate 3D human pose only from monocular video. However, recent bone splicing based 3D human pose estimation method brings about the problem of cumulative error. In this paper, the concept of virtual bones is proposed to solve such a challenge. The virtual bones are imaginary bones between non-adjacent joints. They do not exist in reality, but they bring new loop constraints for the estimation of 3D human joints. The proposed network in this paper predicts real bones and virtual bones, simultaneously. The final length of real bones is constrained and learned by the loop constructed by the predicted real bones and virtual bones. Besides, the motion constraints of joints in consecutive frames are considered. The consistency between the 2D projected position displacement predicted by the network and the captured real 2D displacement by the camera is proposed as a new projection consistency loss for the learning of 3D human pose. The experiments on the Human3.6M dataset demonstrate the good performance of the proposed method. Ablation studies demonstrate the effectiveness of the proposed inter-frame projection consistency constraints and intra-frame loop constraints.
【3】 VAT-Mart: Learning Visual Action Trajectory Proposals for Manipulating 3D ARTiculated Objects 标题:VAT-MART:学习操作3D关节对象的视觉动作轨迹建议
作者:Ruihai Wu,Yan Zhao,Kaichun Mo,Zizheng Guo,Yian Wang,Tianhao Wu,Qingnan Fan,Xuelin Chen,Leonidas Guibas,Hao Dong 机构:CFCS, CS Dept., PKU, AIIT, PKU, Stanford University, Tencent AI Lab 链接:https://arxiv.org/abs/2106.14440 摘要:在人类环境中感知和操纵三维关节对象(如橱柜、门)是未来家庭助理机器人的一项重要而富有挑战性的任务。三维关节对象的空间异常丰富,语义种类繁多,形状几何多样,零件功能复杂。以往的工作主要是通过估计关节参数和零件姿态来抽象运动结构,作为操纵三维关节对象的视觉表示。在本文中,我们提出以对象为中心的可操作视觉先验作为一种新的感知交互握手点,通过预测密集几何感知、交互感知和任务感知的视觉动作提供和轨迹建议,感知系统输出比运动结构估计更多的可操作指导。我们设计了一个交互感知框架,通过同时训练好奇心驱动的强化学习策略(探索不同的交互轨迹)和感知模块(总结和概括所探索的知识,以便在不同的交互轨迹中进行逐点预测),来学习这种可操作的视觉表征形状。通过SAPIEN环境下的大规模PartNet移动数据集的实验验证了该方法的有效性,并对新的测试形状、不可见的对象类别和真实数据显示了良好的泛化能力。项目页面:https://hyperplane-lab.github.io/vat-mart 摘要:Perceiving and manipulating 3D articulated objects (e.g., cabinets, doors) in human environments is an important yet challenging task for future home-assistant robots. The space of 3D articulated objects is exceptionally rich in their myriad semantic categories, diverse shape geometry, and complicated part functionality. Previous works mostly abstract kinematic structure with estimated joint parameters and part poses as the visual representations for manipulating 3D articulated objects. In this paper, we propose object-centric actionable visual priors as a novel perception-interaction handshaking point that the perception system outputs more actionable guidance than kinematic structure estimation, by predicting dense geometry-aware, interaction-aware, and task-aware visual action affordance and trajectory proposals. We design an interaction-for-perception framework VAT-Mart to learn such actionable visual representations by simultaneously training a curiosity-driven reinforcement learning policy exploring diverse interaction trajectories and a perception module summarizing and generalizing the explored knowledge for pointwise predictions among diverse shapes. Experiments prove the effectiveness of the proposed approach using the large-scale PartNet-Mobility dataset in SAPIEN environment and show promising generalization capabilities to novel test shapes, unseen object categories, and real-world data. Project page: https://hyperplane-lab.github.io/vat-mart
【4】 DONet: Learning Category-Level 6D Object Pose and Size Estimation from Depth Observation 标题:DONET:从深度观测中学习类别级6D对象姿态和大小估计
作者:Haitao Lin,Zichang Liu,Chilam Cheang,Lingwei Zhang,Yanwei Fu,Xiangyang Xue 机构:Fudan University 链接:https://arxiv.org/abs/2106.14193 摘要:提出了一种基于单个深度图像的类别级6D物体姿态和尺寸估计(COPSE)方法。以往的研究都是利用RGB(D)图像中的视觉线索,而我们的方法仅仅是基于物体在深度通道中丰富的几何信息进行推理。从本质上讲,我们的框架通过学习统一的三维方向一致表示(3D-OCR)模块,并进一步利用几何约束反射对称(GeoReS)模块的特性来探索这些几何信息。最后利用镜像对维估计(MPDE)模块对目标大小和中心点的大小信息进行估计。在类别级NOCS基准上的大量实验表明,我们的框架与需要标记真实世界图像的最新方法相竞争。我们还将我们的方法应用到一个物理Baxter机器人上,对未知但类别已知的实例执行操作任务,结果进一步验证了我们提出的模型的有效性。我们的视频可以在补充材料中找到。 摘要:We propose a method of Category-level 6D Object Pose and Size Estimation (COPSE) from a single depth image, without external pose-annotated real-world training data. While previous works exploit visual cues in RGB(D) images, our method makes inferences based on the rich geometric information of the object in the depth channel alone. Essentially, our framework explores such geometric information by learning the unified 3D Orientation-Consistent Representations (3D-OCR) module, and further enforced by the property of Geometry-constrained Reflection Symmetry (GeoReS) module. The magnitude information of object size and the center point is finally estimated by Mirror-Paired Dimensional Estimation (MPDE) module. Extensive experiments on the category-level NOCS benchmark demonstrate that our framework competes with state-of-the-art approaches that require labeled real-world images. We also deploy our approach to a physical Baxter robot to perform manipulation tasks on unseen but category-known instances, and the results further validate the efficacy of our proposed model. Our videos are available in the supplementary material.
【5】 Building a Video-and-Language Dataset with Human Actions for Multimodal Logical Inference 标题:构建具有人类行为的视频和语言数据集以进行多模态逻辑推理
作者:Riko Suzuki,Hitomi Yanaka,Koji Mineshima,Daisuke Bekki 机构:Ochanomizu University, Tokyo, Japan, The University of Tokyo, Tokyo, Japan, Keio University, Tokyo, Japan 备注:Accepted to MMSR I 链接:https://arxiv.org/abs/2106.14137 摘要:本文介绍了一种用于多模态逻辑推理的视频和语言数据集,该数据集着重于描述动态人类行为的有意和体态表达。该数据集由200个视频、5554个动作标签和1942个动作三元组组成,这些动作三元组的形式为,可以转换为逻辑语义表示。该数据集有望用于评估视频和语义复杂句子之间的多模态推理系统,包括否定和量化。 摘要:This paper introduces a new video-and-language dataset with human actions for multimodal logical inference, which focuses on intentional and aspectual expressions that describe dynamic human actions. The dataset consists of 200 videos, 5,554 action labels, and 1,942 action triplets of the formthat can be translated into logical semantic representations. The dataset is expected to be useful for evaluating multimodal inference systems between videos and semantically complicated sentences including negation and quantification.
【6】 Robust Pose Transfer with Dynamic Details using Neural Video Rendering 标题:基于神经视频绘制的动态细节姿态鲁棒传递
作者:Yang-tian Sun,Hao-zhi Huang,Xuan Wang,Yu-kun Lai,Wei Liu,Lin Gao 备注:Video link: this https URL 链接:https://arxiv.org/abs/2106.14132 摘要:人体视频姿势转换的目的是生成目标人模仿源人动作的高保真视频。一些研究已经取得了很大的进展,无论是通过深层潜在特征的图像翻译或显式三维特征的神经渲染。然而,这两种方法都依赖于大量的训练数据来生成真实的结果,并且由于训练帧的不足,在更容易访问的互联网视频上的性能会下降。在这篇论文中,我们证明了即使是从短单目视频中训练也可以保持动态细节。总体而言,我们提出了一个结合基于图像翻译的动态细节生成网络(D2G-Net)的神经视频渲染框架,该框架充分利用了显式3D特征的稳定性和学习部件的能力。具体地说,提出了一种新的纹理表示方法,对静态和姿态变化的外观特征进行编码,然后映射到图像空间,在神经渲染阶段作为一个细节丰富的帧进行渲染。此外,我们在训练阶段引入一个简洁的时间损失来抑制细节闪烁,这是由于我们的方法生成的高质量动态细节而变得更加明显。通过广泛的比较,我们证明了我们的神经-人类视频渲染器能够实现更清晰的动态细节和更强大的性能,即使在只有2k-4k帧的可访问短视频。 摘要:Pose transfer of human videos aims to generate a high fidelity video of a target person imitating actions of a source person. A few studies have made great progress either through image translation with deep latent features or neural rendering with explicit 3D features. However, both of them rely on large amounts of training data to generate realistic results, and the performance degrades on more accessible internet videos due to insufficient training frames. In this paper, we demonstrate that the dynamic details can be preserved even trained from short monocular videos. Overall, we propose a neural video rendering framework coupled with an image-translation-based dynamic details generation network (D2G-Net), which fully utilizes both the stability of explicit 3D features and the capacity of learning components. To be specific, a novel texture representation is presented to encode both the static and pose-varying appearance characteristics, which is then mapped to the image space and rendered as a detail-rich frame in the neural rendering stage. Moreover, we introduce a concise temporal loss in the training stage to suppress the detail flickering that is made more visible due to high-quality dynamic details generated by our method. Through extensive comparisons, we demonstrate that our neural human video renderer is capable of achieving both clearer dynamic details and more robust performance even on accessible short videos with only 2k - 4k frames.
【7】 Hear Me Out: Fusional Approaches for Audio Augmented Temporal Action Localization 标题:听我说:音频增强时间动作本地化的融合方法
作者:Anurag Bagchi,Jazib Mahmood,Dolton Fernandes,Ravi Kiran Sarvadevabhatla 机构:Sarvadevabhatla, Centre for Visual Information Technology, International Institute of Information Technology Hyderabad (IIIT-H), Hyderabad , INDIA 链接:https://arxiv.org/abs/2106.14118 摘要:最先进的未剪辑视频时间动作定位(TAL)架构只考虑了RGB和流模式,使得信息丰富的音频模式完全未被开发。音频融合已被探索的相关,但可以说是更容易的问题修剪(剪辑级)动作识别。然而,好未来面临着一系列独特的挑战。在本文中,我们提出了简单而有效的基于融合的TAL方法。据我们所知,我们的工作是第一个共同考虑的音频和视频模式的监督TAL。我们的实验表明,我们的方案不断提高性能的国家最先进的视频只TAL方法。具体来说,它们有助于在大规模基准数据集ActivityNet-1.3(52.73)上实现最新的性能mAP@0.5)和THUMOS14(57.18mAP@0.5). 我们的实验包括烧蚀涉及多种融合方案,模态组合和塔尔架构。我们的代码,模型和相关数据将提供。 摘要:State of the art architectures for untrimmed video Temporal Action Localization (TAL) have only considered RGB and Flow modalities, leaving the information-rich audio modality totally unexploited. Audio fusion has been explored for the related but arguably easier problem of trimmed (clip-level) action recognition. However, TAL poses a unique set of challenges. In this paper, we propose simple but effective fusion-based approaches for TAL. To the best of our knowledge, our work is the first to jointly consider audio and video modalities for supervised TAL. We experimentally show that our schemes consistently improve performance for state of the art video-only TAL approaches. Specifically, they help achieve new state of the art performance on large-scale benchmark datasets - ActivityNet-1.3 (52.73 mAP@0.5) and THUMOS14 (57.18 mAP@0.5). Our experiments include ablations involving multiple fusion schemes, modality combinations and TAL architectures. Our code, models and associated data will be made available.
【8】 Saying the Unseen: Video Descriptions via Dialog Agents 标题:说出看不见的东西:通过对话代理的视频描述
作者:Ye Zhu,Yu Wu,Yi Yang,Yan Yan 机构: the natural language•Ye Zhu and Yan Yan are with the Department of Computer Science, Illinois Institute of Technology 备注:Accepted as a regular paper at TPAMI 2021 链接:https://arxiv.org/abs/2106.14069 摘要:当前的视觉和语言任务通常以完整的视觉数据(如原始图像或视频)作为输入,然而,实际场景中往往会出现由于各种原因(如固定摄像头的限制视野或出于安全考虑的故意视觉障碍)而无法访问部分视觉信息的情况。作为向更实际的应用场景迈进的一步,我们引入了一个新的任务,该任务旨在描述一个视频,使用两个代理之间的自然语言对话作为补充信息源,给定不完整的视觉数据。与大多数现有的视觉语言任务不同,人工智能系统可以完全访问图像或视频片段,这些图像或视频片段可能会显示可识别的人脸或声音等敏感信息,我们有意限制人工智能系统的视觉输入,并寻求更安全和透明的信息媒介,即自然语言对话,补充缺失的视觉信息。具体来说,其中一个智能代理Q-BOT从视频的开始和结束被赋予两个语义分段的帧,并且在描述看不见的视频之前有有限的机会询问相关的自然语言问题。A-BOT是另一个可以访问整个视频的代理,它通过回答提问来帮助Q-BOT完成目标。我们引入了两种不同的实验设置,一种是生成式(即代理自由生成问题和答案)的内部对话生成过程,另一种是区分式(即代理从候选人中选择问题和答案)的内部对话生成过程。利用所提出的统一QA协作网络,我们通过实验验证了两个对话主体之间的知识转移过程以及使用自然语言对话作为不完全隐式视觉补充的有效性。 摘要:Current vision and language tasks usually take complete visual data (e.g., raw images or videos) as input, however, practical scenarios may often consist the situations where part of the visual information becomes inaccessible due to various reasons e.g., restricted view with fixed camera or intentional vision block for security concerns. As a step towards the more practical application scenarios, we introduce a novel task that aims to describe a video using the natural language dialog between two agents as a supplementary information source given incomplete visual data. Different from most existing vision-language tasks where AI systems have full access to images or video clips, which may reveal sensitive information such as recognizable human faces or voices, we intentionally limit the visual input for AI systems and seek a more secure and transparent information medium, i.e., the natural language dialog, to supplement the missing visual information. Specifically, one of the intelligent agents - Q-BOT - is given two semantic segmented frames from the beginning and the end of the video, as well as a finite number of opportunities to ask relevant natural language questions before describing the unseen video. A-BOT, the other agent who has access to the entire video, assists Q-BOT to accomplish the goal by answering the asked questions. We introduce two different experimental settings with either a generative (i.e., agents generate questions and answers freely) or a discriminative (i.e., agents select the questions and answers from candidates) internal dialog generation process. With the proposed unified QA-Cooperative networks, we experimentally demonstrate the knowledge transfer process between the two dialog agents and the effectiveness of using the natural language dialog as a supplement for incomplete implicit visions.
医学相关(7篇)
【1】 Tiled sparse coding in eigenspaces for the COVID-19 diagnosis in chest X-ray images 标题:特征空间平铺稀疏编码在胸部X线图像冠状病毒诊断中的应用
作者:Juan E. Arco,Andrés Ortiz,Javier Ramírez,Juan M Gorriz 机构: Department of Signal Theory, Networking and Communications, Universidad de Granada, Department of Communications Engineering, Universidad de Malaga 备注:14 pages, 5 figures 链接:https://arxiv.org/abs/2106.14724 摘要:持续的冠状病毒-19(Coronavirus disease 2019)大流行危机改变了世界。据世界卫生组织(WHO)统计,已有400万人死于该病,而确诊的COVID-19病例已超过1.8亿。许多国家卫生系统的崩溃表明,需要开发工具,从医学影像中自动诊断该病。以前的研究都是用深度学习来达到这个目的。然而,这种方法的性能在很大程度上取决于用于训练算法的数据集的大小。在这项工作中,我们提出了一个基于稀疏编码的分类框架,以识别与不同病理相关的肺炎模式。具体来说,每个胸部X射线(CXR)图像被分割成不同的块。从主成分分析中提取最相关的特征,然后在稀疏编码过程中建立字典。一旦图像从字典的元素转换和重建,分类就从与每幅图像相关联的各个面片的重建误差中执行。在同时区分四种不同病理学(对照组、细菌性肺炎、病毒性肺炎、COVID-19)的真实场景中评估性能。识别肺炎的准确率为93.85%,而在4级分类中获得88.11%。出色的结果和稀疏编码在这个场景中的开创性使用证明了这种方法在现实环境中作为临床医生的辅助手段的适用性。 摘要:The ongoing crisis of the COVID-19 (Coronavirus disease 2019) pandemic has changed the world. According to the World Health Organization (WHO), 4 million people have died due to this disease, whereas there have been more than 180 million confirmed cases of COVID-19. The collapse of the health system in many countries has demonstrated the need of developing tools to automatize the diagnosis of the disease from medical imaging. Previous studies have used deep learning for this purpose. However, the performance of this alternative highly depends on the size of the dataset employed for training the algorithm. In this work, we propose a classification framework based on sparse coding in order to identify the pneumonia patterns associated with different pathologies. Specifically, each chest X-ray (CXR) image is partitioned into different tiles. The most relevant features extracted from PCA are then used to build the dictionary within the sparse coding procedure. Once images are transformed and reconstructed from the elements of the dictionary, classification is performed from the reconstruction errors of individual patches associated with each image. Performance is evaluated in a real scenario where simultaneously differentiation between four different pathologies: control vs bacterial pneumonia vs viral pneumonia vs COVID-19. The accuracy when identifying the presence of pneumonia is 93.85%, whereas 88.11% is obtained in the 4-class classification context. The excellent results and the pioneering use of sparse coding in this scenario evidence the applicability of this approach as an aid for clinicians in a real-world environment.
【2】 Weighted multi-level deep learning analysis and framework for processing breast cancer WSIs 标题:乳腺癌WSIS处理的加权多层深度学习分析框架
作者:Peter Bokor,Lukas Hudec,Ondrej Fabian,Wanda Benesova 机构: 1 40 2 1 and theCharles University and Thomayer University Hospital 备注:9 pages, 12 images, 3 tables with results, We have an intention to submit this paper to the current journal focused on computer methods/deep learning in biomedicine 链接:https://arxiv.org/abs/2106.14708 摘要:乳腺癌的预防和早期诊断是选择合适治疗方法的必要前提。由于对更快更精确诊断结果的需求的增加而产生的巨大压力推动了自动解决方案。在过去的十年中,深度学习技术已经在多个领域展示了它们的能力,计算机辅助诊断(CAD)成为其中之一。然而,当涉及到整个幻灯片图像(WSI)的分析时,现有的大多数工作都是从不同的层次独立地计算预测。然而,这与组织病理学专家的方法形成了对比,后者要求看到在BC分类中重要的组织结构的整体结构。我们提出了一个基于深度学习的解决方案和框架来处理WSI基于一种新的方法,利用图像水平的优势。我们将从多个层次提取的信息进行加权,最终对恶性肿瘤进行分类。我们的结果表明,全球信息的盈利能力,准确率从72.2%提高到84.8%。 摘要:Prevention and early diagnosis of breast cancer (BC) is an essential prerequisite for the selection of proper treatment. The substantial pressure due to the increase of demand for faster and more precise diagnostic results drives for automatic solutions. In the past decade, deep learning techniques have demonstrated their power over several domains, and Computer-Aided (CAD) diagnostic became one of them. However, when it comes to the analysis of Whole Slide Images (WSI), most of the existing works compute predictions from levels independently. This is, however, in contrast to the histopathologist expert approach who requires to see a global architecture of tissue structures important in BC classification. We present a deep learning-based solution and framework for processing WSI based on a novel approach utilizing the advantages of image levels. We apply the weighing of information extracted from several levels into the final classification of the malignancy. Our results demonstrate the profitability of global information with an increase of accuracy from 72.2% to 84.8%.
【3】 Benchmarking convolutional neural networks for diagnosing Lyme disease from images 标题:基于卷积神经网络的图像莱姆病诊断基准
作者:Sk Imran Hossain,Jocelyn de Goër de Herve,Md Shahriar Hassan,Delphine Martineau,Evelina Petrosyan,Violaine Corbain,Jean Beytout,Isabelle Lebert,Elisabeth Baux,Céline Cazorla,Carole Eldin,Yves Hansmann,Solene Patrat-Delon,Thierry Prazuck,Alice Raffetin,Pierre Tattevin,Gwenaël Vourc'H,Olivier Lesens,Engelbert Nguifo 机构:Engelbert Mephu Nguifoa,, Université Clermont Auvergne, CNRS, ENSMSE, LIMOS, F-, Clermont-Ferrand, France, Université Clermont Auvergne, INRAE, VetAgro Sup, UMR EPIA, Saint-Genès-Champanelle 链接:https://arxiv.org/abs/2106.14465 摘要:莱姆病是世界上最常见的媒介传染病之一。在早期阶段,这种疾病表现为大多数病例的红斑移行性(EM)皮肤病变。更好的诊断这些早期形式将有助于改善预后,防止过渡到严重的晚期形式感谢适当的抗生素治疗。最近的研究表明,卷积神经网络(CNNs)能很好地从图像中识别出莱姆病,但从EM图像中预测莱姆病的工作并不多。本研究的主要目的是广泛分析CNNs在莱姆病影像诊断中的有效性,并找出最佳的CNN结构。目前还没有公开的用于莱姆病预测的EM图像数据集,主要是出于隐私考虑。在这项研究中,我们利用EM数据集,包括从法国的CelMunt Feland大学医院中心(CF-CU)收集的图像和互联网。朱棣文从法国几家医院收集了这些图像。该数据集由来自CF-CHU的皮肤科专家和感染科专家标记。首先,我们在预测性能指标、计算复杂性指标和统计显著性测试方面对23个著名的CNN架构的数据集进行了基准测试。其次,为了提高CNNs的性能,我们采用了基于ImageNet预训练模型的转移学习方法,并利用皮肤损伤数据集“10000张训练图像的人机对抗(HAM1000)”对CNNs进行了预训练。在这个过程中,我们为每个cnn寻找在迁移学习微调过程中解冻的最佳层数。第三,对于模型的可解释性,我们利用梯度加权类激活映射来可视化对CNN有重要意义的输入区域以进行预测。第四,我们提供了基于预测性能和计算复杂性的模型选择准则。我们的研究证实了一些轻量级CNN用于莱姆病扫描前移动应用的有效性和潜力。我们还公开了所有经过训练的模型https://dappem.limos.fr/download.html,可供其他人用于转移学习和建立莱姆病预扫描仪。 摘要:Lyme disease is one of the most common infectious vector-borne diseases in the world. In the early stage, the disease manifests itself in most cases with erythema migrans (EM) skin lesions. Better diagnosis of these early forms would allow improving the prognosis by preventing the transition to a severe late form thanks to appropriate antibiotic therapy. Recent studies show that convolutional neural networks (CNNs) perform very well to identify skin lesions from the image but, there is not much work for Lyme disease prediction from EM lesion images. The main objective of this study is to extensively analyze the effectiveness of CNNs for diagnosing Lyme disease from images and to find out the best CNN architecture for the purpose. There is no publicly available EM image dataset for Lyme disease prediction mainly because of privacy concerns. In this study, we utilized an EM dataset consisting of images collected from Clermont-Ferrand University Hospital Center (CF-CHU) of France and the internet. CF-CHU collected the images from several hospitals in France. This dataset was labeled by expert dermatologists and infectiologists from CF-CHU. First, we benchmarked this dataset for twenty-three well-known CNN architectures in terms of predictive performance metrics, computational complexity metrics, and statistical significance tests. Second, to improve the performance of the CNNs, we used transfer learning from ImageNet pre-trained models as well as pre-trained the CNNs with the skin lesion dataset "Human Against Machine with 10000 training images (HAM1000)". In that process, we searched for the best performing number of layers to unfreeze during transfer learning fine-tuning for each of the CNNs. Third, for model explainability, we utilized Gradient-weighted Class Activation Mapping to visualize the regions of input that are significant to the CNNs for making predictions. Fourth, we provided guidelines for model selection based on predictive performance and computational complexity. Our study confirmed the effectiveness and potential of even some lightweight CNNs to be used for Lyme disease pre-scanner mobile applications. We also made all the trained models publicly available at https://dappem.limos.fr/download.html, which can be used by others for transfer learning and building pre-scanners for Lyme disease.
【4】 A 3D CNN Network with BERT For Automatic COVID-19 Diagnosis From CT-Scan Images 标题:基于ERT的三维CNN网络在CT扫描图像中的冠状病毒自动诊断
作者:Weijun Tan,Jingfeng Liu 机构:Shenzhen Deepcam Information Technologies, Shenzhen, China 链接:https://arxiv.org/abs/2106.14403 摘要:我们提出了一个自动COVID1-19诊断框架,从肺部CT扫描切片图像。在该框架中,首先利用分割技术对CT扫描体层图像进行预处理,滤除闭肺图像,去除无用背景。然后采用重采样的方法选择一组或多组固定数目的切片图像进行训练和验证。一个带有BERT的3D CNN网络被用来对这组选定的切片图像进行分类。在该网络中,还提取了嵌入特征。在一个体积中有多组切片图像的情况下,所有组的特征被提取并汇集到整个CT扫描体积的全局特征向量中。采用简单的多层感知器(MLP)网络对聚集的特征向量进行进一步分类。在提供的训练和验证数据集上对模型进行训练和评估。在验证数据集上,精确度为0.9278,F1得分为0.9261。 摘要:We present an automatic COVID1-19 diagnosis framework from lung CT-scan slice images. In this framework, the slice images of a CT-scan volume are first proprocessed using segmentation techniques to filter out images of closed lung, and to remove the useless background. Then a resampling method is used to select one or multiple sets of a fixed number of slice images for training and validation. A 3D CNN network with BERT is used to classify this set of selected slice images. In this network, an embedding feature is also extracted. In cases where there are more than one set of slice images in a volume, the features of all sets are extracted and pooled into a global feature vector for the whole CT-scan volume. A simple multiple-layer perceptron (MLP) network is used to further classify the aggregated feature vector. The models are trained and evaluated on the provided training and validation datasets. On the validation dataset, the accuracy is 0.9278 and the F1 score is 0.9261.
【5】 Learning stochastic object models from medical imaging measurements by use of advanced AmbientGANs 标题:利用先进的AmbientGANs从医学成像测量中学习随机对象模型
作者:Weimin Zhou,Sayantan Bhadra,Frank J. Brooks,Hua Li,Mark A. Anastasio 备注:Submitted to IEEE Transactions on Medical Imaging. arXiv admin note: substantial text overlap with arXiv:2006.00033 链接:https://arxiv.org/abs/2106.14324 摘要:为了通过计算机模拟客观地评估新的医学成像技术,重要的是要考虑到所有来源的变异,有助于图像数据。可变性的一个重要来源,可以大大限制观察员的表现,是与可变性,在合奏的对象成像。这种可变性的来源可以用随机对象模型(som)来描述,som是一种生成性模型,可以用来从虚拟成像对象的分布中取样。通常希望通过使用具有良好特征的成像系统获得的实验成像测量来建立som,但是这项任务仍然具有挑战性。深层生成性神经网络,如生成性对抗性网络(GANs)在这类任务中具有潜力。为了从成像测量中建立som,提出了一种用测量算子扩充GAN的环境GAN。然而,原始的AmbientGAN不能立即从现代的训练程序和GAN结构中获益,这限制了它应用于实际大小的医学图像数据的能力。为了避免这一问题,本文提出了一种改进的氛围训练策略,该策略适用于现代渐进式或多分辨率训练方法,如用于渐进式生长的GANs和基于风格的GANs。利用所提出的训练程序所建立的环境,通过与程式化成像系统相对应的计算机模拟测量数据,以受控的方式进行了系统的验证。最后,利用模拟的单线圈实验磁共振成像数据,在较少程式化的条件下验证了该方法。 摘要:In order to objectively assess new medical imaging technologies via computer-simulations, it is important to account for all sources of variability that contribute to image data. One important source of variability that can significantly limit observer performance is associated with the variability in the ensemble of objects to-be-imaged. This source of variability can be described by stochastic object models (SOMs), which are generative models that can be employed to sample from a distribution of to-be-virtually-imaged objects. It is generally desirable to establish SOMs from experimental imaging measurements acquired by use of a well-characterized imaging system, but this task has remained challenging. Deep generative neural networks, such as generative adversarial networks (GANs) hold potential for such tasks. To establish SOMs from imaging measurements, an AmbientGAN has been proposed that augments a GAN with a measurement operator. However, the original AmbientGAN could not immediately benefit from modern training procedures and GAN architectures, which limited its ability to be applied to realistically sized medical image data. To circumvent this, in this work, a modified AmbientGAN training strategy is proposed that is suitable for modern progressive or multi-resolution training approaches such as employed in the Progressive Growing of GANs and Style-based GANs. AmbientGANs established by use of the proposed training procedure are systematically validated in a controlled way by use of computer-simulated measurement data corresponding to a stylized imaging system. Finally, emulated single-coil experimental magnetic resonance imaging data are employed to demonstrate the methods under less stylized conditions.
【6】 Knee Osteoarthritis Severity Prediction using an Attentive Multi-Scale Deep Convolutional Neural Network 标题:基于多尺度深度卷积神经网络的膝关节骨关节炎严重程度预测
作者:Rohit Kumar Jain,Prasen Kumar Sharma,Sibaji Gaj,Arijit Sur,Palash Ghosh 机构:Palash Ghosh is with Department of Mathematics, Indian Instituteof Technology Guwahati, Duke-NUS Medical School 备注:This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible 链接:https://arxiv.org/abs/2106.14292 摘要:膝关节骨性关节炎(OA)是一种破坏性的关节疾病,由关节僵硬、疼痛和功能障碍引起,关系到全球数百万人的生命。通常通过评估身体症状、病史和其他联合筛查检查(如射线照片、磁共振成像(MRI)和计算机断层扫描(CT))来评估。遗憾的是,传统的方法是非常主观的,这形成了一个障碍,在检测疾病进展的早期阶段。本文提出了一个基于深度学习的框架,即OsteoHRNet,该框架根据Kellgren和Lawrence(KL)的X线分级自动评估膝关节OA的严重程度。作为一个主要的创新点,提出的方法是建立在一个最新的深部模型,称为高分辨率网络(HRNet),以捕捉膝关节X射线的多尺度特征。此外,我们还加入了注意机制,以过滤掉适得其反的功能,并进一步提高性能。在OAI数据集的基线队列上,我们提出的模型获得了71.74%的最佳多类准确率和0.311的MAE,这是一个显著的优势。我们也使用了基于梯度的类激活图(Grad-CAMs)可视化来验证所提出的网络学习。 摘要:Knee Osteoarthritis (OA) is a destructive joint disease identified by joint stiffness, pain, and functional disability concerning millions of lives across the globe. It is generally assessed by evaluating physical symptoms, medical history, and other joint screening tests like radiographs, Magnetic Resonance Imaging (MRI), and Computed Tomography (CT) scans. Unfortunately, the conventional methods are very subjective, which forms a barrier in detecting the disease progression at an early stage. This paper presents a deep learning-based framework, namely OsteoHRNet, that automatically assesses the Knee OA severity in terms of Kellgren and Lawrence (KL) grade classification from X-rays. As a primary novelty, the proposed approach is built upon one of the most recent deep models, called the High-Resolution Network (HRNet), to capture the multi-scale features of knee X-rays. In addition, we have also incorporated an attention mechanism to filter out the counterproductive features and boost the performance further. Our proposed model has achieved the best multiclass accuracy of 71.74% and MAE of 0.311 on the baseline cohort of the OAI dataset, which is a remarkable gain over the existing best-published works. We have also employed the Gradient-based Class Activation Maps (Grad-CAMs) visualization to justify the proposed network learning.
【7】 Using deep learning to detect patients at risk for prostate cancer despite benign biopsies 标题:使用深度学习检测前列腺癌高危患者,尽管进行了良性活检
作者:Boing Liu,Yinxi Wang,Philippe Weitz,Johan Lindberg,Lars Egevad,Henrik Grönberg,Martin Eklund,Mattias Rantalainen 机构:Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden, Department of Oncology-Pathology, Karolinska Institutet, Stockholm, Sweden 备注:13 pages, 3 figures 链接:https://arxiv.org/abs/2106.14256 摘要:背景:经直肠超声引导前列腺系统活检是前列腺癌诊断的常规方法。然而,10-12个前列腺核心活检仅采集相对较小体积的前列腺,活检核心之间区域的肿瘤病变可能被遗漏,导致众所周知的检测临床相关癌症的敏感性较低。作为原理的证明,我们开发并验证了一个深度卷积神经网络模型,用于区分患有和未患有癌症的男性前列腺活检全片图像中的形态模式。方法:本研究包括1508名男性的14354张苏木精和伊红染色的良性前列腺活检全片图像,分为两组:未确诊前列腺癌(PCa)的男性和至少有一个核心活检确诊为PCa的男性。80%的参与者被指定为训练数据并用于模型优化(1211名男性),其余20%(297名男性)作为用于评估模型性能的测试集。一个由10个深度卷积神经网络模型组成的集合被优化用于对有和没有确诊癌症的男性活检进行分类。通过交叉验证对训练数据进行超参数优化和模型选择。结果:受试者操作特征曲线下面积(ROC-AUC)在活检水平上估计为0.727(bootstrap 95%CI:0.708-0.745),在人水平上估计为0.738(bootstrap 95%CI:0.682-0.796)。特异性为0.9时,模型的估计敏感性为0.348。结论:所建立的模型能够检测由于前列腺取样不足而导致前列腺癌漏诊的风险。该模型有可能减少常规系统性前列腺活检中假阴性病例的数量,并提示男性可以受益于MRI引导的再次活检。 摘要:Background: Transrectal ultrasound guided systematic biopsies of the prostate is a routine procedure to establish a prostate cancer diagnosis. However, the 10-12 prostate core biopsies only sample a relatively small volume of the prostate, and tumour lesions in regions between biopsy cores can be missed, leading to a well-known low sensitivity to detect clinically relevant cancer. As a proof-of-principle, we developed and validated a deep convolutional neural network model to distinguish between morphological patterns in benign prostate biopsy whole slide images from men with and without established cancer. Methods: This study included 14,354 hematoxylin and eosin stained whole slide images from benign prostate biopsies from 1,508 men in two groups: men without an established prostate cancer (PCa) diagnosis and men with at least one core biopsy diagnosed with PCa. 80% of the participants were assigned as training data and used for model optimization (1,211 men), and the remaining 20% (297 men) as a held-out test set used to evaluate model performance. An ensemble of 10 deep convolutional neural network models was optimized for classification of biopsies from men with and without established cancer. Hyperparameter optimization and model selection was performed by cross-validation in the training data . Results: Area under the receiver operating characteristic curve (ROC-AUC) was estimated as 0.727 (bootstrap 95% CI: 0.708-0.745) on biopsy level and 0.738 (bootstrap 95% CI: 0.682 - 0.796) on man level. At a specificity of 0.9 the model had an estimated sensitivity of 0.348. Conclusion: The developed model has the ability to detect men with risk of missed PCa due to under-sampling of the prostate. The proposed model has the potential to reduce the number of false negative cases in routine systematic prostate biopsies and to indicate men who could benefit from MRI-guided re-biopsy.
GAN|对抗|攻击|生成相关(3篇)
【1】 HDMapGen: A Hierarchical Graph Generative Model of High Definition Maps 标题:HDMapGen:一种高清晰度地图的层次图形生成模型
作者:Lu Mi,Hang Zhao,Charlie Nash,Xiaohan Jin,Jiyang Gao,Chen Sun,Cordelia Schmid,Nir Shavit,Yuning Chai,Dragomir Anguelov 机构:Waymo,MIT,DeepMind,Google 链接:https://arxiv.org/abs/2106.14880 摘要:高清地图是一种具有精确车道定义和丰富交通规则语义的地图。它们对于自主驾驶系统的几个关键阶段至关重要,包括运动预测和规划。然而,现实世界中只有少量的道路拓扑和几何图形,这大大限制了我们测试自动驾驶堆栈以推广到新的未知场景的能力。为了解决这个问题,我们引入了一个新的具有挑战性的任务来生成高清地图。在这项工作中,我们探讨了几种使用不同数据表示的自回归模型,包括序列图、平面图和层次图。我们提出了HDMapGen,一个分层的图形生成模型,能够通过从粗到细的方法生成高质量和多样的HD地图。在Argoverse数据集和内部数据集上的实验表明,HDMapGen的性能明显优于基线方法。此外,我们还证明了HDMapGen具有很高的可扩展性和效率。 摘要:High Definition (HD) maps are maps with precise definitions of road lanes with rich semantics of the traffic rules. They are critical for several key stages in an autonomous driving system, including motion forecasting and planning. However, there are only a small amount of real-world road topologies and geometries, which significantly limits our ability to test out the self-driving stack to generalize onto new unseen scenarios. To address this issue, we introduce a new challenging task to generate HD maps. In this work, we explore several autoregressive models using different data representations, including sequence, plain graph, and hierarchical graph. We propose HDMapGen, a hierarchical graph generation model capable of producing high-quality and diverse HD maps through a coarse-to-fine approach. Experiments on the Argoverse dataset and an in-house dataset show that HDMapGen significantly outperforms baseline methods. Additionally, we demonstrate that HDMapGen achieves high scalability and efficiency.
【2】 CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders 标题:CLIPDraw:通过语言图像编码器探索文本到图形的合成
作者:Kevin Frans,L. B. Soros,Olaf Witkowski 机构:Cross Labs, Cross Compass Ltd., Tokyo, Japan, Massachusetts Institute of Technology, Cambridge, MA, USA, Earth-Life Science Institute, Tokyo Institute of Technology, Japan, College of Arts and Sciences, University of Tokyo, Japan 链接:https://arxiv.org/abs/2106.14843
【3】 Speech2Properties2Gestures: Gesture-Property Prediction as a Tool for Generating Representational Gestures from Speech 标题:Speech2Properties2Gestures:手势属性预测作为从语音生成代表性手势的工具
作者:Taras Kucherenko,Rajmund Nagy,Patrik Jonell,Michael Neff,Hedvig Kjellström,Gustav Eje Henter 机构:Speech,GestProp, Text transcription, Speech,GestExist, GestureFlow, Gesture probability: ,%, Beat:, Deictic:, Iconic:, Metaphoric: 备注:Accepted for publication at the ACM International Conference on Intelligent Virtual Agents (IVA 2021) 链接:https://arxiv.org/abs/2106.14736 摘要:我们提出了一个新的手势生成框架,旨在允许数据驱动的方法生成语义更丰富的手势。我们的方法首先预测是否需要手势,然后预测手势的属性。然后,这些特性被用作现代概率手势生成模型的条件,该模型能够获得高质量的输出。这使得该方法能够生成多样性和代表性的手势。 摘要:We propose a new framework for gesture generation, aiming to allow data-driven approaches to produce more semantically rich gestures. Our approach first predicts whether to gesture, followed by a prediction of the gesture properties. Those properties are then used as conditioning for a modern probabilistic gesture-generation model capable of high-quality output. This empowers the approach to generate gestures that are both diverse and representational.
自动驾驶|车辆|车道检测等(1篇)
【1】 Identifying High Accuracy Regions in Traffic Camera Images to Enhance the Estimation of Road Traffic Metrics: A Quadtree Based Method 标题:基于四叉树的交通摄像图像高精度区域识别方法
作者:Yue Lin,Nningchuan Xiao 机构:a Department of Geography, The Ohio State University, Columbus, Ohio, b Center for Urban and Regional Analysis, The Ohio State University, Columbus, Ohio 链接:https://arxiv.org/abs/2106.14049 摘要:在城市地区,越来越多的实时摄像机反馈使得为有效的交通规划、运营和管理提供高质量的交通数据成为可能。然而,由于当前车辆检测技术的局限性,以及高度和分辨率等各种摄像机条件的限制,从这些摄像机馈送中获得可靠的交通度量一直是一个挑战。本文提出了一种基于四叉树的图像分割算法,对图像进行连续分割,直到只保留高检测精度的区域。本文将这些区域称为高精度识别区域(HAIR)。我们演示了如何使用头发可以提高准确性的交通密度估计使用图像从交通摄像机在不同的高度和分辨率在俄亥俄州中部。实验结果表明,该算法可用于提取鲁棒性强的头发,其车辆检测精度比原始图像高41%。HAIR的使用还显著地改进了流量密度估计,总的均方根误差降低了49%。 摘要:The growing number of real-time camera feeds in urban areas has made it possible to provide high-quality traffic data for effective transportation planning, operations, and management. However, deriving reliable traffic metrics from these camera feeds has been a challenge due to the limitations of current vehicle detection techniques, as well as the various camera conditions such as height and resolution. In this work, a quadtree based algorithm is developed to continuously partition the image extent until only regions with high detection accuracy are remained. These regions are referred to as the high-accuracy identification regions (HAIR) in this paper. We demonstrate how the use of the HAIR can improve the accuracy of traffic density estimates using images from traffic cameras at different heights and resolutions in Central Ohio. Our experiments show that the proposed algorithm can be used to derive robust HAIR where vehicle detection accuracy is 41 percent higher than that in the original image extent. The use of the HAIR also significantly improves the traffic density estimation with an overall decrease of 49 percent in root mean squared error.
Attention注意力(1篇)
【1】 Interflow: Aggregating Multi-layer Feature Mappings with Attention Mechanism 标题:Interflow:利用注意力机制聚合多层特征映射
作者:Zhicheng Cai 机构:School of Electronic Science and Engineering, Nanjing University, Nanjing, China 备注:10 pages, 4 figures 链接:https://arxiv.org/abs/2106.14073 摘要:传统的CNN模型具有层次结构,利用最后一层的特征映射来获得预测结果。然而,确定最优网络深度和使中间层学习可分辨特征是困难的。本文针对传统的CNN模型,提出了一种融合算法。Interflow将CNNs按深度划分为若干阶段,并根据每个阶段的特征映射进行预测。随后,我们将这些预测分支输入到一个精心设计的注意模块中,该模块学习这些预测分支的权重,并对其进行聚合,得到最终的输出。对浅层和深层学习到的特征进行加权融合,使各阶段的特征信息得到合理有效的处理,使中间层学习到更多的特征,增强了模型的表示能力。此外,通过引入注意机制,可以缓解梯度消失问题,降低网络深度选择的难度,减轻可能出现的过拟合问题。此外,它还可以避免网络退化的副产品。与原模型相比,带干扰流的CNN模型在多个基准数据集上具有更高的测试精度。 摘要:Traditionally, CNN models possess hierarchical structures and utilize the feature mapping of the last layer to obtain the prediction output. However, it can be difficulty to settle the optimal network depth and make the middle layers learn distinguished features. This paper proposes the Interflow algorithm specially for traditional CNN models. Interflow divides CNNs into several stages according to the depth and makes predictions by the feature mappings in each stage. Subsequently, we input these prediction branches into a well-designed attention module, which learns the weights of these prediction branches, aggregates them and obtains the final output. Interflow weights and fuses the features learned in both shallower and deeper layers, making the feature information at each stage processed reasonably and effectively, enabling the middle layers to learn more distinguished features, and enhancing the model representation ability. In addition, Interflow can alleviate gradient vanishing problem, lower the difficulty of network depth selection, and lighten possible over-fitting problem by introducing attention mechanism. Besides, it can avoid network degradation as a byproduct. Compared with the original model, the CNN model with Interflow achieves higher test accuracy on multiple benchmark datasets.
人脸|人群计数(3篇)
【1】 A Diffeomorphic Aging Model for Adult Human Brain from Cross-Sectional Data 标题:基于横断面数据的成人脑不同形态衰老模型
作者:Alphin J Thottupattu,Jayanthi Sivaswamy,Venkateswaran P. Krishnan 机构: International Institute of Information Technology, Hyderabad , India, TIFR Centre for Applicable Mathematics, Bangalore , India 链接:https://arxiv.org/abs/2106.14516 摘要:大脑的正常老化趋势可以作为评估神经结构紊乱的重要参考。这种模型通常是从纵向脑图像数据——同一受试者在不同时间点的随访数据——发展而来的。在实践中,很难获得这样的纵向数据。我们提出了一种方法,在没有纵向数据的情况下,通过使用不同受试者在不同时间点的图像,即所谓的横截面数据,来建立给定人群的老龄化模型。我们将老化模型定义为结构模板上的变形,并提出了一种接近自然老化的拓扑保持老化模型。该模型在两个公共截面数据集上得到了成功的验证,这两个数据集提供了从不同年龄点的不同受试者集合构建的模板。 摘要:Normative aging trends of the brain can serve as an important reference in the assessment of neurological structural disorders. Such models are typically developed from longitudinal brain image data -- follow-up data of the same subject over different time points. In practice, obtaining such longitudinal data is difficult. We propose a method to develop an aging model for a given population, in the absence of longitudinal data, by using images from different subjects at different time points, the so-called cross-sectional data. We define an aging model as a diffeomorphic deformation on a structural template derived from the data and propose a method that develops topology preserving aging model close to natural aging. The proposed model is successfully validated on two public cross-sectional datasets which provide templates constructed from different sets of subjects at different age points.
【2】 Darker than Black-Box: Face Reconstruction from Similarity Queries 标题:比黑盒更暗:基于相似度查询的人脸重建
作者:Anton Razzhigaev,Klim Kireev,Igor Udovichenko,Aleksandr Petiushko 机构: Skolkovo Institute of Science and, Technology, Moscow, Russia, Lomonosov Moscow State University 链接:https://arxiv.org/abs/2106.14290 摘要:最近提出了几种人脸识别模型的反演方法,试图从深层模板重建人脸。尽管这些方法中的一些在黑盒设置中仅使用面嵌入,但通常在最终用户端仅提供相似性分数。因此,这些算法在这种情况下是不适用的。我们提出了一种新的方法,只需查询黑匣子模型的相似性分数就可以重建人脸。虽然我们的算法运行在一个更一般的设置,实验表明,它是查询效率和优于现有的方法。 摘要:Several methods for inversion of face recognition models were recently presented, attempting to reconstruct a face from deep templates. Although some of these approaches work in a black-box setup using only face embeddings, usually, on the end-user side, only similarity scores are provided. Therefore, these algorithms are inapplicable in such scenarios. We propose a novel approach that allows reconstructing the face querying only similarity scores of the black-box model. While our algorithm operates in a more general setup, experiments show that it is query efficient and outperforms the existing methods.
【3】 ShapeEditer: a StyleGAN Encoder for Face Swapping 标题:ShapeEditer:一种面向人脸交换的StyleGAN编码器
作者:Shuai Yang,Kai Qiao 机构:Henan Key Laboratory of Imaging and Intelligent Processing, PLA Strategic Support, Force Information Engineering University, Zhengzhou, China 备注:13 pages, 3 figures 链接:https://arxiv.org/abs/2106.13984 摘要:在本文中,我们提出了一种新的编码器,称为ShapeEditor,用于高分辨率、逼真和高保真的人脸交换。首先,为了保证足够的清晰度和真实性,我们的核心思想是采用一种先进的预训练高质量随机人脸图像发生器StyleGAN作为主干。其次,设计了两步编码器ShapeEditor,使交换后的人脸融合了输入人脸的身份和属性。第一步,分别提取源图像的特征向量和目标图像的属性向量;在第二步中,我们将身份向量和属性向量的连接映射到$mathcal{W }$势空间。此外,为了学习如何映射到StyleGAN的潜在空间,我们提出了一组无需人工标注训练数据的自监督损失函数。在测试数据集上的大量实验表明,该方法的结果不仅在清晰度和真实性方面比现有方法有很大的优势,而且体现了身份和属性的充分集成。 摘要:In this paper, we propose a novel encoder, called ShapeEditor, for high-resolution, realistic and high-fidelity face exchange. First of all, in order to ensure sufficient clarity and authenticity, our key idea is to use an advanced pretrained high-quality random face image generator, i.e. StyleGAN, as backbone. Secondly, we design ShapeEditor, a two-step encoder, to make the swapped face integrate the identity and attribute of the input faces. In the first step, we extract the identity vector of the source image and the attribute vector of the target image respectively; in the second step, we map the concatenation of identity vector and attribute vector into the $mathcal{W }$ potential space. In addition, for learning to map into the latent space of StyleGAN, we propose a set of self-supervised loss functions with which the training data do not need to be labeled manually. Extensive experiments on the test dataset show that the results of our method not only have a great advantage in clarity and authenticity than other state-of-the-art methods, but also reflect the sufficient integration of identity and attribute.
图像视频检索|Re-id相关(1篇)
【1】 A Graph-based approach to derive the geodesic distance on Statistical manifolds: Application to Multimedia Information Retrieval 标题:一种基于图的统计流形上测地线距离的推导方法:在多媒体信息检索中的应用
作者:Zakariae Abbad,Ahmed Drissi El Maliani,Said Ouatik El Alaoui,Mohammed El Hassouni 机构:LISAC, FSDM, Sidi Mohamed Ben Abdellah University, Fez, Morocco, LRIT Rabat IT Center, Mohammed V University, Rabat, Morocco, Laboratory of Engeneering Sciences, National School of Applied Sciences, Ibn Tofail University, Kenitra, Morocco, DESTEC, FLSHR, LRIT IT Center 链接:https://arxiv.org/abs/2106.14060 摘要:本文利用非欧几里德几何的性质来定义统计流形空间上的测地距离。测地距离是一种真实而直观的相似性度量,是纯统计和广泛使用的Kullback-Leibler散度(KLD)的一种很好的替代方法。尽管GD的有效性,但由于测地线方程很难求解,许多流形不存在封闭形式。这说明,主要的研究内容已经满足于使用数值近似。然而,这些方法大多没有考虑流形的性质,这会导致信息的丢失,从而导致性能的降低。我们提出了一个近似的测地距离通过一个基于图形的方法。后者允许很好地表示统计流形的结构,并尊重其几何性质。我们的主要目的是比较基于图的近似和最新的近似。因此,考虑到基于内容的纹理检索在不同数据库上的应用,本文对两种统计流形,即Weibull流形和Gamma流形进行了评价。 摘要:In this paper, we leverage the properties of non-Euclidean Geometry to define the Geodesic distance (GD) on the space of statistical manifolds. The Geodesic distance is a real and intuitive similarity measure that is a good alternative to the purely statistical and extensively used Kullback-Leibler divergence (KLD). Despite the effectiveness of the GD, a closed-form does not exist for many manifolds, since the geodesic equations are hard to solve. This explains that the major studies have been content to use numerical approximations. Nevertheless, most of those do not take account of the manifold properties, which leads to a loss of information and thus to low performances. We propose an approximation of the Geodesic distance through a graph-based method. This latter permits to well represent the structure of the statistical manifold, and respects its geometrical properties. Our main aim is to compare the graph-based approximation to the state of the art approximations. Thus, the proposed approach is evaluated for two statistical manifolds, namely the Weibull manifold and the Gamma manifold, considering the Content-Based Texture Retrieval application on different databases.
表征学习(1篇)
【1】 Understanding Dynamics of Nonlinear Representation Learning and Its Application 标题:非线性表征学习的理解动力学及其应用
作者:Kenji Kawaguchi,Linjun Zhang,Zhun Deng 机构:Harvard University, Rutgers University 链接:https://arxiv.org/abs/2106.14836 摘要:世界环境的表示在机器智能中起着至关重要的作用。直接在图像像素值等原始感官表征空间进行推理和推理往往效率低下。表征学习允许我们从原始的感官数据中自动发现合适的表征。例如,给定原始的感官数据,多层感知器在其隐藏层学习非线性表示,随后在其输出层用于分类(或回归)。这是在训练过程中通过最小化有监督或无监督的损失隐式发生的。本文研究了这种内隐非线性表征学习的动力学。我们确定了一对新的假设和一个新的条件,称为公共模型结构假设和数据体系结构对齐条件。在一般模型结构假设下,证明了数据结构对齐条件对全局收敛是充分的,对全局最优性是必要的。我们的结果为模型结构的设计提供了实际指导:例如,公共模型结构假设可以作为使用特定模型结构而不是其他模型结构的理由。作为一个应用,我们推导了一个新的训练框架,该框架通过依赖于每个数据和结构自动修改任何给定的训练算法来满足数据结构对齐条件,而不必假设它。在给定标准训练算法的情况下,运行其修改版本的框架在保持具有竞争力的(实际)测试性能的同时,通过卷积、跳过连接和标准基准数据集(包括MNIST、CIFAR-10、CIFAR-100)的批标准化,为ResNet-18提供全局收敛保证,Semeion、KMNIST和SVHN。 摘要:Representations of the world environment play a crucial role in machine intelligence. It is often inefficient to conduct reasoning and inference directly in the space of raw sensory representations, such as pixel values of images. Representation learning allows us to automatically discover suitable representations from raw sensory data. For example, given raw sensory data, a multilayer perceptron learns nonlinear representations at its hidden layers, which are subsequently used for classification (or regression) at its output layer. This happens implicitly during training through minimizing a supervised or unsupervised loss. In this paper, we study the dynamics of such implicit nonlinear representation learning. We identify a pair of a new assumption and a novel condition, called the common model structure assumption and the data-architecture alignment condition. Under the common model structure assumption, the data-architecture alignment condition is shown to be sufficient for the global convergence and necessary for the global optimality. Our results provide practical guidance for designing a model structure: e.g., the common model structure assumption can be used as a justification for using a particular model structure instead of others. As an application, we then derive a new training framework, which satisfies the data-architecture alignment condition without assuming it by automatically modifying any given training algorithm dependently on each data and architecture. Given a standard training algorithm, the framework running its modified version is empirically shown to maintain competitive (practical) test performances while providing global convergence guarantees for ResNet-18 with convolutions, skip connections, and batch normalization with standard benchmark datasets, including MNIST, CIFAR-10, CIFAR-100, Semeion, KMNIST and SVHN.
视觉解释|视频理解VQA|caption等(3篇)
【1】 Contrastive Counterfactual Visual Explanations With Overdetermination 标题:超定的反事实对比视觉解释
作者:Adam White,Kwun Ho Ngan,James Phelan,Saman Sadeghi Afgeh,Kevin Ryan,Constantino Carlos Reyes-Aldasoro,Artur d'Avila Garcez 机构:School of Mathematics, Computer Science and Engineering, City, University of London, London, EC,V ,HB, UK, Saman.Sadeghiafgeh, Kevin.Ryan 链接:https://arxiv.org/abs/2106.14556 摘要:本文提出了一种新的可解释人工智能方法——清晰图像法。清晰的图像是基于这样一种观点:一个令人满意的解释应该是对比的、反事实的和可测量的。清晰图像通过对比图像和通过对抗学习自动生成的相应图像来解释图像的分类概率。这使得显著的分割和扰动都能忠实地确定每个片段的重要性。CLEAR Image已成功应用于医学影像案例研究,使用一种新的定点博弈度量,它比Grad-CAM和LIME等方法平均高出27%。CLEAR Image擅长识别“因果过度确定”的情况,即图像中存在多个斑块,其中任何一个斑块本身足以导致分类概率接近1。 摘要:A novel explainable AI method called CLEAR Image is introduced in this paper. CLEAR Image is based on the view that a satisfactory explanation should be contrastive, counterfactual and measurable. CLEAR Image explains an image's classification probability by contrasting the image with a corresponding image generated automatically via adversarial learning. This enables both salient segmentation and perturbations that faithfully determine each segment's importance. CLEAR Image was successfully applied to a medical imaging case study where it outperformed methods such as Grad-CAM and LIME by an average of 27% using a novel pointing game metric. CLEAR Image excels in identifying cases of "causal overdetermination" where there are multiple patches in an image, any one of which is sufficient by itself to cause the classification probability to be close to one.
【2】 Adventurer's Treasure Hunt: A Transparent System for Visually Grounded Compositional Visual Question Answering based on Scene Graphs 标题:探险家寻宝:一种基于场景图的透明构图构图视觉答疑系统
作者:Daniel Reich,Felix Putze,Tanja Schultz 机构:Cognitive Systems Lab, University of Bremen, Bremen, Germany 链接:https://arxiv.org/abs/2106.14476 摘要:以提高VQA推理过程的透明性和视觉基础为目标,提出了一个基于场景图的组合VQA模块化系统。我们的系统被称为“冒险家的寻宝”(或ATH),是以我们在模型的搜索过程和冒险家的寻宝过程之间的类比命名的。我们开发ATH时考虑了三个特征:1.通过设计,ATH允许我们明确量化每个子组件对VQA整体性能的影响,以及它们对各自子任务的性能。2.通过对寻宝后的搜索任务建模,ATH内在地为处理过的问题生成一个明确的、直观的推理路径。3.ATH是第一个经过GQA训练的VQA系统,它通过直接查询可视化知识库来动态提取答案,而不是在预先固定的答案词汇表上从专门学习的分类器的输出分布中选择答案。我们在GQA数据集上报告了所有组件及其对VQA整体性能的贡献的详细结果,并表明在所有被检查的系统中,ATH获得了最高的视觉基础分数。 摘要:With the expressed goal of improving system transparency and visual grounding in the reasoning process in VQA, we present a modular system for the task of compositional VQA based on scene graphs. Our system is called "Adventurer's Treasure Hunt" (or ATH), named after an analogy we draw between our model's search procedure for an answer and an adventurer's search for treasure. We developed ATH with three characteristic features in mind: 1. By design, ATH allows us to explicitly quantify the impact of each of the sub-components on overall VQA performance, as well as their performance on their individual sub-task. 2. By modeling the search task after a treasure hunt, ATH inherently produces an explicit, visually grounded inference path for the processed question. 3. ATH is the first GQA-trained VQA system that dynamically extracts answers by querying the visual knowledge base directly, instead of selecting one from a specially learned classifier's output distribution over a pre-fixed answer vocabulary. We report detailed results on all components and their contributions to overall VQA performance on the GQA dataset and show that ATH achieves the highest visual grounding score among all examined systems.
【3】 UMIC: An Unreferenced Metric for Image Captioning via Contrastive Learning 标题:UMIC:一种基于对比学习的无参考图像字幕度量
作者:Hwanhee Lee,Seunghyun Yoon,Franck Dernoncourt,Trung Bui,Kyomin Jung 机构:Dept. of Electrical and Computer Engineering, Seoul National University, Seoul, Korea, Adobe Research, San Jose, CA, USA 备注:ACL 2021 链接:https://arxiv.org/abs/2106.14019 摘要:尽管BERTScore等各种文本生成度量方法取得了成功,但是由于描述的多样性,在没有足够的参考字幕的情况下仍然很难对图像字幕进行评价。在本文中,我们介绍了一种新的度量UMIC,一种无参考的图像字幕度量,它不需要参考字幕来评价图像字幕。本文以视觉和语言为基础,通过对比学习训练UMIC识别否定字幕。同时,我们观察到了以前的基准数据集(即人类注释)在图像字幕度量上的关键问题,并在生成的字幕上引入了一个新的人类注释集合。我们在四个数据集(包括我们的新数据集)上验证了UMIC,并表明UMIC比所有以前需要多个引用的度量具有更高的相关性。我们发布了基准数据集和预先训练的模型来计算UMIC。 摘要:Despite the success of various text generation metrics such as BERTScore, it is still difficult to evaluate the image captions without enough reference captions due to the diversity of the descriptions. In this paper, we introduce a new metric UMIC, an Unreferenced Metric for Image Captioning which does not require reference captions to evaluate image captions. Based on Vision-and-Language BERT, we train UMIC to discriminate negative captions via contrastive learning. Also, we observe critical problems of the previous benchmark dataset (i.e., human annotations) on image captioning metric, and introduce a new collection of human annotations on the generated captions. We validate UMIC on four datasets, including our new dataset, and show that UMIC has a higher correlation than all previous metrics that require multiple references. We release the benchmark dataset and pre-trained models to compute the UMIC.
超分辨率|去噪|去模糊|去雾(1篇)
【1】 Blind Non-Uniform Motion Deblurring using Atrous Spatial Pyramid Deformable Convolution and Deblurring-Reblurring Consistency 标题:基于Atrus空间金字塔可变形卷积和去模糊-再模糊一致性的非均匀运动盲去模糊
作者:Dong Huo,Abbas Masoumzadeh,Yee-Hong Yang 机构:Department of Computing Science, University of Alberta, Edmonton, Canada 备注:8 pages, 8 figures 链接:https://arxiv.org/abs/2106.14336 摘要:许多基于深度学习的方法被设计用来消除由物体运动和相机抖动引起的非均匀(空间变化)运动模糊,而不需要知道模糊核。有些方法在一个阶段直接输出潜在的锐利图像,而另一些方法利用多阶段策略(如多尺度、多面片或多时相)来逐步恢复锐利图像。然而,这些方法存在以下两个主要问题:1)多阶段计算量大;2) 在不同的区域采用相同的卷积核,这对于非均匀模糊不是一个理想的选择。因此,非均匀运动去模糊仍然是一个具有挑战性和开放性的问题。本文提出了一种由多个空间金字塔可变形卷积(ASPDC)模块组成的新结构,以更灵活地实现端到端的图像去模糊。多个ASPDC模块隐式地学习同一层中具有不同伸缩率的像素特定运动,以处理不同幅度的运动。为了提高训练效果,我们还提出了一种再模糊网络来将去模糊输出映射回模糊输入,从而限制了解空间。实验结果表明,该方法在基准数据集上的性能优于现有的方法。 摘要:Many deep learning based methods are designed to remove non-uniform (spatially variant) motion blur caused by object motion and camera shake without knowing the blur kernel. Some methods directly output the latent sharp image in one stage, while others utilize a multi-stage strategy (eg multi-scale, multi-patch, or multi-temporal) to gradually restore the sharp image. However, these methods have the following two main issues: 1) The computational cost of multi-stage is high; 2) The same convolution kernel is applied in different regions, which is not an ideal choice for non-uniform blur. Hence, non-uniform motion deblurring is still a challenging and open problem. In this paper, we propose a new architecture which consists of multiple Atrous Spatial Pyramid Deformable Convolution (ASPDC) modules to deblur an image end-to-end with more flexibility. Multiple ASPDC modules implicitly learn the pixel-specific motion with different dilation rates in the same layer to handle movements of different magnitude. To improve the training, we also propose a reblurring network to map the deblurred output back to the blurred input, which constrains the solution space. Our experimental results show that the proposed method outperforms state-of-the-art methods on the benchmark datasets.
点云|SLAM|雷达|激光|深度RGBD相关(1篇)
【1】 Learning without Forgetting for 3D Point Cloud Objects 标题:三维点云对象的学习而不忘
作者:Townim Chowdhury,Mahira Jalisha,Ali Cheraghian,Shafin Rahman 机构: North South University, Dhaka, Bangladesh, Australian National University, Canberra, Australia, Data,-CSIRO, Australia 备注:Accepted in International Work-Conference on Artificial and Natural Neural Networks, (IWANN) 2021 链接:https://arxiv.org/abs/2106.14275 摘要:当我们为一组新的课程微调一个训练有素的深度学习模型时,网络学习新的概念,但逐渐忘记旧的训练知识。在一些实际应用中,我们可能会对学习新课程感兴趣,而不会忘记以前经验的能力。这种学习而不遗忘的问题往往是研究使用二维图像识别任务。在本文中,考虑到深度相机技术的发展,我们解决了同样的问题,为三维点云对象数据。由于大数据集和强大的预训练主干模型的不可用,这个问题在三维领域比二维领域更具挑战性。研究了基于三维数据的知识提取技术,以减少对先前训练的灾难性遗忘。此外,我们还利用对象类的语义词向量来改进提取过程。我们观察到,在训练中探索新旧知识的相互关系有助于学习新概念而不忘记旧概念。在三个3D点云识别主干(PointNet、DGCNN和PointConv)和合成(ModelNet40、ModelNet10)以及真实扫描(ScanObjectNN)数据集上进行实验,我们建立了新的3D数据学习基线结果。这项研究将推动这一领域今后的许多工作。 摘要:When we fine-tune a well-trained deep learning model for a new set of classes, the network learns new concepts but gradually forgets the knowledge of old training. In some real-life applications, we may be interested in learning new classes without forgetting the capability of previous experience. Such learning without forgetting problem is often investigated using 2D image recognition tasks. In this paper, considering the growth of depth camera technology, we address the same problem for the 3D point cloud object data. This problem becomes more challenging in the 3D domain than 2D because of the unavailability of large datasets and powerful pretrained backbone models. We investigate knowledge distillation techniques on 3D data to reduce catastrophic forgetting of the previous training. Moreover, we improve the distillation process by using semantic word vectors of object classes. We observe that exploring the interrelation of old and new knowledge during training helps to learn new concepts without forgetting old ones. Experimenting on three 3D point cloud recognition backbones (PointNet, DGCNN, and PointConv) and synthetic (ModelNet40, ModelNet10) and real scanned (ScanObjectNN) datasets, we establish new baseline results on learning without forgetting for 3D data. This research will instigate many future works in this area.
3D|3D重建等相关(4篇)
【1】 Geometric Processing for Image-based 3D Object Modeling 标题:基于图像的三维物体建模中的几何处理
作者:Rongjun Qin,Xu Huang 机构:Department of Civil, Environmental and Geodetic Engineering, Department of Electrical and Computer Engineering, Translational Data Analytics Institute., The Ohio State University, Neil Avenue, Columbus, Ohio, USA 链接:https://arxiv.org/abs/2106.14307 摘要:基于图像的三维物体建模是指将原始光学图像转换为物体的三维数字表示的过程。通常,这类模型需要在维度上真实,在语义上标记为照片级真实感外观(基于真实感的建模)。激光扫描被认为是获得物体高精度三维测量的标准(和直接)方法,而在某些平台上,激光扫描必须承受高昂的获取成本和不可用性。近年来,基于图像的匹配方法以其高灵活性、高可用性和低成本等优点成为主流方法。从有序/无序的原始图像到纹理网格,三维对象重建过程中的图像几何处理已经成为基于真实感的三维建模的关键部分。本文概述了几何处理的总体工作流程,重点介绍了几何处理的三个主要组成部分:1)地理参考;2) 图像密集匹配3)纹理映射。最后,我们将得出结论并分享我们对本文所讨论主题的展望。 摘要:Image-based 3D object modeling refers to the process of converting raw optical images to 3D digital representations of the objects. Very often, such models are desired to be dimensionally true, semantically labeled with photorealistic appearance (reality-based modeling). Laser scanning was deemed as the standard (and direct) way to obtaining highly accurate 3D measurements of objects, while one would have to abide the high acquisition cost and its unavailability on some of the platforms. Nowadays the image-based methods backboned by the recently developed advanced dense image matching algorithms and geo-referencing paradigms, are becoming the dominant approaches, due to its high flexibility, availability and low cost. The largely automated geometric processing of images in a 3D object reconstruction workflow, from ordered/unordered raw imagery to textured meshes, is becoming a critical part of the reality-based 3D modeling. This article summarizes the overall geometric processing workflow, with focuses on introducing the state-of-the-art methods of three major components of geometric processing: 1) geo-referencing; 2) Image dense matching 3) texture mapping. Finally, we will draw conclusions and share our outlooks of the topics discussed in this article.
【2】 3D Reconstruction through Fusion of Cross-View Images 标题:基于交叉视图图像融合的三维重建
作者:Rongjun Qin,Shuang Song,Xiao Ling,Mostafa Elhashash 机构:. Department of Civil, Environmental and Geodetic Engineering, The Ohio State University, United States, Neil, . Department of Electrical and Computer Engineering, The Ohio State University, United States., Corresponding author 链接:https://arxiv.org/abs/2106.14306 摘要:从多个立体和立体图像中提取三维信息,作为基于图像的透视几何的一个重要应用,在计算机视觉、遥感和地理信息等领域有着广泛的应用。在这一章中,作者利用成像几何和目前的方法,执行三维重建,从横向图像,是截然不同的观点。本文介绍了一种基于地景图像和卫星图像的全三维重建框架,包括从图像中生成卫星和地面点云、三维数据配准、融合和网格生成的必要方法。我们在一个数据集上演示了我们提出的框架,该数据集由12幅卫星图像和通过车载Go-pro相机获取的150k视频帧组成,并展示了重建结果。我们还将我们的结果与一个直观的处理管道生成的结果进行了比较,该管道涉及典型的地理配准和网格划分方法。 摘要:3D recovery from multi-stereo and stereo images, as an important application of the image-based perspective geometry, serves many applications in computer vision, remote sensing and Geomatics. In this chapter, the authors utilize the imaging geometry and present approaches that perform 3D reconstruction from cross-view images that are drastically different in their viewpoints. We introduce our framework that takes ground-view images and satellite images for full 3D recovery, which includes necessary methods in satellite and ground-based point cloud generation from images, 3D data co-registration, fusion and mesh generation. We demonstrate our proposed framework on a dataset consisting of twelve satellite images and 150k video frames acquired through a vehicle-mounted Go-pro camera and demonstrate the reconstruction results. We have also compared our results with results generated from an intuitive processing pipeline that involves typical geo-registration and meshing methods.
【3】 Indoor Panorama Planar 3D Reconstruction via Divide and Conquer 标题:分而治之的室内全景平面三维重建
作者:Cheng Sun,Chi-Wei Hsiao,Ning-Hsu Wang,Min Sun,Hwann-Tzong Chen 链接:https://arxiv.org/abs/2106.14166 摘要:室内全景图通常由平行或垂直于重力的人造结构组成。我们利用这一现象来近似360度图像中的场景(H)水平面和(V)垂直面。为此,我们提出了一种基于像素平面方向估计的有效分治策略;然后,后续实例分割模块在每个平面方向组中更容易地完成平面聚类任务。另外,V平面的参数依赖于摄像机的偏航旋转,而平移不变的CNNs对偏航变化的感知能力较弱。因此,我们提出一个偏航不变的V-平面重参数化的CNNs学习。我们用地面真实H&V平面(简称PanoH&V数据集)扩展现有的360深度数据集,建立了室内全景平面重建的基准,并采用最先进的平面重建方法预测H&V平面作为基线。在所提出的数据集上,我们的方法比基线有很大的优势。 摘要:Indoor panorama typically consists of human-made structures parallel or perpendicular to gravity. We leverage this phenomenon to approximate the scene in a 360-degree image with (H)orizontal-planes and (V)ertical-planes. To this end, we propose an effective divide-and-conquer strategy that divides pixels based on their plane orientation estimation; then, the succeeding instance segmentation module conquers the task of planes clustering more easily in each plane orientation group. Besides, parameters of V-planes depend on camera yaw rotation, but translation-invariant CNNs are less aware of the yaw change. We thus propose a yaw-invariant V-planar reparameterization for CNNs to learn. We create a benchmark for indoor panorama planar reconstruction by extending existing 360 depth datasets with ground truth H&V-planes (referred to as PanoH&V dataset) and adopt state-of-the-art planar reconstruction methods to predict H&V-planes as our baselines. Our method outperforms the baselines by a large margin on the proposed dataset.
【4】 Fully Steerable 3D Spherical Neurons 标题:完全可操纵的三维球形神经元
作者:Pavlo Melnyk,Michael Felsberg,Mårten Wadenbäck 机构:Computer Vision Laboratory, Department of Electrical Engineering, Linköping University 链接:https://arxiv.org/abs/2106.13863 摘要:从低级视觉理论中产生的可操纵过滤器在深度学习中找到了对应的过滤器。早期的工作使用了转向定理,并提出了卷积网络等价于刚性变换。在我们的工作中,我们提出了一种可控制的基于前馈学习的方法,该方法由球形决策面和操作点云组成。由于我们理论固有的三维几何结构,我们推导了其原子部分超球神经元的三维可操纵性约束。利用旋转等价性,我们展示了模型参数是如何在推理时完全可控的。所提出的球形滤波器组能够对未知方向的已知合成点集进行等变和在线优化后的不变类预测。 摘要:Emerging from low-level vision theory, steerable filters found their counterpart in deep learning. Earlier works used the steering theorems and presented convolutional networks equivariant to rigid transformations. In our work, we propose a steerable feed-forward learning-based approach that consists of spherical decision surfaces and operates on point clouds. Due to the inherent geometric 3D structure of our theory, we derive a 3D steerability constraint for its atomic parts, the hypersphere neurons. Exploiting the rotational equivariance, we show how the model parameters are fully steerable at inference time. The proposed spherical filter banks enable to make equivariant and, after online optimization, invariant class predictions for known synthetic point sets in unknown orientations.
其他神经网络|深度学习|模型|建模(11篇)
【1】 Explicit Clothing Modeling for an Animatable Full-Body Avatar 标题:动画全身化身的显式服装建模
作者:Donglai Xiang,Fabian Andres Prada,Timur Bagautdinov,Weipeng Xu,Yuan Dong,He Wen,Jessica Hodgins,Chenglei Wu 机构:FABIAN PRADA, Facebook Reality Labs Research, USA, Driving Signals, Animation, Texture Editing 链接:https://arxiv.org/abs/2106.14879 摘要:最近的工作显示了在构建照片级真实感可动画全身编解码器化身方面的巨大进步,但是这些化身在生成服装的高保真动画方面仍然面临困难。为了解决这些困难,我们提出了一种方法来建立一个可动画的衣服身体化身与服装上半身显式表示从多视角捕获的视频。我们使用两层网格表示,分别注册与模板的三维扫描。为了提高不同帧之间的光度对应性,纹理对齐是通过对由变分自动编码器预测的服装几何和纹理进行反向渲染来实现的。然后,我们训练一个新的两层编解码器化身,分别对上装和内体层进行建模。为了了解人体动力学和服装状态之间的相互作用,我们使用时间卷积网络来预测基于输入骨骼姿势序列的服装潜在代码。我们展示了三个不同角色的照片级真实感动画输出,并展示了我们的服装化身相对于单层化身的优势。我们还展示了显式服装模型的优点,它允许在动画输出中编辑服装纹理。 摘要:Recent work has shown great progress in building photorealistic animatable full-body codec avatars, but these avatars still face difficulties in generating high-fidelity animation of clothing. To address the difficulties, we propose a method to build an animatable clothed body avatar with an explicit representation of the clothing on the upper body from multi-view captured videos. We use a two-layer mesh representation to separately register the 3D scans with templates. In order to improve the photometric correspondence across different frames, texture alignment is then performed through inverse rendering of the clothing geometry and texture predicted by a variational autoencoder. We then train a new two-layer codec avatar with separate modeling of the upper clothing and the inner body layer. To learn the interaction between the body dynamics and clothing states, we use a temporal convolution network to predict the clothing latent code based on a sequence of input skeletal poses. We show photorealistic animation output for three different actors, and demonstrate the advantage of our clothed-body avatars over single-layer avatars in the previous work. We also show the benefit of an explicit clothing model which allows the clothing texture to be edited in the animation output.
【2】 Fractal Pyramid Networks 标题:分形金字塔网络
作者:Zhiqiang Deng,Huimin Yu,Yangqi Long 机构: Department of Information Science and Electronic Engineering, Zhejiang University, China, State Key Laboratory of CAD & CG, China 链接:https://arxiv.org/abs/2106.14694 摘要:我们提出了一种新的网络结构,分形金字塔网络(PFNs)的像素预测任务作为一种替代广泛使用的编码器-解码器结构。在编码器-解码器结构中,输入由编码-解码管道处理,该管道试图获得语义大信道特征。与此不同的是,我们提出的PFNs拥有多条信息处理路径,并将信息编码成多个独立的小通道特征。在自监督单目深度估计的任务中,即使没有对ImageNet进行预训练,我们的模型也可以在KITTI数据集上以更少的参数与最新的方法进行比较。此外,预测的视觉质量也得到了显著提高。语义分割实验证明了PFNs可以应用于其他像素级的预测任务,并且证明了我们的模型能够捕捉到更多的全局结构信息。 摘要:We propose a new network architecture, the Fractal Pyramid Networks (PFNs) for pixel-wise prediction tasks as an alternative to the widely used encoder-decoder structure. In the encoder-decoder structure, the input is processed by an encoding-decoding pipeline that tries to get a semantic large-channel feature. Different from that, our proposed PFNs hold multiple information processing pathways and encode the information to multiple separate small-channel features. On the task of self-supervised monocular depth estimation, even without ImageNet pretrained, our models can compete or outperform the state-of-the-art methods on the KITTI dataset with much fewer parameters. Moreover, the visual quality of the prediction is significantly improved. The experiment of semantic segmentation provides evidence that the PFNs can be applied to other pixel-wise prediction tasks, and demonstrates that our models can catch more global structure information.
【3】 R2RNet: Low-light Image Enhancement via Real-low to Real-normal Network 标题:R2RNet:基于实-低-实-正常网络的微光图像增强
作者:Jiang Hai,Zhu Xuan,Ren Yang,Yutong Hao,Fengzhu Zou,Fang Lin,Songchen Han 机构:)The authors are with the School of Aeronautics and Astronautics, Sichuan University 备注:9 pages, 6 figures 链接:https://arxiv.org/abs/2106.14501 摘要:在弱光照条件下拍摄的图像会严重降低图像质量。解决弱光图像的一系列退化问题,可以有效地提高图像的视觉质量和高层次视觉任务的性能。本文基于Retinex理论,提出了一种新的用于微光图像增强的真低到真正态网络R2RNet,它包括三个子网:装饰网、去噪网和重光网。这三个子网分别用于分解、去噪和对比度增强。与以往对合成图像进行训练的方法不同,我们收集了第一个大规模的真实世界配对低/正常光图像数据集(LSRW数据集)进行训练。我们的方法可以在适当提高对比度的同时抑制噪声。在公开数据集上的大量实验表明,我们的方法在数量和视觉上都比现有的最新方法有很大的优势。此外,我们还表明,在弱光条件下,利用我们的方法得到的增强结果,可以有效地提高高级视觉任务(emph{i.e.}人脸检测)的性能。我们的代码和LSRW数据集位于:https://github.com/abcdef2000/R2RNet. 摘要:Images captured in weak illumination conditions will seriously degrade the image quality. Solving a series of degradation of low-light images can effectively improve the visual quality of the image and the performance of high-level visual tasks. In this paper, we propose a novel Real-low to Real-normal Network for low-light image enhancement, dubbed R2RNet, based on the Retinex theory, which includes three subnets: a Decom-Net, a Denoise-Net, and a Relight-Net. These three subnets are used for decomposing, denoising, and contrast enhancement, respectively. Unlike most previous methods trained on synthetic images, we collect the first Large-Scale Real-World paired low/normal-light images dataset (LSRW dataset) for training. Our method can properly improve the contrast and suppress noise simultaneously. Extensive experiments on publicly available datasets demonstrate that our method outperforms the existing state-of-the-art methods by a large margin both quantitatively and visually. And we also show that the performance of the high-level visual task (emph{i.e.} face detection) can be effectively improved by using the enhanced results obtained by our method in low-light conditions. Our codes and the LSRW dataset are available at: https://github.com/abcdef2000/R2RNet.
【4】 Co^2L: Contrastive Continual Learning标题:CO^2L:对比持续学习
作者:Hyuntak Cha,Jaeho Lee,Jinwoo Shin 机构:Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea 备注:14 pages, 5 figures 链接:https://arxiv.org/abs/2106.14413 摘要:最近在自我监督学习方面的突破表明,这种算法学习的视觉表征比依赖于特定任务监督的联合训练方法能够更好地转移到看不见的任务。在本文中,我们发现在持续学习的文本中存在相似的情况:对比学习的表征比联合训练的表征更能抵抗灾难性遗忘。基于这一新的观察,我们提出了一种基于排练的持续学习算法,该算法着重于持续学习和保持可转移表征。更具体地说,所提出的方案(1)使用对比学习目标学习表示,(2)使用自监督蒸馏步骤保留所学习的表示。我们进行了广泛的实验验证下流行的基准图像分类数据集,我们的方法设置了新的国家的最先进的性能。 摘要:Recent breakthroughs in self-supervised learning show that such algorithms learn visual representations that can be transferred better to unseen tasks than joint-training methods relying on task-specific supervision. In this paper, we found that the similar holds in the continual learning con-text: contrastively learned representations are more robust against the catastrophic forgetting than jointly trained representations. Based on this novel observation, we propose a rehearsal-based continual learning algorithm that focuses on continually learning and maintaining transferable representations. More specifically, the proposed scheme (1) learns representations using the contrastive learning objective, and (2) preserves learned representations using a self-supervised distillation step. We conduct extensive experimental validations under popular benchmark image classification datasets, where our method sets the new state-of-the-art performance.
【5】 Learning Mesh Representations via Binary Space Partitioning Tree Networks 标题:基于二元空间划分树网络的网格表示学习
作者:Zhiqin Chen,Andrea Tagliasacchi,Hao Zhang 机构: Zhang are with Simon Fraser University 备注:Accepted to TPAMI. This is the journal version of BSP-Net (arXiv:1911.06971) 链接:https://arxiv.org/abs/2106.14274 摘要:多边形网格无处不在,但在深度学习革命中只起到了相对次要的作用。最先进的三维形状神经生成模型学习隐函数,并通过昂贵的iso曲面生成网格。我们通过采用计算机图形学中的经典空间数据结构二进制空间划分(BSP)来促进3D学习,从而克服了这些挑战。BSP的核心运算是对三维空间进行递归细分以获得凸集。利用这一特性,我们设计了BSP网络,该网络通过凸分解学习表示三维形状,而无需监督。该网络被训练成使用从一组平面上建立的BSP树获得的一组凸面来重构形状,其中平面和凸面都由学习的网络权重定义。BSP-Net直接从推断的凸面输出多边形网格。生成的网格是防水的、紧凑的(即低多边形),并且非常适合表示尖锐的几何体。结果表明,BSP网络的重建质量与现有的重建方法相比具有一定的竞争力,但所使用的基元要少得多。我们还探讨了BSP网络的变化,包括使用更通用的解码器进行重建,使用比平面更通用的原语,以及使用变分自动编码器训练生成模型。代码位于https://github.com/czq142857/BSP-NET-original. 摘要:Polygonal meshes are ubiquitous, but have only played a relatively minor role in the deep learning revolution. State-of-the-art neural generative models for 3D shapes learn implicit functions and generate meshes via expensive iso-surfacing. We overcome these challenges by employing a classical spatial data structure from computer graphics, Binary Space Partitioning (BSP), to facilitate 3D learning. The core operation of BSP involves recursive subdivision of 3D space to obtain convex sets. By exploiting this property, we devise BSP-Net, a network that learns to represent a 3D shape via convex decomposition without supervision. The network is trained to reconstruct a shape using a set of convexes obtained from a BSP-tree built over a set of planes, where the planes and convexes are both defined by learned network weights. BSP-Net directly outputs polygonal meshes from the inferred convexes. The generated meshes are watertight, compact (i.e., low-poly), and well suited to represent sharp geometry. We show that the reconstruction quality by BSP-Net is competitive with those from state-of-the-art methods while using much fewer primitives. We also explore variations to BSP-Net including using a more generic decoder for reconstruction, more general primitives than planes, as well as training a generative model with variational auto-encoders. Code is available at https://github.com/czq142857/BSP-NET-original.
【6】 Learning to solve geometric construction problems from images 标题:学习从图像中解决几何构造问题
作者:J. Macke,J. Sedlar,M. Olsak,J. Urban,J. Sivic 机构: andJosef Sivic 2[0000−000 2− 2 5 5 4− 5 30 1] 1 Charles University in Prague, Czech Republic 2 Czech Technical University in Prague, cz 3 University of Innsbruck 备注:16 pages, 7 figures, 3 tables 链接:https://arxiv.org/abs/2106.14195 摘要:我们描述了一个纯粹的基于图像的方法来寻找几何结构与尺子和罗盘在欧几里德几何游戏。该方法采用了maskr-CNN最新的图像处理神经网络结构,并加入了基于树的搜索过程。在有监督的环境下,该方法从欧几里德的前六个层次包中学习求解所有68种几何构造问题,平均准确率为92%。当对新问题进行评价时,该方法可以解决68类欧氏问题中的31类。我们相信,这是第一次,一个纯粹的图像为基础的学习已被训练,以解决几何构造问题的这一困难。 摘要:We describe a purely image-based method for finding geometric constructions with a ruler and compass in the Euclidea geometric game. The method is based on adapting the Mask R-CNN state-of-the-art image processing neural architecture and adding a tree-based search procedure to it. In a supervised setting, the method learns to solve all 68 kinds of geometric construction problems from the first six level packs of Euclidea with an average 92% accuracy. When evaluated on new kinds of problems, the method can solve 31 of the 68 kinds of Euclidea problems. We believe that this is the first time that a purely image-based learning has been trained to solve geometric construction problems of this difficulty.
【7】 Mitigating severe over-parameterization in deep convolutional neural networks through forced feature abstraction and compression with an entropy-based heuristic 标题:通过基于熵的启发式强制特征提取和压缩来缓解深度卷积神经网络中的严重过度参数化
作者:Nidhi Gowdra,Roopak Sinha,Stephen MacDonell,Wei Qi Yan 机构:School of Engineering, Computer and Mathematical Sciences, Auckland University of Technology, New Zealand 备注:None 链接:https://arxiv.org/abs/2106.14190 摘要:卷积神经网络(CNNs)如ResNet-50、DenseNet-40和ResNeXt-56严重参数化,因此需要增加模型训练所需的计算资源,模型训练随着模型深度的增加呈指数级扩展。本文提出了一种基于熵的卷积层估计(EBCLE)启发式算法,该算法鲁棒性强、简单,但有效地解决了CNN模型中网络深度的过参数化问题。EBCLE启发式算法利用输入数据集熵数据分布的先验知识来确定卷积网络深度的上界,在这个上界之外,身份转换对于提高模型性能的贡献很小。通过强制特征压缩和抽象来限制深度冗余限制了过度参数化,同时减少了24.99%-78.59%的训练时间,而不会降低模型性能。我们提出了经验证据来强调使用EBCLE启发式训练的更广泛但更浅的模型的相对有效性,该启发式保持或优于更窄但更深的模型的基线分类精度。EBCLE启发式是架构不可知的,基于EBCLE的CNN模型限制了深度冗余,从而提高了可用计算资源的利用率。提出的EBCLE启发式是一种令人信服的技术,研究人员分析证明他们的超参数(HP)选择CNNs。在5个基准数据集(ImageNet32、CIFAR-10/100、STL-10、MNIST)和4个网络体系结构(DenseNet、ResNet、ResNeXt和EfficientNet B0-B2)上建立了训练CNN模型的EBCLE启发式的经验验证,并采用适当的统计检验来推断本文提出的任何结论性主张。 摘要:Convolutional Neural Networks (CNNs) such as ResNet-50, DenseNet-40 and ResNeXt-56 are severely over-parameterized, necessitating a consequent increase in the computational resources required for model training which scales exponentially for increments in model depth. In this paper, we propose an Entropy-Based Convolutional Layer Estimation (EBCLE) heuristic which is robust and simple, yet effective in resolving the problem of over-parameterization with regards to network depth of CNN model. The EBCLE heuristic employs a priori knowledge of the entropic data distribution of input datasets to determine an upper bound for convolutional network depth, beyond which identity transformations are prevalent offering insignificant contributions for enhancing model performance. Restricting depth redundancies by forcing feature compression and abstraction restricts over-parameterization while decreasing training time by 24.99% - 78.59% without degradation in model performance. We present empirical evidence to emphasize the relative effectiveness of broader, yet shallower models trained using the EBCLE heuristic, which maintains or outperforms baseline classification accuracies of narrower yet deeper models. The EBCLE heuristic is architecturally agnostic and EBCLE based CNN models restrict depth redundancies resulting in enhanced utilization of the available computational resources. The proposed EBCLE heuristic is a compelling technique for researchers to analytically justify their HyperParameter (HP) choices for CNNs. Empirical validation of the EBCLE heuristic in training CNN models was established on five benchmarking datasets (ImageNet32, CIFAR-10/100, STL-10, MNIST) and four network architectures (DenseNet, ResNet, ResNeXt and EfficientNet B0-B2) with appropriate statistical tests employed to infer any conclusive claims presented in this paper.
【8】 The Story in Your Eyes: An Individual-difference-aware Model for Cross-person Gaze Estimation 标题:你眼中的故事:跨人凝视估计的个体差异感知模型
作者:Jun Bao,Buyu Liu,Jun Yu 机构:Hangzhou Dianzi University, Hangzhou, China, NEC Laboratories America, San Jose, USA 链接:https://arxiv.org/abs/2106.14183 摘要:我们提出了一种仅通过显式地模拟特定人的差异来细化眼睛/人脸图像的跨人注视预测任务的新方法。具体来说,我们首先假设我们可以用现有的方法获得一些初始注视预测结果,我们称之为InitNet,然后介绍了三个模块:有效性模块(VM)、自校准模块(SC)和特定人转换模块(PT)。通过预测当前眼睛/人脸图像的可靠性,我们的虚拟机能够识别无效样本,例如眨眼图像,并在建模过程中减少它们的影响。然后,我们的SC和PT模块学习仅补偿有效样本的差异。前者通过弥合初始预测和数据集分布之间的差距来建模转换偏移。后者通过合并来自同一个人已有的初始预测的信息来学习更一般的特定于人的转换。我们在三个公开的数据集EVE、XGaze和MPIIGaze上验证了我们的观点,并证明我们提出的方法在所有这些数据集上都显著优于SOTA方法,例如相对性能改善分别为21.7%、36.0%和32.9%。我们在2021年的比赛中获胜。我们的代码可以在这里找到https://github.com/bjj9/EVE_SCPT. 摘要:We propose a novel method on refining cross-person gaze prediction task with eye/face images only by explicitly modelling the person-specific differences. Specifically, we first assume that we can obtain some initial gaze prediction results with existing method, which we refer to as InitNet, and then introduce three modules, the Validity Module (VM), Self-Calibration (SC) and Person-specific Transform (PT)) Module. By predicting the reliability of current eye/face images, our VM is able to identify invalid samples, e.g. eye blinking images, and reduce their effects in our modelling process. Our SC and PT module then learn to compensate for the differences on valid samples only. The former models the translation offsets by bridging the gap between initial predictions and dataset-wise distribution. And the later learns more general person-specific transformation by incorporating the information from existing initial predictions of the same person. We validate our ideas on three publicly available datasets, EVE, XGaze and MPIIGaze and demonstrate that our proposed method outperforms the SOTA methods significantly on all of them, e.g. respectively 21.7%, 36.0% and 32.9% relative performance improvements. We won the GAZE 2021 Competition on the EVE dataset. Our code can be found here https://github.com/bjj9/EVE_SCPT.
【9】 Visual Conceptual Blending with Large-scale Language and Vision Models 标题:大规模语言和视觉模型的视觉概念融合
作者:Songwei Ge,Devi Parikh 机构:University of Maryland, Facebook AI Research & Georgia Tech 链接:https://arxiv.org/abs/2106.14127 摘要:我们提出这样一个问题:最近的大规模语言和图像生成模型能在多大程度上融合视觉概念?给定一个任意的对象,我们识别一个相关的对象,并使用一个语言模型生成两个对象混合的一个句子描述。然后,我们使用基于文本的图像生成模型生成混合的视觉描述。定量和定性评价表明,语言模型优于传统的概念整合方法,最近的大规模图像生成模型优于先前的视觉描述模型。 摘要:We ask the question: to what extent can recent large-scale language and image generation models blend visual concepts? Given an arbitrary object, we identify a relevant object and generate a single-sentence description of the blend of the two using a language model. We then generate a visual depiction of the blend using a text-based image generation model. Quantitative and qualitative evaluations demonstrate the superiority of language models over classical methods for conceptual blending, and of recent large-scale image generation models over prior models for the visual depiction.
【10】 Descriptive Modeling of Textiles using FE Simulations and Deep Learning 标题:基于有限元模拟和深度学习的纺织品描述性建模
作者:Arturo Mendoza,Roger Trullo,Yanneck Wielhorski 机构:Safran Tech, Rue des Jeunes Bois, Magny les Hameaux, France, Safran Aircraft Engines, Rond Point Réné Ravaud-Réau, France, LMT (ENS Paris-SaclayCNRSUniv. Paris-Saclay), Avenue des Sciences, Gif-sur-Yvette, France 备注:Preprint submitted to Composites Science and Technology. A. Mendoza and R. Trullo contributed equally to this work 链接:https://arxiv.org/abs/2106.13982 摘要:在这项工作中,我们提出了一种新颖的、全自动的方法来提取机织复合材料中纱线的几何特征,从而实现了织物增强体(如FE网)的直接参数化。因此,我们的目标不仅是从断层图像中进行纱线分割,而是提供完整的织物描述模型。因此,这种直接方法改进了以前的方法,即使用体素遮罩作为中间表示,然后进行重网格操作(纱线包络估计)。该方法采用两种深度神经网络结构(U-Net和Mask-RCNN)。首先,我们训练U-Net从相应的有限元模拟生成合成CT图像。这允许生成大量带注释的数据,而不需要昂贵的手动注释。然后利用这些数据训练掩模R-CNN,该掩模R-CNN主要用于预测图像中每个纱线周围的轮廓点。实验结果表明,该方法对CT图像上的纱线实例分割具有较好的准确性和鲁棒性,定量和定性分析进一步验证了该方法的有效性。 摘要:In this work we propose a novel and fully automated method for extracting the yarn geometrical features in woven composites so that a direct parametrization of the textile reinforcement is achieved (e.g., FE mesh). Thus, our aim is not only to perform yarn segmentation from tomographic images but rather to provide a complete descriptive modeling of the fabric. As such, this direct approach improves on previous methods that use voxel-wise masks as intermediate representations followed by re-meshing operations (yarn envelope estimation). The proposed approach employs two deep neural network architectures (U-Net and Mask RCNN). First, we train the U-Net to generate synthetic CT images from the corresponding FE simulations. This allows to generate large quantities of annotated data without requiring costly manual annotations. This data is then used to train the Mask R-CNN, which is focused on predicting contour points around each of the yarns in the image. Experimental results show that our method is accurate and robust for performing yarn instance segmentation on CT images, this is further validated by quantitative and qualitative analyses.
【11】 The Deep Neural Network based Photometry Framework for Wide Field Small Aperture Telescopes 标题:基于深度神经网络的大视场小口径望远镜测光框架
作者:Peng Jia,Yongyang Sun,Qiang Liu 机构:College of Physics and Optoelectronics, Taiyuan University of Technology, Taiyuan, China, Department of Physics, Durham University, DH,LE, UK, Accepted XXX. Received YYY; in original form ZZZ 备注:Submitted to the MNRAS and welcome to any comments. Complete code and data can be downloaded from this https URL 链接:https://arxiv.org/abs/2106.14349 摘要:宽视场小孔径望远镜(WFSAT)主要用于获取点状天体和条纹状天体的科学信息。然而,背景噪声和变点扩散函数严重影响了wfsat成像的质量。发展高速高效的数据处理方法对进一步的科学研究具有重要意义。近年来,人们提出了一种基于深度神经网络的天体检测与分类方法,并取得了比经典方法更好的性能。本文进一步扩展了基于深度神经网络的天文目标检测框架的能力,使之适用于光度测量和天体测量。在深度神经网络中加入新的分支,可以同时获得不同天体的类型、大小和位置。通过对模拟数据的测试,我们发现我们的神经网络在光度测量方面比经典方法有更好的性能。由于光度法和天体测量法是一种回归算法,可以获得高精度的测量结果,而不是粗糙的分类结果,因此不同的观测条件会影响光度法和天体测量法的精度。为了解决这一问题,我们进一步提出在观测条件发生变化时,利用参考星来训练具有转移学习策略的深层神经网络。本文提出的测光框架可以作为wfsat的端到端快速数据处理框架,进一步提高wfsat的响应速度和科学输出。 摘要:Wide field small aperture telescopes (WFSATs) are mainly used to obtain scientific information of point--like and streak--like celestial objects. However, qualities of images obtained by WFSATs are seriously affected by the background noise and variable point spread functions. Developing high speed and high efficiency data processing method is of great importance for further scientific research. In recent years, deep neural networks have been proposed for detection and classification of celestial objects and have shown better performance than classical methods. In this paper, we further extend abilities of the deep neural network based astronomical target detection framework to make it suitable for photometry and astrometry. We add new branches into the deep neural network to obtain types, magnitudes and positions of different celestial objects at the same time. Tested with simulated data, we find that our neural network has better performance in photometry than classical methods. Because photometry and astrometry are regression algorithms, which would obtain high accuracy measurements instead of rough classification results, the accuracy of photometry and astrometry results would be affected by different observation conditions. To solve this problem, we further propose to use reference stars to train our deep neural network with transfer learning strategy when observation conditions change. The photometry framework proposed in this paper could be used as an end--to--end quick data processing framework for WFSATs, which can further increase response speed and scientific outputs of WFSATs.
其他(16篇)
【1】 Rethinking Token-Mixing MLP for MLP-based Vision Backbone 标题:基于MLP的视觉骨干网中令牌混合MLP的再思考
作者:Tan Yu,Xu Li,Yunfeng Cai,Mingming Sun,Ping Li 机构:Cognitive Computing Lab, Baidu Research, NE ,th St. Bellevue, Washington , USA, No., Xibeiwang East Road, Beijing , China 链接:https://arxiv.org/abs/2106.14882 摘要:在过去的十年里,我们见证了机器视觉主干的快速发展。卷积神经网络(CNN)通过引入图像处理中的诱导偏差,在许多计算机视觉任务中取得了优异的性能,并被确立为事实上的主干。近年来,受Transformer在NLP任务中取得巨大成功的启发,视觉Transformer模型应运而生。与CNN相比,使用更少的诱导偏差,他们在计算机视觉任务中取得了很好的效果。最近,研究人员研究使用纯MLP架构来构建视觉主干,以进一步减少感应偏差,获得良好的性能。纯MLP骨干网建立在信道混合MLP之上,用于融合信道和令牌混合MLP,用于补丁之间的通信。本文对令牌混合MLP的设计进行了重新思考。我们发现现有的基于MLP的主干中的令牌混合MLP是空间特定的,因此它对空间转换非常敏感。同时,现有令牌混合mlp的信道不可知特性限制了其混合令牌的能力。为了克服这些限制,我们提出了一种改进的循环信道特定(CCS)令牌混合MLP结构,该结构具有空间不变性和信道特定性。它需要较少的参数,但在ImageNet1K基准上实现了更高的分类精度。 摘要:In the past decade, we have witnessed rapid progress in the machine vision backbone. By introducing the inductive bias from the image processing, convolution neural network (CNN) has achieved excellent performance in numerous computer vision tasks and has been established as emph{de facto} backbone. In recent years, inspired by the great success achieved by Transformer in NLP tasks, vision Transformer models emerge. Using much less inductive bias, they have achieved promising performance in computer vision tasks compared with their CNN counterparts. More recently, researchers investigate using the pure-MLP architecture to build the vision backbone to further reduce the inductive bias, achieving good performance. The pure-MLP backbone is built upon channel-mixing MLPs to fuse the channels and token-mixing MLPs for communications between patches. In this paper, we re-think the design of the token-mixing MLP. We discover that token-mixing MLPs in existing MLP-based backbones are spatial-specific, and thus it is sensitive to spatial translation. Meanwhile, the channel-agnostic property of the existing token-mixing MLPs limits their capability in mixing tokens. To overcome those limitations, we propose an improved structure termed as Circulant Channel-Specific (CCS) token-mixing MLP, which is spatial-invariant and channel-specific. It takes fewer parameters but achieves higher classification accuracy on ImageNet1K benchmark.
【2】 Dataset Bias Mitigation Through Analysis of CNN Training Scores 标题:基于CNN训练成绩分析的数据集偏差消除
作者:Ekberjan Derman 机构:Department of Computer Science, Sabancı University, Tuzla, Istanbul, Turkey 备注:12 pages, 11 figures 链接:https://arxiv.org/abs/2106.14829
【3】 Privacy-Preserving Image Acquisition Using Trainable Optical Kernel 标题:基于可训练光学核的隐私保护图像获取
作者:Yamin Sepehri,Pedram Pad,Pascal Frossard,L. Andrea Dunbar 机构:†Centre Suisse d’Electronique et de Microtechnique (CSEM), ‡ ´Ecole polytechnique f´ed´erale de Lausanne (EPFL) 备注:9 pages, 9 figures 链接:https://arxiv.org/abs/2106.14577 摘要:在传感器和摄像头无处不在的社会中,保护隐私越来越受到关注。在这项工作中,我们首次提出了一种可训练的图像获取方法,在敏感的身份信息到达图像传感器之前去除光域中的敏感信息。该方法利用可训练的光卷积核,在滤除敏感内容的同时传输所需信息。由于敏感内容在到达图像传感器之前被抑制,因此不会进入数字域,因此任何形式的隐私攻击都无法恢复。这与当前所有易受直接访问攻击的数字隐私保护方法形成对比。此外,与以往无法训练的光学隐私保护方法相比,我们的方法是数据驱动的,并针对手头的特定应用进行了优化。此外,由于该处理在光域中被动地发生并且甚至可以在完全数字隐私保护系统的顶部一起使用,因此在采集系统上没有额外的计算、存储器或功率负担。该方法适用于不同的数字神经网络和内容。我们将其应用于一些场景中,例如将微笑检测作为所需属性,而将性别作为敏感内容过滤掉。我们结合两个对抗性神经网络训练光学内核,其中分析网络尝试检测所需属性,而对抗性网络尝试检测敏感内容。实验结果表明,该方法可以减少65.1%的敏感内容,仅损失7.3%的敏感内容。此外,我们利用深度重建的方法对原始人脸进行重建,证实了重建攻击的无效性,从而获得敏感内容。 摘要:Preserving privacy is a growing concern in our society where sensors and cameras are ubiquitous. In this work, for the first time, we propose a trainable image acquisition method that removes the sensitive identity revealing information in the optical domain before it reaches the image sensor. The method benefits from a trainable optical convolution kernel which transmits the desired information while filters out the sensitive content. As the sensitive content is suppressed before it reaches the image sensor, it does not enter the digital domain therefore is unretrievable by any sort of privacy attack. This is in contrast with the current digital privacy-preserving methods that are all vulnerable to direct access attack. Also, in contrast with the previous optical privacy-preserving methods that cannot be trained, our method is data-driven and optimized for the specific application at hand. Moreover, there is no additional computation, memory, or power burden on the acquisition system since this processing happens passively in the optical domain and can even be used together and on top of the fully digital privacy-preserving systems. The proposed approach is adaptable to different digital neural networks and content. We demonstrate it for several scenarios such as smile detection as the desired attribute while the gender is filtered out as the sensitive content. We trained the optical kernel in conjunction with two adversarial neural networks where the analysis network tries to detect the desired attribute and the adversarial network tries to detect the sensitive content. We show that this method can reduce 65.1% of sensitive content when it is selected to be the gender and it only loses 7.3% of the desired content. Moreover, we reconstruct the original faces using the deep reconstruction method that confirms the ineffectiveness of reconstruction attacks to obtain the sensitive content.
【4】 FreeTickets: Accurate, Robust and Efficient Deep Ensemble by Training with Dynamic Sparsity 标题:FreeTickets:通过动态稀疏性训练实现准确、健壮、高效的深度集成
作者:Shiwei Liu,Tianlong Chen,Zahra Atashgahi,Xiaohan Chen,Ghada Sokar,Elena Mocanu,Mykola Pechenizkiy,Zhangyang Wang,Decebal Constantin Mocanu 机构:Eindhoven University of Technology,University of Texas at Austin, University of Twente 备注:preprint version 链接:https://arxiv.org/abs/2106.14568 摘要:最近关于稀疏神经网络的研究表明,孤立地训练一个稀疏网络是可能的,以便用一小部分参数来匹配相应的稠密网络的性能。然而,这些性能稀疏神经网络(中奖彩票)的识别要么涉及昂贵的迭代训练修剪再训练过程(如彩票假设),要么涉及超长的稀疏训练时间(如动态稀疏训练),这两者都会引起财务和环境问题。在这项工作中,我们试图通过引入免费门票的概念来解决这个成本降低的问题,作为第一个解决方案,稀疏卷积神经网络的性能比其密集网络的性能大大提高,而用于完全训练的只占后者所需计算资源的一小部分。具体来说,我们例举了FreeTickets的概念,提出了两种新的具有动态稀疏性的有效集成方法,在稀疏训练过程中一次生成多个多样的、精确的“免费”票证。将这些免费门票组合成一个集合,与相应的密集(集合)网络相比,在精确度、不确定性估计、鲁棒性和效率方面都有显著的提高。我们的结果为稀疏神经网络的强度提供了新的见解,并表明稀疏性的好处远远超出了通常的训练/推理的预期效率。我们将发布所有的密码https://github.com/Shiweiliuiiiiiii/FreeTickets. 摘要:Recent works on sparse neural networks have demonstrated that it is possible to train a sparse network in isolation to match the performance of the corresponding dense networks with a fraction of parameters. However, the identification of these performant sparse neural networks (winning tickets) either involves a costly iterative train-prune-retrain process (e.g., Lottery Ticket Hypothesis) or an over-extended sparse training time (e.g., Training with Dynamic Sparsity), both of which would raise financial and environmental concerns. In this work, we attempt to address this cost-reducing problem by introducing the FreeTickets concept, as the first solution which can boost the performance of sparse convolutional neural networks over their dense network equivalents by a large margin, while using for complete training only a fraction of the computational resources required by the latter. Concretely, we instantiate the FreeTickets concept, by proposing two novel efficient ensemble methods with dynamic sparsity, which yield in one shot many diverse and accurate tickets "for free" during the sparse training process. The combination of these free tickets into an ensemble demonstrates a significant improvement in accuracy, uncertainty estimation, robustness, and efficiency over the corresponding dense (ensemble) networks. Our results provide new insights into the strength of sparse neural networks and suggest that the benefits of sparsity go way beyond the usual training/inference expected efficiency. We will release all codes in https://github.com/Shiweiliuiiiiiii/FreeTickets.
【5】 Making Images Real Again: A Comprehensive Survey on Deep Image Composition 标题:让影像再真实--深层影像构图综观
作者:Li Niu,Wenyan Cong,Liu Liu,Yan Hong,Bo Zhang,Jing Liang,Liqing Zhang 机构: Department ofComputer Science and Engineering Shanghai Jiao Tong University 链接:https://arxiv.org/abs/2106.14490 摘要:图像合成是一种常见的图像编辑操作,其目的是将前景从一幅图像中分割出来,粘贴到另一幅图像上,得到合成图像。然而,有许多问题可能使合成图像不切实际。这些问题可以概括为前景和背景的不一致,包括外观不一致(如颜色和照明不一致)和几何不一致(如尺寸和位置不合理)。以往关于图像合成的研究主要针对一个或多个问题。由于每个问题都是一个复杂的问题,因此有一些研究方向(如图像协调、物体放置)只关注一个问题。通过综合所有的努力,我们可以获得真实的合成图像。有时,我们期望合成的图像不仅具有真实感,而且具有美感,在这种情况下,需要考虑审美评价。在本研究中,我们总结了以上研究方向的数据集和方法。同时,我们也讨论了本研究的局限性及未来的研究方向。最后,作为一把双刃剑,图像合成也可能会对我们的生活产生负面影响(如假新闻),因此必须开发算法来对抗合成图像。图像合成的数据集和代码在https://github.com/bcmi/Awesome-Image-Composition. 摘要:As a common image editing operation, image composition aims to cut the foreground from one image and paste it on another image, resulting in a composite image. However, there are many issues that could make the composite images unrealistic. These issues can be summarized as the inconsistency between foreground and background, which include appearance inconsistency (e.g., incompatible color and illumination) and geometry inconsistency (e.g., unreasonable size and location). Previous works on image composition target at one or more issues. Since each individual issue is a complicated problem, there are some research directions (e.g., image harmonization, object placement) which focus on only one issue. By putting all the efforts together, we can acquire realistic composite images. Sometimes, we expect the composite images to be not only realistic but also aesthetic, in which case aesthetic evaluation needs to be considered. In this survey, we summarize the datasets and methods for the above research directions. We also discuss the limitations and potential directions to facilitate the future research for image composition. Finally, as a double-edged sword, image composition may also have negative effect on our lives (e.g., fake news) and thus it is imperative to develop algorithms to fight against composite images. Datasets and codes for image composition are summarized at https://github.com/bcmi/Awesome-Image-Composition.
【6】 Prior-Induced Information Alignment for Image Matting 标题:基于先验诱导的图像遮片信息对齐
作者:Yuhao Liu,Jiake Xie,Yu Qiao,Yong Tang and,Xin Yang 机构: Yu Qiao and Xin Yang are with the School of ComputerScience and Technology, Dalian University of Technology 备注:IEEE TMM 链接:https://arxiv.org/abs/2106.14439 摘要:图像抠图是一个病态问题,其目的是估计图像中前景像素的不透明度。然而,现有的大多数基于深度学习的方法仍然存在粗粒度细节问题。一般来说,这些算法无法正确区分确定域(特定FG和BG像素)和不确定域(像素之间不确定)之间的探测程度,或者在连续采样过程中不可避免地丢失信息,导致次优结果。本文提出了一种新的网络,即先验诱导信息对齐Matting网络(PIIAMatting),它可以有效地对像素级响应图的区分和分层特征图的相关性进行建模。它主要由动态高斯调制机制(DGM)和信息对齐策略(IA)组成。具体地说,DGM可以动态地获取从先验分布学习的像素域响应图。响应图可以反映训练过程中不透明度变化与收敛过程之间的关系。另一方面,IA包括信息匹配模块(IMM)和信息聚合模块(IAM),联合调度以自适应地匹配和聚合相邻层特征。此外,我们还开发了多尺度精细化(MSR)模块,在精细化阶段整合多尺度感受野信息,以恢复波动的外观细节。大量的定量和定性评估表明,所提出的PIIAMatting在Alphamatting.com、Composition-1K和differentions-646数据集上与最先进的图像抠图方法相比表现良好。 摘要:Image matting is an ill-posed problem that aims to estimate the opacity of foreground pixels in an image. However, most existing deep learning-based methods still suffer from the coarse-grained details. In general, these algorithms are incapable of felicitously distinguishing the degree of exploration between deterministic domains (certain FG and BG pixels) and undetermined domains (uncertain in-between pixels), or inevitably lose information in the continuous sampling process, leading to a sub-optimal result. In this paper, we propose a novel network named Prior-Induced Information Alignment Matting Network (PIIAMatting), which can efficiently model the distinction of pixel-wise response maps and the correlation of layer-wise feature maps. It mainly consists of a Dynamic Gaussian Modulation mechanism (DGM) and an Information Alignment strategy (IA). Specifically, the DGM can dynamically acquire a pixel-wise domain response map learned from the prior distribution. The response map can present the relationship between the opacity variation and the convergence process during training. On the other hand, the IA comprises an Information Match Module (IMM) and an Information Aggregation Module (IAM), jointly scheduled to match and aggregate the adjacent layer-wise features adaptively. Besides, we also develop a Multi-Scale Refinement (MSR) module to integrate multi-scale receptive field information at the refinement stage to recover the fluctuating appearance details. Extensive quantitative and qualitative evaluations demonstrate that the proposed PIIAMatting performs favourably against state-of-the-art image matting methods on the Alphamatting.com, Composition-1K and Distinctions-646 dataset.
【7】 Representation Based Regression for Object Distance Estimation 标题:基于表示的回归物距估计方法
作者:Mete Ahishali,Mehmet Yamac,Serkan Kiranyaz,Moncef Gabbouj 机构: Tampere University, Serkan Kiranyaz is with the Department of Electrical Engineering, QatarUniversity 链接:https://arxiv.org/abs/2106.14208 摘要:在这项研究中,我们提出了一种新的方法来预测在一个观测场景中被检测物体的距离。该方法改进了最近提出的卷积支持度估计网络(csen)。在一个基于表示的分类问题中,支持度估计(SE)任务的直接映射被设计成csen。我们进一步提出并证明了基于表示的方法(稀疏或协作表示)可以用于设计良好的回归问题。据我们所知,这是第一个基于表示的方法提出的执行回归任务,利用修改后的CSENs;因此,我们将这种新方法命名为基于表示的回归(RbR)。CSENs的初始版本具有输入所需的代理映射阶段(即,支持集的粗略估计)。在本研究中,我们通过提出压缩学习CSEN(CL-CSEN)来改进CSEN模型,该模型能够与卷积层一起联合优化所谓的代理映射阶段。利用KITTI三维目标检测距离估计数据集进行的实验结果表明,该方法能显著提高距离估计性能。最后,这些方法的软件实现在https://github.com/meteahishali/CSENDistance. 摘要:In this study, we propose a novel approach to predict the distances of the detected objects in an observed scene. The proposed approach modifies the recently proposed Convolutional Support Estimator Networks (CSENs). CSENs are designed to compute a direct mapping for the Support Estimation (SE) task in a representation-based classification problem. We further propose and demonstrate that representation-based methods (sparse or collaborative representation) can be used in well-designed regression problems. To the best of our knowledge, this is the first representation-based method proposed for performing a regression task by utilizing the modified CSENs; and hence, we name this novel approach as Representation-based Regression (RbR). The initial version of CSENs has a proxy mapping stage (i.e., a coarse estimation for the support set) that is required for the input. In this study, we improve the CSEN model by proposing Compressive Learning CSEN (CL-CSEN) that has the ability to jointly optimize the so-called proxy mapping stage along with convolutional layers. The experimental evaluations using the KITTI 3D Object Detection distance estimation dataset show that the proposed method can achieve a significantly improved distance estimation performance over all competing methods. Finally, the software implementations of the methods are publicly shared at https://github.com/meteahishali/CSENDistance.
【8】 DenseTNT: Waymo Open Dataset Motion Prediction Challenge 1st Place Solution 标题:DenseTNT:Waymo Open DataSet运动预测挑战赛第一名解决方案
作者:Junru Gu,Qiao Sun,Hang Zhao 机构:IIIS, Tsinghua University 链接:https://arxiv.org/abs/2106.14160 摘要:在自主驾驶中,基于目标的多轨迹预测方法最近被证明是有效的,它们首先对候选目标进行评分,然后选择最终的目标集,最后根据选定的目标完成轨迹。然而,这些方法通常涉及基于稀疏预定义锚的目标预测。在这项工作中,我们提出了一个无锚模型,名为DenseTNT,它执行密集目标概率估计的轨迹预测。我们的模型达到了最先进的性能,在Waymo开放数据集运动预测挑战赛中排名第一。 摘要:In autonomous driving, goal-based multi-trajectory prediction methods are proved to be effective recently, where they first score goal candidates, then select a final set of goals, and finally complete trajectories based on the selected goals. However, these methods usually involve goal predictions based on sparse predefined anchors. In this work, we propose an anchor-free model, named DenseTNT, which performs dense goal probability estimation for trajectory prediction. Our model achieves state-of-the-art performance, and ranks 1st on the Waymo Open Dataset Motion Prediction Challenge.
【9】 Vision-driven Compliant Manipulation for Reliable, High-Precision Assembly Tasks 标题:视觉驱动的顺应性操作,实现可靠、高精度的装配任务
作者:Andrew S. Morgan,Bowen Wen,Junchi Liang,Abdeslam Boularias,Aaron M. Dollar,Kostas Bekris 机构:Deparmtent of Mechanical Engineering and Materials Science, Yale University, USA, Department of Computer Science, Rutgers University, USA 链接:https://arxiv.org/abs/2106.14070 摘要:高度受限的操作任务对于自主机器人来说仍然是一个挑战,因为它们需要高水平的精度,通常小于1毫米,这与传统的感知系统所能达到的目标是不相容的。本文证明了将最先进的目标跟踪技术与被动自适应机械硬件相结合,可以在工业相关公差(0.25mm)很小的情况下完成精密操作任务。所提出的控制方法通过视觉跟踪相关工作空间中物体的相对6D姿态来实现闭环控制。它通过手内操作调整柔顺机械手和手的控制基准,完成物体插入任务。与以前的插入工作相反,我们的方法不需要昂贵的力传感器、精确的机械手,也不需要耗时的在线学习,因为在线学习需要大量的数据。相反,这项工作利用了机械柔顺性,并利用了离线学习的手的对象不可知操作模型、现成的运动规划和仅用合成数据训练的基于RGBD的对象跟踪器。这些特性使得所提出的系统可以很容易地推广和转移到新的任务和环境中。本文详细描述了该系统的组成部分,并通过大量实验证明了其有效性,包括各种几何图形的紧公差孔内插钉任务以及开放世界约束放置任务。 摘要:Highly constrained manipulation tasks continue to be challenging for autonomous robots as they require high levels of precision, typically less than 1mm, which is often incompatible with what can be achieved by traditional perception systems. This paper demonstrates that the combination of state-of-the-art object tracking with passively adaptive mechanical hardware can be leveraged to complete precision manipulation tasks with tight, industrially-relevant tolerances (0.25mm). The proposed control method closes the loop through vision by tracking the relative 6D pose of objects in the relevant workspace. It adjusts the control reference of both the compliant manipulator and the hand to complete object insertion tasks via within-hand manipulation. Contrary to previous efforts for insertion, our method does not require expensive force sensors, precision manipulators, or time-consuming, online learning, which is data hungry. Instead, this effort leverages mechanical compliance and utilizes an object agnostic manipulation model of the hand learned offline, off-the-shelf motion planning, and an RGBD-based object tracker trained solely with synthetic data. These features allow the proposed system to easily generalize and transfer to new tasks and environments. This paper describes in detail the system components and showcases its efficacy with extensive experiments involving tight tolerance peg-in-hole insertion tasks of various geometries as well as open-world constrained placement tasks.
【10】 Mining atmospheric data 标题:大气数据挖掘
作者:Chaabane Djeraba,Jérôme Riedi 机构:Univ. Lille, CNRS, Centrale Lille, UMR , CRIStAL, IRCICA, avenue Halley, Villeneuve d'Ascq, ICARE, LOA, University of Lille, Bat. M, Cité Scientifique, Villeneuve d’Ascq, France 备注:5 pages, 1 figure 链接:https://arxiv.org/abs/2106.13992 摘要:本文综述了从大气监测任务中获取遥感数据(如图像)的两个相互依赖的重要问题。第一个问题涉及建立新的公共数据集和基准,这是遥感界的热点优先事项。第二个问题是基于大量无注释数据和地面稀疏观测网提供的局部注释数据的大气数据分类深度学习方法的研究。目标应用是空气质量评价和预测。空气质量是指与气体和气溶胶等多种大气成分有关的污染水平。空气污染造成的空气质量差与公众健康之间存在着依赖关系。目标应用是开发一个快速预测模型,用于本地和区域空气质量评估和跟踪。通过对稀疏的地面原位测量网络进行智能外推,为城市居民和决策者提供一个快速、可靠的、能够覆盖本地和区域尺度的空气质量监测系统,挖掘数据的结果将具有重要的意义。 摘要:This paper overviews two interdependent issues important for mining remote sensing data (e.g. images) obtained from atmospheric monitoring missions. The first issue relates the building new public datasets and benchmarks, which are hot priority of the remote sensing community. The second issue is the investigation of deep learning methodologies for atmospheric data classification based on vast amount of data without annotations and with localized annotated data provided by sparse observing networks at the surface. The targeted application is air quality assessment and prediction. Air quality is defined as the pollution level linked with several atmospheric constituents such as gases and aerosols. There are dependency relationships between the bad air quality, caused by air pollution, and the public health. The target application is the development of a fast prediction model for local and regional air quality assessment and tracking. The results of mining data will have significant implication for citizen and decision makers by providing a fast prediction and reliable air quality monitoring system able to cover the local and regional scale through intelligent extrapolation of sparse ground-based in situ measurement networks.
【11】 In-N-Out: Towards Good Initialization for Inpainting and Outpainting 标题:In-N-Out:向着良好的修复和外包初始化迈进
作者:Changho Jo,Woobin Im,Sung-Eui Yoon 机构:School of Computing, Korea Advanced Institute of Science, and Technology (KAIST), Daejeon, Korea 备注:13 pages (9 pages without references), 7 figures 链接:https://arxiv.org/abs/2106.13953 摘要:在计算机视觉中,通过填充遮罩区域来恢复空间信息(如修复)已被广泛研究,因为它的可用性和广泛适用性可用于其他各种应用:图像修复、图像外推和环境地图估计。根据应用的不同,对它们进行了单独的研究。然而,我们的重点是适应相反的任务,例如图像输出,这将有利于目标应用,例如图像修复。我们的自我监督方法In-N-Out被总结为一种训练方法,它利用目标模型中的相反任务的知识。我们的经验表明,In-N-Out(探索互补信息)比传统的管道有效地利用了优势,在传统管道中,只有任务特定的学习发生在训练中。在实验中,我们将我们的方法与传统的方法进行了比较,并分析了我们的方法在不同的应用上的有效性:图像修复、图像外推和环境地图估计。对于这些任务,我们证明了In-N-Out通过对他们的训练过程进行In-N-Out自我监督,持续改进了最近工作的绩效。此外,我们还表明,我们的方法取得了更好的结果比现有的训练方法,为超越。 摘要:In computer vision, recovering spatial information by filling in masked regions, e.g., inpainting, has been widely investigated for its usability and wide applicability to other various applications: image inpainting, image extrapolation, and environment map estimation. Most of them are studied separately depending on the applications. Our focus, however, is on accommodating the opposite task, e.g., image outpainting, which would benefit the target applications, e.g., image inpainting. Our self-supervision method, In-N-Out, is summarized as a training approach that leverages the knowledge of the opposite task into the target model. We empirically show that In-N-Out -- which explores the complementary information -- effectively takes advantage over the traditional pipelines where only task-specific learning takes place in training. In experiments, we compare our method to the traditional procedure and analyze the effectiveness of our method on different applications: image inpainting, image extrapolation, and environment map estimation. For these tasks, we demonstrate that In-N-Out consistently improves the performance of the recent works with In-N-Out self-supervision to their training procedure. Also, we show that our approach achieves better results than an existing training approach for outpainting.
【12】 Core Challenges in Embodied Vision-Language Planning 标题:体验式视觉语言规划的核心挑战
作者:Jonathan Francis,Nariaki Kitamura,Felix Labelle,Xiaopeng Lu,Ingrid Navarro,Jean Oh 机构:Language Technologies Institute, Carnegie Mellon University, Forbes Ave., Pittsburgh, PA, USA, Human-Machine Collaboration, Bosch Research Pittsburgh, Smallman St., Pittsburgh, PA, USA 备注:35 pages 链接:https://arxiv.org/abs/2106.13948 摘要:多模态机器学习和人工智能领域的最新进展导致了计算机视觉、自然语言处理和具体化人工智能交叉点的挑战性任务的发展。尽管许多方法和先前的调查研究都描述了其中的一个或两个维度,但并没有在所有三个维度的中心进行整体分析。此外,即使考虑到这些主题的组合,也更多地侧重于描述(例如)当前的体系结构方法,而不是同时说明该领域的高级别挑战和机遇。在这篇综述文章中,我们讨论了具身视觉语言规划(EVLP)任务,这是一系列突出的具身导航和操作问题,它们共同使用计算机视觉和自然语言。我们提出了一种分类法来统一这些任务,并对新的和当前的算法方法、度量、模拟环境以及用于EVLP任务的数据集进行了深入的分析和比较。最后,我们提出了我们认为新的EVLP工作应该寻求解决的核心挑战,并且我们提倡任务构建,以实现模型的通用性和进一步的实际部署。 摘要:Recent advances in the areas of multimodal machine learning and artificial intelligence (AI) have led to the development of challenging tasks at the intersection of Computer Vision, Natural Language Processing, and Embodied AI. Whereas many approaches and previous survey pursuits have characterised one or two of these dimensions, there has not been a holistic analysis at the center of all three. Moreover, even when combinations of these topics are considered, more focus is placed on describing, e.g., current architectural methods, as opposed to also illustrating high-level challenges and opportunities for the field. In this survey paper, we discuss Embodied Vision-Language Planning (EVLP) tasks, a family of prominent embodied navigation and manipulation problems that jointly use computer vision and natural language. We propose a taxonomy to unify these tasks and provide an in-depth analysis and comparison of the new and current algorithmic approaches, metrics, simulated environments, as well as the datasets used for EVLP tasks. Finally, we present the core challenges that we believe new EVLP works should seek to address, and we advocate for task construction that enables model generalizability and furthers real-world deployment.
【13】 Inverting and Understanding Object Detectors 标题:反转和理解对象检测器
作者:Ang Cao,Justin Johnson 机构:University of Michigan 备注:Preprints 链接:https://arxiv.org/abs/2106.13933 摘要:作为计算机视觉的核心问题,目标检测的性能在过去的几年中有了很大的提高。尽管物体探测器的性能令人印象深刻,但它们缺乏可解释性。可视化技术已经发展并广泛应用于反思其他类型的深度学习模型的决策;然而,可视化物体探测器一直没有得到充分的研究。在本文中,我们建议使用反演作为理解现代目标探测器的主要工具,并开发一种基于优化的布局反演方法,使我们能够生成由经过训练的探测器识别为包含所需目标配置的合成图像。我们将我们的布局反演技术应用于各种现代目标探测器,揭示了探测器有趣的特性,并通过验证实验对其进行了进一步研究:它们依赖于定性不同的特征进行分类和回归;他们学习共同出现的物体的典型图案;他们使用不同的视觉线索来识别不同大小的物体。我们希望我们的见解能帮助实践者改进目标探测器。 摘要:As a core problem in computer vision, the performance of object detection has improved drastically in the past few years. Despite their impressive performance, object detectors suffer from a lack of interpretability. Visualization techniques have been developed and widely applied to introspect the decisions made by other kinds of deep learning models; however, visualizing object detectors has been underexplored. In this paper, we propose using inversion as a primary tool to understand modern object detectors and develop an optimization-based approach to layout inversion, allowing us to generate synthetic images recognized by trained detectors as containing a desired configuration of objects. We reveal intriguing properties of detectors by applying our layout inversion technique to a variety of modern object detectors, and further investigate them via validation experiments: they rely on qualitatively different features for classification and regression; they learn canonical motifs of commonly co-occurring objects; they use diff erent visual cues to recognize objects of varying sizes. We hope our insights can help practitioners improve object detectors.
【14】 CAMS: Color-Aware Multi-Style Transfer 标题:CAM:颜色感知多样式转换
作者:Mahmoud Afifi,Abdullah Abuolaim,Mostafa Hussien,Marcus A. Brubaker,Michael S. Brown 机构: York University, Toronto, Ontario, Canada, École de technologie supérieure (ÉTS), Montreal, Quebec 链接:https://arxiv.org/abs/2106.13920 摘要:图像风格传递的目的是操纵源图像或“内容”图像的外观,以共享目标“风格”图像的相似纹理和颜色。理想情况下,样式转换操作还应该保留源图像的语义内容。一种常用的方法,以协助转移风格是基于克矩阵优化。基于Gram矩阵的优化问题之一是没有考虑颜色与其风格之间的相关性。具体来说,某些纹理或结构应该与特定的颜色相关联。当目标样式图像显示多种样式类型时,这尤其具有挑战性。在这项工作中,我们提出了一种颜色感知的多样式传输方法,在保持样式和生成图像之间的样式-颜色相关性的同时,生成美观的结果。我们通过对经典的基于Gram矩阵的风格转换优化方法进行简单而有效的改进,达到了预期的效果。我们的方法的一个很好的特点是,它使用户能够手动选择目标样式和内容图像之间的颜色关联,以获得更大的传输灵活性。我们通过一些定性比较来验证我们的方法,包括一项由30名参与者进行的用户研究。与以前的工作相比,我们的方法简单,易于实现,并且在针对具有多种样式的图像时获得视觉上吸引人的结果。源代码位于https://github.com/mahmoudnafifi/color-aware-style-transfer. 摘要:Image style transfer aims to manipulate the appearance of a source image, or "content" image, to share similar texture and colors of a target "style" image. Ideally, the style transfer manipulation should also preserve the semantic content of the source image. A commonly used approach to assist in transferring styles is based on Gram matrix optimization. One problem of Gram matrix-based optimization is that it does not consider the correlation between colors and their styles. Specifically, certain textures or structures should be associated with specific colors. This is particularly challenging when the target style image exhibits multiple style types. In this work, we propose a color-aware multi-style transfer method that generates aesthetically pleasing results while preserving the style-color correlation between style and generated images. We achieve this desired outcome by introducing a simple but efficient modification to classic Gram matrix-based style transfer optimization. A nice feature of our method is that it enables the users to manually select the color associations between the target style and content image for more transfer flexibility. We validated our method with several qualitative comparisons, including a user study conducted with 30 participants. In comparison with prior work, our method is simple, easy to implement, and achieves visually appealing results when targeting images that have multiple styles. Source code is available at https://github.com/mahmoudnafifi/color-aware-style-transfer.
【15】 Self-paced Principal Component Analysis 标题:自定步主成分分析
作者:Zhao Kang,Hongfei Liu,Jiangxin Li,Xiaofeng Zhu,Ling Tian 机构: [ 1 2] develop a computationallysimple paradigm for image denoising using superpixel-basedThe authors are with the School of Computer Science and Engineering, University of Electronic Science and Technology of China 链接:https://arxiv.org/abs/2106.13880 摘要:主成分分析(PCA)在降维和特征提取方面有着广泛的应用。鲁棒主元分析(RPCA)在l1范数、l2范数、p范数等不同的鲁棒距离度量下,能在一定程度上处理噪声或异常值。然而,现实世界中的数据可能显示这些简单函数无法完全捕获的结构。另外,现有方法对复杂样本和简单样本一视同仁。相比之下,人类通常采用的学习模式是从简单到复杂,从少到多。基于这一原理,我们提出了一种新的方法,称为自步PCA(SPCA),以进一步降低噪声和异常值的影响。值得注意的是,在每次迭代开始时计算每个样本的复杂度,以便将从简单到更复杂的样本集成到训练中。基于交替优化,SPCA找到一个最优的投影矩阵,并迭代地滤除异常值。理论分析证明了SPCA的合理性。在流行数据集上的大量实验表明,该方法能显著提高现有结果。 摘要:Principal Component Analysis (PCA) has been widely used for dimensionality reduction and feature extraction. Robust PCA (RPCA), under different robust distance metrics, such as l1-norm and l2, p-norm, can deal with noise or outliers to some extent. However, real-world data may display structures that can not be fully captured by these simple functions. In addition, existing methods treat complex and simple samples equally. By contrast, a learning pattern typically adopted by human beings is to learn from simple to complex and less to more. Based on this principle, we propose a novel method called Self-paced PCA (SPCA) to further reduce the effect of noise and outliers. Notably, the complexity of each sample is calculated at the beginning of each iteration in order to integrate samples from simple to more complex into training. Based on an alternating optimization, SPCA finds an optimal projection matrix and filters out outliers iteratively. Theoretical analysis is presented to show the rationality of SPCA. Extensive experiments on popular data sets demonstrate that the proposed method can improve the state of-the-art results considerably.
【16】 Progressive Joint Low-light Enhancement and Noise Removal for Raw Images 标题:原始图像的渐进式联合微光增强与去噪
作者:Yucheng Lu,Seung-Won Jung 机构: Lu is with the Department of Multimedia Engineering, DonggukUniversity, Jung is with the Department of ElectricalEngineering, Korea University 链接:https://arxiv.org/abs/2106.14844 摘要:移动设备上的微光成像通常具有挑战性,因为通过相对较小的光圈的入射光不足,导致低信噪比。以往对微光图像处理的研究大多集中在光照调整、色彩增强、噪声去除等单个任务上;或者,联合照明调整和去噪任务严重依赖于从特定相机模型收集的短-长曝光图像对,因此这些方法在需要相机特定联合增强和恢复的实际环境中不太实用和通用。为了解决这一问题,本文提出了一种微光图像处理框架,该框架实现了联合光照调整、色彩增强和去噪。考虑到模型数据采集的困难性和采集图像的超高清晰度,我们设计了两个分支:系数估计分支和联合增强去噪分支。系数估计分支工作在低分辨率空间,通过双边学习预测增强系数,而联合增强和去噪分支工作在全分辨率空间,以渐进方式执行联合增强和去噪。与现有方法相比,我们的框架在适应另一个相机模型时不需要重新收集大量数据,这大大减少了为实际应用而微调方法所需的工作量。通过大量的实验,我们证明了它在现实世界中的微光成像应用的巨大潜力相比,目前最先进的方法。 摘要:Low-light imaging on mobile devices is typically challenging due to insufficient incident light coming through the relatively small aperture, resulting in a low signal-to-noise ratio. Most of the previous works on low-light image processing focus either only on a single task such as illumination adjustment, color enhancement, or noise removal; or on a joint illumination adjustment and denoising task that heavily relies on short-long exposure image pairs collected from specific camera models, and thus these approaches are less practical and generalizable in real-world settings where camera-specific joint enhancement and restoration is required. To tackle this problem, in this paper, we propose a low-light image processing framework that performs joint illumination adjustment, color enhancement, and denoising. Considering the difficulty in model-specific data collection and the ultra-high definition of the captured images, we design two branches: a coefficient estimation branch as well as a joint enhancement and denoising branch. The coefficient estimation branch works in a low-resolution space and predicts the coefficients for enhancement via bilateral learning, whereas the joint enhancement and denoising branch works in a full-resolution space and performs joint enhancement and denoising in a progressive manner. In contrast to existing methods, our framework does not need to recollect massive data when being adapted to another camera model, which significantly reduces the efforts required to fine-tune our approach for practical usage. Through extensive experiments, we demonstrate its great potential in real-world low-light imaging applications when compared with current state-of-the-art methods.