计算机视觉与模式识别学术速递[12.10]

2021-12-10 17:13:39 浏览数 (1)

Update!H5支持摘要折叠,体验更佳!涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!

cs.CV 方向,今日共计82篇

Transformer(8篇)

【1】 BLT: Bidirectional Layout Transformer for Controllable Layout Generation 标题:BLT:可控版图生成的双向版图转换器 链接:https://arxiv.org/abs/2112.05112

作者:Xiang Kong,Lu Jiang,Huiwen Chang,Han Zhang,Yuan Hao,Haifeng Gong,Irfan Essa 机构: Google, LTI, Carnegie Mellon University, Georgia Institute of Technology 备注:14 pages, under review

【2】 PE-former: Pose Estimation Transformer 标题:PE-form:位姿估计转换器 链接:https://arxiv.org/abs/2112.04981

作者:Paschalis Panteleris,Antonis Argyros 机构: Institute of Computer Science, FORTH, Heraklion, Crete, Greece, Computer Science Department, University of Crete, Greece 摘要:视觉转换器架构已经被证明可以非常有效地用于图像分类任务。使用Transformer解决更具挑战性的视觉任务的工作依赖于卷积主干进行特征提取。在本文中,我们研究了使用纯Transformer结构(即,没有CNN主干)来解决二维人体姿势估计问题。我们在COCO数据集上评估了两种ViT体系结构。我们证明了使用编码器-解码器-Transformer结构可以在这个估计问题上产生最先进的结果。 摘要:Vision transformer architectures have been demonstrated to work very effectively for image classification tasks. Efforts to solve more challenging vision tasks with transformers rely on convolutional backbones for feature extraction. In this paper we investigate the use of a pure transformer architecture (i.e., one with no CNN backbone) for the problem of 2D body pose estimation. We evaluate two ViT architectures on the COCO dataset. We demonstrate that using an encoder-decoder transformer architecture yields state of the art results on this estimation problem.

【3】 A Bilingual, OpenWorld Video Text Dataset and End-to-end Video Text Spotter with Transformer 标题:带有Transformer的双语OpenWorld视频文本数据集和端到端视频文本筛选 链接:https://arxiv.org/abs/2112.04888

作者:Weijia Wu,Yuanqiang Cai,Debing Zhang,Sibo Wang,Zhuang Li,Jiahong Li,Yejun Tang,Hong Zhou 机构:Zhejiang University, Beijing University of Posts and Telecommunications, Kuaishou Technology 备注:None 摘要:大多数现有的视频文本识别基准都侧重于评估数据有限的单一语言和场景。在这项工作中,我们介绍了一个大规模的、双语的、开放世界的视频文本基准数据集(BOVText)。BOVText有四个功能。首先,我们提供了2000多个视频,超过1750000 帧,比现有最大的视频附带文本数据集大25倍。其次,我们的数据集涵盖了30多个开放类别,有各种各样的场景可供选择,例如生活视频日志、驾驶、电影等。第三,为视频中的不同代表意义提供了丰富的文本类型注释(即标题、标题或场景文本)。第四,BOVText提供双语文本注释,促进多元文化的生活和交流。此外,我们还提出了一种带有转换器的端到端视频文本定位框架TransVTSpotter,它通过一种简单而高效的基于注意的查询密钥机制解决了视频中的多方向文本定位问题。它将上一帧中的对象特征应用为当前帧的跟踪查询,并引入旋转角度预测以适合多方向文本实例。在ICDAR2015(视频)上,TransVTSpotter以44.1%的MOTA,9 fps的速度实现了最先进的性能。TransVTSpotter的数据集和代码分别位于github:com=weijiawu=BOVText和github:com=weijiawu=TransVTSpotter。 摘要:Most existing video text spotting benchmarks focus on evaluating a single language and scenario with limited data. In this work, we introduce a large-scale, Bilingual, Open World Video text benchmark dataset(BOVText). There are four features for BOVText. Firstly, we provide 2,000 videos with more than 1,750,000 frames, 25 times larger than the existing largest dataset with incidental text in videos. Secondly, our dataset covers 30 open categories with a wide selection of various scenarios, e.g., Life Vlog, Driving, Movie, etc. Thirdly, abundant text types annotation (i.e., title, caption or scene text) are provided for the different representational meanings in video. Fourthly, the BOVText provides bilingual text annotation to promote multiple cultures live and communication. Besides, we propose an end-to-end video text spotting framework with Transformer, termed TransVTSpotter, which solves the multi-orient text spotting in video with a simple, but efficient attention-based query-key mechanism. It applies object features from the previous frame as a tracking query for the current frame and introduces a rotation angle prediction to fit the multiorient text instance. On ICDAR2015(video), TransVTSpotter achieves the state-of-the-art performance with 44.1% MOTA, 9 fps. The dataset and code of TransVTSpotter can be found at github:com=weijiawu=BOVText and github:com=weijiawu=TransVTSpotter, respectively.

【4】 Fast Point Transformer 标题:快速点Transformer 链接:https://arxiv.org/abs/2112.04702

作者:Chunghyun Park,Yoonwoo Jeong,Minsu Cho,Jaesik Park 机构:POSTECH CSE & GSAI 备注:16 pages, 8 figures 摘要:最近神经网络的成功使人们能够更好地解释三维点云,但处理大规模三维场景仍然是一个具有挑战性的问题。目前大多数方法将大规模场景划分为小区域,并将局部预测结合在一起。然而,该方案不可避免地涉及额外的前处理和后处理阶段,并且由于局部视角的预测,可能会降低最终输出。本文介绍了由一个新的轻量级自关注层组成的快速点变换器。我们的方法编码连续的三维坐标,基于体素散列的架构提高了计算效率。通过三维语义分割和三维检测对该方法进行了验证。我们的方法的精度与基于体素的最佳方法具有竞争力,并且我们的网络在合理的精度权衡下,推理时间比最先进的点变换器快136倍。 摘要:The recent success of neural networks enables a better interpretation of 3D point clouds, but processing a large-scale 3D scene remains a challenging problem. Most current approaches divide a large-scale scene into small regions and combine the local predictions together. However, this scheme inevitably involves additional stages for pre- and post-processing and may also degrade the final output due to predictions in a local perspective. This paper introduces Fast Point Transformer that consists of a new lightweight self-attention layer. Our approach encodes continuous 3D coordinates, and the voxel hashing-based architecture boosts computational efficiency. The proposed method is demonstrated with 3D semantic segmentation and 3D detection. The accuracy of our approach is competitive to the best voxel-based method, and our network achieves 136 times faster inference time than the state-of-the-art, Point Transformer, with a reasonable accuracy trade-off.

【5】 DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition 标题:DualFormer:用于高效视频识别的局部-全局分层转换器 链接:https://arxiv.org/abs/2112.04674

作者:Yuxuan Liang,Pan Zhou,Roger Zimmermann,Shuicheng Yan 机构:Sea AI Lab, National University of Singapore 备注:Preprint

【6】 Recurrent Glimpse-based Decoder for Detection with Transformer 标题:基于递归扫视的Transformer检测解码器 链接:https://arxiv.org/abs/2112.04632

作者:Zhe Chen,Jing Zhang,Dacheng Tao 机构: The University of Sydney, Australia, JD Explore Academy, China

【7】 Semi-Supervised Medical Image Segmentation via Cross Teaching between CNN and Transformer 标题:基于CNN和Transformer交叉教学的半监督医学图像分割 链接:https://arxiv.org/abs/2112.04894

作者:Xiangde Luo,Minhao Hu,Tao Song,Guotai Wang,Shaoting Zhang 机构: University of Electronic Science and Technology of China, Chengdu, China, Shanghai AI Lab, Shanghai, China; , SenseTime, Shanghai, China; ∗ Corresponding author 备注:A technical report about SSL4MIS:this https URL 摘要:最近,卷积神经网络(CNN)和Transformer的深度学习在完全监督的医学图像分割中显示了令人鼓舞的结果。然而,对于他们来说,在有限的训练注释下实现良好的绩效仍然是一个挑战。在这个简单而高效的医学图像分割框架中,我们引入了CNN和半监督图像分割。具体来说,我们将经典的深度协同训练从一致性正则化简化为交叉教学,其中一个网络的预测被用作伪标签,直接端到端地监控另一个网络。考虑到CNN和Transformer在学习范式上的差异,我们引入了CNN和Transformer的交叉教学,而不仅仅是使用CNN。在一个公共基准测试上的实验表明,我们的方法比现有的八种半监督学习方法具有更简单的框架。值得注意的是,这项工作可能是第一次尝试结合CNN和transformer进行半监督医学图像分割,并在公共基准上取得有希望的结果。该守则将于以下时间发布:https://github.com/HiLab-git/SSL4MIS. 摘要:Recently, deep learning with Convolutional Neural Networks (CNNs) and Transformers has shown encouraging results in fully supervised medical image segmentation. However, it is still challenging for them to achieve good performance with limited annotations for training. In this work, we present a very simple yet efficient framework for semi-supervised medical image segmentation by introducing the cross teaching between CNN and Transformer. Specifically, we simplify the classical deep co-training from consistency regularization to cross teaching, where the prediction of a network is used as the pseudo label to supervise the other network directly end-to-end. Considering the difference in learning paradigm between CNN and Transformer, we introduce the Cross Teaching between CNN and Transformer rather than just using CNNs. Experiments on a public benchmark show that our method outperforms eight existing semi-supervised learning methods just with a simpler framework. Notably, this work may be the first attempt to combine CNN and transformer for semi-supervised medical image segmentation and achieve promising results on a public benchmark. The code will be released at: https://github.com/HiLab-git/SSL4MIS.

【8】 3D Medical Point Transformer: Introducing Convolution to Attention Networks for Medical Point Cloud Analysis 标题:三维医疗点转换器:在医疗点云分析的注意力网络中引入卷积 链接:https://arxiv.org/abs/2112.04863

作者:Jianhui Yu,Chaoyi Zhang,Heng Wang,Dingxin Zhang,Yang Song,Tiange Xiang,Dongnan Liu,Weidong Cai 机构:School of Computer Science, University of Sydney, Australia, School of Computer Science and Engineering, University of New South Wales, Australia 备注:technical report 摘要:一般的点云已经被越来越多的研究用于不同的任务,最近基于Transformer的网络被提出用于点云分析。然而,对于对于疾病检测和治疗非常重要的医学点云,几乎没有相关的工作。在这项工作中,我们提出了一个专门针对医学点云的基于注意的模型,即3D医学点变换器(3DMedPT),用于检查复杂的生物结构。通过增加上下文信息和总结查询时的本地响应,我们的注意模块可以捕获本地上下文和全局内容特征交互。然而,医学数据的训练样本不足可能导致特征学习效果不佳,因此我们应用位置嵌入学习精确的局部几何和多图推理(MGR)来研究通道图上的全局知识传播,以丰富特征表示。在内部数据集上进行的实验证明了3DMedPT的优越性,我们获得了最佳的分类和分割结果。此外,我们的方法有希望的泛化能力在通用的3D点云基准测试:ModelNet40和ShapeNet Part上得到了验证。代码将很快发布。 摘要:General point clouds have been increasingly investigated for different tasks, and recently Transformer-based networks are proposed for point cloud analysis. However, there are barely related works for medical point clouds, which are important for disease detection and treatment. In this work, we propose an attention-based model specifically for medical point clouds, namely 3D medical point Transformer (3DMedPT), to examine the complex biological structures. By augmenting contextual information and summarizing local responses at query, our attention module can capture both local context and global content feature interactions. However, the insufficient training samples of medical data may lead to poor feature learning, so we apply position embeddings to learn accurate local geometry and Multi-Graph Reasoning (MGR) to examine global knowledge propagation over channel graphs to enrich feature representations. Experiments conducted on IntrA dataset proves the superiority of 3DMedPT, where we achieve the best classification and segmentation results. Furthermore, the promising generalization ability of our method is validated on general 3D point cloud benchmarks: ModelNet40 and ShapeNetPart. Code will be released soon.

检测相关(12篇)

【1】 Searching Parameterized AP Loss for Object Detection 标题:搜索参数化AP损失进行目标检测 链接:https://arxiv.org/abs/2112.05138

作者:Chenxin Tao,Zizhang Li,Xizhou Zhu,Gao Huang,Yong Liu,Jifeng Dai 机构:Tsinghua University,Zhejiang University,SenseTime Research, Shanghai Jiao Tong University,Beijing Academy of Artificial Intelligence, Beijing, China 备注:Accepted by NeurIPS 2021

【2】 Illumination and Temperature-Aware Multispectral Networks for Edge-Computing-Enabled Pedestrian Detection 标题:用于边缘计算行人检测的光照和温度感知多光谱网络 链接:https://arxiv.org/abs/2112.05053

作者:Yifan Zhuang,Ziyuan Pu,Jia Hu,Yinhai Wang 机构:to real-time pedestrian detection results, e.g., vehicle control of, Advanced Driver Assistance Systems (ADAS) [,], and traffic, signal control of smart pedestrian crosswalk systems [,]., Wrong detection results can generate false signals to the control 备注:13 pages, 12 figures 摘要:准确、高效的行人检测对于智能交通系统(如高级驾驶员辅助系统和智能人行横道系统)的行人安全和机动性至关重要。在所有的行人检测方法中,基于视觉的检测方法被证明是最有效的。然而,现有的基于视觉的行人检测算法仍然存在两个限制其实现的限制,即实时性能以及对环境因素(例如,低照度条件)影响的抵抗力。为了解决这些问题,本研究提出了一种轻型照明和温度感知多光谱网络(IT-MN),用于准确有效的行人检测。提出的IT-MN是一种高效的单级探测器。为了适应环境因素的影响,提高传感精度,IT-MN将热图像数据与视觉图像进行融合,以在视觉图像质量有限的情况下丰富有用信息。此外,还提出了一种新颖有效的后期融合策略,以优化图像融合性能。为了使所提出的模型可用于边缘计算,采用模型量化将模型大小减少75%,同时显著缩短推理时间。使用车载摄像头收集的公共数据集,通过与选定的最新算法进行比较,对所提出的算法进行评估。结果表明,该算法在GPU上的每幅图像对的误检率和推理时间分别为14.19%和0.03秒。此外,量化IT-MN在边缘设备上实现了每对图像0.21秒的推断时间,这也证明了将所提出的模型部署在边缘设备上作为高效行人检测算法的潜力。 摘要:Accurate and efficient pedestrian detection is crucial for the intelligent transportation system regarding pedestrian safety and mobility, e.g., Advanced Driver Assistance Systems, and smart pedestrian crosswalk systems. Among all pedestrian detection methods, vision-based detection method is demonstrated to be the most effective in previous studies. However, the existing vision-based pedestrian detection algorithms still have two limitations that restrict their implementations, those being real-time performance as well as the resistance to the impacts of environmental factors, e.g., low illumination conditions. To address these issues, this study proposes a lightweight Illumination and Temperature-aware Multispectral Network (IT-MN) for accurate and efficient pedestrian detection. The proposed IT-MN is an efficient one-stage detector. For accommodating the impacts of environmental factors and enhancing the sensing accuracy, thermal image data is fused by the proposed IT-MN with visual images to enrich useful information when visual image quality is limited. In addition, an innovative and effective late fusion strategy is also developed to optimize the image fusion performance. To make the proposed model implementable for edge computing, the model quantization is applied to reduce the model size by 75% while shortening the inference time significantly. The proposed algorithm is evaluated by comparing with the selected state-of-the-art algorithms using a public dataset collected by in-vehicle cameras. The results show that the proposed algorithm achieves a low miss rate and inference time at 14.19% and 0.03 seconds per image pair on GPU. Besides, the quantized IT-MN achieves an inference time of 0.21 seconds per image pair on the edge device, which also demonstrates the potentiality of deploying the proposed model on edge devices as a highly efficient pedestrian detection algorithm.

【3】 CaSP: Class-agnostic Semi-Supervised Pretraining for Detection and Segmentation 标题:CASP:用于检测和分割的类不可知半监督预训练 链接:https://arxiv.org/abs/2112.04966

作者:Lu Qi,Jason Kuen,Zhe Lin,Jiuxiang Gu,Fengyun Rao,Dian Li,Weidong Guo,Zhen Wen,Jiaya Jia 机构:Platform & Content Group, Tencent, Adobe Research ,The Chinese University of Hong Kong, China 摘要:为了提高实例级检测/分割性能,现有的自监督和半监督方法从未标记数据中提取与任务无关或特定于任务的训练信号。我们认为,这两种方法,在任务特异性谱的两个极端,对于任务性能来说是次优的。使用太少的任务特定训练信号会导致下游任务的地面真值标签拟合不足,反之则会导致地面真值标签拟合过度。为此,我们提出了一种新的类不可知半监督预训练(CaSP)框架,以在从未标记数据中提取训练信号时实现更有利的任务特异性平衡。与半监督学习相比,CaSP通过忽略伪标签中的类别信息,并有一个单独的预训练阶段,仅使用与任务无关的未标记数据,从而降低了训练信号中的任务特异性。另一方面,CaSP通过利用box/mask级别的伪标签来保持适当数量的任务专用性。因此,当对下游任务进行微调时,我们的预训练模型可以更好地避免对地面真实值标签的拟合不足/过度拟合。使用3.6M的未标记数据,我们在目标检测方面比ImageNet预训练基线获得了4.7%的显著性能增益。我们的预训练模型还展示了良好的可移植性,可用于其他检测和分割任务/框架。 摘要:To improve instance-level detection/segmentation performance, existing self-supervised and semi-supervised methods extract either very task-unrelated or very task-specific training signals from unlabeled data. We argue that these two approaches, at the two extreme ends of the task-specificity spectrum, are suboptimal for the task performance. Utilizing too little task-specific training signals causes underfitting to the ground-truth labels of downstream tasks, while the opposite causes overfitting to the ground-truth labels. To this end, we propose a novel Class-agnostic Semi-supervised Pretraining (CaSP) framework to achieve a more favorable task-specificity balance in extracting training signals from unlabeled data. Compared to semi-supervised learning, CaSP reduces the task specificity in training signals by ignoring class information in the pseudo labels and having a separate pretraining stage that uses only task-unrelated unlabeled data. On the other hand, CaSP preserves the right amount of task specificity by leveraging box/mask-level pseudo labels. As a result, our pretrained model can better avoid underfitting/overfitting to ground-truth labels when finetuned on the downstream task. Using 3.6M unlabeled data, we achieve a remarkable performance gain of 4.7% over ImageNet-pretrained baseline on object detection. Our pretrained model also demonstrates excellent transferability to other detection and segmentation tasks/frameworks.

【4】 Few-Shot Keypoint Detection as Task Adaptation via Latent Embeddings 标题:基于潜在嵌入的任务自适应Few-Shot关键点检测 链接:https://arxiv.org/abs/2112.04910

作者:Mel Vecerik,Jackie Kay,Raia Hadsell,Lourdes Agapito,Jon Scholz 机构: UK 2University College London 备注:Supplementary material available at: this https URL 摘要:密集目标跟踪是一项重要的计算机视觉任务,它能够以像素级的精度定位特定的目标点,在机器人领域有着众多的下游应用。现有的方法要么在一次向前传递中计算密集的关键点嵌入,这意味着模型被训练成一次跟踪所有东西,要么将其全部容量分配给稀疏的预定义点集,以通用性换取准确性。在本文中,我们根据观察发现,在给定时间内相关点的数量通常相对较少(例如,目标物体上的抓取点)来探索一个中间地带。我们的主要贡献是一种新颖的体系结构,灵感来源于Few-Shot任务自适应,它允许稀疏式网络以指示跟踪哪个点的关键点嵌入为条件。我们的主要发现是,这种方法提供了密集嵌入模型的通用性,同时提供了显著接近稀疏关键点方法的精度。我们给出的结果说明了这种容量与精度的权衡,并演示了使用真实机器人拾取和放置任务将零炮转移到新对象实例(类内)的能力。 摘要:Dense object tracking, the ability to localize specific object points with pixel-level accuracy, is an important computer vision task with numerous downstream applications in robotics. Existing approaches either compute dense keypoint embeddings in a single forward pass, meaning the model is trained to track everything at once, or allocate their full capacity to a sparse predefined set of points, trading generality for accuracy. In this paper we explore a middle ground based on the observation that the number of relevant points at a given time are typically relatively few, e.g. grasp points on a target object. Our main contribution is a novel architecture, inspired by few-shot task adaptation, which allows a sparse-style network to condition on a keypoint embedding that indicates which point to track. Our central finding is that this approach provides the generality of dense-embedding models, while offering accuracy significantly closer to sparse-keypoint approaches. We present results illustrating this capacity vs. accuracy trade-off, and demonstrate the ability to zero-shot transfer to new object instances (within-class) using a real-robot pick-and-place task.

【5】 Knowledge Distillation for Object Detection via Rank Mimicking and Prediction-guided Feature Imitation 标题:基于秩模仿和预测引导特征模仿的目标检测知识提取 链接:https://arxiv.org/abs/2112.04840

作者:Gang Li,Xiang Li,Yujie Wang,Shanshan Zhang,Yichao Wu,Ding Liang 机构: Nanjing University of Science and Technology, Sensetime Research 备注:Accepted by AAAI 2022

【6】 Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary Detection 标题:通用事件边界检测的多层密集差分图研究进展 链接:https://arxiv.org/abs/2112.04771

作者:Jiaqi Tang,Zhaoyang Liu,Chen Qian,Wayne Wu,Limin Wang 机构:State Key Laboratory for Novel Software Technology, Nanjing University, China, SenseTime Research

【7】 3D-VField: Learning to Adversarially Deform Point Clouds for Robust 3D Object Detection 标题:3D-Vfield:学习逆变形点云以实现稳健的3D对象检测 链接:https://arxiv.org/abs/2112.04764

作者:Alexander Lehner,Stefano Gasperini,Alvaro Marcos-Ramiro,Michael Schmidt,Mohammad-Ali Nikouei Mahani,Nassir Navab,Benjamin Busam,Federico Tombari 机构: Technical University of Munich, BMW Group, Johns Hopkins University, Google

【8】 BLPnet: A New DNN model for Automatic License Plate Detection with Bengali OCR 标题:BLPnet:一种新的孟加拉OCR车牌自动检测DNN模型 链接:https://arxiv.org/abs/2112.04752

作者:Md Saif Hassan Onim,Hussain Nyeem,Koushik Roy,Mahmudul Hasan,Abtahi Ishmam,Md. Akiful Hoque Akif,Tareque Bashar Ovi 机构:Military Institute of Science and Technology (MIST), Mirpur Cantonment, Dhaka–, Bangladesh 摘要:深度神经网络(DNN)模型结合图像处理和目标定位,有可能推动交通自动控制和监控系统的发展。尽管在开发健壮的车牌检测模型方面取得了一些显著进展,但研究工作仍在继续以更高的检测精度降低计算复杂性。本文报道了一种计算效率高、准确度高的孟加拉语字符自动车牌识别(ALPR)系统,该系统采用了一种新的DNN模型,我们称之为孟加拉语车牌网络(BLPnet)。此外,在所提出的模型中,用于在VLP之前检测车辆区域的级联架构将显著降低计算成本和误报,从而使系统更快、更准确。此外,通过新的孟加拉语OCR引擎和单词映射过程,该模型可以很容易地提取、检测和输出车辆的完整车牌号。该模型在实时视频画面上以每秒17帧(fps)的速度进给,可以检测出均方误差(MSE)为0.0152、平均车牌字符识别准确率为95%的车辆。与其他模型相比,BLPnet的车牌检测精度和时间要求分别比基于YOLO的ALPR模型和Tesseract模型提高了5%和20%。 摘要:Deep Neural Network (DNN) models with image processing and object localization have the potential to advance the automatic traffic control and monitoring system. Despite some notable progress in developing robust license plate detection models, research endeavours continue to reduce computational complexities with higher detection accuracy. This paper reports a computationally efficient and reasonably accurate Automatic License Plate Recognition (ALPR) system for Bengali characters with a new DNN model that we call Bengali License Plate Network (BLPnet). Additionally, the cascaded architectures for detecting vehicle regions prior to VLP in the proposed model, would significantly reduce computational cost and false-positives making the system faster and more accurate. Besides, with a new Bengali OCR engine and word-mapping process, the model can readily extract, detect and output the complete license-plate number of a vehicle. The model feeding with17 frames per second (fps) on real-time video footage can detect a vehicle with the Mean Squared Error (MSE) of 0.0152, and the mean license plate character recognition accuracy of 95%. While compared to the other models, an improvement of 5% and 20% were recorded for the BLPnet over the prominent YOLO-based ALPR model and Tesseract model for the number-plate detection accuracy and time requirement, respectively.

【9】 Superpixel-Based Building Damage Detection from Post-earthquake Very High Resolution Imagery Using Deep Neural Networks 标题:基于超像素的地震后超高分辨率图像的深度神经网络建筑物损伤检测 链接:https://arxiv.org/abs/2112.04744

作者:Jun Wang,Zhoujing Li,Yixuan Qiao,Qiming Qin,Peng Gao,Guotong Xie 机构: Li is with Agricultural Information Institute, Qin is with Institute of Remote Sensing and Geographic InformationSystem, Peking University 摘要:地震等自然灾害后的建筑物损坏检测对于启动有效的应急响应行动至关重要。遥感甚高空间分辨率(VHR)图像能够提供重要信息,因为它们能够以高几何精度绘制受影响建筑物的地图。人们已经开发出许多方法来检测因地震而受损的建筑物。然而,利用深度神经网络(DNN)挖掘VHR图像中丰富的特征却很少受到重视。本文提出了一种新的基于超像素的方法,结合DNN和改进的分割方法,从VHR图像中检测受损建筑物。首先,对一种改进的快速扫描和自适应合并方法进行了扩展,以产生初始过分割。其次,基于区域邻接图(RAG)对图像进行合并,并考虑了由局部二元模式(LBP)纹理、光谱和形状特征组成的改进的语义相似度准则。第三,提出了一种使用叠加去噪自动编码器的预训练DNN,称为SDAE-DNN,利用其丰富的语义特征进行建筑物损伤检测。SDAE-DNN的深层特征提取可以通过学习更多的固有和鉴别特征来提高检测精度,这优于使用最先进的替代分类器的其他方法。我们在2015年4月25日受尼泊尔地震影响的尼泊尔Bhaktapur复杂城市地区,使用WorldView-2图像子集证明了我们方法的可行性和有效性。 摘要:Building damage detection after natural disasters like earthquakes is crucial for initiating effective emergency response actions. Remotely sensed very high spatial resolution (VHR) imagery can provide vital information due to their ability to map the affected buildings with high geometric precision. Many approaches have been developed to detect damaged buildings due to earthquakes. However, little attention has been paid to exploiting rich features represented in VHR images using Deep Neural Networks (DNN). This paper presents a novel super-pixel based approach combining DNN and a modified segmentation method, to detect damaged buildings from VHR imagery. Firstly, a modified Fast Scanning and Adaptive Merging method is extended to create initial over-segmentation. Secondly, the segments are merged based on the Region Adjacent Graph (RAG), considered an improved semantic similarity criterion composed of Local Binary Patterns (LBP) texture, spectral, and shape features. Thirdly, a pre-trained DNN using Stacked Denoising Auto-Encoders called SDAE-DNN is presented, to exploit the rich semantic features for building damage detection. Deep-layer feature abstraction of SDAE-DNN could boost detection accuracy through learning more intrinsic and discriminative features, which outperformed other methods using state-of-the-art alternative classifiers. We demonstrate the feasibility and effectiveness of our method using a subset of WorldView-2 imagery, in the complex urban areas of Bhaktapur, Nepal, which was affected by the Nepal Earthquake of April 25, 2015.

【10】 Learning Auxiliary Monocular Contexts Helps Monocular 3D Object Detection 标题:学习辅助单目背景有助于单目三维物体检测 链接:https://arxiv.org/abs/2112.04628

作者:Xianpeng Liu,Nan Xue,Tianfu Wu 机构: Department of Electrical and Computer Engineering, North Carolina State University, USA, School of Computer Science, Wuhan University, China 备注:None 摘要:单目三维目标检测的目的是在输入的单个二维图像中定位三维边界框。这是一个极具挑战性的问题,并且仍然是开放的,尤其是在训练和/或推理中无法利用额外信息(例如,深度、激光雷达和/或多帧)的情况下。本文提出了一种简单而有效的单目三维目标检测公式,无需利用任何额外信息。提出了一种单目视觉环境学习方法,作为训练中的辅助任务,帮助单目三维目标检测。其关键思想是,对于图像中对象的带注释的三维边界框,有一组丰富的适定投影二维监控信号可用于训练,例如投影角关键点及其相对于二维边界框中心的相关偏移向量,应将其作为训练中的辅助任务。所提出的单光子源于高水平测量理论中的Cramer-Wold定理。在实施过程中,它利用一个非常简单的端到端设计来证明学习辅助单目上下文的有效性,该上下文由三个部分组成:基于深度神经网络(DNN)的特征主干、用于学习3D边界框预测中使用的基本参数的多个回归头分支、,以及一些用于学习辅助上下文的回归头分支。训练后,为了提高推理效率,放弃了辅助上下文回归分支。在实验中,所提出的MonoCon在KITTI基准(汽车、行人和自行车)上进行了测试。它在汽车类排行榜上的表现优于所有现有技术,并在准确性方面与行人和自行车手的表现相当。由于设计简单,所提出的MonoCon方法以38.7fps的速度获得了最快的推理速度 摘要:Monocular 3D object detection aims to localize 3D bounding boxes in an input single 2D image. It is a highly challenging problem and remains open, especially when no extra information (e.g., depth, lidar and/or multi-frames) can be leveraged in training and/or inference. This paper proposes a simple yet effective formulation for monocular 3D object detection without exploiting any extra information. It presents the MonoCon method which learns Monocular Contexts, as auxiliary tasks in training, to help monocular 3D object detection. The key idea is that with the annotated 3D bounding boxes of objects in an image, there is a rich set of well-posed projected 2D supervision signals available in training, such as the projected corner keypoints and their associated offset vectors with respect to the center of 2D bounding box, which should be exploited as auxiliary tasks in training. The proposed MonoCon is motivated by the Cramer-Wold theorem in measure theory at a high level. In implementation, it utilizes a very simple end-to-end design to justify the effectiveness of learning auxiliary monocular contexts, which consists of three components: a Deep Neural Network (DNN) based feature backbone, a number of regression head branches for learning the essential parameters used in the 3D bounding box prediction, and a number of regression head branches for learning auxiliary contexts. After training, the auxiliary context regression branches are discarded for better inference efficiency. In experiments, the proposed MonoCon is tested in the KITTI benchmark (car, pedestrain and cyclist). It outperforms all prior arts in the leaderboard on car category and obtains comparable performance on pedestrian and cyclist in terms of accuracy. Thanks to the simple design, the proposed MonoCon method obtains the fastest inference speed with 38.7 fps in comparisons

【11】 Segment and Complete: Defending Object Detectors against Adversarial Patch Attacks with Robust Patch Detection 标题:分段和完全:利用鲁棒补丁检测保护对象检测器免受敌意补丁攻击 链接:https://arxiv.org/abs/2112.04532

作者:Jiang Liu,Alexander Levine,Chun Pong Lau,Rama Chellappa,Soheil Feizi 机构:Johns Hopkins University,University of Maryland, College Park 备注:Under submission 摘要:目标检测在许多安全关键系统中起着关键作用。对抗性补丁攻击易于在物理世界中实施,对最先进的对象检测器构成严重威胁。开发针对补丁攻击的目标探测器的可靠防御至关重要,但尚未得到充分研究。在本文中,我们提出了分段和完全防御(SAC),这是一个通过检测和移除敌对补丁来防御目标检测器的通用框架。我们首先训练一个贴片分割器,输出贴片遮罩,提供对抗性贴片的像素级定位。然后,我们提出了一种自对抗训练算法来增强补丁分割器的鲁棒性。此外,我们还设计了一种鲁棒的形状完成算法,当面片分割器的输出在地面真实面片掩模一定的汉明距离内时,该算法可以保证从图像中去除整个面片。我们在COCO和xView数据集上的实验表明,SAC即使在强自适应攻击下也能实现卓越的鲁棒性,在干净图像上不会出现性能下降,并且可以很好地推广到看不见的补丁形状、攻击预算和看不见的攻击方法。此外,我们还提出了APRICOT Mask数据集,该数据集使用对抗补丁的像素级注释来扩充APRICOT数据集。我们发现SAC可以显著降低物理补丁攻击的目标攻击成功率。 摘要:Object detection plays a key role in many security-critical systems. Adversarial patch attacks, which are easy to implement in the physical world, pose a serious threat to state-of-the-art object detectors. Developing reliable defenses for object detectors against patch attacks is critical but severely understudied. In this paper, we propose Segment and Complete defense (SAC), a general framework for defending object detectors against patch attacks through detecting and removing adversarial patches. We first train a patch segmenter that outputs patch masks that provide pixel-level localization of adversarial patches. We then propose a self adversarial training algorithm to robustify the patch segmenter. In addition, we design a robust shape completion algorithm, which is guaranteed to remove the entire patch from the images given the outputs of the patch segmenter are within a certain Hamming distance of the ground-truth patch masks. Our experiments on COCO and xView datasets demonstrate that SAC achieves superior robustness even under strong adaptive attacks with no performance drop on clean images, and generalizes well to unseen patch shapes, attack budgets, and unseen attack methods. Furthermore, we present the APRICOT-Mask dataset, which augments the APRICOT dataset with pixel-level annotations of adversarial patches. We show SAC can significantly reduce the targeted attack success rate of physical patch attacks.

【12】 Binary Change Guided Hyperspectral Multiclass Change Detection 标题:二值变化制导的高光谱多类变化检测 链接:https://arxiv.org/abs/2112.04493

作者:Meiqi Hu,Chen Wu,Bo Du,Liangpei Zhang 机构:location on different times [,], and has been widely, applied in land cover and land usage change detection [,], disaster emergency and assessment [,], dynamic urban, monitoring and management [,], etc. 备注:14 pages,17 figures

分类|识别相关(12篇)

【1】 Spatio-temporal Relation Modeling for Few-shot Action Recognition 标题:面向少发动作识别的时空关系建模 链接:https://arxiv.org/abs/2112.05132

作者:Anirudh Thatipelli,Sanath Narayan,Salman Khan,Rao Muhammad Anwer,Fahad Shahbaz Khan,Bernard Ghanem 机构:Mohamed Bin Zayed University of Artificial Intelligence, Inception Institute of Artificial Intelligence, Aalto University, Australian National University, CVL, Link¨oping University, King Abdullah University of Science & Technology 摘要:我们提出了一种新的Few-Shot动作识别框架STRM,该框架在学习高阶时间表示的同时,增强了特定类别的特征识别能力。我们的方法的重点是一个新的时空丰富模块,该模块使用专用的局部面片级和全局帧级特征丰富子模块聚合时空上下文。局部补丁级别的扩展捕获动作基于外观的特征。另一方面,全局帧级扩展显式地对广泛的时间上下文进行编码,从而捕获随时间变化的相关对象特征。然后利用得到的时空丰富表示来学习查询和支持动作子序列之间的关系匹配。在该框架中,我们进一步引入了一个基于补丁级丰富特征的查询类相似性分类器,通过加强不同阶段的特征学习来增强特定于类的特征识别能力。实验在四个Few-Shot动作识别基准上进行:动力学、SSv2、HMDB51和UCF101。我们广泛的消融研究揭示了建议贡献的好处。此外,我们的方法为所有四个基准设定了新的最新水平。在具有挑战性的SSv2基准上,与文献中现有的最佳方法相比,我们的方法在分类精度上获得了3.5%的绝对增益。我们的代码和模型将公开发布。 摘要:We propose a novel few-shot action recognition framework, STRM, which enhances class-specific feature discriminability while simultaneously learning higher-order temporal representations. The focus of our approach is a novel spatio-temporal enrichment module that aggregates spatial and temporal contexts with dedicated local patch-level and global frame-level feature enrichment sub-modules. Local patch-level enrichment captures the appearance-based characteristics of actions. On the other hand, global frame-level enrichment explicitly encodes the broad temporal context, thereby capturing the relevant object features over time. The resulting spatio-temporally enriched representations are then utilized to learn the relational matching between query and support action sub-sequences. We further introduce a query-class similarity classifier on the patch-level enriched features to enhance class-specific feature discriminability by reinforcing the feature learning at different stages in the proposed framework. Experiments are performed on four few-shot action recognition benchmarks: Kinetics, SSv2, HMDB51 and UCF101. Our extensive ablation study reveals the benefits of the proposed contributions. Furthermore, our approach sets a new state-of-the-art on all four benchmarks. On the challenging SSv2 benchmark, our approach achieves an absolute gain of 3.5% in classification accuracy, as compared to the best existing method in the literature. Our code and models will be publicly released.

【2】 DVHN: A Deep Hashing Framework for Large-scale Vehicle Re-identification 标题:DVHN:一种面向大规模车辆识别的深度散列框架 链接:https://arxiv.org/abs/2112.04937

作者:Yongbiao Chen,Sheng Zhang,Fangxin Liu,Chenggang Wu,Kaicheng Guo,Zhengwei Qi 机构:∗School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China, †University of Southern California, Los Angeles, USA 摘要:在本文中,我们首次尝试研究深度哈希学习与车辆重新识别的集成。我们提出了一个基于深度散列的车辆再识别框架,称为DVHN,该框架在保留最近邻搜索精度的同时,大大减少了内存使用,提高了检索效率。具体地说,~DVHN通过联合优化特征学习网络和哈希码生成模块,直接学习每个图像的离散紧凑二进制哈希码。具体地说,我们直接将卷积神经网络的输出约束为离散二进制码,并确保所学习的二进制码对于分类是最优的。为了优化深度离散散列框架,我们进一步提出了一种学习二进制相似性保持散列码的交替最小化方法。在两个被广泛研究的车辆重新识别数据集textbf{VehicleID}和textbf{VeRi}上的大量实验证明了我们的方法相对于最先进的深度散列方法的优越性。$2048$位的textbf{DVHN}在textbf{mAP}和textbf方面可以实现13.94%和10.21%的精度改进{Rank@1}对于textbf{VehicleID(800)}数据集。对于textbf{VeRi},我们为textbf实现了35.45%和32.72%的性能提升{Rank@1}和textbf{mAP}。 摘要:In this paper, we make the very first attempt to investigate the integration of deep hash learning with vehicle re-identification. We propose a deep hash-based vehicle re-identification framework, dubbed DVHN, which substantially reduces memory usage and promotes retrieval efficiency while reserving nearest neighbor search accuracy. Concretely,~DVHN directly learns discrete compact binary hash codes for each image by jointly optimizing the feature learning network and the hash code generating module. Specifically, we directly constrain the output from the convolutional neural network to be discrete binary codes and ensure the learned binary codes are optimal for classification. To optimize the deep discrete hashing framework, we further propose an alternating minimization method for learning binary similarity-preserved hashing codes. Extensive experiments on two widely-studied vehicle re-identification datasets- textbf{VehicleID} and textbf{VeRi}-~have demonstrated the superiority of our method against the state-of-the-art deep hash methods. textbf{DVHN} of $2048$ bits can achieve 13.94% and 10.21% accuracy improvement in terms of textbf{mAP} and textbf{Rank@1} for textbf{VehicleID (800)} dataset. For textbf{VeRi}, we achieve 35.45% and 32.72% performance gains for textbf{Rank@1} and textbf{mAP}, respectively.

【3】 Model Doctor: A Simple Gradient Aggregation Strategy for Diagnosing and Treating CNN Classifiers 标题:模型医生:一种诊断和治疗CNN分类器的简单梯度聚合策略 链接:https://arxiv.org/abs/2112.04934

作者:Zunlei Feng,Jiacong Hu,Sai Wu,Xiaotian Yu,Jie Song,Mingli Song 机构:Zhejiang University 备注:Accepted by AAAI 2022 摘要:最近,卷积神经网络(CNN)在分类任务中取得了优异的性能。众所周知,CNN被认为是一个“黑匣子”,很难理解预测机制和调试错误的预测。为了解决上述问题,本文进行了一些模型调试和解释工作。然而,这些方法侧重于解释和诊断模型预测的可能原因,研究人员在此基础上手动处理以下模型优化。在本文中,我们提出了第一个全自动的模型诊断和治疗工具,称为模型医生。基于以下两个发现:1)每个类别仅与稀疏和特定卷积核相关;2)在特征空间中分离对抗性样本,而正常样本是连续的。设计了一个简单的聚合梯度约束,用于有效诊断和优化CNN分类器。聚合梯度策略是主流CNN分类器的通用模块。大量实验表明,该模型适用于所有现有的CNN分类器,并将$16$主流CNN分类器的准确率提高了1%-5%。 摘要:Recently, Convolutional Neural Network (CNN) has achieved excellent performance in the classification task. It is widely known that CNN is deemed as a 'black-box', which is hard for understanding the prediction mechanism and debugging the wrong prediction. Some model debugging and explanation works are developed for solving the above drawbacks. However, those methods focus on explanation and diagnosing possible causes for model prediction, based on which the researchers handle the following optimization of models manually. In this paper, we propose the first completely automatic model diagnosing and treating tool, termed as Model Doctor. Based on two discoveries that 1) each category is only correlated with sparse and specific convolution kernels, and 2) adversarial samples are isolated while normal samples are successive in the feature space, a simple aggregate gradient constraint is devised for effectively diagnosing and optimizing CNN classifiers. The aggregate gradient strategy is a versatile module for mainstream CNN classifiers. Extensive experiments demonstrate that the proposed Model Doctor applies to all existing CNN classifiers, and improves the accuracy of $16$ mainstream CNN classifiers by 1%-5%.

【4】 Explainability of the Implications of Supervised and Unsupervised Face Image Quality Estimations Through Activation Map Variation Analyses in Face Recognition Models 标题:激活图变异分析在人脸识别模型中对监督和非监督人脸图像质量估计含义的可解释性 链接:https://arxiv.org/abs/2112.04827

作者:Biying Fu,Naser Damer 机构:Fraunhofer Institute for Computer Graphics Research IGD, Darmstadt, Germany, Department of Computer Science, TU Darmstadt, Darmstadt, Germany 备注:accepted at the IEEE Winter Conference on Applications of Computer Vision Workshops, WACV Workshops 2022 摘要:推导无监督或基于统计的人脸图像质量评估(FIQA)方法的可解释性是一个挑战。在这项工作中,我们提出了一套新的可解释性工具来推导不同FIQA决策的推理及其人脸识别(FR)性能影响。在处理具有不同FIQA决策的样本时,我们根据FR模型的行为进行分析,从而避免将工具的部署局限于某些FIQA方法。这导致了可解释性工具,可用于任何FIQA方法和任何基于CNN的FR解决方案,使用激活映射来显示从面部嵌入衍生的网络激活。为了避免FR模型中低质量图像和高质量图像的一般空间激活映射之间的低区分,我们通过分析具有不同质量决策的图像集的FR激活映射的变化,在高导数空间中构建了我们的解释工具。我们通过展示内部和内部FIQA方法分析,展示我们的工具并分析四种FIQA方法的结果。我们提出的工具和基于这些工具的分析指出,除其他结论外,高质量图像通常会导致中央面部区域以外区域的持续低激活,而低质量图像尽管普遍低激活,但在这些区域的激活变化很大。我们的可解释性工具还扩展到分析单个图像,其中我们表明,低质量图像倾向于具有FR模型空间激活,这与高质量图像的预期强烈不同,高质量图像的这种差异也倾向于更多地出现在中央面部区域以外的区域,并与以下问题相对应:极端姿势和面部遮挡。建议工具的实现可在此处访问[链接]。 摘要:It is challenging to derive explainability for unsupervised or statistical-based face image quality assessment (FIQA) methods. In this work, we propose a novel set of explainability tools to derive reasoning for different FIQA decisions and their face recognition (FR) performance implications. We avoid limiting the deployment of our tools to certain FIQA methods by basing our analyses on the behavior of FR models when processing samples with different FIQA decisions. This leads to explainability tools that can be applied for any FIQA method with any CNN-based FR solution using activation mapping to exhibit the network's activation derived from the face embedding. To avoid the low discrimination between the general spatial activation mapping of low and high-quality images in FR models, we build our explainability tools in a higher derivative space by analyzing the variation of the FR activation maps of image sets with different quality decisions. We demonstrate our tools and analyze the findings on four FIQA methods, by presenting inter and intra-FIQA method analyses. Our proposed tools and the analyses based on them point out, among other conclusions, that high-quality images typically cause consistent low activation on the areas outside of the central face region, while low-quality images, despite general low activation, have high variations of activation in such areas. Our explainability tools also extend to analyzing single images where we show that low-quality images tend to have an FR model spatial activation that strongly differs from what is expected from a high-quality image where this difference also tends to appear more in areas outside of the central face region and does correspond to issues like extreme poses and facial occlusions. The implementation of the proposed tools is accessible here [link].

【5】 HBReID: Harder Batch for Re-identification 标题:HBReID:重新识别的批次更难 链接:https://arxiv.org/abs/2112.04761

作者:Wen Li,Furong Xu,Jianan Zhao,Ruobing Zheng,Cheng Zou,Meng Wang,Yuan Cheng 机构:ANT GROUP 摘要:三重态损耗是里德任务中广泛采用的损耗函数,它将最硬的正对拉近,将最硬的负对推远。但是,所选样本不是全局最硬的,而是仅在小批量中最硬的,这将影响性能。在本报告中,提出了一种硬批量挖掘方法来全局挖掘最难的样本,以使三重态更难。更具体地说,将最相似的类选择到同一个小批量中,以便将相似的类推到更远的地方。此外,使用由场景分类器和敌方损失组成的敌方场景移除模块学习场景不变特征表示。在数据集MSMT17上进行了实验,证明了该方法的有效性,该方法优于以往的所有方法,并取得了最新的结果。 摘要:Triplet loss is a widely adopted loss function in ReID task which pulls the hardest positive pairs close and pushes the hardest negative pairs far away. However, the selected samples are not the hardest globally, but the hardest only in a mini-batch, which will affect the performance. In this report, a hard batch mining method is proposed to mine the hardest samples globally to make triplet harder. More specifically, the most similar classes are selected into a same mini-batch so that the similar classes could be pushed further away. Besides, an adversarial scene removal module composed of a scene classifier and an adversarial loss is used to learn scene invariant feature representations. Experiments are conducted on dataset MSMT17 to prove the effectiveness, and our method surpasses all of the previous methods and sets state-of-the-art result.

【6】 Amicable Aid: Turning Adversarial Attack to Benefit Classification 标题:友好援助:变对抗性攻击为利益分类 链接:https://arxiv.org/abs/2112.04720

作者:Juyeop Kim,Jun-Ho Choi,Soobeom Jang,Jong-Seok Lee 机构:Yonsei University 备注:16 pages (3 pages for appendix) 摘要:虽然针对深度图像分类模型的对抗性攻击在实践中引起了严重的安全问题,本文提出了一种新的范例,其中对抗性攻击的概念可以提高分类性能,我们称之为友好帮助。我们表明,通过采取相反的扰动搜索方向,分类模型可以将一幅图像转换为另一幅具有更高可信度的图像,甚至可以对错误分类的图像进行正确分类。此外,通过大量的扰动,图像可以被人眼识别,而模型可以正确识别图像。友好援助的机制是解释在基本的自然图像流形的观点。我们还考虑了普遍的友好扰动,即固定的扰动可以应用于多个图像,以提高它们的分类结果。虽然找到这样的扰动很有挑战性,但我们表明,通过使用修改的数据进行训练,使决策边界尽可能垂直于图像流形,可以有效地获得更容易找到普遍友好扰动的模型。最后,我们讨论了友好帮助可能有用的几个应用场景,包括安全图像通信、保护隐私的图像通信以及对抗性攻击的保护。 摘要:While adversarial attacks on deep image classification models pose serious security concerns in practice, this paper suggests a novel paradigm where the concept of adversarial attacks can benefit classification performance, which we call amicable aid. We show that by taking the opposite search direction of perturbation, an image can be converted to another yielding higher confidence by the classification model and even a wrongly classified image can be made to be correctly classified. Furthermore, with a large amount of perturbation, an image can be made unrecognizable by human eyes, while it is correctly recognized by the model. The mechanism of the amicable aid is explained in the viewpoint of the underlying natural image manifold. We also consider universal amicable perturbations, i.e., a fixed perturbation can be applied to multiple images to improve their classification results. While it is challenging to find such perturbations, we show that making the decision boundary as perpendicular to the image manifold as possible via training with modified data is effective to obtain a model for which universal amicable perturbations are more easily found. Finally, we discuss several application scenarios where the amicable aid can be useful, including secure image communication, privacy-preserving image communication, and protection against adversarial attacks.

【7】 Unsupervised Complementary-aware Multi-process Fusion for Visual Place Recognition 标题:基于无监督互补感知的视觉位置识别多进程融合 链接:https://arxiv.org/abs/2112.04701

作者:Stephen Hausler,Tobias Fischer,Michael Milford 机构: The authors acknowledge continued support from the QueenslandUniversity of Technology (QUT) through the Centre for Robotics, Queensland Universityof Technology 摘要:视觉位置识别(VPR)问题的最新方法是同时融合多个互补VPR技术的位置识别估计。然而,事先选择在特定部署环境中使用的最佳技术集是一个困难且未解决的挑战。此外,据我们所知,没有任何方法能够响应图像到图像的变化,在逐帧的基础上选择一组技术。在这项工作中,我们提出了一种无监督的算法,可以在当前部署环境中逐帧查找最健壮的VPR技术集。技术的选择是通过分析当前查询图像和数据库图像集合之间的相似性分数来确定的,不需要地面真实信息。我们在各种数据集和VPR技术上演示了我们的方法,并表明与各种具有挑战性的竞争方法相比,所提出的动态多进程融合(Dyn MPF)具有优越的VPR性能,其中一些方法通过访问地面真相信息而获得了不公平的优势。 摘要:A recent approach to the Visual Place Recognition (VPR) problem has been to fuse the place recognition estimates of multiple complementary VPR techniques simultaneously. However, selecting the optimal set of techniques to use in a specific deployment environment a-priori is a difficult and unresolved challenge. Further, to the best of our knowledge, no method exists which can select a set of techniques on a frame-by-frame basis in response to image-to-image variations. In this work, we propose an unsupervised algorithm that finds the most robust set of VPR techniques to use in the current deployment environment, on a frame-by-frame basis. The selection of techniques is determined by an analysis of the similarity scores between the current query image and the collection of database images and does not require ground-truth information. We demonstrate our approach on a wide variety of datasets and VPR techniques and show that the proposed dynamic multi-process fusion (Dyn-MPF) has superior VPR performance compared to a variety of challenging competitive methods, some of which are given an unfair advantage through access to the ground-truth information.

【8】 Dual Cluster Contrastive learning for Person Re-Identification 标题:双聚类对比学习在人的再识别中的应用 链接:https://arxiv.org/abs/2112.04662

作者:Hantao Yao,Changsheng Xu 机构:National Laboratory of Pattern Recognition, Institute of Automation, CAS 备注:10 pages, 6 figures

【9】 STAF: A Spatio-Temporal Attention Fusion Network for Few-shot Video Classification 标题:STAF:一种用于Few-Shot视频分类的时空注意力融合网络 链接:https://arxiv.org/abs/2112.04585

作者:Rex Liu,Huanle Zhang,Hamed Pirsiavash,Xin Liu 机构:Department of Computer Science, University of California, Davis 摘要:我们提出了一种时空注意融合网络STAF,用于小镜头视频分类。STAF首先利用三维卷积神经网络嵌入网络提取视频的粗粒度时空特征。然后使用自我注意和交叉注意网络对提取的特征进行微调。最后,STAF采用轻量级融合网络和最近邻分类器对每个查询视频进行分类。为了评估STAF,我们在三个基准(UCF101、HMDB51和Something-Something-V2)上进行了大量实验。实验结果表明,STAF极大地提高了最先进的精度,例如,STAF使UCF101和HMDB51的五向单发精度分别提高了5.3%和7.0%。 摘要:We propose STAF, a Spatio-Temporal Attention Fusion network for few-shot video classification. STAF first extracts coarse-grained spatial and temporal features of videos by applying a 3D Convolution Neural Networks embedding network. It then fine-tunes the extracted features using self-attention and cross-attention networks. Last, STAF applies a lightweight fusion network and a nearest neighbor classifier to classify each query video. To evaluate STAF, we conduct extensive experiments on three benchmarks (UCF101, HMDB51, and Something-Something-V2). The experimental results show that STAF improves state-of-the-art accuracy by a large margin, e.g., STAF increases the five-way one-shot accuracy by 5.3% and 7.0% for UCF101 and HMDB51, respectively.

【10】 CoSSL: Co-Learning of Representation and Classifier for Imbalanced Semi-Supervised Learning 标题:CoSSL:非平衡半监督学习的表示和分类器协同学习 链接:https://arxiv.org/abs/2112.04564

作者:Yue Fan,Dengxin Dai,Bernt Schiele 机构:Max Planck Institute for Informatics, Saarbr¨ucken, Germany, Saarland Informatics Campus 摘要:在本文中,我们提出了一个新的合作学习框架(CoSSL)的解耦表示学习和分类器学习的不平衡SSL。为了处理数据不平衡,我们设计了用于分类器学习的尾类特征增强(TFE)。此外,目前针对不平衡SSL的评估协议只关注平衡测试集,这在现实场景中的实用性有限。因此,我们进一步在各种转移测试分布下进行综合评估。在实验中,我们表明,我们的方法在大范围的移位分布上优于其他方法,在从CIFAR-10、CIFAR-100、ImageNet到Food-101的基准数据集上实现了最先进的性能。我们的代码将公开发布。 摘要:In this paper, we propose a novel co-learning framework (CoSSL) with decoupled representation learning and classifier learning for imbalanced SSL. To handle the data imbalance, we devise Tail-class Feature Enhancement (TFE) for classifier learning. Furthermore, the current evaluation protocol for imbalanced SSL focuses only on balanced test sets, which has limited practicality in real-world scenarios. Therefore, we further conduct a comprehensive evaluation under various shifted test distributions. In experiments, we show that our approach outperforms other methods over a large range of shifted distributions, achieving state-of-the-art performance on benchmark datasets ranging from CIFAR-10, CIFAR-100, ImageNet, to Food-101. Our code will be made publicly available.

【11】 SoK: Anti-Facial Recognition Technology 标题:SOK:反人脸识别技术 链接:https://arxiv.org/abs/2112.04558

作者:Emily Wenger,Shawn Shan,Haitao Zheng,Ben Y. Zhao 机构:Department of Computer Science, The University of Chicago 备注:13 pages 摘要:近年来,政府和商业实体迅速采用面部识别(FR)技术,引起了人们对公民自由和隐私的关注。作为回应,一系列所谓的“反面部识别”(AFR)工具被开发出来,以帮助用户避免不必要的面部识别。在过去几年中提出的AFR工具的集合是广泛的和快速发展的,需要退后一步考虑更广泛的设计空间的AFR系统和长期的挑战。本文旨在填补这一空白,并首次对AFR研究领域进行全面分析。以FR系统的运行阶段为起点,我们创建了一个系统框架,用于分析不同AFR方法的效益和权衡。然后我们考虑AFR工具所面临的技术和社会挑战,并提出未来的研究方向在这一领域。 摘要:The rapid adoption of facial recognition (FR) technology by both government and commercial entities in recent years has raised concerns about civil liberties and privacy. In response, a broad suite of so-called "anti-facial recognition" (AFR) tools has been developed to help users avoid unwanted facial recognition. The set of AFR tools proposed in the last few years is wide-ranging and rapidly evolving, necessitating a step back to consider the broader design space of AFR systems and long-term challenges. This paper aims to fill that gap and provides the first comprehensive analysis of the AFR research landscape. Using the operational stages of FR systems as a starting point, we create a systematic framework for analyzing the benefits and tradeoffs of different AFR approaches. We then consider both technical and social challenges facing AFR tools and propose directions for future research in this field.

【12】 Robust Weakly Supervised Learning for COVID-19 Recognition Using Multi-Center CT Images 标题:多中心CT图像识别冠状病毒的鲁棒弱监督学习 链接:https://arxiv.org/abs/2112.04984

作者:Qinghao Ye,Yuan Gao,Weiping Ding,Zhangming Niu,Chengjia Wang,Yinghui Jiang,Minhao Wang,Evandro Fei Fang,Wade Menpes-Smith,Jun Xia,Guang Yang 机构:Hangzhou Ocean’s Smart Boya Co., Ltd, University of California, San Diego, La Jolla, California, USA, Institute of Biomedical Engineering, University of Oxford, UK, Aladdin Healthcare Technologies Ltd, Nantong University, Nantong , China 备注:32 pages, 8 figures, Applied Soft Computing 摘要:世界目前正在经历一种名为2019年冠状病毒病(即新冠病毒-19)的传染病大流行,该传染病由严重急性呼吸综合征冠状病毒2(SARS-CoV-2)引起。计算机断层扫描(CT)在评估感染的严重程度中起着重要的作用,也可用于鉴别有症状和无症状的COVID-19携带者。随着COVID2019冠状病毒疾病的累积,放射学家越来越强调手动检查CT扫描。因此,自动化3D CT扫描识别工具的需求量很大,因为手动分析对放射科医生来说非常耗时,而且他们的疲劳可能会导致误判。然而,由于位于不同医院的CT扫描仪的各种技术规格,CT图像的外观可能会显著不同,从而导致许多自动图像识别方法的失败。因此,多中心和多扫描仪研究的多领域转移问题非常重要,这对于可靠的识别和可重复的客观诊断及预后也至关重要。在2019冠状病毒疾病诊断中,提出了一种COVID-19 CT扫描识别模型,即冠状病毒信息融合诊断网络(CIFD Net),通过一种新的鲁棒弱监督学习模式有效地处理多域移位问题。我们的模型能够可靠有效地解决CT扫描图像中不同外观的问题,同时与其他最先进的方法相比具有更高的精确度。 摘要:The world is currently experiencing an ongoing pandemic of an infectious disease named coronavirus disease 2019 (i.e., COVID-19), which is caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Computed Tomography (CT) plays an important role in assessing the severity of the infection and can also be used to identify those symptomatic and asymptomatic COVID-19 carriers. With a surge of the cumulative number of COVID-19 patients, radiologists are increasingly stressed to examine the CT scans manually. Therefore, an automated 3D CT scan recognition tool is highly in demand since the manual analysis is time-consuming for radiologists and their fatigue can cause possible misjudgment. However, due to various technical specifications of CT scanners located in different hospitals, the appearance of CT images can be significantly different leading to the failure of many automated image recognition approaches. The multi-domain shift problem for the multi-center and multi-scanner studies is therefore nontrivial that is also crucial for a dependable recognition and critical for reproducible and objective diagnosis and prognosis. In this paper, we proposed a COVID-19 CT scan recognition model namely coronavirus information fusion and diagnosis network (CIFD-Net) that can efficiently handle the multi-domain shift problem via a new robust weakly supervised learning paradigm. Our model can resolve the problem of different appearance in CT scan images reliably and efficiently while attaining higher accuracy compared to other state-of-the-art methods.

分割|语义相关(5篇)

【1】 Exploring Event-driven Dynamic Context for Accident Scene Segmentation 标题:探索事件驱动的动态上下文进行事故场景分割 链接:https://arxiv.org/abs/2112.05006

作者:Jiaming Zhang,Kailun Yang,Rainer Stiefelhagen 机构: in part by the University of Excellence through the “KITFuture Fields” project, The authors are with the Institute for Anthropomatics and Robotics, Karlsruhe Institute of Technology 备注:Accepted to IEEE Transactions on Intelligent Transportation Systems (T-ITS), extended version of arXiv:2008.08974, dataset and code will be made publicly available at this https URL 摘要:交通场景边缘语义分割的鲁棒性是智能交通安全的重要因素。然而,大多数交通事故的关键场景都是非常动态的,而且以前是看不见的,这严重影响了语义分割方法的性能。此外,传统相机在高速行驶过程中的延迟将进一步减少时间维度上的上下文信息。因此,我们建议以更高的时间分辨率从基于事件的数据中提取动态上下文,以增强静态RGB图像,即使是那些来自具有运动模糊、碰撞、变形、翻转等的交通事故的图像。此外,为了评估交通事故中的分割性能,我们提供了一个像素级注释的事故数据集,即DADA seg,其中包含各种交通事故的关键场景。我们的实验表明,基于事件的数据可以通过在事故中保持快速移动前景(碰撞对象)的细粒度运动,在不利条件下为稳定语义分割提供补充信息。我们的方法在提出的事故数据集上实现了 8.2%的性能提升,超过了20多种最先进的语义分割方法。该方案已被证明对在多个源数据库(包括Cityscapes、KITTI-360、BDD和ApolloScape)上学习的模型始终有效。 摘要:The robustness of semantic segmentation on edge cases of traffic scene is a vital factor for the safety of intelligent transportation. However, most of the critical scenes of traffic accidents are extremely dynamic and previously unseen, which seriously harm the performance of semantic segmentation methods. In addition, the delay of the traditional camera during high-speed driving will further reduce the contextual information in the time dimension. Therefore, we propose to extract dynamic context from event-based data with a higher temporal resolution to enhance static RGB images, even for those from traffic accidents with motion blur, collisions, deformations, overturns, etc. Moreover, in order to evaluate the segmentation performance in traffic accidents, we provide a pixel-wise annotated accident dataset, namely DADA-seg, which contains a variety of critical scenarios from traffic accidents. Our experiments indicate that event-based data can provide complementary information to stabilize semantic segmentation under adverse conditions by preserving fine-grained motion of fast-moving foreground (crash objects) in accidents. Our approach achieves 8.2% performance gain on the proposed accident dataset, exceeding more than 20 state-of-the-art semantic segmentation methods. The proposal has been demonstrated to be consistently effective for models learned on multiple source databases including Cityscapes, KITTI-360, BDD, and ApolloScape.

【2】 Implicit Feature Refinement for Instance Segmentation 标题:用于实例分割的隐式特征细化 链接:https://arxiv.org/abs/2112.04709

作者:Lufan Ma,Tiancai Wang,Bin Dong,Jiangpeng Yan,Xiu Li,Xiangyu Zhang 机构:Tsinghua University, Beijing, China, MEGVII Technology, Beijing, China 备注:Published at ACM MM 2021. Code is available at this https URL 摘要:我们提出了一种新的隐式特征细化模块,用于高质量的实例分割。现有的图像/视频实例分割方法依赖于显式叠加卷积来在最终预测之前细化实例特征。在本文中,我们首先对不同的细化策略进行了实证比较,这表明广泛使用的四个连续卷积是不必要的。作为替代方案,权重共享卷积块提供有竞争力的性能。当这样的块迭代无限次时,块输出最终将收敛到平衡状态。基于这一观察,通过构造隐函数来发展隐式特征精化(IFR)。实例特征的平衡状态可以通过模拟无限深网络通过定点迭代获得。我们的IFR有几个优点:1)模拟无限深度细化网络,而只需要单个剩余块的参数;2) 产生全局感受野的高级平衡实例特征;3) 作为一个即插即用的通用模块,可轻松扩展到大多数对象识别框架。在COCO和YouTube VIS基准上的实验表明,我们的IFR在最先进的图像/视频实例分割框架上实现了更高的性能,同时减少了参数负担(例如,在Mask R-CNN上AP提高1%,而在Mask head中只有30.0%的参数)。该守则可于https://github.com/lufanma/IFR.git 摘要:We propose a novel implicit feature refinement module for high-quality instance segmentation. Existing image/video instance segmentation methods rely on explicitly stacked convolutions to refine instance features before the final prediction. In this paper, we first give an empirical comparison of different refinement strategies,which reveals that the widely-used four consecutive convolutions are not necessary. As an alternative, weight-sharing convolution blocks provides competitive performance. When such block is iterated for infinite times, the block output will eventually convergeto an equilibrium state. Based on this observation, the implicit feature refinement (IFR) is developed by constructing an implicit function. The equilibrium state of instance features can be obtained by fixed-point iteration via a simulated infinite-depth network. Our IFR enjoys several advantages: 1) simulates an infinite-depth refinement network while only requiring parameters of single residual block; 2) produces high-level equilibrium instance features of global receptive field; 3) serves as a plug-and-play general module easily extended to most object recognition frameworks. Experiments on the COCO and YouTube-VIS benchmarks show that our IFR achieves improved performance on state-of-the-art image/video instance segmentation frameworks, while reducing the parameter burden (e.g.1% AP improvement on Mask R-CNN with only 30.0% parameters in mask head). Code is made available at https://github.com/lufanma/IFR.git

【3】 Style Mixing and Patchwise Prototypical Matching for One-Shot Unsupervised Domain Adaptive Semantic Segmentation 标题:一次无监督领域自适应语义分割的风格混合和分块原型匹配 链接:https://arxiv.org/abs/2112.04665

作者:Xinyi Wu,Zhenyao Wu,Yuhang Lu,Lili Ju,Song Wang 机构:Department of Computer Science and Engineering, University of South Carolina, USA, Department of Mathematics, University of South Carolina, USA 备注:Accepted by AAAI 2022

【4】 A Unified Architecture of Semantic Segmentation and Hierarchical Generative Adversarial Networks for Expression Manipulation 标题:面向表情操作的语义分割和层次化生成对抗性网络统一架构 链接:https://arxiv.org/abs/2112.04603

作者:Rumeysa Bodur,Binod Bhattarai,Tae-Kyun Kim 机构:Imperial College London, UK, University College London, UK, KAIST, South Korea 摘要:仅通过改变我们想要的来编辑面部表情是生成性对抗网络(GANs)中用于图像处理的一个长期研究问题。大多数只依赖全局生成器的现有方法通常会遇到随着目标属性一起更改不需要的属性的问题。最近,由处理整个图像的全局网络和关注局部部分的多个局部网络组成的分层网络正在显示出成功。然而,这些方法都是通过围绕稀疏人脸关键点的边界框来提取局部区域,这些关键点是不可微的、不精确的和不现实的。因此,该解决方案变得次优,引入了不需要的伪影,降低了合成图像的整体质量。此外,最近的一项研究表明,面部属性和局部语义区域之间存在着很强的相关性。为了利用这种关系,我们设计了一个统一的语义分割和层次结构的GANs架构。我们的框架的一个独特优势是,在向前传递时,语义分割网络对生成模型起作用,而在向后传递时,层次结构的GANs的梯度被传播到语义分割网络,这使得我们的框架成为端到端的可微体系结构。这使得这两种体系结构可以相互受益。为了证明其优势,我们在两个具有挑战性的面部表情翻译基准(AffectNet和RaFD)和一个语义分割基准(CelebAMask HQ)上对我们的方法进行了评估,这两个基准跨越了两种流行的体系结构,即BiSeNet和UNet。我们对人脸语义分割和人脸表情操作任务进行了广泛的定量和定性评估,验证了我们的工作相对于现有最先进方法的有效性。 摘要:Editing facial expressions by only changing what we want is a long-standing research problem in Generative Adversarial Networks (GANs) for image manipulation. Most of the existing methods that rely only on a global generator usually suffer from changing unwanted attributes along with the target attributes. Recently, hierarchical networks that consist of both a global network dealing with the whole image and multiple local networks focusing on local parts are showing success. However, these methods extract local regions by bounding boxes centred around the sparse facial key points which are non-differentiable, inaccurate and unrealistic. Hence, the solution becomes sub-optimal, introduces unwanted artefacts degrading the overall quality of the synthetic images. Moreover, a recent study has shown strong correlation between facial attributes and local semantic regions. To exploit this relationship, we designed a unified architecture of semantic segmentation and hierarchical GANs. A unique advantage of our framework is that on forward pass the semantic segmentation network conditions the generative model, and on backward pass gradients from hierarchical GANs are propagated to the semantic segmentation network, which makes our framework an end-to-end differentiable architecture. This allows both architectures to benefit from each other. To demonstrate its advantages, we evaluate our method on two challenging facial expression translation benchmarks, AffectNet and RaFD, and a semantic segmentation benchmark, CelebAMask-HQ across two popular architectures, BiSeNet and UNet. Our extensive quantitative and qualitative evaluations on both face semantic segmentation and face expression manipulation tasks validate the effectiveness of our work over existing state-of-the-art methods.

【5】 Extending nn-UNet for brain tumor segmentation 标题:扩展nn-UNET在脑肿瘤分割中的应用 链接:https://arxiv.org/abs/2112.04653

作者:Huan Minh Luu,Sung-Hong Park 机构:Magnetic Resonance Imaging Laboratory, Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology 备注:12 pages, 4 figures, BraTS competition paper

Zero/Few Shot|迁移|域适配|自适应(3篇)

【1】 Extending the WILDS Benchmark for Unsupervised Adaptation 标题:扩展无监督适应的WARDS基准 链接:https://arxiv.org/abs/2112.05090

作者:Shiori Sagawa,Pang Wei Koh,Tony Lee,Irena Gao,Sang Michael Xie,Kendrick Shen,Ananya Kumar,Weihua Hu,Michihiro Yasunaga,Henrik Marklund,Sara Beery,Etienne David,Ian Stavness,Wei Guo,Jure Leskovec,Kate Saenko,Tatsunori Hashimoto,Sergey Levine,Chelsea Finn,Percy Liang

【2】 AdaStereo: An Efficient Domain-Adaptive Stereo Matching Approach 标题:AdaStereo:一种高效的域自适应立体匹配方法 链接:https://arxiv.org/abs/2112.04974

作者:Xiao Song,Guorun Yang,Xinge Zhu,Hui Zhou,Yuexin Ma,Zhe Wang,Jianping Shi 备注:To be published in International Journal of Computer Vision (IJCV) 摘要:最近,端到端视差网络不断打破立体声匹配基准的记录。然而,这些深度模型的领域适应能力非常有限。针对这一问题,我们提出了一种称为AdaStereo的新的域自适应方法,旨在对齐深度立体匹配网络的多级表示。与以前的方法相比,我们的AdaStereo实现了更标准、完整和有效的域适配管道。首先,我们提出了一种用于输入图像级对齐的非对抗性渐进式颜色传递算法。其次,我们设计了一个高效的无参数代价标准化层,用于内部特征级对齐。最后,提出了一个高度相关的辅助任务,即自我监督的遮挡感知重建,以缩小输出空间中的间隙。我们进行了密集的消融研究和细分比较,以验证每个拟议模块的有效性。我们的AdaStereo模型不需要额外的推理开销,只需稍微增加训练复杂度,就可以在多个基准上实现最先进的跨域性能,包括KITTI、Middlebury、ETH3D和DrivingStereo,甚至超过了一些通过目标域地面真相微调的最先进视差网络。此外,基于两个额外的评估指标,我们的领域自适应立体匹配管道的优势从更多的角度进一步揭示。最后,我们证明了我们的方法对各种域自适应设置具有鲁棒性,并且可以轻松地集成到快速自适应应用场景和实际部署中。 摘要:Recently, records on stereo matching benchmarks are constantly broken by end-to-end disparity networks. However, the domain adaptation ability of these deep models is quite limited. Addressing such problem, we present a novel domain-adaptive approach called AdaStereo that aims to align multi-level representations for deep stereo matching networks. Compared to previous methods, our AdaStereo realizes a more standard, complete and effective domain adaptation pipeline. Firstly, we propose a non-adversarial progressive color transfer algorithm for input image-level alignment. Secondly, we design an efficient parameter-free cost normalization layer for internal feature-level alignment. Lastly, a highly related auxiliary task, self-supervised occlusion-aware reconstruction is presented to narrow the gaps in output space. We perform intensive ablation studies and break-down comparisons to validate the effectiveness of each proposed module. With no extra inference overhead and only a slight increase in training complexity, our AdaStereo models achieve state-of-the-art cross-domain performance on multiple benchmarks, including KITTI, Middlebury, ETH3D and DrivingStereo, even outperforming some state-of-the-art disparity networks finetuned with target-domain ground-truths. Moreover, based on two additional evaluation metrics, the superiority of our domain-adaptive stereo matching pipeline is further uncovered from more perspectives. Finally, we demonstrate that our method is robust to various domain adaptation settings, and can be easily integrated into quick adaptation application scenarios and real-world deployments.

【3】 Adaptive Methods for Aggregated Domain Generalization 标题:一种自适应的聚合域综合方法 链接:https://arxiv.org/abs/2112.04766

作者:Xavier Thomas,Dhruv Mahajan,Alex Pentland,Abhimanyu Dubey 机构:Manipal Institute of Technology†Facebook AI Research 摘要:领域泛化涉及从异构的训练源集合中学习分类器,从而将其泛化为来自类似未知目标领域的数据,并应用于大规模学习和个性化推理。在许多情况下,隐私问题禁止获取训练数据样本的域标签,而只具有训练点的聚合集合。利用域标签创建域不变特征表示的现有方法在此设置中不适用,需要其他方法来学习通用分类器。在本文中,我们提出了一种域自适应方法来解决这个问题,该方法分为两个步骤:(a)我们在精心选择的特征空间中聚类训练数据以创建伪域,以及(b)使用这些伪域,我们学习一个域自适应分类器,该分类器使用有关输入和它所属的伪域的信息进行预测。我们的方法在不使用任何域标签的情况下,在各种域泛化基准上实现了最先进的性能。此外,我们还利用聚类信息为领域泛化提供了新的理论保证。我们的方法适用于基于集成的方法,即使在大规模基准数据集上也能提供可观的收益。有关代码,请访问:https://github.com/xavierohan/AdaClust_DomainBed 摘要:Domain generalization involves learning a classifier from a heterogeneous collection of training sources such that it generalizes to data drawn from similar unknown target domains, with applications in large-scale learning and personalized inference. In many settings, privacy concerns prohibit obtaining domain labels for the training data samples, and instead only have an aggregated collection of training points. Existing approaches that utilize domain labels to create domain-invariant feature representations are inapplicable in this setting, requiring alternative approaches to learn generalizable classifiers. In this paper, we propose a domain-adaptive approach to this problem, which operates in two steps: (a) we cluster training data within a carefully chosen feature space to create pseudo-domains, and (b) using these pseudo-domains we learn a domain-adaptive classifier that makes predictions using information about both the input and the pseudo-domain it belongs to. Our approach achieves state-of-the-art performance on a variety of domain generalization benchmarks without using domain labels whatsoever. Furthermore, we provide novel theoretical guarantees on domain generalization using cluster information. Our approach is amenable to ensemble-based methods and provides substantial gains even on large-scale benchmark datasets. The code can be found at: https://github.com/xavierohan/AdaClust_DomainBed

半弱无监督|主动学习|不确定性(5篇)

【1】 GAN-Supervised Dense Visual Alignment 标题:GaN监督下的密集视觉对准 链接:https://arxiv.org/abs/2112.05143

作者:William Peebles,Jun-Yan Zhu,Richard Zhang,Antonio Torralba,Alexei Efros,Eli Shechtman 机构:Alexei A. Efros, UC Berkeley, Carnegie Mellon University, Adobe Research, MIT CSAIL, Facebook AI Research (FAIR), Average Image 备注:Code available at this https URL . Project page and videos available at this https URL

【2】 Exploring the Equivalence of Siamese Self-Supervised Learning via A Unified Gradient Framework 标题:基于统一梯度框架的暹罗自监督学习等价性探讨 链接:https://arxiv.org/abs/2112.05141

作者:Chenxin Tao,Honghui Wang,Xizhou Zhu,Jiahua Dong,Shiji Song,Gao Huang,Jifeng Dai 机构:Tsinghua University,SenseTime Research,Zhejiang University, Beijing Academy of Artificial Intelligence, Beijing, China

【3】 Self-Supervised Keypoint Discovery in Behavioral Videos 标题:行为视频中的自监督关键点发现 链接:https://arxiv.org/abs/2112.05121

作者:Jennifer J. Sun,Serim Ryou,Roni Goldshmid,Brandon Weissbourd,John Dabiri,David J. Anderson,Ann Kennedy,Yisong Yue,Pietro Perona 机构:California Institute of Technology, Northwestern University

【4】 Self-Supervised Image-to-Text and Text-to-Image Synthesis 标题:自监督图文和文图合成 链接:https://arxiv.org/abs/2112.04928

作者:Anindya Sundar Das,Sriparna Saha 机构:Department of Computer Science and Engineering, Indian Institute of Technology Patna, India 备注:None 摘要:对视觉和语言及其相互关系的全面理解对于认识这些模式之间潜在的相似性和差异以及学习更普遍、更有意义的表达至关重要。近年来,大多数与文本到图像合成和图像到文本生成相关的工作都集中在有监督生成的深层架构来解决这些问题,其中很少有人关注跨模式的嵌入空间之间的相似性。本文提出了一种新的基于自监督深度学习的跨模态嵌入空间学习方法;用于生成图像到文本和文本到图像。在我们的方法中,我们首先使用基于StackGAN的自动编码器模型获得图像的稠密矢量表示,并使用基于LSTM的文本自动编码器在句子级获得稠密矢量表示;然后利用基于GAN和最大均值差的生成网络研究了从一种模态的嵌入空间到另一种模态的嵌入空间的映射。我们还证明了我们的模型学习从图像数据生成文本描述,以及从定性和定量的文本数据生成图像。 摘要:A comprehensive understanding of vision and language and their interrelation are crucial to realize the underlying similarities and differences between these modalities and to learn more generalized, meaningful representations. In recent years, most of the works related to Text-to-Image synthesis and Image-to-Text generation, focused on supervised generative deep architectures to solve the problems, where very little interest was placed on learning the similarities between the embedding spaces across modalities. In this paper, we propose a novel self-supervised deep learning based approach towards learning the cross-modal embedding spaces; for both image to text and text to image generations. In our approach, we first obtain dense vector representations of images using StackGAN-based autoencoder model and also dense vector representations on sentence-level utilizing LSTM based text-autoencoder; then we study the mapping from embedding space of one modality to embedding space of the other modality utilizing GAN and maximum mean discrepancy based generative networks. We, also demonstrate that our model learns to generate textual description from image data as well as images from textual data both qualitatively and quantitatively.

【5】 SimIPU: Simple 2D Image and 3D Point Cloud Unsupervised Pre-Training for Spatial-Aware Visual Representations 标题:SimIPU:简单二维图像和三维点云空间感知视觉表示的无监督预训练 链接:https://arxiv.org/abs/2112.04680

作者:Zhenyu Li,Zehui Chen,Ang Li,Liangji Fang,Qinhong Jiang,Xianming Liu,Junjun Jiang,Bolei Zhou,Hang Zhao 机构: Harbin Institute of Technology, University of Science and Technology, SenseTime Research, The Chinese University of Hong Kong, China, IIIS, Tsinghua University 备注:Accepted to 36th AAAI Conference on Artificial Intelligence (AAAI 2022)

时序|行为识别|姿态|视频|运动估计(1篇)

【1】 Auto-X3D: Ultra-Efficient Video Understanding via Finer-Grained Neural Architecture Search 标题:Auto-X3D:通过更细粒度的神经架构搜索实现超高效的视频理解 链接:https://arxiv.org/abs/2112.04710

作者:Yifan Jiang,Xinyu Gong,Junru Wu,Humphrey Shi,Zhicheng Yan,Zhangyang Wang 机构:University of Texas at Austin, Texas A&M University, University of Oregon, Facebook AI, Picsart AI Research (PAIR) 备注:Accepted by WACV'2022 摘要:高效的视频体系结构是在计算资源有限的设备上部署视频识别系统的关键。不幸的是,现有的视频架构通常是计算密集型的,不适合此类应用。最近的X3D工作通过沿多个轴(如空间、时间、宽度和深度)扩展手工制作的图像体系结构,展示了一系列新的高效视频模型。尽管X3D在一个概念性的大空间中运行,但它一次只搜索一个轴,只探索了一小部分共30个架构,这还不足以探索这个空间。本文绕过现有的二维体系结构,直接在细粒度空间中搜索三维体系结构,其中块类型、过滤器数量、扩展比和注意块被联合搜索。为了在如此大的空间内有效地搜索,采用了概率神经结构搜索方法。对动力学和Something-Something-V2基准测试的评估证实,在类似的失败情况下,我们的AutoX3D模型在精度上优于现有模型,高达1.3%,并将计算成本降低到x1。74达到类似性能时。 摘要:Efficient video architecture is the key to deploying video recognition systems on devices with limited computing resources. Unfortunately, existing video architectures are often computationally intensive and not suitable for such applications. The recent X3D work presents a new family of efficient video models by expanding a hand-crafted image architecture along multiple axes, such as space, time, width, and depth. Although operating in a conceptually large space, X3D searches one axis at a time, and merely explored a small set of 30 architectures in total, which does not sufficiently explore the space. This paper bypasses existing 2D architectures, and directly searched for 3D architectures in a fine-grained space, where block type, filter number, expansion ratio and attention block are jointly searched. A probabilistic neural architecture search method is adopted to efficiently search in such a large space. Evaluations on Kinetics and Something-Something-V2 benchmarks confirm our AutoX3D models outperform existing ones in accuracy up to 1.3% under similar FLOPs, and reduce the computational cost up to x1.74 when reaching similar performance.

医学相关(2篇)

【1】 One-dimensional Deep Low-rank and Sparse Network for Accelerated MRI 标题:一维深度低秩稀疏网络在加速磁共振成像中的应用 链接:https://arxiv.org/abs/2112.04721

作者:Zi Wang,Chen Qian,Di Guo,Hongwei Sun,Rushuai Li,Bo Zhao,Xiaobo Qu 备注:16 pages 摘要:核磁共振成像(MRI)显示出惊人的深度学习能力。由于许多磁共振图像或其对应的k空间是二维的,因此大多数最先进的深度学习重建采用强大的卷积神经网络并执行二维卷积。在这项工作中,我们提出了一种探索一维卷积的新方法,使深层网络更易于训练和推广。我们进一步将一维卷积集成到所提出的深网络中,称为一维深低秩稀疏网络(ODLS),它展开了低秩稀疏重建模型的迭代过程。在活体膝关节和大脑数据集上的广泛结果表明,所提出的ODLS非常适合有限训练对象的情况,并且在视觉和定量上都比最先进的方法提供了更好的重建性能。此外,ODLS对不同的欠采样场景以及训练和测试数据之间的一些不匹配也表现出很好的鲁棒性。总之,我们的工作表明,一维深度学习方案在快速磁共振成像中具有记忆效率和鲁棒性。 摘要:Deep learning has shown astonishing performance in accelerated magnetic resonance imaging (MRI). Most state-of-the-art deep learning reconstructions adopt the powerful convolutional neural network and perform 2D convolution since many magnetic resonance images or their corresponding k-space are in 2D. In this work, we present a new approach that explores the 1D convolution, making the deep network much easier to be trained and generalized. We further integrate the 1D convolution into the proposed deep network, named as One-dimensional Deep Low-rank and Sparse network (ODLS), which unrolls the iteration procedure of a low-rank and sparse reconstruction model. Extensive results on in vivo knee and brain datasets demonstrate that, the proposed ODLS is very suitable for the case of limited training subjects and provides improved reconstruction performance than state-of-the-art methods both visually and quantitatively. Additionally, ODLS also shows nice robustness to different undersampling scenarios and some mismatches between the training and test data. In summary, our work demonstrates that the 1D deep learning scheme is memory-efficient and robust in fast MRI.

【2】 Learn2Reg: comprehensive multi-task medical image registration challenge, dataset and evaluation in the era of deep learning 标题:Learn2Reg:深度学习时代综合多任务医学图像配准挑战、数据集与评估 链接:https://arxiv.org/abs/2112.04489

作者:Alessa Hering,Lasse Hansen,Tony C. W. Mok,Albert C. S. Chung,Hanna Siebert,Stephanie Häger,Annkristin Lange,Sven Kuckertz,Stefan Heldmann,Wei Shao,Sulaiman Vesal,Mirabela Rusu,Geoffrey Sonn,Théo Estienne,Maria Vakalopoulou,Luyi Han,Yunzhi Huang,Mikael Brudfors,Yaël Balbastre,Samuel Joutard,Marc Modat,Gal Lifshitz,Dan Raviv,Jinxin Lv,Qiang Li,Vincent Jaouen,Dimitris Visvikis,Constance Fourcade,Mathieu Rubeaux,Wentao Pan,Zhe Xu,Bailiang Jian,Francesca De Benetti,Marek Wodzinski,Niklas Gunnarsson,Huaqi Qiu,Zeju Li,Christoph Großbröhmer,Andrew Hoopes,Ingerid Reinertsen,Yiming Xiao,Bennett Landman,Yuankai Huo,Keelin Murphy,Bram van Ginneken,Adrian Dalca,Mattias P. Heinrich 机构: Universit¨at zu L¨ubeck 摘要:迄今为止,很少有研究在广泛的临床相关补充任务中全面比较医学图像配准方法。这限制了将研究进展应用于实践,并妨碍了跨竞争方法的公平基准。在过去的五年中,已经探索了许多新的基于学习的方法,但优化、架构或度量策略最适合哪一种仍然是一个问题。Learn2Reg涵盖了广泛的解剖:大脑、腹部和胸部,模式:超声波、CT、MRI,人群:患者内部和患者之间以及监督水平。我们为3D注册的训练和验证设置了较低的进入门槛,这有助于我们收集来自20多个独特团队的65份个人方法提交的结果。我们的补充指标集,包括稳健性、准确性、合理性和速度,使我们能够对当前医学图像配准技术的现状有独特的见解。对监管的可转移性、偏差和重要性的进一步分析质疑了主要基于深度学习的方法的优越性,并为利用GPU加速传统优化的混合方法开辟了新的研究方向。 摘要:To date few studies have comprehensively compared medical image registration approaches on a wide-range of complementary clinically relevant tasks. This limits the adoption of advances in research into practice and prevents fair benchmarks across competing approaches. Many newer learning-based methods have been explored within the last five years, but the question which optimisation, architectural or metric strategy is ideally suited remains open. Learn2Reg covers a wide range of anatomies: brain, abdomen and thorax, modalities: ultrasound, CT, MRI, populations: intra- and inter-patient and levels of supervision. We established a lower entry barrier for training and validation of 3D registration, which helped us compile results of over 65 individual method submissions from more than 20 unique teams. Our complementary set of metrics, including robustness, accuracy, plausibility and speed enables unique insight into the current-state-of-the-art of medical image registration. Further analyses into transferability, bias and importance of supervision question the superiority of primarily deep learning based approaches and open exiting new research directions into hybrid methods that leverage GPU-accelerated conventional optimisation.

GAN|对抗|攻击|生成相关(4篇)

【1】 Multimodal Conditional Image Synthesis with Product-of-Experts GANs 标题:基于专家乘积GANS的多模态条件图像合成 链接:https://arxiv.org/abs/2112.05130

作者:Xun Huang,Arun Mallya,Ting-Chun Wang,Ming-Yu Liu 机构:NVIDIA, sketch, (b), text 摘要:现有的条件图像合成框架基于单个模态中的用户输入生成图像,例如文本、分段、草图或样式参考。它们通常无法利用可用的多模式用户输入,这降低了它们的实用性。为了解决这一局限性,我们提出了专家产品生成对抗网络(PoE-GAN)框架,该框架可以合成基于多种输入模式或其任何子集,甚至空集的图像。PoE-GAN由专家生成器和多模态多尺度投影鉴别器组成。通过我们精心设计的训练方案,PoE GAN学会了合成高质量和多样性的图像。除了提升多模态条件图像合成的技术水平外,PoE-GAN在单峰环境下测试时也优于现有的最佳单峰条件图像合成方法。该项目的网站位于https://deepimagination.github.io/PoE-GAN . 摘要:Existing conditional image synthesis frameworks generate images based on user inputs in a single modality, such as text, segmentation, sketch, or style reference. They are often unable to leverage multimodal user inputs when available, which reduces their practicality. To address this limitation, we propose the Product-of-Experts Generative Adversarial Networks (PoE-GAN) framework, which can synthesize images conditioned on multiple input modalities or any subset of them, even the empty set. PoE-GAN consists of a product-of-experts generator and a multimodal multiscale projection discriminator. Through our carefully designed training scheme, PoE-GAN learns to synthesize images with high quality and diversity. Besides advancing the state of the art in multimodal conditional image synthesis, PoE-GAN also outperforms the best existing unimodal conditional image synthesis approaches when tested in the unimodal setting. The project website is available at https://deepimagination.github.io/PoE-GAN .

【2】 Generating Useful Accident-Prone Driving Scenarios via a Learned Traffic Prior 标题:通过学习的交通先验生成有用的易发生事故的驾驶场景 链接:https://arxiv.org/abs/2112.05077

作者:Davis Rempe,Jonah Philion,Leonidas J. Guibas,Sanja Fidler,Or Litany 机构:Stanford University, NVIDIA, University of Toronto, Vector Institute, nv-tlabs.github.ioSTRIVE 摘要:评估和改进自动驾驶车辆的规划需要可扩展的长尾交通场景生成。为了有用,这些场景必须现实且具有挑战性,但并非不可能安全驾驶通过。在这项工作中,我们介绍了STRIVE,一种自动生成具有挑战性的场景的方法,该场景会导致给定的计划人员产生不希望的行为,如碰撞。为了保持场景的合理性,关键思想是以基于图形的条件VAE的形式利用已学习的交通运动模型。场景生成是在该交通模型的潜在空间中进行的优化,通过扰动初始真实场景来产生与给定规划器碰撞的轨迹。随后的优化用于找到场景的“解决方案”,确保它有助于改进给定的计划器。进一步分析基于碰撞类型生成的场景。我们攻击了两个计划者,并证明在这两种情况下,STRIVE成功地生成了现实的、具有挑战性的场景。此外,我们还“关闭循环”,并使用这些场景优化基于规则的规划器的超参数。 摘要:Evaluating and improving planning for autonomous vehicles requires scalable generation of long-tail traffic scenarios. To be useful, these scenarios must be realistic and challenging, but not impossible to drive through safely. In this work, we introduce STRIVE, a method to automatically generate challenging scenarios that cause a given planner to produce undesirable behavior, like collisions. To maintain scenario plausibility, the key idea is to leverage a learned model of traffic motion in the form of a graph-based conditional VAE. Scenario generation is formulated as an optimization in the latent space of this traffic model, effected by perturbing an initial real-world scene to produce trajectories that collide with a given planner. A subsequent optimization is used to find a "solution" to the scenario, ensuring it is useful to improve the given planner. Further analysis clusters generated scenarios based on collision type. We attack two planners and show that STRIVE successfully generates realistic, challenging scenarios in both cases. We additionally "close the loop" and use these scenarios to optimize hyperparameters of a rule-based planner.

【3】 Mutual Adversarial Training: Learning together is better than going alone 标题:相互对抗训练,一起学习总比单独去要好 链接:https://arxiv.org/abs/2112.05005

作者:Jiang Liu,Chun Pong Lau,Hossein Souri,Soheil Feizi,Rama Chellappa 机构:Mutual Adversarial Training:, Learning together is better than going alone., Johns Hopkins University, Baltimore, Maryland, USA, University of Maryland, College Park, College Park, Maryland, USA 备注:Under submission 摘要:最近的研究表明,对抗性攻击的鲁棒性可以通过网络传输。换句话说,我们可以借助强教师模型使弱模型更加健壮。我们问,模型是否可以“一起学习”和“互相教导”以实现更好的健壮性,而不是从静态教师那里学习?在本文中,我们研究了模型之间的交互如何通过知识提取影响鲁棒性。我们提出了相互对抗训练(MAT),其中多个模型一起训练,并共享对抗示例的知识,以提高鲁棒性。MAT允许稳健模型探索更大的对抗性样本空间,并找到更稳健的特征空间和决策边界。通过在CIFAR-10和CIFAR-100上的大量实验,我们证明MAT可以有效地提高模型的鲁棒性,在白盒攻击下的性能优于最先进的方法,为PGD-100攻击下的普通对抗训练(AT)带来$sim$8%的准确度增益。此外,我们还表明,MAT还可以缓解不同扰动类型之间的鲁棒性权衡,使AT基线在联合$linfty$、$lu 2$和$lu 1$攻击时获得高达13.1%的精度增益。这些结果表明了该方法的优越性,并证明了协作学习是设计鲁棒模型的有效策略。 摘要:Recent studies have shown that robustness to adversarial attacks can be transferred across networks. In other words, we can make a weak model more robust with the help of a strong teacher model. We ask if instead of learning from a static teacher, can models "learn together" and "teach each other" to achieve better robustness? In this paper, we study how interactions among models affect robustness via knowledge distillation. We propose mutual adversarial training (MAT), in which multiple models are trained together and share the knowledge of adversarial examples to achieve improved robustness. MAT allows robust models to explore a larger space of adversarial samples, and find more robust feature spaces and decision boundaries. Through extensive experiments on CIFAR-10 and CIFAR-100, we demonstrate that MAT can effectively improve model robustness and outperform state-of-the-art methods under white-box attacks, bringing $sim$8% accuracy gain to vanilla adversarial training (AT) under PGD-100 attacks. In addition, we show that MAT can also mitigate the robustness trade-off among different perturbation types, bringing as much as 13.1% accuracy gain to AT baselines against the union of $l_infty$, $l_2$ and $l_1$ attacks. These results show the superiority of the proposed method and demonstrate that collaborative learning is an effective strategy for designing robust models.

【4】 InvGAN: Invertable GANs 标题:InvGAN:可逆GANS 链接:https://arxiv.org/abs/2112.04598

作者:Partha Ghosh,Dominik Zietlow,Michael J. Black,Larry S. Davis,Xiaochen Hu 机构:† MPI for Intelligent Systems, Tübingen 摘要:高分辨率生成模型的许多潜在应用包括照片真实感图像的生成、语义编辑和表示学习。GAN的最新进展已将其确定为此类任务的最佳选择。然而,由于它们不提供推理模型,因此无法使用GAN潜在空间对真实图像进行图像编辑或诸如分类之类的下游任务。尽管为训练推理模型或设计迭代方法以反转预先训练的生成器进行了大量工作,但以前的方法都是针对数据集(如人脸图像)和体系结构(如StyleGAN)的。这些方法对于扩展到新的数据集或体系结构来说是非常重要的。我们提出了一个对体系结构和数据集不可知的通用框架。我们的主要见解是,通过将推理和生成模型一起训练,我们可以使它们相互适应,并收敛到质量更好的模型。我们的textbf{InvGAN}是可逆GAN的缩写,它成功地将真实图像嵌入高质量生成模型的潜在空间。这使我们能够执行图像修复、合并、插值和在线数据增强。我们通过大量的定性和定量实验证明了这一点。 摘要:Generation of photo-realistic images, semantic editing and representation learning are a few of many potential applications of high resolution generative models. Recent progress in GANs have established them as an excellent choice for such tasks. However, since they do not provide an inference model, image editing or downstream tasks such as classification can not be done on real images using the GAN latent space. Despite numerous efforts to train an inference model or design an iterative method to invert a pre-trained generator, previous methods are dataset (e.g. human face images) and architecture (e.g. StyleGAN) specific. These methods are nontrivial to extend to novel datasets or architectures. We propose a general framework that is agnostic to architecture and datasets. Our key insight is that, by training the inference and the generative model together, we allow them to adapt to each other and to converge to a better quality model. Our textbf{InvGAN}, short for Invertable GAN, successfully embeds real images to the latent space of a high quality generative model. This allows us to perform image inpainting, merging, interpolation and online data augmentation. We demonstrate this with extensive qualitative and quantitative experiments.

自动驾驶|车辆|车道检测等(2篇)

【1】 A Shared Representation for Photorealistic Driving Simulators 标题:真实感驾驶模拟器的共享表示 链接:https://arxiv.org/abs/2112.05134

作者:Saeed Saadatnejad,Siyuan Li,Taylor Mordan,Alexandre Alahi 机构:All researchers are with the VITA laboratory of EPFL 备注:Accepted to IEEE Transactions on Intelligent Transportation Systems (T-ITS) 摘要:在训练和评估自动驾驶车辆时,强大的模拟器大大减少了对真实世界测试的需求。随着条件生成对抗网络(CGAN)的发展,数据驱动模拟器蓬勃发展,提供高保真图像。主要的挑战是在遵循给定约束的情况下合成照片级真实感图像。在这项工作中,我们建议通过重新考虑鉴别器结构来提高生成图像的质量。重点是在给定语义输入(如场景分割贴图或人体姿势)的情况下生成图像的一类问题。我们在成功的cGAN模型的基础上提出了一种新的语义感知鉴别器,它可以更好地指导生成器。我们的目标是学习一个共享的潜在表示,该表示编码足够的信息,以便联合进行语义分割、内容重建以及粗粒度到细粒度的对抗性推理。所取得的改进是通用的,并且足够简单,可以应用于条件图像合成的任何体系结构。我们在三个不同数据集的场景、建筑和人工合成任务中展示了我们的方法的优势。该守则可于https://github.com/vita-epfl/SemDisc. 摘要:A powerful simulator highly decreases the need for real-world tests when training and evaluating autonomous vehicles. Data-driven simulators flourished with the recent advancement of conditional Generative Adversarial Networks (cGANs), providing high-fidelity images. The main challenge is synthesizing photorealistic images while following given constraints. In this work, we propose to improve the quality of generated images by rethinking the discriminator architecture. The focus is on the class of problems where images are generated given semantic inputs, such as scene segmentation maps or human body poses. We build on successful cGAN models to propose a new semantically-aware discriminator that better guides the generator. We aim to learn a shared latent representation that encodes enough information to jointly do semantic segmentation, content reconstruction, along with a coarse-to-fine grained adversarial reasoning. The achieved improvements are generic and simple enough to be applied to any architecture of conditional image synthesis. We demonstrate the strength of our method on the scene, building, and human synthesis tasks across three different datasets. The code is available at https://github.com/vita-epfl/SemDisc.

【2】 Does Redundancy in AI Perception Systems Help to Test for Super-Human Automated Driving Performance? 标题:人工智能感知系统中的冗余是否有助于测试超人自动驾驶性能? 链接:https://arxiv.org/abs/2112.04758

作者:Hanno Gottschalk,Matthias Rottmann,Maida Saltagic 机构: equal contribution 2UniversityofWuppertal 摘要:虽然自动驾驶的广告通常比人工驾驶的性能更好,但这项工作回顾说,几乎不可能在系统层面上提供直接的统计证据,证明事实确实如此。所需的标记数据量将超过当今技术和经济能力的规模。因此,一种常用的策略是使用冗余,同时证明子系统具有足够的性能。众所周知,该策略特别适用于子系统独立运行的情况,即错误的发生在统计意义上是独立的。在这里,我们给出了一些初步考虑和实验证据,证明这种策略不是免费的,因为完成相同计算机视觉任务的神经网络的错误,至少在某些情况下,显示出错误的相关发生。如果训练数据、体系结构和训练保持分离,或者使用特殊的损失函数训练独立性,这仍然是正确的。在我们的实验中,使用来自不同传感器的数据(通过3D MNIST数据集的多达五个2D投影实现)可以更有效地降低相关性,但这并没有实现减少冗余和统计独立子系统可获得的测试数据的潜力。 摘要:While automated driving is often advertised with better-than-human driving performance, this work reviews that it is nearly impossible to provide direct statistical evidence on the system level that this is actually the case. The amount of labeled data needed would exceed dimensions of present day technical and economical capabilities. A commonly used strategy therefore is the use of redundancy along with the proof of sufficient subsystems' performances. As it is known, this strategy is efficient especially for the case of subsystems operating independently, i.e. the occurrence of errors is independent in a statistical sense. Here, we give some first considerations and experimental evidence that this strategy is not a free ride as the errors of neural networks fulfilling the same computer vision task, at least for some cases, show correlated occurrences of errors. This remains true, if training data, architecture, and training are kept separate or independence is trained using special loss functions. Using data from different sensors (realized by up to five 2D projections of the 3D MNIST data set) in our experiments is more efficiently reducing correlations, however not to an extent that is realizing the potential of reduction of testing data that can be obtained for redundant and statistically independent subsystems.

NAS模型搜索(1篇)

【1】 Learning with Nested Scene Modeling and Cooperative Architecture Search for Low-Light Vision 标题:基于嵌套场景建模和协同架构搜索的微光视觉学习 链接:https://arxiv.org/abs/2112.04719

作者:Risheng Liu,Long Ma,Tengyu Ma,Xin Fan,Zhongxuan Luo 机构: Ma is with the School of Software Technology, Dalian University ofTechnology 备注:Submitted to IEEE TPAMI. Code is available at this https URL 摘要:从微光场景捕获的图像通常会出现严重的退化,包括低可见性、颜色投射和强烈的噪声等。这些因素不仅会影响图像质量,还会降低下游微光视觉(LLV)应用的性能。为了提高微光图像的视觉质量,人们提出了多种深度学习方法。然而,这些方法大多依赖于重要的体系结构工程来获得适当的低光模型,并且经常遭受高计算负担。此外,将这些增强技术扩展到处理其他LLV仍然具有挑战性。为了部分解决上述问题,我们建立了受Retinex启发的架构搜索展开(RUAS),这是一个通用的学习框架,它不仅可以解决弱光增强任务,还可以灵活处理其他更具挑战性的下游视觉应用。具体来说,我们首先建立一个嵌套优化公式,连同展开策略,以探索一系列LLV任务的基本原理。此外,我们还构建了一个可微策略来协同搜索特定场景和任务架构。最后但并非最不重要的一点是,我们演示了如何将RUA应用于低级别和高级别LLV应用(例如,增强、检测和分割)。大量实验验证了RUAS的灵活性、有效性和效率。 摘要:Images captured from low-light scenes often suffer from severe degradations, including low visibility, color cast and intensive noises, etc. These factors not only affect image qualities, but also degrade the performance of downstream Low-Light Vision (LLV) applications. A variety of deep learning methods have been proposed to enhance the visual quality of low-light images. However, these approaches mostly rely on significant architecture engineering to obtain proper low-light models and often suffer from high computational burden. Furthermore, it is still challenging to extend these enhancement techniques to handle other LLVs. To partially address above issues, we establish Retinex-inspired Unrolling with Architecture Search (RUAS), a general learning framework, which not only can address low-light enhancement task, but also has the flexibility to handle other more challenging downstream vision applications. Specifically, we first establish a nested optimization formulation, together with an unrolling strategy, to explore underlying principles of a series of LLV tasks. Furthermore, we construct a differentiable strategy to cooperatively search specific scene and task architectures for RUAS. Last but not least, we demonstrate how to apply RUAS for both low- and high-level LLV applications (e.g., enhancement, detection and segmentation). Extensive experiments verify the flexibility, effectiveness, and efficiency of RUAS.

OCR|文本相关(2篇)

【1】 HairCLIP: Design Your Hair by Text and Reference Image 标题:发夹:根据文字和参考图像设计您的头发 链接:https://arxiv.org/abs/2112.05142

作者:Tianyi Wei,Dongdong Chen,Wenbo Zhou,Jing Liao,Zhentao Tan,Lu Yuan,Weiming Zhang,Nenghai Yu 机构:University of Science and Technology of China ,Microsoft Cloud AI, City University of Hong Kong, China, “Brown Hair”, “Braid Hairstyle”, “Blond Hair” , “Chignon Hairstyle”

【2】 CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields 标题:CLIP-NERF:文本和图像驱动的神经辐射场操作 链接:https://arxiv.org/abs/2112.05139

作者:Can Wang,Menglei Chai,Mingming He,Dongdong Chen,Jing Liao 机构:City University of Hong Kong, China, Snap Inc., USC Institute for Creative Technologies, Microsoft Cloud AI

Attention注意力(2篇)

【1】 Locally Shifted Attention With Early Global Integration 标题:通过早期的全球一体化在当地转移注意力 链接:https://arxiv.org/abs/2112.05080

作者:Shelly Sheynin,Sagie Benaim,Adam Polyak,Lior Wolf 机构:Tel Aviv University, Facebook AI Research

【2】 Trajectory-Constrained Deep Latent Visual Attention for Improved Local Planning in Presence of Heterogeneous Terrain 标题:异质地形下改进局部规划的轨迹约束深度潜伏视觉注意 链接:https://arxiv.org/abs/2112.04684

作者:Stefan Wapnick,Travis Manderson,David Meger,Gregory Dudek 机构: School of Computer Science 备注:Published in International Conference on Intelligent Robots and Systems (IROS) 2021 proceedings. Project website: this https URL 摘要:我们提出了一种奖励预测、基于模型的深度学习方法,该方法具有轨迹约束视觉注意,可用于mapless局部视觉导航任务。我们的方法学习将视觉注意力放置在潜影空间中的位置,这些位置跟随车辆控制动作引起的轨迹,以提高规划过程中的预测精度。注意模型通过特定任务损失和额外的轨迹约束损失进行联合优化,允许适应性,但鼓励正则化结构以提高泛化和可靠性。重要的是,视觉注意被应用于潜在特征地图空间,而不是原始图像空间,以促进有效的规划。我们在视觉导航任务中验证了我们的模型,这些任务包括在越野环境中规划低湍流、无碰撞的轨迹,以及在湿滑地形下使用锁定差速器爬山。实验涉及随机程序生成的模拟和真实环境。我们发现,与不注意和自我注意相比,我们的方法提高了泛化和学习效率。 摘要:We present a reward-predictive, model-based deep learning method featuring trajectory-constrained visual attention for use in mapless, local visual navigation tasks. Our method learns to place visual attention at locations in latent image space which follow trajectories caused by vehicle control actions to enhance predictive accuracy during planning. The attention model is jointly optimized by the task-specific loss and an additional trajectory-constraint loss, allowing adaptability yet encouraging a regularized structure for improved generalization and reliability. Importantly, visual attention is applied in latent feature map space instead of raw image space to promote efficient planning. We validated our model in visual navigation tasks of planning low turbulence, collision-free trajectories in off-road settings and hill climbing with locking differentials in the presence of slippery terrain. Experiments involved randomized procedural generated simulation and real-world environments. We found our method improved generalization and learning efficiency when compared to no-attention and self-attention alternatives.

跟踪(1篇)

【1】 Enhancing Food Intake Tracking in Long-Term Care with Automated Food Imaging and Nutrient Intake Tracking (AFINI-T) Technology 标题:利用自动食物成像和营养素摄取跟踪(AFINI-T)技术加强长期护理中的食物摄入量跟踪 链接:https://arxiv.org/abs/2112.04608

作者:Kaylen J. Pfisterer,Robert Amelard,Jennifer Boger,Audrey G. Chung,Heather H. Keller,Alexander Wong 机构:University of Waterloo, Waterloo, Systems Design Engineering, Waterloo, ON, N,L ,G, Canada, Waterloo AI Institute, Waterloo, ON, N,L ,G, Canada, Schlegel-UW Research Institute for Aging, Waterloo, N,J ,E, Canada 备注:Key words: Automatic segmentation, convolutional neural network, deep learning, food intake tracking, volume estimation, malnutrition prevention, long-term care, hospital 摘要:半数长期护理(LTC)居民营养不良,住院率、死亡率、发病率不断增加,生活质量下降。目前的跟踪方法主观且耗时。本文介绍了为LTC设计的自动食品成像和营养素摄入跟踪(AFINI-T)技术。我们提出了一种用于食品分类的新型卷积自动编码器,在增强的UNIMIB2016数据集上进行训练,并在我们的模拟LTC食品摄入数据集上进行测试(12种膳食场景;每个场景最多15种;顶级分类准确率:88.9%;平均摄入误差:-0.4 mL$pm$36.7 mL)。按体积计算的营养素摄入量估算值与从质量计算的营养素估算值(r^2$0.92至0.99)呈强线性相关,方法之间具有良好的一致性($sigma$=-2.7至-0.01;在每个一致性限值内为零)。AFINI-T方法是一种以深度学习为动力的计算营养素传感系统,可为更准确、客观地跟踪LTC居民的食物摄入量提供一种新方法,以支持和预防营养不良跟踪策略。 摘要:Half of long-term care (LTC) residents are malnourished increasing hospitalization, mortality, morbidity, with lower quality of life. Current tracking methods are subjective and time consuming. This paper presents the automated food imaging and nutrient intake tracking (AFINI-T) technology designed for LTC. We propose a novel convolutional autoencoder for food classification, trained on an augmented UNIMIB2016 dataset and tested on our simulated LTC food intake dataset (12 meal scenarios; up to 15 classes each; top-1 classification accuracy: 88.9%; mean intake error: -0.4 mL$pm$36.7 mL). Nutrient intake estimation by volume was strongly linearly correlated with nutrient estimates from mass ($r^2$ 0.92 to 0.99) with good agreement between methods ($sigma$= -2.7 to -0.01; zero within each of the limits of agreement). The AFINI-T approach is a deep-learning powered computational nutrient sensing system that may provide a novel means for more accurately and objectively tracking LTC resident food intake to support and prevent malnutrition tracking strategies.

表征学习(1篇)

【1】 Constrained Mean Shift Using Distant Yet Related Neighbors for Representation Learning 标题:基于距离相关邻域的约束Mean Shift表征学习 链接:https://arxiv.org/abs/2112.04607

作者:Ajinkya Tejankar,Soroush Abbasi Koohpayegani,KL Navaneet,Kossar Pourahmadi,Akshayvarun Subramanya,Hamed Pirsiavash 机构:K L Navaneet ,, University of Maryland, Baltimore County, University of California, Davis 备注:Code is available at this https URL arXiv admin note: text overlap with arXiv:2110.10309 摘要:我们对自我监督、监督或半监督环境下的表征学习感兴趣。在将meanshift思想应用于自监督学习(MSF)之前的工作中,通过将查询图像拉近到其其他扩充,并且拉近到其其他扩充的最近邻(NNs),从而推广了BYOL思想。我们相信,选择在语义上仍然与查询相关的遥远邻居可以使学习受益。因此,我们建议通过限制最近邻的搜索空间来推广MSF算法。我们表明,当约束使用不同的图像增强时,我们的方法在SSL设置中优于MSF,当约束确保NNs具有与查询相同的伪标签时,我们的方法在半监督设置中优于PAWS,并且训练资源较少。 摘要:We are interested in representation learning in self-supervised, supervised, or semi-supervised settings. The prior work on applying mean-shift idea for self-supervised learning, MSF, generalizes the BYOL idea by pulling a query image to not only be closer to its other augmentation, but also to the nearest neighbors (NNs) of its other augmentation. We believe the learning can benefit from choosing far away neighbors that are still semantically related to the query. Hence, we propose to generalize MSF algorithm by constraining the search space for nearest neighbors. We show that our method outperforms MSF in SSL setting when the constraint utilizes a different augmentation of an image, and outperforms PAWS in semi-supervised setting with less training resources when the constraint ensures the NNs have the same pseudo-label as the query.

点云|SLAM|雷达|激光|深度RGBD相关(1篇)

【1】 PRA-Net: Point Relation-Aware Network for 3D Point Cloud Analysis 标题:PRA-NET:用于三维点云分析的点关系感知网络 链接:https://arxiv.org/abs/2112.04903

作者:Silin Cheng,Xiwu Chen,Xinwei He,Zhe Liu,Xiang Bai 备注:None 摘要:学习区域内上下文和区域间关系是加强点云分析特征表示的两种有效策略。然而,在现有的点云表示方法中,统一这两种策略并没有得到充分的重视。为此,我们提出了一种新的点关系感知网络(PRA-Net)框架,它由一个区域内结构学习(ISL)模块和一个区域间关系学习(IRL)模块组成。ISL模块可以动态地将局部结构信息集成到点特征中,而IRL模块通过可微区域划分方案和基于代表点的策略自适应地、高效地捕获区域间的关系。在包括形状分类、关键点估计和零件分割的多个3D基准上进行的大量实验验证了PRA网络的有效性和泛化能力。代码将在https://github.com/XiwuChen/PRA-Net . 摘要:Learning intra-region contexts and inter-region relations are two effective strategies to strengthen feature representations for point cloud analysis. However, unifying the two strategies for point cloud representation is not fully emphasized in existing methods. To this end, we propose a novel framework named Point Relation-Aware Network (PRA-Net), which is composed of an Intra-region Structure Learning (ISL) module and an Inter-region Relation Learning (IRL) module. The ISL module can dynamically integrate the local structural information into the point features, while the IRL module captures inter-region relations adaptively and efficiently via a differentiable region partition scheme and a representative point-based strategy. Extensive experiments on several 3D benchmarks covering shape classification, keypoint estimation, and part segmentation have verified the effectiveness and the generalization ability of PRA-Net. Code will be available at https://github.com/XiwuChen/PRA-Net .

其他神经网络|深度学习|模型|建模(6篇)

【1】 Plenoxels: Radiance Fields without Neural Networks 标题:多角点:无神经网络的辐射场 链接:https://arxiv.org/abs/2112.05131

作者:Alex Yu,Sara Fridovich-Keil,Matthew Tancik,Qinhong Chen,Benjamin Recht,Angjoo Kanazawa 机构:UC Berkeley 备注:For video and code, please see this https URL 摘要:我们介绍了Plenoxels(plenoptic体素),这是一个用于照片级真实感视图合成的系统。使用稀疏的三维网格表示三维场景。该表示可以通过梯度方法和正则化从校准图像优化,而无需任何神经组件。在标准的基准任务中,Plenoxel的优化速度比神经辐射场快两个数量级,且视觉质量没有损失。 摘要:We introduce Plenoxels (plenoptic voxels), a system for photorealistic view synthesis. Plenoxels represent a scene as a sparse 3D grid with spherical harmonics. This representation can be optimized from calibrated images via gradient methods and regularization without any neural components. On standard, benchmark tasks, Plenoxels are optimized two orders of magnitude faster than Neural Radiance Fields with no loss in visual quality.

【2】 Learning Personal Representations from fMRIby Predicting Neurofeedback Performance 标题:通过预测神经反馈性能从fMRI学习个人表征 链接:https://arxiv.org/abs/2112.04902

作者:Jhonathan Osin,Lior Wolf,Guy Gurevitch,Jackob Nimrod Keynan,Tom Fruchtman-Steinbok,Ayelet Or-Borichev,Shira Reznik Balter,Talma Hendler 机构: School of Computer Science, Tel Aviv University, Sagol Brain Institue, Tel-Aviv Sourasky Medical Center, School of Psychological Sciences, Tel Aviv University, Sagol School of Neuroscience, Tel Aviv University 备注:None 摘要:我们提出了一种深度神经网络方法,用于在功能磁共振成像(fMRI)的指导下,学习执行自我神经调节任务的个体的个人表征。这种神经反馈任务(观察与调节)根据受试者杏仁核信号的下调为受试者提供持续反馈,学习算法关注该区域活动的时间过程。这种表征是通过一个自我监督的递归神经网络学习的,该神经网络根据最近的功能磁共振成像帧预测下一个功能磁共振成像帧中的杏仁核活动,并以学习到的个体表征为条件。结果表明,个体的表示大大改善了下一帧预测。此外,这种仅从功能磁共振成像图像中学习的个人表征,在精神特征的线性预测中表现良好,这比基于临床数据和人格测试进行预测要好。我们的规范作为补充附在后,数据将在道德批准的情况下共享。 摘要:We present a deep neural network method for learning a personal representation for individuals that are performing a self neuromodulation task, guided by functional MRI (fMRI). This neurofeedback task (watch vs. regulate) provides the subjects with a continuous feedback contingent on down regulation of their Amygdala signal and the learning algorithm focuses on this region's time-course of activity. The representation is learned by a self-supervised recurrent neural network, that predicts the Amygdala activity in the next fMRI frame given recent fMRI frames and is conditioned on the learned individual representation. It is shown that the individuals' representation improves the next-frame prediction considerably. Moreover, this personal representation, learned solely from fMRI images, yields good performance in linear prediction of psychiatric traits, which is better than performing such a prediction based on clinical data and personality tests. Our code is attached as supplementary and the data would be shared subject to ethical approvals.

【3】 Mimicking the Oracle: An Initial Phase Decorrelation Approach for Class Incremental Learning 标题:模仿Oracle:一种用于类增量学习的初始阶段去相关方法 链接:https://arxiv.org/abs/2112.04731

作者:Yujun Shi,Kuangqi Zhou,Jian Liang,Zihang Jiang,Jiashi Feng,Philip Torr,Song Bai,Vincent Y. F. Tan 机构:Vincent Y.F. Tan, National University of Singapore, ByteDance Inc., Institute of Automation, Chinese Academy of Sciences (CAS), University of Oxford 摘要:类增量学习(CIL)旨在以逐阶段的方式学习多类分类器,在每个阶段仅提供类子集的数据。以往的研究主要集中在缓解初始遗忘后的阶段性遗忘。然而,我们发现在初始阶段改善CIL也是一个有希望的方向。具体而言,我们的实验表明,直接鼓励CIL学习者在初始阶段输出与所有课程上联合训练的模型相似的表示,可以极大地提高CIL的性能。基于此,我们研究了na和“经过专业训练的初始阶段模型和oracle模型。具体来说,由于这两个模型之间的一个主要差异是训练班的数量,我们研究了这种差异如何影响模型表示。我们发现,在训练班较少的情况下,每个班的数据表示都位于一个狭长的区域;随着训练班的增加,每个班的表示分布更加均匀。受这一观察结果的启发,我们提出了基于类的去相关(CwD)方法,该方法可以有效地正则化每个类的表示,使其更均匀地分散,从而模仿与所有类联合训练的模型(即oracle模型)。我们的CwD易于实现,并且易于插入现有方法。在各种基准数据集上进行的大量实验表明,CwD持续显著地将现有最先进方法的性能提高了约1%到3%。代码将被发布。 摘要:Class Incremental Learning (CIL) aims at learning a multi-class classifier in a phase-by-phase manner, in which only data of a subset of the classes are provided at each phase. Previous works mainly focus on mitigating forgetting in phases after the initial one. However, we find that improving CIL at its initial phase is also a promising direction. Specifically, we experimentally show that directly encouraging CIL Learner at the initial phase to output similar representations as the model jointly trained on all classes can greatly boost the CIL performance. Motivated by this, we study the difference between a na"ively-trained initial-phase model and the oracle model. Specifically, since one major difference between these two models is the number of training classes, we investigate how such difference affects the model representations. We find that, with fewer training classes, the data representations of each class lie in a long and narrow region; with more training classes, the representations of each class scatter more uniformly. Inspired by this observation, we propose Class-wise Decorrelation (CwD) that effectively regularizes representations of each class to scatter more uniformly, thus mimicking the model jointly trained with all classes (i.e., the oracle model). Our CwD is simple to implement and easy to plug into existing methods. Extensive experiments on various benchmark datasets show that CwD consistently and significantly improves the performance of existing state-of-the-art methods by around 1% to 3%. Code will be released.

【4】 BACON: Band-limited Coordinate Networks for Multiscale Scene Representation 标题:BACON:多尺度场景表示的带限坐标网络 链接:https://arxiv.org/abs/2112.04645

作者:David B. Lindell,Dave Van Veen,Jeong Joon Park,Gordon Wetzstein 机构:Stanford University 摘要:基于坐标的网络已成为三维表示和场景重建的有力工具。这些网络经过训练,可以将连续输入坐标映射到每个点的信号值。尽管如此,当前的体系结构仍然是黑匣子:它们的光谱特征不容易分析,它们在无监督点的行为也很难预测。此外,这些网络通常被训练成以单个尺度表示信号,因此简单的下采样或上采样会导致伪影。我们介绍了带限坐标网络(BACON),一种具有解析傅里叶谱的网络结构。BACON在无监督点上具有可预测的行为,可以根据所表示信号的频谱特征进行设计,并且可以在没有明确监督的情况下在多个尺度上表示信号。我们用符号距离函数演示了BACON对图像、辐射场和3D场景的多尺度神经表示,并表明它在可解释性和质量方面优于传统的单尺度坐标网络。 摘要:Coordinate-based networks have emerged as a powerful tool for 3D representation and scene reconstruction. These networks are trained to map continuous input coordinates to the value of a signal at each point. Still, current architectures are black boxes: their spectral characteristics cannot be easily analyzed, and their behavior at unsupervised points is difficult to predict. Moreover, these networks are typically trained to represent a signal at a single scale, and so naive downsampling or upsampling results in artifacts. We introduce band-limited coordinate networks (BACON), a network architecture with an analytical Fourier spectrum. BACON has predictable behavior at unsupervised points, can be designed based on the spectral characteristics of the represented signal, and can represent signals at multiple scales without explicit supervision. We demonstrate BACON for multiscale neural representation of images, radiance fields, and 3D scenes using signed distance functions and show that it outperforms conventional single-scale coordinate networks in terms of interpretability and quality.

【5】 Dynamic multi feature-class Gaussian process models 标题:动态多要素类高斯过程模型 链接:https://arxiv.org/abs/2112.04495

作者:Jean-Rassaire Fouefack,Bhushan Borotikar,Marcel Lüthi,Tania S. Douglas,Valérie Burdin,Tinashe E. M. Mutsvangwa 机构:and Tinashe E.M. Mutsvangwa 备注:16

【6】 A novel multi-view deep learning approach for BI-RADS and density assessment of mammograms 标题:一种新的BI-RADS多视点深度学习方法及其在乳腺X线图像密度评估中的应用 链接:https://arxiv.org/abs/2112.04490

作者:Huyen T. X. Nguyen,Sam B. Tran,Dung B. Nguyen,Hieu H. Pham,Ha Q. Nguyen 机构: VinUniversity 备注:Under review by IEEE EMBC

其他(14篇)

【1】 Neural Radiance Fields for Outdoor Scene Relighting 标题:用于室外场景重照明的神经辐射场 链接:https://arxiv.org/abs/2112.05140

作者:Viktor Rudnev,Mohamed Elgharib,William Smith,Lingjie Liu,Vladislav Golyanik,Christian Theobalt 机构:MPI for Informatics, SIC, University of York, Training image, Reconstruction, Surface normals, Diffuse Albedo, Illumination, Shadows, Novel lighting and viewpoint 备注:13 pages, 8 figures, 2 tables; project web page: this https URL

【2】 PTR: A Benchmark for Part-based Conceptual, Relational, and Physical Reasoning 标题:PTR:基于零件的概念推理、关系推理和物理推理的基准 链接:https://arxiv.org/abs/2112.05136

作者:Yining Hong,Li Yi,Joshua B. Tenenbaum,Antonio Torralba,Chuang Gan 机构:UCLA, Stanford University, MIT BCS, CBMM, CSAIL, MIT CSAIL, MIT-IBM Watson AI Lab 备注:NeurIPS 2021. Project page: this http URL 摘要:将视觉对象进一步分解为一个整体和一个部分的能力是构成视觉层次的关键。这种复合结构可以归纳出丰富的语义概念和关系,从而在视觉信号的解释和组织以及视觉感知和推理的泛化中发挥重要作用。然而,现有的视觉推理基准主要关注对象而不是零件。由于更细粒度的概念、更丰富的几何关系和更复杂的物理,基于完整零件-整体层次结构的可视化推理比以对象为中心的推理更具挑战性。因此,为了更好地为基于零件的概念、关系和物理推理服务,我们引入了一个新的大规模诊断可视化推理数据集PTR。PTR包含约70k RGBD合成图像,其中包含关于语义实例分割、颜色属性、空间和几何关系以及某些物理属性(如稳定性)的地面真实对象和零件级注释。这些图像与700k机器生成的问题相结合,涵盖各种类型的推理类型,使它们成为视觉推理模型的良好测试平台。我们在此数据集上研究了几种最先进的视觉推理模型,并观察到它们在人类可以轻松推断正确答案的情况下仍然会犯许多令人惊讶的错误。我们相信,该数据集将为基于零件的推理提供新的机会。 摘要:A critical aspect of human visual perception is the ability to parse visual scenes into individual objects and further into object parts, forming part-whole hierarchies. Such composite structures could induce a rich set of semantic concepts and relations, thus playing an important role in the interpretation and organization of visual signals as well as for the generalization of visual perception and reasoning. However, existing visual reasoning benchmarks mostly focus on objects rather than parts. Visual reasoning based on the full part-whole hierarchy is much more challenging than object-centric reasoning due to finer-grained concepts, richer geometry relations, and more complex physics. Therefore, to better serve for part-based conceptual, relational and physical reasoning, we introduce a new large-scale diagnostic visual reasoning dataset named PTR. PTR contains around 70k RGBD synthetic images with ground truth object and part level annotations regarding semantic instance segmentation, color attributes, spatial and geometric relationships, and certain physical properties such as stability. These images are paired with 700k machine-generated questions covering various types of reasoning types, making them a good testbed for visual reasoning models. We examine several state-of-the-art visual reasoning models on this dataset and observe that they still make many surprising mistakes in situations where humans can easily infer the correct answer. We believe this dataset will open up new opportunities for part-based reasoning.

【3】 PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures 标题:PixMix:梦幻画卷全面提升安全措施 链接:https://arxiv.org/abs/2112.05135

作者:Dan Hendrycks,Andy Zou,Mantas Mazeika,Leonard Tang,Dawn Song,Jacob Steinhardt 机构:UC Berkeley, UIUC, Harvard University 备注:Code and models are available at this https URL 摘要:在真实的机器学习应用中,可靠和安全的系统必须考虑性能测试的标准测试集的精度。这些其他目标包括分布外(OOD)稳健性、预测一致性、对抗能力、校准的不确定性估计以及检测异常输入的能力。然而,为了实现这些目标而提高性能通常是一种平衡行为,而今天的方法在不牺牲其他安全轴性能的情况下是无法实现的。例如,对抗性训练提高了对抗性稳健性,但严重降低了其他分类器性能指标。类似地,强大的数据增强和正则化技术通常会提高OOD鲁棒性,但会损害异常检测,这就提出了一个问题,即是否可以对所有现有安全措施进行帕累托改进。为了应对这一挑战,我们设计了一种新的数据增强策略,利用分形等图片的自然结构复杂性,其性能优于许多基线,接近帕累托最优,并全面改进了安全措施。 摘要:In real-world applications of machine learning, reliable and safe systems must consider measures of performance beyond standard test set accuracy. These other goals include out-of-distribution (OOD) robustness, prediction consistency, resilience to adversaries, calibrated uncertainty estimates, and the ability to detect anomalous inputs. However, improving performance towards these goals is often a balancing act that today's methods cannot achieve without sacrificing performance on other safety axes. For instance, adversarial training improves adversarial robustness but sharply degrades other classifier performance metrics. Similarly, strong data augmentation and regularization techniques often improve OOD robustness but harm anomaly detection, raising the question of whether a Pareto improvement on all existing safety measures is possible. To meet this challenge, we design a new data augmentation strategy utilizing the natural structural complexity of pictures such as fractals, which outperforms numerous baselines, is near Pareto-optimal, and roundly improves safety measures.

【4】 IterMVS: Iterative Probability Estimation for Efficient Multi-View Stereo 标题:IterMVS:高效多视点立体的迭代概率估计 链接:https://arxiv.org/abs/2112.05126

作者:Fangjinhua Wang,Silvano Galliani,Christoph Vogel,Marc Pollefeys 机构:Department of Computer Science, ETH Zurich, Microsoft Mixed Reality & AI Zurich Lab

【5】 Neural Descriptor Fields: SE(3)-Equivariant Object Representations for Manipulation 标题:神经描述符域:SE(3)-用于操作的等变对象表示 链接:https://arxiv.org/abs/2112.05124

作者:Anthony Simeonov,Yilun Du,Andrea Tagliasacchi,Joshua B. Tenenbaum,Alberto Rodriguez,Pulkit Agrawal,Vincent Sitzmann 机构:Massachusetts Institute of Technology, Google Research, University of Toronto, ∗Authors contributed equally, order determined by coin flip. †Equal Advising., Small Handful (~,-,) of Demonstrations, Test-time executions: Unseen objects in out-of-distribution poses 备注:Website: this https URL First two authors contributed equally (order determined by coin flip), last two authors equal advising 摘要:我们提出了神经描述符字段(NDF),一种通过类别级描述符对对象和目标(如用于悬挂的机器人夹具或机架)之间的点和相对姿势进行编码的对象表示。我们将此表示用于对象操作,在给定任务演示的情况下,我们希望在同一类别的新对象实例上重复相同的任务。我们建议通过搜索(通过优化)描述符与演示中观察到的匹配的姿势来实现这一目标。NDF通过不依赖专家标记的关键点的3D自动编码任务以自我监督的方式方便地进行训练。此外,NDF是SE(3)-等变的,保证了在所有可能的3D对象平移和旋转中通用的性能。我们在模拟和真实机器人上演示了从少量(5-10)演示中学习操作任务。我们的性能概括了对象实例和6自由度对象姿势,并且显著优于最近依赖2D描述符的基线。项目网站:https://yilundu.github.io/ndf/. 摘要:We present Neural Descriptor Fields (NDFs), an object representation that encodes both points and relative poses between an object and a target (such as a robot gripper or a rack used for hanging) via category-level descriptors. We employ this representation for object manipulation, where given a task demonstration, we want to repeat the same task on a new object instance from the same category. We propose to achieve this objective by searching (via optimization) for the pose whose descriptor matches that observed in the demonstration. NDFs are conveniently trained in a self-supervised fashion via a 3D auto-encoding task that does not rely on expert-labeled keypoints. Further, NDFs are SE(3)-equivariant, guaranteeing performance that generalizes across all possible 3D object translations and rotations. We demonstrate learning of manipulation tasks from few (5-10) demonstrations both in simulation and on a real robot. Our performance generalizes across both object instances and 6-DoF object poses, and significantly outperforms a recent baseline that relies on 2D descriptors. Project website: https://yilundu.github.io/ndf/.

【6】 Latent Space Explanation by Intervention 标题:干预对潜在空间的解释 链接:https://arxiv.org/abs/2112.04895

作者:Itai Gat,Guy Lorberbom,Idan Schwartz,Tamir Hazan 机构: Technion - Israel Institute of Technology, NetApp 备注:Accepted to AAAI22 摘要:深层神经网络的成功在很大程度上依赖于其编码输入和输出之间复杂关系的能力。虽然该属性可以很好地拟合训练数据,但它也掩盖了驱动预测的机制。这项研究的目的是通过使用一种干预机制来揭示隐藏的概念,该机制基于离散变分自动编码器来转移预测类。然后,解释模型将任何隐藏层的编码信息及其相应的中间表示可视化。通过评估原始表示和介入表示之间的差异,可以确定可以改变类的概念,从而提供可解释性。我们在CelebA上展示了我们的方法的有效性,在CelebA中,我们展示了数据中对偏差的各种可视化,并建议不同的干预措施来揭示和改变偏差。 摘要:The success of deep neural nets heavily relies on their ability to encode complex relations between their input and their output. While this property serves to fit the training data well, it also obscures the mechanism that drives prediction. This study aims to reveal hidden concepts by employing an intervention mechanism that shifts the predicted class based on discrete variational autoencoders. An explanatory model then visualizes the encoded information from any hidden layer and its corresponding intervened representation. By the assessment of differences between the original representation and the intervened representation, one can determine the concepts that can alter the class, hence providing interpretability. We demonstrate the effectiveness of our approach on CelebA, where we show various visualizations for bias in the data and suggest different interventions to reveal and change bias.

【7】 ScaleNet: A Shallow Architecture for Scale Estimation 标题:ScaleNet:一种浅显的尺度估计体系结构 链接:https://arxiv.org/abs/2112.04846

作者:Axel Barroso-Laguna,Yurun Tian,Krystian Mikolajczyk 机构:Imperial College London

【8】 A Simple and efficient deep Scanpath Prediction 标题:一种简单有效的深度扫描路径预测方法 链接:https://arxiv.org/abs/2112.04610

作者:Mohamed Amine Kerkouri,Aladine Chetouani 机构:Laboratoire PRISME, Universit´e d’Orl´eans, Orl´eans, FRANCE 备注:Electronic Imaging Symposium 2022 (EI 2022) 摘要:视觉扫描路径是在观察图像时,人类目光移动的固定点序列,它的预测有助于建模图像的视觉注意。为此,文献中提出了几种使用复杂的深度学习架构和框架的模型。在这里,我们以一种简单的完全卷积回归的方式探索使用通用深度学习架构的效率。我们实验了这些模型在两个数据集上预测扫描路径的能力。我们使用不同的度量与其他模型进行了比较,并显示了有时超过以前复杂体系结构的竞争结果。我们还根据实验中的性能比较了不同的杠杆式主干结构,以推断出哪些结构最适合于该任务。 摘要:Visual scanpath is the sequence of fixation points that the human gaze travels while observing an image, and its prediction helps in modeling the visual attention of an image. To this end several models were proposed in the literature using complex deep learning architectures and frameworks. Here, we explore the efficiency of using common deep learning architectures, in a simple fully convolutional regressive manner. We experiment how well these models can predict the scanpaths on 2 datasets. We compare with other models using different metrics and show competitive results that sometimes surpass previous complex architectures. We also compare the different leveraged backbone architectures based on their performances on the experiment to deduce which ones are the most suitable for the task.

【9】 SIRfyN: Single Image Relighting from your Neighbors 标题:SIRfyN:来自邻居的单幅图像重新照明 链接:https://arxiv.org/abs/2112.04497

作者:D. A. Forsyth,Anand Bhattad,Pranav Asthana,Yuanyi Zhong,Yuxiong Wang 机构:UIUC, Urbana, Illinois, ! 摘要:我们将展示如何重新照亮一个场景(在单个图像中描绘),以便(a)整体着色已经改变,并且(b)生成的图像看起来像该场景的自然图像。这种过程的应用包括生成训练数据和构建创作环境。这样做的天真方法失败了。一个原因是阴影和反照率有很强的相关性;例如,明暗处理中的尖锐边界往往出现在深度不连续处,这通常在反照率中很明显。同一个场景可以用不同的方式照明,已建立的理论表明不同的照明形成一个圆锥体(照明圆锥体)。新的理论表明,人们可以使用相似的场景来估计适用于给定场景的不同照明,且预期误差有界。我们的方法利用这一理论,以输入的照明锥发生器的形式估计可用照明场的表示。我们的程序不需要昂贵的“逆向图形”数据集,也看不到任何类型的地面真实数据。定性评估表明,该方法可以消除和恢复室内柔和的阴影,并可以“控制”场景周围的光线。我们提供了一个新的应用FID的方法的总结定量评估。FID的扩展允许对每个生成的图像进行评估。此外,我们通过用户研究提供了定性评估,并表明我们的方法生成的图像可以成功地用于数据扩充。 摘要:We show how to relight a scene, depicted in a single image, such that (a) the overall shading has changed and (b) the resulting image looks like a natural image of that scene. Applications for such a procedure include generating training data and building authoring environments. Naive methods for doing this fail. One reason is that shading and albedo are quite strongly related; for example, sharp boundaries in shading tend to appear at depth discontinuities, which usually apparent in albedo. The same scene can be lit in different ways, and established theory shows the different lightings form a cone (the illumination cone). Novel theory shows that one can use similar scenes to estimate the different lightings that apply to a given scene, with bounded expected error. Our method exploits this theory to estimate a representation of the available lighting fields in the form of imputed generators of the illumination cone. Our procedure does not require expensive "inverse graphics" datasets, and sees no ground truth data of any kind. Qualitative evaluation suggests the method can erase and restore soft indoor shadows, and can "steer" light around a scene. We offer a summary quantitative evaluation of the method with a novel application of the FID. An extension of the FID allows per-generated-image evaluation. Furthermore, we offer qualitative evaluation with a user study, and show that our method produces images that can successfully be used for data augmentation.

【10】 Critical configurations for two projective views, a new approach 标题:双射影视图的临界构型--一种新方法 链接:https://arxiv.org/abs/2112.05074

作者:Martin Bråtelund 备注:22 pages, 6 figures 摘要:运动结构问题涉及从一组二维图像中恢复物体的三维结构。通常,如果提供足够的图像和图像点,所有信息都可以唯一地恢复,但在某些情况下,不可能唯一地恢复;这些被称为关键配置。本文用代数方法研究了两个投影相机的临界构型。我们证明了所有的临界位形都位于二次曲面上,并对构成临界位形的二次曲面进行了分类。文中还描述了当不可能进行唯一重建时,不同重建之间的关系。 摘要:The problem of structure from motion is concerned with recovering 3-dimensional structure of an object from a set of 2-dimensional images. Generally, all information can be uniquely recovered if enough images and image points are provided, but there are certain cases where unique recovery is impossible; these are called critical configurations. In this paper we use an algebraic approach to study the critical configurations for two projective cameras. We show that all critical configurations lie on quadric surfaces, and classify exactly which quadrics constitute a critical configuration. The paper also describes the relation between the different reconstructions when unique reconstruction is impossible.

【11】 Sparse-View CT Reconstruction using Recurrent Stacked Back Projection 标题:基于递归叠加反投影的稀疏视图CT重建 链接:https://arxiv.org/abs/2112.04998

作者:Wenrui Li,Gregery T. Buzzard,Charles A. Bouman 机构:Electrical and Computer Engineering, Purdue University, West Lafayette, IN, United States, Mathematics 备注:5 pages, 5 pages, 2021 Asilomar Conference on Signals, Systems, and Computers 摘要:由于成本、采集时间或剂量的限制,稀疏视图CT重建在广泛的应用中非常重要。然而,传统的直接重建方法如滤波反投影(FBP)在亚奈奎斯特区域会导致低质量的重建。相比之下,深度神经网络(DNN)可以从稀疏和噪声数据生成高质量的重建,例如通过FBP重建的后处理,基于模型的迭代重建(MBIR)也可以,尽管计算成本较高。在本文中,我们介绍了一种直接重建DNN方法,称为递归叠加反投影(RSBP),该方法使用顺序获取的单个视图的反投影作为递归卷积LSTM网络的输入。SBP结构维护正弦图中的所有信息,而循环处理利用相邻视图之间的相关性,并在每个新视图之后生成更新的重建。我们在模拟数据上训练我们的网络,并在模拟数据和真实数据上进行测试,证明RSBP优于FBP图像的DNN后处理和基本MBIR,计算成本低于MBIR。 摘要:Sparse-view CT reconstruction is important in a wide range of applications due to limitations on cost, acquisition time, or dosage. However, traditional direct reconstruction methods such as filtered back-projection (FBP) lead to low-quality reconstructions in the sub-Nyquist regime. In contrast, deep neural networks (DNNs) can produce high-quality reconstructions from sparse and noisy data, e.g. through post-processing of FBP reconstructions, as can model-based iterative reconstruction (MBIR), albeit at a higher computational cost. In this paper, we introduce a direct-reconstruction DNN method called Recurrent Stacked Back Projection (RSBP) that uses sequentially-acquired backprojections of individual views as input to a recurrent convolutional LSTM network. The SBP structure maintains all information in the sinogram, while the recurrent processing exploits the correlations between adjacent views and produces an updated reconstruction after each new view. We train our network on simulated data and test on both simulated and real data and demonstrate that RSBP outperforms both DNN post-processing of FBP images and basic MBIR, with a lower computational cost than MBIR.

【12】 Evaluating saliency methods on artificial data with different background types 标题:不同背景类型人工数据的显著性评价方法 链接:https://arxiv.org/abs/2112.04882

作者:Céline Budding,Fabian Eitel,Kerstin Ritter,Stefan Haufe 机构:Department of Industrial Engineering & Innovation Sciences., Eindhoven University of Technology, Eindhoven, The Netherlands, Charité – Universitätsmedizin Berlin, Berlin, Germany;, Bernstein Center for Computational Neuroscience Berlin, Berlin, Germany 备注:6 pages, 2 figures. Presented at Medical Imaging meets NeurIPS 2021 (poster presentation) 摘要:在过去的几年里,许多“可解释人工智能”(xAI)方法已经被开发出来,但这些方法并不总是被客观地评估。为了评估由各种显著性方法生成的热图的质量,我们开发了一个框架,用合成病变和已知的地面真值图生成人工数据。利用这个框架,我们评估了两个不同背景的数据集,柏林噪声和2D脑MRI切片,发现显著性方法和背景之间的热图差异很大。我们强烈鼓励在将显著性图和xAI方法应用于临床或其他安全关键环境之前,使用该框架对显著性图和xAI方法进行进一步评估。 摘要:Over the last years, many 'explainable artificial intelligence' (xAI) approaches have been developed, but these have not always been objectively evaluated. To evaluate the quality of heatmaps generated by various saliency methods, we developed a framework to generate artificial data with synthetic lesions and a known ground truth map. Using this framework, we evaluated two data sets with different backgrounds, Perlin noise and 2D brain MRI slices, and found that the heatmaps vary strongly between saliency methods and backgrounds. We strongly encourage further evaluation of saliency maps and xAI methods using this framework before applying these in clinical or other safety-critical settings.

【13】 Multiscale Softmax Cross Entropy for Fovea Localization on Color Fundus Photography 标题:彩色眼底摄影中心凹定位的多尺度软最大交叉熵 链接:https://arxiv.org/abs/2112.04499

作者:Yuli Wu,Peter Walter,Dorit Merhof 机构:Institute of Imaging and Computer Vision, RWTH Aachen, Germany, Department of Ophthalmology, RWTH Aachen, Germany

【14】 Revisiting Global Statistics Aggregation for Improving Image Restoration 标题:改进图像恢复的全局统计聚合再认识 链接:https://arxiv.org/abs/2112.04491

作者:Xiaojie Chu,Liangyu Chen,Chengpeng Chen,Xin Lu 机构:Megvii Technology

机器翻译,仅供参考

0 人点赞