计算机视觉与模式识别学术速递[11.8]

2021-11-17 10:53:00 浏览数 (1)

cs.CV 方向,今日共计52篇

Transformer(2篇)

【1】 Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers 标题:用带Transformer的令牌生成器改善图像合成的视觉质量 链接:https://arxiv.org/abs/2111.03481

作者:Yanhong Zeng,Huan Yang,Hongyang Chao,Jianbo Wang,Jianlong Fu 机构:School of Computer Science and Engineering, Sun Yat-sen University, Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China, Microsoft Research Asia, University of Tokyo 备注:NeurIPS 2021 摘要:我们提出了一个实现图像合成的新视角,将此任务视为一个可视化标记生成问题。与从单个输入(例如,潜在代码)直接合成完整图像的现有范例不同,新的公式允许对不同图像区域进行灵活的局部操作,这使得学习用于图像合成的内容感知和细粒度样式控制成为可能。具体地说,它采用一系列潜在标记作为输入,以预测用于合成图像的视觉标记。在此视角下,我们提出了一种基于令牌的生成器(即令牌生成器)。特别地,标记输入两个语义不同的视觉标记,即,从潜在空间学习的常量内容标记和样式标记。给定一系列样式标记,TokenGAN能够通过使用转换器的注意机制将样式分配给内容标记,从而控制图像合成。我们进行了大量的实验,结果表明所提出的TokenGAN在几种广泛使用的图像合成基准上取得了最先进的结果,包括具有不同分辨率的FFHQ和LSUN CHURCH。特别是,生成器能够合成1024x1024大小的高保真图像,完全不需要卷积。 摘要:We present a new perspective of achieving image synthesis by viewing this task as a visual token generation problem. Different from existing paradigms that directly synthesize a full image from a single input (e.g., a latent code), the new formulation enables a flexible local manipulation for different image regions, which makes it possible to learn content-aware and fine-grained style control for image synthesis. Specifically, it takes as input a sequence of latent tokens to predict the visual tokens for synthesizing an image. Under this perspective, we propose a token-based generator (i.e.,TokenGAN). Particularly, the TokenGAN inputs two semantically different visual tokens, i.e., the learned constant content tokens and the style tokens from the latent space. Given a sequence of style tokens, the TokenGAN is able to control the image synthesis by assigning the styles to the content tokens by attention mechanism with a Transformer. We conduct extensive experiments and show that the proposed TokenGAN has achieved state-of-the-art results on several widely-used image synthesis benchmarks, including FFHQ and LSUN CHURCH with different resolutions. In particular, the generator is able to synthesize high-fidelity images with 1024x1024 size, dispensing with convolutions entirely.

【2】 Hepatic vessel segmentation based on 3Dswin-transformer with inductive biased multi-head self-attention 标题:基于感应偏置多头自关注的三维旋转Transformer肝血管分割 链接:https://arxiv.org/abs/2111.03368

作者:Mian Wu,Yinling Qian,Xiangyun Liao,Qiong Wang,Pheng-Ann Heng 机构:SIAT, Chinese Academy of Sciences, The Chinese University of Hong Kong, Received: date Accepted: date 备注:20 pages, 6 figures 摘要:目的:从CT图像中分割肝血管是外科手术前必不可少的,在医学图像分析界引起了广泛的兴趣。由于复杂的结构和低对比度背景,自动肝血管分割仍然是一个特别具有挑战性的问题。大多数相关研究采用FCN、U-net和V-net变体作为主干。然而,这些方法主要关注于捕获多尺度局部特征,由于卷积算子有限的局部接收场,可能会产生误分类的体素。方法:通过将swin-transformer扩展到3D,并采用卷积和自我注意的有效结合,我们提出了一种鲁棒的端到端血管分割网络,称为感应偏向多头注意血管网(IBIMHAV-Net)。在实践中,我们引入了体素嵌入而不是面片嵌入来定位精确的肝血管体素,并采用多尺度卷积算子来获取局部空间信息。另一方面,我们提出了诱导偏向的多头自我注意,它从初始绝对位置嵌入中学习诱导偏向的相对位置嵌入。在此基础上,我们可以得到更可靠的查询和密钥矩阵。为了验证模型的泛化性,我们对具有不同结构复杂性的样本进行了测试。结果:我们在3DIRCADb数据集上进行了实验。四个测试案例的平均骰子数和敏感性分别为74.8%和77.5%,超过了现有深度学习方法和改进的图切割方法的结果。结论:提出的模型IBIMHAV网络提供了一种自动、准确的三维肝血管分割,具有交错结构,能够更好地利用CT体积中的全局和局部空间特征。它可以进一步扩展到其他临床数据。 摘要:Purpose: Segmentation of liver vessels from CT images is indispensable prior to surgical planning and aroused broad range of interests in the medical image analysis community. Due to the complex structure and low contrast background, automatic liver vessel segmentation remains particularly challenging. Most of the related researches adopt FCN, U-net, and V-net variants as a backbone. However, these methods mainly focus on capturing multi-scale local features which may produce misclassified voxels due to the convolutional operator's limited locality reception field. Methods: We propose a robust end-to-end vessel segmentation network called Inductive BIased Multi-Head Attention Vessel Net(IBIMHAV-Net) by expanding swin transformer to 3D and employing an effective combination of convolution and self-attention. In practice, we introduce the voxel-wise embedding rather than patch-wise embedding to locate precise liver vessel voxels, and adopt multi-scale convolutional operators to gain local spatial information. On the other hand, we propose the inductive biased multi-head self-attention which learns inductive biased relative positional embedding from initialized absolute position embedding. Based on this, we can gain a more reliable query and key matrix. To validate the generalization of our model, we test on samples which have different structural complexity. Results: We conducted experiments on the 3DIRCADb datasets. The average dice and sensitivity of the four tested cases were 74.8% and 77.5%, which exceed results of existing deep learning methods and improved graph cuts method. Conclusion: The proposed model IBIMHAV-Net provides an automatic, accurate 3D liver vessel segmentation with an interleaved architecture that better utilizes both global and local spatial features in CT volumes. It can be further extended for other clinical data.

检测相关(8篇)

【1】 AGPCNet: Attention-Guided Pyramid Context Networks for Infrared Small Target Detection 标题:AGPCNet:注意力引导的红外小目标检测金字塔上下文网络 链接:https://arxiv.org/abs/2111.03580

作者:Tianfang Zhang,Siying Cao,Tian Pu,Zhenming Peng 机构: Member, IEEE 备注:12 pages, 13 figures, 8 tables 摘要:红外小目标检测是地球观测、军事侦察、救灾等领域的一个重要问题,近年来受到广泛关注。本文提出了注意力引导的金字塔上下文网络(AGPCNet)算法。它的主要组成部分是注意引导上下文块(AGCB)、上下文金字塔模块(CPM)和非对称融合模块(AFM)。AGCB将特征图划分为多个块来计算局部关联,并使用全局上下文注意(GCA)来计算语义之间的全局关联,CPM集成来自多尺度AGCB的特征,AFM从特征融合的角度集成低级和深层语义,以提高特征的利用率。实验结果表明,AGPCNet在两个可用的红外小目标数据集上取得了最新的性能。源代码可在https://github.com/Tianfang-Zhang/AGPCNet. 摘要:Infrared small target detection is an important problem in many fields such as earth observation, military reconnaissance, disaster relief, and has received widespread attention recently. This paper presents the Attention-Guided Pyramid Context Network (AGPCNet) algorithm. Its main components are an Attention-Guided Context Block (AGCB), a Context Pyramid Module (CPM), and an Asymmetric Fusion Module (AFM). AGCB divides the feature map into patches to compute local associations and uses Global Context Attention (GCA) to compute global associations between semantics, CPM integrates features from multi-scale AGCBs, and AFM integrates low-level and deep-level semantics from a feature-fusion perspective to enhance the utilization of features. The experimental results illustrate that AGPCNet has achieved new state-of-the-art performance on two available infrared small target datasets. The source codes are available at https://github.com/Tianfang-Zhang/AGPCNet.

【2】 Sampling Equivariant Self-attention Networks for Object Detection in Aerial Images 标题:航空图像目标检测的采样等变自注意网络 链接:https://arxiv.org/abs/2111.03420

作者:Guo-Ye Yang,Xiang-Li Li,Ralph R. Martin,Shi-Min Hu 机构:Cardiff University, Hu is with Tsinghua University 摘要:航空图像中的目标在尺度和方向上的变化比典型图像中的变化更大,因此检测更加困难。卷积神经网络使用各种频率和方向特定的核来识别经过不同变换的对象;这需要许多参数。采样等变网络可以根据对象的变换调整输入特征映射的采样,允许内核在不同变换下提取对象的特征。这样做需要更少的参数,并使网络更适合表示可变形对象,如航空图像中的对象。然而,像可变形卷积网络这样的方法,由于用于采样的位置,只能在某些情况下提供采样等效性。我们提出了一种考虑自聚焦局限于局部图像贴片的采样等变自聚焦网络,它是以掩模代替采样位置的卷积采样,并设计了一个变换嵌入模块,以进一步改善等变采样能力。我们还使用了一种新的随机归一化模块来解决由于航空图像数据有限而导致的过度拟合问题。我们表明,我们的模型(i)提供了比现有方法更好的采样等效性,无需额外监督,(ii)在ImageNet上提供了改进的分类,以及(iii)在DOTA数据集上实现了最先进的结果,而无需增加计算量。 摘要:Objects in aerial images have greater variations in scale and orientation than in typical images, so detection is more difficult. Convolutional neural networks use a variety of frequency- and orientation-specific kernels to identify objects subject to different transformations; these require many parameters. Sampling equivariant networks can adjust sampling from input feature maps according to the transformation of the object, allowing a kernel to extract features of an object under different transformations. Doing so requires fewer parameters, and makes the network more suitable for representing deformable objects, like those in aerial images. However, methods like deformable convolutional networks can only provide sampling equivariance under certain circumstances, because of the locations used for sampling. We propose sampling equivariant self-attention networks which consider self-attention restricted to a local image patch as convolution sampling with masks instead of locations, and design a transformation embedding module to further improve the equivariant sampling ability. We also use a novel randomized normalization module to tackle overfitting due to limited aerial image data. We show that our model (i) provides significantly better sampling equivariance than existing methods, without additional supervision, (ii) provides improved classification on ImageNet, and (iii) achieves state-of-the-art results on the DOTA dataset, without increased computation.

【3】 KORSAL: Key-point Detection based Online Real-Time Spatio-Temporal Action Localization 标题:KORSAL:基于关键点检测的在线实时时空动作定位 链接:https://arxiv.org/abs/2111.03319

作者:Kalana Abeywardena,Shechem Sumanthiran,Sakuna Jayasundara,Sachira Karunasena,Ranga Rodrigo,Peshala Jayasekara 机构: CNN-based deep-learning methods for ST actionAll authors are with the Department of Electronic and TelecommunicationEngineering, University of Moratuwa 备注:7 pages 摘要:视频中的实时和在线动作定位是一个关键但极具挑战性的问题。准确的动作定位需要同时利用时间和空间信息。最近的尝试通过使用计算密集型3D CNN架构或具有光流的高度冗余双流架构来实现这一点,这使得它们都不适合实时在线应用。为了在高挑战性的实时约束条件下完成活动定位,我们提出利用快速有效的基于关键点的包围盒预测对动作进行空间定位。然后,我们介绍了一种管道连接算法,该算法可以在存在遮挡的情况下暂时保持动作管道的连续性。此外,我们通过将时间和空间信息组合到单个网络的级联输入中,从而消除了对双流体系结构的需要,从而允许网络从这两种类型的信息中学习。与计算密集型光流相反,使用结构相似性索引图有效地提取时间信息。尽管我们的方法简单,但我们的轻量级端到端体系结构在具有挑战性的UCF101-24数据集上实现了74.7%的最先进的帧映射,与以前最好的在线方法相比,性能提高了6.4%。与在线和离线方法相比,我们还实现了最先进的视频地图结果。此外,我们的模型实现了41.8 FPS的帧速率,这比当前的实时方法提高了10.7%。 摘要:Real-time and online action localization in a video is a critical yet highly challenging problem. Accurate action localization requires the utilization of both temporal and spatial information. Recent attempts achieve this by using computationally intensive 3D CNN architectures or highly redundant two-stream architectures with optical flow, making them both unsuitable for real-time, online applications. To accomplish activity localization under highly challenging real-time constraints, we propose utilizing fast and efficient key-point based bounding box prediction to spatially localize actions. We then introduce a tube-linking algorithm that maintains the continuity of action tubes temporally in the presence of occlusions. Further, we eliminate the need for a two-stream architecture by combining temporal and spatial information into a cascaded input to a single network, allowing the network to learn from both types of information. Temporal information is efficiently extracted using a structural similarity index map as opposed to computationally intensive optical flow. Despite the simplicity of our approach, our lightweight end-to-end architecture achieves state-of-the-art frame-mAP of 74.7% on the challenging UCF101-24 dataset, demonstrating a performance gain of 6.4% over the previous best online methods. We also achieve state-of-the-art video-mAP results compared to both online and offline methods. Moreover, our model achieves a frame rate of 41.8 FPS, which is a 10.7% improvement over contemporary real-time methods.

【4】 Remote Sensing Image Super-resolution and Object Detection: Benchmark and State of the Art 标题:遥感图像超分辨率与目标检测:基准与现状 链接:https://arxiv.org/abs/2111.03260

作者:Yi Wang,Syed Muhammad Arsalan Bashir,Mahrukh Khan,Qudrat Ullah,Rui Wang,Yilin Song,Zhe Guo,Yilong Niu 备注:39 pages, 15 figures, 5 tables. Submitted to Elsevier journal for review 摘要:在过去的二十年中,人们一直在努力开发遥感图像中的目标检测方法。在大多数情况下,遥感图像中用于小目标检测的数据集不足。许多研究人员使用场景分类数据集进行目标检测,这有其局限性;例如,在对象类别中,大型对象的数量超过小型对象。因此,它们缺乏多样性;这进一步影响了遥感图像中小目标检测器的检测性能。本文回顾了当前遥感图像的数据集和目标检测方法(基于深度学习)。我们还提出了一个大规模、公开的基准遥感超分辨率目标检测(RSSOD)数据集。RSSOD数据集由1759幅手工注释图像和22091幅空间分辨率为~0.05 m的甚高分辨率(VHR)图像组成。有五个类别,每个类别的标签频率不同。从卫星图像中提取图像块,包括真实图像畸变,如切向尺度畸变和倾斜畸变。我们还提出了一种新的多类循环超分辨率生成对抗网络(MCGR)和辅助YOLOv5检测器,对基于图像超分辨率的目标检测进行基准测试,并与现有的基于图像超分辨率(SR)的方法进行了比较。与目前最先进的NLSN方法相比,所提出的MCGR实现了图像SR的最先进性能,PSNR提高了1.2dB。对于五类、四类、两类和单类,MCGR分别获得了0.758、0.881、0.841和0.983的最佳目标检测图,分别超过了最先进的目标检测器YOLOv5、EfficientSet、更快的RCNN、SSD和RetinaNet的性能。 摘要:For the past two decades, there have been significant efforts to develop methods for object detection in Remote Sensing (RS) images. In most cases, the datasets for small object detection in remote sensing images are inadequate. Many researchers used scene classification datasets for object detection, which has its limitations; for example, the large-sized objects outnumber the small objects in object categories. Thus, they lack diversity; this further affects the detection performance of small object detectors in RS images. This paper reviews current datasets and object detection methods (deep learning-based) for remote sensing images. We also propose a large-scale, publicly available benchmark Remote Sensing Super-resolution Object Detection (RSSOD) dataset. The RSSOD dataset consists of 1,759 hand-annotated images with 22,091 instances of very high resolution (VHR) images with a spatial resolution of ~0.05 m. There are five classes with varying frequencies of labels per class. The image patches are extracted from satellite images, including real image distortions such as tangential scale distortion and skew distortion. We also propose a novel Multi-class Cyclic super-resolution Generative adversarial network with Residual feature aggregation (MCGR) and auxiliary YOLOv5 detector to benchmark image super-resolution-based object detection and compare with the existing state-of-the-art methods based on image super-resolution (SR). The proposed MCGR achieved state-of-the-art performance for image SR with an improvement of 1.2dB PSNR compared to the current state-of-the-art NLSN method. MCGR achieved best object detection mAPs of 0.758, 0.881, 0.841, and 0.983, respectively, for five-class, four-class, two-class, and single classes, respectively surpassing the performance of the state-of-the-art object detectors YOLOv5, EfficientDet, Faster RCNN, SSD, and RetinaNet.

【5】 Fast Camouflaged Object Detection via Edge-based Reversible Re-calibration Network 标题:基于边缘可逆重校准网络的伪装目标快速检测 链接:https://arxiv.org/abs/2111.03216

作者:Ge-Peng Ji,Lei Zhu,Mingchen Zhuge,Keren Fu 机构:School of Computer Science, Wuhan University, Hong Kong University of Science and Technology (Guangzhou), School of Computer of Science, China University of Geosciences, China, College of Computer Science, Sichuan University 备注:35 pages, 7 figures, 5 tables (Accepted by Pattern Recognition 2022) 摘要:伪装目标检测(COD)旨在检测与周围环境具有相似模式(如纹理、强度、颜色等)的目标,近年来吸引了越来越多的研究兴趣。由于伪装物体的边界往往非常模糊,如何确定物体的位置以及它们的弱边界是一项具有挑战性的任务,也是这项任务的关键。受人类观察者发现伪装物体的生物视觉感知过程的启发,本文提出了一种新的基于边缘的可逆再校准网络ERRNet。我们的模型采用了两种创新设计,即选择性边缘聚集(SEA)和可逆再校准单元(RRU),旨在模拟视觉感知行为,实现潜在伪装区域和背景之间的有效边缘优先和交叉比较。更重要的是,与现有的COD模型相比,RRU采用了多种先验知识和更全面的信息。实验结果表明,在三个COD数据集和五个医学图像分割数据集上,ERRNet优于现有的前沿基线。特别是,与现有的top-1模型SINet相比,ERRNet以显著的高速(79.3 FPS)将性能显著提高了$sim$6%(平均E-measure),这表明ERRNet可以成为COD任务的通用且健壮的解决方案。 摘要:Camouflaged Object Detection (COD) aims to detect objects with similar patterns (e.g., texture, intensity, colour, etc) to their surroundings, and recently has attracted growing research interest. As camouflaged objects often present very ambiguous boundaries, how to determine object locations as well as their weak boundaries is challenging and also the key to this task. Inspired by the biological visual perception process when a human observer discovers camouflaged objects, this paper proposes a novel edge-based reversible re-calibration network called ERRNet. Our model is characterized by two innovative designs, namely Selective Edge Aggregation (SEA) and Reversible Re-calibration Unit (RRU), which aim to model the visual perception behaviour and achieve effective edge prior and cross-comparison between potential camouflaged regions and background. More importantly, RRU incorporates diverse priors with more comprehensive information comparing to existing COD models. Experimental results show that ERRNet outperforms existing cutting-edge baselines on three COD datasets and five medical image segmentation datasets. Especially, compared with the existing top-1 model SINet, ERRNet significantly improves the performance by $sim$6% (mean E-measure) with notably high speed (79.3 FPS), showing that ERRNet could be a general and robust solution for the COD task.

【6】 Addressing Multiple Salient Object Detection via Dual-Space Long-Range Dependencies 标题:基于双空间远程依赖的多显著目标检测方法 链接:https://arxiv.org/abs/2111.03195

作者:Bowen Deng,Andrew P. French,Michael P. Pound 机构:Computer Vision Laboratory, School of Computer Science, University of Nottingham, Nottingham, United Kingdom 备注:10 pages, 9 figures 摘要:显著目标检测在许多后续任务中起着重要作用。然而,具有不同比例和显著对象数量的复杂现实场景仍然是一个挑战。在本文中,我们直接解决了在复杂场景中检测多个显著对象的问题。我们提出了一种在空间和通道空间中结合非局部特征信息的网络体系结构,以捕获独立对象之间的长期依赖关系。传统的自底向上和非局部特征与特征融合门内的边缘特征相结合,该特征融合门逐步细化解码器中的显著对象预测。我们表明,即使在复杂的场景中,我们的方法也能准确地定位多个显著区域。为了证明我们的方法对多个显著对象问题的有效性,我们策划了一个只包含多个显著对象的新数据集。我们的实验表明,该方法在五个广泛使用的数据集上显示了最先进的结果,而无需任何前处理和后处理。我们在我们的多对象数据集上获得了与竞争技术相比的进一步性能改进。数据集和源代码可在以下位置获得:https://github.com/EricDengbowen/DSLRDNet. 摘要:Salient object detection plays an important role in many downstream tasks. However, complex real-world scenes with varying scales and numbers of salient objects still pose a challenge. In this paper, we directly address the problem of detecting multiple salient objects across complex scenes. We propose a network architecture incorporating non-local feature information in both the spatial and channel spaces, capturing the long-range dependencies between separate objects. Traditional bottom-up and non-local features are combined with edge features within a feature fusion gate that progressively refines the salient object prediction in the decoder. We show that our approach accurately locates multiple salient regions even in complex scenarios. To demonstrate the efficacy of our approach to the multiple salient objects problem, we curate a new dataset containing only multiple salient objects. Our experiments demonstrate the proposed method presents state-of-the-art results on five widely used datasets without any pre-processing and post-processing. We obtain a further performance improvement against competing techniques on our multi-objects dataset. The dataset and source code are avaliable at: https://github.com/EricDengbowen/DSLRDNet.

【7】 Voxel-based 3D Detection and Reconstruction of Multiple Objects from a Single Image 标题:基于体素的单幅图像多目标三维检测与重建 链接:https://arxiv.org/abs/2111.03098

作者:Feng Liu,Xiaoming Liu 备注:NeurIPS 2021 摘要:从单个2D图像推断多个物体的3D位置和形状是计算机视觉的一个长期目标。现有的大多数工作要么预测这些三维特性中的一个,要么专注于求解单个对象的这两个特性。一个基本的挑战在于如何学习一种非常适合三维检测和重建的图像的有效表示。在这项工作中,我们建议从输入图像中学习规则的三维体素特征网格,该图像通过三维特征提升算子与三维场景空间对齐。基于三维体素特征,我们的新型CenterNet-3D检测头将3D检测作为3D空间中的关键点检测。此外,我们设计了一个高效的从粗到精的重建模块,包括粗层次的体素化和一种新的局部PCA-SDF形状表示,该模块能够实现精细细节重建,并且比以前的方法推理速度快一个数量级。在3D检测和重建的辅助监督下,我们可以使3D体素特征保持几何和上下文,这两项任务都受益。我们的方法的有效性通过单目标和多目标场景中的3D检测和重建得到了验证。 摘要:Inferring 3D locations and shapes of multiple objects from a single 2D image is a long-standing objective of computer vision. Most of the existing works either predict one of these 3D properties or focus on solving both for a single object. One fundamental challenge lies in how to learn an effective representation of the image that is well-suited for 3D detection and reconstruction. In this work, we propose to learn a regular grid of 3D voxel features from the input image which is aligned with 3D scene space via a 3D feature lifting operator. Based on the 3D voxel features, our novel CenterNet-3D detection head formulates the 3D detection as keypoint detection in the 3D space. Moreover, we devise an efficient coarse-to-fine reconstruction module, including coarse-level voxelization and a novel local PCA-SDF shape representation, which enables fine detail reconstruction and one order of magnitude faster inference than prior methods. With complementary supervision from both 3D detection and reconstruction, one enables the 3D voxel features to be geometry and context preserving, benefiting both tasks.The effectiveness of our approach is demonstrated through 3D detection and reconstruction in single object and multiple object scenarios.

【8】 A bone suppression model ensemble to improve COVID-19 detection in chest X-rays 标题:改进胸片冠状病毒检测的骨抑制模型集合 链接:https://arxiv.org/abs/2111.03404

作者:Sivaramakrishnan Rajaraman,Gregg Cohen,Les folio,Sameer Antani 机构: National Library of Medicine, National Institutes of Health, Maryland, USA, Clinical Center, Department of Radiology and Imaging Sciences, National Institutes of Health, Correspondence to:, Sivaramakrishnan, Rajaraman, PhD., TEL:, ,-,-,-, ;, E-mail: 备注:29 pages, 10 figures, 4 tables 摘要:胸部X射线(CXR)是一种广泛应用的放射学检查,有助于检测胸腔内组织和器官的异常情况。由于肋骨和锁骨等骨结构的存在使肺异常变得模糊,因此检测像COVID-19这样的肺异常可能变得困难,从而导致筛查/诊断误解。自动化骨抑制方法将有助于抑制这些骨结构,提高软组织的可见度。在本研究中,我们建议建立一套卷积神经网络模型来抑制额叶CXR中的骨骼,提高分类性能,并减少与新冠病毒检测相关的解释错误。通过(i)测量由前3名执行的骨抑制模型中的每一个预测的骨抑制图像的子块与其各自的地面真实软组织图像的对应子块之间的多尺度结构相似性指数(MS-SSIM)得分,以及(ii)构建集成对每个子块中计算的MS-SSIM分数进行多数表决,以识别具有最大MS-SSIM分数的子块,并将其用于构建最终骨骼抑制图像。我们根据经验确定子块大小,以提供优越的骨抑制性能。据观察,骨抑制模型集成在MS-SSIM和其他指标方面优于单个模型。在非骨抑制和骨抑制图像上重新训练和评估CXR模态特异性分类模型,以将其分类为显示正常肺或其他类COVID-19表现。我们观察到,在检测新冠病毒19的表现方面,骨抑制模型训练显著优于在非骨抑制图像上训练的模型。 摘要:Chest X-ray (CXR) is a widely performed radiology examination that helps to detect abnormalities in the tissues and organs in the thoracic cavity. Detecting pulmonary abnormalities like COVID-19 may become difficult due to that they are obscured by the presence of bony structures like the ribs and the clavicles, thereby resulting in screening/diagnostic misinterpretations. Automated bone suppression methods would help suppress these bony structures and increase soft tissue visibility. In this study, we propose to build an ensemble of convolutional neural network models to suppress bones in frontal CXRs, improve classification performance, and reduce interpretation errors related to COVID-19 detection. The ensemble is constructed by (i) measuring the multi-scale structural similarity index (MS-SSIM) score between the sub-blocks of the bone-suppressed image predicted by each of the top-3 performing bone-suppression models and the corresponding sub-blocks of its respective ground truth soft-tissue image, and (ii) performing a majority voting of the MS-SSIM score computed in each sub-block to identify the sub-block with the maximum MS-SSIM score and use it in constructing the final bone-suppressed image. We empirically determine the sub-block size that delivers superior bone suppression performance. It is observed that the bone suppression model ensemble outperformed the individual models in terms of MS-SSIM and other metrics. A CXR modality-specific classification model is retrained and evaluated on the non-bone-suppressed and bone-suppressed images to classify them as showing normal lungs or other COVID-19-like manifestations. We observed that the bone-suppressed model training significantly outperformed the model trained on non-bone-suppressed images toward detecting COVID-19 manifestations.

分类|识别相关(5篇)

【1】 The Curious Layperson: Fine-Grained Image Recognition without Expert Labels 标题:好奇的门外汉:没有专家标签的细粒度图像识别 链接:https://arxiv.org/abs/2111.03651

作者:Subhabrata Choudhury,Iro Laina,Christian Rupprecht,Andrea Vedaldi 机构:Visual Geometry Group, University of Oxford, Oxford, UK 备注:To appear in BMVC 2021 (Oral). Project page: this https URL 摘要:我们大多数人都不是特定领域的专家,比如鸟类学。尽管如此,我们确实拥有通用的图像和语言理解能力,用于将我们看到的内容与专家资源相匹配。这使我们能够扩展我们的知识并执行新任务,而无需特别的外部监督。相反,机器很难咨询专家策划的知识库,除非专门针对这些知识进行训练。因此,在本文中,我们考虑了一个新的问题:细粒度的图像识别没有专家注释,我们通过利用Web百科全书中提供的大量知识来解决。首先,我们学习使用非专家图像描述来描述对象视觉外观的模型。然后,我们训练一个细粒度的文本相似性模型,该模型在句子级的基础上匹配图像描述和文档。我们在两个数据集上评估了该方法,并与几个强基线和跨模态检索的最新技术进行了比较。代码可从以下网址获取:https://github.com/subhc/clever 摘要:Most of us are not experts in specific fields, such as ornithology. Nonetheless, we do have general image and language understanding capabilities that we use to match what we see to expert resources. This allows us to expand our knowledge and perform novel tasks without ad-hoc external supervision. On the contrary, machines have a much harder time consulting expert-curated knowledge bases unless trained specifically with that knowledge in mind. Thus, in this paper we consider a new problem: fine-grained image recognition without expert annotations, which we address by leveraging the vast knowledge available in web encyclopedias. First, we learn a model to describe the visual appearance of objects using non-expert image descriptions. We then train a fine-grained textual similarity model that matches image descriptions with documents on a sentence-level basis. We evaluate the method on two datasets and compare with several strong baselines and the state of the art in cross-modal retrieval. Code is available at: https://github.com/subhc/clever

【2】 Recognizing Vector Graphics without Rasterization 标题:无栅格化的矢量图形识别 链接:https://arxiv.org/abs/2111.03281

作者:Xinyang Jiang,Lu Liu,Caihua Shan,Yifei Shen,Xuanyi Dong,Dongsheng Li 机构:Microsoft Research Asia, University of Technology Sydney, The Hong Kong University of Science and Technology 摘要:在本文中,我们考虑不同的数据格式的图像:矢量图形。与图像识别中广泛使用的光栅图形不同,由于文档中原语的解析表示,矢量图形可以放大或缩小到任何分辨率,而不会出现混叠或信息丢失。此外,矢量图形能够提供关于低级元素如何组合在一起形成高级形状或结构的额外结构信息。图形向量的这些优点在现有的方法中并没有得到充分利用。为了探索这种数据格式,我们的目标是基本的识别任务:目标定位和分类。我们提出了一种高效的无CNN管道,它不将图形渲染为像素(即光栅化),并将矢量图形的文本文档作为输入,称为YOLaT(您只查看文本)。YOLaT构建了多个图形来对矢量图形中的结构和空间信息进行建模,并提出了一种双流图神经网络来从图形中检测对象。我们的实验表明,通过直接操作矢量图形,YOLaT在平均精度和效率方面都超过了基于光栅图形的目标检测基线。 摘要:In this paper, we consider a different data format for images: vector graphics. In contrast to raster graphics which are widely used in image recognition, vector graphics can be scaled up or down into any resolution without aliasing or information loss, due to the analytic representation of the primitives in the document. Furthermore, vector graphics are able to give extra structural information on how low-level elements group together to form high level shapes or structures. These merits of graphic vectors have not been fully leveraged in existing methods. To explore this data format, we target on the fundamental recognition tasks: object localization and classification. We propose an efficient CNN-free pipeline that does not render the graphic into pixels (i.e. rasterization), and takes textual document of the vector graphics as input, called YOLaT (You Only Look at Text). YOLaT builds multi-graphs to model the structural and spatial information in vector graphics, and a dual-stream graph neural network is proposed to detect objects from the graph. Our experiments show that by directly operating on vector graphics, YOLaT out-performs raster-graphic based object detection baselines in terms of both average precision and efficiency.

【3】 Attention on Classification for Fire Segmentation 标题:火灾分段分类应注意的几个问题 链接:https://arxiv.org/abs/2111.03129

作者:Milad Niknejad,Alexandre Bernardino 机构:Instituto de Sistemas e Robotica, Instituto Superior Tecnico, University of Lisbon, Lisbon, Portugal 摘要:图像和视频中的火灾检测和定位对于处理火灾事件非常重要。虽然语义分割方法可以用来指示图像中具有火的像素的位置,但是它们的预测是局部的,并且它们经常不考虑图像标签中隐含的图像中存在火的全局信息。我们提出了一种卷积神经网络(CNN)对图像中的火灾进行联合分类和分割,提高了火灾分割的性能。我们使用空间自我注意机制捕获像素之间的长距离依赖关系,并使用分类概率作为注意权重的新通道注意模块。该网络针对分割和分类进行联合训练,从而提高了单任务图像分割方法的性能,并改进了先前提出的火灾分割方法。 摘要:Detection and localization of fire in images and videos are important in tackling fire incidents. Although semantic segmentation methods can be used to indicate the location of pixels with fire in the images, their predictions are localized, and they often fail to consider global information of the existence of fire in the image which is implicit in the image labels. We propose a Convolutional Neural Network (CNN) for joint classification and segmentation of fire in images which improves the performance of the fire segmentation. We use a spatial self-attention mechanism to capture long-range dependency between pixels, and a new channel attention module which uses the classification probability as an attention weight. The network is jointly trained for both segmentation and classification, leading to improvement in the performance of the single-task image segmentation methods, and the previous methods proposed for fire segmentation.

【4】 Learning of Frequency-Time Attention Mechanism for Automatic Modulation Recognition 标题:自动调制识别的频率-时间注意机制学习 链接:https://arxiv.org/abs/2111.03258

作者:Shangao Lin,Yuan Zeng,Yi Gong 机构:∗Department of Electrical and Electronic Engineering, Southern University of Science and Technology (SUSTech), Shenzhen, China, †Academy for Advanced Interdisciplinary Studies, SUSTech, Shenzhen, China 摘要:最近基于学习的图像分类和语音识别方法广泛使用注意机制来实现最先进的识别能力,这证明了注意机制的有效性。基于调制无线电信号的频率和时间信息对调制模式识别至关重要的事实,本文提出了一种基于卷积神经网络(CNN)的调制识别框架的频率-时间注意机制。所提出的频率-时间注意模块旨在学习在CNN调制识别中哪个信道、频率和时间信息更有意义。我们分析了所提出的频率-时间注意机制的有效性,并将所提出的方法与现有的两种基于学习的方法进行了比较。在一个开源调制识别数据集上的实验表明,该框架的识别性能优于没有频率-时间注意和现有学习方法的框架。 摘要:Recent learning-based image classification and speech recognition approaches make extensive use of attention mechanisms to achieve state-of-the-art recognition power, which demonstrates the effectiveness of attention mechanisms. Motivated by the fact that the frequency and time information of modulated radio signals are crucial for modulation mode recognition, this paper proposes a frequency-time attention mechanism for a convolutional neural network (CNN)-based modulation recognition framework. The proposed frequency-time attention module is designed to learn which channel, frequency and time information is more meaningful in CNN for modulation recognition. We analyze the effectiveness of the proposed frequency-time attention mechanism and compare the proposed method with two existing learning-based methods. Experiments on an open-source modulation recognition dataset show that the recognition performance of the proposed framework is better than those of the framework without frequency-time attention and existing learning-based methods.

【5】 PDBL: Improving Histopathological Tissue Classification with Plug-and-Play Pyramidal Deep-Broad Learning 标题:PDBL:即插即用金字塔深度学习改进组织病理组织分类 链接:https://arxiv.org/abs/2111.03063

作者:Jiatai Lin,Guoqiang Han,Xipeng Pan,Hao Chen,Danyi Li,Xiping Jia,Zhenwei Shi,Zhizhen Wang,Yanfen Cui,Haiming Li,Changhong Liang,Li Liang,Zaiyi Liu,Chu Han 机构: School ofComputer Science, Li Liang and Danyi Li are with Department of Pathology, Southern Medical University 备注:10 pages, 5 figures 摘要:组织病理学分类是肿瘤病理学研究的一项基本任务。准确区分不同的组织类型有利于肿瘤诊断、预后等下游研究。现有的研究大多利用计算机视觉中流行的分类主干来实现组织病理学分类。在本文中,我们提出了一个超轻量级即插即用模块,称为金字塔深度广域学习(PDBL),用于任何经过良好训练的分类主干,以进一步提高分类性能,而无需重新训练负担。我们模拟病理学家如何在不同的放大倍数下观察病理切片,并为输入图像构建图像金字塔,以获得金字塔上下文信息。对于金字塔中的每一层,我们通过我们提出的深宽块(DB块)提取多尺度深宽特征。我们在三个流行的分类主干中配备了PDBL,即ShuffleNet V2、EfficientNetb0和ResNet50,以评估我们提出的模块在两个数据集(Kather多类数据集和LC25000数据集)上的有效性和效率。实验结果表明,该方法可以稳定地提高CNN主干的组织级分类性能,特别是在训练样本较少(小于10%)的情况下,对于轻量级模型,大大节省了计算时间和注释工作量。 摘要:Histopathological tissue classification is a fundamental task in pathomics cancer research. Precisely differentiating different tissue types is a benefit for the downstream researches, like cancer diagnosis, prognosis and etc. Existing works mostly leverage the popular classification backbones in computer vision to achieve histopathological tissue classification. In this paper, we proposed a super lightweight plug-and-play module, named Pyramidal Deep-Broad Learning (PDBL), for any well-trained classification backbone to further improve the classification performance without a re-training burden. We mimic how pathologists observe pathology slides in different magnifications and construct an image pyramid for the input image in order to obtain the pyramidal contextual information. For each level in the pyramid, we extract the multi-scale deep-broad features by our proposed Deep-Broad block (DB-block). We equipped PDBL in three popular classification backbones, ShuffLeNetV2, EfficientNetb0, and ResNet50 to evaluate the effectiveness and efficiency of our proposed module on two datasets (Kather Multiclass Dataset and the LC25000 Dataset). Experimental results demonstrate the proposed PDBL can steadily improve the tissue-level classification performance for any CNN backbones, especially for the lightweight models when given a small among of training samples (less than 10%), which greatly saves the computational time and annotation efforts.

分割|语义相关(6篇)

【1】 Semantic Consistency in Image-to-Image Translation for Unsupervised Domain Adaptation 标题:无监督领域自适应图像到图像翻译中的语义一致性 链接:https://arxiv.org/abs/2111.03522

作者:Stephan Brehm,Sebastian Scherer,Rainer Lienhart 机构:Department of Computer Science, University of Augsburg, Universit¨atsstr. ,a, Augsburg, Germany 摘要:无监督域适配(UDA)旨在将在源域上训练的模型适配到没有标记数据可用的新目标域。在这项工作中,我们研究了UDA问题,从一个合成的计算机生成的领域到一个相似但真实的领域来学习语义分割。我们提出了一种语义一致的图像到图像的翻译方法,并结合了一致性正则化方法。我们克服了以前将合成图像传输到真实图像的限制。我们利用伪标签来学习生成的图像到图像的翻译模型,该模型从两个域上的语义标签接收额外的反馈。我们的方法在相关领域适应基准(即GTA5到城市景观和SYNTHIA到城市景观)上的表现优于结合图像到图像翻译和半监督学习的最新方法。 摘要:Unsupervised Domain Adaptation (UDA) aims to adapt models trained on a source domain to a new target domain where no labelled data is available. In this work, we investigate the problem of UDA from a synthetic computer-generated domain to a similar but real-world domain for learning semantic segmentation. We propose a semantically consistent image-to-image translation method in combination with a consistency regularisation method for UDA. We overcome previous limitations on transferring synthetic images to real looking images. We leverage pseudo-labels in order to learn a generative image-to-image translation model that receives additional feedback from semantic labels on both domains. Our method outperforms state-of-the-art methods that combine image-to-image translation and semi-supervised learning on relevant domain adaption benchmarks, i.e., on GTA5 to Cityscapes and SYNTHIA to Cityscapes.

【2】 Event-based Motion Segmentation by Cascaded Two-Level Multi-Model Fitting 标题:级联两级多模型拟合的基于事件的运动分割 链接:https://arxiv.org/abs/2111.03483

作者:Xiuyuan Lu,Yi Zhou,Shaojie Shen 机构: Hong Kong University ofScience and Technology 备注:Accepted for presentation at the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2021) 摘要:在合成代理与动态场景交互的先决条件中,识别独立移动对象的能力尤为重要。然而,从应用的角度来看,标准相机在剧烈运动和具有挑战性的照明条件下可能会显著退化。相比之下,基于事件的摄像机作为一种新颖的生物传感器,在应对这些挑战方面具有优势。其快速响应和异步特性使其能够以与场景动力学完全相同的速率捕获视觉刺激。在本文中,我们提出了一种级联的两级多模型拟合方法,用于用单目事件相机识别独立运动对象(即运动分割问题)。第一级利用事件特征的跟踪,在渐进式多模型拟合方案下解决特征聚类问题。第二级使用生成的运动模型实例初始化,使用时空图切割方法进一步解决事件聚类问题。这种组合导致高效和准确的事件式运动分割,这是任何一种方法都无法单独实现的。实验证明了该方法在具有不同运动模式和未知数量独立运动对象的真实场景中的有效性和通用性。 摘要:Among prerequisites for a synthetic agent to interact with dynamic scenes, the ability to identify independently moving objects is specifically important. From an application perspective, nevertheless, standard cameras may deteriorate remarkably under aggressive motion and challenging illumination conditions. In contrast, event-based cameras, as a category of novel biologically inspired sensors, deliver advantages to deal with these challenges. Its rapid response and asynchronous nature enables it to capture visual stimuli at exactly the same rate of the scene dynamics. In this paper, we present a cascaded two-level multi-model fitting method for identifying independently moving objects (i.e., the motion segmentation problem) with a monocular event camera. The first level leverages tracking of event features and solves the feature clustering problem under a progressive multi-model fitting scheme. Initialized with the resulting motion model instances, the second level further addresses the event clustering problem using a spatio-temporal graph-cut method. This combination leads to efficient and accurate event-wise motion segmentation that cannot be achieved by any of them alone. Experiments demonstrate the effectiveness and versatility of our method in real-world scenes with different motion patterns and an unknown number of independently moving objects.

【3】 SSA: Semantic Structure Aware Inference for Weakly Pixel-Wise Dense Predictions without Cost 标题:SSA:无代价弱像素稠密预测的语义结构感知推理 链接:https://arxiv.org/abs/2111.03392

作者:Yanpeng Sun,Zechao Li 机构:Nanjing University Of Science And Technology, Nanjing, Jiangsu, China 摘要:目前,基于弱监督的像素级密集预测任务使用类注意图(CAM)生成伪掩模作为基本真理。然而,现有的方法通常依赖于艰苦的训练模块,这可能会带来沉重的计算开销和复杂的训练过程。在这项工作中,语义结构感知推理(SSA)被提出来探索隐藏在基于CNN的网络的不同阶段中的语义结构信息,从而在模型推理中生成高质量的CAM。具体而言,首先提出语义结构建模模块(SSM)来生成类无关语义关联表示,其中每个项表示一类对象与所有其他对象之间的亲和力。然后利用点积运算对未成熟的凸轮进行结构化特征表示。最后,来自不同主干级的抛光凸轮被融合作为输出。该方法具有无参数、无需训练等优点。因此,它可以应用于广泛的弱监督像素级密集预测任务。在弱监督目标定位和弱监督语义分割任务上的实验结果表明了该方法的有效性,在这两个任务上都取得了新的最新成果。 摘要:The pixel-wise dense prediction tasks based on weakly supervisions currently use Class Attention Maps (CAM) to generate pseudo masks as ground-truth. However, the existing methods typically depend on the painstaking training modules, which may bring in grinding computational overhead and complex training procedures. In this work, the semantic structure aware inference (SSA) is proposed to explore the semantic structure information hidden in different stages of the CNN-based network to generate high-quality CAM in the model inference. Specifically, the semantic structure modeling module (SSM) is first proposed to generate the class-agnostic semantic correlation representation, where each item denotes the affinity degree between one category of objects and all the others. Then the structured feature representation is explored to polish an immature CAM via the dot product operation. Finally, the polished CAMs from different backbone stages are fused as the output. The proposed method has the advantage of no parameters and does not need to be trained. Therefore, it can be applied to a wide range of weakly-supervised pixel-wise dense prediction tasks. Experimental results on both weakly-supervised object localization and weakly-supervised semantic segmentation tasks demonstrate the effectiveness of the proposed method, which achieves the new state-of-the-art results on these two tasks.

【4】 FBNet: Feature Balance Network for Urban-Scene Segmentation 标题:FBNet:城市场景分割的特征平衡网络 链接:https://arxiv.org/abs/2111.03286

作者:Lei Gan,Huabin Huang,Banghuai Li,Ye Yuan 机构: Beihang University , MEGVII Technology 备注:Tech Report 摘要:城市场景中的图像分割由于其在自动驾驶系统中的成功,近年来受到了广泛关注。然而,相关前景目标(如交通灯和灯杆)的性能较差,仍然限制了其进一步的实际应用。在城市场景中,由于摄像机的特殊位置和三维透视投影,前景目标往往隐藏在周围物体中。更糟糕的是,由于接收域的不断扩展,它加剧了高级功能中前景类和背景类之间的不平衡。我们称之为特征伪装。在本文中,我们提出了一种新的附加模块,称为特征平衡网络(FBNet),以消除城市场景分割中的特征伪装。FBNet由两个关键组件组成,即块式BCE(BwBCE)和双特征调制器(DFM)。BwBCE用作辅助损耗,以确保反向传播期间前景类及其周围环境的梯度均匀。同时,DFM希望在BwBCE的监督下,自适应地增强高级特征中前景类的深度表示。这两个模块作为一个整体相互促进,有效地简化了特征伪装。我们提出的方法在两个具有挑战性的城市场景基准(即Cityscapes和BDD100K)上实现了最新的分割性能。代码将被发布以供复制。 摘要:Image segmentation in the urban scene has recently attracted much attention due to its success in autonomous driving systems. However, the poor performance of concerned foreground targets, e.g., traffic lights and poles, still limits its further practical applications. In urban scenes, foreground targets are always concealed in their surrounding stuff because of the special camera position and 3D perspective projection. What's worse, it exacerbates the unbalance between foreground and background classes in high-level features due to the continuous expansion of the reception field. We call it Feature Camouflage. In this paper, we present a novel add-on module, named Feature Balance Network (FBNet), to eliminate the feature camouflage in urban-scene segmentation. FBNet consists of two key components, i.e., Block-wise BCE(BwBCE) and Dual Feature Modulator(DFM). BwBCE serves as an auxiliary loss to ensure uniform gradients for foreground classes and their surroundings during backpropagation. At the same time, DFM intends to enhance the deep representation of foreground classes in high-level features adaptively under the supervision of BwBCE. These two modules facilitate each other as a whole to ease feature camouflage effectively. Our proposed method achieves a new state-of-the-art segmentation performance on two challenging urban-scene benchmarks, i.e., Cityscapes and BDD100K. Code will be released for reproduction.

【5】 EditGAN: High-Precision Semantic Image Editing 标题:EditGAN:高精度语义图像编辑 链接:https://arxiv.org/abs/2111.03186

作者:Huan Ling,Karsten Kreis,Daiqing Li,Seung Wook Kim,Antonio Torralba,Sanja Fidler 机构:NVIDIA, University of Toronto, Vector Institute, MIT 摘要:生成性对抗网络(GAN)最近在图像编辑中得到了应用。然而,大多数基于GAN的图像编辑方法通常需要大规模数据集和语义分割注释进行训练,只提供高级控制,或者仅仅在不同图像之间进行插值。在此,我们提出EditGAN,这是一种用于高质量、高精度语义图像编辑的新方法,允许用户通过修改其高度详细的零件分割遮罩来编辑图像,例如,为汽车前照灯绘制新遮罩。EditGAN建立在GAN框架的基础上,该框架联合建模图像及其语义分割,只需要少量标记的示例,使其成为可伸缩的编辑工具。具体地说,我们将图像嵌入到GAN潜在空间中,并根据分割编辑执行条件潜在代码优化,从而有效地修改图像。为了分期优化,我们在潜在空间中找到实现编辑的编辑向量。该框架允许我们学习任意数量的编辑向量,然后可以以交互速率直接应用于其他图像。我们的实验表明,EditGAN可以以前所未有的细节和自由度处理图像,同时保持完整的图像质量。我们还可以轻松地组合多个编辑,并在EditGAN训练数据之外执行合理的编辑。我们在各种各样的图像类型上演示了EditGAN,并且在标准编辑基准任务上在数量上优于以前的几种编辑方法。 摘要:Generative adversarial networks (GANs) have recently found applications in image editing. However, most GAN based image editing methods often require large scale datasets with semantic segmentation annotations for training, only provide high level control, or merely interpolate between different images. Here, we propose EditGAN, a novel method for high quality, high precision semantic image editing, allowing users to edit images by modifying their highly detailed part segmentation masks, e.g., drawing a new mask for the headlight of a car. EditGAN builds on a GAN framework that jointly models images and their semantic segmentations, requiring only a handful of labeled examples, making it a scalable tool for editing. Specifically, we embed an image into the GAN latent space and perform conditional latent code optimization according to the segmentation edit, which effectively also modifies the image. To amortize optimization, we find editing vectors in latent space that realize the edits. The framework allows us to learn an arbitrary number of editing vectors, which can then be directly applied on other images at interactive rates. We experimentally show that EditGAN can manipulate images with an unprecedented level of detail and freedom, while preserving full image quality.We can also easily combine multiple edits and perform plausible edits beyond EditGAN training data. We demonstrate EditGAN on a wide variety of image types and quantitatively outperform several previous editing methods on standard editing benchmark tasks.

【6】 Segmentation of 2D Brain MR Images 标题:二维脑MR图像的分割 链接:https://arxiv.org/abs/2111.03370

作者:Angad Ripudaman Singh Bajwa 机构: 2 1 Indian Institute of Technology (IIT), Varanasi (BHU) 2 National Institute of Technology (NIT) 摘要:脑肿瘤分割是医学图像处理中的一项重要任务。脑肿瘤的早期诊断在改善治疗可能性和提高患者生存率方面起着至关重要的作用。从大量MRI图像中手动分割用于癌症诊断的脑肿瘤是一项既困难又耗时的任务。有必要对脑肿瘤图像进行自动分割。本项目的目的是提供一种MRI图像的脑肿瘤自动分割方法,以帮助准确快速地定位肿瘤。 摘要:Brain tumour segmentation is an essential task in medical image processing. Early diagnosis of brain tumours plays a crucial role in improving treatment possibilities and increases the survival rate of the patients. Manual segmentation of the brain tumours for cancer diagnosis, from large number of MRI images, is both a difficult and time-consuming task. There is a need for automatic brain tumour image segmentation. The purpose of this project is to provide an automatic brain tumour segmentation method of MRI images to help locate the tumour accurately and quickly.

Zero/Few Shot|迁移|域适配|自适应(2篇)

【1】 Solving Traffic4Cast Competition with U-Net and Temporal Domain Adaptation 标题:基于U网和时域自适应的Traffic4Cast竞争解决方案 链接:https://arxiv.org/abs/2111.03421

作者:Vsevolod Konyakhin,Nina Lukashina,Aleksei Shpilman 机构:ITMO University, Saint Petersburg, Russia, JetBrains Research, HSE University 备注:Conference on Neural Information Processing Systems (NeurIPS 2021) Traffic4cast Competition 摘要:在本技术报告中,我们介绍了Traffic4Cast 2021核心挑战的解决方案,其中要求参与者根据前一小时的信息,在4个不同的城市开发算法,预测60分钟前的交通状态。与之前举办的竞赛不同,今年的挑战集中于新冠病毒-19大流行导致的交通量时域变化。继U-Net过去的成功之后,我们利用它来预测未来的交通地图。此外,我们还探讨了使用预先训练好的编码器,如DenseNet和EfficientNet,并采用多域自适应技术来对抗域转移。我们的解决方案在决赛中排名第三。该守则可于https://github.com/jbr-ai-labs/traffic4cast-2021. 摘要:In this technical report, we present our solution to the Traffic4Cast 2021 Core Challenge, in which participants were asked to develop algorithms for predicting a traffic state 60 minutes ahead, based on the information from the previous hour, in 4 different cities. In contrast to the previously held competitions, this year's challenge focuses on the temporal domain shift in traffic due to the COVID-19 pandemic. Following the past success of U-Net, we utilize it for predicting future traffic maps. Additionally, we explore the usage of pre-trained encoders such as DenseNet and EfficientNet and employ multiple domain adaptation techniques to fight the domain shift. Our solution has ranked third in the final competition. The code is available at https://github.com/jbr-ai-labs/traffic4cast-2021.

【2】 Cross Modality 3D Navigation Using Reinforcement Learning and Neural Style Transfer 标题:基于强化学习和神经风格转换的跨通道三维导航 链接:https://arxiv.org/abs/2111.03485

作者:Cesare Magnetti,Hadrien Reynaud,Bernhard Kainz 机构: Imperial College London, London, SW,BU, UK, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, DE 摘要:本文介绍了使用多智能体强化学习(MARL)在医学成像的三维解剖体中执行导航。我们利用神经方式传输来创建合成计算机断层扫描(CT)代理健身房环境,并评估我们代理对临床CT体积的泛化能力。我们的框架不需要任何标记的临床数据,并且可以轻松地与多种图像翻译技术集成,从而实现跨模态应用。此外,我们仅在2D切片上对我们的代理进行调节,为更困难的成像模式(如超声波成像)中的3D引导开辟了道路。这是在获取标准化诊断视图平面、提高诊断一致性和促进更好的病例比较过程中向用户指导迈出的重要一步。 摘要:This paper presents the use of Multi-Agent Reinforcement Learning (MARL) to perform navigation in 3D anatomical volumes from medical imaging. We utilize Neural Style Transfer to create synthetic Computed Tomography (CT) agent gym environments and assess the generalization capabilities of our agents to clinical CT volumes. Our framework does not require any labelled clinical data and integrates easily with several image translation techniques, enabling cross modality applications. Further, we solely condition our agents on 2D slices, breaking grounds for 3D guidance in much more difficult imaging modalities, such as ultrasound imaging. This is an important step towards user guidance during the acquisition of standardised diagnostic view planes, improving diagnostic consistency and facilitating better case comparison.

半弱无监督|主动学习|不确定性(1篇)

【1】 Generalized Radiograph Representation Learning via Cross-supervision between Images and Free-text Radiology Reports 标题:基于图像和自由文本放射学报告交叉监督的广义X线图像表示学习 链接:https://arxiv.org/abs/2111.03452

作者:Hong-Yu Zhou,Xiaoyu Chen,Yinghao Zhang,Ruibang Luo,Liansheng Wang,Yizhou Yu 机构: Department of Computer Science, The University of Hong Kong, Pokfulam, Hong Kong, Department of Computer Science, Xiamen University, Xiamen, China, † These authors contributed equally 备注:Technical Report 摘要:预训练为深学习支持的放射影像分析的成功奠定了基础。它通过在源域上进行大规模的全监督或自监督学习来学习可转移的图像表示。然而,有监督的预训练需要一个复杂且劳动密集的两阶段人工辅助注释过程,而自我监督学习无法与有监督范式竞争。为了解决这些问题,我们提出了一种称为“审查自由文本报告以供监督”(REFERED)的交叉监督方法,该方法从放射照片附带的原始放射报告中获取自由监督信号。所提出的方法采用了视觉转换器,旨在从每个患者研究的多个视图中学习联合表示。在极其有限的监督下,在4个著名的X射线数据集上,Reference优于其转移学习和自我监督学习对手。此外,reference甚至超过了基于具有人工辅助结构标签的射线照片源域的方法。因此,REFERED有可能取代规范的训练前方法。 摘要:Pre-training lays the foundation for recent successes in radiograph analysis supported by deep learning. It learns transferable image representations by conducting large-scale fully-supervised or self-supervised learning on a source domain. However, supervised pre-training requires a complex and labor intensive two-stage human-assisted annotation process while self-supervised learning cannot compete with the supervised paradigm. To tackle these issues, we propose a cross-supervised methodology named REviewing FreE-text Reports for Supervision (REFERS), which acquires free supervision signals from original radiology reports accompanying the radiographs. The proposed approach employs a vision transformer and is designed to learn joint representations from multiple views within every patient study. REFERS outperforms its transfer learning and self-supervised learning counterparts on 4 well-known X-ray datasets under extremely limited supervision. Moreover, REFERS even surpasses methods based on a source domain of radiographs with human-assisted structured labels. Thus REFERS has the potential to replace canonical pre-training methodologies.

时序|行为识别|姿态|视频|运动估计(6篇)

【1】 Spatial-Temporal Residual Aggregation for High Resolution Video Inpainting 标题:用于高分辨率视频修复的时空残差聚合算法 链接:https://arxiv.org/abs/2111.03574

作者:Vishnu Sanjay Ramiya Srinivasan,Rui Ma,Qiang Tang,Zili Yi,Zhan Xu 机构: Huawei Technologies, Canada, Jilin University, China 备注:Accepted by BMVC 2021. Project page: this https URL 摘要:最近基于学习的修复算法在删除视频中不需要的对象后完成缺失区域方面取得了令人信服的结果。为了保持帧之间的时间一致性,在深度网络中经常大量使用三维空间和时间操作。然而,这些方法通常受到内存限制,只能处理低分辨率视频。我们提出了一种新的用于高分辨率视频修复的时空残差聚合框架STRA-Net。其关键思想是首先学习并应用空间和时间修复网络对降采样低分辨率视频进行修复。然后,我们通过将学习到的空间和时间图像残差(细节)聚合到上采样修复帧来细化低分辨率结果。定量和定性评估都表明,在修复高分辨率视频方面,与最先进的方法相比,我们可以产生更具时间连贯性和视觉吸引力的结果。 摘要:Recent learning-based inpainting algorithms have achieved compelling results for completing missing regions after removing undesired objects in videos. To maintain the temporal consistency among the frames, 3D spatial and temporal operations are often heavily used in the deep networks. However, these methods usually suffer from memory constraints and can only handle low resolution videos. We propose STRA-Net, a novel spatial-temporal residual aggregation framework for high resolution video inpainting. The key idea is to first learn and apply a spatial and temporal inpainting network on the downsampled low resolution videos. Then, we refine the low resolution results by aggregating the learned spatial and temporal image residuals (details) to the upsampled inpainted frames. Both the quantitative and qualitative evaluations show that we can produce more temporal-coherent and visually appealing results than the state-of-the-art methods on inpainting high resolution videos.

【2】 Synchronized Smartphone Video Recording System of Depth and RGB Image Frames with Sub-millisecond Precision 标题:亚毫秒精度深度和RGB图像帧同步智能手机视频记录系统 链接:https://arxiv.org/abs/2111.03552

作者:Marsel Faizullin,Anastasiia Kornilova,Azat Akhmetyanov,Konstantin Pakulev,Andrey Sadkov,Gonzalo Ferrer 机构:Skolkovo Institute of Science and Technology, Moscow, Russia 备注:IEEE Sensors Journal submitted paper 摘要:在本文中,我们提出了一种具有高时间同步(sync)精度的记录系统,该系统由智能手机、深度摄像头、IMU等异构传感器组成。由于智能手机的普遍兴趣和广泛采用,我们在系统中至少包括一个此类设备。这种异构系统需要对两种不同的时间管理器进行混合同步:智能手机和MCU,我们将基于硬件有线的触发同步与软件同步相结合。我们在一个自定义和新颖的系统上评估了我们的同步结果,该系统将主动红外深度与RGB相机混合。我们的系统实现了亚毫秒的时间同步精度。此外,我们的系统以这种精度同时曝光每个RGB深度图像对。我们特别展示了一个配置,但是我们系统背后的一般原则可以被其他项目复制。 摘要:In this paper, we propose a recording system with high time synchronization (sync) precision which consists of heterogeneous sensors such as smartphone, depth camera, IMU, etc. Due to the general interest and mass adoption of smartphones, we include at least one of such devices into our system. This heterogeneous system requires a hybrid synchronization for the two different time authorities: smartphone and MCU, where we combine a hardware wired-based trigger sync with software sync. We evaluate our sync results on a custom and novel system mixing active infra-red depth with RGB camera. Our system achieves sub-millisecond precision of time sync. Moreover, our system exposes every RGB-depth image pair at the same time with this precision. We showcase a configuration in particular but the general principles behind our system could be replicated by other projects.

【3】 DriveGuard: Robustification of Automated Driving Systems with Deep Spatio-Temporal Convolutional Autoencoder 标题:DriveGuard:采用深时空卷积自动编码器的自动驾驶系统的ROBUS化 链接:https://arxiv.org/abs/2111.03480

作者:Andreas Papachristodoulou,Christos Kyrkou,Theocharis Theocharides 机构:KIOS Research and Innovation Center of Excellence, Department of Electrical and Computer Engineering, University of Cyprus, Panepistimiou Avenue, Nicosia Cyprus 备注:2021 IEEE Winter Conference on Applications of Computer Vision Workshops (WACVW) 摘要:自动驾驶车辆越来越依赖于摄像头为感知和场景理解提供输入,这些模型在不利条件和图像噪声下对其环境和对象进行分类的能力至关重要。当输入在无意中或通过有针对性的攻击而恶化时,自动驾驶车辆的可靠性就会受到损害。为了缓解这种现象,我们提出了一种轻量级时空自动编码器DriveGuard,作为一种解决方案,用于增强自动车辆图像分割过程的鲁棒性。通过首先使用DriveGuard处理摄像头图像,我们提供了一个比使用噪声输入重新训练每个感知模型更通用的解决方案。我们探索了不同自动编码器架构的空间,并在使用真实图像和合成图像创建的不同数据集上对其进行了评估,证明通过利用时空信息结合多分量损失,我们显著提高了对负面图像影响的鲁棒性,达到原始图像的5-6%在干净的图像上建模。 摘要:Autonomous vehicles increasingly rely on cameras to provide the input for perception and scene understanding and the ability of these models to classify their environment and objects, under adverse conditions and image noise is crucial. When the input is, either unintentionally or through targeted attacks, deteriorated, the reliability of autonomous vehicle is compromised. In order to mitigate such phenomena, we propose DriveGuard, a lightweight spatio-temporal autoencoder, as a solution to robustify the image segmentation process for autonomous vehicles. By first processing camera images with DriveGuard, we offer a more universal solution than having to re-train each perception model with noisy input. We explore the space of different autoencoder architectures and evaluate them on a diverse dataset created with real and synthetic images demonstrating that by exploiting spatio-temporal information combined with multi-component loss we significantly increase robustness against adverse image effects reaching within 5-6% of that of the original model on clean images.

【4】 Technical Report: Disentangled Action Parsing Networks for Accurate Part-level Action Parsing 标题:技术报告:解开动作解析网络,实现准确的零件级动作解析 链接:https://arxiv.org/abs/2111.03225

作者:Xuanhan Wang,Xiaojia Chen,Lianli Gao,Lechao Chen,Jingkuan Song 机构: Center for Future Media, University of Electronic Science and Technology of China, Chengdu, China, Zhejiang Lab, Hangzhou, China 摘要:部分级动作解析旨在通过部分状态解析来提高视频中动作的识别能力。尽管在视频分类研究领域取得了巨大进展,但社区面临的一个严重问题是忽视了对人类行为的详细理解。我们的动机是,解析人类行为需要构建专注于特定问题的模型。我们提出了一种简单而有效的方法,称为解纠缠动作解析(DAP)。具体来说,我们将部分级动作解析分为三个阶段:1)人物检测,其中采用人物检测器检测视频中的所有人物,并执行实例级动作识别;2) 零件解析,其中提出了一种零件解析模型,用于从检测到的人体图像中识别人体零件;以及3)动作解析,其中多模态动作解析网络用于解析基于从先前阶段获得的所有检测结果的动作类别条件。应用这三种主要模型后,我们的DAP方法记录了2021年TPS挑战赛的全球平均得分为0.605美元。 摘要:Part-level Action Parsing aims at part state parsing for boosting action recognition in videos. Despite of dramatic progresses in the area of video classification research, a severe problem faced by the community is that the detailed understanding of human actions is ignored. Our motivation is that parsing human actions needs to build models that focus on the specific problem. We present a simple yet effective approach, named disentangled action parsing (DAP). Specifically, we divided the part-level action parsing into three stages: 1) person detection, where a person detector is adopted to detect all persons from videos as well as performs instance-level action recognition; 2) Part parsing, where a part-parsing model is proposed to recognize human parts from detected person images; and 3) Action parsing, where a multi-modal action parsing network is used to parse action category conditioning on all detection results that are obtained from previous stages. With these three major models applied, our approach of DAP records a global mean of $0.605$ score in 2021 Kinetics-TPS Challenge.

【5】 Skeleton-Split Framework using Spatial Temporal Graph Convolutional Networks for Action Recogntion 标题:基于时空图卷积网络的动作识别骨架分裂框架 链接:https://arxiv.org/abs/2111.03106

作者:Motasem Alsawadi,Miguel Rio 机构:∗Department of Electronic and Electrical Engineering, University College London, WC,E ,JE UK, xiKing Abdulaziz City for Science and Technology, Riyadh , Saudi Arabia 摘要:上传到互联网的视频及其相关内容的数量急剧增加。因此,需要高效的算法来分析如此大量的数据引起了人们极大的研究兴趣。基于人体运动的动作识别系统已被证明能够准确地解释视频内容。这项工作旨在使用ST-GCN模型识别日常生活活动,比较四种不同的划分策略:空间配置划分、全距离划分、连接划分和索引划分。为了实现这一目标,我们在HMDB-51数据集上首次实现了ST-GCN框架。通过使用连接分割分区方法,我们实现了48.88%的top-1精度。通过实验模拟,我们表明我们的方案在使用ST-GCN框架的UCF-101数据集上取得了比最新方法更高的精度性能。最后,通过使用索引分割分割策略,实现了73.25%top-1的准确率。 摘要:There has been a dramatic increase in the volume of videos and their related content uploaded to the internet. Accordingly, the need for efficient algorithms to analyse this vast amount of data has attracted significant research interest. An action recognition system based upon human body motions has been proven to interpret videos contents accurately. This work aims to recognize activities of daily living using the ST-GCN model, providing a comparison between four different partitioning strategies: spatial configuration partitioning, full distance split, connection split, and index split. To achieve this aim, we present the first implementation of the ST-GCN framework upon the HMDB-51 dataset. We have achieved 48.88 % top-1 accuracy by using the connection split partitioning approach. Through experimental simulation, we show that our proposals have achieved the highest accuracy performance on the UCF-101 dataset using the ST-GCN framework than the state-of-the-art approach. Finally, accuracy of 73.25 % top-1 is achieved by using the index split partitioning strategy.

【6】 Versatile Learned Video Compression 标题:多功能学习型视频压缩 链接:https://arxiv.org/abs/2111.03386

作者:Runsen Feng,Zongyu Guo,Zhizheng Zhang,Zhibo Chen 机构: University of Science and Technology of China, Microsoft Research Asia 摘要:学习过的视频压缩方法在其率失真(R-D)性能方面已经证明了在追赶传统视频编解码器方面的巨大潜力。然而,现有的学习视频压缩方案受到预测模式和固定网络框架的限制。它们无法支持各种帧间预测模式,因此不适用于各种场景。在本文中,为了打破这一限制,我们提出了一个通用的学习视频压缩(VLVC)框架,它使用一个模型来支持所有可能的预测模式。具体来说,为了实现多功能压缩,我们首先构建了一个运动补偿模块,该模块应用多个3D运动矢量场(即体素流)在时空空间中进行加权三线性扭曲。体素流传递时间参考位置信息,有助于将帧间预测模式与框架设计分离。其次,在多参考帧预测的情况下,我们使用一个统一的多项式函数来预测精确的运动轨迹。我们发现,流预测模块可以大大降低体素流的传输成本。实验结果表明,我们提出的VLVC不仅在各种设置下支持多功能压缩,而且在MS-SSIM方面实现了与最新VVC标准相当的R-D性能。 摘要:Learned video compression methods have demonstrated great promise in catching up with traditional video codecs in their rate-distortion (R-D) performance. However, existing learned video compression schemes are limited by the binding of the prediction mode and the fixed network framework. They are unable to support various inter prediction modes and thus inapplicable for various scenarios. In this paper, to break this limitation, we propose a versatile learned video compression (VLVC) framework that uses one model to support all possible prediction modes. Specifically, to realize versatile compression, we first build a motion compensation module that applies multiple 3D motion vector fields (i.e., voxel flows) for weighted trilinear warping in spatial-temporal space. The voxel flows convey the information of temporal reference position that helps to decouple inter prediction modes away from framework designing. Secondly, in case of multiple-reference-frame prediction, we apply a flow prediction module to predict accurate motion trajectories with a unified polynomial function. We show that the flow prediction module can largely reduce the transmission cost of voxel flows. Experimental results demonstrate that our proposed VLVC not only supports versatile compression in various settings but also achieves comparable R-D performance with the latest VVC standard in terms of MS-SSIM.

GAN|对抗|攻击|生成相关(4篇)

【1】 A Unified Game-Theoretic Interpretation of Adversarial Robustness 标题:对抗性稳健性的统一博弈论解释 链接:https://arxiv.org/abs/2111.03536

作者:Jie Ren,Die Zhang,Yisen Wang,Lu Chen,Zhanpeng Zhou,Yiting Chen,Xu Cheng,Xin Wang,Meng Zhou,Jie Shi,Quanshi Zhang 机构:a Shanghai Jiao Tong University, Key Lab. of Machine Perception (MoE), School of EECS, Peking University, c Carnegie Mellon University, d Huawei technologies Inc. 备注:arXiv admin note: substantial text overlap with arXiv:2103.07364 摘要:本文提供了一个统一的观点来解释不同的对抗性攻击和防御方法,即DNN输入变量之间的多阶相互作用的观点。基于多阶交互,我们发现对抗性攻击主要影响高阶交互来愚弄DNN。此外,我们发现对抗训练的DNN的鲁棒性来自于特定类别的低阶相互作用。我们的发现提供了一种潜在的方法来统一对抗性干扰和鲁棒性,这可以从原则上解释现有的防御方法。此外,我们的研究结果也修正了以往对逆向学习特征形状偏差的不准确理解。 摘要:This paper provides a unified view to explain different adversarial attacks and defense methods, emph{i.e.} the view of multi-order interactions between input variables of DNNs. Based on the multi-order interaction, we discover that adversarial attacks mainly affect high-order interactions to fool the DNN. Furthermore, we find that the robustness of adversarially trained DNNs comes from category-specific low-order interactions. Our findings provide a potential method to unify adversarial perturbations and robustness, which can explain the existing defense methods in a principle way. Besides, our findings also make a revision of previous inaccurate understanding of the shape bias of adversarially learned features.

【2】 A Deep Learning Generative Model Approach for Image Synthesis of Plant Leaves 标题:一种用于植物叶片图像合成的深度学习产生式模型方法 链接:https://arxiv.org/abs/2111.03388

作者:Alessandrop Benfenati,Davide Bolzi,Paola Causin,Roberto Oberti 机构: Dept. of Environmental Science and Policy, Universita degli Studi di Milano, Milano, Dept. of Mathematics, Universita degli Studi di Milano, Milano, Italy, Dept. of Agricultural and Environmental Sciences - Production, Landscape 摘要:目标。我们通过先进的深度学习(DL)技术自动生成人工叶片图像。我们的目标是为现代作物管理的人工智能应用处理一个训练样本源。这样的应用程序需要大量的数据,而叶图像并不是真正稀缺的,图像收集和注释仍然是一个非常耗时的过程。数据稀缺性可以通过扩充技术解决,扩充技术包括对属于小数据集的样本进行简单转换,但扩充数据的丰富性是有限的:这促使人们寻找替代方法。方法。采用基于DL生成模型的方法,我们提出了一种叶到叶转换(L2L)过程,该过程分为两步:首先,残差变分自动编码器结构从真实图像的二值化骨架开始生成合成叶骨架(叶轮廓和叶脉)。在第二步中,我们通过Pix2pix框架执行翻译,该框架使用条件生成器对抗网络再现叶片的着色,保留形状和脉络模式。后果L2L程序生成具有真实外观的树叶合成图像。我们以定性和定量的方式处理绩效衡量问题;对于后一种评估,我们采用了DL异常检测策略,该策略量化了合成叶片相对于真实样本的异常程度。结论。生成DL方法有可能成为一种新的范例,为计算机辅助应用提供低成本、有意义的合成样本。目前的L2L方法是朝着这一目标迈出的一步,能够生成与真实叶子在定性和定量上具有相关相似性的合成样品。 摘要:Objectives. We generate via advanced Deep Learning (DL) techniques artificial leaf images in an automatized way. We aim to dispose of a source of training samples for AI applications for modern crop management. Such applications require large amounts of data and, while leaf images are not truly scarce, image collection and annotation remains a very time--consuming process. Data scarcity can be addressed by augmentation techniques consisting in simple transformations of samples belonging to a small dataset, but the richness of the augmented data is limited: this motivates the search for alternative approaches. Methods. Pursuing an approach based on DL generative models, we propose a Leaf-to-Leaf Translation (L2L) procedure structured in two steps: first, a residual variational autoencoder architecture generates synthetic leaf skeletons (leaf profile and veins) starting from companions binarized skeletons of real images. In a second step, we perform translation via a Pix2pix framework, which uses conditional generator adversarial networks to reproduce the colorization of leaf blades, preserving the shape and the venation pattern. Results. The L2L procedure generates synthetic images of leaves with a realistic appearance. We address the performance measurement both in a qualitative and a quantitative way; for this latter evaluation, we employ a DL anomaly detection strategy which quantifies the degree of anomaly of synthetic leaves with respect to real samples. Conclusions. Generative DL approaches have the potential to be a new paradigm to provide low-cost meaningful synthetic samples for computer-aided applications. The present L2L approach represents a step towards this goal, being able to generate synthetic samples with a relevant qualitative and quantitative resemblance to real leaves.

【3】 Seamless Satellite-image Synthesis 标题:无缝卫星图像合成 链接:https://arxiv.org/abs/2111.03384

作者:Jialin Zhu,Tom Kelly 机构:University of Leeds 备注:12 pages 摘要:我们介绍了无缝卫星图像合成(SSS),这是一种新的神经结构,用于从地图数据创建比例和空间连续的卫星纹理。虽然2D地图数据便宜且易于合成,但精确的卫星图像价格昂贵,而且往往无法获得或过时。我们的方法在任意大的空间范围内生成无缝纹理,这些纹理在整个尺度空间内是一致的。为了克服图像到图像转换方法中的平铺大小限制,SSS学习以语义上有意义的方式移除平铺图像之间的接缝。比例尺空间的连续性是通过以样式和制图数据为条件的网络层次结构来实现的。我们的定性和定量评估表明,我们的系统在几个关键领域优于最先进的技术。我们展示了纹理程序生成地图和交互式卫星图像处理的应用。 摘要:We introduce Seamless Satellite-image Synthesis (SSS), a novel neural architecture to create scale-and-space continuous satellite textures from cartographic data. While 2D map data is cheap and easily synthesized, accurate satellite imagery is expensive and often unavailable or out of date. Our approach generates seamless textures over arbitrarily large spatial extents which are consistent through scale-space. To overcome tile size limitations in image-to-image translation approaches, SSS learns to remove seams between tiled images in a semantically meaningful manner. Scale-space continuity is achieved by a hierarchy of networks conditioned on style and cartographic data. Our qualitative and quantitative evaluations show that our system improves over the state-of-the-art in several key areas. We show applications to texturing procedurally generation maps and interactive satellite image manipulation.

【4】 GraN-GAN: Piecewise Gradient Normalization for Generative Adversarial Networks 标题:GRAN-GAN:生成性对抗网络的分段梯度归一化 链接:https://arxiv.org/abs/2111.03162

作者:Vineeth S. Bhaskara,Tristan Aumentado-Armstrong,Allan Jepson,Alex Levinshtein 机构:Samsung AI Centre Toronto, University of Toronto, Vector Institute for AI 备注:WACV 2022 Main Conference Paper (Submitted: 18 Aug 2021, Accepted: 4 Oct 2021) 摘要:现代生成性对抗网络(GAN)主要在鉴别器(或批评者)中使用分段线性激活函数,包括ReLU和LeakyReLU。这种模型学习分段线性映射,其中每个分段处理输入空间的子集,每个子集的梯度是分段常数。在这类鉴别器(或批评家)函数下,我们提出了梯度归一化(GraN),这是一种新的与输入相关的归一化方法,它保证了输入空间中的分段K-Lipschitz约束。与光谱规范化不同,GraN不限制单个网络层的处理,并且与梯度惩罚不同,GraN几乎在任何地方都严格执行分段Lipschitz约束。根据经验,我们在多个数据集(包括CIFAR-10/100、STL-10、LSUN卧室和CelebA)、GAN损失函数和度量中展示了改进的图像生成性能。此外,我们分析了在几个标准GAN中改变经常不协调的Lipschitz常数K,不仅获得了显著的性能提升,而且还发现了K和训练动态之间的联系,特别是在低梯度损失高原,使用通用Adam优化器。 摘要:Modern generative adversarial networks (GANs) predominantly use piecewise linear activation functions in discriminators (or critics), including ReLU and LeakyReLU. Such models learn piecewise linear mappings, where each piece handles a subset of the input space, and the gradients per subset are piecewise constant. Under such a class of discriminator (or critic) functions, we present Gradient Normalization (GraN), a novel input-dependent normalization method, which guarantees a piecewise K-Lipschitz constraint in the input space. In contrast to spectral normalization, GraN does not constrain processing at the individual network layers, and, unlike gradient penalties, strictly enforces a piecewise Lipschitz constraint almost everywhere. Empirically, we demonstrate improved image generation performance across multiple datasets (incl. CIFAR-10/100, STL-10, LSUN bedrooms, and CelebA), GAN loss functions, and metrics. Further, we analyze altering the often untuned Lipschitz constant K in several standard GANs, not only attaining significant performance gains, but also finding connections between K and training dynamics, particularly in low-gradient loss plateaus, with the common Adam optimizer.

自动驾驶|车辆|车道检测等(1篇)

【1】 ProSTformer: Pre-trained Progressive Space-Time Self-attention Model for Traffic Flow Forecasting 标题:ProSTform:交通流预测的预训练累进时空自我注意模型 链接:https://arxiv.org/abs/2111.03459

作者:Xiao Yan,Xianghua Gan,Jingjing Tang,Rui Wang 机构: School of Business Administration, Southwestern University of Finance and Economics. 摘要:交通流预测是智能城市管理和公共安全的关键和挑战。最近的研究表明,无卷积Transformer方法有潜力提取复杂影响因素之间的动态相关性。然而,有两个问题阻碍了该方法在交通流预测中的有效应用。首先,它忽略了交通流视频的时空结构。第二,对于一个长序列,由于二次乘以点积的计算,很难集中注意力。为了解决这两个问题,我们首先对依赖项进行分解,然后设计一种渐进式时空自我注意机制ProSTformer。它有两个显著的特征:(1)与因子分解相对应,自我注意机制逐渐关注从局部到全局的空间依赖,从内部到外部片段的时间依赖(即接近度、周期和趋势),最后关注外部依赖,如天气、温度、,和星期几;(2) 通过将时空结构融入自注意机制,通过将具有时空位置的区域聚集起来,可以显著减少计算量,从而突出了该区域的独特依赖性。我们在两个交通数据集上评估ProSTformer,每个数据集包括三个独立的数据集,分别具有大、中、小规模。尽管与用于交通流预测的卷积结构相比,ProSTformer的设计有着根本的不同,但它在大规模数据集上的性能优于或与RMSE的六种最先进的基线方法相同。当对大规模数据集进行预训练并转移到中小型数据集时,ProSTformer实现了显著的增强,表现最佳。 摘要:Traffic flow forecasting is essential and challenging to intelligent city management and public safety. Recent studies have shown the potential of convolution-free Transformer approach to extract the dynamic dependencies among complex influencing factors. However, two issues prevent the approach from being effectively applied in traffic flow forecasting. First, it ignores the spatiotemporal structure of the traffic flow videos. Second, for a long sequence, it is hard to focus on crucial attention due to the quadratic times dot-product computation. To address the two issues, we first factorize the dependencies and then design a progressive space-time self-attention mechanism named ProSTformer. It has two distinctive characteristics: (1) corresponding to the factorization, the self-attention mechanism progressively focuses on spatial dependence from local to global regions, on temporal dependence from inside to outside fragment (i.e., closeness, period, and trend), and finally on external dependence such as weather, temperature, and day-of-week; (2) by incorporating the spatiotemporal structure into the self-attention mechanism, each block in ProSTformer highlights the unique dependence by aggregating the regions with spatiotemporal positions to significantly decrease the computation. We evaluate ProSTformer on two traffic datasets, and each dataset includes three separate datasets with big, medium, and small scales. Despite the radically different design compared to the convolutional architectures for traffic flow forecasting, ProSTformer performs better or the same on the big scale datasets than six state-of-the-art baseline methods by RMSE. When pre-trained on the big scale datasets and transferred to the medium and small scale datasets, ProSTformer achieves a significant enhancement and behaves best.

OCR|文本相关(2篇)

【1】 Negative Sample is Negative in Its Own Way: Tailoring Negative Sentences for Image-Text Retrieval 标题:否定样本是否定的--为图文检索量身定做否定句 链接:https://arxiv.org/abs/2111.03349

作者:Zhihao Fan,Zhongyu Wei,Zejun Li,Siyuan Wang,Jianqing Fan 机构: Fudan University, Princeton University, Research Institute of Intelligent and Complex Systems, Fudan University, China 摘要:匹配模型是图像文本检索框架的基础。现有的研究通常使用三重损失训练模型,并探索各种策略来检索数据集中的硬否定句。我们认为,当前基于检索的负样本构造方法在数据集的规模上受到限制,因此无法为每幅图像识别出高难度的负样本。我们提出了一种基于区分和纠正的否定句裁剪(TAGS-DC)方法,以自动生成合成句子作为否定样本。TAGS-DC由掩蔽和再填充组成,生成难度较高的合成否定句。为了保持训练过程中的难度,我们通过参数共享来相互改进检索和生成。为了进一步利用否定句中不匹配的细粒度语义,我们提出了两个辅助任务,即单词辨别和单词纠正来改进训练。在实验中,我们在MS-COCO和Flickr30K上验证了我们的模型与当前最先进的模型相比的有效性,并在进一步的分析中证明了其鲁棒性和可靠性。我们的代码在https://github.com/LibertFan/TAGS. 摘要:Matching model is essential for Image-Text Retrieval framework. Existing research usually train the model with a triplet loss and explore various strategy to retrieve hard negative sentences in the dataset. We argue that current retrieval-based negative sample construction approach is limited in the scale of the dataset thus fail to identify negative sample of high difficulty for every image. We propose our TAiloring neGative Sentences with Discrimination and Correction (TAGS-DC) to generate synthetic sentences automatically as negative samples. TAGS-DC is composed of masking and refilling to generate synthetic negative sentences with higher difficulty. To keep the difficulty during training, we mutually improve the retrieval and generation through parameter sharing. To further utilize fine-grained semantic of mismatch in the negative sentence, we propose two auxiliary tasks, namely word discrimination and word correction to improve the training. In experiments, we verify the effectiveness of our model on MS-COCO and Flickr30K compared with current state-of-the-art models and demonstrates its robustness and faithfulness in the further analysis. Our code is available in https://github.com/LibertFan/TAGS.

【2】 StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis 标题:StyleCLIPDraw:文本到图形合成中内容和样式的耦合 链接:https://arxiv.org/abs/2111.03133

作者:Peter Schaldenbrand,Zhixuan Liu,Jean Oh 机构:Robotics Institute, Carnegie Mellon University, Pittsburgh, PA , School of Data Science, The Chinese University, of Hong Kong, Shenzhen, China 摘要:使用机器学习生成符合给定文本描述的图像,随着诸如剪辑图像文本编码器模型等技术的发布,已经有了很大的改进;然而,目前的方法缺乏对要生成的图像样式的艺术控制。我们引入了StyleCLIPDraw,它将样式丢失添加到绘图合成模型的CLIPDraw文本中,除了通过文本控制内容外,还允许对合成的绘图进行艺术控制。虽然在生成的图像上执行解耦样式转换只会影响纹理,但我们提出的耦合方法能够捕获纹理和形状中的样式,这表明绘图样式与绘图过程本身是耦合的。更多结果和我们的代码可在https://github.com/pschaldenbrand/StyleCLIPDraw 摘要:Generating images that fit a given text description using machine learning has improved greatly with the release of technologies such as the CLIP image-text encoder model; however, current methods lack artistic control of the style of image to be generated. We introduce StyleCLIPDraw which adds a style loss to the CLIPDraw text-to-drawing synthesis model to allow artistic control of the synthesized drawings in addition to control of the content via text. Whereas performing decoupled style transfer on a generated image only affects the texture, our proposed coupled approach is able to capture a style in both texture and shape, suggesting that the style of the drawing is coupled with the drawing process itself. More results and our code are available at https://github.com/pschaldenbrand/StyleCLIPDraw

超分辨率|去噪|去模糊|去雾(3篇)

【1】 Normalizing Flow as a Flexible Fidelity Objective for Photo-Realistic Super-resolution 标题:将流归一化作为实现照片真实感超分辨率的灵活保真度目标 链接:https://arxiv.org/abs/2111.03649

作者:Andreas Lugmayr,Martin Danelljan,Fisher Yu,Luc Van Gool,Radu Timofte 机构:CVL, ETH Z¨urich, Switzerland, ×, Low-Resolution, ESRGAN [,], Ours 备注:None 摘要:超分辨率是一个不适定问题,其中地面真实高分辨率图像仅代表可行解空间中的一种可能性。然而,主要的范例是采用像素级的损失,如L_1,这将使预测趋于模糊的平均值。这将导致根本上相互冲突的目标,再加上对抗性损失,从而降低最终质量。我们通过重新研究L_1损失来解决这个问题,并表明它对应于一层条件流。受这一关系的启发,我们将通用流程作为基于保真度的L_1目标替代方案进行探索。我们证明,当与对抗性损失相结合时,深度流的灵活性会导致更好的视觉质量和一致性。我们对三个数据集和比例因子进行了广泛的用户研究,结果表明,我们的方法在照片真实感超分辨率方面优于最先进的方法。代码和经过训练的模型将在以下位置提供:git.io/AdFlow 摘要:Super-resolution is an ill-posed problem, where a ground-truth high-resolution image represents only one possibility in the space of plausible solutions. Yet, the dominant paradigm is to employ pixel-wise losses, such as L_1, which drive the prediction towards a blurry average. This leads to fundamentally conflicting objectives when combined with adversarial losses, which degrades the final quality. We address this issue by revisiting the L_1 loss and show that it corresponds to a one-layer conditional flow. Inspired by this relation, we explore general flows as a fidelity-based alternative to the L_1 objective. We demonstrate that the flexibility of deeper flows leads to better visual quality and consistency when combined with adversarial losses. We conduct extensive user studies for three datasets and scale factors, where our approach is shown to outperform state-of-the-art methods for photo-realistic super-resolution. Code and trained models will be available at: git.io/AdFlow

【2】 Frequency-Aware Physics-Inspired Degradation Model for Real-World Image Super-Resolution 标题:基于物理启发的频率感知真实图像超分辨率退化模型 链接:https://arxiv.org/abs/2111.03301

作者:Zhenxing Dong,Hong Cao,Wang Shen,Yu Gan,Yuye Ling,Guangtao Zhai,Yikai Su 机构:John Hopcroft Center for Computer Science, Shanghai Jiao Tong, Institute of Image Communication and Network, Engineering,Shanghai Jiao Tong University, Department of Electrical and Computer Engineering, University, of Alabama, Tuscaloosa, AL , United States 备注:18 pages,8 figures 摘要:当前基于学习的单图像超分辨率(SISR)算法在实际数据上表现不佳,这是因为假设的退化过程与实际场景中的退化过程存在偏差。传统的退化过程考虑在高分辨率(HR)图像上应用模糊、噪声和下采样(典型的双三次下采样)来合成低分辨率(LR)对应物。然而,退化建模方面的工作很少考虑光学成像系统的物理方面。在本文中,我们对成像系统进行了光学分析,并在空间频域中研究了真实世界LR-HR对的特性。我们通过考虑hoppics和sensordegration,建立了一个真实世界的物理启发的degradation模型;成像系统的物理退化被建模为低通滤波器,其截止频率由物体距离、透镜焦距和图像传感器的像素大小决定。特别是,我们建议使用卷积神经网络(CNN)来学习真实世界退化过程的截止频率。然后将学习到的网络用于从未配对的HR图像合成LR图像。合成HR-LR图像对随后用于训练SISR网络。我们评估了该退化模型在不同成像系统拍摄的真实图像上的有效性和泛化能力。实验结果表明,使用我们的合成数据训练的SISR网络相对于使用传统退化模型的网络表现良好。此外,我们的结果与使用真实世界LR-HR对训练的同一网络获得的结果相当,这在真实场景中很难获得。 摘要:Current learning-based single image super-resolution (SISR) algorithms underperform on real data due to the deviation in the assumed degrada-tion process from that in the real-world scenario. Conventional degradation processes consider applying blur, noise, and downsampling (typicallybicubic downsampling) on high-resolution (HR) images to synthesize low-resolution (LR) counterparts. However, few works on degradation modelling have taken the physical aspects of the optical imaging system intoconsideration. In this paper, we analyze the imaging system optically andexploit the characteristics of the real-world LR-HR pairs in the spatial frequency domain. We formulate a real-world physics-inspired degradationmodel by considering bothopticsandsensordegradation; The physical degradation of an imaging system is modelled as a low-pass filter, whose cut-off frequency is dictated by the object distance, the focal length of thelens, and the pixel size of the image sensor. In particular, we propose to use a convolutional neural network (CNN) to learn the cutoff frequency of real-world degradation process. The learned network is then applied to synthesize LR images from unpaired HR images. The synthetic HR-LR image pairs are later used to train an SISR network. We evaluatethe effectiveness and generalization capability of the proposed degradation model on real-world images captured by different imaging systems. Experimental results showcase that the SISR network trained by using our synthetic data performs favorably against the network using the traditional degradation model. Moreover, our results are comparable to that obtained by the same network trained by using real-world LR-HR pairs, which are challenging to obtain in real scenes.

【3】 Multi-Spectral Multi-Image Super-Resolution of Sentinel-2 with Radiometric Consistency Losses and Its Effect on Building Delineation 标题:具有辐射一致性损失的哨兵2号多光谱多图像超分辨率及其对建筑物勾画的影响 链接:https://arxiv.org/abs/2111.03231

作者:Muhammed Razzak,Gonzalo Mateo-Garcia,Luis Gómez-Chova,Yarin Gal,Freddie Kalaitzis 机构:����������, Citation: Razzak, M.T.; Mateo-García, G.; Gómez-Chova, L.; Gal, Y.; Kalaitzis, F. Multi-Spectral Multi-Image, Super-Resolution of Sentinel-, with, Radiometric Consistency Losses and, Its Effect on Building Delineation. 摘要:高分辨率遥感图像用于广泛的任务,包括目标探测和分类。然而,高分辨率图像价格昂贵,而低分辨率图像通常免费提供,公众可以将其用于各种社会公益应用。为此,我们策划了一个多光谱多图像超分辨率数据集,使用SpaceNet 7 challenge的PlanetScope图像作为高分辨率参考,并对同一图像和低分辨率图像进行多次Sentinel-2重访。我们介绍了首次将多图像超分辨率(MISR)应用于多光谱遥感图像的结果。此外,我们在MISR模型中引入了辐射一致性模块,以保持Sentinel-2传感器的高辐射分辨率。我们表明,在一系列图像保真度指标上,MISR优于单图像超分辨率和其他基线。此外,我们还首次评估了多图像超分辨率在建筑物描绘中的效用,结果表明,利用多个图像可以在这些下游任务中获得更好的性能。 摘要:High resolution remote sensing imagery is used in broad range of tasks, including detection and classification of objects. High-resolution imagery is however expensive, while lower resolution imagery is often freely available and can be used by the public for range of social good applications. To that end, we curate a multi-spectral multi-image super-resolution dataset, using PlanetScope imagery from the SpaceNet 7 challenge as the high resolution reference and multiple Sentinel-2 revisits of the same imagery as the low-resolution imagery. We present the first results of applying multi-image super-resolution (MISR) to multi-spectral remote sensing imagery. We, additionally, introduce a radiometric consistency module into MISR model the to preserve the high radiometric resolution of the Sentinel-2 sensor. We show that MISR is superior to single-image super-resolution and other baselines on a range of image fidelity metrics. Furthermore, we conduct the first assessment of the utility of multi-image super-resolution on building delineation, showing that utilising multiple images results in better performance in these downstream tasks.

点云|SLAM|雷达|激光|深度RGBD相关(1篇)

【1】 Interpreting Representation Quality of DNNs for 3D Point Cloud Processing 标题:三维点云处理中DNNs表示质量的解释 链接:https://arxiv.org/abs/2111.03549

作者:Wen Shen,Qihan Ren,Dongrui Liu,Quanshi Zhang 机构:Shanghai Jiao Tong University, Tongji University 摘要:在本文中,我们评估了三维点云处理中深层神经网络(DNN)编码的知识表示的质量。我们提出了一种将整体模型脆弱性分解为对旋转、平移、缩放和局部3D结构的敏感性的方法。此外,我们还提出了评估编码3D结构的空间平滑度和DNN表示复杂性的指标。基于这样的分析,实验揭示了经典DNN的表征问题,并解释了对抗训练的效用。 摘要:In this paper, we evaluate the quality of knowledge representations encoded in deep neural networks (DNNs) for 3D point cloud processing. We propose a method to disentangle the overall model vulnerability into the sensitivity to the rotation, the translation, the scale, and local 3D structures. Besides, we also propose metrics to evaluate the spatial smoothness of encoding 3D structures, and the representation complexity of the DNN. Based on such analysis, experiments expose representation problems with classic DNNs, and explain the utility of the adversarial training.

多模态(1篇)

【1】 BiosecurID: a multimodal biometric database 标题:BiosecurID:一个多模态生物特征数据库 链接:https://arxiv.org/abs/2111.03472

作者:Julian Fierrez,Javier Galbally,Javier Ortega-Garcia,Manuel R Freire,Fernando Alonso-Fernandez,Daniel Ramos,Doroteo Torre Toledano,Joaquin Gonzalez-Rodriguez,Juan A Siguenza,Javier Garrido-Salas,E Anguiano,Guillermo Gonzalez-de-Rivera,Ricardo Ribalda,Marcos Faundez-Zanuy,JA Ortega,Valentín Cardeñoso-Payo,A Viloria,Carlos E Vivaracho,Q Isaac Moro,Juan J Igarza,J Sanchez,Inmaculada Hernaez,Carlos Orrite-Urunuela,Francisco Martinez-Contreras,Juan José Gracia-Roche 备注:Published at Pattern Analysis and Applications journal 摘要:本文介绍了在生物安全项目框架内获得的一个新的多模式生物特征数据库,以及采集设置和协议的描述。该数据库包括八种单峰生物特征,即:语音、虹膜、人脸(静止图像、说话人脸的视频)、手写签名和手写文本(在线动态信号、离线扫描图像)、指纹(通过两个不同传感器获取)、手部(掌纹、轮廓几何)和按键。该数据库由400名受试者组成,具有以下特点:真实的采集场景、平衡的性别和人口分布、特定人口群体(年龄、性别、惯用手)信息的可用性、语音和击键重放攻击的采集、签名的熟练伪造、,以及与其他现有数据库的兼容性。所有这些特点使得它在单峰和多峰生物识别系统的研究和开发中非常有用。 摘要:A new multimodal biometric database, acquired in the framework of the BiosecurID project, is presented together with the description of the acquisition setup and protocol. The database includes eight unimodal biometric traits, namely: speech, iris, face (still images, videos of talking faces), handwritten signature and handwritten text (on-line dynamic signals, off-line scanned images), fingerprints (acquired with two different sensors), hand (palmprint, contour-geometry) and keystroking. The database comprises 400 subjects and presents features such as: realistic acquisition scenario, balanced gender and population distributions, availability of information about particular demographic groups (age, gender, handedness), acquisition of replay attacks for speech and keystroking, skilled forgeries for signatures, and compatibility with other existing databases. All these characteristics make it very useful in research and development of unimodal and multimodal biometric systems.

其他神经网络|深度学习|模型|建模(3篇)

【1】 Single Image Deraining Network with Rain Embedding Consistency and Layered LSTM 标题:具有雨嵌入一致性和分层LSTM的单幅图像降维网络 链接:https://arxiv.org/abs/2111.03615

作者:Yizhou Li,Yusuke Monno,Masatoshi Okutomi 机构:Tokyo Institute of Technology, Japan 备注:Accepted by WACV2022, January 2022 摘要:单幅图像去噪通常作为残差学习来处理,以从输入的雨图像预测雨层。为此,编码器-解码器网络引起广泛关注,其中编码器需要编码高质量的雨嵌入,这决定了重建雨层的后续解码阶段的性能。然而,大多数现有研究忽略了雨水嵌入质量的重要性,从而导致过度/欠降额的性能有限。在本文中,我们通过观察雨对雨自动编码器重建高雨层的性能,引入了“雨嵌入一致性”的概念通过将自动编码器的编码嵌入视为理想的雨嵌入,并通过提高理想雨嵌入与降容网络编码器导出的雨嵌入之间的一致性来提高降容性能。为了实现这一点,应用Rain嵌入损失来直接监督编码过程,以校正的局部对比度归一化(RLCN)作为有效提取候选Rain像素的指南。我们还提出了分层LSTM,用于考虑不同尺度的周期性降额和细粒度编码器特征细化。定性和定量实验表明,我们提出的方法优于以前的最新方法,特别是在真实数据集上。我们的源代码可在http://www.ok.sc.e.titech.ac.jp/res/SIR/. 摘要:Single image deraining is typically addressed as residual learning to predict the rain layer from an input rainy image. For this purpose, an encoder-decoder network draws wide attention, where the encoder is required to encode a high-quality rain embedding which determines the performance of the subsequent decoding stage to reconstruct the rain layer. However, most of existing studies ignore the significance of rain embedding quality, thus leading to limited performance with over/under-deraining. In this paper, with our observation of the high rain layer reconstruction performance by an rain-to-rain autoencoder, we introduce the idea of "Rain Embedding Consistency" by regarding the encoded embedding by the autoencoder as an ideal rain embedding and aim at enhancing the deraining performance by improving the consistency between the ideal rain embedding and the rain embedding derived by the encoder of the deraining network. To achieve this, a Rain Embedding Loss is applied to directly supervise the encoding process, with a Rectified Local Contrast Normalization (RLCN) as the guide that effectively extracts the candidate rain pixels. We also propose Layered LSTM for recurrent deraining and fine-grained encoder feature refinement considering different scales. Qualitative and quantitative experiments demonstrate that our proposed method outperforms previous state-of-the-art methods particularly on a real-world dataset. Our source code is available at http://www.ok.sc.e.titech.ac.jp/res/SIR/.

【2】 Visualizing the Emergence of Intermediate Visual Patterns in DNNs 标题:在DNNs中可视化中间视觉模式的出现 链接:https://arxiv.org/abs/2111.03505

作者:Mingjie Li,Shaobo Wang,Quanshi Zhang 机构:Shanghai Jiao Tong University, Harbin Institute of Technology 摘要:本文提出了一种用DNN编码的中间层视觉模式识别能力的可视化方法。具体而言,我们可视化了(1)DNN如何在训练过程中逐渐学习每个中间层中的区域视觉模式,以及(2)DNN使用低层中的非辨别模式通过前向传播在中/高层中构建离散模式的效果。基于我们的可视化方法,我们可以量化DNN学习到的知识点(即辨别性视觉模式的数量),以评估DNN的表示能力。此外,该方法还为现有深度学习技术的信号处理行为提供了新的见解,如对抗性攻击和知识提取。 摘要:This paper proposes a method to visualize the discrimination power of intermediate-layer visual patterns encoded by a DNN. Specifically, we visualize (1) how the DNN gradually learns regional visual patterns in each intermediate layer during the training process, and (2) the effects of the DNN using non-discriminative patterns in low layers to construct disciminative patterns in middle/high layers through the forward propagation. Based on our visualization method, we can quantify knowledge points (i.e., the number of discriminative visual patterns) learned by the DNN to evaluate the representation capacity of the DNN. Furthermore, this method also provides new insights into signal-processing behaviors of existing deep-learning techniques, such as adversarial attacks and knowledge distillation.

【3】 Pathological Analysis of Blood Cells Using Deep Learning Techniques 标题:基于深度学习技术的血细胞病理学分析 链接:https://arxiv.org/abs/2111.03274

作者:Virender Ranga,Shivam Gupta,Priyansh Agrawal,Jyoti Meena 机构:Department of Computer Engineering, National Institute of Technology , Kurukshetra , Haryana, India, Department of Computer Science and Engineering, Indian Institute of Information Technology , Sonepat ,Haryana , India 备注:None 摘要:病理学是通过分析身体样本来发现疾病原因的实践。该领域最常用的方法是使用组织学,基本上是研究和观察细胞和组织的微观结构。幻灯片查看方法被广泛使用,并转换为数字形式,以生成高分辨率图像。这使得深度学习和机器学习领域能够深入到医学领域。在本研究中,提出了一种基于神经网络的血细胞图像分类方法。当输入图像通过所提出的结构并按照所提出的算法使用所有超参数和丢失率值时,该模型对血液图像进行分类,准确率为95.24%。该模型的性能优于现有的标准体系结构和众多研究人员的工作。因此,该模型将使病理系统的开发成为可能,从而减少人为错误和实验室人员的日常负荷。这将反过来帮助病理学家更有效地开展工作。 摘要:Pathology deals with the practice of discovering the reasons for disease by analyzing the body samples. The most used way in this field, is to use histology which is basically studying and viewing microscopic structures of cell and tissues. The slide viewing method is widely being used and converted into digital form to produce high resolution images. This enabled the area of deep learning and machine learning to deep dive into this field of medical sciences. In the present study, a neural based network has been proposed for classification of blood cells images into various categories. When input image is passed through the proposed architecture and all the hyper parameters and dropout ratio values are used in accordance with proposed algorithm, then model classifies the blood images with an accuracy of 95.24%. The performance of proposed model is better than existing standard architectures and work done by various researchers. Thus model will enable development of pathological system which will reduce human errors and daily load on laboratory men. This will in turn help pathologists in carrying out their work more efficiently and effectively.

其他(7篇)

【1】 TermiNeRF: Ray Termination Prediction for Efficient Neural Rendering 标题:TermiNeRF:用于高效神经绘制的光线终止预测 链接:https://arxiv.org/abs/2111.03643

作者:Martin Piala,Ronald Clark 机构:Imperial College London 备注:3DV 2021; Project page with videos: this https URL 摘要:使用神经场的体绘制在捕捉和合成三维场景的新视图方面显示出巨大的潜力。但是,这种类型的方法需要沿每条查看光线在多个点处查询体积网络,以便渲染图像,从而导致渲染时间非常缓慢。在本文中,我们提出了一种方法,通过学习从相机光线到光线沿线最有可能影响像素最终外观的位置的直接映射来克服此限制。使用这种方法,我们能够以比标准方法快一个数量级的速度渲染、训练和微调体积渲染的神经场模型。与现有方法不同,我们的方法适用于一般卷,并且可以进行端到端的训练。 摘要:Volume rendering using neural fields has shown great promise in capturing and synthesizing novel views of 3D scenes. However, this type of approach requires querying the volume network at multiple points along each viewing ray in order to render an image, resulting in very slow rendering times. In this paper, we present a method that overcomes this limitation by learning a direct mapping from camera rays to locations along the ray that are most likely to influence the pixel's final appearance. Using this approach we are able to render, train and fine-tune a volumetrically-rendered neural field model an order of magnitude faster than standard approaches. Unlike existing methods, our approach works with general volumes and can be trained end-to-end.

【2】 BBC-Oxford British Sign Language Dataset 标题:BBC-牛津英国手语数据集 链接:https://arxiv.org/abs/2111.03635

作者:Samuel Albanie,Gül Varol,Liliane Momeni,Hannah Bull,Triantafyllos Afouras,Himel Chowdhury,Neil Fox,Bencie Woll,Rob Cooper,Andrew McParland,Andrew Zisserman 机构:G¨ul Varol∗ 摘要:在这项工作中,我们介绍了BBC牛津英国手语(BOBSL)数据集,这是一个英国手语(BSL)的大规模视频集合。BOBSL是一个基于先前工作中引入的BSL-1K数据集的扩展和公开发布的数据集。我们描述了数据集的动机,以及统计数据和可用的注释。我们进行实验,为手语识别、手语对齐和手语翻译任务提供基线。最后,我们从机器学习和语言学的角度描述了数据的一些优势和局限性,指出了数据集中存在的偏见来源,并讨论了BOBSL在手语技术方面的潜在应用。该数据集可在https://www.robots.ox.ac.uk/~vgg/data/bobsl/。 摘要:In this work, we introduce the BBC-Oxford British Sign Language (BOBSL) dataset, a large-scale video collection of British Sign Language (BSL). BOBSL is an extended and publicly released dataset based on the BSL-1K dataset introduced in previous work. We describe the motivation for the dataset, together with statistics and available annotations. We conduct experiments to provide baselines for the tasks of sign recognition, sign language alignment, and sign language translation. Finally, we describe several strengths and limitations of the data from the perspectives of machine learning and linguistics, note sources of bias present in the dataset, and discuss potential applications of BOBSL in the context of sign language technology. The dataset is available at https://www.robots.ox.ac.uk/~vgg/data/bobsl/.

【3】 Edge Tracing using Gaussian Process Regression 标题:基于高斯过程回归的边缘跟踪 链接:https://arxiv.org/abs/2111.03605

作者:Jamie Burke,Stuart King 机构:School of Mathematics, James Clerk Maxwell Building, The University of Edinburgh, Edinburgh, UK EH,FD 备注:15 pages, 6 figures. Accepted to be published in IEEE Transactions on Image Processing. Github repository: this https URL 摘要:介绍了一种基于高斯过程回归的边缘跟踪算法。我们的基于边缘的分割算法使用高斯过程回归建模感兴趣的边缘,并使用递归贝叶斯方案迭代搜索图像的边缘像素。该过程将图像梯度中的局部边缘信息与从模型后验预测分布中采样的后验曲线中的全局结构信息相结合,从而依次构建和细化边缘像素的观测集。像素的累积将分布收敛到感兴趣的边缘。用户可在初始化时调整超参数,并根据优化的观测集进行优化。这种可调方法不需要任何事先训练,也不限于任何特定类型的成像领域。由于模型的不确定性量化,该算法对降低图像边缘质量和连续性的伪影和遮挡具有鲁棒性。我们的方法还能够通过使用先前的图像边缘跟踪作为连续图像的先验信息,有效地跟踪图像序列中的边缘。在医学成像和卫星成像的各种应用中验证了该技术,并与两种常用的边缘跟踪算法进行了比较。 摘要:We introduce a novel edge tracing algorithm using Gaussian process regression. Our edge-based segmentation algorithm models an edge of interest using Gaussian process regression and iteratively searches the image for edge pixels in a recursive Bayesian scheme. This procedure combines local edge information from the image gradient and global structural information from posterior curves, sampled from the model's posterior predictive distribution, to sequentially build and refine an observation set of edge pixels. This accumulation of pixels converges the distribution to the edge of interest. Hyperparameters can be tuned by the user at initialisation and optimised given the refined observation set. This tunable approach does not require any prior training and is not restricted to any particular type of imaging domain. Due to the model's uncertainty quantification, the algorithm is robust to artefacts and occlusions which degrade the quality and continuity of edges in images. Our approach also has the ability to efficiently trace edges in image sequences by using previous-image edge traces as a priori information for consecutive images. Various applications to medical imaging and satellite imaging are used to validate the technique and comparisons are made with two commonly used edge tracing algorithms.

【4】 Nondestructive Testing of Composite Fibre Materials with Hyperspectral Imaging : Evaluative Studies in the EU H2020 FibreEUse Project 标题:复合纤维材料的高光谱成像无损检测:欧盟H2020纤维项目的评估研究 链接:https://arxiv.org/abs/2111.03443

作者:Yijun Yan,Jinchang Ren,Huan Zhao,James F. C. Windmill,Winifred Ijomah,Jesper de Wit,Justus von Freeden 机构: National Subsea Centre, Robert Gordon University, Aberdeen, U.K., Dept. of Electronic and Electrical Engineering, University of Strathclyde, Glasgow, U.K., Dept of Design, Manufacturing & Engineering Management, University of Strathclyde, Glasgow, U.K. 备注:11 pages, 12 figures 摘要:高光谱成像(HSI)通过捕获宽频率范围的光谱数据以及空间信息,可以检测温度、湿度和化学成分方面的微小差异。因此,HSI已成功应用于各种应用,包括用于安全和国防的遥感、用于植被和作物监测的精确农业、食品/饮料和药品质量控制。然而,对于碳纤维增强聚合物(CFRP)中的状态监测和损伤检测,HSI的使用是一个相对未触及的领域,因为现有的无损检测(NDT)技术主要侧重于提供有关结构物理完整性的信息,而不是材料组成的信息。为此,HSI可以提供一种独特的方式来应对这一挑战。本文以欧盟H2020纤维应用项目为背景,介绍了近红外HSI摄像机在CFRP产品无损检测中的应用。详细介绍了三个案例研究的技术挑战和解决方案,包括粘合剂残留检测、表面损伤检测和基于Cobot的自动检测。实验结果充分证明了HSI和相关视觉技术在CFRP无损检测中的巨大潜力,特别是满足工业制造环境的潜力。 摘要:Through capturing spectral data from a wide frequency range along with the spatial information, hyperspectral imaging (HSI) can detect minor differences in terms of temperature, moisture and chemical composition. Therefore, HSI has been successfully applied in various applications, including remote sensing for security and defense, precision agriculture for vegetation and crop monitoring, food/drink, and pharmaceuticals quality control. However, for condition monitoring and damage detection in carbon fibre reinforced polymer (CFRP), the use of HSI is a relatively untouched area, as existing non-destructive testing (NDT) techniques focus mainly on delivering information about physical integrity of structures but not on material composition. To this end, HSI can provide a unique way to tackle this challenge. In this paper, with the use of a near-infrared HSI camera, applications of HSI for the non-destructive inspection of CFRP products are introduced, taking the EU H2020 FibreEUse project as the background. Technical challenges and solutions on three case studies are presented in detail, including adhesive residues detection, surface damage detection and Cobot based automated inspection. Experimental results have fully demonstrated the great potential of HSI and related vision techniques for NDT of CFRP, especially the potential to satisfy the industrial manufacturing environment.

【5】 Structure-aware Image Inpainting with Two Parallel Streams 标题:基于双并行流的结构感知图像修复 链接:https://arxiv.org/abs/2111.03414

作者:Zhilin Huang,Chujun Qin,Ruixin Liu,Zhenyu Weng,Yuesheng Zhu 机构:Communication and Information Security Lab, Peking University 备注:9 pages, 8 figures, rejected by IJCAI 2021 摘要:最近的图像修复工作表明,结构信息在恢复视觉效果方面起着重要作用。在本文中,我们提出了一种端到端架构,它由两个并行的基于UNet的流组成:一个主流(MS)和一个结构流(SS)。在SS的帮助下,MS可以产生合理结构和真实细节的合理结果。具体而言,MS通过同时推断缺失结构和纹理来重建详细图像,SS通过处理MS编码器的层次信息仅恢复缺失结构。通过在训练过程中与SS交互,MS可以被隐含地鼓励利用结构线索。为了帮助SS关注结构并防止MS中的纹理受到影响,提出了一种门控单元来抑制MS和SS之间信息流中与结构无关的激活。此外,利用SS中的多尺度结构特征映射,通过融合块明确指导MS解码器中结构合理的图像重建。在CelebA、Paris StreetView和Places2数据集上的大量实验表明,我们提出的方法优于最先进的方法。 摘要:Recent works in image inpainting have shown that structural information plays an important role in recovering visually pleasing results. In this paper, we propose an end-to-end architecture composed of two parallel UNet-based streams: a main stream (MS) and a structure stream (SS). With the assistance of SS, MS can produce plausible results with reasonable structures and realistic details. Specifically, MS reconstructs detailed images by inferring missing structures and textures simultaneously, and SS restores only missing structures by processing the hierarchical information from the encoder of MS. By interacting with SS in the training process, MS can be implicitly encouraged to exploit structural cues. In order to help SS focus on structures and prevent textures in MS from being affected, a gated unit is proposed to depress structure-irrelevant activations in the information flow between MS and SS. Furthermore, the multi-scale structure feature maps in SS are utilized to explicitly guide the structure-reasonable image reconstruction in the decoder of MS through the fusion block. Extensive experiments on CelebA, Paris StreetView and Places2 datasets demonstrate that our proposed method outperforms state-of-the-art methods.

【6】 MSC-VO: Exploiting Manhattan and Structural Constraints for Visual Odometry 标题:MSC-VO:利用曼哈顿和结构约束进行视觉里程计 链接:https://arxiv.org/abs/2111.03408

作者:Joan P. Company-Corcoles,Emilio Garcia-Fidalgo,Alberto Ortiz 机构:All authors are with the Department of Mathematics and Com-puter Science (University of the Balearic Islands) and IDISBA (Institutd’Investigacio Sanitaria de les Illes Balears) 备注:Submitted to RAL ICRA 2022 摘要:视觉里程计算法在面对纹理较低的场景(例如,人造环境)时往往会退化,在这些场景中,通常很难找到足够数量的点特征。替代的几何视觉线索,如线,通常可以在这些场景中找到,可以变得特别有用。此外,这些场景通常呈现结构规律,如平行性或正交性,并符合曼哈顿世界假设。在这些前提下,在这项工作中,我们介绍了MSC-VO,一种基于RGB-D的视觉里程计方法,它结合了点和线特征,并利用(如果存在)这些结构规律和场景的曼哈顿轴。在我们的方法中,这些结构约束最初用于精确估计提取线的三维位置。然后将这些约束与估计的曼哈顿轴以及点和线的重投影误差相结合,通过局部地图优化来优化相机姿态。这样的组合使我们的方法即使在没有上述约束的情况下也能运行,从而使该方法适用于更广泛的场景。此外,我们提出了一种新的多视图曼哈顿轴估计方法,主要依赖于线特征。MSC-VO使用多个公共数据集进行评估,优于其他最先进的解决方案,甚至与一些SLAM方法进行了比较。 摘要:Visual odometry algorithms tend to degrade when facing low-textured scenes -from e.g. human-made environments-, where it is often difficult to find a sufficient number of point features. Alternative geometrical visual cues, such as lines, which can often be found within these scenarios, can become particularly useful. Moreover, these scenarios typically present structural regularities, such as parallelism or orthogonality, and hold the Manhattan World assumption. Under these premises, in this work, we introduce MSC-VO, an RGB-D -based visual odometry approach that combines both point and line features and leverages, if exist, those structural regularities and the Manhattan axes of the scene. Within our approach, these structural constraints are initially used to estimate accurately the 3D position of the extracted lines. These constraints are also combined next with the estimated Manhattan axes and the reprojection errors of points and lines to refine the camera pose by means of local map optimization. Such a combination enables our approach to operate even in the absence of the aforementioned constraints, allowing the method to work for a wider variety of scenarios. Furthermore, we propose a novel multi-view Manhattan axes estimation procedure that mainly relies on line features. MSC-VO is assessed using several public datasets, outperforming other state-of-the-art solutions, and comparing favourably even with some SLAM methods.

【7】 Numerisation D'un Siecle de Paysage Ferroviaire Français : recul du rail, conséquences territoriales et coût environnemental 标题:数字化D‘un Siecle de Paysage Feranviaire Français:Recul du Rail,Conséquens Territoriales et coút Environmental nemental 链接:https://arxiv.org/abs/2111.03433

作者:Robert Jeansoulin 备注:in French. Territoire(s) et Num'erique : innovations, mutations et d'ecision, ASRDLF (Association de Sciences R'egionales de Langue Franc{c}aise), Sep 2021, Avignon, France 摘要:经过一个多世纪的地理数据重建,可以计算出法国铁路景观的演变,以及它是如何受到重大事件(如二战)或更长时间跨度过程的影响的:行业外包、大都市化、公共交通政策或其缺失。这项工作是由几个公共地理数据(SNCF,IGN)融合而成,通过计算机辅助添加互联网上收集的多个数据(维基百科,志愿者地理信息)而丰富。该数据集将几乎所有铁路车站(即使是简单的车站)和铁路分支节点组合在一起,它们与各自铁路线的链接允许构建网络的底层一致图。每条铁路线都有一个“有效到”日期(或近似值),以便显示时间演变。目前的重建进度约为预期进度的90%(确切总数未知)。这允许考虑时间人口统计分析(有多少城市和城镇服务的铁路自1925上升到今天),以及环境模拟(CO2成本由给定的目的地)。 摘要:The reconstruction of geographical data over a century, allows to figuring out the evolution of the French railway landscape, and how it has been impacted by major events (eg.: WWII), or longer time span processes : industry outsourcing, metropolization, public transport policies or absence of them. This work is resulting from the fusion of several public geographical data (SNCF, IGN), enriched with the computer-assisted addition of multiple data gathered on the Internet (Wikipedia, volunteer geographic information). The dataset compounds almost every rail stations (even simple stops) and railway branch nodes, whose link to their respective rail lines allows to build the underlying consistent graph of the network. Every rail line has a "valid to" date (or approx) so that time evolution can be displayed. The present progress of that reconstruction sums up to roughly 90% of what is expected (exact total unknown). This allows to consider temporal demographic analysis (how many cities and towns served by the railway since 1925 up on today), and environmental simulations as well (CO2 cost by given destination ).

0 人点赞