计算机视觉学术速递[8.20]

2021-08-24 16:38:07 浏览数 (1)

Update!H5支持摘要折叠,体验更佳!点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!

cs.CV 方向,今日共计72篇

Transformer(3篇)

【1】 PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers 标题:PoinTr:使用几何感知转换器完成不同的点云 链接:https://arxiv.org/abs/2108.08839

作者:Xumin Yu,Yongming Rao,Ziyi Wang,Zuyan Liu,Jiwen Lu,Jie Zhou 机构:Department of Automation, Tsinghua University, China, State Key Lab of Intelligent Technologies and Systems, China, Beijing National Research Center for Information Science and Technology, China 备注:Accepted to ICCV 2021 (Oral Presentation) 摘要:由于传感器分辨率、单视点和遮挡的限制,在实际应用中捕获的点云通常是不完整的。因此,从局部点云中恢复完整点云成为许多实际应用中不可缺少的任务。在本文中,我们提出了一种新的方法,将点云完成转化为集对集转换问题,并设计了一种新的模型,称为PoinTr,该模型采用transformer编码器-解码器体系结构来完成点云完成。通过将点云表示为一组具有位置嵌入的无序点组,我们将点云转换为一系列点代理,并使用转换器生成点云。为了便于Transformer更好地利用点云三维几何结构的感应偏差,我们进一步设计了一个几何感知块,明确地建模局部几何关系。transformers的迁移使我们的模型能够更好地学习结构知识,并保留详细信息以完成点云计算。此外,我们还提出了两个更具挑战性的基准,其中包含更多不同的不完整点云,可以更好地反映现实世界的场景,以促进未来的研究。实验结果表明,无论是在新的基准测试还是在现有的基准测试中,我们的方法都大大优于现有的方法。代码可在https://github.com/yuxumin/PoinTr 摘要:Point clouds captured in real-world applications are often incomplete due to the limited sensor resolution, single viewpoint, and occlusion. Therefore, recovering the complete point clouds from partial ones becomes an indispensable task in many practical applications. In this paper, we present a new method that reformulates point cloud completion as a set-to-set translation problem and design a new model, called PoinTr that adopts a transformer encoder-decoder architecture for point cloud completion. By representing the point cloud as a set of unordered groups of points with position embeddings, we convert the point cloud to a sequence of point proxies and employ the transformers for point cloud generation. To facilitate transformers to better leverage the inductive bias about 3D geometric structures of point clouds, we further devise a geometry-aware block that models the local geometric relationships explicitly. The migration of transformers enables our model to better learn structural knowledge and preserve detailed information for point cloud completion. Furthermore, we propose two more challenging benchmarks with more diverse incomplete point clouds that can better reflect the real-world scenarios to promote future research. Experimental results show that our method outperforms state-of-the-art methods by a large margin on both the new benchmarks and the existing ones. Code is available at https://github.com/yuxumin/PoinTr

【2】 Do Vision Transformers See Like Convolutional Neural Networks? 标题:视觉Transformer看起来像卷积神经网络吗? 链接:https://arxiv.org/abs/2108.08810

作者:Maithra Raghu,Thomas Unterthiner,Simon Kornblith,Chiyuan Zhang,Alexey Dosovitskiy 机构:Dosovitskiy, Google Research, Brain Team 摘要:迄今为止,卷积神经网络(CNN)已成为视觉数据的事实模型。最近的工作表明,(视觉)变换器模型(ViT)可以在图像分类任务上实现相当甚至更高的性能。这就提出了一个中心问题:视觉转换器是如何解决这些任务的?他们是像卷积网络一样工作,还是学习完全不同的视觉表现?通过分析ViT和CNN在图像分类基准上的内部表示结构,我们发现这两种体系结构之间存在显著差异,例如ViT在所有层上都有更统一的表示。我们探索了这些差异是如何产生的,发现了自我注意所起的关键作用,自我注意使全局信息得以早期聚合,以及ViT残余连接,它强烈地将特征从较低层传播到较高层。我们研究了空间定位的影响,证明VIT成功地保留了输入的空间信息,不同分类方法的效果显著。最后,我们研究了(预训练)数据集规模对中间特征和迁移学习的影响,最后讨论了与新体系结构(如MLP混合器)的连接。 摘要:Convolutional neural networks (CNNs) have so far been the de-facto model for visual data. Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks. This raises a central question: how are Vision Transformers solving these tasks? Are they acting like convolutional networks, or learning entirely different visual representations? Analyzing the internal representation structure of ViTs and CNNs on image classification benchmarks, we find striking differences between the two architectures, such as ViT having more uniform representations across all layers. We explore how these differences arise, finding crucial roles played by self-attention, which enables early aggregation of global information, and ViT residual connections, which strongly propagate features from lower to higher layers. We study the ramifications for spatial localization, demonstrating ViTs successfully preserve input spatial information, with noticeable effects from different classification methods. Finally, we study the effect of (pretraining) dataset scale on intermediate features and transfer learning, and conclude with a discussion on connections to new architectures such as the MLP-Mixer.

【3】 Video Relation Detection via Tracklet based Visual Transformer 标题:基于Tracklet的视觉转换器视频关系检测 链接:https://arxiv.org/abs/2108.08669

作者:Kaifeng Gao,Long Chen,Yifeng Huang,Jun Xiao 机构:†Zhejiang University, ‡Columbia University 备注:1st of Video Relation Understanding (VRU) Grand Challenge in ACM Multimedia 2021 摘要:视频-视觉关系检测(VidVRD)近年来受到了社会各界的广泛关注。在本文中,我们应用最先进的视频对象轨迹检测管道MEGA和deepSORT生成轨迹建议。然后,我们以基于轨迹的方式执行VidVRD,而无需任何预切割操作。具体来说,我们设计了一个基于tracklet的可视化转换器。它包含一个时态感知解码器,该解码器执行tracklet和可学习谓词查询嵌入之间的特征交互,并最终预测关系。实验结果有力地证明了我们的方法的优越性,在ACM Multimedia 2021中的视频关系理解(VRU)大挑战上,我们的方法比其他方法有很大的优势https://github.com/Dawn-LX/VidVRD-tracklets. 摘要:Video Visual Relation Detection (VidVRD), has received significant attention of our community over recent years. In this paper, we apply the state-of-the-art video object tracklet detection pipeline MEGA and deepSORT to generate tracklet proposals. Then we perform VidVRD in a tracklet-based manner without any pre-cutting operations. Specifically, we design a tracklet-based visual Transformer. It contains a temporal-aware decoder which performs feature interactions between the tracklets and learnable predicate query embeddings, and finally predicts the relations. Experimental results strongly demonstrate the superiority of our method, which outperforms other methods by a large margin on the Video Relation Understanding (VRU) Grand Challenge in ACM Multimedia 2021. Codes are released at https://github.com/Dawn-LX/VidVRD-tracklets.

检测相关(8篇)

【1】 Efficient remedies for outlier detection with variational autoencoders 标题:利用变分自动编码器进行离群点检测的有效补救方法 链接:https://arxiv.org/abs/2108.08760

作者:Kushal Chauhan,Pradeep Shenoy,Manish Gupta,Devarajan Sridharan 机构:Google Research, Center for Neuroscience, Indian Institute of Science 备注:27 pages 摘要:当使用远离其训练分布的离群数据进行测试时,深度网络通常会做出自信但不正确的预测。由深层生成模型计算的似然度是使用未标记数据进行离群点检测的候选度量。然而,以前的研究表明,这种可能性是不可靠的,并且很容易通过对输入数据的简单转换而产生偏差。在这里,我们研究了最简单的深层生成模型中的变异自动编码器(VAE)异常检测。首先,我们证明了理论上的接地校正很容易通过VAE似然估计改善关键偏差。偏差校正是无模型的、特定于样本的,并且使用伯努利和连续伯努利可见分布精确计算。其次,我们展示了一种众所周知的预处理技术,对比度归一化,将偏差校正的有效性扩展到自然图像数据集。第三,我们证明了在VAE集合上计算的似然度的方差也能够实现鲁棒的离群点检测。我们使用九个(灰度和自然)图像数据集对我们的治疗方法进行了全面评估,并证明了与其他四种最先进的方法相比,在速度和准确性方面具有显著优势。我们的轻量级补救措施受到了生物学的启发,可能有助于通过多种类型的深层生成模型实现有效的异常值检测。 摘要:Deep networks often make confident, yet incorrect, predictions when tested with outlier data that is far removed from their training distributions. Likelihoods computed by deep generative models are a candidate metric for outlier detection with unlabeled data. Yet, previous studies have shown that such likelihoods are unreliable and can be easily biased by simple transformations to input data. Here, we examine outlier detection with variational autoencoders (VAEs), among the simplest class of deep generative models. First, we show that a theoretically-grounded correction readily ameliorates a key bias with VAE likelihood estimates. The bias correction is model-free, sample-specific, and accurately computed with the Bernoulli and continuous Bernoulli visible distributions. Second, we show that a well-known preprocessing technique, contrast normalization, extends the effectiveness of bias correction to natural image datasets. Third, we show that the variance of the likelihoods computed over an ensemble of VAEs also enables robust outlier detection. We perform a comprehensive evaluation of our remedies with nine (grayscale and natural) image datasets, and demonstrate significant advantages, in terms of both speed and accuracy, over four other state-of-the-art methods. Our lightweight remedies are biologically inspired and may serve to achieve efficient outlier detection with many types of deep generative models.

【2】 FSNet: A Failure Detection Framework for Semantic Segmentation 标题:FSNET:一种面向语义分词的故障检测框架 链接:https://arxiv.org/abs/2108.08748

作者:Quazi Marufur Rahman,Niko Sünderhauf,Peter Corke,Feras Dayoub 机构: acknowledge the continued support fromthe Queensland University of Technology (QUT) through the Centre forRobotics 摘要:语义分割是帮助自动驾驶车辆理解周围环境和安全导航的一项重要任务。在部署过程中,即使是最成熟的细分模型也容易受到各种外部因素的影响,这些外部因素会降低细分性能,并对车辆及其周围环境造成潜在的灾难性后果。为了解决这个问题,我们提出了一个故障检测框架来识别像素级错误分类。我们这样做是通过利用分割模型的内部特征,并同时使用故障检测网络对其进行训练。在部署期间,故障检测器可以在图像中标记分割模型未能正确分割的区域。我们针对最先进的方法对所提出的方法进行了评估,并在城市景观、BDD100K和地图语义分割数据集的AUPR误差度量方面实现了12.30%、9.46%和9.65%的性能改进。 摘要:Semantic segmentation is an important task that helps autonomous vehicles understand their surroundings and navigate safely. During deployment, even the most mature segmentation models are vulnerable to various external factors that can degrade the segmentation performance with potentially catastrophic consequences for the vehicle and its surroundings. To address this issue, we propose a failure detection framework to identify pixel-level misclassification. We do so by exploiting internal features of the segmentation model and training it simultaneously with a failure detection network. During deployment, the failure detector can flag areas in the image where the segmentation model have failed to segment correctly. We evaluate the proposed approach against state-of-the-art methods and achieve 12.30%, 9.46%, and 9.65% performance improvement in the AUPR-Error metric for Cityscapes, BDD100K, and Mapillary semantic segmentation datasets.

【3】 Wind Turbine Blade Surface Damage Detection based on Aerial Imagery and VGG16-RCNN Framework 标题:基于航空影像和VGG16-RCNN框架的风力机叶片表面损伤检测 链接:https://arxiv.org/abs/2108.08636

作者:Juhi Patel,Lagan Sharma,Harsh S. Dhiman 机构: Patel is with the Department of Data Science, NMIMS University, Sharma is with Department of Computer Science, Tandon School ofEngineering, New York University, Dhiman is with Department of Electrical Engineering, Adani Instituteof Infrastructure Engineering 摘要:本文提出了一种基于图像分析的风力机叶片表面损伤检测深度学习框架。承载约三分之一涡轮机重量的涡轮机叶片容易损坏,并可能导致并网风能转换系统的突然故障。风力机叶片的表面损伤检测需要一个大的数据集,以便在早期检测一种类型的损伤。涡轮叶片图像通过航空图像捕获。经检查,发现图像数据集有限,因此应用图像增强技术改进叶片图像数据集。将该方法建模为多类监督学习问题,并对卷积神经网络(CNN)、VGG16-RCNN和AlexNet等深度学习方法进行了测试,以确定涡轮叶片表面损伤的潜在能力。 摘要:In this manuscript, an image analytics based deep learning framework for wind turbine blade surface damage detection is proposed. Turbine blade(s) which carry approximately one-third of a turbine weight are susceptible to damage and can cause sudden malfunction of a grid-connected wind energy conversion system. The surface damage detection of wind turbine blade requires a large dataset so as to detect a type of damage at an early stage. Turbine blade images are captured via aerial imagery. Upon inspection, it is found that the image dataset was limited and hence image augmentation is applied to improve blade image dataset. The approach is modeled as a multi-class supervised learning problem and deep learning methods like Convolutional neural network (CNN), VGG16-RCNN and AlexNet are tested for determining the potential capability of turbine blade surface damage.

【4】 Exploiting Scene Graphs for Human-Object Interaction Detection 标题:利用场景图进行人机交互检测 链接:https://arxiv.org/abs/2108.08584

作者:Tao He,Lianli Gao,Jingkuan Song,Yuan-Fang Li 机构:Monash University, Center for Future Media, University of Electronic Science and Technology of China 备注:Accepted to ICCV 2021 摘要:人-物交互(HOI)检测是一项基本的视觉任务,旨在定位和识别人与物之间的交互。现有的作品侧重于人类和物体的视觉和语言特征。然而,它们没有利用图像中存在的高层和语义关系,这为HOI推理提供了关键的上下文和详细的关系知识。我们提出了一种新的方法来利用这些信息,通过场景图,为人-对象交互(SG2HOI)检测任务。我们的方法SG2HOI以两种方式结合了SG信息:(1)将场景图嵌入到全局上下文线索中,作为场景特定的环境上下文;(2)我们构建了一个关系感知的消息传递模块,从对象的邻域中收集关系并将其转换为交互。实证评估表明,我们的SG2HOI方法在两个基准HOI数据集:V-COCO和HICO-DET上优于最先进的方法。代码将在https://github.com/ht014/SG2HOI. 摘要:Human-Object Interaction (HOI) detection is a fundamental visual task aiming at localizing and recognizing interactions between humans and objects. Existing works focus on the visual and linguistic features of humans and objects. However, they do not capitalise on the high-level and semantic relationships present in the image, which provides crucial contextual and detailed relational knowledge for HOI inference. We propose a novel method to exploit this information, through the scene graph, for the Human-Object Interaction (SG2HOI) detection task. Our method, SG2HOI, incorporates the SG information in two ways: (1) we embed a scene graph into a global context clue, serving as the scene-specific environmental context; and (2) we build a relation-aware message-passing module to gather relationships from objects' neighborhood and transfer them into interactions. Empirical evaluation shows that our SG2HOI method outperforms the state-of-the-art methods on two benchmark HOI datasets: V-COCO and HICO-DET. Code will be available at https://github.com/ht014/SG2HOI.

【5】 Inter-Species Cell Detection: Datasets on pulmonary hemosiderophages in equine, human and feline specimens 标题:种间细胞检测:马、人和猫标本中肺含铁血黄素噬菌体的数据集 链接:https://arxiv.org/abs/2108.08529

作者:Christian Marzahl,Jenny Hill,Jason Stayt,Dorothee Bienzle,Lutz Welker,Frauke Wilm,Jörn Voigt,Marc Aubreville,Andreas Maier,Robert Klopfleisch,Katharina Breininger,Christof A. Bertram 机构:Pattern Recognition Lab, Computer Science, Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg, Germany, Research and Development, EUROIMMUN Medizinische Labordiagnostika AG, L¨ubeck, Germany, VetPath Laboratory Services, Ascot, Western, Australia 备注:Submitted to SCIENTIFIC DATA 摘要:肺出血(P-Hem)发生于多种物种之间,可能有多种原因。基于含铁血黄素含量的肺泡巨噬细胞5级评分系统对支气管肺泡灌洗液(BALF)进行细胞学检查被认为是最敏感的诊断方法。我们介绍了一个新的、完全注释的多物种P-Hem数据集,该数据集由74张马、猫和人类样本的细胞学全玻片图像(WSIs)组成。为了创建这个高质量和高数量的数据集,我们开发了一个注释管道,将人类专业知识与深度学习和数据可视化技术相结合。我们应用了一种基于深度学习的目标检测方法,对17个经过专业注释的马WSI进行了训练,对其余39个马WSI、12个人WSI和7个猫WSI进行了训练。结果注释在多种类型的专业注释图上半自动筛选错误,最后由经过训练的病理学家进行审查。我们的数据集共包含297383个含铁血黄素噬菌体,分为五个等级。就注释数量、扫描面积和覆盖物种数量而言,它是最大的公共可用的EWISS数据集之一。 摘要:Pulmonary hemorrhage (P-Hem) occurs among multiple species and can have various causes. Cytology of bronchoalveolarlavage fluid (BALF) using a 5-tier scoring system of alveolar macrophages based on their hemosiderin content is considered the most sensitive diagnostic method. We introduce a novel, fully annotated multi-species P-Hem dataset which consists of 74 cytology whole slide images (WSIs) with equine, feline and human samples. To create this high-quality and high-quantity dataset, we developed an annotation pipeline combining human expertise with deep learning and data visualisation techniques. We applied a deep learning-based object detection approach trained on 17 expertly annotated equine WSIs, to the remaining 39 equine, 12 human and 7 feline WSIs. The resulting annotations were semi-automatically screened for errors on multiple types of specialised annotation maps and finally reviewed by a trained pathologists. Our dataset contains a total of 297,383 hemosiderophages classified into five grades. It is one of the largest publicly availableWSIs datasets with respect to the number of annotations, the scanned area and the number of species covered.

【6】 VIL-100: A New Dataset and A Baseline Model for Video Instance Lane Detection 标题:VIL-100:一种新的视频实例车道检测数据集和基线模型 链接:https://arxiv.org/abs/2108.08482

作者:Yujun Zhang,Lei Zhu,Wei Feng,Huazhu Fu,Mingqian Wang,Qingxia Li,Cheng Li,Song Wang 机构: Tianjin University, University of Cambridge, Inception Institute of Artificial Intelligence, Automotive Data of China (Tianjin) Co., Ltd, University of South Carolina, # 摘要:车道检测在自动驾驶中起着关键作用。虽然汽车摄像头总是在路上拍摄流式视频,但当前的车道检测工作主要关注单个图像(帧),忽略了视频中的动态。在这项工作中,我们收集了一个新的视频实例车道检测(VIL-100)数据集,该数据集包含从不同的真实交通场景中获取的100个视频,共10000帧。每个视频中的所有帧都被手动注释为高质量的实例级车道注释,并包含一组帧级和视频级度量,用于定量性能评估。此外,我们提出了一种新的基线模型,称为多级内存聚合网络(MMA-Net),用于视频实例车道检测。在我们的方法中,通过集中其他帧的局部和全局记忆特征来增强当前帧的表示。在新收集的数据集上的实验表明,所提出的MMA网络优于最新的车道检测方法和视频对象分割方法。我们发布了我们的数据集和代码https://github.com/yujun0-0/MMA-Net. 摘要:Lane detection plays a key role in autonomous driving. While car cameras always take streaming videos on the way, current lane detection works mainly focus on individual images (frames) by ignoring dynamics along the video. In this work, we collect a new video instance lane detection (VIL-100) dataset, which contains 100 videos with in total 10,000 frames, acquired from different real traffic scenarios. All the frames in each video are manually annotated to a high-quality instance-level lane annotation, and a set of frame-level and video-level metrics are included for quantitative performance evaluation. Moreover, we propose a new baseline model, named multi-level memory aggregation network (MMA-Net), for video instance lane detection. In our approach, the representation of current frame is enhanced by attentively aggregating both local and global memory features from other frames. Experiments on the new collected dataset show that the proposed MMA-Net outperforms state-of-the-art lane detection methods and video object segmentation methods. We release our dataset and code at https://github.com/yujun0-0/MMA-Net.

【7】 Exploiting Multi-Object Relationships for Detecting Adversarial Attacks in Complex Scenes 标题:利用多对象关系检测复杂场景中的对抗性攻击 链接:https://arxiv.org/abs/2108.08421

作者:Mingjun Yin,Shasha Li,Zikui Cai,Chengyu Song,M. Salman Asif,Amit K. Roy-Chowdhury,Srikanth V. Krishnamurthy 机构:University of California, Riverside, USA 备注:ICCV'21 Accepted 摘要:众所周知,部署深度神经网络(DNN)的视觉系统容易受到对抗性示例的攻击。最近的研究表明,检查输入数据的内在一致性是检测对抗性攻击的一种很有前途的方法(例如,通过检查复杂场景中的对象共生关系)。然而,现有的方法与特定的模型相联系,不能提供通用性。基于自然场景图像的语言描述已经捕获了可以通过语言模型学习的对象共生关系,我们开发了一种新的方法来使用这种语言模型执行上下文一致性检查。我们的方法的独特之处在于,它独立于部署的对象检测器,但在实际场景中检测多个对象的对抗性示例时,它提供了非常高的准确性。 摘要:Vision systems that deploy Deep Neural Networks (DNNs) are known to be vulnerable to adversarial examples. Recent research has shown that checking the intrinsic consistencies in the input data is a promising way to detect adversarial attacks (e.g., by checking the object co-occurrence relationships in complex scenes). However, existing approaches are tied to specific models and do not offer generalizability. Motivated by the observation that language descriptions of natural scene images have already captured the object co-occurrence relationships that can be learned by a language model, we develop a novel approach to perform context consistency checks using such language models. The distinguishing aspect of our approach is that it is independent of the deployed object detector and yet offers very high accuracy in terms of detecting adversarial examples in practical scenes with multiple objects.

【8】 Social Fabric: Tubelet Compositions for Video Relation Detection 标题:Social Fabric:用于视频关系检测的Tubelet组合 链接:https://arxiv.org/abs/2108.08363

作者:Shuo Chen,Zenglin Shi,Pascal Mettes,Cees G. M. Snoek 机构:University of Amsterdam, a person swordfighting with another person, "approach", "clash", "fall", an adult chasing a child, "run", "greet" 备注:ICCV 2021 摘要:本文致力于分类和检测视频中出现的三元组中的object-tubelet之间的关系。当现有的工作将对象建议或小管视为单个实体并对其关系进行后验建模时,我们建议对对象小管对进行先验分类和检测谓词。我们还提出了社交结构:一种将一对对象小结节表示为交互原语组成的编码。这些原语在所有关系中学习,从而形成一种紧凑的表示形式,能够在视频的所有时间跨度内对共同出现的对象小结节池中的关系进行本地化和分类。编码实现了我们的两级网络。在第一阶段,我们训练社会结构,提出可能相互影响的建议。我们在第二阶段使用社会结构来同时微调和预测小试管的谓词标签。实验证明了早期视频关系建模、我们的编码和两阶段架构的好处,从而在两个基准上实现了新的最先进水平。我们还展示了编码如何通过原语示例查询来搜索时空视频关系。代码:https://github.com/shanshuo/Social-Fabric. 摘要:This paper strives to classify and detect the relationship between object tubelets appearing within a video as atriplet. Where existing works treat object proposals or tubelets as single entities and model their relations a posteriori, we propose to classify and detect predicates for pairs of object tubelets a priori. We also propose Social Fabric: an encoding that represents a pair of object tubelets as a composition of interaction primitives. These primitives are learned over all relations, resulting in a compact representation able to localize and classify relations from the pool of co-occurring object tubelets across all timespans in a video. The encoding enables our two-stage network. In the first stage, we train Social Fabric to suggest proposals that are likely interacting. We use the Social Fabric in the second stage to simultaneously fine-tune and predict predicate labels for the tubelets. Experiments demonstrate the benefit of early video relation modeling, our encoding and the two-stage architecture, leading to a new state-of-the-art on two benchmarks. We also show how the encoding enables query-by-primitive-example to search for spatio-temporal video relations. Code: https://github.com/shanshuo/Social-Fabric.

分类|识别相关(8篇)

【1】 Causal Attention for Unbiased Visual Recognition 标题:无偏视觉识别中的因果注意 链接:https://arxiv.org/abs/2108.08782

作者:Tan Wang,Chang Zhou,Qianru Sun,Hanwang Zhang 机构:Nanyang Technological University, Damo Academy, Alibaba Group, Singapore Management University 备注:Accepted by ICCV 2021 摘要:注意模块并不总是帮助深度模型学习在任何混杂背景下都具有鲁棒性的因果特征,例如,前景对象特征对不同背景是不变的。这是因为当训练和测试数据为IID(相同和独立分布)时,混杂因素欺骗了注意力,以捕获有利于预测的虚假相关性;而当数据分布不均时,会损害预测。学习因果注意的唯一根本解决方案是通过因果干预,这需要对混杂因素进行额外的注释,例如,分别在“草 狗”和“路 狗”中学习“狗”模型,因此“草”和“路”上下文将不再混淆“狗”识别。然而,这样的注释不仅昂贵得令人望而却步,而且本身就存在问题,因为混淆因素在本质上是难以捉摸的。在本文中,我们提出了一个因果注意模块(CaaM),它以无监督的方式对混杂因素进行自我注释。特别是,多个CAAM可以堆叠并集成在传统的注意力CNN和自我注意力视觉转换器中。在OOD设置中,有CaaM的深度模型的性能显著优于没有CaaM的深度模型;即使在IID设置中,CaaM也可以改进注意力定位,在需要鲁棒视觉显著性的应用中显示出巨大的潜力。代码位于url{https://github.com/Wangt-CN/CaaM}. 摘要:Attention module does not always help deep models learn causal features that are robust in any confounding context, e.g., a foreground object feature is invariant to different backgrounds. This is because the confounders trick the attention to capture spurious correlations that benefit the prediction when the training and testing data are IID (identical & independent distribution); while harm the prediction when the data are OOD (out-of-distribution). The sole fundamental solution to learn causal attention is by causal intervention, which requires additional annotations of the confounders, e.g., a "dog" model is learned within "grass dog" and "road dog" respectively, so the "grass" and "road" contexts will no longer confound the "dog" recognition. However, such annotation is not only prohibitively expensive, but also inherently problematic, as the confounders are elusive in nature. In this paper, we propose a causal attention module (CaaM) that self-annotates the confounders in unsupervised fashion. In particular, multiple CaaMs can be stacked and integrated in conventional attention CNN and self-attention Vision Transformer. In OOD settings, deep models with CaaM outperform those without it significantly; even in IID settings, the attention localization is also improved by CaaM, showing a great potential in applications that require robust visual saliency. Codes are available at url{https://github.com/Wangt-CN/CaaM}.

【2】 Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification 标题:基于反事实注意学习的细粒度视觉分类与再识别 链接:https://arxiv.org/abs/2108.08728

作者:Yongming Rao,Guangyi Chen,Jiwen Lu,Jie Zhou 机构:Department of Automation, Tsinghua University, China, State Key Lab of Intelligent Technologies and Systems, China, Beijing National Research Center for Information Science and Technology, China 备注:Accepted to ICCV 2021 摘要:注意机制在细粒度视觉识别任务中显示出巨大的潜力。本文提出了一种基于因果推理的反事实注意学习方法来学习更有效的注意。与大多数现有的基于传统似然理论的视觉注意学习方法不同,我们提出用反事实因果关系来学习注意,它提供了一种测量注意质量的工具,并提供了一个强大的监督信号来指导学习过程。具体来说,我们通过反事实干预来分析学习到的视觉注意对网络预测的影响,并最大限度地提高影响,以鼓励网络学习更多有用的注意,用于细粒度图像识别。根据经验,我们在广泛的细粒度识别任务中评估了我们的方法,其中注意力起着至关重要的作用,包括细粒度图像分类、人员重新识别和车辆重新识别。所有基准的持续改进证明了我们方法的有效性。代码可在https://github.com/raoyongming/CAL 摘要:Attention mechanism has demonstrated great potential in fine-grained visual recognition tasks. In this paper, we present a counterfactual attention learning method to learn more effective attention based on causal inference. Unlike most existing methods that learn visual attention based on conventional likelihood, we propose to learn the attention with counterfactual causality, which provides a tool to measure the attention quality and a powerful supervisory signal to guide the learning process. Specifically, we analyze the effect of the learned visual attention on network prediction through counterfactual intervention and maximize the effect to encourage the network to learn more useful attention for fine-grained image recognition. Empirically, we evaluate our method on a wide range of fine-grained recognition tasks where attention plays a crucial role, including fine-grained image categorization, person re-identification, and vehicle re-identification. The consistent improvement on all benchmarks demonstrates the effectiveness of our method. Code is available at https://github.com/raoyongming/CAL

【3】 Spatio-Temporal Interaction Graph Parsing Networks for Human-Object Interaction Recognition 标题:面向人机交互识别的时空交互图解析网络 链接:https://arxiv.org/abs/2108.08633

作者:Ning Wang,Guangming Zhu,Liang Zhang,Peiyi Shen,Hongsheng Li,Cong Hua 机构:School of Computer Science and Technology, Xidian University, Xi’an, Shaanxi, China, Intra-frame Relations, Inter-frame Relations, Human, Plate 备注:ACM MM Oral paper 摘要:对于给定的基于视频的人与对象交互场景,建模人与对象之间的时空关系是理解视频中呈现的上下文信息的重要线索。通过有效的时空关系建模,不仅可以发现每个帧中的上下文信息,还可以直接捕获时间间的依赖关系。当人和物体的外观特征可能不会随时间发生显著变化时,捕捉其在时空维度上的位置变化更为关键。充分利用外观特征、空间位置和语义信息也是提高基于视频的人机交互识别性能的关键。本文构造了时空交互图解析网络(STIGPN),该网络利用由人和对象节点组成的图对视频进行编码。这些节点通过两种类型的关系连接:(i)空间关系建模每个框架内人与交互对象之间的交互(ii)跨帧时间关系捕获人与交互对象之间的长期依赖关系。通过该图,STIGPN可以直接从整个基于视频的人机交互场景中学习时空特征。采用多模态特征和多流融合策略来增强STIGPN的推理能力。使用两个人机交互视频数据集(包括CAD-120和其他数据集)对所提出的体系结构进行评估,最先进的性能证明了STIGPN的优越性。 摘要:For a given video-based Human-Object Interaction scene, modeling the spatio-temporal relationship between humans and objects are the important cue to understand the contextual information presented in the video. With the effective spatio-temporal relationship modeling, it is possible not only to uncover contextual information in each frame but also to directly capture inter-time dependencies. It is more critical to capture the position changes of human and objects over the spatio-temporal dimension when their appearance features may not show up significant changes over time. The full use of appearance features, the spatial location and the semantic information are also the key to improve the video-based Human-Object Interaction recognition performance. In this paper, Spatio-Temporal Interaction Graph Parsing Networks (STIGPN) are constructed, which encode the videos with a graph composed of human and object nodes. These nodes are connected by two types of relations: (i) spatial relations modeling the interactions between human and the interacted objects within each frame. (ii) inter-time relations capturing the long range dependencies between human and the interacted objects across frame. With the graph, STIGPN learn spatio-temporal features directly from the whole video-based Human-Object Interaction scenes. Multi-modal features and a multi-stream fusion strategy are used to enhance the reasoning capability of STIGPN. Two Human-Object Interaction video datasets, including CAD-120 and Something-Else, are used to evaluate the proposed architectures, and the state-of-the-art performance demonstrates the superiority of STIGPN.

【4】 Understanding and Mitigating Annotation Bias in Facial Expression Recognition 标题:面部表情识别中标注偏差的理解与缓解 链接:https://arxiv.org/abs/2108.08504

作者:Yunliang Chen,Jungseock Joo 机构:University of California, Los Angeles 备注:To appear in ICCV 2021 摘要:计算机视觉模型的性能取决于其训练数据的大小和质量。最近的研究揭示了常见图像数据集中先前未知的合成偏差,这些偏差会导致模型输出出现偏差,并提出了缓解这些偏差的方法。然而,大多数现有的工作都假设人工生成的注释可以被视为金标准和无偏见的。在本文中,我们揭示了这个假设可能是有问题的,并且应该特别注意防止模型学习这种注释偏差。我们专注于面部表情识别,并比较实验室控制和野生数据集之间的标签偏差。我们证明了许多表达数据集在性别之间存在显著的注释偏差,特别是当涉及到快乐和愤怒的表达时,并且传统方法无法完全缓解训练模型中的这种偏差。为了消除表情标注偏差,我们提出了一种AU校准的面部表情识别(AUC-FER)框架,该框架利用面部动作单元(AUs)并将三重态丢失纳入目标函数。实验结果表明,与现有的方法相比,该方法在消除表达式标注偏差方面更为有效。 摘要:The performance of a computer vision model depends on the size and quality of its training data. Recent studies have unveiled previously-unknown composition biases in common image datasets which then lead to skewed model outputs, and have proposed methods to mitigate these biases. However, most existing works assume that human-generated annotations can be considered gold-standard and unbiased. In this paper, we reveal that this assumption can be problematic, and that special care should be taken to prevent models from learning such annotation biases. We focus on facial expression recognition and compare the label biases between lab-controlled and in-the-wild datasets. We demonstrate that many expression datasets contain significant annotation biases between genders, especially when it comes to the happy and angry expressions, and that traditional methods cannot fully mitigate such biases in trained models. To remove expression annotation bias, we propose an AU-Calibrated Facial Expression Recognition (AUC-FER) framework that utilizes facial action units (AUs) and incorporates the triplet loss into the objective function. Experimental results suggest that the proposed method is more effective in removing expression annotation bias than existing techniques.

【5】 Semantic Reinforced Attention Learning for Visual Place Recognition 标题:语义强化的注意学习在视觉位置识别中的应用 链接:https://arxiv.org/abs/2108.08443

作者:Guohao Peng,Yufeng Yue,Jun Zhang,Zhenyu Wu,Xiaoyu Tang,Danwei Wang 机构: Wang are with School ofElectrical and Electronic Engineering, Nanyang Technological University, Yue is with the School of Automation, Beijing Institute of Technology 摘要:大规模视觉位置识别(VPR)具有内在的挑战性,因为图像中并非所有的视觉线索都有利于任务的完成。为了在特征嵌入中突出任务相关的视觉线索,现有的注意机制要么基于人工规则,要么以彻底的数据驱动方式进行训练。为了填补这两种类型之间的空白,我们提出了一种新的语义强化注意学习网络(SRALNet),其中推断出的注意可以同时受益于语义先验和数据驱动的微调。其贡献有两个方面(1) 为了抑制误导性的局部特征,提出了一种基于分层特征分布的可解释局部加权方案(2) 通过利用局部加权方案的可解释性,提出了一种语义约束的初始化方法,以便通过语义先验加强局部注意。实验表明,在城市规模的VPR基准数据集上,我们的方法优于最先进的技术。 摘要:Large-scale visual place recognition (VPR) is inherently challenging because not all visual cues in the image are beneficial to the task. In order to highlight the task-relevant visual cues in the feature embedding, the existing attention mechanisms are either based on artificial rules or trained in a thorough data-driven manner. To fill the gap between the two types, we propose a novel Semantic Reinforced Attention Learning Network (SRALNet), in which the inferred attention can benefit from both semantic priors and data-driven fine-tuning. The contribution lies in two-folds. (1) To suppress misleading local features, an interpretable local weighting scheme is proposed based on hierarchical feature distribution. (2) By exploiting the interpretability of the local weighting scheme, a semantic constrained initialization is proposed so that the local attention can be reinforced by semantic priors. Experiments demonstrate that our method outperforms state-of-the-art techniques on city-scale VPR benchmark datasets.

【6】 STAR: Noisy Semi-Supervised Transfer Learning for Visual Classification 标题:STAR:噪声半监督转移学习在视觉分类中的应用 链接:https://arxiv.org/abs/2108.08362

作者:Hasib Zunair,Yan Gobeil,Samuel Mercier,A. Ben Hamza 机构:Concordia University, Montreal, Canada, Décathlon 摘要:半监督学习(SSL)已被证明能够有效地利用大规模未标记数据来减轻对标记数据的依赖,从而学习视觉识别和分类任务的更好模型。然而,最近的SSL方法依赖于数十亿规模的未标记图像数据才能正常工作。这对于在运行时、内存和数据采集方面具有相对较少未标记数据的任务来说是不可行的。为了解决这个问题,我们提出了噪声半监督转移学习,这是一种有效的SSL方法,它将转移学习和带噪声学生的自我训练集成到一个单一框架中,专门针对能够利用数千个未标记图像数据的任务。我们在二值分类和多类分类任务中评估了我们的方法,目标是识别图像是否显示练习运动的人或运动类型,以及从流行瑜伽姿势库中识别姿势。大量的实验和烧蚀研究表明,通过利用未标记的数据,我们提出的框架显著改进了视觉分类,尤其是与最先进的方法相比,在多类分类设置中。此外,结合迁移学习不仅提高了分类性能,而且还需要减少6倍的计算时间和5倍的内存。我们还表明,我们的方法提高了视觉分类模型的稳健性,即使没有特别针对对抗性稳健性进行优化。 摘要:Semi-supervised learning (SSL) has proven to be effective at leveraging large-scale unlabeled data to mitigate the dependency on labeled data in order to learn better models for visual recognition and classification tasks. However, recent SSL methods rely on unlabeled image data at a scale of billions to work well. This becomes infeasible for tasks with relatively fewer unlabeled data in terms of runtime, memory and data acquisition. To address this issue, we propose noisy semi-supervised transfer learning, an efficient SSL approach that integrates transfer learning and self-training with noisy student into a single framework, which is tailored for tasks that can leverage unlabeled image data on a scale of thousands. We evaluate our method on both binary and multi-class classification tasks, where the objective is to identify whether an image displays people practicing sports or the type of sport, as well as to identify the pose from a pool of popular yoga poses. Extensive experiments and ablation studies demonstrate that by leveraging unlabeled data, our proposed framework significantly improves visual classification, especially in multi-class classification settings compared to state-of-the-art methods. Moreover, incorporating transfer learning not only improves classification performance, but also requires 6x less compute time and 5x less memory. We also show that our method boosts robustness of visual classification models, even without specifically optimizing for adversarial robustness.

【7】 End-to-End License Plate Recognition Pipeline for Real-time Low Resource Video Based Applications 标题:基于实时低资源视频应用的端到端车牌识别流水线 链接:https://arxiv.org/abs/2108.08339

作者:Alif Ashrafee,Akib Mohammed Khan,Mohammad Sabik Irbaz,MD Abdullah Al Nasim 机构:Department of Computer Science and Engineering, Islamic University of Technology, Machine Learning Team, Pioneer Alpha Ltd., A PREPRINT 备注:Under Review 摘要:自动车牌识别系统旨在为检测、定位和识别视频帧中车辆的车牌字符提供端到端解决方案。然而,在现实世界中部署此类系统需要在低资源环境中的实时性能。在我们的论文中,我们提出了一种新的两阶段检测管道与视觉API相结合,旨在提供实时推理速度以及一致准确的检测和识别性能。我们在主干MobileNet SSDv2检测模型上使用了一个haar级联分类器作为过滤器。这通过只关注高置信度检测并将其用于识别来减少推理时间。我们还采用了一种时间帧分离策略来识别同一剪辑中的多个车牌。此外,没有公开可用的孟加拉语车牌数据集,为此我们创建了一个图像数据集和一个视频数据集,其中包含野外的车牌。我们在图像数据集上训练了我们的模型,获得了86%的AP(0.5)分数,并在视频数据集上测试了我们的管道,观察到了合理的检测和识别性能(82.7%的检测率和60.8%的OCR F1分数),实时处理速度(27.2帧/秒)。 摘要:Automatic License Plate Recognition systems aim to provide an end-to-end solution towards detecting, localizing, and recognizing license plate characters from vehicles appearing in video frames. However, deploying such systems in the real world requires real-time performance in low-resource environments. In our paper, we propose a novel two-stage detection pipeline paired with Vision API that aims to provide real-time inference speed along with consistently accurate detection and recognition performance. We used a haar-cascade classifier as a filter on top of our backbone MobileNet SSDv2 detection model. This reduces inference time by only focusing on high confidence detections and using them for recognition. We also impose a temporal frame separation strategy to identify multiple vehicle license plates in the same clip. Furthermore, there are no publicly available Bangla license plate datasets, for which we created an image dataset and a video dataset containing license plates in the wild. We trained our models on the image dataset and achieved an AP(0.5) score of 86% and tested our pipeline on the video dataset and observed reasonable detection and recognition performance (82.7% detection rate, and 60.8% OCR F1 score) with real-time processing speed (27.2 frames per second).

【8】 Classification of Diabetic Retinopathy Severity in Fundus Images with DenseNet121 and ResNet50 标题:DenseNet121和ResNet50在眼底图像中对糖尿病视网膜病变严重程度的分类 链接:https://arxiv.org/abs/2108.08473

作者:Jonathan Zhang,Bowen Xie,Xin Wu,Rahul Ram,David Liang 机构:Commack High School, Commack, NY , Glenda Dawson High School, Pearland, TX , Mira Loma High School, Carmichael, CA , Ward Melville High School, East Setauket, NY , Machine Learning, Camp Illumina, Illumina Learning, arXiv:,.,v, [eess.IV] , Aug 备注:15 pages, 14 figures; Jonathan Zhang - first author, Rahul Ram and David Liang - principal investigators; classifier repository - $url{this https URL}$ 摘要:在这项工作中,深度学习算法用于根据糖尿病视网膜病变的严重程度对眼底图像进行分类。测试了两种模型结构(密集卷积网络-121和残差神经网络-50)以及三种图像类型(RGB、绿色和高对比度)的六种不同组合,以找到性能最高的组合。我们的平均验证损失为0.17,最大验证准确率为85%。通过测试多个组合,某些参数组合的表现优于其他组合,尽管总体上发现最小方差。绿色过滤的效果最差,而与RGB分析相比,放大对比度的效果似乎可以忽略不计。与DenseNet121相比,ResNet50被证明不是一个健壮的模型。 摘要:In this work, deep learning algorithms are used to classify fundus images in terms of diabetic retinopathy severity. Six different combinations of two model architectures, the Dense Convolutional Network-121 and the Residual Neural Network-50 and three image types, RGB, Green, and High Contrast, were tested to find the highest performing combination. We achieved an average validation loss of 0.17 and a max validation accuracy of 85 percent. By testing out multiple combinations, certain combinations of parameters performed better than others, though minimal variance was found overall. Green filtration was shown to perform the poorest, while amplified contrast appeared to have a negligible effect in comparison to RGB analysis. ResNet50 proved to be less of a robust model as opposed to DenseNet121.

分割|语义相关(7篇)

【1】 Semantic Compositional Learning for Low-shot Scene Graph Generation 标题:语义组合学习在低镜头场景图生成中的应用 链接:https://arxiv.org/abs/2108.08600

作者:Tao He,Lianli Gao,Jingkuan Song,Jianfei Cai,Yuan-Fang Li 机构:Monash University, Center for Future Media, University of Electronic Science and Technology of China 摘要:场景图为许多下游任务提供了有价值的信息。许多场景图生成(SGG)模型仅使用有限的带注释的关系三元组进行训练,导致它们在低镜头(少量和零)场景中表现不佳,尤其是在稀有谓词上。为了解决这一问题,我们提出了一种新的语义合成学习策略,它可以与来自不同图像的对象构造额外的真实关系三元组。具体地说,我们的策略通过识别和删除不重要的组件来分解关系三元组,并通过与视觉组件词典中语义或视觉相似的对象融合来合成新的关系三元组,同时确保新合成的三元组的真实性。值得注意的是,我们的策略是通用的,可以与现有的SGG模型相结合,以显著提高其性能。我们对基准数据集VisualGenome进行了全面评估。对于三款最新的SGG车型,添加我们的策略将其性能提高了近50%,并且所有这些车型都大大超过了当前的先进水平。 摘要:Scene graphs provide valuable information to many downstream tasks. Many scene graph generation (SGG) models solely use the limited annotated relation triples for training, leading to their underperformance on low-shot (few and zero) scenarios, especially on the rare predicates. To address this problem, we propose a novel semantic compositional learning strategy that makes it possible to construct additional, realistic relation triples with objects from different images. Specifically, our strategy decomposes a relation triple by identifying and removing the unessential component and composes a new relation triple by fusing with a semantically or visually similar object from a visual components dictionary, whilst ensuring the realisticity of the newly composed triple. Notably, our strategy is generic and can be combined with existing SGG models to significantly improve their performance. We performed a comprehensive evaluation on the benchmark dataset Visual Genome. For three recent SGG models, adding our strategy improves their performance by close to 50%, and all of them substantially exceed the current state-of-the-art.

【2】 Multi-task Federated Learning for Heterogeneous Pancreas Segmentation 标题:多任务联合学习在异构胰腺分割中的应用 链接:https://arxiv.org/abs/2108.08537

作者:Chen Shen,Pochuan Wang,Holger R. Roth,Dong Yang,Daguang Xu,Masahiro Oda,Weichung Wang,Chiou-Shann Fuh,Po-Ting Chen,Kao-Lang Liu,Wei-Chih Liao,Kensaku Mori 机构: Nagoya University, Japan, National Taiwan University, Taiwan, NVIDIA Corporation, United States, National Taiwan University Hospital, Taiwan 备注:Accepted by MICCAI DCL Workshop 2021 摘要:在多任务环境中,用于医学图像分割的联邦学习(FL)变得更具挑战性,因为客户可能在其数据中表示不同类别的标签。例如,一个客户可能只有“健康”胰腺的患者数据,而其他客户的数据集可能包含胰腺肿瘤病例。普通的联邦平均算法可以在不集中数据集的情况下获得更具普遍性的基于深度学习的分割模型,该模型表示来自多个机构的训练数据。但是,对于上述多任务场景,它可能是次优的。在本文中,我们研究了异构优化方法,这些方法显示了在腹部CT图像中使用FL设置对胰腺和胰腺肿瘤进行自动分割的改进。 摘要:Federated learning (FL) for medical image segmentation becomes more challenging in multi-task settings where clients might have different categories of labels represented in their data. For example, one client might have patient data with "healthy'' pancreases only while datasets from other clients may contain cases with pancreatic tumors. The vanilla federated averaging algorithm makes it possible to obtain more generalizable deep learning-based segmentation models representing the training data from multiple institutions without centralizing datasets. However, it might be sub-optimal for the aforementioned multi-task scenarios. In this paper, we investigate heterogeneous optimization methods that show improvements for the automated segmentation of pancreas and pancreatic tumors in abdominal CT images with FL settings.

【3】 Few-shot Segmentation with Optimal Transport Matching and Message Flow 标题:具有最优传输匹配和消息流的Few-Shot分段 链接:https://arxiv.org/abs/2108.08518

作者:Weide Liu,Chi Zhang,Henghui Ding,Tzu-Yi Hung,Guosheng Lin 机构: Ding is with School of Electrical and Electronic Engineering, NanyangTechnologicalUniversity(NTU) 摘要:在这项工作中,我们解决了具有挑战性的少数镜头分割任务。充分利用支持信息是Few-Shot语义分割的关键。以往的方法通常采用覆盖支持特征的掩蔽平均池,将支持线索作为一个全局向量进行提取,通常以显著部分为主,并丢失一些重要线索。在这项工作中,我们认为需要将每个支持像素的信息传输到所有查询像素,并提出了一个带有最佳传输匹配模块的对应匹配网络(CMNet)来挖掘查询图像和支持图像之间的对应关系。此外,充分利用标注支撑图像的局部和全局信息也很重要。为此,我们提出了一个消息流模块,在同一图像内沿内部流传播消息,并在支持图像和查询图像之间交叉传播消息,这有助于增强局部特征表示。我们进一步将少数镜头分割作为一个多任务学习问题来解决不同数据集之间的领域差距问题。在PASCAL VOC 2012、MS COCO和FSS-1000数据集上的实验表明,我们的网络实现了最新的Few-Shot分割性能。 摘要:We address the challenging task of few-shot segmentation in this work. It is essential for few-shot semantic segmentation to fully utilize the support information. Previous methods typically adapt masked average pooling over the support feature to extract the support clues as a global vector, usually dominated by the salient part and loses some important clues. In this work, we argue that every support pixel's information is desired to be transferred to all query pixels and propose a Correspondence Matching Network (CMNet) with an Optimal Transport Matching module to mine out the correspondence between the query and support images. Besides, it is important to fully utilize both local and global information from the annotated support images. To this end, we propose a Message Flow module to propagate the message along the inner-flow within the same image and cross-flow between support and query images, which greatly help enhance the local feature representations. We further address the few-shot segmentation as a multi-task learning problem to alleviate the domain gap issue between different datasets. Experiments on PASCAL VOC 2012, MS COCO, and FSS-1000 datasets show that our network achieves new state-of-the-art few-shot segmentation performance.

【4】 Box-Adapt: Domain-Adaptive Medical Image Segmentation using Bounding BoxSupervision 标题:盒自适应:基于包围盒监督的域自适应医学图像分割 链接:https://arxiv.org/abs/2108.08432

作者:Yanwu Xu,Mingming Gong,Kayhan Batmanghelich 机构:Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh , Pennsylvania , USA, School of Mathematics and Statistics, Melbourne Centre for Data Science, The University of, Melbourne, Australia 备注:None 摘要:深度学习在医学图像分割方面取得了显著的成功,但它通常需要大量标记有细粒度分割模板的图像,并且这些模板的注释可能非常昂贵和耗时。因此,最近的方法尝试使用无监督域适配(UDA)方法从其他数据集(源do主数据集)的标记数据中借用信息到新的数据集(目标域)。然而,由于目标域中没有标签,DA方法的性能比完全监督的方法差得多。在本文中,我们提出了一种弱监督的do主适应设置,在该设置中,我们可以使用边界框部分标记新数据集,这比分割掩码更容易获得,也更便宜。因此,我们提出了一种新的弱监督域自适应方法,称为盒自适应,它充分利用了源域中的细粒度分割掩码和目标域中的弱边界盒。我们的Box-Adapt是一种两阶段方法,首先对源域和目标域进行联合训练,然后使用目标域的伪标签进行自我训练。我们证明了我们的方法在肝脏分割任务中的有效性。弱监督do主适应 摘要:Deep learning has achieved remarkable success in medicalimage segmentation, but it usually requires a large numberof images labeled with fine-grained segmentation masks, andthe annotation of these masks can be very expensive andtime-consuming. Therefore, recent methods try to use un-supervised domain adaptation (UDA) methods to borrow in-formation from labeled data from other datasets (source do-mains) to a new dataset (target domain). However, due tothe absence of labels in the target domain, the performance ofUDA methods is much worse than that of the fully supervisedmethod. In this paper, we propose a weakly supervised do-main adaptation setting, in which we can partially label newdatasets with bounding boxes, which are easier and cheaperto obtain than segmentation masks. Accordingly, we proposea new weakly-supervised domain adaptation method calledBox-Adapt, which fully explores the fine-grained segmenta-tion mask in the source domain and the weak bounding boxin the target domain. Our Box-Adapt is a two-stage methodthat first performs joint training on the source and target do-mains, and then conducts self-training with the pseudo-labelsof the target domain. We demonstrate the effectiveness of ourmethod in the liver segmentation task. Weakly supervised do-main adaptation

【5】 GP-S3Net: Graph-based Panoptic Sparse Semantic Segmentation Network 标题:GP-S3Net:基于图的全景稀疏语义分割网络 链接:https://arxiv.org/abs/2108.08401

作者:Ryan Razani,Ran Cheng,Enxu Li,Ehsan Taghavi,Yuan Ren,Liu Bingbing 机构:Huawei Noah’s Ark Lab, Toronto, Canada 摘要:全景图像分割作为静态环境理解和动态目标识别的综合任务,近年来开始受到广泛的研究兴趣。在本文中,我们提出了一种新的计算效率高的基于激光雷达的全景图像分割框架,称为GP-S3Net。GP-S3Net是一种无建议的方法,其中不需要对象建议来识别对象,这与传统的两级全景系统不同,在传统的两级全景系统中,检测网络用于捕获实例信息。我们的新设计包括一个新的实例级网络来处理语义结果,通过构造一个图卷积网络来识别对象(前景),然后将其与背景类融合。通过语义分割主干中前景对象的细粒度聚类,生成过分割先验,然后通过3D稀疏卷积处理嵌入每个聚类。每个簇被视为图中的一个节点,其相应的嵌入被用作其节点特征。然后,GCNN预测每个簇对之间是否存在边。我们利用实例标签为每个构造的图生成地面真值边标签,以监督学习。大量的实验表明,GP-S3Net在可用数据集(如nuScenes和SemanticPOSS)中的表现优于当前最先进的方法,在发布时在竞争性公共SemanticKITTI排行榜上排名第一。 摘要:Panoptic segmentation as an integrated task of both static environmental understanding and dynamic object identification, has recently begun to receive broad research interest. In this paper, we propose a new computationally efficient LiDAR based panoptic segmentation framework, called GP-S3Net. GP-S3Net is a proposal-free approach in which no object proposals are needed to identify the objects in contrast to conventional two-stage panoptic systems, where a detection network is incorporated for capturing instance information. Our new design consists of a novel instance-level network to process the semantic results by constructing a graph convolutional network to identify objects (foreground), which later on are fused with the background classes. Through the fine-grained clusters of the foreground objects from the semantic segmentation backbone, over-segmentation priors are generated and subsequently processed by 3D sparse convolution to embed each cluster. Each cluster is treated as a node in the graph and its corresponding embedding is used as its node feature. Then a GCNN predicts whether edges exist between each cluster pair. We utilize the instance label to generate ground truth edge labels for each constructed graph in order to supervise the learning. Extensive experiments demonstrate that GP-S3Net outperforms the current state-of-the-art approaches, by a significant margin across available datasets such as, nuScenes and SemanticPOSS, ranking first on the competitive public SemanticKITTI leaderboard upon publication.

【6】 Patch-Based Cervical Cancer Segmentation using Distance from Boundary of Tissue 标题:基于组织边界距离的子宫颈癌斑块分割 链接:https://arxiv.org/abs/2108.08508

作者:Kengo Araki,Mariyo Rokutan-Kurata,Kazuhiro Terada,Akihiko Yoshizawa,Ryoma Bise 备注:4 pages, 6 figures, EMBC2021 摘要:病理诊断用于详细检查癌症,需要自动化。为了自动分割每个癌区,通常使用基于补丁的方法,因为整个幻灯片图像(WSI)是巨大的。但是,这种方法会丢失区分类所需的全局信息。在本文中,我们利用了距离组织边界的距离(DfB),这是可以从原始图像中提取的全局信息。我们将我们的方法应用于宫颈癌的三级分类,并发现与传统方法相比,它提高了总体性能。 摘要:Pathological diagnosis is used for examining cancer in detail, and its automation is in demand. To automatically segment each cancer area, a patch-based approach is usually used since a Whole Slide Image (WSI) is huge. However, this approach loses the global information needed to distinguish between classes. In this paper, we utilized the Distance from the Boundary of tissue (DfB), which is global information that can be extracted from the original image. We experimentally applied our method to the three-class classification of cervical cancer, and found that it improved the total performance compared with the conventional method.

【7】 Medical Image Segmentation using 3D Convolutional Neural Networks: A Review 标题:三维卷积神经网络在医学图像分割中的研究进展 链接:https://arxiv.org/abs/2108.08467

作者:S. Niyas,S J Pawan,M Anand Kumar,Jeny Rajan 机构:Department of Computer Science and Engineering, National Institute of Technology Karnataka, Surathkal, India, Department of Information Technology 备注:17 pages, 4 figures 摘要:计算机辅助医学图像分析在帮助医生进行专家临床诊断和确定最佳治疗方案方面发挥着重要作用。目前,卷积神经网络(CNN)是医学图像分析的首选方法。此外,随着三维(3D)成像系统的快速发展以及处理大量数据的优秀硬件和软件支持的可用性,3D深度学习方法在医学图像分析中越来越流行。在这里,我们提出了一个广泛的审查,最近发展的三维深度学习方法在医学图像分割。此外,还讨论了三维医学图像分割的研究差距和未来发展方向。 摘要:Computer-aided medical image analysis plays a significant role in assisting medical practitioners for expert clinical diagnosis and deciding the optimal treatment plan. At present, convolutional neural networks (CNN) are the preferred choice for medical image analysis. In addition, with the rapid advancements in three-dimensional (3D) imaging systems and the availability of excellent hardware and software support to process large volumes of data, 3D deep learning methods are gaining popularity in medical image analysis. Here, we present an extensive review of the recently evolved 3D deep learning methods in medical image segmentation. Furthermore, the research gaps and future directions in 3D medical image segmentation are discussed.

Zero/Few Shot|迁移|域适配|自适应(4篇)

【1】 Spatially-Adaptive Image Restoration using Distortion-Guided Networks 标题:基于失真引导网络的空间自适应图像恢复 链接:https://arxiv.org/abs/2108.08617

作者:Kuldeep Purohit,Maitreya Suin,A. N. Rajagopalan,Vishnu Naresh Boddeti 机构: Michigan State University, Indian Institute of Technology Madras 备注:Accepted at ICCV 2021 摘要:我们提出了一个通用的基于学习的解决方案,用于恢复遭受空间变化退化的图像。先前的方法通常是特定于退化的,并且在不同的图像和图像中的不同像素上采用相同的处理。然而,我们假设这样的空间刚性处理对于同时恢复退化像素以及重建图像的干净区域而言是次优的。为了克服这一局限性,我们提出了SPAIR,这是一种利用失真定位信息并根据图像中的困难区域动态调整计算的网络设计。SPAIR由两部分组成:(1)一个用于识别退化像素的定位网络,以及(2)一个恢复网络,该网络利用来自滤波器和特征域中的定位网络的知识来选择性和自适应地恢复退化像素。我们的关键思想是利用空间域中严重退化的不均匀性,并将这些知识适当地嵌入到执行稀疏归一化、特征提取和注意的失真引导模块中。我们的体系结构与物理形成模型无关,并概括了几种类型的空间变化退化。我们分别展示了SPAIR在四项恢复任务中的功效,即去除雨条纹、雨滴、阴影和运动模糊。在11个基准数据集上与现有技术进行了广泛的定性和定量比较,结果表明,与最先进的特定于退化的体系结构相比,我们的退化不可知网络设计提供了显著的性能提升。代码可在https://github.com/human-analysis/spatially-adaptive-image-restoration. 摘要:We present a general learning-based solution for restoring images suffering from spatially-varying degradations. Prior approaches are typically degradation-specific and employ the same processing across different images and different pixels within. However, we hypothesize that such spatially rigid processing is suboptimal for simultaneously restoring the degraded pixels as well as reconstructing the clean regions of the image. To overcome this limitation, we propose SPAIR, a network design that harnesses distortion-localization information and dynamically adjusts computation to difficult regions in the image. SPAIR comprises of two components, (1) a localization network that identifies degraded pixels, and (2) a restoration network that exploits knowledge from the localization network in filter and feature domain to selectively and adaptively restore degraded pixels. Our key idea is to exploit the non-uniformity of heavy degradations in spatial-domain and suitably embed this knowledge within distortion-guided modules performing sparse normalization, feature extraction and attention. Our architecture is agnostic to physical formation model and generalizes across several types of spatially-varying degradations. We demonstrate the efficacy of SPAIR individually on four restoration tasks-removal of rain-streaks, raindrops, shadows and motion blur. Extensive qualitative and quantitative comparisons with prior art on 11 benchmark datasets demonstrate that our degradation-agnostic network design offers significant performance gains over state-of-the-art degradation-specific architectures. Code available at https://github.com/human-analysis/spatially-adaptive-image-restoration.

【2】 Feature Stylization and Domain-aware Contrastive Learning for Domain Generalization 标题:面向领域综合的要素样式化和领域感知对比学习 链接:https://arxiv.org/abs/2108.08596

作者:Seogkyu Jeon,Kibeom Hong,Pilhyeon Lee,Jewook Lee,Hyeran Byun 机构:Department of Computer Science, Yonsei University, South Korea 备注:Accepted to ACM MM 2021 (oral) 摘要:领域泛化的目的是在不访问目标领域的情况下提高模型对领域转移的鲁棒性。由于可用于训练的源域有限,最近的方法侧重于生成新域的样本。然而,它们要么在合成丰富的域时与优化问题作斗争,要么导致类语义的扭曲。为此,我们提出了一种新的领域泛化框架,其中利用特征统计信息将原始特征样式化为具有新领域属性的特征。为了在样式化过程中保留类信息,我们首先将特征分解为高频和低频分量。然后,我们使用从操纵统计数据中采样的新域样式对低频分量进行样式化,同时保留高频分量中的形状线索。作为最后一步,我们重新合并这两个组件以合成新的域特征。为了增强领域稳健性,我们利用样式化特征来保持模型在特征和输出方面的一致性。我们实现了与所提出的领域感知监督对比损失的特征一致性,这在保证领域不变性的同时增加了类别的可辨别性。实验结果证明了所提出的特征类型化和领域感知对比损失的有效性。通过定量比较,我们在PACS和Office Home两个基准上验证了我们的方法在现有最先进方法上的领先性。 摘要:Domain generalization aims to enhance the model robustness against domain shift without accessing the target domain. Since the available source domains for training are limited, recent approaches focus on generating samples of novel domains. Nevertheless, they either struggle with the optimization problem when synthesizing abundant domains or cause the distortion of class semantics. To these ends, we propose a novel domain generalization framework where feature statistics are utilized for stylizing original features to ones with novel domain properties. To preserve class information during stylization, we first decompose features into high and low frequency components. Afterward, we stylize the low frequency components with the novel domain styles sampled from the manipulated statistics, while preserving the shape cues in high frequency ones. As the final step, we re-merge both components to synthesize novel domain features. To enhance domain robustness, we utilize the stylized features to maintain the model consistency in terms of features as well as outputs. We achieve the feature consistency with the proposed domain-aware supervised contrastive loss, which ensures domain invariance while increasing class discriminability. Experimental results demonstrate the effectiveness of the proposed feature stylization and the domain-aware contrastive loss. Through quantitative comparisons, we verify the lead of our method upon existing state-of-the-art methods on two benchmarks, PACS and Office-Home.

【3】 Amplitude-Phase Recombination: Rethinking Robustness of Convolutional Neural Networks in Frequency Domain 标题:幅相重组:卷积神经网络的频域鲁棒性再思考 链接:https://arxiv.org/abs/2108.08487

作者:Guangyao Chen,Peixi Peng,Li Ma,Jia Li,Lin Du,Yonghong Tian 机构:Department of Computer Science and Technology, Peking University, Peng Cheng Laborotory, State Key Laboratory of Virtual Reality Technology and Systems, SCSE, Beihang University, AI Application Research Center, Huawei 备注:ICCV 2021 摘要:最近,卷积神经网络(CNN)的泛化行为通过频率分量分解的解释技术逐渐变得透明。然而,图像相位谱对于鲁棒视觉系统的重要性仍然被忽视。在本文中,我们注意到CNN趋向于收敛于局部最优,这与训练图像的高频成分密切相关,而振幅谱容易受到干扰,例如噪声或常见的损坏。相比之下,更多的实证研究发现,人类依赖更多的相位成分来实现稳健的识别。这一观察结果进一步解释了CNN在对常见扰动和分布外检测的鲁棒性方面的泛化行为,并激发了通过重新组合当前图像的相位谱和干扰图像的振幅谱来设计数据增强的新视角。也就是说,生成的样本迫使CNN更加关注来自相位分量的结构化信息,并对振幅的变化保持鲁棒性。在多个图像数据集上的实验表明,该方法在多个泛化和校准任务上达到了最先进的性能,包括对常见腐蚀和表面变化的适应性、分布外检测和对抗性攻击。 摘要:Recently, the generalization behavior of Convolutional Neural Networks (CNN) is gradually transparent through explanation techniques with the frequency components decomposition. However, the importance of the phase spectrum of the image for a robust vision system is still ignored. In this paper, we notice that the CNN tends to converge at the local optimum which is closely related to the high-frequency components of the training images, while the amplitude spectrum is easily disturbed such as noises or common corruptions. In contrast, more empirical studies found that humans rely on more phase components to achieve robust recognition. This observation leads to more explanations of the CNN's generalization behaviors in both robustness to common perturbations and out-of-distribution detection, and motivates a new perspective on data augmentation designed by re-combing the phase spectrum of the current image and the amplitude spectrum of the distracter image. That is, the generated samples force the CNN to pay more attention to the structured information from phase components and keep robust to the variation of the amplitude. Experiments on several image datasets indicate that the proposed method achieves state-of-the-art performances on multiple generalizations and calibration tasks, including adaptability for common corruptions and surface variations, out-of-distribution detection, and adversarial attack.

【4】 Universal Cross-Domain Retrieval: Generalizing Across Classes and Domains 标题:通用跨域检索:跨类和跨域的泛化 链接:https://arxiv.org/abs/2108.08356

作者:Soumava Paul,Titir Dutta,Soma Biswas 机构:Indian Institute of Technology, Kharagpur, Indian Institute of Science, Bangalore 备注:Accepted at ICCV 2021. 15 pages, 6 figures 摘要:在这项工作中,我们首次解决了通用跨域检索的问题,其中测试数据可以属于在训练期间看不到的类或域。由于动态增加的类别数量和每个可能领域的实际训练限制,需要大量数据,因此推广到看不见的类和领域非常重要。为了实现这一目标,我们提出了SnMpNet(语义邻域和混合预测网络),它结合了两种新的损失来解释测试过程中遇到的不可见的类和域。具体地说,我们引入了一种新的语义邻域丢失来弥合可见类和不可见类之间的知识鸿沟,并确保不可见类的潜在空间嵌入相对于其相邻类具有语义意义。我们还引入了一种基于混合的图像级和语义级数据监控,用于使用混合预测损失进行训练,这有助于在查询属于不可见域时进行高效检索。这些损耗并入SE-ResNet50主干网以获得SnMpNet。在Sketchy Extended和DomainNet两个大规模数据集上进行的大量实验以及与最新技术的彻底比较证明了所提出模型的有效性。 摘要:In this work, for the first time, we address the problem of universal cross-domain retrieval, where the test data can belong to classes or domains which are unseen during training. Due to dynamically increasing number of categories and practical constraint of training on every possible domain, which requires large amounts of data, generalizing to both unseen classes and domains is important. Towards that goal, we propose SnMpNet (Semantic Neighbourhood and Mixture Prediction Network), which incorporates two novel losses to account for the unseen classes and domains encountered during testing. Specifically, we introduce a novel Semantic Neighborhood loss to bridge the knowledge gap between seen and unseen classes and ensure that the latent space embedding of the unseen classes is semantically meaningful with respect to its neighboring classes. We also introduce a mix-up based supervision at image-level as well as semantic-level of the data for training with the Mixture Prediction loss, which helps in efficient retrieval when the query belongs to an unseen domain. These losses are incorporated on the SE-ResNet50 backbone to obtain SnMpNet. Extensive experiments on two large-scale datasets, Sketchy Extended and DomainNet, and thorough comparisons with state-of-the-art justify the effectiveness of the proposed model.

半弱无监督|主动学习|不确定性(4篇)

【1】 Fine-grained Semantics-aware Representation Enhancement for Self-supervised Monocular Depth Estimation 标题:用于自监督单目深度估计的细粒度语义感知表示增强 链接:https://arxiv.org/abs/2108.08829

作者:Hyunyoung Jung,Eunhyeok Park,Sungjoo Yoo 机构:Seoul National University, POSTECH 备注:ICCV 2021 (Oral) 摘要:自监督单目深度估计由于其实际重要性和最近的有前途的改进而得到了广泛的研究。然而,大多数作品都受到光度一致性监督的限制,特别是在弱纹理区域和对象边界。为了克服这一缺点,我们提出了利用跨域信息,特别是场景语义来改进自监督单目深度估计的新思路。我们专注于将隐式语义知识纳入几何表示增强,并提出了两种想法:一种利用语义引导的局部几何优化中间深度表示的度量学习方法和一种新的特征融合模块,该模块明智地利用两个异构体之间的跨模态特征表示。我们在KITTI数据集上全面评估了我们的方法,并证明我们的方法优于最先进的方法。源代码可在https://github.com/hyBlue/FSRE-Depth. 摘要:Self-supervised monocular depth estimation has been widely studied, owing to its practical importance and recent promising improvements. However, most works suffer from limited supervision of photometric consistency, especially in weak texture regions and at object boundaries. To overcome this weakness, we propose novel ideas to improve self-supervised monocular depth estimation by leveraging cross-domain information, especially scene semantics. We focus on incorporating implicit semantic knowledge into geometric representation enhancement and suggest two ideas: a metric learning approach that exploits the semantics-guided local geometry to optimize intermediate depth representations and a novel feature fusion module that judiciously utilizes cross-modality between two heterogeneous feature representations. We comprehensively evaluate our methods on the KITTI dataset and demonstrate that our method outperforms state-of-the-art methods. The source code is available at https://github.com/hyBlue/FSRE-Depth.

【2】 StructDepth: Leveraging the structural regularities for self-supervised indoor depth estimation 标题:StructDepth:利用结构规律进行自监督室内深度估计 链接:https://arxiv.org/abs/2108.08574

作者:Boying Li,Yuan Huang,Zeyu Liu,Danping Zou,Wenxian Yu 机构:Shanghai Key Laboratory of Navigation and Location-Based Services, Shanghai Key Laboratory of Intelligent Sensing and Recognition, Shanghai Jiao Tong University 备注:Accepted by ICCV2021. Project is in this https URL 摘要:自监督单目深度估计在室外数据集上取得了令人印象深刻的性能。但是,由于缺乏纹理,其性能在室内环境中会显著下降。如果没有丰富的纹理,光度一致性太弱,无法训练良好的深度网络。受早期室内建模作品的启发,我们利用室内场景中呈现的结构规律,来训练更好的深度网络。具体来说,我们采用两个额外的监督信号进行自我监督训练:1)曼哈顿法线约束和2)共面约束。曼哈顿法线约束强制主曲面(地板、天花板和墙)与主方向对齐。共面约束表示,如果三维点位于同一平面区域内,则它们将由平面很好地拟合。为了生成监控信号,我们在训练过程中采用两个分量将主表面法线分类为主导方向,并动态检测平面区域。随着经过更多训练次数后预测深度变得更加准确,监控信号也会得到改善,进而反馈以获得更好的深度模型。通过对室内基准数据集的大量实验,结果表明我们的网络性能优于最先进的方法。源代码可在https://github.com/SJTU-ViSYS/StructDepth . 摘要:Self-supervised monocular depth estimation has achieved impressive performance on outdoor datasets. Its performance however degrades notably in indoor environments because of the lack of textures. Without rich textures, the photometric consistency is too weak to train a good depth network. Inspired by the early works on indoor modeling, we leverage the structural regularities exhibited in indoor scenes, to train a better depth network. Specifically, we adopt two extra supervisory signals for self-supervised training: 1) the Manhattan normal constraint and 2) the co-planar constraint. The Manhattan normal constraint enforces the major surfaces (the floor, ceiling, and walls) to be aligned with dominant directions. The co-planar constraint states that the 3D points be well fitted by a plane if they are located within the same planar region. To generate the supervisory signals, we adopt two components to classify the major surface normal into dominant directions and detect the planar regions on the fly during training. As the predicted depth becomes more accurate after more training epochs, the supervisory signals also improve and in turn feedback to obtain a better depth model. Through extensive experiments on indoor benchmark datasets, the results show that our network outperforms the state-of-the-art methods. The source code is available at https://github.com/SJTU-ViSYS/StructDepth .

【3】 Concurrent Discrimination and Alignment for Self-Supervised Feature Learning 标题:自监督特征学习的并行判别和对齐 链接:https://arxiv.org/abs/2108.08562

作者:Anjan Dutta,Massimiliano Mancini,Zeynep Akata 机构:University of Exeter,University of T¨ubingen 备注:International Conference on Computer Vision (DeepMTL) 2021 摘要:现有的自监督学习方法通过借口任务学习表征,这些任务要么(1)明确指定哪些特征应该分离,要么(2)对齐,精确指示哪些特征应该闭合,但忽略了一个事实,即如何共同且主要地定义哪些特征应该被排斥,哪些特征应该被吸引。在这项工作中,我们结合了识别和对齐方法的积极方面,并设计了一种解决上述问题的混合方法。我们的方法通过区分性预测任务和同时最大化共享冗余信息的成对视图之间的互信息,分别明确指定了排斥和吸引机制。我们定性和定量地表明,我们提出的模型学习更好的特征,更有效地处理从分类到语义分割的各种下游任务。我们在九个已建立的基准上的实验表明,该模型始终优于现有的自监督和转移学习协议的最新结果。 摘要:Existing self-supervised learning methods learn representation by means of pretext tasks which are either (1) discriminating that explicitly specify which features should be separated or (2) aligning that precisely indicate which features should be closed together, but ignore the fact how to jointly and principally define which features to be repelled and which ones to be attracted. In this work, we combine the positive aspects of the discriminating and aligning methods, and design a hybrid method that addresses the above issue. Our method explicitly specifies the repulsion and attraction mechanism respectively by discriminative predictive task and concurrently maximizing mutual information between paired views sharing redundant information. We qualitatively and quantitatively show that our proposed model learns better features that are more effective for the diverse downstream tasks ranging from classification to semantic segmentation. Our experiments on nine established benchmarks show that the proposed model consistently outperforms the existing state-of-the-art results of self-supervised and transfer learning protocol.

【4】 Self-Supervised Video Representation Learning with Meta-Contrastive Network 标题:基于元对比网络的自监督视频表示学习 链接:https://arxiv.org/abs/2108.08426

作者:Yuanze Lin,Xun Guo,Yan Lu 机构:University of Washington, Microsoft Research Asia 备注:Accepted to ICCV 2021 摘要:自监督学习已成功地应用于训练前视频表示,其目的是有效地适应训练前域到下游任务。现有的方法仅仅利用对比损失来学习实例级别的区分。然而,类别信息的缺乏将导致难以确定的正问题,从而限制了这类方法的泛化能力。我们发现元学习的多任务过程可以解决这个问题。在本文中,我们提出了一个元对比网络(MCN),它将对比学习和元学习结合起来,以增强现有自监督方法的学习能力。我们的方法包含两个基于模型不可知元学习(MAML)的训练阶段,每个阶段包括一个对比分支和一个元分支。广泛的评估证明了我们方法的有效性。对于两个下游任务,即视频动作识别和视频检索,MCN在UCF101和HMDB51数据集上优于最先进的方法。更具体地说,使用R(2 1)D主干,MCN在视频动作识别方面达到了84.8%和54.5%的顶级精度,在视频检索方面达到了52.5%和23.7%的顶级精度。 摘要:Self-supervised learning has been successfully applied to pre-train video representations, which aims at efficient adaptation from pre-training domain to downstream tasks. Existing approaches merely leverage contrastive loss to learn instance-level discrimination. However, lack of category information will lead to hard-positive problem that constrains the generalization ability of this kind of methods. We find that the multi-task process of meta learning can provide a solution to this problem. In this paper, we propose a Meta-Contrastive Network (MCN), which combines the contrastive learning and meta learning, to enhance the learning ability of existing self-supervised approaches. Our method contains two training stages based on model-agnostic meta learning (MAML), each of which consists of a contrastive branch and a meta branch. Extensive evaluations demonstrate the effectiveness of our method. For two downstream tasks, i.e., video action recognition and video retrieval, MCN outperforms state-of-the-art approaches on UCF101 and HMDB51 datasets. To be more specific, with R(2 1)D backbone, MCN achieves Top-1 accuracies of 84.8% and 54.5% for video action recognition, as well as 52.5% and 23.7% for video retrieval.

时序|行为识别|姿态|视频|运动估计(10篇)

【1】 Multi-Object Tracking with Hallucinated and Unlabeled Videos 标题:具有幻觉和未标记视频的多目标跟踪 链接:https://arxiv.org/abs/2108.08836

作者:Daniel McKee,Bing Shuai,Andrew Berneshawi,Manchen Wang,Davide Modolo,Svetlana Lazebnik,Joseph Tighe 机构:University of Illinois at Urbana-Champaign, Amazon Web Services 摘要:在这篇论文中,我们探讨了在没有跟踪注释的情况下学习端到端的深度神经跟踪器。这一点很重要,因为大规模训练数据对于训练深度神经跟踪器至关重要,而跟踪注释的获取成本很高。代替跟踪注释,我们首先使用放大/缩小运动变换从带有边界框注释的图像中产生幻觉视频,以获得自由跟踪标签。我们添加了视频模拟增强,以创建多样化的跟踪数据集,尽管只是简单的运动。接下来,为了解决更难的跟踪案例,我们使用训练过幻觉视频数据的跟踪器,在未标记的真实视频池中挖掘硬示例。对于硬示例挖掘,我们提出了一种基于优化的连接过程,首先从未标记的视频池中识别硬示例,然后纠正硬示例。最后,我们在幻觉数据和挖掘的硬视频示例上联合训练我们的跟踪器。我们的弱监督跟踪器在MOT17和TAO个人数据集上实现了最先进的性能。在MOT17上,我们进一步证明了我们自己生成的数据和现有手动注释数据的组合带来了额外的改进。 摘要:In this paper, we explore learning end-to-end deep neural trackers without tracking annotations. This is important as large-scale training data is essential for training deep neural trackers while tracking annotations are expensive to acquire. In place of tracking annotations, we first hallucinate videos from images with bounding box annotations using zoom-in/out motion transformations to obtain free tracking labels. We add video simulation augmentations to create a diverse tracking dataset, albeit with simple motion. Next, to tackle harder tracking cases, we mine hard examples across an unlabeled pool of real videos with a tracker trained on our hallucinated video data. For hard example mining, we propose an optimization-based connecting process to first identify and then rectify hard examples from the pool of unlabeled videos. Finally, we train our tracker jointly on hallucinated data and mined hard video examples. Our weakly supervised tracker achieves state-of-the-art performance on the MOT17 and TAO-person datasets. On MOT17, we further demonstrate that the combination of our self-generated data and the existing manually-annotated data leads to additional improvements.

【2】 Click to Move: Controlling Video Generation with Sparse Motion 标题:点击移动:使用稀疏运动控制视频生成 链接:https://arxiv.org/abs/2108.08815

作者:Pierfrancesco Ardino,Marco De Nadai,Bruno Lepri,Elisa Ricci,Stéphane Lathuilière 机构:University of Trento, Fondazione Bruno Kessler, LTCI, T´el´ecom Paris, Institut Polytechnique de Paris 备注:Accepted by International Conference on Computer Vision (ICCV 2021) 摘要:本文介绍了一种新的视频生成框架——点击移动(C2M),用户可以通过鼠标点击指定场景中关键对象的简单对象轨迹来控制合成视频的运动。我们的模型接收一个初始帧、其相应的分割图和编码用户输入的稀疏运动矢量作为输入。它从给定的帧开始输出一个看似合理的视频序列,并且运动与用户输入一致。值得注意的是,我们提出的deep体系结构结合了一个图形卷积网络(GCN),以整体方式对场景中所有对象的运动进行建模,并有效地结合了稀疏的用户运动信息和图像特征。实验结果表明,C2M在两个公开可用的数据集上优于现有方法,从而证明了我们的GCN框架在建模对象交互方面的有效性。源代码可在https://github.com/PierfrancescoArdino/C2M. 摘要:This paper introduces Click to Move (C2M), a novel framework for video generation where the user can control the motion of the synthesized video through mouse clicks specifying simple object trajectories of the key objects in the scene. Our model receives as input an initial frame, its corresponding segmentation map and the sparse motion vectors encoding the input provided by the user. It outputs a plausible video sequence starting from the given frame and with a motion that is consistent with user input. Notably, our proposed deep architecture incorporates a Graph Convolution Network (GCN) modelling the movements of all the objects in the scene in a holistic manner and effectively combining the sparse user motion information and image features. Experimental results show that C2M outperforms existing methods on two publicly available datasets, thus demonstrating the effectiveness of our GCN framework at modelling object interactions. The source code is publicly available at https://github.com/PierfrancescoArdino/C2M.

【3】 Category-Level 6D Object Pose Estimation via Cascaded Relation and Recurrent Reconstruction Networks 标题:基于级联关系和递归重构网络的类别级六维目标位姿估计 链接:https://arxiv.org/abs/2108.08755

作者:Jiaze Wang,Kai Chen,Qi Dou 机构: Kai Chen and Qi Dou are with the Department of ComputerScience and Engineering at The Chinese University of Hong Kong, hk 2 Qi Dou is also with T Stone Robotics Institute, The Chinese Universityof Hong Kong 备注:accepted by IROS2021 摘要:类别级6D姿势估计,旨在预测看不见对象实例的位置和方向,是许多场景(如机器人操作和增强现实)的基础,但仍然没有解决。在正则空间中精确地恢复实例三维模型,并将其与观测值精确匹配,是对不可见对象进行6D姿态估计的关键。在本文中,我们通过级联关系和递归重建网络实现了精确的类别级6D位姿估计。具体而言,一种新颖的级联关系网络专门用于高级表征学习,以探索实例RGB图像、实例点云和类别形状之间复杂且信息丰富的关系。此外,我们设计了一个用于迭代残差细化的递归重建网络,以从粗到细逐步改进重建和对应估计。最后,利用实例点云与标准空间中重建的三维模型之间的估计密集对应关系,获得实例6D姿势。我们在两个公认的类别级6D姿势估计基准上进行了大量实验,与现有方法相比,性能有了显著提高。在具有代表性的严格评估指标$3D{75}$和$5{circ}2cm$上,我们的方法在CAMERA25数据集上比最新最先进的SPD高出$4.9%$和$17.7%$,在REAL275数据集上比最新的SPD高出$2.7%$和$8.5%。代码可在https://wangjiaze.cn/projects/6DPoseEstimation.html. 摘要:Category-level 6D pose estimation, aiming to predict the location and orientation of unseen object instances, is fundamental to many scenarios such as robotic manipulation and augmented reality, yet still remains unsolved. Precisely recovering instance 3D model in the canonical space and accurately matching it with the observation is an essential point when estimating 6D pose for unseen objects. In this paper, we achieve accurate category-level 6D pose estimation via cascaded relation and recurrent reconstruction networks. Specifically, a novel cascaded relation network is dedicated for advanced representation learning to explore the complex and informative relations among instance RGB image, instance point cloud and category shape prior. Furthermore, we design a recurrent reconstruction network for iterative residual refinement to progressively improve the reconstruction and correspondence estimations from coarse to fine. Finally, the instance 6D pose is obtained leveraging the estimated dense correspondences between the instance point cloud and the reconstructed 3D model in the canonical space. We have conducted extensive experiments on two well-acknowledged benchmarks of category-level 6D pose estimation, with significant performance improvement over existing approaches. On the representatively strict evaluation metrics of $3D_{75}$ and $5^{circ}2 cm$, our method exceeds the latest state-of-the-art SPD by $4.9%$ and $17.7%$ on the CAMERA25 dataset, and by $2.7%$ and $8.5%$ on the REAL275 dataset. Codes are available at https://wangjiaze.cn/projects/6DPoseEstimation.html.

【4】 DECA: Deep viewpoint-Equivariant human pose estimation using Capsule Autoencoders 标题:DECA:基于胶囊自动编码器的深度视点等变人体姿态估计 链接:https://arxiv.org/abs/2108.08557

作者:Nicola Garau,Niccolò Bisagno,Piotr Bródka,Nicola Conci 机构:University of Trento, Via Sommarive, Povo, Trento TN 备注:International Conference on Computer Vision 2021 (ICCV 2021), 8 pages, 4 figures, 4 tables, accepted for ICCV 2021 oral 摘要:人体姿势估计(HPE)旨在从图像或视频中检索人体关节的三维位置。我们发现,当前的3D HPE方法缺乏视点等价性,即在处理训练时看不到的视点时,它们往往会失败或表现不佳。深度学习方法通常依赖于缩放不变、平移不变或旋转不变操作,如最大池。然而,采用这样的程序并不一定能提高视点的泛化,反而会导致更多依赖数据的方法。为了解决这个问题,我们提出了一种具有快速变分贝叶斯胶囊路由的新型胶囊自动编码器网络,称为DECA。通过将每个关节建模为胶囊实体,并结合路由算法,我们的方法可以独立于视点在特征空间中保留关节的层次结构和几何结构。通过实现视点等变,我们大大减少了训练时的网络数据依赖性,从而提高了对不可见视点的泛化能力。在实验验证中,我们在可见和不可见视点、俯视和前视图的深度图像上都优于其他方法。在RGB领域,同一网络在具有挑战性的视点转移任务上提供了最先进的结果,也为俯视HPE建立了新的框架。有关代码,请访问https://github.com/mmlab-cv/DECA. 摘要:Human Pose Estimation (HPE) aims at retrieving the 3D position of human joints from images or videos. We show that current 3D HPE methods suffer a lack of viewpoint equivariance, namely they tend to fail or perform poorly when dealing with viewpoints unseen at training time. Deep learning methods often rely on either scale-invariant, translation-invariant, or rotation-invariant operations, such as max-pooling. However, the adoption of such procedures does not necessarily improve viewpoint generalization, rather leading to more data-dependent methods. To tackle this issue, we propose a novel capsule autoencoder network with fast Variational Bayes capsule routing, named DECA. By modeling each joint as a capsule entity, combined with the routing algorithm, our approach can preserve the joints' hierarchical and geometrical structure in the feature space, independently from the viewpoint. By achieving viewpoint equivariance, we drastically reduce the network data dependency at training time, resulting in an improved ability to generalize for unseen viewpoints. In the experimental validation, we outperform other methods on depth images from both seen and unseen viewpoints, both top-view, and front-view. In the RGB domain, the same network gives state-of-the-art results on the challenging viewpoint transfer task, also establishing a new framework for top-view HPE. The code can be found at https://github.com/mmlab-cv/DECA.

【5】 D3D-HOI: Dynamic 3D Human-Object Interactions from Videos 标题:D3D-HOI:视频中的动态3D人-物交互 链接:https://arxiv.org/abs/2108.08420

作者:Xiang Xu,Hanbyul Joo,Greg Mori,Manolis Savva 机构:Simon Fraser University, Facebook AI Research 摘要:我们介绍了D3D-HOI:一个单目视频数据集,在人机交互过程中具有三维物体姿势、形状和部分运动的地面真值注释。我们的数据集由多个常见的关节对象组成,这些对象是从不同的真实场景和摄影机视点捕获的。每个操纵对象(如微波炉)都用匹配的三维参数化模型表示。这些数据使我们能够评估关节对象的重建质量,并为这项具有挑战性的任务建立基准。特别是,我们利用估计的3D人体姿势来更准确地推断对象的空间布局和动力学。我们在我们的数据集上对这种方法进行了评估,证明人-物关系可以显著减少具有挑战性的真实世界视频中关节对象重建的模糊性。代码和数据集可在https://github.com/facebookresearch/d3d-hoi. 摘要:We introduce D3D-HOI: a dataset of monocular videos with ground truth annotations of 3D object pose, shape and part motion during human-object interactions. Our dataset consists of several common articulated objects captured from diverse real-world scenes and camera viewpoints. Each manipulated object (e.g., microwave oven) is represented with a matching 3D parametric model. This data allows us to evaluate the reconstruction quality of articulated objects and establish a benchmark for this challenging task. In particular, we leverage the estimated 3D human pose for more accurate inference of the object spatial layout and dynamics. We evaluate this approach on our dataset, demonstrating that human-object relations can significantly reduce the ambiguity of articulated object reconstructions from challenging real-world videos. Code and dataset are available at https://github.com/facebookresearch/d3d-hoi.

【6】 SO-Pose: Exploiting Self-Occlusion for Direct 6D Pose Estimation 标题:SO-PASE:利用自遮挡进行直接6D位姿估计 链接:https://arxiv.org/abs/2108.08367

作者:Yan Di,Fabian Manhardt,Gu Wang,Xiangyang Ji,Nassir Navab,Federico Tombari 机构:Technical University of Munich,Google,Tsinghua University 备注:ICCV2021 摘要:从单个RGB图像直接回归杂乱环境中对象姿势(例如3D旋转和平移)的所有6个自由度(6DoF)是一个具有挑战性的问题。虽然端到端方法最近显示出高效率的有希望的结果,但就姿势精度而言,它们与精心设计的基于P$n$P/RANSAC的方法相比仍然较差。在这项工作中,我们通过一种新的关于自遮挡的推理来解决这一缺点,以便为三维对象建立一个两层表示,这大大提高了端到端6D姿势估计的准确性。我们的框架,命名为SO Pose,以单个RGB图像作为输入,利用共享编码器和两个单独的解码器分别生成2D-3D对应以及自遮挡信息。然后将两个输出进行融合,以直接回归6自由度姿态参数。结合对齐对应、自遮挡和6D姿势的跨层一致性,我们可以进一步提高准确性和鲁棒性,在各种具有挑战性的数据集上超越或与所有其他最先进的方法相抗衡。 摘要:Directly regressing all 6 degrees-of-freedom (6DoF) for the object pose (e.g. the 3D rotation and translation) in a cluttered environment from a single RGB image is a challenging problem. While end-to-end methods have recently demonstrated promising results at high efficiency, they are still inferior when compared with elaborate P$n$P/RANSAC-based approaches in terms of pose accuracy. In this work, we address this shortcoming by means of a novel reasoning about self-occlusion, in order to establish a two-layer representation for 3D objects which considerably enhances the accuracy of end-to-end 6D pose estimation. Our framework, named SO-Pose, takes a single RGB image as input and respectively generates 2D-3D correspondences as well as self-occlusion information harnessing a shared encoder and two separate decoders. Both outputs are then fused to directly regress the 6DoF pose parameters. Incorporating cross-layer consistencies that align correspondences, self-occlusion and 6D pose, we can further improve accuracy and robustness, surpassing or rivaling all other state-of-the-art approaches on various challenging datasets.

【7】 The Multi-Modal Video Reasoning and Analyzing Competition 标题:多模态视频推理分析比赛 链接:https://arxiv.org/abs/2108.08344

作者:Haoran Peng,He Huang,Li Xu,Tianjiao Li,Jun Liu,Hossein Rahmani,Qiuhong Ke,Zhicheng Guo,Cong Wu,Rongchang Li,Mang Ye,Jiahao Wang,Jiaxu Zhang,Yuanzhong Liu,Tao He,Fuwei Zhang,Xianbin Liu,Tao Lin 机构:Singapore University of Technology and Design, Lancaster University, University of Melbourne, Xidian University, Jiangnan University, Wuhan University, Tsinghua University, Sun Yat-sen University, BOE Technology Group Co., Ltd 备注:Accepted to ICCV 2021 Workshops 摘要:在本文中,我们结合ICCV 2021介绍了多模式视频推理和分析竞赛(MMVRAC)研讨会。该竞赛由四个不同的轨道组成,即视频问答、基于骨架的动作识别、基于鱼眼视频的动作识别和人物再识别,这是基于两个数据集:SUTD TrafficQA和UAV Human。我们总结了参赛者提交的最佳表现方法,并展示了他们在比赛中取得的成绩。 摘要:In this paper, we introduce the Multi-Modal Video Reasoning and Analyzing Competition (MMVRAC) workshop in conjunction with ICCV 2021. This competition is composed of four different tracks, namely, video question answering, skeleton-based action recognition, fisheye video-based action recognition, and person re-identification, which are based on two datasets: SUTD-TrafficQA and UAV-Human. We summarize the top-performing methods submitted by the participants in this competition and show their results achieved in the competition.

【8】 Learned Video Compression with Residual Prediction and Loop Filter 标题:基于残差预测和循环过滤的学习型视频压缩 链接:https://arxiv.org/abs/2108.08551

作者:Chao Liu,Heming Sun,Jiro Katto,Xiaoyang Zeng,Yibo Fan 摘要:在本文中,我们提出了一种学习视频编解码器与残差预测网络(RP网络)和特征辅助环路滤波器(LF网络)。对于RP网络,我们利用先前多帧的残差来进一步消除当前帧残差的冗余。对于LF网络,使用残差解码网络和运动补偿网络的特征来辅助重建质量。为了降低复杂度,RP网络和LF网络均采用轻型ResNet结构作为主干。实验结果表明,与已有的视频压缩框架相比,我们可以节省约10%的BD速率。此外,由于采用了ResNet主干网,我们可以实现更快的编码速度。该项目可在以下网址获得:https://github.com/chaoliu18/RPLVC. 摘要:In this paper, we propose a learned video codec with a residual prediction network (RP-Net) and a feature-aided loop filter (LF-Net). For the RP-Net, we exploit the residual of previous multiple frames to further eliminate the redundancy of the current frame residual. For the LF-Net, the features from residual decoding network and the motion compensation network are used to aid the reconstruction quality. To reduce the complexity, a light ResNet structure is used as the backbone for both RP-Net and LF-Net. Experimental results illustrate that we can save about 10% BD-rate compared with previous learned video compression frameworks. Moreover, we can achieve faster coding speed due to the ResNet backbone. This project is available at https://github.com/chaoliu18/RPLVC.

【9】 Blindly Assess Quality of In-the-Wild Videos via Quality-aware Pre-training and Motion Perception 标题:通过感知质量的预训练和动作感知来盲目评估野外视频的质量 链接:https://arxiv.org/abs/2108.08505

作者:Bowen Li,Weixia Zhang,Meng Tian,Guangtao Zhai,Xianpei Wang 摘要:对野外采集的视频进行感知质量评估对于视频服务的质量保证至关重要。原始质量参考视频的不可访问性和真实失真的复杂性对这种盲视频质量评估(BVQA)任务提出了巨大挑战。尽管基于模型的迁移学习是BVQA任务的一种有效范例,但探索什么以及如何跨越域转移以获得更好的视频表示仍然是一个挑战。在这项工作中,我们建议从具有真实失真的图像质量评估(IQA)数据库和具有丰富运动模式的大规模动作识别中转移知识。我们依靠这两组数据来学习特征提取器。我们使用混合列表排序损失函数在目标VQA数据库上训练所提出的模型。在六个数据库上的大量实验表明,我们的方法在单个数据库和混合数据库训练环境下都具有很强的竞争力。我们还验证了所提方法各组成部分的合理性,并探索了进一步改进的简单方式。 摘要:Perceptual quality assessment of the videos acquired in the wilds is of vital importance for quality assurance of video services. The inaccessibility of reference videos with pristine quality and the complexity of authentic distortions pose great challenges for this kind of blind video quality assessment (BVQA) task. Although model-based transfer learning is an effective and efficient paradigm for the BVQA task, it remains to be a challenge to explore what and how to bridge the domain shifts for better video representation. In this work, we propose to transfer knowledge from image quality assessment (IQA) databases with authentic distortions and large-scale action recognition with rich motion patterns. We rely on both groups of data to learn the feature extractor. We train the proposed model on the target VQA databases using a mixed list-wise ranking loss function. Extensive experiments on six databases demonstrate that our method performs very competitively under both individual database and mixed database training settings. We also verify the rationality of each component of the proposed method and explore a simple manner for further improvement.

【10】 Temporal Kernel Consistency for Blind Video Super-Resolution 标题:盲视频超分辨率的时间核一致性研究 链接:https://arxiv.org/abs/2108.08305

作者:Lichuan Xiang,Royson Lee,Mohamed S. Abdelfattah,Nicholas D. Lane,Hongkai Wen 机构:University of Warwick, University of Cambridge, Samsung AI Center, Cambridge 摘要:基于深度学习的盲超分辨率(SR)方法最近在未知退化的放大帧中取得了前所未有的性能。这些模型能够从给定的低分辨率(LR)图像中准确估计未知的降尺度核,以便在恢复过程中利用核。尽管这些方法在很大程度上取得了成功,但它们主要基于图像,因此不利用多个视频帧中内核的时间特性。在本文中,我们研究了核的时间特性,并强调了它在盲视频超分辨率任务中的重要性。具体地说,我们测量了真实世界视频的内核时间一致性,并说明了在场景及其对象的动态性不同的视频中,估计的内核在每帧中是如何变化的。有了这一新的见解,我们回顾了以前流行的视频SR方法,并表明以前在整个恢复过程中使用固定内核的假设在放大真实世界的视频时会导致视觉伪影。为了解决这个问题,我们定制了现有的单图像和视频SR技术,以在内核估计和视频放大过程中利用内核一致性。对合成视频和真实视频的大量实验表明,从数量和质量上都有很大的恢复收益,实现了盲视频SR的最新技术,并强调了利用内核时间一致性的潜力。 摘要:Deep learning-based blind super-resolution (SR) methods have recently achieved unprecedented performance in upscaling frames with unknown degradation. These models are able to accurately estimate the unknown downscaling kernel from a given low-resolution (LR) image in order to leverage the kernel during restoration. Although these approaches have largely been successful, they are predominantly image-based and therefore do not exploit the temporal properties of the kernels across multiple video frames. In this paper, we investigated the temporal properties of the kernels and highlighted its importance in the task of blind video super-resolution. Specifically, we measured the kernel temporal consistency of real-world videos and illustrated how the estimated kernels might change per frame in videos of varying dynamicity of the scene and its objects. With this new insight, we revisited previous popular video SR approaches, and showed that previous assumptions of using a fixed kernel throughout the restoration process can lead to visual artifacts when upscaling real-world videos. In order to counteract this, we tailored existing single-image and video SR techniques to leverage kernel consistency during both kernel estimation and video upscaling processes. Extensive experiments on synthetic and real-world videos show substantial restoration gains quantitatively and qualitatively, achieving the new state-of-the-art in blind video SR and underlining the potential of exploiting kernel temporal consistency.

医学相关(2篇)

【1】 MobileCaps: A Lightweight Model for Screening and Severity Analysis of COVID-19 Chest X-Ray Images 标题:MobileCaps:一种用于冠状病毒胸部X线图像筛查和严重性分析的轻量级模型 链接:https://arxiv.org/abs/2108.08775

作者:S J Pawan,Rahul Sankar,Amithash M Prabhudev,P A Mahesh,K Prakashini,Sudha Kiran Das,Jeny Rajan 机构:Department of Computer Science and Engineering, National Institute of Technology Karnataka, Surathkal, India, Department of Respiratory Medicine, Kasturba Medical College and Hospital, Manipal, J.S.S. Medical College, Mysore, India 备注:14 pages, 6 figures 摘要:由于新冠疫情对医疗体系和经济造成的灾难性影响,世界正在经历一个具有挑战性的阶段。传播速度、新冠病毒后的症状以及新冠病毒链的出现使全球的医疗体系陷入了混乱。因此,准确筛查新冠病毒-19病例已成为当务之急。由于病毒感染呼吸系统,胸部X光是一种广泛用于初始筛查的成像方式。我们进行了一项综合研究,利用CXR图像识别新冠病毒-19病例,并认识到有必要建立一个更具普遍性的模型。我们利用MobileNetV2体系结构作为特征提取器,并将其集成到胶囊网络中,以构建称为MobileCaps的全自动轻量级模型。MobileCaps在公开的数据集上使用模型融合和贝叶斯优化策略进行训练和评估,以有效地将新冠肺炎患者和非新冠肺炎患者的CXR图像分类。在另外两个经RT-PCR证实的数据集上进一步评估了所提出的模型,以证明其普遍性。我们还介绍了MobileCaps-S,并利用它对基于肺水肿放射学评估(RALE)评分技术的新冠病毒19的CXR图像进行严重性评估。我们的分类模型对新冠肺炎、非新冠肺炎和健康病例的总召回率分别为91.60、94.60、92.20和98.50、88.21、92.62。此外,严重性评估模型的R$^2$系数为70.51。由于所提出的模型比文献中报道的最先进模型具有更少的可训练参数,我们相信我们的模型将在帮助医疗系统对抗流感大流行方面发挥很大作用。 摘要:The world is going through a challenging phase due to the disastrous effect caused by the COVID-19 pandemic on the healthcare system and the economy. The rate of spreading, post-COVID-19 symptoms, and the occurrence of new strands of COVID-19 have put the healthcare systems in disruption across the globe. Due to this, the task of accurately screening COVID-19 cases has become of utmost priority. Since the virus infects the respiratory system, Chest X-Ray is an imaging modality that is adopted extensively for the initial screening. We have performed a comprehensive study that uses CXR images to identify COVID-19 cases and realized the necessity of having a more generalizable model. We utilize MobileNetV2 architecture as the feature extractor and integrate it into Capsule Networks to construct a fully automated and lightweight model termed as MobileCaps. MobileCaps is trained and evaluated on the publicly available dataset with the model ensembling and Bayesian optimization strategies to efficiently classify CXR images of patients with COVID-19 from non-COVID-19 pneumonia and healthy cases. The proposed model is further evaluated on two additional RT-PCR confirmed datasets to demonstrate the generalizability. We also introduce MobileCaps-S and leverage it for performing severity assessment of CXR images of COVID-19 based on the Radiographic Assessment of Lung Edema (RALE) scoring technique. Our classification model achieved an overall recall of 91.60, 94.60, 92.20, and a precision of 98.50, 88.21, 92.62 for COVID-19, non-COVID-19 pneumonia, and healthy cases, respectively. Further, the severity assessment model attained an R$^2$ coefficient of 70.51. Owing to the fact that the proposed models have fewer trainable parameters than the state-of-the-art models reported in the literature, we believe our models will go a long way in aiding healthcare systems in the battle against the pandemic.

【2】 Reproducible radiomics through automated machine learning validated on twelve clinical applications 标题:自动机器学习的可重复性放射组学在12个临床应用中的验证 链接:https://arxiv.org/abs/2108.08618

作者:Martijn P. A. Starmans,Sebastian R. van der Voort,Thomas Phil,Milea J. M. Timbergen,Melissa Vos,Guillaume A. Padmos,Wouter Kessels,David Hanff,Dirk J. Grunhagen,Cornelis Verhoef,Stefan Sleijfer,Martin J. van den Bent,Marion Smits,Roy S. Dwarkasing,Christopher J. Els,Federico Fiduzi,Geert J. L. H. van Leenders,Anela Blazevic,Johannes Hofland,Tessa Brabander,Renza A. H. van Gils,Gaston J. H. Franssen,Richard A. Feelders,Wouter W. de Herder,Florian E. Buisman,Francois E. J. A. Willemssen,Bas Groot Koerkamp,Lindsay Angus,Astrid A. M. van der Veldt,Ana Rajicic,Arlette E. Odink,Mitchell Deen,Jose M. Castillo T.,Jifke Veenland,Ivo Schoots,Michel Renckens,Michail Doukas,Rob A. de Man,Jan N. M. IJzermans,Razvan L. Miclea,Peter B. Vermeulen,Esther E. Bron,Maarten G. Thomeer,Jacob J. Visser,Wiro J. Niessen,Stefan Klein 机构:Department of Radiology and Nuclear Medicine, Erasmus MC, Rotterdam, the Netherlands, Department of Surgical Oncology, Erasmus MC Cancer Institute, Rotterdam, the Netherlands 备注:29 pages, 3 figures, 4 tables, 2 supplementary figures, 1 supplementary table, submitted to Medical Image Analysis 摘要:放射组学使用定量医学成像特征来预测临床结果。虽然文献中描述了许多放射组学方法,但这些方法通常是为单一应用而设计的。本研究的目的是通过提出一个框架来自动构建和优化每个应用程序的radiomics工作流,从而将radiomics推广到各个应用程序。为此,我们制定了一个模块化的工作流程,由几个部分组成:图像和分割预处理、特征提取、特征和样本预处理以及机器学习。对于每个组件,都包含一组通用算法。为了优化每个应用程序的工作流程,我们使用自动机器学习,使用随机搜索和置乱。我们在12个不同的临床应用中评估了我们的方法,得出以下曲线下区域:1)脂肪肉瘤(0.83);2) 硬纤维样纤维瘤病(0.82);3) 原发性肝肿瘤(0.81);4) 胃肠道间质瘤(0.77);5) 结直肠肝转移(0.68);6) 黑色素瘤转移(0.51);7) 肝细胞癌(0.75);8) 肠系膜纤维化(0.81);9) 前列腺癌(0.72);10) 胶质瘤(0.70);11) 阿尔茨海默病(0.87);12)头颈癌(0.84)。最后,我们的方法完全自动构建和优化放射组学工作流程,从而在新的应用中简化放射组学生物标记物的搜索。为了促进再现性和未来研究,我们公开发布了六个数据集、我们框架的软件实现(开源)以及再现本研究的代码。 摘要:Radiomics uses quantitative medical imaging features to predict clinical outcomes. While many radiomics methods have been described in the literature, these are generally designed for a single application. The aim of this study is to generalize radiomics across applications by proposing a framework to automatically construct and optimize the radiomics workflow per application. To this end, we formulate radiomics as a modular workflow, consisting of several components: image and segmentation preprocessing, feature extraction, feature and sample preprocessing, and machine learning. For each component, a collection of common algorithms is included. To optimize the workflow per application, we employ automated machine learning using a random search and ensembling. We evaluate our method in twelve different clinical applications, resulting in the following area under the curves: 1) liposarcoma (0.83); 2) desmoid-type fibromatosis (0.82); 3) primary liver tumors (0.81); 4) gastrointestinal stromal tumors (0.77); 5) colorectal liver metastases (0.68); 6) melanoma metastases (0.51); 7) hepatocellular carcinoma (0.75); 8) mesenteric fibrosis (0.81); 9) prostate cancer (0.72); 10) glioma (0.70); 11) Alzheimer's disease (0.87); and 12) head and neck cancer (0.84). Concluding, our method fully automatically constructs and optimizes the radiomics workflow, thereby streamlining the search for radiomics biomarkers in new applications. To facilitate reproducibility and future research, we publicly release six datasets, the software implementation of our framework (open-source), and the code to reproduce this study.

GAN|对抗|攻击|生成相关(7篇)

【1】 Graph-to-3D: End-to-End Generation and Manipulation of 3D Scenes Using Scene Graphs 标题:Graph-to-3D:使用场景图生成和操作3D场景的端到端 链接:https://arxiv.org/abs/2108.08841

作者:Helisa Dhamo,Fabian Manhardt,Nassir Navab,Federico Tombari 机构: Technische Universit¨at M¨unchen, Google 备注:accepted to ICCV 2021 摘要:可控场景合成包括生成满足底层规范的3D信息。因此,这些规范应该是抽象的,即允许用户轻松交互,同时为详细控制提供足够的界面。场景图是场景的表示,由对象(节点)和对象间关系(边)组成,经证明特别适合此任务,因为它们允许对生成的内容进行语义控制。以前处理此任务的工作通常依赖于合成数据和检索对象网格,这自然限制了生成能力。为了避免这个问题,我们提出了第一个直接从场景图以端到端的方式生成形状的工作。此外,我们还展示了相同的模型支持场景修改,使用各自的场景图作为接口。利用图卷积网络(GCN),我们在对象和边缘类别以及3D形状和场景布局上训练了一个可变自动编码器,允许对新场景和形状进行后期采样。 摘要:Controllable scene synthesis consists of generating 3D information that satisfy underlying specifications. Thereby, these specifications should be abstract, i.e. allowing easy user interaction, whilst providing enough interface for detailed control. Scene graphs are representations of a scene, composed of objects (nodes) and inter-object relationships (edges), proven to be particularly suited for this task, as they allow for semantic control on the generated content. Previous works tackling this task often rely on synthetic data, and retrieve object meshes, which naturally limits the generation capabilities. To circumvent this issue, we instead propose the first work that directly generates shapes from a scene graph in an end-to-end manner. In addition, we show that the same model supports scene modification, using the respective scene graph as interface. Leveraging Graph Convolutional Networks (GCN) we train a variational Auto-Encoder on top of the object and edge categories, as well as 3D shapes and scene layouts, allowing latter sampling of new scenes and shapes.

【2】 ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis 标题:ImageBART:用于自回归图像合成的多项式扩散双向上下文 链接:https://arxiv.org/abs/2108.08827

作者:Patrick Esser,Robin Rombach,Andreas Blattmann,Björn Ommer 机构:Heidelberg Collaboratory for Image Processing, IWR, Heidelberg University, Germany 摘要:最近,自回归模型及其数据似然的顺序因子分解在图像表示和合成方面显示出巨大的潜力。尽管如此,他们通过只关注先前合成的图像块上方或左侧,以线性1D顺序合并图像上下文。这种单向、连续的注意力偏差不仅对图像来说是不自然的,因为它忽略了场景的大部分,直到合成几乎完成。它还以单个比例处理整个图像,从而忽略更多全局上下文信息,直至整个场景的要点。作为补救措施,我们通过将自回归公式与多项式扩散过程相结合,将上下文的粗糙到精细层次结合起来:尽管多级扩散过程连续移除信息以粗化图像,但我们训练(短)马尔可夫链来反转此过程。在每个阶段中,产生的自回归ImageBART模型以从粗到精的方式逐步合并先前阶段的上下文。实验表明,与自回归模型相比,图像修改能力大大提高,同时还提供了高保真图像生成,这两种方法都是通过在压缩的潜在空间中进行有效训练实现的。具体地说,我们的方法可以考虑用户提供的无限制遮罩来执行本地图像编辑。因此,与纯自回归模型相比,它可以解决自由形式的图像修复,并且在条件模型的情况下,可以解决局部文本引导的图像修改,而不需要特定于掩码的训练。 摘要:Autoregressive models and their sequential factorization of the data likelihood have recently demonstrated great potential for image representation and synthesis. Nevertheless, they incorporate image context in a linear 1D order by attending only to previously synthesized image patches above or to the left. Not only is this unidirectional, sequential bias of attention unnatural for images as it disregards large parts of a scene until synthesis is almost complete. It also processes the entire image on a single scale, thus ignoring more global contextual information up to the gist of the entire scene. As a remedy we incorporate a coarse-to-fine hierarchy of context by combining the autoregressive formulation with a multinomial diffusion process: Whereas a multistage diffusion process successively removes information to coarsen an image, we train a (short) Markov chain to invert this process. In each stage, the resulting autoregressive ImageBART model progressively incorporates context from previous stages in a coarse-to-fine manner. Experiments show greatly improved image modification capabilities over autoregressive models while also providing high-fidelity image generation, both of which are enabled through efficient training in a compressed latent space. Specifically, our approach can take unrestricted, user-provided masks into account to perform local image editing. Thus, in contrast to pure autoregressive models, it can solve free-form image inpainting and, in the case of conditional models, local, text-guided image modification without requiring mask-specific training.

【3】 Towards Vivid and Diverse Image Colorization with Generative Color Prior 标题:基于产生色先验的生动多样的图像彩色化 链接:https://arxiv.org/abs/2108.08826

作者:Yanze Wu,Xintao Wang,Yu Li,Honglun Zhang,Xun Zhao,Ying Shan 机构:Applied Research Center (ARC), Tencent PCG 备注:ICCV 2021 摘要:近年来,彩色化引起了人们越来越多的兴趣。经典的基于参考的方法通常依赖外部彩色图像来获得合理的结果。检索此类示例不可避免地需要大型图像数据库或在线搜索引擎。最近的基于深度学习的方法可以以较低的成本自动为图像着色。然而,不令人满意的人工制品和不连贯的颜色总是伴随着。在这项工作中,我们的目标是通过利用预先训练的生成性对抗网络(GAN)中封装的丰富多样的颜色先验来恢复生动的颜色。具体来说,我们首先通过GAN编码器“检索”匹配特征(类似于示例),然后将这些特征与特征调制结合到着色过程中。由于强大的生成性色彩优先和精致的设计,我们的方法可以通过一次向前传递产生生动的色彩。此外,通过修改GAN潜码可以非常方便地获得不同的结果。我们的方法也继承了GAN的可解释控制的优点,并且可以通过穿越GAN的潜在空间实现可控和平滑的过渡。大量的实验和用户研究表明,我们的方法比以前的工作取得了更好的性能。 摘要:Colorization has attracted increasing interest in recent years. Classic reference-based methods usually rely on external color images for plausible results. A large image database or online search engine is inevitably required for retrieving such exemplars. Recent deep-learning-based methods could automatically colorize images at a low cost. However, unsatisfactory artifacts and incoherent colors are always accompanied. In this work, we aim at recovering vivid colors by leveraging the rich and diverse color priors encapsulated in a pretrained Generative Adversarial Networks (GAN). Specifically, we first "retrieve" matched features (similar to exemplars) via a GAN encoder and then incorporate these features into the colorization process with feature modulations. Thanks to the powerful generative color prior and delicate designs, our method could produce vivid colors with a single forward pass. Moreover, it is highly convenient to obtain diverse results by modifying GAN latent codes. Our method also inherits the merit of interpretable controls of GANs and could attain controllable and smooth transitions by walking through GAN latent space. Extensive experiments and user studies demonstrate that our method achieves superior performance than previous works.

【4】 Generating Superpixels for High-resolution Images with Decoupled Patch Calibration 标题:解耦面片定标生成高分辨率图像的超像素 链接:https://arxiv.org/abs/2108.08607

作者:Yaxiong Wang,Yuchao Wei,Xueming Qian,Li Zhu,Yi Yang 机构: Xi’an Jiaotong University 备注:10 pages, superpixel segmentation. High-resolution Computer Version 摘要:超像素分割最近取得了重要进展,得益于可微深度学习的进展。然而,由于内存和计算成本昂贵,超像素分割仍然具有挑战性,使得目前先进的超像素网络无法处理。在本文中,我们设计了面片校准网络(PCNet),旨在高效准确地实现高分辨率超像素分割。PCNet遵循从低分辨率输入产生高分辨率输出的原则,以节省GPU内存并降低计算成本。为了回忆由下采样操作破坏的细节,我们提出了一种新的解耦面片校准(DPC)分支,用于协作增强主超像素生成分支。特别是,DPC从高分辨率图像中获取一个局部补丁,并动态生成一个二进制掩码,以将网络施加到区域边界上。通过共享DPC和主要分支的参数,从高分辨率面片中学习到的精细细节知识将被传输,以帮助校准被破坏的信息。据我们所知,我们第一次尝试考虑基于深度学习的超像素生成高分辨率的情况。为了促进这项研究,我们从两个公共数据集和一个新构建的数据集构建了评估基准,涵盖了从细粒度人体部位到城市景观的广泛多样性。大量的实验表明,我们的PCNet不仅在定量结果方面优于现有技术,而且在1080Ti GPU上将分辨率上限从3K提高到5K。 摘要:Superpixel segmentation has recently seen important progress benefiting from the advances in differentiable deep learning. However, the very high-resolution superpixel segmentation still remains challenging due to the expensive memory and computation cost, making the current advanced superpixel networks fail to process. In this paper, we devise Patch Calibration Networks (PCNet), aiming to efficiently and accurately implement high-resolution superpixel segmentation. PCNet follows the principle of producing high-resolution output from low-resolution input for saving GPU memory and relieving computation cost. To recall the fine details destroyed by the down-sampling operation, we propose a novel Decoupled Patch Calibration (DPC) branch for collaboratively augment the main superpixel generation branch. In particular, DPC takes a local patch from the high-resolution images and dynamically generates a binary mask to impose the network to focus on region boundaries. By sharing the parameters of DPC and main branches, the fine-detailed knowledge learned from high-resolution patches will be transferred to help calibrate the destroyed information. To the best of our knowledge, we make the first attempt to consider the deep-learning-based superpixel generation for high-resolution cases. To facilitate this research, we build evaluation benchmarks from two public datasets and one new constructed one, covering a wide range of diversities from fine-grained human parts to cityscapes. Extensive experiments demonstrate that our PCNet can not only perform favorably against the state-of-the-arts in the quantitative results but also improve the resolution upper bound from 3K to 5K on 1080Ti GPUs.

【5】 Image2Lego: Customized LEGO Set Generation from Images 标题:Image2Lego:从图像生成自定义乐高积木集 链接:https://arxiv.org/abs/2108.08477

作者:Kyle Lennon,Katharina Fransen,Alexander O'Brien,Yumeng Cao,Matthew Beveridge,Yamin Arefeen,Nikhil Singh,Iddo Drori 机构:Massachusetts Institute of Technology 备注:9 pages, 10 figures 摘要:尽管乐高玩具已经让一代又一代的儿童和成人感到愉悦,但对于一般的乐高爱好者来说,设计与现实世界或想象场景的复杂性相匹配的定制玩具仍然是一个巨大的挑战。为了使这一壮举成为可能,我们实现了一个从2D图像生成乐高积木模型的系统。我们设计了一个新的解决方案,使用在三维体素化模型上训练的八叉树结构的自动编码器来获得模型重建的可行潜在表示,并使用训练的独立网络从二维图像预测该潜在表示。乐高模型是通过三维体素化模型到砖块的算法转换获得的。我们展示了第一个将照片转换为三维乐高模型的例子。八叉树体系结构可以灵活地生成多个分辨率,以最适合用户的创造性愿景或设计需求。为了证明我们系统的广泛适用性,我们为乐高物体和人脸模型生成逐步构建说明和动画。最后,我们通过使用真实的乐高积木构建物理构建来测试这些自动生成的乐高积木集。 摘要:Although LEGO sets have entertained generations of children and adults, the challenge of designing customized builds matching the complexity of real-world or imagined scenes remains too great for the average enthusiast. In order to make this feat possible, we implement a system that generates a LEGO brick model from 2D images. We design a novel solution to this problem that uses an octree-structured autoencoder trained on 3D voxelized models to obtain a feasible latent representation for model reconstruction, and a separate network trained to predict this latent representation from 2D images. LEGO models are obtained by algorithmic conversion of the 3D voxelized model to bricks. We demonstrate first-of-its-kind conversion of photographs to 3D LEGO models. An octree architecture enables the flexibility to produce multiple resolutions to best fit a user's creative vision or design needs. In order to demonstrate the broad applicability of our system, we generate step-by-step building instructions and animations for LEGO models of objects and human faces. Finally, we test these automatically generated LEGO sets by constructing physical builds using real LEGO bricks.

【6】 Generating Smooth Pose Sequences for Diverse Human Motion Prediction 标题:用于多种人体运动预测的平滑姿势序列生成 链接:https://arxiv.org/abs/2108.08422

作者:Wei Mao,Miaomiao Liu,Mathieu Salzmann 机构:Australian National University; ,CVLab, EPFL; ,ClearSpace, Switzerland, Diverse Motion Prediction, Controllable Motion Prediction, s 备注:ICCV21(oral) 摘要:随机运动预测的最新进展,即预测给定单个过去姿势序列的多个可能的未来人体运动,已导致产生真正多样的未来运动,甚至提供对某些身体部位运动的控制。然而,为了实现这一点,最先进的方法需要学习多个多样性映射和一个用于可控运动预测的专用模型。在本文中,我们介绍了一个统一的深度生成网络,用于多样性和可控运动预测。为此,我们利用了一种直觉,即真实的人体运动由有效姿势的平滑序列组成,并且在数据有限的情况下,事先学习姿势要比学习运动姿势容易得多。因此,我们设计了一个生成器,可以按顺序预测不同身体部位的运动,并引入基于流的标准化姿势先验和关节角度损失,以实现运动真实感。我们在两个标准基准数据集Human3.6M和HumanEva-I上的实验,证明我们的方法在样本多样性和准确性方面优于最先进的基线。该守则可于https://github.com/wei-mao-2019/gsps 摘要:Recent progress in stochastic motion prediction, i.e., predicting multiple possible future human motions given a single past pose sequence, has led to producing truly diverse future motions and even providing control over the motion of some body parts. However, to achieve this, the state-of-the-art method requires learning several mappings for diversity and a dedicated model for controllable motion prediction. In this paper, we introduce a unified deep generative network for both diverse and controllable motion prediction. To this end, we leverage the intuition that realistic human motions consist of smooth sequences of valid poses, and that, given limited data, learning a pose prior is much more tractable than a motion one. We therefore design a generator that predicts the motion of different body parts sequentially, and introduce a normalizing flow based pose prior, together with a joint angle loss, to achieve motion realism.Our experiments on two standard benchmark datasets, Human3.6M and HumanEva-I, demonstrate that our approach outperforms the state-of-the-art baselines in terms of both sample diversity and accuracy. The code is available at https://github.com/wei-mao-2019/gsps

【7】 Quality assessment of image matchers for DSM generation -- a comparative study based on UAV images 标题:用于DSM生成的图像匹配器的质量评估--基于无人机图像的比较研究 链接:https://arxiv.org/abs/2108.08369

作者:Rongjun Qin,Armin Gruen,Cive Fraser 机构: PhD student at Future Cities Laboratory, Singapore-ETH Center, CREATE Way #,-, CREATE Tower, Singapore , Principal Investigator of the Future Cities Laboratory, Chair of Information Architecture, ETH Zuerich, Wolfgang-Pauli-Strasse , CH-, Zuerich Switzerland 摘要:最近开发的自动稠密图像匹配算法目前正在用于DSM/DTM生产,其像素级表面生成能力提供了部分缓解手动和半自动立体测量需求的前景。在本文中,使用无人机记录的5cm GSD图像,评估了五个商业/公共3D表面生成软件包。生成的表面模型根据移动激光雷达和手动立体测量生成的点云进行评估。考虑的软件包包括APS、MICMAC、SURE、Pix4UAV和DLR的SGM实现。 摘要:Recently developed automatic dense image matching algorithms are now being implemented for DSM/DTM production, with their pixel-level surface generation capability offering the prospect of partially alleviating the need for manual and semi-automatic stereoscopic measurements. In this paper, five commercial/public software packages for 3D surface generation are evaluated, using 5cm GSD imagery recorded from a UAV. Generated surface models are assessed against point clouds generated from mobile LiDAR and manual stereoscopic measurements. The software packages considered are APS, MICMAC, SURE, Pix4UAV and an SGM implementation from DLR.

人脸|人群计数(2篇)

【1】 Gravity-Aware Monocular 3D Human-Object Reconstruction 标题:基于重力感知的单目三维人体重建 链接:https://arxiv.org/abs/2108.08844

作者:Rishabh Dabral,Soshi Shimada,Arjun Jain,Christian Theobalt,Vladislav Golyanik 机构:IIT Bombay, MPI for Informatics, SIC, IISc Bangalore, Fast Code AI, D Object and, Pose Estimation, . . ., Initial ,D Pose, D Object Trajectory and, Joint Keypoints 备注:None 摘要:本文提出了一种基于单目RGB视频的联合无标记3D人体运动捕获和目标轨迹估计的新方法GraviCap。我们关注在自由飞行期间部分观察到的对象的场景。与现有的单目方法相比,由于意识到重力约束对象运动,我们可以恢复比例、对象轨迹以及人体骨骼长度(单位:米)和地平面的方向。我们的目标函数由物体的初始速度和位置、重力方向和焦距参数化,并针对一个或多个自由飞行事件进行联合优化。与无约束情况相比,提出的人机交互约束确保了三维重建的几何一致性,并改善了人体姿势的物理合理性。我们在一个新的数据集上评估GraviCap,该数据集包含自由飞行的人和不同物体的地面真实值注释。在实验中,我们的方法在各种度量上实现了最先进的三维人体运动捕捉精度。我们呼吁读者观看我们的补充视频。源代码和数据集都已发布;看见http://4dqv.mpi-inf.mpg.de/GraviCap/. 摘要:This paper proposes GraviCap, i.e., a new approach for joint markerless 3D human motion capture and object trajectory estimation from monocular RGB videos. We focus on scenes with objects partially observed during a free flight. In contrast to existing monocular methods, we can recover scale, object trajectories as well as human bone lengths in meters and the ground plane's orientation, thanks to the awareness of the gravity constraining object motions. Our objective function is parametrised by the object's initial velocity and position, gravity direction and focal length, and jointly optimised for one or several free flight episodes. The proposed human-object interaction constraints ensure geometric consistency of the 3D reconstructions and improved physical plausibility of human poses compared to the unconstrained case. We evaluate GraviCap on a new dataset with ground-truth annotations for persons and different objects undergoing free flights. In the experiments, our approach achieves state-of-the-art accuracy in 3D human motion capture on various metrics. We urge the reader to watch our supplementary video. Both the source code and the dataset are released; see http://4dqv.mpi-inf.mpg.de/GraviCap/.

【2】 Wisdom of (Binned) Crowds: A Bayesian Stratification Paradigm for Crowd Counting 标题:人群智慧:人群计数的贝叶斯分层范式 链接:https://arxiv.org/abs/2108.08784

作者:Sravya Vardhani Shivapuja,Mansi Pradeep Khamkar,Divij Bajaj,Ganesh Ramakrishnan,Ravi Kiran Sarvadevabhatla 机构:CVIT, IIIT Hyderabad, Hyderabad , INDIA, Dept. of CSE, IIT Bombay, Mumbai , INDIA, . . . . ., from bins, Minibatch Sampling, (Stage - ,), Random, sampling, across dataset, Images, Smoothed Bayesian Stratification, Per Strata MAE, Abs Error, StrataBins on Count Range, Stratum, Abs Error 备注:Accepted at ACM Multimedia (ACMMM) 2021 . Code, pretrained models and interactive visualizations can be viewed at our project page this https URL 摘要:用于训练人群计数深度网络的数据集在计数分布上通常是重尾的,并且在整个计数范围内表现出不连续性。因此,事实上的统计指标(MSE、MAE)表现出很大的差异,往往是整个计数范围内不可靠的绩效指标。为了全面解决这些问题,我们修订了标准人群计数流程各个阶段的流程。为了实现原则和平衡的小批量抽样,我们提出了一种新的平滑贝叶斯样本分层方法。我们提出了一个新的成本函数,可以很容易地纳入现有的人群计数深度网络,以鼓励分层感知优化。我们分析了标准数据集中代表性人群计数方法在各层级和总体上的性能。我们分析了标准数据集中人群计数方法的性能,并证明我们提出的修改显著降低了误差标准差。我们的贡献代表了人群计数方法的细微差别、统计平衡和细粒度性能表征。代码、预训练模型和交互式可视化可以在我们的项目页面上查看https://deepcount.iiit.ac.in/ 摘要:Datasets for training crowd counting deep networks are typically heavy-tailed in count distribution and exhibit discontinuities across the count range. As a result, the de facto statistical measures (MSE, MAE) exhibit large variance and tend to be unreliable indicators of performance across the count range. To address these concerns in a holistic manner, we revise processes at various stages of the standard crowd counting pipeline. To enable principled and balanced minibatch sampling, we propose a novel smoothed Bayesian sample stratification approach. We propose a novel cost function which can be readily incorporated into existing crowd counting deep networks to encourage strata-aware optimization. We analyze the performance of representative crowd counting approaches across standard datasets at per strata level and in aggregate. We analyze the performance of crowd counting approaches across standard datasets and demonstrate that our proposed modifications noticeably reduce error standard deviation. Our contributions represent a nuanced, statistically balanced and fine-grained characterization of performance for crowd counting approaches. Code, pretrained models and interactive visualizations can be viewed at our project page https://deepcount.iiit.ac.in/

图像视频检索|Re-id相关(1篇)

【1】 Retrieval and Localization with Observation Constraints 标题:带观测约束的检索与定位 链接:https://arxiv.org/abs/2108.08516

作者:Yuhao Zhou,Huanhuan Fan,Shuang Gao,Yuchen Yang,Xudong Zhang,Jijunnan Li,Yandong Guo 机构:comAll authors are with OPPO Research Institute 备注:Accepted by the 2021 International Conference on Robotics and Automation (ICRA2021) 摘要:准确的视觉重新定位对于许多人工智能应用非常关键,例如增强现实、虚拟现实、机器人技术和自动驾驶。为了完成这项任务,我们提出了一种综合的视觉重新定位方法,称为RLOCS,通过结合图像检索、语义一致性和几何验证来实现精确的估计。本地化管道设计为从粗到精的范例。在检索部分,我们级联了ResNet101 GeM ArcFace的体系结构,并使用DBSCAN进行空间验证,以获得更好的初始粗略姿态。我们设计了一个称为观测约束的模块,该模块结合几何信息和语义一致性来过滤异常值。在开放数据集上进行了综合实验,包括R-Oxford5k和R-Paris6k检索、城市景观语义分割、亚琛昼夜定位和InLoc。通过创造性地修改整个管道中的单独模块,我们的方法在具有挑战性的本地化基准上实现了许多性能改进。 摘要:Accurate visual re-localization is very critical to many artificial intelligence applications, such as augmented reality, virtual reality, robotics and autonomous driving. To accomplish this task, we propose an integrated visual re-localization method called RLOCS by combining image retrieval, semantic consistency and geometry verification to achieve accurate estimations. The localization pipeline is designed as a coarse-to-fine paradigm. In the retrieval part, we cascade the architecture of ResNet101-GeM-ArcFace and employ DBSCAN followed by spatial verification to obtain a better initial coarse pose. We design a module called observation constraints, which combines geometry information and semantic consistency for filtering outliers. Comprehensive experiments are conducted on open datasets, including retrieval on R-Oxford5k and R-Paris6k, semantic segmentation on Cityscapes, localization on Aachen Day-Night and InLoc. By creatively modifying separate modules in the total pipeline, our method achieves many performance improvements on the challenging localization benchmarks.

裁剪|量化|加速|压缩相关(1篇)

【1】 An Information Theory-inspired Strategy for Automatic Network Pruning 标题:一种信息论启发的网络自动剪枝策略 链接:https://arxiv.org/abs/2108.08532

作者:Xiawu Zheng,Yuexiao Ma,Teng Xi,Gang Zhang,Errui Ding,Yuchao Li,Jie Chen,Yonghong Tian,Rongrong Ji 机构: Chen is with Institute of Digital Media 摘要:尽管深卷积神经网络在许多计算机视觉任务上表现优异,但众所周知,深卷积神经网络在资源受限的设备上可以进行压缩。大多数现有的网络剪枝方法需要费力的人力和令人望而却步的计算资源,尤其是当约束条件发生变化时。当模型需要部署在广泛的设备上时,这实际上限制了模型压缩的应用。此外,现有的方法仍然受到缺少理论指导的挑战。在本文中,我们提出了一个信息理论启发的自动模型压缩策略。我们的方法背后的原理是信息瓶颈理论,即隐藏表示应该相互压缩信息。因此,我们引入网络激活的规范化Hilbert-Schmidt独立性准则(nHSIC)作为层重要性的稳定和广义指标。当给定一定的资源约束时,我们将HSIC指标与约束相结合,将架构搜索问题转化为具有二次约束的线性规划问题。这种问题很容易用凸优化方法解决,只需几秒钟。我们还提供了一个严格的证明,表明优化归一化HSIC同时最小化了不同层之间的互信息。与最先进的压缩算法相比,无需任何搜索过程,我们的方法实现了更好的压缩折衷。例如,使用ResNet-50,我们实现了45.3%的浮点减少,在ImageNet上达到了75.75的top-1精度。代码可在https://github.com/MAC-AutoML/ITPruner/tree/master. 摘要:Despite superior performance on many computer vision tasks, deep convolution neural networks are well known to be compressed on devices that have resource constraints. Most existing network pruning methods require laborious human efforts and prohibitive computation resources, especially when the constraints are changed. This practically limits the application of model compression when the model needs to be deployed on a wide range of devices. Besides, existing methods are still challenged by the missing theoretical guidance. In this paper we propose an information theory-inspired strategy for automatic model compression. The principle behind our method is the information bottleneck theory, i.e., the hidden representation should compress information with each other. We thus introduce the normalized Hilbert-Schmidt Independence Criterion (nHSIC) on network activations as a stable and generalized indicator of layer importance. When a certain resource constraint is given, we integrate the HSIC indicator with the constraint to transform the architecture search problem into a linear programming problem with quadratic constraints. Such a problem is easily solved by a convex optimization method with a few seconds. We also provide a rigorous proof to reveal that optimizing the normalized HSIC simultaneously minimizes the mutual information between different layers. Without any search process, our method achieves better compression tradeoffs comparing to the state-of-the-art compression algorithms. For instance, with ResNet-50, we achieve a 45.3%-FLOPs reduction, with a 75.75 top-1 accuracy on ImageNet. Codes are avaliable at https://github.com/MAC-AutoML/ITPruner/tree/master.

点云|SLAM|雷达|激光|深度RGBD相关(2篇)

【1】 VolumeFusion: Deep Depth Fusion for 3D Scene Reconstruction 标题:VolumeFusion:三维场景重建的深度融合 链接:https://arxiv.org/abs/2108.08623

作者:Jaesung Choe,Sunghoon Im,Francois Rameau,Minjun Kang,In So Kweon 机构:KAIST, DGIST 备注:ICCV 2021 Accepted 摘要:为了从一组校准视图重建三维场景,传统的多视图立体技术依赖于两个不同的阶段:局部深度贴图计算和全局深度贴图融合。最近的研究集中于使用传统深度融合方法或通过回归截断符号距离函数(TSDF)的直接三维重建网络进行深度估计的深度神经结构。在本文中,我们提倡使用深度神经网络复制传统的两阶段框架,以提高结果的可解释性和准确性。如前所述,我们的网络分两步运行:1)使用深度MVS技术对局部深度图进行局部计算,2)深度图和图像特征融合以构建单个TSDF体积。为了提高从不同视点(如大基线和旋转)获取的图像之间的匹配性能,我们引入了一种称为PosedConv的旋转不变三维卷积核。通过在ScanNet数据集上进行的一系列实验,我们的方法与传统和深度学习技术相比都具有优势,从而强调了所提出的体系结构的有效性。 摘要:To reconstruct a 3D scene from a set of calibrated views, traditional multi-view stereo techniques rely on two distinct stages: local depth maps computation and global depth maps fusion. Recent studies concentrate on deep neural architectures for depth estimation by using conventional depth fusion method or direct 3D reconstruction network by regressing Truncated Signed Distance Function (TSDF). In this paper, we advocate that replicating the traditional two stages framework with deep neural networks improves both the interpretability and the accuracy of the results. As mentioned, our network operates in two steps: 1) the local computation of the local depth maps with a deep MVS technique, and, 2) the depth maps and images' features fusion to build a single TSDF volume. In order to improve the matching performance between images acquired from very different viewpoints (e.g., large-baseline and rotations), we introduce a rotation-invariant 3D convolution kernel called PosedConv. The effectiveness of the proposed architecture is underlined via a large series of experiments conducted on the ScanNet dataset where our approach compares favorably against both traditional and deep learning techniques.

【2】 Vis2Mesh: Efficient Mesh Reconstruction from Unstructured Point Clouds of Large Scenes with Learned Virtual View Visibility 标题:Vis2Mesh:学习虚拟视图可见性的大型场景非结构化点云的高效网格重建 链接:https://arxiv.org/abs/2108.08378

作者:Shuang Song,Zhaopeng Cui,Rongjun Qin 机构:The Ohio State University, Zhejiang University 备注:ICCV2021 摘要:我们提出了一种新的非结构化点云网格重建框架,利用虚拟视图中三维点的可见性和传统的基于图割的网格生成。具体地说,我们首先提出了一个三步网络,该网络明确采用深度补全进行可见性预测。然后,通过求解一个考虑可见性的优化问题,将多个视图的可见性信息聚合生成三维网格模型,该优化问题还引入了一种新的自适应可见性权重来抑制具有大入射角的视线。与其他基于学习的方法相比,我们的管道仅在2D二元分类任务上进行学习,即在视图中可见或不可见的点,这更具普遍性,实际上更有效,能够处理大量点。实验表明,我们的方法具有良好的可转移性和鲁棒性,并在小型复杂对象上实现了与当前最先进的基于学习的方法相媲美的性能,在大型室内和室外场景中的性能也优于传统的基于学习的方法。代码可在https://github.com/GDAOSU/vis2mesh. 摘要:We present a novel framework for mesh reconstruction from unstructured point clouds by taking advantage of the learned visibility of the 3D points in the virtual views and traditional graph-cut based mesh generation. Specifically, we first propose a three-step network that explicitly employs depth completion for visibility prediction. Then the visibility information of multiple views is aggregated to generate a 3D mesh model by solving an optimization problem considering visibility in which a novel adaptive visibility weighting in surface determination is also introduced to suppress line of sight with a large incident angle. Compared to other learning-based approaches, our pipeline only exercises the learning on a 2D binary classification task, ie, points visible or not in a view, which is much more generalizable and practically more efficient and capable to deal with a large number of points. Experiments demonstrate that our method with favorable transferability and robustness, and achieve competing performances wrt state-of-the-art learning-based approaches on small complex objects and outperforms on large indoor and outdoor scenes. Code is available at https://github.com/GDAOSU/vis2mesh.

3D|3D重建等相关(3篇)

【1】 Real-time Image Enhancer via Learnable Spatial-aware 3D Lookup Tables 标题:基于可学习空间感知3D查找表的实时图像增强器 链接:https://arxiv.org/abs/2108.08697

作者:Tao Wang,Yong Li,Jingyang Peng,Yipeng Ma,Xian Wang,Fenglong Song,Youliang Yan 机构:Huawei Noah’s Ark Lab 备注:Accepted to ICCV2021 摘要:最近,基于深度学习的图像增强算法在几个公开的数据集上实现了最先进的(SOTA)性能。然而,大多数现有的方法都不能满足视觉感知或计算效率的实际要求,特别是对于高分辨率图像。在本文中,我们提出了一种新的实时图像增强器,该增强器通过可学习的空间感知三维查找表(3D LUT)实现,充分考虑了全局场景和局部空间信息。具体地说,我们介绍了一种具有两个输出的轻型双头重量预测器。一个是用于图像级场景自适应的一维权重向量,另一个是用于像素级类别融合的三维权重映射。我们学习空间感知3D LUT,并根据上述权重以端到端的方式对其进行融合。然后使用融合的LUT以有效的方式将源图像转换为目标音调。大量结果表明,我们的模型在主观和客观上都优于SOTA图像增强方法,并且我们的模型在一个NVIDIA V100 GPU上处理4K分辨率的图像只需要大约4ms。 摘要:Recently, deep learning-based image enhancement algorithms achieved state-of-the-art (SOTA) performance on several publicly available datasets. However, most existing methods fail to meet practical requirements either for visual perception or for computation efficiency, especially for high-resolution images. In this paper, we propose a novel real-time image enhancer via learnable spatial-aware 3-dimentional lookup tables(3D LUTs), which well considers global scenario and local spatial information. Specifically, we introduce a light weight two-head weight predictor that has two outputs. One is a 1D weight vector used for image-level scenario adaptation, the other is a 3D weight map aimed for pixel-wise category fusion. We learn the spatial-aware 3D LUTs and fuse them according to the aforementioned weights in an end-to-end manner. The fused LUT is then used to transform the source image into the target tone in an efficient way. Extensive results show that our model outperforms SOTA image enhancement methods on public datasets both subjectively and objectively, and that our model only takes about 4ms to process a 4K resolution image on one NVIDIA V100 GPU.

【2】 3DIAS: 3D Shape Reconstruction with Implicit Algebraic Surfaces 标题:3DIAS:基于隐式代数曲面的三维形状重建 链接:https://arxiv.org/abs/2108.08653

作者:Mohsen Yavartanoo,JaeYoung Chung,Reyhaneh Neshatavar,Kyoung Mu Lee 机构:ASRI, Department of ECE, Seoul National University, Seoul, Korea 备注:Published at ICCV 2021 摘要:三维形状表示对三维形状重建有重要影响。基于基元的表示主要通过一组简单的隐式基元来逼近三维形状,但基元的低几何复杂度限制了形状分辨率。此外,为任意形状设置足够数量的基本体是一项挑战。为了克服这些问题,我们提出了一种约束隐式代数曲面作为基本体,具有较少的可学习系数和较高的几何复杂度,并提出了一种生成这些基本体的深层神经网络。我们的实验表明,在单RGB图像三维形状重建中,与最先进的方法相比,我们的方法在表示能力方面具有优势。此外,我们证明了我们的方法可以在无监督的情况下从语义上学习三维形状的片段。该代码可从以下网站公开获取:https://myavartanoo.github.io/3dias/ . 摘要:3D Shape representation has substantial effects on 3D shape reconstruction. Primitive-based representations approximate a 3D shape mainly by a set of simple implicit primitives, but the low geometrical complexity of the primitives limits the shape resolution. Moreover, setting a sufficient number of primitives for an arbitrary shape is challenging. To overcome these issues, we propose a constrained implicit algebraic surface as the primitive with few learnable coefficients and higher geometrical complexities and a deep neural network to produce these primitives. Our experiments demonstrate the superiorities of our method in terms of representation power compared to the state-of-the-art methods in single RGB image 3D shape reconstruction. Furthermore, we show that our method can semantically learn segments of 3D shapes in an unsupervised manner. The code is publicly available from https://myavartanoo.github.io/3dias/ .

【3】 3D Shapes Local Geometry Codes Learning with SDF 标题:用SDF学习3D Shape局部几何编码 链接:https://arxiv.org/abs/2108.08593

作者:Shun Yao,Fei Yang,Yongmei Cheng,Mikhail G. Mozerov 机构: School of Automation, Northwestern Polytechnical University, Xi’an, China, Computer Vision Center, Universitat Autonoma de Barcelona, Barcelona, Spain 备注:DLGC workshop in ICCV 2021 摘要:符号距离函数(SDF)作为三维形状描述是表示三维几何体以进行渲染和重建的最有效方法之一。我们的工作受到了最先进的DeepSDF方法的启发,该方法将3D形状作为外壳的等值面进行学习和分析,该方法在3D形状重建和压缩领域显示出了良好的效果。在本文中,我们考虑重构的退化问题是由BoosiDF模型的容量减小所导致的,它用神经网络和单个潜代码逼近SDF。我们提出了局部几何代码学习(LGCL),这是一种通过学习全3D形状的局部形状几何来改进原始DeepSDF结果的模型。我们添加了一个额外的图神经网络,将单个可传输的潜在代码分割成一组分布在三维形状上的局部潜在代码。使用上述潜在码在其局部区域内对SDF进行近似,这将减轻与原始SDF相比的近似复杂性。此外,我们还引入了一个新的几何损失函数来促进这些局部潜在码的训练。请注意,其他局部形状调整方法使用三维体素表示,这反过来又是一个非常难以解决甚至无法解决的问题。相比之下,我们的体系结构隐式地基于图形处理,直接在潜在代码空间中执行学习回归过程,从而使所提出的体系结构更加灵活且易于实现。我们对三维形状重建的实验表明,我们的LGCL方法可以用更小的SDF解码器保持更多的细节,并且在最重要的量化指标下大大优于原始的DeepSDF方法。 摘要:A signed distance function (SDF) as the 3D shape description is one of the most effective approaches to represent 3D geometry for rendering and reconstruction. Our work is inspired by the state-of-the-art method DeepSDF that learns and analyzes the 3D shape as the iso-surface of its shell and this method has shown promising results especially in the 3D shape reconstruction and compression domain. In this paper, we consider the degeneration problem of reconstruction coming from the capacity decrease of the DeepSDF model, which approximates the SDF with a neural network and a single latent code. We propose Local Geometry Code Learning (LGCL), a model that improves the original DeepSDF results by learning from a local shape geometry of the full 3D shape. We add an extra graph neural network to split the single transmittable latent code into a set of local latent codes distributed on the 3D shape. Mentioned latent codes are used to approximate the SDF in their local regions, which will alleviate the complexity of the approximation compared to the original DeepSDF. Furthermore, we introduce a new geometric loss function to facilitate the training of these local latent codes. Note that other local shape adjusting methods use the 3D voxel representation, which in turn is a problem highly difficult to solve or even is insolvable. In contrast, our architecture is based on graph processing implicitly and performs the learning regression process directly in the latent code space, thus make the proposed architecture more flexible and also simple for realization. Our experiments on 3D shape reconstruction demonstrate that our LGCL method can keep more details with a significantly smaller size of the SDF decoder and outperforms considerably the original DeepSDF method under the most important quantitative metrics.

其他神经网络|深度学习|模型|建模(3篇)

【1】 Learning to Match Features with Seeded Graph Matching Network 标题:基于种子图匹配网络的特征匹配学习 链接:https://arxiv.org/abs/2108.08771

作者:Hongkai Chen,Zixin Luo,Jiahui Zhang,Lei Zhou,Xuyang Bai,Zeyu Hu,Chiew-Lan Tai,Long Quan 机构:Hong Kong University of Science and Technology, Tsinghua University 备注:Accepted by ICCV2021, code to be realeased at this https URL 摘要:跨图像匹配局部特征是计算机视觉中的一个基本问题。针对高精度和高效率的目标,我们提出了种子图匹配网络,一种稀疏结构的图神经网络,以减少冗余连接和学习紧凑表示。该网络由1)种子模块组成,该模块通过生成一小组可靠的匹配作为种子来初始化匹配。2) 种子图神经网络,利用种子匹配在图像内/图像间传递消息,并预测分配成本。提出了三种新的操作作为消息传递的基本元素:1)注意池,它聚集图像中的关键点特征以进行种子匹配。2) 种子过滤,增强种子功能并在图像间交换消息。3) 注意解除冷却,将种子特征传播回原始关键点。实验表明,与典型的基于注意的网络相比,该方法显著降低了计算复杂度和存储复杂度,同时获得了具有竞争力或更高的性能。 摘要:Matching local features across images is a fundamental problem in computer vision. Targeting towards high accuracy and efficiency, we propose Seeded Graph Matching Network, a graph neural network with sparse structure to reduce redundant connectivity and learn compact representation. The network consists of 1) Seeding Module, which initializes the matching by generating a small set of reliable matches as seeds. 2) Seeded Graph Neural Network, which utilizes seed matches to pass messages within/across images and predicts assignment costs. Three novel operations are proposed as basic elements for message passing: 1) Attentional Pooling, which aggregates keypoint features within the image to seed matches. 2) Seed Filtering, which enhances seed features and exchanges messages across images. 3) Attentional Unpooling, which propagates seed features back to original keypoints. Experiments show that our method reduces computational and memory complexity significantly compared with typical attention-based networks while competitive or higher performance is achieved.

【2】 Progressive and Selective Fusion Network for High Dynamic Range Imaging 标题:用于高动态范围成像的渐进式和选择性融合网络 链接:https://arxiv.org/abs/2108.08585

作者:Qian Ye,Jun Xiao,Kin-man Lam,Takayuki Okatani 机构:Graduate School of Information Sciences, Sendai, Japan, Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong, China, Tohoku University RIKEN Center for AIP 摘要:本文考虑从场景的LDR图像生成场景的HDR图像的问题。最近的研究采用深度学习,以端到端的方式解决问题,从而显著提高了性能。然而,仍然难以从手持相机捕获的动态场景的LDR图像生成高质量图像,例如,由于前景对象的大运动而造成的遮挡,从而导致重影伪影。成功的关键取决于我们在特征空间中融合输入图像的能力,我们希望在执行HDR图像生成的基本计算(例如,选择最佳曝光图像/区域)的同时,消除导致低质量图像生成的因素。基于这两种思想,我们提出了一种能够更好地融合特征的新方法。一种是多步特征融合;我们的网络逐渐融合了具有相同结构的一叠块中的特征。另一个是组件块的设计,该组件块有效地执行对问题至关重要的两个操作,即比较和选择适当的图像/区域。实验结果表明,在标准的基准测试中,该方法的性能优于以往的最新方法。 摘要:This paper considers the problem of generating an HDR image of a scene from its LDR images. Recent studies employ deep learning and solve the problem in an end-to-end fashion, leading to significant performance improvements. However, it is still hard to generate a good quality image from LDR images of a dynamic scene captured by a hand-held camera, e.g., occlusion due to the large motion of foreground objects, causing ghosting artifacts. The key to success relies on how well we can fuse the input images in their feature space, where we wish to remove the factors leading to low-quality image generation while performing the fundamental computations for HDR image generation, e.g., selecting the best-exposed image/region. We propose a novel method that can better fuse the features based on two ideas. One is multi-step feature fusion; our network gradually fuses the features in a stack of blocks having the same structure. The other is the design of the component block that effectively performs two operations essential to the problem, i.e., comparing and selecting appropriate images/regions. Experimental results show that the proposed method outperforms the previous state-of-the-art methods on the standard benchmark tests.

【3】 Learning Anchored Unsigned Distance Functions with Gradient Direction Alignment for Single-view Garment Reconstruction 标题:基于梯度方向对齐的锚定无符号距离函数学习在单视图服装重建中的应用 链接:https://arxiv.org/abs/2108.08478

作者:Fang Zhao,Wenhao Wang,Shengcai Liao,Ling Shao 机构:Inception Institute of Artificial Intelligence, ReLER, University of Technology Sydney, Mohamed bin Zayed University of Artificial Intelligence 备注:ICCV 2021 摘要:近年来,单视图三维重建技术在深形状表示方面取得了重大进展,但由于曲面开放、拓扑结构多样、几何细节复杂等原因,服装三维重建问题一直没有得到很好的解决。在本文中,我们提出了一种新的可学习的锚定无符号距离函数(AnchorUDF)表示方法,用于从单个图像重建三维服装。AnchorUDF通过预测无符号距离场(UDF)来表示3D形状,以实现任意分辨率的开放式服装曲面建模。为了捕捉不同的服装拓扑,AnchorUDF不仅计算查询点的像素对齐局部图像特征,还利用位于曲面周围的一组定位点来丰富查询点的三维位置特征,这为距离函数提供了更强的三维空间上下文。此外,为了在推理时获得更精确的点投影方向,我们在训练过程中明确地将锚点的空间梯度方向与地面真值方向对准曲面。在MGN和Deep Fashion3D两个公共3D服装数据集上进行的大量实验表明,AnchorUDF在单视图服装重建方面达到了最先进的性能。 摘要:While single-view 3D reconstruction has made significant progress benefiting from deep shape representations in recent years, garment reconstruction is still not solved well due to open surfaces, diverse topologies and complex geometric details. In this paper, we propose a novel learnable Anchored Unsigned Distance Function (AnchorUDF) representation for 3D garment reconstruction from a single image. AnchorUDF represents 3D shapes by predicting unsigned distance fields (UDFs) to enable open garment surface modeling at arbitrary resolution. To capture diverse garment topologies, AnchorUDF not only computes pixel-aligned local image features of query points, but also leverages a set of anchor points located around the surface to enrich 3D position features for query points, which provides stronger 3D space context for the distance function. Furthermore, in order to obtain more accurate point projection direction at inference, we explicitly align the spatial gradient direction of AnchorUDF with the ground-truth direction to the surface during training. Extensive experiments on two public 3D garment datasets, i.e., MGN and Deep Fashion3D, demonstrate that AnchorUDF achieves the state-of-the-art performance on single-view garment reconstruction.

其他(7篇)

【1】 Neural-GIF: Neural Generalized Implicit Functions for Animating People in Clothing 标题:Neuran-GIF:用于服装人体动画的神经广义隐函数 链接:https://arxiv.org/abs/2108.08807

作者:Garvita Tiwari,Nikolaos Sarafianos,Tony Tung,Gerard Pons-Moll1 机构:University of T¨ubingen, Germany, Max Planck Institute for Informatics, Saarland Informatics Campus, Germany, Facebook Reality Labs, Sausalito, USA 摘要:我们提出了神经广义隐函数(neuralgrimplicfunctions,neuralgiff),作为人体姿势的一个函数,为穿着衣服的人制作动画。给定一个对象在各种姿势下的扫描序列,我们将学习为新姿势的角色设置动画。现有的方法依赖于基于模板的人体(或衣服)表示。然而,此类模型通常具有固定且有限的分辨率,需要困难的数据预处理步骤,并且不能用于复杂的服装。我们从基于模板的方法中得到启发,这些方法将运动分解为关节和非刚性变形,但将此概念推广到隐式形状学习中,以获得更灵活的模型。我们学习将空间中的每个点映射到规范空间,在计算符号距离场之前,将学习到的变形场应用于非刚性效应的建模。我们的公式允许学习服装和软组织的复杂和非刚性变形,而无需计算模板注册,因为这在当前方法中很常见。神经GIF可以在原始3D扫描上进行训练,并重建详细的复杂表面几何和变形。此外,该模型还可以推广到新的姿态。我们在不同服装款式的不同公共数据集中对我们的方法进行了评估,并在定量和定性上显示出比基线方法的显著改进。我们还将模型扩展到多形状设置。为了促进进一步研究,我们将在以下网站公开模型、代码和数据:https://virtualhumans.mpi-inf.mpg.de/neuralgif/ 摘要:We present Neural Generalized Implicit Functions(Neural-GIF), to animate people in clothing as a function of the body pose. Given a sequence of scans of a subject in various poses, we learn to animate the character for new poses. Existing methods have relied on template-based representations of the human body (or clothing). However such models usually have fixed and limited resolutions, require difficult data pre-processing steps and cannot be used with complex clothing. We draw inspiration from template-based methods, which factorize motion into articulation and non-rigid deformation, but generalize this concept for implicit shape learning to obtain a more flexible model. We learn to map every point in the space to a canonical space, where a learned deformation field is applied to model non-rigid effects, before evaluating the signed distance field. Our formulation allows the learning of complex and non-rigid deformations of clothing and soft tissue, without computing a template registration as it is common with current approaches. Neural-GIF can be trained on raw 3D scans and reconstructs detailed complex surface geometry and deformations. Moreover, the model can generalize to new poses. We evaluate our method on a variety of characters from different public datasets in diverse clothing styles and show significant improvements over baseline methods, quantitatively and qualitatively. We also extend our model to multiple shape setting. To stimulate further research, we will make the model, code and data publicly available at: https://virtualhumans.mpi-inf.mpg.de/neuralgif/

【2】 Image Inpainting using Partial Convolution 标题:利用部分卷积进行图像修复 链接:https://arxiv.org/abs/2108.08791

作者:Harsh Patel,Amey Kulkarni,Shivam Sahni,Udit Vyas 机构:Indian Institute of Technology Gandhinagar, India 摘要:图像修复是图像处理领域的一个热门课题,在计算机视觉中有着广泛的应用。在各种实际应用中,由于存在损坏、丢失或不需要的信息,图像往往会因噪声而恶化。过去有各种修复技术用于处理此类问题,包括经典方法和深度学习方法。一些传统的方法包括通过使用附近的已知像素填充间隙像素或使用相同像素上的移动平均来恢复图像。本文的目的是使用部分卷积层的鲁棒深度学习方法进行图像修复。 摘要:Image Inpainting is one of the very popular tasks in the field of image processing with broad applications in computer vision. In various practical applications, images are often deteriorated by noise due to the presence of corrupted, lost, or undesirable information. There have been various restoration techniques used in the past with both classical and deep learning approaches for handling such issues. Some traditional methods include image restoration by filling gap pixels using the nearby known pixels or using the moving average over the same. The aim of this paper is to perform image inpainting using robust deep learning methods that use partial convolution layers.

【3】 How to cheat with metrics in single-image HDR reconstruction 标题:如何在单幅图像HDR重建中欺骗度量 链接:https://arxiv.org/abs/2108.08713

作者:Gabriel Eilertsen,Saghi Hajisharif,Param Hanji,Apostolia Tsirikoglou,Rafal K. Mantiuk,Jonas Unger 机构: Dept. of Science and Technology, Link¨oping University, Sweden, Dept. of Computer Science and Technology, University of Cambridge, UK 备注:ICCV 2021 workshop on Learning for Computational Imaging (LCI) 摘要:单图像高动态范围(SI-HDR)重建是一个非常适合深度学习方法的问题。每一种连续的技术都通过报告更高的图像质量分数来证明对现有方法的改进。然而,本文强调,客观指标的这些改进并不一定能转化为视觉上的卓越图像。第一个问题是在数据和度量参数方面使用不同的评估条件,需要一个标准化的协议,以便能够在论文之间进行比较。第二个问题是评估SI-HDR重建的固有困难,这是本文的主要焦点,因为重建问题的某些方面主导了客观差异,从而引入了偏差。在这里,我们使用现有的和模拟的SI-HDR方法再现了一个典型的评估,以演示问题的不同方面如何影响客观质量指标。令人惊讶的是,我们发现甚至不重建HDR信息的方法可以与最先进的深度学习方法竞争。我们展示了这些结果如何不能代表感知质量,SI-HDR重建需要更好的评估协议。 摘要:Single-image high dynamic range (SI-HDR) reconstruction has recently emerged as a problem well-suited for deep learning methods. Each successive technique demonstrates an improvement over existing methods by reporting higher image quality scores. This paper, however, highlights that such improvements in objective metrics do not necessarily translate to visually superior images. The first problem is the use of disparate evaluation conditions in terms of data and metric parameters, calling for a standardized protocol to make it possible to compare between papers. The second problem, which forms the main focus of this paper, is the inherent difficulty in evaluating SI-HDR reconstructions since certain aspects of the reconstruction problem dominate objective differences, thereby introducing a bias. Here, we reproduce a typical evaluation using existing as well as simulated SI-HDR methods to demonstrate how different aspects of the problem affect objective quality metrics. Surprisingly, we found that methods that do not even reconstruct HDR information can compete with state-of-the-art deep learning methods. We show how such results are not representative of the perceived quality and that SI-HDR reconstruction needs better evaluation protocols.

【4】 Contrastive Language-Image Pre-training for the Italian Language 标题:对比语言--意大利语的表象前训练 链接:https://arxiv.org/abs/2108.08688

作者:Federico Bianchi,Giuseppe Attanasio,Raphael Pisoni,Silvia Terragni,Gabriele Sarti,Sri Lakshmi 机构:Bocconi University, Milan, Italy, Politecnico di Torino, Turin, Italy, Independent Researcher, Vienna, Austria, University of Milano-Bicocca, University of Groningen, Groningen, The Netherlands, Chennai, India 摘要:CLIP(对比语言图像预训练)是一种最新的多模态模型,它联合学习图像和文本的表示。该模型基于大量英文数据进行训练,在Zero-Shot分类任务中表现出令人印象深刻的性能。在不同的语言上训练同一个模型并非易事,因为其他语言中的数据可能不够,模型需要高质量的文本翻译来保证良好的性能。在本文中,我们介绍了第一个意大利语剪辑模型(CLIP-意大利语),该模型在140多万对图像文本上进行训练。结果表明,CLIP-意大利语在图像检索和Zero-Shot分类任务上优于多语言CLIP模型。 摘要:CLIP (Contrastive Language-Image Pre-training) is a very recent multi-modal model that jointly learns representations of images and texts. The model is trained on a massive amount of English data and shows impressive performance on zero-shot classification tasks. Training the same model on a different language is not trivial, since data in other languages might be not enough and the model needs high-quality translations of the texts to guarantee a good performance. In this paper, we present the first CLIP model for the Italian Language (CLIP-Italian), trained on more than 1.4 million image-text pairs. Results show that CLIP-Italian outperforms the multilingual CLIP model on the tasks of image retrieval and zero-shot classification.

【5】 Towards Controllable and Photorealistic Region-wise Image Manipulation 标题:向可控性和照片真实感的区域级图像操纵方向发展 链接:https://arxiv.org/abs/2108.08674

作者:Ansheng You,Chenglin Zhou,Qixuan Zhang,Lan Xu 机构:Peking University, Beijing, China, ShanghaiTech University, Shanghai, China 备注:None 摘要:自适应和灵活的图像编辑是现代生成模型的理想功能。在这项工作中,我们提出了一个自动编码器架构的生成模型,用于每区域风格的操作。我们应用代码一致性损失来强制内容和样式潜在表示之间的显式分离,使生成的样本的内容和样式与其相应的内容和样式引用一致。模型还受到内容对齐丢失的约束,以确保前景编辑不会干扰背景内容。因此,给定用户提供的感兴趣区域掩码,我们的模型支持前景区域风格转换。特别是,除了自我监督之外,我们的模型没有额外的注释,比如语义标签。大量实验表明了该方法的有效性,并展示了该模型在不同应用中的灵活性,包括区域风格编辑、潜在空间插值、跨域风格转换。 摘要:Adaptive and flexible image editing is a desirable function of modern generative models. In this work, we present a generative model with auto-encoder architecture for per-region style manipulation. We apply a code consistency loss to enforce an explicit disentanglement between content and style latent representations, making the content and style of generated samples consistent with their corresponding content and style references. The model is also constrained by a content alignment loss to ensure the foreground editing will not interfere background contents. As a result, given interested region masks provided by users, our model supports foreground region-wise style transfer. Specially, our model receives no extra annotations such as semantic labels except for self-supervision. Extensive experiments show the effectiveness of the proposed method and exhibit the flexibility of the proposed model for various applications, including region-wise style editing, latent space interpolation, cross-domain style transfer.

【6】 A Unified Objective for Novel Class Discovery 标题:新类发现的统一目标 链接:https://arxiv.org/abs/2108.08536

作者:Enrico Fini,Enver Sangineto,Stéphane Lathuilière,Zhun Zhong,Moin Nabi,Elisa Ricci 机构:St´ephane Lathuiliere, University of Trento, Trento, Italy, LTCI, T´el´ecom Paris, Institut Polytechnique de Paris, France, SAP AI Research, Berlin, Germany, Fondazione Bruno Kessler, Trento, Italy 备注:ICCV 2021 (Oral) 摘要:本文研究了新类发现(NCD)问题。NCD旨在通过利用包含不同但相关类的标记集的先验知识,在未标记集中推断新的对象类别。现有方法通过考虑多个目标函数来解决这个问题,通常分别涉及标记样本和未标记样本的专门损失项,并且通常需要辅助正则化项。在本文中,我们从这个传统的方案出发,引入了一个统一的目标函数(UNO)来发现新的类,明确的目的是促进监督和非监督学习之间的协同。使用多视图自标记策略,我们生成伪标签,这些伪标签可以与地面真值标签进行均匀处理。这导致在已知和未知类别上都有一个单一的分类目标。尽管它很简单,UNO在几个基准上的表现都比最先进的水平有显著的优势(在CIFAR-100上为 10%,在ImageNet上为 8%)。项目页面位于:url{https://ncd-uno.github.io}. 摘要:In this paper, we study the problem of Novel Class Discovery (NCD). NCD aims at inferring novel object categories in an unlabeled set by leveraging from prior knowledge of a labeled set containing different, but related classes. Existing approaches tackle this problem by considering multiple objective functions, usually involving specialized loss terms for the labeled and the unlabeled samples respectively, and often requiring auxiliary regularization terms. In this paper, we depart from this traditional scheme and introduce a UNified Objective function (UNO) for discovering novel classes, with the explicit purpose of favoring synergy between supervised and unsupervised learning. Using a multi-view self-labeling strategy, we generate pseudo-labels that can be treated homogeneously with ground truth labels. This leads to a single classification objective operating on both known and unknown classes. Despite its simplicity, UNO outperforms the state of the art by a significant margin on several benchmarks (~ 10% on CIFAR-100 and 8% on ImageNet). The project page is available at: url{https://ncd-uno.github.io}.

【7】 Revisiting Binary Local Image Description for Resource Limited Devices 标题:重新访问资源受限设备的二进制本地映像描述 链接:https://arxiv.org/abs/2108.08380

作者:Iago Suárez,José M. Buenaposada,Luis Baumela 摘要:各种资源有限的设备的出现为计算机视觉算法的设计带来了新的挑战,在精度和计算要求之间达成了明确的妥协。在本文中,我们提出了一种新的二值图像描述符,它是基于像素差异和图像梯度将三重排序损失、硬负挖掘和锚交换应用于传统特征而产生的。这些描述符BAD(框平均差)和HashSIFT在最先进的精度与资源权衡曲线中建立了新的操作点。在我们的实验中,我们评估了所提出描述符的准确性、执行时间和能耗。我们表明,BAD具有文献中最快的描述符实现,而HashSIFT在精度上接近顶级的基于深度学习的描述符,计算效率更高。我们已经公开了源代码。 摘要:The advent of a panoply of resource limited devices opens up new challenges in the design of computer vision algorithms with a clear compromise between accuracy and computational requirements. In this paper we present new binary image descriptors that emerge from the application of triplet ranking loss, hard negative mining and anchor swapping to traditional features based on pixel differences and image gradients. These descriptors, BAD (Box Average Difference) and HashSIFT, establish new operating points in the state-of-the-art's accuracy vs. resources trade-off curve. In our experiments we evaluate the accuracy, execution time and energy consumption of the proposed descriptors. We show that BAD bears the fastest descriptor implementation in the literature while HashSIFT approaches in accuracy that of the top deep learning-based descriptors, being computationally more efficient. We have made the source code public.

0 人点赞