计算机视觉与模式识别学术速递[11.11]

2021-11-17 11:02:04 浏览数 (1)

cs.CV 方向,今日共计39篇

Transformer(3篇)

【1】 Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation 标题:用于视觉和语言导航的变长记忆多模转换器 链接:https://arxiv.org/abs/2111.05759

作者:Chuang Lin,Yi Jiang,Jianfei Cai,Lizhen Qu,Gholamreza Haffari,Zehuan Yuan 机构:Dept of Data Science and AI, Monash University, ByteDance Inc. 摘要:视觉和语言导航(VLN)是一项任务,要求代理按照语言指令导航到目标位置,这取决于移动过程中与环境的持续交互。最近,基于Transformer的VLN方法取得了很大进展,得益于视觉观察和语言教学之间通过多模态交叉注意机制的直接联系。然而,这些方法通常通过使用LSTM解码器或使用手动设计的隐藏状态来构建循环变换器,将时间上下文表示为固定长度的向量。考虑到单个固定长度向量通常不足以捕获长期的时间上下文,本文通过显式地建模时间上下文,引入带可变长度记忆的多模态变换器(MTVM)用于视觉基础自然语言导航。具体来说,MTVM使代理能够通过直接将以前的激活存储在内存库中来跟踪导航轨迹。为了进一步提高性能,我们提出了一种记忆感知的一致性损失,以帮助学习使用随机屏蔽指令更好地联合表示时间上下文。我们在流行的R2R和CVDN数据集上评估了MTVM,我们的模型将R2R不可见验证和测试集的成功率分别提高了2%,并将CVDN测试集的目标过程减少了1.6m。 摘要:Vision-and-Language Navigation (VLN) is a task that an agent is required to follow a language instruction to navigate to the goal position, which relies on the ongoing interactions with the environment during moving. Recent Transformer-based VLN methods have made great progress benefiting from the direct connections between visual observations and the language instruction via the multimodal cross-attention mechanism. However, these methods usually represent temporal context as a fixed-length vector by using an LSTM decoder or using manually designed hidden states to build a recurrent Transformer. Considering a single fixed-length vector is often insufficient to capture long-term temporal context, in this paper, we introduce Multimodal Transformer with Variable-length Memory (MTVM) for visually-grounded natural language navigation by modelling the temporal context explicitly. Specifically, MTVM enables the agent to keep track of the navigation trajectory by directly storing previous activations in a memory bank. To further boost the performance, we propose a memory-aware consistency loss to help learn a better joint representation of temporal context with random masked instructions. We evaluate MTVM on popular R2R and CVDN datasets, and our model improves Success Rate on R2R unseen validation and test set by 2% each, and reduce Goal Process by 1.6m on CVDN test set.

【2】 CLIP2TV: An Empirical Study on Transformer-based Methods for Video-Text Retrieval 标题:CLIP2TV:基于Transformer的视频文本检索方法的实证研究 链接:https://arxiv.org/abs/2111.05610

作者:Zijian Gao,Jingyu Liu,Sheng Chen,Dedan Chang,Hao Zhang,Jinwei Yuan 机构:OVBU, Tencent PCG 备注:Tech Report 摘要:现代视频文本检索框架主要由三部分组成:视频编码器、文本编码器和相似头。随着视觉和文本表征学习的成功,基于变换器的编码器和融合方法也被应用于视频文本检索领域。在本报告中,我们介绍了CLIP2TV,旨在探索基于Transformer的方法中的关键要素。为了实现这一点,我们首先回顾了最近关于多模式学习的一些工作,然后将一些技术引入到视频文本检索中,最后通过在不同配置下的大量实验对它们进行评估。值得注意的是,CLIP2TV达到52。9@R1在MSR-VTT数据集上,比之前的SOTA结果高出4.1%。 摘要:Modern video-text retrieval frameworks basically consist of three parts: video encoder, text encoder and the similarity head. With the success on both visual and textual representation learning, transformer based encoders and fusion methods have also been adopted in the field of video-text retrieval. In this report, we present CLIP2TV, aiming at exploring where the critical elements lie in transformer based methods. To achieve this, We first revisit some recent works on multi-modal learning, then introduce some techniques into video-text retrieval, finally evaluate them through extensive experiments in different configurations. Notably, CLIP2TV achieves 52.9@R1 on MSR-VTT dataset, outperforming the previous SOTA result by 4.1%.

【3】 Are Transformers More Robust Than CNNs? 标题:Transformer比CNN更健壮吗? 链接:https://arxiv.org/abs/2111.05464

作者:Yutong Bai,Jieru Mei,Alan Yuille,Cihang Xie 机构:Johns Hopkins University, University of California, Santa Cruz 摘要:Transformer是一种强大的视觉识别工具。除了在广泛的视觉基准上展示竞争性性能外,最近的研究还表明Transformer比卷积神经网络(CNN)更稳健。然而,令人惊讶的是,我们发现这些结论是从不公平的实验环境中得出的,在这些实验环境中,Transformer和CNN在不同的尺度上进行比较,并应用不同的训练框架。在本文中,我们的目标是提供Transformer和CNN之间的首次公平深入的比较,重点是稳健性评估。通过我们的统一训练设置,我们首先挑战了之前的信念,即Transformer在衡量对抗性稳健性时比CNN更出色。更令人惊讶的是,我们发现,如果CNN正确地采用了Transformer的训练方法,那么CNN在防御对抗性攻击方面很容易像Transformer一样强大。关于分布外样本的泛化,我们表明(外部)大规模数据集的预训练不是使Transformer获得比CNN更好性能的基本要求。此外,我们的烧蚀表明,这种更强的泛化在很大程度上得益于Transformer的自我关注式架构本身,而不是其他训练设置。我们希望这项工作能够帮助社区更好地理解Transformer和CNN的稳健性并对其进行基准测试。代码和模型可在https://github.com/ytongbai/ViTs-vs-CNNs. 摘要:Transformer emerges as a powerful tool for visual recognition. In addition to demonstrating competitive performance on a broad range of visual benchmarks, recent works also argue that Transformers are much more robust than Convolutions Neural Networks (CNNs). Nonetheless, surprisingly, we find these conclusions are drawn from unfair experimental settings, where Transformers and CNNs are compared at different scales and are applied with distinct training frameworks. In this paper, we aim to provide the first fair & in-depth comparisons between Transformers and CNNs, focusing on robustness evaluations. With our unified training setup, we first challenge the previous belief that Transformers outshine CNNs when measuring adversarial robustness. More surprisingly, we find CNNs can easily be as robust as Transformers on defending against adversarial attacks, if they properly adopt Transformers' training recipes. While regarding generalization on out-of-distribution samples, we show pre-training on (external) large-scale datasets is not a fundamental request for enabling Transformers to achieve better performance than CNNs. Moreover, our ablations suggest such stronger generalization is largely benefited by the Transformer's self-attention-like architectures per se, rather than by other training setups. We hope this work can help the community better understand and benchmark the robustness of Transformers and CNNs. The code and models are publicly available at https://github.com/ytongbai/ViTs-vs-CNNs.

检测相关(2篇)

【1】 Automated Pulmonary Embolism Detection from CTPA Images Using an End-to-End Convolutional Neural Network 标题:基于端到端卷积神经网络的CTPA图像肺栓塞自动检测 链接:https://arxiv.org/abs/2111.05506

作者:Yi Lin,Jianchao Su,Xiang Wang,Xiang Li,Jingen Liu,Kwang-Ting Cheng,Xin Yang 机构:Huazhong University of Science and Technology, China, The Central Hospital of Wuhan, China, JD AI Research, Mountain View, USA, Hong Kong University of Science and Technology, Hong Kong 备注:Accepted to MICCAI 2019 摘要:在CT肺动脉造影(CTPA)图像上自动检测肺栓塞(PEs)的方法有很高的要求。现有方法通常采用单独的步骤来检测PE候选和去除假阳性,而不考虑其他步骤的能力。因此,为了获得可接受的灵敏度,大多数现有方法通常存在较高的假阳性率。本研究提出了一种端到端可训练的卷积神经网络(CNN),其中两个步骤被联合优化。提出的CNN由三个子网组成:1)一个新的3D候选提议网络,用于检测包含可疑PEs的立方体;2)一个3D空间变换子网,用于生成候选的固定大小血管对齐图像表示,以及3)二维分类网络,该网络以变换立方体的三个横截面作为输入,并消除误报。我们使用PE挑战中的20 CTPA测试数据集对我们的方法进行了评估,在0毫米、2毫米和5毫米定位误差下,每体积2个假阳性时,灵敏度分别为78.9%、80.7%和80.7%,优于最先进的方法。我们在自己的数据集上进一步评估了我们的系统,该数据集由129个CTPA数据和269个栓子组成。在0毫米、2毫米和5毫米定位误差下,我们的系统在每个体积2个误报时的灵敏度分别为63.2%、78.9%和86.8%。 摘要:Automated methods for detecting pulmonary embolisms (PEs) on CT pulmonary angiography (CTPA) images are of high demand. Existing methods typically employ separate steps for PE candidate detection and false positive removal, without considering the ability of the other step. As a result, most existing methods usually suffer from a high false positive rate in order to achieve an acceptable sensitivity. This study presents an end-to-end trainable convolutional neural network (CNN) where the two steps are optimized jointly. The proposed CNN consists of three concatenated subnets: 1) a novel 3D candidate proposal network for detecting cubes containing suspected PEs, 2) a 3D spatial transformation subnet for generating fixed-sized vessel-aligned image representation for candidates, and 3) a 2D classification network which takes the three cross-sections of the transformed cubes as input and eliminates false positives. We have evaluated our approach using the 20 CTPA test dataset from the PE challenge, achieving a sensitivity of 78.9%, 80.7% and 80.7% at 2 false positives per volume at 0mm, 2mm and 5mm localization error, which is superior to the state-of-the-art methods. We have further evaluated our system on our own dataset consisting of 129 CTPA data with a total of 269 emboli. Our system achieves a sensitivity of 63.2%, 78.9% and 86.8% at 2 false positives per volume at 0mm, 2mm and 5mm localization error.

【2】 Early Myocardial Infarction Detection over Multi-view Echocardiography 标题:多视角超声心动图检测早期心肌梗死 链接:https://arxiv.org/abs/2111.05790

作者:Aysen Degerli,Serkan Kiranyaz,Tahir Hamid,Rashid Mazhar,Moncef Gabbouj 机构:Tampere University, qa) is with the Department of Elec-trical Engineering, Qatar University 摘要:心肌梗死(MI)是世界上由于供养心肌的冠状动脉阻塞而导致死亡的主要原因。心肌梗死的早期诊断和定位可通过促进早期治疗干预减轻心肌损伤的程度。冠状动脉阻塞后,缺血心肌节段的局部室壁运动异常(RWMA)是最早发生的改变。超声心动图是评估任何RWMA的基本工具。仅从单个超声心动图视图评估左心室(LV)壁的运动可能导致错过MI的诊断,因为在该特定视图上可能看不到RWMA。因此,在本研究中,我们建议融合心尖四腔(A4C)和心尖二腔(A2C)视图,其中总共11个心肌节段可用于心肌梗死检测。所提出的方法首先通过主动多项式(APs)估计左室壁的运动,该多项式提取并跟踪心内膜边界以计算心肌节段位移。从A4C和A2C视图位移中提取特征,将其融合并输入分类器以检测MI。本研究的主要贡献是:1)通过将A4C和A2C视图包含在总共260个超声心动图记录中,创建了一个新的基准数据集,并与研究社区公开共享;2)通过基于机器学习的方法改进基于阈值的AP先前工作的性能,3)融合A4C和A2C视图信息,通过多视图超声心动图进行心肌梗死检测的先驱方法。实验结果表明,该方法对多视点超声心动图检测心肌梗死的灵敏度为90.91%,准确率为86.36%。 摘要:Myocardial infarction (MI) is the leading cause of mortality in the world that occurs due to a blockage of the coronary arteries feeding the myocardium. An early diagnosis of MI and its localization can mitigate the extent of myocardial damage by facilitating early therapeutic interventions. Following the blockage of a coronary artery, the regional wall motion abnormality (RWMA) of the ischemic myocardial segments is the earliest change to set in. Echocardiography is the fundamental tool to assess any RWMA. Assessing the motion of the left ventricle (LV) wall only from a single echocardiography view may lead to missing the diagnosis of MI as the RWMA may not be visible on that specific view. Therefore, in this study, we propose to fuse apical 4-chamber (A4C) and apical 2-chamber (A2C) views in which a total of 11 myocardial segments can be analyzed for MI detection. The proposed method first estimates the motion of the LV wall by Active Polynomials (APs), which extract and track the endocardial boundary to compute myocardial segment displacements. The features are extracted from the A4C and A2C view displacements, which are fused and fed into the classifiers to detect MI. The main contributions of this study are 1) creation of a new benchmark dataset by including both A4C and A2C views in a total of 260 echocardiography recordings, which is publicly shared with the research community, 2) improving the performance of the prior work of threshold-based APs by a Machine Learning based approach, and 3) a pioneer MI detection approach via multi-view echocardiography by fusing the information of A4C and A2C views. Experimental results show that the proposed method achieves 90.91% sensitivity and 86.36% precision for MI detection over multi-view echocardiography.

分类|识别相关(2篇)

【1】 Handwritten Digit Recognition Using Improved Bounding Box Recognition Technique 标题:基于改进边界框识别技术的手写体数字识别 链接:https://arxiv.org/abs/2111.05483

作者:Arkaprabha Basu,M. Sathya 机构:(Registration Number: ,), Project report Submitted in partial fulfilment of the, requirements for the award of the degree of, MASTER OF SCIENCE, DEPARTMENT OF COMPUTER SCIENCE, SCHOOL OF ENGINEERING AND TECHNOLOGY, PONDICHERRY UNIVERSITY, PUDUCHERRY- 备注:41 pages, 12 figures 摘要:该项目伴随着OCR(光学字符识别)技术而来,包括计算机科学的各个研究领域。该项目将拍摄一个角色的照片,并对其进行处理,以识别该角色的图像,就像人脑识别各种数字一样。该项目包含对图像处理技术的深刻理解,以及机器学习的大研究领域和机器学习的构造块神经网络。这个项目有两个不同的部分。训练部分的想法是通过给孩子们提供不同的相似字符集来训练他们,但不是完全相同的,并且说他们的输出就是这个。像这个想法一样,我们必须训练新建立的具有如此多特征的神经网络。这部分包含一些新的算法,这些算法是根据项目需要自行创建和升级的。测试部分包含一个新数据集的测试。这部分总是在训练部分之后。首先必须教孩子如何识别字符。然后必须测试他是否给出了正确的答案。如果没有,就必须通过提供新的数据集和新条目对他进行更严格的训练。就像那样,我们也必须测试算法。统计建模和优化技术的许多部分进入项目,需要大量的统计建模概念,如优化器技术和过滤过程,过滤或算法背后的数学和预测是如何产生的,或者预测模型创建的实际需要和最终需要的结果。机器学习算法是根据预测和编程的概念建立的。 摘要:The project comes with the technique of OCR (Optical Character Recognition) which includes various research sides of computer science. The project is to take a picture of a character and process it up to recognize the image of that character like a human brain recognize the various digits. The project contains the deep idea of the Image Processing techniques and the big research area of machine learning and the building block of the machine learning called Neural Network. There are two different parts of the project. Training part comes with the idea of to train a child by giving various sets of similar characters but not the totally same and to say them the output of this is this. Like this idea one has to train the newly built neural network with so many characters. This part contains some new algorithm which is self-created and upgraded as the project need. The testing part contains the testing of a new dataset .This part always comes after the part of the training .At first one has to teach the child how to recognize the character .Then one has to take the test whether he has given right answer or not. If not, one has to train him harder by giving new dataset and new entries. Just like that one has to test the algorithm also. There are many parts of statistical modeling and optimization techniques which come into the project requiring a lot of modeling concept of statistics like optimizer technique and filtering process, that how the mathematics and prediction behind that filtering or the algorithms comes after or which result one actually needs to and ultimately for the prediction of a predictive model creation. Machine learning algorithm is built by concepts of prediction and programming.

【2】 Learning to Disentangle Scenes for Person Re-identification 标题:学习解开场景以重新识别人 链接:https://arxiv.org/abs/2111.05476

作者:Xianghao Zang,Ge Li,Wei Gao,Xiujun Shu 机构:School of Electronic and Computer Engineering, Peking University, Shenzhen , China., Peng Cheng Laboratory, Shenzhen , China., A R T I C L E I N F O 备注:Preprint Version; Accepted by Image and Vision Computing 摘要:在人的再识别(ReID)任务中有许多具有挑战性的问题,例如遮挡和尺度变化。现有的工作通常试图通过采用一个分支网络来解决这些问题。这种单分支网络需要对各种具有挑战性的问题具有鲁棒性,这使得该网络负担过重。本文提出划分和征服里德任务。为此,我们使用几个自我监督操作来模拟不同的挑战性问题,并使用不同的网络处理每个挑战性问题。具体地说,我们使用了随机擦除操作,并提出了一种新的随机缩放操作来生成具有可控特性的新图像。介绍了一种通用的多分支网络,包括一个主分支和两个服务分支,用于处理不同的场景。这些分支协作学习并获得不同的感知能力。通过这种方式,ReID任务中的复杂场景得到了有效的分解,每个分支的负担都减轻了。大量实验的结果表明,该方法在三个ReID基准和两个遮挡ReID基准上都达到了最先进的性能。烧蚀研究还表明,所提出的方案和操作显著提高了在各种场景中的性能。 摘要:There are many challenging problems in the person re-identification (ReID) task, such as the occlusion and scale variation. Existing works usually tried to solve them by employing a one-branch network. This one-branch network needs to be robust to various challenging problems, which makes this network overburdened. This paper proposes to divide-and-conquer the ReID task. For this purpose, we employ several self-supervision operations to simulate different challenging problems and handle each challenging problem using different networks. Concretely, we use the random erasing operation and propose a novel random scaling operation to generate new images with controllable characteristics. A general multi-branch network, including one master branch and two servant branches, is introduced to handle different scenes. These branches learn collaboratively and achieve different perceptive abilities. In this way, the complex scenes in the ReID task are effectively disentangled, and the burden of each branch is relieved. The results from extensive experiments demonstrate that the proposed method achieves state-of-the-art performances on three ReID benchmarks and two occluded ReID benchmarks. Ablation study also shows that the proposed scheme and operations significantly improve the performance in various scenes.

分割|语义相关(1篇)

【1】 Robust deep learning-based semantic organ segmentation in hyperspectral images 标题:基于鲁棒深度学习的高光谱图像语义器官分割 链接:https://arxiv.org/abs/2111.05408

作者:Silvia Seidlitz,Jan Sellner,Jan Odenthal,Berkin Özdemir,Alexander Studier-Fischer,Samuel Knödler,Leonardo Ayala,Tim Adler,Hannes G. Kenngott,Minu Tizabi,Martin Wagner,Felix Nickel,Beat P. Müller-Stich,Lena Maier-Hein 机构:Maier-Heina,b,d,e,f, Computer Assisted Medical Interventions (CAMI), German Cancer Research Center (DKFZ), Heidelberg, Germany, Helmholtz Information and Data Science School for Health, KarlsruheHeidelberg, Germany 备注:The first two authors (Silvia Seidlitz and Jan Sellner) contributed equally to this paper 摘要:语义图像分割是外科手术中上下文感知和自主机器人技术的重要前提。目前的技术水平主要集中在微创手术期间获取的常规RGB视频数据上,但迄今为止,基于光谱成像数据和开放手术期间获取的全场景语义分割几乎没有受到关注。为了解决文献中的这一差距,我们正在研究以下基于开放手术环境下获得的猪高光谱成像(HSI)数据的研究问题:(1)基于神经网络的全自动器官分割的HSI数据的适当表示形式是什么,特别是在数据的空间粒度方面(像素与超级像素、面片与完整图像)?(2) 在执行语义器官分割时,与其他模式(即RGB数据和处理后的HSI数据(例如,组织参数,如氧合)相比,使用HSI数据是否有好处?根据一项基于20头猪的506张HSI图像的综合验证研究,共有19个类别进行了注释,基于深度学习的分割性能随着输入数据的空间背景而不断提高。未经处理的HSI数据比RGB数据或来自摄像机提供商的已处理数据具有优势,优势随着神经网络输入尺寸的减小而增大。最大性能(应用于整个图像的HSI)产生的平均骰子相似系数(DSC)为0.89(标准偏差(SD)0.04),在评分者间可变性(DSC为0.89(SD 0.07))范围内。我们得出结论,HSI可以成为全自动手术场景理解的强大图像模式,与传统成像相比具有许多优势,包括恢复额外功能组织信息的能力。 摘要:Semantic image segmentation is an important prerequisite for context-awareness and autonomous robotics in surgery. The state of the art has focused on conventional RGB video data acquired during minimally invasive surgery, but full-scene semantic segmentation based on spectral imaging data and obtained during open surgery has received almost no attention to date. To address this gap in the literature, we are investigating the following research questions based on hyperspectral imaging (HSI) data of pigs acquired in an open surgery setting: (1) What is an adequate representation of HSI data for neural network-based fully automated organ segmentation, especially with respect to the spatial granularity of the data (pixels vs. superpixels vs. patches vs. full images)? (2) Is there a benefit of using HSI data compared to other modalities, namely RGB data and processed HSI data (e.g. tissue parameters like oxygenation), when performing semantic organ segmentation? According to a comprehensive validation study based on 506 HSI images from 20 pigs, annotated with a total of 19 classes, deep learning-based segmentation performance increases - consistently across modalities - with the spatial context of the input data. Unprocessed HSI data offers an advantage over RGB data or processed data from the camera provider, with the advantage increasing with decreasing size of the input to the neural network. Maximum performance (HSI applied to whole images) yielded a mean dice similarity coefficient (DSC) of 0.89 (standard deviation (SD) 0.04), which is in the range of the inter-rater variability (DSC of 0.89 (SD 0.07)). We conclude that HSI could become a powerful image modality for fully-automatic surgical scene understanding with many advantages over traditional imaging, including the ability to recover additional functional tissue information.

半弱无监督|主动学习|不确定性(3篇)

【1】 Deep Attention-guided Graph Clustering with Dual Self-supervision 标题:具有双重自我监督的深度注意引导图聚类 链接:https://arxiv.org/abs/2111.05548

作者:Zhihao Peng,Hui Liu,Yuheng Jia,Junhui Hou 机构:Liu is with the School of Computing & Information Sciences, CaritasInstitute of Higher Education 摘要:现有的深度嵌入聚类只考虑最深的层来学习特征嵌入,因此不能很好地利用来自聚类指派的可用判别信息,从而导致性能限制。为此,我们提出了一种新的方法,即双自监督深度注意引导图聚类(DAGC)。具体地说,DAGC首先利用异构融合模块在每一层中自适应地集成自动编码器和图卷积网络的特征,然后使用尺度融合模块动态地连接不同层中的多尺度特征。这些模块能够通过基于注意的机制学习鉴别特征嵌入。此外,我们还设计了一个分布式融合模块,该模块利用集群分配直接获取集群结果。为了更好地探索集群分配中的鉴别信息,我们开发了一个双重自我监督解决方案,包括一个具有三重Kullback-Leibler散度损失的软自我监督策略和一个具有伪监督损失的硬自我监督策略。大量实验证明,在六个基准数据集上,我们的方法始终优于最先进的方法。特别是,我们的方法比最佳基线提高了ARI 18.14%以上。 摘要:Existing deep embedding clustering works only consider the deepest layer to learn a feature embedding and thus fail to well utilize the available discriminative information from cluster assignments, resulting performance limitation. To this end, we propose a novel method, namely deep attention-guided graph clustering with dual self-supervision (DAGC). Specifically, DAGC first utilizes a heterogeneity-wise fusion module to adaptively integrate the features of an auto-encoder and a graph convolutional network in each layer and then uses a scale-wise fusion module to dynamically concatenate the multi-scale features in different layers. Such modules are capable of learning a discriminative feature embedding via an attention-based mechanism. In addition, we design a distribution-wise fusion module that leverages cluster assignments to acquire clustering results directly. To better explore the discriminative information from the cluster assignments, we develop a dual self-supervision solution consisting of a soft self-supervision strategy with a triplet Kullback-Leibler divergence loss and a hard self-supervision strategy with a pseudo supervision loss. Extensive experiments validate that our method consistently outperforms state-of-the-art methods on six benchmark datasets. Especially, our method improves the ARI by more than 18.14% over the best baseline.

【2】 Towards Active Vision for Action Localization with Reactive Control and Predictive Learning 标题:基于反应控制和预测学习的主动视觉动作定位 链接:https://arxiv.org/abs/2111.05448

作者:Shubham Trehan,Sathyanarayanan N. Aakur 机构:Department of Computer Science, Oklahoma State University 备注:To appear at WACV 2022 摘要:视觉事件感知任务(如动作定位)主要集中在静态观察者的监督学习设置上,即相机是静态的,不能由算法控制。它们通常受到textit{annotated}训练数据的质量、数量和多样性的限制,并且通常不会推广到域外样本。在这项工作中,我们解决了主动动作定位的问题,其目标是在控制主动摄像机的几何和物理参数的同时定位动作,以便在没有训练数据的情况下将动作保持在视野中。我们提出了一种基于能量的机制,将预测学习和反应控制相结合,在没有奖励的情况下执行主动动作定位,这在现实环境中可能是稀疏的或不存在的。我们在模拟和真实环境中对两项任务进行了广泛的实验——主动目标跟踪和主动动作定位。我们证明了所提出的方法可以以流的方式推广到不同的任务和环境,而无需明确的奖励或训练。我们证明,与强化学习训练的方法相比,该方法优于无监督基线,并获得了具有竞争力的性能。 摘要:Visual event perception tasks such as action localization have primarily focused on supervised learning settings under a static observer, i.e., the camera is static and cannot be controlled by an algorithm. They are often restricted by the quality, quantity, and diversity of textit{annotated} training data and do not often generalize to out-of-domain samples. In this work, we tackle the problem of active action localization where the goal is to localize an action while controlling the geometric and physical parameters of an active camera to keep the action in the field of view without training data. We formulate an energy-based mechanism that combines predictive learning and reactive control to perform active action localization without rewards, which can be sparse or non-existent in real-world environments. We perform extensive experiments in both simulated and real-world environments on two tasks - active object tracking and active action localization. We demonstrate that the proposed approach can generalize to different tasks and environments in a streaming fashion, without explicit rewards or training. We show that the proposed approach outperforms unsupervised baselines and obtains competitive performance compared to those trained with reinforcement learning.

【3】 Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Temporal Synchronicity 标题:放松跨模态时间同步性的自监督视听表征学习 链接:https://arxiv.org/abs/2111.05329

作者:Pritam Sarkar,Ali Etemad 机构:Queen’s University, Canada, Vector Institute 备注:18 pages 摘要:我们提出了Crosscross,一个用于学习视听表示的自我监督框架。在我们的框架中引入了一个新概念,即除了学习模式内和标准的“同步”跨模式关系外,Crosscross还学习“异步”跨模式关系。我们表明,通过放松音频和视频模式之间的时间同步性,网络学习强时不变表示。我们的实验表明,音频和视频模式的强增强以及跨模式时间同步性的放松优化了性能。为了预先训练我们提出的框架,我们使用了3个不同大小的数据集:Kinetics Sound、Kinetics-400和AudioSet。学习到的表征在许多下游任务上进行评估,即动作识别、声音分类和检索。Crosscross展示了动作识别(UCF101和HMDB51)和声音分类(ESC50)的最新性能。代码和预训练模型将公开提供。 摘要:We present CrissCross, a self-supervised framework for learning audio-visual representations. A novel notion is introduced in our framework whereby in addition to learning the intra-modal and standard 'synchronous' cross-modal relations, CrissCross also learns 'asynchronous' cross-modal relationships. We show that by relaxing the temporal synchronicity between the audio and visual modalities, the network learns strong time-invariant representations. Our experiments show that strong augmentations for both audio and visual modalities with relaxation of cross-modal temporal synchronicity optimize performance. To pretrain our proposed framework, we use 3 different datasets with varying sizes, Kinetics-Sound, Kinetics-400, and AudioSet. The learned representations are evaluated on a number of downstream tasks namely action recognition, sound classification, and retrieval. CrissCross shows state-of-the-art performances on action recognition (UCF101 and HMDB51) and sound classification (ESC50). The codes and pretrained models will be made publicly available.

时序|行为识别|姿态|视频|运动估计(2篇)

【1】 Space-Time Memory Network for Sounding Object Localization in Videos 标题:用于视频探测对象定位的时空记忆网络 链接:https://arxiv.org/abs/2111.05526

作者:Sizhe Li,Yapeng Tian,Chenliang Xu 机构:Department of Computer Science, University of Rochester, Rochester, USA 备注:Accepted to BMVC2021. Project page: this https URL 摘要:利用视觉和声音中的时间同步和关联是实现探测对象鲁棒定位的关键步骤。为此,我们提出了一种时空记忆网络,用于视频中探测目标的定位。它可以同时学习来自音频和视频模式的单模态和跨模态表示的时空注意。我们从定量和定性两方面展示和分析了时空学习在视听对象定位中的有效性。我们证明了我们的方法可以推广到各种复杂的视听场景中,并且优于目前最先进的方法。 摘要:Leveraging temporal synchronization and association within sight and sound is an essential step towards robust localization of sounding objects. To this end, we propose a space-time memory network for sounding object localization in videos. It can simultaneously learn spatio-temporal attention over both uni-modal and cross-modal representations from audio and visual modalities. We show and analyze both quantitatively and qualitatively the effectiveness of incorporating spatio-temporal learning in localizing audio-visual objects. We demonstrate that our approach generalizes over various complex audio-visual scenes and outperforms recent state-of-the-art methods.

【2】 Sparse Adversarial Video Attacks with Spatial Transformations 标题:基于空间变换的稀疏对抗性视频攻击 链接:https://arxiv.org/abs/2111.05468

作者:Ronghui Mu,Wenjie Ruan,Leandro Soriano Marcolino,Qiang Ni 机构:Lancaster UniversityLancasterronghui, University of Exeter, Lancaster UniversityLancasterl, Lancaster UniversityLancasterq 备注:The short version of this work will appear in the BMVC 2021 conference 摘要:近年来,大量的研究工作集中在对图像的对抗性攻击上,而对对抗性视频攻击的研究却很少。我们提出了一种对抗性的视频攻击策略,称为DeepSAVA。我们的模型通过统一的优化框架包括加性扰动和空间变换,其中采用结构相似性指数(SSIM)度量来度量对抗距离。我们设计了一个有效且新颖的优化方案,该方案交替使用贝叶斯优化来识别视频中最有影响力的帧,并使用基于随机梯度下降(SGD)的优化来产生加性和空间变换扰动。这样做使DeepSAVA能够对视频执行非常稀疏的攻击,以保持人类的不可感知性,同时在攻击成功率和对抗性转移能力方面仍能实现最先进的性能。我们在各种类型的深度神经网络和视频数据集上进行的密集实验证实了DeepSAVA的优越性。 摘要:In recent years, a significant amount of research efforts concentrated on adversarial attacks on images, while adversarial video attacks have seldom been explored. We propose an adversarial attack strategy on videos, called DeepSAVA. Our model includes both additive perturbation and spatial transformation by a unified optimisation framework, where the structural similarity index (SSIM) measure is adopted to measure the adversarial distance. We design an effective and novel optimisation scheme which alternatively utilizes Bayesian optimisation to identify the most influential frame in a video and Stochastic gradient descent (SGD) based optimisation to produce both additive and spatial-transformed perturbations. Doing so enables DeepSAVA to perform a very sparse attack on videos for maintaining human imperceptibility while still achieving state-of-the-art performance in terms of both attack success rate and adversarial transferability. Our intensive experiments on various types of deep neural networks and video datasets confirm the superiority of DeepSAVA.

医学相关(1篇)

【1】 Explanatory Analysis and Rectification of the Pitfalls in COVID-19 Datasets 标题:冠状病毒数据集中缺陷的解释性分析与纠正 链接:https://arxiv.org/abs/2111.05679

作者:Samyak Prajapati,Japman Singh Monga,Shaanya Singh,Amrit Raj,Yuvraj Singh Champawat,Chandra Prakash 机构:Department of Computer Science and, Engineering, National Institute of Technology Delhi, New Delhi, Delhi, India, Pandit Deendayal Energy University, Gandhinagar, Gujarat, India, Co-First Authors 摘要:自2020年新冠疫情爆发以来,已有数百万人死于这种致命病毒。人们曾多次尝试设计一种能够检测病毒的自动化测试方法。全球各地的许多研究人员提出了基于深度学习的方法,用胸部X光检测新冠病毒-19。然而,对于大多数研究人员使用的公开的胸部X射线数据集中是否存在偏差提出了疑问。在本文中,我们提出了一个两阶段的方法来解决这个问题。作为该方法第1阶段的一部分,已经进行了两个实验,以证明数据集中存在偏差。随后,在该方法的第2阶段提出了一种图像分割、超分辨率和基于CNN的管道以及不同的图像增强技术,以减少偏差的影响。InceptionResNetV2在胸部X射线图像上进行训练,该图像在通过第2阶段中提出的管道时经过直方图均衡化和伽马校正后得到增强,对于3级(正常、肺炎和新冠病毒-19)分类任务,最高准确率为90.47%。 摘要:Since the onset of the COVID-19 pandemic in 2020, millions of people have succumbed to this deadly virus. Many attempts have been made to devise an automated method of testing that could detect the virus. Various researchers around the globe have proposed deep learning based methodologies to detect the COVID-19 using Chest X-Rays. However, questions have been raised on the presence of bias in the publicly available Chest X-Ray datasets which have been used by the majority of the researchers. In this paper, we propose a 2 staged methodology to address this topical issue. Two experiments have been conducted as a part of stage 1 of the methodology to exhibit the presence of bias in the datasets. Subsequently, an image segmentation, super-resolution and CNN based pipeline along with different image augmentation techniques have been proposed in stage 2 of the methodology to reduce the effect of bias. InceptionResNetV2 trained on Chest X-Ray images that were augmented with Histogram Equalization followed by Gamma Correction when passed through the pipeline proposed in stage 2, yielded a top accuracy of 90.47% for 3-class (Normal, Pneumonia, and COVID-19) classification task.

GAN|对抗|攻击|生成相关(1篇)

【1】 Object-Centric Representation Learning with Generative Spatial-Temporal Factorization 标题:基于产生式时空分解的以对象为中心的表征学习 链接:https://arxiv.org/abs/2111.05393

作者:Li Nanbo,Muhammad Ahmed Raza,Hu Wenbin,Zhaole Sun,Robert B. Fisher 机构:School of Informatics, University of Edinburgh 备注:Accepted at NeurIPS 2021 摘要:学习以对象为中心的场景表示对于获得复杂场景的结构理解和抽象至关重要。然而,由于当前的无监督以对象为中心的表示学习方法是建立在静态观察者假设或静态场景假设的基础上的,因此它们通常:i)存在单视图空间模糊性,或ii)从动态场景中错误或不准确地推断对象表示。为了解决这个问题,我们提出了动态感知多对象网络(DyMON),这是一种将多视图以对象为中心的表示学习扩展到动态场景的方法。我们在多视图动态场景数据上训练DyMON,并表明DyMON在没有监督的情况下学习从一系列观察中分解观察者运动和场景对象动态的纠缠效应,并构建适合在任意时间渲染的场景对象空间表示(跨时间查询)以及从任意视点(跨空间查询)。我们还表明,分解场景表示(w.r.t.对象)支持独立于空间和时间查询单个对象。 摘要:Learning object-centric scene representations is essential for attaining structural understanding and abstraction of complex scenes. Yet, as current approaches for unsupervised object-centric representation learning are built upon either a stationary observer assumption or a static scene assumption, they often: i) suffer single-view spatial ambiguities, or ii) infer incorrectly or inaccurately object representations from dynamic scenes. To address this, we propose Dynamics-aware Multi-Object Network (DyMON), a method that broadens the scope of multi-view object-centric representation learning to dynamic scenes. We train DyMON on multi-view-dynamic-scene data and show that DyMON learns -- without supervision -- to factorize the entangled effects of observer motions and scene object dynamics from a sequence of observations, and constructs scene object spatial representations suitable for rendering at arbitrary times (querying across time) and from arbitrary viewpoints (querying across space). We also show that the factorized scene representations (w.r.t. objects) support querying about a single object by space and time independently.

Attention注意力(1篇)

【1】 Learning to ignore: rethinking attention in CNNs 标题:学会忽视:CNN中注意力的再思考 链接:https://arxiv.org/abs/2111.05684

作者:Firas Laakom,Kateryna Chumachenko,Jenni Raitoharju,Alexandros Iosifidis,Moncef Gabbouj 机构:Tampere University, FinnishEnvironment Institute 备注:accepted to BMVC 2021 摘要:近年来,人们越来越关注卷积神经网络(CNN)中的注意机制来解决计算机视觉任务。这些方法大多学习明确识别和突出显示场景的相关部分,并将参与的图像传递到网络的其他层。在本文中,我们认为这种方法可能不是最优的。可以说,明确地了解图像的哪些部分是相关的通常比了解图像的哪些部分不那么相关更难,因此应该忽略。事实上,在视觉领域中,存在许多容易识别的无关特征模式。例如,靠近边界的图像区域不太可能包含分类任务的有用信息。基于这一观点,我们建议重新构建CNN中的注意机制,使其学会忽略,而不是学会参与。具体来说,我们建议显式学习场景中的无关信息,并在生成的表示中抑制它,只保留重要属性。这种内隐注意机制可以整合到任何现有的注意机制中。在这项工作中,我们用两种最新的注意方法——挤压和激发(SE)块和卷积块注意模块(CBAM)验证了这一想法。在不同数据集和模型架构上的实验结果表明,与标准方法相比,学习忽略,即内隐注意,可以产生更好的性能。 摘要:Recently, there has been an increasing interest in applying attention mechanisms in Convolutional Neural Networks (CNNs) to solve computer vision tasks. Most of these methods learn to explicitly identify and highlight relevant parts of the scene and pass the attended image to further layers of the network. In this paper, we argue that such an approach might not be optimal. Arguably, explicitly learning which parts of the image are relevant is typically harder than learning which parts of the image are less relevant and, thus, should be ignored. In fact, in vision domain, there are many easy-to-identify patterns of irrelevant features. For example, image regions close to the borders are less likely to contain useful information for a classification task. Based on this idea, we propose to reformulate the attention mechanism in CNNs to learn to ignore instead of learning to attend. Specifically, we propose to explicitly learn irrelevant information in the scene and suppress it in the produced representation, keeping only important attributes. This implicit attention scheme can be incorporated into any existing attention mechanism. In this work, we validate this idea using two recent attention methods Squeeze and Excitation (SE) block and Convolutional Block Attention Module (CBAM). Experimental results on different datasets and model architectures show that learning to ignore, i.e., implicit attention, yields superior performance compared to the standard approaches.

人脸|人群计数(1篇)

【1】 Pipeline for 3D reconstruction of the human body from AR/VR headset mounted egocentric cameras 标题:用于从安装在AR/VR耳机上的自我中心摄像机进行人体三维重建的流水线 链接:https://arxiv.org/abs/2111.05409

作者:Shivam Grover,Kshitij Sidana,Vanita Jain 机构:Bharati Vidyapeeth’s College of Engineering, New Delhi, India, Contributed equally to this work as first authors. 备注:11 pages, 12 figures and 2 tables 摘要:在本文中,我们提出了一种从自我中心的角度对全身进行三维重建的新管道。从以自我为中心的视角对人体进行三维重建是一项具有挑战性的任务,因为视图是倾斜的,距离摄像机较远的身体部位被遮挡。其中一个例子是安装在VR头戴式耳机下方的摄像头的视图。为了完成这项任务,我们首先利用条件式GANs将以自我为中心的观点转换为全身第三人称观点。这增加了图像的可理解性,并迎合了遮挡。生成的第三人称视图进一步通过生成身体三维网格的三维重建模块发送。我们还训练了一个网络,该网络可以获取主体的第三人称全身视图,并生成纹理贴图以应用于网格。生成的网格具有相当逼真的身体比例,并且完全装配,允许进一步的应用,如游戏中的实时动画和姿势转换。这种方法可以成为移动人远程临场感新领域的关键。 摘要:In this paper, we propose a novel pipeline for the 3D reconstruction of the full body from egocentric viewpoints. 3-D reconstruction of the human body from egocentric viewpoints is a challenging task as the view is skewed and the body parts farther from the cameras are occluded. One such example is the view from cameras installed below VR headsets. To achieve this task, we first make use of conditional GANs to translate the egocentric views to full body third-person views. This increases the comprehensibility of the image and caters to occlusions. The generated third-person view is further sent through the 3D reconstruction module that generates a 3D mesh of the body. We also train a network that can take the third person full-body view of the subject and generate the texture maps for applying on the mesh. The generated mesh has fairly realistic body proportions and is fully rigged allowing for further applications such as real-time animation and pose transfer in games. This approach can be key to a new domain of mobile human telepresence.

图像视频检索|Re-id相关(1篇)

【1】 SwAMP: Swapped Assignment of Multi-Modal Pairs for Cross-Modal Retrieval 标题:SWAMP:面向跨模态检索的多模态对互换赋值 链接:https://arxiv.org/abs/2111.05814

作者:Minyoung Kim 机构:Samsung AI Center, Cambridge, UK 摘要:我们处理跨模态检索问题,其中训练仅由数据中的相关多模态对监督。对比学习是完成这项任务最常用的方法。然而,它的学习采样复杂度在训练数据点的数量上是二次的。此外,它还可能错误地假设不同对中的实例自动无关。为了解决这些问题,我们提出了一种新的基于未知类自标记的损失函数。具体来说,我们的目标是预测每个模态中数据实例的类标签,并将这些标签分配给另一模态中的相应实例(即交换伪标签)。通过这些交换标签,我们使用监督交叉熵损失学习每个模态的数据嵌入,从而导致线性采样复杂性。我们还维护用于存储最新批次嵌入的队列,对于这些队列,聚类分配和嵌入学习以在线方式同时完成。这消除了为离线集群注入间歇的整个训练数据扫描的计算开销。我们在几个实际的跨模态检索问题上测试了我们的方法,包括基于文本的视频检索、基于草图的图像检索和图像文本检索,并且对于所有这些任务,我们的方法都比对比学习有显著的性能改进。 摘要:We tackle the cross-modal retrieval problem, where the training is only supervised by the relevant multi-modal pairs in the data. The contrastive learning is the most popular approach for this task. However, its sampling complexity for learning is quadratic in the number of training data points. Moreover, it makes potentially wrong assumption that the instances in different pairs are automatically irrelevant. To address these issues, we propose a novel loss function that is based on self-labeling of the unknown classes. Specifically, we aim to predict class labels of the data instances in each modality, and assign those labels to the corresponding instances in the other modality (i.e., swapping the pseudo labels). With these swapped labels, we learn the data embedding for each modality using the supervised cross-entropy loss, hence leading to linear sampling complexity. We also maintain the queues for storing the embeddings of the latest batches, for which clustering assignment and embedding learning are done at the same time in an online fashion. This removes computational overhead of injecting intermittent epochs of entire training data sweep for offline clustering. We tested our approach on several real-world cross-modal retrieval problems, including text-based video retrieval, sketch-based image retrieval, and image-text retrieval, and for all these tasks our method achieves significant performance improvement over the contrastive learning.

裁剪|量化|加速|压缩相关(1篇)

【1】 Efficient Neural Network Training via Forward and Backward Propagation Sparsification 标题:基于前向和后向传播稀疏化的高效神经网络训练 链接:https://arxiv.org/abs/2111.05685

作者:Xiao Zhou,Weizhong Zhang,Zonghao Chen,Shizhe Diao,Tong Zhang 机构:Hong Kong University of Science and Technology, Tsinghua University 摘要:稀疏训练是一种自然的想法,可以加快深层神经网络的训练速度并节省内存使用,特别是在大型现代神经网络明显过度参数化的情况下。然而,大多数现有方法在实践中无法实现这一目标,因为先前方法采用的基于链规则的梯度(w.r.t.结构参数)估计至少在反向传播步骤中需要密集计算。本文提出了一种具有完全稀疏前向和后向通道的有效稀疏训练方法,解决了这一问题。我们首先将训练过程描述为全局稀疏约束下的连续最小化问题。然后,我们将优化过程分为两个步骤,分别对应于权重更新和结构参数更新。对于前一步,我们使用传统的链式规则,通过利用稀疏结构可以实现稀疏。对于后一步,我们不使用现有方法中基于链规则的梯度估计,而是提出了一种减少方差的策略梯度估计,它只需要两个前向过程而不需要反向传播,从而实现完全稀疏的训练。我们证明了梯度估计的方差是有界的。在真实数据集上的大量实验结果表明,与以前的方法相比,我们的算法在加速训练过程方面更加有效,速度快了一个数量级。 摘要:Sparse training is a natural idea to accelerate the training speed of deep neural networks and save the memory usage, especially since large modern neural networks are significantly over-parameterized. However, most of the existing methods cannot achieve this goal in practice because the chain rule based gradient (w.r.t. structure parameters) estimators adopted by previous methods require dense computation at least in the backward propagation step. This paper solves this problem by proposing an efficient sparse training method with completely sparse forward and backward passes. We first formulate the training process as a continuous minimization problem under global sparsity constraint. We then separate the optimization process into two steps, corresponding to weight update and structure parameter update. For the former step, we use the conventional chain rule, which can be sparse via exploiting the sparse structure. For the latter step, instead of using the chain rule based gradient estimators as in existing methods, we propose a variance reduced policy gradient estimator, which only requires two forward passes without backward propagation, thus achieving completely sparse training. We prove that the variance of our gradient estimator is bounded. Extensive experimental results on real-world datasets demonstrate that compared to previous methods, our algorithm is much more effective in accelerating the training process, up to an order of magnitude faster.

蒸馏|知识提取(1篇)

【1】 Multimodal Approach for Metadata Extraction from German Scientific Publications 标题:从德国科学出版物中提取元数据的多模态方法 链接:https://arxiv.org/abs/2111.05736

作者:Azeddine Bouabdallah,Jorge Gavilan,Jennifer Gerbl,Prayuth Patumcharoenpol 机构:Institute for Web Science and Technologies (WeST), University of Koblenz-Landau, Koblenz, Germany 备注:8 pages, 5 figures, 4 tables 摘要:如今,元数据信息通常由作者自己在提交时提供。然而,现有研究论文中有相当一部分缺少或不完整的元数据信息。德国科学论文有各种各样的布局,这使得元数据的提取成为一项非常重要的任务,需要一种精确的方法对从文档中提取的元数据进行分类。在本文中,我们提出了一种从德语科学论文中提取元数据的多模式深度学习方法。我们结合自然语言处理和图像视觉处理,考虑了多种类型的输入数据。与其他最先进的方法相比,此模型旨在提高元数据提取的总体准确性。它能够利用空间和上下文特征,以实现更可靠的提取。我们的这种方法模型是在一个由大约8800个文档组成的数据集上训练的,能够获得0.923的F1总分。 摘要:Nowadays, metadata information is often given by the authors themselves upon submission. However, a significant part of already existing research papers have missing or incomplete metadata information. German scientific papers come in a large variety of layouts which makes the extraction of metadata a non-trivial task that requires a precise way to classify the metadata extracted from the documents. In this paper, we propose a multimodal deep learning approach for metadata extraction from scientific papers in the German language. We consider multiple types of input data by combining natural language processing and image vision processing. This model aims to increase the overall accuracy of metadata extraction compared to other state-of-the-art approaches. It enables the utilization of both spatial and contextual features in order to achieve a more reliable extraction. Our model for this approach was trained on a dataset consisting of around 8800 documents and is able to obtain an overall F1-score of 0.923.

超分辨率|去噪|去模糊|去雾(2篇)

【1】 Single image dehazing via combining the prior knowledge and CNNs 标题:基于先验知识和CNN相结合的单幅图像去模糊 链接:https://arxiv.org/abs/2111.05701

作者:Yuwen Li,Chaobing Zheng,Shiqian Wu,Wangming Xu 机构:. Hubei Key Laboratory of Mechanical Transmission and Manufacturing Engineering, Wuhan University of Science and Technology, . The Institute of Robotics and Intelligent Systems, School of Information Science and Engineering, Wuhan University of Science and 摘要:针对现有的基于先验知识和假设的单幅图像去雾算法在实际应用中受到诸多限制,可能会受到噪声和光晕放大的影响。本文提出了一种端到端系统,通过将先验知识和深度学习方法相结合来减少缺陷。首先,通过加权引导图像滤波器(WGIF)将烟雾图像分解为基础层和细节层,并从基础层估计空气光。然后,将基本层图像传递到高效的深卷积网络以估计传输映射。为了完全恢复接近摄像机的目标而不放大天空中的噪声或严重模糊的场景,提出了一种基于透射贴图值的自适应策略。如果像素的透射贴图很小,则最终使用雾图像的基础层通过大气散射模型恢复无雾图像。否则,将使用雾图像。实验表明,该方法比现有方法具有更好的性能。 摘要:Aiming at the existing single image haze removal algorithms, which are based on prior knowledge and assumptions, subject to many limitations in practical applications, and could suffer from noise and halo amplification. An end-to-end system is proposed in this paper to reduce defects by combining the prior knowledge and deep learning method. The haze image is decomposed into the base layer and detail layers through a weighted guided image filter (WGIF) firstly, and the airlight is estimated from the base layer. Then, the base layer image is passed to the efficient deep convolutional network for estimating the transmission map. To restore object close to the camera completely without amplifying noise in sky or heavily hazy scene, an adaptive strategy is proposed based on the value of the transmission map. If the transmission map of a pixel is small, the base layer of the haze image is used to recover a haze-free image via atmospheric scattering model, finally. Otherwise, the haze image is used. Experiments show that the proposed method achieves superior performance over existing methods.

【2】 Multi-Scale Single Image Dehazing Using Laplacian and Gaussian Pyramids 标题:基于拉普拉斯金字塔和高斯金字塔的多尺度单幅图像去模糊 链接:https://arxiv.org/abs/2111.05700

作者:Zhengguo Li,Haiyan Shu,Chaobing Zheng 机构:Engineering, Wuhan University of Science and Technology, Wuhan , decomposition method in [,]. The weights and the images, to be fused are learned using variational method in [,]., Unfortunately, all the derived images cannot reflect the depth 摘要:模型驱动的单幅图像去叠由于其广泛的应用,在不同的先验上得到了广泛的研究。目标辐射和雾度之间的模糊性以及天空区域的噪声放大是模型驱动的单幅图像去雾的两个固有问题。本文提出了一种暗直接衰减先验(DDAP)来解决前一个问题。提出了一种新的haze线平均法来减少DDAP引起的形态伪影,该方法使得具有较小半径的加权引导图像滤波器能够进一步减少形态伪影,同时保留图像中的精细结构。针对后一个问题,提出了一种多尺度去雾算法,采用拉普拉斯金字塔和高斯金字塔将模糊图像分解为不同的层次,并采用不同的去雾和降噪方法恢复金字塔不同层次的场景辐射。折叠生成的棱锥体以恢复无雾图像。实验结果表明,该算法的性能优于现有的去杂算法,并且有效地防止了噪声在天空区域的放大。 摘要:Model driven single image dehazing was widely studied on top of different priors due to its extensive applications. Ambiguity between object radiance and haze and noise amplification in sky regions are two inherent problems of model driven single image dehazing. In this paper, a dark direct attenuation prior (DDAP) is proposed to address the former problem. A novel haze line averaging is proposed to reduce the morphological artifacts caused by the DDAP which enables a weighted guided image filter with a smaller radius to further reduce the morphological artifacts while preserve the fine structure in the image. A multi-scale dehazing algorithm is then proposed to address the latter problem by adopting Laplacian and Guassian pyramids to decompose the hazy image into different levels and applying different haze removal and noise reduction approaches to restore the scene radiance at different levels of the pyramid. The resultant pyramid is collapsed to restore a haze-free image. Experiment results demonstrate that the proposed algorithm outperforms state of the art dehazing algorithms and the noise is indeed prevented from being amplified in the sky region.

多模态(1篇)

【1】 A Structure Feature Algorithm for Multi-modal Forearm Registration 标题:一种多模态前臂配准的结构特征算法 链接:https://arxiv.org/abs/2111.05485

作者:Jiaxin Li,Yan Ding,Weizhong Zhang,Yifan Zhao,Lingxi Guo,Zhe Yang 机构:Key Laboratory of Dynamics and Control of Flight Vehicle, Ministry of Education, School of, Aerospace Engineering, Beijing Institute of Technology, Beijing , China, School of Aerospace, Transport and Manufacturing, Cranfield University, Wharley End 摘要:基于图像配准的增强现实技术因方便术前准备和医学教育而日益流行。本文主要研究前臂图像与数字解剖模型的配准问题。针对前臂多模态图像纹理特征的差异,提出了一种基于结构兼容多模态图像配准框架(FAM)的前臂特征表示曲线(FFRC)。 摘要:Augmented reality technology based on image registration is becoming increasingly popular for the convenience of pre-surgery preparation and medical education. This paper focuses on the registration of forearm images and digital anatomical models. Due to the difference in texture features of forearm multi-modal images, this paper proposes a forearm feature representation curve (FFRC) based on structure compliant multi-modal image registration framework (FAM) for the forearm.

3D|3D重建等相关(2篇)

【1】 3D modelling of survey scene from images enhanced with a multi-exposure fusion 标题:利用多曝光融合增强图像的测量场景三维建模 链接:https://arxiv.org/abs/2111.05541

作者:Kwok-Leung Chan,Liping Li,Arthur Wing-Tak Leung,Ho-Yin Chan 机构:Department of Electrical Engineering, City University of Hong Kong; lipingli,-, Division of Building Science and Technology, City University of Hong Kong;, Department of Electrical Engineering, City University of Hong Kong; hychan,- 摘要:在当前实践中,现场调查由工人使用全站仪进行。该方法具有较高的精度,但如果需要连续监测,成本较高。基于摄影测量的技术和相对便宜的数码相机在许多领域得到了广泛的应用。除了点测量,摄影测量还可以创建场景的三维(3D)模型。精确的三维模型重建依赖于高质量的图像。退化的图像会导致重建的三维模型产生较大的误差。在本文中,我们提出了一种方法,可以用来提高图像的可见性,并最终减少三维场景模型的误差。这一想法的灵感来自于图像德哈津。首先通过伽马校正操作和自适应直方图均衡化将每个原始图像转换为多个曝光图像。变换后的图像通过计算局部二值模式进行分析。然后对图像进行增强,根据局部模式特征和图像饱和度的函数对变换后的图像像素集生成的每个像素进行加权。对基准图像去杂数据集进行了性能评估。在室外和室内调查中进行了试验。我们的分析发现,该方法适用于室外和室内图像中存在的不同类型的退化。当输入摄影测量软件时,增强图像可以重建具有亚毫米平均误差的三维场景模型。 摘要:In current practice, scene survey is carried out by workers using total stations. The method has high accuracy, but it incurs high costs if continuous monitoring is needed. Techniques based on photogrammetry, with the relatively cheaper digital cameras, have gained wide applications in many fields. Besides point measurement, photogrammetry can also create a three-dimensional (3D) model of the scene. Accurate 3D model reconstruction depends on high quality images. Degraded images will result in large errors in the reconstructed 3D model. In this paper, we propose a method that can be used to improve the visibility of the images, and eventually reduce the errors of the 3D scene model. The idea is inspired by image dehazing. Each original image is first transformed into multiple exposure images by means of gamma-correction operations and adaptive histogram equalization. The transformed images are analyzed by the computation of the local binary patterns. The image is then enhanced, with each pixel generated from the set of transformed image pixels weighted by a function of the local pattern feature and image saturation. Performance evaluation has been performed on benchmark image dehazing datasets. Experimentations have been carried out on outdoor and indoor surveys. Our analysis finds that the method works on different types of degradation that exist in both outdoor and indoor images. When fed into the photogrammetry software, the enhanced images can reconstruct 3D scene models with sub-millimeter mean errors.

【2】 Efficient Data Compression for 3D Sparse TPC via Bicephalous Convolutional Autoencoder 标题:基于双头卷积自动编码器的三维稀疏TPC数据压缩 链接:https://arxiv.org/abs/2111.05423

作者:Yi Huang,Yihui Ren,Shinjae Yoo,Jin Huang 备注:6 pages, 6 figures 摘要:大型实验设施中的实时数据收集和分析是跨多个领域的巨大挑战,包括高能物理、核物理和宇宙学。为了解决这个问题,基于机器学习(ML)的实时数据压缩方法引起了人们的极大关注。然而,与自然图像数据(如CIFAR和ImageNet)不同,自然图像数据规模相对较小且连续,科学数据通常以三维数据体的形式出现,具有高稀疏性(多个零)和非高斯值分布。这使得直接应用流行的ML压缩方法以及传统的数据压缩方法都不太理想。为了解决这些障碍,这项工作引入了一种双头自动编码器来同时解决稀疏性和回归问题,称为textit{Bicephalous voluminal autoencoder}(BCAE)。与传统的数据压缩方法(如MGARD、SZ和ZFP)相比,该方法在压缩保真度和压缩比方面都具有优势。为了获得相似的保真度,传统方法中性能最好的方法只能达到BCAE压缩比的一半。此外,对BCAE方法的深入研究表明,专用的分割解码器可以改善重建效果。 摘要:Real-time data collection and analysis in large experimental facilities present a great challenge across multiple domains, including high energy physics, nuclear physics, and cosmology. To address this, machine learning (ML)-based methods for real-time data compression have drawn significant attention. However, unlike natural image data, such as CIFAR and ImageNet that are relatively small-sized and continuous, scientific data often come in as three-dimensional data volumes at high rates with high sparsity (many zeros) and non-Gaussian value distribution. This makes direct application of popular ML compression methods, as well as conventional data compression methods, suboptimal. To address these obstacles, this work introduces a dual-head autoencoder to resolve sparsity and regression simultaneously, called textit{Bicephalous Convolutional AutoEncoder} (BCAE). This method shows advantages both in compression fidelity and ratio compared to traditional data compression methods, such as MGARD, SZ, and ZFP. To achieve similar fidelity, the best performer among the traditional methods can reach only half the compression ratio of BCAE. Moreover, a thorough ablation study of the BCAE method shows that a dedicated segmentation decoder improves the reconstruction.

其他神经网络|深度学习|模型|建模(5篇)

【1】 Structure from Silence: Learning Scene Structure from Ambient Sound 标题:无声中的结构:从环境声中学习场景结构 链接:https://arxiv.org/abs/2111.05846

作者:Ziyang Chen,Xixi Hu,Andrew Owens 机构:University of Michigan 备注:Accepted to CoRL 2021 (Oral Presentation) 摘要:从旋转的吊扇到滴答作响的时钟,当我们在场景中移动时,我们听到的声音会发生微妙的变化。我们询问这些环境声音是否传达了有关3D场景结构的信息,如果是,它们是否为多模态模型提供了有用的学习信号。为了研究这一点,我们收集了来自各种安静室内场景的成对音频和RGB-D录音数据集。然后,我们训练模型,估计到附近墙壁的距离,只提供音频作为输入。我们还利用这些录音通过自我监督学习多模态表征,通过训练网络将图像与其对应的声音相关联。这些结果表明,环境声音传递了大量有关场景结构的信息,是学习多模态特征的有用信号。 摘要:From whirling ceiling fans to ticking clocks, the sounds that we hear subtly vary as we move through a scene. We ask whether these ambient sounds convey information about 3D scene structure and, if so, whether they provide a useful learning signal for multimodal models. To study this, we collect a dataset of paired audio and RGB-D recordings from a variety of quiet indoor scenes. We then train models that estimate the distance to nearby walls, given only audio as input. We also use these recordings to learn multimodal representations through self-supervision, by training a network to associate images with their corresponding sounds. These results suggest that ambient sound conveys a surprising amount of information about scene structure, and that it is a useful signal for learning multimodal features.

【2】 Palette: Image-to-Image Diffusion Models 标题:调色板:图像到图像扩散模型 链接:https://arxiv.org/abs/2111.05826

作者:Chitwan Saharia,William Chan,Huiwen Chang,Chris A. Lee,Jonathan Ho,Tim Salimans,David J. Fleet,Mohammad Norouzi 机构:Google Research 摘要:我们介绍调色板,这是一个使用条件扩散模型进行图像到图像转换的简单而通用的框架。在四项具有挑战性的图像到图像转换任务(着色、修复、取消裁剪和JPEG解压缩)中,调色板的性能优于强GAN和回归基线,并建立了一种新的技术状态。这是在没有特定任务的超参数调整、架构定制或任何辅助损失的情况下完成的,显示了理想的通用性和灵活性。我们揭示了在去噪扩散目标中使用$L_2$与$L_1$损失对样本多样性的影响,并通过实证架构研究证明了自我关注的重要性。重要的是,我们提倡基于ImageNet的统一评估协议,并报告多个样本质量分数,包括FID、初始分数、预先训练的ResNet-50的分类精度,以及不同基线下参考图像的感知距离。我们期望这个标准化的评估协议在推进图像到图像的翻译研究中发挥关键作用。最后,我们展示了在3项任务(着色、修复、JPEG解压缩)上训练的单一通用调色板模型的性能与特定任务的专家模型相当或更好。 摘要:We introduce Palette, a simple and general framework for image-to-image translation using conditional diffusion models. On four challenging image-to-image translation tasks (colorization, inpainting, uncropping, and JPEG decompression), Palette outperforms strong GAN and regression baselines, and establishes a new state of the art. This is accomplished without task-specific hyper-parameter tuning, architecture customization, or any auxiliary loss, demonstrating a desirable degree of generality and flexibility. We uncover the impact of using $L_2$ vs. $L_1$ loss in the denoising diffusion objective on sample diversity, and demonstrate the importance of self-attention through empirical architecture studies. Importantly, we advocate a unified evaluation protocol based on ImageNet, and report several sample quality scores including FID, Inception Score, Classification Accuracy of a pre-trained ResNet-50, and Perceptual Distance against reference images for various baselines. We expect this standardized evaluation protocol to play a critical role in advancing image-to-image translation research. Finally, we show that a single generalist Palette model trained on 3 tasks (colorization, inpainting, JPEG decompression) performs as well or better than task-specific specialist counterparts.

【3】 Understanding the Generalization Benefit of Model Invariance from a Data Perspective 标题:从数据角度理解模型不变性的泛化效益 链接:https://arxiv.org/abs/2111.05529

作者:Sicheng Zhu,Bang An,Furong Huang 机构:Department of Computer Science, University of Maryland, College Park 备注:Accepted to NeurIPS 2021 摘要:机器学习模型在某些类型的数据转换下具有不变性,在实践中表现出更好的泛化能力。然而,对不变性为什么有利于推广的原则性理解是有限的。给定一个数据集,通常没有原则性的方法来选择“合适的”数据转换,在这种转换下,模型不变性可以保证更好的泛化。本文通过引入由变换引起的样本覆盖,即数据集的一个代表子集,可以通过变换近似地恢复整个数据集,研究了模型不变性的泛化优势。对于任何数据转换,我们基于样本覆盖率为不变模型提供精确的泛化边界。我们还通过转换所诱导的样本覆盖数,即其诱导样本覆盖的最小大小,来描述一组数据转换的“适用性”。我们表明,对于样本覆盖数较小的“合适”变换,我们可以收紧推广边界。此外,我们提出的样本覆盖数可以进行经验评估,从而为选择变换以发展模型不变性以更好地推广提供了指导。在多个数据集上的实验中,我们评估了一些常用变换的样本覆盖数,并表明一组变换(例如,3D视图变换)的较小样本覆盖数表明不变模型的测试和训练误差之间的差距较小,这验证了我们的命题。 摘要:Machine learning models that are developed to be invariant under certain types of data transformations have shown improved generalization in practice. However, a principled understanding of why invariance benefits generalization is limited. Given a dataset, there is often no principled way to select "suitable" data transformations under which model invariance guarantees better generalization. This paper studies the generalization benefit of model invariance by introducing the sample cover induced by transformations, i.e., a representative subset of a dataset that can approximately recover the whole dataset using transformations. For any data transformations, we provide refined generalization bounds for invariant models based on the sample cover. We also characterize the "suitability" of a set of data transformations by the sample covering number induced by transformations, i.e., the smallest size of its induced sample covers. We show that we may tighten the generalization bounds for "suitable" transformations that have a small sample covering number. In addition, our proposed sample covering number can be empirically evaluated and thus provides a guide for selecting transformations to develop model invariance for better generalization. In experiments on multiple datasets, we evaluate sample covering numbers for some commonly used transformations and show that the smaller sample covering number for a set of transformations (e.g., the 3D-view transformation) indicates a smaller gap between the test and training error for invariant models, which verifies our propositions.

【4】 Analysis of PDE-based binarization model for degraded document images 标题:基于偏微分方程的退化文档图像二值化模型分析 链接:https://arxiv.org/abs/2111.05471

作者:Uche A. Nnolim 机构:Department of Electronic Engineering, University of Nigeria, Nsukka, Enugu, Nigeria 备注:11 pages, 6 figures 摘要:本报告介绍了基于PDE的降级文档图像二值化模型的结果。该模型在其公式中使用了一个边和二元源项。结果表明,在较小程度上,对于具有渗透、褪色文本和污点的文档图像是有效的。 摘要:This report presents the results of a PDE-based binarization model for degraded document images. The model utilizes an edge and binary source term in its formulation. Results indicate effectiveness for document images with bleed-through and faded text and stains to a lesser extent.

【5】 Evaluation of Deep Learning Topcoders Method for Neuron Individualization in Histological Macaque Brain Section 标题:组织学猕猴脑切片神经元个体化深度学习Topcoders方法的评价 链接:https://arxiv.org/abs/2111.05789

作者:Huaqian Wu,Nicolas Souedet,Zhenzhen You,Caroline Jan,Cédric Clouchoux,Thierry Delzescaux 摘要:细胞个体化在数字病理图像分析中起着至关重要的作用。深度学习被认为是一个有效的工具,例如分割任务,包括细胞个性化。然而,深度学习模型的精度依赖于大量无偏数据集和手动像素级注释,这是劳动密集型的。此外,大多数深度学习的应用都是为了处理肿瘤数据而开发的。为了克服这些挑战,我)我们建立了一个管道,只提供点注释来合成像素级标签;ii)我们测试了一种集成深度学习算法,对神经数据执行细胞个体化。结果表明,该方法在目标层和像素层均能成功分割神经元细胞,平均检测准确率为0.93。 摘要:Cell individualization has a vital role in digital pathology image analysis. Deep Learning is considered as an efficient tool for instance segmentation tasks, including cell individualization. However, the precision of the Deep Learning model relies on massive unbiased dataset and manual pixel-level annotations, which is labor intensive. Moreover, most applications of Deep Learning have been developed for processing oncological data. To overcome these challenges, i) we established a pipeline to synthesize pixel-level labels with only point annotations provided; ii) we tested an ensemble Deep Learning algorithm to perform cell individualization on neurological data. Results suggest that the proposed method successfully segments neuronal cells in both object-level and pixel-level, with an average detection accuracy of 0.93.

其他(9篇)

【1】 Advances in Neural Rendering 标题:神经绘制研究进展 链接:https://arxiv.org/abs/2111.05849

作者:Ayush Tewari,Justus Thies,Ben Mildenhall,Pratul Srinivasan,Edgar Tretschk,Yifan Wang,Christoph Lassner,Vincent Sitzmann,Ricardo Martin-Brualla,Stephen Lombardi,Tomas Simon,Christian Theobalt,Matthias Niessner,Jonathan T. Barron,Gordon Wetzstein,Michael Zollhoefer,Vladislav Golyanik 机构:Golyanik 1 1MPI for Informatics 2MPI for Intelligent Systems 3Google Research 4ETH Zurich 5Reality Labs Research6MIT 7Technical University of Munich 8Stanford University ⋆Equal contribution 备注:29 pages, 14 figures, 5 tables 摘要:合成真实照片图像和视频是计算机图形学的核心,也是数十年来研究的焦点。传统上,场景的合成图像是使用栅格化或光线跟踪等渲染算法生成的,这些算法将特定定义的几何体和材质属性表示作为输入。这些输入共同定义了实际场景和渲染内容,称为场景表示(场景由一个或多个对象组成)。示例场景表示为带有伴随纹理的三角形网格(例如,由艺术家创建)、点云(例如,来自深度传感器)、体积网格(例如,来自CT扫描)或隐式曲面函数(例如,截断的有符号距离场)。使用可微分的渲染损失从观测值重建这样的场景表示称为逆图形或逆渲染。神经渲染是密切相关的,它结合了经典计算机图形学和机器学习的思想,创建了从真实世界观察合成图像的算法。神经渲染是朝着合成照片真实感图像和视频内容目标的一次飞跃。近年来,我们通过数百份出版物看到了这一领域的巨大进步,这些出版物展示了将可学习组件注入渲染管道的不同方法。这篇关于神经渲染进展的最新报告重点介绍了将经典渲染原理与学习的3D场景表示(现在通常称为神经场景表示)相结合的方法。这些方法的一个关键优势是,它们在设计上是3D一致的,支持诸如捕获场景的新视点合成等应用。除了处理静态场景的方法外,我们还介绍了用于建模非刚性变形对象的神经场景表示。。。 摘要:Synthesizing photo-realistic images and videos is at the heart of computer graphics and has been the focus of decades of research. Traditionally, synthetic images of a scene are generated using rendering algorithms such as rasterization or ray tracing, which take specifically defined representations of geometry and material properties as input. Collectively, these inputs define the actual scene and what is rendered, and are referred to as the scene representation (where a scene consists of one or more objects). Example scene representations are triangle meshes with accompanied textures (e.g., created by an artist), point clouds (e.g., from a depth sensor), volumetric grids (e.g., from a CT scan), or implicit surface functions (e.g., truncated signed distance fields). The reconstruction of such a scene representation from observations using differentiable rendering losses is known as inverse graphics or inverse rendering. Neural rendering is closely related, and combines ideas from classical computer graphics and machine learning to create algorithms for synthesizing images from real-world observations. Neural rendering is a leap forward towards the goal of synthesizing photo-realistic image and video content. In recent years, we have seen immense progress in this field through hundreds of publications that show different ways to inject learnable components into the rendering pipeline. This state-of-the-art report on advances in neural rendering focuses on methods that combine classical rendering principles with learned 3D scene representations, often now referred to as neural scene representations. A key advantage of these methods is that they are 3D-consistent by design, enabling applications such as novel viewpoint synthesis of a captured scene. In addition to methods that handle static scenes, we cover neural scene representations for modeling non-rigidly deforming objects...

【2】 PIMIP: An Open Source Platform for Pathology Information Management and Integration 标题:PIMIP:一个开放源码的病理信息管理与集成平台 链接:https://arxiv.org/abs/2111.05794

作者:Jialun Wu,Anyu Mao,Xinrui Bao,Haichuan Zhang,Zeyu Gao,Chunbao Wang,Tieliang Gong,Chen Li 机构:School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, China, National Engineering Lab for Big Data Analytics, Xi’an Jiaotong University, Xi’an, China, School of Automation Science and Engineering, Xi’an Jiaotong University, Xi’an, China 备注:BIBM 2021 accepted, including 8 pages, 8 figures 摘要:数字病理学在医学领域人工智能的发展中起着至关重要的作用。数字化病理平台可以使病理资源数字化、网络化,实现视觉数据的永久存储和不受时间和空间限制的同步浏览处理。它已广泛应用于病理学的各个领域。然而,仍然缺乏一个开放和通用的数字病理学平台来帮助医生管理和分析数字病理切片,以及管理和结构化描述相关患者信息。大多数平台无法集成图像查看、注释和分析以及文本信息管理。为了解决上述问题,我们提出了一个全面的、可扩展的平台PIMIP。我们的PIMIP开发了基于数字病理切片可视化的图像注释功能。我们的标注功能支持多用户协同标注和多设备标注,实现了部分标注任务的自动化。在注释任务中,我们邀请了一位专业病理学家进行指导。我们介绍了一个用于图像分析的机器学习模块。我们收集的数据包括当地医院的公共数据和临床实例。我们的平台更适合临床使用。除了图像数据,我们还构建了文本信息的管理和显示。所以我们的平台是全面的。平台框架以模块化方式构建,支持用户独立添加机器学习模块,使平台具有可扩展性。 摘要:Digital pathology plays a crucial role in the development of artificial intelligence in the medical field. The digital pathology platform can make the pathological resources digital and networked, and realize the permanent storage of visual data and the synchronous browsing processing without the limitation of time and space. It has been widely used in various fields of pathology. However, there is still a lack of an open and universal digital pathology platform to assist doctors in the management and analysis of digital pathological sections, as well as the management and structured description of relevant patient information. Most platforms cannot integrate image viewing, annotation and analysis, and text information management. To solve the above problems, we propose a comprehensive and extensible platform PIMIP. Our PIMIP has developed the image annotation functions based on the visualization of digital pathological sections. Our annotation functions support multi-user collaborative annotation and multi-device annotation, and realize the automation of some annotation tasks. In the annotation task, we invited a professional pathologist for guidance. We introduce a machine learning module for image analysis. The data we collected included public data from local hospitals and clinical examples. Our platform is more clinical and suitable for clinical use. In addition to image data, we also structured the management and display of text information. So our platform is comprehensive. The platform framework is built in a modular way to support users to add machine learning modules independently, which makes our platform extensible.

【3】 Theoretical and empirical analysis of a fast algorithm for extracting polygons from signed distance bounds 标题:基于符号距离界的多边形快速提取算法的理论与实验分析 链接:https://arxiv.org/abs/2111.05778

作者:Nenad Markuš 摘要:我们研究了一种将有符号距离边界转化为多边形网格的渐近快速方法。这是通过结合球体跟踪(也称为光线行进)和一种传统的多边形化方案(例如行进立方体)实现的。让我们称这种方法为网格跳跃。我们提供了理论和实验证据,证明对于具有$N^3$单元的多边形网格,其计算复杂度为$O(N^2log N)$。该算法在一组原始形状以及由点云通过机器学习生成的有符号距离场上进行了测试。考虑到它的速度、简单性和可移植性,我们认为它在建模阶段以及存储的形状压缩中都是有用的。代码可在以下位置获得:https://github.com/nenadmarkus/gridhopping 摘要:We investigate an asymptotically fast method for transforming signed distance bounds into polygon meshes. This is achieved by combining sphere tracing (also known as ray marching) and one of the traditional polygonization schemes (e.g., Marching cubes). Let us call this approach Gridhopping. We provide theoretical and experimental evidence that it is of the $O(N^2log N)$ computational complexity for a polygonization grid with $N^3$ cells. The algorithm is tested on both a set of primitive shapes as well as signed distance fields generated from point clouds by machine learning. Given its speed, simplicity and portability, we argue that it could prove useful during the modelling stage as well as in shape compression for storage. The code is available here: https://github.com/nenadmarkus/gridhopping

【4】 Robust reconstructions by multi-scale/irregular tangential covering 标题:基于多尺度/不规则切线覆盖的鲁棒重建 链接:https://arxiv.org/abs/2111.05688

作者:Antoine Vacavant,Bertrand Kerautret,Fabien Feschet 机构:Université Clermont Auvergne, CNRS, SIGMA Clermont, Institut Pascal, F-, Clermont-Ferrand, France, Univ Lyon, Lyon , LIRIS, F-, Lyon, France, Université Clermont Auvergne, CNRS, ENSMSE, LIMOS 摘要:在本文中,我们提出了一种原始的方法,采用切线覆盖算法-minDSS-以几何重建有噪声的数字轮廓。为了做到这一点,我们利用我们在以前的工作中介绍的最大原语来表示图形对象。通过计算轮廓的多尺度和不规则等高线表示,我们获得了一维(一维)间隔,并实现了随后的最大线段或圆弧分解。通过使minDSS适应支持最大基元的一维间隔的稀疏和不规则数据,我们现在能够将输入的噪声对象重建为具有最少基元数的直线或圆弧构成的循环轮廓。在这项工作中,我们解释了我们的新的完整管道,并通过考虑合成和真实图像数据给出了它的实验评估。我们还表明,通过考虑多尺度噪声评估过程,对于从最新技术中选择的参考文献,这是一种稳健的方法。 摘要:In this paper, we propose an original manner to employ a tangential cover algorithm - minDSS - in order to geometrically reconstruct noisy digital contours. To do so, we exploit the representation of graphical objects by maximal primitives we have introduced in previous works. By calculating multi-scale and irregular isothetic representations of the contour, we obtained 1-D (one-dimensional) intervals, and achieved afterwards a decomposition into maximal line segments or circular arcs. By adapting minDSS to this sparse and irregular data of 1-D intervals supporting the maximal primitives, we are now able to reconstruct the input noisy objects into cyclic contours made of lines or arcs with a minimal number of primitives. In this work, we explain our novel complete pipeline, and present its experimental evaluation by considering both synthetic and real image data. We also show that this is a robust approach, with respect to selected references from state-of-the-art, and by considering a multi-scale noise evaluation process.

【5】 The Impact of Changes in Resolution on the Persistent Homology of Images 标题:分辨率变化对图像持久同源性的影响 链接:https://arxiv.org/abs/2111.05663

作者:Teresa Heiss,Sarah Tymochko,Brittany Story,Adélie Garin,Hoa Bui,Bea Bleile,Vanessa Robins 机构:˚Institute of Science and Technology (IST) Austria, Klosterneuburg, Austria, :Michigan State University, Department of Computational Mathematics, Science and Engineering, East Lansing, MI, USA 备注:accepted for the IEEE Big Data 2021 workshop: Applications of Topological Data Analysis to 'Big Data' 摘要:数字图像可以在微观和宏观尺度上对材料特性进行定量分析,但在获取图像时选择合适的分辨率是一项挑战。高分辨率意味着对给定样本进行更长的图像采集和更大的数据要求,但如果分辨率过低,可能会丢失重要信息。本文研究了分辨率变化对持久同源性的影响,持久同源性是拓扑数据分析的一种工具,可在所有长度尺度的图像中提供结构特征。给定函数、对象几何体或给定分辨率下的密度分布的先验信息,我们提供了选择最粗糙分辨率的方法,从而在可接受的公差范围内产生结果。我们为一个说明性的合成示例和多孔材料的样品提供了数值案例研究,其中理论界未知。 摘要:Digital images enable quantitative analysis of material properties at micro and macro length scales, but choosing an appropriate resolution when acquiring the image is challenging. A high resolution means longer image acquisition and larger data requirements for a given sample, but if the resolution is too low, significant information may be lost. This paper studies the impact of changes in resolution on persistent homology, a tool from topological data analysis that provides a signature of structure in an image across all length scales. Given prior information about a function, the geometry of an object, or its density distribution at a given resolution, we provide methods to select the coarsest resolution yielding results within an acceptable tolerance. We present numerical case studies for an illustrative synthetic example and samples from porous materials where the theoretical bounds are unknown.

【6】 FabricFlowNet: Bimanual Cloth Manipulation with a Flow-based Policy 标题:FabricFlowNet:基于流策略的双工布料操作 链接:https://arxiv.org/abs/2111.05623

作者:Thomas Weng,Sujay Bajracharya,Yufei Wang,Khush Agrawal,David Held 机构:Robotics Institute, Carnegie Mellon University, USA 备注:CoRL 2021 摘要:我们解决了目标导向的布料操纵问题,这是一项具有挑战性的任务,因为布料的可变形性。我们的见解是光流,一种通常用于视频中运动估计的技术,也可以为观察和目标图像中相应的布料姿势提供有效的表示。我们将介绍FabricFlowNet(FFN),这是一种布料操作策略,它利用流作为输入和动作表示来提高性能。FabricFlowNet还根据所需目标在双手和单臂动作之间优雅地切换。我们表明,FabricFlowNet的性能显著优于采用图像输入的最先进的无模型和基于模型的布料操作策略。我们还介绍了一个双手系统的真实实验,展示了有效的模拟到真实的传输。最后,我们证明了我们的方法可以推广到其他形状的布料,如T恤衫和长方形布料。视频和其他补充材料可从以下网址获得:https://sites.google.com/view/fabricflownet. 摘要:We address the problem of goal-directed cloth manipulation, a challenging task due to the deformability of cloth. Our insight is that optical flow, a technique normally used for motion estimation in video, can also provide an effective representation for corresponding cloth poses across observation and goal images. We introduce FabricFlowNet (FFN), a cloth manipulation policy that leverages flow as both an input and as an action representation to improve performance. FabricFlowNet also elegantly switches between bimanual and single-arm actions based on the desired goal. We show that FabricFlowNet significantly outperforms state-of-the-art model-free and model-based cloth manipulation policies that take image input. We also present real-world experiments on a bimanual system, demonstrating effective sim-to-real transfer. Finally, we show that our method generalizes when trained on a single square cloth to other cloth shapes, such as T-shirts and rectangular cloths. Video and other supplementary materials are available at: https://sites.google.com/view/fabricflownet.

【7】 Leveraging Geometry for Shape Estimation from a Single RGB Image 标题:利用几何信息从单个RGB图像进行形状估计 链接:https://arxiv.org/abs/2111.05615

作者:Florian Langer,Ignas Budvytis,Roberto Cipolla 机构:Department of Engineering, University of Cambridge, Cambridge, UK 摘要:从单个RGB图像预测静态物体的三维形状和姿态是现代计算机视觉的一个重要研究领域。它的应用范围从增强现实到机器人技术和数字内容创作。通常,此任务是通过不准确的直接对象形状和姿势预测来执行的。通过从大型数据库检索CAD模型并将其与图像中观察到的对象对齐,一个有希望的研究方向确保了有意义的形状预测。然而,现有的工作没有考虑对象几何体,导致不准确的对象姿势预测,尤其是对于看不见的对象。在这项工作中,我们演示了从RGB图像到渲染CAD模型的跨域关键点匹配如何允许比通过直接预测获得的更精确的对象姿势预测。我们进一步证明,关键点匹配不仅可以用来估计物体的姿态,还可以用来修改物体本身的形状。这一点很重要,因为仅通过对象检索即可实现的精度本质上仅限于可用的CAD模型。允许形状调整可以缩小检索到的CAD模型和观察到的形状之间的差距。我们在具有挑战性的Pix3D数据集上演示了我们的方法。建议的几何形状预测将AP网格从最先进的33.2提高到37.8(在可见对象上),从8.2提高到17.1(在不可见对象上)。此外,我们还展示了更精确的形状预测,在遵循建议的形状自适应时,无需与CAD模型紧密匹配。该守则可于https://github.com/florianlanger/leveraging_geometry_for_shape_estimation . 摘要:Predicting 3D shapes and poses of static objects from a single RGB image is an important research area in modern computer vision. Its applications range from augmented reality to robotics and digital content creation. Typically this task is performed through direct object shape and pose predictions which is inaccurate. A promising research direction ensures meaningful shape predictions by retrieving CAD models from large scale databases and aligning them to the objects observed in the image. However, existing work does not take the object geometry into account, leading to inaccurate object pose predictions, especially for unseen objects. In this work we demonstrate how cross-domain keypoint matches from an RGB image to a rendered CAD model allow for more precise object pose predictions compared to ones obtained through direct predictions. We further show that keypoint matches can not only be used to estimate the pose of an object, but also to modify the shape of the object itself. This is important as the accuracy that can be achieved with object retrieval alone is inherently limited to the available CAD models. Allowing shape adaptation bridges the gap between the retrieved CAD model and the observed shape. We demonstrate our approach on the challenging Pix3D dataset. The proposed geometric shape prediction improves the AP mesh over the state-of-the-art from 33.2 to 37.8 on seen objects and from 8.2 to 17.1 on unseen objects. Furthermore, we demonstrate more accurate shape predictions without closely matching CAD models when following the proposed shape adaptation. Code is publicly available at https://github.com/florianlanger/leveraging_geometry_for_shape_estimation .

【8】 TomoSLAM: factor graph optimization for rotation angle refinement in microtomography 标题:TomoSLAM:微层析成像中旋转角度优化的因子图优化 链接:https://arxiv.org/abs/2111.05562

作者:Mark Griguletskii,Mikhail Chekanov,Oleg Shipitko 机构:Kibalov, Institute for Information Transmission Problems – IITP RAS, Bol’shoy Karetnyy Pereulok , Moscow, Russia, ., Evocargo, Moscow, Russia, Smart Engines Service LLC, Moscow, Russia 摘要:在计算机断层扫描(CT)中,样品、探测器和信号源的相对轨迹传统上被认为是已知的,因为它们是由仪器部件的有意预编程运动引起的。然而,由于机械反冲击、旋转传感器测量误差等原因,热变形的真实轨迹与期望轨迹不同。这会对层析重建的结果质量产生负面影响。设备的校准或初步调整都不能完全消除轨迹的不精确性,但会显著增加仪器维护成本。解决此问题的许多方法都基于重建过程中每个投影(在每个时间步)相对于样本的源和传感器位置估计的自动细化。在机器人技术中(特别是在移动机器人和自动驾驶车辆中),从不同角度观察对象的不同图像时,位置优化的类似问题是众所周知的,称为同步定位和映射(SLAM)。这项工作的科学新奇之处在于,将微层析成像中的轨迹细化问题视为SLAM问题。这是通过从X射线投影中提取加速鲁棒特征(SURF)特征,使用随机样本一致性(RANSAC)过滤匹配,计算投影之间的角度,并在因子图中结合步进电机控制信号使用它们来优化旋转角度来实现的。 摘要:In computed tomography (CT), the relative trajectories of a sample, a detector, and a signal source are traditionally considered to be known, since they are caused by the intentional preprogrammed movement of the instrument parts. However, due to the mechanical backlashes, rotation sensor measurement errors, thermal deformations real trajectory differs from desired ones. This negatively affects the resulting quality of tomographic reconstruction. Neither the calibration nor preliminary adjustments of the device completely eliminates the inaccuracy of the trajectory but significantly increase the cost of instrument maintenance. A number of approaches to this problem are based on an automatic refinement of the source and sensor position estimate relative to the sample for each projection (at each time step) during the reconstruction process. A similar problem of position refinement while observing different images of an object from different angles is well known in robotics (particularly, in mobile robots and self-driving vehicles) and is called Simultaneous Localization And Mapping (SLAM). The scientific novelty of this work is to consider the problem of trajectory refinement in microtomography as a SLAM problem. This is achieved by extracting Speeded Up Robust Features (SURF) features from X-ray projections, filtering matches with Random Sample Consensus (RANSAC), calculating angles between projections, and using them in factor graph in combination with stepper motor control signals in order to refine rotation angles.

【9】 ICDAR 2021 Competition on Document VisualQuestion Answering 标题:ICDAR 2021文件视觉问答比赛 链接:https://arxiv.org/abs/2111.05547

作者:Rubèn Tito,Minesh Mathew,C. V. Jawahar,Ernest Valveny,Dimosthenis Karatzas 机构:Computer Vision Center, UAB, Spain, CVIT, IIIT Hyderabad, India 摘要:在本报告中,我们介绍了ICDAR 2021版文件“视觉问题挑战”的结果。这个版本补充了以前关于单文档VQA和文档集合VQA的任务,新引入了一个关于信息图形VQA的任务。Infographics VQA基于一个包含5000多张Infographics图像和30000对问答的新数据集。获胜者方法在信息图形VQA任务中的ANLS得分为0.6120,在文档收集VQA任务中的ANLSL得分为0.7743,在单文档VQA中的ANLS得分为0.8705。我们总结了每项任务所使用的数据集,描述了每种提交的方法,并对其性能进行了结果和分析。本文还总结了自DocVQA 2020挑战第一版以来在单文件VQA方面取得的进展。 摘要:In this report we present results of the ICDAR 2021 edition of the Document Visual Question Challenges. This edition complements the previous tasks on Single Document VQA and Document Collection VQA with a newly introduced on Infographics VQA. Infographics VQA is based on a new dataset of more than 5,000 infographics images and 30,000 question-answer pairs. The winner methods have scored 0.6120 ANLS in Infographics VQA task, 0.7743 ANLSL in Document Collection VQA task and 0.8705 ANLS in Single Document VQA. We present a summary of the datasets used for each task, description of each of the submitted methods and the results and analysis of their performance. A summary of the progress made on Single Document VQA since the first edition of the DocVQA 2020 challenge is also presented.

0 人点赞