cs.CV 方向,今日共计97篇
Transformer(2篇)
【1】 Spatial-Temporal Transformer for Dynamic Scene Graph Generation 标题:用于动态场景图生成的时空转换器
作者:Yuren Cong,Wentong Liao,Hanno Ackermann,Michael Ying Yang,Bodo Rosenhahn 机构:Institute of Information Processing, Leibniz University Hannover, Germany, Scene Understanding Group, University of Twente, The Netherlands 备注:accepted by ICCV2021 链接:https://arxiv.org/abs/2107.12309 摘要:动态场景图生成的目的是生成给定视频的场景图。与由图像生成场景图的任务相比,它具有更大的挑战性,因为对象之间的动态关系和帧之间的时间依赖性允许更丰富的语义解释。在本文中,我们提出了时空转换器(STTran),一种由两个核心模块组成的神经网络:(1)一个空间编码器,它接收一个输入帧来提取空间上下文和帧内视觉关系的原因,以及(2)时间解码器,其将空间编码器的输出作为输入以捕获帧之间的时间依赖性并推断动态关系。此外,STTran可以灵活地将不同长度的视频作为输入,而无需剪辑,这对于长视频尤其重要。我们的方法在基准数据集Action Genome(AG)上得到了验证。实验结果表明,该方法在动态场景图方面具有良好的性能。此外,还进行了一系列烧蚀实验,验证了所提出的各个模块的烧蚀效果。 摘要:Dynamic scene graph generation aims at generating a scene graph of the given video. Compared to the task of scene graph generation from images, it is more challenging because of the dynamic relationships between objects and the temporal dependencies between frames allowing for a richer semantic interpretation. In this paper, we propose Spatial-temporal Transformer (STTran), a neural network that consists of two core modules: (1) a spatial encoder that takes an input frame to extract spatial context and reason about the visual relationships within a frame, and (2) a temporal decoder which takes the output of the spatial encoder as input in order to capture the temporal dependencies between frames and infer the dynamic relationships. Furthermore, STTran is flexible to take varying lengths of videos as input without clipping, which is especially important for long videos. Our method is validated on the benchmark dataset Action Genome (AG). The experimental results demonstrate the superior performance of our method in terms of dynamic scene graphs. Moreover, a set of ablative studies is conducted and the effect of each proposed module is justified.
【2】 Contextual Transformer Networks for Visual Recognition 标题:用于视觉识别的上下文变换网络
作者:Yehao Li,Ting Yao,Yingwei Pan,Tao Mei 机构:JD AI Research, Beijing, China 备注:Rank 1 in open-set image classification task of Open World Vision Challenge @ CVPR 2021; The source code and models are publicly available at: url{this https URL} 链接:https://arxiv.org/abs/2107.12292 摘要:具有自我关注的Transformer引发了自然语言处理领域的一场革命,并在众多的计算机视觉任务中激发了Transformer风格的体系结构设计。然而,现有的设计大多直接在二维特征图上利用自注意来获得基于每个空间位置上的孤立查询和密钥对的注意矩阵,而没有充分利用相邻密钥之间丰富的上下文。在这项工作中,我们设计了一个新的转换器样式模块,即上下文转换器(CoT)块,用于视觉识别。这种设计充分利用了输入键间的上下文信息来指导动态注意矩阵的学习,从而增强了视觉表征能力。从技术上讲,CoT block first通过$3times3$卷积对输入键进行上下文编码,从而产生输入的静态上下文表示。我们通过两个连续的$1times1$卷积,将编码的密钥与输入查询连接起来,学习动态多头部注意矩阵。学习的注意力矩阵乘以输入值,实现输入的动态上下文表示。最后将静态和动态的上下文表示融合作为输出。我们的CoT块很吸引人,因为它可以很容易地替换ResNet架构中的每个$3times3$卷积,产生一个称为上下文Transformer网络(CoTNet)的Transformer式主干。通过对广泛应用(如图像识别、目标检测和实例分割)的大量实验,验证了CoTNet作为更强大主干网的优越性。源代码位于url{https://github.com/JDAI-CV/CoTNet}. 摘要:Transformer with self-attention has led to the revolutionizing of natural language processing field, and recently inspires the emergence of Transformer-style architecture design with competitive results in numerous computer vision tasks. Nevertheless, most of existing designs directly employ self-attention over a 2D feature map to obtain the attention matrix based on pairs of isolated queries and keys at each spatial location, but leave the rich contexts among neighbor keys under-exploited. In this work, we design a novel Transformer-style module, i.e., Contextual Transformer (CoT) block, for visual recognition. Such design fully capitalizes on the contextual information among input keys to guide the learning of dynamic attention matrix and thus strengthens the capacity of visual representation. Technically, CoT block first contextually encodes input keys via a $3times3$ convolution, leading to a static contextual representation of inputs. We further concatenate the encoded keys with input queries to learn the dynamic multi-head attention matrix through two consecutive $1times1$ convolutions. The learnt attention matrix is multiplied by input values to achieve the dynamic contextual representation of inputs. The fusion of the static and dynamic contextual representations are finally taken as outputs. Our CoT block is appealing in the view that it can readily replace each $3times3$ convolution in ResNet architectures, yielding a Transformer-style backbone named as Contextual Transformer Networks (CoTNet). Through extensive experiments over a wide range of applications (e.g., image recognition, object detection and instance segmentation), we validate the superiority of CoTNet as a stronger backbone. Source code is available at url{https://github.com/JDAI-CV/CoTNet}.
检测相关(11篇)
【1】 Continental-Scale Building Detection from High Resolution Satellite Imagery 标题:基于高分辨率卫星图像的大陆尺度建筑物检测
作者:Wojciech Sirko,Sergii Kashubin,Marvin Ritter,Abigail Annkah,Yasser Salah Edine Bouchareb,Yann Dauphin,Daniel Keysers,Maxim Neumann,Moustapha Cisse,John Quinn 机构:Google Research 链接:https://arxiv.org/abs/2107.12283 摘要:识别建筑物的位置和脚印对于许多实用和科学目的至关重要。这类信息在缺乏替代数据来源的发展中区域特别有用。在这项工作中,我们描述了一个使用50厘米卫星图像探测整个非洲大陆建筑物的模型训练管道。从卫星图像分析中广泛应用的U-Net模型出发,研究了结构、损失函数、正则化、预训练、自训练和后处理等方面的变化对提高实例分割性能的影响。实验是利用非洲各地10万张卫星图像的数据集进行的,其中包括1.75米人工标记的建筑实例,以及用于预训练和自我训练的进一步数据集。我们报告了利用这种模型改进建筑物检测性能的新方法,包括使用混合(mAP 0.12)和使用软KL损失(mAP 0.06)的自训练。由此产生的管道即使在各种具有挑战性的农村和城市环境中也取得了良好的效果,并被用于创建5.16亿非洲范围内探测到的足迹的开放式建筑数据集。 摘要:Identifying the locations and footprints of buildings is vital for many practical and scientific purposes. Such information can be particularly useful in developing regions where alternative data sources may be scarce. In this work, we describe a model training pipeline for detecting buildings across the entire continent of Africa, using 50 cm satellite imagery. Starting with the U-Net model, widely used in satellite image analysis, we study variations in architecture, loss functions, regularization, pre-training, self-training and post-processing that increase instance segmentation performance. Experiments were carried out using a dataset of 100k satellite images across Africa containing 1.75M manually labelled building instances, and further datasets for pre-training and self-training. We report novel methods for improving performance of building detection with this type of model, including the use of mixup (mAP 0.12) and self-training with soft KL loss (mAP 0.06). The resulting pipeline obtains good results even on a wide variety of challenging rural and urban contexts, and was used to create the Open Buildings dataset of 516M Africa-wide detected footprints.
【2】 AA3DNet: Attention Augmented Real Time 3D Object Detection 标题:AA3DNet:注意力增强的实时三维目标检测
作者:Abhinav Sagar 机构:Independent Researcher, Mumbai, India 备注:12 pages, 8 tables, 6 figures 链接:https://arxiv.org/abs/2107.12137 摘要:在这项工作中,我们解决了从点云数据中实时检测三维目标的问题。对于自主车辆来说,感知部件在高精度和快速推理的同时检测真实世界的物体是非常重要的。我们提出了一种新的神经网络结构以及训练和优化细节,用于使用点云数据检测三维对象。我们提出锚设计以及自定义损失函数在这项工作中使用。在这项工作中结合了空间和通道注意模块。我们使用kitti3d鸟瞰图数据集进行基准测试和验证我们的结果。我们的方法在平均精度和运行速度方面都超过了这一领域的最新技术。最后,我们提出烧蚀研究,以证明我们的网络性能是普遍的。这使得它成为一个可行的选择,部署在实时应用程序,如自动驾驶汽车。 摘要:In this work, we address the problem of 3D object detection from point cloud data in real time. For autonomous vehicles to work, it is very important for the perception component to detect the real world objects with both high accuracy and fast inference. We propose a novel neural network architecture along with the training and optimization details for detecting 3D objects using point cloud data. We present anchor design along with custom loss functions used in this work. A combination of spatial and channel wise attention module is used in this work. We use the Kitti 3D Birds Eye View dataset for benchmarking and validating our results. Our method surpasses previous state of the art in this domain both in terms of average precision and speed running at > 30 FPS. Finally, we present the ablation study to demonstrate that the performance of our network is generalizable. This makes it a feasible option to be deployed in real time applications like self driving cars.
【3】 CP-loss: Connectivity-preserving Loss for Road Curb Detection in Autonomous Driving with Aerial Images 标题:CP-Loss:航空影像自动驾驶中道路路缘检测的连通性保持损失
作者:Zhenhua Xu,Yuxiang Sun,Lujia Wang,Ming Liu 机构: TheHong Kong Polytechnic University 备注:Accepted by The IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2021 链接:https://arxiv.org/abs/2107.11920 摘要:道路边缘检测对于自动驾驶非常重要。它可以用来确定道路边界,限制车辆在道路上行驶,从而避免潜在的事故。目前的大多数方法都是使用车载传感器(如摄像头或三维激光雷达)在线检测路缘。然而,这些方法通常会遇到严重的遮挡问题。特别是在高度动态的交通环境中,大部分视场被动态对象占据。为了缓解这一问题,本文利用高分辨率航空图像离线检测路缘石。此外,检测到的路缘可用于为自动驾驶车辆创建高清(HD)地图。具体来说,我们首先预测路缘石的像素分割图,然后进行一系列的后处理步骤来提取路缘石的图形结构。为了解决分割图中的不连通性问题,我们提出了一种新的保持连通性丢失(CP-loss)方法来提高分割性能。在公共数据集上的实验结果证明了本文提出的损失函数的有效性。本文附有演示视频和补充文件,可在texttt{url上找到{https://sites.google.com/view/cp-loss}}. 摘要:Road curb detection is important for autonomous driving. It can be used to determine road boundaries to constrain vehicles on roads, so that potential accidents could be avoided. Most of the current methods detect road curbs online using vehicle-mounted sensors, such as cameras or 3-D Lidars. However, these methods usually suffer from severe occlusion issues. Especially in highly-dynamic traffic environments, most of the field of view is occupied by dynamic objects. To alleviate this issue, we detect road curbs offline using high-resolution aerial images in this paper. Moreover, the detected road curbs can be used to create high-definition (HD) maps for autonomous vehicles. Specifically, we first predict the pixel-wise segmentation map of road curbs, and then conduct a series of post-processing steps to extract the graph structure of road curbs. To tackle the disconnectivity issue in the segmentation maps, we propose an innovative connectivity-preserving loss (CP-loss) to improve the segmentation performance. The experimental results on a public dataset demonstrate the effectiveness of our proposed loss function. This paper is accompanied with a demonstration video and a supplementary document, which are available at texttt{url{https://sites.google.com/view/cp-loss}}.
【4】 Comprehensive Studies for Arbitrary-shape Scene Text Detection 标题:任意形状场景文本检测的综合研究
作者:Pengwen Dai,Xiaochun Cao 机构:SKLOIS, Institute of Information Engineering, CAS, Beijing, China, School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China 链接:https://arxiv.org/abs/2107.11800 摘要:近年来,大量的场景文本检测方法被提出。他们中的大多数人都宣称他们取得了最先进的表演。然而,由于训练数据、骨干网、多尺度特征融合、评估协议等不一致的设置,性能比较是不公平的。这些不同的设置将掩盖提议的核心技术的优点和缺点。本文仔细研究和分析了不一致设置,提出了一个统一的基于自底向上的场景文本检测方法框架。在统一的框架下,我们保证了非核心模块设置的一致性,主要研究了描述任意形状场景文本的表示方法,如文本轮廓上的回归点、用预测的辅助信息聚类像素、用学习的链接对连接组件进行分组等,通过全面的调查和细致的分析,不仅扫清了理解现有方法性能差异的障碍,而且在公平比较的基础上揭示了以往模型的优缺点。 摘要:Numerous scene text detection methods have been proposed in recent years. Most of them declare they have achieved state-of-the-art performances. However, the performance comparison is unfair, due to lots of inconsistent settings (e.g., training data, backbone network, multi-scale feature fusion, evaluation protocols, etc.). These various settings would dissemble the pros and cons of the proposed core techniques. In this paper, we carefully examine and analyze the inconsistent settings, and propose a unified framework for the bottom-up based scene text detection methods. Under the unified framework, we ensure the consistent settings for non-core modules, and mainly investigate the representations of describing arbitrary-shape scene texts, e.g., regressing points on text contours, clustering pixels with predicted auxiliary information, grouping connected components with learned linkages, etc. With the comprehensive investigations and elaborate analyses, it not only cleans up the obstacle of understanding the performance differences between existing methods but also reveals the advantages and disadvantages of previous models under fair comparisons.
【5】 Improving Variational Autoencoder based Out-of-Distribution Detection for Embedded Real-time Applications 标题:改进的基于变分自动编码器的嵌入式实时应用失配检测
作者:Yeli Feng,Daniel Jun Xian Ng,Arvind Easwaran 机构: Nanyang Technological UniversityUncertainties in machine learning are a significant roadblock for its application in safety-critical cyber-physicalsystems (CPS), Nanyang Technological University 链接:https://arxiv.org/abs/2107.11750 摘要:机器学习中的不确定性是其应用于安全关键网络物理系统(CPS)的一个重要障碍。不确定性的一个来源是训练和测试场景之间输入数据的分布变化。实时检测这种分布变化是应对这一挑战的一种新兴方法。在涉及成像的CPS应用中,高维的输入空间给任务增加了额外的难度。生成学习模型被广泛应用于任务的检测,即分布外(OoD)检测。为了提高现有的技术水平,我们研究了机器学习和CPS领域的现有方案。在后者中,自动驾驶代理的实时安全监控一直是人们关注的焦点。利用视频中运动的时空相关性,我们可以鲁棒地检测出自主驾驶代理周围的危险运动。受变分自动编码器(VAE)理论和实践的最新进展的启发,我们利用数据中的先验知识来进一步提高OoD检测的鲁棒性。对nuScenes和Synthia数据集的比较研究表明,我们的方法显著提高了对驾驶场景特有的OoD因素的检测能力,比最先进的方法提高了42%。我们的模型也近乎完美地进行了推广,在真实世界和模拟驾驶数据集实验中,比最先进的模型好97%。最后,我们定制了一个双编码器模型,可以部署到资源有限的嵌入式设备进行实时OoD检测。在低精度的8位整数推理中,它的执行时间减少了4倍以上,而检测能力与相应的浮点模型相当。 摘要:Uncertainties in machine learning are a significant roadblock for its application in safety-critical cyber-physical systems (CPS). One source of uncertainty arises from distribution shifts in the input data between training and test scenarios. Detecting such distribution shifts in real-time is an emerging approach to address the challenge. The high dimensional input space in CPS applications involving imaging adds extra difficulty to the task. Generative learning models are widely adopted for the task, namely out-of-distribution (OoD) detection. To improve the state-of-the-art, we studied existing proposals from both machine learning and CPS fields. In the latter, safety monitoring in real-time for autonomous driving agents has been a focus. Exploiting the spatiotemporal correlation of motion in videos, we can robustly detect hazardous motion around autonomous driving agents. Inspired by the latest advances in the Variational Autoencoder (VAE) theory and practice, we tapped into the prior knowledge in data to further boost OoD detection's robustness. Comparison studies over nuScenes and Synthia data sets show our methods significantly improve detection capabilities of OoD factors unique to driving scenarios, 42% better than state-of-the-art approaches. Our model also generalized near-perfectly, 97% better than the state-of-the-art across the real-world and simulation driving data sets experimented. Finally, we customized one proposed method into a twin-encoder model that can be deployed to resource limited embedded devices for real-time OoD detection. Its execution time was reduced over four times in low-precision 8-bit integer inference, while detection capability is comparable to its corresponding floating-point model.
【6】 Rank & Sort Loss for Object Detection and Instance Segmentation 标题:对象检测和实例分割中的秩和排序丢失
作者:Kemal Oksuz,Baris Can Cam,Emre Akbas,Sinan Kalkan 机构:Dept. of Computer Engineering, Middle East Technical University, Ankara, Turkey 备注:ICCV 2021, oral presentation 链接:https://arxiv.org/abs/2107.11669 摘要:我们提出秩排序损失(RS)作为一种基于秩的损失函数来训练深度目标检测和实例分割方法(即视觉检测器)。RS Loss监督分类器,即这些方法的子网络,将每个正的排序置于所有负的之上,并根据它们的连续定位质量(例如联合IoU上的交集)在它们之间对正的排序。为了解决排序和排序的不可微性,我们重新将错误驱动更新和反向传播相结合作为身份更新,这使得我们能够对新的排序错误进行建模。有了RS损失,我们大大简化了训练:(i)由于我们的分类目标,分类器在没有额外辅助头(例如中心度、IoU、掩码IoU)的情况下对阳性进行优先排序;(ii)由于RS损失的排名性质,它对类不平衡具有鲁棒性,因此,不需要任何抽样启发式,(iii)我们使用无需调整的任务平衡系数来解决视觉检测器的多任务性质。使用RS-Loss,我们仅通过调整学习率来训练七种不同的视觉检测器,并且表明它始终优于基线:例如,我们的RS-Loss提高了(i)在COCO数据集上更快的R-CNN约3盒AP和aLRP损失(基于排名的基线)约2盒AP,(ii)在LVIS数据集上用3.5个掩模AP(稀有类约7个AP)掩模R-CNN和重复因子抽样(RFS);而且表现也优于所有同行。代码位于https://github.com/kemaloksuz/RankSortLoss 摘要:We propose Rank & Sort (RS) Loss, as a ranking-based loss function to train deep object detection and instance segmentation methods (i.e. visual detectors). RS Loss supervises the classifier, a sub-network of these methods, to rank each positive above all negatives as well as to sort positives among themselves with respect to (wrt.) their continuous localisation qualities (e.g. Intersection-over-Union - IoU). To tackle the non-differentiable nature of ranking and sorting, we reformulate the incorporation of error-driven update with backpropagation as Identity Update, which enables us to model our novel sorting error among positives. With RS Loss, we significantly simplify training: (i) Thanks to our sorting objective, the positives are prioritized by the classifier without an additional auxiliary head (e.g. for centerness, IoU, mask-IoU), (ii) due to its ranking-based nature, RS Loss is robust to class imbalance, and thus, no sampling heuristic is required, and (iii) we address the multi-task nature of visual detectors using tuning-free task-balancing coefficients. Using RS Loss, we train seven diverse visual detectors only by tuning the learning rate, and show that it consistently outperforms baselines: e.g. our RS Loss improves (i) Faster R-CNN by ~ 3 box AP and aLRP Loss (ranking-based baseline) by ~ 2 box AP on COCO dataset, (ii) Mask R-CNN with repeat factor sampling (RFS) by 3.5 mask AP (~ 7 AP for rare classes) on LVIS dataset; and also outperforms all counterparts. Code available at https://github.com/kemaloksuz/RankSortLoss
【7】 An Uncertainty-Aware Deep Learning Framework for Defect Detection in Casting Products 标题:一种不确定性感知的铸造产品缺陷检测深度学习框架
作者:Maryam Habibpour,Hassan Gharoun,AmirReza Tajally,Afshar Shamsi,Hamzeh Asgharnezhad,Abbas Khosravi,Saeid Nahavandi 机构: Tajally is with the Department of Industrial Engineering, university ofTehran 备注:9 pages, 5 figures, 3 tables 链接:https://arxiv.org/abs/2107.11643 摘要:由于铸造工艺的复杂性,铸件生产中不可避免地会出现缺陷。在大规模生产中,传统的铸件目视检查速度慢,效率低,而自动可靠的缺陷检测不仅提高了质量控制过程,而且积极地提高了生产率。然而,由于铸件缺陷形貌的多样性和多样性,铸件缺陷检测是一项具有挑战性的任务。卷积神经网络(CNNs)广泛应用于图像分类和缺陷检测。然而,具有频繁推理的CNN需要大量的数据来训练,并且仍然不能报告对其预测不确定性的有益估计。因此,利用迁移学习范式,我们首先在一个小数据集上应用四个强大的基于CNN的模型(VGG16、ResNet50、DenseNet121和InceptionResNetV2)来提取有意义的特征。提取的特征通过各种机器学习算法进行处理,完成分类任务。仿真结果表明,线性支持向量机(SVM)和多层感知器(MLP)在铸件图像缺陷检测中具有良好的性能。其次,为了实现可靠的分类和测量认知不确定性,我们使用了一种不确定性量化(UQ)技术(MLP模型集成),该技术使用了从四个预先训练的cnn中提取的特征。利用UQ混淆矩阵和不确定度精度度量对预测不确定度进行了评估。综合比较表明,基于VGG16的UQ方法在提取不确定性方面优于其他方法。我们相信,一个具有不确定性的自动缺陷检测解决方案将加强铸件生产的质量保证。 摘要:Defects are unavoidable in casting production owing to the complexity of the casting process. While conventional human-visual inspection of casting products is slow and unproductive in mass productions, an automatic and reliable defect detection not just enhances the quality control process but positively improves productivity. However, casting defect detection is a challenging task due to diversity and variation in defects' appearance. Convolutional neural networks (CNNs) have been widely applied in both image classification and defect detection tasks. Howbeit, CNNs with frequentist inference require a massive amount of data to train on and still fall short in reporting beneficial estimates of their predictive uncertainty. Accordingly, leveraging the transfer learning paradigm, we first apply four powerful CNN-based models (VGG16, ResNet50, DenseNet121, and InceptionResNetV2) on a small dataset to extract meaningful features. Extracted features are then processed by various machine learning algorithms to perform the classification task. Simulation results demonstrate that linear support vector machine (SVM) and multi-layer perceptron (MLP) show the finest performance in defect detection of casting images. Secondly, to achieve a reliable classification and to measure epistemic uncertainty, we employ an uncertainty quantification (UQ) technique (ensemble of MLP models) using features extracted from four pre-trained CNNs. UQ confusion matrix and uncertainty accuracy metric are also utilized to evaluate the predictive uncertainty estimates. Comprehensive comparisons reveal that UQ method based on VGG16 outperforms others to fetch uncertainty. We believe an uncertainty-aware automatic defect detection solution will reinforce casting productions quality assurance.
【8】 ASOD60K: Audio-Induced Salient Object Detection in Panoramic Videos 标题:ASOD60K:全景视频中音频诱导的显著目标检测
作者:Yi Zhang,Fang-Yi Chao,Ge-Peng Ji,Deng-Ping Fan,Lu Zhang,Ling Shao 机构: 3Inception Institute of AI (IIAI) 备注:22 pages, 17 figures, 7 tables (Project Page: this https URL) 链接:https://arxiv.org/abs/2107.11629 摘要:探索人类在动态全景场景中的注意力对于许多基础应用是有用的,包括零售业中的增强现实(AR)、AR驱动的招聘和视觉语言导航。基于这一目标,我们提出了PV-SOD,这是一项新的任务,旨在从全景视频中分割突出的物体。与现有的注视水平或目标水平显著性检测任务相比,本文重点研究了多模态显著性目标检测(SOD),它通过在视听线索的引导下分割显著性目标来模仿人类的注意机制。为了支持这项任务,我们收集了第一个大规模数据集ASOD60K,其中包含4K分辨率的视频帧,并采用六层结构进行注释,因此具有丰富性、多样性和质量。具体地说,每个序列都用其超/子类进行标记,每个子类的对象用人眼注视、边界框、对象/实例级掩码和相关属性(例如,几何失真)进行进一步注释。这些从粗到细的注释使PV-SOD模型的详细分析成为可能,例如,确定现有SOD模型的主要挑战,预测扫描路径以研究人类的长期眼睛注视行为。我们对ASOD60K的11种代表性方法进行了系统的基准测试,并得出了一些有趣的发现。希望本研究能作为一个良好的起点,推动超氧化物歧化酶研究向全景视频方向发展。 摘要:Exploring to what humans pay attention in dynamic panoramic scenes is useful for many fundamental applications, including augmented reality (AR) in retail, AR-powered recruitment, and visual language navigation. With this goal in mind, we propose PV-SOD, a new task that aims to segment salient objects from panoramic videos. In contrast to existing fixation-level or object-level saliency detection tasks, we focus on multi-modal salient object detection (SOD), which mimics human attention mechanism by segmenting salient objects with the guidance of audio-visual cues. To support this task, we collect the first large-scale dataset, named ASOD60K, which contains 4K-resolution video frames annotated with a six-level hierarchy, thus distinguishing itself with richness, diversity and quality. Specifically, each sequence is marked with both its super-/sub-class, with objects of each sub-class being further annotated with human eye fixations, bounding boxes, object-/instance-level masks, and associated attributes (e.g., geometrical distortion). These coarse-to-fine annotations enable detailed analysis for PV-SOD modeling, e.g., determining the major challenges for existing SOD models, and predicting scanpaths to study the long-term eye fixation behaviors of humans. We systematically benchmark 11 representative approaches on ASOD60K and derive several interesting findings. We hope this study could serve as a good starting point for advancing SOD research towards panoramic videos.
【9】 Multi-Echo LiDAR for 3D Object Detection 标题:多回波激光雷达在三维目标检测中的应用
作者:Yunze Man,Xinshuo Weng,Prasanna Kumar Sivakuma,Matthew O'Toole,Kris Kitani 机构:Carnegie Mellon University,DENSO 链接:https://arxiv.org/abs/2107.11470 摘要:激光雷达传感器可用于获得除简单的三维点云以外的各种测量信号,这些信号可用于改进诸如三维目标检测等感知任务。单个激光脉冲可以被多个物体沿其路径部分反射,从而产生称为回波的多次测量。多回波测量可以提供物体轮廓和半透明表面的信息,用于更好地识别和定位物体。激光雷达还可以测量表面反射率(激光脉冲回波强度)以及场景的环境光(物体反射的阳光)。这些信号已经在商业激光雷达设备中可用,但尚未在大多数基于激光雷达的探测模型中使用。提出了一种利用激光雷达提供的全光谱测量信号的三维目标检测模型。首先,我们提出了一个多信号融合(MSF)模块来结合(1)用2dcnn提取的反射和环境特征,以及(2)用3D图形神经网络(GNN)提取的点云特征。其次,我们提出一个多重回音聚合(MEA)模组来整合不同回音点集合中所编码的资讯。与传统的单回波点云方法相比,本文提出的多信号激光雷达探测器(MSLiD)能够从更大范围的传感测量中提取更丰富的背景信息,实现更精确的三维目标检测。实验表明,该方法结合了激光雷达的多模态特性,比现有的方法提高了9.1%。 摘要:LiDAR sensors can be used to obtain a wide range of measurement signals other than a simple 3D point cloud, and those signals can be leveraged to improve perception tasks like 3D object detection. A single laser pulse can be partially reflected by multiple objects along its path, resulting in multiple measurements called echoes. Multi-echo measurement can provide information about object contours and semi-transparent surfaces which can be used to better identify and locate objects. LiDAR can also measure surface reflectance (intensity of laser pulse return), as well as ambient light of the scene (sunlight reflected by objects). These signals are already available in commercial LiDAR devices but have not been used in most LiDAR-based detection models. We present a 3D object detection model which leverages the full spectrum of measurement signals provided by LiDAR. First, we propose a multi-signal fusion (MSF) module to combine (1) the reflectance and ambient features extracted with a 2D CNN, and (2) point cloud features extracted using a 3D graph neural network (GNN). Second, we propose a multi-echo aggregation (MEA) module to combine the information encoded in different set of echo points. Compared with traditional single echo point cloud methods, our proposed Multi-Signal LiDAR Detector (MSLiD) extracts richer context information from a wider range of sensing measurements and achieves more accurate 3D object detection. Experiments show that by incorporating the multi-modality of LiDAR, our method outperforms the state-of-the-art by up to 9.1%.
【10】 B-line Detection in Lung Ultrasound Videos: Cartesian vs Polar Representation 标题:肺部超声视频中的B线检测:笛卡尔表示法与极向表示法
作者:Hamideh Kerdegari,Phung Tran Huy Nhat,Angela McBride,Luigi Pisani,Reza Razavi,Louise Thwaites,Sophie Yacoub,Alberto Gomez 机构: School of Biomedical Engineering & Imaging Sciences, King’s College London, UK, Oxford University Clinical Research Unit, Ho Chi Minh City, Vietnam, Mahidol Oxford Research Unit, Thailand 备注:8 pages, 4 figures, 1 table 链接:https://arxiv.org/abs/2107.12291 摘要:肺部超声(LUS)成像在重症监护病房(ICU)中越来越流行,用于评估严重登革热引起的肺部异常,如B超伪影的出现。这些人工制品出现在LUS图像中并很快消失,使得人工检测非常具有挑战性。它们也随着声波的传播呈放射状延伸。因此,我们假设极性表示可能更适合于这些图像的自动图像分析。本文提出了一种基于注意的卷积 LSTM模型来自动检测LUS视频中的B线,比较了笛卡尔表示和极坐标表示的图像数据的性能。结果表明,与笛卡尔B线分类方法相比,本文提出的基于极坐标表示的B线分类方法具有更好的定位性能。 摘要:Lung ultrasound (LUS) imaging is becoming popular in the intensive care units (ICU) for assessing lung abnormalities such as the appearance of B-line artefacts as a result of severe dengue. These artefacts appear in the LUS images and disappear quickly, making their manual detection very challenging. They also extend radially following the propagation of the sound waves. As a result, we hypothesize that a polar representation may be more adequate for automatic image analysis of these images. This paper presents an attention-based Convolutional LSTM model to automatically detect B-lines in LUS videos, comparing performance when image data is taken in Cartesian and polar representations. Results indicate that the proposed framework with polar representation achieves competitive performance compared to the Cartesian representation for B-line classification and that attention mechanism can provide better localization.
【11】 Early Diagnosis of Lung Cancer Using Computer Aided Detection via Lung Segmentation Approach 标题:计算机辅助肺分割检测技术在肺癌早期诊断中的应用
作者:Abhir Bhandary,Ananth Prabhu G,Mustafa Basthikodi,Chaitra K M 机构:Assistant Professor, Department of Information Science Engineering, NMAM Institute of Technology, (Visvesvaraya, Technological University, Belagavi), Nitte, Karnataka, India 备注:None 链接:https://arxiv.org/abs/2107.12205 摘要:肺癌开始于肺部,并导致癌症死亡的原因在人群中的创造。据美国癌症协会估计,约有27%的人死于癌症。在其发展的早期阶段,肺癌通常不会引起任何症状。许多病人被诊断为一个发展阶段,症状变得更加突出,这导致治疗效果差,死亡率高。计算机辅助检测系统被用来提高肺癌诊断的准确性。本文提出了一种基于模糊C均值聚类、自适应阈值分割和活动轮廓模型分割的肺分割方法。分析并给出了实验结果。 摘要:Lung cancer begins in the lungs and leading to the reason of cancer demise amid population in the creation. According to the American Cancer Society, which estimates about 27% of the deaths because of cancer. In the early phase of its evolution, lung cancer does not cause any symptoms usually. Many of the patients have been diagnosed in a developed phase where symptoms become more prominent, that results in poor curative treatment and high mortality rate. Computer Aided Detection systems are used to achieve greater accuracies for the lung cancer diagnosis. In this research exertion, we proposed a novel methodology for lung Segmentation on the basis of Fuzzy C-Means Clustering, Adaptive Thresholding, and Segmentation of Active Contour Model. The experimental results are analysed and presented.
分类|识别相关(22篇)
【1】 Channel-wise Topology Refinement Graph Convolution for Skeleton-Based Action Recognition 标题:基于骨架的动作识别的通道拓扑细化图卷积
作者:Yuxin Chen,Ziqi Zhang,Chunfeng Yuan,Bing Li,Ying Deng,Weiming Hu 机构:NLPR, Institute of Automation, Chinese Academy of Sciences, School of Artificial Intelligence, University of Chinese Academy of Sciences, CAS Center for Excellence in Brain Science and Intelligence Technology 备注:Accepted to ICCV2021. Code is available at this https URL 链接:https://arxiv.org/abs/2107.12213 摘要:图卷积网络在基于骨架的动作识别中得到了广泛的应用,并取得了显著的效果。在GCNs中,图的拓扑结构控制着特征的聚集,是提取代表性特征的关键。在这项工作中,我们提出了一种新的通道拓扑精化图卷积(CTR-GC)来动态学习不同的拓扑结构,并有效地聚合不同通道中的关节特征,用于基于骨架的动作识别。提出的CTR-GC通过学习共享拓扑作为所有信道的通用先验,并使用每个信道的特定信道相关性对其进行细化,从而对信道拓扑进行建模。我们的细化方法引入了很少的额外参数,显著降低了对信道拓扑进行建模的难度。此外,通过将图卷积重新表示成统一的形式,我们发现CTR-GC放松了图卷积的严格约束,从而具有更强的表示能力。结合CTR-GC和时态建模模块,我们开发了一个强大的图卷积网络CTR-GCN,它在NTU RGB D、NTU RGB D 120和NW-UCLA数据集上显著优于最新的方法。 摘要:Graph convolutional networks (GCNs) have been widely used and achieved remarkable results in skeleton-based action recognition. In GCNs, graph topology dominates feature aggregation and therefore is the key to extracting representative features. In this work, we propose a novel Channel-wise Topology Refinement Graph Convolution (CTR-GC) to dynamically learn different topologies and effectively aggregate joint features in different channels for skeleton-based action recognition. The proposed CTR-GC models channel-wise topologies through learning a shared topology as a generic prior for all channels and refining it with channel-specific correlations for each channel. Our refinement method introduces few extra parameters and significantly reduces the difficulty of modeling channel-wise topologies. Furthermore, via reformulating graph convolutions into a unified form, we find that CTR-GC relaxes strict constraints of graph convolutions, leading to stronger representation capability. Combining CTR-GC with temporal modeling modules, we develop a powerful graph convolutional network named CTR-GCN which notably outperforms state-of-the-art methods on the NTU RGB D, NTU RGB D 120, and NW-UCLA datasets.
【2】 Image-Based Parking Space Occupancy Classification: Dataset and Baseline 标题:基于图像的车位占用分类:数据集和基线
作者:Martin Marek 机构:ParkDots, PosAm 链接:https://arxiv.org/abs/2107.12207 摘要:介绍了一种新的基于图像的停车位占用分类数据集ACPDS。与以前的数据集不同,每个图像都是从一个独特的视图中获取的,并进行了系统的注释,列车中的停车场、验证和测试集都是独特的。利用这个数据集,我们提出了一个简单的停车位占用分类的基线模型,在看不见的停车场上达到98%的准确率,显著优于现有的模型。我们在麻省理工学院的许可下共享我们的数据集、代码和经过训练的模型。 摘要:We introduce a new dataset for image-based parking space occupancy classification: ACPDS. Unlike in prior datasets, each image is taken from a unique view, systematically annotated, and the parking lots in the train, validation, and test sets are unique. We use this dataset to propose a simple baseline model for parking space occupancy classification, which achieves 98% accuracy on unseen parking lots, significantly outperforming existing models. We share our dataset, code, and trained models under the MIT license.
【3】 An Efficient Insect Pest Classification Using Multiple Convolutional Neural Network Based Models 标题:一种基于多卷积神经网络模型的有效害虫分类方法
作者:Hieu T. Ung,Huy Q. Ung,Binh T. Nguyen 机构:Nguyen., Received: date Accepted: date 备注:22 pages, 15 figures 链接:https://arxiv.org/abs/2107.12189 摘要:准确识别病虫害对保护作物或对病虫害进行早期处理,减少农业经济损失具有重要意义。由于人工识别速度慢、耗时长、成本高,因此设计一个害虫自动识别系统是非常必要的。传统的基于图像的害虫分类方法由于其复杂性,效率不高。由于害虫种类、规模、形态多样,田间背景复杂,昆虫种间的外貌相似性高,因此害虫分类是一项艰巨的任务。随着深度学习技术的迅速发展,基于CNN的方法是开发快速、准确的害虫分类器的最佳途径。在这项工作中,我们提出了不同的基于卷积神经网络的模型,包括注意、特征金字塔和细粒度模型。我们在两个公共数据集上评估了我们的方法:大规模虫害数据集、IP102基准数据集和一个较小的数据集,即D0的宏观平均精度(MPre)、宏观平均召回率(MRec)、宏观平均F1评分(MF1)、精度(Acc)和几何平均值(GM)。实验结果表明,在这两种数据集上,结合这些基于卷积神经网络的模型比现有的方法具有更好的性能。例如,我们在IP102和D0上获得的最高精度分别是$74.13%$和$99.78%$,绕过了相应的最新精度:$67.1%$(IP102)和$98.8%$(D0)。我们还发表了我们的代码,为当前有关昆虫害虫分类问题的研究做出贡献。 摘要:Accurate insect pest recognition is significant to protect the crop or take the early treatment on the infected yield, and it helps reduce the loss for the agriculture economy. Design an automatic pest recognition system is necessary because manual recognition is slow, time-consuming, and expensive. The Image-based pest classifier using the traditional computer vision method is not efficient due to the complexity. Insect pest classification is a difficult task because of various kinds, scales, shapes, complex backgrounds in the field, and high appearance similarity among insect species. With the rapid development of deep learning technology, the CNN-based method is the best way to develop a fast and accurate insect pest classifier. We present different convolutional neural network-based models in this work, including attention, feature pyramid, and fine-grained models. We evaluate our methods on two public datasets: the large-scale insect pest dataset, the IP102 benchmark dataset, and a smaller dataset, namely D0 in terms of the macro-average precision (MPre), the macro-average recall (MRec), the macro-average F1- score (MF1), the accuracy (Acc), and the geometric mean (GM). The experimental results show that combining these convolutional neural network-based models can better perform than the state-of-the-art methods on these two datasets. For instance, the highest accuracy we obtained on IP102 and D0 is $74.13%$ and $99.78%$, respectively, bypassing the corresponding state-of-the-art accuracy: $67.1%$ (IP102) and $98.8%$ (D0). We also publish our codes for contributing to the current research related to the insect pest classification problem.
【4】 Federated Action Recognition on Heterogeneous Embedded Devices 标题:异构嵌入式设备上的联合动作识别
作者:Pranjal Jain,Shreyas Goenka,Saurabh Bagchi,Biplab Banerjee,Somali Chaterji 机构: Purdue University 备注:None 链接:https://arxiv.org/abs/2107.12147 摘要:联合学习允许大量设备在不共享数据的情况下联合学习模型。在这项工作中,我们让计算能力有限的客户机执行动作识别,这是一项计算繁重的任务。我们首先在中央服务器上通过对一个大数据集的知识提炼来进行模型压缩。这允许模型学习复杂的特性,并作为模型微调的初始化。由于小数据集中存在的有限数据不足以让动作识别模型学习复杂的时空特征,因此需要进行微调。由于客户端的计算资源往往是异构的,因此我们使用了异步联邦优化,并进一步给出了收敛界。我们将我们的方法与两种基线方法进行比较:在中心服务器(无客户机)进行微调,以及使用同步联邦平均法使用(异构)客户机进行微调。我们在一个异构嵌入式设备的实验台上进行了实验,实验结果表明,我们能够以与上述两个基线相当的准确率进行动作识别,而我们的异步学习策略比同步学习减少了40%的训练时间。 摘要:Federated learning allows a large number of devices to jointly learn a model without sharing data. In this work, we enable clients with limited computing power to perform action recognition, a computationally heavy task. We first perform model compression at the central server through knowledge distillation on a large dataset. This allows the model to learn complex features and serves as an initialization for model fine-tuning. The fine-tuning is required because the limited data present in smaller datasets is not adequate for action recognition models to learn complex spatio-temporal features. Because the clients present are often heterogeneous in their computing resources, we use an asynchronous federated optimization and we further show a convergence bound. We compare our approach to two baseline approaches: fine-tuning at the central server (no clients) and fine-tuning using (heterogeneous) clients using synchronous federated averaging. We empirically show on a testbed of heterogeneous embedded devices that we can perform action recognition with comparable accuracy to the two baselines above, while our asynchronous learning strategy reduces the training time by 40%, relative to synchronous learning.
【5】 Towards Unbiased Visual Emotion Recognition via Causal Intervention 标题:基于因果干预的无偏视觉情绪识别研究
作者:Yuedong Chen,Xu Yang,Tat-Jen Cham,Jianfei Cai 机构:Monash University, Australia, Nanyang Technological University, Singapore 备注:10 pages, 7 figures 链接:https://arxiv.org/abs/2107.12096 摘要:尽管在视觉情感识别方面已经取得了很大的进展,但是研究人员已经意识到现代深层网络倾向于利用数据集特征来学习输入和目标之间虚假的统计关联。这些数据集特征通常被视为数据集偏差,这会影响这些识别系统的鲁棒性和泛化性能。在这项工作中,我们从因果推理的角度来审视这个问题,这种数据集特征被称为一种混淆,误导系统学习虚假的相关性。为了缓解数据集偏差带来的负面影响,我们提出了一种新的干预性情绪识别网络(interventive emotional Recognition Network,IERN)来实现后门调整,这是因果推理中的一种基本解构技术。一系列设计的测试验证了IERN的有效性,在三个情绪基准上的实验表明,IERN优于其他最先进的方法。 摘要:Although much progress has been made in visual emotion recognition, researchers have realized that modern deep networks tend to exploit dataset characteristics to learn spurious statistical associations between the input and the target. Such dataset characteristics are usually treated as dataset bias, which damages the robustness and generalization performance of these recognition systems. In this work, we scrutinize this problem from the perspective of causal inference, where such dataset characteristic is termed as a confounder which misleads the system to learn the spurious correlation. To alleviate the negative effects brought by the dataset bias, we propose a novel Interventional Emotion Recognition Network (IERN) to achieve the backdoor adjustment, which is one fundamental deconfounding technique in causal inference. A series of designed tests validate the effectiveness of IERN, and experiments on three emotion benchmarks demonstrate that IERN outperforms other state-of-the-art approaches.
【6】 Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text Recognition 标题:联合视觉语义推理:文本识别的多级解码器
作者:Ayan Kumar Bhunia,Aneeshan Sain,Amandeep Kumar,Shuvozit Ghose,Pinaki Nath Chowdhury,Yi-Zhe Song 机构:SketchX, CVSSP, University of Surrey, United Kingdom., FlyTek-Surrey Joint Research Centre on Artificial Intelligence. 备注:IEEE International Conference on Computer Vision (ICCV), 2021 链接:https://arxiv.org/abs/2107.12090 摘要:尽管文本识别在过去的几年中已经有了很大的发展,但是由于复杂的背景、不同的字体、不受控制的照明、扭曲和其他人工制品,最先进的(SOTA)模型仍然难以适应。这是因为这样的模型仅仅依赖视觉信息进行文本识别,因此缺乏语义推理能力。在本文中,我们认为,语义信息提供了一个补充作用,除了视觉。更具体地说,我们通过提出一个多阶段多尺度注意解码器来利用语义信息,该解码器执行联合视觉语义推理。我们的新颖之处在于直觉,即对于文本识别,预测应该以阶段性的方式进行细化。因此,我们的主要贡献是设计了一个阶段性的展开注意解码器,其中离散预测字符标签调用的不可微性需要在端到端训练中绕过。虽然第一阶段使用视觉特征进行预测,但随后的阶段使用联合视觉语义信息对其进行优化。此外,我们引入多尺度2D注意,以及不同阶段之间的密集和剩余连接,以处理不同尺度的角色大小,从而在训练过程中获得更好的性能和更快的收敛速度。实验结果表明,我们的方法比现有的SOTA方法有相当大的优势。 摘要:Although text recognition has significantly evolved over the years, state-of-the-art (SOTA) models still struggle in the wild scenarios due to complex backgrounds, varying fonts, uncontrolled illuminations, distortions and other artefacts. This is because such models solely depend on visual information for text recognition, thus lacking semantic reasoning capabilities. In this paper, we argue that semantic information offers a complementary role in addition to visual only. More specifically, we additionally utilize semantic information by proposing a multi-stage multi-scale attentional decoder that performs joint visual-semantic reasoning. Our novelty lies in the intuition that for text recognition, the prediction should be refined in a stage-wise manner. Therefore our key contribution is in designing a stage-wise unrolling attentional decoder where non-differentiability, invoked by discretely predicted character labels, needs to be bypassed for end-to-end training. While the first stage predicts using visual features, subsequent stages refine on top of it using joint visual-semantic information. Additionally, we introduce multi-scale 2D attention along with dense and residual connections between different stages to deal with varying scales of character sizes, for better performance and faster convergence during training. Experimental results show our approach to outperform existing SOTA methods by a considerable margin.
【7】 Text is Text, No Matter What: Unifying Text Recognition using Knowledge Distillation 标题:文本就是文本,不管是什么:使用知识提炼统一文本识别
作者:Ayan Kumar Bhunia,Aneeshan Sain,Pinaki Nath Chowdhury,Yi-Zhe Song 机构:SketchX, CVSSP, University of Surrey, United Kingdom., FlyTek-Surrey Joint Research Centre on Artificial Intelligence. 备注:IEEE International Conference on Computer Vision (ICCV), 2021 链接:https://arxiv.org/abs/2107.12087 摘要:文本识别是计算机视觉领域的一个基础和广泛的研究课题,主要是由于其广泛的商业应用。然而,这个问题的挑战性决定了研究工作的分散性:处理日常场景中文本的场景文本识别(STR)和处理手写文本的手写文本识别(HTR)。在这篇论文中,我们第一次主张它们的统一——我们的目标是一个单一的模型,可以与两个独立的最先进的STR和HTR模型相竞争。我们首先表明,由于STR和HTR模型内在挑战的差异,它们的交叉使用会引发显著的性能下降。然后,我们通过引入基于知识提炼(KD)的框架来解决它们的结合。然而,这是非常重要的,主要是由于文本序列的可变长度和序列性质,这使得现有的KD技术主要用于全局固定长度的数据是不够的。为此,我们提出了三个蒸馏损失,所有这些都是专为应付上述独特的文字识别的特点。经验证据表明,我们提出的统一模型与个别模型相比,甚至在某些情况下超过了它们。烧蚀研究表明,天真的基线,如两阶段的框架,和领域适应/泛化的替代品不工作以及进一步验证我们的设计的适当性。 摘要:Text recognition remains a fundamental and extensively researched topic in computer vision, largely owing to its wide array of commercial applications. The challenging nature of the very problem however dictated a fragmentation of research efforts: Scene Text Recognition (STR) that deals with text in everyday scenes, and Handwriting Text Recognition (HTR) that tackles hand-written text. In this paper, for the first time, we argue for their unification -- we aim for a single model that can compete favourably with two separate state-of-the-art STR and HTR models. We first show that cross-utilisation of STR and HTR models trigger significant performance drops due to differences in their inherent challenges. We then tackle their union by introducing a knowledge distillation (KD) based framework. This is however non-trivial, largely due to the variable-length and sequential nature of text sequences, which renders off-the-shelf KD techniques that mostly works with global fixed-length data inadequate. For that, we propose three distillation losses all of which are specifically designed to cope with the aforementioned unique characteristics of text recognition. Empirical evidence suggests that our proposed unified model performs on par with individual models, even surpassing them in certain cases. Ablative studies demonstrate that naive baselines such as a two-stage framework, and domain adaption/generalisation alternatives do not work as well, further verifying the appropriateness of our design.
【8】 Towards the Unseen: Iterative Text Recognition by Distilling from Errors 标题:走向看不见的:通过从错误中提取的迭代文本识别
作者:Ayan Kumar Bhunia,Pinaki Nath Chowdhury,Aneeshan Sain,Yi-Zhe Song 机构:SketchX, CVSSP, University of Surrey, United Kingdom., FlyTek-Surrey Joint Research Centre on Artificial Intelligence. 备注:IEEE International Conference on Computer Vision (ICCV), 2021 链接:https://arxiv.org/abs/2107.12081 摘要:视觉文本识别无疑是计算机视觉中研究最广泛的课题之一。到目前为止已经取得了很大的进展,最新的车型开始注重更实用的“野外”环境。然而,一个突出的问题仍然阻碍了实际的部署——现有技术大多难以识别看不见(或很少看到)的字符序列。在本文中,我们提出了一个新的框架来专门解决这个“看不见的”问题。我们的框架本质上是迭代的,因为它利用了从上一次迭代中得到的字符序列的预测知识来增强主网络以改进下一次的预测。我们成功的关键是一个独特的跨模态变分自动编码器作为一个反馈模块,这是训练与文本错误分布数据的存在。该模块将离散的预测字符空间转化为连续的仿射变换参数空间,用于在下一次迭代中对视觉特征图进行条件化。在普通数据集上的实验表明,在传统环境下,其性能优于现有技术。最重要的是,在新的不相交设置下,列车测试标签是互斥的,我们提供了最好的性能,从而展示了对看不见的词进行归纳的能力。 摘要:Visual text recognition is undoubtedly one of the most extensively researched topics in computer vision. Great progress have been made to date, with the latest models starting to focus on the more practical "in-the-wild" setting. However, a salient problem still hinders practical deployment -- prior arts mostly struggle with recognising unseen (or rarely seen) character sequences. In this paper, we put forward a novel framework to specifically tackle this "unseen" problem. Our framework is iterative in nature, in that it utilises predicted knowledge of character sequences from a previous iteration, to augment the main network in improving the next prediction. Key to our success is a unique cross-modal variational autoencoder to act as a feedback module, which is trained with the presence of textual error distribution data. This module importantly translate a discrete predicted character space, to a continuous affine transformation parameter space used to condition the visual feature map at next iteration. Experiments on common datasets have shown competitive performance over state-of-the-arts under the conventional setting. Most importantly, under the new disjoint setup where train-test labels are mutually exclusive, ours offers the best performance thus showcasing the capability of generalising onto unseen words.
【9】 Augmentation Pathways Network for Visual Recognition 标题:用于视觉识别的增强路径网络
作者:Yalong Bai,Mohan Zhou,Yuxiang Chen,Wei Zhang,Bowen Zhou,Tao Mei 机构:JD AI Research, Harbin Institute of Technology 链接:https://arxiv.org/abs/2107.11990 摘要:特别是在数据匮乏的情况下,数据增强对视觉识别有着实际的帮助。然而,这样的成功仅限于相当多的光增强(例如,随机裁剪、翻转)。由于原始图像和增强图像之间存在较大的差距,重增强(如灰度、栅格洗牌)在训练过程中要么不稳定,要么表现出不利影响。本文介绍了一种新的网络设计,称为增强路径(AP),以系统地稳定训练范围更广的增强政策。值得注意的是,AP可以抑制大量的数据扩充,并稳定地提高性能,而无需在扩充策略中进行仔细选择。与传统的单一路径不同,增强图像在不同的神经路径中进行处理。主要途径处理轻度强化,而其他途径侧重于重度强化。主干网通过与多条路径相互依赖的方式,从增广中共享的视觉模式中学习,同时抑制噪声模式。此外,我们将AP扩展到一个同质版本和一个高阶场景的异构版本,在实际应用中证明了AP的健壮性和灵活性。在ImageNet基准上的实验结果表明,该方法在更广范围的增强(如Crop、Gray、Grid Shuffle、RandAugment)上具有兼容性和有效性,同时在推理时消耗更少的参数和更低的计算成本。源代码:https://github.com/ap-conv/ap-net. 摘要:Data augmentation is practically helpful for visual recognition, especially at the time of data scarcity. However, such success is only limited to quite a few light augmentations (e.g., random crop, flip). Heavy augmentations (e.g., gray, grid shuffle) are either unstable or show adverse effects during training, owing to the big gap between the original and augmented images. This paper introduces a novel network design, noted as Augmentation Pathways (AP), to systematically stabilize training on a much wider range of augmentation policies. Notably, AP tames heavy data augmentations and stably boosts performance without a careful selection among augmentation policies. Unlike traditional single pathway, augmented images are processed in different neural paths. The main pathway handles light augmentations, while other pathways focus on heavy augmentations. By interacting with multiple paths in a dependent manner, the backbone network robustly learns from shared visual patterns among augmentations, and suppresses noisy patterns at the same time. Furthermore, we extend AP to a homogeneous version and a heterogeneous version for high-order scenarios, demonstrating its robustness and flexibility in practical usage. Experimental results on ImageNet benchmarks demonstrate the compatibility and effectiveness on a much wider range of augmentations (e.g., Crop, Gray, Grid Shuffle, RandAugment), while consuming fewer parameters and lower computational costs at inference time. Source code:https://github.com/ap-conv/ap-net.
【10】 Transductive Maximum Margin Classifier for Few-Shot Learning 标题:用于Few-Shot学习的直导式最大边距分类器
作者:Fei Pan,Chunlei Xu,Jie Guo,Yanwen Guo 机构:National Key Lab for Novel Software Technology, Nanjing University 链接:https://arxiv.org/abs/2107.11975 摘要:Few-Shot学习的目的是训练一个分类器,当每个类只给出少量的标记样本时,该分类器能够很好地泛化。我们引入了用于小镜头学习的最大边缘分类器(TMMC)。经典最大边距分类器的基本思想是求解一个最优的预测函数,即相应的分离超平面能正确分割训练数据,得到的分类器具有最大的几何边距。在少数镜头学习场景中,训练样本很少,不足以找到对未知数据具有良好泛化能力的分离超平面。TMMC是在给定的任务中混合使用标记的支持集和未标记的查询集来构造的。查询集中的未标记样本可以调整分离超平面,使预测函数在标记样本和未标记样本上都是最优的。此外,我们利用一个高效的拟牛顿算法,L-BFGS方法来优化TMMC。在minimagenet、tieredImagenet和CUB三个标准的Few-Shot学习基准上的实验结果表明,我们的TMMC达到了最先进的精度。 摘要:Few-shot learning aims to train a classifier that can generalize well when just a small number of labeled samples per class are given. We introduce Transductive Maximum Margin Classifier (TMMC) for few-shot learning. The basic idea of the classical maximum margin classifier is to solve an optimal prediction function that the corresponding separating hyperplane can correctly divide the training data and the resulting classifier has the largest geometric margin. In few-shot learning scenarios, the training samples are scarce, not enough to find a separating hyperplane with good generalization ability on unseen data. TMMC is constructed using a mixture of the labeled support set and the unlabeled query set in a given task. The unlabeled samples in the query set can adjust the separating hyperplane so that the prediction function is optimal on both the labeled and unlabeled samples. Furthermore, we leverage an efficient and effective quasi-Newton algorithm, the L-BFGS method to optimize TMMC. Experimental results on three standard few-shot learning benchmarks including miniImagenet, tieredImagenet and CUB suggest that our TMMC achieves state-of-the-art accuracies.
【11】 Temporal Alignment Prediction for Few-Shot Video Classification 标题:用于Few-Shot视频分类的时间序列预测
作者:Fei Pan,Chunlei Xu,Jie Guo,Yanwen Guo 机构:National Key Lab for Novel Software Technology, Nanjing University 链接:https://arxiv.org/abs/2107.11960 摘要:Few-Shot视频分类的目标是在训练时只需少量的标记视频就可以得到一个具有良好泛化能力的分类模型。然而,在这样的背景下,很难学习视频的区别性特征表示。本文提出了一种基于序列相似性学习的时间对齐预测(TAP)方法用于视频分类。为了获得一对视频的相似性,我们利用时间对齐预测函数来预测两个视频中所有时间位置对之间的对齐分数。此外,该函数的输入还配备了时域中的上下文信息。我们在两个视频分类基准上评估TAP,包括Kinetics和Something V2。实验结果验证了TAP方法的有效性,并显示了其优于现有方法的优越性。 摘要:The goal of few-shot video classification is to learn a classification model with good generalization ability when trained with only a few labeled videos. However, it is difficult to learn discriminative feature representations for videos in such a setting. In this paper, we propose Temporal Alignment Prediction (TAP) based on sequence similarity learning for few-shot video classification. In order to obtain the similarity of a pair of videos, we predict the alignment scores between all pairs of temporal positions in the two videos with the temporal alignment prediction function. Besides, the inputs to this function are also equipped with the context information in the temporal domain. We evaluate TAP on two video classification benchmarks including Kinetics and Something-Something V2. The experimental results verify the effectiveness of TAP and show its superiority over state-of-the-art methods.
【12】 Spatio-Temporal Representation Factorization for Video-based Person Re-Identification 标题:时空表示因式分解在基于视频的人物再识别中的应用
作者:Abhishek Aich,Meng Zheng,Srikrishna Karanam,Terrence Chen,Amit K. Roy-Chowdhury,Ziyan Wu 机构:United Imaging Intelligence, Cambridge, MA, USA,University of California, Riverside, CA, USA 备注:Accepted at IEEE ICCV 2021, Includes Supplementary Material 链接:https://arxiv.org/abs/2107.11878 摘要:尽管最近在基于视频的人物再识别(re-ID)方面取得了许多进展,但是当前的技术仍然面临着现实世界中常见的挑战,例如不同人物之间的外观相似性、遮挡和帧不对中。为了缓解这些问题,我们提出了时空表示因子分解模块(STRF),一种灵活的新计算单元,可与大多数现有的三维卷积神经网络结构结合使用,用于re-ID。与以前的工作相比,STRF的关键创新包括学习区分性时间和空间特征的显式路径,对每个分量进一步分解,以捕获互补的特定于人的外观和运动信息。具体而言,时间因子分解包括两个分支,一个分支用于静态特征(例如,衣服的颜色),这些特征随时间变化不大,另一个分支用于动态特征(例如,行走模式),这些特征随时间变化。此外,空间因子分解还包括两个分支来学习全局(粗段)以及局部(细段)外观特征,其中局部特征在遮挡或空间失准的情况下特别有用。这两个因子分解操作结合在一起,为我们的参数经济型STRF单元提供了一个模块化的体系结构,可以插入任意两个3D卷积层之间,从而形成端到端的学习框架。我们的经验表明,STRF提高了各种现有的基线架构的性能,同时展示了在三个基准上使用标准的人员重新识别评估协议的最新结果。 摘要:Despite much recent progress in video-based person re-identification (re-ID), the current state-of-the-art still suffers from common real-world challenges such as appearance similarity among various people, occlusions, and frame misalignment. To alleviate these problems, we propose Spatio-Temporal Representation Factorization module (STRF), a flexible new computational unit that can be used in conjunction with most existing 3D convolutional neural network architectures for re-ID. The key innovations of STRF over prior work include explicit pathways for learning discriminative temporal and spatial features, with each component further factorized to capture complementary person-specific appearance and motion information. Specifically, temporal factorization comprises two branches, one each for static features (e.g., the color of clothes) that do not change much over time, and dynamic features (e.g., walking patterns) that change over time. Further, spatial factorization also comprises two branches to learn both global (coarse segments) as well as local (finer segments) appearance features, with the local features particularly useful in cases of occlusion or spatial misalignment. These two factorization operations taken together result in a modular architecture for our parameter-wise economic STRF unit that can be plugged in between any two 3D convolutional layers, resulting in an end-to-end learning framework. We empirically show that STRF improves performance of various existing baseline architectures while demonstrating new state-of-the-art results using standard person re-identification evaluation protocols on three benchmarks.
【13】 Bangla sign language recognition using concatenated BdSL network 标题:基于级联BDSL网络的孟加拉手语识别
作者:Thasin Abedin,Khondokar S. S. Prottoy,Ayana Moshruba,Safayat Bin Hakim 机构:Department of Electrical and Electronic Engineering, Islamic University of Technology (IUT) 链接:https://arxiv.org/abs/2107.11818 摘要:手语是聋哑人和聋哑人交流的唯一媒介。因此,与大众的沟通对这个少数群体来说始终是一个挑战。特别是在孟加拉语手语(BdSL)中,有38个字母表,其中一些有几乎相同的符号。因此,在BdSL识别中,除了从传统的卷积神经网络(CNN)中提取视觉特征外,手的姿态也是一个重要的因素。本文提出了一种由CNN图像网络和姿态估计网络组成的级联BdSL网络结构。图像网络在获取视觉特征的同时,通过姿态估计网络获取手部关键点的相对位置,获得附加特征,以应对BdSL符号的复杂性。实验结果表明,该附加姿态估计网络的有效性。 摘要:Sign language is the only medium of communication for the hearing impaired and the deaf and dumb community. Communication with the general mass is thus always a challenge for this minority group. Especially in Bangla sign language (BdSL), there are 38 alphabets with some having nearly identical symbols. As a result, in BdSL recognition, the posture of hand is an important factor in addition to visual features extracted from traditional Convolutional Neural Network (CNN). In this paper, a novel architecture "Concatenated BdSL Network" is proposed which consists of a CNN based image network and a pose estimation network. While the image network gets the visual features, the relative positions of hand keypoints are taken by the pose estimation network to obtain the additional features to deal with the complexity of the BdSL symbols. A score of 91.51% was achieved by this novel approach in test set and the effectiveness of the additional pose estimation network is suggested by the experimental results.
【14】 Adaptive Recursive Circle Framework for Fine-grained Action Recognition 标题:适用于细粒度动作识别的自适应递归圆框架
作者:Hanxi Lin,Xinxiao Wu,Jiebo Luo 机构:Beijing Laboratory of Intelligent Information Technology, School of Computer Science, Beijing Institute of Technology, Beijing , China, Department of Computer Science, University of Rochester, Rochester NY , USA 链接:https://arxiv.org/abs/2107.11813 摘要:如何对视频中的细粒度时空动态进行建模一直是动作识别的一个难题。它要求学习深刻而丰富的特征,对微妙而抽象的动作具有优越的识别性。大多数现有的方法都是以一种纯前馈的方式生成一个层的特征,在这种方式中,信息从输入到输出沿一个方向移动。它们依靠堆叠更多的层来获得更强大的功能,带来额外的不可忽略的开销。在本文中,我们提出了一个自适应递归圆(ARC)框架,一个纯前馈层的细粒度装饰器。它继承了原始层的操作符和参数,但在使用这些操作符和参数时略有不同。具体来说,该层的输入被视为一个演化状态,其更新与特征生成交替进行。在每个递归步骤中,输入状态由先前生成的特征丰富,特征生成由新更新的输入状态进行。我们希望ARC框架能够以较低的代价引入更精细的特征和多尺度的感受域,从而促进细粒度的动作识别。在几个基准上观察到前馈基线的显著改进。例如,装备了ARC的TSM-ResNet18比TSM-ResNet50性能更好,在V1和G48上的失败次数减少了48%,模型参数减少了52%。 摘要:How to model fine-grained spatial-temporal dynamics in videos has been a challenging problem for action recognition. It requires learning deep and rich features with superior distinctiveness for the subtle and abstract motions. Most existing methods generate features of a layer in a pure feedforward manner, where the information moves in one direction from inputs to outputs. And they rely on stacking more layers to obtain more powerful features, bringing extra non-negligible overheads. In this paper, we propose an Adaptive Recursive Circle (ARC) framework, a fine-grained decorator for pure feedforward layers. It inherits the operators and parameters of the original layer but is slightly different in the use of those operators and parameters. Specifically, the input of the layer is treated as an evolving state, and its update is alternated with the feature generation. At each recursive step, the input state is enriched by the previously generated features and the feature generation is made with the newly updated input state. We hope the ARC framework can facilitate fine-grained action recognition by introducing deeply refined features and multi-scale receptive fields at a low cost. Significant improvements over feedforward baselines are observed on several benchmarks. For example, an ARC-equipped TSM-ResNet18 outperforms TSM-ResNet50 with 48% fewer FLOPs and 52% model parameters on Something-Something V1 and Diving48.
【15】 PoseFace: Pose-Invariant Features and Pose-Adaptive Loss for Face Recognition 标题:PoseFace:面向人脸识别的姿态不变特征和姿态自适应损失
作者:Qiang Meng,Xiaqing Xu,Xiaobo Wang,Yang Qian,Yunxiao Qin,Zezheng Wang,Chenxu Zhao,Feng Zhou,Zhen Lei 机构:†Institute of Automation, Chinese Academy of Sciences‡The University of Sydney§Northwestern Polytechnical University¶Corresponding author||Academy of Sciences, Institute of Automation, China††School of Artificial Intelligence 链接:https://arxiv.org/abs/2107.11721 摘要:尽管深度学习方法在人脸识别方面取得了巨大的成功,但在无约束的环境中(例如,在监视和照片标记的情况下),姿势变化较大时,性能会严重下降。为了解决这个问题,当前的方法要么部署特定于姿势的模型,要么通过附加模块将人脸正面化。尽管如此,他们忽略了一个事实,即身份信息应该在不同姿势之间保持一致,并且没有意识到在训练期间正面和侧面人脸图像之间的数据不平衡。本文提出了一种有效的PoseFace框架,该框架利用人脸标志来分离姿态不变特征,并利用姿态自适应损失自适应地处理不平衡问题。在Multi-PIE、CFP、CPLFW和IJB等基准上的大量实验结果表明了该方法的优越性。 摘要:Despite the great success achieved by deep learning methods in face recognition, severe performance drops are observed for large pose variations in unconstrained environments (e.g., in cases of surveillance and photo-tagging). To address it, current methods either deploy pose-specific models or frontalize faces by additional modules. Still, they ignore the fact that identity information should be consistent across poses and are not realizing the data imbalance between frontal and profile face images during training. In this paper, we propose an efficient PoseFace framework which utilizes the facial landmarks to disentangle the pose-invariant features and exploits a pose-adaptive loss to handle the imbalance issue adaptively. Extensive experimental results on the benchmarks of Multi-PIE, CFP, CPLFW and IJB have demonstrated the superiority of our method over the state-of-the-arts.
【16】 Temporal-wise Attention Spiking Neural Networks for Event Streams Classification 标题:基于时间关注尖峰神经网络的事件流分类
作者:Man Yao,Huanhuan Gao,Guangshe Zhao,Dingheng Wang,Yihan Lin,Zhaoxu Yang,Guoqi Li 机构:School of Automation Science and Engineering, Xi’an Jiaotong University, Xi’an , China, Department of Precision Instrument, Center for Brain Inspired Computing Research, Tsinghua University, Beijing , China 备注:Accepted by ICCV 2021 链接:https://arxiv.org/abs/2107.11711 摘要:如何有效地处理事件稀疏、不均匀、时间分辨率为微秒的时空事件流具有重要的应用价值。脉冲神经网络(Spiking neural network,SNN)作为一种受大脑启发的事件触发计算模型,具有从事件流中提取有效时空特征的潜力。然而,当将单个事件聚合成具有新的更高时间分辨率的帧时,现有的SNN模型并不重视序列帧具有不同的信噪比,因为事件流是稀疏和不均匀的。这种情况会干扰现有snn的性能。在这项工作中,我们提出了一个时序注意SNN(TA-SNN)模型来学习处理事件流的基于帧的表示。具体而言,我们将注意力概念扩展到时间输入,在训练阶段判断框架对最终决策的重要性,在推理阶段丢弃无关框架。我们证明了TA-SNN模型提高了事件流分类任务的准确性。我们还研究了多尺度时间分辨率对基于帧表示的影响。我们的方法在三个不同的分类任务上进行了测试:手势识别、图像分类和语音数字识别。我们报告了这些任务的最新结果,并且在仅60毫秒的时间内,手势识别的准确率得到了本质性的提高(几乎19%)。 摘要:How to effectively and efficiently deal with spatio-temporal event streams, where the events are generally sparse and non-uniform and have the microsecond temporal resolution, is of great value and has various real-life applications. Spiking neural network (SNN), as one of the brain-inspired event-triggered computing models, has the potential to extract effective spatio-temporal features from the event streams. However, when aggregating individual events into frames with a new higher temporal resolution, existing SNN models do not attach importance to that the serial frames have different signal-to-noise ratios since event streams are sparse and non-uniform. This situation interferes with the performance of existing SNNs. In this work, we propose a temporal-wise attention SNN (TA-SNN) model to learn frame-based representation for processing event streams. Concretely, we extend the attention concept to temporal-wise input to judge the significance of frames for the final decision at the training stage, and discard the irrelevant frames at the inference stage. We demonstrate that TA-SNN models improve the accuracy of event streams classification tasks. We also study the impact of multiple-scale temporal resolutions for frame-based representation. Our approach is tested on three different classification tasks: gesture recognition, image classification, and spoken digit recognition. We report the state-of-the-art results on these tasks, and get the essential improvement of accuracy (almost 19%) for gesture recognition with only 60 ms.
【17】 Deep Machine Learning Based Egyptian Vehicle License Plate Recognition Systems 标题:基于深度机器学习的埃及车牌识别系统
作者:Mohamed Shehata,Mohamed Taha Abou-Kreisha,Hany Elnashar 机构:Systems & Computers Engineering, Al-Azhar University, Egypt, , Mathematical & Computer Science, Department, Al-, Azhar University. Egypt., Project Development Unite, Computers and Artificial Intelligent, Beni-Suef University 备注:None 链接:https://arxiv.org/abs/2107.11640 摘要:自动车牌检测与识别是近年来的一个重要研究课题。VLP定位和识别是利用数字技术进行交通管理的关键技术之一。本文开发了四种埃及汽车牌照识别智能系统。两个系统是基于字符识别的,分别是(System1,经典机器学习字符识别)和(System2,深度机器学习字符识别)。另外两个系统是基于全车牌识别的,分别是(System3,经典机器学习的全车牌识别)和(System4,深度机器学习的全车牌识别)。我们使用目标检测算法和基于机器学习的目标识别算法。在实际图像上对所开发的系统进行了性能测试,实验结果表明,采用深度学习方法可以获得最佳的VLP检测准确率。其中VLP的检测准确率比经典系统高出32%。然而,经典的车牌阿拉伯字符(VLPAC)检测方法提供了最佳的检测准确率。其中VLPAC的检测准确率比基于深度学习的系统高出6%。同时,实验结果表明,在VLP识别过程中,深度学习比经典技术有更好的效果。其中识别准确率比经典系统高8%。最后,本文提出了一个基于统计和深度机器学习的鲁棒VLP识别系统。 摘要:Automated Vehicle License Plate (VLP) detection and recognition have ended up being a significant research issue as of late. VLP localization and recognition are some of the most essential techniques for managing traffic using digital techniques. In this paper, four smart systems are developed to recognize Egyptian vehicles license plates. Two systems are based on character recognition, which are (System1, Characters Recognition with Classical Machine Learning) and (System2, Characters Recognition with Deep Machine Learning). The other two systems are based on the whole plate recognition which are (System3, Whole License Plate Recognition with Classical Machine Learning) and (System4, Whole License Plate Recognition with Deep Machine Learning). We use object detection algorithms, and machine learning based object recognition algorithms. The performance of the developed systems has been tested on real images, and the experimental results demonstrate that the best detection accuracy rate for VLP is provided by using the deep learning method. Where the VLP detection accuracy rate is better than the classical system by 32%. However, the best detection accuracy rate for Vehicle License Plate Arabic Character (VLPAC) is provided by using the classical method. Where VLPAC detection accuracy rate is better than the deep learning-based system by 6%. Also, the results show that deep learning is better than the classical technique used in VLP recognition processes. Where the recognition accuracy rate is better than the classical system by 8%. Finally, the paper output recommends a robust VLP recognition system based on both statistical and deep machine learning.
【18】 Multi-Label Image Classification with Contrastive Learning 标题:基于对比学习的多标签图像分类
作者:Son D. Dao,Ethan Zhao,Dinh Phung,Jianfei Cai 机构:Monash University 链接:https://arxiv.org/abs/2107.11626 摘要:近年来,对比学习作为一种学习潜在表征的有效方法,在各个领域得到了越来越广泛的应用。单标签分类中构念学习的成功促使我们利用这个学习框架来增强识别性,从而在多标签图像分类中获得更好的性能。在本文中,我们发现直接应用对比学习很难改善多标签案例。因此,本文提出了一种新的基于对比学习的多标签分类框架,该框架在完全监督的背景下学习图像在不同标签背景下的多种表示。这有助于将对比学习简单而直观地应用到我们的模型中,以提高其在多标签图像分类中的性能。在两个基准数据集上的大量实验表明,与多标签分类的先进方法相比,该框架达到了最先进的性能。 摘要:Recently, as an effective way of learning latent representations, contrastive learning has been increasingly popular and successful in various domains. The success of constrastive learning in single-label classifications motivates us to leverage this learning framework to enhance distinctiveness for better performance in multi-label image classification. In this paper, we show that a direct application of contrastive learning can hardly improve in multi-label cases. Accordingly, we propose a novel framework for multi-label classification with contrastive learning in a fully supervised setting, which learns multiple representations of an image under the context of different labels. This facilities a simple yet intuitive adaption of contrastive learning into our model to boost its performance in multi-label image classification. Extensive experiments on two benchmark datasets show that the proposed framework achieves state-of-the-art performance in the comparison with the advanced methods in multi-label classification.
【19】 Going Deeper into Semi-supervised Person Re-identification 标题:深入研究半监督人员再识别
作者:Olga Moskvyak,Frederic Maire,Feras Dayoub,Mahsa Baktashmotlagh 机构:School of Electrical Engineering and Robotics, Queensland University of Technology, George Street, Brisbane City, QLD , Australia, School of Information Technology and Electrical Engineering, The University of, Queensland, St Lucia, QLD , Australia 链接:https://arxiv.org/abs/2107.11566 摘要:人的重新识别是一项具有挑战性的任务,在不同的摄像机视角下识别一个人。为此任务训练卷积神经网络(CNN)需要对一个大的数据集进行注释,因此,它需要通过摄像机对人进行耗时的手动匹配。为了减少对标记数据的需要,我们重点研究了一种半监督方法,该方法只需要对训练数据的一个子集进行标记。我们进行了一项全面的调查,在该地区的人重新识别有限的标签。现有的这方面的工作是有限的,因为它们利用了来自多个cnn的特征,并且需要知道未标记数据中的身份的数量。为了克服这些局限性,我们建议使用来自单个CNN的基于部分的特征,而不需要知道标签空间(即身份的数量)。这使得我们的方法更适合于实际场景,并且大大减少了对计算资源的需求。我们还提出了一个部分混合损失,提高了半监督环境下基于学习部分的伪标记特征的鉴别能力。我们的方法在三个大规模的person-re-id数据集上的性能优于最新的结果,并且在只有三分之一的标记身份的情况下达到了与完全监督方法相同的性能水平。 摘要:Person re-identification is the challenging task of identifying a person across different camera views. Training a convolutional neural network (CNN) for this task requires annotating a large dataset, and hence, it involves the time-consuming manual matching of people across cameras. To reduce the need for labeled data, we focus on a semi-supervised approach that requires only a subset of the training data to be labeled. We conduct a comprehensive survey in the area of person re-identification with limited labels. Existing works in this realm are limited in the sense that they utilize features from multiple CNNs and require the number of identities in the unlabeled data to be known. To overcome these limitations, we propose to employ part-based features from a single CNN without requiring the knowledge of the label space (i.e., the number of identities). This makes our approach more suitable for practical scenarios, and it significantly reduces the need for computational resources. We also propose a PartMixUp loss that improves the discriminative ability of learned part-based features for pseudo-labeling in semi-supervised settings. Our method outperforms the state-of-the-art results on three large-scale person re-id datasets and achieves the same level of performance as fully supervised methods with only one-third of labeled identities.
【20】 Semantic-guided Pixel Sampling for Cloth-Changing Person Re-identification 标题:语义引导的像素采样在换布人再识别中的应用
作者:Xiujun Shu,Ge Li,Xiao Wang,Weijian Ruan,Qi Tian 机构: Ge Li)Xiujun Shu is with Peng Cheng Laboratory and Peking University, Ge Li is with the School of Electronic and Computer Engineering, PekingUniversity 备注:This paper has been published on IEEE Signal Processing Letters 链接:https://arxiv.org/abs/2107.11522 摘要:换衣人再识别(re-ID)是一个新兴的研究课题。这项任务相当具有挑战性,迄今尚未得到充分研究。目前的研究主要集中在体形或轮廓素描方面,但由于视点和姿态的变化,这些研究不够健壮。这项任务的关键是利用与布料无关的线索。本文提出了一种语义引导的像素采样方法,用于换衣人员身份识别任务。我们没有明确定义要提取的特征,而是强制模型自动学习不相关的线索。具体地说,我们首先识别行人的上衣和裤子,然后通过从其他行人身上采样像素来随机改变它们。更改后的样本保留了身份标签,但在不同的行人之间交换了衣服或裤子的像素。此外,我们采用损失函数来约束学习到的特征,以保持变化前后的一致性。这样,模特就被迫学习与上衣和裤子无关的线索。我们在最新发布的PRCC数据集上进行了广泛的实验。我们的方法在Rank1精度上达到65.8%,比以前的方法有很大的提高。代码可在https://github.com/shuxjweb/pixel_sampling.git. 摘要:Cloth-changing person re-identification (re-ID) is a new rising research topic that aims at retrieving pedestrians whose clothes are changed. This task is quite challenging and has not been fully studied to date. Current works mainly focus on body shape or contour sketch, but they are not robust enough due to view and posture variations. The key to this task is to exploit cloth-irrelevant cues. This paper proposes a semantic-guided pixel sampling approach for the cloth-changing person re-ID task. We do not explicitly define which feature to extract but force the model to automatically learn cloth-irrelevant cues. Specifically, we first recognize the pedestrian's upper clothes and pants, then randomly change them by sampling pixels from other pedestrians. The changed samples retain the identity labels but exchange the pixels of clothes or pants among different pedestrians. Besides, we adopt a loss function to constrain the learned features to keep consistent before and after changes. In this way, the model is forced to learn cues that are irrelevant to upper clothes and pants. We conduct extensive experiments on the latest released PRCC dataset. Our method achieved 65.8% on Rank1 accuracy, which outperforms previous methods with a large margin. The code is available at https://github.com/shuxjweb/pixel_sampling.git.
【21】 TinyAction Challenge: Recognizing Real-world Low-resolution Activities in Videos 标题:TinyAction挑战:识别视频中真实世界的低分辨率活动
作者:Praveen Tirupattur,Aayush J Rana,Tushar Sangam,Shruti Vyas,Yogesh S Rawat,Mubarak Shah 机构:Center for Research in Computer Vision, University of Central Florida, Orlando, Florida, USA, Email: 备注:8 pages. arXiv admin note: text overlap with arXiv:2007.07355 链接:https://arxiv.org/abs/2107.11494 摘要:本文总结了在CVPR 2021的ActivityNet研讨会上组织的TinyAction挑战。该挑战的重点是识别视频中存在的真实世界的低分辨率活动。动作识别任务目前主要集中在对高质量视频中的动作进行分类,在这些视频中,演员和动作清晰可见。虽然各种方法在最近的工作中被证明是有效的识别任务,他们往往不处理低分辨率的视频,其中行动发生在一个很小的区域。然而,许多真实世界的安全视频往往以很小的分辨率捕捉到实际的动作,这使得在一个很小的区域进行动作识别成为一项具有挑战性的任务。在这项工作中,我们提出了一个基准数据集TinyVIRAT-v2,它由自然发生的低分辨率动作组成。这是TinyVIRAT数据集的扩展,由具有多个标签的操作组成。这些视频都是从安全视频中提取出来的,这使得它们更真实,更具挑战性。我们以目前最先进的动作识别方法为基准,提出了TinyAction挑战。 摘要:This paper summarizes the TinyAction challenge which was organized in ActivityNet workshop at CVPR 2021. This challenge focuses on recognizing real-world low-resolution activities present in videos. Action recognition task is currently focused around classifying the actions from high-quality videos where the actors and the action is clearly visible. While various approaches have been shown effective for recognition task in recent works, they often do not deal with videos of lower resolution where the action is happening in a tiny region. However, many real world security videos often have the actual action captured in a small resolution, making action recognition in a tiny region a challenging task. In this work, we propose a benchmark dataset, TinyVIRAT-v2, which is comprised of naturally occuring low-resolution actions. This is an extension of the TinyVIRAT dataset and consists of actions with multiple labels. The videos are extracted from security videos which makes them realistic and more challenging. We use current state-of-the-art action recognition methods on the dataset as a benchmark, and propose the TinyAction Challenge.
【22】 MAG-Net: Mutli-task attention guided network for brain tumor segmentation and classification 标题:MAG-Net:用于脑肿瘤分割和分类的多任务注意力引导网络
作者:Sachin Gupta,Narinder Singh Punn,Sanjay Kumar Sonbhadra,Sonali Agarwal 机构:Agarwal, Indian Institute of Information Technology Allahabad, India 链接:https://arxiv.org/abs/2107.12321 摘要:脑瘤是所有年龄组中最常见和最致命的疾病。放射科医生通常采用MRI检查来鉴别和诊断肿瘤。正确识别肿瘤的部位和类型有助于肿瘤的诊断和后续治疗方案的制定。然而,对于任何放射科医生来说,分析这种扫描是一项复杂而耗时的任务。在基于深度学习的计算机辅助诊断系统的启发下,本文提出了一种基于多任务注意引导的编解码网络(MAG-Net)来利用MRI图像对脑肿瘤区域进行分类和分割。MAG网络是在Figshare数据集上训练和评估的,该数据集包括三种类型的肿瘤:脑膜瘤、胶质瘤和垂体瘤的冠状面、轴面和矢状面。通过详尽的实验测试,与现有的最新模型相比,该模型取得了令人满意的结果,同时在其他最新模型中训练参数的数目最少。 摘要:Brain tumor is the most common and deadliest disease that can be found in all age groups. Generally, MRI modality is adopted for identifying and diagnosing tumors by the radiologists. The correct identification of tumor regions and its type can aid to diagnose tumors with the followup treatment plans. However, for any radiologist analysing such scans is a complex and time-consuming task. Motivated by the deep learning based computer-aided-diagnosis systems, this paper proposes multi-task attention guided encoder-decoder network (MAG-Net) to classify and segment the brain tumor regions using MRI images. The MAG-Net is trained and evaluated on the Figshare dataset that includes coronal, axial, and sagittal views with 3 types of tumors meningioma, glioma, and pituitary tumor. With exhaustive experimental trials the model achieved promising results as compared to existing state-of-the-art models, while having least number of training parameters among other state-of-the-art models.
分割|语义相关(13篇)
【1】 Adaptive Hierarchical Graph Reasoning with Semantic Coherence for Video-and-Language Inference 标题:用于视频和语言推理的语义一致性自适应层次图推理
作者:Juncheng Li,Siliang Tang,Linchao Zhu,Haochen Shi,Xuanwen Huang,Fei Wu,Yi Yang,Yueting Zhuang 机构:Zhejiang University, ReLER, University of Technology Sydney, Universit´e de Montr´eal 链接:https://arxiv.org/abs/2107.12270 摘要:视频和语言推理是最近提出的一项视频和语言联合理解的任务。这项新任务需要一个模型来推断一个自然语言语句是否包含或与给定的视频片段相矛盾。在本文中,我们研究了如何解决这项任务的三个关键挑战:判断涉及多个语义的语句的全局正确性、视频和字幕的联合推理以及建立远程关系和复杂社会互动的模型。首先,我们提出了一个自适应的层次图网络,实现了对复杂交互中视频的深入理解。具体来说,它在三个层次上对视频和字幕进行联合推理,其中图形结构根据语句的语义结构进行自适应调整。其次,我们引入语义一致性学习,从三个层次明确地鼓励自适应层次图网络的语义一致性。语义连贯学习可以进一步提高视觉与语言学的一致性,提高视频片段间的连贯性。实验结果表明,该方法明显优于基线。 摘要:Video-and-Language Inference is a recently proposed task for joint video-and-language understanding. This new task requires a model to draw inference on whether a natural language statement entails or contradicts a given video clip. In this paper, we study how to address three critical challenges for this task: judging the global correctness of the statement involved multiple semantic meanings, joint reasoning over video and subtitles, and modeling long-range relationships and complex social interactions. First, we propose an adaptive hierarchical graph network that achieves in-depth understanding of the video over complex interactions. Specifically, it performs joint reasoning over video and subtitles in three hierarchies, where the graph structure is adaptively adjusted according to the semantic structures of the statement. Secondly, we introduce semantic coherence learning to explicitly encourage the semantic coherence of the adaptive hierarchical graph network from three hierarchies. The semantic coherence learning can further improve the alignment between vision and linguistics, and the coherence across a sequence of video segments. Experimental results show that our method significantly outperforms the baseline by a large margin.
【2】 Efficient Video Object Segmentation with Compressed Video 标题:基于压缩视频的高效视频对象分割
作者:Kai Xu,Angela Yao 机构:National University of Singapore 备注:9 pages, 5 figures 链接:https://arxiv.org/abs/2107.12192 摘要:利用视频的时间冗余性,提出了一种有效的半监督视频对象分割推理框架。我们的方法对选定的关键帧进行推断,并通过基于压缩视频比特流的运动矢量和残差的传播对其他帧进行预测。具体来说,我们提出了一种新的基于运动矢量的扭曲方法,用于以多参考的方式将分割模板从关键帧传播到其他帧。此外,我们还提出了一个基于残差的细化模块,该模块可以对按块传播的分割掩模进行校正和添加细节。我们的方法是灵活的,可以添加到现有的视频对象分割算法之上。以带top-k滤波的STM作为我们的基本模型,我们在DAVIS16和YouTube-VOS上取得了极具竞争力的结果,大大提高了速度,达到了4.9倍,精度损失很小。 摘要:We propose an efficient inference framework for semi-supervised video object segmentation by exploiting the temporal redundancy of the video. Our method performs inference on selected keyframes and makes predictions for other frames via propagation based on motion vectors and residuals from the compressed video bitstream. Specifically, we propose a new motion vector-based warping method for propagating segmentation masks from keyframes to other frames in a multi-reference manner. Additionally, we propose a residual-based refinement module that can correct and add detail to the block-wise propagated segmentation masks. Our approach is flexible and can be added on top of existing video object segmentation algorithms. With STM with top-k filtering as our base model, we achieved highly competitive results on DAVIS16 and YouTube-VOS with substantial speedups of up to 4.9X with little loss in accuracy.
【3】 3D AGSE-VNet: An Automatic Brain Tumor MRI Data Segmentation Framework 标题:3DAGSE-vNet:一种脑肿瘤MRI数据自动分割框架
作者:Xi Guan,Guang Yang,Jianming Ye,Weiji Yang,Xiaomei Xu,Weiwei Jiang,Xiaobo Lai 机构: School of Medical Technology and Information Engineering, Zhejiang Chinese Medical University, Cardiovascular Research Centre, Royal Brompton Hospital, London, SW,NP, UK, National Heart and Lung Institute, Imperial College London, London, SW,AZ, UK 备注:34 pages, 12 figure, Accepted by BMC Medical Imaging 链接:https://arxiv.org/abs/2107.12046 摘要:背景:脑胶质瘤是最常见的脑恶性肿瘤,发病率高,死亡率高达3%以上,严重危害人类健康。临床上获取脑肿瘤的主要方法是MRI。从多模态MRI扫描图像中分割脑肿瘤区域,有助于治疗检查、诊断后监测和疗效评价。然而,目前临床上常用的脑肿瘤分割操作仍然是手工分割,导致其耗时长,不同算子之间的性能差异较大,迫切需要一种一致、准确的自动分割方法。方法:针对上述问题,提出了一种脑肿瘤MRI数据自动分割框架AGSE-VNet。在我们的研究中,在每个编码器中加入压缩和激励(SE)模块,在每个解码器中加入注意引导滤波器(AG)模块,利用信道关系自动增强信道中的有用信息,抑制无用信息,利用注意机制引导边缘信息,去除噪声等无关信息的影响。结果:我们使用BraTS2020挑战在线验证工具来评估我们的方法。验证的重点是整个肿瘤(WT)、肿瘤核心(TC)和增强肿瘤(ET)的Dice评分分别为0.68、0.85和0.70。结论:尽管MRI图像强度不同,但AGSE-VNet不受肿瘤大小的影响,能更准确地提取三个区域的特征,取得了令人印象深刻的效果,为脑肿瘤患者的临床诊断和治疗做出了突出贡献。 摘要:Background: Glioma is the most common brain malignant tumor, with a high morbidity rate and a mortality rate of more than three percent, which seriously endangers human health. The main method of acquiring brain tumors in the clinic is MRI. Segmentation of brain tumor regions from multi-modal MRI scan images is helpful for treatment inspection, post-diagnosis monitoring, and effect evaluation of patients. However, the common operation in clinical brain tumor segmentation is still manual segmentation, lead to its time-consuming and large performance difference between different operators, a consistent and accurate automatic segmentation method is urgently needed. Methods: To meet the above challenges, we propose an automatic brain tumor MRI data segmentation framework which is called AGSE-VNet. In our study, the Squeeze and Excite (SE) module is added to each encoder, the Attention Guide Filter (AG) module is added to each decoder, using the channel relationship to automatically enhance the useful information in the channel to suppress the useless information, and use the attention mechanism to guide the edge information and remove the influence of irrelevant information such as noise. Results: We used the BraTS2020 challenge online verification tool to evaluate our approach. The focus of verification is that the Dice scores of the whole tumor (WT), tumor core (TC) and enhanced tumor (ET) are 0.68, 0.85 and 0.70, respectively. Conclusion: Although MRI images have different intensities, AGSE-VNet is not affected by the size of the tumor, and can more accurately extract the features of the three regions, it has achieved impressive results and made outstanding contributions to the clinical diagnosis and treatment of brain tumor patients.
【4】 Language Models as Zero-shot Visual Semantic Learners 标题:作为Zero-Shot视觉语义学习者的语言模型
作者:Yue Jiao,Jonathon Hare,Adam Prügel-Bennett 机构:University of Southampton, Southampton, UK, Adam Pr¨ugel-Bennett 链接:https://arxiv.org/abs/2107.12021 摘要:视觉语义嵌入(visualsemanticembedding,VSE)模型将图像映射到丰富的语义嵌入空间,是目标识别和Zero-Shot学习的一个里程碑。目前的VSE方法严重依赖于静态词em层理技术。在这项工作中,我们提出了一个视觉语义嵌入探针(VSEP),用来探测视觉语义理解任务中语境化词语嵌入的语义信息。研究表明,transformer语言模型中编码的知识可以用于需要视觉语义理解的任务,而具有上下文表示的VSEP可以作为一个Zero-Shot合成学习者来区分复杂场景中的词级对象表示。我们进一步引入了一个Zero-Shot设置与VSEPs来评估一个模型的能力,以关联一个新的单词和一个新的视觉类别。我们发现,当对象的组成链较短时,语言模型中的上下文表示优于静态词嵌入。我们注意到,目前的视觉语义嵌入模型缺乏相互排斥的偏见,这限制了它们的性能。 摘要:Visual Semantic Embedding (VSE) models, which map images into a rich semantic embedding space, have been a milestone in object recognition and zero-shot learning. Current approaches to VSE heavily rely on static word em-bedding techniques. In this work, we propose a Visual Se-mantic Embedding Probe (VSEP) designed to probe the semantic information of contextualized word embeddings in visual semantic understanding tasks. We show that the knowledge encoded in transformer language models can be exploited for tasks requiring visual semantic understanding.The VSEP with contextual representations can distinguish word-level object representations in complicated scenes as a compositional zero-shot learner. We further introduce a zero-shot setting with VSEPs to evaluate a model's ability to associate a novel word with a novel visual category. We find that contextual representations in language mod-els outperform static word embeddings, when the compositional chain of object is short. We notice that current visual semantic embedding models lack a mutual exclusivity bias which limits their performance.
【5】 What Remains of Visual Semantic Embeddings 标题:视觉语义嵌入还剩下什么
作者:Yue Jiao,Jonathon Hare,Adam Prügel-Bennett 机构:University of Southampton, Southampton, UK, Adam Pr¨ugel-Bennett 链接:https://arxiv.org/abs/2107.11991 摘要:Zero-Shot学习(Zero-shot learning,ZSL)由于与幼儿识别新奇物体的机制密切相关,近十年来引起了人们的极大兴趣。尽管视觉语义嵌入模型的不同范例被设计来对齐视觉特征和分布式单词表示,但是目前的ZSL模型在多大程度上编码来自分布式单词表示的语义信息还不清楚。在这项工作中,我们将分层ImageNet的拆分引入到ZSL任务中,以避免标准ImageNet基准中的结构缺陷。我们建立了一个统一的ZSL框架,以对比学习作为预训练,既保证了语义信息的不泄漏,又鼓励了线性可分离的视觉特征。我们的工作使得在语义推理起决定性作用的ZSL环境下评估视觉语义嵌入模型是公平的。在这个框架下,我们证明了现有的ZSL模型难以从词汇类比和词汇层次编码语义关系。我们的分析为探索语境语言表征在ZSL任务中的作用提供了动力。 摘要:Zero shot learning (ZSL) has seen a surge in interest over the decade for its tight links with the mechanism making young children recognize novel objects. Although different paradigms of visual semantic embedding models are designed to align visual features and distributed word representations, it is unclear to what extent current ZSL models encode semantic information from distributed word representations. In this work, we introduce the split of tiered-ImageNet to the ZSL task, in order to avoid the structural flaws in the standard ImageNet benchmark. We build a unified framework for ZSL with contrastive learning as pre-training, which guarantees no semantic information leakage and encourages linearly separable visual features. Our work makes it fair for evaluating visual semantic embedding models on a ZSL setting in which semantic inference is decisive. With this framework, we show that current ZSL models struggle with encoding semantic relationships from word analogy and word hierarchy. Our analyses provide motivation for exploring the role of context language representations in ZSL tasks.
【6】 Denoising and Segmentation of Epigraphical Scripts 标题:碑文的去噪与分割
作者:P Preethi,Hrishikesh Viswanath 机构:Assistant Professor, Department of Computer Science, PES University, Bangalore 链接:https://arxiv.org/abs/2107.11801 摘要:本文提出了一种利用Haralick特征对图像进行去噪和利用人工神经网络进一步分割字符的新方法。图像被划分成核,每个核被转换成一个GLCM(灰度共生矩阵),在这个GLCM上调用一个Haralick特征生成函数,其结果是一个包含14个元素的数组,对应14个特征Haralick值和相应的噪声/文本分类形成一个字典,然后通过核比较对图像进行去噪处理。切分是从文档中提取字符的过程,当字母被空白分隔时,切分可以使用,空白是一种明确的边界标记。分词是许多自然语言处理问题的第一步。本文探讨了利用神经网络进行图像分割的过程。虽然已有许多方法来分割文档中的字符,但本文只关注使用神经网络进行分割的准确性。字符的正确分割是非常必要的,否则将导致自然语言处理工具对字符的错误识别。采用人工神经网络,准确率达89%。此方法适用于字符由空格分隔的语言。然而,当语言大量使用连接字母时,这种方法将无法提供可接受的结果。天成文书就是一个例子,主要在印度北部使用。 摘要:This paper is a presentation of a new method for denoising images using Haralick features and further segmenting the characters using artificial neural networks. The image is divided into kernels, each of which is converted to a GLCM (Gray Level Co-Occurrence Matrix) on which a Haralick Feature generation function is called, the result of which is an array with fourteen elements corresponding to fourteen features The Haralick values and the corresponding noise/text classification form a dictionary, which is then used to de-noise the image through kernel comparison. Segmentation is the process of extracting characters from a document and can be used when letters are separated by white space, which is an explicit boundary marker. Segmentation is the first step in many Natural Language Processing problems. This paper explores the process of segmentation using Neural Networks. While there have been numerous methods to segment characters of a document, this paper is only concerned with the accuracy of doing so using neural networks. It is imperative that the characters be segmented correctly, for failing to do so will lead to incorrect recognition by Natural language processing tools. Artificial Neural Networks was used to attain accuracy of upto 89%. This method is suitable for languages where the characters are delimited by white space. However, this method will fail to provide acceptable results when the language heavily uses connected letters. An example would be the Devanagari script, which is predominantly used in northern India.
【7】 Leveraging Auxiliary Tasks with Affinity Learning for Weakly Supervised Semantic Segmentation 标题:利用具有亲和力学习的辅助任务进行弱监督语义分割
作者:Lian Xu,Wanli Ouyang,Mohammed Bennamoun,Farid Boussaid,Ferdous Sohel,Dan Xu 机构:The University of Western Australia, The University of Sydney, Murdoch University, Hong Kong University of Sciences and Technology 备注:Accepted at ICCV 2021 链接:https://arxiv.org/abs/2107.11787 摘要:在缺乏密集标记数据的情况下,语义分割是一项具有挑战性的任务。仅依赖于带有图像级标签的类激活映射(CAM)提供了不足的分割监控。因此,以前的作品考虑预先训练的模型产生粗显著图,以指导产生伪分割标签。然而,常用的离线启发式生成过程不能充分利用这些粗糙显著图的优点。基于显著的任务间相关性,我们提出了一种新的弱监督多任务框架AuxSegNet,利用显著性检测和多标签图像分类作为辅助任务,改进了仅使用图像级地面真值标签进行语义分割的主要任务。受其相似结构语义的启发,我们还提出从显著性和分割表示中学习一个跨任务的全局像素级亲和图。学习的跨任务亲和性可用于改进显著性预测和传播CAM图,从而为两个任务提供改进的伪标签。伪标签更新和跨任务相似性学习之间的相互促进使得分割性能的迭代改进成为可能。大量实验证明了所提出的辅助学习网络结构和跨任务亲和学习方法的有效性。该方法在具有挑战性的PASCAL-VOC 2012和MS-COCO基准上实现了最先进的弱监督分割性能。 摘要:Semantic segmentation is a challenging task in the absence of densely labelled data. Only relying on class activation maps (CAM) with image-level labels provides deficient segmentation supervision. Prior works thus consider pre-trained models to produce coarse saliency maps to guide the generation of pseudo segmentation labels. However, the commonly used off-line heuristic generation process cannot fully exploit the benefits of these coarse saliency maps. Motivated by the significant inter-task correlation, we propose a novel weakly supervised multi-task framework termed as AuxSegNet, to leverage saliency detection and multi-label image classification as auxiliary tasks to improve the primary task of semantic segmentation using only image-level ground-truth labels. Inspired by their similar structured semantics, we also propose to learn a cross-task global pixel-level affinity map from the saliency and segmentation representations. The learned cross-task affinity can be used to refine saliency predictions and propagate CAM maps to provide improved pseudo labels for both tasks. The mutual boost between pseudo label updating and cross-task affinity learning enables iterative improvements on segmentation performance. Extensive experiments demonstrate the effectiveness of the proposed auxiliary learning network structure and the cross-task affinity learning method. The proposed approach achieves state-of-the-art weakly supervised segmentation performance on the challenging PASCAL VOC 2012 and MS COCO benchmarks.
【8】 ReDAL: Region-based and Diversity-aware Active Learning for Point Cloud Semantic Segmentation 标题:Redal:基于区域和多样性的主动学习点云语义分割
作者:Tsung-Han Wu,Yueh-Cheng Liu,Yu-Kai Huang,Hsin-Ying Lee,Hung-Ting Su,Ping-Chia Huang,Winston H. Hsu 机构:National Taiwan University 备注:Accepted by ICCV 2021 链接:https://arxiv.org/abs/2107.11769 摘要:尽管深度学习在有监督点云语义分割方面取得了成功,但是获取大规模的逐点人工标注仍然是一个巨大的挑战。为了减少标注的巨大负担,我们提出了一种基于区域和多样性感知的主动学习(ReDAL),这是许多深度学习方法的一个通用框架,旨在自动选择信息丰富和多样的子场景区域进行标签获取。观察到只有一小部分的注释区域足够用于深度学习的三维场景理解,我们使用softmax熵、颜色不连续性和结构复杂性来度量子场景区域的信息。提出了一种多样性感知的选择算法,避免了在查询批中选择信息丰富但相似的区域所产生的冗余标注。大量实验表明,该方法的性能优于以往的主动学习策略,在S3DIS和SemanticKITTI数据集上实现了90%的完全监督学习,而标注量分别小于15%和5%。 摘要:Despite the success of deep learning on supervised point cloud semantic segmentation, obtaining large-scale point-by-point manual annotations is still a significant challenge. To reduce the huge annotation burden, we propose a Region-based and Diversity-aware Active Learning (ReDAL), a general framework for many deep learning approaches, aiming to automatically select only informative and diverse sub-scene regions for label acquisition. Observing that only a small portion of annotated regions are sufficient for 3D scene understanding with deep learning, we use softmax entropy, color discontinuity, and structural complexity to measure the information of sub-scene regions. A diversity-aware selection algorithm is also developed to avoid redundant annotations resulting from selecting informative but similar regions in a querying batch. Extensive experiments show that our method highly outperforms previous active learning strategies, and we achieve the performance of 90% fully supervised learning, while less than 15% and 5% annotations are required on S3DIS and SemanticKITTI datasets, respectively.
【9】 Semantic Attention and Scale Complementary Network for Instance Segmentation in Remote Sensing Images 标题:语义关注度和尺度互补网络在遥感图像实例分割中的应用
作者:Tianyang Zhang,Xiangrong Zhang,Peng Zhu,Xu Tang,Chen Li,Licheng Jiao,Huiyu Zhou 备注:14 pages 链接:https://arxiv.org/abs/2107.11758 摘要:针对遥感图像中具有挑战性的多类别实例分割问题,提出了一种基于像素级掩模的多类别实例分割方法。尽管许多具有里程碑意义的框架在实例分割方面表现出了良好的性能,但是对于RSIs的实例分割来说,背景的复杂性和尺度的可变性仍然是一个挑战。针对上述问题,本文提出了一种端到端的多类别实例分割模型,即语义注意和尺度互补网络,该模型主要由语义注意(SEA)模块和尺度互补掩码分支(SCMB)组成。SEA模块包含一个简单的完全卷积的语义分割分支,通过额外的监督来增强特征图上兴趣实例的激活,减少背景噪声的干扰。为了解决大尺度下地理空间实例的欠分割问题,我们设计了SCMB,将原来的单掩模分支扩展到三叉戟掩模分支,并引入不同尺度下的互补掩模监督,以充分利用多尺度信息。在iSAID数据集和NWPU实例分割数据集上进行了综合实验,验证了该方法的有效性,并取得了良好的性能。 摘要:In this paper, we focus on the challenging multicategory instance segmentation problem in remote sensing images (RSIs), which aims at predicting the categories of all instances and localizing them with pixel-level masks. Although many landmark frameworks have demonstrated promising performance in instance segmentation, the complexity in the background and scale variability instances still remain challenging for instance segmentation of RSIs. To address the above problems, we propose an end-to-end multi-category instance segmentation model, namely Semantic Attention and Scale Complementary Network, which mainly consists of a Semantic Attention (SEA) module and a Scale Complementary Mask Branch (SCMB). The SEA module contains a simple fully convolutional semantic segmentation branch with extra supervision to strengthen the activation of interest instances on the feature map and reduce the background noise's interference. To handle the under-segmentation of geospatial instances with large varying scales, we design the SCMB that extends the original single mask branch to trident mask branches and introduces complementary mask supervision at different scales to sufficiently leverage the multi-scale information. We conduct comprehensive experiments to evaluate the effectiveness of our proposed method on the iSAID dataset and the NWPU Instance Segmentation dataset and achieve promising performance.
【10】 Crosslink-Net: Double-branch Encoder Segmentation Network via Fusing Vertical and Horizontal Convolutions 标题:Crosslink-Net:基于垂直和水平卷积融合的双分支编码器分段网络
作者:Qian Yu,Lei Qi,Luping Zhou,Lei Wang,Yilong Yin,Yinghuan Shi,Wuzhang Wang,Yang Gao 备注:13 pages, 12 figures 链接:https://arxiv.org/abs/2107.11517 摘要:准确的图像分割在医学图像分析中起着至关重要的作用,但它面临着形状多样、大小不一、边界模糊等诸多挑战。为了解决这些问题,人们提出了基于平方核的编解码结构,并得到了广泛的应用,但其性能仍然不尽如人意。为了进一步应对这些挑战,我们提出了一种新的双分支编码器结构。我们的架构受到两个观察的启发:1)由于通过平方卷积核学习的特征识别需要进一步改进,我们建议在双分支编码器中使用非平方垂直和水平卷积核,因此,这两个分支所学习的特征可以相互补充。2) 考虑到空间注意有助于模型更好地聚焦于大尺寸图像中的目标区域,我们提出了一种注意丢失的方法来进一步强调小尺寸目标的分割。结合上述两种方案,提出了一种新的用于医学图像分割的双分支编码器分割框架Crosslink-Net。在四个数据集上的实验验证了该模型的有效性。代码发布于https://github.com/Qianyu1226/Crosslink-Net. 摘要:Accurate image segmentation plays a crucial role in medical image analysis, yet it faces great challenges of various shapes, diverse sizes, and blurry boundaries. To address these difficulties, square kernel-based encoder-decoder architecture has been proposed and widely used, but its performance remains still unsatisfactory. To further cope with these challenges, we present a novel double-branch encoder architecture. Our architecture is inspired by two observations: 1) Since the discrimination of features learned via square convolutional kernels needs to be further improved, we propose to utilize non-square vertical and horizontal convolutional kernels in the double-branch encoder, so features learned by the two branches can be expected to complement each other. 2) Considering that spatial attention can help models to better focus on the target region in a large-sized image, we develop an attention loss to further emphasize the segmentation on small-sized targets. Together, the above two schemes give rise to a novel double-branch encoder segmentation framework for medical image segmentation, namely Crosslink-Net. The experiments validate the effectiveness of our model on four datasets. The code is released at https://github.com/Qianyu1226/Crosslink-Net.
【11】 Deep Learning Based Cardiac MRI Segmentation: Do We Need Experts? 标题:基于深度学习的心脏MRI分割:我们需要专家吗?
作者:Youssef Skandarani,Pierre-Marc Jodoin,Alain Lalande 机构:����������, Citation: Skandarani, Y.; Jodoin, P-M.;, Lalande, A. Deep Learning based, Cardiac MRI segmentation: Do we, need experts?. Preprints ,. 备注:None 链接:https://arxiv.org/abs/2107.11447 摘要:深度学习方法是解决医学图像分析问题的有效方法。心脏MRI分割就是这样一种应用,它和其他许多应用一样,需要大量的注释数据,因此训练好的网络可以很好地推广。不幸的是,由医学专家手工制作大量图像的过程既慢又贵。在这篇论文中,我们开始探讨专家知识是否是创建机器学习可以成功训练的带注释数据集的严格要求。为此,我们评估了三种分割模型,即U-Net、注意力U-Net和ENet的性能,这些模型在专家和非专家背景下分别训练了不同的损失函数,用于心脏电影MRI分割。评估采用经典分割指标(Dice指数和Hausdorff距离)以及临床测量,如心室射血分数和心肌质量。结果表明,在非专家背景数据上训练的分割神经网络的泛化性能与在专家背景数据上的泛化性能相当,特别是在非专家得到相当程度的训练时,强调了为心脏数据集高效廉价地创建注释的机会。 摘要:Deep learning methods are the de-facto solutions to a multitude of medical image analysis tasks. Cardiac MRI segmentation is one such application which, like many others, requires a large number of annotated data so a trained network can generalize well. Unfortunately, the process of having a large number of manually curated images by medical experts is both slow and utterly expensive. In this paper, we set out to explore whether expert knowledge is a strict requirement for the creation of annotated datasets that machine learning can successfully train on. To do so, we gauged the performance of three segmentation models, namely U-Net, Attention U-Net, and ENet, trained with different loss functions on expert and non-expert groundtruth for cardiac cine-MRI segmentation. Evaluation was done with classic segmentation metrics (Dice index and Hausdorff distance) as well as clinical measurements, such as the ventricular ejection fractions and the myocardial mass. Results reveal that generalization performances of a segmentation neural network trained on non-expert groundtruth data is, to all practical purposes, as good as on expert groundtruth data, in particular when the non-expert gets a decent level of training, highlighting an opportunity for the efficient and cheap creation of annotations for cardiac datasets.
【12】 Cross-Sentence Temporal and Semantic Relations in Video Activity Localisation 标题:视频活动定位中的跨句时间和语义关系
作者:Jiabo Huang,Yang Liu,Shaogang Gong,Hailin Jin 机构:Queen Mary University of London, Peking University, Adobe Research 备注:International Conference on Computer Vision (ICCV'21) 链接:https://arxiv.org/abs/2107.11443 摘要:视频活动定位由于其实用价值,近年来得到了越来越多的关注,因为它可以自动定位最显著的视频片段,这些视频片段对应于未经剪辑和非结构化的视频中的语言描述(句子)。对于有监督的模型训练,必须给出一个句子(一个视频时刻)的每个视频片段的开始和结束时间索引的时间注释。这不仅是非常昂贵的,但也敏感的模糊性和主观注释偏见,一个更困难的任务比图像标签。在这项工作中,我们提出了一个更准确的弱监督解决方案,通过引入跨句关系挖掘(CRM)在视频时刻建议生成和匹配时,只有一个段落描述的活动没有每句时间注释可用。具体来说,我们探讨了两个跨句关系约束:(1)时间顺序和(2)视频活动段落描述中句子间的语义一致性。现有的弱监督技术只考虑训练中的句子视频段相关性,而不考虑跨句段落上下文。这可能会导致误导,因为个别句子的表达模棱两可,单独使用不加区分的视频瞬间建议。在两个公开的活动定位数据集上的实验表明,与现有的弱监督方法相比,我们的方法具有优势,特别是当视频活动描述变得更加复杂时。 摘要:Video activity localisation has recently attained increasing attention due to its practical values in automatically localising the most salient visual segments corresponding to their language descriptions (sentences) from untrimmed and unstructured videos. For supervised model training, a temporal annotation of both the start and end time index of each video segment for a sentence (a video moment) must be given. This is not only very expensive but also sensitive to ambiguity and subjective annotation bias, a much harder task than image labelling. In this work, we develop a more accurate weakly-supervised solution by introducing Cross-Sentence Relations Mining (CRM) in video moment proposal generation and matching when only a paragraph description of activities without per-sentence temporal annotation is available. Specifically, we explore two cross-sentence relational constraints: (1) Temporal ordering and (2) semantic consistency among sentences in a paragraph description of video activities. Existing weakly-supervised techniques only consider within-sentence video segment correlations in training without considering cross-sentence paragraph context. This can mislead due to ambiguous expressions of individual sentences with visually indiscriminate video moment proposals in isolation. Experiments on two publicly available activity localisation datasets show the advantages of our approach over the state-of-the-art weakly supervised methods, especially so when the video activity descriptions become more complex.
【13】 Dual-Attention Enhanced BDense-UNet for Liver Lesion Segmentation 标题:双注意力增强BDense-UNET在肝脏病变分割中的应用
作者:Wenming Cao,Philip L. H. Yu,Gilbert C. S. Lui,Keith W. H. Chiu,Ho-Ming Cheng,Yanwen Fang,Man-Fung Yuen,Wai-Kay Seto 机构: Department of Statistics and Actuarial Science, The University of Hong Kong, Department of Diagnostic Radiology, The University of Hong Kong, State Key Laboratory for Liver Research, The University of Hong Kong 备注:9 pages, 3 figures 链接:https://arxiv.org/abs/2107.11645 摘要:在这项工作中,我们提出了一个新的分割网络,将DenseUNet和双向LSTM与注意机制相结合,称为DA-BDense-UNet。DenseUNet允许学习足够多的不同特性,并通过调节信息流来增强网络的代表性。双向LSTM负责探索编码和解码路径中编码特征和上采样特征之间的关系。同时,我们在DenseUNet中引入注意门(AG)来减少不相关背景区域的响应,并逐步放大显著区域的响应。此外,双向LSTM算法在分割改进中考虑了编码特征和上采样特征的贡献差异,进而对这两类特征调整适当的权值。我们在多家医院采集的肝脏CT图像数据集上进行了实验,并与现有的分割模型进行了比较。实验结果表明,本文提出的算法在骰子系数方面取得了比较好的效果,证明了算法的有效性。 摘要:In this work, we propose a new segmentation network by integrating DenseUNet and bidirectional LSTM together with attention mechanism, termed as DA-BDense-UNet. DenseUNet allows learning enough diverse features and enhancing the representative power of networks by regulating the information flow. Bidirectional LSTM is responsible to explore the relationships between the encoded features and the up-sampled features in the encoding and decoding paths. Meanwhile, we introduce attention gates (AG) into DenseUNet to diminish responses of unrelated background regions and magnify responses of salient regions progressively. Besides, the attention in bidirectional LSTM takes into account the contribution differences of the encoded features and the up-sampled features in segmentation improvement, which can in turn adjust proper weights for these two kinds of features. We conduct experiments on liver CT image data sets collected from multiple hospitals by comparing them with state-of-the-art segmentation models. Experimental results indicate that our proposed method DA-BDense-UNet has achieved comparative performance in terms of dice coefficient, which demonstrates its effectiveness.
Zero/Few Shot|迁移|域适配|自适应(4篇)
【1】 Meta-FDMixup: Cross-Domain Few-Shot Learning Guided by Labeled Target Data 标题:Meta-FDMixup:标签目标数据引导的跨域Few-Shot学习
作者:Yuqian Fu,Yanwei Fu,Yu-Gang Jiang 机构:Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan University, School of Data Science, Fudan University 备注:Accepted by ACM Multimedia 2021 链接:https://arxiv.org/abs/2107.11978 摘要:最近的一项研究发现,现有的几种基于源域训练的镜头学习方法,在观察到域间隙的情况下,不能推广到新的目标域。这激发了跨域Few-Shot学习(CD-FSL)的任务。在本文中,我们意识到CD-FSL中标记的目标数据没有以任何方式帮助学习过程。因此,我们提倡利用少量的标记目标数据来指导模型学习。在技术上,提出了一种新型的元fd混频网络。我们主要从两个方面来解决这个问题。首先,利用两个不同类集合的源数据和新引入的目标数据,重新提出了一个混合模块,并将其集成到元学习机制中。其次,提出了一种新的解纠缠模块和领域分类器相结合的方法来提取解纠缠后的领域无关特征和领域特定特征。这两个模块一起使我们的模型能够缩小域差距,从而很好地推广到目标数据集。此外,还进行了详细的可行性和试点研究,以反映在我们的新设置下对CD-FSL的直观理解。实验结果表明了新设置和所提方法的有效性。代码和型号可在https://github.com/lovelyqian/Meta-FDMixup. 摘要:A recent study finds that existing few-shot learning methods, trained on the source domain, fail to generalize to the novel target domain when a domain gap is observed. This motivates the task of Cross-Domain Few-Shot Learning (CD-FSL). In this paper, we realize that the labeled target data in CD-FSL has not been leveraged in any way to help the learning process. Thus, we advocate utilizing few labeled target data to guide the model learning. Technically, a novel meta-FDMixup network is proposed. We tackle this problem mainly from two aspects. Firstly, to utilize the source and the newly introduced target data of two different class sets, a mixup module is re-proposed and integrated into the meta-learning mechanism. Secondly, a novel disentangle module together with a domain classifier is proposed to extract the disentangled domain-irrelevant and domain-specific features. These two modules together enable our model to narrow the domain gap thus generalizing well to the target datasets. Additionally, a detailed feasibility and pilot study is conducted to reflect the intuitive understanding of CD-FSL under our new setting. Experimental results show the effectiveness of our new setting and the proposed method. Codes and models are available at https://github.com/lovelyqian/Meta-FDMixup.
【2】 Will Multi-modal Data Improves Few-shot Learning? 标题:多模态数据是否会改善“少发即得”的学习?
作者:Zilun Zhang,Shihao Ma,Yichun Zhang 备注:Project Report 链接:https://arxiv.org/abs/2107.11853 摘要:大多数少数镜头学习模型只利用一种形式的数据。我们希望从定性和定量的角度研究,如果我们增加一个额外的模态(即图像的文本描述),模型会有多大的改进,以及它如何影响学习过程。为了实现这一目标,本文提出了四种融合方法,将图像特征和文本特征相结合。为了验证改进的有效性,我们用ProtoNet和MAML两个经典的Few-Shot学习模型,以及ConvNet和ResNet12等图像特征提取工具对融合方法进行了测试。基于注意的融合方法效果最好,分类精度比基线结果提高了30%左右。 摘要:Most few-shot learning models utilize only one modality of data. We would like to investigate qualitatively and quantitatively how much will the model improve if we add an extra modality (i.e. text description of the image), and how it affects the learning procedure. To achieve this goal, we propose four types of fusion method to combine the image feature and text feature. To verify the effectiveness of improvement, we test the fusion methods with two classical few-shot learning models - ProtoNet and MAML, with image feature extractors such as ConvNet and ResNet12. The attention-based fusion method works best, which improves the classification accuracy by a large margin around 30% comparing to the baseline result.
【3】 LAConv: Local Adaptive Convolution for Image Fusion 标题:LAConv:用于图像融合的局部自适应卷积
作者:Zi-Rong Jin,Liang-Jian Deng,Tai-Xiang Jiang,Tian-Jing Zhang 机构:School of Optoelectronic Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China, School of Mathematical Sciences, School of Economic Information Engineering, Southwestern University of Finance and Economics 链接:https://arxiv.org/abs/2107.11617 摘要:卷积运算是一种强有力的特征提取工具,在计算机视觉领域占有重要地位。然而,在针对图像融合等像素级任务时,如果在不同的面片上使用均匀卷积核,则无法充分感知图像中每个像素的特殊性。在本文中,我们提出了一种局部自适应卷积(LAConv),它是动态调整到不同的空间位置。LAConv使网络能够关注学习过程中的每一个特定的局部区域。此外,引入动态偏差(DYB),为特征描述提供了更多的可能性,使网络更加灵活。我们进一步设计了一个包含LAConv和DYB模块的残差结构网络,并将其应用于两个图像融合任务中。通过对泛锐化和高光谱图像超分辨率(HISR)的实验,证明了该方法的优越性。值得一提的是,LAConv还可以以较少的计算量胜任其他超分辨率任务。 摘要:The convolution operation is a powerful tool for feature extraction and plays a prominent role in the field of computer vision. However, when targeting the pixel-wise tasks like image fusion, it would not fully perceive the particularity of each pixel in the image if the uniform convolution kernel is used on different patches. In this paper, we propose a local adaptive convolution (LAConv), which is dynamically adjusted to different spatial locations. LAConv enables the network to pay attention to every specific local area in the learning process. Besides, the dynamic bias (DYB) is introduced to provide more possibilities for the depiction of features and make the network more flexible. We further design a residual structure network equipped with the proposed LAConv and DYB modules, and apply it to two image fusion tasks. Experiments for pansharpening and hyperspectral image super-resolution (HISR) demonstrate the superiority of our method over other state-of-the-art methods. It is worth mentioning that LAConv can also be competent for other super-resolution tasks with less computation effort.
【4】 Structure-Preserving Multi-Domain Stain Color Augmentation using Style-Transfer with Disentangled Representations 标题:基于解缠表示的样式传递保持结构的多域染色颜色增强
作者:Sophia J. Wagner,Nadieh Khalili,Raghav Sharma,Melanie Boxberg,Carsten Marr,Walter de Back,Tingying Peng 机构: Technical University Munich, Munich, Germany, Helmholtz AI, Neuherberg, Germany, Institute for Computational Biology, HelmholtzZentrum Munich, Germany, Munich School of Data Science (MuDS), Munich, Germany, ContextVision AB, Stockholm, Sweden 备注:accepted at MICCAI 2021, code and model weights are available at this http URL 链接:https://arxiv.org/abs/2107.12357 摘要:在数字病理学中,不同的染色程序和扫描仪会导致整个玻片图像(WSIs)的颜色变化,特别是在不同的实验室。这些颜色变化导致从训练域到外部病理数据的基于深度学习的方法的泛化性较差。为了提高测试性能,采用了染色归一化技术来减小训练域和测试域之间的方差。或者,可以在训练期间应用颜色增强,从而产生更健壮的模型,而无需在测试时进行额外的颜色归一化步骤。我们提出了一种新的颜色增强技术HistAuGAN,它可以模拟多种真实的组织学染色颜色,从而使神经网络在训练时具有染色不变性。基于图像到图像的生成性对抗网络(GAN),我们的模型将图像的内容,即形态组织结构,从颜色属性中分离出来。它可以在多个域上训练,因此,可以学习覆盖不同的染色颜色以及载玻片制备和成像过程中引入的其他特定域的变化。我们证明HistAuGAN在公开数据集camelon17上的分类任务上优于传统的颜色增强技术,并且表明它能够缓解当前的批处理效应。 摘要:In digital pathology, different staining procedures and scanners cause substantial color variations in whole-slide images (WSIs), especially across different laboratories. These color shifts result in a poor generalization of deep learning-based methods from the training domain to external pathology data. To increase test performance, stain normalization techniques are used to reduce the variance between training and test domain. Alternatively, color augmentation can be applied during training leading to a more robust model without the extra step of color normalization at test time. We propose a novel color augmentation technique, HistAuGAN, that can simulate a wide variety of realistic histology stain colors, thus making neural networks stain-invariant when applied during training. Based on a generative adversarial network (GAN) for image-to-image translation, our model disentangles the content of the image, i.e., the morphological tissue structure, from the stain color attributes. It can be trained on multiple domains and, therefore, learns to cover different stain colors as well as other domain-specific variations introduced in the slide preparation and imaging process. We demonstrate that HistAuGAN outperforms conventional color augmentation techniques on a classification task on the publicly available dataset Camelyon17 and show that it is able to mitigate present batch effects.
半弱无监督|主动学习|不确定性(3篇)
【1】 Improve Unsupervised Pretraining for Few-label Transfer 标题:改进用于少标签转移的无监督预训练
作者:Suichan Li,Dongdong Chen,Yinpeng Chen,Lu Yuan,Lei Zhang,Qi Chu,Bin Liu,Nenghai Yu 机构:University of Science and Technology of China, Microsoft Research 备注:ICCV 2021. arXiv admin note: substantial text overlap with arXiv:2012.05899 链接:https://arxiv.org/abs/2107.12369 摘要:无监督预训练已经取得了很大的成功,最近的许多研究表明,在下游目标数据集上,无监督预训练可以获得与有监督预训练相当甚至稍好的传输性能。但在本文中,我们发现当目标数据集只有很少的标记样本进行微调时,即很少有标记转移时,这一结论可能不成立。我们从聚类的角度分析了可能的原因:1)目标样本的聚类质量对很少的标签转移有重要影响;2) 尽管对比学习是学习聚类的必要手段,但由于缺乏对标签的监督,对比学习的聚类质量仍然不如有监督的预训练。在分析的基础上,我们有趣地发现,只有将一些未标记的目标域加入到无监督的预训练中,才能提高聚类质量,从而缩小与有监督预训练的传输性能差距。这一发现也促使我们为实际应用提出一种新的渐进式少标签传输算法,其目的是在有限的注释预算下最大限度地提高传输性能。为了支持我们的分析和提出的方法,我们在9个不同的目标数据集上进行了广泛的实验。实验结果表明,本文提出的方法能显著提高无监督预训练的少量标记转移性能。 摘要:Unsupervised pretraining has achieved great success and many recent works have shown unsupervised pretraining can achieve comparable or even slightly better transfer performance than supervised pretraining on downstream target datasets. But in this paper, we find this conclusion may not hold when the target dataset has very few labeled samples for finetuning, ie, few-label transfer. We analyze the possible reason from the clustering perspective: 1) The clustering quality of target samples is of great importance to few-label transfer; 2) Though contrastive learning is essential to learn how to cluster, its clustering quality is still inferior to supervised pretraining due to lack of label supervision. Based on the analysis, we interestingly discover that only involving some unlabeled target domain into the unsupervised pretraining can improve the clustering quality, subsequently reducing the transfer performance gap with supervised pretraining. This finding also motivates us to propose a new progressive few-label transfer algorithm for real applications, which aims to maximize the transfer performance under a limited annotation budget. To support our analysis and proposed method, we conduct extensive experiments on nine different target datasets. Experimental results show our proposed method can significantly boost the few-label transfer performance of unsupervised pretraining.
【2】 Self-Conditioned Probabilistic Learning of Video Rescaling 标题:视频缩放的自条件概率学习
作者:Yuan Tian,Guo Lu,Xiongkuo Min,Zhaohui Che,Guangtao Zhai,Guodong Guo,Zhiyong Gao 机构:Shanghai Jiao Tong Unversity, Beijing Institute of Technology, Baidu 备注:accepted to ICCV2021 链接:https://arxiv.org/abs/2107.11639 摘要:双三次降尺度是一种常用的技术,用于减少视频存储负担或加快下游处理速度。然而,逆向上缩放步骤是不平凡的,并且缩小的视频也可能恶化下游任务的性能。在本文中,我们提出了一个自我调节的视频重缩放概率框架来同时学习成对的降尺度和升尺度过程。在训练过程中,我们以视频中的强时空先验信息为条件,通过最大化其概率来降低降尺度过程中丢失的信息的熵。经过优化后,我们的框架得到的缩小视频保留了更多有意义的信息,这对于放大步骤和下游任务都是有利的,例如视频动作识别任务。我们进一步将该框架扩展到一个有损视频压缩系统中,提出了一种非差分工业有损编解码器的梯度估计器,用于整个系统的端到端训练。大量的实验结果证明了我们的方法在视频重缩放、视频压缩和有效的动作识别任务上的优越性。 摘要:Bicubic downscaling is a prevalent technique used to reduce the video storage burden or to accelerate the downstream processing speed. However, the inverse upscaling step is non-trivial, and the downscaled video may also deteriorate the performance of downstream tasks. In this paper, we propose a self-conditioned probabilistic framework for video rescaling to learn the paired downscaling and upscaling procedures simultaneously. During the training, we decrease the entropy of the information lost in the downscaling by maximizing its probability conditioned on the strong spatial-temporal prior information within the downscaled video. After optimization, the downscaled video by our framework preserves more meaningful information, which is beneficial for both the upscaling step and the downstream tasks, e.g., video action recognition task. We further extend the framework to a lossy video compression system, in which a gradient estimator for non-differential industrial lossy codecs is proposed for the end-to-end training of the whole system. Extensive experimental results demonstrate the superiority of our approach on video rescaling, video compression, and efficient action recognition tasks.
【3】 Weakly Supervised Attention Model for RV StrainClassification from volumetric CTPA Scans 标题:基于容积CTPA扫描的房车应变分类的弱监督注意模型
作者:Noa Cahan,Edith M. Marom,Shelly Soffer,Yiftach Barash,Eli Konen,Eyal Klang,Hayit Greenspan 机构:Department of Biomedical Engineering, Tel-Aviv University, Tel-Aviv, Israel, Department of Diagnostic Imaging, Sheba Medical Center, Ramat Gan, Israel affiliated with, the Tel Aviv University, Tel Aviv, Israel 备注:12 pages, 6 figures, 5 tables 链接:https://arxiv.org/abs/2107.12009 摘要:肺栓塞(PE)是指血块阻塞肺动脉。仅在美国,PE每年就造成大约10万人死亡。PE的临床表现往往是非特异性的,这使得诊断具有挑战性。因此,快速准确的风险分层至关重要。高风险PE是由急性压力超负荷引起的右心室(RV)功能障碍引起的,这反过来有助于确定哪些患者需要更积极的治疗。胸部CT重建的心脏四腔图可以发现右心室扩大。CT肺动脉造影(CTPA)是诊断可疑PE的金标准。因此,它可以将诊断和危险分层策略联系起来。我们开发了一种弱监督的深度学习算法,着重于一种新的注意机制,用于在CTPA上自动分类RV菌株。我们的方法是一个三维DenseNet模型与集成的三维剩余注意块。我们在急诊室(ED)PE患者的CTPA数据集上评估了我们的模型。该模型获得了0.88的受试者操作特征曲线(AUC)下面积,用于RV应变分类。该模型的敏感性为87%,特异性为83.7%。我们的解决方案优于最先进的3D CNN网络。所提出的设计允许一个完全自动化的网络,可以很容易地以端到端的方式进行训练,而不需要计算密集和耗时的预处理或繁重的数据标记。我们推断,未标记的CTPA可以用于有效的RV应变分类。这可以作为第二个读取器,提醒高风险PE患者。据我们所知,以前没有深入的研究试图解决这个问题。 摘要:Pulmonary embolus (PE) refers to obstruction of pulmonary arteries by blood clots. PE accounts for approximately 100,000 deaths per year in the United States alone. The clinical presentation of PE is often nonspecific, making the diagnosis challenging. Thus, rapid and accurate risk stratification is of paramount importance. High-risk PE is caused by right ventricular (RV) dysfunction from acute pressure overload, which in return can help identify which patients require more aggressive therapy. Reconstructed four-chamber views of the heart on chest CT can detect right ventricular enlargement. CT pulmonary angiography (CTPA) is the golden standard in the diagnostic workup of suspected PE. Therefore, it can link between diagnosis and risk stratification strategies. We developed a weakly supervised deep learning algorithm, with an emphasis on a novel attention mechanism, to automatically classify RV strain on CTPA. Our method is a 3D DenseNet model with integrated 3D residual attention blocks. We evaluated our model on a dataset of CTPAs of emergency department (ED) PE patients. This model achieved an area under the receiver operating characteristic curve (AUC) of 0.88 for classifying RV strain. The model showed a sensitivity of 87% and specificity of 83.7%. Our solution outperforms state-of-the-art 3D CNN networks. The proposed design allows for a fully automated network that can be trained easily in an end-to-end manner without requiring computationally intensive and time-consuming preprocessing or strenuous labeling of the data.We infer that unmarked CTPAs can be used for effective RV strain classification. This could be used as a second reader, alerting for high-risk PE patients. To the best of our knowledge, there are no previous deep learning-based studies that attempted to solve this problem.
时序|行为识别|姿态|视频|运动估计(7篇)
【1】 Perceptually Validated Precise Local Editing for Facial Action Units with StyleGAN 标题:基于StyleGAN的面部动作单元感知验证精确局部编辑
作者:Alara Zindancıoğlu,T. Metin Sezgin 机构:Koc¸ University, Istanbul, Turkey 链接:https://arxiv.org/abs/2107.12143 摘要:编辑面部表情的能力在计算机图形学中有着广泛的应用。理想的面部表情编辑算法需要满足两个重要条件。首先,它应该允许精确和有针对性地编辑个人面部动作。其次,它应该生成高保真输出而不产生伪影。我们构建了一个基于StyleGAN的解决方案,它已被广泛应用于人脸的语义处理。在这样做的同时,我们进一步了解了StyleGAN中各种语义属性的编码方式。特别地,我们证明了在潜在空间中执行编辑的天真策略会导致某些动作单元之间的不希望的耦合,即使它们在概念上是不同的。例如,虽然眉毛较低和嘴唇较紧是不同的行动单位,他们似乎相关的训练数据。因此,StyleGAN很难解开它们。我们通过计算每个动作单元的独立影响区域,允许对这些动作单元进行分离编辑,并限制对这些区域的编辑。通过对23名受试者的感知实验,验证了本文提出的局部编辑方法的有效性。结果表明,我们的方法提供了更高的控制本地编辑和产生的图像具有优越的保真度相比,国家的最先进的方法。 摘要:The ability to edit facial expressions has a wide range of applications in computer graphics. The ideal facial expression editing algorithm needs to satisfy two important criteria. First, it should allow precise and targeted editing of individual facial actions. Second, it should generate high fidelity outputs without artifacts. We build a solution based on StyleGAN, which has been used extensively for semantic manipulation of faces. As we do so, we add to our understanding of how various semantic attributes are encoded in StyleGAN. In particular, we show that a naive strategy to perform editing in the latent space results in undesired coupling between certain action units, even if they are conceptually distinct. For example, although brow lowerer and lip tightener are distinct action units, they appear correlated in the training data. Hence, StyleGAN has difficulty in disentangling them. We allow disentangled editing of such action units by computing detached regions of influence for each action unit, and restrict editing to these regions. We validate the effectiveness of our local editing method through perception experiments conducted with 23 subjects. The results show that our method provides higher control over local editing and produces images with superior fidelity compared to the state-of-the-art methods.
【2】 HANet: Hierarchical Alignment Networks for Video-Text Retrieval 标题:HANET:面向视频文本检索的层次对齐网络
作者:Peng Wu,Xiangteng He,Mingqian Tang,Yiliang Lv,Jing Liu 机构:Guangzhou Institute of Technology, Xidian University, Guangzhou, China, DAMO Academy, Alibaba Group, Hangzhou, China 备注:This work has been accepted to ACM-MM 2021 链接:https://arxiv.org/abs/2107.12059 摘要:视频文本检索是视觉语言理解中一项重要而富有挑战性的任务,其目的是学习一个相关视频和文本实例相互接近的联合嵌入空间。目前的研究大多是基于视频层和文本层的嵌入来度量视频文本的相似度。然而,忽略更细粒度或局部信息会导致表示不足的问题。有些作品通过解构句子来挖掘局部细节,却忽略了相应的视频,造成了视频文本表达的不对称。为了解决上述局限性,我们提出了一种层次对齐网络(HANet)来对齐视频文本匹配的不同层次表示。具体来说,我们首先将视频和文本分解为三个语义层次,即事件(视频和文本)、动作(动作和动词)和实体(外观和名词)。在此基础上,我们自然地以个体-局部-全局的方式构建了层次化的表示,其中个体层次关注的是帧与词之间的对齐,局部层次关注的是视频片段与文本上下文之间的对齐,全局层次关注的是整个视频与文本之间的对齐。不同层次的对齐捕获了视频和文本之间从细到粗的相关性,并利用了三个语义层次之间的互补信息。此外,我们的HANet还可以通过显式地学习关键的语义概念来进行丰富的解释。在MSR-VTT和VATEX两个公共数据集上的大量实验表明,所提出的HANet方法优于其他最新的方法,证明了层次表示和对齐的有效性。我们的代码是公开的。 摘要:Video-text retrieval is an important yet challenging task in vision-language understanding, which aims to learn a joint embedding space where related video and text instances are close to each other. Most current works simply measure the video-text similarity based on video-level and text-level embeddings. However, the neglect of more fine-grained or local information causes the problem of insufficient representation. Some works exploit the local details by disentangling sentences, but overlook the corresponding videos, causing the asymmetry of video-text representation. To address the above limitations, we propose a Hierarchical Alignment Network (HANet) to align different level representations for video-text matching. Specifically, we first decompose video and text into three semantic levels, namely event (video and text), action (motion and verb), and entity (appearance and noun). Based on these, we naturally construct hierarchical representations in the individual-local-global manner, where the individual level focuses on the alignment between frame and word, local level focuses on the alignment between video clip and textual context, and global level focuses on the alignment between the whole video and text. Different level alignments capture fine-to-coarse correlations between video and text, as well as take the advantage of the complementary information among three semantic levels. Besides, our HANet is also richly interpretable by explicitly learning key semantic concepts. Extensive experiments on two public datasets, namely MSR-VTT and VATEX, show the proposed HANet outperforms other state-of-the-art methods, which demonstrates the effectiveness of hierarchical representation and alignment. Our code is publicly available.
【3】 ICDAR 2021 Competition on Scene Video Text Spotting 标题:ICDAR 2021现场视频文字识别比赛
作者:Zhanzhan Cheng,Jing Lu,Baorui Zou,Shuigeng Zhou,Fei Wu 机构: Zhejiang University, Hangzhou, China, Hikvision Research Institute, Hangzhou, China, Fudan University, Shanghai, China 备注:SVTS Technique Report for ICDAR 2021 competition 链接:https://arxiv.org/abs/2107.11919 摘要:场景视频文本识别(SVTS)是一个非常重要的研究课题。然而,与静态图像中大量的场景文本识别研究相比,对场景视频文本的识别研究却很少。由于运动模糊等各种环境干扰,场景视频文本的识别变得非常困难。为了推广这一研究领域,本次比赛引入了一个新的挑战数据集,其中包含了来自21个自然场景的129个视频片段。比赛包括三项任务,即视频文本检测(任务1)、视频文本跟踪(任务2)和端到端视频文本定位(任务3)。在比赛期间(2021年3月1日开始,2021年4月11日结束),共有24支队伍参与了三项拟议任务,分别提交了46份有效意见书。本文包括数据集描述,任务定义,评估协议和ICDAR2021 SVTS比赛的结果总结。得益于团队的健康数量以及提交的意见,我们认为VTS竞赛已经成功举办,引起了社会的广泛关注,促进了现场研究和发展。 摘要:Scene video text spotting (SVTS) is a very important research topic because of many real-life applications. However, only a little effort has put to spotting scene video text, in contrast to massive studies of scene text spotting in static images. Due to various environmental interferences like motion blur, spotting scene video text becomes very challenging. To promote this research area, this competition introduces a new challenge dataset containing 129 video clips from 21 natural scenarios in full annotations. The competition containts three tasks, that is, video text detection (Task 1), video text tracking (Task 2) and end-to-end video text spotting (Task3). During the competition period (opened on 1st March, 2021 and closed on 11th April, 2021), a total of 24 teams participated in the three proposed tasks with 46 valid submissions, respectively. This paper includes dataset descriptions, task definitions, evaluation protocols and results summaries of the ICDAR 2021 on SVTS competition. Thanks to the healthy number of teams as well as submissions, we consider that the SVTS competition has been successfully held, drawing much attention from the community and promoting the field research and its development.
【4】 Transcript to Video: Efficient Clip Sequencing from Texts 标题:从文本到视频:从文本到视频的高效剪辑排序
作者:Yu Xiong,Fabian Caba Heilbron,Dahua Lin 机构:The Chinese University of Hong Kong, Adobe Research, …, Shot Gallery, No., The sea is beautiful with houses built upon the rock. 备注:Tech Report; Demo and project page at this http URL 链接:https://arxiv.org/abs/2107.11851 摘要:在网上分享的众多视频中,经过精心编辑的视频总是吸引更多的关注。然而,对于没有经验的用户来说,制作经过良好编辑的视频是很困难的,因为这需要专业知识和巨大的体力劳动。为了满足对非专家的需求,我们提出了一种“转录到视频”(Transcript To Video)——一种弱监督框架,它使用文本作为输入,从大量镜头中自动创建视频序列。具体来说,我们提出了一个内容检索模块和一个时间连贯模块,分别学习视觉语言表示和模型镜头排序风格。为了快速推断,我们引入了一种有效的实时视频片段排序搜索策略。定量结果和用户研究的经验表明,提出的学习框架可以检索内容相关的镜头,同时创建合理的视频序列方面的风格。此外,运行时性能分析表明,该框架能够支持实际应用。 摘要:Among numerous videos shared on the web, well-edited ones always attract more attention. However, it is difficult for inexperienced users to make well-edited videos because it requires professional expertise and immense manual labor. To meet the demands for non-experts, we present Transcript-to-Video -- a weakly-supervised framework that uses texts as input to automatically create video sequences from an extensive collection of shots. Specifically, we propose a Content Retrieval Module and a Temporal Coherent Module to learn visual-language representations and model shot sequencing styles, respectively. For fast inference, we introduce an efficient search strategy for real-time video clip sequencing. Quantitative results and user studies demonstrate empirically that the proposed learning framework can retrieve content-relevant shots while creating plausible video sequences in terms of style. Besides, the run-time performance analysis shows that our framework can support real-world applications.
【5】 Can Action be Imitated? Learn to Reconstruct and Transfer Human Dynamics from Videos 标题:行动能被模仿吗?学习从视频中重建和传输人体动力学
作者:Yuqian Fu,Yanwei Fu,Yu-Gang Jiang 机构:Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan University, China, School of Data Science and MOE Frontiers Center for Brain Science, Fudan University, China, Source Video, (action A), Reconstruction, Pose Transfer 备注:accpected by ICMR2021 链接:https://arxiv.org/abs/2107.11756 摘要:给出一个视频演示,我们能模仿这个视频中包含的动作吗?在本文中,我们介绍了一个新的任务,称为基于网格的动作模拟。此任务的目标是使任意目标人体网格能够执行视频演示中显示的相同操作。为此,本文提出了一种基于网格的视频动作模拟(M-VAI)方法。M-VAI首先学习从给定的源图像帧中重构网格,然后将初始恢复的网格序列输入到我们提出的网格序列平滑模块mesh2mesh中,以提高时间一致性。最后,我们通过将构建的人体姿势转换到目标身份网格来模拟动作。利用M-VAI可以生成高质量、精细的人体网格。大量的实验证明了我们的任务的可行性和我们提出的方法的有效性。 摘要:Given a video demonstration, can we imitate the action contained in this video? In this paper, we introduce a novel task, dubbed mesh-based action imitation. The goal of this task is to enable an arbitrary target human mesh to perform the same action shown on the video demonstration. To achieve this, a novel Mesh-based Video Action Imitation (M-VAI) method is proposed by us. M-VAI first learns to reconstruct the meshes from the given source image frames, then the initial recovered mesh sequence is fed into mesh2mesh, a mesh sequence smooth module proposed by us, to improve the temporal consistency. Finally, we imitate the actions by transferring the pose from the constructed human body to our target identity mesh. High-quality and detailed human body meshes can be generated by using our M-VAI. Extensive experiments demonstrate the feasibility of our task and the effectiveness of our proposed method.
【6】 Boosting Video Captioning with Dynamic Loss Network 标题:利用动态丢失网络增强视频字幕
作者:Nasibullah,Partha Pratim Mohanta 机构: Indian Statistical Institute, Kolkata 备注:10 pages, 3 figures, Preprint 链接:https://arxiv.org/abs/2107.11707 摘要:视频字幕是视觉与语言交叉领域的一个极具挑战性的问题,在视频检索、视频监控、辅助视觉障碍者、人机界面等方面有着广泛的应用。近年来,基于深度学习的方法取得了很好的效果,但与其他视觉任务(如图像分类、目标检测等)相比仍处于较低的水平。现有的视频字幕方法的一个显著缺点是,它们是在交叉熵损失函数的基础上进行优化的,而交叉熵损失函数与事实上的评价指标(BLEU、METEOR、CIDER、ROUGE)不相关,换句话说,交叉熵不是视频字幕真实损失函数的合适替代品。本文通过引入动态损耗网络(DLN)来解决这个问题,DLN提供了一个直接反映评估指标的反馈信号。我们在microsoftrearch视频描述语料库(MSVD)和MSR视频到文本(MSRVTT)数据集上的研究结果优于以前的方法。 摘要:Video captioning is one of the challenging problems at the intersection of vision and language, having many real-life applications in video retrieval, video surveillance, assisting visually challenged people, Human-machine interface, and many more. Recent deep learning-based methods have shown promising results but are still on the lower side than other vision tasks (such as image classification, object detection). A significant drawback with existing video captioning methods is that they are optimized over cross-entropy loss function, which is uncorrelated to the de facto evaluation metrics (BLEU, METEOR, CIDER, ROUGE).In other words, cross-entropy is not a proper surrogate of the true loss function for video captioning. This paper addresses the drawback by introducing a dynamic loss network (DLN), which provides an additional feedback signal that directly reflects the evaluation metrics. Our results on Microsoft Research Video Description Corpus (MSVD) and MSR-Video to Text (MSRVTT) datasets outperform previous methods.
【7】 Towards Generative Video Compression 标题:走向生成性视频压缩
作者:Fabian Mentzer,Eirikur Agustsson,Johannes Ballé,David Minnen,Nick Johnston,George Toderici 机构:Google Research 链接:https://arxiv.org/abs/2107.12038 摘要:提出了一种基于生成对抗网络的神经视频压缩方法,其性能优于以往的神经视频压缩方法,在用户研究中与HEVC相当。本文提出了一种基于频谱分析的随机移位和非移位的递归帧压缩技术,以减少由递归帧压缩引起的时间误差积累。我们详细介绍了网络设计的选择,它们的相对重要性,并阐述了在用户研究中评估视频压缩方法的挑战。 摘要:We present a neural video compression method based on generative adversarial networks (GANs) that outperforms previous neural video compression methods and is comparable to HEVC in a user study. We propose a technique to mitigate temporal error accumulation caused by recursive frame compression that uses randomized shifting and un-shifting, motivated by a spectral analysis. We present in detail the network design choices, their relative importance, and elaborate on the challenges of evaluating video compression methods in user studies.
医学相关(1篇)
【1】 Lung Cancer Risk Estimation with Incomplete Data: A Joint Missing Imputation Perspective 标题:不完全数据下的肺癌风险估计:联合缺失归因视角
作者:Riqiang Gao,Yucheng Tang,Kaiwen Xu,Ho Hin Lee,Steve Deppen,Kim Sandler,Pierre Massion,Thomas A. Lasko,Yuankai Huo,Bennett A. Landman 机构:Landman, EECS, Vanderbilt University, Nashville, TN, USA , Vanderbilt University Medical Center, Nashville, TN, USA 备注:Early Accepted by MICCAI 2021. Traveling Award 链接:https://arxiv.org/abs/2107.11882 摘要:来自多模态的数据为临床预测提供了补充信息,但临床队列中的缺失数据限制了多模态学习环境中受试者的数量。当1)缺失数据跨异质模式(如图像与非图像)时,多模式缺失插补对现有方法具有挑战性;或者2)一种方式基本上缺失。本文通过建立多模态数据的联合分布模型来解决缺失数据的插补问题。基于部分双向生成对抗网(PBiGAN)的思想,提出了一种新的条件PBiGAN(C-PBiGAN)方法,结合另一模态的条件知识对一模态进行插补。具体而言,C-PBiGAN在缺失插补框架中引入了一个条件潜空间,对可用的多模态数据进行联合编码,同时对插补数据进行类正则化丢失,以恢复判别信息。据我们所知,这是第一个通过建立图像和非图像数据的联合分布模型来解决多模态缺失插补问题的生成性对抗模型。我们用国家肺筛查试验(NLST)数据集和外部临床验证队列来验证我们的模型。与有代表性的插补方法相比,建议的C-PBiGAN在肺癌风险估计方面取得了显著的改进(例如,与PBiGAN相比,NLST( 2.9%)和内部数据集( 4.3%)的AUC值均增加,p$<$0.05)。 摘要:Data from multi-modality provide complementary information in clinical prediction, but missing data in clinical cohorts limits the number of subjects in multi-modal learning context. Multi-modal missing imputation is challenging with existing methods when 1) the missing data span across heterogeneous modalities (e.g., image vs. non-image); or 2) one modality is largely missing. In this paper, we address imputation of missing data by modeling the joint distribution of multi-modal data. Motivated by partial bidirectional generative adversarial net (PBiGAN), we propose a new Conditional PBiGAN (C-PBiGAN) method that imputes one modality combining the conditional knowledge from another modality. Specifically, C-PBiGAN introduces a conditional latent space in a missing imputation framework that jointly encodes the available multi-modal data, along with a class regularization loss on imputed data to recover discriminative information. To our knowledge, it is the first generative adversarial model that addresses multi-modal missing imputation by modeling the joint distribution of image and non-image data. We validate our model with both the national lung screening trial (NLST) dataset and an external clinical validation cohort. The proposed C-PBiGAN achieves significant improvements in lung cancer risk estimation compared with representative imputation methods (e.g., AUC values increase in both NLST ( 2.9%) and in-house dataset ( 4.3%) compared with PBiGAN, p$<$0.05).
GAN|对抗|攻击|生成相关(5篇)
【1】 Benign Adversarial Attack: Tricking Algorithm for Goodness 标题:良性对抗性攻击:行善的欺骗算法
作者:Xian Zhao,Jiaming Zhang,Zhiyu Lin,Jitao Sang 机构:Beijing Jiaotong University, Beijing, China 备注:Preprint. Under review 链接:https://arxiv.org/abs/2107.11986 摘要:尽管机器学习算法在许多领域得到了成功的应用,但它仍然存在着一些臭名昭著的问题,比如易受对手的攻击。除了在对抗攻击和防御之间陷入猫捉老鼠的游戏中,本文还提供了另一种视角来考虑对抗实例,并探讨我们能否在良性应用中利用它。我们首先提出了一种新的视觉信息分类方法,包括任务相关性和语义取向。对抗性实例的出现是由于算法利用了与任务相关的非语义信息。在经典的机器学习机制中,任务相关的非语义信息在很大程度上被忽略了,但它有三个有趣的特点:(1)算法独有的;(2)反映出共同的弱点;(3)作为特征可利用。受此启发,我们提出了一种称为良性对抗攻击的勇敢的新思想,从三个方向利用对抗性例子:(1)对抗性图灵测试,(2)拒绝恶意算法,(3)对抗性数据扩充。每一个方向都有动机阐述、理由分析和原型应用程序来展示其潜力。 摘要:In spite of the successful application in many fields, machine learning algorithms today suffer from notorious problems like vulnerability to adversarial examples. Beyond falling into the cat-and-mouse game between adversarial attack and defense, this paper provides alternative perspective to consider adversarial example and explore whether we can exploit it in benign applications. We first propose a novel taxonomy of visual information along task-relevance and semantic-orientation. The emergence of adversarial example is attributed to algorithm's utilization of task-relevant non-semantic information. While largely ignored in classical machine learning mechanisms, task-relevant non-semantic information enjoys three interesting characteristics as (1) exclusive to algorithm, (2) reflecting common weakness, and (3) utilizable as features. Inspired by this, we present brave new idea called benign adversarial attack to exploit adversarial examples for goodness in three directions: (1) adversarial Turing test, (2) rejecting malicious algorithm, and (3) adversarial data augmentation. Each direction is positioned with motivation elaboration, justification analysis and prototype applications to showcase its potential.
【2】 Adversarial training may be a double-edged sword 标题:对抗性训练可能是一把双刃剑
作者:Ali Rahmati,Seyed-Mohsen Moosavi-Dezfooli,Huaiyu Dai 机构:∗North Carolina State University, Raleigh, NC, †Institute for Machine Learning, ETH Z¨urich 备注:Presented as a RobustML workshop paper at ICLR 2021 链接:https://arxiv.org/abs/2107.11671 摘要:对抗训练是提高图像分类器抗白盒攻击鲁棒性的有效方法。然而,它对黑匣子攻击的有效性更为微妙。在这项工作中,我们证明了在深度网络的决策边界上对抗性训练的一些几何结果为某些类型的黑盒攻击提供了优势。特别地,我们定义了一个称为鲁棒性增益的度量来表明,尽管对抗性训练是一种有效的方法,可以显著地提高白盒场景中的鲁棒性,但它可能无法提供针对更现实的基于决策的黑盒攻击的鲁棒性增益。此外,我们还证明了即使是最小扰动白盒攻击也能比常规神经网络更快地收敛于对手训练的神经网络。 摘要:Adversarial training has been shown as an effective approach to improve the robustness of image classifiers against white-box attacks. However, its effectiveness against black-box attacks is more nuanced. In this work, we demonstrate that some geometric consequences of adversarial training on the decision boundary of deep networks give an edge to certain types of black-box attacks. In particular, we define a metric called robustness gain to show that while adversarial training is an effective method to dramatically improve the robustness in white-box scenarios, it may not provide such a good robustness gain against the more realistic decision-based black-box attacks. Moreover, we show that even the minimal perturbation white-box attacks can converge faster against adversarially-trained neural networks compared to the regular ones.
【3】 X-GGM: Graph Generative Modeling for Out-of-Distribution Generalization in Visual Question Answering 标题:X-GGM:视觉答疑中离散型泛化的图形生成建模
作者:Jingjing Jiang,Ziyi Liu,Yifan Liu,Zhixiong Nan,Nanning Zheng 机构:Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Shannxi, China 备注:Accepted by ACM MM2021 链接:https://arxiv.org/abs/2107.11576 摘要:近年来,可视化问答(VQA)的研究取得了可喜的进展,但如何使VQA模型能够自适应地推广到分布外(OOD)样本仍然是一个挑战。直观地说,对已有的视觉概念(即属性和对象)进行重组,可以在训练集中生成不可见的合成,这将促进VQA模型推广到面向对象的样本。本文将面向对象的泛化问题描述为一个组合泛化问题,并提出了一种基于图生成建模的隐式训练方案(X-GGM)。X-GGM利用图生成建模来迭代地生成预定义图的关系矩阵和节点表示,该预定义图利用属性对象对作为节点。此外,为了解决图生成建模中训练不稳定的问题,我们提出了一种梯度分布一致性损失的方法来约束具有对抗性扰动的数据分布和生成的分布。使用X-GGM方案训练的基线VQA模型(LXMERT)在两个标准VQA-OOD基准(即VQA-CP v2和GQA-OOD)上实现了最先进的OOD性能。广泛的消融研究证明了X-GGM组件的有效性。 摘要:Encouraging progress has been made towards Visual Question Answering (VQA) in recent years, but it is still challenging to enable VQA models to adaptively generalize to out-of-distribution (OOD) samples. Intuitively, recompositions of existing visual concepts (i.e., attributes and objects) can generate unseen compositions in the training set, which will promote VQA models to generalize to OOD samples. In this paper, we formulate OOD generalization in VQA as a compositional generalization problem and propose a graph generative modeling-based training scheme (X-GGM) to handle the problem implicitly. X-GGM leverages graph generative modeling to iteratively generate a relation matrix and node representations for the predefined graph that utilizes attribute-object pairs as nodes. Furthermore, to alleviate the unstable training issue in graph generative modeling, we propose a gradient distribution consistency loss to constrain the data distribution with adversarial perturbations and the generated distribution. The baseline VQA model (LXMERT) trained with the X-GGM scheme achieves state-of-the-art OOD performance on two standard VQA OOD benchmarks, i.e., VQA-CP v2 and GQA-OOD. Extensive ablation studies demonstrate the effectiveness of X-GGM components.
【4】 A Unified Hyper-GAN Model for Unpaired Multi-contrast MR Image Translation 标题:用于非配对多对比度MR图像平移的统一Hyper-GaN模型
作者:Heran Yang,Jian Sun,Liwei Yang,Zongben Xu 机构: School of Mathematics and Statistics, Xi’an Jiaotong University, China, Pazhou Lab, Guangzhou, China 备注:11 pages, 4 figures, accepted by MICCAI 2021 链接:https://arxiv.org/abs/2107.11945 摘要:在临床诊断中,交叉对比图像的翻译是完成缺失对比的重要任务。然而,大多数现有的方法都是为每一对对比度学习单独的翻译,这是由于在实际场景中存在许多可能的对比度对而导致的效率低下。在这项工作中,我们提出了一个统一的Hyper-GAN模型来有效地在不同的对比度对之间进行翻译。超GAN由一对超编码器和超解码器组成,首先将源对比度映射到公共特征空间,然后进一步映射到目标对比度图像。为了便于不同对比度对之间的转换,设计了对比度调制器来调整超编码器和超解码器以适应不同的对比度。我们还设计了一个公共空间损失,以强制一个主题的多对比度图像共享一个公共特征空间,隐式地建模共享的底层解剖结构。在IXI和BraTS 2019两个数据集上的实验表明,我们的Hyper-GAN在精度和效率上都达到了最先进的结果,例如,在两个参数量不到一半的数据集上,PSNR分别提高了1.47和1.09db以上。 摘要:Cross-contrast image translation is an important task for completing missing contrasts in clinical diagnosis. However, most existing methods learn separate translator for each pair of contrasts, which is inefficient due to many possible contrast pairs in real scenarios. In this work, we propose a unified Hyper-GAN model for effectively and efficiently translating between different contrast pairs. Hyper-GAN consists of a pair of hyper-encoder and hyper-decoder to first map from the source contrast to a common feature space, and then further map to the target contrast image. To facilitate the translation between different contrast pairs, contrast-modulators are designed to tune the hyper-encoder and hyper-decoder adaptive to different contrasts. We also design a common space loss to enforce that multi-contrast images of a subject share a common feature space, implicitly modeling the shared underlying anatomical structures. Experiments on two datasets of IXI and BraTS 2019 show that our Hyper-GAN achieves state-of-the-art results in both accuracy and efficiency, e.g., improving more than 1.47 and 1.09 dB in PSNR on two datasets with less than half the amount of parameters.
【5】 Reconstructing Images of Two Adjacent Objects through Scattering Medium Using Generative Adversarial Network 标题:利用产生式对抗性网络通过散射介质重建相邻两个物体的图像
作者:Xuetian Lai,Qiongyao Li,Ziyang Chen,Xiaopeng Shao,Jixiong Pu 机构:Fujian Provincial Key Laboratory of Light Propagation and Transformation, College of Information Science and, Engineering, Huaqiao University, Xiamen, Fujian , China, School of Physics and Optoelectronic Engineering, Xidian University, Xi’an , China 链接:https://arxiv.org/abs/2107.11574 摘要:利用卷积神经网络(CNNs)进行图像重建是近十年来研究的热点。到目前为止,利用神经网络对散射介质中的单个物体进行成像的技术已经发展了很多,但是如何同时重建多个物体的图像却很难实现。本文提出了一种利用生成对抗网络(GAN)通过散射介质重建相邻物体图像的方法。我们构造了一个成像系统来成像散射介质后面的两个相邻物体。通常情况下,当两个相邻物体图像的光场通过散射板时,会得到散斑图样。利用所设计的对抗网络YGAN对图像进行同步重建。结果表明,基于训练好的YGAN,我们可以从一个散斑图中重建两个相邻物体的图像,具有很高的保真度。此外,我们还研究了物体图像类型和相邻物体之间的距离对重建图像保真度的影响。此外,即使在两个物体之间插入另一个散射介质,我们也可以从一个斑点重建两个物体的高质量图像。该技术可应用于医学图像分析领域,如医学图像分类、分割、多目标散射成像研究等。 摘要:Reconstruction of image by using convolutional neural networks (CNNs) has been vigorously studied in the last decade. Until now, there have being developed several techniques for imaging of a single object through scattering medium by using neural networks, however how to reconstruct images of more than one object simultaneously seems hard to realize. In this paper, we demonstrate an approach by using generative adversarial network (GAN) to reconstruct images of two adjacent objects through scattering media. We construct an imaging system for imaging of two adjacent objects behind the scattering media. In general, as the light field of two adjacent object images pass through the scattering slab, a speckle pattern is obtained. The designed adversarial network, which is called as YGAN, is employed to reconstruct the images simultaneously. It is shown that based on the trained YGAN, we can reconstruct images of two adjacent objects from one speckle pattern with high fidelity. In addition, we study the influence of the object image types, and the distance between the two adjacent objects on the fidelity of the reconstructed images. Moreover even if another scattering medium is inserted between the two objects, we can also reconstruct the images of two objects from a speckle with high quality. The technique presented in this work can be used for applications in areas of medical image analysis, such as medical image classification, segmentation, and studies of multi-object scattering imaging etc.
自动驾驶|车辆|车道检测等(1篇)
【1】 Multimodal Fusion Using Deep Learning Applied to Driver's Referencing of Outside-Vehicle Objects 标题:基于深度学习的多模态融合在驾驶员对车外物体参考中的应用
作者:Abdul Rafey Aftab,Michael von der Beeck,Steven Rohrhirsch,Benoit Diotte,Michael Feld 机构:©, IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media 链接:https://arxiv.org/abs/2107.12167 摘要:人们越来越感兴趣的是更智能的自然用户与汽车的互动。手势和语音已经被应用于司机和汽车的互动。此外,多模式方法在汽车工业中也显示出前景。在本文中,我们利用深度学习的多模态融合网络的参考对象以外的车辆。我们同时利用凝视、头部姿势和手指指向的特征来精确预测不同汽车姿势下的参考对象。我们演示了每种模式在用于自然引用形式时的实际局限性,特别是在车内。从我们的结果可以明显看出,我们通过添加其他模式在很大程度上克服了特定模式的局限性。这项工作强调了多模态感知的重要性,特别是在走向自然用户交互时。此外,我们基于用户的分析显示,根据车辆姿态,用户行为的识别存在显著差异。 摘要:There is a growing interest in more intelligent natural user interaction with the car. Hand gestures and speech are already being applied for driver-car interaction. Moreover, multimodal approaches are also showing promise in the automotive industry. In this paper, we utilize deep learning for a multimodal fusion network for referencing objects outside the vehicle. We use features from gaze, head pose and finger pointing simultaneously to precisely predict the referenced objects in different car poses. We demonstrate the practical limitations of each modality when used for a natural form of referencing, specifically inside the car. As evident from our results, we overcome the modality specific limitations, to a large extent, by the addition of other modalities. This work highlights the importance of multimodal sensing, especially when moving towards natural user interaction. Furthermore, our user based analysis shows noteworthy differences in recognition of user behavior depending upon the vehicle pose.
OCR|文本相关(1篇)
【1】 Cycled Compositional Learning between Images and Text 标题:图文循环构图学习
作者:Jongseok Kim,Youngjae Yu,Seunghwan Lee,GunheeKim 机构:Seoul National University, RippleAI, Seoul, Korea 备注:Fashion IQ 2020 challenge winner. Workshop tech report 链接:https://arxiv.org/abs/2107.11509 摘要:提出了一种基于循环合成网络的图像文本嵌入合成语义距离度量方法。首先,合成网络在嵌入空间中使用相对字幕将参考图像传输到目标图像。其次,校正网络计算嵌入空间中参考图像和检索到的目标图像之间的差异,并将其与相对标题匹配。我们的目标是学习作文网络的作文映射。由于这种单向映射是高度欠约束的,我们将其与逆关系学习和校正网络相结合,并为给定图像引入循环关系。我们参加了Fashion IQ 2020挑战赛,并通过我们的模型集成获得了第一名。 摘要:We present an approach named the Cycled Composition Network that can measure the semantic distance of the composition of image-text embedding. First, the Composition Network transit a reference image to target image in an embedding space using relative caption. Second, the Correction Network calculates a difference between reference and retrieved target images in the embedding space and match it with a relative caption. Our goal is to learn a Composition mapping with the Composition Network. Since this one-way mapping is highly under-constrained, we couple it with an inverse relation learning with the Correction Network and introduce a cycled relation for given Image We participate in Fashion IQ 2020 challenge and have won the first place with the ensemble of our model.
人脸|人群计数(1篇)
【1】 Facetron: Multi-speaker Face-to-Speech Model based on Cross-modal Latent Representations 标题:Faceron:基于跨模态潜在表征的多说话人人脸到语音模型
作者:Se-Yun Um,Jihyun Kim,Jihyun Lee,Sangshin Oh,Kyungguen Byun,Hong-Goo Kang 机构:Dept. of E.E., Yonsei University, Seoul, Korea 备注:10 pages (including references), 3 figures 链接:https://arxiv.org/abs/2107.12003 摘要:在这篇论文中,我们提出了一种有效的方法来合成特定说话人的语音波形通过调节个人的脸视频。该方法以语言特征和说话人特征为辅助条件,在端到端的训练框架下,直接将人脸图像转换成语音波形。利用唇读模型从唇部运动中提取语言特征,利用预先训练好的声学模型通过跨模态学习从人脸图像中预测说话人特征。由于这两个特征是不相关的,并且是独立控制的,因此我们可以灵活地合成语音波形,其说话人特征随输入的人脸图像而变化。因此,我们的方法可以看作是一个多说话人面对面语音波形模型。在客观和主观评价结果方面,我们证明了我们提出的模型比传统方法的优越性。具体来说,我们分别通过测量自动语音识别和自动说话人/性别识别任务的准确性来评估语言特征和说话人特征生成模块的性能。我们也使用平均意见评分(MOS)测试来评估合成语音波形的自然度。 摘要:In this paper, we propose an effective method to synthesize speaker-specific speech waveforms by conditioning on videos of an individual's face. Using a generative adversarial network (GAN) with linguistic and speaker characteristic features as auxiliary conditions, our method directly converts face images into speech waveforms under an end-to-end training framework. The linguistic features are extracted from lip movements using a lip-reading model, and the speaker characteristic features are predicted from face images using cross-modal learning with a pre-trained acoustic model. Since these two features are uncorrelated and controlled independently, we can flexibly synthesize speech waveforms whose speaker characteristics vary depending on the input face images. Therefore, our method can be regarded as a multi-speaker face-to-speech waveform model. We show the superiority of our proposed model over conventional methods in terms of both objective and subjective evaluation results. Specifically, we evaluate the performances of the linguistic feature and the speaker characteristic generation modules by measuring the accuracy of automatic speech recognition and automatic speaker/gender recognition tasks, respectively. We also evaluate the naturalness of the synthesized speech waveforms using a mean opinion score (MOS) test.
跟踪(1篇)
【1】 Learning to Adversarially Blur Visual Object Tracking 标题:学习逆模糊视觉目标跟踪
作者:Qing Guo,Ziyi Cheng,Felix Juefei-Xu,Lei Ma,Xiaofei Xie,Yang Liu,Jianjun Zhao 机构: Nanyang Technological University, Singapore, Kyushu University, Japan, Alibaba Group, USA, University of Alberta, Canada 备注:This work has been accepted to ICCV2021. 12 pages, 5 figures 链接:https://arxiv.org/abs/2107.12085 摘要:在曝光过程中,由于物体或摄像机的运动而引起的运动模糊是视觉目标跟踪的一个关键问题,严重影响跟踪精度。在这项工作中,我们从一个新的角度,即对抗性模糊攻击(ABA)来探讨视觉目标跟踪器对运动模糊的鲁棒性。我们的主要目标是在线传输输入帧到它们的自然运动模糊对位,同时在跟踪过程中误导最先进的跟踪器。为此,本文首先根据运动模糊的产生原理,结合运动信息和光的积累过程,设计了一种用于视觉跟踪的运动模糊合成方法。利用这种综合方法,我们提出了一种基于优化的ABA(OP-ABA)方法,通过迭代优化对抗性目标函数来跟踪运动和光积累参数。OP-ABA能够生成自然的对抗性示例,但迭代会导致大量的时间开销,因此不适合攻击实时跟踪器。为了缓解这一问题,我们进一步提出了一步ABA(OS-ABA),在OP-ABA的指导下设计并训练了一个联合对抗运动和积累预测网络(JAMANet),该网络能够一步有效地估计对抗运动和积累参数。在四个流行数据集(如OTB100、VOT2018、UAV123和LaSOT)上的实验表明,我们的方法能够在四个具有高可转移性的最先进跟踪器上造成显著的精度下降。请在找到源代码https://github.com/tsingqguo/ABA 摘要:Motion blur caused by the moving of the object or camera during the exposure can be a key challenge for visual object tracking, affecting tracking accuracy significantly. In this work, we explore the robustness of visual object trackers against motion blur from a new angle, i.e., adversarial blur attack (ABA). Our main objective is to online transfer input frames to their natural motion-blurred counterparts while misleading the state-of-the-art trackers during the tracking process. To this end, we first design the motion blur synthesizing method for visual tracking based on the generation principle of motion blur, considering the motion information and the light accumulation process. With this synthetic method, we propose textit{optimization-based ABA (OP-ABA)} by iteratively optimizing an adversarial objective function against the tracking w.r.t. the motion and light accumulation parameters. The OP-ABA is able to produce natural adversarial examples but the iteration can cause heavy time cost, making it unsuitable for attacking real-time trackers. To alleviate this issue, we further propose textit{one-step ABA (OS-ABA)} where we design and train a joint adversarial motion and accumulation predictive network (JAMANet) with the guidance of OP-ABA, which is able to efficiently estimate the adversarial motion and accumulation parameters in a one-step way. The experiments on four popular datasets (eg, OTB100, VOT2018, UAV123, and LaSOT) demonstrate that our methods are able to cause significant accuracy drops on four state-of-the-art trackers with high transferability. Please find the source code at https://github.com/tsingqguo/ABA
表征学习(1篇)
【1】 Alleviate Representation Overlapping in Class Incremental Learning by Contrastive Class Concentration 标题:通过对比班级集中减轻班级增量学习中的表征重叠
作者:Zixuan Ni,Haizhou shi,Siliang tang,Yueting Zhuang 机构:Zhejiang University 链接:https://arxiv.org/abs/2107.12308 摘要:课堂增量学习(CIL)的挑战在于,学习者很难在不保留原有数据的情况下,区分新旧课堂的数据。即不同相位的表示分布相互重叠。为了缓解基于记忆和无记忆方法的表示重叠现象,本文提出了一种新的CIL框架,即对比类集中CIL(C4IL)。我们的框架利用了对比表征学习的类集中效应,从而产生了具有更好的类内紧性和类间可分性的表征分布。定量实验展示了我们的框架在基于记忆和无记忆两种情况下都是有效的:在10相和20相CIL中,它的平均准确率和top-1准确率都比两种情况下的基线方法高出5%。定性结果也表明,我们的方法产生了一个更紧凑的表示分布,减轻了重叠问题。 摘要:The challenge of the Class Incremental Learning (CIL) lies in difficulty for a learner to discern the old classes' data from the new while no previous data is preserved. Namely, the representation distribution of different phases overlaps with each other. In this paper, to alleviate the phenomenon of representation overlapping for both memory-based and memory-free methods, we propose a new CIL framework, Contrastive Class Concentration for CIL (C4IL). Our framework leverages the class concentration effect of contrastive representation learning, therefore yielding a representation distribution with better intra-class compactibility and inter-class separability. Quantitative experiments showcase our framework that is effective in both memory-based and memory-free cases: it outperforms the baseline methods of both cases by 5% in terms of the average and top-1 accuracy in 10-phase and 20-phase CIL. Qualitative results also demonstrate that our method generates a more compact representation distribution that alleviates the overlapping problem.
视觉解释|视频理解VQA|caption等(1篇)
【1】 Boosting Entity-aware Image Captioning with Multi-modal Knowledge Graph 标题:利用多模态知识图增强实体感知图像字幕
作者:Wentian Zhao,Yao Hu,Heda Wang,Xinxiao Wu,Jiebo Luo 机构: School of Computer Science, Beijing Institute of Technology, Beijing , China, Alibaba Group, Department of Computer Science, University of Rochester, Rochester NY , USA 链接:https://arxiv.org/abs/2107.11970 摘要:实体感知图像字幕的目的是利用相关文章中的背景知识来描述与图像相关的命名实体和事件。这项任务仍然具有挑战性,因为由于命名实体的长尾分布,很难学习命名实体和视觉线索之间的关联。此外,本文的复杂性给提取实体之间的细粒度关系以生成有关图像的信息性事件描述带来了困难。为了应对这些挑战,我们提出了一种新的方法,即构造一个多模态的知识图来将视觉对象与命名实体相关联,并借助从web上收集的外部知识同时捕获实体之间的关系。具体来说,我们通过从文章中提取命名实体及其关系来构建文本子图,通过检测图像中的对象来构建图像子图。为了连接这两个子图,我们提出了一个跨模态实体匹配模块,该模块使用包含Wikipedia条目和相应图像的知识库进行训练。最后,通过图注意机制将多模态知识图集成到字幕模型中。在GoodNews和NYTimes800k数据集上的大量实验证明了该方法的有效性。 摘要:Entity-aware image captioning aims to describe named entities and events related to the image by utilizing the background knowledge in the associated article. This task remains challenging as it is difficult to learn the association between named entities and visual cues due to the long-tail distribution of named entities. Furthermore, the complexity of the article brings difficulty in extracting fine-grained relationships between entities to generate informative event descriptions about the image. To tackle these challenges, we propose a novel approach that constructs a multi-modal knowledge graph to associate the visual objects with named entities and capture the relationship between entities simultaneously with the help of external knowledge collected from the web. Specifically, we build a text sub-graph by extracting named entities and their relationships from the article, and build an image sub-graph by detecting the objects in the image. To connect these two sub-graphs, we propose a cross-modal entity matching module trained using a knowledge base that contains Wikipedia entries and the corresponding images. Finally, the multi-modal knowledge graph is integrated into the captioning model via a graph attention mechanism. Extensive experiments on both GoodNews and NYTimes800k datasets demonstrate the effectiveness of our method.
点云|SLAM|雷达|激光|深度RGBD相关(1篇)
【1】 HRegNet: A Hierarchical Network for Large-scale Outdoor LiDAR Point Cloud Registration 标题:HRegNet:一种适用于大规模室外激光雷达点云配准的分层网络
作者:Fan Lu,Guang Chen,Yinlong Liu,Lijun Zhang,Sanqing Qu,Shu Liu,Rongqi Gu 机构:Tongji University,Technische Universit¨at M¨unchen,ETH Zurich,Westwell lab 备注:Accepted to ICCV 2021 链接:https://arxiv.org/abs/2107.11992 摘要:点云配准是三维计算机视觉中的一个基本问题。室外LiDAR点云具有尺度大、分布复杂等特点,使得点云配准具有挑战性。在本文中,我们提出了一个有效的分层网络命名为HRegNet的大规模户外LiDAR点云登记。HRegNet不使用点云中的所有点,而是对分层提取的关键点和描述符执行注册。整体框架结合了深层可靠的特征和浅层精确的位置信息,实现了鲁棒精确的配准。我们提出一个对应网络来产生正确和准确的关键点对应。此外,在关键点匹配中引入了双边一致性和邻域一致性,并设计了新的相似性特征将其融入到对应网络中,显著提高了配准性能。此外,整个网络也很高效,因为只有少量的关键点用于注册。在两个大型室外激光雷达点云数据集上进行了大量实验,验证了所提出的HRegNet的高精度和高效率。项目网站是https://ispc-group.github.io/hregnet. 摘要:Point cloud registration is a fundamental problem in 3D computer vision. Outdoor LiDAR point clouds are typically large-scale and complexly distributed, which makes the registration challenging. In this paper, we propose an efficient hierarchical network named HRegNet for large-scale outdoor LiDAR point cloud registration. Instead of using all points in the point clouds, HRegNet performs registration on hierarchically extracted keypoints and descriptors. The overall framework combines the reliable features in deeper layer and the precise position information in shallower layers to achieve robust and precise registration. We present a correspondence network to generate correct and accurate keypoints correspondences. Moreover, bilateral consensus and neighborhood consensus are introduced for keypoints matching and novel similarity features are designed to incorporate them into the correspondence network, which significantly improves the registration performance. Besides, the whole network is also highly efficient since only a small number of keypoints are used for registration. Extensive experiments are conducted on two large-scale outdoor LiDAR point cloud datasets to demonstrate the high accuracy and efficiency of the proposed HRegNet. The project website is https://ispc-group.github.io/hregnet.
多模态(1篇)
【1】 Two Headed Dragons: Multimodal Fusion and Cross Modal Transactions 标题:双头龙:多模式融合和跨模式交易
作者:Rupak Bose,Shivam Pande,Biplab Banerjee 机构:Centre of Studies in Resources Engineering, Indian Institute of Technology Bombay, India 备注:Accepted in IEEE International conference on Image Processing (ICIP), 2021 链接:https://arxiv.org/abs/2107.11585 摘要:随着遥感领域的不断发展,我们见证了多光谱(MS)、高光谱(HSI)、激光雷达等多种模式的信息积累,每种模式都有其独特的特点,当它们协同工作时,在识别和分类任务中表现得非常好。然而,由于各领域高度不同,在遥感中融合多种模式很麻烦。此外,现有的方法不利于跨模态相互作用。为此,我们提出了一种新的基于Transformer的HSI和LiDAR融合方法。该模型由堆叠式自动编码器组成,利用HSI和激光雷达的交叉键值对,从而在两种模式之间建立通信,同时使用CNN从HSI和激光雷达提取光谱和空间信息。我们在休斯顿(2013年数据融合大赛)和MUUFL-Gulfport数据集上测试了我们的模型,并取得了有竞争力的结果。 摘要:As the field of remote sensing is evolving, we witness the accumulation of information from several modalities, such as multispectral (MS), hyperspectral (HSI), LiDAR etc. Each of these modalities possess its own distinct characteristics and when combined synergistically, perform very well in the recognition and classification tasks. However, fusing multiple modalities in remote sensing is cumbersome due to highly disparate domains. Furthermore, the existing methods do not facilitate cross-modal interactions. To this end, we propose a novel transformer based fusion method for HSI and LiDAR modalities. The model is composed of stacked auto encoders that harness the cross key-value pairs for HSI and LiDAR, thus establishing a communication between the two modalities, while simultaneously using the CNNs to extract the spectral and spatial information from HSI and LiDAR. We test our model on Houston (Data Fusion Contest - 2013) and MUUFL Gulfport datasets and achieve competitive results.
其他神经网络|深度学习|模型|建模(11篇)
【1】 In Defense of the Learning Without Forgetting for Task Incremental Learning 标题:为任务增量学习的学习不忘辩护
作者:Guy Oren,Lior Wolf 机构:Tel-Aviv University 备注:12 pages with 4 figures 链接:https://arxiv.org/abs/2107.12304 摘要:灾难性遗忘是持续学习系统面临的主要挑战之一,它伴随着一系列的在线任务。这一领域引起了相当大的兴趣,并提出了一套不同的方法来克服这一挑战。不遗忘学习(LwF)是最早也是最常被引用的学习方法之一。它的优点是不需要存储以前任务中的样本,实现简单,并且依赖于知识提炼而具有良好的基础。然而,普遍的观点是,当只引入两个任务时,它显示出相对少量的遗忘,但它不能扩展到长序列的任务。本文对这一观点提出了质疑,通过使用正确的体系结构和标准的扩充集,LwF得到的结果超过了任务增量场景的最新算法。在CIFAR-100和Tiny-ImageNet上进行的大量实验证明了这种改进的性能,同时也表明其他方法不能从类似的改进中获益。 摘要:Catastrophic forgetting is one of the major challenges on the road for continual learning systems, which are presented with an on-line stream of tasks. The field has attracted considerable interest and a diverse set of methods have been presented for overcoming this challenge. Learning without Forgetting (LwF) is one of the earliest and most frequently cited methods. It has the advantages of not requiring the storage of samples from the previous tasks, of implementation simplicity, and of being well-grounded by relying on knowledge distillation. However, the prevailing view is that while it shows a relatively small amount of forgetting when only two tasks are introduced, it fails to scale to long sequences of tasks. This paper challenges this view, by showing that using the right architecture along with a standard set of augmentations, the results obtained by LwF surpass the latest algorithms for task incremental scenario. This improved performance is demonstrated by an extensive set of experiments over CIFAR-100 and Tiny-ImageNet, where it is also shown that other methods cannot benefit as much from similar improvements.
【2】 Thought Flow Nets: From Single Predictions to Trains of Model Thought 标题:思维流网络:从单一预测到模型思维序列
作者:Hendrik Schuff,Heike Adel,Ngoc Thang Vu 机构: Bosch Center for Artificial Intelligence, Renningen, Germany, Institut für Maschinelle Sprachverarbeitung, University of Stuttgart 链接:https://arxiv.org/abs/2107.12220 摘要:当人类解决复杂的问题时,很少能马上做出决定。相反,他们从一个直观的决定开始,反思它,发现错误,解决矛盾,在不同的假设之间跳跃。因此,他们创造了一系列的想法,并遵循一系列的思路,最终得出结论性的决定。与此相反,今天的神经分类模型大多是训练一个单一的输入映射到一个固定的输出。在本文中,我们将探讨如何给予模型第二次、第三次和第k$次思考的机会。我们从黑格尔的辩证法中得到启发,提出了一种将现有分类器的类预测(如图像类forest)转化为一系列预测(如forest$rightarrow$tree$rightarrow$蘑菇)的方法。具体地说,我们提出了一个校正模块,用来估计模型的正确性,以及一个基于预测梯度的迭代预测更新。我们的方法在类概率分布上产生一个动态系统$unicode{x2014}$思想流。我们从计算机视觉和自然语言处理的不同数据集和任务来评估我们的方法。我们观察到令人惊讶的复杂但直观的行为,并证明我们的方法(i)可以纠正错误分类,(ii)增强模型性能,(iii)对高水平的敌对攻击具有鲁棒性,(iv)在标签分布偏移设置中可将精确度提高高达4%,(iv)提供了一种模型解释性工具,该工具可揭示在单个分布预测中不可见的模型知识。 摘要:When humans solve complex problems, they rarely come up with a decision right-away. Instead, they start with an intuitive decision, reflect upon it, spot mistakes, resolve contradictions and jump between different hypotheses. Thus, they create a sequence of ideas and follow a train of thought that ultimately reaches a conclusive decision. Contrary to this, today's neural classification models are mostly trained to map an input to one single and fixed output. In this paper, we investigate how we can give models the opportunity of a second, third and $k$-th thought. We take inspiration from Hegel's dialectics and propose a method that turns an existing classifier's class prediction (such as the image class forest) into a sequence of predictions (such as forest $rightarrow$ tree $rightarrow$ mushroom). Concretely, we propose a correction module that is trained to estimate the model's correctness as well as an iterative prediction update based on the prediction's gradient. Our approach results in a dynamic system over class probability distributions $unicode{x2014}$ the thought flow. We evaluate our method on diverse datasets and tasks from computer vision and natural language processing. We observe surprisingly complex but intuitive behavior and demonstrate that our method (i) can correct misclassifications, (ii) strengthens model performance, (iii) is robust to high levels of adversarial attacks, (iv) can increase accuracy up to 4% in a label-distribution-shift setting and (iv) provides a tool for model interpretability that uncovers model knowledge which otherwise remains invisible in a single distribution prediction.
【3】 A Multiple-Instance Learning Approach for the Assessment of Gallbladder Vascularity from Laparoscopic Images 标题:多示例学习方法在腹腔镜图像胆囊血管评价中的应用
作者:C. Loukas,A. Gazis,D. Schizas 机构: A Multiple-Instance Learning Approach for the Assessment of Gallbladder Vascularity from Laparoscopic Images Constantinos Loukas Laboratory of Medical Physics, Medical School National and Kapodistrian University of Athens Athens 备注:6 pages, 5 tables, 2 figures 链接:https://arxiv.org/abs/2107.12093 摘要:腹腔镜胆囊切除术(LC)开始时的一项重要任务是检查胆囊(GB),以评估胆囊壁的厚度、炎症的存在和脂肪的程度。显示骨壁血管的困难可能是由于先前的因素,可能是慢性炎症或其他疾病的结果。在本文中,我们提出了一种多实例学习(MIL)技术,通过计算机视觉分析LC手术图像来评估胆囊壁血管。这些滤袋对应于53例手术中181gb图像的标记(低与高)血管数据集。实例对应于从这些图像中提取的未标记面片。每个面片由一个包含颜色、纹理和统计特征的向量表示。我们比较了各种最新的MIL和单实例学习方法,以及一种基于变分贝叶斯推理的MIL技术。在基于图像和基于视频(即基于患者)的分类两个实验任务中,对两种方法进行了比较。该方法对第一个和第二个任务的准确率分别为92.1%和90.3%。该方法的一个显著优点是不需要手动标记实例的耗时任务。 摘要:An important task at the onset of a laparoscopic cholecystectomy (LC) operation is the inspection of gallbladder (GB) to evaluate the thickness of its wall, presence of inflammation and extent of fat. Difficulty in visualization of the GB wall vessels may be due to the previous factors, potentially as a result of chronic inflammation or other diseases. In this paper we propose a multiple-instance learning (MIL) technique for assessment of the GB wall vascularity via computer-vision analysis of images from LC operations. The bags correspond to a labeled (low vs. high) vascularity dataset of 181 GB images, from 53 operations. The instances correspond to unlabeled patches extracted from these images. Each patch is represented by a vector with color, texture and statistical features. We compare various state-of-the-art MIL and single-instance learning approaches, as well as a proposed MIL technique based on variational Bayesian inference. The methods were compared for two experimental tasks: image-based and video-based (i.e. patient-based) classification. The proposed approach presents the best performance with accuracy 92.1% and 90.3% for the first and second task, respectively. A significant advantage of the proposed technique is that it does not require the time-consuming task of manual labelling the instances.
【4】 Parametric Contrastive Learning 标题:参数对比学习
作者:Jiequan Cui,Zhisheng Zhong,Shu Liu,Bei Yu,Jiaya Jia 机构:The Chinese University of Hong Kong, SmartMore 备注:ICCV 2021 链接:https://arxiv.org/abs/2107.12028 摘要:本文提出参数对比学习(PaCo)来解决长尾识别问题。在理论分析的基础上,我们发现有监督的对比缺失倾向于偏向高频课堂,从而增加了不平衡学习的难度。我们引入一组参数化的类学习中心,从优化的角度重新平衡。进一步,我们分析了在均衡条件下的PaCo损失。我们的分析表明,当更多的样本与其对应的中心被拉到一起时,PaCo可以自适应地增强将同一类样本推近的强度,有利于硬示例学习。在长尾CIFAR、ImageNet、Places和iNaturalist 2018上的实验显示了长尾识别的新技术。在full-ImageNet上,用PaCo损失训练的模型在不同ResNet主干上都优于有监督的对比学习。我们的代码可从url获得{https://github.com/jiequancui/Parametric-Contrastive-Learning}. 摘要:In this paper, we propose Parametric Contrastive Learning (PaCo) to tackle long-tailed recognition. Based on theoretical analysis, we observe supervised contrastive loss tends to bias on high-frequency classes and thus increases the difficulty of imbalance learning. We introduce a set of parametric class-wise learnable centers to rebalance from an optimization perspective. Further, we analyze our PaCo loss under a balanced setting. Our analysis demonstrates that PaCo can adaptively enhance the intensity of pushing samples of the same class close as more samples are pulled together with their corresponding centers and benefit hard example learning. Experiments on long-tailed CIFAR, ImageNet, Places, and iNaturalist 2018 manifest the new state-of-the-art for long-tailed recognition. On full ImageNet, models trained with PaCo loss surpass supervised contrastive learning across various ResNet backbones. Our code is available at url{https://github.com/jiequancui/Parametric-Contrastive-Learning}.
【5】 Log-Polar Space Convolution for Convolutional Neural Networks 标题:卷积神经网络的对数极空间卷积
作者:Bing Su,Ji-Rong Wen 机构:Beijing Key Laboratory of Big Data Management and Analysis Methods, Gaoling School of Artificial Intelligence, Renmin University of China, Beijing , China 链接:https://arxiv.org/abs/2107.11943 摘要:卷积神经网络采用规则的四边形卷积核来提取特征。由于参数的数目随卷积核的大小二次增加,许多流行的模型使用较小的卷积核,导致较低层的局部感受野较小。提出了一种新的对数极空间卷积(LPSC)方法,该方法以椭圆为卷积核,根据相对方向和对数距离自适应地将局部感受野划分为不同的区域。局部感受野随着距离水平的增加呈指数增长。因此,所提出的LPSC不仅能自然地编码局部空间结构,而且在保持参数个数的前提下,大大增加了单层感受野。我们证明了LPSC可以通过对数极性空间池实现,并且可以应用于任何网络结构来代替传统的卷积。在不同的任务和数据集上的实验证明了该算法的有效性。代码位于https://github.com/BingSu12/Log-Polar-Space-Convolution. 摘要:Convolutional neural networks use regular quadrilateral convolution kernels to extract features. Since the number of parameters increases quadratically with the size of the convolution kernel, many popular models use small convolution kernels, resulting in small local receptive fields in lower layers. This paper proposes a novel log-polar space convolution (LPSC) method, where the convolution kernel is elliptical and adaptively divides its local receptive field into different regions according to the relative directions and logarithmic distances. The local receptive field grows exponentially with the number of distance levels. Therefore, the proposed LPSC not only naturally encodes local spatial structures, but also greatly increases the single-layer receptive field while maintaining the number of parameters. We show that LPSC can be implemented with conventional convolution via log-polar space pooling and can be applied in any network architecture to replace conventional convolutions. Experiments on different tasks and datasets demonstrate the effectiveness of the proposed LPSC. Code is available at https://github.com/BingSu12/Log-Polar-Space-Convolution.
【6】 Character Spotting Using Machine Learning Techniques 标题:基于机器学习技术的字符定位
作者:P Preethi,Hrishikesh Viswanath 机构:Department of Computer Science and Engineering, PES University 链接:https://arxiv.org/abs/2107.11795 摘要:这项工作提出了一个机器学习算法的比较,实现了分割字符的文本作为一个图像。这些算法设计用于处理文本未按组织方式对齐的降级文档。研究了利用支持向量机、K近邻算法和编码网络进行字符定位的方法。字符定位是通过选择空白区域从文本流中提取潜在字符。 摘要:This work presents a comparison of machine learning algorithms that are implemented to segment the characters of text presented as an image. The algorithms are designed to work on degraded documents with text that is not aligned in an organized fashion. The paper investigates the use of Support Vector Machines, K-Nearest Neighbor algorithm and an Encoder Network to perform the operation of character spotting. Character Spotting involves extracting potential characters from a stream of text by selecting regions bound by white space.
【7】 Hand Image Understanding via Deep Multi-Task Learning 标题:基于深度多任务学习的手部图像理解
作者:Zhang Xiong,Huang Hongsheng,Tan Jianchao,Xu Hongmin,Yang Cheng,Peng Guozhu,Wang Lei,Liu Ji 备注:Accepted By ICCV 2021 链接:https://arxiv.org/abs/2107.11646 摘要:分析和理解来自多媒体材料(如图像或视频)的手部信息对于许多实际应用是非常重要的,并且在研究领域仍然很活跃。目前有许多工作致力于从单幅图像中恢复手部信息,但通常只解决一个任务,例如手部掩模分割、二维/三维手部姿势估计或手部网格重建,在具有挑战性的场景中表现不佳。为了进一步提高这些任务的性能,我们提出了一种新的手部图像理解(HIU)框架,通过综合考虑这些任务之间的关系,从单个RGB图像中提取手部目标的综合信息。为了实现这一目标,我们设计了一个级联多任务学习(MTL)主干来估计二维热图,学习分割模板,生成中间的三维信息编码,然后采用从粗到精的学习模式和自监督学习策略。定性实验表明,该方法即使在有挑战性的情况下也能恢复合理的网格表示。从数量上讲,我们的方法在各种广泛使用的数据集上,在不同的评估指标方面显著优于最新的方法。 摘要:Analyzing and understanding hand information from multimedia materials like images or videos is important for many real world applications and remains active in research community. There are various works focusing on recovering hand information from single image, however, they usually solve a single task, for example, hand mask segmentation, 2D/3D hand pose estimation, or hand mesh reconstruction and perform not well in challenging scenarios. To further improve the performance of these tasks, we propose a novel Hand Image Understanding (HIU) framework to extract comprehensive information of the hand object from a single RGB image, by jointly considering the relationships between these tasks. To achieve this goal, a cascaded multi-task learning (MTL) backbone is designed to estimate the 2D heat maps, to learn the segmentation mask, and to generate the intermediate 3D information encoding, followed by a coarse-to-fine learning paradigm and a self-supervised learning strategy. Qualitative experiments demonstrate that our approach is capable of recovering reasonable mesh representations even in challenging situations. Quantitatively, our method significantly outperforms the state-of-the-art approaches on various widely-used datasets, in terms of diverse evaluation metrics.
【8】 Using a Cross-Task Grid of Linear Probes to Interpret CNN Model Predictions On Retinal Images 标题:使用线性探针的跨任务网格解释视网膜图像上的CNN模型预测
作者:Katy Blumer,Subhashini Venugopalan,Michael P. Brenner,Jon Kleinberg 机构: One simple idea is to take a model that achieves 1Cornell University 2Google Research 3Harvard University 备注:Extended abstract at Interpretable Machine Learning in Healthcare (IMLH) workshop at ICML 2021 链接:https://arxiv.org/abs/2107.11468 摘要:我们使用线性探针分析视网膜图像数据集:线性回归模型训练一些“目标”任务,使用嵌入从深度卷积(CNN)模型训练一些“源”任务作为输入。我们在英国生物库视网膜图像数据集中93个任务的所有可能配对中使用这种方法,得到了约164k个不同的模型。我们通过源任务和目标任务以及层深度来分析这些线性探针的性能。我们观察到来自网络中间层的表示更具普遍性。我们发现,不管源任务是什么,一些目标任务都很容易预测,而其他一些目标任务从相关的源任务比从同一任务上训练的嵌入任务更准确地预测。 摘要:We analyze a dataset of retinal images using linear probes: linear regression models trained on some "target" task, using embeddings from a deep convolutional (CNN) model trained on some "source" task as input. We use this method across all possible pairings of 93 tasks in the UK Biobank dataset of retinal images, leading to ~164k different models. We analyze the performance of these linear probes by source and target task and by layer depth. We observe that representations from the middle layers of the network are more generalizable. We find that some target tasks are easily predicted irrespective of the source task, and that some other target tasks are more accurately predicted from correlated source tasks than from embeddings trained on the same task.
【9】 Compressing Neural Networks: Towards Determining the Optimal Layer-wise Decomposition 标题:压缩神经网络:朝着确定最佳分层分解的方向发展
作者:Lucas Liebenwein,Alaa Maalouf,Oren Gal,Dan Feldman,Daniela Rus 机构:MIT CSAIL, University of Haifa 链接:https://arxiv.org/abs/2107.11442 摘要:我们提出了一种新的深度神经网络全局压缩框架,该框架自动分析每一层,以确定最佳的每一层压缩比,同时实现所需的整体压缩。我们的算法依赖于压缩每个卷积(或完全连接)层的思想,通过将其信道分为多个组,并通过低秩分解对每个组进行分解。该算法的核心是从Eckart-Young-Mirsky定理推导出分层误差界。然后,我们利用这些边界将压缩问题框架化为一个优化问题,在这个优化问题中,我们希望最小化跨层的最大压缩错误,并提出一个有效的算法来解决这个问题。我们的实验表明,我们的方法优于现有的低秩压缩方法在广泛的网络和数据集。我们相信,我们的研究结果为将来研究现代神经网络的全局性能-规模权衡开辟了新的途径。我们的代码在https://github.com/lucaslie/torchprune. 摘要:We present a novel global compression framework for deep neural networks that automatically analyzes each layer to identify the optimal per-layer compression ratio, while simultaneously achieving the desired overall compression. Our algorithm hinges on the idea of compressing each convolutional (or fully-connected) layer by slicing its channels into multiple groups and decomposing each group via low-rank decomposition. At the core of our algorithm is the derivation of layer-wise error bounds from the Eckart Young Mirsky theorem. We then leverage these bounds to frame the compression problem as an optimization problem where we wish to minimize the maximum compression error across layers and propose an efficient algorithm towards a solution. Our experiments indicate that our method outperforms existing low-rank compression approaches across a wide range of networks and data sets. We believe that our results open up new avenues for future research into the global performance-size trade-offs of modern neural networks. Our code is available at https://github.com/lucaslie/torchprune.
【10】 Robust Explainability: A Tutorial on Gradient-Based Attribution Methods for Deep Neural Networks 标题:稳健可解释性:深度神经网络基于梯度的属性方法教程
作者:Ian E. Nielsen,Ghulam Rasool,Dimah Dera,Nidhal Bouaynaya,Ravi P. Ramachandran 机构: Rowan University, The University of Texas Rio Grande Valley 备注:21 pages, 3 figures 链接:https://arxiv.org/abs/2107.11400 摘要:随着深层神经网络的兴起,人们越来越认识到解释这些网络预测的挑战。虽然有许多方法可以解释深层神经网络的决策,但目前对于如何评价它们还没有共识。另一方面,稳健性是深度学习研究的热门话题;然而,直到最近才有人在可解释性方面谈论它。在本教程中,我们首先介绍基于梯度的可解释性方法。这些技术使用梯度信号来分配输入特征的决策负担。之后,我们将讨论如何评估基于梯度的方法的稳健性,以及对抗性稳健性在有意义的解释中所起的作用。我们还讨论了基于梯度的方法的局限性。最后,我们给出了在选择可解释性方法之前应该检查的最佳实践和属性。最后,我们在稳健性和可解释性的收敛性方面提出了该领域未来的研究方向。 摘要:With the rise of deep neural networks, the challenge of explaining the predictions of these networks has become increasingly recognized. While many methods for explaining the decisions of deep neural networks exist, there is currently no consensus on how to evaluate them. On the other hand, robustness is a popular topic for deep learning research; however, it is hardly talked about in explainability until very recently. In this tutorial paper, we start by presenting gradient-based interpretability methods. These techniques use gradient signals to assign the burden of the decision on the input features. Later, we discuss how gradient-based methods can be evaluated for their robustness and the role that adversarial robustness plays in having meaningful explanations. We also discuss the limitations of gradient-based methods. Finally, we present the best practices and attributes that should be examined before choosing an explainability method. We conclude with the future directions for research in the area at the convergence of robustness and explainability.
【11】 Deep Learning-based Frozen Section to FFPE Translation 标题:基于深度学习的冰冻切片到FFPE的翻译
作者:Kutsev Bengisu Ozyoruk,Sermet Can,Guliz Irem Gokceler,Kayhan Basak,Derya Demir,Gurdeniz Serin,Uguray Payam Hacisalihoglu,Berkan Darbaz,Ming Y. Lu,Tiffany Y. Chen,Drew F. K. Williamson,Funda Yilmaz,Faisal Mahmood,Mehmet Turan 机构:Institute of Biomedical Engineering, Bogazici University, Istanbul, Turkey, UK Biocentre, Tilbrook, England, Department of Pathology, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 链接:https://arxiv.org/abs/2107.11786 摘要:冷冻切片(FS)是外科手术中组织显微评价的首选制备方法。高速的手术过程使病理学家能够快速评估关键的显微特征,如肿瘤边缘和恶性状态,以指导手术决策,并尽量减少对手术过程的干扰。然而,FS容易引入许多误导性的人工结构(组织学人工制品),如核冰晶、压缩和切割人工制品,阻碍病理学家及时准确的诊断判断。另一方面,福尔马林固定石蜡包埋(FFPE)的金标准组织制备技术提供了显著优越的图像质量,但这是一个非常耗时的过程(12-48小时),不适合术中使用。本文提出了一种人工智能(AI)方法,通过在数分钟内将冻结切片的全玻片图像(FS-WSIs)转换成全玻片FFPE样式的图像来提高FS图像的质量。AI-FFPE在注意机制的指导下纠正FS伪影,特别强调伪影,同时利用FS输入图像和合成FFPE风格图像之间建立的自我调节机制,保留临床相关特征。结果,AI-FFPE方法在不显著延长组织处理时间的情况下成功生成FFPE样式的图像,从而提高了诊断准确率。 摘要:Frozen sectioning (FS) is the preparation method of choice for microscopic evaluation of tissues during surgical operations. The high speed of procedure allows pathologists to rapidly assess the key microscopic features, such as tumor margins and malignant status to guide surgical decision-making and minimise disruptions to the course of the operation. However, FS is prone to introducing many misleading artificial structures (histological artefacts), such as nuclear ice crystals, compression, and cutting artefacts, hindering timely and accurate diagnostic judgement of the pathologist. On the other hand, the gold standard tissue preparation technique of formalin-fixation and paraffin-embedding (FFPE) provides significantly superior image quality, but is a very time-consuming process (12-48 hours), making it unsuitable for intra-operative use. In this paper, we propose an artificial intelligence (AI) method that improves FS image quality by computationally transforming frozen-sectioned whole-slide images (FS-WSIs) into whole-slide FFPE-style images in minutes. AI-FFPE rectifies FS artefacts with the guidance of an attention-mechanism that puts a particular emphasis on artefacts while utilising a self-regularization mechanism established between FS input image and synthesized FFPE-style image that preserves clinically relevant features. As a result, AI-FFPE method successfully generates FFPE-style images without significantly extending tissue processing time and consequently improves diagnostic accuracy.
其他(10篇)
【1】 NeLF: Neural Light-transport Field for Portrait View Synthesis and Relighting 标题:NELF:人像视图合成和重光的神经光传输场
作者:Tiancheng Sun,Kai-En Lin,Sai Bi,Zexiang Xu,Ravi Ramamoorthi 机构:University of California, San Diego,Adobe Research, ∗Equal contribution 备注:Published at EGSR 2021. Project page with video and code: this http URL 链接:https://arxiv.org/abs/2107.12351 摘要:在不同的光照条件下,从不同的视角观察人像时,人像表现出不同的面貌。我们可以很容易地想象在另一种情况下,这张脸会是什么样子,但由于观察有限,计算机算法在这个问题上仍然失败。为此,我们提出了一个用于人像合成和重照明的系统:在给定多幅人像的情况下,我们使用神经网络来预测三维空间中的光传输场,并根据预测的神经光传输场(NeLF)在新的环境光照下从新的相机视图生成人像。我们的系统是在大量的合成模型上训练出来的,可以在不同的光照条件下推广到不同的合成和真实肖像。该方法以多视点人像为输入,实现了多视点人像的同时合成和重照明,取得了很好的效果。 摘要:Human portraits exhibit various appearances when observed from different views under different lighting conditions. We can easily imagine how the face will look like in another setup, but computer algorithms still fail on this problem given limited observations. To this end, we present a system for portrait view synthesis and relighting: given multiple portraits, we use a neural network to predict the light-transport field in 3D space, and from the predicted Neural Light-transport Field (NeLF) produce a portrait from a new camera view under a new environmental lighting. Our system is trained on a large number of synthetic models, and can generalize to different synthetic and real portraits under various lighting conditions. Our method achieves simultaneous view synthesis and relighting given multi-view portraits as the input, and achieves state-of-the-art results.
【2】 Using Synthetic Corruptions to Measure Robustness to Natural Distribution Shifts 标题:利用合成腐败度量对自然分布偏移的稳健性
作者:Alfred Laugros,Alice Caplier,Matthieu Ospici 机构:Atos, France, Universite Grenoble Alpes, France 链接:https://arxiv.org/abs/2107.12052 摘要:合成腐蚀收集到一个基准经常被用来衡量神经网络的鲁棒性分布转移。然而,对合成腐败基准的鲁棒性并不总是预测对现实应用中遇到的分布变化的鲁棒性。在本文中,我们提出了一种方法来建立综合腐败基准,使稳健性估计与对现实世界分布变化的稳健性更相关。使用重叠准则,我们将合成腐蚀分为多个类别,以帮助更好地理解神经网络的鲁棒性。基于这些类别,我们确定了构建腐败基准时需要考虑的三个相关参数:代表类别的数量、类别之间的平衡和基准的大小。应用所提出的方法,我们建立了一个新的基准ImageNet-Syn2Nat来预测图像分类器的鲁棒性。 摘要:Synthetic corruptions gathered into a benchmark are frequently used to measure neural network robustness to distribution shifts. However, robustness to synthetic corruption benchmarks is not always predictive of robustness to distribution shifts encountered in real-world applications. In this paper, we propose a methodology to build synthetic corruption benchmarks that make robustness estimations more correlated with robustness to real-world distribution shifts. Using the overlapping criterion, we split synthetic corruptions into categories that help to better understand neural network robustness. Based on these categories, we identify three parameters that are relevant to take into account when constructing a corruption benchmark: number of represented categories, balance among categories and size of benchmarks. Applying the proposed methodology, we build a new benchmark called ImageNet-Syn2Nat to predict image classifier robustness.
【3】 Synthetic Periocular Iris PAI from a Small Set of Near-Infrared-Images 标题:从一小组近红外图像合成近眼虹膜PAI
作者:Jose Maureira,Juan Tapia,Claudia Arellano,Christoph Busch 机构:Member, IEEE., notice 链接:https://arxiv.org/abs/2107.12014 摘要:近年来,生物特征识别技术的应用越来越广泛,例如,它可以用于访问控制等多种应用。不幸的是,随着生物特征应用程序部署的增加,我们观察到攻击的增加。因此,检测此类攻击的算法(表示攻击检测(PAD))的相关性越来越高。LivDet-2020竞赛的重点是表示攻击检测(PAD)算法,但仍然存在一些问题,特别是对于未知的攻击场景。为了提高生物特征识别系统的鲁棒性,改进PAD方法至关重要。这可以通过增加表示攻击工具(PAI)和用于训练此类算法的真实图像的数量来实现。不幸的是,呈现攻击工具的捕获和创建,甚至真实图像的捕获有时很难实现。本文提出了一种利用四种最先进的GAN算法(cGAN、WGAN、WGAN-GP和StyleGAN2)和一小组近红外图像合成PAI(SPI-PAI)的新方法。利用生成的图像和用于训练的原始图像之间的Frechet初始距离(FID)对GAN算法进行了基准测试。使用StyleGAN2算法得到的合成PAI对LivDet-2020竞赛报告的最佳PAD算法进行了测试。令人惊讶的是,PAD算法无法将合成图像检测为表示攻击,从而将所有图像都归类为真实图像。这样的结果证明了合成图像愚弄呈现攻击检测算法的可行性,以及这种算法需要在大量图像和PAI场景下不断更新和训练。 摘要:Biometric has been increasing in relevance these days since it can be used for several applications such as access control for instance. Unfortunately, with the increased deployment of biometric applications, we observe an increase of attacks. Therefore, algorithms to detect such attacks (Presentation Attack Detection (PAD)) have been increasing in relevance. The LivDet-2020 competition which focuses on Presentation Attacks Detection (PAD) algorithms have shown still open problems, specially for unknown attacks scenarios. In order to improve the robustness of biometric systems, it is crucial to improve PAD methods. This can be achieved by augmenting the number of presentation attack instruments (PAI) and bona fide images that are used to train such algorithms. Unfortunately, the capture and creation of presentation attack instruments and even the capture of bona fide images is sometimes complex to achieve. This paper proposes a novel PAI synthetically created (SPI-PAI) using four state-of-the-art GAN algorithms (cGAN, WGAN, WGAN-GP, and StyleGAN2) and a small set of periocular NIR images. A benchmark between GAN algorithms is performed using the Frechet Inception Distance (FID) between the generated images and the original images used for training. The best PAD algorithm reported by the LivDet-2020 competition was tested for us using the synthetic PAI which was obtained with the StyleGAN2 algorithm. Surprisingly, The PAD algorithm was not able to detect the synthetic images as a Presentation Attack, categorizing all of them as bona fide. Such results demonstrated the feasibility of synthetic images to fool presentation attacks detection algorithms and the need for such algorithms to be constantly updated and trained with a larger number of images and PAI scenarios.
【4】 Improving Robot Localisation by Ignoring Visual Distraction 标题:忽略视觉干扰改善机器人定位
作者:Oscar Mendez,Matthew Vowels,Richard Bowden 机构:All authors with the University of Surrey {o 备注:2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 链接:https://arxiv.org/abs/2107.11857 摘要:注意力是现代深度学习的重要组成部分。然而,人们对它的反面却不太重视:忽略分心。我们的日常生活要求我们明确避免关注那些让我们试图完成的任务感到困惑的突出的视觉特征。这种视觉优先顺序使我们能够专注于重要的任务,而忽略视觉干扰。在这项工作中,我们引入了神经盲,它使代理能够完全忽略被认为是分心的对象或类。更明确地说,我们的目标是使一个神经网络完全不能代表特定的选择类在其潜在的空间。在非常真实的意义上,这使得网络对某些类“视而不见”,允许和代理专注于对给定任务重要的内容,并演示了如何使用这些内容来改进本地化。 摘要:Attention is an important component of modern deep learning. However, less emphasis has been put on its inverse: ignoring distraction. Our daily lives require us to explicitly avoid giving attention to salient visual features that confound the task we are trying to accomplish. This visual prioritisation allows us to concentrate on important tasks while ignoring visual distractors. In this work, we introduce Neural Blindness, which gives an agent the ability to completely ignore objects or classes that are deemed distractors. More explicitly, we aim to render a neural network completely incapable of representing specific chosen classes in its latent space. In a very real sense, this makes the network "blind" to certain classes, allowing and agent to focus on what is important for a given task, and demonstrates how this can be used to improve localisation.
【5】 On-Device Content Moderation 标题:设备上的内容审核
作者:Anchal Pandey,Sukumar Moharana,Debi Prasanna Mohanty,Archit Panwar,Dewang Agarwal,Siva Prasad Thota 机构:On Device AI, Samsung R&D Bangalore, Bangalore, India 链接:https://arxiv.org/abs/2107.11845 摘要:随着互联网的出现,不安全工作(NSFW)的内容节制是当今的一个主要问题。由于智能手机现在已经成为亿万人日常生活的一部分,因此拥有一个能够检测并向用户建议手机上潜在NSFW内容的解决方案变得更加重要。本文提出了一种新的基于器件的NSFW图像检测方法。除了传统的色情图片内容节制,我们还包括了半裸体内容节制,因为它仍然是全国妇联在一个庞大的人口。我们策划了一个数据集,包括三大类,即裸体,半裸体和安全图像。我们已经创建了一个对象检测器和分类器,用于过滤裸体和半裸体内容。该解决方案提供了不安全的身体部位注释以及半裸体图像的识别。我们在几个公共数据集和自定义数据集上对我们提出的解决方案进行了广泛的测试。该模型在customNSFW16k数据集上的F1得分为0.91,查准率为95%,查全率为88%,在NPDI数据集上的MAP为0.92。此外,它在safeimage开放数据集上的平均假阳性率为0.002。 摘要:With the advent of internet, not safe for work(NSFW) content moderation is a major problem today. Since,smartphones are now part of daily life of billions of people,it becomes even more important to have a solution which coulddetect and suggest user about potential NSFW content present ontheir phone. In this paper we present a novel on-device solutionfor detecting NSFW images. In addition to conventional porno-graphic content moderation, we have also included semi-nudecontent moderation as it is still NSFW in a large demography.We have curated a dataset comprising of three major categories,namely nude, semi-nude and safe images. We have created anensemble of object detector and classifier for filtering of nudeand semi-nude contents. The solution provides unsafe body partannotations along with identification of semi-nude images. Weextensively tested our proposed solution on several public datasetand also on our custom dataset. The model achieves F1 scoreof 0.91 with 95% precision and 88% recall on our customNSFW16k dataset and 0.92 MAP on NPDI dataset. Moreover itachieves average 0.002 false positive rate on a collection of safeimage open datasets.
【6】 Distributional Shifts in Automated Diabetic Retinopathy Screening 标题:糖尿病视网膜病变自动筛查中的分布偏移
作者:Jay Nandy,Wynne Hsu,Mong Li Lee 机构:School of Computing, National University of Singapore, Institute of Data Science, National University of Singapore 备注:Accepted at IEEE ICIP 2021 链接:https://arxiv.org/abs/2107.11822 摘要:在糖尿病视网膜病变(DR)筛查中,基于深度学习的模型可以自动检测视网膜图像是否“可参考”。然而,当输入图像分布偏离训练分布时,分类精度下降。此外,即使输入的不是视网膜图像,标准的DR分类器也会产生一个高度自信的预测,即该图像是“可参考的”。本文提出了一个基于Dirichlet先验网络的框架来解决这个问题。它利用了一个非分布(OOD)检测器模型和一个DR分类模型,通过识别OOD图像来提高泛化能力。在真实数据集上的实验表明,该框架能够消除未知的非视网膜图像,识别出分布移位的视网膜图像,便于人工干预。 摘要:Deep learning-based models are developed to automatically detect if a retina image is `referable' in diabetic retinopathy (DR) screening. However, their classification accuracy degrades as the input images distributionally shift from their training distribution. Further, even if the input is not a retina image, a standard DR classifier produces a high confident prediction that the image is `referable'. Our paper presents a Dirichlet Prior Network-based framework to address this issue. It utilizes an out-of-distribution (OOD) detector model and a DR classification model to improve generalizability by identifying OOD images. Experiments on real-world datasets indicate that the proposed framework can eliminate the unknown non-retina images and identify the distributionally shifted retina images for human intervention.
【7】 Go Wider Instead of Deeper 标题:走得更广,而不是更深
作者:Fuzhao Xue,Ziji Shi,Yuxuan Lou,Yong Liu,Yang You 机构:Department of Computer Science, National University of Singapore, Singapore 链接:https://arxiv.org/abs/2107.11817 摘要:Transformer最近在各种任务上取得了令人印象深刻的成果。为了进一步提高Transformer的有效性和效率,现有的工作有两个思路:(1)通过扩展到更多的可训练参数来扩大Transformer的范围(2) 通过参数共享或模型随深度压缩而变浅。然而,当可用于训练的令牌较少时,较大的模型通常不能很好地扩展,并且当模型非常大时,需要高级并行。与原始Transformer模型相比,较小的模型通常由于表现功率的损失而获得较差的性能。在本文中,为了在可训练参数较少的情况下获得更好的性能,我们提出了一个框架来有效地部署可训练参数,方法是更广泛而不是更深。特别地,我们用混合专家(MoE)代替前馈网络(FFN),沿模型宽度进行缩放。然后,我们使用单独的层规范化跨Transformer块共享MoE层。这样的部署起到了转换各种语义表示的作用,使得模型的参数更为高效和有效。为了评估我们的框架,我们设计了WideNet并在ImageNet-1K上进行了评估。我们最好的模型比视觉变换器(ViT)高出1.46%$,可训练参数为$0.72倍。使用$0.46乘以$和$0.13乘以$参数,我们的WideNet仍然可以分别超过ViT和ViT MoE$0.83%$和$2.08%$。 摘要:The transformer has recently achieved impressive results on various tasks. To further improve the effectiveness and efficiency of the transformer, there are two trains of thought among existing works: (1) going wider by scaling to more trainable parameters; (2) going shallower by parameter sharing or model compressing along with the depth. However, larger models usually do not scale well when fewer tokens are available to train, and advanced parallelisms are required when the model is extremely large. Smaller models usually achieve inferior performance compared to the original transformer model due to the loss of representation power. In this paper, to achieve better performance with fewer trainable parameters, we propose a framework to deploy trainable parameters efficiently, by going wider instead of deeper. Specially, we scale along model width by replacing feed-forward network (FFN) with mixture-of-experts (MoE). We then share the MoE layers across transformer blocks using individual layer normalization. Such deployment plays the role to transform various semantic representations, which makes the model more parameter-efficient and effective. To evaluate our framework, we design WideNet and evaluate it on ImageNet-1K. Our best model outperforms Vision Transformer (ViT) by $1.46%$ with $0.72 times$ trainable parameters. Using $0.46 times$ and $0.13 times$ parameters, our WideNet can still surpass ViT and ViT-MoE by $0.83%$ and $2.08%$, respectively.
【8】 Efficient Large Scale Inlier Voting for Geometric Vision Problems 标题:几何视觉问题的高效大规模内点投票
作者:Dror Aiger,Simon Lynen,Jan Hosang,Bernhard Zeisl 机构:Google 链接:https://arxiv.org/abs/2107.11810 摘要:离群点剔除和等价内联集优化是计算机视觉中的一个重要组成部分,如摄像机姿态估计中的点匹配滤波、点云中的平面法向估计等。有几种方法存在,但在大范围内,我们面临着可能的解决方案的组合爆炸,最先进的方法如RANSAC、Hough变换或分枝定界需要最小的内联比或先验知识才能保持实用性。事实上,对于像在非常大的场景中摆姿势这样的问题,这些方法是没有用的,因为如果不满足这些条件,它们的运行时间会呈指数增长。为了解决这一问题,我们提出了一种基于$R^d$中“相交”k$维曲面的离群点剔除算法。我们提供了一个在$R^d$中找到一个点的方法来解决各种几何问题,该点使附近曲面的数量最大化(从而使内联线的数量最大化)。所得到的算法具有线性最坏情况复杂度,与竞争算法相比,在近似因子上具有更好的运行时依赖性,同时不需要特定于域的边界。这是通过引入空间分解方案来实现的,该方案通过依次舍入和分组样本来限制计算的数量。我们的方法(和开源代码)使任何人都能在广泛的领域中快速地解决新问题。我们证明了该方法的多功能性,可以解决在低内联比下大量匹配的多摄像机问题,从而在较低的处理时间内获得最先进的结果。 摘要:Outlier rejection and equivalently inlier set optimization is a key ingredient in numerous applications in computer vision such as filtering point-matches in camera pose estimation or plane and normal estimation in point clouds. Several approaches exist, yet at large scale we face a combinatorial explosion of possible solutions and state-of-the-art methods like RANSAC, Hough transform or Branch&Bound require a minimum inlier ratio or prior knowledge to remain practical. In fact, for problems such as camera posing in very large scenes these approaches become useless as they have exponential runtime growth if these conditions aren't met. To approach the problem we present a efficient and general algorithm for outlier rejection based on "intersecting" $k$-dimensional surfaces in $R^d$. We provide a recipe for casting a variety of geometric problems as finding a point in $R^d$ which maximizes the number of nearby surfaces (and thus inliers). The resulting algorithm has linear worst-case complexity with a better runtime dependency in the approximation factor than competing algorithms while not requiring domain specific bounds. This is achieved by introducing a space decomposition scheme that bounds the number of computations by successively rounding and grouping samples. Our recipe (and open-source code) enables anybody to derive such fast approaches to new problems across a wide range of domains. We demonstrate the versatility of the approach on several camera posing problems with a high number of matches at low inlier ratio achieving state-of-the-art results at significantly lower processing times.
【9】 Clustering by Maximizing Mutual Information Across Views 标题:通过最大化视图间的交互信息进行聚类
作者:Kien Do,Truyen Tran,Svetha Venkatesh 机构:Applied Artificial Intelligence Institute (A,I,), Deakin University, Geelong, Australia 备注:Accepted at ICCV 2021 链接:https://arxiv.org/abs/2107.11635 摘要:提出了一种新的图像聚类框架,将联合表示学习和聚类相结合。我们的方法由共享同一主干网络的两个头组成,一个是“表示学习”头,一个是“聚类”头。“表示学习”头部在实例级捕获对象的细粒度模式,作为“聚类”头部提取粗粒度信息的线索,将对象分为多个簇。整个模型以端到端的方式训练,通过最小化应用于两个头部输出的两个面向样本的对比损失的加权和。为了保证“聚类”头对应的对比度损失是最优的,我们引入了一个新的评价函数“点积对数”。大量的实验结果表明,我们的方法在各种图像数据集上显著优于最先进的单阶段聚类方法,在CIFAR10/20、STL10和ImageNet狗上的准确率比最佳基线提高了约5-7%。此外,我们方法的“两阶段”变体在三个具有挑战性的ImageNet子集上也比基线获得更好的结果。 摘要:We propose a novel framework for image clustering that incorporates joint representation learning and clustering. Our method consists of two heads that share the same backbone network - a "representation learning" head and a "clustering" head. The "representation learning" head captures fine-grained patterns of objects at the instance level which serve as clues for the "clustering" head to extract coarse-grain information that separates objects into clusters. The whole model is trained in an end-to-end manner by minimizing the weighted sum of two sample-oriented contrastive losses applied to the outputs of the two heads. To ensure that the contrastive loss corresponding to the "clustering" head is optimal, we introduce a novel critic function called "log-of-dot-product". Extensive experimental results demonstrate that our method significantly outperforms state-of-the-art single-stage clustering methods across a variety of image datasets, improving over the best baseline by about 5-7% in accuracy on CIFAR10/20, STL10, and ImageNet-Dogs. Further, the "two-stage" variant of our method also achieves better results than baselines on three challenging ImageNet subsets.
【10】 Accelerating Atmospheric Turbulence Simulation via Learned Phase-to-Space Transform 标题:基于学习相空间变换的加速大气湍流模拟
作者:Zhiyuan Mao,Nicholas Chimitt,Stanley H. Chan 机构:School of Electrical and Computer Engineering, Purdue University, West Lafayette, Indiana USA 备注:The paper will be published at the ICCV 2021 链接:https://arxiv.org/abs/2107.11627 摘要:快速准确地模拟大气湍流成像是发展湍流抑制算法的关键。认识到以前的方法的局限性,我们引入了一个新的概念,称为相空间(P2S)变换,大大加快了模拟速度。P2S建立在三个思想之上:(1)将空间变化卷积重新表示为一组具有基函数的不变卷积,(2)通过已知的湍流统计模型学习基函数,(3)通过直接将相位表示转换为空间表示的轻量级网络来实现P2S变换。新的模拟器提供了300倍-1000倍的速度比主流分步模拟器,同时保留了基本的湍流统计。 摘要:Fast and accurate simulation of imaging through atmospheric turbulence is essential for developing turbulence mitigation algorithms. Recognizing the limitations of previous approaches, we introduce a new concept known as the phase-to-space (P2S) transform to significantly speed up the simulation. P2S is build upon three ideas: (1) reformulating the spatially varying convolution as a set of invariant convolutions with basis functions, (2) learning the basis function via the known turbulence statistics models, (3) implementing the P2S transform via a light-weight network that directly convert the phase representation to spatial representation. The new simulator offers 300x -- 1000x speed up compared to the mainstream split-step simulators while preserving the essential turbulence statistics.