计算机视觉学术速递[8.24]

Update！H5支持摘要折叠，体验更佳！点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

cs.CV 方向，今日共计104篇

Transformer(6篇)

【1】 Discovering Spatial Relationships by Transformers for Domain Generalization 标题：基于变换器域综合的空间关系发现链接：https://arxiv.org/abs/2108.10046

作者：Cuicui Kang,Karthik Nandakumar 摘要：随着图像数据多样性的迅速增加，领域泛化问题日益受到关注。虽然领域泛化是一个具有挑战性的问题，但由于计算机视觉中人工智能技术的快速发展，领域泛化已经取得了巨大的发展。这些高级算法大多是基于卷积神经网络（CNN）的深层结构提出的。然而，尽管CNN具有很强的识别能力，但由于对CNN过滤器的响应大多是局部的，因此它们在建模图像中不同位置之间的关系方面做得很差。由于这些局部和全局空间关系的特征是区分所考虑的对象，因此它们在提高针对域间隙的泛化能力方面起着关键作用。为了获得对象-部分关系以获得更好的领域泛化，本文提出使用自我注意模型。然而，注意力模型是针对序列提出的，在二维图像的鉴别特征提取方面并不擅长。考虑到这一点，我们提出了一种混合结构来发现这些局部特征之间的空间关系，并推导出一种复合表示法，该表示法对区分性特征及其关系进行编码，以提高领域泛化能力。通过对三个著名基准的评估，证明了使用所提出的方法对图像特征之间的关系进行建模的好处，并实现了最先进的领域泛化性能。更具体地说，在PACS和Office Home数据库上，所提出的算法分别比最先进的算法高出2.2%$和3.4%$。摘要：Due to the rapid increase in the diversity of image data, the problem of domain generalization has received increased attention recently. While domain generalization is a challenging problem, it has achieved great development thanks to the fast development of AI techniques in computer vision. Most of these advanced algorithms are proposed with deep architectures based on convolution neural nets (CNN). However, though CNNs have a strong ability to find the discriminative features, they do a poor job of modeling the relations between different locations in the image due to the response to CNN filters are mostly local. Since these local and global spatial relationships are characterized to distinguish an object under consideration, they play a critical role in improving the generalization ability against the domain gap. In order to get the object parts relationships to gain better domain generalization, this work proposes to use the self attention model. However, the attention models are proposed for sequence, which are not expert in discriminate feature extraction for 2D images. Considering this, we proposed a hybrid architecture to discover the spatial relationships between these local features, and derive a composite representation that encodes both the discriminative features and their relationships to improve the domain generalization. Evaluation on three well-known benchmarks demonstrates the benefits of modeling relationships between the features of an image using the proposed method and achieves state-of-the-art domain generalization performance. More specifically, the proposed algorithm outperforms the state-of-the-art by $2.2%$ and $3.4%$ on PACS and Office-Home databases, respectively.

【2】 Spatial Transformer Networks for Curriculum Learning 标题：面向课程学习的空间转换网络链接：https://arxiv.org/abs/2108.09696

作者：Fatemeh Azimi,Jean-Francois Jacques Nicolas Nies,Sebastian Palacio,Federico Raue,Jörn Hees,Andreas Dengel 机构： TU Kaiserslautern, Germany, Smart Data and Knowledge Services, German Research Center for Artificial, Intelligence (DFKI), Germany 摘要：课程学习是一种受生物启发的训练技术，广泛应用于机器学习，以改进神经网络的优化和训练，提高收敛速度或获得的精度。课程学习的主要理念是以简单的任务开始训练，并逐渐增加难度。因此，一个自然的问题是如何确定或生成这些简单的任务。在这项工作中，我们从空间Transformer网络（STN）的启发，以形成一个容易到难的课程。由于STN已被证明能够消除输入图像中的杂波，并在图像分类任务中获得更高的精度，我们假设STN处理的图像可以被视为更简单的任务，并用于课程学习。为此，我们利用STN生成的数据，研究制定训练课程的多种策略。我们在杂乱的MNIST和Fashion MNIST数据集上进行了各种实验，其中，与基线相比，前者的分类精度提高了3.8美元pp。摘要：Curriculum learning is a bio-inspired training technique that is widely adopted to machine learning for improved optimization and better training of neural networks regarding the convergence rate or obtained accuracy. The main concept in curriculum learning is to start the training with simpler tasks and gradually increase the level of difficulty. Therefore, a natural question is how to determine or generate these simpler tasks. In this work, we take inspiration from Spatial Transformer Networks (STNs) in order to form an easy-to-hard curriculum. As STNs have been proven to be capable of removing the clutter from the input images and obtaining higher accuracy in image classification tasks, we hypothesize that images processed by STNs can be seen as easier tasks and utilized in the interest of curriculum learning. To this end, we study multiple strategies developed for shaping the training curriculum, using the data generated by STNs. We perform various experiments on cluttered MNIST and Fashion-MNIST datasets, where on the former, we obtain an improvement of $3.8$pp in classification accuracy compared to the baseline.

【3】 Guiding Query Position and Performing Similar Attention for Transformer-Based Detection Heads 标题：Transformer检测头的引导查询位置和相似关注度链接：https://arxiv.org/abs/2108.09691

作者：Xiaohu Jiang,Ze Chen,Zhicheng Wang,Erjin Zhou,ChunYuan 机构：Tsinghua Shenzhen International Graduate School, MEGVII Technology 摘要：在DETR被提出后，这种新的基于变换器的检测范式在对象查询和预测的特征映射之间执行若干交叉关注，随后衍生出一系列基于变换器的检测头。这些模型在每次交叉注意后迭代对象查询。但是，它们不更新表示对象查询位置信息的查询位置。因此，该模型需要额外的学习，以找出查询位置应该表达和需要更多关注的最新区域。为了解决这个问题，我们提出了引导查询位置（GQPos）方法，将对象查询的最新位置信息迭代地嵌入到查询位置中。这种基于变换器的检测头的另一个问题是在多尺度特征图上执行注意的高度复杂性，这阻碍了它们在所有尺度上提高检测性能。因此，我们提出了一种新的融合方案，称为相似注意（SiA）：除了融合特征图外，SiA还融合注意权重图，以加速高分辨率注意权重图的学习。我们的实验表明，所提出的GQPO提高了一系列模型的性能，包括DETR、SMCA、YoloS和HoiTransformer，而SiA则持续提高了基于多尺度Transformer的检测头（如DETR和HoiTransformer）的性能。摘要：After DETR was proposed, this novel transformer-based detection paradigm which performs several cross-attentions between object queries and feature maps for predictions has subsequently derived a series of transformer-based detection heads. These models iterate object queries after each cross-attention. However, they don't renew the query position which indicates object queries' position information. Thus model needs extra learning to figure out the newest regions that query position should express and need more attention. To fix this issue, we propose the Guided Query Position (GQPos) method to embed the latest location information of object queries to query position iteratively. Another problem of such transformer-based detection heads is the high complexity to perform attention on multi-scale feature maps, which hinders them from improving detection performance at all scales. Therefore we propose a novel fusion scheme named Similar Attention (SiA): besides the feature maps is fused, SiA also fuse the attention weights maps to accelerate the learning of high-resolution attention weight map by well-learned low-resolution attention weight map. Our experiments show that the proposed GQPos improves the performance of a series of models, including DETR, SMCA, YoloS, and HoiTransformer and SiA consistently improve the performance of multi-scale transformer-based detection heads like DETR and HoiTransformer.

【4】 Construction material classification on imbalanced datasets for construction monitoring automation using Vision Transformer (ViT) architecture 标题：基于视觉Transformer(VIT)架构的施工监控自动化不平衡数据集上的建筑材料分类链接：https://arxiv.org/abs/2108.09527

作者：Maryam Soleymani,Mahdi Bonyani,Hadi Mahami,Farnad Nasirzadeh 机构：a Department of Construction and Project Management – Art University of Tehran, Tehran, Iran, b Department of Computer Engineering, University of Tabriz, Tabriz, Iran, c School of Architecture and Built Environment, Deakin University, Geelong, Australia 备注：19 pages, 10 figures, 9 tables 摘要：如今，自动化是一个关键的话题，因为它对建设项目的生产率有着重大的影响。在这一行业中使用自动化带来了巨大的成果，例如显著提高了施工活动的效率、质量和安全性。施工自动化的范围包括广泛的阶段，对施工项目的监控也不例外。此外，它在项目管理中非常重要，因为对项目进度的准确和及时的评估使管理者能够快速识别与进度计划的偏差，并在正确的时间采取所需的行动。在这一阶段，最重要的任务之一是每天跟踪项目进度，这是非常耗时和劳动密集的，但自动化促进和加速了这一任务。它还消除或至少降低了许多危险任务的风险。通过这种方式，施工自动化的第一步是自动检测项目现场使用的材料。在本文中，一种新的深度学习体系结构被称为视觉转换器（ViT），用于检测和分类建筑材料。为了评估所提出方法的适用性和性能，对三个大型不平衡数据集（即建筑材料库（CML）和建筑材料数据集（BMD））进行了训练和测试，这两个数据集在以前的论文中使用，以及通过组合它们创建的新数据集。所取得的结果表明，所有参数以及每种材料类别的精度均为100%。我们相信，该方法为检测和分类不同的材料类型提供了一种新颖而健壮的工具。摘要：Nowadays, automation is a critical topic due to its significant impacts on the productivity of construction projects. Utilizing automation in this industry brings about great results, such as remarkable improvements in the efficiency, quality, and safety of construction activities. The scope of automation in construction includes a wide range of stages, and monitoring construction projects is no exception. Additionally, it is of great importance in project management since an accurate and timely assessment of project progress enables managers to quickly identify deviations from the schedule and take the required actions at the right time. In this stage, one of the most important tasks is to daily keep track of the project progress, which is very time-consuming and labor-intensive, but automation has facilitated and accelerated this task. It also eliminated or at least decreased the risk of many dangerous tasks. In this way, the first step of construction automation is to detect used materials in a project site automatically. In this paper, a novel deep learning architecture is utilized, called Vision Transformer (ViT), for detecting and classifying construction materials. To evaluate the applicability and performance of the proposed method, it is trained and tested on three large imbalanced datasets, namely Construction Material Library (CML) and Building Material Dataset (BMD), used in the previous papers, as well as a new dataset created by combining them. The achieved results revealed an accuracy of 100 percent in all parameters and also in each material category. It is believed that the proposed method provides a novel and robust tool for detecting and classifying different material types.

【5】 MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition 标题：MM-VIT：用于压缩视频动作识别的多模态视频转换器链接：https://arxiv.org/abs/2108.09322

作者：Jiawei Chen,Chiu Man Ho 机构：OPPO US Research Center 摘要：本文提出了一种基于变换器的视频动作识别方法，称为多模态视频变换器（MM ViT）。与仅利用解码的RGB帧的其他方案不同，MM-ViT仅在压缩视频域中操作，并利用所有现成的模式，即i帧、运动向量、残差和音频波形。为了处理从多个模态中提取的大量时空标记，我们开发了几个可扩展的模型变体，这些模型将自我注意在空间、时间和模态维度上分解。此外，为了进一步探索丰富的模态间相互作用及其影响，我们开发并比较了三种不同的跨模态注意机制，它们可以无缝集成到transformer构建块中。在三个公共行为识别基准（UCF-101、Something-Something-v2、Kinetics-600）上进行的大量实验表明，MM ViT在效率和准确性方面都优于最先进的视频转换器，并且在计算量大的光流下，其性能优于或同样好于最先进的CNN。摘要：This paper presents a pure transformer-based approach, dubbed the Multi-Modal Video Transformer (MM-ViT), for video action recognition. Different from other schemes which solely utilize the decoded RGB frames, MM-ViT operates exclusively in the compressed video domain and exploits all readily available modalities, i.e., I-frames, motion vectors, residuals and audio waveform. In order to handle the large number of spatiotemporal tokens extracted from multiple modalities, we develop several scalable model variants which factorize self-attention across the space, time and modality dimensions. In addition, to further explore the rich inter-modal interactions and their effects, we develop and compare three distinct cross-modal attention mechanisms that can be seamlessly integrated into the transformer building block. Extensive experiments on three public action recognition benchmarks (UCF-101, Something-Something-v2, Kinetics-600) demonstrate that MM-ViT outperforms the state-of-the-art video transformers in both efficiency and accuracy, and performs better or equally well to the state-of-the-art CNN counterparts with computationally-heavy optical flow.

【6】 SwinIR: Image Restoration Using Swin Transformer 标题：SwinIR：使用Swin Transformer的图像恢复链接：https://arxiv.org/abs/2108.10257

作者：Jingyun Liang,Jiezhang Cao,Guolei Sun,Kai Zhang,Luc Van Gool,Radu Timofte 机构：Computer Vision Lab, ETH Zurich, Switzerland, KU Leuven, Belgium 备注：Sota results on classical/lightweight/real-world image SR, image denoising and JPEG compression artifact reduction. Code: this https URL 摘要：图像恢复是一个长期存在的低级视觉问题，其目的是从低质量图像（例如缩小的、有噪声的和压缩的图像）恢复高质量图像。虽然最先进的图像恢复方法是基于卷积神经网络的，但很少有人尝试使用Transformer，因为Transformer在高级视觉任务中表现出令人印象深刻的性能。在本文中，我们提出了一种基于Swin变换的强基线图像恢复模型SwinIR。SwinIR由三部分组成：浅层特征提取、深层特征提取和高质量图像重建。具体地说，深度特征提取模块由几个剩余的SwinTransformer块（RSTB）组成，每个块都有几个SwinTransformer层和一个剩余连接。我们在三个具有代表性的任务上进行了实验：图像超分辨率（包括经典、轻量级和真实世界的图像超分辨率）、图像去噪（包括灰度和彩色图像去噪）和JPEG压缩伪影减少。实验结果表明，在不同的任务上，SwinIR的性能优于最先进的方法$textbf{高达0.14$sim$0.45dB}$，而参数总数可以减少$textbf{高达67%}$。摘要：Image restoration is a long-standing low-level vision problem that aims to restore high-quality images from low-quality images (e.g., downscaled, noisy and compressed images). While state-of-the-art image restoration methods are based on convolutional neural networks, few attempts have been made with Transformers which show impressive performance on high-level vision tasks. In this paper, we propose a strong baseline model SwinIR for image restoration based on the Swin Transformer. SwinIR consists of three parts: shallow feature extraction, deep feature extraction and high-quality image reconstruction. In particular, the deep feature extraction module is composed of several residual Swin Transformer blocks (RSTB), each of which has several Swin Transformer layers together with a residual connection. We conduct experiments on three representative tasks: image super-resolution (including classical, lightweight and real-world image super-resolution), image denoising (including grayscale and color image denoising) and JPEG compression artifact reduction. Experimental results demonstrate that SwinIR outperforms state-of-the-art methods on different tasks by $textbf{up to 0.14$sim$0.45dB}$, while the total number of parameters can be reduced by $textbf{up to 67%}$.

检测相关(11篇)

【1】 PW-MAD: Pixel-wise Supervision for Generalized Face Morphing Attack Detection 标题：PW-MAD：基于像素监督的广义人脸变形攻击检测链接：https://arxiv.org/abs/2108.10291

作者：Naser Damer,Noemie Spiller,Meiling Fang,Fadi Boutros,Florian Kirchbuchner,Arjan Kuijper 机构： Fraunhofer Institute for Computer Graphics Research IGD, Germany, Department of Computer Science, TU Darmstadt, Germany 摘要：人脸变形攻击图像可以验证为多个身份，使此攻击成为基于身份验证（如边界检查）的进程的主要漏洞。人们已经提出了不同的方法来检测人脸变形攻击，但是，这些方法对意外的变形后过程的通用性很低。一个主要的后变形过程是在许多国家签发护照或身份证件时执行的打印和扫描操作。在这项工作中，我们通过采用像素级监控方法来解决这一泛化问题，在这种方法中，我们训练一个网络，在训练过程中将图像的每个像素分类为攻击或非攻击，而不是只对整个图像使用一个标签。我们的像素级变形攻击检测（PW-MAD）解决方案比一组既定基线执行得更准确。更重要的是，当对未知的重新数字化攻击进行评估时，与相关工作相比，我们的方法显示出高度的通用性。除PW-MAD方法外，我们还创建了一个新的面部变形攻击数据集，其中包含数字和重新数字化的攻击以及真实样本，即LMA-DRD数据集，该数据集将公开用于研究目的。摘要：A face morphing attack image can be verified to multiple identities, making this attack a major vulnerability to processes based on identity verification, such as border checks. Different methods have been proposed to detect face morphing attacks, however, with low generalizability to unexpected post-morphing processes. A major post-morphing process is the print and scan operation performed in many countries when issuing a passport or identity document. In this work, we address this generalization problem by adapting a pixel-wise supervision approach where we train a network to classify each pixel of the image into an attack or not during the training process, rather than only having one label for the whole image. Our pixel-wise morphing attack detection (PW-MAD) solution performs more accurately than a set of established baselines. More importantly, our approach shows high generalizability in comparison to related works, when evaluated on unknown re-digitized attacks. Additionally to our PW-MAD approach, we create a new face morphing attack dataset with digital and re-digitized attacks and bona fide samples, namely the LMA-DRD dataset that will be made publicly available for research purposes.

【2】 LivDet 2021 Fingerprint Liveness Detection Competition -- Into the unknown 标题：LivDet 2021指纹活性检测大赛--走进未知链接：https://arxiv.org/abs/2108.10183

作者：Roberto Casula,Marco Micheletto,Giulia Orrù,Rita Delussu,Sara Concas,Andrea Panzino,Gian Luca Marcialis 机构：University of Cagliari, Piazza d’Armi, I-, Italy, ©, IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current 备注：None 摘要：国际指纹活性检测竞赛是一项两年一度的国际竞赛，向学术界和工业界开放，旨在评估和报告指纹呈现攻击检测方面的进展。拟议的“活动中的活性检测”和“指纹表示”挑战旨在评估嵌入验证系统的PAD的影响，以及移动应用程序功能集的有效性和紧凑性。此外，我们还试验了一种新的恶搞制造方法，这种方法对最终结果有很大影响。23种算法被提交到比赛中，这是LivDet有史以来获得的最大数量。摘要：The International Fingerprint Liveness Detection Competition is an international biennial competition open to academia and industry with the aim to assess and report advances in Fingerprint Presentation Attack Detection. The proposed "Liveness Detection in Action" and "Fingerprint representation" challenges were aimed to evaluate the impact of a PAD embedded into a verification system, and the effectiveness and compactness of feature sets for mobile applications. Furthermore, we experimented a new spoof fabrication method that has particularly affected the final results. Twenty-three algorithms were submitted to the competition, the maximum number ever achieved by LivDet.

【3】 ODAM: Object Detection, Association, and Mapping using Posed RGB Video 标题：ODAM：基于姿态RGB视频的目标检测、关联和映射链接：https://arxiv.org/abs/2108.10165

作者：Kejie Li,Daniel DeTone,Steven Chen,Minh Vo,Ian Reid,Hamid Rezatofighi,Chris Sweeney,Julian Straub,Richard Newcombe 机构：The University of Adelaide,Facebook Reality Labs Research,Monash University 备注：Accepted in ICCV 2021 as oral 摘要：在3D中定位对象并估计其范围是实现高水平3D场景理解的重要步骤，在增强现实和机器人技术中有着广泛的应用。我们介绍了ODAM，一个使用姿势RGB视频进行3D对象检测、关联和映射的系统。该系统依赖于深度学习前端，从给定的RGB帧中检测3D对象，并使用图形神经网络（GNN）将其关联到全局基于对象的地图。基于这些帧到模型的关联，我们的后端在多视图几何约束和对象缩放之前优化对象边界体积（表示为超级二次曲面）。我们在ScanNet上验证了所提出的系统，与现有的仅使用RGB的方法相比，我们有了显著的改进。摘要：Localizing objects and estimating their extent in 3D is an important step towards high-level 3D scene understanding, which has many applications in Augmented Reality and Robotics. We present ODAM, a system for 3D Object Detection, Association, and Mapping using posed RGB videos. The proposed system relies on a deep learning front-end to detect 3D objects from a given RGB frame and associate them to a global object-based map using a graph neural network (GNN). Based on these frame-to-model associations, our back-end optimizes object bounding volumes, represented as super-quadrics, under multi-view geometry constraints and the object scale prior. We validate the proposed system on ScanNet where we show a significant improvement over existing RGB-only methods.

【4】 Deep neural networks approach to microbial colony detection -- a comparative analysis 标题：深层神经网络在微生物菌落检测中的应用--比较分析链接：https://arxiv.org/abs/2108.10103

作者：Sylwia Majchrowska,Jarosław Pawłowski,Natalia Czerep,Aleksander Górecki,Jakub Kuciński,Tomasz Golan 机构： NeuroSYS, Rybacka ,-, Wroc�law, Poland, Wroclaw University of Science and Technology, Wybrze˙ze S. Wyspia´nskiego ,-, Wroclaw, Poland, University of Wroclaw, Fryderyka Joliot-Curie ,-, Wroclaw, Poland 备注：8 pages, 2 figures, 3 tables 摘要：微生物菌落计数是微生物学的一项基本任务，在许多工业分支中有着广泛的应用。尽管如此，由于缺乏统一的方法和大数据集的可用性，目前使用人工智能进行微生物自动计数的研究几乎无法进行比较。最近引入的琼脂数据集是第二种需求的答案，但所进行的研究仍然不够详尽。为了解决这个问题，我们在AGAR数据集上比较了三种著名的用于目标检测的深度学习方法的性能，即两阶段、一阶段和基于Transformer的神经网络。所得结果可作为今后实验的基准。摘要：Counting microbial colonies is a fundamental task in microbiology and has many applications in numerous industry branches. Despite this, current studies towards automatic microbial counting using artificial intelligence are hardly comparable due to the lack of unified methodology and the availability of large datasets. The recently introduced AGAR dataset is the answer to the second need, but the research carried out is still not exhaustive. To tackle this problem, we compared the performance of three well-known deep learning approaches for object detection on the AGAR dataset, namely two-stage, one-stage and transformer based neural networks. The achieved results may serve as a benchmark for future experiments.

【5】 Towards Real-world X-ray Security Inspection: A High-Quality Benchmark and Lateral Inhibition Module for Prohibited Items Detection 标题：走向真实世界的X射线安检：一种高质量的违禁品检测基准和横向抑制模块链接：https://arxiv.org/abs/2108.09917

作者：Renshuai Tao,Yanlu Wei,Xiangjian Jiang,Hainan Li,Haotong Qin,Jiakai Wang,Yuqing Ma,Libo Zhang,Xianglong Liu 机构：State Key Laboratory of Software Development Environment, Beihang University, FLYTEK Research, Institute of Software Chinese Academy of Sciences 摘要：X射线图像中的违禁品检测在保障公共安全方面发挥着重要作用，它常常处理颜色单调、光泽不足的物体，导致性能不理想。到目前为止，由于缺乏专门的高质量数据集，很少有研究涉及这一主题。在这项工作中，我们首先提出了一个高质量的X射线（HIX射线）安全检查图像数据集，其中包含102928个常见的8类违禁物品。它是最大的高质量违禁品检测数据集，收集自真实世界的机场安检，并由专业安检人员注释。此外，为了准确地检测违禁物品，我们进一步提出了侧向抑制模块（LIM），其灵感来源于人类通过忽略无关信息和关注可识别特征来识别这些物品，特别是当物体相互重叠时。具体而言，LIM是精心设计的灵活附加模块，通过双向传播（BP）模块最大限度地抑制噪声信息流，并通过边界激活（BA）模块从四个方向激活最具识别魅力的边界。我们在HIX射线和OPIX射线上对我们的方法进行了广泛的评估，结果表明它优于SOTA检测方法。摘要：Prohibited items detection in X-ray images often plays an important role in protecting public safety, which often deals with color-monotonous and luster-insufficient objects, resulting in unsatisfactory performance. Till now, there have been rare studies touching this topic due to the lack of specialized high-quality datasets. In this work, we first present a High-quality X-ray (HiXray) security inspection image dataset, which contains 102,928 common prohibited items of 8 categories. It is the largest dataset of high quality for prohibited items detection, gathered from the real-world airport security inspection and annotated by professional security inspectors. Besides, for accurate prohibited item detection, we further propose the Lateral Inhibition Module (LIM) inspired by the fact that humans recognize these items by ignoring irrelevant information and focusing on identifiable characteristics, especially when objects are overlapped with each other. Specifically, LIM, the elaborately designed flexible additional module, suppresses the noisy information flowing maximumly by the Bidirectional Propagation (BP) module and activates the most identifiable charismatic, boundary, from four directions by Boundary Activation (BA) module. We evaluate our method extensively on HiXray and OPIXray and the results demonstrate that it outperforms SOTA detection methods.

【6】 Multi-Expert Adversarial Attack Detection in Person Re-identification Using Context Inconsistency 标题：基于上下文不一致的人重识别中的多专家对抗性攻击检测链接：https://arxiv.org/abs/2108.09891

作者：Xueping Wang,Shasha Li,Min Liu,Yaonan Wang,Amit K. Roy-Chowdhury 机构：College of Electrical and Information Engineering, Hunan University, China, National Engineering Laboratory for Robot Visual Perception and Control Technology, China, University of California, Riverside 备注：Accepted at IEEE ICCV 2021 摘要：深度神经网络（DNNs）的成功促进了人再识别（ReID）的广泛应用。然而，ReID系统继承了DNN在明显的对抗性干扰中遭受恶意攻击的弱点。因此，探测高级攻击是鲁棒ReID系统的基本要求。在这项工作中，我们提出了一种通过检查上下文不一致性来实现这一目标的多专家对抗性攻击检测（MEAAD）方法，该方法适用于任何基于DNN的ReID系统。具体地说，利用三种由对抗性攻击引起的上下文不一致性学习检测器来区分扰动示例，即a）扰动查询人物图像与其top-K检索之间的嵌入距离通常大于良性查询图像与其top-K检索之间的嵌入距离，b）受干扰查询图像的top-K检索之间的嵌入距离大于良性查询图像的top-K检索之间的嵌入距离；c）使用多个专家ReID模型获得的良性查询图像的top-K检索趋于一致，这在存在攻击时是不保留的。1501和DukeMTMC ReID数据集上的广泛实验表明，作为ReID的第一种对抗性攻击检测方法，Meade有效地检测各种对抗性攻击，并实现高ROC-AUC（超过97.5%）。摘要：The success of deep neural networks (DNNs) haspromoted the widespread applications of person re-identification (ReID). However, ReID systems inherit thevulnerability of DNNs to malicious attacks of visually in-conspicuous adversarial perturbations. Detection of adver-sarial attacks is, therefore, a fundamental requirement forrobust ReID systems. In this work, we propose a Multi-Expert Adversarial Attack Detection (MEAAD) approach toachieve this goal by checking context inconsistency, whichis suitable for any DNN-based ReID systems. Specifically,three kinds of context inconsistencies caused by adversar-ial attacks are employed to learn a detector for distinguish-ing the perturbed examples, i.e., a) the embedding distancesbetween a perturbed query person image and its top-K re-trievals are generally larger than those between a benignquery image and its top-K retrievals, b) the embedding dis-tances among the top-K retrievals of a perturbed query im-age are larger than those of a benign query image, c) thetop-K retrievals of a benign query image obtained with mul-tiple expert ReID models tend to be consistent, which isnot preserved when attacks are present. Extensive exper-iments on the Market1501 and DukeMTMC-ReID datasetsshow that, as the first adversarial attack detection approachfor ReID,MEAADeffectively detects various adversarial at-tacks and achieves high ROC-AUC (over 97.5%).

【7】 Detection and Localization of Multiple Image Splicing Using MobileNet V1 标题：基于MobileNet V1的多图像拼接检测与定位链接：https://arxiv.org/abs/2108.09674

作者：Kalyani Kadam,Dr. Swati Ahirrao,Dr. Ketan Kotecha,Sayan Sahu 摘要：在现代社会，数字图像已经成为一种重要的信息来源和传播媒介。但是，可以使用免费提供的图像编辑软件简单地修改它们。两个或两个以上的图像组合生成一个新的图像，可以通过社交媒体平台传输信息，从而影响社会中的人们。这些信息可能产生积极和消极的后果。因此，有必要开发一种技术来检测和定位图像中的多图像拼接伪造。这项研究工作提出了使用Mask R-CNN的多图像拼接伪造检测方法，其主干是MobileNet V1。它还计算多个拼接图像的伪造区域的百分比分数。对提议的工作与ResNet的变体进行了比较分析。使用我们的MISD（多图像拼接数据集）对所提出的模型进行了训练和测试，并且观察到所提出的模型优于ResNet模型的变体（ResNet 51101和151）。摘要：In modern society, digital images have become a prominent source of information and medium of communication. They can, however, be simply altered using freely available image editing software. Two or more images are combined to generate a new image that can transmit information across social media platforms to influence the people in the society. This information may have both positive and negative consequences. Hence there is a need to develop a technique that will detect and locates a multiple image splicing forgery in an image. This research work proposes multiple image splicing forgery detection using Mask R-CNN, with a backbone as a MobileNet V1. It also calculates the percentage score of a forged region of multiple spliced images. The comparative analysis of the proposed work with the variants of ResNet is performed. The proposed model is trained and tested using our MISD (Multiple Image Splicing Dataset), and it is observed that the proposed model outperforms the variants of ResNet models (ResNet 51,101 and 151).

【8】 SIDE: Center-based Stereo 3D Detector with Structure-aware Instance Depth Estimation 标题：侧面：具有结构感知实例深度估计的基于中心的立体3D探测器链接：https://arxiv.org/abs/2108.09663

作者：Xidong Peng,Xinge Zhu,Tai Wang,Yuexin Ma 机构：ShanghaiTech University, The Chinese University of Hong Kong 摘要：三维检测在环境感知中起着不可或缺的作用。由于常用的激光雷达传感器成本较高，基于立体视觉的三维检测作为一种经济有效的检测手段，近年来受到了越来越多的关注。对于这些基于二维图像的方法来说，准确的深度信息是实现三维检测的关键，而现有的大多数深度估计方法都处于初级阶段。他们主要关注全局深度，而忽略了在这个特定任务中深度信息的属性，即稀疏性和局部性，其中精确的深度只需要这些三维边界框。基于这一发现，我们提出了一种基于立体图像的无锚三维检测方法，称为结构感知立体三维检测器（称为SIDE），其中我们通过从每个对象的roi构建成本体积来探索实例级深度信息。由于局部成本量的信息稀疏性，我们进一步引入匹配重加权和结构感知注意，使深度信息更加集中。在KITTI数据集上进行的实验表明，与没有深度图监控的现有方法相比，我们的方法达到了最先进的性能。摘要：3D detection plays an indispensable role in environment perception. Due to the high cost of commonly used LiDAR sensor, stereo vision based 3D detection, as an economical yet effective setting, attracts more attention recently. For these approaches based on 2D images, accurate depth information is the key to achieve 3D detection, and most existing methods resort to a preliminary stage for depth estimation. They mainly focus on the global depth and neglect the property of depth information in this specific task, namely, sparsity and locality, where exactly accurate depth is only needed for these 3D bounding boxes. Motivated by this finding, we propose a stereo-image based anchor-free 3D detection method, called structure-aware stereo 3D detector (termed as SIDE), where we explore the instance-level depth information via constructing the cost volume from RoIs of each object. Due to the information sparsity of local cost volume, we further introduce match reweighting and structure-aware attention, to make the depth information more concentrated. Experiments conducted on the KITTI dataset show that our method achieves the state-of-the-art performance compared to existing methods without depth map supervision.

【9】 MOTSynth: How Can Synthetic Data Help Pedestrian Detection and Tracking? 标题：MOTSynth：合成数据如何帮助行人检测和跟踪？链接：https://arxiv.org/abs/2108.09518

作者：Matteo Fabbri,Guillem Braso,Gianluca Maugeri,Orcun Cetintas,Riccardo Gasparini,Aljosa Osep,Simone Calderara,Laura Leal-Taixe,Rita Cucchiara 机构：Guillem Bras´o, Aljo˘sa O˘sep, Laura Leal-Taix´e, University of Modena and Reggio Emilia, Italy, Technical University of Munich, Germany, GoatAI S.r.l. 备注：ICCV 2021 camera-ready version 摘要：基于深度学习的视频行人检测和跟踪方法需要大量的训练数据才能获得良好的性能。然而，在拥挤的公共环境中采集数据会引起数据隐私问题——未经所有参与者的明确同意，我们不能简单地记录和存储数据。此外，用于计算机视觉应用的此类数据的注释通常需要大量的手动工作，尤其是在视频领域。在高度拥挤的场景中标记行人实例甚至对人工注释员来说也是一项挑战，可能会在训练数据中引入错误。在本文中，我们研究了如何利用单独的合成数据推进多人跟踪的不同方面。为此，我们生成MOTSynth，这是一个大型、高度多样化的合成数据集，用于使用渲染游戏引擎进行对象检测和跟踪。我们的实验表明，MOTSynth可以替代行人检测、重新识别、分割和跟踪等任务中的真实数据。摘要：Deep learning-based methods for video pedestrian detection and tracking require large volumes of training data to achieve good performance. However, data acquisition in crowded public environments raises data privacy concerns -- we are not allowed to simply record and store data without the explicit consent of all participants. Furthermore, the annotation of such data for computer vision applications usually requires a substantial amount of manual effort, especially in the video domain. Labeling instances of pedestrians in highly crowded scenarios can be challenging even for human annotators and may introduce errors in the training data. In this paper, we study how we can advance different aspects of multi-person tracking using solely synthetic data. To this end, we generate MOTSynth, a large, highly diverse synthetic dataset for object detection and tracking using a rendering game engine. Our experiments show that MOTSynth can be used as a replacement for real data on tasks such as pedestrian detection, re-identification, segmentation, and tracking.

【10】 Multi-scale Edge-based U-shape Network for Salient Object Detection 标题：基于多尺度边缘的U形网络用于显著目标检测链接：https://arxiv.org/abs/2108.09408

作者：Han Sun,Yetong Bian,Ningzhong Liu,Huiyu Zhou 机构： Nanjing University of Aeronautics and Astronautics, Jiangsu Nanjing, China, School of Informatics, University of Leicester, Leicester LE,RH, U.K 备注：14pages, 5 figures. accepted by PRICAI 2021, code: this https URL 摘要：基于深度学习的显著目标检测方法取得了很大的进步。然而，预测中仍然存在边界模糊、定位不准确等问题，这主要是由于特征提取和集成不足造成的。在本文中，我们提出了一种基于多尺度边缘的U形网络（MEUN）来集成不同尺度上的各种特征，以获得更好的性能。为了提取更多有用的边界预测信息，在每个解码器单元中嵌入U形边缘网络模块。此外，额外的下采样模块减轻了定位误差。在四个基准数据集上的实验结果证明了该方法的有效性和可靠性。与15种先进的显著目标检测方法相比，基于多尺度边缘的U形网络也显示出其优越性。摘要：Deep-learning based salient object detection methods achieve great improvements. However, there are still problems existing in the predictions, such as blurry boundary and inaccurate location, which is mainly caused by inadequate feature extraction and integration. In this paper, we propose a Multi-scale Edge-based U-shape Network (MEUN) to integrate various features at different scales to achieve better performance. To extract more useful information for boundary prediction, U-shape Edge Network modules are embedded in each decoder units. Besides, the additional down-sampling module alleviates the location inaccuracy. Experimental results on four benchmark datasets demonstrate the validity and reliability of the proposed method. Multi-scale Edge based U-shape Network also shows its superiority when compared with 15 state-of-the-art salient object detection methods.

【11】 Detecting and Segmenting Adversarial Graphics Patterns from Images 标题：图像中对抗性图形模式的检测与分割链接：https://arxiv.org/abs/2108.09383

作者：Xiangyu Qu,Stanley H. Chan 机构：School of Electrical and Computer Engineering, Purdue University, West Lafayette, Indiana USA 备注：ICCV 2021 Workshop on Adversarial Robustness in the Real World 摘要：对抗性攻击对计算机视觉系统安全构成了巨大威胁，但社交媒体行业不断面临另一种形式的“对抗性攻击”，黑客试图上传不适当的图像，并通过添加人工图形模式愚弄自动筛选系统。在本文中，我们将针对此类攻击的防御描述为一个人工图形模式分割问题。我们评估了几种分割算法的有效性，并基于对其性能的观察，提出了一种针对这一特定问题的新方法。大量实验表明，该方法优于基线分割方法，具有良好的泛化能力，是分割人工图形模式的关键。摘要：Adversarial attacks pose a substantial threat to computer vision system security, but the social media industry constantly faces another form of "adversarial attack" in which the hackers attempt to upload inappropriate images and fool the automated screening systems by adding artificial graphics patterns. In this paper, we formulate the defense against such attacks as an artificial graphics pattern segmentation problem. We evaluate the efficacy of several segmentation algorithms and, based on observation of their performance, propose a new method tailored to this specific problem. Extensive experiments show that the proposed method outperforms the baselines and has a promising generalization capability, which is the most crucial aspect in segmenting artificial graphics patterns.

分类|识别相关(16篇)

【1】 Cross-Quality LFW: A Database for Analyzing Cross-Resolution Image Face Recognition in Unconstrained Environments 标题：Cross-Quality LFW：一个分析不受约束环境下的交叉分辨率图像人脸识别的数据库链接：https://arxiv.org/abs/2108.10290

作者：Martin Knoche,Stefan Hörmann,Gerhard Rigoll 机构：Stefan Hörman, Chair of Human-Machine Communication, Technical University, Munich, Germany 备注：9 pages, 4 figures, 2 tables 摘要：现实世界中的人脸识别应用程序通常会处理由于不同的拍摄条件（如不同的拍摄对象到摄影机的距离、较差的摄影机设置或运动模糊）而导致的次优图像质量或分辨率。这一特性对性能有着不可忽视的影响。最近的交叉分辨率人脸识别方法使用简单、任意和不切实际的上下缩放技术来测量图像质量中对真实世界边缘情况的鲁棒性。因此，我们提出了一个新的标准化基准数据集，该数据集来源于著名的野外标记人脸（LFW）。与之前的衍生产品（专注于姿势、年龄、相似性和对抗性攻击）不同，我们的交叉质量标签野生人脸（XQLFW）数据集最大化了质量差异。必要时，它只包含更真实的合成降级图像。我们提出的数据集，然后用于进一步调查的影响，图像质量的几个国家的最先进的方法。通过XQLFW，我们证明了这些模型在交叉质量情况下的性能不同，因此，它们在LFW上的性能无法准确预测泛化能力。此外，我们报告了最近针对交叉分辨率应用明确训练的深度学习模型的基线精度，并评估了对图像质量的敏感性。为了鼓励在交叉分辨率人脸识别方面的进一步研究并促进图像质量稳健性的评估，我们发布了数据库和评估代码。摘要：Real-world face recognition applications often deal with suboptimal image quality or resolution due to different capturing conditions such as various subject-to-camera distances, poor camera settings, or motion blur. This characteristic has an unignorable effect on performance. Recent cross-resolution face recognition approaches used simple, arbitrary, and unrealistic down- and up-scaling techniques to measure robustness against real-world edge-cases in image quality. Thus, we propose a new standardized benchmark dataset derived from the famous Labeled Faces in the Wild (LFW). In contrast to previous derivatives, which focus on pose, age, similarity, and adversarial attacks, our Cross-Quality Labeled Faces in the Wild (XQLFW) dataset maximizes the quality difference. It contains only more realistic synthetically degraded images when necessary. Our proposed dataset is then used to further investigate the influence of image quality on several state-of-the-art approaches. With XQLFW, we show that these models perform differently in cross-quality cases, and hence, the generalizing capability is not accurately predicted by their performance on LFW. Additionally, we report baseline accuracy with recent deep learning models explicitly trained for cross-resolution applications and evaluate the susceptibility to image quality. To encourage further research in cross-resolution face recognition and incite the assessment of image quality robustness, we publish the database and code for evaluation.

【2】 Pattern Inversion as a Pattern Recognition Method for Machine Learning 标题：模式反演作为一种机器学习的模式识别方法链接：https://arxiv.org/abs/2108.10242

作者：Alexei Mikhailov,Mikhail Karavay 机构：Institute for Control Problems, Russian Academy of Sciences, Moscow, Russia 备注：9 pages, 12 figures 摘要：人工神经网络使用了大量的系数，这些系数的调整需要大量的计算能力，特别是在采用深度学习网络的情况下。然而，也存在一些无系数、基于索引的极快技术，例如，在谷歌搜索引擎、基因组测序等方面都能发挥作用。本文讨论了基于索引的模式识别方法的使用。结果表明，对于模式识别应用，这种索引方法用反向模式替换搜索引擎中通常使用的完全反向文件。这种反演不仅提供了自动特征提取，这是深度学习的一个显著标志，而且与深度学习不同，模式反演支持几乎瞬时的学习，这是缺少系数的结果。本文讨论了一种基于新模式变换的模式反演方法及其在无监督即时学习中的应用。示例演示了在任意背景下与视角无关的三维对象（如汽车）识别、飞机发动机剩余使用寿命预测以及其他应用。总之，值得注意的是，在神经生理学中，新皮质微柱的功能自1957年以来一直存在广泛的争论。本文假设，从数学上讲，皮质微柱可以描述为一种逆模式，它在物理上充当连接乘数，扩展输入与相关模式类的关联。摘要：Artificial neural networks use a lot of coefficients that take a great deal of computing power for their adjustment, especially if deep learning networks are employed. However, there exist coefficients-free extremely fast indexing-based technologies that work, for instance, in Google search engines, in genome sequencing, etc. The paper discusses the use of indexing-based methods for pattern recognition. It is shown that for pattern recognition applications such indexing methods replace with inverse patterns the fully inverted files, which are typically employed in search engines. Not only such inversion provide automatic feature extraction, which is a distinguishing mark of deep learning, but, unlike deep learning, pattern inversion supports almost instantaneous learning, which is a consequence of absence of coefficients. The paper discusses a pattern inversion formalism that makes use on a novel pattern transform and its application for unsupervised instant learning. Examples demonstrate a view-angle independent recognition of three-dimensional objects, such as cars, against arbitrary background, prediction of remaining useful life of aircraft engines, and other applications. In conclusion, it is noted that, in neurophysiology, the function of the neocortical mini-column has been widely debated since 1957. This paper hypothesize that, mathematically, the cortical mini-column can be described as an inverse pattern, which physically serves as a connection multiplier expanding associations of inputs with relevant pattern classes.

【3】 Fusion of evidential CNN classifiers for image classification 标题：融合证据CNN分类器的图像分类方法链接：https://arxiv.org/abs/2108.10233

作者：Zheng Tong,Philippe Xu,Thierry Denoeux 机构：Thierry Denœux,[,−,−,−,], Universit´e de technologie de Compiegne, CNRS, UMR , Heudiasyc, Compiegne, France, Institut universitaire de France, Paris, France 摘要：我们提出了一种基于信念函数的信息融合方法来结合卷积神经网络。在这种方法中，几个预先训练好的基于DS的CNN结构从输入图像中提取特征，并将其转换为不同识别帧上的质量函数。然后，融合模块使用Dempster规则聚合这些质量函数。端到端学习过程允许我们使用带有软标签的学习集来微调总体架构，这进一步提高了分类性能。使用三个基准数据库对该方法的有效性进行了实验验证。摘要：We propose an information-fusion approach based on belief functions to combine convolutional neural networks. In this approach, several pre-trained DS-based CNN architectures extract features from input images and convert them into mass functions on different frames of discernment. A fusion module then aggregates these mass functions using Dempster's rule. An end-to-end learning procedure allows us to fine-tune the overall architecture using a learning set with soft labels, which further improves the classification performance. The effectiveness of this approach is demonstrated experimentally using three benchmark databases.

【4】 Deep Bayesian Image Set Classification: A Defence Approach against Adversarial Attacks 标题：深度贝叶斯图像集分类：一种防御敌意攻击的方法链接：https://arxiv.org/abs/2108.10217

作者：Nima Mirnateghi,Syed Afaq Ali Shah,Mohammed Bennamoun 机构： Murdoch University, Bennamoun is with the Department of Computer Science and SoftwareEngineering, University of Western Australia 摘要：近年来，由于深度学习在目标识别、人脸识别和场景理解方面的突出成就，它已成为各种计算机视觉系统的一个组成部分。然而，深层神经网络（DNN）很容易被对手以近乎高的置信度愚弄。在实践中，深度学习系统对仔细扰动的图像（称为对抗性示例）的脆弱性在物理世界应用程序中构成了严重的安全威胁。为了解决这一现象，据我们所知，我们提出了第一种基于图像集的对抗性防御方法。由于其处理外观变化的固有特性，图像集分类在对象和人脸识别方面表现出了优异的性能。我们提出了一种鲁棒的深层贝叶斯图像集分类方法，作为抵御各种敌对攻击的防御框架。我们用几种投票策略对所提出的技术的性能进行了广泛的实验。我们进一步分析了图像大小、扰动幅度以及每个图像集中扰动图像的比率的影响。我们还使用最新的最先进的防御方法和单发识别任务来评估我们的技术。实证结果表明，在CIFAR-10、MNIST、ETH-80和微型ImageNet数据集上具有优异的性能。摘要：Deep learning has become an integral part of various computer vision systems in recent years due to its outstanding achievements for object recognition, facial recognition, and scene understanding. However, deep neural networks (DNNs) are susceptible to be fooled with nearly high confidence by an adversary. In practice, the vulnerability of deep learning systems against carefully perturbed images, known as adversarial examples, poses a dire security threat in the physical world applications. To address this phenomenon, we present, what to our knowledge, is the first ever image set based adversarial defence approach. Image set classification has shown an exceptional performance for object and face recognition, owing to its intrinsic property of handling appearance variability. We propose a robust deep Bayesian image set classification as a defence framework against a broad range of adversarial attacks. We extensively experiment the performance of the proposed technique with several voting strategies. We further analyse the effects of image size, perturbation magnitude, along with the ratio of perturbed images in each image set. We also evaluate our technique with the recent state-of-the-art defence methods, and single-shot recognition task. The empirical results demonstrate superior performance on CIFAR-10, MNIST, ETH-80, and Tiny ImageNet datasets.

【5】 Towards Balanced Learning for Instance Recognition 标题：走向均衡学习--实例识别链接：https://arxiv.org/abs/2108.10175

作者：Jiangmiao Pang,Kai Chen,Qi Li,Zhihai Xu,Huajun Feng,Jianping Shi,Wanli Ouyang,Dahua Lin 机构：Received: date Accepted: date 备注：Accepted by IJCV. Journal extension of paper arXiv:1904.02701 摘要：随着各种深度卷积神经网络的发展，实例识别得到了迅速的发展。与网络体系结构相比，训练过程受到的关注相对较少，而训练过程对探测器的成功也至关重要。在这项工作中，我们仔细回顾了检测器的标准训练实践，发现在训练过程中，检测性能往往受到不平衡性的限制，这通常包括三个级别-样本级别、特征级别和目标级别。为了减轻由此产生的不利影响，我们提出了Libra R-CNN，这是一个简单但有效的平衡学习框架，例如识别。它分别集成了IoU平衡采样、平衡特征金字塔和目标重加权，以减少样本、特征和目标层面的不平衡。在MS COCO、LVIS和Pascal VOC数据集上进行的大量实验证明了整体平衡设计的有效性。摘要：Instance recognition is rapidly advanced along with the developments of various deep convolutional neural networks. Compared to the architectures of networks, the training process, which is also crucial to the success of detectors, has received relatively less attention. In this work, we carefully revisit the standard training practice of detectors, and find that the detection performance is often limited by the imbalance during the training process, which generally consists in three levels - sample level, feature level, and objective level. To mitigate the adverse effects caused thereby, we propose Libra R-CNN, a simple yet effective framework towards balanced learning for instance recognition. It integrates IoU-balanced sampling, balanced feature pyramid, and objective re-weighting, respectively for reducing the imbalance at sample, feature, and objective level. Extensive experiments conducted on MS COCO, LVIS and Pascal VOC datasets prove the effectiveness of the overall balanced design.

【6】 Improving the trustworthiness of image classification models by utilizing bounding-box annotations 标题：利用包围盒标注提高图像分类模型的可信性链接：https://arxiv.org/abs/2108.10131

作者：Dharma KC,Chicheng Zhang 机构：University of Arizona 备注：None 摘要：我们研究利用训练数据中的辅助信息来提高机器学习模型的可信度。具体地说，在图像分类的背景下，我们建议优化包含边界框信息的训练目标，这在许多图像分类数据集中都是可用的。初步实验结果表明，与基线算法相比，该算法在准确性、鲁棒性和可解释性方面都有较好的性能。摘要：We study utilizing auxiliary information in training data to improve the trustworthiness of machine learning models. Specifically, in the context of image classification, we propose to optimize a training objective that incorporates bounding box information, which is available in many image classification datasets. Preliminary experimental results show that the proposed algorithm achieves better performance in accuracy, robustness, and interpretability compared with baselines.

【7】 ZS-SLR: Zero-Shot Sign Language Recognition from RGB-D Videos 标题：ZS-SLR：RGB-D视频的Zero-Shot手语识别链接：https://arxiv.org/abs/2108.10059

作者：Razieh Rastgoo,Kourosh Kiani,Sergio Escalera 机构：Semnan University, Institute for Research in Fundamental Sciences (IPM), Universitat de Barcelona and Computer Vision Center 摘要：手语识别是计算机视觉中一个极具挑战性的研究领域。为了解决单反中的标注瓶颈问题，我们提出了Zero-Shot手语识别（ZS-SLR）问题，并从RGB和深度视频两种输入模式提出了双流模型。为了从视觉转换器功能中获益，我们使用了两种视觉转换器模型，用于人体检测和视觉特征表示。我们配置了transformer编码器-解码器体系结构，作为一种快速准确的人体检测模型，以克服当前人体检测模型的挑战。根据人体关键点，将检测到的人体分为九个部分。使用视觉变换器和LSTM网络获得人体的时空表示。语义空间通过双向编码器表示变换器（BERT）模型将视觉特征映射到类标签的语言嵌入。我们在Montalbano II、MSR Daily Activity 3D、CAD-60和NTU-60四个数据集上评估了提议的模型，与最先进的ZS-SLR模型相比，获得了最先进的结果。摘要：Sign Language Recognition (SLR) is a challenging research area in computer vision. To tackle the annotation bottleneck in SLR, we formulate the problem of Zero-Shot Sign Language Recognition (ZS-SLR) and propose a two-stream model from two input modalities: RGB and Depth videos. To benefit from the vision Transformer capabilities, we use two vision Transformer models, for human detection and visual features representation. We configure a transformer encoder-decoder architecture, as a fast and accurate human detection model, to overcome the challenges of the current human detection models. Considering the human keypoints, the detected human body is segmented into nine parts. A spatio-temporal representation from human body is obtained using a vision Transformer and a LSTM network. A semantic space maps the visual features to the lingual embedding of the class labels via a Bidirectional Encoder Representations from Transformers (BERT) model. We evaluated the proposed model on four datasets, Montalbano II, MSR Daily Activity 3D, CAD-60, and NTU-60, obtaining state-of-the-art results compared to state-of-the-art ZS-SLR models.

【8】 How Transferable Are Self-supervised Features in Medical Image Classification Tasks? 标题：自监督特征在医学图像分类任务中的可移植性如何？链接：https://arxiv.org/abs/2108.10048

作者：Tuan Truong,Sadegh Mohammadi,Matthias Lenga 机构：Bayer AG, Leverkusen, Germany, Berlin, Germany 摘要：迁移学习已成为减轻医学分类任务中标记数据缺乏的标准实践。尽管使用监督的ImageNet预训练特征对下游任务进行微调是简单的，并且在许多工作中进行了广泛的研究，但很少有人研究自我监督预训练的有用性。在本文中，我们通过评估使用三种自监督技术（SimCLR、SwAV和DINO）的预训练特征初始化的模型在选定的医学分类任务上的性能，来评估ImageNet自监督预训练的可转移性。选择的任务包括前哨腋窝淋巴结图像中的肿瘤检测、眼底图像中的糖尿病视网膜病变分类以及胸部X射线图像中的多种病理状况分类。我们证明了自监督预训练模型比其监督模型产生了更丰富的嵌入，这有利于下游任务的线性评估和微调。例如，考虑到数据中非常小的子集的线性评估，我们发现在糖尿病视网膜病变分类任务中Kappa评分提高了14.79%，在肿瘤分类任务中AUC提高了5.4%，在肺炎检测中AUC提高了7.03%，在胸部X光病理条件检测中AUC提高了9.4%。此外，我们引入了动态视觉元嵌入（DVME）作为一种端到端的迁移学习方法，它融合了来自多个模型的预训练嵌入。我们表明，与使用单一的预训练模型方法相比，DVME获得的集体表征可以显著提高选定任务的性能，并且可以推广到预训练模型的任何组合。摘要：Transfer learning has become a standard practice to mitigate the lack of labeled data in medical classification tasks. Whereas finetuning a downstream task using supervised ImageNet pretrained features is straightforward and extensively investigated in many works, there is little study on the usefulness of self-supervised pretraining. In this paper, we assess the transferability of ImageNet self-supervisedpretraining by evaluating the performance of models initialized with pretrained features from three self-supervised techniques (SimCLR, SwAV, and DINO) on selected medical classification tasks. The chosen tasks cover tumor detection in sentinel axillary lymph node images, diabetic retinopathy classification in fundus images, and multiple pathological condition classification in chest X-ray images. We demonstrate that self-supervised pretrained models yield richer embeddings than their supervised counterpart, which benefits downstream tasks in view of both linear evaluation and finetuning. For example, in view of linear evaluation at acritically small subset of the data, we see an improvement up to 14.79% in Kappa score in the diabetic retinopathy classification task, 5.4% in AUC in the tumor classification task, 7.03% AUC in the pneumonia detection, and 9.4% in AUC in the detection of pathological conditions in chest X-ray. In addition, we introduce Dynamic Visual Meta-Embedding (DVME) as an end-to-end transfer learning approach that fuses pretrained embeddings from multiple models. We show that the collective representation obtained by DVME leads to a significant improvement in the performance of selected tasks compared to using a single pretrained model approach and can be generalized to any combination of pretrained models.

【9】 Exploring the Quality of GAN Generated Images for Person Re-Identification 标题：用于身份识别的GaN生成图像质量探讨链接：https://arxiv.org/abs/2108.09977

作者：Yiqi Jiang,Weihua Chen,Xiuyu Sun,Xiaoyu Shi,Fan Wang,Hao Li 机构：Alibaba Group, China 备注：10 pages, 4 figures 摘要：最近，基于GAN的方法在生成用于人员重新识别（ReID）的增强数据方面表现出了强大的有效性，因为它能够弥合领域之间的差距，丰富特征空间中的数据种类。然而，大多数ReID工作选择所有GAN生成的数据作为额外的训练样本，或者在整个数据集级别上评估GAN生成的质量，而忽略了ReID任务中数据的图像级基本特征。在本文中，我们分析了ReID样品的深度特征，并解决了“GAN生成的图像对ReID有什么好处”的问题。具体地说，我们建议通过将图像映射到不同的空间来检查具有id一致性和多样性约束的每个数据样本。通过基于度量的采样方法，我们证明了并非每个GAN生成的数据都有利于增强。使用我们的质量评估筛选的数据训练的模型比使用完全扩充集训练的模型表现得更好。大量实验表明，该方法在有监督的ReID任务和无监督的领域适应ReID任务上都是有效的。摘要：Recently, GAN based method has demonstrated strong effectiveness in generating augmentation data for person re-identification (ReID), on account of its ability to bridge the gap between domains and enrich the data variety in feature space. However, most of the ReID works pick all the GAN generated data as additional training samples or evaluate the quality of GAN generation at the entire data set level, ignoring the image-level essential feature of data in ReID task. In this paper, we analyze the in-depth characteristics of ReID sample and solve the problem of "What makes a GAN-generated image good for ReID". Specifically, we propose to examine each data sample with id-consistency and diversity constraints by mapping image onto different spaces. With a metric-based sampling method, we demonstrate that not every GAN-generated data is beneficial for augmentation. Models trained with data filtered by our quality evaluation outperform those trained with the full augmentation set by a large margin. Extensive experiments show the effectiveness of our method on both supervised ReID task and unsupervised domain adaptation ReID task.

【10】 Face Photo-Sketch Recognition Using Bidirectional Collaborative Synthesis Network 标题：基于双向协同合成网络的人脸照片素描识别链接：https://arxiv.org/abs/2108.09898

作者：Seho Bae,Nizam Ud Din,Hyunkyu Park,Juneho Yi 机构：Department of Electrical and Computer Engineering, Sungkyunkwan University, Suwon, Republic of Korea 摘要：本研究采用了一个基于深度学习的框架来解决给定人脸草图图像与人脸照片数据库的匹配问题。照片-草图匹配问题具有挑战性，因为1）照片和草图之间存在较大的模态差异，2）成对的训练样本数量不足以训练基于深度学习的网络。为了避免大模态间隙的问题，我们的方法是在两种模态之间使用一个中间的潜在空间。通过采用双向（照片->草图和草图->照片）协作合成网络，我们有效地调整了这一潜在空间中两种模式的分布。利用StyleGAN式结构，使中间潜在空间具有丰富的表现力。为了解决训练样本不足的问题，我们引入了三步训练方案。对公共合成人脸草图数据库的广泛评估证实了我们的方法比现有的最先进的方法具有更高的性能。所提出的方法可用于匹配其他模态对。摘要：This research features a deep-learning based framework to address the problem of matching a given face sketch image against a face photo database. The problem of photo-sketch matching is challenging because 1) there is large modality gap between photo and sketch, and 2) the number of paired training samples is insufficient to train deep learning based networks. To circumvent the problem of large modality gap, our approach is to use an intermediate latent space between the two modalities. We effectively align the distributions of the two modalities in this latent space by employing a bidirectional (photo -> sketch and sketch -> photo) collaborative synthesis network. A StyleGAN-like architecture is utilized to make the intermediate latent space be equipped with rich representation power. To resolve the problem of insufficient training samples, we introduce a three-step training scheme. Extensive evaluation on public composite face sketch database confirms superior performance of our method compared to existing state-of-the-art methods. The proposed methodology can be employed in matching other modality pairs.

【11】 Uncertainty-aware Clustering for Unsupervised Domain Adaptive Object Re-identification 标题：无监督领域自适应目标重识别的不确定性感知聚类链接：https://arxiv.org/abs/2108.09682

作者：Pengfei Wang,Changxing Ding,Wentao Tan,Mingming Gong,Kui Jia,Dacheng Tao 机构： South China University of Technology, University of Melbourne, JD Explore Academy, China 摘要：无监督域自适应（UDA）对象再识别（re ID）旨在使在标记的源域上训练的模型适应未标记的目标域。最先进的对象重新标识方法采用聚类算法为未标记的目标域生成伪标签。然而，聚类过程中不可避免的标签噪声会显著降低Re-ID模型的识别能力。为了解决这个问题，我们提出了一个用于UDA任务的不确定性感知集群框架（UCF）。首先，提出了一种新的分层聚类方案来提高聚类质量。其次，引入了一种不确定性感知的协同实例选择方法，用于选择具有可靠标签的图像进行模型训练。将这两种技术结合起来可以有效地减少标签噪声的影响。此外，我们引入了一个强大的基线，其特点是紧凑的对比损失。我们的UCF方法在对象Re ID的多个UDA任务中始终实现最先进的性能，并显著减少了无监督和有监督Re ID性能之间的差距。特别是，在MSMT17$到$Market1501任务中，我们的无监督UCF方法的性能优于Market1501上的完全监督设置。UCF的代码可在https://github.com/Wang-pengfei/UCF. 摘要：Unsupervised Domain Adaptive (UDA) object re-identification (Re-ID) aims at adapting a model trained on a labeled source domain to an unlabeled target domain. State-of-the-art object Re-ID approaches adopt clustering algorithms to generate pseudo-labels for the unlabeled target domain. However, the inevitable label noise caused by the clustering procedure significantly degrades the discriminative power of Re-ID model. To address this problem, we propose an uncertainty-aware clustering framework (UCF) for UDA tasks. First, a novel hierarchical clustering scheme is proposed to promote clustering quality. Second, an uncertainty-aware collaborative instance selection method is introduced to select images with reliable labels for model training. Combining both techniques effectively reduces the impact of noisy labels. In addition, we introduce a strong baseline that features a compact contrastive loss. Our UCF method consistently achieves state-of-the-art performance in multiple UDA tasks for object Re-ID, and significantly reduces the gap between unsupervised and supervised Re-ID performance. In particular, the performance of our unsupervised UCF method in the MSMT17$to$Market1501 task is better than that of the fully supervised setting on Market1501. The code of UCF is available at https://github.com/Wang-pengfei/UCF.

【12】 Relational Embedding for Few-Shot Classification 标题：关系嵌入在Few-Shot分类中的应用链接：https://arxiv.org/abs/2108.09666

作者：Dahyun Kang,Heeseung Kwon,Juhong Min,Minsu Cho 机构：Pohang University of Science and Technology (POSTECH), South Korea 备注：Accepted at ICCV 2021 摘要：我们建议从关系的角度通过元学习“观察什么”和“参加哪里”来解决少数镜头分类的问题。我们的方法通过自相关表示（SCR）和互相关注意（CCA）利用图像内部和图像之间的关系模式。在每个图像中，SCR模块将基本特征映射转换为自相关张量，并学习从张量中提取结构模式。在图像之间，CCA模块计算两个图像表示之间的互相关，并学习在它们之间产生共同注意。我们的关系嵌入网络（RENet）结合了两个关系模块，以端到端的方式学习关系嵌入。在实验评估中，它在miniImageNet、tieredImageNet、CUB-200-2011和CIFAR-FS四个广泛使用的Few-Shot分类基准上实现了对最先进方法的一致改进。摘要：We propose to address the problem of few-shot classification by meta-learning "what to observe" and "where to attend" in a relational perspective. Our method leverages relational patterns within and between images via self-correlational representation (SCR) and cross-correlational attention (CCA). Within each image, the SCR module transforms a base feature map into a self-correlation tensor and learns to extract structural patterns from the tensor. Between the images, the CCA module computes cross-correlation between two image representations and learns to produce co-attention between them. Our Relational Embedding Network (RENet) combines the two relational modules to learn relational embedding in an end-to-end manner. In experimental evaluation, it achieves consistent improvements over state-of-the-art methods on four widely used few-shot classification benchmarks of miniImageNet, tieredImageNet, CUB-200-2011, and CIFAR-FS.

【13】 Tensor Pooling Driven Instance Segmentation Framework for Baggage Threat Recognition 标题：张量池驱动的行李威胁识别实例分割框架链接：https://arxiv.org/abs/2108.09603

作者：Taimur Hassan,Samet Akcay,Mohammed Bennamoun,Salman Khan,Naoufel Werghi 机构：Received: date Accepted: date 备注：Accepted in Neural Computing and Applications. Source code is available at this https URL 摘要：设计用于从X射线图像中筛选违禁物品的自动化系统仍然面临着高杂波、隐蔽性和极端遮挡的困难。在本文中，我们使用一种新的多尺度轮廓实例分割框架解决了这一难题，该框架可有效识别行李X射线扫描中杂乱的违禁品数据。与使用基于区域或基于关键点的技术在对象周围生成多个长方体的标准模型不同，我们建议根据轮廓定义的区域层次来导出建议。提出的框架在三个公共数据集（称为GDX射线、SIXray和OpiX射线）上进行了严格验证，其平均精度得分分别为0.9779、0.9614和0.8396，优于最先进的方法。此外，据我们所知，这是第一个轮廓实例分割框架，它利用多尺度信息从彩色和灰度安全X射线图像中识别杂乱和隐藏的违禁品数据。摘要：Automated systems designed for screening contraband items from the X-ray imagery are still facing difficulties with high clutter, concealment, and extreme occlusion. In this paper, we addressed this challenge using a novel multi-scale contour instance segmentation framework that effectively identifies the cluttered contraband data within the baggage X-ray scans. Unlike standard models that employ region-based or keypoint-based techniques to generate multiple boxes around objects, we propose to derive proposals according to the hierarchy of the regions defined by the contours. The proposed framework is rigorously validated on three public datasets, dubbed GDXray, SIXray, and OPIXray, where it outperforms the state-of-the-art methods by achieving the mean average precision score of 0.9779, 0.9614, and 0.8396, respectively. Furthermore, to the best of our knowledge, this is the first contour instance segmentation framework that leverages multi-scale information to recognize cluttered and concealed contraband data from the colored and grayscale security X-ray imagery.

【14】 End2End Occluded Face Recognition by Masking Corrupted Features 标题：End2End掩蔽受损特征的遮挡人脸识别链接：https://arxiv.org/abs/2108.09468

作者：Haibo Qiu,Dihong Gong,Zhifeng Li,Wei Liu,Dacheng Tao 备注：Accepted by TPAMI 2021 摘要：近年来，随着深度卷积神经网络的发展，在一般人脸识别方面取得了重大进展。然而，现有的通用人脸识别模型不能很好地推广到真实场景中常见的遮挡人脸图像。潜在的原因是缺乏用于训练的大规模遮挡人脸数据和处理遮挡带来的损坏特征的具体设计。提出了一种基于单端到端深度神经网络的人脸识别方法，该方法对遮挡具有鲁棒性。我们的方法名为FROM（使用遮挡遮罩的人脸识别），学习从深度卷积神经网络中发现损坏的特征，并通过动态学习的遮罩清除这些特征。此外，我们还构建了大量的遮挡人脸图像，以便有效地进行人脸训练。与现有方法相比，FROM方法简单但功能强大，这些方法要么依赖外部检测器来发现遮挡，要么采用识别性较差的浅层模型。在LFW、Megaface challenge 1、RMF2、AR数据集和其他模拟遮挡/遮罩数据集上的实验结果证实，FROM显著提高了遮挡情况下的准确率，并在一般人脸识别中具有良好的通用性。摘要：With the recent advancement of deep convolutional neural networks, significant progress has been made in general face recognition. However, the state-of-the-art general face recognition models do not generalize well to occluded face images, which are exactly the common cases in real-world scenarios. The potential reasons are the absences of large-scale occluded face data for training and specific designs for tackling corrupted features brought by occlusions. This paper presents a novel face recognition method that is robust to occlusions based on a single end-to-end deep neural network. Our approach, named FROM (Face Recognition with Occlusion Masks), learns to discover the corrupted features from the deep convolutional neural networks, and clean them by the dynamically learned masks. In addition, we construct massive occluded face images to train FROM effectively and efficiently. FROM is simple yet powerful compared to the existing methods that either rely on external detectors to discover the occlusions or employ shallow models which are less discriminative. Experimental results on the LFW, Megaface challenge 1, RMF2, AR dataset and other simulated occluded/masked datasets confirm that FROM dramatically improves the accuracy under occlusions, and generalizes well on general face recognition.

【15】 Ensemble of CNN classifiers using Sugeno Fuzzy Integral Technique for Cervical Cytology Image Classification 标题：基于Sugeno模糊积分技术的宫颈细胞学图像分类的CNN分类器集成链接：https://arxiv.org/abs/2108.09460

作者：Rohit Kundu,Hritam Basak,Akhil Koilada,Soham Chattopadhyay,Sukanta Chakraborty,Nibaran Das 机构：Department of Electrical Engineering, Jadavpur University, Kolkata-, INDIA, Theism Medical Diagnostics Centre, Department of Computer Science and Engineering, ∗ 备注：16 pages 摘要：宫颈癌是第四种最常见的癌症，由于检测程序缓慢，每年影响50多万妇女。早期诊断有助于治疗甚至治愈癌症，但繁琐、耗时的检测过程使得无法进行人群筛查。为了帮助病理学家进行高效可靠的检测，在本文中，我们提出了一种全自动的计算机辅助诊断工具，用于对宫颈癌的单细胞和玻片图像进行分类。开发用于生物医学图像分类的自动检测工具的主要问题是公共可访问数据的可用性低。集成学习是一种流行的图像分类方法，但利用预先确定的权重对分类器进行简单化的方法无法令人满意地执行。在本研究中，我们使用Sugeno模糊积分来整合三种流行的预训练深度学习模型的决策分数，即Inception v3、DenseNet-161和ResNet-34。所提出的模糊融合能够考虑每个样本分类器的置信度分数，从而自适应地改变每个分类器的重要性，捕获每个分类器提供的补充信息，从而获得优异的分类性能。我们在三个公开可用的数据集上评估了所提出的方法，即Mendeley液基细胞学（LBC）数据集、SIPaKMeD全玻片图像（WSI）数据集和SIPaKMeD单细胞图像（SCI）数据集，由此产生的结果是有希望的。使用基于GradCAM的视觉表示和统计测试对该方法进行分析，并将该方法与文献中现有和基线模型进行比较，证明该方法的有效性。摘要：Cervical cancer is the fourth most common category of cancer, affecting more than 500,000 women annually, owing to the slow detection procedure. Early diagnosis can help in treating and even curing cancer, but the tedious, time-consuming testing process makes it impossible to conduct population-wise screening. To aid the pathologists in efficient and reliable detection, in this paper, we propose a fully automated computer-aided diagnosis tool for classifying single-cell and slide images of cervical cancer. The main concern in developing an automatic detection tool for biomedical image classification is the low availability of publicly accessible data. Ensemble Learning is a popular approach for image classification, but simplistic approaches that leverage pre-determined weights to classifiers fail to perform satisfactorily. In this research, we use the Sugeno Fuzzy Integral to ensemble the decision scores from three popular pretrained deep learning models, namely, Inception v3, DenseNet-161 and ResNet-34. The proposed Fuzzy fusion is capable of taking into consideration the confidence scores of the classifiers for each sample, and thus adaptively changing the importance given to each classifier, capturing the complementary information supplied by each, thus leading to superior classification performance. We evaluated the proposed method on three publicly available datasets, the Mendeley Liquid Based Cytology (LBC) dataset, the SIPaKMeD Whole Slide Image (WSI) dataset, and the SIPaKMeD Single Cell Image (SCI) dataset, and the results thus yielded are promising. Analysis of the approach using GradCAM-based visual representations and statistical tests, and comparison of the method with existing and baseline models in literature justify the efficacy of the approach.

【16】 Multimodal Breast Lesion Classification Using Cross-Attention Deep Networks 标题：基于交叉注意深度网络的多模态乳腺病变分类链接：https://arxiv.org/abs/2108.09591

作者：Hung Q. Vo,Pengyu Yuan,Tiancheng He,Stephen T. C. Wong,Hien V. Nguyen 机构：Electrical and Computer Engineering, University of Houston, Texas, USA, Systems Medicine and, Biomedical Engineering, Houston Methodist, Stephen T.C. Wong 摘要：准确的乳腺病变风险评估可以显著减少不必要的活检，并帮助医生决定最佳治疗方案。大多数现有的计算机辅助系统仅依靠乳房X光片特征来对乳腺病变进行分类。虽然这种方法很方便，但它没有充分利用临床报告中的有用信息来实现最佳性能。与单独使用乳房X光片相比，临床特征会显著改善乳腺病变分类吗？如何处理因医疗实践中的差异而导致的临床信息缺失？结合乳房X光片和临床特征的最佳方法是什么？迫切需要进行系统研究，以解决这些基本问题。本文研究了几种基于特征串联、交叉注意和共同注意的多模式深度网络，以结合乳腺X线摄影和分类临床变量。我们发现，所提出的架构显著提高了病变分类性能（ROC曲线下的平均面积从0.89增加到0.94）。当临床变量缺失时，我们也对模型进行评估。摘要：Accurate breast lesion risk estimation can significantly reduce unnecessary biopsies and help doctors decide optimal treatment plans. Most existing computer-aided systems rely solely on mammogram features to classify breast lesions. While this approach is convenient, it does not fully exploit useful information in clinical reports to achieve the optimal performance. Would clinical features significantly improve breast lesion classification compared to using mammograms alone? How to handle missing clinical information caused by variation in medical practice? What is the best way to combine mammograms and clinical features? There is a compelling need for a systematic study to address these fundamental questions. This paper investigates several multimodal deep networks based on feature concatenation, cross-attention, and co-attention to combine mammograms and categorical clinical variables. We show that the proposed architectures significantly increase the lesion classification performance (average area under ROC curves from 0.89 to 0.94). We also evaluate the model when clinical variables are missing.

分割|语义相关(8篇)

【1】 Exploring Biases and Prejudice of Facial Synthesis via Semantic Latent Space 标题：基于语义潜在空间的人脸合成偏向和偏向研究链接：https://arxiv.org/abs/2108.10265

作者：Xuyang Shen,Jo Plested,Sabrina Caldwell,Tom Gedeon 机构：Research School of Computer Science, the Australian National University, Canberra, Australia 备注：8 pages, 11 figures; accepted by IJCNN2021 摘要：深度学习（DL）模式被广泛应用于提供更方便、更智能的生活。然而，有偏见的算法会对我们产生负面影响。例如，被有偏见的算法锁定的群体会感到不公平，甚至害怕这些偏见的负面后果。这项工作的目标是有偏见的生成模型的行为，找出偏见的原因并消除它们。我们可以（如预期的）得出结论，有偏见的数据会导致对人脸前沿化模型的有偏见的预测。改变训练数据中男性和女性面孔的比例会对测试数据上的行为产生实质性影响：我们发现，似乎显而易见的选择50:50的比例并不是该数据集减少女性面孔上有偏见行为的最佳选择，与我们最高的无偏见率84%相比，无偏见率为71%。生成失败和生成错误的性别面孔是这些模型的两种行为。此外，只有人脸正面化模型中的某些层容易受到有偏见的数据集的影响。优化正面化模型中生成器的跳过连接可以减少模型的偏差。我们的结论是，如果没有一个无限大小的数据集，很可能不可能消除所有的训练偏差，我们的实验表明，可以减少和量化偏差。我们相信，下一个最好的完美无偏预测是一个最小化剩余已知偏差的预测。摘要：Deep learning (DL) models are widely used to provide a more convenient and smarter life. However, biased algorithms will negatively influence us. For instance, groups targeted by biased algorithms will feel unfairly treated and even fearful of negative consequences of these biases. This work targets biased generative models' behaviors, identifying the cause of the biases and eliminating them. We can (as expected) conclude that biased data causes biased predictions of face frontalization models. Varying the proportions of male and female faces in the training data can have a substantial effect on behavior on the test data: we found that the seemingly obvious choice of 50:50 proportions was not the best for this dataset to reduce biased behavior on female faces, which was 71% unbiased as compared to our top unbiased rate of 84%. Failure in generation and generating incorrect gender faces are two behaviors of these models. In addition, only some layers in face frontalization models are vulnerable to biased datasets. Optimizing the skip-connections of the generator in face frontalization models can make models less biased. We conclude that it is likely to be impossible to eliminate all training bias without an unlimited size dataset, and our experiments show that the bias can be reduced and quantified. We believe the next best to a perfect unbiased predictor is one that has minimized the remaining known bias.

【2】 Tracked 3D Ultrasound and Deep Neural Network-based Thyroid Segmentation reduce Interobserver Variability in Thyroid Volumetry 标题：基于跟踪三维超声和深度神经网络的甲状腺分割降低甲状腺体积测量中的观察者间变异性链接：https://arxiv.org/abs/2108.10118

作者：Markus Krönke,Christine Eilers,Desislava Dimova,Melanie Köhler,Gabriel Buschner,Lilit Mirzojan,Lemonia Konstantinidou,Marcus R. Makowski,James Nagarajah,Nassir Navab,Wolfgang Weber,Thomas Wendler 机构：Department of Nuclear Medicine, School of Medicine, Technical University of Munich, Department of Radiology, School of Medicine, Technical University of Munich, Munich, Chair for Computer Aided Medical Procedures and Augmented Reality, Department of 备注：7 figures, 19 pages, under review 摘要：背景：甲状腺容量测定在甲状腺疾病的诊断、治疗和监测中至关重要。然而，传统的二维超声甲状腺容积测定法高度依赖于操作者。本研究比较了2D超声和跟踪3D超声与基于深度神经网络的甲状腺自动分割，包括观察者之间和观察者内部的变异性、时间和准确性。体积参考值为MRI。方法：对28名健康志愿者进行二维、三维超声扫描及MRI检查。三位具有不同经验（6、4和1A）的医师（MD 1、2、3）对每位志愿者进行了三次二维超声和三次跟踪三维超声扫描。在2D扫描中，甲状腺叶体积采用椭球体公式计算。卷积深度神经网络（CNN）自动分割3D甲状腺叶。在MRI（T1 VIBE序列）上，由经验丰富的医生手动分割甲状腺。结果：CNN经过训练，骰子得分为0.94。比较两种MDs的观察者间变异性显示，2D和3D的平均差异分别为0.58 ml至0.52 ml（MD1对2）、-1.33 ml至-0.17 ml（MD1对3）和-1.89 ml至-0.70 ml（MD2对3）。配对样本t检验显示2D和3D的两个比较存在显著差异。二维和三维超声的观察内变异性相似。通过配对样本t检验比较超声体积和MRI体积，发现所有MDs的2D体积测定结果存在显著差异，而3D超声体积测定结果无显著差异。三维超声的采集时间明显缩短。结论：跟踪三维超声结合CNN分割显著降低了观察者之间甲状腺容积测量的变异性，并在较短的采集时间内提高了测量的准确性。摘要：Background: Thyroid volumetry is crucial in diagnosis, treatment and monitoring of thyroid diseases. However, conventional thyroid volumetry with 2D ultrasound is highly operator-dependent. This study compares 2D ultrasound and tracked 3D ultrasound with an automatic thyroid segmentation based on a deep neural network regarding inter- and intraobserver variability, time and accuracy. Volume reference was MRI. Methods: 28 healthy volunteers were scanned with 2D and 3D ultrasound as well as by MRI. Three physicians (MD 1, 2, 3) with different levels of experience (6, 4 and 1 a) performed three 2D ultrasound and three tracked 3D ultrasound scans on each volunteer. In the 2D scans the thyroid lobe volumes were calculated with the ellipsoid formula. A convolutional deep neural network (CNN) segmented the 3D thyroid lobes automatically. On MRI (T1 VIBE sequence) the thyroid was manually segmented by an experienced medical doctor. Results: The CNN was trained to obtain a dice score of 0.94. The interobserver variability comparing two MDs showed mean differences for 2D and 3D respectively of 0.58 ml to 0.52 ml (MD1 vs. 2), -1.33 ml to -0.17 ml (MD1 vs. 3) and -1.89 ml to -0.70 ml (MD2 vs. 3). Paired samples t-tests showed significant differences in two comparisons for 2D and none for 3D. Intraobsever variability was similar for 2D and 3D ultrasound. Comparison of ultrasound volumes and MRI volumes by paired samples t-tests showed a significant difference for the 2D volumetry of all MDs, and no significant difference for 3D ultrasound. Acquisition time was significantly shorter for 3D ultrasound. Conclusion: Tracked 3D ultrasound combined with a CNN segmentation significantly reduces interobserver variability in thyroid volumetry and increases the accuracy of the measurements with shorter acquisition times.

【3】 SegMix: Co-occurrence Driven Mixup for Semantic Segmentation and Adversarial Robustness 标题：SegMix：用于语义分割和对抗性鲁棒性的共现驱动混合链接：https://arxiv.org/abs/2108.09929

作者：Md Amirul Islam,Matthew Kowal,Konstantinos G. Derpanis,Neil D. B. Bruce 机构：D. B. Bruce 备注：Under submission at IJCV (BMVC 2020 Extension). arXiv admin note: substantial text overlap with arXiv:2008.05667 摘要：在本文中，我们提出了一种训练卷积神经网络的策略，以有效地解决与整个网络中的分类间信息相关的竞争假设所产生的干扰。前提是基于特征绑定的概念，特征绑定定义为跨空间和网络中各层的激活被成功集成以获得正确推理决策的过程。在我们的工作中，这是通过基于（i）类别聚类或（ii）类别共现可能性的混合图像来完成稠密图像标记的任务。然后，我们训练一个特征绑定网络，同时分割和分离混合图像。随后的特征去噪以抑制噪声激活，揭示了额外的期望特性和高成功率预测。通过这一过程，我们揭示了一种不同于以往任何方法的通用机制，用于提高基本分割和显著性网络的性能，同时提高对抗性攻击的鲁棒性。摘要：In this paper, we present a strategy for training convolutional neural networks to effectively resolve interference arising from competing hypotheses relating to inter-categorical information throughout the network. The premise is based on the notion of feature binding, which is defined as the process by which activations spread across space and layers in the network are successfully integrated to arrive at a correct inference decision. In our work, this is accomplished for the task of dense image labelling by blending images based on (i) categorical clustering or (ii) the co-occurrence likelihood of categories. We then train a feature binding network which simultaneously segments and separates the blended images. Subsequent feature denoising to suppress noisy activations reveals additional desirable properties and high degrees of successful predictions. Through this process, we reveal a general mechanism, distinct from any prior methods, for boosting the performance of the base segmentation and saliency network while simultaneously increasing robustness to adversarial attacks.

【4】 Self-Regulation for Semantic Segmentation 标题：语义切分的自我调节链接：https://arxiv.org/abs/2108.09702

作者：Zhang Dong,Zhang Hanwang,Tang Jinhui,Hua Xiansheng,Sun Qianru 机构：Self-Regulation for Semantic SegmentationDong Zhang 1Hanwang Zhang 2Jinhui Tang 1Xian-Sheng Hua 3Qianru Sun 4 1School of Computer Science and Engineering, Nanjing University of Science and Technology 2Nanyang Technological University 3Damo Academy 备注：ICCV 2021 摘要：在本文中，我们寻找语义切分（SS）中两个主要失败案例的原因：1）缺少小对象或小对象部分，2）将大对象的小部分错误标记为错误的类。我们有一个有趣的发现，Failure-1是由于细节功能的使用不足，Failure-2是由于视觉上下文的使用不足。为了帮助模型更好地进行权衡，我们引入了几种自调节（SR）损耗来训练SS神经网络。所谓“自我”，我们的意思是损失来自模型本身，而不使用任何额外的数据或监督。通过应用SR损耗，深层特征由浅层特征调节，以保留更多细节；同时，浅层分类逻辑由深层分类逻辑调节，以获取更多的语义。我们在弱监督和完全监督的SS任务上进行了大量实验，结果表明我们的方法始终优于基线。我们还验证了SR损耗在各种最先进的SS模型（如SPGNet和OCRNet）中易于实现，在训练期间产生的计算开销很小，而在测试中则没有。摘要：In this paper, we seek reasons for the two major failure cases in Semantic Segmentation (SS): 1) missing small objects or minor object parts, and 2) mislabeling minor parts of large objects as wrong classes. We have an interesting finding that Failure-1 is due to the underuse of detailed features and Failure-2 is due to the underuse of visual contexts. To help the model learn a better trade-off, we introduce several Self-Regulation (SR) losses for training SS neural networks. By "self", we mean that the losses are from the model per se without using any additional data or supervision. By applying the SR losses, the deep layer features are regulated by the shallow ones to preserve more details; meanwhile, shallow layer classification logits are regulated by the deep ones to capture more semantics. We conduct extensive experiments on both weakly and fully supervised SS tasks, and the results show that our approach consistently surpasses the baselines. We also validate that SR losses are easy to implement in various state-of-the-art SS models, e.g., SPGNet and OCRNet, incurring little computational overhead during training and none for testing.

【5】 A Technical Survey and Evaluation of Traditional Point Cloud Clustering Methods for LiDAR Panoptic Segmentation 标题：LiDAR全景图像分割中传统点云聚类方法的技术综述与评价链接：https://arxiv.org/abs/2108.09522

作者：Yiming Zhao,Xiao Zhang,Xinming Huang 机构：Worcester Polytechnic Institute, Institute Rd, Worcester, MA, USA 备注：1. A hybrid SOTA solution. 2. Accept by ICCV2021 Workshop on Traditional Computer Vision in the Age of Deep Learning. 3. Code: this https URL 摘要：激光雷达全景图像分割是一项新提出的自主驾驶技术。与流行的端到端深度学习解决方案相比，我们提出了一种混合方法，使用现有的语义分割网络提取语义信息，使用传统的激光雷达点云聚类算法分割每个实例对象。我们认为，通过在SemanticKITTI数据集的全景分割排行榜上展示所有已发布的端到端深度学习解决方案的最新性能，基于几何的传统聚类算法值得考虑。据我们所知，我们是第一个尝试使用聚类算法进行点云全景分割的人。因此，在本文中，我们没有研究新的模型，而是通过实现四种典型的集群方法进行了全面的技术调查，并在基准上报告了它们的性能。这四种聚类方法最具代表性，具有实时运行速度。在本文中，它们用C 实现，然后被封装为Python函数，与现有的深度学习框架无缝集成。我们为可能对此问题感兴趣的同行研究人员发布代码。摘要：LiDAR panoptic segmentation is a newly proposed technical task for autonomous driving. In contrast to popular end-to-end deep learning solutions, we propose a hybrid method with an existing semantic segmentation network to extract semantic information and a traditional LiDAR point cloud cluster algorithm to split each instance object. We argue geometry-based traditional clustering algorithms are worth being considered by showing a state-of-the-art performance among all published end-to-end deep learning solutions on the panoptic segmentation leaderboard of the SemanticKITTI dataset. To our best knowledge, we are the first to attempt the point cloud panoptic segmentation with clustering algorithms. Therefore, instead of working on new models, we give a comprehensive technical survey in this paper by implementing four typical cluster methods and report their performances on the benchmark. Those four cluster methods are the most representative ones with real-time running speed. They are implemented with C in this paper and then wrapped as a python function for seamless integration with the existing deep learning frameworks. We release our code for peer researchers who might be interested in this problem.

【6】 Palmira: A Deep Deformable Network for Instance Segmentation of Dense and Uneven Layouts in Handwritten Manuscripts 标题：Palmira：一种深度变形网络，用于手写稿件密集和不均匀版面的分割链接：https://arxiv.org/abs/2108.09436

作者：Prema Satish Sharan,Sowmya Aitha,Amandeep Kumar,Abhishek Trivedi,Aaron Augustine,Ravi Kiran Sarvadevabhatla 机构：(�)[,−,−,−,], Centre for Visual Information Technology, International Institute of Information Technology, Hyderabad – , INDIA 备注：Accepted at ICDAR-21. Watch teaser video this https URL , code and pretrained models this https URL , project page this https URL 摘要：手写文档通常具有密集和不均匀布局的特点。尽管取得了一些进展，基于标准深度网络的语义布局分割方法对跨语义区域的复杂变形并不鲁棒。这一现象在低资源印度棕榈叶手稿领域尤为明显。为了解决这个问题，我们首先介绍了INSCAPES2，这是一个新的具有语义布局注释的大规模多样化印度手稿数据集。Incapes2包含来自四种不同历史收藏的文件，比其前身Incapes大150%。我们还提出了一种新的深度网络Palmira，用于手写手稿区域的鲁棒、变形感知的实例分割。我们还报告了Hausdorff距离及其变体作为边界感知性能度量。我们的实验表明，Palmira提供了稳健的布局，优于强基线方法和烧蚀变体。我们还包括阿拉伯语、东南亚和希伯来语历史手稿的定性结果，以展示帕尔米拉的泛化能力。摘要：Handwritten documents are often characterized by dense and uneven layout. Despite advances, standard deep network based approaches for semantic layout segmentation are not robust to complex deformations seen across semantic regions. This phenomenon is especially pronounced for the low-resource Indic palm-leaf manuscript domain. To address the issue, we first introduce Indiscapes2, a new large-scale diverse dataset of Indic manuscripts with semantic layout annotations. Indiscapes2 contains documents from four different historical collections and is 150% larger than its predecessor, Indiscapes. We also propose a novel deep network Palmira for robust, deformation-aware instance segmentation of regions in handwritten manuscripts. We also report Hausdorff distance and its variants as a boundary-aware performance measure. Our experiments demonstrate that Palmira provides robust layouts, outperforms strong baseline approaches and ablative variants. We also include qualitative results on Arabic, South-East Asian and Hebrew historical manuscripts to showcase the generalization capability of Palmira.

【7】 Efficient Medical Image Segmentation Based on Knowledge Distillation 标题：基于知识提取的高效医学图像分割链接：https://arxiv.org/abs/2108.09987

作者：Dian Qin,Jiajun Bu,Zhe Liu,Xin Shen,Sheng Zhou,Jingjun Gu,Zhijua Wang,Lei Wu,Huifen Dai 备注：Accepted by IEEE TMI, Code Avalivable 摘要：在应用卷积神经网络实现医学图像分割问题的更精确预测结果方面，最近取得了一些进展。然而，现有方法的成功在很大程度上依赖于巨大的计算复杂度和海量存储，这在现实场景中是不切实际的。为了解决这个问题，我们提出了一种有效的架构，通过从训练有素的医学图像分割网络中提取知识来训练另一个轻量级网络。这种体系结构使轻量级网络能够在保持其运行效率的同时显著提高分段能力。我们进一步设计了一个新的用于医学图像分割的提取模块，将语义区域信息从教师网络传输到学生网络。它迫使学生网络模拟从不同组织区域计算的表征差异程度。该模块避免了在处理医学成像时遇到的边界模糊问题，而是对每个语义区域的内部信息进行编码以进行传输。得益于我们的模块，轻量级网络在我们的实验中可以得到高达32.6%的改进，同时在推理阶段保持其可移植性。整个结构已在两个广泛接受的公共CT数据集LiTS17和KiTS19上得到验证。我们证明，在需要相对较高的操作速度和较低的存储使用率的情况下，通过我们的方法提取的轻量级网络具有不可忽略的价值。摘要：Recent advances have been made in applying convolutional neural networks to achieve more precise prediction results for medical image segmentation problems. However, the success of existing methods has highly relied on huge computational complexity and massive storage, which is impractical in the real-world scenario. To deal with this problem, we propose an efficient architecture by distilling knowledge from well-trained medical image segmentation networks to train another lightweight network. This architecture empowers the lightweight network to get a significant improvement on segmentation capability while retaining its runtime efficiency. We further devise a novel distillation module tailored for medical image segmentation to transfer semantic region information from teacher to student network. It forces the student network to mimic the extent of difference of representations calculated from different tissue regions. This module avoids the ambiguous boundary problem encountered when dealing with medical imaging but instead encodes the internal information of each semantic region for transferring. Benefited from our module, the lightweight network could receive an improvement of up to 32.6% in our experiment while maintaining its portability in the inference phase. The entire structure has been verified on two widely accepted public CT datasets LiTS17 and KiTS19. We demonstrate that a lightweight network distilled by our method has non-negligible value in the scenario which requires relatively high operating speed and low storage usage.

【8】 Systematic Clinical Evaluation of A Deep Learning Method for Medical Image Segmentation: Radiosurgery Application 标题：医学图像分割深度学习方法的系统临床评价：放射外科应用链接：https://arxiv.org/abs/2108.09535

作者：Boris Shirokikh,Alexandra Dalechina,Alexey Shevtsov,Egor Krivov,Valery Kostjuchenko,Amayak Durgaryan,Mikhail Galkin,Andrey Golanov,Mikhail Belyaev 摘要：我们在三维医学图像分割任务中系统地评估了一种深度学习（DL）方法。我们的分割方法集成到放射外科治疗过程中，直接影响临床工作流程。通过我们的方法，我们解决了手工分割的相对缺点：评分者间轮廓可变性高，轮廓过程耗时长。对现有评估的主要扩展是仔细和详细的分析，可以进一步推广到其他医学图像分割任务中。首先，我们分析了评分员间检测协议的变化。我们发现，分割模型将检测不一致的比率从0.162降低到0.085（p<0.05）。其次，我们表明，该模型将评分者之间的轮廓一致性从0.845提高到了0.871（p<0.05）。第三，我们发现该模型在1.6到2.0倍之间加速了描绘过程（p<0.05）。最后，我们设计了临床实验的设置，以排除或估计评估偏差，从而保持结果的显著性。除了临床评价外，我们还总结了建立一个有效的基于DL的三维医学图像分割模型的直觉和实践思路。摘要：We systematically evaluate a Deep Learning (DL) method in a 3D medical image segmentation task. Our segmentation method is integrated into the radiosurgery treatment process and directly impacts the clinical workflow. With our method, we address the relative drawbacks of manual segmentation: high inter-rater contouring variability and high time consumption of the contouring process. The main extension over the existing evaluations is the careful and detailed analysis that could be further generalized on other medical image segmentation tasks. Firstly, we analyze the changes in the inter-rater detection agreement. We show that the segmentation model reduces the ratio of detection disagreements from 0.162 to 0.085 (p < 0.05). Secondly, we show that the model improves the inter-rater contouring agreement from 0.845 to 0.871 surface Dice Score (p < 0.05). Thirdly, we show that the model accelerates the delineation process in between 1.6 and 2.0 times (p < 0.05). Finally, we design the setup of the clinical experiment to either exclude or estimate the evaluation biases, thus preserve the significance of the results. Besides the clinical evaluation, we also summarize the intuitions and practical ideas for building an efficient DL-based model for 3D medical image segmentation.

Zero/Few Shot|迁移|域适配|自适应(5篇)

【1】 Domain Adaptation for Underwater Image Enhancement 标题：一种用于水下图像增强的域自适应算法链接：https://arxiv.org/abs/2108.09650

作者：Zhengyong Wang,Liquan Shen,Mei Yu,Kun Wang,Yufei Lin,Mai Xu 摘要：最近，基于学习的算法在水下图像增强中表现出了令人印象深刻的性能。他们中的大多数人求助于综合数据训练，并取得了优异的成绩。然而，这些方法忽略了合成数据和真实数据之间的显著域差距（即域间差距），因此基于合成数据训练的模型通常无法很好地推广到真实的水下场景。此外，复杂多变的水下环境也导致真实数据本身之间存在巨大的分布差距（即域内差距）。然而，几乎没有研究关注这个问题，因此他们的技术经常在各种真实图像上产生视觉上令人不快的伪影和颜色失真。基于这些观察结果，我们提出了一种新的两相水下域自适应网络（TUDA），以同时最小化域间和域内间隙。具体来说，第一阶段设计了一个新的双对准网络，包括一个用于增强输入图像真实感的平移部分，然后是一个增强部分。通过联合对抗性学习，将图像级和特征级自适应分为两部分，网络可以更好地跨域建立不变性，从而弥合域间的鸿沟。在第二阶段，我们根据增强图像的评估质量对真实数据进行简单的硬分类，其中嵌入了基于等级的水下质量评估方法。通过利用从排名中获得的隐含质量信息，该方法可以更准确地评估增强图像的感知质量。然后利用容易部分的伪标签，进行容易-硬适应技术，以有效地减小容易和硬样本之间的域内间隙。摘要：Recently, learning-based algorithms have shown impressive performance in underwater image enhancement. Most of them resort to training on synthetic data and achieve outstanding performance. However, these methods ignore the significant domain gap between the synthetic and real data (i.e., interdomain gap), and thus the models trained on synthetic data often fail to generalize well to real underwater scenarios. Furthermore, the complex and changeable underwater environment also causes a great distribution gap among the real data itself (i.e., intra-domain gap). However, almost no research focuses on this problem and thus their techniques often produce visually unpleasing artifacts and color distortions on various real images. Motivated by these observations, we propose a novel Two-phase Underwater Domain Adaptation network (TUDA) to simultaneously minimize the inter-domain and intra-domain gap. Concretely, a new dual-alignment network is designed in the first phase, including a translation part for enhancing realism of input images, followed by an enhancement part. With performing image-level and feature-level adaptation in two parts by jointly adversarial learning, the network can better build invariance across domains and thus bridge the inter-domain gap. In the second phase, we perform an easy-hard classification of real data according to the assessed quality of enhanced images, where a rank-based underwater quality assessment method is embedded. By leveraging implicit quality information learned from rankings, this method can more accurately assess the perceptual quality of enhanced images. Using pseudo labels from the easy part, an easy-hard adaptation technique is then conducted to effectively decrease the intra-domain gap between easy and hard samples.

【2】 Robust Ensembling Network for Unsupervised Domain Adaptation 标题：用于无监督领域自适应的鲁棒集成网络链接：https://arxiv.org/abs/2108.09473

作者：Han Sun,Lei Lin,Ningzhong Liu,Huiyu Zhou 机构： Nanjing University of Aeronautics and Astronautics, Jiangsu Nanjing, China, School of Informatics, University of Leicester, Leicester LE,RH, U.K 备注：14 pages, 4 figures. accepted by PRICA-2021. code: this https URL 摘要：最近，为了解决无监督领域自适应（UDA）问题，人们提出了大量的研究来实现可转移模型。其中，最常用的方法是对抗域自适应，它可以缩短源域和目标域之间的距离。尽管对抗式学习非常有效，但它仍然会导致网络的不稳定性和混淆类别信息的缺点。在本文中，我们提出了一种用于UDA的鲁棒加密网络（REN），该网络应用鲁棒时间加密教师网络来学习全局信息以进行域转移。具体地说，REN主要包括教师网络和学生网络，其执行标准域适应训练并更新教师网络的权重。此外，我们还提出了一种双网络条件对抗性丢失来提高鉴别器的能力。最后，为了提高学生网络的基本能力，我们利用一致性约束来平衡学生网络和教师网络之间的误差。在几个UDA数据集上的大量实验结果表明，与其他最先进的UDA算法相比，我们的模型是有效的。摘要：Recently, in order to address the unsupervised domain adaptation (UDA) problem, extensive studies have been proposed to achieve transferrable models. Among them, the most prevalent method is adversarial domain adaptation, which can shorten the distance between the source domain and the target domain. Although adversarial learning is very effective, it still leads to the instability of the network and the drawbacks of confusing category information. In this paper, we propose a Robust Ensembling Network (REN) for UDA, which applies a robust time ensembling teacher network to learn global information for domain transfer. Specifically, REN mainly includes a teacher network and a student network, which performs standard domain adaptation training and updates weights of the teacher network. In addition, we also propose a dual-network conditional adversarial loss to improve the ability of the discriminator. Finally, for the purpose of improving the basic ability of the student network, we utilize the consistency constraint to balance the error between the student network and the teacher network. Extensive experimental results on several UDA datasets have demonstrated the effectiveness of our model by comparing with other state-of-the-art UDA algorithms.

【3】 Learned Image Coding for Machines: A Content-Adaptive Approach 标题：机器学习图像编码：一种内容自适应的方法链接：https://arxiv.org/abs/2108.09992

作者：Nam Le,Honglei Zhang,Francesco Cricri,Ramin Ghaznavi-Youvalari,Hamed Rezazadegan Tavakoli,Esa Rahtu 机构：†Nokia Technologies, ∗Tampere University, Tampere, Finland 备注：None 摘要：今天，根据思科年度互联网报告（2018-2023），互联网流量增长最快的类别是机器对机器通信。特别是，图像和视频的机器对机器通信代表了一个新的挑战，并在数据压缩方面开辟了新的前景。一种可能的解决方法是根据机器消耗的使用情况，调整当前针对人类的图像和视频编码标准。另一种方法包括为机器对机器通信开发全新的压缩范例和体系结构。在本文中，我们将重点放在图像压缩上，并提出了一种推理时间内容自适应微调方案，该方案优化了端到端学习图像编解码器的潜在表示，旨在提高机器消耗的压缩效率。进行的实验表明，与预训练图像编解码器相比，我们的在线微调带来了-3.66%的平均比特率节省（BD率）。特别是，在低比特率点，我们提出的方法可以显著节省-9.85%的比特率。总体而言，我们经过预训练和微调的系统在最先进的图像/视频编解码器通用视频编码（VVC）上实现了-30.54%的BD速率。摘要：Today, according to the Cisco Annual Internet Report (2018-2023), the fastest-growing category of Internet traffic is machine-to-machine communication. In particular, machine-to-machine communication of images and videos represents a new challenge and opens up new perspectives in the context of data compression. One possible solution approach consists of adapting current human-targeted image and video coding standards to the use case of machine consumption. Another approach consists of developing completely new compression paradigms and architectures for machine-to-machine communications. In this paper, we focus on image compression and present an inference-time content-adaptive finetuning scheme that optimizes the latent representation of an end-to-end learned image codec, aimed at improving the compression efficiency for machine-consumption. The conducted experiments show that our online finetuning brings an average bitrate saving (BD-rate) of -3.66% with respect to our pretrained image codec. In particular, at low bitrate points, our proposed method results in a significant bitrate saving of -9.85%. Overall, our pretrained-and-then-finetuned system achieves -30.54% BD-rate over the state-of-the-art image/video codec Versatile Video Coding (VVC).

【4】 FEDI: Few-shot learning based on Earth Mover's Distance algorithm combined with deep residual network to identify diabetic retinopathy 标题：FEDI：基于地球移动距离算法结合深度残差网络的小概率学习识别糖尿病视网膜病变链接：https://arxiv.org/abs/2108.09711

作者：Liangrui Pan,Boya Ji,Peng Xi,Xiaoqi Wang,Mitchai Chongcheawchamnan,Shaoliang Peng 机构：College of Computer Scienceand, Electronic Engineering, HunanUniversity, Chang Sha, China, College of Computer Science and, Prince of Songkla University, Songkhla, Thailand, Hunan University 摘要：糖尿病视网膜病变（DR）是糖尿病患者致盲的主要原因。然而，DR通过眼底的诊断很容易延迟失明的发生。鉴于现实情况，在临床实践中很难收集到大量的糖尿病视网膜数据。本文提出了一种基于地球移动器距离算法的深度剩余网络的多镜头学习模型，以辅助诊断DR。我们基于39类1000个样本数据建立了多镜头学习的训练和验证分类任务，训练深度剩余网络，并获得经验最大化预训练模型。基于预训练模型的权重，土方工程距离算法计算图像之间的距离，获得图像之间的相似性，并更改模型参数以提高训练模型的准确性。最后，实验构建了小样本分类任务的测试集，进一步优化了模型，最后，对糖尿病视网膜测试集的3way10shot任务的准确率达到93.5667%。有关实验代码和结果，请参考：https://github.com/panliangrui/few-shot-learning-funds. 摘要：Diabetic retinopathy(DR) is the main cause of blindness in diabetic patients. However, DR can easily delay the occurrence of blindness through the diagnosis of the fundus. In view of the reality, it is difficult to collect a large amount of diabetic retina data in clinical practice. This paper proposes a few-shot learning model of a deep residual network based on Earth Mover's Distance algorithm to assist in diagnosing DR. We build training and validation classification tasks for few-shot learning based on 39 categories of 1000 sample data, train deep residual networks, and obtain experience maximization pre-training models. Based on the weights of the pre-trained model, the Earth Mover's Distance algorithm calculates the distance between the images, obtains the similarity between the images, and changes the model's parameters to improve the accuracy of the training model. Finally, the experimental construction of the small sample classification task of the test set to optimize the model further, and finally, an accuracy of 93.5667% on the 3way10shot task of the diabetic retina test set. For the experimental code and results, please refer to: https://github.com/panliangrui/few-shot-learning-funds.

【5】 Variable-Rate Deep Image Compression through Spatially-Adaptive Feature Transform 标题：基于空间自适应特征变换的变速率深度图像压缩链接：https://arxiv.org/abs/2108.09551

作者：Myungseo Song,Jinyoung Choi,Bohyung Han 机构：ECE & ASRI, Seoul National University, Korea 备注：ICCV 2021 摘要：我们提出了一种基于空间特征变换的通用深度图像压缩网络（SFT arXiv:1804.02815），该网络以源图像和相应的质量图作为输入，并以可变速率生成压缩图像。我们的模型使用单个模型覆盖了广泛的压缩率，该模型由任意像素质量贴图控制。此外，所提出的框架允许我们通过有效地估计特定于编码网络目标任务的优化质量映射，对各种任务（例如分类）执行任务感知图像压缩。这甚至可以通过预先训练的网络实现，而无需学习单独任务的单独模型。与基于多个模型的方法相比，我们的算法实现了出色的率失真折衷，这些模型分别针对多个不同的目标速率进行了优化。在相同的压缩水平下，该方法通过任务感知质量图估计成功地提高了图像分类和文本区域质量保持的性能，而无需额外的模型训练。该代码可在项目网站上获得：https://github.com/micmic123/QmapCompression 摘要：We propose a versatile deep image compression network based on Spatial Feature Transform (SFT arXiv:1804.02815), which takes a source image and a corresponding quality map as inputs and produce a compressed image with variable rates. Our model covers a wide range of compression rates using a single model, which is controlled by arbitrary pixel-wise quality maps. In addition, the proposed framework allows us to perform task-aware image compressions for various tasks, e.g., classification, by efficiently estimating optimized quality maps specific to target tasks for our encoding network. This is even possible with a pretrained network without learning separate models for individual tasks. Our algorithm achieves outstanding rate-distortion trade-off compared to the approaches based on multiple models that are optimized separately for several different target rates. At the same level of compression, the proposed approach successfully improves performance on image classification and text region quality preservation via task-aware quality map estimation without additional model training. The code is available at the project website: https://github.com/micmic123/QmapCompression

半弱无监督|主动学习|不确定性(5篇)

【1】 A Weakly Supervised Amodal Segmenter with Boundary Uncertainty Estimation 标题：一种具有边界不确定性估计的弱监督非模态分割器链接：https://arxiv.org/abs/2108.09897

作者：Khoi Nguyen,Sinisa Todorovic 机构：Oregon State University, Corvallis, OR , USA 备注：Accepted to ICCV 2021 摘要：本文讨论弱监督amodal实例分割，其目标是分割可见和遮挡（amodal）对象部分，而训练只提供地面真实可见（模态）分割。在之前的工作之后，我们使用数据操纵在训练图像中生成遮挡，从而训练分割器来预测操纵数据的amodal分割。将训练图像的预测结果作为Mask-RCNN标准训练的伪地面真值，我们将其用于测试图像的amodal实例分割。为了生成伪地面真值，我们设计了一种新的基于边界不确定性估计（ASBU）的Amodal分割器，并做出了两个贡献。首先，之前的工作使用遮罩，而我们的ASBU使用遮罩边界作为输入。其次，ASBU估计预测的不确定性图。估计的不确定性使学习规则化，从而在具有高不确定性的区域上产生较低的分割损失。相对于COCOA和KINS数据集的最新技术，ASBU在三项任务中实现了显著的性能改进：amodal实例分割、amodal完成和排序恢复。摘要：This paper addresses weakly supervised amodal instance segmentation, where the goal is to segment both visible and occluded (amodal) object parts, while training provides only ground-truth visible (modal) segmentations. Following prior work, we use data manipulation to generate occlusions in training images and thus train a segmenter to predict amodal segmentations of the manipulated data. The resulting predictions on training images are taken as the pseudo-ground truth for the standard training of Mask-RCNN, which we use for amodal instance segmentation of test images. For generating the pseudo-ground truth, we specify a new Amodal Segmenter based on Boundary Uncertainty estimation (ASBU) and make two contributions. First, while prior work uses the occluder's mask, our ASBU uses the occlusion boundary as input. Second, ASBU estimates an uncertainty map of the prediction. The estimated uncertainty regularizes learning such that lower segmentation loss is incurred on regions with high uncertainty. ASBU achieves significant performance improvement relative to the state of the art on the COCOA and KINS datasets in three tasks: amodal instance segmentation, amodal completion, and ordering recovery.

【2】 SSR: Semi-supervised Soft Rasterizer for single-view 2D to 3D Reconstruction 标题：SSR：用于单视图2D到3D重建的半监督软光栅化器链接：https://arxiv.org/abs/2108.09593

作者：Issam Laradji,Pau Rodríguez,David Vazquez,Derek Nowrouzezahrai 机构：McGill University, Element AI, Pau Rodr´ıguez 摘要：最近的工作在学习监督薄弱的对象网格方面取得了重大进展。软光栅化方法仅在视点监控下从二维图像实现精确的三维重建。在这项工作中，我们通过允许这样的3D重建方法利用未标记的图像来进一步减少标记工作。为了获得这些未标记图像的视点，我们建议使用暹罗网络，该网络将两幅图像作为输入，并输出它们是否对应于同一视点。在训练过程中，我们最小化交叉熵损失以最大化预测一对图像是否属于同一视点的概率。为了获得新图像的视点，我们将其与从训练样本中获得的不同视点进行比较，并选择匹配概率最高的视点。最后，我们用最自信的预测视点标记未标记的图像，并训练具有可微光栅化层的深度网络。我们的实验表明，当利用未标记的示例时，即使只标记两个对象也会显著提高ShapeNet的IoU。代码可在https://github.com/IssamLaradji/SSR. 摘要：Recent work has made significant progress in learning object meshes with weak supervision. Soft Rasterization methods have achieved accurate 3D reconstruction from 2D images with viewpoint supervision only. In this work, we further reduce the labeling effort by allowing such 3D reconstruction methods leverage unlabeled images. In order to obtain the viewpoints for these unlabeled images, we propose to use a Siamese network that takes two images as input and outputs whether they correspond to the same viewpoint. During training, we minimize the cross entropy loss to maximize the probability of predicting whether a pair of images belong to the same viewpoint or not. To get the viewpoint of a new image, we compare it against different viewpoints obtained from the training samples and select the viewpoint with the highest matching probability. We finally label the unlabeled images with the most confident predicted viewpoint and train a deep network that has a differentiable rasterization layer. Our experiments show that even labeling only two objects yields significant improvement in IoU for ShapeNet when leveraging unlabeled examples. Code is available at https://github.com/IssamLaradji/SSR.

【3】 Unsupervised Local Discrimination for Medical Images 标题：医学图像的无监督局部鉴别链接：https://arxiv.org/abs/2108.09440

作者：Huai Chen,Renzhen Wang,Jieyu Li,Qing Peng,Deyu Meng,Lisheng Wang 机构： Xi’an Jiaotong University 备注：16 pages, 9 figures 摘要：对比表征学习是一种有效的无监督方法，可以缓解医学图像处理中对昂贵标注数据的需求。最近的工作主要基于实例识别来学习全局特征，而忽略了局部细节，这限制了它们在处理微小解剖结构、组织和病变方面的应用。因此，我们旨在提出一个通用的局部判别框架来学习局部判别特征，从而有效地初始化医学模型，同时系统地研究其实际医学应用。具体而言，基于模态内部结构相似性的共同特性，即相同模态图像之间共享相似结构，提出了一种系统的局部特征学习框架。我们的方法不是基于全局嵌入进行实例比较，而是进行像素级嵌入，重点度量斑块和区域之间的相似性。更精细的对比规则使学习到的表示更通用于分割任务，并通过在彩色眼底和胸部X射线的所有12个下游任务中赢得11个，优于广泛的最新方法。此外，基于模态间形状相似性的特点，即在不同的医学模式下，结构可能具有相似的形状，我们将模态间的形状联合到区域识别中，实现无监督分割。结果表明，仅基于其他模式的形状描述和区域判别提供的内部模式相似性分割目标是可行的。最后，我们通过引入中心敏感平均来提高斑块识别的中心敏感能力，实现一次定位，这是一种有效的斑块识别应用。摘要：Contrastive representation learning is an effective unsupervised method to alleviate the demand for expensive annotated data in medical image processing. Recent work mainly based on instance-wise discrimination to learn global features, while neglect local details, which limit their application in processing tiny anatomical structures, tissues and lesions. Therefore, we aim to propose a universal local discrmination framework to learn local discriminative features to effectively initialize medical models, meanwhile, we systematacially investigate its practical medical applications. Specifically, based on the common property of intra-modality structure similarity, i.e. similar structures are shared among the same modality images, a systematic local feature learning framework is proposed. Instead of making instance-wise comparisons based on global embedding, our method makes pixel-wise embedding and focuses on measuring similarity among patches and regions. The finer contrastive rule makes the learnt representation more generalized for segmentation tasks and outperform extensive state-of-the-art methods by wining 11 out of all 12 downstream tasks in color fundus and chest X-ray. Furthermore, based on the property of inter-modality shape similarity, i.e. structures may share similar shape although in different medical modalities, we joint across-modality shape prior into region discrimination to realize unsupervised segmentation. It shows the feaibility of segmenting target only based on shape description from other modalities and inner pattern similarity provided by region discrimination. Finally, we enhance the center-sensitive ability of patch discrimination by introducing center-sensitive averaging to realize one-shot landmark localization, this is an effective application for patch discrimination.

【4】 SemiFed: Semi-supervised Federated Learning with Consistency and Pseudo-Labeling 标题：SemiFED：具有一致性和伪标注的半监督联邦学习链接：https://arxiv.org/abs/2108.09412

作者：Haowen Lin,Jian Lou,Li Xiong,Cyrus Shahabi 机构：University of Southern California, Emory University, Xidian University 摘要：联合学习使多个客户端（如移动电话和组织）能够协作学习用于预测的共享模型，同时保护本地数据隐私。然而，联邦学习的最新研究和应用假设所有客户端都有完全标记的数据，这在现实环境中是不切实际的。在这项工作中，我们关注一个新的跨思洛联盟学习场景，其中每个客户机的数据样本都有部分标记。我们借鉴了半监督学习方法的思想，在半监督学习方法中，使用大量未标记的数据来提高模型的准确性，尽管对标记示例的访问有限。我们提出了一个称为SemiFed的新框架，它将两种主要的半监督学习方法统一起来：一致性正则化和伪标记。SemiFed首先应用先进的数据增强技术来实施一致性正则化，然后在训练期间使用模型的预测生成伪标签。SemiFed利用了联合的优势，因此对于给定的图像，只有当来自不同客户的多个模型产生高置信度预测并同意相同的标签时，伪标签才有效。在两个图像基准上的大量实验证明了我们的方法在同质和异构数据分布环境下的有效性摘要：Federated learning enables multiple clients, such as mobile phones and organizations, to collaboratively learn a shared model for prediction while protecting local data privacy. However, most recent research and applications of federated learning assume that all clients have fully labeled data, which is impractical in real-world settings. In this work, we focus on a new scenario for cross-silo federated learning, where data samples of each client are partially labeled. We borrow ideas from semi-supervised learning methods where a large amount of unlabeled data is utilized to improve the model's accuracy despite limited access to labeled examples. We propose a new framework dubbed SemiFed that unifies two dominant approaches for semi-supervised learning: consistency regularization and pseudo-labeling. SemiFed first applies advanced data augmentation techniques to enforce consistency regularization and then generates pseudo-labels using the model's predictions during training. SemiFed takes advantage of the federation so that for a given image, the pseudo-label holds only if multiple models from different clients produce a high-confidence prediction and agree on the same label. Extensive experiments on two image benchmarks demonstrate the effectiveness of our approach under both homogeneous and heterogeneous data distribution settings

【5】 Influence Selection for Active Learning 标题：主动学习的影响选择链接：https://arxiv.org/abs/2108.09331

作者：Zhuoming Liu,Hao Ding,Huaping Zhong,Weijia Li,Jifeng Dai,Conghui He 机构：University Southern California, Johns Hopkins University, SenseTime Research, CUHK-SenseTime Joint Lab, The Chinese University of Hong Kong 备注：ICCV2021 accepted paper 摘要：现有的主动学习方法是基于不同的任务特定或模型特定标准，通过评估样本的不确定性或其对标记数据集多样性的影响来选择样本。在本文中，我们提出了主动学习的影响选择（ISAL），它选择对模型性能具有最积极影响的未标记样本。为了获得主动学习场景中未标记样本的影响，我们设计了未经训练的未标记样本影响计算（UUIC）来估计未标记样本的预期梯度，并以此计算其影响。为了证明UUIC的有效性，我们提供了理论和实验分析。由于UUIC只依赖于模型梯度，这可以很容易地从任何神经网络中获得，因此我们的主动学习算法是任务不可知和模型不可知的。ISAL在不同的主动学习环境中，针对不同的任务和不同的数据集，实现了最先进的性能。与以前的方法相比，我们的方法在CIFAR10、VOC2012和COCO上的注释成本分别降低了至少12%、13%和16%。摘要：The existing active learning methods select the samples by evaluating the sample's uncertainty or its effect on the diversity of labeled datasets based on different task-specific or model-specific criteria. In this paper, we propose the Influence Selection for Active Learning(ISAL) which selects the unlabeled samples that can provide the most positive Influence on model performance. To obtain the Influence of the unlabeled sample in the active learning scenario, we design the Untrained Unlabeled sample Influence Calculation(UUIC) to estimate the unlabeled sample's expected gradient with which we calculate its Influence. To prove the effectiveness of UUIC, we provide both theoretical and experimental analyses. Since the UUIC just depends on the model gradients, which can be obtained easily from any neural network, our active learning algorithm is task-agnostic and model-agnostic. ISAL achieves state-of-the-art performance in different active learning settings for different tasks with different datasets. Compared with previous methods, our method decreases the annotation cost at least by 12%, 13% and 16% on CIFAR10, VOC2012 and COCO, respectively.

时序|行为识别|姿态|视频|运动估计(6篇)

【1】 ChiNet: Deep Recurrent Convolutional Learning for Multimodal Spacecraft Pose Estimation 标题：CHINET：用于多模态航天器姿态估计的深度递归卷积学习链接：https://arxiv.org/abs/2108.10282

作者：Duarte Rondao,Nabil Aouf,Mark A. Richardson 机构： Aouf is a Professor of Robotics and Autonomous Systems with theDepartment of Electrical and Electronic Engineering at City, University ofLondon 备注：This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible 摘要：本文提出了一种创新的深度学习管道，该管道通过结合交会序列的时间信息来估计航天器的相对姿态。它利用长短期记忆（LSTM）单元在数据序列建模中的性能来处理卷积神经网络（CNN）主干提取的特征。三种不同的训练策略，遵循从粗到细的漏斗式方法，结合起来促进特征学习，并通过回归改进端到端姿势估计。利用CNN自主确定图像特征表示的能力，将热红外数据与红-绿-蓝（RGB）输入进行融合，从而减轻在可见光波长下对空间物体成像产生的伪影的影响。在一个合成数据集上演示了所提出框架（称为ChiNet）的每一项贡献，并在实验数据上验证了整个管道。摘要：This paper presents an innovative deep learning pipeline which estimates the relative pose of a spacecraft by incorporating the temporal information from a rendezvous sequence. It leverages the performance of long short-term memory (LSTM) units in modelling sequences of data for the processing of features extracted by a convolutional neural network (CNN) backbone. Three distinct training strategies, which follow a coarse-to-fine funnelled approach, are combined to facilitate feature learning and improve end-to-end pose estimation by regression. The capability of CNNs to autonomously ascertain feature representations from images is exploited to fuse thermal infrared data with red-green-blue (RGB) inputs, thus mitigating the effects of artefacts from imaging space objects in the visible wavelength. Each contribution of the proposed framework, dubbed ChiNet, is demonstrated on a synthetic dataset, and the complete pipeline is validated on experimental data.

【2】 Recurrent Video Deblurring with Blur-Invariant Motion Estimation and Pixel Volumes 标题：基于模糊不变运动估计和像素体积的递归视频去模糊链接：https://arxiv.org/abs/2108.09982

作者：Hyeongseok Son,Junyong Lee,Jonghyeop Lee,Sunghyun Cho,Seungyong Lee 备注：17 pages, Camera-ready version for ACM Transactions on Graphics (TOG) 2021 摘要：为了成功地进行视频去模糊，必须利用相邻帧的信息。大多数最先进的视频去模糊方法采用视频帧之间的运动补偿来聚合来自多个帧的信息，这有助于对目标帧进行去模糊。然而，先前的去模糊方法所采用的运动补偿方法不是模糊不变的，因此，对于具有不同模糊量的模糊帧，其精度受到限制。为了缓解这个问题，我们提出了两种新的方法，通过有效地聚合来自多个视频帧的信息来消除视频中的模糊。首先，我们提出了模糊不变运动估计学习来提高模糊帧之间的运动估计精度。其次，对于运动补偿，我们使用包含候选锐利像素的像素体积来解决运动估计错误，而不是通过扭曲估计运动来对齐帧。我们将这两个过程结合起来，提出了一种有效的循环视频去模糊网络，该网络充分利用了先前的去模糊帧。实验表明，与最近使用深度学习的方法相比，我们的方法在数量和质量上都达到了最先进的性能。摘要：For the success of video deblurring, it is essential to utilize information from neighboring frames. Most state-of-the-art video deblurring methods adopt motion compensation between video frames to aggregate information from multiple frames that can help deblur a target frame. However, the motion compensation methods adopted by previous deblurring methods are not blur-invariant, and consequently, their accuracy is limited for blurry frames with different blur amounts. To alleviate this problem, we propose two novel approaches to deblur videos by effectively aggregating information from multiple video frames. First, we present blur-invariant motion estimation learning to improve motion estimation accuracy between blurry frames. Second, for motion compensation, instead of aligning frames by warping with estimated motions, we use a pixel volume that contains candidate sharp pixels to resolve motion estimation errors. We combine these two processes to propose an effective recurrent video deblurring network that fully exploits deblurred previous frames. Experiments show that our method achieves the state-of-the-art performance both quantitatively and qualitatively compared to recent methods that use deep learning.

【3】 TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment 标题：TACO：用于视频-文本对齐的标记感知级联对比学习链接：https://arxiv.org/abs/2108.09980

作者：Jianwei Yang,Yonatan Bisk,Jianfeng Gao 机构：Microsoft Research, Carnegie Mellon University 备注：Accepted by ICCV 2021 摘要：对比学习被广泛用于训练基于变换器的视觉语言模型，用于视频文本对齐和多模态表示学习。本文提出了一种新的标记感知级联对比学习（TACo）算法，该算法利用两种新技术改进对比学习。第一种是标记感知对比损失，它是通过考虑单词的句法类别来计算的。这是因为观察到，对于视频-文本对，文本中的内容词（如名词和动词）比虚词更可能与视频中的视觉内容对齐。其次，采用级联采样方法生成一组小的硬负示例，用于有效估计多模态融合层的损耗。为了验证TACo的有效性，在我们的实验中，我们对一组下游任务的预训练模型进行了微调，包括文本视频检索（YouCook2、MSR-VTT和ActivityNet）、视频动作步骤定位（CrossTask）、视频动作分段（COIN）。结果表明，与以前的方法相比，我们的模型在不同的实验环境中取得了一致的改进，在YouCook2、MSR-VTT和ActivityNet三个公共文本视频检索基准上建立了最新的技术水平。摘要：Contrastive learning has been widely used to train transformer-based vision-language models for video-text alignment and multi-modal representation learning. This paper presents a new algorithm called Token-Aware Cascade contrastive learning (TACo) that improves contrastive learning using two novel techniques. The first is the token-aware contrastive loss which is computed by taking into account the syntactic classes of words. This is motivated by the observation that for a video-text pair, the content words in the text, such as nouns and verbs, are more likely to be aligned with the visual contents in the video than the function words. Second, a cascade sampling method is applied to generate a small set of hard negative examples for efficient loss estimation for multi-modal fusion layers. To validate the effectiveness of TACo, in our experiments we finetune pretrained models for a set of downstream tasks including text-video retrieval (YouCook2, MSR-VTT and ActivityNet), video action step localization (CrossTask), video action segmentation (COIN). The results show that our models attain consistent improvements across different experimental settings over previous methods, setting new state-of-the-art on three public text-video retrieval benchmarks of YouCook2, MSR-VTT and ActivityNet.

【4】 PR-GCN: A Deep Graph Convolutional Network with Point Refinement for 6D Pose Estimation 标题：PR-GCN：一种用于6D位姿估计的点精化深图卷积网络链接：https://arxiv.org/abs/2108.09916

作者：Guangyuan Zhou,Huiqun Wang,Jiaxin Chen,Di Huang 机构：State Key Laboratory of Software Development Environment, Beihang University, Beijing, China, School of Computer Science and Engineering, Beihang University, Beijing, China 摘要：基于RGB-D的6D位姿估计最近取得了显著的进展，但仍然存在两个主要局限性：（1）深度数据的无效表示；（2）不同模式的集成不足。本文提出了一种新的深度学习方法，即带点细化的图卷积网络（PR-GCN），以统一的方式同时解决上述问题。它首先引入点细化网络（PRN）来抛光3D点云，在去除噪声的情况下恢复缺失的部分。随后，提出了多模态融合图卷积网络（MMF-GCN）来增强RGB-D组合，该网络通过图卷积网络中的局部信息传播来捕获几何感知的模态间相关性。在三个广泛使用的基准上进行了大量实验，达到了最先进的性能。此外，还表明所提出的PRN和MMF-GCN模块可以很好地推广到其他框架。摘要：RGB-D based 6D pose estimation has recently achieved remarkable progress, but still suffers from two major limitations: (1) ineffective representation of depth data and (2) insufficient integration of different modalities. This paper proposes a novel deep learning approach, namely Graph Convolutional Network with Point Refinement (PR-GCN), to simultaneously address the issues above in a unified way. It first introduces the Point Refinement Network (PRN) to polish 3D point clouds, recovering missing parts with noise removed. Subsequently, the Multi-Modal Fusion Graph Convolutional Network (MMF-GCN) is presented to strengthen RGB-D combination, which captures geometry-aware inter-modality correlation through local information propagation in the graph convolutional network. Extensive experiments are conducted on three widely used benchmarks, and state-of-the-art performance is reached. Besides, it is also shown that the proposed PRN and MMF-GCN modules are well generalized to other frameworks.

【5】 StarVQA: Space-Time Attention for Video Quality Assessment 标题：StarVQA：用于视频质量评估的时空关注度链接：https://arxiv.org/abs/2108.09635

作者：Fengchuang Xing,Yuan-Gen Wang,Hanpin Wang,Leida Li,Guopu Zhu 机构： Xing is also with the School of Physics and Information Engineering, Guangdong University of Education 摘要：注意机制是当今计算机视觉研究的热点。然而，其在视频质量评估（VQA）中的应用尚未见报道。由于原始参考和拍摄失真的未知性，评估野生视频的质量具有挑战性。本文提出了一个新的underline{s}空间-underline{t}imeunderline{a}注意网络来解决underline{VQA}问题，名为饥饿qa。Hungqa通过交替连接分割的时空注意力来构建一个转换器。为了使Transformer体系结构适合于训练，Qa通过将平均意见分数（MOS）编码到概率向量并嵌入一个特殊的向量化标签标记作为可学习变量来设计向量化回归损失。为了捕获视频序列的长距离时空相关性，Qa将每个面片的时空位置信息编码到转换器的输入端。在野外视频数据集中，事实上进行了各种实验，包括LIVE-VQC、KoNViD-1k、LSVQ和LSVQ-1080p。实验结果表明，所提出的方法优于现有的方法。代码和型号将在以下位置提供：https://github.com/DVL/StarVQA. 摘要：The attention mechanism is blooming in computer vision nowadays. However, its application to video quality assessment (VQA) has not been reported. Evaluating the quality of in-the-wild videos is challenging due to the unknown of pristine reference and shooting distortion. This paper presents a novel underline{s}pace-underline{t}ime underline{a}ttention network founderline{r} the underline{VQA} problem, named StarVQA. StarVQA builds a Transformer by alternately concatenating the divided space-time attention. To adapt the Transformer architecture for training, StarVQA designs a vectorized regression loss by encoding the mean opinion score (MOS) to the probability vector and embedding a special vectorized label token as the learnable variable. To capture the long-range spatiotemporal dependencies of a video sequence, StarVQA encodes the space-time position information of each patch to the input of the Transformer. Various experiments are conducted on the de-facto in-the-wild video datasets, including LIVE-VQC, KoNViD-1k, LSVQ, and LSVQ-1080p. Experimental results demonstrate the superiority of the proposed StarVQA over the state-of-the-art. Code and model will be available at: https://github.com/DVL/StarVQA.

【6】 BlockCopy: High-Resolution Video Processing with Block-Sparse Feature Propagation and Online Policies 标题：Block Copy：使用挡路的高分辨率视频处理-稀疏功能传播和在线策略链接：https://arxiv.org/abs/2108.09376

作者：Thomas Verelst,Tinne Tuytelaars 机构：ESAT-PSI, KU Leuven, Leuven, Belgium 备注：Accepted for International Conference on Computer Vision (ICCV 2021) 摘要：在本文中，我们提出了BlockCopy方案，与标准逐帧处理方案相比，该方案加速了基于预训练帧的CNN以更高效地处理视频。为此，轻量级策略网络确定图像中的重要区域，并使用自定义块稀疏卷积仅对选定区域应用操作。非选定区域的特征仅从前一帧复制，减少了计算量和延迟。执行策略使用在线强化学习进行训练，无需地面真相注释。我们的通用框架在密集预测任务（如行人检测、实例分割和语义分割）上得到了演示，同时使用了最新技术（中心和尺度预测器、MGAN、SwiftNet）和标准基线网络（Mask RCNN、DeepLabV3 ）。BlockCopy在对准确性影响最小的情况下实现了显著的触发器节省和推理加速。摘要：In this paper we propose BlockCopy, a scheme that accelerates pretrained frame-based CNNs to process video more efficiently, compared to standard frame-by-frame processing. To this end, a lightweight policy network determines important regions in an image, and operations are applied on selected regions only, using custom block-sparse convolutions. Features of non-selected regions are simply copied from the preceding frame, reducing the number of computations and latency. The execution policy is trained using reinforcement learning in an online fashion without requiring ground truth annotations. Our universal framework is demonstrated on dense prediction tasks such as pedestrian detection, instance segmentation and semantic segmentation, using both state of the art (Center and Scale Predictor, MGAN, SwiftNet) and standard baseline networks (Mask-RCNN, DeepLabV3 ). BlockCopy achieves significant FLOPS savings and inference speedup with minimal impact on accuracy.

医学相关(2篇)

【1】 Learn-Explain-Reinforce: Counterfactual Reasoning and Its Guidance to Reinforce an Alzheimer's Disease Diagnosis Model 标题：学习-解释-强化：反事实推理及其对强化阿尔茨海默病诊断模型的指导链接：https://arxiv.org/abs/2108.09451

作者：Kwanseok Oh,Jee Seok Yoon,Heung-Il Suk 机构： Yoon is with the Department of Brain and Cognitive Engi-neering, Korea University 备注：14 pages, 9 figures 摘要：关于疾病诊断模型的现有研究要么侧重于诊断模型学习以提高性能，要么侧重于对经过训练的诊断模型的直观解释。我们提出了一个新的学习-解释-强化（LEAR）框架，该框架将诊断模型学习、可视化解释生成（解释单元）和在可视化解释指导下训练的诊断模型强化（强化单元）统一起来。对于视觉解释，我们生成一个反事实映射，该映射将输入样本转换为预期目标标签。例如，反事实地图可以定位正常大脑图像中的假设异常，这可能导致其被诊断为阿尔茨海默病（AD）。我们认为，生成的反事实图代表了关于目标任务的数据驱动和模型诱导知识，即使用结构MRI进行AD诊断，这可能是加强训练诊断模型泛化的重要信息来源。为此，我们在反事实地图的指导下设计了一个基于注意的特征求精模块。解释和强化单元是相互的，可以迭代操作。通过对ADNI数据集的定性和定量分析，我们提出的方法得到了验证。通过消融研究和与现有方法的比较，证明了该方法的可理解性和保真度。摘要：Existing studies on disease diagnostic models focus either on diagnostic model learning for performance improvement or on the visual explanation of a trained diagnostic model. We propose a novel learn-explain-reinforce (LEAR) framework that unifies diagnostic model learning, visual explanation generation (explanation unit), and trained diagnostic model reinforcement (reinforcement unit) guided by the visual explanation. For the visual explanation, we generate a counterfactual map that transforms an input sample to be identified as an intended target label. For example, a counterfactual map can localize hypothetical abnormalities within a normal brain image that may cause it to be diagnosed with Alzheimer's disease (AD). We believe that the generated counterfactual maps represent data-driven and model-induced knowledge about a target task, i.e., AD diagnosis using structural MRI, which can be a vital source of information to reinforce the generalization of the trained diagnostic model. To this end, we devise an attention-based feature refinement module with the guidance of the counterfactual maps. The explanation and reinforcement units are reciprocal and can be operated iteratively. Our proposed approach was validated via qualitative and quantitative analysis on the ADNI dataset. Its comprehensibility and fidelity were demonstrated through ablation studies and comparisons with existing methods.

【2】 Deep survival analysis with longitudinal X-rays for COVID-19 标题：冠状病毒纵向X线深度存活分析链接：https://arxiv.org/abs/2108.09641

作者：Michelle Shu,Richard Strong Bowen,Charles Herrmann,Gengmo Qi,Michele Santacatterina,Ramin Zabih 机构： Cornell Tech, Google Research, George Washington University 摘要：事件时间分析是分配ICU病床等临床资源的重要统计工具。然而，像考克斯模型这样的经典技术由于其高维性而不能直接合并图像。我们提出了一种深度学习方法，自然地将多个时间相关的成像研究以及非成像数据纳入到时间-事件分析中。2019冠状病毒疾病患者的临床数据集上的技术，并显示图像序列显著改善预测。例如，经典的事件时间方法在预测住院时产生的一致性误差约为30-40%，而我们的误差在没有图像的情况下为25%，在包括多张X光片的情况下为20%。消融研究表明，我们的模型并没有学习到诸如扫描仪伪影之类的虚假特征。2019冠状病毒疾病的研究和评价是在COVID-19上进行的，我们开发的方法是广泛适用的。摘要：Time-to-event analysis is an important statistical tool for allocating clinical resources such as ICU beds. However, classical techniques like the Cox model cannot directly incorporate images due to their high dimensionality. We propose a deep learning approach that naturally incorporates multiple, time-dependent imaging studies as well as non-imaging data into time-to-event analysis. Our techniques are benchmarked on a clinical dataset of 1,894 COVID-19 patients, and show that image sequences significantly improve predictions. For example, classical time-to-event methods produce a concordance error of around 30-40% for predicting hospital admission, while our error is 25% without images and 20% with multiple X-rays included. Ablation studies suggest that our models are not learning spurious features such as scanner artifacts. While our focus and evaluation is on COVID-19, the methods we develop are broadly applicable.

GAN|对抗|攻击|生成相关(5篇)

【1】 Adaptable GAN Encoders for Image Reconstruction via Multi-type Latent Vectors with Two-scale Attentions 标题：用于图像重建的双尺度关注度多类型隐向量自适应GaN编码器链接：https://arxiv.org/abs/2108.10201

作者：Cheng Yu,Wenmin Wang 机构： and also with the State Key Laboratory of Lunarand Planetary Sciences, Macau University of Science and Technology 摘要：尽管目前的深度生成对抗网络（GAN）可以合成高质量（HQ）图像，但发现用于图像重建的新型GAN编码器仍然是有利的。当将图像嵌入到潜在空间时，现有的GAN编码器对于对齐的图像（如人脸）工作得很好，但它们不适应更广义的GAN。据我们所知，目前最先进的GAN编码器没有合适的编码器，无法从不同GAN上的大多数错位HQ合成图像重建高保真图像。他们的表现是有限的，特别是在不结盟和真实的图像上。我们提出了一种新的方法（MTV-TSA）来处理这些问题。从潜在空间创建多类型潜在向量（MTV）和从图像创建双尺度注意（TSA），可以设计一组编码器，可适用于各种预先训练的GANs。我们推广了两组损失函数来优化编码器。所设计的编码器可以使GANs从大多数合成的HQ图像重建出高保真图像。此外，该方法能够很好地重建真实图像，并基于学习的属性方向对其进行处理。所设计的编码器具有统一的卷积块，通过微调相应的规范化层和最后一个块，可以很好地匹配当前的GAN架构（如PGGAN、StyleGANs和BigGAN）。这种设计良好的编码器也可以训练更快地收敛。摘要：Although current deep generative adversarial networks (GANs) could synthesize high-quality (HQ) images, discovering novel GAN encoders for image reconstruction is still favorable. When embedding images to latent space, existing GAN encoders work well for aligned images (such as the human face), but they do not adapt to more generalized GANs. To our knowledge, current state-of-the-art GAN encoders do not have a proper encoder to reconstruct high-fidelity images from most misaligned HQ synthesized images on different GANs. Their performances are limited, especially on non-aligned and real images. We propose a novel method (named MTV-TSA) to handle such problems. Creating multi-type latent vectors (MTV) from latent space and two-scale attentions (TSA) from images allows designing a set of encoders that can be adaptable to a variety of pre-trained GANs. We generalize two sets of loss functions to optimize the encoders. The designed encoders could make GANs reconstruct higher fidelity images from most synthesized HQ images. In addition, the proposed method can reconstruct real images well and process them based on learned attribute directions. The designed encoders have unified convolutional blocks and could match well in current GAN architectures (such as PGGAN, StyleGANs, and BigGAN) by fine-tuning the corresponding normalization layers and the last block. Such well-designed encoders can also be trained to converge more quickly.

【2】 Realistic Image Synthesis with Configurable 3D Scene Layouts 标题：基于可配置3D场景布局的真实感图像合成链接：https://arxiv.org/abs/2108.10031

作者：Jaebong Jeong,Janghun Jo,Jingdong Wang,Sunghyun Cho,Jaesik Park 机构：POSTECH, Republic of Korea, Microsoft Research Asia, Beijing, Chaina 备注：paper: 9 pages, supplementary materials: 7 pages 摘要：最近的条件图像合成方法提供了高质量的合成图像。然而，精确调整图像内容（如对象的位置和方向）仍然是一个挑战，合成图像通常具有几何无效的内容。为了在三维几何方面为用户提供对合成图像丰富的可控性，我们提出了一种基于可配置三维场景布局的真实感图像合成方法。我们的方法将带有语义类标签的3D场景作为输入，并训练一个3D场景绘制网络，该网络合成输入3D场景的颜色值。使用经过训练的绘制网络，可以渲染和操纵输入3D场景的逼真图像。为了训练无3D色彩监控的绘画网络，我们开发了一种现成的2D语义图像合成方法。在实验中，我们证明了我们的方法可以生成具有几何正确结构的图像，并支持几何操作，例如视点和对象姿势的改变以及绘画风格的操作。摘要：Recent conditional image synthesis approaches provide high-quality synthesized images. However, it is still challenging to accurately adjust image contents such as the positions and orientations of objects, and synthesized images often have geometrically invalid contents. To provide users with rich controllability on synthesized images in the aspect of 3D geometry, we propose a novel approach to realistic-looking image synthesis based on a configurable 3D scene layout. Our approach takes a 3D scene with semantic class labels as input and trains a 3D scene painting network that synthesizes color values for the input 3D scene. With the trained painting network, realistic-looking images for the input 3D scene can be rendered and manipulated. To train the painting network without 3D color supervision, we exploit an off-the-shelf 2D semantic image synthesis method. In experiments, we show that our approach produces images with geometrically correct structures and supports geometric manipulation such as the change of the viewpoint and object poses as well as manipulation of the painting style.

【3】 Voxel-based Network for Shape Completion by Leveraging Edge Generation 标题：利用边缘生成实现形状完成的基于体素的网络链接：https://arxiv.org/abs/2108.09936

作者：Xiaogang Wang,Marcelo H Ang Jr,Gim Hee Lee 机构：National University of Singapore 备注：ICCV 2021 摘要：深度学习技术在点云完成方面取得了显著的改进，目的是从部分输入中完成缺失的对象形状。然而，由于细粒度细节的过度平滑，大多数现有方法无法恢复真实结构。在本文中，我们通过利用边缘生成（VE-PCN）开发了一个基于体素的点云完成网络。我们首先将点云嵌入到规则的体素网格中，然后借助幻觉的形状边缘生成完整的对象。这种解耦的体系结构与多尺度网格特征学习相结合，能够生成更真实的表面细节。我们在公开的完井数据集上评估了我们的模型，并表明它在数量和质量上都优于现有的最新方法。我们的源代码可在https://github.com/xiaogangw/VE-PCN. 摘要：Deep learning technique has yielded significant improvements in point cloud completion with the aim of completing missing object shapes from partial inputs. However, most existing methods fail to recover realistic structures due to over-smoothing of fine-grained details. In this paper, we develop a voxel-based network for point cloud completion by leveraging edge generation (VE-PCN). We first embed point clouds into regular voxel grids, and then generate complete objects with the help of the hallucinated shape edges. This decoupled architecture together with a multi-scale grid feature learning is able to generate more realistic on-surface details. We evaluate our model on the publicly available completion datasets and show that it outperforms existing state-of-the-art approaches quantitatively and qualitatively. Our source code is available at https://github.com/xiaogangw/VE-PCN.

【4】 Image Inpainting via Conditional Texture and Structure Dual Generation 标题：基于条件纹理和结构双重生成的图像修复链接：https://arxiv.org/abs/2108.09760

作者：Xiefan Guo,Hongyu Yang,Di Huang 机构：State Key Laboratory of Software Development Environment, Beihang University, Beijing, China, School of Computer Science and Engineering, Beihang University, Beijing, China 备注：Accepted at ICCV'2021 摘要：近年来，通过引入结构先验，深层生成方法在图像修复方面取得了长足的进展。然而，由于在结构重建过程中缺乏与图像纹理的适当交互，目前的解决方案在处理大量损坏的情况时不起作用，并且它们通常会受到扭曲结果的影响。在本文中，我们提出了一种新的用于图像修复的双流网络，该网络以耦合方式对结构约束纹理合成和纹理引导结构重建进行建模，以便它们更好地相互利用以生成更合理的纹理。此外，为了增强全局一致性，设计了双向选通特征融合（Bi-GFF）模块来交换和组合结构和纹理信息，并开发了上下文特征聚合（CFA）模块，通过区域相似性学习和多尺度特征聚合来细化生成的内容。在CelebA、Paris StreetView和Places2数据集上的定性和定量实验证明了该方法的优越性。我们的代码可在https://github.com/Xiefan-Guo/CTSDG. 摘要：Deep generative approaches have recently made considerable progress in image inpainting by introducing structure priors. Due to the lack of proper interaction with image texture during structure reconstruction, however, current solutions are incompetent in handling the cases with large corruptions, and they generally suffer from distorted results. In this paper, we propose a novel two-stream network for image inpainting, which models the structure-constrained texture synthesis and texture-guided structure reconstruction in a coupled manner so that they better leverage each other for more plausible generation. Furthermore, to enhance the global consistency, a Bi-directional Gated Feature Fusion (Bi-GFF) module is designed to exchange and combine the structure and texture information and a Contextual Feature Aggregation (CFA) module is developed to refine the generated contents by region affinity learning and multi-scale feature aggregation. Qualitative and quantitative experiments on the CelebA, Paris StreetView and Places2 datasets demonstrate the superiority of the proposed method. Our code is available at https://github.com/Xiefan-Guo/CTSDG.

【5】 Robustness-via-Synthesis: Robust Training with Generative Adversarial Perturbations 标题：通过合成实现健壮性：具有生成性对抗性扰动的健壮训练链接：https://arxiv.org/abs/2108.09713

作者：Inci M. Baytas,Debayan Deb 机构： Baytas¸ is with the Department of Computer Engineering, Bo ‘gazic¸iUniversity 摘要：随着对抗性攻击的发现，鲁棒模型已成为基于深度学习的系统的必备模型。一阶攻击对抗性训练是迄今为止对抗性干扰最有效的防御手段之一。大多数对抗性训练方法侧重于使用损失函数相对于输入图像的梯度对每个像素进行迭代扰动。然而，基于梯度攻击的对抗训练缺乏多样性，不能很好地推广到自然图像和各种攻击。本研究提出了一种鲁棒训练算法，其中对抗性干扰通过生成网络从随机向量自动合成。该分类器采用交叉熵损失进行训练，通过自然和合成对抗性样本表示之间的最佳传输距离进行正则化。与流行的生成性防御不同，所提出的一步攻击生成框架在不利用分类器损失梯度的情况下合成各种扰动。实验结果表明，在CIFAR10、CIFAR100和SVHN数据集上，该方法与各种基于梯度和生成的鲁棒训练技术相比具有相当的鲁棒性。此外，与基线相比，本文提出的鲁棒训练框架能很好地推广到自然样本。代码和经过训练的模型将公开提供。摘要：Upon the discovery of adversarial attacks, robust models have become obligatory for deep learning-based systems. Adversarial training with first-order attacks has been one of the most effective defenses against adversarial perturbations to this day. The majority of the adversarial training approaches focus on iteratively perturbing each pixel with the gradient of the loss function with respect to the input image. However, the adversarial training with gradient-based attacks lacks diversity and does not generalize well to natural images and various attacks. This study presents a robust training algorithm where the adversarial perturbations are automatically synthesized from a random vector using a generator network. The classifier is trained with cross-entropy loss regularized with the optimal transport distance between the representations of the natural and synthesized adversarial samples. Unlike prevailing generative defenses, the proposed one-step attack generation framework synthesizes diverse perturbations without utilizing gradient of the classifier's loss. Experimental results show that the proposed approach attains comparable robustness with various gradient-based and generative robust training techniques on CIFAR10, CIFAR100, and SVHN datasets. In addition, compared to the baselines, the proposed robust training framework generalizes well to the natural samples. Code and trained models will be made publicly available.

自动驾驶|车辆|车道检测等(1篇)

【1】 Exploring Simple 3D Multi-Object Tracking for Autonomous Driving 标题：面向自动驾驶的简易三维多目标跟踪探索链接：https://arxiv.org/abs/2108.10312

作者：Chenxu Luo,Xiaodong Yang,Alan Yuille 机构：QCraft, Johns Hopkins University 备注：ICCV 2021 摘要：激光雷达点云中的三维多目标跟踪是自动驾驶车辆的关键组成部分。现有的方法主要基于检测管道的跟踪，不可避免地需要检测关联的启发式匹配步骤。在本文中，我们通过提出一个端到端的可训练模型，从原始点云进行联合检测和跟踪，从而简化了手工制作的跟踪范式。我们的关键设计是预测每个对象在给定片段中的首次出现位置，以获得跟踪身份，然后基于运动估计更新位置。在推理过程中，通过简单的读取操作，可以完全放弃启发式匹配步骤。SimTrack在一个统一的模型中集成了跟踪对象关联、新生对象检测和死亡轨迹消除。我们对两个大型数据集进行了广泛的评估：nuScenes和Waymo开放数据集。实验结果表明，在排除启发式匹配规则的情况下，我们的简单方法与最新的方法相比具有优势。摘要：3D multi-object tracking in LiDAR point clouds is a key ingredient for self-driving vehicles. Existing methods are predominantly based on the tracking-by-detection pipeline and inevitably require a heuristic matching step for the detection association. In this paper, we present SimTrack to simplify the hand-crafted tracking paradigm by proposing an end-to-end trainable model for joint detection and tracking from raw point clouds. Our key design is to predict the first-appear location of each object in a given snippet to get the tracking identity and then update the location based on motion estimation. In the inference, the heuristic matching step can be completely waived by a simple read-off operation. SimTrack integrates the tracked object association, newborn object detection, and dead track killing in a single unified model. We conduct extensive evaluations on two large-scale datasets: nuScenes and Waymo Open Dataset. Experimental results reveal that our simple approach compares favorably with the state-of-the-art methods while ruling out the heuristic matching rules.

NAS模型搜索(2篇)

【1】 Pi-NAS: Improving Neural Architecture Search by Reducing Supernet Training Consistency Shift 标题：PI-NAS：通过减少超网训练一致性漂移来改进神经结构搜索链接：https://arxiv.org/abs/2108.09671

作者：Jiefeng Peng,Jiqi Zhang,Changlin Li,Guangrun Wang,Xiaodan Liang,Liang Lin 机构：Sun Yat-sen University, DarkMatter AI Research, GORSE Lab, Dept. of DSAI, Monash University, University of Oxford 备注：Accepted to ICCV 2021 摘要：最近提出的神经结构搜索（NAS）方法在一个超网中联合训练数十亿个结构，并使用从超网分离的网络权重估计其潜在精度。然而，体系结构的预测精度与实际性能之间的排序关系是不正确的，这导致了现有NAS方法的困境。我们将这种排序相关性问题归因于超网训练一致性的变化，包括特征变化和参数变化。由于随机路径采样，特征偏移被识别为隐藏层的动态输入分布。输入分布动态影响损耗下降，最终影响架构排名。参数偏移被识别为在不同的训练步骤中，对位于不同路径中的共享层进行的相互矛盾的参数更新。快速变化的参数无法保持体系结构排名。我们使用一个非平凡的超网Pi模型（称为Pi NAS）同时处理这两个移位。具体来说，我们采用了一个包含交叉路径学习的supernet Pi模型，以减少不同路径之间的特征一致性转移。同时，我们采用了一种新的包含负样本的非平凡均值算法来克服参数漂移和模型碰撞。此外，我们的Pi NAS以无监督的方式运行，可以搜索更多可转移的体系结构。在ImageNet和一系列下游任务（如COCO 2017、ADE20K和Cityscapes）上进行的大量实验证明了我们的Pi NAS与监督NAS相比的有效性和普遍性。见代码：https://github.com/Ernie1/Pi-NAS. 摘要：Recently proposed neural architecture search (NAS) methods co-train billions of architectures in a supernet and estimate their potential accuracy using the network weights detached from the supernet. However, the ranking correlation between the architectures' predicted accuracy and their actual capability is incorrect, which causes the existing NAS methods' dilemma. We attribute this ranking correlation problem to the supernet training consistency shift, including feature shift and parameter shift. Feature shift is identified as dynamic input distributions of a hidden layer due to random path sampling. The input distribution dynamic affects the loss descent and finally affects architecture ranking. Parameter shift is identified as contradictory parameter updates for a shared layer lay in different paths in different training steps. The rapidly-changing parameter could not preserve architecture ranking. We address these two shifts simultaneously using a nontrivial supernet-Pi model, called Pi-NAS. Specifically, we employ a supernet-Pi model that contains cross-path learning to reduce the feature consistency shift between different paths. Meanwhile, we adopt a novel nontrivial mean teacher containing negative samples to overcome parameter shift and model collision. Furthermore, our Pi-NAS runs in an unsupervised manner, which can search for more transferable architectures. Extensive experiments on ImageNet and a wide range of downstream tasks (e.g., COCO 2017, ADE20K, and Cityscapes) demonstrate the effectiveness and universality of our Pi-NAS compared to supervised NAS. See Codes: https://github.com/Ernie1/Pi-NAS.

【2】 D-DARTS: Distributed Differentiable Architecture Search 标题：D-DARTS：分布式可区分体系结构搜索链接：https://arxiv.org/abs/2108.09306

作者：Alexandre Heuillet,Hedi Tabia,Hichem Arioui,Kamal Youcef-Toumi 机构：Universit´e Paris-Saclay, MIT 摘要：可微结构搜索（DARTS）是最流行的神经结构搜索（NAS）方法之一，通过采用随机梯度下降（SGD）和权重共享，大大降低了搜索成本。然而，它也大大减少了搜索空间，从而排除了潜在的有希望的架构被发现。在本文中，我们提出了D-DART，这是一种新的解决方案，通过在单元级别嵌套多个神经网络来解决此问题，而不是使用权重共享来生成更多样化和专用的体系结构。此外，我们还介绍了一种新的算法，该算法可以从少量训练单元中派生出更深层次的体系结构，从而提高了性能并节省了计算时间。我们的解决方案能够在CIFAR-10、CIFAR-100和ImageNet上提供最先进的结果，同时使用的参数比以前的基线少得多，从而产生更具硬件效率的神经网络。摘要：Differentiable ARchiTecture Search (DARTS) is one of the most trending Neural Architecture Search (NAS) methods, drastically reducing search cost by resorting to Stochastic Gradient Descent (SGD) and weight-sharing. However, it also greatly reduces the search space, thus excluding potential promising architectures from being discovered. In this paper, we propose D-DARTS, a novel solution that addresses this problem by nesting several neural networks at cell-level instead of using weight-sharing to produce more diversified and specialized architectures. Moreover, we introduce a novel algorithm which can derive deeper architectures from a few trained cells, increasing performance and saving computation time. Our solution is able to provide state-of-the-art results on CIFAR-10, CIFAR-100 and ImageNet while using significantly less parameters than previous baselines, resulting in more hardware-efficient neural networks.

OCR|文本相关(2篇)

【1】 External Knowledge Augmented Text Visual Question Answering 标题：外部知识增强的文本视觉问答链接：https://arxiv.org/abs/2108.09717

作者：Arka Ujjal Dey,Ernest Valveny,Gaurav Harit 机构：IIT Jodhpur, Rajasthan, India, Computer Vision Center, Universitat Autonoma de Barcelona, Bellaterra (Barcelona) 摘要：文本VQA的开放式问答任务需要阅读和推理图像的局部（通常是以前看不见的）场景文本内容，以生成答案。在这项工作中，我们建议广泛使用外部知识，以增强我们对上述场景文本的理解。我们设计了一个框架来提取、过滤和编码标准多模式转换器上的知识，用于视觉语言理解任务。通过经验证据，我们展示了知识如何突出实例线索，从而帮助处理训练数据偏差，提高答案实体类型的正确性，并检测多词命名实体。在类似上游OCR系统和训练数据的约束下，我们在两个公开可用的数据集上生成了与最新技术相当的结果。摘要：The open-ended question answering task of Text-VQA requires reading and reasoning about local, often previously unseen, scene-text content of an image to generate answers. In this work, we propose the generalized use of external knowledge to augment our understanding of the said scene-text. We design a framework to extract, filter, and encode knowledge atop a standard multimodal transformer for vision language understanding tasks. Through empirical evidence, we demonstrate how knowledge can highlight instance-only cues and thus help deal with training data bias, improve answer entity type correctness, and detect multiword named entities. We generate results comparable to the state-of-the-art on two publicly available datasets, under the constraints of similar upstream OCR systems and training data.

【2】 From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network 标题：从二到一：一种基于可视化语言建模网络的新型场景文本识别器链接：https://arxiv.org/abs/2108.09661

作者：Yuxin Wang,Hongtao Xie,Shancheng Fang,Jing Wang,Shenggao Zhu,Yongdong Zhang 机构：University of Science and Technology of China, Huawei Cloud & AI 备注：Accept by ICCV2021 摘要：在本文中，我们放弃了占主导地位的复杂语言模型，重新思考了场景文本识别中的语言学习过程。与以往将视觉和语言信息分为两个独立结构的方法不同，我们提出了一种视觉语言建模网络（VisionLAN），它通过直接赋予视觉模型语言能力，将视觉和语言信息视为一个联合体。特别地，我们在训练阶段介绍了基于字符的遮挡特征图的文本识别。这种操作指导视觉模型在视觉线索混淆（如遮挡、噪声等）时，不仅使用字符的视觉纹理，而且使用视觉上下文中的语言信息进行识别。由于语言信息与视觉特征一起获取，无需额外的语言模型，VisionLAN将速度显著提高39%，并自适应地考虑语言信息以增强视觉特征，从而实现准确识别。此外，还提出了一个遮挡场景文本（OST）数据集来评估在缺少字符视觉线索的情况下的性能。几个基准的最新结果证明了我们的有效性。代码和数据集可在https://github.com/wangyuxin87/VisionLAN. 摘要：In this paper, we abandon the dominant complex language model and rethink the linguistic learning process in the scene text recognition. Different from previous methods considering the visual and linguistic information in two separate structures, we propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union by directly enduing the vision model with language capability. Specially, we introduce the text recognition of character-wise occluded feature maps in the training stage. Such operation guides the vision model to use not only the visual texture of characters, but also the linguistic information in visual context for recognition when the visual cues are confused (e.g. occlusion, noise, etc.). As the linguistic information is acquired along with visual features without the need of extra language model, VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition. Furthermore, an Occlusion Scene Text (OST) dataset is proposed to evaluate the performance on the case of missing character-wise visual cues. The state of-the-art results on several benchmarks prove our effectiveness. Code and dataset are available at https://github.com/wangyuxin87/VisionLAN.

人脸|人群计数(2篇)

【1】 Modeling Dynamics of Facial Behavior for Mental Health Assessment 标题：心理健康评估中面部行为的建模动力学研究链接：https://arxiv.org/abs/2108.09934

作者：Minh Tran,Ellen Bradley,Michelle Matvey,Joshua Woolley,Mohammad Soleymani 机构：Soleymani, Computer Science Department, University of Southern California, Los Angeles, USA, University of California San Francisco, San Francisco, USA 备注：Accepted to FG 2021 摘要：面部动作单位（FAU）强度是分析面部行为的常用描述符。然而，当一次只有少数FAU被激活时，FAU很少被表示。在这项研究中，我们探索了通过采用自然语言处理中用于单词表示的算法来表示面部表情动态的可能性。具体地说，我们在应用全局向量表示（globalvectorrepresentation，globalvectorrepresentation，globalgeneration）算法学习人脸簇的嵌入之前，对一个5.3M帧的大型时态人脸表情数据集进行聚类。我们评估了我们学习到的表征在两个下游任务上的有用性：精神分裂症症状估计和抑郁严重度回归。这些实验结果表明，与单独使用FAU强度的基线模型相比，我们的方法在改善心理健康症状评估方面具有潜在的有效性。摘要：Facial action unit (FAU) intensities are popular descriptors for the analysis of facial behavior. However, FAUs are sparsely represented when only a few are activated at a time. In this study, we explore the possibility of representing the dynamics of facial expressions by adopting algorithms used for word representation in natural language processing. Specifically, we perform clustering on a large dataset of temporal facial expressions with 5.3M frames before applying the Global Vector representation (GloVe) algorithm to learn the embeddings of the facial clusters. We evaluate the usefulness of our learned representations on two downstream tasks: schizophrenia symptom estimation and depression severity regression. These experimental results show the potential effectiveness of our approach for improving the assessment of mental health symptoms over baseline models that use FAU intensities alone.

【2】 A Synthesis-Based Approach for Thermal-to-Visible Face Verification 标题：一种基于合成的热转可见光人脸验证方法链接：https://arxiv.org/abs/2108.09558

作者：Neehar Peri,Joshua Gleason,Carlos D. Castillo,Thirimachos Bourlai,Vishal M. Patel,Rama Chellappa 机构： MUKH Technologies, University of Maryland, Johns Hopkins University, University of Georgia 摘要：近年来，可见光谱人脸验证系统已被证明与法医鉴定专家的识别性能相匹配。然而，这种系统在低光和夜间条件下是无效的。热人脸图像捕捉人体热量释放，有效地增强可见光谱，在光照有限的场景中捕捉有区别的面部特征。由于获取不同的成对热光谱和可见光谱数据集的成本和难度增加，用于弱光识别的算法和大规模基准受到限制。本文提出了一种在ARL-VTF和TUFTS多光谱人脸数据集上实现最先进性能的算法。重要的是，我们研究了人脸对齐、像素级对应和带有标签平滑的身份分类对多光谱人脸合成和验证的影响。结果表明，该方法具有广泛的适用性、鲁棒性和高效性。此外，我们还表明，该方法在从侧面到正面的验证上明显优于人脸正面化方法。最后，我们介绍了MILAB-VTF（B），这是一个具有挑战性的多光谱人脸数据集，由成对的热视频和可见视频组成。据我们所知，该数据集收集了400名受试者的面部数据，代表了最广泛的公开室内和远程室外热可见面部图像。最后，我们展示了我们的端到端热-可见人脸验证系统在MILAB-VTF（B）数据集上提供了强大的性能。摘要：In recent years, visible-spectrum face verification systems have been shown to match expert forensic examiner recognition performance. However, such systems are ineffective in low-light and nighttime conditions. Thermal face imagery, which captures body heat emissions, effectively augments the visible spectrum, capturing discriminative facial features in scenes with limited illumination. Due to the increased cost and difficulty of obtaining diverse, paired thermal and visible spectrum datasets, algorithms and large-scale benchmarks for low-light recognition are limited. This paper presents an algorithm that achieves state-of-the-art performance on both the ARL-VTF and TUFTS multi-spectral face datasets. Importantly, we study the impact of face alignment, pixel-level correspondence, and identity classification with label smoothing for multi-spectral face synthesis and verification. We show that our proposed method is widely applicable, robust, and highly effective. In addition, we show that the proposed method significantly outperforms face frontalization methods on profile-to-frontal verification. Finally, we present MILAB-VTF(B), a challenging multi-spectral face dataset that is composed of paired thermal and visible videos. To the best of our knowledge, with face data from 400 subjects, this dataset represents the most extensive collection of publicly available indoor and long-range outdoor thermal-visible face imagery. Lastly, we show that our end-to-end thermal-to-visible face verification system provides strong performance on the MILAB-VTF(B) dataset.

跟踪(1篇)

【1】 Beyond Tracking: Using Deep Learning to Discover Novel Interactions in Biological Swarms 标题：超越跟踪：利用深度学习发现生物群中的新互动链接：https://arxiv.org/abs/2108.09394

作者：Taeyeong Choi,Benjamin Pyenson,Juergen Liebig,Theodore P. Pavlic 机构： Lincoln Institute for Agri-food Technology, University of Lincoln, Riseholme Park, Lincoln, UK, School of Life Sciences, Social Insect Research Group, School of Computing, Informatics, and Decision Systems Engineering 备注：Best Paper Winner at the 4th International Symposium on Swarm Behavior and Bio-Inspired Robotics (SWARM 2021) 摘要：大多数用于理解生物群的深度学习框架旨在将群体行为的感知模型与单独从视频观察中收集的个体层面数据（例如，个体识别特征的空间坐标）相匹配。尽管在自动跟踪方面取得了相当大的进步，但当同时跟踪大量动物时，这些方法仍然非常昂贵或不可靠。此外，这种方法假设人类选择的特征包括足够的特征来解释集体行为中的重要模式。为了解决这些问题，我们建议训练深度网络模型，直接从整个视图的通用图形特征预测系统级状态，以完全自动化的方式收集这些特征相对便宜。由于生成的预测模型并非基于人类理解的预测因子，因此我们使用解释性模块（例如。，Grad-CAM）将隐藏在深层网络模型潜在变量中的信息与视频数据本身结合起来，与人类观察者交流观察到的个体行为的哪些方面在预测群体行为方面最具信息性。这代表了行为生态学中智能增强的一个例子——人类人工智能团队中的知识共同创造。作为概念证明，我们利用20天对50多只Harpegnathos saltator蚂蚁群体的视频记录来展示，在不提供任何单独注释的情况下，经过训练的模型可以生成跨视频帧的“重要性地图”，以突出重要行为的区域，例如决斗（AI对决斗没有先验知识），这在解决生殖层次重组方面发挥了作用。基于实证结果，我们还讨论了潜在的用途和当前的挑战。摘要：Most deep-learning frameworks for understanding biological swarms are designed to fit perceptive models of group behavior to individual-level data (e.g., spatial coordinates of identified features of individuals) that have been separately gathered from video observations. Despite considerable advances in automated tracking, these methods are still very expensive or unreliable when tracking large numbers of animals simultaneously. Moreover, this approach assumes that the human-chosen features include sufficient features to explain important patterns in collective behavior. To address these issues, we propose training deep network models to predict system-level states directly from generic graphical features from the entire view, which can be relatively inexpensive to gather in a completely automated fashion. Because the resulting predictive models are not based on human-understood predictors, we use explanatory modules (e.g., Grad-CAM) that combine information hidden in the latent variables of the deep-network model with the video data itself to communicate to a human observer which aspects of observed individual behaviors are most informative in predicting group behavior. This represents an example of augmented intelligence in behavioral ecology -- knowledge co-creation in a human-AI team. As proof of concept, we utilize a 20-day video recording of a colony of over 50 Harpegnathos saltator ants to showcase that, without any individual annotations provided, a trained model can generate an "importance map" across the video frames to highlight regions of important behaviors, such as dueling (which the AI has no a priori knowledge of), that play a role in the resolution of reproductive-hierarchy re-formation. Based on the empirical results, we also discuss the potential use and current challenges.

图像视频检索|Re-id相关(1篇)

【1】 Patch2CAD: Patchwise Embedding Learning for In-the-Wild Shape Retrieval from a Single Image 标题：Patch2CAD：基于补丁嵌入学习的单图像野外形状检索链接：https://arxiv.org/abs/2108.09368

作者：Weicheng Kuo,Anelia Angelova,Tsung-Yi Lin,Angela Dai 机构： Google Research, Brain Team, Technical University of Munich 备注：To appear at ICCV 2021(IEEE/CVF International Conference on Computer Vision) 摘要：从RGB图像输入对物体形状的三维感知是语义场景理解的基础，是我们空间三维真实环境中基于图像的感知的基础。为了实现对象图像视图和三维形状之间的映射，我们利用现有大型数据库中的CAD模型优先级，并提出了一种以面片方式在二维图像和三维CAD模型之间构建联合嵌入空间的新方法——在对象图像视图的面片和CAD几何体的面片之间建立对应关系。这使零件相似性推理能够检索到新图像视图中的相似CAD，而无需在数据库中进行精确匹配。在我们对单个输入图像中检测到的对象的CAD模型形状和姿势的端到端估计中，我们的面片嵌入为形状估计提供了更稳健的CAD检索。对来自ScanNet的野外复杂图像进行的实验表明，在没有任何精确CAD匹配的真实场景中，我们的方法比最先进的方法更稳健。摘要：3D perception of object shapes from RGB image input is fundamental towards semantic scene understanding, grounding image-based perception in our spatially 3-dimensional real-world environments. To achieve a mapping between image views of objects and 3D shapes, we leverage CAD model priors from existing large-scale databases, and propose a novel approach towards constructing a joint embedding space between 2D images and 3D CAD models in a patch-wise fashion -- establishing correspondences between patches of an image view of an object and patches of CAD geometry. This enables part similarity reasoning for retrieving similar CADs to a new image view without exact matches in the database. Our patch embedding provides more robust CAD retrieval for shape estimation in our end-to-end estimation of CAD model shape and pose for detected objects in a single input image. Experiments on in-the-wild, complex imagery from ScanNet show that our approach is more robust than state of the art in real-world scenarios without any exact CAD matches.

裁剪|量化|加速|压缩相关(1篇)

【1】 Integer-arithmetic-only Certified Robustness for Quantized Neural Networks 标题：量化神经网络的仅整数运算认证鲁棒性链接：https://arxiv.org/abs/2108.09413

作者：Haowen Lin,Jian Lou,Li Xiong,Cyrus Shahabi 机构：University of Southern California, Emory University, Xidian University 摘要：对抗性数据示例引起了机器学习和安全社区的极大关注。处理对抗性示例的一系列工作通过随机平滑验证了鲁棒性，可提供理论上的鲁棒性保证。然而，这种机制通常使用浮点算法进行推理计算，并且需要大量内存占用和令人望而生畏的计算成本。这些防御模型既不能在边缘设备上高效运行，也不能部署在纯整数逻辑单元（如图灵张量核或纯整数ARM处理器）上。为了克服这些挑战，我们提出了一种带量化的整数随机化平滑方法，将任何分类器转换为一个新的平滑分类器，该方法使用仅整数算法来证明对对抗性扰动的鲁棒性。我们证明了该方法在L2范数下的严格鲁棒性保证。我们表明，在通用CPU和移动设备上，在两个不同的数据集（CIFAR-10和Caltech-101）上，我们的方法可以获得与浮点算法认证的健壮方法相当的精度和4x~5x的加速比。摘要：Adversarial data examples have drawn significant attention from the machine learning and security communities. A line of work on tackling adversarial examples is certified robustness via randomized smoothing that can provide a theoretical robustness guarantee. However, such a mechanism usually uses floating-point arithmetic for calculations in inference and requires large memory footprints and daunting computational costs. These defensive models cannot run efficiently on edge devices nor be deployed on integer-only logical units such as Turing Tensor Cores or integer-only ARM processors. To overcome these challenges, we propose an integer randomized smoothing approach with quantization to convert any classifier into a new smoothed classifier, which uses integer-only arithmetic for certified robustness against adversarial perturbations. We prove a tight robustness guarantee under L2-norm for the proposed approach. We show our approach can obtain a comparable accuracy and 4x~5x speedup over floating-point arithmetic certified robust methods on general-purpose CPUs and mobile devices on two distinct datasets (CIFAR-10 and Caltech-101).

点云|SLAM|雷达|激光|深度RGBD相关(2篇)

【1】 Vogtareuth Rehab Depth Datasets: Benchmark for Marker-less Posture Estimation in Rehabilitation 标题：Vogtareuth康复深度数据集：康复中无标记姿势估计的基准链接：https://arxiv.org/abs/2108.10272

作者：Soubarna Banik,Alejandro Mendoza Garcia,Lorenz Kiwull,Steffen Berweck,Alois Knoll 机构：To appear in Proc. IEEE EMBC ,. 备注：4 pages 摘要：使用单深度相机进行姿势估计已成为分析康复运动的有用工具。由于大规模姿态数据集的可用性，计算机视觉研究中姿态估计的最新进展成为可能。然而，在现有的基准深度数据集中，康复训练中涉及的复杂姿势并不存在。为了解决这一局限性，我们提出了两个康复特定姿势数据集，其中包含进行康复训练的成人和儿童患者的深度图像和2D姿势信息。我们使用最先进的无标记姿势估计模型，该模型在非康复基准数据集上训练。我们在我们的康复数据集上对其进行评估，并观察到从非康复到康复的绩效显著下降，突出了对这些数据集的需求。我们表明，我们的数据集可用于训练姿势模型，以检测特定于康复的复杂姿势。这些数据集将为研究界的利益而发布。摘要：Posture estimation using a single depth camera has become a useful tool for analyzing movements in rehabilitation. Recent advances in posture estimation in computer vision research have been possible due to the availability of large-scale pose datasets. However, the complex postures involved in rehabilitation exercises are not represented in the existing benchmark depth datasets. To address this limitation, we propose two rehabilitation-specific pose datasets containing depth images and 2D pose information of patients, both adult and children, performing rehab exercises. We use a state-of-the-art marker-less posture estimation model which is trained on a non-rehab benchmark dataset. We evaluate it on our rehab datasets, and observe that the performance degrades significantly from non-rehab to rehab, highlighting the need for these datasets. We show that our dataset can be used to train pose models to detect rehab-specific complex postures. The datasets will be released for the benefit of the research community.

【2】 DSP-SLAM: Object Oriented SLAM with Deep Shape Priors 标题：DSP-SLAM：具有深度形状先验的面向对象SLAM 链接：https://arxiv.org/abs/2108.09481

作者：Jingwen Wang,Martin Rünz,Lourdes Agapito 机构：Martin R¨unz, Department of Computer Science, University College London 摘要：我们提出了DSP-SLAM，这是一个面向对象的SLAM系统，它为前景对象构建了一个丰富而精确的密集3D模型的联合地图，并用稀疏的地标点来表示背景。DSP-SLAM将基于特征的SLAM系统重建的三维点云作为输入，并使其具备通过密集重建检测对象来增强其稀疏地图的能力。通过语义实例分割检测目标，并通过一种新的二阶优化算法，以特定类别的深形状嵌入作为先验估计目标的形状和姿态。我们的对象感知束调整构建姿势图，以联合优化摄影机姿势、对象位置和特征点。DSP-SLAM可以在3种不同的输入模式下以每秒10帧的速度工作：单目、立体声或立体声激光雷达。我们演示了DSP-SLAM在Friburg和Redwood OS数据集的单目RGB序列和KITTI里程计数据集的stereo LiDAR序列上以几乎帧速率运行，表明它实现了高质量的全对象重建，即使是部分观测，同时保持了一致的全局地图。我们的评估显示，与最近基于深度先验的重建方法相比，物体姿态和形状重建有了改进，并减少了KITTI数据集上的相机跟踪漂移。摘要：We propose DSP-SLAM, an object-oriented SLAM system that builds a rich and accurate joint map of dense 3D models for foreground objects, and sparse landmark points to represent the background. DSP-SLAM takes as input the 3D point cloud reconstructed by a feature-based SLAM system and equips it with the ability to enhance its sparse map with dense reconstructions of detected objects. Objects are detected via semantic instance segmentation, and their shape and pose is estimated using category-specific deep shape embeddings as priors, via a novel second order optimization. Our object-aware bundle adjustment builds a pose-graph to jointly optimize camera poses, object locations and feature points. DSP-SLAM can operate at 10 frames per second on 3 different input modalities: monocular, stereo, or stereo LiDAR. We demonstrate DSP-SLAM operating at almost frame rate on monocular-RGB sequences from the Friburg and Redwood-OS datasets, and on stereo LiDAR sequences on the KITTI odometry dataset showing that it achieves high-quality full object reconstructions, even from partial observations, while maintaining a consistent global map. Our evaluation shows improvements in object pose and shape reconstruction with respect to recent deep prior-based reconstruction methods and reductions in camera tracking drift on the KITTI dataset.

3D|3D重建等相关(3篇)

【1】 Separable Convolutions for Optimizing 3D Stereo Networks 标题：用于优化三维立体网络的可分离卷积链接：https://arxiv.org/abs/2108.10216

作者：Rafia Rahim,Faranak Shamsafar,Andreas Zell 机构：Department of Computer Science (WSI), University of Tuebingen, Germany 备注：Accepted at IEEE International Conference on Image Processing, ICIP, 2021 摘要：与2D网络和传统立体方法相比，基于深度学习的3D立体网络具有优越的性能。然而，这种性能的提高是以计算复杂度的增加为代价的，因此这些网络在实际应用中不实用。具体地说，这些网络使用3D卷积作为主要工作马来细化和回归差异。在这项工作中，我们首先表明立体网络中的这些3D卷积消耗了高达94%的整体网络操作，并成为主要的瓶颈。接下来，我们提出了一组“即插即用”可分离卷积，以减少参数和操作的数量。当与现有最先进的立体声网络集成时，这些卷积可在不影响其性能的情况下，使操作数减少7倍，参数减少3.5倍。事实上，在大多数情况下，这些卷积导致性能的改善。摘要：Deep learning based 3D stereo networks give superior performance compared to 2D networks and conventional stereo methods. However, this improvement in the performance comes at the cost of increased computational complexity, thus making these networks non-practical for the real-world applications. Specifically, these networks use 3D convolutions as a major work horse to refine and regress disparities. In this work first, we show that these 3D convolutions in stereo networks consume up to 94% of overall network operations and act as a major bottleneck. Next, we propose a set of "plug-&-run" separable convolutions to reduce the number of parameters and operations. When integrated with the existing state of the art stereo networks, these convolutions lead up to 7x reduction in number of operations and up to 3.5x reduction in parameters without compromising their performance. In fact these convolutions lead to improvement in their performance in the majority of cases.

【2】 Black-Box Test-Time Shape REFINEment for Single View 3D Reconstruction 标题：单视图三维重建的黑盒测试时间形状细化链接：https://arxiv.org/abs/2108.09911

作者：Brandon Leung,Chih-Hui Ho,Nuno Vasconcelos 机构：UC San Diego 摘要：最近在从物体的图像重建物体的三维形状方面取得了许多进展，即单视图三维重建。然而，有人建议，目前的方法只是采用“最近邻”策略，而不是真正理解输入图像背后的形状。在本文中，我们严格证明，对于许多最先进的方法，这个问题表现为：（1）粗略重建和输入图像之间的不一致，（2）无法跨域推广。因此，我们提出了REFINE，这是一个后处理网格细化步骤，可以很容易地集成到文献中任何黑盒方法的管道中。在测试时，REFINE会针对每个网格实例优化网络，以鼓励网格和给定对象视图之间的一致性。这与调整损耗的新组合一起，减少了域差距并实现了最先进的性能。我们相信，这一新的范式是朝着稳健、准确的重建迈出的重要一步，随着新重建网络的引入，重建仍然具有相关性。摘要：Much recent progress has been made in reconstructing the 3D shape of an object from an image of it, i.e. single view 3D reconstruction. However, it has been suggested that current methods simply adopt a "nearest-neighbor" strategy, instead of genuinely understanding the shape behind the input image. In this paper, we rigorously show that for many state of the art methods, this issue manifests as (1) inconsistencies between coarse reconstructions and input images, and (2) inability to generalize across domains. We thus propose REFINE, a postprocessing mesh refinement step that can be easily integrated into the pipeline of any black-box method in the literature. At test time, REFINE optimizes a network per mesh instance, to encourage consistency between the mesh and the given object view. This, along with a novel combination of regularizing losses, reduces the domain gap and achieves state of the art performance. We believe that this novel paradigm is an important step towards robust, accurate reconstructions, remaining relevant as new reconstruction networks are introduced.

【3】 3D Reconstruction from public webcams 标题：基于公共网络摄像机的三维重建链接：https://arxiv.org/abs/2108.09476

作者：Tianyu Wu,Konrad Schindler,Cenek Albl 机构：ETH Zurich 摘要：在本文中，我们研究了重建由多个网络摄像头捕获的场景的三维几何体的可能性。可公开访问的网络摄像头数量已经很大，并且每天都在增长。一个合乎逻辑的问题出现了——我们能否将这些免费数据源用于休闲活动之外的事情？挑战在于这些摄像机没有内部、外部或时间校准。我们展示了利用计算机视觉的最新进展，我们成功地校准了摄像机，执行了静态场景的三维重建，并恢复了运动对象的三维轨迹。摘要：In this paper, we investigate the possibility of reconstructing the 3D geometry of a scene captured by multiple webcams. The number of publicly accessible webcams is already large and it is growing every day. A logical question arises - can we use this free source of data for something beyond leisure activities? The challenge is that no internal, external, or temporal calibration of these cameras is available. We show that using recent advances in computer vision, we successfully calibrate the cameras, perform 3D reconstructions of the static scene and also recover the 3D trajectories of moving objects.

其他神经网络|深度学习|模型|建模(13篇)

【1】 Ranking Models in Unlabeled New Environments 标题：在未标记的新环境中对模型进行排序链接：https://arxiv.org/abs/2108.10310

作者：Xiaoxiao Sun,Yunzhong Hou,Weijian Deng,Hongdong Li,Liang Zheng 机构：Australian National University 备注：13 pages, 10 figures, ICCV2021 摘要：考虑一个场景，在该场景中，我们提供了一些在特定源域上训练的可用模型，并希望根据模型的相对性能直接将最合适的模型应用到不同的目标域。理想情况下，我们应该为每个新的目标环境上的模型性能评估注释一个验证集，但是这样的注释通常非常昂贵。在这种情况下，我们引入了未标记新环境中的排序模型问题。对于这个问题，我们建议采用一个代理数据集，1）完全标记，2）很好地反映给定目标环境中的真实模型排名，并使用代理集上的性能排名作为代理。我们首先选择带标签的数据集作为代理。具体而言，发现更类似于未标记目标域的数据集可以更好地保存相对性能排名。基于此，我们进一步建议通过从与目标具有相似分布的各种数据集中采样图像来搜索代理集。我们在人员重新识别（re-ID）任务中分析了该问题及其解决方案，对于该任务，有足够的数据集是公开的，并且表明精心构造的代理集可以有效地捕获新环境中的相对性能排名。代码位于url{https://github.com/sxzrt/Proxy-Set}. 摘要：Consider a scenario where we are supplied with a number of ready-to-use models trained on a certain source domain and hope to directly apply the most appropriate ones to different target domains based on the models' relative performance. Ideally we should annotate a validation set for model performance assessment on each new target environment, but such annotations are often very expensive. Under this circumstance, we introduce the problem of ranking models in unlabeled new environments. For this problem, we propose to adopt a proxy dataset that 1) is fully labeled and 2) well reflects the true model rankings in a given target environment, and use the performance rankings on the proxy sets as surrogates. We first select labeled datasets as the proxy. Specifically, datasets that are more similar to the unlabeled target domain are found to better preserve the relative performance rankings. Motivated by this, we further propose to search the proxy set by sampling images from various datasets that have similar distributions as the target. We analyze the problem and its solutions on the person re-identification (re-ID) task, for which sufficient datasets are publicly available, and show that a carefully constructed proxy set effectively captures relative performance ranking in new environments. Code is available at url{https://github.com/sxzrt/Proxy-Set}.

【2】 Deep Relational Metric Learning 标题：深度关系度量学习链接：https://arxiv.org/abs/2108.10026

作者：Wenzhao Zheng,Borui Zhang,Jiwen Lu,Jie Zhou 机构：Department of Automation, Tsinghua University, China, Beijing National Research Center for Information Science and Technology, China 备注：Accepted to ICCV 2021. Source code available at this https URL 摘要：提出了一种用于图像聚类和检索的深度关系度量学习（DRML）框架。大多数现有的深度度量学习方法学习嵌入空间，其总体目标是增加类间距离和减少类内距离。然而，传统的度量学习损失通常会抑制组内变化，这可能有助于识别未知类的样本。为了解决这个问题，我们建议自适应地学习从不同方面表征图像的特征集合，以建模类内和类内分布。我们进一步使用关系模块来捕获集合中每个特征之间的相关性，并构造一个图来表示图像。然后，我们在图上执行关系推理来集成集合，并获得关系感知嵌入来度量相似度。在广泛使用的CUB-200-2011、Cars196和斯坦福在线产品数据集上进行的大量实验表明，我们的框架改进了现有的深度度量学习方法，并取得了非常有竞争力的结果。摘要：This paper presents a deep relational metric learning (DRML) framework for image clustering and retrieval. Most existing deep metric learning methods learn an embedding space with a general objective of increasing interclass distances and decreasing intraclass distances. However, the conventional losses of metric learning usually suppress intraclass variations which might be helpful to identify samples of unseen classes. To address this problem, we propose to adaptively learn an ensemble of features that characterizes an image from different aspects to model both interclass and intraclass distributions. We further employ a relational module to capture the correlations among each feature in the ensemble and construct a graph to represent an image. We then perform relational inference on the graph to integrate the ensemble and obtain a relation-aware embedding to measure the similarities. Extensive experiments on the widely-used CUB-200-2011, Cars196, and Stanford Online Products datasets demonstrate that our framework improves existing deep metric learning methods and achieves very competitive results.

【3】 Learning Signed Distance Field for Multi-view Surface Reconstruction 标题：用于多视点曲面重建的带符号距离场学习链接：https://arxiv.org/abs/2108.09964

作者：Jingyang Zhang,Yao Yao,Long Quan 机构：The Hong Kong University of Science and Technology 备注：ICCV 2021 (Oral) 摘要：最近关于隐式神经表示的研究表明，隐式神经表示对于多视图曲面重建具有良好的效果。然而，大多数方法仅限于相对简单的几何图形，通常需要干净的对象遮罩来重建复杂和凹面对象。在这项工作中，我们介绍了一种新的神经曲面重建框架，该框架利用立体匹配和特征一致性的知识来优化隐式曲面表示。更具体地说，我们分别应用有符号距离场（SDF）和曲面光场来表示场景几何体和外观。SDF由立体匹配中的几何图形直接监督，并通过优化多视图特征的一致性和渲染图像的保真度来细化。我们的方法能够提高几何估计的鲁棒性，并支持复杂场景拓扑的重建。在DTU、EPFL和储罐和庙宇数据集上进行了广泛的实验。与以前的最新方法相比，我们的方法在完全开放的场景中实现了更好的网格重建，无需使用遮罩作为输入。摘要：Recent works on implicit neural representations have shown promising results for multi-view surface reconstruction. However, most approaches are limited to relatively simple geometries and usually require clean object masks for reconstructing complex and concave objects. In this work, we introduce a novel neural surface reconstruction framework that leverages the knowledge of stereo matching and feature consistency to optimize the implicit surface representation. More specifically, we apply a signed distance field (SDF) and a surface light field to represent the scene geometry and appearance respectively. The SDF is directly supervised by geometry from stereo matching, and is refined by optimizing the multi-view feature consistency and the fidelity of rendered images. Our method is able to improve the robustness of geometry estimation and support reconstruction of complex scene topologies. Extensive experiments have been conducted on DTU, EPFL and Tanks and Temples datasets. Compared to previous state-of-the-art methods, our method achieves better mesh reconstruction in wide open scenes without masks as input.

【4】 CANet: A Context-Aware Network for Shadow Removal 标题：CANET：一种上下文感知的阴影去除网络链接：https://arxiv.org/abs/2108.09894

作者：Zipei Chen,Chengjiang Long,Ling Zhang,Chunxia Xiao 机构：School of Computer Science, Wuhan University, Wuhan, Hubei, China, JD Finance America Corporation, Mountain View, CA, USA, Wuhan University of Science and Technology, Wuhan, Hubei, China 备注：This paper was accepted to the IEEE International Conference on Computer Vision (ICCV), Montreal, Canada, Oct 11-17, 2021 摘要：在本文中，我们提出了一种新的用于阴影去除的两阶段上下文感知网络CANet，该网络将来自非阴影区域的上下文信息转移到嵌入特征空间中的阴影区域。在第一阶段，我们提出了一个上下文补丁匹配（CPM）模块来生成一组阴影和非阴影补丁的潜在匹配对。结合阴影和非阴影区域之间潜在的上下文关系，我们精心设计的上下文特征转移（CFT）机制可以在不同的尺度上将上下文信息从非阴影区域转移到阴影区域。利用重构后的特征图，我们分别去除L和A/B通道上的阴影。在第二阶段，我们使用编码器-解码器优化当前结果，并生成最终阴影消除结果。我们在两个基准数据集和一些具有复杂场景的真实阴影图像上评估了我们提出的CANet。大量的实验结果有力地证明了我们提出的CANet的有效性，并显示出优于现有技术的性能。摘要：In this paper, we propose a novel two-stage context-aware network named CANet for shadow removal, in which the contextual information from non-shadow regions is transferred to shadow regions at the embedded feature spaces. At Stage-I, we propose a contextual patch matching (CPM) module to generate a set of potential matching pairs of shadow and non-shadow patches. Combined with the potential contextual relationships between shadow and non-shadow regions, our well-designed contextual feature transfer (CFT) mechanism can transfer contextual information from non-shadow to shadow regions at different scales. With the reconstructed feature maps, we remove shadows at L and A/B channels separately. At Stage-II, we use an encoder-decoder to refine current results and generate the final shadow removal results. We evaluate our proposed CANet on two benchmark datasets and some real-world shadow images with complex scenes. Extensive experimental results strongly demonstrate the efficacy of our proposed CANet and exhibit superior performance to state-of-the-arts.

【5】 MobileStereoNet: Towards Lightweight Deep Networks for Stereo Matching 标题：MobileStereoNet：面向立体匹配的轻量级深度网络链接：https://arxiv.org/abs/2108.09770

作者：Faranak Shamsafar,Samuel Woerz,Rafia Rahim,Andreas Zell 机构：University of Tuebingen, WSI Institute for Computer Science, Tuebingen, Germany 备注：Under review. Further figures and tables in the appendix. Code provided 摘要：最近的立体匹配方法使用深度模型不断提高精度。然而，该增益是通过计算成本的高增加来实现的，因此网络可能甚至不适合中等GPU。当需要在资源有限的设备上部署模型时，此问题会引发问题。为此，我们提出了两种用于立体视觉的光照模型，既降低了复杂度，又不牺牲精度。根据成本量的大小，我们设计了一个二维和三维模型，分别使用二维和三维卷积构建编码器-解码器。为此，我们利用二维MobileNet块，并将其扩展到三维立体视觉应用。此外，还提出了一种新的成本量，以提高2D模型的精度，使其性能接近3D网络。实验表明，所提出的2D/3D网络在保持精度的同时，有效地减少了计算开销（2D和3D模型中的参数/操作分别减少27%/95%和72%/38%）。我们的代码可在https://github.com/cogsys-tuebingen/mobilestereonet. 摘要：Recent methods in stereo matching have continuously improved the accuracy using deep models. This gain, however, is attained with a high increase in computation cost, such that the network may not fit even on a moderate GPU. This issue raises problems when the model needs to be deployed on resource-limited devices. For this, we propose two light models for stereo vision with reduced complexity and without sacrificing accuracy. Depending on the dimension of cost volume, we design a 2D and a 3D model with encoder-decoders built from 2D and 3D convolutions, respectively. To this end, we leverage 2D MobileNet blocks and extend them to 3D for stereo vision application. Besides, a new cost volume is proposed to boost the accuracy of the 2D model, making it performing close to 3D networks. Experiments show that the proposed 2D/3D networks effectively reduce the computational expense (27%/95% and 72%/38% fewer parameters/operations in 2D and 3D models, respectively) while upholding the accuracy. Our code is available at https://github.com/cogsys-tuebingen/mobilestereonet.

【6】 Learning of Visual Relations: The Devil is in the Tails 标题：视觉关系的学习：尾巴里的魔鬼链接：https://arxiv.org/abs/2108.09668

作者：Alakh Desai,Tz-Ying Wu,Subarna Tripathi,Nuno Vasconcelos 机构：University of California San Diego, USA, Intel Labs, USA 备注：Accepted to ICCV 2021 摘要：最近，人们致力于可视化关系建模。这主要涉及架构的设计，通常是通过添加参数和增加模型复杂性。然而，视觉关系学习是一个长尾问题，因为关于对象组的联合推理具有组合性质。一般来说，模型复杂性的增加不适合长尾问题，因为它们有过度拟合的倾向。在本文中，我们探讨了另一种假设，即魔鬼在尾巴上。在这个假设下，通过保持模型简单，但提高其处理长尾分布的能力，可以获得更好的性能。为了验证这一假设，我们设计了一种新的训练视觉关系模型的方法，该方法的灵感来自最新的长尾识别文献。这是基于一个迭代解耦训练方案，表示为尾部魔鬼的解耦训练（DT2）。DT2采用了一种新的采样方法，交替类平衡采样（ACBS），以捕获视觉关系的长尾实体和谓词分布之间的相互作用。结果表明，DT2-ACBS具有极其简单的体系结构，在场景图生成任务中的性能明显优于更复杂的最新方法。这表明，复杂模型的开发必须与问题的长尾性结合起来考虑。摘要：Significant effort has been recently devoted to modeling visual relations. This has mostly addressed the design of architectures, typically by adding parameters and increasing model complexity. However, visual relation learning is a long-tailed problem, due to the combinatorial nature of joint reasoning about groups of objects. Increasing model complexity is, in general, ill-suited for long-tailed problems due to their tendency to overfit. In this paper, we explore an alternative hypothesis, denoted the Devil is in the Tails. Under this hypothesis, better performance is achieved by keeping the model simple but improving its ability to cope with long-tailed distributions. To test this hypothesis, we devise a new approach for training visual relationships models, which is inspired by state-of-the-art long-tailed recognition literature. This is based on an iterative decoupled training scheme, denoted Decoupled Training for Devil in the Tails (DT2). DT2 employs a novel sampling approach, Alternating Class-Balanced Sampling (ACBS), to capture the interplay between the long-tailed entity and predicate distributions of visual relations. Results show that, with an extremely simple architecture, DT2-ACBS significantly outperforms much more complex state-of-the-art methods on scene graph generation tasks. This suggests that the development of sophisticated models must be considered in tandem with the long-tailed nature of the problem.

【7】 SERF: Towards better training of deep neural networks using log-Softplus ERror activation Function 标题：SERF：使用LOG-Softplus误差激活函数更好地训练深度神经网络链接：https://arxiv.org/abs/2108.09598

作者：Sayan Nag,Mayukh Bhattacharyya 机构：University of Toronto, Stony Brook University 摘要：激活函数在决定训练动态和神经网络性能方面起着关键作用。广泛采用的激活函数ReLU尽管简单有效，但也存在一些缺点，包括即将消亡的ReLU问题。为了解决这些问题，我们提出了一种新的激活函数，称为Serf，它是自正则的，本质上是非单调的。和米什一样，农奴也属于时髦的职能家族。基于计算机视觉（图像分类和目标检测）和自然语言处理（机器翻译、情感分类和多模态蕴涵）任务的多个实验，采用不同的最新架构，据观察，Serf的性能远远优于ReLU（基线）和其他激活功能，包括Swish和Mish，在更深层次的体系结构上具有更大的优势。消融研究进一步证明，基于Serf的体系结构在不同场景下的性能优于Swish和Mish，验证了Serf在不同深度、复杂度、优化器、学习率、批量大小、初始值设定者和退出率下的有效性和兼容性。最后，我们研究了Swish和Serf之间的数学关系，从而显示了Serf一阶导数中固有的预条件函数的影响，它提供了一种正则化效果，使梯度更平滑，优化速度更快。摘要：Activation functions play a pivotal role in determining the training dynamics and neural network performance. The widely adopted activation function ReLU despite being simple and effective has few disadvantages including the Dying ReLU problem. In order to tackle such problems, we propose a novel activation function called Serf which is self-regularized and nonmonotonic in nature. Like Mish, Serf also belongs to the Swish family of functions. Based on several experiments on computer vision (image classification and object detection) and natural language processing (machine translation, sentiment classification and multimodal entailment) tasks with different state-of-the-art architectures, it is observed that Serf vastly outperforms ReLU (baseline) and other activation functions including both Swish and Mish, with a markedly bigger margin on deeper architectures. Ablation studies further demonstrate that Serf based architectures perform better than those of Swish and Mish in varying scenarios, validating the effectiveness and compatibility of Serf with varying depth, complexity, optimizers, learning rates, batch sizes, initializers and dropout rates. Finally, we investigate the mathematical relation between Swish and Serf, thereby showing the impact of preconditioner function ingrained in the first derivative of Serf which provides a regularization effect making gradients smoother and optimization faster.

【8】 BoundaryNet: An Attentive Deep Network with Fast Marching Distance Maps for Semi-automatic Layout Annotation 标题：边界网：一种带快速行进距离图的细心深度网络半自动布局标注链接：https://arxiv.org/abs/2108.09433

作者：Abhishek Trivedi,Ravi Kiran Sarvadevabhatla 机构：(�)[,−,−,−,], Centre for Visual Information Technology (CVIT), International Institute of Information Technology, Hyderabad – , INDIA 备注：Accepted at ICDAR-21 for oral presentation - watch video this https URL View webpage this http URL Code and pretrained models this https URL 摘要：图像区域的精确边界标注对于依赖区域类语义的下游应用程序至关重要。一些文档集合包含布局密集、高度不规则和重叠的多类区域实例，其纵横比范围很大。全自动边界估计方法往往是数据密集型的，不能处理可变大小的图像，并且对上述图像产生次优结果。为了解决这些问题，我们提出了BoundaryNet，这是一种新的无需调整大小的高精度半自动布局注释方法。可变大小的用户选择的感兴趣区域首先由注意力引导的跳过网络进行处理。通过快速行进距离图指导网络优化，以获得高质量的初始边界估计和相关特征表示。这些输出由使用Hausdorff损失优化的残差图卷积网络处理，以获得最终区域边界。在具有挑战性的图像手稿数据集上的结果表明，BoundaryNet优于强基线，并生成高质量的语义区域边界。定性地说，我们的方法在包含不同脚本系统和布局的多个文档图像数据集上进行了推广，而无需进行额外的微调。我们将BoundaryNet集成到文档注释系统中，并表明与手动和全自动替代方案相比，它提供了高注释吞吐量。摘要：Precise boundary annotations of image regions can be crucial for downstream applications which rely on region-class semantics. Some document collections contain densely laid out, highly irregular and overlapping multi-class region instances with large range in aspect ratio. Fully automatic boundary estimation approaches tend to be data intensive, cannot handle variable-sized images and produce sub-optimal results for aforementioned images. To address these issues, we propose BoundaryNet, a novel resizing-free approach for high-precision semi-automatic layout annotation. The variable-sized user selected region of interest is first processed by an attention-guided skip network. The network optimization is guided via Fast Marching distance maps to obtain a good quality initial boundary estimate and an associated feature representation. These outputs are processed by a Residual Graph Convolution Network optimized using Hausdorff loss to obtain the final region boundary. Results on a challenging image manuscript dataset demonstrate that BoundaryNet outperforms strong baselines and produces high-quality semantic region boundaries. Qualitatively, our approach generalizes across multiple document image datasets containing different script systems and layouts, all without additional fine-tuning. We integrate BoundaryNet into a document annotation system and show that it provides high annotation throughput compared to manual and fully automatic alternatives.

【9】 ARAPReg: An As-Rigid-As Possible Regularization Loss for Learning Deformable Shape Generators 标题：ARAPReg：一种学习可变形形状生成器的尽可能刚性的正则化损失链接：https://arxiv.org/abs/2108.09432

作者：Qixing Huang,Xiangru Huang,Bo Sun,Zaiwei Zhang,Junfeng Jiang,Chandrajit Bajaj 机构：UT Austin & MIT, Hohai University, With ARAPReg, Without ARAPReg, Source, Target 摘要：本文介绍了一种用于训练参数变形形状发生器的无监督损失法。关键思想是在生成的形状中加强局部刚性的保持。我们的方法建立在尽可能刚性（或ARAP）变形能量的近似值上。我们展示了如何通过ARAP能量的Hessian谱分解来发展无监督损耗。我们的损失通过一个稳健的标准很好地解耦了姿势和形状的变化。损失允许简单的闭式表达式。它易于训练，可插入任何标准生成模型，例如变分自动编码器（VAE）和自动解码器（AD）。实验结果表明，在人类、动物和骨骼等各种形状类别的公共基准数据集上，我们的方法比现有的形状生成方法有很大的优势。摘要：This paper introduces an unsupervised loss for training parametric deformation shape generators. The key idea is to enforce the preservation of local rigidity among the generated shapes. Our approach builds on an approximation of the as-rigid-as possible (or ARAP) deformation energy. We show how to develop the unsupervised loss via a spectral decomposition of the Hessian of the ARAP energy. Our loss nicely decouples pose and shape variations through a robust norm. The loss admits simple closed-form expressions. It is easy to train and can be plugged into any standard generation models, e.g., variational auto-encoder (VAE) and auto-decoder (AD). Experimental results show that our approach outperforms existing shape generation approaches considerably on public benchmark datasets of various shape categories such as human, animal and bone.

【10】 A Multiple-View Geometric Model for Specularity Prediction on Non-Uniformly Curved Surfaces 标题：非均匀曲面镜面镜面反射率预测的多视图几何模型链接：https://arxiv.org/abs/2108.09378

作者：Alexandre Morgand,Mohamed Tamaazousti,Adrien Bartoli 机构： Tamaazousti is with Université Paris Saclay 摘要：镜面反射度预测对于许多计算机视觉应用至关重要，它提供了重要的视觉线索，可用于增强现实（AR）、同步定位和映射（SLAM）、三维重建和材料建模，从而提高场景理解。然而，这是一项具有挑战性的任务，需要来自场景的大量信息，包括摄影机姿势、场景几何体、光源和材质属性。我们以前的工作通过使用椭球体创建一个显式模型来解决这个问题，椭球体的投影适合给定相机姿势的镜面反射图像轮廓。这些基于椭球体的方法属于一系列称为“联合灯光材质镜面反射”（JOLIMAS）的模型，在这些模型中，我们尝试逐步删除场景中的假设，例如镜面反射曲面的几何体。然而，我们最近的方法仍然局限于均匀曲面。本文在这些方法的基础上，将JOLIMAS推广到任何曲面几何体，同时在不牺牲计算性能的情况下提高镜面反射度预测的质量。提出的方法建立了曲面曲率和镜面形状之间的联系，以便从以前的工作中提升几何假设。与以前的工作相反，我们的新模型是从基于物理的局部照明模型Torrance Sparrow构建的，提供了更好的模型重建。使用我们的新模型对合成序列和具有不同形状曲率对象的真实序列进行镜面反射度预测，并与最新的JOLIMAS版本进行对比测试。我们的方法在镜面反射度预测方面优于以前的方法，包括实时设置，如使用视频的补充材料所示。摘要：Specularity prediction is essential to many computer vision applications by giving important visual cues that could be used in Augmented Reality (AR), Simultaneous Localisation and Mapping (SLAM), 3D reconstruction and material modeling, thus improving scene understanding. However, it is a challenging task requiring numerous information from the scene including the camera pose, the geometry of the scene, the light sources and the material properties. Our previous work have addressed this task by creating an explicit model using an ellipsoid whose projection fits the specularity image contours for a given camera pose. These ellipsoid-based approaches belong to a family of models called JOint-LIght MAterial Specularity (JOLIMAS), where we have attempted to gradually remove assumptions on the scene such as the geometry of the specular surfaces. However, our most recent approach is still limited to uniformly curved surfaces. This paper builds upon these methods by generalising JOLIMAS to any surface geometry while improving the quality of specularity prediction, without sacrificing computation performances. The proposed method establishes a link between surface curvature and specularity shape in order to lift the geometric assumptions from previous work. Contrary to previous work, our new model is built from a physics-based local illumination model namely Torrance-Sparrow, providing a better model reconstruction. Specularity prediction using our new model is tested against the most recent JOLIMAS version on both synthetic and real sequences with objects of varying shape curvatures. Our method outperforms previous approaches in specularity prediction, including the real-time setup, as shown in the supplementary material using videos.

【11】 Early-exit deep neural networks for distorted images: providing an efficient edge offloading 标题：用于失真图像的提前退出深度神经网络：提供有效的边缘卸载链接：https://arxiv.org/abs/2108.09343

作者：Roberto G. Pacheco,Fernanda D. V. R. Oliveira,Rodrigo S. Couto 机构：∗Universidade Federal do Rio de Janeiro, GTAPADSPEE-COPPEDEL-Poli, Rio de Janeiro, RJ, Brazil 备注：to appear in Proc. IEEE Global Communications Conference (GLOBECOM) 2021 摘要：深度神经网络（DNN）的边缘卸载可以通过使用早期退出DNN来适应输入的复杂性。这些DNN在整个体系结构中都有分支，允许推断在边缘中提前结束。这些分支估计给定输入的精度。如果此估计精度达到阈值，则推断将在边缘结束。否则，边缘将推理卸载到云以处理剩余的DNN层。然而，用于图像分类的DNN处理的是扭曲的图像，这会对分支的估计精度产生负面影响。因此，边缘将更多的推断转移到云上。这项工作引入了在特定失真类型上训练的专家分支，以提高对图像失真的鲁棒性。边缘检测畸变类型并选择适当的专家分支来执行推理。这种方法提高了边缘的估计精度，改善了卸载决策。我们在一个现实场景中验证了我们的建议，在这个场景中，edge将DNN推理卸载到Amazon EC2实例。摘要：Edge offloading for deep neural networks (DNNs) can be adaptive to the input's complexity by using early-exit DNNs. These DNNs have side branches throughout their architecture, allowing the inference to end earlier in the edge. The branches estimate the accuracy for a given input. If this estimated accuracy reaches a threshold, the inference ends on the edge. Otherwise, the edge offloads the inference to the cloud to process the remaining DNN layers. However, DNNs for image classification deals with distorted images, which negatively impact the branches' estimated accuracy. Consequently, the edge offloads more inferences to the cloud. This work introduces expert side branches trained on a particular distortion type to improve robustness against image distortion. The edge detects the distortion type and selects appropriate expert branches to perform the inference. This approach increases the estimated accuracy on the edge, improving the offloading decisions. We validate our proposal in a realistic scenario, in which the edge offloads DNN inference to Amazon EC2 instances.

【12】 LoOp: Looking for Optimal Hard Negative Embeddings for Deep Metric Learning 标题：循环：寻找深度度量学习的最优硬负嵌入链接：https://arxiv.org/abs/2108.09335

作者：Bhavya Vasudeva,Puneesh Deora,Saumik Bhattacharya,Umapada Pal,Sukalpa Chanda 机构：Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kharagpur, India, Østfold University College, Halden, Norway 备注：17 pages, 9 figures, 5 tables. Accepted at The IEEE/CVF International Conference on Computer Vision (ICCV) 2021 摘要：深度度量学习已被有效地用于学习不同视觉任务（如图像检索、聚类等）的距离度量。为了帮助训练过程，现有方法要么使用硬挖掘策略提取信息量最大的样本，要么使用附加网络生成硬合成。这种方法面临着不同的挑战，在前一种情况下可能导致有偏差的嵌入，（i）更难的优化（ii）更慢的训练速度（iii）在后一种情况下更高的模型复杂度。为了克服这些挑战，我们提出了一种在嵌入空间中寻找最优硬负（LoOp）的新方法，通过计算一对正和一对负之间的最小距离来充分利用每个元组。与基于挖掘的方法不同，我们的方法考虑嵌入对之间的整个空间来计算最佳硬负。结合我们的方法和具有代表性的度量学习损失的大量实验表明，在三个基准数据集上，性能显著提高。摘要：Deep metric learning has been effectively used to learn distance metrics for different visual tasks like image retrieval, clustering, etc. In order to aid the training process, existing methods either use a hard mining strategy to extract the most informative samples or seek to generate hard synthetics using an additional network. Such approaches face different challenges and can lead to biased embeddings in the former case, and (i) harder optimization (ii) slower training speed (iii) higher model complexity in the latter case. In order to overcome these challenges, we propose a novel approach that looks for optimal hard negatives (LoOp) in the embedding space, taking full advantage of each tuple by calculating the minimum distance between a pair of positives and a pair of negatives. Unlike mining-based methods, our approach considers the entire space between pairs of embeddings to calculate the optimal hard negatives. Extensive experiments combining our approach and representative metric learning losses reveal a significant boost in performance on three benchmark datasets.

【13】 Fourier Neural Operator Networks: A Fast and General Solver for the Photoacoustic Wave Equation 标题：傅里叶神经算子网络：光声波方程的快速通用型求解器链接：https://arxiv.org/abs/2108.09374

作者：Steven Guan,Ko-Tsung Hsu,Parag V. Chitnis 机构：S. Guan is with the Bioengineering Department, George Mason University., Fairfax, VA , USA. and The MITRE Corporation., The author’s affiliation with The MITRE Corporation is provided for identification purposes only and is not intended to convey or 摘要：光声波传播的模拟工具通过对影响图像质量的参数提供定量和定性的见解，在推进光声成像方面发挥了关键作用。数值求解光声波方程的经典方法依赖于空间的精细离散化，对于大型计算网格，计算成本可能会很高。在这项工作中，我们应用傅立叶神经算子（FNO）网络作为一种快速数据驱动的深度学习方法来求解均匀介质中的二维光声波方程。FNO网络和伪谱时域方法之间的比较表明，FNO网络生成的模拟结果具有可比性，误差小，速度快几个数量级。此外，FNO网络是可推广的，可以生成训练数据中未观察到的模拟。摘要：Simulation tools for photoacoustic wave propagation have played a key role in advancing photoacoustic imaging by providing quantitative and qualitative insights into parameters affecting image quality. Classical methods for numerically solving the photoacoustic wave equation relies on a fine discretization of space and can become computationally expensive for large computational grids. In this work, we apply Fourier Neural Operator (FNO) networks as a fast data-driven deep learning method for solving the 2D photoacoustic wave equation in a homogeneous medium. Comparisons between the FNO network and pseudo-spectral time domain approach demonstrated that the FNO network generated comparable simulations with small errors and was several orders of magnitude faster. Moreover, the FNO network was generalizable and can generate simulations not observed in the training data.

其他(12篇)

【1】 BiaSwap: Removing dataset bias with bias-tailored swapping augmentation 标题：BiaSwp：通过针对偏差量身定制的交换增强消除数据集偏差链接：https://arxiv.org/abs/2108.10008

作者：Eungyeup Kim,Jihyeon Lee,Jaegul Choo 机构：KAIST 备注：Accepted to ICCV'21 摘要：深度神经网络通常根据数据集中固有的虚假相关性做出决策，无法在无偏数据分布中推广。尽管以前的方法预先定义了数据集偏差的类型以防止网络学习它，但在真实数据集中识别偏差类型通常是禁止的。本文提出了一种新的基于偏差裁剪的增广方法BiaSwap，用于学习基于Debias的表示，而无需监督偏差类型。假设偏差对应于易于学习的属性，我们根据有偏差分类器能够利用它们作为捷径的程度对训练图像进行排序，并以无监督的方式将其划分为偏差引导样本和偏差相反样本。然后，我们将图像翻译模型的风格转换模块与这种有偏分类器的类激活映射相结合，从而能够初步转换分类器学习到的有偏属性。因此，给定偏置引导和偏置反向对，BiaSwap生成偏置交换图像，该图像包含偏置反向图像中的偏置属性，同时保留偏置引导图像中与偏置无关的属性。考虑到这些增强的图像，BiaSwap展示了相对于合成和真实世界数据集的现有基线进行借记的优势。即使没有对偏差的仔细监督，BiaSwap在无偏差和偏差引导样本上都取得了显著的性能，这意味着模型的泛化能力得到了提高。摘要：Deep neural networks often make decisions based on the spurious correlations inherent in the dataset, failing to generalize in an unbiased data distribution. Although previous approaches pre-define the type of dataset bias to prevent the network from learning it, recognizing the bias type in the real dataset is often prohibitive. This paper proposes a novel bias-tailored augmentation-based approach, BiaSwap, for learning debiased representation without requiring supervision on the bias type. Assuming that the bias corresponds to the easy-to-learn attributes, we sort the training images based on how much a biased classifier can exploits them as shortcut and divide them into bias-guiding and bias-contrary samples in an unsupervised manner. Afterwards, we integrate the style-transferring module of the image translation model with the class activation maps of such biased classifier, which enables to primarily transfer the bias attributes learned by the classifier. Therefore, given the pair of bias-guiding and bias-contrary, BiaSwap generates the bias-swapped image which contains the bias attributes from the bias-contrary images, while preserving bias-irrelevant ones in the bias-guiding images. Given such augmented images, BiaSwap demonstrates the superiority in debiasing against the existing baselines over both synthetic and real-world datasets. Even without careful supervision on the bias, BiaSwap achieves a remarkable performance on both unbiased and bias-guiding samples, implying the improved generalization capability of the model.

【2】 Image coding for machines: an end-to-end learned approach 标题：机器图像编码：一种端到端的学习方法链接：https://arxiv.org/abs/2108.09993

作者：Nam Le,Honglei Zhang,Francesco Cricri,Ramin Ghaznavi-Youvalari,Esa Rahtu 机构：†Nokia Technologies, ∗Tampere University, Tampere, Finland 备注：None 摘要：近年来，基于深度学习的计算机视觉系统以不断增长的速度应用于图像，通常是这些图像的唯一消费类型。鉴于每天生成的图像数量急剧增加，出现了一个问题：针对机器消费的图像编解码器与针对人类消费的最先进编解码器相比，性能会有多好？在本文中，我们提出了一种基于神经网络（NN）和端到端学习的机器图像编解码器。特别是，我们提出了一套训练策略，解决了平衡竞争损失函数的微妙问题，如计算机视觉任务损失、图像失真损失和速率损失。我们的实验结果表明，基于神经网络的编解码器在目标检测和实例分割任务上优于最先进的Versa-tile视频编码（VVC）标准，分别实现了-37.87%和-32.90%的BD速率增益，同时由于其紧凑的尺寸，速度很快。据我们所知，这是第一个端到端学习机器目标图像编解码器。摘要：Over recent years, deep learning-based computer vision systems have been applied to images at an ever-increasing pace, oftentimes representing the only type of consumption for those images. Given the dramatic explosion in the number of images generated per day, a question arises: how much better would an image codec targeting machine-consumption perform against state-of-the-art codecs targeting human-consumption? In this paper, we propose an image codec for machines which is neural network (NN) based and end-to-end learned. In particular, we propose a set of training strategies that address the delicate problem of balancing competing loss functions, such as computer vision task losses, image distortion losses, and rate loss. Our experimental results show that our NN-based codec outperforms the state-of-the-art Versa-tile Video Coding (VVC) standard on the object detection and instance segmentation tasks, achieving -37.87% and -32.90% of BD-rate gain, respectively, while being fast thanks to its compact size. To the best of our knowledge, this is the first end-to-end learned machine-targeted image codec.

【3】 The 2nd Anti-UAV Workshop & Challenge: Methods and Results 标题：第二届反无人机研讨会与挑战：方法与结果链接：https://arxiv.org/abs/2108.09909

作者：Jian Zhao,Gang Wang,Jianan Li,Lei Jin,Nana Fan,Min Wang,Xiaojuan Wang,Ting Yong,Yafeng Deng,Yandong Guo,Shiming Ge,Guodong Guo 摘要：第二届反无人机研讨会挑战旨在鼓励研究开发新的、准确的多尺度目标跟踪方法。用于反无人机挑战的反无人机数据集已公开发布。数据集中有两个子集，$，即测试开发子集和测试挑战子集。这两个子集由140个热红外视频序列组成，跨越多个多尺度无人机的多次出现。来自全球的大约24支参赛队伍参加了第二届反无人机挑战赛。在本文中，我们简要介绍了第二次反无人机研讨会挑战，包括前三种方法的简要介绍。提交排行榜将重新开放，供对反无人机挑战感兴趣的研究人员使用。基准数据集和其他信息可在以下位置找到：https://anti-uav.github.io/. 摘要：The 2nd Anti-UAV Workshop & Challenge aims to encourage research in developing novel and accurate methods for multi-scale object tracking. The Anti-UAV dataset used for the Anti-UAV Challenge has been publicly released. There are two subsets in the dataset, $i.e.$, the test-dev subset and test-challenge subset. Both subsets consist of 140 thermal infrared video sequences, spanning multiple occurrences of multi-scale UAVs. Around 24 participating teams from the globe competed in the 2nd Anti-UAV Challenge. In this paper, we provide a brief summary of the 2nd Anti-UAV Workshop & Challenge including brief introductions to the top three methods.The submission leaderboard will be reopened for researchers that are interested in the Anti-UAV challenge. The benchmark dataset and other information can be found at: https://anti-uav.github.io/.

【4】 Burst Imaging for Light-Constrained Structure-From-Motion 标题：光约束结构运动的猝发成像链接：https://arxiv.org/abs/2108.09895

作者：Ahalya Ravendran,Mitch Bryson,Donald G. Dansereau 机构：School of Aerospace, The Uni-versity of Sydney and with the Sydney Institute for Robotics and Intelli-gent Systems 备注：8 pages, 8 figures, 2 tables, for associated project page, see: this https URL 摘要：在极低光照条件下拍摄的图像受噪声限制，这可能导致现有的机器人视觉算法失败。在本文中，我们开发了一种图像处理技术，用于帮助在弱光条件下获取的图像进行三维重建。我们的技术基于突发摄影，使用直接方法在短曝光时间图像的突发内进行图像配准，以提高基于特征的运动结构（SfM）的鲁棒性和准确性。我们在具有挑战性的光约束场景中展示了改进的SfM性能，包括显示改进的特征性能和相机姿势估计的定量评估。此外，我们还表明，我们的方法比最先进的方法更容易收敛到正确的重建。我们的方法是朝着允许机器人在弱光条件下工作迈出的重要一步，有可能应用于在地下矿山和夜间作业等环境中工作的机器人。摘要：Images captured under extremely low light conditions are noise-limited, which can cause existing robotic vision algorithms to fail. In this paper we develop an image processing technique for aiding 3D reconstruction from images acquired in low light conditions. Our technique, based on burst photography, uses direct methods for image registration within bursts of short exposure time images to improve the robustness and accuracy of feature-based structure-from-motion (SfM). We demonstrate improved SfM performance in challenging light-constrained scenes, including quantitative evaluations that show improved feature performance and camera pose estimates. Additionally, we show that our method converges more frequently to correct reconstructions than the state-of-the-art. Our method is a significant step towards allowing robots to operate in low light conditions, with potential applications to robots operating in environments such as underground mines and night time operation.

【5】 Relating CNNs with brain: Challenges and findings 标题：中枢神经网络与脑的联系：挑战和发现链接：https://arxiv.org/abs/2108.09768

作者：Reem Abdel-Salam 机构：Department of Computer engineering, Cairo University 摘要：传统的神经网络模型（CNN）受到灵长类视觉系统的启发，已被证明可以预测视觉皮层的神经反应。然而，由于许多原因，CNN与视觉系统之间的关系并不完整。一方面，最先进的CNN体系结构非常复杂，但可以被不易察觉的小干扰所愚弄，这些干扰使得很难用视觉系统映射网络层并理解它们在做什么。另一方面，我们不知道CNN的特征空间和视觉皮层的空间域之间的精确映射，这使得很难准确预测神经反应。在本文中，我们回顾了作为Algonauts项目2021挑战的一部分，用于预测视觉皮层和整个大脑神经反应的挑战和方法：“人脑如何理解运动中的世界”。摘要：Conventional neural network models (CNN), loosely inspired by the primate visual system, have been shown to predict neural responses in the visual cortex. However, the relationship between CNNs and the visual system is incomplete due to many reasons. On one hand state of the art CNN architecture is very complex, yet can be fooled by imperceptibly small, explicitly crafted perturbations which makes it hard difficult to map layers of the network with the visual system and to understand what they are doing. On the other hand, we don't know the exact mapping between feature space of the CNNs and the space domain of the visual cortex, which makes it hard to accurately predict neural responses. In this paper we review the challenges and the methods that have been used to predict neural responses in the visual cortex and whole brain as part of The Algonauts Project 2021 Challenge: "How the Human Brain Makes Sense of a World in Motion".

【6】 Graph2Pix: A Graph-Based Image to Image Translation Framework 标题：Graph2Pix：一种基于图形的图像到图像翻译框架链接：https://arxiv.org/abs/2108.09752

作者：Dilara Gokay,Enis Simsar,Efehan Atici,Alper Ahmetoglu,Atif Emre Yuksel,Pinar Yanardag 机构：Technical University of Munich, Bogazici University 摘要：在本文中，我们提出了一个基于图的图像到图像转换框架来生成图像。我们使用从流行创意平台Artbreader收集的丰富数据(http://artbreeder.com)，其中用户插入多个生成的图像以创建艺术品。这种创建新图像的独特方法形成了一种树状结构，人们可以在其中跟踪有关创建特定图像的历史数据。受这种结构的启发，我们提出了一种新的图到图像转换模型Graph2Pix，该模型将一个图和相应的图像作为输入，生成一个图像作为输出。我们的实验表明，Graph2Pix能够在基准指标上胜过几种图像到图像的翻译框架，包括LPIP（提高25%）和人类感知研究（n=60），其中用户81.5%的时间喜欢使用我们的方法生成的图像。我们的源代码和数据集在https://github.com/catlab-team/graph2pix. 摘要：In this paper, we propose a graph-based image-to-image translation framework for generating images. We use rich data collected from the popular creativity platform Artbreeder (http://artbreeder.com), where users interpolate multiple GAN-generated images to create artworks. This unique approach of creating new images leads to a tree-like structure where one can track historical data about the creation of a particular image. Inspired by this structure, we propose a novel graph-to-image translation model called Graph2Pix, which takes a graph and corresponding images as input and generates a single image as output. Our experiments show that Graph2Pix is able to outperform several image-to-image translation frameworks on benchmark metrics, including LPIPS (with a 25% improvement) and human perception studies (n=60), where users preferred the images generated by our method 81.5% of the time. Our source code and dataset are publicly available at https://github.com/catlab-team/graph2pix.

【7】 DenseTNT: End-to-end Trajectory Prediction from Dense Goal Sets 标题：DenseTNT：密集目标集的端到端轨迹预测链接：https://arxiv.org/abs/2108.09640

作者：Junru Gu,Chen Sun,Hang Zhao 机构：IIIS, Tsinghua University, Brown University 备注：Accepted to ICCV 2021 摘要：由于人类行为的随机性，预测道路代理的未来轨迹对于自动驾驶来说是一个挑战。最近，基于目标的多轨迹预测方法被证明是有效的，它们首先对抽样的目标候选进行评分，然后从中选择最终的一组。然而，这些方法通常涉及基于稀疏预定义锚和启发式目标选择算法的目标预测。在这项工作中，我们提出了一个无锚和端到端的轨迹预测模型，命名为Densett，它直接从密集的候选目标输出一组轨迹。此外，我们引入了一种基于离线优化的技术，为最终的在线模型提供多个未来的伪标签。实验表明，Densett实现了最先进的性能，在Argoverse运动预测基准中排名第一，并在2021 Waymo开放数据集运动预测挑战赛中获得第一名。摘要：Due to the stochasticity of human behaviors, predicting the future trajectories of road agents is challenging for autonomous driving. Recently, goal-based multi-trajectory prediction methods are proved to be effective, where they first score over-sampled goal candidates and then select a final set from them. However, these methods usually involve goal predictions based on sparse pre-defined anchors and heuristic goal selection algorithms. In this work, we propose an anchor-free and end-to-end trajectory prediction model, named DenseTNT, that directly outputs a set of trajectories from dense goal candidates. In addition, we introduce an offline optimization-based technique to provide multi-future pseudo-labels for our final online model. Experiments show that DenseTNT achieves state-of-the-art performance, ranking 1st on the Argoverse motion forecasting benchmark and being the 1st place winner of the 2021 Waymo Open Dataset Motion Prediction Challenge.

【8】 Flikcer -- A Chrome Extension to Resolve Online Epileptogenic Visual Content with Real-Time Luminance Frequency Analysis 标题：实时亮度频率分析在线致痫视觉内容解析的Chrome扩展--Flikercer 链接：https://arxiv.org/abs/2108.09491

作者：Jaisal Kothari,Ashay Srivastava 机构：Student, Amity International School, Saket, New Delhi, Delhi Public School, RK Puram 摘要：具有快速亮度变化或具有高对比度空间模式的视频内容（称为致痫性视觉内容）可能会导致患有光敏性癫痫的观众癫痫发作，甚至会导致未受该疾病影响的用户感到不适。Flikcer是一款以网站和chrome扩展为形式的网络应用，旨在解决视频中的癫痫内容。它提供了癫痫发作的可能触发因素的数量。它还提供了这些触发器的时间戳以及更安全的视频版本，可免费下载。该算法是用Python编写的，使用机器学习和计算机视觉。该算法的一个关键方面是其计算效率，允许公共用户实时实现。摘要：Video content with fast luminance variations, or with spatial patterns of high contrast - referred to as epileptogenic visual content - may induce seizures on viewers with photosensitive epilepsy, and even cause discomfort in users not affected by this disease. Flikcer is a web app in the form of a website and chrome extension which aims to resolve epileptic content in videos. It provides the number of possible triggers for a seizure. It also provides the timestamps for these triggers along with a safer version of the video, free to download. The algorithm is written in Python and uses machine learning and computer vision. A key aspect of the algorithm is its computational efficiency, allowing real time implementation for public users.

【9】 Grid-VLP: Revisiting Grid Features for Vision-Language Pre-training 标题：GRID-VLP：重温网格功能以进行视觉语言预训练链接：https://arxiv.org/abs/2108.09479

作者：Ming Yan,Haiyang Xu,Chenliang Li,Bin Bi,Junfeng Tian,Min Gui,Wei Wang 机构：Alibaba Group 摘要：现有的视觉语言预训练（VLP）方法严重依赖于基于边界盒（区域）的对象检测器，首先从图像中检测出显著对象，然后使用基于变换器的模型进行跨模态融合。尽管这些方法具有优越的性能，但在有效性和效率方面都受到目标检测器能力的限制。此外，目标检测的存在对模型设计施加了不必要的约束，使得支持端到端训练变得困难。在本文中，我们重新讨论了基于网格的卷积特征用于视觉语言预训练，跳过了昂贵的区域相关步骤。我们提出了一种简单而有效的基于网格的VLP方法，该方法非常适合网格特性。通过仅使用域内数据集进行预训练，所提出的网格VLP方法在三个检查的视觉语言理解任务上优于最具竞争力的基于区域的VLP方法。我们希望我们的发现有助于进一步推动视觉语言预训练的发展，并为有效的视觉语言预训练提供新的方向。摘要：Existing approaches to vision-language pre-training (VLP) heavily rely on an object detector based on bounding boxes (regions), where salient objects are first detected from images and then a Transformer-based model is used for cross-modal fusion. Despite their superior performance, these approaches are bounded by the capability of the object detector in terms of both effectiveness and efficiency. Besides, the presence of object detection imposes unnecessary constraints on model designs and makes it difficult to support end-to-end training. In this paper, we revisit grid-based convolutional features for vision-language pre-training, skipping the expensive region-related steps. We propose a simple yet effective grid-based VLP method that works surprisingly well with the grid features. By pre-training only with in-domain datasets, the proposed Grid-VLP method can outperform most competitive region-based VLP methods on three examined vision-language understanding tasks. We hope that our findings help to further advance the state of the art of vision-language pre-training, and provide a new direction towards effective and efficient VLP.

【10】 OSRM-CCTV: Open-source CCTV-aware routing and navigation system for privacy, anonymity and safety (Preprint) 标题：OSRM-CCTV：面向隐私、匿名和安全的开源CCTV感知路由和导航系统(预印本) 链接：https://arxiv.org/abs/2108.09369

作者：Lauri Sintonen,Hannu Turtiainen,Andrei Costin,Timo Hamalainen,Tuomo Lahtinen 机构： Lahtinen are with the Department ofInformation Technology, University of Jyv¨askyl¨a 摘要：在过去几十年中，全球闭路电视（CCTV）摄像机的增加、广泛、无根据和不负责任的使用引起了人们对隐私风险的担忧。许多闭路电视摄像机最近的附加功能，如物联网（IoT）连接和基于人工智能（AI）的面部识别，只会增加隐私倡导者的担忧。因此，必须提供有隐私、安全和网络安全特性的PACEMPHEC{CCTV意识解决方案}。我们认为，向前迈出的重要一步是通过路由和导航系统（如OpenStreetMap、Google Maps）开发解决隐私问题的解决方案，这些系统为已知存在摄像头的区域提供隐私和安全选项。然而，目前没有在线或离线的路由和导航系统提供相应的CCTV感知功能。本文介绍了OSRM-CCTV——第一个也是唯一一个为隐私、匿名和安全应用而设计和构建的CCTV感知路由和导航系统。我们通过几个合成的和真实的例子验证和演示了系统的有效性和可用性。为了帮助验证我们的工作，并进一步鼓励系统的开发和广泛采用，我们将OSRM-CCTV作为开源发布。摘要：For the last several decades, the increased, widespread, unwarranted, and unaccountable use of Closed-Circuit TeleVision (CCTV) cameras globally has raised concerns about privacy risks. Additional recent features of many CCTV cameras, such as Internet of Things (IoT) connectivity and Artificial Intelligence (AI)-based facial recognition, only increase concerns among privacy advocates. Therefore, on par emph{CCTV-aware solutions} must exist that provide privacy, safety, and cybersecurity features. We argue that an important step forward is to develop solutions addressing privacy concerns via routing and navigation systems (e.g., OpenStreetMap, Google Maps) that provide both privacy and safety options for areas where cameras are known to be present. However, at present no routing and navigation system, whether online or offline, provide corresponding CCTV-aware functionality. In this paper we introduce OSRM-CCTV -- the first and only CCTV-aware routing and navigation system designed and built for privacy, anonymity and safety applications. We validate and demonstrate the effectiveness and usability of the system on a handful of synthetic and real-world examples. To help validate our work as well as to further encourage the development and wide adoption of the system, we release OSRM-CCTV as open-source.

【11】 All-Optical Synthesis of an Arbitrary Linear Transformation Using Diffractive Surfaces 标题：利用衍射面实现任意线性变换的全光合成链接：https://arxiv.org/abs/2108.09833

作者：Onur Kulce,Deniz Mengu,Yair Rivenson,Aydogan Ozcan 机构： Electrical and Computer Engineering Department, University of California, Los Angeles, CA, Bioengineering Department, University of California, Los Angeles, CA, USA, California NanoSystems Institute, University of California, Los Angeles, CA, USA 备注：46 Pages, 12 Figures 摘要：我们报告了衍射表面的设计，以在输入（N_i）和输出（N_o）之间光学地执行任意复数线性变换，其中N_i和N_o分别表示输入和输出视场（FOV）的像素数。首先，我们考虑单个衍射表面，并使用基于矩阵伪逆的方法来确定衍射特征/神经元的复数传输系数，以使所有光学执行期望/目标线性变换。除了这种无数据设计方法，我们还考虑了一种基于深度学习的设计方法，通过使用与目标变换对应的输入/输出字段的例子来优化衍射表面的透射系数。我们将使用无数据设计以及数据驱动（基于深度学习）衍射设计实现的全光变换误差和衍射效率与所有光学执行（i）任意选择的复数变换（包括酉变换、非均匀变换和不可逆变换）进行了比较，（ii）二维离散傅里叶变换，（iii）任意二维置换操作，以及（iv）高通滤波相干成像。我们的分析表明，如果空间工程衍射特征/神经元的总数（N）为N_i x N_o或更大，则两种设计方法都成功实现了目标变换的全光实现，误差可以忽略不计。然而，与无数据设计相比，基于深度学习的衍射设计对于给定的N可以获得更高的衍射效率，并且对于N摘要：We report the design of diffractive surfaces to all-optically perform arbitrary complex-valued linear transformations between an input (N_i) and output (N_o), where N_i and N_o represent the number of pixels at the input and output fields-of-view (FOVs), respectively. First, we consider a single diffractive surface and use a matrix pseudoinverse-based method to determine the complex-valued transmission coefficients of the diffractive features/neurons to all-optically perform a desired/target linear transformation. In addition to this data-free design approach, we also consider a deep learning-based design method to optimize the transmission coefficients of diffractive surfaces by using examples of input/output fields corresponding to the target transformation. We compared the all-optical transformation errors and diffraction efficiencies achieved using data-free designs as well as data-driven (deep learning-based) diffractive designs to all-optically perform (i) arbitrarily-chosen complex-valued transformations including unitary, nonunitary and noninvertible transforms, (ii) 2D discrete Fourier transformation, (iii) arbitrary 2D permutation operations, and (iv) high-pass filtered coherent imaging. Our analyses reveal that if the total number (N) of spatially-engineered diffractive features/neurons is N_i x N_o or larger, both design methods succeed in all-optical implementation of the target transformation, achieving negligible error. However, compared to data-free designs, deep learning-based diffractive designs are found to achieve significantly larger diffraction efficiencies for a given N and their all-optical transformations are more accurate for N < N_i x N_o. These conclusions are generally applicable to various optical processors that employ spatially-engineered diffractive surfaces.

【12】 L3C-Stereo: Lossless Compression for Stereo Images 标题：L3C-Stereo：立体图像的无损压缩链接：https://arxiv.org/abs/2108.09422

作者：Zihao Huang,Zhe Sun,Feng Duan,Andrzej Cichocki,Peiying Ruan,Chao Li 机构： HangzhouDianzi University, China; Department of Informatics, Nicolaus CopernicusUniversity, Poland and the Systems Research Institute of Polish of Academy ofScience 摘要：大量的自主驾驶任务需要高清立体图像，这需要大量的存储空间。高效地执行无损压缩已成为一个实际问题。通常，很难对每个像素进行准确的概率估计。为了解决这个问题，我们提出了L3C立体声，一种多尺度无损压缩模型，由两个主要模块组成：扭曲模块和概率估计模块。翘曲模块利用同一域的两个视图特征映射生成视差图，用于重建右视图，从而提高右视图概率估计的可信度。概率估计模块为自适应算术编码提供像素逻辑混合分布。在实验中，我们的方法在所有三个数据集上都优于手工制作的压缩方法和基于学习的方法。然后，我们证明了更好的最大视差可以导致更好的压缩效果。此外，由于我们模型的压缩特性，它自然为后续的立体任务生成质量可接受的视差图。摘要：A large number of autonomous driving tasks need high-definition stereo images, which requires a large amount of storage space. Efficiently executing lossless compression has become a practical problem. Commonly, it is hard to make accurate probability estimates for each pixel. To tackle this, we propose L3C-Stereo, a multi-scale lossless compression model consisting of two main modules: the warping module and the probability estimation module. The warping module takes advantage of two view feature maps from the same domain to generate a disparity map, which is used to reconstruct the right view so as to improve the confidence of the probability estimate of the right view. The probability estimation module provides pixel-wise logistic mixture distributions for adaptive arithmetic coding. In the experiments, our method outperforms the hand-crafted compression methods and the learning-based method on all three datasets used. Then, we show that a better maximum disparity can lead to a better compression effect. Furthermore, thanks to a compression property of our model, it naturally generates a disparity map of an acceptable quality for the subsequent stereo tasks.

linux https 网络安全学习方法图像处理

0 人点赞