计算机视觉与模式识别学术速递[12.15]

cs.CV 方向，今日共计63篇

Transformer(5篇)

【1】 AdaViT: Adaptive Tokens for Efficient Vision Transformer 标题：AdaViT：高效视觉转换器的自适应标记链接：https://arxiv.org/abs/2112.07658

作者：Hongxu Yin,Arash Vahdat,Jose Alvarez,Arun Mallya,Jan Kautz,Pavlo Molchanov 机构：NVIDIA, Token removal & reorg., Token halting, Transformer Block, Token Depths, Mean-field, aggregation, Task head, Tokens remain: , Adaptive Halting, Halting probability, Tokenization, Embedding, Class token, memory, Layer K, ImageNet,K Examples for Adaptive Tokens 摘要：我们介绍了AdaViT，一种针对不同复杂度的图像自适应调整视觉转换器（ViT）推理代价的方法。AdaViT通过在推理过程中自动减少网络中处理的vision Transformer中的令牌数量来实现这一点。我们为这项任务重新制定了自适应计算时间（ACT），扩展了暂停以丢弃冗余的空间令牌。视觉转换器吸引人的体系结构特性使我们的自适应令牌缩减机制能够在不修改网络体系结构或推理硬件的情况下加速推理。我们证明了AdaViT不需要额外的参数或子网络来停止，因为我们基于原始网络参数学习自适应停止。我们进一步引入了分布先验正则化，与先验ACT方法相比，它可以稳定训练。在图像分类任务（ImageNet1K）中，我们证明了我们提出的AdaViT在过滤信息性空间特征和减少总体计算量方面的高效性。该方法将DeiT-Tiny和DeiT-Small的吞吐量分别提高了62%和38%，而准确度仅下降了0.3%，大大优于现有技术。摘要：We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity. AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds. We reformulate Adaptive Computation Time (ACT) for this task, extending halting to discard redundant spatial tokens. The appealing architectural properties of vision transformers enables our adaptive token reduction mechanism to speed up inference without modifying the network architecture or inference hardware. We demonstrate that AdaViT requires no extra parameters or sub-network for halting, as we base the learning of adaptive halting on the original network parameters. We further introduce distributional prior regularization that stabilizes training compared to prior ACT approaches. On the image classification task (ImageNet1K), we show that our proposed AdaViT yields high efficacy in filtering informative spatial features and cutting down on the overall compute. The proposed method improves the throughput of DeiT-Tiny by 62% and DeiT-Small by 38% with only 0.3% accuracy drop, outperforming prior art by a large margin.

【2】 Geometry-Contrastive Transformer for Generalized 3D Pose Transfer 标题：基于几何对比变换的广义三维位姿变换链接：https://arxiv.org/abs/2112.07374

作者：Haoyu Chen,Hao Tang,Zitong Yu,Nicu Sebe,Guoying Zhao 机构：CMVS, University of Oulu, Computer Vision Lab, ETH Zurich, DISI, University of Trento, , Identity, Pose, Result, =, Known pose space and, unkown shape space, Unknown pose and, Other domain, Known pose and shape, Seen pose, Unseen pose 备注：AAAI 2022 摘要：我们提出了一个定制的三维网格变换模型的姿势转移任务。由于三维位姿转换本质上是一个依赖于给定网格的变形过程，因此这项工作的直觉是通过强大的自我注意机制感知给定网格之间的几何不一致性。具体来说，我们提出了一种新的几何对比变换器，该变换器对给定网格中的全局几何不一致性具有高效的三维结构化感知能力。此外，在局部，进一步提出了一种简单而有效的中心测地对比损失法，以改进区域几何不一致性学习。最后，我们提出了一个潜在的等距正则化模块和一个新的半合成数据集，用于面向未知空间的跨数据集三维姿势转移任务。通过在SMPL-NPT、FAUST和我们新提出的数据集SMG-3D数据集上显示最先进的定量性能，以及在MG cloth和SMAL数据集上显示有希望的定性结果，大量实验结果证明了我们方法的有效性。结果表明，该方法能够实现鲁棒的三维姿态转换，并可推广到跨数据集任务中未知空间的挑战性网格。代码和数据集可用。代码如下：https://github.com/mikecheninoulu/CGT. 摘要：We present a customized 3D mesh Transformer model for the pose transfer task. As the 3D pose transfer essentially is a deformation procedure dependent on the given meshes, the intuition of this work is to perceive the geometric inconsistency between the given meshes with the powerful self-attention mechanism. Specifically, we propose a novel geometry-contrastive Transformer that has an efficient 3D structured perceiving ability to the global geometric inconsistencies across the given meshes. Moreover, locally, a simple yet efficient central geodesic contrastive loss is further proposed to improve the regional geometric-inconsistency learning. At last, we present a latent isometric regularization module together with a novel semi-synthesized dataset for the cross-dataset 3D pose transfer task towards unknown spaces. The massive experimental results prove the efficacy of our approach by showing state-of-the-art quantitative performances on SMPL-NPT, FAUST and our new proposed dataset SMG-3D datasets, as well as promising qualitative results on MG-cloth and SMAL datasets. It's demonstrated that our method can achieve robust 3D pose transfer and be generalized to challenging meshes from unknown spaces on cross-dataset tasks. The code and dataset are made available. Code is available: https://github.com/mikecheninoulu/CGT.

【3】 Temporal Transformer Networks with Self-Supervision for Action Recognition 标题：用于动作识别的具有自监督功能的时序Transformer网络链接：https://arxiv.org/abs/2112.07338

作者：Yongkang Zhang,Jun Li,Guoming Wu,Han Zhang,Zhiping Shi,Zhaoxun Liu,Zizhang Wu,Na Jiang 机构： Beihang University 摘要：近年来，基于二维卷积网络的视频动作识别得到了广泛的应用；然而，由于缺乏长期的非线性时间关系建模和反向运动信息建模，现有模型的性能受到严重影响。为了解决这个紧迫的问题，我们引入了一种具有自监督功能的瞬态Transformer网络（TTSN）。我们的高性能TTSN主要由一个时态变换模块和一个时态序列自监督模块组成。简而言之，我们利用高效的时间变换模块来建模非局部帧之间的非线性时间依赖关系，这显著增强了复杂的运动特征表示。我们采用的时间序列自监督模块前所未有地采用了“随机批量随机通道”的简化策略来反转视频帧序列，允许从反转的时间维度中鲁棒地提取运动信息表示，并提高了模型的泛化能力。在三个广泛使用的数据集（HMDB51、UCF101和Something V1）上进行的大量实验最终证明，我们提出的TTSN很有希望，因为它成功地实现了动作识别的最新性能。摘要：In recent years, 2D Convolutional Networks-based video action recognition has encouragingly gained wide popularity; However, constrained by the lack of long-range non-linear temporal relation modeling and reverse motion information modeling, the performance of existing models is, therefore, undercut seriously. To address this urgent problem, we introduce a startling Temporal Transformer Network with Self-supervision (TTSN). Our high-performance TTSN mainly consists of a temporal transformer module and a temporal sequence self-supervision module. Concisely speaking, we utilize the efficient temporal transformer module to model the non-linear temporal dependencies among non-local frames, which significantly enhances complex motion feature representations. The temporal sequence self-supervision module we employ unprecedentedly adopts the streamlined strategy of "random batch random channel" to reverse the sequence of video frames, allowing robust extractions of motion information representation from inversed temporal dimensions and improving the generalization capability of the model. Extensive experiments on three widely used datasets (HMDB51, UCF101, and Something-something V1) have conclusively demonstrated that our proposed TTSN is promising as it successfully achieves state-of-the-art performance for action recognition.

【4】 Co-training Transformer with Videos and Images Improves Action Recognition 标题：具有视频和图像的联合训练Transformer提高了动作识别能力链接：https://arxiv.org/abs/2112.07175

作者：Bowen Zhang,Jiahui Yu,Christopher Fifty,Wei Han,Andrew M. Dai,Ruoming Pang,Fei Sha 机构：USC, Google Brain, Apple AI, Google Research 摘要：在学习动作识别过程中，模型通常是通过图像（如ImageNet）对对象识别进行预训练，然后通过视频对目标动作识别进行微调。这种方法已经取得了很好的经验性能，尤其是在最近基于Transformer的视频架构中。虽然最近许多工作旨在设计更先进的动作识别Transformer架构，但在如何训练视频Transformer方面所做的工作较少。在这项工作中，我们探索了几种训练模式，并提出了两个发现。首先，视频Transformer受益于对不同视频数据集和标签空间的联合训练（例如，动力学以外观为中心，而某些事物以运动为中心）。其次，通过进一步与图像（作为单帧视频）的联合训练，视频转换器可以学习更好的视频表示。我们将这种方法称为用于动作识别（CoVeR）的联合训练视频和图像。特别是，当基于时间转换器架构在ImageNet-21K上进行预训练时，CoVeR将Kinetics-400 Top-1精度提高了2.4%，Kinetics-600提高了2.3%，SomethingSomething-v2提高了2.3%。当按照之前的最新技术对较大规模的图像数据集进行预训练时，CoVeR在Kinetics-400（87.2%）、Kinetics-600（87.9%）、Kinetics-700（79.8%）、SomethingSomething-v2（70.9%）和时间瞬间（46.1%）上取得了最佳效果，并配备了一个简单的时空视频转换器。摘要：In learning action recognition, models are typically pre-trained on object recognition with images, such as ImageNet, and later fine-tuned on target action recognition with videos. This approach has achieved good empirical performance especially with recent transformer-based video architectures. While recently many works aim to design more advanced transformer architectures for action recognition, less effort has been made on how to train video transformers. In this work, we explore several training paradigms and present two findings. First, video transformers benefit from joint training on diverse video datasets and label spaces (e.g., Kinetics is appearance-focused while SomethingSomething is motion-focused). Second, by further co-training with images (as single-frame videos), the video transformers learn even better video representations. We term this approach as Co-training Videos and Images for Action Recognition (CoVeR). In particular, when pretrained on ImageNet-21K based on the TimeSFormer architecture, CoVeR improves Kinetics-400 Top-1 Accuracy by 2.4%, Kinetics-600 by 2.3%, and SomethingSomething-v2 by 2.3%. When pretrained on larger-scale image datasets following previous state-of-the-art, CoVeR achieves best results on Kinetics-400 (87.2%), Kinetics-600 (87.9%), Kinetics-700 (79.8%), SomethingSomething-v2 (70.9%), and Moments-in-Time (46.1%), with a simple spatio-temporal video transformer.

【5】 Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text 标题：迈向统一的基础模型：对未配对的图像和文本进行联合预训练Transformer 链接：https://arxiv.org/abs/2112.07074

作者：Qing Li,Boqing Gong,Yin Cui,Dan Kondratyuk,Xianzhi Du,Ming-Hsuan Yang,Matthew Brown 机构：Google Research, University of California, Los Angeles 备注：preliminary work 摘要：在本文中，我们探讨建立统一的基础模型的可能性，可以适用于视觉和纯文本的任务。从BERT和ViT开始，我们设计了一个统一的转换器，它由特定于模态的标记器、共享的转换器编码器和特定于任务的输出头组成。为了有效地在未配对的图像和文本上对所提出的模型进行联合预训练，我们提出了两种新技术：（i）我们使用单独训练的BERT和ViT模型作为教师，并应用知识提取为联合训练提供额外的、准确的监督信号；（ii）我们提出了一种新的梯度掩蔽策略来平衡来自图像和文本预训练损失的参数更新。我们通过分别在图像分类任务和自然语言理解任务中对联合预训练的transformer进行微调来评估它。实验结果表明，所得到的统一基础Transformer在视觉和纯文本两个任务上都令人惊讶地很好地工作，并且所提出的知识蒸馏和梯度掩蔽策略可以有效地提升性能以接近单独训练的模型的水平。摘要：In this paper, we explore the possibility of building a unified foundation model that can be adapted to both vision-only and text-only tasks. Starting from BERT and ViT, we design a unified transformer consisting of modality-specific tokenizers, a shared transformer encoder, and task-specific output heads. To efficiently pre-train the proposed model jointly on unpaired images and text, we propose two novel techniques: (i) We employ the separately-trained BERT and ViT models as teachers and apply knowledge distillation to provide additional, accurate supervision signals for the joint training; (ii) We propose a novel gradient masking strategy to balance the parameter updates from the image and text pre-training losses. We evaluate the jointly pre-trained transformer by fine-tuning it on image classification tasks and natural language understanding tasks, respectively. The experiments show that the resultant unified foundation transformer works surprisingly well on both the vision-only and text-only tasks, and the proposed knowledge distillation and gradient masking strategy can effectively lift the performance to approach the level of separately-trained models.

检测相关(10篇)

【1】 Out-of-Distribution Detection without Class Labels 标题：无类别标签的失配检测链接：https://arxiv.org/abs/2112.07662

作者：Niv Cohen,Ron Abutbul,Yedid Hoshen 机构：School of Computer Science and Engineering, The Hebrew University of Jerusalem, Israel 摘要：异常检测方法识别偏离数据集正常行为的样本。它通常用于包含来自多个标记类或单个未标记类的正常数据的训练集。当前的方法在面对由多个类组成但没有标签的训练数据时会遇到困难。在这项工作中，我们首先发现，通过自监督图像聚类方法学习的分类器为未标记的多类数据集的异常检测提供了强大的基线。也许令人惊讶的是，我们发现使用预先训练的特征初始化聚类方法并没有比其自我监督的方法有所改进。这是由于灾难性遗忘的现象。相反，我们建议采用两阶段方法。我们首先使用自监督方法对图像进行聚类，并为每个图像获得一个聚类标签。我们使用集群标签作为分布外（OOD）方法的“伪监督”。具体地说，我们在通过聚类标签对图像进行分类的任务中对预训练的特征进行微调。我们对我们的方法进行了广泛的分析，并论证了我们两阶段方法的必要性。我们根据最先进的自我监督和预训练方法对其进行评估，并证明其具有优异的性能。摘要：Anomaly detection methods identify samples that deviate from the normal behavior of the dataset. It is typically tackled either for training sets containing normal data from multiple labeled classes or a single unlabeled class. Current methods struggle when faced with training data consisting of multiple classes but no labels. In this work, we first discover that classifiers learned by self-supervised image clustering methods provide a strong baseline for anomaly detection on unlabeled multi-class datasets. Perhaps surprisingly, we find that initializing clustering methods with pre-trained features does not improve over their self-supervised counterparts. This is due to the phenomenon of catastrophic forgetting. Instead, we suggest a two stage approach. We first cluster images using self-supervised methods and obtain a cluster label for every image. We use the cluster labels as "pseudo supervision" for out-of-distribution (OOD) methods. Specifically, we finetune pretrained features on the task of classifying images by their cluster labels. We provide extensive analyses of our method and demonstrate the necessity of our two-stage approach. We evaluate it against the state-of-the-art self-supervised and pretrained methods and demonstrate superior performance.

【2】 Approaches Toward Physical and General Video Anomaly Detection 标题：一种面向物理和一般视频异常检测的方法链接：https://arxiv.org/abs/2112.07661

作者：Laura Kart,Niv Cohen 机构：School of Computer Science and Engineering, The Hebrew University of Jerusalem, Israel. 摘要：近年来，许多作品解决了在视频中发现以前从未见过的异常的问题。然而，大多数工作都集中在检测从安全摄像头拍摄的监控视频中的异常帧。同时，异常检测（AD）在表现出异常机械行为的视频中的任务也被忽视了。此类视频中的异常检测具有学术和实际意义，因为它们可以在许多制造、维护和现实环境中自动检测故障。为了评估不同方法检测此类异常的潜力，我们评估了两种简单的基线方法：（i）时间池图像AD技术。（ii）用预训练特征表示的视频密度估计，用于视频分类。开发这种方法需要新的基准，以便对不同的可能方法进行评估。我们介绍了物理异常轨迹或运动（PHANTOM）数据集，它包含六个不同的视频类。每节课由正常和异常视频组成。这些类别在呈现的现象、正常类别可变性以及视频中的异常类型方面有所不同。我们还提出了一个更难的基准，即在高度可变的场景中发现异常活动。摘要：In recent years, many works have addressed the problem of finding never-seen-before anomalies in videos. Yet, most work has been focused on detecting anomalous frames in surveillance videos taken from security cameras. Meanwhile, the task of anomaly detection (AD) in videos exhibiting anomalous mechanical behavior, has been mostly overlooked. Anomaly detection in such videos is both of academic and practical interest, as they may enable automatic detection of malfunctions in many manufacturing, maintenance, and real-life settings. To assess the potential of the different approaches to detect such anomalies, we evaluate two simple baseline approaches: (i) Temporal-pooled image AD techniques. (ii) Density estimation of videos represented with features pretrained for video-classification. Development of such methods calls for new benchmarks to allow evaluation of different possible approaches. We introduce the Physical Anomalous Trajectory or Motion (PHANTOM) dataset, which contains six different video classes. Each class consists of normal and anomalous videos. The classes differ in the presented phenomena, the normal class variability, and the kind of anomalies in the videos. We also suggest an even harder benchmark where anomalous activities should be spotted on highly variable scenes.

【3】 CORE-Text: Improving Scene Text Detection with Contrastive Relational Reasoning 标题：核心文本：利用对比关系推理改进场景文本检测链接：https://arxiv.org/abs/2112.07513

作者：Jingyang Lin,Yingwei Pan,Rongfeng Lai,Xuehang Yang,Hongyang Chao,Ting Yao 机构：∗Sun Yat-sen University, Guangzhou, China, †The Key Laboratory of Machine Intelligence and Advanced Computing (Sun Yat-sen University), Ministry of Education, Guangzhou, P. R. China, ‡JD AI Research, Beijing, China 备注：ICME 2021 (Oral); Code is publicly available at: this https URL 摘要：在自然场景中定位文本实例被认为是计算机视觉的一个基本挑战。然而，由于真实场景中文本实例的纵横比和比例差异极大，大多数传统的文本检测器都面临着只定位文本实例片段（即子文本）的子文本问题。在这项工作中，我们定量分析了子文本问题，并提出了一个简单而有效的设计，对比关系（核心）模块，以缓解该问题。CORE首先利用一个普通的关系块来建模所有文本提议（多个文本实例的子文本）之间的关系，并通过实例级子文本区分以对比方式进一步增强关系推理。这种方法自然地学习文本建议的实例感知表示，从而促进场景文本检测。我们将核心模块集成到Mask R-CNN的两级文本检测器中，并设计了我们的文本检测器核心文本。在四个基准上的大量实验证明了核心文本的优越性。代码可用：url{https://github.com/jylins/CORE-Text}. 摘要：Localizing text instances in natural scenes is regarded as a fundamental challenge in computer vision. Nevertheless, owing to the extremely varied aspect ratios and scales of text instances in real scenes, most conventional text detectors suffer from the sub-text problem that only localizes the fragments of text instance (i.e., sub-texts). In this work, we quantitatively analyze the sub-text problem and present a simple yet effective design, COntrastive RElation (CORE) module, to mitigate that issue. CORE first leverages a vanilla relation block to model the relations among all text proposals (sub-texts of multiple text instances) and further enhances relational reasoning via instance-level sub-text discrimination in a contrastive manner. Such way naturally learns instance-aware representations of text proposals and thus facilitates scene text detection. We integrate the CORE module into a two-stage text detector of Mask R-CNN and devise our text detector CORE-Text. Extensive experiments on four benchmarks demonstrate the superiority of CORE-Text. Code is available: url{https://github.com/jylins/CORE-Text}.

【4】 Improving Human-Object Interaction Detection via Phrase Learning and Label Composition 标题：利用短语学习和标签合成改进人机交互检测链接：https://arxiv.org/abs/2112.07383

作者：Zhimin Li,Cheng Zou,Yu Zhao,Boxun Li,Sheng Zhong 机构： National Key Laboratory of Science and Technology on Multispectral Information Processing, School of Artificial, Intelligence and Automation, Huazhong University of Science and Technology, Ant Group, Megvii Technology 备注：Accepted to AAAI2022 摘要：人机交互（HOI）检测是高层以人为中心的场景理解中的一项基本任务。我们提出PhraseHOI，包含一个HOI分支和一个新的短语分支，以利用语言优先权和改进关系表达。具体而言，短语分支由语义嵌入进行监督，语义嵌入的基本事实从原始HOI注释自动转换而来，无需额外的人力。同时，针对HOI中的长尾问题，提出了一种新的标签合成方法，该方法通过语义邻域合成新的短语标签。此外，为了优化相位分支，提出了由蒸馏损耗和平衡三重态损耗组成的损耗。我们进行了大量的实验，以证明所提出的PhraseHOI的有效性，该方法在具有挑战性的HICO-DET基准上，在完整和非罕见方面取得了比基线显著的改进，并超过了以前最先进的方法。摘要：Human-Object Interaction (HOI) detection is a fundamental task in high-level human-centric scene understanding. We propose PhraseHOI, containing a HOI branch and a novel phrase branch, to leverage language prior and improve relation expression. Specifically, the phrase branch is supervised by semantic embeddings, whose ground truths are automatically converted from the original HOI annotations without extra human efforts. Meanwhile, a novel label composition method is proposed to deal with the long-tailed problem in HOI, which composites novel phrase labels by semantic neighbors. Further, to optimize the phrase branch, a loss composed of a distilling loss and a balanced triplet loss is proposed. Extensive experiments are conducted to prove the effectiveness of the proposed PhraseHOI, which achieves significant improvement over the baseline and surpasses previous state-of-the-art methods on Full and NonRare on the challenging HICO-DET benchmark.

【5】 Static-Dynamic Co-Teaching for Class-Incremental 3D Object Detection 标题：静电--增量式三维物体检测的动态协同教学链接：https://arxiv.org/abs/2112.07241

作者：Na Zhao,Gim Hee Lee 机构：Department of Computer Science, National University of Singapore 备注：Accepted at AAAI 2022 摘要：基于深度学习的方法在三维目标检测任务中表现出了显著的性能。然而，当他们在不重新访问旧数据的情况下增量学习新类时，他们会在最初训练的类上遭受灾难性的性能下降。这种“灾难性遗忘”现象阻碍了3D目标检测方法在现实场景中的应用，因为现实场景中需要持续学习系统。在本文中，我们研究了尚未探索但重要的类增量三维目标检测问题，并提出了第一个解决方案-SDCoT，一种新的静态-动态协同教学方法。我们的SDCoT通过静态教师减轻了对旧类的灾难性遗忘，该教师在新样本中为旧类提供伪注释，并通过提取具有蒸馏损失的先前知识来正则化当前模型。同时，SDCoT通过动态教师不断地从新数据中学习基础知识。我们在两个基准数据集上进行了广泛的实验，并在几个增量学习场景中展示了我们的SDCoT方法优于基线方法的性能。摘要：Deep learning-based approaches have shown remarkable performance in the 3D object detection task. However, they suffer from a catastrophic performance drop on the originally trained classes when incrementally learning new classes without revisiting the old data. This "catastrophic forgetting" phenomenon impedes the deployment of 3D object detection approaches in real-world scenarios, where continuous learning systems are needed. In this paper, we study the unexplored yet important class-incremental 3D object detection problem and present the first solution - SDCoT, a novel static-dynamic co-teaching method. Our SDCoT alleviates the catastrophic forgetting of old classes via a static teacher, which provides pseudo annotations for old classes in the new samples and regularizes the current model by extracting previous knowledge with a distillation loss. At the same time, SDCoT consistently learns the underlying knowledge from new data via a dynamic teacher. We conduct extensive experiments on two benchmark datasets and demonstrate the superior performance of our SDCoT over baseline approaches in several incremental learning scenarios.

【6】 Noise Reduction and Driving Event Extraction Method for Performance Improvement on Driving Noise-based Surface Anomaly Detection 标题：提高基于驾驶噪声的表面异常检测性能的降噪和驾驶事件提取方法链接：https://arxiv.org/abs/2112.07214

作者：YeongHyeon Park,JoonSung Lee,Myung Jin Kim,Wonseok Park 机构：SK Planet Co., Ltd. 备注：3 pages, 3 figures, 2 tables 摘要：路面上的异物（如雨水或黑冰）会减少轮胎与路面之间的摩擦。上述情况将降低制动性能，并使车身姿态难以控制。在这种情况下，至少有可能造成财产损失。在最坏的情况下，将发生人身伤害。为了避免这一问题，提出了一种基于车辆行驶噪声的道路异常检测模型。然而，先前的建议不考虑额外的噪声，与驾驶噪声混合，并且跳过没有车辆驾驶的时刻的计算。在本文中，我们提出了一种简单的驱动事件提取方法和降噪方法，以提高计算效率和异常检测性能。摘要：Foreign substances on the road surface, such as rainwater or black ice, reduce the friction between the tire and the surface. The above situation will reduce the braking performance and make difficult to control the vehicle body posture. In that case, there is a possibility of property damage at least. In the worst case, personal damage will be occured. To avoid this problem, a road anomaly detection model is proposed based on vehicle driving noise. However, the prior proposal does not consider the extra noise, mixed with driving noise, and skipping calculations for moments without vehicle driving. In this paper, we propose a simple driving event extraction method and noise reduction method for improving computational efficiency and anomaly detection performance.

【7】 Joint 3D Object Detection and Tracking Using Spatio-Temporal Representation of Camera Image and LiDAR Point Clouds 标题：基于摄像机图像和激光雷达点云时空表示的联合三维目标检测与跟踪链接：https://arxiv.org/abs/2112.07116

作者：Junho Koh,Jaekyum Kim,Jinhyuk Yoo,Yecheol Kim,Jun Won Choi 机构：Hanyang University, Korea Advanced Institute of Science and Technology (KAIST) 摘要：在本文中，我们提出了一种新的联合目标检测和跟踪（JoDT）框架，用于基于相机和激光雷达传感器的三维目标检测和跟踪。所提出的方法称为3D DetectTrack，使探测器和跟踪器能够协作生成相机和激光雷达数据的时空表示，然后使用该表示来执行3D对象检测和跟踪。探测器通过相机和激光雷达融合获得的空间特征的加权时间聚集来构造时空特征。然后，检测器使用来自保持到上一时间步的轨迹的信息重新配置初始检测结果。基于检测器生成的时空特征，跟踪器使用图形神经网络（GNN）将检测到的对象与先前跟踪的对象相关联。我们通过基于规则的边缘修剪和基于注意的边缘选通相结合，设计了一个完全连通的GNN，它利用空间和时间对象上下文来提高跟踪性能。在KITTI和nuScenes基准上进行的实验表明，与基线方法相比，3D DetecTrack在检测和跟踪性能方面取得了显著的改进，并且通过检测器和跟踪器之间的协作，在现有方法中实现了最先进的性能。摘要：In this paper, we propose a new joint object detection and tracking (JoDT) framework for 3D object detection and tracking based on camera and LiDAR sensors. The proposed method, referred to as 3D DetecTrack, enables the detector and tracker to cooperate to generate a spatio-temporal representation of the camera and LiDAR data, with which 3D object detection and tracking are then performed. The detector constructs the spatio-temporal features via the weighted temporal aggregation of the spatial features obtained by the camera and LiDAR fusion. Then, the detector reconfigures the initial detection results using information from the tracklets maintained up to the previous time step. Based on the spatio-temporal features generated by the detector, the tracker associates the detected objects with previously tracked objects using a graph neural network (GNN). We devise a fully-connected GNN facilitated by a combination of rule-based edge pruning and attention-based edge gating, which exploits both spatial and temporal object contexts to improve tracking performance. The experiments conducted on both KITTI and nuScenes benchmarks demonstrate that the proposed 3D DetecTrack achieves significant improvements in both detection and tracking performances over baseline methods and achieves state-of-the-art performance among existing methods through collaboration between the detector and tracker.

【8】 EMDS-6: Environmental Microorganism Image Dataset Sixth Version for Image Denoising, Segmentation, Feature Extraction, Classification and Detection Methods Evaluation 标题：EMDS-6：环境微生物图像数据集第六版，用于图像去噪、分割、特征提取、分类和检测方法评价链接：https://arxiv.org/abs/2112.07111

作者：Peng Zhao,Chen Li,Md Mamunur Rahaman,Hao Xu,Pingli Ma,Hechen Yang,Hongzan Sun,Tao Jiang,Ning Xu,Marcin Grzegorzek 机构：Microscopic Image and Medical Image Analysis Group, MBIE College, Northeastern University, Shenyang, PR China, School of Control Engineering, Chengdu University of Information Technology, Chengdu , China, University of Lübeck, Germany 摘要：环境微生物（EMs）无处不在，对人类社会的生存和发展具有重要影响。然而，环境微生物（EM）数据的高标准和严格要求导致了现有相关数据库的不足，更不用说具有GT图像的数据库了。这个问题严重影响了相关实验的进展。因此，本研究开发了环境微生物数据集第六版（EMDS-6），其中包含21种EMs。每种类型的EM包含40个原始图像和40个GT图像，总共1680个EM图像。在本研究中，为了检验EMDS-6的有效性。我们选择了经典的图像处理算法，如图像去噪、图像分割和目标检测。实验结果表明，EMDS-6可用于评价图像去噪、图像分割、图像特征提取、图像分类和目标检测方法的性能。摘要：Environmental microorganisms (EMs) are ubiquitous around us and have an important impact on the survival and development of human society. However, the high standards and strict requirements for the preparation of environmental microorganism (EM) data have led to the insufficient of existing related databases, not to mention the databases with GT images. This problem seriously affects the progress of related experiments. Therefore, This study develops the Environmental Microorganism Dataset Sixth Version (EMDS-6), which contains 21 types of EMs. Each type of EM contains 40 original and 40 GT images, in total 1680 EM images. In this study, in order to test the effectiveness of EMDS-6. We choose the classic algorithms of image processing methods such as image denoising, image segmentation and target detection. The experimental result shows that EMDS-6 can be used to evaluate the performance of image denoising, image segmentation, image feature extraction, image classification, and object detection methods.

【9】 Improving COVID-19 CXR Detection with Synthetic Data Augmentation 标题：利用合成数据增强改进冠状病毒CXR检测链接：https://arxiv.org/abs/2112.07529

作者：Daniel Schaudt,Christopher Kloth,Christian Spaete,Andreas Hinteregger,Meinrad Beer,Reinhold von Schwerin 机构： Technische Hochschule Ulm - Ulm University of Applied Sciences, Universitätsklinikum Ulm - Ulm University Medical Center 备注：This paper has been accepted at the Upper-Rhine Artificial Intelligence Symposium 2021 arXiv:2112.05657 摘要：自从COVID-19流行病开始以来，研究人员已经开发出深度学习模型来分类COVID-19诱导的肺炎。与许多医学成像任务一样，可用数据的质量和数量通常是有限的。2019冠状病毒疾病的影像学研究，在国内外的胸部X射线数据上进行了深入的研究。两名放射科医生对数据进行了审查和标记，以确保对模型的泛化能力进行高质量的估计。此外，我们正在使用生成性对抗网络，根据这些数据生成合成X射线图像。我们的结果表明，使用这些合成图像进行数据扩充可以显著提高模型的性能。对于许多稀疏数据域，这是一种很有前途的方法。摘要：Since the beginning of the COVID-19 pandemic, researchers have developed deep learning models to classify COVID-19 induced pneumonia. As with many medical imaging tasks, the quality and quantity of the available data is often limited. In this work we train a deep learning model on publicly available COVID-19 image data and evaluate the model on local hospital chest X-ray data. The data has been reviewed and labeled by two radiologists to ensure a high quality estimation of the generalization capabilities of the model. Furthermore, we are using a Generative Adversarial Network to generate synthetic X-ray images based on this data. Our results show that using those synthetic images for data augmentation can improve the model's performance significantly. This can be a promising approach for many sparse data domains.

【10】 COVID-19 Pneumonia and Influenza Pneumonia Detection Using Convolutional Neural Networks 标题：基于卷积神经网络的冠状病毒肺炎和流感肺炎检测链接：https://arxiv.org/abs/2112.07102

作者：Julianna Antonchuk,Benjamin Prescott,Philip Melanchthon,Robin Singh 机构：Northwestern University 备注：for associated Azure ML notebook code, see this https URL 摘要：在研究2019冠状病毒疾病、流感病毒肺炎和正常生物标志物方面，我们开发了一种计算机视觉解决方案来支持诊断放射学。COVID-19肺炎的胸片外观被认为是非特异性的，它提出了一个挑战，以确定卷积神经网络（CNN）的最佳结构，该分类将在COVID-19和非COVID-19型肺炎的肺部炎症特征之间进行高灵敏度分类。拉赫曼（2021）2019冠状病毒疾病图像的不可用性和质量问题影响诊断过程，并影响深度学习检测模型的准确性。COVID-19射线照相图像的一个显著的不足引入了一个不平衡的数据激励我们使用过采样技术。在这项研究2019冠状病毒疾病中，我们包括了一套广泛的X射线成像（CXR）和COVID-19肺炎、流感病毒肺炎和正常生物标志物，以实现可扩展和精确的CNN模型。在研究的实验阶段，我们评估了各种卷积网络结构，选择了具有两个传统卷积层和两个最大功能池层的顺序卷积网络。在分类性能方面，表现最好的模型的验证准确率为93%，F1得分为0.95。我们选择Azure机器学习服务来执行网络实验和解决方案部署。自动缩放计算集群显著减少了网络训练的时间。我们希望看到人工智能和人类生物学领域的科学家合作并扩大拟议解决方案的范围，以提供快速和全面的诊断，有效地减缓病毒的传播摘要：In the research, we developed a computer vision solution to support diagnostic radiology in differentiating between COVID-19 pneumonia, influenza virus pneumonia, and normal biomarkers. The chest radiograph appearance of COVID-19 pneumonia is thought to be nonspecific, having presented a challenge to identify an optimal architecture of a convolutional neural network (CNN) that would classify with a high sensitivity among the pulmonary inflammation features of COVID-19 and non-COVID-19 types of pneumonia. Rahman (2021) states that COVID-19 radiography images observe unavailability and quality issues impacting the diagnostic process and affecting the accuracy of the deep learning detection models. A significant scarcity of COVID-19 radiography images introduced an imbalance in data motivating us to use over-sampling techniques. In the study, we include an extensive set of X-ray imaging of human lungs (CXR) with COVID-19 pneumonia, influenza virus pneumonia, and normal biomarkers to achieve an extensible and accurate CNN model. In the experimentation phase of the research, we evaluated a variety of convolutional network architectures, selecting a sequential convolutional network with two traditional convolutional layers and two pooling layers with maximum function. In its classification performance, the best performing model demonstrated a validation accuracy of 93% and an F1 score of 0.95. We chose the Azure Machine Learning service to perform network experimentation and solution deployment. The auto-scaling compute clusters offered a significant time reduction in network training. We would like to see scientists across fields of artificial intelligence and human biology collaborating and expanding on the proposed solution to provide rapid and comprehensive diagnostics, effectively mitigating the spread of the virus

分类|识别相关(6篇)

【1】 Text Classification Models for Form Entity Linking 标题：用于表单实体链接的文本分类模型链接：https://arxiv.org/abs/2112.07443

作者：María Villota,César Domínguez,Jónathan Heras,Eloy Mata,Vico Pascual 机构：Department of Mathematics and Computer Science, University of La Rioja, Spain 摘要：表单是一种广泛使用的基于模板的文档类型，用于各种领域，包括管理、医学、金融或保险等。由于每天生成的表单数量不断增加，因此迫切需要自动提取这些文档中包含的信息。但是，在处理扫描的表单时，这并不是一项简单的任务，因为具有不同表单实体位置的模板非常多样，而且扫描文档的质量也很高。在这种情况下，所有表单都有一个共同的特性：它们包含一个作为键值（或标签值）对构建的互连实体集合，以及其他实体，如标题或图像。在这项工作中，我们通过结合图像处理技术和基于BERT体系结构的文本分类模型，解决了表单中的实体链接问题。这种方法在FUNSD数据集上的F1得分为0.80，达到了最先进的结果，比以前最好的方法提高了5%。该项目的代码可在https://github.com/mavillot/FUNSD-Entity-Linking. 摘要：Forms are a widespread type of template-based document used in a great variety of fields including, among others, administration, medicine, finance, or insurance. The automatic extraction of the information included in these documents is greatly demanded due to the increasing volume of forms that are generated in a daily basis. However, this is not a straightforward task when working with scanned forms because of the great diversity of templates with different location of form entities, and the quality of the scanned documents. In this context, there is a feature that is shared by all forms: they contain a collection of interlinked entities built as key-value (or label-value) pairs, together with other entities such as headers or images. In this work, we have tacked the problem of entity linking in forms by combining image processing techniques and a text classification model based on the BERT architecture. This approach achieves state-of-the-art results with a F1-score of 0.80 on the FUNSD dataset, a 5% improvement regarding the best previous method. The code of this project is available at https://github.com/mavillot/FUNSD-Entity-Linking.

【2】 Federated Learning for Face Recognition with Gradient Correction 标题：梯度校正的联合学习在人脸识别中的应用链接：https://arxiv.org/abs/2112.07246

作者：Yifan Niu,Weihong Deng 机构：Beijing University of Posts and Telecommunications 备注：accepted by AAAI2022 摘要：随着对人脸识别中隐私问题的日益关注，联合学习已经成为研究具有私有分散数据的无约束人脸识别问题的最流行的方法之一。然而，在人脸识别场景中，传统的分散联邦算法在客户端之间共享整个网络参数，存在隐私泄漏问题。在这项工作中，我们引入了一个框架，FedGC，以解决人脸识别的联合学习问题，并保证更高的隐私。我们从反向传播的角度探索了一种新的梯度校正思想，并提出了一种基于softmax的正则化器，通过精确注入跨客户端梯度项来校正类嵌入的梯度。理论上，我们证明了FedGC构成了一个类似于标准softmax的有效损失函数。通过大量的实验验证了FedGC的优越性，它可以在几个流行的基准数据集上与传统的集中式方法的性能相匹配。摘要：With increasing appealing to privacy issues in face recognition, federated learning has emerged as one of the most prevalent approaches to study the unconstrained face recognition problem with private decentralized data. However, conventional decentralized federated algorithm sharing whole parameters of networks among clients suffers from privacy leakage in face recognition scene. In this work, we introduce a framework, FedGC, to tackle federated learning for face recognition and guarantees higher privacy. We explore a novel idea of correcting gradients from the perspective of backward propagation and propose a softmax-based regularizer to correct gradients of class embeddings by precisely injecting a cross-client gradient term. Theoretically, we show that FedGC constitutes a valid loss function similar to standard softmax. Extensive experiments have been conducted to validate the superiority of FedGC which can match the performance of conventional centralized methods utilizing full training dataset on several popular benchmark datasets.

【3】 Margin Calibration for Long-Tailed Visual Recognition 标题：长尾视觉识别中的边缘校正链接：https://arxiv.org/abs/2112.07225

作者：Yidong Wang,Bowen Zhang,Wenxin Hou,Zhen Wu,Jindong Wang,Takahiro Shinozaki 机构：Tokyo Institute of Technology, Nanjing University, Microsoft Research Asia 备注：Technical report; 9 pages 摘要：视觉识别任务中的长尾类分布对神经网络如何处理头类和尾类之间的有偏预测提出了巨大挑战，即该模型倾向于将尾类分类为头类。虽然现有的研究主要集中在数据重采样和损失函数工程上，但在本文中，我们采用了不同的视角：分类裕度。我们研究了边际与logits（分类分数）之间的关系，并实证观察了有偏边际和有偏logits之间的正相关关系。我们提出MARC，一个简单而有效的边缘校准函数，用于动态校准无偏Logit的有偏边缘。我们通过对常见的长尾基准测试（包括CIFAR-LT、ImageNet LT、Places LT和iNaturalist-LT）进行广泛的实验来验证MARC。实验结果表明，我们的MARC在这些基准测试上取得了良好的结果。此外，MARC非常容易实现，只需三行代码。我们希望这一简单的方法将激励人们重新思考长尾视觉识别中的偏差边际和偏差逻辑。摘要：The long-tailed class distribution in visual recognition tasks poses great challenges for neural networks on how to handle the biased predictions between head and tail classes, i.e., the model tends to classify tail classes as head classes. While existing research focused on data resampling and loss function engineering, in this paper, we take a different perspective: the classification margins. We study the relationship between the margins and logits (classification scores) and empirically observe the biased margins and the biased logits are positively correlated. We propose MARC, a simple yet effective MARgin Calibration function to dynamically calibrate the biased margins for unbiased logits. We validate MARC through extensive experiments on common long-tailed benchmarks including CIFAR-LT, ImageNet-LT, Places-LT, and iNaturalist-LT. Experimental results demonstrate that our MARC achieves favorable results on these benchmarks. In addition, MARC is extremely easy to implement with just three lines of code. We hope this simple method will motivate people to rethink the biased margins and biased logits in long-tailed visual recognition.

【4】 Exploring Category-correlated Feature for Few-shot Image Classification 标题：探索类别相关特征进行Few-Shot图像分类链接：https://arxiv.org/abs/2112.07224

作者：Jing Xu,Xinglin Pan,Xu Luo,Wenjie Pei,Zenglin Xu 机构：Harbin Institute of Technology, Shenzhen, University of Electronic Science and Technology of China 备注：10 pages, 9 figures 摘要：Few-Shot分类旨在使分类器适应具有少量训练样本的新类。然而，训练数据的不足可能会导致对某一类特征分布的有偏估计。为了缓解这一问题，我们提出了一种简单而有效的特征校正方法，该方法利用新类和基类之间的类别相关性作为先验知识。我们通过将特征映射到一个与基类数目相匹配的维数的潜在向量中，将其视为特征在基类上的对数概率，来明确地捕获这种相关性。基于此潜在向量，校正后的特征由解码器直接构造，我们期望解码器在保留类别相关信息的同时去除其他随机因素，从而更接近其类别质心。此外，通过改变softmax中的温度值，我们可以重新平衡特征校正和重构以获得更好的性能。我们的方法是通用的，灵活的，对任何特征提取和分类器都是不可知的，很容易嵌入到现有的FSL方法中。实验证明，我们的方法能够纠正有偏特征，特别是当特征远离类质心时。所提出的方法在三个广泛使用的基准上一致地获得了可观的性能增益，并使用不同的主干和分类器进行了评估。该守则将予以公布。摘要：Few-shot classification aims to adapt classifiers to novel classes with a few training samples. However, the insufficiency of training data may cause a biased estimation of feature distribution in a certain class. To alleviate this problem, we present a simple yet effective feature rectification method by exploring the category correlation between novel and base classes as the prior knowledge. We explicitly capture such correlation by mapping features into a latent vector with dimension matching the number of base classes, treating it as the logarithm probability of the feature over base classes. Based on this latent vector, the rectified feature is directly constructed by a decoder, which we expect maintaining category-related information while removing other stochastic factors, and consequently being closer to its class centroid. Furthermore, by changing the temperature value in softmax, we can re-balance the feature rectification and reconstruction for better performance. Our method is generic, flexible and agnostic to any feature extractor and classifier, readily to be embedded into existing FSL approaches. Experiments verify that our method is capable of rectifying biased features, especially when the feature is far from the class centroid. The proposed approach consistently obtains considerable performance gains on three widely used benchmarks, evaluated with different backbones and classifiers. The code will be made public.

【5】 Multi-Expert Human Action Recognition with Hierarchical Super-Class Learning 标题：基于分层超类学习的多专家人体动作识别链接：https://arxiv.org/abs/2112.07015

作者：Hojat Asgarian Dehkordi,Ali Soltani Nezhad,Hossein Kashiani,Shahriar Baradaran Shokouhi,Ahmad Ayatollahi 机构：School of Electrical Engineering, Iran University of Science and Technology, Tehran, Iran 备注：47 pages 摘要：在静态图像人类行为识别中，现有的研究主要利用额外的边界框信息和类标签来缓解静态图像中时间信息的不足；但是，使用手动注释准备额外数据非常耗时，而且容易出现人为错误。此外，现有的研究还没有解决行动识别的长尾分布。在本文中，我们提出了一种两阶段多专家分类的人体行为识别方法，以处理长尾分布，通过超类学习，在没有任何额外信息的情况下。为了为每个超类选择最佳配置并描述不同动作类之间的类间依赖关系，我们提出了一种新的基于图的类选择（GCS）算法。在所提出的方法中，粗粒度阶段选择最相关的细粒度专家。然后，细粒度专家对每个超类中的复杂细节进行编码，以便增加类间的变化。对各种公共人类行为识别数据集进行了广泛的实验评估，包括Stanford40、Pascal VOC 2012 action、BU101 和IHAR数据集。实验结果表明，该方法具有良好的效果。更具体地说，在IHAR、Sanford40、Pascal VOC 2012 Action和BU101 基准中，所提出的方法比最新研究高出8.92%、0.41%、0.66%和2.11%，计算成本更低，且无任何辅助注释信息。此外，本文还证明了在解决长尾分布的动作识别问题时，该方法的性能明显优于其他方法。摘要：In still image human action recognition, existing studies have mainly leveraged extra bounding box information along with class labels to mitigate the lack of temporal information in still images; however, preparing extra data with manual annotation is time-consuming and also prone to human errors. Moreover, the existing studies have not addressed action recognition with long-tailed distribution. In this paper, we propose a two-phase multi-expert classification method for human action recognition to cope with long-tailed distribution by means of super-class learning and without any extra information. To choose the best configuration for each super-class and characterize inter-class dependency between different action classes, we propose a novel Graph-Based Class Selection (GCS) algorithm. In the proposed approach, a coarse-grained phase selects the most relevant fine-grained experts. Then, the fine-grained experts encode the intricate details within each super-class so that the inter-class variation increases. Extensive experimental evaluations are conducted on various public human action recognition datasets, including Stanford40, Pascal VOC 2012 Action, BU101 , and IHAR datasets. The experimental results demonstrate that the proposed method yields promising improvements. To be more specific, in IHAR, Sanford40, Pascal VOC 2012 Action, and BU101 benchmarks, the proposed approach outperforms the state-of-the-art studies by 8.92%, 0.41%, 0.66%, and 2.11 % with much less computational cost and without any auxiliary annotation information. Besides, it is proven that in addressing action recognition with long-tailed distribution, the proposed method outperforms its counterparts by a significant margin.

【6】 Classification of histopathology images using ConvNets to detect Lupus Nephritis 标题：基于ConvNets的狼疮性肾炎组织病理学图像分类链接：https://arxiv.org/abs/2112.07555

作者：Akash Gupta,Anirudh Reddy,CV Jawahar,PK Vinod 机构：New York University, IIIT Hyderabad 备注：Accepted in the 2021 Medical Imaging meets NeurIPS Workshop 摘要：系统性红斑狼疮（SLE）是一种自身免疫性疾病，患者的免疫系统开始攻击身体的健康组织。狼疮性肾炎（LN）是指肾组织的炎症，由于这些攻击而导致肾功能衰竭。国际肾病学会/肾脏病理学会（ISN/RPS）发布了一个基于SLE肾损伤过程中观察到的各种模式的分类系统。传统的方法需要对肾活检进行细致的病理评估，而且耗时。最近，计算技术通过使用虚拟显微镜或全玻片成像（WSI）帮助缓解了这个问题。通过使用深度学习和现代计算机视觉技术，我们提出了一个管道，该管道能够自动完成以下过程：1）检测这些完整幻灯片图像中的各种肾小球模式；2）使用提取的肾小球特征对每个图像进行分类。摘要：Systemic lupus erythematosus (SLE) is an autoimmune disease in which the immune system of the patient starts attacking healthy tissues of the body. Lupus Nephritis (LN) refers to the inflammation of kidney tissues resulting in renal failure due to these attacks. The International Society of Nephrology/Renal Pathology Society (ISN/RPS) has released a classification system based on various patterns observed during renal injury in SLE. Traditional methods require meticulous pathological assessment of the renal biopsy and are time-consuming. Recently, computational techniques have helped to alleviate this issue by using virtual microscopy or Whole Slide Imaging (WSI). With the use of deep learning and modern computer vision techniques, we propose a pipeline that is able to automate the process of 1) detection of various glomeruli patterns present in these whole slide images and 2) classification of each image using the extracted glomeruli features.

分割|语义相关(5篇)

【1】 n-CPS: Generalising Cross Pseudo Supervision to n networks for Semi-Supervised Semantic Segmentation标题：n-CPS：将交叉伪监督推广到n网络进行半监督语义分割链接：https://arxiv.org/abs/2112.07528

作者：Dominik Filipiak,Piotr Tempczyk,Marek Cygan 机构： AI Clearing, Inc., Semantic Technology Institute, Department of Computer Science, University of Innsbruck, Informatics and Mechanics, University of Warsaw 摘要：我们提出了$n$-CPS——一种最新的用于半监督语义切分任务的交叉伪监督（CPS）方法的推广。在$n$-CPS中，有$n$同时训练的子网络通过一个热编码扰动和一致性正则化相互学习。我们还表明，集成技术应用于子网输出可以显著提高性能。据我们所知，$n$-CPS与CutMix组合的表现优于CPS，并为Pascal VOC 2012设定了新的最先进水平，包括（1/16、1/8、1/4和1/2监管制度）和城市景观（1/16监管）。摘要：We present $n$-CPS - a generalisation of the recent state-of-the-art cross pseudo supervision (CPS) approach for the task of semi-supervised semantic segmentation. In $n$-CPS, there are $n$ simultaneously trained subnetworks that learn from each other through one-hot encoding perturbation and consistency regularisation. We also show that ensembling techniques applied to subnetworks outputs can significantly improve the performance. To the best of our knowledge, $n$-CPS paired with CutMix outperforms CPS and sets the new state-of-the-art for Pascal VOC 2012 with (1/16, 1/8, 1/4, and 1/2 supervised regimes) and Cityscapes (1/16 supervised).

【2】 A Style and Semantic Memory Mechanism for Domain Generalization 标题：一种面向领域泛化的风格和语义记忆机制链接：https://arxiv.org/abs/2112.07517

作者：Yang Chen,Yu Wang,Yingwei Pan,Ting Yao,Xinmei Tian,Tao Mei 机构：† University of Science and Technology of China, Hefei, China, ‡JD AI Research, Beijing, China 备注：ICCV 2021 摘要：主流最先进的领域泛化算法倾向于优先考虑跨领域语义不变性的假设。同时，固有的域内风格不变性通常被低估和搁置。在本文中，我们发现利用域内风格不变性对于提高域泛化的效率也至关重要。我们验证了网络提供关于哪些领域特征是不变的以及在实例之间共享的信息是至关重要的，这样网络可以增强其理解能力并提高其语义辨别能力。相应地，我们还提出了一种新的“陪审团”机制，该机制在学习领域间有用的语义特征共性方面特别有效。我们称为STEAM的完整模型可以解释为一种新的概率图形模型，其实现需要方便地构造两种类型的内存库：语义特征库和样式特征库。实证结果表明，我们提出的框架明显优于最先进的方法。摘要：Mainstream state-of-the-art domain generalization algorithms tend to prioritize the assumption on semantic invariance across domains. Meanwhile, the inherent intra-domain style invariance is usually underappreciated and put on the shelf. In this paper, we reveal that leveraging intra-domain style invariance is also of pivotal importance in improving the efficiency of domain generalization. We verify that it is critical for the network to be informative on what domain features are invariant and shared among instances, so that the network sharpens its understanding and improves its semantic discriminative ability. Correspondingly, we also propose a novel "jury" mechanism, which is particularly effective in learning useful semantic feature commonalities among domains. Our complete model called STEAM can be interpreted as a novel probabilistic graphical model, for which the implementation requires convenient constructions of two kinds of memory banks: semantic feature bank and style feature bank. Empirical results show that our proposed framework surpasses the state-of-the-art methods by clear margins.

【3】 Uncertainty Estimation via Response Scaling for Pseudo-mask Noise Mitigation in Weakly-supervised Semantic Segmentation 标题：弱监督语义分割中伪掩模噪声抑制的响应尺度不确定性估计链接：https://arxiv.org/abs/2112.07431

作者：Yi Li,Yiqun Duan,Zhanghui Kuang,Yimin Chen,Wayne Zhang,Xiaomeng Li 机构： Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, SenseTime Research , Department of Computer Science, University of Technology Sydney 备注：Accept at AAAI 2022, Code is available at this https URL 摘要：弱监督语义分割（WSSS）在不需要密集注释的情况下分割对象。而作为代价，生成的伪掩模存在明显的噪声像素，这导致在这些伪掩模上训练次优分割模型。但很少有研究注意到或研究这个问题，即使这些噪声像素在伪掩模改进后也是不可避免的。因此，我们试图在噪声抑制方面对无线传感器网络进行改进。我们观察到许多噪声像素具有很高的置信度，特别是当响应范围太宽或太窄时，呈现不确定状态。因此，在本文中，我们通过多次缩放预测图来模拟响应的噪声变化，以进行不确定性估计。然后利用不确定性对分割损失进行加权，以减轻噪声监督信号。我们称这种方法为URN，缩写为通过响应缩放进行噪声缓解的不确定性估计。实验验证了URN的好处，我们的方法在PASCAL VOC 2012和MS COCO 2014上分别达到71.2%和41.5%的最新结果，而无需额外的模型，如显著性检测。代码可在https://github.com/XMed-Lab/URN. 摘要：Weakly-Supervised Semantic Segmentation (WSSS) segments objects without a heavy burden of dense annotation. While as a price, generated pseudo-masks exist obvious noisy pixels, which result in sub-optimal segmentation models trained over these pseudo-masks. But rare studies notice or work on this problem, even these noisy pixels are inevitable after their improvements on pseudo-mask. So we try to improve WSSS in the aspect of noise mitigation. And we observe that many noisy pixels are of high confidence, especially when the response range is too wide or narrow, presenting an uncertain status. Thus, in this paper, we simulate noisy variations of response by scaling the prediction map multiple times for uncertainty estimation. The uncertainty is then used to weight the segmentation loss to mitigate noisy supervision signals. We call this method URN, abbreviated from Uncertainty estimation via Response scaling for Noise mitigation. Experiments validate the benefits of URN, and our method achieves state-of-the-art results at 71.2% and 41.5% on PASCAL VOC 2012 and MS COCO 2014 respectively, without extra models like saliency detection. Code is available at https://github.com/XMed-Lab/URN.

【4】 PP-HumanSeg: Connectivity-Aware Portrait Segmentation with a Large-Scale Teleconferencing Video Dataset 标题：PP-HumanSeg：基于大规模电话会议视频数据集的连通性感知的人像分割链接：https://arxiv.org/abs/2112.07146

作者：Lutao Chu,Yi Liu,Zewu Wu,Shiyu Tang,Guowei Chen,Yuying Hao,Juncai Peng,Zhiliang Yu,Zeyu Chen,Baohua Lai,Haoyi Xiong 机构：Baidu, Inc. 备注：Accepted by WACV workshop 摘要：随着全球范围内2019冠状病毒疾病的猖獗，视频会议的需求激增。为此，实时肖像分割成为一种流行的功能，以取代会议参与者的背景。虽然为从生活场景中提取身体姿势的分割提供了功能丰富的数据集、模型和算法，但视频会议环境中尚未很好地涵盖肖像分割。为了促进这一领域的进展，我们引入了一个名为PP HumanSeg的开源解决方案。这项工作是第一次构建大规模视频肖像数据集，其中包含来自23个会议场景的291个视频，具有14K精细标记帧，并扩展到多摄像机远程会议。此外，我们还提出了一种新的语义连接感知学习（SCL）方法用于语义分割，该方法引入了语义连接感知损失，从连接的角度提高了分割结果的质量。我们提出了一种超轻量级的基于SCL的人像分割模型，在IoU和推理速度之间实现了最佳的平衡。对我们的数据集的广泛评估证明了SCL和我们的模型的优越性。源代码可在https://github.com/PaddlePaddle/PaddleSeg. 摘要：As the COVID-19 pandemic rampages across the world, the demands of video conferencing surge. To this end, real-time portrait segmentation becomes a popular feature to replace backgrounds of conferencing participants. While feature-rich datasets, models and algorithms have been offered for segmentation that extract body postures from life scenes, portrait segmentation has yet not been well covered in a video conferencing context. To facilitate the progress in this field, we introduce an open-source solution named PP-HumanSeg. This work is the first to construct a large-scale video portrait dataset that contains 291 videos from 23 conference scenes with 14K fine-labeled frames and extensions to multi-camera teleconferencing. Furthermore, we propose a novel Semantic Connectivity-aware Learning (SCL) for semantic segmentation, which introduces a semantic connectivity-aware loss to improve the quality of segmentation results from the perspective of connectivity. And we propose an ultra-lightweight model with SCL for practical portrait segmentation, which achieves the best trade-off between IoU and the speed of inference. Extensive evaluations on our dataset demonstrate the superiority of SCL and our model. The source code is available at https://github.com/PaddlePaddle/PaddleSeg.

【5】 ES-CRF: Embedded Superpixel CRF for Semantic Segmentation 标题：ES-CRF：嵌入式超像素语义分割CRF 链接：https://arxiv.org/abs/2112.07106

作者：Jie Zhu,Huabin Huang,Banghuai Li,Leye Wang 机构：Peking University, Beijing, China, Megvii Inc 备注：10 page 4 figures 摘要：现代语义分割方法主要通过度量学习、结构设计等方法调整特征表示以提高分割性能，但几乎所有这些方法都忽略了边界像素的特殊性。由于CNN网络中感受野的不断扩大，这些像素容易从两侧获得令人困惑的特征。这样会误导模型优化的方向，使这些倾向于共享多个相邻像素的类别权重缺乏区分，从而损害整体性能。在这项工作中，我们深入研究了这个问题，并提出了一种新的方法称为嵌入式超像素CRF（ES-CRF）来解决这个问题。ES-CRF涉及两个主要方面。一方面，ES-CRF创新性地将CRF机制作为一个有机整体融合到CNN网络中，以实现更有效的端到端优化。它利用CRF引导高层特征中像素间的信息传递，借助于同一对象的内部像素来净化边界像素的特征表示。另一方面，超级像素被集成到ES-CRF中，以利用本地对象，从而实现更可靠的消息传递。最后，我们提出的方法在两个具有挑战性的基准上产生了新的记录，即Cityscapes和ADE20K。此外，我们还进行了详细的理论分析，以验证ES-CRF的优越性。摘要：Modern semantic segmentation methods devote much attention to adjusting feature representations to improve the segmentation performance in various ways, such as metric learning, architecture design, etc. However, almost all those methods neglect the particularity of boundary pixels. These pixels are prone to obtain confusing features from both sides due to the continuous expansion of receptive fields in CNN networks. In this way, they will mislead the model optimization direction and make the class weights of such categories that tend to share many adjacent pixels lack discrimination, which will damage the overall performance. In this work, we dive deep into this problem and propose a novel method named Embedded Superpixel CRF (ES-CRF) to address it. ES-CRF involves two main aspects. On the one hand, ES-CRF innovatively fuses the CRF mechanism into the CNN network as an organic whole for more effective end-to-end optimization. It utilizes CRF to guide the message passing between pixels in high-level features to purify the feature representation of boundary pixels, with the help of inner pixels belong to the same object. On the other hand, superpixel is integrated into ES-CRF to exploit the local object prior for more reliable message passing. Finally, our proposed method yields new records on two challenging benchmarks, i.e., Cityscapes and ADE20K. Moreover, we make detailed theoretical analysis to verify the superiority of ES-CRF.

Zero/Few Shot|迁移|域适配|自适应(3篇)

【1】 Adaptive Affinity for Associations in Multi-Target Multi-Camera Tracking 标题：多目标多摄像机跟踪中的关联自适应亲和度链接：https://arxiv.org/abs/2112.07664

作者：Yunzhong Hou,Zhongdao Wang,Shengjin Wang,Liang Zheng 备注：This paper appears in: IEEE Transactions on Image Processing 摘要：多目标多摄像机跟踪（MTMCT）中的数据关联通常直接从重新识别（re ID）特征距离估计亲和性。然而，我们认为，鉴于re ID和MTMCT问题之间匹配范围的差异，这可能不是最佳选择。Re-ID系统专注于全局匹配，即从所有摄像机和所有时间检索目标。相比之下，跟踪中的数据关联是一个局部匹配问题，因为它的候选对象只来自相邻的位置和时间帧。在本文中，我们设计了实验来验证跟踪中全局re-ID特征距离和局部匹配之间的这种不匹配，并提出了一种简单而有效的方法来适应MTMCT中相应的匹配范围。我们没有试图处理所有的外观变化，而是对亲和性度量进行定制，以专门处理在数据关联过程中可能出现的变化。为此，我们引入了一种新的带有时间窗口的数据采样方案，最初用于跟踪中的数据关联。自适应关联模块最大限度地减少了不匹配，大大改善了全局re ID距离，并在CityFlow和DukeMTMC数据集上产生了具有竞争力的性能。摘要：Data associations in multi-target multi-camera tracking (MTMCT) usually estimate affinity directly from re-identification (re-ID) feature distances. However, we argue that it might not be the best choice given the difference in matching scopes between re-ID and MTMCT problems. Re-ID systems focus on global matching, which retrieves targets from all cameras and all times. In contrast, data association in tracking is a local matching problem, since its candidates only come from neighboring locations and time frames. In this paper, we design experiments to verify such misfit between global re-ID feature distances and local matching in tracking, and propose a simple yet effective approach to adapt affinity estimations to corresponding matching scopes in MTMCT. Instead of trying to deal with all appearance changes, we tailor the affinity metric to specialize in ones that might emerge during data associations. To this end, we introduce a new data sampling scheme with temporal windows originally used for data associations in tracking. Minimizing the mismatch, the adaptive affinity module brings significant improvements over global re-ID distance, and produces competitive performance on CityFlow and DukeMTMC datasets.

【2】 Transferrable Contrastive Learning for Visual Domain Adaptation 标题：用于视域自适应的可转移对比学习链接：https://arxiv.org/abs/2112.07516

作者：Yang Chen,Yingwei Pan,Yu Wang,Ting Yao,Xinmei Tian,Tao Mei 机构：University of Science and Technology of China, Hefei, China, JD AI Research, Beijing, China 备注：ACM Multimedia 2021 摘要：自监督学习（SSL）最近成为特征学习方法中最受欢迎的一种。因此，呼吁域适应方法考虑纳入SSL。直觉是强制执行实例级的特性一致性，这样预测器就可以在域之间保持不变。然而，域适配机制中的大多数现有SSL方法通常被视为独立的辅助组件，使得域适配的签名无人值守。实际上，域间隙消失的最佳区域和SSL所研究的实例级约束可能根本不一致。从这一点出发，我们提出了一种专门针对领域适应的自我监督学习范式，即可转移对比学习（TCL），它将SSL与所需的跨领域转移性一致地联系起来。我们发现，对比学习本质上是一种适用于领域适应的方法，因为它的实例不变性假设可以方便地推广到领域适应任务所青睐的跨领域类级不变性。TCL基于特定的记忆库结构和伪标记策略，通过清晰新颖的对比损失来惩罚源和目标之间的跨域类内域差异。免费午餐是，得益于对比学习的结合，TCL依赖于移动平均密钥编码器，该编码器自然实现了目标数据伪标签的临时集成版本，从而避免了伪标签错误传播，无需额外成本。因此，TCL有效地减少了跨域差距。通过针对单源和多源域适配任务的大量基准测试（Office Home、VisDA-2017、Digits five、PACS和DomainNet），TCL展示了最先进的性能。摘要：Self-supervised learning (SSL) has recently become the favorite among feature learning methodologies. It is therefore appealing for domain adaptation approaches to consider incorporating SSL. The intuition is to enforce instance-level feature consistency such that the predictor becomes somehow invariant across domains. However, most existing SSL methods in the regime of domain adaptation usually are treated as standalone auxiliary components, leaving the signatures of domain adaptation unattended. Actually, the optimal region where the domain gap vanishes and the instance level constraint that SSL peruses may not coincide at all. From this point, we present a particular paradigm of self-supervised learning tailored for domain adaptation, i.e., Transferrable Contrastive Learning (TCL), which links the SSL and the desired cross-domain transferability congruently. We find contrastive learning intrinsically a suitable candidate for domain adaptation, as its instance invariance assumption can be conveniently promoted to cross-domain class-level invariance favored by domain adaptation tasks. Based on particular memory bank constructions and pseudo label strategies, TCL then penalizes cross-domain intra-class domain discrepancy between source and target through a clean and novel contrastive loss. The free lunch is, thanks to the incorporation of contrastive learning, TCL relies on a moving-averaged key encoder that naturally achieves a temporally ensembled version of pseudo labels for target data, which avoids pseudo label error propagation at no extra cost. TCL therefore efficiently reduces cross-domain gaps. Through extensive experiments on benchmarks (Office-Home, VisDA-2017, Digits-five, PACS and DomainNet) for both single-source and multi-source domain adaptation tasks, TCL has demonstrated state-of-the-art performances.

【3】 DeepDiffusion: Unsupervised Learning of Retrieval-adapted Representations via Diffusion-based Ranking on Latent Feature Manifold 标题：深度扩散：基于潜在特征流形扩散排序的检索自适应表示的无监督学习链接：https://arxiv.org/abs/2112.07082

作者：Takahiko Furuya,Ryutarou Ohbuchi 机构：University of Yamanashi,-,-, Takeda, Kofu-shi, Yamanashi-ken ,-, Japan 备注：Currently under review 摘要：对于分析大量没有语义标签的多媒体数据而言，特征表示的无监督学习是一个具有挑战性的重要问题。最近提出的基于神经网络的无监督学习方法已经成功地获得了适合多媒体数据分类的特征。然而，适应于多媒体数据基于内容的匹配、比较或检索的特征表示的无监督学习尚未得到很好的探索。为了获得这种检索适应的特征，我们引入了将特征流形上的扩散距离与基于神经网络的无监督特征学习相结合的思想。这一思想被实现为一种称为DeepDiffusion（DD）的新算法。DD同时优化了两个组件，一个是通过深度神经网络嵌入的特征，另一个是利用潜在特征流形上的扩散的距离度量。DD依赖于其丢失功能，而不是编码器架构。因此，它可以应用于不同的多媒体数据类型及其各自的编码器架构。使用3D形状和2D图像进行的实验评估表明DD算法的通用性和高精度。代码可在https://github.com/takahikof/DeepDiffusion 摘要：Unsupervised learning of feature representations is a challenging yet important problem for analyzing a large collection of multimedia data that do not have semantic labels. Recently proposed neural network-based unsupervised learning approaches have succeeded in obtaining features appropriate for classification of multimedia data. However, unsupervised learning of feature representations adapted to content-based matching, comparison, or retrieval of multimedia data has not been explored well. To obtain such retrieval-adapted features, we introduce the idea of combining diffusion distance on a feature manifold with neural network-based unsupervised feature learning. This idea is realized as a novel algorithm called DeepDiffusion (DD). DD simultaneously optimizes two components, a feature embedding by a deep neural network and a distance metric that leverages diffusion on a latent feature manifold, together. DD relies on its loss function but not encoder architecture. It can thus be applied to diverse multimedia data types with their respective encoder architectures. Experimental evaluation using 3D shapes and 2D images demonstrates versatility as well as high accuracy of the DD algorithm. Code is available at https://github.com/takahikof/DeepDiffusion

半弱无监督|主动学习|不确定性(4篇)

【1】 Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking 标题：具有自监督学习的多模态感知注意网络在视听说话人跟踪中的应用链接：https://arxiv.org/abs/2112.07423

作者：Yidi Li,Hong Liu,Hao Tang 机构：Key Laboratory of Machine Perception, Peking University, Shenzhen Graduate School, China, Computer Vision Lab, ETH Zurich, Switzerland 备注：Accepted by AAAI2022 摘要：多模态融合被证明是提高说话人跟踪精度和鲁棒性的有效方法，特别是在复杂场景下。然而，如何结合异构信息，利用多模态信号的互补性仍然是一个具有挑战性的问题。在本文中，我们提出了一种新的多模态感知跟踪器（MPT），用于同时使用音频和视频模式的说话人跟踪。具体而言，首先构建了一种基于时空全局相干场（stGCF）的异质信号融合声学图，该图采用摄像机模型将音频线索映射到与视觉线索一致的定位空间。然后引入多模态感知注意网络，推导出感知权重，用于衡量受噪声干扰的间歇音频和视频流的可靠性和有效性。此外，提出了一种独特的跨模态自监督学习方法，利用不同模态之间的互补性和一致性，对视听观测的可信度进行建模。实验结果表明，该算法对标准数据集和遮挡数据集的跟踪准确率分别达到98.6%和78.3%，在不利条件下表现出良好的鲁棒性，优于现有的先进方法。摘要：Multi-modal fusion is proven to be an effective method to improve the accuracy and robustness of speaker tracking, especially in complex scenarios. However, how to combine the heterogeneous information and exploit the complementarity of multi-modal signals remains a challenging issue. In this paper, we propose a novel Multi-modal Perception Tracker (MPT) for speaker tracking using both audio and visual modalities. Specifically, a novel acoustic map based on spatial-temporal Global Coherence Field (stGCF) is first constructed for heterogeneous signal fusion, which employs a camera model to map audio cues to the localization space consistent with the visual cues. Then a multi-modal perception attention network is introduced to derive the perception weights that measure the reliability and effectiveness of intermittent audio and video streams disturbed by noise. Moreover, a unique cross-modal self-supervised learning method is presented to model the confidence of audio and visual observations by leveraging the complementarity and consistency between different modalities. Experimental results show that the proposed MPT achieves 98.6% and 78.3% tracking accuracy on the standard and occluded datasets, respectively, which demonstrates its robustness under adverse conditions and outperforms the current state-of-the-art methods.

【2】 Weakly Supervised High-Fidelity Clothing Model Generation 标题：弱监督高保真服装模型生成链接：https://arxiv.org/abs/2112.07200

作者：Ruili Feng,Cheng Ma,Chengji Shen,Xin Gao,Zhenjiang Liu,Xiaobo Li,Kairi Ou,Zhengjun Zha 机构：University of Science and Technology of China,Zhejiang University,Alibaba Group 摘要：网络经济的发展引发了在产品服装上生成模特形象、展示新服装和促进销售的需求。然而，昂贵的专有模型图像挑战了该场景中现有的图像虚拟试穿方法，因为大多数模型图像都需要经过大量模型图像和成对衣服图像的训练。在本文中，我们提出了一种廉价但可扩展的弱监督方法，称为深度生成投影（DGP），以解决这一特定场景。该方法的核心是模仿人类预测磨损效果的过程，这是一种基于生活经验的无监督想象，而不是从监督中学习的计算规则。在这里，一个预训练的款式被用来捕捉穿着的实际体验。实验表明，将衣服和身体的粗略对齐投影到StyleGAN空间可以产生逼真的穿着效果。在真实场景专有模型图像上的实验表明，在生成服装模型图像时，DGP优于几种最先进的监督方法。摘要：The development of online economics arouses the demand of generating images of models on product clothes, to display new clothes and promote sales. However, the expensive proprietary model images challenge the existing image virtual try-on methods in this scenario, as most of them need to be trained on considerable amounts of model images accompanied with paired clothes images. In this paper, we propose a cheap yet scalable weakly-supervised method called Deep Generative Projection (DGP) to address this specific scenario. Lying in the heart of the proposed method is to imitate the process of human predicting the wearing effect, which is an unsupervised imagination based on life experience rather than computation rules learned from supervisions. Here a pretrained StyleGAN is used to capture the practical experience of wearing. Experiments show that projecting the rough alignment of clothing and body onto the StyleGAN space can yield photo-realistic wearing results. Experiments on real scene proprietary model images demonstrate the superiority of DGP over several state-of-the-art supervised methods when generating clothing model images.

【3】 ElePose: Unsupervised 3D Human Pose Estimation by Predicting Camera Elevation and Learning Normalizing Flows on 2D Poses 标题：ElePose：基于摄像机仰角预测和二维姿态归一化流学习的无监督三维人体姿态估计链接：https://arxiv.org/abs/2112.07088

作者：Bastian Wandt,James J. Little,Helge Rhodin 机构：University of British Columbia 摘要：单幅图像的人体姿态估计是一个具有挑战性的问题，通常通过监督学习来解决。不幸的是，由于3D注释需要专用的运动捕捉系统，因此许多人类活动还不存在带标签的训练数据。因此，我们提出了一种无监督的方法，该方法学习从单个图像预测3D人体姿势，同时仅使用2D姿势数据进行训练，该数据可以众包获取，并且已经广泛可用。为此，我们估计最有可能在随机投影上出现的3D姿势，并使用2D姿势上的规格化流估计可能性。虽然之前的工作需要在训练数据集中对摄像机旋转进行强先验，但我们了解了摄像机角度的分布，这显著提高了性能。我们的贡献的另一部分是通过将2D姿势投影到线性子空间，通过对高维3D姿势数据的归一化流来稳定训练。在基准数据集Human3上，我们的性能优于最先进的无监督人体姿势估计方法。6M和MPI-INF-3DHP在许多指标中。摘要：Human pose estimation from single images is a challenging problem that is typically solved by supervised learning. Unfortunately, labeled training data does not yet exist for many human activities since 3D annotation requires dedicated motion capture systems. Therefore, we propose an unsupervised approach that learns to predict a 3D human pose from a single image while only being trained with 2D pose data, which can be crowd-sourced and is already widely available. To this end, we estimate the 3D pose that is most likely over random projections, with the likelihood estimated using normalizing flows on 2D poses. While previous work requires strong priors on camera rotations in the training data set, we learn the distribution of camera angles which significantly improves the performance. Another part of our contribution is to stabilize training with normalizing flows on high-dimensional 3D pose data by first projecting the 2D poses to a linear subspace. We outperform the state-of-the-art unsupervised human pose estimation methods on the benchmark datasets Human3.6M and MPI-INF-3DHP in many metrics.

【4】 Stochastic Planner-Actor-Critic for Unsupervised Deformable Image Registration 标题：无监督变形图像配准的随机规划者-执行者-批评者链接：https://arxiv.org/abs/2112.07415

作者：Ziwei Luo,Jing Hu,Xin Wang,Shu Hu,Bin Kong,Youbing Yin,Qi Song,Xi Wu,Siwei Lyu 机构： Chengdu University of Information Technology, China, Keya Medical, Seattle, USA, University at Buffalo, SUNY, USA 备注：Accepted by AAAI'22 摘要：不同形状和非线性形状变化引起的器官大变形对医学图像配准提出了重大挑战。传统的配准方法需要通过特定的变形模型迭代优化目标函数，并进行细致的参数调整，但在配准大变形图像时能力有限。虽然基于深度学习的方法可以学习从输入图像到其各自变形场的复杂映射，但它是基于回归的，并且容易陷入局部极小值，特别是当涉及大变形时。为此，我们提出了一种新的基于强化学习的框架，即随机规划-演员-评论家（SPAC），该框架执行逐步注册。关键概念是按每个时间步连续扭曲运动图像，以最终与固定图像对齐。考虑到在传统强化学习（RL）框架中处理高维连续动作和状态空间具有挑战性，我们在标准演员-评论家模型中引入了一个新概念“计划”，该概念具有低维性，可以帮助演员生成可处理的高维动作。整个框架以无监督训练为基础，以端到端的方式运作。我们在几个二维和三维医学图像数据集上评估了我们的方法，其中一些包含大变形。我们的实证结果突出表明，我们的工作取得了一致、显著的收益，并且优于最先进的方法。摘要：Large deformations of organs, caused by diverse shapes and nonlinear shape changes, pose a significant challenge for medical image registration. Traditional registration methods need to iteratively optimize an objective function via a specific deformation model along with meticulous parameter tuning, but which have limited capabilities in registering images with large deformations. While deep learning-based methods can learn the complex mapping from input images to their respective deformation field, it is regression-based and is prone to be stuck at local minima, particularly when large deformations are involved. To this end, we present Stochastic Planner-Actor-Critic (SPAC), a novel reinforcement learning-based framework that performs step-wise registration. The key notion is warping a moving image successively by each time step to finally align to a fixed image. Considering that it is challenging to handle high dimensional continuous action and state spaces in the conventional reinforcement learning (RL) framework, we introduce a new concept `Plan' to the standard Actor-Critic model, which is of low dimension and can facilitate the actor to generate a tractable high dimensional action. The entire framework is based on unsupervised training and operates in an end-to-end manner. We evaluate our method on several 2D and 3D medical image datasets, some of which contain large deformations. Our empirical results highlight that our work achieves consistent, significant gains and outperforms state-of-the-art methods.

时序|行为识别|姿态|视频|运动估计(5篇)

【1】 CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising 标题：Coco-BERT：用对比跨模态匹配和去噪改进视频语言预训练链接：https://arxiv.org/abs/2112.07515

作者：Jianjie Luo,Yehao Li,Yingwei Pan,Ting Yao,Hongyang Chao,Tao Mei 机构：★ School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China, ♣ The Key Laboratory of Machine Intelligence and Advanced Computing (Sun Yat-sen University), Ministry of, Education, Guangzhou, China, ♠ JD AI Research, Beijing, China 备注：ACM Multimedia 2021 摘要：BERT式结构导致了视觉语言预训练的革命，并在众多视觉语言下游任务中取得了最新成果。现有解决方案主要利用带有掩码令牌的多模态输入来触发基于掩码的代理预训练任务（例如，掩码语言建模和掩码对象/帧预测）。在这项工作中，我们认为这样的蒙蔽输入将不可避免地在跨模式匹配代理任务中引入噪声，从而使固有的视觉语言关联未得到充分的探索。作为替代方案，我们推导了一种用于视频语言预训练的特定形式的跨模态代理目标，即对比跨模态匹配和去噪（CoCo）。通过将屏蔽帧/词序列视为主要未屏蔽帧/词序列的噪声增强，CoCo通过在屏蔽和未屏蔽输入之间以对比方式同时进行模态间匹配和模态内去噪来增强视频语言关联。我们的CoCo代理目标可以进一步集成到任何用于视频语言预训练的BERT类型编解码器结构中，称为对比跨模态BERT（CoCo BERT）。我们在电视数据集和新收集的大规模GIF视频数据集（ACTION）上预先训练CoCo BERT。通过广泛的下游任务（如跨模式检索、视频问答和视频字幕）的大量实验，我们证明了CoCo-BERT作为预训练结构的优越性。摘要：BERT-type structure has led to the revolution of vision-language pre-training and the achievement of state-of-the-art results on numerous vision-language downstream tasks. Existing solutions dominantly capitalize on the multi-modal inputs with mask tokens to trigger mask-based proxy pre-training tasks (e.g., masked language modeling and masked object/frame prediction). In this work, we argue that such masked inputs would inevitably introduce noise for cross-modal matching proxy task, and thus leave the inherent vision-language association under-explored. As an alternative, we derive a particular form of cross-modal proxy objective for video-language pre-training, i.e., Contrastive Cross-modal matching and denoising (CoCo). By viewing the masked frame/word sequences as the noisy augmentation of primary unmasked ones, CoCo strengthens video-language association by simultaneously pursuing inter-modal matching and intra-modal denoising between masked and unmasked inputs in a contrastive manner. Our CoCo proxy objective can be further integrated into any BERT-type encoder-decoder structure for video-language pre-training, named as Contrastive Cross-modal BERT (CoCo-BERT). We pre-train CoCo-BERT on TV dataset and a newly collected large-scale GIF video dataset (ACTION). Through extensive experiments over a wide range of downstream tasks (e.g., cross-modal retrieval, video question answering, and video captioning), we demonstrate the superiority of CoCo-BERT as a pre-trained structure.

【2】 I M Avatar: Implicit Morphable Head Avatars from Videos 标题：I M阿凡达：视频中隐含的可变形头像链接：https://arxiv.org/abs/2112.07471

作者：Yufeng Zheng,Victoria Fernández Abrevaya,Xu Chen,Marcel C. Bühler,Michael J. Black,Otmar Hilliges 机构：Victoria Fern´andez Abrevaya, Marcel C. B¨uhler, ETH Z¨urich, Max Planck Institute for Intelligent Systems, T¨ubingen, Max Planck ETH Center for Learning Systems 备注：Submitted to CVPR 2022 摘要：传统的可变形人脸模型提供了对表情的细粒度控制，但无法轻松捕获几何和外观细节。神经体积表示法接近照片真实感，但很难设置动画，也不能很好地推广到看不见的表达式。为了解决这个问题，我们提出了IMavatar（隐式变形化身），一种从单目视频学习隐式头部化身的新方法。受传统3DMMs提供的细粒度控制机制的启发，我们通过学习混合形状和蒙皮场来表示与表情和姿势相关的变形。这些属性与姿势无关，可用于变形给定新表达式和姿势参数的规范几何体和纹理场。我们使用光线追踪和迭代寻根来定位每个像素的标准曲面交点。一个关键贡献是我们新颖的分析梯度公式，它支持从视频中对Imavatar进行端到端的训练。我们从数量和质量上证明，与最先进的方法相比，我们的方法改进了几何，覆盖了更完整的表达空间。摘要：Traditional morphable face models provide fine-grained control over expression but cannot easily capture geometric and appearance details. Neural volumetric representations approach photo-realism but are hard to animate and do not generalize well to unseen expressions. To tackle this problem, we propose IMavatar (Implicit Morphable avatar), a novel method for learning implicit head avatars from monocular videos. Inspired by the fine-grained control mechanisms afforded by conventional 3DMMs, we represent the expression- and pose-related deformations via learned blendshapes and skinning fields. These attributes are pose-independent and can be used to morph the canonical geometry and texture fields given novel expression and pose parameters. We employ ray tracing and iterative root-finding to locate the canonical surface intersection for each pixel. A key contribution is our novel analytical gradient formulation that enables end-to-end training of IMavatars from videos. We show quantitatively and qualitatively that our method improves geometry and covers a more complete expression space compared to state-of-the-art methods.

【3】 OMAD: Object Model with Articulated Deformations for Pose Estimation and Retrieval 标题：OMAD：用于位姿估计和检索的关节变形对象模型链接：https://arxiv.org/abs/2112.07334

作者：Han Xue,Liu Liu,Wenqiang Xu,Haoyuan Fu,Cewu Lu 机构：Shanghai Jiao Tong University, Shanghai, CN 摘要：铰接物体在日常生活中无处不在。然而，由于固有的高自由度结构，关节对象的关节状态难以估计。建立铰接物体模型时，需要考虑两种形状变形，即几何变形和姿态变形。在这项工作中，我们提出了一种新的特定于类别的参数化表示，称为带关节变形的对象模型（OMAD），以显式地建模关节对象。在OMAD中，类别与具有共享形状基的线性形状函数和非线性关节函数相关联。这两个函数都可以从大规模对象模型数据集中学习，并固定为特定于类别的优先级。然后，我们提出了一个OMADNet来预测形状参数和关节状态从一个对象的单一观测。通过对物体形状和关节状态的完整表示，我们可以处理多个任务，包括类别级物体姿态估计和关节式物体检索。为了评估这些任务，我们创建了一个基于PartNet Mobility的合成数据集。大量实验表明，我们的简单OMADNet可以作为这两项任务的强大基线。摘要：Articulated objects are pervasive in daily life. However, due to the intrinsic high-DoF structure, the joint states of the articulated objects are hard to be estimated. To model articulated objects, two kinds of shape deformations namely the geometric and the pose deformation should be considered. In this work, we present a novel category-specific parametric representation called Object Model with Articulated Deformations (OMAD) to explicitly model the articulated objects. In OMAD, a category is associated with a linear shape function with shared shape basis and a non-linear joint function. Both functions can be learned from a large-scale object model dataset and fixed as category-specific priors. Then we propose an OMADNet to predict the shape parameters and the joint states from an object's single observation. With the full representation of the object shape and joint states, we can address several tasks including category-level object pose estimation and the articulated object retrieval. To evaluate these tasks, we create a synthetic dataset based on PartNet-Mobility. Extensive experiments show that our simple OMADNet can serve as a strong baseline for both tasks.

【4】 A real-time spatiotemporal AI model analyzes skill in open surgical videos 标题：一种实时时空人工智能模型在开放式手术视频中的技能分析链接：https://arxiv.org/abs/2112.07219

作者：Emmett D. Goodman,Krishna K. Patel,Yilun Zhang,William Locke,Chris J. Kennedy,Rohan Mehrotra,Stephen Ren,Melody Guan,Maren Downing,Hao Wei Chen,Jevin Z. Clark,Gabriel A. Brat,Serena Yeung 机构：Department of Computer Science, Stanford University, Stanford, CA, USA., Department of Biomedical Data Science, Stanford University, Stanford, CA, USA, Department of Electrical Engineering, Stanford University, Stanford, CA, USA 备注：22 pages, 4 main text figures, 7 extended data figures, 4 extended data tables 摘要：开放式手术是全世界外科手术的主要形式。人工智能（AI）有可能优化手术实践并改善患者预后，但工作主要集中在微创技术上。我们的工作通过在YouTube上策划迄今为止最大的开放手术视频数据集，克服了现有人工智能模型训练数据的局限性：1997年从50个国家上传的23个手术视频。利用这个数据集，我们开发了一个多任务人工智能模型，能够实时理解手术行为、手和工具——程序流程和外科医生技能的构建块。我们证明了我们的模型可以推广到不同的手术类型和环境中。为了说明这一普遍性，我们直接应用YouTube训练的模型分析了在学术医疗中心前瞻性收集的开放手术，并确定了与手部运动效率相关的手术技能运动学描述符。我们的开放手术（AVOS）数据集和训练模型的注释视频将用于外科人工智能的进一步开发。摘要：Open procedures represent the dominant form of surgery worldwide. Artificial intelligence (AI) has the potential to optimize surgical practice and improve patient outcomes, but efforts have focused primarily on minimally invasive techniques. Our work overcomes existing data limitations for training AI models by curating, from YouTube, the largest dataset of open surgical videos to date: 1997 videos from 23 surgical procedures uploaded from 50 countries. Using this dataset, we developed a multi-task AI model capable of real-time understanding of surgical behaviors, hands, and tools - the building blocks of procedural flow and surgeon skill. We show that our model generalizes across diverse surgery types and environments. Illustrating this generalizability, we directly applied our YouTube-trained model to analyze open surgeries prospectively collected at an academic medical center and identified kinematic descriptors of surgical skill related to efficiency of hand motion. Our Annotated Videos of Open Surgery (AVOS) dataset and trained model will be made available for further development of surgical AI.

【5】 Event-guided Deblurring of Unknown Exposure Time Videos 标题：未知曝光时间视频的事件引导去模糊链接：https://arxiv.org/abs/2112.06988

作者：Taewoo Kim,Jungmin Lee,Lin Wang,Kuk-Jin Yoon 机构：Visual Intelligence Lab., KAIST, Korea 备注：Under review 摘要：由于模糊退化过程中运动信息的丢失，视频去模糊是一个高度不适定的问题。由于事件摄影机可以以高时间分辨率捕捉视运动，因此有几次尝试探索了事件用于引导视频去模糊的潜力。这些方法通常假设曝光时间与视频帧速率的倒数相同。然而，这在真实情况下并非如此，曝光时间可能未知，并根据视频拍摄环境（例如照明条件）动态变化。在本文中，我们解决了事件引导视频去模糊假设动态可变未知曝光时间的帧为基础的相机。为此，我们首先通过考虑视频帧采集过程中的曝光和读出时间，推导出一种新的事件引导视频去模糊公式。然后，我们提出了一个新的端到端学习框架，用于事件引导视频去模糊。特别是，我们设计了一个新的基于曝光时间的事件选择（ETES）模块，通过估计模糊帧的特征和事件之间的交叉模态相关性来选择性地使用事件特征。此外，我们还提出了一个特征融合模块来有效融合事件和模糊帧中的特征。我们在各种数据集上进行了大量实验，证明我们的方法达到了最先进的性能。我们的项目代码和预训练模型将可用。摘要：Video deblurring is a highly ill-posed problem due to the loss of motion information in the blur degradation process. Since event cameras can capture apparent motion with a high temporal resolution, several attempts have explored the potential of events for guiding video deblurring. These methods generally assume that the exposure time is the same as the reciprocal of the video frame rate. However,this is not true in real situations, and the exposure time might be unknown and dynamically varies depending on the video shooting environment(e.g., illumination condition). In this paper, we address the event-guided video deblurring assuming dynamically variable unknown exposure time of the frame-based camera. To this end, we first derive a new formulation for event-guided video deblurring by considering the exposure and readout time in the video frame acquisition process. We then propose a novel end-toend learning framework for event-guided video deblurring. In particular, we design a novel Exposure Time-based Event Selection(ETES) module to selectively use event features by estimating the cross-modal correlation between the features from blurred frames and the events. Moreover, we propose a feature fusion module to effectively fuse the selected features from events and blur frames. We conduct extensive experiments on various datasets and demonstrate that our method achieves state-of-the-art performance. Our project code and pretrained models will be available.

医学相关(1篇)

【1】 The Brain Tumor Sequence Registration Challenge: Establishing Correspondence between Pre-Operative and Follow-up MRI scans of diffuse glioma patients 标题：脑肿瘤序列登记的挑战：建立弥漫性胶质瘤患者术前和随访MRI扫描的对应关系链接：https://arxiv.org/abs/2112.06979

作者：Bhakti Baheti,Diana Waldmannstetter,Satrajit Chakrabarty,Hamed Akbari,Michel Bilello,Benedikt Wiestler,Julian Schwarting,Evan Calabrese,Jeffrey Rudie,Syed Abidi,Mina Mousa,Javier Villanueva-Meyer,Daniel S. Marcus,Christos Davatzikos,Aristeidis Sotiras,Bjoern Menze,Spyridon Bakas 机构：∗University of Pennsylvania ; †University of Zurich ; ‡Technical University of Munich ; §Washington University in, Saint Louis ; ∗∗University of California San Francisco;, ∥Equally contributing first authors; 摘要：由于组织外观的改变，包含病理学的纵向脑磁共振成像（MRI）扫描的注册具有挑战性，并且仍然是一个未解决的问题。本文描述了第一次脑肿瘤序列注册（BraTS Reg）挑战，重点是评估同一名诊断为脑弥漫性胶质瘤患者的术前和随访扫描之间的对应关系。BraTS注册挑战旨在为可变形注册算法建立一个公共基准环境。相关的数据集包括去识别的多机构多参数MRI（mpMRI）数据，根据通用解剖模板，针对每次扫描的大小和分辨率进行管理。临床专家已经在扫描中生成了大量的标志点注释，描述了整个颞区的不同解剖位置。训练数据以及这些地面真相注释将发布给参与者，以设计和开发他们的注册算法，而验证和测试数据的注释将由组织者保留，并用于评估参与者的集装箱化算法。每个提交的算法都将使用若干指标进行定量评估，如中值绝对误差（MAE）、稳健性和雅可比行列式。摘要：Registration of longitudinal brain Magnetic Resonance Imaging (MRI) scans containing pathologies is challenging due to tissue appearance changes, and still an unsolved problem. This paper describes the first Brain Tumor Sequence Registration (BraTS-Reg) challenge, focusing on estimating correspondences between pre-operative and follow-up scans of the same patient diagnosed with a brain diffuse glioma. The BraTS-Reg challenge intends to establish a public benchmark environment for deformable registration algorithms. The associated dataset comprises de-identified multi-institutional multi-parametric MRI (mpMRI) data, curated for each scan's size and resolution, according to a common anatomical template. Clinical experts have generated extensive annotations of landmarks points within the scans, descriptive of distinct anatomical locations across the temporal domain. The training data along with these ground truth annotations will be released to participants to design and develop their registration algorithms, whereas the annotations for the validation and the testing data will be withheld by the organizers and used to evaluate the containerized algorithms of the participants. Each submitted algorithm will be quantitatively evaluated using several metrics, such as the Median Absolute Error (MAE), Robustness, and the Jacobian determinant.

GAN|对抗|攻击|生成相关(2篇)

【1】 Handwritten text generation and strikethrough characters augmentation 标题：手写文本生成和删除线字符增强链接：https://arxiv.org/abs/2112.07395

作者：Alex Shonenkov,Denis Karachev,Max Novopoltsev,Mark Potanin,Denis Dimitrov,Andrey Chertok 机构：a,cSBER AI, Moscow, Russia, OCRV, Sochi, Russia, SBER AI, MIPT, Moscow, Russia, SBER AI, Lomonosov MSU, Moscow, Russia, SBER AI, AIRI, Moscow, Russia 备注：16 pages, 15 figures. arXiv admin note: substantial text overlap with arXiv:2108.11667 摘要：我们介绍了两种数据增强技术，这两种技术与Resnet BiLSTM CTC网络一起使用，大大降低了手写文本识别（HTR）任务的字错误率（WER）和字符错误率（CER），其结果超出了最佳报告结果。我们采用了一种新的模拟删除线文本（手写污点）的增强方法和一种基于打印文本的手写文本生成方法（StackMix），这两种方法在HTR任务中被证明是非常有效的。StackMix使用弱监督框架来获取角色边界。由于这些数据增强技术独立于所使用的网络，因此它们也可用于增强其他网络的性能和HTR方法。对10个手写文本数据集的大量实验表明，手写Blots增强和StackMix显著提高了HTR模型的质量摘要：We introduce two data augmentation techniques, which, used with a Resnet-BiLSTM-CTC network, significantly reduce Word Error Rate (WER) and Character Error Rate (CER) beyond best-reported results on handwriting text recognition (HTR) tasks. We apply a novel augmentation that simulates strikethrough text (HandWritten Blots) and a handwritten text generation method based on printed text (StackMix), which proved to be very effective in HTR tasks. StackMix uses weakly-supervised framework to get character boundaries. Because these data augmentation techniques are independent of the network used, they could also be applied to enhance the performance of other networks and approaches to HTR. Extensive experiments on ten handwritten text datasets show that HandWritten Blots augmentation and StackMix significantly improve the quality of HTR models

【2】 Learning Body-Aware 3D Shape Generative Models 标题：学习体感三维形状生成模型链接：https://arxiv.org/abs/2112.07022

作者：Bryce Blinn,Alexander Ding,Daniel Ritchie,R. Kenny Jones,Srinath Sridhar,Manolis Savva 机构：Brown University, Providence, RI , Simon Fraser University, Burnaby, BC V,A ,S 备注：11 pages, 8 figures 摘要：建筑环境中许多物体的形状是由它们与人体的关系决定的：一个人将如何与这个物体互动？现有的数据驱动的三维形状生成模型生成看似合理的对象，但无法解释这些对象与人体的关系。在本文中，我们学习三维形状的身体感知生成模型。具体来说，我们训练椅子的生成模型，这是一个普遍存在的形状类别，可以根据给定的身体形状或坐姿进行调节。体形调节模型生产的椅子对于具有给定体形的人来说是舒适的；姿势调节模型生成可容纳给定坐姿的椅子。为了训练这些模型，我们定义了一个“坐姿匹配”度量和一个新的“坐姿舒适度”度量。计算这些指标需要一个昂贵的优化，让身体坐在椅子上，这太慢了，无法用作训练生成模型的损失函数。因此，我们训练神经网络来有效地逼近这些度量。我们使用我们的方法来训练三个身体感知的生成形状模型：基于结构化零件的生成器、点云生成器和隐式曲面生成器。在所有情况下，我们的方法都会生成模型，使其输出座椅形状适应输入人体规格。摘要：The shape of many objects in the built environment is dictated by their relationships to the human body: how will a person interact with this object? Existing data-driven generative models of 3D shapes produce plausible objects but do not reason about the relationship of those objects to the human body. In this paper, we learn body-aware generative models of 3D shapes. Specifically, we train generative models of chairs, an ubiquitous shape category, which can be conditioned on a given body shape or sitting pose. The body-shape-conditioned models produce chairs which will be comfortable for a person with the given body shape; the pose-conditioned models produce chairs which accommodate the given sitting pose. To train these models, we define a "sitting pose matching" metric and a novel "sitting comfort" metric. Calculating these metrics requires an expensive optimization to sit the body into the chair, which is too slow to be used as a loss function for training a generative model. Thus, we train neural networks to efficiently approximate these metrics. We use our approach to train three body-aware generative shape models: a structured part-based generator, a point cloud generator, and an implicit surface generator. In all cases, our approach produces models which adapt their output chair shapes to input human body specifications.

Attention注意力(3篇)

【1】 Multi-Modal Temporal Attention Models for Crop Mapping from Satellite Time Series 标题：卫星时间序列作物测绘的多模态时间注意模型链接：https://arxiv.org/abs/2112.07558

作者：Vivien Sainte Fare Garnot,Loic Landrieu,Nesrine Chehata 机构：LASTIG, ENSG, IGN, Univ Gustave Eiffel, Saint-Mande,France, EA G&E Bordeaux INP, Univ Bordeaux Montaigne, Bordeaux 备注：Under review 摘要：光学和雷达卫星时间序列是协同的：光学图像包含丰富的光谱信息，而C波段雷达捕获有用的几何信息，并且不受云层的影响。最近，基于时间注意的方法在多个作物制图任务中取得了成功，因此我们建议研究这些模型如何适用于多种模式。我们实施并评估了多种融合方案，包括一种新的方法和对训练过程的简单调整，在不增加复杂度的情况下显著提高了性能和效率。我们表明，大多数融合方案都有优点和缺点，使得它们与特定设置相关。然后，我们评估了多模式在多个任务中的优势：地块分类、基于像素的分割和全景地块分割。我们表明，通过利用光学和雷达时间序列，基于多模态时间注意的模型在性能和对云层覆盖的恢复能力方面可以超过单模态模型。为了进行这些实验，我们用空间对齐的雷达图像时间序列扩充了PASTIS数据集。由此产生的数据集PASTIS-R构成了第一个具有语义和实例注释的大规模、多模式和开放存取卫星时间序列数据集。摘要：Optical and radar satellite time series are synergetic: optical images contain rich spectral information, while C-band radar captures useful geometrical information and is immune to cloud cover. Motivated by the recent success of temporal attention-based methods across multiple crop mapping tasks, we propose to investigate how these models can be adapted to operate on several modalities. We implement and evaluate multiple fusion schemes, including a novel approach and simple adjustments to the training procedure, significantly improving performance and efficiency with little added complexity. We show that most fusion schemes have advantages and drawbacks, making them relevant for specific settings. We then evaluate the benefit of multimodality across several tasks: parcel classification, pixel-based segmentation, and panoptic parcel segmentation. We show that by leveraging both optical and radar time series, multimodal temporal attention-based models can outmatch single-modality models in terms of performance and resilience to cloud cover. To conduct these experiments, we augment the PASTIS dataset with spatially aligned radar image time series. The resulting dataset, PASTIS-R, constitutes the first large-scale, multimodal, and open-access satellite time series dataset with semantic and instance annotations.

【2】 TRACER: Extreme Attention Guided Salient Object Tracing Network 标题：Tracer：极端注意引导的显著目标跟踪网络链接：https://arxiv.org/abs/2112.07380

作者：Min Seok Lee,WooSeok Shin,Sung Won Han 机构： School of Industrial and Management Engineering, Korea University 备注：AAAI 2022, SA poster session accepted paper 摘要：现有的显著目标检测研究主要集中在提取具有边缘信息的不同目标和聚集多层次特征以提高显著目标检测的性能。为了获得令人满意的性能，该方法采用了细化的边缘信息和较低的多级差异。然而，性能增益和计算效率都无法达到，这促使我们研究现有编码器-解码器结构的低效性，以避免这种权衡。我们提出了跟踪器，它通过合并注意引导跟踪模块来检测具有显式边缘的显著对象。我们在第一个编码器的末端使用一个屏蔽边缘注意模块，使用快速傅里叶变换将细化的边缘信息传播到下游特征提取。在多级聚合阶段，联合注意模块识别互补通道和重要空间信息。为了提高解码器的性能和计算效率，我们使用对象注意模块最小化解码器块的使用。该模块从细化的通道和空间表示中提取未检测到的对象和边缘信息。随后，我们提出了一种自适应像素强度损失函数来处理相对重要的像素，而传统的损失函数对所有像素都一视同仁。与13种现有方法的比较表明，TRACER在五个基准数据集上实现了最先进的性能。特别是，TRACER-Efficient3（TE3）优于LDF，LDF是一种现有的方法，同时需要的学习参数和时间更少1.8倍；TE3快5倍。摘要：Existing studies on salient object detection (SOD) focus on extracting distinct objects with edge information and aggregating multi-level features to improve SOD performance. To achieve satisfactory performance, the methods employ refined edge information and low multi-level discrepancy. However, both performance gain and computational efficiency cannot be attained, which has motivated us to study the inefficiencies in existing encoder-decoder structures to avoid this trade-off. We propose TRACER, which detects salient objects with explicit edges by incorporating attention guided tracing modules. We employ a masked edge attention module at the end of the first encoder using a fast Fourier transform to propagate the refined edge information to the downstream feature extraction. In the multi-level aggregation phase, the union attention module identifies the complementary channel and important spatial information. To improve the decoder performance and computational efficiency, we minimize the decoder block usage with object attention module. This module extracts undetected objects and edge information from refined channels and spatial representations. Subsequently, we propose an adaptive pixel intensity loss function to deal with the relatively important pixels unlike conventional loss functions which treat all pixels equally. A comparison with 13 existing methods reveals that TRACER achieves state-of-the-art performance on five benchmark datasets. In particular, TRACER-Efficient3 (TE3) outperforms LDF, an existing method while requiring 1.8x fewer learning parameters and less time; TE3 is 5x faster.

【3】 Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering 标题：视觉问答中特征融合的双边跨模态图匹配关注点链接：https://arxiv.org/abs/2112.07270

作者：JianJian Cao,Xiameng Qin,Sanyuan Zhao,Jianbing Shen 备注：pre-print, TNNLS, 12 pages 摘要：在视觉问答（VQA）任务中，根据图像回答语义复杂的问题具有挑战性。虽然深度学习可以很好地表现图像，但问题总是简单地嵌入其中，无法很好地表明其含义。此外，视觉特征和文本特征对于不同的模态存在差异，跨模态信息很难对齐和利用。本文针对这两个问题，提出了一种图匹配注意（GMA）网络。首先，它不仅为图像构建图，而且还根据句法和嵌入信息为问题构建图。接下来，我们通过一个双阶段图编码器来探索模态内的关系，然后提出一个双边跨模态图匹配注意来推断图像和问题之间的关系。更新后的跨模态特征随后被发送到应答预测模块以进行最终应答预测。实验表明，我们的网络在GQA数据集和VQA 2.0数据集上实现了最先进的性能。烧蚀研究验证了我们GMA网络中每个模块的有效性。摘要：Answering semantically-complicated questions according to an image is challenging in Visual Question Answering (VQA) task. Although the image can be well represented by deep learning, the question is always simply embedded and cannot well indicate its meaning. Besides, the visual and textual features have a gap for different modalities, it is difficult to align and utilize the cross-modality information. In this paper, we focus on these two problems and propose a Graph Matching Attention (GMA) network. Firstly, it not only builds graph for the image, but also constructs graph for the question in terms of both syntactic and embedding information. Next, we explore the intra-modality relationships by a dual-stage graph encoder and then present a bilateral cross-modality graph matching attention to infer the relationships between the image and the question. The updated cross-modality features are then sent into the answer prediction module for final answer prediction. Experiments demonstrate that our network achieves state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset. The ablation studies verify the effectiveness of each modules in our GMA network.

人脸|人群计数(1篇)

【1】 EgoBody: Human Body Shape, Motion and Social Interactions from Head-Mounted Devices 标题：EgoBody：来自头盔设备的人体形状、运动和社会互动链接：https://arxiv.org/abs/2112.07642

作者：Siwei Zhang,Qianli Ma,Yan Zhang,Zhiyin Qian,Marc Pollefeys,Federica Bogo,Siyu Tang 机构：ETH Z¨urich, Microsoft 摘要：从第一人称视角理解社会互动对于许多应用至关重要，从辅助机器人到AR/VR。推理交互的第一步是理解人类的姿势和形状。然而，由于缺乏数据，这方面的研究目前受到阻碍。现有数据集在大小、注释、地面真相捕获模式或交互多样性方面都受到限制。我们通过提出EgoBody来解决这个缺点，EgoBody是一个用于复杂3D场景中社会交互的新型大规模数据集。我们使用Microsoft HoloLens2耳机记录丰富的以自我为中心的数据流（包括RGB、深度、眼睛注视、头部和手部跟踪）。为了获得准确的3D地面真实感，我们使用multi-Kinect装备校准耳机，并将富有表现力的SMPL-X身体网格拟合到多视图RGB-D帧，重建与场景相关的3D人体姿势和形状。我们收集了68个序列，跨越不同的社会学交互类别，并提出了第一个从自我中心的角度进行3D全身姿势和形状估计的基准。我们的数据集和代码将在https://sanweiliti.github.io/egobody/egobody.html. 摘要：Understanding social interactions from first-person views is crucial for many applications, ranging from assistive robotics to AR/VR. A first step for reasoning about interactions is to understand human pose and shape. However, research in this area is currently hindered by the lack of data. Existing datasets are limited in terms of either size, annotations, ground-truth capture modalities or the diversity of interactions. We address this shortcoming by proposing EgoBody, a novel large-scale dataset for social interactions in complex 3D scenes. We employ Microsoft HoloLens2 headsets to record rich egocentric data streams (including RGB, depth, eye gaze, head and hand tracking). To obtain accurate 3D ground-truth, we calibrate the headset with a multi-Kinect rig and fit expressive SMPL-X body meshes to multi-view RGB-D frames, reconstructing 3D human poses and shapes relative to the scene. We collect 68 sequences, spanning diverse sociological interaction categories, and propose the first benchmark for 3D full-body pose and shape estimation from egocentric views. Our dataset and code will be available for research at https://sanweiliti.github.io/egobody/egobody.html.

裁剪|量化|加速|压缩相关(2篇)

【1】 SNF: Filter Pruning via Searching the Proper Number of Filters 标题：snf：通过搜索合适数量的过滤器进行过滤剪枝链接：https://arxiv.org/abs/2112.07282

作者：Pengkun Liu,Yaru Yue,Yanjun Guo,Xingxiang Tao,Xiaoguang Zhou 机构：School of Modern Post (School of Automation), Beijing University, of Posts and Telecommunications, Beijing, China. 摘要：卷积神经网络（CNN）具有大量的参数冗余，滤波器剪枝旨在去除冗余滤波器，为CNN在终端设备上的应用提供了可能。然而，以前的工作更多地关注于设计滤波器重要性的评价准则，然后以固定的剪枝率或固定的剪枝数剪枝不太重要的滤波器，以减少卷积神经网络的冗余。它不考虑每层保留多少过滤器是最合理的选择。从这个角度出发，我们提出了一种新的过滤器修剪方法，通过搜索适当的过滤器数量（SNF）。SNF致力于为每一层搜索最合理数量的保留过滤器，然后使用特定标准修剪过滤器。它可以在不同的触发器下定制最合适的网络结构。使用我们的方法进行的过滤器修剪在CIFAR-10上实现了最先进的（SOTA）精度，并在ImageNet ILSVRC-2012上实现了具有竞争力的性能。基于ResNet-56网络的SNF在CIFAR-10上实现了0.14%的Top-1精度提高，而FLOPs减少了52.94%。在CIFAR-10上修剪ResNet-110也可以在减少68.68%的失败率时提高0.03%的Top-1精度。对于ImageNet，我们将修剪率设置为52.10%的失败率，顶级精度只有0.74%的下降。这些代码可在以下网址获得：https://github.com/pk-l/SNF. 摘要：Convolutional Neural Network (CNN) has an amount of parameter redundancy, filter pruning aims to remove the redundant filters and provides the possibility for the application of CNN on terminal devices. However, previous works pay more attention to designing evaluation criteria of filter importance and then prune less important filters with a fixed pruning rate or a fixed number to reduce convolutional neural networks' redundancy. It does not consider how many filters to reserve for each layer is the most reasonable choice. From this perspective, we propose a new filter pruning method by searching the proper number of filters (SNF). SNF is dedicated to searching for the most reasonable number of reserved filters for each layer and then pruning filters with specific criteria. It can tailor the most suitable network structure at different FLOPs. Filter pruning with our method leads to the state-of-the-art (SOTA) accuracy on CIFAR-10 and achieves competitive performance on ImageNet ILSVRC-2012.SNF based on the ResNet-56 network achieves an increase of 0.14% in Top-1 accuracy at 52.94% FLOPs reduction on CIFAR-10. Pruning ResNet-110 on CIFAR-10 also improves the Top-1 accuracy of 0.03% when reducing 68.68% FLOPs. For ImageNet, we set the pruning rates as 52.10% FLOPs, and the Top-1 accuracy only has a drop of 0.74%. The codes can be available at https://github.com/pk-l/SNF.

【2】 Modeling Image Quantization Tradeoffs for Optimal Compression 标题：基于最优压缩的图像量化权衡建模链接：https://arxiv.org/abs/2112.07207

作者：Johnathan Chiu 机构：University of California, Berkeley 摘要：所有有损压缩算法都采用类似的压缩方案——频域变换，然后是量化和无损编码方案。他们通过量化高频数据来提高压缩率，从而达到折衷的目的，而压缩率的代价是更高的图像失真。我们提出了一种利用深度学习和极大极小损失函数优化量化表的新方法，与以前的方法相比，该方法能够更准确地测量率和失真参数（RD）之间的权衡。我们设计了一个卷积神经网络（CNN），以无监督的方式学习图像块和量化表之间的映射。通过一次处理所有通道中的图像，我们还可以通过测量不同通道之间信息丢失的权衡来实现更高的性能。我们最初的目标是对JPEG图像进行优化，但感觉这可以扩展到任何有损压缩。摘要：All Lossy compression algorithms employ similar compression schemes -- frequency domain transform followed by quantization and lossless encoding schemes. They target tradeoffs by quantizating high frequency data to increase compression rates which come at the cost of higher image distortion. We propose a new method of optimizing quantization tables using Deep Learning and a minimax loss function that more accurately measures the tradeoffs between rate and distortion parameters (RD) than previous methods. We design a convolutional neural network (CNN) that learns a mapping between image blocks and quantization tables in an unsupervised manner. By processing images across all channels at once, we can achieve stronger performance by also measuring tradeoffs in information loss between different channels. We initially target optimization on JPEG images but feel that this can be expanded to any lossy compressor.

表征学习(1篇)

【1】 CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations 标题：CLIP-Lite：从文本标注中学习信息高效的视觉表示链接：https://arxiv.org/abs/2112.07133

作者：Aman Shrivastava,Ramprasaath R. Selvaraju,Nikhil Naik,Vicente Ordonez 机构：University of Virginia,Salesforce Research,Rice University 摘要：我们提出了CLIP-Lite，这是一种通过特征对齐和文本注释进行视觉表示学习的信息高效方法。与之前提出的CLIP模型相比，CLIP-Lite在优化其对比学习目标的过程中，每个正图像文本样本只需要一对负图像文本样本。我们通过利用信息效率下限来最大化两种输入模式之间的互信息来实现这一点。这使得CLIP-Lite能够在获得比CLIP更好的性能的同时，以显著减少的数据量和批量进行训练。我们通过对COCO字幕数据集进行预训练并测试向其他数据集的迁移学习来评估CLIP Lite。CLIP Lite在Pascal VOC分类上获得 15.4%的mAP绝对性能增益，在ImageNet上获得 22.1%的top-1精度增益，同时与其他更复杂的文本监督模型相当或优于它们。CLIP Lite还优于剪辑式图像和文本检索、Zero-Shot分类和视觉基础。最后，通过在表示学习期间执行显式图像-文本对齐，我们表明CLIP-Lite可以利用语言语义来鼓励无偏见的视觉表示，这些视觉表示可用于下游任务。摘要：We propose CLIP-Lite, an information efficient method for visual representation learning by feature alignment with textual annotations. Compared to the previously proposed CLIP model, CLIP-Lite requires only one negative image-text sample pair for every positive image-text sample during the optimization of its contrastive learning objective. We accomplish this by taking advantage of an information efficient lower-bound to maximize the mutual information between the two input modalities. This allows CLIP-Lite to be trained with significantly reduced amounts of data and batch sizes while obtaining better performance than CLIP. We evaluate CLIP-Lite by pretraining on the COCO-Captions dataset and testing transfer learning to other datasets. CLIP-Lite obtains a 15.4% mAP absolute gain in performance on Pascal VOC classification, and a 22.1% top-1 accuracy gain on ImageNet, while being comparable or superior to other, more complex, text-supervised models. CLIP-Lite is also superior to CLIP on image and text retrieval, zero-shot classification, and visual grounding. Finally, by performing explicit image-text alignment during representation learning, we show that CLIP-Lite can leverage language semantics to encourage bias-free visual representations that can be used in downstream tasks.

视觉解释|视频理解VQA|caption等(1篇)

【1】 Dual-Key Multimodal Backdoors for Visual Question Answering 标题：用于视觉答疑的双键多模式后门链接：https://arxiv.org/abs/2112.07668

作者：Matthew Walmer,Karan Sikka,Indranil Sur,Abhinav Shrivastava,Susmit Jha 机构：University of Maryland, College Park, SRI International 备注：22 pages, 11 figures, 12 tables 摘要：深度学习的成功促进了多模态任务的发展，这些任务需要对多个输入域进行非平凡的融合。尽管多模态模型在许多问题中显示出了潜力，但其日益增加的复杂性使其更容易受到攻击。后门（或特洛伊木马）攻击是一类安全漏洞，其中攻击者将恶意秘密行为嵌入到网络中（例如，有针对性的误分类），当攻击者指定的触发器添加到输入时，该网络会被激活。在这项工作中，我们表明多模网络容易受到一种新型攻击，我们称之为双键多模后门。这种攻击利用最先进的网络使用的复杂融合机制来嵌入既有效又隐蔽的后门。该攻击没有使用单个触发器，而是在每个输入模式中嵌入一个触发器，并且仅当两个触发器都存在时才激活恶意行为。我们对具有多种体系结构和视觉特征主干的视觉问答（VQA）任务的多模式后门进行了广泛的研究。在VQA模型中嵌入后门的一个主要挑战是，大多数模型使用从固定的预训练对象检测器中提取的视觉特征。这对攻击者来说是一个挑战，因为探测器可能会完全扭曲或忽略视觉触发器，从而导致后门过分依赖语言触发器的模型。我们通过提出一种针对预训练目标检测器的视觉触发优化策略来解决这个问题。通过这种方法，我们创建了双键后门，攻击成功率超过98%，同时只毒害了1%的训练数据。最后，我们发布了TrojVQA，这是一个干净的和特洛伊木马VQA模型的大集合，以支持针对多模式后门的防御研究。摘要：The success of deep learning has enabled advances in multimodal tasks that require non-trivial fusion of multiple input domains. Although multimodal models have shown potential in many problems, their increased complexity makes them more vulnerable to attacks. A Backdoor (or Trojan) attack is a class of security vulnerability wherein an attacker embeds a malicious secret behavior into a network (e.g. targeted misclassification) that is activated when an attacker-specified trigger is added to an input. In this work, we show that multimodal networks are vulnerable to a novel type of attack that we refer to as Dual-Key Multimodal Backdoors. This attack exploits the complex fusion mechanisms used by state-of-the-art networks to embed backdoors that are both effective and stealthy. Instead of using a single trigger, the proposed attack embeds a trigger in each of the input modalities and activates the malicious behavior only when both the triggers are present. We present an extensive study of multimodal backdoors on the Visual Question Answering (VQA) task with multiple architectures and visual feature backbones. A major challenge in embedding backdoors in VQA models is that most models use visual features extracted from a fixed pretrained object detector. This is challenging for the attacker as the detector can distort or ignore the visual trigger entirely, which leads to models where backdoors are over-reliant on the language trigger. We tackle this problem by proposing a visual trigger optimization strategy designed for pretrained object detectors. Through this method, we create Dual-Key Backdoors with over a 98% attack success rate while only poisoning 1% of the training data. Finally, we release TrojVQA, a large collection of clean and trojan VQA models to enable research in defending against multimodal backdoors.

超分辨率|去噪|去模糊|去雾(3篇)

【1】 Learning to Deblur and Rotate Motion-Blurred Faces 标题：学习消除模糊和旋转运动模糊面链接：https://arxiv.org/abs/2112.07599

作者：Givi Meishvili,Attila Szabó,Simon Jenni,Paolo Favaro 机构： University of Bern, Switzerland, Adobe Research, Huawei, Noah’s Ark Lab, (work was done before joining) 备注：British Machine Vision Conference 2021 摘要：我们提出了一个解决方案，从一个单一的运动模糊的人脸图像从新的视点渲染清晰的视频。我们的方法通过在三个大型数据集（FFHQ和300VW）上的联合训练隐式学习人脸的几何和运动来处理人脸模糊的复杂性，这三个数据集是公开的，并且我们构建了一个新的Bern多视图人脸数据集（BMFD）。前两个数据集提供了大量的面，使我们的模型能够更好地概括。BMFD允许我们引入多视图约束，这对于从新的摄像机视图合成清晰视频至关重要。它由来自多个主题的多个视图的高帧速率同步视频组成，显示了广泛的面部表情。我们使用高帧率视频通过平均来模拟真实的运动模糊。借助于这个数据集，我们训练了一个神经网络，从单个图像和相应的面部注视重建三维视频表示。然后，我们提供一个相对于估计的凝视和模糊图像的摄像机视点，作为编码器-解码器网络的输入，以生成具有新摄像机视点的锐利帧的视频。我们在多视图数据集和VIDTIMIT的测试对象上演示了我们的方法。摘要：We propose a solution to the novel task of rendering sharp videos from new viewpoints from a single motion-blurred image of a face. Our method handles the complexity of face blur by implicitly learning the geometry and motion of faces through the joint training on three large datasets: FFHQ and 300VW, which are publicly available, and a new Bern Multi-View Face Dataset (BMFD) that we built. The first two datasets provide a large variety of faces and allow our model to generalize better. BMFD instead allows us to introduce multi-view constraints, which are crucial to synthesizing sharp videos from a new camera view. It consists of high frame rate synchronized videos from multiple views of several subjects displaying a wide range of facial expressions. We use the high frame rate videos to simulate realistic motion blur through averaging. Thanks to this dataset, we train a neural network to reconstruct a 3D video representation from a single image and the corresponding face gaze. We then provide a camera viewpoint relative to the estimated gaze and the blurry image as input to an encoder-decoder network to generate a video of sharp frames with a novel camera viewpoint. We demonstrate our approach on test subjects of our multi-view dataset and VIDTIMIT.

【2】 Mitigating Channel-wise Noise for Single Image Super Resolution 标题：单幅图像超分辨率通道噪声的抑制链接：https://arxiv.org/abs/2112.07589

作者：Srimanta Mandal,Kuldeep Purohit,A. N. Rajagopalan 摘要：实际上，对于不同的颜色通道，图像可能包含不同数量的噪声，这是现有超分辨率方法所不承认的。在本文中，我们提出了通过联合考虑颜色通道来超分辨含噪彩色图像。噪声统计是从输入的低分辨率图像中盲估计的，用于在数据开销中为不同的颜色通道分配不同的权重。视觉数据的隐式低秩结构是通过核范数最小化和自适应权值来实现的，自适应权值作为正则化项加入到代价中。此外，图像的多尺度细节通过另一个正则化项添加到模型中，该正则化项涉及到PCA基础上的投影，该PCA基础是使用在输入图像的不同尺度上提取的相似面片构建的。结果表明，该方法在实际场景中具有超分辨能力。摘要：In practice, images can contain different amounts of noise for different color channels, which is not acknowledged by existing super-resolution approaches. In this paper, we propose to super-resolve noisy color images by considering the color channels jointly. Noise statistics are blindly estimated from the input low-resolution image and are used to assign different weights to different color channels in the data cost. Implicit low-rank structure of visual data is enforced via nuclear norm minimization in association with adaptive weights, which is added as a regularization term to the cost. Additionally, multi-scale details of the image are added to the model through another regularization term that involves projection onto PCA basis, which is constructed using similar patches extracted across different scales of the input image. The results demonstrate the super-resolving capability of the approach in real scenarios.

【3】 Kernel-aware Raw Burst Blind Super-Resolution 标题：内核感知的原始突发盲超分辨率链接：https://arxiv.org/abs/2112.07315

作者：Wenyi Lian,Shanglian Peng 机构：School of Computer Science, Chengdu University of Information Technology, China, DBSR, st Frame of LR, DeepREP, EBSR, Ours, GT 摘要：突发超分辨率（SR）提供了从低质量图像恢复丰富细节的可能性。然而，由于低分辨率（LR）图像在实际应用中存在多种复杂且未知的退化，现有的非盲（如双三次）设计网络通常会导致恢复高分辨率（HR）图像的性能严重下降。此外，处理多个未对准的噪声原始输入也是一项挑战。在本文中，我们解决了从现代手持设备获取的原始突发序列重建HR图像的问题。核心思想是一种内核引导策略，它可以通过两个步骤解决突发SR：内核建模和HR恢复。前者根据原始输入估计突发核，而后者根据估计的核预测超分辨率图像。此外，我们还引入了一个核感知的可变形对齐模块，该模块可以在考虑模糊先验的情况下有效地对齐原始图像。对合成数据集和真实数据集的大量实验表明，该方法在突发随机共振问题中具有良好的性能。摘要：Burst super-resolution (SR) provides a possibility of restoring rich details from low-quality images. However, since low-resolution (LR) images in practical applications have multiple complicated and unknown degradations, existing non-blind (e.g., bicubic) designed networks usually lead to a severe performance drop in recovering high-resolution (HR) images. Moreover, handling multiple misaligned noisy raw inputs is also challenging. In this paper, we address the problem of reconstructing HR images from raw burst sequences acquired from modern handheld devices. The central idea is a kernel-guided strategy which can solve the burst SR with two steps: kernel modeling and HR restoring. The former estimates burst kernels from raw inputs, while the latter predicts the super-resolved image based on the estimated kernels. Furthermore, we introduce a kernel-aware deformable alignment module which can effectively align the raw images with consideration of the blurry priors. Extensive experiments on synthetic and real-world datasets demonstrate that the proposed method can perform favorable state-of-the-art performance in the burst SR problem.

其他神经网络|深度学习|模型|建模(5篇)

【1】 VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena 标题：VALSE：以语言现象为中心的视觉和语言模型的任务无关基准链接：https://arxiv.org/abs/2112.07566

作者：Letitia Parcalabescu,Michele Cafagna,Lilitta Muradjan,Anette Frank,Iacer Calixto,Albert Gatt 机构：Heidelberg University, Department of Computational Linguistics, University of Malta, Institute of Linguistics and Language Technology, New York University, ILLC, University of Amsterdam, Utrecht University, Department of Information and Computing Sciences 备注：28 pages, 4 figures, 11 tables 摘要：我们提出了VALSE（Vision And Language Structured Evaluation，视觉和语言结构化评估），这是一个新的基准，旨在测试通用预训练视觉和语言（V&L）模型对特定语言现象的视觉语言基础能力。VALSE提供了一套六项测试，涵盖各种语言结构。解决这些问题需要模型将语言现象置于视觉情态中，从而允许比迄今为止更细粒度的评估。我们使用支持有效箔构建的方法构建VALE，并报告评估五种广泛使用的V&L模型的结果。我们的实验表明，目前的模型很难解决大多数现象。因此，我们希望VALSE能够作为一个重要的基准，从语言学的角度衡量预先训练的V&L模型的未来进展，补充规范的以任务为中心的V&L评估。摘要：We propose VALSE (Vision And Language Structured Evaluation), a novel benchmark designed for testing general-purpose pretrained vision and language (V&L) models for their visio-linguistic grounding capabilities on specific linguistic phenomena. VALSE offers a suite of six tests covering various linguistic constructs. Solving these requires models to ground linguistic phenomena in the visual modality, allowing more fine-grained evaluations than hitherto possible. We build VALSE using methods that support the construction of valid foils, and report results from evaluating five widely-used V&L models. Our experiments suggest that current models have considerable difficulty addressing most phenomena. Hence, we expect VALSE to serve as an important benchmark to measure future progress of pretrained V&L models from a linguistic perspective, complementing the canonical task-centred V&L evaluations.

【2】 An Interpretive Constrained Linear Model for ResNet and MgNet 标题：ResNet和MgNet的一种解释性约束线性模型链接：https://arxiv.org/abs/2112.07441

作者：Juncai He,Jinchao Xu,Lian Zhang,Jianqing Zhu 机构： The University of Texas at Austin, †Department of Mathematics, The Pennsylvania State University, University Park 备注：26 pages, 2 figures and 11 tables. arXiv admin note: text overlap with arXiv:1911.10428 摘要：我们提出了一种约束线性数据特征映射模型，作为使用卷积神经网络（CNN）进行图像分类的可解释数学模型。从这个观点出发，我们在线性系统的传统迭代方案和ResNet和MgNet类型模型的基本块的体系结构之间建立了详细的联系。利用这些联系，我们提出了一些改进的ResNet模型，与原始模型相比，这些模型具有更少的参数，但可以产生更精确的结果，从而证明了这种约束学习数据特征映射假设的有效性。基于这一假设，我们进一步提出了一种通用的数据特征迭代方案来证明MgNet的合理性。我们还对MgNet进行了系统的数值研究，以展示其在图像分类问题上的成功和优势，并与已建立的网络进行了比较。摘要：We propose a constrained linear data-feature-mapping model as an interpretable mathematical model for image classification using a convolutional neural network (CNN). From this viewpoint, we establish detailed connections between the traditional iterative schemes for linear systems and the architectures of the basic blocks of ResNet- and MgNet-type models. Using these connections, we present some modified ResNet models that compared with the original models have fewer parameters and yet can produce more accurate results, thereby demonstrating the validity of this constrained learning data-feature-mapping assumption. Based on this assumption, we further propose a general data-feature iterative scheme to show the rationality of MgNet. We also provide a systematic numerical study on MgNet to show its success and advantages in image classification problems and demonstrate its advantages in comparison with established networks.

【3】 Simple and Robust Loss Design for Multi-Label Learning with Missing Labels 标题：具有缺失标签的多标签学习的简单鲁棒损失设计链接：https://arxiv.org/abs/2112.07368

作者：Youcai Zhang,Yuhao Cheng,Xinyu Huang,Fei Wen,Rui Feng,Yaqian Li,Yandong Guo 机构：Guo 摘要：标签缺失情况下的多标签学习（MLML）是一个具有挑战性的问题。现有的方法主要集中在网络结构或训练方案的设计上，这增加了实现的复杂性。这项工作试图在不增加程序和复杂性的情况下实现MLML中的潜在损失函数。为此，我们提出了两种简单而有效的方法，通过鲁棒损失设计，基于模型可以在训练期间以高精度识别缺失标签的观察结果。第一种是一种新颖的鲁棒性负片损失，即希尔损失，它将负片重新加权为希尔形状，以减轻假负片的影响。第二种是自配速损失校正（SPLC）方法，该方法在丢失标签的近似分布下使用从最大似然准则导出的损失。在大量多标签图像分类数据集上的综合实验表明，我们的方法可以显著提高MLML的性能，并在MLML中实现新的最先进的损失函数。摘要：Multi-label learning in the presence of missing labels (MLML) is a challenging problem. Existing methods mainly focus on the design of network structures or training schemes, which increase the complexity of implementation. This work seeks to fulfill the potential of loss function in MLML without increasing the procedure and complexity. Toward this end, we propose two simple yet effective methods via robust loss design based on an observation that a model can identify missing labels during training with a high precision. The first is a novel robust loss for negatives, namely the Hill loss, which re-weights negatives in the shape of a hill to alleviate the effect of false negatives. The second is a self-paced loss correction (SPLC) method, which uses a loss derived from the maximum likelihood criterion under an approximate distribution of missing labels. Comprehensive experiments on a vast range of multi-label image classification datasets demonstrate that our methods can remarkably boost the performance of MLML and achieve new state-of-the-art loss functions in MLML.

【4】 Why you should learn functional basis 标题：为什么要学习函数基础链接：https://arxiv.org/abs/2112.07289

作者：Riccardo Marin,Souhaib Attaiki,Simone Melzi,Emanuele Rodolà,Maks Ovsjanikov 机构：Sapienza, University of Rome, LIX, Ecole Polytechnique, IP Paris, Emanuele Rodola 摘要：几何数据的高效、实用表示是几何处理中的一个普遍问题。一种广泛使用的选择是通过光谱嵌入对3D对象进行编码，将微分算子（通常为拉普拉斯算子）特征函数的截断子集在该点处假定的值与每个曲面点关联。在过去的十年中，为不同的应用程序定义新的、更好的嵌入的几次尝试已经见效。尽管如此，标准拉普拉斯特征函数仍然稳固地位于可用解的顶部，尽管它们存在局限性，例如形状匹配仅限于近等距。最近，一个新的趋势显示了学习拉普拉斯特征函数替代物的优势。与此同时，许多研究问题仍然没有解决：新的基础是否比LBO特征函数更好，以及它们与它们之间的关系如何？从功能角度看，他们是如何行动的？如何在新配置中结合其他特性和描述符利用这些基础？在本研究中，我们适当地提出这些问题，以提高我们对这一新兴研究方向的理解。我们展示了它们在不同语境中的应用相关性，揭示了它们的一些见解和令人兴奋的未来方向。摘要：Efficient and practical representation of geometric data is a ubiquitous problem for several applications in geometry processing. A widely used choice is to encode the 3D objects through their spectral embedding, associating to each surface point the values assumed at that point by a truncated subset of the eigenfunctions of a differential operator (typically the Laplacian). Several attempts to define new, preferable embeddings for different applications have seen the light during the last decade. Still, the standard Laplacian eigenfunctions remain solidly at the top of the available solutions, despite their limitations, such as being limited to near-isometries for shape matching. Recently, a new trend shows advantages in learning substitutes for the Laplacian eigenfunctions. At the same time, many research questions remain unsolved: are the new bases better than the LBO eigenfunctions, and how do they relate to them? How do they act in the functional perspective? And how to exploit these bases in new configurations in conjunction with additional features and descriptors? In this study, we properly pose these questions to improve our understanding of this emerging research direction. We show their applicative relevance in different contexts revealing some of their insights and exciting future directions.

【5】 Heuristic Hyperparameter Optimization for Convolutional Neural Networks using Genetic Algorithm 标题：基于遗传算法的卷积神经网络启发式超参数优化链接：https://arxiv.org/abs/2112.07087

作者：Meng Zhou 机构：School of Computing, Queen’s University, Kingston, ON, Canada 备注：8 pages, 3 figures 摘要：近年来，2019冠状病毒疾病2019例，冠状病毒病，被世界各国人民所知。当病毒到达肺部时，更容易引起肺部肺炎和败血症。在2019冠状病毒疾病患者中，X射线图像是鉴别感染典型特征的有力工具。放射科医生和病理学家观察到，感染患者的胸部X光片中出现毛玻璃状阴影，可作为诊断过程中的标准之一。在过去的几年中，深度学习已被证明是图像分类领域最强大的方法之一。由于正常人和感染者之间的胸部X射线有显著差异，可以使用深度模型来确定患者胸部X射线是否存在疾病。许多深层模型都是复杂的，并且随着输入参数的变化而变化。设计师有时会在深层模型的调整过程中苦苦挣扎，尤其是当他们从头开始构建模型时。遗传算法受生物进化过程的启发，在解决此类复杂问题中发挥着关键作用。在本文中，我提出了一种基于遗传的方法来优化用于胸部X射线分类任务的卷积神经网络（CNN）。摘要：In recent years, people from all over the world are suffering from one of the most severe diseases in history, known as Coronavirus disease 2019, COVID-19 for short. When the virus reaches the lungs, it has a higher probability to cause lung pneumonia and sepsis. X-ray image is a powerful tool in identifying the typical features of the infection for COVID-19 patients. The radiologists and pathologists observe that ground-glass opacity appears in the chest X-ray for infected patient cite{cozzi2021ground}, and it could be used as one of the criteria during the diagnosis process. In the past few years, deep learning has proven to be one of the most powerful methods in the field of image classification. Due to significant differences in Chest X-Ray between normal and infected people cite{rousan2020chest}, deep models could be used to identify the presence of the disease given a patient's Chest X-Ray. Many deep models are complex, and it evolves with lots of input parameters. Designers sometimes struggle with the tuning process for deep models, especially when they build up the model from scratch. Genetic Algorithm, inspired by the biological evolution process, plays a key role in solving such complex problems. In this paper, I proposed a genetic-based approach to optimize the Convolutional Neural Network(CNN) for the Chest X-Ray classification task.

其他(6篇)

【1】 Deep Sea Bubble Stream Characterization Using Wide-Baseline Stereo Photogrammetry 标题：基于宽基线立体摄影测量的深海泡沫流特征研究链接：https://arxiv.org/abs/2112.07414

作者：Mengkun She,Yifan Song,Tim Weiß,Jens Greinert,Kevin Köser 机构：GEOMAR Helmholtz Centre for Ocean Research Kiel, Wischhofstr. ,-, Kiel 备注：43 pages, 25 figures 摘要：可靠地量化从海底向海洋以及最终向大气释放的天然气和人为气体（例如，CO$U 2$，甲烷）是一项具有挑战性的任务。虽然船载回声测深仪允许从更远的距离探测水中的自由气体，但精确的量化需要诸如上升速度和气泡大小分布等参数，而这些传感器无法获得这些参数。光学方法是互补的，因为它们可以从近距离提供单个气泡或气泡流的高时间和空间分辨率。在本文中，我们介绍了一种完整的光学气泡流表征仪器和评估方法。该专用仪器采用了一个高速深海立体摄像系统，当部署在seep现场进行自动化分析时，可以记录数TB的气泡图像。气泡特征可在几分钟的短时间内获得，然后将仪器重新定位到其他位置，或在长达几天的时间间隔内以自主模式获得，以便捕捉由于海流和压力变化以及潮汐周期而产生的变化。除了报告使气泡特征具有鲁棒性和自主性的步骤外，我们还仔细评估了可达到的精度，并提出了一种新的校准程序，由于缺乏点对应，因此仅使用气泡轮廓。该系统已在太平洋最深1000米的水域成功运行，以评估甲烷通量。除了示例结果外，我们还报告了开发过程中的失败案例和经验教训。摘要：Reliable quantification of natural and anthropogenic gas release (e.g. CO$_2$, methane) from the seafloor into the ocean, and ultimately, the atmosphere, is a challenging task. While ship-based echo sounders allow detection of free gas in the water even from a larger distance, exact quantification requires parameters such as rise speed and bubble size distribution not obtainable by such sensors. Optical methods are complementary in the sense that they can provide high temporal and spatial resolution of single bubbles or bubble streams from close distance. In this contribution we introduce a complete instrument and evaluation method for optical bubble stream characterization. The dedicated instrument employs a high-speed deep sea stereo camera system that can record terabytes of bubble imagery when deployed at a seep site for later automated analysis. Bubble characteristics can be obtained for short sequences of few minutes, then relocating the instrument to other locations, or in autonomous mode of intervals up to several days, in order to capture variations due to current and pressure changes and across tidal cycles. Beside reporting the steps to make bubble characterization robust and autonomous, we carefully evaluate the reachable accuracy and propose a novel calibration procedure that, due to the lack of point correspondences, uses only the silhouettes of bubbles. The system has been operated successfully in up to 1000m water depth in the Pacific Ocean to assess methane fluxes. Besides sample results we also report failure cases and lessons learnt during development.

【2】 Stochastic Actor-Executor-Critic for Image-to-Image Translation 标题：图像到图像翻译的随机行为者-执行者-批评者链接：https://arxiv.org/abs/2112.07403

作者：Ziwei Luo,Jing Hu,Xin Wang,Siwei Lyu,Bin Kong,Youbing Yin,Qi Song,Xi Wu 机构：Chengdu University of Information Technology, China, Keya Medical, Seattle, USA, University at Buffalo, SUNY, USA 备注：None 摘要：训练一个无模型的深度强化学习模型来解决图像到图像的转换是困难的，因为它涉及高维的连续状态空间和动作空间。在本文中，我们从最近成功的最大熵强化学习框架中得到启发，该框架设计用于挑战连续控制问题，在高维连续空间上同时开发随机策略，包括图像表示、生成和控制。该方法的核心是随机参与者-执行者-批评家（SAEC），它是一种非策略参与者-批评家模型，具有额外的执行器以生成真实图像。具体来说，参与者通过一个随机潜在动作关注高级表示和控制策略，并明确指示执行者生成低级动作来操纵状态。在多个图像到图像的翻译任务上的实验证明了所提出的SAEC在面对高维连续空间问题时的有效性和鲁棒性。摘要：Training a model-free deep reinforcement learning model to solve image-to-image translation is difficult since it involves high-dimensional continuous state and action spaces. In this paper, we draw inspiration from the recent success of the maximum entropy reinforcement learning framework designed for challenging continuous control problems to develop stochastic policies over high dimensional continuous spaces including image representation, generation, and control simultaneously. Central to this method is the Stochastic Actor-Executor-Critic (SAEC) which is an off-policy actor-critic model with an additional executor to generate realistic images. Specifically, the actor focuses on the high-level representation and control policy by a stochastic latent action, as well as explicitly directs the executor to generate low-level actions to manipulate the state. Experiments on several image-to-image translation tasks have demonstrated the effectiveness and robustness of the proposed SAEC when facing high-dimensional continuous space problems.

【3】 Levels of Autonomous Radiology 标题：自主放射学水平链接：https://arxiv.org/abs/2112.07286

作者：Suraj Ghuwalewala,Viraj Kulkarni,Richa Pant,Amit Kharat 机构： DeepTek, Pune, India, Dr. D.Y. Patil Medical College, Hospital and Research Centre, Pune, India 备注：8 pages 摘要：放射学是一门年轻的医学学科，历史仅一个多世纪，它见证了巨大的技术进步，并彻底改变了我们今天的行医方式。在过去的几十年中，医学成像模式产生了大量的医学数据。利用这些数据开发和采用人工智能（AI）应用程序将导致放射学发展的下一个阶段。它将包括自动化费力的手动任务，如注释、报告生成等，以及病例的初始放射评估，以帮助放射科医生完成评估工作流程。我们为放射科自动化的进展提出了一个级别分类，解释了AI在每个级别上的帮助以及相应的挑战和解决方案。我们希望这些讨论能够帮助我们以结构化的方式应对这些挑战，并采取必要的步骤确保新技术在放射学领域的顺利采用。摘要：Radiology, being one of the younger disciplines of medicine with a history of just over a century, has witnessed tremendous technological advancements and has revolutionized the way we practice medicine today. In the last few decades, medical imaging modalities have generated seismic amounts of medical data. The development and adoption of Artificial Intelligence (AI) applications using this data will lead to the next phase of evolution in radiology. It will include automating laborious manual tasks such as annotations, report-generation, etc., along with the initial radiological assessment of cases to aid radiologists in their evaluation workflow. We propose a level-wise classification for the progression of automation in radiology, explaining AI assistance at each level with corresponding challenges and solutions. We hope that such discussions can help us address the challenges in a structured way and take the necessary steps to ensure the smooth adoption of new technologies in radiology.

【4】 On the use of Cortical Magnification and Saccades as Biological Proxies for Data Augmentation 标题：大脑皮层放大率和眼跳作为数据增强生物指标的研究链接：https://arxiv.org/abs/2112.07173

作者：Binxu Wang,David Mayo,Arturo Deza,Andrei Barbu,Colin Conwell 机构：Dept. of Neurobiology, Harvard Medical School &, Dept. of Neuroscience, Washington University in St Louis, Dept. of Psychology, Harvard University; ,Google Research, Brain Team, BCS & CBMM, MIT; ,CSAIL & CBMM, MIT 备注：14 pages, 6 figures, 2 tables. Published in NeurIPS 2021 Workshop, Shared Visual Representations in Human & Machine Intelligence (SVRHM). For code, see this https URL 摘要：自监督学习是从自然数据中学习有用表示的有效方法。它也被认为是在人类中建立视觉表现的一种可能方法，但具体目标和算法尚不清楚。目前，大多数自监督方法鼓励系统学习同一图像与其他图像不同变换的不变表示。然而，这种转换通常在生物学上是不合理的，并且通常由人为的感知方案组成，例如随机裁剪和颜色抖动。在本文中，我们试图对这些增强进行反向工程，使其在生物学或感知上更具合理性，同时仍能为鼓励稳健表征提供相同的好处。关键的是，我们发现随机裁剪可以被皮质放大所取代，而像扫视一样的图像采样也可以帮助表征学习。这些转变的可行性表明了生物视觉系统实现自我监控的一种潜在方式。此外，它们打破了许多计算机视觉算法中广泛接受的空间一致性处理假设，表明了空间自适应计算在人类和机器中的作用。我们的代码和演示可以在这里找到。摘要：Self-supervised learning is a powerful way to learn useful representations from natural data. It has also been suggested as one possible means of building visual representation in humans, but the specific objective and algorithm are unknown. Currently, most self-supervised methods encourage the system to learn an invariant representation of different transformations of the same image in contrast to those of other images. However, such transformations are generally non-biologically plausible, and often consist of contrived perceptual schemes such as random cropping and color jittering. In this paper, we attempt to reverse-engineer these augmentations to be more biologically or perceptually plausible while still conferring the same benefits for encouraging robust representation. Critically, we find that random cropping can be substituted by cortical magnification, and saccade-like sampling of the image could also assist the representation learning. The feasibility of these transformations suggests a potential way that biological visual systems could implement self-supervision. Further, they break the widely accepted spatially-uniform processing assumption used in many computer vision algorithms, suggesting a role for spatially-adaptive computation in humans and machines alike. Our code and demo can be found here.

【5】 Birds Eye View Social Distancing Analysis System 标题：鸟瞰社交距离分析系统链接：https://arxiv.org/abs/2112.07159

作者：Zhengye Yang,Mingfei Sun,Hongzhe Ye,Zihao Xiong,Gil Zussman,Zoran Kostic 机构：†Department of Electrical Engineering, Columbia University, United States of America 摘要：在2019冠状病毒疾病中，社会距离可以降低呼吸道传染病的感染率。交通路口特别适合于监测和评估大都市的社会距离行为。我们提出并评估了一个隐私保护的社会距离分析系统（B-SDA），该系统使用行人穿越十字路口的鸟瞰视频记录。我们设计了视频预处理、目标检测和跟踪的算法，这些算法植根于已知的计算机视觉和深度学习技术，但经过修改以解决检测高度提升的摄像机捕获的非常小的物体/行人的问题。我们提出了一种结合行人分组的方法来检测违反社交距离的行为。B-SDA用于根据大城市地区大流行前和大流行视频比较行人行为。完成的行人检测性能为$63.0%$$AP{50}$，跟踪性能为$47.6%$MOTA。流感大流行期间，违反社会距离的比率为15.6%$，明显低于流感大流行前的31.4%$基线，表明行人遵守CDC规定的社会距离建议。该系统适用于实际应用中的部署。摘要：Social distancing can reduce the infection rates in respiratory pandemics such as COVID-19. Traffic intersections are particularly suitable for monitoring and evaluation of social distancing behavior in metropolises. We propose and evaluate a privacy-preserving social distancing analysis system (B-SDA), which uses bird's-eye view video recordings of pedestrians who cross traffic intersections. We devise algorithms for video pre-processing, object detection and tracking which are rooted in the known computer-vision and deep learning techniques, but modified to address the problem of detecting very small objects/pedestrians captured by a highly elevated camera. We propose a method for incorporating pedestrian grouping for detection of social distancing violations. B-SDA is used to compare pedestrian behavior based on pre-pandemic and pandemic videos in a major metropolitan area. The accomplished pedestrian detection performance is $63.0%$ $AP_{50}$ and the tracking performance is $47.6%$ MOTA. The social distancing violation rate of $15.6%$ during the pandemic is notably lower than $31.4%$ pre-pandemic baseline, indicating that pedestrians followed CDC-prescribed social distancing recommendations. The proposed system is suitable for deployment in real-world applications.

【6】 Exploring Latent Dimensions of Crowd-sourced Creativity 标题：探索众包创意的潜在维度链接：https://arxiv.org/abs/2112.06978

作者：Umut Kocasari,Alperen Bag,Efehan Atici,Pinar Yanardag 机构：Bogazici University 备注：5th Workshop on Machine Learning for Creativity and Design (NeurIPS 2021), Sydney, Australia 摘要：近年来，在预先训练的GANs的潜在空间中发现可解释方向已成为一个热门话题。虽然现有的作品大多考虑方向的语义图像处理，我们专注于一个抽象的属性：创造力。我们可以操纵一个图像来增加或减少创造性吗？我们在最大的基于人工智能的创意平台ArtBreeer上构建我们的工作，用户可以使用预先训练好的GAN模型生成图像。我们探索了在这个平台上生成的图像的潜在维度，并提出了一个新的操作图像的框架，使其更具创造性。我们的代码和数据集可在http://github.com/catlab-team/latentcreative. 摘要：Recently, the discovery of interpretable directions in the latent spaces of pre-trained GANs has become a popular topic. While existing works mostly consider directions for semantic image manipulations, we focus on an abstract property: creativity. Can we manipulate an image to be more or less creative? We build our work on the largest AI-based creativity platform, Artbreeder, where users can generate images using pre-trained GAN models. We explore the latent dimensions of images generated on this platform and present a novel framework for manipulating images to make them more creative. Our code and dataset are available at http://github.com/catlab-team/latentcreative.

linux https 网络安全学习方法图像处理

0 人点赞