计算机视觉与模式识别学术速递[12.8]

cs.CV 方向，今日共计75篇

Transformer(3篇)

【1】 SSAT: A Symmetric Semantic-Aware Transformer Network for Makeup Transfer and Removal 标题：SSAT：一种对称语义感知的补丁迁移与移除转换网络链接：https://arxiv.org/abs/2112.03631

作者：Zhaoyang Sun,Yaxiong Chen,Shengwu Xiong 机构： School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Wuhan, Wuhan University of Technology Chongqing Research Institute, Chongqing, Sanya Science and Education Innovation Park of Wuhan University of Technology, Sanya 备注：Accepted to AAAI 2022 摘要：化妆传递不仅是提取参考图像的化妆风格，而且是将化妆风格渲染到目标图像的语义对应位置。然而，大多数现有的方法侧重于前者而忽视后者，导致无法达到预期的结果。为了解决上述问题，我们提出了一种统一的对称语义感知变换器（SSAT）网络，该网络结合了语义对应学习来同时实现补足转移和补足移除。在SSAT中，提出了一种新的对称语义对应特征转移（SSCFT）模块和一种弱监督语义丢失模型，以便于建立精确的语义对应。在生成过程中，利用SSCFT对提取的化妆特征进行空间扭曲，实现与目标图像的语义对齐，然后将扭曲的化妆特征与未修改的化妆无关特征相结合，生成最终结果。实验表明，我们的方法获得了更加直观准确的化妆转移结果，与其他最先进的化妆转移方法相比，用户研究反映了我们方法的优越性。此外，我们还验证了该方法在表情和姿势差异、对象遮挡场景等方面的鲁棒性，并将其扩展到视频合成传输中。代码将在https://gitee.com/sunzhaoyang0304/ssat-msp. 摘要：Makeup transfer is not only to extract the makeup style of the reference image, but also to render the makeup style to the semantic corresponding position of the target image. However, most existing methods focus on the former and ignore the latter, resulting in a failure to achieve desired results. To solve the above problems, we propose a unified Symmetric Semantic-Aware Transformer (SSAT) network, which incorporates semantic correspondence learning to realize makeup transfer and removal simultaneously. In SSAT, a novel Symmetric Semantic Corresponding Feature Transfer (SSCFT) module and a weakly supervised semantic loss are proposed to model and facilitate the establishment of accurate semantic correspondence. In the generation process, the extracted makeup features are spatially distorted by SSCFT to achieve semantic alignment with the target image, then the distorted makeup features are combined with unmodified makeup irrelevant features to produce the final result. Experiments show that our method obtains more visually accurate makeup transfer results, and user study in comparison with other state-of-the-art makeup transfer methods reflects the superiority of our method. Besides, we verify the robustness of the proposed method in the difference of expression and pose, object occlusion scenes, and extend it to video makeup transfer. Code will be available at https://gitee.com/sunzhaoyang0304/ssat-msp.

【2】 Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training 标题：自助式VITS：将视觉Transformer从预训中解放出来链接：https://arxiv.org/abs/2112.03552

作者：Haofei Zhang,Jiarui Duan,Mengqi Xue,Jie Song,Li Sun,Mingli Song 机构： Zhejiang University, Xidian University 备注：10 Pages, Reviewing under CVPR 2022 摘要：近年来，视觉变换器（VIT）发展迅速，并开始挑战卷积神经网络（CNN）在计算机视觉领域的主导地位。随着通用Transformer结构取代硬编码卷积电感偏置，VIT已超过CNN，尤其是在数据充足的情况下。然而，VIT容易过度适应小数据集，因此依赖于大规模的预训练，这需要花费大量时间。在本文中，我们努力将CNN的归纳偏差引入VIT，同时保留其网络结构以获得更高的上界，并设置更合适的优化目标，从而将VIT从预训练中解放出来。首先，基于给定的具有电感偏置的ViT设计了一个代理CNN。在此基础上，提出了一种自举训练算法，通过权重共享对agent和ViT进行联合优化，ViT从agent的中间特征中学习归纳偏差。在有限的训练数据下对CIFAR-10/100和ImageNet-1k进行的大量实验表明，令人鼓舞的结果是，诱导偏差有助于VIT更快地收敛，并优于参数更少的传统CNN。摘要：Recently, vision Transformers (ViTs) are developing rapidly and starting to challenge the domination of convolutional neural networks (CNNs) in the realm of computer vision (CV). With the general-purpose Transformer architecture for replacing the hard-coded inductive biases of convolution, ViTs have surpassed CNNs, especially in data-sufficient circumstances. However, ViTs are prone to over-fit on small datasets and thus rely on large-scale pre-training, which expends enormous time. In this paper, we strive to liberate ViTs from pre-training by introducing CNNs' inductive biases back to ViTs while preserving their network architectures for higher upper bound and setting up more suitable optimization objectives. To begin with, an agent CNN is designed based on the given ViT with inductive biases. Then a bootstrapping training algorithm is proposed to jointly optimize the agent and ViT with weight sharing, during which the ViT learns inductive biases from the intermediate features of the agent. Extensive experiments on CIFAR-10/100 and ImageNet-1k with limited training data have shown encouraging results that the inductive biases help ViTs converge significantly faster and outperform conventional CNNs with even fewer parameters.

【3】 Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal 标题：基于决策的基于补丁对抗性去除的视觉Transformer黑盒攻击链接：https://arxiv.org/abs/2112.03492

作者：Yucheng Shi,Yahong Han 机构：College of Intelligence and Computing, Tianjin University, Tianjin, China 摘要：与深度卷积神经网络（CNN）相比，视觉转换器（VIT）表现出令人印象深刻的性能和更强的对抗鲁棒性。一方面，ViTs关注单个面片之间的全局交互，降低了图像的局部噪声敏感性。另一方面，现有的基于决策的CNN攻击忽略了图像不同区域的噪声敏感性差异，影响了噪声压缩的效率。因此，当只能查询目标模型时，验证ViTs的黑盒对抗鲁棒性仍然是一个具有挑战性的问题。在本文中，我们提出了一种新的基于决策的针对VIT的黑盒攻击，称为补丁式对抗移除（PAR）。PAR通过粗到细的搜索过程将图像分割成补丁，并分别压缩每个补丁上的噪声。PAR记录每个贴片的噪声幅度和噪声灵敏度，并选择具有最高查询值的贴片进行噪声压缩。此外，PAR可以用作其他基于决策的攻击的噪声初始化方法，以提高VITS和CNNs上的噪声压缩效率，而不引入额外的计算。在IMANETET-21K、ILVRC-2012和Tiny Imagenet数据集上的大量实验表明，PAR在相同数量的查询下实现了平均低得多的扰动幅度。摘要：Vision transformers (ViTs) have demonstrated impressive performance and stronger adversarial robustness compared to Deep Convolutional Neural Networks (CNNs). On the one hand, ViTs' focus on global interaction between individual patches reduces the local noise sensitivity of images. On the other hand, the existing decision-based attacks for CNNs ignore the difference in noise sensitivity between different regions of the image, which affects the efficiency of noise compression. Therefore, validating the black-box adversarial robustness of ViTs when the target model can only be queried still remains a challenging problem. In this paper, we propose a new decision-based black-box attack against ViTs termed Patch-wise Adversarial Removal (PAR). PAR divides images into patches through a coarse-to-fine search process and compresses the noise on each patch separately. PAR records the noise magnitude and noise sensitivity of each patch and selects the patch with the highest query value for noise compression. In addition, PAR can be used as a noise initialization method for other decision-based attacks to improve the noise compression efficiency on both ViTs and CNNs without introducing additional calculations. Extensive experiments on ImageNet-21k, ILSVRC-2012, and Tiny-Imagenet datasets demonstrate that PAR achieves a much lower magnitude of perturbation on average with the same number of queries.

检测相关(7篇)

【1】 MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection 标题：MS-TCT：用于动作检测的多尺度时域变换链接：https://arxiv.org/abs/2112.03902

作者：Rui Dai,Srijan Das,Kumara Kahatapitiya,Michael S. Ryoo,Francois Bremond 机构：Franc¸ois Br´emond, Inria, Universit´e Cˆote d’Azur, Stony Brook University 摘要：动作检测是一项重要且具有挑战性的任务，尤其是对于未剪辑视频的密集标记数据集。在这些数据集中，时间关系是复杂的，包括复合动作和共现动作等挑战。为了检测这些复杂视频中的动作，有效地捕获视频中的短期和长期时间信息至关重要。为此，我们提出了一种新的行为检测网络。该网络由三个主要部分组成：（1）时间编码器模块在多个时间分辨率下广泛探索全局和局部时间关系。（2）时间尺度混合模块有效地融合多尺度特征，形成统一的特征表示。（3）分类模块用于学习实例中心的相对位置并预测帧级分类分数。在多个数据集（包括字谜、TSU和MultiTHUMOS）上进行的大量实验证实了我们提出的方法的有效性。我们的网络在所有三个数据集上都优于最先进的方法。摘要：Action detection is an essential and challenging task, especially for densely labelled datasets of untrimmed videos. The temporal relation is complex in those datasets, including challenges like composite action, and co-occurring action. For detecting actions in those complex videos, efficiently capturing both short-term and long-term temporal information in the video is critical. To this end, we propose a novel ConvTransformer network for action detection. This network comprises three main components: (1) Temporal Encoder module extensively explores global and local temporal relations at multiple temporal resolutions. (2) Temporal Scale Mixer module effectively fuses the multi-scale features to have a unified feature representation. (3) Classification module is used to learn the instance center-relative position and predict the frame-level classification scores. The extensive experiments on multiple datasets, including Charades, TSU and MultiTHUMOS, confirm the effectiveness of our proposed method. Our network outperforms the state-of-the-art methods on all three datasets.

【2】 Activation to Saliency: Forming High-Quality Labels for Unsupervised Salient Object Detection 标题：显著激活：形成用于无监督显著目标检测的高质量标签链接：https://arxiv.org/abs/2112.03650

作者：Huajun Zhou,Peijia Chen,Lingxiao Yang,Jianhuang Lai,Xiaohua Xie 机构： Shell was with the Department of Electrical and Computer Engineering, Georgia Institute of Technology 备注：11 pages 摘要：无监督显著目标检测（USOD）对于工业应用和下游任务都具有极其重要的意义。现有的基于深度学习（DL）的USOD方法利用一些传统SOD方法提取的低质量显著性预测作为显著性线索，主要捕获图像中的一些显著区域。此外，他们在语义信息的辅助下对这些显著性线索进行细化，这些语义信息来自于在其他相关视觉任务中通过监督学习训练的一些模型。在这项工作中，我们提出了一个两阶段激活显著性（A2S）框架，有效地生成高质量的显著性线索，并使用这些线索来训练一个鲁棒的显著性检测器。更重要的是，在整个训练过程中，我们的框架中没有涉及人工注释。在第一阶段，我们将一个预训练网络（MoCo v2）转化为一个单一的激活图，其中提出了一个自适应决策边界（ADB）来帮助训练转化后的网络。为了便于生成高质量的伪标签，我们提出了一种损失函数来扩大像素与其均值之间的特征距离。在第二阶段，在线标签纠正（OLR）策略在训练过程中更新伪标签，以减少干扰物的负面影响。此外，我们使用两个剩余注意模块（RAMs）构造了一个轻量级显著性检测器，该检测器利用边缘和颜色等低层特征中的互补信息来细化高层特征。在几个SOD基准上的大量实验证明，与现有的USOD方法相比，我们的框架报告了显著的性能。此外，在3000张图像上训练我们的框架大约需要1小时，比以前最先进的方法快30倍多。摘要：Unsupervised Salient Object Detection (USOD) is of paramount significance for both industrial applications and downstream tasks. Existing deep-learning (DL) based USOD methods utilize some low-quality saliency predictions extracted by several traditional SOD methods as saliency cues, which mainly capture some conspicuous regions in images. Furthermore, they refine these saliency cues with the assistant of semantic information, which is obtained from some models trained by supervised learning in other related vision tasks. In this work, we propose a two-stage Activation-to-Saliency (A2S) framework that effectively generates high-quality saliency cues and uses these cues to train a robust saliency detector. More importantly, no human annotations are involved in our framework during the whole training process. In the first stage, we transform a pretrained network (MoCo v2) to aggregate multi-level features to a single activation map, where an Adaptive Decision Boundary (ADB) is proposed to assist the training of the transformed network. To facilitate the generation of high-quality pseudo labels, we propose a loss function to enlarges the feature distances between pixels and their means. In the second stage, an Online Label Rectifying (OLR) strategy updates the pseudo labels during the training process to reduce the negative impact of distractors. In addition, we construct a lightweight saliency detector using two Residual Attention Modules (RAMs), which refine the high-level features using the complementary information in low-level features, such as edges and colors. Extensive experiments on several SOD benchmarks prove that our framework reports significant performance compared with existing USOD methods. Moreover, training our framework on 3000 images consumes about 1 hour, which is over 30x faster than previous state-of-the-art methods.

【3】 Regularity Learning via Explicit Distribution Modeling for Skeletal Video Anomaly Detection 标题：基于显式分布建模的规则性学习在骨架视频异常检测中的应用链接：https://arxiv.org/abs/2112.03649

作者：Shoubin Yu,Zhongyin Zhao,Haoshu Fang,Andong Deng,Haisheng Su,Dongliang Wang,Weihao Gan,Cewu Lu,Wei Wu 机构： Shanghai Jiao Tong University, SenseTime Research, Shanghai AI Laboratory 摘要：监控视频中的异常检测对于确保公共安全具有挑战性和重要意义。与基于像素的异常检测方法不同，基于姿态的异常检测方法利用高度结构化的骨架数据，既减少了计算负担，又避免了背景噪声的负面影响。然而，与基于像素的方法不同（基于像素的方法可以直接利用光流等显式运动特征），基于姿势的方法缺乏替代的动态表示。本文提出了一种新的运动嵌入器（ME），从概率的角度提供了一种姿态运动表示。此外，还采用了一种新的任务特定时空变换器（STT）进行自我监督的姿势序列重建。然后将这两个模块集成到一个统一的姿势规则学习框架中，称为运动先验规则学习器（MoPRL）。MoPRL通过在几个具有挑战性的数据集上平均提高4.7%的AUC，实现了最先进的性能。大量实验验证了每个模块的通用性。摘要：Anomaly detection in surveillance videos is challenging and important for ensuring public security. Different from pixel-based anomaly detection methods, pose-based methods utilize highly-structured skeleton data, which decreases the computational burden and also avoids the negative impact of background noise. However, unlike pixel-based methods, which could directly exploit explicit motion features such as optical flow, pose-based methods suffer from the lack of alternative dynamic representation. In this paper, a novel Motion Embedder (ME) is proposed to provide a pose motion representation from the probability perspective. Furthermore, a novel task-specific Spatial-Temporal Transformer (STT) is deployed for self-supervised pose sequence reconstruction. These two modules are then integrated into a unified framework for pose regularity learning, which is referred to as Motion Prior Regularity Learner (MoPRL). MoPRL achieves the state-of-the-art performance by an average improvement of 4.7% AUC on several challenging datasets. Extensive experiments validate the versatility of each proposed module.

【4】 Gram-SLD: Automatic Self-labeling and Detection for Instance Objects 标题：GRAM-SLD：实例对象的自动自标记和检测链接：https://arxiv.org/abs/2112.03641

作者：Rui Wang,Chengtun Wu,Jiawen Xin,Liang Zhang 机构：School of Instrumentation Science and Opto-electronics Engineering, Beihang University, Key Laboratory of Precision Opto-mechatronics Technology, China, b Department of Electrical and Computer Engineering, University of Connecticut 备注：37 pages with 7 figures 摘要：实例目标检测在智能监控、视觉导航、人机交互、智能服务等领域发挥着重要作用。受深度卷积神经网络（DCNN）的巨大成功的启发，基于DCNN的实例目标检测已成为一个很有前途的研究课题。针对DCNN在人工标注时需要大规模标注数据集来监督训练，而手工标注又费时费力的问题，提出了一种基于协同训练的新框架——Gram自标记与检测（Gram SLD）。所提出的Gram-SLD能够用非常有限的人工标注的关键数据自动标注大量数据，并获得具有竞争力的性能。在我们的框架中，定义了革兰氏损失，并构造了两个完全冗余和独立的视图，以及一个关键样本选择策略以及一个综合考虑精确性和召回性的自动注释策略，以产生高质量的伪标签。在公共GMU Kitchen数据集、Active Vision数据集和自制BHID-ITEM数据集上的实验表明，与完全监督的方法相比，我们的Gram SLD仅使用5%的标记训练数据，在目标检测方面取得了具有竞争力的性能（小于2%的地图丢失）。在复杂多变环境下的实际应用中，该方法能够满足实例目标检测的实时性和准确性要求。摘要：Instance object detection plays an important role in intelligent monitoring, visual navigation, human-computer interaction, intelligent services and other fields. Inspired by the great success of Deep Convolutional Neural Network (DCNN), DCNN-based instance object detection has become a promising research topic. To address the problem that DCNN always requires a large-scale annotated dataset to supervise its training while manual annotation is exhausting and time-consuming, we propose a new framework based on co-training called Gram Self-Labeling and Detection (Gram-SLD). The proposed Gram-SLD can automatically annotate a large amount of data with very limited manually labeled key data and achieve competitive performance. In our framework, gram loss is defined and used to construct two fully redundant and independent views and a key sample selection strategy along with an automatic annotating strategy that comprehensively consider precision and recall are proposed to generate high quality pseudo-labels. Experiments on the public GMU Kitchen Dataset , Active Vision Dataset and the self-made BHID-ITEM Datasetdemonstrate that, with only 5% labeled training data, our Gram-SLD achieves competitive performance in object detection (less than 2% mAP loss), compared with the fully supervised methods. In practical applications with complex and changing environments, the proposed method can satisfy the real-time and accuracy requirements on instance object detection.

【5】 DCAN: Improving Temporal Action Detection via Dual Context Aggregation 标题：DCAN：通过双上下文聚合改进时间动作检测链接：https://arxiv.org/abs/2112.03612

作者：Guo Chen,Yin-Dong Zheng,Limin Wang,Tong Lu 机构：State Key Lab for Novel Software Technology, Nanjing University, China 备注：AAAI 2022 camera ready version 摘要：时间动作检测的目的是定位视频中动作的边界。当前基于边界匹配的方法枚举并计算所有可能的边界匹配以生成建议。然而，这些方法忽略了边界预测中的长期上下文聚合。同时，由于相邻匹配的语义相似，密集生成匹配的局部语义聚合不能提高语义的丰富性和区分性。在本文中，我们提出了端到端的提议生成方法，称为双上下文聚合网络（DCAN），将上下文聚合到两个级别，即边界级别和提议级别，以生成高质量的动作提议，从而提高时间动作检测的性能。具体来说，我们设计了多径时态上下文聚合（MTCA），以实现边界级的平滑上下文聚合和精确的边界评估。对于匹配评估，粗到精匹配（CFM）设计用于在建议级别聚合上下文，并从粗到精细化匹配映射。我们在ActivityNet v1.3和THUMOS-14上进行了广泛的实验。DCAN在ActivityNet v1.3上获得35.39%的平均mAP，在IoU@0.5在THUMOS-14上，这表明DCAN可以生成高质量的提案并实现最先进的性能。我们在https://github.com/cg1177/DCAN. 摘要：Temporal action detection aims to locate the boundaries of action in the video. The current method based on boundary matching enumerates and calculates all possible boundary matchings to generate proposals. However, these methods neglect the long-range context aggregation in boundary prediction. At the same time, due to the similar semantics of adjacent matchings, local semantic aggregation of densely-generated matchings cannot improve semantic richness and discrimination. In this paper, we propose the end-to-end proposal generation method named Dual Context Aggregation Network (DCAN) to aggregate context on two levels, namely, boundary level and proposal level, for generating high-quality action proposals, thereby improving the performance of temporal action detection. Specifically, we design the Multi-Path Temporal Context Aggregation (MTCA) to achieve smooth context aggregation on boundary level and precise evaluation of boundaries. For matching evaluation, Coarse-to-fine Matching (CFM) is designed to aggregate context on the proposal level and refine the matching map from coarse to fine. We conduct extensive experiments on ActivityNet v1.3 and THUMOS-14. DCAN obtains an average mAP of 35.39% on ActivityNet v1.3 and reaches mAP 54.14% at IoU@0.5 on THUMOS-14, which demonstrates DCAN can generate high-quality proposals and achieve state-of-the-art performance. We release the code at https://github.com/cg1177/DCAN.

【6】 Voxelized 3D Feature Aggregation for Multiview Detection 标题：用于多视角检测的体素化三维特征聚合链接：https://arxiv.org/abs/2112.03471

作者：Jiahao Ma,Jinguang Tong,Shan Wang,Wei Zhao,Liang Zheng,Chuong Nguyen 机构：Cyber Physical Systems, CSIRO Data, Australia, College of Engineering & Computer Science, Australian National University, Australia 摘要：多视图检测结合了多个摄影机视图以减轻拥挤场景中的遮挡，其中最先进的方法采用单应变换将多视图特征投影到地平面。但是，我们发现，这些2D变换没有考虑对象的高度，因此忽略了沿同一对象垂直方向的特征可能不会投影到同一地平面点上，从而导致不纯净的地平面特征。为了解决这一问题，我们提出了体素化三维特征聚合（VFA），用于多视图检测中的特征转换和聚合。具体来说，我们将三维空间体素化，将体素投影到每个摄影机视图上，并将二维特征与这些投影体素相关联。这使我们能够沿同一垂直线识别并聚合二维特征，从而在很大程度上减轻投影扭曲。此外，由于不同类型的对象（人与牛）在地平面上具有不同的形状，我们引入了定向高斯编码来匹配这些形状，从而提高了准确性和效率。我们对多视图2D检测和多视图3D检测问题进行了实验。四个数据集（包括新引入的MultiviewC数据集）的结果表明，与最先进的方法相比，我们的系统具有很强的竞争力我们的代码和数据将是开源的https://github.com/Robert-Mar/VFA. 摘要：Multi-view detection incorporates multiple camera views to alleviate occlusion in crowded scenes, where the state-of-the-art approaches adopt homography transformations to project multi-view features to the ground plane. However, we find that these 2D transformations do not take into account the object's height, and with this neglection features along the vertical direction of same object are likely not projected onto the same ground plane point, leading to impure ground-plane features. To solve this problem, we propose VFA, voxelized 3D feature aggregation, for feature transformation and aggregation in multi-view detection. Specifically, we voxelize the 3D space, project the voxels onto each camera view, and associate 2D features with these projected voxels. This allows us to identify and then aggregate 2D features along the same vertical line, alleviating projection distortions to a large extent. Additionally, because different kinds of objects (human vs. cattle) have different shapes on the ground plane, we introduce the oriented Gaussian encoding to match such shapes, leading to increased accuracy and efficiency. We perform experiments on multiview 2D detection and multiview 3D detection problems. Results on four datasets (including a newly introduced MultiviewC dataset) show that our system is very competitive compared with the state-of-the-art approaches. %Our code and data will be open-sourced.Code and MultiviewC are released at https://github.com/Robert-Mar/VFA.

【7】 Hybrid SNN-ANN: Energy-Efficient Classification and Object Detection for Event-Based Vision 标题：混合SNN-ANN：基于事件视觉的节能分类与目标检测链接：https://arxiv.org/abs/2112.03423

作者：Alexander Kugele,Thomas Pfeil,Michael Pfeiffer,Elisabetta Chicca 机构：-,-,- 备注：Accepted at DAGM German Conference on Pattern Recognition (GCPR 2021) 摘要：基于事件的视觉传感器对事件流而非图像帧中的局部像素亮度变化进行编码，并产生稀疏、节能的场景编码，此外还具有低延迟、高动态范围和缺少运动模糊的特点。基于事件的传感器在目标识别方面的最新进展来自使用反向传播训练的深层神经网络的转换。然而，将这些方法用于事件流需要转换为同步范式，这不仅会损失计算效率，而且还会错过提取时空特征的机会。在本文中，我们提出了一种用于基于事件的模式识别和目标检测的深度神经网络端到端训练的混合体系结构，将用于有效基于事件的特征提取的尖峰神经网络（SNN）主干与随后的模拟神经网络（ANN）相结合负责解决同步分类和检测任务。这是通过将标准反向传播与代理梯度训练相结合来实现的，以通过SNN传播梯度。混合SNN ANN可以在不进行转换的情况下进行训练，并生成比ANN对应网络更具计算效率的高精度网络。我们展示了基于事件的分类和目标检测数据集的结果，其中只需要调整ANN头的结构以适应任务，并且不需要转换基于事件的输入。由于ANN和SNN需要不同的硬件模式来最大限度地提高其效率，我们设想SNN主干和ANN头部可以在不同的处理单元上执行，从而分析两部分之间通信所需的带宽。混合网络是一种很有前途的体系结构，可以在不影响效率的情况下进一步推进基于事件的视觉的机器学习方法。摘要：Event-based vision sensors encode local pixel-wise brightness changes in streams of events rather than image frames and yield sparse, energy-efficient encodings of scenes, in addition to low latency, high dynamic range, and lack of motion blur. Recent progress in object recognition from event-based sensors has come from conversions of deep neural networks, trained with backpropagation. However, using these approaches for event streams requires a transformation to a synchronous paradigm, which not only loses computational efficiency, but also misses opportunities to extract spatio-temporal features. In this article we propose a hybrid architecture for end-to-end training of deep neural networks for event-based pattern recognition and object detection, combining a spiking neural network (SNN) backbone for efficient event-based feature extraction, and a subsequent analog neural network (ANN) head to solve synchronous classification and detection tasks. This is achieved by combining standard backpropagation with surrogate gradient training to propagate gradients through the SNN. Hybrid SNN-ANNs can be trained without conversion, and result in highly accurate networks that are substantially more computationally efficient than their ANN counterparts. We demonstrate results on event-based classification and object detection datasets, in which only the architecture of the ANN heads need to be adapted to the tasks, and no conversion of the event-based input is necessary. Since ANNs and SNNs require different hardware paradigms to maximize their efficiency, we envision that SNN backbone and ANN head can be executed on different processing units, and thus analyze the necessary bandwidth to communicate between the two parts. Hybrid networks are promising architectures to further advance machine learning approaches for event-based vision, without having to compromise on efficiency.

分类|识别相关(8篇)

【1】 Handwritten Mathematical Expression Recognition via Attention Aggregation based Bi-directional Mutual Learning 标题：基于注意力聚合的双向交互学习手写数学表达式识别链接：https://arxiv.org/abs/2112.03603

作者：Xiaohang Bian,Bo Qin,Xiaozhe Xin,Jianwu Li,Xuefeng Su,Yanfeng Wang 机构： Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science and Technology, Beijing Institute of Technology, China, AI Interaction Department, Tencent, China 备注：None 摘要：手写数学表达式识别的目的是从给定的图像中自动生成乳胶序列。目前，基于注意的编解码模型被广泛应用于这项任务中。它们通常以从左到右（L2R）的方式生成目标序列，而不利用从右到左（R2L）的上下文。本文提出了一种基于注意聚合的双向互学习网络（ABM），该网络由一个共享编码器和两个并行逆解码器（L2R和R2L）组成。这两个译码器通过相互蒸馏来增强，在每个训练步骤中涉及一对一的知识转移，充分利用来自两个反向的互补信息。此外，为了处理不同尺度下的数学符号，提出了一种注意力聚合模块（AAM）来有效地集成多尺度覆盖注意。值得注意的是，在推理阶段，假设模型已经从两个反向学习知识，我们只使用L2R分支进行推理，保持原始参数大小和推理速度。大量实验表明，我们提出的方法在没有数据增强和模型融合的情况下，在CROHME 2014、CROHME 2016和CROHME 2019上的识别准确率分别为56.85%、52.92%和53.96%，大大优于最先进的方法。补充资料中提供了源代码。摘要：Handwritten mathematical expression recognition aims to automatically generate LaTeX sequences from given images. Currently, attention-based encoder-decoder models are widely used in this task. They typically generate target sequences in a left-to-right (L2R) manner, leaving the right-to-left (R2L) contexts unexploited. In this paper, we propose an Attention aggregation based Bi-directional Mutual learning Network (ABM) which consists of one shared encoder and two parallel inverse decoders (L2R and R2L). The two decoders are enhanced via mutual distillation, which involves one-to-one knowledge transfer at each training step, making full use of the complementary information from two inverse directions. Moreover, in order to deal with mathematical symbols in diverse scales, an Attention Aggregation Module (AAM) is proposed to effectively integrate multi-scale coverage attentions. Notably, in the inference phase, given that the model already learns knowledge from two inverse directions, we only use the L2R branch for inference, keeping the original parameter size and inference speed. Extensive experiments demonstrate that our proposed approach achieves the recognition accuracy of 56.85 % on CROHME 2014, 52.92 % on CROHME 2016, and 53.96 % on CROHME 2019 without data augmentation and model ensembling, substantially outperforming the state-of-the-art methods. The source code is available in the supplementary materials.

【2】 E^2(GO)MOTION: Motion Augmented Event Stream for Egocentric Action Recognition标题：E^2(GO)运动：用于自我中心动作识别的运动增强事件流链接：https://arxiv.org/abs/2112.03596

作者：Chiara Plizzari,Mirco Planamente,Gabriele Goletto,Marco Cannici,Emanuele Gusso,Matteo Matteucci,Barbara Caputo 机构： Politecnico di Torino, CINI Consortium, Politecnico di Milano 摘要：事件摄像机是一种新颖的仿生传感器，它以“事件”的形式异步捕捉像素级的强度变化。由于其传感机制，事件摄影机几乎没有运动模糊，具有非常高的时间分辨率，并且与传统的基于帧的摄影机相比，所需的电源和内存要少得多。这些特性使它们非常适合于一些现实世界的应用，例如可穿戴设备上的以自我为中心的动作识别，在这些应用中，快速的摄像机运动和有限的功率挑战了传统的视觉传感器。然而，不断增长的基于事件的视觉领域迄今为止忽视了事件摄像机在此类应用中的潜力。在本文中，我们证明了事件数据是一种非常有价值的自我中心行为识别模式。为此，我们引入了N-EPIC-Kitchens，这是大型EPIC Kitchens数据集的第一个基于事件的摄像头扩展。在这种情况下，我们提出了两种策略：（i）使用传统视频处理架构（E$^2$（GO））直接处理事件摄影机数据；（ii）使用事件数据提取光流信息（E$^2$（GO）MO）。在我们提出的基准上，我们表明，事件数据提供了与RGB和光流相当的性能，但在部署时没有任何额外的流量计算，并且相对于仅RGB的信息，性能提高了4%。摘要：Event cameras are novel bio-inspired sensors, which asynchronously capture pixel-level intensity changes in the form of "events". Due to their sensing mechanism, event cameras have little to no motion blur, a very high temporal resolution and require significantly less power and memory than traditional frame-based cameras. These characteristics make them a perfect fit to several real-world applications such as egocentric action recognition on wearable devices, where fast camera motion and limited power challenge traditional vision sensors. However, the ever-growing field of event-based vision has, to date, overlooked the potential of event cameras in such applications. In this paper, we show that event data is a very valuable modality for egocentric action recognition. To do so, we introduce N-EPIC-Kitchens, the first event-based camera extension of the large-scale EPIC-Kitchens dataset. In this context, we propose two strategies: (i) directly processing event-camera data with traditional video-processing architectures (E$^2$(GO)) and (ii) using event-data to distill optical flow information (E$^2$(GO)MO). On our proposed benchmark, we show that event data provides a comparable performance to RGB and optical flow, yet without any additional flow computation at deploy time, and an improved performance of up to 4% with respect to RGB only information.

【3】 Contrastive Learning from Extremely Augmented Skeleton Sequences for Self-supervised Action Recognition 标题：基于高度增广骨架序列的自监督动作识别对比学习链接：https://arxiv.org/abs/2112.03590

作者：Tianyu Guo,Hong Liu,Zhan Chen,Mengyuan Liu,Tao Wang,Runwei Ding 机构： Key Laboratory of Machine Perception, Shenzhen Graduate School, Peking University, China, The School of Intelligent Systems Engineering, Sun Yat-sen University, China 备注：Accepted by AAAI2022 摘要：近年来，随着对比学习方法的发展，基于骨架的动作识别的自监督表示学习得到了发展。现有的对比学习方法使用正常的增强来构建相似的积极样本，这限制了探索新的运动模式的能力。为了更好地利用极限增强引入的运动模式，提出了一种基于丰富信息挖掘的自监督动作表示对比学习框架（AIMCRL）。首先，提出了极端增强和基于能量的注意引导下降模块（EADM）来获得不同的正样本，这带来了新的运动模式，提高了学习表征的普遍性。其次，由于直接使用极端增广可能无法提高性能，因为原始身份发生了剧烈变化，因此提出了双分布散度最小化损失（D$^3$M损失），以更温和的方式最小化分布散度。第三，提出了最近邻挖掘（NNM），进一步扩展正样本，使丰富信息挖掘过程更加合理。在NTU RGB D 60、PKU-MMD、NTU RGB D 120数据集上进行的详尽实验已经证实，我们的AimCLR可以在各种评估协议下，与最先进的方法相比，在观察到的更高质量的动作表现下，显著表现良好。我们的代码可在https://github.com/Levigty/AimCLR. 摘要：In recent years, self-supervised representation learning for skeleton-based action recognition has been developed with the advance of contrastive learning methods. The existing contrastive learning methods use normal augmentations to construct similar positive samples, which limits the ability to explore novel movement patterns. In this paper, to make better use of the movement patterns introduced by extreme augmentations, a Contrastive Learning framework utilizing Abundant Information Mining for self-supervised action Representation (AimCLR) is proposed. First, the extreme augmentations and the Energy-based Attention-guided Drop Module (EADM) are proposed to obtain diverse positive samples, which bring novel movement patterns to improve the universality of the learned representations. Second, since directly using extreme augmentations may not be able to boost the performance due to the drastic changes in original identity, the Dual Distributional Divergence Minimization Loss (D$^3$M Loss) is proposed to minimize the distribution divergence in a more gentle way. Third, the Nearest Neighbors Mining (NNM) is proposed to further expand positive samples to make the abundant information mining process more reasonable. Exhaustive experiments on NTU RGB D 60, PKU-MMD, NTU RGB D 120 datasets have verified that our AimCLR can significantly perform favorably against state-of-the-art methods under a variety of evaluation protocols with observed higher quality action representations. Our code is available at https://github.com/Levigty/AimCLR.

【4】 CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification 标题：CMA-CLIP：用于图文分类的跨通道注意剪辑链接：https://arxiv.org/abs/2112.03562

作者：Huidong Liu,Shaoyuan Xu,Jinmiao Fu,Yang Liu,Ning Xie,Chien-chih Wang,Bryan Wang,Yi Sun 机构：Stony Brook University, Stony Brook, NY, USA, Amazon Inc., Seattle, WA, USA 摘要：社交媒体和电子商务等现代网络系统包含以图像和文本表示的丰富内容。利用来自多种模式的信息可以提高机器学习任务（如分类和推荐）的性能。在本文中，我们提出了跨模态注意对比语言图像预训练（CMA-CLIP），这是一种新的框架，它将两种跨模态注意（顺序注意和模态注意）结合起来，以有效地融合图像和文本对中的信息。序列式注意使框架能够捕获图像块和文本标记之间的细粒度关系，而模态式注意则通过其与下游任务的相关性来衡量每个模态。此外，通过添加任务特定的模态注意和多层感知器，我们提出的框架能够执行多模态的多任务分类。我们在一个主要零售网站产品属性（MRWPA）数据集和两个公共数据集Food101和Fashion-Gen上进行了实验。结果表明，CMA-CLIP在多任务分类的MRWPA数据集上，在相同精度水平下，召回率平均比预训练和微调的CLIP高11.9%。它在精度上也比Fashion Gen数据集上的最新方法高出5.5%，并在Food101数据集上实现了具有竞争力的性能。通过详细的消融研究，我们进一步证明了跨模态注意模块的有效性以及我们的方法对图像和文本输入噪声的鲁棒性，这是实践中的一个常见挑战。摘要：Modern Web systems such as social media and e-commerce contain rich contents expressed in images and text. Leveraging information from multi-modalities can improve the performance of machine learning tasks such as classification and recommendation. In this paper, we propose the Cross-Modality Attention Contrastive Language-Image Pre-training (CMA-CLIP), a new framework which unifies two types of cross-modality attentions, sequence-wise attention and modality-wise attention, to effectively fuse information from image and text pairs. The sequence-wise attention enables the framework to capture the fine-grained relationship between image patches and text tokens, while the modality-wise attention weighs each modality by its relevance to the downstream tasks. In addition, by adding task specific modality-wise attentions and multilayer perceptrons, our proposed framework is capable of performing multi-task classification with multi-modalities. We conduct experiments on a Major Retail Website Product Attribute (MRWPA) dataset and two public datasets, Food101 and Fashion-Gen. The results show that CMA-CLIP outperforms the pre-trained and fine-tuned CLIP by an average of 11.9% in recall at the same level of precision on the MRWPA dataset for multi-task classification. It also surpasses the state-of-the-art method on Fashion-Gen Dataset by 5.5% in accuracy and achieves competitive performance on Food101 Dataset. Through detailed ablation studies, we further demonstrate the effectiveness of both cross-modality attention modules and our method's robustness against noise in image and text inputs, which is a common challenge in practice.

【5】 Label Hallucination for Few-Shot Classification 标题：给幻觉贴上标签，以便对“Few-Shot”进行分类链接：https://arxiv.org/abs/2112.03340

作者：Yiren Jian,Lorenzo Torresani 机构：Dartmouth College 备注：Accepted by AAAI 2022. Code is available: this https URL 摘要：少数镜头分类需要调整从大型带注释的基础数据集中学习的知识，以识别新的看不见的类，每个类由少数标记的示例表示。在这种情况下，在大数据集上预训练具有高容量的网络，然后在少数示例上对其进行微调会导致严重的过度拟合。同时，在从大的标记数据集中学习到的“冻结”特征的基础上训练一个简单的线性分类器无法使模型适应新类的属性，从而有效地导致欠拟合。在本文中，我们提出了两种流行策略的替代方法。首先，我们的方法使用在新类上训练的线性分类器对整个大型数据集进行伪标记。这有效地“幻觉”了大型数据集中的新类，尽管新类不存在于基础数据库中（新类和基类是不相交的）。然后，除了新数据集上的标准交叉熵损失外，它还使用伪标记基示例上的蒸馏损失对整个模型进行微调。该步骤有效地训练网络识别上下文和外观线索，这些线索对新类别识别有用，但使用整个大规模基础数据集，从而克服了少数镜头学习固有的数据稀缺问题。尽管该方法简单，但我们表明，我们的方法在四个完善的少数镜头分类基准上优于最先进的方法。摘要：Few-shot classification requires adapting knowledge learned from a large annotated base dataset to recognize novel unseen classes, each represented by few labeled examples. In such a scenario, pretraining a network with high capacity on the large dataset and then finetuning it on the few examples causes severe overfitting. At the same time, training a simple linear classifier on top of "frozen" features learned from the large labeled dataset fails to adapt the model to the properties of the novel classes, effectively inducing underfitting. In this paper we propose an alternative approach to both of these two popular strategies. First, our method pseudo-labels the entire large dataset using the linear classifier trained on the novel classes. This effectively "hallucinates" the novel classes in the large dataset, despite the novel categories not being present in the base database (novel and base classes are disjoint). Then, it finetunes the entire model with a distillation loss on the pseudo-labeled base examples, in addition to the standard cross-entropy loss on the novel dataset. This step effectively trains the network to recognize contextual and appearance cues that are useful for the novel-category recognition but using the entire large-scale base dataset and thus overcoming the inherent data-scarcity problem of few-shot learning. Despite the simplicity of the approach, we show that that our method outperforms the state-of-the-art on four well-established few-shot classification benchmarks.

【6】 Learning Connectivity with Graph Convolutional Networks for Skeleton-based Action Recognition 标题：利用图卷积网络学习连通性进行基于骨架的动作识别链接：https://arxiv.org/abs/2112.03328

作者：Hichem Sahbi 机构：Sorbonne University, CNRS, LIP, F-, Paris, France, ! 备注：arXiv admin note: text overlap with arXiv:2104.04255, arXiv:2104.05482 摘要：学习图卷积网络（GCN）是一个新兴的领域，旨在将卷积运算推广到任意非正则域。特别是，与光谱网络相比，在空间域上运行的GCN表现出优越的性能，但是它们的成功与否在很大程度上取决于输入图的拓扑结构是如何定义的。在这篇文章中，我们介绍了一个新的框架图卷积网络学习拓扑性质的图。我们的方法的设计原则是基于约束目标函数的优化，该函数不仅学习GCN中常用的卷积参数，还学习传递这些图中最相关拓扑关系的变换基。在具有挑战性的基于骨架的动作识别任务上进行的实验表明，与手工图形设计以及相关工作相比，该方法具有优越性。摘要：Learning graph convolutional networks (GCNs) is an emerging field which aims at generalizing convolutional operations to arbitrary non-regular domains. In particular, GCNs operating on spatial domains show superior performances compared to spectral ones, however their success is highly dependent on how the topology of input graphs is defined. In this paper, we introduce a novel framework for graph convolutional networks that learns the topological properties of graphs. The design principle of our method is based on the optimization of a constrained objective function which learns not only the usual convolutional parameters in GCNs but also a transformation basis that conveys the most relevant topological relationships in these graphs. Experiments conducted on the challenging task of skeleton-based action recognition shows the superiority of the proposed method compared to handcrafted graph design as well as the related work.

【7】 Hard Sample Aware Noise Robust Learning for Histopathology Image Classification 标题：硬样本感知噪声鲁棒学习在组织病理学图像分类中的应用链接：https://arxiv.org/abs/2112.03694

作者：Chuang Zhu,Wenkai Chen,Ting Peng,Ying Wang,Mulan Jin 机构： Peng are with the School of Arti-ficial Intelligence, Beijing University of Posts and Telecommunica-tions 备注：14 pages, 20figures, IEEE Transactions on Medical Imaging 摘要：基于深度学习的组织病理学图像分类是帮助医生提高癌症诊断准确性和及时性的关键技术。然而，在复杂的人工标注过程中，标签噪声往往是不可避免的，从而误导了分类模型的训练。在这项工作中，我们介绍了一种新的硬样本感知噪声鲁棒学习方法用于组织病理学图像分类。为了区分信息性硬样本和有害噪声样本，我们利用样本训练历史建立了易/硬/噪声（EHN）检测模型。然后，我们将EHN集成到一个自训练结构中，通过逐步标记校正来降低噪声率。利用获得的几乎干净的数据集，我们进一步提出了一种噪声抑制和硬增强（NSHE）方案来训练噪声鲁棒模型。与以前的工作相比，我们的方法可以节省更多的干净样本，并且可以直接应用于真实的有噪声数据集场景，而不需要使用干净的子集。实验结果表明，无论是在合成数据集还是在真实噪声数据集，该方法都优于目前最新的方法。源代码和数据可在https://github.com/bupt-ai-cz/HSA-NRL/. 摘要：Deep learning-based histopathology image classification is a key technique to help physicians in improving the accuracy and promptness of cancer diagnosis. However, the noisy labels are often inevitable in the complex manual annotation process, and thus mislead the training of the classification model. In this work, we introduce a novel hard sample aware noise robust learning method for histopathology image classification. To distinguish the informative hard samples from the harmful noisy ones, we build an easy/hard/noisy (EHN) detection model by using the sample training history. Then we integrate the EHN into a self-training architecture to lower the noise rate through gradually label correction. With the obtained almost clean dataset, we further propose a noise suppressing and hard enhancing (NSHE) scheme to train the noise robust model. Compared with the previous works, our method can save more clean samples and can be directly applied to the real-world noisy dataset scenario without using a clean subset. Experimental results demonstrate that the proposed scheme outperforms the current state-of-the-art methods in both the synthetic and real-world noisy datasets. The source code and data are available at https://github.com/bupt-ai-cz/HSA-NRL/.

【8】 RSBNet: One-Shot Neural Architecture Search for A Backbone Network in Remote Sensing Image Recognition 标题：RSBNet：遥感图像识别中骨干网络的一次性神经结构搜索链接：https://arxiv.org/abs/2112.03456

作者：Cheng Peng,Yangyang Li,Ronghua Shang,Licheng Jiao 机构：Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of, Artificial Intelligence, Xidian University, Xi’an, China 摘要：近年来，大量基于深度学习的方法已成功应用于各种遥感图像（RSI）识别任务。然而，目前RSI领域的深度学习方法大多依赖于人工设计的主干网络提取的特征，由于RSI的复杂性和先验知识的局限性，这严重阻碍了深度学习模型的发展。在本文中，我们研究了一种新的RSI识别任务中主干结构的设计范式，包括场景分类、土地覆盖分类和目标检测。提出了一种新的基于权重共享策略和进化算法的一次性搜索框架RSBNet，该框架包括三个阶段：首先，基于集成单路径训练策略，在自组装的大规模RSI数据集上预训练在分层搜索空间中构造的超网。然后，通过可切换的识别模块为预先训练的超网配备不同的识别头，并分别对目标数据集进行微调，以获得特定于任务的超网。最后，在不需要任何网络训练的情况下，基于进化算法搜索不同识别任务的最佳主干结构。在五个基准数据集上对不同的识别任务进行了大量的实验，结果表明了所提出的搜索范式的有效性，并证明了所搜索的主干能够灵活地适应不同的RSI识别任务，并取得了令人印象深刻的性能。摘要：Recently, a massive number of deep learning based approaches have been successfully applied to various remote sensing image (RSI) recognition tasks. However, most existing advances of deep learning methods in the RSI field heavily rely on the features extracted by the manually designed backbone network, which severely hinders the potential of deep learning models due the complexity of RSI and the limitation of prior knowledge. In this paper, we research a new design paradigm for the backbone architecture in RSI recognition tasks, including scene classification, land-cover classification and object detection. A novel one-shot architecture search framework based on weight-sharing strategy and evolutionary algorithm is proposed, called RSBNet, which consists of three stages: Firstly, a supernet constructed in a layer-wise search space is pretrained on a self-assembled large-scale RSI dataset based on an ensemble single-path training strategy. Next, the pre-trained supernet is equipped with different recognition heads through the switchable recognition module and respectively fine-tuned on the target dataset to obtain task-specific supernet. Finally, we search the optimal backbone architecture for different recognition tasks based on the evolutionary algorithm without any network training. Extensive experiments have been conducted on five benchmark datasets for different recognition tasks, the results show the effectiveness of the proposed search paradigm and demonstrate that the searched backbone is able to flexibly adapt different RSI recognition tasks and achieve impressive performance.

分割|语义相关(4篇)

【1】 A Contrastive Distillation Approach for Incremental Semantic Segmentation in Aerial Images 标题：一种航空图像增量式语义分割的对比提取方法链接：https://arxiv.org/abs/2112.03814

作者：Edoardo Arnaudo,Fabio Cermelli,Antonio Tavera,Claudio Rossi,Barbara Caputo 备注：12 pages, ICIAP 2021 摘要：增量学习是航空图像处理中的一项关键任务，特别是在大规模标注数据集有限的情况下。当前深层神经结构的一个主要问题是灾难性遗忘，即一旦提供了一组新的数据用于再训练，就无法忠实地维护过去的知识。多年来，人们提出了几种技术来缓解图像分类和目标检测中的这一问题。然而，直到最近，焦点才转向更复杂的下游任务，如实例或语义分割。从语义分割任务的增量类学习开始，我们的目标是将此策略应用于航空领域，利用其与自然图像不同的一个特殊特征，即方向。除了标准的知识提取方法外，我们还提出了一种对比正则化方法，将任何给定的输入与其增强版本（即翻转和旋转）进行比较，以最小化两个输入产生的分割特征之间的差异。我们在波茨坦数据集上展示了我们的解决方案的有效性，在每个测试中都优于增量基线。代码可从以下网址获得：https://github.com/edornd/contrastive-distillation. 摘要：Incremental learning represents a crucial task in aerial image processing, especially given the limited availability of large-scale annotated datasets. A major issue concerning current deep neural architectures is known as catastrophic forgetting, namely the inability to faithfully maintain past knowledge once a new set of data is provided for retraining. Over the years, several techniques have been proposed to mitigate this problem for image classification and object detection. However, only recently the focus has shifted towards more complex downstream tasks such as instance or semantic segmentation. Starting from incremental-class learning for semantic segmentation tasks, our goal is to adapt this strategy to the aerial domain, exploiting a peculiar feature that differentiates it from natural images, namely the orientation. In addition to the standard knowledge distillation approach, we propose a contrastive regularization, where any given input is compared with its augmented version (i.e. flipping and rotations) in order to minimize the difference between the segmentation features produced by both inputs. We show the effectiveness of our solution on the Potsdam dataset, outperforming the incremental baseline in every test. Code available at: https://github.com/edornd/contrastive-distillation.

【2】 Deep Level Set for Box-supervised Instance Segmentation in Aerial Images 标题：航空影像盒监督实例分割的深水平集方法链接：https://arxiv.org/abs/2112.03451

作者：Wentong Li,Yijie Chen,Wenyu Liu,Jianke Zhu 机构：Alibaba-Zhejiang University Joint Research Institute of Frontier Technologies 备注：10 pages, 5 figures 摘要：盒监督实例分割是近年来研究的热点，但在航空图像领域的研究却很少。与一般的目标集合相比，航空目标具有较大的类内方差和类间相似性，且背景复杂。此外，在高分辨率卫星图像中还存在许多微小物体。这使得最近的成对亲和建模方法不可避免地涉及到噪声监督，结果较差。为了解决这些问题，我们提出了一种新的航空实例分割方法，该方法驱动网络以端到端的方式学习一系列仅带有方框标注的航空对象的水平集函数。水平集方法采用精心设计的能量函数，不学习两两相似性，而是将对象分割视为曲线演化，能够准确地恢复对象的边界，防止不可分辨背景和相似对象的干扰。实验结果表明，该方法优于现有的盒监督实例分割方法。源代码可在https://github.com/LiWentomng/boxlevelset. 摘要：Box-supervised instance segmentation has recently attracted lots of research efforts while little attention is received in aerial image domain. In contrast to the general object collections, aerial objects have large intra-class variances and inter-class similarity with complex background. Moreover, there are many tiny objects in the high-resolution satellite images. This makes the recent pairwise affinity modeling method inevitably to involve the noisy supervision with the inferior results. To tackle these problems, we propose a novel aerial instance segmentation approach, which drives the network to learn a series of level set functions for the aerial objects with only box annotations in an end-to-end fashion. Instead of learning the pairwise affinity, the level set method with the carefully designed energy functions treats the object segmentation as curve evolution, which is able to accurately recover the object's boundaries and prevent the interference from the indistinguishable background and similar objects. The experimental results demonstrate that the proposed approach outperforms the state-of-the-art box-supervised instance segmentation methods. The source code is available at https://github.com/LiWentomng/boxlevelset.

【3】 Hybrid guiding: A multi-resolution refinement approach for semantic segmentation of gigapixel histopathological images 标题：混合引导：一种用于千兆像素组织病理图像语义分割的多分辨率细化方法链接：https://arxiv.org/abs/2112.03455

作者：André Pedersen,Erik Smistad,Tor V. Rise,Vibeke G. Dale,Henrik S. Pettersen,Tor-Arne S. Nordmo,David Bouget,Ingerid Reinertsen,Marit Valla 机构：Department of Clinical and Molecular Medicine, Norwegian University of Science and Technology, NO-, Trondheim, Norway, Clinic of Surgery, St. Olavs Hospital, Trondheim University Hospital, NO-, Trondheim, Norway 备注：12 pages, 3 figures 摘要：组织病理学癌症诊断变得更加复杂，越来越多的活检对大多数病理学实验室来说是一个挑战。因此，开发用于评估组织病理学癌症切片的自动方法将是有价值的。在这项研究中，我们使用了来自挪威队列的624张乳腺癌全幻灯片图像（WSIs）。我们提出了一种级联卷积神经网络设计，称为H2G网络，用于千兆像素组织病理学图像的语义分割。该设计包括使用分片方法的检测阶段和使用卷积自动编码器的细化阶段。为了验证设计，我们进行了一项消融研究，以评估管道中所选成分对肿瘤分割的影响。在分割组织病理学图像时，使用分层采样和深度热图细化的引导分割被证明是有益的。我们发现，当使用细化网络对生成的肿瘤分割热图进行后处理时，有了显著的改进。在90个WSI的独立测试集上，总体最佳设计的骰子得分为0.933。该设计优于单分辨率方法，例如使用MobileNetV2（0.872）和低分辨率U-Net（0.874）的群集引导、面片式高分辨率分类。此外，仅使用CPU，在典型的x400 WSI上进行分段大约需要58秒。这些发现证明了利用细化网络改进面片预测的潜力。该解决方案是有效的，并且不需要重叠的面片推断或加密。此外，我们还表明，深度神经网络可以使用随机抽样方案进行训练，该方案可以同时在多个不同的标签上进行平衡，而无需在磁盘上存储补丁。未来的工作应该包括更有效的补丁生成和采样，以及改进聚类。摘要：Histopathological cancer diagnostics has become more complex, and the increasing number of biopsies is a challenge for most pathology laboratories. Thus, development of automatic methods for evaluation of histopathological cancer sections would be of value. In this study, we used 624 whole slide images (WSIs) of breast cancer from a Norwegian cohort. We propose a cascaded convolutional neural network design, called H2G-Net, for semantic segmentation of gigapixel histopathological images. The design involves a detection stage using a patch-wise method, and a refinement stage using a convolutional autoencoder. To validate the design, we conducted an ablation study to assess the impact of selected components in the pipeline on tumour segmentation. Guiding segmentation, using hierarchical sampling and deep heatmap refinement, proved to be beneficial when segmenting the histopathological images. We found a significant improvement when using a refinement network for postprocessing the generated tumour segmentation heatmaps. The overall best design achieved a Dice score of 0.933 on an independent test set of 90 WSIs. The design outperformed single-resolution approaches, such as cluster-guided, patch-wise high-resolution classification using MobileNetV2 (0.872) and a low-resolution U-Net (0.874). In addition, segmentation on a representative x400 WSI took ~58 seconds, using only the CPU. The findings demonstrate the potential of utilizing a refinement network to improve patch-wise predictions. The solution is efficient and does not require overlapping patch inference or ensembling. Furthermore, we showed that deep neural networks can be trained using a random sampling scheme that balances on multiple different labels simultaneously, without the need of storing patches on disk. Future work should involve more efficient patch generation and sampling, as well as improved clustering.

【4】 Quality control for more reliable integration of deep learning-based image segmentation into medical workflows 标题：将基于深度学习的图像分割更可靠地集成到医疗工作流中的质量控制链接：https://arxiv.org/abs/2112.03277

作者：Elena Williams,Sebastian Niehaus,Janis Reinelt,Alberto Merola,Paul Glad Mihai,Ingo Roeder,Nico Scherf,Maria del C. Valdés Hernández 机构： – AICURA medical, Bessemerstrasse , Berlin, Germany., – Centre for Clinical Brain Sciences. University of Edinburgh., – Institute for Medical Informatics and Biometry, Technische Universität Dresden, Fetscherstrasse , Dresden, Germany. 备注：25 pages 摘要：机器学习算法是现代诊断辅助软件的基础，该软件在临床实践中，特别是在放射学中被证明是有价值的。然而，不准确主要是由于用于训练这些算法的临床样本有限，妨碍了它们在临床医生中的广泛适用性、接受度和认可度。我们分析了最先进的自动质量控制（QC）方法，这些方法可以在这些算法中实现，以估计其输出的确定性。我们在脑图像分割任务中验证了最有希望的方法，以识别磁共振成像数据中的白质高强度（WMH）。WMH是一种常见于成年中后期的小血管疾病，由于其大小和分布模式不同，对其进行分割尤其具有挑战性。我们的结果表明，不确定性聚合和骰子预测在该任务的故障检测中最有效。两种方法独立地将平均骰子从0.82提高到0.84。我们的工作揭示了QC方法如何帮助检测分割失败的案例，从而使自动分割更可靠，更适合临床实践。摘要：Machine learning algorithms underpin modern diagnostic-aiding software, which has proved valuable in clinical practice, particularly in radiology. However, inaccuracies, mainly due to the limited availability of clinical samples for training these algorithms, hamper their wider applicability, acceptance, and recognition amongst clinicians. We present an analysis of state-of-the-art automatic quality control (QC) approaches that can be implemented within these algorithms to estimate the certainty of their outputs. We validated the most promising approaches on a brain image segmentation task identifying white matter hyperintensities (WMH) in magnetic resonance imaging data. WMH are a correlate of small vessel disease common in mid-to-late adulthood and are particularly challenging to segment due to their varied size, and distributional patterns. Our results show that the aggregation of uncertainty and Dice prediction were most effective in failure detection for this task. Both methods independently improved mean Dice from 0.82 to 0.84. Our work reveals how QC methods can help to detect failed segmentation cases and therefore make automatic segmentation more reliable and suitable for clinical practice.

Zero/Few Shot|迁移|域适配|自适应(5篇)

【1】 Domain Generalization via Progressive Layer-wise and Channel-wise Dropout 标题：基于逐层和逐通道丢弃的域综合链接：https://arxiv.org/abs/2112.03676

作者：Jintao Guo,Lei Qi,Yinghuan Shi,Yang Gao 机构： National Key Laboratory for Novel Software Technology, Nanjing University, National Institute of Healthcare Data Science, Nanjing University, Key Lab of Computer Network and Information Integration, Southeast University 摘要：通过在多个观测源域上训练一个模型，域泛化的目的是在无需进一步训练的情况下很好地泛化到任意不可见的目标域。现有的工作主要集中在学习领域不变特征以提高泛化能力。然而，由于目标域在训练过程中不可用，以前的方法不可避免地会受到源域过度拟合的影响。为了解决这个问题，我们开发了一个有效的基于退出的框架来扩大模型的关注范围，这可以有效地缓解过度拟合问题。特别是，与通常在固定层上进行辍学的典型辍学方案不同，我们首先随机选择一层，然后随机选择其信道进行辍学。此外，我们还利用渐进式方案增加了训练过程中的辍学率，从而逐渐提高了训练模型的难度，增强了模型的鲁棒性。此外，为了进一步缓解过度拟合问题的影响，我们利用图像级和特征级的增强方案来生成强基线模型。我们在多个基准数据集上进行了大量的实验，结果表明，我们的方法优于最先进的方法。摘要：By training a model on multiple observed source domains, domain generalization aims to generalize well to arbitrary unseen target domains without further training. Existing works mainly focus on learning domain-invariant features to improve the generalization ability. However, since target domain is not available during training, previous methods inevitably suffer from overfitting in source domains. To tackle this issue, we develop an effective dropout-based framework to enlarge the region of the model's attention, which can effectively mitigate the overfitting problem. Particularly, different from the typical dropout scheme, which normally conducts the dropout on the fixed layer, first, we randomly select one layer, and then we randomly select its channels to conduct dropout. Besides, we leverage the progressive scheme to add the ratio of the dropout during training, which can gradually boost the difficulty of training model to enhance the robustness of the model. Moreover, to further alleviate the impact of the overfitting issue, we leverage the augmentation schemes on image-level and feature-level to yield a strong baseline model. We conduct extensive experiments on multiple benchmark datasets, which show our method can outperform the state-of-the-art methods.

【2】 Parallel Discrete Convolutions on Adaptive Particle Representations of Images 标题：图像自适应粒子表示的并行离散卷积链接：https://arxiv.org/abs/2112.03592

作者：Joel Jonsson,Bevan L. Cheeseman,Suryanarayana Maddu,Krzysztof Gonciarz,Ivo F. Sbalzarini 机构： Technische Universit¨at Dresden, Dresden, Germany, Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany, Center for Systems Biology Dresden, Dresden, Germany 备注：18 pages, 13 figures 摘要：我们提出了并行计算机体系结构上图像自适应粒子表示（APR）上离散卷积算子本机实现的数据结构和算法。APR是一种内容自适应图像表示法，可根据图像信号局部调整采样分辨率。它已被开发为一种替代像素表示的大型稀疏图像，因为它们通常出现在荧光显微镜中。已经证明，它可以减少存储、可视化和处理此类图像的内存和运行时成本。然而，这要求图像处理在APR上本机运行，而不需要中间恢复到像素。然而，设计高效且可扩展的APR本机图像处理原语由于APR不规则的内存结构而变得复杂。在这里，我们提供了高效和本机处理APR图像所需的算法构建块，使用了可以用离散卷积表示的各种算法。我们表明，APR卷积自然会导致在多核CPU和GPU架构上高效并行的规模自适应算法。与基于像素的算法和均匀采样数据上的卷积相比，我们量化了加速比。我们在单个Nvidia GeForce RTX 2080游戏GPU上实现了高达1 TB/s的像素等效吞吐量，与基于像素的实现相比，所需内存最多可减少两个数量级。摘要：We present data structures and algorithms for native implementations of discrete convolution operators over Adaptive Particle Representations (APR) of images on parallel computer architectures. The APR is a content-adaptive image representation that locally adapts the sampling resolution to the image signal. It has been developed as an alternative to pixel representations for large, sparse images as they typically occur in fluorescence microscopy. It has been shown to reduce the memory and runtime costs of storing, visualizing, and processing such images. This, however, requires that image processing natively operates on APRs, without intermediately reverting to pixels. Designing efficient and scalable APR-native image processing primitives, however, is complicated by the APR's irregular memory structure. Here, we provide the algorithmic building blocks required to efficiently and natively process APR images using a wide range of algorithms that can be formulated in terms of discrete convolutions. We show that APR convolution naturally leads to scale-adaptive algorithms that efficiently parallelize on multi-core CPU and GPU architectures. We quantify the speedups in comparison to pixel-based algorithms and convolutions on evenly sampled data. We achieve pixel-equivalent throughputs of up to 1 TB/s on a single Nvidia GeForce RTX 2080 gaming GPU, requiring up to two orders of magnitude less memory than a pixel-based implementation.

【3】 Learning Instance and Task-Aware Dynamic Kernels for Few Shot Learning 标题：Few-Shot学习的学习实例和任务感知动态核链接：https://arxiv.org/abs/2112.03494

作者：Rongkai Ma,Pengfei Fang,Gil Avraham,Yan Zuo,Tom Drummond,Mehrtash Harandi 机构：Monash University,Australian National University,CSIRO,The University of Melbourne 摘要：在实际应用中，学习和推广具有少量样本的新概念（少量快照学习）仍然是一个基本挑战。实现Few-Shot学习的一个主要方法是实现一个能够快速适应给定任务上下文的模型。动态网络已被证明能够有效地学习内容自适应参数，使其适合于少量镜头学习。在本文中，我们建议学习卷积网络的动态核作为手头任务的函数，从而实现更快的泛化。为此，我们基于整个任务和每个样本获得了动态内核，并开发了一种机制，进一步独立地调节每个通道和位置。这导致动态内核同时关注全局信息，同时也考虑可用的微小细节。我们的经验表明，我们的模型提高了在少数镜头分类和检测任务上的性能，与一些基线模型相比取得了明显的改进。这包括4个少数镜头分类基准的最新结果：mini ImageNet、分层ImageNet、CUB和FC100，以及少数镜头检测数据集MS COCO-PASCAL-VOC的竞争结果。摘要：Learning and generalizing to novel concepts with few samples (Few-Shot Learning) is still an essential challenge to real-world applications. A principle way of achieving few-shot learning is to realize a model that can rapidly adapt to the context of a given task. Dynamic networks have been shown capable of learning content-adaptive parameters efficiently, making them suitable for few-shot learning. In this paper, we propose to learn the dynamic kernels of a convolution network as a function of the task at hand, enabling faster generalization. To this end, we obtain our dynamic kernels based on the entire task and each sample and develop a mechanism further conditioning on each individual channel and position independently. This results in dynamic kernels that simultaneously attend to the global information whilst also considering minuscule details available. We empirically show that our model improves performance on few-shot classification and detection tasks, achieving a tangible improvement over several baseline models. This includes state-of-the-art results on 4 few-shot classification benchmarks: mini-ImageNet, tiered-ImageNet, CUB and FC100 and competitive results on a few-shot detection dataset: MS COCO-PASCAL-VOC.

【4】 Noise Distribution Adaptive Self-Supervised Image Denoising using Tweedie Distribution and Score Matching 标题：基于Twedie分布和分数匹配的噪声分布自适应自监督图像去噪链接：https://arxiv.org/abs/2112.03696

作者：Kwanyoung Kim,Taesung Kwon,Jong Chul Ye 机构： Department of Bio and Brain Engineering, Kim Jaechul Graduate School of AI, Deptartment of Mathematical Sciences, Korea Advanced Institute of Science and Technology (KAIST) 摘要：Tweedie分布是指数色散模型的一种特例，在经典统计学中常用作广义线性模型的分布。在这里，我们揭示了Tweedie分布在现代深度学习时代也起着关键作用，导致了一个独立于分布的自监督图像去噪公式，而没有干净的参考图像。具体地说，通过结合最近的Noise2Score自监督图像去噪方法和Tweedie分布的鞍点近似，我们可以提供一个通用的封闭形式去噪公式，该公式可用于大类噪声分布，而不需要知道潜在的噪声分布。与原始Noise2Score相似，新方法由两个连续步骤组成：使用扰动噪声图像进行分数匹配，然后通过分布无关的Tweedie公式得到封闭形式的图像去噪公式。这也提出了一个系统的算法来估计噪声模型和噪声参数为给定的噪声图像数据集。通过大量实验，我们证明了该方法能够准确估计噪声模型和参数，并在基准数据集和真实数据集上提供了最先进的自监督图像去噪性能。摘要：Tweedie distributions are a special case of exponential dispersion models, which are often used in classical statistics as distributions for generalized linear models. Here, we reveal that Tweedie distributions also play key roles in modern deep learning era, leading to a distribution independent self-supervised image denoising formula without clean reference images. Specifically, by combining with the recent Noise2Score self-supervised image denoising approach and the saddle point approximation of Tweedie distribution, we can provide a general closed-form denoising formula that can be used for large classes of noise distributions without ever knowing the underlying noise distribution. Similar to the original Noise2Score, the new approach is composed of two successive steps: score matching using perturbed noisy images, followed by a closed form image denoising formula via distribution-independent Tweedie's formula. This also suggests a systematic algorithm to estimate the noise model and noise parameters for a given noisy image data set. Through extensive experiments, we demonstrate that the proposed method can accurately estimate noise models and parameters, and provide the state-of-the-art self-supervised image denoising performance in the benchmark dataset and real-world dataset.

【5】 Learning Pixel-Adaptive Weights for Portrait Photo Retouching 标题：用于人像照片润色的像素自适应权值学习链接：https://arxiv.org/abs/2112.03536

作者：Binglu Wang,Chengzhe Lu,Dawei Yan,Yongqiang Zhao 机构： Equal Contribution 备注：Techinical report 摘要：人像照片润饰是一项照片润饰任务，强调人的区域优先级和组级别的一致性。基于查找表的方法通过学习图像自适应权重来组合三维查找表（3D LUT）并进行像素到像素的颜色变换，从而获得了良好的修饰性能。但是，当纵向像素和背景像素显示相同的原始RGB值时，此范例会忽略本地上下文提示，并将相同的变换应用于纵向像素和背景像素。相比之下，专家通常执行不同的操作来调整肖像区域和背景区域的色温和色调。这启发我们对本地上下文线索进行建模，以明确地提高修饰质量。首先，我们考虑图像贴片，并预测像素自适应查找表权重，以精确地刷新中心像素。其次，由于相邻像素与中心像素具有不同的亲和力，我们估计一个局部注意掩码来调节相邻像素的影响。第三，通过应用监督，可以进一步提高局部注意掩码的质量，监督是基于由groundtruth肖像掩码计算的亲和图。对于组级一致性，我们建议直接限制实验室空间中平均颜色分量的方差。在PPR10K数据集上的大量实验验证了我们的方法的有效性，例如，在高分辨率照片上，PSNR度量获得了超过0.5的增益，而组级一致性度量获得了至少2.1的降低。摘要：Portrait photo retouching is a photo retouching task that emphasizes human-region priority and group-level consistency. The lookup table-based method achieves promising retouching performance by learning image-adaptive weights to combine 3-dimensional lookup tables (3D LUTs) and conducting pixel-to-pixel color transformation. However, this paradigm ignores the local context cues and applies the same transformation to portrait pixels and background pixels when they exhibit the same raw RGB values. In contrast, an expert usually conducts different operations to adjust the color temperatures and tones of portrait regions and background regions. This inspires us to model local context cues to improve the retouching quality explicitly. Firstly, we consider an image patch and predict pixel-adaptive lookup table weights to precisely retouch the center pixel. Secondly, as neighboring pixels exhibit different affinities to the center pixel, we estimate a local attention mask to modulate the influence of neighboring pixels. Thirdly, the quality of the local attention mask can be further improved by applying supervision, which is based on the affinity map calculated by the groundtruth portrait mask. As for group-level consistency, we propose to directly constrain the variance of mean color components in the Lab space. Extensive experiments on PPR10K dataset verify the effectiveness of our method, e.g. on high-resolution photos, the PSNR metric receives over 0.5 gains while the group-level consistency metric obtains at least 2.1 decreases.

半弱无监督|主动学习|不确定性(7篇)

【1】 STC-mix: Space, Time, Channel mixing for Self-supervised Video Representation 标题：STC-MIX：用于自监督视频表示的空间、时间、通道混合链接：https://arxiv.org/abs/2112.03906

作者：Srijan Das,Michael S. Ryoo 机构：Stony Brook University 备注：12 pages, codes and model links will be updated soon 摘要：视频对比表征学习在很大程度上依赖于数百万未标记视频的可用性。这对于网络上的视频来说是可行的，但是为现实世界的应用程序获取如此大规模的视频是非常昂贵和费力的。因此，在本文中，我们重点设计用于自监督学习的视频增强，我们首先分析混合视频的最佳策略，以创建新的增强视频样本。那么，问题仍然是，我们能否利用视频中的其他模式进行数据混合？为此，我们提出了跨模态流形切割混合（CMMC），将一个视频拼接插入到特征空间中的另一个视频拼接中，跨越两个不同的模态。我们发现，我们的视频混合策略STC mix，即视频的初步混合，然后是视频中不同模式的CMMC，提高了学习视频表示的质量。我们在两个小规模视频数据集UCF101和HMDB51上对两个下游任务进行了彻底的实验：动作识别和视频检索。我们还展示了我们的STC组合在领域知识有限的NTU数据集上的有效性。我们表明，在两个下游任务的STC混合的性能等同于其他自我监督的方法，同时需要较少的训练数据。摘要：Contrastive representation learning of videos highly relies on the availability of millions of unlabelled videos. This is practical for videos available on web but acquiring such large scale of videos for real-world applications is very expensive and laborious. Therefore, in this paper we focus on designing video augmentation for self-supervised learning, we first analyze the best strategy to mix videos to create a new augmented video sample. Then, the question remains, can we make use of the other modalities in videos for data mixing? To this end, we propose Cross-Modal Manifold Cutmix (CMMC) that inserts a video tesseract into another video tesseract in the feature space across two different modalities. We find that our video mixing strategy STC-mix, i.e. preliminary mixing of videos followed by CMMC across different modalities in a video, improves the quality of learned video representations. We conduct thorough experiments for two downstream tasks: action recognition and video retrieval on two small scale video datasets UCF101, and HMDB51. We also demonstrate the effectiveness of our STC-mix on NTU dataset where domain knowledge is limited. We show that the performance of our STC-mix on both the downstream tasks is on par with the other self-supervised approaches while requiring less training data.

【2】 ViewCLR: Learning Self-supervised Video Representation for Unseen Viewpoints 标题：ViewCLR：学习不可见视点的自监督视频表示链接：https://arxiv.org/abs/2112.03905

作者：Srijan Das,Michael S. Ryoo 机构：Stony Brook University, Feature space,

linux https 网络安全批量计算学习方法

0 人点赞