计算机视觉与模式识别学术速递[12.6]

cs.CV 方向，今日共计70篇

Transformer(5篇)

【1】 Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer 标题：一种新型一元成对Transformer人-物相互作用的高效两阶段检测链接：https://arxiv.org/abs/2112.01838

作者：Frederic Z. Zhang,Dylan Campbell,Stephen Gould 备注：14 pages, 14 figures and 5 tables 摘要：用于视觉数据的Transformer模型的最新发展已导致识别和检测任务的显著改进。特别是，使用可学习查询代替区域建议已经产生了一类新的单阶段检测模型，由检测Transformer（DETR）牵头。此后，这种单阶段方法的变化主导了人机交互（HOI）检测。然而，这种单级HOI探测器的成功在很大程度上归功于Transformer的表现力。我们发现，当配备相同的Transformer时，两级Transformer的性能和内存效率会更高，而训练时间只需一小部分。在这项工作中，我们提出了一元成对变换器，这是一种两级检测器，利用了HOI的一元和成对表示。我们观察到，Transformer网络的一元部分和成对部分具有特殊性，前者优先增加正面示例的分数，后者减少负面示例的分数。我们在HICO-DET和V-COCO数据集上评估了我们的方法，并且显著优于最先进的方法。在推断时，我们使用ResNet50的模型在单个GPU上接近实时性能。摘要：Recent developments in transformer models for visual data have led to significant improvements in recognition and detection tasks. In particular, using learnable queries in place of region proposals has given rise to a new class of one-stage detection models, spearheaded by the Detection Transformer (DETR). Variations on this one-stage approach have since dominated human-object interaction (HOI) detection. However, the success of such one-stage HOI detectors can largely be attributed to the representation power of transformers. We discovered that when equipped with the same transformer, their two-stage counterparts can be more performant and memory-efficient, while taking a fraction of the time to train. In this work, we propose the Unary-Pairwise Transformer, a two-stage detector that exploits unary and pairwise representations for HOIs. We observe that the unary and pairwise parts of our transformer network specialise, with the former preferentially increasing the scores of positive examples and the latter decreasing the scores of negative examples. We evaluate our method on the HICO-DET and V-COCO datasets, and significantly outperform state-of-the-art approaches. At inference time, our model with ResNet50 approaches real-time performance on a single GPU.

【2】 LMR-CBT: Learning Modality-fused Representations with CB-Transformer for Multimodal Emotion Recognition from Unaligned Multimodal Sequences 标题：LMR-CBT：基于CB-Transform的学习模态融合表示法在未对齐多模态序列中的多模态情感识别链接：https://arxiv.org/abs/2112.01697

作者：Ziwang Fu,Feng Liu,Hanyang Wang,Siyuan Shen,Jiahao Zhang,Jiayin Qi,Xiangling Fu,Aimin Zhou 备注：9 pages ,Figure 2, Table 5 摘要：学习模态融合表征和处理未对齐的多模态序列在多模态情感识别中具有重要意义和挑战性。现有的方法使用定向两两注意或信息中心来融合语言、视觉和音频模式。然而，这些方法在融合特征时引入了信息冗余，并且在不考虑模式互补性的情况下效率低下。在本文中，我们提出了一种有效的神经网络学习模式融合表示与CBTransformer（LMR-CBT）的多模态情感识别从未对齐的多模态序列。具体来说，我们首先分别对这三种模式进行特征提取，以获得序列的局部结构。然后，我们设计了一种新的跨模态块变换器（CB-transformer），该变换器支持不同模态的互补学习，主要分为局部时间学习、跨模态特征融合和全局自我注意表征。此外，我们将融合后的特征与原始特征拼接，对序列中的情感进行分类。最后，我们在三个具有挑战性的数据集（IEMOCAP、CMU-MOSI和CMU-MOSEI）上进行单词对齐和非对齐实验。实验结果表明，在这两种情况下，我们提出的方法都具有优越性和有效性。与主流方法相比，我们的方法以最少的参数数量达到了最先进的水平。摘要：Learning modality-fused representations and processing unaligned multimodal sequences are meaningful and challenging in multimodal emotion recognition. Existing approaches use directional pairwise attention or a message hub to fuse language, visual, and audio modalities. However, those approaches introduce information redundancy when fusing features and are inefficient without considering the complementarity of modalities. In this paper, we propose an efficient neural network to learn modality-fused representations with CB-Transformer (LMR-CBT) for multimodal emotion recognition from unaligned multimodal sequences. Specifically, we first perform feature extraction for the three modalities respectively to obtain the local structure of the sequences. Then, we design a novel transformer with cross-modal blocks (CB-Transformer) that enables complementary learning of different modalities, mainly divided into local temporal learning,cross-modal feature fusion and global self-attention representations. In addition, we splice the fused features with the original features to classify the emotions of the sequences. Finally, we conduct word-aligned and unaligned experiments on three challenging datasets, IEMOCAP, CMU-MOSI, and CMU-MOSEI. The experimental results show the superiority and efficiency of our proposed method in both settings. Compared with the mainstream methods, our approach reaches the state-of-the-art with a minimum number of parameters.

【3】 Make A Long Image Short: Adaptive Token Length for Vision Transformers 标题：缩短图像长度：视觉Transformer的自适应令牌长度链接：https://arxiv.org/abs/2112.01686

作者：Yichen Zhu,Yuqin Zhu,Jie Du,Yi Wang,Zhicai Ou,Feifei Feng,Jian Tang 备注：10 pages, Technical report 摘要：vision transformer将每个图像分割成一系列具有固定长度的标记，并以与自然语言处理中的单词相同的方式处理这些标记。更多的令牌通常会带来更好的性能，但会大大增加计算成本。受“一张图片胜过千言万语”这句谚语的激励，我们的目标是通过缩短长图像来加速ViT模型。为此，我们提出了一种在推理过程中自适应分配令牌长度的新方法。具体来说，我们首先训练一个ViT模型，称为可调整大小的ViT（ReViT），该模型可以处理具有不同令牌长度的任何给定输入。然后，我们从ReViT中检索“标记长度标签”，并使用它来训练轻量级标记长度赋值器（TLA）。标记长度标签是用于分割ReViT可以进行正确预测的图像的最小标记数，TLA将根据这些标签分配最佳标记长度。TLA使ReViT能够在推断过程中使用最少数量的标记处理图像。因此，通过减少ViT模型中的令牌数来提高推理速度。我们的方法是通用的，与现代视觉转换器架构兼容，并且可以显著减少计算量。我们通过两项任务（图像分类和动作识别）验证了我们的方法在多个代表性ViT模型（DeiT、LV ViT和时间转换器）上的有效性。摘要：The vision transformer splits each image into a sequence of tokens with fixed length and processes the tokens in the same way as words in natural language processing. More tokens normally lead to better performance but considerably increased computational cost. Motivated by the proverb "A picture is worth a thousand words" we aim to accelerate the ViT model by making a long image short. To this end, we propose a novel approach to assign token length adaptively during inference. Specifically, we first train a ViT model, called Resizable-ViT (ReViT), that can process any given input with diverse token lengths. Then, we retrieve the "token-length label" from ReViT and use it to train a lightweight Token-Length Assigner (TLA). The token-length labels are the smallest number of tokens to split an image that the ReViT can make the correct prediction, and TLA is learned to allocate the optimal token length based on these labels. The TLA enables the ReViT to process the image with the minimum sufficient number of tokens during inference. Thus, the inference speed is boosted by reducing the token numbers in the ViT model. Our approach is general and compatible with modern vision transformer architectures and can significantly reduce computational expanse. We verified the effectiveness of our methods on multiple representative ViT models (DeiT, LV-ViT, and TimesFormer) across two tasks (image classification and action recognition).

【4】 TransZero: Attribute-guided Transformer for Zero-Shot Learning 标题：TransZero：一种基于属性引导的Zero-Shot学习转换器链接：https://arxiv.org/abs/2112.01683

作者：Shiming Chen,Ziming Hong,Yang Liu,Guo-Sen Xie,Baigui Sun,Hao Li,Qinmu Peng,Ke Lu,Xinge You 备注：Accepted to AAAI'22 摘要：Zero-Shot学习（Zero-shot learning，ZSL）旨在通过将语义知识从可见类转移到不可见类来识别新类。语义知识是从不同类之间共享的属性描述中学习的，这些属性描述充当了定位对象属性（表示有区别的区域特征）的强优先级，从而实现显著的视觉语义交互。尽管一些基于注意的模型试图在单个图像中学习这些区域特征，但是视觉特征的可转移性和区分性属性定位通常被忽略。在本文中，我们提出了一种属性引导Transformer网络，称为TransZero，用于细化视觉特征和学习ZSL中有区别的视觉嵌入表示的属性定位。具体而言，TransZero采用特征增强编码器来缓解ImageNet和ZSL基准之间的交叉数据集偏差，并通过减少区域特征之间的纠缠相对几何关系来提高视觉特征的可转移性。为了学习局部增强的视觉特征，TransZero使用视觉语义解码器，在语义属性信息的指导下，定位与给定图像中每个属性最相关的图像区域。然后，利用局部增强的视觉特征和语义向量在视觉语义嵌入网络中进行有效的视觉语义交互。大量的实验表明，TransZero在三个ZSL基准上达到了最新水平。代码可从以下网址获得：url{https://github.com/shiming-chen/TransZero}. 摘要：Zero-shot learning (ZSL) aims to recognize novel classes by transferring semantic knowledge from seen classes to unseen ones. Semantic knowledge is learned from attribute descriptions shared between different classes, which act as strong priors for localizing object attributes that represent discriminative region features, enabling significant visual-semantic interaction. Although some attention-based models have attempted to learn such region features in a single image, the transferability and discriminative attribute localization of visual features are typically neglected. In this paper, we propose an attribute-guided Transformer network, termed TransZero, to refine visual features and learn attribute localization for discriminative visual embedding representations in ZSL. Specifically, TransZero takes a feature augmentation encoder to alleviate the cross-dataset bias between ImageNet and ZSL benchmarks, and improves the transferability of visual features by reducing the entangled relative geometry relationships among region features. To learn locality-augmented visual features, TransZero employs a visual-semantic decoder to localize the image regions most relevant to each attribute in a given image, under the guidance of semantic attribute information. Then, the locality-augmented visual features and semantic vectors are used to conduct effective visual-semantic interaction in a visual-semantic embedding network. Extensive experiments show that TransZero achieves the new state of the art on three ZSL benchmarks. The codes are available at: url{https://github.com/shiming-chen/TransZero}.

【5】 MT-TransUNet: Mediating Multi-Task Tokens in Transformers for Skin Lesion Segmentation and Classification 标题：MT-TransUNet：在用于皮肤病变分割和分类的Transformer中调解多任务令牌链接：https://arxiv.org/abs/2112.01767

作者：Jingye Chen,Jieneng Chen,Zongwei Zhou,Bin Li,Alan Yuille,Yongyi Lu 备注：A technical report. Code will be released 摘要：自动皮肤癌诊断的最新进展已经取得了与平板认证皮肤科医师一致的表现。然而，这些方法将皮肤癌诊断表述为一项简单的分类任务，忽略了病变分割的潜在好处。我们认为，准确的病变分割可以用附加的病变信息补充分类任务，如不对称性、边界、强度和物理大小；反过来，可靠的病变分类可以支持具有鉴别病变特征的分割任务。为此，本文提出了一个新的多任务框架MT-TransUNet，它能够通过在Transformer中调解多任务标记来协作分割和分类皮肤损伤。此外，我们还引入了双任务和关注区域一致性损失，以利用那些没有像素级注释的图像，确保模型在遇到相同图像时的鲁棒性，并考虑了增强。在ISIC-2017和PH2中，我们的MT Transune在病变分割和分类任务方面超过了之前的最新水平；更重要的是，它在模型参数（48M~vs.130M）和推理速度（每幅图像0.17s~vs.2.02s）方面保持了令人信服的计算效率。代码将在https://github.com/JingyeChen/MT-TransUNet. 摘要：Recent advances in automated skin cancer diagnosis have yielded performance on par with board-certified dermatologists. However, these approaches formulated skin cancer diagnosis as a simple classification task, dismissing the potential benefit from lesion segmentation. We argue that an accurate lesion segmentation can supplement the classification task with additive lesion information, such as asymmetry, border, intensity, and physical size; in turn, a faithful lesion classification can support the segmentation task with discriminant lesion features. To this end, this paper proposes a new multi-task framework, named MT-TransUNet, which is capable of segmenting and classifying skin lesions collaboratively by mediating multi-task tokens in Transformers. Furthermore, we have introduced dual-task and attended region consistency losses to take advantage of those images without pixel-level annotation, ensuring the model's robustness when it encounters the same image with an account of augmentation. Our MT-TransUNet exceeds the previous state of the art for lesion segmentation and classification tasks in ISIC-2017 and PH2; more importantly, it preserves compelling computational efficiency regarding model parameters (48M~vs.~130M) and inference speed (0.17s~vs.~2.02s per image). Code will be available at https://github.com/JingyeChen/MT-TransUNet.

检测相关(7篇)

【1】 Multi-Content Complementation Network for Salient Object Detection in Optical Remote Sensing Images 标题：多内容互补网络在光学遥感图像显著目标检测中的应用链接：https://arxiv.org/abs/2112.01932

作者：Gongyang Li,Zhi Liu,Weisi Lin,Haibin Ling 备注：12 pages, 7 figures, Accepted by IEEE Transactions on Geoscience and Remote Sensing 2021 摘要：在计算机视觉领域，从自然场景图像中提取显著目标（NSI-SOD）已经取得了很大的进展；相比之下，光学遥感图像中的显著目标检测（RSI-SOD）仍然是一个具有挑战性的新兴课题。NSI-SOD和RSI-SOD因其独特的光学RSI特征，如标度、照明和成像方向等而存在显著差异。在本文中，我们提出了一种新的多内容互补网络（MCCNet）来探索RSI-SOD的多内容互补性。具体地说，MCCNet基于通用的编解码器体系结构，包含一个新的关键组件，称为多内容补充模块（Multi-Content Complementation Module，MCCM），它连接编码器和解码器。在MCCM中，我们考虑多种类型的关键RSS-SOD的特征，包括前景特征、边缘特征、背景特征和全局图像级别特征，并利用它们之间的内容互补性通过注意机制在RSI特征的不同尺度上突出突出区域。此外，我们在训练阶段全面介绍了像素级、地图级和度量感知损失。在两个流行数据集上的大量实验表明，所提出的MCCNet优于23种最先进的方法，包括NSI-SOD和RSI-SOD方法。我们的方法的代码和结果可在https://github.com/MathLee/MCCNet. 摘要：In the computer vision community, great progresses have been achieved in salient object detection from natural scene images (NSI-SOD); by contrast, salient object detection in optical remote sensing images (RSI-SOD) remains to be a challenging emerging topic. The unique characteristics of optical RSIs, such as scales, illuminations and imaging orientations, bring significant differences between NSI-SOD and RSI-SOD. In this paper, we propose a novel Multi-Content Complementation Network (MCCNet) to explore the complementarity of multiple content for RSI-SOD. Specifically, MCCNet is based on the general encoder-decoder architecture, and contains a novel key component named Multi-Content Complementation Module (MCCM), which bridges the encoder and the decoder. In MCCM, we consider multiple types of features that are critical to RSI-SOD, including foreground features, edge features, background features, and global image-level features, and exploit the content complementarity between them to highlight salient regions over various scales in RSI features through the attention mechanism. Besides, we comprehensively introduce pixel-level, map-level and metric-aware losses in the training phase. Extensive experiments on two popular datasets demonstrate that the proposed MCCNet outperforms 23 state-of-the-art methods, including both NSI-SOD and RSI-SOD methods. The code and results of our method are available at https://github.com/MathLee/MCCNet.

【2】 SGM3D: Stereo Guided Monocular 3D Object Detection 标题：SGM3D：立体制导单目三维目标检测链接：https://arxiv.org/abs/2112.01914

作者：Zheyuan Zhou,Liang Du,Xiaoqing Ye,Zhikang Zou,Xiao Tan,Errui Ding,Li Zhang,Xiangyang Xue,Jianfeng Feng 备注：11 pages, 5 figures 摘要：由于缺乏由激光雷达传感器捕获的精确深度信息，单目三维目标检测对于自动驾驶来说是一项关键但具有挑战性的任务。在本文中，我们提出了一种立体引导的单目3D目标检测网络，称为SGM3D，它利用从立体图像中提取的鲁棒3D特征来增强从单目图像中学习到的特征。我们创新性地研究了一种多粒度域自适应模块（MG-DA），利用网络的能力，仅基于单目线索生成立体模拟特征。利用粗糙的BEV特征层和精细的锚定层域自适应来引导单眼分支。我们提出了一种基于IoU匹配的对齐模块（IoU MA），用于在立体和单目预测之间进行对象级域自适应，以缓解先前阶段中的不匹配。我们在最具挑战性的KITTI和Lyft数据集上进行了广泛的实验，并实现了最先进的性能。此外，我们的方法可以集成到许多其他单目方法中，以提高性能，而不引入任何额外的计算成本。摘要：Monocular 3D object detection is a critical yet challenging task for autonomous driving, due to the lack of accurate depth information captured by LiDAR sensors. In this paper, we propose a stereo-guided monocular 3D object detection network, termed SGM3D, which leverages robust 3D features extracted from stereo images to enhance the features learned from the monocular image. We innovatively investigate a multi-granularity domain adaptation module (MG-DA) to exploit the network's ability so as to generate stereo-mimic features only based on the monocular cues. The coarse BEV feature-level, as well as the fine anchor-level domain adaptation, are leveraged to guide the monocular branch. We present an IoU matching-based alignment module (IoU-MA) for object-level domain adaptation between the stereo and monocular predictions to alleviate the mismatches in previous stages. We conduct extensive experiments on the most challenging KITTI and Lyft datasets and achieve new state-of-the-art performance. Furthermore, our method can be integrated into many other monocular approaches to boost performance without introducing any extra computational cost.

【3】 The Box Size Confidence Bias Harms Your Object Detector 标题：盒子大小置信偏差危害您的对象检测器链接：https://arxiv.org/abs/2112.01901

作者：Johannes Gilg,Torben Teepe,Fabian Herzog,Gerhard Rigoll 摘要：无数的应用依赖于现代物体探测器的准确预测和可靠的置信度估计。然而，众所周知，包括目标检测器在内的神经网络会产生错误校准的置信度估计。最近的研究甚至表明，探测器的置信度预测在物体大小和位置方面存在偏差，但目前尚不清楚这种偏差与受影响物体探测器的性能有何关系。我们正式证明了条件置信偏差正在损害目标检测器的预期性能，并从经验上验证了这些发现。具体来说，我们演示了如何修改直方图分块校准，以避免性能受损，并通过条件置信度校准提高性能。我们进一步发现，置信偏差也存在于对检测器的训练数据生成的检测中，我们利用这些数据在不使用额外数据的情况下执行去偏差。此外，测试时间的增加放大了这种偏差，这使得我们的校准方法获得了更大的性能增益。最后，我们在一组不同的目标检测体系结构上验证了我们的发现，并在没有额外数据或训练的情况下显示了高达0.6 mAP和0.8 mAP50的改进。摘要：Countless applications depend on accurate predictions with reliable confidence estimates from modern object detectors. It is well known, however, that neural networks including object detectors produce miscalibrated confidence estimates. Recent work even suggests that detectors' confidence predictions are biased with respect to object size and position, but it is still unclear how this bias relates to the performance of the affected object detectors. We formally prove that the conditional confidence bias is harming the expected performance of object detectors and empirically validate these findings. Specifically, we demonstrate how to modify the histogram binning calibration to not only avoid performance impairment but also improve performance through conditional confidence calibration. We further find that the confidence bias is also present in detections generated on the training data of the detector, which we leverage to perform our de-biasing without using additional data. Moreover, Test Time Augmentation magnifies this bias, which results in even larger performance gains from our calibration method. Finally, we validate our findings on a diverse set of object detection architectures and show improvements of up to 0.6 mAP and 0.8 mAP50 without extra data or training.

【4】 AirDet: Few-Shot Detection without Fine-tuning for Autonomous Exploration 标题：AirDet：自主探测的无微调Few-Shot探测链接：https://arxiv.org/abs/2112.01740

作者：Bowen Li,Chen Wang,Pranay Reddy,Seungchan Kim,Sebastian Scherer 备注：10 pages, 8 figures 摘要：由于元学习策略的成功，很少有镜头目标检测得到了快速发展。然而，现有方法中对微调阶段的要求非常耗时，并且严重阻碍了它们在实时应用中的应用，例如低功率机器人的自主探索。为了解决这个问题，我们提出了一个全新的体系结构AirDet，它通过学习类与支持映像的不可知关系，不需要进行微调。具体来说，我们提出了一个支持引导的跨尺度（SCS）特征融合网络来生成目标建议，一个用于镜头聚合的全局局部关系网络（GLR），以及一个用于精确定位的基于关系的原型嵌入网络（R-PEN）。在COCO和PASCAL VOC数据集上进行了详尽的实验，令人惊讶的是，AirDet获得了与详尽微调方法相当甚至更好的结果，比基线提高了40-60%。令我们兴奋的是，AirDet在多尺度物体，尤其是小尺度物体上获得了良好的性能。此外，我们还提供了DARPA地下挑战的真实勘探测试评估结果，这有力地验证了AirDet在机器人技术中的可行性。源代码、预先训练的模型以及用于探索的真实数据将公开。摘要：Few-shot object detection has rapidly progressed owing to the success of meta-learning strategies. However, the requirement of a fine-tuning stage in existing methods is timeconsuming and significantly hinders their usage in real-time applications such as autonomous exploration of low-power robots. To solve this problem, we present a brand new architecture, AirDet, which is free of fine-tuning by learning class agnostic relation with support images. Specifically, we propose a support-guided cross-scale (SCS) feature fusion network to generate object proposals, a global-local relation network (GLR) for shots aggregation, and a relation-based prototype embedding network (R-PEN) for precise localization. Exhaustive experiments are conducted on COCO and PASCAL VOC datasets, where surprisingly, AirDet achieves comparable or even better results than the exhaustively finetuned methods, reaching up to 40-60% improvements on the baseline. To our excitement, AirDet obtains favorable performance on multi-scale objects, especially the small ones. Furthermore, we present evaluation results on real-world exploration tests from the DARPA Subterranean Challenge, which strongly validate the feasibility of AirDet in robotics. The source code, pre-trained models, along with the real world data for exploration, will be made public.

【5】 MFNet: Multi-filter Directive Network for Weakly Supervised Salient Object Detection 标题：MFNet：用于弱监督显著目标检测的多过滤指令网链接：https://arxiv.org/abs/2112.01732

作者：Yongri Piao,Jian Wang,Miao Zhang,Huchuan Lu 备注：accepted by ICCV-2021 摘要：弱监督显著性目标检测（WSOD）目标仅使用低成本注释来训练基于CNNs的显著性网络。现有的WSOD方法采用各种技术从低成本注释中寻找单一的“高质量”伪标签，然后开发其显著性网络。虽然这些方法都取得了良好的性能，但是生成的单个标签不可避免地受到所采用的细化算法的影响，并且显示出偏见的特征，这进一步影响了显著性网络。在这项工作中，我们引入了一个新的多伪标签框架来整合来自多个标签的更全面和准确的显著性线索，避免了上述问题。具体地说，我们提出了一种多滤波器方向网络（MFNet），包括一个显著性网络和多个方向滤波器。指令滤波器（DF）设计用于从噪声伪标签中提取和过滤更精确的显著性线索。然后，来自多个DFs的多个精确线索同时传播到显著性网络，并伴有多个制导损失。在五个数据集上对四个指标进行的大量实验表明，我们的方法优于所有现有的同类方法。此外，还值得注意的是，我们的框架足够灵活，可以应用于现有方法并提高其性能。摘要：Weakly supervised salient object detection (WSOD) targets to train a CNNs-based saliency network using only low-cost annotations. Existing WSOD methods take various techniques to pursue single "high-quality" pseudo label from low-cost annotations and then develop their saliency networks. Though these methods have achieved good performance, the generated single label is inevitably affected by adopted refinement algorithms and shows prejudiced characteristics which further influence the saliency networks. In this work, we introduce a new multiple-pseudo-label framework to integrate more comprehensive and accurate saliency cues from multiple labels, avoiding the aforementioned problem. Specifically, we propose a multi-filter directive network (MFNet) including a saliency network as well as multiple directive filters. The directive filter (DF) is designed to extract and filter more accurate saliency cues from the noisy pseudo labels. The multiple accurate cues from multiple DFs are then simultaneously propagated to the saliency network with a multi-guidance loss. Extensive experiments on five datasets over four metrics demonstrate that our method outperforms all the existing congeneric methods. Moreover, it is also worth noting that our framework is flexible enough to apply to existing methods and improve their performance.

【6】 Adversarial Attacks against a Satellite-borne Multispectral Cloud Detector 标题：针对星载多光谱云探测器的敌意攻击链接：https://arxiv.org/abs/2112.01723

作者：Andrew Du,Yee Wei Law,Michele Sasdelli,Bo Chen,Ken Clarke,Michael Brown,Tat-Jun Chin 摘要：地球观测卫星（EO）收集的数据经常受到云层的影响。探测云的存在——这越来越多地使用深度学习来完成——是EO应用程序中至关重要的预处理。事实上，先进的地球观测卫星在卫星上执行基于深度学习的云检测，并仅下行晴空数据以节省宝贵的带宽。在本文中，我们强调了基于深度学习的云检测在对抗性攻击中的脆弱性。通过优化对抗模式并将其叠加到无云场景中，我们使神经网络偏向于检测场景中的云。由于云探测器的输入光谱包含不可见波段，我们在多光谱域中生成了攻击。这就打开了多目标攻击的可能性，特别是在云敏感波段的对抗性偏见和在可见波段的视觉伪装。我们还研究了对抗性攻击的缓解策略。我们希望我们的工作能够进一步提高EO社区对对抗性攻击的潜在认识。摘要：Data collected by Earth-observing (EO) satellites are often afflicted by cloud cover. Detecting the presence of clouds -- which is increasingly done using deep learning -- is crucial preprocessing in EO applications. In fact, advanced EO satellites perform deep learning-based cloud detection on board the satellites and downlink only clear-sky data to save precious bandwidth. In this paper, we highlight the vulnerability of deep learning-based cloud detection towards adversarial attacks. By optimising an adversarial pattern and superimposing it into a cloudless scene, we bias the neural network into detecting clouds in the scene. Since the input spectra of cloud detectors include the non-visible bands, we generated our attacks in the multispectral domain. This opens up the potential of multi-objective attacks, specifically, adversarial biasing in the cloud-sensitive bands and visual camouflage in the visible bands. We also investigated mitigation strategies against the adversarial attacks. We hope our work further builds awareness of the potential of adversarial attacks in the EO community.

【7】 Detection of Large Vessel Occlusions using Deep Learning by Deforming Vessel Tree Segmentations 标题：基于变形血管树分割的深度学习检测大血管闭塞链接：https://arxiv.org/abs/2112.01797

作者：Florian Thamm,Oliver Taubmann,Markus Jürgens,Hendrik Ditt,Andreas Maier 备注：6 pages, preprint 摘要：计算机断层摄影血管造影是一种关键的检查方法，它能深入了解脑血管树，这对于缺血性中风的诊断和治疗至关重要，尤其是在大血管闭塞（LVO）的情况下。因此，LVOs患者的自动检测大大有利于临床工作流程。这项工作使用卷积神经网络进行案例级分类，通过血管树分割模板的弹性变形进行训练，以人为地增加训练数据。仅使用遮罩作为模型的输入，使我们能够在保持样本真实性的同时，比传统图像体积更积极地应用此类变形。神经网络对LVO的存在和受影响的半球进行分类。在一项5倍交叉验证消融研究中，我们证明，使用建议的增强器使我们能够从很少的数据集训练稳健的模型。在100个数据集上训练EfficientNetB1架构，建议的增强方案能够在不使用增强的情况下将ROC AUC从基线值0.57提高到0.85。使用3D DenseNet获得最佳性能，AUC为0.88。增强对受累半球的分类也有积极影响，其中3D DenseNet两侧的AUC为0.93。摘要：Computed Tomography Angiography is a key modality providing insights into the cerebrovascular vessel tree that are crucial for the diagnosis and treatment of ischemic strokes, in particular in cases of large vessel occlusions (LVO). Thus, the clinical workflow greatly benefits from an automated detection of patients suffering from LVOs. This work uses convolutional neural networks for case-level classification trained with elastic deformation of the vessel tree segmentation masks to artificially augment training data. Using only masks as the input to our model uniquely allows us to apply such deformations much more aggressively than one could with conventional image volumes while retaining sample realism. The neural network classifies the presence of an LVO and the affected hemisphere. In a 5-fold cross validated ablation study, we demonstrate that the use of the suggested augmentation enables us to train robust models even from few data sets. Training the EfficientNetB1 architecture on 100 data sets, the proposed augmentation scheme was able to raise the ROC AUC to 0.85 from a baseline value of 0.57 using no augmentation. The best performance was achieved using a 3D-DenseNet yielding an AUC of 0.88. The augmentation had positive impact in classification of the affected hemisphere as well, where the 3D-DenseNet reached an AUC of 0.93 on both sides.

分类|识别相关(6篇)

【1】 Cross-modal Knowledge Distillation for Vision-to-Sensor Action Recognition 标题：用于视觉到传感器动作识别的跨模态知识提取链接：https://arxiv.org/abs/2112.01849

作者：Jianyuan Ni,Raunak Sarbajna,Yang Liu,Anne H. H. Ngu,Yan Yan 备注：5 pages, 2 figures, submitted to ICASSP2022 摘要：最近，基于多模态方法的人类活动识别（HAR）被证明可以提高HAR的准确性和性能。然而，与可穿戴设备（如smartwatch）相关的有限计算资源无法直接支持此类高级方法。为了解决这个问题，本研究引入了一个端到端的传感器知识提取（VSKD）框架。在该VSKD框架中，测试阶段仅需要可穿戴设备的时间序列数据，即加速度计数据。因此，该框架不仅可以减少对边缘设备的计算需求，还可以生成与计算昂贵的多模式方法的性能密切匹配的学习模型。为了保持局部时间关系和便于视觉深度学习模型，我们首先采用基于格拉米安角场（GAF）的编码方法将时间序列数据转换为二维图像。在本研究中，我们分别采用了ResNet18和多尺度TRN（以BN为起始点）作为教师和学生网络。提出了一种新的损失函数，称为距离和角度语义知识损失（DASK），用于缓解视觉和传感器域之间的模态变化。在UTD-MHAD、MMAct和Berkeley MHAD数据集上的大量实验结果证明了所提出的可部署在可穿戴传感器上的VSKD模型的有效性和竞争力。摘要：Human activity recognition (HAR) based on multi-modal approach has been recently shown to improve the accuracy performance of HAR. However, restricted computational resources associated with wearable devices, i.e., smartwatch, failed to directly support such advanced methods. To tackle this issue, this study introduces an end-to-end Vision-to-Sensor Knowledge Distillation (VSKD) framework. In this VSKD framework, only time-series data, i.e., accelerometer data, is needed from wearable devices during the testing phase. Therefore, this framework will not only reduce the computational demands on edge devices, but also produce a learning model that closely matches the performance of the computational expensive multi-modal approach. In order to retain the local temporal relationship and facilitate visual deep learning models, we first convert time-series data to two-dimensional images by applying the Gramian Angular Field ( GAF) based encoding method. We adopted ResNet18 and multi-scale TRN with BN-Inception as teacher and student network in this study, respectively. A novel loss function, named Distance and Angle-wised Semantic Knowledge loss (DASK), is proposed to mitigate the modality variations between the vision and the sensor domain. Extensive experimental results on UTD-MHAD, MMAct, and Berkeley-MHAD datasets demonstrate the effectiveness and competitiveness of the proposed VSKD model which can deployed on wearable sensors.

【2】 Mind Your Clever Neighbours: Unsupervised Person Re-identification via Adaptive Clustering Relationship Modeling 标题：小心你的聪明邻居：通过自适应聚类关系建模实现无人监督的重新身份识别链接：https://arxiv.org/abs/2112.01839

作者：Lianjie Jia,Chenyang Yu,Xiehao Ye,Tianyu Yan,Yinjie Lei,Pingping Zhang 备注：This work has been accepted by AAAI-2022. Some modifications may be performed for the final version 摘要：无监督人员再识别（re-ID）因其解决有监督re-ID模型可扩展性问题的潜力而受到越来越多的关注。大多数现有的无监督方法采用迭代聚类机制，其中网络是基于无监督聚类生成的伪标签进行训练的。然而，聚类错误是不可避免的。为了生成高质量的伪标签并减轻聚类错误的影响，我们提出了一种新的无监督人员Re-ID聚类关系建模框架。具体而言，在聚类之前，基于图相关学习（GCL）探索未标记图像之间的关系模型和细化后的特征用于聚类，生成高质量的伪标签。因此，GCL自适应挖掘小批量样本之间的关系，以减少训练时异常聚类的影响。为了更有效地训练网络，我们进一步提出了一种具有选择性记忆库更新策略的选择性对比学习（SCL）方法。大量实验表明，在Market1501、DukeMTMC reID和MSMT17数据集上，我们的方法比大多数最先进的无监督方法显示出更好的结果。我们将发布模型复制的代码。摘要：Unsupervised person re-identification (Re-ID) attracts increasing attention due to its potential to resolve the scalability problem of supervised Re-ID models. Most existing unsupervised methods adopt an iterative clustering mechanism, where the network was trained based on pseudo labels generated by unsupervised clustering. However, clustering errors are inevitable. To generate high-quality pseudo-labels and mitigate the impact of clustering errors, we propose a novel clustering relationship modeling framework for unsupervised person Re-ID. Specifically, before clustering, the relation between unlabeled images is explored based on a graph correlation learning (GCL) module and the refined features are then used for clustering to generate high-quality pseudo-labels.Thus, GCL adaptively mines the relationship between samples in a mini-batch to reduce the impact of abnormal clustering when training. To train the network more effectively, we further propose a selective contrastive learning (SCL) method with a selective memory bank update policy. Extensive experiments demonstrate that our method shows much better results than most state-of-the-art unsupervised methods on Market1501, DukeMTMC-reID and MSMT17 datasets. We will release the code for model reproduction.

【3】 A Survey: Deep Learning for Hyperspectral Image Classification with Few Labeled Samples 标题：深度学习在少样本高光谱图像分类中的研究进展链接：https://arxiv.org/abs/2112.01800

作者：Sen Jia,Shuguo Jiang,Zhijie Lin,Nanying Li,Meng Xu,Shiqi Yu 备注：None 摘要：随着深度学习技术的快速发展和计算能力的提高，深度学习在高光谱图像分类领域得到了广泛的应用。一般来说，深度学习模型通常包含许多可训练的参数，需要大量的标记样本才能实现最佳性能。然而，在HSI分类方面，由于手动标记的困难性和耗时性，通常难以获取大量标记样本。因此，许多研究工作致力于在标记样本较少的情况下建立HSI分类的深度学习模型。在这篇文章中，我们专注于这个主题，并对相关文献进行了系统的回顾。具体而言，本文的贡献是双重的。首先，根据学习范式对相关方法的研究进展进行了分类，包括迁移学习、主动学习和Few-Shot学习。其次，采用各种最先进的方法进行了大量实验，并对结果进行了总结，以揭示潜在的研究方向。更重要的是，值得注意的是，尽管深度学习模型（通常需要足够的标记样本）与标记样本较少的HSI场景之间存在巨大差距，但小样本集的问题可以通过深度学习方法和相关技术的融合得到很好的表征，例如转移学习和轻量级模型。对于再现性，本文中评估方法的源代码可在https://github.com/ShuGuoJ/HSI-Classification.git. 摘要：With the rapid development of deep learning technology and improvement in computing capability, deep learning has been widely used in the field of hyperspectral image (HSI) classification. In general, deep learning models often contain many trainable parameters and require a massive number of labeled samples to achieve optimal performance. However, in regard to HSI classification, a large number of labeled samples is generally difficult to acquire due to the difficulty and time-consuming nature of manual labeling. Therefore, many research works focus on building a deep learning model for HSI classification with few labeled samples. In this article, we concentrate on this topic and provide a systematic review of the relevant literature. Specifically, the contributions of this paper are twofold. First, the research progress of related methods is categorized according to the learning paradigm, including transfer learning, active learning and few-shot learning. Second, a number of experiments with various state-of-the-art approaches has been carried out, and the results are summarized to reveal the potential research directions. More importantly, it is notable that although there is a vast gap between deep learning models (that usually need sufficient labeled samples) and the HSI scenario with few labeled samples, the issues of small-sample sets can be well characterized by fusion of deep learning methods and related techniques, such as transfer learning and a lightweight model. For reproducibility, the source codes of the methods assessed in the paper can be found at https://github.com/ShuGuoJ/HSI-Classification.git.

【4】 Gesture Recognition with a Skeleton-Based Keyframe Selection Module 标题：基于骨架的关键帧选择模块的手势识别链接：https://arxiv.org/abs/2112.01736

作者：Yunsoo Kim,Hyun Myung 备注：8 pages 摘要：我们提出了一种双向连续连接的双通道网络（BCCN），用于有效的手势识别。BCCN包括两条路径：（i）关键帧路径和（ii）时间注意路径。使用基于骨架的关键帧选择模块配置关键帧路径。关键帧通过路径提取自身的空间特征，时间注意路径提取时间语义。我们的模型提高了视频中手势识别的性能，并获得了更好的时空属性激活图。测试在Chalearn数据集、ETRI活动3D数据集和丰田智能家居数据集上进行。摘要：We propose a bidirectional consecutively connected two-pathway network (BCCN) for efficient gesture recognition. The BCCN consists of two pathways: (i) a keyframe pathway and (ii) a temporal-attention pathway. The keyframe pathway is configured using the skeleton-based keyframe selection module. Keyframes pass through the pathway to extract the spatial feature of itself, and the temporal-attention pathway extracts temporal semantics. Our model improved gesture recognition performance in videos and obtained better activation maps for spatial and temporal properties. Tests were performed on the Chalearn dataset, the ETRI-Activity 3D dataset, and the Toyota Smart Home dataset.

【5】 Adaptive Poincaré Point to Set Distance for Few-Shot Classification 标题：基于自适应Poincaré点的Few-Shot分类距离设置方法链接：https://arxiv.org/abs/2112.01719

作者：Rongkai Ma,Pengfei Fang,Tom Drummond,Mehrtash Harandi 备注：Accepted at AAAI2022 摘要：从有限的示例中学习和概括，即Few-Shot学习，对于许多真实世界的视觉应用程序具有核心重要性。实现少量镜头学习的一个主要方法是实现嵌入，其中来自不同类的样本是不同的。最近的研究表明，通过双曲几何的嵌入对于层次和结构化数据具有较低的失真，因此适合于少量镜头学习。在本文中，我们建议学习一个上下文感知的双曲线度量来描述一个点和一个集合之间的距离，该距离与学习的集合到集合的距离相关。为此，我们将度量表示为双曲空间切线束上的加权和，并开发了一种基于点群自适应获取权重的机制。这不仅使度量成为局部的，而且还依赖于手头的任务，这意味着度量将根据其比较的样本进行调整。我们的经验表明，这种度量在存在异常值的情况下具有鲁棒性，并且与基线模型相比取得了明显的改进。这包括五个流行的少数镜头分类基准的最新结果，即mini ImageNet、分层ImageNet、加州理工大学UCSD Birds-200-2011（CUB）、CIFAR-FS和FC100。摘要：Learning and generalizing from limited examples, i,e, few-shot learning, is of core importance to many real-world vision applications. A principal way of achieving few-shot learning is to realize an embedding where samples from different classes are distinctive. Recent studies suggest that embedding via hyperbolic geometry enjoys low distortion for hierarchical and structured data, making it suitable for few-shot learning. In this paper, we propose to learn a context-aware hyperbolic metric to characterize the distance between a point and a set associated with a learned set to set distance. To this end, we formulate the metric as a weighted sum on the tangent bundle of the hyperbolic space and develop a mechanism to obtain the weights adaptively and based on the constellation of the points. This not only makes the metric local but also dependent on the task in hand, meaning that the metric will adapt depending on the samples that it compares. We empirically show that such metric yields robustness in the presence of outliers and achieves a tangible improvement over baseline models. This includes the state-of-the-art results on five popular few-shot classification benchmarks, namely mini-ImageNet, tiered-ImageNet, Caltech-UCSD Birds-200-2011 (CUB), CIFAR-FS, and FC100.

【6】 Fully automatic integration of dental CBCT images and full-arch intraoral impressions with stitching error correction via individual tooth segmentation and identification 标题：牙齿CBCT图像和全弓口内印模的全自动集成，通过单个牙齿的分割和识别进行拼接误差校正链接：https://arxiv.org/abs/2112.01784

作者：Tae Jun Jang,Hye Sun Yun,Jong-Eun Kim,Sang-Hwy Lee,Jin Keun Seo 摘要：我们提出了一种完全自动化的方法，通过补充每个图像的缺点，将口腔内扫描（IOS）和牙锥束计算机断层扫描（CBCT）图像集成到一个图像中。由于有限的图像分辨率和各种CBCT伪影，包括金属诱发的伪影，单独使用牙科CBCT可能无法描绘牙齿表面的精确细节。IOS对于狭窄区域的扫描非常精确，但在全弓扫描期间会产生累积缝合错误。所提出的方法不仅用于补偿使用IOS的CBCT衍生牙齿表面的低质量，而且还用于校正整个牙弓的IOS累积缝合误差。此外，该集成在一张图像中提供了IOS的牙龈结构和CBCT的牙根。提出的全自动方法由四部分组成；（i）用于IOS数据的单个牙齿分割和识别模块（TSIM-IOS）；（ii）用于CBCT数据的单个牙齿分割和识别模块（TSIM-CBCT）；（iii）IOS和CBCT之间的全局到局部牙齿注册；和（iv）全弓IOS的缝合误差校正。实验结果表明，该方法的地标和面距误差分别为0.11mm和0.30mm。摘要：We present a fully automated method of integrating intraoral scan (IOS) and dental cone-beam computerized tomography (CBCT) images into one image by complementing each image's weaknesses. Dental CBCT alone may not be able to delineate precise details of the tooth surface due to limited image resolution and various CBCT artifacts, including metal-induced artifacts. IOS is very accurate for the scanning of narrow areas, but it produces cumulative stitching errors during full-arch scanning. The proposed method is intended not only to compensate the low-quality of CBCT-derived tooth surfaces with IOS, but also to correct the cumulative stitching errors of IOS across the entire dental arch. Moreover, the integration provide both gingival structure of IOS and tooth roots of CBCT in one image. The proposed fully automated method consists of four parts; (i) individual tooth segmentation and identification module for IOS data (TSIM-IOS); (ii) individual tooth segmentation and identification module for CBCT data (TSIM-CBCT); (iii) global-to-local tooth registration between IOS and CBCT; and (iv) stitching error correction of full-arch IOS. The experimental results show that the proposed method achieved landmark and surface distance errors of 0.11mm and 0.30mm, respectively.

分割|语义相关(7篇)

【1】 Novel Class Discovery in Semantic Segmentation 标题：一种新的语义分词中的类发现链接：https://arxiv.org/abs/2112.01900

作者：Yuyang Zhao,Zhun Zhong,Nicu Sebe,Gim Hee Lee 摘要：我们在语义分割（NCDSS）中引入了一种新的类别发现方法，其目的是从一组有标记的不相交类别中分割包含给定先验知识的新类别的未标记图像。与现有的在图像分类中寻找新类发现的方法相比，我们侧重于更具挑战性的语义分割。在NCDSS中，我们需要区分对象和背景，并处理图像中存在的多个类，这增加了使用未标记数据的难度。为了解决这个新的设置，我们利用标记的基础数据和显著性模型在我们的基本框架中对新类进行粗聚类，以进行模型训练。此外，我们提出了基于熵的不确定性建模和自训练（EUMS）框架来克服噪声伪标签，进一步提高了模型在新类上的性能。我们的EUMS利用熵排序技术和动态重新分配来提取干净的标签，从而通过自监督学习充分利用噪声数据。我们在PASCAL-5$^i$数据集上构建NCDSS基准。大量实验证明了基本框架的可行性（平均mIoU达到49.81%）和EUMS框架的有效性（比基本框架高出9.28%mIoU）。摘要：We introduce a new setting of Novel Class Discovery in Semantic Segmentation (NCDSS), which aims at segmenting unlabeled images containing new classes given prior knowledge from a labeled set of disjoint classes. In contrast to existing approaches that look at novel class discovery in image classification, we focus on the more challenging semantic segmentation. In NCDSS, we need to distinguish the objects and background, and to handle the existence of multiple classes within an image, which increases the difficulty in using the unlabeled data. To tackle this new setting, we leverage the labeled base data and a saliency model to coarsely cluster novel classes for model training in our basic framework. Additionally, we propose the Entropy-based Uncertainty Modeling and Self-training (EUMS) framework to overcome noisy pseudo-labels, further improving the model performance on the novel classes. Our EUMS utilizes an entropy ranking technique and a dynamic reassignment to distill clean labels, thereby making full use of the noisy data via self-supervised learning. We build the NCDSS benchmark on the PASCAL-5$^i$ dataset. Extensive experiments demonstrate the feasibility of the basic framework (achieving an average mIoU of 49.81%) and the effectiveness of EUMS framework (outperforming the basic framework by 9.28% mIoU).

【2】 Incremental Learning in Semantic Segmentation from Image Labels 标题：增量学习在图像标签语义分割中的应用链接：https://arxiv.org/abs/2112.01882

作者：Fabio Cermelli,Dario Fontanel,Antonio Tavera,Marco Ciccone,Barbara Caputo 摘要：尽管现有的语义分割方法取得了令人印象深刻的结果，但随着新类别的发现，它们仍然难以逐步更新其模型。此外，逐像素注释是昂贵且耗时的。本文提出了一种新的语义分割弱增量学习框架，旨在学习从廉价且大量可用的图像级标签中分割新类。与现有的需要离线生成伪标签的方法不同，我们使用一个辅助分类器，通过图像级标签进行训练并通过分割模型进行正则化，在线获得伪监督并增量更新模型。我们通过使用辅助分类器生成的软标签来处理过程中的固有噪声。我们在Pascal VOC和COCO数据集上证明了我们的方法的有效性，优于离线弱监督方法，并获得了与完全监督的增量学习方法相当的结果。摘要：Although existing semantic segmentation approaches achieve impressive results, they still struggle to update their models incrementally as new categories are uncovered. Furthermore, pixel-by-pixel annotations are expensive and time-consuming. This paper proposes a novel framework for Weakly Incremental Learning for Semantic Segmentation, that aims at learning to segment new classes from cheap and largely available image-level labels. As opposed to existing approaches, that need to generate pseudo-labels offline, we use an auxiliary classifier, trained with image-level labels and regularized by the segmentation model, to obtain pseudo-supervision online and update the model incrementally. We cope with the inherent noise in the process by using soft-labels generated by the auxiliary classifier. We demonstrate the effectiveness of our approach on the Pascal VOC and COCO datasets, outperforming offline weakly-supervised methods and obtaining results comparable with incremental learning methods with full supervision.

【3】 Semantic Map Injected GAN Training for Image-to-Image Translation 标题：用于图像到图像翻译的语义映射注入GaN训练链接：https://arxiv.org/abs/2112.01845

作者：Balaram Singh Kshatriya,Shiv Ram Dubey,Himangshu Sarma,Kunal Chaudhary,Meva Ram Gurjar,Rahul Rai,Sunny Manchanda 备注：Accepted in Fourth Workshop on Computer Vision Applications (WCVA) at ICVGIP 2021 摘要：图像到图像的转换是使用生成性对抗网络（GAN）将图像从一个域转换到另一个域的最新趋势。现有的GAN模型仅利用输入和输出转换模式来执行训练。本文对GAN模型进行语义注入训练。具体来说，我们使用原始的输入和输出模式进行训练，并为从输入到语义映射的翻译注入一些训练时期。让我们将原始训练称为将输入图像翻译到目标域的训练。在原始训练中注入语义训练，提高了训练后的GAN模型的泛化能力。此外，它还以更好的方式在生成的图像中保留了分类信息。语义映射仅在训练时使用，在测试时不需要。实验是在城市景观和RGB-NIR立体数据集上使用最先进的GAN模型进行的。我们观察到，与原始训练相比，注入语义训练后，SSIM、FID和KID分数的表现有所改善。摘要：Image-to-image translation is the recent trend to transform images from one domain to another domain using generative adversarial network (GAN). The existing GAN models perform the training by only utilizing the input and output modalities of transformation. In this paper, we perform the semantic injected training of GAN models. Specifically, we train with original input and output modalities and inject a few epochs of training for translation from input to semantic map. Lets refer the original training as the training for the translation of input image into target domain. The injection of semantic training in the original training improves the generalization capability of the trained GAN model. Moreover, it also preserves the categorical information in a better way in the generated image. The semantic map is only utilized at the training time and is not required at the test time. The experiments are performed using state-of-the-art GAN models over CityScapes and RGB-NIR stereo datasets. We observe the improved performance in terms of the SSIM, FID and KID scores after injecting semantic training as compared to original training.

【4】 MSP : Refine Boundary Segmentation via Multiscale Superpixel 标题：MSP：基于多尺度超像素的边界分割链接：https://arxiv.org/abs/2112.01746

作者：Jie Zhu,Huabin Huang,Banghuai Li,Yong Liu,Leye Wang 备注：under review 摘要：在本文中，我们提出了一种简单而有效的消息传递方法来提高语义分割结果的边界质量。受超级像素块产生的锐利边缘的启发，我们使用超级像素来引导信息在特征图中传递。同时，块的尖锐边界也限制了消息传递范围。具体地说，我们平均超级像素块在特征映射中覆盖的特征，并将结果添加回每个特征向量。此外，为了获得更清晰的边缘和更远的空间相关性，我们通过级联不同尺度的超像素块开发了一个多尺度超像素模块（MSP）。我们的方法可以作为一个即插即用的模块，在不引入新参数的情况下很容易插入到任何分割网络中。在三个强基线（即PSPNet、DeeplabV3和DeeplabV3 ）和四个具有挑战性的场景解析数据集（包括ADE20K、Cityscapes、PASCAL VOC和PASCAL Context）上进行了广泛的实验。实验结果验证了该方法的有效性和通用性。摘要：In this paper, we propose a simple but effective message passing method to improve the boundary quality for the semantic segmentation result. Inspired by the generated sharp edges of superpixel blocks, we employ superpixel to guide the information passing within feature map. Simultaneously, the sharp boundaries of the blocks also restrict the message passing scope. Specifically, we average features that the superpixel block covers within feature map, and add the result back to each feature vector. Further, to obtain sharper edges and farther spatial dependence, we develop a multiscale superpixel module (MSP) by a cascade of different scales superpixel blocks. Our method can be served as a plug-and-play module and easily inserted into any segmentation network without introducing new parameters. Extensive experiments are conducted on three strong baselines, namely PSPNet, DeeplabV3, and DeepLabV3 , and four challenging scene parsing datasets including ADE20K, Cityscapes, PASCAL VOC, and PASCAL Context. The experimental results verify its effectiveness and generalizability.

【5】 Hybrid Instance-aware Temporal Fusion for Online Video Instance Segmentation 标题：基于实例感知的混合时间融合在线视频实例分割链接：https://arxiv.org/abs/2112.01695

作者：Xiang Li,Jinglu Wang,Xiao Li,Yan Lu 备注：AAAI 2022 Accepted Paper 摘要：最近，基于Transformer的图像分割方法取得了显著的成功。而对于视频领域，如何在跨帧关注对象实例的情况下有效地建模时间上下文仍然是一个有待解决的问题。在本文中，我们提出了一个在线视频实例分割框架和一种新的基于实例的时间融合方法。我们首先利用表示，即全局上下文中的潜在代码（实例代码）和CNN特征映射来表示实例和像素级特征。基于这种表示，我们引入了一种无裁剪的时间融合方法来建模视频帧之间的时间一致性。具体而言，我们在实例代码中编码全局实例特定信息，并在实例代码和CNN特征映射之间建立混合注意的帧间上下文融合。实例代码之间的帧间一致性通过顺序约束进一步加强。通过利用学习到的混合时间一致性，我们能够跨帧直接检索和维护实例身份，消除了以前方法中复杂的逐帧实例匹配。在流行的VIS数据集（即Youtube-VIS-19/21）上进行了广泛的实验。我们的模型在所有在线VIS方法中取得了最好的性能。值得注意的是，当使用ResNet-50主干网时，我们的模型也使所有离线方法黯然失色。摘要：Recently, transformer-based image segmentation methods have achieved notable success against previous solutions. While for video domains, how to effectively model temporal context with the attention of object instances across frames remains an open problem. In this paper, we propose an online video instance segmentation framework with a novel instance-aware temporal fusion method. We first leverages the representation, i.e., a latent code in the global context (instance code) and CNN feature maps to represent instance- and pixel-level features. Based on this representation, we introduce a cropping-free temporal fusion approach to model the temporal consistency between video frames. Specifically, we encode global instance-specific information in the instance code and build up inter-frame contextual fusion with hybrid attentions between the instance codes and CNN feature maps. Inter-frame consistency between the instance codes are further enforced with order constraints. By leveraging the learned hybrid temporal consistency, we are able to directly retrieve and maintain instance identities across frames, eliminating the complicated frame-wise instance matching in prior methods. Extensive experiments have been conducted on popular VIS datasets, i.e. Youtube-VIS-19/21. Our model achieves the best performance among all online VIS methods. Notably, our model also eclipses all offline methods when using the ResNet-50 backbone.

【6】 Localized Feature Aggregation Module for Semantic Segmentation 标题：面向语义分割的局部化特征聚合模块链接：https://arxiv.org/abs/2112.01702

作者：Ryouichi Furukawa,Kazuhiro Hotta 备注：SMC 2021 摘要：提出了一种基于编码器和解码器特征映射相似性的局部特征聚合模型。该方法通过强调具有高级语义信息的解码器特征映射和具有高级位置信息的编码器特征映射之间的相似性来恢复位置信息。与传统的U网和注意U网级联方法相比，该方法能更有效地学习位置信息。此外，该方法还利用局部注意范围来降低计算量。两项创新有助于提高分割精度，降低计算成本。通过对果蝇细胞图像数据集和COVID-19图像数据集的实验，我们证实了我们的方法优于传统方法。摘要：We propose a new information aggregation method which called Localized Feature Aggregation Module based on the similarity between the feature maps of an encoder and a decoder. The proposed method recovers positional information by emphasizing the similarity between decoder's feature maps with superior semantic information and encoder's feature maps with superior positional information. The proposed method can learn positional information more efficiently than conventional concatenation in the U-net and attention U-net. Additionally, the proposed method also uses localized attention range to reduce the computational cost. Two innovations contributed to improve the segmentation accuracy with lower computational cost. By experiments on the Drosophila cell image dataset and COVID-19 image dataset, we confirmed that our method outperformed conventional methods.

【7】 Automatic tumour segmentation in H&E-stained whole-slide images of the pancreas 标题：胰腺H&E染色全片图像中肿瘤的自动分割链接：https://arxiv.org/abs/2112.01533

作者：Pierpaolo Vendittelli,Esther M. M. Smeets,Geert Litjens 摘要：胰腺癌很快将成为西方社会癌症相关死亡的第二大原因。CT、MRI和超声波等成像技术通常有助于提供初步诊断，但组织病理学评估仍然是最终确认疾病存在和预后的金标准。近年来，机器学习方法和病理学管道在改善其他癌症实体（如乳腺癌和前列腺癌）的诊断和预后方面显示出潜力。在这些管道中，关键的第一步通常是识别和分割肿瘤区域。理想情况下，此步骤是自动完成的，以避免耗时的手动注释。我们提出了一种多任务卷积神经网络来平衡疾病检测和分割精度。我们在29名患者的数据集（共58张幻灯片）上以不同的分辨率验证了我们的方法。最佳单任务分割网络在15.56$mu$m的分辨率下实现了0.885（0.122）IQR的中值骰子。我们的多任务网络在这方面有所改进，骰子得分中值为0.934（0.077）IQR。摘要：Pancreatic cancer will soon be the second leading cause of cancer-related death in Western society. Imaging techniques such as CT, MRI and ultrasound typically help providing the initial diagnosis, but histopathological assessment is still the gold standard for final confirmation of disease presence and prognosis. In recent years machine learning approaches and pathomics pipelines have shown potential in improving diagnostics and prognostics in other cancerous entities, such as breast and prostate cancer. A crucial first step in these pipelines is typically identification and segmentation of the tumour area. Ideally this step is done automatically to prevent time consuming manual annotation. We propose a multi-task convolutional neural network to balance disease detection and segmentation accuracy. We validated our approach on a dataset of 29 patients (for a total of 58 slides) at different resolutions. The best single task segmentation network achieved a median Dice of 0.885 (0.122) IQR at a resolution of 15.56 $mu$m. Our multi-task network improved on that with a median Dice score of 0.934 (0.077) IQR.

Zero/Few Shot|迁移|域适配|自适应(2篇)

【1】 Hierarchical Optimal Transport for Unsupervised Domain Adaptation 标题：无监督域自适应的分层最优传输算法链接：https://arxiv.org/abs/2112.02073

作者：Mourad El Hamri,Younès Bennani,Issam Falih,Hamid Ahaggach 摘要：在本文中，我们提出了一种新的无监督域自适应方法，该方法结合了最优传输、学习概率测度和无监督学习的概念。所提出的方法HOT-DA基于优化传输的分层公式，该公式利用了地面度量捕获的几何信息之外的源域和目标域中更丰富的结构信息。标记源域中的附加信息是通过根据类标签将样本分组到结构中本能地形成的。而在未标记的目标域中探索隐藏结构的问题则归结为通过Wasserstein重心学习概率测度的问题，我们证明了它等价于谱聚类。在一个复杂度可控的玩具数据集和两个具有挑战性的视觉适应数据集上的实验表明，该方法优于现有的方法。摘要：In this paper, we propose a novel approach for unsupervised domain adaptation, that relates notions of optimal transport, learning probability measures and unsupervised learning. The proposed approach, HOT-DA, is based on a hierarchical formulation of optimal transport, that leverages beyond the geometrical information captured by the ground metric, richer structural information in the source and target domains. The additional information in the labeled source domain is formed instinctively by grouping samples into structures according to their class labels. While exploring hidden structures in the unlabeled target domain is reduced to the problem of learning probability measures through Wasserstein barycenter, which we prove to be equivalent to spectral clustering. Experiments on a toy dataset with controllable complexity and two challenging visual adaptation datasets show the superiority of the proposed approach over the state-of-the-art.

【2】 Boosting Unsupervised Domain Adaptation with Soft Pseudo-label and Curriculum Learning 标题：利用软伪标签和课程学习促进无监督领域适应链接：https://arxiv.org/abs/2112.01948

作者：Shengjia Zhang,Tiancheng Lin,Yi Xu 备注：28 pages 摘要：通过利用来自完全标记源域的数据，无监督域自适应（UDA）通过显式差异最小化数据分布或对抗性学习，提高了未标记目标域的分类性能。作为一种增强，在自适应过程中涉及类别对齐，以利用模型预测增强目标特征识别。然而，由于对目标域的错误类别预测而导致的伪标签不准确，以及由于对源域的过度拟合而导致的分布偏差等问题仍然没有得到解决。在本文中，我们提出了一个模型不可知的两阶段学习框架，该框架使用软伪标记策略大大减少了有缺陷的模型预测，并避免了使用课程学习策略对源域进行过度拟合。理论上，它成功地将组合风险降低到目标域的预期误差上限。在第一阶段，我们使用基于分布对齐的UDA方法训练模型，以获得目标域上具有较高可信度的软语义标签。为了避免源域上的过度拟合，在第二阶段，我们提出了一种课程学习策略，以自适应地控制两个域损失之间的权重，从而使训练阶段的重点逐渐从源分布转移到目标分布，并在目标域上提高预测置信度。在两个著名的基准数据集上进行的大量实验验证了我们提出的框架在提高排名靠前的UDA算法性能方面的普遍有效性，并证明了其一致的优越性能。摘要：By leveraging data from a fully labeled source domain, unsupervised domain adaptation (UDA) improves classification performance on an unlabeled target domain through explicit discrepancy minimization of data distribution or adversarial learning. As an enhancement, category alignment is involved during adaptation to reinforce target feature discrimination by utilizing model prediction. However, there remain unexplored problems about pseudo-label inaccuracy incurred by wrong category predictions on target domain, and distribution deviation caused by overfitting on source domain. In this paper, we propose a model-agnostic two-stage learning framework, which greatly reduces flawed model predictions using soft pseudo-label strategy and avoids overfitting on source domain with a curriculum learning strategy. Theoretically, it successfully decreases the combined risk in the upper bound of expected error on the target domain. At the first stage, we train a model with distribution alignment-based UDA method to obtain soft semantic label on target domain with rather high confidence. To avoid overfitting on source domain, at the second stage, we propose a curriculum learning strategy to adaptively control the weighting between losses from the two domains so that the focus of the training stage is gradually shifted from source distribution to target distribution with prediction confidence boosted on the target domain. Extensive experiments on two well-known benchmark datasets validate the universal effectiveness of our proposed framework on promoting the performance of the top-ranked UDA algorithms and demonstrate its consistent superior performance.

半弱无监督|主动学习|不确定性(5篇)

【1】 SSDL: Self-Supervised Dictionary Learning 标题：SSDL：自监督词典学习链接：https://arxiv.org/abs/2112.01790

作者：Shuai Shao,Lei Xing,Wei Yu,Rui Xu,Yanjiang Wang,Baodi Liu 备注：Accepted by 22th IEEE International Conference on Multimedia and Expo (ICME) as an Oral 摘要：标签嵌入词典学习（DL）算法通过引入鉴别信息生成有影响力的词典。然而，存在一个局限性：所有嵌入标签的DL方法都依赖于标签，因为这种方法只能在监督学习中获得理想的性能。而在半监督和非监督学习中，它不再足够有效。受自我监督学习概念的启发（例如，设置借口任务以生成下游任务的通用模型），我们提出了一个自我监督字典学习（SSDL）框架来应对这一挑战。具体来说，我们首先设计一个$p$-Laplacian注意超图学习（pAHL）块作为为DL生成伪软标签的借口任务。然后，我们采用伪标签从主标签嵌入的DL方法训练字典。我们在两个人类活动识别数据集上评估我们的SSDL。与其他先进方法的比较结果证明了SSDL的有效性。摘要：The label-embedded dictionary learning (DL) algorithms generate influential dictionaries by introducing discriminative information. However, there exists a limitation: All the label-embedded DL methods rely on the labels due that this way merely achieves ideal performances in supervised learning. While in semi-supervised and unsupervised learning, it is no longer sufficient to be effective. Inspired by the concept of self-supervised learning (e.g., setting the pretext task to generate a universal model for the downstream task), we propose a Self-Supervised Dictionary Learning (SSDL) framework to address this challenge. Specifically, we first design a $p$-Laplacian Attention Hypergraph Learning (pAHL) block as the pretext task to generate pseudo soft labels for DL. Then, we adopt the pseudo labels to train a dictionary from a primary label-embedded DL method. We evaluate our SSDL on two human activity recognition datasets. The comparison results with other state-of-the-art methods have demonstrated the efficiency of SSDL.

【2】 Unsupervised Low-Light Image Enhancement via Histogram Equalization Prior 标题：基于直方图均衡优先的无监督微光图像增强链接：https://arxiv.org/abs/2112.01766

作者：Feng Zhang,Yuanjie Shao,Yishi Sun,Kai Zhu,Changxin Gao,Nong Sang 备注：submitted to IEEE Transactions on Image Processing 摘要：基于深度学习的微光图像增强方法通常需要大量成对的训练数据，这在现实场景中是不现实的。最近，人们探索了无监督的方法来消除对成对训练数据的依赖。然而，由于缺乏先验知识，它们在不同的现实场景中表现不稳定。为了解决这个问题，我们提出了一种基于有效的先验直方图均衡先验（HEP）的无监督微光图像增强方法。我们的工作受到了一个有趣的观察的启发，即直方图均衡化增强图像的特征映射与地面真实值相似。具体来说，我们制定了HEP以提供丰富的纹理和亮度信息。它嵌入到发光模块（LUM）中，有助于将微光图像分解为照明和反射贴图，反射贴图可视为恢复图像。然而，基于Retinex理论的推导表明，反射贴图受到噪声的污染。我们引入了一个噪声分离模块（NDM），在未配对干净图像的可靠帮助下分离反射贴图中的噪声和内容。在直方图均衡化和噪声解纠缠的指导下，我们的方法可以恢复更精细的细节，并且在真实的弱光场景中更能抑制噪声。大量的实验表明，我们的方法与最新的无监督微光增强算法相比表现良好，甚至与最新的有监督算法相匹配。摘要：Deep learning-based methods for low-light image enhancement typically require enormous paired training data, which are impractical to capture in real-world scenarios. Recently, unsupervised approaches have been explored to eliminate the reliance on paired training data. However, they perform erratically in diverse real-world scenarios due to the absence of priors. To address this issue, we propose an unsupervised low-light image enhancement method based on an effective prior termed histogram equalization prior (HEP). Our work is inspired by the interesting observation that the feature maps of histogram equalization enhanced image and the ground truth are similar. Specifically, we formulate the HEP to provide abundant texture and luminance information. Embedded into a Light Up Module (LUM), it helps to decompose the low-light images into illumination and reflectance maps, and the reflectance maps can be regarded as restored images. However, the derivation based on Retinex theory reveals that the reflectance maps are contaminated by noise. We introduce a Noise Disentanglement Module (NDM) to disentangle the noise and content in the reflectance maps with the reliable aid of unpaired clean images. Guided by the histogram equalization prior and noise disentanglement, our method can recover finer details and is more capable to suppress noise in real-world low-light scenarios. Extensive experiments demonstrate that our method performs favorably against the state-of-the-art unsupervised low-light enhancement algorithms and even matches the state-of-the-art supervised algorithms.

【3】 Self-Supervised Material and Texture Representation Learning for Remote Sensing Tasks 标题：用于遥感任务的自监督材料和纹理表示学习链接：https://arxiv.org/abs/2112.01715

作者：Peri Akiva,Matthew Purri,Matthew Leotta 摘要：自监督学习的目的是在不使用人工标注标签的情况下学习图像特征表示。它通常被用作获得有用的初始网络权值的先导步骤，这有助于加快收敛速度并提高下游任务的性能。虽然自我监督可以在不使用标签的情况下减少监督学习和非监督学习之间的领域差距，但自我监督目标仍然需要对下游任务产生强烈的归纳偏见，以实现有效的迁移学习。在这项工作中，我们提出了一种基于材料和纹理的自我监督方法，名为MATTER（材料和纹理表示学习），该方法受经典材料和纹理方法的启发。材质和纹理可以有效地描述任何表面，包括其触觉特性、颜色和镜面反射度。通过扩展，材料和纹理的有效表示可以描述与所述材料和纹理密切相关的其他语义类。MATTER利用不变区域上的多时、空间对齐的遥感图像来学习照明不变性和视角不变性，以此作为实现材质和纹理表示一致性的机制。我们表明，我们的自我监督预训练方法可以在无监督和微调设置中实现高达24.22%和6.33%的性能提升，并在变化检测、土地覆盖分类和语义分割任务上加快高达76%的收敛速度。摘要：Self-supervised learning aims to learn image feature representations without the usage of manually annotated labels. It is often used as a precursor step to obtain useful initial network weights which contribute to faster convergence and superior performance of downstream tasks. While self-supervision allows one to reduce the domain gap between supervised and unsupervised learning without the usage of labels, the self-supervised objective still requires a strong inductive bias to downstream tasks for effective transfer learning. In this work, we present our material and texture based self-supervision method named MATTER (MATerial and TExture Representation Learning), which is inspired by classical material and texture methods. Material and texture can effectively describe any surface, including its tactile properties, color, and specularity. By extension, effective representation of material and texture can describe other semantic classes strongly associated with said material and texture. MATTER leverages multi-temporal, spatially aligned remote sensing imagery over unchanged regions to learn invariance to illumination and viewing angle as a mechanism to achieve consistency of material and texture representation. We show that our self-supervision pre-training method allows for up to 24.22% and 6.33% performance increase in unsupervised and fine-tuned setups, and up to 76% faster convergence on change detection, land cover classification, and semantic segmentation tasks.

【4】 D3Net: A Speaker-Listener Architecture for Semi-supervised Dense Captioning and Visual Grounding in RGB-D Scans 标题：D3Net：RGB-D扫描中半监督密集字幕和视觉接地的说话人-听话人结构链接：https://arxiv.org/abs/2112.01551

作者：Dave Zhenyu Chen,Qirui Wu,Matthias Nießner,Angel X. Chang 备注：Project website: this https URL 摘要：最近关于3D密集字幕和视觉基础的研究已经取得了令人印象深刻的成果。尽管在这两个领域都有所发展，但可用的3D视觉语言数据量有限，导致3D视觉基础和3D密集字幕方法存在过度拟合问题。此外，如何在复杂的三维环境中区分地描述对象还没有得到充分的研究。为了应对这些挑战，我们提出了D3Net，一种端到端的神经说话人-听话人体系结构，可以检测、描述和辨别。我们的D3Net以一种自我批评的方式将3D中的密集字幕和视觉基础统一起来。D3Net的这种自临界特性还引入了对象标题生成过程中的可辨别性，并支持对带有部分注释描述的扫描网数据进行半监督训练。我们的方法在ScanReferer数据集的两项任务中都优于SOTA方法，大大超过SOTA 3D密集字幕法（23.56%）CiDEr@0.5IoU改进）。摘要：Recent studies on dense captioning and visual grounding in 3D have achieved impressive results. Despite developments in both areas, the limited amount of available 3D vision-language data causes overfitting issues for 3D visual grounding and 3D dense captioning methods. Also, how to discriminatively describe objects in complex 3D environments is not fully studied yet. To address these challenges, we present D3Net, an end-to-end neural speaker-listener architecture that can detect, describe and discriminate. Our D3Net unifies dense captioning and visual grounding in 3D in a self-critical manner. This self-critical property of D3Net also introduces discriminability during object caption generation and enables semi-supervised training on ScanNet data with partially annotated descriptions. Our method outperforms SOTA methods in both tasks on the ScanRefer dataset, surpassing the SOTA 3D dense captioning method by a significant margin (23.56% CiDEr@0.5IoU improvement).

【5】 Quantifying the uncertainty of neural networks using Monte Carlo dropout for deep learning based quantitative MRI 标题：基于蒙特卡罗辍学的定量MRI深度学习神经网络不确定性量化链接：https://arxiv.org/abs/2112.01587

作者：Mehmet Yigit Avci,Ziyu Li,Qiuyun Fan,Susie Huang,Berkin Bilgic,Qiyuan Tian 摘要：辍学通常在训练阶段用作正则化方法，并用于量化深度学习中的不确定性。我们建议在训练和推理步骤中使用辍学，并平均多次预测，以提高准确性，同时减少和量化不确定性。对仅从3个方向扫描获得的分数各向异性（FA）和平均扩散率（MD）图的结果进行了评估。使用我们的方法，与网络输出相比，准确度可以显著提高，而不会丢失，特别是在训练数据集很小的情况下。此外，生成的置信图可能有助于诊断未知的病理学或伪影。摘要：Dropout is conventionally used during the training phase as regularization method and for quantifying uncertainty in deep learning. We propose to use dropout during training as well as inference steps, and average multiple predictions to improve the accuracy, while reducing and quantifying the uncertainty. The results are evaluated for fractional anisotropy (FA) and mean diffusivity (MD) maps which are obtained from only 3 direction scans. With our method, accuracy can be improved significantly compared to network outputs without dropout, especially when the training dataset is small. Moreover, confidence maps are generated which may aid in diagnosis of unseen pathology or artifacts.

时序|行为识别|姿态|视频|运动估计(4篇)

【1】 Class-agnostic Reconstruction of Dynamic Objects from Videos 标题：视频中与类无关的动态对象重建链接：https://arxiv.org/abs/2112.02091

作者：Zhongzheng Ren,Xiaoming Zhao,Alexander G. Schwing 备注：NeurIPS 2021 摘要：我们引入了REDO，一个与类无关的框架来从RGBD或校准视频重建动态对象。与之前的工作相比，我们的问题设置更真实，但更具挑战性，原因有三：1）由于遮挡或相机设置，感兴趣的对象可能永远不会完全可见，但我们的目标是重建完整的形状；2）我们的目标是处理不同的对象动力学，包括刚体运动、非刚体运动和关节；3）我们的目标是用一个统一的框架重建不同类别的对象。为了应对这些挑战，我们开发了两个新模块。首先，我们引入一个标准的4D隐式函数，该函数是与聚集的时间视觉线索对齐的像素。其次，我们开发了一个4D转换模块，该模块捕获对象动态以支持时间传播和聚合。我们在合成RGBD视频数据集SAIL-VOS 3D和变形4D 以及真实世界视频数据3DPW的大量实验中研究了REDO的效果。我们发现REDO比最先进的动态重建方法有一定的优势。在消融研究中，我们验证了每个开发的组件。摘要：We introduce REDO, a class-agnostic framework to REconstruct the Dynamic Objects from RGBD or calibrated videos. Compared to prior work, our problem setting is more realistic yet more challenging for three reasons: 1) due to occlusion or camera settings an object of interest may never be entirely visible, but we aim to reconstruct the complete shape; 2) we aim to handle different object dynamics including rigid motion, non-rigid motion, and articulation; 3) we aim to reconstruct different categories of objects with one unified framework. To address these challenges, we develop two novel modules. First, we introduce a canonical 4D implicit function which is pixel-aligned with aggregated temporal visual cues. Second, we develop a 4D transformation module which captures object dynamics to support temporal propagation and aggregation. We study the efficacy of REDO in extensive experiments on synthetic RGBD video datasets SAIL-VOS 3D and DeformingThings4D , and on real-world video data 3DPW. We find REDO outperforms state-of-the-art dynamic reconstruction methods by a margin. In ablation studies we validate each developed component.

【2】 Lightweight Attentional Feature Fusion for Video Retrieval by Text 标题：用于文本视频检索的轻量级注意特征融合链接：https://arxiv.org/abs/2112.01832

作者：Fan Hu,Aozhu Chen,Ziyue Wang,Fangming Zhou,Xirong Li 摘要：在本文中，我们将在新的文本视频检索环境中重新讨论emph{featurefusion}，这是一个老式的主题。与以往只考虑一端（视频或文本）特征融合的研究不同，我们的目标是在一个统一的框架内对两端进行特征融合。我们假设，优化特征的凸组合比通过计算繁重的多头自我注意来建模它们的相关性更可取。因此，我们提出了轻量级注意特征融合（LAFF）。LAFF在早期和后期以及视频和文本端执行特征融合，使其成为利用各种（现成）特征的强大方法。对四个公共数据集（即MSR-VTT、MSVD、TGIF、VATEX）和大规模TRECVID AVS基准评估（2016-2020）进行的广泛实验表明LAFF的可行性。此外，LAFF的实现极其简单，因此对现实世界的部署非常有吸引力。摘要：In this paper, we revisit emph{feature fusion}, an old-fashioned topic, in the new context of video retrieval by text. Different from previous research that considers feature fusion only at one end, let it be video or text, we aim for feature fusion for both ends within a unified framework. We hypothesize that optimizing the convex combination of the features is preferred to modeling their correlations by computationally heavy multi-head self-attention. Accordingly, we propose Lightweight Attentional Feature Fusion (LAFF). LAFF performs feature fusion at both early and late stages and at both video and text ends, making it a powerful method for exploiting diverse (off-the-shelf) features. Extensive experiments on four public datasets, i.e. MSR-VTT, MSVD, TGIF, VATEX, and the large-scale TRECVID AVS benchmark evaluations (2016-2020) show the viability of LAFF. Moreover, LAFF is extremely simple to implement, making it appealing for real-world deployment.

【3】 Action Units That Constitute Trainable Micro-expressions (and A Large-scale Synthetic Dataset) 标题：构成可训练微表情的动作单元(和大规模合成数据集) 链接：https://arxiv.org/abs/2112.01730

作者：Yuchi Liu,Zhongdao Wang,Tom Gedeon,Liang Zheng 摘要：由于昂贵的数据采集过程，微表情数据集的规模通常比其他计算机视觉领域的数据集小得多，使得大规模训练的稳定性和可行性较差。在本文中，我们的目标是开发一个协议来自动合成微表情训练数据，1）大规模的，2）允许我们在真实世界的测试集上训练高精度的识别模型。具体来说，我们发现了三种类型的动作单元（AU），它们可以很好地构成可训练的微表情。这些AU来自真实世界的微观表达、宏观表达的早期框架以及AU与人类知识定义的表达标签之间的关系。利用这些AUs，我们的协议使用大量具有不同身份的人脸图像和现有的人脸生成方法进行微表情合成。微表情识别模型在生成的微表情数据集上进行训练，并在真实世界的测试集上进行评估，从而获得非常有竞争力和稳定的性能。实验结果不仅验证了这些AUs和我们的数据集合成协议的有效性，还揭示了微表达式的一些关键特性：它们跨面通用，接近早期宏表达式，并且可以手动定义。摘要：Due to the expensive data collection process, micro-expression datasets are generally much smaller in scale than those in other computer vision fields, rendering large-scale training less stable and feasible. In this paper, we aim to develop a protocol to automatically synthesize micro-expression training data that 1) are on a large scale and 2) allow us to train recognition models with strong accuracy on real-world test sets. Specifically, we discover three types of Action Units (AUs) that can well constitute trainable micro-expressions. These AUs come from real-world micro-expressions, early frames of macro-expressions, and the relationship between AUs and expression labels defined by human knowledge. With these AUs, our protocol then employs large numbers of face images with various identities and an existing face generation method for micro-expression synthesis. Micro-expression recognition models are trained on the generated micro-expression datasets and evaluated on real-world test sets, where very competitive and stable performance is obtained. The experimental results not only validate the effectiveness of these AUs and our dataset synthesis protocol but also reveal some critical properties of micro-expressions: they generalize across faces, are close to early-stage macro-expressions, and can be manually defined.

【4】 Neural Head Avatars from Monocular RGB Videos 标题：来自单目RGB视频的神经头头像链接：https://arxiv.org/abs/2112.01554

作者：Philip-William Grassal,Malte Prinzler,Titus Leistner,Carsten Rother,Matthias Nießner,Justus Thies 备注：Video: this https URL Project page: this https URL 摘要：我们介绍了神经头像，这是一种新的神经表示法，可明确建模可动画人类头像的表面几何结构和外观，可用于AR/VR中的远程会议或依赖数字人类的电影或游戏行业中的其他应用。我们的表现可以从单目RGB纵向视频中学习，该视频具有一系列不同的表情和视图。具体地说，我们提出了一种混合表示法，它由一个可变形模型和两个前馈网络组成，用于预测底层网格的顶点偏移以及一个依赖于视图和表达式的纹理。我们证明了这种表示法能够精确地外推到看不见的姿势和视点，并在提供清晰纹理细节的同时生成自然的表情。与以前的头部头像研究相比，我们的方法提供了一个完整的人头（包括头发）的分离形状和外观模型，该模型与标准图形管道兼容。此外，在重建质量和新颖视图合成方面，它在数量和质量上都优于当前的技术水平。摘要：We present Neural Head Avatars, a novel neural representation that explicitly models the surface geometry and appearance of an animatable human avatar that can be used for teleconferencing in AR/VR or other applications in the movie or games industry that rely on a digital human. Our representation can be learned from a monocular RGB portrait video that features a range of different expressions and views. Specifically, we propose a hybrid representation consisting of a morphable model for the coarse shape and expressions of the face, and two feed-forward networks, predicting vertex offsets of the underlying mesh as well as a view- and expression-dependent texture. We demonstrate that this representation is able to accurately extrapolate to unseen poses and view points, and generates natural expressions while providing sharp texture details. Compared to previous works on head avatars, our method provides a disentangled shape and appearance model of the complete human head (including hair) that is compatible with the standard graphics pipeline. Moreover, it quantitatively and qualitatively outperforms current state of the art in terms of reconstruction quality and novel-view synthesis.

医学相关(2篇)

【1】 Towards Super-Resolution CEST MRI for Visualization of Small Structures 标题：面向小结构可视化的超分辨率CEST MRI 链接：https://arxiv.org/abs/2112.01905

作者：Lukas Folle,Katharian Tkotz,Fasil Gadjimuradov,Lorenz Kapsner,Moritz Fabian,Sebastian Bickelhaupt,David Simon,Arnd Kleyer,Gerhard Krönke,Moritz Zaiß,Armin Nagel,Andreas Maier 摘要：风湿性疾病（如类风湿性关节炎）的发病通常是亚临床的，这对疾病的早期发现具有挑战性。然而，解剖结构的特征性变化可以通过MRI或CT等成像技术检测出来。现代成像技术，如化学交换饱和转移（CEST）MRI，通过对体内代谢物的成像，有望进一步改善早期检测。为了成像患者关节中的小结构，通常是疾病引起变化的第一个区域之一，CEST MR成像需要高分辨率。然而，由于收购的潜在物理限制，目前CEST MR固有的分辨率较低。在这项工作中，我们比较了已建立的上采样技术和基于神经网络的超分辨率方法。我们可以证明，神经网络能够比现有方法更好地学习从低分辨率到高分辨率非饱和CEST图像的映射。在测试集上，使用ResNet神经网络可以实现32.29dB（ 10%）、0.14（ 28%）的NRMSE和0.85（ 15%）的SSIM，从而显著改善基线。这项工作为超分辨率CEST MRI神经网络的前瞻性研究铺平了道路，并可能导致风湿性疾病发病的早期检测。摘要：The onset of rheumatic diseases such as rheumatoid arthritis is typically subclinical, which results in challenging early detection of the disease. However, characteristic changes in the anatomy can be detected using imaging techniques such as MRI or CT. Modern imaging techniques such as chemical exchange saturation transfer (CEST) MRI drive the hope to improve early detection even further through the imaging of metabolites in the body. To image small structures in the joints of patients, typically one of the first regions where changes due to the disease occur, a high resolution for the CEST MR imaging is necessary. Currently, however, CEST MR suffers from an inherently low resolution due to the underlying physical constraints of the acquisition. In this work we compared established up-sampling techniques to neural network-based super-resolution approaches. We could show, that neural networks are able to learn the mapping from low-resolution to high-resolution unsaturated CEST images considerably better than present methods. On the test set a PSNR of 32.29dB ( 10%), a NRMSE of 0.14 ( 28%), and a SSIM of 0.85 ( 15%) could be achieved using a ResNet neural network, improving the baseline considerably. This work paves the way for the prospective investigation of neural networks for super-resolution CEST MRI and, followingly, might lead to a earlier detection of the onset of rheumatic diseases.

【2】 Learning to automate cryo-electron microscopy data collection with Ptolemy 标题：学习用托勒密自动采集低温电子显微镜数据链接：https://arxiv.org/abs/2112.01534

作者：Paul T. Kim,Alex J. Noble,Anchi Cheng,Tristan Bepler 备注：13 pages, 11 figures 摘要：在过去的十年中，低温电子显微镜（cryo EM）已成为测定生物大分子近自然、近原子分辨率三维结构的主要方法。为了满足日益增长的冷冻EM需求，需要采用自动化方法来提高吞吐量和效率，同时降低成本。目前，收集高倍率低温电镜显微照片的过程，即数据收集，需要人工输入和手动调整参数，因为专家操作员必须导航低和中倍率图像，以找到良好的高倍率收集位置。自动化这一点非常重要：图像的信噪比很低，并且受到一系列实验参数的影响，这些参数在每个采集会话中可能有所不同。在这里，我们使用各种计算机视觉算法，包括混合模型、卷积神经网络（CNN）和U网络来开发第一条管道，以使用专门构建的算法自动化低和中等放大率目标定位。该管道中的学习模型在真实世界cryo EM数据采集会话中的大型内部图像数据集上进行训练，标记有操作员选择的位置。利用这些模型，我们可以有效地检测和分类中低倍图像中的感兴趣区域（ROI），并且可以推广到看不见的会话，以及使用不同显微镜从外部设备捕获的图像。我们预计，我们的管道，托勒密，将立即作为一个工具，自动化的冷冻EM数据收集，并作为基础的未来先进的方法，有效和自动化的低温EM显微镜。摘要：Over the past decade, cryogenic electron microscopy (cryo-EM) has emerged as a primary method for determining near-native, near-atomic resolution 3D structures of biological macromolecules. In order to meet increasing demand for cryo-EM, automated methods to improve throughput and efficiency while lowering costs are needed. Currently, the process of collecting high-magnification cryo-EM micrographs, data collection, requires human input and manual tuning of parameters, as expert operators must navigate low- and medium-magnification images to find good high-magnification collection locations. Automating this is non-trivial: the images suffer from low signal-to-noise ratio and are affected by a range of experimental parameters that can differ for each collection session. Here, we use various computer vision algorithms, including mixture models, convolutional neural networks (CNNs), and U-Nets to develop the first pipeline to automate low- and medium-magnification targeting with purpose-built algorithms. Learned models in this pipeline are trained on a large internal dataset of images from real world cryo-EM data collection sessions, labeled with locations that were selected by operators. Using these models, we show that we can effectively detect and classify regions of interest (ROIs) in low- and medium-magnification images, and can generalize to unseen sessions, as well as to images captured using different microscopes from external facilities. We expect our pipeline, Ptolemy, will be both immediately useful as a tool for automation of cryo-EM data collection, and serve as a foundation for future advanced methods for efficient and automated cryo-EM microscopy.

GAN|对抗|攻击|生成相关(5篇)

【1】 Music-to-Dance Generation with Optimal Transport 标题：具有最优运输的音乐到舞蹈生成链接：https://arxiv.org/abs/2112.01806

作者：Shuang Wu,Shijian Lu,Li Cheng 摘要：一段音乐的舞蹈编排是一项具有挑战性的任务，必须在考虑音乐主题和节奏的同时，创造性地呈现独特的风格舞蹈元素。不同的方法，如相似性检索、序列间建模和生成性对抗网络，已经解决了这一问题，但它们生成的舞蹈序列往往缺乏运动真实感、多样性和音乐一致性。在这篇文章中，我们提出了一个音乐舞蹈与最佳传输网络（MDOT网络），学习从音乐生成三维舞蹈编舞。我们引入一个最佳传输距离来评估生成的舞蹈分布的真实性，并引入格罗莫夫-瓦瑟斯坦距离来测量舞蹈分布与输入音乐之间的对应关系。这提供了一个定义明确且无分歧的训练目标，以减轻标准GAN训练的限制，该训练经常受到不稳定和发散发电机损耗问题的困扰。大量的实验表明，我们的MDOT网络可以合成真实的、多样的舞蹈，这些舞蹈与输入音乐实现有机的统一，反映出共同的意向性，并与节奏衔接相匹配。摘要：Dance choreography for a piece of music is a challenging task, having to be creative in presenting distinctive stylistic dance elements while taking into account the musical theme and rhythm. It has been tackled by different approaches such as similarity retrieval, sequence-to-sequence modeling and generative adversarial networks, but their generated dance sequences are often short of motion realism, diversity and music consistency. In this paper, we propose a Music-to-Dance with Optimal Transport Network (MDOT-Net) for learning to generate 3D dance choreographs from music. We introduce an optimal transport distance for evaluating the authenticity of the generated dance distribution and a Gromov-Wasserstein distance to measure the correspondence between the dance distribution and the input music. This gives a well defined and non-divergent training objective that mitigates the limitation of standard GAN training which is frequently plagued with instability and divergent generator loss issues. Extensive experiments demonstrate that our MDOT-Net can synthesize realistic and diverse dances which achieve an organic unity with the input music, reflecting the shared intentionality and matching the rhythmic articulation.

【2】 Global Context with Discrete Diffusion in Vector Quantised Modelling for Image Generation 标题：用于图像生成的矢量量化建模中的离散扩散全局上下文链接：https://arxiv.org/abs/2112.01799

作者：Minghui Hu,Yujie Wang,Tat-Jen Cham,Jianfei Yang,P. N. Suganthan 摘要：矢量量化变分自编码器（VQ-VAE）与自回归模型相结合作为生成部分，在图像生成方面取得了高质量的结果。然而，在采样阶段，自回归模型将严格遵循渐进扫描顺序。这使得现有的VQ系列模型难以摆脱缺乏全局信息的陷阱。连续域中的去噪扩散概率模型（DDPM）已显示出捕获全局上下文的能力，同时生成高质量图像。在离散状态空间中，一些工作已经证明了执行文本生成和低分辨率图像生成的潜力。我们证明，借助于VQ-VAE提供的内容丰富的离散视觉码本，离散扩散模型还可以生成具有全局上下文的高保真图像，这弥补了经典自回归模型在像素空间上的不足。同时，离散VAE与扩散模型的集成解决了传统自回归模型尺寸过大的缺点，以及扩散模型在生成图像时在采样过程中需要过多时间的缺点。研究发现，生成图像的质量在很大程度上取决于离散视觉码本。大量实验表明，所提出的矢量量化离散扩散模型（VQ-DDM）能够以较低的复杂度实现与顶级方法相当的性能。在无需额外训练的图像修复任务方面，它还显示出比其他用自回归模型量化的向量的突出优势。摘要：The integration of Vector Quantised Variational AutoEncoder (VQ-VAE) with autoregressive models as generation part has yielded high-quality results on image generation. However, the autoregressive models will strictly follow the progressive scanning order during the sampling phase. This leads the existing VQ series models to hardly escape the trap of lacking global information. Denoising Diffusion Probabilistic Models (DDPM) in the continuous domain have shown a capability to capture the global context, while generating high-quality images. In the discrete state space, some works have demonstrated the potential to perform text generation and low resolution image generation. We show that with the help of a content-rich discrete visual codebook from VQ-VAE, the discrete diffusion model can also generate high fidelity images with global context, which compensates for the deficiency of the classical autoregressive model along pixel space. Meanwhile, the integration of the discrete VAE with the diffusion model resolves the drawback of conventional autoregressive models being oversized, and the diffusion model which demands excessive time in the sampling process when generating images. It is found that the quality of the generated images is heavily dependent on the discrete visual codebook. Extensive experiments demonstrate that the proposed Vector Quantised Discrete Diffusion Model (VQ-DDM) is able to achieve comparable performance to top-tier methods with low complexity. It also demonstrates outstanding advantages over other vectors quantised with autoregressive models in terms of image inpainting tasks without additional training.

【3】 Multi-modal application: Image Memes Generation 标题：多模态应用：图像模因生成链接：https://arxiv.org/abs/2112.01651

作者：Zhiyuan Liu,Chuanzheng Sun,Yuxin Jiang,Shiqi Jiang,Mei Ming 摘要：模因是一个有趣的词。互联网模因为我们对世界、媒体和我们自己生活的认知变化提供了独特的见解。如果你在网上冲浪足够长的时间，你会在网上的某个地方看到它。随着社交媒体平台的兴起和便捷的图像传播，图像模因已经声名鹊起。图像模因已经成为一种流行文化，在社交媒体、博客和公开信息的传播中发挥着重要作用。随着人工智能的发展和深度学习的广泛应用，自然语言处理（NLP）和计算机视觉（CV）也可以用来解决生活中的更多问题，包括模因生成。互联网模因通常采用图像的形式，通过结合模因模板（图像）和标题（自然语言句子）创建。在我们的项目中，我们提出了一种端到端编码器-解码器结构的meme生成器。对于给定的输入句子，我们使用模因模板选择模型来确定其表达的情感，并选择图像模板。然后通过meme字幕生成器生成字幕和模因。代码和模型可在github上获得摘要：Meme is an interesting word. Internet memes offer unique insights into the changes in our perception of the world, the media and our own lives. If you surf the Internet for long enough, you will see it somewhere on the Internet. With the rise of social media platforms and convenient image dissemination, Image Meme has gained fame. Image memes have become a kind of pop culture and they play an important role in communication over social media, blogs, and open messages. With the development of artificial intelligence and the widespread use of deep learning, Natural Language Processing (NLP) and Computer Vision (CV) can also be used to solve more problems in life, including meme generation. An Internet meme commonly takes the form of an image and is created by combining a meme template (image) and a caption (natural language sentence). In our project, we propose an end-to-end encoder-decoder architecture meme generator. For a given input sentence, we use the Meme template selection model to determine the emotion it expresses and select the image template. Then generate captions and memes through to the meme caption generator. Code and models are available at github

【4】 Is RobustBench/AutoAttack a suitable Benchmark for Adversarial Robustness? 标题：RobustBench/AutoAttack是衡量对手健壮性的合适基准吗？链接：https://arxiv.org/abs/2112.01601

作者：Peter Lorenz,Dominik Strassel,Margret Keuper,Janis Keuper 备注：AAAI-22 AdvML Workshop ShortPaper 摘要：最近，RobustBench（Croce et al.2020）已成为图像分类网络对抗性鲁棒性的公认基准。在其最常报告的子任务中，RobustBench评估了CIFAR10上经过训练的神经网络在自动攻击（Croce和Hein 2020b）下的对抗鲁棒性，l-inf扰动限制为eps=8/255。鉴于目前表现最好的模型的领先分数约为基线的60%，可以公平地将该基准描述为相当具有挑战性。尽管RobustBench在最近的文献中被普遍接受，但我们的目标是促进关于RobustBench作为鲁棒性关键指标的适用性的讨论，该指标可以推广到实际应用中。我们反对这一观点的论据是双重的，并得到了本文中大量实验的支持：我们认为I）使用l-inf（eps=8/255）进行自动攻击的数据交替是不切实际的，即使是通过简单的检测算法和人类观察者，也会产生接近完美的敌对样本检测率。我们还表明，其他攻击方法在获得类似成功率的同时更难检测。二）低分辨率数据集（如CIFAR10）的结果不能很好地推广到更高分辨率的图像，因为基于梯度的攻击似乎随着分辨率的增加变得更容易检测。摘要：Recently, RobustBench (Croce et al. 2020) has become a widely recognized benchmark for the adversarial robustness of image classification networks. In its most commonly reported sub-task, RobustBench evaluates and ranks the adversarial robustness of trained neural networks on CIFAR10 under AutoAttack (Croce and Hein 2020b) with l-inf perturbations limited to eps = 8/255. With leading scores of the currently best performing models of around 60% of the baseline, it is fair to characterize this benchmark to be quite challenging. Despite its general acceptance in recent literature, we aim to foster discussion about the suitability of RobustBench as a key indicator for robustness which could be generalized to practical applications. Our line of argumentation against this is two-fold and supported by excessive experiments presented in this paper: We argue that I) the alternation of data by AutoAttack with l-inf, eps = 8/255 is unrealistically strong, resulting in close to perfect detection rates of adversarial samples even by simple detection algorithms and human observers. We also show that other attack methods are much harder to detect while achieving similar success rates. II) That results on low-resolution data sets like CIFAR10 do not generalize well to higher resolution images as gradient-based attacks appear to become even more detectable with increasing resolutions.

【5】 FuseDream: Training-Free Text-to-Image Generation with Improved CLIP GAN Space Optimization 标题：FuseDream：通过改进的CLIP GaN空间优化实现免训练的文本到图像生成链接：https://arxiv.org/abs/2112.01573

作者：Xingchao Liu,Chengyue Gong,Lemeng Wu,Shujian Zhang,Hao Su,Qiang Liu 摘要：从自然语言指令生成图像是一项有趣但极具挑战性的任务。我们将重新训练的片段表示与现成的图像生成器（GANs）相结合，在GAN的潜在空间中进行优化，以找到在给定输入文本中获得最大片段分数的图像，从而实现文本到图像的生成。与从零开始从文本到图像训练生成模型的传统方法相比，CLIP GAN方法无需训练，Zero-Shot，并且可以轻松地使用不同的生成器进行定制。然而，优化GAN空间中的剪辑分数会带来一个极具挑战性的优化问题，而现成的优化器（如Adam）无法产生令人满意的结果。在这项工作中，我们提出了一种FuseDream管道，它通过三个关键技术改进了CLIP GAN方法：1）一个AugCLIP分数，它通过在图像上引入随机增强来增强CLIP目标的鲁棒性。2）一种新颖的初始化和过参数化优化策略，使我们能够有效地导航GAN空间中的非凸地形。3）一种合成生成技术，通过利用一种新的双层优化公式，可以合成多幅图像以扩展GAN空间并克服数据偏差。当通过不同的输入文本进行提升时，FuseDream可以生成具有不同对象、背景、艺术风格甚至新颖的反事实概念的高质量图像，而这些概念并未出现在我们使用的GAN的训练数据中。从数量上讲，FuseDream生成的图像在MS COCO数据集上产生顶级初始分数和FID分数，而无需额外的架构设计或训练。我们的代码可在url公开获取{https://github.com/gnobitab/FuseDream}. 摘要：Generating images from natural language instructions is an intriguing yet highly challenging task. We approach text-to-image generation by combining the power of the retrained CLIP representation with an off-the-shelf image generator (GANs), optimizing in the latent space of GAN to find images that achieve maximum CLIP score with the given input text. Compared to traditional methods that train generative models from text to image starting from scratch, the CLIP GAN approach is training-free, zero shot and can be easily customized with different generators. However, optimizing CLIP score in the GAN space casts a highly challenging optimization problem and off-the-shelf optimizers such as Adam fail to yield satisfying results. In this work, we propose a FuseDream pipeline, which improves the CLIP GAN approach with three key techniques: 1) an AugCLIP score which robustifies the CLIP objective by introducing random augmentation on image. 2) a novel initialization and over-parameterization strategy for optimization which allows us to efficiently navigate the non-convex landscape in GAN space. 3) a composed generation technique which, by leveraging a novel bi-level optimization formulation, can compose multiple images to extend the GAN space and overcome the data-bias. When promoted by different input text, FuseDream can generate high-quality images with varying objects, backgrounds, artistic styles, even novel counterfactual concepts that do not appear in the training data of the GAN we use. Quantitatively, the images generated by FuseDream yield top-level Inception score and FID score on MS COCO dataset, without additional architecture design or training. Our code is publicly available at url{https://github.com/gnobitab/FuseDream}.

NAS模型搜索(1篇)

【1】 Data-Free Neural Architecture Search via Recursive Label Calibration 标题：基于递归标签校准的无数据神经结构搜索链接：https://arxiv.org/abs/2112.02086

作者：Zechun Liu,Zhiqiang Shen,Yun Long,Eric Xing,Kwang-Ting Cheng,Chas Leichner 备注：Technical report 摘要：本文旨在探讨在不使用任何原始训练数据的情况下，仅给出预训练模型的神经结构搜索（NAS）的可行性。在现实场景中，这是隐私保护、避免偏见等的重要环境。为了实现这一点，我们首先通过从预先训练的深层神经网络中恢复知识来合成可用数据。然后，我们使用合成数据及其预测的软标签来指导神经结构搜索。我们发现NAS任务需要合成的数据（我们这里的目标是图像域），这些数据具有足够的语义、多样性，并且与自然图像的域间距最小。对于语义，我们提出递归标签校准来产生更多的信息输出。对于多样性，我们提出了一种区域更新策略，以生成更加多样和语义丰富的合成数据。对于最小域间距，我们使用输入和特征级正则化来模拟原始数据在潜在空间中的分布。我们用三种流行的NAS算法来实例化我们提出的框架：DART、ProxylessNAS和SPO。令人惊讶的是，我们的结果表明，通过使用我们的合成数据搜索发现的架构实现了与通过搜索原始架构首次发现的架构相当甚至更高的精确度，得出结论：如果综合方法设计得当，NAS可以有效地进行，而无需访问原始数据或所谓的自然数据。我们的代码将公开提供。摘要：This paper aims to explore the feasibility of neural architecture search (NAS) given only a pre-trained model without using any original training data. This is an important circumstance for privacy protection, bias avoidance, etc., in real-world scenarios. To achieve this, we start by synthesizing usable data through recovering the knowledge from a pre-trained deep neural network. Then we use the synthesized data and their predicted soft-labels to guide neural architecture search. We identify that the NAS task requires the synthesized data (we target at image domain here) with enough semantics, diversity, and a minimal domain gap from the natural images. For semantics, we propose recursive label calibration to produce more informative outputs. For diversity, we propose a regional update strategy to generate more diverse and semantically-enriched synthetic data. For minimal domain gap, we use input and feature-level regularization to mimic the original data distribution in latent space. We instantiate our proposed framework with three popular NAS algorithms: DARTS, ProxylessNAS and SPOS. Surprisingly, our results demonstrate that the architectures discovered by searching with our synthetic data achieve accuracy that is comparable to, or even higher than, architectures discovered by searching from the original ones, for the first time, deriving the conclusion that NAS can be done effectively with no need of access to the original or called natural data if the synthesis method is well designed. Our code will be publicly available.

人脸|人群计数(1篇)

【1】 Total Scale: Face-to-Body Detail Reconstruction from Sparse RGBD Sensors 标题：总尺度：基于稀疏RGBD传感器的面对面细节重建链接：https://arxiv.org/abs/2112.02082

作者：Zheng Dong,Ke Xu,Ziheng Duan,Hujun Bao,Weiwei Xu,Rynson W. H. Lau 备注：14 pages, 16 figures, 4 tables 摘要：虽然使用像素对齐隐式函数（PIFu）的三维人体重建方法发展迅速，但我们观察到重建细节的质量仍然不令人满意。在基于PIFu的重建结果中，平坦的面部表面经常出现。为此，我们提出了一种双尺度PIFu表示方法来提高重建人脸细节的质量。具体地说，我们使用两个MLP分别表示人脸和人体的像素。专用于三维人脸重建的MLP可以增加网络容量，并降低人脸细节重建的难度，如前一种单尺度PIFu表示。为了纠正拓扑错误，我们利用3个RGBD传感器捕获多视图RGBD数据作为网络的输入，这是一种稀疏、轻量级的捕获设置。由于深度噪声严重影响重建结果，我们设计了一个深度细化模块，在输入RGB图像的引导下降低原始深度的噪声。我们还提出了一种自适应融合方案来融合预测的身体和面部占据场，以消除其边界处的不连续伪影。实验证明了我们的方法在重建生动的面部细节和变形的身体形状方面的有效性，并验证了它优于最先进的方法。摘要：While the 3D human reconstruction methods using Pixel-aligned implicit function (PIFu) develop fast, we observe that the quality of reconstructed details is still not satisfactory. Flat facial surfaces frequently occur in the PIFu-based reconstruction results. To this end, we propose a two-scale PIFu representation to enhance the quality of the reconstructed facial details. Specifically, we utilize two MLPs to separately represent the PIFus for the face and human body. An MLP dedicated to the reconstruction of 3D faces can increase the network capacity and reduce the difficulty of the reconstruction of facial details as in the previous one-scale PIFu representation. To remedy the topology error, we leverage 3 RGBD sensors to capture multiview RGBD data as the input to the network, a sparse, lightweight capture setting. Since the depth noise severely influences the reconstruction results, we design a depth refinement module to reduce the noise of the raw depths under the guidance of the input RGB images. We also propose an adaptive fusion scheme to fuse the predicted occupancy field of the body and face to eliminate the discontinuity artifact at their boundaries. Experiments demonstrate the effectiveness of our approach in reconstructing vivid facial details and deforming body shapes, and verify its superiority over state-of-the-art methods.

跟踪(1篇)

【1】 Probabilistic Tracking with Deep Factors 标题：具有深度因子的概率跟踪链接：https://arxiv.org/abs/2112.01609

作者：Fan Jiang,Andrew Marmon,Ildebrando De Courten,Marc Rasi,Frank Dellaert 摘要：在计算机视觉的许多应用中，通过融合来自多个源（其中2D和3D图像仅为一个）的数据，精确估计物体随时间的轨迹非常重要。在本文中，我们展示了如何在基于因子图的概率跟踪框架中使用深度特征编码和特征上的生成密度。我们提出了一个似然模型，该模型将学习的特征编码器与它们上的生成密度相结合，两者都以有监督的方式进行训练。我们还通过使用图像分类模型直接推断概率进行了实验，这些模型输入到似然公式中。这些模型用于实现添加到因子图中的深层因子，以补充表示特定领域知识的其他因子，如运动模型和/或其他先验信息。然后在一个非线性最小二乘跟踪框架中对这些因素进行优化，该框架采用具有高斯先验的扩展卡尔曼平滑器的形式。我们的似然模型的一个关键特征是，它利用被跟踪目标姿态的李群特性对图像块应用特征编码，该特征编码是通过受空间变换网络启发的可微扭曲函数提取的。为了说明所提出的方法，我们在一个具有挑战性的昆虫社会行为数据集上对其进行了评估，并表明使用深度特征确实优于在此设置中使用的这些早期线性外观模型。摘要：In many applications of computer vision it is important to accurately estimate the trajectory of an object over time by fusing data from a number of sources, of which 2D and 3D imagery is only one. In this paper, we show how to use a deep feature encoding in conjunction with generative densities over the features in a factor-graph based, probabilistic tracking framework. We present a likelihood model that combines a learned feature encoder with generative densities over them, both trained in a supervised manner. We also experiment with directly inferring probability through the use of image classification models that feed into the likelihood formulation. These models are used to implement deep factors that are added to the factor graph to complement other factors that represent domain-specific knowledge such as motion models and/or other prior information. Factors are then optimized together in a non-linear least-squares tracking framework that takes the form of an Extended Kalman Smoother with a Gaussian prior. A key feature of our likelihood model is that it leverages the Lie group properties of the tracked target's pose to apply the feature encoding on an image patch, extracted through a differentiable warp function inspired by spatial transformer networks. To illustrate the proposed approach we evaluate it on a challenging social insect behavior dataset, and show that using deep features does outperform these earlier linear appearance models used in this setting.

图像视频检索|Re-id相关(1篇)

【1】 ROCA: Robust CAD Model Retrieval and Alignment from a Single Image 标题：ROCA：单幅图像的稳健CAD模型检索与对齐链接：https://arxiv.org/abs/2112.01988

作者：Can Gümeli,Angela Dai,Matthias Nießner 摘要：我们介绍了ROCA，一种新颖的端到端方法，它可以从形状数据库检索三维CAD模型并将其与单个输入图像对齐。这使得能够从2D RGB观察中对观察到的场景进行3D感知，其特征是轻量级、紧凑、干净的CAD表示。我们方法的核心是基于密集2D-3D对象对应和Procrustes对齐的可微对齐优化。因此，ROCA可以提供可靠的CAD对齐，同时通过利用2D-3D对应关系来学习几何相似的CAD模型来通知CAD检索。对来自ScanNet的具有挑战性的真实图像进行的实验表明，ROCA在支持检索的CAD对齐精度方面显著提高，从9.5%提高到17.6%。摘要：We present ROCA, a novel end-to-end approach that retrieves and aligns 3D CAD models from a shape database to a single input image. This enables 3D perception of an observed scene from a 2D RGB observation, characterized as a lightweight, compact, clean CAD representation. Core to our approach is our differentiable alignment optimization based on dense 2D-3D object correspondences and Procrustes alignment. ROCA can thus provide a robust CAD alignment while simultaneously informing CAD retrieval by leveraging the 2D-3D correspondences to learn geometrically similar CAD models. Experiments on challenging, real-world imagery from ScanNet show that ROCA significantly improves on state of the art, from 9.5% to 17.6% in retrieval-aware CAD alignment accuracy.

点云|SLAM|雷达|激光|深度RGBD相关(3篇)

【1】 Bridging the Gap: Point Clouds for Merging Neurons in Connectomics 标题：跨越鸿沟：连接学中用于合并神经元的点云链接：https://arxiv.org/abs/2112.02039

作者：Jules Berman,Dmitri B. Chklovskii,Jingpeng Wu 备注：10 pages, 6 figures, MIDL 2022 摘要：在连接组学领域，一个主要问题是三维神经元分割。尽管基于深度学习的方法已经取得了显著的精度，但仍然存在误差，特别是在图像缺陷区域。一种常见的缺陷类型是连续缺失图像部分。在这里，数据沿着某个轴丢失，由此产生的神经元分割被分割开。为了解决这个问题，我们提出了一种基于神经元点云表示的新方法。我们将其表述为一个分类问题，并训练CurveNet（一种最先进的点云分类模型）来确定应该合并哪些神经元。我们表明，我们的方法不仅表现出很强的性能，而且可以合理地扩展到远远超出其他方法试图解决的差距。此外，我们的点云表示在数据方面是高效的，能够在大量数据的情况下保持高性能，这对于其他方法来说是不可行的。我们认为，这是一个指标的可行性使用点云表示的其他校对任务。摘要：In the field of Connectomics, a primary problem is that of 3D neuron segmentation. Although Deep Learning based methods have achieved remarkable accuracy, errors still exist, especially in regions with image defects. One common type of defect is that of consecutive missing image sections. Here data is lost along some axis, and the resulting neuron segmentations are split across the gap. To address this problem, we propose a novel method based on point cloud representations of neurons. We formulate this as a classification problem and train CurveNet, a state-of-the-art point cloud classification model, to identify which neurons should be merged. We show that our method not only performs strongly but scales reasonably to gaps well beyond what other methods have attempted to address. Additionally, our point cloud representations are highly efficient in terms of data, maintaining high performance with an amount of data that would be unfeasible for other methods. We believe that this is an indicator of the viability of using point clouds representations for other proofreading tasks.

【2】 Graph-Guided Deformation for Point Cloud Completion 标题：基于图形引导的点云补全变形算法链接：https://arxiv.org/abs/2112.01840

作者：Jieqi Shi,Lingyun Xu,Liang Heng,Shaojie Shen 备注：RAL with IROS 2021 摘要：长期以来，点云完成任务一直被视为纯生成任务。在通过编码器获得全局形状代码后，使用网络预先学习的形状生成完整的点云。然而，这样的模型不希望偏向于先前的平均对象，并且固有地局限于适合几何细节。本文提出了一种以输入数据和中间生成为控制点和支撑点的图引导变形网络，并对点云完成任务的图卷积网络（GCN）引导的优化建模。我们的主要见解是通过网格变形方法模拟最小二乘拉普拉斯变形过程，这为建模几何细节的变化带来了适应性。通过这种方法，我们还减少了完成任务和网格变形算法之间的差距。据我们所知，我们是第一个通过使用GCN引导变形模拟传统图形算法来优化点云完成任务的人。我们在模拟的室内数据集ShapeNet、室外数据集KITTI和我们自行收集的自主驾驶数据集Pandar40上进行了广泛的实验。结果表明，在三维点云完成任务中，我们的方法优于现有的最新算法。摘要：For a long time, the point cloud completion task has been regarded as a pure generation task. After obtaining the global shape code through the encoder, a complete point cloud is generated using the shape priorly learnt by the networks. However, such models are undesirably biased towards prior average objects and inherently limited to fit geometry details. In this paper, we propose a Graph-Guided Deformation Network, which respectively regards the input data and intermediate generation as controlling and supporting points, and models the optimization guided by a graph convolutional network(GCN) for the point cloud completion task. Our key insight is to simulate the least square Laplacian deformation process via mesh deformation methods, which brings adaptivity for modeling variation in geometry details. By this means, we also reduce the gap between the completion task and the mesh deformation algorithms. As far as we know, we are the first to refine the point cloud completion task by mimicing traditional graphics algorithms with GCN-guided deformation. We have conducted extensive experiments on both the simulated indoor dataset ShapeNet, outdoor dataset KITTI, and our self-collected autonomous driving dataset Pandar40. The results show that our method outperforms the existing state-of-the-art algorithms in the 3D point cloud completion task.

【3】 Deep Depth from Focus with Differential Focus Volume 标题：利用差焦体积实现从焦点到深度的深度链接：https://arxiv.org/abs/2112.01712

作者：Fengting Yang,Xiaolei Huang,Zihan Zhou 备注：17 pages 摘要：焦距深度（DFF）是一种利用相机焦距变化推断深度的技术。在这项工作中，我们提出了一种卷积神经网络（CNN）来寻找聚焦堆栈中的最佳聚焦像素，并从聚焦估计中推断深度。该网络的关键创新在于新型深度差分聚焦体积（DFV）。通过计算不同焦距上叠加特征的一阶导数，DFV能够捕获焦点和上下文信息进行焦点分析。此外，我们还引入了聚焦估计的概率回归机制来处理稀疏采样的聚焦叠加，并为最终预测提供不确定性估计。综合实验表明，该模型在多个数据集上具有良好的通用性和快速性。摘要：Depth-from-focus (DFF) is a technique that infers depth using the focus change of a camera. In this work, we propose a convolutional neural network (CNN) to find the best-focused pixels in a focal stack and infer depth from the focus estimation. The key innovation of the network is the novel deep differential focus volume (DFV). By computing the first-order derivative with the stacked features over different focal distances, DFV is able to capture both the focus and context information for focus analysis. Besides, we also introduce a probability regression mechanism for focus estimation to handle sparsely sampled focal stacks and provide uncertainty estimation to the final prediction. Comprehensive experiments demonstrate that the proposed model achieves state-of-the-art performance on multiple datasets with good generalizability and fast speed.

3D|3D重建等相关(1篇)

【1】 Geometric Feature Learning for 3D Meshes 标题：三维网格的几何特征学习链接：https://arxiv.org/abs/2112.01801

作者：Huan Lei,Naveed Akhtar,Mubarak Shah,Ajmal Mian 备注：Submitted to TPAMI 摘要：三维网格的几何特征学习是计算机图形学的核心，在许多视觉应用中具有重要意义。然而，由于缺乏必要的操作和/或其有效的实现，深度学习目前在异构三维网格的分层建模方面滞后。在本文中，我们提出了一系列模块化操作，用于在异构三维网格上进行有效的几何深度学习。这些操作包括网格卷积、联合池和高效网格抽取。我们提供这些操作的开源实现，统称为textit{Picasso}。毕加索的网格抽取模块是GPU加速的，可以动态处理一批网格进行深度学习。我们的（联合国）池操作计算不同分辨率的网络层中新创建的神经元的功能。我们的网格卷积包括facet2vertex、vertex2facet和facet2facet卷积，它们利用vMF混合和重心插值结合模糊建模。利用毕加索的模块化操作，我们提出了一种新的分层神经网络——毕加索网络II，用于从三维网格中学习高度区分的特征。PicassoNet II接受原始几何体和网格面的精细纹理作为输入特征，同时处理完整场景网格。我们的网络在各种基准上实现了高度竞争的形状分析和场景解析性能。我们在Github上发布了毕加索和毕加索网络IIhttps://github.com/EnyaHermite/Picasso. 摘要：Geometric feature learning for 3D meshes is central to computer graphics and highly important for numerous vision applications. However, deep learning currently lags in hierarchical modeling of heterogeneous 3D meshes due to the lack of required operations and/or their efficient implementations. In this paper, we propose a series of modular operations for effective geometric deep learning over heterogeneous 3D meshes. These operations include mesh convolutions, (un)pooling and efficient mesh decimation. We provide open source implementation of these operations, collectively termed textit{Picasso}. The mesh decimation module of Picasso is GPU-accelerated, which can process a batch of meshes on-the-fly for deep learning. Our (un)pooling operations compute features for newly-created neurons across network layers of varying resolution. Our mesh convolutions include facet2vertex, vertex2facet, and facet2facet convolutions that exploit vMF mixture and Barycentric interpolation to incorporate fuzzy modelling. Leveraging the modular operations of Picasso, we contribute a novel hierarchical neural network, PicassoNet-II, to learn highly discriminative features from 3D meshes. PicassoNet-II accepts primitive geometrics and fine textures of mesh facets as input features, while processing full scene meshes. Our network achieves highly competitive performance for shape analysis and scene parsing on a variety of benchmarks. We release Picasso and PicassoNet-II on Github https://github.com/EnyaHermite/Picasso.

其他神经网络|深度学习|模型|建模(5篇)

【1】 Optimization of phase-only holograms calculated with scaled diffraction calculation through deep neural networks 标题：用深度神经网络优化比例衍射计算的纯位相全息图链接：https://arxiv.org/abs/2112.01970

作者：Yoshiyuki Ishii,Tomoyoshi Shimobaba,David Blinder,Tobias Birnbaum,Peter Schelkens,Takashi Kakue,Tomoyoshi Ito 摘要：计算机生成全息图（CGH）用于全息三维显示和全息投影。由于重建图像的振幅难以控制，使用纯相位CGHs重建图像的质量会降低。迭代优化方法，如Gerchberg-Saxton（GS）算法，是提高图像质量的一种选择。他们以迭代方式优化CGH，以获得更高的图像质量。然而，这种迭代计算非常耗时，而且图像质量的改善往往停滞不前。近年来，基于深度学习的全息计算被提出。深度神经网络直接从输入图像数据推断CGH。然而，它仅限于再现与全息图大小相同的图像。在本研究中，我们使用深度学习来优化使用比例衍射计算和随机无相位方法生成的纯相位CGH。通过将随机无相位法与比例衍射计算相结合，可以处理比全息图更大的可缩放再现图像。与GS算法相比，该方法优化了高质量和高速度。摘要：Computer-generated holograms (CGHs) are used in holographic three-dimensional (3D) displays and holographic projections. The quality of the reconstructed images using phase-only CGHs is degraded because the amplitude of the reconstructed image is difficult to control. Iterative optimization methods such as the Gerchberg-Saxton (GS) algorithm are one option for improving image quality. They optimize CGHs in an iterative fashion to obtain a higher image quality. However, such iterative computation is time consuming, and the improvement in image quality is often stagnant. Recently, deep learning-based hologram computation has been proposed. Deep neural networks directly infer CGHs from input image data. However, it is limited to reconstructing images that are the same size as the hologram. In this study, we use deep learning to optimize phase-only CGHs generated using scaled diffraction computations and the random phase-free method. By combining the random phase-free method with the scaled diffraction computation, it is possible to handle a zoomable reconstructed image larger than the hologram. In comparison to the GS algorithm, the proposed method optimizes both high quality and speed.

【2】 Frame Averaging for Equivariant Shape Space Learning 标题：用于等变形状空间学习的帧平均方法链接：https://arxiv.org/abs/2112.01741

作者：Matan Atzmon,Koki Nagano,Sanja Fidler,Sameh Khamis,Yaron Lipman 摘要：形状空间学习的任务是将一系列形状映射到具有良好泛化特性的潜在表示空间，并将其映射到潜在表示空间。通常，真实世界的形状集合具有对称性，可以将对称性定义为不改变形状本质的变换。将对称性纳入形状空间学习的一种自然方法是要求到形状空间的映射（编码器）和来自形状空间的映射（解码器）与相关对称性相等。在本文中，我们通过引入两个贡献，提出了一个在编码器和解码器中合并等变的框架：（i）调整最近的帧平均（FA）框架，以构建通用的、高效的和最大表达的等变自动编码器；和（ii）构造与应用于形状不同部分的分段欧几里德运动相同的自动编码器。据我们所知，这是第一个完全分段欧氏等变自动编码器构造。我们的框架很简单：它使用标准重建损失，不需要引入新的损失。我们的体系结构是由标准（主干）体系结构构建的，具有适当的帧平均以使它们相等。在使用隐式神经表示的刚性形状数据集和使用基于网格的神经网络的关节形状数据集上对我们的框架进行测试，显示了对看不见的测试形状的最新概括，大大提高了相关基线。特别是，我们的方法在推广到看不见的关节姿势方面有显著的改进。摘要：The task of shape space learning involves mapping a train set of shapes to and from a latent representation space with good generalization properties. Often, real-world collections of shapes have symmetries, which can be defined as transformations that do not change the essence of the shape. A natural way to incorporate symmetries in shape space learning is to ask that the mapping to the shape space (encoder) and mapping from the shape space (decoder) are equivariant to the relevant symmetries. In this paper, we present a framework for incorporating equivariance in encoders and decoders by introducing two contributions: (i) adapting the recent Frame Averaging (FA) framework for building generic, efficient, and maximally expressive Equivariant autoencoders; and (ii) constructing autoencoders equivariant to piecewise Euclidean motions applied to different parts of the shape. To the best of our knowledge, this is the first fully piecewise Euclidean equivariant autoencoder construction. Training our framework is simple: it uses standard reconstruction losses and does not require the introduction of new losses. Our architectures are built of standard (backbone) architectures with the appropriate frame averaging to make them equivariant. Testing our framework on both rigid shapes dataset using implicit neural representations, and articulated shape datasets using mesh-based neural networks show state-of-the-art generalization to unseen test shapes, improving relevant baselines by a large margin. In particular, our method demonstrates significant improvement in generalizing to unseen articulated poses.

【3】 Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research 标题：缩减、重用和再循环：机器学习研究中的数据集寿命链接：https://arxiv.org/abs/2112.01716

作者：Bernard Koch,Emily Denton,Alex Hanna,Jacob G. Foster 备注：35th Conference on Neural Information Processing Systems (NeurIPS 2021), Sydney, Australia 摘要：基准数据集在机器学习研究的组织中起着核心作用。他们围绕共同的研究问题协调研究人员，并作为实现共同目标进展的衡量标准。尽管基准测试实践在这一领域起着基础性作用，但在机器学习子单元内部或之间，对基准数据集使用和重用的动态性关注相对较少。在本文中，我们深入研究了这些动力学。我们研究了2015-2020年间数据集使用模式在机器学习子单元和时间上的差异。我们发现，越来越多的人关注任务社区中越来越少的数据集，大量采用其他任务中的数据集，以及研究人员在少数精英机构中引入的数据集。我们的研究结果对科学评估、人工智能伦理和领域内的公平/准入具有影响。摘要：Benchmark datasets play a central role in the organization of machine learning research. They coordinate researchers around shared research problems and serve as a measure of progress towards shared goals. Despite the foundational role of benchmarking practices in this field, relatively little attention has been paid to the dynamics of benchmark dataset use and reuse, within or across machine learning subcommunities. In this paper, we dig into these dynamics. We study how dataset usage patterns differ across machine learning subcommunities and across time from 2015-2020. We find increasing concentration on fewer and fewer datasets within task communities, significant adoption of datasets from other tasks, and concentration across the field on datasets that have been introduced by researchers situated within a small number of elite institutions. Our results have implications for scientific evaluation, AI ethics, and equity/access within the field.

【4】 Structure-Aware Multi-Hop Graph Convolution for Graph Neural Networks 标题：图神经网络的结构感知多跳图卷积链接：https://arxiv.org/abs/2112.01714

作者：Yang Li,Yuichi Tanaka 摘要：在本文中，我们提出了一种空间图卷积（GC）来对图上的信号进行分类。现有的GC方法仅限于使用特征空间中的结构信息。此外，GCs的单步仅使用目标节点的一跳相邻节点上的功能。在本文中，我们提出了两种提高GCs性能的方法：1）利用特征空间中的结构信息，2）在一个GC步骤中利用多跳信息。在第一种方法中，我们在特征空间中定义了三种结构特征：特征角度、特征距离和关系嵌入。第二种方法聚合GC中多跳邻居的节点特征。这两种方法可以同时使用。我们还提出了图形神经网络（GNNs）集成所提出的GC，用于对3D点云和引用网络中的节点进行分类。在实验中，提出的GNNs比现有的方法具有更高的分类精度。摘要：In this paper, we propose a spatial graph convolution (GC) to classify signals on a graph. Existing GC methods are limited to using the structural information in the feature space. Additionally, the single step of GCs only uses features on the one-hop neighboring nodes from the target node. In this paper, we propose two methods to improve the performance of GCs: 1) Utilizing structural information in the feature space, and 2) exploiting the multi-hop information in one GC step. In the first method, we define three structural features in the feature space: feature angle, feature distance, and relational embedding. The second method aggregates the node-wise features of multi-hop neighbors in a GC. Both methods can be simultaneously used. We also propose graph neural networks (GNNs) integrating the proposed GC for classifying nodes in 3D point clouds and citation networks. In experiments, the proposed GNNs exhibited a higher classification accuracy than existing methods.

【5】 Learning to Detect Every Thing in an Open World 标题：学会探测开放世界中的每一件事链接：https://arxiv.org/abs/2112.01698

作者：Kuniaki Saito,Ping Hu,Trevor Darrell,Kate Saenko 备注：Project page is available at this https URL 摘要：许多开放世界的应用程序需要检测新对象，但最先进的对象检测和实例分割网络并不擅长这项任务。关键问题在于他们假设没有任何注释的区域应被抑制为负片，这使模型将未注释的对象视为背景。为了解决这个问题，我们提出了一个简单但功能惊人的数据扩充和训练方案，我们称之为“学习检测一切”（LDET）。为了避免抑制隐藏对象，即可见但未标记的背景对象，我们将带注释的对象粘贴到从原始图像的小区域采样的背景图像上。由于仅对此类综合增强图像进行训练会出现域偏移，因此我们将训练分为两部分：1）在增强图像上训练区域分类和回归头，2）在原始图像上训练遮罩头。这样，模型就不会在很好地推广到真实图像的同时将隐藏对象分类为背景。LDET导致开放世界实例分割任务中许多数据集的显著改进，在COCO的跨类别泛化以及UVO和城市景观的跨数据集评估方面优于基线。摘要：Many open-world applications require the detection of novel objects, yet state-of-the-art object detection and instance segmentation networks do not excel at this task. The key issue lies in their assumption that regions without any annotations should be suppressed as negatives, which teaches the model to treat the unannotated objects as background. To address this issue, we propose a simple yet surprisingly powerful data augmentation and training scheme we call Learning to Detect Every Thing (LDET). To avoid suppressing hidden objects, background objects that are visible but unlabeled, we paste annotated objects on a background image sampled from a small region of the original image. Since training solely on such synthetically augmented images suffers from domain shift, we decouple the training into two parts: 1) training the region classification and regression head on augmented images, and 2) training the mask heads on original images. In this way, a model does not learn to classify hidden objects as background while generalizing well to real images. LDET leads to significant improvements on many datasets in the open world instance segmentation task, outperforming baselines on cross-category generalization on COCO, as well as cross-dataset evaluation on UVO and Cityscapes.

其他(14篇)

【1】 Coupling Vision and Proprioception for Navigation of Legged Robots 标题：腿式机器人导航中的视觉与视觉耦合链接：https://arxiv.org/abs/2112.02094

作者：Zipeng Fu,Ashish Kumar,Ananye Agarwal,Haozhi Qi,Jitendra Malik,Deepak Pathak 备注：Website and videos at this https URL 摘要：我们利用视觉和本体感觉的互补优势，在腿部机器人中实现点目标导航。腿式系统比轮式机器人能够穿越更复杂的地形，但为了充分利用这一能力，我们需要导航系统中的高级路径规划人员了解低级别移动策略在不同地形上的行走能力。我们通过使用本体感知反馈来估计行走策略的安全操作极限，并感知意外障碍物和地形特性，如视觉可能错过的地面平滑度或柔软度，从而实现这一目标。导航系统使用车载摄像头生成入住地图和相应的成本地图，以实现目标。然后，FMM（快速行进法）规划器生成目标路径。速度命令生成器将此作为输入，使用来自安全顾问的意外障碍物和地形确定速度限制的附加约束作为输入，为移动策略生成所需速度。与轮式机器人（LoCoBot）基线和其他具有不相交的高层规划和底层控制的基线相比，我们显示出了优越的性能。我们还展示了我们的系统在四足机器人上的实际部署，该机器人带有机载传感器和计算机。视频在https://navigation-locomotion.github.io/camera-ready 摘要：We exploit the complementary strengths of vision and proprioception to achieve point goal navigation in a legged robot. Legged systems are capable of traversing more complex terrain than wheeled robots, but to fully exploit this capability, we need the high-level path planner in the navigation system to be aware of the walking capabilities of the low-level locomotion policy on varying terrains. We achieve this by using proprioceptive feedback to estimate the safe operating limits of the walking policy, and to sense unexpected obstacles and terrain properties like smoothness or softness of the ground that may be missed by vision. The navigation system uses onboard cameras to generate an occupancy map and a corresponding cost map to reach the goal. The FMM (Fast Marching Method) planner then generates a target path. The velocity command generator takes this as input to generate the desired velocity for the locomotion policy using as input additional constraints, from the safety advisor, of unexpected obstacles and terrain determined speed limits. We show superior performance compared to wheeled robot (LoCoBot) baselines, and other baselines which have disjoint high-level planning and low-level control. We also show the real-world deployment of our system on a quadruped robot with onboard sensors and compute. Videos at https://navigation-locomotion.github.io/camera-ready

【2】 CoNeRF: Controllable Neural Radiance Fields 标题：CoNeRF：可控神经辐射场链接：https://arxiv.org/abs/2112.01983

作者：Kacper Kania,Kwang Moo Yi,Marek Kowalski,Tomasz Trzciński,Andrea Taliasacchi 备注：Project page: this https URL 摘要：我们扩展了神经3D表示，允许用户在新颖的视图渲染（即相机控制）之外进行直观和可解释的控制。我们允许用户在训练图像中仅使用少量的遮罩注释来注释希望控制的场景的哪一部分。我们的关键思想是将属性视为潜在变量，在给定场景编码的情况下，通过神经网络进行回归。这就产生了一些快照学习框架，当不提供注释时，框架会自动发现属性。我们将我们的方法应用于具有不同类型可控属性的各种场景（例如，人脸上的表情控制，或无生命对象运动中的状态控制）。总的来说，据我们所知，我们首次展示了单个视频场景的新颖视图和新颖属性重新渲染。摘要：We extend neural 3D representations to allow for intuitive and interpretable user control beyond novel view rendering (i.e. camera control). We allow the user to annotate which part of the scene one wishes to control with just a small number of mask annotations in the training images. Our key idea is to treat the attributes as latent variables that are regressed by the neural network given the scene encoding. This leads to a few-shot learning framework, where attributes are discovered automatically by the framework, when annotations are not provided. We apply our method to various scenes with different types of controllable attributes (e.g. expression control on human faces, or state control in movement of inanimate objects). Overall, we demonstrate, to the best of our knowledge, for the first time novel view and novel attribute re-rendering of scenes from a single video.

【3】 Bio-inspired Polarization Event Camera 标题：仿生偏振事件相机链接：https://arxiv.org/abs/2112.01933

作者：Germain Haessig,Damien Joubert,Justin Haque,Yingkai Chen,Moritz Milde,Tobi Delbruck,Viktor Gruev 摘要：口足类（螳螂虾）视觉系统最近为模式转换偏振和多光谱成像传感器的设计提供了蓝图，能够解决具有挑战性的医学和遥感问题。然而，这些生物传感器缺乏口足类视觉系统的高动态范围（HDR）和异步偏振视觉能力，将时间分辨率限制在~12 ms，动态范围限制在~72 dB。在这里，我们提出了一种新颖的口形启发偏振相机，它模拟持续和瞬态的生物视觉路径，以节省超过最大奈奎斯特帧速率的功率和采样数据。这种仿生传感器同时捕获同步强度帧和异步偏振亮度变化信息，在一百万倍的照明范围内具有亚毫秒的延迟。我们的PDAVIS相机由346x260像素组成，以2×2的宏观像素组织，使用四个偏移45度的线性偏振滤光片过滤入射光。使用低成本和基于延迟事件的算法以及更精确但速度较慢的深层神经网络重建极化信息。我们的传感器用于成像高速变化的HDR偏振场景，并观察快速循环载荷下牛肌腱中单个胶原纤维的动力学特性摘要：The stomatopod (mantis shrimp) visual system has recently provided a blueprint for the design of paradigm-shifting polarization and multispectral imaging sensors, enabling solutions to challenging medical and remote sensing problems. However, these bioinspired sensors lack the high dynamic range (HDR) and asynchronous polarization vision capabilities of the stomatopod visual system, limiting temporal resolution to ~12 ms and dynamic range to ~ 72 dB. Here we present a novel stomatopod-inspired polarization camera which mimics the sustained and transient biological visual pathways to save power and sample data beyond the maximum Nyquist frame rate. This bio-inspired sensor simultaneously captures both synchronous intensity frames and asynchronous polarization brightness change information with sub-millisecond latencies over a million-fold range of illumination. Our PDAVIS camera is comprised of 346x260 pixels, organized in 2-by-2 macropixels, which filter the incoming light with four linear polarization filters offset by 45 degrees. Polarization information is reconstructed using both low cost and latency event-based algorithms and more accurate but slower deep neural networks. Our sensor is used to image HDR polarization scenes which vary at high speeds and to observe dynamical properties of single collagen fibers in bovine tendon under rapid cyclical loads

【4】 Panoptic-based Object Style-Align for Image-to-Image Translation 标题：用于图像到图像转换的基于全景的对象样式对齐链接：https://arxiv.org/abs/2112.01926

作者：Liyun Zhang,Photchara Ratsamee,Bowen Wang,Manabu Higashida,Yuki Uranishi,Haruo Takemura 摘要：尽管最近在图像翻译方面取得了显著的进展，但具有多个不一致对象的复杂场景仍然是一个具有挑战性的问题。由于翻译后的图像保真度较低，细节较小，目标识别效果不理想。如果没有图像的完整对象感知（即边界框、类别和遮罩）作为先验知识，则在图像翻译过程中很难跟踪每个对象的样式转换。我们提出了基于全景的对象样式对齐生成对抗网络（POSA-GANs），用于图像到图像的转换，并提供了一个紧凑的全景分割数据集。全景分割模型用于提取全景级感知（即图像中去除重叠的前景对象实例和背景语义区域）。这用于指导输入域图像的对象内容代码与从目标域的样式空间采样的对象样式代码之间的对齐。样式对齐的对象表示将进一步变换，以获得精确的边界布局，从而生成更高保真的对象。将该方法与不同的竞争方法进行了系统的比较，在图像质量和目标识别性能上都取得了显著的提高。摘要：Despite remarkable recent progress in image translation, the complex scene with multiple discrepant objects remains a challenging problem. Because the translated images have low fidelity and tiny objects in fewer details and obtain unsatisfactory performance in object recognition. Without the thorough object perception (i.e., bounding boxes, categories, and masks) of the image as prior knowledge, the style transformation of each object will be difficult to track in the image translation process. We propose panoptic-based object style-align generative adversarial networks (POSA-GANs) for image-to-image translation together with a compact panoptic segmentation dataset. The panoptic segmentation model is utilized to extract panoptic-level perception (i.e., overlap-removed foreground object instances and background semantic regions in the image). This is utilized to guide the alignment between the object content codes of the input domain image and object style codes sampled from the style space of the target domain. The style-aligned object representations are further transformed to obtain precise boundaries layout for higher fidelity object generation. The proposed method was systematically compared with different competing methods and obtained significant improvement on both image quality and object recognition performance for translated images.

【5】 TRNR: Task-Driven Image Rain and Noise Removal with a Few Images Based on Patch Analysis 标题：TRNR：基于面片分析的任务驱动的少幅图像去雨去噪链接：https://arxiv.org/abs/2112.01924

作者：Wu Ran,Bohong Yang,Peirong Ma,Hong Lu 备注：13 pages 摘要：最近基于学习的图像雨和噪声去除的繁荣主要归功于设计良好的神经网络结构和大型标记数据集。然而，我们发现目前的图像雨和噪声去除方法导致图像利用率低。为了减少对大型标记数据集的依赖，我们在引入面片分析策略的基础上提出了任务驱动的图像雨和噪声去除（TRNR）。面片分析策略为训练提供具有各种空间和统计特性的图像面片，并已被验证可提高图像的利用率。此外，补丁分析策略促使我们考虑学习图像雨和噪声去除任务驱动，而不是数据驱动。因此，我们引入了TRNR的N-frequency-K-shot学习任务。每个N-frequency-K-shot学习任务基于一个小数据集，该数据集包含通过面片分析策略采样的NK图像面片。TRNR使神经网络能够从丰富的N-频率-K-镜头学习任务中学习，而不是从足够的数据中学习。为了验证TRNR的有效性，我们构建了一个具有约0.9M参数的轻型多尺度残差网络（MSResNet）来学习图像雨水去除，并使用一个具有约1.2M参数的简单ResNet（称为DNNet）来对少量图像进行盲高斯噪声去除（例如，20.0%的Rain100H列车集）。实验结果表明，TRNR使MSResNet能够从较少的图像中更好地学习。此外，使用TRNR的MSResNet和DNNet比最新的深度学习方法获得了更好的性能，这些方法在大型标记数据集上训练数据驱动。这些实验结果证实了所提出的TRNR的有效性和优越性。TRNR的代码将很快公布。摘要：The recent prosperity of learning-based image rain and noise removal is mainly due to the well-designed neural network architectures and large labeled datasets. However, we find that current image rain and noise removal methods result in low utilization of images. To alleviate the reliance on large labeled datasets, we propose the task-driven image rain and noise removal (TRNR) based on the introduced patch analysis strategy. The patch analysis strategy provides image patches with various spatial and statistical properties for training and has been verified to increase the utilization of images. Further, the patch analysis strategy motivates us to consider learning image rain and noise removal task-driven instead of data-driven. Therefore we introduce the N-frequency-K-shot learning task for TRNR. Each N-frequency-K-shot learning task is based on a tiny dataset containing NK image patches sampled by the patch analysis strategy. TRNR enables neural networks to learn from abundant N-frequency-K-shot learning tasks other than from adequate data. To verify the effectiveness of TRNR, we build a light Multi-Scale Residual Network (MSResNet) with about 0.9M parameters to learn image rain removal and use a simple ResNet with about 1.2M parameters dubbed DNNet for blind gaussian noise removal with a few images (for example, 20.0% train-set of Rain100H). Experimental results demonstrate that TRNR enables MSResNet to learn better from fewer images. In addition, MSResNet and DNNet utilizing TRNR have obtained better performance than most recent deep learning methods trained data-driven on large labeled datasets. These experimental results have confirmed the effectiveness and superiority of the proposed TRNR. The codes of TRNR will be public soon.

【6】 A Structured Dictionary Perspective on Implicit Neural Representations 标题：关于隐含神经表征的结构化词典视角链接：https://arxiv.org/abs/2112.01917

作者：Gizem Yüce,Guillermo Ortiz-Jiménez,Beril Besbinar,Pascal Frossard 备注：26 pages, 14 figures 摘要：在新设计的推动下，隐式神经表征（INR）可以绕过光谱偏差，成为信号经典离散化表征的一种有前途的替代方法。然而，尽管INR在实践中取得了成功，但我们仍然缺乏对INR如何表示信号的正确理论描述。在这项工作中，我们旨在填补这一空白，并提出了一个新的统一的角度来理论分析印度卢比。利用谐波分析和深度学习理论的结果，我们发现大多数INR族类似于结构化信号字典，其原子是初始映射频率集的整数次谐波。这种结构允许INR使用一些仅随深度线性增长的参数，以指数增长的频率支持来表达信号。然后，我们利用最近关于经验神经切线核（NTK）的研究结果，探讨了INRs的诱导偏差。具体来说，我们证明了NTK的本征函数可以看作是字典原子，其与目标信号的内积决定了重建的最终性能。在这方面，我们发现元学习初始化具有NTK的重塑效应，类似于字典学习，将字典原子构建为元训练过程中看到的示例的组合。我们的结果允许设计和调整新颖的INR体系结构，但也可能引起更广泛的深度学习理论界的兴趣。摘要：Propelled by new designs that permit to circumvent the spectral bias, implicit neural representations (INRs) have recently emerged as a promising alternative to classical discretized representations of signals. Nevertheless, despite their practical success, we still lack a proper theoretical characterization of how INRs represent signals. In this work, we aim to fill this gap, and we propose a novel unified perspective to theoretically analyse INRs. Leveraging results from harmonic analysis and deep learning theory, we show that most INR families are analogous to structured signal dictionaries whose atoms are integer harmonics of the set of initial mapping frequencies. This structure allows INRs to express signals with an exponentially increasing frequency support using a number of parameters that only grows linearly with depth. Afterwards, we explore the inductive bias of INRs exploiting recent results about the empirical neural tangent kernel (NTK). Specifically, we show that the eigenfunctions of the NTK can be seen as dictionary atoms whose inner product with the target signal determines the final performance of their reconstruction. In this regard, we reveal that meta-learning the initialization has a reshaping effect of the NTK analogous to dictionary learning, building dictionary atoms as a combination of the examples seen during meta-training. Our results permit to design and tune novel INR architectures, but can also be of interest for the wider deep learning theory community.

【7】 Image-to-image Translation as a Unique Source of Knowledge 标题：作为一种独特的知识来源的影象翻译链接：https://arxiv.org/abs/2112.01873

作者：Alejandro D. Mousist 摘要：图像到图像（I2I）转换是一种将数据从一个域转换到另一个域的既定方法，但在处理SAR/光学卫星图像等不同域时，目标域中转换图像的可用性以及将多少原始域转换到目标域仍然不够清楚。本文通过使用最新的I2I算法将标记数据集从光学域转换到SAR域，从目标域中传输的特征中学习，并在稍后评估从原始数据集传输了多少数据，从而解决了这一问题。除此之外，还建议将堆叠作为一种结合从不同I2I翻译中学习到的知识并针对单个模型进行评估的方法。摘要：Image-to-image (I2I) translation is an established way of translating data from one domain to another but the usability of the translated images in the target domain when working with such dissimilar domains as the SAR/optical satellite imagery ones and how much of the origin domain is translated to the target domain is still not clear enough. This article address this by performing translations of labelled datasets from the optical domain to the SAR domain with different I2I algorithms from the state-of-the-art, learning from transferred features in the destination domain and evaluating later how much from the original dataset was transferred. Added to this, stacking is proposed as a way of combining the knowledge learned from the different I2I translations and evaluated against single models.

【8】 A Systematic IoU-Related Method: Beyond Simplified Regression for Better Localization 标题：一种系统的欠条相关方法：超越简化回归以实现更好的本地化链接：https://arxiv.org/abs/2112.01793

作者：Hanyang Peng,Shiqi Yu 备注：None 摘要：现代检测器默认使用四变量独立回归局部化损失，例如平滑-$elluu 1$损失。然而，这种损失过于简单，因此与最终评估指标——联合交集（IoU）不一致。直接使用标准IoU也不是不可行的，因为在非重叠盒的情况下，恒定的零平台和最小的非零梯度可能使其无法训练。因此，我们提出了一个系统的方法来解决这些问题。首先，我们提出了一个新的度量，扩展IoU（EIoU），当两个盒子不重叠时，它是定义良好的，当重叠时，它被简化为标准IoU。其次，我们提出了基于EIoU的凸化技术（CT）来构造一个损失，它可以保证最小梯度为零。第三，我们提出了一种稳定优化技术（SOT），使分数EIoU损失更稳定、更平滑地逼近最小值。第四，为了充分利用基于EIoU的定位精度，我们引入了一个相关的IoU预测头来进一步提高定位精度。根据提议的贡献，新方法与以ResNet50 FPN为主干的快速R-CNN结合，在VOC0007上产生textbf{4.2 mAP}增益，在COCO2017上产生textbf{2.3 mAP}增益，超过基线平滑-$ellu 1$损失，几乎没有训练和推断计算成本。具体而言，指标越严格，收益越显著，VOC2007上的textbf{8.2 mAP}和COCO2017上的textbf{5.4 mAP}分别提高到指标$AP{90}$。摘要：Four-variable-independent-regression localization losses, such as Smooth-$ell_1$ Loss, are used by default in modern detectors. Nevertheless, this kind of loss is oversimplified so that it is inconsistent with the final evaluation metric, intersection over union (IoU). Directly employing the standard IoU is also not infeasible, since the constant-zero plateau in the case of non-overlapping boxes and the non-zero gradient at the minimum may make it not trainable. Accordingly, we propose a systematic method to address these problems. Firstly, we propose a new metric, the extended IoU (EIoU), which is well-defined when two boxes are not overlapping and reduced to the standard IoU when overlapping. Secondly, we present the convexification technique (CT) to construct a loss on the basis of EIoU, which can guarantee the gradient at the minimum to be zero. Thirdly, we propose a steady optimization technique (SOT) to make the fractional EIoU loss approaching the minimum more steadily and smoothly. Fourthly, to fully exploit the capability of the EIoU based loss, we introduce an interrelated IoU-predicting head to further boost localization accuracy. With the proposed contributions, the new method incorporated into Faster R-CNN with ResNet50 FPN as the backbone yields textbf{4.2 mAP} gain on VOC2007 and textbf{2.3 mAP} gain on COCO2017 over the baseline Smooth-$ell_1$ Loss, at almost textbf{no training and inferencing computational cost}. Specifically, the stricter the metric is, the more notable the gain is, improving textbf{8.2 mAP} on VOC2007 and textbf{5.4 mAP} on COCO2017 at metric $AP_{90}$.

【9】 Detect Faces Efficiently: A Survey and Evaluations 标题：有效的人脸检测：综述与评价链接：https://arxiv.org/abs/2112.01787

作者：Yuantao Feng,Shiqi Yu,Hanyang Peng,Yan-Ran Li,Jianguo Zhang 摘要：人脸检测是在图像中搜索所有可能的人脸区域，并在有人脸的情况下进行定位。包括人脸识别、表情识别、人脸跟踪和头部姿势估计在内的许多应用都假设人脸在图像中的位置和大小都是已知的。近几十年来，研究人员创造了许多典型而高效的人脸检测器，从Viola-Jones人脸检测器到目前基于CNN的人脸检测器。然而，随着图像和视频的大量增加，人脸的尺度、外观、表情、遮挡和姿势都发生了变化，传统的人脸检测器面临着检测各种“野生”人脸的挑战。深度学习技术的出现给人脸检测带来了显著的突破，同时计算量也大幅增加。本文介绍了有代表性的基于深度学习的方法，并从准确性和效率方面进行了深入和彻底的分析。我们进一步比较和讨论了流行和具有挑战性的数据集及其评估指标。对几种成功的基于深度学习的人脸检测器进行了综合比较，通过两个指标：FLOPs和延迟来揭示它们的效率。本文的研究可以指导人们根据不同的应用场合选择合适的人脸检测仪，也可以开发出更高效、更准确的人脸检测仪。摘要：Face detection is to search all the possible regions for faces in images and locate the faces if there are any. Many applications including face recognition, facial expression recognition, face tracking and head-pose estimation assume that both the location and the size of faces are known in the image. In recent decades, researchers have created many typical and efficient face detectors from the Viola-Jones face detector to current CNN-based ones. However, with the tremendous increase in images and videos with variations in face scale, appearance, expression, occlusion and pose, traditional face detectors are challenged to detect various "in the wild" faces. The emergence of deep learning techniques brought remarkable breakthroughs to face detection along with the price of a considerable increase in computation. This paper introduces representative deep learning-based methods and presents a deep and thorough analysis in terms of accuracy and efficiency. We further compare and discuss the popular and challenging datasets and their evaluation metrics. A comprehensive comparison of several successful deep learning-based face detectors is conducted to uncover their efficiency using two metrics: FLOPs and latency. The paper can guide to choose appropriate face detectors for different applications and also to develop more efficient and accurate detectors.

【10】 NeRF-SR: High-Quality Neural Radiance Fields using Super-Sampling 标题：NERF-SR：采用超采样的高质量神经辐射场链接：https://arxiv.org/abs/2112.01759

作者：Chen Wang,Xian Wu,Yuan-Chen Guo,Song-Hai Zhang,Yu-Wing Tai,Shi-Min Hu 备注：this https URL 摘要：我们提出了NeRF SR，这是一种用于高分辨率（HR）新视图合成的解决方案，主要用于低分辨率（LR）输入。我们的方法是建立在神经辐射场（NeRF）的基础上的，它通过多层感知器预测每一点的密度和颜色。在以任意比例生成图像的同时，NeRF努力解决超出观察图像的分辨率问题。我们的关键洞察是NeRF具有局部先验，这意味着3D点的预测可以在附近区域传播并保持准确。我们首先利用超级采样策略，在每个图像像素上发射多条光线，这在亚像素级别上强制多视图约束。然后，我们证明了NeRF-SR可以通过一个细化网络进一步提高超级采样的性能，该网络利用估计的手边深度来幻觉HR参考图像上相关补丁的细节。实验结果表明，无论是在合成数据集还是在真实数据集上，NeRF-SR都能在HR生成高质量的新视图合成结果。摘要：We present NeRF-SR, a solution for high-resolution (HR) novel view synthesis with mostly low-resolution (LR) inputs. Our method is built upon Neural Radiance Fields (NeRF) that predicts per-point density and color with a multi-layer perceptron. While producing images at arbitrary scales, NeRF struggles with resolutions that go beyond observed images. Our key insight is that NeRF has a local prior, which means predictions of a 3D point can be propagated in the nearby region and remain accurate. We first exploit it by a super-sampling strategy that shoots multiple rays at each image pixel, which enforces multi-view constraint at a sub-pixel level. Then, we show that NeRF-SR can further boost the performance of super-sampling by a refinement network that leverages the estimated depth at hand to hallucinate details from related patches on an HR reference image. Experiment results demonstrate that NeRF-SR generates high-quality results for novel view synthesis at HR on both synthetic and real-world datasets.

【11】 Investigating the usefulness of Quantum Blur 标题：量子模糊的有用性研究链接：https://arxiv.org/abs/2112.01646

作者：James R. Wootton,Marcel Pfaffhauser 摘要：虽然距离量子计算能够超越传统计算还有几年的时间，但它已经提供了用于各个领域探索目的的资源。这包括电脑游戏、音乐和艺术中程序生成的某些任务。量子模糊方法作为一个原理证明的例子被介绍，以表明它可以用于设计使用量子软件原理的程序生成方法。在这里，我们分析了该方法的效果，并将其与传统的模糊效果进行了比较。我们还确定了所看到的效应是如何从量子叠加和纠缠的操纵中产生的。摘要：Though some years remain before quantum computation can outperform conventional computation, it already provides resources that be used for exploratory purposes in various fields. This includes certain tasks for procedural generation in computer games, music and art. The Quantum Blur method was introduced as a proof-of-principle example, to show that it can be useful to design methods for procedural generation using the principles of quantum software. Here we analyse the effects of the method and compare it to conventional blur effects. We also determine how the effects seen derive from the manipulation of quantum superposition and entanglement.

【12】 Hamiltonian prior to Disentangle Content and Motion in Image Sequences 标题：图像序列中内容和运动解缠前的哈密顿量链接：https://arxiv.org/abs/2112.01641

作者：Asif Khan,Amos Storkey 备注：Controllable Generative Modeling in Language and Vision Workshop at NeurIPS 2021 摘要：我们提出了一个高维序列数据的深层潜变量模型。我们的模型将潜在空间分解为内容和运动变量。为了对不同的动力学进行建模，我们将运动空间划分为子空间，并为每个子空间引入唯一的哈密顿算符。哈密顿公式提供可逆动力学，学习约束运动路径以保持不变特性。运动空间的显式分裂将哈密顿量分解为对称群，并给出动力学的长期可分性。这种分离还意味着可以学习易于理解和控制的表达。我们展示了我们的模型用于交换两个视频的运动、从给定图像生成各种动作序列和无条件序列生成的实用性。摘要：We present a deep latent variable model for high dimensional sequential data. Our model factorises the latent space into content and motion variables. To model the diverse dynamics, we split the motion space into subspaces, and introduce a unique Hamiltonian operator for each subspace. The Hamiltonian formulation provides reversible dynamics that learn to constrain the motion path to conserve invariant properties. The explicit split of the motion space decomposes the Hamiltonian into symmetry groups and gives long-term separability of the dynamics. This split also means representations can be learnt that are easy to interpret and control. We demonstrate the utility of our model for swapping the motion of two videos, generating sequences of various actions from a given image and unconditional sequence generation.

【13】 Fast Neural Representations for Direct Volume Rendering 标题：直接体绘制的快速神经表示链接：https://arxiv.org/abs/2112.01579

作者：Sebastian Weiss,Philipp Hermüller,Rüdiger Westermann 摘要：尽管神经场景表示可以有效地压缩高重建质量的三维标量场，但使用场景表示网络的训练和数据重建步骤的计算复杂性限制了它们在实际应用中的应用。在本文中，我们分析了场景表示网络是否可以修改以减少这些限制，以及这些架构是否也可以用于时间重建任务。我们提出了一种新的场景表示网络设计，使用GPU张量核将重建无缝集成到芯片光线跟踪核中。此外，我们还研究了图像引导网络训练作为经典数据驱动方法的替代方法的使用，并探讨了这种替代方法在质量和速度方面的潜在优势和劣势。作为时变场空间超分辨率方法的替代方案，我们提出了一种基于潜在空间插值的解决方案，以实现任意粒度的随机访问重建。我们总结了我们的研究结果，评估了科学可视化任务中场景表示网络的优势和局限性，并概述了该领域未来的研究方向。摘要：Despite the potential of neural scene representations to effectively compress 3D scalar fields at high reconstruction quality, the computational complexity of the training and data reconstruction step using scene representation networks limits their use in practical applications. In this paper, we analyze whether scene representation networks can be modified to reduce these limitations and whether these architectures can also be used for temporal reconstruction tasks. We propose a novel design of scene representation networks using GPU tensor cores to integrate the reconstruction seamlessly into on-chip raytracing kernels. Furthermore, we investigate the use of image-guided network training as an alternative to classical data-driven approaches, and we explore the potential strengths and weaknesses of this alternative regarding quality and speed. As an alternative to spatial super-resolution approaches for time-varying fields, we propose a solution that builds upon latent-space interpolation to enable random access reconstruction at arbitrary granularity. We summarize our findings in the form of an assessment of the strengths and limitations of scene representation networks for scientific visualization tasks and outline promising future research directions in this field.

【14】 Engineering AI Tools for Systematic and Scalable Quality Assessment in Magnetic Resonance Imaging 标题：磁共振成像系统化、可扩展性质量评估的工程人工智能工具链接：https://arxiv.org/abs/2112.01629

作者：Yukai Zou,Ikbeom Jang 备注：6 pages, 2 figures, NeurIPS Data-Centric AI Workshop 2021 (Virtual) 摘要：随着机器学习算法、并行计算和硬件技术的发展，实现大型医学影像数据集的愿望不断增加。因此，越来越多的人需要汇集来自多个临床和学术机构的数据，以便进行大规模的临床或转化研究。磁共振成像（MRI）是一种常用的非侵入性成像方式。然而，构建大型MRI数据存储库面临着与隐私、数据大小、DICOM格式、物流和非标准化图像相关的多重挑战。不仅构建数据存储库很困难，而且使用存储库中汇集的数据也很困难，因为MRI供应商和成像站点的图像采集、重建和处理管道存在异构性。本职位论文描述了构建大型MRI数据存储库以及在各个方面使用从此类数据存储库下载的数据所面临的挑战。为了帮助应对这些挑战，本文建议引入一个质量评估管道，考虑因素和一般设计原则。摘要：A desire to achieve large medical imaging datasets keeps increasing as machine learning algorithms, parallel computing, and hardware technology evolve. Accordingly, there is a growing demand in pooling data from multiple clinical and academic institutes to enable large-scale clinical or translational research studies. Magnetic resonance imaging (MRI) is a frequently used, non-invasive imaging modality. However, constructing a big MRI data repository has multiple challenges related to privacy, data size, DICOM format, logistics, and non-standardized images. Not only building the data repository is difficult, but using data pooled from the repository is also challenging, due to heterogeneity in image acquisition, reconstruction, and processing pipelines across MRI vendors and imaging sites. This position paper describes challenges in constructing a large MRI data repository and using data downloaded from such data repositories in various aspects. To help address the challenges, the paper proposes introducing a quality assessment pipeline, with considerations and general design principles.

机器翻译，仅供参考

linux https 网络安全图像处理批量计算

0 人点赞