计算机视觉与模式识别学术速递[12.23]

2021-12-27 17:10:31 浏览数 (1)

cs.CV 方向,今日共计59篇

Transformer(1篇)

【1】 MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input-Adaptation 标题:基于多粒度输入自适应的高效鲁棒视觉转换器 链接:https://arxiv.org/abs/2112.11542

作者:Zhongzhi Yu,Yonggan Fu,Sicheng Li,Chaojian Li,Yingyan Lin 机构: Department of Electrical and Computer Engineering, Rice University , Alibaba DAMO Academy 摘要:VIT的计算成本通常太高,无法安装在现实世界中资源受限的设备上,这是因为(1)VIT的复杂性随着输入令牌的数量而二次增加,(2)VIT的自我注意头和模型深度参数化过度。同时,不同的图像具有不同的复杂性,它们的不同区域可以包含不同级别的视觉信息,这表明在模型复杂性方面平等对待所有区域/标记是不必要的,而这种降低VIT复杂性的机会尚未充分探索。为此,我们提出了一个称为MIA-Former的多粒度输入自适应视觉转换器框架,该框架可以在三个粗粒度到细粒度(即模型深度和模型头/令牌的数量)下输入自适应地调整vit的结构。特别是,我们的MIA-Former采用了一种低成本的网络,通过混合监督和强化训练方法进行训练,以输入自适应的方式跳过不必要的层、头和令牌,从而降低了总体计算成本。此外,我们的MIA-Former的一个有趣的副作用是,它产生的vit与静态的vit相比,自然具有更强的抗对手攻击的鲁棒性,因为MIA-Former的多粒度动态控制提高了模型的多样性,类似于集成的效果,从而增加了针对其所有子模型的对抗性攻击的难度。大量实验和烧蚀研究证实,所提出的MIA-Former框架能够有效地分配计算预算,以适应输入图像的难度,同时提高鲁棒性,实现最先进的(SOTA)精度和效率权衡,例如。,与SOTA动态Transformer模型相比,在相同或更高精度的情况下,计算节省20%。 摘要:ViTs are often too computationally expensive to be fitted onto real-world resource-constrained devices, due to (1) their quadratically increased complexity with the number of input tokens and (2) their overparameterized self-attention heads and model depth. In parallel, different images are of varied complexity and their different regions can contain various levels of visual information, indicating that treating all regions/tokens equally in terms of model complexity is unnecessary while such opportunities for trimming down ViTs' complexity have not been fully explored. To this end, we propose a Multi-grained Input-adaptive Vision Transformer framework dubbed MIA-Former that can input-adaptively adjust the structure of ViTs at three coarse-to-fine-grained granularities (i.e., model depth and the number of model heads/tokens). In particular, our MIA-Former adopts a low-cost network trained with a hybrid supervised and reinforcement training method to skip unnecessary layers, heads, and tokens in an input adaptive manner, reducing the overall computational cost. Furthermore, an interesting side effect of our MIA-Former is that its resulting ViTs are naturally equipped with improved robustness against adversarial attacks over their static counterparts, because MIA-Former's multi-grained dynamic control improves the model diversity similar to the effect of ensemble and thus increases the difficulty of adversarial attacks against all its sub-models. Extensive experiments and ablation studies validate that the proposed MIA-Former framework can effectively allocate computation budgets adaptive to the difficulty of input images meanwhile increase robustness, achieving state-of-the-art (SOTA) accuracy-efficiency trade-offs, e.g., 20% computation savings with the same or even a higher accuracy compared with SOTA dynamic transformer models.

检测相关(11篇)

【1】 Two Stream Network for Stroke Detection in Table Tennis 标题:双流网络在乒乓球击球检测中的应用 链接:https://arxiv.org/abs/2112.12073

作者:Anam Zahra,Pierre-Etienne Martin 机构:CCP Department, Max Planck Institute for Evolutionary Anthropology, D-, Leipzig, Germany 备注:MediaEval 2021, Dec 2021, Online, Germany 摘要:提出了一种基于视频的乒乓球击球检测方法。该方法依赖于并行处理RGB流及其计算光流的两流卷积神经网络。该方法已被开发为中世纪2021基准的运动任务的一部分。我们的贡献在测试集上没有超过提供的基线,但在mAP指标方面在其他参与者中表现最好。 摘要:This paper presents a table tennis stroke detection method from videos. The method relies on a two-stream Convolutional Neural Network processing in parallel the RGB Stream and its computed optical flow. The method has been developed as part of the MediaEval 2021 benchmark for the Sport task. Our contribution did not outperform the provided baseline on the test set but has performed the best among the other participants with regard to the mAP metric.

【2】 A Single-Target License Plate Detection with Attention 标题:一种具有注意力的单目标车牌检测方法 链接:https://arxiv.org/abs/2112.12070

作者:Wenyun Li,Chi-Man Pun 机构:Information Science, University of Macau, Macau, China (e-mail:, II. RESULTS AND DISCUSSION, For IoU choice of our design, unlike the design of, RetinaNet using Smooth L, Loss as bbox loss, which, represents the location of four coordinates (box’s center and 备注:IWAIT2022 摘要:随着深度学习的发展,神经网络被广泛应用于车牌检测(LPD)任务中,并取得了更好的性能和精度,尤其是基于CNN的网络可以实现最先进的视网膜网[1]。对于单个对象检测任务(如LPD),修改后的通用对象检测将非常耗时,无法处理复杂的场景和难以在嵌入式设备上部署的繁琐权重文件。 摘要:With the development of deep learning, Neural Network is commonly adopted to the License Plate Detection (LPD) task and achieves much better performance and precision, especially CNN-based networks can achieve state of the art RetinaNet[1]. For a single object detection task such as LPD, modified general object detection would be time-consuming, unable to cope with complex scenarios and a cumbersome weights file that is too hard to deploy on the embedded device.

【3】 Looking Beyond Corners: Contrastive Learning of Visual Representations for Keypoint Detection and Description Extraction 标题:超越角落:关键点检测和描述提取视觉表征的对比学习 链接:https://arxiv.org/abs/2112.12002

作者:Henrique Siqueira,Patrick Ruhkamp,Ibrahim Halfaoui,Markus Karmann,Onay Urfalioglu 机构: Technical University of Munich , HUAWEI, Munich Research Center 摘要:可学习的关键点检测器和描述符开始优于经典的手工特征提取方法。最近关于视觉表征的自监督学习的研究推动了基于深度网络的可学习模型性能的提高。通过利用传统的数据增强和单应变换,这些网络学习在极端照明变化等不利条件下检测角点。然而,它们的泛化能力仅限于通过经典方法或合成数据预先检测到的角点特征。在本文中,我们提出了一种通信网络(CorrNet),该网络通过空间约束下的无监督对比学习来检测可重复的关键点并提取区分性描述。我们的实验表明,CorrNet不仅能够检测诸如角点之类的低级特征,而且能够通过我们提出的潜在空间的联合引导反向传播来检测表示一对输入图像中存在的相似对象的高级特征。我们的方法在视点变化下获得竞争性结果,并在光照变化下实现最先进的性能。 摘要:Learnable keypoint detectors and descriptors are beginning to outperform classical hand-crafted feature extraction methods. Recent studies on self-supervised learning of visual representations have driven the increasing performance of learnable models based on deep networks. By leveraging traditional data augmentations and homography transformations, these networks learn to detect corners under adverse conditions such as extreme illumination changes. However, their generalization capabilities are limited to corner-like features detected a priori by classical methods or synthetically generated data. In this paper, we propose the Correspondence Network (CorrNet) that learns to detect repeatable keypoints and to extract discriminative descriptions via unsupervised contrastive learning under spatial constraints. Our experiments show that CorrNet is not only able to detect low-level features such as corners, but also high-level features that represent similar objects present in a pair of input images through our proposed joint guided backpropagation of their latent space. Our approach obtains competitive results under viewpoint changes and achieves state-of-the-art performance under illumination changes.

【4】 DA-FDFtNet: Dual Attention Fake Detection Fine-tuning Network to Detect Various AI-Generated Fake Images 标题:DA-FDFtNet:检测各种人工智能产生的假图像的双重注意假检测微调网络 链接:https://arxiv.org/abs/2112.12001

作者:Young Oh Bang,Simon S. Woo 机构:Department of Artificial Intelligence, Sungkyunkwan University, Suwon, South Korea, Department of Applied Data Science, Computer Science & Engineering Department, Sungkyunkwan University, Suwon, South Korea 摘要:由于生成性对抗网络(GAN)、自动编码器和其他人工智能技术的进步,创建诸如“深度假货”之类的假图像变得更加容易。最近的研究引入了少量镜头学习,它使用少量的训练数据更有效地生成假图像和视频。因此,容易生成被操纵的图像以及难以识别这些图像会对我们的社会造成严重威胁,例如传播虚假信息。然而,由于上述原因,检测由最新人工智能技术生成的真实假图像具有挑战性。在这项工作中,我们提出了双注意假检测微调网络(DA-FDFtNet)从真实人脸数据中检测被操纵的假人脸图像。我们的DA FDFtNet将预先训练的模型与微调转换器、MBblockV3和通道注意模块集成在一起,以提高不同类型假图像的性能和鲁棒性。特别是,微调转换器由多个基于图像的自我注意模块和下采样层组成。通道注意模块还与预先训练好的模型连接,以捕获假图像特征空间。我们使用FaceForensics 数据集和各种GAN生成的数据集对DA FDFtNet进行了实验,结果表明,我们的方法优于以前的基线模型。 摘要:Due to the advancement of Generative Adversarial Networks (GAN), Autoencoders, and other AI technologies, it has been much easier to create fake images such as "Deepfakes". More recent research has introduced few-shot learning, which uses a small amount of training data to produce fake images and videos more effectively. Therefore, the ease of generating manipulated images and the difficulty of distinguishing those images can cause a serious threat to our society, such as propagating fake information. However, detecting realistic fake images generated by the latest AI technology is challenging due to the reasons mentioned above. In this work, we propose Dual Attention Fake Detection Fine-tuning Network (DA-FDFtNet) to detect the manipulated fake face images from the real face data. Our DA-FDFtNet integrates the pre-trained model with Fine-Tune Transformer, MBblockV3, and a channel attention module to improve the performance and robustness across different types of fake images. In particular, Fine-Tune Transformer consists of multiple numbers of an image-based self-attention module and a down-sampling layer. The channel attention module is also connected with the pre-trained model to capture the fake images feature space. We experiment with our DA-FDFtNet with the FaceForensics dataset and various GAN-generated datasets, and we show that our approach outperforms the previous baseline models.

【5】 YOLO-Z: Improving small object detection in YOLOv5 for autonomous vehicles 标题:YOLO-Z:改进YOLOv5中的自动车辆小目标检测 链接:https://arxiv.org/abs/2112.11798

作者:Aduen Benjumea,Izzedin Teeti,Fabio Cuzzolin,Andrew Bradley 机构:Izzeddin Teeti† 摘要:随着自动驾驶汽车和自动赛车的普及,对更快、更准确的探测器的需求也越来越大。虽然我们的肉眼几乎可以立即提取上下文信息,即使是从很远的地方,但图像分辨率和计算资源的限制使得检测较小的对象(即,在输入图像中占据一个小像素区域的对象)对于机器来说是一项真正具有挑战性的任务,也是一个广阔的研究领域。这项研究探讨了如何修改流行的YOLOv5物体检测器,以提高其在检测较小物体方面的性能,并将其应用于自主赛车。为了实现这一点,我们研究了替换模型的某些结构元素(以及它们的连接和其他参数)如何影响性能和推理时间。在此过程中,我们提出了一系列不同比例的模型,我们将其命名为“YOLO-Z”,当在50%IOU下检测较小的对象时,该模型在mAP中显示出高达6.9%的改进,而与原始YOLOv5相比,推理时间仅增加3毫秒。我们的目标是为未来的研究提供信息,说明调整流行探测器(如YOLOv5)以解决特定任务的可能性,并提供有关特定变化如何影响小对象检测的见解。这些发现适用于更广泛的自动驾驶车辆环境,可以增加此类系统可用的环境信息量。 摘要:As autonomous vehicles and autonomous racing rise in popularity, so does the need for faster and more accurate detectors. While our naked eyes are able to extract contextual information almost instantly, even from far away, image resolution and computational resources limitations make detecting smaller objects (that is, objects that occupy a small pixel area in the input image) a genuinely challenging task for machines and a wide-open research field. This study explores how the popular YOLOv5 object detector can be modified to improve its performance in detecting smaller objects, with a particular application in autonomous racing. To achieve this, we investigate how replacing certain structural elements of the model (as well as their connections and other parameters) can affect performance and inference time. In doing so, we propose a series of models at different scales, which we name `YOLO-Z', and which display an improvement of up to 6.9% in mAP when detecting smaller objects at 50% IOU, at the cost of just a 3ms increase in inference time compared to the original YOLOv5. Our objective is to inform future research on the potential of adjusting a popular detector such as YOLOv5 to address specific tasks and provide insights on how specific changes can impact small object detection. Such findings, applied to the broader context of autonomous vehicles, could increase the amount of contextual information available to such systems.

【6】 BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View 标题:BEVDet:鸟瞰环境下的高性能多摄像机三维目标检测 链接:https://arxiv.org/abs/2112.11790

作者:Junjie Huang,Guan Huang,Zheng Zhu,Dalong Du 机构:PhiGent Robotics 备注:Multi-camera 3D Object Detection 摘要:自主驾驶能够感知周围环境进行决策,这是视觉感知中最复杂的场景之一。范式创新在解决2D对象检测任务方面的巨大力量激励我们寻求一种优雅、可行和可扩展的范式,以推动该领域的性能边界。为此,我们在本文中提出了BEVDet范式。BEVDet是根据在鸟瞰视图(BEV)中检测3D对象的原理开发的,在鸟瞰视图中可以方便地执行路线规划。在这个范例中,四类模块以不同的角色依次执行:用于在图像视图中编码特征的图像视图编码器、用于从图像视图到BEV的特征转换的视图转换器、用于在BEV中进一步编码特征的BEV编码器以及用于预测BEV中目标的任务特定头。我们仅仅重用现有的模块来构建BEVDet,并通过构建一个专用的数据增强策略使多摄像机三维目标检测成为可能。该方法在多摄像机三维目标检测中效果良好,在计算预算和性能之间取得了较好的平衡。图像大小为704x256(占竞争对手的1/8)的BEVDet在nuScenes val集上的mAP和NDS得分分别为29.4%和38.4%,与FCOS3D相当(即2008.2 GFLOPs、1.7 FPS、29.5%mAP和37.2%NDS),而计算预算仅为239.4 GFLOPs,运行速度为FCOS3D的4.3倍。将输入大小放大到1408x512,BEVDet的mAP分数为34.9%,NDS分数为41.7%,这只需要601.4 GFLOPs,并将FCOS3D显著抑制5.4%mAP和4.5%NDS。BEVDet卓越的性能说明了范式创新的魔力。 摘要:Autonomous driving perceives the surrounding environment for decision making, which is one of the most complicated scenes for visual perception. The great power of paradigm innovation in solving the 2D object detection task inspires us to seek an elegant, feasible, and scalable paradigm for pushing the performance boundary in this area. To this end, we contribute the BEVDet paradigm in this paper. BEVDet is developed by following the principle of detecting the 3D objects in Bird-Eye-View (BEV), where route planning can be handily performed. In this paradigm, four kinds of modules are conducted in succession with different roles: an image-view encoder for encoding feature in image view, a view transformer for feature transformation from image view to BEV, a BEV encoder for further encoding feature in BEV, and a task-specific head for predicting the targets in BEV. We merely reuse the existing modules for constructing BEVDet and make it feasible for multi-camera 3D object detection by constructing an exclusive data augmentation strategy. The proposed paradigm works well in multi-camera 3D object detection and offers a good trade-off between computing budget and performance. BEVDet with 704x256 (1/8 of the competitors) image size scores 29.4% mAP and 38.4% NDS on the nuScenes val set, which is comparable with FCOS3D (i.e., 2008.2 GFLOPs, 1.7 FPS, 29.5% mAP and 37.2% NDS), while requires merely 12% computing budget of 239.4 GFLOPs and runs 4.3 times faster. Scaling up the input size to 1408x512, BEVDet scores 34.9% mAP, and 41.7% NDS, which requires just 601.4 GFLOPs and significantly suppresses FCOS3D by 5.4% mAP and 4.5% NDS. The superior performance of BEVDet tells the magic of paradigm innovation.

【7】 Few-Shot Object Detection: A Survey 标题:Few-Shot目标检测研究综述 链接:https://arxiv.org/abs/2112.11699

作者:Mona Köhler,Markus Eisenbach,Horst-Michael Gross 机构:IlmenauUniversityofTechnology 备注:24 pages, 13 figures, submitted to IEEE Transactions on Neural Networks and Learning Systems 摘要:人类甚至可以通过几个例子来学习识别新的物体。相反,训练基于深度学习的对象检测器需要大量带注释的数据。为了避免获取和注释这些海量数据的需要,少数镜头目标检测旨在从目标域中新类别的少数对象实例中学习。在这项调查中,我们提供了一个在少数镜头对象检测技术的现状概述。我们根据训练计划和架构布局对方法进行分类。对于每种类型的方法,我们描述了一般实现以及提高新类别性能的概念。在适当的时候,我们会简要介绍这些概念,以突出最佳想法。最后,我们介绍了常用的数据集及其评估协议,并分析了报告的基准测试结果。因此,我们强调了评估中的共同挑战,并确定了少数镜头目标检测这一新兴领域中最有前景的当前趋势。 摘要:Humans are able to learn to recognize new objects even from a few examples. In contrast, training deep-learning-based object detectors requires huge amounts of annotated data. To avoid the need to acquire and annotate these huge amounts of data, few-shot object detection aims to learn from few object instances of new categories in the target domain. In this survey, we provide an overview of the state of the art in few-shot object detection. We categorize approaches according to their training scheme and architectural layout. For each type of approaches, we describe the general realization as well as concepts to improve the performance on novel categories. Whenever appropriate, we give short takeaways regarding these concepts in order to highlight the best ideas. Eventually, we introduce commonly used datasets and their evaluation protocols and analyze reported benchmark results. As a result, we emphasize common challenges in evaluation and identify the most promising current trends in this emerging field of few-shot object detection.

【8】 GAN Based Boundary Aware Classifier for Detecting Out-of-distribution Samples 标题:基于GaN的边界感知分类器的离散型样本检测 链接:https://arxiv.org/abs/2112.11648

作者:Sen Pei,Xin Zhang,Richard YiDa Xu,Gaofeng Meng 机构:NLPR, Institute of Automation, Chinese Academy of Sciences, Department of Mathematics, Hong Kong Baptist University 摘要:本文主要研究用神经网络检测分布外样本的问题。在图像识别任务中,训练好的分类器往往会对远离分布内(id)数据的输入图像给出较高的置信度,这极大地限制了其在现实世界中的应用。为了缓解这个问题,我们提出了一种基于GAN的边界感知分类器(GBAC),用于生成只包含大多数id数据的封闭超空间。我们的方法是基于传统的神经网络将特征空间分离为几个不适合ood检测的非闭合区域。使用GBAC作为辅助模块,分布在封闭超空间之外的ood数据将被分配更低的分数,从而在保持分类性能的同时实现更有效的ood检测。此外,我们还提出了一种快速采样方法来生成位于上述闭超空间边界上的硬ood表示。在多个数据集和神经网络结构上进行的实验保证了GBAC的有效性。 摘要:This paper focuses on the problem of detecting out-of-distribution (ood) samples with neural nets. In image recognition tasks, the trained classifier often gives high confidence score for input images which are remote from the in-distribution (id) data, and this has greatly limited its application in real world. For alleviating this problem, we propose a GAN based boundary aware classifier (GBAC) for generating a closed hyperspace which only contains most id data. Our method is based on the fact that the traditional neural net seperates the feature space as several unclosed regions which are not suitable for ood detection. With GBAC as an auxiliary module, the ood data distributed outside the closed hyperspace will be assigned with much lower score, allowing more effective ood detection while maintaining the classification performance. Moreover, we present a fast sampling method for generating hard ood representations which lie on the boundary of pre-mentioned closed hyperspace. Experiments taken on several datasets and neural net architectures promise the effectiveness of GBAC.

【9】 EyePAD : A Distillation-based approach for joint Eye Authentication and Presentation Attack Detection using Periocular Images 标题:EyePAD :一种基于蒸馏的联合眼部认证和基于眼周图像的呈现攻击检测方法 链接:https://arxiv.org/abs/2112.11610

作者:Prithviraj Dhar,Amit Kumar,Kirsten Kaplan,Khushi Gupta,Rakesh Ranjan,Rama Chellappa 机构:Johns Hopkins University,Meta Platforms, Inc. 摘要:针对边缘设备的实用眼睛身份验证(EA)系统需要执行身份验证并对表示攻击具有鲁棒性,同时保持计算和延迟效率。然而,现有的基于眼睛的框架a)独立执行身份验证和呈现攻击检测(PAD),b)涉及提取虹膜区域的重要预处理步骤。在这里,我们介绍一个联合框架EA和PAD使用眼周图像。虽然深度多任务学习(MTL)网络可以同时完成这两项任务,但由于EA和PAD的训练数据集是不相交的,MTL会受到遗忘效应的影响。为了克服这一问题,我们提出了使用PAD(EyePAD)进行眼睛身份验证,这是一种基于蒸馏的方法,可以训练EA和PAD的单一网络,同时减少遗忘的影响。为了进一步提高EA性能,我们引入了一种称为EyePAD 的新方法,该方法包括在EA和PAD数据上训练MTL网络,同时通过额外的蒸馏步骤提取EyePAD网络的“通用性”。我们提出的方法在PAD中优于SOTA,在眼对眼验证中获得接近SOTA的性能,无需任何预处理。我们还演示了EyePAD和EyePAD 在跨网络主干和图像质量的用户对用户验证中的功效。 摘要:A practical eye authentication (EA) system targeted for edge devices needs to perform authentication and be robust to presentation attacks, all while remaining compute and latency efficient. However, existing eye-based frameworks a) perform authentication and Presentation Attack Detection (PAD) independently and b) involve significant pre-processing steps to extract the iris region. Here, we introduce a joint framework for EA and PAD using periocular images. While a deep Multitask Learning (MTL) network can perform both the tasks, MTL suffers from the forgetting effect since the training datasets for EA and PAD are disjoint. To overcome this, we propose Eye Authentication with PAD (EyePAD), a distillation-based method that trains a single network for EA and PAD while reducing the effect of forgetting. To further improve the EA performance, we introduce a novel approach called EyePAD that includes training an MTL network on both EA and PAD data, while distilling the `versatility' of the EyePAD network through an additional distillation step. Our proposed methods outperform the SOTA in PAD and obtain near-SOTA performance in eye-to-eye verification, without any pre-processing. We also demonstrate the efficacy of EyePAD and EyePAD in user-to-user verification with PAD across network backbones and image quality.

【10】 Community Detection in Medical Image Datasets: Using Wavelets and Spectral Methods 标题:基于小波和谱方法的医学图像数据中的社区检测 链接:https://arxiv.org/abs/2112.12021

作者:Roozbeh Yousefzadeh 机构:Yale Center for Medical Informatics and VA Connecticut Healthcare System 摘要:医学图像数据集可以有大量代表不同健康状况和不同疾病严重程度的患者的图像。在处理未标记的原始图像数据集时,大量的样本往往使专家和非专家难以理解数据集中存在的各种图像。监督学习方法依赖于标记图像,这需要医学专家付出相当大的努力,首先了解数据中存在的图像社区,然后标记图像。在这里,我们提出了一种算法,以方便在医学图像数据集中自动识别社区。我们进一步解释,当图像已经标记时,这种分析在有监督的环境中也可以是有见地的。这些见解是有用的,因为在现实中,健康和疾病的严重程度可以被视为一个连续的光谱,在每个类别中,通常都有值得调查的更好的社区,特别是当它们与其他类别中的社区有相似性时。在我们的方法中,我们将图像的小波分解与光谱方法结合使用。我们证明了图拉普拉斯算子的特征值可以揭示图像数据集中显著社区的数量。在我们的实验中,我们使用了一组标记有不同条件的新冠病毒患者的图像数据集。我们在数据集中检测了25个社区,然后观察到其中只有6个社区有肺炎患者。我们还调查了结直肠癌组织病理学数据集的内容。 摘要:Medical image datasets can have large number of images representing patients with different health conditions and various disease severity. When dealing with raw unlabeled image datasets, the large number of samples often makes it hard for experts and non-experts to understand the variety of images present in a dataset. Supervised learning methods rely on labeled images which requires a considerable effort by medical experts to first understand the communities of images present in the data and then labeling the images. Here, we propose an algorithm to facilitate the automatic identification of communities in medical image datasets. We further explain that such analysis can also be insightful in a supervised setting, when the images are already labeled. Such insights are useful because in reality, health and disease severity can be considered a continuous spectrum, and within each class, there usually are finer communities worthy of investigation, especially when they have similarities to communities in other classes. In our approach, we use wavelet decomposition of images in tandem with spectral methods. We show that the eigenvalues of a graph Laplacian can reveal the number of notable communities in an image dataset. In our experiments, we use a dataset of images labeled with different conditions for COVID patients. We detect 25 communities in the dataset and then observe that only 6 of those communities contain patients with pneumonia. We also investigate the contents of a colorectal cancer histopathology dataset.

【11】 Deep learning for brain metastasis detection and segmentation in longitudinal MRI data 标题:深度学习在纵向MRI数据脑转移检测和分割中的应用 链接:https://arxiv.org/abs/2112.11833

作者:Yixing Huang,Christoph Bert,Philipp Sommer,Benjamin Frey,Udo Gaipl,Luitpold V. Distel,Thomas Weissmann,Michael Uder,Manuel A. Schmidt,Arnd Dörfler,Andrreas Maier,Rainer Fietkau,Florian Putz 机构:. Department of Radiation Oncology, Universit¨atsklinikum Erlangen, Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg (FAU), Erlangen, Germany, . Comprehensive Cancer Center Erlangen-EMN (CCC ER-EMN), Erlangen, Germany. 摘要:脑转移经常发生在转移癌患者身上。在放射治疗中,脑转移瘤的早期准确检测对于治疗计划和预后至关重要。为了通过深度学习提高脑转移检测性能,提出了一种称为体积水平敏感性特异性(VSS)的定制检测损失,该检测损失在(亚)体积水平上评估个体转移检测的敏感性和特异性。由于敏感性和精确性始终是转移水平中的一个折衷,通过调整VSS损失中的权重可以实现高灵敏度或高精度,而不会降低分段转移的骰子评分系数。为了减少被检测为假阳性转移的转移样结构,提出了一种时间先验体积作为神经网络的额外输入。我们提出的VSS缺失提高了脑转移检测的灵敏度,将灵敏度从86.7%提高到95.5%。或者,它将精度从68.8%提高到97.8%。随着时间先验容量的增加,高灵敏度模型中约45%的假阳性转移减少,高特异性模型的精确度达到99.6%。所有转移瘤的平均骰子系数约为0.81。在高灵敏度和高特异性模型的集成下,平均每个患者只有1.5个假阳性转移需要进一步检查,而大多数真阳性转移需要确认。集成学习能够区分高置信度真实阳性转移和需要特殊专家审查或进一步随访的转移候选,特别适合实际临床实践中专家支持的要求。 摘要:Brain metastases occur frequently in patients with metastatic cancer. Early and accurate detection of brain metastases is very essential for treatment planning and prognosis in radiation therapy. To improve brain metastasis detection performance with deep learning, a custom detection loss called volume-level sensitivity-specificity (VSS) is proposed, which rates individual metastasis detection sensitivity and specificity in (sub-)volume levels. As sensitivity and precision are always a trade-off in a metastasis level, either a high sensitivity or a high precision can be achieved by adjusting the weights in the VSS loss without decline in dice score coefficient for segmented metastases. To reduce metastasis-like structures being detected as false positive metastases, a temporal prior volume is proposed as an additional input of the neural network. Our proposed VSS loss improves the sensitivity of brain metastasis detection, increasing the sensitivity from 86.7% to 95.5%. Alternatively, it improves the precision from 68.8% to 97.8%. With the additional temporal prior volume, about 45% of the false positive metastases are reduced in the high sensitivity model and the precision reaches 99.6% for the high specificity model. The mean dice coefficient for all metastases is about 0.81. With the ensemble of the high sensitivity and high specificity models, on average only 1.5 false positive metastases per patient needs further check, while the majority of true positive metastases are confirmed. The ensemble learning is able to distinguish high confidence true positive metastases from metastases candidates that require special expert review or further follow-up, being particularly well-fit to the requirements of expert support in real clinical practice.

分类|识别相关(3篇)

【1】 Improved skin lesion recognition by a Self-Supervised Curricular Deep Learning approach 标题:基于自监督课程深度学习方法的皮肤病变识别 链接:https://arxiv.org/abs/2112.12086

作者:Kirill Sirotkin,Marcos Escudero Viñolo,Pablo Carballeira,Juan Carlos SanMiguel 备注:11 pages, 8 figures, submitted to the Journal of Biomedical and Health Informatics (Special Issue on Skin Image Analysis in the Age of Deep Learning) 摘要:用于皮肤损伤识别的最先进的深度学习方法通常需要对更大、更多样的数据集进行预训练,以克服因皮肤损伤成像数据集的缩小而产生的泛化限制。ImageNet通常用作预训练数据集,但源数据集和目标数据集之间的域差距阻碍了其传输潜力。在这项工作中,我们介绍了一种新的预训练方法,该方法依次训练一系列自我监督学习借口任务,并且只需要未标记的皮肤损伤成像数据。我们提出了一种简单的方法来建立定义借口任务课程的顺序。对于多类皮肤损伤分类问题和ISIC-2019数据集,我们提供的实验证据表明:i)由借口任务课程预训练的模型优于由单个借口任务预训练的模型,和ii)由最佳借口任务课程预训练的模型优于在ImageNet上预训练的模型。我们证明,这一性能提升与以下事实有关:借口任务课程更能将最终模型的注意力集中在皮肤损伤上。除了性能改进之外,此策略还允许大幅减少ImageNet预训练的训练时间,这对于针对特定问题定制的网络体系结构尤其有利。 摘要:State-of-the-art deep learning approaches for skin lesion recognition often require pretraining on larger and more varied datasets, to overcome the generalization limitations derived from the reduced size of the skin lesion imaging datasets. ImageNet is often used as the pretraining dataset, but its transferring potential is hindered by the domain gap between the source dataset and the target dermatoscopic scenario. In this work, we introduce a novel pretraining approach that sequentially trains a series of Self-Supervised Learning pretext tasks and only requires the unlabeled skin lesion imaging data. We present a simple methodology to establish an ordering that defines a pretext task curriculum. For the multi-class skin lesion classification problem, and ISIC-2019 dataset, we provide experimental evidence showing that: i) a model pretrained by a curriculum of pretext tasks outperforms models pretrained by individual pretext tasks, and ii) a model pretrained by the optimal pretext task curriculum outperforms a model pretrained on ImageNet. We demonstrate that this performance gain is related to the fact that the curriculum of pretext tasks better focuses the attention of the final model on the skin lesion. Beyond performance improvement, this strategy allows for a large reduction in the training time with respect to ImageNet pretraining, which is especially advantageous for network architectures tailored for a specific problem.

【2】 High-Accuracy RGB-D Face Recognition via Segmentation-Aware Face Depth Estimation and Mask-Guided Attention Network 标题:基于分割感知人脸深度估计和面具引导注意网络的高精度RGB-D人脸识别 链接:https://arxiv.org/abs/2112.11713

作者:Meng-Tzu Chiu,Hsun-Ying Cheng,Chien-Yi Wang,Shang-Hong Lai 机构: Department of Computer Science, National Tsing Hua University, Taiwan, Microsoft AI R&D Center, Taiwan 备注:IEEE International Conference on Automatic Face and Gesture Recognition (FG) 2021 摘要:深度学习方法通过使用非常大的人脸图像数据集训练模型,实现了高度准确的人脸识别。与大型2D人脸图像数据集的可用性不同,公众缺乏大型3D人脸数据集。现有的公共3D人脸数据集通常只收集很少的受试者,这导致了过度拟合问题。本文提出了两种CNN模型来改进RGB-D人脸识别任务。第一种是一种支持分割的深度估计网络,称为DepthNet,它通过包含语义分割信息来估计RGB人脸图像的深度图,从而实现更精确的人脸区域定位。另一种是一种新的掩模引导的RGB-D人脸识别模型,该模型包含RGB识别分支、深度图识别分支和带有空间注意模块的辅助分割掩模分支。我们的DepthNet用于将一个大的二维人脸图像数据集扩充为一个大的RGB-D人脸数据集,用于训练一个精确的RGB-D人脸识别模型。此外,所提出的掩模引导的RGB-D人脸识别模型能够充分利用深度图和分割掩模信息,对姿态变化具有更强的鲁棒性。我们的实验结果表明,DepthNet可以从带有分割掩模的人脸图像中生成更可靠的深度图。我们的面具引导人脸识别模型在几个公共3D人脸数据集上优于最新的方法。 摘要:Deep learning approaches have achieved highly accurate face recognition by training the models with very large face image datasets. Unlike the availability of large 2D face image datasets, there is a lack of large 3D face datasets available to the public. Existing public 3D face datasets were usually collected with few subjects, leading to the over-fitting problem. This paper proposes two CNN models to improve the RGB-D face recognition task. The first is a segmentation-aware depth estimation network, called DepthNet, which estimates depth maps from RGB face images by including semantic segmentation information for more accurate face region localization. The other is a novel mask-guided RGB-D face recognition model that contains an RGB recognition branch, a depth map recognition branch, and an auxiliary segmentation mask branch with a spatial attention module. Our DepthNet is used to augment a large 2D face image dataset to a large RGB-D face dataset, which is used for training an accurate RGB-D face recognition model. Furthermore, the proposed mask-guided RGB-D face recognition model can fully exploit the depth map and segmentation mask information and is more robust against pose variation than previous methods. Our experimental results show that DepthNet can produce more reliable depth maps from face images with the segmentation mask. Our mask-guided face recognition model outperforms state-of-the-art methods on several public 3D face datasets.

【3】 Ghost-dil-NetVLAD: A Lightweight Neural Network for Visual Place Recognition 标题:GHOST-DIL-NetVLAD:一种用于视觉位置识别的轻量级神经网络 链接:https://arxiv.org/abs/2112.11679

作者:Qingyuan Gong,Yu Liu,Liqiang Zhang,Renhe Liu 机构:School of Microelectronics, Tianjin University, Tianjin, China, A R T I C L E I N F O 摘要:视觉位置识别(VPR)是一项具有挑战性的任务,其计算量巨大,识别性能高。由于轻量级卷积神经网络(CNN)的实用特征提取能力和局部聚集描述符(VLAD)层向量的训练能力,我们提出了一种轻量级的弱监督端到端神经网络,由一个称为GhostCNN的前端感知模型和一个作为后端的可学习VLAD层组成。GhostCNN基于GhostModule,GhostModule是基于CNN的轻量级体系结构。它们可以使用线性运算而不是传统的卷积过程来生成冗余特征映射,从而在计算资源和识别精度之间实现良好的折衷。为了进一步增强我们提出的轻量级模型,我们在Ghost模块中添加了扩展卷积,以获得包含更多空间语义信息的特征,从而提高了准确性。最后,在一个常用的公共基准测试和我们的私有数据集上进行的大量实验验证了所提出的神经网络将VGG16 NetVLAD的触发器和参数分别减少了99.04%和80.16%。此外,两种模型的精度相近。 摘要:Visual place recognition (VPR) is a challenging task with the unbalance between enormous computational cost and high recognition performance. Thanks to the practical feature extraction ability of the lightweight convolution neural networks (CNNs) and the train-ability of the vector of locally aggregated descriptors (VLAD) layer, we propose a lightweight weakly supervised end-to-end neural network consisting of a front-ended perception model called GhostCNN and a learnable VLAD layer as a back-end. GhostCNN is based on Ghost modules that are lightweight CNN-based architectures. They can generate redundant feature maps using linear operations instead of the traditional convolution process, making a good trade-off between computation resources and recognition accuracy. To enhance our proposed lightweight model further, we add dilated convolutions to the Ghost module to get features containing more spatial semantic information, improving accuracy. Finally, rich experiments conducted on a commonly used public benchmark and our private dataset validate that the proposed neural network reduces the FLOPs and parameters of VGG16-NetVLAD by 99.04% and 80.16%, respectively. Besides, both models achieve similar accuracy.

分割|语义相关(8篇)

【1】 Open-Vocabulary Image Segmentation 标题:开放式词汇图像分割 链接:https://arxiv.org/abs/2112.12143

作者:Golnaz Ghiasi,Xiuye Gu,Yin Cui,Tsung-Yi Lin 机构:Google Research, bride, groom, hills, field, sky, couple, land, calf, grass, trees, cows, background 摘要:我们设计了一个开放词汇的图像分割模型,将图像组织成由任意文本表示的有意义区域。我们发现,最近的开放词汇模型不能很好地本地化视觉概念,尽管能够识别图像中的内容。我们认为,这些模型错过了视觉分组的一个重要步骤,即在学习视觉语义对齐之前将像素分组。我们建议OpenSeg解决上述问题。首先,它学习为可能的组织提出细分模板。然后,它通过将标题中的每个单词与一个或几个预测的掩码对齐来学习视觉语义对齐。我们发现,掩码表示是支持从标题学习的关键,这使得扩大数据集和词汇表的规模成为可能。我们的工作是第一次在保持分割数据集上执行Zero-Shot传输。我们通过在预先训练好的ALIGN模型上应用类激活映射或使用像素标签进行微调,建立了两个强基线。OpenSeg在PASCAL环境(459个类)和ADE-20k(847个类)上的性能分别超过这些基线340万和270万。 摘要:We design an open-vocabulary image segmentation model to organize an image into meaningful regions indicated by arbitrary texts. We identify that recent open-vocabulary models can not localize visual concepts well despite recognizing what are in an image. We argue that these models miss an important step of visual grouping, which organizes pixels into groups before learning visual-semantic alignments. We propose OpenSeg to address the above issue. First, it learns to propose segmentation masks for possible organizations. Then it learns visual-semantic alignments by aligning each word in a caption to one or a few predicted masks. We find the mask representations are the key to support learning from captions, making it possible to scale up the dataset and vocabulary sizes. Our work is the first to perform zero-shot transfer on holdout segmentation datasets. We set up two strong baselines by applying class activation maps or fine-tuning with pixel-wise labels on a pre-trained ALIGN model. OpenSeg outperforms these baselines by 3.4 mIoU on PASCAL-Context (459 classes) and 2.7 mIoU on ADE-20k (847 classes).

【2】 Hierarchical Cross-Modality Semantic Correlation Learning Model for Multimodal Summarization 标题:多通道文摘的层次化跨通道语义关联学习模型 链接:https://arxiv.org/abs/2112.12072

作者:Litian Zhang,Xiaoming Zhang,Junshu Pan,Feiran Huang 机构:School of Cyber Science and Technology, Beihang University, Beijing, China, College of Cyber Security, Jinan University, Guangzhou, China 备注:Accepted by AAAI2022 摘要:具有多模式输出的多模式摘要(MSMO)生成包含文本和视觉内容的摘要。多模态新闻报道包含异构内容,这使得MSMO非常重要。此外,我们还观察到,新闻报道中不同形式的数据在层次上相互关联。传统的MSMO方法通过学习整个数据的表示来处理不同的数据模式,这不能直接适应异构内容和层次关联。在本文中,我们提出了一个分层跨模态语义相关学习模型(HCSCL)来学习多模态数据中存在的模态内和模态间相关性。HCSCL采用图形网络对模式内相关性进行编码。然后,提出了一种分层融合框架来学习文本和图像之间的分层相关性。此外,我们还构建了一个包含相关图像注释和图像对象标签信息的新数据集,为学习过程提供监督信息。数据集上的大量实验表明,HCSCL在自动摘要度量和细粒度多样性测试方面显著优于基线方法。 摘要:Multimodal summarization with multimodal output (MSMO) generates a summary with both textual and visual content. Multimodal news report contains heterogeneous contents, which makes MSMO nontrivial. Moreover, it is observed that different modalities of data in the news report correlate hierarchically. Traditional MSMO methods indistinguishably handle different modalities of data by learning a representation for the whole data, which is not directly adaptable to the heterogeneous contents and hierarchical correlation. In this paper, we propose a hierarchical cross-modality semantic correlation learning model (HCSCL) to learn the intra- and inter-modal correlation existing in the multimodal data. HCSCL adopts a graph network to encode the intra-modal correlation. Then, a hierarchical fusion framework is proposed to learn the hierarchical correlation between text and images. Furthermore, we construct a new dataset with relevant image annotation and image object label information to provide the supervision information for the learning procedure. Extensive experiments on the dataset show that HCSCL significantly outperforms the baseline methods in automatic summarization metrics and fine-grained diversity tests.

【3】 Page Segmentation using Visual Adjacency Analysis 标题:基于视觉邻接分析的页面分割 链接:https://arxiv.org/abs/2112.11975

作者:Mohammad Bajammal,Ali Mesbah 机构:University of British Columbia, Vancouver, BC, Canada 摘要:页面分割是一种网页分析过程,它将页面划分为多个有凝聚力的部分,如边栏、页眉和页脚。当前的页面分割方法使用页面的DOM、文本内容或呈现样式信息。然而,这些方法有许多缺点,例如大量的参数和关于页面的严格假设,这对它们的分割精度产生了负面影响。我们提出了一种新的基于局部邻接区域视觉分析的页面分割方法。它结合DOM属性和可视化分析来构建给定页面的功能,并指导无监督的集群。我们在35个真实世界的网页上评估了我们的方法,并检验了分割的有效性和效率。结果表明,与现有技术相比,我们的方法在精度上平均提高了156%,在F-测量上平均提高了249%。 摘要:Page segmentation is a web page analysis process that divides a page into cohesive segments, such as sidebars, headers, and footers. Current page segmentation approaches use either the DOM, textual content, or rendering style information of the page. However, these approaches have a number of drawbacks, such as a large number of parameters and rigid assumptions about the page, which negatively impact their segmentation accuracy. We propose a novel page segmentation approach based on visual analysis of localized adjacency regions. It combines DOM attributes and visual analysis to build features of a given page and guide an unsupervised clustering. We evaluate our approach on 35 real-world web pages, and examine the effectiveness and efficiency of segmentation. The results show that, compared with state-of-the-art, our approach achieves an average of 156% increase in precision and 249% improvement in F-measure.

【4】 A Discriminative Single-Shot Segmentation Network for Visual Object Tracking 标题:一种用于视觉目标跟踪的判别单镜头分割网络 链接:https://arxiv.org/abs/2112.11846

作者:Alan Lukežič,Jiří Matas,Matej Kristan 机构:University of Ljubljana, Slovenia, Czech Technical University in Prague, Czech Republic 备注:Extended version of the D3S tracker (CVPR2020). Accepted to IEEE TPAMI. arXiv admin note: substantial text overlap with arXiv:1911.08862 摘要:基于模板的判别式跟踪器由于其鲁棒性,目前是主要的跟踪模式,但仅限于边界盒跟踪和有限范围的变换模型,这降低了其定位精度。我们提出了一种区分性的单镜头分割跟踪器——D3S2,它缩小了视觉对象跟踪和视频对象分割之间的差距。单镜头网络应用两个具有互补几何特性的目标模型,一个对广泛的变换(包括非刚性变形)保持不变,另一个假设刚性对象以同时实现鲁棒在线目标分割。通过解耦目标和特征尺度估计,进一步提高了整体跟踪可靠性。无需对每个数据集进行微调,并且仅针对作为主要输出的分段进行训练,D3S2在最近的短期跟踪基准VOT2020上的性能优于所有已发布的跟踪器,并且在GOT-10k、TrackingNet、OTB100和LaSoT上的性能非常接近最先进的跟踪器。D3S2在视频对象分割基准上优于领先的分割跟踪器SIMACHASK,并与顶级视频对象分割算法相媲美。 摘要:Template-based discriminative trackers are currently the dominant tracking paradigm due to their robustness, but are restricted to bounding box tracking and a limited range of transformation models, which reduces their localization accuracy. We propose a discriminative single-shot segmentation tracker -- D3S2, which narrows the gap between visual object tracking and video object segmentation. A single-shot network applies two target models with complementary geometric properties, one invariant to a broad range of transformations, including non-rigid deformations, the other assuming a rigid object to simultaneously achieve robust online target segmentation. The overall tracking reliability is further increased by decoupling the object and feature scale estimation. Without per-dataset finetuning, and trained only for segmentation as the primary output, D3S2 outperforms all published trackers on the recent short-term tracking benchmark VOT2020 and performs very close to the state-of-the-art trackers on the GOT-10k, TrackingNet, OTB100 and LaSoT. D3S2 outperforms the leading segmentation tracker SiamMask on video object segmentation benchmarks and performs on par with top video object segmentation algorithms.

【5】 Cost Aggregation Is All You Need for Few-Shot Segmentation 标题:成本聚合是您进行短期细分所需的全部条件 链接:https://arxiv.org/abs/2112.11685

作者:Sunghwan Hong,Seokju Cho,Jisu Nam,Seungryong Kim 机构:Korea University, Yonsei University 备注:The trained weights and codes are available at: this https URL 摘要:我们引入了一种新的成本聚合网络,称为带转换器的体积聚合(VAT),通过使用卷积和转换器来有效地处理查询和支持之间的高维相关映射来处理少量镜头分割任务。具体来说,我们提出了由体积嵌入模块组成的编码器,不仅可以将相关映射转换为更易于处理的大小,而且还可以注入一些卷积电感偏置和体积变换模块来进行成本聚合。我们的编码器有一个金字塔结构,让较粗级别的聚合引导较细级别,并强制学习互补匹配分数。然后,我们将输出与投影的特征映射一起提供给我们的亲和力感知解码器,以指导分割过程。结合这些组件,我们进行了实验,以证明所提出的方法的有效性,并且我们的方法为少数镜头分割任务中的所有标准基准设置了一个新的最新水平。此外,我们发现,即使对于语义对应任务中的标准基准,所提出的方法也达到了最先进的性能,尽管不是专门为该任务设计的。我们还提供了广泛的烧蚀研究,以验证我们的架构选择。经过训练的重量和代码可从以下网址获得:https://seokju-cho.github.io/VAT/. 摘要:We introduce a novel cost aggregation network, dubbed Volumetric Aggregation with Transformers (VAT), to tackle the few-shot segmentation task by using both convolutions and transformers to efficiently handle high dimensional correlation maps between query and support. In specific, we propose our encoder consisting of volume embedding module to not only transform the correlation maps into more tractable size but also inject some convolutional inductive bias and volumetric transformer module for the cost aggregation. Our encoder has a pyramidal structure to let the coarser level aggregation to guide the finer level and enforce to learn complementary matching scores. We then feed the output into our affinity-aware decoder along with the projected feature maps for guiding the segmentation process. Combining these components, we conduct experiments to demonstrate the effectiveness of the proposed method, and our method sets a new state-of-the-art for all the standard benchmarks in few-shot segmentation task. Furthermore, we find that the proposed method attains state-of-the-art performance even for the standard benchmarks in semantic correspondence task although not specifically designed for this task. We also provide an extensive ablation study to validate our architectural choices. The trained weights and codes are available at: https://seokju-cho.github.io/VAT/.

【6】 MOSAIC: Mobile Segmentation via decoding Aggregated Information and encoded Context 标题:马赛克:基于解码聚合信息和编码上下文的移动分割 链接:https://arxiv.org/abs/2112.11623

作者:Weijun Wang,Andrew Howard 机构:Google Research 摘要:我们提出了一种新一代的神经网络结构,马赛克,用于移动设备上高效准确的语义图像分割。MOSAIC的设计使用了各种移动硬件平台共同支持的神经操作,以便在各种移动平台上灵活部署。MOSAIC采用简单的非对称编解码器结构,该结构由高效的多尺度上下文编码器和轻型混合解码器组成,用于从聚合信息中恢复空间细节,在平衡精度和计算成本的同时实现了最新的性能。MOSAIC部署在基于搜索分类网络的定制特征提取主干上,其绝对准确度提高了5%,超过了当前行业标准的MLPerf模型和最先进的体系结构。 摘要:We present a next-generation neural network architecture, MOSAIC, for efficient and accurate semantic image segmentation on mobile devices. MOSAIC is designed using commonly supported neural operations by diverse mobile hardware platforms for flexible deployment across various mobile platforms. With a simple asymmetric encoder-decoder structure which consists of an efficient multi-scale context encoder and a light-weight hybrid decoder to recover spatial details from aggregated information, MOSAIC achieves new state-of-the-art performance while balancing accuracy and computational cost. Deployed on top of a tailored feature extraction backbone based on a searched classification network, MOSAIC achieves a 5% absolute accuracy gain surpassing the current industry standard MLPerf models and state-of-the-art architectures.

【7】 Distribution-aware Margin Calibration for Semantic Segmentation in Images 标题:基于分布感知的边缘校正在图像语义分割中的应用 链接:https://arxiv.org/abs/2112.11554

作者:Litao Yu,Zhibin Li,Min Xu,Yongsheng Gao,Jiebo Luo,Jian Zhang 机构:University of Technology Sydney, CSIRO, Griffith University, Rochester University, Equally contributed. 备注:This paper has been accepted by International Journal of Computer Vision (IJCV), and published on 09 November 2021. arXiv admin note: text overlap with arXiv:2011.01462 摘要:Jaccard索引,也称为联合交集(IoU),是图像语义分割中最关键的评价指标之一。然而,直接优化IoU分数是非常困难的,因为学习目标既不可微也不可分解。虽然已经提出了一些算法来优化其代理,但不能保证其泛化能力。在本文中,我们提出了一种可直接用作学习目标的裕度校准方法,用于改进IoU在数据分布上的泛化,并以刚性下限为基础。该方案在理论上保证了在IoU分数方面更好的分割性能。我们在七个图像数据集上评估了所提出的边缘校准方法的有效性,结果表明,与使用深度分割模型的其他学习目标相比,IoU分数有了显著提高。 摘要:The Jaccard index, also known as Intersection-over-Union (IoU), is one of the most critical evaluation metrics in image semantic segmentation. However, direct optimization of IoU score is very difficult because the learning objective is neither differentiable nor decomposable. Although some algorithms have been proposed to optimize its surrogates, there is no guarantee provided for the generalization ability. In this paper, we propose a margin calibration method, which can be directly used as a learning objective, for an improved generalization of IoU over the data-distribution, underpinned by a rigid lower bound. This scheme theoretically ensures a better segmentation performance in terms of IoU score. We evaluated the effectiveness of the proposed margin calibration method on seven image datasets, showing substantial improvements in IoU score over other learning objectives using deep segmentation models.

【8】 Teacher-Student Architecture for Mixed Supervised Lung Tumor Segmentation 标题:基于师生结构的混合监督肺肿瘤分割算法 链接:https://arxiv.org/abs/2112.11541

作者:Vemund Fredriksen,Svein Ole M. Svele,André Pedersen,Thomas Langø,Gabriel Kiss,Frank Lindseth 机构: Department of Computer Science, Norwegian University of Science and Technology, Department of Health Research, Medical Technology, SINTEF, Trondheim, Norway, Department of Clinical and Molecular Medicine, Norwegian University of Science 备注:17 pages, 3 figures, 5 tables, submitted to journal 摘要:目的:自动化任务,如肺部肿瘤的定位和分割,可以为放射科医生和其他临床人员节省宝贵的时间。卷积神经网络可能适合此类任务,但需要大量标记数据进行训练。获取标记数据是一项挑战,尤其是在医学领域。方法:本文研究了利用师生设计,利用具有不同监督类型的数据集,训练在计算机断层扫描图像上执行肺肿瘤分割的自动模型。该框架由两个模型组成:执行端到端自动肿瘤分割的学生和在训练期间向学生提供附加伪注释数据的教师。结果:仅使用一小部分语义标记数据和大量边界框注释数据,我们通过师生设计获得了具有竞争力的性能。在大量语义标注上训练的模型并不比在教师标注数据上训练的模型表现更好。结论:我们的结果证明了利用师生设计来减少注释负载的潜力,因为可以执行较少监督的注释方案,而不会降低分割精度。 摘要:Purpose: Automating tasks such as lung tumor localization and segmentation in radiological images can free valuable time for radiologists and other clinical personnel. Convolutional neural networks may be suited for such tasks, but require substantial amounts of labeled data to train. Obtaining labeled data is a challenge, especially in the medical domain. Methods: This paper investigates the use of a teacher-student design to utilize datasets with different types of supervision to train an automatic model performing pulmonary tumor segmentation on computed tomography images. The framework consists of two models: the student that performs end-to-end automatic tumor segmentation and the teacher that supplies the student additional pseudo-annotated data during training. Results: Using only a small proportion of semantically labeled data and a large number of bounding box annotated data, we achieved competitive performance using a teacher-student design. Models trained on larger amounts of semantic annotations did not perform better than those trained on teacher-annotated data. Conclusions: Our results demonstrate the potential of utilizing teacher-student designs to reduce the annotation load, as less supervised annotation schemes may be performed, without any real degradation in segmentation accuracy.

Zero/Few Shot|迁移|域适配|自适应(8篇)

【1】 A New Adaptive Noise Covariance Matrices Estimation and Filtering Method: Application to Multi-Object Tracking 标题:一种新的自适应噪声协方差矩阵估计和滤波方法在多目标跟踪中的应用 链接:https://arxiv.org/abs/2112.12082

作者:Chao Jiang,Zhiling Wang,Shuhang Tan,Huawei Liang 机构:Science, Chinese Academy of Sciences, Hefei , China, and with the, University of Science and Technology of China, Hefei , China (e-mail:, Academy of Sciences, Hefei , China, also with the Anhui Engineering 摘要:卡尔曼滤波器广泛应用于目标跟踪,过程和测量噪声通常被认为是精确已知且恒定的。然而,确切的已知和不变的假设在实践中并不总是成立的。例如,当激光雷达用于跟踪非合作目标时,测量噪声在不同距离和天气条件下是不同的。此外,过程噪声随对象的运动状态而变化,特别是当跟踪对象是行人时,过程噪声变化更频繁。本文提出了一种新的估计校正闭环估计方法,用于在线估计卡尔曼滤波过程和测量噪声协方差矩阵。首先,我们将噪声协方差矩阵分解为元素分布矩阵和噪声强度,并改进Sage滤波器来估计元素分布矩阵。其次,我们提出了一种校准方法来准确诊断噪声强度偏差。然后提出了一种在线自适应校正噪声强度的方法。第三,在系统可检测的假设下,从数学上证明了该方法的无偏性和收敛性。仿真结果证明了该方法的有效性和可靠性。最后,我们将所提出的方法应用于激光雷达的多目标跟踪,并在KITTI官方服务器上进行了评估。KITTI行人多目标跟踪排行榜上提出的方法(http://www.cvlibs.net/datasets/kitti/evalu跟踪。php)优于现有的所有使用激光雷达的方法,证明了该方法在实际应用中的可行性。这为提高卡尔曼滤波和多目标跟踪的性能提供了一种新的途径。 摘要:Kalman filters are widely used for object tracking, where process and measurement noise are usually considered accurately known and constant. However, the exact known and constant assumptions do not always hold in practice. For example, when lidar is used to track noncooperative targets, the measurement noise is different under different distances and weather conditions. In addition, the process noise changes with the object's motion state, especially when the tracking object is a pedestrian, and the process noise changes more frequently. This paper proposes a new estimation-calibration-correction closed-loop estimation method to estimate the Kalman filter process and measurement noise covariance matrices online. First, we decompose the noise covariance matrix into an element distribution matrix and noise intensity and improve the Sage filter to estimate the element distribution matrix. Second, we propose a calibration method to accurately diagnose the noise intensity deviation. We then propose a correct method to adaptively correct the noise intensity online. Third, under the assumption that the system is detectable, the unbiased and convergence of the proposed method is mathematically proven. Simulation results prove the effectiveness and reliability of the proposed method. Finally, we apply the proposed method to multiobject tracking of lidar and evaluate it on the official KITTI server. The proposed method on the KITTI pedestrian multiobject tracking leaderboard (http://www.cvlibs.net/datasets /kitti/eval_tracking.php) surpasses all existing methods using lidar, proving the feasibility of the method in practical applications. This work provides a new way to improve the performance of the Kalman filter and multiobject tracking.

【2】 Few-shot Font Generation with Weakly Supervised Localized Representations 标题:基于弱监督局部化表示的Few-Shot字体生成 链接:https://arxiv.org/abs/2112.11895

作者:Song Park,Sanghyuk Chun,Junbum Cha,Bado Lee,Hyunjung Shim 机构: and Hyungjung Shim are with the School of IntegratedTechnology, Yonsei University 备注:First two authors contributed equally. This is a journal extension of our AAAI 2021 paper arXiv:2009.11042; Code: this https URL and this https URL 摘要:自动Few-Shot字体生成旨在解决一个定义明确的现实问题,因为手动字体设计成本高昂,而且对设计师的专业知识敏感。现有方法通过为每种字体样式开发通用样式表示来学习如何将样式和内容元素分开。然而,这种方法限制了模型在表达不同的地方风格方面的应用,因为它不适用于复杂的字母系统,例如,汉字,其字符由不同数量的成分组成(通常称为“部首”)——具有高度复杂的结构。在本文中,我们提出了一种新的字体生成方法,该方法学习本地化样式,即组件式样式表示,而不是通用样式。所提出的样式表示能够在文本设计中合成复杂的局部细节。然而,当目标脚本有大量组件(例如,中文超过200个)时,仅从几个参考字形学习组件风格是不可行的。为了减少所需参考图示符的数量,受低秩矩阵分解的启发,我们通过组件和样式因子的乘积来表示组件风格。由于强表示和紧凑分解策略的结合,我们的方法显示出比其他最先进的方法更好的Few-Shot字体生成结果(只有八个参考字形)。此外,没有使用强大的位置监控,例如每个组件、骨架或笔划的位置。源代码可在https://github.com/clovaai/lffont和https://github.com/clovaai/fewshot-font-generation. 摘要:Automatic few-shot font generation aims to solve a well-defined, real-world problem because manual font designs are expensive and sensitive to the expertise of designers. Existing methods learn to disentangle style and content elements by developing a universal style representation for each font style. However, this approach limits the model in representing diverse local styles, because it is unsuitable for complicated letter systems, for example, Chinese, whose characters consist of a varying number of components (often called "radical") -- with a highly complex structure. In this paper, we propose a novel font generation method that learns localized styles, namely component-wise style representations, instead of universal styles. The proposed style representations enable the synthesis of complex local details in text designs. However, learning component-wise styles solely from a few reference glyphs is infeasible when a target script has a large number of components, for example, over 200 for Chinese. To reduce the number of required reference glyphs, we represent component-wise styles by a product of component and style factors, inspired by low-rank matrix factorization. Owing to the combination of strong representation and a compact factorization strategy, our method shows remarkably better few-shot font generation results (with only eight reference glyphs) than other state-of-the-art methods. Moreover, strong locality supervision, for example, location of each component, skeleton, or strokes, was not utilized. The source code is available at https://github.com/clovaai/lffont and https://github.com/clovaai/fewshot-font-generation.

【3】 Generalized Local Optimality for Video Steganalysis in Motion Vector Domain 标题:运动矢量域视频隐写分析的广义局部最优性 链接:https://arxiv.org/abs/2112.11729

作者:Liming Zhai,Lina Wang,Yanzhen Ren,Yang Liu 机构:Wuhan University, China, Nanyang Technological University, Singapore 摘要:运动矢量的局部最优性(MVs)是视频编码中的一个固有特性,对MVs的任何修改都将不可避免地破坏这种最优性,使其成为MV域隐写术的敏感指标。因此,局部最优性是设计隐写分析特征的常用方法,对局部最优性的估计已成为视频隐写分析的首要任务。然而,现有工作中的局部最优性往往估计不准确或使用不合理的假设,限制了其隐写分析能力。本文提出了更合理、更全面地估计局部最优性,并从两个方面推广了局部最优性的概念。首先,由MV和预测运动矢量(PMV)共同确定率失真意义下的局部最优性,PMV的变化将影响局部最优性的估计。因此,我们将局部最优性从静态估计推广到动态估计。第二,PMV是MV的特例,也可以反映MVs中的嵌入痕迹。因此,我们将局部最优性从MV域推广到PMV域。基于局部最优性的两种推广,我们构造了新类型的隐写分析特征,并提出了特征对称化规则来降低特征维数。在三个数据库上进行的大量实验证明了所提出的特征的有效性,这些特征在各种情况下(包括覆盖源不匹配、视频预测方法、视频编解码器和视频分辨率)的准确性和鲁棒性都达到了最新水平。 摘要:The local optimality of motion vectors (MVs) is an intrinsic property in video coding, and any modifications to the MVs will inevitably destroy this optimality, making it a sensitive indicator of steganography in the MV domain. Thus the local optimality is commonly used to design steganalytic features, and the estimation for local optimality has become a top priority in video steganalysis. However, the local optimality in existing works is often estimated inaccurately or using an unreasonable assumption, limiting its capability in steganalysis. In this paper, we propose to estimate the local optimality in a more reasonable and comprehensive fashion, and generalize the concept of local optimality in two aspects. First, the local optimality measured in a rate-distortion sense is jointly determined by MV and predicted motion vector (PMV), and the variability of PMV will affect the estimation for local optimality. Hence we generalize the local optimality from a static estimation to a dynamic one. Second, the PMV is a special case of MV, and can also reflect the embedding traces in MVs. So we generalize the local optimality from the MV domain to the PMV domain. Based on the two generalizations of local optimality, we construct new types of steganalytic features and also propose feature symmetrization rules to reduce feature dimension. Extensive experiments performed on three databases demonstrate the effectiveness of the proposed features, which achieve state-of-the-art in both accuracy and robustness in various conditions, including cover source mismatch, video prediction methods, video codecs, and video resolutions.

【4】 Adaptive Contrast for Image Regression in Computer-Aided Disease Assessment 标题:计算机辅助疾病评估中图像回归的自适应对比 链接:https://arxiv.org/abs/2112.11700

作者:Weihang Dai,Xiaomeng Li,Wan Hang Keith Chiu,Michael D. Kuo,Kwang-Ting Cheng 机构: Li is with the Department of Electronic and Computer Engineering 备注:Accepted in IEEE Transactions on Medical Imaging 摘要:医学应用中的图像回归任务,如骨密度(BMD)估计和左室射血分数(LVEF)预测,在计算机辅助疾病评估中发挥着重要作用。大多数深度回归方法使用单一回归损失函数(如MSE或L1损失)训练神经网络。在本文中,我们提出了第一个用于深度图像回归的对比学习框架,即AdaCon,它由一个通过新的自适应边缘对比损失的特征学习分支和一个回归预测分支组成。我们的方法将标签距离关系作为学习的特征表示的一部分,从而在下游回归任务中获得更好的性能。此外,它还可以用作即插即用模块,以提高现有回归方法的性能。我们证明了AdaCon在两项医学图像回归任务中的有效性,即,从X射线图像估计骨密度和从超声心动图视频预测左心室射血分数。与最先进的BMD估计和LVEF预测方法相比,AdaCon使MAE的相对改善率分别为3.3%和5.9%。 摘要:Image regression tasks for medical applications, such as bone mineral density (BMD) estimation and left-ventricular ejection fraction (LVEF) prediction, play an important role in computer-aided disease assessment. Most deep regression methods train the neural network with a single regression loss function like MSE or L1 loss. In this paper, we propose the first contrastive learning framework for deep image regression, namely AdaCon, which consists of a feature learning branch via a novel adaptive-margin contrastive loss and a regression prediction branch. Our method incorporates label distance relationships as part of the learned feature representations, which allows for better performance in downstream regression tasks. Moreover, it can be used as a plug-and-play module to improve performance of existing regression methods. We demonstrate the effectiveness of AdaCon on two medical image regression tasks, ie, bone mineral density estimation from X-ray images and left-ventricular ejection fraction prediction from echocardiogram videos. AdaCon leads to relative improvements of 3.3% and 5.9% in MAE over state-of-the-art BMD estimation and LVEF prediction methods, respectively.

【5】 Multi-Centroid Representation Network for Domain Adaptive Person Re-ID 标题:领域自适应人员Re-ID的多质心表示网络 链接:https://arxiv.org/abs/2112.11689

作者:Yuhang Wu,Tengteng Huang,Haotian Yao,Chi Zhang,Yuanjie Shao,Chuchu Han,Changxin Gao,Nong Sang 机构:Key Laboratory of Ministry of Education for Image Processing and Intelligent Control, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Megvii Technology 备注:Accepted by AAAI2022 摘要:近年来,许多方法通过基于伪标签的对比学习来解决无监督领域自适应人员再识别问题。在训练过程中,通过简单地平均具有相同伪标签的集群中的所有实例特征,即可获得单质心表示。然而,由于聚类结果的不完善,一个聚类可能包含具有不同身份(标签噪声)的图像,这使得单质心表示不合适。在本文中,我们提出了一种新的多质心存储器(MCM)来自适应地捕获集群中的不同身份信息。MCM通过为查询图像选择合适的正/负质心,可以有效地缓解标签噪声问题。此外,我们还提出了两种改进对比学习过程的策略。首先,我们提出了一种特定领域的对比学习(DSCL)机制,通过比较同一领域的样本来充分挖掘领域内的信息。其次,我们提出二阶最近插值(SONI)来获得丰富且信息丰富的负样本。我们将MCM、DSCL和SONI集成到一个名为多质心表示网络(MCRN)的统一框架中。大量的实验表明,在多个UDA re ID任务和完全无监督的re ID任务上,MCRN优于最新的方法。 摘要:Recently, many approaches tackle the Unsupervised Domain Adaptive person re-identification (UDA re-ID) problem through pseudo-label-based contrastive learning. During training, a uni-centroid representation is obtained by simply averaging all the instance features from a cluster with the same pseudo label. However, a cluster may contain images with different identities (label noises) due to the imperfect clustering results, which makes the uni-centroid representation inappropriate. In this paper, we present a novel Multi-Centroid Memory (MCM) to adaptively capture different identity information within the cluster. MCM can effectively alleviate the issue of label noises by selecting proper positive/negative centroids for the query image. Moreover, we further propose two strategies to improve the contrastive learning process. First, we present a Domain-Specific Contrastive Learning (DSCL) mechanism to fully explore intradomain information by comparing samples only from the same domain. Second, we propose Second-Order Nearest Interpolation (SONI) to obtain abundant and informative negative samples. We integrate MCM, DSCL, and SONI into a unified framework named Multi-Centroid Representation Network (MCRN). Extensive experiments demonstrate the superiority of MCRN over state-of-the-art approaches on multiple UDA re-ID tasks and fully unsupervised re-ID tasks.

【6】 JoJoGAN: One Shot Face Stylization 标题:JoJoGan:One Shot Face风格化 链接:https://arxiv.org/abs/2112.11641

作者:Min Jin Chong,David Forsyth 机构:University of Illinois at Urbana-Champaign 备注:code at this https URL 摘要:虽然最近在少数镜头图像风格化方面取得了一些进展,但这些方法无法捕捉到人类显而易见的风格细节。眼睛的形状、线条的粗犷等细节对于模型来说尤其难以学习,尤其是在有限的数据设置下。在这项工作中,我们的目标是执行一次拍摄图像样式化,以获得正确的细节。给定一个参考样式图像,我们使用GAN反演近似成对的真实数据,并使用该近似成对数据微调预训练样式。然后,我们鼓励StyleGAN进行概括,以便学习的样式可以应用于所有其他图像。 摘要:While there have been recent advances in few-shot image stylization, these methods fail to capture stylistic details that are obvious to humans. Details such as the shape of the eyes, the boldness of the lines, are especially difficult for a model to learn, especially so under a limited data setting. In this work, we aim to perform one-shot image stylization that gets the details right. Given a reference style image, we approximate paired real data using GAN inversion and finetune a pretrained StyleGAN using that approximate paired data. We then encourage the StyleGAN to generalize so that the learned style can be applied to all other images.

【7】 Convolutional neural network based on transfer learning for breast cancer screening 标题:基于转移学习的卷积神经网络在乳腺癌筛查中的应用 链接:https://arxiv.org/abs/2112.11629

作者:Hussin Ragb,Redha Ali,Elforjani Jera,Nagi Buaossa 机构:Department of Electrical and Computer Engineering, Christian Brothers University, Memphis, TN , University of Dayton, Dayton, Ohio , Department of Electro Optics and Photonics 备注:9 pages, 7 figures. arXiv admin note: text overlap with arXiv:2009.08831 摘要:乳腺癌是世界上最常见的癌症,也是全世界女性最常见的死亡原因。然而,如果早期发现,它也是最可治疗的恶性肿瘤之一。本文提出了一种基于深度卷积神经网络的超声图像乳腺癌精确识别算法。在该算法中,将多个神经网络融合在一个并行结构中以执行分类过程,并在候选对象类之间的最终分类决策中应用投票准则,其中每个神经网络的输出代表一次投票。在由537张良性、360张恶性和133张正常图像组成的乳腺超声数据集上进行了几项实验。这些实验显示了一个乐观的结果,并且所提出的模型在几个方面都优于许多最先进的算法。使用k-fold交叉验证和bagging分类器集成,我们实现了99.5%的准确率和99.6%的灵敏度。 摘要:Breast cancer is the most common cancer in the world and the most prevalent cause of death among women worldwide. Nevertheless, it is also one of the most treatable malignancies if detected early. In this paper, a deep convolutional neural network-based algorithm is proposed to aid in accurately identifying breast cancer from ultrasonic images. In this algorithm, several neural networks are fused in a parallel architecture to perform the classification process and the voting criteria are applied in the final classification decision between the candidate object classes where the output of each neural network is representing a single vote. Several experiments were conducted on the breast ultrasound dataset consisting of 537 Benign, 360 malignant, and 133 normal images. These experiments show an optimistic result and a capability of the proposed model to outperform many state-of-the-art algorithms on several measures. Using k-fold cross-validation and a bagging classifier ensemble, we achieved an accuracy of 99.5% and a sensitivity of 99.6%.

【8】 AdaptPose: Cross-Dataset Adaptation for 3D Human Pose Estimation by Learnable Motion Generation 标题:AdaptPose:基于可学习运动生成的交叉数据集自适应三维人体姿态估计 链接:https://arxiv.org/abs/2112.11593

作者:Mohsen Gholami,Bastian Wandt,Helge Rhodin,Rabab Ward,Z. Jane Wang 机构:University of British Columbia 摘要:本文研究了三维人体姿态估计模型的跨数据集泛化问题。在新数据集上测试预先训练好的3D姿势估计器会导致性能大幅下降。以前的方法主要通过改善训练数据的多样性来解决这个问题。我们认为,仅靠多样性是不够的,训练数据的特征需要适应新数据集的特征,如相机视点、位置、人体动作和体型。为此,我们提出AdaptePose,这是一个端到端的框架,可从源数据集生成合成的3D人体运动,并使用它们微调3D姿势估计器。AdaptPose遵循对抗性训练计划。生成器从源三维姿势生成一系列三维姿势和相机方向,用于将生成的姿势投影到新视图。在没有任何3D标签或相机信息的情况下,AdapterPose成功学习从目标数据集创建合成3D姿势,同时只接受2D姿势的训练。在Human3上的实验中。6M、MPI-INF-3DHP、3DPW和滑雪姿势数据集我们的方法在交叉数据集评估方面比以前的工作好14%,在使用部分3D注释的半监督学习方法方面比以前的工作好16%。 摘要:This paper addresses the problem of cross-dataset generalization of 3D human pose estimation models. Testing a pre-trained 3D pose estimator on a new dataset results in a major performance drop. Previous methods have mainly addressed this problem by improving the diversity of the training data. We argue that diversity alone is not sufficient and that the characteristics of the training data need to be adapted to those of the new dataset such as camera viewpoint, position, human actions, and body size. To this end, we propose AdaptPose, an end-to-end framework that generates synthetic 3D human motions from a source dataset and uses them to fine-tune a 3D pose estimator. AdaptPose follows an adversarial training scheme. From a source 3D pose the generator generates a sequence of 3D poses and a camera orientation that is used to project the generated poses to a novel view. Without any 3D labels or camera information AdaptPose successfully learns to create synthetic 3D poses from the target dataset while only being trained on 2D poses. In experiments on the Human3.6M, MPI-INF-3DHP, 3DPW, and Ski-Pose datasets our method outperforms previous work in cross-dataset evaluations by 14% and previous semi-supervised learning methods that use partial 3D annotations by 16%.

半弱无监督|主动学习|不确定性(3篇)

【1】 Multi-modal 3D Human Pose Estimation with 2D Weak Supervision in Autonomous Driving 标题:自动驾驶中二维弱监督的多模态三维人体姿态估计 链接:https://arxiv.org/abs/2112.12141

作者:Jingxiao Zheng,Xinwei Shi,Alexander Gorban,Junhua Mao,Yang Song,Charles R. Qi,Ting Liu,Visesh Chari,Andre Cornman,Yin Zhou,Congcong Li,Dragomir Anguelov 机构: Waymo LLC, Google Research 摘要:自动驾驶车辆(AV)中的3D人体姿势估计(HPE)在许多因素上与其他用例不同,包括3D分辨率和数据范围、缺少密集深度图、激光雷达故障模式、相机和激光雷达之间的相对位置以及估计精度的高条。因此,为其他用例(如虚拟现实、游戏和动画)收集的数据可能无法用于AV应用程序。这就需要为AV中的HPE收集和注释大量3D数据,这既耗时又昂贵。在这篇论文中,我们提出了在AV环境中缓解这一问题的第一种方法之一。具体来说,我们提出了一种多模态方法,该方法使用RGB图像上的2D标签作为弱监督来执行3D HPE。所提出的多模态结构将激光雷达和相机输入与辅助分割分支结合起来。在Waymo开放数据集上,我们的方法比仅摄像机的2D HPE基线相对提高了22%,比仅激光雷达模型提高了6%。最后,仔细的烧蚀研究和基于零件的分析说明了我们每项贡献的优势。 摘要:3D human pose estimation (HPE) in autonomous vehicles (AV) differs from other use cases in many factors, including the 3D resolution and range of data, absence of dense depth maps, failure modes for LiDAR, relative location between the camera and LiDAR, and a high bar for estimation accuracy. Data collected for other use cases (such as virtual reality, gaming, and animation) may therefore not be usable for AV applications. This necessitates the collection and annotation of a large amount of 3D data for HPE in AV, which is time-consuming and expensive. In this paper, we propose one of the first approaches to alleviate this problem in the AV setting. Specifically, we propose a multi-modal approach which uses 2D labels on RGB images as weak supervision to perform 3D HPE. The proposed multi-modal architecture incorporates LiDAR and camera inputs with an auxiliary segmentation branch. On the Waymo Open Dataset, our approach achieves a 22% relative improvement over camera-only 2D HPE baseline, and 6% improvement over LiDAR-only model. Finally, careful ablation studies and parts based analysis illustrate the advantages of each of our contributions.

【2】 Barely-Supervised Learning: Semi-Supervised Learning with very few labeled images 标题:勉强监督学习:标签图像极少的半监督学习 链接:https://arxiv.org/abs/2112.12004

作者:Thomas Lucas,Philippe Weinzaepfel,Gregory Rogez 机构:Naver Labs Europe 摘要:本文解决了半监督学习的问题,即当标记样本集限制在每个类的少量图像(通常少于10个)时,我们称之为勉强监督学习的问题。我们深入分析了最先进的半监督方法FixMatch的行为,该方法依赖于图像的弱增强版本来获得增强版本的监督信号。我们发现,在几乎没有监督的情况下,它经常失败,这是因为在没有高置信度的伪标签可以预测的情况下,缺乏训练信号。我们提出了一种利用自监督方法的方法,该方法在没有可信伪标签的情况下提供训练信号。然后,我们提出了两种改进伪标签选择过程的方法,这将导致进一步的改进。第一种方法依赖于模型预测的每个样本历史,类似于投票方案。第二个迭代更新依赖于类的置信阈值,以便更好地探索伪标签中表示不足的类。我们的实验表明,在几乎没有监督的情况下,我们的方法在STL-10上的性能明显更好,例如,每类有4或8张标记图像。 摘要:This paper tackles the problem of semi-supervised learning when the set of labeled samples is limited to a small number of images per class, typically less than 10, problem that we refer to as barely-supervised learning. We analyze in depth the behavior of a state-of-the-art semi-supervised method, FixMatch, which relies on a weakly-augmented version of an image to obtain supervision signal for a more strongly-augmented version. We show that it frequently fails in barely-supervised scenarios, due to a lack of training signal when no pseudo-label can be predicted with high confidence. We propose a method to leverage self-supervised methods that provides training signal in the absence of confident pseudo-labels. We then propose two methods to refine the pseudo-label selection process which lead to further improvements. The first one relies on a per-sample history of the model predictions, akin to a voting scheme. The second iteratively updates class-dependent confidence thresholds to better explore classes that are under-represented in the pseudo-labels. Our experiments show that our approach performs significantly better on STL-10 in the barely-supervised regime, e.g. with 4 or 8 labeled images per class.

【3】 Meta-Learning and Self-Supervised Pretraining for Real World Image Translation 标题:用于真实世界图像翻译的元学习和自我监督预训练 链接:https://arxiv.org/abs/2112.11929

作者:Ileana Rugina,Rumen Dangovski,Mark Veillette,Pooya Khorrami,Brian Cheung,Olga Simek,Marin Soljačić 机构:MIT EECS, MIT Lincoln Lab, MIT CSAIL & BCS, Marin Soljaˇci´c, MIT Physics 备注:10 pages, 8 figures, 2 tables 摘要:深度学习的最新进展,特别是硬件进步和大数据的推动,在计算机视觉、自然语言或强化学习等广泛的计算问题上取得了令人印象深刻的结果。然而,这些改进中的许多局限于需要大量人力收集的大规模管理数据集的问题。此外,这些模型在轻微的分布变化和低数据状态下的泛化能力较差。近年来,元学习或自监督学习等新兴领域通过将深度学习扩展到半监督和Few-Shot领域,缩小了概念验证结果与机器学习实际应用之间的差距。我们遵循这一工作思路,探索最近引入的图像到图像翻译问题中的时空结构,以便:i)制定一个新的多任务Few-Shot图像生成基准,ii)探索图像翻译下游任务对比预训练中的数据增强。我们提出了少数镜头问题的几个基线,并讨论了不同方法之间的权衡。我们的代码可在https://github.com/irugina/meta-image-translation. 摘要:Recent advances in deep learning, in particular enabled by hardware advances and big data, have provided impressive results across a wide range of computational problems such as computer vision, natural language, or reinforcement learning. Many of these improvements are however constrained to problems with large-scale curated data-sets which require a lot of human labor to gather. Additionally, these models tend to generalize poorly under both slight distributional shifts and low-data regimes. In recent years, emerging fields such as meta-learning or self-supervised learning have been closing the gap between proof-of-concept results and real-life applications of machine learning by extending deep-learning to the semi-supervised and few-shot domains. We follow this line of work and explore spatio-temporal structure in a recently introduced image-to-image translation problem in order to: i) formulate a novel multi-task few-shot image generation benchmark and ii) explore data augmentations in contrastive pre-training for image translation downstream tasks. We present several baselines for the few-shot problem and discuss trade-offs between different approaches. Our code is available at https://github.com/irugina/meta-image-translation.

时序|行为识别|姿态|视频|运动估计(2篇)

【1】 Spatio-Temporal CNN baseline method for the Sports Video Task of MediaEval 2021 benchmark 标题:用于中世纪2021年基准体育视频任务的时空CNN基线方法 链接:https://arxiv.org/abs/2112.12074

作者:Pierre-Etienne Martin 机构:CCP Department, Max Planck Institute for Evolutionary Anthropology, D-, Leipzig, Germany, Video Stream, Conv, (,x,x,), ReLU, Pool, FC, for classification, for detection, SoftMax, Probabilistic, Output 备注:None 摘要:本文介绍了中世纪2021基准的体育视频任务部分的基线方法。此任务提出了笔划检测和笔划分类子任务。此基线处理两个子任务。该模型的时空CNN结构和训练过程是根据所处理的子任务定制的。该方法的目的是帮助参与者解决任务,而不是达到最先进的表演水平。尽管如此,对于检测任务,基线的表现要好于其他参与者,这突出了此类任务的难度。 摘要:This paper presents the baseline method proposed for the Sports Video task part of the MediaEval 2021 benchmark. This task proposes a stroke detection and a stroke classification subtasks. This baseline addresses both subtasks. The spatio-temporal CNN architecture and the training process of the model are tailored according to the addressed subtask. The method has the purpose of helping the participants to solve the task and is not meant to reach stateof-the-art performance. Still, for the detection task, the baseline is performing better than the other participants, which stresses the difficulty of such a task.

【2】 Bottom-up approaches for multi-person pose estimation and it's applications: A brief review 标题:自下而上的多人姿态估计方法及其应用 链接:https://arxiv.org/abs/2112.11834

作者:Milan Kresović,Thong Duy Nguyen 机构:Norwegian University of Science and Technology, Norway. 备注:13 pages, 11 figures 摘要:人体姿态估计(HPE)是计算机视觉的基本问题之一。它的应用范围从虚拟现实、人类行为分析、视频监控、异常检测、自动驾驶到医疗救助。HPE的主要目标是从给定的输入中获取人的姿势。在HPE的不同范例中,有一种范例称为自下而上的多人姿势估计。在自底向上的方法中,首先检测目标的所有关键点,然后在优化阶段,将检测到的关键点与相应的目标关联。本文讨论了HPE自底向上方法的最新进展,并列出了用于训练模型的可能高质量数据集。此外,还讨论了自下而上的方法及其在标准性能矩阵上的定量结果。最后,指出了现有方法的局限性,并对未来的研究方向进行了展望。 摘要:Human Pose Estimation (HPE) is one of the fundamental problems in computer vision. It has applications ranging from virtual reality, human behavior analysis, video surveillance, anomaly detection, self-driving to medical assistance. The main objective of HPE is to obtain the person's posture from the given input. Among different paradigms for HPE, one paradigm is called bottom-up multi-person pose estimation. In the bottom-up approach, initially, all the key points of the targets are detected, and later in the optimization stage, the detected key points are associated with the corresponding targets. This review paper discussed the recent advancements in bottom-up approaches for the HPE and listed the possible high-quality datasets used to train the models. Additionally, a discussion of the prominent bottom-up approaches and their quantitative results on the standard performance matrices are given. Finally, the limitations of the existing methods are highlighted, and guidelines of the future research directions are given.

医学相关(2篇)

【1】 Comparing radiologists' gaze and saliency maps generated by interpretability methods for chest x-rays 标题:胸部X光解译方法生成的放射科医生凝视图和显著图的比较 链接:https://arxiv.org/abs/2112.11716

作者:Ricardo Bigolin Lanfredi,Ambuj Arora,Trafton Drew,Joyce D. Schroeder,Tolga Tasdizen 机构: Scientific Computing and Imaging Institute, University of Utah, Salt Lake City, UT, USA, School of Computing, University of Utah, Salt Lake City, UT, USA, Department of Psychology, University of Utah, Salt Lake City, UT, USA 摘要:医学图像分析模型的可解释性被认为是一个重要的研究领域。我们使用来自五位放射科医生的眼球跟踪数据集,将可解释性方法的输出与代表放射科医生观察位置的热图进行比较。我们对从文献中选择的两种方法生成的显著性图进行了类独立分析:梯度CAM和注意门模型中的注意图。为了进行比较,我们使用了混合度量,这避免了固定位置的偏差。我们在一个混合指标中获得了与观察者间基线相当的分数,突出了Grad CAM的显著性图在模拟放射科医生对图像的注意力方面的潜力。我们还将数据集划分为子集,以评估在哪些情况下相似性更高。 摘要:The interpretability of medical image analysis models is considered a key research field. We use a dataset of eye-tracking data from five radiologists to compare the outputs of interpretability methods against the heatmaps representing where radiologists looked. We conduct a class-independent analysis of the saliency maps generated by two methods selected from the literature: Grad-CAM and attention maps from an attention-gated model. For the comparison, we use shuffled metrics, which avoid biases from fixation locations. We achieve scores comparable to an interobserver baseline in one shuffled metric, highlighting the potential of saliency maps from Grad-CAM to mimic a radiologist's attention over an image. We also divide the dataset into subsets to evaluate in which cases similarities are higher.

【2】 Fusion of medical imaging and electronic health records with attention and multi-head machanisms 标题:具有注意力和多头机制的医学影像和电子健康记录的融合 链接:https://arxiv.org/abs/2112.11710

作者:Cheng Jiang,Yihao Chen,Jianbo Chang,Ming Feng,Renzhi Wang,Jianhua Yao 摘要:医生通常根据患者的图像扫描做出诊断决定,如磁共振成像(MRI)和患者的电子健康记录(EHR),如年龄、性别、血压等。尽管在计算机视觉或自然语言研究领域中,已经提出了许多用于图像或文本分析的自动化方法,但是针对医学问题的医学图像和EHR数据融合的研究却少得多。在现有的早期或中期融合方法中,两种模式的特征拼接仍然是主流。为了更好地利用图像和EHR数据,我们提出了一种多模态注意模块,该模块使用EHR数据帮助在传统CNN进行的图像特征提取过程中选择重要区域。此外,我们建议将多头机制融入到门控多模态单元(GMU)中,使其能够并行融合不同子空间中的图像和EHR特征。在这两个模块的帮助下,可以使用这两种模式增强现有的CNN体系结构。对脑出血患者格拉斯哥预后量表(GOS)的预测和阿尔茨海默病的分类实验表明,该方法能够自动聚焦于任务相关区域,并通过更好地利用图像和EHR特征获得更好的结果。 摘要:Doctors often make diagonostic decisions based on patient's image scans, such as magnetic resonance imaging (MRI), and patient's electronic health records (EHR) such as age, gender, blood pressure and so on. Despite a lot of automatic methods have been proposed for either image or text analysis in computer vision or natural language research areas, much fewer studies have been developed for the fusion of medical image and EHR data for medical problems. Among existing early or intermediate fusion methods, concatenation of features from both modalities is still a mainstream. For a better exploiting of image and EHR data, we propose a multi-modal attention module which use EHR data to help the selection of important regions during image feature extraction process conducted by traditional CNN. Moreover, we propose to incorporate multi-head machnism to gated multimodal unit (GMU) to make it able to parallelly fuse image and EHR features in different subspaces. With the help of the two modules, existing CNN architecture can be enhanced using both modalities. Experiments on predicting Glasgow outcome scale (GOS) of intracerebral hemorrhage patients and classifying Alzheimer's Disease showed the proposed method can automatically focus on task-related areas and achieve better results by making better use of image and EHR features.

自动驾驶|车辆|车道检测等(1篇)

【1】 Exploring Credibility Scoring Metrics of Perception Systems for Autonomous Driving 标题:自动驾驶感知系统可信度评分指标探讨 链接:https://arxiv.org/abs/2112.11643

作者:Viren Khandal,Arth Vidyarthi 机构:University of California, Berkeley, Berkeley, California, United States 备注:In 14th International Conference on COMmunication Systems & NETworkS (COMSNETS) Intelligent Transportation Systems 2022 摘要:自动和半自动车辆的感知算法可能遇到错误的目标检测情况,例如道路上的目标分类错误,这可能导致安全违规和潜在的致命后果。虽然在目标检测算法和在线度量学习的稳健性方面已经做了大量工作,但很少有关于基准评分度量的研究,以确定潜在错误分类的任何可能指标。重点放在探索在线获取这些评分指标的潜力,以便让AV在实时约束条件下做出基于感知的决策。在这项工作中,我们探讨了当感知算法和对象检测器出现故障时,哪些指标(如果有的话)可以作为在线指标。我们的工作提供了更好的设计原则和在线度量特征的见解,以准确评估对象检测器的可信度。我们的方法对图像采用非对抗性和真实的扰动,并在此基础上评估各种定量指标。我们发现,离线指标可以用来解释现实世界中的腐败,如恶劣的天气条件,对这些指标的分析可以为在线指标的设计提供一个步骤。这是一个明确的下一步,因为它可以实现无错误的自动车辆感知和更安全的时间关键和安全关键决策。 摘要:Autonomous and semi-autonomous vehicles' perception algorithms can encounter situations with erroneous object detection, such as misclassification of objects on the road, which can lead to safety violations and potentially fatal consequences. While there has been substantial work in the robustness of object detection algorithms and online metric learning, there is little research on benchmarking scoring metrics to determine any possible indicators of potential misclassification. An emphasis is put on exploring the potential of taking these scoring metrics online in order to allow the AV to make perception-based decisions given real-time constraints. In this work, we explore which, if any, metrics act as online indicators of when perception algorithms and object detectors are failing. Our work provides insight on better design principles and characteristics of online metrics to accurately evaluate the credibility of object detectors. Our approach employs non-adversarial and realistic perturbations to images, on which we evaluate various quantitative metrics. We found that offline metrics can be designed to account for real-world corruptions such as poor weather conditions and that the analysis of such metrics can provide a segue into designing online metrics. This is a clear next step as it can allow for error-free autonomous vehicle perception and safer time-critical and safety-critical decision-making.

人脸|人群计数(2篇)

【1】 Automatic Estimation of Anthropometric Human Body Measurements 标题:人体测量尺寸的自动估算 链接:https://arxiv.org/abs/2112.11992

作者:Dana Škorvánková,Adam Riečický,Martin Madaras 机构:Physics and Informatics, Comenius University Bratislava, Slovakia, Skeletex Research, Slovakia 摘要:考虑到人体分析对我们日常生活的潜在益处,在过去几十年中,与人体分析相关的研究任务在计算机视觉领域受到了广泛关注。人体测量学是一个定义人体大小、形态和功能能力的物理测量领域。具体而言,从可视人体数据准确估计人体测量值是一个具有挑战性的问题,解决方案将简化许多不同的应用领域,包括人类工效学、服装制造等。本文在深入学习和神经网络领域进行了研究,解决从各种类型的视觉输入数据(如2D图像或3D点云)估计身体测量的挑战。此外,我们还通过生成各种人体形状的合成数据集并执行骨架驱动的注释,来处理缺乏用训练和评估所需的地面真实人体测量值注释的真实人体数据的问题。 摘要:Research tasks related to human body analysis have been drawing a lot of attention in computer vision area over the last few decades, considering its potential benefits on our day-to-day life. Anthropometry is a field defining physical measures of a human body size, form, and functional capacities. Specifically, the accurate estimation of anthropometric body measurements from visual human body data is one of the challenging problems, where the solution would ease many different areas of applications, including ergonomics, garment manufacturing, etc. This paper formulates a research in the field of deep learning and neural networks, to tackle the challenge of body measurements estimation from various types of visual input data (such as 2D images or 3D point clouds). Also, we deal with the lack of real human data annotated with ground truth body measurements required for training and evaluation, by generating a synthetic dataset of various human body shapes and performing a skeleton-driven annotation.

【2】 Real-time Street Human Motion Capture 标题:实时的街道人体运动捕捉 链接:https://arxiv.org/abs/2112.11543

作者:Yanquan Chen,Fei Yang,Tianyu Lang,Guanfang Dong,Anup Basu 机构:ca †University of Alberta, ca‡University of Alberta 备注:7 pages, 7 figures 摘要:近年来,利用计算机的运动捕捉技术发展迅速。由于其高效率和优异的性能,它取代了许多传统的方法,并在许多领域得到了广泛的应用。我们的项目是关于街道场景视频人体运动捕捉和分析。该项目的主要目标是捕捉视频中的人体运动,并实时使用三维动画(人体)的运动信息。我们将神经网络应用于运动捕捉,并在街景场景下的unity中实现。通过分析运动数据,我们可以更好地估计街道状况,这对于其他高科技应用(如自动驾驶汽车)非常有用。 摘要:In recent years, motion capture technology using computers has developed rapidly. Because of its high efficiency and excellent performance, it replaces many traditional methods and is being widely used in many fields. Our project is about street scene video human motion capturing and analysis. The primary goal of the project is to capture the human motion in a video and use the motion information for 3D animation (human) in real-time. We applied a neural network for motion capture and implement it in the unity under a street view scene. By analyzing the motion data, we will have a better estimation of the street condition, which is useful for other high-tech applications such as self-driving cars.

蒸馏|知识提取(1篇)

【1】 Multimodal Analysis of memes for sentiment extraction 标题:用于情感提取的模因多模态分析 链接:https://arxiv.org/abs/2112.11850

作者:Nayan Varma Alluri,Neeli Dheeraj Krishna 机构:Department of Computer Science and, Engineering, PES University, Bangalore, India 备注:5 pages 摘要:模因是最普遍的社交媒体传播形式之一。模因本质上是多媒体,研究和处理模因是当前的热门话题。本研究中的研究基于Memotion数据集,该数据集包括根据反讽、喜剧、动机和总体情绪对模因进行分类。开发了三种基于Transformer的独立创新技术,并对其结果进行了全面审查。在我们的所有技术中,最佳算法的幽默分类的F1总分为0.633分,动机分类的F1总分为0.55分,讽刺分类的F1总分为0.61分,模因整体情绪的F1总分为0.575分。 摘要:Memes are one of the most ubiquitous forms of social media communication. The study and processing of memes, which are intrinsically multimedia, is a popular topic right now. The study presented in this research is based on the Memotion dataset, which involves categorising memes based on irony, comedy, motivation, and overall-sentiment. Three separate innovative transformer-based techniques have been developed, and their outcomes have been thoroughly reviewed.The best algorithm achieved a macro F1 score of 0.633 for humour classification, 0.55 for motivation classification, 0.61 for sarcasm classification, and 0.575 for overall sentiment of the meme out of all our techniques.

视觉解释|视频理解VQA|caption等(1篇)

【1】 CLEVR3D: Compositional Language and Elementary Visual Reasoning for Question Answering in 3D Real-World Scenes 标题:CLEVR3D:三维场景问答的构图语言和基本视觉推理 链接:https://arxiv.org/abs/2112.11691

作者:Xu Yan,Zhihao Yuan,Yuhao Du,Yinghong Liao,Yao Guo,Zhen Li,Shuguang Cui 机构:⋆ Corresponding author, † Equal first authorship., The Chinese University of Hong Kong (Shenzhen), Shenzhen Research Institute of Big Data, Shanghai Jiao Tong Univerisity 摘要:三维场景理解是一个相对新兴的研究领域。在本文中,我们介绍了三维真实场景中的视觉问答任务(VQA-3D),该任务旨在回答给定三维场景中所有可能的问题。为了解决这个问题,提出了第一个VQA-3D数据集,即CLEVR3D,它包含1129个真实场景中的60K个问题。具体而言,我们开发了一个问题引擎,利用3D场景图结构生成各种推理问题,涵盖对象属性(即大小、颜色和材质)及其空间关系的问题。在这个数据集的基础上,我们进一步设计了第一个VQA-3D基线模型TransVQA3D。与直接应用于3D场景的纯语言基线和以前的3D推理方法相比,TransVQA3D模型采用了精心设计的Transformer架构,以实现卓越的VQA-3D性能。实验结果表明,将VQA-3D作为辅助任务可以提高3D场景理解的性能,包括用于节点分类的场景图分析和整体图识别。 摘要:3D scene understanding is a relatively emerging research field. In this paper, we introduce the Visual Question Answering task in 3D real-world scenes (VQA-3D), which aims to answer all possible questions given a 3D scene. To tackle this problem, the first VQA-3D dataset, namely CLEVR3D, is proposed, which contains 60K questions in 1,129 real-world scenes. Specifically, we develop a question engine leveraging 3D scene graph structures to generate diverse reasoning questions, covering the questions of objects' attributes (i.e., size, color, and material) and their spatial relationships. Built upon this dataset, we further design the first VQA-3D baseline model, TransVQA3D. The TransVQA3D model adopts well-designed Transformer architectures to achieve superior VQA-3D performance, compared with the pure language baseline and previous 3D reasoning methods directly applied to 3D scenarios. Experimental results verify that taking VQA-3D as an auxiliary task can boost the performance of 3D scene understanding, including scene graph analysis for the node-wise classification and whole-graph recognition.

超分辨率|去噪|去模糊|去雾(2篇)

【1】 Reflash Dropout in Image Super-Resolution 标题:图像超分辨率中的闪光丢失 链接:https://arxiv.org/abs/2112.12089

作者:Xiangtao Kong,Xina Liu,Jinjin Gu,Yu Qiao,Chao Dong 机构:Shenzhen Institutes of Advanced Technology, CAS, University of Chinese Academy of Sciences, The University of Sydney, Shanghai AI Lab, Shanghai, China 摘要:辍学旨在缓解高级视觉任务中的过度拟合问题,但很少应用于低级视觉任务,如图像超分辨率(SR)。作为一个经典的回归问题,SR表现出与高级任务不同的行为,并且对退出操作非常敏感。然而,在本文中,我们证明了适当使用辍学有利于SR网络,并提高了泛化能力。具体而言,辍学更好地嵌入在网络末端,并且对多重降级设置有很大帮助。这一发现打破了我们的常识,激励我们探索其工作机制。我们进一步使用了两种分析工具——一种来自最近的网络解释工作,另一种是专门为此任务设计的。分析结果为我们的实验结果提供了侧面证据,并为我们理解SR网络提供了一个新的视角。 摘要:Dropout is designed to relieve the overfitting problem in high-level vision tasks but is rarely applied in low-level vision tasks, like image super-resolution (SR). As a classic regression problem, SR exhibits a different behaviour as high-level tasks and is sensitive to the dropout operation. However, in this paper, we show that appropriate usage of dropout benefits SR networks and improves the generalization ability. Specifically, dropout is better embedded at the end of the network and is significantly helpful for the multi-degradation settings. This discovery breaks our common sense and inspires us to explore its working mechanism. We further use two analysis tools -- one is from recent network interpretation works, and the other is specially designed for this task. The analysis results provide side proofs to our experimental findings and show us a new perspective to understand SR networks.

【2】 Exploring Inter-frequency Guidance of Image for Lightweight Gaussian Denoising 标题:基于轻量级高斯去噪的图像频间制导研究 链接:https://arxiv.org/abs/2112.11779

作者:Zhuang Jia 机构:Xiaomi Camera 摘要:图像去噪在许多成像或计算机视觉相关领域具有重要意义。随着卷积神经网络在计算机视觉任务中表现出强大的能力,基于CNN的方法也提高了图像去噪的性能。尽管基于CNN的图像去噪方法在这方面取得了很好的效果,但目前大多数基于CNN的方法都试图直接学习从噪声图像到干净图像的映射,缺乏对图像和噪声先验知识的明确探索。观察到的自然图像遵循倒数幂律,这意味着图像的低频段往往占据大部分能量。因此,在AGWN(加性高斯白噪声)恶化的情况下,低频段往往比高频段保持更高的峰值信噪比。考虑到不同频带的空间形态一致性,可以使用保真度更高的低频带作为指导来细化污染更严重的高频带。基于这一思想,我们提出了一种新的网络结构,称为IGNet,以便以渐进的方式从低到高细化频带。该算法首先利用离散小波变换(DWT)迭代地将特征映射分解为高频子带和低频子带,然后利用每个低频特征对高频特征进行细化。最后,由解码器对细化后的特征映射进行处理,以恢复干净的结果。通过这种设计,可以利用更多的频率间先验知识和信息,从而在保持竞争结果的同时减小模型尺寸。在几个公共数据集上的实验表明,我们的模型在结构轻量级的情况下,与其他最先进的方法相比,取得了具有竞争力的性能。 摘要:Image denoising is of vital importance in many imaging or computer vision related areas. With the convolutional neural networks showing strong capability in computer vision tasks, the performance of image denoising has also been brought up by CNN based methods. Though CNN based image denoisers show promising results on this task, most of the current CNN based methods try to learn the mapping from noisy image to clean image directly, which lacks the explicit exploration of prior knowledge of images and noises. Natural images are observed to obey the reciprocal power law, implying the low-frequency band of image tend to occupy most of the energy. Thus in the condition of AGWN (additive gaussian white noise) deterioration, low-frequency band tend to preserve a higher PSNR than high-frequency band. Considering the spatial morphological consistency of different frequency bands, low-frequency band with more fidelity can be used as a guidance to refine the more contaminated high-frequency bands. Based on this thought, we proposed a novel network architecture denoted as IGNet, in order to refine the frequency bands from low to high in a progressive manner. Firstly, it decomposes the feature maps into high- and low-frequency subbands using DWT (discrete wavelet transform) iteratively, and then each low band features are used to refine the high band features. Finally, the refined feature maps are processed by a decoder to recover the clean result. With this design, more inter-frequency prior and information are utilized, thus the model size can be lightened while still perserves competitive results. Experiments on several public datasets show that our model obtains competitive performance comparing with other state-of-the-art methods yet with a lightweight structure.

点云|SLAM|雷达|激光|深度RGBD相关(2篇)

【1】 NICE-SLAM: Neural Implicit Scalable Encoding for SLAM 标题:NICE-SLAM:SLAM的神经隐式可伸缩编码 链接:https://arxiv.org/abs/2112.12130

作者:Zihan Zhu,Songyou Peng,Viktor Larsson,Weiwei Xu,Hujun Bao,Zhaopeng Cui,Martin R. Oswald,Marc Pollefeys 机构:ETH Zurich, State Key Lab of CAD&CG, Zhejiang University, MPI for Intelligent Systems, T¨ubingen, University of Amsterdam, Microsoft 备注:Project page: this https URL 摘要:最近,神经隐式表征在各个领域取得了令人鼓舞的结果,包括在同时定位和映射(SLAM)方面取得了有希望的进展。然而,现有的方法会产生过平滑的场景重建,并且难以扩展到大型场景。这些限制主要是由于其简单的完全连接的网络结构,在观测中不包含本地信息。在本文中,我们提出了NICE-SLAM,这是一个密集的SLAM系统,通过引入分层场景表示,将多级局部信息结合在一起。使用预先训练的几何先验优化此表示,可以在大型室内场景上进行详细重建。与最近的神经隐式SLAM系统相比,我们的方法更具可扩展性、效率和鲁棒性。在五个具有挑战性的数据集上的实验表明,NICE-SLAM在映射和跟踪质量方面都具有竞争力。 摘要:Neural implicit representations have recently shown encouraging results in various domains, including promising progress in simultaneous localization and mapping (SLAM). Nevertheless, existing methods produce over-smoothed scene reconstructions and have difficulty scaling up to large scenes. These limitations are mainly due to their simple fully-connected network architecture that does not incorporate local information in the observations. In this paper, we present NICE-SLAM, a dense SLAM system that incorporates multi-level local information by introducing a hierarchical scene representation. Optimizing this representation with pre-trained geometric priors enables detailed reconstruction on large indoor scenes. Compared to recent neural implicit SLAM systems, our approach is more scalable, efficient, and robust. Experiments on five challenging datasets demonstrate competitive results of NICE-SLAM in both mapping and tracking quality.

【2】 Multi-View Partial (MVP) Point Cloud Challenge 2021 on Completion and Registration: Methods and Results 标题:多视点局部(MVP)点云完成和注册挑战赛2021:方法和结果 链接:https://arxiv.org/abs/2112.12053

作者:Liang Pan,Tong Wu,Zhongang Cai,Ziwei Liu,Xumin Yu,Yongming Rao,Jiwen Lu,Jie Zhou,Mingye Xu,Xiaoyuan Luo,Kexue Fu,Peng Gao,Manning Wang,Yali Wang,Yu Qiao,Junsheng Zhou,Xin Wen,Peng Xiang,Yu-Shen Liu,Zhizhong Han,Yuanjie Yan,Junyi An,Lifa Zhu,Changwei Lin,Dongrui Liu,Xin Li,Francisco Gómez-Fernández,Qinlong Wang,Yang Yang 机构:Francisco G´omez-Fern´andez, S-Lab, Nanyang Technological University, SenseTime-CUHK Joint Lab, The Chinese University of Hong Kong, Sensetime Research, Shanghai AI Laboratory, Department of Automation, Tsinghua University, University of Chinese Academy of Sciences 备注:15 pages, 13 figures, ICCV2021 Workshop Technique Report, the codebase webpage: this https URL 摘要:由于实际扫描的点云由于遮挡和视点的原因大部分是局部的,因此基于不完全观测重建完整的三维形状成为计算机视觉的一个基本问题。对于单个不完整的点云,它将成为部分点云完成问题。给定多个不同的观测值,可以通过执行部分到部分点云配准来解决三维重建问题。最近,发布了一个大规模多视图局部(MVP)点云数据集,该数据集由100000多个高质量虚拟扫描的局部点云组成。基于MVP数据集,本文报道了多视图部分点云挑战2021完成和注册的方法和结果。共有128名参赛者报名参加比赛,31支参赛队提交了有效的参赛作品。我们将分析排名靠前的解决方案,然后讨论未来的研究方向。 摘要:As real-scanned point clouds are mostly partial due to occlusions and viewpoints, reconstructing complete 3D shapes based on incomplete observations becomes a fundamental problem for computer vision. With a single incomplete point cloud, it becomes the partial point cloud completion problem. Given multiple different observations, 3D reconstruction can be addressed by performing partial-to-partial point cloud registration. Recently, a large-scale Multi-View Partial (MVP) point cloud dataset has been released, which consists of over 100,000 high-quality virtual-scanned partial point clouds. Based on the MVP dataset, this paper reports methods and results in the Multi-View Partial Point Cloud Challenge 2021 on Completion and Registration. In total, 128 participants registered for the competition, and 31 teams made valid submissions. The top-ranked solutions will be analyzed, and then we will discuss future research directions.

其他神经网络|深度学习|模型|建模(4篇)

【1】 Can Deep Neural Networks be Converted to Ultra Low-Latency Spiking Neural Networks? 标题:深度神经网络能否转变为超低延迟尖峰神经网络? 链接:https://arxiv.org/abs/2112.12133

作者:Gourav Datta,Peter A. Beerel 机构:Ming Hsieh Dept. of Electrical and Computer Engineering, University of Southern California, Los Angeles, USA 备注:Accepted to DATE 2022 摘要:尖峰神经网络(SNN)通过随时间分布的二进制尖峰进行操作,已成为资源受限设备的一种有前途的节能ML范式。然而,目前最先进的(SOTA)SNN需要多个时间步才能达到可接受的推理精度,从而增加峰值活动,从而增加能耗。SNN的SOTA训练策略涉及非尖峰深度神经网络(DNN)的转换。在本文中,我们确定SOTA转换策略不能产生超低延迟,因为它们错误地假设DNN和SNN预激活值是均匀分布的。我们提出了一种新的训练算法,能够准确地捕获这些分布,最小化DNN和转换后的SNN之间的误差。由此产生的SNN具有超低延迟和高激活稀疏性,从而显著提高了计算效率。特别是,我们在几个VGG和ResNet体系结构上对CIFAR-10和CIFAR-100数据集的图像识别任务评估了我们的框架。与iso体系结构标准DNN相比,我们在CIFAR-100数据集上仅使用2个时间步长就获得了64.19%的顶级精度,计算能量降低了约159.2倍。与其他SOTA SNN模型相比,我们的模型推理速度快2.5-8倍(即,时间步长更少)。 摘要:Spiking neural networks (SNNs), that operate via binary spikes distributed over time, have emerged as a promising energy efficient ML paradigm for resource-constrained devices. However, the current state-of-the-art (SOTA) SNNs require multiple time steps for acceptable inference accuracy, increasing spiking activity and, consequently, energy consumption. SOTA training strategies for SNNs involve conversion from a non-spiking deep neural network (DNN). In this paper, we determine that SOTA conversion strategies cannot yield ultra low latency because they incorrectly assume that the DNN and SNN pre-activation values are uniformly distributed. We propose a new training algorithm that accurately captures these distributions, minimizing the error between the DNN and converted SNN. The resulting SNNs have ultra low latency and high activation sparsity, yielding significant improvements in compute efficiency. In particular, we evaluate our framework on image recognition tasks from CIFAR-10 and CIFAR-100 datasets on several VGG and ResNet architectures. We obtain top-1 accuracy of 64.19% with only 2 time steps on the CIFAR-100 dataset with ~159.2x lower compute energy compared to an iso-architecture standard DNN. Compared to other SOTA SNN models, our models perform inference 2.5-8x faster (i.e., with fewer time steps).

【2】 Deeper Learning with CoLU Activation 标题:利用COLU激活进行更深入的学习 链接:https://arxiv.org/abs/2112.12078

作者:Advait Vagerwal 备注:7 pages, 4 figures, 4 tables 摘要:在神经网络中,非线性是由激活函数引入的。一个常用的激活函数是校正线性单元(ReLU)。作为一种激活方式,ReLU一直是一种流行的选择,但也有缺陷。像Swish和Mish这样的最先进的功能现在作为一个更好的选择而受到关注,因为它们克服了其他激活功能所带来的许多缺陷。CoLU在性质上类似于Swish和Mish的激活函数。它被定义为f(x)=x/(1-xe^-(x e^x))。它是光滑的、连续可微的、上无界的、下有界的、非饱和的、非单调的。基于使用具有不同激活函数的CoLU进行的实验,可以观察到CoLU在更深层次的神经网络上通常比其他函数表现得更好。当在MNIST上以递增的卷积层数训练不同的神经网络时,CoLU在更多的层数中保持了最高的精度。在具有8个卷积层的较小网络上,CoLU具有最高的平均精度,紧随其后的是ReLU。在接受时装师训练的VGG-13上,CoLU的准确率比Mish高4.20%,比ReLU高3.31%。在Cifar-10上训练的ResNet-9上,CoLU的准确率比Swish高0.05%,比Mish高0.09%,比ReLU高0.29%。可以观察到,基于不同的因素,包括层数、层类型、参数数量、学习率、优化器、,可以对这些因素和激活函数进行进一步的研究,以获得更优化的激活函数和更多关于其行为的知识。 摘要:In neural networks, non-linearity is introduced by activation functions. One commonly used activation function is Rectified Linear Unit (ReLU). ReLU has been a popular choice as an activation but has flaws. State-of-the-art functions like Swish and Mish are now gaining attention as a better choice as they combat many flaws presented by other activation functions. CoLU is an activation function similar to Swish and Mish in properties. It is defined as f(x)=x/(1-xe^-(x e^x)). It is smooth, continuously differentiable, unbounded above, bounded below, non-saturating, and non-monotonic. Based on experiments done with CoLU with different activation functions, it is observed that CoLU usually performs better than other functions on deeper neural networks. While training different neural networks on MNIST on an incrementally increasing number of convolutional layers, CoLU retained the highest accuracy for more layers. On a smaller network with 8 convolutional layers, CoLU had the highest mean accuracy, closely followed by ReLU. On VGG-13 trained on Fashion-MNIST, CoLU had a 4.20% higher accuracy than Mish and 3.31% higher accuracy than ReLU. On ResNet-9 trained on Cifar-10, CoLU had 0.05% higher accuracy than Swish, 0.09% higher accuracy than Mish, and 0.29% higher accuracy than ReLU. It is observed that activation functions may behave better than other activation functions based on different factors including the number of layers, types of layers, number of parameters, learning rate, optimizer, etc. Further research can be done on these factors and activation functions for more optimal activation functions and more knowledge on their behavior.

【3】 Deep Models for Visual Sentiment Analysis of Disaster-related Multimedia Content 标题:灾难多媒体内容视觉情感分析的深层模型 链接:https://arxiv.org/abs/2112.12060

作者:Khubaib Ahmad,Muhammad Asif Ayub,Kashif Ahmad,Ala Al-Fuqaha,Nasir Ahmad 机构: Department of Computer Systems Engineering, University of Engineering and Technology, Peshawar, Pakistan., Division of Information and Computing Technology, College of Science and Engineering, Hamad Bin Khalifa, University, Qatar Foundation, Doha, Qatar. 备注:3 pages 摘要:本文提出了一种中世纪2021项任务的解决方案,即“视觉情感分析:自然灾害使用案例”。这项任务的目的是提取和分类观众感知到的情感以及社交媒体上共享的自然灾害相关图像所传达的情感信息。该任务由三个子任务组成,包括一个单标签多类图像分类任务和两个具有不同标签集的多标签多类图像分类任务。在我们提出的解决方案中,我们主要依赖于两种不同的最先进的模型,即Inception-v3和VggNet-19,它们在ImageNet上预先训练,使用不同的策略针对三项任务中的每一项进行微调。所有三项任务都取得了令人鼓舞的总体结果。在单标签分类任务(即任务1)中,我们获得了基于Inception-v3和VggNet-19的解决方案的加权平均F1分数分别为0.540和0.526。在多标签分类(即任务2和任务3)中,基于Inception-v3的解决方案的加权F1分数分别为0.572和0.516。同样,基于VggNet-19的解决方案在任务2和任务3上的加权F1分数分别为0.584和0.495。 摘要:This paper presents a solutions for the MediaEval 2021 task namely "Visual Sentiment Analysis: A Natural Disaster Use-case". The task aims to extract and classify sentiments perceived by viewers and the emotional message conveyed by natural disaster-related images shared on social media. The task is composed of three sub-tasks including, one single label multi-class image classification task, and, two multi-label multi-class image classification tasks, with different sets of labels. In our proposed solutions, we rely mainly on two different state-of-the-art models namely, Inception-v3 and VggNet-19, pre-trained on ImageNet, which are fine-tuned for each of the three task using different strategies. Overall encouraging results are obtained on all the three tasks. On the single-label classification task (i.e. Task 1), we obtained the weighted average F1-scores of 0.540 and 0.526 for the Inception-v3 and VggNet-19 based solutions, respectively. On the multi-label classification i.e., Task 2 and Task 3, the weighted F1-score of our Inception-v3 based solutions was 0.572 and 0.516, respectively. Similarly, the weighted F1-score of our VggNet-19 based solution on Task 2 and Task 3 was 0.584 and 0.495, respectively.

【4】 Learning and Crafting for the Wide Multiple Baseline Stereo 标题:宽幅多基线立体声的学习与制作 链接:https://arxiv.org/abs/2112.12027

作者:Dmytro Mishkin 机构:Thesis Advisor: prof. Jiˇr´i Matas, Available at 备注:After-defence version with additional fixes based on reviewer commends. 144 pages 摘要:本文介绍了宽多基线立体声(WxBS)问题。WxBS是标准宽基线立体问题的推广,它考虑了在多个图像采集因子(如视点、照明、传感器类型)上同时不同的图像的匹配,或者在对象外观发生显著变化的位置(例如,随着时间的推移)。引入了一个新的数据集,该数据集包含基本事实、评估指标和基线。本文介绍了WxBS管道的以下改进。(i) 一个称为HardNeg的损失函数,用于学习局部图像描述符,该描述符依赖于小批量内的硬负片挖掘以及最近正片和最近负片之间距离的最大化。(ii)使用HardNeg损失训练的描述符称为HardNet,其结构紧凑,在标准匹配、补丁验证和检索基准方面显示了最先进的性能。(iii)学习仿射形状、方向和可能与局部特征的几何和外观特性相关的其他参数的方法。(iv)提出了一种初步的通信生成策略,该策略概括了标准的第一到第二最近距离比率。选择策略显示出优于标准方法的性能,适用于硬工程描述符(如SIFT、LIOP和MROGH)或深入学习的描述符(如HardNet)。(v) 针对两视图匹配问题引入了一个反馈回路,产生了MODS——按需视图综合匹配——算法。MODS是一种处理视角差异的算法,其处理视角差异甚至大于先前最先进的ASIFT算法,与“标准”宽基线和窄基线方法相比,计算成本没有显著增加。最后,但并非最不重要的是,介绍了局部特征和鲁棒估计算法的综合基准。 摘要:This thesis introduces the wide multiple baseline stereo (WxBS) problem. WxBS, a generalization of the standard wide baseline stereo problem, considers the matching of images that simultaneously differ in more than one image acquisition factor such as viewpoint, illumination, sensor type, or where object appearance changes significantly, e.g., over time. A new dataset with the ground truth, evaluation metric and baselines has been introduced. The thesis presents the following improvements of the WxBS pipeline. (i) A loss function, called HardNeg, for learning a local image descriptor that relies on hard negative mining within a mini-batch and on the maximization of the distance between the closest positive and the closest negative patches. (ii) The descriptor trained with the HardNeg loss, called HardNet, is compact and shows state-of-the-art performance in standard matching, patch verification and retrieval benchmarks. (iii) A method for learning the affine shape, orientation, and potentially other parameters related to geometric and appearance properties of local features. (iv) A tentative correspondences generation strategy which generalizes the standard first to second closest distance ratio is presented. The selection strategy, which shows performance superior to the standard method, is applicable to either hard-engineered descriptors like SIFT, LIOP, and MROGH or deeply learned like HardNet. (v) A feedback loop is introduced for the two-view matching problem, resulting in MODS -- matching with on-demand view synthesis -- algorithm. MODS is an algorithm that handles a viewing angle difference even larger than the previous state-of-the-art ASIFT algorithm, without a significant increase of computational cost over "standard" wide and narrow baseline approaches. Last, but not least, a comprehensive benchmark for local features and robust estimation algorithms is introduced.

其他(8篇)

【1】 Input-Specific Robustness Certification for Randomized Smoothing 标题:随机平滑的特定输入鲁棒性证明 链接:https://arxiv.org/abs/2112.12084

作者:Ruoxin Chen,Jie Li,Junchi Yan,Ping Li,Bin Sheng 机构: Shanghai Jiao Tong University, The Hong Kong Polytechnic University 备注:Accepted by AAAI22 摘要:尽管随机平滑已经证明了高认证鲁棒性和优于其他认证防御的可扩展性,但鲁棒性认证的高计算开销限制了实用性,因为它在很大程度上依赖于估计置信区间的大样本近似。在现有的工作中,置信区间的样本量是普遍设置的,与预测输入无关。这种输入不可知抽样(IAS)方案可能会产生较差的平均认证半径(ACR)-运行时权衡,需要改进。在本文中,我们提出了输入特定采样(ISS)加速,以实现鲁棒性认证的成本效益,并基于输入特性自适应地减小采样大小。此外,我们的方法普遍控制了ISS样本量减少导致的认证半径下降。在CIFAR-10和ImageNet上的实验结果表明,ISS可以以0.05认证半径的有限成本将认证速度提高三倍以上。同时,在广泛的超参数设置中,ISS在平均认证半径方面超过了IAS。具体而言,ISS在ImageNet($sigma=1.0$)上在250分钟内达到ACR=0.958,而IAS在相同条件下达到ACR=0.917。我们在url中发布代码{https://github.com/roy-ch/Input-Specific-Certification}. 摘要:Although randomized smoothing has demonstrated high certified robustness and superior scalability to other certified defenses, the high computational overhead of the robustness certification bottlenecks the practical applicability, as it depends heavily on the large sample approximation for estimating the confidence interval. In existing works, the sample size for the confidence interval is universally set and agnostic to the input for prediction. This Input-Agnostic Sampling (IAS) scheme may yield a poor Average Certified Radius (ACR)-runtime trade-off which calls for improvement. In this paper, we propose Input-Specific Sampling (ISS) acceleration to achieve the cost-effectiveness for robustness certification, in an adaptive way of reducing the sampling size based on the input characteristic. Furthermore, our method universally controls the certified radius decline from the ISS sample size reduction. The empirical results on CIFAR-10 and ImageNet show that ISS can speed up the certification by more than three times at a limited cost of 0.05 certified radius. Meanwhile, ISS surpasses IAS on the average certified radius across the extensive hyperparameter settings. Specifically, ISS achieves ACR=0.958 on ImageNet ($sigma=1.0$) in 250 minutes, compared to ACR=0.917 by IAS under the same condition. We release our code in url{https://github.com/roy-ch/Input-Specific-Certification}.

【2】 Geodesic squared exponential kernel for non-rigid shape registration 标题:非刚体形状配准的测地平方指数核 链接:https://arxiv.org/abs/2112.11853

作者:Florent Jousse,Xavier Pennec,Hervé Delingette,Matilde Gonzalez 机构: Universit´e Cˆote d’Azur, INRIA, EPIONE team, Sophia-Antipolis, France., Qc Labs Department, QuantifiCare, Sophia-Antipolis, France 备注:2021 16TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG 2021) PROCEEDINGS, Dec 2021, JODHPUR, India 摘要:这项工作解决了三维扫描的非刚性配准问题,这是形状建模技术的核心。首先,针对高斯过程可变形模型(GPMMs)框架,提出了一种新的基于测地距离的核函数。在内核中使用测地距离使其更适合曲面的拓扑和几何特征,并导致孔和弯曲区域周围更真实的变形。由于内核具有超参数,因此我们针对FaceWarehouse数据集上的人脸注册任务对其进行了优化。我们表明,测地线平方指数核在人脸数据集的所有20个表达式上的人脸配准任务中的性能明显优于最先进的核。其次,我们对非刚性ICP配准算法中使用的损失函数进行了修改,允许根据给定的置信度对对应项进行加权。作为一个用例,我们展示了我们可以使注册对3D扫描中的异常值(例如非皮肤部分)更加鲁棒。 摘要:This work addresses the problem of non-rigid registration of 3D scans, which is at the core of shape modeling techniques. Firstly, we propose a new kernel based on geodesic distances for the Gaussian Process Morphable Models (GPMMs) framework. The use of geodesic distances into the kernel makes it more adapted to the topological and geometric characteristics of the surface and leads to more realistic deformations around holes and curved areas. Since the kernel possesses hyperparameters we have optimized them for the task of face registration on the FaceWarehouse dataset. We show that the Geodesic squared exponential kernel performs significantly better than state of the art kernels for the task of face registration on all the 20 expressions of the FaceWarehouse dataset. Secondly, we propose a modification of the loss function used in the non-rigid ICP registration algorithm, that allows to weight the correspondences according to the confidence given to them. As a use case, we show that we can make the registration more robust to outliers in the 3D scans, such as non-skin parts.

【3】 Binary Image Skeletonization Using 2-Stage U-Net 标题:基于两级U网的二值图像骨架提取 链接:https://arxiv.org/abs/2112.11824

作者:Mohamed A. Ghanem,Alaa A. Anani 机构:The American University in Cairo, Computer Science and Engineering Department 备注:Computer Vision Course Project [AUC, Spring 21] 摘要:对象骨架化是提取形状的骨架、线条表示的过程。它为几何图形理解和最小形状表示提供了一个非常有用的工具。它还具有广泛的应用,尤其是在解剖学研究和活动检测方面。为了解决这个问题,已经开发了几种数学算法方法,其中一些已经被证明是非常稳健的。然而,对it的深度学习解决方案投入的关注较少。在本文中,我们使用著名的U-Net体系结构的两阶段变体将问题空间划分为两个子问题:形状最小化和校正骨架细化。我们的模型产生的结果在视觉上比基线SkelNetOn模型要好得多。针对这一挑战,我们提出了一种基于归一化相关系数的新度量M-CCORR,作为F1的替代方案,因为它解决了类不平衡的问题,能够在不受F1对像素偏移过于敏感的情况下识别骨架相似性。 摘要:Object Skeletonization is the process of extracting skeletal, line-like representations of shapes. It provides a very useful tool for geometric shape understanding and minimal shape representation. It also has a wide variety of applications, most notably in anatomical research and activity detection. Several mathematical algorithmic approaches have been developed to solve this problem, and some of them have been proven quite robust. However, a lesser amount of attention has been invested into deep learning solutions for it. In this paper, we use a 2-stage variant of the famous U-Net architecture to split the problem space into two sub-problems: shape minimization and corrective skeleton thinning. Our model produces results that are visually much better than the baseline SkelNetOn model. We propose a new metric, M-CCORR, based on normalized correlation coefficients as an alternative to F1 for this challenge as it solves the problem of class imbalance, managing to recognize skeleton similarity without suffering from F1's over-sensitivity to pixel-shifts.

【4】 Class-aware Sounding Objects Localization via Audiovisual Correspondence 标题:基于视听对应的类感知探测对象定位 链接:https://arxiv.org/abs/2112.11749

作者:Di Hu,Yake Wei,Rui Qian,Weiyao Lin,Ruihua Song,Ji-Rong Wen 机构: the ChineseUniversity of Hong Kong 备注:accepted by TPAMI 2021. Code: this https URL 摘要:视听场景在我们的日常生活中无处不在。对于人类来说,区分不同的发声对象是很常见的,但是对于机器来说,在没有类别注释的情况下实现类感知的发声对象定位是相当具有挑战性的,即,定位发声对象并识别其类别。为了解决这个问题,我们提出了一个两阶段分步学习框架,仅使用音频和视觉之间的对应关系来定位和识别复杂视听场景中的发声对象。首先,我们建议在单声源情况下,通过粗粒度视听对应确定探测区域。然后利用发声区域中的视觉特征作为候选对象表示,建立类别表示对象词典,用于表达性视觉特征提取。我们在鸡尾酒会场景中生成类感知的对象定位图,并通过引用该词典使用视听对应来抑制静默区域。最后,我们采用类别级视听一致性作为监督,以实现细粒度音频和声音对象分布对齐。在真实视频和合成视频上的实验表明,我们的模型在定位和识别对象以及滤除无声对象方面具有优越性。我们还将学习到的视听网络转换为无监督的目标检测任务,获得了合理的性能。 摘要:Audiovisual scenes are pervasive in our daily life. It is commonplace for humans to discriminatively localize different sounding objects but quite challenging for machines to achieve class-aware sounding objects localization without category annotations, i.e., localizing the sounding object and recognizing its category. To address this problem, we propose a two-stage step-by-step learning framework to localize and recognize sounding objects in complex audiovisual scenarios using only the correspondence between audio and vision. First, we propose to determine the sounding area via coarse-grained audiovisual correspondence in the single source cases. Then visual features in the sounding area are leveraged as candidate object representations to establish a category-representation object dictionary for expressive visual character extraction. We generate class-aware object localization maps in cocktail-party scenarios and use audiovisual correspondence to suppress silent areas by referring to this dictionary. Finally, we employ category-level audiovisual consistency as the supervision to achieve fine-grained audio and sounding object distribution alignment. Experiments on both realistic and synthesized videos show that our model is superior in localizing and recognizing objects as well as filtering out silent ones. We also transfer the learned audiovisual network into the unsupervised object detection task, obtaining reasonable performance.

【5】 Simple and Effective Balance of Contrastive Losses 标题:简单有效的对比损失平衡 链接:https://arxiv.org/abs/2112.11743

作者:Arnaud Sors,Rafael Sampaio de Rezende,Sarah Ibrahimi,Jean-Marc Andreoli 机构:† NAVER LABS Europe, ‡ University of Amsterdam 备注:15 pages, 10 figures 摘要:对比损失长期以来一直是深度度量学习的一个关键因素,现在由于自我监督学习的成功而变得越来越流行。最近的研究表明,将这些损失分解为两个子损失的好处,这两个子损失在学习表示网络时以互补的方式起作用:一个正项和一个熵项。尽管总体损失因此被定义为两个术语的组合,但这两个术语的平衡往往隐藏在实现细节后面,在实践中被忽略和次优。在这项工作中,我们将对比损失平衡作为一个超参数优化问题来处理,并提出了一种基于坐标下降的搜索方法,该方法可以有效地找到优化评估性能的超参数。在此过程中,我们将现有的余额分析扩展到对比利润损失,在余额中包括批次大小,并解释如何从批次中聚合损失元素,以在更大的批次大小范围内保持接近最优的性能。对深度度量学习和自监督学习的基准测试进行的大量实验表明,与其他常用搜索方法相比,该方法能更快地找到最优超参数。 摘要:Contrastive losses have long been a key ingredient of deep metric learning and are now becoming more popular due to the success of self-supervised learning. Recent research has shown the benefit of decomposing such losses into two sub-losses which act in a complementary way when learning the representation network: a positive term and an entropy term. Although the overall loss is thus defined as a combination of two terms, the balance of these two terms is often hidden behind implementation details and is largely ignored and sub-optimal in practice. In this work, we approach the balance of contrastive losses as a hyper-parameter optimization problem, and propose a coordinate descent-based search method that efficiently find the hyper-parameters that optimize evaluation performance. In the process, we extend existing balance analyses to the contrastive margin loss, include batch size in the balance, and explain how to aggregate loss elements from the batch to maintain near-optimal performance over a larger range of batch sizes. Extensive experiments with benchmarks from deep metric learning and self-supervised learning show that optimal hyper-parameters are found faster with our method than with other common search methods.

【6】 Entropy Regularized Iterative Weighted Shrinkage-Thresholding Algorithm (ERIWSTA): An Application to CT Image Restoration 标题:熵正则化迭代加权收缩阈值算法(ERIWSTA)在CT图像恢复中的应用 链接:https://arxiv.org/abs/2112.11706

作者:Bingxue Wu,Jiao Wei,Chen Li,Yudong Yao,Yueyang Teng 机构:College of Medicine and Biological Information Engineering, Northeastern University, Shenyang , China, Key Laboratory of Intelligent Computing in Medical Image, Ministry of Education, Shenyang , China 摘要:迭代加权收缩阈值算法(IWSTA)在求解线性逆问题时显示出优于经典的无加权迭代收缩阈值算法(ISTA),后者处理属性的方式不同。本文提出了一种新的熵正则化IWSTA(ERIWSTA),它在代价函数中添加了一个熵正则化器来度量权重的不确定性,以激励属性参与问题求解。然后,使用拉格朗日乘子法求解权重,以获得简单的迭代更新。权重可以解释为属性对问题解决方案的贡献概率。CT图像复原实验结果表明,该方法在收敛速度和复原精度方面均优于现有方法。 摘要:The iterative weighted shrinkage-thresholding algorithm (IWSTA) has shown superiority to the classic unweighted iterative shrinkage-thresholding algorithm (ISTA) for solving linear inverse problems, which address the attributes differently. This paper proposes a new entropy regularized IWSTA (ERIWSTA) that adds an entropy regularizer to the cost function to measure the uncertainty of the weights to stimulate attributes to participate in problem solving. Then, the weights are solved with a Lagrange multiplier method to obtain a simple iterative update. The weights can be explained as the probability of the contribution of an attribute to the problem solution. Experimental results on CT image restoration show that the proposed method has better performance in terms of convergence speed and restoration accuracy than the existing methods.

【7】 Anomaly Clustering: Grouping Images into Coherent Clusters of Anomaly Types 标题:异常聚类:将图像分组为异常类型的相关聚类 链接:https://arxiv.org/abs/2112.11573

作者:Kihyuk Sohn,Jinsung Yoon,Chun-Liang Li,Chen-Yu Lee,Tomas Pfister 机构:Google Cloud AI Research 备注:Tech report 摘要:我们引入异常聚类,其目标是将数据分组为语义上一致的异常类型聚类。这与异常检测不同,异常检测的目标是将异常与正常数据分开。与以对象为中心的图像聚类应用程序不同,异常聚类尤其具有挑战性,因为异常模式是微妙的和局部的。我们使用基于补丁的预训练深度嵌入和现成的聚类方法,提出了一个简单而有效的聚类框架。我们通过加权平均嵌入之间的欧几里德距离定义图像之间的距离函数,每个图像都表示为一包嵌入。权重定义了包中实例(即补片嵌入)的重要性,可能会突出显示缺陷区域。我们以无监督的方式或半监督的方式计算权重,如果标记的正常数据可用。大量的实验研究表明,在现有的多实例或深度聚类框架上,本文提出的聚类框架以及一种新的距离函数是有效的。总体而言,我们的框架在MVTec对象和纹理类别上实现了0.451和0.674标准化互信息分数,并通过一些标记的正常数据(0.577,0.669)进一步改进,远远超过基线(0.244,0.273)或最先进的深度聚类方法(0.176,0.277)。 摘要:We introduce anomaly clustering, whose goal is to group data into semantically coherent clusters of anomaly types. This is different from anomaly detection, whose goal is to divide anomalies from normal data. Unlike object-centered image clustering applications, anomaly clustering is particularly challenging as anomalous patterns are subtle and local. We present a simple yet effective clustering framework using a patch-based pretrained deep embeddings and off-the-shelf clustering methods. We define a distance function between images, each of which is represented as a bag of embeddings, by the Euclidean distance between weighted averaged embeddings. The weight defines the importance of instances (i.e., patch embeddings) in the bag, which may highlight defective regions. We compute weights in an unsupervised way or in a semi-supervised way if labeled normal data is available. Extensive experimental studies show the effectiveness of the proposed clustering framework along with a novel distance function upon existing multiple instance or deep clustering frameworks. Overall, our framework achieves 0.451 and 0.674 normalized mutual information scores on MVTec object and texture categories and further improve with a few labeled normal data (0.577, 0.669), far exceeding the baselines (0.244, 0.273) or state-of-the-art deep clustering methods (0.176, 0.277).

【8】 Decompose the Sounds and Pixels, Recompose the Events 标题:分解声音和像素,重组事件 链接:https://arxiv.org/abs/2112.11547

作者:Varshanth R. Rao,Md Ibrahim Khalil,Haoda Li,Peng Dai,Juwei Lu 机构:Huawei Noah’s Ark Lab, University of Waterloo, Canada, University of Toronto, Canada 备注:Accepted at AAAI 2022 摘要:在本文中,我们提出了一个以事件分解-重组网络(EDRNet)为核心的框架来解决有监督和弱监督环境下的视听事件(AVE)定位问题。现实世界中的AVE显示出常见的分解模式(称为事件进度检查点(EPC)),人类可以通过听觉和视觉感官的合作来感知这些模式。与早期尝试识别整个事件序列的方法不同,EDRNet使用堆叠时间卷积对EPC和EPC间关系进行建模。基于EPC表示理论上与事件类别一致的假设,我们介绍了基于状态机的视频融合,这是一种新的增强技术,它使用不同的EPC模板序列混合源视频。此外,我们还设计了一个新的损失函数,称为陆-岸-海损失,以压缩连续的前景和背景表示。最后,为了缓解在弱监控期间混淆事件的问题,我们提出了一种称为包到实例标签校正的预测稳定方法。在AVE数据集上的实验表明,我们的集体框架比最先进的框架有相当大的优势。 摘要:In this paper, we propose a framework centering around a novel architecture called the Event Decomposition Recomposition Network (EDRNet) to tackle the Audio-Visual Event (AVE) localization problem in the supervised and weakly supervised settings. AVEs in the real world exhibit common unravelling patterns (termed as Event Progress Checkpoints (EPC)), which humans can perceive through the cooperation of their auditory and visual senses. Unlike earlier methods which attempt to recognize entire event sequences, the EDRNet models EPCs and inter-EPC relationships using stacked temporal convolutions. Based on the postulation that EPC representations are theoretically consistent for an event category, we introduce the State Machine Based Video Fusion, a novel augmentation technique that blends source videos using different EPC template sequences. Additionally, we design a new loss function called the Land-Shore-Sea loss to compactify continuous foreground and background representations. Lastly, to alleviate the issue of confusing events during weak supervision, we propose a prediction stabilization method called Bag to Instance Label Correction. Experiments on the AVE dataset show that our collective framework outperforms the state-of-the-art by a sizable margin.

机器翻译,仅供参考

0 人点赞