计算机视觉与模式识别学术速递[11.10]

cs.CV 方向，今日共计56篇

Transformer(3篇)

【1】 Sliced Recursive Transformer 标题：片式递归Transformer 链接：https://arxiv.org/abs/2111.05297

作者：Zhiqiang Shen,Zechun Liu,Eric Xing 机构：‡CMU §MBZUAI 备注：Code and models are available at this https URL 摘要：我们提出了一种简洁而有效的视觉转换器递归操作，可以在不涉及额外参数的情况下提高参数利用率。这是通过在Transformer网络的深度上共享权重来实现的。该方法只需使用na即可获得可观的增益（~2%）ive递归运算，不需要特殊或复杂的网络设计原理知识，并将训练过程的计算开销降至最低。为了在保持较高精度的同时减少递归运算带来的额外计算量，我们提出了一种跨递归层的多切片组自关注近似方法，该方法可以在性能损失最小的情况下将成本消耗降低10~30%。我们将我们的模型称为切片递归变换器（SReT），它与高效视觉变换器的广泛其他设计兼容。与最先进的方法相比，我们的最佳模型在包含较少参数的情况下显著改进了ImageNet。建议的切片递归操作允许我们在较小的尺寸（13~15M）下轻松构建100层甚至1000层以上的Transformer，以避免模型尺寸过大时出现优化困难。灵活的可伸缩性在扩展和构建极深和大维度的视觉转换器方面显示出巨大的潜力。我们的代码和模型可在https://github.com/szq0214/SReT. 摘要：We present a neat yet effective recursive operation on vision transformers that can improve parameter utilization without involving additional parameters. This is achieved by sharing weights across depth of transformer networks. The proposed method can obtain a substantial gain (~2%) simply using na"ive recursive operation, requires no special or sophisticated knowledge for designing principles of networks, and introduces minimum computational overhead to the training procedure. To reduce the additional computation caused by recursive operation while maintaining the superior accuracy, we propose an approximating method through multiple sliced group self-attentions across recursive layers which can reduce the cost consumption by 10~30% with minimal performance loss. We call our model Sliced Recursive Transformer (SReT), which is compatible with a broad range of other designs for efficient vision transformers. Our best model establishes significant improvement on ImageNet over state-of-the-art methods while containing fewer parameters. The proposed sliced recursive operation allows us to build a transformer with more than 100 or even 1000 layers effortlessly under a still small size (13~15M), to avoid difficulties in optimization when the model size is too large. The flexible scalability has shown great potential for scaling up and constructing extremely deep and large dimensionality vision transformers. Our code and models are available at https://github.com/szq0214/SReT.

【2】 Lymph Node Detection in T2 MRI with Transformers 标题：TransformerT2磁共振成像中的淋巴结检测链接：https://arxiv.org/abs/2111.04885

作者：Tejas Sudharshan Mathai,Sungwon Lee,Daniel C. Elton,Thomas C. Shen,Yifan Peng,Zhiyong Lu,Ronald M. Summers 机构：Imaging Biomarkers and Computer-Aided Diagnosis Laboratory, Radiology and Imaging, Sciences, Clinical Center, National Institutes of Health, Bethesda MD, USA, National Center for Biotechnology Information, National Library of Medicine, National 备注：Accepted at SPIE 2022 摘要：在T2磁共振成像（MRI）中识别淋巴结（LN）是放射科医生在淋巴增生性疾病评估中的一个重要步骤。淋巴结的大小在其分期中起着至关重要的作用，放射科医生有时使用额外的对比序列，如弥散加权成像（DWI）进行确认。然而，在T2 MRI扫描中，淋巴结有不同的表现，因此很难对转移进行分期。此外，放射科医生在忙碌的一天中往往会错过较小的转移淋巴结。为了解决这些问题，我们建议使用检测Transformer（DETR）网络来定位可疑转移淋巴结，以便在不同扫描仪和检查方案获得的挑战性T2 MRI扫描中进行分期。通过边界盒融合技术减少了假阳性（FP），在每幅图像4FP时，精确度为65.41%，灵敏度为91.66%。据我们所知，我们的结果改进了T2 MRI扫描中淋巴结检测的最新技术。摘要：Identification of lymph nodes (LN) in T2 Magnetic Resonance Imaging (MRI) is an important step performed by radiologists during the assessment of lymphoproliferative diseases. The size of the nodes play a crucial role in their staging, and radiologists sometimes use an additional contrast sequence such as diffusion weighted imaging (DWI) for confirmation. However, lymph nodes have diverse appearances in T2 MRI scans, making it tough to stage for metastasis. Furthermore, radiologists often miss smaller metastatic lymph nodes over the course of a busy day. To deal with these issues, we propose to use the DEtection TRansformer (DETR) network to localize suspicious metastatic lymph nodes for staging in challenging T2 MRI scans acquired by different scanners and exam protocols. False positives (FP) were reduced through a bounding box fusion technique, and a precision of 65.41% and sensitivity of 91.66% at 4 FP per image was achieved. To the best of our knowledge, our results improve upon the current state-of-the-art for lymph node detection in T2 MRI scans.

【3】 Mixed Transformer U-Net For Medical Image Segmentation 标题：用于医学图像分割的混合TransformerU网链接：https://arxiv.org/abs/2111.04734

作者：Hongyi Wang,Shiao Xie,Lanfen Lin,Yutaro Iwamoto,Xian-Hua Han,Yen-Wei Chen,Ruofeng Tong 机构：College of Computer Science and Technology, Zhejiang University, China, College of Information Science and Engineering, Ritsumeikan University, Japan, Artificial Intelligence Research Center, Yamaguchi University, Japan 摘要：尽管U-Net在医学图像分割任务中取得了巨大的成功，但它缺乏明确建模长期依赖关系的能力。因此，视觉变换器由于其通过自我注意（SA）捕捉长距离相关性的固有能力，近年来已成为一种可供选择的分割结构。然而，Transformer通常依赖于大规模的预训练，并且具有很高的计算复杂度。此外，SA只能在单个样本中建模自相似性，而忽略整个数据集的潜在相关性。为了解决这些问题，我们提出了一种新的Transformer模块，名为混合Transformer模块（MTM），用于同时进行内部和内部亲和力学习。MTM首先通过我们精心设计的局部全局高斯加权自我注意（LGG-SA）有效地计算自我亲和力。然后，通过外部注意（EA）挖掘数据样本之间的相互连接。通过使用MTM，我们构建了一个名为混合TransformerU网络（MT UNet）的U形模型，用于精确的医学图像分割。我们在两个不同的公共数据集上测试了我们的方法，实验结果表明，该方法比其他最先进的方法具有更好的性能。该代码可从以下网址获取：https://github.com/Dootmaan/MT-UNet. 摘要：Though U-Net has achieved tremendous success in medical image segmentation tasks, it lacks the ability to explicitly model long-range dependencies. Therefore, Vision Transformers have emerged as alternative segmentation structures recently, for their innate ability of capturing long-range correlations through Self-Attention (SA). However, Transformers usually rely on large-scale pre-training and have high computational complexity. Furthermore, SA can only model self-affinities within a single sample, ignoring the potential correlations of the overall dataset. To address these problems, we propose a novel Transformer module named Mixed Transformer Module (MTM) for simultaneous inter- and intra- affinities learning. MTM first calculates self-affinities efficiently through our well-designed Local-Global Gaussian-Weighted Self-Attention (LGG-SA). Then, it mines inter-connections between data samples through External Attention (EA). By using MTM, we construct a U-shaped model named Mixed Transformer U-Net (MT-UNet) for accurate medical image segmentation. We test our method on two different public datasets, and the experimental results show that the proposed method achieves better performance over other state-of-the-art methods. The code is available at: https://github.com/Dootmaan/MT-UNet.

检测相关(8篇)

【1】 Does Thermal data make the detection systems more reliable? 标题：热数据是否使检测系统更加可靠？链接：https://arxiv.org/abs/2111.05191

作者：Shruthi Gowda,Bahram Zonooz,Elahe Arani 机构：Advanced Research Lab, NavInfo Europe, The Netherlands 备注：Accepted at NeurIPS 2021 - ML4AD workshop (The code for this research is available at: this https URL) 摘要：基于深度学习的检测网络在自动驾驶系统（ADS）方面取得了显著的进展。ADS应在各种环境照明和恶劣天气条件下具有可靠的性能。然而，亮度下降和视觉障碍（如眩光、雾）会导致视觉摄像机的图像质量差，从而导致性能下降。为了克服这些挑战，我们探讨了利用不同的数据模式的想法，该模式与可视数据完全不同，但又是互补的。我们提出了一个基于多模态协作框架的综合检测系统，该框架同时学习RGB（来自视觉摄像机）和热（来自红外摄像机）数据。该框架协同训练两个网络，并在学习其自身模式的最佳特征时提供灵活性，同时还结合了其他模式的补充知识。我们广泛的实证结果表明，虽然精度的提高是名义上的，但其价值在于具有挑战性和极端困难的边缘情况，这在AD等安全关键应用中至关重要。我们提供了在检测中使用热成像系统的优缺点的整体视图。摘要：Deep learning-based detection networks have made remarkable progress in autonomous driving systems (ADS). ADS should have reliable performance across a variety of ambient lighting and adverse weather conditions. However, luminance degradation and visual obstructions (such as glare, fog) result in poor quality images by the visual camera which leads to performance decline. To overcome these challenges, we explore the idea of leveraging a different data modality that is disparate yet complementary to the visual data. We propose a comprehensive detection system based on a multimodal-collaborative framework that learns from both RGB (from visual cameras) and thermal (from Infrared cameras) data. This framework trains two networks collaboratively and provides flexibility in learning optimal features of its own modality while also incorporating the complementary knowledge of the other. Our extensive empirical results show that while the improvement in accuracy is nominal, the value lies in challenging and extremely difficult edge cases which is crucial in safety-critical applications such as AD. We provide a holistic view of both merits and limitations of using a thermal imaging system in detection.

【2】 Deep Convolution Network Based Emotion Analysis for Automatic Detection of Mild Cognitive Impairment in the Elderly 标题：基于深度卷积网络的情绪分析在老年人轻度认知障碍自动检测中的应用链接：https://arxiv.org/abs/2111.05066

作者：Zixiang Fei,Erfu Yang,Leijian Yu,Xia Li,Huiyu Zhou,Wenju Zhou 机构： Department of Design, Manufacturing and Engineering Management, University of Strathclyde, Glasgow G,XJ, UK, Shanghai Jiaotong University, Shanghai, China, Department of Informatics, University of Leicester, Leicester LE,RH, UK 备注：17 pages 摘要：全世界有相当数量的人患有认知障碍。早期发现认知障碍对患者和护理人员都非常重要。然而，现有的方法有其不足之处，例如在临床和神经成像阶段所需的时间和财务费用。研究发现，认知障碍患者表现出异常的情绪模式。在本文中，我们提出了一种新的基于深度卷积网络的系统，通过分析参与者在观看设计的视频刺激时面部情绪的演变来检测认知障碍。在我们提出的系统中，使用MobileNet和支持向量机（SVM）中的层开发了一种新的人脸表情识别算法，在3个数据集中显示了令人满意的性能。为了验证所提出的系统在检测认知障碍方面的有效性，我们邀请了61名老年人（包括认知障碍患者和健康人）作为对照组参与实验，并建立了相应的数据集。利用该数据集，该系统成功地实现了73.3%的检测准确率。摘要：A significant number of people are suffering from cognitive impairment all over the world. Early detection of cognitive impairment is of great importance to both patients and caregivers. However, existing approaches have their shortages, such as time consumption and financial expenses involved in clinics and the neuroimaging stage. It has been found that patients with cognitive impairment show abnormal emotion patterns. In this paper, we present a novel deep convolution network-based system to detect the cognitive impairment through the analysis of the evolution of facial emotions while participants are watching designed video stimuli. In our proposed system, a novel facial expression recognition algorithm is developed using layers from MobileNet and Support Vector Machine (SVM), which showed satisfactory performance in 3 datasets. To verify the proposed system in detecting cognitive impairment, 61 elderly people including patients with cognitive impairment and healthy people as a control group have been invited to participate in the experiments and a dataset was built accordingly. With this dataset, the proposed system has successfully achieved the detection accuracy of 73.3%.

【3】 Explaining Face Presentation Attack Detection Using Natural Language 标题：用自然语言解释人脸呈现攻击检测链接：https://arxiv.org/abs/2111.04862

作者：Hengameh Mirzaalian,Mohamed E. Hussein,Leonidas Spinoulas,Jonathan May,Wael Abd-Almageed 机构：University of Southern California, Information Sciences Institute, Marina del Rey, CA, USA 备注：To Appear in the Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition 2021 摘要：为了解决人脸呈现攻击检测（PAD）这一具有挑战性的问题，人们开发了大量基于深度神经网络的技术。虽然这些技术的重点是提高PAD的分类精度和对未知攻击和环境条件的鲁棒性，但很少关注PAD预测的可解释性。在本文中，我们解决的问题，解释垫预测通过自然语言。我们的方法将PAD模型深层的特征表示传递给语言模型，以生成描述PAD预测背后推理的文本。由于我们的研究中注释数据的数量有限，我们采用了一个轻量级的LSTM网络作为我们的自然语言生成模型。我们研究了不同的损失函数如何影响生成解释的质量，包括常用的单词交叉熵损失、句子辨别性损失和句子语义损失。我们使用来自1105个真实样本和924个呈现攻击样本数据集的人脸图像进行实验。我们的定量和定性结果显示了我们的模型通过文本生成正确的PAD解释的有效性以及句子损失的力量。据我们所知，这是首次引入联合生物特征NLP任务。我们的数据集可以通过我们的GitHub页面获得。摘要：A large number of deep neural network based techniques have been developed to address the challenging problem of face presentation attack detection (PAD). Whereas such techniques' focus has been on improving PAD performance in terms of classification accuracy and robustness against unseen attacks and environmental conditions, there exists little attention on the explainability of PAD predictions. In this paper, we tackle the problem of explaining PAD predictions through natural language. Our approach passes feature representations of a deep layer of the PAD model to a language model to generate text describing the reasoning behind the PAD prediction. Due to the limited amount of annotated data in our study, we apply a light-weight LSTM network as our natural language generation model. We investigate how the quality of the generated explanations is affected by different loss functions, including the commonly used word-wise cross entropy loss, a sentence discriminative loss, and a sentence semantic loss. We perform our experiments using face images from a dataset consisting of 1,105 bona-fide and 924 presentation attack samples. Our quantitative and qualitative results show the effectiveness of our model for generating proper PAD explanations through text as well as the power of the sentence-wise losses. To the best of our knowledge, this is the first introduction of a joint biometrics-NLP task. Our dataset can be obtained through our GitHub page.

【4】 Frustum Fusion: Pseudo-LiDAR and LiDAR Fusion for 3D Detection 标题：Frustum Fusion：用于三维探测的伪LiDAR和LiDAR融合链接：https://arxiv.org/abs/2111.04780

作者：Farzin Negahbani,Onur Berk Töre,Fatma Güney,Baris Akgun 机构： Koc University 2Autonomous 摘要：大多数自动驾驶车辆都配备了激光雷达传感器和立体摄像机。前者非常精确，但会生成稀疏数据，而后者非常密集，具有丰富的纹理和颜色信息，但难以从中提取鲁棒的三维表示。在本文中，我们提出了一种新的数据融合算法，将精确的点云与从立体对中获得的密集但精度较低的点云相结合。我们开发了一个框架，将该算法集成到各种三维目标检测方法中。我们的框架从两幅RGB图像的二维检测开始，计算平截头体及其交点，从立体图像创建伪激光雷达数据，并用密集的伪激光雷达点填充缺少激光雷达数据的交点区域。我们训练了多种三维目标检测方法，并表明我们的融合策略持续提高了检测器的性能。摘要：Most autonomous vehicles are equipped with LiDAR sensors and stereo cameras. The former is very accurate but generates sparse data, whereas the latter is dense, has rich texture and color information but difficult to extract robust 3D representations from. In this paper, we propose a novel data fusion algorithm to combine accurate point clouds with dense but less accurate point clouds obtained from stereo pairs. We develop a framework to integrate this algorithm into various 3D object detection methods. Our framework starts with 2D detections from both of the RGB images, calculates frustums and their intersection, creates Pseudo-LiDAR data from the stereo images, and fills in the parts of the intersection region where the LiDAR data is lacking with the dense Pseudo-LiDAR points. We train multiple 3D object detection methods and show that our fusion strategy consistently improves the performance of detectors.

【5】 Stain-free Detection of Embryo Polarization using Deep Learning 标题：基于深度学习的胚胎极化无污染检测链接：https://arxiv.org/abs/2111.05315

作者：Cheng Shen,Adiyant Lamba,Meng Zhu,Ray Zhang,Changhuei Yang,Magdalena Zernicka Goetz 机构：Goetz,, Department of Electrical Engineering, California Institute of Technology, Pasadena, CA, USA, Mammalian Embryo and Stem Cell Group, Department of Physiology, Development and Neuroscience, University of Cambridge, Downing Street, Cambridge, CB,EG, UK 摘要：哺乳动物胚胎在正确的发育时期的两极分化对其发育至足月至关重要，并且在评估人类胚胎的潜力方面具有重要价值。然而，追踪极化需要有创荧光染色，这在体外受精临床上是不允许的。在这里，我们报告使用人工智能检测未染色的小鼠胚胎延时电影中的极化。我们收集了一组来自8细胞期胚胎的亮场电影帧数据集，与细胞极化荧光标记的相应图像并排排列。然后，我们使用集成学习模型来检测在极化开始之前或之后，任何亮场帧是否显示胚胎。我们得到的模型检测极化的准确率为85%，显著优于在相同数据上训练的志愿者（准确率为61%）。我们发现，我们的自学习模型将细胞之间的角度作为一个已知的压缩线索，它先于极化，但它优于单独使用这个线索。通过将三维延时图像数据压缩为二维，我们能够将数据缩减到易于管理的大小，以便进行深度学习处理。总之，我们描述了一种检测胚胎发育关键发育特征的方法，避免了临床上不允许的荧光染色。摘要：Polarization of the mammalian embryo at the right developmental time is critical for its development to term and would be valuable in assessing the potential of human embryos. However, tracking polarization requires invasive fluorescence staining, impermissible in the in vitro fertilization clinic. Here, we report the use of artificial intelligence to detect polarization from unstained time-lapse movies of mouse embryos. We assembled a dataset of bright-field movie frames from 8-cell-stage embryos, side-by-side with corresponding images of fluorescent markers of cell polarization. We then used an ensemble learning model to detect whether any bright-field frame showed an embryo before or after onset of polarization. Our resulting model has an accuracy of 85% for detecting polarization, significantly outperforming human volunteers trained on the same data (61% accuracy). We discovered that our self-learning model focuses upon the angle between cells as one known cue for compaction, which precedes polarization, but it outperforms the use of this cue alone. By compressing three-dimensional time-lapsed image data into two-dimensions, we are able to reduce data to an easily manageable size for deep learning processing. In conclusion, we describe a method for detecting a key developmental feature of embryo development that avoids clinically impermissible fluorescence staining.

【6】 Universal Lesion Detection in CT Scans using NeuralNetwork Ensembles 标题：神经网络集成在CT扫描中的通用病变检测链接：https://arxiv.org/abs/2111.04886

作者：Tarun Mattikalli,Tejas Sudharshan Mathai,Ronald M. Summers 机构：Imaging Biomarkers and Computer-Aided Diagnosis Laboratory, Radiology and Imaging, Sciences, Clinical Center, National Institutes of Health, Bethesda MD, USA 备注：Accepted at SPIE 2022 摘要：在临床实践中，放射科医生在区分转移性病变和非转移性病变时依赖于病变大小。病变大小的先决条件是它们的检测，因为它促进了肿瘤扩散的下游评估。然而，在CT扫描中，病变的大小和外观各不相同，放射科医生在繁忙的临床一天中常常忽略小病变。为了克服这些挑战，我们建议使用最先进的检测神经网络来标记NIH DeepDiscosure数据集中存在的可疑病变，以确定其大小。此外，我们还引入了边界盒融合技术，以最小化误报（FP）并提高检测精度。最后，为了便于临床应用，我们构建了一套最佳的检测模型来定位病灶大小，在每幅图像4fp时，精确度为65.17%，灵敏度为91.67%。我们的研究结果改进或维持了当前最先进的CT扫描病灶检测方法的性能。摘要：In clinical practice, radiologists are reliant on the lesion size when distinguishing metastatic from non-metastaticlesions. A prerequisite for lesion sizing is their detection, as it promotes the downstream assessment of tumorspread. However, lesions vary in their size and appearance in CT scans, and radiologists often miss small lesionsduring a busy clinical day. To overcome these challenges, we propose the use of state-of-the-art detection neuralnetworks to flag suspicious lesions present in the NIH DeepLesion dataset for sizing. Additionally, we incorporatea bounding box fusion technique to minimize false positives (FP) and improve detection accuracy. Finally, toresemble clinical usage, we constructed an ensemble of the best detection models to localize lesions for sizingwith a precision of 65.17% and sensitivity of 91.67% at 4 FP per image. Our results improve upon or maintainthe performance of current state-of-the-art methods for lesion detection in challenging CT scans.

【7】 Unsupervised Approaches for Out-Of-Distribution Dermoscopic Lesion Detection 标题：离散型皮肤镜病变检测的非监督方法链接：https://arxiv.org/abs/2111.04807

作者：Max Torop,Sandesh Ghimire,Wenqian Liu,Dana H. Brooks,Octavia Camps,Milind Rajadhyaksha,Jennifer Dy,Kivanc Kose 机构：Northeastern University, Memorial Sloan Kettering Cancer Center 备注：NeurIPS: Medical Imaging Meets NeurIPS Workshop 摘要：对于复杂的医学数据，显示无监督分布外（OOD）方法有效性的工作有限。在这里，我们介绍了我们的无监督OOD检测算法SimCLR LOF的初步发现，以及应用于医学图像的最新技术方法（SSD）。SimCLR LOF使用SimCLR学习语义上有意义的特征，如果测试样本是OOD，则使用LOF进行评分。我们在多源国际皮肤成像合作（ISIC）2019数据集上进行了评估，结果显示，该结果与SSD以及最近应用于相同数据的监督方法具有竞争力。摘要：There are limited works showing the efficacy of unsupervised Out-of-Distribution (OOD) methods on complex medical data. Here, we present preliminary findings of our unsupervised OOD detection algorithm, SimCLR-LOF, as well as a recent state of the art approach (SSD), applied on medical images. SimCLR-LOF learns semantically meaningful features using SimCLR and uses LOF for scoring if a test sample is OOD. We evaluated on the multi-source International Skin Imaging Collaboration (ISIC) 2019 dataset, and show results that are competitive with SSD as well as with recent supervised approaches applied on the same data.

【8】 Real-time landmark detection for precise endoscopic submucosal dissection via shape-aware relation network 标题：基于形状感知关系网络的内镜精密粘膜下剥离实时标志点检测链接：https://arxiv.org/abs/2111.04733

作者：Jiacheng Wang,Yueming Jin,Shuntian Cai,Hongzhi Xu,Pheng-Ann Heng,Jing Qin,Liansheng Wang 摘要：我们提出了一种新的形状感知关系网络，用于内窥镜粘膜下剥离（ESD）手术中准确、实时的地标检测。这项任务具有重要的临床意义，但由于在复杂的手术环境中出血、光线反射和运动模糊，因此极具挑战性。与现有的解决方案（忽略目标对象之间的几何关系或使用复杂的聚合方案捕获目标对象之间的关系）相比，该网络能够充分利用地标之间的空间关系，在保持实时性能的同时达到令人满意的精度。我们首先设计了一种自动生成关系关键点热图的算法，该算法能够直观地表示地标之间空间关系的先验知识，而不需要额外的人工标注。然后，我们开发了两个互补的正则化方案，以逐步将先验知识纳入训练过程。一种方案通过多任务学习引入像素级正则化，另一种方案通过利用新设计的分组一致性评估器集成全局级正则化，以对抗的方式向所提出的网络添加关系约束。这两种方案都有利于模型的训练，并且在推理过程中易于卸载，实现实时检测。我们建立了食管癌ESD手术的大型内部数据集，以验证我们提出的方法的有效性。大量的实验结果表明，我们的方法在准确性和效率方面优于最先进的方法，更快地获得更好的检测结果。在两个下游应用上的有希望的结果进一步证实了我们的方法在ESD临床实践中的巨大潜力。摘要：We propose a novel shape-aware relation network for accurate and real-time landmark detection in endoscopic submucosal dissection (ESD) surgery. This task is of great clinical significance but extremely challenging due to bleeding, lighting reflection, and motion blur in the complicated surgical environment. Compared with existing solutions, which either neglect geometric relationships among targeting objects or capture the relationships by using complicated aggregation schemes, the proposed network is capable of achieving satisfactory accuracy while maintaining real-time performance by taking full advantage of the spatial relations among landmarks. We first devise an algorithm to automatically generate relation keypoint heatmaps, which are able to intuitively represent the prior knowledge of spatial relations among landmarks without using any extra manual annotation efforts. We then develop two complementary regularization schemes to progressively incorporate the prior knowledge into the training process. While one scheme introduces pixel-level regularization by multi-task learning, the other integrates global-level regularization by harnessing a newly designed grouped consistency evaluator, which adds relation constraints to the proposed network in an adversarial manner. Both schemes are beneficial to the model in training, and can be readily unloaded in inference to achieve real-time detection. We establish a large in-house dataset of ESD surgery for esophageal cancer to validate the effectiveness of our proposed method. Extensive experimental results demonstrate that our approach outperforms state-of-the-art methods in terms of accuracy and efficiency, achieving better detection results faster. Promising results on two downstream applications further corroborate the great potential of our method in ESD clinical practice.

分类|识别相关(5篇)

【1】 Cross Attentional Audio-Visual Fusion for Dimensional Emotion Recognition 标题：交叉注意视听融合在空间情感识别中的应用链接：https://arxiv.org/abs/2111.05222

作者：Gnana Praveen R,Eric Granger,Patrick Cardinal 机构：Laboratoire d’imagerie, de vision et d’intelligence artificielle (LIVIA), ´Ecole de technologie sup´erieure, Montreal, Canada 备注：Accepted in FG2021 摘要：多模态分析最近引起了人们对情感计算的极大兴趣，因为它可以比孤立的单峰方法提高情感识别的整体准确性。多模态情感识别的最有效技术有效地利用各种互补信息源，如面部、声音和生理模式，以提供全面的特征表示。在本文中，我们重点研究了基于视频中人脸和声音模式融合的多维情感识别，其中可能捕捉到复杂的时空关系。大多数现有的融合技术依赖于经常性的网络或传统的注意机制，不能有效地利用视听（A-V）模式的互补性。我们引入了一种交叉注意融合方法来提取跨视听模式的显著特征，从而能够准确预测配价和唤醒的连续值。我们新的交叉注意视听融合模型有效地利用了跨模态关系。特别是，它计算交叉注意权重，以关注各个模式中更具贡献性的特征，从而结合贡献性特征表示，然后将其输入到完全连接的层，用于预测效价和觉醒。该方法的有效性在RECOLA和疲劳（私有）数据集的视频上进行了实验验证。结果表明，我们的交叉注意A-V融合模型是一种经济有效的方法，优于最先进的融合方法。代码可用：url{https://github.com/praveena2j/Cross-Attentional-AV-Fusion} 摘要：Multimodal analysis has recently drawn much interest in affective computing, since it can improve the overall accuracy of emotion recognition over isolated uni-modal approaches. The most effective techniques for multimodal emotion recognition efficiently leverage diverse and complimentary sources of information, such as facial, vocal, and physiological modalities, to provide comprehensive feature representations. In this paper, we focus on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos, where complex spatiotemporal relationships may be captured. Most of the existing fusion techniques rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complimentary nature of audio-visual (A-V) modalities. We introduce a cross-attentional fusion approach to extract the salient features across A-V modalities, allowing for accurate prediction of continuous values of valence and arousal. Our new cross-attentional A-V fusion model efficiently leverages the inter-modal relationships. In particular, it computes cross-attention weights to focus on the more contributive features across individual modalities, and thereby combine contributive feature representations, which are then fed to fully connected layers for the prediction of valence and arousal. The effectiveness of the proposed approach is validated experimentally on videos from the RECOLA and Fatigue (private) data-sets. Results indicate that our cross-attentional A-V fusion model is a cost-effective approach that outperforms state-of-the-art fusion approaches. Code is available: url{https://github.com/praveena2j/Cross-Attentional-AV-Fusion}

【2】 Exploiting Robust Unsupervised Video Person Re-identification 标题：利用鲁棒的无监督视频进行人的再识别链接：https://arxiv.org/abs/2111.05170

作者：Xianghao Zang,Ge Li,Wei Gao,Xiujun Shu 机构：School of Electronic and Computer Engineering, Peking University, Shenzhen , China., Peng Cheng Laboratory, Shenzhen , China. 备注：Accepted by IET Image Processing 摘要：无监督视频人员再识别（reID）方法通常依赖于全局级特征。许多监督reID方法采用了局部特征，并取得了显著的性能改进。然而，将局部级特征应用于无监督方法可能会导致性能不稳定。为了提高无监督视频reID的性能稳定性，本文介绍了一种融合部分模型和无监督学习的通用方案。在该方案中，全局级特征被划分为相等的局部级特征。采用局部感知模块，探索局部特征在无监督学习中的作用。为了克服局部特征的不足，提出了一种全局感知模块。来自这两个模块的特征被融合，以形成每个输入图像的鲁棒特征表示。这种特征表示方法既具有局部特征的优点，又不存在缺点。在三个基准上进行了综合实验，包括PRID2011、iLIDS VID和DukeMTMC VideoReID，结果表明所提出的方法达到了最先进的性能。大量的烧蚀研究证明了所提方案、局部感知模块和全局感知模块的有效性和鲁棒性。摘要：Unsupervised video person re-identification (reID) methods usually depend on global-level features. And many supervised reID methods employed local-level features and achieved significant performance improvements. However, applying local-level features to unsupervised methods may introduce an unstable performance. To improve the performance stability for unsupervised video reID, this paper introduces a general scheme fusing part models and unsupervised learning. In this scheme, the global-level feature is divided into equal local-level feature. A local-aware module is employed to explore the poentials of local-level feature for unsupervised learning. A global-aware module is proposed to overcome the disadvantages of local-level features. Features from these two modules are fused to form a robust feature representation for each input image. This feature representation has the advantages of local-level feature without suffering from its disadvantages. Comprehensive experiments are conducted on three benchmarks, including PRID2011, iLIDS-VID, and DukeMTMC-VideoReID, and the results demonstrate that the proposed approach achieves state-of-the-art performance. Extensive ablation studies demonstrate the effectiveness and robustness of proposed scheme, local-aware module and global-aware module.

【3】 Incremental Meta-Learning via Episodic Replay Distillation for Few-Shot Image Recognition 标题：基于情节重放蒸馏的增量式元学习在Few-Shot图像识别中的应用链接：https://arxiv.org/abs/2111.04993

作者：Kai Wang,Xialei Liu,Andy Bagdanov,Luis Herranz,Shangling Rui,Joost van de Weijer 机构： Computer Vision Center, Universitat Autonoma de Barcelona, Barcelona, Spain, College of Computer Science, Nankai University, Tianjin, China, MICC, University of Florence, Florence, Italy, Huawei Kirin Solution, Shanghai, China 摘要：大多数元学习方法假设存在一组非常大的标记数据，可用于基础知识的情景元学习。这与更现实的持续学习范式形成对比，在这种范式中，数据以包含不相交类的任务的形式递增地到达。在本文中，我们考虑这个问题增量增量学习（IML），其中递增递增的任务在离散的任务。我们提出了一种IML方法，我们称之为情节重放蒸馏（ERD），它将当前任务中的类与以前任务中的类范例混合在一起，并对元学习中的情节进行采样。这些情节然后被用于知识提炼，以尽量减少灾难性遗忘。在四个数据集上的实验表明，ERD优于最新技术。特别是，在更具挑战性的一次性、长任务序列增量元学习场景中，我们使用分层ImageNet/Mini ImageNet/CIFAR100方法，将IML与联合训练上限之间的差距从目前最先进的3.5%/10.1%/13.4%降低到2.6%/2.9%/5.0%。摘要：Most meta-learning approaches assume the existence of a very large set of labeled data available for episodic meta-learning of base knowledge. This contrasts with the more realistic continual learning paradigm in which data arrives incrementally in the form of tasks containing disjoint classes. In this paper we consider this problem of Incremental Meta-Learning (IML) in which classes are presented incrementally in discrete tasks. We propose an approach to IML, which we call Episodic Replay Distillation (ERD), that mixes classes from the current task with class exemplars from previous tasks when sampling episodes for meta-learning. These episodes are then used for knowledge distillation to minimize catastrophic forgetting. Experiments on four datasets demonstrate that ERD surpasses the state-of-the-art. In particular, on the more challenging one-shot, long task sequence incremental meta-learning scenarios, we reduce the gap between IML and the joint-training upper bound from 3.5% / 10.1% / 13.4% with the current state-of-the-art to 2.6% / 2.9% / 5.0% with our method on Tiered-ImageNet / Mini-ImageNet / CIFAR100, respectively.

【4】 Label-Aware Distribution Calibration for Long-tailed Classification 标题：长尾分类的标签感知分布校正链接：https://arxiv.org/abs/2111.04901

作者：Chaozheng Wang,Shuzheng Gao,Cuiyun Gao,Pengyun Wang,Wenjie Pei,Lujia Pan,Zenglin Xu 机构： Harbin Institute of Technology, Shenzhen, Noah’s Ark Lab, Shenzhen, Huawei Technologies Co. Ltd 备注：9 pages 摘要：现实世界的数据通常呈现长尾分布。对不平衡数据的训练往往会使神经网络在头类上表现良好，而在尾类上表现更差。尾类训练实例的严重稀疏性是主要的挑战，这导致训练过程中的分布估计存在偏差。已经投入了大量的努力来改善这一挑战，包括数据重新采样和为尾部课程合成新的训练实例。然而，以前的研究还没有利用头类到尾类的可转移知识来校准尾类的分布。在本文中，我们假设相似的头类可以丰富尾类，并提出了一种新的分布校准方法，称为标签感知分布校准LADC。LADC从相关的头类转移统计数据，以推断尾类的分布。来自校准分布的采样进一步有助于重新平衡分类器。在图像和文本长尾数据集上的实验表明，LADC显著优于现有的方法。可视化还表明，LADC提供了更准确的分布估计。摘要：Real-world data usually present long-tailed distributions. Training on imbalanced data tends to render neural networks perform well on head classes while much worse on tail classes. The severe sparseness of training instances for the tail classes is the main challenge, which results in biased distribution estimation during training. Plenty of efforts have been devoted to ameliorating the challenge, including data re-sampling and synthesizing new training instances for tail classes. However, no prior research has exploited the transferable knowledge from head classes to tail classes for calibrating the distribution of tail classes. In this paper, we suppose that tail classes can be enriched by similar head classes and propose a novel distribution calibration approach named as label-Aware Distribution Calibration LADC. LADC transfers the statistics from relevant head classes to infer the distribution of tail classes. Sampling from calibrated distribution further facilitates re-balancing the classifier. Experiments on both image and text long-tailed datasets demonstrate that LADC significantly outperforms existing methods.The visualization also shows that LADC provides a more accurate distribution estimation.

【5】 Bilinear pooling and metric learning network for early Alzheimer's disease identification with FDG-PET images 标题：双线性汇集和度量学习网络在早期阿尔茨海默病FDG-PET图像识别中的应用链接：https://arxiv.org/abs/2111.04985

作者：Wenju Cui,Caiying Yan,Zhuangzhi Yan,Yunsong Peng,Yilin Leng,Chenlu Liu,Shuangqing Chen,Xi Jiang 机构：Institute of Biomedical Engineering, School of Communication and Information Engineering, Shanghai, University, Shanghai , China, Medical Imaging Department, Suzhou Institute of Biomedical Engineering and Technology, Chinese, Academy of Sciences, Suzhou , China 摘要：FDG-PET显示轻度认知障碍（MCI）和阿尔茨海默病（AD）患者的脑代谢改变。通过计算机辅助诊断（CAD）技术从FDG-PET中提取的一些生物标记物已被证明能够准确诊断正常对照组（NC）、MCI和AD。然而，利用FDG-PET图像识别早期MCI（EMCI）和晚期MCI（LMCI）的研究仍然不足。与基于功能磁共振成像和DTI图像的研究相比，FDG-PET图像区域间表征特征的研究还不够。此外，考虑到不同个体的变异性，一些与两个类别非常相似的硬样本限制了分类性能。为了解决这些问题，本文提出了一种新的双线性池和度量学习网络（BMNet），该网络通过构造嵌入空间来提取区域间表示特征和区分硬样本。为了验证所提出的方法，我们从ADNI采集了998张FDG-PET图像。按照常见的预处理步骤，根据自动解剖地标（AAL）模板从每个FDG-PET图像中提取90个特征，然后发送到建议的网络中。对多个两类分类进行了广泛的5倍交叉验证实验。实验表明，在基线模型中分别加入双线性池模块和度量损失后，大多数度量都得到了改进。具体而言，在EMCI和LMCI之间的分类任务中，添加三重度量损失后，特异性提高6.38%，使用双线性池模块后，阴性预测值（NPV）提高3.45%。摘要：FDG-PET reveals altered brain metabolism in individuals with mild cognitive impairment (MCI) and Alzheimer's disease (AD). Some biomarkers derived from FDG-PET by computer-aided-diagnosis (CAD) technologies have been proved that they can accurately diagnosis normal control (NC), MCI, and AD. However, the studies of identification of early MCI (EMCI) and late MCI (LMCI) with FDG-PET images are still insufficient. Compared with studies based on fMRI and DTI images, the researches of the inter-region representation features in FDG-PET images are insufficient. Moreover, considering the variability in different individuals, some hard samples which are very similar with both two classes limit the classification performance. To tackle these problems, in this paper, we propose a novel bilinear pooling and metric learning network (BMNet), which can extract the inter-region representation features and distinguish hard samples by constructing embedding space. To validate the proposed method, we collect 998 FDG-PET images from ADNI. Following the common preprocessing steps, 90 features are extracted from each FDG-PET image according to the automatic anatomical landmark (AAL) template and then sent into the proposed network. Extensive 5-fold cross-validation experiments are performed for multiple two-class classifications. Experiments show that most metrics are improved after adding the bilinear pooling module and metric losses to the Baseline model respectively. Specifically, in the classification task between EMCI and LMCI, the specificity improves 6.38% after adding the triple metric loss, and the negative predictive value (NPV) improves 3.45% after using the bilinear pooling module.

分割|语义相关(8篇)

【1】 Unsupervised Spiking Instance Segmentation on Event Data using STDP 标题：基于STDP的事件数据无监督尖峰实例分割链接：https://arxiv.org/abs/2111.05283

作者：Paul Kirkland,Davide L. Manna,Alex Vincente-Sola,Gaetano Di Caterina 机构：Neuromorphic Sensor Signal Processing Lab, Centre for Image and Signal Processing, Electrical and Electronic Engineering, University of Strathclyde, Glasgow, United Kingdom. 备注：20 Pages, 13 Figures 摘要：尖峰神经网络（SNN）和神经形态工程领域带来了如何处理机器学习（ML）和计算机视觉（CV）问题的范式转变。这种范式的转变来自基于事件的感知和处理的适应。基于事件的视觉传感器允许生成与场景动态相关的稀疏和异步事件。不仅可以捕获空间信息，还可以捕获高保真的时间信息。同时避免了传统高帧速率方法的额外开销和冗余。然而，随着范式的变化，传统的CV和ML中的许多技术都不适用于这些基于事件的时空视觉流。因此，存在数量有限的识别、检测和分割方法。在本文中，我们提出了一种新的方法，只需使用训练用于对象识别的尖峰时间相关可塑性训练尖峰卷积神经网络的权值即可执行实例分割。这利用了网络内部特征表示的空间和时间方面，增加了这种新的鉴别能力。我们通过成功地将用于人脸检测的单类无监督网络转换为多人人脸识别和实例分割网络，突出了这一新功能。摘要：Spiking Neural Networks (SNN) and the field of Neuromorphic Engineering has brought about a paradigm shift in how to approach Machine Learning (ML) and Computer Vision (CV) problem. This paradigm shift comes from the adaption of event-based sensing and processing. An event-based vision sensor allows for sparse and asynchronous events to be produced that are dynamically related to the scene. Allowing not only the spatial information but a high-fidelity of temporal information to be captured. Meanwhile avoiding the extra overhead and redundancy of conventional high frame rate approaches. However, with this change in paradigm, many techniques from traditional CV and ML are not applicable to these event-based spatial-temporal visual streams. As such a limited number of recognition, detection and segmentation approaches exist. In this paper, we present a novel approach that can perform instance segmentation using just the weights of a Spike Time Dependent Plasticity trained Spiking Convolutional Neural Network that was trained for object recognition. This exploits the spatial and temporal aspects of the network's internal feature representations adding this new discriminative capability. We highlight the new capability by successfully transforming a single class unsupervised network for face detection into a multi-person face recognition and instance segmentation network.

【2】 Dual Prototypical Contrastive Learning for Few-shot Semantic Segmentation 标题：基于双原型对比学习的Few-Shot语义切分链接：https://arxiv.org/abs/2111.04982

作者：Hyeongjun Kwon,Somi Jeong,Sunok Kim,Kwanghoon Sohn 机构：Department of Electrical & Electronic Engineering, Yonsei University, Seoul, Korea, Department of Software Engineering, Korea Aerospace University, Goyang, Korea 备注：8 pages, 7 figures, this https URL 摘要：我们解决了Few-Shot语义分割（FSS）问题，该问题旨在用少量注释样本分割目标图像中的新类对象。尽管最近通过结合基于原型的度量学习取得了一些进展，但是现有的方法在类内对象的极端变化和语义相似的类间对象下仍然表现出有限的性能，因为它们的特征表示较差。为了解决这个问题，我们提出了一种针对FSS任务的双原型对比学习方法，以有效地捕获具有代表性的语义特征。其主要思想是通过增加类间距离，同时减少原型特征空间中的类内距离，来鼓励原型更具辨别力。为此，我们首先提出了一个特定于类的对比损失，并使用一个动态原型词典来存储训练期间的类感知原型，从而使相同的类原型相似，而不同的类原型不同。此外，我们还引入了类不可知对比损失，通过压缩语义类在每个片段中的特征分布来增强对不可见类的泛化能力。我们证明，在PASCAL-5i和COCO-20i数据集上，所提出的双原型对比学习方法优于最先进的FSS方法。该代码可从以下网址获取：https://github.com/kwonjunn01/DPCL1. 摘要：We address the problem of few-shot semantic segmentation (FSS), which aims to segment novel class objects in a target image with a few annotated samples. Though recent advances have been made by incorporating prototype-based metric learning, existing methods still show limited performance under extreme intra-class object variations and semantically similar inter-class objects due to their poor feature representation. To tackle this problem, we propose a dual prototypical contrastive learning approach tailored to the FSS task to capture the representative semanticfeatures effectively. The main idea is to encourage the prototypes more discriminative by increasing inter-class distance while reducing intra-class distance in prototype feature space. To this end, we first present a class-specific contrastive loss with a dynamic prototype dictionary that stores the class-aware prototypes during training, thus enabling the same class prototypes similar and the different class prototypes to be dissimilar. Furthermore, we introduce a class-agnostic contrastive loss to enhance the generalization ability to unseen classes by compressing the feature distribution of semantic class within each episode. We demonstrate that the proposed dual prototypical contrastive learning approach outperforms state-of-the-art FSS methods on PASCAL-5i and COCO-20i datasets. The code is available at:https://github.com/kwonjunn01/DPCL1.

【3】 LiMoSeg: Real-time Bird's Eye View based LiDAR Motion Segmentation 标题：LiMoSeg：基于实时鸟瞰的LiDAR运动分割链接：https://arxiv.org/abs/2111.04875

作者：Sambit Mohapatra,Mona Hodaei,Senthil Yogamani,Stefan Milz,Patrick Maeder,Heinrich Gotzig,Martin Simon,Hazem Rashed 机构：Valeo, Germany, Valeo, Ireland, Spleenlab.ai, Germany, TU Ilmenau, Germany 摘要：运动目标检测与分割是自动驾驶管道中的一项重要任务。检测和隔离车辆周围的静态和移动部件在路径规划和定位任务中尤为重要。提出了一种新的实时光探测和测距（LiDAR）数据运动分割体系结构。我们使用二维鸟瞰图（BEV）表示的两次连续激光雷达数据扫描来执行静态或移动的像素分类。此外，我们提出了一种新的数据增强技术，以减少静态和移动对象之间的显著类不平衡。我们通过剪切和粘贴静态车辆来人工合成移动对象来实现这一点。我们在一个常用的汽车嵌入式平台，即Nvidia Jetson Xavier上演示了8毫秒的低延迟。据我们所知，这是第一个直接在LiDAR BEV空间中执行运动分割的工作。我们提供了具有挑战性的SemanticKITTI数据集的定量结果，定性结果见https://youtu.be/2aJ-cL8b0LI. 摘要：Moving object detection and segmentation is an essential task in the Autonomous Driving pipeline. Detecting and isolating static and moving components of a vehicle's surroundings are particularly crucial in path planning and localization tasks. This paper proposes a novel real-time architecture for motion segmentation of Light Detection and Ranging (LiDAR) data. We use two successive scans of LiDAR data in 2D Bird's Eye View (BEV) representation to perform pixel-wise classification as static or moving. Furthermore, we propose a novel data augmentation technique to reduce the significant class imbalance between static and moving objects. We achieve this by artificially synthesizing moving objects by cutting and pasting static vehicles. We demonstrate a low latency of 8 ms on a commonly used automotive embedded platform, namely Nvidia Jetson Xavier. To the best of our knowledge, this is the first work directly performing motion segmentation in LiDAR BEV space. We provide quantitative results on the challenging SemanticKITTI dataset, and qualitative results are provided in https://youtu.be/2aJ-cL8b0LI.

【4】 Feature-enhanced Generation and Multi-modality Fusion based Deep Neural Network for Brain Tumor Segmentation with Missing MR Modalities 标题：基于特征增强生成和多模态融合的深度神经网络在MR缺失型脑肿瘤分割中的应用链接：https://arxiv.org/abs/2111.04735

作者：Tongxue Zhou,Stéphane Canu,Pierre Vera,Su Ruan 机构：INSA de Rouen, LITIS -Apprentissage, Rouen , France, Universit´e de Rouen Normandie, LITIS - QuantIF, Rouen , France, Normandie Univ, INSA Rouen, UNIROUEN, UNIHAVRE, LITIS, France, Department of Nuclear Medicine, Henri Becquerel Cancer Center, Rouen, France 备注：None 摘要：使用多模态磁共振成像（MRI）是准确分割脑肿瘤的必要手段。主要的问题是，并非所有类型的磁共振成像在临床检查中总是可用的。基于同一患者的磁共振成像模式之间存在着很强的相关性这一事实，在这项工作中，我们提出了一种新的脑肿瘤分割网络，以应对缺少一种或多种模式的情况。该网络由三个子网络组成：特征增强生成器、相关约束块和分割网络。特征增强生成器利用可用的模态生成表示缺失模态的三维特征增强图像。相关约束块可以利用模态之间的多源相关性，还可以约束生成器合成特征增强模态，该特征增强模态必须与可用模态具有一致的相关性。分割网络是一个基于多编码器的U-Net，用于实现最终的脑肿瘤分割。在BraTS 2018数据集上对所提出的方法进行了评估。实验结果表明，该方法在所有情况下对整个肿瘤、肿瘤核心和增强肿瘤的Dice平均得分分别为82.9、74.9和59.1，优于最佳方法3.5%、17%和18.2%。摘要：Using multimodal Magnetic Resonance Imaging (MRI) is necessary for accurate brain tumor segmentation. The main problem is that not all types of MRIs are always available in clinical exams. Based on the fact that there is a strong correlation between MR modalities of the same patient, in this work, we propose a novel brain tumor segmentation network in the case of missing one or more modalities. The proposed network consists of three sub-networks: a feature-enhanced generator, a correlation constraint block and a segmentation network. The feature-enhanced generator utilizes the available modalities to generate 3D feature-enhanced image representing the missing modality. The correlation constraint block can exploit the multi-source correlation between the modalities and also constrain the generator to synthesize a feature-enhanced modality which must have a coherent correlation with the available modalities. The segmentation network is a multi-encoder based U-Net to achieve the final brain tumor segmentation. The proposed method is evaluated on BraTS 2018 dataset. Experimental results demonstrate the effectiveness of the proposed method which achieves the average Dice Score of 82.9, 74.9 and 59.1 on whole tumor, tumor core and enhancing tumor, respectively across all the situations, and outperforms the best method by 3.5%, 17% and 18.2%.

【5】 Segmentation of Multiple Myeloma Plasma Cells in Microscopy Images with Noisy Labels 标题：含噪标记显微图像中多发性骨髓瘤浆细胞的分割链接：https://arxiv.org/abs/2111.05125

作者：Álvaro García Faura,Dejan Štepec,Tomaž Martinčič,Danijel Skočaj 机构： SloveniabUniversity of Ljubljana 备注：Accepted to SPIE Medical Imaging conference 摘要：改进和快速癌症诊断的一个关键组成部分是开发计算机辅助工具。在这篇文章中，我们提出的解决方案，赢得了SegPC-2021竞争的分割多发性骨髓瘤浆细胞的显微镜图像。竞赛数据集中使用的标签是半自动生成的，并显示为噪声。为了解决这个问题，我们进行了一个沉重的图像增强过程，并使用定制的集成策略将来自多个模型的预测结合起来。使用了最先进的特征提取程序和实例分割体系结构，在SegPC-2021最终测试集上的平均交点为0.9389。摘要：A key component towards an improved and fast cancer diagnosis is the development of computer-assisted tools. In this article, we present the solution that won the SegPC-2021 competition for the segmentation of multiple myeloma plasma cells in microscopy images. The labels used in the competition dataset were generated semi-automatically and presented noise. To deal with it, a heavy image augmentation procedure was carried out and predictions from several models were combined using a custom ensemble strategy. State-of-the-art feature extractors and instance segmentation architectures were used, resulting in a mean Intersection-over-Union of 0.9389 on the SegPC-2021 final test set.

【6】 Real-time Instance Segmentation of Surgical Instruments using Attention and Multi-scale Feature Fusion 标题：基于注意力和多尺度特征融合的手术器械实时实例分割链接：https://arxiv.org/abs/2111.04911

作者：Juan Carlos Angeles-Ceron,Gilberto Ochoa-Ruiz,Leonardo Chang,Sharib Ali 机构：Tecnologico de Monterrey, Escuela de Ingeniería y Ciencias, México, Institute of Biomedical Engineering (IBME), Department of Engineering Science, University of Oxford, Oxford, UK, Oxford NIHR Biomedical Research Centre, University of Oxford, Oxford, UK 摘要：精确的器械分割有助于外科医生更轻松地导航身体，提高患者的安全性。虽然实时精确跟踪手术器械在微创计算机辅助手术中起着至关重要的作用，但这是一项具有挑战性的任务，主要是因为1）复杂的手术环境，2）具有最佳精度和速度的模型设计。深入学习使我们有机会从大型手术场景环境以及这些仪器在真实场景中的放置中学习复杂环境。Robust Medical Instrument Segmentation 2019 challenge（Robust-MIS）为不同临床环境下的手术工具提供了10000多帧。在本文中，我们使用了一个轻量级的单阶段实例分割模型，并辅之以卷积块注意模块，以实现更快和更准确的推理。我们通过增加数据和优化锚定位策略进一步提高准确性。据我们所知，这是第一个明确关注实时性能和提高准确性的工作。在ROBUST-MIS挑战中，我们的方法在基于区域的度量MI_DSC和基于距离的度量MI_NSD两个方面都有超过44%的改进，超过了顶级团队的表现。我们还展示了实时性能（>60帧/秒），与我们的最终方法不同，但具有竞争力。摘要：Precise instrument segmentation aid surgeons to navigate the body more easily and increase patient safety. While accurate tracking of surgical instruments in real-time plays a crucial role in minimally invasive computer-assisted surgeries, it is a challenging task to achieve, mainly due to 1) complex surgical environment, and 2) model design with both optimal accuracy and speed. Deep learning gives us the opportunity to learn complex environment from large surgery scene environments and placements of these instruments in real world scenarios. The Robust Medical Instrument Segmentation 2019 challenge (ROBUST-MIS) provides more than 10,000 frames with surgical tools in different clinical settings. In this paper, we use a light-weight single stage instance segmentation model complemented with a convolutional block attention module for achieving both faster and accurate inference. We further improve accuracy through data augmentation and optimal anchor localisation strategies. To our knowledge, this is the first work that explicitly focuses on both real-time performance and improved accuracy. Our approach out-performed top team performances in the ROBUST-MIS challenge with over 44% improvement on both area-based metric MI_DSC and distance-based metric MI_NSD. We also demonstrate real-time performance (> 60 frames-per-second) with different but competitive variants of our final approach.

【7】 DR-VNet: Retinal Vessel Segmentation via Dense Residual UNet 标题：DR-VNet：基于致密残余UNET的视网膜血管分割链接：https://arxiv.org/abs/2111.04739

作者：Ali Karaali,Rozenn Dahyot,Donal J. Sexton 机构：The Irish Longitudinal Study on Ageing (TILDA), School of Medicine, Trinity College Dublin, Ireland, ADAPT: Science Foundation Ireland (SFI) Research Centre for Digital Media Technology, Trinity College Dublin, Ireland 摘要：准确的视网膜血管分割是许多计算机辅助诊断系统的重要任务。然而，由于眼睛复杂的血管结构，这仍然是一个具有挑战性的问题。近年来，人们提出了多种血管分割方法，但对于细血管和微小血管的分割效果较差的问题，还需要进一步的研究。为了解决这个问题，我们提出了一种新的深度学习管道，它结合了剩余密集网络块和剩余压缩和激励块的效率。我们在三个数据集上通过实验验证了我们的方法，并表明我们的管道在评估小型船只捕获的灵敏度指标方面优于当前最先进的技术。摘要：Accurate retinal vessel segmentation is an important task for many computer-aided diagnosis systems. Yet, it is still a challenging problem due to the complex vessel structures of an eye. Numerous vessel segmentation methods have been proposed recently, however more research is needed to deal with poor segmentation of thin and tiny vessels. To address this, we propose a new deep learning pipeline combining the efficiency of residual dense net blocks and, residual squeeze and excitation blocks. We validate experimentally our approach on three datasets and show that our pipeline outperforms current state of the art techniques on the sensitivity metric relevant to assess capture of small vessels.

【8】 Synthetic magnetic resonance images for domain adaptation: Application to fetal brain tissue segmentation 标题：基于区域自适应的合成磁共振图像在胎儿脑组织分割中的应用链接：https://arxiv.org/abs/2111.04737

作者：Priscille de Dumast,Hamza Kebiri,Kelly Payette,Andras Jakab,Hélène Lajous,Meritxell Bach Cuadra 机构： Department of Radiology, Lausanne University Hospital (CHUV) and, University of Lausanne (UNIL), Lausanne, Switzerland, CIBM Center for Biomedical Imaging, Switzerland, Center for MR Research, University Children’s Hospital Zurich, University of Zurich 备注：4 pages, 4 figures. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible 摘要：对子宫内发育中的人脑进行定量评估对于全面了解神经发育至关重要。因此，自动多组织胎儿大脑分割算法正在开发中，这反过来又需要对注释数据进行训练。然而，现有的带注释的胎儿大脑数据集在数量和异质性方面都是有限的，阻碍了稳健分割的领域适应策略。在此背景下，我们使用FaBiAN（一种胎儿大脑磁共振采集数字模型）模拟胎儿大脑的各种真实磁共振图像及其类别标签。我们证明，这些无成本生成并使用目标超分辨率技术进一步重建的多个合成注释数据可以成功地用于深度学习方法的领域自适应，该方法可分割七个脑组织。总的来说，分割的准确性显著提高，特别是在皮质灰质、白质、小脑、深灰质和脑干。摘要：The quantitative assessment of the developing human brain in utero is crucial to fully understand neurodevelopment. Thus, automated multi-tissue fetal brain segmentation algorithms are being developed, which in turn require annotated data to be trained. However, the available annotated fetal brain datasets are limited in number and heterogeneity, hampering domain adaptation strategies for robust segmentation. In this context, we use FaBiAN, a Fetal Brain magnetic resonance Acquisition Numerical phantom, to simulate various realistic magnetic resonance images of the fetal brain along with its class labels. We demonstrate that these multiple synthetic annotated data, generated at no cost and further reconstructed using the target super-resolution technique, can be successfully used for domain adaptation of a deep learning method that segments seven brain tissues. Overall, the accuracy of the segmentation is significantly enhanced, especially in the cortical gray matter, the white matter, the cerebellum, the deep gray matter and the brain stem.

Zero/Few Shot|迁移|域适配|自适应(2篇)

【1】 Deep Learning Adapted Acceleration for Limited-view Photoacoustic Computed Tomography 标题：深度学习自适应加速用于有限视角光声计算机层析成像链接：https://arxiv.org/abs/2111.05194

作者：Hengrong Lan,Jiali Gong,Fei Gao 机构：Engineering Research Center of Intelligent Vision and Imaging, School of, Information Science and Technology, ShanghaiTech University, Shanghai, China, with Chinese Academy of Sciences, Shanghai Institute of 备注：submitted the journal version 摘要：光声成像（PAI）是一种非侵入性成像设备，用于检测光激励下组织产生的超声信号。光声计算机断层成像（PACT）使用未聚焦的大面积光，通过超声换能器阵列照亮目标，用于PA信号检测。由于几何条件的限制，有限的视图问题可能导致PACT中的图像质量低下。采用基于模型的方法来解决这一问题，其中包含不同的正则化。为了适应有限视野PA数据的快速、高质量重建，本文提出了一种基于模型的方法，将数学变分模型与深度学习相结合，以加速和规范重建的展开过程。设计了一种深度神经网络，以适应梯度下降过程中数据一致性梯度更新项的步长，只需几次迭代即可获得高质量的PA图像。请注意，所有参数和优先级都是在脱机训练阶段自动学习的。实验表明，在半视图（180度）模拟和真实数据的情况下，该方法优于其他方法。不同基于模型的方法的比较表明，在相同的迭代（3次）步骤下，我们提出的方案具有优越的性能（对于SSIM超过0.05）。此外，使用一个不可见的数据来验证不同方法的泛化性。最后，我们发现我们的方法获得了更好的结果（体内SSIM值为0.94），具有较高的鲁棒性和加速重建。摘要：Photoacoustic imaging (PAI) is a non-invasive imaging modality that detects the ultrasound signal generated from tissue with light excitation. Photoacoustic computed tomography (PACT) uses unfocused large-area light to illuminate the target with ultrasound transducer array for PA signal detection. Limited-view issue could cause a low-quality image in PACT due to the limitation of geometric condition. The model-based method is used to resolve this problem, which contains different regularization. To adapt fast and high-quality reconstruction of limited-view PA data, in this paper, a model-based method that combines the mathematical variational model with deep learning is proposed to speed up and regularize the unrolled procedure of reconstruction. A deep neural network is designed to adapt the step of the gradient updated term of data consistency in the gradient descent procedure, which can obtain a high-quality PA image only with a few iterations. Note that all parameters and priors are automatically learned during the offline training stage. In experiments, we show that this method outperforms the other methods with half-view (180 degrees) simulation and real data. The comparison of different model-based methods show that our proposed scheme has superior performances (over 0.05 for SSIM) with same iteration (3 times) steps. Furthermore, an unseen data is used to validate the generalization of different methods. Finally, we find that our method obtains superior results (0.94 value of SSIM for in vivo) with a high robustness and accelerated reconstruction.

【2】 Mitigating domain shift in AI-based tuberculosis screening with unsupervised domain adaptation 标题：无监督域自适应减轻基于人工智能的结核病筛查中的域漂移链接：https://arxiv.org/abs/2111.04893

作者：Nishanjan Ravin,Sourajit Saha,Alan Schweitzer,Ameena Elahi,Farouk Dako,Daniel Mollura,David Chapman 机构：Department of Computer Science and Electrical Engineering, University of Maryland Baltimore County, Hilltop Circle, Baltimore, MD, (email: 摘要：我们证明了域不变特征学习（DIFL）可以提高深度学习肺结核筛查算法的域外泛化能力。众所周知，由于“域转移”，最先进的深度学习算法通常难以推广到看不见的数据分布。在医学成像方面，这可能导致意外的偏见，例如无法从一个患者群体推广到另一个患者群体。我们使用四个最流行的公共数据集和地理上不同的图像源，分析了用于结核病筛查的ResNet-50分类器的性能。我们发现，如果不进行域适配，ResNet-50很难在多个公共结核病筛查数据集的影像分布与地理分布区域的影像之间进行概括。然而，随着DIFL的加入，域外性能大大增强。分析标准包括基线和DIFL增强算法的准确性、敏感性、特异性和AUC的比较。我们得出结论，DIFL提高了结核病筛查的普遍性，同时在各种公共数据集中应用时，在源域图像上保持可接受的准确性。摘要：We demonstrate that Domain Invariant Feature Learning (DIFL) can improve the out-of-domain generalizability of a deep learning Tuberculosis screening algorithm. It is well known that state of the art deep learning algorithms often have difficulty generalizing to unseen data distributions due to "domain shift". In the context of medical imaging, this could lead to unintended biases such as the inability to generalize from one patient population to another. We analyze the performance of a ResNet-50 classifier for the purposes of Tuberculosis screening using the four most popular public datasets with geographically diverse sources of imagery. We show that without domain adaptation, ResNet-50 has difficulty in generalizing between imaging distributions from a number of public Tuberculosis screening datasets with imagery from geographically distributed regions. However, with the incorporation of DIFL, the out-of-domain performance is greatly enhanced. Analysis criteria includes a comparison of accuracy, sensitivity, specificity and AUC over both the baseline, as well as the DIFL enhanced algorithms. We conclude that DIFL improves generalizability of Tuberculosis screening while maintaining acceptable accuracy over the source domain imagery when applied across a variety of public datasets.

半弱无监督|主动学习|不确定性(1篇)

【1】 TAGLETS: A System for Automatic Semi-Supervised Learning with Auxiliary Data 标题：TAGLETS：一个带辅助数据的自动半监督学习系统链接：https://arxiv.org/abs/2111.04798

作者：Wasu Piriyakulkij,Cristina Menghini,Ross Briden,Nihal V. Nayak,Jeffrey Zhu,Elaheh Raisi,Stephen H. Bach 机构：Brown University,LinkedIn 摘要：机器学习实践者通常可以访问一系列数据：目标任务的标记数据（通常是有限的）、未标记数据和辅助数据，即其他任务的许多可用标记数据集。我们描述了TAGLETS，一个用于研究自动利用所有三种类型的数据和创建高质量、可服务的分类器的技术的系统。TAGLETS的关键组件是：（1）根据知识图组织的辅助数据，（2）封装用于利用辅助和未标记数据的不同方法的模块，以及（3）将集成模块组合成可服务模型的蒸馏阶段。我们将TAGLETS与最新的转移学习和半监督学习方法在四种图像分类任务上进行了比较。我们的研究涵盖了一系列设置，改变了标记数据的数量以及辅助数据与目标任务的语义相关性。我们发现，智能地将辅助和未标记的数据合并到多种学习技术中，使Taglet能够匹配，并且通常显著超过这些替代方法。TAGLETS作为开源系统在github.com/BatsResearch/TAGLETS上提供。摘要：Machine learning practitioners often have access to a spectrum of data: labeled data for the target task (which is often limited), unlabeled data, and auxiliary data, the many available labeled datasets for other tasks. We describe TAGLETS, a system built to study techniques for automatically exploiting all three types of data and creating high-quality, servable classifiers. The key components of TAGLETS are: (1) auxiliary data organized according to a knowledge graph, (2) modules encapsulating different methods for exploiting auxiliary and unlabeled data, and (3) a distillation stage in which the ensembled modules are combined into a servable model. We compare TAGLETS with state-of-the-art transfer learning and semi-supervised learning methods on four image classification tasks. Our study covers a range of settings, varying the amount of labeled data and the semantic relatedness of the auxiliary data to the target task. We find that the intelligent incorporation of auxiliary and unlabeled data into multiple learning techniques enables TAGLETS to match-and most often significantly surpass-these alternatives. TAGLETS is available as an open-source system at github.com/BatsResearch/taglets.

时序|行为识别|姿态|视频|运动估计(2篇)

【1】 Video Text Tracking With a Spatio-Temporal Complementary Model 标题：基于时空互补模型的视频文本跟踪链接：https://arxiv.org/abs/2111.04987

作者：Yuzhe Gao,Xing Li,Jiajian Zhang,Yu Zhou,Dian Jin,Jing Wang,Shenggao Zhu,Xiang Bai 机构： Bai are with Schoolof Electronic Information and Communications, Huazhong University ofScience and Technology 备注：None 摘要：文本跟踪是跟踪视频中的多个文本，并为每个文本构建一条轨迹。现有的方法利用检测框架跟踪来完成这项任务，即检测每一帧中的文本实例，并在连续的帧中关联相应的文本实例。我们认为，在更复杂的场景中，这种模式的跟踪精度受到严重限制，例如，由于断层模糊等，文本实例的漏检导致文本轨迹的中断。此外，外观相似的不同文本实例容易混淆，导致文本实例的不正确关联。为此，本文提出了一种新的时空互补文本跟踪模型。我们利用暹罗互补模块充分利用文本实例在时间维度上的连续性特征，有效地减少了文本实例的漏检，从而保证了每个文本轨迹的完整性。我们进一步通过文本相似性学习网络将文本实例的语义线索和视觉线索集成到一个统一的表示中，从而在存在外观相似的文本实例时提供了较高的辨别能力，从而避免了它们之间的错误关联。我们的方法在几个公共基准上实现了最先进的性能。源代码可在https://github.com/lsabrinax/VideoTextSCM. 摘要：Text tracking is to track multiple texts in a video,and construct a trajectory for each text. Existing methodstackle this task by utilizing the tracking-by-detection frame-work, i.e., detecting the text instances in each frame andassociating the corresponding text instances in consecutiveframes. We argue that the tracking accuracy of this paradigmis severely limited in more complex scenarios, e.g., owing tomotion blur, etc., the missed detection of text instances causesthe break of the text trajectory. In addition, different textinstances with similar appearance are easily confused, leadingto the incorrect association of the text instances. To this end,a novel spatio-temporal complementary text tracking model isproposed in this paper. We leverage a Siamese ComplementaryModule to fully exploit the continuity characteristic of the textinstances in the temporal dimension, which effectively alleviatesthe missed detection of the text instances, and hence ensuresthe completeness of each text trajectory. We further integratethe semantic cues and the visual cues of the text instance intoa unified representation via a text similarity learning network,which supplies a high discriminative power in the presence oftext instances with similar appearance, and thus avoids the mis-association between them. Our method achieves state-of-the-art performance on several public benchmarks. The source codeis available at https://github.com/lsabrinax/VideoTextSCM.

【2】 Cascaded Multilingual Audio-Visual Learning from Videos 标题：级联多语种视频视听学习链接：https://arxiv.org/abs/2111.04823

作者：Andrew Rouditchenko,Angie Boggust,David Harwath,Samuel Thomas,Hilde Kuehne,Brian Chen,Rameswar Panda,Rogerio Feris,Brian Kingsbury,Michael Picheny,James Glass 机构：MIT CSAIL, USA, UT Austin, USA, IBM Research AI, USA, Columbia University, USA, NYU, USA 备注：Presented at Interspeech 2021. This version contains updated results using the YouCook-Japanese dataset 摘要：在本文中，我们探讨了从教学视频中学习的自监督视听模型。之前的工作表明，这些模型在对大规模视频数据集进行训练后，可以将口语和声音与视觉内容联系起来，但它们仅在英语视频上进行训练和评估。为了学习多语言视听表示，我们提出了一种级联方法，该方法利用在英语视频上训练的模型，并将其应用于其他语言的视听数据，例如日语视频。通过我们的级联方法，我们显示检索性能比仅在日本视频上进行的训练提高了近10倍。我们还将经过英语视频训练的模型应用于日语和印地语口语图像字幕，实现了最先进的性能。摘要：In this paper, we explore self-supervised audio-visual models that learn from instructional videos. Prior work has shown that these models can relate spoken words and sounds to visual content after training on a large-scale dataset of videos, but they were only trained and evaluated on videos in English. To learn multilingual audio-visual representations, we propose a cascaded approach that leverages a model trained on English videos and applies it to audio-visual data in other languages, such as Japanese videos. With our cascaded approach, we show an improvement in retrieval performance of nearly 10x compared to training on the Japanese videos solely. We also apply the model trained on English videos to Japanese and Hindi spoken captions of images, achieving state-of-the-art performance.

医学相关(2篇)

【1】 BRACS: A Dataset for BReAst Carcinoma Subtyping in H&E Histology Images 标题：BRACS：一种用于乳腺癌HE组织学图像分型的数据集链接：https://arxiv.org/abs/2111.04740

作者：Nadia Brancati,Anna Maria Anniciello,Pushpak Pati,Daniel Riccio,Giosuè Scognamiglio,Guillaume Jaume,Giuseppe De Pietro,Maurizio Di Bonito,Antonio Foncubierta,Gerardo Botti,Maria Gabrani,Florinda Feroce,Maria Frucci 机构：Institute for High Performance Computing and Networking of the Research Council of Italy, ICAR-CNR, Naples, National Cancer Institute – IRCCS – Fondazione Pascale, Naples, Italy, IBM Research – Zurich, Switzerland 备注：10 pages, 3 figures, 8 tables, 30 references 摘要：乳腺癌是最常见的癌症，也是女性癌症死亡人数最多的癌症。诊断活动的最新进展与大规模筛查政策相结合，显著降低了乳腺癌患者的死亡率。然而，病理学家对组织切片的手动检查既繁琐又耗时，且受观察者之间和内部显著差异的影响。最近，全玻片扫描系统的出现使病理玻片的快速数字化成为可能，并使之能够开发数字工作流程。这些进步进一步使得利用人工智能（AI）来辅助、自动化和增强病理诊断成为可能。但是人工智能技术，特别是深度学习（Deep Learning，DL），需要大量高质量的注释数据来学习。构建这种特定于任务的数据集带来了一些挑战，例如数据采集级别限制、耗时且昂贵的注释以及私有信息的匿名化。在这篇文章中，我们介绍了乳腺癌亚型（BRACS）数据集，这是一个大队列的注释苏木精和伊红（H&E）染色图像，有助于乳腺病变的表征。BRAS包含547张完整幻灯片图像（WSI），以及从WSIs中提取的4539个感兴趣区域（ROI）。每个WSI和各自的ROI由三名委员会认证病理学家按照不同的病变类别进行注释。具体而言，BRAC包括三种病变类型，即良性、恶性和非典型，这些病变又分为七类。据我们所知，它是WSI和ROI水平上最大的乳腺癌亚型注释数据集。此外，通过包括未研究的非典型病变，BRAC为利用AI更好地了解其特征提供了独特的机会。摘要：Breast cancer is the most commonly diagnosed cancer and registers the highest number of deaths for women with cancer. Recent advancements in diagnostic activities combined with large-scale screening policies have significantly lowered the mortality rates for breast cancer patients. However, the manual inspection of tissue slides by the pathologists is cumbersome, time-consuming, and is subject to significant inter- and intra-observer variability. Recently, the advent of whole-slide scanning systems have empowered the rapid digitization of pathology slides, and enabled to develop digital workflows. These advances further enable to leverage Artificial Intelligence (AI) to assist, automate, and augment pathological diagnosis. But the AI techniques, especially Deep Learning (DL), require a large amount of high-quality annotated data to learn from. Constructing such task-specific datasets poses several challenges, such as, data-acquisition level constrains, time-consuming and expensive annotations, and anonymization of private information. In this paper, we introduce the BReAst Carcinoma Subtyping (BRACS) dataset, a large cohort of annotated Hematoxylin & Eosin (H&E)-stained images to facilitate the characterization of breast lesions. BRACS contains 547 Whole-Slide Images (WSIs), and 4539 Regions of Interest (ROIs) extracted from the WSIs. Each WSI, and respective ROIs, are annotated by the consensus of three board-certified pathologists into different lesion categories. Specifically, BRACS includes three lesion types, i.e., benign, malignant and atypical, which are further subtyped into seven categories. It is, to the best of our knowledge, the largest annotated dataset for breast cancer subtyping both at WSI- and ROI-level. Further, by including the understudied atypical lesions, BRACS offers an unique opportunity for leveraging AI to better understand their characteristics.

【2】 HEROHE Challenge: assessing HER2 status in breast cancer without immunohistochemistry or in situ hybridization 标题：HEROHE挑战：在没有免疫组织化学或原位杂交的情况下评估乳腺癌的HER2状态链接：https://arxiv.org/abs/2111.04738

作者：Eduardo Conde-Sousa,João Vale,Ming Feng,Kele Xu,Yin Wang,Vincenzo Della Mea,David La Barbera,Ehsan Montahaei,Mahdieh Soleymani Baghshah,Andreas Turzynski,Jacob Gildenblat,Eldad Klaiman,Yiyu Hong,Guilherme Aresta,Teresa Araújo,Paulo Aguiar,Catarina Eloy,António Polónia 机构：a., I,S - Instituto de Investigação e Inovação em Saúde, Universidade Do Porto, Portugal, b., INEB - Instituto de Engenharia Biomédica, Universidade Do Porto, Portugal, c. 摘要：乳腺癌是女性最常见的恶性肿瘤，每年造成50多万人死亡。因此，早期准确的诊断至关重要。诊断和正确分类乳腺癌以及确定适当的治疗需要人类专业知识，这取决于对不同生物标记物（如跨膜蛋白受体HER2）表达的评估。该评估需要几个步骤，包括特殊技术，如免疫组织化学或原位杂交，以评估HER2状态。为了减少诊断中的步骤数量和人类偏见，HEROHE挑战作为第16届欧洲数字病理学大会的平行活动组织起来，旨在仅基于苏木精和伊红染色的浸润性乳腺癌组织样本自动评估HER2状态。全球21个团队介绍了评估HER2状态的方法，其中一些方法所取得的结果为提高最新水平开辟了潜在的前景。摘要：Breast cancer is the most common malignancy in women, being responsible for more than half a million deaths every year. As such, early and accurate diagnosis is of paramount importance. Human expertise is required to diagnose and correctly classify breast cancer and define appropriate therapy, which depends on the evaluation of the expression of different biomarkers such as the transmembrane protein receptor HER2. This evaluation requires several steps, including special techniques such as immunohistochemistry or in situ hybridization to assess HER2 status. With the goal of reducing the number of steps and human bias in diagnosis, the HEROHE Challenge was organized, as a parallel event of the 16th European Congress on Digital Pathology, aiming to automate the assessment of the HER2 status based only on hematoxylin and eosin stained tissue sample of invasive breast cancer. Methods to assess HER2 status were presented by 21 teams worldwide and the results achieved by some of the proposed methods open potential perspectives to advance the state-of-the-art.

GAN|对抗|攻击|生成相关(1篇)

【1】 GDCA: GAN-based single image super resolution with Dual discriminators and Channel Attention 标题：GDCA：具有双鉴别器和通道注意的GaN基单图像超分辨率链接：https://arxiv.org/abs/2111.05014

作者：Thanh Nguyen,Hieu Hoang,Chang D. Yoo 机构： Korea Advances Institute of Science and Technology (KAIST) 备注：None 摘要：单图像超分辨率（SISR）是一个非常活跃的研究领域。本文通过使用基于GAN的双鉴别器方法并将其与注意机制相结合来解决SISR。实验结果表明，与其他传统方法相比，GDCA可以生成更清晰、更令人满意的图像。摘要：Single Image Super-Resolution (SISR) is a very active research field. This paper addresses SISR by using a GAN-based approach with dual discriminators and incorporating it with an attention mechanism. The experimental results show that GDCA can generate sharper and high pleasing images compare to other conventional methods.

Attention注意力(1篇)

【1】 E(2) Equivariant Self-Attention for Radio Astronomy 标题：射电天文学的E(2)等变自我注意链接：https://arxiv.org/abs/2111.04742

作者：Micah Bowles,Matthew Bromley,Max Allen,Anna Scaife 机构：Department of Physics & Astronomy, University of Manchester, UK 备注：7 pages, 3 figures, NeurIPS, Workshop: Machine Learning and the Physical Sciences 摘要：在这项工作中，我们引入了群等变自注意模型来解决天文学中可解释射电星系分类的问题。我们评估了循环和二面体等变的各种阶数，并表明将等变作为先验值都可以减少拟合数据所需的历元数，从而提高性能。我们强调当使用自我注意作为一个可解释的模型时，等变性的好处，并说明等变性模型在统计上如何与人类天文学家在分类中具有相同的特征。摘要：In this work we introduce group-equivariant self-attention models to address the problem of explainable radio galaxy classification in astronomy. We evaluate various orders of both cyclic and dihedral equivariance, and show that including equivariance as a prior both reduces the number of epochs required to fit the data and results in improved performance. We highlight the benefits of equivariance when using self-attention as an explainable model and illustrate how equivariant models statistically attend the same features in their classifications as human astronomers.

人脸|人群计数(3篇)

【1】 Monocular Human Shape and Pose with Dense Mesh-borne Local Image Features 标题：具有密集网格局部图像特征的单目人体形状和姿势链接：https://arxiv.org/abs/2111.05319

作者：Shubhendu Jena,Franck Multon,Adnane Boukhayma 机构：Inria, Univ. Rennes, CNRS, IRISA, M,S, France 备注：FG 2021 摘要：我们建议改进基于图卷积的单目输入人体形状和姿势估计方法，使用像素对齐的局部图像特征。给定单个输入彩色图像，现有的基于图卷积网络（GCN）的人体形状和姿势估计技术使用单个卷积神经网络（CNN）生成的全局图像特征（附加到所有网格顶点）来初始化GCN阶段，该阶段将模板T姿势网格转换为目标姿势。相比之下，我们首次提出了使用每个顶点的局部图像特征的想法。这些特征是利用DensePose生成的像素到网格的对应关系从CNN图像特征图中采样的。我们在标准基准上的定量和定性结果表明，使用本地功能比使用全球功能更好，并在最新技术方面带来竞争性性能。摘要：We propose to improve on graph convolution based approaches for human shape and pose estimation from monocular input, using pixel-aligned local image features. Given a single input color image, existing graph convolutional network (GCN) based techniques for human shape and pose estimation use a single convolutional neural network (CNN) generated global image feature appended to all mesh vertices equally to initialize the GCN stage, which transforms a template T-posed mesh into the target pose. In contrast, we propose for the first time the idea of using local image features per vertex. These features are sampled from the CNN image feature maps by utilizing pixel-to-mesh correspondences generated with DensePose. Our quantitative and qualitative results on standard benchmarks show that using local features improves on global ones and leads to competitive performances with respect to the state-of-the-art.

【2】 Ethically aligned Deep Learning: Unbiased Facial Aesthetic Prediction 标题：伦理一致的深度学习：无偏见的面部美学预测链接：https://arxiv.org/abs/2111.05149

作者：Michael Danner,Thomas Weber,Leping Peng,Tobias Gerlach,Xueping Su,Matthias Rätsch 机构：∗University of Surrey †Reutlingen University ‡Hunan University of Science and Technology, §Xi’an Polytechnic University 备注：Peer reviewed and accepted at CEPE/IACAP 2021 as Extended Abstract 摘要：面部美预测（FBP）旨在开发一种自动进行面部吸引力评估的机器。在过去，这些结果与人类评分高度相关，因此也与他们在注释上的偏见高度相关。由于人工智能可能具有种族主义和歧视倾向，因此必须查明数据中出现偏差的原因。对科学家来说，开发对有偏见的信息具有鲁棒性的训练数据和人工智能算法是一个新的挑战。由于审美判断通常是有偏见的，我们想进一步提出一种无偏卷积神经网络。虽然可以创建网络模型，从道德的角度对人脸的吸引力进行高水平的评估，但同样重要的是确保模型不带偏见。在这项工作中，我们引入了唯美网，这是一个最先进的吸引力预测网络，其显著优于竞争对手，Pearson相关系数为0.9601。此外，我们提出了一种新的生成无偏差CNN的方法，以提高机器学习的公平性。摘要：Facial beauty prediction (FBP) aims to develop a machine that automatically makes facial attractiveness assessment. In the past those results were highly correlated with human ratings, therefore also with their bias in annotating. As artificial intelligence can have racist and discriminatory tendencies, the cause of skews in the data must be identified. Development of training data and AI algorithms that are robust against biased information is a new challenge for scientists. As aesthetic judgement usually is biased, we want to take it one step further and propose an Unbiased Convolutional Neural Network for FBP. While it is possible to create network models that can rate attractiveness of faces on a high level, from an ethical point of view, it is equally important to make sure the model is unbiased. In this work, we introduce AestheticNet, a state-of-the-art attractiveness prediction network, which significantly outperforms competitors with a Pearson Correlation of 0.9601. Additionally, we propose a new approach for generating a bias-free CNN to improve fairness in machine learning.

【3】 SAFA: Structure Aware Face Animation 标题：SAFA：结构感知人脸动画链接：https://arxiv.org/abs/2111.04928

作者：Qiulin Wang,Lu Zhang,Bo Li 备注：Accepted at 3DV2021 摘要：生成性对抗网络（GAN）最近的成功在人脸动画任务方面取得了巨大的进展。然而，人脸图像的复杂场景结构仍然使生成人脸姿势明显偏离源图像的视频成为一项挑战。一方面，在不知道人脸几何结构的情况下，生成的人脸图像可能会不正确地失真。另一方面，生成图像的某些区域可能会在源图像中被遮挡，这使得GAN难以生成真实的外观。为了解决这些问题，我们提出了一种结构感知人脸动画（SAFA）方法，该方法通过构造特定的几何结构来建模人脸图像的不同组件。遵循公认的基于运动的人脸动画技术，我们使用3D可变形模型（3DMM）对人脸建模，使用多个仿射变换对其他前景组件（如头发和胡须）建模，使用身份变换对背景建模。3DMM几何嵌入不仅有助于生成驾驶场景的真实结构，而且有助于更好地感知生成图像中的遮挡区域。此外，我们进一步建议利用广泛研究的修复技术来忠实地恢复被遮挡的图像区域。定量和定性实验结果都表明了该方法的优越性。代码可在https://github.com/Qiulin-W/SAFA. 摘要：Recent success of generative adversarial networks (GAN) has made great progress on the face animation task. However, the complex scene structure of a face image still makes it a challenge to generate videos with face poses significantly deviating from the source image. On one hand, without knowing the facial geometric structure, generated face images might be improperly distorted. On the other hand, some area of the generated image might be occluded in the source image, which makes it difficult for GAN to generate realistic appearance. To address these problems, we propose a structure aware face animation (SAFA) method which constructs specific geometric structures to model different components of a face image. Following the well recognized motion based face animation technique, we use a 3D morphable model (3DMM) to model the face, multiple affine transforms to model the other foreground components like hair and beard, and an identity transform to model the background. The 3DMM geometric embedding not only helps generate realistic structure for the driving scene, but also contributes to better perception of occluded area in the generated image. Besides, we further propose to exploit the widely studied inpainting technique to faithfully recover the occluded image area. Both quantitative and qualitative experiment results have shown the superiority of our method. Code is available at https://github.com/Qiulin-W/SAFA.

跟踪(2篇)

【1】 EEGEyeNet: a Simultaneous Electroencephalography and Eye-tracking Dataset and Benchmark for Eye Movement Prediction 标题：EEGEyeNet：一种同时进行脑电图和眼动跟踪的数据集和眼动预测基准链接：https://arxiv.org/abs/2111.05100

作者：Ard Kastrati,Martyna Martyna Beata Płomecka,Damián Pascual,Lukas Wolf,Victor Gillioz,Roger Wattenhofer,Nicolas Langer 机构：ETH Zurich, Switzerland, University of Zurich, Switzerland 备注：Published at NeurIPS 2021 Datasets and Benchmarks Track 摘要：我们提出了一个新的数据集和基准，旨在推进大脑活动和眼球运动交叉点的研究。我们的数据集EEGEyeNet由来自三个不同实验范式的356名不同受试者的同时脑电图（EEG）和眼球跟踪（ET）记录组成。利用这个数据集，我们还提出了一个基准来评估凝视预测从脑电图测量。基准测试包括三项难度越来越大的任务：左右、角度幅度和绝对位置。我们在这个基准上运行了大量的实验，以提供基于经典机器学习模型和大型神经网络的可靠基线。我们发布了完整的代码和数据，并提供了一个简单易用的界面来评估新方法。摘要：We present a new dataset and benchmark with the goal of advancing research in the intersection of brain activities and eye movements. Our dataset, EEGEyeNet, consists of simultaneous Electroencephalography (EEG) and Eye-tracking (ET) recordings from 356 different subjects collected from three different experimental paradigms. Using this dataset, we also propose a benchmark to evaluate gaze prediction from EEG measurements. The benchmark consists of three tasks with an increasing level of difficulty: left-right, angle-amplitude and absolute position. We run extensive experiments on this benchmark in order to provide solid baselines, both based on classical machine learning models and on large neural networks. We release our complete code and data and provide a simple and easy-to-use interface to evaluate new methods.

【2】 Combining Machine Learning with Physics: A Framework for Tracking and Sorting Multiple Dark Solitons 标题：机器学习与物理相结合：跟踪和分类多个暗孤子的框架链接：https://arxiv.org/abs/2111.04881

作者：Shangjie Guo,Sophia M. Koh,Amilson R. Fritsch,I. B. Spielman,Justyna P. Zwolak 机构：Joint Quantum Institute, National Institute of Standards and Technology, and University of Maryland, Gaithersburg, Maryland , USA, Department of Physics and Astronomy, Amherst College, Amherst, Massachusetts , USA 备注：12 pages, 9 figures 摘要：在超冷原子实验中，数据通常以图像的形式出现，这会使制备和测量系统的技术中固有的信息丢失。当感兴趣的过程很复杂时，例如玻色-爱因斯坦凝聚体（BEC）中激发之间的相互作用，这一问题尤其严重。在本文中，我们描述了一个结合机器学习（ML）模型和基于物理的传统分析的框架，用于识别和跟踪BEC图像中的多个孤子激发。我们使用一个基于ML的目标检测器来定位孤子激发，并开发一个物理信息分类器来将孤子激发分类为物理激励的子类别。最后，我们引入了一个质量度量来量化特定特征是扭结孤子的可能性。我们经过训练的这个框架的实现——SolDet——作为一个开源python包公开提供。当在合适的用户提供的数据集上进行训练时，SolDet广泛适用于冷原子图像中的特征识别。摘要：In ultracold atom experiments, data often comes in the form of images which suffer information loss inherent in the techniques used to prepare and measure the system. This is particularly problematic when the processes of interest are complicated, such as interactions among excitations in Bose-Einstein condensates (BECs). In this paper, we describe a framework combining machine learning (ML) models with physics-based traditional analyses to identify and track multiple solitonic excitations in images of BECs. We use an ML-based object detector to locate the solitonic excitations and develop a physics-informed classifier to sort solitonic excitations into physically motivated sub-categories. Lastly, we introduce a quality metric quantifying the likelihood that a specific feature is a kink soliton. Our trained implementation of this framework -- SolDet -- is publicly available as an open-source python package. SolDet is broadly applicable to feature identification in cold atom images when trained on a suitable user-provided dataset.

图像视频检索|Re-id相关(2篇)

【1】 MMD-ReID: A Simple but Effective Solution for Visible-Thermal Person ReID 标题：MMD-REID：一种简单而有效的可见光人体Reid解决方案链接：https://arxiv.org/abs/2111.05059

作者：Chaitra Jambigi,Ruchit Rawal,Anirban Chakraborty 机构：Department of Computational and Data, Sciences, Indian Institute of Science, Bangalore, India 备注：Accepted in BMVC 2021 (Oral) 摘要：学习模态不变特征是可见热跨模态人员再识别（VT ReID）问题的核心，其中查询和画廊图像来自不同的模态。现有工作通过使用对抗性学习或仔细设计严重依赖领域知识的特征提取模块，隐式地在像素和特征空间中对齐模式。我们提出了一个简单而有效的框架MMD ReID，该框架通过一个显式的差异减少约束来减少模态差异。MMD ReID的灵感来自最大平均差异（MMD），这是一种广泛用于假设检验的统计工具，用于确定两个分布之间的距离。MMD ReID使用一种新的基于边缘的公式来匹配可见和热样本的类条件特征分布，以最小化类内距离，同时保持特征可辨别性。MMD ReID在架构和损耗公式方面是一个简单的框架。我们进行了大量的实验，定性和定量地证明了MMD-ReID在对齐边缘和类条件分布方面的有效性，从而学习了模态独立和身份一致的特征。该框架在SYSU-MM01和RegDB数据集上的性能明显优于最先进的方法。守则将于https://github.com/vcl-iisc/MMD-ReID 摘要：Learning modality invariant features is central to the problem of Visible-Thermal cross-modal Person Reidentification (VT-ReID), where query and gallery images come from different modalities. Existing works implicitly align the modalities in pixel and feature spaces by either using adversarial learning or carefully designing feature extraction modules that heavily rely on domain knowledge. We propose a simple but effective framework, MMD-ReID, that reduces the modality gap by an explicit discrepancy reduction constraint. MMD-ReID takes inspiration from Maximum Mean Discrepancy (MMD), a widely used statistical tool for hypothesis testing that determines the distance between two distributions. MMD-ReID uses a novel margin-based formulation to match class-conditional feature distributions of visible and thermal samples to minimize intra-class distances while maintaining feature discriminability. MMD-ReID is a simple framework in terms of architecture and loss formulation. We conduct extensive experiments to demonstrate both qualitatively and quantitatively the effectiveness of MMD-ReID in aligning the marginal and class conditional distributions, thus learning both modality-independent and identity-consistent features. The proposed framework significantly outperforms the state-of-the-art methods on SYSU-MM01 and RegDB datasets. Code will be released at https://github.com/vcl-iisc/MMD-ReID

【2】 PREMA: Part-based REcurrent Multi-view Aggregation Network for 3D Shape Retrieval 标题：PREMA：用于三维形状检索的基于部件的递归多视图聚集网络链接：https://arxiv.org/abs/2111.04945

作者：Jiongchao Jin,Huanqiang Xu,Pengliang Ji,Zehao Tang,Zhang Xiong 机构：School of Computer Science, Beihang University, Beijing, China 备注：Accepted by ICCSMT 2021 摘要：我们提出了基于部分的递归多视图聚合网络（PREMA），以消除实际视图缺陷的不利影响，如视图数量不足、遮挡或背景杂波，并增强形状表示的辨别能力。受人类主要通过判别部分识别物体这一事实的启发，我们定义了多视点相干部分（MCP），一种在不同视点中重复出现的判别部分。我们的PREMA能够可靠地定位并有效地利用MCP来构建健壮的形状表示。综合而言，我们在PREMA中设计了一个新的区域注意单元（RAU）来计算每个视图的置信度图，并通过将这些图应用于视图特征来提取MCP。PREMA通过关联不同视图的特征来强调MCP，并聚合形状表示的零件感知特征。摘要：We propose the Part-based Recurrent Multi-view Aggregation network(PREMA) to eliminate the detrimental effects of the practical view defects, such as insufficient view numbers, occlusions or background clutters, and also enhance the discriminative ability of shape representations. Inspired by the fact that human recognize an object mainly by its discriminant parts, we define the multi-view coherent part(MCP), a discriminant part reoccurring in different views. Our PREMA can reliably locate and effectively utilize MCPs to build robust shape representations. Comprehensively, we design a novel Regional Attention Unit(RAU) in PREMA to compute the confidence map for each view, and extract MCPs by applying those maps to view features. PREMA accentuates MCPs via correlating features of different views, and aggregates the part-aware features for shape representation.

蒸馏|知识提取(1篇)

【1】 MixACM: Mixup-Based Robustness Transfer via Distillation of Activated Channel Maps 标题：MixACM：通过提取激活的通道映射实现基于混合的健壮性传递链接：https://arxiv.org/abs/2111.05073

作者：Muhammad Awais,Fengwei Zhou,Chuanlong Xie,Jiawei Li,Sung-Ho Bae,Zhenguo Li 机构： Huawei Noah’s Ark Lab, Department of Computer Science, Kyung-Hee University, South Korea 备注：Accepted by NeurIPS 2021 摘要：深层神经网络容易受到自然输入中的不利、微小和不可察觉的变化的影响。对抗这些例子最有效的防御机制是对抗性训练，它通过迭代损失最大化在训练期间构造对抗性例子。然后对模型进行训练，以使这些构造的示例的损失最小化。这种最小-最大优化需要更多的数据、更大的容量模型和额外的计算资源。它还会降低模型的标准泛化性能。我们能否更有效地实现健壮性？在这项工作中，我们从知识转移的角度来探讨这个问题。首先，我们从理论上证明了鲁棒性在混合增强的帮助下从一个经过对抗训练的教师模型到一个学生模型的可转移性。其次，我们提出了一种新的鲁棒性传输方法，称为基于混合的激活信道映射（MixACM）传输。MixACM通过匹配生成的激活通道图（无需昂贵的对抗性干扰），将鲁棒性从健壮的教师传递给学生。最后，在多个数据集和不同的学习场景上进行的大量实验表明，我们的方法可以在提高自然图像泛化能力的同时传递鲁棒性。摘要：Deep neural networks are susceptible to adversarially crafted, small and imperceptible changes in the natural inputs. The most effective defense mechanism against these examples is adversarial training which constructs adversarial examples during training by iterative maximization of loss. The model is then trained to minimize the loss on these constructed examples. This min-max optimization requires more data, larger capacity models, and additional computing resources. It also degrades the standard generalization performance of a model. Can we achieve robustness more efficiently? In this work, we explore this question from the perspective of knowledge transfer. First, we theoretically show the transferability of robustness from an adversarially trained teacher model to a student model with the help of mixup augmentation. Second, we propose a novel robustness transfer method called Mixup-Based Activated Channel Maps (MixACM) Transfer. MixACM transfers robustness from a robust teacher to a student by matching activated channel maps generated without expensive adversarial perturbations. Finally, extensive experiments on multiple datasets and different learning scenarios show our method can transfer robustness while also improving generalization on natural images.

视觉解释|视频理解VQA|caption等(1篇)

【1】 Visual Question Answering based on Formal Logic 标题：基于形式逻辑的可视化问答链接：https://arxiv.org/abs/2111.04785

作者：Muralikrishnna G. Sethuraman,Ali Payani,Faramarz Fekri,J. Clayton Kerce 机构：School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, USA, Cisco, California, USA, Georgia Tech Research Institute 摘要：近年来，由于在理解来自多种模式（即图像、语言）的信息方面所面临的挑战，视觉问答（VQA）在机器学习领域得到了广泛的关注。在VQA中，基于一组图像提出一系列问题，手头的任务是找到答案。为了实现这一点，我们使用形式逻辑框架，采用基于符号推理的方法。图像和问题被转换成符号表示，并在此基础上进行显式推理。我们提出了一个正式的逻辑框架，其中（i）借助场景图将图像转换为逻辑背景事实，（ii）使用基于转换器的深度学习模型将问题转换为一阶谓词逻辑子句，以及（iii）执行可满足性检查，利用背景知识和谓语从句的基础知识，得出答案。我们提出的方法具有高度的可解释性，管道中的每一步都可以很容易地由人来分析。我们在CLEVR和GQA数据集上验证了我们的方法。我们在CLEVR数据集上实现了接近完美的99.6%的准确率，与最先进的模型相当，这表明形式逻辑是解决视觉问答问题的可行工具。我们的模型数据效率也很高，当仅使用10%的训练数据进行训练时，CLEVR数据集的准确率达到99.1%。摘要：Visual question answering (VQA) has been gaining a lot of traction in the machine learning community in the recent years due to the challenges posed in understanding information coming from multiple modalities (i.e., images, language). In VQA, a series of questions are posed based on a set of images and the task at hand is to arrive at the answer. To achieve this, we take a symbolic reasoning based approach using the framework of formal logic. The image and the questions are converted into symbolic representations on which explicit reasoning is performed. We propose a formal logic framework where (i) images are converted to logical background facts with the help of scene graphs, (ii) the questions are translated to first-order predicate logic clauses using a transformer based deep learning model, and (iii) perform satisfiability checks, by using the background knowledge and the grounding of predicate clauses, to obtain the answer. Our proposed method is highly interpretable and each step in the pipeline can be easily analyzed by a human. We validate our approach on the CLEVR and the GQA dataset. We achieve near perfect accuracy of 99.6% on the CLEVR dataset comparable to the state of art models, showcasing that formal logic is a viable tool to tackle visual question answering. Our model is also data efficient, achieving 99.1% accuracy on CLEVR dataset when trained on just 10% of the training data.

超分辨率|去噪|去模糊|去雾(1篇)

【1】 Graph-Based Depth Denoising & Dequantization for Point Cloud Enhancement 标题：基于图的点云增强深度去噪与去量化链接：https://arxiv.org/abs/2111.04946

作者：Xue Zhang,Gene Cheung,Jiahao Pang,Yash Sanghvi,Abhiram Gnanasambandam,Stanley H. Chan 备注：13 pages,14 figures 摘要：三维点云通常由传感器在一个或多个视点上获取的深度测量构建。测量同时受到量化和噪声的影响。为了提高质量，以前的工作在将不完美的深度数据投影到三维空间后对点云进行去噪。相反，我们在合成3D点云之前，直接增强感测图像的深度测量。通过在物理传感过程附近进行增强，我们在后续处理步骤（掩盖测量误差）之前，根据深度形成模型进行优化。具体地说，我们将深度形成建模为信号相关噪声添加和基于非均匀对数的量化的组合过程。利用从实际深度传感器收集的经验数据对设计的模型进行验证（参数拟合）。为了增强深度图像中的每个像素行，我们首先通过特征图学习将可用行像素之间的视图内相似性编码为边缘权重。接下来，我们通过视点映射和稀疏线性插值建立与另一个校正深度图像的视点间相似性。这导致最大后验概率（MAP）图过滤目标是凸的和可微的。我们使用加速梯度下降（AGD）有效地优化目标，其中最佳步长通过Gershgorin圆定理（GCT）近似。实验表明，在两个已建立的点云质量指标中，我们的方法明显优于最近的点云去噪方案和最新的图像去噪方案。摘要：A 3D point cloud is typically constructed from depth measurements acquired by sensors at one or more viewpoints. The measurements suffer from both quantization and noise corruption. To improve quality, previous works denoise a point cloud textit{a posteriori} after projecting the imperfect depth data onto 3D space. Instead, we enhance depth measurements directly on the sensed images textit{a priori}, before synthesizing a 3D point cloud. By enhancing near the physical sensing process, we tailor our optimization to our depth formation model before subsequent processing steps that obscure measurement errors. Specifically, we model depth formation as a combined process of signal-dependent noise addition and non-uniform log-based quantization. The designed model is validated (with parameters fitted) using collected empirical data from an actual depth sensor. To enhance each pixel row in a depth image, we first encode intra-view similarities between available row pixels as edge weights via feature graph learning. We next establish inter-view similarities with another rectified depth image via viewpoint mapping and sparse linear interpolation. This leads to a maximum a posteriori (MAP) graph filtering objective that is convex and differentiable. We optimize the objective efficiently using accelerated gradient descent (AGD), where the optimal step size is approximated via Gershgorin circle theorem (GCT). Experiments show that our method significantly outperformed recent point cloud denoising schemes and state-of-the-art image denoising schemes, in two established point cloud quality metrics.

3D|3D重建等相关(1篇)

【1】 Evolving Evocative 2D Views of Generated 3D Objects 标题：生成的3D对象的进化唤起的2D视图链接：https://arxiv.org/abs/2111.04839

作者：Eric Chu 机构：MIT Media Lab 备注：None 摘要：我们提出了一种在不同视角下联合生成物体三维模型和二维渲染的方法，该方法以ImageNet和基于剪辑的模型为指导。我们的结果表明，它可以生成变形的对象，渲染既能唤起目标字幕，又能在视觉上吸引人。摘要：We present a method for jointly generating 3D models of objects and 2D renders at different viewing angles, with the process guided by ImageNet and CLIP -based models. Our results indicate that it can generate anamorphic objects, with renders that both evoke the target caption and look visually appealing.

其他神经网络|深度学习|模型|建模(4篇)

【1】 Self-Interpretable Model with TransformationEquivariant Interpretation 标题：具有变换等变解释的自解释模型链接：https://arxiv.org/abs/2111.04927

作者：Yipei Wang,Xiaoqian Wang 机构：Department of Electrical and Computer Engineering, Purdue University, West Lafayette, IN 备注：Accepted by NeurIPS 2021 摘要：在本文中，我们提出了一个具有变换等变解释的自解释模型站点。我们关注几何变换解释的鲁棒性和自一致性。除了转换等效性，作为一种自我解释模型，SITE具有与基准黑盒分类器相当的表达能力，同时能够以高质量呈现忠实和稳健的解释。值得注意的是，尽管在大多数CNN可视化方法中都有应用，但双线性上采样近似是一种粗略近似，它只能以热图的形式提供解释（而不是像素）。这些解释是否可以直接指向输入空间（如MNIST实验所示），这仍然是一个悬而未决的问题。此外，我们还考虑了模型中的平移和旋转变换。在未来的工作中，我们将探索在更复杂的变换（如缩放和失真）下的稳健解释。此外，我们澄清了该站点不限于几何变换（我们在计算机视觉领域中使用），并将在未来的工作中探索其他领域中的站点。摘要：In this paper, we propose a self-interpretable model SITE with transformation-equivariant interpretations. We focus on the robustness and self-consistency of the interpretations of geometric transformations. Apart from the transformation equivariance, as a self-interpretable model, SITE has comparable expressive power as the benchmark black-box classifiers, while being able to present faithful and robust interpretations with high quality. It is worth noticing that although applied in most of the CNN visualization methods, the bilinear upsampling approximation is a rough approximation, which can only provide interpretations in the form of heatmaps (instead of pixel-wise). It remains an open question whether such interpretations can be direct to the input space (as shown in the MNIST experiments). Besides, we consider the translation and rotation transformations in our model. In future work, we will explore the robust interpretations under more complex transformations such as scaling and distortion. Moreover, we clarify that SITE is not limited to geometric transformation (that we used in the computer vision domain), and will explore SITEin other domains in future work.

【2】 Survey of Deep Learning Methods for Inverse Problems 标题：反问题的深度学习方法综述链接：https://arxiv.org/abs/2111.04731

作者：Shima Kamyab,Zihreh Azimifar,Rasool Sabzi,Paul Fieguth 机构：Dept. of Comp. Sci. and Eng., Shiraz University, Shiraz, Iran, Zohreh Azimifar, Dept. of Systems Design Engineering, University of Waterloo, Waterloo, Canada, arXiv:,.,v, [cs.CV] , Nov 摘要：在本文中，我们研究了解决反问题的各种深度学习策略。我们将现有的反问题深度学习解决方案分为三类：直接映射、数据一致性优化器和深度正则化器。我们选择了每种反问题类型的样本，以比较三种方法的鲁棒性我们对线性回归的经典问题和计算机视觉中的三个著名反问题进行了广泛的实验，即图像去噪、3D人脸逆绘制和对象跟踪，这些问题被选为每类反问题的代表性原型ems。总体结果和统计分析表明，解决方案类别具有鲁棒性行为，这取决于反问题域的类型，具体取决于问题是否包含测量异常值。根据我们的实验结果，我们提出了r每个反问题类。摘要：In this paper we investigate a variety of deep learning strategies for solving inverse problems. We classify existing deep learning solutions for inverse problems into three categories of Direct Mapping, Data Consistency Optimizer, and Deep Regularizer. We choose a sample of each inverse problem type, so as to compare the robustness of the three categories, and report a statistical analysis of their differences. We perform extensive experiments on the classic problem of linear regression and three well-known inverse problems in computer vision, namely image denoising, 3D human face inverse rendering, and object tracking, selected as representative prototypes for each class of inverse problems. The overall results and the statistical analyses show that the solution categories have a robustness behaviour dependent on the type of inverse problem domain, and specifically dependent on whether or not the problem includes measurement outliers. Based on our experimental results, we conclude by proposing the most robust solution category for each inverse problem class.

【3】 MAC-ReconNet: A Multiple Acquisition Context based Convolutional Neural Network for MR Image Reconstruction using Dynamic Weight Prediction 标题：MAC-RECOMNET：一种基于多采集上下文的动态权重预测卷积神经网络MR图像重建链接：https://arxiv.org/abs/2111.05055

作者：Sriprabha Ramanarayanan,Balamurali Murugesan,Keerthi Ram,Mohanasankar Sivaprakasam 机构： Indian Institute of Technology Madras (IITM), India, Healthcare Technology Innovation Centre (HTIC), IITM, India 备注：None 摘要：基于卷积神经网络的MR重建方法已被证明能够提供快速和高质量的重建。基于CNN的模型的一个主要缺点是，它缺乏灵活性，只能在特定的采集环境下有效运行，限制了实际应用。我们所说的采集环境是指三种输入设置的具体组合，即研究中的解剖结构、欠采样掩模模式和欠采样加速因子。该模型可以在结合多个上下文的图像上进行联合训练。然而，该模型既不能满足特定于上下文的模型的性能，也不能扩展到火车时刻看不见的上下文。这就需要在生成特定于上下文的权重时对现有体系结构进行修改，以便将灵活性结合到多个上下文中。我们提出了一种基于多个采集上下文的网络，称为用于MRI重建的MAC ReconNet，该网络灵活地适应多个采集上下文，并可推广到不可见上下文，以适用于实际场景。该网络具有一个MRI重建模块和一个动态权重预测（DWP）模块。DWP模块将相应的采集上下文信息作为输入，学习重建模块的上下文特定权重，该权重在运行时随上下文动态变化。我们表明，该方法可以处理基于心脏和大脑数据集、高斯和笛卡尔欠采样模式以及五个加速因子的多个上下文。该网络的性能优于朴素的联合训练模型，并在数量和质量上与上下文特定的模型具有竞争性。我们还通过在火车时刻看不到的上下文上进行测试来证明我们模型的普遍性。摘要：Convolutional Neural network-based MR reconstruction methods have shown to provide fast and high quality reconstructions. A primary drawback with a CNN-based model is that it lacks flexibility and can effectively operate only for a specific acquisition context limiting practical applicability. By acquisition context, we mean a specific combination of three input settings considered namely, the anatomy under study, undersampling mask pattern and acceleration factor for undersampling. The model could be trained jointly on images combining multiple contexts. However the model does not meet the performance of context specific models nor extensible to contexts unseen at train time. This necessitates a modification to the existing architecture in generating context specific weights so as to incorporate flexibility to multiple contexts. We propose a multiple acquisition context based network, called MAC-ReconNet for MRI reconstruction, flexible to multiple acquisition contexts and generalizable to unseen contexts for applicability in real scenarios. The proposed network has an MRI reconstruction module and a dynamic weight prediction (DWP) module. The DWP module takes the corresponding acquisition context information as input and learns the context-specific weights of the reconstruction module which changes dynamically with context at run time. We show that the proposed approach can handle multiple contexts based on cardiac and brain datasets, Gaussian and Cartesian undersampling patterns and five acceleration factors. The proposed network outperforms the naive jointly trained model and gives competitive results with the context-specific models both quantitatively and qualitatively. We also demonstrate the generalizability of our model by testing on contexts unseen at train time.

【4】 Multi-Modality Cardiac Image Analysis with Deep Learning 标题：基于深度学习的多模态心脏图像分析链接：https://arxiv.org/abs/2111.04736

作者：Lei Li,Fuping Wu,Sihang Wang,Xiahai Zhuang 机构：†Lei Li is with Department of Engineering Science, University of Oxford, Sihan Wang and Xiahai Zhuang are with School of Data Science, Fudan University 备注：Under review as a chapter of book 'Deep Learning for Medical Image Analysis, 2E' 摘要：准确的心脏计算、分析和多模态图像建模对于心脏疾病的诊断和治疗具有重要意义。晚期钆增强磁共振成像（LGE MRI）是一种很有前途的技术，可以可视化和量化心肌梗死（MI）和心房瘢痕。由于LGE MRI的低图像质量和复杂的增强模式，MI和心房疤痕的自动量化可能具有挑战性。此外，与其他序列相比，具有金标准标签的LGE MRI特别有限，这是开发LGE MRI自动分割和量化新算法的另一个障碍。本章旨在总结基于深度学习的多模态心脏图像分析的最新进展和我们最近的先进贡献。首先，我们介绍了基于多序列心脏MRI的心肌和病理分割的两个基准工作。其次，提出了两种新的左心房瘢痕分割和定量框架。第三，我们提出了三种用于跨模态心脏图像分割的无监督域自适应技术。摘要：Accurate cardiac computing, analysis and modeling from multi-modality images are important for the diagnosis and treatment of cardiac disease. Late gadolinium enhancement magnetic resonance imaging (LGE MRI) is a promising technique to visualize and quantify myocardial infarction (MI) and atrial scars. Automating quantification of MI and atrial scars can be challenging due to the low image quality and complex enhancement patterns of LGE MRI. Moreover, compared with the other sequences LGE MRIs with gold standard labels are particularly limited, which represents another obstacle for developing novel algorithms for automatic segmentation and quantification of LGE MRIs. This chapter aims to summarize the state-of-the-art and our recent advanced contributions on deep learning based multi-modality cardiac image analysis. Firstly, we introduce two benchmark works for multi-sequence cardiac MRI based myocardial and pathology segmentation. Secondly, two novel frameworks for left atrial scar segmentation and quantification from LGE MRI were presented. Thirdly, we present three unsupervised domain adaptation techniques for cross-modality cardiac image segmentation.

其他(8篇)

【1】 Data Augmentation Can Improve Robustness 标题：数据增强可以提高健壮性链接：https://arxiv.org/abs/2111.05328

作者：Sylvestre-Alvise Rebuffi,Sven Gowal,Dan A. Calian,Florian Stimberg,Olivia Wiles,Timothy Mann 机构：DeepMind, London 备注：Accepted at NeurIPS 2021. arXiv admin note: substantial text overlap with arXiv:2103.01946; text overlap with arXiv:2110.09468 摘要：对抗性训练存在稳健过度拟合现象，即稳健测试精度在训练过程中开始下降。在本文中，我们重点讨论了使用常用的数据增强方案来减少鲁棒过拟合。我们证明，与之前的研究结果相反，当与模型权重平均相结合时，数据增强可以显著提高鲁棒精度。此外，我们比较了各种增强技术，发现空间合成技术最适合对抗训练。最后，我们分别针对大小为$epsilon=8/255$和$epsilon=128/255$的$ellinfty$和$ellu 2$范数有界扰动对CIFAR-10的方法进行了评估。与以前最先进的方法相比，我们在鲁棒性精度方面显示了 2.93%和 2.16%的巨大绝对改进。特别是，对于大小为$epsilon=8/255$的$ellinfty$范数有界扰动，我们的模型在不使用任何外部数据的情况下达到60.07%的鲁棒精度。在使用其他体系结构和数据集（如CIFAR-100、SVHN和TinyImageNet）时，我们还通过这种方法实现了显著的性能提升。摘要：Adversarial training suffers from robust overfitting, a phenomenon where the robust test accuracy starts to decrease during training. In this paper, we focus on reducing robust overfitting by using common data augmentation schemes. We demonstrate that, contrary to previous findings, when combined with model weight averaging, data augmentation can significantly boost robust accuracy. Furthermore, we compare various augmentations techniques and observe that spatial composition techniques work the best for adversarial training. Finally, we evaluate our approach on CIFAR-10 against $ell_infty$ and $ell_2$ norm-bounded perturbations of size $epsilon = 8/255$ and $epsilon = 128/255$, respectively. We show large absolute improvements of 2.93% and 2.16% in robust accuracy compared to previous state-of-the-art methods. In particular, against $ell_infty$ norm-bounded perturbations of size $epsilon = 8/255$, our model reaches 60.07% robust accuracy without using any external data. We also achieve a significant performance boost with this approach while using other architectures and datasets such as CIFAR-100, SVHN and TinyImageNet.

【2】 Designing and Analyzing the PID and Fuzzy Control System for an Inverted Pendulum 标题：倒立摆的PID和模糊控制系统设计与分析链接：https://arxiv.org/abs/2111.05309

作者：Armin Masoumian,Pezhman kazemi,Mohammad Chehreghani Montazer,Hatem A. Rashwan,Domenec Puig Valls 机构：Department of Computer Engineering and Mathematics, Universitat Rovira I Virgili, Tarragona, Spain, Departament of Chemical Engineering, Mechatronics Department, University of Debrecen, Debrecen, Hungary 备注：5 pages, Accepted for The 6th International Conference on Mechatronics and Robotics Engineering (ICMRE 2020) 摘要：倒立摆是一个非线性不平衡系统，需要使用电机来控制，以实现稳定和平衡。倒立摆由乐高制作，使用乐高Mindstorm NXT，这是一种可编程机器人，能够完成许多不同的功能。本文提出了倒立摆的初步设计，并研究了与乐高Mindstorm NXT兼容的不同传感器的性能。此外，还研究了计算机视觉实现维护系统所需稳定性的能力。倒立摆是一种传统的小车，可以使用模糊逻辑控制器进行控制，该控制器为小车的移动提供自调整PID控制。在MATLAB和Simulink中对模糊逻辑和PID进行了仿真，并在LabVIEW软件中开发了机器人程序。摘要：The inverted pendulum is a non-linear unbalanced system that needs to be controlled using motors to achieve stability and equilibrium. The inverted pendulum is constructed with Lego and using the Lego Mindstorm NXT, which is a programmable robot capable of completing many different functions. In this paper, an initial design of the inverted pendulum is proposed and the performance of different sensors, which are compatible with the Lego Mindstorm NXT was studied. Furthermore, the ability of computer vision to achieve the stability required to maintain the system is also investigated. The inverted pendulum is a conventional cart that can be controlled using a Fuzzy Logic controller that produces a self-tuning PID control for the cart to move on. The fuzzy logic and PID are simulated in MATLAB and Simulink, and the program for the robot is developed in the LabVIEW software.

【3】 Using The Feedback of Dynamic Active-Pixel Vision Sensor (Davis) to Prevent Slip in Real Time 标题：利用动态主动像素视觉传感器(Davis)的反馈进行实时防滑链接：https://arxiv.org/abs/2111.05308

作者：Armin Masoumian,Pezhman kazemi,Mohammad Chehreghani Montazer,Hatem A. Rashwan,Domenec Puig Valls 机构：Department of Computer Engineering and Mathematics, Universitat Rovira I Virgili, Tarragona, Spain, Departament of Chemical Engineering, Mechatronics Department, University of Debrecen, Debrecen, Hungary 备注：5 pages, Accepted for The 6th International Conference on Mechatronics and Robotics Engineering (ICMRE 2020) 摘要：本文的目的是描述一种在实时反馈中检测滑动和接触力的方法。在这种新的方法中，戴维斯相机由于其快速的处理速度和高分辨率被用作视觉触觉传感器。在四个不同形状、大小、重量和材料的物体上进行了200次实验，以比较巴克斯特机器人夹持器的精度和响应，从而避免滑动。通过使用力敏电阻器（FSR402）验证了先进的方法。戴维斯摄像机捕捉到的事件通过特定算法进行处理，以向巴克斯特机器人提供反馈，帮助其检测滑倒。摘要：The objective of this paper is to describe an approach to detect the slip and contact force in real-time feedback. In this novel approach, the DAVIS camera is used as a vision tactile sensor due to its fast process speed and high resolution. Two hundred experiments were performed on four objects with different shapes, sizes, weights, and materials to compare the accuracy and response of the Baxter robot grippers to avoid slipping. The advanced approach is validated by using a force-sensitive resistor (FSR402). The events captured with the DAVIS camera are processed with specific algorithms to provide feedback to the Baxter robot aiding it to detect the slip.

【4】 Residual Quantity in Percentage of Factory Machines Using ComputerVision and Mathematical Methods 标题：用计算机视觉和数学方法计算工厂机器的剩余量(以百分比表示) 链接：https://arxiv.org/abs/2111.05080

作者：Seunghyeon Kim,Jihoon Ryoo,Dongyeob Lee,Youngho Kim 备注：4 pages, 13 figures 摘要：自从人工智能的发展得到推动以来，计算机视觉一直在蓬勃发展。使用深度学习技术是计算机科学家认为解决问题的最流行的方法。然而，深度学习技术的性能往往低于手工处理。使用深度学习并不总是解决与计算机视觉相关的问题。摘要：Computer vision has been thriving since AI development was gaining thrust. Using deep learning techniques has been the most popular way which computer scientists thought the solution of. However, deep learning techniques tend to show lower performance than manual processing. Using deep learning is not always the answer to a problem related to computer vision.

【5】 View Birdification in the Crowd: Ground-Plane Localization from Perceived Movements 标题：人群中的视图鸟化：从感知到的运动中的地平面定位链接：https://arxiv.org/abs/2111.05060

作者：Mai Nishimura,Shohei Nobuhara,Ko Nishino 机构： Kyoto University, Kyoto, Japan, OMRON SINIC X, Tokyo, Japan 备注：Accepted to the British Machine Vision Conference (BMVC) 2021 摘要：我们介绍了视图双向修改，即从同样在人群中移动的观察者（如人或车辆）捕获的以自我为中心的视频中恢复人群中人的地平面运动的问题。恢复的地平面运动将为态势理解提供良好的基础，并有利于计算机视觉和机器人技术的下游应用。在这篇文章中，我们将视图二值化描述为一个几何轨迹重建问题，并从贝叶斯的角度推导了一种级联优化方法。该方法首先估计观察者的运动，然后在考虑行人之间的局部相互作用的同时，对每一帧的周围行人进行定位。我们通过利用人群中人的合成和真实轨迹介绍了三个数据集，并评估了我们方法的有效性。结果证明了我们的方法的准确性，并为进一步研究视图二值化这一重要但具有挑战性的视觉理解问题奠定了基础。摘要：We introduce view birdification, the problem of recovering ground-plane movements of people in a crowd from an ego-centric video captured from an observer (e.g., a person or a vehicle) also moving in the crowd. Recovered ground-plane movements would provide a sound basis for situational understanding and benefit downstream applications in computer vision and robotics. In this paper, we formulate view birdification as a geometric trajectory reconstruction problem and derive a cascaded optimization method from a Bayesian perspective. The method first estimates the observer's movement and then localizes surrounding pedestrians for each frame while taking into account the local interactions between them. We introduce three datasets by leveraging synthetic and real trajectories of people in crowds and evaluate the effectiveness of our method. The results demonstrate the accuracy of our method and set the ground for further studies of view birdification as an important but challenging visual understanding problem.

【6】 Hybrid BYOL-ViT: Efficient approach to deal with small Datasets 标题：混合BYOL-VIT：处理小数据集的有效方法链接：https://arxiv.org/abs/2111.04845

作者：Safwen Naimi,Rien van Leeuwen,Wided Souidene,Slim Ben Saoud 备注：19 pages, 8 figures and 16 tables 摘要：监督学习可以学习较大的表征空间，这对于处理困难的学习任务至关重要。然而，由于模型的设计，经典的图像分类方法在处理小数据集时难以推广到新问题和新情况。事实上，监督学习可能会丢失图像特征的位置，从而导致监督在非常深的体系结构中崩溃。在本文中，我们研究了自我监督如何在不需要数百万标记数据的情况下有效地训练神经网络的第一层，甚至比监督学习更好。其主要目标是通过获取与任务无关的通用底层特征来断开像素数据与注释的连接。此外，我们还研究了视觉变换器（ViT）并表明，从自监督体系结构中提取的低层特征可以提高这种新兴体系结构的鲁棒性和整体性能。我们在一个最小的开源数据集STL-10上评估了我们的方法，当将自监督学习体系结构中的低级特征输入到ViT而不是原始图像时，我们获得了从41.66%到83.25%的性能显著提升。摘要：Supervised learning can learn large representational spaces, which are crucial for handling difficult learning tasks. However, due to the design of the model, classical image classification approaches struggle to generalize to new problems and new situations when dealing with small datasets. In fact, supervised learning can lose the location of image features which leads to supervision collapse in very deep architectures. In this paper, we investigate how self-supervision with strong and sufficient augmentation of unlabeled data can train effectively the first layers of a neural network even better than supervised learning, with no need for millions of labeled data. The main goal is to disconnect pixel data from annotation by getting generic task-agnostic low-level features. Furthermore, we look into Vision Transformers (ViT) and show that the low-level features derived from a self-supervised architecture can improve the robustness and the overall performance of this emergent architecture. We evaluated our method on one of the smallest open-source datasets STL-10 and we obtained a significant boost of performance from 41.66% to 83.25% when inputting low-level features from a self-supervised learning architecture to the ViT instead of the raw images.

【7】 Leveraging blur information for plenoptic camera calibration 标题：利用模糊信息进行全光相机校准链接：https://arxiv.org/abs/2111.05226

作者：Mathieu Labussière,Céline Teulière,Frédéric Bernardin,Omar Ait-Aider 机构：Received: date Accepted: date 备注：arXiv admin note: text overlap with arXiv:2004.07745 摘要：本文提出了一种新的全光摄像机标定算法，特别是多焦点配置，其中使用了几种类型的微透镜，仅使用原始图像。当前的校准方法依赖于简化的投影模型，使用重建图像的特征，或者要求对每种类型的微透镜进行单独校准。在多焦点配置中，场景的相同部分将根据微透镜焦距显示不同的模糊量。通常，仅使用模糊程度最小的显微图像。为了利用所有可用数据，我们建议借助我们新引入的模糊感知全光（BAP）功能，在新的相机模型中显式建模散焦模糊。首先，它用于检索初始相机参数的预校准步骤，其次，表示在我们的单一优化过程中最小化的新成本函数。第三，利用该方法标定显微图像之间的相对模糊度。它将几何模糊（即模糊圆）链接到物理模糊（即点扩散函数）。最后，我们使用得到的模糊轮廓来描述相机的景深。在受控环境中对真实数据的定量评估证明了我们校准的有效性。摘要：This paper presents a novel calibration algorithm for plenoptic cameras, especially the multi-focus configuration, where several types of micro-lenses are used, using raw images only. Current calibration methods rely on simplified projection models, use features from reconstructed images, or require separated calibrations for each type of micro-lens. In the multi-focus configuration, the same part of a scene will demonstrate different amounts of blur according to the micro-lens focal length. Usually, only micro-images with the smallest amount of blur are used. In order to exploit all available data, we propose to explicitly model the defocus blur in a new camera model with the help of our newly introduced Blur Aware Plenoptic (BAP) feature. First, it is used in a pre-calibration step that retrieves initial camera parameters, and second, to express a new cost function to be minimized in our single optimization process. Third, it is exploited to calibrate the relative blur between micro-images. It links the geometric blur, i.e., the blur circle, to the physical blur, i.e., the point spread function. Finally, we use the resulting blur profile to characterize the camera's depth of field. Quantitative evaluations in controlled environment on real-world data demonstrate the effectiveness of our calibrations.

【8】 Approaching the Limit of Image Rescaling via Flow Guidance 标题：通过流量制导接近图像缩放的极限链接：https://arxiv.org/abs/2111.05133

作者：Shang Li,Guixuan Zhang,Zhengxiong Luo,Jie Liu,Zhi Zeng,Shuwu Zhang 机构： School of Artificial Intelligence, University of Chinese Academy of, Beijing, China, Institute of Automation, Chinese Academy of Sciences, Beijing Engineering Research Center, for Digital Content Technology 备注：BMVC 2021 摘要：图像缩小和放大是两种基本的重缩放操作。一旦图像被缩小，由于信息丢失，很难通过放大来重建图像。为了使这两个过程更加兼容并提高重建性能，一些工作将其建模为联合编码-解码任务，约束条件是缩小（即编码的）低分辨率（LR）图像必须保持原始视觉外观。为了实现这一约束，大多数方法通过使用原始高分辨率（HR）图像的双三次缩尺LR版本对缩尺模块进行监控来指导缩尺模块。然而，这种双三次LR制导对于随后的放大（即解码）可能不太理想，并限制最终重建性能。在本文中，我们没有直接应用LR制导，而是提出了一种附加的可逆流制导模块（FGM），该模块可以在降尺度期间将降尺度表示转换为视觉上合理的图像，并在升尺度期间将其转换回。得益于女性生殖器切割的可逆性，缩小比例的表示可以摆脱LR指导，并且不会干扰缩小比例的过程。它允许我们消除对降尺度模块的限制，并以端到端的方式优化降尺度和升尺度模块。通过这种方式，这两个模块可以相互协作，以最大限度地提高人力资源重建的绩效。大量实验表明，该方法在缩小尺度和重建图像上都能达到最先进的性能（SotA）。摘要：Image downscaling and upscaling are two basic rescaling operations. Once the image is downscaled, it is difficult to be reconstructed via upscaling due to the loss of information. To make these two processes more compatible and improve the reconstruction performance, some efforts model them as a joint encoding-decoding task, with the constraint that the downscaled (i.e. encoded) low-resolution (LR) image must preserve the original visual appearance. To implement this constraint, most methods guide the downscaling module by supervising it with the bicubically downscaled LR version of the original high-resolution (HR) image. However, this bicubic LR guidance may be suboptimal for the subsequent upscaling (i.e. decoding) and restrict the final reconstruction performance. In this paper, instead of directly applying the LR guidance, we propose an additional invertible flow guidance module (FGM), which can transform the downscaled representation to the visually plausible image during downscaling and transform it back during upscaling. Benefiting from the invertibility of FGM, the downscaled representation could get rid of the LR guidance and would not disturb the downscaling-upscaling process. It allows us to remove the restrictions on the downscaling module and optimize the downscaling and upscaling modules in an end-to-end manner. In this way, these two modules could cooperate to maximize the HR reconstruction performance. Extensive experiments demonstrate that the proposed method can achieve state-of-the-art (SotA) performance on both downscaled and reconstructed images.

linux https 网络安全学习方法机器学习

0 人点赞