计算机视觉学术速递[8.17]

Update！H5支持摘要折叠，体验更佳！点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

cs.CV 方向，今日共计119篇

Transformer(5篇)

【1】 Flying Guide Dog: Walkable Path Discovery for the Visually Impaired Utilizing Drones and Transformer-based Semantic Segmentation 标题：飞行导盲犬：基于无人机和Transformer的视障人士可行走路径发现链接：https://arxiv.org/abs/2108.07007

作者：Haobin Tan,Chang Chen,Xinyu Luo,Jiaming Zhang,Constantin Seibold,Kailun Yang,Rainer Stiefelhagen 机构：Walkable path: "slightly left", Semantic Segmentation 备注：Code, dataset, and video demo will be made publicly available at this https URL 摘要：由于缺乏有效感知周围环境的能力，盲人和视力受损者（BVIP）在户外行走时面临困难，尤其是在城市地区。因此，协助BVIP的工具非常重要。在本文中，我们提出了一个新的“飞行导盲犬”原型BVIP援助使用无人机和街景语义分割。基于从分割预测中提取的可行走区域，无人机可以自动调整其运动，从而引导用户沿着可行走路径行走。通过识别行人交通灯的颜色，我们的原型可以帮助用户安全地穿过街道。此外，我们引入了一个新的数据集，名为行人和车辆交通灯（PVTL），该数据集专门用于交通灯识别。我们在真实场景中的用户研究结果表明，我们的原型有效且易于使用，为BVIP援助提供了新的见解。摘要：Lacking the ability to sense ambient environments effectively, blind and visually impaired people (BVIP) face difficulty in walking outdoors, especially in urban areas. Therefore, tools for assisting BVIP are of great importance. In this paper, we propose a novel "flying guide dog" prototype for BVIP assistance using drone and street view semantic segmentation. Based on the walkable areas extracted from the segmentation prediction, the drone can adjust its movement automatically and thus lead the user to walk along the walkable path. By recognizing the color of pedestrian traffic lights, our prototype can help the user to cross a street safely. Furthermore, we introduce a new dataset named Pedestrian and Vehicle Traffic Lights (PVTL), which is dedicated to traffic light recognition. The result of our user study in real-world scenarios shows that our prototype is effective and easy to use, providing new insight into BVIP assistance.

【2】 Polyp-PVT: Polyp Segmentation with Pyramid Vision Transformers 标题：息肉-PVT：基于金字塔视觉转换器的息肉分割链接：https://arxiv.org/abs/2108.06932

作者：Bo Dong,Wenhai Wang,Deng-Ping Fan,Jinpeng Li,Huazhu Fu,Ling Shao 机构： Li is with the Department of Computer Science and Engineering, TheChinese University of Hong Kong 备注：Technical Report 摘要：大多数息肉分割方法使用CNN作为主干，导致在编码器和解码器之间交换信息时出现两个关键问题：1）考虑不同级别特征之间的贡献差异；2）设计融合这些特征的有效机制。与现有的基于CNN的方法不同，我们采用了Transformer编码器，它可以学习更强大、更健壮的表示。此外，考虑到息肉的图像采集影响和难以捉摸的特性，我们引入了三个新模块，包括级联融合模块（CFM）、伪装识别模块（CIM）、相似性聚合模块（SAM）。其中，CFM用于从高级特征中收集息肉的语义和位置信息，而CIM用于捕获隐藏在低级特征中的息肉信息。在SAM的帮助下，我们将具有高级语义位置信息的息肉区域的像素特征扩展到整个息肉区域，从而有效地融合跨级别特征。提出的模型名为ourmodel，有效地抑制了特征中的噪声，并显著提高了特征的表达能力。在五个广泛采用的数据集上进行的大量实验表明，与现有方法相比，该模型对各种挑战性情况（如外观变化、小对象）更具鲁棒性，并实现了最新的性能。建议的模型可在https://github.com/DengPingFan/Polyp-PVT . 摘要：Most polyp segmentation methods use CNNs as their backbone, leading to two key issues when exchanging information between the encoder and decoder: 1) taking into account the differences in contribution between different-level features; and 2) designing effective mechanism for fusing these features. Different from existing CNN-based methods, we adopt a transformer encoder, which learns more powerful and robust representations. In addition, considering the image acquisition influence and elusive properties of polyps, we introduce three novel modules, including a cascaded fusion module (CFM), a camouflage identification module (CIM), a and similarity aggregation module (SAM). Among these, the CFM is used to collect the semantic and location information of polyps from high-level features, while the CIM is applied to capture polyp information disguised in low-level features. With the help of the SAM, we extend the pixel features of the polyp area with high-level semantic position information to the entire polyp area, thereby effectively fusing cross-level features. The proposed model, named ourmodel, effectively suppresses noises in the features and significantly improves their expressive capabilities. Extensive experiments on five widely adopted datasets show that the proposed model is more robust to various challenging situations (e.g., appearance changes, small objects) than existing methods, and achieves the new state-of-the-art performance. The proposed model is available at https://github.com/DengPingFan/Polyp-PVT .

【3】 No-Reference Image Quality Assessment via Transformers, Relative Ranking, and Self-Consistency 标题：基于变形器法、相对排序法和自洽法的无参考图像质量评估链接：https://arxiv.org/abs/2108.06858

作者：S. Alireza Golestaneh,Saba Dadsetan,Kris M. Kitani 机构： Bosch Center for AI, Pittsburgh, University of Pittsburgh, Carnegie Mellon University 摘要：无参考图像质量评估（NR-IQA）的目标是根据主观评价来估计感知图像质量，由于缺少原始参考图像，这是一个复杂且尚未解决的问题。在本文中，我们提出了一种新的模型来解决NR-IQA任务，该模型利用卷积神经网络（CNN）和Transformer中的自我注意机制的混合方法从输入图像中提取局部和非局部特征。我们通过CNN捕获图像的局部结构信息，然后为了避免提取的CNN特征之间的局部性偏差，并获得图像的非局部表示，我们在提取的特征上使用变换器，将其建模为变换器模型的顺序输入。此外，为了改善主观和客观分数之间的单调相关性，我们利用了每批图像之间的相对距离信息，并加强了它们之间的相对排序。最后但并非最不重要的是，我们观察到，当我们对输入应用等变变换（例如水平翻转）时，NR-IQA模型的性能会下降。因此，我们提出了一种利用自一致性作为自我监督来源的方法，以提高NRIQA模型的鲁棒性。具体地说，我们在每个图像的质量评估模型的输出及其转换（水平翻转）之间强制执行自一致性，以利用丰富的自我监督信息并减少模型的不确定性。为了证明我们工作的有效性，我们在七个标准IQA数据集（包括合成数据集和真实数据集）上对其进行了评估，并表明我们的模型在各种数据集上实现了最先进的结果。摘要：The goal of No-Reference Image Quality Assessment (NR-IQA) is to estimate the perceptual image quality in accordance with subjective evaluations, it is a complex and unsolved problem due to the absence of the pristine reference image. In this paper, we propose a novel model to address the NR-IQA task by leveraging a hybrid approach that benefits from Convolutional Neural Networks (CNNs) and self-attention mechanism in Transformers to extract both local and non-local features from the input image. We capture local structure information of the image via CNNs, then to circumvent the locality bias among the extracted CNNs features and obtain a non-local representation of the image, we utilize Transformers on the extracted features where we model them as a sequential input to the Transformer model. Furthermore, to improve the monotonicity correlation between the subjective and objective scores, we utilize the relative distance information among the images within each batch and enforce the relative ranking among them. Last but not least, we observe that the performance of NR-IQA models degrades when we apply equivariant transformations (e.g. horizontal flipping) to the inputs. Therefore, we propose a method that leverages self-consistency as a source of self-supervision to improve the robustness of NRIQA models. Specifically, we enforce self-consistency between the outputs of our quality assessment model for each image and its transformation (horizontally flipped) to utilize the rich self-supervisory information and reduce the uncertainty of the model. To demonstrate the effectiveness of our work, we evaluate it on seven standard IQA datasets (both synthetic and authentic) and show that our model achieves state-of-the-art results on various datasets.

【4】 SOTR: Segmenting Objects with Transformers 标题：SOTR：使用Transformer分割对象链接：https://arxiv.org/abs/2108.06747

作者：Ruohao Guo,Dantong Niu,Liao Qu,Zhenbo Li 机构：College of Information and Electrical Engineering, China Agricultural University, EECS department, University of California, Berkeley, Key Laboratory of Agricultural Information Acquisition Technology, Ministry of Agriculture 备注：ICCV 2021 摘要：最新的基于Transformer的模型在视觉任务上表现出令人印象深刻的性能，甚至比卷积神经网络（CNN）更好。在这项工作中，我们提出了一种新颖、灵活、有效的基于转换器的高质量实例分割模型。所提出的利用变换器分割对象（SOTR）的方法简化了分割管道，建立在一个附加有两个并行子任务的CNN主干上：（1）通过变换器预测每个实例的类别；（2）使用多级上采样模块动态生成分割掩码。SOTR可以分别通过特征金字塔网络（FPN）和twin transformer有效地提取底层特征表示，并捕获长期上下文依赖。同时，与原变换器相比，该双变换器在时间和资源上都是高效的，因为只需要注意一行和一列来编码像素。此外，SOTR易于与各种CNN主干和Transformer模型变体结合，从而大大提高分割精度和训练收敛性。大量实验表明，我们的SOTR在MS COCO数据集上表现良好，超过了最先进的实例分割方法。我们希望我们简单但强大的框架可以作为实例级识别的首选基线。我们的代码可在https://github.com/easton-cau/SOTR. 摘要：Most recent transformer-based models show impressive performance on vision tasks, even better than Convolution Neural Networks (CNN). In this work, we present a novel, flexible, and effective transformer-based model for high-quality instance segmentation. The proposed method, Segmenting Objects with TRansformers (SOTR), simplifies the segmentation pipeline, building on an alternative CNN backbone appended with two parallel subtasks: (1) predicting per-instance category via transformer and (2) dynamically generating segmentation mask with the multi-level upsampling module. SOTR can effectively extract lower-level feature representations and capture long-range context dependencies by Feature Pyramid Network (FPN) and twin transformer, respectively. Meanwhile, compared with the original transformer, the proposed twin transformer is time- and resource-efficient since only a row and a column attention are involved to encode pixels. Moreover, SOTR is easy to be incorporated with various CNN backbones and transformer model variants to make considerable improvements for the segmentation accuracy and training convergence. Extensive experiments show that our SOTR performs well on the MS COCO dataset and surpasses state-of-the-art instance segmentation approaches. We hope our simple but strong framework could serve as a preferment baseline for instance-level recognition. Our code is available at https://github.com/easton-cau/SOTR.

【5】 PTT: Point-Track-Transformer Module for 3D Single Object Tracking in Point Clouds 标题：PTT：点云环境下三维单目标跟踪的点跟踪变换模块链接：https://arxiv.org/abs/2108.06455

作者：Jiayao Shan,Sifan Zhou,Zheng Fang,Yubo Cui 机构：Northeastern University 备注：Accepted by IROS 2021 摘要：三维单目标跟踪是机器人技术的一个关键问题。在本文中，我们提出了一种称为点跟踪变换器（PTT）的变换模块，用于基于点云的三维单目标跟踪。PTT模块包含用于特征嵌入、位置编码和自我注意特征计算的三个模块。特征嵌入的目的是，如果特征具有相似的语义信息，则将特征放置在嵌入空间中。位置编码用于将点云坐标编码为高维可分辨特征。自我注意通过计算注意权重生成细化的注意特征。此外，我们将PTT模块嵌入到开源的最新方法P2B中，以构建PTT网络。KITTI数据集上的实验表明，我们的PTT网络比最先进的网络有明显的优势（~10%）。此外，PTT网络可以在NVIDIA 1080Ti GPU上实现实时性能（~40FPS）。我们的代码是为位于https://github.com/shanjiayao/PTT. 摘要：3D single object tracking is a key issue for robotics. In this paper, we propose a transformer module called Point-Track-Transformer (PTT) for point cloud-based 3D single object tracking. PTT module contains three blocks for feature embedding, position encoding, and self-attention feature computation. Feature embedding aims to place features closer in the embedding space if they have similar semantic information. Position encoding is used to encode coordinates of point clouds into high dimension distinguishable features. Self-attention generates refined attention features by computing attention weights. Besides, we embed the PTT module into the open-source state-of-the-art method P2B to construct PTT-Net. Experiments on the KITTI dataset reveal that our PTT-Net surpasses the state-of-the-art by a noticeable margin (~10%). Additionally, PTT-Net could achieve real-time performance (~40FPS) on NVIDIA 1080Ti GPU. Our code is open-sourced for the robotics community at https://github.com/shanjiayao/PTT.

检测相关(16篇)

【1】 Vehicle-counting with Automatic Region-of-Interest and Driving-Trajectory detection 标题：具有自动感兴趣区域和行驶轨迹检测的车辆计数链接：https://arxiv.org/abs/2108.07135

作者：Malolan Vasu,Nelson Abreu,Raysa Vásquez,Christian López 备注：5 pages with 3 figures and 1 table. Presented in ICML 2021 LatinXAI Workshop 摘要：车辆计数系统有助于车辆分析和交通事故检测。不幸的是，大多数现有的方法都需要一定程度的人工输入来识别感兴趣区域（ROI）、感兴趣的运动，或者建立一个参考点或参考线来从交通摄像头中对车辆进行计数。这项工作介绍了一种从交通视频中计数车辆的方法，该方法可以自动识别摄像头的ROI以及车辆的行驶轨迹。这使得该方法适用于在发展中国家经常使用的云台变焦相机。初步结果表明，所提出的方法在对测试的交通摄像机的车辆计数时，ROI的平均交点为57.05%，平均绝对误差仅为17.44%。摘要：Vehicle counting systems can help with vehicle analysis and traffic incident detection. Unfortunately, most existing methods require some level of human input to identify the Region of interest (ROI), movements of interest, or to establish a reference point or line to count vehicles from traffic cameras. This work introduces a method to count vehicles from traffic videos that automatically identifies the ROI for the camera, as well as the driving trajectories of the vehicles. This makes the method feasible to use with Pan-Tilt-Zoom cameras, which are frequently used in developing countries. Preliminary results indicate that the proposed method achieves an average intersection over the union of 57.05% for the ROI and a mean absolute error of just 17.44% at counting vehicles of the traffic video cameras tested.

【2】 Towards Real-World Prohibited Item Detection: A Large-Scale X-ray Benchmark 标题：走向真实世界的违禁品检测：一个大规模的X射线基准链接：https://arxiv.org/abs/2108.07020

作者：Boying Wang,Libo Zhang,Longyin Wen,Xianglong Liu,Yanjun Wu 机构：State Key Laboratory of Computer Science, ISCAS, Beijing, China, University of Chinese Academy of Sciences, Beijing, China., Hangzhou Institute for Advanced Study, UCAS, Hangzhou, China., JD Finance America Corporation, Mountain View, CA, USA. 备注：Accepted by ICCV 2021 摘要：由于各种因素，包括类内差异、类间不平衡和遮挡，使用计算机视觉技术的自动安全检查在现实场景中是一项具有挑战性的任务。以往的大多数方法很少解决由于缺乏大规模数据集而故意将禁止项隐藏在杂乱对象中的情况，限制了它们在现实场景中的应用。针对现实世界中的违禁物品检测，我们收集了一个名为PIDray的大规模数据集，该数据集涵盖了现实世界中违禁物品检测场景中的各种情况，尤其是故意隐藏的物品。经过大量的努力，我们的数据集在47677美元的X射线图像中包含了价值12美元的违禁物品类别，并带有高质量的带注释的分割遮罩和边界框。据我们所知，这是迄今为止最大的违禁品检测数据集。同时，我们设计了选择性密集注意网络（SDANet）来构建一个强大的基线，该基线由密集注意模块和依赖项细化模块组成。密集注意模块由空间和通道密集注意组成，用于学习区分特征以提高性能。依赖关系细化模块用于利用多尺度特征的依赖关系。在收集的PIDray数据集上进行的大量实验表明，该方法优于现有的方法，特别是在检测有意隐藏的项目时。摘要：Automatic security inspection using computer vision technology is a challenging task in real-world scenarios due to various factors, including intra-class variance, class imbalance, and occlusion. Most of the previous methods rarely solve the cases that the prohibited items are deliberately hidden in messy objects due to the lack of large-scale datasets, restricted their applications in real-world scenarios. Towards real-world prohibited item detection, we collect a large-scale dataset, named as PIDray, which covers various cases in real-world scenarios for prohibited item detection, especially for deliberately hidden items. With an intensive amount of effort, our dataset contains $12$ categories of prohibited items in $47,677$ X-ray images with high-quality annotated segmentation masks and bounding boxes. To the best of our knowledge, it is the largest prohibited items detection dataset to date. Meanwhile, we design the selective dense attention network (SDANet) to construct a strong baseline, which consists of the dense attention module and the dependency refinement module. The dense attention module formed by the spatial and channel-wise dense attentions, is designed to learn the discriminative features to boost the performance. The dependency refinement module is used to exploit the dependencies of multi-scale features. Extensive experiments conducted on the collected PIDray dataset demonstrate that the proposed method performs favorably against the state-of-the-art methods, especially for detecting the deliberately hidden items.

【3】 Pixel Difference Networks for Efficient Edge Detection 标题：基于像素差分网络的高效边缘检测链接：https://arxiv.org/abs/2108.07009

作者：Zhuo Su,Wenzhe Liu,Zitong Yu,Dewen Hu,Qing Liao,Qi Tian,Matti Pietikäinen,Li Liu 机构：Matti Pietik¨ainen, Center for Machine Vision and Signal Analysis, University of Oulu, Finland, National University of Defense Technology, China, Harbin Institute of Technology (Shenzhen), China, Xidian University, China 备注：ICCV 2021 摘要：近年来，深度卷积神经网络（CNN）以其丰富的、抽象的边缘表示能力，在边缘检测方面达到了人的水平。然而，基于CNN的边缘检测的高性能是通过一个大的预训练CNN主干来实现的，该主干占用大量内存和能量。此外，令人惊讶的是，在快速发展的深度学习时代，传统边缘检测器（如Canny、Sobel和LBP）以前的智慧很少被研究。为了解决这些问题，我们提出了一种简单、轻量级但有效的架构，称为像素差分网络（PiDiNet），用于有效的边缘检测。在BSDS500、NYUD和Multicue上进行了大量的实验，以证明其有效性、高训练和推理效率。令人惊讶的是，当仅使用BSDS500和VOC数据集从头开始训练时，PiDiNet可以超过BSDS500数据集上记录的人类感知结果（在ODS F-measure中为0.807对0.803），速度为100 FPS，参数小于1M。参数小于0.1M的更快版本的PiDiNet仍然可以以200 FPS的速度在最先进的技术中实现相当的性能。NYUD和Multicue数据集的结果显示了类似的观察结果。代码可从以下网址获得：https://github.com/zhuoinoulu/pidinet. 摘要：Recently, deep Convolutional Neural Networks (CNNs) can achieve human-level performance in edge detection with the rich and abstract edge representation capacities. However, the high performance of CNN based edge detection is achieved with a large pretrained CNN backbone, which is memory and energy consuming. In addition, it is surprising that the previous wisdom from the traditional edge detectors, such as Canny, Sobel, and LBP are rarely investigated in the rapid-developing deep learning era. To address these issues, we propose a simple, lightweight yet effective architecture named Pixel Difference Network (PiDiNet) for efficient edge detection. Extensive experiments on BSDS500, NYUD, and Multicue are provided to demonstrate its effectiveness, and its high training and inference efficiency. Surprisingly, when training from scratch with only the BSDS500 and VOC datasets, PiDiNet can surpass the recorded result of human perception (0.807 vs. 0.803 in ODS F-measure) on the BSDS500 dataset with 100 FPS and less than 1M parameters. A faster version of PiDiNet with less than 0.1M parameters can still achieve comparable performance among state of the arts with 200 FPS. Results on the NYUD and Multicue datasets show similar observations. The codes are available at https://github.com/zhuoinoulu/pidinet.

【4】 Change is Everywhere: Single-Temporal Supervised Object Change Detection in Remote Sensing Imagery 标题：变化无处不在：遥感影像中的单时相监督目标变化检测链接：https://arxiv.org/abs/2108.07002

作者：Zhuo Zheng,Ailong Ma,Liangpei Zhang,Yanfei Zhong 机构：Wuhan University, Wuhan, China 备注：Accepted by ICCV 2021 摘要：对于高空间分辨率（HSR）遥感图像，双时态监督学习总是使用许多成对标记的双时态图像进行变化检测。然而，对大规模双时相高铁遥感图像进行成对标注是非常昂贵和耗时的。在本文中，我们提出了单时间监督学习（STAR）的变化检测，从一个新的角度利用对象的变化，在未配对的图像作为监督信号。STAR使我们能够仅使用textbf{unpaired}标记的图像来训练高精度的变化检测器，并推广到真实世界的双时态图像。为了评估STAR的有效性，我们设计了一个简单而有效的变化检测器ChangeStar，它可以通过ChangeMixin模块重用任何深层语义分段架构。综合实验结果表明，ChangeStar在单时间监督下的性能优于基线，在双时间监督下的性能优于基线。代码可在https://github.com/Z-Zheng/ChangeStar 摘要：For high spatial resolution (HSR) remote sensing images, bitemporal supervised learning always dominates change detection using many pairwise labeled bitemporal images. However, it is very expensive and time-consuming to pairwise label large-scale bitemporal HSR remote sensing images. In this paper, we propose single-temporal supervised learning (STAR) for change detection from a new perspective of exploiting object changes in unpaired images as supervisory signals. STAR enables us to train a high-accuracy change detector only using textbf{unpaired} labeled images and generalize to real-world bitemporal images. To evaluate the effectiveness of STAR, we design a simple yet effective change detector called ChangeStar, which can reuse any deep semantic segmentation architecture by the ChangeMixin module. The comprehensive experimental results show that ChangeStar outperforms the baseline with a large margin under single-temporal supervision and achieves superior performance under bitemporal supervision. Code is available at https://github.com/Z-Zheng/ChangeStar

【5】 3D High-Fidelity Mask Face Presentation Attack Detection Challenge 标题：3D高保真面具人脸呈现攻击检测挑战赛链接：https://arxiv.org/abs/2108.06968

作者：Ajian Liu,Chenxu Zhao,Zitong Yu,Anyang Su,Xing Liu,Zijian Kong,Jun Wan,Sergio Escalera,Hugo Jair Escalante,Zhen Lei,Guodong Guo 机构：MUST, Macau, Mininglamp Academy of Sciences, Mininglamp Technology, China, University of Oulu, Finland, NLPR, CASIA, China, SAI, UCAS, China, CAIR, HKISI, CAS, CVC, UB, Spain ,INAOE, CINVESTAV, Mexico, ChaLearn, USA, Baidu Research, China 摘要：三维面具对人脸识别系统的威胁越来越严重，已经受到研究人员的广泛关注。为了便于算法的研究，收集了大规模高保真掩码数据集，即CASIA-SURF HiFiMask（简称HiFiMask）。具体地说，它由总共54600个视频组成，这些视频来自75名受试者，在7种新型传感器下有225个逼真的面具。基于该数据集和协议3，我们组织了一个3D高保真面具人脸呈现攻击检测挑战赛，以推动基于3D面具的攻击检测研究。协议3评估了算法在开放集场景下的辨别能力和泛化能力。它吸引了195支球队进入开发阶段，共有18支球队进入最后一轮。所有结果都经过了组织团队的验证和重新运行，并将结果用于最终排名。本文概述了挑战，包括所用数据集的介绍、协议的定义、评估标准的计算以及竞赛结果的总结和公布。最后，我们重点介绍和分析了本次竞赛提供的排名靠前的算法、结论总结以及针对掩码攻击检测的研究思路。摘要：The threat of 3D masks to face recognition systems is increasingly serious and has been widely concerned by researchers. To facilitate the study of the algorithms, a large-scale High-Fidelity Mask dataset, namely CASIA-SURF HiFiMask (briefly HiFiMask) has been collected. Specifically, it consists of a total amount of 54, 600 videos which are recorded from 75 subjects with 225 realistic masks under 7 new kinds of sensors. Based on this dataset and Protocol 3 which evaluates both the discrimination and generalization ability of the algorithm under the open set scenarios, we organized a 3D High-Fidelity Mask Face Presentation Attack Detection Challenge to boost the research of 3D mask-based attack detection. It attracted 195 teams for the development phase with a total of 18 teams qualifying for the final round. All the results were verified and re-run by the organizing team, and the results were used for the final ranking. This paper presents an overview of the challenge, including the introduction of the dataset used, the definition of the protocol, the calculation of the evaluation criteria, and the summary and publication of the competition results. Finally, we focus on introducing and analyzing the top ranking algorithms, the conclusion summary, and the research ideas for mask attack detection provided by this competition.

【6】 TL-SDD: A Transfer Learning-Based Method for Surface Defect Detection with Few Samples 标题：TL-SDD：一种基于转移学习的小样本表面缺陷检测方法链接：https://arxiv.org/abs/2108.06939

作者：Jiahui Cheng,Bin Guo,Jiaqi Liu,Sicong Liu,Guangzhi Wu,Yueqi Sun,Zhiwen Yu 机构：Northwestern Polytechnical University, Xi’an, China 摘要：在制造业中，表面缺陷检测在保证产品质量方面发挥着越来越重要的作用。许多深度学习方法已广泛应用于表面缺陷检测任务中，并已被证明在缺陷分类和定位方面表现良好。然而，基于深度学习的检测方法往往需要大量的数据进行训练，由于缺陷类别的分布往往是不平衡的，因此无法应用于实际的工业场景。换句话说，常见缺陷类有很多样本，而稀有缺陷类的样本非常少，这些方法很难很好地检测稀有缺陷类。为了解决不平衡分布问题，本文提出了TL-SDD：一种新的基于转移学习的表面缺陷检测方法。首先，我们采用两阶段训练方案将知识从常见缺陷类转移到罕见缺陷类。其次，我们提出了一种新的基于度量的表面缺陷检测（M-SDD）模型。该模型设计了三个模块：（1）特征提取模块：包含高层次语义信息和低层次结构信息的特征融合(2）特征重新加权模块：将示例转换为指示特征重要性的重新加权向量(3）距离度量模块：学习度量空间，其中缺陷通过计算到每个类别表示的距离进行分类。最后，我们在包含铝型材表面缺陷的真实数据集上验证了我们提出的方法的性能。与基线方法相比，对于罕见缺陷类别，我们提出的方法的性能提高了11.98%。摘要：Surface defect detection plays an increasingly important role in manufacturing industry to guarantee the product quality. Many deep learning methods have been widely used in surface defect detection tasks, and have been proven to perform well in defects classification and location. However, deep learning-based detection methods often require plenty of data for training, which fail to apply to the real industrial scenarios since the distribution of defect categories is often imbalanced. In other words, common defect classes have many samples but rare defect classes have extremely few samples, and it is difficult for these methods to well detect rare defect classes. To solve the imbalanced distribution problem, in this paper we propose TL-SDD: a novel Transfer Learning-based method for Surface Defect Detection. First, we adopt a two-phase training scheme to transfer the knowledge from common defect classes to rare defect classes. Second, we propose a novel Metric-based Surface Defect Detection (M-SDD) model. We design three modules for this model: (1) feature extraction module: containing feature fusion which combines high-level semantic information with low-level structural information. (2) feature reweighting module: transforming examples to a reweighting vector that indicates the importance of features. (3) distance metric module: learning a metric space in which defects are classified by computing distances to representations of each category. Finally, we validate the performance of our proposed method on a real dataset including surface defects of aluminum profiles. Compared to the baseline methods, the performance of our proposed method has improved by up to 11.98% for rare defect classes.

【7】 A Hybrid Video Anomaly Detection Framework via Memory-Augmented Flow Reconstruction and Flow-Guided Frame Prediction 标题：基于记忆增强流重构和流引导帧预测的混合视频异常检测框架链接：https://arxiv.org/abs/2108.06852

作者：Zhian Liu,Yongwei Nie,Chengjiang Long,Qing Zhang,Guiqing Li 机构：School of Computer Science and Engineering, South China University of Technology, China, JD Finance America Corporation, Mountain View, CA, USA, School of Computer Science and Engineering, Sun Yat-sen University, China 备注：Accepted to 2021 International Conference on Computer Vision (oral) 摘要：在本文中，我们提出了$text{HF}^2$-VAD，一种无缝集成流重建和帧预测的混合框架来处理视频异常检测。首先，我们设计了ML-MemAE-SC（带跳跃连接的自动编码器中的多级存储模块）网络来存储光流重建的正常模式，以便能够敏感地识别具有较大流量重建误差的异常事件。更重要的是，在重建流的条件下，我们采用条件变分自动编码器（CVAE），该编码器捕获视频帧和光流之间的高度相关性，以预测给定前几帧的下一帧。通过CVAE，流重建的质量实质上影响帧预测的质量。因此，异常事件的重建不良光流进一步恶化了最终预测的未来帧的质量，使得异常更加可检测。实验结果证明了该方法的有效性。代码位于href{https://github.com/LiUzHiAn/hf2vad}{https://github.com/LiUzHiAn/hf2vad}. 摘要：In this paper, we propose $text{HF}^2$-VAD, a Hybrid framework that integrates Flow reconstruction and Frame prediction seamlessly to handle Video Anomaly Detection. Firstly, we design the network of ML-MemAE-SC (Multi-Level Memory modules in an Autoencoder with Skip Connections) to memorize normal patterns for optical flow reconstruction so that abnormal events can be sensitively identified with larger flow reconstruction errors. More importantly, conditioned on the reconstructed flows, we then employ a Conditional Variational Autoencoder (CVAE), which captures the high correlation between video frame and optical flow, to predict the next frame given several previous frames. By CVAE, the quality of flow reconstruction essentially influences that of frame prediction. Therefore, poorly reconstructed optical flows of abnormal events further deteriorate the quality of the final predicted future frame, making the anomalies more detectable. Experimental results demonstrate the effectiveness of the proposed method. Code is available at href{https://github.com/LiUzHiAn/hf2vad}{https://github.com/LiUzHiAn/hf2vad}.

【8】 AdaCon: Adaptive Context-Aware Object Detection for Resource-Constrained Embedded Devices 标题：AdaCon：资源受限嵌入式设备的自适应上下文感知对象检测链接：https://arxiv.org/abs/2108.06850

作者：Marina Neseem,Sherief Reda 机构：School of Engineering, Brown University, Providence, RI 备注：9 pages, 6 figures, 2021 IEEE/ACM International Conference on Computer-Aided Design (ICCAD 2021) 摘要：卷积神经网络在目标检测任务中达到了最先进的精度。然而，它们有巨大的计算和能源需求，这对它们在资源受限的边缘设备上的部署提出了挑战。对象检测将图像作为输入，并识别现有对象类及其在图像中的位置。在本文中，我们利用关于不同对象类别联合出现概率的先验知识来提高对象检测模型的效率。特别是，我们的技术基于对象的空间共现概率对对象类别进行聚类。我们使用这些集群来设计自适应网络。在运行期间，分支控制器根据输入帧的空间上下文决定要执行的网络部分。我们使用COCO数据集进行的实验表明，我们的自适应对象检测模型实现了高达45%的能耗降低，高达27%的延迟降低，并且对象检测的平均精度（AP）损失很小。摘要：Convolutional Neural Networks achieve state-of-the-art accuracy in object detection tasks. However, they have large computational and energy requirements that challenge their deployment on resource-constrained edge devices. Object detection takes an image as an input, and identifies the existing object classes as well as their locations in the image. In this paper, we leverage the prior knowledge about the probabilities that different object categories can occur jointly to increase the efficiency of object detection models. In particular, our technique clusters the object categories based on their spatial co-occurrence probability. We use those clusters to design an adaptive network. During runtime, a branch controller decides which part(s) of the network to execute based on the spatial context of the input frame. Our experiments using COCO dataset show that our adaptive object detection model achieves up to 45% reduction in the energy consumption, and up to 27% reduction in the latency, with a small loss in the average precision (AP) of object detection.

【9】 A Cascaded Zoom-In Network for Patterned Fabric Defect Detection 标题：一种用于花纹织物疵点检测的级联放大网络链接：https://arxiv.org/abs/2108.06760

作者：Zhiwei Zhang 备注：Technology Report; 12 pages, 15 figures 摘要：目前，深卷积神经网络（DCNN）广泛应用于织物疵点检测，但其训练代价昂贵，模型参数复杂。针对实际中大多数织物无缺陷的特点，提出了一种两步级联放大网络（CZI-Net）用于图案织物缺陷检测。在CZI网络中，使用聚合HOG（A-HOG）和SIFT特征代替简单的卷积滤波器进行特征提取。此外，为了提取更多的特征，CZI网络中包含了特征表示层和全连接层。在实际应用中，大多数无缺陷织物只涉及到我们方法的第一步，避免了第二步的计算，这使得织物检测速度非常快。更重要的是，我们在第一步中提出了局部约束重建错误（LCRE），在第二步中提出了限制性局部约束编码（RLC）、索引袋（BoI）方法。我们还分析了不同编码方法之间的联系，并得出结论，视觉词汇索引在编码方法中起着至关重要的作用。最后，在实际数据集上进行了实验，结果表明，该方法不仅计算简单，而且具有较高的检测精度。摘要：Nowadays, Deep Convolutional Neural Networks (DCNNs) are widely used in fabric defect detection, which come with the cost of expensive training and complex model parameters. With the observation that most fabrics are defect free in practice, a two-step Cascaded Zoom-In Network (CZI-Net) is proposed for patterned fabric defect detection. In the CZI-Net, the Aggregated HOG (A-HOG) and SIFT features are used to instead of simple convolution filters for feature extraction. Moreover, in order to extract more distinctive features, the feature representation layer and full connection layer are included in the CZI-Net. In practice, Most defect-free fabrics only involve in the first step of our method and avoid a costive computation in the second step, which makes very fast fabric detection. More importantly, we propose the Locality-constrained Reconstruction Error (LCRE) in the first step and Restrictive Locality-constrained Coding (RLC), Bag-of-Indexes (BoI) methods in the second step. We also analyse the connections between different coding methods and conclude that the index of visual words plays an essential role in the coding methods. In conclusion, experiments based on real-world datasets are implemented and demonstrate that our proposed method is not only computationally simple but also with high detection accuracy.

【10】 SPG: Unsupervised Domain Adaptation for 3D Object Detection via Semantic Point Generation 标题：SPG：基于语义点生成的无监督区域自适应三维目标检测链接：https://arxiv.org/abs/2108.06709

作者：Qiangeng Xu,Yin Zhou,Weiyue Wang,Charles R. Qi,Dragomir Anguelov 机构：University of Southern California, Waymo, LLC 摘要：在自动驾驶中，基于激光雷达的目标探测器应在不同地理位置和不同天气条件下可靠运行。虽然最近的3D检测研究集中于提高单个域内的性能，但我们的研究表明，现代检测器的性能会在跨域时急剧下降。在本文中，我们研究了基于激光雷达的三维目标检测的无监督域自适应（UDA）。在Waymo域自适应数据集上，我们确定恶化的点云质量是性能下降的根本原因。为了解决这个问题，我们提出了语义点生成（SPG），这是一种提高激光雷达探测器抗域偏移可靠性的通用方法。具体而言，SPG在预测的前景区域生成语义点，并忠实地恢复由遮挡、低反射率或天气干扰等现象引起的前景对象缺失部分。通过将语义点与原始点合并，我们得到了一个增强的点云，它可以被现代基于激光雷达的探测器直接使用。为了验证SPG的广泛适用性，我们使用了两种具有代表性的探测器，点柱和PV-RCNN进行了实验。在UDA任务中，SPG显著提高了所有感兴趣对象类别和所有难度级别的检测器。SPG也有利于原始域中的目标检测。在Waymo Open Dataset和KITTI上，SPG改进了这两种方法在所有类别中的3D检测结果。结合PV-RCNN，SPG在KITTI上实现了最先进的3D检测结果。摘要：In autonomous driving, a LiDAR-based object detector should perform reliably at different geographic locations and under various weather conditions. While recent 3D detection research focuses on improving performance within a single domain, our study reveals that the performance of modern detectors can drop drastically cross-domain. In this paper, we investigate unsupervised domain adaptation (UDA) for LiDAR-based 3D object detection. On the Waymo Domain Adaptation dataset, we identify the deteriorating point cloud quality as the root cause of the performance drop. To address this issue, we present Semantic Point Generation (SPG), a general approach to enhance the reliability of LiDAR detectors against domain shifts. Specifically, SPG generates semantic points at the predicted foreground regions and faithfully recovers missing parts of the foreground objects, which are caused by phenomena such as occlusions, low reflectance or weather interference. By merging the semantic points with the original points, we obtain an augmented point cloud, which can be directly consumed by modern LiDAR-based detectors. To validate the wide applicability of SPG, we experiment with two representative detectors, PointPillars and PV-RCNN. On the UDA task, SPG significantly improves both detectors across all object categories of interest and at all difficulty levels. SPG can also benefit object detection in the original domain. On the Waymo Open Dataset and KITTI, SPG improves 3D detection results of these two methods across all categories. Combined with PV-RCNN, SPG achieves state-of-the-art 3D detection results on KITTI.

【11】 Exploring Temporal Coherence for More General Video Face Forgery Detection 标题：更一般视频人脸伪造检测的时间相干性研究链接：https://arxiv.org/abs/2108.06693

作者：Yinglin Zheng,Jianmin Bao,Dong Chen,Ming Zeng,Fang Wen 机构： School of Informatics, Xiamen University, Microsoft Research Asia 备注：Accepted by ICCV 2021 摘要：尽管当前的人脸处理技术在质量和可控性方面取得了令人印象深刻的性能，但它们仍在努力生成时间连贯的人脸视频。在这项工作中，我们探索如何充分利用时间相干性进行视频人脸伪造检测。为了实现这一点，我们提出了一种新的端到端框架，它包括两个主要阶段。第一阶段是全时间卷积网络（FTCN）。FTCN的关键是将空间卷积核大小减少到1，同时保持时间卷积核大小不变。我们惊奇地发现，这种特殊的设计有助于模型提取时间特征，并提高泛化能力。第二阶段是时间Transformer网络，旨在探索长期的时间一致性。该框架具有通用性和灵活性，可以从零开始直接训练，无需任何预训练模型或外部数据集。大量的实验表明，我们的框架优于现有的方法，并且在检测新类型的人脸伪造视频时仍然有效。摘要：Although current face manipulation techniques achieve impressive performance regarding quality and controllability, they are struggling to generate temporal coherent face videos. In this work, we explore to take full advantage of the temporal coherence for video face forgery detection. To achieve this, we propose a novel end-to-end framework, which consists of two major stages. The first stage is a fully temporal convolution network (FTCN). The key insight of FTCN is to reduce the spatial convolution kernel size to 1, while maintaining the temporal convolution kernel size unchanged. We surprisingly find this special design can benefit the model for extracting the temporal features as well as improve the generalization capability. The second stage is a Temporal Transformer network, which aims to explore the long-term temporal coherence. The proposed framework is general and flexible, which can be directly trained from scratch without any pre-training models or external datasets. Extensive experiments show that our framework outperforms existing methods and remains effective when applied to detect new sorts of face forgery videos.

【12】 Vector-Decomposed Disentanglement for Domain-Invariant Object Detection 标题：基于矢量分解解缠的域不变性目标检测链接：https://arxiv.org/abs/2108.06685

作者：Aming Wu,Rui Liu,Yahong Han,Linchao Zhu,Yi Yang 机构：College of Intelligence and Computing, Tianjin University, Tianjin, China, Tianjin Key Lab of Machine Learning, Tianjin University, Tianjin, China, Peng Cheng Laboratory, Shenzhen, China, ReLER Lab, AAII, University of Technology Sydney 备注：Accepted by ICCV 2021 摘要：为了提高检测器的通用性，对于域自适应目标检测（DAOD），最近的进展主要是探索源域和单目标域之间的对齐特征级分布，这可能忽略对齐特征中存在的特定于域的信息的影响。对于DAOD，提取域不变对象表示非常重要。为此，在本文中，我们试图从特定于域的表示中分离出域不变表示。提出了一种基于向量分解的解纠缠方法。首先，设计了一个提取器，将域不变表示从输入中分离出来，用于提取对象建议。其次，针对输入表示和域不变表示之间的差异，引入了特定于域的表示。通过差分运算，扩大了特定领域表示和领域不变表示之间的差距，使得领域不变表示包含更多的领域无关信息。在实验中，我们分别在单目标和复合目标情况下评估了我们的方法。对于单目标情况，四个域移位场景的实验结果表明，我们的方法比基线方法获得了显著的性能增益。此外，对于复合目标情况（即，目标是两个不同域的化合物，没有域标签），我们的方法比基线方法的性能好4%左右，这表明了我们方法的有效性。摘要：To improve the generalization of detectors, for domain adaptive object detection (DAOD), recent advances mainly explore aligning feature-level distributions between the source and single-target domain, which may neglect the impact of domain-specific information existing in the aligned features. Towards DAOD, it is important to extract domain-invariant object representations. To this end, in this paper, we try to disentangle domain-invariant representations from domain-specific representations. And we propose a novel disentangled method based on vector decomposition. Firstly, an extractor is devised to separate domain-invariant representations from the input, which are used for extracting object proposals. Secondly, domain-specific representations are introduced as the differences between the input and domain-invariant representations. Through the difference operation, the gap between the domain-specific and domain-invariant representations is enlarged, which promotes domain-invariant representations to contain more domain-irrelevant information. In the experiment, we separately evaluate our method on the single- and compound-target case. For the single-target case, experimental results of four domain-shift scenes show our method obtains a significant performance gain over baseline methods. Moreover, for the compound-target case (i.e., the target is a compound of two different domains without domain labels), our method outperforms baseline methods by around 4%, which demonstrates the effectiveness of our method.

【13】 ST3D : Denoised Self-training for Unsupervised Domain Adaptation on 3D Object Detection 标题：ST3D ：三维目标检测中无监督域自适应的去噪自训练链接：https://arxiv.org/abs/2108.06682

作者：Jihan Yang,Shaoshuai Shi,Zhe Wang,Hongsheng Li,Xiaojuan Qi 机构： This•Jihan Yang and Xiaojuan Qi are with the Department of Electrical andElectronic Engineering at The University of Hong Kong, Shaoshuai Shi and Hongsheng Li are with the Department of ElectronicEngineering at The Chinese University of Hong Kong 摘要：在本文中，我们提出了一种称为ST3D 的自训练方法，该方法具有整体伪标签去噪管道，用于三维目标检测的无监督域自适应。ST3D 旨在减少伪标签生成过程中的噪声，同时缓解噪声伪标签对模型训练的负面影响。首先，ST3D 使用随机对象缩放（random object SCALATION，ROS）在标记的源域上预训练3D对象检测器，该方法旨在减少源域对象缩放偏差引起的目标域伪标签噪声。然后，通过交替生成伪标签和使用伪标签目标域数据训练目标检测器，逐步改进检测器。在这里，我们为伪标签生成过程配备了混合质量感知三重态存储器，以提高生成的伪标签的质量和稳定性。同时，在模型训练阶段，我们提出了源数据辅助训练策略和课程数据扩充策略，以有效地校正噪声梯度方向，避免模型过度拟合噪声伪标记数据。这些特定的设计使得检测器能够在经过精心精炼的伪标记目标数据上使用去噪的训练信号进行训练，从而有效地促进对象检测器适应目标域，而无需注释。最后，我们的方法在四个3D基准数据集（即Waymo、KITTI、Lyft和nuScenes）上对三个常见类别（即汽车、行人和自行车）进行了评估。ST3D 在所有评估设置上都达到了最先进的性能，大大超过了相应的基线（例如，Waymo$rightarrow$KITTI上的9.6%$sim$38.16%（以AP${text{3D}}$$）计算），甚至在KITTI 3D对象检测基准上超过了完全监督的oracle结果（目标优先）。代码将可用。摘要：In this paper, we present a self-training method, named ST3D , with a holistic pseudo label denoising pipeline for unsupervised domain adaptation on 3D object detection. ST3D aims at reducing noise in pseudo label generation as well as alleviating the negative impacts of noisy pseudo labels on model training. First, ST3D pre-trains the 3D object detector on the labeled source domain with random object scaling (ROS) which is designed to reduce target domain pseudo label noise arising from object scale bias of the source domain. Then, the detector is progressively improved through alternating between generating pseudo labels and training the object detector with pseudo-labeled target domain data. Here, we equip the pseudo label generation process with a hybrid quality-aware triplet memory to improve the quality and stability of generated pseudo labels. Meanwhile, in the model training stage, we propose a source data assisted training strategy and a curriculum data augmentation policy to effectively rectify noisy gradient directions and avoid model over-fitting to noisy pseudo labeled data. These specific designs enable the detector to be trained on meticulously refined pseudo labeled target data with denoised training signals, and thus effectively facilitate adapting an object detector to a target domain without requiring annotations. Finally, our method is assessed on four 3D benchmark datasets (i.e., Waymo, KITTI, Lyft, and nuScenes) for three common categories (i.e., car, pedestrian and bicycle). ST3D achieves state-of-the-art performance on all evaluated settings, outperforming the corresponding baseline by a large margin (e.g., 9.6% $sim$ 38.16% on Waymo $rightarrow$ KITTI in terms of AP$_{text{3D}}$), and even surpasses the fully supervised oracle results on the KITTI 3D object detection benchmark with target prior. Code will be available.

【14】 Semi-supervised 3D Object Detection via Adaptive Pseudo-Labeling 标题：基于自适应伪标记的半监督三维目标检测链接：https://arxiv.org/abs/2108.06649

作者：Hongyi Xu,Fengqi Liu,Qianyu Zhou,Jinkun Hao,Zhijie Cao,Zhengyang Feng,Lizhuang Ma 机构：Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China, School of Computer Science and Technology, East China Normal University, China 备注：Accepted at International Conference on Image Processing (ICIP 2021) 摘要：三维目标检测是计算机视觉中的一项重要任务。大多数现有方法需要大量高质量的三维标注，这些标注的收集成本很高。特别是对于室外场景，由于点云的稀疏性和城市场景的复杂性，问题变得更加严重。半监督学习是缓解数据标注问题的一种很有前途的技术。受此启发，我们提出了一种新的基于伪标记的室外三维目标检测半监督框架。我们设计了自适应类置信选择模块（ACCS）来生成高质量的伪标签。此外，我们还提出了针对未标记数据的整体点云增强（HPCA），以提高鲁棒性。在KITTI基准上的实验证明了该方法的有效性。摘要：3D object detection is an important task in computer vision. Most existing methods require a large number of high-quality 3D annotations, which are expensive to collect. Especially for outdoor scenes, the problem becomes more severe due to the sparseness of the point cloud and the complexity of urban scenes. Semi-supervised learning is a promising technique to mitigate the data annotation issue. Inspired by this, we propose a novel semi-supervised framework based on pseudo-labeling for outdoor 3D object detection tasks. We design the Adaptive Class Confidence Selection module (ACCS) to generate high-quality pseudo-labels. Besides, we propose Holistic Point Cloud Augmentation (HPCA) for unlabeled data to improve robustness. Experiments on the KITTI benchmark demonstrate the effectiveness of our method.

【15】 MMOCR: A Comprehensive Toolbox for Text Detection, Recognition and Understanding 标题：MMOCR：一个用于文本检测、识别和理解的综合工具箱链接：https://arxiv.org/abs/2108.06543

作者：Zhanghui Kuang,Hongbin Sun,Zhizhong Li,Xiaoyu Yue,Tsui Hin Lin,Jianyong Chen,Huaqiang Wei,Yiqin Zhu,Tong Gao,Wenwei Zhang,Kai Chen,Wayne Zhang,Dahua Lin 机构：SenseTime Research, Center for Perceptual and Interactive, Intelligence, South China University of Technology, Nanyang Technological University, Singapore, Shanghai AI Laboratory, The Chinese University of Hong Kong 备注：Accepted to ACM MM (Open Source Competition Track) 摘要：我们为MMOCR提供了一个开源工具箱，它为文本检测和识别以及命名实体识别和关键信息提取等下游任务提供了一个全面的管道。MMOCR实现了14种最先进的算法，远远超过了我们目前所知的所有现有开源OCR项目。为了促进文本识别相关问题的未来研究和工业应用，我们还提供了大量经过训练的模型和详细的基准，以深入了解文本检测、识别和理解的性能。MMOCR公开发布于https://github.com/open-mmlab/mmocr. 摘要：We present MMOCR-an open-source toolbox which provides a comprehensive pipeline for text detection and recognition, as well as their downstream tasks such as named entity recognition and key information extraction. MMOCR implements 14 state-of-the-art algorithms, which is significantly more than all the existing open-source OCR projects we are aware of to date. To facilitate future research and industrial applications of text recognition-related problems, we also provide a large number of trained models and detailed benchmarks to give insights into the performance of text detection, recognition and understanding. MMOCR is publicly released at https://github.com/open-mmlab/mmocr.

【16】 Is Pseudo-Lidar needed for Monocular 3D Object detection? 标题：单目三维物体探测需要伪激光雷达吗？链接：https://arxiv.org/abs/2108.06417

作者：Dennis Park,Rares Ambrus,Vitor Guizilini,Jie Li,Adrien Gaidon 机构：Rares, Ambrus, ∗, Toyota Research Institute 备注：In Proceedings of the ICCV 2021 摘要：从单个图像进行三维目标检测的最新进展利用单目深度估计作为生成三维点云的一种方法，将相机转变为伪激光雷达传感器。这些两级检测器提高了中间深度估计网络的精度，而中间深度估计网络本身可以通过大规模自监督学习在无需手动标记的情况下进行改进。然而，与端到端方法相比，它们更容易受到过度拟合的影响，更为复杂，与类似的基于激光雷达的探测器之间的差距仍然很大。在这项工作中，我们提出了一种端到端、单级、单目3D目标探测器DD3D，它可以像伪激光雷达方法一样受益于深度预训练，但没有其局限性。我们的架构设计用于深度估计和3D检测之间的有效信息传输，允许我们根据未标记的预训练数据量进行缩放。我们的方法在两个具有挑战性的基准上实现了最先进的结果，在KITTI-3D基准上汽车和行人的AP分别为16.34%和9.28%，在NuScenes上为41.5%。摘要：Recent progress in 3D object detection from single images leverages monocular depth estimation as a way to produce 3D pointclouds, turning cameras into pseudo-lidar sensors. These two-stage detectors improve with the accuracy of the intermediate depth estimation network, which can itself be improved without manual labels via large-scale self-supervised learning. However, they tend to suffer from overfitting more than end-to-end methods, are more complex, and the gap with similar lidar-based detectors remains significant. In this work, we propose an end-to-end, single stage, monocular 3D object detector, DD3D, that can benefit from depth pre-training like pseudo-lidar methods, but without their limitations. Our architecture is designed for effective information transfer between depth estimation and 3D detection, allowing us to scale with the amount of unlabeled pre-training data. Our method achieves state-of-the-art results on two challenging benchmarks, with 16.34% and 9.28% AP for Cars and Pedestrians (respectively) on the KITTI-3D benchmark, and 41.5% mAP on NuScenes.

分类|识别相关(15篇)

【1】 Masked Face Recognition Challenge: The WebFace260M Track Report 标题：蒙面人脸识别挑战赛：WebFace260M跟踪报告链接：https://arxiv.org/abs/2108.07189

作者：Zheng Zhu,Guan Huang,Jiankang Deng,Yun Ye,Junjie Huang,Xinze Chen,Jiagang Zhu,Tian Yang,Jia Guo,Jiwen Lu,Dalong Du,Jie Zhou 机构：Tsinghua University, XForwardAI, Imperial College London, InsightFace 备注：The WebFace260M Track of ICCV-21 MFR Challenge is still open in this https URL 摘要：据世界卫生组织统计，截至2021年8月12日，全球共有超过204617027人确诊为COVID-19病例，其中4323247人死亡。在冠状病毒流行期间，几乎每个人都戴着口罩。传统上，人脸识别方法主要处理非遮挡人脸，包括眼睛、鼻子和嘴巴等主要面部特征。在机场或实验室移除身份验证面具将增加病毒感染的风险，对当前的人脸识别系统构成巨大挑战。由于疫情的突然爆发，目前还没有公开的真实世界蒙面人脸识别（MFR）基准。为了解决上述问题，我们组织了COVID研讨会下的人脸生物测量和ICCV 2021中的蒙面人脸识别挑战。通过超大规模WebFace260M基准测试和推理时间约束下的人脸识别（FREATS）协议，这项挑战（WebFace260M轨道）旨在推动实用MFR的前沿。由于公共评估集大多饱和或含有噪音，因此收集了一个新的测试集，由2478位名人和60926张面孔组成。同时，我们收集了世界上最大的真实蒙面测试集。在WebFace260M赛道的第一阶段，69个团队（共833个解决方案）参与了挑战，49个团队的表现超过了我们的基线。挑战的第二阶段将持续到2021年10月1日，排行榜也将持续。我们将在未来积极更新该报告。摘要：According to WHO statistics, there are more than 204,617,027 confirmed COVID-19 cases including 4,323,247 deaths worldwide till August 12, 2021. During the coronavirus epidemic, almost everyone wears a facial mask. Traditionally, face recognition approaches process mostly non-occluded faces, which include primary facial features such as the eyes, nose, and mouth. Removing the mask for authentication in airports or laboratories will increase the risk of virus infection, posing a huge challenge to current face recognition systems. Due to the sudden outbreak of the epidemic, there are yet no publicly available real-world masked face recognition (MFR) benchmark. To cope with the above-mentioned issue, we organize the Face Bio-metrics under COVID Workshop and Masked Face Recognition Challenge in ICCV 2021. Enabled by the ultra-large-scale WebFace260M benchmark and the Face Recognition Under Inference Time conStraint (FRUITS) protocol, this challenge (WebFace260M Track) aims to push the frontiers of practical MFR. Since public evaluation sets are mostly saturated or contain noise, a new test set is gathered consisting of elaborated 2,478 celebrities and 60,926 faces. Meanwhile, we collect the world-largest real-world masked test set. In the first phase of WebFace260M Track, 69 teams (total 833 solutions) participate in the challenge and 49 teams exceed the performance of our baseline. There are second phase of the challenge till October 1, 2021 and on-going leaderboard. We will actively update this report in the future.

【2】 Data Augmentation and CNN Classification For Automatic COVID-19 Diagnosis From CT-Scan Images On Small Dataset 标题：小数据集CT扫描图像的数据增强和CNN分类自动诊断冠状病毒链接：https://arxiv.org/abs/2108.07148

作者：Weijun Tan,Hongwei Guo 摘要：我们提出了一个从肺部CT图像自动诊断COVID1-19的框架。重点是小数据集上的信号处理和分类，努力探索数据准备和扩充，以提高2D CNN分类模型的泛化能力。我们提出了一种使用多个Hounsfield单元（HU）归一化窗口的独特而有效的数据扩充方法。此外，对原始切片图像进行裁剪以排除背景，并应用过滤器过滤掉闭合的肺图像。对于分类网络，我们选择使用特征金字塔网络（FPN）的2D Densenet和Exception。为了进一步提高分类精度，使用了多个CNN模型和HU窗口的集成。在训练/验证数据集上，我们实现了93.39%的患者分类准确率。摘要：We present an automatic COVID1-19 diagnosis framework from lung CT images. The focus is on signal processing and classification on small datasets with efforts putting into exploring data preparation and augmentation to improve the generalization capability of the 2D CNN classification models. We propose a unique and effective data augmentation method using multiple Hounsfield Unit (HU) normalization windows. In addition, the original slice image is cropped to exclude background, and a filter is applied to filter out closed-lung images. For the classification network, we choose to use 2D Densenet and Xception with the feature pyramid network (FPN). To further improve the classification accuracy, an ensemble of multiple CNN models and HU windows is used. On the training/validation dataset, we achieve a patient classification accuracy of 93.39%.

【3】 Learning Canonical View Representation for 3D Shape Recognition with Arbitrary Views 标题：任意视点三维形状识别的规范视图表示学习链接：https://arxiv.org/abs/2108.07084

作者：Xin Wei,Yifei Gong,Fudong Wang,Xing Sun 机构：Xi’an Jiaotong University, Tencent Youtu Lab 摘要：在本文中，我们着重于从任意视图（即任意数量和位置的视点）识别三维形状。对于基于视图的三维形状识别来说，这是一个具有挑战性和现实性的设置。我们提出了一种规范化的视图表示来应对这一挑战。我们首先将任意视图的原始特征转换为固定数量的视图特征，称为规范视图表示，方法是使用最佳传输将任意视图特征与一组可学习的参考视图特征对齐。通过这种方式，具有任意视图的每个三维形状由固定数量的规范视图特征表示，这些特征进一步聚合以生成用于形状识别的丰富而健壮的三维形状表示。我们还提出了一个规范视图特征分离约束，以强制将规范视图表示中的视图特征嵌入到欧氏空间中的散乱点中。在ModelNet40、ScanObjectNN和RGBD数据集上的实验表明，我们的方法在固定视点设置下取得了有竞争力的结果，并且在任意视图设置下显著优于适用的方法。摘要：In this paper, we focus on recognizing 3D shapes from arbitrary views, i.e., arbitrary numbers and positions of viewpoints. It is a challenging and realistic setting for view-based 3D shape recognition. We propose a canonical view representation to tackle this challenge. We first transform the original features of arbitrary views to a fixed number of view features, dubbed canonical view representation, by aligning the arbitrary view features to a set of learnable reference view features using optimal transport. In this way, each 3D shape with arbitrary views is represented by a fixed number of canonical view features, which are further aggregated to generate a rich and robust 3D shape representation for shape recognition. We also propose a canonical view feature separation constraint to enforce that the view features in canonical view representation can be embedded into scattered points in a Euclidean space. Experiments on the ModelNet40, ScanObjectNN, and RGBD datasets show that our method achieves competitive results under the fixed viewpoint settings, and significantly outperforms the applicable methods under the arbitrary view setting.

【4】 Towards Efficient and Data Agnostic Image Classification Training Pipeline for Embedded Systems 标题：面向嵌入式系统的高效数据无关图像分类训练流水线链接：https://arxiv.org/abs/2108.07049

作者：Kirill Prokofiev,Vladislav Sovrasov 备注：Submitted to ICIAP 2022 摘要：如今，基于深度学习的方法在广泛的常用数据集（ImageNet、CIFAR、SVHN、Caltech 101、SUN397等）中的图像分类任务方面取得了显著进展。通过根据目标数据的属性仔细调整模型体系结构和训练技巧，可以获得上述每个数据集上的SOTA性能。尽管这种方法允许设置学术记录，但一个普通的数据科学家拥有足够的资源为他在实践中遇到的每一个图像分类任务构建一个复杂的训练管道是不现实的。这项工作的重点是回顾用于图像分类的最新增强和正则化方法，并探索自动选择一些最重要超参数的方法：历元总数、初始学习速率值及其时间表。有了一个配备轻量级现代CNN体系结构（如bileNetV3或EfficientNet）的训练程序、足够的正则化水平和对数据学习速率计划的适应性，我们可以在各种下游图像分类任务上实现合理的性能，而无需手动调整每个特定任务的参数。生成的模型计算效率高，可以使用OpenVINO工具包部署到CPU。源代码作为OpenVINO训练扩展的一部分提供(https://github.com/openvinotoolkit/training_extensions). 摘要：Nowadays deep learning-based methods have achieved a remarkable progress at the image classification task among a wide range of commonly used datasets (ImageNet, CIFAR, SVHN, Caltech 101, SUN397, etc.). SOTA performance on each of the mentioned datasets is obtained by careful tuning of the model architecture and training tricks according to the properties of the target data. Although this approach allows setting academic records, it is unrealistic that an average data scientist would have enough resources to build a sophisticated training pipeline for every image classification task he meets in practice. This work is focusing on reviewing the latest augmentation and regularization methods for the image classification and exploring ways to automatically choose some of the most important hyperparameters: total number of epochs, initial learning rate value and it's schedule. Having a training procedure equipped with a lightweight modern CNN architecture (like bileNetV3 or EfficientNet), sufficient level of regularization and adaptive to data learning rate schedule, we can achieve a reasonable performance on a variety of downstream image classification tasks without manual tuning of parameters to each particular task. Resulting models are computationally efficient and can be deployed to CPU using the OpenVINO toolkit. Source code is available as a part of the OpenVINO Training Extensions (https://github.com/openvinotoolkit/training_extensions).

【5】 Data Augmentation for Scene Text Recognition 标题：用于场景文本识别的数据增强链接：https://arxiv.org/abs/2108.06949

作者：Rowel Atienza 机构：Electrical and Electronics Engineering Institute, University of the Philippines 备注：Interactive Labeling and Data Augmentation for Vision ICCV 2021 Workshop 摘要：由于自然场景中可能出现大量文本，因此场景文本识别（STR）是计算机视觉中一项具有挑战性的任务。大多数STR模型依赖于合成数据集进行训练，因为没有足够大且公开可用的标记真实数据集。由于STR模型是使用真实数据进行评估的，因此训练和测试数据分布之间的不匹配导致模型性能不佳，特别是在受噪声、伪影、几何、结构等影响的具有挑战性的文本上，我们介绍了STRAug，它由36个为STR设计的图像增强函数组成。每个函数都模拟了某些文本图像属性，这些属性可以在自然场景中找到，由摄像机传感器引起，或由信号处理操作引起，但在训练数据集中表现不佳。当使用RandAugment应用于强基线模型时，STRAug显著提高了常规和不规则测试数据集STR模型的整体绝对准确度，Rosetta为2.10%，R2AM为1.48%，CRNN为1.30%，RARE为1.35%，TRBA为1.06%，GCRNN为0.89%。STRAug函数提供的API的多样性和简单性使STRAug的现有数据扩充方法易于复制和验证。STRAug可在https://github.com/roatienza/straug. 摘要：Scene text recognition (STR) is a challenging task in computer vision due to the large number of possible text appearances in natural scenes. Most STR models rely on synthetic datasets for training since there are no sufficiently big and publicly available labelled real datasets. Since STR models are evaluated using real data, the mismatch between training and testing data distributions results into poor performance of models especially on challenging text that are affected by noise, artifacts, geometry, structure, etc. In this paper, we introduce STRAug which is made of 36 image augmentation functions designed for STR. Each function mimics certain text image properties that can be found in natural scenes, caused by camera sensors, or induced by signal processing operations but poorly represented in the training dataset. When applied to strong baseline models using RandAugment, STRAug significantly increases the overall absolute accuracy of STR models across regular and irregular test datasets by as much as 2.10% on Rosetta, 1.48% on R2AM, 1.30% on CRNN, 1.35% on RARE, 1.06% on TRBA and 0.89% on GCRNN. The diversity and simplicity of API provided by STRAug functions enable easy replication and validation of existing data augmentation methods for STR. STRAug is available at https://github.com/roatienza/straug.

【6】 Video Person Re-identification using Attribute-enhanced Features 标题：基于属性增强特征的视频人物再识别链接：https://arxiv.org/abs/2108.06946

作者：Tianrui Chai,Zhiyuan Chen,Annan Li,Jiaxin Chen,Xinyu Mei,Yunhong Wang 摘要：基于视频的人员重新识别（re-ID）是一项具有挑战性的任务，它旨在使用监控视频跨非重叠摄像头将人员关联起来。行人属性（如性别、年龄和衣着特征）包含丰富的补充信息，但在视频人物识别中的研究较少。在这项工作中，我们提出了一种用于属性辅助视频人物识别的新型网络体系结构，称为属性显著性辅助网络（ASA Net），通过两种方法对现有的研究成果进行了较大的改进：第一，为了更好地实现目标与背景的分离，我们提出从中层属性学习视觉注意，而不是从高层身份学习视觉注意。提出的属性显著区域增强（ASRE）模块能够更准确地关注行人的身体。其次，我们发现许多与身份无关但与物体或主体相关的因素，如目标行人的视角和运动，都会对行人的二维外观产生很大影响。通过研究身份相关和身份无关属性，可以通过一种新的三重态丢失来缓解这个问题，这种三重态丢失被称为姿势和运动不变量（PMI）三重态丢失。摘要：Video-based person re-identification (Re-ID) which aims to associate people across non-overlapping cameras using surveillance video is a challenging task. Pedestrian attribute, such as gender, age and clothing characteristics contains rich and supplementary information but is less explored in video person Re-ID. In this work, we propose a novel network architecture named Attribute Salience Assisted Network (ASA-Net) for attribute-assisted video person Re-ID, which achieved considerable improvement to existing works by two methods.First, to learn a better separation of the target from background, we propose to learn the visual attention from middle-level attribute instead of high-level identities. The proposed Attribute Salient Region Enhance (ASRE) module can attend more accurately on the body of pedestrian. Second, we found that many identity-irrelevant but object or subject-relevant factors like the view angle and movement of the target pedestrian can greatly influence the two dimensional appearance of a pedestrian. This problem can be mitigated by investigating both identity-relevant and identity-irrelevant attributes via a novel triplet loss which is referred as the Pose~&~Motion-Invariant (PMI) triplet loss.

【7】 Unsupervised Person Re-identification with Stochastic Training Strategy 标题：基于随机训练策略的无监督人员再识别链接：https://arxiv.org/abs/2108.06938

作者：Tianyang Liu,Yutian Lin,Bo Du 机构： School of Computer Science, and Hubei Key Laboratory of Multime-dia and Network Communication Engineering, Wuhan University 摘要：无监督人员再识别（reid）由于其可扩展性和现实应用的可能性而吸引了越来越多的研究兴趣。目前最先进的无监督re-ID方法通常遵循基于聚类的策略，该策略通过聚类生成伪标签，并维护内存以存储实例特征，并表示用于对比学习的聚类重心。这种方法存在两个问题。首先，由无监督学习生成的质心可能不是一个完美的原型。强制图像靠近质心强调了聚类的结果，这可能会在迭代过程中累积聚类错误。其次，以前的方法利用在不同训练迭代中获得的特征来表示一个质心，这与当前训练样本不一致，因为这些特征不具有直接可比性。为此，我们提出了一种具有随机学习策略的无监督re-ID方法。具体而言，我们采用随机更新记忆，其中来自集群的随机实例用于更新集群级记忆以进行对比学习。通过这种方法，学习随机选择的图像对之间的关系，以避免不可靠伪标签引起的训练偏差。随机记忆也总是最新的，用于分类以保持一致性。此外，为了缓解摄像机方差的问题，在聚类过程中提出了一个统一的距离矩阵，该矩阵减少了来自不同摄像机域的距离偏差，并强调了身份的方差。摘要：Unsupervised person re-identification (re-ID) has attracted increasing research interests because of its scalability and possibility for real-world applications. State-of-the-art unsupervised re-ID methods usually follow a clustering-based strategy, which generates pseudo labels by clustering and maintains a memory to store instance features and represent the centroid of the clusters for contrastive learning. This approach suffers two problems. First, the centroid generated by unsupervised learning may not be a perfect prototype. Forcing images to get closer to the centroid emphasizes the result of clustering, which could accumulate clustering errors during iterations. Second, previous methods utilize features obtained at different training iterations to represent one centroid, which is not consistent with the current training sample, since the features are not directly comparable. To this end, we propose an unsupervised re-ID approach with a stochastic learning strategy. Specifically, we adopt a stochastic updated memory, where a random instance from a cluster is used to update the cluster-level memory for contrastive learning. In this way, the relationship between randomly selected pair of images are learned to avoid the training bias caused by unreliable pseudo labels. The stochastic memory is also always up-to-date for classifying to keep the consistency. Besides, to relieve the issue of camera variance, a unified distance matrix is proposed during clustering, where the distance bias from different camera domain is reduced and the variances of identities is emphasized.

【8】 Online Continual Learning For Visual Food Classification 标题：在线持续学习在食品视觉分类中的应用链接：https://arxiv.org/abs/2108.06781

作者：Jiangpeng He,Fengqing Zhu 机构：School of Electrical and Computer Engineering, Purdue University, West Lafayette, Indiana USA 备注：Accepted paper for LargeFineFoodAI, ICCV 2021 摘要：食品图像分类对于现实世界的应用是一个挑战，因为现有的方法需要静态数据集进行训练，并且不能从连续可用的新食品图像中学习。在线持续学习旨在从数据流中学习新课程，只使用一次新数据，而不忘记以前学习的知识。然而，现有的研究都没有针对食品图像分析的，食品图像分析由于其高度的类内变异性以及未来食品类分布的不平衡性和不可预测性而更难以增量学习。在本文中，我们通过引入（1）一种新的基于聚类的样本选择算法来解决这些问题，该算法存储属于每个学习食物的最具代表性的数据，用于知识回放，（2）一个有效的在线学习机制，使用平衡的训练批以及增广样本上的知识提炼，以保持模型在所有学习类上的性能。我们的方法是在一个具有挑战性的大型食品图像数据库food-1K上通过改变新添加的食品类别的数量进行评估的。我们的研究结果表明，与现有最先进的在线持续学习方法相比，该方法有了显著的改进，显示出在现实世界中实现食品图像分类终身学习的巨大潜力。摘要：Food image classification is challenging for real-world applications since existing methods require static datasets for training and are not capable of learning from sequentially available new food images. Online continual learning aims to learn new classes from data stream by using each new data only once without forgetting the previously learned knowledge. However, none of the existing works target food image analysis, which is more difficult to learn incrementally due to its high intra-class variation with the unbalanced and unpredictable characteristics of future food class distribution. In this paper, we address these issues by introducing (1) a novel clustering based exemplar selection algorithm to store the most representative data belonging to each learned food for knowledge replay, and (2) an effective online learning regime using balanced training batch along with the knowledge distillation on augmented exemplars to maintain the model performance on all learned classes. Our method is evaluated on a challenging large scale food image database, Food-1K, by varying the number of newly added food classes. Our results show significant improvements compared with existing state-of-the-art online continual learning methods, showing great potential to achieve lifelong learning for food image classification in real world.

【9】 Learning Open-World Object Proposals without Learning to Classify 标题：在不学习分类的情况下学习开放世界对象建议链接：https://arxiv.org/abs/2108.06753

作者：Dahun Kim,Tsung-Yi Lin,Anelia Angelova,In So Kweon,Weicheng Kuo 机构：KAIST, Google Brain 摘要：目标建议已成为许多视觉管道（包括目标检测、弱监督检测、目标发现、跟踪等）的一个完整的预处理步骤。与无学习方法相比，基于学习的建议近年来由于对目标检测的兴趣越来越大而变得流行。常见的范例是从标有一组对象区域及其相应类别的数据中学习对象建议。然而，这种方法经常与训练集中缺少的开放世界中的新对象进行斗争。在本文中，我们发现问题在于现有提议方法中的二元分类器倾向于过度适应训练类别。因此，我们提出了一种无分类的目标定位网络（OLN），该网络仅通过区域的位置和形状与任何地面真实对象（例如，中心度和IoU）的重叠程度来估计每个区域的目标度。这个简单的策略学习可概括的对象性，并优于现有的COCO跨类别概括建议，以及RoboNet、Object365和Epickitchen跨数据集评估建议。最后，我们展示了OLN在大型词汇表数据集LVIS上用于长尾对象检测的优点，我们注意到在罕见和常见类别上有明显的改进。摘要：Object proposals have become an integral preprocessing steps of many vision pipelines including object detection, weakly supervised detection, object discovery, tracking, etc. Compared to the learning-free methods, learning-based proposals have become popular recently due to the growing interest in object detection. The common paradigm is to learn object proposals from data labeled with a set of object regions and their corresponding categories. However, this approach often struggles with novel objects in the open world that are absent in the training set. In this paper, we identify that the problem is that the binary classifiers in existing proposal methods tend to overfit to the training categories. Therefore, we propose a classification-free Object Localization Network (OLN) which estimates the objectness of each region purely by how well the location and shape of a region overlap with any ground-truth object (e.g., centerness and IoU). This simple strategy learns generalizable objectness and outperforms existing proposals on cross-category generalization on COCO, as well as cross-dataset evaluation on RoboNet, Object365, and EpicKitchens. Finally, we demonstrate the merit of OLN for long-tail object detection on large vocabulary dataset, LVIS, where we notice clear improvement in rare and common categories.

【10】 HCR-Net: A deep learning based script independent handwritten character recognition network 标题：HCR-Net：一种基于深度学习的手写体独立字符识别网络链接：https://arxiv.org/abs/2108.06663

作者：Vinod Kumar Chauhan,Sukhdeep Singh,Anuj Sharma 机构：Received: date Accepted: date 备注：21 pages, 5 figures, 16 tables (under review) 摘要：手写字符识别（HCR）是模式识别中一个具有挑战性的学习问题，主要原因是字符结构相似、书写风格不同、数据集噪声大以及语言和脚本种类繁多。HCR问题已经被广泛研究了几十年，但是对于脚本无关模型的研究非常有限。这是因为脚本的多样性、大多数传统研究工作的重点都是针对特定语言/脚本的手工特征提取技术，并且并不总是可用，以及公共数据集和代码无法再现结果等因素。另一方面，深度学习在模式识别的不同领域（包括HCR）取得了巨大成功，并提供了端到端的学习，即自动特征提取和识别。在本文中，我们提出了一种新的深度学习体系结构，称为HCR-Net，该体系结构利用迁移学习和图像增强进行端到端学习，用于与脚本无关的手写字符识别。该网络基于一种新的HCR转移学习方法，其中使用了预先训练的VGG16网络的一些较低层。由于迁移学习和图像增强，HCR网络提供了更快的训练、更好的性能和更好的概括。在孟加拉语、旁遮普语、印地语、英语、瑞典语、乌尔都语、波斯语、藏语、卡纳达语、马拉雅拉姆语、泰卢固语、马拉地语、尼泊尔语和阿拉伯语的公开数据集上的实验结果证明了HCR网络的有效性，并建立了若干新的基准。对于结果的再现性和HCR研究的进展，完整代码在href公开发布{https://github.com/jmdvinodjmd/HCR-Net}{GitHub}。摘要：Handwritten character recognition (HCR) is a challenging learning problem in pattern recognition, mainly due to similarity in structure of characters, different handwriting styles, noisy datasets and a large variety of languages and scripts. HCR problem is studied extensively for a few decades but there is very limited research on script independent models. This is because of factors, like, diversity of scripts, focus of the most of conventional research efforts on handcrafted feature extraction techniques which are language/script specific and are not always available, and unavailability of public datasets and codes to reproduce the results. On the other hand, deep learning has witnessed huge success in different areas of pattern recognition, including HCR, and provides end-to-end learning, i.e., automated feature extraction and recognition. In this paper, we have proposed a novel deep learning architecture which exploits transfer learning and image-augmentation for end-to-end learning for script independent handwritten character recognition, called HCR-Net. The network is based on a novel transfer learning approach for HCR, where some of lower layers of a pre-trained VGG16 network are utilised. Due to transfer learning and image-augmentation, HCR-Net provides faster training, better performance and better generalisations. The experimental results on publicly available datasets of Bangla, Punjabi, Hindi, English, Swedish, Urdu, Farsi, Tibetan, Kannada, Malayalam, Telugu, Marathi, Nepali and Arabic languages prove the efficacy of HCR-Net and establishes several new benchmarks. For reproducibility of the results and for the advancements of the HCR research, complete code is publicly released at href{https://github.com/jmdvinodjmd/HCR-Net}{GitHub}.

【11】 Few-Shot Fine-Grained Action Recognition via Bidirectional Attention and Contrastive Meta-Learning 标题：基于双向注意和对比元学习的Few-Shot细粒度动作识别链接：https://arxiv.org/abs/2108.06647

作者：Jiahao Wang,Yunhong Wang,Sheng Liu,Annan Li 机构：State Key Laboratory of Virtual Reality Technology and System, Beihang University, Beijing, China 备注：Accepted in ACM Multimedia 2021 摘要：由于实际应用中对特定动作理解的需求不断增加，细粒度动作识别正吸引着越来越多的关注，而稀有细粒度类别的数据非常有限。因此，我们提出了Few-Shot细粒度动作识别问题，旨在识别新的细粒度动作，每个类只需给出很少的样本。虽然在粗粒度动作方面已经取得了进展，但现有的少数镜头识别方法在处理细粒度动作时遇到了两个问题：无法捕获细微动作细节和从类间方差较低的数据中学习不足。为了解决第一个问题，提出了一种基于人类视觉的双向注意模块（BAM）。BAM将自上而下的任务驱动信号与自下而上的显著刺激相结合，通过准确突出信息性时空区域来捕捉细微的动作细节。为了解决第二个问题，我们引入了对比元学习（CML）。与广泛采用的基于ProtoNet的方法相比，CML在每一次训练中都充分利用了潜在的对比对，因此能够为低类间方差数据生成更具辨别力的视频表示。此外，为了比较不同的模型，我们在两个大规模细粒度动作识别数据集上建立了特定的基准协议。大量的实验表明，我们的方法在评估的任务中始终达到最先进的性能。摘要：Fine-grained action recognition is attracting increasing attention due to the emerging demand of specific action understanding in real-world applications, whereas the data of rare fine-grained categories is very limited. Therefore, we propose the few-shot fine-grained action recognition problem, aiming to recognize novel fine-grained actions with only few samples given for each class. Although progress has been made in coarse-grained actions, existing few-shot recognition methods encounter two issues handling fine-grained actions: the inability to capture subtle action details and the inadequacy in learning from data with low inter-class variance. To tackle the first issue, a human vision inspired bidirectional attention module (BAM) is proposed. Combining top-down task-driven signals with bottom-up salient stimuli, BAM captures subtle action details by accurately highlighting informative spatio-temporal regions. To address the second issue, we introduce contrastive meta-learning (CML). Compared with the widely adopted ProtoNet-based method, CML generates more discriminative video representations for low inter-class variance data, since it makes full use of potential contrastive pairs in each training episode. Furthermore, to fairly compare different models, we establish specific benchmark protocols on two large-scale fine-grained action recognition datasets. Extensive experiments show that our method consistently achieves state-of-the-art performance across evaluated tasks.

【12】 Unravelling the Effect of Image Distortions for Biased Prediction of Pre-trained Face Recognition Models 标题：揭开图像失真对人脸识别模型有偏预测的影响链接：https://arxiv.org/abs/2108.06581

作者：Puspita Majumdar,Surbhi Mittal,Richa Singh,Mayank Vatsa 机构：IIIT-Delhi, India, IIT Jodhpur, India 备注：Accepted in ICCV Workshops 摘要：由于对社会的影响，识别和减少深度学习算法中的偏差在过去几年中得到了广泛的应用。研究人员认为，在具有良好代表性的平衡数据集上训练的模型在各亚组之间提供了平等和无偏的性能。然而， textit{当输入数据发生某些失真时，似乎无偏的预训练模型会变得有偏吗？}我们首次尝试在人脸识别的背景下回答这个问题。我们提供了一个系统的分析，以评估四种最先进的深度人脸识别模型在不同性别和种族分组的图像失真情况下的性能。我们已经观察到，图像失真与模型在不同子组之间的性能差距有关。摘要：Identifying and mitigating bias in deep learning algorithms has gained significant popularity in the past few years due to its impact on the society. Researchers argue that models trained on balanced datasets with good representation provide equal and unbiased performance across subgroups. However, textit{can seemingly unbiased pre-trained model become biased when input data undergoes certain distortions?} For the first time, we attempt to answer this question in the context of face recognition. We provide a systematic analysis to evaluate the performance of four state-of-the-art deep face recognition models in the presence of image distortions across different textit{gender} and textit{race} subgroups. We have observed that image distortions have a relationship with the performance gap of the model across different subgroups.

【13】 Feature Identification and Matching for Hand Hygiene Pose 标题：手部卫生姿势的特征识别与匹配链接：https://arxiv.org/abs/2108.06537

作者：Rashmi Bakshi 机构：School of Electrical and Electronic Engineering, Technological University, Dublin, Ireland 摘要：比较和评价了三种常用的计算机视觉特征描述符，如SIFT、SURF和ORB。为原始手部卫生姿势提取并匹配的正确特征数搓手手掌到手掌图像和旋转图像。根据总匹配数和生成的正确匹配数计算的准确度分数。实验表明，ORB算法通过在较短的时间内提供大量的正确匹配而表现出更好的性能。ORB特征检测技术应用于洗手录像中的特征提取和手部卫生姿势分类是今后的工作方向。OpenCV用于在python脚本中应用算法。摘要：Three popular feature descriptors of computer vision such as SIFT, SURF, and ORB compared and evaluated. The number of correct features extracted and matched for the original hand hygiene pose-Rub hands palm to palm image and rotated image. An accuracy score calculated based on the total number of matches and the correct number of matches produced. The experiment demonstrated that ORB algorithm outperforms by giving the high number of correct matches in less amount of time. ORB feature detection technique applied over handwashing video recordings for feature extraction and hand hygiene pose classification as a future work. OpenCV utilized to apply the algorithms within python scripts.

【14】 Joint Optimization in Edge-Cloud Continuum for Federated Unsupervised Person Re-identification 标题：联合无监督人员身份识别的边云连续体联合优化链接：https://arxiv.org/abs/2108.06493

作者：Weiming Zhuang,Yonggang Wen,Shuai Zhang 机构：S-Lab, Nanyang Technological University, SenseTime Research 备注：ACMMM'21 摘要：人员重新识别（ReID）旨在从非重叠摄像机视图中重新识别人员。由于person ReID数据包含敏感的个人信息，研究人员采用了联邦学习（federated learning）这一新兴的分布式训练方法来降低隐私泄露风险。然而，现有的研究依赖于数据标签，获取这些标签费时费力。我们提出FedUReID，一个联邦的无监督的person-ReID系统来学习person-ReID模型，而不需要任何标签，同时保护隐私。FedUReID允许在带有未标记数据的边上进行现场模型训练。云服务器从边缘聚合模型，而不是集中原始数据以保护数据隐私。此外，为了解决边缘在数据量和分布方面存在差异的问题，我们通过云和边缘的联合优化对边缘进行个性化训练。具体地说，我们提出了个性化epoch来重新分配整个训练过程中的计算，个性化聚类来迭代地预测未标记数据的合适标签，以及个性化更新来使服务器聚合模型适应每个边缘。在8人ReID数据集上的大量实验表明，FedUReID不仅具有更高的精度，而且还将计算成本降低了29%。我们的FedUReID系统和联合优化将有助于实现联邦学习，使更多的多媒体任务不需要数据标签。摘要：Person re-identification (ReID) aims to re-identify a person from non-overlapping camera views. Since person ReID data contains sensitive personal information, researchers have adopted federated learning, an emerging distributed training method, to mitigate the privacy leakage risks. However, existing studies rely on data labels that are laborious and time-consuming to obtain. We present FedUReID, a federated unsupervised person ReID system to learn person ReID models without any labels while preserving privacy. FedUReID enables in-situ model training on edges with unlabeled data. A cloud server aggregates models from edges instead of centralizing raw data to preserve data privacy. Moreover, to tackle the problem that edges vary in data volumes and distributions, we personalize training in edges with joint optimization of cloud and edge. Specifically, we propose personalized epoch to reassign computation throughout training, personalized clustering to iteratively predict suitable labels for unlabeled data, and personalized update to adapt the server aggregated model to each edge. Extensive experiments on eight person ReID datasets demonstrate that FedUReID not only achieves higher accuracy but also reduces computation cost by 29%. Our FedUReID system with the joint optimization will shed light on implementing federated learning to more multimedia tasks without data labels.

【15】 DICOM Imaging Router: An Open Deep Learning Framework for Classification of Body Parts from DICOM X-ray Scans 标题：DICOM影像路由器：一种用于DICOM X射线扫描身体部位分类的开放式深度学习框架链接：https://arxiv.org/abs/2108.06490

作者：Hieu H. Pham,Dung V. Do,Ha Q. Nguyen 机构：Medical Imaging Center, Vingroup Big Data Institute, Hanoi, Vietnam, College of Engineering & Computer Science, VinUniversity, Hanoi, Vietnam 备注：This is a preprint of our paper, which was accepted for publication to ICCV Workshop 2021 摘要：DICOM格式的X射线成像是临床实践中最常用的成像方式，产生了大量的非标准化数据库。这导致在部署用于分析医学图像的AI解决方案时遇到障碍，通常需要在将图像输入指定的AI模型之前识别正确的身体部位。这一挑战提出了一种自动化和高效的方法来从X射线扫描中对身体部位进行分类的需求。不幸的是，据我们所知，到目前为止，还没有用于此任务的开放工具或框架。为了弥补这一不足，我们引入了一种DICOM成像路由器，该路由器部署了深CNN，用于将未知的DICOM X射线图像分为五个解剖组：腹部、成人胸部、儿童胸部、脊柱和其他。为此，收集了由16093张图像组成的大规模X射线数据集，并对其进行了手动分类。然后，我们使用11263张图像的训练集训练了一组最先进的深度CNN。然后，在2419张图像的独立测试集上对这些网络进行评估，并显示出在对身体部位进行分类方面的优异性能。具体而言，我们表现最好的模型的召回率为0.982（95%置信区间，0.977-0.988），精确度为0.985（95%置信区间，0.975-0.989），F1得分为0.981（95%置信区间，0.976-0.987），同时需要较少的推理计算（每幅图像0.0295秒）。我们在1000张X射线图像上的外部有效性表明了所提出的方法在医院中的稳健性。这些显著的性能表明，深部CNN能够准确有效地将人体部位与X射线扫描区分开来，从而为临床环境中的广泛应用提供潜在益处。本研究的数据集、代码和经过训练的深度学习模型将在我们的项目网站上公开https://vindr.ai/. 摘要：X-ray imaging in DICOM format is the most commonly used imaging modality in clinical practice, resulting in vast, non-normalized databases. This leads to an obstacle in deploying AI solutions for analyzing medical images, which often requires identifying the right body part before feeding the image into a specified AI model. This challenge raises the need for an automated and efficient approach to classifying body parts from X-ray scans. Unfortunately, to the best of our knowledge, there is no open tool or framework for this task to date. To fill this lack, we introduce a DICOM Imaging Router that deploys deep CNNs for categorizing unknown DICOM X-ray images into five anatomical groups: abdominal, adult chest, pediatric chest, spine, and others. To this end, a large-scale X-ray dataset consisting of 16,093 images has been collected and manually classified. We then trained a set of state-of-the-art deep CNNs using a training set of 11,263 images. These networks were then evaluated on an independent test set of 2,419 images and showed superior performance in classifying the body parts. Specifically, our best performing model achieved a recall of 0.982 (95% CI, 0.977-0.988), a precision of 0.985 (95% CI, 0.975-0.989) and a F1-score of 0.981 (95% CI, 0.976-0.987), whilst requiring less computation for inference (0.0295 second per image). Our external validity on 1,000 X-ray images shows the robustness of the proposed approach across hospitals. These remarkable performances indicate that deep CNNs can accurately and effectively differentiate human body parts from X-ray scans, thereby providing potential benefits for a wide range of applications in clinical settings. The dataset, codes, and trained deep learning models from this study will be made publicly available on our project website at https://vindr.ai/.

分割|语义相关(17篇)

【1】 Real-time Human-Centric Segmentation for Complex Video Scenes 标题：复杂视频场景中以人为中心的实时分割链接：https://arxiv.org/abs/2108.07199

作者：Ran Yu,Chenyu Tian,Weihao Xia,Xinyuan Zhao,Haoqian Wang,Yujiu Yang 机构： Yang are with Tsinghua ShenzhenInternational Graduate School, Tsinghua University, Zhao is with Northwestern University 摘要：大多数现有的与“人类”相关的视频任务侧重于突出的人类的分割，而忽略了视频中未指明的其他人。很少有研究集中于分割和跟踪复杂视频中的所有人，包括行人和其他状态的人（例如，坐着、骑着或被遮挡）。在本文中，我们提出了一个新的框架，简称为HVISNet，该框架基于一级检测器对给定视频中所有呈现的人进行分割和跟踪。为了更好地评估复杂场景，我们提供了一个称为HVIS（Human Video Instance Segmentation，人类视频实例分割）的新基准，该基准由805个不同场景的高分辨率视频中的1447个人类实例遮罩组成。大量的实验表明，我们提出的HVISNet在实时推理速度（30 FPS）下的准确性优于最先进的方法，特别是在复杂的视频场景中。我们还注意到，使用边界框的中心来区分不同的个体会严重降低分割精度，尤其是在严重遮挡的情况下。这种常见的现象称为模糊正样本问题。为了缓解这个问题，我们提出了一种称为内中心采样的机制来提高实例分割的准确性。这种即插即用的内中心采样机制可以结合到基于一级检测器的任何实例分割模型中，以提高性能。特别是，对于被遮挡的人类，它在最先进的方法基础上获得了4.1的mAP改进。有关代码和数据，请访问https://github.com/IIGROUP/HVISNet. 摘要：Most existing video tasks related to "human" focus on the segmentation of salient humans, ignoring the unspecified others in the video. Few studies have focused on segmenting and tracking all humans in a complex video, including pedestrians and humans of other states (e.g., seated, riding, or occluded). In this paper, we propose a novel framework, abbreviated as HVISNet, that segments and tracks all presented people in given videos based on a one-stage detector. To better evaluate complex scenes, we offer a new benchmark called HVIS (Human Video Instance Segmentation), which comprises 1447 human instance masks in 805 high-resolution videos in diverse scenes. Extensive experiments show that our proposed HVISNet outperforms the state-of-the-art methods in terms of accuracy at a real-time inference speed (30 FPS), especially on complex video scenes. We also notice that using the center of the bounding box to distinguish different individuals severely deteriorates the segmentation accuracy, especially in heavily occluded conditions. This common phenomenon is referred to as the ambiguous positive samples problem. To alleviate this problem, we propose a mechanism named Inner Center Sampling to improve the accuracy of instance segmentation. Such a plug-and-play inner center sampling mechanism can be incorporated in any instance segmentation models based on a one-stage detector to improve the performance. In particular, it gains 4.1 mAP improvement on the state-of-the-art method in the case of occluded humans. Code and data are available at https://github.com/IIGROUP/HVISNet.

【2】 ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration 标题：Rosita：通过跨模态和内模态知识集成增强视觉和语言语义对齐链接：https://arxiv.org/abs/2108.07073

作者：Yuhao Cui,Zhou Yu,Chunqi Wang,Zhongzhou Zhao,Ji Zhang,Meng Wang,Jun Yu 机构：School of Computer Science and Technology, Hangzhou Dianzi University, China, Alibaba Group, China, School of Computer Science and Information Engineering, Hefei University of Technology, China 备注：Accepted at ACM Multimedia 2021. Code available at this https URL 摘要：视觉和语言预训练（VLP）旨在从海量图像-文本对中学习通用多模态表示。虽然已经提出了各种成功的尝试，但学习图像-文本对之间的细粒度语义对齐在其方法中起着关键作用。然而，大多数现有的VLP方法没有充分利用图像-文本对中的固有知识，这限制了学习对齐的有效性，并进一步限制了其模型的性能。为此，我们引入了一种称为ROSITA的新VLP方法，该方法将跨模态和模态内知识集成到一个统一的场景图中，以增强语义对齐。具体来说，我们引入了一种新的结构知识掩蔽（SKM）策略，以场景图结构作为先验来执行掩蔽语言（区域）建模，该策略通过消除模式内部和跨模式的干扰信息来增强语义对齐。广泛的消融研究和综合分析证实了ROSITA在语义对齐方面的有效性。通过对域内和域外数据集的预训练，ROSITA在三个典型的视觉和语言任务上显著优于现有的最先进的VLP方法，超过了六个基准数据集。摘要：Vision-and-language pretraining (VLP) aims to learn generic multimodal representations from massive image-text pairs. While various successful attempts have been proposed, learning fine-grained semantic alignments between image-text pairs plays a key role in their approaches. Nevertheless, most existing VLP approaches have not fully utilized the intrinsic knowledge within the image-text pairs, which limits the effectiveness of the learned alignments and further restricts the performance of their models. To this end, we introduce a new VLP method called ROSITA, which integrates the cross- and intra-modal knowledge in a unified scene graph to enhance the semantic alignments. Specifically, we introduce a novel structural knowledge masking (SKM) strategy to use the scene graph structure as a priori to perform masked language (region) modeling, which enhances the semantic alignments by eliminating the interference information within and across modalities. Extensive ablation studies and comprehensive analysis verifies the effectiveness of ROSITA in semantic alignments. Pretrained with both in-domain and out-of-domain datasets, ROSITA significantly outperforms existing state-of-the-art VLP methods on three typical vision-and-language tasks over six benchmark datasets.

【3】 Multi-Target Adversarial Frameworks for Domain Adaptation in Semantic Segmentation 标题：语义分词中领域自适应的多目标对抗性框架链接：https://arxiv.org/abs/2108.06962

作者：Antoine Saporta,Tuan-Hung Vu,Matthieu Cord,Patrick Pérez 机构：Patrick P´erez, Sorbonne University, Valeo.ai 备注：Accepted at the 2021 International Conference on Computer Vision (ICCV) 摘要：在这项工作中，我们解决了无监督领域自适应（UDA）在多个目标领域的语义分割任务：目标是训练一个单一的模型，该模型可以在测试时处理所有这些领域。这种多目标自适应对于现实世界中的自治系统必须处理的各种场景来说至关重要。这是一个具有挑战性的设置，因为人们不仅要面对标记源集和未标记目标集之间的域间隙，还要面对后者在不同目标域之间存在的分布偏移。为此，我们引入了两个对抗性框架：（i）多鉴别器，明确地将每个目标领域与其对应领域对齐；（ii）多目标知识转移，通过多教师/单个学生的提取机制学习目标不可知模型。在语义切分方面，对四个新提出的UDA多目标基准进行了评估。在所有测试场景中，我们的方法始终优于基线，为新任务设定了具有竞争力的标准。摘要：In this work, we address the task of unsupervised domain adaptation (UDA) for semantic segmentation in presence of multiple target domains: The objective is to train a single model that can handle all these domains at test time. Such a multi-target adaptation is crucial for a variety of scenarios that real-world autonomous systems must handle. It is a challenging setup since one faces not only the domain gap between the labeled source set and the unlabeled target set, but also the distribution shifts existing within the latter among the different target domains. To this end, we introduce two adversarial frameworks: (i) multi-discriminator, which explicitly aligns each target domain to its counterparts, and (ii) multi-target knowledge transfer, which learns a target-agnostic model thanks to a multi-teacher/single-student distillation mechanism.The evaluation is done on four newly-proposed multi-target benchmarks for UDA in semantic segmentation. In all tested scenarios, our approaches consistently outperform baselines, setting competitive standards for the novel task.

【4】 CarveMix: A Simple Data Augmentation Method for Brain Lesion Segmentation 标题：CarveMix：一种简单的脑病变分割数据增强方法链接：https://arxiv.org/abs/2108.06883

作者：Xinru Zhang,Chenghao Liu,Ni Ou,Xiangzhu Zeng,Xiaoliang Xiong,Yizhou Yu,Zhiwen Liu,Chuyang Ye 机构：School of Information and Electronics, Beijing Institute of Technology, Beijing, School of Automation, Beijing Institute of Technology, Beijing, China, Department of Radiology, Peking University Third Hospital, Beijing, China, Deepwise AI Lab, Beijing, China 摘要：脑损伤分割为临床诊断提供了一个有价值的工具，卷积神经网络（CNN）在这项任务中取得了前所未有的成功。数据增强是一种广泛使用的改进CNN训练的策略，脑损伤分割增强方法的设计仍然是一个有待解决的问题。在这项工作中，我们提出了一种简单的数据增强方法，称为CarveMix，用于基于CNN的脑损伤分割。与其他基于“混合”的方法（如Mixup和CutMix）一样，CarveMix随机组合两个现有的标记图像以生成新的标记样本。然而，与这些基于图像组合的增强策略不同，CarveMix是病变感知的，在这种情况下，组合是在关注病变的情况下进行的，并为生成的图像创建适当的注释。具体地说，我们根据病变位置和几何结构从一幅标记图像中雕刻出感兴趣区域（ROI），并从概率分布中采样ROI的大小。然后，雕刻的ROI替换第二个标记图像中的相应体素，并且相应地替换第二个图像的注释。通过这种方式，我们生成新的标记图像用于网络训练，并保留病变信息。为了评估所提出的方法，在两个脑损伤数据集上进行了实验。结果表明，与其他简单的数据增强方法相比，该方法提高了分割精度。摘要：Brain lesion segmentation provides a valuable tool for clinical diagnosis, and convolutional neural networks (CNNs) have achieved unprecedented success in the task. Data augmentation is a widely used strategy that improves the training of CNNs, and the design of the augmentation method for brain lesion segmentation is still an open problem. In this work, we propose a simple data augmentation approach, dubbed as CarveMix, for CNN-based brain lesion segmentation. Like other "mix"-based methods, such as Mixup and CutMix, CarveMix stochastically combines two existing labeled images to generate new labeled samples. Yet, unlike these augmentation strategies based on image combination, CarveMix is lesion-aware, where the combination is performed with an attention on the lesions and a proper annotation is created for the generated image. Specifically, from one labeled image we carve a region of interest (ROI) according to the lesion location and geometry, and the size of the ROI is sampled from a probability distribution. The carved ROI then replaces the corresponding voxels in a second labeled image, and the annotation of the second image is replaced accordingly as well. In this way, we generate new labeled images for network training and the lesion information is preserved. To evaluate the proposed method, experiments were performed on two brain lesion datasets. The results show that our method improves the segmentation accuracy compared with other simple data augmentation approaches.

【5】 Weakly Supervised Temporal Anomaly Segmentation with Dynamic Time Warping 标题：基于动态时间规整的弱监督时间异常分割链接：https://arxiv.org/abs/2108.06816

作者：Dongha Lee,Sehun Yu,Hyunjun Ju,Hwanjo Yu 机构：University of Illinois at Urbana-Champaign (UIUC), Urbana, IL, United States, Pohang University of Science and Technology (POSTECH), Pohang, South Korea 备注：ICCV 2021. 8 pages, References (2 pages), Appendix (3 pages), 6 figures 摘要：最近关于检测和定位时间异常的研究主要是利用深度神经网络以无监督的方式学习时间数据的正常模式。与它们不同的是，我们工作的目标是充分利用实例级（或弱）异常标签，它只指示在每个时态数据实例中是否发生了任何异常事件。在本文中，我们提出了一种新的框架WETAS，它可以有效地识别输入实例中的异常时间段（即连续时间点）。WETAS从实例级别的标签中学习鉴别特征，从而推断每个实例中正常和异常段的顺序，这可以用作粗分割掩码。基于输入实例与其分割掩码之间的动态时间扭曲（DTW）对齐，WETAS获得时间分割的结果，同时，通过使用掩码作为附加监控，它进一步增强了自身。我们的实验表明，在时间异常的定位方面，WETAS大大优于其他基线，并且与点级检测方法相比，它提供了更多信息。摘要：Most recent studies on detecting and localizing temporal anomalies have mainly employed deep neural networks to learn the normal patterns of temporal data in an unsupervised manner. Unlike them, the goal of our work is to fully utilize instance-level (or weak) anomaly labels, which only indicate whether any anomalous events occurred or not in each instance of temporal data. In this paper, we present WETAS, a novel framework that effectively identifies anomalous temporal segments (i.e., consecutive time points) in an input instance. WETAS learns discriminative features from the instance-level labels so that it infers the sequential order of normal and anomalous segments within each instance, which can be used as a rough segmentation mask. Based on the dynamic time warping (DTW) alignment between the input instance and its segmentation mask, WETAS obtains the result of temporal segmentation, and simultaneously, it further enhances itself by using the mask as additional supervision. Our experiments show that WETAS considerably outperforms other baselines in terms of the localization of temporal anomalies, and also it provides more informative results than point-level detection methods.

【6】 The Marine Debris Dataset for Forward-Looking Sonar Semantic Segmentation 标题：面向前视声纳语义分割的海洋废弃物数据集链接：https://arxiv.org/abs/2108.06800

作者：Deepak Singh,Matias Valdenegro-Toro 机构：Netaji Subhas Institute Of Technology, Dwarka Sec-, Delhi, India, German Research Center for Artificial Intelligence, Robert-Hooke-Str , Bremen, Germany 备注：OceanVision 2021 ICCV Worshop, Camera Ready, 9 pages, 13 figures, 6 Tables 摘要：准确检测和分割海洋废弃物对于保持水体清洁非常重要。本文提出了一种利用前视声纳（FLS）采集的海洋废弃物分割新数据集。该数据集由使用ARIS Explorer 3000传感器捕获的1868幅FLS图像组成。用于生成此数据集的对象包含典型的室内海洋废弃物和干扰物海洋对象（轮胎、挂钩、阀门等），分为11个类和一个背景类。在此数据集上分析了各种编码器的最新语义分割体系结构的性能，并将其作为基线结果呈现。由于图像是灰度的，因此没有使用预训练权重。使用联合上的交集（IoU）进行比较。性能最好的机型是Unet，其ResNet34主干网的容量为0.74.81兆。该数据集可在https://github.com/mvaldenegro/marine-debris-fls-datasets/ 摘要：Accurate detection and segmentation of marine debris is important for keeping the water bodies clean. This paper presents a novel dataset for marine debris segmentation collected using a Forward Looking Sonar (FLS). The dataset consists of 1868 FLS images captured using ARIS Explorer 3000 sensor. The objects used to produce this dataset contain typical house-hold marine debris and distractor marine objects (tires, hooks, valves,etc), divided in 11 classes plus a background class. Performance of state of the art semantic segmentation architectures with a variety of encoders have been analyzed on this dataset and presented as baseline results. Since the images are grayscale, no pretrained weights have been used. Comparisons are made using Intersection over Union (IoU). The best performing model is Unet with ResNet34 backbone at 0.7481 mIoU. The dataset is available at https://github.com/mvaldenegro/marine-debris-fls-datasets/

【7】 Dilated Inception U-Net (DIU-Net) for Brain Tumor Segmentation 标题：扩展初始U-网(DIU-Net)在脑肿瘤分割中的应用链接：https://arxiv.org/abs/2108.06772

作者：Daniel E. Cahall,Ghulam Rasool,Nidhal C. Bouaynaya,Hassan M. Fathallah-Shaykh 机构： Department of Electrical and Computer Engineering, Rowan University, University of Alabama at Birmingham 摘要：磁共振成像（MRI）通常用于脑肿瘤诊断、治疗计划和治疗后监测。最近，基于深度神经网络的各种模型被提出用于脑磁共振成像中肿瘤的像素级分割。然而，磁共振成像的结构变化、空间差异和强度不均匀性使得分割成为一项具有挑战性的任务。我们提出了一种新的基于U-Net的端到端脑肿瘤分割架构，该架构将起始模块和扩展卷积集成到其收缩和扩展路径中。这使我们能够提取局部结构信息以及全局上下文信息。我们使用脑肿瘤分割（BraTS）2018数据集对胶质瘤亚区域进行分割，包括肿瘤核心、增强肿瘤和整个肿瘤。我们提出的模型在肿瘤核心和整个肿瘤分割方面的表现明显优于最先进的基于U-Net的模型（$p<0.05$）。摘要：Magnetic resonance imaging (MRI) is routinely used for brain tumor diagnosis, treatment planning, and post-treatment surveillance. Recently, various models based on deep neural networks have been proposed for the pixel-level segmentation of tumors in brain MRIs. However, the structural variations, spatial dissimilarities, and intensity inhomogeneity in MRIs make segmentation a challenging task. We propose a new end-to-end brain tumor segmentation architecture based on U-Net that integrates Inception modules and dilated convolutions into its contracting and expanding paths. This allows us to extract local structural as well as global contextual information. We performed segmentation of glioma sub-regions, including tumor core, enhancing tumor, and whole tumor using Brain Tumor Segmentation (BraTS) 2018 dataset. Our proposed model performed significantly better than the state-of-the-art U-Net-based model ($p<0.05$) for tumor core and whole tumor segmentation.

【8】 Multi-Slice Dense-Sparse Learning for Efficient Liver and Tumor Segmentation 标题：多层密集-稀疏学习在肝脏和肿瘤分割中的应用链接：https://arxiv.org/abs/2108.06761

作者：Ziyuan Zhao,Zeyu Ma,Yanjie Liu,Zeng Zeng,Pierce KH Chow 机构： 2 NationalUniversity of Singapore 备注：Accepted in 43rd Annual International Conference of the IEEE Engineering in Medicine and Biology Society, IEEE EMBC 2021 摘要：准确的肝脏和肿瘤自动分割在治疗计划和疾病监测中起着至关重要的作用。近年来，深度卷积神经网络（DCNNs）在二维和三维医学图像分割中取得了巨大的成功。然而，2D DCNN不能充分利用片间信息，而3D DCNN计算成本高且内存密集。为了解决这些问题，我们首先从数据的角度提出了一种新的密集稀疏训练流，其中，密集相邻切片和稀疏相邻切片被提取作为正则化DCNN的输入，从而提高模型性能。此外，我们还从网络的角度设计了一个2.5D轻型nnU网络，其中采用了深度可分离卷积来提高效率。在LiTS数据集上的大量实验证明了该方法的优越性。摘要：Accurate automatic liver and tumor segmentation plays a vital role in treatment planning and disease monitoring. Recently, deep convolutional neural network (DCNNs) has obtained tremendous success in 2D and 3D medical image segmentation. However, 2D DCNNs cannot fully leverage the inter-slice information, while 3D DCNNs are computationally expensive and memory intensive. To address these issues, we first propose a novel dense-sparse training flow from a data perspective, in which, densely adjacent slices and sparsely adjacent slices are extracted as inputs for regularizing DCNNs, thereby improving the model performance. Moreover, we design a 2.5D light-weight nnU-Net from a network perspective, in which, depthwise separable convolutions are adopted to improve the efficiency. Extensive experiments on the LiTS dataset have demonstrated the superiority of the proposed method.

【9】 Temporal Action Segmentation with High-level Complex Activity Labels 标题：基于高层复杂活动标签的时态动作分割链接：https://arxiv.org/abs/2108.06706

作者：Guodong Ding,Angela Yao 机构： annotating every frame in videos is highlyGuodongDingandAngelaYaoarewiththeSchoolofCom-puting, National University of Singapore 备注：11 pages, 6 figures 摘要：在过去的几年中，在短剪辑视频的动作识别方面的成功导致了对未剪辑长视频中动作的时间分割的更多研究。最近，监督方法在分割未经剪辑的视频中复杂的人类行为方面取得了优异的性能。然而，除了动作标签外，这种方法还需要每个动作的起点和终点，这既昂贵又繁琐。在本文中，我们的目标是学习仅以高级活动标签作为输入的动作片段。在不提供行动级别监督的情况下，匈牙利匹配通常用于找到分段和地面真相行动之间的映射，以评估模型并报告绩效。一方面，我们表明，通过高层监督，我们能够将匈牙利匹配设置从当前视频和活动级别推广到全球级别。扩展的全局级别匹配允许跨活动共享操作。另一方面，我们提出了一个新的动作发现框架，该框架通过活动分类任务自动发现视频中的组成动作。具体来说，我们定义了有限数量的原型，以形成视频序列的双重表示。这些集体学习的原型被视为已发现的动作。这种分类设置使我们的方法能够发现跨多个复杂活动的潜在共享操作。大量实验表明，所发现的动作有助于进行时间动作分割和活动识别。摘要：Over the past few years, the success in action recognition on short trimmed videos has led more investigations towards the temporal segmentation of actions in untrimmed long videos. Recently, supervised approaches have achieved excellent performance in segmenting complex human actions in untrimmed videos. However, besides action labels, such approaches also require the start and end points of each action, which is expensive and tedious to collect. In this paper, we aim to learn the action segments taking only the high-level activity labels as input. Under the setting where no action-level supervision is provided, Hungarian matching is often used to find the mapping between segments and ground truth actions to evaluate the model and report the performance. On the one hand, we show that with the high-level supervision, we are able to generalize the Hungarian matching settings from the current video and activity level to the global level. The extended global-level matching allows for the shared actions across activities. On the other hand, we propose a novel action discovery framework that automatically discovers constituent actions in videos with the activity classification task. Specifically, we define a finite number of prototypes to form a dual representation of a video sequence. These collectively learned prototypes are considered discovered actions. This classification setting endows our approach the capability of discovering potentially shared actions across multiple complex activities. Extensive experiments demonstrate that the discovered actions are helpful in performing temporal action segmentation and activity recognition.

【10】 CPNet: Cycle Prototype Network for Weakly-supervised 3D Renal Compartments Segmentation on CT Images 标题：CPNet：用于CT图像弱监督三维肾室分割的循环原型网络链接：https://arxiv.org/abs/2108.06669

作者：Song Wang,Yuting He,Youyong Kong,Xiaomei Zhu,Shaobo Zhang,Pengfei Shao,Jean-Louis Dillenseger,Jean-Louis Coatrieux,Shuo Li,Guanyu Yang 机构： LIST, Key Laboratory of Computer Network and Information Integration, (Southeast University), Ministry of Education, Nanjing, China, Univ Rennes, Inserm, LTSI - UMR, Rennes, F-, France 备注：24th International Conference on Medical Image Computing and Computer Assisted Intervention 摘要：CT图像肾室分割的目的是从腹部CTA图像中提取肾室的三维结构，对肾脏疾病的诊断和治疗具有重要意义。然而，由于三维肾脏CT图像的腔室边界不清、腔室结构薄、解剖变异大，基于深度学习的肾腔室分割是一项具有挑战性的任务。我们提出了一种新的弱监督学习框架，循环原型网络，用于三维肾室分割。该方法有三个创新之处：1）提出了一种循环原型学习（CPL）方法，用于学习一致性以进行泛化。它通过正向过程学习伪标签，通过反向过程学习一致性正则化。这两个过程使得模型对噪声具有鲁棒性，并且标签有效。2）提出了一种基于跨周期先验知识的贝叶斯弱监督模型（BWSM）。它从跨周期的未标记数据中学习先验知识并自动进行纠错，从而生成准确的伪标签。3）我们提出了一种用于细粒度特征提取的精细解码特征提取器（FDFE）。该模型将全局形态信息与局部细节信息相结合，得到细节清晰的特征图，从而实现对薄结构的精细分割。我们的模型实现了79.1%和78.7%的骰子，只有四个标记图像，实现了比典型原型模型PANet大约20%的显著改进。摘要：Renal compartment segmentation on CT images targets on extracting the 3D structure of renal compartments from abdominal CTA images and is of great significance to the diagnosis and treatment for kidney diseases. However, due to the unclear compartment boundary, thin compartment structure and large anatomy variation of 3D kidney CT images, deep-learning based renal compartment segmentation is a challenging task. We propose a novel weakly supervised learning framework, Cycle Prototype Network, for 3D renal compartment segmentation. It has three innovations: 1) A Cycle Prototype Learning (CPL) is proposed to learn consistency for generalization. It learns from pseudo labels through the forward process and learns consistency regularization through the reverse process. The two processes make the model robust to noise and label-efficient. 2) We propose a Bayes Weakly Supervised Module (BWSM) based on cross-period prior knowledge. It learns prior knowledge from cross-period unlabeled data and perform error correction automatically, thus generates accurate pseudo labels. 3) We present a Fine Decoding Feature Extractor (FDFE) for fine-grained feature extraction. It combines global morphology information and local detail information to obtain feature maps with sharp detail, so the model will achieve fine segmentation on thin structures. Our model achieves Dice of 79.1% and 78.7% with only four labeled images, achieving a significant improvement by about 20% than typical prototype model PANet.

【11】 Semantic-embedded Unsupervised Spectral Reconstruction from Single RGB Images in the Wild 标题：野外单幅RGB图像的语义嵌入非监督光谱重建链接：https://arxiv.org/abs/2108.06659

作者：Zhiyu Zhu,Hui Liu,Junhui Hou,Huanqiang Zeng,Qingfu Zhang 机构： City University of Hong Kong, Huaqiao University 摘要：本文研究了在训练过程中，利用成对的高光谱图像和RGB图像，从商业摄像机拍摄的单个RGB图像重建高光谱图像的问题。为了应对这一挑战，我们提出了一个新的轻量级和基于端到端学习的框架。具体而言，基于HS图像中RGB图像的固有成像退化模型，我们通过有效的无监督相机光谱响应函数估计，逐步扩大输入RGB图像和恢复HS图像中重新投影RGB图像之间的差异。为了使学习不需要成对的地面真相HS图像作为监督，我们采用了对抗式学习方式，并使用一个简单而有效的$mathcal{L}u 1$梯度剪裁方案来增强学习。此外，我们还嵌入了输入RGB图像的语义信息，对无监督学习进行局部正则化，以促进语义相同的像素具有一致的光谱特征。除了对合成RGB图像重建HS图像的两个广泛使用的数据集进行定量实验外，我们还通过将真实RGB图像恢复的HS图像应用于基于HS的视觉跟踪来评估我们的方法。大量的实验结果表明，我们的方法明显优于最新的无监督方法，甚至在某些情况下超过了最新的有监督方法。源代码可在https://github.com/zbzhzhy/Unsupervised-Spectral-Reconstruction. 摘要：This paper investigates the problem of reconstructing hyperspectral (HS) images from single RGB images captured by commercial cameras, textbf{without} using paired HS and RGB images during training. To tackle this challenge, we propose a new lightweight and end-to-end learning-based framework. Specifically, on the basis of the intrinsic imaging degradation model of RGB images from HS images, we progressively spread the differences between input RGB images and re-projected RGB images from recovered HS images via effective unsupervised camera spectral response function estimation. To enable the learning without paired ground-truth HS images as supervision, we adopt the adversarial learning manner and boost it with a simple yet effective $mathcal{L}_1$ gradient clipping scheme. Besides, we embed the semantic information of input RGB images to locally regularize the unsupervised learning, which is expected to promote pixels with identical semantics to have consistent spectral signatures. In addition to conducting quantitative experiments over two widely-used datasets for HS image reconstruction from synthetic RGB images, we also evaluate our method by applying recovered HS images from real RGB images to HS-based visual tracking. Extensive results show that our method significantly outperforms state-of-the-art unsupervised methods and even exceeds the latest supervised method under some settings. The source code is public available at https://github.com/zbzhzhy/Unsupervised-Spectral-Reconstruction.

【12】 Real-Time Multi-Modal Semantic Fusion on Unmanned Aerial Vehicles 标题：无人机上的实时多模态语义融合链接：https://arxiv.org/abs/2108.06608

作者：Simon Bultmann,Jan Quenzel,Sven Behnke 机构：Accepted for: ,th European Conference on Mobile Robots (ECMR), Bonn, Germany 备注：Accepted for: 10th European Conference on Mobile Robots (ECMR), Bonn, Germany, September 2021 摘要：配备多个互补传感器的无人机（UAV）在快速自主或远程控制语义场景分析方面具有巨大潜力，例如用于灾难检测。在这项工作中，我们提出了一个无人机系统的实时语义推理和融合的多传感器模式。激光雷达扫描和RGB图像的语义分割，以及RGB和热图像上的目标检测，使用轻量级CNN架构和嵌入式推理加速器在无人机计算机上在线运行。我们采用了一种后期融合方法，其中来自多种模式的语义信息增强了3D点云和图像分割遮罩，同时还生成了一个异中心语义图。我们的系统提供了增强的语义图像和点云，分别为$约、$9$、$Hz。我们在城市环境的真实实验中评估了集成系统。摘要：Unmanned aerial vehicles (UAVs) equipped with multiple complementary sensors have tremendous potential for fast autonomous or remote-controlled semantic scene analysis, e.g., for disaster examination. In this work, we propose a UAV system for real-time semantic inference and fusion of multiple sensor modalities. Semantic segmentation of LiDAR scans and RGB images, as well as object detection on RGB and thermal images, run online onboard the UAV computer using lightweight CNN architectures and embedded inference accelerators. We follow a late fusion approach where semantic information from multiple modalities augments 3D point clouds and image segmentation masks while also generating an allocentric semantic map. Our system provides augmented semantic images and point clouds with $approx,$9$,$Hz. We evaluate the integrated system in real-world experiments in an urban environment.

【13】 A Self-Distillation Embedded Supervised Affinity Attention Model for Few-Shot Segmentation 标题：一种用于Few-Shot分割的自蒸馏嵌入式亲和力监督注意模型链接：https://arxiv.org/abs/2108.06600

作者：Qi Zhao,Binghao Liu,Shuchang Lyu,Xu Wang,Yifan Yang 备注：13 pages, 13 figures 摘要：Few-Shot语义分割是一项具有挑战性的任务，它只需要很少的注释样本就可以在像素级预测对象类别。然而，现有方法仍然面临两大挑战。首先，支持图像和查询图像之间的巨大特征差异造成了知识传递障碍，从而影响了分割性能。其次，支持样本少导致支持特征不具代表性，难以指导高质量的查询分割。为了解决上述两个问题，我们提出了自蒸馏嵌入式监督亲和注意模型（SD-AANet）来提高少数镜头分割任务的性能。具体而言，自蒸馏引导原型模块（SDPM）通过在支持和查询之间进行自蒸馏来提取内在原型，以捕获具有代表性的特征。有监督的相似性注意模块（SAAM）采用支持地真相来指导高质量查询注意图的生成，该模块可以学习相似性信息以关注整个查询目标区域。大量实验证明，与现有方法相比，我们的SD-AANet显著提高了性能。综合消融实验和可视化研究也表明，SDPM和SAAM对少量镜头分割任务有显著效果。在基准数据集PASCAL-5i和COCO-20i上，我们提出的SD AANet都实现了最先进的结果。我们的代码将很快公开。摘要：Few-shot semantic segmentation is a challenging task of predicting object categories in pixel-wise with only few annotated samples. However, existing approaches still face two main challenges. First, huge feature distinction between support and query images causes knowledge transferring barrier, which harms the segmentation performance. Second, few support samples cause unrepresentative of support features, hardly to guide high-quality query segmentation. To deal with the above two issues, we propose self-distillation embedded supervised affinity attention model (SD-AANet) to improve the performance of few-shot segmentation task. Specifically, the self-distillation guided prototype module (SDPM) extracts intrinsic prototype by self-distillation between support and query to capture representative features. The supervised affinity attention module (SAAM) adopts support ground truth to guide the production of high quality query attention map, which can learn affinity information to focus on whole area of query target. Extensive experiments prove that our SD-AANet significantly improves the performance comparing with existing methods. Comprehensive ablation experiments and visualization studies also show the significant effect of SDPM and SAAM for few-shot segmentation task. On benchmark datasets, PASCAL-5i and COCO-20i, our proposed SD-AANet both achieve state-of-the-art results. Our code will be publicly available soon.

【14】 Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic Segmentation 标题：利用联合嵌入空间进行广义Zero-Shot语义切分链接：https://arxiv.org/abs/2108.06536

作者：Donghyeon Baek,Youngmin Oh,Bumsub Ham 机构：School of Electrical and Electronic Engineering, Yonsei University 备注：Accepted to ICCV 2021 摘要：我们解决了广义Zero-Shot语义分割（GZS3）的问题，该问题预测可见和不可见类的像素级语义标签。大多数GZS3方法采用生成方法，从相应的语义类（例如word2vec）中综合看不见类的视觉特征，为看不见类和看不见类训练新的分类器。尽管生成方法表现出良好的性能，但它们有两个局限性：（1）视觉特征偏向于可见类(2）每当出现新的看不见的类时，应该重新训练分类器。我们提出了一种区别性的方法来解决统一框架中的这些限制。为此，我们利用视觉和语义编码器来学习一个联合嵌入空间，其中语义编码器将语义特征转换为语义原型，作为相应类的视觉特征的中心。具体来说，我们引入边界感知回归（BAR）和语义一致性（SC）损失来学习区分特征。我们利用联合嵌入空间的方法，以及BAR和SC项，缓解了seen偏差问题。在测试时，我们通过利用语义原型作为最近邻（NN）分类器来避免重新训练过程。为了进一步缓解偏差问题，我们还提出了一种称为Apollonius calibration（AC）的推理技术，该技术自适应地将NN分类器的决策边界调整到Apollonius圆。实验结果证明了我们的框架的有效性，在标准基准上达到了新的水平。摘要：We address the problem of generalized zero-shot semantic segmentation (GZS3) predicting pixel-wise semantic labels for seen and unseen classes. Most GZS3 methods adopt a generative approach that synthesizes visual features of unseen classes from corresponding semantic ones (e.g., word2vec) to train novel classifiers for both seen and unseen classes. Although generative methods show decent performance, they have two limitations: (1) the visual features are biased towards seen classes; (2) the classifier should be retrained whenever novel unseen classes appear. We propose a discriminative approach to address these limitations in a unified framework. To this end, we leverage visual and semantic encoders to learn a joint embedding space, where the semantic encoder transforms semantic features to semantic prototypes that act as centers for visual features of corresponding classes. Specifically, we introduce boundary-aware regression (BAR) and semantic consistency (SC) losses to learn discriminative features. Our approach to exploiting the joint embedding space, together with BAR and SC terms, alleviates the seen bias problem. At test time, we avoid the retraining process by exploiting semantic prototypes as a nearest-neighbor (NN) classifier. To further alleviate the bias problem, we also propose an inference technique, dubbed Apollonius calibration (AC), that modulates the decision boundary of the NN classifier to the Apollonius circle adaptively. Experimental results demonstrate the effectiveness of our framework, achieving a new state of the art on standard benchmarks.

【15】 Adapting to Unseen Vendor Domains for MRI Lesion Segmentation 标题：一种自适应不可见供应商区域的MRI病变分割方法链接：https://arxiv.org/abs/2108.06434

作者：Brandon Mac,Alan R. Moody,April Khademi 机构：Image Analysis in Medicine Lab (IAMLAB), Ryerson University, Toronto, ON, Canada, Department of Medical Imaging, University of Toronto, Toronto, ON, Canada, Department of Medical Imaging, Sunnybrook Health Sciences Centre, Toronto, ON, Canada 摘要：机器学习模型的一个关键限制是在训练分布范围之外的数据上表现不佳。这对于磁共振（MR）成像中的图像分析尤其如此，因为硬件和软件的变化会在扫描仪上产生非标准强度、对比度和噪声分布。最近，人们提出了图像转换模型来跨域增加数据，以创建合成数据点。在本文中，我们研究了无监督图像转换模型在将MR图像从源数据集扩展到目标数据集中的应用。具体而言，我们希望评估这些模型通过图像转换创建代表目标数据集的合成数据点的能力，并查看训练这些合成数据点的分割模型是否接近直接在目标数据集上训练的模型的性能。我们考虑三个配置之间的数据集之间的翻译，扫描仪供应商之间的转换，并从标签到图像。研究发现，从标签到图像配置的合成数据上训练的分割模型与直接在目标数据集上训练的分割模型的性能最接近。每个目标供应商（GE、西门子、飞利浦）在合成数据训练方面的骰子系数得分分别为0.63、0.64和0.58，而直接在目标数据集上训练的骰子系数得分分别为0.65、0.72和0.61。摘要：One of the key limitations in machine learning models is poor performance on data that is out of the domain of the training distribution. This is especially true for image analysis in magnetic resonance (MR) imaging, as variations in hardware and software create non-standard intensities, contrasts, and noise distributions across scanners. Recently, image translation models have been proposed to augment data across domains to create synthetic data points. In this paper, we investigate the application an unsupervised image translation model to augment MR images from a source dataset to a target dataset. Specifically, we want to evaluate how well these models can create synthetic data points representative of the target dataset through image translation, and to see if a segmentation model trained these synthetic data points would approach the performance of a model trained directly on the target dataset. We consider three configurations of augmentation between datasets consisting of translation between images, between scanner vendors, and from labels to images. It was found that the segmentation models trained on synthetic data from labels to images configuration yielded the closest performance to the segmentation model trained directly on the target dataset. The Dice coeffcient score per each target vendor (GE, Siemens, Philips) for training on synthetic data was 0.63, 0.64, and 0.58, compared to training directly on target dataset was 0.65, 0.72, and 0.61.

【16】 Soccer line mark segmentation with stochastic watershed transform 标题：基于随机分水岭变换的足球线标分割链接：https://arxiv.org/abs/2108.06432

作者：Daniel Berjón,Carlos Cuevas,Narciso García 机构：Grupo de Tratamiento de Imágenes, Information Processing and Telecommunications, Center, Universidad Politécnica de Madrid, Spain 备注：30 pages, 17 figures 摘要：增强现实应用开始改变体育节目的播放方式，为球迷提供更丰富的体验和宝贵的见解。增强现实系统的第一步是摄像机校准，可能是基于检测比赛场地的线条标记。大多数现有的线检测方案依赖于边缘检测和Hough变换，但光学失真和外部边缘会导致线标记检测不准确或虚假。我们提出了一种基于随机分水岭变换的自动准确分割线标记的新策略，该变换对光学畸变具有鲁棒性，因为它不假设线的直线度，并且不受比赛场地中球员或球的影响。首先，整个运动场被分割，完全消除了看台和外围看板。然后提取线条标记。该策略已经在一个新的公共数据库上进行了测试，该数据库由来自五个体育场比赛的60幅带注释的图像组成。结果表明，本文提出的分割算法能够成功、准确地检测出大多数线标记像素。摘要：Augmented reality applications are beginning to change the way sports are broadcast, providing richer experiences and valuable insights to fans. The first step of augmented reality systems is camera calibration, possibly based on detecting the line markings of the field of play. Most existing proposals for line detection rely on edge detection and Hough transform, but optical distortion and extraneous edges cause inaccurate or spurious detections of line markings. We propose a novel strategy to automatically and accurately segment line markings based on a stochastic watershed transform that is robust to optical distortions, since it makes no assumptions about line straightness, and is unaffected by the presence of players or the ball in the field of play. Firstly, the playing field as a whole is segmented completely eliminating the stands and perimeter boards. Then the line markings are extracted. The strategy has been tested on a new and public database composed by 60 annotated images from matches in five stadiums. The results obtained have proven that the proposed segmentation algorithm allows successful and precise detection of most line mark pixels.

【17】 DensePASS: Dense Panoramic Semantic Segmentation via Unsupervised Domain Adaptation with Attention-Augmented Context Exchange 标题：DensePASS：基于注意力增强上下文交换的无监督领域适配的密集全景语义分割链接：https://arxiv.org/abs/2108.06383

作者：Chaoxiang Ma,Jiaming Zhang,Kailun Yang,Alina Roitberg,Rainer Stiefelhagen 机构： in part by the University of Excellence through the “KITFuture Fields” project, ) 1Authors are with Institute for Anthropomatics and Robotics, KarlsruheInstitute of Technology 备注：Accepted to IEEE ITSC 2021. Dataset and code will be made publicly available at this https URL 摘要：智能车辆显然受益于360度传感器的扩展视野（FoV），但绝大多数可用的语义分割训练图像都是用针孔相机拍摄的。在这项工作中，我们通过领域自适应的视角来看待这个问题，并将全景语义分割引入到一个场景中，其中标记的训练数据来自于传统针孔相机图像的不同分布。首先，我们形式化了全景语义分割的无监督域适配任务，其中，根据针孔相机数据源域中的标记示例训练的网络部署在全景图像的不同目标域中，而没有可用的标记。为了验证这一想法，我们收集并公开发布了DensePASS——一种用于跨域条件下全景分割的新型密集注释数据集，专门用于研究针孔到全景的转换，并附带了从城市景观中获得的针孔相机训练示例。DensePASS涵盖标记和未标记的360度图像，标记数据包括19个类别，这些类别明确符合源域（即针孔）数据中可用的类别。为了应对领域转移的挑战，我们利用当前基于注意机制的进展，基于不同的注意增强域自适应模块，构建了跨领域全景语义分割的通用框架。我们的框架在学习域对应时促进了局部和全局级别的信息交换，并将两个标准分段网络的平均IoU域自适应性能提高了6.05%和11.26%。摘要：Intelligent vehicles clearly benefit from the expanded Field of View (FoV) of the 360-degree sensors, but the vast majority of available semantic segmentation training images are captured with pinhole cameras. In this work, we look at this problem through the lens of domain adaptation and bring panoramic semantic segmentation to a setting, where labelled training data originates from a different distribution of conventional pinhole camera images. First, we formalize the task of unsupervised domain adaptation for panoramic semantic segmentation, where a network trained on labelled examples from the source domain of pinhole camera data is deployed in a different target domain of panoramic images, for which no labels are available. To validate this idea, we collect and publicly release DensePASS - a novel densely annotated dataset for panoramic segmentation under cross-domain conditions, specifically built to study the Pinhole-to-Panoramic transfer and accompanied with pinhole camera training examples obtained from Cityscapes. DensePASS covers both, labelled- and unlabelled 360-degree images, with the labelled data comprising 19 classes which explicitly fit the categories available in the source domain (i.e. pinhole) data. To meet the challenge of domain shift, we leverage the current progress of attention-based mechanisms and build a generic framework for cross-domain panoramic semantic segmentation based on different variants of attention-augmented domain adaptation modules. Our framework facilitates information exchange at local- and global levels when learning the domain correspondences and improves the domain adaptation performance of two standard segmentation networks by 6.05% and 11.26% in Mean IoU.

Zero/Few Shot|迁移|域适配|自适应(7篇)

【1】 PIT: Position-Invariant Transform for Cross-FoV Domain Adaptation 标题：PIT：用于跨视图域自适应的位置不变变换链接：https://arxiv.org/abs/2108.07142

作者：Qiqi Gu,Qianyu Zhou,Minghao Xu,Zhengyang Feng,Guangliang Cheng,Xuequan Lu,Jianping Shi,Lizhuang Ma 机构：Shanghai Jiao Tong University,SenseTime Group Research, Deakin University,East China Normal University,Shanghai AI Laboratory, MoE Key Lab of Artificial Intelligence, SJTU,Qing Yuan Research Institute, SJTU 备注：Accepted to ICCV 2021. Code is available at this https URL 摘要：近年来，跨领域目标检测和语义分割取得了令人瞩目的进展。现有的方法主要考虑由外部环境引起的域转换，包括背景、照明或天气的变化，而不同的摄像机内参数通常出现在不同的域中，并且它们对域适应的影响很少被探索。在本文中，我们观察到视场（FoV）间隙在源域和目标域之间引起明显的实例外观差异。我们进一步发现，在FoV增加（源FoV<目标FoV）和FoV降低的情况下，两个域之间的FoV差距会损害域适应性能。受观察结果的启发，我们提出textbf{位置不变变换}（PIT）来更好地对齐不同域中的图像。我们还引入了一个反向PIT，用于将变换/对齐的图像映射回原始图像空间，并设计了一个损失重加权策略来加速训练过程。我们的方法可以很容易地插入现有的跨域检测/分割框架，同时带来的计算开销可以忽略不计。大量的实验表明，我们的方法可以有效地提高现有技术在跨域目标检测和分割方面的性能。我们的代码可在https://github.com/sheepooo/PIT-Position-Invariant-Transform. 摘要：Cross-domain object detection and semantic segmentation have witnessed impressive progress recently. Existing approaches mainly consider the domain shift resulting from external environments including the changes of background, illumination or weather, while distinct camera intrinsic parameters appear commonly in different domains, and their influence for domain adaptation has been very rarely explored. In this paper, we observe that the Field of View (FoV) gap induces noticeable instance appearance differences between the source and target domains. We further discover that the FoV gap between two domains impairs domain adaptation performance under both the FoV-increasing (source FoV < target FoV) and FoV-decreasing cases. Motivated by the observations, we propose the textbf{Position-Invariant Transform} (PIT) to better align images in different domains. We also introduce a reverse PIT for mapping the transformed/aligned images back to the original image space and design a loss re-weighting strategy to accelerate the training process. Our method can be easily plugged into existing cross-domain detection/segmentation frameworks while bringing about negligible computational overhead. Extensive experiments demonstrate that our method can soundly boost the performance on both cross-domain object detection and segmentation for state-of-the-art techniques. Our code is available at https://github.com/sheepooo/PIT-Position-Invariant-Transform.

【2】 Deep Self-Adaptive Hashing for Image Retrieval 标题：深度自适应散列在图像检索中的应用链接：https://arxiv.org/abs/2108.07094

作者：Qinghong Lin,Xiaojun Chen,Qin Zhang,Shangxuan Tian,Yudong Chen 机构：Shenzhen University,Tencent,The University of Queensland 备注：10 pages, 11 figures, 4 tables 摘要：哈希技术由于其计算效率和存储效率在图像检索中得到了广泛的应用。近年来，由于人工标注的高成本和深度学习技术的优越性，深度无监督哈希方法越来越受到人们的关注。然而，大多数深度无监督散列方法通常预先计算一个相似矩阵，在预先训练的特征空间中对成对关系进行建模。然后，该相似矩阵将用于指导哈希学习，其中大多数数据对被等价地处理。上述过程存在以下缺陷：1）预先计算的相似度矩阵是不可修改的，与哈希学习过程无关，无法挖掘底层语义信息。2）信息性数据对可能被大量信息性较小的数据对掩埋。为了解决上述问题，我们提出了一个textbf{Deep Self Adaptive Hashing~（DSAH）}模型，通过两种特殊的设计自适应捕获语义信息：textbf{Adaptive Neighbor Discovery~（AND）}和textbf{Pairwise information Content~（PIC）}。首先，我们采用AND初始构造一个基于邻域的相似度矩阵，然后使用一种新的更新策略对该初始相似度矩阵进行细化，以进一步研究学习表示背后的语义结构。其次，我们使用PIC度量数据对的优先级，并为它们分配自适应权重，这依赖于假设更多不同的数据对包含更多用于hash学习的鉴别信息。在多个基准数据集上的大量实验表明，上述两种技术有助于深度哈希模型以自适应方式实现优异的性能。摘要：Hashing technology has been widely used in image retrieval due to its computational and storage efficiency. Recently, deep unsupervised hashing methods have attracted increasing attention due to the high cost of human annotations in the real world and the superiority of deep learning technology. However, most deep unsupervised hashing methods usually pre-compute a similarity matrix to model the pairwise relationship in the pre-trained feature space. Then this similarity matrix would be used to guide hash learning, in which most of the data pairs are treated equivalently. The above process is confronted with the following defects: 1) The pre-computed similarity matrix is inalterable and disconnected from the hash learning process, which cannot explore the underlying semantic information. 2) The informative data pairs may be buried by the large number of less-informative data pairs. To solve the aforementioned problems, we propose a textbf{Deep Self-Adaptive Hashing~(DSAH)} model to adaptively capture the semantic information with two special designs: textbf{Adaptive Neighbor Discovery~(AND)} and textbf{Pairwise Information Content~(PIC)}. Firstly, we adopt the AND to initially construct a neighborhood-based similarity matrix, and then refine this initial similarity matrix with a novel update strategy to further investigate the semantic structure behind the learned representation. Secondly, we measure the priorities of data pairs with PIC and assign adaptive weights to them, which is relies on the assumption that more dissimilar data pairs contain more discriminative information for hash learning. Extensive experiments on several benchmark datasets demonstrate that the above two technologies facilitate the deep hashing model to achieve superior performance in a self-adaptive manner.

【3】 Structure-Aware Feature Generation for Zero-Shot Learning 标题：面向Zero-Shot学习的结构感知特征生成链接：https://arxiv.org/abs/2108.07032

作者：Lianbo Zhang,Shaoli Huang,Xinchao Wang,Wei Liu,Dacheng Tao 机构：Shaoli Huang is with School of Computer Science, University ofSydney 摘要：Zero-Shot学习（Zero-Shot Learning，ZSL）的目标是通过利用辅助信息（如属性嵌入）来识别看不见的类别。尽管取得了令人鼓舞的结果，以前的ZSL方法侧重于提高SEE类特征的鉴别能力，但在很大程度上忽略了样本和原型的几何结构。因此，后续的基于属性的生成性对抗网络（GAN）在样本生成过程中也忽略了拓扑信息，从而在对不可见类的视觉特征进行分类时产生了较差的性能。在本文中，我们介绍了一种新的结构感知特征生成方案，称为SA-GAN，以明确说明在学习潜在空间和生成网络时的拓扑结构。具体地说，我们在学习辨别性潜在空间时引入约束损失来保持初始几何结构，并使用来自结构感知鉴别器和重构模块的附加监督信号来执行我们的GAN训练。前者根据与类原型的亲和力区分假样本和真样本，后者则根据生成的潜在空间重构原始特征空间。这种拓扑保持机制使我们的方法能够显著增强对不可见类的泛化能力，从而提高分类性能。在四个基准上的实验表明，所提出的方法始终优于现有的方法。我们的代码可以在补充材料中找到，也将公开提供。摘要：Zero-Shot Learning (ZSL) targets at recognizing unseen categories by leveraging auxiliary information, such as attribute embedding. Despite the encouraging results achieved, prior ZSL approaches focus on improving the discriminant power of seen-class features, yet have largely overlooked the geometric structure of the samples and the prototypes. The subsequent attribute-based generative adversarial network (GAN), as a result, also neglects the topological information in sample generation and further yields inferior performances in classifying the visual features of unseen classes. In this paper, we introduce a novel structure-aware feature generation scheme, termed as SA-GAN, to explicitly account for the topological structure in learning both the latent space and the generative networks. Specifically, we introduce a constraint loss to preserve the initial geometric structure when learning a discriminative latent space, and carry out our GAN training with additional supervising signals from a structure-aware discriminator and a reconstruction module. The former supervision distinguishes fake and real samples based on their affinity to class prototypes, while the latter aims to reconstruct the original feature space from the generated latent space. This topology-preserving mechanism enables our method to significantly enhance the generalization capability on unseen-classes and consequently improve the classification performance. Experiments on four benchmarks demonstrate that the proposed approach consistently outperforms the state of the art. Our code can be found in the supplementary material and will also be made publicly available.

【4】 End-to-End Adaptive Monte Carlo Denoising and Super-Resolution 标题：端到端自适应蒙特卡罗去噪和超分辨率链接：https://arxiv.org/abs/2108.06915

作者：Xinyue Wei,Haozhi Huang,Yujin Shi,Hongliang Yuan,Li Shen,Jue Wang 机构：EDSR 摘要：经典的蒙特卡罗路径跟踪可以在大量计算的代价下实现高质量的渲染。最近的工作利用深度神经网络来加速这一过程，通过在后处理中使用超分辨率或去噪神经网络来改善低分辨率或更少的样本渲染。然而，在以前的工作中，去噪和超分辨率仅被单独考虑。我们在这项工作中表明，在后处理中，联合超分辨率和去噪（SRD）可以进一步加速蒙特卡罗路径跟踪。这种新型的联合滤波仅允许通过路径跟踪渲染低分辨率和较少样本（因此有噪声）的图像，然后将路径跟踪馈入深度神经网络以生成高分辨率和干净的图像。这项工作的主要贡献是一种新的端到端网络体系结构，专门为SRD任务设计。它包含两个具有共享组件的级联级。我们发现，去噪和超分辨率需要非常不同的感受野，这是导致在网络设计中引入可变形卷积的关键洞察。大量的实验表明，该方法的性能优于以前的SRD任务所采用的方法及其变体。摘要：The classic Monte Carlo path tracing can achieve high quality rendering at the cost of heavy computation. Recent works make use of deep neural networks to accelerate this process, by improving either low-resolution or fewer-sample rendering with super-resolution or denoising neural networks in post-processing. However, denoising and super-resolution have only been considered separately in previous work. We show in this work that Monte Carlo path tracing can be further accelerated by joint super-resolution and denoising (SRD) in post-processing. This new type of joint filtering allows only a low-resolution and fewer-sample (thus noisy) image to be rendered by path tracing, which is then fed into a deep neural network to produce a high-resolution and clean image. The main contribution of this work is a new end-to-end network architecture, specifically designed for the SRD task. It contains two cascaded stages with shared components. We discover that denoising and super-resolution require very different receptive fields, a key insight that leads to the introduction of deformable convolution into the network design. Extensive experiments show that the proposed method outperforms previous methods and their variants adopted for the SRD task.

【5】 SCIDA: Self-Correction Integrated Domain Adaptation from Single- to Multi-label Aerial Images 标题：SIDIA：从单标签航空影像到多标签航空影像的自校正综合域自适应链接：https://arxiv.org/abs/2108.06810

作者：Tianze Yu,Jianzhe Lin,Lichao Mou,Yuansheng Hua,Xiaoxiang Zhu,Z. Jane Wang 机构： Jane Wangare with the Department of Electrical and Computer Engineering, Universityof British Columbia 摘要：大多数公开的图像分类数据集都是单标签的，而在我们的日常生活中，图像本身就是多标签的。这种注释缺口使得许多预先训练的单标签分类模型在实际场景中失败。该注释问题更关注航空图像：从传感器收集的航空数据自然覆盖了具有多个标签的相对较大的陆地区域，而可公开获取的注释航空数据集（例如UCM、AID）是单标签的。针对人工标注多标签航空影像耗时费力的问题，提出了一种新的自动多标签学习自校正集成域自适应（SCIDA）方法。SCIDA是弱监督的，即从使用大量公开的单标签图像自动学习多标签图像分类模型。为了实现这一目标，我们提出了一种新的标签自校正（LWC）模块，以更好地探索潜在的标签相关性。该模块还使从单标签数据到多标签数据的无监督域自适应（UDA）成为可能。对于模型训练，该模型只使用单标签信息，而不需要多标签数据的先验知识；并预测多标签航空图像的标签。在我们的实验中，使用单个标记的MAI-AID-s和MAI-UCM-s数据集进行训练，在我们收集的多场景航空影像（MAI）数据集上直接测试了所提出的模型。摘要：Most publicly available datasets for image classification are with single labels, while images are inherently multi-labeled in our daily life. Such an annotation gap makes many pre-trained single-label classification models fail in practical scenarios. This annotation issue is more concerned for aerial images: Aerial data collected from sensors naturally cover a relatively large land area with multiple labels, while annotated aerial datasets, which are publicly available (e.g., UCM, AID), are single-labeled. As manually annotating multi-label aerial images would be time/labor-consuming, we propose a novel self-correction integrated domain adaptation (SCIDA) method for automatic multi-label learning. SCIDA is weakly supervised, i.e., automatically learning the multi-label image classification model from using massive, publicly available single-label images. To achieve this goal, we propose a novel Label-Wise self-Correction (LWC) module to better explore underlying label correlations. This module also makes the unsupervised domain adaptation (UDA) from single- to multi-label data possible. For model training, the proposed model only uses single-label information yet requires no prior knowledge of multi-labeled data; and it predicts labels for multi-label aerial images. In our experiments, trained with single-labeled MAI-AID-s and MAI-UCM-s datasets, the proposed model is tested directly on our collected Multi-scene Aerial Image (MAI) dataset.

【6】 Towards Category and Domain Alignment: Category-Invariant Feature Enhancement for Adversarial Domain Adaptation 标题：面向类别和领域对齐：对抗性领域自适应的类别不变特征增强链接：https://arxiv.org/abs/2108.06583

作者：Yuan Wu,Diana Inkpen,Ahmed El-Roby 机构：Carleton University, University of Ottawa 备注：10 pages, 4 figures 摘要：对抗性领域适应通过调整两个领域的特征分布，在将知识从源领域转移到目标领域方面取得了令人印象深刻的进展。这些方法侧重于最小化域发散，并将适应性（作为理想联合假设在这两个域上的预期误差）视为一个小常数。然而，这些方法仍然面临两个问题：（1）对抗性领域对齐扭曲了原始特征分布，降低了适应性(2）将特征表示转换为域不变需要牺牲特定于域的变化，从而导致较弱的可辨别性。为了缓解这些问题，我们提出了类别不变特征增强（CIFE），这是一种通过优化适应性来增强对抗领域适应性的通用机制。具体来说，CIFE方法引入了类别不变特征，在保持可转移性的前提下提高了域不变特征的可分辨性。实验表明，CIFE可以改进具有代表性的对抗性领域适应方法，在五个基准上产生最先进的结果。摘要：Adversarial domain adaptation has made impressive advances in transferring knowledge from the source domain to the target domain by aligning feature distributions of both domains. These methods focus on minimizing domain divergence and regard the adaptability, which is measured as the expected error of the ideal joint hypothesis on these two domains, as a small constant. However, these approaches still face two issues: (1) Adversarial domain alignment distorts the original feature distributions, deteriorating the adaptability; (2) Transforming feature representations to be domain-invariant needs to sacrifice domain-specific variations, resulting in weaker discriminability. In order to alleviate these issues, we propose category-invariant feature enhancement (CIFE), a general mechanism that enhances the adversarial domain adaptation through optimizing the adaptability. Specifically, the CIFE approach introduces category-invariant features to boost the discriminability of domain-invariant features with preserving the transferability. Experiments show that the CIFE could improve upon representative adversarial domain adaptation methods to yield state-of-the-art results on five benchmarks.

【7】 Transfer Learning from an Artificial Radiograph-landmark Dataset for Registration of the Anatomic Skull Model to Dual Fluoroscopic X-ray Images 标题：从用于解剖颅骨模型配准的人工X线标志性数据集到双荧光X射线图像的转移学习链接：https://arxiv.org/abs/2108.06466

作者：Chaochao Zhou,Thomas Cha,Yun Peng,Guoan Li 机构： Orthopaedic Bioengineering Research Center, Newton-Wellesley Hospital and Harvard Medical, School, Newton, MA, USA; , Department of Orthopaedic Surgery, Massachusetts General Hospital and, Harvard Medical School, Boston, MA, USA 摘要：三维解剖结构与二维双透视X射线图像的配准是一种广泛应用的运动跟踪技术。然而，深度学习的实施往往受到缺乏医学图像和基本事实的阻碍。在这项研究中，我们提出了一种基于人工数据集训练的深度神经网络的三维到二维配准转移学习策略。从女性受试者的颅颈CT数据自动创建数字重建X线照片（DRR）和头颅X线标志。他们被用来训练用于地标探测的残余网络（ResNet）和周期生成对抗网络（GAN），以消除DRR和实际X射线之间的风格差异。ResNet检测到X射线上经历GAN式平移的地标，并将其用于三角剖分优化，以便在实际双荧光镜图像中对颅骨进行三维到二维配准（使用非正交设置、点X射线源、图像失真和部分捕获的颅骨区域）。在多个颅颈运动场景中评估配准精度。在行走时，基于学习的颅骨配准的角度/位置误差为3.9 -2.1度/4.6 -2.2毫米。然而，在功能性颈部活动期间，由于在末端位置的双透视图像上成像的颅骨区域过小，准确性较低。策略性地增加人工训练数据的方法可以处理复杂的颅骨注册场景，并且有潜力扩展到广泛的注册场景。摘要：Registration of 3D anatomic structures to their 2D dual fluoroscopic X-ray images is a widely used motion tracking technique. However, deep learning implementation is often impeded by a paucity of medical images and ground truths. In this study, we proposed a transfer learning strategy for 3D-to-2D registration using deep neural networks trained from an artificial dataset. Digitally reconstructed radiographs (DRRs) and radiographic skull landmarks were automatically created from craniocervical CT data of a female subject. They were used to train a residual network (ResNet) for landmark detection and a cycle generative adversarial network (GAN) to eliminate the style difference between DRRs and actual X-rays. Landmarks on the X-rays experiencing GAN style translation were detected by the ResNet, and were used in triangulation optimization for 3D-to-2D registration of the skull in actual dual-fluoroscope images (with a non-orthogonal setup, point X-ray sources, image distortions, and partially captured skull regions). The registration accuracy was evaluated in multiple scenarios of craniocervical motions. In walking, learning-based registration for the skull had angular/position errors of 3.9 - 2.1 deg / 4.6 - 2.2 mm. However, the accuracy was lower during functional neck activity, due to overly small skull regions imaged on the dual fluoroscopic images at end-range positions. The methodology to strategically augment artificial training data can tackle the complicated skull registration scenario, and has potentials to extend to widespread registration scenarios.

半弱无监督|主动学习|不确定性(8篇)

【1】 Improving Self-supervised Learning with Hardness-aware Dynamic Curriculum Learning: An Application to Digital Pathology 标题：用硬度意识动态课程学习改进自我监督学习：在数字病理学中的应用链接：https://arxiv.org/abs/2108.07183

作者：Chetan L Srinidhi,Anne L Martel 机构：Physical Sciences, Sunnybrook Research Institute, Toronto, Canada, Department of Medical Biophysics, University of Toronto, Canada 备注：Accepted at ICCV 2021 CDpath workshop 摘要：自监督学习（SSL）最近显示出巨大的潜力，可以学习对许多图像分析任务有用的通用视觉表示。尽管它们取得了显著的成功，但当标记的训练实例数量较少或传输域之间的域转移显著时，现有的SSL方法无法推广到下游任务。在本文中，我们试图通过课程学习的视角，通过提出一种硬度感知的动态课程学习（HaDCL）方法来改进自我监督的预训练表征。为了提高SSL的健壮性和通用性，我们在小批量下游微调过程中通过易到难和难到非常难的样本动态地利用渐进式较难示例。我们发现，通过循序渐进的阶段性课程学习，预训练表征得到显著增强，并适用于域内和域外分布数据。我们对三个组织学基准数据集进行了广泛验证，包括面片分类和玻片分类问题。与标准微调相比，我们基于课程的微调产生了显著的改进，在域内和域外分布数据上，曲线下面积（AUC）分数的最小改进分别为1.7%和2.2%。此外，我们的经验表明，我们的方法更通用，适用于任何SSL方法，并且不会增加任何额外的开销复杂性。此外，我们还概述了基于补丁和基于幻灯片的课程学习在组织病理学中的作用，以提供基于课程的SSL方法微调成功的实际见解。守则将于https://github.com/srinidhiPY/ICCVCDPATH2021-ID-8 摘要：Self-supervised learning (SSL) has recently shown tremendous potential to learn generic visual representations useful for many image analysis tasks. Despite their notable success, the existing SSL methods fail to generalize to downstream tasks when the number of labeled training instances is small or if the domain shift between the transfer domains is significant. In this paper, we attempt to improve self-supervised pretrained representations through the lens of curriculum learning by proposing a hardness-aware dynamic curriculum learning (HaDCL) approach. To improve the robustness and generalizability of SSL, we dynamically leverage progressive harder examples via easy-to-hard and hard-to-very-hard samples during mini-batch downstream fine-tuning. We discover that by progressive stage-wise curriculum learning, the pretrained representations are significantly enhanced and adaptable to both in-domain and out-of-domain distribution data. We performed extensive validation on three histology benchmark datasets on both patch-wise and slide-level classification problems. Our curriculum based fine-tuning yields a significant improvement over standard fine-tuning, with a minimum improvement in area-under-the-curve (AUC) score of 1.7% and 2.2% on in-domain and out-of-domain distribution data, respectively. Further, we empirically show that our approach is more generic and adaptable to any SSL methods and does not impose any additional overhead complexity. Besides, we also outline the role of patch-based versus slide-based curriculum learning in histopathology to provide practical insights into the success of curriculum based fine-tuning of SSL methods. Code will be released at https://github.com/srinidhiPY/ICCVCDPATH2021-ID-8

【2】 Semi-Supervised Siamese Network for Identifying Bad Data in Medical Imaging Datasets 标题：医学影像数据集中识别不良数据的半监督暹罗网络链接：https://arxiv.org/abs/2108.07130

作者：Niamh Belton,Aonghus Lawlor,Kathleen M. Curran 机构：Science Foundation Ireland Centre for Research Training in Machine Learning, School of Medicine,School of Computer Science, University College Dublin, Insight Centre for Data Analytics, University College Dublin, Dublin, Ireland 备注：None 摘要：医学成像数据集中存在的噪声数据通常有助于开发能够处理真实世界数据的稳健模型。但是，如果不良数据包含的解剖信息不足，可能会对模型的性能产生严重的负面影响。我们提出了一种使用半监督暹罗网络识别不良数据的新方法。这种方法只需要一小部分“参考”医学图像就可以由非专家人员进行审查，以确保主要解剖结构出现在视野中。该模型在此参考集上进行训练，并通过使用暹罗网络计算参考集与数据集中所有其他医学图像之间的距离来识别不良数据。该方法实现了0.989的曲线下面积（AUC），用于识别不良数据。代码将在https://git.io/JYFuV. 摘要：Noisy data present in medical imaging datasets can often aid the development of robust models that are equipped to handle real-world data. However, if the bad data contains insufficient anatomical information, it can have a severe negative effect on the model's performance. We propose a novel methodology using a semi-supervised Siamese network to identify bad data. This method requires only a small pool of 'reference' medical images to be reviewed by a non-expert human to ensure the major anatomical structures are present in the Field of View. The model trains on this reference set and identifies bad data by using the Siamese network to compute the distance between the reference set and all other medical images in the dataset. This methodology achieves an Area Under the Curve (AUC) of 0.989 for identifying bad data. Code will be available at https://git.io/JYFuV.

【3】 SSH: A Self-Supervised Framework for Image Harmonization 标题：SSH：一种自监督的图像调和框架链接：https://arxiv.org/abs/2108.06805

作者：Yifan Jiang,He Zhang,Jianming Zhang,Yilin Wang,Zhe Lin,Kalyan Sunkavalli,Simon Chen,Sohrab Amirghodsi,Sarah Kong,Zhangyang Wang 机构：†The University of Texas at Austin, ‡Adobe Inc., DC, WCT , [,], DIH [,], S,AM [,], DoveNet [,], SSH (ours) 备注：Accepted by ICCV'2021 摘要：图像协调旨在通过匹配前景图像和背景图像之间的“外观”（例如色调、亮度和对比度），提高图像合成的质量。然而，为此任务收集大规模带注释的数据集需要复杂的专业修饰。相反，我们提出了一种新的自我监督协调框架（SSH），它可以只使用“自由”自然图像进行训练，而无需编辑。我们从表示融合的角度重新描述了图像协调问题，分别处理前景和背景示例，以解决背景遮挡问题。该框架设计允许使用双重数据增强方法，通过使用3D颜色查找表（LUT）对图像进行扰动裁剪，可以生成不同的[前景、背景、伪GT]三元组。此外，我们还构建了一个由专家用户精心创建的真实世界的协调数据集，用于评估和基准测试。我们的结果表明，所提出的自监督方法在参考指标、视觉质量和主题用户研究方面优于以前的最新方法。代码和数据集位于url{https://github.com/VITA-Group/SSHarmonization}. 摘要：Image harmonization aims to improve the quality of image compositing by matching the "appearance" (eg, color tone, brightness and contrast) between foreground and background images. However, collecting large-scale annotated datasets for this task requires complex professional retouching. Instead, we propose a novel Self-Supervised Harmonization framework (SSH) that can be trained using just "free" natural images without being edited. We reformulate the image harmonization problem from a representation fusion perspective, which separately processes the foreground and background examples, to address the background occlusion issue. This framework design allows for a dual data augmentation method, where diverse [foreground, background, pseudo GT] triplets can be generated by cropping an image with perturbations using 3D color lookup tables (LUTs). In addition, we build a real-world harmonization dataset as carefully created by expert users, for evaluation and benchmarking purposes. Our results show that the proposed self-supervised method outperforms previous state-of-the-art methods in terms of reference metrics, visual quality, and subject user study. Code and dataset are available at url{https://github.com/VITA-Group/SSHarmonization}.

【4】 Self-supervised Contrastive Learning of Multi-view Facial Expressions 标题：多视点面部表情的自监督对比学习链接：https://arxiv.org/abs/2108.06723

作者：Shuvendu Roy,Ali Etemad 机构：Department of Electrical and Computer Engineering & Ingenuity Labs Research Institute, Queen’s University, Kingston, Canada 备注：Accepted by 23rd ACM International Conference on Multimodal Interaction (ICMI 2021) 摘要：人脸表情识别（FER）已成为人机交互系统的重要组成部分。尽管FER最近有所进步，但非正面面部图像的性能通常会显著下降。我们提出多视角面部表情对比学习（CL-MEx）来利用从不同角度同时捕获的面部图像。CL MEx是一个两步训练框架。在第一步中，使用所提出的自监督对比损失对编码器网络进行预训练，学习为主体的不同视图生成视图不变嵌入。然后，在有监督的环境中使用标记数据对模型进行微调。我们在两个多视图FER数据集KDEF和DDCF上演示了所提出的方法的性能，在这两个数据集上实现了最先进的性能。进一步的实验表明，我们的方法在处理具有挑战性的角度和减少标记数据量方面具有鲁棒性。摘要：Facial expression recognition (FER) has emerged as an important component of human-computer interaction systems. Despite recent advancements in FER, performance often drops significantly for non-frontal facial images. We propose Contrastive Learning of Multi-view facial Expressions (CL-MEx) to exploit facial images captured simultaneously from different angles towards FER. CL-MEx is a two-step training framework. In the first step, an encoder network is pre-trained with the proposed self-supervised contrastive loss, where it learns to generate view-invariant embeddings for different views of a subject. The model is then fine-tuned with labeled data in a supervised setting. We demonstrate the performance of the proposed method on two multi-view FER datasets, KDEF and DDCF, where state-of-the-art performances are achieved. Further experiments show the robustness of our method in dealing with challenging angles and reduced amounts of labeled data.

【5】 Unsupervised Disentanglement without Autoencoding: Pitfalls and Future Directions 标题：不带自动编码的无监督解缠：陷阱和未来方向链接：https://arxiv.org/abs/2108.06613

作者：Andrea Burns,Aaron Sarna,Dilip Krishnan,Aaron Maschinot 备注：Accepted at the ICML 2021 Self-Supervised Learning for Reasoning and Perception Workshop 摘要：解构的视觉表现已经在很大程度上被研究与生成模型，如变分自动编码器（VAE）。虽然之前的工作主要集中在用于分离表示学习的生成方法上，但由于生成模型的当前限制，这些方法不能扩展到大型数据集。相反，我们探索了使用对比学习的正则化方法，这可能会导致对大规模数据集和下游应用足够强大的分离表示。然而，我们发现，由于优化和初始化的敏感性，在任务性能的权衡下，非监督解纠缠很难实现。我们评估了与下游任务的分离，分析了每种正则化方法的优缺点，并讨论了未来的方向。摘要：Disentangled visual representations have largely been studied with generative models such as Variational AutoEncoders (VAEs). While prior work has focused on generative methods for disentangled representation learning, these approaches do not scale to large datasets due to current limitations of generative models. Instead, we explore regularization methods with contrastive learning, which could result in disentangled representations that are powerful enough for large scale datasets and downstream applications. However, we find that unsupervised disentanglement is difficult to achieve due to optimization and initialization sensitivity, with trade-offs in task performance. We evaluate disentanglement with downstream tasks, analyze the benefits and disadvantages of each regularization used, and discuss future directions.

【6】 Foreground-Action Consistency Network for Weakly Supervised Temporal Action Localization 标题：弱监督时间动作定位的前景-动作一致性网络链接：https://arxiv.org/abs/2108.06524

作者：Linjiang Huang,Liang Wang,Hongsheng Li 机构：Multimedia Laboratory, The Chinese University of Hong Kong, Centre for Perceptual and Interactive Intelligence, Hong Kong, Institute of Automation, Chinese Academy of Sciences 备注：Accepted by ICCV 2021. Code is available at this https URL 摘要：作为一项具有挑战性的高级视频理解任务，弱监督时间动作定位越来越受到人们的关注。由于只有视频注释，大多数现有方法试图通过分类框架进行本地化来处理此任务，该框架通常采用选择器来选择动作概率较高的片段或前景。然而，现有的前景选择策略主要局限于只考虑前景与动作之间的单向关系，不能保证前景动作的一致性。在本文中，我们提出了一个基于I3D主干的FAC-Net框架，在该框架上增加了三个分支，分别为类前景分类分支、类无关注意分支和多实例学习分支。首先，我们的类级前景分类分支将动作和前景之间的关系规则化，以最大限度地实现前景-背景分离。此外，采用类无关注意分支和多实例学习分支对前景动作一致性进行正则化，帮助学习有意义的前景分类器。在每个分支中，我们引入了一种混合注意机制，该机制为每个片段计算多个注意分数，以关注区分性和不那么区分性的片段，从而捕获完整的动作边界。在THUMOS14和ActivityNet1.3上的实验结果证明了我们方法的最新性能。我们的代码可在https://github.com/LeonHLJ/FAC-Net. 摘要：As a challenging task of high-level video understanding, weakly supervised temporal action localization has been attracting increasing attention. With only video annotations, most existing methods seek to handle this task with a localization-by-classification framework, which generally adopts a selector to select snippets of high probabilities of actions or namely the foreground. Nevertheless, the existing foreground selection strategies have a major limitation of only considering the unilateral relation from foreground to actions, which cannot guarantee the foreground-action consistency. In this paper, we present a framework named FAC-Net based on the I3D backbone, on which three branches are appended, named class-wise foreground classification branch, class-agnostic attention branch and multiple instance learning branch. First, our class-wise foreground classification branch regularizes the relation between actions and foreground to maximize the foreground-background separation. Besides, the class-agnostic attention branch and multiple instance learning branch are adopted to regularize the foreground-action consistency and help to learn a meaningful foreground classifier. Within each branch, we introduce a hybrid attention mechanism, which calculates multiple attention scores for each snippet, to focus on both discriminative and less-discriminative snippets to capture the full action boundaries. Experimental results on THUMOS14 and ActivityNet1.3 demonstrate the state-of-the-art performance of our method. Our code is available at https://github.com/LeonHLJ/FAC-Net.

【7】 Collaborative Unsupervised Visual Representation Learning from Decentralized Data 标题：分散数据的协作式无监督视觉表示学习链接：https://arxiv.org/abs/2108.06492

作者：Weiming Zhuang,Xin Gan,Yonggang Wen,Shuai Zhang,Shuai Yi 机构：S-Lab, Nanyang Technological University,Nanyang Technological University,SenseTime Research 备注：ICCV'21 摘要：利用互联网上的集中数据，无监督表征学习取得了优异的成绩。然而，隐私保护意识的提高限制了分散的未标记图像数据的共享，这些数据在多方（如手机和相机）中爆炸性增长。因此，一个自然的问题是如何利用这些数据来学习下游任务的可视化表示，同时保护数据隐私。为了解决这个问题，我们提出了一个新的联邦无监督学习框架FedU。在这个框架中，各方使用在线网络和目标网络的对比学习，独立地从未标记的数据中训练模型。然后，中央服务器聚合经过训练的模型，并使用聚合的模型更新客户机的模型。它保护了数据隐私，因为各方只能访问其原始数据。多方之间分散的数据通常是非独立且分布相同的（非IID），导致性能下降。为了应对这一挑战，我们提出了两种简单但有效的方法：1）设计通信协议，仅上传在线网络的编码器进行服务器聚合，并使用聚合编码器进行更新；2）我们引入了一个新的模块，根据非IID引起的偏差动态决定如何更新预测器。预测器是在线网络的另一个组成部分。大量的实验和烧蚀证明了FedU的有效性和重要性。在非IID数据的线性和半监督评估中，它的表现优于仅使用一方的训练，超过5%，其他方法超过14%。摘要：Unsupervised representation learning has achieved outstanding performances using centralized data available on the Internet. However, the increasing awareness of privacy protection limits sharing of decentralized unlabeled image data that grows explosively in multiple parties (e.g., mobile phones and cameras). As such, a natural problem is how to leverage these data to learn visual representations for downstream tasks while preserving data privacy. To address this problem, we propose a novel federated unsupervised learning framework, FedU. In this framework, each party trains models from unlabeled data independently using contrastive learning with an online network and a target network. Then, a central server aggregates trained models and updates clients' models with the aggregated model. It preserves data privacy as each party only has access to its raw data. Decentralized data among multiple parties are normally non-independent and identically distributed (non-IID), leading to performance degradation. To tackle this challenge, we propose two simple but effective methods: 1) We design the communication protocol to upload only the encoders of online networks for server aggregation and update them with the aggregated encoder; 2) We introduce a new module to dynamically decide how to update predictors based on the divergence caused by non-IID. The predictor is the other component of the online network. Extensive experiments and ablations demonstrate the effectiveness and significance of FedU. It outperforms training with only one party by over 5% and other methods by over 14% in linear and semi-supervised evaluation on non-IID data.

【8】 Focus on the Positives: Self-Supervised Learning for Biodiversity Monitoring 标题：关注积极因素：用于生物多样性监测的自我监督学习链接：https://arxiv.org/abs/2108.06435

作者：Omiros Pantazis,Gabriel Brostow,Kate Jones,Oisin Mac Aodha 机构：Gabriel J. Brostow, Kate E. Jones, University College London, Niantic, University of Edinburgh 备注：ICCV 2021 摘要：我们解决了从未标记图像集合中学习自监督表示的问题。与试图通过最大化每个输入图像的增强版本之间的相似性或通过推测性地拾取负样本来学习有用特征的现有方法不同，我们还利用了使用静态监控摄像机捕获的图像集合中发生的自然变化。为了实现这一点，我们利用现成的上下文数据对输入图像之间的空间和时间关系等信息进行编码。通过在训练时首先识别高概率正对，即那些可能描述相同视觉概念的图像，我们能够学习对下游监督分类出奇有效的表示。对于全球生物多样性监测的关键任务而言，这会产生图像特征，可以在有限的人类监督下适应具有挑战性的视觉物种分类任务。我们展示了在三种不同的自监督学习方法中对四种不同的相机捕捉图像采集的结果，并表明与现有基线（如传统的自监督训练和转移学习）相比，在训练时仔细选择图像可以获得更好的性能。摘要：We address the problem of learning self-supervised representations from unlabeled image collections. Unlike existing approaches that attempt to learn useful features by maximizing similarity between augmented versions of each input image or by speculatively picking negative samples, we instead also make use of the natural variation that occurs in image collections that are captured using static monitoring cameras. To achieve this, we exploit readily available context data that encodes information such as the spatial and temporal relationships between the input images. We are able to learn representations that are surprisingly effective for downstream supervised classification, by first identifying high probability positive pairs at training time, i.e. those images that are likely to depict the same visual concept. For the critical task of global biodiversity monitoring, this results in image features that can be adapted to challenging visual species classification tasks with limited human supervision. We present results on four different camera trap image collections, across three different families of self-supervised learning methods, and show that careful image selection at training time results in superior performance compared to existing baselines such as conventional self-supervised training and transfer learning.

时序|行为识别|姿态|视频|运动估计(9篇)

【1】 Learning Skeletal Graph Neural Networks for Hard 3D Pose Estimation 标题：学习骨架图神经网络的硬三维位姿估计链接：https://arxiv.org/abs/2108.07181

作者：Ailing Zeng,Xiao Sun,Lei Yang,Nanxuan Zhao,Minhao Liu,Qiang Xu 机构：The Chinese University of Hong Kong, Microsoft Research Asia,Sensetime Group Ltd. 备注：ICCV 2021 摘要：人们提出了各种深度学习技术来解决单视图二维到三维姿态估计问题。虽然多年来平均预测精度已显著提高，但在深度模糊、自遮挡和复杂或罕见姿势的硬姿势上的性能仍远不能令人满意。在这项工作中，我们针对这些困难的姿势，提出了一种新的骨骼GNN学习解决方案。具体地说，我们提出了一种跳感知的分层信道压缩融合层，在抑制GNN学习中不希望出现的噪声的同时，有效地从相邻节点提取相关信息。此外，我们还提出了一种时间感知的动态图形构造方法，该方法对三维姿态估计具有鲁棒性和有效性。在Human3.6M数据集上的实验结果表明，我们的解决方案实现了10.3%的平均预测精度提高，并且与最先进的技术相比，在硬姿势方面有了很大的改进。我们进一步将所提出的技术应用于基于骨架的动作识别任务，并获得了最先进的性能。我们的代码可在https://github.com/ailingzengzzz/Skeletal-GNN. 摘要：Various deep learning techniques have been proposed to solve the single-view 2D-to-3D pose estimation problem. While the average prediction accuracy has been improved significantly over the years, the performance on hard poses with depth ambiguity, self-occlusion, and complex or rare poses is still far from satisfactory. In this work, we target these hard poses and present a novel skeletal GNN learning solution. To be specific, we propose a hop-aware hierarchical channel-squeezing fusion layer to effectively extract relevant information from neighboring nodes while suppressing undesired noises in GNN learning. In addition, we propose a temporal-aware dynamic graph construction procedure that is robust and effective for 3D pose estimation. Experimental results on the Human3.6M dataset show that our solution achieves 10.3% average prediction accuracy improvement and greatly improves on hard poses over state-of-the-art techniques. We further apply the proposed technique on the skeleton-based action recognition task and also achieve state-of-the-art performance. Our code is available at https://github.com/ailingzengzzz/Skeletal-GNN.

【2】 Towards unconstrained joint hand-object reconstruction from RGB videos 标题：基于RGB视频的无约束关节手部目标重建链接：https://arxiv.org/abs/2108.07044

作者：Yana Hasson,Gül Varol,Ivan Laptev,Cordelia Schmid 机构：G¨ul Varol, Inria,D´epartement d’informatique de l’ENS, CNRS, PSL Research University, LIGM, ’EEcole des Ponts, Univ Gustave Eiffel, CNRS, France 备注：Project website: this https URL 摘要：我们的工作旨在从单目视频中获得手和操纵对象的三维重建。重建手对象操作对于机器人学和从人类演示中学习具有巨大的潜力。然而，这个问题的监督学习方法需要3D监督，并且仍然局限于受限的实验室设置和3D地面真实性可用的模拟器。在本文中，我们首先提出了一种无学习拟合的手对象重建方法，它可以无缝地处理双手对象的交互。我们的方法依赖于通过常用的对象检测、手姿势估计和实例分割方法获得的线索。我们定量地评估了我们的方法，并表明它可以应用于训练数据不可用的具有不同难度的数据集。摘要：Our work aims to obtain 3D reconstruction of hands and manipulated objects from monocular videos. Reconstructing hand-object manipulations holds a great potential for robotics and learning from human demonstrations. The supervised learning approach to this problem, however, requires 3D supervision and remains limited to constrained laboratory settings and simulators for which 3D ground truth is available. In this paper we first propose a learning-free fitting approach for hand-object reconstruction which can seamlessly handle two-hand object interactions. Our method relies on cues obtained with common methods for object detection, hand pose estimation and instance segmentation. We quantitatively evaluate our approach and show that it can be applied to datasets with varying levels of difficulty for which training data is unavailable.

【3】 Human Pose and Shape Estimation from Single Polarization Images 标题：基于单幅偏振图像的人体姿态和形状估计链接：https://arxiv.org/abs/2108.06834

作者：Shihao Zou,Xinxin Zuo,Sen Wang,Yiming Qian,Chuan Guo,Wei Ji,Jingjing Li,Minglun Gong,Li Cheng 机构： China University of Geosciences 备注：Submitted to IEEE TIP 摘要：本文研究了一种新的基于单偏振图像的人体姿态和形状估计方法。众所周知，偏振照相机能够捕捉反射光的偏振，从而保留物体表面丰富的几何线索。受最近偏振图像表面法线重建应用的启发，本文尝试利用偏振诱导的几何线索从单偏振图像估计人体姿势和形状。提出了一种专用的两级流水线：给定一幅单偏振图像，第一级（Polar2Normal）着重于精细的人体表面法线估计；第二阶段（Polar2Shape）然后根据偏振图像和估计的表面法线重建穿着衣服的人体形状。为了验证我们的方法，我们构建了一个专用数据集（PHSPD），该数据集由超过500K帧组成，具有精确的姿势和形状注释。对这个真实世界数据集以及合成数据集（超现实）的实证评估证明了我们方法的有效性。它表明偏振摄像机是一种很有希望的替代传统RGB摄像机的人体姿势和形状估计方法。摘要：This paper focuses on a new problem of estimating human pose and shape from single polarization images. Polarization camera is known to be able to capture the polarization of reflected lights that preserves rich geometric cues of an object surface. Inspired by the recent applications in surface normal reconstruction from polarization images, in this paper, we attempt to estimate human pose and shape from single polarization images by leveraging the polarization-induced geometric cues. A dedicated two-stage pipeline is proposed: given a single polarization image, stage one (Polar2Normal) focuses on the fine detailed human body surface normal estimation; stage two (Polar2Shape) then reconstructs clothed human shape from the polarization image and the estimated surface normal. To empirically validate our approach, a dedicated dataset (PHSPD) is constructed, consisting of over 500K frames with accurate pose and shape annotations. Empirical evaluations on this real-world dataset as well as a synthetic dataset, SURREAL, demonstrate the effectiveness of our approach. It suggests polarization camera as a promising alternative to the more conventional RGB camera for human pose and shape estimation.

【4】 EventHPE: Event-based 3D Human Pose and Shape Estimation 标题：EventHPE：基于事件的三维人体姿态和形状估计链接：https://arxiv.org/abs/2108.06819

作者：Shihao Zou,Chuan Guo,Xinxin Zuo,Sen Wang,Pengyu Wang,Xiaoqin Hu,Shoushun Chen,Minglun Gong,Li Cheng 机构：∗University of Alberta, ‡Shandong University, †Celepixel Technology, §University of Guelph, ♮ Nanyang Technological University 备注：ICCV 2021 摘要：事件摄像机是一种新兴的成像传感器，用于捕捉运动对象作为事件的动态，这推动了我们从事件信号中估计三维人体姿势和形状的工作。另一方面，事件有其独特的挑战：事件信号最擅长捕捉局部运动，而不是捕捉静态身体姿势。这导致我们提出了一种两阶段的深度学习方法，称为EventHPE。第一阶段，FlowNet，通过无监督学习进行训练，从事件中推断光流。事件和光流都与人体动力学密切相关，在第二阶段将其作为输入输入输入ShapeNet，以估计3D人体形状。为了缓解基于图像的流（光流）和基于形状的流（人体形状的顶点运动）之间的差异，通过利用两种流都源自相同的人体运动这一事实，引入了一种新的流一致性损失。我们策划了一个基于内部事件的3D人体数据集，该数据集附带3D姿势和形状注释，这是迄今为止我们所知的最大的一个。对DHP19数据集和我们内部数据集的实证评估证明了我们方法的有效性。摘要：Event camera is an emerging imaging sensor for capturing dynamics of moving objects as events, which motivates our work in estimating 3D human pose and shape from the event signals. Events, on the other hand, have their unique challenges: rather than capturing static body postures, the event signals are best at capturing local motions. This leads us to propose a two-stage deep learning approach, called EventHPE. The first-stage, FlowNet, is trained by unsupervised learning to infer optical flow from events. Both events and optical flow are closely related to human body dynamics, which are fed as input to the ShapeNet in the second stage, to estimate 3D human shapes. To mitigate the discrepancy between image-based flow (optical flow) and shape-based flow (vertices movement of human body shape), a novel flow coherence loss is introduced by exploiting the fact that both flows are originated from the identical human motion. An in-house event-based 3D human dataset is curated that comes with 3D pose and shape annotations, which is by far the largest one to our knowledge. Empirical evaluations on DHP19 dataset and our in-house dataset demonstrate the effectiveness of our approach.

【5】 Asymmetric Bilateral Motion Estimation for Video Frame Interpolation 标题：用于视频帧内插的非对称双边运动估计链接：https://arxiv.org/abs/2108.06815

作者：Junheum Park,Chul Lee,Chang-Su Kim 机构：Korea University, Dongguk University 备注：Accepted to ICCV2021 摘要：我们提出了一种新的基于非对称双边运动估计（ABME）的视频帧插值算法，该算法在两个输入帧之间合成一个中间帧。首先，我们预测对称的双边运动场来插值锚框架。其次，我们估计从锚帧到输入帧的不对称双边运动场。第三，我们使用非对称场向后扭曲输入帧并重建中间帧。最后，为了细化中间帧，我们开发了一个新的合成网络，该网络使用局部和全局信息生成一组动态滤波器和一个剩余帧。实验结果表明，该算法在各种数据集上都取得了良好的性能。源代码和预训练模型可在https://github.com/JunHeum/ABME. 摘要：We propose a novel video frame interpolation algorithm based on asymmetric bilateral motion estimation (ABME), which synthesizes an intermediate frame between two input frames. First, we predict symmetric bilateral motion fields to interpolate an anchor frame. Second, we estimate asymmetric bilateral motions fields from the anchor frame to the input frames. Third, we use the asymmetric fields to warp the input frames backward and reconstruct the intermediate frame. Last, to refine the intermediate frame, we develop a new synthesis network that generates a set of dynamic filters and a residual frame using local and global information. Experimental results show that the proposed algorithm achieves excellent performance on various datasets. The source codes and pretrained models are available at https://github.com/JunHeum/ABME.

【6】 Occlusion-Aware Video Object Inpainting 标题：遮挡感知视频对象修复链接：https://arxiv.org/abs/2108.06765

作者：Lei Ke,Yu-Wing Tai,Chi-Keung Tang 机构：The Hong Kong University of Science and Technology, Kuaishou Technology 备注：ICCV 2021 摘要：传统的视频修复既不是面向对象的，也不是遮挡感知的，这使得在修复大的遮挡对象区域时容易产生明显的伪影。该文提出了一种基于遮挡的视频对象修复方法，该方法在给定遮挡对象可见遮罩分割的情况下，恢复视频中遮挡对象的完整形状和外观。为了促进这项新的研究，我们构建了第一个大规模视频对象修复基准YouTube VOI，以提供具有遮挡和可见对象遮罩的真实遮挡场景。我们的技术贡献VOIN联合执行视频对象形状完成和遮挡纹理生成。特别是，形状完成模块建模远程对象一致性，而流完成模块恢复具有尖锐运动边界的精确流，以便跨帧将时间一致的纹理传播到同一运动对象。为了获得更真实的结果，我们使用T-PatchGAN和一种新的基于时空注意的多类鉴别器对VOIN进行了优化。最后，我们在YouTube VOI上比较VOIN和强基线。实验结果清楚地证明了我们的方法的有效性，包括修复复杂和动态对象。VOIN随着不准确的输入可见遮罩而优雅地降解。摘要：Conventional video inpainting is neither object-oriented nor occlusion-aware, making it liable to obvious artifacts when large occluded object regions are inpainted. This paper presents occlusion-aware video object inpainting, which recovers both the complete shape and appearance for occluded objects in videos given their visible mask segmentation. To facilitate this new research, we construct the first large-scale video object inpainting benchmark YouTube-VOI to provide realistic occlusion scenarios with both occluded and visible object masks available. Our technical contribution VOIN jointly performs video object shape completion and occluded texture generation. In particular, the shape completion module models long-range object coherence while the flow completion module recovers accurate flow with sharp motion boundary, for propagating temporally-consistent texture to the same moving object across frames. For more realistic results, VOIN is optimized using both T-PatchGAN and a new spatio-temporal attention-based multi-class discriminator. Finally, we compare VOIN and strong baselines on YouTube-VOI. Experimental results clearly demonstrate the efficacy of our method including inpainting complex and dynamic objects. VOIN degrades gracefully with inaccurate input visible mask.

【7】 Cross-Modal Graph with Meta Concepts for Video Captioning 标题：带元概念的交叉模图在视频字幕中的应用链接：https://arxiv.org/abs/2108.06458

作者：Hao Wang,Guosheng Lin,Steven C. H. Hoi,Chunyan Miao 机构： Guosheng Lin and Chunyan Miao are with School of Com-puter Science and Engineering, Nanyang Technological University 摘要：视频字幕的目标是将复杂的视觉内容解释为文本描述，这要求模型充分理解视频场景，包括对象及其交互。流行的方法采用现成的目标检测网络给出目标建议，并使用注意机制对目标之间的关系进行建模。它们经常会遗漏一些未定义的预训练模型语义概念，并且无法识别对象之间的精确谓词关系。在本文中，我们调查了一个开放的研究任务，为给定的视频生成文本描述，并提出了跨模态图（CMG）与元概念的视频字幕。具体来说，为了涵盖视频字幕中有用的语义概念，我们弱学习文本描述的相应视觉区域，其中相关的视觉区域和文本词被命名为跨模态元概念。我们进一步利用所学的跨模态元概念动态构建元概念图。我们还使用预测谓词构建整体视频级和局部帧级视频图，以模拟视频序列结构。我们通过大量实验验证了我们提出的技术的有效性，并在两个公共数据集上获得了最新的结果。摘要：Video captioning targets interpreting the complex visual contents as text descriptions, which requires the model to fully understand video scenes including objects and their interactions. Prevailing methods adopt off-the-shelf object detection networks to give object proposals and use the attention mechanism to model the relations between objects. They often miss some undefined semantic concepts of the pretrained model and fail to identify exact predicate relationships between objects. In this paper, we investigate an open research task of generating text descriptions for the given videos, and propose Cross-Modal Graph (CMG) with meta concepts for video captioning. Specifically, to cover the useful semantic concepts in video captions, we weakly learn the corresponding visual regions for text descriptions, where the associated visual regions and textual words are named cross-modal meta concepts. We further build meta concept graphs dynamically with the learned cross-modal meta concepts. We also construct holistic video-level and local frame-level video graphs with the predicted predicates to model video sequence structures. We validate the efficacy of our proposed techniques with extensive experiments and achieve state-of-the-art results on two public datasets.

【8】 FrankMocap: A Monocular 3D Whole-Body Pose Estimation System via Regression and Integration 标题：FrankMocap：一种基于回归积分的单目三维全身姿态估计系统链接：https://arxiv.org/abs/2108.06428

作者：Yu Rong,Takaaki Shiratori,Hanbyul Joo 机构：The Chinese University of Hong Kong, Facebook Reality Labs, Facebook AI Research 备注：Accepted to ICCV 2021 Workshops on Assistive Computer Vision and Robotics. An updated version of this https URL Code, models and demo videos are available at:this https URL 摘要：大多数现有的单目3D姿势估计方法只关注单个身体部位，而忽略了一个事实，即人体运动的本质细微差别是通过面部、手部和身体的细微运动协调传达的。在本文中，我们介绍了FrankMocap，一个快速准确的全身3D姿势估计系统，它可以从野生单目图像中同时生成3D人脸、手和身体。FrankMocap的核心思想是其模块化设计：我们首先独立运行面部、手部和身体的3D姿势回归方法，然后通过集成模块组合回归输出。单独的回归模块使我们能够充分利用其最先进的性能，而不会在实践中损害原始的准确性和可靠性。我们开发了三个不同的集成模块，在延迟和准确性之间进行权衡。所有这些都能够提供简单而有效的解决方案，将单独的输出统一为无缝的全身姿势估计结果。我们定量和定性地证明，我们的模块化系统优于基于优化和端到端的估计全身姿势的方法。摘要：Most existing monocular 3D pose estimation approaches only focus on a single body part, neglecting the fact that the essential nuance of human motion is conveyed through a concert of subtle movements of face, hands, and body. In this paper, we present FrankMocap, a fast and accurate whole-body 3D pose estimation system that can produce 3D face, hands, and body simultaneously from in-the-wild monocular images. The core idea of FrankMocap is its modular design: We first run 3D pose regression methods for face, hands, and body independently, followed by composing the regression outputs via an integration module. The separate regression modules allow us to take full advantage of their state-of-the-art performances without compromising the original accuracy and reliability in practice. We develop three different integration modules that trade off between latency and accuracy. All of them are capable of providing simple yet effective solutions to unify the separate outputs into seamless whole-body pose estimation results. We quantitatively and qualitatively demonstrate that our modularized system outperforms both the optimization-based and end-to-end methods of estimating whole-body pose.

【9】 Continuous-Time Spatiotemporal Calibration of a Rolling Shutter Camera---IMU System 标题：滚动式快门相机的连续时间时空标定-IMU系统链接：https://arxiv.org/abs/2108.07200

作者：Jianzhu Huai,Yuan Zhuang,Qicheng Yuan,Yukai Lin 机构：Wuhan University, Wuhan Hubei China, Huawei Inc., Shanghai China 备注：11 pages, 9 figures 摘要：卷帘式快门（RS）机构广泛用于消费级相机，是智能手机和自动驾驶汽车的重要部件。当相机和场景之间相对运动时，RS效应会导致图像失真。在视频稳定、运动结构和视觉辅助里程计中需要考虑这种效应，最近的研究通过考虑RS效应改进了早期的全局快门（GS）方法。然而，RS如何影响传感器组件中相机的时空校准尚不清楚，这对于上述应用中的良好性能至关重要。本文以相机IMU系统为例，研究了遥感对其时空标定的影响。为此，我们开发了一种基于连续时间B样条的遥感相机IMU系统标定方法。与校准GS相机不同，对目标上地标的每次观察都有一个由连续时间B样条拟合的独特相机姿势。通过四组公共校准数据生成的模拟数据，我们表明RS可以显著影响外部参数，与普通智能手机摄像头一样，RS设置会导致方向误差约为1$^circ$，平移误差约为2$cm$。通过两个工业摄像机IMU系统采集的真实数据，我们发现考虑RS效应可以提供更精确和一致的时空校准。此外，我们的方法还可以精确校准RS的线间延迟。模拟和校准代码是公开的。摘要：The rolling shutter (RS) mechanism is widely used by consumer-grade cameras, which are essential parts in smartphones and autonomous vehicles. The RS effect leads to image distortion upon relative motion between a camera and the scene. This effect needs to be considered in video stabilization, structure from motion, and vision-aided odometry, for which recent studies have improved earlier global shutter (GS) methods by accounting for the RS effect. However, it is still unclear how the RS affects spatiotemporal calibration of the camera in a sensor assembly, which is crucial to good performance in aforementioned applications. This work takes the camera-IMU system as an example and looks into the RS effect on its spatiotemporal calibration. To this end, we develop a calibration method for a RS-camera-IMU system with continuous-time B-splines by using a calibration target. Unlike in calibrating GS cameras, every observation of a landmark on the target has a unique camera pose fitted by continuous-time B-splines. With simulated data generated from four sets of public calibration data, we show that RS can noticeably affect the extrinsic parameters, causing errors about 1$^circ$ in orientation and 2 $cm$ in translation with a RS setting as in common smartphone cameras. With real data collected by two industrial camera-IMU systems, we find that considering the RS effect gives more accurate and consistent spatiotemporal calibration. Moreover, our method also accurately calibrates the inter-line delay of the RS. The code for simulation and calibration is publicly available.

医学相关(4篇)

【1】 NPBDREG: A Non-parametric Bayesian Deep-Learning Based Approach for Diffeomorphic Brain MRI Registration 标题：NPBDREG：一种基于非参数贝叶斯深度学习的不同构型脑MRI配准方法链接：https://arxiv.org/abs/2108.06771

作者：Samah Khawaled,Moti Freiman 机构：Department of Applied Mathematics, Department of Biomedical Engineering, Technion, Israel Institute of Technology, Haifa, Israel 备注：Currently under review 摘要：基于深度神经网络（DNN）的图像配准算法中不确定性的量化对于安全部署现实世界的医学应用和面向研究的处理管道，以及提高泛化能力具有重要作用。目前可用的不确定性估计方法，包括变分编码器-解码器体系结构和推断时间差方法，需要特定的网络结构，并假设潜在空间的参数分布，这可能导致预测变形场后验分布的次优表征。我们介绍了NPBDREG，一种用于基于DNN的无监督可变形图像配准的完全非参数贝叶斯框架，通过将texttt{Adam}优化器与随机梯度Langevin dynamics（SGLD）相结合，通过后验采样来描述真实的后验分布。NPBDREG提供了一种原则性的非参数方法来描述真实的后验分布，从而以一种理论依据充分且计算高效的方式提供了改进的不确定性估计和置信度度量。我们使用四个公开数据库（MGH10、CMUC12、ISBR18和LPBA40）中390美元的图像对，证明了NPBDREG相对于基线概率无监督模型（PrVXM）在脑MRI图像注册上的附加值。NPBDREG显示，与PrVXM相比，注册精度略有提高（骰子分数为$0.73$对$0.68$，$pll 0.01$），对于被混合结构噪声破坏的数据具有更好的泛化能力（例如，骰子分数为$0.729$对$0.686$，对$alpha=0.2$），最后但最重要的是，预测不确定性与分布外数据的相关性显著提高（$r>0.95$，而$r<0.5$）。摘要：Quantification of uncertainty in deep-neural-networks (DNN) based image registration algorithms plays an important role in the safe deployment of real-world medical applications and research-oriented processing pipelines, and in improving generalization capabilities. Currently available approaches for uncertainty estimation, including the variational encoder-decoder architecture and the inference-time dropout approach, require specific network architectures and assume parametric distribution of the latent space which may result in sub-optimal characterization of the posterior distribution for the predicted deformation-fields. We introduce the NPBDREG, a fully non-parametric Bayesian framework for unsupervised DNN-based deformable image registration by combining an texttt{Adam} optimizer with stochastic gradient Langevin dynamics (SGLD) to characterize the true posterior distribution through posterior sampling. The NPBDREG provides a principled non-parametric way to characterize the true posterior distribution, thus providing improved uncertainty estimates and confidence measures in a theoretically well-founded and computationally efficient way. We demonstrated the added-value of NPBDREG, compared to the baseline probabilistic texttt{VoxelMorph} unsupervised model (PrVXM), on brain MRI images registration using $390$ image pairs from four publicly available databases: MGH10, CMUC12, ISBR18 and LPBA40. The NPBDREG shows a slight improvement in the registration accuracy compared to PrVXM (Dice score of $0.73$ vs. $0.68$, $p ll 0.01$), a better generalization capability for data corrupted by a mixed structure noise (e.g Dice score of $0.729$ vs. $0.686$ for $alpha=0.2$) and last but foremost, a significantly better correlation of the predicted uncertainty with out-of-distribution data ($r>0.95$ vs. $r<0.5$).

【2】 Deep Algorithm Unrolling for Biomedical Imaging 标题：生物医学成像的深度展开算法链接：https://arxiv.org/abs/2108.06637

作者：Yuelong Li,Or Bar-Shira,Vishal Monga,Yonina C. Eldar 机构：arXiv:,.,v, [cs.CV] , Aug 摘要：在本章中，我们通过利用算法展开（一种连接传统迭代算法和现代深度学习技术的重要技术）来回顾生物医学应用和突破。为了提供上下文，我们首先跟踪算法展开的起源，并提供关于如何将迭代算法展开到深度网络的全面教程。然后，我们广泛地介绍了各种生物医学成像模式中的算法展开，并深入研究了一些具有代表性的最新工作。事实上，生物医学图像合成的迭代算法有着丰富的历史，这使得展开技术的领域已经成熟。此外，我们将算法展开放在一个广阔的视角中，以便理解为什么它特别有效，并讨论最近的趋势。最后，我们通过讨论开放的挑战来结束本章，并提出未来的研究方向。摘要：In this chapter, we review biomedical applications and breakthroughs via leveraging algorithm unrolling, an important technique that bridges between traditional iterative algorithms and modern deep learning techniques. To provide context, we start by tracing the origin of algorithm unrolling and providing a comprehensive tutorial on how to unroll iterative algorithms into deep networks. We then extensively cover algorithm unrolling in a wide variety of biomedical imaging modalities and delve into several representative recent works in detail. Indeed, there is a rich history of iterative algorithms for biomedical image synthesis, which makes the field ripe for unrolling techniques. In addition, we put algorithm unrolling into a broad perspective, in order to understand why it is particularly effective and discuss recent trends. Finally, we conclude the chapter by discussing open challenges, and suggesting future research directions.

【3】 Disease-oriented image embedding with pseudo-scanner standardization for content-based image retrieval on 3D brain MRI 标题：面向疾病的三维脑MRI图像嵌入与伪扫描仪标准化的基于内容的图像检索链接：https://arxiv.org/abs/2108.06518

作者：Hayato Arai,Yuto Onga,Kumpei Ikuta,Yusuke Chayama,Hitoshi Iyatomi,Kenichi Oishi 机构：and the Parkinson’s Progression Markers Initiative., Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database 备注：13 pages, 7 figures 摘要：为了建立一个适用于临床脑MRI数据库的健壮实用的基于内容的图像检索（CBIR）系统，我们提出了一个新的框架——基于伪扫描仪标准化的面向疾病的图像嵌入（DI-PSS）——该框架包括两个核心技术：数据协调和降维算法。我们的DI-PSS使用颅骨剥离和基于CycleGAN的图像转换，映射到标准大脑，然后转换为使用给定参考扫描仪拍摄的大脑图像。然后，我们的3D卷积式自动编码器（3D-CAE）通过深度度量学习获得了低维嵌入，更好地反映了疾病的特征。我们提出的框架的有效性在从阿尔茨海默病神经成像倡议和帕金森病进展标志物倡议中选择的T1加权磁共振成像上进行了测试。我们确认，我们的PSS大大降低了由不同扫描仪和数据集引起的低维嵌入的可变性。与基线条件相比，我们的PSS将阿尔茨海默病（AD）与临床正常（CN）和帕金森病（PD）患者之间距离的变异性分别降低了15.8-22.6%和18.0-29.9%。这些特性允许DI-PSS生成更适合疾病分类的低维表示。在基于谱聚类的AD和CN分类实验中，PSS分别将平均准确率和宏F1提高了6.2%和10.7%。考虑到DI-PSS在协调未用于扫描训练数据的MRI扫描仪扫描的图像方面的潜力，我们期望DI-PSS适用于在异构环境中扫描的大量传统MRI。摘要：To build a robust and practical content-based image retrieval (CBIR) system that is applicable to a clinical brain MRI database, we propose a new framework -- Disease-oriented image embedding with pseudo-scanner standardization (DI-PSS) -- that consists of two core techniques, data harmonization and a dimension reduction algorithm. Our DI-PSS uses skull stripping and CycleGAN-based image transformations that map to a standard brain followed by transformation into a brain image taken with a given reference scanner. Then, our 3D convolutioinal autoencoders (3D-CAE) with deep metric learning acquires a low-dimensional embedding that better reflects the characteristics of the disease. The effectiveness of our proposed framework was tested on the T1-weighted MRIs selected from the Alzheimer's Disease Neuroimaging Initiative and the Parkinson's Progression Markers Initiative. We confirmed that our PSS greatly reduced the variability of low-dimensional embeddings caused by different scanner and datasets. Compared with the baseline condition, our PSS reduced the variability in the distance from Alzheimer's disease (AD) to clinically normal (CN) and Parkinson disease (PD) cases by 15.8-22.6% and 18.0-29.9%, respectively. These properties allow DI-PSS to generate lower dimensional representations that are more amenable to disease classification. In AD and CN classification experiments based on spectral clustering, PSS improved the average accuracy and macro-F1 by 6.2% and 10.7%, respectively. Given the potential of the DI-PSS for harmonizing images scanned by MRI scanners that were not used to scan the training data, we expect that the DI-PSS is suitable for application to a large number of legacy MRIs scanned in heterogeneous environments.

【4】 Learning to Automatically Diagnose Multiple Diseases in Pediatric Chest Radiographs Using Deep Convolutional Neural Networks 标题：深度卷积神经网络在小儿胸片多病自动诊断中的学习链接：https://arxiv.org/abs/2108.06486

作者：Thanh T. Tran,Hieu H. Pham,Thang V. Nguyen,Tung T. Le,Hieu T. Nguyen,Ha Q. Nguyen 机构：Medical Imaging Center, Vingroup Big Data Institute, Hanoi, Vietnam, College of Engineering & Computer Science, VinUniversity, Hanoi, Vietnam 备注：This is a preprint of our paper which was accepted for publication to ICCV Workshop 2021 摘要：儿科患者的胸片（CXR）解释容易出错，需要对放射学专业知识的高度理解。最近，深卷积神经网络（D-CNN）在解释成人CXR方面表现出了显著的性能。然而，缺乏证据表明D-CNN能够从儿科CXR扫描中准确识别多种肺部病变。特别是，用于检测儿童胸部疾病的诊断模型的开发面临着重大挑战，如（i）缺乏医生注释的数据集和（ii）类别不平衡问题。在本文中，我们回顾性地收集了5017个儿科CXR扫描的大数据集，每个扫描都由经验丰富的放射科医生手动标记10种常见病理。然后在3550个带注释的扫描上训练D-CNN模型，以自动对多种儿科肺部病变进行分类。为了解决高级不平衡问题，我们建议修改并应用“分布平衡损失”来训练D-CNN，它重塑了标准的二进制交叉熵损失（BCE），通过降低分配给大多数类别的损失的权重来有效地学习更难的样本。在一个由777项研究组成的独立测试集上，所提出的方法得出的接收器工作特性（AUC）下的面积为0.709（95%CI，0.690-0.729）。截止值处的敏感性、特异性和F1评分分别为0.722（0.694-0.750）、0.579（0.563-0.595）和0.389（0.373-0.405）。在大多数目标疾病上，这些结果明显优于以前最先进的方法。此外，我们的烧蚀研究验证了与其他标准损耗（例如BCE和焦点损耗）相比，所提出的损耗函数在该学习任务中的有效性。总的来说，我们展示了D-CNN在解释儿科CXR方面的潜力。摘要：Chest radiograph (CXR) interpretation in pediatric patients is error-prone and requires a high level of understanding of radiologic expertise. Recently, deep convolutional neural networks (D-CNNs) have shown remarkable performance in interpreting CXR in adults. However, there is a lack of evidence indicating that D-CNNs can recognize accurately multiple lung pathologies from pediatric CXR scans. In particular, the development of diagnostic models for the detection of pediatric chest diseases faces significant challenges such as (i) lack of physician-annotated datasets and (ii) class imbalance problems. In this paper, we retrospectively collect a large dataset of 5,017 pediatric CXR scans, for which each is manually labeled by an experienced radiologist for the presence of 10 common pathologies. A D-CNN model is then trained on 3,550 annotated scans to classify multiple pediatric lung pathologies automatically. To address the high-class imbalance issue, we propose to modify and apply "Distribution-Balanced loss" for training D-CNNs which reshapes the standard Binary-Cross Entropy loss (BCE) to efficiently learn harder samples by down-weighting the loss assigned to the majority classes. On an independent test set of 777 studies, the proposed approach yields an area under the receiver operating characteristic (AUC) of 0.709 (95% CI, 0.690-0.729). The sensitivity, specificity, and F1-score at the cutoff value are 0.722 (0.694-0.750), 0.579 (0.563-0.595), and 0.389 (0.373-0.405), respectively. These results significantly outperform previous state-of-the-art methods on most of the target diseases. Moreover, our ablation studies validate the effectiveness of the proposed loss function compared to other standard losses, e.g., BCE and Focal Loss, for this learning task. Overall, we demonstrate the potential of D-CNNs in interpreting pediatric CXRs.

GAN|对抗|攻击|生成相关(8篇)

【1】 Patch Attack Invariance: How Sensitive are Patch Attacks to 3D Pose? 标题：补丁攻击不变性：补丁攻击对3D姿势有多敏感？链接：https://arxiv.org/abs/2108.07229

作者：Max Lennon,Nathan Drenkow,Philippe Burlina 机构：The Johns Hopkins University Applied Physics Laboratory, Johns Hopkins Road, Laurel, Maryland 摘要：基于扰动的攻击虽然在物理上无法实现，但一直是对抗式机器学习（ML）研究的重点。相比之下，基于补丁的攻击在物理上是可以实现的，但大多数工作都集中在2D领域，最近又涉足3D领域。描述面片攻击的鲁棒性及其对三维姿态的不变性是重要的，但尚未完全阐明，这是本文的重点。为此，本文做出了以下几点贡献：A）我们开发了一种新的度量，称为变换平均攻击成功率（mAST），用于评估补丁攻击的鲁棒性和不变性；和B），我们系统地评估了补丁攻击在各种条件下对3D位置和方向的鲁棒性；特别是，我们进行了敏感性分析，该分析提供了关于攻击有效性的重要定性见解，作为面片相对于相机的3D姿势（旋转、平移）的函数，并阐述了面片攻击3D不变性的一些特性；和C），我们得出了新的定性结论，包括：1）我们证明，对于一些3D变换，即旋转和织布机，增加训练分布支持可以提高测试时整个范围内的补丁成功率。2）我们提供了一个新的见解，以了解面片攻击有效性的基本截止极限的存在，该极限取决于面外旋转角度的范围。这些发现将共同指导3D补丁攻击和防御的未来设计。摘要：Perturbation-based attacks, while not physically realizable, have been the main emphasis of adversarial machine learning (ML) research. Patch-based attacks by contrast are physically realizable, yet most work has focused on 2D domain with recent forays into 3D. Characterizing the robustness properties of patch attacks and their invariance to 3D pose is important, yet not fully elucidated, and is the focus of this paper. To this end, several contributions are made here: A) we develop a new metric called mean Attack Success over Transformations (mAST) to evaluate patch attack robustness and invariance; and B), we systematically assess robustness of patch attacks to 3D position and orientation for various conditions; in particular, we conduct a sensitivity analysis which provides important qualitative insights into attack effectiveness as a function of the 3D pose of a patch relative to the camera (rotation, translation) and sets forth some properties for patch attack 3D invariance; and C), we draw novel qualitative conclusions including: 1) we demonstrate that for some 3D transformations, namely rotation and loom, increasing the training distribution support yields an increase in patch success over the full range at test time. 2) We provide new insights into the existence of a fundamental cutoff limit in patch attack effectiveness that depends on the extent of out-of-plane rotation angles. These findings should collectively guide future design of 3D patch attacks and defenses.

【2】 Exploring Transferable and Robust Adversarial Perturbation Generation from the Perspective of Network Hierarchy 标题：从网络层次角度探索可转移的鲁棒对抗扰动生成链接：https://arxiv.org/abs/2108.07033

作者：Ruikui Wang,Yuanfang Guo,Ruijie Yang,Yunhong Wang 机构： School of Computer Science and Engineering, Beihang University, Beijing, China 摘要：对抗性示例的可转移性和鲁棒性是黑盒对抗攻击的两个实用但重要的特性。在本文中，我们从网络层次结构的角度探索了促进这两个阶段的有效机制，其中典型的网络可以分为输出阶段、中间阶段和输入阶段。由于源模型过于专业化，我们很难在输出阶段提高对抗性扰动的可转移性和鲁棒性。因此，本文将重点放在中间阶段和输入阶段，并提出一种可转移且鲁棒的对抗性扰动生成（TRAP）方法。具体而言，我们提出了动态引导机制，用于在中间阶段连续计算扰动产生的精确方向引导。在输入阶段，我们利用多种形式的仿射变换增强来进一步丰富输入多样性，提高对抗性扰动的鲁棒性和可转移性，而不是现有方法中采用的单一形式的变换增强。大量实验表明，我们的陷阱实现了令人印象深刻的可转移性和对某些干扰的高鲁棒性。摘要：The transferability and robustness of adversarial examples are two practical yet important properties for black-box adversarial attacks. In this paper, we explore effective mechanisms to boost both of them from the perspective of network hierarchy, where a typical network can be hierarchically divided into output stage, intermediate stage and input stage. Since over-specialization of source model, we can hardly improve the transferability and robustness of the adversarial perturbations in the output stage. Therefore, we focus on the intermediate and input stages in this paper and propose a transferable and robust adversarial perturbation generation (TRAP) method. Specifically, we propose the dynamically guided mechanism to continuously calculate accurate directional guidances for perturbation generation in the intermediate stage. In the input stage, instead of the single-form transformation augmentations adopted in the existing methods, we leverage multiform affine transformation augmentations to further enrich the input diversity and boost the robustness and transferability of the adversarial perturbations. Extensive experiments demonstrate that our TRAP achieves impressive transferability and high robustness against certain interferences.

【3】 Online Multi-Granularity Distillation for GAN Compression 标题：用于GaN压缩的在线多粒度精馏链接：https://arxiv.org/abs/2108.06908

作者：Yuxi Ren,Jie Wu,Xuefeng Xiao,Jianchao Yang 机构：ByteDance Inc. 备注：Accepted by ICCV2021 摘要：生成性对抗网络（GAN）在生成出色的图像方面取得了巨大的成功，但是，由于沉重的计算成本和庞大的内存使用，它们在资源受限的设备上部署起来很麻烦。尽管最近压缩GAN的努力取得了显著的成果，但它们仍然存在潜在的模型冗余，可以进一步压缩。为了解决这个问题，我们提出了一种新的在线多粒度蒸馏（OMGD）方案来获得轻量级的GAN，这有助于以较低的计算需求生成高保真图像。我们首次尝试推广用于GAN定向压缩的单级在线蒸馏，逐步提升的教师生成器有助于改进基于无鉴别器的学生生成器。互补的教师生成器和网络层提供了全面和多粒度的概念，以从不同维度增强视觉逼真度。在四个基准数据集上的实验结果表明，OMGD成功地在Pix2Pix和CycleGAN上压缩了40倍的MAC和82.5倍的参数，而不损失图像质量。这表明OMGD为在资源受限的设备上部署实时图像翻译提供了一个可行的解决方案。我们的代码和模型公开于：https://github.com/bytedance/OMGD. 摘要：Generative Adversarial Networks (GANs) have witnessed prevailing success in yielding outstanding images, however, they are burdensome to deploy on resource-constrained devices due to ponderous computational costs and hulking memory usage. Although recent efforts on compressing GANs have acquired remarkable results, they still exist potential model redundancies and can be further compressed. To solve this issue, we propose a novel online multi-granularity distillation (OMGD) scheme to obtain lightweight GANs, which contributes to generating high-fidelity images with low computational demands. We offer the first attempt to popularize single-stage online distillation for GAN-oriented compression, where the progressively promoted teacher generator helps to refine the discriminator-free based student generator. Complementary teacher generators and network layers provide comprehensive and multi-granularity concepts to enhance visual fidelity from diverse dimensions. Experimental results on four benchmark datasets demonstrate that OMGD successes to compress 40x MACs and 82.5X parameters on Pix2Pix and CycleGAN, without loss of image quality. It reveals that OMGD provides a feasible solution for the deployment of real-time image translation on resource-constrained devices. Our code and models are made public at: https://github.com/bytedance/OMGD.

【4】 Interpreting Attributions and Interactions of Adversarial Attacks 标题：解读对抗性攻击的归因与互动链接：https://arxiv.org/abs/2108.06895

作者：Xin Wang,Shuyun Lin,Hao Zhang,Yufei Zhu,Quanshi Zhang 机构：Shanghai Jiao Tong University 摘要：本文旨在从对抗性干扰如何影响攻击任务的角度来解释对抗性攻击。我们根据Shapley值估计不同图像区域的属性以降低攻击成本。我们定义并量化对抗扰动像素之间的相互作用，并将整个扰动图分解为相对独立的扰动分量。对扰动图的分解表明，对抗训练的DNN比正常训练的DNN在前景中具有更多的扰动分量。此外，与正常训练的DNN相比，对抗训练的DNN有更多的成分，这主要降低了真实类别的得分。上述分析为理解对抗性攻击提供了新的见解。摘要：This paper aims to explain adversarial attacks in terms of how adversarial perturbations contribute to the attacking task. We estimate attributions of different image regions to the decrease of the attacking cost based on the Shapley value. We define and quantify interactions among adversarial perturbation pixels, and decompose the entire perturbation map into relatively independent perturbation components. The decomposition of the perturbation map shows that adversarially-trained DNNs have more perturbation components in the foreground than normally-trained DNNs. Moreover, compared to the normally-trained DNN, the adversarially-trained DNN have more components which mainly decrease the score of the true category. Above analyses provide new insights into the understanding of adversarial attacks.

【5】 Neural Architecture Dilation for Adversarial Robustness 标题：用于对抗健壮性的神经结构扩张链接：https://arxiv.org/abs/2108.06885

作者：Yanxi Li,Zhaohui Yang,Yunhe Wang,Chang Xu 机构： School of Computer Science, University of Sydney, Australia, Noah’s Ark Lab, Huawei Technologies, China, Key Lab of Machine Perception (MOE), Department of Machine Intelligence, Peking University, China 备注：9 pages of main text, 5 pages of appendix, 4 figures, 9 tables 摘要：在过去的几十年里，随着卷积神经网络（CNN）的结构和规模的巨大进步，它们在某些任务中很容易达到甚至超过人类的性能。然而，最近发现的CNN的一个缺点是，它们容易受到敌对攻击。尽管对抗训练可以提高CNN的对抗鲁棒性，但在标准准确性和对抗鲁棒性之间存在权衡。从神经网络体系结构的角度出发，本文旨在提高主干CNN的对抗鲁棒性，使其具有令人满意的准确性。在计算开销最小的情况下，扩容架构的引入有望与主干CNN的标准性能保持友好关系，同时追求对抗鲁棒性。对标准误差界和对抗误差界的理论分析自然激发了所提出的神经结构膨胀算法。在真实数据集和基准神经网络上的实验结果证明了该算法在平衡精度和对抗鲁棒性方面的有效性。摘要：With the tremendous advances in the architecture and scale of convolutional neural networks (CNNs) over the past few decades, they can easily reach or even exceed the performance of humans in certain tasks. However, a recently discovered shortcoming of CNNs is that they are vulnerable to adversarial attacks. Although the adversarial robustness of CNNs can be improved by adversarial training, there is a trade-off between standard accuracy and adversarial robustness. From the neural architecture perspective, this paper aims to improve the adversarial robustness of the backbone CNNs that have a satisfactory accuracy. Under a minimal computational overhead, the introduction of a dilation architecture is expected to be friendly with the standard performance of the backbone CNN while pursuing adversarial robustness. Theoretical analyses on the standard and adversarial error bounds naturally motivate the proposed neural architecture dilation algorithm. Experimental results on real-world datasets and benchmark neural networks demonstrate the effectiveness of the proposed algorithm to balance the accuracy and adversarial robustness.

【6】 Audio2Gestures: Generating Diverse Gestures from Speech Audio with Conditional Variational Autoencoders 标题：Audio2Gestures：使用条件变量自动编码器从语音音频生成各种手势链接：https://arxiv.org/abs/2108.06720

作者：Jing Li,Di Kang,Wenjie Pei,Xuefei Zhe,Ying Zhang,Zhenyu He,Linchao Bao 机构：Harbin Institute of Technology, Shenzhen, Tencent AI Lab 摘要：由于音频和身体运动之间固有的一对多映射关系，从语音音频生成对话手势具有挑战性。传统的CNN/RNN假设一对一映射，因此倾向于预测所有可能目标运动的平均值，从而在推断过程中产生平淡/无聊的运动。为了克服这个问题，我们提出了一种新的条件变分自动编码器（VAE），它通过将跨模态潜在代码分解为共享代码和特定于运动的代码来显式地建模一对多音频到运动的映射。共享代码主要建模音频和运动之间的强相关性（如同步音频和运动节拍），而特定于运动的代码捕获独立于音频的各种运动信息。然而，将潜在代码分成两部分会给VAE模型的训练带来困难。为了更好地训练VAE，设计了一个便于随机采样的映射网络以及其他技术，包括放松运动损失、自行车约束和多样性损失。在3D和2D运动数据集上的实验证明，我们的方法在数量和质量上都比最先进的方法生成更真实和多样的运动。最后，我们证明了我们的方法可以很容易地使用用户指定的时间轴上的运动剪辑生成运动序列。代码和更多结果位于https://jingli513.github.io/audio2gestures. 摘要：Generating conversational gestures from speech audio is challenging due to the inherent one-to-many mapping between audio and body motions. Conventional CNNs/RNNs assume one-to-one mapping, and thus tend to predict the average of all possible target motions, resulting in plain/boring motions during inference. In order to overcome this problem, we propose a novel conditional variational autoencoder (VAE) that explicitly models one-to-many audio-to-motion mapping by splitting the cross-modal latent code into shared code and motion-specific code. The shared code mainly models the strong correlation between audio and motion (such as the synchronized audio and motion beats), while the motion-specific code captures diverse motion information independent of the audio. However, splitting the latent code into two parts poses training difficulties for the VAE model. A mapping network facilitating random sampling along with other techniques including relaxed motion loss, bicycle constraint, and diversity loss are designed to better train the VAE. Experiments on both 3D and 2D motion datasets verify that our method generates more realistic and diverse motions than state-of-the-art methods, quantitatively and qualitatively. Finally, we demonstrate that our method can be readily used to generate motion sequences with user-specified motion clips on the timeline. Code and more results are at https://jingli513.github.io/audio2gestures.

【7】 A Survey on GAN Acceleration Using Memory Compression Technique 标题：基于存储器压缩技术的GaN加速研究综述链接：https://arxiv.org/abs/2108.06626

作者：Dina Tantawy,Mohamed Zahran,Amr Wassal 机构：Correspondence:, Department of Computer, Engineering, Cairo University, Cairo, Egypt, Full list of author information is, available at the end of the article 备注：21 pages, 17 figures 摘要：自其发明以来，生成性对抗网络（GAN）在许多应用中已显示出优异的结果。生成性对抗网络是一种强大但又需要大量资源的深度学习模型。它们与普通深度学习模型的主要区别在于其输出的性质。例如，相对于检测对象或分类图像的其他模型，GAN输出可以是整个图像。因此，网络的体系结构和数值精度会影响解决方案的质量和速度。因此，加速GANs至关重要。加速GAN可分为三个主要方面：（1）内存压缩，（2）计算优化，（3）数据流优化。因为数据传输是能源使用的主要来源，所以内存压缩带来了最大的节约。因此，在本文中，我们调查了基于CNN的GAN的内存压缩技术。此外，本文总结了GANs加速的机遇和挑战，并提出了有待进一步研究的开放性研究问题。摘要：Since its invention, Generative adversarial networks (GANs) have shown outstanding results in many applications. Generative Adversarial Networks are powerful yet, resource-hungry deep-learning models. Their main difference from ordinary deep learning models is the nature of their output. For example, GAN output can be a whole image versus other models detecting objects or classifying images. Thus, the architecture and numeric precision of the network affect the quality and speed of the solution. Hence, accelerating GANs is pivotal. Accelerating GANs can be classified into three main tracks: (1) Memory compression, (2) Computation optimization, and (3) Data-flow optimization. Because data transfer is the main source of energy usage, memory compression leads to the most savings. Thus, in this paper, we survey memory compression techniques for CNN-Based GANs. Additionally, the paper summarizes opportunities and challenges in GANs acceleration and suggests open research problems to be further investigated.

【8】 High-dimensional Assisted Generative Model for Color Image Restoration 标题：彩色图像恢复的高维辅助生成模型链接：https://arxiv.org/abs/2108.06460

作者：Kai Hong,Chunhua Wu,Cailian Yang,Minghui Zhang,Yancheng Lu,Yuhao Wang,Qiegen Liu 机构：I 备注：12 pages,11 figures 摘要：本文提出了一种基于高维辅助评分生成模型的无监督深度学习方案，用于彩色图像恢复任务。考虑到基于分数的生成模型中的样本数和内部维数对估计数据分布梯度具有关键影响，提出了两种不同的高维方法：通道拷贝变换增加样本数，像素尺度变换减少可行空间维数。随后，使用由这些变换表示的一组高维张量通过去噪分数匹配来训练网络。然后，通过退火Langevin dynamics和替代数据一致性更新执行采样。此外，为了缓解学习高维表示的困难，提出了一种渐进策略来利用性能。与其他数据驱动方法相比，所提出的无监督学习和迭代恢复算法具有透明和清晰的解释，该算法涉及一个预先训练的生成网络来获取先验信息。脱漆和修复的实验结果表明，该方法具有显著的性能和多样性。摘要：This work presents an unsupervised deep learning scheme that exploiting high-dimensional assisted score-based generative model for color image restoration tasks. Considering that the sample number and internal dimension in score-based generative model have key influence on estimating the gradients of data distribution, two different high-dimensional ways are proposed: The channel-copy transformation increases the sample number and the pixel-scale transformation decreases feasible space dimension. Subsequently, a set of high-dimensional tensors represented by these transformations are used to train the network through denoising score matching. Then, sampling is performed by annealing Langevin dynamics and alternative data-consistency update. Furthermore, to alleviate the difficulty of learning high-dimensional representation, a progressive strategy is proposed to leverage the performance. The proposed unsupervised learning and iterative restoration algo-rithm, which involves a pre-trained generative network to obtain prior, has transparent and clear interpretation compared to other data-driven approaches. Experimental results on demosaicking and inpainting conveyed the remarkable performance and diversity of our proposed method.

OCR|文本相关(2篇)

【1】 Who's Waldo? Linking People Across Text and Images 标题：沃尔多是谁？跨文本和图像链接人员链接：https://arxiv.org/abs/2108.07253

作者：Claire Yuqing Cui,Apoorv Khandelwal,Yoav Artzi,Noah Snavely,Hadar Averbuch-Elor 机构：Cornell University ,Cornell Tech 备注：Published in ICCV 2021 (Oral). Project webpage: this https URL 摘要：我们提出了一个任务和基准数据集，用于以人为中心的视觉基础，即标题中命名的人和图像中的人物之间的链接问题。与之前主要基于对象的视觉基础研究不同，我们的新任务掩盖了字幕中的人名，以鼓励在此类图像字幕对上训练的方法关注上下文线索（如多人之间的丰富互动），而不是学习名字和外表之间的联系。为了促进这项任务，我们引入了一个新的数据集，Who'swaldo，它是从wikimediacomons上的图像标题数据中自动挖掘出来的。我们提出了一种基于Transformer的方法，它优于这个任务上的几个强基线，并将我们的数据发布到研究社区，以刺激工作的上下文模型考虑视觉和语言。摘要：We present a task and benchmark dataset for person-centric visual grounding, the problem of linking between people named in a caption and people pictured in an image. In contrast to prior work in visual grounding, which is predominantly object-based, our new task masks out the names of people in captions in order to encourage methods trained on such image-caption pairs to focus on contextual cues (such as rich interactions between multiple people), rather than learning associations between names and appearances. To facilitate this task, we introduce a new dataset, Who's Waldo, mined automatically from image-caption data on Wikimedia Commons. We propose a Transformer-based method that outperforms several strong baselines on this task, and are releasing our data to the research community to spur work on contextual models that consider both vision and language.

【2】 Text-Aware Single Image Specular Highlight Removal 标题：文本感知的单个图像镜面反射高光移除链接：https://arxiv.org/abs/2108.06881

作者：Shiyu Hou,Chaoqun Wang,Weize Quan,Jingen Jiang,Dong-Ming Yan 机构：Yan,�, University of Chinese Academy of Sciences, Beijing , China, Institute of Automation, Chinese Academy of Sciences, Beijing , China 摘要：从单个输入图像中去除不需要的镜面反射高光对于许多计算机视觉和图形任务至关重要。现有的方法通常会删除医学图像和特定对象图像的镜面反射高光，但是，它们无法处理带有文本的图像。此外，文本检测和识别界很少研究高光对文本识别的影响。因此，本文首先提出并研究了文本感知的单幅图像镜面高光去除问题。其核心目标是通过去除文本图像中的高光来提高文本检测和识别的准确性。为了解决这个具有挑战性的问题，我们首先收集了三个具有细粒度注释的高质量数据集，这些数据集将被适当地发布，以促进相关研究。然后，我们设计了一种新的两级网络，包括高光检测网络和高光去除网络。高光检测网络的输出提供有关高光区域的附加信息，以指导后续高光移除网络。此外，我们还提出了一个测量集，包括端到端文本检测和识别评估以及辅助视觉质量评估。在我们收集的数据集上进行的大量实验证明了该方法的优越性能。摘要：Removing undesirable specular highlight from a single input image is of crucial importance to many computer vision and graphics tasks. Existing methods typically remove specular highlight for medical images and specific-object images, however, they cannot handle the images with text. In addition, the impact of specular highlight on text recognition is rarely studied by text detection and recognition community. Therefore, in this paper, we first raise and study the text-aware single image specular highlight removal problem. The core goal is to improve the accuracy of text detection and recognition by removing the highlight from text images. To tackle this challenging problem, we first collect three high-quality datasets with fine-grained annotations, which will be appropriately released to facilitate the relevant research. Then, we design a novel two-stage network, which contains a highlight detection network and a highlight removal network. The output of highlight detection network provides additional information about highlight regions to guide the subsequent highlight removal network. Moreover, we suggest a measurement set including the end-to-end text detection and recognition evaluation and auxiliary visual quality evaluation. Extensive experiments on our collected datasets demonstrate the superior performance of the proposed method.

Attention注意力(2篇)

【1】 Escaping the Gradient Vanishing: Periodic Alternatives of Softmax in Attention Mechanism 标题：逃脱梯度消失：注意机制中Softmax的周期性选择链接：https://arxiv.org/abs/2108.07153

作者：Shulun Wang,Bin Liu,Feng Liu 机构：Department of Computer Science, Beijing Jiaotong University, Beijing, China, Key Laboratory of Deep Oil and Gas, China University of Petroleum (East China), Qingdao, China, slwang.github.ioPeriod alternatives 备注：18 pages, 16 figures 摘要：Softmax广泛应用于神经网络的多类分类、门结构和注意机制。输入为正态分布的统计假设支持Softmax的梯度稳定性。然而，当用于注意机制（如Transformer）时，由于嵌入之间的相关分数通常不是正态分布，因此出现了梯度消失问题，我们通过实验验证了这一点。在这项工作中，我们建议用周期函数代替指数函数，并从值和梯度的角度研究Softmax的一些潜在周期替代方案。通过参考LeViT的一个简单设计的演示上的实验，我们的方法被证明能够缓解梯度问题，并且与Softmax及其变体相比产生实质性的改进。此外，我们还通过数学和实验分析了预规范化对Softmax的影响以及我们的方法。最后，我们增加了演示的深度，并证明了我们的方法在深部结构中的适用性。摘要：Softmax is widely used in neural networks for multiclass classification, gate structure and attention mechanisms. The statistical assumption that the input is normal distributed supports the gradient stability of Softmax. However, when used in attention mechanisms such as transformers, since the correlation scores between embeddings are often not normally distributed, the gradient vanishing problem appears, and we prove this point through experimental confirmation. In this work, we suggest that replacing the exponential function by periodic functions, and we delve into some potential periodic alternatives of Softmax from the view of value and gradient. Through experiments on a simply designed demo referenced to LeViT, our method is proved to be able to alleviate the gradient problem and yield substantial improvements compared to Softmax and its variants. Further, we analyze the impact of pre-normalization for Softmax and our methods through mathematics and experiments. Lastly, we increase the depth of the demo and prove the applicability of our method in deep structures.

【2】 Stacked Hourglass Network with a Multi-level Attention Mechanism: Where to Look for Intervertebral Disc Labeling 标题：具有多层次注意机制的堆叠沙漏网络：在哪里寻找椎间盘标记链接：https://arxiv.org/abs/2108.06554

作者：Reza Azad,Lucas Rouhier,Julien Cohen-Adad 机构： NeuroPoly Lab, Institute of Biomedical Engineering, Polytechnique Montreal, Mila, Quebec AI Institute, Canada, Functional Neuroimaging Unit, CRIUGM, University of Montreal, Montreal 备注：None 摘要：从MRI扫描中标记椎间盘对于正确诊断脊柱相关疾病非常重要，包括多发性硬化症、肌萎缩侧索硬化症、退行性颈脊髓病和癌症。在MRI数据中自动标记椎间盘是一项困难的任务，因为椎间盘和骨区域之间的相似性，个体间脊柱和周围组织几何结构的变异性，以及扫描（制造商、脉冲序列、图像对比度、分辨率和人工制品）的变异性。在以前的研究中，椎间盘标记通常是在椎间盘检测步骤之后进行的，当定位算法遗漏椎间盘或检测到假阳性时，大多会失败。在这项工作中，我们的目标是通过使用姿势估计技术重新制定语义椎间盘标记来缓解这个问题。为此，我们提出了一种具有多级注意机制的叠层沙漏网络来共同学习椎间盘位置及其骨架结构。所提出的深度学习模型考虑了语义分割和姿态估计技术的优点来处理缺失区域和误报检测。为了进一步提高该方法的性能，我们提出了一种基于骨架的搜索空间来减少误报检测。该方法在spine通用公共多中心数据集上进行了评估，在T1w和T2w对比度方面，与之前的工作相比，表现出更好的性能。该方法在ivadomed中实现(https://ivadomed.org). 摘要：Labeling vertebral discs from MRI scans is important for the proper diagnosis of spinal related diseases, including multiple sclerosis, amyotrophic lateral sclerosis, degenerative cervical myelopathy and cancer. Automatic labeling of the vertebral discs in MRI data is a difficult task because of the similarity between discs and bone area, the variability in the geometry of the spine and surrounding tissues across individuals, and the variability across scans (manufacturers, pulse sequence, image contrast, resolution and artefacts). In previous studies, vertebral disc labeling is often done after a disc detection step and mostly fails when the localization algorithm misses discs or has false positive detection. In this work, we aim to mitigate this problem by reformulating the semantic vertebral disc labeling using the pose estimation technique. To do so, we propose a stacked hourglass network with multi-level attention mechanism to jointly learn intervertebral disc position and their skeleton structure. The proposed deep learning model takes into account the strength of semantic segmentation and pose estimation technique to handle the missing area and false positive detection. To further improve the performance of the proposed method, we propose a skeleton-based search space to reduce false positive detection. The proposed method evaluated on spine generic public multi-center dataset and demonstrated better performance comparing to previous work, on both T1w and T2w contrasts. The method is implemented in ivadomed (https://ivadomed.org).

人脸|人群计数(2篇)

【1】 MSR-GCN: Multi-Scale Residual Graph Convolution Networks for Human Motion Prediction 标题：MSR-GCN：用于人体运动预测的多尺度残差图卷积网络链接：https://arxiv.org/abs/2108.07152

作者：Lingwei Dang,Yongwei Nie,Chengjiang Long,Qing Zhang,Guiqing Li 机构：School of Computer Science and Engineering, South China University of Technology, China, JD Finance America Corporation, USA, School of Computer Science and Engineering, Sun Yat-sen University, China 备注：This paper has been accepted by ICCV2021 摘要：由于未来姿势的随机性和非周期性，人体运动预测是一项具有挑战性的任务。最近，图卷积网络被证明是非常有效的学习姿态关节之间的动态关系，这有助于姿态预测。另一方面，可以递归地提取人体姿势，以获得多尺度的一组姿势。随着抽象层次的增加，姿态的运动变得更加稳定，这也有利于姿态预测。在本文中，我们提出了一种新的多尺度残差图卷积网络（MSR-GCN），用于端到端的人体姿势预测任务。GCN用于从细尺度到粗尺度再从粗尺度到细尺度提取特征。然后，对每个尺度上提取的特征进行组合和解码，以获得输入姿势和目标姿势之间的残差。对所有预测的姿势进行中间监督，从而强制网络学习更多具有代表性的特征。我们提出的方法在两个标准基准数据集上进行了评估，即Human3.6M数据集和CMU Mocap数据集。实验结果表明，我们的方法优于现有的方法。代码和预先训练的模型可在https://github.com/Droliven/MSRGCN. 摘要：Human motion prediction is a challenging task due to the stochasticity and aperiodicity of future poses. Recently, graph convolutional network has been proven to be very effective to learn dynamic relations among pose joints, which is helpful for pose prediction. On the other hand, one can abstract a human pose recursively to obtain a set of poses at multiple scales. With the increase of the abstraction level, the motion of the pose becomes more stable, which benefits pose prediction too. In this paper, we propose a novel Multi-Scale Residual Graph Convolution Network (MSR-GCN) for human pose prediction task in the manner of end-to-end. The GCNs are used to extract features from fine to coarse scale and then from coarse to fine scale. The extracted features at each scale are then combined and decoded to obtain the residuals between the input and target poses. Intermediate supervisions are imposed on all the predicted poses, which enforces the network to learn more representative features. Our proposed approach is evaluated on two standard benchmark datasets, i.e., the Human3.6M dataset and the CMU Mocap dataset. Experimental results demonstrate that our method outperforms the state-of-the-art approaches. Code and pre-trained models are available at https://github.com/Droliven/MSRGCN.

【2】 U-mesh: Human Correspondence Matching with Mesh Convolutional Networks 标题：U-Mesh：基于Mesh卷积网络的人类对应匹配链接：https://arxiv.org/abs/2108.06695

作者：Benjamin Groisser,Alon Wolf,Ron Kimmel 机构：Technion-Israel Institute of Technology 摘要：三维扫描技术的普及推动了对解释几何数据的方法的需求，尤其是对人体受试者。在本文中，我们提出了一种回归（自下而上）和生成（自上而下）方法的完美融合，以将参数模板模型拟合到原始扫描网格。我们的第一个主要贡献是一个内在的卷积网格U-net架构，它预测到模板曲面的逐点对应。软对应在新构造的笛卡尔空间中表示为坐标。将对应关系建模为欧几里德近似可以实现网络训练和算法下一步的高效优化。我们的第二个贡献是一个生成优化算法，它使用U网对应预测来指导参数迭代最近点配准。通过使用预先训练的人体表面参数模型，我们最大限度地利用特定领域的先验知识。网格卷积网络与生成模型拟合的配对使我们能够预测真实人体表面扫描的对应关系，包括遮挡、偏析和不同的属（例如自接触）。我们在《浮士德通信挑战》中对所提出的方法进行了评估，其中我们在学科间（学科内）通信方面比最先进的方法提高了20%（33%）。摘要：The proliferation of 3D scanning technology has driven a need for methods to interpret geometric data, particularly for human subjects. In this paper we propose an elegant fusion of regression (bottom-up) and generative (top-down) methods to fit a parametric template model to raw scan meshes. Our first major contribution is an intrinsic convolutional mesh U-net architecture that predicts pointwise correspondence to a template surface. Soft-correspondence is formulated as coordinates in a newly-constructed Cartesian space. Modeling correspondence as Euclidean proximity enables efficient optimization, both for network training and for the next step of the algorithm. Our second contribution is a generative optimization algorithm that uses the U-net correspondence predictions to guide a parametric Iterative Closest Point registration. By employing pre-trained human surface parametric models we maximally leverage domain-specific prior knowledge. The pairing of a mesh-convolutional network with generative model fitting enables us to predict correspondence for real human surface scans including occlusions, partialities, and varying genus (e.g. from self-contact). We evaluate the proposed method on the FAUST correspondence challenge where we achieve 20% (33%) improvement over state of the art methods for inter- (intra-) subject correspondence.

裁剪|量化|加速|压缩相关(1篇)

【1】 Distance-aware Quantization 标题：距离感知量化链接：https://arxiv.org/abs/2108.06983

作者：Dohyung kim,Junghyup Lee,Bumsub Ham 机构：School of Electrical and Electronic Engineering, Yonsei University 备注：ICCV2021 摘要：我们解决了网络量化问题，即减少权重和/或激活的比特宽度以减轻网络架构。量化方法使用舍入函数将全精度值映射到最近的量化值，但此操作是不可微的。使用基于梯度的优化器训练量化网络主要有两种方法。首先，直通估计器（STE）将舍入的零导数替换为单位函数的零导数，这会导致梯度失配问题。其次，软量化器在训练时用连续函数近似取整，并在测试时利用取整进行量化。这会减轻梯度失配，但会导致量化器间隙问题。我们在一个统一的框架内缓解了这两个问题。为此，我们介绍了一种新型量化器，称为距离感知量化器（DAQ），它主要由距离感知软舍入（DASR）和温度控制器组成。为了缓解梯度失配问题，DASR使用内核软件argmax近似离散舍入，这是基于我们的见解，即量化可以表示为全精度值和量化值之间基于距离的分配问题。控制器根据输入自适应调整DASR中的温度参数，解决了量化器间隙问题。在标准基准上的实验结果表明，在不同的比特宽度下，DAQ的性能明显优于现有技术。摘要：We address the problem of network quantization, that is, reducing bit-widths of weights and/or activations to lighten network architectures. Quantization methods use a rounding function to map full-precision values to the nearest quantized ones, but this operation is not differentiable. There are mainly two approaches to training quantized networks with gradient-based optimizers. First, a straight-through estimator (STE) replaces the zero derivative of the rounding with that of an identity function, which causes a gradient mismatch problem. Second, soft quantizers approximate the rounding with continuous functions at training time, and exploit the rounding for quantization at test time. This alleviates the gradient mismatch, but causes a quantizer gap problem. We alleviate both problems in a unified framework. To this end, we introduce a novel quantizer, dubbed a distance-aware quantizer (DAQ), that mainly consists of a distance-aware soft rounding (DASR) and a temperature controller. To alleviate the gradient mismatch problem, DASR approximates the discrete rounding with the kernel soft argmax, which is based on our insight that the quantization can be formulated as a distance-based assignment problem between full-precision values and quantized ones. The controller adjusts the temperature parameter in DASR adaptively according to the input, addressing the quantizer gap problem. Experimental results on standard benchmarks show that DAQ outperforms the state of the art significantly for various bit-widths without bells and whistles.

表征学习(1篇)

【1】 Voxel-wise Cross-Volume Representation Learning for 3D Neuron Reconstruction 标题：三维神经元重建的体素跨体表示学习链接：https://arxiv.org/abs/2108.06522

作者：Heng Wang,Chaoyi Zhang,Jianhui Yu,Yang Song,Siqi Liu,Wojciech Chrzanowski,Weidong Cai 机构： School of Computer Science, University of Sydney, Australia, School of Computer Science and Engineering, University of New South Wales, Paige AI, New York, NY, USA, Sydney Pharmacy School, University of Sydney, Australia 备注：10 pages, 3 figures, 3 tables, accepted by MICCAI-MLMI 2021 摘要：自动三维神经元重建对于分析脑回路活动中神经元的形态和功能至关重要。然而，现有跟踪算法的性能受到低图像质量的制约。近年来，人们提出了一系列基于深度学习的分割方法，通过从低对比度背景中去除噪声和恢复神经元结构来提高原始三维光学图像的质量。由于神经元形态的多样性和缺乏大型神经元数据集，目前的大多数神经元分割模型都依赖于在基本结构中引入复杂的、专门设计的子模块，以编码更好的特征表示。虽然成功，但在推理过程中会给计算带来额外负担。因此，我们不再修改基本网络，而是将注意力转移到数据集本身。大多数神经元分割模型中使用的编码器-解码器主干只关注体积内的体素点来学习神经元的结构特征，而忽略了不同体积中属于同一类别的体素的共享内在语义特征，这对于表达性表征学习也很重要。因此，为了更好地利用稀缺的数据集，我们建议在编码器-解码器分割模型的基础上，通过一种新的体素级交叉体积表示学习范式，明确利用体素的这种内在特征。我们的方法在推理过程中没有引入额外的成本。在BigNeuron项目的42幅3D神经元图像上进行了评估，结果表明，我们提出的方法提高了原始分割模型的学习能力，并进一步提高了重建性能。摘要：Automatic 3D neuron reconstruction is critical for analysing the morphology and functionality of neurons in brain circuit activities. However, the performance of existing tracing algorithms is hinged by the low image quality. Recently, a series of deep learning based segmentation methods have been proposed to improve the quality of raw 3D optical image stacks by removing noises and restoring neuronal structures from low-contrast background. Due to the variety of neuron morphology and the lack of large neuron datasets, most of current neuron segmentation models rely on introducing complex and specially-designed submodules to a base architecture with the aim of encoding better feature representations. Though successful, extra burden would be put on computation during inference. Therefore, rather than modifying the base network, we shift our focus to the dataset itself. The encoder-decoder backbone used in most neuron segmentation models attends only intra-volume voxel points to learn structural features of neurons but neglect the shared intrinsic semantic features of voxels belonging to the same category among different volumes, which is also important for expressive representation learning. Hence, to better utilise the scarce dataset, we propose to explicitly exploit such intrinsic features of voxels through a novel voxel-level cross-volume representation learning paradigm on the basis of an encoder-decoder segmentation model. Our method introduces no extra cost during inference. Evaluated on 42 3D neuron images from BigNeuron project, our proposed method is demonstrated to improve the learning ability of the original segmentation model and further enhancing the reconstruction performance.

蒸馏|知识提取(1篇)

【1】 Multi-granularity for knowledge distillation 标题：知识提炼的多粒度方法链接：https://arxiv.org/abs/2108.06681

作者：Baitan Shao,Ying Chen 机构：Key Laboratory of Advanced Process Control for Light Industry Ministry of Education, Jiangnan University, Wuxi , China, A R T I C L E I N F O 备注：14 pages, 12 figures 摘要：考虑到学生对教师传授的知识的理解能力不同，提出了一种多粒度的知识提取机制，用于向学生网络传递更易理解的知识。设计了教师网络的多粒度自分析模块，使学生网络能够从不同的教学模式中学习知识。此外，还提出了一种稳定的激励方案，用于对学生训练进行鲁棒监督。提出的蒸馏机制可以嵌入到不同的蒸馏框架中，作为基线。实验表明，与基线相比，该机制平均提高了0.58%，最佳提高了1.08%，其性能优于现有技术。研究还表明，该机制可以提高学生的微调能力和对噪声输入的鲁棒性。该守则可于https://github.com/shaoeric/multi-granularity-distillation. 摘要：Considering the fact that students have different abilities to understand the knowledge imparted by teachers, a multi-granularity distillation mechanism is proposed for transferring more understandable knowledge for student networks. A multi-granularity self-analyzing module of the teacher network is designed, which enables the student network to learn knowledge from different teaching patterns. Furthermore, a stable excitation scheme is proposed for robust supervision for the student training. The proposed distillation mechanism can be embedded into different distillation frameworks, which are taken as baselines. Experiments show the mechanism improves the accuracy by 0.58% on average and by 1.08% in the best over the baselines, which makes its performance superior to the state-of-the-arts. It is also exploited that the student's ability of fine-tuning and robustness to noisy inputs can be improved via the proposed mechanism. The code is available at https://github.com/shaoeric/multi-granularity-distillation.

点云|SLAM|雷达|激光|深度RGBD相关(1篇)

【1】 PICCOLO: Point Cloud-Centric Omnidirectional Localization 标题：Piccolo：以点云为中心的全方位定位链接：https://arxiv.org/abs/2108.06545

作者：Junho Kim,Changwoon Choi,Hojun Jang,Young Min Kim 机构：Dept. of Electrical and Computer Engineering, Seoul National University, Korea 备注：Accepted to ICCV 2021 摘要：我们提出了一种简单有效的全方位定位算法PICCOLO。给定场景的彩色点云和360度全景图像，我们的目标是恢复拍摄全景图像时的相机姿态。我们的管道以现成的方式工作，将单个图像作为查询，不需要任何神经网络训练或收集图像的地面真实姿势。相反，我们通过梯度下降优化将每个点云颜色与全景图像的整体视图相匹配，以找到相机姿势。我们的损失函数称为采样损失，以点云为中心，在点云中每个点的投影位置进行评估。相比之下，传统的光度损失以图像为中心，比较每个像素位置的颜色。通过对比较实体的简单更改，采样丢失有效地克服了全向图像的严重视觉失真，并享受360 view的全局环境，以处理视觉定位的挑战性场景。在各种环境下评估时，PICCOLO在准确性和稳定性方面都优于现有的全方位定位算法。摘要：We present PICCOLO, a simple and efficient algorithm for omnidirectional localization. Given a colored point cloud and a 360 panorama image of a scene, our objective is to recover the camera pose at which the panorama image is taken. Our pipeline works in an off-the-shelf manner with a single image given as a query and does not require any training of neural networks or collecting ground-truth poses of images. Instead, we match each point cloud color to the holistic view of the panorama image with gradient-descent optimization to find the camera pose. Our loss function, called sampling loss, is point cloud-centric, evaluated at the projected location of every point in the point cloud. In contrast, conventional photometric loss is image-centric, comparing colors at each pixel location. With a simple change in the compared entities, sampling loss effectively overcomes the severe visual distortion of omnidirectional images, and enjoys the global context of the 360 view to handle challenging scenarios for visual localization. PICCOLO outperforms existing omnidirectional localization algorithms in both accuracy and stability when evaluated in various environments.

多模态(1篇)

【1】 MMChat: Multi-Modal Chat Dataset on Social Media 标题：MMChat：社交媒体上的多模态聊天数据集链接：https://arxiv.org/abs/2108.07154

作者：Yinhe Zheng,Guanyi Chen,Xin Liu,Ke Lin 机构：♠ Alibaba Group, ♣Department of Information and Computing Sciences, Utrecht University, ♥Samsung Research China - Beijing (SRC-B) 摘要：在对话中融入多模态语境是发展更具吸引力的对话系统的一个重要步骤。在这项工作中，我们通过引入MMChat来探索这一方向：一个大规模的多模态对话语料库（3240万个原始对话和120.84K个过滤对话）。与以前的语料库不同，MMChat是众包的或从虚构电影中收集的，它包含从社交媒体上的真实对话中收集的基于图像的对话，其中观察到了稀疏性问题。具体地说，在普通通信中，图像引发的对话可能会随着对话的进行而偏离一些非基于图像的主题。我们开发了一个基准模型，通过调整图像特征上的注意路由机制来解决对话生成任务中的这一问题。实验证明了融合图像特征的有效性和处理图像特征稀疏性的有效性。摘要：Incorporating multi-modal contexts in conversation is an important step for developing more engaging dialogue systems. In this work, we explore this direction by introducing MMChat: a large scale multi-modal dialogue corpus (32.4M raw dialogues and 120.84K filtered dialogues). Unlike previous corpora that are crowd-sourced or collected from fictitious movies, MMChat contains image-grounded dialogues collected from real conversations on social media, in which the sparsity issue is observed. Specifically, image-initiated dialogues in common communications may deviate to some non-image-grounded topics as the conversation proceeds. We develop a benchmark model to address this issue in dialogue generation tasks by adapting the attention routing mechanism on image features. Experiments demonstrate the usefulness of incorporating image features and the effectiveness in handling the sparsity of image features.

3D|3D重建等相关(1篇)

【1】 Interpolation-Aware Padding for 3D Sparse Convolutional Neural Networks 标题：三维稀疏卷积神经网络的插值感知填充链接：https://arxiv.org/abs/2108.06925

作者：Yu-Qi Yang,Peng-Shuai Wang,Yang Liu 机构：Tsinghua University, Microsoft Research Asia 摘要：基于稀疏体素的三维卷积神经网络（CNN）广泛应用于各种三维视觉任务。基于稀疏体素的三维CNN从三维输入创建稀疏非空体素，并仅对其执行三维卷积运算。我们提出了一种简单而有效的填充方案——插值感知填充，用于填充与非空体素相邻的几个空体素，并将它们包含在3D CNN计算中，以便在通过三线性插值计算逐点特征时，所有相邻体素都存在。对于细粒度3D视觉任务，如语义分割和3D检测等关键点特征，我们的网络比使用最近邻插值或带零填充的归一化三线性插值或八叉树填充方案的现有网络具有更高的预测精度。通过对各种3D分割和检测任务的广泛比较，我们证明了3D稀疏CNN与我们的填充方案结合特征插值的优越性。摘要：Sparse voxel-based 3D convolutional neural networks (CNNs) are widely used for various 3D vision tasks. Sparse voxel-based 3D CNNs create sparse non-empty voxels from the 3D input and perform 3D convolution operations on them only. We propose a simple yet effective padding scheme --- interpolation-aware padding to pad a few empty voxels adjacent to the non-empty voxels and involve them in the 3D CNN computation so that all neighboring voxels exist when computing point-wise features via the trilinear interpolation. For fine-grained 3D vision tasks where point-wise features are essential, like semantic segmentation and 3D detection, our network achieves higher prediction accuracy than the existing networks using the nearest neighbor interpolation or the normalized trilinear interpolation with the zero-padding or the octree-padding scheme. Through extensive comparisons on various 3D segmentation and detection tasks, we demonstrate the superiority of 3D sparse CNNs with our padding scheme in conjunction with feature interpolation.

其他神经网络|深度学习|模型|建模(8篇)

【1】 FaPN: Feature-aligned Pyramid Network for Dense Image Prediction 标题：FAPN：面向密集图像预测的特征对准金字塔网络链接：https://arxiv.org/abs/2108.07058

作者：Shihua,Huang,Zhichao,Lu,Ran,Cheng,Cheng,He 机构： Feature-aligned Pyramid Network for Dense Image PredictionShihua HuangZhichao LuRan ChengCheng HeSouthern University of Science and Technology†{shihuahuang9 5, †Authors are with Department of Computer Science and Engineering 备注：ICCV2021 摘要：深度神经网络的最新进展使密集图像预测取得了显著的飞跃。然而，为了简单起见，大多数现有方法仍然忽略了特征对齐问题。在上采样特征和局部特征之间直接添加像素会导致特征映射的上下文不对齐，进而导致预测中的错误分类，尤其是在对象边界上。在本文中，我们提出了一个特征对齐模块，该模块学习像素的变换偏移量，以上下文对齐上采样的高级特征；另一个特征选择模块强调具有丰富空间细节的底层特征。然后，我们将这两个模块集成到一个自顶向下的金字塔结构中，并提出了特征对齐金字塔网络（FaPN）。对四个密集预测任务和四个数据集的广泛实验评估已经证明了FaPN的有效性，当与Faster/Mask R-CNN配对时，AP/mIoU比FPN的总体改善1.2-2.6个点。特别是，当集成在掩模成型器中时，我们的FaPN在ADE20K上实现了56.7%mIoU的最先进水平。该代码可从以下网址获得：https://github.com/EMI-Group/FaPN. 摘要：Recent advancements in deep neural networks have made remarkable leap-forwards in dense image prediction. However, the issue of feature alignment remains as neglected by most existing approaches for simplicity. Direct pixel addition between upsampled and local features leads to feature maps with misaligned contexts that, in turn, translate to mis-classifications in prediction, especially on object boundaries. In this paper, we propose a feature alignment module that learns transformation offsets of pixels to contextually align upsampled higher-level features; and another feature selection module to emphasize the lower-level features with rich spatial details. We then integrate these two modules in a top-down pyramidal architecture and present the Feature-aligned Pyramid Network (FaPN). Extensive experimental evaluations on four dense prediction tasks and four datasets have demonstrated the efficacy of FaPN, yielding an overall improvement of 1.2 - 2.6 points in AP / mIoU over FPN when paired with Faster / Mask R-CNN. In particular, our FaPN achieves the state-of-the-art of 56.7% mIoU on ADE20K when integrated within Mask-Former. The code is available from https://github.com/EMI-Group/FaPN.

【2】 Nowcasting-Nets: Deep Neural Network Structures for Precipitation Nowcasting Using IMERG 标题：短时预报网：IMERG降水短时预报的深度神经网络结构链接：https://arxiv.org/abs/2108.06868

作者：Mohammad Reza Ehsani,Ariyan Zarei,Hoshin V. Gupta,Kobus Barnard,Ali Behrangi 机构： Department of Hydrology and Atmospheric Sciences, The University of Arizona; Tucson, AZ, Department of Computer Science, The University of Arizona; Tucson, AZ, Submitted to: arXiv 备注：41 Pages, 18 Figures 摘要：准确及时地估计降水量对于发布危险警告（如山洪暴发或滑坡）至关重要。目前的遥感降水产品与卫星数据的采集和处理有关，有几个小时的延迟时间。通过对这些产品应用健壮的即时广播系统，可以（原则上）减少延迟并提高其适用性、价值和影响。然而，由于大气的混沌性质，以及由此导致的降水系统结构的快速变化，这种系统的发展是复杂的，我们开发了两种方法（以下称为临近预报网），使用递归和卷积深度神经网络结构来应对降水临近预报的挑战。使用全球降水测量（GPM）综合多卫星检索对美国东部毗连地区（CONUS）的全球降水测量（IMERG）降水数据共训练了五个模型，然后根据东部和西部CONUS的独立数据进行测试。该模型设计用于提供最多1.5小时的提前期预测，并通过使用反馈回路方法，研究了该模型将预测时间延长至4.5小时的能力。将模型性能与随机森林（RF）和线性回归（LR）机器学习方法进行比较，并与使用最新观测作为预测的持久性基准（BM）进行比较。独立的IMERG观测被用作参考，并进行了实验，以检查涉及特定降水事件的总体统计数据和案例研究。总的来说，由临近预报网络模型提供的预测是优越的，带有剩余水头的卷积临近预报网络（CNC-R）在试验中分别提高了25%、28%和46%。。。摘要：Accurate and timely estimation of precipitation is critical for issuing hazard warnings (e.g., for flash floods or landslides). Current remotely sensed precipitation products have a few hours of latency, associated with the acquisition and processing of satellite data. By applying a robust nowcasting system to these products, it is (in principle) possible to reduce this latency and improve their applicability, value, and impact. However, the development of such a system is complicated by the chaotic nature of the atmosphere, and the consequent rapid changes that can occur in the structures of precipitation systems In this work, we develop two approaches (hereafter referred to as Nowcasting-Nets) that use Recurrent and Convolutional deep neural network structures to address the challenge of precipitation nowcasting. A total of five models are trained using Global Precipitation Measurement (GPM) Integrated Multi-satellitE Retrievals for GPM (IMERG) precipitation data over the Eastern Contiguous United States (CONUS) and then tested against independent data for the Eastern and Western CONUS. The models were designed to provide forecasts with a lead time of up to 1.5 hours and, by using a feedback loop approach, the ability of the models to extend the forecast time to 4.5 hours was also investigated. Model performance was compared against the Random Forest (RF) and Linear Regression (LR) machine learning methods, and also against a persistence benchmark (BM) that used the most recent observation as the forecast. Independent IMERG observations were used as a reference, and experiments were conducted to examine both overall statistics and case studies involving specific precipitation events. Overall, the forecasts provided by the Nowcasting-Net models are superior, with the Convolutional Nowcasting Network with Residual Head (CNC-R) achieving 25%, 28%, and 46% improvement in the test ...

【3】 CONet: Channel Optimization for Convolutional Neural Networks 标题：CONET：卷积神经网络的信道优化链接：https://arxiv.org/abs/2108.06822

作者：Mahdi S. Hosseini,Jia Shu Zhang,Zhe Liu,Andre Fu,Jingxuan Su,Mathieu Tuli,Konstantinos N. Plataniotis 机构：The Department of Electrical and Computer Engineering, University of New Brunswick, The Edward S. Rogers Sr. Department of Electrical & Computer Engineering, University of Toronto 备注：Accepted for Publication in ICCV2021 NeurArch 摘要：神经架构搜索（NAS）已将网络设计从使用人类直觉转向利用由评估指标引导的搜索算法。我们研究了卷积神经网络（CNN）中的信道大小优化，并确定了它在模型精度和复杂性中所起的作用。当前的信道大小选择方法通常受到离散样本空间的限制，同时受到手动迭代和简单启发式的影响。为了解决这个问题，我们引入了一种高效的动态缩放算法——CONet——它可以自动优化给定CNN的跨网络层通道大小。引入了两个度量--“`textit{Rank}”和“textit{Rank Average Slope}”来识别训练中积累的信息。该算法在固定的搜索阶段上动态地放大或缩小通道大小。我们在CIFAR10/100和ImageNet数据集上进行了实验，结果表明，CONet可以找到在ResNet、DART和DART 空间中搜索的高效、准确的体系结构，其性能优于其基线模型。摘要：Neural Architecture Search (NAS) has shifted network design from using human intuition to leveraging search algorithms guided by evaluation metrics. We study channel size optimization in convolutional neural networks (CNN) and identify the role it plays in model accuracy and complexity. Current channel size selection methods are generally limited by discrete sample spaces while suffering from manual iteration and simple heuristics. To solve this, we introduce an efficient dynamic scaling algorithm -- CONet -- that automatically optimizes channel sizes across network layers for a given CNN. Two metrics -- ``textit{Rank}" and "textit{Rank Average Slope}" -- are introduced to identify the information accumulated in training. The algorithm dynamically scales channel sizes up or down over a fixed searching phase. We conduct experiments on CIFAR10/100 and ImageNet datasets and show that CONet can find efficient and accurate architectures searched in ResNet, DARTS, and DARTS spaces that outperform their baseline models.

【4】 Reference Service Model for Federated Identity Management 标题：面向联合身份管理的参考服务模型链接：https://arxiv.org/abs/2108.06701

作者：Daniela Pöhn,Peter Hillmann 机构：Universit¨at der Bundeswehr M¨unchen, Neubiberg, Germany 备注：None 摘要：随着新冠肺炎的流行，全世界越来越多的人在家工作。每个自然人通常具有多个具有不同关联信息的数字身份。在过去的几年中，各种身份和访问管理方法已经获得了吸引力，例如帮助在信任范围内访问其他组织的服务。由此产生的异质性造成了高度复杂性，以区分作为参与实体的这些方法和场景；把它们结合起来就更难了。最后但并非最不重要的一点是，在这种情况下，不同的参与者对术语有不同的理解或观点，比如“服务”。本文描述了在通用联邦身份管理中使用标准组件的参考服务。这在使用框架ArchiMate的现代企业体系结构中得到了应用。提出的通用联邦身份管理服务模型（FIMSM）用于以通用的面向服务的方式描述各种联邦身份管理场景。所提出的参考设计在多个方面得到了批准，并且很容易适用于多种场景。摘要：With the pandemic of COVID-19, people around the world increasingly work from home. Each natural person typically has several digital identities with different associated information. During the last years, various identity and access management approaches have gained attraction, helping for example to access other organization's services within trust boundaries. The resulting heterogeneity creates a high complexity to differentiate between these approaches and scenarios as participating entity; combining them is even harder. Last but not least, various actors have a different understanding or perspective of the terms, like 'service', in this context. Our paper describes a reference service with standard components in generic federated identity management. This is utilized with modern Enterprise Architecture using the framework ArchiMate. The proposed universal federated identity management service model (FIMSM) is applied to describe various federated identity management scenarios in a generic service-oriented way. The presented reference design is approved in multiple aspects and is easily applicable in numerous scenarios.

【5】 A Sparse Coding Interpretation of Neural Networks and Theoretical Implications 标题：神经网络的稀疏编码解释及其理论意义链接：https://arxiv.org/abs/2108.06622

作者：Joshua Bowren 机构：Department of Computer Science, University of Miami, Coral Gables, FL, USA 摘要：神经网络，特别是深卷积神经网络，在各种计算机视觉任务中取得了前所未有的性能，但是对于成功的神经网络的计算和结构的原理还没有完全理解。关于卷积神经网络用于图像分类的能力，有很多理论，但对于这种模型为什么能够完成复杂的视觉任务，如推理和异常识别，人们了解得很少。在这里，我们提出了一种具有ReLU激活的神经网络，特别是卷积神经网络的稀疏编码解释。在稀疏编码中，当模型的基函数假设为正交时，最佳系数由投影到输入图像上的基函数的软阈值函数给出。在稀疏编码的非负变体中，软阈值函数成为ReLU。在这里，我们通过使用正交假设基函数的稀疏编码推导出这些解，然后我们从一个改进的非负正交稀疏编码模型推导出卷积神经网络前向变换，每个稀疏编码系数具有指数先验参数。接下来，我们将逻辑回归添加到分层稀疏编码模型中，得到一个完整的卷积神经网络，无需归一化和池化。最后，我们通过保持卷积神经网络中的稀疏先验以及执行更强的非线性变换来激发潜在的更鲁棒的前向变换。摘要：Neural networks, specifically deep convolutional neural networks, have achieved unprecedented performance in various computer vision tasks, but the rationale for the computations and structures of successful neural networks is not fully understood. Theories abound for the aptitude of convolutional neural networks for image classification, but less is understood about why such models would be capable of complex visual tasks such as inference and anomaly identification. Here, we propose a sparse coding interpretation of neural networks that have ReLU activation and of convolutional neural networks in particular. In sparse coding, when the model's basis functions are assumed to be orthogonal, the optimal coefficients are given by the soft-threshold function of the basis functions projected onto the input image. In a non-negative variant of sparse coding, the soft-threshold function becomes a ReLU. Here, we derive these solutions via sparse coding with orthogonal-assumed basis functions, then we derive the convolutional neural network forward transformation from a modified non-negative orthogonal sparse coding model with an exponential prior parameter for each sparse coding coefficient. Next, we derive a complete convolutional neural network without normalization and pooling by adding logistic regression to a hierarchical sparse coding model. Finally we motivate potentially more robust forward transformations by maintaining sparse priors in convolutional neural networks as well performing a stronger nonlinear transformation.

【6】 Focusing on Persons: Colorizing Old Images Learning from Modern Historical Movies 标题：以人为本：借鉴现代历史电影为旧形象增色链接：https://arxiv.org/abs/2108.06515

作者：Xin Jin,Zhonglan Li,Ke Liu,Dongqing Zou,Xiaodong Li,Xingfan Zhu,Ziyin Zhou,Qilong Sun,Qingyu Liu 机构：Beijing Electronic Science and, Fengtai District, Beijing , SenseTime Research and Tetras.AI, Beijing , China, Qing Yuan Research Institute, Shanghai Jiao Tong University, Shanghai , China, Department of Cyber Security , Cryptography, Beijing Electronic 备注：ACM Multimedia 2021 Industrial Track 摘要：在工业中，有很多场景需要对旧的灰色照片进行自动着色，例如视频网站和档案。在本文中，我们介绍了HistoryNet，它基于细粒度语义理解和先验知识，关注历史人物不同的高保真服装色彩。历史人物的色彩化具有现实性和实用性，但现有的方法在这方面的表现并不理想。本文提出了一个包含分类、细粒度语义分析和颜色化三部分的历史网。分类子模块根据时代、民族和服装类型对图像进行分类；解析子网络为图像中的人物轮廓、服装和背景提供语义，实现更准确的服装和人物着色，防止颜色溢出。在训练过程中，我们将分类和语义分析特征集成到着色生成网络中，以提高着色效果。通过分类和解析子网的设计，可以提高图像着色的精度，使图像各部分的边界更加清晰。此外，我们还提出了一个新的现代历史电影数据集（MHMD），其中包含1353166幅图像和42个时代、民族和服装类型标签，用于对147部现代历史电影或电视剧进行自动着色。各种定量和定性比较表明，我们的方法优于最新的着色方法，尤其是在军服上，根据历史文献，军服的颜色是正确的。摘要：In industry, there exist plenty of scenarios where old gray photos need to be automatically colored, such as video sites and archives. In this paper, we present the HistoryNet focusing on historical person's diverse high fidelity clothing colorization based on fine grained semantic understanding and prior. Colorization of historical persons is realistic and practical, however, existing methods do not perform well in the regards. In this paper, a HistoryNet including three parts, namely, classification, fine grained semantic parsing and colorization, is proposed. Classification sub-module supplies classifying of images according to the eras, nationalities and garment types; Parsing sub-network supplies the semantic for person contours, clothing and background in the image to achieve more accurate colorization of clothes and persons and prevent color overflow. In the training process, we integrate classification and semantic parsing features into the coloring generation network to improve colorization. Through the design of classification and parsing subnetwork, the accuracy of image colorization can be improved and the boundary of each part of image can be more clearly. Moreover, we also propose a novel Modern Historical Movies Dataset (MHMD) containing 1,353,166 images and 42 labels of eras, nationalities, and garment types for automatic colorization from 147 historical movies or TV series made in modern time. Various quantitative and qualitative comparisons demonstrate that our method outperforms the state-of-the-art colorization methods, especially on military uniforms, which has correct colors according to the historical literatures.

【7】 GeoCLR: Georeference Contrastive Learning for Efficient Seafloor Image Interpretation 标题：GeoCLR：用于高效海底图像解释的地理参考对比学习链接：https://arxiv.org/abs/2108.06421

作者：Takaki Yamada,Adam Prügel-Bennett,Stefan B. Williams,Oscar Pizarro,Blair Thornton 机构：Centre for In Situ and Remote Intelligent Sensing, University of Southampton, Southampton SO,QF, U.K., Australian Centre for Field Robotics, The University of Sydney, NSW , Australia, Institute of Industrial Science, The University of Tokyo 备注：27 pages, 9 figures 摘要：本文描述了用于深度学习卷积神经网络（CNN）有效训练的视觉表征的地理参考对比学习（GEOCRL）。该方法利用地理参考信息，使用从附近位置拍摄的图像生成相似的图像对，并将其与相距很远的图像对进行对比。基本假设是，在近距离内收集的图像更可能具有相似的视觉外观，这在海底机器人成像应用中可以合理满足，其中图像足迹限制在几米的边缘长度，并被拍摄，以便它们沿车辆轨迹重叠，然而，海底基质和栖息地的斑块尺寸要大得多。这种方法的一个主要优点是，它是自我监督的，不需要任何人为输入进行CNN训练。该方法计算效率高，在多日AUV任务期间，可使用大多数海洋现场试验期间可访问的计算资源在潜水之间生成结果。我们将GeoCLR应用于一个数据集上的栖息地分类，该数据集由使用自主水下机器人（AUV）采集的约86k图像组成。我们展示了GeoCLR生成的潜在表示如何有效地指导人类注释工作，其中，与使用相同CNN和同等数量的人类注释进行训练的最先进的转移学习相比，半监督框架将分类准确率平均提高了11.8%。摘要：This paper describes Georeference Contrastive Learning of visual Representation (GeoCLR) for efficient training of deep-learning Convolutional Neural Networks (CNNs). The method leverages georeference information by generating a similar image pair using images taken of nearby locations, and contrasting these with an image pair that is far apart. The underlying assumption is that images gathered within a close distance are more likely to have similar visual appearance, where this can be reasonably satisfied in seafloor robotic imaging applications where image footprints are limited to edge lengths of a few metres and are taken so that they overlap along a vehicle's trajectory, whereas seafloor substrates and habitats have patch sizes that are far larger. A key advantage of this method is that it is self-supervised and does not require any human input for CNN training. The method is computationally efficient, where results can be generated between dives during multi-day AUV missions using computational resources that would be accessible during most oceanic field trials. We apply GeoCLR to habitat classification on a dataset that consists of ~86k images gathered using an Autonomous Underwater Vehicle (AUV). We demonstrate how the latent representations generated by GeoCLR can be used to efficiently guide human annotation efforts, where the semi-supervised framework improves classification accuracy by an average of 11.8 % compared to state-of-the-art transfer learning using the same CNN and equivalent number of human annotations for training.

【8】 Finding Representative Interpretations on Convolutional Neural Networks 标题：寻找卷积神经网络的代表性解释链接：https://arxiv.org/abs/2108.06384

作者：Peter Cho-Ho Lam,Lingyang Chu,Maxim Torgonskiy,Jian Pei,Yong Zhang,Lanjun Wang 机构：Huawei Canada Technologies Co., Ltd., McMaster University, Simon Fraser University 备注：Accepted by ICCV 2021 (this http URL) 摘要：在图像上解释有效的深度卷积神经网络（CNN）背后的决策逻辑是对深度学习模型成功的补充。然而，现有的方法只能对单个或少量图像解释某些特定的决策逻辑。为了促进人类的可理解性和泛化能力，重要的是开发具有代表性的解释，在大量相似图像上解释CNN的共同决策逻辑，这揭示了共同语义数据有助于许多密切相关的预测。在本文中，我们开发了一种新的无监督方法，为大量相似的图像生成具有高度代表性的解释。我们将寻找代表性解释的问题描述为一个共聚类问题，并将其转化为基于CNN线性决策边界样本的子模代价子模覆盖问题。我们还提出了一种可视化和相似性排序方法。我们的大量实验证明了我们的方法的优异性能。摘要：Interpreting the decision logic behind effective deep convolutional neural networks (CNN) on images complements the success of deep learning models. However, the existing methods can only interpret some specific decision logic on individual or a small number of images. To facilitate human understandability and generalization ability, it is important to develop representative interpretations that interpret common decision logics of a CNN on a large group of similar images, which reveal the common semantics data contributes to many closely related predictions. In this paper, we develop a novel unsupervised approach to produce a highly representative interpretation for a large number of similar images. We formulate the problem of finding representative interpretations as a co-clustering problem, and convert it into a submodular cost submodular cover problem based on a sample of the linear decision boundaries of a CNN. We also present a visualization and similarity ranking method. Our extensive experiments demonstrate the excellent performance of our method.

其他(10篇)

【1】 Reassessing the Limitations of CNN Methods for Camera Pose Regression 标题：重新评价CNN方法在相机姿态回归中的局限性链接：https://arxiv.org/abs/2108.07260

作者：Tony Ng,Adrian Lopez-Rodriguez,Vassileios Balntas,Krystian Mikolajczyk 机构：Imperial College London, Facebook Reality Labs 摘要：在本文中，我们解决了室外和室内场景中的摄像机姿态估计问题。与目前基于2D到3D匹配的最佳方法相比，我们提出了一种模型，该模型可以直接从图像中回归相机姿态，其精度明显高于同类现有方法。我们首先分析了为什么回归方法仍然落后于最新技术，并用我们的新方法弥补了性能差距。具体来说，我们提出了一种通过一种新的训练技术来克服训练数据偏差的方法，该技术通过训练集的概率分布生成姿势，以合成新的训练视图。最后，我们在两个广泛使用的基准上对我们的方法进行了评估，结果表明，与先前基于回归的方法、检索技术以及具有局部特征匹配的三维管道相比，我们的方法实现了显著改进的性能。摘要：In this paper, we address the problem of camera pose estimation in outdoor and indoor scenarios. In comparison to the currently top-performing methods that rely on 2D to 3D matching, we propose a model that can directly regress the camera pose from images with significantly higher accuracy than existing methods of the same class. We first analyse why regression methods are still behind the state-of-the-art, and we bridge the performance gap with our new approach. Specifically, we propose a way to overcome the biased training data by a novel training technique, which generates poses guided by a probability distribution from the training set for synthesising new training views. Lastly, we evaluate our approach on two widely used benchmarks and show that it achieves significantly improved performance compared to prior regression-based methods, retrieval techniques as well as 3D pipelines with local feature matching.

【2】 On the Importance of Encrypting Deep Features 标题：浅谈深度特征加密的重要性链接：https://arxiv.org/abs/2108.07147

作者：Xingyang Ni,Heikki Huttunen,Esa Rahtu 机构：Tampere University, Tampere, Finland, Visy Oy 备注：First Version 摘要：在本研究中，我们仅在两个假设下分析模型反转攻击：已知用户数据的特征向量，并提供用于推断的黑盒API。一方面，通过选择更实际的环境来解决现有研究的局限性。在最新的个人再识别模型上进行了实验，并研究了两种攻击场景（即识别辅助属性和重建用户数据）。结果表明，即使在严格的约束条件下，对手也能成功地推断出敏感信息。另一方面，建议加密特征向量，特别是对于生产中的机器学习模型。作为传统加密方法（如AES）的替代方案，提出了一种简单而有效的方法，称为ShuffleBits。更具体地说，每个浮点数的二进制序列被洗牌。使用一次性pad方案部署，它作为一个即插即用模块，适用于任何神经网络，生成的模型以加密形式直接输出深层特征。源代码可在https://github.com/nixingyang/ShuffleBits. 摘要：In this study, we analyze model inversion attacks with only two assumptions: feature vectors of user data are known, and a black-box API for inference is provided. On the one hand, limitations of existing studies are addressed by opting for a more practical setting. Experiments have been conducted on state-of-the-art models in person re-identification, and two attack scenarios (i.e., recognizing auxiliary attributes and reconstructing user data) are investigated. Results show that an adversary could successfully infer sensitive information even under severe constraints. On the other hand, it is advisable to encrypt feature vectors, especially for a machine learning model in production. As an alternative to traditional encryption methods such as AES, a simple yet effective method termed ShuffleBits is presented. More specifically, the binary sequence of each floating-point number gets shuffled. Deployed using the one-time pad scheme, it serves as a plug-and-play module that is applicable to any neural network, and the resulting model directly outputs deep features in encrypted form. Source code is publicly available at https://github.com/nixingyang/ShuffleBits.

【3】 WikiChurches: A Fine-Grained Dataset of Architectural Styles with Real-World Challenges 标题：WikiChurches：具有现实挑战的细粒度建筑样式数据集链接：https://arxiv.org/abs/2108.06959

作者：Björn Barz,Joachim Denzler 机构：Computer Vision Group, Friedrich Schiller University Jena, Jena, Germany 备注：10 pages, 7 figures, 3 tables 摘要：我们介绍了一个新的建筑风格分类数据集，由9485个教堂建筑图像组成。图片和样式标签都来源于维基百科。该数据集可以作为各种研究领域的基准，因为它结合了许多现实世界的挑战：基于细微视觉特征的类之间的细粒度区分、相对较小的样本量、高度不平衡的类分布、高度不同的视点以及标签的分层组织，其中只有一些图像以最精确的级别进行标记。此外，我们还为四大类别的139座教堂提供了631个特征视觉特征的边界框注释。例如，这些注释可以用于细粒度分类的研究，在细粒度分类中，通常可以获得关于不同对象部分的额外专家知识。有关图像和注释，请访问：https://doi.org/10.5281/zenodo.5166987 摘要：We introduce a novel dataset for architectural style classification, consisting of 9,485 images of church buildings. Both images and style labels were sourced from Wikipedia. The dataset can serve as a benchmark for various research fields, as it combines numerous real-world challenges: fine-grained distinctions between classes based on subtle visual features, a comparatively small sample size, a highly imbalanced class distribution, a high variance of viewpoints, and a hierarchical organization of labels, where only some images are labeled at the most precise level. In addition, we provide 631 bounding box annotations of characteristic visual features for 139 churches from four major categories. These annotations can, for example, be useful for research on fine-grained classification, where additional expert knowledge about distinctive object parts is often available. Images and annotations are available at: https://doi.org/10.5281/zenodo.5166987

【4】 Probeable DARTS with Application to Computational Pathology 标题：可能的飞镖及其在计算病理学中的应用链接：https://arxiv.org/abs/2108.06859

作者：Sheyang Tang,Mahdi S. Hosseini,Lina Chen,Sonal Varma,Corwyn Rowsell,Savvas Damaskinos,Konstantinos N. Plataniotis,Zhou Wang 机构：University of Waterloo, Canada, University of New Brunswick, Canada, Sunnybrook health science center, Canada, Kingston Health Sciences Center, Canada, St. Michaels Hospital, Canada, Huron Digital Pathology, Canada, University of Toronto, Canada 摘要：人工智能技术在计算病理学（CPath）领域取得了显著的成就，特别是借助于深层神经网络。然而，网络性能与体系结构设计高度相关，这通常需要具有领域知识的专家。在本文中，我们通过神经结构搜索（NAS）的最新进展来应对这一挑战，以找到适合CPath应用的最佳网络。特别是，我们使用可微体系结构搜索（DART）来提高效率。我们首先采用一个探测指标来说明原始DART在CIFAR数据集上缺乏适当的超参数调整，以及如何使用自适应优化策略来解决泛化问题。然后，我们通过在组织学组织类型数据集（ADP）上搜索最佳网络结构，将我们的搜索框架应用于CPath应用程序。结果表明，搜索网络在预测精度和计算复杂度方面优于现有网络。我们进一步进行了大量实验，以证明搜索到的网络可转移到新的CPath应用程序中，对缩小的输入具有鲁棒性，以及预测的可靠性。摘要：AI technology has made remarkable achievements in computational pathology (CPath), especially with the help of deep neural networks. However, the network performance is highly related to architecture design, which commonly requires human experts with domain knowledge. In this paper, we combat this challenge with the recent advance in neural architecture search (NAS) to find an optimal network for CPath applications. In particular, we use differentiable architecture search (DARTS) for its efficiency. We first adopt a probing metric to show that the original DARTS lacks proper hyperparameter tuning on the CIFAR dataset, and how the generalization issue can be addressed using an adaptive optimization strategy. We then apply our searching framework on CPath applications by searching for the optimum network architecture on a histological tissue type dataset (ADP). Results show that the searched network outperforms state-of-the-art networks in terms of prediction accuracy and computation complexity. We further conduct extensive experiments to demonstrate the transferability of the searched network to new CPath applications, the robustness against downscaled inputs, as well as the reliability of predictions.

【5】 Deep Adversarially-Enhanced k-Nearest Neighbors 标题：深层对抗性增强的k-最近邻域链接：https://arxiv.org/abs/2108.06797

作者：Ren Wang,Tianqi Chen 机构：University of Michigan 摘要：最近的研究从理论和经验上表明，深度神经网络（DNN）对小扰动具有固有的脆弱性。应用深度k-最近邻（DkNN）分类器，我们观察到随着层的加深，鲁棒性和准确性的权衡显著增加。在这项工作中，我们提出了一种深度敌对增强的k-最近邻（DAEkNN）方法，该方法比DkNN具有更高的鲁棒性，并通过两个关键因素缓解了深层鲁棒性-准确性权衡。首先，DAEkNN基于一个经过对抗训练的模型。其次，DAEkNN利用良性和对抗性训练数据的加权组合进行预测。从经验上看，我们发现DAEkNN改善了MNIST和CIFAR-10数据集的鲁棒性和鲁棒性-准确性权衡。摘要：Recent works have theoretically and empirically shown that deep neural networks (DNNs) have an inherent vulnerability to small perturbations. Applying the Deep k-Nearest Neighbors (DkNN) classifier, we observe a dramatically increasing robustness-accuracy trade-off as the layer goes deeper. In this work, we propose a Deep Adversarially-Enhanced k-Nearest Neighbors (DAEkNN) method which achieves higher robustness than DkNN and mitigates the robustness-accuracy trade-off in deep layers through two key elements. First, DAEkNN is based on an adversarially trained model. Second, DAEkNN makes predictions by leveraging a weighted combination of benign and adversarial training data. Empirically, we find that DAEkNN improves both the robustness and the robustness-accuracy trade-off on MNIST and CIFAR-10 datasets.

【6】 Two Eyes Are Better Than One: Exploiting Binocular Correlation for Diabetic Retinopathy Severity Grading 标题：双眼胜于一眼：利用双目相关性进行糖尿病视网膜病变严重程度分级链接：https://arxiv.org/abs/2108.06763

作者：Peisheng Qian,Ziyuan Zhao,Cong Chen,Zeng Zeng,Xiaoli Li 机构： This research is supportedby Institute for Infocomm Research (I 2R), 1 Institute for Infocomm Research(I 2R), 2 National University of Singapore 备注：Accepted in 43rd Annual International Conference of the IEEE Engineering in Medicine and Biology Society, IEEE EMBC 2021 摘要：糖尿病视网膜病变（DR）是糖尿病患者最常见的眼部疾病之一。然而，视力丧失主要发生在DR的晚期，视力损害的症状，从轻微到严重，可能差异很大，增加了临床实践中的诊断和治疗负担。基于视网膜图像的深度学习方法在DR自动分级方面取得了显著的成功，但大多数方法忽视了糖尿病的存在通常会影响双眼，眼科医生通常同时对双眼进行DR诊断比较，从而导致左右眼之间的相关性未被开发。在本研究中，模拟诊断过程，我们提出了一种双流双目网络来捕获左右眼之间的细微相关性，其中，眼睛的成对图像在训练期间分别馈入两个相同的子网络。我们设计了一种对比分级损失学习双目相关的五类DR检测方法，该方法在最大化类间差异的同时最小化类内差异。在EyePACS数据集上的实验结果表明了所提出的双目模型的优越性，大大优于单目方法。摘要：Diabetic retinopathy (DR) is one of the most common eye conditions among diabetic patients. However, vision loss occurs primarily in the late stages of DR, and the symptoms of visual impairment, ranging from mild to severe, can vary greatly, adding to the burden of diagnosis and treatment in clinical practice. Deep learning methods based on retinal images have achieved remarkable success in automatic DR grading, but most of them neglect that the presence of diabetes usually affects both eyes, and ophthalmologists usually compare both eyes concurrently for DR diagnosis, leaving correlations between left and right eyes unexploited. In this study, simulating the diagnostic process, we propose a two-stream binocular network to capture the subtle correlations between left and right eyes, in which, paired images of eyes are fed into two identical subnetworks separately during training. We design a contrastive grading loss to learn binocular correlation for five-class DR detection, which maximizes inter-class dissimilarity while minimizing the intra-class difference. Experimental results on the EyePACS dataset show the superiority of the proposed binocular model, outperforming monocular methods by a large margin.

【7】 Deepfake Representation with Multilinear Regression 标题：多元线性回归的深伪表示链接：https://arxiv.org/abs/2108.06702

作者：Sara Abdali,M. Alex O. Vasilescu,Evangelos E. Papalexakis 机构：University of California, Riverside, University of California, Los Angeles, Tensor Vision, Los Angeles 摘要：生成型神经网络结构（如GANs）可用于生成合成实例，以弥补实际数据的不足。然而，他们可能被用来创建可能导致社会、政治或经济动荡的媒体。一种新兴媒体是“深度伪造”。能够区分这类媒体的技术是必不可少的。在本文中，我们提出了一种改进的多线性（张量）方法，一种线性和多线性回归的组合，用于表示假数据和真实数据。我们通过使用改进的多线性（张量）方法表示深度伪造来测试我们的方法，并执行SVM分类，结果令人鼓舞。摘要：Generative neural network architectures such as GANs, may be used to generate synthetic instances to compensate for the lack of real data. However, they may be employed to create media that may cause social, political or economical upheaval. One emerging media is "Deepfake".Techniques that can discriminate between such media is indispensable. In this paper, we propose a modified multilinear (tensor) method, a combination of linear and multilinear regressions for representing fake and real data. We test our approach by representing Deepfakes with our modified multilinear (tensor) approach and perform SVM classification with encouraging results.

【8】 B-Splines 标题：B样条链接：https://arxiv.org/abs/2108.06617

作者：Arindam Chaudhuri 机构：Samsung R & D Institute Delhi Noida – , India, Synonyms, B-Spline ,D curves; B-Spline polygons; B-Spline computer aided design; B-, Spline surfaces, Definition, B-Splines are one of the most promising curves in computer graphics. They are 备注：This work is published in Encyclopedia of Computer Graphics and Games 摘要：BSP线是计算机图形学中最有前途的曲线之一。它们具有一些优越的几何特性，这使它们成为计算机辅助设计行业中若干应用的理想候选。本文介绍了B样条曲线的一些基本性质。讨论了两个重要的B样条性质，即凸包性质和重复点效应。文中还介绍了计算设备中的BSplines计算。一个基于图像处理的工业应用程序，其中B样条曲线为内部器官的CT图像数据集重建3D曲面，进一步突出了这些曲线的强度摘要：BSplines are one of the most promising curves in computer graphics. They are blessed with some superior geometric properties which make them an ideal candidate for several applications in computer aided design industry. In this article, some basic properties of B-Spline curves are presented. Two significant B-Spline properties viz convex hull property and repeated points effects are discussed. The BSplines computation in computational devices is also illustrated. An industry application based on image processing where B-Spline curve reconstructs the 3D surfaces for CT image datasets of inner organs further highlights the strength of these curves

【9】 Monocular visual autonomous landing system for quadcopter drones using software in the loop 标题：基于软件在环的四旋翼无人机单目视觉自主着陆系统链接：https://arxiv.org/abs/2108.06616

作者：Miguel Saavedra-Ruiz,Ana Mario Pinto-Vargas,Victor Romero-Cano 机构：∗Mila - Quebec Institute of Artificial Intelligence, Universit´e de Montr´eal, Canada, † Alternova Tech SAS, Medell´ın, Colombia, ‡Robotics and Autonomous Systems Laboratory, Universidad Aut´onoma de Occidente, Cali, Colombia 备注：IEEE aerospace and electronic systems 摘要：自主着陆是实现多旋翼无人机在许多社会和工业应用中的全部潜力所必需的能力。在物理平台上实施和测试这种能力是有风险的，而且需要大量资源；因此，为了确保良好的设计过程和安全部署，在实现物理原型之前需要进行模拟。本文介绍了一种单目视觉系统的开发，该系统采用软件在环方法，能够自动、高效地将四旋翼无人机降落在预定的着陆平台上，从而降低物理测试阶段的风险。除了使用基于露台的模拟确保自主着陆系统作为一个整体满足设计要求外，我们的方法还提供了在物理实施之前进行安全参数调整和设计测试的工具。最后，所提出的单目视觉着陆台跟踪方法使得在具有Odroid XU4嵌入式处理器标准计算能力的F450四旋翼无人机上有效实现该系统成为可能。摘要：Autonomous landing is a capability that is essential to achieve the full potential of multi-rotor drones in many social and industrial applications. The implementation and testing of this capability on physical platforms is risky and resource-intensive; hence, in order to ensure both a sound design process and a safe deployment, simulations are required before implementing a physical prototype. This paper presents the development of a monocular visual system, using a software-in-the-loop methodology, that autonomously and efficiently lands a quadcopter drone on a predefined landing pad, thus reducing the risks of the physical testing stage. In addition to ensuring that the autonomous landing system as a whole fulfils the design requirements using a Gazebo-based simulation, our approach provides a tool for safe parameter tuning and design testing prior to physical implementation. Finally, the proposed monocular vision-only approach to landing pad tracking made it possible to effectively implement the system in an F450 quadcopter drone with the standard computational capabilities of an Odroid XU4 embedded processor.

【10】 Refractive Geometry for Underwater Domes 标题：水下穹顶的折射几何链接：https://arxiv.org/abs/2108.06575

作者：Mengkun She,David Nakath,Yifan Song,Kevin Köser 机构：Oceanic Machine Vision, GEOMAR Helmholtz Centre for Ocean Research Kiel, Kiel, Germany, Tel.: 备注：41 pages, 18 figures 摘要：水下摄像机通常放置在玻璃窗后面，以保护它们不受水的影响。球形玻璃，一个圆顶端口，非常适合在大深度的高水压，允许大视场，并避免折射，如果针孔相机正好位于球体的中心。将真实透镜完美地调整到圆顶中心是一项具有挑战性的任务，无论是在如何实际指导定心过程（例如视觉伺服）和如何测量对准质量方面，还是在如何以机械方式执行对准方面。因此，这样的系统容易偏离一定的偏移，导致球体上具有挑战性的折射模式，从而使针孔相机模型无效。我们表明，整个相机系统成为轴向相机，即使是用于深海勘探的厚圆顶，也提供了一种非迭代方法来计算折射中心，而无需了解确切的空气、玻璃或水的性质。我们还分析了球体上的折射几何，观察了诸如前向与后向偏心、等折射曲线等效应，并获得了薄圆顶中三维点正向投影的6次多项式方程。然后，我们提出了一个纯水下校准程序来估计多幅图像的偏心。该估算值可在调整期间用于指导透镜的机械位置，也可在水下摄影测量应用中考虑。摘要：Underwater cameras are typically placed behind glass windows to protect them from the water. Spherical glass, a dome port, is well suited for high water pressures at great depth, allows for a large field of view, and avoids refraction if a pinhole camera is positioned exactly at the sphere's center. Adjusting a real lens perfectly to the dome center is a challenging task, both in terms of how to actually guide the centering process (e.g. visual servoing) and how to measure the alignment quality, but also, how to mechanically perform the alignment. Consequently, such systems are prone to being decentered by some offset, leading to challenging refraction patterns at the sphere that invalidate the pinhole camera model. We show that the overall camera system becomes an axial camera, even for thick domes as used for deep sea exploration and provide a non-iterative way to compute the center of refraction without requiring knowledge of exact air, glass or water properties. We also analyze the refractive geometry at the sphere, looking at effects such as forward- vs. backward decentering, iso-refraction curves and obtain a 6th-degree polynomial equation for forward projection of 3D points in thin domes. We then propose a pure underwater calibration procedure to estimate the decentering from multiple images. This estimate can either be used during adjustment to guide the mechanical position of the lens, or can be considered in photogrammetric underwater applications.

linux https 网络安全图像处理学习方法

0 人点赞