计算机视觉学术速递[9.8]

Update！H5支持摘要折叠，体验更佳！点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

cs.CV 方向，今日共计53篇

Transformer(4篇)

【1】 nnFormer: Interleaved Transformer for Volumetric Segmentation 标题：nnFormer：用于体积分割的交错Transformer 链接：https://arxiv.org/abs/2109.03201

作者：Hong-Yu Zhou,Jiansen Guo,Yinghao Zhang,Lequan Yu,Liansheng Wang,Yizhou Yu 机构：Department of Computer Science, The University of Hong Kong, Department of Computer Science, Xiamen University, Department of Statistics and Actuarial Science, The University of Hong Kong 备注：Codes and models are available at this https URL 摘要：Transformers是自然语言处理中的默认选择模型，但医学成像界对此关注甚少。鉴于Transformer具有利用长期相关性的能力，它有望帮助非典型卷积神经网络（convnet）克服其固有的空间电感偏置缺点。然而，最近提出的大多数基于Transformer的分割方法只是将Transformer作为辅助模块来处理，以帮助将全局上下文编码为卷积表示，而没有研究如何将自我注意（即Transformer的核心）与卷积最佳结合。为了解决这个问题，在本文中，我们引入了nnFormer（即，不是另一个转换器），这是一个功能强大的分段模型，具有基于自我注意和卷积经验组合的交错结构。在实践中，NNFORER从三维局部体积学习体积表示。与原始体素水平的自我注意实现相比，这种基于体积的操作有助于在Synapse和ACDC数据集上分别将计算复杂度降低约98%和99.5%。与现有技术的网络配置相比，nnFormer在两个常用数据集Synapse和ACDC上实现了比以前基于Transformer的方法更大的改进。例如，nnFormer在Synapse上比Swin-UNet好7个百分点以上。即使与nnUNet（目前性能最好的完全卷积医学分割网络）相比，NNONER在Synapse和ACDC上的性能仍然稍好一些。摘要：Transformers, the default model of choices in natural language processing, have drawn scant attention from the medical imaging community. Given the ability to exploit long-term dependencies, transformers are promising to help atypical convolutional neural networks (convnets) to overcome its inherent shortcomings of spatial inductive bias. However, most of recently proposed transformer-based segmentation approaches simply treated transformers as assisted modules to help encode global context into convolutional representations without investigating how to optimally combine self-attention (i.e., the core of transformers) with convolution. To address this issue, in this paper, we introduce nnFormer (i.e., Not-aNother transFormer), a powerful segmentation model with an interleaved architecture based on empirical combination of self-attention and convolution. In practice, nnFormer learns volumetric representations from 3D local volumes. Compared to the naive voxel-level self-attention implementation, such volume-based operations help to reduce the computational complexity by approximate 98% and 99.5% on Synapse and ACDC datasets, respectively. In comparison to prior-art network configurations, nnFormer achieves tremendous improvements over previous transformer-based methods on two commonly used datasets Synapse and ACDC. For instance, nnFormer outperforms Swin-UNet by over 7 percents on Synapse. Even when compared to nnUNet, currently the best performing fully-convolutional medical segmentation network, nnFormer still provides slightly better performance on Synapse and ACDC.

【2】 FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting 标题：FuseFormer：融合Transformer中的细粒度信息进行视频修复链接：https://arxiv.org/abs/2109.02974

作者：Rui Liu,Hanming Deng,Yangyi Huang,Xiaoyu Shi,Lewei Lu,Wenxiu Sun,Xiaogang Wang,Jifeng Dai,Hongsheng Li 机构：†CUHK-SenseTime Joint Laboratory, The Chinese University of Hong Kong, ‡SenseTime Research, §Zhejiang University, ♯Tetras.AI, #School of CST, Xidian University 备注：To appear at ICCV 2021 摘要：Transformer作为一种用于建模长期关系的强大而灵活的体系结构，在视觉任务中得到了广泛的探索。然而，当用于需要细粒度表示的视频修复时，现有的方法仍然存在由于硬面片分割而导致细节边缘模糊的问题。在这里，我们通过提出FuseFormer来解决这个问题，FuseFormer是一种基于新的软分割和软合成操作的细粒度特征融合视频修复Transformer模型。软分割将特征地图划分为具有给定重叠间隔的多个面片。相反，软合成通过将不同的面片缝合到一个完整的特征图中来操作，其中重叠区域中的像素被汇总。这两个模块首先用于转换器层之前的标记化和转换器层之后的去标记化，以实现标记和功能之间的有效映射。因此，子面片级信息交互能够在相邻面片之间进行更有效的特征传播，从而合成视频中孔洞区域的生动内容。此外，在FuseFormer中，我们精心地将软合成和软分割插入前馈网络，使一维线性层能够模拟二维结构。子面片级特征融合能力进一步增强。在定量和定性评估中，我们提出的FuseFormer优于最先进的方法。我们还进行了详细的分析，以检验其优越性。摘要：Transformer, as a strong and flexible architecture for modelling long-range relations, has been widely explored in vision tasks. However, when used in video inpainting that requires fine-grained representation, existed method still suffers from yielding blurry edges in detail due to the hard patch splitting. Here we aim to tackle this problem by proposing FuseFormer, a Transformer model designed for video inpainting via fine-grained feature fusion based on novel Soft Split and Soft Composition operations. The soft split divides feature map into many patches with given overlapping interval. On the contrary, the soft composition operates by stitching different patches into a whole feature map where pixels in overlapping regions are summed up. These two modules are first used in tokenization before Transformer layers and de-tokenization after Transformer layers, for effective mapping between tokens and features. Therefore, sub-patch level information interaction is enabled for more effective feature propagation between neighboring patches, resulting in synthesizing vivid content for hole regions in videos. Moreover, in FuseFormer, we elaborately insert the soft composition and soft split into the feed-forward network, enabling the 1D linear layers to have the capability of modelling 2D structure. And, the sub-patch level feature fusion ability is further enhanced. In both quantitative and qualitative evaluations, our proposed FuseFormer surpasses state-of-the-art methods. We also conduct detailed analysis to examine its superiority.

【3】 GCsT: Graph Convolutional Skeleton Transformer for Action Recognition 标题：GCsT：面向动作识别的图卷积骨架变换器链接：https://arxiv.org/abs/2109.02860

作者：Ruwen Bai,Min Li,Bo Meng,Fengfa Li,Junxing Ren,Miao Jiang,Degang Sun 机构：Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China, School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China, Beijing Institute of Technology, Beijing, China 备注：8 pages, 5 figures 摘要：图卷积网络（GCN）在基于骨架的动作识别中具有良好的性能。然而，在大多数基于GCN的方法中，时空图卷积受到图拓扑的严格限制，而只捕获短期的时间上下文，因此缺乏特征提取的灵活性。在这项工作中，我们提出了一种新的架构，名为图卷积骨架变换器（GCsT），它通过引入变换器来解决GCNs中的限制。我们的GCsT利用了Transformer的所有优势（即动态注意和全局上下文），同时保留了GCNs的优势（即层次结构和局部拓扑结构）。在GCsT中，时空GCN强制捕获局部依赖，而Transformer动态提取全局时空关系。此外，建议的GCsT通过添加骨架序列中的附加信息显示出更强的表达能力。合并Transformer允许几乎毫不费力地将信息引入模型中。我们通过大量实验验证了所提出的GCsT，在NTU RGB D、NTU RGB D 120和西北加州大学洛杉矶分校数据集上实现了最先进的性能。摘要：Graph convolutional networks (GCNs) achieve promising performance for skeleton-based action recognition. However, in most GCN-based methods, the spatial-temporal graph convolution is strictly restricted by the graph topology while only captures the short-term temporal context, thus lacking the flexibility of feature extraction. In this work, we present a novel architecture, named Graph Convolutional skeleton Transformer (GCsT), which addresses limitations in GCNs by introducing Transformer. Our GCsT employs all the benefits of Transformer (i.e. dynamical attention and global context) while keeps the advantages of GCNs (i.e. hierarchy and local topology structure). In GCsT, the spatial-temporal GCN forces the capture of local dependencies while Transformer dynamically extracts global spatial-temporal relationships. Furthermore, the proposed GCsT shows stronger expressive capability by adding additional information present in skeleton sequences. Incorporating the Transformer allows that information to be introduced into the model almost effortlessly. We validate the proposed GCsT by conducting extensive experiments, which achieves the state-of-the-art performance on NTU RGB D, NTU RGB D 120 and Northwestern-UCLA datasets.

【4】 Vision Transformers For Weeds and Crops Classification Of High Resolution UAV Images 标题：用于高分辨率无人机图像杂草和农作物分类的视觉转换器链接：https://arxiv.org/abs/2109.02716

作者：Reenul Reedha,Eric Dericquebourg,Raphael Canals,Adel Hafiane 机构：INSA CVL, University of Orleans, PRISME EA , Bourges,France 摘要：作物和杂草监测是当今农业和粮食生产面临的重要挑战。由于数据采集和计算技术的最新进展，农业正在向更加智能和精确的农业发展，以满足高产和优质作物生产的需要。无人机（UAV）图像的分类和识别是作物监测的重要环节。基于卷积神经网络（CNN）的深度学习模型在农业领域取得了很好的分类效果。尽管这种体系结构取得了成功，CNN仍然面临着许多挑战，如计算成本高、需要大的标记数据集等。。。自然语言处理的transformer架构可以作为处理CNN局限性的替代方法。利用自我注意范式，视觉变换（ViT）模型可以在不使用任何卷积运算的情况下获得有竞争力或更好的结果。在本文中，我们通过ViT模型采用自我注意机制对杂草和作物进行植物分类：红甜菜、非类型甜菜（绿叶）、欧芹和菠菜。我们的实验表明，与最先进的基于CNN的模型EfficientNet和ResNet相比，在少量标记训练数据的情况下，ViT模型的性能更好，ViT模型的最高精确度达到99.8%。摘要：Crop and weed monitoring is an important challenge for agriculture and food production nowadays. Thanks to recent advances in data acquisition and computation technologies, agriculture is evolving to a more smart and precision farming to meet with the high yield and high quality crop production. Classification and recognition in Unmanned Aerial Vehicles (UAV) images are important phases for crop monitoring. Advances in deep learning models relying on Convolutional Neural Network (CNN) have achieved high performances in image classification in the agricultural domain. Despite the success of this architecture, CNN still faces many challenges such as high computation cost, the need of large labelled datasets, ... Natural language processing's transformer architecture can be an alternative approach to deal with CNN's limitations. Making use of the self-attention paradigm, Vision Transformer (ViT) models can achieve competitive or better results without applying any convolution operations. In this paper, we adopt the self-attention mechanism via the ViT models for plant classification of weeds and crops: red beet, off-type beet (green leaves), parsley and spinach. Our experiments show that with small set of labelled training data, ViT models perform better compared to state-of-the-art CNN-based models EfficientNet and ResNet, with a top accuracy of 99.8% achieved by the ViT model.

检测相关(4篇)

【1】 DeepFakes: Detecting Forged and Synthetic Media Content Using Machine Learning 标题：DeepFake：使用机器学习检测伪造和合成媒体内容链接：https://arxiv.org/abs/2109.02874

作者：Sm Zobaed,Md Fazle Rabby,Md Istiaq Hossain,Ekram Hossain,Sazib Hasan,Asif Karim,Khan Md. Hasib 机构： University of Louisiana, Lafayette, LA , USA, Southern Utah University, Cedar City, UT , USA, Dixie State University, St. George, UT , USA, Charles Darwin University, Casuarina, NT , Australia, Ahsanullah University of Science & Technology, Dhaka, Bangladesh 备注：A preprint version 摘要：深度学习的快速发展使得区分真实的和经过处理的面部图像和视频剪辑变得前所未有的困难。通过深度生成方法操纵面部外观的底层技术，如DeepFake，最近通过推广大量恶意面部操纵应用程序而出现。随后，需要其他类型的技术来评估数字视觉内容的完整性，这是毋庸置疑的，以减少DeepFake创作的影响。对深度伪造和检测进行的大量研究创造了一个超越当前状态的相互推动的范围。本研究通过回顾DeepFake领域的著名研究，提出了与DeepFake创建和检测技术相关的挑战、研究趋势和方向，以促进开发更稳健的方法，从而在未来应对更先进的DeepFake。摘要：The rapid advancement in deep learning makes the differentiation of authentic and manipulated facial images and video clips unprecedentedly harder. The underlying technology of manipulating facial appearances through deep generative approaches, enunciated as DeepFake that have emerged recently by promoting a vast number of malicious face manipulation applications. Subsequently, the need of other sort of techniques that can assess the integrity of digital visual content is indisputable to reduce the impact of the creations of DeepFake. A large body of research that are performed on DeepFake creation and detection create a scope of pushing each other beyond the current status. This study presents challenges, research trends, and directions related to DeepFake creation and detection techniques by reviewing the notable research in the DeepFake domain to facilitate the development of more robust approaches that could deal with the more advance DeepFake in the future.

【2】 Zero-Shot Open Set Detection by Extending CLIP 标题：基于扩展CLIP的Zero-Shot开集检测链接：https://arxiv.org/abs/2109.02748

作者：Sepideh Esmaeilpour,Bing Liu,Eric Robertson,Lei Shu 机构： University of Illinois at Chicago, PAR Government 摘要：在正则开集检测问题中，使用已知类（也称为闭集类）的样本来训练特殊的分类器。在测试中，分类器可以（1）将已知类的测试样本分类到各自的类中，（2）还可以检测不属于任何已知类的样本（我们说它们属于某些未知或开放集类）。本文研究了零炮开集检测问题，该问题在测试中仍然执行相同的两个任务，但除了使用已知的类名外没有训练。本文提出了一种新颖而简单的方法（称为ZO-CLIP）来解决这个问题。ZO-CLIP建立在通过多模态表示学习实现Zero-Shot分类的最新进展之上。它首先通过在片段顶部训练基于文本的图像描述生成器来扩展预先训练的多模态模型片段。在测试中，它使用扩展模型为每个测试样本生成一些候选未知类名，并基于已知类名和候选未知类名计算置信度分数，用于零炮开集检测。在5个开放集检测基准数据集上的实验结果证实，ZO-CLIP的性能大大优于基线。摘要：In a regular open set detection problem, samples of known classes (also called closed set classes) are used to train a special classifier. In testing, the classifier can (1) classify the test samples of known classes to their respective classes and (2) also detect samples that do not belong to any of the known classes (we say they belong to some unknown or open set classes). This paper studies the problem of zero-shot open-set detection, which still performs the same two tasks in testing but has no training except using the given known class names. This paper proposes a novel and yet simple method (called ZO-CLIP) to solve the problem. ZO-CLIP builds on top of the recent advances in zero-shot classification through multi-modal representation learning. It first extends the pre-trained multi-modal model CLIP by training a text-based image description generator on top of CLIP. In testing, it uses the extended model to generate some candidate unknown class names for each test sample and computes a confidence score based on both the known class names and candidate unknown class names for zero-shot open set detection. Experimental results on 5 benchmark datasets for open set detection confirm that ZO-CLIP outperforms the baselines by a large margin.

【3】 Automatic Landmarks Correspondence Detection in Medical Images with an Application to Deformable Image Registration 标题：医学图像中自动标记对应检测及其在变形图像配准中的应用链接：https://arxiv.org/abs/2109.02722

作者：Monika Grewal,Jan Wiersma,Henrike Westerveld,Peter A. N. Bosman,Tanja Alderliesten 机构：Life Science & Health Research Group, Centrum Wiskunde & Informatica, XG, Amsterdam, The Netherlands, Department of Radiation Oncology, Amsterdam University Medical Centers, location AMC, University of Amsterdam, Amsterdam, The Netherlands 备注：preprint submitted for peer review 摘要：可变形图像配准（DIR）可以受益于使用图像中相应地标的附加制导。然而，其优点在很大程度上尚未得到研究，特别是由于缺乏三维（3D）医学图像中相应地标的自动检测方法。在这项工作中，我们提出了一种深度卷积神经网络（DCNN），称为DCNN匹配，它学习以自我监督的方式预测3D图像中的地标对应。我们探索了使用不同损失函数的五种DCNN匹配变体，并分别测试了DCNN匹配，以及与开源注册软件Elastix结合，以评估其对常见DIR方法的影响。我们使用了子宫颈癌患者的下腹部计算机断层扫描（CT）扫描：121对盆腔CT扫描，包含模拟弹性变形，11对显示临床变形。我们的结果表明，当使用DCNN匹配预测的地标对应用于模拟和临床变形时，DIR性能显著改善。我们还观察到，自动识别的地标的空间分布和相关的匹配错误会影响DIR的改善程度。最后，发现DCNN匹配可以很好地推广到磁共振成像（MRI）扫描，而无需再训练，这表明它很容易适用于其他数据集。摘要：Deformable Image Registration (DIR) can benefit from additional guidance using corresponding landmarks in the images. However, the benefits thereof are largely understudied, especially due to the lack of automatic detection methods for corresponding landmarks in three-dimensional (3D) medical images. In this work, we present a Deep Convolutional Neural Network (DCNN), called DCNN-Match, that learns to predict landmark correspondences in 3D images in a self-supervised manner. We explored five variants of DCNN-Match that use different loss functions and tested DCNN-Match separately as well as in combination with the open-source registration software Elastix to assess its impact on a common DIR approach. We employed lower-abdominal Computed Tomography (CT) scans from cervical cancer patients: 121 pelvic CT scan pairs containing simulated elastic transformations and 11 pairs demonstrating clinical deformations. Our results show significant improvement in DIR performance when landmark correspondences predicted by DCNN-Match were used in case of simulated as well as clinical deformations. We also observed that the spatial distribution of the automatically identified landmarks and the associated matching errors affect the extent of improvement in DIR. Finally, DCNN-Match was found to generalize well to Magnetic Resonance Imaging (MRI) scans without requiring retraining, indicating easy applicability to other datasets.

【4】 Graph Attention Layer Evolves Semantic Segmentation for Road Pothole Detection: A Benchmark and Algorithms 标题：基于图注意力层进化语义分割的道路坑洞检测基准与算法链接：https://arxiv.org/abs/2109.02711

作者：Rui Fan,Hengli Wang,Yuan Wang,Ming Liu,Ioannis Pitas 机构： the Hong Kong University of Science and Technology 备注：accepted as a regular paper to IEEE Transactions on Image Processing 摘要：现有的道路坑洼检测方法可分为基于计算机视觉的方法和基于机器学习的方法。前一种方法通常采用二维图像分析/理解或三维点云建模和分割算法从视觉传感器数据中检测道路凹坑。后一种方法通常以端到端的方式使用卷积神经网络（CNN）处理道路坑洞检测。然而，道路坑洼不一定无处不在，为CNN训练准备一个大型注释良好的数据集是一项挑战。在这方面，基于计算机视觉的方法是过去十年的主流研究趋势，而基于机器学习的方法只是讨论而已。最近，我们发布了第一个基于立体视觉的道路凹坑检测数据集和一种新的视差变换算法，从而可以高度区分受损和未受损的道路区域。然而，目前还没有使用视差图像或变换视差图像训练的最先进（SoTA）CNN的基准。因此，在本文中，我们首先讨论了用于语义分割的SoTA CNN，并通过大量实验评估了它们在道路坑洞检测中的性能。此外，受图神经网络（GNN）的启发，我们提出了一种新的CNN层，称为图注意层（GAL），它可以很容易地部署在任何现有的CNN中，以优化用于语义分割的图像特征表示。我们的实验将性能最好的实现GAL-DeepLabv3 与九个SoTA CNN在三种模式的训练数据上进行了比较：RGB图像、视差图像和变换的视差图像。实验结果表明，我们提出的GAL-DeepLabv3 在所有训练数据模式下都达到了最佳的整体坑洞检测精度。摘要：Existing road pothole detection approaches can be classified as computer vision-based or machine learning-based. The former approaches typically employ 2-D image analysis/understanding or 3-D point cloud modeling and segmentation algorithms to detect road potholes from vision sensor data. The latter approaches generally address road pothole detection using convolutional neural networks (CNNs) in an end-to-end manner. However, road potholes are not necessarily ubiquitous and it is challenging to prepare a large well-annotated dataset for CNN training. In this regard, while computer vision-based methods were the mainstream research trend in the past decade, machine learning-based methods were merely discussed. Recently, we published the first stereo vision-based road pothole detection dataset and a novel disparity transformation algorithm, whereby the damaged and undamaged road areas can be highly distinguished. However, there are no benchmarks currently available for state-of-the-art (SoTA) CNNs trained using either disparity images or transformed disparity images. Therefore, in this paper, we first discuss the SoTA CNNs designed for semantic segmentation and evaluate their performance for road pothole detection with extensive experiments. Additionally, inspired by graph neural network (GNN), we propose a novel CNN layer, referred to as graph attention layer (GAL), which can be easily deployed in any existing CNN to optimize image feature representations for semantic segmentation. Our experiments compare GAL-DeepLabv3 , our best-performing implementation, with nine SoTA CNNs on three modalities of training data: RGB images, disparity images, and transformed disparity images. The experimental results suggest that our proposed GAL-DeepLabv3 achieves the best overall pothole detection accuracy on all training data modalities.

分类|识别相关(8篇)

【1】 Rethinking Common Assumptions to Mitigate Racial Bias in Face Recognition Datasets 标题：重新思考减轻人脸识别数据集中种族偏见的常见假设链接：https://arxiv.org/abs/2109.03229

作者：Matthew Gwilliam,Srinidhi Hegde,Lade Tinubu,Alex Hanson 机构：University of Maryland, University of Chicago 备注：Accepted as an oral paper to the Human-centric Trustworthy Computer Vision (HTCV 2021) workshop at ICCV 2021; 17 pages, 9 figures, 6 tables 摘要：许多现有的研究已经在减少人脸识别中的种族偏见方面取得了巨大的进步。然而，这些方法中的大多数都试图纠正训练期间模型中出现的偏差，而不是直接解决偏差的主要来源，即数据集本身。BUPT Balancedface/RFW和Fairface是例外，但这些工作假设主要在单一种族上训练或不在数据集上实现种族平衡本质上是不利的。我们证明这些假设不一定有效。在我们的实验中，仅对非洲面孔进行的训练比对平衡的面孔分布进行的训练产生的偏见要少，而将更多非洲面孔包括在内的分布会产生更公平的模型。我们还注意到，向数据集中添加更多现有身份的图像，而不是添加新身份，可以提高跨种族类别的准确性。我们的代码可在https://github.com/j-alex-hanson/rethinking-race-face-datasets 摘要：Many existing works have made great strides towards reducing racial bias in face recognition. However, most of these methods attempt to rectify bias that manifests in models during training instead of directly addressing a major source of the bias, the dataset itself. Exceptions to this are BUPT-Balancedface/RFW and Fairface, but these works assume that primarily training on a single race or not racially balancing the dataset are inherently disadvantageous. We demonstrate that these assumptions are not necessarily valid. In our experiments, training on only African faces induced less bias than training on a balanced distribution of faces and distributions skewed to include more African faces produced more equitable models. We additionally notice that adding more images of existing identities to a dataset in place of adding new identities can lead to accuracy boosts across racial categories. Our code is available at https://github.com/j-alex-hanson/rethinking-race-face-datasets

【2】 Rendezvous: Attention Mechanisms for the Recognition of Surgical Action Triplets in Endoscopic Videos 标题：集合：内窥镜视频中识别手术动作三联体的注意机制链接：https://arxiv.org/abs/2109.03223

作者：Chinedu Innocent Nwoye,Tong Yu,Cristians Gonzalez,Barbara Seeliger,Pietro Mascagni,Didier Mutter,Jacques Marescaux,Nicolas Padoy 机构：ICube, University of Strasbourg, CNRS, France, IHU Strasbourg, France, University Hospital of Strasbourg, France, Fondazione Policlinico Universitario Agostino Gemelli IRCCS, Rome, Italy, IRCAD France 备注：21 pages, 9 figures, 19 tables. Submitted for journal peer-review on 20th August, 2021. Supplementary video available at: this https URL 摘要：在所有现有的内窥镜视频手术流程分析框架中，动作三联体识别是唯一一个旨在提供真正精细和全面的手术活动信息的框架。这些信息以组合形式呈现，要准确识别非常困难。三重态成分可能难以单独识别；在这项任务中，不仅需要同时对所有三重态成分进行识别，还需要正确地建立它们之间的数据关联。为了完成这项任务，我们引入了我们的新模型，集合（RDV），它通过在两个不同层次上利用注意力直接从手术视频中识别三胞胎。我们首先引入一种新的空间注意形式来捕捉场景中的个体动作三重态成分；称为类激活引导注意机制（CAGAM）。这项技术的重点是使用仪器产生的激活来识别动词和目标。为了解决关联问题，我们的RDV模型添加了一种新形式的语义注意，其灵感来源于Transformer网络。使用多个交叉和自我注意的头部，RDV能够有效地捕捉工具、动词和目标之间的关系。我们还介绍了CholecT50——一个包含50个内窥镜视频的数据集，其中每一帧都用100个三元组类的标签进行了注释。与此数据集上的最新方法相比，我们提出的RDV模型将三重态预测图显著提高了9%以上。摘要：Out of all existing frameworks for surgical workflow analysis in endoscopic videos, action triplet recognition stands out as the only one aiming to provide truly fine-grained and comprehensive information on surgical activities. This information, presented ascombinations, is highly challenging to be accurately identified. Triplet components can be difficult to recognize individually; in this task, it requires not only performing recognition simultaneously for all three triplet components, but also correctly establishing the data association between them. To achieve this task, we introduce our new model, the Rendezvous (RDV), which recognizes triplets directly from surgical videos by leveraging attention at two different levels. We first introduce a new form of spatial attention to capture individual action triplet components in a scene; called the Class Activation Guided Attention Mechanism (CAGAM). This technique focuses on the recognition of verbs and targets using activations resulting from instruments. To solve the association problem, our RDV model adds a new form of semantic attention inspired by Transformer networks. Using multiple heads of cross and self attentions, RDV is able to effectively capture relationships between instruments, verbs, and targets. We also introduce CholecT50 - a dataset of 50 endoscopic videos in which every frame has been annotated with labels from 100 triplet classes. Our proposed RDV model significantly improves the triplet prediction mAP by over 9% compared to the state-of-the-art methods on this dataset.

【3】 Support Vector Machine for Handwritten Character Recognition 标题：用于手写字符识别的支持向量机链接：https://arxiv.org/abs/2109.03081

作者：Jomy John 机构：Department of Computer Science, K. K. T. M. Government College, Pullut. 备注：9 pages, KKTM Cognizance - A Multidisciplinary Journal, March 2016, ISSN:2456-4168 摘要：手写识别一直是图像处理和模式识别领域最吸引人和最具挑战性的研究领域之一。它极大地促进了自动化过程的改进。本文提出了一种无约束手写马来文字符识别系统。在这项工作中，使用了一个包含44个基本马来语字符的10000个字符样本的数据库。使用64个局部特征和4个全局特征的判别特征集对SVM分类器进行训练和测试，准确率达到92.24% 摘要：Handwriting recognition has been one of the most fascinating and challenging research areas in field of image processing and pattern recognition. It contributes enormously to the improvement of automation process. In this paper, a system for recognition of unconstrained handwritten Malayalam characters is proposed. A database of 10,000 character samples of 44 basic Malayalam characters is used in this work. A discriminate feature set of 64 local and 4 global features are used to train and test SVM classifier and achieved 92.24% accuracy

【4】 Fine-grained Hand Gesture Recognition in Multi-viewpoint Hand Hygiene 标题：多视点手部卫生中的细粒度手势识别链接：https://arxiv.org/abs/2109.02917

作者：Huy Q. Vo,Tuong Do,Vi C. Pham,Duy Nguyen,An T. Duong,Quang D. Tran 备注：6 pages, accepted for oral in IEEE SMC 2021 摘要：本文为手卫生系统中的手势识别提供了一个新的高质量数据集，名为“MFH”。一般来说，当前的数据集并不关注：（i）细粒度的操作；以及（ii）不同视点之间的数据不匹配，这在真实设置下可用。为了解决上述问题，建议MFH数据集包含由6个非重叠位置的不同摄像机视图获得的总共731147个样本。此外，每个样本属于世界卫生组织（WHO）提出的七个步骤之一。作为一个小贡献，受细粒度图像识别和分布自适应的启发，本文建议使用自监督学习方法来处理上述问题。在基准MFH数据集上的大量实验表明，引入的方法在精确度和宏观F1分数方面都具有竞争力。代码和MFH数据集可在https://github.com/willogy-team/hand-gesture-recognition-smc2021. 摘要：This paper contributes a new high-quality dataset for hand gesture recognition in hand hygiene systems, named "MFH". Generally, current datasets are not focused on: (i) fine-grained actions; and (ii) data mismatch between different viewpoints, which are available under realistic settings. To address the aforementioned issues, the MFH dataset is proposed to contain a total of 731147 samples obtained by different camera views in 6 non-overlapping locations. Additionally, each sample belongs to one of seven steps introduced by the World Health Organization (WHO). As a minor contribution, inspired by advances in fine-grained image recognition and distribution adaptation, this paper recommends using the self-supervised learning method to handle these preceding problems. The extensive experiments on the benchmarking MFH dataset show that the introduced method yields competitive performance in both the Accuracy and the Macro F1-score. The code and the MFH dataset are available at https://github.com/willogy-team/hand-gesture-recognition-smc2021.

【5】 ICCAD Special Session Paper: Quantum-Classical Hybrid Machine Learning for Image Classification 标题：ICCAD专题会议论文：用于图像分类的量子-经典混合机器学习链接：https://arxiv.org/abs/2109.02862

作者：Mahabubul Alam,Satwik Kundu,Rasit Onur Topaloglu,Swaroop Ghosh 机构：School of Electrical Engineering and Computer Science, Penn State University, University Park, IBM Corporation 摘要：图像分类是传统深度学习（DL）的一个主要应用领域。量子机器学习（QML）有可能彻底改变图像分类。在任何典型的基于DL的图像分类中，我们使用卷积神经网络（CNN）从图像中提取特征，并使用多层感知器网络（MLP）创建实际的决策边界。一方面，QML模型在这两项任务中都很有用。参数化量子电路卷积（Quanvolution）可以从图像中提取丰富的特征。另一方面，量子神经网络（QNN）模型可以创建复杂的决策边界。因此，Quanvolution和QNN可用于创建用于图像分类的端到端QML模型。或者，我们可以使用经典降维技术（如主成分分析（PCA）或卷积自动编码器（CAE））分别提取图像特征，并使用提取的特征训练QNN。我们回顾了用于图像分类的量子经典混合ML模型的两个建议，即量子进化神经网络和使用经典算法和QNN进行降维。特别是，我们提出了一个可训练滤波器的例子，用于图像数据集的量子进化和基于CAE的特征提取（而不是使用线性变换（如PCA）进行降维）。我们将讨论这些模型的各种设计选择、潜在机会和缺点。我们还发布了一个基于Python的框架来创建和探索这些具有各种设计选择的混合模型。摘要：Image classification is a major application domain for conventional deep learning (DL). Quantum machine learning (QML) has the potential to revolutionize image classification. In any typical DL-based image classification, we use convolutional neural network (CNN) to extract features from the image and multi-layer perceptron network (MLP) to create the actual decision boundaries. On one hand, QML models can be useful in both of these tasks. Convolution with parameterized quantum circuits (Quanvolution) can extract rich features from the images. On the other hand, quantum neural network (QNN) models can create complex decision boundaries. Therefore, Quanvolution and QNN can be used to create an end-to-end QML model for image classification. Alternatively, we can extract image features separately using classical dimension reduction techniques such as, Principal Components Analysis (PCA) or Convolutional Autoencoder (CAE) and use the extracted features to train a QNN. We review two proposals on quantum-classical hybrid ML models for image classification namely, Quanvolutional Neural Network and dimension reduction using a classical algorithm followed by QNN. Particularly, we make a case for trainable filters in Quanvolution and CAE-based feature extraction for image datasets (instead of dimension reduction using linear transformations such as, PCA). We discuss various design choices, potential opportunities, and drawbacks of these models. We also release a Python-based framework to create and explore these hybrid models with a variety of design choices.

【6】 CIM: Class-Irrelevant Mapping for Few-Shot Classification 标题：CIM：用于Few-Shot分类的类无关映射链接：https://arxiv.org/abs/2109.02840

作者：Shuai Shao,Lei Xing,Yixin Chen,Yan-Jiang Wang,Bao-Di Liu,Yicong Zhou 机构：Member, IEEE 摘要：少数镜头分类（FSC）是近年来最受关注的热点问题之一。一般设置包括两个阶段：（1）使用基础数据（具有大量标记样本）预训练特征提取模型（FEM）(2）使用FEM提取新数据的特征（带有少量标记样本和与基础数据完全不同的类别），然后使用待设计的分类器对其进行分类。预训练FEM对新数据的适应性决定了新特征的准确性，从而影响最终的分类性能。为此，如何评估预先训练好的FEM是FSC社区最关键的焦点。听起来传统的基于类激活映射（CAM）的方法可以通过叠加加权特征映射来实现这一点。然而，由于FSC的特殊性（例如，使用预先训练的FEM提取新特征时没有反向传播），我们无法使用新类激活特征映射。为了应对这一挑战，我们提出了一种简单、灵活的方法，称为类无关映射（CIM）。具体来说，首先，我们介绍了字典学习理论，并将特征映射的通道视为字典中的基础。然后利用特征映射对图像的特征向量进行拟合，得到相应的通道权值。最后，我们重叠加权特征图进行可视化，以评估预先训练的FEM在新数据上的能力。为了在评估不同模型时合理使用CIM，我们提出了一个新的度量指标，称为特征定位精度（FLA）。在实验中，我们首先在常规任务中将CIM与CAM进行了比较，并取得了优异的性能。接下来，我们使用我们的CIM来评估几个经典的FSC框架，而不考虑分类结果，并对它们进行讨论。摘要：Few-shot classification (FSC) is one of the most concerned hot issues in recent years. The general setting consists of two phases: (1) Pre-train a feature extraction model (FEM) with base data (has large amounts of labeled samples). (2) Use the FEM to extract the features of novel data (with few labeled samples and totally different categories from base data), then classify them with the to-be-designed classifier. The adaptability of pre-trained FEM to novel data determines the accuracy of novel features, thereby affecting the final classification performances. To this end, how to appraise the pre-trained FEM is the most crucial focus in the FSC community. It sounds like traditional Class Activate Mapping (CAM) based methods can achieve this by overlaying weighted feature maps. However, due to the particularity of FSC (e.g., there is no backpropagation when using the pre-trained FEM to extract novel features), we cannot activate the feature map with the novel classes. To address this challenge, we propose a simple, flexible method, dubbed as Class-Irrelevant Mapping (CIM). Specifically, first, we introduce dictionary learning theory and view the channels of the feature map as the bases in a dictionary. Then we utilize the feature map to fit the feature vector of an image to achieve the corresponding channel weights. Finally, we overlap the weighted feature map for visualization to appraise the ability of pre-trained FEM on novel data. For fair use of CIM in evaluating different models, we propose a new measurement index, called Feature Localization Accuracy (FLA). In experiments, we first compare our CIM with CAM in regular tasks and achieve outstanding performances. Next, we use our CIM to appraise several classical FSC frameworks without considering the classification results and discuss them.

【7】 Improving Dietary Assessment Via Integrated Hierarchy Food Classification 标题：通过综合分级食品分类改进膳食评估链接：https://arxiv.org/abs/2109.02736

作者：Runyu Mao,Jiangpeng He,Luotao Lin,Zeman Shao,Heather A. Eicher-Miller,Fengqing Zhu 机构：School of Electrical and Computer, Engineering, Purdue University, West Lafayette, Indiana, USA, Department of Nutrition Science 摘要：基于图像的饮食评估是指从视觉数据中确定某人吃什么以及消耗多少能量和营养的过程。食品分类是第一步也是最关键的一步。现有方法侧重于通过仅基于视觉信息的正确分类率来提高准确度，这是非常具有挑战性的，因为食品的高度复杂性和类间相似性。此外，食品分类的准确性是概念性的，因为对食品的描述总是可以改进的。在这项工作中，我们引入了一个新的食品分类框架，通过整合来自多个领域的信息来提高预测的质量，同时保持分类的准确性。我们应用了一个基于层次结构的多任务网络，该网络使用视觉和营养领域的特定信息来聚类相似的食物。我们的方法通过包含相关的能量和营养信息，在改进的VIPER FoodNet（VFN）食品图像数据集上得到验证。我们实现了与仅使用视觉信息的现有方法相当的分类精度，但错误预测的能量和营养价值误差较小。摘要：Image-based dietary assessment refers to the process of determining what someone eats and how much energy and nutrients are consumed from visual data. Food classification is the first and most crucial step. Existing methods focus on improving accuracy measured by the rate of correct classification based on visual information alone, which is very challenging due to the high complexity and inter-class similarity of foods. Further, accuracy in food classification is conceptual as description of a food can always be improved. In this work, we introduce a new food classification framework to improve the quality of predictions by integrating the information from multiple domains while maintaining the classification accuracy. We apply a multi-task network based on a hierarchical structure that uses both visual and nutrition domain specific information to cluster similar foods. Our method is validated on the modified VIPER-FoodNet (VFN) food image dataset by including associated energy and nutrient information. We achieve comparable classification accuracy with existing methods that use visual information only, but with less error in terms of energy and nutrient values for the wrong predictions.

【8】 Rethinking Crowdsourcing Annotation: Partial Annotation with Salient Labels for Multi-Label Image Classification 标题：对众包标注的再思考：用于多标签图像分类的突出标签局部标注链接：https://arxiv.org/abs/2109.02688

作者：Jianzhe Lin,Tianze Yu,Z. Jane Wang 机构：The University of British Columbia 摘要：在图像分类中，有监督的模型训练和评估都需要带注释的图像。手动注释图像既费力又昂贵，尤其是对于多标签图像。最近的一个趋势是通过众包来完成这种人工标注任务，即由志愿者或在线付费工作者（例如亚马逊机械土耳其公司的工人）从头开始对图像进行标注。然而，众包图像标注的质量无法保证，不完整性和不正确性是众包标注的两个主要问题。为了解决这些问题，我们对众包注释进行了重新思考：我们的简单假设是，如果注释者仅使用他们确信的显著标签对多标签图像进行部分注释，注释错误将减少，注释者将在不确定标签上花费更少的时间。令人惊喜的是，在注释预算相同的情况下，我们展示了由具有显著注释的图像监督的多标签图像分类器可以优于由完全注释的图像监督的模型。我们的方法贡献有两个方面：一是提出了一种主动学习方法来获取多标签图像的显著标签；提出了一种基于局部标注的自适应温度关联模型（ATAM）用于多标签图像分类。我们在实际的众包数据、开放街道地图（OSM）数据集和基准数据集COCO 2014上进行了实验。与在全注释图像上训练的最新分类方法相比，所提出的ATAM可以实现更高的精度。所提出的想法是有希望的众包数据注释。我们的代码将公开提供。摘要：Annotated images are required for both supervised model training and evaluation in image classification. Manually annotating images is arduous and expensive, especially for multi-labeled images. A recent trend for conducting such laboursome annotation tasks is through crowdsourcing, where images are annotated by volunteers or paid workers online (e.g., workers of Amazon Mechanical Turk) from scratch. However, the quality of crowdsourcing image annotations cannot be guaranteed, and incompleteness and incorrectness are two major concerns for crowdsourcing annotations. To address such concerns, we have a rethinking of crowdsourcing annotations: Our simple hypothesis is that if the annotators only partially annotate multi-label images with salient labels they are confident in, there will be fewer annotation errors and annotators will spend less time on uncertain labels. As a pleasant surprise, with the same annotation budget, we show a multi-label image classifier supervised by images with salient annotations can outperform models supervised by fully annotated images. Our method contributions are 2-fold: An active learning way is proposed to acquire salient labels for multi-label images; and a novel Adaptive Temperature Associated Model (ATAM) specifically using partial annotations is proposed for multi-label image classification. We conduct experiments on practical crowdsourcing data, the Open Street Map (OSM) dataset and benchmark dataset COCO 2014. When compared with state-of-the-art classification methods trained on fully annotated images, the proposed ATAM can achieve higher accuracy. The proposed idea is promising for crowdsourcing data annotation. Our code will be publicly available.

分割|语义相关(2篇)

【1】 Self-supervised Tumor Segmentation through Layer Decomposition 标题：基于分层分解的自监督肿瘤分割链接：https://arxiv.org/abs/2109.03230

作者：Xiaoman Zhang,Weidi Xie,Chaoqin Huang,Ya Zhang,Yanfeng Wang 机构： University of Oxford 备注：Submitted to IEEE TMI. Project webpage: this https URL 摘要：在本文中，我们提出了一种自监督的肿瘤分割方法。具体而言，我们提倡Zero-Shot设置，其中来自自监督学习的模型应直接适用于下游任务，而不使用任何手动注释。我们作出以下贡献。首先，通过仔细检查现有的自监督学习方法，我们发现了一个令人惊讶的结果，即在适当的数据扩充下，从头开始训练的模型实际上达到了与使用自监督学习预训练的模型相当的性能。其次，受肿瘤倾向于独立于上下文的事实的启发，我们提出了一个生成合成肿瘤数据的可伸缩管道，并训练了一个自我监督模型，该模型最大限度地缩小了与下游任务的概括差距。第三，我们对不同的下游数据集进行了广泛的消融研究，BraTS2018用于脑肿瘤分割，LiTS2017用于肝肿瘤分割。在评估低注释制度下肿瘤分割的模型可转移性时，包括Zero-Shot分割的极端情况，提出的方法展示了最先进的性能，大大优于所有现有的自我监督方法，以及开放在实际场景中使用自我监督学习。摘要：In this paper, we propose a self-supervised approach for tumor segmentation. Specifically, we advocate a zero-shot setting, where models from self-supervised learning should be directly applicable for the downstream task, without using any manual annotations whatsoever. We make the following contributions. First, with careful examination on existing self-supervised learning approaches, we reveal the surprising result that, given suitable data augmentation, models trained from scratch in fact achieve comparable performance to those pre-trained with self-supervised learning. Second, inspired by the fact that tumors tend to be characterized independently to the contexts, we propose a scalable pipeline for generating synthetic tumor data, and train a self-supervised model that minimises the generalisation gap with the downstream task. Third, we conduct extensive ablation studies on different downstream datasets, BraTS2018 for brain tumor segmentation and LiTS2017 for liver tumor segmentation. While evaluating the model transferability for tumor segmentation under a low-annotation regime, including an extreme case of zero-shot segmentation, the proposed approach demonstrates state-of-the-art performance, substantially outperforming all existing self-supervised approaches, and opening up the usage of self-supervised learning in practical scenarios.

【2】 FDA: Feature Decomposition and Aggregation for Robust Airway Segmentation 标题：FDA：用于鲁棒气道分割的特征分解和聚合链接：https://arxiv.org/abs/2109.02920

作者：Minghui Zhang,Xin Yu,Hanxiao Zhang,Hao Zheng,Weihao Yu,Hong Pan,Xiangran Cai,Yun Gu 机构： Institute of Medical Robotics, Shanghai Jiao Tong University, Shanghai, China, Department of Computer Science and Software Engineering, Swinburne University, of Technology, Victoria, Australia 备注：Accepted at MICCAI2021-DART 摘要：三维卷积神经网络（CNN）已被广泛用于气道分割。3D CNN的性能受数据集的影响很大，而公共气道数据集主要是带有粗略注释的干净CT扫描，因此很难推广到有噪声的CT扫描（如COVID-19 CT扫描）。在这项工作中，我们提出了一种新的双流网络来解决干净域和噪声域之间的可变性，该网络利用干净的CT扫描和少量标记的噪声CT扫描进行气道分割。我们设计了两个不同的编码器分别提取可转移的干净特征和唯一的噪声特征，然后设计了两个独立的解码器。此外，可转移特征通过通道特征重新校准和符号距离图（SDM）回归进行细化。特征重校准模块强调关键特征，SDM更加关注支气管，这有利于提取对粗标签鲁棒的可转移拓扑特征。大量的实验结果表明了该方法的有效性。与其他最先进的转移学习方法相比，我们的方法在嘈杂的CT扫描中准确地分割了更多的支气管。摘要：3D Convolutional Neural Networks (CNNs) have been widely adopted for airway segmentation. The performance of 3D CNNs is greatly influenced by the dataset while the public airway datasets are mainly clean CT scans with coarse annotation, thus difficult to be generalized to noisy CT scans (e.g. COVID-19 CT scans). In this work, we proposed a new dual-stream network to address the variability between the clean domain and noisy domain, which utilizes the clean CT scans and a small amount of labeled noisy CT scans for airway segmentation. We designed two different encoders to extract the transferable clean features and the unique noisy features separately, followed by two independent decoders. Further on, the transferable features are refined by the channel-wise feature recalibration and Signed Distance Map (SDM) regression. The feature recalibration module emphasizes critical features and the SDM pays more attention to the bronchi, which is beneficial to extracting the transferable topological features robust to the coarse labels. Extensive experimental results demonstrated the obvious improvement brought by our proposed method. Compared to other state-of-the-art transfer learning methods, our method accurately segmented more bronchi in the noisy CT scans.

Zero/Few Shot|迁移|域适配|自适应(3篇)

【1】 Grassmannian Graph-attentional Landmark Selection for Domain Adaptation 标题：格拉斯曼图--领域自适应的注意标志选择链接：https://arxiv.org/abs/2109.02990

作者：Bin Sun,Shaofan Wang,Dehui Kong,Jinghua Li,Baocai Yin 机构：Received: date Accepted: date 备注：MTAP-R1,27 pages with 6 figures 摘要：域自适应旨在利用源域中的信息来提高目标域中的分类性能。它主要采用两种方案：样本重加权和特征匹配。第一个方案为单个样本分配不同的权重，第二个方案使用全局结构统计匹配两个域的特征。这两种方案相互补充，有望共同实现鲁棒域自适应。有几种方法结合了这两种方案，但由于忽略了样本的层次结构和样本之间的几何特性，因此没有充分分析样本之间的潜在关系。为了更好地结合这两种方案的优点，我们提出了一种用于领域自适应的格拉斯曼图注意地标选择（GGLS）框架。GGLS提出了一种利用样本图形结构的注意诱导邻域的地标选择方案，并在Grassmann流形上执行分布自适应和知识自适应。前者对每个样本的地标进行不同的处理，后者避免了特征失真，获得了更好的几何特性。在不同的实际跨领域视觉识别任务上的实验结果表明，与最先进的领域自适应方法相比，GGLS具有更好的分类精度。摘要：Domain adaptation aims to leverage information from the source domain to improve the classification performance in the target domain. It mainly utilizes two schemes: sample reweighting and feature matching. While the first scheme allocates different weights to individual samples, the second scheme matches the feature of two domains using global structural statistics. The two schemes are complementary with each other, which are expected to jointly work for robust domain adaptation. Several methods combine the two schemes, but the underlying relationship of samples is insufficiently analyzed due to the neglect of the hierarchy of samples and the geometric properties between samples. To better combine the advantages of the two schemes, we propose a Grassmannian graph-attentional landmark selection (GGLS) framework for domain adaptation. GGLS presents a landmark selection scheme using attention-induced neighbors of the graphical structure of samples and performs distribution adaptation and knowledge adaptation over Grassmann manifold. the former treats the landmarks of each sample differently, and the latter avoids feature distortion and achieves better geometric properties. Experimental results on different real-world cross-domain visual recognition tasks demonstrate that GGLS provides better classification accuracies compared with state-of-the-art domain adaptation methods.

【2】 Few-shot Learning via Dependency Maximization and Instance Discriminant Analysis 标题：基于依赖最大化和实例判别分析的少概率学习链接：https://arxiv.org/abs/2109.02820

作者：Zejiang Hou,Sun-Yuan Kung 机构：Princeton University 摘要：我们研究了少数镜头学习（FSL）问题，模型学习识别新对象，每个类别的标记训练数据非常少。以往的FSL方法大多采用元学习范式，通过学习大量训练任务来积累归纳偏差，从而解决一个新的看不见的少数镜头任务。相比之下，我们提出了一种简单的方法来利用伴随Few-Shot任务的未标记数据来提高Few-Shot性能。首先，我们提出了一种基于互协方差算子Hilbert-Schmidt范数的相关性最大化方法，该方法最大化了未标记数据的嵌入特征与其标签预测之间的统计相关性，以及支持集上的监督损失。然后，我们使用得到的模型来推断那些未标记数据的伪标签。此外，我们还提出了一种立场判别分析来评估每个伪标记示例的可信度，并将最忠实的示例选择到一个扩展支持集中，以便像第一步一样重新训练模型。我们迭代上述过程，直到未标记数据的伪标签变得稳定。根据标准的转换和半监督FSL设置，我们的实验表明，所提出的方法在四个广泛使用的基准上，包括mini ImageNet、tiered ImageNet、CUB和CIFARF，优于以前的最先进方法。摘要：We study the few-shot learning (FSL) problem, where a model learns to recognize new objects with extremely few labeled training data per category. Most of previous FSL approaches resort to the meta-learning paradigm, where the model accumulates inductive bias through learning many training tasks so as to solve a new unseen few-shot task. In contrast, we propose a simple approach to exploit unlabeled data accompanying the few-shot task for improving few-shot performance. Firstly, we propose a Dependency Maximization method based on the Hilbert-Schmidt norm of the cross-covariance operator, which maximizes the statistical dependency between the embedded feature of those unlabeled data and their label predictions, together with the supervised loss over the support set. We then use the obtained model to infer the pseudo-labels for those unlabeled data. Furthermore, we propose anInstance Discriminant Analysis to evaluate the credibility of each pseudo-labeled example and select the most faithful ones into an augmented support set to retrain the model as in the first step. We iterate the above process until the pseudo-labels for the unlabeled data becomes stable. Following the standard transductive and semi-supervised FSL setting, our experiments show that the proposed method out-performs previous state-of-the-art methods on four widely used benchmarks, including mini-ImageNet, tiered-ImageNet, CUB, and CIFARFS.

【3】 Improving Transferability of Domain Adaptation Networks Through Domain Alignment Layers 标题：通过域对齐层提高域适配网络的可移植性链接：https://arxiv.org/abs/2109.02693

作者：Lucas Fernando Alvarenga e Silva,Daniel Carlos Guimarães Pedronette,Fábio Augusto Faria,João Paulo Papa,Jurandy Almeida 摘要：深度学习（DL）已经成为各种计算机视觉任务中使用的主要方法，因为它在许多任务中取得了相关的结果。然而，在部分或没有标记数据的真实场景中，DL方法也容易出现众所周知的域转移问题。多源无监督领域自适应（MSDA）旨在通过从一组源模型中分配弱知识来学习未标记领域的预测值。然而，大多数工作仅利用提取的特征进行域自适应，并从损失函数设计的角度减少其域转移。在本文中，我们认为仅仅基于域级特征处理域转移是不够的，但是在特征空间上对齐这些信息也是必要的。与以前的工作不同，我们专注于网络设计，并建议在预测器的不同级别嵌入多源版本的域对齐层（MS-DIAL）。这些层旨在匹配不同域之间的特征分布，并且可以轻松地应用于各种MSDA方法。为了证明我们的方法的稳健性，我们考虑了两个具有挑战性的场景：数字识别和对象分类，进行了广泛的实验评估。实验结果表明，我们的方法可以改进最先进的MSDA方法，其分类精度的相对增益高达 30.64%。摘要：Deep learning (DL) has been the primary approach used in various computer vision tasks due to its relevant results achieved on many tasks. However, on real-world scenarios with partially or no labeled data, DL methods are also prone to the well-known domain shift problem. Multi-source unsupervised domain adaptation (MSDA) aims at learning a predictor for an unlabeled domain by assigning weak knowledge from a bag of source models. However, most works conduct domain adaptation leveraging only the extracted features and reducing their domain shift from the perspective of loss function designs. In this paper, we argue that it is not sufficient to handle domain shift only based on domain-level features, but it is also essential to align such information on the feature space. Unlike previous works, we focus on the network design and propose to embed Multi-Source version of DomaIn Alignment Layers (MS-DIAL) at different levels of the predictor. These layers are designed to match the feature distributions between different domains and can be easily applied to various MSDA methods. To show the robustness of our approach, we conducted an extensive experimental evaluation considering two challenging scenarios: digit recognition and object classification. The experimental results indicated that our approach can improve state-of-the-art MSDA methods, yielding relative gains of up to 30.64% on their classification accuracies.

半弱无监督|主动学习|不确定性(2篇)

【1】 Knowledge Distillation Using Hierarchical Self-Supervision Augmented Distribution 标题：基于分层自监督增强分布的知识精馏链接：https://arxiv.org/abs/2109.03075

作者：Chuanguang Yang,Zhulin An,Linhang Cai,Yongjun Xu 备注：15 pages, an extension of Hierarchical Self-supervised Augmented Knowledge Distillation published at IJCAI-2021 摘要：知识提炼（KD）是一个有效的框架，旨在将有意义的信息从大教师传递给小学生。一般来说，知识发现通常涉及如何定义和传递知识。以前的知识发现方法通常侧重于挖掘各种形式的知识，例如特征图和细化信息。然而，知识来源于主要的监督任务，因此具有高度的任务特异性。受最近自我监督表征学习成功的启发，我们提出了一个辅助自我监督增强任务来引导网络学习更多有意义的特征。因此，我们可以从KD的这项任务中获得更丰富的暗知识，即软自我监督增强分布。与以前的知识不同，该分布对有监督和自监督特征学习的联合知识进行编码。除了知识探索，另一个关键的方面是如何有效地学习和提炼我们提出的知识。为了充分利用层次特征映射，我们建议在各个隐藏层上附加几个辅助分支。每个辅助分支都被引导学习自我监督增强任务，并从教师到学生之间提取这种分布。因此，我们称我们的KD方法为分层自监督增强知识提取（HSSAKD）。对标准图像分类的实验表明，离线和在线HSSAKD在KD领域都达到了最先进的性能。进一步的目标检测转移实验进一步验证了HSSAKD可以引导网络学习更好的特征，这可以归因于有效地学习和提取辅助自监督增强任务。摘要：Knowledge distillation (KD) is an effective framework that aims to transfer meaningful information from a large teacher to a smaller student. Generally, KD often involves how to define and transfer knowledge. Previous KD methods often focus on mining various forms of knowledge, for example, feature maps and refined information. However, the knowledge is derived from the primary supervised task and thus is highly task-specific. Motivated by the recent success of self-supervised representation learning, we propose an auxiliary self-supervision augmented task to guide networks to learn more meaningful features. Therefore, we can derive soft self-supervision augmented distributions as richer dark knowledge from this task for KD. Unlike previous knowledge, this distribution encodes joint knowledge from supervised and self-supervised feature learning. Beyond knowledge exploration, another crucial aspect is how to learn and distill our proposed knowledge effectively. To fully take advantage of hierarchical feature maps, we propose to append several auxiliary branches at various hidden layers. Each auxiliary branch is guided to learn self-supervision augmented task and distill this distribution from teacher to student. Thus we call our KD method as Hierarchical Self-Supervision Augmented Knowledge Distillation (HSSAKD). Experiments on standard image classification show that both offline and online HSSAKD achieves state-of-the-art performance in the field of KD. Further transfer experiments on object detection further verify that HSSAKD can guide the network to learn better features, which can be attributed to learn and distill an auxiliary self-supervision augmented task effectively.

【2】 Deep Collaborative Multi-Modal Learning for Unsupervised Kinship Estimation 标题：深度协作多模态学习在无监督亲属关系评估中的应用链接：https://arxiv.org/abs/2109.02804

作者：Guan-Nan Dong,Chi-Man Pun,Zheng Zhang 摘要：亲属关系验证是计算机视觉领域一个长期存在的研究挑战。呈现在脸上的视觉差异对亲属系统的识别能力有显著影响。我们认为，聚合多种视觉知识可以更好地描述受试者的特征，以便进行精确的亲属身份识别。通常，年龄不变特征可以表示更自然的面部细节。由于衰老的生物学效应，这种与年龄相关的变化对于人脸识别至关重要。然而，现有的方法主要集中在利用单视图图像特征进行亲属关系识别，而在特征学习步骤中直接忽略了种族和年龄等更有意义的视觉特性。为此，我们提出了一种新的深度协作多模式学习（DCML）方法，以自适应方式整合人脸属性中呈现的潜在信息，从而增强人脸细节，从而实现有效的无监督亲属关系验证。具体而言，我们构建了一个设计良好的自适应特征融合机制，该机制可以联合利用不同视觉视角的互补特性来生成复合特征，并将更多注意力吸引到空间特征地图中信息量最大的部分。特别是，基于一种新的注意机制，提出了一种自适应加权策略，通过自适应地减少信道中的信息冗余，增强了不同属性之间的依赖性。为了验证所提出的方法的有效性，在四个广泛使用的数据集上进行的大量实验评估表明，我们的DCML方法总是优于一些最先进的亲属关系验证方法。摘要：Kinship verification is a long-standing research challenge in computer vision. The visual differences presented to the face have a significant effect on the recognition capabilities of the kinship systems. We argue that aggregating multiple visual knowledge can better describe the characteristics of the subject for precise kinship identification. Typically, the age-invariant features can represent more natural facial details. Such age-related transformations are essential for face recognition due to the biological effects of aging. However, the existing methods mainly focus on employing the single-view image features for kinship identification, while more meaningful visual properties such as race and age are directly ignored in the feature learning step. To this end, we propose a novel deep collaborative multi-modal learning (DCML) to integrate the underlying information presented in facial properties in an adaptive manner to strengthen the facial details for effective unsupervised kinship verification. Specifically, we construct a well-designed adaptive feature fusion mechanism, which can jointly leverage the complementary properties from different visual perspectives to produce composite features and draw greater attention to the most informative components of spatial feature maps. Particularly, an adaptive weighting strategy is developed based on a novel attention mechanism, which can enhance the dependencies between different properties by decreasing the information redundancy in channels in a self-adaptive manner. To validate the effectiveness of the proposed method, extensive experimental evaluations conducted on four widely-used datasets show that our DCML method is always superior to some state-of-the-art kinship verification methods.

时序|行为识别|姿态|视频|运动估计(8篇)

【1】 Evaluation of an Audio-Video Multimodal Deepfake Dataset using Unimodal and Multimodal Detectors 标题：基于单模和多模检测器的音视频多模态深伪数据集评估链接：https://arxiv.org/abs/2109.02993

作者：Hasam Khalid,Minha Kim,Shahroz Tariq,Simon S. Woo 机构：College of Computing and Informatics, Sungkyunkwan University, South Korea, Department of Applied Data Science 备注：2 Figures, 2 Tables, Accepted for publication at the 1st Workshop on Synthetic Multimedia - Audiovisual Deepfake Generation and Detection (ADGD '21) at ACM MM 2021 摘要：deepfakes产品的开发取得了重大进展，导致了安全和隐私问题。攻击者可以用目标人的脸替换他的脸，从而很容易地在图像中模拟一个人的身份。此外，利用深度学习技术克隆人声的新领域也正在出现。现在，攻击者只需几秒钟的目标人音频就可以生成逼真的克隆人声。随着deepfake可能造成的潜在危害的不断出现，研究人员提出了deepfake检测方法。然而，他们只专注于检测单一模态，即视频或音频。另一方面，为了开发一个好的deepfake检测器，以应对deepfake一代的最新进展，我们需要一个能够检测多种模式的deepfake的检测器，即视频和音频。为了构建这样一个检测器，我们需要一个包含视频和音频的数据集。我们能够找到最新的deepfake数据集，音频-视频多模式deepfake检测数据集（FakeAVCeleb），它不仅包含deepfake视频，还包含合成的假音频。我们使用这个多模态数据集，并使用最先进的单峰、基于集合和多模态检测方法进行详细的基线实验来评估它。我们通过详细的实验得出结论，与基于集成的方法相比，仅处理单一模态（视频或音频）的单模方法性能不佳。而纯粹基于多模式的基线提供了最差的性能。摘要：Significant advancements made in the generation of deepfakes have caused security and privacy issues. Attackers can easily impersonate a person's identity in an image by replacing his face with the target person's face. Moreover, a new domain of cloning human voices using deep-learning technologies is also emerging. Now, an attacker can generate realistic cloned voices of humans using only a few seconds of audio of the target person. With the emerging threat of potential harm deepfakes can cause, researchers have proposed deepfake detection methods. However, they only focus on detecting a single modality, i.e., either video or audio. On the other hand, to develop a good deepfake detector that can cope with the recent advancements in deepfake generation, we need to have a detector that can detect deepfakes of multiple modalities, i.e., videos and audios. To build such a detector, we need a dataset that contains video and respective audio deepfakes. We were able to find a most recent deepfake dataset, Audio-Video Multimodal Deepfake Detection Dataset (FakeAVCeleb), that contains not only deepfake videos but synthesized fake audios as well. We used this multimodal deepfake dataset and performed detailed baseline experiments using state-of-the-art unimodal, ensemble-based, and multimodal detection methods to evaluate it. We conclude through detailed experimentation that unimodals, addressing only a single modality, video or audio, do not perform well compared to ensemble-based methods. Whereas purely multimodal-based baselines provide the worst performance.

【2】 Sensor-Augmented Egocentric-Video Captioning with Dynamic Modal Attention 标题：具有动态模态注意的传感器增强的以自我为中心的视频字幕链接：https://arxiv.org/abs/2109.02955

作者：Katsuyuki Nakamura,Hiroki Ohashi,Mitsuhiro Okada 机构：Hitachi, Ltd., Tokyo, Japan 备注：Accepted to ACM Multimedia (ACMMM) 2021 摘要：自动描述视频，或视频字幕，在多媒体领域得到了广泛的研究。本文提出了一种新的传感器增强的自我中心视频字幕任务，一种新的为其构建的数据集称为MMAC字幕，以及一种有效利用视频和运动传感器或惯性测量单元（IMU）的多模态数据的新任务方法。由于固定摄像机的视野有限，传统的视频字幕任务难以处理人类活动的详细描述，而以自我为中心的视觉具有更大的潜力，可用于在更近的视野的基础上生成更细粒度的人类活动描述。此外，我们利用可穿戴传感器数据作为辅助信息，以缓解以自我为中心的视觉中的固有问题：运动模糊、自遮挡和超出摄像机范围的活动。我们提出了一种基于注意机制的有效利用传感器数据和视频数据的方法，该注意机制在考虑上下文信息的情况下动态确定需要更多注意的模态。我们将提出的传感器融合方法与MMAC字幕数据集上的强基线进行了比较，发现使用传感器数据作为自我中心视频数据的补充信息是有益的，并且我们提出的方法优于强基线，证明了所提方法的有效性。摘要：Automatically describing video, or video captioning, has been widely studied in the multimedia field. This paper proposes a new task of sensor-augmented egocentric-video captioning, a newly constructed dataset for it called MMAC Captions, and a method for the newly proposed task that effectively utilizes multi-modal data of video and motion sensors, or inertial measurement units (IMUs). While conventional video captioning tasks have difficulty in dealing with detailed descriptions of human activities due to the limited view of a fixed camera, egocentric vision has greater potential to be used for generating the finer-grained descriptions of human activities on the basis of a much closer view. In addition, we utilize wearable-sensor data as auxiliary information to mitigate the inherent problems in egocentric vision: motion blur, self-occlusion, and out-of-camera-range activities. We propose a method for effectively utilizing the sensor data in combination with the video data on the basis of an attention mechanism that dynamically determines the modality that requires more attention, taking the contextual information into account. We compared the proposed sensor-fusion method with strong baselines on the MMAC Captions dataset and found that using sensor data as supplementary information to the egocentric-video data was beneficial, and that our proposed method outperformed the strong baselines, demonstrating the effectiveness of the proposed method.

【3】 Learning to Combine the Modalities of Language and Video for Temporal Moment Localization 标题：学会将语言和视频的模态结合起来进行时间瞬间本地化链接：https://arxiv.org/abs/2109.02925

作者：Jungkyoo Shin,Jinyoung Moon 机构：ICT Major, ETRI School, University of Science and Technology, Daejeon, Republic of Korea , Gajeong-ro, Yuseong-gu, Daejeon, Republic of Korea, Artificial Intelligent Research Laboratory, Electronics and Telecommunications Research Institute 摘要：时间矩定位的目的是检索与查询指定的时刻相匹配的最佳视频片段。现有的方法没有充分考虑视觉嵌入和语义嵌入之间的长期时间关系，而是独立生成视觉嵌入和语义嵌入并进行融合。为了解决这些缺点，我们引入了一种新的循环单元，即跨模式长短时记忆（CM-LSTM），通过模仿人类的认知过程来定位时间瞬间，该瞬间集中在与查询部分相关的视频片段部分，并在整个视频中循环累积上下文信息。此外，我们通过输入查询为有人参与和无人参与的视频特征设计了双流注意机制，以防止必要的视觉信息被忽略。为了获得更精确的边界，我们提出了一个双流关注跨模式交互网络（TACI），该网络生成两个2D提案地图，该地图从综合上下文特征全局获得，使用CM-LSTM生成，然后以端到端的方式将它们组合成最终的2D地图。在TML基准数据集ActivityNet字幕上，TACI的性能优于最先进的TML方法R@1分别为45.50%和27.23%IoU@0.5及IoU@0.7分别地此外，我们还表明，通过使用CM-LSTM替换原始LSTM，改进后的最新方法可实现性能提升。摘要：Temporal moment localization aims to retrieve the best video segment matching a moment specified by a query. The existing methods generate the visual and semantic embeddings independently and fuse them without full consideration of the long-term temporal relationship between them. To address these shortcomings, we introduce a novel recurrent unit, cross-modal long short-term memory (CM-LSTM), by mimicking the human cognitive process of localizing temporal moments that focuses on the part of a video segment related to the part of a query, and accumulates the contextual information across the entire video recurrently. In addition, we devise a two-stream attention mechanism for both attended and unattended video features by the input query to prevent necessary visual information from being neglected. To obtain more precise boundaries, we propose a two-stream attentive cross-modal interaction network (TACI) that generates two 2D proposal maps obtained globally from the integrated contextual features, which are generated by using CM-LSTM, and locally from boundary score sequences and then combines them into a final 2D map in an end-to-end manner. On the TML benchmark dataset, ActivityNet-Captions, the TACI outperform state-of-the-art TML methods with R@1 of 45.50% and 27.23% for IoU@0.5 and IoU@0.7, respectively. In addition, we show that the revised state-of-the-arts methods by replacing the original LSTM with our CM-LSTM achieve performance gains.

【4】 STRIVE: Scene Text Replacement In Videos 标题：奋力：视频中的场景文字替换链接：https://arxiv.org/abs/2109.02762

作者：Vijay Kumar B G,Jeyasri Subramanian,Varnith Chordia,Eugene Bart,Shaobo Fang,Kelly Guan,Raja Bala 机构：NEC Laboratories, America, Palo Alto Research Center , Amazon, Work done at PARC, Stanford University 备注：ICCV 2021, Project Page: this https URL 摘要：我们建议使用深度样式转换和学习的光度变换替换视频中的场景文本。在静态图像文本替换的最新进展的基础上，我们提出了在保留原始视频的外观和运动特征的同时更改文本的扩展。与静态图像文本替换问题相比，我们的方法解决了视频带来的其他挑战，即改变照明、运动模糊、相机对象姿势随时间的变化以及保持时间一致性引起的效果。我们将问题分为三个步骤。首先，使用时空变换器网络将所有帧中的文本归一化为正面姿势。其次，使用最先进的静止图像文本替换方法在单个参考帧中替换文本。最后，使用一种新颖的学习图像变换网络将新文本从参考帧转移到剩余帧，该网络以时间一致的方式捕获照明和模糊效果。合成视频和具有挑战性的真实视频的结果显示了真实的文本传输、有竞争力的定量和定性性能，以及相对于备选方案的优越推理速度。我们介绍了新的合成和现实世界的数据集与配对的文本对象。据我们所知，这是第一次尝试深度视频文本替换。摘要：We propose replacing scene text in videos using deep style transfer and learned photometric transformations.Building on recent progress on still image text replacement,we present extensions that alter text while preserving the appearance and motion characteristics of the original video.Compared to the problem of still image text replacement,our method addresses additional challenges introduced by video, namely effects induced by changing lighting, motion blur, diverse variations in camera-object pose over time,and preservation of temporal consistency. We parse the problem into three steps. First, the text in all frames is normalized to a frontal pose using a spatio-temporal trans-former network. Second, the text is replaced in a single reference frame using a state-of-art still-image text replacement method. Finally, the new text is transferred from the reference to remaining frames using a novel learned image transformation network that captures lighting and blur effects in a temporally consistent manner. Results on synthetic and challenging real videos show realistic text trans-fer, competitive quantitative and qualitative performance,and superior inference speed relative to alternatives. We introduce new synthetic and real-world datasets with paired text objects. To the best of our knowledge this is the first attempt at deep video text replacement.

【5】 WhyAct: Identifying Action Reasons in Lifestyle Vlogs 标题：WhyAct：在生活方式Vlog中识别行动原因链接：https://arxiv.org/abs/2109.02747

作者：Oana Ignat,Santiago Castro,Hanwen Miao,Weiji Li,Rada Mihalcea 机构：University of Michigan 备注：Accepted at EMNLP 2021 摘要：我们的目标是在在线视频中自动识别人类行为的原因。我们关注广泛存在的生活方式虚拟博客类型，在这种类型中，人们在口头描述自己的行为的同时，也在表演动作。我们引入并公开了{sc WhyAct}数据集，该数据集由1077个手动注释其原因的视觉动作组成。我们描述了一个多模态模型，该模型利用视觉和文本信息自动推断与视频中呈现的动作对应的原因。摘要：We aim to automatically identify human action reasons in online videos. We focus on the widespread genre of lifestyle vlogs, in which people perform actions while verbally describing them. We introduce and make publicly available the {sc WhyAct} dataset, consisting of 1,077 visual actions manually annotated with their reasons. We describe a multimodal model that leverages visual and textual information to automatically infer the reasons corresponding to an action presented in the video.

【6】 Improving Phenotype Prediction using Long-Range Spatio-Temporal Dynamics of Functional Connectivity 标题：利用功能连接性的长程时空动力学改进表型预测链接：https://arxiv.org/abs/2109.03115

作者：Simon Dahan,Logan Z. J. Williams,Daniel Rueckert,Emma C. Robinson 机构：Robinson, Department of Biomedical Engineering, School of Biomedical Engineering and, Imaging Sciences, King’s College London, London, SE,EH, UK, Centre for the Developing Brain, Department of Perinatal Imaging and Health 备注：MLCN 2021 摘要：脑功能连通性（FC）的研究对于理解许多精神疾病的潜在机制非常重要。最近的许多分析采用图卷积网络来研究功能相关态之间的非线性相互作用。然而，尽管已知大脑激活模式在空间和时间上都是分层组织的，但许多方法都未能提取出强大的时空特征。为了克服这些挑战，并提高对长期功能动力学的理解，我们从基于骨架的动作识别领域提出了一种方法，旨在对跨空间和时间的交互进行建模。我们使用人类连接组项目（HCP）数据集对该方法进行性别分类和流体智能预测。为了解释功能组织的受试者地形变异性，我们使用多分辨率双回归（受试者特定）ICA节点对功能连接体进行建模。结果显示，性别分类的预测准确率为94.4%（与其他方法相比增加了6.2%），与流体智能的相关性相对于分别编码空间和时间的基线模型提高了0.325 vs 0.144。结果表明，大脑功能活动时空动态的显式编码可以提高未来预测行为和认知表型的准确性。摘要：The study of functional brain connectivity (FC) is important for understanding the underlying mechanisms of many psychiatric disorders. Many recent analyses adopt graph convolutional networks, to study non-linear interactions between functionally-correlated states. However, although patterns of brain activation are known to be hierarchically organised in both space and time, many methods have failed to extract powerful spatio-temporal features. To overcome those challenges, and improve understanding of long-range functional dynamics, we translate an approach, from the domain of skeleton-based action recognition, designed to model interactions across space and time. We evaluate this approach using the Human Connectome Project (HCP) dataset on sex classification and fluid intelligence prediction. To account for subject topographic variability of functional organisation, we modelled functional connectomes using multi-resolution dual-regressed (subject-specific) ICA nodes. Results show a prediction accuracy of 94.4% for sex classification (an increase of 6.2% compared to other methods), and an improvement of correlation with fluid intelligence of 0.325 vs 0.144, relative to a baseline model that encodes space and time separately. Results suggest that explicit encoding of spatio-temporal dynamics of brain functional activity may improve the precision with which behavioural and cognitive phenotypes may be predicted in the future.

【7】 Perceptual Video Compression with Recurrent Conditional GAN 标题：基于递归条件GAN的感知视频压缩链接：https://arxiv.org/abs/2109.03082

作者：Ren Yang,Luc Van Gool,Radu Timofte 机构：Computer Vision Laboratory, D-ITET, ETH Zurich, Switzerland 摘要：提出了一种基于循环条件生成对抗网络的感知学习视频压缩（PLVC）方法。在我们的方法中，基于循环自动编码器的生成器学习充分探索视频压缩的时间相关性。更重要的是，我们提出了一种循环条件鉴别器，该鉴别器根据空间和时间信息判断原始和压缩视频，包括循环单元中的潜在表示、时间运动和隐藏状态。这样，在对抗性训练中，它使生成的视频不仅在空间上具有照片真实感，而且在时间上与背景真实一致，并且在视频帧之间具有连贯性。因此，所提出的PLVC模型学习在低比特率下向良好的感知质量压缩视频。实验结果表明，我们的PLVC方法在若干感知质量指标上优于以往的传统方法和学习方法。与最新学习的视频压缩方法和官方HEVC测试模型（HM16.20）相比，用户研究进一步验证了PLVC卓越的感知性能。这些代码将在https://github.com/RenYang-home/PLVC. 摘要：This paper proposes a Perceptual Learned Video Compression (PLVC) approach with recurrent conditional generative adversarial network. In our approach, the recurrent auto-encoder-based generator learns to fully explore the temporal correlation for compressing video. More importantly, we propose a recurrent conditional discriminator, which judges raw and compressed video conditioned on both spatial and temporal information, including the latent representation, temporal motion and hidden states in recurrent cells. This way, in the adversarial training, it pushes the generated video to be not only spatially photo-realistic but also temporally consistent with groundtruth and coherent among video frames. Therefore, the proposed PLVC model learns to compress video towards good perceptual quality at low bit-rate. The experimental results show that our PLVC approach outperforms the previous traditional and learned approaches on several perceptual quality metrics. The user study further validates the outstanding perceptual performance of PLVC in comparison with the latest learned video compression approaches and the official HEVC test model (HM 16.20). The codes will be released at https://github.com/RenYang-home/PLVC.

【8】 Crash Report Data Analysis for Creating Scenario-Wise, Spatio-Temporal Attention Guidance to Support Computer Vision-based Perception of Fatal Crash Risks 标题：事故报告数据分析，用于创建情景、时空注意指南，以支持基于计算机视觉的致命事故风险感知链接：https://arxiv.org/abs/2109.02710

作者：Yu Li,Muhammad Monjurul Karim,Ruwen Qin 机构： Zeyi Sun 1AbstractReducing traffic fatalities and serious injuries is a top priority of the US Department of Transportation 备注：None 摘要：减少交通事故和重伤是美国交通部的首要任务。在接近碰撞阶段，基于计算机视觉（CV）的碰撞预测正受到越来越多的关注。提前感知致命碰撞风险的能力也至关重要，因为这将提高碰撞预测的可靠性。然而，用于训练可靠的人工智能模型以进行碰撞风险早期视觉感知的注释图像数据并不丰富。致命事故分析报告系统包含致命事故的大数据。它是学习驾驶场景特征和致命碰撞之间关系的可靠数据源，以弥补CV的限制。因此，本文从致命事故报告数据中开发了一个数据分析模型，名为场景感知时空注意引导，该模型可以根据检测到的对象的环境和上下文信息估计检测到的对象与致命事故的相关性。首先，本文确定了五个稀疏变量，这些变量允许分解5年致命车祸数据集，以制定情景注意指导。然后，对碰撞报告数据的位置和时间相关变量进行探索性分析，建议将致命碰撞减少到空间定义的组。该组的时间模式是该组致命交通事故相似性的指标。层次聚类和K-means聚类根据时间模式的相似性将空间定义的组合并为六个聚类。然后，关联规则挖掘发现每个聚类中驾驶场景的时间信息与碰撞特征之间的统计关系。本文展示了开发的注意指南如何支持初步CV模型的设计和实施，该模型可以从环境和上下文信息中识别可能涉及致命碰撞的对象。摘要：Reducing traffic fatalities and serious injuries is a top priority of the US Department of Transportation. The computer vision (CV)-based crash anticipation in the near-crash phase is receiving growing attention. The ability to perceive fatal crash risks earlier is also critical because it will improve the reliability of crash anticipation. Yet, annotated image data for training a reliable AI model for the early visual perception of crash risks are not abundant. The Fatality Analysis Reporting System contains big data of fatal crashes. It is a reliable data source for learning the relationship between driving scene characteristics and fatal crashes to compensate for the limitation of CV. Therefore, this paper develops a data analytics model, named scenario-wise, Spatio-temporal attention guidance, from fatal crash report data, which can estimate the relevance of detected objects to fatal crashes from their environment and context information. First, the paper identifies five sparse variables that allow for decomposing the 5-year fatal crash dataset to develop scenario-wise attention guidance. Then, exploratory analysis of location- and time-related variables of the crash report data suggests reducing fatal crashes to spatially defined groups. The group's temporal pattern is an indicator of the similarity of fatal crashes in the group. Hierarchical clustering and K-means clustering merge the spatially defined groups into six clusters according to the similarity of their temporal patterns. After that, association rule mining discovers the statistical relationship between the temporal information of driving scenes with crash features, for each cluster. The paper shows how the developed attention guidance supports the design and implementation of a preliminary CV model that can identify objects of a possibility to involve in fatal crashes from their environment and context information.

医学相关(2篇)

【1】 Single-Camera 3D Head Fitting for Mixed Reality Clinical Applications 标题：适用于混合现实临床应用的单摄像机3D头部拟合链接：https://arxiv.org/abs/2109.02740

作者：Tejas Mane,Aylar Bayramova,Kostas Daniilidis,Philippos Mordohai,Elena Bernardis 机构：Dept. Dermatology, University of Pennsylvania, Philadelphia, PA , USA, Dept. Computer and Information Science, University of Pennsylvania, Philadelphia, PA , USA, Dept. Computer Science, Stevens Institute of Technology, Hoboken, NJ, USA 摘要：我们要解决的问题是，从单个移动摄像机拍摄的视频中估计人的头部形状（定义为完整头部表面的几何图形），并确定所有视频帧的3D头部的对齐，而不管人的姿势如何。三维头部重建通常侧重于完善面部重建，使头皮处于统计近似状态。我们的目标是重建每个人的头部模型，以实现未来的混合现实应用。为此，我们通过结构从运动和多视图立体中恢复密集的三维重建和相机信息。然后，在一个新的两阶段拟合过程中，通过迭代拟合头部的三维变形模型，在规范空间中进行密集重建，并使用从头部分割遮罩中提取的传统面部地标和头皮特征，将其拟合到每个人的头部，从而恢复三维头部形状。我们的方法从不同的人、不同的智能手机以及从客厅到室外的各种环境中拍摄的视频中恢复不同头部形状的一致几何图形。摘要：We address the problem of estimating the shape of a person's head, defined as the geometry of the complete head surface, from a video taken with a single moving camera, and determining the alignment of the fitted 3D head for all video frames, irrespective of the person's pose. 3D head reconstructions commonly tend to focus on perfecting the face reconstruction, leaving the scalp to a statistical approximation. Our goal is to reconstruct the head model of each person to enable future mixed reality applications. To do this, we recover a dense 3D reconstruction and camera information via structure-from-motion and multi-view stereo. These are then used in a new two-stage fitting process to recover the 3D head shape by iteratively fitting a 3D morphable model of the head with the dense reconstruction in canonical space and fitting it to each person's head, using both traditional facial landmarks and scalp features extracted from the head's segmentation mask. Our approach recovers consistent geometry for varying head shapes, from videos taken by different people, with different smartphones, and in a variety of environments from living rooms to outdoor spaces.

【2】 Analysis of MRI Biomarkers for Brain Cancer Survival Prediction 标题：MRI生物标志物在脑癌生存期预测中的应用分析链接：https://arxiv.org/abs/2109.02785

作者：Subhashis Banerjee,Sushmita Mitra,Lawrence O. Hall 机构：Machine Intelligence Unit, Indian Statistical Institute, Kolkata , India, Department of Computer Science and Engineering, University of South Florida, E. Fowler Ave., Tampa, FL. ,-, United States 摘要：从多模式MRI预测脑癌患者的总生存率（OS）是一个具有挑战性的研究领域。现有的大多数关于生存预测的文献是基于放射生物学特征的，它不考虑非生物因素或患者的功能神经状态。此外，选择合适的生存截止线和存在审查数据会造成进一步的问题。深度学习模型在操作系统预测中的应用也受到了限制，因为缺乏大型注释的公共可用数据集。在这个场景中，我们分析了两个新的神经影像特征家族的潜力，它们是从脑区图谱和空间栖息地中提取的，以及经典的放射和几何特征；研究其对总生存率分析的综合预测能力。提出了一种基于网格搜索的交叉验证策略，根据预测能力同时选择和评估预测能力最强的特征子集。采用Cox比例风险（CoxPH）模型进行单变量特征选择，然后通过三个多变量简约模型（即。Coxnet、随机生存森林（RSF）和生存支持向量机（SSVM）。本研究中使用的脑癌MRI数据取自癌症影像档案（TCIA）提供的两个开放获取集合TCGA-GBM和TCGA-LGG。从癌症基因组图谱（TCGA）下载每位患者的相应生存数据。使用RSF和选择的最佳$24$功能，获得了$0.82pm.10$的高交叉验证$C-index$分数。年龄被发现是最重要的生物学预测因子。分别从地块划分、栖息地、辐射和基于区域的要素组中选择了9美元、6美元、6美元和2美元的要素。摘要：Prediction of Overall Survival (OS) of brain cancer patients from multi-modal MRI is a challenging field of research. Most of the existing literature on survival prediction is based on Radiomic features, which does not consider either non-biological factors or the functional neurological status of the patient(s). Besides, the selection of an appropriate cut-off for survival and the presence of censored data create further problems. Application of deep learning models for OS prediction is also limited due to the lack of large annotated publicly available datasets. In this scenario we analyse the potential of two novel neuroimaging feature families, extracted from brain parcellation atlases and spatial habitats, along with classical radiomic and geometric features; to study their combined predictive power for analysing overall survival. A cross validation strategy with grid search is proposed to simultaneously select and evaluate the most predictive feature subset based on its predictive power. A Cox Proportional Hazard (CoxPH) model is employed for univariate feature selection, followed by the prediction of patient-specific survival functions by three multivariate parsimonious models viz. Coxnet, Random survival forests (RSF) and Survival SVM (SSVM). The brain cancer MRI data used for this research was taken from two open-access collections TCGA-GBM and TCGA-LGG available from The Cancer Imaging Archive (TCIA). Corresponding survival data for each patient was downloaded from The Cancer Genome Atlas (TCGA). A high cross validation $C-index$ score of $0.82pm.10$ was achieved using RSF with the best $24$ selected features. Age was found to be the most important biological predictor. There were $9$, $6$, $6$ and $2$ features selected from the parcellation, habitat, radiomic and region-based feature groups respectively.

GAN|对抗|攻击|生成相关(5篇)

【1】 Unpaired Adversarial Learning for Single Image Deraining with Rain-Space Contrastive Constraints 标题：雨空间对比约束下单幅图像降维的非配对对抗性学习链接：https://arxiv.org/abs/2109.02973

作者：Xiang Chen,Jinshan Pan,Kui Jiang,Yufeng Huang,Caihua Kong,Longgang Dai,Yufeng Li 机构： Shenyang Aerospace University, Nanjing University of Science and Technology, Wuhan University 备注：Submitted to AAAI 2022 摘要：基于深度学习的具有不成对信息的单幅图像去噪（SID）具有极其重要的意义，因为依赖成对的合成数据通常会限制它们在现实应用中的通用性和可扩展性。然而，我们注意到，在SID任务中直接使用未配对的对抗性学习和循环一致性约束不足以学习从雨天输入到干净输出的潜在关系，因为雨天图像和无雨天图像之间的领域知识是不对称的。为了解决这一局限性，我们开发了一种有效的非配对SID方法，该方法在GAN框架（CDR-GAN）中通过对比学习的方式探索非配对样本的相互属性。该方法主要由两个协作分支组成：双向翻译分支（BTB）和对比指导分支（CGB）。具体而言，BTB充分利用对抗一致性的循环体系结构，利用潜在的特征分布，并通过配备双向映射来引导两个域之间的传输能力。同时，CGB通过鼓励相似的特征分布更接近，而将不同的特征分布推远，从而隐式地约束雨空间中不同样本的嵌入，以便更好地帮助雨水去除和图像恢复。在训练过程中，我们探索了几个损失函数来进一步约束所提出的CDR-GAN。大量实验表明，我们的方法在合成数据集和真实数据集上都优于现有的非配对降额方法，甚至优于几种完全监督或半监督模型。摘要：Deep learning-based single image deraining (SID) with unpaired information is of immense importance, as relying on paired synthetic data often limits their generality and scalability in real-world applications. However, we noticed that direct employ of unpaired adversarial learning and cycle-consistency constraints in the SID task is insufficient to learn the underlying relationship from rainy input to clean outputs, since the domain knowledge between rainy and rain-free images is asymmetrical. To address such limitation, we develop an effective unpaired SID method which explores mutual properties of the unpaired exemplars by a contrastive learning manner in a GAN framework, named as CDR-GAN. The proposed method mainly consists of two cooperative branches: Bidirectional Translation Branch (BTB) and Contrastive Guidance Branch (CGB). Specifically, BTB takes full advantage of the circulatory architecture of adversarial consistency to exploit latent feature distributions and guide transfer ability between two domains by equipping it with bidirectional mapping. Simultaneously, CGB implicitly constrains the embeddings of different exemplars in rain space by encouraging the similar feature distributions closer while pushing the dissimilar further away, in order to better help rain removal and image restoration. During training, we explore several loss functions to further constrain the proposed CDR-GAN. Extensive experiments show that our method performs favorably against existing unpaired deraining approaches on both synthetic and real-world datasets, even outperforms several fully-supervised or semi-supervised models.

【2】 CovarianceNet: Conditional Generative Model for Correct Covariance Prediction in Human Motion Prediction 标题：协方差网：人体运动预测中正确协方差预测的条件生成模型链接：https://arxiv.org/abs/2109.02965

作者：Aleksey Postnikov,Aleksander Gamayunov,Gonzalo Ferrer 机构： 2Skolkovo Institute of Science and Technology 摘要：预测人体运动时，不确定性的正确表征与预测的准确性同等重要。我们提出了一种新的方法来正确预测与预测的未来轨迹分布相关的不确定性。我们的方法，协变量网，是基于一个具有高斯潜变量的条件生成模型来预测双变量高斯分布的参数。协方差网与运动预测模型的结合产生了一种输出单峰分布的混合方法。我们将展示一些最先进的运动预测方法在预测不确定性时是如何变得过度自信的，根据我们提出的度量，并在ETH数据集{pellegrini2009you}中验证。协方差网能够正确预测不确定性，这使得我们的方法适用于使用预测分布的应用，例如规划或决策。摘要：The correct characterization of uncertainty when predicting human motion is equally important as the accuracy of this prediction. We present a new method to correctly predict the uncertainty associated with the predicted distribution of future trajectories. Our approach, CovariaceNet, is based on a Conditional Generative Model with Gaussian latent variables in order to predict the parameters of a bi-variate Gaussian distribution. The combination of CovarianceNet with a motion prediction model results in a hybrid approach that outputs a uni-modal distribution. We will show how some state of the art methods in motion prediction become overconfident when predicting uncertainty, according to our proposed metric and validated in the ETH data-set cite{pellegrini2009you}. CovarianceNet correctly predicts uncertainty, which makes our method suitable for applications that use predicted distributions, e.g., planning or decision making.

【3】 Brand Label Albedo Extraction of eCommerce Products using Generative Adversarial Network 标题：基于产生式对抗网络的电子商务产品品牌反照率提取链接：https://arxiv.org/abs/2109.02929

作者：Suman Sapkota,Manish Juneja,Laurynas Keleras,Pranav Kotwal,Binod Bhattarai 机构：Zeg.AI Pvt. Ltd, London, UK 备注：5 pages, 5 figures 摘要：在本文中，我们提出了我们的解决方案，提取反照率的品牌标签的电子商务产品。为此，我们生成一个用于反照率提取的大规模照片真实合成数据集，然后训练生成模型，将具有不同照明条件的图像转换为反照率。我们进行了广泛的评估，以测试我们的方法在野生图像中的普遍性。从实验结果中，我们观察到，与现有方法相比，我们的解决方案在不可见的渲染图像和野生图像中都具有良好的通用性。摘要：In this paper we present our solution to extract albedo of branded labels for e-commerce products. To this end, we generate a large-scale photo-realistic synthetic data set for albedo extraction followed by training a generative model to translate images with diverse lighting conditions to albedo. We performed an extensive evaluation to test the generalisation of our method to in-the-wild images. From the experimental results, we observe that our solution generalises well compared to the existing method both in the unseen rendered images as well as in the wild image.

【4】 Kinship Verification Based on Cross-Generation Feature Interaction Learning 标题：基于跨代特征交互学习的亲属关系验证链接：https://arxiv.org/abs/2109.02809

作者：Guan-Nan Dong,Chi-Man Pun,Zheng Zhang 摘要：在许多潜在的计算机视觉应用中，人脸图像的亲缘关系验证已经被认为是一种新兴但具有挑战性的技术。在本文中，我们提出了一种新的跨代特征交互学习（CFIL）框架，用于鲁棒亲属关系验证。特别是，通过联合提取父母和子女图像对的特征，构建了一种有效的协作加权策略，以探索跨代关系的特征。具体来说，我们将父母和孩子作为一个整体来提取具有表现力的局部和非局部特征。与传统的基于距离的相似性度量不同，我们将相似性计算作为内部辅助权重插入到深度CNN架构中，以了解整体和自然特征。这些相似度权重不仅涉及对应的单点，还挖掘了交叉点之间的多个关系，利用这两种距离度量计算局部和非局部特征。重要的是，我们没有单独进行相似性计算和特征提取，而是将相似性学习和特征提取集成到一个统一的学习过程中。由局部特征和非局部特征推导出的综合表示能够综合表达图像中的信息语义，并保留图像对中丰富的相关知识。大量的实验证明了该模型的有效性和优越性。摘要：Kinship verification from facial images has been recognized as an emerging yet challenging technique in many potential computer vision applications. In this paper, we propose a novel cross-generation feature interaction learning (CFIL) framework for robust kinship verification. Particularly, an effective collaborative weighting strategy is constructed to explore the characteristics of cross-generation relations by corporately extracting features of both parents and children image pairs. Specifically, we take parents and children as a whole to extract the expressive local and non-local features. Different from the traditional works measuring similarity by distance, we interpolate the similarity calculations as the interior auxiliary weights into the deep CNN architecture to learn the whole and natural features. These similarity weights not only involve corresponding single points but also excavate the multiple relationships cross points, where local and non-local features are calculated by using these two kinds of distance measurements. Importantly, instead of separately conducting similarity computation and feature extraction, we integrate similarity learning and feature extraction into one unified learning process. The integrated representations deduced from local and non-local features can comprehensively express the informative semantics embedded in images and preserve abundant correlation knowledge from image pairs. Extensive experiments demonstrate the efficiency and superiority of the proposed model compared to some state-of-the-art kinship verification methods.

【5】 Robustness and Generalization via Generative Adversarial Training 标题：通过生成性对抗性训练实现健壮性和泛化链接：https://arxiv.org/abs/2109.02765

作者：Omid Poursaeed,Tianxing Jiang,Harry Yang,Serge Belongie,SerNam Lim 机构：Ser-Nam Lim, Cornell University, Cornell Tech, Facebook AI 备注：ICCV 2021. arXiv admin note: substantial text overlap with arXiv:1911.09058 摘要：虽然深度神经网络在各种计算机视觉任务中取得了显著的成功，但它们往往无法推广到新的领域和输入图像的细微变化。已经提出了几种防御措施来提高对这些变化的鲁棒性。然而，当前的防御系统只能抵御训练中使用的特定攻击，并且模型通常仍然容易受到其他输入变化的影响。此外，这些方法通常会降低模型在干净图像上的性能，并且不能推广到域外样本。在本文中，我们提出了生成性对抗训练，这是一种同时提高模型对测试集和域外样本的泛化能力以及对未知对抗攻击的鲁棒性的方法。我们不改变图像的低水平预定义方面，而是使用具有分离潜在空间的生成模型生成低水平、中水平和高水平变化的光谱。通过这些示例进行对抗性训练，可以通过观察训练期间的各种输入变化，使模型能够承受范围广泛的攻击。我们表明，我们的方法不仅提高了模型在干净图像和域外样本上的性能，而且使其对不可预见的攻击具有鲁棒性，并且优于以前的工作。我们通过在分类、分割和目标检测等各种任务上展示结果，验证了我们方法的有效性。摘要：While deep neural networks have achieved remarkable success in various computer vision tasks, they often fail to generalize to new domains and subtle variations of input images. Several defenses have been proposed to improve the robustness against these variations. However, current defenses can only withstand the specific attack used in training, and the models often remain vulnerable to other input variations. Moreover, these methods often degrade performance of the model on clean images and do not generalize to out-of-domain samples. In this paper we present Generative Adversarial Training, an approach to simultaneously improve the model's generalization to the test set and out-of-domain samples as well as its robustness to unseen adversarial attacks. Instead of altering a low-level pre-defined aspect of images, we generate a spectrum of low-level, mid-level and high-level changes using generative models with a disentangled latent space. Adversarial training with these examples enable the model to withstand a wide range of attacks by observing a variety of input alterations during training. We show that our approach not only improves performance of the model on clean images and out-of-domain samples but also makes it robust against unforeseen attacks and outperforms prior work. We validate effectiveness of our method by demonstrating results on various tasks such as classification, segmentation and object detection.

自动驾驶|车辆|车道检测等(1篇)

【1】 Smart Traffic Monitoring System using Computer Vision and Edge Computing 标题：基于计算机视觉和边缘计算的智能交通监控系统链接：https://arxiv.org/abs/2109.03141

作者：Guanxiong Liu,Hang Shi,Abbas Kiani,Abdallah Khreishah,Jo Young Lee,Nirwan Ansari,Chengjun Liu,Mustafa Yousef 机构： Hang Shi and ChengjunLiu are with the Department of Computer Science at New Jersey Instituteof Technology 摘要：交通管理系统捕获大量视频数据，并利用视频处理技术的进步来检测和监控交通事故。收集的数据传统上被转发到流量管理中心（TMC）进行深入分析，因此可能会加剧到TMC的网络路径。为了缓解这些瓶颈，我们建议通过为靠近摄像机的边缘节点配备计算资源（例如Cloudlet）来利用边缘计算。与TMC相比，计算资源有限的cloudlet提供有限的视频处理能力。在本文中，我们重点研究了两种常见的流量监测任务，拥塞检测和速度检测，并提出了一种基于两层边缘计算的模型，该模型考虑了Cloudlet中有限的计算能力和TMC的不稳定网络条件。我们的解决方案针对每个任务使用两种算法，一种在边缘实现，另一种在TMC实现，这两种算法的设计考虑了不同的计算资源。虽然TMC提供强大的计算能力，但它接收的视频质量取决于底层网络条件。另一方面，edge处理非常高质量的视频，但计算资源有限。我们的模型捕捉到了这种权衡。通过在不同天气和网络条件下的试验台实验，我们评估了所提出的两层模型和流量监控算法的性能，并表明我们提出的混合边缘云解决方案优于纯云和纯边缘解决方案。摘要：Traffic management systems capture tremendous video data and leverage advances in video processing to detect and monitor traffic incidents. The collected data are traditionally forwarded to the traffic management center (TMC) for in-depth analysis and may thus exacerbate the network paths to the TMC. To alleviate such bottlenecks, we propose to utilize edge computing by equipping edge nodes that are close to cameras with computing resources (e.g. cloudlets). A cloudlet, with limited computing resources as compared to TMC, provides limited video processing capabilities. In this paper, we focus on two common traffic monitoring tasks, congestion detection, and speed detection, and propose a two-tier edge computing based model that takes into account of both the limited computing capability in cloudlets and the unstable network condition to the TMC. Our solution utilizes two algorithms for each task, one implemented at the edge and the other one at the TMC, which are designed with the consideration of different computing resources. While the TMC provides strong computation power, the video quality it receives depends on the underlying network conditions. On the other hand, the edge processes very high-quality video but with limited computing resources. Our model captures this trade-off. We evaluate the performance of the proposed two-tier model as well as the traffic monitoring algorithms via test-bed experiments under different weather as well as network conditions and show that our proposed hybrid edge-cloud solution outperforms both the cloud-only and edge-only solutions.

OCR|文本相关(1篇)

【1】 PP-OCRv2: Bag of Tricks for Ultra Lightweight OCR System 标题：PP-OCRv2：适用于超轻量级OCR系统的一袋小把戏链接：https://arxiv.org/abs/2109.03144

作者：Yuning Du,Chenxia Li,Ruoyu Guo,Cheng Cui,Weiwei Liu,Jun Zhou,Bin Lu,Yehua Yang,Qiwen Liu,Xiaoguang Hu,Dianhai Yu,Yanjun Ma 机构：Baidu Inc. 备注：8 pages, 9 figures, 5 tables 摘要：光学字符识别（OCR）系统已广泛应用于各种应用场合。设计OCR系统仍然是一项具有挑战性的任务。在以前的工作中，我们提出了一种实用的超轻型OCR系统（PP-OCR），以平衡精度和效率。为了提高PP-OCR的准确率并保持高效率，本文提出了一种更加健壮的OCR系统，即PP-OCRv2。我们介绍了一系列技巧来训练更好的文本检测器和文本识别器，包括协作相互学习（CML）、复制粘贴、轻量级CPune网络（LCNet）、统一深度相互学习（U-DML）和增强CTCLoss。实际数据实验表明，在相同的推理代价下，PP-OCRv2的精度比PP-OCR高7%。它还可以与PP-OCR的服务器型号进行比较，PP-OCR使用ResNet系列作为主干。所有上述模型都是开源的，代码在Github仓库PADELLOCR中可用，PADELLOCR是由PaddlePaddle供电的。摘要：Optical Character Recognition (OCR) systems have been widely used in various of application scenarios. Designing an OCR system is still a challenging task. In previous work, we proposed a practical ultra lightweight OCR system (PP-OCR) to balance the accuracy against the efficiency. In order to improve the accuracy of PP-OCR and keep high efficiency, in this paper, we propose a more robust OCR system, i.e. PP-OCRv2. We introduce bag of tricks to train a better text detector and a better text recognizer, which include Collaborative Mutual Learning (CML), CopyPaste, Lightweight CPUNetwork (LCNet), Unified-Deep Mutual Learning (U-DML) and Enhanced CTCLoss. Experiments on real data show that the precision of PP-OCRv2 is 7% higher than PP-OCR under the same inference cost. It is also comparable to the server models of the PP-OCR which uses ResNet series as backbones. All of the above mentioned models are open-sourced and the code is available in the GitHub repository PaddleOCR which is powered by PaddlePaddle.

视觉解释|视频理解VQA|caption等(1篇)

【1】 Journalistic Guidelines Aware News Image Captioning 标题：新闻准则意识到的新闻图像字幕链接：https://arxiv.org/abs/2109.02865

作者：Xuewen Yang,Svebor Karaman,Joel Tetreault,Alex Jaimes 机构：Stony Brook University, Dataminr Inc. 备注：None 摘要：新闻文章图像字幕的任务是为新闻文章图像生成描述性和信息性的字幕。与传统的简单描述图像内容的图像标题不同，新闻图像标题遵循新闻准则，严重依赖命名实体来描述图像内容，通常从与之相关的整篇文章中提取上下文。在这项工作中，我们提出了一种新的方法来完成这项任务，其动机是由记者遵循的字幕指导原则决定的。我们的方法，即新闻指南感知新闻图像字幕（JoGANIC），利用字幕的结构来提高生成质量并指导我们的表现设计。在两个大规模公开数据集上的实验结果（包括详细的消融研究）表明，JoGANIC在字幕生成和命名实体相关度量方面都明显优于最先进的方法。摘要：The task of news article image captioning aims to generate descriptive and informative captions for news article images. Unlike conventional image captions that simply describe the content of the image in general terms, news image captions follow journalistic guidelines and rely heavily on named entities to describe the image content, often drawing context from the whole article they are associated with. In this work, we propose a new approach to this task, motivated by caption guidelines that journalists follow. Our approach, Journalistic Guidelines Aware News Image Captioning (JoGANIC), leverages the structure of captions to improve the generation quality and guide our representation design. Experimental results, including detailed ablation studies, on two large-scale publicly available datasets show that JoGANIC substantially outperforms state-of-the-art methods both on caption generation and named entity related metrics.

点云|SLAM|雷达|激光|深度RGBD相关(2篇)

【1】 Binaural SoundNet: Predicting Semantics, Depth and Motion with Binaural Sounds 标题：双耳声网：利用双耳声音预测语义、深度和运动链接：https://arxiv.org/abs/2109.02763

作者：Dengxin Dai,Arun Balajee Vasudevan,Jiri Matas,Luc Van Gool 机构：! 备注：Journal extension of our ECCV'20 Paper -- 15 pages. arXiv admin note: substantial text overlap with arXiv:2003.04210 摘要：人类可以通过视觉和/或听觉线索来稳健地识别和定位物体。虽然机器已经能够对视觉数据进行同样的处理，但对声音的处理却很少。这项工作开发了一种完全基于双耳声音的场景理解方法。所考虑的任务包括预测发声对象的语义掩码、发声对象的运动以及场景的深度图。为此，我们提出了一种新的传感器设置，并使用八个专业的双耳麦克风和一个360度摄像机记录了一个新的街景视听数据集。视觉和听觉线索的共存被用于监督转移。特别是，我们采用了一个跨模式蒸馏框架，该框架由多个视觉教师方法和一个良好的学生方法组成——学生方法经过训练，可以产生与教师方法相同的结果。这样，听觉系统就可以在不使用人类注释的情况下进行训练。为了进一步提高性能，我们提出了另一个新的辅助任务，即创造空间声音超分辨率，以提高声音的方向分辨率。然后，我们将这四个任务组成一个端到端可训练的多任务网络，以提高整体性能。实验结果表明：1）我们的方法在所有四项任务中都取得了良好的效果；2）这四项任务是互利的——共同训练它们可以获得最佳性能；3）麦克风的数量和方向都很重要，4）从标准频谱图学习的特征和通过经典信号处理管道获得的特征对于听觉感知任务是互补的。数据和代码已发布。摘要：Humans can robustly recognize and localize objects by using visual and/or auditory cues. While machines are able to do the same with visual data already, less work has been done with sounds. This work develops an approach for scene understanding purely based on binaural sounds. The considered tasks include predicting the semantic masks of sound-making objects, the motion of sound-making objects, and the depth map of the scene. To this aim, we propose a novel sensor setup and record a new audio-visual dataset of street scenes with eight professional binaural microphones and a 360-degree camera. The co-existence of visual and audio cues is leveraged for supervision transfer. In particular, we employ a cross-modal distillation framework that consists of multiple vision teacher methods and a sound student method -- the student method is trained to generate the same results as the teacher methods do. This way, the auditory system can be trained without using human annotations. To further boost the performance, we propose another novel auxiliary task, coined Spatial Sound Super-Resolution, to increase the directional resolution of sounds. We then formulate the four tasks into one end-to-end trainable multi-tasking network aiming to boost the overall performance. Experimental results show that 1) our method achieves good results for all four tasks, 2) the four tasks are mutually beneficial -- training them together achieves the best performance, 3) the number and orientation of microphones are both important, and 4) features learned from the standard spectrogram and features obtained by the classic signal processing pipeline are complementary for auditory perception tasks. The data and code are released.

【2】 Pano3D: A Holistic Benchmark and a Solid Baseline for 360^o Depth Estimation标题：Pano3D：360^o深度估计的整体基准和可靠基线链接：https://arxiv.org/abs/2109.02749

作者：Georgios Albanis,Nikolaos Zioulis,Petros Drakoulis,Vasileios Gkitsas,Vladimiros Sterzentsenko,Federico Alvarez,Dimitrios Zarpalas,Petros Daras 机构： Centre for Research and Technology Hellas, Thessaloniki, Greece, Universidad Polit´ecnica de Madrid, Madrid, Spain, vcl,d.github.ioPano,D 备注：Presented at the OmniCV CVPR 2021 workshop. Code, models, data and demo at this https URL 摘要：Pano3D是一种新的基于球形全景图的深度估计基准。它旨在评估所有深度估计特征的性能，主要的直接深度估计性能目标精度和准确性，以及次要特征、边界保持和平滑度。此外，Pano3D从典型的数据集内评估转移到数据集间性能评估。通过将不可见数据归纳为不同的测试分割，Pano3D代表了一个360美元深度估计的整体基准。我们使用它作为扩展分析的基础，试图提供深度估计经典选择的见解。这为全景深度提供了坚实的基线，后续工作可以以此为基础指导未来的进展。摘要：Pano3D is a new benchmark for depth estimation from spherical panoramas. It aims to assess performance across all depth estimation traits, the primary direct depth estimation performance targeting precision and accuracy, and also the secondary traits, boundary preservation, and smoothness. Moreover, Pano3D moves beyond typical intra-dataset evaluation to inter-dataset performance assessment. By disentangling the capacity to generalize to unseen data into different test splits, Pano3D represents a holistic benchmark for $360^o$ depth estimation. We use it as a basis for an extended analysis seeking to offer insights into classical choices for depth estimation. This results in a solid baseline for panoramic depth that follow-up works can build upon to steer future progress.

其他神经网络|深度学习|模型|建模(2篇)

【1】 Learning Fast Sample Re-weighting Without Reward Data 标题：学习无奖励数据的快速样本加权链接：https://arxiv.org/abs/2109.03216

作者：Zizhao Zhang,Tomas Pfister 机构：Google Cloud AI 备注：ICCV2021 摘要：训练样本重新加权是解决数据偏差（如标签不平衡和损坏）的有效方法。最近的方法基于强化学习和元学习的框架，开发了基于学习的算法，结合模型训练来学习样本重新加权策略。然而，依赖于额外的无偏奖励数据限制了它们的普遍适用性。此外，现有的基于学习的样本重加权方法需要对模型和加权参数进行嵌套优化，这需要昂贵的二阶计算。本文针对这两个问题，提出了一种新的基于学习的快速样本重加权（FSR）方法，该方法不需要额外的奖励数据。该方法基于两个关键思想：从历史中学习建立代理奖励数据和特征共享以降低优化成本。我们的实验表明，与现有的标签噪声鲁棒性和长尾识别技术相比，该方法取得了具有竞争力的结果，同时显著提高了训练效率。源代码可在https://github.com/google-research/google-research/tree/master/ieg. 摘要：Training sample re-weighting is an effective approach for tackling data biases such as imbalanced and corrupted labels. Recent methods develop learning-based algorithms to learn sample re-weighting strategies jointly with model training based on the frameworks of reinforcement learning and meta learning. However, depending on additional unbiased reward data is limiting their general applicability. Furthermore, existing learning-based sample re-weighting methods require nested optimizations of models and weighting parameters, which requires expensive second-order computation. This paper addresses these two problems and presents a novel learning-based fast sample re-weighting (FSR) method that does not require additional reward data. The method is based on two key ideas: learning from history to build proxy reward data and feature sharing to reduce the optimization cost. Our experiments show the proposed method achieves competitive results compared to state of the arts on label noise robustness and long-tailed recognition, and does so while achieving significantly improved training efficiency. The source code is publicly available at https://github.com/google-research/google-research/tree/master/ieg.

【2】 Training Deep Networks from Zero to Hero: avoiding pitfalls and going beyond 标题：从零到英雄训练深度网络：躲避陷阱超越链接：https://arxiv.org/abs/2109.02752

作者：Moacir Antonelli Ponti,Fernando Pereira dos Santos,Leo Sampaio Ferraz Ribeiro,Gabriel Biscaro Cavallari 机构： CavallariICMC – Universidade de S˜ao Paulo (USP) 备注：9 pgs 摘要：在实际数据中，训练深层神经网络可能是一项挑战。当涉及到小数据集或特定应用时，将模型作为黑盒使用，即使使用转移学习，也可能导致泛化性差或结果不确定。本教程介绍了改进模型的基本步骤和最新选项，特别是但不限于监督学习。它在准备得不如挑战中的数据集好，并且在注释和/或小数据稀少的情况下特别有用。我们描述了基本程序：数据准备、优化和转移学习，以及最新的架构选择，如Transformer模块、替代卷积层、激活函数、广域和深层网络的使用，以及训练程序，包括课程、对比和自我监督学习。摘要：Training deep neural networks may be challenging in real world data. Using models as black-boxes, even with transfer learning, can result in poor generalization or inconclusive results when it comes to small datasets or specific applications. This tutorial covers the basic steps as well as more recent options to improve models, in particular, but not restricted to, supervised learning. It can be particularly useful in datasets that are not as well-prepared as those in challenges, and also under scarce annotation and/or small data. We describe basic procedures: as data preparation, optimization and transfer learning, but also recent architectural choices such as use of transformer modules, alternative convolutional layers, activation functions, wide and deep networks, as well as training procedures including as curriculum, contrastive and self-supervised learning.

其他(8篇)

【1】 Fair Comparison: Quantifying Variance in Resultsfor Fine-grained Visual Categorization 标题：公平比较：细粒度视觉分类结果方差的量化链接：https://arxiv.org/abs/2109.03156

作者：Matthew Gwilliam,Adam Teuscher,Connor Anderson,Ryan Farrell 机构：Brigham Young University,University of Maryland 备注：None 摘要：为了完成图像分类的任务，研究人员努力开发下一代最先进的（SOTA）模型，每个模型都将自己的性能与前人和同行的性能进行对比。不幸的是，最常用于描述模型性能的度量标准，即平均分类准确率，通常是单独使用的。随着类数量的增加，例如在细粒度视觉分类（FGVC）中，仅通过平均准确度传递的信息量就会减少。虽然其最明显的缺点是未能按类描述模型的性能，但平均准确度也未能描述从同一体系结构、同一数据集上的一个经过训练的模型到另一个模型（所有类别和每类级别的平均）的性能变化。我们首先根据数据的属性证明了这些变化在不同模型和不同类别分布之间的幅度，比较了不同视觉域和不同类别图像分布（包括长尾分布和少数镜头子集）上的结果。然后，我们分析了各种FGVC方法对总体方差和每类方差的影响。从这一分析中，我们都强调了基于超出总体准确性的信息报告和比较方法的重要性，并指出了减少FGVC结果差异的技术。摘要：For the task of image classification, researchers work arduously to develop the next state-of-the-art (SOTA) model, each bench-marking their own performance against that of their predecessors and of their peers. Unfortunately, the metric used most frequently to describe a model's performance, average categorization accuracy, is often used in isolation. As the number of classes increases, such as in fine-grained visual categorization (FGVC), the amount of information conveyed by average accuracy alone dwindles. While its most glaring weakness is its failure to describe the model's performance on a class-by-class basis, average accuracy also fails to describe how performance may vary from one trained model of the same architecture, on the same dataset, to another (both averaged across all categories and at the per-class level). We first demonstrate the magnitude of these variations across models and across class distributions based on attributes of the data, comparing results on different visual domains and different per-class image distributions, including long-tailed distributions and few-shot subsets. We then analyze the impact various FGVC methods have on overall and per-class variance. From this analysis, we both highlight the importance of reporting and comparing methods based on information beyond overall accuracy, as well as point out techniques that mitigate variance in FGVC results.

【2】 Efficient ADMM-based Algorithms for Convolutional Sparse Coding 标题：基于ADMM的高效卷积稀疏编码算法链接：https://arxiv.org/abs/2109.02969

作者：Farshad G. Veshki,Sergiy A. Vorobyov 机构： ϵ is the upper bound on the energy ofThe authors are with the Department of Signal Processing and Acoustics, Aalto University 摘要：卷积稀疏编码通过引入全局平移不变模型，改进了标准稀疏近似。最有效的卷积稀疏编码方法是基于乘法器交替方向法和卷积定理的。这些方法之间唯一的主要区别是如何处理卷积最小二乘拟合子问题。这封信提出了这个子问题的解决方案，它提高了最先进算法的效率。我们也使用同样的方法来开发一种有效的卷积字典学习方法。此外，我们还提出了一种新的卷积稀疏编码算法，该算法具有近似误差约束。摘要：Convolutional sparse coding improves on the standard sparse approximation by incorporating a global shift-invariant model. The most efficient convolutional sparse coding methods are based on the alternating direction method of multipliers and the convolution theorem. The only major difference between these methods is how they approach a convolutional least-squares fitting subproblem. This letter presents a solution to this subproblem, which improves the efficiency of the state-of-the-art algorithms. We also use the same approach for developing an efficient convolutional dictionary learning method. Furthermore, we propose a novel algorithm for convolutional sparse coding with a constraint on the approximation error.

【3】 Fishr: Invariant Gradient Variances for Out-of-distribution Generalization 标题：FISIR：离散型泛化的不变梯度方差链接：https://arxiv.org/abs/2109.02934

作者：Alexandre Rame,Corentin Dancette,Matthieu Cord 机构：Sorbonne Universit´e, CNRS, LIP, Paris, France, Valeo.ai 备注：31 pages, 12 tables, 6 figures 摘要：学习在数据分布发生变化时能够很好地概括的健壮模型对于实际应用至关重要。为此，人们对同时从多个训练领域学习的兴趣日益高涨，同时在这些领域中强制执行不同类型的不变性。然而，所有现有的方法都未能在公平评估协议下显示出系统性的好处。在本文中，我们提出了一种新的学习方案来在损失函数的梯度空间中实现域不变性：具体地说，我们引入了一个正则化项来匹配跨训练域的梯度的域级方差。关键的是，我们的策略名为Fishr，它与Fisher信息和损失的Hessian密切相关。我们表明，在学习过程中，强制域级梯度协方差相似最终会使域级损失景观围绕最终权重局部对齐。大量的实验证明了Fishr在非分布泛化中的有效性。特别是，Fishr改进了领域基准的最新技术，其性能明显优于经验风险最小化。该代码发布于https://github.com/alexrame/fishr. 摘要：Learning robust models that generalize well under changes in the data distribution is critical for real-world applications. To this end, there has been a growing surge of interest to learn simultaneously from multiple training domains - while enforcing different types of invariance across those domains. Yet, all existing approaches fail to show systematic benefits under fair evaluation protocols. In this paper, we propose a new learning scheme to enforce domain invariance in the space of the gradients of the loss function: specifically, we introduce a regularization term that matches the domain-level variances of gradients across training domains. Critically, our strategy, named Fishr, exhibits close relations with the Fisher Information and the Hessian of the loss. We show that forcing domain-level gradient covariances to be similar during the learning procedure eventually aligns the domain-level loss landscapes locally around the final weights. Extensive experiments demonstrate the effectiveness of Fishr for out-of-distribution generalization. In particular, Fishr improves the state of the art on the DomainBed benchmark and performs significantly better than Empirical Risk Minimization. The code is released at https://github.com/alexrame/fishr.

【4】 Deep SIMBAD: Active Landmark-based Self-localization Using Ranking -based Scene Descriptor 标题：Deep SIMBAD：基于排名的场景描述符的主动地标自定位链接：https://arxiv.org/abs/2109.02786

作者：Tanaka Kanji 机构：The authors are with Graduate School of Engineering, University ofFukui 备注：6 pages, 7 figures, a preprint 摘要：基于Landmark的机器人自定位作为一种高度压缩的域不变方法，最近引起了人们的兴趣，用于跨域（例如，时间、天气和季节）执行视觉位置识别（VPR）。然而，对于被动观察者（例如，手动机器人控制），基于地标的自定位可能是一个不适定问题，因为许多视点可能无法提供有效的地标视图。在这项研究中，我们认为主动的自我定位任务的积极观察员，并提出了一种新的强化学习（RL）为基础的下一个最佳视图（NBV）规划师。我们的贡献如下(1）基于SIMBAD的VPR：我们将基于landmark的紧凑场景描述问题表述为SIMBAD（基于相似性的模式识别），并进一步给出其深度学习扩展(2） VPR到NBV知识转移：我们通过将VPR的状态识别能力转移到NBV，解决了不确定性（即主动自我定位）下RL的挑战(3）基于NNQL的NBV：我们通过采用Q-学习的最近邻近似（NNQL）将可用的VPR视为经验数据库。结果显示了一个非常紧凑的数据结构，它将VPR和NBV压缩为一个增量反向索引。使用公共NCLT数据集的实验验证了该方法的有效性。摘要：Landmark-based robot self-localization has recently garnered interest as a highly-compressive domain-invariant approach for performing visual place recognition (VPR) across domains (e.g., time of day, weather, and season). However, landmark-based self-localization can be an ill-posed problem for a passive observer (e.g., manual robot control), as many viewpoints may not provide an effective landmark view. In this study, we consider an active self-localization task by an active observer and present a novel reinforcement learning (RL)-based next-best-view (NBV) planner. Our contributions are as follows. (1) SIMBAD-based VPR: We formulate the problem of landmark-based compact scene description as SIMBAD (similarity-based pattern recognition) and further present its deep learning extension. (2) VPR-to-NBV knowledge transfer: We address the challenge of RL under uncertainty (i.e., active self-localization) by transferring the state recognition ability of VPR to the NBV. (3) NNQL-based NBV: We regard the available VPR as the experience database by adapting nearest-neighbor approximation of Q-learning (NNQL). The result shows an extremely compact data structure that compresses both the VPR and NBV into a single incremental inverted index. Experiments using the public NCLT dataset validated the effectiveness of the proposed approach.

【5】 Intelligent Motion Planning for a Cost-effective Object Follower Mobile Robotic System with Obstacle Avoidance 标题：高性价比目标跟随式避障移动机器人系统的智能运动规划链接：https://arxiv.org/abs/2109.02700

作者：Sai Nikhil Gona,Prithvi Raj Bandhakavi 机构：Electrical and Electronics Engineering, Mechanical Engineering, Chaitanya Bharathi Institute of Technology, Telangana, India 备注：24 pages, 27 figures 摘要：很少有行业使用手动控制的机器人来搬运材料，而且这种机器人不能在所有地方一直使用。因此，拥有能够跟随特定人类的机器人是非常安静的，它可以跟随该人所持有的独特颜色的物体。因此，我们提出了一个机器人系统，它使用机器人视觉和深度学习来获得所需的线速度和角速度，分别为{nu}和{omega}。这反过来使机器人在跟随人类持有的独特颜色物体时能够避开障碍物。我们提出的新方法能够准确地检测任何类型照明中独特颜色物体的位置，并告诉我们机器人所在位置的水平像素值，还可以告诉我们物体是否靠近或远离机器人。此外，我们在这个问题中使用的人工神经网络在线性和角速度预测和用于控制线性和角速度的PI控制器方面给了我们很小的误差，这反过来又控制了机器人的位置，给我们带来了令人印象深刻的结果，这种方法优于所有其他方法。摘要：There are few industries which use manually controlled robots for carrying material and this cannot be used all the time in all the places. So, it is very tranquil to have robots which can follow a specific human by following the unique coloured object held by that person. So, we propose a robotic system which uses robot vision and deep learning to get the required linear and angular velocities which are {nu} and {omega}, respectively. Which in turn makes the robot to avoid obstacles when following the unique coloured object held by the human. The novel methodology that we are proposing is accurate in detecting the position of the unique coloured object in any kind of lighting and tells us the horizontal pixel value where the robot is present and also tells if the object is close to or far from the robot. Moreover, the artificial neural networks that we have used in this problem gave us a meagre error in linear and angular velocity prediction and the PI controller which was used to control the linear and angular velocities, which in turn controls the position of the robot gave us impressive results and this methodology outperforms all other methodologies.

【6】 On the Out-of-distribution Generalization of Probabilistic Image Modelling 标题：关于概率图像建模的非分布泛化链接：https://arxiv.org/abs/2109.02639

作者：Mingtian Zhang,Andi Zhang,Steven McDonagh 机构：AI Center, University College London, Computer Science, University of Cambridge, Huawei Noah’s Ark Lab 摘要：分布外（OOD）检测和无损压缩构成了两个问题，可以通过在第一个数据集上训练概率模型，然后在数据分布不同的第二个数据集上进行似然评估来解决。通过定义概率模型的似然泛化，我们表明，在图像模型的情况下，OOD泛化能力由局部特征决定。这促使我们提出局部自回归模型，专门对局部图像特征建模，以提高OOD性能。我们将所提出的模型应用于OOD检测任务，并在不引入额外数据的情况下实现最先进的无监督OOD检测性能。此外，我们使用我们的模型构建了一个新的无损图像压缩程序：NeLLoC（神经局部无损压缩程序），并报告了最先进的压缩率和模型大小。摘要：Out-of-distribution (OOD) detection and lossless compression constitute two problems that can be solved by the training of probabilistic models on a first dataset with subsequent likelihood evaluation on a second dataset, where data distributions differ. By defining the generalization of probabilistic models in terms of likelihood we show that, in the case of image models, the OOD generalization ability is dominated by local features. This motivates our proposal of a Local Autoregressive model that exclusively models local image features towards improving OOD performance. We apply the proposed model to OOD detection tasks and achieve state-of-the-art unsupervised OOD detection performance without the introduction of additional data. Additionally, we employ our model to build a new lossless image compressor: NeLLoC (Neural Local Lossless Compressor) and report state-of-the-art compression rates and model size.

【7】 Statistical analysis of locally parameterized shapes 标题：局部参数化形状的统计分析链接：https://arxiv.org/abs/2109.03027

作者：Mohsen Taheri,Jörn Schulz 机构：and, J¨orn Schulz†, Department of Mathematics & Physics, University of Stavanger 备注：25 pages, 20 figures 摘要：形状对齐是统计形状分析中的一个关键步骤，例如，在计算平均形状、检测两个形状群体之间的位置差异以及分类时。Procrustes对齐是最常用的方法和最先进的技术。在这项工作中，我们发现对齐可能会严重影响统计分析。例如，对齐可能导致错误的形状差异，并导致误导性结果和解释。提出了一种基于局部坐标系的层次形状参数化方法。局部参数化形状具有平移和旋转不变性。因此，使用此参数化可以避免形状表示常用全局坐标系的固有对齐问题。新参数化在形状变形和模拟方面也具有优越性。通过对模拟数据以及帕金森病患者和对照组的左海马的假设检验，证明了该方法的有效性。摘要：The alignment of shapes has been a crucial step in statistical shape analysis, for example, in calculating mean shape, detecting locational differences between two shape populations, and classification. Procrustes alignment is the most commonly used method and state of the art. In this work, we uncover that alignment might seriously affect the statistical analysis. For example, alignment can induce false shape differences and lead to misleading results and interpretations. We propose a novel hierarchical shape parameterization based on local coordinate systems. The local parameterized shapes are translation and rotation invariant. Thus, the inherent alignment problems from the commonly used global coordinate system for shape representation can be avoided using this parameterization. The new parameterization is also superior for shape deformation and simulation. The method's power is demonstrated on the hypothesis testing of simulated data as well as the left hippocampi of patients with Parkinson's disease and controls.

【8】 End to end hyperspectral imaging system with coded compression imaging process 标题：具有编码压缩成像处理的端到端高光谱成像系统链接：https://arxiv.org/abs/2109.02643

作者：Hui Xie,Zhuang Zhao,Jing Han,Yi Zhang,Lianfa Bai,Jun Lu 机构：School of Electronic Engineering and Optoelectronic Technology, Nanjing University of Science and, Technology, Nanjing , China 摘要：高光谱图像能够提供丰富的空间和光谱信息，具有广泛的应用前景。近年来，人们发展了几种利用卷积神经网络（CNN）重建HSI的方法。然而，大多数深度学习方法适合压缩HSI和标准HSI之间的蛮力映射关系。因此，当观测数据偏离训练数据时，学习的映射将无效。为了从二维压缩图像中恢复三维HSI，我们提出了一种基于编码孔径快照光谱成像系统的具有物理信息的自监督CNN方法的双摄像机设备。我们的方法有效地利用了编码光谱信息的空间光谱相对性，形成了基于相机量子效应模型的自我监控系统。实验结果表明，该方法能够适应较宽的成像环境，具有良好的性能。此外，与大多数基于网络的方法相比，我们的系统不需要专门的数据集进行预训练。因此，它具有更大的场景适应性和更好的泛化能力。同时，我们的系统可以在现实场景中不断进行微调和自我改进。摘要：Hyperspectral images (HSIs) can provide rich spatial and spectral information with extensive application prospects. Recently, several methods using convolutional neural networks (CNNs) to reconstruct HSIs have been developed. However, most deep learning methods fit a brute-force mapping relationship between the compressive and standard HSIs. Thus, the learned mapping would be invalid when the observation data deviate from the training data. To recover the three-dimensional HSIs from two-dimensional compressive images, we present dual-camera equipment with a physics-informed self-supervising CNN method based on a coded aperture snapshot spectral imaging system. Our method effectively exploits the spatial-spectral relativization from the coded spectral information and forms a self-supervising system based on the camera quantum effect model. The experimental results show that our method can be adapted to a wide imaging environment with good performance. In addition, compared with most of the network-based methods, our system does not require a dedicated dataset for pre-training. Therefore, it has greater scenario adaptability and better generalization ability. Meanwhile, our system can be constantly fine-tuned and self-improved in real-life scenarios.

机器翻译，仅供参考

linux https 网络安全学习方法数据库

0 人点赞