计算机视觉学术速递[7.16]

访问www.arxivdaily.com获取含摘要速递，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏、发帖等功能！点击阅读原文即可访问

cs.CV 方向，今日共计48篇

Transformer(2篇)

【1】 STAR: Sparse Transformer-based Action Recognition 标题：STAR：基于稀疏Transformer的动作识别

作者：Feng Shi,Chonghan Lee,Liang Qiu,Yizhou Zhao,Tianyi Shen,Shivran Muralidhar,Tian Han,Song-Chun Zhu,Vijaykrishnan Narayanan 机构：University of California Los Angeles, The Pennsylvania State University, Stevens Institute of Technology 链接：https://arxiv.org/abs/2107.07089 摘要：人类行为和行为的认知系统已经演变成一个深度学习系统，特别是近年来图卷积网络的出现改变了这个领域。然而，以往的工作主要集中在基于稠密图卷积网络的参数化和复杂模型上，导致训练和推理效率低下。同时，基于Transformer结构的模型在人类行为和行为估计的认知应用方面还没有得到很好的探索。提出了一种新的基于骨架的人体动作识别模型，该模型在空间维度上具有稀疏注意，在时间维度上具有分段线性注意。我们的模型也可以处理可变长度的视频剪辑分组为一个单一的批次。实验结果表明，该模型在利用较少可训练参数的情况下，可以达到相当的性能，并且训练和推理速度快。实验结果表明，与基准模型相比，该模型在竞争精度下的加速比为4~18倍，模型尺寸为1/7~1/15。摘要：The cognitive system for human action and behavior has evolved into a deep learning regime, and especially the advent of Graph Convolution Networks has transformed the field in recent years. However, previous works have mainly focused on over-parameterized and complex models based on dense graph convolution networks, resulting in low efficiency in training and inference. Meanwhile, the Transformer architecture-based model has not yet been well explored for cognitive application in human action and behavior estimation. This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data. Our model can also process the variable length of video clips grouped as a single batch. Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference. Experiments show that our model achieves 4~18x speedup and 1/7~1/15 model size compared with the baseline models at competitive accuracy.

【2】 Surgical Instruction Generation with Transformers 标题：使用Transformer生成外科指导书

作者：Jinglu Zhang,Yinyu Nie,Jian Chang,Jian Jun Zhang 机构： National Centre for Computer Animation (NCCA), Bournemouth University, UK, Technical University of Munich 备注：Accepted to MICCAI 2021 链接：https://arxiv.org/abs/2107.06964 摘要：自动生成手术指令是术中上下文感知手术辅助的先决条件。然而，从手术场景生成指令是一个挑战，因为它需要共同理解当前视图的手术活动以及视觉信息和文本描述之间的建模关系。灵感来自于开放式领域中的神经机器翻译和成像字幕任务，我们介绍了一个Transformer背靠背编码器解码器网络与自我批判强化学习，以产生指令从手术图像。我们在DAISI数据集上评估了我们的方法的有效性，该数据集包括来自不同医学学科的290个程序。我们的方法在所有字幕评价指标上都优于现有的基线。结果表明，编码器-解码器结构支持Transformer在处理多模态上下文的好处。摘要：Automatic surgical instruction generation is a prerequisite towards intra-operative context-aware surgical assistance. However, generating instructions from surgical scenes is challenging, as it requires jointly understanding the surgical activity of current view and modelling relationships between visual information and textual description. Inspired by the neural machine translation and imaging captioning tasks in open domain, we introduce a transformer-backboned encoder-decoder network with self-critical reinforcement learning to generate instructions from surgical images. We evaluate the effectiveness of our method on DAISI dataset, which includes 290 procedures from various medical disciplines. Our approach outperforms the existing baseline over all caption evaluation metrics. The results demonstrate the benefits of the encoder-decoder structure backboned by transformer in handling multimodal context.

检测相关(5篇)

【1】 What and When to Look?: Temporal Span Proposal Network for Video Visual Relation Detection 标题：看什么，什么时候看？--视频视觉关系检测的时间跨度建议网络

作者：Sangmin Woo,Junhyug Noh,Kangil Kim 机构： Member, IEEE 备注：This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible 链接：https://arxiv.org/abs/2107.07154 摘要：识别对象之间的关系是理解场景的核心。虽然已经提出了一些用于图像域中的关系建模的工作，但是由于时空交互的挑战性动力学（例如，哪些对象之间存在交互？关系何时发生和结束？）。迄今为止，已有两种典型的视频关系检测方法：基于片段的方法和基于窗口的方法。本文首先指出了这两种方法的局限性，并提出了一种新的时间跨度建议网络（TSPN），它具有效率和有效性两个优点。1） TSPN告诉我们要看什么：它通过对对象对的关系性（即，对象对之间存在关系的置信度）评分来稀疏关系搜索空间。2） TSPN告诉何时查看：它利用完整的视频上下文来同时预测整个关系的时间跨度和类别。TSPN通过在两个VidVRD基准（ImageNet-VidVDR和VidOR）上以显著的优势实现最新的技术水平来证明其有效性，同时也显示出比现有方法更低的时间复杂度，特别是比流行的基于段的方法效率高一倍。摘要：Identifying relations between objects is central to understanding the scene. While several works have been proposed for relation modeling in the image domain, there have been many constraints in the video domain due to challenging dynamics of spatio-temporal interactions (e.g., Between which objects are there an interaction? When do relations occur and end?). To date, two representative methods have been proposed to tackle Video Visual Relation Detection (VidVRD): segment-based and window-based. We first point out the limitations these two methods have and propose Temporal Span Proposal Network (TSPN), a novel method with two advantages in terms of efficiency and effectiveness. 1) TSPN tells what to look: it sparsifies relation search space by scoring relationness (i.e., confidence score for the existence of a relation between pair of objects) of object pair. 2) TSPN tells when to look: it leverages the full video context to simultaneously predict the temporal span and categories of the entire relations. TSPN demonstrates its effectiveness by achieving new state-of-the-art by a significant margin on two VidVRD benchmarks (ImageNet-VidVDR and VidOR) while also showing lower time complexity than existing methods - in particular, twice as efficient as a popular segment-based approach.

【2】 Diff-Net: Image Feature Difference based High-Definition Map Change Detection 标题：Diff-Net：基于图像特征差异的高清地图变化检测

作者：Lei He,Shengjie Jiang,Xiaoqing Liang,Ning Wang,Shiyu Song 机构： Baidu Autonomous Driving Technology, Department (ADT), China University of Geosciences, Beijing, China, National Laboratory of Pattern, Recognition, Institute of Automation Chinese, Academy of Sciences, Nanjing University of Information, Science & Technology 备注：13 pages, 4 figures 链接：https://arxiv.org/abs/2107.07030 摘要：最新的高清（HD）地图对于自动驾驶汽车是必不可少的。为了获得不断更新的高清地图，我们提出了一种深度神经网络（DNN），Diff-Net，来检测地图的变化。与传统的基于目标检测器的方法相比，本文的核心设计是一种并行的特征差分计算结构，通过比较从摄像机和光栅化图像中提取的特征来推断地图的变化。为了生成这些光栅化的图像，我们将地图元素投影到相机视图中的图像上，从而产生有意义的地图表示，DNN可以相应地使用这些表示。当我们将变化检测任务描述为一个对象检测问题时，我们利用基于锚的结构来预测具有不同变化状态类别的边界框。此外，我们引入了一个时空融合模块，将历史帧中的特征融合到当前帧中，从而提高了系统的整体性能。最后，我们使用新收集的数据集全面验证了该方法的有效性。结果表明，我们的Diff-Net实现了比基线方法更好的性能，并准备好集成到地图生产流水线中，以保持最新的HD地图。摘要：Up-to-date High-Definition (HD) maps are essential for self-driving cars. To achieve constantly updated HD maps, we present a deep neural network (DNN), Diff-Net, to detect changes in them. Compared to traditional methods based on object detectors, the essential design in our work is a parallel feature difference calculation structure that infers map changes by comparing features extracted from the camera and rasterized images. To generate these rasterized images, we project map elements onto images in the camera view, yielding meaningful map representations that can be consumed by a DNN accordingly. As we formulate the change detection task as an object detection problem, we leverage the anchor-based structure that predicts bounding boxes with different change status categories. Furthermore, rather than relying on single frame input, we introduce a spatio-temporal fusion module that fuses features from history frames into the current, thus improving the overall performance. Finally, we comprehensively validate our method's effectiveness using freshly collected datasets. Results demonstrate that our Diff-Net achieves better performance than the baseline methods and is ready to be integrated into a map production pipeline maintaining an up-to-date HD map.

【3】 Lidar Light Scattering Augmentation (LISA): Physics-based Simulation of Adverse Weather Conditions for 3D Object Detection 标题：激光雷达光散射增强(LISA)：基于物理模拟的不利天气条件三维目标探测

作者：Velat Kilic,Deepti Hegde,Vishwanath Sindagi,A. Brinton Cooper,Mark A. Foster,Vishal M. Patel 机构：Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, U.S.A 链接：https://arxiv.org/abs/2107.07004 摘要：基于激光雷达的目标探测器是自动驾驶汽车等自主导航系统中三维感知管道的关键部件。然而，由于信噪比（SNR）和信背景比（SBR）降低，它们对雨、雪和雾等不利天气条件敏感。因此，在这种情况下，基于激光雷达的目标探测器在正常天气下采集的数据上进行训练，往往表现不佳。然而，在各种恶劣的天气条件下收集和标记足够的训练数据既费力又昂贵。为了解决这个问题，我们提出了一种基于物理的方法来模拟不利天气条件下的lidar点云场景。这些增强的数据集可用于训练基于激光雷达的探测器，以提高其全天候可靠性。具体来说，我们介绍了一种基于混合蒙特卡罗的方法，该方法处理（i）大颗粒的影响，通过随机放置它们并比较它们对目标的后向反射功率，以及（ii）通过计算Mie理论和粒径分布的散射效率来平均衰减影响。用这种增强的数据对网络进行再训练，可以提高在真实世界雨景中评估的平均精度，并且我们观察到，与文献中现有的模型相比，我们的模型在性能上有更大的改进。此外，我们在模拟天气条件下评估了最新的探测器，并对其性能进行了深入分析。摘要：Lidar-based object detectors are critical parts of the 3D perception pipeline in autonomous navigation systems such as self-driving cars. However, they are known to be sensitive to adverse weather conditions such as rain, snow and fog due to reduced signal-to-noise ratio (SNR) and signal-to-background ratio (SBR). As a result, lidar-based object detectors trained on data captured in normal weather tend to perform poorly in such scenarios. However, collecting and labelling sufficient training data in a diverse range of adverse weather conditions is laborious and prohibitively expensive. To address this issue, we propose a physics-based approach to simulate lidar point clouds of scenes in adverse weather conditions. These augmented datasets can then be used to train lidar-based detectors to improve their all-weather reliability. Specifically, we introduce a hybrid Monte-Carlo based approach that treats (i) the effects of large particles by placing them randomly and comparing their back reflected power against the target, and (ii) attenuation effects on average through calculation of scattering efficiencies from the Mie theory and particle size distributions. Retraining networks with this augmented data improves mean average precision evaluated on real world rainy scenes and we observe greater improvement in performance with our model relative to existing models from the literature. Furthermore, we evaluate recent state-of-the-art detectors on the simulated weather conditions and present an in-depth analysis of their performance.

【4】 Mutually improved endoscopic image synthesis and landmark detection in unpaired image-to-image translation 标题：非配对图像到图像转换中相互改进的内窥镜图像合成和标志点检测

作者：Lalith Sharan,Gabriele Romano,Sven Koehler,Halvar Kelm,Matthias Karck,Raffaele De Simone,Sandy Engelhardt 机构：at Heidelberg University Hospital 备注：Submitted to IEEE JBHI 2021, 13 pages, 8 figures, 4 tables 链接：https://arxiv.org/abs/2107.06941 摘要：CycleGAN框架允许对未配对数据进行无监督的图像到图像的转换。在物理手术模拟器上进行手术训练的场景中，该方法可用于将内窥镜下的模型图像转换为更接近于同一手术靶结构术中外观的图像。这可以看作是一种新的增强现实方法，我们在以前的工作中创造了超现实主义。在这个用例中，最重要的是在两个域中显示一致的对象，如针、缝线或工具，同时将样式更改为更像组织的外观。分割这些物体将允许直接传输，然而，这些，部分微小和薄前景物体的轮廓是麻烦的，也许是不准确的。相反，我们建议在缝合线进入组织时，使用地标检测。通过将预训练检测器模型的性能作为一个额外的优化目标，该目标被直接纳入CycleGAN框架中。我们表明，在这些稀疏的地标标签上定义的任务提高了生成网络在两个域中合成的一致性。比较基线CycleGAN架构和我们提出的扩展（DetCycleGAN），平均精确度（PPV）提高了 61.32，平均灵敏度（TPR）提高了 37.91，平均F1得分提高了 0.4743。此外，可以证明，通过数据集融合，生成的术中图像可以作为检测网络本身的额外训练数据。这些数据是在2021年的MICCAI挑战赛范围内发布的https://adaptor2021.github.io/，并在处编码https://github.com/Cardio-AI/detcyclegan_pytorch. 摘要：The CycleGAN framework allows for unsupervised image-to-image translation of unpaired data. In a scenario of surgical training on a physical surgical simulator, this method can be used to transform endoscopic images of phantoms into images which more closely resemble the intra-operative appearance of the same surgical target structure. This can be viewed as a novel augmented reality approach, which we coined Hyperrealism in previous work. In this use case, it is of paramount importance to display objects like needles, sutures or instruments consistent in both domains while altering the style to a more tissue-like appearance. Segmentation of these objects would allow for a direct transfer, however, contouring of these, partly tiny and thin foreground objects is cumbersome and perhaps inaccurate. Instead, we propose to use landmark detection on the points when sutures pass into the tissue. This objective is directly incorporated into a CycleGAN framework by treating the performance of pre-trained detector models as an additional optimization goal. We show that a task defined on these sparse landmark labels improves consistency of synthesis by the generator network in both domains. Comparing a baseline CycleGAN architecture to our proposed extension (DetCycleGAN), mean precision (PPV) improved by 61.32, mean sensitivity (TPR) by 37.91, and mean F1 score by 0.4743. Furthermore, it could be shown that by dataset fusion, generated intra-operative images can be leveraged as additional training data for the detection network itself. The data is released within the scope of the AdaptOR MICCAI Challenge 2021 at https://adaptor2021.github.io/, and code at https://github.com/Cardio-AI/detcyclegan_pytorch.

【5】 Potential UAV Landing Sites Detection through Digital Elevation Models Analysis 标题：基于数字高程模型分析的无人机潜在着陆点检测

作者：Efstratios Kakaletsis,Nikos Nikolaidis 机构：Department of Informatics, Artificial Intelligence and Information Analysis Laboratory, Aristotle University of Thessaloniki, GR-, Thessaloniki, GREECE 备注：Proceedings of the 2019 27th European Signal Processing Conference (EUSIPCO) satellite workshop "Signal Processing Computer vision and Deep Learning for Autonomous Systems" 链接：https://arxiv.org/abs/2107.06921 摘要：本文提出了一种利用地形信息通过识别平坦区域来检测无人机潜在着陆点的简单方法。该算法利用数字高程模型（DEM）来表示一个地区的高度分布。在正常或紧急情况下，构成无人机适当着陆区的平坦区域是通过对数字表面模型（DSM）的图像梯度幅度进行阈值化得到的。该方法还对阈值梯度图像进行连通分量评估，以发现足够大小的连通区域。此外，还发现了人造结构和植被区域，并将其排除在潜在的着陆地点之外。在真实世界和合成数据集上对所提出的着陆点检测算法在多个领域的性能进行了定量评估，并与最新的算法进行了比较，证明了该算法的有效性和优越性。摘要：In this paper, a simple technique for Unmanned Aerial Vehicles (UAVs) potential landing site detection using terrain information through identification of flat areas, is presented. The algorithm utilizes digital elevation models (DEM) that represent the height distribution of an area. Flat areas which constitute appropriate landing zones for UAVs in normal or emergency situations result by thresholding the image gradient magnitude of the digital surface model (DSM). The proposed technique also uses connected components evaluation on the thresholded gradient image in order to discover connected regions of sufficient size for landing. Moreover, man-made structures and vegetation areas are detected and excluded from the potential landing sites. Quantitative performance evaluation of the proposed landing site detection algorithm in a number of areas on real world and synthetic datasets, accompanied by a comparison with a state-of-the-art algorithm, proves its efficiency and superiority.

分类|识别相关(4篇)

【1】 Context-Conditional Adaptation for Recognizing Unseen Classes in Unseen Domains 标题：上下文条件自适应在不可见领域中识别不可见类

作者：Puneet Mangla,Shivam Chandhok,Vineeth N Balasubramanian,Fahad Shahbaz Khan 机构：Department of Computer Science, Indian Institute of Technology, Hyderabad, Mohamed bin Zayed University of Artificial Intelligence 链接：https://arxiv.org/abs/2107.07497 摘要：在设计能推广到不可见域（即域综合）或不可见类（即零炮学习）的模型方面的最新进展使人们对建立能同时处理域转移和语义转移（即零炮域综合）的模型产生了兴趣。为了将模型推广到不可见域中的不可见类，学习保持类级（域不变量）和特定域信息的特征表示是至关重要的。基于产生式零炮方法的成功，我们提出了一种结合上下文条件自适应（COCOA）批量规范化的特征生成框架，以无缝集成类级语义和领域特定信息。生成的可视化特性更好地捕获了底层的数据分布，使我们能够在测试时泛化到不可见的类和域。我们在已建立的大规模基准域网上对我们的方法进行了全面的评估和分析，并展示了在基线和最新方法上的良好性能。摘要：Recent progress towards designing models that can generalize to unseen domains (i.e domain generalization) or unseen classes (i.e zero-shot learning) has embarked interest towards building models that can tackle both domain-shift and semantic shift simultaneously (i.e zero-shot domain generalization). For models to generalize to unseen classes in unseen domains, it is crucial to learn feature representation that preserves class-level (domain-invariant) as well as domain-specific information. Motivated from the success of generative zero-shot approaches, we propose a feature generative framework integrated with a COntext COnditional Adaptive (COCOA) Batch-Normalization to seamlessly integrate class-level semantic and domain-specific information. The generated visual features better capture the underlying data distribution enabling us to generalize to unseen classes and domains at test-time. We thoroughly evaluate and analyse our approach on established large-scale benchmark - DomainNet and demonstrate promising performance over baselines and state-of-art methods.

【2】 Unsupervised Anomaly Instance Segmentation for Baggage Threat Recognition 标题：用于行李威胁识别的无监督异常实例分割

作者：Taimur Hassan,Samet Akcay,Mohammed Bennamoun,Salman Khan,Naoufel Werghi 机构：Received: date Accepted: date 链接：https://arxiv.org/abs/2107.07333 摘要：识别隐藏在行李中的潜在威胁是安保人员最关心的问题。许多研究人员已经开发出可以从X射线扫描中检测行李威胁的框架。然而，据我们所知，所有这些框架都需要对大规模和注释良好的数据集进行广泛的训练，这在现实世界中很难获得。本文提出了一种新的无监督异常实例分割框架，该框架可以将X射线扫描中的行李威胁识别为异常，而不需要任何地面真实性标签。此外，由于其风格化能力，该框架只需训练一次，在推理阶段，它检测并提取违禁物品，而不考虑其扫描仪规格。我们的单阶段方法最初学习通过编码器-解码器网络利用建议的样式化损失函数重建正常的行李内容。该模型随后通过分析原始扫描和重建扫描的差异来识别异常区域。然后对异常区域进行聚类和后处理，以适应其定位的边界框。此外，该框架还可以附加一个可选的分类器来识别这些异常的类别。在没有任何再训练的情况下，对四个公共行李X射线数据集上的拟议系统进行了全面评估，结果表明，与传统的完全监督方法相比，该系统取得了具有竞争力的性能（即SIXray的平均精度得分为0.7941，GDX射线的平均精度得分为0.8591，OPIXray的平均精度得分为0.7483，而在SIXray、GDXray、OPIXray和COMPASS-XP数据集上，F1得分分别比最先进的半监督和非监督行李威胁检测框架高67.37%、32.32%、47.19%和45.81%。摘要：Identifying potential threats concealed within the baggage is of prime concern for the security staff. Many researchers have developed frameworks that can detect baggage threats from X-ray scans. However, to the best of our knowledge, all of these frameworks require extensive training on large-scale and well-annotated datasets, which are hard to procure in the real world. This paper presents a novel unsupervised anomaly instance segmentation framework that recognizes baggage threats, in X-ray scans, as anomalies without requiring any ground truth labels. Furthermore, thanks to its stylization capacity, the framework is trained only once, and at the inference stage, it detects and extracts contraband items regardless of their scanner specifications. Our one-staged approach initially learns to reconstruct normal baggage content via an encoder-decoder network utilizing a proposed stylization loss function. The model subsequently identifies the abnormal regions by analyzing the disparities within the original and the reconstructed scans. The anomalous regions are then clustered and post-processed to fit a bounding box for their localization. In addition, an optional classifier can also be appended with the proposed framework to recognize the categories of these extracted anomalies. A thorough evaluation of the proposed system on four public baggage X-ray datasets, without any re-training, demonstrates that it achieves competitive performance as compared to the conventional fully supervised methods (i.e., the mean average precision score of 0.7941 on SIXray, 0.8591 on GDXray, 0.7483 on OPIXray, and 0.5439 on COMPASS-XP dataset) while outperforming state-of-the-art semi-supervised and unsupervised baggage threat detection frameworks by 67.37%, 32.32%, 47.19%, and 45.81% in terms of F1 score across SIXray, GDXray, OPIXray, and COMPASS-XP datasets, respectively.

【3】 An Efficient and Small Convolutional Neural Network for Pest Recognition -- ExquisiteNet 标题：一种用于害虫识别的高效小型卷积神经网络--ExquisiteNet

作者：Shi-Yao Zhou,Chung-Yen Su 机构：Department of Electrical Engineering, National Taiwan Normal University, Taipei, Taiwan 备注：4 pages 链接：https://arxiv.org/abs/2107.07167 摘要：如今，由于人口的迅速膨胀，粮食短缺已成为一个关键问题。为了稳定粮源生产，防止农作物遭受病虫害十分重要。一般来说，农民使用杀虫剂来杀死害虫，但是不适当地使用杀虫剂也会杀死一些对农作物有益的昆虫，如蜜蜂。如果蜜蜂的数量太少，世界上的食物补充就会短缺。此外，过量的杀虫剂会严重污染环境。因此，农民需要一台能自动识别害虫的机器。近年来，深度学习在图像分类领域得到了广泛的应用。在本文中，我们提出了一个小而有效的模型ExquisiteNet来完成害虫识别的任务，并期望将我们的模型应用到移动设备上。ExquisiteNet主要由两个模块组成。一种是基于压缩和激励的双融合瓶颈块（DFSEB块），另一种是最大特征扩展块（ME块）。ExquisiteNet只有0.98M的参数，计算速度非常快，几乎与SqueezeNet相同。为了评估模型的性能，我们在一个名为IP102的pest基准数据集上测试了模型。与现有的ResNet101、ShuffleNetV2、MobileNetV3等大型高效网络模型相比，该模型在不增加数据量的情况下，对IP102测试集的准确率达到52.32%。摘要：Nowadays, due to the rapid population expansion, food shortage has become a critical issue. In order to stabilizing the food source production, preventing crops from being attacked by pests is very important. In generally, farmers use pesticides to kill pests, however, improperly using pesticides will also kill some insects which is beneficial to crops, such as bees. If the number of bees is too few, the supplement of food in the world will be in short. Besides, excessive pesticides will seriously pollute the environment. Accordingly, farmers need a machine which can automatically recognize the pests. Recently, deep learning is popular because its effectiveness in the field of image classification. In this paper, we propose a small and efficient model called ExquisiteNet to complete the task of recognizing the pests and we expect to apply our model on mobile devices. ExquisiteNet mainly consists of two blocks. One is double fusion with squeeze-and-excitation-bottleneck block (DFSEB block), and the other is max feature expansion block (ME block). ExquisiteNet only has 0.98M parameters and its computing speed is very fast almost the same as SqueezeNet. In order to evaluate our model's performance, we test our model on a benchmark pest dataset called IP102. Compared to many state-of-the-art models, such as ResNet101, ShuffleNetV2, MobileNetV3-large and EfficientNet etc., our model achieves higher accuracy, that is, 52.32% on the test set of IP102 without any data augmentation.

【4】 Training Compact CNNs for Image Classification using Dynamic-coded Filter Fusion 标题：利用动态编码过滤融合训练紧凑CNN用于图像分类

作者：Mingbao Lin,Rongrong Ji,Bohong Chen,Fei Chao,Jianzhuang Liu,Wei Zeng,Yonghong Tian,Qi Tian 链接：https://arxiv.org/abs/2107.06916 摘要：滤波器剪枝的主流方法通常是对计算量大的预训练模型强制进行硬编码的重要性估计来选择“重要”滤波器，或者对损失目标施加超参数敏感的稀疏约束来正则化网络训练。本文提出了一种新的滤波剪枝方法，称为动态编码滤波融合（DCFF），以一种计算经济且无需正则化的方式得到紧凑的cnn，从而实现高效的图像分类。本文首先以温度参数作为滤波器的代理，给出了DCFF中每个滤波器的互相似性分布，在此基础上，提出了一种新的基于Kullback-Leibler散度的动态编码准则来评价滤波器的重要性。与其他方法简单地保留高分滤波器不同，我们提出了滤波器融合的概念，即使用指定代理的加权平均值作为我们的保留滤波器。当温度参数接近无穷大时，我们得到了一个单热互相似分布。因此，每个滤波器的相对重要性可以随压缩CNN的训练而变化，从而产生动态可变的融合滤波器，而不依赖于预训练模型和引入稀疏约束。在分类基准上的大量实验证明了DCFF的优越性。例如，我们的DCFF得到了一个紧凑的VGGNet-16，只有72.77M的触发器和1.06M的参数，同时在CIFAR-10上达到了93.47%的顶级精度。一个紧凑的ResNet-50得到了63.8%的触发器和58.6%的参数缩减，在ILSVRC-2012上保持了75.60%的顶级精度。我们的代码，更窄的模型和训练日志可在https://github.com/lmbxmu/DCFF. 摘要：The mainstream approach for filter pruning is usually either to force a hard-coded importance estimation upon a computation-heavy pretrained model to select "important" filters, or to impose a hyperparameter-sensitive sparse constraint on the loss objective to regularize the network training. In this paper, we present a novel filter pruning method, dubbed dynamic-coded filter fusion (DCFF), to derive compact CNNs in a computation-economical and regularization-free manner for efficient image classification. Each filter in our DCFF is firstly given an inter-similarity distribution with a temperature parameter as a filter proxy, on top of which, a fresh Kullback-Leibler divergence based dynamic-coded criterion is proposed to evaluate the filter importance. In contrast to simply keeping high-score filters in other methods, we propose the concept of filter fusion, i.e., the weighted averages using the assigned proxies, as our preserved filters. We obtain a one-hot inter-similarity distribution as the temperature parameter approaches infinity. Thus, the relative importance of each filter can vary along with the training of the compact CNN, leading to dynamically changeable fused filters without both the dependency on the pretrained model and the introduction of sparse constraints. Extensive experiments on classification benchmarks demonstrate the superiority of our DCFF over the compared counterparts. For example, our DCFF derives a compact VGGNet-16 with only 72.77M FLOPs and 1.06M parameters while reaching top-1 accuracy of 93.47% on CIFAR-10. A compact ResNet-50 is obtained with 63.8% FLOPs and 58.6% parameter reductions, retaining 75.60% top-1 accuracy on ILSVRC-2012. Our code, narrower models and training logs are available at https://github.com/lmbxmu/DCFF.

分割|语义相关(4篇)

【1】 Amodal segmentation just like doing a jigsaw 标题：非模态分割就像做拼图一样

作者：Xunli Zeng,Jianqin Yin 链接：https://arxiv.org/abs/2107.07464 摘要：Amodal分割是实例分割的一个新方向，它考虑了实例中可见部分和被遮挡部分的分割。现有的方法是利用多任务分支分别预测amodal部分和可见部分，从amodal部分减去可见部分得到遮挡部分。但是，amodal部分包含可见信息。因此，分离预测方法会产生重复信息。与此不同的是，本文提出了一种基于拼图思想的变形虫分割方法。该方法利用多任务分支预测可见光和遮挡两个自然解耦部分，就像得到两个匹配的拼图块一样。然后把两块拼图拼在一起，得到阿米达尔的部分。这使得每个分支都专注于对象的建模。我们相信在现实世界中，遮挡关系是有一定规律的。这是一种遮挡上下文信息。这种拼图方法能较好地模拟遮挡关系，利用遮挡上下文信息，对amodal分割具有重要意义。在两个广泛使用的amodaly注释数据集上的实验证明，我们的方法超过了现有的最先进的方法。这项工作的源代码将很快公布。摘要：Amodal segmentation is a new direction of instance segmentation while considering the segmentation of the visible and occluded parts of the instance. The existing state-of-the-art method uses multi-task branches to predict the amodal part and the visible part separately and subtract the visible part from the amodal part to obtain the occluded part. However, the amodal part contains visible information. Therefore, the separated prediction method will generate duplicate information. Different from this method, we propose a method of amodal segmentation based on the idea of the jigsaw. The method uses multi-task branches to predict the two naturally decoupled parts of visible and occluded, which is like getting two matching jigsaw pieces. Then put the two jigsaw pieces together to get the amodal part. This makes each branch focus on the modeling of the object. And we believe that there are certain rules in the occlusion relationship in the real world. This is a kind of occlusion context information. This jigsaw method can better model the occlusion relationship and use the occlusion context information, which is important for amodal segmentation. Experiments on two widely used amodally annotated datasets prove that our method exceeds existing state-of-the-art methods. The source code of this work will be made public soon.

【2】 Deep Learning based Food Instance Segmentation using Synthetic Data 标题：基于深度学习的人工数据食品实例分割

作者：D. Park,J. Lee,J. Lee,K. Lee 机构： 1SchoolofIntegratedTechnology(SIT), GwangjuInstituteofScienceandTechnology(GIST) 备注：Technical Report 链接：https://arxiv.org/abs/2107.07191 摘要：在利用深度神经网络对图像中的食物进行智能分割的过程中，数据的采集和标注是一项非常重要但又很费时费力的任务。为了解决数据采集和标注的困难，本文提出了一种适用于真实世界的食品分割方法。为了在医疗机器人系统（如用餐辅助机器人手臂）上进行食物分割，我们使用开源的三维图形软件Blender生成合成数据，将多个对象放置在餐盘上，并以R-CNN为例进行了分割。此外，我们建立了一个数据收集系统，并在真实的食物数据上验证了我们的分割模型。因此，在我们的真实数据集上，仅训练合成数据的模型可用于分割未使用52.2%遮罩训练的食品实例AP@all与从头开始训练的模型相比，微调后的性能提高了 6.4%p。此外，我们还验证了在公共数据集上进行公平分析的可能性和性能改进。我们的代码和预先训练的重量可在线获取：https://github.com/gist-ailab/Food-Instance-Segmentation 摘要：In the process of intelligently segmenting foods in images using deep neural networks for diet management, data collection and labeling for network training are very important but labor-intensive tasks. In order to solve the difficulties of data collection and annotations, this paper proposes a food segmentation method applicable to real-world through synthetic data. To perform food segmentation on healthcare robot systems, such as meal assistance robot arm, we generate synthetic data using the open-source 3D graphics software Blender placing multiple objects on meal plate and train Mask R-CNN for instance segmentation. Also, we build a data collection system and verify our segmentation model on real-world food data. As a result, on our real-world dataset, the model trained only synthetic data is available to segment food instances that are not trained with 52.2% mask AP@all, and improve performance by 6.4%p after fine-tuning comparing to the model trained from scratch. In addition, we also confirm the possibility and performance improvement on the public dataset for fair analysis. Our code and pre-trained weights are avaliable online at: https://github.com/gist-ailab/Food-Instance-Segmentation

【3】 Semantic Image Cropping 标题：语义图像裁剪

作者：Oriol Corcoll 机构：under the supervision of, Queen Mary: Dr. Fabrizio Smeraldi, MSc in Big Data Science, School of Electronic Engineering and Computer Science, Queen Mary University of London, arXiv:,.,v, [cs.CV] , Jul 链接：https://arxiv.org/abs/2107.07153 摘要：自动图像裁剪技术通常用于提高图像的审美质量；他们通过检测图像中最漂亮或最突出的部分，去除不需要的内容，得到一个更小、更具视觉美感的图像。在这篇论文中，我介绍了裁剪问题的另一个维度，语义。我认为图像裁剪还可以通过使用图像中包含的语义信息来增强图像与给定实体的相关性。我把这个问题称为语义图像裁剪。为了支持我的论点，我提供了一个新的数据集，其中包含100个图像，每个图像至少有两个不同的实体，以及使用amazonmachicalturk收集的四个地面真实裁剪。我使用这个数据集来说明，仅考虑美学的最先进的裁剪算法在语义图像裁剪问题上并没有很好的表现。此外，我还提供了一个新的深度学习系统，该系统不仅考虑了美学，还考虑了语义来生成图像裁剪，并使用新的语义裁剪数据集对其性能进行了评估，表明使用图像的语义信息有助于生成更好的裁剪。摘要：Automatic image cropping techniques are commonly used to enhance the aesthetic quality of an image; they do it by detecting the most beautiful or the most salient parts of the image and removing the unwanted content to have a smaller image that is more visually pleasing. In this thesis, I introduce an additional dimension to the problem of cropping, semantics. I argue that image cropping can also enhance the image's relevancy for a given entity by using the semantic information contained in the image. I call this problem, Semantic Image Cropping. To support my argument, I provide a new dataset containing 100 images with at least two different entities per image and four ground truth croppings collected using Amazon Mechanical Turk. I use this dataset to show that state-of-the-art cropping algorithms that only take into account aesthetics do not perform well in the problem of semantic image cropping. Additionally, I provide a new deep learning system that takes not just aesthetics but also semantics into account to generate image croppings, and I evaluate its performance using my new semantic cropping dataset, showing that using the semantic information of an image can help to produce better croppings.

【4】 A modular U-Net for automated segmentation of X-ray tomography images in composite materials 标题：一种用于复合材料X射线层析图像自动分割的模块化U网

作者：João P C Bertoldo,Etienne Decencière,David Ryckelynck,Henry Proudhon 机构：MINES ParisTech, PSL Research University, France, Centre des Matériaux, Centre de Morphologie Mathématique 备注：Submitted to Nature Machine Intelligence 链接：https://arxiv.org/abs/2107.07468 摘要：X射线计算机断层扫描（XCT）技术已经发展到这样一个地步，即高分辨率的数据可以获得如此之快，经典的分割方法是令人望而却步的麻烦，要求自动化的数据管道能够处理非平凡的3D图像。深度学习已经在许多图像处理任务中取得了成功，包括材料科学应用，显示了一种无需人工的分割管道的有希望的替代方案。本文提出并训练了一种模块化的U-Net（modular U-Net）模型，对一种三相玻璃纤维增强聚酰胺66的三维层析成像图像进行了分割，并对模型的2D和3D版本进行了比较，发现前者略优于后者。我们观察到，即使只有10个注释层，也可以获得人类可比的结果，并且使用浅U网比深U网产生更好的结果。因此，神经网络（NN）在XCT数据处理管道的自动化方面显示出了良好的前景，无需人为干预。摘要：X-ray Computed Tomography (XCT) techniques have evolved to a point that high-resolution data can be acquired so fast that classic segmentation methods are prohibitively cumbersome, demanding automated data pipelines capable of dealing with non-trivial 3D images. Deep learning has demonstrated success in many image processing tasks, including material science applications, showing a promising alternative for a humanfree segmentation pipeline. In this paper a modular interpretation of UNet (Modular U-Net) is proposed and trained to segment 3D tomography images of a three-phased glass fiber-reinforced Polyamide 66. We compare 2D and 3D versions of our model, finding that the former is slightly better than the latter. We observe that human-comparable results can be achievied even with only 10 annotated layers and using a shallow U-Net yields better results than a deeper one. As a consequence, Neural Network (NN) show indeed a promising venue to automate XCT data processing pipelines needing no human, adhoc intervention.

Zero/Few Shot|迁移|域适配|自适应(2篇)

【1】 Finding Significant Features for Few-Shot Learning using Dimensionality Reduction 标题：基于降维的Few-Shot学习中重要特征的发现

作者：Mauricio Mendez-Ruiz,Ivan Garcia Jorge Gonzalez-Zapata,Gilberto Ochoa-Ruiz,Andres Mendez-Vazquez 机构： Tecnol´ogico de Monterrey, School of Engineering and Sciences, Mexico, CINVESTAV Unidad Guadalajara, Mexico 备注：This paper is currently under review for the Mexican International Conference on Artificial Intelligence (MICAI) 2021 链接：https://arxiv.org/abs/2107.06992 摘要：Few-Shot学习是一种相对较新的技术，专门用于处理数据量很少的问题。这些方法的目标是用少量的样本对以前没有见过的类别进行分类。最近的方法，如度量学习，采用元学习策略，其中我们有符合支持（训练）数据和查询（测试）数据的幕式任务。度量学习方法已经证明，简单的模型可以通过学习相似函数来比较支持和查询数据，从而获得良好的性能。然而，由给定的度量学习方法所学习的特征空间可能不会利用特定的少数镜头任务所提供的信息。在这项工作中，我们探索使用降维技术作为一种方法，以找到任务的重要特征，帮助作出更好的预测。我们通过基于类内和类间距离的评分，以及选择不同类实例距离较远而同一类实例距离较近的特征约简方法来度量约简后的特征的性能。该模块允许度量学习方法给出的相似函数具有更多的判别特征，从而提高分类的准确率。我们的方法比minimagenet数据集中的度量学习基线的准确率性能提高了约2%。摘要：Few-shot learning is a relatively new technique that specializes in problems where we have little amounts of data. The goal of these methods is to classify categories that have not been seen before with just a handful of samples. Recent approaches, such as metric learning, adopt the meta-learning strategy in which we have episodic tasks conformed by support (training) data and query (test) data. Metric learning methods have demonstrated that simple models can achieve good performance by learning a similarity function to compare the support and the query data. However, the feature space learned by a given metric learning approach may not exploit the information given by a specific few-shot task. In this work, we explore the use of dimension reduction techniques as a way to find task-significant features helping to make better predictions. We measure the performance of the reduced features by assigning a score based on the intra-class and inter-class distance, and selecting a feature reduction method in which instances of different classes are far away and instances of the same class are close. This module helps to improve the accuracy performance by allowing the similarity function, given by the metric learning method, to have more discriminative features for the classification. Our method outperforms the metric learning baselines in the miniImageNet dataset by around 2% in accuracy performance.

【2】 Multi-Channel Auto-Encoders and a Novel Dataset for Learning Domain Invariant Representations of Histopathology Images 标题：多通道自动编码器和一种用于学习组织病理学图像领域不变表示的新数据集

作者：Andrew Moyes,Richard Gault,Kun Zhang,Ji Ming,Danny Crookes,Jing Wang 机构：. School of Electronics, Electrical Engineering and Computer Science, Queen’s University, Belfast 链接：https://arxiv.org/abs/2107.07271 摘要：领域转移是开发自动组织病理学管道时经常遇到的问题。机器学习模型（如自动组织病理学管道中的卷积神经网络）在应用于新的数据域时，由于染色和扫描协议的不同，其性能往往会降低。双通道自动编码器（DCAE）模型以前被证明可以产生对不同数码幻灯片扫描仪引入的外观变化不太敏感的特征表示。在这项工作中，多通道自动编码器（MCAE）模型被提出作为DCAE的一个扩展，DCAE从两个以上的数据域学习。此外，使用CycleGANs生成一个合成数据集，该数据集包含对齐的组织图像，这些图像的外观经过了合成修改。实验结果表明，MCAE模型产生的特征表示对域间变化的敏感性低于比较StaNoSA方法。此外，MCAE和StaNoSA模型在一个新的组织分类任务上进行了测试。实验结果表明，MCAE模型的f1成绩比StaNoSA模型高出5个百分点。这些结果表明，MCAE模型通过主动学习标准化的特征表示，能够比现有方法更好地推广到新的数据和任务中。摘要：Domain shift is a problem commonly encountered when developing automated histopathology pipelines. The performance of machine learning models such as convolutional neural networks within automated histopathology pipelines is often diminished when applying them to novel data domains due to factors arising from differing staining and scanning protocols. The Dual-Channel Auto-Encoder (DCAE) model was previously shown to produce feature representations that are less sensitive to appearance variation introduced by different digital slide scanners. In this work, the Multi-Channel Auto-Encoder (MCAE) model is presented as an extension to DCAE which learns from more than two domains of data. Additionally, a synthetic dataset is generated using CycleGANs that contains aligned tissue images that have had their appearance synthetically modified. Experimental results show that the MCAE model produces feature representations that are less sensitive to inter-domain variations than the comparative StaNoSA method when tested on the novel synthetic data. Additionally, the MCAE and StaNoSA models are tested on a novel tissue classification task. The results of this experiment show the MCAE model out performs the StaNoSA model by 5 percentage-points in the f1-score. These results show that the MCAE model is able to generalise better to novel data and tasks than existing approaches by actively learning normalised feature representations.

时序|行为识别|姿态|视频|运动估计(1篇)

【1】 Training for temporal sparsity in deep neural networks, application in video processing 标题：时间稀疏性的深度神经网络训练及其在视频处理中的应用

作者：Amirreza Yousefzadeh,Manolis Sifalakis 链接：https://arxiv.org/abs/2107.07305 摘要：激活稀疏性提高了稀疏感知神经网络加速器的计算效率和资源利用率。由于DNNs中的主要操作是使用权重计算内积的激活的乘法累加（MAC），跳过两个操作数中至少有一个为零的操作可以使推断在延迟和功率方面更有效。激活的空间稀疏化是DNN文献中的一个热门话题，已经建立了几种方法来偏向DNN。另一方面，时间稀疏性是仿生脉冲神经网络（SNNs）的固有特性，神经形态处理利用SNNs提高硬件效率。引入和利用时空稀疏性，是DNN文献中很少探讨的话题，但与DNN的发展趋势完全一致，从静态信号处理转向更多的流信号处理。为了实现这一目标，本文引入了一种新的DNN层（称为Delta激活层），其唯一目的是提高训练过程中激活的时间稀疏性。Delta激活层将时间稀疏性转化为空间稀疏性，以便在硬件中执行稀疏张量乘法时加以利用。通过在训练过程中使用delta推理和“通常的”空间稀疏启发式，所得到的模型不仅学习利用空间而且还学习利用时间激活稀疏性（对于给定的输入数据分布）。人们可以在普通训练或细化阶段使用δ激活层。我们实现了Delta激活层作为标准Tensoflow Keras库的扩展，并将其应用于人体动作识别（UCF101）数据集的深层神经网络训练。我们报告了激活稀疏度几乎提高了3倍，在长时间的训练后，模型精度的可恢复性损失。摘要：Activation sparsity improves compute efficiency and resource utilization in sparsity-aware neural network accelerators. As the predominant operation in DNNs is multiply-accumulate (MAC) of activations with weights to compute inner products, skipping operations where (at least) one of the two operands is zero can make inference more efficient in terms of latency and power. Spatial sparsification of activations is a popular topic in DNN literature and several methods have already been established to bias a DNN for it. On the other hand, temporal sparsity is an inherent feature of bio-inspired spiking neural networks (SNNs), which neuromorphic processing exploits for hardware efficiency. Introducing and exploiting spatio-temporal sparsity, is a topic much less explored in DNN literature, but in perfect resonance with the trend in DNN, to shift from static signal processing to more streaming signal processing. Towards this goal, in this paper we introduce a new DNN layer (called Delta Activation Layer), whose sole purpose is to promote temporal sparsity of activations during training. A Delta Activation Layer casts temporal sparsity into spatial activation sparsity to be exploited when performing sparse tensor multiplications in hardware. By employing delta inference and ``the usual'' spatial sparsification heuristics during training, the resulting model learns to exploit not only spatial but also temporal activation sparsity (for a given input data distribution). One may use the Delta Activation Layer either during vanilla training or during a refinement phase. We have implemented Delta Activation Layer as an extension of the standard Tensoflow-Keras library, and applied it to train deep neural networks on the Human Action Recognition (UCF101) dataset. We report an almost 3x improvement of activation sparsity, with recoverable loss of model accuracy after longer training.

医学相关(1篇)

【1】 Variational Topic Inference for Chest X-Ray Report Generation 标题：胸部X线报告生成的变分主题推理

作者：Ivona Najdenkoska,Xiantong Zhen,Marcel Worring,Ling Shao 机构： AIM Lab, University of Amsterdam, The Netherlands 备注：To be published in the International Conference on Medical Image Computing and Computer Assisted Intervention 2021 链接：https://arxiv.org/abs/2107.07314 摘要：自动生成医学影像报告有望减少工作量，并在临床实践中辅助诊断。最近的研究表明，深度学习模型可以成功地描述自然图像。然而，从医学数据中学习是具有挑战性的，因为不同专业知识和经验的放射科医生所写的报告具有多样性和不确定性。为了解决这些问题，我们提出了变分主题推理来自动生成报告。具体来说，我们引入一组主题作为潜在变量，通过在潜在空间中对齐图像和语言形式来指导句子生成。主题是在条件变分推理框架中推理出来的，每个主题都控制着报告中句子的生成。此外，我们采用视觉注意模块，使模型能够注意到图像中的不同位置，并生成更多的信息描述。我们在两个基准上进行了广泛的实验，即印第安纳大学。胸部X光和模拟CXR。结果表明，本文提出的变分主题推理方法可以生成新颖的报告，而不仅仅是训练中使用的报告的副本，同时在标准语言生成标准方面仍能达到与现有方法相当的性能。摘要：Automating report generation for medical imaging promises to reduce workload and assist diagnosis in clinical practice. Recent work has shown that deep learning models can successfully caption natural images. However, learning from medical data is challenging due to the diversity and uncertainty inherent in the reports written by different radiologists with discrepant expertise and experience. To tackle these challenges, we propose variational topic inference for automatic report generation. Specifically, we introduce a set of topics as latent variables to guide sentence generation by aligning image and language modalities in a latent space. The topics are inferred in a conditional variational inference framework, with each topic governing the generation of a sentence in the report. Further, we adopt a visual attention module that enables the model to attend to different locations in the image and generate more informative descriptions. We conduct extensive experiments on two benchmarks, namely Indiana U. Chest X-rays and MIMIC-CXR. The results demonstrate that our proposed variational topic inference method can generate novel reports rather than mere copies of reports used in training, while still achieving comparable performance to state-of-the-art methods in terms of standard language generation criteria.

GAN|对抗|攻击|生成相关(5篇)

【1】 Adversarial Attacks on Multi-task Visual Perception for Autonomous Driving 标题：自主驾驶多任务视觉感知的对抗性攻击

作者：Ibrahim Sobh,Ahmed Hamed,Varun Ravi Kumar,Senthil Yogamani 机构：Valeo R&D Egypt, Valeo DAR Germany, TU Ilmenau, Germany, Valeo Ireland 链接：https://arxiv.org/abs/2107.07449 摘要：近年来，深度神经网络（DNNs）在各种应用领域取得了令人瞩目的成功，包括自主驾驶感知任务。另一方面，现有的深度神经网络很容易被对手的攻击所愚弄。这种漏洞引起了人们的极大关注，特别是在安全关键型应用程序中。因此，对DNNs的攻防研究得到了广泛的关注。在这项工作中，详细的对抗性攻击被应用于不同的多任务视觉感知深度网络，跨越距离估计、语义分割、运动检测和目标检测。实验考虑针对目标和非目标情况的白盒和黑盒攻击，同时攻击任务并检查对所有其他的影响，除了检查应用简单防御方法的效果之外。通过对实验结果的比较和讨论，对本文进行了总结，并提出了自己的见解和今后的工作。攻击的可视化可在https://youtu.be/R3JUV41aiPY. 摘要：Deep neural networks (DNNs) have accomplished impressive success in various applications, including autonomous driving perception tasks, in recent years. On the other hand, current deep neural networks are easily fooled by adversarial attacks. This vulnerability raises significant concerns, particularly in safety-critical applications. As a result, research into attacking and defending DNNs has gained much coverage. In this work, detailed adversarial attacks are applied on a diverse multi-task visual perception deep network across distance estimation, semantic segmentation, motion detection, and object detection. The experiments consider both white and black box attacks for targeted and un-targeted cases, while attacking a task and inspecting the effect on all the others, in addition to inspecting the effect of applying a simple defense method. We conclude this paper by comparing and discussing the experimental results, proposing insights and future work. The visualizations of the attacks are available at https://youtu.be/R3JUV41aiPY.

【2】 StyleFusion: A Generative Model for Disentangling Spatial Segments 标题：StyleFusion：一种解开空间段纠缠的产生式模型

作者：Omer Kafri,Or Patashnik,Yuval Alaluf,Daniel Cohen-Or 机构：The Blavatnik School of Computer Science, Tel Aviv University, Mohawk, Smile, Hair Albedo, Fearful Eyes, StyleGAN, StyleFusion, Background, Car, Wheels 备注：Code is available at: this https URL 链接：https://arxiv.org/abs/2107.07437 摘要：本文提出了一种新的StyleFusion映射体系结构，它将大量的潜在代码作为输入，并将它们融合成一个单一的样式代码。将得到的样式代码插入到预先训练的样式生成器中，得到单个协调图像，其中每个语义区域由一个输入的潜在代码控制。有效地，StyleFusion产生图像的分离表示，提供对生成图像的每个区域的细粒度控制。此外，为了便于对生成的图像进行全局控制，在融合表示中加入了一个特殊的输入潜码。StyleFusion以分层的方式进行操作，每个级别的任务是学习分离一对图像区域（例如，车身和车轮）。由此产生的学习的解纠缠允许一个修改局部的、细粒度的语义（例如面部特征）以及更多的全局特征（例如姿势和背景），从而在合成过程中提供改进的灵活性。作为一种自然的扩展，StyleFusion使人们能够对不一定对齐的区域执行语义感知的跨图像混合。最后，我们将现有的编辑技术与用户的兴趣区域进行忠实的配对，来演示如何对用户的兴趣区域进行更忠实的约束。摘要：We present StyleFusion, a new mapping architecture for StyleGAN, which takes as input a number of latent codes and fuses them into a single style code. Inserting the resulting style code into a pre-trained StyleGAN generator results in a single harmonized image in which each semantic region is controlled by one of the input latent codes. Effectively, StyleFusion yields a disentangled representation of the image, providing fine-grained control over each region of the generated image. Moreover, to help facilitate global control over the generated image, a special input latent code is incorporated into the fused representation. StyleFusion operates in a hierarchical manner, where each level is tasked with learning to disentangle a pair of image regions (e.g., the car body and wheels). The resulting learned disentanglement allows one to modify both local, fine-grained semantics (e.g., facial features) as well as more global features (e.g., pose and background), providing improved flexibility in the synthesis process. As a natural extension, StyleFusion enables one to perform semantically-aware cross-image mixing of regions that are not necessarily aligned. Finally, we demonstrate how StyleFusion can be paired with existing editing techniques to more faithfully constrain the edit to the user's region of interest.

【3】 Level generation and style enhancement -- deep learning for game development overview 标题：关卡生成和风格提升--游戏开发的深度学习概述

作者：Piotr Migdał,Bartłomiej Olechno,Błażej Podgórski 机构：ECC Games SA, Kozminski University 备注：16 pages, 10 figures, submitted to the 52nd International Simulation and Gaming Association (ISAGA) Conference 2021 链接：https://arxiv.org/abs/2107.07397 摘要：我们提出了使用深度学习来创建和增强视频游戏（桌面、移动和网络）的水平贴图和纹理的实用方法。我们的目标是为游戏开发者和水平艺术家提供新的可能性。设计关卡并用细节填充关卡的任务很有挑战性。这既费时又费劲，使水平丰富，复杂，并有一种自然的感觉。幸运的是，最近在深度学习方面取得的进展为高级设计师和视觉艺术家提供了新的工具。此外，他们还提供了一种方法来生成无限的游戏可重复性世界，并根据玩家的需要调整教育游戏。我们提出了七种创建层次图的方法，每种方法都使用统计方法、机器学习或深度学习。特别是，我们包括：-从现有示例（如ProGAN）创建新图像的生成性对抗网络在保留清晰细节的同时放大图像的超分辨率技术（如ESRGAN）改变视觉主题的神经风格转换图像翻译-将语义地图转换为图像（如GauGAN）用于将图像转换为语义掩码的语义分割（例如U-Net）用于提取语义特征（例如Tile2Vec）的无监督语义分割-纹理合成-创建一个较小的样本（如英根）为基础的大模式。摘要：We present practical approaches of using deep learning to create and enhance level maps and textures for video games -- desktop, mobile, and web. We aim to present new possibilities for game developers and level artists. The task of designing levels and filling them with details is challenging. It is both time-consuming and takes effort to make levels rich, complex, and with a feeling of being natural. Fortunately, recent progress in deep learning provides new tools to accompany level designers and visual artists. Moreover, they offer a way to generate infinite worlds for game replayability and adjust educational games to players' needs. We present seven approaches to create level maps, each using statistical methods, machine learning, or deep learning. In particular, we include: - Generative Adversarial Networks for creating new images from existing examples (e.g. ProGAN). - Super-resolution techniques for upscaling images while preserving crisp detail (e.g. ESRGAN). - Neural style transfer for changing visual themes. - Image translation - turning semantic maps into images (e.g. GauGAN). - Semantic segmentation for turning images into semantic masks (e.g. U-Net). - Unsupervised semantic segmentation for extracting semantic features (e.g. Tile2Vec). - Texture synthesis - creating large patterns based on a smaller sample (e.g. InGAN).

【4】 DynaDog T: A Parametric Animal Model for Synthetic Canine Image Generation 标题：DynaDog T：一种用于合成犬类图像生成的参数化动物模型

作者：Jake Deane,Sinead Kearney,Kwang In Kim,Darren Cosker 机构：CAMERA University of Bath, Ulsan National Institute of Science and Technology 备注：CV4Animals Workshop in CVPR 2021 链接：https://arxiv.org/abs/2107.07330 摘要：在训练各种任务的计算机视觉模型时，合成数据正变得越来越普遍。值得注意的是，此类数据已应用于与人类相关的任务中，例如3D姿势估计，其中数据很难在真实环境中创建或获得。相对而言，对合成动物数据及其用于训练模型的研究较少。因此，我们引入了一个参数化的犬科动物模型DynaDog T，用于生成合成的犬科动物图像和数据，我们将其用于一个常见的计算机视觉任务，即二值分割，否则由于缺乏可用的数据而很难实现。摘要：Synthetic data is becoming increasingly common for training computer vision models for a variety of tasks. Notably, such data has been applied in tasks related to humans such as 3D pose estimation where data is either difficult to create or obtain in realistic settings. Comparatively, there has been less work into synthetic animal data and it's uses for training models. Consequently, we introduce a parametric canine model, DynaDog T, for generating synthetic canine images and data which we use for a common computer vision task, binary segmentation, which would otherwise be difficult due to the lack of available data.

【5】 StyleVideoGAN: A Temporal Generative Model using a Pretrained StyleGAN 标题：StyleVideoGan：一种基于预训练StyleGan的时态生成模型

作者：Gereon Fox,Ayush Tewari,Mohamed Elgharib,Christian Theobalt 机构：Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany 链接：https://arxiv.org/abs/2107.07224 摘要：生成对抗模型（GANs）在静态图像的视觉质量以及时间相关性的学习方面不断取得进展。然而，很少有工作能够将这两种有趣的功能结合起来用于视频内容的合成：大多数方法需要大量的训练数据集来学习时间相关性，同时在其输出帧的分辨率和视觉质量方面相当有限。在本文中，我们提出了一种新的方法来解决视频合成问题，这有助于大大提高视觉质量，并大大减少生成视频内容所需的训练数据和资源量。我们的公式将合成单个帧的空间域与生成运动的时间域分开。对于空间域，我们使用一个预先训练的StyleGAN网络，它的潜在空间允许控制被训练对象的外观。该模型的表现力使我们能够将训练视频嵌入到潜在空间中。我们的时态结构不是基于RGB帧序列，而是基于StyleGAN潜在代码序列。StyleGAN空间的优点简化了时间相关性的发现。我们证明，它足以训练我们的时间架构，只有10分钟的镜头，一个主题约6小时。经过训练，我们的模型不仅可以为训练对象生成新的人像视频，而且可以为任意一个可以嵌入到StyleGAN空间中的随机对象生成新的人像视频。摘要：Generative adversarial models (GANs) continue to produce advances in terms of the visual quality of still images, as well as the learning of temporal correlations. However, few works manage to combine these two interesting capabilities for the synthesis of video content: Most methods require an extensive training dataset in order to learn temporal correlations, while being rather limited in the resolution and visual quality of their output frames. In this paper, we present a novel approach to the video synthesis problem that helps to greatly improve visual quality and drastically reduce the amount of training data and resources necessary for generating video content. Our formulation separates the spatial domain, in which individual frames are synthesized, from the temporal domain, in which motion is generated. For the spatial domain we make use of a pre-trained StyleGAN network, the latent space of which allows control over the appearance of the objects it was trained for. The expressive power of this model allows us to embed our training videos in the StyleGAN latent space. Our temporal architecture is then trained not on sequences of RGB frames, but on sequences of StyleGAN latent codes. The advantageous properties of the StyleGAN space simplify the discovery of temporal correlations. We demonstrate that it suffices to train our temporal architecture on only 10 minutes of footage of 1 subject for about 6 hours. After training, our model can not only generate new portrait videos for the training subject, but also for any random subject which can be embedded in the StyleGAN space.

Attention注意力(1篇)

【1】 Passive attention in artificial neural networks predicts human visual selectivity 标题：人工神经网络中被动注意对人类视觉选择性的预测

作者：Thomas A. Langlois,H. Charles Zhao,Erin Grant,Ishita Dasgupta,Thomas L. Griffiths,Nori Jacoby 机构：T.A.L. and H.C.Z. contributed equally to this work., T.L.G. and N.J. contributed equally to this work., Department of Psychology, UC Berkeley, Department of Computer Science, Princeton University 链接：https://arxiv.org/abs/2107.07013 摘要：机器学习可解释性技术在过去十年中的发展为观察图像区域提供了新的工具，这些区域对于人工神经网络（ANNs）中的分类和定位是最有用的。相同的区域对人类观察者的信息是否相似？利用78个新实验和6610名参与者的数据，我们发现被动注意技术与人类视觉选择性估计有显著的重叠，这些估计来自6个不同的行为任务，包括视觉辨别、空间定位、可识别性、自由观看、线索对象搜索和显著性搜索固定。我们发现，由相对简单的人工神经网络架构（ANN架构）衍生出的输入可视化，使用引导反向传播方法进行探测，是人类测量联合变异中共享成分的最佳预测因子。我们通过识别实验验证了因果操作的相关结果。在一个快速识别实验中，我们发现用人工神经网络注意图遮罩的图像比对照遮罩更容易被人类识别。同样，我们发现在相同的人工神经网络模型中，使用人类视觉选择性映射对输入图像进行掩蔽同样会影响识别性能。这项工作有助于评估作为人类视觉模型的主要人工神经网络的生物学和心理学有效性的新方法：通过检查它们对图像中所含信息的视觉选择性方面的相似性和差异。摘要：Developments in machine learning interpretability techniques over the past decade have provided new tools to observe the image regions that are most informative for classification and localization in artificial neural networks (ANNs). Are the same regions similarly informative to human observers? Using data from 78 new experiments and 6,610 participants, we show that passive attention techniques reveal a significant overlap with human visual selectivity estimates derived from 6 distinct behavioral tasks including visual discrimination, spatial localization, recognizability, free-viewing, cued-object search, and saliency search fixations. We find that input visualizations derived from relatively simple ANN architectures probed using guided backpropagation methods are the best predictors of a shared component in the joint variability of the human measures. We validate these correlational results with causal manipulations using recognition experiments. We show that images masked with ANN attention maps were easier for humans to classify than control masks in a speeded recognition experiment. Similarly, we find that recognition performance in the same ANN models was likewise influenced by masking input images using human visual selectivity maps. This work contributes a new approach to evaluating the biological and psychological validity of leading ANNs as models of human vision: by examining their similarities and differences in terms of their visual selectivity to the information contained in images.

人脸|人群计数(1篇)

【1】 Single-image Full-body Human Relighting 标题：单幅图像全身人体重光

作者：Manuel Lagunas,Xin Sun,Jimei Yang,Ruben Villegas,Jianming Zhang,Zhixin Shu,Belen Masia,Diego Gutierrez 机构：Universidad de Zaragoza, I,A, Adobe Research 备注：None 链接：https://arxiv.org/abs/2107.07259 摘要：我们提出了一种单图像数据驱动的方法来自动重新点燃图像与全身人在其中。我们的框架是基于一个真实的场景分解利用预计算辐射传输（PRT）和球谐（SH）照明。与以前的工作不同，我们取消了对朗伯材料的假设，并在数据中显式地模拟了漫反射和镜面反射。此外，我们还引入了一个额外的光相关残差项来解释基于PRT的图像重建中的误差。我们提出了一种新的深度学习体系结构，该体系结构针对PRT中执行的分解，使用L1、对数和渲染损失的组合进行训练。我们的模型在合成图像和照片方面都优于最先进的全身人体重新照明技术。摘要：We present a single-image data-driven method to automatically relight images with full-body humans in them. Our framework is based on a realistic scene decomposition leveraging precomputed radiance transfer (PRT) and spherical harmonics (SH) lighting. In contrast to previous work, we lift the assumptions on Lambertian materials and explicitly model diffuse and specular reflectance in our data. Moreover, we introduce an additional light-dependent residual term that accounts for errors in the PRT-based image reconstruction. We propose a new deep learning architecture, tailored to the decomposition performed in PRT, that is trained using a combination of L1, logarithmic, and rendering losses. Our model outperforms the state of the art for full-body human relighting both with synthetic images and photographs.

图像视频检索|Re-id相关(1篇)

【1】 Object Retrieval and Localization in Large Art Collections using Deep Multi-Style Feature Fusion and Iterative Voting 标题：基于深度多风格特征融合和迭代投票的大型艺术品对象检索与定位

作者：Nikolai Ufer,Sabine Lang,Björn Ommer 机构：Heidelberg University, HCI IWR, Germany 备注：Accepted at ECCV 2020 Workshop Computer Vision for Art Analysis 链接：https://arxiv.org/abs/2107.06935 摘要：对特定对象或主题的探索对艺术史来说是必不可少的，因为两者都有助于解读艺术品的意义。数字化产生了大量的艺术藏品，但手工方法证明不足以分析它们。下面，我们将介绍一种算法，允许用户搜索包含特定图案或对象的图像区域，并在广泛的数据集中找到类似区域，帮助艺术史学家分析大型数字化艺术收藏。计算机视觉为跨照片的视觉实例检索提供了有效的方法。然而，应用到艺术收藏中，由于主题的多样性以及技术、材料和风格的差异所导致的巨大领域转移，它们暴露出严重的不足。在本文中，我们提出了一个多风格的特征融合方法，成功地减少了领域差距和提高检索结果没有标签数据或策划的图像收集。我们的基于区域的投票和GPU加速的近似近邻搜索允许我们在几秒钟内找到和定位一个广泛的数据集中甚至小的主题。我们在Brueghel数据集上获得了最新的结果，并证明了它对具有大量干扰源的非均匀集合的推广。摘要：The search for specific objects or motifs is essential to art history as both assist in decoding the meaning of artworks. Digitization has produced large art collections, but manual methods prove to be insufficient to analyze them. In the following, we introduce an algorithm that allows users to search for image regions containing specific motifs or objects and find similar regions in an extensive dataset, helping art historians to analyze large digitized art collections. Computer vision has presented efficient methods for visual instance retrieval across photographs. However, applied to art collections, they reveal severe deficiencies because of diverse motifs and massive domain shifts induced by differences in techniques, materials, and styles. In this paper, we present a multi-style feature fusion approach that successfully reduces the domain gap and improves retrieval results without labelled data or curated image collections. Our region-based voting with GPU-accelerated approximate nearest-neighbour search allows us to find and localize even small motifs within an extensive dataset in a few seconds. We obtain state-of-the-art results on the Brueghel dataset and demonstrate its generalization to inhomogeneous collections with a large number of distractors.

表征学习(1篇)

【1】 MultiBench: Multiscale Benchmarks for Multimodal Representation Learning 标题：MultiBench：多模态表征学习的多尺度基准

作者：Paul Pu Liang,Yiwei Lyu,Xiang Fan,Zetian Wu,Yun Cheng,Jason Wu,Leslie Chen,Peter Wu,Michelle A. Lee,Yuke Zhu,Ruslan Salakhutdinov,Louis-Philippe Morency 机构：CMU,Johns Hopkins,Northeastern,Stanford,UT Austin 备注：Code: this https URL and Website: this https URL 链接：https://arxiv.org/abs/2107.07502 摘要：学习多模态表示涉及整合来自多个异构数据源的信息。这是一个具有挑战性但又至关重要的领域，在多媒体、情感计算、机器人技术、金融、人机交互和医疗保健等领域有着广泛的应用。不幸的是，多模态研究的资源有限，无法研究（1）跨域和模式的泛化，（2）训练和推理过程中的复杂性，以及（3）对噪声和缺失模式的鲁棒性。为了加快研究模式和任务的进度，同时确保真实世界的鲁棒性，我们发布了MultiBench，这是一个系统和统一的大规模基准，涵盖15个数据集、10个模式、20个预测任务和6个研究领域。MultiBench提供了一个自动化的端到端机器学习管道，简化和标准化数据加载、实验设置和模型评估。为了实现整体评估，MultiBench提供了一种全面的方法来评估（1）泛化，（2）时间和空间复杂性，以及（3）模态稳健性。MultiBench为未来的研究带来了巨大的挑战，包括对大规模多模态数据集的可扩展性和对现实缺陷的鲁棒性。为了配合这一基准，我们还提供了多模式学习中20种核心方法的标准化实施。简单地应用在不同研究领域提出的方法可以提高9/15数据集的最新性能。因此，MultiBench在统一多模态研究中不相交的工作方面具有里程碑意义，并为更好地理解多模态模型的能力和局限性铺平了道路，同时确保了易用性、可访问性和可再现性。MultiBench，我们的标准化代码和排行榜是公开的，将定期更新，并欢迎来自社区的投入。摘要：Learning multimodal representations involves integrating information from multiple heterogeneous sources of data. It is a challenging yet crucial area with numerous real-world applications in multimedia, affective computing, robotics, finance, human-computer interaction, and healthcare. Unfortunately, multimodal research has seen limited resources to study (1) generalization across domains and modalities, (2) complexity during training and inference, and (3) robustness to noisy and missing modalities. In order to accelerate progress towards understudied modalities and tasks while ensuring real-world robustness, we release MultiBench, a systematic and unified large-scale benchmark spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas. MultiBench provides an automated end-to-end machine learning pipeline that simplifies and standardizes data loading, experimental setup, and model evaluation. To enable holistic evaluation, MultiBench offers a comprehensive methodology to assess (1) generalization, (2) time and space complexity, and (3) modality robustness. MultiBench introduces impactful challenges for future research, including scalability to large-scale multimodal datasets and robustness to realistic imperfections. To accompany this benchmark, we also provide a standardized implementation of 20 core approaches in multimodal learning. Simply applying methods proposed in different research areas can improve the state-of-the-art performance on 9/15 datasets. Therefore, MultiBench presents a milestone in unifying disjoint efforts in multimodal research and paves the way towards a better understanding of the capabilities and limitations of multimodal models, all the while ensuring ease of use, accessibility, and reproducibility. MultiBench, our standardized code, and leaderboards are publicly available, will be regularly updated, and welcomes inputs from the community.

蒸馏|知识提取(1篇)

【1】 Confidence Conditioned Knowledge Distillation 标题：置信度条件下的知识提炼

作者：Sourav Mishra,Suresh Sundaram 机构：Department of Aerospace Engineering, Indian Institute of Science, Bangalore 备注：31 pages, 41 references, 5 figures, 9 tables 链接：https://arxiv.org/abs/2107.06993 摘要：提出了一种新的置信条件知识提取（CCKD）方案，用于将知识从教师模型转移到学生模型。现有的最先进的方法为此采用固定损失函数，忽略了不同样本需要传输的不同层次的信息。除此之外，这些方法在数据使用方面也是低效的。CCKD通过利用教师模型分配给正确班级的置信度来设计样本特定损失函数（CCKD-L公式）和目标（CCKD-T公式）来解决这些问题。此外，CCKD通过采用自我调节来阻止这些样本参与蒸馏过程，从而提高了数据效率，学生模型在蒸馏过程中学习更快。在多个基准数据集上的实证评估表明，CCKD方法至少达到了与其他最先进方法相同的泛化性能水平，同时在该过程中具有数据效率。通过CCKD方法训练的学生模型不会保留教师模型在训练集中犯下的大部分错误分类。与传统的KD方法相比，通过CCKD方法进行的提取提高了学生模型对敌方攻击的恢复能力。实验表明，MNIST和Fashion MNIST数据集的抗攻击性能至少提高了3%，CIFAR10数据集的抗攻击性能至少提高了6%。摘要：In this paper, a novel confidence conditioned knowledge distillation (CCKD) scheme for transferring the knowledge from a teacher model to a student model is proposed. Existing state-of-the-art methods employ fixed loss functions for this purpose and ignore the different levels of information that need to be transferred for different samples. In addition to that, these methods are also inefficient in terms of data usage. CCKD addresses these issues by leveraging the confidence assigned by the teacher model to the correct class to devise sample-specific loss functions (CCKD-L formulation) and targets (CCKD-T formulation). Further, CCKD improves the data efficiency by employing self-regulation to stop those samples from participating in the distillation process on which the student model learns faster. Empirical evaluations on several benchmark datasets show that CCKD methods achieve at least as much generalization performance levels as other state-of-the-art methods while being data efficient in the process. Student models trained through CCKD methods do not retain most of the misclassifications commited by the teacher model on the training set. Distillation through CCKD methods improves the resilience of the student models against adversarial attacks compared to the conventional KD method. Experiments show at least 3% increase in performance against adversarial attacks for the MNIST and the Fashion MNIST datasets, and at least 6% increase for the CIFAR10 dataset.

视觉解释|视频理解VQA|caption等(1篇)

【1】 From Show to Tell: A Survey on Image Captioning 标题：从秀到说：图像字幕研究综述

作者：Matteo Stefanini,Marcella Cornia,Lorenzo Baraldi,Silvia Cascianelli,Giuseppe Fiameni,Rita Cucchiara 机构： Cucchiaraare with the Department of Engineering “Enzo Ferrari”, University ofModena and Reggio Emilia 链接：https://arxiv.org/abs/2107.06912 摘要：连接视觉和语言在生成智力中起着重要作用。因此，在过去的几年里，大量的研究致力于图像字幕，即用句法和语义上有意义的句子来描述图像。从2015年开始，这项任务通常通过由可视化编码步骤和文本生成语言模型组成的管道来完成。在这些年里，这两个组件已经通过开发对象区域、属性和关系以及引入多模态连接、完全关注的方法和类似BERT的早期融合策略而有了很大的发展。然而，尽管取得了令人印象深刻的成果，图像字幕的研究还没有得出结论性的答案。这项工作的目的是提供一个全面的概述和分类的图像字幕的方法，从视觉编码和文本生成的训练策略，使用的数据集和评估指标。在这方面，我们定量比较了许多相关的最新方法，以确定在图像字幕体系结构和训练策略方面最具影响力的技术创新。此外，还分析和讨论了问题的许多变体及其面临的挑战。这项工作的最终目标是作为一种工具来理解现有的最新技术，并强调计算机视觉和自然语言处理可以找到最佳协同作用的研究领域的未来方向。摘要：Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, in the last few years, a large research effort has been devoted to image captioning, i.e. the task of describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoding step and a language model for text generation. During these years, both components have evolved considerably through the exploitation of object regions, attributes, and relationships and the introduction of multi-modal connections, fully-attentive approaches, and BERT-like early-fusion strategies. However, regardless of the impressive results obtained, research in image captioning has not reached a conclusive answer yet. This work aims at providing a comprehensive overview and categorization of image captioning approaches, from visual encoding and text generation to training strategies, used datasets, and evaluation metrics. In this respect, we quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in image captioning architectures and training strategies. Moreover, many variants of the problem and its open challenges are analyzed and discussed. The final goal of this work is to serve as a tool for understanding the existing state-of-the-art and highlighting the future directions for an area of research where Computer Vision and Natural Language Processing can find an optimal synergy.

点云|SLAM|雷达|激光|深度RGBD相关(2篇)

【1】 High carbon stock mapping at large scale with optical satellite imagery and spaceborne LIDAR 标题：利用光学卫星影像和星载激光雷达进行大比例尺高碳储量制图

作者：Nico Lang,Konrad Schindler,Jan Dirk Wegner 机构：EcoVision Lab, Photogrammetry and Remote Sensing, ETH Zurich, Switzerland, Institute for Computational Science, University of Zurich, Switzerland 链接：https://arxiv.org/abs/2107.07431 摘要：对商品日益增长的需求正导致世界各地土地利用的变化。在热带地区，森林砍伐造成高碳排放，威胁生物多样性，常常与农业扩张有关。虽然人们普遍认识到需要建立无森林砍伐的全球供应链，但在实践中取得进展仍然是一项挑战。在这里，我们提出了一个自动化的方法，旨在支持保护和可持续的土地利用规划决策，通过绘制大规模和高空间分辨率的热带景观图，遵循高碳储量（HCS）方法。发展了一种深度学习方法，通过从稀疏的GEDI激光雷达参考数据中学习，估计每10米Sentinel-2像素的林冠高度，获得6.3米的总体RMSE。我们表明，这些林冠顶高的墙到墙地图可以预测HCS森林和退化地区的分类，总体准确率为86%，并为印度尼西亚、马来西亚和菲律宾绘制了第一张高碳储量地图。摘要：The increasing demand for commodities is leading to changes in land use worldwide. In the tropics, deforestation, which causes high carbon emissions and threatens biodiversity, is often linked to agricultural expansion. While the need for deforestation-free global supply chains is widely recognized, making progress in practice remains a challenge. Here, we propose an automated approach that aims to support conservation and sustainable land use planning decisions by mapping tropical landscapes at large scale and high spatial resolution following the High Carbon Stock (HCS) approach. A deep learning approach is developed that estimates canopy height for each 10 m Sentinel-2 pixel by learning from sparse GEDI LIDAR reference data, achieving an overall RMSE of 6.3 m. We show that these wall-to-wall maps of canopy top height are predictive for classifying HCS forests and degraded areas with an overall accuracy of 86 % and produce a first high carbon stock map for Indonesia, Malaysia, and the Philippines.

【2】 VILENS: Visual, Inertial, Lidar, and Leg Odometry for All-Terrain Legged Robots 标题：VILENS：用于全地形腿式机器人的视觉、惯性、激光雷达和腿部里程计

作者：David Wisth,Marco Camurri,Maurice Fallon 机构：Oxford Robotics Institute, University of Oxford, Oxford, UK 备注：Video: this https URL 链接：https://arxiv.org/abs/2107.07243 摘要：提出了一种基于因子图的腿式机器人视觉惯性激光雷达（VILENS）里程计系统。关键的创新点是四种不同传感器模式的紧密融合，以实现可靠的操作，否则单个传感器将产生退化估计。为了减小腿部里程漂移，我们用在线估计的线速度偏差项来扩展机器人的状态。这种偏差只有在预积分速度因子与视觉、激光雷达和IMU因子紧密融合的情况下才能观察到。在ANYmal四足机器人上进行了广泛的实验验证，总持续时间为2小时，行程为1.8公里。实验包括在松散的岩石、斜坡和泥土上的动态运动；这些挑战包括感知挑战，如黑暗和灰尘地下洞穴或开放，缺乏特征的地区，以及流动性挑战，如滑动和地形变形。与最先进的松耦合方法相比，我们平均提高了62%的平移误差和51%的旋转误差。为了证明其鲁棒性，VILENS还集成了感知控制器和局部路径规划器。摘要：We present VILENS (Visual Inertial Lidar Legged Navigation System), an odometry system for legged robots based on factor graphs. The key novelty is the tight fusion of four different sensor modalities to achieve reliable operation when the individual sensors would otherwise produce degenerate estimation. To minimize leg odometry drift, we extend the robot's state with a linear velocity bias term which is estimated online. This bias is only observable because of the tight fusion of this preintegrated velocity factor with vision, lidar, and IMU factors. Extensive experimental validation on the ANYmal quadruped robots is presented, for a total duration of 2 h and 1.8 km traveled. The experiments involved dynamic locomotion over loose rocks, slopes, and mud; these included perceptual challenges, such as dark and dusty underground caverns or open, feature-deprived areas, as well as mobility challenges such as slipping and terrain deformation. We show an average improvement of 62% translational and 51% rotational errors compared to a state-of-the-art loosely coupled approach. To demonstrate its robustness, VILENS was also integrated with a perceptive controller and a local path planner.

其他神经网络|深度学习|模型|建模(7篇)

【1】 COAST: COntrollable Arbitrary-Sampling NeTwork for Compressive Sensing 标题：COSTAY：可控任意采样压缩传感网络

作者：Di You,Jian Zhang,Jingfen Xie,Bin Chen,Siwei Ma 备注：None 链接：https://arxiv.org/abs/2107.07225 摘要：近年来，基于深度网络的压缩感知方法取得了巨大的成功。然而，它们大多将不同的采样矩阵视为不同的独立任务，需要为每个目标采样矩阵训练一个特定的模型。这种做法导致计算效率低下，泛化能力差。本文提出了一种新的可控任意采样网络COAST，用于解决单模型任意采样矩阵（包括不可见采样矩阵）的CS问题。在优化启发的深度展开框架下，我们的海岸显示出良好的可解释性。在COAST中，提出了一种随机投影增强（RPA）策略，以提高采样空间的训练多样性，实现任意采样，进一步提出了一种可控制的近端映射模块（CPMM）和即插即用解块（PnP-D）策略，分别对网络特征进行动态调制和有效消除阻塞伪影。在广泛使用的基准数据集上的大量实验表明，我们提出的COAST不仅能够用一个模型处理任意采样矩阵，而且能够以较快的速度获得最先进的性能。源代码位于https://github.com/jianzhangcs/COAST. 摘要：Recent deep network-based compressive sensing (CS) methods have achieved great success. However, most of them regard different sampling matrices as different independent tasks and need to train a specific model for each target sampling matrix. Such practices give rise to inefficiency in computing and suffer from poor generalization ability. In this paper, we propose a novel COntrollable Arbitrary-Sampling neTwork, dubbed COAST, to solve CS problems of arbitrary-sampling matrices (including unseen sampling matrices) with one single model. Under the optimization-inspired deep unfolding framework, our COAST exhibits good interpretability. In COAST, a random projection augmentation (RPA) strategy is proposed to promote the training diversity in the sampling space to enable arbitrary sampling, and a controllable proximal mapping module (CPMM) and a plug-and-play deblocking (PnP-D) strategy are further developed to dynamically modulate the network features and effectively eliminate the blocking artifacts, respectively. Extensive experiments on widely used benchmark datasets demonstrate that our proposed COAST is not only able to handle arbitrary sampling matrices with one single model but also to achieve state-of-the-art performance with fast speed. The source code is available on https://github.com/jianzhangcs/COAST.

【2】 Neighbor-view Enhanced Model for Vision and Language Navigation 标题：视觉和语言导航的邻视增强模型

作者：Dong An,Yuankai Qi,Yan Huang,Qi Wu,Liang Wang,Tieniu Tan 机构： Institution of Automation, Chinese Academy of Sciences (CASIA), Beijing, China, University of Adelaide, Adelaide, Australia, Center for Excellence in Brain Science and Intelligence Technology (CEBSIT), Beijing, China 链接：https://arxiv.org/abs/2107.07201 摘要：视觉和语言导航（VLN）要求代理按照自然语言指令导航到目标位置。现有的大多数作品都是通过候选对象所在的单个视图的特征来表示导航候选对象。然而，一条指令可能会引用单个视图之外的地标，这可能导致现有方法的文本-视觉匹配失败。在这项工作中，我们提出了一个多模块邻居视图增强模型（NvEM）来自适应地合并来自邻居视图的视觉上下文，以获得更好的文本视觉匹配。具体地说，我们的NvEM使用一个主题模块和一个引用模块从邻居视图中收集上下文。主题模块在全局级融合相邻视图，参考模块在局部级融合相邻对象。主题和参考文献是通过注意机制自适应地确定的。我们的模型还包括一个动作模块，利用指令中的强方向引导（例如“左转”）。每个模块分别预测导航动作，并用它们的加权和预测最终动作。大量的实验结果表明，该方法在R2R和R4R基准上对多个最先进的导航器都是有效的，NvEM甚至超过了一些预训练的导航器。我们的代码在https://github.com/MarSaKi/NvEM. 摘要：Vision and Language Navigation (VLN) requires an agent to navigate to a target location by following natural language instructions. Most of existing works represent a navigation candidate by the feature of the corresponding single view where the candidate lies in. However, an instruction may mention landmarks out of the single view as references, which might lead to failures of textual-visual matching of existing methods. In this work, we propose a multi-module Neighbor-View Enhanced Model (NvEM) to adaptively incorporate visual contexts from neighbor views for better textual-visual matching. Specifically, our NvEM utilizes a subject module and a reference module to collect contexts from neighbor views. The subject module fuses neighbor views at a global level, and the reference module fuses neighbor objects at a local level. Subjects and references are adaptively determined via attention mechanisms. Our model also includes an action module to utilize the strong orientation guidance (e.g., ``turn left'') in instructions. Each module predicts navigation action separately and their weighted sum is used for predicting the final action. Extensive experimental results demonstrate the effectiveness of the proposed method on the R2R and R4R benchmarks against several state-of-the-art navigators, and NvEM even beats some pre-training ones. Our code is available at https://github.com/MarSaKi/NvEM.

【3】 Applying the Case Difference Heuristic to Learn Adaptations from Deep Network Features 标题：案例差异启发式在深度网络特征自适应学习中的应用

作者：Xiaomeng Ye,Ziwei Zhao,David Leake,Xizi Wang,David Crandall 机构：Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington IN , USA 备注：7 pages, 2 figures, 1 table. To be published in the IJCAI-21 Workshop on Deep Learning, Case-Based Reasoning, and AutoML: Present and Future Synergies 链接：https://arxiv.org/abs/2107.07095 摘要：案例差异启发式（CDH）方法是一种从基于案例推理系统的案例库中学习案例适应性知识的知识光方法。给定一对案例，CDH方法将解决方案的差异归结为解决问题的差异，当检索到的案例和新查询具有相似的问题差异时，生成自适应规则来相应地调整解决方案。作为学习适应规则的一种替代方法，一些研究人员已经应用神经网络来学习从问题差异中预测解决方案差异。以前关于这种方法的工作假设描述问题的特征集是预定义的。本文研究了一种结合深度学习和神经网络自适应学习的两阶段特征提取方法。它的性能表现在一个回归任务的图像数据：预测年龄给定的图像的一张脸。结果表明，组合过程能成功地学习适用于病例非符号差异的适应知识。CBR系统的总体性能略低于基线深度网络回归器，但在新查询上的性能优于基线深度网络回归器。摘要：The case difference heuristic (CDH) approach is a knowledge-light method for learning case adaptation knowledge from the case base of a case-based reasoning system. Given a pair of cases, the CDH approach attributes the difference in their solutions to the difference in the problems they solve, and generates adaptation rules to adjust solutions accordingly when a retrieved case and new query have similar problem differences. As an alternative to learning adaptation rules, several researchers have applied neural networks to learn to predict solution differences from problem differences. Previous work on such approaches has assumed that the feature set describing problems is predefined. This paper investigates a two-phase process combining deep learning for feature extraction and neural network based adaptation learning from extracted features. Its performance is demonstrated in a regression task on an image data: predicting age given the image of a face. Results show that the combined process can successfully learn adaptation knowledge applicable to nonsymbolic differences in cases. The CBR system achieves slightly lower performance overall than a baseline deep network regressor, but better performance than the baseline on novel queries.

【4】 MeNToS: Tracklets Association with a Space-Time Memory Network 标题：Mentos：Tracklet与时空记忆网络的关联

作者：Mehdi Miah,Guillaume-Alexandre Bilodeau,Nicolas Saunier 机构：Polytechnique Montr´eal 备注：Presented at the "Robust Video Scene Understanding: Tracking and Video Segmentation" workshop (CVPR-W 2021) 链接：https://arxiv.org/abs/2107.07067 摘要：提出了一种无需微调和超参数选择的多目标跟踪与分割方法。该方法特别解决了数据关联问题。事实上，最近引入的HOTA度量，通过均衡地平衡检测和关联质量，与人类视觉评估有更好的一致性，已经表明数据关联仍然需要改进。该方法在利用实例分割和光流生成轨迹后，利用时空记忆网络（STM）对视频对象进行单镜头分割，提高轨迹与时间间隔的关联性。据我们所知，我们的方法，命名为MeNToS，是第一个使用STM网络来跟踪mot的对象掩码。我们在机器人挑战赛中获得了第四名。项目页面是https://mehdimiah.com/mentos.html. 摘要：We propose a method for multi-object tracking and segmentation (MOTS) that does not require fine-tuning or per benchmark hyperparameter selection. The proposed method addresses particularly the data association problem. Indeed, the recently introduced HOTA metric, that has a better alignment with the human visual assessment by evenly balancing detections and associations quality, has shown that improvements are still needed for data association. After creating tracklets using instance segmentation and optical flow, the proposed method relies on a space-time memory network (STM) developed for one-shot video object segmentation to improve the association of tracklets with temporal gaps. To the best of our knowledge, our method, named MeNToS, is the first to use the STM network to track object masks for MOTS. We took the 4th place in the RobMOTS challenge. The project page is https://mehdimiah.com/mentos.html.

【5】 Learning Sparse Interaction Graphs of Partially Observed Pedestrians for Trajectory Prediction 标题：用于轨迹预测的部分观察行人稀疏交互图学习

作者：Zhe Huang,Ruohua Li,Kazuki Shin,Katherine Driggs-Campbell 机构：University of Illinois at Urbana-Champaign, University of Michigan 备注：10 pages, 3 figures 链接：https://arxiv.org/abs/2107.07056 摘要：多人路径预测是非结构化环境下与人群交互的自治系统中不可缺少的安全因素。许多最近的工作已经发展了轨迹预测算法，重点是理解行人运动背后的社会规范。然而，我们观察到这些工作通常有两个假设，阻止他们顺利地应用于机器人应用：所有行人的位置始终跟踪；目标代理注意场景中的所有行人。第一种假设导致了不完全行人数据下的有偏交互建模，第二种假设引入了不必要的干扰，导致了冻结机器人问题。因此，我们提出了Gumbel社会变换，其中边Gumbel选择器在每个时间步对部分观察到的行人的稀疏交互图进行采样。一个节点变换编码器和一个掩蔽的LSTM用采样的稀疏图对行人特征进行编码以预测其运动轨迹。我们证明，我们的模型克服了由假设引起的潜在问题，并且我们的方法在基准评估方面优于相关工作。摘要：Multi-pedestrian trajectory prediction is an indispensable safety element of autonomous systems that interact with crowds in unstructured environments. Many recent efforts have developed trajectory prediction algorithms with focus on understanding social norms behind pedestrian motions. Yet we observe these works usually hold two assumptions that prevent them from being smoothly applied to robot applications: positions of all pedestrians are consistently tracked; the target agent pays attention to all pedestrians in the scene. The first assumption leads to biased interaction modeling with incomplete pedestrian data, and the second assumption introduces unnecessary disturbances and leads to the freezing robot problem. Thus, we propose Gumbel Social Transformer, in which an Edge Gumbel Selector samples a sparse interaction graph of partially observed pedestrians at each time step. A Node Transformer Encoder and a Masked LSTM encode the pedestrian features with the sampled sparse graphs to predict trajectories. We demonstrate that our model overcomes the potential problems caused by the assumptions, and our approach outperforms the related works in benchmark evaluation.

【6】 Classifying Component Function in Product Assemblies with Graph Neural Networks 标题：用图神经网络对产品装配中的部件功能进行分类

作者：Vincenzo Ferrero,Kaveh Hassani,Daniele Grandi,Bryony DuPont 机构：Design Engineering Laboratory, Oregon State University, Corvallis, Oregon, Autodesk, Inc., Autodesk Research, San Rafael, CA 链接：https://arxiv.org/abs/2107.07042 摘要：功能被定义为使产品能够完成设计目的的任务集合。功能性工具，如功能建模，在产品设计的早期阶段提供决策指导，在早期阶段还没有明确的设计决策。基于功能的设计数据通常是稀疏的，并且基于个别的解释。因此，基于功能的设计工具可以从自动功能分类中获益，以提高数据保真度，并提供支持基于功能的智能设计代理的功能表示模型。基于功能的设计数据通常存储在手动生成的设计存储库中。这些设计存储库是由功能流和组件分类法限定的产品设计中的专家知识和功能解释的集合。在这项工作中，我们将一个结构化的基于分类法的设计储存库表示为装配流程图，然后利用图神经网络（GNN）模型来执行自动功能分类。我们通过从存储库数据学习来建立组件功能分配的基本事实，从而支持自动功能分类。实验结果表明，我们的GNN模型的微观平均F${u1}$-分数为0.832（广义），0.756（广义），0.783（具体）三级函数。鉴于数据特征的不平衡，结果令人鼓舞。我们在这篇论文中的努力可以作为一个更复杂的应用在基于知识的CAD系统和基于功能的设计考虑X设计的起点。摘要：Function is defined as the ensemble of tasks that enable the product to complete the designed purpose. Functional tools, such as functional modeling, offer decision guidance in the early phase of product design, where explicit design decisions are yet to be made. Function-based design data is often sparse and grounded in individual interpretation. As such, function-based design tools can benefit from automatic function classification to increase data fidelity and provide function representation models that enable function-based intelligent design agents. Function-based design data is commonly stored in manually generated design repositories. These design repositories are a collection of expert knowledge and interpretations of function in product design bounded by function-flow and component taxonomies. In this work, we represent a structured taxonomy-based design repository as assembly-flow graphs, then leverage a graph neural network (GNN) model to perform automatic function classification. We support automated function classification by learning from repository data to establish the ground truth of component function assignment. Experimental results show that our GNN model achieves a micro-average F${_1}$-score of 0.832 for tier 1 (broad), 0.756 for tier 2, and 0.783 for tier 3 (specific) functions. Given the imbalance of data features, the results are encouraging. Our efforts in this paper can be a starting point for more sophisticated applications in knowledge-based CAD systems and Design-for-X consideration in function-based design.

【7】 FetalNet: Multi-task deep learning framework for fetal ultrasound biometric measurements 标题：FetalNet：胎儿超声生物特征测量的多任务深度学习框架

作者：Szymon Płotka,Tomasz Włodarczyk,Adam Klasa,Michał Lipa,Arkadiusz Sitek,Tomasz Trzciński 机构： Sano Centre for Computational Medicine, Cracow, Poland, Warsaw University of Technology, Warsaw, Poland, Medical University of Warsaw, Warsaw, Poland, Fetai Health Ltd., Tooploox, Wroclaw, Poland 备注：Submitted to ICONIP 2021 链接：https://arxiv.org/abs/2107.06943 摘要：在本文中，我们提出了一个端到端的多任务神经网络称为胎儿网络与注意机制和堆叠模块的时空胎儿超声扫描视频分析。胎儿生物测量是妊娠期的一种标准检查，用于胎儿生长监测和估计胎龄和胎儿体重。胎儿超声扫描图像分析的主要目的是寻找合适的标准平面来测量胎儿的头、腹和股骨。由于超声数据中固有的高散斑噪声和阴影，需要医学专家和超声经验来找到合适的采集平面并对胎儿进行精确测量。此外，现有的计算机辅助胎儿超声生物特征测量方法只处理一帧图像，没有考虑时间特征。针对这些不足，我们提出了一种端到端的多任务神经网络，用于时空超声扫描视频分析，同时对胎儿身体部位进行定位、分类和测量。我们提出了一个新的编码器-解码器分割架构，其中包含了一个分类分支。此外，我们利用一个堆叠模组的注意机制来学习显著地图，以抑制不相关的US区域，并有效地进行扫描平面定位。我们对700名不同患者的常规检查中的胎儿超声视频进行了训练。我们称之为FetalNet的方法在胎儿超声录像的分类和分割方面都优于现有的最新方法。摘要：In this paper, we propose an end-to-end multi-task neural network called FetalNet with an attention mechanism and stacked module for spatio-temporal fetal ultrasound scan video analysis. Fetal biometric measurement is a standard examination during pregnancy used for the fetus growth monitoring and estimation of gestational age and fetal weight. The main goal in fetal ultrasound scan video analysis is to find proper standard planes to measure the fetal head, abdomen and femur. Due to natural high speckle noise and shadows in ultrasound data, medical expertise and sonographic experience are required to find the appropriate acquisition plane and perform accurate measurements of the fetus. In addition, existing computer-aided methods for fetal US biometric measurement address only one single image frame without considering temporal features. To address these shortcomings, we propose an end-to-end multi-task neural network for spatio-temporal ultrasound scan video analysis to simultaneously localize, classify and measure the fetal body parts. We propose a new encoder-decoder segmentation architecture that incorporates a classification branch. Additionally, we employ an attention mechanism with a stacked module to learn salient maps to suppress irrelevant US regions and efficient scan plane localization. We trained on the fetal ultrasound video comes from routine examinations of 700 different patients. Our method called FetalNet outperforms existing state-of-the-art methods in both classification and segmentation in fetal ultrasound video recordings.

其他(9篇)

【1】 Recommending best course of treatment based on similarities of prognostic markersthanks{All authors contributed equally 标题：根据预后标志物汉克斯的相似性推荐最佳疗程{所有作者的贡献相等

作者：Sudhanshu,Narinder Singh Punn,Sanjay Kumar Sonbhadra,Sonali Agarwal 机构：Agarwal, Indian Institute of Information Technology Allahabad, India 链接：https://arxiv.org/abs/2107.07500 摘要：随着技术领域跨越各个领域的进步，信息的大量涌入是不可避免的。在这项技术的进步所带来的所有机遇中，其中之一就是提出有效的数据检索解决方案。这意味着从大量的数据中，检索方法应该允许用户在一段时间内获取相关的和最新的数据。在娱乐和电子商务领域，推荐系统一直在发挥作用，以提供上述功能。在医学领域使用同样的系统肯定会在许多方面被证明是有用的。基于这一背景，本文提出了一种基于协同过滤的医疗领域推荐系统，用于根据患者的症状推荐治疗方案。此外，还开发了一个新的数据集，其中包括各种疾病的治疗方法，以解决数据有限的问题。建议的推荐系统接受病人的预后指标作为输入，并产生最佳的补救过程。通过几项实验研究，提出的模型在为给定的预后标志物推荐可能的治疗方法方面取得了有希望的结果。摘要：With the advancement in the technology sector spanning over every field, a huge influx of information is inevitable. Among all the opportunities that the advancements in the technology have brought, one of them is to propose efficient solutions for data retrieval. This means that from an enormous pile of data, the retrieval methods should allow the users to fetch the relevant and recent data over time. In the field of entertainment and e-commerce, recommender systems have been functioning to provide the aforementioned. Employing the same systems in the medical domain could definitely prove to be useful in variety of ways. Following this context, the goal of this paper is to propose collaborative filtering based recommender system in the healthcare sector to recommend remedies based on the symptoms experienced by the patients. Furthermore, a new dataset is developed consisting of remedies concerning various diseases to address the limited availability of the data. The proposed recommender system accepts the prognostic markers of a patient as the input and generates the best remedy course. With several experimental trials, the proposed model achieved promising results in recommending the possible remedy for given prognostic markers.

【2】 Convolutional Neural Bandit: Provable Algorithm for Visual-aware Advertising 标题：卷积神经BANDIT：视觉感知广告的可证明算法

作者：Yikun Ban,Jingrui He 机构：University of Illinois at Urbana-Champaign 备注：23 pages, in submission 链接：https://arxiv.org/abs/2107.07438 摘要：网络广告在网络商务中无处不在。图像显示被认为是与客户交互最常用的格式之一。背景多武装土匪在广告的应用中成功地解决了推荐程序中存在的探索性开发困境。受视觉感知广告的启发，本文提出了一种上下文bandit算法，该算法利用卷积神经网络（CNN）学习奖励函数，并结合置信上限（UCB）进行探索。我们还证明了当网络过度参数化并与卷积神经切线核（CNTK）建立强连接时，一个近似最优的遗憾界$tilde{mathcal{O}（sqrt{T}）$。最后，我们评估了所提出的算法的实验性能，并在真实的图像数据集上证明了它优于其他最先进的基于UCB的bandit算法。摘要：Online advertising is ubiquitous in web business. Image displaying is considered as one of the most commonly used formats to interact with customers. Contextual multi-armed bandit has shown success in the application of advertising to solve the exploration-exploitation dilemma existed in the recommendation procedure. Inspired by the visual-aware advertising, in this paper, we propose a contextual bandit algorithm, where the convolutional neural network (CNN) is utilized to learn the reward function along with an upper confidence bound (UCB) for exploration. We also prove a near-optimal regret bound $tilde{mathcal{O}}(sqrt{T})$ when the network is over-parameterized and establish strong connections with convolutional neural tangent kernel (CNTK). Finally, we evaluate the empirical performance of the proposed algorithm and show that it outperforms other state-of-the-art UCB-based bandit algorithms on real-world image data sets.

【3】 Framework for A Personalized Intelligent Assistant to Elderly People for Activities of Daily Living 标题：一种面向老年人日常生活活动的个性化智能助手框架

作者：Nirmalya Thakur,Chia Y. Han 机构：edu Department of Electrical Engineering and Computer Science University of Cincinnati Cincinnati 备注：None 链接：https://arxiv.org/abs/2107.07344 摘要：随着老年人口的不断增长，人们需要满足他们日益增长的需求，并提供能够提高他们在智能家居中生活质量的解决方案。除了对与系统接口的恐惧和焦虑之外；随着年龄的增长，老年人往往会面临认知障碍、记忆力减退、行为紊乱甚至身体限制等问题。提供以科技为基础的解决方案，以满足老年人的这些需求，并为老年人创造智能和辅助生活空间；关键在于开发能够通过解决其多样性进行调整的系统，并能够根据其日常目标增强其性能。因此，本研究提出一个架构来开发一个个人化的智慧助理员，以协助老人在智慧且互联的物联网环境中进行日常生活活动。这种个性化的智能助手可以分析用户执行的不同任务，并根据用户的日常生活、当前的情感状态和潜在的用户体验来推荐活动。为了验证这个框架的有效性，我们分别对一个普通用户和一个特定用户的数据集进行了测试。结果表明，该模型在对特定用户建模时的性能准确率为73.12%，大大高于对普通用户建模时的性能，支持了该框架的开发和实现。摘要：The increasing population of elderly people is associated with the need to meet their increasing requirements and to provide solutions that can improve their quality of life in a smart home. In addition to fear and anxiety towards interfacing with systems; cognitive disabilities, weakened memory, disorganized behavior and even physical limitations are some of the problems that elderly people tend to face with increasing age. The essence of providing technology-based solutions to address these needs of elderly people and to create smart and assisted living spaces for the elderly; lies in developing systems that can adapt by addressing their diversity and can augment their performances in the context of their day to day goals. Therefore, this work proposes a framework for development of a Personalized Intelligent Assistant to help elderly people perform Activities of Daily Living (ADLs) in a smart and connected Internet of Things (IoT) based environment. This Personalized Intelligent Assistant can analyze different tasks performed by the user and recommend activities by considering their daily routine, current affective state and the underlining user experience. To uphold the efficacy of this proposed framework, it has been tested on a couple of datasets for modelling an average user and a specific user respectively. The results presented show that the model achieves a performance accuracy of 73.12% when modelling a specific user, which is considerably higher than its performance while modelling an average user, this upholds the relevance for development and implementation of this proposed framework.

【4】 Deep Automatic Natural Image Matting 标题：深度自动自然图像遮片

作者：Jizhizi Li,Jing Zhang,Dacheng Tao 机构：The University of Sydney, Australia 备注：Accepted to IJCAI-21, code and dataset available at this https URL 链接：https://arxiv.org/abs/2107.07235 摘要：自动图像抠图（AIM）是指在没有trimap等辅助输入的情况下，从任意自然图像中估计出软前景，这对于图像编辑非常有用。以前的方法试图学习语义特征来辅助铺垫过程，但仅限于具有显著不透明前景的图像，如人和动物。在本文中，我们研究了将它们扩展到具有显著透明/细致前景或非显著前景的自然图像时的困难。为了解决这一问题，提出了一种新的端到端matting网络，它可以作为统一的语义表示来预测上述类型的任何图像的广义trimap。同时，学习到的语义特征通过注意机制引导matting网络聚焦于过渡区域。我们还构建了一个测试集AIM-500，其中包含500个不同的自然图像，涵盖了所有类型以及手动标记的alpha蒙版，使得测试AIM模型的泛化能力成为可能。实验结果表明，我们的网络训练可用的复合垫数据集在客观和主观上都优于现有的方法。源代码和数据集位于https://github.com/JizhiziLi/AIM. 摘要：Automatic image matting (AIM) refers to estimating the soft foreground from an arbitrary natural image without any auxiliary input like trimap, which is useful for image editing. Prior methods try to learn semantic features to aid the matting process while being limited to images with salient opaque foregrounds such as humans and animals. In this paper, we investigate the difficulties when extending them to natural images with salient transparent/meticulous foregrounds or non-salient foregrounds. To address the problem, a novel end-to-end matting network is proposed, which can predict a generalized trimap for any image of the above types as a unified semantic representation. Simultaneously, the learned semantic features guide the matting network to focus on the transition areas via an attention mechanism. We also construct a test set AIM-500 that contains 500 diverse natural images covering all types along with manually labeled alpha mattes, making it feasible to benchmark the generalization ability of AIM models. Results of the experiments demonstrate that our network trained on available composite matting datasets outperforms existing methods both objectively and subjectively. The source code and dataset are available at https://github.com/JizhiziLi/AIM.

【5】 Incorporating Lambertian Priors into Surface Normals Measurement 标题：在曲面法线测量中引入Lambertian先验

作者：Yakun Ju,Muwei Jian,Shaoxiang Guo,Yingyu Wang,Huiyu Zhou,Junyu Dong 机构： Jian is with the School of Information Science and Engineering, Linyi University, and the School of Computer Science andTechnology, Shandong University of Finance and Economics 链接：https://arxiv.org/abs/2107.07192 摘要：光度立体的目标是从具有各种着色提示的观察中测量三维对象的精确表面法线。然而，非朗伯曲面由于不规则的着色线索而影响测量精度。尽管深度神经网络已经被用来模拟非朗伯曲面的性能，但是镜面反射、阴影和褶皱区域的误差很难减小。为了应对这一挑战，我们在此提出了一种结合朗伯先验的光度立体网络来更好地测量曲面法线。本文利用朗伯假设下的初始法线作为先验信息来细化法线测量，而不是仅仅利用观察到的阴影线索来推导曲面法线。该方法利用朗伯信息对网络权值进行重参数化，利用深度神经网络强大的拟合能力来修正由一般反射特性引起的误差。我们的探索包括：Lambertian先验（1）减少了学习假设空间，使我们的方法在同一个曲面法向空间中学习映射，提高了学习的准确性；（2）提供了差分特征学习，提高了曲面细节重构。在具有挑战性的基准数据集上，大量实验验证了所提出的朗伯先验光度立体网络在精确测量表面法线方面的有效性。摘要：The goal of photometric stereo is to measure the precise surface normal of a 3D object from observations with various shading cues. However, non-Lambertian surfaces influence the measurement accuracy due to irregular shading cues. Despite deep neural networks have been employed to simulate the performance of non-Lambertian surfaces, the error in specularities, shadows, and crinkle regions is hard to be reduced. In order to address this challenge, we here propose a photometric stereo network that incorporates Lambertian priors to better measure the surface normal. In this paper, we use the initial normal under the Lambertian assumption as the prior information to refine the normal measurement, instead of solely applying the observed shading cues to deriving the surface normal. Our method utilizes the Lambertian information to reparameterize the network weights and the powerful fitting ability of deep neural networks to correct these errors caused by general reflectance properties. Our explorations include: the Lambertian priors (1) reduce the learning hypothesis space, making our method learn the mapping in the same surface normal space and improving the accuracy of learning, and (2) provides the differential features learning, improving the surfaces reconstruction of details. Extensive experiments verify the effectiveness of the proposed Lambertian prior photometric stereo network in accurate surface normal measurement, on the challenging benchmark dataset.

【6】 What Image Features Boost Housing Market Predictions? 标题：哪些形象特征提升了房地产市场预测？

作者：Zona Kostic,Aleksandar Jevremovic 机构： Harvard University, Paulson School Of Engineering andApplied Sciences, Singidunum University 链接：https://arxiv.org/abs/2107.07148 摘要：房地产的吸引力是最有趣，但最具挑战性的类别之一。图像特征用于描述某些属性，并检查视觉因素对上市价格或时间框架的影响。在这篇论文中，我们提出了一套技术来提取视觉特征，以便在现代预测算法中有效地包含数值。我们讨论了香农熵、重心计算、图像分割和卷积神经网络等技术。通过将这些技术应用于一组房地产相关图像（室内、室外和卫星图像）的比较，我们得出以下结论：（i）熵是预测房价最有效的单位数视觉度量(图像分割是预测房屋寿命最重要的视觉特征；以及（iii）深部图像特征可用于量化内部特征，并有助于吸引建模。这里选择的40个图像特征集具有很强的预测能力，并优于一些最强的元数据预测。在房地产估价过程中，无需更换专家，本文提出的技术可以有效地描述可见特征，从而将感知吸引力作为一种定量度量引入到住房预测建模中。摘要：The attractiveness of a property is one of the most interesting, yet challenging, categories to model. Image characteristics are used to describe certain attributes, and to examine the influence of visual factors on the price or timeframe of the listing. In this paper, we propose a set of techniques for the extraction of visual features for efficient numerical inclusion in modern-day predictive algorithms. We discuss techniques such as Shannon's entropy, calculating the center of gravity, employing image segmentation, and using Convolutional Neural Networks. After comparing these techniques as applied to a set of property-related images (indoor, outdoor, and satellite), we conclude the following: (i) the entropy is the most efficient single-digit visual measure for housing price prediction; (ii) image segmentation is the most important visual feature for the prediction of housing lifespan; and (iii) deep image features can be used to quantify interior characteristics and contribute to captivation modeling. The set of 40 image features selected here carries a significant amount of predictive power and outperforms some of the strongest metadata predictors. Without any need to replace a human expert in a real-estate appraisal process, we conclude that the techniques presented in this paper can efficiently describe visible characteristics, thus introducing perceived attractiveness as a quantitative measure into the predictive modeling of housing.

【7】 Recurrent Parameter Generators 标题：递归参数生成器

作者：Jiayun Wang,Yubei Chen,Stella X. Yu,Brian Cheung,Yann LeCun 机构： UC Berkeley ICSI , Facebook AI Research, New York University , MIT CSAIL & BCS 链接：https://arxiv.org/abs/2107.07110 摘要：我们提出了一种通用的方法，对许多不同的卷积层循环使用相同的参数来构建一个深度网络。具体地说，对于一个网络，我们创建一个递归参数发生器（RPG），从中生成每个卷积层的参数。虽然使用递归模型来建立深度卷积神经网络（CNN）并不完全是一种新方法，但是与现有的方法相比，我们的方法取得了显著的性能改进。我们演示了如何建立一个单层的神经网络，以达到类似的性能相比，其他传统的CNN模型在各种应用和数据集。这种方法使我们可以建立一个任意复杂的神经网络的任何数量的参数。例如，我们构建了一个ResNet34，模型参数减少了400美元以上，仍然达到了41.6美元的ImageNet top-1精度。此外，我们还证明了RPG可以应用于不同的尺度，如层、块甚至子网络。具体来说，我们使用RPG构建了一个ResNet18网络，其权值的数量相当于传统ResNet的一个卷积层，并且表明该模型可以达到$67.2%$ImageNet top-1的精度。该方法可以看作是一种逆模型压缩方法。它的目的不是从一个大模型中删除未使用的参数，而是将更多的信息压缩到少量的参数中。大量的实验结果证明了该参数发生器的有效性。摘要：We present a generic method for recurrently using the same parameters for many different convolution layers to build a deep network. Specifically, for a network, we create a recurrent parameter generator (RPG), from which the parameters of each convolution layer are generated. Though using recurrent models to build a deep convolutional neural network (CNN) is not entirely new, our method achieves significant performance gain compared to the existing works. We demonstrate how to build a one-layer neural network to achieve similar performance compared to other traditional CNN models on various applications and datasets. Such a method allows us to build an arbitrarily complex neural network with any amount of parameters. For example, we build a ResNet34 with model parameters reduced by more than $400$ times, which still achieves $41.6%$ ImageNet top-1 accuracy. Furthermore, we demonstrate the RPG can be applied at different scales, such as layers, blocks, or even sub-networks. Specifically, we use the RPG to build a ResNet18 network with the number of weights equivalent to one convolutional layer of a conventional ResNet and show this model can achieve $67.2%$ ImageNet top-1 accuracy. The proposed method can be viewed as an inverse approach to model compression. Rather than removing the unused parameters from a large model, it aims to squeeze more information into a small number of parameters. Extensive experiment results are provided to demonstrate the power of the proposed recurrent parameter generator.

【8】 A Generalized Framework for Edge-preserving and Structure-preserving Image Smoothing 标题：一种广义的保边缘保结构图像平滑框架

作者：Wei Liu,Pingping Zhang,Yinjie Lei,Xiaolin Huang,Jie Yang,Michael Ng 机构：cn•Yinjie Lei is with the School of Electronics and Information Engineering, Sichuan University 备注：This work is accepted by TPAMI. The code is available at this https URL arXiv admin note: substantial text overlap with arXiv:1907.09642 链接：https://arxiv.org/abs/2107.07058 摘要：图像平滑是计算机视觉和图形学应用中的一个基本过程。在不同的任务中，所需的平滑特性可能不同，甚至相互矛盾。然而，一个平滑算子固有的平滑性质通常是固定的，因此不能满足不同应用的各种要求。本文首先介绍了截尾Huber罚函数，它在不同的参数设置下表现出很强的灵活性。在此基础上，引入截断Huber罚函数，提出了一个广义框架。结合其强大的灵活性，我们的框架能够实现不同的平滑性质，甚至可以实现矛盾的平滑行为。它还可以产生以前的方法很难实现的平滑行为，因此在具有挑战性的情况下可以获得优异的性能。这些共同使得我们的框架能够应用于一系列应用，并且能够在多个任务中优于最先进的方法，例如图像细节增强、剪贴画压缩伪影去除、引导深度图恢复、图像纹理去除等，在非凸非光滑的优化框架下，给出了一个有效的数值解，并从理论上保证了其收敛性。进一步提出了一种简单而有效的方法，在保持算法性能的同时降低了算法的计算量。通过一系列应用的综合实验，验证了该方法的有效性和优越的性能。我们的代码在https://github.com/wliusjtu/Generalized-Smoothing-Framework. 摘要：Image smoothing is a fundamental procedure in applications of both computer vision and graphics. The required smoothing properties can be different or even contradictive among different tasks. Nevertheless, the inherent smoothing nature of one smoothing operator is usually fixed and thus cannot meet the various requirements of different applications. In this paper, we first introduce the truncated Huber penalty function which shows strong flexibility under different parameter settings. A generalized framework is then proposed with the introduced truncated Huber penalty function. When combined with its strong flexibility, our framework is able to achieve diverse smoothing natures where contradictive smoothing behaviors can even be achieved. It can also yield the smoothing behavior that can seldom be achieved by previous methods, and superior performance is thus achieved in challenging cases. These together enable our framework capable of a range of applications and able to outperform the state-of-the-art approaches in several tasks, such as image detail enhancement, clip-art compression artifacts removal, guided depth map restoration, image texture removal, etc. In addition, an efficient numerical solution is provided and its convergence is theoretically guaranteed even the optimization framework is non-convex and non-smooth. A simple yet effective approach is further proposed to reduce the computational cost of our method while maintaining its performance. The effectiveness and superior performance of our approach are validated through comprehensive experiments in a range of applications. Our code is available at https://github.com/wliusjtu/Generalized-Smoothing-Framework.

【9】 FastSHAP: Real-Time Shapley Value Estimation 标题：FastSHAP：Shapley值的实时估计

作者：Neil Jethani,Mukund Sudarshan,Ian Covert,Su-In Lee,Rajesh Ranganath 机构：New York University, University of Washington 备注：20 pages, 10 figures, 3 tables 链接：https://arxiv.org/abs/2107.07436 摘要：Shapley值被广泛用于解释黑匣子模型，但是它们的计算成本很高，因为它们需要许多模型评估。我们介绍了FastSHAP，一种利用学习的解释模型估计单次向前传球中Shapley值的方法。FastSHAP通过受Shapley值的加权最小二乘特征启发的学习方法来分摊解释许多输入的成本，并且可以使用标准随机梯度优化来训练它。我们将FastSHAP与现有的估计方法进行了比较，结果表明它能产生高质量的解释，并具有数量级的加速比。摘要：Shapley values are widely used to explain black-box models, but they are costly to calculate because they require many model evaluations. We introduce FastSHAP, a method for estimating Shapley values in a single forward pass using a learned explainer model. FastSHAP amortizes the cost of explaining many inputs via a learning approach inspired by the Shapley value's weighted least squares characterization, and it can be trained using standard stochastic gradient optimization. We compare FastSHAP to existing estimation approaches, revealing that it generates high-quality explanations with orders of magnitude speedup.

linux https 网络安全学习方法批量计算

0 人点赞