计算机视觉学术速递[7.23]

2021-07-27 11:15:42 浏览数 (1)

访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问

cs.CV 方向,今日共计33篇

Transformer(1篇)

【1】 Query2Label: A Simple Transformer Way to Multi-Label Classification 标题:Query2Label:一种简单的多标签分类转换方法

作者:Shilong Liu,Lei Zhang,Xiao Yang,Hang Su,Jun Zhu 机构:Dept. of Comp. Sci. and Tech., BNRist Center, Institute for AI, Tsinghua-Bosch Joint ML Center, Tsinghua University, Beijing, China, International Digital Economy Academy 链接:https://arxiv.org/abs/2107.10834 摘要:本文提出了一种简单有效的解决多标签分类问题的方法。该方法利用转换器解码器来查询类标签的存在性。变换器的使用源于对不同标签自适应地提取局部鉴别特征的需要,由于一幅图像中存在多个目标,这是一个非常理想的特性。Transformer解码器中内置的交叉注意模块提供了一种有效的方法,可以使用标签嵌入作为查询,从视觉主干计算的特征图中探测和汇集与类相关的特征,以便进行后续的二进制分类。与以前的工作相比,新的框架是简单的,使用标准的Transformer和视觉主干,有效的,在五个多标签分类数据集上,包括MS-COCO,PASCAL VOC,NUS-WIDE和Visual Genome,始终优于以前的所有工作。特别是,我们在MS-COCO上建立了91.3%$mAP。我们希望它结构紧凑,实现简单,性能优越,可以作为多标签分类任务和未来研究的有力基础。代码将很快在https://github.com/SlongLiu/query2labels. 摘要:This paper presents a simple and effective approach to solving the multi-label classification problem. The proposed approach leverages Transformer decoders to query the existence of a class label. The use of Transformer is rooted in the need of extracting local discriminative features adaptively for different labels, which is a strongly desired property due to the existence of multiple objects in one image. The built-in cross-attention module in the Transformer decoder offers an effective way to use label embeddings as queries to probe and pool class-related features from a feature map computed by a vision backbone for subsequent binary classifications. Compared with prior works, the new framework is simple, using standard Transformers and vision backbones, and effective, consistently outperforming all previous works on five multi-label classification data sets, including MS-COCO, PASCAL VOC, NUS-WIDE, and Visual Genome. Particularly, we establish $91.3%$ mAP on MS-COCO. We hope its compact structure, simple implementation, and superior performance serve as a strong baseline for multi-label classification tasks and future studies. The code will be available soon at https://github.com/SlongLiu/query2labels.

检测相关(1篇)

【1】 Unsupervised Detection of Adversarial Examples with Model Explanations 标题:基于模型解释的对抗性实例的无监督检测

作者:Gihyuk Ko,Gyumin Lim 机构:Carnegie Mellon University, Pittsburgh, USA, CSRC, KAIST, Daejeon, South Korea 备注:AdvML@KDD'21 链接:https://arxiv.org/abs/2107.10480 摘要:深度神经网络(DNNs)在各种机器学习应用中表现出了显著的性能。然而,众所周知,DNNs容易受到简单的对抗性扰动,这会导致模型错误地对输入进行分类。在本文中,我们提出了一个简单而有效的方法来检测对手的例子,使用开发的方法来解释模型的行为。我们的主要观察结果是,加入微小的、人类无法察觉的扰动会导致模型解释的剧烈变化,从而导致不寻常或不规则的解释形式。基于这一点,我们提出了一种非监督的检测对抗性的例子使用重建网络训练只对模型解释良性的例子。通过对MNIST手写数据集的评估表明,该方法能够以高置信度检测由最新算法生成的对抗性实例。据我们所知,这项工作是第一次提出使用模型解释的无监督防御方法。 摘要:Deep Neural Networks (DNNs) have shown remarkable performance in a diverse range of machine learning applications. However, it is widely known that DNNs are vulnerable to simple adversarial perturbations, which causes the model to incorrectly classify inputs. In this paper, we propose a simple yet effective method to detect adversarial examples, using methods developed to explain the model's behavior. Our key observation is that adding small, humanly imperceptible perturbations can lead to drastic changes in the model explanations, resulting in unusual or irregular forms of explanations. From this insight, we propose an unsupervised detection of adversarial examples using reconstructor networks trained only on model explanations of benign examples. Our evaluations with MNIST handwritten dataset show that our method is capable of detecting adversarial examples generated by the state-of-the-art algorithms with high confidence. To the best of our knowledge, this work is the first in suggesting unsupervised defense method using model explanations.

分类|识别相关(2篇)

【1】 EAN: Event Adaptive Network for Enhanced Action Recognition 标题:EAN:增强动作识别的事件自适应网络

作者:Yuan Tian,Yichao Yan,Xiongkuo Min,Guo Lu,Guangtao Zhai,Guodong Guo,Zhiyong Gao 备注:Submitted to TIP. Codes are available at: this https URL 链接:https://arxiv.org/abs/2107.10771 摘要:视频中时空信息的有效建模是动作识别的关键。为了实现这一目标,最先进的方法通常采用卷积算子和密集交互模块(如非局部块)。然而,这些方法不能准确地适应视频中的各种事件。一方面,所采用的卷积具有固定的尺度,因此难以处理各种尺度的事件。另一方面,稠密交互建模范式仅在行为无关部分为最终预测带来额外噪声的情况下实现次优性能。在本文中,我们提出了一个统一的动作识别框架来研究视频内容的动态特性。首先,在提取局部线索时,我们生成动态尺度的时空核,以适应不同的事件。其次,为了准确地将这些线索聚合到一个全局视频表示中,我们建议通过一个变换器来挖掘只有少数选定前景对象之间的交互,这产生了一个稀疏的范例。我们将所提出的框架称为事件自适应网络(EAN),因为两个关键设计都是对输入视频内容自适应的。为了利用局部区域的短时运动,我们提出了一种新的有效的潜在运动编码(LMC)模块,进一步提高了框架的性能。在一些大型视频数据集上进行的大量实验,例如Something-to-Something V1&V2、dynamics和Diving48,验证了我们的模型在低功耗下实现了最先进的或有竞争力的性能。代码位于:https://github.com/tianyuan168326/EAN-Pytorch. 摘要:Efficiently modeling spatial-temporal information in videos is crucial for action recognition. To achieve this goal, state-of-the-art methods typically employ the convolution operator and the dense interaction modules such as non-local blocks. However, these methods cannot accurately fit the diverse events in videos. On the one hand, the adopted convolutions are with fixed scales, thus struggling with events of various scales. On the other hand, the dense interaction modeling paradigm only achieves sub-optimal performance as action-irrelevant parts bring additional noises for the final prediction. In this paper, we propose a unified action recognition framework to investigate the dynamic nature of video content by introducing the following designs. First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events. Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer, which yields a sparse paradigm. We call the proposed framework as Event Adaptive Network (EAN) because both key designs are adaptive to the input video content. To exploit the short-term motions within local segments, we propose a novel and efficient Latent Motion Code (LMC) module, further improving the performance of the framework. Extensive experiments on several large-scale video datasets, e.g., Something-to-Something V1&V2, Kinetics, and Diving48, verify that our models achieve state-of-the-art or competitive performances at low FLOPs. Codes are available at: https://github.com/tianyuan168326/EAN-Pytorch.

【2】 Rule-Based Classification of Hyperspectral Imaging Data 标题:基于规则的高光谱图像数据分类

作者:Songuel Polat,Alain Tremeau,Frank Boochs 机构:i,mainz, Institute for Spatial Information and Surveying Technology, Mainz University of Applied Sciences, Lucy-Hillebrand-Str. , Hubert Curien Laboratory, University Jean Monnet, University of Lyon, Rue Professeur Benoît Lauras, Saint-Etienne, France; 链接:https://arxiv.org/abs/2107.10638 摘要:由于高光谱成像具有很高的空间和光谱信息含量,为更好地理解各种应用中的数据和场景开辟了新的可能性。这个理解过程的一个重要部分是分类部分。本文提出了一种基于谱特征形状的通用分类方法。与经典分类方法(如SVM、KNN)相比,不仅考虑了反射率值,还考虑了曲率点、曲率值等参数,利用谱特征的曲率特性建立形状描述规则,通过基于规则的IF-THEN查询进行分类。利用两个不同应用领域的数据集验证了该方法的灵活性和有效性,得到了令人信服的结果和良好的性能。 摘要:Due to its high spatial and spectral information content, hyperspectral imaging opens up new possibilities for a better understanding of data and scenes in a wide variety of applications. An essential part of this process of understanding is the classification part. In this article we present a general classification approach based on the shape of spectral signatures. In contrast to classical classification approaches (e.g. SVM, KNN), not only reflectance values are considered, but also parameters such as curvature points, curvature values, and the curvature behavior of spectral signatures are used to develop shape-describing rules in order to use them for classification by a rule-based procedure using IF-THEN queries. The flexibility and efficiency of the methodology is demonstrated using datasets from two different application fields and leads to convincing results with good performance.

分割|语义相关(3篇)

【1】 Semantic Text-to-Face GAN -ST^2FG 标题:语义文本到面对面的GAN-ST^2FG

作者:Manan Oza,Sukalpa Chanda,David Doermann 机构:Department of Computer Science and Engineering, University at Buffalo, Buffalo, NY , Department of Computer Science and Communication, Østfold University College, Halden, Norway 备注:arXiv admin note: text overlap with arXiv:2010.12136 by other authors 链接:https://arxiv.org/abs/2107.10756 摘要:利用生成性对抗网络(GANs)生成的人脸已经达到了前所未有的真实感。这些面孔,也被称为“深赝品”,看起来像真实的照片,像素级的失真非常小。虽然一些工作已经使模型的训练成为可能,从而产生特定的对象属性,但是基于自然语言描述生成人脸图像的方法还没有得到充分的探索。对于安全和犯罪识别来说,提供一个像素描艺术家一样工作的基于GAN的系统将是非常有用的。本文提出了一种基于语义文本描述的人脸图像生成方法。所学习的模型提供了一个文本描述和一个面部类型的轮廓,该模型使用该轮廓来绘制特征。我们的模型使用仿射组合模块(ACM)机制进行训练,将BERT中的文本嵌入与使用自注意矩阵的GAN潜在空间相结合。这避免了由于“注意”不足而导致的特征丢失,如果简单地将文本嵌入和潜在向量连接起来,可能会发生这种情况。我们的方法能够生成图像,这些图像非常精确地与具有许多面部细节特征的面部的详尽文本描述对齐,并且有助于生成更好的图像。如果提供额外的文本描述或句子,该方法还能够对先前生成的图像进行增量更改。 摘要:Faces generated using generative adversarial networks (GANs) have reached unprecedented realism. These faces, also known as "Deep Fakes", appear as realistic photographs with very little pixel-level distortions. While some work has enabled the training of models that lead to the generation of specific properties of the subject, generating a facial image based on a natural language description has not been fully explored. For security and criminal identification, the ability to provide a GAN-based system that works like a sketch artist would be incredibly useful. In this paper, we present a novel approach to generate facial images from semantic text descriptions. The learned model is provided with a text description and an outline of the type of face, which the model uses to sketch the features. Our models are trained using an Affine Combination Module (ACM) mechanism to combine the text embedding from BERT and the GAN latent space using a self-attention matrix. This avoids the loss of features due to inadequate "attention", which may happen if text embedding and latent vector are simply concatenated. Our approach is capable of generating images that are very accurately aligned to the exhaustive textual descriptions of faces with many fine detail features of the face and helps in generating better images. The proposed method is also capable of making incremental changes to a previously generated image if it is provided with additional textual descriptions or sentences.

【2】 Segmentation of Cardiac Structures via Successive Subspace Learning with Saab Transform from Cine MRI 标题:基于Saab变换的连续子空间学习在电影MRI心脏结构分割中的应用

作者:Xiaofeng Liu,Fangxu Xing,Hanna K. Gaggin,Weichung Wang,C. -C. Jay Kuo,Georges El Fakhri,Jonghye Woo 机构:Wang is with Institute of Applied Mathematical Sciences, NationalTaiwan University 备注:43rd Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 2021) 链接:https://arxiv.org/abs/2107.10718 摘要:电影磁共振成像(MRI)评价心血管疾病(CVD)已被用于无创性评价详细的心脏结构和功能。从电影MRI中准确分割心脏结构是CVD早期诊断和预后判断的关键步骤,而卷积神经网络(CNN)技术的应用极大地提高了CVD的诊断率。然而,CNN模型存在许多局限性,例如有限的可解释性和高度的复杂性,从而限制了其在临床实践中的应用。在这项工作中,为了解决这一局限性,我们提出了一种轻量级的、可解释的机器学习模型,即基于调整偏差子空间近似(Saab)变换的逐次子空间学习(successive subspace learning)模型,用于电影MRI图像的准确、高效分割。具体来说,我们的分割框架由以下步骤组成:(1)在不同分辨率下对近远邻域进行顺序扩展(2) 提出了一种基于Saab变换的无监督降维方法(3) 分类熵引导的有监督降维特征选择(4) 基于梯度增强的特征拼接和像素分类;以及(5)用于后处理的条件随机场。在ACDC 2017分割数据库上的实验结果显示,我们的框架在描绘左心室、右心室和心肌方面比最先进的U-Net模型表现更好,参数少200$倍,因此显示了其在临床实践中的潜在应用。 摘要:Assessment of cardiovascular disease (CVD) with cine magnetic resonance imaging (MRI) has been used to non-invasively evaluate detailed cardiac structure and function. Accurate segmentation of cardiac structures from cine MRI is a crucial step for early diagnosis and prognosis of CVD, and has been greatly improved with convolutional neural networks (CNN). There, however, are a number of limitations identified in CNN models, such as limited interpretability and high complexity, thus limiting their use in clinical practice. In this work, to address the limitations, we propose a lightweight and interpretable machine learning model, successive subspace learning with the subspace approximation with adjusted bias (Saab) transform, for accurate and efficient segmentation from cine MRI. Specifically, our segmentation framework is comprised of the following steps: (1) sequential expansion of near-to-far neighborhood at different resolutions; (2) channel-wise subspace approximation using the Saab transform for unsupervised dimension reduction; (3) class-wise entropy guided feature selection for supervised dimension reduction; (4) concatenation of features and pixel-wise classification with gradient boost; and (5) conditional random field for post-processing. Experimental results on the ACDC 2017 segmentation database, showed that our framework performed better than state-of-the-art U-Net models with 200$times$ fewer parameters in delineating the left ventricle, right ventricle, and myocardium, thus showing its potential to be used in clinical practice.

【3】 A Deep Learning-based Quality Assessment and Segmentation System with a Large-scale Benchmark Dataset for Optical Coherence Tomographic Angiography Image 标题:基于深度学习的大规模基准数据光学相干层析血管成像质量评价与分割系统

作者:Yufei Wang,Yiqing Shen,Meng Yuan,Jing Xu,Bin Yang,Chi Liu,Wenjia Cai,Weijing Cheng,Wei Wang 机构:, State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou , China, Key Laboratory of Bio-Resource and Eco-Environment of Ministry of Education, College of Life Sciences, State Key Laboratory of Hydraulics 链接:https://arxiv.org/abs/2107.10476 摘要:光学相干断层扫描血管造影(OCTA)是一种无创、非接触的成像技术,可在活体内显示视网膜和视神经头的微血管。OCTA的图像质量是视网膜微血管定量的前提。传统上,基于信号强度的图像质量分数是用来判别低质量图像的。然而,对于运动、偏心等依赖于专业知识、需要繁琐费时的人工识别的人工制品,其识别是不够的。OCTA分析中最主要的问题之一是对视网膜中心凹无血管区(FAZ)进行分类,它与任何视力疾病都高度相关。然而,OCTA视觉质量的变化对任何下游的深度学习的表现都有轻微的影响。此外,过滤出低质量的OCTA图像既费时又费力。为了解决这些问题,我们开发了一个自动化的计算机辅助OCTA图像处理系统,使用深度神经网络作为分类器和分割器,帮助眼科医生进行临床诊断和研究。该系统可以作为一个辅助工具,因为它可以处理不同格式的OCTA图像来评估质量和分割FAZ区域。源代码可以在https://github.com/shanzha09/COIPS.git. 另一个主要贡献是我们为性能评估而发布的大规模OCTA数据集,即OCTA-25K-IQA-SEG。它由四个子集组成,即sOCTA-3$times$3-10k、sOCTA-6$times$6-14k、sOCTA-3$times$3-1.1k-seg和dOCTA-6$times$6-1.1k-seg,共包含25665个图像。大规模的OCTA数据集可以在https://doi.org/10.5281/zenodo.5111975, https://doi.org/10.5281/zenodo.5111972. 摘要:Optical Coherence Tomography Angiography (OCTA) is a non-invasive and non-contacting imaging technique providing visualization of microvasculature of retina and optic nerve head in human eyes in vivo. The adequate image quality of OCTA is the prerequisite for the subsequent quantification of retinal microvasculature. Traditionally, the image quality score based on signal strength is used for discriminating low quality. However, it is insufficient for identifying artefacts such as motion and off-centration, which rely specialized knowledge and need tedious and time-consuming manual identification. One of the most primary issues in OCTA analysis is to sort out the foveal avascular zone (FAZ) region in the retina, which highly correlates with any visual acuity disease. However, the variations in OCTA visual quality affect the performance of deep learning in any downstream marginally. Moreover, filtering the low-quality OCTA images out is both labor-intensive and time-consuming. To address these issues, we develop an automated computer-aided OCTA image processing system using deep neural networks as the classifier and segmentor to help ophthalmologists in clinical diagnosis and research. This system can be an assistive tool as it can process OCTA images of different formats to assess the quality and segment the FAZ area. The source code is freely available at https://github.com/shanzha09/COIPS.git. Another major contribution is the large-scale OCTA dataset, namely OCTA-25K-IQA-SEG we publicize for performance evaluation. It is comprised of four subsets, namely sOCTA-3$times$3-10k, sOCTA-6$times$6-14k, sOCTA-3$times$3-1.1k-seg, and dOCTA-6$times$6-1.1k-seg, which contains a total number of 25,665 images. The large-scale OCTA dataset is available at https://doi.org/10.5281/zenodo.5111975, https://doi.org/10.5281/zenodo.5111972.

Zero/Few Shot|迁移|域适配|自适应(4篇)

【1】 Adaptive Dilated Convolution For Human Pose Estimation 标题:自适应膨胀卷积在人体姿态估计中的应用

作者:Zhengxiong Luo,Zhicheng Wang,Yan Huang,Liang Wang,Tieniu Tan,Erjin Zhou 机构:Megvii Inc, University of Chinese Academy of Sciences (UCAS), Center for Research on Intelligent Perception and Computing (CRIPAC), National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences (CASIA) 链接:https://arxiv.org/abs/2107.10477 摘要:大多数现有的人体姿态估计(HPE)方法通过融合四种不同空间尺寸的特征图来利用多尺度信息,即输入图像的$1/4$、$1/8$、$1/16$和$1/32$。该策略存在两个缺点:1)不同空间尺寸的特征地图在空间上可能没有很好的对齐,这可能会影响关键点定位的准确性;2) 这些尺度是固定的,不灵活的,这可能会限制泛化能力在不同的人的大小。针对这些问题,我们提出了一种自适应扩展卷积(ADC)。通过对不同通道设置不同的扩容率,可以生成和融合具有相同空间尺寸的多尺度特征。更重要的是,这些扩张率是由回归模型产生的。它使ADC能够自适应地调整融合的尺度,因此ADC可以更好地推广到各种人体尺寸。ADC可以进行端到端的训练,并且可以很容易地插入到现有的方法中。大量的实验表明,ADC可以给各种HPE方法带来一致的改进。源代码将发布以供进一步研究。 摘要:Most existing human pose estimation (HPE) methods exploit multi-scale information by fusing feature maps of four different spatial sizes, ie $1/4$, $1/8$, $1/16$, and $1/32$ of the input image. There are two drawbacks of this strategy: 1) feature maps of different spatial sizes may be not well aligned spatially, which potentially hurts the accuracy of keypoint location; 2) these scales are fixed and inflexible, which may restrict the generalization ability over various human sizes. Towards these issues, we propose an adaptive dilated convolution (ADC). It can generate and fuse multi-scale features of the same spatial sizes by setting different dilation rates for different channels. More importantly, these dilation rates are generated by a regression module. It enables ADC to adaptively adjust the fused scales and thus ADC may generalize better to various human sizes. ADC can be end-to-end trained and easily plugged into existing methods. Extensive experiments show that ADC can bring consistent improvements to various HPE methods. The source codes will be released for further research.

【2】 CogSense: A Cognitively Inspired Framework for Perception Adaptation 标题:认知感知:一种认知启发的知觉适应框架

作者:Hyukseong Kwon,Amir Rahimi,Kevin G. Lee,Amit Agarwal,Rajan Bhattacharyya 机构:HRL Laboratories, LLC., Malibu Canyon Road, Malibu, CA , USA 链接:https://arxiv.org/abs/2107.10456 摘要:本文提出了一种CogSense系统,该系统受哺乳动物大脑中的感知和认知启发,利用概率信号时序逻辑进行感知错误检测和感知参数自适应。作为具体应用,提出并验证了一种基于对比度的感知自适应方法。该方法利用由检测到的目标计算出的异质探测函数来评估感知误差,然后解决对比度优化问题来修正感知误差。CogSense探测函数利用物体的几何、动力学和检测到的斑点图像质量的特性,在概率信号时序逻辑框架中发展公理。通过评估这些公理,我们可以正式验证检测是有效的还是错误的。进一步,利用CogSense公理,生成基于概率信号时序逻辑的约束条件,最终解决基于对比度的优化问题,减少误报和漏报。 摘要:This paper proposes the CogSense system, which is inspired by sense-making cognition and perception in the mammalian brain to perform perception error detection and perception parameter adaptation using probabilistic signal temporal logic. As a specific application, a contrast-based perception adaption method is presented and validated. The proposed method evaluates perception errors using heterogeneous probe functions computed from the detected objects and subsequently solves a contrast optimization problem to correct perception errors. The CogSense probe functions utilize the characteristics of geometry, dynamics, and detected blob image quality of the objects to develop axioms in a probabilistic signal temporal logic framework. By evaluating these axioms, we can formally verify whether the detections are valid or erroneous. Further, using the CogSense axioms, we generate the probabilistic signal temporal logic-based constraints to finally solve the contrast-based optimization problem to reduce false positives and false negatives.

【3】 DeepScale: An Online Frame Size Adaptation Framework to Accelerate Visual Multi-object Tracking 标题:DeepScale:一种加速视觉多目标跟踪的在线帧大小自适应框架

作者:Keivan Nalaie,Rong Zheng 机构:McMaster University 链接:https://arxiv.org/abs/2107.10404 摘要:在监视和搜救应用中,在低端设备上实时进行多目标跟踪是非常重要的。今天的MOT解决方案采用深度神经网络,这往往具有很高的计算复杂度。认识到帧大小对跟踪性能的影响,我们提出了DeepScale,一种模型无关的帧大小选择方法,该方法在现有的基于完全卷积网络的跟踪器的基础上运行,以加速跟踪吞吐量。在训练阶段,我们将可检测性得分合并到一个单镜头跟踪器架构中,以便DeepScale能够以自我监督的方式学习不同帧大小的表示估计。在推理过程中,基于用户控制的参数,通过在运行时调整帧大小,可以在跟踪精度和速度之间找到一个合适的折衷点。在MOT数据集上的大量实验和基准测试证明了DeepScale的有效性和灵活性。与最先进的跟踪器DeepScale 相比,DeepScale的一种变体在一种配置下,在MOT15数据集上的跟踪精度仅出现中度下降(~2.4),从而实现了1.57倍的加速。 摘要:In surveillance and search and rescue applications, it is important to perform multi-target tracking (MOT) in real-time on low-end devices. Today's MOT solutions employ deep neural networks, which tend to have high computation complexity. Recognizing the effects of frame sizes on tracking performance, we propose DeepScale, a model agnostic frame size selection approach that operates on top of existing fully convolutional network-based trackers to accelerate tracking throughput. In the training stage, we incorporate detectability scores into a one-shot tracker architecture so that DeepScale can learn representation estimations for different frame sizes in a self-supervised manner. During inference, based on user-controlled parameters, it can find a suitable trade-off between tracking accuracy and speed by adapting frame sizes at run time. Extensive experiments and benchmark tests on MOT datasets demonstrate the effectiveness and flexibility of DeepScale. Compared to a state-of-the-art tracker, DeepScale , a variant of DeepScale achieves 1.57X accelerated with only moderate degradation (~ 2.4) in tracking accuracy on the MOT15 dataset in one configuration.

【4】 Self-transfer learning via patches: A prostate cancer triage approach based on bi-parametric MRI 标题:基于补丁的自转移学习:一种基于双参数MRI的前列腺癌分诊方法

作者:Alvaro Fernandez-Quilez,Trygve Eftestøl,Morten Goodwin,Svein Reidar Kjosavik,Ketil Oppedal 机构:Department of Quality and Health Technology, University of Stavanger, Norway., Stavanger Medical Imaging Laboratory (SMIL), Department of Radiology, Stavanger University Hospital, Norway. 备注:13 pages. Article under review 链接:https://arxiv.org/abs/2107.10806 摘要:前列腺癌(PCa)是全球男性第二常见的癌症。目前的PCa诊断途径是以大量的过度诊断为代价的,导致不必要的治疗和进一步的检测。基于表观扩散系数图(ADC)和T2加权序列的双参数磁共振成像(bp-MRI)被认为是区分临床意义(cS)和非临床意义(ncS)前列腺病变的一种分类方法。然而,序列的分析依赖于专业知识,需要专门的训练,并且存在观察者间的变异性。深度学习(DL)技术在分类和检测等任务中具有广阔的应用前景。然而,它们依赖于大量的注释数据,这在医学领域并不常见。为了缓解这些问题,现有的工作依赖于转移学习(TL)和ImageNet预训练,这已被证明是次优的医学成像领域。在本文中,我们提出了一种基于贴片的预训练策略来区分cS和ncS病变,该策略利用贴片源域的感兴趣区域(ROI)来有效地训练无需注释的全切片目标域分类器。我们提供了几个CNNs架构和不同设置之间的综合比较,这些设置作为基线。此外,我们还探讨了跨域TL,它利用了MRI两种成像方式,改善了单一成像结果。最后,我们展示了我们的方法是如何在相当大的幅度上优于标准方法的 摘要:Prostate cancer (PCa) is the second most common cancer diagnosed among men worldwide. The current PCa diagnostic pathway comes at the cost of substantial overdiagnosis, leading to unnecessary treatment and further testing. Bi-parametric magnetic resonance imaging (bp-MRI) based on apparent diffusion coefficient maps (ADC) and T2-weighted (T2w) sequences has been proposed as a triage test to differentiate between clinically significant (cS) and non-clinically significant (ncS) prostate lesions. However, analysis of the sequences relies on expertise, requires specialized training, and suffers from inter-observer variability. Deep learning (DL) techniques hold promise in tasks such as classification and detection. Nevertheless, they rely on large amounts of annotated data which is not common in the medical field. In order to palliate such issues, existing works rely on transfer learning (TL) and ImageNet pre-training, which has been proven to be sub-optimal for the medical imaging domain. In this paper, we present a patch-based pre-training strategy to distinguish between cS and ncS lesions which exploit the region of interest (ROI) of the patched source domain to efficiently train a classifier in the full-slice target domain which does not require annotations by making use of transfer learning (TL). We provide a comprehensive comparison between several CNNs architectures and different settings which are presented as a baseline. Moreover, we explore cross-domain TL which exploits both MRI modalities and improves single modality results. Finally, we show how our approaches outperform the standard approaches by a considerable margin

半弱无监督|主动学习|不确定性(1篇)

【1】 Triplet is All You Need with Random Mappings for Unsupervised Visual Representation Learning 标题:使用随机映射进行无监督视觉表示学习,三元组就是您所需的全部

作者:Wenbin Li,Xuesong Yang,Meihao Kong,Lei Wang,Jing Huo,Yang Gao,Jiebo Luo 机构:Nanjing University, China,University of Wollongong, Australia,University of Rochester, USA 备注:12 pages 链接:https://arxiv.org/abs/2107.10419 摘要:对比自监督学习(SSL)通过最大化同一图像(正对)的两个增强视图之间的相似度,同时对比其他不同的图像(负对),在无监督视觉表征学习中取得了巨大的成功。然而,这类方法,如SimCLR和MoCo,严重依赖于大量的负对,因此需要大量的批处理或内存。相比之下,最近的一些非对比SSL方法,如BYOL和SimSiam,试图通过引入不对称性来丢弃负对,并显示出显著的性能。不幸的是,为了避免由于不使用负对而导致的崩溃解,这些方法需要复杂的不对称设计。在本文中,我们认为负对仍然是必要的,但一个是足够的,即三元组是所有你需要的。一个简单的基于三重态的损耗可以在不需要大批量或不对称的情况下获得令人惊讶的良好性能。此外,我们观察到无监督视觉表征学习可以从随机性中获得显著的收益。基于这一观察,我们提出了一种简单的插件随机映射(ROMA)策略,将样本随机映射到其他空间,并强制这些随机投影样本满足相同的相关性要求。所提出的ROMA策略不仅与基于三元组的丢失相结合实现了最先进的性能,而且可以进一步有效地促进其他SSL方法的发展。 摘要:Contrastive self-supervised learning (SSL) has achieved great success in unsupervised visual representation learning by maximizing the similarity between two augmented views of the same image (positive pairs) and simultaneously contrasting other different images (negative pairs). However, this type of methods, such as SimCLR and MoCo, relies heavily on a large number of negative pairs and thus requires either large batches or memory banks. In contrast, some recent non-contrastive SSL methods, such as BYOL and SimSiam, attempt to discard negative pairs by introducing asymmetry and show remarkable performance. Unfortunately, to avoid collapsed solutions caused by not using negative pairs, these methods require sophisticated asymmetry designs. In this paper, we argue that negative pairs are still necessary but one is sufficient, i.e., triplet is all you need. A simple triplet-based loss can achieve surprisingly good performance without requiring large batches or asymmetry. Moreover, we observe that unsupervised visual representation learning can gain significantly from randomness. Based on this observation, we propose a simple plug-in RandOm MApping (ROMA) strategy by randomly mapping samples into other spaces and enforcing these randomly projected samples to satisfy the same correlation requirement. The proposed ROMA strategy not only achieves the state-of-the-art performance in conjunction with the triplet-based loss, but also can further effectively boost other SSL methods.

时序|行为识别|姿态|视频|运动估计(6篇)

【1】 DOVE: Learning Deformable 3D Objects by Watching Videos 标题:DOVE:通过观看视频学习可变形3D对象

作者:Shangzhe Wu,Tomas Jakab,Christian Rupprecht,Andrea Vedaldi 机构:Visual Geometry Group, University of Oxford, dove,d.github.io 备注:Project Page: this https URL 链接:https://arxiv.org/abs/2107.10844 摘要:从二维图像中学习可变形的三维物体是一个非常不适定的问题。现有的方法依赖于显式监控来建立多视图的对应关系,如模板形状模型和关键点注释,这限制了它们对“野外”对象的适用性。在本文中,我们建议使用单目视频,它自然地提供跨时间的对应,允许我们学习三维形状的可变形对象类别没有明确的关键点或模板形状。具体来说,我们提出了鸽,它学习预测三维标准形状,变形,视点和纹理从一个单一的二维图像的鸟,给定一个鸟类视频采集以及自动获得的轮廓和光流作为训练数据。我们的方法重建时间一致的三维形状和变形,这使我们能够从一个单一的图像动画和重新渲染从任意角度的鸟。 摘要:Learning deformable 3D objects from 2D images is an extremely ill-posed problem. Existing methods rely on explicit supervision to establish multi-view correspondences, such as template shape models and keypoint annotations, which restricts their applicability on objects "in the wild". In this paper, we propose to use monocular videos, which naturally provide correspondences across time, allowing us to learn 3D shapes of deformable object categories without explicit keypoints or template shapes. Specifically, we present DOVE, which learns to predict 3D canonical shape, deformation, viewpoint and texture from a single 2D image of a bird, given a bird video collection as well as automatically obtained silhouettes and optical flows as training data. Our method reconstructs temporally consistent 3D shape and deformation, which allows us to animate and re-render the bird from arbitrary viewpoints from a single image.

【2】 Deep 3D-CNN for Depression Diagnosis with Facial Video Recording of Self-Rating Depression Scale Questionnaire 标题:深层3D-CNN在抑郁症诊断中的应用--自评抑郁量表问卷面部录像记录

作者:Wanqing Xie,Lizhong Liang,Yao Lu,Hui Luo,Xiaofeng Liu 机构:Wanqing Xie is with the Harbin Engineering University and SuzhouFanhan Information Technology Company, Liang is with the Sun Yat-sen University, Yao Lu is with the Sun Yat-sen University 备注:43rd Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 2021) 链接:https://arxiv.org/abs/2107.10712 摘要:抑郁自评量表(SDS)是一种常用的抑郁预筛方法。另一方面,不受控制的自我管理措施,可能容易受到漫不经心或不诚实的反应的影响,产生不同于临床医生管理的诊断结果。面部表情(FE)和行为在临床医生管理的评估中很重要,但在自我管理的评估中却被低估。在这项研究中,我们使用了一个新的200名参与者的数据集来证明自评问卷的有效性和他们附带的问题录像。我们提供了一个端到端的系统来处理以问卷答案和回应时间为条件的面部视频记录,以自动解释来自SDS评估和相关视频的悲伤。我们修改了一个用于时间特征提取的3D-CNN,并比较了各种最先进的时间建模技术。该系统的优越性能表明,将面部视频记录与SDS评分相结合可以更准确地进行自我诊断。 摘要:The Self-Rating Depression Scale (SDS) questionnaire is commonly utilized for effective depression preliminary screening. The uncontrolled self-administered measure, on the other hand, maybe readily influenced by insouciant or dishonest responses, yielding different findings from the clinician-administered diagnostic. Facial expression (FE) and behaviors are important in clinician-administered assessments, but they are underappreciated in self-administered evaluations. We use a new dataset of 200 participants to demonstrate the validity of self-rating questionnaires and their accompanying question-by-question video recordings in this study. We offer an end-to-end system to handle the face video recording that is conditioned on the questionnaire answers and the responding time to automatically interpret sadness from the SDS assessment and the associated video. We modified a 3D-CNN for temporal feature extraction and compared various state-of-the-art temporal modeling techniques. The superior performance of our system shows the validity of combining facial video recording with the SDS score for more accurate self-diagnose.

【3】 AnonySIGN: Novel Human Appearance Synthesis for Sign Language Video Anonymisation 标题:AnonySIGN:一种新的手语视频匿名化的人体外观合成方法

作者:Ben Saunders,Necati Cihan Camgoz,Richard Bowden 机构:University of Surrey, Skeleton Pose, Original Image, Novel Human Appearances 备注:None 链接:https://arxiv.org/abs/2107.10685 摘要:手语数据的视觉匿名化是解决大规模数据集收集引起的隐私问题的一项重要任务。以前的匿名技术要么显著影响符号理解,要么需要手工劳动密集型工作。在本文中,我们正式介绍了手语视频匿名化(SLVA)的任务,作为一种自动的方法来匿名化手语视频的视觉外观,同时保留原始手语序列的意义。为了解决SLVA问题,我们提出了AnonySign,一种新的手语数据视觉匿名化的自动方法。首先从源视频中提取姿态信息,去除原始签名者的外观。接下来,我们使用条件变分自动编码器框架中的图像到图像转换方法,从姿势序列生成具有新颖外观的照片逼真手语视频。学习一个近似的后验式分布,从中抽取样本,合成新的人类外观。此外,我们还提出了一种新颖的textit{style loss},以确保匿名手语视频的风格一致性。我们通过大量的定量和定性实验来评估SLVA任务的异常信号,突出了我们新的人类外观合成的真实性和匿名性。此外,我们将匿名知觉研究正式化为SLVA任务的评估标准,并展示使用匿名符号的视频匿名化保留了原始手语内容。 摘要:The visual anonymisation of sign language data is an essential task to address privacy concerns raised by large-scale dataset collection. Previous anonymisation techniques have either significantly affected sign comprehension or required manual, labour-intensive work. In this paper, we formally introduce the task of Sign Language Video Anonymisation (SLVA) as an automatic method to anonymise the visual appearance of a sign language video whilst retaining the meaning of the original sign language sequence. To tackle SLVA, we propose AnonySign, a novel automatic approach for visual anonymisation of sign language data. We first extract pose information from the source video to remove the original signer appearance. We next generate a photo-realistic sign language video of a novel appearance from the pose sequence, using image-to-image translation methods in a conditional variational autoencoder framework. An approximate posterior style distribution is learnt, which can be sampled from to synthesise novel human appearances. In addition, we propose a novel textit{style loss} that ensures style consistency in the anonymised sign language videos. We evaluate AnonySign for the SLVA task with extensive quantitative and qualitative experiments highlighting both realism and anonymity of our novel human appearance synthesis. In addition, we formalise an anonymity perceptual study as an evaluation criteria for the SLVA task and showcase that video anonymisation using AnonySign retains the original sign language content.

【4】 PoseDet: Fast Multi-Person Pose Estimation Using Pose Embedding 标题:PoseDet:基于位姿嵌入的快速多人位姿估计

作者:Chenyu Tian,Ran Yu,Xinyuan Zhao,Weihao Xia,Yujiu Yang,Haoqian Wang 机构:Tsinghua University, Northwestern University 备注:10 pages, 5 figures 链接:https://arxiv.org/abs/2107.10466 摘要:现有的多人姿态估计方法通常将人体关节的定位和关联分开处理。这是方便,但效率低,导致额外的计算和浪费时间。本文提出了一种新的框架PoseDet(通过检测来估计姿态),以更高的推理速度同时定位和关联人体关节。此外,我们提出了关键点感知的姿态嵌入来表示一个物体的关键点的位置。提出的姿态嵌入算法包含了语义和几何信息,使得我们能够有效地获取有区别的信息特征。该方法用于PoseDet中的候选分类和人体关节定位,实现了对各种姿态的鲁棒预测。与最先进的方法相比,这个简单的框架在COCO基准上实现了前所未有的速度和具有竞争力的准确性。在CrowdPose基准上的大量实验表明,该方法在人群场景中具有较强的鲁棒性。源代码可用。 摘要:Current methods of multi-person pose estimation typically treat the localization and the association of body joints separately. It is convenient but inefficient, leading to additional computation and a waste of time. This paper, however, presents a novel framework PoseDet (Estimating Pose by Detection) to localize and associate body joints simultaneously at higher inference speed. Moreover, we propose the keypoint-aware pose embedding to represent an object in terms of the locations of its keypoints. The proposed pose embedding contains semantic and geometric information, allowing us to access discriminative and informative features efficiently. It is utilized for candidate classification and body joint localization in PoseDet, leading to robust predictions of various poses. This simple framework achieves an unprecedented speed and a competitive accuracy on the COCO benchmark compared with state-of-the-art methods. Extensive experiments on the CrowdPose benchmark show the robustness in the crowd scenes. Source code is available.

【5】 iReason: Multimodal Commonsense Reasoning using Videos and Natural Language with Interpretability 标题:iReason:基于视频和自然语言的多模态常识推理

作者:Aman Chadha,Vinija Jain 机构:Department of Computer Science, Stanford University 备注:12 pages, 1 figure, 7 tables 链接:https://arxiv.org/abs/2107.10300 摘要:因果关系知识对于构建健壮的人工智能系统至关重要。深度学习模型通常在需要因果推理的任务上表现不佳,而因果推理通常是使用某种形式的常识推导出来的,这些常识不是在输入中立即可用的,而是由人类隐含地推断出来的。先前的工作已经揭示了在没有因果关系的情况下,模型所受到的虚假的观测偏差。虽然语言表征模型在学习的嵌入中保留了语境知识,但在训练过程中它们不考虑因果关系。通过将因果关系与输入特征融合到一个执行视觉认知任务(如场景理解、视频字幕、视频问答等)的现有模型中,由于因果关系带来的洞察力,可以获得更好的性能。最近,有人提出了一些模型来处理从视觉或文本模态中挖掘因果数据的任务。然而,目前还没有广泛的研究通过视觉和语言形式并置来挖掘因果关系。虽然图像为我们提供了丰富且易于处理的资源,可以从中挖掘因果关系知识,但视频密度更高,并且由自然的时间顺序事件组成。此外,文本信息提供了视频中可能隐含的细节。我们提出了iReason,一个使用视频和自然语言字幕推断视觉语义常识知识的框架。此外,iReason的架构集成了一个因果合理化模块,以帮助解释性、错误分析和偏差检测的过程。我们通过与语言表征学习模型(BERT,GPT-2)以及当前最先进的多模态因果关系模型的双管齐下的比较分析,证明了iReason的有效性。 摘要:Causality knowledge is vital to building robust AI systems. Deep learning models often perform poorly on tasks that require causal reasoning, which is often derived using some form of commonsense knowledge not immediately available in the input but implicitly inferred by humans. Prior work has unraveled spurious observational biases that models fall prey to in the absence of causality. While language representation models preserve contextual knowledge within learned embeddings, they do not factor in causal relationships during training. By blending causal relationships with the input features to an existing model that performs visual cognition tasks (such as scene understanding, video captioning, video question-answering, etc.), better performance can be achieved owing to the insight causal relationships bring about. Recently, several models have been proposed that have tackled the task of mining causal data from either the visual or textual modality. However, there does not exist widespread research that mines causal relationships by juxtaposing the visual and language modalities. While images offer a rich and easy-to-process resource for us to mine causality knowledge from, videos are denser and consist of naturally time-ordered events. Also, textual information offers details that could be implicit in videos. We propose iReason, a framework that infers visual-semantic commonsense knowledge using both videos and natural language captions. Furthermore, iReason's architecture integrates a causal rationalization module to aid the process of interpretability, error analysis and bias detection. We demonstrate the effectiveness of iReason using a two-pronged comparative analysis with language representation learning models (BERT, GPT-2) as well as current state-of-the-art multimodal causality models.

【6】 mmPose-NLP: A Natural Language Processing Approach to Precise Skeletal Pose Estimation using mmWave Radars 标题:mmPose-NLP:一种利用毫米波雷达精确估计骨骼姿态的自然语言处理方法

作者:Arindam Sengupta,Siyang Cao 机构:Department of Electrical and Computer Engineering, University of Arizona, Tucson, AZ USA 备注:Submitted to IEEE Transactions 链接:https://arxiv.org/abs/2107.10327 摘要:本文提出了一种基于自然语言处理(NLP)的毫米波雷达数据骨架关键点估计器mmPose-NLP。据作者所知,这是第一种仅利用毫米波雷达数据精确估计多达25个骨架关键点的方法。骨骼姿态估计在自主车辆、交通监控、患者监控、步态分析、国防安全取证等领域有着重要的应用,并有助于预防性和可操作性决策。与传统的光学传感器相比,毫米波雷达在这项任务中的应用提供了一些优势,主要是其对场景照明和恶劣天气条件的操作鲁棒性,而在恶劣天气条件下,光学传感器的性能会显著降低。首先对mmWave雷达点云(PCL)数据进行体素化(类似于NLP中的标记化),并对体素化雷达数据的$N$帧(类似于NLP中的文本段落)进行mmPose-NLP架构,其中预测25个骨架关键点的体素指数(类似于NLP中的关键词提取)。体素索引使用标记化过程中使用的体素字典转换回真实世界的三维坐标。利用平均绝对误差(MAE)度量方法,对所提出的mmPose-NLP系统在深度、水平和垂直轴上的定位误差小于3cm的精度进行了测量。在N={1,2,…,10}条件下,研究了输入帧数对性能/精度的影响。本文介绍了一个综合的方法、结果、讨论和局限性。所有的源代码和结果都可以在GitHub上获得,以进一步研究和开发这个关键而又新兴的领域,即利用毫米波雷达进行骨架关键点估计。 摘要:In this paper we presented mmPose-NLP, a novel Natural Language Processing (NLP) inspired Sequence-to-Sequence (Seq2Seq) skeletal key-point estimator using millimeter-wave (mmWave) radar data. To the best of the author's knowledge, this is the first method to precisely estimate upto 25 skeletal key-points using mmWave radar data alone. Skeletal pose estimation is critical in several applications ranging from autonomous vehicles, traffic monitoring, patient monitoring, gait analysis, to defense security forensics, and aid both preventative and actionable decision making. The use of mmWave radars for this task, over traditionally employed optical sensors, provide several advantages, primarily its operational robustness to scene lighting and adverse weather conditions, where optical sensor performance degrade significantly. The mmWave radar point-cloud (PCL) data is first voxelized (analogous to tokenization in NLP) and $N$ frames of the voxelized radar data (analogous to a text paragraph in NLP) is subjected to the proposed mmPose-NLP architecture, where the voxel indices of the 25 skeletal key-points (analogous to keyword extraction in NLP) are predicted. The voxel indices are converted back to real world 3-D coordinates using the voxel dictionary used during the tokenization process. Mean Absolute Error (MAE) metrics were used to measure the accuracy of the proposed system against the ground truth, with the proposed mmPose-NLP offering <3 cm localization errors in the depth, horizontal and vertical axes. The effect of the number of input frames vs performance/accuracy was also studied for N = {1,2,..,10}. A comprehensive methodology, results, discussions and limitations are presented in this paper. All the source codes and results are made available on GitHub for furthering research and development in this critical yet emerging domain of skeletal key-point estimation using mmWave radars.

医学相关(1篇)

【1】 Reading Race: AI Recognises Patient's Racial Identity In Medical Images 标题:阅读种族:人工智能在医学图像中识别患者的种族身份

作者:Imon Banerjee,Ananth Reddy Bhimireddy,John L. Burns,Leo Anthony Celi,Li-Ching Chen,Ramon Correa,Natalie Dullerud,Marzyeh Ghassemi,Shih-Cheng Huang,Po-Chih Kuo,Matthew P Lungren,Lyle Palmer,Brandon J Price,Saptarshi Purkayastha,Ayis Pyrros,Luke Oakden-Rayner,Chima Okechukwu,Laleh Seyyed-Kalantari,Hari Trivedi,Ryan Wang,Zachary Zaiman,Haoran Zhang,Judy W Gichoya 机构:University of Toronto 备注:Submitted to the Lancet 链接:https://arxiv.org/abs/2107.10356 摘要:背景:在医学成像方面,先前的研究已经证明了不同种族的人工智能表现不同,但是还没有已知的种族与医学成像的相关性,这对于解释图像的人类专家来说是显而易见的。方法:使用私人和公共数据集,我们评估:A)从医学图像中检测种族的深度学习模型的性能量化,包括这些模型推广到外部环境和多种成像模式的能力,B)可能混淆的解剖和表型人群特征的评估,例如疾病分布和身体习性作为种族的预测因子,以及C)研究人工智能模型识别种族的潜在机制。结果:标准的深度学习模型可以训练预测种族从医学图像高性能跨多种成像模式。我们的发现在外部验证条件下,以及当模型被优化以执行临床激励的任务时成立。我们证明这种检测并不是由于简单的代理或成像相关的替代协变种族,如潜在的疾病分布。最后,我们显示,在所有解剖区域和频谱的图像性能持续表明,缓解努力将是具有挑战性的,需要进一步研究。解读:我们强调模型预测自我报道种族的能力本身并不是重要的问题。然而,我们的研究发现,人工智能可以很容易地预测自我报告的种族——即使是在临床专家无法预测的情况下,也可以从损坏、裁剪和有噪音的医学图像中预测自我报告的种族——这给医学成像中的所有模型部署带来了巨大的风险:如果一个人工智能模型秘密地利用其自我报告的种族知识对所有黑人患者进行错误分类,放射科医生将无法判断使用相同的数据模型有权访问。 摘要:Background: In medical imaging, prior studies have demonstrated disparate AI performance by race, yet there is no known correlation for race on medical imaging that would be obvious to the human expert interpreting the images. Methods: Using private and public datasets we evaluate: A) performance quantification of deep learning models to detect race from medical images, including the ability of these models to generalize to external environments and across multiple imaging modalities, B) assessment of possible confounding anatomic and phenotype population features, such as disease distribution and body habitus as predictors of race, and C) investigation into the underlying mechanism by which AI models can recognize race. Findings: Standard deep learning models can be trained to predict race from medical images with high performance across multiple imaging modalities. Our findings hold under external validation conditions, as well as when models are optimized to perform clinically motivated tasks. We demonstrate this detection is not due to trivial proxies or imaging-related surrogate covariates for race, such as underlying disease distribution. Finally, we show that performance persists over all anatomical regions and frequency spectrum of the images suggesting that mitigation efforts will be challenging and demand further study. Interpretation: We emphasize that model ability to predict self-reported race is itself not the issue of importance. However, our findings that AI can trivially predict self-reported race -- even from corrupted, cropped, and noised medical images -- in a setting where clinical experts cannot, creates an enormous risk for all model deployments in medical imaging: if an AI model secretly used its knowledge of self-reported race to misclassify all Black patients, radiologists would not be able to tell using the same data the model has access to.

GAN|对抗|攻击|生成相关(4篇)

【1】 3D Shape Generation with Grid-based Implicit Functions 标题:基于网格隐函数的三维形状生成

作者:Moritz Ibing,Isaak Lim,Leif Kobbelt 机构:Visual Computing Institute, RWTH Aachen University 备注:CVPR 2021 链接:https://arxiv.org/abs/2107.10607 摘要:以前在3D环境中生成形状的方法是在自动编码器(AE)的潜在空间上训练GAN。尽管这产生了令人信服的结果,但它有两个主要缺点。由于GAN仅限于重现AE所训练的数据集,因此我们不能将训练好的AE用于新数据。此外,由于AE只给出了一个全局表示,因此很难在生成过程中加入空间监督。为了解决这些问题,我们建议在网格上训练GAN(即每个单元覆盖一个形状的一部分)。在该表示中,每个细胞配备有由AE提供的潜在载体。这种局部表示使得能够更具表现性(因为基于单元的潜在向量可以以新颖的方式组合)以及生成过程的空间控制(例如通过边界框)。我们的方法在所有已建立的评估指标上都优于目前的水平,这是为定量评估GANs的生成能力而提出的。我们展示了这些措施的局限性,并建议采用统计分析中的稳健标准作为替代方法。 摘要:Previous approaches to generate shapes in a 3D setting train a GAN on the latent space of an autoencoder (AE). Even though this produces convincing results, it has two major shortcomings. As the GAN is limited to reproduce the dataset the AE was trained on, we cannot reuse a trained AE for novel data. Furthermore, it is difficult to add spatial supervision into the generation process, as the AE only gives us a global representation. To remedy these issues, we propose to train the GAN on grids (i.e. each cell covers a part of a shape). In this representation each cell is equipped with a latent vector provided by an AE. This localized representation enables more expressiveness (since the cell-based latent vectors can be combined in novel ways) as well as spatial control of the generation process (e.g. via bounding boxes). Our method outperforms the current state of the art on all established evaluation measures, proposed for quantitatively evaluating the generative capabilities of GANs. We show limitations of these measures and propose the adaptation of a robust criterion from statistical analysis as an alternative.

【2】 Abstract Reasoning via Logic-guided Generation 标题:基于逻辑制导生成的抽象推理

作者:Sihyun Yu,Sangwoo Mo,Sungsoo Ahn,Jinwoo Shin 机构:those DNN-based approaches attempt to derive the 1Korea Advanced Institute of Science and Technology(KAIST) 2Mohamed bin Zayed University of Artificial In-telligence (MBZUAI) 备注:ICML 2021 Workshop on Self-Supervised Learning for Reasoning and Perception (Spotlight Talk) 链接:https://arxiv.org/abs/2107.10493 摘要:抽象推理,即从给定的观察结果中推断出复杂的模式,是人工通用智能的核心组成部分。当人们通过排除错误的候选或首先构造答案来寻找答案时,基于先验深度神经网络(DNN)的方法主要集中于前者的判别方法。本文旨在为后一种方法设计一个框架,弥合人工智能和人类智能之间的鸿沟。为此,我们提出了逻辑引导生成(LoGe)这一新的生成性DNN框架,将抽象推理简化为命题逻辑中的优化问题。LoGe由三个步骤组成:从图像中提取命题变量,用逻辑层推理答案变量,并从变量中重构答案图像。我们证明了在RAVEN基准下,LoGe优于黑盒DNN框架的生成性抽象推理,即基于从观察中获取各种属性的正确规则来重构答案。 摘要:Abstract reasoning, i.e., inferring complicated patterns from given observations, is a central building block of artificial general intelligence. While humans find the answer by either eliminating wrong candidates or first constructing the answer, prior deep neural network (DNN)-based methods focus on the former discriminative approach. This paper aims to design a framework for the latter approach and bridge the gap between artificial and human intelligence. To this end, we propose logic-guided generation (LoGe), a novel generative DNN framework that reduces abstract reasoning as an optimization problem in propositional logic. LoGe is composed of three steps: extract propositional variables from images, reason the answer variables with a logic layer, and reconstruct the answer image from the variables. We demonstrate that LoGe outperforms the black box DNN frameworks for generative abstract reasoning under the RAVEN benchmark, i.e., reconstructing answers based on capturing correct rules of various attributes from observations.

【3】 Improve Learning from Crowds via Generative Augmentation 标题:通过生成性增强提高从人群中学习的能力

作者:Zhendong Chu,Hongning Wang 机构:Department of Computer Science, University of Virginia, Charlottesville, VA , USA 备注:KDD 2021 链接:https://arxiv.org/abs/2107.10449 摘要:众包为有监督的机器学习提供了一个有效的标签收集模式。然而,为了控制注释成本,众包数据中的每个实例通常由少量注释者注释。这就产生了稀疏性问题,限制了基于这些数据训练的机器学习模型的质量。在本文中,我们研究如何处理稀疏的众包数据使用数据扩充。具体来说,我们建议通过增加原始稀疏注释来直接学习分类器。我们利用生成性对抗网络实现了两个高质量增强的原则:1)生成的注释应该遵循真实注释的分布,这是由鉴别器测量的;2) 生成的注释应该与地面真值标签具有高互信息,这是由辅助网络测量的。在三个真实的数据集上,通过大量的实验和与一系列最先进的群体学习方法的比较,证明了我们的数据扩充框架的有效性。它显示了我们的算法在低预算众包的潜力。 摘要:Crowdsourcing provides an efficient label collection schema for supervised machine learning. However, to control annotation cost, each instance in the crowdsourced data is typically annotated by a small number of annotators. This creates a sparsity issue and limits the quality of machine learning models trained on such data. In this paper, we study how to handle sparsity in crowdsourced data using data augmentation. Specifically, we propose to directly learn a classifier by augmenting the raw sparse annotations. We implement two principles of high-quality augmentation using Generative Adversarial Networks: 1) the generated annotations should follow the distribution of authentic ones, which is measured by a discriminator; 2) the generated annotations should have high mutual information with the ground-truth labels, which is measured by an auxiliary network. Extensive experiments and comparisons against an array of state-of-the-art learning from crowds methods on three real-world datasets proved the effectiveness of our data augmentation framework. It shows the potential of our algorithm for low-budget crowdsourcing in general.

【4】 MFGNet: Dynamic Modality-Aware Filter Generation for RGB-T Tracking 标题:MFGNet:用于RGB-T跟踪的动态模态感知过滤生成

作者:Xiao Wang,Xiujun Shu,Shiliang Zhang,Bo Jiang,Yaowei Wang,Yonghong Tian,Feng Wu 机构:Shiliang Zhang and YonghongTian are also with Peking University, Bo Jiang is withthe School of Computer Science and Technology, Anhui University, Feng Wu is also with University of Science and Technology of China 备注:In Peer Review 链接:https://arxiv.org/abs/2107.10433 摘要:许多RGB-T跟踪器试图通过自适应加权(或注意机制)来获得鲁棒的特征表示。与这些工作不同的是,我们提出了一种新的动态模态感知滤波器生成模块(MFGNet),通过自适应地调整各种输入图像的卷积核来增强可见光和热数据之间的信息通信。以图像对为输入,首先利用骨干网对其特征进行编码。然后,我们将这些特征映射串联起来,并生成具有两个独立网络的动态模态感知滤波器。可见光滤波器和热滤波器将分别对其相应的输入特征图进行动态卷积运算。受残差连接的启发,生成的可见光和热特征图都将用输入特征图进行总结。增强后的特征映射将被输入RoI-align模块,生成实例级特征,用于后续分类。为了解决由重遮挡、快速运动和视线外引起的问题,我们提出利用一种新的方向感知目标驱动注意机制进行局部和全局联合搜索。利用时空递归神经网络来捕捉方向感知上下文,实现对全局注意力的精确预测。在三个大规模RGB-T跟踪基准数据集上的实验验证了该算法的有效性。本文的项目页面可在https://sites.google.com/view/mfgrgbttrack/. 摘要:Many RGB-T trackers attempt to attain robust feature representation by utilizing an adaptive weighting scheme (or attention mechanism). Different from these works, we propose a new dynamic modality-aware filter generation module (named MFGNet) to boost the message communication between visible and thermal data by adaptively adjusting the convolutional kernels for various input images in practical tracking. Given the image pairs as input, we first encode their features with the backbone network. Then, we concatenate these feature maps and generate dynamic modality-aware filters with two independent networks. The visible and thermal filters will be used to conduct a dynamic convolutional operation on their corresponding input feature maps respectively. Inspired by residual connection, both the generated visible and thermal feature maps will be summarized with input feature maps. The augmented feature maps will be fed into the RoI align module to generate instance-level features for subsequent classification. To address issues caused by heavy occlusion, fast motion, and out-of-view, we propose to conduct a joint local and global search by exploiting a new direction-aware target-driven attention mechanism. The spatial and temporal recurrent neural network is used to capture the direction-aware context for accurate global attention prediction. Extensive experiments on three large-scale RGB-T tracking benchmark datasets validated the effectiveness of our proposed algorithm. The project page of this paper is available at https://sites.google.com/view/mfgrgbttrack/.

人脸|人群计数(1篇)

【1】 Structure Destruction and Content Combination for Face Anti-Spoofing 标题:面向人脸反欺骗的结构破坏和内容组合

作者:Ke-Yue Zhang,Taiping Yao,Jian Zhang,Shice Liu,Bangjie Yin,Shouhong Ding,Jilin Li 机构:Youtu Lab, Tencent, Shanghai, China 备注:Accepted by IJCB2021 链接:https://arxiv.org/abs/2107.10628 摘要:为了巩固人脸验证系统,先前的人脸反欺骗研究在辅助监督的帮助下,挖掘原始图像中隐藏的线索,以区分真人和不同的攻击类型。然而,在训练过程中受到以下两个内在干扰的限制:1)在一幅图像中完整的面部结构。2) 在整个数据集的隐式子域中,这些方法容易依赖于对整个训练数据集的记忆,并且对非同源域分布具有敏感性。在本文中,我们提出了结构破坏模块和内容组合模块来分别解决这两种模仿。前者将图像分解成块构造非结构化输入,而后者将不同子域或类的块重新组合成混合结构。在此基础上,进一步提出了局部关系建模模块,对贴片间的二阶关系进行建模。我们在广泛的公共数据集上对我们的方法进行了评估,并通过实验结果证明了我们的方法对最先进的竞争对手的可靠性。 摘要:In pursuit of consolidating the face verification systems, prior face anti-spoofing studies excavate the hidden cues in original images to discriminate real persons and diverse attack types with the assistance of auxiliary supervision. However, limited by the following two inherent disturbances in their training process: 1) Complete facial structure in a single image. 2) Implicit subdomains in the whole dataset, these methods are prone to stick on memorization of the entire training dataset and show sensitivity to nonhomologous domain distribution. In this paper, we propose Structure Destruction Module and Content Combination Module to address these two imitations separately. The former mechanism destroys images into patches to construct a non-structural input, while the latter mechanism recombines patches from different subdomains or classes into a mixup construct. Based on this splitting-and-splicing operation, Local Relation Modeling Module is further proposed to model the second-order relationship between patches. We evaluate our method on extensive public datasets and promising experimental results to demonstrate the reliability of our method against state-of-the-art competitors.

图像视频检索|Re-id相关(1篇)

【1】 Copy and Paste method based on Pose for ReID 标题:一种基于Reid姿势的复制粘贴方法

作者:Cheng Yang 机构:Huazhong University of, Science and Technology., Wuhan, China. 链接:https://arxiv.org/abs/2107.10479 摘要:再识别(ReID)是针对监控摄像机中不同视点的目标进行匹配。它的发展非常快,但是在这个阶段没有针对多个场景中的ReID任务的处理方法。然而,这种情况在现实生活中经常发生,比如安全场景。本文探讨了一种新的场景重新识别,不同的视角,背景和姿势(步行或骑自行车)。显然,普通的ReID处理方法不能很好地处理这种情况。我们都知道,最好的方法是在这个scanario中引入图像数据集,但是这个非常昂贵。针对这一问题,本文提出了一种简单有效的新场景下的图像生成方法,即基于姿势的复制粘贴方法(CPP)。CPP是一种基于关键点检测的方法,通过复制和粘贴,在两个不同的语义图像数据集中合成一个新的语义图像数据集。例如,我们可以使用行人和自行车生成一些显示同一个人骑不同自行车的图像。CPP适用于新场景中的ReID任务,它在原始ReID任务中的原始数据集上的性能优于最新技术。具体来说,它还可以对第三方公共数据集具有更好的泛化性能。未来将提供由CPP合成的代码和数据集。 摘要:Re-identification(ReID) aims at matching objects in surveillance cameras with different viewpoints. It's developing very fast, but there is no processing method for the ReID task in multiple scenarios at this stage. However, this dose happen all the time in real life, such as the security scenarios. This paper explores a new scenario of Re-identification, which differs in perspective, background, and pose(walking or cycling). Obviously, ordinary ReID processing methods cannot handle this scenario well. As we all konw, the best way to deal with that it is to introduce image datasets in this scanario, But this one is very expensive. To solve this problem, this paper proposes a simple and effective way to generate images in some new scenario, which is named Copy and Paste method based on Pose(CPP). The CPP is a method based on key point detection, using copy and paste, to composite a new semantic image dataset in two different semantic image datasets. Such as, we can use pedestrians and bicycles to generate some images that shows the same person rides on different bicycles. The CPP is suitable for ReID tasks in new scenarios and it outperforms state-of-the-art on the original datasets in original ReID tasks. Specifically, it can also have better generalization performance for third-party public datasets. Code and Datasets which composited by the CPP will be available in the future.

超分辨率|去噪|去模糊|去雾(1篇)

【1】 Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data 标题:REAL-ESRGAN:用纯合成数据训练真实世界盲超分辨率

作者:Xintao Wang,Liangbin Xie,Chao Dong,Ying Shan 机构:Applied Research Center (ARC), Tencent PCG, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences 备注:Tech Report. Training codes, testing codes, and executable files are in this https URL 链接:https://arxiv.org/abs/2107.10833 摘要:尽管人们在盲超分辨率技术中已经进行了许多尝试,以恢复具有未知和复杂退化的低分辨率图像,但它们仍然远远不能解决一般的真实退化图像。在这项工作中,我们将强大的ESRGAN扩展到一个实际的恢复应用程序(即真实的ESRGAN),它是用纯合成数据训练的。特别地,引入了高阶退化建模过程来更好地模拟复杂的真实退化。我们还考虑了合成过程中常见的振铃和过冲伪影。此外,我们采用了一个带谱归一化的U-Net鉴别器来提高鉴别器的性能和稳定训练动态。广泛的比较表明,它的视觉性能优于以往的工作对各种真实数据集。我们还提供了有效的实现来动态合成训练对。 摘要:Though many attempts have been made in blind super-resolution to restore low-resolution images with unknown and complex degradations, they are still far from addressing general real-world degraded images. In this work, we extend the powerful ESRGAN to a practical restoration application (namely, Real-ESRGAN), which is trained with pure synthetic data. Specifically, a high-order degradation modeling process is introduced to better simulate complex real-world degradations. We also consider the common ringing and overshoot artifacts in the synthesis process. In addition, we employ a U-Net discriminator with spectral normalization to increase discriminator capability and stabilize the training dynamics. Extensive comparisons have shown its superior visual performance than prior works on various real datasets. We also provide efficient implementations to synthesize training pairs on the fly.

点云|SLAM|雷达|激光|深度RGBD相关(1篇)

【1】 Correspondence-Free Point Cloud Registration with SO(3)-Equivariant Implicit Shape Representations 标题:基于SO(3)-等变隐式形状表示的无对应点云配准

作者:Minghan Zhu,Maani Ghaffari,Huei Peng 机构:PengarewiththeUniversityofMichigan 备注:7 pages. 2 figures. Submitted to CoRL 2021 链接:https://arxiv.org/abs/2107.10296 摘要:提出了一种无对应的点云旋转配准方法。我们学习在特征空间中为每个点云嵌入一个保持SO(3)-等变特性的方法,这是由等变神经网络的最新发展所支持的。该方法将等变特征学习与隐式形状模型相结合,具有三大优点。首先,由于类似于点网的网络结构具有置换不变性,消除了数据关联的必要性。其次,由于so3-等变性,特征空间中的配准问题可以用Horn方法以封闭形式求解。第三,由于内隐形状学习,配准对点云噪声具有鲁棒性。实验结果表明,与现有的无对应深度配准方法相比,该方法具有更好的性能。 摘要:This paper proposes a correspondence-free method for point cloud rotational registration. We learn an embedding for each point cloud in a feature space that preserves the SO(3)-equivariance property, enabled by recent developments in equivariant neural networks. The proposed shape registration method achieves three major advantages through combining equivariant feature learning with implicit shape models. First, the necessity of data association is removed because of the permutation-invariant property in network architectures similar to PointNet. Second, the registration in feature space can be solved in closed-form using Horn's method due to the SO(3)-equivariance property. Third, the registration is robust to noise in the point cloud because of implicit shape learning. The experimental results show superior performance compared with existing correspondence-free deep registration methods.

其他神经网络|深度学习|模型|建模(2篇)

【1】 HANT: Hardware-Aware Network Transformation 标题:HANT:硬件感知的网络转型

作者:Pavlo Molchanov,Jimmy Hall,Hongxu Yin,Jan Kautz,Nicolo Fusi,Arash Vahdat 机构:NVIDIA, Microsoft Research 链接:https://arxiv.org/abs/2107.10624 摘要:给定一个经过训练的网络,我们如何加速它以满足在特定硬件上部署的效率需求?常用的硬件感知网络压缩技术通过剪枝、核融合、量化和降低精度来解决这一问题。但是,这些方法不会改变底层网络的操作。在本文中,我们提出了硬件感知网络变换(HANT),它通过使用一种类似于神经结构的搜索方法,将低效的操作替换为更有效的选择,从而加速网络。HANT分两个阶段来解决这个问题:第一阶段采用分层特征图提取的方法,对教师模型的每一层进行大量的交替操作训练。在第二阶段,有效操作的组合选择被放宽到整数优化问题,可以在几秒钟内解决。我们通过核融合和量化来扩展HANT,进一步提高吞吐量。我们对EfficientNet家族的加速实验结果表明,HANT可以将其加速3.6倍,在ImageNet数据集上的前1精度下降<0.4%。当比较相同的潜伏期水平时,HANT可以将EfficientNet-B4加速到与EfficientNet-B1相同的潜伏期,同时准确率提高3%。我们检查了一个大的操作池,每层最多197个,并提供了对所选操作和最终体系结构的见解。 摘要:Given a trained network, how can we accelerate it to meet efficiency needs for deployment on particular hardware? The commonly used hardware-aware network compression techniques address this question with pruning, kernel fusion, quantization and lowering precision. However, these approaches do not change the underlying network operations. In this paper, we propose hardware-aware network transformation (HANT), which accelerates a network by replacing inefficient operations with more efficient alternatives using a neural architecture search like approach. HANT tackles the problem in two phase: In the first phase, a large number of alternative operations per every layer of the teacher model is trained using layer-wise feature map distillation. In the second phase, the combinatorial selection of efficient operations is relaxed to an integer optimization problem that can be solved in a few seconds. We extend HANT with kernel fusion and quantization to improve throughput even further. Our experimental results on accelerating the EfficientNet family show that HANT can accelerate them by up to 3.6x with <0.4% drop in the top-1 accuracy on the ImageNet dataset. When comparing the same latency level, HANT can accelerate EfficientNet-B4 to the same latency as EfficientNet-B1 while having 3% higher accuracy. We examine a large pool of operations, up to 197 per layer, and we provide insights into the selected operations and final architectures.

【2】 External-Memory Networks for Low-Shot Learning of Targets in Forward-Looking-Sonar Imagery 标题:用于前视声纳图像目标近距离学习的外部记忆网络

作者:Isaac J. Sledge,Christopher D. Toole,Joseph A. Maestri,Jose C. Principe 机构:Eckis Chair and the Distinguished Professor with both the Department of Electrical and ComputerEngineering and the Department of Biomedical Engineering, He is the directorof the Computational NeuroEngineering Laboratory (CNEL) at the University of Florida 备注:Submitted to IEEE Journal of Oceanic Engineering 链接:https://arxiv.org/abs/2107.10504 摘要:提出了一种基于记忆的前视声纳(FLS)图像目标实时分析框架。我们的框架依赖于首先使用小规模DenseNet启发的网络从图像中去除非歧视性细节。这样做简化了随后的分析,并允许从几个标记的示例进行概括。然后将滤波后的图像级联成一个新的基于神经网络的卷积匹配网络NRMN,用于低炮目标识别。我们采用小尺度流网(LFN)在局部时间尺度上对FLS图像进行配准。LFN使得目标标签在图像间的一致投票成为可能,并且通常提高了目标检测和识别率。我们评估我们的框架使用真实世界的FLS图像与多个广泛的目标类,具有较高的类内可变性和丰富的子类结构。我们表明,很少有镜头学习,从10到30个特定类的样本,执行类似于监督深层网络训练上百个样本每类。有效的零炮学习也是可能的。当分心元件被移除时,NRMNs的感应传输特性实现了高性能。 摘要:We propose a memory-based framework for real-time, data-efficient target analysis in forward-looking-sonar (FLS) imagery. Our framework relies on first removing non-discriminative details from the imagery using a small-scale DenseNet-inspired network. Doing so simplifies ensuing analyses and permits generalizing from few labeled examples. We then cascade the filtered imagery into a novel NeuralRAM-based convolutional matching network, NRMN, for low-shot target recognition. We employ a small-scale FlowNet, LFN to align and register FLS imagery across local temporal scales. LFN enables target label consensus voting across images and generally improves target detection and recognition rates. We evaluate our framework using real-world FLS imagery with multiple broad target classes that have high intra-class variability and rich sub-class structure. We show that few-shot learning, with anywhere from ten to thirty class-specific exemplars, performs similarly to supervised deep networks trained on hundreds of samples per class. Effective zero-shot learning is also possible. High performance is realized from the inductive-transfer properties of NRMNs when distractor elements are removed.

其他(4篇)

【1】 Geometric Data Augmentation Based on Feature Map Ensemble 标题:基于要素地图集成的几何数据增强

作者:Takashi Shibata,Masayuki Tanaka,Masatoshi Okutomi 机构:NTT Corporation,Tokyo Institute of Technology 备注:Accepted to ICIP2021 链接:https://arxiv.org/abs/2107.10524 摘要:深度卷积网络已经成为计算机视觉应用的主流。虽然CNNs在许多计算机视觉任务中都取得了成功,但它也存在一些不足。CNN的性能会因几何变换(如大旋转)而显著降低。在本文中,我们提出了一种新的CNN结构,它可以在不修改现有CNN主干的情况下提高对几何变换的鲁棒性。关键是用一个几何变换(和相应的逆变换)和一个特征映射集合来封闭现有的主干。该方法可以继承现有cnn的优点。此外,该方法还可以与现有的数据增强算法结合使用,以提高算法的性能。我们使用CIFAR、CUB-200和Mnist-rot-12k等标准数据集证明了该方法的有效性。 摘要:Deep convolutional networks have become the mainstream in computer vision applications. Although CNNs have been successful in many computer vision tasks, it is not free from drawbacks. The performance of CNN is dramatically degraded by geometric transformation, such as large rotations. In this paper, we propose a novel CNN architecture that can improve the robustness against geometric transformations without modifying the existing backbones of their CNNs. The key is to enclose the existing backbone with a geometric transformation (and the corresponding reverse transformation) and a feature map ensemble. The proposed method can inherit the strengths of existing CNNs that have been presented so far. Furthermore, the proposed method can be employed in combination with state-of-the-art data augmentation algorithms to improve their performance. We demonstrate the effectiveness of the proposed method using standard datasets such as CIFAR, CUB-200, and Mnist-rot-12k.

【2】 A Public Ground-Truth Dataset for Handwritten Circuit Diagram Images 标题:一种面向手写电路图图像的公开地面真实数据集

作者:Felix Thoma,Johannes Bayer,Yakun Li 备注:6 pages, 3 figures, raw version as submitted to GREC2021 链接:https://arxiv.org/abs/2107.10373 摘要:线路图数字化方法的发展(特别是在电气工程领域)依赖于可公开获得的训练和评估数据。本文提出了这样一个图像集连同注释。该数据集由12位起草者绘制的144条电路的1152幅图像和48563条注释组成。这些图像中的每一幅都描绘了一个电路图,由消费者级相机在不同的照明条件和视角下拍摄。使用了各种不同的铅笔类型和表面材料。对于每个图像,所有单独的电气元件都用边界框和45个类标签中的一个进行注释。为了简化图形的提取过程,引入了连接点和交叉点等辅助符号,同时对文本进行了注释。从这个任务中产生的几何和分类学问题,以及类本身和它们的外观的统计数据都被陈述了。数据集上的标准快速RCNN的性能作为目标检测基线。 摘要:The development of digitization methods for line drawings (especially in the area of electrical engineering) relies on the availability of publicly available training and evaluation data. This paper presents such an image set along with annotations. The dataset consists of 1152 images of 144 circuits by 12 drafters and 48 563 annotations. Each of these images depicts an electrical circuit diagram, taken by consumer grade cameras under varying lighting conditions and perspectives. A variety of different pencil types and surface materials has been used. For each image, all individual electrical components are annotated with bounding boxes and one out of 45 class labels. In order to simplify a graph extraction process, different helper symbols like junction points and crossovers are introduced, while texts are annotated as well. The geometric and taxonomic problems arising from this task as well as the classes themselves and statistics of their appearances are stated. The performance of a standard Faster RCNN on the dataset is provided as an object detection baseline.

【3】 Rethinking Trajectory Forecasting Evaluation 标题:关于弹道预报评估的再思考

作者:Boris Ivanovic,Marco Pavone 机构:NVIDIA Research 备注:4 pages, 2 figures 链接:https://arxiv.org/abs/2107.10297 摘要:预测其他智能体的行为是现代机器人自主堆栈的一个组成部分,特别是在具有人机交互的安全关键场景中,例如自动驾驶。反过来,人们对轨道预测有了大量的兴趣和研究,产生了各种各样的方法。然而,所有工作的共同点是使用相同的基于精度的评估指标,例如位移误差和对数似然。虽然这些指标是信息性的,但它们是任务不可知的,被评估为相等的预测可能会导致截然不同的结果,例如在下游规划和决策中。在这项工作中,我们后退了一步,并对当前的轨迹预测指标进行了批判性评估,提出了任务感知指标,作为部署预测的系统中性能的更好度量。此外,我们还提供了一个这样一个指标的例子,将规划意识纳入现有的轨迹预测指标中。 摘要:Forecasting the behavior of other agents is an integral part of the modern robotic autonomy stack, especially in safety-critical scenarios with human-robot interaction, such as autonomous driving. In turn, there has been a significant amount of interest and research in trajectory forecasting, resulting in a wide variety of approaches. Common to all works, however, is the use of the same few accuracy-based evaluation metrics, e.g., displacement error and log-likelihood. While these metrics are informative, they are task-agnostic and predictions that are evaluated as equal can lead to vastly different outcomes, e.g., in downstream planning and decision making. In this work, we take a step back and critically evaluate current trajectory forecasting metrics, proposing task-aware metrics as a better measure of performance in systems where prediction is being deployed. We additionally present one example of such a metric, incorporating planning-awareness within existing trajectory forecasting metrics.

【4】 Fristograms: Revealing and Exploiting Light Field Internals 标题:弗里斯特图:揭示和利用光场内部

作者:Thorsten Herfet,Kelvin Chelli,Tobias Lange,Robin Kremer 机构:Saarland Informatics Campus, D-, Saarbr¨ucken, Germany 备注:6 pages, 7 figures 链接:https://arxiv.org/abs/2107.10563 摘要:近年来,光场(LF)的捕获和处理已经成为媒体制作中不可或缺的一部分。LFs所提供的丰富信息使其能够应用于诸如捕获后景深编辑、三维重建、分割和铺垫、显著性检测、目标检测和识别以及混合现实等新的应用。此类应用程序的有效性取决于某些基本需求,而这些需求往往被忽略。例如,某些操作(如降噪或hyperfan滤波)仅在场景点为Lambertian散热器时才可能进行。只有在至少有一条光线捕捉到所需场景点的情况下,才能执行一些其他操作,例如清除障碍物或查看后面的对象。因此,表示某一场景点的光线分布是评估处理可能性的重要特征。本文的主要思想是建立LF的光线与捕获装置之间的关系。为此,我们将视锥离散化。传统上,视锥体的均匀离散化会导致体素在规则间隔的三维网格上表示单个样本。取而代之的是,我们使用截头体形状的体素(froxels),通过使用深度和捕捉视截头体的依赖于设置的离散化。在这种离散化的基础上,我们计算在捕获设备上映射到同一像素的光线数。通过这个计数,我们提出了弗罗素上光线计数的直方图(fristograms)。Fristograms可以作为一种工具来分析和揭示底层LF的有趣方面,比如从场景点出发的光线数量和这些光线的颜色分布。作为一个例子,我们通过显著减少光线的数量来展示它的能力,这使得噪声减少同时保持非朗伯或部分遮挡区域的真实渲染。 摘要:In recent years, light field (LF) capture and processing has become an integral part of media production. The richness of information available in LFs has enabled novel applications like post-capture depth-of-field editing, 3D reconstruction, segmentation and matting, saliency detection, object detection and recognition, and mixed reality. The efficacy of such applications depends on certain underlying requirements, which are often ignored. For example, some operations such as noise-reduction, or hyperfan-filtering are only possible if a scene point Lambertian radiator. Some other operations such as the removal of obstacles or looking behind objects are only possible if there is at least one ray capturing the required scene point. Consequently, the ray distribution representing a certain scene point is an important characteristic for evaluating processing possibilities. The primary idea in this paper is to establish a relation between the capturing setup and the rays of the LF. To this end, we discretize the view frustum. Traditionally, a uniform discretization of the view frustum results in voxels that represents a single sample on a regularly spaced, 3-D grid. Instead, we use frustum-shaped voxels (froxels), by using depth and capturing-setup dependent discretization of the view frustum. Based on such discretization, we count the number of rays mapping to the same pixel on the capturing device(s). By means of this count, we propose histograms of ray-counts over the froxels (fristograms). Fristograms can be used as a tool to analyze and reveal interesting aspects of the underlying LF, like the number of rays originating from a scene point and the color distribution of these rays. As an example, we show its ability by significantly reducing the number of rays which enables noise reduction while maintaining the realistic rendering of non-Lambertian or partially occluded regions.

0 人点赞