访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问
cs.CV 方向,今日共计56篇
Transformer(4篇)
【1】 Transformer Meets Convolution: A Bilateral Awareness Net-work for Semantic Segmentation of Very Fine Resolution Ur-ban Scene Images 标题:Transform Meets卷积:一种用于超细分辨率URBAN场景图像语义分割的双边感知网络
作者:Libo Wang,Rui Li,Dongzhi Wang,Chenxi Duan,Teng Wang,Xiaoliang Meng 机构: School of Remote Sensing and Information Engineering, Wuhan University, Wuhan , China;, Surveying And Mapping Institute Lands And Resource Department Of Guangdong Province, Guangzhou 链接:https://arxiv.org/abs/2106.12413 摘要:从高分辨率(VFR)城市场景图像中进行语义分割在自动驾驶、土地覆盖分类、城市规划等应用场景中起着重要的作用。然而,VFR图像中包含的大量细节严重限制了现有深度学习方法的潜力。更为严重的是,对象的尺度和外观的巨大变化进一步恶化了这些语义分割方法的表现能力,导致相邻对象的混淆。解决这些问题是遥感界一个很有前途的研究领域,为景观格局分析和决策奠定了基础。在这篇手稿中,我们提出了一个双边感知网络(BANet),它包含一个依赖路径和一个纹理路径,以充分捕捉VFR图像中的长程关系和细粒度细节。具体地说,依赖路径是基于ResT的,ResT是一种新的具有记忆效率的多头自注意Transformer主干,而纹理路径是建立在叠加卷积运算的基础上的。此外,利用线性注意机制,设计了特征聚合模块(FAM),有效地融合了依赖特征和纹理特征。在ISPRS-Vaihingen数据集、ISPRS-Potsdam数据集和UAVid数据集三个大型城市场景图像分割数据集上进行了大量的实验,证明了该方法的有效性。具体来说,在UAVid数据集上实现了64.6%的mIoU。 摘要:Semantic segmentation from very fine resolution (VFR) urban scene images plays a significant role in several application scenarios including autonomous driving, land cover classification, and urban planning, etc. However, the tremendous details contained in the VFR image severely limit the potential of the existing deep learning approaches. More seriously, the considerable variations in scale and appearance of objects further deteriorate the representational capacity of those se-mantic segmentation methods, leading to the confusion of adjacent objects. Addressing such is-sues represents a promising research field in the remote sensing community, which paves the way for scene-level landscape pattern analysis and decision making. In this manuscript, we pro-pose a bilateral awareness network (BANet) which contains a dependency path and a texture path to fully capture the long-range relationships and fine-grained details in VFR images. Specif-ically, the dependency path is conducted based on the ResT, a novel Transformer backbone with memory-efficient multi-head self-attention, while the texture path is built on the stacked convo-lution operation. Besides, using the linear attention mechanism, a feature aggregation module (FAM) is designed to effectively fuse the dependency features and texture features. Extensive experiments conducted on the three large-scale urban scene image segmentation datasets, i.e., ISPRS Vaihingen dataset, ISPRS Potsdam dataset, and UAVid dataset, demonstrate the effective-ness of our BANet. Specifically, a 64.6% mIoU is achieved on the UAVid dataset.
【2】 Instance-based Vision Transformer for Subtyping of Papillary Renal Cell Carcinoma in Histopathological Image 标题:基于实例的视觉转换器在乳头状肾细胞癌组织病理图像分型中的应用
作者:Zeyu Gao,Bangyang Hong,Xianli Zhang,Yang Li,Chang Jia,Jialun Wu,Chunbao Wang,Deyu Meng,Chen Li 机构: School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, Shaanxi , China, National Engineering Lab for Big Data Analytics, Xi’an Jiaotong University, Xi’an 备注:Accepted by MICCAI 2021 链接:https://arxiv.org/abs/2106.12265 摘要:乳头状(p)肾细胞癌(RCC)的组织学亚型,1型与2型,是一个重要的预后因素。pRCC的两个亚型有相似的模式,即乳头状结构,但也有一些细微的差异,包括细胞和细胞层的模式。然而,现有的基于CNN的模型在大尺寸组织病理图像中几乎无法捕捉到细胞和细胞层的模式,这给这些模型直接应用于细粒度分类带来了障碍。本文提出了一种新的基于实例的视觉变换器(i-ViT),通过从实例块中提取更精细的特征(通过在分割的细胞核周围裁剪和指定预测等级)来学习pRCC亚型任务中组织病理学图像的鲁棒表示。提出的i-ViT以top-K实例为输入,通过位置嵌入层、等级嵌入层和多头多层自注意模块对其进行聚合,以捕获细胞层和细胞层的模式。为了评估所提出的框架的性能,邀请有经验的病理学家从171张1型和2型pRCC的完整幻灯片图像中选择1162个感兴趣的区域。实验结果表明,该方法比现有的基于CNN的模型具有更好的性能。 摘要:Histological subtype of papillary (p) renal cell carcinoma (RCC), type 1 vs. type 2, is an essential prognostic factor. The two subtypes of pRCC have a similar pattern, i.e., the papillary architecture, yet some subtle differences, including cellular and cell-layer level patterns. However, the cellular and cell-layer level patterns almost cannot be captured by existing CNN-based models in large-size histopathological images, which brings obstacles to directly applying these models to such a fine-grained classification task. This paper proposes a novel instance-based Vision Transformer (i-ViT) to learn robust representations of histopathological images for the pRCC subtyping task by extracting finer features from instance patches (by cropping around segmented nuclei and assigning predicted grades). The proposed i-ViT takes top-K instances as input and aggregates them for capturing both the cellular and cell-layer level patterns by a position-embedding layer, a grade-embedding layer, and a multi-head multi-layer self-attention module. To evaluate the performance of the proposed framework, experienced pathologists are invited to selected 1162 regions of interest from 171 whole slide images of type 1 and type 2 pRCC. Experimental results show that the proposed method achieves better performance than existing CNN-based models with a significant margin.
【3】 LegoFormer: Transformers for Block-by-Block Multi-view 3D Reconstruction 标题:乐高表格:挡路多视点三维重建的Transformer
作者:Farid Yagubbayli,Alessio Tonioni,Federico Tombari 机构:Technical University of Munich, Google Inc. 链接:https://arxiv.org/abs/2106.12102 摘要:大多数现代的基于深度学习的多视点三维重建技术都是利用RNN或融合模块对多幅图像编码后的信息进行融合。这两个单独的步骤连接松散,在编码每个视图时不考虑所有可用信息。我们提出了LegoFormer,一种基于Transformer的模型,它将对象重建统一在一个框架下,并通过分解因子对重建的占用网格进行参数化。这种重构允许将对象预测为一组独立的结构,然后进行聚合以获得最终的重构。在ShapeNet上进行的实验显示了我们的网络相对于最先进的方法的竞争性能。我们还展示了如何使用自我注意导致提高解释性的模型输出。 摘要:Most modern deep learning-based multi-view 3D reconstruction techniques use RNNs or fusion modules to combine information from multiple images after encoding them. These two separate steps have loose connections and do not consider all available information while encoding each view. We propose LegoFormer, a transformer-based model that unifies object reconstruction under a single framework and parametrizes the reconstructed occupancy grid by its decomposition factors. This reformulation allows the prediction of an object as a set of independent structures then aggregated to obtain the final reconstruction. Experiments conducted on ShapeNet display the competitive performance of our network with respect to the state-of-the-art methods. We also demonstrate how the use of self-attention leads to increased interpretability of the model output.
【4】 P2T: Pyramid Pooling Transformer for Scene Understanding 标题:P2T:用于场景理解的金字塔汇聚转换器
作者:Yu-Huan Wu,Yun Liu,Xin Zhan,Ming-Ming Cheng 机构:NankaiUniversity 链接:https://arxiv.org/abs/2106.12011 摘要:本文联合解决了视觉变换中的两个问题:一是多头自注意(MHSA)的计算具有很高的计算/空间复杂度;ii)最近的视觉变换网络过度调整图像分类,忽略了图像分类(简单场景,更类似于NLP)和下游场景理解任务(复杂场景,丰富的结构和上下文信息)之间的差异。为此,我们注意到金字塔池由于其强大的上下文抽象,在各种视觉任务中被证明是有效的,并且其空间不变性的自然属性适合解决结构信息的丢失(问题二)。因此,我们建议将金字塔池应用于MHSA,以减轻MHSA对计算资源的高要求(问题一)。通过这种方式,这种基于池的MHSA可以很好地解决上述两个问题,因此对于下游场景理解任务来说是灵活而强大的。利用基于池的MHSA,我们构建了一个面向下游任务的转换器网络,称为金字塔池转换器(P2T)。大量的实验表明,当P2T作为主干网络时,它在语义分割、目标检测、实例分割和视觉显著性检测等下游场景理解任务中表现出明显的优越性。代码将在https://github.com/yuhuan-wu/P2T. 请注意,本技术报告将不断更新。 摘要:This paper jointly resolves two problems in vision transformer: i) the computation of Multi-Head Self-Attention (MHSA) has high computational/space complexity; ii) recent vision transformer networks are overly tuned for image classification, ignoring the difference between image classification (simple scenarios, more similar to NLP) and downstream scene understanding tasks (complicated scenarios, rich structural and contextual information). To this end, we note that pyramid pooling has been demonstrated to be effective in various vision tasks owing to its powerful context abstraction, and its natural property of spatial invariance is suitable to address the loss of structural information (problem ii)). Hence, we propose to adapt pyramid pooling to MHSA for alleviating its high requirement on computational resources (problem i)). In this way, this pooling-based MHSA can well address the above two problems and is thus flexible and powerful for downstream scene understanding tasks. Plugged with our pooling-based MHSA, we build a downstream-task-oriented transformer network, dubbed Pyramid Pooling Transformer (P2T). Extensive experiments demonstrate that, when applied P2T as the backbone network, it shows substantial superiority in various downstream scene understanding tasks such as semantic segmentation, object detection, instance segmentation, and visual saliency detection, compared to previous CNN- and transformer-based networks. The code will be released at https://github.com/yuhuan-wu/P2T. Note that this technical report will keep updating.
检测相关(3篇)
【1】 FusionPainting: Multimodal Fusion with Adaptive Attention for 3D Object Detection 标题:FusionPainting:具有自适应注意力的多模式融合三维目标检测
作者:Shaoqing Xu,Dingfu Zhou,Jin Fang,Junbo Yin,Zhou Bin,Liangjun Zhang 机构: 1 Beihang University, 3 National EngineeringLaboratory of Deep Learning Technology and Application, 4 BeijingInstitute of Technology 备注:Accepted by ITSC 2021 链接:https://arxiv.org/abs/2106.12449 摘要:三维障碍物的精确检测是自主驾驶和智能交通的重要任务。在这项工作中,我们提出了一个通用的多模式融合框架FusionPainting,在语义层面上融合二维RGB图像和三维点云,以增强三维目标检测任务。其中,融合绘画框架主要包括三个模块:多模态语义分割模块、基于自适应注意的语义融合模块和三维目标检测模块。首先,基于二维和三维分割方法,获取二维图像和三维激光雷达点云的语义信息。然后基于所提出的基于注意的语义融合模块对不同传感器的分割结果进行自适应融合。最后,将融合后的语义标签绘制出的点云送到三维探测器,得到三维目标结果。在大规模nuScenes检测基准上,通过与三种不同基线的比较,验证了该框架的有效性。实验结果表明,与仅使用点云的方法和仅使用二维分割信息的点云方法相比,融合策略能显著提高检测性能。此外,在nuScenes测试基准上,该方法的性能优于其他最先进的方法。 摘要:Accurate detection of obstacles in 3D is an essential task for autonomous driving and intelligent transportation. In this work, we propose a general multimodal fusion framework FusionPainting to fuse the 2D RGB image and 3D point clouds at a semantic level for boosting the 3D object detection task. Especially, the FusionPainting framework consists of three main modules: a multi-modal semantic segmentation module, an adaptive attention-based semantic fusion module, and a 3D object detector. First, semantic information is obtained for 2D images and 3D Lidar point clouds based on 2D and 3D segmentation approaches. Then the segmentation results from different sensors are adaptively fused based on the proposed attention-based semantic fusion module. Finally, the point clouds painted with the fused semantic label are sent to the 3D detector for obtaining the 3D objection results. The effectiveness of the proposed framework has been verified on the large-scale nuScenes detection benchmark by comparing it with three different baselines. The experimental results show that the fusion strategy can significantly improve the detection performance compared to the methods using only point clouds, and the methods using point clouds only painted with 2D segmentation information. Furthermore, the proposed approach outperforms other state-of-the-art methods on the nuScenes testing benchmark.
【2】 Diabetic Retinopathy Detection using Ensemble Machine Learning 标题:基于集成机器学习的糖尿病视网膜病变检测
作者:Israa Odeh,Mouhammd Alkasassbeh,Mohammad Alauthman 机构:Department of Computer Science, PSUT, Amman, Jordan, Department of Information Security, University of Petra 链接:https://arxiv.org/abs/2106.12545 摘要:糖尿病视网膜病变(DR)是糖尿病患者视力下降的主要原因之一。DR是一种影响眼视网膜的微血管疾病,引起血管阻塞,从而切断视网膜组织的主要营养来源。当这种视觉障碍在早期被发现时,治疗是最有效的,因为严重的DR会导致不可逆转的失明。然而,医生的鉴定需要眼科医生的专业知识,这往往是昂贵和耗时的。因此,引入了自动检测系统,旨在促进识别过程,使其以时间和成本效益高的方式在全球范围内可用。然而,由于这一特殊眼病的可靠数据集和医疗记录有限,所获得的预测准确率对于眼科专家依赖它们作为诊断系统是相对不满意的。因此,我们探索了一种基于集成的学习策略,在一个复杂的诊断模型中融合了大量的知名分类算法。该框架在该领域所有其他常用分类算法中取得了最高的准确率。生成了4个子数据集,包含InfoGainEval选定的梅西多数据集的前5和前10个特征。在InfoGainEval上获得了70.7%和75.1%的准确率。前5名和原始数据集。结果表明,子数据集具有令人印象深刻的性能,这大大有助于减少复杂的分类过程 摘要:Diabetic Retinopathy (DR) is among the worlds leading vision loss causes in diabetic patients. DR is a microvascular disease that affects the eye retina, which causes vessel blockage and therefore cuts the main source of nutrition for the retina tissues. Treatment for this visual disorder is most effective when it is detected in its earliest stages, as severe DR can result in irreversible blindness. Nonetheless, DR identification requires the expertise of Ophthalmologists which is often expensive and time-consuming. Therefore, automatic detection systems were introduced aiming to facilitate the identification process, making it available globally in a time and cost-efficient manner. However, due to the limited reliable datasets and medical records for this particular eye disease, the obtained predictions accuracies were relatively unsatisfying for eye specialists to rely on them as diagnostic systems. Thus, we explored an ensemble-based learning strategy, merging a substantial selection of well-known classification algorithms in one sophisticated diagnostic model. The proposed framework achieved the highest accuracy rates among all other common classification algorithms in the area. 4 subdatasets were generated to contain the top 5 and top 10 features of the Messidor dataset, selected by InfoGainEval. and WrapperSubsetEval., accuracies of 70.7% and 75.1% were achieved on the InfoGainEval. top 5 and original dataset respectively. The results imply the impressive performance of the subdataset, which significantly conduces to a less complex classification process
【3】 FoldIt: Haustral Folds Detection and Segmentation in Colonoscopy Videos 标题:Foldit:结肠镜检查视频中的Haustral Fold检测和分割
作者:Shawn Mathew,Saad Nadeem,Arie Kaufman 机构: Department of Computer Science, Stony Brook University, Department of Medical Physics, Memorial Sloan Kettering Cancer Center 备注:MICCAI 2021 (Early Accept), *Saad Nadeem and Shawn Mathew contributed equally 链接:https://arxiv.org/abs/2106.12522 摘要:喉褶是结肠壁的突出物,在光学结肠镜检查过程中可能导致高息肉漏检率。如果精确分割,haustral皱襞可以更好地估计漏诊面,也可以作为治疗前虚拟(CT)和光学结肠镜检查的重要标志,引导导航到治疗前扫描中发现的异常。我们提出了一种新的生成性对抗网络FoldIt,用于将光学结肠镜视频的特征一致性图像转换为具有haustral折叠覆盖的虚拟结肠镜渲染。为了利用haustral fold注释和虚拟结肠镜渲染之间的真实信息,引入了一种新的传递损失。我们证明了我们的模型在真正具有挑战性的光学结肠镜视频以及具有临床医生验证的haustral fold注释的纹理虚拟结肠镜视频上的有效性。所有重现本文实验的代码和脚本将通过我们的计算内窥镜平台提供https://github.com/nadeemlab/CEP. 摘要:Haustral folds are colon wall protrusions implicated for high polyp miss rate during optical colonoscopy procedures. If segmented accurately, haustral folds can allow for better estimation of missed surface and can also serve as valuable landmarks for registering pre-treatment virtual (CT) and optical colonoscopies, to guide navigation towards the anomalies found in pre-treatment scans. We present a novel generative adversarial network, FoldIt, for feature-consistent image translation of optical colonoscopy videos to virtual colonoscopy renderings with haustral fold overlays. A new transitive loss is introduced in order to leverage ground truth information between haustral fold annotations and virtual colonoscopy renderings. We demonstrate the effectiveness of our model on real challenging optical colonoscopy videos as well as on textured virtual colonoscopy videos with clinician-verified haustral fold annotations. All code and scripts to reproduce the experiments of this paper will be made available via our Computational Endoscopy Platform at https://github.com/nadeemlab/CEP.
分类|识别相关(7篇)
【1】 Multi-Class Classification of Blood Cells - End to End Computer Vision based diagnosis case study 标题:血细胞多级分类--基于计算机视觉的端到端诊断案例研究
作者:Sai Sukruth Bezugam 机构:Electrical Engineering Department, Indian Institute of Technology Delhi 备注:18 pages, 10 figures 链接:https://arxiv.org/abs/2106.12548 摘要:血源性疾病的诊断通常涉及识别和描述患者血样。自动检测和分类血细胞亚型的方法在医学上有重要的应用。自动化的医学图像处理和分析为医学诊断提供了强有力的工具。在这项工作中,我们处理的问题,白血球分类的基础上,其外部轮廓,颜色的形态特征。我们将探索一套预处理和分割(基于颜色的分割、形态学处理、轮廓)算法以及一套特征提取方法(角点检测算法和梯度直方图(HOG)),降维算法(主成分分析(PCA)),能够通过各种无监督(k-近邻)和有监督(支持向量机、决策树、线性判别分析、二次判别分析、,朴素贝叶斯(naivebayes)算法将不同类别的白细胞分为嗜酸性粒细胞、淋巴细胞、单核细胞和中性粒细胞。我们甚至向前迈出了一步,探索各种深度卷积神经网络架构(Sqeezent、MobilenetV1、MobilenetV2、InceptionNet等),无需预处理/分割和预处理。我们希望探索许多算法来识别时间复杂度最低、资源需求较低的鲁棒算法。这项工作的结果可以作为根据自动血细胞分类的要求选择算法的线索。 摘要:The diagnosis of blood-based diseases often involves identifying and characterizing patient blood samples. Automated methods to detect and classify blood cell subtypes have important medical applications. Automated medical image processing and analysis offers a powerful tool for medical diagnosis. In this work we tackle the problem of white blood cell classification based on the morphological characteristics of their outer contour, color. The work we would explore a set of preprocessing and segmentation (Color-based segmentation, Morphological processing, contouring) algorithms along with a set of features extraction methods (Corner detection algorithms and Histogram of Gradients(HOG)), dimensionality reduction algorithms (Principal Component Analysis(PCA)) that are able to recognize and classify through various Unsupervised(k-nearest neighbors) and Supervised (Support Vector Machine, Decision Trees, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Naive Bayes) algorithms different categories of white blood cells to Eosinophil, Lymphocyte, Monocyte, and Neutrophil. We even take a step forwards to explore various Deep Convolutional Neural network architecture (Sqeezent, MobilenetV1,MobilenetV2, InceptionNet etc.) without preprocessing/segmentation and with preprocessing. We would like to explore many algorithms to identify the robust algorithm with least time complexity and low resource requirement. The outcome of this work can be a cue to selection of algorithms as per requirement for automated blood cell classification.
【2】 Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition 标题:视觉排列器:一种可置换的类MLP视觉识别体系结构
作者:Qibin Hou,Zihang Jiang,Li Yuan,Ming-Ming Cheng,Shuicheng Yan,Jiashi Feng 机构:Department of Electrical and Computer Engineering, National University of Singapore, School of Computer Science, Nankai University, Tianjin, China, ECE, NUS & Sea AI Labs 备注:9 pages 链接:https://arxiv.org/abs/2106.12368 摘要:在本文中,我们提出了视觉置换器,一个概念简单,数据效率高的MLP类视觉识别体系结构。通过认识到二维特征表示所携带的位置信息的重要性,与最近类似MLP的模型沿平坦空间维度编码空间信息不同,视觉置换器使用线性投影分别沿高度和宽度维度编码特征表示。这使得视觉置换器能够沿一个空间方向捕捉长距离的依赖关系,同时沿另一个方向保留精确的位置信息。然后,以互补的方式聚合得到的位置敏感输出,以形成感兴趣对象的表达表示。我们证明了我们的视觉变换器是卷积神经网络(CNNs)和视觉变换器的强大竞争对手。在不依赖空间卷积和注意机制的情况下,视觉置换器仅使用25M的可学习参数,在不使用超大规模训练数据(如ImageNet-22k)的情况下,在ImageNet上实现了81.5%的top-1精度,这比同一模型尺寸约束下的大多数cnn和视觉变换器要好得多。当扩展到88M时,达到83.2%的top-1精度。我们希望这项工作能鼓励重新思考空间信息编码方式的研究,促进MLP类模型的发展。代码位于https://github.com/Andrew-Qibin/VisionPermutator. 摘要:In this paper, we present Vision Permutator, a conceptually simple and data efficient MLP-like architecture for visual recognition. By realizing the importance of the positional information carried by 2D feature representations, unlike recent MLP-like models that encode the spatial information along the flattened spatial dimensions, Vision Permutator separately encodes the feature representations along the height and width dimensions with linear projections. This allows Vision Permutator to capture long-range dependencies along one spatial direction and meanwhile preserve precise positional information along the other direction. The resulting position-sensitive outputs are then aggregated in a mutually complementing manner to form expressive representations of the objects of interest. We show that our Vision Permutators are formidable competitors to convolutional neural networks (CNNs) and vision transformers. Without the dependence on spatial convolutions or attention mechanisms, Vision Permutator achieves 81.5% top-1 accuracy on ImageNet without extra large-scale training data (e.g., ImageNet-22k) using only 25M learnable parameters, which is much better than most CNNs and vision transformers under the same model size constraint. When scaling up to 88M, it attains 83.2% top-1 accuracy. We hope this work could encourage research on rethinking the way of encoding spatial information and facilitate the development of MLP-like models. Code is available at https://github.com/Andrew-Qibin/VisionPermutator.
【3】 Estimating the Robustness of Classification Models by the Structure of the Learned Feature-Space 标题:利用学习特征空间结构估计分类模型的稳健性
作者:Kalun Ho,Franz-Josef Pfreundt,Janis Keuper,Margret Keuper 机构:CC-HPC, Fraunhofer ITWM, Fraunhofer-Platz , Kaiserslautern, Germany, Institute for, Machine Learning, and Analytics, Offenburg University, Data and Web Science Group, University of Mannheim 链接:https://arxiv.org/abs/2106.12303 摘要:在过去的十年中,深度图像分类网络的发展主要是由在标准化的基准(如ImageNet)上寻找分类精度方面的最佳性能驱动的。最近,模型鲁棒性的概念扩展了这一关注点,即模型对数据分布中以前看不到的变化的泛化能力。虽然像ImageNet-C这样的新基准已经被引入来度量健壮性,但我们认为固定测试集只能捕获一小部分可能的数据变化,因此受到限制,并且容易生成新的过度拟合的解决方案。为了克服这些缺点,我们建议直接从学习特征空间的结构来估计模型的鲁棒性。我们引入了鲁棒性指标,这些指标是通过训练分类器中潜在表示的无监督聚类获得的,并且在损坏的测试数据上显示出与模型性能非常高的相关性。 摘要:Over the last decade, the development of deep image classification networks has mostly been driven by the search for the best performance in terms of classification accuracy on standardized benchmarks like ImageNet. More recently, this focus has been expanded by the notion of model robustness, i.e. the generalization abilities of models towards previously unseen changes in the data distribution. While new benchmarks, like ImageNet-C, have been introduced to measure robustness properties, we argue that fixed testsets are only able to capture a small portion of possible data variations and are thus limited and prone to generate new overfitted solutions. To overcome these drawbacks, we suggest to estimate the robustness of a model directly from the structure of its learned feature-space. We introduce robustness indicators which are obtained via unsupervised clustering of latent representations inside a trained classifier and show very high correlations to the model performance on corrupted test data.
【4】 A Label Management Mechanism for Retinal Fundus Image Classification of Diabetic Retinopathy 标题:糖尿病视网膜病变视网膜眼底图像分类的标签管理机制
作者:Mengdi Gao,Ximeng Feng,Mufeng Geng,Zhe Jiang,Lei Zhu,Xiangxi Meng,Chuanqing Zhou,Qiushi Ren,Yanye Lu 备注:10 pages, 9 figures 链接:https://arxiv.org/abs/2106.12284 摘要:糖尿病视网膜病变(DR)仍然是工作年龄成年人视力损害和不可逆性失明的最常见原因。由于深度学习(deep learning,DL)的兴起,基于DL的DR诊断已经成为DR早期筛查和严重程度分级的一个很有前途的工具。然而,训练深度神经网络(deep neural networks,DNNs)需要大量仔细标记的数据。在标记大量数据时,可能会引入带噪的标记数据,降低模型的性能。在这项工作中,我们提出了一种新的标签管理机制(LMM)的DNN,以克服过度拟合的噪声数据。LMM利用贝叶斯统计中的最大后验概率(maximum posteriori probability,MAP)和时间加权技术,对不干净数据的标签进行选择性校正,逐步净化训练数据,提高分类性能。对合成噪声数据(mesidor&我们收集的DR数据集)和真实噪声数据(ANIMAL-10N)的综合实验表明,LMM可以提高模型的性能,优于三种最先进的方法。 摘要:Diabetic retinopathy (DR) remains the most prevalent cause of vision impairment and irreversible blindness in the working-age adults. Due to the renaissance of deep learning (DL), DL-based DR diagnosis has become a promising tool for the early screening and severity grading of DR. However, training deep neural networks (DNNs) requires an enormous amount of carefully labeled data. Noisy label data may be introduced when labeling plenty of data, degrading the performance of models. In this work, we propose a novel label management mechanism (LMM) for the DNN to overcome overfitting on the noisy data. LMM utilizes maximum posteriori probability (MAP) in the Bayesian statistic and time-weighted technique to selectively correct the labels of unclean data, which gradually purify the training data and improve classification performance. Comprehensive experiments on both synthetic noise data (Messidor & our collected DR dataset) and real-world noise data (ANIMAL-10N) demonstrated that LMM could boost performance of models and is superior to three state-of-the-art methods.
【5】 Mutual-Information Based Few-Shot Classification 标题:基于互信息的Few-Shot分类
作者:Malik Boudiaf,Ziko Imtiaz Masud,Jérôme Rony,Jose Dolz,Ismail Ben Ayed,Pablo Piantanida 机构: CentraleSupélecCNRS Université Paris Saclay 备注:Journal extension of this https URL PyTorch implementation of META-DATASET available at this https URL arXiv admin note: substantial text overlap with arXiv:2008.11297 链接:https://arxiv.org/abs/2106.12252 摘要:我们引入了最大化传递信息(TIM)的Few-Shot学习。我们的方法最大化了查询特征和它们的标签预测之间的互信息,并结合了基于支持集的监督损失。我们通过推导分类精度和互信息最大化之间的形式关系来激励我们的传递损失。此外,我们提出了一个新的交替方向求解器,大大加快了基于梯度优化的直推推理,同时产生了竞争性的精度。基于Zangwill的理论和有界优化理论,我们还对我们的求解器进行了收敛性分析。TIM推理是模块化的:它可以在任何基础训练特征提取器上使用。我们的综合实验表明,在不同的数据集和网络中,TIM在固定的特征抽取器上使用时,在基类上使用简单的交叉熵进行训练,而不依赖于复杂的元学习方案,其性能显著优于最先进的方法。与性能最佳的方法相比,该方法的准确率始终提高了2%到5%,不仅适用于所有已建立的少数镜头基准测试,而且适用于更具挑战性的场景,包括随机任务、域转移和更多类,如最近引入的元数据集。我们的代码在https://github.com/mboudiaf/TIM. 我们还公开发布了元数据集的独立PyTorch实现,以及其他基准测试结果https://github.com/mboudiaf/pytorch-meta-dataset. 摘要:We introduce Transductive Infomation Maximization (TIM) for few-shot learning. Our method maximizes the mutual information between the query features and their label predictions for a given few-shot task, in conjunction with a supervision loss based on the support set. We motivate our transductive loss by deriving a formal relation between the classification accuracy and mutual-information maximization. Furthermore, we propose a new alternating-direction solver, which substantially speeds up transductive inference over gradient-based optimization, while yielding competitive accuracy. We also provide a convergence analysis of our solver based on Zangwill's theory and bound-optimization arguments. TIM inference is modular: it can be used on top of any base-training feature extractor. Following standard transductive few-shot settings, our comprehensive experiments demonstrate that TIM outperforms state-of-the-art methods significantly across various datasets and networks, while used on top of a fixed feature extractor trained with simple cross-entropy on the base classes, without resorting to complex meta-learning schemes. It consistently brings between 2 % and 5 % improvement in accuracy over the best performing method, not only on all the well-established few-shot benchmarks but also on more challenging scenarios, with random tasks, domain shift and larger numbers of classes, as in the recently introduced META-DATASET. Our code is publicly available at https://github.com/mboudiaf/TIM. We also publicly release a standalone PyTorch implementation of META-DATASET, along with additional benchmarking results, at https://github.com/mboudiaf/pytorch-meta-dataset.
【6】 Vision-based Behavioral Recognition of Novelty Preference in Pigs 标题:基于视觉的猪新奇偏好行为识别
作者:Aniket Shirke,Rebecca Golden,Mrinal Gautam,Angela Green-Miller,Matthew Caesar,Ryan N. Dilger 机构:N. Dilger, Department of Computer Science, University of Illinois at Urbana-Champaign, Department of Animal Sciences, University of Illinois at Urbana-Champaign, Department of Agricultural & Biological Engineering, University of Illinois at Urbana-Champaign 备注:5 pages, 7 figures, Accepted at the CVPR 2021 CV4Animals workshop 链接:https://arxiv.org/abs/2106.12181 摘要:研究数据的行为评分对于提取特定领域的指标至关重要,但在使用人力分析大量信息的能力方面却受到了限制。深度学习被广泛认为是缓解这一瓶颈的一个关键进步。我们确定了一个这样的领域,在那里深度学习可以用来减轻手工评分的过程。新奇偏好范式已被广泛用于研究猪的再认记忆,但对这些视频的分析需要人为干预。我们以“猪的新奇偏好行为”(PNPB)数据集的形式介绍了一部分此类视频,该数据集用猪的动作和关键点进行了充分注释。为了演示最新的动作识别模型在这个数据集上的应用,我们在各种分析度量的基础上比较了LRCN、C3D和TSM,并讨论了这些模型的常见缺陷。该方法对仔猪行为的估计准确率为93%,平均准确率为96%。我们在https://github.com/AIFARMS/NOR-behavior-recognition 摘要:Behavioral scoring of research data is crucial for extracting domain-specific metrics but is bottlenecked on the ability to analyze enormous volumes of information using human labor. Deep learning is widely viewed as a key advancement to relieve this bottleneck. We identify one such domain, where deep learning can be leveraged to alleviate the process of manual scoring. Novelty preference paradigms have been widely used to study recognition memory in pigs, but analysis of these videos requires human intervention. We introduce a subset of such videos in the form of the 'Pig Novelty Preference Behavior' (PNPB) dataset that is fully annotated with pig actions and keypoints. In order to demonstrate the application of state-of-the-art action recognition models on this dataset, we compare LRCN, C3D, and TSM on the basis of various analytical metrics and discuss common pitfalls of the models. Our methods achieve an accuracy of 93% and a mean Average Precision of 96% in estimating piglet behavior. We open-source our code and annotated dataset at https://github.com/AIFARMS/NOR-behavior-recognition
【7】 Team PyKale (xy9) Submission to the EPIC-Kitchens 2021 Unsupervised Domain Adaptation Challenge for Action Recognition 标题:PYKALE(XY9)深渊翻滚团队到史诗-厨房2021年无监督领域适配挑战赛行动认可
作者:Xianyuan Liu,Raivo Koot,Shuo Zhou,Tao Lei,Haiping Lu 机构:Institute of Optical and Electronics, Chinese Academy of Sciences, China, University of Chinese Academy of Sciences, China ,University of Sheffield, United Kingdom 链接:https://arxiv.org/abs/2106.12023 摘要:本报告描述了我们提交给EPIC Kitchens 2021无监督领域适应挑战以进行行动识别的技术细节。EPIC Kitchens数据集比其他视频域适配数据集更困难,因为多任务具有更多模式。首先,为了参与挑战,我们使用了一个转换器来捕捉每个模态的空间信息。其次,我们使用时间注意模组来建立时间上的相互依赖模型。第三,利用对抗域自适应网络学习标记源域和未标记目标域的一般特征。最后,我们结合多种模式,以提高性能的三流网络与后期融合。我们的网络达到了与最先进的基线T$A^3$N相当的性能,并且在动词类的前1个准确度和动词、名词和动作三个任务的前5个准确度方面优于基线。在团队名称xy9下,我们的提交在动词类和所有前5名准确度的前1名准确度方面获得了第5名。 摘要:This report describes the technical details of our submission to the EPIC-Kitchens 2021 Unsupervised Domain Adaptation Challenge for Action Recognition. The EPIC-Kitchens dataset is more difficult than other video domain adaptation datasets due to multi-tasks with more modalities. Firstly, to participate in the challenge, we employ a transformer to capture the spatial information from each modality. Secondly, we employ a temporal attention module to model temporal-wise inter-dependency. Thirdly, we employ the adversarial domain adaptation network to learn the general features between labeled source and unlabeled target domain. Finally, we incorporate multiple modalities to improve the performance by a three-stream network with late fusion. Our network achieves the comparable performance with the state-of-the-art baseline T$A^3$N and outperforms the baseline on top-1 accuracy for verb class and top-5 accuracies for all three tasks which are verb, noun and action. Under the team name xy9, our submission achieved 5th place in terms of top-1 accuracy for verb class and all top-5 accuracies.
分割|语义相关(5篇)
【1】 Adapting Off-the-Shelf Source Segmenter for Target Medical Image Segmentation 标题:自适应现成的源分割器用于目标医学图像分割
作者:Xiaofeng Liu,Fangxu Xing,Chao Yang,Georges El Fakhri,Jonghye Woo 机构: Gordon Center for Medical Imaging, Department of Radiology, Massachusetts, General Hospital and Harvard Medical School, Boston, MA, Facebook Artificial Intelligence, Boston, MA 备注:To appear in MICCAI 2021 链接:https://arxiv.org/abs/2106.12497 摘要:无监督域自适应(Unsupervised domain adaption,UDA)的目的是将从标记的源域学习到的知识转移到一个未标记的、不可见的目标域,该目标域通常基于两个域的数据进行训练。然而,由于数据存储或隐私问题,在适配阶段对源域数据的访问通常是有限的。为了缓解这一问题,在本文中,我们针对无源UDA进行了分割,并提出了一种在源域中预先训练好的现成的分割模型来适应目标域,该模型采用了一种自适应的批量归一化统计自适应框架。具体地说,域特定的低阶批处理统计量,即均值和方差,通过指数动量衰减方案逐渐适应,而域共享的高阶批处理统计量,即缩放和移动参数的一致性,通过我们的优化目标得到了显式的加强。首先自适应地测量每个信道的可转移性,从中平衡每个信道的贡献。此外,提出的无源UDA框架与无监督学习方法(如自熵最小化)是正交的,因此可以简单地添加到我们的框架之上。在BraTS 2018数据库上的大量实验表明,我们的无源UDA框架在跨子类型UDA分割任务中优于现有的源松弛UDA方法,并且在跨模态UDA分割任务中得到了与源数据监督UDA方法相当的结果。 摘要:Unsupervised domain adaptation (UDA) aims to transfer knowledge learned from a labeled source domain to an unlabeled and unseen target domain, which is usually trained on data from both domains. Access to the source domain data at the adaptation stage, however, is often limited, due to data storage or privacy issues. To alleviate this, in this work, we target source free UDA for segmentation, and propose to adapt an ``off-the-shelf" segmentation model pre-trained in the source domain to the target domain, with an adaptive batch-wise normalization statistics adaptation framework. Specifically, the domain-specific low-order batch statistics, i.e., mean and variance, are gradually adapted with an exponential momentum decay scheme, while the consistency of domain shareable high-order batch statistics, i.e., scaling and shifting parameters, is explicitly enforced by our optimization objective. The transferability of each channel is adaptively measured first from which to balance the contribution of each channel. Moreover, the proposed source free UDA framework is orthogonal to unsupervised learning methods, e.g., self-entropy minimization, which can thus be simply added on top of our framework. Extensive experiments on the BraTS 2018 database show that our source free UDA framework outperformed existing source-relaxed UDA methods for the cross-subtype UDA segmentation task and yielded comparable results for the cross-modality UDA segmentation task, compared with a supervised UDA methods with the source data.
【2】 Fairness in Cardiac MR Image Analysis: An Investigation of Bias Due to Data Imbalance in Deep Learning Based Segmentation 标题:心脏磁共振图像分析中的公平性:基于深度学习的分割中数据不平衡引起的偏差研究
作者:Esther Puyol-Anton,Bram Ruijsink,Stefan K. Piechnik,Stefan Neubauer,Steffen E. Petersen,Reza Razavi,Andrew P. King 机构: School of Biomedical Engineering & Imaging Sciences, King's College London, UK, Guy’s and St Thomas' Hospital, London, UK., William Harvey Research Institute, NIHR Barts Biomedical Research Centre 备注:MICCAI 2021 conference 链接:https://arxiv.org/abs/2106.12387 摘要:人工智能中的“公平”主题是指基于种族和性别等人口统计特征评估人工智能算法的潜在偏见,以及开发解决这种偏见的算法。迄今为止,大多数应用都是在计算机视觉领域,尽管一些医疗领域的工作已经开始出现。近年来,深度学习(DL)在心脏MR分割中的应用已经取得了令人印象深刻的成果,这些技术也开始被应用到临床实践中。然而,目前还没有研究此类模型的公平性。在这项工作中,我们对种族/性别群体进行了这样的分析,重点是训练数据不平衡的问题,使用一个nnU网络模型,对来自英国Biobank数据集的电影短轴心脏MR数据进行训练和评估,该数据集由6个不同种族的5903名受试者组成。我们发现不同种族的人在掷骰子的表现上有显著的统计学差异。为了减少种族偏见,我们研究了三种策略:(1)分层分批抽样,其中分批抽样是分层的,以确保种族群体之间的平衡(2) 公平元分割学习,训练DL分类器对种族进行分类,并与分割模型进行联合优化;(3)保护群体模型,其中为每个种族群体训练不同的分割模型。我们还将结果与拥有完全平衡的数据库的场景进行了比较。为了评估公平性,我们使用了骰子平均值的标准差(SD)和偏误率(SER)。我们的结果表明,种族偏见的结果来自于使用不平衡的训练数据,并且所有提出的偏见缓解策略都提高了公平性,最佳的SD和SER来自于使用保护组模型。 摘要:The subject of "fairness" in artificial intelligence (AI) refers to assessing AI algorithms for potential bias based on demographic characteristics such as race and gender, and the development of algorithms to address this bias. Most applications to date have been in computer vision, although some work in healthcare has started to emerge. The use of deep learning (DL) in cardiac MR segmentation has led to impressive results in recent years, and such techniques are starting to be translated into clinical practice. However, no work has yet investigated the fairness of such models. In this work, we perform such an analysis for racial/gender groups, focusing on the problem of training data imbalance, using a nnU-Net model trained and evaluated on cine short axis cardiac MR data from the UK Biobank dataset, consisting of 5,903 subjects from 6 different racial groups. We find statistically significant differences in Dice performance between different racial groups. To reduce the racial bias, we investigated three strategies: (1) stratified batch sampling, in which batch sampling is stratified to ensure balance between racial groups; (2) fair meta-learning for segmentation, in which a DL classifier is trained to classify race and jointly optimized with the segmentation model; and (3) protected group models, in which a different segmentation model is trained for each racial group. We also compared the results to the scenario where we have a perfectly balanced database. To assess fairness we used the standard deviation (SD) and skewed error ratio (SER) of the average Dice values. Our results demonstrate that the racial bias results from the use of imbalanced training data, and that all proposed bias mitigation strategies improved fairness, with the best SD and SER resulting from the use of protected group models.
【3】 Real-time Instance Segmentation with Discriminative Orientation Maps 标题:基于判别方向图的实时实例分割
作者:Wentao Du,Zhiyu Xiang,Shuya Chen,Chengyu Qiao,Yiman Chen,Tingming Bai 机构:Zhejiang University, China 链接:https://arxiv.org/abs/2106.12204 摘要:尽管近年来实例分割取得了长足的进步,但如何设计具有实时性的高精度算法仍然是一个挑战。本文提出了一种实时实例分割框架OrienMask。在一级目标检测器YOLOv3的基础上,添加一个掩模头来预测一些有区别的方向图,这些方向图被明确定义为前景和背景像素的空间偏移向量。由于方向图的分辨能力,可以在不需要额外前景分割的情况下恢复掩模。与相同锚定大小匹配的所有实例共享一个公共方向图。这种特殊的共享策略降低了掩码预测的缓冲内存利用率,但不会损失掩码粒度。给定NMS后的生存盒预测,实例掩码可以由相应的方向图以较低的复杂度同时构造。由于掩模表示的简洁设计及其与基于锚的目标检测器的有效集成,我们的方法在保持竞争精度的同时,在实时条件下是合格的。在COCO基准上的实验表明,OrienMask以42.7fps的速度实现了34.8mask-AP,用单个rtx2080ti进行了评估。代码可在https://github.com/duwt/OrienMask. 摘要:Although instance segmentation has made considerable advancement over recent years, it's still a challenge to design high accuracy algorithms with real-time performance. In this paper, we propose a real-time instance segmentation framework termed OrienMask. Upon the one-stage object detector YOLOv3, a mask head is added to predict some discriminative orientation maps, which are explicitly defined as spatial offset vectors for both foreground and background pixels. Thanks to the discrimination ability of orientation maps, masks can be recovered without the need for extra foreground segmentation. All instances that match with the same anchor size share a common orientation map. This special sharing strategy reduces the amortized memory utilization for mask predictions but without loss of mask granularity. Given the surviving box predictions after NMS, instance masks can be concurrently constructed from the corresponding orientation maps with low complexity. Owing to the concise design for mask representation and its effective integration with the anchor-based object detector, our method is qualified under real-time conditions while maintaining competitive accuracy. Experiments on COCO benchmark show that OrienMask achieves 34.8 mask AP at the speed of 42.7 fps evaluated with a single RTX 2080 Ti. The code is available at https://github.com/duwt/OrienMask.
【4】 Bootstrap Representation Learning for Segmentation on Medical Volumes and Sequences 标题:Bootstrap表示学习在医学体积和序列分割中的应用
作者:Zejian Chen,Wei Zhuo,Tianfu Wang,Wufeng Xue,Dong Ni 机构:Shenzhen University, Tencent 备注:17 pages 链接:https://arxiv.org/abs/2106.12153 摘要:在这项工作中,我们提出了一种新的直接的方法,医学体积和序列分割有限的注释。为了避免费力的注释,自我监督学习(SSL)最近的成功激发了对未标记数据的预训练。尽管SSL方法取得了成功,但由于缺乏对局部语义识别的挖掘和对卷和序列结构的开发,将典型的SSL方法应用于卷/序列分割仍然具有挑战性。基于切片/帧之间的连续性和器官跨卷/序列的公共空间布局,提出了一种利用相邻切片可预测性的bootstrap自监督表示学习方法。该方法的核心是对局部表示预测的一种简单而直接的密集自监督和基于全局上下文的局部预测策略,从而实现了对卷间全局和局部表示挖掘的稳定而可靠的监督。具体地说,我们首先提出了一种具有注意引导预测器的非对称网络,以对卷/序列内和跨卷/序列的切片实施距离特定的预测和监督。其次,提出了一种新的基于原型的前景背景校正模块,以增强表示的一致性。这两部分是联合训练的标记和未标记的数据。当在三个医疗量和序列的基准数据集上进行评估时,我们的模型优于现有的方法,在ACDC上有4.5%DSC,在前列腺上有1.7%,在CAMUS上有2.3%。深入的评价表明了该方法的有效性和优越性。 摘要:In this work, we propose a novel straightforward method for medical volume and sequence segmentation with limited annotations. To avert laborious annotating, the recent success of self-supervised learning(SSL) motivates the pre-training on unlabeled data. Despite its success, it is still challenging to adapt typical SSL methods to volume/sequence segmentation, due to their lack of mining on local semantic discrimination and rare exploitation on volume and sequence structures. Based on the continuity between slices/frames and the common spatial layout of organs across volumes/sequences, we introduced a novel bootstrap self-supervised representation learning method by leveraging the predictable possibility of neighboring slices. At the core of our method is a simple and straightforward dense self-supervision on the predictions of local representations and a strategy of predicting locals based on global context, which enables stable and reliable supervision for both global and local representation mining among volumes. Specifically, we first proposed an asymmetric network with an attention-guided predictor to enforce distance-specific prediction and supervision on slices within and across volumes/sequences. Secondly, we introduced a novel prototype-based foreground-background calibration module to enhance representation consistency. The two parts are trained jointly on labeled and unlabeled data. When evaluated on three benchmark datasets of medical volumes and sequences, our model outperforms existing methods with a large margin of 4.5% DSC on ACDC, 1.7% on Prostate, and 2.3% on CAMUS. Intensive evaluations reveals the effectiveness and superiority of our method.
【5】 Exploiting Negative Learning for Implicit Pseudo Label Rectification in Source-Free Domain Adaptive Semantic Segmentation 标题:无源域自适应语义分割中利用负学习进行隐式伪标签校正
作者:Xin Luo,Wei Chen,Yusong Tan,Chen Li,Yulin He,Xiaogang Jia 机构:College of Computers, National University of Defense Technology 备注:8 pages, 4 figures 链接:https://arxiv.org/abs/2106.12123 摘要:在没有源数据的情况下,需要将存储在经过良好训练的源模型中的知识转移到无注释的目标域。然而,最新的无源域适配(SFDA)方法受到严格限制:1)必须访问源模型的内部规范;在自我训练中,伪标签应该是干净的,使得依赖语义切分的关键任务不可靠。针对这些缺陷,本文提出了一种基于伪标记校正的领域自适应语义分割方法(即PR-SFDA),分为两个阶段:1)采用最大平方损失法对目标模型进行正则化,以保证预测的可信度;和2)textit{噪声感知伪标签学习}:负学习在训练中能容忍噪声伪标签,同时正学习能快速收敛。在领域自适应语义切分基准测试textit{GTA5$to$Cityscapes}上进行了大量的实验。总的来说, textit{PR-SFDA}实现了4900万的性能,与最先进的同类产品非常接近。请注意,后者需要访问源模型的内部规范,而textit{PR-SFDA}解决方案则不需要访问任何规范。 摘要:It is desirable to transfer the knowledge stored in a well-trained source model onto non-annotated target domain in the absence of source data. However, state-of-the-art methods for source free domain adaptation (SFDA) are subject to strict limits: 1) access to internal specifications of source models is a must; and 2) pseudo labels should be clean during self-training, making critical tasks relying on semantic segmentation unreliable. Aiming at these pitfalls, this study develops a domain adaptive solution to semantic segmentation with pseudo label rectification (namely textit{PR-SFDA}), which operates in two phases: 1) textit{Confidence-regularized unsupervised learning}: Maximum squares loss applies to regularize the target model to ensure the confidence in prediction; and 2) textit{Noise-aware pseudo label learning}: Negative learning enables tolerance to noisy pseudo labels in training, meanwhile positive learning achieves fast convergence. Extensive experiments have been performed on domain adaptive semantic segmentation benchmark, textit{GTA5 $to$ Cityscapes}. Overall, textit{PR-SFDA} achieves a performance of 49.0 mIoU, which is very close to that of the state-of-the-art counterparts. Note that the latter demand accesses to the source model's internal specifications, whereas the textit{PR-SFDA} solution needs none as a sharp contrast.
Zero/Few Shot|迁移|域适配|自适应(2篇)
【1】 Generative Self-training for Cross-domain Unsupervised Tagged-to-Cine MRI Synthesis 标题:跨域无监督Tag-to-Cine MRI合成的生成性自我训练
作者:Xiaofeng Liu,Fangxu Xing,Maureen Stone,Jiachen Zhuo,Reese Timothy,Jerry L. Prince,Georges El Fakhri,Jonghye Woo 机构: Gordon Center for Medical Imaging, Department of Radiology, Massachusetts, Dept. of Neural and Pain Sciences, University of Maryland School of Dentistry, Baltimore, MD, USA, Athinoula A. Martinos Center for Biomedical Imaging, Dept. of Radiology 备注:MICCAI 2021 (early accept <13%) 链接:https://arxiv.org/abs/2106.12499 摘要:基于自训练的无监督域自适应(UDA)在将源域中训练好的深度学习模型应用于未标记的目标域时,显示出很大的潜力来解决域转移问题。然而,尽管自训练UDA已经证明了其在区分性任务(如分类和分割)上的有效性,但是通过基于softmax离散直方图的可靠伪标记选择,生成性任务(如图像合成)的自训练UDA还没有得到充分的研究。在这项工作中,我们提出了一个新的具有连续值预测和回归目标的生成性自训练(GST)UDA框架,用于跨域图像合成。具体地说,我们提出用不确定性掩模过滤伪标签,并用实际的变分Bayes学习量化生成图像的预测置信度。测试时间的快速自适应是通过一个基于轮的交替优化方案来实现的。我们在标记到电影磁共振成像(MRI)合成问题上验证了我们的框架,其中源域和目标域中的数据集是从不同的扫描仪或中心获得的。广泛的验证进行了验证,以验证我们的框架对流行的对抗性训练方法。结果表明,与对抗性训练方法相比,我们的GST在新靶区标记了受试者的MRI,大大提高了合成质量。 摘要:Self-training based unsupervised domain adaptation (UDA) has shown great potential to address the problem of domain shift, when applying a trained deep learning model in a source domain to unlabeled target domains. However, while the self-training UDA has demonstrated its effectiveness on discriminative tasks, such as classification and segmentation, via the reliable pseudo-label selection based on the softmax discrete histogram, the self-training UDA for generative tasks, such as image synthesis, is not fully investigated. In this work, we propose a novel generative self-training (GST) UDA framework with continuous value prediction and regression objective for cross-domain image synthesis. Specifically, we propose to filter the pseudo-label with an uncertainty mask, and quantify the predictive confidence of generated images with practical variational Bayes learning. The fast test-time adaptation is achieved by a round-based alternative optimization scheme. We validated our framework on the tagged-to-cine magnetic resonance imaging (MRI) synthesis problem, where datasets in the source and target domains were acquired from different scanners or centers. Extensive validations were carried out to verify our framework against popular adversarial training UDA methods. Results show that our GST, with tagged MRI of test subjects in new target domains, improved the synthesis quality by a large margin, compared with the adversarial training UDA methods.
【2】 Transfer Learning of Deep Spatiotemporal Networks to Model Arbitrarily Long Videos of Seizures 标题:深度时空网络的迁移学习对任意长视频癫痫发作的建模
作者:Fernando Pérez-García,Catherine Scott,Rachel Sparks,Beate Diehl,Sébastien Ourselin 机构:and S´ebastien Ourselin, Department of Medical Physics and Biomedical Engineering, University College, Wellcome EPSRC Centre for Interventional and Surgical Sciences (WEISS), University College London, UK 备注:Accepted at the 24th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2021) 链接:https://arxiv.org/abs/2106.12014 摘要:详细分析癫痫发作的症状和体征,对于癫痫患者的治疗至关重要。采用定性视觉分析的评分员间信度在符号学特征方面往往较差。因此,需要对视频记录的癫痫发作进行自动定量分析,以便进行客观评估。我们提出了一种结合卷积神经网络(CNNs)和递归神经网络(RNNs)的新结构,用于学习癫痫发作任意长视频的深度表示。我们使用一个时空CNN(STCNN)对大型人类行为识别(HAR)数据集进行预训练,从癫痫发作视频的短片段(约0.5s)中提取特征。然后我们训练RNN从特征序列中学习癫痫发作水平的表示。我们整理了68名患者的癫痫视频数据集,并评估了手势将癫痫分为局灶性发作性癫痫(FOSs)(N=106)和局灶性至双侧强直阵挛性癫痫(TCSs)(N=77)的能力,使用双向长短时记忆(BLSTM)单元获得了98.9%的准确率。我们证明在HAR数据集上训练的STCNN可以与RNN结合使用,精确地表示任意长的癫痫发作视频。通过对符号学序列的建模,手势可以提供准确的癫痫分类。 摘要:Detailed analysis of seizure semiology, the symptoms and signs which occur during a seizure, is critical for management of epilepsy patients. Inter-rater reliability using qualitative visual analysis is often poor for semiological features. Therefore, automatic and quantitative analysis of video-recorded seizures is needed for objective assessment. We present GESTURES, a novel architecture combining convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to learn deep representations of arbitrarily long videos of epileptic seizures. We use a spatiotemporal CNN (STCNN) pre-trained on large human action recognition (HAR) datasets to extract features from short snippets (approx. 0.5 s) sampled from seizure videos. We then train an RNN to learn seizure-level representations from the sequence of features. We curated a dataset of seizure videos from 68 patients and evaluated GESTURES on its ability to classify seizures into focal onset seizures (FOSs) (N = 106) vs. focal to bilateral tonic-clonic seizures (TCSs) (N = 77), obtaining an accuracy of 98.9% using bidirectional long short-term memory (BLSTM) units. We demonstrate that an STCNN trained on a HAR dataset can be used in combination with an RNN to accurately represent arbitrarily long videos of seizures. GESTURES can provide accurate seizure classification by modeling sequences of semiologies.
半弱无监督|主动学习|不确定性(4篇)
【1】 Deep unsupervised 3D human body reconstruction from a sparse set of landmarks 标题:基于稀疏地标集的深度无监督三维人体重建
作者:Meysam Madadi,Hugo Bertiche,Sergio Escalera 机构: Computer Vision Center, Edifici O, Campus UAB, Bellaterra (Barcelona), Catalonia Spain, Dept. Mathematics and Informatics, Universitat de Barcelona, Catalonia, Spain 备注:None 链接:https://arxiv.org/abs/2106.12282 摘要:在这篇论文中,我们提出了第一种深度无监督的人体重建方法,从一组稀疏的地标来估计人体表面,称为DeepMurf。我们应用一个去噪的自动编码器来估计丢失的地标。然后,我们应用注意模型来估计身体关节的地标。最后,应用级联网络对重构体的统计生成模型的参数进行回归。我们提出的一组损失函数允许我们以无监督的方式训练网络。在四个公共数据集上的结果表明,我们的方法可以从真实世界的mocap数据中准确地重建人体。 摘要:In this paper we propose the first deep unsupervised approach in human body reconstruction to estimate body surface from a sparse set of landmarks, so called DeepMurf. We apply a denoising autoencoder to estimate missing landmarks. Then we apply an attention model to estimate body joints from landmarks. Finally, a cascading network is applied to regress parameters of a statistical generative model that reconstructs body. Our set of proposed loss functions allows us to train the network in an unsupervised way. Results on four public datasets show that our approach accurately reconstructs the human body from real world mocap data.
【2】 STRESS: Super-Resolution for Dynamic Fetal MRI using Self-Supervised Learning 标题:STRESS:基于自监督学习的动态胎儿MRI超分辨率
作者:Junshen Xu,Esra Abaci Turk,P. Ellen Grant,Polina Golland,Elfar Adalsteinsson 机构: Department of Electrical Engineering and Computer Science, MIT, Fetal-Neonatal Neuroimaging and Developmental Science Center, Boston Children’s Hospital, Boston, MA, USA, Harvard Medical School, Boston, MA, USA 链接:https://arxiv.org/abs/2106.12407 摘要:在常规MR扫描时间范围内,胎动是不可预测和快速的。因此,动态胎儿磁共振成像(dynamic fetal MRI)仅限于图像质量和分辨率较差的快速成像技术,其目的是捕捉胎儿运动和胎儿功能的动态变化。动态胎儿磁共振成像的超分辨率仍然是一个挑战,特别是当用于过采样的多方向图像切片堆栈不可用并且需要用于记录胎儿或胎盘动态的高时间分辨率时。此外,胎儿运动使得在有监督学习方法中很难获得高分辨率图像。为了解决这个问题,在这项工作中,我们提出了STRESS(时空分辨率增强与模拟扫描),一个自我监督的超分辨率框架,用于动态胎儿MRI的交错切片采集。我们提出的方法在原始采集的数据上模拟沿高分辨率轴的交错切片采集,以生成低分辨率和高分辨率图像对。然后,利用MR时间序列的时空相关性训练超分辨率网络,提高原始数据的分辨率。对模拟数据和子宫内数据的评价表明,该方法优于其他自监督超分辨率方法,提高了图像质量,有利于后续任务和评价。 摘要:Fetal motion is unpredictable and rapid on the scale of conventional MR scan times. Therefore, dynamic fetal MRI, which aims at capturing fetal motion and dynamics of fetal function, is limited to fast imaging techniques with compromises in image quality and resolution. Super-resolution for dynamic fetal MRI is still a challenge, especially when multi-oriented stacks of image slices for oversampling are not available and high temporal resolution for recording the dynamics of the fetus or placenta is desired. Further, fetal motion makes it difficult to acquire high-resolution images for supervised learning methods. To address this problem, in this work, we propose STRESS (Spatio-Temporal Resolution Enhancement with Simulated Scans), a self-supervised super-resolution framework for dynamic fetal MRI with interleaved slice acquisitions. Our proposed method simulates an interleaved slice acquisition along the high-resolution axis on the originally acquired data to generate pairs of low- and high-resolution images. Then, it trains a super-resolution network by exploiting both spatial and temporal correlations in the MR time series, which is used to enhance the resolution of the original data. Evaluations on both simulated and in utero data show that our proposed method outperforms other self-supervised super-resolution methods and improves image quality, which is beneficial to other downstream tasks and evaluations.
【3】 Learning from Pseudo Lesion: A Self-supervised Framework for COVID-19 Diagnosis 标题:从假性病变中学习:一种用于冠状病毒诊断的自我监督框架
作者:Zhongliang Li,Zhihao Jin,Xuechen Li,Linlin Shen 机构: Shenzhen University, and Guangdong Key Laboratory of Intelligent In-formation Processing 链接:https://arxiv.org/abs/2106.12313 摘要:2019年12月首次报道冠状病毒病(COVID-19)后,该病已在全球迅速蔓延,胸部计算机断层扫描(CT)已成为诊断该病的主要工具之一。近年来,基于深度学习的方法在众多的图像识别任务中表现出了令人印象深刻的性能。然而,它们通常需要大量带注释的数据来进行训练。受COIVD-19患者CT扫描中常见的毛玻璃样阴影(GGO)的启发,本文提出了一种新的基于伪病灶生成和修复的自监督预训练方法。我们使用Perlin噪声(一种基于梯度噪声的数学模型)生成病变样模式,然后将其随机粘贴到正常CT图像的肺部区域,生成伪COVID-19图像。然后,利用正常和伪COVID-19图像对训练基于U-Net的编解码器结构进行图像恢复,不需要任何标记数据。然后使用标记数据对预训练编码器进行微调,用于COVID-19诊断任务。两个由CT图像组成的公共COVID-19诊断数据集用于评估。综合实验结果表明,本文提出的自监督学习方法能够提取出更好的COVID-19诊断特征,在SARS-CoV-2数据集和济南COVID-19数据集上,该方法的准确率分别比在大尺度图像上预训练的监督模型高6.57%和3.03%,分别。 摘要:The Coronavirus disease 2019 (COVID-19) has rapidly spread all over the world since its first report in December 2019 and thoracic computed tomography (CT) has become one of the main tools for its diagnosis. In recent years, deep learning-based approaches have shown impressive performance in myriad image recognition tasks. However, they usually require a large number of annotated data for training. Inspired by Ground Glass Opacity (GGO), a common finding in COIVD-19 patient's CT scans, we proposed in this paper a novel self-supervised pretraining method based on pseudo lesions generation and restoration for COVID-19 diagnosis. We used Perlin noise, a gradient noise based mathematical model, to generate lesion-like patterns, which were then randomly pasted to the lung regions of normal CT images to generate pseudo COVID-19 images. The pairs of normal and pseudo COVID-19 images were then used to train an encoder-decoder architecture based U-Net for image restoration, which does not require any labelled data. The pretrained encoder was then fine-tuned using labelled data for COVID-19 diagnosis task. Two public COVID-19 diagnosis datasets made up of CT images were employed for evaluation. Comprehensive experimental results demonstrated that the proposed self-supervised learning approach could extract better feature representation for COVID-19 diagnosis and the accuracy of the proposed method outperformed the supervised model pretrained on large scale images by 6.57% and 3.03% on SARS-CoV-2 dataset and Jinan COVID-19 dataset, respectively.
【4】 Deformed2Self: Self-Supervised Denoising for Dynamic Medical Imaging 标题:变形2Self:动态医学图像的自监督去噪
作者:Junshen Xu,Elfar Adalsteinsson 机构: Department of Electrical Engineering and Computer Science, MIT, Institute for Medical Engineering and Science, MIT, Cambridge, MA, USA 链接:https://arxiv.org/abs/2106.12175 摘要:图像去噪对于医学成像系统具有重要意义,它可以提高图像质量,用于疾病诊断和后续图像分析。在各种应用中,动态成像技术被用来捕获对象的时变特征,其中在不同的时间点为同一对象获取多个图像。虽然每个时间帧的信噪比通常受到捕获时间短的限制,但是利用不同时间帧之间的相关性,可以在跨时间帧共享信息的情况下提高去噪效果。随着神经网络在计算机视觉中的成功应用,有监督的深度学习方法在单幅图像去噪中表现出了突出的性能,这种方法依赖于大数据集的干净图像对和有噪声的图像对。近年来,一些自监督深度去噪模型被提出,在不需要干净图像的两两背景真实性的情况下取得了很好的效果。然而,在多图像去噪领域,利用自监督深度学习方法从多个切片中提取相关信息进行去噪的研究却很少。在这项工作中,我们提出了Deformed2Self,一个端到端的自监督深度学习框架,用于动态图像去噪。它结合了单幅图像和多幅图像去噪来提高图像质量,并利用空间变换网络来模拟不同切片之间的运动。此外,它只需要一个单一的噪声图像和一些辅助观测在不同的时间帧进行训练和推理。对不同噪声统计下的体模和活体数据的评价表明,该方法具有与其他无监督或自监督去噪方法相当的性能,在高噪声水平下具有更好的性能。 摘要:Image denoising is of great importance for medical imaging system, since it can improve image quality for disease diagnosis and downstream image analyses. In a variety of applications, dynamic imaging techniques are utilized to capture the time-varying features of the subject, where multiple images are acquired for the same subject at different time points. Although signal-to-noise ratio of each time frame is usually limited by the short acquisition time, the correlation among different time frames can be exploited to improve denoising results with shared information across time frames. With the success of neural networks in computer vision, supervised deep learning methods show prominent performance in single-image denoising, which rely on large datasets with clean-vs-noisy image pairs. Recently, several self-supervised deep denoising models have been proposed, achieving promising results without needing the pairwise ground truth of clean images. In the field of multi-image denoising, however, very few works have been done on extracting correlated information from multiple slices for denoising using self-supervised deep learning methods. In this work, we propose Deformed2Self, an end-to-end self-supervised deep learning framework for dynamic imaging denoising. It combines single-image and multi-image denoising to improve image quality and use a spatial transformer network to model motion between different slices. Further, it only requires a single noisy image with a few auxiliary observations at different time frames for training and inference. Evaluations on phantom and in vivo data with different noise statistics show that our method has comparable performance to other state-of-the-art unsupervised or self-supervised denoising methods and outperforms under high noise levels.
时序|行为识别|姿态|视频|运动估计(2篇)
【1】 A new Video Synopsis Based Approach Using Stereo Camera 标题:一种使用立体摄像机的基于视频摘要的新方法
作者:Talha Dilber,Mehmet Serdar Guzel,Erkan Bostanci 链接:https://arxiv.org/abs/2106.12362 摘要:在当今世界,每个领域产生的数据量都以出乎意料的水平增长。面对日益增多的数据,数据处理的重要性显著提高。我们的资源主题是视频数据的处理,它在增加数据和制作摘要视频中占有重要地位。在这个资源范围内,一种新的基于对象的无监督学习异常检测方法已经被开发出来,同时创建一个视频摘要。通过使用该方法,视频数据被处理为像素,并且结果被产生为视频片段。工艺流程可简要概括如下。根据对象的类型检测视频中的对象,然后对其进行跟踪。然后对目标的跟踪历史数据进行处理,并根据目标类型对分类器进行训练。利用该分类器,可以检测出物体的异常行为。通过处理包含异常行为的视频片段来确定视频片段。视频摘要是通过从原始视频中提取检测到的视频片段并进行组合来创建的。我们开发的模型已经分别在单摄像机和双摄像机系统中进行了测试和验证。 摘要:In today's world, the amount of data produced in every field has increased at an unexpected level. In the face of increasing data, the importance of data processing has increased remarkably. Our resource topic is on the processing of video data, which has an important place in increasing data, and the production of summary videos. Within the scope of this resource, a new method for anomaly detection with object-based unsupervised learning has been developed while creating a video summary. By using this method, the video data is processed as pixels and the result is produced as a video segment. The process flow can be briefly summarized as follows. Objects on the video are detected according to their type, and then they are tracked. Then, the tracking history data of the objects are processed, and the classifier is trained with the object type. Thanks to this classifier, anomaly behavior of objects is detected. Video segments are determined by processing video moments containing anomaly behaviors. The video summary is created by extracting the detected video segments from the original video and combining them. The model we developed has been tested and verified separately for single camera and dual camera systems.
【2】 Sentinel-1 and Sentinel-2 Spatio-Temporal Data Fusion for Clouds Removal 标题:Sentinel-1和Sentinel-2时空数据融合去云技术
作者:Alessandro Sebastianelli,Artur Nowakowski,Erika Puglisi,Maria Pia Del Rosso,Jamila Mifdal,Fiora Pirri,Pierre Philippe Mathieu,Silvia Liberata Ullo 链接:https://arxiv.org/abs/2106.12226 摘要:云层的丰富性,无论是在空间上还是在时间上,常常使得光学图像的遥感应用变得困难甚至不可能。本文提出并发展了一种新的基于联合数据融合的光学图像复原方法,将三个深度神经网络结合起来,融合从Sentinel-1和Sentinel-2时间序列数据中提取的时空特征。值得强调的是,代码和数据集都是从零开始实现的,并提供给感兴趣的研究人员进行进一步的分析和调查。 摘要:The abundance of clouds, located both spatially and temporally, often makes remote sensing applications with optical images difficult or even impossible. In this manuscript, a novel method for clouds-corrupted optical image restoration has been presented and developed, based on a joint data fusion paradigm, where three deep neural networks have been combined in order to fuse spatio-temporal features extracted from Sentinel-1 and Sentinel-2 time-series of data. It is worth highlighting that both the code and the dataset have been implemented from scratch and made available to interested research for further analysis and investigation.
医学相关(1篇)
【1】 CxSE: Chest X-ray Slow Encoding CNN forCOVID-19 Diagnosis 标题:CxSE:胸片慢速编码CNN用于COVID-19诊断
作者:Thangarajah Akilan 机构:LakeheadUniversity 链接:https://arxiv.org/abs/2106.12157 摘要:冠状病毒以指数级的速度传播,继续扰乱我们的日常生活。需要快速检测,以便隔离阳性患者,避免进一步传播。本文提出了一种新的卷积神经网络(CNN)结构,称为慢编码CNN。该模型对COVID19-筛查X射线图像的敏感性、阳性预测值(PPV)的最佳表现与COVID19-筛查X射线图像对COVID19-感染比赛测试数据样本的敏感性、阳性预测值(PPV)相比,分别为SP=0.67、PP=0.98、SN=0.96和PN=0.52。SP和PP代表COVID-19阳性组的敏感性和PPV,PN和SN代表COVID-19阴性组的敏感性和PPV。 摘要:The coronavirus continues to disrupt our everyday lives as it spreads at an exponential rate. It needs to be detected quickly in order to quarantine positive patients so as to avoid further spread. This work proposes a new convolutional neural network (CNN) architecture called 'slow Encoding CNN. The proposed model's best performance wrt Sensitivity, Positive Predictive Value (PPV) found to be SP=0.67, PP=0.98, SN=0.96, and PN=0.52 on AI AGAINST COVID19 - Screening X-ray images for COVID-19 Infections competition's test data samples. SP and PP stand for the Sensitivity and PPV of the COVID-19 positive class, while PN and SN stand for the Sensitivity and PPV of the COVID-19 negative class.
GAN|对抗|攻击|生成相关(2篇)
【1】 Fine-Tuning StyleGAN2 For Cartoon Face Generation 标题:用于卡通人脸生成的微调StyleGAN2算法
作者:Jihye Back 机构:Seoul National University 备注:10 pages, 9 figures 链接:https://arxiv.org/abs/2106.12445 摘要:最近的研究表明,无监督图像到图像(I2I)的翻译取得了显著的成功。然而,由于数据的不平衡性,学习不同领域的联合分布仍然是一个非常具有挑战性的问题。现有的模型虽然能生成逼真的目标图像,但很难保持源图像的结构。此外,在多个领域中训练一个关于大数据的生成模型需要大量的时间和计算机资源。为了解决这些局限性,我们提出了一种新的图像到图像的转换方法,通过微调stylegan2预训练模型来生成目标域的图像。stylegan2模型适用于非平衡数据集的无监督I2I翻译;它是高度稳定的,产生逼真的图像,甚至可以从有限的数据中正确学习时,应用简单的微调技术。因此,本文提出了一种新的方法来保持源图像的结构,并在目标域生成真实感的图像。有关代码和结果,请访问https://github.com/happy-jihye/Cartoon-StyleGan2 摘要:Recent studies have shown remarkable success in the unsupervised image to image (I2I) translation. However, due to the imbalance in the data, learning joint distribution for various domains is still very challenging. Although existing models can generate realistic target images, it's difficult to maintain the structure of the source image. In addition, training a generative model on large data in multiple domains requires a lot of time and computer resources. To address these limitations, we propose a novel image-to-image translation method that generates images of the target domain by finetuning a stylegan2 pretrained model. The stylegan2 model is suitable for unsupervised I2I translation on unbalanced datasets; it is highly stable, produces realistic images, and even learns properly from limited data when applied with simple fine-tuning techniques. Thus, in this paper, we propose new methods to preserve the structure of the source images and generate realistic images in the target domain. The code and results are available at https://github.com/happy-jihye/Cartoon-StyleGan2
【2】 Alias-Free Generative Adversarial Networks 标题:无别名生成性对抗性网络
作者:Tero Karras,Miika Aittala,Samuli Laine,Erik Härkönen,Janne Hellsten,Jaakko Lehtinen,Timo Aila 机构:Aalto University and NVIDIA, NVIDIA and Aalto University 链接:https://arxiv.org/abs/2106.12423 摘要:我们观察到,尽管他们的层次卷积性质,合成过程中的典型生成对手网络依赖于绝对像素坐标在一个不健康的方式。这表现为,例如,细节似乎被粘在图像坐标上,而不是被描绘对象的表面。我们追踪的根本原因是粗心的信号处理,造成混叠在发电机网络。将网络中的所有信号解释为连续的,我们导出了普遍适用的、小的体系结构更改,以保证不需要的信息不会泄漏到分层合成过程中。得到的网络与StyleGAN2的FID匹配,但在内部表示上有很大的不同,即使在亚像素尺度上,它们也完全等同于平移和旋转。我们的结果为更适合视频和动画的生成模型铺平了道路。 摘要:We observe that despite their hierarchical convolutional nature, the synthesis process of typical generative adversarial networks depends on absolute pixel coordinates in an unhealthy manner. This manifests itself as, e.g., detail appearing to be glued to image coordinates instead of the surfaces of depicted objects. We trace the root cause to careless signal processing that causes aliasing in the generator network. Interpreting all signals in the network as continuous, we derive generally applicable, small architectural changes that guarantee that unwanted information cannot leak into the hierarchical synthesis process. The resulting networks match the FID of StyleGAN2 but differ dramatically in their internal representations, and they are fully equivariant to translation and rotation even at subpixel scales. Our results pave the way for generative models better suited for video and animation.
自动驾驶|车辆|车道检测等(1篇)
【1】 Euro-PVI: Pedestrian Vehicle Interactions in Dense Urban Centers 标题:EURO-PVI:密集城市中心的行人车辆互动
作者:Apratim Bhattacharyya,Daniel Olmeda Reino,Mario Fritz,Bernt Schiele 备注:To appear at CVPR 2021 链接:https://arxiv.org/abs/2106.12442 摘要:行人和自行车路径的准确预测是在密集的城市环境中开发可靠的自动驾驶车辆必不可少的。车辆与行人或骑自行车者之间的相互作用对交通参与者的轨迹有重要影响,例如停车或转弯以避免碰撞。尽管最近的数据集和轨迹预测方法促进了自主车辆的发展,但建模的车辆-行人(骑自行车者)相互作用的数量很少。在这项工作中,我们提出欧洲PVI,行人和自行车的轨迹数据集。特别是,与现有数据集相比,我们的数据集在密集的城市场景中满足了更多样化和复杂的交互。为了解决在预测具有密集交互作用的未来轨迹时所面临的挑战,我们开发了一个联合推理模型,该模型可以学习城市场景中多个智能体之间的多模态共享潜在空间。这使得我们的联合-$beta$-cVAE方法能够更好地模拟未来轨迹的分布。我们在nuScenes和Euro PVI数据集上获得了最新的结果,证明了捕捉自我车辆和行人(骑自行车者)之间的相互作用对于准确预测的重要性。 摘要:Accurate prediction of pedestrian and bicyclist paths is integral to the development of reliable autonomous vehicles in dense urban environments. The interactions between vehicle and pedestrian or bicyclist have a significant impact on the trajectories of traffic participants e.g. stopping or turning to avoid collisions. Although recent datasets and trajectory prediction approaches have fostered the development of autonomous vehicles yet the amount of vehicle-pedestrian (bicyclist) interactions modeled are sparse. In this work, we propose Euro-PVI, a dataset of pedestrian and bicyclist trajectories. In particular, our dataset caters more diverse and complex interactions in dense urban scenarios compared to the existing datasets. To address the challenges in predicting future trajectories with dense interactions, we develop a joint inference model that learns an expressive multi-modal shared latent space across agents in the urban scene. This enables our Joint-$beta$-cVAE approach to better model the distribution of future trajectories. We achieve state of the art results on the nuScenes and Euro-PVI datasets demonstrating the importance of capturing interactions between ego-vehicle and pedestrians (bicyclists) for accurate predictions.
OCR|文本相关(1篇)
【1】 Open Images V5 Text Annotation and Yet Another Mask Text Spotter 标题:Open Images V5文本批注和另一个蒙版文本Spotter
作者:Ilya Krylov,Sergei Nosov,Vladislav Sovrasov 机构:IOTG Computer Vision (ICV), Intel, Lobachevsky State University of Nizhny Novgorod, Russia 链接:https://arxiv.org/abs/2106.12326 摘要:大规模人类标记数据集在创建高质量的深度学习模型中起着重要的作用。在本文中,我们提出了开放图像V5数据集的文本注释。据我们所知,这是最大的公开可用的手动创建的文本注释。有了这个注释,我们训练了一个简单的基于Mask-RCNN的网络,称为另一个Mask-Text-Spotter(YAMTS),它在某些情况下在ICDAR2013、ICDAR2015和总文本数据集上实现了竞争性的性能,甚至超过了当前最先进的方法。文本定位模型的代码在线提供:https://github.com/openvinotoolkit/training_extensions. 该模型可以导出为OpenVINO格式并在Intel cpu上运行。 摘要:A large scale human-labeled dataset plays an important role in creating high quality deep learning models. In this paper we present text annotation for Open Images V5 dataset. To our knowledge it is the largest among publicly available manually created text annotations. Having this annotation we trained a simple Mask-RCNN-based network, referred as Yet Another Mask Text Spotter (YAMTS), which achieves competitive performance or even outperforms current state-of-the-art approaches in some cases on ICDAR2013, ICDAR2015 and Total-Text datasets. Code for text spotting model available online at: https://github.com/openvinotoolkit/training_extensions. The model can be exported to OpenVINO-format and run on Intel CPUs.
Attention注意力(1篇)
【1】 Coarse-to-Fine Q-attention: Efficient Learning for Visual Robotic Manipulation via Discretisation 标题:从粗到精的Q-注意:基于离散化的视觉机器人操作的有效学习
作者:Stephen James,Kentaro Wada,Tristan Laidlow,Andrew J. Davison 机构:Dyson Robotics Lab, Imperial College London 备注:Videos and code found at this https URL 链接:https://arxiv.org/abs/2106.12534 摘要:回顾过去几年,深度强化学习(RL)的最大突破是在离散动作领域。然而,机器人操作本身就是一个连续的控制环境,但是这些连续的控制强化学习算法往往依赖于演员-评论家方法,由于演员和评论家的联合优化,这些方法效率低,训练难度大。为此,我们探讨了如何将离散动作RL算法的稳定性引入机器人操作领域。我们扩展了最近发布的ARM算法,将连续的次优姿态代理替换为离散的次优姿态代理。考虑到旋转的有界性,旋转的离散化是微不足道的,而平移本质上是无界的,这使得离散化很困难。通过对三维空间的离散化,将平移预测转化为体素预测问题;然而,大型工作空间的体素化是内存密集型的,并且不会与高密度的体素一起工作,这对于获得机器人操作所需的分辨率至关重要。因此,我们建议通过逐渐提高分辨率,从粗到细地应用这种体素预测。在每一步中,我们提取最高值的体素作为预测位置,然后作为下一步高分辨率体素化的中心。这种从粗到精的预测应用于多个步骤,给出了几乎无损的翻译预测。结果表明,与连续控制算法相比,本文提出的由粗到精算法能更有效地完成RLBench任务,甚至能在不到7分钟的时间内训练出一些实际任务,即表格rasa,只需3次演示。此外,我们还表明,通过移动到体素表示,我们能够很容易地合并来自多个摄像头的观察。 摘要:Reflecting on the last few years, the biggest breakthroughs in deep reinforcement learning (RL) have been in the discrete action domain. Robotic manipulation, however, is inherently a continuous control environment, but these continuous control reinforcement learning algorithms often depend on actor-critic methods that are sample-inefficient and inherently difficult to train, due to the joint optimisation of the actor and critic. To that end, we explore how we can bring the stability of discrete action RL algorithms to the robot manipulation domain. We extend the recently released ARM algorithm, by replacing the continuous next-best pose agent with a discrete next-best pose agent. Discretisation of rotation is trivial given its bounded nature, while translation is inherently unbounded, making discretisation difficult. We formulate the translation prediction as the voxel prediction problem by discretising the 3D space; however, voxelisation of a large workspace is memory intensive and would not work with a high density of voxels, crucial to obtaining the resolution needed for robotic manipulation. We therefore propose to apply this voxel prediction in a coarse-to-fine manner by gradually increasing the resolution. In each step, we extract the highest valued voxel as the predicted location, which is then used as the centre of the higher-resolution voxelisation in the next step. This coarse-to-fine prediction is applied over several steps, giving a near-lossless prediction of the translation. We show that our new coarse-to-fine algorithm is able to accomplish RLBench tasks much more efficiently than the continuous control equivalent, and even train some real-world tasks, tabular rasa, in less than 7 minutes, with only 3 demonstrations. Moreover, we show that by moving to a voxel representation, we are able to easily incorporate observations from multiple cameras.
人脸|人群计数(2篇)
【1】 3D human tongue reconstruction from single "in-the-wild" images
作者:Stylianos Ploumpis,Stylianos Moschoglou,Vasileios Triantafyllou,Stefanos Zafeiriou 机构:Imperial College London, UK, Huawei Technologies Co. Ltd 备注:10 pages, 9 figures 链接:https://arxiv.org/abs/2106.12302 摘要:基于单个图像的三维人脸重建是计算机视觉领域的一个研究热点,特别是由于其在真实感三维化身创建、姿态不变人脸识别和人脸幻觉等领域的广泛应用。自从90年代末引入三维变形模型以来,我们目睹了一场旨在解决这一问题的研究热潮。然而,尽管主要归因于深度学习进步的单个图像的3D面部重建中的细节水平不断提高,但是在文献中所有3D面部模型中仍然缺少更精细和高度可变形的面部组件,例如舌头,尽管这对于3D化身表示的真实性非常重要。在这项工作中,我们提出了第一个,据我们所知,端到端的训练管道,准确地重建三维人脸和舌头一起。此外,我们通过引入一种新的适合于三维舌面生成的GAN方法,使得该管道在野外图像中具有很强的鲁棒性。最后,我们向社区公开了第一个不同的舌头数据集,包括1800个原始扫描,700个不同性别、年龄和种族背景的个体。正如我们在一系列大量的定量和定性实验中所证明的,我们的模型被证明是健壮的,并且真实地捕捉到了三维舌头结构,即使在不利的“野外”条件下也是如此。 摘要:3D face reconstruction from a single image is a task that has garnered increased interest in the Computer Vision community, especially due to its broad use in a number of applications such as realistic 3D avatar creation, pose invariant face recognition and face hallucination. Since the introduction of the 3D Morphable Model in the late 90's, we witnessed an explosion of research aiming at particularly tackling this task. Nevertheless, despite the increasing level of detail in the 3D face reconstructions from single images mainly attributed to deep learning advances, finer and highly deformable components of the face such as the tongue are still absent from all 3D face models in the literature, although being very important for the realness of the 3D avatar representations. In this work we present the first, to the best of our knowledge, end-to-end trainable pipeline that accurately reconstructs the 3D face together with the tongue. Moreover, we make this pipeline robust in "in-the-wild" images by introducing a novel GAN method tailored for 3D tongue surface generation. Finally, we make publicly available to the community the first diverse tongue dataset, consisting of 1,800 raw scans of 700 individuals varying in gender, age, and ethnicity backgrounds. As we demonstrate in an extensive series of quantitative as well as qualitative experiments, our model proves to be robust and realistically captures the 3D tongue structure, even in adverse "in-the-wild" conditions.
【2】 Region-Aware Network: Model Human's Top-Down Visual Perception Mechanism for Crowd Counting 标题:区域感知网络:模拟人类自上而下的视觉感知机制进行人群计数
作者:Yuehai Chen,Jing Yang,Dong Zhang,Kun Zhang,Badong Chen,Shaoyi Du 机构:School of Automation Science and Engineering, Xi’an Jiaotong, Institute of Articial Intelligence and Robotics, College of Articial Intelligence, Xi’an Jiaotong University, Xi’an, Shanxi, China, A R T I C L E I N F O 链接:https://arxiv.org/abs/2106.12163 摘要:背景噪声和尺度变化是人群计数中常见的问题。人类通过对人群区域的关注和对人群区域的拥挤程度进行全局感知,在人群图像上快速地判断出人群的大致数量和所处的位置。因此,本文通过对人类自顶向下视觉感知机制的建模,提出了一种基于区域感知块的反馈网络RANet。首先,我们引入一个回馈架构来产生优先权图,提供输入影像中候选人群区域的优先权。先验知识使得RANet更加关注人群区域。然后设计区域感知块,通过全局感受野自适应地将上下文信息编码到输入图像中。更具体地说,我们以列向量的形式扫描整个输入图像及其优先级映射,以获得估计其相似性的相关矩阵。得到的相关矩阵将用于建立像素之间的全局关系。在几个公共数据集上,我们的方法优于最先进的人群计数方法。 摘要:Background noise and scale variation are common problems that have been long recognized in crowd counting. Humans glance at a crowd image and instantly know the approximate number of human and where they are through attention the crowd regions and the congestion degree of crowd regions with a global receptive filed. Hence, in this paper, we propose a novel feedback network with Region-Aware block called RANet by modeling human's Top-Down visual perception mechanism. Firstly, we introduce a feedback architecture to generate priority maps that provide prior about candidate crowd regions in input images. The prior enables the RANet pay more attention to crowd regions. Then we design Region-Aware block that could adaptively encode the contextual information into input images through global receptive field. More specifically, we scan the whole input images and its priority maps in the form of column vector to obtain a relevance matrix estimating their similarity. The relevance matrix obtained would be utilized to build global relationships between pixels. Our method outperforms state-of-the-art crowd counting methods on several public datasets.
裁剪|量化|加速|压缩相关(1篇)
【1】 Gradient-Based Interpretability Methods and Binarized Neural Networks 标题:基于梯度的可解释性方法与二值化神经网络
作者:Amy Widdicombe,Simon J. Julier 机构: 1Department of Computer Science, University College London 备注:Accepted at the ICML 2021 Workshop on Theoretic Foundation, Criticism & Application Trend of Explainable AI 链接:https://arxiv.org/abs/2106.12569 摘要:二值化神经网络(BNNs)有可能彻底改变在边缘计算平台上进行深度学习的方式。然而,这些网络上的可解释性方法的有效性尚未得到评估。本文比较了几种常用的基于显著图的可解释性技术(梯度、平滑梯度和梯度凸轮)在二值化或全精度神经网络(FPNNs)中的性能。我们发现,基本梯度法产生非常相似的寻找地图为这两种类型的网络。然而,SmoothGrad为BNNs生成了明显的噪声贴图。GradCAM还生成了不同网络类型的显著性图,其中一些bnn的解释似乎毫无意义。我们评论了这些解释上的差异的可能原因,并以解释性技术为什么应该在更广泛的网络类型上进行测试为例。 摘要:Binarized Neural Networks (BNNs) have the potential to revolutionize the way that deep learning is carried out in edge computing platforms. However, the effectiveness of interpretability methods on these networks has not been assessed. In this paper, we compare the performance of several widely used saliency map-based interpretabilty techniques (Gradient, SmoothGrad and GradCAM), when applied to Binarized or Full Precision Neural Networks (FPNNs). We found that the basic Gradient method produces very similar-looking maps for both types of network. However, SmoothGrad produces significantly noisier maps for BNNs. GradCAM also produces saliency maps which differ between network types, with some of the BNNs having seemingly nonsensical explanations. We comment on possible reasons for these differences in explanations and present it as an example of why interpretability techniques should be tested on a wider range of network types.
表征学习(1篇)
【1】 A Circular-Structured Representation for Visual Emotion Distribution Learning 标题:一种用于视觉情绪分布学习的圆形结构表示法
作者:Jingyuan Yang,Ji Lie,Leida Li,Xiumei Wang,Xinbo Gao 机构:Jie Li, School of Electronic Engineering, Xidian University, Xi’an, China, School of Artificial Intelligence, Xidian University, Xi’an, China, The Chongqing Key Laboratory of Image Cognition 备注:accepted by CVPR2021 链接:https://arxiv.org/abs/2106.12450 摘要:近年来,随着社交网络上图像共享的普及,视觉情感分析(VEA)越来越受到人们的关注。由于人类的情绪具有模糊性和主观性,因此在标签分布学习(LDL)范式中处理VEA比在单个标签分类任务中更为合理。与其他低密度脂蛋白任务不同的是,情绪与情绪之间存在着内在的联系,这是心理学理论所证明的。受此启发,我们提出了一个良好的圆形结构表示,以利用视觉情感分布学习的先验知识。具体地说,我们首先构建一个情感圈来统一其中的任何情感状态。在提出的情感圈中,每个情感分布都用一个情感向量来表示,该情感向量由三个属性(情感极性、情感类型、情感强度)和两个属性(相似性、可加性)来定义。此外,我们设计了一个新的渐进循环(PC)损失来惩罚预测的情感向量和标记的情感向量之间的差异,从粗到细的方式,这进一步促进了学习过程中的情感特定的方式。在公众视觉情绪分布数据集上进行了大量的实验和比较,结果表明该方法优于现有的方法。 摘要:Visual Emotion Analysis (VEA) has attracted increasing attention recently with the prevalence of sharing images on social networks. Since human emotions are ambiguous and subjective, it is more reasonable to address VEA in a label distribution learning (LDL) paradigm rather than a single-label classification task. Different from other LDL tasks, there exist intrinsic relationships between emotions and unique characteristics within them, as demonstrated in psychological theories. Inspired by this, we propose a well-grounded circular-structured representation to utilize the prior knowledge for visual emotion distribution learning. To be specific, we first construct an Emotion Circle to unify any emotional state within it. On the proposed Emotion Circle, each emotion distribution is represented with an emotion vector, which is defined with three attributes (i.e., emotion polarity, emotion type, emotion intensity) as well as two properties (i.e., similarity, additivity). Besides, we design a novel Progressive Circular (PC) loss to penalize the dissimilarities between predicted emotion vector and labeled one in a coarse-to-fine manner, which further boosts the learning process in an emotion-specific way. Extensive experiments and comparisons are conducted on public visual emotion distribution datasets, and the results demonstrate that the proposed method outperforms the state-of-the-art methods.
蒸馏|知识提取(1篇)
【1】 Co-advise: Cross Inductive Bias Distillation 标题:联合建议:交叉感应偏压蒸馏
作者:Sucheng Ren,Zhengqi Gao,Tianyu Hua,Zihui Xue,Yonglong Tian,Shengfeng He,Hang Zhao 机构:South China University of Technology, MIT, University of Texas at Austin, Tsinghua University, Shanghai Qi Zhi Institute 链接:https://arxiv.org/abs/2106.12378 摘要:Transformer最近被改编自社区的自然语言处理作为一个有前途的替代基于卷积神经网络的视觉学习任务。然而,如果训练数据量不足(例如ImageNet),它的优势就会退化。为了使之实用化,我们提出了一种新的基于蒸馏的视觉变换器训练方法。与以前的工作不同,这里只提供了大量基于卷积的教师,我们引入了具有不同体系结构归纳偏差(例如卷积和对合)的轻量级教师来共同建议学生。关键在于,不同归纳偏差的教师在同一个数据集上接受不同的训练,却获得了不同的知识,这些不同的知识在提炼过程中复合并提高了学生的表现。配备了这种交叉感应偏压蒸馏方法,我们的视觉Transformer(称为CivT)优于ImageNet上所有相同结构的Transformer。 摘要:Transformers recently are adapted from the community of natural language processing as a promising substitute of convolution-based neural networks for visual learning tasks. However, its supremacy degenerates given an insufficient amount of training data (e.g., ImageNet). To make it into practical utility, we propose a novel distillation-based method to train vision transformers. Unlike previous works, where merely heavy convolution-based teachers are provided, we introduce lightweight teachers with different architectural inductive biases (e.g., convolution and involution) to co-advise the student transformer. The key is that teachers with different inductive biases attain different knowledge despite that they are trained on the same dataset, and such different knowledge compounds and boosts the student's performance during distillation. Equipped with this cross inductive bias distillation method, our vision transformers (termed as CivT) outperform all previous transformers of the same architecture on ImageNet.
视觉解释|视频理解VQA|caption等(1篇)
【1】 Neural Fashion Image Captioning : Accounting for Data Diversity 标题:神经时尚图像字幕:考虑数据多样性
作者:Gilles Hacheme,Noureini Sayouti 机构:Ai,Innov, Aix-Marseille University (Aix-Marseille School of Economics), CNRS & EHESS, Marseille, France 链接:https://arxiv.org/abs/2106.12154 摘要:图像字幕的应用领域越来越广,时尚也不例外。拥有自动项目描述是非常有趣的时尚网络平台托管有时几十万张图片。本文是第一个针对时尚图像的图像字幕。为了有助于解决数据集的多样性问题,我们引入了InFashAIv1数据集,其中包含了近16000张非洲时尚商品图片及其标题、价格和一般描述。除了InFashAIv1之外,我们还使用了众所周知的DeepFashion数据集。字幕是使用由CNN编码器和RNN解码器组成的textit{Show and Tell}模型生成的。我们发现,在这两个数据集上联合训练模型可以提高非洲风格时尚图片的字幕质量,这意味着从西方风格的数据中学习。InFashAIv1数据集在href发布{https://github.com/hgilles06/infashai}{Github}鼓励更多多样性的作品。 摘要:Image captioning has increasingly large domains of application, and fashion is not an exception. Having automatic item descriptions is of great interest for fashion web platforms hosting sometimes hundreds of thousands of images. This paper is one of the first tackling image captioning for fashion images. To contribute addressing dataset diversity issues, we introduced the InFashAIv1 dataset containing almost 16.000 African fashion item images with their titles, prices and general descriptions. We also used the well known DeepFashion dataset in addition to InFashAIv1. Captions are generated using the textit{Show and Tell} model made of CNN encoder and RNN Decoder. We showed that jointly training the model on both datasets improves captions quality for African style fashion images, suggesting a transfer learning from Western style data. The InFashAIv1 dataset is released on href{https://github.com/hgilles06/infashai}{Github} to encourage works with more diversity inclusion.
超分辨率|去噪|去模糊|去雾(1篇)
【1】 Multi-modal and frequency-weighted tensor nuclear norm for hyperspectral image denoising 标题:多模态频率加权张量核范数在高光谱图像去噪中的应用
作者:Sheng Liu,Xiaozhen Xie,Wenfeng Kong,Jifeng Ning 链接:https://arxiv.org/abs/2106.12489 摘要:低粗糙度是高光谱图像去噪的重要内容。基于张量奇异值分解定义的张量核范数(TNN)是描述HSI低粗糙度的最新方法。然而,TNN在处理去噪任务时忽略了HSI的一些物理意义,导致了次优的去噪性能。在本文中,我们提出了多模态和频率加权张量核范数(MFWTNN)和非凸MFWTNN的HSI去噪任务。首先,我们研究了频率分量的物理意义,并重新考虑了它们的权重,以提高TNN的低阶表示能力。同时,我们还考虑了HSI的两个空间维度和光谱维度之间的相关性,并结合上述对TNN的改进,提出了MFWTNN。其次,我们利用非凸函数来逼近频率张量的秩函数,并提出非凸fwtnn来更好地松弛MFWTNN。此外,对于主要包含噪声信息的切片,自适应地选择较大的权值,对于包含轮廓信息的切片,自适应地选择较小的权值。最后,本文提出了一种有效的基于交替方向乘子法(ADMM)的算法来求解所提出的模型,并在仿真和实际HSI数据集上进行了验证。 摘要:Low-rankness is important in the hyperspectral image (HSI) denoising tasks. The tensor nuclear norm (TNN), defined based on the tensor singular value decomposition, is a state-of-the-art method to describe the low-rankness of HSI. However, TNN ignores some of the physical meanings of HSI in tackling the denoising tasks, leading to suboptimal denoising performance. In this paper, we propose the multi-modal and frequency-weighted tensor nuclear norm (MFWTNN) and the non-convex MFWTNN for HSI denoising tasks. Firstly, we investigate the physical meaning of frequency components and reconsider their weights to improve the low-rank representation ability of TNN. Meanwhile, we also consider the correlation among two spatial dimensions and the spectral dimension of HSI and combine the above improvements to TNN to propose MFWTNN. Secondly, we use non-convex functions to approximate the rank function of the frequency tensor and propose the NonMFWTNN to relax the MFWTNN better. Besides, we adaptively choose bigger weights for slices mainly containing noise information and smaller weights for slices containing profile information. Finally, we develop the efficient alternating direction method of multiplier (ADMM) based algorithm to solve the proposed models, and the effectiveness of our models are substantiated in simulated and real HSI datasets.
点云|SLAM|雷达|激光|深度RGBD相关(1篇)
【1】 Collaborative Visual Inertial SLAM for Multiple Smart Phones 标题:多智能手机协同视觉惯性SLAM
作者:Jialing Liu,Ruyu Liu,Kaiqi Chen,Jianhua Zhang,Dongyan Guo 机构:College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, China, School of Computer Science and Engineering, Tianjin University of Technology, Tianjin, China 备注:6 pages,4 figures,ICRA2021 链接:https://arxiv.org/abs/2106.12186 摘要:在大型场景和长期的AR应用中,映射的效率和精度至关重要。多agent协作SLAM是多用户AR交互的前提。多个智能手机的协作有可能提高任务完成的效率和健壮性,并且可以完成单个智能体无法完成的任务。然而,它依赖于健壮的通信、有效的位置检测、健壮的映射以及代理之间的有效信息共享。我们提出了一个多智能协作单目视觉惯性SLAM,部署在多个ios移动设备上,采用集中式架构。每个代理可以独立地探索环境,在线运行视觉惯性里程表模块,然后将所有测量信息发送到具有较高计算资源的中央服务器。服务器管理接收到的所有信息,检测重叠区域,合并和优化地图,并在需要时与代理共享信息。我们已经在公共数据集和真实环境中验证了系统的性能。该系统的映射和融合精度与VINS-Mono相当,后者需要更高的计算资源。 摘要:The efficiency and accuracy of mapping are crucial in a large scene and long-term AR applications. Multi-agent cooperative SLAM is the precondition of multi-user AR interaction. The cooperation of multiple smart phones has the potential to improve efficiency and robustness of task completion and can complete tasks that a single agent cannot do. However, it depends on robust communication, efficient location detection, robust mapping, and efficient information sharing among agents. We propose a multi-intelligence collaborative monocular visual-inertial SLAM deployed on multiple ios mobile devices with a centralized architecture. Each agent can independently explore the environment, run a visual-inertial odometry module online, and then send all the measurement information to a central server with higher computing resources. The server manages all the information received, detects overlapping areas, merges and optimizes the map, and shares information with the agents when needed. We have verified the performance of the system in public datasets and real environments. The accuracy of mapping and fusion of the proposed system is comparable to VINS-Mono which requires higher computing resources.
多模态(2篇)
【1】 PatentNet: A Large-Scale Incomplete Multiview, Multimodal, Multilabel Industrial Goods Image Database 标题:PatentNet:一个大规模不完整的多视图、多模态、多标签工业品图像数据库
作者:Fangyuan Lei,Da Huang,Jianjian Jiang,Ruijun Ma,Senhong Wang,Jiangzhong Cao,Yusen Lin,Qingyun Dai 机构:a Guangdong Provincial Key Laboratory of Intellectual Property &Big Data, Guangzhou , China, b School of Electronic and Information, Guangdong Polytechnic Normal University, Guangzhou , China 备注:12 pages,7 figures 链接:https://arxiv.org/abs/2106.12139 摘要:在深度学习领域,大规模的图像数据集为目标识别和检索的成功带来了突破。目前,作为创新的体现,工业产品的多样性显著增大,其中不完全的多视角、多模态、多标签数据集与传统的数据集不同。在本文中,我们介绍了一个工业品数据集,即PatentNet,其中包含大量高度多样化、精确和详细的工业品图像注释以及相应的文本。在PatentNet中,图像和文本来源于外观设计专利。PatentNet拥有超过600万张由专业人员手动检查的工业品标签图像和相应文本,是第一个正在进行的工业品图像数据库,其种类比以前用于基准测试的工业品数据集更广泛。PatentNet根据Locarno分类协议将数百万张图像组织成32个类和219个子类。通过对图像分类、图像检索和不完全多视图聚类的大量实验,我们证明了我们的专利网比现有的工业图像数据集更具多样性、复杂性和挑战性,具有更高的潜力。此外,PatentNet中不完全多视图、多模态和多标签的特性能够在人工智能领域和其他领域提供无与伦比的机会。 摘要:In deep learning area, large-scale image datasets bring a breakthrough in the success of object recognition and retrieval. Nowadays, as the embodiment of innovation, the diversity of the industrial goods is significantly larger, in which the incomplete multiview, multimodal and multilabel are different from the traditional dataset. In this paper, we introduce an industrial goods dataset, namely PatentNet, with numerous highly diverse, accurate and detailed annotations of industrial goods images, and corresponding texts. In PatentNet, the images and texts are sourced from design patent. Within over 6M images and corresponding texts of industrial goods labeled manually checked by professionals, PatentNet is the first ongoing industrial goods image database whose varieties are wider than industrial goods datasets used previously for benchmarking. PatentNet organizes millions of images into 32 classes and 219 subclasses based on the Locarno Classification Agreement. Through extensive experiments on image classification, image retrieval and incomplete multiview clustering, we demonstrate that our PatentNet is much more diverse, complex, and challenging, enjoying higher potentials than existing industrial image datasets. Furthermore, the characteristics of incomplete multiview, multimodal and multilabel in PatentNet are able to offer unparalleled opportunities in the artificial intelligence community and beyond.
【2】 Listen to Your Favorite Melodies with img2Mxml, Producing MusicXML from Sheet Music Image by Measure-based Multimodal Deep Learning-driven Assembly 标题:使用img2Mxml收听您喜欢的旋律,通过基于度量的多模态深度学习驱动汇编从工作表音乐图像生成MusicXML
作者:Tomoyuki Shishido,Fehmiju Fati,Daisuke Tokushige,Yasuhiro Ono 机构:Shishido&Associates,Composer, Ph.D.,SK Intellectual Property Law Firm, Enspirea, LLC., Correspondence, to:, Tomoyuki 备注:19 pages, 7 figures 链接:https://arxiv.org/abs/2106.12037 摘要:深度学习最近被应用于光学音乐识别(OMR)。然而,目前各种乐谱图像的OMR处理仍缺乏精度,难以广泛应用。在这里,我们提出了一种MMdA(基于测量的多模态深度学习(DL)驱动的组装)方法,允许从各种图像(包括倾斜照片图像)进行端到端OMR处理。利用这种方法,通过一个深度学习模型提取度量值,对齐度量值,并调整度量值大小,以便通过顺序或并行使用多个深度学习模型来推断给定的音乐符号成分。每项标准化措施的使用都能使模型得到有效训练,并能准确调整每项措施中的五条工作人员线。具有少量特征类型的多个音乐符号组件类别模型可以表示一组不同的音符和其他音乐符号,包括和弦。这种MMdA方法为端到端OMR处理提供了一种精确的解决方案。 摘要:Deep learning has recently been applied to optical music recognition (OMR). However, currently OMR processing from various sheet music images still lacks precision to be widely applicable. Here, we present an MMdA (Measure-based Multimodal deep learning (DL)-driven Assembly) method allowing for end-to-end OMR processing from various images including inclined photo images. Using this method, measures are extracted by a deep learning model, aligned, and resized to be used for inference of given musical symbol components by using multiple deep learning models in sequence or in parallel. Use of each standardized measure enables efficient training of the models and accurate adjustment of five staff lines in each measure. Multiple musical symbol component category models with a small number of feature types can represent a diverse set of notes and other musical symbols including chords. This MMdA method provides a solution to end-to-end OMR processing with precision.
3D|3D重建等相关(1篇)
【1】 The Neurally-Guided Shape Parser: A Monte Carlo Method for Hierarchical Labeling of Over-segmented 3D Shapes 标题:神经引导的形状解析器:超分割三维形状分层标注的蒙特卡罗方法
作者:R. Kenny Jones,Rana Hanocka,Daniel Ritchie 机构:Brown University, University of Chicago 链接:https://arxiv.org/abs/2106.12026 摘要:许多基于学习的三维形状语义分割方法通过端到端方式训练的单通道方法将标签分配给形状原子(例如点云中的点或网格中的面)。这种方法取得了令人印象深刻的性能,但需要大量的标记训练数据。这种范式纠缠着两个可分离的子问题:(1)将形状分解为区域;(2)为这些区域分配语义标签。我们声称,解开这些子问题可以减少标记数据的负担:(1)区域分解不需要语义标记,可以在无监督的方式下执行;(2)标记形状区域而不是原子会导致较小的搜索空间,应该可以用较少的标记训练数据进行学习。在本文中,我们通过介绍神经引导形状分析器(NGSP)来研究第二种说法,NGSP是一种学习如何为过度分割的3D形状区域分配语义标签的方法。我们通过映射推理来解决这个问题,在输入形状的条件下建立标签分配的后验概率模型。我们采用了一种由神经网络引导的蒙特卡罗重要性抽样方法,这种基于搜索的方法通过假设输入形状被分解为离散区域而变得可行。我们评估了NGSP的任务层次语义分割制造三维形状从零件网。我们发现NGSP比基线有显著的性能改进,基线学习标记形状原子,然后为每个形状区域聚合预测,特别是在低数据区域。最后,我们证明了NGSP对区域粒度的鲁棒性,因为它在区域发生严重损坏的情况下仍然保持了很强的分割性能。 摘要:Many learning-based 3D shape semantic segmentation methods assign labels to shape atoms (e.g. points in a point cloud or faces in a mesh) with a single-pass approach trained in an end-to-end fashion. Such methods achieve impressive performance but require large amounts of labeled training data. This paradigm entangles two separable subproblems: (1) decomposing a shape into regions and (2) assigning semantic labels to these regions. We claim that disentangling these subproblems reduces the labeled data burden: (1) region decomposition requires no semantic labels and could be performed in an unsupervised fashion, and (2) labeling shape regions instead of atoms results in a smaller search space and should be learnable with less labeled training data. In this paper, we investigate this second claim by presenting the Neurally-Guided Shape Parser (NGSP), a method that learns how to assign semantic labels to regions of an over-segmented 3D shape. We solve this problem via MAP inference, modeling the posterior probability of a labeling assignment conditioned on an input shape. We employ a Monte Carlo importance sampling approach guided by a neural proposal network, a search-based approach made feasible by assuming the input shape is decomposed into discrete regions. We evaluate NGSP on the task of hierarchical semantic segmentation on manufactured 3D shapes from PartNet. We find that NGSP delivers significant performance improvements over baselines that learn to label shape atoms and then aggregate predictions for each shape region, especially in low-data regimes. Finally, we demonstrate that NGSP is robust to region granularity, as it maintains strong segmentation performance even as the regions undergo significant corruption.
其他神经网络|深度学习|模型|建模(5篇)
【1】 Feature Alignment for Approximated Reversibility in Neural Networks 标题:神经网络中近似可逆性的特征对齐
作者:Tiago de Souza Farias,Jonas Maziero 机构:Departamento de F´ısica, Centro de Ciˆencias Naturais e Exatas, Universidade Federal de Santa Maria, Avenida Roraima , Santa Maria, Rio Grande do Sul,-, Brazil 备注:21 pages 链接:https://arxiv.org/abs/2106.12562 摘要:我们介绍了特征对齐,一种在人工神经网络中获得近似可逆性的技术。通过特征提取,我们可以训练一个神经网络来学习从输出到输入的反向过程的估计映射。结合变分自动编码器,我们可以从与训练数据相同的统计量中产生新的样本。利用生成对抗网络中的概念,对结果进行了改进。最后,我们证明了该技术可以进行局部训练,节省计算内存资源。应用这些技术,我们报告了三个视觉生成任务的结果:MNIST、CIFAR-10和celebA。 摘要:We introduce feature alignment, a technique for obtaining approximate reversibility in artificial neural networks. By means of feature extraction, we can train a neural network to learn an estimated map for its reverse process from outputs to inputs. Combined with variational autoencoders, we can generate new samples from the same statistics as the training data. Improvements of the results are obtained by using concepts from generative adversarial networks. Finally, we show that the technique can be modified for training neural networks locally, saving computational memory resources. Applying these techniques, we report results for three vision generative tasks: MNIST, CIFAR-10, and celebA.
【2】 Behavior Mimics Distribution: Combining Individual and Group Behaviors for Federated Learning 标题:行为模拟分布:将个体行为和群体行为结合起来进行联合学习
作者:Hua Huang,Fanhua Shang,Yuanyuan Liu,Hongying Liu 机构:Key Lab of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, China, Peng Cheng Lab, Shenzhen, China 备注:This paper has been accepted by International Joint Conference on Artificial Intelligence (IJCAI) 2021 链接:https://arxiv.org/abs/2106.12300 摘要:联邦学习(FL)已经成为一种活跃的、有前途的分布式机器学习模式。由于统计上的异质性,最近的研究清楚地表明,流行的FL方法(例如FedAvg)的性能由于本地更新引起的客户端漂移而急剧恶化。本文提出了一种新的联合学习算法(IGFL),它利用个体和群体的行为来模拟分布,从而提高了对异质性的处理能力。与现有的FL方法不同,我们的IGFL可以应用于客户机和服务器优化。作为一个副产品,我们提出了一种新的基于注意的联邦学习在服务器优化的IGFL。据我们所知,这是第一次将注意机制纳入联邦优化。我们进行了大量的实验,结果表明IGFL可以显著提高现有联邦学习方法的性能。特别是当个体间的数据分布不同时,IGFL可以将分类精度提高13%左右。 摘要:Federated Learning (FL) has become an active and promising distributed machine learning paradigm. As a result of statistical heterogeneity, recent studies clearly show that the performance of popular FL methods (e.g., FedAvg) deteriorates dramatically due to the client drift caused by local updates. This paper proposes a novel Federated Learning algorithm (called IGFL), which leverages both Individual and Group behaviors to mimic distribution, thereby improving the ability to deal with heterogeneity. Unlike existing FL methods, our IGFL can be applied to both client and server optimization. As a by-product, we propose a new attention-based federated learning in the server optimization of IGFL. To the best of our knowledge, this is the first time to incorporate attention mechanisms into federated optimization. We conduct extensive experiments and show that IGFL can significantly improve the performance of existing federated learning methods. Especially when the distributions of data among individuals are diverse, IGFL can improve the classification accuracy by about 13% compared with prior baselines.
【3】 APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Cores 标题:APNN-TC:在安培GPU张量核上加速任意精度神经网络
作者:Boyuan Feng,Yuke Wang,Tong Geng,Ang Li,Yufei Ding 机构:†University of California, Santa Barbara, Pacific Northwest National Lab. 备注:Accepted by SC'21 链接:https://arxiv.org/abs/2106.12169 摘要:近年来,量化加速神经网络得到了广泛的研究。不幸的是,先前具有不同精度的工作(例如,1位权重和2位激活)通常受到GPU上有限精度支持的限制(例如,int1和int4)。为了打破这种限制,我们引入了第一个任意精度的神经网络框架(APNN-TC)来充分利用安培GPU张量核的量化优势。具体地说,APNN-TC首先采用了一种新的仿真算法来支持具有int1计算原语和异或/布尔运算的任意短比特宽度计算。第二,APNN-TC集成了任意精度层设计,通过新的批处理策略和专门的内存组织,有效地将仿真算法映射到张量核。第三,APNN-TC采用了一种新颖的任意精度神经网络设计,最大限度地减少了跨层的内存访问,进一步提高了性能。广泛的评估表明,APNN-TC可以实现明显的加速比剪刀差核和各种神经网络模型,如ResNet和VGG。 摘要:Over the years, accelerating neural networks with quantization has been widely studied. Unfortunately, prior efforts with diverse precisions (e.g., 1-bit weights and 2-bit activations) are usually restricted by limited precision support on GPUs (e.g., int1 and int4). To break such restrictions, we introduce the first Arbitrary Precision Neural Network framework (APNN-TC) to fully exploit quantization benefits on Ampere GPU Tensor Cores. Specifically, APNN-TC first incorporates a novel emulation algorithm to support arbitrary short bit-width computation with int1 compute primitives and XOR/AND Boolean operations. Second, APNN-TC integrates arbitrary precision layer designs to efficiently map our emulation algorithm to Tensor Cores with novel batching strategies and specialized memory organization. Third, APNN-TC embodies a novel arbitrary precision NN design to minimize memory access across layers and further improve performance. Extensive evaluations show that APNN-TC can achieve significant speedup over CUTLASS kernels and various NN models, such as ResNet and VGG.
【4】 Reachability Analysis of Convolutional Neural Networks 标题:卷积神经网络的可达性分析
作者:Xiaodong Yang,Tomoya Yamaguchi,Hoang-Dung Tran,Bardh Hoxha,Taylor T Johnson,Danil Prokhorov 链接:https://arxiv.org/abs/2106.12074 摘要:深度卷积神经网络作为一种处理复杂实际问题的有效技术,得到了广泛的应用。然而,一个根本的问题是缺乏正式的方法来分析他们的行为。为了解决这个问题,我们提出了一种计算给定输入域的网络的精确可达集的方法,其中可达集由面格结构表示。除了计算可达集外,我们的方法还可以回溯到给定输出可达集的输入域。因此,可以实现对网络行为的全面分析。此外,还介绍了一种快速分析方法,该方法通过考虑每一层中选定的敏感神经元来快速计算可达集。在CNN上对CIFAR10数据集的精确像素级可达性分析方法进行了评价,并与相关工作进行了比较。在CNN CIFAR10数据集和ImageNet数据集的VGG16体系结构上评估了快速分析方法。 摘要:Deep convolutional neural networks have been widely employed as an effective technique to handle complex and practical problems. However, one of the fundamental problems is the lack of formal methods to analyze their behavior. To address this challenge, we propose an approach to compute the exact reachable sets of a network given an input domain, where the reachable set is represented by the face lattice structure. Besides the computation of reachable sets, our approach is also capable of backtracking to the input domain given an output reachable set. Therefore, a full analysis of a network's behavior can be realized. In addition, an approach for fast analysis is also introduced, which conducts fast computation of reachable sets by considering selected sensitive neurons in each layer. The exact pixel-level reachability analysis method is evaluated on a CNN for the CIFAR10 dataset and compared to related works. The fast analysis method is evaluated over a CNN CIFAR10 dataset and VGG16 architecture for the ImageNet dataset.
【5】 High-Throughput Precision Phenotyping of Left Ventricular Hypertrophy with Cardiovascular Deep Learning 标题:心血管深度学习高通量高精度左心室肥厚表型研究
作者:Grant Duffy,Paul P Cheng,Neal Yuan,Bryan He,Alan C. Kwan,Matthew J. Shun-Shin,Kevin M. Alexander,Joseph Ebinger,Matthew P. Lungren,Florian Rader,David H. Liang,Ingela Schnittger,Euan A. Ashley,James Y. Zou,Jignesh Patel,Ronald Witteles,Susan Cheng,David Ouyang 机构:. Department of Cardiology, Smidt Heart Institute, Cedars-Sinai Medical Center, . Department of Medicine, Division of Cardiology, Stanford University, . Department of Computer Science, Stanford University 链接:https://arxiv.org/abs/2106.12511 摘要:左心室肥大(LVH)是由一系列系统性和心血管疾病引起的慢性重构引起的,这些疾病包括高血压、主动脉狭窄、肥厚型心肌病和心脏淀粉样变。LVH的早期发现和特征化可以显著影响患者的护理,但由于对肥厚认识不足、测量误差和变异性以及难以区分LVH的病因而受到限制。为了克服这一挑战,我们提出了EchoNet LVH-一个深度学习的工作流程,它可以自动量化心室肥大,精确度与人类专家相当,并预测LVH的病因。在28201个超声心动图视频上训练,我们的模型精确测量了心室壁厚度(平均绝对误差[MAE]1.4mm,95%CI 1.2-1.5mm),左心室直径(MAE 2.4mm,95%CI 2.2-2.6mm)和后壁厚度(MAE 1.2mm,95%ci1.1-1.3mm),并将心肌淀粉样变(曲线下面积0.83)和肥厚型心肌病(auc0.98)与LVH的其他病因进行分类。在来自独立的国内和国际医疗系统的外部数据集中,EchoNet LVH准确地量化了心室参数(R2分别为0.96和0.90),并在国内外部验证站点检测到心脏淀粉样变(AUC 0.79)和肥厚型心肌病(AUC 0.89)。利用多个心跳的测量,我们的模型可以更准确地识别LV几何结构的细微变化及其病因。与人类专家相比,ECHONET LVH是完全自动化的,允许重复的、精确的测量,并为心肌肥厚的精确诊断奠定基础。作为促进进一步创新的资源,我们还公开了23212个带注释的超声心动图视频的大数据集。 摘要:Left ventricular hypertrophy (LVH) results from chronic remodeling caused by a broad range of systemic and cardiovascular disease including hypertension, aortic stenosis, hypertrophic cardiomyopathy, and cardiac amyloidosis. Early detection and characterization of LVH can significantly impact patient care but is limited by under-recognition of hypertrophy, measurement error and variability, and difficulty differentiating etiologies of LVH. To overcome this challenge, we present EchoNet-LVH - a deep learning workflow that automatically quantifies ventricular hypertrophy with precision equal to human experts and predicts etiology of LVH. Trained on 28,201 echocardiogram videos, our model accurately measures intraventricular wall thickness (mean absolute error [MAE] 1.4mm, 95% CI 1.2-1.5mm), left ventricular diameter (MAE 2.4mm, 95% CI 2.2-2.6mm), and posterior wall thickness (MAE 1.2mm, 95% CI 1.1-1.3mm) and classifies cardiac amyloidosis (area under the curve of 0.83) and hypertrophic cardiomyopathy (AUC 0.98) from other etiologies of LVH. In external datasets from independent domestic and international healthcare systems, EchoNet-LVH accurately quantified ventricular parameters (R2 of 0.96 and 0.90 respectively) and detected cardiac amyloidosis (AUC 0.79) and hypertrophic cardiomyopathy (AUC 0.89) on the domestic external validation site. Leveraging measurements across multiple heart beats, our model can more accurately identify subtle changes in LV geometry and its causal etiologies. Compared to human experts, EchoNet-LVH is fully automated, allowing for reproducible, precise measurements, and lays the foundation for precision diagnosis of cardiac hypertrophy. As a resource to promote further innovation, we also make publicly available a large dataset of 23,212 annotated echocardiogram videos.
其他(7篇)
【1】 How Well do Feature Visualizations Support Causal Understanding of CNN Activations? 标题:功能可视化在多大程度上支持CNN激活的因果理解?
作者:Roland S. Zimmermann,Judy Borowski,Robert Geirhos,Matthias Bethge,Thomas S. A. Wallis,Wieland Brendel 机构: 1University of Tübingen, Germany 2TechnischeUniversitätDarmstadt 备注:ICML 2021 XAI workshop version. Joint first and last authors. Project website at this https URL 链接:https://arxiv.org/abs/2106.12447 摘要:理解深度卷积神经网络内部工作的一种广泛使用的方法是通过激活最大化来可视化单元响应。通过激活最大化的特征可视化被认为为人类提供了关于导致一个单位被激活的图像特征的精确信息。如果这是真的,这些合成图像应该能让人类预测干预的效果,比如遮挡图像的某个区域(比如,狗的头部)是否会改变一个单位的激活。在这里,我们通过让人类预测两个方形闭塞中的哪一个导致一个单位的激活发生更大的变化来检验这个假设。大规模众包实验和专家测量均表明,平均而言,Olah等人(2017年)的极度活跃的特征可视化确实有助于人类完成这项任务(准确率为67美元/pm 4%$;没有任何可视化的基线性能是$60pm3%$)。但是,与其他可视化(例如数据集样本)相比,它们没有提供任何显著的优势,后者产生类似的性能($66pm 3%$到$67pm 3%$精度)。综上所述,我们提出了一个客观的心理物理学任务来量化单位水平的可解释性方法对人类的益处,并且没有发现任何证据表明特征可视化比简单的替代可视化能为人类提供更好的“因果理解”。 摘要:One widely used approach towards understanding the inner workings of deep convolutional neural networks is to visualize unit responses via activation maximization. Feature visualizations via activation maximization are thought to provide humans with precise information about the image features that cause a unit to be activated. If this is indeed true, these synthetic images should enable humans to predict the effect of an intervention, such as whether occluding a certain patch of the image (say, a dog's head) changes a unit's activation. Here, we test this hypothesis by asking humans to predict which of two square occlusions causes a larger change to a unit's activation. Both a large-scale crowdsourced experiment and measurements with experts show that on average, the extremely activating feature visualizations by Olah et al. (2017) indeed help humans on this task ($67 pm 4%$ accuracy; baseline performance without any visualizations is $60 pm 3%$). However, they do not provide any significant advantage over other visualizations (such as e.g. dataset samples), which yield similar performance ($66 pm 3%$ to $67 pm 3%$ accuracy). Taken together, we propose an objective psychophysical task to quantify the benefit of unit-level interpretability methods for humans, and find no evidence that feature visualizations provide humans with better "causal understanding" than simple alternative visualizations.
【2】 Image-to-Image Translation of Synthetic Samples for Rare Classes 标题:稀有类合成样本的图像到图像翻译
作者:Edoardo Lanzini,Sara Beery 机构:Delft University of Technology, Mekelweg , CD Delft, Netherlands, California Institute of Technology, E California Blvd, Pasadena, CA 链接:https://arxiv.org/abs/2106.12212 摘要:自然界是长尾的:稀有类的观测数量级比普通类要少,这导致了数据的高度不平衡,稀有类只能有少量的例子。从少数例子中学习是基于深度学习的分类算法面临的一个挑战,也是低镜头学习领域的研究热点。增加这些稀有类的训练数据的一个潜在方法是用合成样本来增加有限的真实数据。这已经被证明是有帮助的,但是当在真实数据上进行测试时,真实和合成之间的领域转移阻碍了这些方法的有效性。我们探索使用图像到图像的转换方法来缩小从摄像机捕捉器收集的数据中用于动物物种分类的合成图像和真实图像之间的领域差距:用于监测野生动物的运动激活静态摄像机。我们使用源域和目标域之间的低层特征对齐,使使用图形引擎生成的稀有物种的合成数据更“真实”。实验结果表明,与未对齐的合成数据增强的分类系统相比,对一个稀有物种的分类错误率有了很大的降低。 摘要:The natural world is long-tailed: rare classes are observed orders of magnitudes less frequently than common ones, leading to highly-imbalanced data where rare classes can have only handfuls of examples. Learning from few examples is a known challenge for deep learning based classification algorithms, and is the focus of the field of low-shot learning. One potential approach to increase the training data for these rare classes is to augment the limited real data with synthetic samples. This has been shown to help, but the domain shift between real and synthetic hinders the approaches' efficacy when tested on real data. We explore the use of image-to-image translation methods to close the domain gap between synthetic and real imagery for animal species classification in data collected from camera traps: motion-activated static cameras used to monitor wildlife. We use low-level feature alignment between source and target domains to make synthetic data for a rare species generated using a graphics engine more "realistic". Compared against a system augmented with unaligned synthetic data, our experiments show a considerable decrease in classification error rates on a rare species.
【3】 A Review of Assistive Technologies for Activities of Daily Living of Elderly 标题:老年人日常生活活动辅助技术综述
作者:Nirmalya Thakur,Chia Y. Han 机构:Department of Electrical Engineering and Computer Science, College of Engineering and Applied Sciences, University of Cincinnati, Ohio, US 备注:None 链接:https://arxiv.org/abs/2106.12183 摘要:本世纪的一个显著特点是老年人口不断增加。随着年龄的增长,老年人由于身体残疾、认知问题、记忆力减退和行为紊乱而有多种需求和要求。这些限制的程度也因年龄、性别、背景、经验、技能、知识等不同而有所不同。随着年龄的增长,这些不同的需求和挑战限制了老年人独立进行日常生活活动的能力。此外,护理人员的短缺使老年人迫切需要以技术为基础的服务,帮助他们完成日常工作,以维持他们的独立生活和积极老龄化。为了满足这些需要,这项工作包括在这一领域作出三大贡献。首先,它提供了一个相当全面的审查辅助生活技术,旨在帮助老年人进行日常生活能力。其次,工作讨论了通过本次审查确定的挑战,这些挑战目前存在于智能家居和智能城市中实施老年人护理辅助生活服务的背景下。最后,该工作还概述了实施、扩展和整合该领域现有工作的方法,以便开发一个急需的框架,能够根据老年人不同和不断变化的需求为他们提供个性化的帮助和以用户为中心的行为干预。 摘要:One of the distinct features of this century has been the population of older adults which has been on a constant rise. Elderly people have several needs and requirements due to physical disabilities, cognitive issues, weakened memory and disorganized behavior, that they face with increasing age. The extent of these limitations also differs according to the varying diversities in elderly, which include age, gender, background, experience, skills, knowledge and so on. These varying needs and challenges with increasing age, limits abilities of older adults to perform Activities of Daily Living (ADLs) in an independent manner. To add to it, the shortage of caregivers creates a looming need for technology-based services for elderly people, to assist them in performing their daily routine tasks to sustain their independent living and active aging. To address these needs, this work consists of making three major contributions in this field. First, it provides a rather comprehensive review of assisted living technologies aimed at helping elderly people to perform ADLs. Second, the work discusses the challenges identified through this review, that currently exist in the context of implementation of assisted living services for elderly care in Smart Homes and Smart Cities. Finally, the work also outlines an approach for implementation, extension and integration of the existing works in this field for development of a much-needed framework that can provide personalized assistance and user-centered behavior interventions to elderly as per their varying and ever-changing needs.
【4】 Towards Consistent Predictive Confidence through Fitted Ensembles 标题:通过拟合的系综走向一致的预测置信度
作者:Navid Kardan,Ankit Sharma,Kenneth O. Stanley 机构:Department of Computer Science, University of Central Florida, Orlando, USA, -,-,- 备注:IJCNN 2021 链接:https://arxiv.org/abs/2106.12070 摘要:深度神经网络是机器学习应用中最近取得的许多成功的背后原因。然而,这些模型在遇到分布外(OOD)的例子或做出错误的预测时会产生过度自信的决策。这种不一致的预测置信度限制了将独立训练的学习模型集成到一个更大的系统中。本文引入可分离概念学习框架,在面向对象的实例中真实地度量分类器的性能。在此设置中,分类器的多个实例在类集合的分区的不同部分上进行训练。随后,在单独的测试集上评估这些模型组合的性能。与当前的OOD检测技术不同,该框架不需要辅助OOD数据集,也不需要将分类与检测性能分开。此外,我们提出了一个新的强基线,用于在深度模型中更一致的预测置信度,称为拟合集合,其中过度自信的预测被原始分类任务的转换版本纠正。通过观察组件间相互矛盾的预测,拟合的集合可以自然地检测出OOD示例,而不需要辅助数据。在MNIST、SVHN、CIFAR-10/100和ImageNet上的实验表明,在OOD示例上,拟合的集成显著优于传统的集成,并且可以扩展。 摘要:Deep neural networks are behind many of the recent successes in machine learning applications. However, these models can produce overconfident decisions while encountering out-of-distribution (OOD) examples or making a wrong prediction. This inconsistent predictive confidence limits the integration of independently-trained learning models into a larger system. This paper introduces separable concept learning framework to realistically measure the performance of classifiers in presence of OOD examples. In this setup, several instances of a classifier are trained on different parts of a partition of the set of classes. Later, the performance of the combination of these models is evaluated on a separate test set. Unlike current OOD detection techniques, this framework does not require auxiliary OOD datasets and does not separate classification from detection performance. Furthermore, we present a new strong baseline for more consistent predictive confidence in deep models, called fitted ensembles, where overconfident predictions are rectified by transformed versions of the original classification task. Fitted ensembles can naturally detect OOD examples without requiring auxiliary data by observing contradicting predictions among its components. Experiments on MNIST, SVHN, CIFAR-10/100, and ImageNet show fitted ensemble significantly outperform conventional ensembles on OOD examples and are possible to scale.
【5】 Automatic Head Overcoat Thickness Measure with NASNet-Large-Decoder Net 标题:采用Nasnet-Large-Decoder网的头皮厚度自动测量
作者:Youshan Zhang,Brian D. Davison,Vivien W. Talghader,Zhiyu Chen,Zhiyong Xiao,Gary J. Kunkel 链接:https://arxiv.org/abs/2106.12054 摘要:透射电子显微镜(TEM)是表征材料微观结构和薄膜厚度的主要工具之一。然而,从TEM图像中手动测定薄膜厚度既费时又主观,特别是在薄膜很薄且对测量精度要求很高的情况下。在硬盘驱动器行业,磁头外套(HOC)厚度的测量就是这样。因此,有必要开发软件来自动测量HOC厚度。本文首次提出了一种以nasnetlarge为编码器的HOC层分割方法,然后提出了一种解码器结构,这是深度学习中最常用的图像分割结构之一。为了进一步提高分割结果,我们首先提出了一种后处理层来去除分割结果中不相关的部分。为了测量分段HOC层的厚度,提出了回归卷积神经网络(RCNN)模型和正交厚度计算方法。实验结果表明,我们的模型具有较高的骰子得分,具有较低的均方误差,并优于目前最先进的手动测量。 摘要:Transmission electron microscopy (TEM) is one of the primary tools to show microstructural characterization of materials as well as film thickness. However, manual determination of film thickness from TEM images is time-consuming as well as subjective, especially when the films in question are very thin and the need for measurement precision is very high. Such is the case for head overcoat (HOC) thickness measurements in the magnetic hard disk drive industry. It is therefore necessary to develop software to automatically measure HOC thickness. In this paper, for the first time, we propose a HOC layer segmentation method using NASNet-Large as an encoder and then followed by a decoder architecture, which is one of the most commonly used architectures in deep learning for image segmentation. To further improve segmentation results, we are the first to propose a post-processing layer to remove irrelevant portions in the segmentation result. To measure the thickness of the segmented HOC layer, we propose a regressive convolutional neural network (RCNN) model as well as orthogonal thickness calculation methods. Experimental results demonstrate a higher dice score for our model which has lower mean squared error and outperforms current state-of-the-art manual measurement.
【6】 Volume Rendering of Neural Implicit Surfaces 标题:神经隐式曲面体绘制
作者:Lior Yariv,Jiatao Gu,Yoni Kasten,Yaron Lipman 机构:Weizmann Institute of Science, Facebook AI Research 链接:https://arxiv.org/abs/2106.12052 摘要:近年来,神经体绘制技术由于能够从稀疏的输入图像中合成出新颖的场景视图而越来越流行。到目前为止,通过神经体绘制技术学习的几何体是使用通用密度函数建模的。此外,利用密度函数的任意水平集来提取几何体本身,从而产生噪声,通常是低保真度的重建。本文的目标是改进神经体绘制中的几何表示和重建。我们通过将体积密度建模为几何体的函数来实现这一点。这与以前的工作形成了鲜明对比,之前的工作将几何体建模为体积密度的函数。更详细地说,我们将体积密度函数定义为拉普拉斯的累积分布函数(CDF),应用于符号距离函数(SDF)表示。这种简单的密度表示有三个好处:(i)它为在神经体绘制过程中学习的几何提供了一个有用的归纳偏差(ii)它有助于限制不透明度近似误差,从而实现对观察光线的精确采样。精确采样对于提供几何和辐射的精确耦合非常重要;并且(iii)它允许在体绘制中对形状和外观进行有效的无监督分离。将这种新的密度表示应用于具有挑战性的场景多视图数据集产生了高质量的几何重建,优于相关基线。此外,由于二者的分离,在场景之间切换形状和外观是可能的。 摘要:Neural volume rendering became increasingly popular recently due to its success in synthesizing novel views of a scene from a sparse set of input images. So far, the geometry learned by neural volume rendering techniques was modeled using a generic density function. Furthermore, the geometry itself was extracted using an arbitrary level set of the density function leading to a noisy, often low fidelity reconstruction. The goal of this paper is to improve geometry representation and reconstruction in neural volume rendering. We achieve that by modeling the volume density as a function of the geometry. This is in contrast to previous work modeling the geometry as a function of the volume density. In more detail, we define the volume density function as Laplace's cumulative distribution function (CDF) applied to a signed distance function (SDF) representation. This simple density representation has three benefits: (i) it provides a useful inductive bias to the geometry learned in the neural volume rendering process; (ii) it facilitates a bound on the opacity approximation error, leading to an accurate sampling of the viewing ray. Accurate sampling is important to provide a precise coupling of geometry and radiance; and (iii) it allows efficient unsupervised disentanglement of shape and appearance in volume rendering. Applying this new density representation to challenging scene multiview datasets produced high quality geometry reconstructions, outperforming relevant baselines. Furthermore, switching shape and appearance between scenes is possible due to the disentanglement of the two.
【7】 On Matrix Factorizations in Subspace Clustering 标题:子空间聚类中的矩阵分解
作者:Reeshad Arian,Keaton Hamm 备注:13 pages plus 4 pages of tables 链接:https://arxiv.org/abs/2106.12016 摘要:本文探讨了基于CUR分解的子空间聚类算法,并在Hopkins155运动分割数据集和Yale人脸数据集上研究了这些算法中各种超参数对聚类性能的影响。对这些数据集的各种采样方法和过采样参数进行了大量的实验,并给出了一些实际应用中参数选择的准则。 摘要:This article explores subspace clustering algorithms using CUR decompositions, and examines the effect of various hyperparameters in these algorithms on clustering performance on two real-world benchmark datasets, the Hopkins155 motion segmentation dataset and the Yale face dataset. Extensive experiments are done for a variety of sampling methods and oversampling parameters for these datasets, and some guidelines for parameter choices are given for practical applications.