作者:Vladimir Braverman,Harry Lang,Enayat Ullah,Samson Zhou
摘要:在数据流的时间衰减模型中,基础数据集的元素在按时间顺序获得的情况下,越晚获得的元素更重要。 处理大型数据集的常用方法是去维持 emph {coreset},这是处理数据的简洁摘要,即允许近似恢复预定查询。 我们提供了一个通用框架,它采用任何离线核心集,并为多项式时间衰减函数提供时间衰减核心集。
我们还考虑了k-中值聚类的指数时间衰减模型,其中我们提供了利用在线设施定位算法的常数因子近似算法。 我们的算法存储O(klog(hΔ) h)点,其中h是衰减函数的半衰期,Δ是数据集的纵横比。 我们的技术也扩展到k-means聚类和M-estimators。
原文标题:Improved Algorithms for Time Decay Streams
原文摘要:In the time-decay model for data streams, elements of an underlying data set arrive sequentially with the recently arrived elements being more important. A common approach for handling large data sets is to maintain a emph{coreset}, a succinct summary of the processed data that allows approximate recovery of a predetermined query. We provide a general framework that takes any offline-coreset and gives a time-decay coreset for polynomial time decay functions.
We also consider the exponential time decay model fork-median clustering, where we provide a constant factor approximation algorithm that utilizes the online facility location algorithm. Our algorithm storesO(klog(hΔ) h)points wherehis the half-life of the decay function andΔis the aspect ratio of the dataset. Our techniques extend tok-means clustering andM-estimators as well.
地址:https://arxiv.org/abs/1907.07574