Elasticsearch & ClickHouse 存储效能对比

2021-04-28 10:21:00 浏览数 (2)

ClickHouse是最近比较火的一款数据库,似乎在公有云用户的使用场景上和Elasticsearch有一定的重合。特别是在一些基于日志的业务数据的分析场景,ClickHouse拥有高压缩率、高速查询的特性,受到不少心急火燎的用户的青睐。本文,针对存储效能这个场景来对比一下ES和CK的表现。

因为本文比较长,这里先把最终结果展示出来(无副本情况):

image.pngimage.png

结论:因为ES使用列存(doc value)对Keyword字段以及使用BKD tree对数值字段进行存储,在无需全文检索(无需读取原始日志分析)的OLAP场景,能够达到与CK同等级的压缩率(ES的主键 _id 必须为倒排索引,无法列存压缩,所以有细微差距)

OLAP场景:CK vs ES

ClickHouse是一款专门为OLAP场景,为海量数据下商业智能领域(也就是我们所说的 BI 领域)所设计的列式存储数据库产品,它为OLAP场景做了很多专门的优化,非高精度,非高并发,非CRUD,非ACID的大数据分析场景很多情况下是很适合使用CK的。

但现在使用ES做OLAP场景的案例也不少,很多人会直接说ES从技术架构上不适合做OLAP,因为它本身设计是以搜索引擎为主,底层数据结构的lucene也不是为OLAP场景设计的。这些表述从技术上没有问题,CK在很多方面确实也是做到了极致。

但有需求即代表合理,ES目前是事实上,用在业务日志上做OLAP分析的最常见的方案之一。ES会被作为OLAP数据库的主要原因我认为有以下几个:

  • 最重要的一点就是,无论是日志的集中管理,还是IT运维、安全运维、AIOps、大数据分析,多个团队都对同样的数据由需求,同一份业务流量数据,用户行为数据不只是在OLAP场景中会使用,在做故障调查,数据深加工,还是安全审计、调查,也都会用到。如果同一份海量数据重复存储于多个不同类型的大数据系统上,会造成极度的资源浪费,而ES是市面上几乎是唯一的,能适用任何团队,任何场景的数据库,即一次投资,能够产生重叠的回报
  • ES本身在海量数据、快速查询、聚合算子的支撑,在大多数场景是满足OLAP的需求的
  • ES产品相比CK成熟很多,抛开数据摄入速率、数据压缩这些硬指标,其实在管理一个大数据平台和海量数据集时,还有很多软指标,比如平台的稳定性,成熟度,易用性等软指标也非常重要,以下是ES的相比CK多出来的点:
    • 完整的大数据链路,有数据摄入层(beats, logstash)和可视化层(kibana)
    • 有各种分析app (canvas和lens),直接提供大屏,也可以脱离SQL为业务人员提供探索式、交互式的分析能力
    • 内置的批流计算能力 (rollup,ingest pipeline, transform, machine learning)
    • 内置的数据治理能力 (ILM, SLM,searchable snapshot, reindex,CCS/CCR,数据可重建/丰富/复制)
    • 内置的集群监控和分析能力 (monitoring & profile)
    • 更加丰富的针对不同数据类型的计算能力 (full text, geo, vector)
    • 更多的安全管理 (企业级权限认证接入、文档级别、字段级别的权限设置与隔离)

选取样本数据

回到我们今天的主题,关于对比ES和CK的存储效率,为了能够更直观,避免争议,我们使用ClickHouse官网的Tutorial数据:

代码语言:txt复制
curl https://datasets.clickhouse.tech/hits/tsv/hits_v1.tsv.xz | unxz --threads=`nproc` > hits_v1.tsv
curl https://datasets.clickhouse.tech/visits/tsv/visits_v1.tsv.xz | unxz --threads=`nproc` > visits_v1.tsv

下载之后,解压的原始数据的大小:

代码语言:txt复制
ls -lh
总用量 9.8G
-rw-rw-r--. 1 caishichao caishichao 7.3G 4月  26 02:03 hits_v1.tsv
-rw-rw-r--. 1 caishichao caishichao 2.5G 4月  26 02:05 visits_v1.tsv

样本数据集是Yandex.Metrica这个网络分析服务产生的数据,典型的业务流量日志数据。其中:

  • hits 表包含所有用户在服务所涵盖的所有网站上完成的每个操作。
  • visits 表包含预先构建的会话,而不是单个操作。

创建CK的数据库与表

按照Tutorial的教程,原封不动的按照如下方式创建数据库和数据表:

创建库:

代码语言:txt复制
clickhouse-client --query "CREATE DATABASE IF NOT EXISTS tutorial"

创建表:

代码语言:txt复制
CREATE TABLE tutorial.hits_v1
(
    `WatchID` UInt64,
    `JavaEnable` UInt8,
    `Title` String,
    `GoodEvent` Int16,
    `EventTime` DateTime,
    `EventDate` Date,
    `CounterID` UInt32,
    `ClientIP` UInt32,
    `ClientIP6` FixedString(16),
    `RegionID` UInt32,
    `UserID` UInt64,
    `CounterClass` Int8,
    `OS` UInt8,
    `UserAgent` UInt8,
    `URL` String,
    `Referer` String,
    `URLDomain` String,
    `RefererDomain` String,
    `Refresh` UInt8,
    `IsRobot` UInt8,
    `RefererCategories` Array(UInt16),
    `URLCategories` Array(UInt16),
    `URLRegions` Array(UInt32),
    `RefererRegions` Array(UInt32),
    `ResolutionWidth` UInt16,
    `ResolutionHeight` UInt16,
    `ResolutionDepth` UInt8,
    `FlashMajor` UInt8,
    `FlashMinor` UInt8,
    `FlashMinor2` String,
    `NetMajor` UInt8,
    `NetMinor` UInt8,
    `UserAgentMajor` UInt16,
    `UserAgentMinor` FixedString(2),
    `CookieEnable` UInt8,
    `JavascriptEnable` UInt8,
    `IsMobile` UInt8,
    `MobilePhone` UInt8,
    `MobilePhoneModel` String,
    `Params` String,
    `IPNetworkID` UInt32,
    `TraficSourceID` Int8,
    `SearchEngineID` UInt16,
    `SearchPhrase` String,
    `AdvEngineID` UInt8,
    `IsArtifical` UInt8,
    `WindowClientWidth` UInt16,
    `WindowClientHeight` UInt16,
    `ClientTimeZone` Int16,
    `ClientEventTime` DateTime,
    `SilverlightVersion1` UInt8,
    `SilverlightVersion2` UInt8,
    `SilverlightVersion3` UInt32,
    `SilverlightVersion4` UInt16,
    `PageCharset` String,
    `CodeVersion` UInt32,
    `IsLink` UInt8,
    `IsDownload` UInt8,
    `IsNotBounce` UInt8,
    `FUniqID` UInt64,
    `HID` UInt32,
    `IsOldCounter` UInt8,
    `IsEvent` UInt8,
    `IsParameter` UInt8,
    `DontCountHits` UInt8,
    `WithHash` UInt8,
    `HitColor` FixedString(1),
    `UTCEventTime` DateTime,
    `Age` UInt8,
    `Sex` UInt8,
    `Income` UInt8,
    `Interests` UInt16,
    `Robotness` UInt8,
    `GeneralInterests` Array(UInt16),
    `RemoteIP` UInt32,
    `RemoteIP6` FixedString(16),
    `WindowName` Int32,
    `OpenerName` Int32,
    `HistoryLength` Int16,
    `BrowserLanguage` FixedString(2),
    `BrowserCountry` FixedString(2),
    `SocialNetwork` String,
    `SocialAction` String,
    `HTTPError` UInt16,
    `SendTiming` Int32,
    `DNSTiming` Int32,
    `ConnectTiming` Int32,
    `ResponseStartTiming` Int32,
    `ResponseEndTiming` Int32,
    `FetchTiming` Int32,
    `RedirectTiming` Int32,
    `DOMInteractiveTiming` Int32,
    `DOMContentLoadedTiming` Int32,
    `DOMCompleteTiming` Int32,
    `LoadEventStartTiming` Int32,
    `LoadEventEndTiming` Int32,
    `NSToDOMContentLoadedTiming` Int32,
    `FirstPaintTiming` Int32,
    `RedirectCount` Int8,
    `SocialSourceNetworkID` UInt8,
    `SocialSourcePage` String,
    `ParamPrice` Int64,
    `ParamOrderID` String,
    `ParamCurrency` FixedString(3),
    `ParamCurrencyID` UInt16,
    `GoalsReached` Array(UInt32),
    `OpenstatServiceName` String,
    `OpenstatCampaignID` String,
    `OpenstatAdID` String,
    `OpenstatSourceID` String,
    `UTMSource` String,
    `UTMMedium` String,
    `UTMCampaign` String,
    `UTMContent` String,
    `UTMTerm` String,
    `FromTag` String,
    `HasGCLID` UInt8,
    `RefererHash` UInt64,
    `URLHash` UInt64,
    `CLID` UInt32,
    `YCLID` UInt64,
    `ShareService` String,
    `ShareURL` String,
    `ShareTitle` String,
    `ParsedParams` Nested(
        Key1 String,
        Key2 String,
        Key3 String,
        Key4 String,
        Key5 String,
        ValueDouble Float64),
    `IslandID` FixedString(16),
    `RequestNum` UInt32,
    `RequestTry` UInt8
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(EventDate)
ORDER BY (CounterID, EventDate, intHash32(UserID))
SAMPLE BY intHash32(UserID)

这里为了简单起见,我们只创建了hits这张表,表中有非常多的参数。可以看到,这里并没有显式的创建二级索引,并且很多字段是做了精确的字段大小的限制的。

通过以下命令写入数据:

代码语言:txt复制
clickhouse-client --query "INSERT INTO tutorial.hits_v1 FORMAT TSV" --max_insert_block_size=100000 < hits_v1.tsv

数据写入之后,未优化之前的大小,可以看到数据总大小是2.5G,相比原先7.3G的数据,已经有了3倍的压缩

代码语言:txt复制
sudo du -h /var/lib/clickhouse/data/tutorial/hits_v1/
0	/var/lib/clickhouse/data/tutorial/hits_v1/detached
40M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_1_1_0
38M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_2_2_0
37M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_3_3_0
46M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_4_4_0
34M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_5_5_0
31M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_6_6_0
43M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_7_7_0
37M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_8_8_0
49M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_9_9_0
45M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_10_10_0
219M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_1_6_1
47M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_11_11_0
38M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_12_12_0
41M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_13_13_0
28M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_14_14_0
39M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_15_15_0
45M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_16_16_0
251M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_7_12_1
44M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_17_17_0
40M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_18_18_0
42M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_19_19_0
40M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_20_20_0
45M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_21_21_0
46M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_22_22_0
230M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_13_18_1
43M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_23_23_0
46M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_24_24_0
44M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_25_25_0
40M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_26_26_0
47M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_27_27_0
48M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_28_28_0
253M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_19_24_1
45M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_29_29_0
41M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_30_30_0
35M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_31_31_0
255M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_25_30_1
2.5G	/var/lib/clickhouse/data/tutorial/hits_v1/

检查数据是否真正写入:

代码语言:txt复制
clickhouse-client --query "SELECT COUNT(*) FROM tutorial.hits_v1"
8873898

wc -l hits_v1.tsv 
8873898 hits_v1.tsv

8873898条数据,正确。我们还可以做进一步的优化 (此操作会在后台合并数据以优化数据存储):

代码语言:txt复制
clickhouse-client --query "OPTIMIZE TABLE tutorial.hits_v1 FINAL"

优化之后,做了很多分区内的合并,文件目录明显变少,但数据总大小相差不大,变为2.4G

代码语言:txt复制
sudo du -h /var/lib/clickhouse/data/tutorial/hits_v1/
0	/var/lib/clickhouse/data/tutorial/hits_v1/detached
219M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_1_6_1
251M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_7_12_1
230M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_13_18_1
253M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_19_24_1
35M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_31_31_0
255M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_25_30_1
1.2G	/var/lib/clickhouse/data/tutorial/hits_v1/201403_1_31_2
2.4G	/var/lib/clickhouse/data/tutorial/hits_v1/

以上操作完全按照ClickHouse官网的Tutorial教程。这里要强调的是,此创建是在单节点,无副本的环境和配置下进行的。

创建ES的对等索引

接下来,我们来看一下同样的数据集在ES上的数据大小。先具体看一下数据:

代码语言:txt复制
# 直接读文件
head -n 1 hits_v1.tsv 
4632802378222380466	1	null Acoper «labilir mitsubishi в Липецке на Batak	1	2014-03-23 16:23:07	2014-03-23	27528801	2723656327	??t??4??g?˟O	15887	2224216775939367616	0	56	4	http://rutube.ru/patianu	http://love/v012/04	footki.yandex.ru	m.fotoalmeta	0	0	[[]	[]	[]	1339	555	29	8	0	0.	0	0	44	s?	1	1	0	0			3238011	-1	0		1136	555	117	2014-03-23 06:04:09	5	1	19832	0	utf-8	401	0	0	0	0	1018986580	0	0	0	0	02014-03-24 08:02:03	16	2	2	0	0	[]	2529497135	0?w?nȔ]MnQ??5	18372	-1	2	wi	??			0	-1	4333	1332	747	2472	0	0	7395	285	-1	-1	-1	16115	-1	-1	0		0		?	0	[]			307141980878902893	338497349483004835	0	0				[]	[]	[]	[]	[]	[]	??? ??????bKQ9	47	0

# 查看clickhouse中的数据
clickhouse-client --query "SELECT * FROM tutorial.hits_v1 LIMIT 1"
7043438415214026105	1		1	2014-03-17 19:29:13	2014-03-17	57	706706104	??:?[?Uc??m???	14050	8585742290196126178	0	56	4http://hurpasspx?Entitle,pos,p5,source=wiz&fp=2&pos=42&wp=1&pos=7&lang=ru&lr=10275961,55.84517842040/?fromServic-last/moto.am/iadebnye_uchasti-shedshipmeethodID.asp&is_vative=0&search-tag=2&rpt=simages.yandex.ru/rp/semennoe/zarplatjie-babe-podpisyvat-otvety.ru/tour=170000009886&ref=http://kryma-parca.ru/chat&users/lexus-70@mail.yandex.ru/yarportwo/11868/parta.com/load/483/?from]=200837&tease	http://kelebekgaleri.hurriyet	hurpass.uaprod.hu	sprashivai.ru	0	0	[5,353,3121,11503,11630]	[5,92,469,13425]	[348,1010]	[28,644]	1234	573	29	8	0	0.	0	0	44	s?	1	1	0	0			1117402	1266	111	117	2014-03-16 04:15:16	0	0	0	0	utf-8	510	0	0	0	6595113484233243093	229156632	0	0	02014-03-17 11:48:07	39	1	2	0	0	[]	594601839	???_Y????5Ӵ.	6826	-1	1	nD	??			0	-1	139	175	197	11	0	0	798	-1	-1	-1	-1	-1	-1	-1	0		0		?	0	[]			5301774355650737480	2999314218052725538	0	0				[]	[]	[]	[]	[]	[]	??? ??????bKQ9	104	0

# 做个样本
head -n 100 >> sample.tsv

我们将样本数据,在Kibana上加载,分析一下:

image.pngimage.png

分析出了88个字段,与ClickHouse的Tutorials里hits表上定义的128个字段不符,需要比较精确的做mapping。

通常来说,用户大概会按照ClickHouse的数据表定义来生成es的mapping,比如上面我们看到的这个tutorial.hits_v1 数据表的定义,如果按照数据库里定义的类型去进行类比定义的话,大概会是下面的mapping:

代码语言:txt复制
  "mappings": {
    "properties": {
      "WatchID": {
        "type": "unsigned_long",
        "index": false
      },
      "JavaEnable": {
        "type": "short",
        "index": false
      },
      "Title": {
        "type": "keyword",
        "index": false
      },
      "GoodEvent": {
        "type": "short",
        "index": false
      },
      "EventTime": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss",
        "index": false
      },
      "EventDate": {
        "type": "keyword",
        "index": false
      },
      "CounterID": {
        "type": "long",
        "index": false
      },
      "ClientIP": {
        "type": "long",
        "index": false
      },
      "ClientIP6": {
        "type": "keyword",
        "index": false
      },
      "RegionID": {
        "type": "long",
        "index": false
      },
      "UserID": {
        "type": "unsigned_long",
        "index": false
      },
      "CounterClass": {
        "type": "byte",
        "index": false
      },
      "OS": {
        "type": "short",
        "index": false
      },
      "UserAgent": {
        "type": "short",
        "index": false
      },
      "URL": {
        "type": "keyword",
        "index": false
      },
      "Referer": {
        "type": "keyword",
        "index": false
      },
      "URLDomain": {
        "type": "keyword",
        "index": false
      },
      "RefererDomain": {
        "type": "keyword",
        "index": false
      },
      "Refresh": {
        "type": "short",
        "index": false
      },
      "IsRobot": {
        "type": "short",
        "index": false
      },
      "RefererCategories": {
        "type": "integer",
        "index": false
      },
      "URLCategories": {
        "type": "integer",
        "index": false
      },
      "URLRegions": {
        "type": "long",
        "index": false
      },
      "RefererRegions": {
        "type": "long",
        "index": false
      },
      "ResolutionWidth": {
        "type": "integer",
        "index": false
      },
      "ResolutionHeight": {
        "type": "integer",
        "index": false
      },
      "ResolutionDepth": {
        "type": "short",
        "index": false
      },
      "FlashMajor": {
        "type": "short",
        "index": false
      },
      "FlashMinor": {
        "type": "short",
        "index": false
      },
      "FlashMinor2": {
        "type": "keyword",
        "index": false
      },
      "NetMajor": {
        "type": "short",
        "index": false
      },
      "NetMinor": {
        "type": "short",
        "index": false
      },
      "UserAgentMajor": {
        "type": "integer",
        "index": false
      },
      "UserAgentMinor": {
        "type": "keyword",
        "index": false
      },
      "CookieEnable": {
        "type": "short",
        "index": false
      },
      "JavascriptEnable": {
        "type": "short",
        "index": false
      },
      "IsMobile": {
        "type": "short",
        "index": false
      },
      "MobilePhone": {
        "type": "short",
        "index": false
      },
      "MobilePhoneModel": {
        "type": "keyword",
        "index": false
      },
      "Params": {
        "type": "keyword",
        "index": false
      },
      "IPNetworkID": {
        "type": "long",
        "index": false
      },
      "TraficSourceID": {
        "type": "byte",
        "index": false
      },
      "SearchEngineID": {
        "type": "integer",
        "index": false
      },
      "SearchPhrase": {
        "type": "keyword",
        "index": false
      },
      "AdvEngineID": {
        "type": "short",
        "index": false
      },
      "IsArtifical": {
        "type": "short",
        "index": false
      },
      "WindowClientWidth": {
        "type": "integer",
        "index": false
      },
      "WindowClientHeight": {
        "type": "integer",
        "index": false
      },
      "ClientTimeZone": {
        "type": "short",
        "index": false
      },
      "ClientEventTime": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss",
        "index": false
      },
      "SilverlightVersion1": {
        "type": "short",
        "index": false
      },
      "SilverlightVersion2": {
        "type": "short",
        "index": false
      },
      "SilverlightVersion3": {
        "type": "long",
        "index": false
      },
      "SilverlightVersion4": {
        "type": "integer",
        "index": false
      },
      "PageCharset": {
        "type": "keyword",
        "index": false
      },
      "CodeVersion": {
        "type": "long",
        "index": false
      },
      "IsLink": {
        "type": "short",
        "index": false
      },
      "IsDownload": {
        "type": "short",
        "index": false
      },
      "IsNotBounce": {
        "type": "short",
        "index": false
      },
      "FUniqID": {
        "type": "unsigned_long",
        "index": false
      },
      "HID": {
        "type": "long",
        "index": false
      },
      "IsOldCounter": {
        "type": "short",
        "index": false
      },
      "IsEvent": {
        "type": "short",
        "index": false
      },
      "IsParameter": {
        "type": "short",
        "index": false
      },
      "DontCountHits": {
        "type": "short",
        "index": false
      },
      "WithHash": {
        "type": "short",
        "index": false
      },
      "HitColor": {
        "type": "keyword",
        "index": false
      },
      "UTCEventTime": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss",
        "index": false
      },
      "Age": {
        "type": "short",
        "index": false
      },
      "Sex": {
        "type": "short",
        "index": false
      },
      "Income": {
        "type": "short",
        "index": false
      },
      "Interests": {
        "type": "integer",
        "index": false
      },
      "Robotness": {
        "type": "short",
        "index": false
      },
      "GeneralInterests": {
        "type": "integer",
        "index": false
      },
      "RemoteIP": {
        "type": "long",
        "index": false
      },
      "RemoteIP6": {
        "type": "keyword",
        "index": false
      },
      "WindowName": {
        "type": "integer",
        "index": false
      },
      "OpenerName": {
        "type": "integer",
        "index": false
      },
      "HistoryLength": {
        "type": "short",
        "index": false
      },
      "BrowserLanguage": {
        "type": "keyword",
        "index": false
      },
      "BrowserCountry": {
        "type": "keyword",
        "index": false
      },
      "SocialNetwork": {
        "type": "keyword",
        "index": false
      },
      "SocialAction": {
        "type": "keyword",
        "index": false
      },
      "HTTPError": {
        "type": "integer",
        "index": false
      },
      "SendTiming": {
        "type": "integer",
        "index": false
      },
      "DNSTiming": {
        "type": "integer",
        "index": false
      },
      "ConnectTiming": {
        "type": "integer",
        "index": false
      },
      "ResponseStartTiming": {
        "type": "integer",
        "index": false
      },
      "ResponseEndTiming": {
        "type": "integer",
        "index": false
      },
      "FetchTiming": {
        "type": "integer",
        "index": false
      },
      "RedirectTiming": {
        "type": "integer",
        "index": false
      },
      "DOMInteractiveTiming": {
        "type": "integer",
        "index": false
      },
      "DOMContentLoadedTiming": {
        "type": "integer",
        "index": false
      },
      "DOMCompleteTiming": {
        "type": "integer",
        "index": false
      },
      "LoadEventStartTiming": {
        "type": "integer",
        "index": false
      },
      "LoadEventEndTiming": {
        "type": "integer",
        "index": false
      },
      "NSToDOMContentLoadedTiming": {
        "type": "integer",
        "index": false
      },
      "FirstPaintTiming": {
        "type": "integer",
        "index": false
      },
      "RedirectCount": {
        "type": "byte",
        "index": false
      },
      "SocialSourceNetworkID": {
        "type": "short",
        "index": false
      },
      "SocialSourcePage": {
        "type": "keyword",
        "index": false
      },
      "ParamPrice": {
        "type": "long",
        "index": false
      },
      "ParamOrderID": {
        "type": "keyword",
        "index": false
      },
      "ParamCurrency": {
        "type": "keyword",
        "index": false
      },
      "ParamCurrencyID": {
        "type": "integer",
        "index": false
      },
      "GoalsReached": {
        "type": "long",
        "index": false
      },
      "OpenstatServiceName": {
        "type": "keyword",
        "index": false
      },
      "OpenstatCampaignID": {
        "type": "keyword",
        "index": false
      },
      "OpenstatAdID": {
        "type": "keyword",
        "index": false
      },
      "OpenstatSourceID": {
        "type": "keyword",
        "index": false
      },
      "UTMSource": {
        "type": "keyword",
        "index": false
      },
      "UTMMedium": {
        "type": "keyword",
        "index": false
      },
      "UTMCampaign": {
        "type": "keyword",
        "index": false
      },
      "UTMContent": {
        "type": "keyword",
        "index": false
      },
      "UTMTerm": {
        "type": "keyword",
        "index": false
      },
      "FromTag": {
        "type": "keyword",
        "index": false
      },
      "HasGCLID": {
        "type": "short",
        "index": false
      },
      "RefererHash": {
        "type": "unsigned_long",
        "index": false
      },
      "URLHash": {
        "type": "unsigned_long",
        "index": false
      },
      "CLID": {
        "type": "long",
        "index": false
      },
      "YCLID": {
        "type": "unsigned_long",
        "index": false
      },
      "ShareService": {
        "type": "keyword",
        "index": false
      },
      "ShareURL": {
        "type": "keyword",
        "index": false
      },
      "ShareTitle": {
        "type": "keyword",
        "index": false
      },
      "ParsedParams.Key1": {
        "type": "keyword",
        "index": false
      },
      "ParsedParams.Key2": {
        "type": "keyword",
        "index": false
      },
      "ParsedParams.Key3": {
        "type": "keyword",
        "index": false
      },
      "ParsedParams.Key4": {
        "type": "keyword",
        "index": false
      },
      "ParsedParams.Key5": {
        "type": "keyword",
        "index": false
      },
      "ParsedParams.ValueDouble": {
        "type": "keyword",
        "index": false
      },
      "IslandID": {
        "type": "keyword",
        "index": false
      },
      "RequestNum": {
        "type": "long",
        "index": false
      },
      "RequestTry": {
        "type": "short",
        "index": false
      }
    }
  }

以这样的方式去构建,基本上是在磁盘上为所有的字段,在不知道自己将来是否会使用,会如何使用的情况下,就位所有的字段相当于创建了二级索引。因此,在这种情况下生产的最终的索引数据会很大:

image.pngimage.png

我们可以强制合并,

image.pngimage.png

但效果不会明显:

image.pngimage.png

因此,可以看到,即时ES只索引了700多万条数据,就已经有9GB,比CK的800多万条数据2.4GB,大了将近四倍。

ES需要调优!

ES正确的做法

如之前所述,CK是专门针对海量数据的OLAP分析场景而设计的列式存储数据库。其在多个方面,包括:数据建模,数据压缩,数据存储等都对OLAP场景做了专门的优化。这些优化并没有什么魔法,只是帮助用户把这些优化做成了默认模式。实际上,大部分的优化在ES上也是可以做的。

因为ES并非为OLAP专门设计的,是一个通用场景的数据库,若要用在OLAP场景,我们需要做一些专门的优化。

正确的做法是不能够简单的创建一个“对等索引”。

建模优化

我们可以先通过Kibana上的数据可视化分析工具,查看各个字段的基本情况:

image.pngimage.png

我们看到很多在原数据表定义为UInt8UInt16的字段都是用于表示特定含义的,比如说IsMobile, IsRobot这种,在ES中根本不应该定义为numeric类型,应该是keyword。

甚至,有些数据甚至有可能在现有的大屏分析、业务分析时根本不会用,比如,下面的这些哈希值:

image.pngimage.png

我们可以更进一步,过滤所有的numeric类型:

image.pngimage.png
image.pngimage.png

我们可以得出一个结论,只有Timing结尾的字段,才值得定义type为numeric。其他都是keyword

因此,我们的数据建模应该是这样的:

代码语言:txt复制
{
  "settings": {
    "index": {
      "number_of_replicas": "0"
    }
  },
  "mappings": {
    "dynamic": "true",
    "dynamic_templates": [
      {
        "Timing_as_numeric": {
          "match": "*Timing",
          "match_mapping_type": "string",
          "mapping": {
            "type": "long",
            "index": false
          }
        }
      },
      {
        "strings_as_keyword": {
          "match_mapping_type": "string",
          "mapping": {
            "ignore_above": 1024,
            "index": false,
            "type": "keyword"
          }
        }
      }
    ],
    "date_detection": false,
    "numeric_detection": false,
    "properties": {
      "ClientEventTime": {
        "type": "date",
        "format": "strict_date_optional_time||epoch_millis||yyyy-MM-dd HH:mm:ss"
      },
      "EventTime": {
        "type": "date",
        "format": "strict_date_optional_time||epoch_millis||yyyy-MM-dd HH:mm:ss"
      },
      "UTCEventTime": {
        "type": "date",
        "format": "strict_date_optional_time||epoch_millis||yyyy-MM-dd HH:mm:ss"
      }
    }
  },
  "aliases": {}
}

filebeat与ingest pipeline优化

如果我们是使用filebeat来进行数据的采集,filebeat会默认添加很多额外的字段,比如 "agent", "host", "ecs","input","log",这些字段在OLAP场景都是不需要的,添加如下processor在filebeat.yml文件中:

代码语言:txt复制
processors:
  - drop_fields:
      fields: ["agent", "host", "ecs","input","log"]
      ignore_missing: false

同时,filebeat也会把数据的采集时间作为@timestamp添加。我们可以在ingest pipeline上把这个字段给drop掉,同样的,message字段也是多余的 (不再考虑原始日志的审计需求)

image.pngimage.png

通过正确的mapping和去掉该场景下不需要的字段,索引的大小降到了CK数据的2倍:

image.pngimage.png

OLAP专用优化

在绝大多数的OLAP场景,我们只关心汇总,聚合之后的指标,而非导出源数据,这时,我们可以进一步把索引模板调整为_source: disabled

image.pngimage.png

此时,我们看到ES索引的大小已经和CK的大小相差无几。此设置下,只是无法再通过_source获取字段值:

image.pngimage.png

但所有的聚合、过滤、排序操作还是可以正常运行的。

如果对于一些关键数据,我们仍然希望能够以行的方式读取的话,我们可以在关闭_source的情况下,通过字段的store参数进行调节:

代码语言:txt复制
        "ClientTimeZone" : {
          "type" : "keyword",
          "index" : false,
          "store" : true,
          "ignore_above" : 1024
        },
        "CodeVersion" : {
          "type" : "keyword",
          "index" : false,
          "store" : true,
          "ignore_above" : 1024
        }

通过将特定字段的数据进行存储,而非将整行数据都存储在_source当中,我们可以在读取的时候,通过stored_fields参数:

代码语言:txt复制
GET ck_hits_no_source_with_field_stored/_search
{
  "stored_fields": [ "ClientTimeZone", "CodeVersion" ]
}

获取单行字段值:

image.pngimage.png

结论

正如上文所说,存在即合理,大量用户在OLAP场景选择ES有其合理性。在没有使用"index.codec": "best_compression"的情况下,作为同样使用列存技术,同样默认使用LZ4压缩算法的两个产品,ES通过合理的优化,能够达到CK一样的压缩效果。

当然,CK在仅使用列存的情况下能够通过组合字段,能够返回行数据,而ES,则需要借助_source字段保存整行数据,或者使用字段的store属性才能返回文档值(对应CK的行数据)。因为历史接口的原因和没有强烈的用户需求,ES并没有从列存数据中返回字段的功能,但这个功能并不是不能添加的。

最后,本文仅涉及数据压缩的内容,不讨论上层的计算算子,SQL支持,Join等功能。也并不是说ES在OLAP场景比CK更合适,用户需要根据自己的情况选择合理的,ES更适合在一个通用的,广泛的,弱OLAP场景进行使用。希望本文能够帮助到哪些将ES用在OLAP场景,但又因为压缩率赶到困扰的同学。

0 人点赞