Elasticsearch & ClickHouse 存储效能对比

ClickHouse是最近比较火的一款数据库，似乎在公有云用户的使用场景上和Elasticsearch有一定的重合。特别是在一些基于日志的业务数据的分析场景，ClickHouse拥有高压缩率、高速查询的特性，受到不少心急火燎的用户的青睐。本文，针对存储效能这个场景来对比一下ES和CK的表现。

因为本文比较长，这里先把最终结果展示出来（无副本情况）：

结论：因为ES使用列存（doc value）对Keyword字段以及使用BKD tree对数值字段进行存储，在无需全文检索（无需读取原始日志分析）的OLAP场景，能够达到与CK同等级的压缩率（ES的主键 _id 必须为倒排索引，无法列存压缩，所以有细微差距）

OLAP场景：CK vs ES

ClickHouse是一款专门为OLAP场景，为海量数据下商业智能领域(也就是我们所说的 BI 领域)所设计的列式存储数据库产品，它为OLAP场景做了很多专门的优化，非高精度，非高并发，非CRUD，非ACID的大数据分析场景很多情况下是很适合使用CK的。

但现在使用ES做OLAP场景的案例也不少，很多人会直接说ES从技术架构上不适合做OLAP，因为它本身设计是以搜索引擎为主，底层数据结构的lucene也不是为OLAP场景设计的。这些表述从技术上没有问题，CK在很多方面确实也是做到了极致。

但有需求即代表合理，ES目前是事实上，用在业务日志上做OLAP分析的最常见的方案之一。ES会被作为OLAP数据库的主要原因我认为有以下几个：

最重要的一点就是，无论是日志的集中管理，还是IT运维、安全运维、AIOps、大数据分析,多个团队都对同样的数据由需求，同一份业务流量数据，用户行为数据不只是在OLAP场景中会使用，在做故障调查，数据深加工，还是安全审计、调查，也都会用到。如果同一份海量数据重复存储于多个不同类型的大数据系统上，会造成极度的资源浪费，而ES是市面上几乎是唯一的，能适用任何团队，任何场景的数据库，即一次投资，能够产生重叠的回报
ES本身在海量数据、快速查询、聚合算子的支撑，在大多数场景是满足OLAP的需求的
ES产品相比CK成熟很多，抛开数据摄入速率、数据压缩这些硬指标，其实在管理一个大数据平台和海量数据集时，还有很多软指标，比如平台的稳定性，成熟度，易用性等软指标也非常重要，以下是ES的相比CK多出来的点：
- 完整的大数据链路，有数据摄入层（beats, logstash）和可视化层(kibana)
- 有各种分析app （canvas和lens），直接提供大屏，也可以脱离SQL为业务人员提供探索式、交互式的分析能力
- 内置的批流计算能力（rollup，ingest pipeline, transform, machine learning）
- 内置的数据治理能力（ILM， SLM，searchable snapshot, reindex，CCS/CCR，数据可重建/丰富/复制）
- 内置的集群监控和分析能力（monitoring & profile）
- 更加丰富的针对不同数据类型的计算能力（full text, geo, vector）
- 更多的安全管理（企业级权限认证接入、文档级别、字段级别的权限设置与隔离）

选取样本数据

回到我们今天的主题，关于对比ES和CK的存储效率，为了能够更直观，避免争议，我们使用ClickHouse官网的Tutorial数据：

代码语言：txt复制

curl https://datasets.clickhouse.tech/hits/tsv/hits_v1.tsv.xz | unxz --threads=`nproc` > hits_v1.tsv
curl https://datasets.clickhouse.tech/visits/tsv/visits_v1.tsv.xz | unxz --threads=`nproc` > visits_v1.tsv

下载之后，解压的原始数据的大小：

代码语言：txt复制

ls -lh
总用量 9.8G
-rw-rw-r--. 1 caishichao caishichao 7.3G 4月  26 02:03 hits_v1.tsv
-rw-rw-r--. 1 caishichao caishichao 2.5G 4月  26 02:05 visits_v1.tsv

样本数据集是Yandex.Metrica这个网络分析服务产生的数据，典型的业务流量日志数据。其中：

hits 表包含所有用户在服务所涵盖的所有网站上完成的每个操作。
visits 表包含预先构建的会话，而不是单个操作。

创建CK的数据库与表

按照Tutorial的教程，原封不动的按照如下方式创建数据库和数据表：

创建库：

代码语言：txt复制

clickhouse-client --query "CREATE DATABASE IF NOT EXISTS tutorial"

创建表：

代码语言：txt复制

CREATE TABLE tutorial.hits_v1
(
    `WatchID` UInt64,
    `JavaEnable` UInt8,
    `Title` String,
    `GoodEvent` Int16,
    `EventTime` DateTime,
    `EventDate` Date,
    `CounterID` UInt32,
    `ClientIP` UInt32,
    `ClientIP6` FixedString(16),
    `RegionID` UInt32,
    `UserID` UInt64,
    `CounterClass` Int8,
    `OS` UInt8,
    `UserAgent` UInt8,
    `URL` String,
    `Referer` String,
    `URLDomain` String,
    `RefererDomain` String,
    `Refresh` UInt8,
    `IsRobot` UInt8,
    `RefererCategories` Array(UInt16),
    `URLCategories` Array(UInt16),
    `URLRegions` Array(UInt32),
    `RefererRegions` Array(UInt32),
    `ResolutionWidth` UInt16,
    `ResolutionHeight` UInt16,
    `ResolutionDepth` UInt8,
    `FlashMajor` UInt8,
    `FlashMinor` UInt8,
    `FlashMinor2` String,
    `NetMajor` UInt8,
    `NetMinor` UInt8,
    `UserAgentMajor` UInt16,
    `UserAgentMinor` FixedString(2),
    `CookieEnable` UInt8,
    `JavascriptEnable` UInt8,
    `IsMobile` UInt8,
    `MobilePhone` UInt8,
    `MobilePhoneModel` String,
    `Params` String,
    `IPNetworkID` UInt32,
    `TraficSourceID` Int8,
    `SearchEngineID` UInt16,
    `SearchPhrase` String,
    `AdvEngineID` UInt8,
    `IsArtifical` UInt8,
    `WindowClientWidth` UInt16,
    `WindowClientHeight` UInt16,
    `ClientTimeZone` Int16,
    `ClientEventTime` DateTime,
    `SilverlightVersion1` UInt8,
    `SilverlightVersion2` UInt8,
    `SilverlightVersion3` UInt32,
    `SilverlightVersion4` UInt16,
    `PageCharset` String,
    `CodeVersion` UInt32,
    `IsLink` UInt8,
    `IsDownload` UInt8,
    `IsNotBounce` UInt8,
    `FUniqID` UInt64,
    `HID` UInt32,
    `IsOldCounter` UInt8,
    `IsEvent` UInt8,
    `IsParameter` UInt8,
    `DontCountHits` UInt8,
    `WithHash` UInt8,
    `HitColor` FixedString(1),
    `UTCEventTime` DateTime,
    `Age` UInt8,
    `Sex` UInt8,
    `Income` UInt8,
    `Interests` UInt16,
    `Robotness` UInt8,
    `GeneralInterests` Array(UInt16),
    `RemoteIP` UInt32,
    `RemoteIP6` FixedString(16),
    `WindowName` Int32,
    `OpenerName` Int32,
    `HistoryLength` Int16,
    `BrowserLanguage` FixedString(2),
    `BrowserCountry` FixedString(2),
    `SocialNetwork` String,
    `SocialAction` String,
    `HTTPError` UInt16,
    `SendTiming` Int32,
    `DNSTiming` Int32,
    `ConnectTiming` Int32,
    `ResponseStartTiming` Int32,
    `ResponseEndTiming` Int32,
    `FetchTiming` Int32,
    `RedirectTiming` Int32,
    `DOMInteractiveTiming` Int32,
    `DOMContentLoadedTiming` Int32,
    `DOMCompleteTiming` Int32,
    `LoadEventStartTiming` Int32,
    `LoadEventEndTiming` Int32,
    `NSToDOMContentLoadedTiming` Int32,
    `FirstPaintTiming` Int32,
    `RedirectCount` Int8,
    `SocialSourceNetworkID` UInt8,
    `SocialSourcePage` String,
    `ParamPrice` Int64,
    `ParamOrderID` String,
    `ParamCurrency` FixedString(3),
    `ParamCurrencyID` UInt16,
    `GoalsReached` Array(UInt32),
    `OpenstatServiceName` String,
    `OpenstatCampaignID` String,
    `OpenstatAdID` String,
    `OpenstatSourceID` String,
    `UTMSource` String,
    `UTMMedium` String,
    `UTMCampaign` String,
    `UTMContent` String,
    `UTMTerm` String,
    `FromTag` String,
    `HasGCLID` UInt8,
    `RefererHash` UInt64,
    `URLHash` UInt64,
    `CLID` UInt32,
    `YCLID` UInt64,
    `ShareService` String,
    `ShareURL` String,
    `ShareTitle` String,
    `ParsedParams` Nested(
        Key1 String,
        Key2 String,
        Key3 String,
        Key4 String,
        Key5 String,
        ValueDouble Float64),
    `IslandID` FixedString(16),
    `RequestNum` UInt32,
    `RequestTry` UInt8
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(EventDate)
ORDER BY (CounterID, EventDate, intHash32(UserID))
SAMPLE BY intHash32(UserID)

这里为了简单起见，我们只创建了hits这张表，表中有非常多的参数。可以看到，这里并没有显式的创建二级索引，并且很多字段是做了精确的字段大小的限制的。

通过以下命令写入数据：

代码语言：txt复制

clickhouse-client --query "INSERT INTO tutorial.hits_v1 FORMAT TSV" --max_insert_block_size=100000 < hits_v1.tsv

数据写入之后，未优化之前的大小，可以看到数据总大小是2.5G，相比原先7.3G的数据，已经有了3倍的压缩

代码语言：txt复制

sudo du -h /var/lib/clickhouse/data/tutorial/hits_v1/
0	/var/lib/clickhouse/data/tutorial/hits_v1/detached
40M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_1_1_0
38M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_2_2_0
37M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_3_3_0
46M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_4_4_0
34M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_5_5_0
31M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_6_6_0
43M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_7_7_0
37M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_8_8_0
49M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_9_9_0
45M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_10_10_0
219M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_1_6_1
47M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_11_11_0
38M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_12_12_0
41M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_13_13_0
28M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_14_14_0
39M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_15_15_0
45M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_16_16_0
251M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_7_12_1
44M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_17_17_0
40M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_18_18_0
42M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_19_19_0
40M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_20_20_0
45M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_21_21_0
46M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_22_22_0
230M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_13_18_1
43M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_23_23_0
46M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_24_24_0
44M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_25_25_0
40M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_26_26_0
47M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_27_27_0
48M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_28_28_0
253M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_19_24_1
45M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_29_29_0
41M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_30_30_0
35M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_31_31_0
255M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_25_30_1
2.5G	/var/lib/clickhouse/data/tutorial/hits_v1/

检查数据是否真正写入：

代码语言：txt复制

clickhouse-client --query "SELECT COUNT(*) FROM tutorial.hits_v1"
8873898

wc -l hits_v1.tsv 
8873898 hits_v1.tsv

共8873898条数据，正确。我们还可以做进一步的优化 (此操作会在后台合并数据以优化数据存储)：

代码语言：txt复制

clickhouse-client --query "OPTIMIZE TABLE tutorial.hits_v1 FINAL"

优化之后，做了很多分区内的合并，文件目录明显变少，但数据总大小相差不大，变为2.4G

代码语言：txt复制

sudo du -h /var/lib/clickhouse/data/tutorial/hits_v1/
0	/var/lib/clickhouse/data/tutorial/hits_v1/detached
219M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_1_6_1
251M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_7_12_1
230M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_13_18_1
253M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_19_24_1
35M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_31_31_0
255M	/var/lib/clickhouse/data/tutorial/hits_v1/201403_25_30_1
1.2G	/var/lib/clickhouse/data/tutorial/hits_v1/201403_1_31_2
2.4G	/var/lib/clickhouse/data/tutorial/hits_v1/

以上操作完全按照ClickHouse官网的Tutorial教程。这里要强调的是，此创建是在单节点，无副本的环境和配置下进行的。

创建ES的对等索引

接下来，我们来看一下同样的数据集在ES上的数据大小。先具体看一下数据：

代码语言：txt复制

# 直接读文件
head -n 1 hits_v1.tsv 
4632802378222380466	1	null Acoper «labilir mitsubishi в Липецке на Batak	1	2014-03-23 16:23:07	2014-03-23	27528801	2723656327	??t??4??g?˟O	15887	2224216775939367616	0	56	4	http://rutube.ru/patianu	http://love/v012/04	footki.yandex.ru	m.fotoalmeta	0	0	[[]	[]	[]	1339	555	29	8	0	0.	0	0	44	s?	1	1	0	0			3238011	-1	0		1136	555	117	2014-03-23 06:04:09	5	1	19832	0	utf-8	401	0	0	0	0	1018986580	0	0	0	0	02014-03-24 08:02:03	16	2	2	0	0	[]	2529497135	0?w?nȔ]MnQ??5	18372	-1	2	wi	??			0	-1	4333	1332	747	2472	0	0	7395	285	-1	-1	-1	16115	-1	-1	0		0		?	0	[]			307141980878902893	338497349483004835	0	0				[]	[]	[]	[]	[]	[]	??? ??????bKQ9	47	0

# 查看clickhouse中的数据
clickhouse-client --query "SELECT * FROM tutorial.hits_v1 LIMIT 1"
7043438415214026105	1		1	2014-03-17 19:29:13	2014-03-17	57	706706104	??:?[?Uc??m???	14050	8585742290196126178	0	56	4http://hurpasspx?Entitle,pos,p5,source=wiz&fp=2&pos=42&wp=1&pos=7&lang=ru&lr=10275961,55.84517842040/?fromServic-last/moto.am/iadebnye_uchasti-shedshipmeethodID.asp&is_vative=0&search-tag=2&rpt=simages.yandex.ru/rp/semennoe/zarplatjie-babe-podpisyvat-otvety.ru/tour=170000009886&ref=http://kryma-parca.ru/chat&users/lexus-70@mail.yandex.ru/yarportwo/11868/parta.com/load/483/?from]=200837&tease	http://kelebekgaleri.hurriyet	hurpass.uaprod.hu	sprashivai.ru	0	0	[5,353,3121,11503,11630]	[5,92,469,13425]	[348,1010]	[28,644]	1234	573	29	8	0	0.	0	0	44	s?	1	1	0	0			1117402	1266	111	117	2014-03-16 04:15:16	0	0	0	0	utf-8	510	0	0	0	6595113484233243093	229156632	0	0	02014-03-17 11:48:07	39	1	2	0	0	[]	594601839	???_Y????5Ӵ.	6826	-1	1	nD	??			0	-1	139	175	197	11	0	0	798	-1	-1	-1	-1	-1	-1	-1	0		0		?	0	[]			5301774355650737480	2999314218052725538	0	0				[]	[]	[]	[]	[]	[]	??? ??????bKQ9	104	0

# 做个样本
head -n 100 >> sample.tsv

我们将样本数据，在Kibana上加载，分析一下：

分析出了88个字段，与ClickHouse的Tutorials里hits表上定义的128个字段不符，需要比较精确的做mapping。

通常来说，用户大概会按照ClickHouse的数据表定义来生成es的mapping，比如上面我们看到的这个tutorial.hits_v1 数据表的定义，如果按照数据库里定义的类型去进行类比定义的话，大概会是下面的mapping:

代码语言：txt复制

  "mappings": {
    "properties": {
      "WatchID": {
        "type": "unsigned_long",
        "index": false
      },
      "JavaEnable": {
        "type": "short",
        "index": false
      },
      "Title": {
        "type": "keyword",
        "index": false
      },
      "GoodEvent": {
        "type": "short",
        "index": false
      },
      "EventTime": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss",
        "index": false
      },
      "EventDate": {
        "type": "keyword",
        "index": false
      },
      "CounterID": {
        "type": "long",
        "index": false
      },
      "ClientIP": {
        "type": "long",
        "index": false
      },
      "ClientIP6": {
        "type": "keyword",
        "index": false
      },
      "RegionID": {
        "type": "long",
        "index": false
      },
      "UserID": {
        "type": "unsigned_long",
        "index": false
      },
      "CounterClass": {
        "type": "byte",
        "index": false
      },
      "OS": {
        "type": "short",
        "index": false
      },
      "UserAgent": {
        "type": "short",
        "index": false
      },
      "URL": {
        "type": "keyword",
        "index": false
      },
      "Referer": {
        "type": "keyword",
        "index": false
      },
      "URLDomain": {
        "type": "keyword",
        "index": false
      },
      "RefererDomain": {
        "type": "keyword",
        "index": false
      },
      "Refresh": {
        "type": "short",
        "index": false
      },
      "IsRobot": {
        "type": "short",
        "index": false
      },
      "RefererCategories": {
        "type": "integer",
        "index": false
      },
      "URLCategories": {
        "type": "integer",
        "index": false
      },
      "URLRegions": {
        "type": "long",
        "index": false
      },
      "RefererRegions": {
        "type": "long",
        "index": false
      },
      "ResolutionWidth": {
        "type": "integer",
        "index": false
      },
      "ResolutionHeight": {
        "type": "integer",
        "index": false
      },
      "ResolutionDepth": {
        "type": "short",
        "index": false
      },
      "FlashMajor": {
        "type": "short",
        "index": false
      },
      "FlashMinor": {
        "type": "short",
        "index": false
      },
      "FlashMinor2": {
        "type": "keyword",
        "index": false
      },
      "NetMajor": {
        "type": "short",
        "index": false
      },
      "NetMinor": {
        "type": "short",
        "index": false
      },
      "UserAgentMajor": {
        "type": "integer",
        "index": false
      },
      "UserAgentMinor": {
        "type": "keyword",
        "index": false
      },
      "CookieEnable": {
        "type": "short",
        "index": false
      },
      "JavascriptEnable": {
        "type": "short",
        "index": false
      },
      "IsMobile": {
        "type": "short",
        "index": false
      },
      "MobilePhone": {
        "type": "short",
        "index": false
      },
      "MobilePhoneModel": {
        "type": "keyword",
        "index": false
      },
      "Params": {
        "type": "keyword",
        "index": false
      },
      "IPNetworkID": {
        "type": "long",
        "index": false
      },
      "TraficSourceID": {
        "type": "byte",
        "index": false
      },
      "SearchEngineID": {
        "type": "integer",
        "index": false
      },
      "SearchPhrase": {
        "type": "keyword",
        "index": false
      },
      "AdvEngineID": {
        "type": "short",
        "index": false
      },
      "IsArtifical": {
        "type": "short",
        "index": false
      },
      "WindowClientWidth": {
        "type": "integer",
        "index": false
      },
      "WindowClientHeight": {
        "type": "integer",
        "index": false
      },
      "ClientTimeZone": {
        "type": "short",
        "index": false
      },
      "ClientEventTime": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss",
        "index": false
      },
      "SilverlightVersion1": {
        "type": "short",
        "index": false
      },
      "SilverlightVersion2": {
        "type": "short",
        "index": false
      },
      "SilverlightVersion3": {
        "type": "long",
        "index": false
      },
      "SilverlightVersion4": {
        "type": "integer",
        "index": false
      },
      "PageCharset": {
        "type": "keyword",
        "index": false
      },
      "CodeVersion": {
        "type": "long",
        "index": false
      },
      "IsLink": {
        "type": "short",
        "index": false
      },
      "IsDownload": {
        "type": "short",
        "index": false
      },
      "IsNotBounce": {
        "type": "short",
        "index": false
      },
      "FUniqID": {
        "type": "unsigned_long",
        "index": false
      },
      "HID": {
        "type": "long",
        "index": false
      },
      "IsOldCounter": {
        "type": "short",
        "index": false
      },
      "IsEvent": {
        "type": "short",
        "index": false
      },
      "IsParameter": {
        "type": "short",
        "index": false
      },
      "DontCountHits": {
        "type": "short",
        "index": false
      },
      "WithHash": {
        "type": "short",
        "index": false
      },
      "HitColor": {
        "type": "keyword",
        "index": false
      },
      "UTCEventTime": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss",
        "index": false
      },
      "Age": {
        "type": "short",
        "index": false
      },
      "Sex": {
        "type": "short",
        "index": false
      },
      "Income": {
        "type": "short",
        "index": false
      },
      "Interests": {
        "type": "integer",
        "index": false
      },
      "Robotness": {
        "type": "short",
        "index": false
      },
      "GeneralInterests": {
        "type": "integer",
        "index": false
      },
      "RemoteIP": {
        "type": "long",
        "index": false
      },
      "RemoteIP6": {
        "type": "keyword",
        "index": false
      },
      "WindowName": {
        "type": "integer",
        "index": false
      },
      "OpenerName": {
        "type": "integer",
        "index": false
      },
      "HistoryLength": {
        "type": "short",
        "index": false
      },
      "BrowserLanguage": {
        "type": "keyword",
        "index": false
      },
      "BrowserCountry": {
        "type": "keyword",
        "index": false
      },
      "SocialNetwork": {
        "type": "keyword",
        "index": false
      },
      "SocialAction": {
        "type": "keyword",
        "index": false
      },
      "HTTPError": {
        "type": "integer",
        "index": false
      },
      "SendTiming": {
        "type": "integer",
        "index": false
      },
      "DNSTiming": {
        "type": "integer",
        "index": false
      },
      "ConnectTiming": {
        "type": "integer",
        "index": false
      },
      "ResponseStartTiming": {
        "type": "integer",
        "index": false
      },
      "ResponseEndTiming": {
        "type": "integer",
        "index": false
      },
      "FetchTiming": {
        "type": "integer",
        "index": false
      },
      "RedirectTiming": {
        "type": "integer",
        "index": false
      },
      "DOMInteractiveTiming": {
        "type": "integer",
        "index": false
      },
      "DOMContentLoadedTiming": {
        "type": "integer",
        "index": false
      },
      "DOMCompleteTiming": {
        "type": "integer",
        "index": false
      },
      "LoadEventStartTiming": {
        "type": "integer",
        "index": false
      },
      "LoadEventEndTiming": {
        "type": "integer",
        "index": false
      },
      "NSToDOMContentLoadedTiming": {
        "type": "integer",
        "index": false
      },
      "FirstPaintTiming": {
        "type": "integer",
        "index": false
      },
      "RedirectCount": {
        "type": "byte",
        "index": false
      },
      "SocialSourceNetworkID": {
        "type": "short",
        "index": false
      },
      "SocialSourcePage": {
        "type": "keyword",
        "index": false
      },
      "ParamPrice": {
        "type": "long",
        "index": false
      },
      "ParamOrderID": {
        "type": "keyword",
        "index": false
      },
      "ParamCurrency": {
        "type": "keyword",
        "index": false
      },
      "ParamCurrencyID": {
        "type": "integer",
        "index": false
      },
      "GoalsReached": {
        "type": "long",
        "index": false
      },
      "OpenstatServiceName": {
        "type": "keyword",
        "index": false
      },
      "OpenstatCampaignID": {
        "type": "keyword",
        "index": false
      },
      "OpenstatAdID": {
        "type": "keyword",
        "index": false
      },
      "OpenstatSourceID": {
        "type": "keyword",
        "index": false
      },
      "UTMSource": {
        "type": "keyword",
        "index": false
      },
      "UTMMedium": {
        "type": "keyword",
        "index": false
      },
      "UTMCampaign": {
        "type": "keyword",
        "index": false
      },
      "UTMContent": {
        "type": "keyword",
        "index": false
      },
      "UTMTerm": {
        "type": "keyword",
        "index": false
      },
      "FromTag": {
        "type": "keyword",
        "index": false
      },
      "HasGCLID": {
        "type": "short",
        "index": false
      },
      "RefererHash": {
        "type": "unsigned_long",
        "index": false
      },
      "URLHash": {
        "type": "unsigned_long",
        "index": false
      },
      "CLID": {
        "type": "long",
        "index": false
      },
      "YCLID": {
        "type": "unsigned_long",
        "index": false
      },
      "ShareService": {
        "type": "keyword",
        "index": false
      },
      "ShareURL": {
        "type": "keyword",
        "index": false
      },
      "ShareTitle": {
        "type": "keyword",
        "index": false
      },
      "ParsedParams.Key1": {
        "type": "keyword",
        "index": false
      },
      "ParsedParams.Key2": {
        "type": "keyword",
        "index": false
      },
      "ParsedParams.Key3": {
        "type": "keyword",
        "index": false
      },
      "ParsedParams.Key4": {
        "type": "keyword",
        "index": false
      },
      "ParsedParams.Key5": {
        "type": "keyword",
        "index": false
      },
      "ParsedParams.ValueDouble": {
        "type": "keyword",
        "index": false
      },
      "IslandID": {
        "type": "keyword",
        "index": false
      },
      "RequestNum": {
        "type": "long",
        "index": false
      },
      "RequestTry": {
        "type": "short",
        "index": false
      }
    }
  }

以这样的方式去构建，基本上是在磁盘上为所有的字段，在不知道自己将来是否会使用，会如何使用的情况下，就位所有的字段相当于创建了二级索引。因此，在这种情况下生产的最终的索引数据会很大：

我们可以强制合并，

但效果不会明显：

因此，可以看到，即时ES只索引了700多万条数据，就已经有9GB，比CK的800多万条数据2.4GB，大了将近四倍。

ES需要调优！

ES正确的做法

如之前所述，CK是专门针对海量数据的OLAP分析场景而设计的列式存储数据库。其在多个方面，包括：数据建模，数据压缩，数据存储等都对OLAP场景做了专门的优化。这些优化并没有什么魔法，只是帮助用户把这些优化做成了默认模式。实际上，大部分的优化在ES上也是可以做的。

因为ES并非为OLAP专门设计的，是一个通用场景的数据库，若要用在OLAP场景，我们需要做一些专门的优化。

正确的做法是不能够简单的创建一个“对等索引”。

建模优化

我们可以先通过Kibana上的数据可视化分析工具，查看各个字段的基本情况：

我们看到很多在原数据表定义为UInt8、UInt16的字段都是用于表示特定含义的，比如说IsMobile, IsRobot这种，在ES中根本不应该定义为numeric类型，应该是keyword。

甚至，有些数据甚至有可能在现有的大屏分析、业务分析时根本不会用，比如，下面的这些哈希值：

我们可以更进一步，过滤所有的numeric类型：

我们可以得出一个结论，只有Timing结尾的字段，才值得定义type为numeric。其他都是keyword。

因此，我们的数据建模应该是这样的：

代码语言：txt复制

{
  "settings": {
    "index": {
      "number_of_replicas": "0"
    }
  },
  "mappings": {
    "dynamic": "true",
    "dynamic_templates": [
      {
        "Timing_as_numeric": {
          "match": "*Timing",
          "match_mapping_type": "string",
          "mapping": {
            "type": "long",
            "index": false
          }
        }
      },
      {
        "strings_as_keyword": {
          "match_mapping_type": "string",
          "mapping": {
            "ignore_above": 1024,
            "index": false,
            "type": "keyword"
          }
        }
      }
    ],
    "date_detection": false,
    "numeric_detection": false,
    "properties": {
      "ClientEventTime": {
        "type": "date",
        "format": "strict_date_optional_time||epoch_millis||yyyy-MM-dd HH:mm:ss"
      },
      "EventTime": {
        "type": "date",
        "format": "strict_date_optional_time||epoch_millis||yyyy-MM-dd HH:mm:ss"
      },
      "UTCEventTime": {
        "type": "date",
        "format": "strict_date_optional_time||epoch_millis||yyyy-MM-dd HH:mm:ss"
      }
    }
  },
  "aliases": {}
}

filebeat与ingest pipeline优化

如果我们是使用filebeat来进行数据的采集，filebeat会默认添加很多额外的字段，比如 "agent", "host", "ecs","input","log"，这些字段在OLAP场景都是不需要的，添加如下processor在filebeat.yml文件中:

代码语言：txt复制

processors:
  - drop_fields:
      fields: ["agent", "host", "ecs","input","log"]
      ignore_missing: false

同时，filebeat也会把数据的采集时间作为@timestamp添加。我们可以在ingest pipeline上把这个字段给drop掉，同样的，message字段也是多余的 (不再考虑原始日志的审计需求)

通过正确的mapping和去掉该场景下不需要的字段，索引的大小降到了CK数据的2倍：

OLAP专用优化

在绝大多数的OLAP场景，我们只关心汇总，聚合之后的指标，而非导出源数据，这时，我们可以进一步把索引模板调整为_source: disabled

此时，我们看到ES索引的大小已经和CK的大小相差无几。此设置下，只是无法再通过_source获取字段值：

但所有的聚合、过滤、排序操作还是可以正常运行的。

如果对于一些关键数据，我们仍然希望能够以行的方式读取的话，我们可以在关闭_source的情况下，通过字段的store参数进行调节：

代码语言：txt复制

        "ClientTimeZone" : {
          "type" : "keyword",
          "index" : false,
          "store" : true,
          "ignore_above" : 1024
        },
        "CodeVersion" : {
          "type" : "keyword",
          "index" : false,
          "store" : true,
          "ignore_above" : 1024
        }

通过将特定字段的数据进行存储，而非将整行数据都存储在_source当中，我们可以在读取的时候，通过stored_fields参数：

代码语言：txt复制

GET ck_hits_no_source_with_field_stored/_search
{
  "stored_fields": [ "ClientTimeZone", "CodeVersion" ]
}

获取单行字段值：

结论

正如上文所说，存在即合理，大量用户在OLAP场景选择ES有其合理性。在没有使用"index.codec": "best_compression"的情况下，作为同样使用列存技术，同样默认使用LZ4压缩算法的两个产品，ES通过合理的优化，能够达到CK一样的压缩效果。

当然，CK在仅使用列存的情况下能够通过组合字段，能够返回行数据，而ES，则需要借助_source字段保存整行数据，或者使用字段的store属性才能返回文档值（对应CK的行数据）。因为历史接口的原因和没有强烈的用户需求，ES并没有从列存数据中返回字段的功能，但这个功能并不是不能添加的。

最后，本文仅涉及数据压缩的内容，不讨论上层的计算算子，SQL支持，Join等功能。也并不是说ES在OLAP场景比CK更合适，用户需要根据自己的情况选择合理的，ES更适合在一个通用的，广泛的，弱OLAP场景进行使用。希望本文能够帮助到哪些将ES用在OLAP场景，但又因为压缩率赶到困扰的同学。

ElasticsearchService 云数据仓库ClickHouse

0 人点赞