thanos内核

2019-12-29 23:20:29 浏览数 (1)

设计

image.pngimage.png

组件

  • Sidecar: StoreAPI:查询 Prometheus 数据;Shipper 上传数据到(对象)存储
  • Store Gateway: StoreAPI:查询存储在(对象)存储上的指标数据.
  • Compactor: 对对象存储中的数据进行:压缩,降采样,期限管理 (retention).
  • Receiver: 从 Prometheus 的 remote-write WAL 获取数据,暴露或者上传到(对象)存储 (图中未画出,是目前还在 beta 状态的实现,用Prometheus的 remote-write 接口实现).
  • Ruler/Rule: 和 Prometheus中的 recording/alerting rules 相同,只不过目标数据是 Thanos中的数据,可以暴露查询或上传.
  • Querier/Query: 实现 Prometheus v1 API,他是一个 API 聚合层,查询底层数据源,并汇总去重.

上面的组件又可以按照作用分成三类:

  1. Metric Sources
  2. Stores
  3. Queriers

Metric Sources

产生数据,目前只有两种 Prometheus sidecar 和 rule nodes.

代码语言:txt复制
┌────────────┬─────────┐         ┌────────────┬─────────┐     ┌─────────┐
│ Prometheus │ Sidecar │   ...   │ Prometheus │ Sidecar │     │   Rule  │
└────────────┴────┬────┘         └────────────┴────┬────┘     └┬────────┘
                  │                                │           │
                Blocks                           Blocks      Blocks
                  │                                │           │
                  v                                v           v
              ┌──────────────────────────────────────────────────┐
              │                   Object Storage                 │
              └──────────────────────────────────────────────────┘

Stores

这里 store 指 store gateway,转换用户的 metrics 请求为 对象存储 API,It implements various strategies to minimize the number of requests to the object storage such as filtering relevant blocks by their metadata (e.g. time range and labels) and caching frequent index lookups.

Store 和 Data Sources 一样, 实现了相同的 gRPC Store API,所以对 client 来说,他们是一样的,无需特殊处理。同时各种 Store API 的实现者会提供他们存储数据的 meta information, 这使得客户端可以最小化他们的查询目标。

代码语言:txt复制
┌──────────────────────┐  ┌────────────┬─────────┐   ┌────────────┐
│ Google Cloud Storage │  │ Prometheus │ Sidecar │   │    Rule    │
└─────────────────┬────┘  └────────────┴────┬────┘   └─┬──────────┘
                  │                         │          │
         Block File Ranges                  │          │
                  │                     Store API      │
                  v                         │          │
                ┌──────────────┐            │          │
                │     Store    │            │      Store API
                └────────┬─────┘            │          │
                         │                  │          │
                     Store API              │          │
                         │                  │          │
                         v                  v          v
                       ┌──────────────────────────────────┐
                       │              Client              │
                       └──────────────────────────────────┘

Query Layer

无状态,自动发现 store,查询数据。Based on the metadata of store and source nodes, they attempt to minimize the request fanout to fetch data for a particular query.

代码语言:txt复制
┌──────────────────┐  ┌────────────┬─────────┐   ┌────────────┐
│    Store Node    │  │ Prometheus │ Sidecar │   │    Rule    │
└─────────────┬────┘  └────────────┴────┬────┘   └─┬──────────┘
              │                         │          │
              │                         │          │
              │                         │          │
              v                         v          v
        ┌─────────────────────────────────────────────────────┐
        │                      Query layer                    │
        └─────────────────────────────────────────────────────┘
                ^                  ^                  ^
                │                  │                  │
       ┌────────┴────────┐  ┌──────┴─────┐       ┌────┴───┐
       │ Alert Component │  │ Dashboards │  ...  │ Web UI │
       └─────────────────┘  └────────────┘       └────────┘

Compactor

Compactor 是一个可选的单独组件,不参与 Thanos 集群

实现

各种组件以 thanos 子命令的方式提供

代码语言:txt复制
// /thanos-io/thanos/cmd/thanos/main.go
registerSidecar(cmds, app)
registerStore(cmds, app)
registerQuery(cmds, app)
registerRule(cmds, app)
registerCompact(cmds, app)
registerBucket(cmds, app, "bucket")
registerDownsample(cmds, app)
registerReceive(cmds, app)
registerChecks(cmds, app, "check")

// 使用的第三方库
// github.com/golang/snappy - snappy 压缩算法s
// github.com/hashicorp/serf - Gossip-based Membership
// github.com/prometheus/prometheus - 指标,version,发现,promql等等
// google.golang.org/grpc - gprc
// github.com/go-kit/kit - log, sd, transport, ratelimit 等等
// github.com/oklog/run - goroutine
// gopkg.in/alecthomas/kingpin.v2 - cmd工具
// go.uber.org/automaxprocs/maxprocs - maxproc设置小工具
// github.com/opentracing/opentracing-go - tracing
// gopkg.in/check.v1;onsi/ginkgo;onsi/gomega;smartystreets/goconvey - testing
// github.com/fsnotify/fsnotify
// github.com/fortytw2/leaktest - leak test
// github.com/fatih/structtag - parsing and manipulating struct tag fields
// github.com/hashicorp/golang-lru
// github.com/mwitkow/go-conntrack - Go middleware for net.Conn tracking (Prometheus/trace)

Bucket

对(对象)存储中的 metrics 数据进行 ls/inspect/verify/repair

代码语言:txt复制
➜ ./thanos bucket --objstore.config-file=`pwd`/cos.yaml inspect
level=info ts=2019-12-29T08:43:27.744236Z caller=main.go:149 msg="Tracing will be disabled"
level=info ts=2019-12-29T08:43:27.744752Z caller=factory.go:43 msg="loading bucket configuration"
|            ULID            |        FROM         |        UNTIL        | RANGE  | UNTIL-DOWN | #SERIES | #SAMPLES  | #CHUNKS | COMP-LEVEL | COMP-FAILED |                                     LABELS                                      | RESOLUTION | SOURCE  |
|----------------------------|---------------------|---------------------|--------|------------|---------|-----------|---------|------------|-------------|---------------------------------------------------------------------------------|------------|---------|
| 01DWQ7AVJAST6B734CCBWJYXV0 | 22-12-2019 20:00:00 | 22-12-2019 22:00:00 | 2h0m0s | 38h0m0s    | 36,534  | 1,858,180 | 36,534  | 1          | false       | prometheus=monitoring/prometheus-2,prometheus_replica=prometheus-prometheus-2-0 | 0s         | sidecar |
| 01DWQ8TA11YH5AZ592JC83HGXP | 22-12-2019 22:00:00 | 23-12-2019 00:00:00 | 2h0m0s | 38h0m0s    | 38,277  | 8,785,813 | 74,002  | 1          | false       | prometheus=monitoring/prometheus-2,prometheus_replica=prometheus-prometheus-2-0 | 0s         | sidecar |

Check

tools for validation of Prometheus rules.

Compact

压缩,降采样,retention 工具,非并发安全,需要单独部署;需要本地盘存储中间数据

  • creating 5m downsampling for blocks larger than 40 hours (2d, 2w)
  • creating 1h downsampling for blocks larger than 10 days (2w).
  • downsampling 不是为了节约空间,二是为了提高查询效率 (不会减少空间,反而对一个 raw block会增加两个 block)
  • compact 内部使用 tsdb.NewLeveledCompactor 进行压缩
  • downsample 是降低采样 thanos-io/thanos/pkg/compact/downsample/downsample.go
代码语言:txt复制
# compact / downsample 后
➜ ./thanos bucket --objstore.config-file=`pwd`/cos.yaml inspect
level=info ts=2019-12-29T09:02:54.011068Z caller=main.go:149 msg="Tracing will be disabled"
level=info ts=2019-12-29T09:02:54.011681Z caller=factory.go:43 msg="loading bucket configuration"
|            ULID            |        FROM         |        UNTIL        | RANGE  | UNTIL-DOWN | #SERIES |  #SAMPLES  | #CHUNKS | COMP-LEVEL | COMP-FAILED |                                     LABELS                                      | RESOLUTION |  SOURCE   |
|----------------------------|---------------------|---------------------|--------|------------|---------|------------|---------|------------|-------------|---------------------------------------------------------------------------------|------------|-----------|
| 01DX8E6DKHG0TEJQG85CPXEEEB | 22-12-2019 20:00:00 | 23-12-2019 00:00:00 | 4h0m0s | 36h0m0s    | 38,562  | 10,643,993 | 110,536 | 2          | false       | prometheus=monitoring/prometheus-2,prometheus_replica=prometheus-prometheus-2-0 | 0s         | compactor |
| 01DWQ97RAVF6ADC47073366595 | 22-12-2019 22:00:00 | 23-12-2019 00:00:00 | 2h0m0s | 38h0m0s    | 36,616  | 8,229,587  | 36,664  | 1          | false       | prometheus=monitoring/prometheus-1,prometheus_replica=prometheus-prometheus-1-0 | 0s         | sidecar   |
| 01DWQFP1915WVACNP2W187QWJ6 | 23-12-2019 00:00:00 | 23-12-2019 02:00:00 | 2h0m0s | 38h0m0s    | 36,625  | 8,793,602  | 73,282  | 1          | false       | prometheus=monitoring/prometheus-2,prometheus_replica=prometheus-prometheus-2-0 | 0s         | sidecar   |
| 01DX8E0MV53DH41VWNW4TJXXKN | 23-12-2019 00:00:00 | 23-12-2019 08:00:00 | 8h0m0s | 32h0m0s    | 36,625  | 35,169,407 | 293,112 | 2          | false       | prometheus=monitoring/prometheus-1,prometheus_replica=prometheus-prometheus-1-0 | 0s         | compactor |

Query

  • query 是 thanos 中核心组件之一,从 StoreAPIs 收集数据,实现 Prometheus HTTP v1 API 返回给 client.
  • 他的最主要的作用是 提供全局视图(Global View), 和某些 exporter 的作用很类似,只不过他是在查询层,而不是在存储层做聚合
  • 另一个作用是 Run-time deduplication of HA group, prometheus可以起多个,作为一个高可用架构,由 query 做数据的聚合和去重,(比如一个 prometheus 有宕机,query可以fill gap)
  • 数据源支持:
    • Prometheus (Sidecar)
    • Object Storage (Store Gateway)
    • Global alerting/recording rules evaluations (Ruler)
    • Metrics received from Prometheus remote write streams (Thanos Receiver)
    • Another Querier (you can stack Queriers on top of each other)
    • Non-Prometheus systems. e.g OpenTSDB
    • 笔者todo: 这里我们可以实现云上监控的 adapter, 作为 store api 的一种实现
  • 除了compatible with Prometheus 2.x. API 之外增加了支持
    • partial response behaviour 当其中一个 StoreAPI 出错或者超时的时候返回 warnings
    • several additional parameters listed below
    • custom response fields.
  • 实现:
    • 有 grpc 的 (proxy) store api, 会去发现的 store查询 (storeMatches 时间 labelSetsMatch)
    • Http 查询 api, 创建查询 engine,最后会形成 grpc 查询请求,查询本地的 grpc api
    • dns provider:dns 发现 store
image.pngimage.png

Store

在 (对象)存储上实现 store api, 本地会定期同步 metadata, index. 实现在 bucket.go里面,同时实现了 store的 grpc service hanos-io/thanos/pkg/store/storepb/rpc.proto.

代码语言:txt复制
// BucketStore implements the store API backed by a bucket. It loads all index
// files to local disk.
type BucketStore struct {
    // ...略
	indexCache storecache.IndexCache

	// Sets of blocks that have the same labels. They are indexed by a hash over their label set.
	blocks    map[ulid.ULID]*bucketBlock
	blockSets map[uint64]*bucketBlockSet

	// samplesLimiter limits the number of samples per each Series() call.
	samplesLimiter *Limiter
	partitioner    partitioner
}

// Series implements the storepb.StoreServer interface.
// 省略了非核心代码, 本地有 cache(posting, series), 先查本地,查远程,存本地cache
func (s *BucketStore) Series(req *storepb.SeriesRequest, srv storepb.Store_SeriesServer) (err error) {
    
    // Concurrently get data from all blocks.
	for _, bs := range s.blockSets {
		blockMatchers, ok := bs.labelMatchers(matchers...)
		blocks := bs.getFor(req.MinTime, req.MaxTime, req.MaxResolutionWindow)
		for _, b := range blocks {
			// We must keep the readers open until all their data has been sent.
			indexr := b.indexReader(ctx)
			chunkr := b.chunkReader(ctx)

			g.Go(func() error {
				part, pstats, err := blockSeries(ctx, b.meta.ULID, b.meta.Thanos.Labels,
					indexr, chunkr, blockMatchers, req, s.samplesLimiter)
				res = append(res, part)
				return nil
			})
		}
	}

	// Merge the sub-results from each selected block.
	{
		// Merge series set into an union of all block sets. This exposes all blocks are single seriesSet.
		// Chunks of returned series might be out of order w.r.t to their time range.
		// This must be accounted for later by clients.
		set := storepb.MergeSeriesSets(res...)
		for set.Next() {
			var series storepb.Series
			series.Labels, series.Chunks = set.At()
			stats.mergedSeriesCount  
			stats.mergedChunksCount  = len(series.Chunks)
			if err := srv.Send(storepb.NewSeriesResponse(&series)); err != nil {
				return status.Error(codes.Unknown, errors.Wrap(err, "send series response").Error())
			}
		}
	}
	return nil
}

Time based partitioning

  • store 默认会根据对象存储中的时间范围提供 metrcis; 但是也可以用 --min-time, --max-time 设置提供metrics的时间范围。
  • 新数据不一定能马上被查询到,Time partitioning 每3分钟同步一次
  • 建议 Thanos Store gateways 中的时间范围和Thanos Sidecar 由一定重合,以预防失败

Rule

You can think of Rule as a simplified Prometheus that does not require a sidecar and does not scrape and do PromQL evaluation

Sidecar

作用

  • 实现 Store API 直接查询 promethues 数据
  • 可选的每2小时上传 tsdb数据,这样promethues的 retention 可以设置得很短
  • 这里的 Store API 实现刚好和 Query 相反,这里是把 grpc请求转换成 promethues 的http 请求
  • Shipper 实现在 thanos-io/thanos/pkg/shipper/shipper.go, 把每二小时形成的 block/meta.. 进行上传

Receiver

实现 promethues 的remote write 接口

参考

  • 官方文档
  • 一个很详细的介绍
  • 上篇文章的翻译,翻译得不太好
  • 一篇中文介绍
  • 另一篇中文介绍
  • Comparing Thanos to VictoriaMetrics cluster, VictoriaMetrics也是一个方案,不过使用磁盘存储
  • 另一个方案,uber开源
  • thanos在bucket很大的时候可能会有性能问题

0 人点赞