https://github.com/etcd-io/etcd以及迭代到v3版本,是很多中间件的核心组件,比如k8s,下面我们将通过一系列文章分析下它的源码和设计。其中部分内容翻译自官方文档https://etcd.io/docs/v3.5/install/。
首先尝试下源码安装:进入源码目录编译
代码语言:javascript复制% cd etcd
% ./scripts/build.sh
(cd etcdctl && env GO_BUILD_FLAGS= CGO_ENABLED=0 GO_BUILD_FLAGS= GOOS=darwin GOARCH=amd64 go build -trimpath -installsuffix=cgo -ldflags=-X=go.etcd.io/etcd/api/v3/version.GitSHA=8da2a5b -o=../bin/etcdctl .)
SUCCESS: etcd_build (GOARCH=amd64)
编译完成后查看下版本:
代码语言:javascript复制% ./bin/etcd --version
etcd Version: 3.6.0-alpha.0
Git SHA: 8da2a5b
Go Version: go1.19
Go OS/Arch: darwin/amd64
把它添加到path
代码语言:javascript复制% export PATH="$PATH:`pwd`/bin"
然后启动server端
代码语言:javascript复制% etcd
{"level":"warn","ts":"2023-06-07T09:17:23.681691 0800","caller":"embed/config.go:708","msg":"Running http and grpc server on single port. This is not recommended for production."}
etcd最小可操作单元是k/v,我们可以通过etcdctl来操作
代码语言:javascript复制% etcdctl put greeting "Hello, etcd"
OK
代码语言:javascript复制% etcdctl get greeting
greeting
Hello, etcd
etcd的service主要分为两类:
- 处理k/v相关的,Services important for dealing with etcd’s key space include
- KV - Creates, updates, fetches, and deletes key-value pairs.
- Watch - Monitors changes to keys.
- Lease - Primitives for consuming client keep-alive messages.
- 处理集群相关的,Services which manage the cluster itself include:
- Auth - Role based authentication mechanism for authenticating users.
- Cluster - Provides membership information and configuration facilities.
- Maintenance - Takes recovery snapshots, defragments the store, and returns per-member status information.
比如KV的查询,核心接口如下
代码语言:javascript复制service KV {
Range(RangeRequest) returns (RangeResponse)
...
}
etcd所有的api返回结果里都增加了Response header,包括集群的元信息:All Responses from etcd API have an attached response header which includes cluster metadata for the response。具体内容如下
代码语言:javascript复制message ResponseHeader {
uint64 cluster_id = 1;
uint64 member_id = 2;
int64 revision = 3;
uint64 raft_term = 4;
}
k/v对是api可操作的最小单元,它的定义如下:
代码语言:javascript复制message KeyValue {
bytes key = 1;
int64 create_revision = 2;
int64 mod_revision = 3;
int64 version = 4;
bytes value = 5;
int64 lease = 6;
}
用etcd实现的分布式锁是通过创建版本号来获取锁的所有权。修改版本号用户mvcc场景下检测版本是否冲突,实现cas逻辑的。etcd内部维护了一个64位的集群粒度的计数器,存储的版本号会随着key修改的次数增加,版本号可以作为逻辑上的一个全局锁。给存储的所有更新排序。etcd maintains a 64-bit cluster-wide counter, the store revision, that is incremented each time the key space is modified. The revision serves as a global logical clock, sequentially ordering all updates to the store. The change represented by a new revision is incremental; the data associated with a revision is the data that changed the store. Internally, a new revision means writing the changes to the backend’s B tree, keyed by the incremented revision.
etcd的数据模型会给所有的二进制key建设一个打平的索引。查询的请求和返回定义如下:
代码语言:javascript复制message RangeRequest {
enum SortOrder {
NONE = 0; // default, no sorting
ASCEND = 1; // lowest target value first
DESCEND = 2; // highest target value first
}
enum SortTarget {
KEY = 0;
VERSION = 1;
CREATE = 2;
MOD = 3;
VALUE = 4;
}
bytes key = 1;
bytes range_end = 2;
int64 limit = 3;
int64 revision = 4;
SortOrder sort_order = 5;
SortTarget sort_target = 6;
bool serializable = 7;
bool keys_only = 8;
bool count_only = 9;
int64 min_mod_revision = 10;
int64 max_mod_revision = 11;
int64 min_create_revision = 12;
int64 max_create_revision = 13;
}
代码语言:javascript复制message RangeResponse {
ResponseHeader header = 1;
repeated mvccpb.KeyValue kvs = 2;
bool more = 3;
int64 count = 4;
}
修改的请求定义类似,同样还有删除的:
代码语言:javascript复制message PutRequest {
bytes key = 1;
bytes value = 2;
int64 lease = 3;
bool prev_kv = 4;
bool ignore_value = 5;
bool ignore_lease = 6;
}
代码语言:javascript复制message PutResponse {
ResponseHeader header = 1;
mvccpb.KeyValue prev_kv = 2;
}
etcd把一个事务操作,抽象为一个原子的If/Then/Else模型:A transaction is an atomic If/Then/Else construct over the key-value store.Transactions can be used for protecting keys from unintended concurrent updates, building compare-and-swap operations, and developing higher-level concurrency control.All comparisons are applied atomically; if all comparisons are true, the transaction is said to succeed and etcd applies the transaction’s then / success request block, otherwise it is said to fail and applies the else / failure request block.
上述模型会对应三个操作:
代码语言:javascript复制message Compare {
enum CompareResult {
EQUAL = 0;
GREATER = 1;
LESS = 2;
NOT_EQUAL = 3;
}
enum CompareTarget {
VERSION = 0;
CREATE = 1;
MOD = 2;
VALUE= 3;
}
CompareResult result = 1;
// target is the key-value field to inspect for the comparison.
CompareTarget target = 2;
// key is the subject key for the comparison operation.
bytes key = 3;
oneof target_union {
int64 version = 4;
int64 create_revision = 5;
int64 mod_revision = 6;
bytes value = 7;
}
}
代码语言:javascript复制message RequestOp {
// request is a union of request types accepted by a transaction.
oneof request {
RangeRequest request_range = 1;
PutRequest request_put = 2;
DeleteRangeRequest request_delete_range = 3;
}
}
All together, a transaction is issued with a Txn API call, which takes a TxnRequest:
代码语言:javascript复制message TxnRequest {
repeated Compare compare = 1;
repeated RequestOp success = 2;
repeated RequestOp failure = 3;
}
事务的结果如下:
代码语言:javascript复制message TxnResponse {
ResponseHeader header = 1;
bool succeeded = 2;
repeated ResponseOp responses = 3;
}
代码语言:javascript复制message ResponseOp {
oneof response {
RangeResponse response_range = 1;
PutResponse response_put = 2;
DeleteRangeResponse response_delete_range = 3;
}
}
message Event {
enum EventType {
PUT = 0;
DELETE = 1;
}
EventType type = 1;
KeyValue kv = 2;
KeyValue prev_kv = 3;
}
Watches are long-running requests and use gRPC streams to stream event data.A single watch stream can multiplex many distinct watches by tagging events with per-watch identifiers.
watch的语意实现了三个要素,有序、可靠、原子性。Watches make three guarantees about events:
- Ordered - events are ordered by revision; an event will never appear on a watch if it precedes an event in time that has already been posted.
- Reliable - a sequence of events will never drop any subsequence of events; if there are events ordered in time as a < b < c, then if the watch receives events a and c, it is guaranteed to receive b.
- Atomic - a list of events is guaranteed to encompass complete revisions; updates in the same revision over multiple keys will not be split over several lists of events.
message WatchCreateRequest {
bytes key = 1;
bytes range_end = 2;
int64 start_revision = 3;
bool progress_notify = 4;
enum FilterType {
NOPUT = 0;
NODELETE = 1;
}
repeated FilterType filters = 5;
bool prev_kv = 6;
}
租约是一种客户端的保活机制,当收不到心跳的时候,就认为客户端挂掉了。Leases are a mechanism for detecting client liveness. The cluster grants leases with a time-to-live. A lease expires if the etcd cluster does not receive a keepAlive within a given TTL period.
代码语言:javascript复制message LeaseGrantRequest {
int64 TTL = 1;
int64 ID = 2;
}
代码语言:javascript复制message LeaseRevokeRequest {
int64 ID = 1;
}
Leases are refreshed using a bi-directional stream created with the LeaseKeepAlive API call.