可用的空闲页号信息存储在freelist中,具体位于freelist.go文件中,定义如下:
代码语言:javascript复制// freelist represents a list of all pages that are available for allocation.
// It also tracks pages that have been freed but are still in use by open transactions.
type freelist struct {
ids []pgid // all free and available free page ids.
pending map[txid][]pgid // mapping of soon-to-be free page ids by tx.
cache map[pgid]bool // fast lookup of all free and pending page ids.
}
它的三个属性分别表示所有空闲可用的页,刚刚事务中被释放的可用页,但是事务还没提交,可能还被读事务在用的空闲页,以及每个页id到是否空闲的映射关系。
代码语言:javascript复制// newFreelist returns an empty, initialized freelist.
func newFreelist() *freelist {
return &freelist{
pending: make(map[txid][]pgid),
cache: make(map[pgid]bool),
}
}
加载一个磁盘页后会进行reindex操作,过程中会对空闲页id进行排序,并建立是否可用的映射关系
代码语言:javascript复制 func (f *freelist) read(p *page) {
ids := ((*[maxAllocSize]pgid)(unsafe.Pointer(&p.ptr)))[idx:count]
f.ids = make([]pgid, len(ids))
copy(f.ids, ids)
// Make sure they're sorted.
sort.Sort(pgids(f.ids))
f.reindex()
代码语言:javascript复制// reindex rebuilds the free cache based on available and pending free lists.
func (f *freelist) reindex() {
f.cache = make(map[pgid]bool, len(f.ids))
for _, id := range f.ids {
f.cache[id] = true
}
for _, pendingIDs := range f.pending {
for _, pendingID := range pendingIDs {
f.cache[pendingID] = true
}
}
}
bolt_amd64.go中可以看到最大可以分配空间是2G
代码语言:javascript复制// maxAllocSize is the size used when creating array pointers.
const maxAllocSize = 0x7FFFFFFF
逻辑视图上是通过Bucket管理数据的关联关系的,关系是一个树形结构,具体的定义和数据解析位于bucket.go
代码语言:javascript复制func (b *Bucket) dereference() {
if b.rootNode != nil {
b.rootNode.root().dereference()
}
for _, child := range b.buckets {
child.dereference()
}
}
代码语言:javascript复制// Bucket represents a collection of key/value pairs inside the database.
type Bucket struct {
*bucket
tx *Tx // the associated transaction
buckets map[string]*Bucket // subbucket cache
page *page // inline page reference
rootNode *node // materialized node for the root page.
nodes map[pgid]*node // node cache
// Sets the threshold for filling nodes when they split. By default,
// the bucket will fill to 50% but it can be useful to increase this
// amount if you know that your write workloads are mostly append-only.
//
// This is non-persisted across transactions so it must be set in every Tx.
FillPercent float64
}
每一个Bucket在物理存储上都是一个B 树,它包含了root节点 pageid和全局递增的sequence, 意味着每一个Bucket的都不一样。Bucket上存了关联事务的指针,用map存储了key到子Bucket的映射关联。满足内联的条件有两个
- 该子Bucket再没有嵌套的子Bucket了。
- 整个子Bucket的大小不能超过page size/4。
如果上述两个条件都符合,就不会为这个Bucket单独申请一个页,而是通过内联的方式,将多个Bucket存储到一个页里面,以节省空间。所以Bucket里面缓存了内联页的指针。
它同时也存储了根页面的指针,以及pageid到node页的映射缓存。
代码语言:javascript复制// bucket represents the on-file representation of a bucket.
// This is stored as the "value" of a bucket key. If the bucket is small enough,
// then its root page can be stored inline in the "value", after the bucket
// header. In the case of inline buckets, the "root" will be 0.
type bucket struct {
root pgid // page id of the bucket's root-level page
sequence uint64 // monotonically incrementing, used by NextSequence()
}
其中node代表了一个具体的磁盘页,它的定义位于node.go
代码语言:javascript复制// node represents an in-memory, deserialized page.
type node struct {
bucket *Bucket
isLeaf bool
unbalanced bool
spilled bool
key []byte
pgid pgid
parent *node
children nodes
inodes inodes
}
其中key []byte和bucket *Bucket,分别代表存储的key和对应Bucket,parent指向了父亲页,children指向了孩子页列表,共同构成了一个B 树,inodes指向了页内存储的数据列表。每一个项都可以表示为
代码语言:javascript复制type inode struct {
flags uint32
pgid pgid
key []byte
value []byte
}
/ inode represents an internal node inside of a node.
// It can be used to point to elements in a page or point
// to an element which hasn't been added to a page yet.
代码语言:javascript复制type inodes []inode
key和value分别代表了页中存储的键值对信息。node的解析过程如下:
代码语言:javascript复制// dereference causes the node to copy all its inode key/value references to heap memory.
// This is required when the mmap is reallocated so inodes are not pointing to stale data.
func (n *node) dereference() {
if n.key != nil {
key := make([]byte, len(n.key))
copy(key, n.key)
n.key = key
_assert(n.pgid == 0 || len(n.key) > 0, "dereference: zero-length node key on existing node")
}
for i := range n.inodes {
inode := &n.inodes[i]
key := make([]byte, len(inode.key))
copy(key, inode.key)
inode.key = key
_assert(len(inode.key) > 0, "dereference: zero-length inode key")
value := make([]byte, len(inode.value))
copy(value, inode.value)
inode.value = value
}
for _, child := range n.children {
child.dereference()
}
// Update statistics.
n.bucket.tx.stats.NodeDeref
整个过程是一个递归的过程,获取祖先节点的过程也是一样的:
代码语言:javascript复制func (n *node) root() *node {
if n.parent == nil {
return n
}
return n.parent.root()
}