阅读顺序
《Postgresql源码(30)Postgresql索引基础B-linked-tree》
《Postgresql源码(31)Btree索引相关系统表和整体结构》
《Postgresql源码(32)Btree索引分裂前后结构差异对比》
《Postgresql源码(33)Btree索引读——整体流程&_bt_first》
《Postgresql源码(34)Btree索引读——_bt_first搜索部分分析》
《Postgresql源码(36)Btree索引读——_bt_next搜索部分分析》
function flow
代码语言:javascript复制// 最顶层的循环在ExecutePlan总,转一次拿一条
ExecScan
ExecScanFetch
IndexNext
【1】index_beginscan: (scandesc = index_beginscan 生成scandesc可以)
index_beginscan_internal[getr]
btbeginscan: 初始化IndexScanDesc和BTScanOpaque
IndexScanDesc scan = ...
BTScanOpaque so = ...
scan->opaque = so
return scan
【2】index_rescan
btrescan
【3】index_getnext(scandesc, direction)
index_getnext_tid : 开始索引遍历,拿到一个符合要求的ctid(found = ...->amgettuple(scan, direction)
实际执行btgettuple返回true说明找到了符合条件的ctid)
btgettuple : 调用_bt_first拿到第一个符合条件的,再调用_bt_next顺序扫描后面符合条件的
_bt_first[setr]:【重点分析】
index_fetch_heap
【3】index_getnext(scandesc, direction)
index_getnext_tid : 开始索引遍历,拿到一个符合要求的ctid(found = ...->amgettuple(scan, direction)
实际执行btgettuple返回true说明找到了符合条件的ctid)
btgettuple : 调用_bt_first拿到第一个符合条件的,再调用_bt_next顺序扫描后面符合条件的
_bt_next[setr]:【重点分析】
_bt_steppage
_bt_readnextpage
index_endscan
2 点查案例
以图中分裂后的索引为例,这是一个level=2的索引,有branch节点和leaf节点。
数据
代码语言:javascript复制postgres=# select * from t8 limit 10;
id | info
---- ----------------------------------
1 | 0dc60cfa723e1b809c45bdb31dfc698e
2 | 99863eeb157c1fc246d4109c469cf2d6
3 | 82a08a276210c48d8216f09e207f8f9c
4 | cb8f3c118a48c13e9be54585ca804b9f
5 | 4dcbfbf43c397b61076d3b638d7ae4f2
6 | 396881e5fc07b4156bfe31480a1adf8f
7 | 9c95a4f04931f4cae8ed6e451b9f12f2
8 | 90571ab1bcb3b22ccb6e03aadee103de
9 | 9ac75d7b5342b0021294779f3c4e8028
10 | 6e61f764b610594004651f188dcc78a0
下面开始查询leaf页面4-6中的一段数据:
代码语言:javascript复制select * from bt_page_items('t8_pkey', 4);
itemoffset | ctid | itemlen | nulls | vars | data
------------ --------- --------- ------- ------ -------------------------
1 | (9,19) | 16 | f | f | 4b 04 00 00 00 00 00 00
2 | (6,13) | 16 | f | f | dd 02 00 00 00 00 00 00
3 | (6,14) | 16 | f | f | de 02 00 00 00 00 00 00
4 | (6,15) | 16 | f | f | df 02 00 00 00 00 00 00
...
select * from bt_page_items('t8_pkey', 6);
itemoffset | ctid | itemlen | nulls | vars | data
------------ ---------- --------- ------- ------ -------------------------
1 | (15,31) | 16 | f | f | 27 07 00 00 00 00 00 00
2 | (12,25) | 16 | f | f | b9 05 00 00 00 00 00 00
3 | (12,26) | 16 | f | f | ba 05 00 00 00 00 00 00
4 | (12,27) | 16 | f | f | bb 05 00 00 00 00 00 00
5 | (12,28) | 16 | f | f | bc 05 00 00 00 00 00 00
6 | (12,29) | 16 | f | f | bd 05 00 00 00 00 00 00
7 | (12,30) | 16 | f | f | be 05 00 00 00 00 00 00
8 | (12,31) | 16 | f | f | bf 05 00 00 00 00 00 00
9 | (12,32) | 16 | f | f | c0 05 00 00 00 00 00 00
10 | (12,33) | 16 | f | f | c1 05 00 00 00 00 00 00
...
postgres=# select * from t8 where ctid='(6,15)';
id | info
----- ----------------------------------
735 | 9a0a17271e0f8331f6d18ac8a0d99992
postgres=# select * from t8 where ctid='(12,27)';
id | info
------ ----------------------------------
1467 | b0994a3ab67d39f0a651ca0e34cf8c52
下面分析查询过程
代码语言:javascript复制select * from t8 where id>735 and id<1467 and id>500 and id<2000;
id | info
------ ----------------------------------
736 | 193a511bacc34d12956b3c74cdcc18e7
737 | f64a6019fa7ef376f750c22881e0e75e
738 | 35837c259964bcef7ebf50f11f30ccda
739 | 8a66aea4f98f6cff1134b86d8b580925
740 | 72e70c6f57e80cf74dd92699c4bd9940
...
1462 | 62fa0351c94047c3d5ed64149aafe50a
1463 | 1abed4413570355d624762db49d88541
1464 | 2c8b8e624e15b4147110480311b6b5ae
1465 | 0ea46790a9180bd0254f673a8cb90275
1466 | 86e2ad07cbb0b4fca0cadd234387f52f
2.1 index_getnext_tid
使用amgettuple函数拿到ctid,索引扫描入口。
- 找到了:返回ctid
- 没找到:放锁返回NULL
ItemPointer
index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
{
bool found;
...
found = scan->indexRelation->rd_amroutine->amgettuple(scan, direction);
...
/* If we're out of index entries, we're done */
if (!found)
{
/* ... but first, release any held pin on a heap page */
if (BufferIsValid(scan->xs_cbuf))
{
ReleaseBuffer(scan->xs_cbuf);
scan->xs_cbuf = InvalidBuffer;
}
return NULL;
}
...
return &scan->xs_ctup.t_self;
}
2.2 btgettuple
- 使用_bt_first拿到第一条符合条件的位置记录到so->currPos
- 继续循环_bt_next拿到后面符合条件位置
bool
btgettuple(IndexScanDesc scan, ScanDirection dir)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
bool res;
...
do
{
if (!BTScanPosIsValid(so->currPos))
res = _bt_first(scan, dir);
else
{
....
res = _bt_next(scan, dir);
}
/* If we have a tuple, return it ... */
if (res)
break;
/* ... otherwise see if we have more array keys to deal with */
} while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));
return res;
}
2.3 _bt_first
函数很长,中间分段分析:
代码语言:javascript复制/*
* _bt_first() -- Find the first item in a scan.
*
* We need to be clever about the direction of scan, the search
* conditions, and the tree ordering. We find the first item (or,
* if backwards scan, the last item) in the tree that satisfies the
* qualifications in the scan key. On success exit, the page containing
* the current index tuple is pinned but not locked, and data about
* the matching tuple(s) on the page has been loaded into so->currPos.
* scan->xs_ctup.t_self is set to the heap TID of the current tuple,
* and if requested, scan->xs_itup points to a copy of the index tuple.
*
* If there are no matching items in the index, we return FALSE, with no
* pins or locks held.
*
* Note that scan->keyData[], and the so->keyData[] scankey built from it,
* are both search-type scankeys (see nbtree/README for more about this).
* Within this routine, we build a temporary insertion-type scankey to use
* in locating the scan start position.
*/
函数的功能:找到第一个满足搜索条件的item。
- 位置记录到
so->currPos
- tid记录到
scan->xs_ctup.t_self
- 如果需要,
scan->xs_itup
指向索引元组的副本。
找不到满足的item返回false。
代码语言:javascript复制bool
_bt_first(IndexScanDesc scan, ScanDirection dir)
{
Relation rel = scan->indexRelation;
BTScanOpaque so = (BTScanOpaque) scan->opaque;
Buffer buf;
BTStack stack;
OffsetNumber offnum;
StrategyNumber strat;
bool nextkey;
bool goback;
ScanKey startKeys[INDEX_MAX_KEYS];
ScanKeyData scankeys[INDEX_MAX_KEYS];
ScanKeyData notnullkeys[INDEX_MAX_KEYS];
int keysCount = 0;
int i;
bool status = true;
StrategyNumber strat_total;
BTScanPosItem *currItem;
BlockNumber blkno;
Assert(!BTScanPosIsValid(so->currPos));
pgstat_count_index_scan(rel);
/*
* Examine the scan keys and eliminate any redundant keys; also mark the
* keys that must be matched to continue the scan.
*/
_bt_preprocess_keys(scan);
第一步:_bt_preprocess_keys开始处理冗余key,结果记录到so中
代码语言:javascript复制(gdb) p scan->numberOfKeys
$3 = 4
(gdb) p scan->keyData[0]
$4 = {sk_flags = 0, sk_attno = 1, sk_strategy = 5, sk_subtype = 23, sk_collation = 0, sk_func = {fn_addr = 0x8f65ac <int4gt>,
fn_oid = 147, fn_nargs = 2, fn_strict = 1 '