堆表结构
- header data – A header data defined by the structure PageHeaderData is allocated in the beginning of the page. It is 24 byte long and contains general information about the page. The major variables of the structure are described below.
- pd_lsn – This variable stores the LSN of XLOG record written by the last change of this page. It is an 8-byte unsigned integer, related to the WAL (Write-Ahead Logging) mechanism. 当前页的最后一个修改的lsn号。
- pd_checksum – This variable stores the checksum value of this page. (Note that this variable is supported in version 9.3 or later; in earlier versions, this part had stored the timelineId of the page.)
- pd_lower, pd_upper – pd_lower points to the end of line pointers, and pd_upper to the beginning of the newest heap tuple.
- pd_special – This variable is for indexes. In the page within tables, it points to the end of the page. (In the page within indexes, it points to the beginning of special space which is the data area held only by indexes and contains the particular data according to the kind of index types such as B-tree, GiST, GiN, etc.)
读写tuple
写
读
内存架构&典型进程
代码语言:javascript复制postgres> pstree -p 9687
- = 00001 root /sbin/launchd
- - 09687 postgres /usr/local/pgsql/bin/postgres -D /usr/local/pgsql/data
|--= 09688 postgres postgres: logger process
|--= 09690 postgres postgres: checkpointer process
|--= 09691 postgres postgres: writer process
|--= 09692 postgres postgres: wal writer process
|--= 09693 postgres postgres: autovacuum launcher process
|--= 09694 postgres postgres: archiver process
|--= 09695 postgres postgres: stats collector process
|--= 09697 postgres postgres: postgres sampledb 192.168.1.100(54924) idle
--= 09717 postgres postgres: postgres sampledb 192.168.1.100(54964) idle in transaction
Local Memory Area
Each backend process allocates a local memory area for query processing; each area is divided into several sub-areas – whose sizes are either fixed or variable. Table 2.2 shows a list of the major sub-areas. The details will be described in the following chapters.
sub-area | description |
---|---|
work_mem | Executor uses this area for sorting tuples by ORDER BY and DISTINCT operations, and for joining tables by merge-join and hash-join operations. |
maintenance_work_mem | Some kinds of maintenance operations (e.g., VACUUM, REINDEX) use this area. |
temp_buffers | Executor uses this area for storing temporary tables. |
Shared Memory Area
sub-area | description |
---|---|
shared buffer pool | PostgreSQL loads pages within tables and indexes from a persistent storage to here, and operates them directly. |
WAL buffer | To ensure that no data has been lost by server failures, PostgreSQL supports the WAL mechanism. WAL data (also referred to as XLOG records) are transaction log in PostgreSQL; and WAL buffer is a buffering area of the WAL data before writing to a persistent storage. |
commit log | Commit Log(CLOG) keeps the states of all transactions (e.g., in_progress,committed,aborted) for Concurrency Control (CC) mechanism. |
SQL引擎
流程
1 语法解析
SQL解析成语法树,只检查语法错误
代码语言:javascript复制testdb=# SELECT id, data FROM tbl_a WHERE id < 300 ORDER BY data;
2 语义分析
对由解析器生成的语法树进行语义分析,并生成查询树。
- The targetlist is a list of columns that are the result of this query. In this example, this list is composed of two columns: ‘id' and ‘data’. If the input query tree uses ‘∗∗' (asterisk), the analyzer/analyser will explicitly replace it to all of the columns.
- The range table is a list of relations that are used in this query. In this example, this table holds the information of the table ‘tbl_a’ such as the oid of this table and the name of this table.
- The join tree stores the FROM clause and the WHERE clauses.
- The sort clause is a list of SortGroupClause.
3 查询重写
展开子查询的视图、一些常量函数等
4 计划
查询树--->查询计划树
事务系统
- 0 means Invalid txid.
- 1 means Bootstrap txid, which is only used in the initialization of the database cluster.
- 2 means Frozen txid
21亿之前的都是能看到的
tuple内部
- t_xmin holds the txid of the transaction that inserted this tuple.
- t_xmax holds the txid of the transaction that deleted or updated this tuple. If this tuple has not been deleted or updated, t_xmax is set to 0, which means INVALID.
insert
delete
update
事务快照
代码语言:javascript复制testdb=# SELECT txid_current_snapshot();
txid_current_snapshot
-----------------------
100:104:100,102
(1 row)
事务快照真实情况:
事务快照在不同隔离级别下的使用方法:
freeze
PG用一个计算距离的算法来计算两个事务的差值,用差值计算决定两个事务的先后关系。
这个差值保存在一个21亿的变量里面,如果超了,就会溢出了得出相反的结果。
lazy_mode
freezeLimit_txid=(OldestXmin−vacuum_freeze_min_age)
- 注意,这里只会清理最小事务ID超过5千万以上的元祖,即自己的xid<freezeLimit_txid=2500
- 元组年龄超过5千万( vacuum_freeze_min_age 默认值 5 千万),触发 lazy freeze
- 元数据 relfrozenxid 会记录 freeze 位置(2500)
- vacuum 清理已经删除的 tuple,vm 也会影响 freeze 动作,tuple4、tuple5、tuple6 没有 freeze
Eager Mode
pg_database.datfrozenxid<(OldestXmin−vacuum_freeze_table_age)
- 最老的库年龄大于 1.5 亿时( vacuum_freeze_table_age 默认值 1.5 亿),触发 eager freeze,不受VM影响扫描全部页
- 元数据 relfrozenxid 会记录 freeze 位置(100002000),表示超过 1.00002 亿的元组都已经 freeze
vacuum
fsm大根堆,父节点保存字节点中最大的空间,叶子节点指向数据页面,一个FSM页应该可以指向4000个数据页
WAL
full_page_write
xlog结构