postgresql inner memo

2022-05-12 10:43:01 浏览数 (1)

堆表结构

  1. header data – A header data defined by the structure PageHeaderData is allocated in the beginning of the page. It is 24 byte long and contains general information about the page. The major variables of the structure are described below.
    • pd_lsn – This variable stores the LSN of XLOG record written by the last change of this page. It is an 8-byte unsigned integer, related to the WAL (Write-Ahead Logging) mechanism. 当前页的最后一个修改的lsn号。
    • pd_checksum – This variable stores the checksum value of this page. (Note that this variable is supported in version 9.3 or later; in earlier versions, this part had stored the timelineId of the page.)
    • pd_lower, pd_upper – pd_lower points to the end of line pointers, and pd_upper to the beginning of the newest heap tuple.
    • pd_special – This variable is for indexes. In the page within tables, it points to the end of the page. (In the page within indexes, it points to the beginning of special space which is the data area held only by indexes and contains the particular data according to the kind of index types such as B-tree, GiST, GiN, etc.)

读写tuple

内存架构&典型进程

代码语言:javascript复制
postgres> pstree -p 9687
- = 00001 root /sbin/launchd
 - - 09687 postgres /usr/local/pgsql/bin/postgres -D /usr/local/pgsql/data
   |--= 09688 postgres postgres: logger process     
   |--= 09690 postgres postgres: checkpointer process     
   |--= 09691 postgres postgres: writer process     
   |--= 09692 postgres postgres: wal writer process     
   |--= 09693 postgres postgres: autovacuum launcher process     
   |--= 09694 postgres postgres: archiver process     
   |--= 09695 postgres postgres: stats collector process     
   |--= 09697 postgres postgres: postgres sampledb 192.168.1.100(54924) idle  
   --= 09717 postgres postgres: postgres sampledb 192.168.1.100(54964) idle in transaction  

Local Memory Area

Each backend process allocates a local memory area for query processing; each area is divided into several sub-areas – whose sizes are either fixed or variable. Table 2.2 shows a list of the major sub-areas. The details will be described in the following chapters.

sub-area

description

work_mem

Executor uses this area for sorting tuples by ORDER BY and DISTINCT operations, and for joining tables by merge-join and hash-join operations.

maintenance_work_mem

Some kinds of maintenance operations (e.g., VACUUM, REINDEX) use this area.

temp_buffers

Executor uses this area for storing temporary tables.

Shared Memory Area

sub-area

description

shared buffer pool

PostgreSQL loads pages within tables and indexes from a persistent storage to here, and operates them directly.

WAL buffer

To ensure that no data has been lost by server failures, PostgreSQL supports the WAL mechanism. WAL data (also referred to as XLOG records) are transaction log in PostgreSQL; and WAL buffer is a buffering area of the WAL data before writing to a persistent storage.

commit log

Commit Log(CLOG) keeps the states of all transactions (e.g., in_progress,committed,aborted) for Concurrency Control (CC) mechanism.

SQL引擎

流程

1 语法解析

SQL解析成语法树,只检查语法错误

代码语言:javascript复制
testdb=# SELECT id, data FROM tbl_a WHERE id < 300 ORDER BY data;

2 语义分析

对由解析器生成的语法树进行语义分析,并生成查询树。

  • The targetlist is a list of columns that are the result of this query. In this example, this list is composed of two columns: ‘id' and ‘data’. If the input query tree uses ‘∗∗' (asterisk), the analyzer/analyser will explicitly replace it to all of the columns.
  • The range table is a list of relations that are used in this query. In this example, this table holds the information of the table ‘tbl_a’ such as the oid of this table and the name of this table.
  • The join tree stores the FROM clause and the WHERE clauses.
  • The sort clause is a list of SortGroupClause.

3 查询重写

展开子查询的视图、一些常量函数等

4 计划

查询树--->查询计划树

事务系统

  • 0 means Invalid txid.
  • 1 means Bootstrap txid, which is only used in the initialization of the database cluster.
  • 2 means Frozen txid

21亿之前的都是能看到的

tuple内部

  • t_xmin holds the txid of the transaction that inserted this tuple.
  • t_xmax holds the txid of the transaction that deleted or updated this tuple. If this tuple has not been deleted or updated, t_xmax is set to 0, which means INVALID.

insert

delete

update

事务快照

代码语言:javascript复制
testdb=# SELECT txid_current_snapshot();
 txid_current_snapshot 
-----------------------
 100:104:100,102
(1 row)

事务快照真实情况:

事务快照在不同隔离级别下的使用方法:

freeze

PG用一个计算距离的算法来计算两个事务的差值,用差值计算决定两个事务的先后关系。

这个差值保存在一个21亿的变量里面,如果超了,就会溢出了得出相反的结果。

lazy_mode

freezeLimit_txid=(OldestXmin−vacuum_freeze_min_age)

  • 注意,这里只会清理最小事务ID超过5千万以上的元祖,即自己的xid<freezeLimit_txid=2500
  • 元组年龄超过5千万( vacuum_freeze_min_age 默认值 5 千万),触发 lazy freeze
  • 元数据 relfrozenxid 会记录 freeze 位置(2500)
  • vacuum 清理已经删除的 tuple,vm 也会影响 freeze 动作,tuple4、tuple5、tuple6 没有 freeze

Eager Mode

pg_database.datfrozenxid<(OldestXmin−vacuum_freeze_table_age)

  • 最老的库年龄大于 1.5 亿时( vacuum_freeze_table_age 默认值 1.5 亿),触发 eager freeze,不受VM影响扫描全部页
  • 元数据 relfrozenxid 会记录 freeze 位置(100002000),表示超过 1.00002 亿的元组都已经 freeze

vacuum

fsm大根堆,父节点保存字节点中最大的空间,叶子节点指向数据页面,一个FSM页应该可以指向4000个数据页

WAL

full_page_write

xlog结构

0 人点赞