【DB宝89】PG高可用集群之Patroni + etcd + HAProxy + keepalived+普罗米修斯监控部署

2022-02-23 19:41:48 浏览数 (1)

简介

使用postgresql etcd patroni haproxy keepalived可以实现PG的高可用集群,其中,以postgresql做数据库,Patroni监控本地的PostgreSQL状态,并将本地PostgreSQL信息/状态写入etcd来存储集群状态,所以,patroni与etcd结合可以实现数据库集群故障切换(自动或手动切换),而haproxy可以实现数据库读写分离 读负载均衡(通过不同端口实现),keepalived实现VIP跳转,对haproxy提供了高可用,防止haproxy宕机。

Etcd用于Patroni节点之间共享信息。Patroni监控本地的PostgreSQL状态。如果主库(Primary)故障,Patroni把一个从库(Standby)拉起来,作为新的主(Primary)数据库, 如果一个故障PostgreSQL被抢救过来了,能够重新自动或手动加入集群。

Patroni基于Python开发的模板,结合DCS(Distributed Configuration Store,例如 ZooKeeper, etcd, Consul )可以定制PostgreSQL高可用方案。Patroni接管PostgreSQL数据库的启停,同时监控本地的PostgreSQL数据库,并将本地的PostgreSQL数据库信息写入DCS。Patroni的主备端是通过是否能获得 leader key 来控制的,获取到了leader key的Patroni为主节点,其它的为备节点。

其中Patroni不仅简单易用而且功能非常强大。

  • 支持自动failover和按需switchover
  • 支持一个和多个备节点
  • 支持级联复制
  • 支持同步复制,异步复制
  • 支持同步复制下备库故障时自动降级为异步复制(功效类似于MySQL的半同步,但是更加智能)
  • 支持控制指定节点是否参与选主,是否参与负载均衡以及是否可以成为同步备机
  • 支持通过pg_rewind自动修复旧主
  • 支持多种方式初始化集群和重建备机,包括pg_basebackup和支持wal_epgBackRestbarman等备份工具的自定义脚本
  • 支持自定义外部callback脚本
  • 支持REST API
  • 支持通过watchdog防止脑裂
  • 支持k8s,docker等容器化环境部署
  • 支持多种常见DCS(Distributed Configuration Store)存储元数据,包括etcd,ZooKeeper,Consul,Kubernetes

架构规划

部署规划如下:

主机

IP

组件

备注

lhrpgpatroni71

172.72.6.71

PostgreSQL 13.3 Patroni 2.1.2

主节点,yum安装PG,pip3安装Patroni

lhrpgpatroni72

172.72.6.72

PostgreSQL 13.3 Patroni 2.1.2

备节点1,yum安装PG,pip3安装Patroni

lhrpgpatroni73

172.72.6.73

PostgreSQL 13.3 Patroni 2.1.2

备节点2,yum安装PG,pip3安装Patroni

lhretcd74

172.72.6.74

etcd 3.3.11

leader,yum安装etcd

lhretcd75

172.72.6.75

etcd 3.3.11

follower,yum安装etcd

lhretcd76

172.72.6.76

etcd 3.3.11

follower,yum安装etcd

lhrhaproxyka77

172.72.6.77

HAProxy 1.5.18 keepalived 1.3.5

yum安装HAProxy keepalived主节点

lhrhaproxyka78

172.72.6.78

HAProxy 1.5.18 keepalived 1.3.5

yum安装HAProxy keepalived备节点

lhrmonitor

172.72.6.79

Prometheus Grafana etcd brower

已内置普罗米修斯监控软件,需要配置监控PG和HAProxy,etcd浏览

172.72.6.70

VIP

虚拟IP,在172.72.6.77和172.72.6.78之间进行漂移,附着于eth0网卡

架构图如下:

拓扑图如下:

本文包括如下几个部分:

1、搭建1主2从的PG流复制环境 2、搭建3节点的etcd集群,并配置etcd-browser WEB可视化界面 3、patroni安装及配置实现switchover和failover故障切换 4、配置HAProxy和Keepalived实现负载均衡 读写分离 5、配置普罗米修斯(prometheus grafana)监控PG HAproxy etcd集群 6、高可用测试

如果配置防火墙,需要开放的端口:

程序

端口

用途

PostgreSQL

5432

数据库监听端口

Patroni

8008

restapi监听端口

etcd

2379/2380

客户端访问端口和服务监听端口

haproxy

1080/5000/5001

Web访问端口和数据库对外服务端口

grafana

3000

Web登录端口

prometheus

9090

Web登录端口

node_exporter

9100

对外服务端口

postgres_exporter

9187

对外服务端口

haproxy_exporter

9101

对外服务端口

环境准备

代码语言:javascript复制
-- 网卡
docker network create --subnet=172.72.6.0/24 pg-network


-- pg   Patroni
docker rm -f lhrpgpatroni71
docker run -d --name lhrpgpatroni71 -h lhrpgpatroni71 
  --net=pg-network --ip 172.72.6.71 
  -p 64371:5432 -p 18008:8008 
  -v /sys/fs/cgroup:/sys/fs/cgroup 
  --privileged=true lhrbest/lhrpgall:2.0 
  /usr/sbin/init


docker rm -f lhrpgpatroni72
docker run -d --name lhrpgpatroni72 -h lhrpgpatroni72 
  --net=pg-network --ip 172.72.6.72 
  -p 64372:5432 -p 28008:8008 
  -v /sys/fs/cgroup:/sys/fs/cgroup 
  --privileged=true lhrbest/lhrpgall:2.0 
  /usr/sbin/init



docker rm -f lhrpgpatroni73
docker run -d --name lhrpgpatroni73 -h lhrpgpatroni73 
  --net=pg-network --ip 172.72.6.73 
  -p 64373:5432 -p 38008:8008 
  -v /sys/fs/cgroup:/sys/fs/cgroup 
  --privileged=true lhrbest/lhrpgall:2.0 
  /usr/sbin/init



--  etcd
docker rm -f lhretcd74
docker run -d --name lhretcd74 -h lhretcd74 
  --net=pg-network --ip 172.72.6.74 
  -v /sys/fs/cgroup:/sys/fs/cgroup 
  --privileged=true lhrbest/lhrcentos76:8.5 
  /usr/sbin/init


docker rm -f lhretcd75
docker run -d --name lhretcd75 -h lhretcd75 
  --net=pg-network --ip 172.72.6.75 
  -v /sys/fs/cgroup:/sys/fs/cgroup 
  --privileged=true lhrbest/lhrcentos76:8.5 
  /usr/sbin/init


docker rm -f lhretcd76 
docker run -d --name lhretcd76 -h lhretcd76 
  --net=pg-network --ip 172.72.6.76 
  -v /sys/fs/cgroup:/sys/fs/cgroup 
  --privileged=true lhrbest/lhrcentos76:8.5 
  /usr/sbin/init


-- HAProxy   keepalived
docker rm -f lhrhaproxyka77
docker run -d --name lhrhaproxyka77 -h lhrhaproxyka77 
  --net=pg-network --ip 172.72.6.77 
  -p 11080:1080 -p 15000-15001:5000-5001 
  -v /sys/fs/cgroup:/sys/fs/cgroup 
  --privileged=true lhrbest/lhrcentos76:8.5 
  /usr/sbin/init


docker rm -f lhrhaproxyka78
docker run -d --name lhrhaproxyka78 -h lhrhaproxyka78 
  --net=pg-network --ip 172.72.6.78 
  -p 21080:1080 -p 25000-25001:5000-5001 
  -v /sys/fs/cgroup:/sys/fs/cgroup 
  --privileged=true lhrbest/lhrcentos76:8.5 
  /usr/sbin/init





-- 监控
docker rm -f lhrmonitor 
docker run -d --name lhrmonitor -h lhrmonitor 
 --net=pg-network --ip 172.72.6.79 
 -p 23389:3389 -p 29090:9090 -p 29093:9093 -p 23000:3000 -p 28000:8000 
 -v /sys/fs/cgroup:/sys/fs/cgroup 
 --privileged=true lhrbest/lhrprometheus:1.0 
 /usr/sbin/init

搭建1主2从的流复制环境

详细搭建过程可以参考:PG12高可用之1主2从流复制环境搭建及切换测试:https://www.xmmup.com/dbbao60pg12gaokeyongzhi1zhu2congliufuzhihuanjingdajianjiqiehuanceshi.html

主库操作

配置主库pg_hba.conf
代码语言:javascript复制
cat  << EOF > /var/lib/pgsql/13/data/pg_hba.conf
# TYPE  DATABASE    USER    ADDRESS       METHOD
local     all       all                    trust
host      all       all   127.0.0.1/32     trust
host      all       all    0.0.0.0/0        md5
host   replication  all    0.0.0.0/0        md5
EOF
配置主库参数
代码语言:javascript复制
-- 登陆主库环境
docker exec -it lhrpgpatroni71 bash

-- 该路径也需要在从库创建
mkdir -p /postgresql/archive
chown -R postgres.postgres /postgresql/archive

-- 修改参数
cat >> /var/lib/pgsql/13/data/postgresql.conf <<"EOF"

wal_level='replica'
archive_mode='on'
archive_command='test ! -f /postgresql/archive/%f && cp %p /postgresql/archive/%f'

max_wal_senders=10

EOF

-- 重启
systemctl restart postgresql-13.service

-- 查询参数
select * from pg_settings where name in ('wal_level','archive_mode','archive_command');
主库创建用户
代码语言:javascript复制
create role replhr login encrypted password 'lhr' replication;

从库操作

代码语言:javascript复制
mkdir -p /bk
chown postgres:postgres /bk

su - postgres
pg_basebackup -h 172.72.6.71 -p 5432 -U replhr -l bk20220120 -F p -P -R -D /bk


cat >> /bk/postgresql.conf <<"EOF"

primary_conninfo = 'host=172.72.6.71 port=5432 user=replhr password=lhr'

EOF

-- 关闭从库,删除从库的数据文件,并且将备份文件覆盖从库的数据文件
systemctl stop postgresql-13.service
rm -rf /var/lib/pgsql/13/data/*
cp -r /bk/* /var/lib/pgsql/13/data/


systemctl start postgresql-13.service

检查主从复制

代码语言:javascript复制
-- 主库查看wal日志发送状态
select * from pg_stat_replication;

-- 从库查看wal日志接收状态
select * from pg_stat_wal_receiver;

-- 也可以通过该命令查看
pg_controldata  | grep state

-- 也可以查看这个,主库是f代表false ;备库是t,代表true
select pg_is_in_recovery();


postgres=# select * from pg_stat_replication;
 pid  | usesysid | usename | application_name | client_addr | client_hostname | client_port |         backend_start         | backend_xmin |   state   | sent_lsn  | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state |          reply_time
------ ---------- --------- ------------------ ------------- ----------------- ------------- ------------------------------- -------------- ----------- ----------- ----------- ----------- ------------ ----------- ----------- ------------ --------------- ------------ -------------------------------
 3170 |    16430 | replhr  | walreceiver      | 172.72.6.73 |                 |       45000 | 2022-01-20 14:58:13.769779 08 |              | streaming | 0/7000060 | 0/7000060 | 0/7000060 | 0/7000060  |           |           |            |             0 | async      | 2022-01-20 15:13:14.383888 08
 3980 |    16430 | replhr  | walreceiver      | 172.72.6.72 |                 |       41278 | 2022-01-20 15:12:09.147278 08 |              | streaming | 0/7000060 | 0/7000060 | 0/7000060 | 0/7000060  |           |           |            |             0 | async      | 2022-01-20 15:13:09.266891 08
(2 rows)

配置etcd集群

6.74、6.75、6.76操作:

代码语言:javascript复制
-- 安装
yum install -y  etcd


-- lhretcd74
cat > /etc/etcd/etcd.conf <<"EOF"
ETCD_DATA_DIR="/var/lib/etcd/default.etcd"
ETCD_LISTEN_PEER_URLS="http://172.72.6.74:2380"
ETCD_LISTEN_CLIENT_URLS="http://172.72.6.74:2379,http://127.0.0.1:2379"
ETCD_NAME="etcd74"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://172.72.6.74:2380"
ETCD_ADVERTISE_CLIENT_URLS="http://172.72.6.74:2379"
ETCD_INITIAL_CLUSTER="etcd74=http://172.72.6.74:2380,etcd75=http://172.72.6.75:2380,etcd76=http://172.72.6.76:2380"
ETCD_INITIAL_CLUSTER_TOKEN="lhretcd-cluster"
ETCD_INITIAL_CLUSTER_STATE="new" 

EOF




-- lhretcd75
cat > /etc/etcd/etcd.conf <<"EOF"
ETCD_DATA_DIR="/var/lib/etcd/default.etcd"
ETCD_LISTEN_PEER_URLS="http://172.72.6.75:2380"
ETCD_LISTEN_CLIENT_URLS="http://172.72.6.75:2379,http://127.0.0.1:2379"
ETCD_NAME="etcd75"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://172.72.6.75:2380"
ETCD_ADVERTISE_CLIENT_URLS="http://172.72.6.75:2379"
ETCD_INITIAL_CLUSTER="etcd74=http://172.72.6.74:2380,etcd75=http://172.72.6.75:2380,etcd76=http://172.72.6.76:2380"
ETCD_INITIAL_CLUSTER_TOKEN="lhretcd-cluster"
ETCD_INITIAL_CLUSTER_STATE="new"

EOF





-- lhretcd76
cat > /etc/etcd/etcd.conf <<"EOF"
ETCD_DATA_DIR="/var/lib/etcd/default.etcd"
ETCD_LISTEN_PEER_URLS="http://172.72.6.76:2380"
ETCD_LISTEN_CLIENT_URLS="http://172.72.6.76:2379,http://127.0.0.1:2379"
ETCD_NAME="etcd76"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://172.72.6.76:2380"
ETCD_ADVERTISE_CLIENT_URLS="http://172.72.6.76:2379"
ETCD_INITIAL_CLUSTER="etcd74=http://172.72.6.74:2380,etcd75=http://172.72.6.75:2380,etcd76=http://172.72.6.76:2380"
ETCD_INITIAL_CLUSTER_TOKEN="lhretcd-cluster"
ETCD_INITIAL_CLUSTER_STATE="new"

EOF



systemctl start etcd
systemctl enable etcd
systemctl status etcd


-- 启动成功后,修改每个配置文件为:
sed -i 's/ETCD_INITIAL_CLUSTER_STATE="new"/ETCD_INITIAL_CLUSTER_STATE="existing"/g' /etc/etcd/etcd.conf


-- 查看状态
etcdctl member list
etcdctl cluster-health

端口2379 用于客户端连接,而 2380 用于伙伴通讯。

etcd集群状态:

代码语言:javascript复制
[root@lhretcd74 /]# etcdctl member list
78e1f33282f17b1d: name=etcd76 peerURLs=http://172.72.6.76:2380 clientURLs=http://172.72.6.76:2379 isLeader=false
87e0134f0e4088f5: name=etcd75 peerURLs=http://172.72.6.75:2380 clientURLs=http://172.72.6.75:2379 isLeader=false
e65c83646e82fd9c: name=etcd74 peerURLs=http://172.72.6.74:2380 clientURLs=http://172.72.6.74:2379 isLeader=true
[root@lhretcd74 /]# etcdctl cluster-health
member 78e1f33282f17b1d is healthy: got healthy result from http://172.72.6.76:2379
member 87e0134f0e4088f5 is healthy: got healthy result from http://172.72.6.75:2379
member e65c83646e82fd9c is healthy: got healthy result from http://172.72.6.74:2379
cluster is healthy

etcd-browser WEB可视化界面

下载地址:https://github.com/henszey/etcd-browser 要启动 etcd-browser,还需要下载安装 nodejs,下载地址:https://nodejs.org/zh-cn/download/

代码语言:javascript复制
docker rm -f lhrmonitor 
docker run -d --name lhrmonitor -h lhrmonitor 
 --net=pg-network --ip 172.72.6.79 
 -p 23389:3389 -p 29090:9090 -p 29093:9093 -p 23000:3000 -p 28000:8000 -p 28800:8800 
 -v /sys/fs/cgroup:/sys/fs/cgroup 
 --privileged=true lhrbest/lhrprometheus:1.0 
 /usr/sbin/init


# 下载安装
wget https://codeload.github.com/henszey/etcd-browser/zip/refs/heads/master -O etcd-browser-master.zip
unzip etcd-browser-master.zip -d /usr/local/etcd

wget https://nodejs.org/dist/v16.13.2/node-v16.13.2-linux-x64.tar.xz
tar -xvf node-v16.13.2-linux-x64.tar.xz -C /usr/local/etcd
ln -s /usr/local/etcd/node-v16.13.2-linux-x64 /usr/local/etcd/nodejs



# 修改etcd-browser的配置文件
vi /usr/local/etcd/etcd-browser-master/server.js
var etcdHost = process.env.ETCD_HOST || '172.72.6.74';
var etcdPort = process.env.ETCD_PORT || 2379;
var serverPort = process.env.SERVER_PORT || 8000;

# 启动etcd-browser
cd /usr/local/etcd/etcd-browser-master
nohup /usr/local/etcd/nodejs/bin/node /usr/local/etcd/etcd-browser-master/server.js &


[root@lhrmonitor ~]# netstat -tulnp | grep 8000
tcp6       0      0 :::8000                 :::*                    LISTEN      665/node

谷歌浏览器访问 http://192.168.66.35:8000/

patroni配置

安装相应的Python模块

6.71、6.72、6.73操作:

代码语言:javascript复制
-- -i https://mirrors.aliyun.com/pypi/simple/
-- -ihttp://pypi.douban.com/simple/
python3 -m pip install --upgrade pip
python3 -m pip install psycopg2_binary
python3 -m pip install patroni[etcd]


-- 验证是否安装成功
which patroni
patronictl --help

配置patroni的yml参数文件

代码语言:javascript复制
mkdir -p /app/patroni/etc/
mkdir -p /app/patroni/log/
chown postgres.postgres -R /app/patroni/
节点lhrpgpatroni71

需要注意python的yml文件格式,有严格的缩进要求,且以空格进行缩进,不要使用Tab键,缩进控制不好的话,参数配置将出现各种问题。

代码语言:javascript复制
cat > /app/patroni/etc/patroni_config.yml <<"EOF"
scope: pg_cluster
namespace: /service
name: lhrpgpatroni71

log:
  level: INFO
  traceback_level: ERROR
  dir: /app/patroni/log
  file_num: 10
  file_size: 104857600

restapi:
  listen: 0.0.0.0:8008
  connect_address: 172.72.6.71:8008

etcd:
  host: 172.72.6.74:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    max_timelines_history: 0
    master_start_timeout: 300
    master_stop_timeout: 0
    synchronous_mode: false
    postgresql:
      use_pg_rewind: true
      use_slots: true
      parameters:
        max_connections: 3000
        superuser_reserved_connections: 100
        max_locks_per_transaction: 64
        max_worker_processes: 2
        max_prepared_transactions: 0
        wal_level: logical
        wal_log_hints: on
        track_commit_timestamp: off
        max_wal_senders: 10
        max_replication_slots: 10
        listen_addresses: "*"
        port: 5432
        cluster_name: "pg_cluster"
        archive_mode: on
        archive_command: "test ! -f /postgresql/archive/%f && cp %p /postgresql/archive/%f"

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 172.72.6.71:5432
  data_dir: /var/lib/pgsql/13/data
  pgpass: /home/postgres/.pgpass
  pg_ctl_timeout: 60
  use_pg_rewind: true
  remove_data_directory_on_rewind_failure: false
  remove_data_directory_on_diverged_timelines: true
  authentication:
    replication:
      username: replhr
      password: lhr
    superuser:
      username: postgres
      password: lhr

tags:
  nofailover: false
  noloadbalance: false
  clonefrom: false
  nosync: false
EOF
节点lhrpgpatroni72
代码语言:javascript复制
cat > /app/patroni/etc/patroni_config.yml <<"EOF"
scope: pg_cluster
namespace: /service
name: lhrpgpatroni72

log:
  level: INFO
  traceback_level: ERROR
  dir: /app/patroni/log
  file_num: 10
  file_size: 104857600

restapi:
  listen: 0.0.0.0:8008
  connect_address: 172.72.6.72:8008

etcd:
  host: 172.72.6.75:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    max_timelines_history: 0
    master_start_timeout: 300
    master_stop_timeout: 0
    synchronous_mode: false
    postgresql:
      use_pg_rewind: true
      use_slots: true
      parameters:
        max_connections: 3000
        superuser_reserved_connections: 100
        max_locks_per_transaction: 64
        max_worker_processes: 2
        max_prepared_transactions: 0
        wal_level: logical
        wal_log_hints: on
        track_commit_timestamp: off
        max_wal_senders: 10
        max_replication_slots: 10
        listen_addresses: "*"
        port: 5432
        cluster_name: "pg_cluster"
        archive_mode: on
        archive_command: "test ! -f /postgresql/archive/%f && cp %p /postgresql/archive/%f"

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 172.72.6.72:5432
  data_dir: /var/lib/pgsql/13/data
  pgpass: /home/postgres/.pgpass
  pg_ctl_timeout: 60
  use_pg_rewind: true
  remove_data_directory_on_rewind_failure: false
  remove_data_directory_on_diverged_timelines: true
  authentication:
    replication:
      username: replhr
      password: lhr
    superuser:
      username: postgres
      password: lhr

tags:
  nofailover: false
  noloadbalance: false
  clonefrom: false
  nosync: false
EOF
节点lhrpgpatroni73
代码语言:javascript复制
cat > /app/patroni/etc/patroni_config.yml <<"EOF"
scope: pg_cluster
namespace: /service
name: lhrpgpatroni73

log:
  level: INFO
  traceback_level: ERROR
  dir: /app/patroni/log
  file_num: 10
  file_size: 104857600

restapi:
  listen: 0.0.0.0:8008
  connect_address: 172.72.6.73:8008

etcd:
  host: 172.72.6.76:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    max_timelines_history: 0
    master_start_timeout: 300
    master_stop_timeout: 0
    synchronous_mode: false
    postgresql:
      use_pg_rewind: true
      use_slots: true
      parameters:
        max_connections: 3000
        superuser_reserved_connections: 100
        max_locks_per_transaction: 64
        max_worker_processes: 2
        max_prepared_transactions: 0
        wal_level: logical
        wal_log_hints: on
        track_commit_timestamp: off
        max_wal_senders: 10
        max_replication_slots: 10
        listen_addresses: "*"
        port: 5432
        cluster_name: "pg_cluster"
        archive_mode: on
        archive_command: "test ! -f /postgresql/archive/%f && cp %p /postgresql/archive/%f"

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 172.72.6.73:5432
  data_dir: /var/lib/pgsql/13/data
  pgpass: /home/postgres/.pgpass
  pg_ctl_timeout: 60
  use_pg_rewind: true
  remove_data_directory_on_rewind_failure: false
  remove_data_directory_on_diverged_timelines: true
  authentication:
    replication:
      username: replhr
      password: lhr
    superuser:
      username: postgres
      password: lhr

tags:
  nofailover: false
  noloadbalance: false
  clonefrom: false
  nosync: false
EOF
参数解释

参考官方文档:https://patroni.readthedocs.io/en/latest/SETTINGS.html

代码语言:javascript复制
[root@pgtest1 patroni]# cat patroni_config.yml 
# 集群名称
scope: pg_cluster     # 集群名称
namespace: /service   # Patroni 将在其中保存有关集群的信息,这个路径是 etcd 存储数据的路径, Default value: "/service"
name: pgtest1         # 主机名,对于集群必须是唯一的

log:
  level: INFO              # 日志级别
  traceback_level: ERROR
  dir: /enmo/app/patroni/  # 日志写入的目录
  file_num: 10             # 要保留的日志数量
  file_size: 104857600     # 触发日志滚动的 patoni.log 文件的大小(以字节为单位)

restapi:
  listen: 192.168.58.10:8008        
  connect_address: 192.168.58.10:8008

etcd:
  # Provide host to do the initial discovery of the cluster topology:
  # 必须指定host、hosts、url、proxy或 srv 之一
  host: 192.168.58.10:2379    # etcd 端点的 host:port

bootstrap:
  # this section will be written into Etcd:/<namespace>/<scope>/config after initializing new cluster
  # and all other cluster members will use it as a `global configuration`
  dcs:   # 动态配置(Dynamic configuration)的参数设置,动态配置存储在 DCS(分布式配置存储)中并应用于所有集群节点
    ttl: 30
    loop_wait: 10   
    retry_timeout: 10
    maximum_lag_on_failover: 1048576 # 如果从库落后于主库超过一定数量的字节,则此设置可确保不会发生故障转移
    maximum_lag_on_syncnode: -1
    max_timelines_history: 0         # DCS 中保存的最大时间线历史项目数。默认值:0。当设置为 0 时,它会在 DCS 中保留完整的历史记录。
    master_start_timeout: 300        # 在触发 failover 之前允许主服务器从故障中恢复的时间(单位:秒)
                                     # master 故障的最坏情况故障转移时间是:loop_wait   master_start_timeout   loop_wait
    master_stop_timeout: 0           # Patroni 停止 Postgres 时允许等待的秒数,仅在启用 synchronous_mode 时有效。超过参数值,则 Patroni 会向 postmaster 发送 SIGKILL。
    synchronous_mode: false          # 打开同步复制模式。在此模式下,一个从库将被选择为同步模式的从库,只有最新的领导者和同步从库才能参与领导者选举。
    synchronous_mode_strict          # 如果没有可用的同步副本,则防止禁用同步复制,从而阻止所有客户端写入主服务器。
                                 # https://patroni.readthedocs.io/en/latest/replication_modes.html

    postgresql:
      use_pg_rewind: true            # 是否使用 pg_rewind
      use_slots: true                # 是否使用复制槽
      parameters:
        max_connections: 3000
        superuser_reserved_connections: 100   # Patroni 需要使用超级用户访问数据库才能正常运行
        max_locks_per_transaction: 64
        max_worker_processes: 2
        max_prepared_transactions: 0
        wal_level: logical
        wal_log_hints: on
        track_commit_timestamp: off
        max_wal_senders: 10
        max_replication_slots: 10
        wal_keep_size: 4096MB  # wal_keep_size = wal_keep_segments * wal_segment_size = 128 * 32MB 
        hot_standby: "on"
        listen_addresses: "*"
        port: 5432
        cluster_name: "pg_cluster"
        archive_mode: on
        archive_command: "cp %p /enmo/pgarch/%f"

    # standby_cluster:               # 如果定义了这个部分,我们想要引导一个备用集群
      # host: 127.0.0.1              # 远程主机的地址
      # port: 1111                   # 远程主机的端口
      # primary_slot_name: patroni   # 远程主服务器上用于复制的插槽。此参数是可选的,默认值来自实例名称
      # create_replica_methods
      # restore_command
      # archive_cleanup_command
      # recovery_min_apply_delay

  initdb:   # 列出要传递给 initdb 的选项
    - data-checksums          # 在9.3上需要 pg_rewind 时必须启用
    - encoding: UTF8      # 新数据库的默认编码
    - locale: UTF8        # 新数据库的默认语言环境
    - wal-segsize: 32
    - allow-group-access
    - pgdata: /enmo/pgdata
    - waldir: /enmo/pgwal

  pg_hba:   # 应该添加到 pg_hba.conf 的行列表
    - host all all 0.0.0.0/0 md5
    - host replication replicator 127.0.0.1/32 md5

  users:    # 初始化新集群后需要创建的一些额外用户
    admin:  # 用户名是admin
      password: zalando
      options:    # CREATE USER 语句的选项列表
        - createrole
        - createdb

postgresql:
  listen: 0.0.0.0:5432                  # Postgres 监听的 IP 地址   端口
  connect_address: 192.168.58.10:5432
  data_dir: /enmo/pgdata                # Postgres 数据目录的位置
  # bin_dir: /software/pgsql/bin        # PostgreSQL 二进制文件的路径,默认值是一个空字符串,这意味着 PATH 环境变量将用于查找可执行文件
  # config_dir: /software/pgsql/data    # Postgres 配置目录的位置,默认为 data_dir
  pgpass: /home/postgres/.pgpass        # 密码文件的路径
  pg_ctl_timeout: 60                    # pg_ctl 执行 start,stop 或 restart 时应等待多长时间. 默认值为 60 秒。
  use_pg_rewind: true                   # 当它作为副本加入集群时,尝试在前领导者上使用 pg_rewind
  remove_data_directory_on_rewind_failure: false   # 如果启用此选项,Patroni 将删除 PostgreSQL 数据目录并重新创建副本。否则它会尝试跟随新的领导者。默认值为false
  remove_data_directory_on_diverged_timelines: false # 如果 Patroni 注意到时间线正在发散并且以前的 master 无法从新 master 开始流式传输,则 Patroni 将删除 PostgreSQL 数据目录并重新创建副本。此选项在 pg_rewind 无法使用时很有用。默认值为false。
  authentication:    # 用户验证
    replication:     # 复制用户
      username: replica
      password: replica
    superuser:       # 超级用户
      username: postgres
      password: postgres
    # rewind:        # pg_rewind 用户
      # username: 
      # password:

tags:
  nofailover: false     # 控制是否允许此节点参与领导者竞赛并成为领导者
  noloadbalance: false
  clonefrom: false
  nosync: false
复制

所有节点创建 patroni 服务并启动

创建服务,注意需要配置环境变量,否则启动服务会出现这种报错(FATAL: Patroni requires psycopg2>=2.5.4 or psycopg2-binary) Requires,强制依赖 etcd.service 启动成功,可以视情况而定,etcd不启动,patroni起来后不会自动拉起数据库

代码语言:javascript复制
cat > /usr/lib/systemd/system/patroni.service <<"EOF"
[Unit]
Description=patroni
After=network.target remote-fs.target nss-lookup.target

[Service]
Type=forking
User=postgres
Group=postgres
Environment="PGHOME=/usr/pgsql-13"
Environment="PGDATA=/var/lib/pgsql/13/data"
Environment="PGPORT=5432"
Environment="LD_LIBRARY_PATH=/usr/pgsql-13/lib"
Environment="PATH=/usr/pgsql-13/bin:/usr/local/bin:/usr/bin"
ExecStart=/bin/bash -c "patroni /app/patroni/etc/patroni_config.yml >> /app/patroni/log/patroni.log 2>&1 &"
ExecReload=/bin/kill -s HUP $MAINPID
ExecStop=/usr/bin/killall patroni
KillMode=process
TimeoutSec=30
Restart=no

[Install]
WantedBy=multi-user.target
EOF

# 启动服务
systemctl daemon-reload
systemctl restart patroni.service
systemctl enable patroni.service
systemctl status patroni.service

# 因为 patroni 会检测 PostgreSQL 数据库是否正常运行,如果没有运行,会自动启动PostgreSQL 数据库,所以可以考虑禁用 PostgreSQL 服务,由 patroni 托管PG的启停
systemctl disable postgresql-13.service

所有节点设置patronictl别名,方便维护

代码语言:javascript复制
echo "alias patronictl='patronictl -c /app/patroni/etc/patroni_config.yml'"  >> /etc/profile

echo "alias patronictl='patronictl -c /app/patroni/etc/patroni_config.yml'"  >> /root/.bashrc


source  /etc/profile
source  /root/.bashrc

在启动patroni之后,建议先使用patronictl禁掉auto failover功能,当启动完毕调试正常后再选择性的启动auto failover功能,因为启用auto failover功能后,killall patroni进程,会导致当前节点的数据库宕掉,如果主库处于生产状态,后果不堪设想。

代码语言:javascript复制
# Disable auto failover ,如果没有启动 patroni,执行patronictl pause 会失败
patronictl pause
# Resume auto failover,重新启用
patronictl resume

配置HAProxy和Keepalived

  • haproxy 使用C语言开发的一个开源软件,是一款具备高并发(一万以上)、高性能的TCP和HTTP负载均衡器,支持基于cookie的持久性,自动故障切换,支持正则表达式及web状态统计。
  • 使用 haproxy 设置端口区分连接主库(5000)和只读从库(5001),且端口上有负载均衡的功能(两个从库)。
  • haproxy 开源官网 https://www.haproxy.org/
  • Keepalived 免费开源,用C编写,通过VRRP协议实现多台机器之间的故障转移服务
  • keepalived 官方文档 https://www.keepalived.org/manpage.html
代码语言:javascript复制
-- 安装
yum install -y haproxy keepalived

配置HAProxy

使用系统自带的RPM安装的默认配置文件是 /etc/haproxy/haproxy.cfg

代码语言:javascript复制
cat > /etc/haproxy/haproxy.cfg <<"EOF"
global
    log         127.0.0.1 local2
    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    maxconn     5000
    user        haproxy
    group       haproxy
    daemon
    nbproc      2

defaults    
    mode tcp
    log  127.0.0.1 local2 err
    option     tcplog
    option     dontlognull
    option     redispatch    
    retries    3
    maxconn    5000
    timeout queue           1m
    timeout connect         10s
    timeout client          1m
    timeout server          1m
    timeout check           5s

listen status
    bind *:1080
    mode http
    log global
    stats enable
    stats refresh 30s
    stats uri /
    stats realm Private lands
    stats auth admin:admin

listen master
    bind *:5000
    mode tcp
    option tcplog
    balance roundrobin
    option httpchk OPTIONS /master
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server pg1 172.72.6.71:5432 maxconn 1000 check port 8008 inter 5000 rise 2 fall 2
    server pg2 172.72.6.72:5432 maxconn 1000 check port 8008 inter 5000 rise 2 fall 2
    server pg3 172.72.6.73:5432 maxconn 1000 check port 8008 inter 5000 rise 2 fall 2

listen replicas
    bind *:5001
    mode tcp
    option tcplog
    balance roundrobin
    option httpchk OPTIONS /replica
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server pg1 172.72.6.71:5432  maxconn 1000 check port 8008 inter 5000 rise 2 fall 2
    server pg2 172.72.6.72:5432  maxconn 1000 check port 8008 inter 5000 rise 2 fall 2
    server pg3 172.72.6.73:5432  maxconn 1000 check port 8008 inter 5000 rise 2 fall 2
EOF

参数说明:

代码语言:javascript复制
# 全局定义
global
    # log   127.0.0.1 local0 info      # 全局的日志配置,使用log关键字,指定使用127.0.0.1上的syslog服务中的local0日志设备,记录日志等级为info的日志
    log         127.0.0.1 local2       # 日志输出配置,所有日志都记录在本机,通过local0输出,需要在rsyslog做配置
    chroot      /var/lib/haproxy       # 改变当前工作目录
    pidfile     /var/run/haproxy.pid   # 进程PID文件
    maxconn     3000                   # 最大连接数
    user        haproxy                # 所属用户
    group       haproxy                # 所属组
    daemon                             # 以后台形式运行haproxy
    nbproc 1                           # haproxy 启动时的进程数,<=CPU的核数,创建多个进程数,可以减少每个进程的任务队列,但是过多的进程数也可能会导致进程的崩溃。
    stats socket /var/lib/haproxy/stats
# 默认部分的定义
defaults    
    mode tcp                           # 模式 mode {tcp|http|health}, tcp是4层, http是7层, health是健康检测, 只会返回ok
    log 127.0.0.1 local2 err           # 使用 127.0.0.1 上的 syslog 服务的 local2 设备记录错误信息
    option     tcplog                  # 如果将 mode 设置为 http,那么您必须将 tcplog 更改为 httplog
    option     dontlognull             # 启用该项,日志中将不会记录空连接。所谓空连接就是在上游的负载均衡器或者监控系统为了探测该服务是否存活可用时,需要定期的连接或者获取某一固定的组件或页面,或者探测扫描端口是否在监听或开放等动作被称为空连接;官方文档中标注,如果该服务上游没有其他的负载均衡器的话,建议不要使用该参数,因为互联网上的恶意扫描或其他动作就不会被记录下来。
    option     redispatch              # 当 serverId 对应的服务器挂掉后,强制定向到其他健康的服务器
    option     abortonclose            # 当服务器负载很高的时候,自动结束掉当队列处理比较久的链接
    retries    3                       # 定义连接后端服务器的失败重连次数,连接失败次数超过此值后将会将对应后端服务器标记为不可用
    maxconn    3000                    # 默认最大连接数
    timeout queue           1m         # 当达到服务器的 maxconn 时,连接等待最大时长
    timeout connect         10s        # 连接超时
    timeout client          1m         # 客户端非活动状态的超时时长
    timeout server          1m         # 服务器超时
    timeout check           5s         # 心跳检测超时
# 配置haproxy web监控,查看统计信息
listen status
    bind *:1080                        # 定义统计页面的端口
    mode http
    log global
    stats enable                       # 通过web看状态信息
    stats refresh 30s                  # 统计页面自动刷新时间
    maxconn 10                         # 最大连接数
    stats uri /                        # 统计页面url,http//ip:1080/ 访问
    stats realm Private lands          # 设置统计页面认证时的提示内容
    stats auth admin:Admin2021         # 设置统计页面认证的用户和密码,如果要设置多个,另起一行写入即可
    stats hide-version                 # 隐藏统计页面上HAProxy的版本信息

listen master
    bind *:5000                        # 定义haproxy前端部分监听的端口
    mode tcp
    option tcplog
    balance roundrobin                 # 设置负载算法为:轮询算法
    option httpchk OPTIONS /master
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions  # inter:间隔3秒做一个检测,fall:3次失败会被踢掉,rise:检查2次
    server pg1 172.72.6.71:5432  maxconn 1000 check port 8008 inter 5000 rise 2 fall 2
    server pg2 172.72.6.72:5432  maxconn 1000 check port 8008 inter 5000 rise 2 fall 2
    server pg3 172.72.6.73:5432  maxconn 1000 check port 8008 inter 5000 rise 2 fall 2

启动HAProxy:

代码语言:javascript复制
systemctl start haproxy
systemctl enable haproxy
systemctl status haproxy

浏览器访问 HAProxy

浏览器登录 http://192.168.66.35:1080 输入用户名 admin 密码 admin

配置keepalived

使用系统自带的RPM安装的配置文件是 /etc/keepalived/keepalived.conf

代码语言:javascript复制
# keepalived 主节点配置文件
cat > /etc/keepalived/keepalived.conf << "EOF"
global_defs {
    router_id pg1
    script_user root
    enable_script_security
}

vrrp_script chk_haproxy {
    script "/usr/bin/killall -0 haproxy"
    interval 2
    weight 5
    fall 3
    rise 5
    timeout 2
}

vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 88
    priority 100
    advert_int 5
    authentication {
        auth_type PASS
        auth_pass postgres
    }
    virtual_ipaddress {
        172.72.6.70/24 dev eth0 label eth0:1
    }
    track_script {
        chk_haproxy
    }
}

EOF



# keepalived 备节点配置文件
cat > /etc/keepalived/keepalived.conf << "EOF"

global_defs {
    router_id pg2
    script_user root
    enable_script_security
}

vrrp_script chk_haproxy {
    script "/usr/bin/killall -0 haproxy"
    interval 2
    weight 5
    fall 3
    rise 5
    timeout 2
}

vrrp_instance VI_1 {
    state BACKUP
    interface eth0
    virtual_router_id 88
    priority 99
    advert_int 5
    authentication {
        auth_type PASS
        auth_pass postgres
    }
    virtual_ipaddress {
        172.72.6.70/24 dev eth0 label eth0:1
    }
    track_script {
        chk_haproxy
    }
}
EOF

参数说明:

代码语言:javascript复制
! Configuration File for keepalived
# 全局定义块
global_defs {
    # 标识服务器的字符串,在局域网内应该是唯一的,不必是主机名,默认是本地主机名
    router_id pgtest1
    # 设置运行脚本默认用户和组
    script_user root
    # 如果脚本路径的任一部分对于非root用户来说,都具有可写权限,则不会以root身份运行脚本。
    enable_script_security
}
# 周期性检查脚本
vrrp_script chk_haproxy {
    # 指定要执行的脚本的路径或命令
    # 通过向进程 haproxy 发送信号 0 ,然后根据返回值来判断 haproxy 进程是否存在
    script "/usr/bin/killall -0 haproxy"
    # 脚本调用间隔的秒数,(默认值:1s)
    interval 2
    # 指定在多少秒后,脚本被认为执行失败
    timeout 2
    # 调整权重优先级,默认为2
    # keepalived 启动时就做权重运算,priority   weight ,主备端哪边权重大,VIP就在哪边启动
    weight 5
    # 执行失败多少次才认为失败
    fall 3
    # 执行成功多少次才认为是成功
    rise 5
}
# VRRP实例定义块
vrrp_instance VI_1 {
    # 指定该keepalived节点的初始状态
    state MASTER
    # vrrp实例绑定的接口,用于发送VRRP包
    interface ens33
    # 指定VRRP实例ID,范围是0-255,主备机保持一致
    virtual_router_id 88
    # 指定优先级,优先级高的将成为MASTER,备机请填写小于主机的值
    priority 100
    # 指定发送VRRP通告的间隔。单位是秒
    advert_int 5
    # 指定认证方式
    authentication {
        auth_type PASS          # PASS简单密码认证(推荐),AH:IPSEC认证(不推荐)
        auth_pass postgres      # 指定认证所使用的密码,最多8位。
    }
    # 指定VIP地址,主备机保持一致
    virtual_ipaddress {
        192.168.58.20/24 dev ens33 label ens33:1
    }
    # 添加一个 track 脚本( vrrp_script 配置的脚本)
    track_script {
        chk_haproxy
    }
}

启动keepalived:

代码语言:javascript复制
systemctl start keepalived
systemctl enable keepalived
systemctl status keepalived


# keepalived启动报错:IPVS: Can't initialize ipvs: Protocol not available
lsmod | grep ip_vs
modprobe ip_vs
modprobe ip_vs_wrr
lsmod | grep ip_vs
# 如果是容器,那么宿主机也需要加载ip_vs模块。

启动之后,会发现只有172.72.6.77节点有虚拟IP:

代码语言:javascript复制
[root@lhrhaproxyka77 /]# ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.72.6.77  netmask 255.255.255.0  broadcast 172.72.6.255
        ether 02:42:ac:48:06:4d  txqueuelen 0  (Ethernet)
        RX packets 72015  bytes 5904300 (5.6 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 117213  bytes 22467390 (21.4 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth0:1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.72.6.70  netmask 255.255.255.0  broadcast 0.0.0.0
        ether 02:42:ac:48:06:4d  txqueuelen 0  (Ethernet)

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 1234  bytes 223309 (218.0 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1234  bytes 223309 (218.0 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

HAProxy 和 Keepalived 日志都记录到 /var/log/messages。

配置普罗米修斯监控

搭建过程参考:

  • 使用Docker分分钟搭建漂亮的prometheus grafana监控: https://www.xmmup.com/dbbao36shiyongdockerfenfenzhongdajianpiaoliangdeprometheusgrafanajiankong.html
  • 监控利器PMM的使用–监控MySQL、PG、MongoDB、ProxySQL等:https://www.xmmup.com/dbbao41jiankongliqipmmdeshiyong-jiankongmysqlpgmongodbproxysqldeng.html
  • https://www.modb.pro/db/154937

访问Grafana:http://172.72.6.79:3000/login 用户名和密码都为admin

export下载:https://prometheus.io/docs/instrumenting/exporters/

监控PG集群

https://github.com/prometheus-community/postgres_exporter

代码语言:javascript复制
tar -zxvf postgres_exporter-0.10.1.linux-amd64.tar.gz
mv postgres_exporter-0.10.1.linux-amd64/postgres_exporter /usr/local/bin/postgres_exporter


useradd pg71
useradd pg72
useradd pg73

su - pg71
export DATA_SOURCE_NAME="postgresql://postgres:lhr@172.72.6.71:5432/postgres?sslmode=disable"
nohup postgres_exporter --web.listen-address=":9187" >/dev/null 2>&1 &



su - pg72 
export DATA_SOURCE_NAME="postgresql://postgres:lhr@172.72.6.72:5432/postgres?sslmode=disable"
nohup postgres_exporter --web.listen-address=":9188" >/dev/null 2>&1 &




su - pg73
export DATA_SOURCE_NAME="postgresql://postgres:lhr@172.72.6.73:5432/postgres?sslmode=disable"
nohup postgres_exporter --web.listen-address=":9189" >/dev/null 2>&1 &



[root@lhrmonitor /]# netstat -tulnp | grep 918
tcp6       0      0 :::9187                 :::*                    LISTEN      10829/postgres_expo
tcp6       0      0 :::9188                 :::*                    LISTEN      10929/postgres_expo
tcp6       0      0 :::9189                 :::*                    LISTEN      11027/postgres_expo
[root@lhrmonitor /]#

测试:

代码语言:javascript复制
curl http://localhost:9187/metrics
curl http://localhost:9188/metrics
curl http://localhost:9189/metrics

Grafana导入监控模板:9628

监控HAProxy

https://github.com/prometheus/haproxy_exporter

代码语言:javascript复制
tar -zxvf haproxy_exporter-0.13.0.linux-amd64.tar.gz
mv haproxy_exporter-0.13.0.linux-amd64/haproxy_exporter /usr/local/bin/haproxy_exporter


nohup haproxy_exporter --web.listen-address=":9101" --haproxy.scrape-uri="http://admin:admin@172.72.6.77:1080/haproxy?stats;csv"  > /dev/null 2>&1 &
nohup haproxy_exporter --web.listen-address=":9102" --haproxy.scrape-uri="http://admin:admin@172.72.6.78:1080/haproxy?stats;csv"  > /dev/null 2>&1 &


[root@lhrmonitor /]# netstat -tulnp | grep 91
tcp        0      0 127.0.0.1:5910          0.0.0.0:*               LISTEN      1346/Xvnc
tcp6       0      0 :::9187                 :::*                    LISTEN      10829/postgres_expo
tcp6       0      0 :::9188                 :::*                    LISTEN      10929/postgres_expo
tcp6       0      0 :::9189                 :::*                    LISTEN      11027/postgres_expo
tcp6       0      0 :::9100                 :::*                    LISTEN      211/node_exporter
tcp6       0      0 :::9101                 :::*                    LISTEN      11462/haproxy_expor
tcp6       0      0 :::9102                 :::*                    LISTEN      11497/haproxy_expor
[root@lhrmonitor /]#

测试:

代码语言:javascript复制
curl http://localhost:9101/metrics
curl http://localhost:9102/metrics

Grafana导入监控模板:12865

监控etcd集群

prometheus 自带了对etcd的监控,不用安装etcd的exporter。

Grafana导入监控模板:3070、10323

修改prometheus服务端配置文件prometheus.yml

prometheus服务端配置文件prometheus.yml添加客户端,严格注意yml文件配置的格式。

我这里的文件是:/usr/local/prometheus/prometheus.yml

代码语言:javascript复制
# vi /usr/local/prometheus/prometheus.yml 

# 加到文件最后,注意每行前面的空格不要删除
  - job_name: "pg_monitor"
    static_configs:
      - targets:
        - "127.0.0.1:9187"
        - "127.0.0.1:9188"
        - "127.0.0.1:9189"
  - job_name: "haproxy_monitor"
    static_configs:
      - targets:
        - "127.0.0.1:9101"
        - "127.0.0.1:9102"
  - job_name: "etcd_monitor"
    static_configs:
      - targets:
        - "172.72.6.74:2379"
        - "172.72.6.75:2379"
        - "172.72.6.76:2379"


# 重新载入配置文件或重启prometheus
curl -X POST http://127.0.0.1:9090/-/reload

http://172.72.6.79:9090/targets

监控到的数据

PG:

HAProxy:

ETCD:

至此,PostgreSQL高可用之Patroni etcd HAProxy Keepalived Prometheus grafana 部署完成。

高可用测试

1. 高可用测试 Keepalived

注意:可以将2个节点的state都配置为backup,将priority都配置为100来避免自动来回切换,而影响客户体验。

测试类型

测试方式

测试命令

测试结果

进程故障

1. 主端killall进程

killall keepalived

VIP从主端自动转移到备端,5000端口和5001端口连接正常

-

-

systemctl start keepalived

VIP从备端自动转移到主端,5000端口和5001端口连接正常。

-

2. 备端killall进程

systemctl stop keepalived

VIP在主端正常运行,5000端口和5001端口连接正常

-

-

systemctl start keepalived

VIP在主端正常运行,5000端口和5001端口连接正常

-

3. 主端同时kill所有进程

-

主端VIP未卸掉,备端也启动VIP,此时主备端均存在VIP(异常现象),5000端口和5001端口连接正常

-

-

systemctl start keepalived

VIP转移到主端正常运行,备端无VIP,5000端口和5001端口连接正常

-

4. 主端只kill主进程

kill -9

VIP从主端自动转移到备端,VIP只在备端启动,5000端口和5001端口连接正常

-

-

systemctl start keepalived

VIP转移到主端正常运行,备端无VIP,5000端口和5001端口连接正常

-

5. 主端只kill子进程

-

VIP从主端自动转移到备端,等待主进程自动生成子进程后,VIP从备端自动转移到主端,5000端口和5001端口连接正常

-

6. 备端kill 进程

-

IP在主端正常运行,5000端口和5001端口连接正常

网卡故障

1. 主端down网卡

ifdown eth0

VIP从主端自动转移到备端,PostgreSQL发生故障转移到其中一个备库,5000端口和5001端口连接正常,patroni和etcd均不显示故障节点

-

2. 主端up网卡

ifup eth0

VIP从备端自动转移到主端,故障节点以备库角色添加到集群,patroni和etcd节点状态显示正常,5000端口和5001端口连接正常

-

3. 备端down网卡

ifdown eth0

VIP在主端正常运行,5000端口和5001端口连接正常,patroni和etcd均不显示故障节点,故障节点上的各个进程还在运行

-

4. 备端up网卡

ifup eth0

patroni和etcd节点状态显示正常

2. 高可用测试 HAProxy

测试类型

测试方式

测试命令

测试结果

进程故障

1. 主端killall进程

killall haproxy

keepalived 未检测 haproxy 进程,自动将VIP从主端转移到备端,5000端口和5001端口连接正常

-

-

systemctl start haproxy

keepalived 检测到 haproxy 进程,自动将VIP从备端转移到主端,5000端口和5001端口连接正常

-

2. 备端killall进程

killall haproxy

VIP在主端正常运行,5000端口和5001端口连接正常

-

-

systemctl start haproxy

VIP在主端正常运行,5000端口和5001端口连接正常

-

3. 主端同时kill所有进程

-

keepalived 未检测 haproxy 进程,自动将VIP从主端转移到备端,5000端口和5001端口连接正常

-

-

systemctl start haproxy

keepalived 检测到 haproxy 进程,自动将VIP从备端转移到主端,5000端口和5001端口连接正常

-

4. 主端只kill主进程

-

keepalived 未检测 haproxy 进程,自动将VIP从主端转移到备端,5000端口和5001端口连接正常

-

-

systemctl start haproxy

keepalived 检测到 haproxy 进程,自动将VIP从备端转移到主端,5000端口和5001端口连接正常

-

5. 主端只kill子进程

-

haproxy 的所有进程都死了,keepalived 未检测 haproxy 进程,自动将VIP从主端转移到备端,5000端口和5001端口连接正常

-

-

systemctl start haproxy

keepalived 检测到 haproxy 进程,自动将VIP从备端转移到主端,5000端口和5001端口连接正常

3. 高可用测试 Patroni

以下是在Patroni开启了auto failover的情况下进行测试

代码语言:javascript复制
[root@pgtest3 ~]# patronictl resume

测试类型

测试方式

测试命令

测试结果

进程故障

1. 主端killall进程

killall patroni

1. 触发故障切换到备库其中一个节点,备库另一个节点同步新主库,切换时间在30秒内 2. 原主库(pgtest1)的 PostgreSQL 被关闭 3. etcd haproxy keepalived 在原主库正常运行,VIP 运行在原主库 4. VIP 5000端口连接切换后的新主库,VIP 5001端口连接另一个备库

-

-

systemctl start patroni

原主库(pgtest1)变成新主库(pgtest2)的备库

-

2. 主库kill patroni 进程

kill -9

1. 触发故障切换到备库其中一个节点,备库另一个节点同步新主库,切换时间在30秒内 2. 原主库(pgtest1)的 PostgreSQL 还在运行,并且是读写模式 3. etcd haproxy keepalived 在原主库正常运行,VIP 运行在原主库 4. VIP 5000端口连接切换后的新主库,VIP 5001端口连接另一个备库

-

-

systemctl start patroni

原主库(pgtest1)被 pg_rewind 成新主库(pgtest2)的备库

-

3. 一个备库kill patroni 进程

-

1. 使用killall,将会同时关闭备库,使用kill,此备库的 PostgreSQL 还在以只读模式运行,且与主库正常同步数据 2. VIP 5000端口正常连接主库,VIP 5001端口不能连接此备库,可以连接另一个备库 3. 主库与另一个备库不受影响 4. 此备库上的 etcd haproxy keepalived 正常运行

-

-

systemctl start patroni

自动恢复正常状态,与主库保持同步

-

4. 两个备库kill patroni 进程

-

1. 使用killall,将会同时关闭备库,使用kill,两个备库的 PostgreSQL 还在以只读模式运行,且与主库正常同步数据 2. VIP 5000端口只连接主库,VIP 5001端口连接失败 3. 主库不受影响 4. 备库上的 etcd haproxy keepalived 正常运行

-

-

systemctl start patroni

自动恢复正常状态,与主库保持同步

4. 高可用测试 etcd

测试类型

测试方式

测试命令

测试结果

进程故障

1. 主库kill etcd 进程

-

不影响主库和备库, patroni 会连接其它节点上的etcd,VIP 5000/5001端口连接正常

-

2. 一个备库停止 etcd 进程

-

不影响主库和备库, patroni 会连接其它节点上的etcd,VIP 5000/5001端口连接正常

-

3. 两个备库停止 etcd 进程

-

此时超过了etcd的最大允许故障节点数,主备库3个节点均以只读模式运行,VIP 5000端口连接失败,VIP 5001端口轮询连接主备库3个节点

-

-

先启动第一个备库的 etcd 进程

主库从只读模式切换成读写模式,主从数据同步恢复正常,VIP 5000/5001端口连接正常

-

-

再启动第二个备库的 etcd 进程

自动恢复正常状态,与主库保持同步

5. 高可用测试 PostgreSQL

测试类型

测试方式

测试命令

测试结果

-

停主库PostgreSQL实例

-

主库被Patroni自动拉起,VIP 5000/5001端口连接正常

-

停备库PostgreSQL实例

-

备库被Patroni自动拉起,VIP 5000/5001端口连接正常

6. 高可用测试 操作系统

测试类型

测试方式

测试命令

测试结果

-

停PostgreSQL主库主机(同时是haproxy keepalived 的主机)

reboot

1. 触发故障切换到备库其中一个节点,备库另一个节点同步新主库,切换时间在30秒内 2. VIP漂移到备库 3. VIP 5000端口连接切换后的新主库,VIP 5001端口连接另一个备库

-

-

启动

原主库(pgtest1)变成新主库(pgtest2)的备库,VIP从keepalived的备端自动转移到主端,5000端口和5001端口连接正常

-

停备库的主机就不测试了

-

-

7. 读写分离 负载均衡测试

通过 5000 端口提供读写服务,通过 5001 端口提供只读服务。

代码语言:javascript复制
[root@lhrpgpatroni71 /]# patronictl list
 ---------------- ------------- --------- --------- ---- ----------- 
| Member         | Host        | Role    | State   | TL | Lag in MB |
  Cluster: pg_cluster (6981731393302868828) ------- ---- ----------- 
| lhrpgpatroni71 | 172.72.6.71 | Leader  | running |  7 |           |
| lhrpgpatroni72 | 172.72.6.72 | Replica | running |  7 |         0 |
| lhrpgpatroni73 | 172.72.6.73 | Replica | running |  7 |         0 |
 ---------------- ------------- --------- --------- ---- ----------- 


psql -U postgres -h 172.72.6.77 -p5000
psql -U postgres -h 172.72.6.78 -p5000


psql -U postgres -h 172.72.6.77 -p5001
psql -U postgres -h 172.72.6.78 -p5001


for i in $(seq 1 10); do psql 'host=172.72.6.70 port=5000 user=postgres dbname=postgres password=lhr' -c 'select inet_server_addr(),pg_is_in_recovery()'; done | egrep '172'



[root@db-rac ~]# for i in $(seq 1 10); do psql 'host=172.72.6.70 port=5000 user=postgres dbname=postgres password=lhr' -c 'select inet_server_addr(),pg_is_in_recovery()'; done | egrep '172'
 172.72.6.71      | f
 172.72.6.71      | f
 172.72.6.71      | f
 172.72.6.71      | f
 172.72.6.71      | f
 172.72.6.71      | f
 172.72.6.71      | f
 172.72.6.71      | f
 172.72.6.71      | f
 172.72.6.71      | f
[root@db-rac ~]# for i in $(seq 1 10); do psql 'host=172.72.6.70 port=5001 user=postgres dbname=postgres password=lhr' -c 'select inet_server_addr(),pg_is_in_recovery()'; done | egrep '172'
 172.72.6.72      | t
 172.72.6.73      | t
 172.72.6.72      | t
 172.72.6.72      | t
 172.72.6.73      | t
 172.72.6.73      | t
 172.72.6.72      | t
 172.72.6.72      | t
 172.72.6.73      | t
 172.72.6.72      | t

可以看出,5000是读写,而5001是只读 负载均衡。

8. switchover和failover切换

https://www.modb.pro/db/152388

https://www.modb.pro/db/155433

  • failover endpoints 允许在没有健康节点时执行手动 failover ,但同时它不允许 switchover 。
  • switchover endpoint 则相反。它仅在集群健康(有leader)时才起作用,并允许在指定时间安排切换。
switchover操作
代码语言:javascript复制
[root@lhrpgpatroni71 /]# patronictl list
 ---------------- ------------- --------- --------- ---- ----------- 
| Member         | Host        | Role    | State   | TL | Lag in MB |
  Cluster: pg_cluster (6981731393302868828) ------- ---- ----------- 
| lhrpgpatroni71 | 172.72.6.71 | Leader  | running |  7 |           |
| lhrpgpatroni72 | 172.72.6.72 | Replica | running |  7 |         0 |
| lhrpgpatroni73 | 172.72.6.73 | Replica | running |  7 |         0 |
 ---------------- ------------- --------- --------- ---- ----------- 
[root@lhrpgpatroni71 /]# patronictl switchover
Master [lhrpgpatroni71]:
Candidate ['lhrpgpatroni72', 'lhrpgpatroni73'] []: lhrpgpatroni72
When should the switchover take place (e.g. 2022-01-22T20:19 )  [now]:
Current cluster topology
 ---------------- ------------- --------- --------- ---- ----------- 
| Member         | Host        | Role    | State   | TL | Lag in MB |
  Cluster: pg_cluster (6981731393302868828) ------- ---- ----------- 
| lhrpgpatroni71 | 172.72.6.71 | Leader  | running |  7 |           |
| lhrpgpatroni72 | 172.72.6.72 | Replica | running |  7 |         0 |
| lhrpgpatroni73 | 172.72.6.73 | Replica | running |  7 |         0 |
 ---------------- ------------- --------- --------- ---- ----------- 
Are you sure you want to switchover cluster pg_cluster, demoting current master lhrpgpatroni71? [y/N]: y
2022-01-22 19:19:49.35893 Successfully switched over to "lhrpgpatroni72"
[root@lhrpgpatroni71 /]# patronictl list
 ---------------- ------------- --------- --------- ---- ----------- 
| Member         | Host        | Role    | State   | TL | Lag in MB |
  Cluster: pg_cluster (6981731393302868828) ------- ---- ----------- 
| lhrpgpatroni71 | 172.72.6.71 | Replica | running |  8 |         0 |
| lhrpgpatroni72 | 172.72.6.72 | Leader  | running |  8 |           |
| lhrpgpatroni73 | 172.72.6.73 | Replica | running |    |   unknown |
 ---------------- ------------- --------- --------- ---- ----------- 
[root@lhrpgpatroni71 /]# patronictl list
 ---------------- ------------- --------- --------- ---- ----------- 
| Member         | Host        | Role    | State   | TL | Lag in MB |
  Cluster: pg_cluster (6981731393302868828) ------- ---- ----------- 
| lhrpgpatroni71 | 172.72.6.71 | Replica | running |  8 |         0 |
| lhrpgpatroni72 | 172.72.6.72 | Leader  | running |  8 |           |
| lhrpgpatroni73 | 172.72.6.73 | Replica | stopped |    |   unknown |
 ---------------- ------------- --------- --------- ---- ----------- 
[root@lhrpgpatroni71 /]# patronictl list
 ---------------- ------------- --------- --------- ---- ----------- 
| Member         | Host        | Role    | State   | TL | Lag in MB |
  Cluster: pg_cluster (6981731393302868828) ------- ---- ----------- 
| lhrpgpatroni71 | 172.72.6.71 | Replica | running |  8 |         0 |
| lhrpgpatroni72 | 172.72.6.72 | Leader  | running |  8 |           |
| lhrpgpatroni73 | 172.72.6.73 | Replica | running |  8 |         0 |
 ---------------- ------------- --------- --------- ---- ----------- 
failover操作
代码语言:javascript复制
[root@lhrpgpatroni71 /]# patronictl list
 ---------------- ------------- --------- --------- ---- ----------- 
| Member         | Host        | Role    | State   | TL | Lag in MB |
  Cluster: pg_cluster (6981731393302868828) ------- ---- ----------- 
| lhrpgpatroni71 | 172.72.6.71 | Replica | running |  8 |         0 |
| lhrpgpatroni72 | 172.72.6.72 | Leader  | running |  8 |           |
| lhrpgpatroni73 | 172.72.6.73 | Replica | running |  8 |         0 |
 ---------------- ------------- --------- --------- ---- ----------- 
 ---------------- ------------- --------- --------- ---- ----------- 
[root@lhrpgpatroni71 /]#
[root@lhrpgpatroni71 /]# patronictl failover
Candidate ['lhrpgpatroni71', 'lhrpgpatroni73'] []: lhrpgpatroni73
Current cluster topology
 ---------------- ------------- --------- --------- ---- ----------- 
| Member         | Host        | Role    | State   | TL | Lag in MB |
  Cluster: pg_cluster (6981731393302868828) ------- ---- ----------- 
| lhrpgpatroni71 | 172.72.6.71 | Replica | running |  8 |         0 |
| lhrpgpatroni72 | 172.72.6.72 | Leader  | running |  8 |           |
| lhrpgpatroni73 | 172.72.6.73 | Replica | running |  8 |         0 |
 ---------------- ------------- --------- --------- ---- ----------- 
Are you sure you want to failover cluster pg_cluster, demoting current master lhrpgpatroni72? [y/N]: y
2022-01-22 19:24:19.78911 Successfully failed over to "lhrpgpatroni73"
 ---------------- ------------- --------- --------- ---- ----------- 
| Member         | Host        | Role    | State   | TL | Lag in MB |
  Cluster: pg_cluster (6981731393302868828) ------- ---- ----------- 
| lhrpgpatroni71 | 172.72.6.71 | Replica | running |  8 |         0 |
| lhrpgpatroni72 | 172.72.6.72 | Replica | stopped |    |   unknown |
| lhrpgpatroni73 | 172.72.6.73 | Leader  | running |  8 |           |
 ---------------- ------------- --------- --------- ---- ----------- 
[root@lhrpgpatroni71 /]# patronictl list
 ---------------- ------------- --------- --------- ---- ----------- 
| Member         | Host        | Role    | State   | TL | Lag in MB |
  Cluster: pg_cluster (6981731393302868828) ------- ---- ----------- 
| lhrpgpatroni71 | 172.72.6.71 | Replica | stopped |    |   unknown |
| lhrpgpatroni72 | 172.72.6.72 | Replica | running |  9 |         0 |
| lhrpgpatroni73 | 172.72.6.73 | Leader  | running |  9 |           |
 ---------------- ------------- --------- --------- ---- ----------- 
[root@lhrpgpatroni71 /]# patronictl list
 ---------------- ------------- --------- --------- ---- ----------- 
| Member         | Host        | Role    | State   | TL | Lag in MB |
  Cluster: pg_cluster (6981731393302868828) ------- ---- ----------- 
| lhrpgpatroni71 | 172.72.6.71 | Replica | running |  9 |         0 |
| lhrpgpatroni72 | 172.72.6.72 | Replica | running |  9 |         0 |
| lhrpgpatroni73 | 172.72.6.73 | Leader  | running |  9 |           |
 ---------------- ------------- --------- --------- ---- ----------- 

POST /switchover 和 POST /failover 分别对应 Patientictl switchover 和 Patonictl failover 命令。DELETE /switchover 对应 patronictl flush switchover 命令。

基本维护

查询 patroni 集群状态

代码语言:javascript复制
[root@lhrpgpatroni71 /]# patronictl list
 ---------------- ------------- --------- --------- ---- ----------- 
| Member         | Host        | Role    | State   | TL | Lag in MB |
  Cluster: pg_cluster (6981731393302868828) ------- ---- ----------- 
| lhrpgpatroni71 | 172.72.6.71 | Leader  | running |  1 |           |
| lhrpgpatroni72 | 172.72.6.72 | Replica | running |  1 |         0 |
| lhrpgpatroni73 | 172.72.6.73 | Replica | running |  1 |         0 |
 ---------------- ------------- --------- --------- ---- ----------- 

http://172.72.6.71:8008/

http://172.72.6.71:8008/patroni

http://172.72.6.71:8008/cluster

image-20220122103714924

http://172.72.6.71:8008/config

使用 etcdctl 查看存储在etcd中的 patroni 动态配置参数

代码语言:javascript复制
[root@lhretcd74 /]# etcdctl get /service/pg_cluster/config | jq
{
  "ttl": 30,
  "loop_wait": 10,
  "retry_timeout": 10,
  "maximum_lag_on_failover": 1048576,
  "max_timelines_history": 0,
  "master_start_timeout": 300,
  "master_stop_timeout": 0,
  "synchronous_mode": false,
  "postgresql": {
    "use_pg_rewind": true,
    "use_slots": true,
    "parameters": {
      "max_connections": 3000,
      "superuser_reserved_connections": 100,
      "max_locks_per_transaction": 64,
      "max_worker_processes": 2,
      "max_prepared_transactions": 0,
      "wal_level": "logical",
      "wal_log_hints": true,
      "track_commit_timestamp": false,
      "max_wal_senders": 10,
      "max_replication_slots": 10,
      "listen_addresses": "*",
      "port": 5432,
      "cluster_name": "pg_cluster",
      "archive_mode": true,
      "archive_command": "test ! -f /postgresql/archive/%f && cp %p /postgresql/archive/%f"
    }
  }
}

[root@lhrpgpatroni71 /]# curl -s http://172.72.6.71:8008/config | jq
{
  "ttl": 30,
  "loop_wait": 10,
  "retry_timeout": 10,
  "maximum_lag_on_failover": 1048576,
  "max_timelines_history": 0,
  "master_start_timeout": 300,
  "master_stop_timeout": 0,
  "synchronous_mode": false,
  "postgresql": {
    "use_pg_rewind": true,
    "use_slots": true,
    "parameters": {
      "max_connections": 3000,
      "superuser_reserved_connections": 100,
      "max_locks_per_transaction": 64,
      "max_worker_processes": 2,
      "max_prepared_transactions": 0,
      "wal_level": "logical",
      "wal_log_hints": true,
      "track_commit_timestamp": false,
      "max_wal_senders": 10,
      "max_replication_slots": 10,
      "listen_addresses": "*",
      "port": 5432,
      "cluster_name": "pg_cluster",
      "archive_mode": true,
      "archive_command": "test ! -f /postgresql/archive/%f && cp %p /postgresql/archive/%f"
    }
  }
}

patronictl 查看 patroni 动态配置参数

代码语言:javascript复制
[root@lhrpgpatroni71 ~]# patronictl show-config
loop_wait: 10
master_start_timeout: 300
master_stop_timeout: 0
max_timelines_history: 0
maximum_lag_on_failover: 1048576
postgresql:
  parameters:
    archive_command: test ! -f /postgresql/archive/%f && cp %p /postgresql/archive/%f
    archive_mode: true
    cluster_name: pg_cluster
    listen_addresses: '*'
    max_connections: 3000
    max_locks_per_transaction: 64
    max_prepared_transactions: 0
    max_replication_slots: 10
    max_wal_senders: 10
    max_worker_processes: 2
    port: 5432
    superuser_reserved_connections: 100
    track_commit_timestamp: false
    wal_level: logical
    wal_log_hints: true
  use_pg_rewind: true
  use_slots: true
retry_timeout: 10
synchronous_mode: false
ttl: 30

通过 patronictl 调整配置参数,在其中一个节点调整,其他节点也会自动调整,并且 patroni 自动进行 reload 操作

代码语言:javascript复制
[postgres@lhrpgpatroni71 ~]$ patronictl edit-config
# 编辑文本
Apply these changes? [y/N]: y
Configuration changed

对于需要重启数据库生效的参数,为了减少对生产的影响,可以逐个节点重启,也可以在停机窗口通过 patronictl restart 对整个集群进行重启:

代码语言:javascript复制
# 仅重启当前节点(--any)
[root@lhrpgpatroni71 ~]# patronictl restart pg_cluster --any
 ---------------- ------------- --------- --------- ---- ----------- 
| Member         | Host        | Role    | State   | TL | Lag in MB |
  Cluster: pg_cluster (6981731393302868828) ------- ---- ----------- 
| lhrpgpatroni71 | 172.72.6.71 | Leader  | running |  1 |           |
| lhrpgpatroni72 | 172.72.6.72 | Replica | running |  1 |         0 |
| lhrpgpatroni73 | 172.72.6.73 | Replica | running |  1 |         0 |
 ---------------- ------------- --------- --------- ---- ----------- 
When should the restart take place (e.g. 2022-01-22T11:14)  [now]: 
Are you sure you want to restart members lhrpgpatroni71, lhrpgpatroni72, lhrpgpatroni73? [y/N]: y
Restart if the PostgreSQL version is less than provided (e.g. 9.5.2)  []: 
Success: restart on member lhrpgpatroni73

[root@lhrpgpatroni71 ~]# patronictl list
 ---------------- ------------- --------- --------- ---- ----------- ----------------- 
| Member         | Host        | Role    | State   | TL | Lag in MB | Pending restart |
  Cluster: pg_cluster (6981731393302868828) ------- ---- ----------- ----------------- 
| lhrpgpatroni71 | 172.72.6.71 | Leader  | running |  1 |           |                 |
| lhrpgpatroni72 | 172.72.6.72 | Replica | running |  1 |         0 |                 |
| lhrpgpatroni73 | 172.72.6.73 | Replica | running |  1 |         0 | *               |
 ---------------- ------------- --------- --------- ---- ----------- ----------------- 

# 如果节点是 pending 状态的,才会执行重启操作
[root@lhrpgpatroni71 /]# patronictl restart pg_cluster --any --pending
 ---------------- ------------- --------- --------- ---- ----------- ----------------- 
| Member         | Host        | Role    | State   | TL | Lag in MB | Pending restart |
  Cluster: pg_cluster (6981731393302868828) ------- ---- ----------- ----------------- 
| lhrpgpatroni71 | 172.72.6.71 | Leader  | running |  1 |           |                 |
| lhrpgpatroni72 | 172.72.6.72 | Replica | running |  1 |         0 |                 |
| lhrpgpatroni73 | 172.72.6.73 | Replica | running |  1 |         0 | *               |
 ---------------- ------------- --------- --------- ---- ----------- ----------------- 
When should the restart take place (e.g. 2022-01-22T11:23)  [now]: 
Are you sure you want to restart members lhrpgpatroni71, lhrpgpatroni72, lhrpgpatroni73? [y/N]: y
Restart if the PostgreSQL version is less than provided (e.g. 9.5.2)  []: 
Success: restart on member lhrpgpatroni73


# 重启所有成员
[root@lhrpgpatroni71 /]# patronictl restart pg_cluster
 ---------------- ------------- --------- --------- ---- ----------- 
| Member         | Host        | Role    | State   | TL | Lag in MB |
  Cluster: pg_cluster (6981731393302868828) ------- ---- ----------- 
| lhrpgpatroni71 | 172.72.6.71 | Leader  | running |  1 |           |
| lhrpgpatroni72 | 172.72.6.72 | Replica | running |  1 |         0 |
| lhrpgpatroni73 | 172.72.6.73 | Replica | running |  1 |         0 |
 ---------------- ------------- --------- --------- ---- ----------- 
When should the restart take place (e.g. 2022-01-22T11:24)  [now]: 
Are you sure you want to restart members lhrpgpatroni73, lhrpgpatroni71, lhrpgpatroni72? [y/N]: y
Restart if the PostgreSQL version is less than provided (e.g. 9.5.2)  []: 
Success: restart on member lhrpgpatroni73
Success: restart on member lhrpgpatroni71
Success: restart on member lhrpgpatroni72
[root@lhrpgpatroni71 /]# patronictl list
 ---------------- ------------- --------- --------- ---- ----------- 
| Member         | Host        | Role    | State   | TL | Lag in MB |
  Cluster: pg_cluster (6981731393302868828) ------- ---- ----------- 
| lhrpgpatroni71 | 172.72.6.71 | Leader  | running |  1 |           |
| lhrpgpatroni72 | 172.72.6.72 | Replica | running |  1 |         0 |
| lhrpgpatroni73 | 172.72.6.73 | Replica | running |  1 |         0 |
 ---------------- ------------- --------- --------- ---- ----------- 

重新初始化 Reinitialize endpoint

只允许在从节点上执行,一旦调用,它将删除数据目录并启动 pg_basebackup 重新初始化指定节点上的 PostgreSQL 数据目录。 The call might fail if Patroni is in a loop trying to recover (restart) a failed Postgres. In order to overcome this problem one can specify {“force”:true} in the request body.

代码语言:javascript复制
curl -s http://172.72.6.73:8008/reinitialize -XPOST
patronictl reinit pg_cluster

重新加载 Reload endpoint

POST /reload 让 Patroni 重新读取和应用配置文件。这相当于向 Patroni 进程发送 SIGHUP 信号。如果您更改了一些需要重新启动 Postgres 的参数(如 shared_buffers),您仍然必须通过调用 POST /restart 或使用 patriotictl restart 明确地重新启动Postgres。

代码语言:javascript复制
curl -s http://172.72.6.73:8008/reload -XPOST
patronictl reload pg_cluster

使用 patronictl 执行数据库查询操作

代码语言:javascript复制
[root@lhrpgpatroni71 /]# cat a.sql
select name from pg_settings limit 2;
[root@lhrpgpatroni71 /]# patronictl query -f a.sql --username=postgres --password
Password:
name
allow_system_table_mods
application_name

参考

https://www.modb.pro/topic/152353

https://www.cnblogs.com/VicLiu/p/14333778.html

Patroni 官方文档:https://patroni.readthedocs.io/en/latest/index.html

将独立设备转换为 Patroni 集群:https://patroni.readthedocs.io/en/latest/existing_data.html

PostgreSQL 版本重大升级:https://patroni.readthedocs.io/en/latest/existing_data.html

0 人点赞