生产一个pg库停了后,起库的时候则需要很长时间,记录一下相应的原理。
- 如backup_label文件不存在(当前没有在做备份),正情况情况下, 在恢复的开始, 服务器首先读取
pg_control
,然后读取检查点记录; 接着它通过从检查点记录里标识的日志位置开始向前扫描执行 REDO操作。 因为数据页的所有内容都保存在检查点之后的第一个页面修改的日志里(假设full_page_writes没有被禁用), 所以自检查点以来的所有变化的页都将被恢复到一个一致的状态 - 数据库正做备份,pg库宕机了,此时数据目录会生成backup_label文件,则会读取backup_lable 中的check_point 点,以及备份期间记录的相应日志,对于这个文件的描述如下: 见src/backend/access/transam/xlog.c
/*
* read_backup_label: check to see if a backup_label file is present
*
* If we see a backup_label during recovery, we assume that we are recovering
* from a backup dump file, and we therefore roll forward from the checkpoint
* identified by the label file, NOT what pg_control says. This avoids the
* problem that pg_control might have been archived one or more checkpoints
* later than the start of the dump, and so if we rely on it as the start
* point, we will fail to restore a consistent database state.
backup_label 文件的内容如下:
START WAL LOCATION: 472D/82000028 (file 000000060000472D00000082)
CHECKPOINT LOCATION: 472D/82150EB8
BACKUP METHOD: pg_start_backup
BACKUP FROM: master
START TIME: 2020-05-23 07:23:18 HKT
LABEL: 2020-05-23 07:23:17 with pg_rman
在这种情况下,如果有pg_xlog或pg_wal 下面没有相应的 从pg_start_backup()以来的 日志启库时会报错,需要确认是不是恢复备份,如果不是则要remove backup_label 文件。
LOG: could not open file "pg_xlog/000000020000000000000084" (log file 0, segment 132): No such file or directory
LOG: invalid checkpoint record
PANIC: could not locate required checkpoint record
HINT: If you are not restoring from a backup, try removing the file "/POSTGRES/data/PG820/backup_label".
生产环境中,如果备份有很多归档日志,起库的时候则需要很长时间。
参考:https://www.postgresql.org/message-id/D960CB61B694CF459DCFB4B0128514C293CEB7@exadv11.host.magwien.gv.at