Configuration: postgres 9.6 with a 3 cluster node. db1 is the master, db2 and db3 are replicas. WAL files are archived in AWS S3 using custom pgrsync tool. Cluster managed by patroni. The archive_command and restore_command is properly configured on all the nodes.
To simulate: On db1, do heavy writes (like vacuum a large table) and then stop db1 by
sudo systemctl stop patroni). db3 becomes the new leader. db2 requests more WAL files, which it gets via the proper restore command from AWS S3, becomes replica to db3.
Now, start db1 again by (
sudo systemctl start patroni). But db1 (the old leader and the new to-be-replica) never comes up as a replica and gives the error message
could not receive data from WAL stream: ERROR: requested WAL segment 0000002400053C55000000AE has already been removed. This error message is reported by db3 (the leader), which db1 just logs it.
So, let’s see the timeline. Initially db1 was in timeline 35 (0x23) and did write the following files to archive:
0000002300053C55000000AE (TL is 23 and archive at 53C55/AE)
0000002300053C5600000043 (TL is 23 and archive at 53C56/43)
db1 is stopped at this point.
db3’s logs show this:
received promote request
redo done at 53C56/A3FFC8C0
selected new timeline ID: 36
database system is ready to accept connections
and db3 copies the following files to archives
As db3 became leader, db2 starts the process to become replica to db3 (which it successfully becomes) and here is the summary of the logs:
restored log file "00000024.history" from archive
restored log file "0000002300053C55000000AE" from archive
restored log file "00000023.history" from archive
redo starts at 53C55/AE0001A8
restored log file "0000002300053C55000000AF" from archive
restored log file "0000002300053C5600000042" from archive
Retrieving 0000002300053C5600000043 to pg_xlog/RECOVERYXLOG FAILED (log by pgrsync)
Retrieving 0000002300053C56000000A2 to pg_xlog/RECOVERYXLOG FAILED
restored log file "0000002400053C56000000A3" from archive
consistent recovery state reached at 53C56/A3FFE900
db1 is started now and here are the logs:
LOG: database system was interrupted while in recovery at log time 2021-01-28 04:08:01 UTC
HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target
LOG: invalid record length at 53C55/AE0001E0: wanted 24, got 0
LOG: started streaming WAL from primary at 53C55/AE000000 on timeline 36
FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000002400053C55000000AE has already been removed
... and repeats this over and over
Points to note:
0000002400053C55000000AE was never written to archives by any postgres node. The old leader (db1) copied the archive
0000002300053C55000000AE (note: 0023, not 0024) before it was stopped.
- The new leader (db3) copied
0000002200053C55000000AE (note: 0022, not 0024)
- max_wal_size is set to 1024 on all nodes.
- After db3 became the new leader, there was hardly any activity on the nodes. db3 only writes WAL files every 10 mins (archive_timeout=600s).
- Is there any thing wrong in the configuration that makes the the old leader asking for a WAL segment, which the new leader does not have?
- How to restore the old leader (db1) at this state, without having to erase and start over?