Stale DB instance being promoted leads to data loss

Hi,
I found a bug in recent testing, and I've provide a suggested fix,  could you help me assess whether this solution is viable?  Thank you.

# Test environment
3 node cluster with DB synchrouous replication enabled.
```
Full List of Resources:
Clone Set: pgsql-ha(test-db)  (promotable):
    * test-db    (ocf::***:pgsqlms):      Slave server_01
    * test-db    (ocf::***:pgsqlms):      Master server_02
    * test-db    (ocf::***:pgsqlms):      Slave server_03
Node Attributes:
  * Node: server_03:     
    * master-score           	: 1000      
  * Node: server_01:   
    * master-score          	: 990       
  * Node: server_02:  
    * master-score          	: 1001      
```

```
	synchronous_standby_names                                             
---------------------------------------------------------------------------------------------------------------------
 ANY 1 (server_01,server_03)
```

# Test step

1. Kill the server_01 database instance. The cluster can still process read and write request without any issues.
![image](https://github.com/ClusterLabs/PAF/assets/2225051/a2bbac15-476e-4e0c-bc41-749cd3060707)

1. Send a write request, such as creating a new user.
The user creation was successful, and we can retrieve the record from the database.

1. Kill the remaining two database instances Master server_02 and Slave server_03

1. When these 3 database instances recover, during the master election process, I noticed that all three instances have the same "lsn_location" value, which is the starting point of the last segment in the local pg_wal directory.
https://github.com/ClusterLabs/PAF/blob/d191e09a321801bf4c465768cfc75719512c9f59/script/pgsqlms#L2062

1. In our logic, we added that when all three "lsn_location" values are the same, we choose the database instance with the best performance as the master. So if the server_01 has the best performance, it is elected as the new master. At this point, we noticed that the user data we created earlier is lost.

# Test logs
Master server_02
Instance "test-db" controldata indicates a running primary instance, the instance has probably crashed
```
Oct 08 09:14:53.390237 server_02 pgsqlms(test-db)[56347]: INFO: pgsql_monitor: instance "test-db" is not listening
Oct 08 09:14:53.439523 server_02 pgsqlms(test-db)[56368]: INFO: _confirm_stopped: no postmaster process found for instance "test-db"
Oct 08 09:14:53.498487 server_02 pgsqlms(test-db)[56387]: INFO: _controldata: instance "test-db" state is "in production"
Oct 08 09:14:53.509602 server_02 pgsqlms(test-db)[56392]: ERROR: Instance "test-db" controldata indicates a running primary instance, the instance has probably crashed
```
Setting lsn_location
all three instances have the same "lsn_location" value, **which is the starting point of the last segment in the local pg_wal directory.**
Hexdecimal 104000000 Decimal 4362076160
```
Oct 08 09:16:22.002567 server_02 pgsqlms(test-db)[64293]: INFO: Promoting instance on node "server_02"
Oct 08 09:16:22.123181 server_02 pgsqlms(test-db)[64314]: INFO: Current node TL#LSN: 6#4362076160
Oct 08 09:16:22.129258 server_02 pacemaker-attrd[21903]:  notice: Setting lsn_location-test-db[server_02]: (unset) -> 6#4362076160
Oct 08 09:16:22.141162 server_02 pacemaker-attrd[21903]:  notice: Setting nodes-test-db[server_02]: (unset) -> server_01 server_02 server_03
Oct 08 09:16:22.158226 server_02 pacemaker-attrd[21903]:  notice: Setting lsn_location-test-db[server_01]: (unset) -> 6#4362076160
Oct 08 09:16:22.426240 server_02 pacemaker-attrd[21903]:  notice: Setting lsn_location-test-db[server_03]: (unset) -> 6#4362076160
```

```
Oct 08 09:16:22.568704 server_02 pgsqlms(test-db)[64354]: INFO: Action: "promote"
Oct 08 09:16:22.668577 server_02 pgsqlms(test-db)[64370]: WARNING: _confirm_role: secondary not streaming wal from primary
Oct 08 09:16:22.670723 server_02 pgsqlms(test-db)[64371]: INFO: pgsql_promote: "test-db" currently running as a standby
Oct 08 09:16:22.686690 server_02 pgsqlms(test-db)[64374]: INFO: pgsql_promote: checking if current node is the best candidate for promotion
Oct 08 09:16:22.704388 server_02 pgsqlms(test-db)[64377]: INFO: pgsql_promote: current node TL#LSN location: 6#4362076160
Oct 08 09:16:22.714073 server_02 pgsqlms(test-db)[64379]: INFO: pgsql_promote: current node score: 90
Oct 08 09:16:22.994833 server_02 pgsqlms(test-db)[64416]: INFO: pgsql_promote: comparing with "server_01": TL#LSN is 6#4362076160
Oct 08 09:16:23.037057 server_02 pgsqlms(test-db)[64426]: INFO: pgsql_promote: "server_01" has a matching TL#LSN, also checking node score
Oct 08 09:16:23.038658 server_02 pgsqlms(test-db)[64427]: INFO: pgsql_promote: comparing with "server_01": node score is 100
Oct 08 09:16:23.041612 server_02 pgsqlms(test-db)[64429]: INFO: pgsql_promote: "server_01" is a better candidate to promote (node score > server_02)
Oct 08 09:16:23.375242 server_02 pgsqlms(test-db)[64513]: INFO: pgsql_promote: comparing with "server_03": TL#LSN is 6#4362076160
Oct 08 09:16:23.384625 server_02 pgsqlms(test-db)[64515]: INFO: pgsql_promote: "server_03" has a matching TL#LSN, also checking node score
Oct 08 09:16:23.386169 server_02 pgsqlms(test-db)[64516]: INFO: pgsql_promote: comparing with "server_03": node score is 80
Oct 08 09:16:23.395775 server_02 pgsqlms(test-db)[64518]: ERROR: server_01 is the best candidate to promote, aborting current promotion
Oct 08 09:16:23.586478 server_02 pgsqlms(test-db)[64548]: INFO: pgsql_promote: move master role to server_01
```

# Root cause analysis
The root cause of this issue is the incorrect usage of **pg_last_wal_receive_lsn()** for elections, leading to an election error and subsequent data loss.
When a database instance starts as standby, the pg_last_wal_receive_lsn is initialized to the starting point of the last segement in the local pg_wal folder.
https://github.com/postgres/postgres/blob/e7689190b3d58404abbafe2d3312c3268a51cca3/src/backend/access/transam/xlogfuncs.c#L343
https://github.com/postgres/postgres/blob/e7689190b3d58404abbafe2d3312c3268a51cca3/src/backend/replication/walreceiverfuncs.c#L332
https://github.com/postgres/postgres/blob/e7689190b3d58404abbafe2d3312c3268a51cca3/src/backend/replication/walreceiverfuncs.c#L302
If this is the first startup of walreceiver (on this timeline),  initialize **flushedUpto** and latestChunkStart to the starting point.
```
/*
 * Report the last WAL receive location (same format as pg_backup_start etc)
 *
 * This is useful for determining how much of WAL is guaranteed to be received
 * and synced to disk by walreceiver.
 */
Datum
pg_last_wal_receive_lsn(PG_FUNCTION_ARGS)
{
	XLogRecPtr	recptr;

	recptr = GetWalRcvFlushRecPtr(NULL, NULL);

	if (recptr == 0)
		PG_RETURN_NULL();

	PG_RETURN_LSN(recptr);
}

/*
 * Returns the last+1 byte position that walreceiver has flushed.
 *
 * Optionally, returns the previous chunk start, that is the first byte
 * written in the most recent walreceiver flush cycle.  Callers not
 * interested in that value may pass NULL for latestChunkStart. Same for
 * receiveTLI.
 */
XLogRecPtr
GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
{
	WalRcvData *walrcv = WalRcv;
	XLogRecPtr	recptr;

	SpinLockAcquire(&walrcv->mutex);
	recptr = walrcv->flushedUpto;  <<<<<<<<<<<<<<<<<<<<<<<<<<
	if (latestChunkStart)
		*latestChunkStart = walrcv->latestChunkStart;
	if (receiveTLI)
		*receiveTLI = walrcv->receivedTLI;
	SpinLockRelease(&walrcv->mutex);

	return recptr;
}

/*
	 * If this is the first startup of walreceiver (on this timeline),
	 * initialize flushedUpto and latestChunkStart to the starting point.
	 */
	if (walrcv->receiveStart == 0 || walrcv->receivedTLI != tli)
	{
		walrcv->flushedUpto = recptr;   
		walrcv->receivedTLI = tli;
		walrcv->latestChunkStart = recptr;
	}
```
Question: In what scenarios is the pg_last_wal_receive_lsn() obtained incorrectly? In other words, the pg_last_wal_receive_lsn() is not what we need.

Answer: When Postgres restarts as a standby, the WAL receiver process is initiated, and the pg_last_wal_receive_lsn() is initialized to the starting point of the last segment in the local WAL directory.

For example, if we observe that the last segment in the following directory is "000000070000000000000058", then the pg_last_wal_receive_lsn() will be initialized as 0x58000000.

# Proposed solution
Use pg_last_wal_replay_lsn () instead of pg_last_wal_receive_lsn() to get the lsn_location

# My questions
I have a question, why did choose to use the pg_last_wal_receive_lsn()  instead of the pg_last_wal_replay_lsn () in the lsn_location design? Was there any particular reason for this choice?  Thanks in advance.

Best Regards
Gavin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stale DB instance being promoted leads to data loss #225

Test environment

Test step

Test logs

Root cause analysis

Proposed solution

My questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Stale DB instance being promoted leads to data loss #225

Description

Test environment

Test step

Test logs

Root cause analysis

Proposed solution

My questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions