erogov Apr 16 2020 at 15:05

WAL in PostgreSQL: 4. Setup and Tuning

17 min

8.9K

Postgres Professional corporate blogPostgreSQL*SQL*

Translation

So, we got acquainted with the structure of the buffer cache and in this context concluded that if all the RAM contents got lost due to failure, the write-ahead log (WAL) was required to recover. The size of the necessary WAL files and the recovery time are limited thanks to the checkpoint performed from time to time.

In the previous articles we already reviewed quite a few important settings that anyway relate to WAL. In this article (being the last in this series) we will discuss problems of WAL setup that are unaddressed yet: WAL levels and their purpose, as well as the reliability and performance of write-ahead logging.

WAL levels

The main WAL task is to ensure recovery after a failure. But once we have to maintain the log anyway, we can also adapt it to other tasks by adding some more information to it. There are several logging levels. The wal_level parameter specifies the level, and each next level includes everything that gets into WAL of the preceding level plus something new.

Minimal

The minimum possible level is set by the value of wal_level = minimal and ensures only recovery after a failure. To save space, the operations related to bulk data processing (such as CREATE TABLE AS SELECT or CREATE INDEX) are not WAL-logged. Instead, the data needed are immediately written to disk, and a new object is added to the system catalog and becomes visible at the transaction commit. If a failure occurs while the operation is performed, the data that are already written remain invisible and do not violate the consistency rules. And if a failure occurs after completion of the operation, everything needed is already on disk and does not need logging.

Let's take a look. First we'll set the necessary level (to this end, we will also need to change another parameter — max_wal_senders).

=> ALTER SYSTEM SET wal_level = minimal;
=> ALTER SYSTEM SET max_wal_senders = 0;

student$ sudo pg_ctlcluster 11 main restart

Note that the change of the level requires restarting the server.

Let's remember the current WAL location:

=> SELECT pg_current_wal_insert_lsn();

 pg_current_wal_insert_lsn
---------------------------
 0/353927BC
(1 row)

Now let's perform creation of a table (CREATE TABLE AS SELECT) and remember the WAL location again. The amount of data retrieved by the SELECT operator does not matter at all in this case, so one row is enough.

=> CREATE TABLE wallevel AS
  SELECT 1 AS n;
=> SELECT pg_current_wal_insert_lsn();

 pg_current_wal_insert_lsn
---------------------------
 0/353A7DFC
(1 row)

Let's look at WAL records using the familiar pg_waldump utility.

postgres$ /usr/lib/postgresql/11/bin/pg_waldump -p /var/lib/postgresql/11/main/pg_wal -s 0/353927BC -e 0/353A7DFC

Certainly, some details can differ from one launch to another, but in this case we get the following. The record of the Heap2 manager relates to vacuuming, here it is in-page vacuum of a table from the system catalog (system objects are easily distinguished with a naked eye by a small number in rel):

rmgr: Heap2       len (rec/tot):     59/  7587, tx:          0, lsn: 0/353927BC, prev 0/35392788, desc: CLEAN remxid 101126, blkref #0: rel 1663/16386/1247 blk 8 FPW

The record of getting the next OID for the table to be created follows:

rmgr: XLOG        len (rec/tot):     30/    30, tx:          0, lsn: 0/35394574, prev 0/353927BC, desc: NEXTOID 82295

And this is pure creation of the table:

rmgr: Storage     len (rec/tot):     42/    42, tx:          0, lsn: 0/35394594, prev 0/35394574, desc: CREATE base/16386/74103

But the insert of data into the table is not WAL-logged. Multiple records follow on row inserts into different tables and indexes — this way PostgreSQL writes the information on the created table to the system catalog (provided in a shorthand form):

rmgr: Heap        len (rec/tot):    203/   203, tx:     101127, lsn: 0/353945C0, prev 0/35394594, desc: INSERT off 71, blkref #0: rel 1663/16386/1247 blk 8
rmgr: Btree       len (rec/tot):     53/   685, tx:     101127, lsn: 0/3539468C, prev 0/353945C0, desc: INSERT_LEAF off 37, blkref #0: rel 1663/16386/2703 blk 2 FPW
...
rmgr: Btree       len (rec/tot):     53/  2393, tx:     101127, lsn: 0/353A747C, prev 0/353A6788, desc: INSERT_LEAF off 10, blkref #0: rel 1664/0/1233 blk 1 FPW

And finally the transaction commit:

rmgr: Transaction len (rec/tot):     34/    34, tx:     101127, lsn: 0/353A7DD8, prev 0/353A747C, desc: COMMIT 2019-07-23 18:59:34.923124 MSK

Replica

When we restore a PostgreSQL instance from backup, we start with some state of the file system and gradually bring the data to the target point of the recovery by playing back the archived WAL records. The number of such records can be pretty large (for example, records for several days), that is, the recovery period will span many checkpoints rather than one. So, it is clear that the minimum logging level is insufficient — if an operation is not logged, we will be unaware of whether we need to redo it. To support restoring from backup, all the operations must be WAL-logged.

The same is true for the replication: everything that is not logged will not be sent to the replica and will not be replayed. And a wish to run queries on a replica complicates the situation even more.

First, we need information on exclusive advisory locks that occur on the main server since they can conflict the queries on the replica. Such locks are WAL-logged and then the startup process applies them on the replica.

Second, we need to create data snapshots, and to do this, as we remember, information on the transactions being executed is needed. In the case of a replica, not only local transactions are meant, but also transactions on the main server. The only way to provide this information is to WAL-log it from time to time (this happens once every 15 seconds).

The level of WAL that ensures both restoring from backup and a possibility of physical replication is set by the value of wal_level = replica. (Before version 9.6, two separate levels were available — archive and hot_standby — but later they were combined.)

It's this level that is used by default starting with PostgreSQL 10 (while earlier it was minimal). So let's just restore the parameters to their default values:

=> ALTER SYSTEM RESET wal_level;
=> ALTER SYSTEM RESET max_wal_senders;

student$ sudo pg_ctlcluster 11 main restart

Deleting the table and redoing exactly the same sequence of steps as last time:

=> DROP TABLE wallevel;
=> SELECT pg_current_wal_insert_lsn();

 pg_current_wal_insert_lsn
---------------------------
 0/353AF21C
(1 row)

=> CREATE TABLE wallevel AS
  SELECT 1 AS n;
=> SELECT pg_current_wal_insert_lsn();

 pg_current_wal_insert_lsn
---------------------------
 0/353BE51C
(1 row)

Now let's check WAL records.

postgres$ /usr/lib/postgresql/11/bin/pg_waldump -p /var/lib/postgresql/11/main/pg_wal -s 0/353AF21C -e 0/353BE51C

Vacuuming, getting the OID, creation of the table and registration in the system catalog — same as before so far:

rmgr: Heap2       len (rec/tot):     58/    58, tx:          0, lsn: 0/353AF21C, prev 0/353AF044, desc: CLEAN remxid 101128, blkref #0: rel 1663/16386/1247 blk 8
rmgr: XLOG        len (rec/tot):     30/    30, tx:          0, lsn: 0/353AF258, prev 0/353AF21C, desc: NEXTOID 82298
rmgr: Storage     len (rec/tot):     42/    42, tx:          0, lsn: 0/353AF278, prev 0/353AF258, desc: CREATE base/16386/74106
rmgr: Heap        len (rec/tot):    203/   203, tx:     101129, lsn: 0/353AF2A4, prev 0/353AF278, desc: INSERT off 73, blkref #0: rel 1663/16386/1247 blk 8
rmgr: Btree       len (rec/tot):     53/   717, tx:     101129, lsn: 0/353AF370, prev 0/353AF2A4, …
rmgr: Btree       len (rec/tot):     53/  2413, tx:     101129, lsn: 0/353BD954, prev 0/353BCC44, desc: INSERT_LEAF off 10, blkref #0: rel 1664/0/1233 blk 1 FPW

And this is something new. The record of the exclusive lock, related to the Standby manager — here it is the lock on the transaction ID (we will discuss why it is needed in the next series of articles):

rmgr: Standby     len (rec/tot):     42/    42, tx:     101129, lsn: 0/353BE2D8, prev 0/353BD954, desc: LOCK xid 101129 db 16386 rel 74106

And this is the record of row inserts in our table (compare the file number rel with the one in the CREATE record):

rmgr: Heap        len (rec/tot):     59/    59, tx:     101129, lsn: 0/353BE304, prev 0/353BE2D8, desc: INSERT+INIT off 1, blkref #0: rel 1663/16386/74106 blk 0

This is the commit record:

rmgr: Transaction len (rec/tot):    421/   421, tx:     101129, lsn: 0/353BE340, prev 0/353BE304, desc: COMMIT 2019-07-23 18:59:37.870333 MSK; inval msgs: catcache 74 catcache 73 catcache 74 catcache 73 catcache 50 catcache 49 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 snapshot 2608 relcache 74106 snapshot 1214

And there is one more record, which occurs from time to time and is not tied to the completed transaction, relates to the Standby manager and informs of the transactions being executed at this point in time:

rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 0/353BE4E8, prev 0/353BE340, desc: RUNNING_XACTS nextXid 101130 latestCompletedXid 101129 oldestRunningXid 101130

Logical

And finally, the last level is specified by the value of wal_level = logical and provides for logical decoding and logical replication. It must be turned on for the publishing server.

From the perspective of WAL records, this level is virtually the same as replica: records are added that relate to replication origins, as well as arbitrary logical records that applications can add to WAL. But logical decoding mainly depends on the information on the transactions being executed since it is needed to create the data snapshot to track changes to the system catalog.

We will not go into details of backup and replication now since this is a topic for a separate series of articles.

Reliability of writing

It's clear that a logging technique must be reliable and ensure recovery whatever the circumstances (certainly, not related to corruption of the data storage media). Many things affect reliability, of which we will discuss caching, corruption of data and atomicity of writing.

Caching

Multiple caches stand in the way of data to a nonvolatile storage (such as a hard disk drive platter).

If a program (any, but PostgreSQL in this case) asks the operating system (OS) to write something on disk, the OS transfers the data to its RAM cache. Writing actually happens asynchronously, depending on the settings of I/O scheduler of the OS.

When the OS decides to write the data, they get into the cache of the storage (hard disk). Electronics of the storage can also postpone writing, for example, by grouping data that are more efficient to be written together. And if a RAID controller is used, one more caching level is added between the OS and disk.

So, without taking special measures, it is absolutely unclear when the data is actually saved in a reliable manner. And usually it makes no difference, but there are critical areas where PostgreSQL must be sure that the data are written with due reliability. This is, primarily, logging (if a WAL record did not reach disk, it will be lost along with the rest of the RAM contents) and a checkpoint (we must be sure that dirty pages are really written to disk). But there are other situations, such as performing unlogged operations at the level of minimal and so on.

The OS provides capabilities to ensure immediate writing of the data to nonvolatile memory. There are a few options, but they reduce to the two main: either after a write, the synchronization call is performed (fsync, fdatasync) or after opening a file (or writing to it) a special flag is set to indicate a need for synchronization or even for a direct write bypassing the OS cache.

As for WAL, the pg_test_fsync utility allows us to choose a method that best suits a particular OS and a particular file system, and this method is specified in the wal_sync_method parameter. Normal files are synchronized using fsync.

A subtle point is that to choose the method, we need to take into account hardware characteristics. For example: if a controller employing a battery-backup unit (BBU) is used, there is no reason why we should avoid using the cache of the controller since the BBU enables saving the data in case of power outage.

The documentation provides a lot of context on this.

In any case, synchronization is expensive and performed at most as often as absolutely necessary (we will get back to this a bit later, when we discuss the performance).

In general, you can turn off the synchronization (the fsync parameter is responsible for this), but in this case, you have to forget about the reliability of storage. By turning fsync off you agree that the data can be irreversibly lost at any time. Probably, the only reasonable use case for this parameter is a temporary increase of the performance when the data can be easily restored from a different source (like at the initial migration).

Data corruption

Hardware is imperfect and the data can be corrupted in the storage when transmitted over interface cables, and so on. Some of such errors are handled at the hardware level, but the others are not.

For quick detection of an issue, checksums are provided in WAL records.

Data pages can also be protected by checksums. Earlier this could be done only at the cluster initialization, but in PostgreSQL 12 it is possible to turn the checksums on and off by means of the pg_checksums utility (but only when the server is shut down rather than «on the fly» so far).

In a production environment, checksums must be obligatory turned on regardless of the overhead costs of computing and verifying them. This reduces the probability of not detecting the corruption duly.

It reduces, but not eliminates the probability.
First, checksums are verified only when the page is accessed; therefore, the corruption may escape detection until the moment when it gets into all backups. It's for this reason that pg_probackup verifies the checksums of all the cluster pages during the data backup.
Second, a page filled with zeros is regarded as correct, so if the file system mistakenly «nullifies» a file, this can escape detection.
Third, checksums protect only the main fork of the data. The other forks and the rest of files (for instance, transaction statuses XACT) are not protected at all.
Alas.

Let's see how it works. First we make sure that checksums are turned on (note that in a package installed on Debian-like systems this is not the case by default):

=> SHOW data_checksums;

 data_checksums
----------------
 on
(1 row)

The data_checksums parameter is read-only.

This is the file where our table is located:

=> SELECT pg_relation_filepath('wallevel');

 pg_relation_filepath
----------------------
 base/16386/24890
(1 row)

Let's shut down the server and change a few bytes on the zero page, for example: erase LSN of the last WAL record from the header.

student$ sudo pg_ctlcluster 11 main stop

postgres$ dd if=/dev/zero of=/var/lib/postgresql/11/main/base/16386/24890 oflag=dsync conv=notrunc bs=1 count=8

8+0 records in
8+0 records out
8 bytes copied, 0,0083022 s, 1,0 kB/s

In general, it was not needed to shut down the server. It is sufficient that the page is flushed to disk and evicted from the cache (otherwise, the server would continue working with the page in the cache). But this scenario is more complicated to reproduce.

Now we start the server and try to read the table.

student$ sudo pg_ctlcluster 11 main start

=> SELECT * FROM wallevel;

WARNING:  page verification failed, calculated checksum 23222 but expected 50884
ERROR:  invalid page in block 0 of relation base/16386/24890

But what shall we do if it is impossible to restore the data from backup? The ignore_checksum_failure parameter enables trying to read the table, of course, with a risk of getting corrupted data.

=> SET ignore_checksum_failure = on;
=> SELECT * FROM wallevel;

WARNING:  page verification failed, calculated checksum 23222 but expected 50884
 n
---
 1
(1 row)

Of course, everything is fine in this case since we hurt only the header rather than pure data.

And there is one more point to note. When checksums are turned on, hint bits are WAL-logged (we discussed them earlier) since a change to any, even inessential, bit results in a change to the checksum. When checksums are turned off, the wal_log_hints parameter is responsible for WAL-logging hint bits.

Changes to hint bits are always logged as FPI (full page image), which pretty much increases the WAL size. In this case, it makes sense to use the wal_compression parameter to turn on compression of FPIs (this parameter was added in version 9.5). We will look at specific figures a bit later.

Atomicity of writing

And finally, there is an issue with the atomicity of writing. A database page occupies not less than 8 KB (it may be 16 or 32 KB), and at a low level, a write is done in blocks, which usually have smaller sizes (usually 512 bytes or 4 KB). Therefore, in case of a power outage, a data page can be written partially. It's clear that during a recovery, it makes no sense to apply usual WAL records to such a page.

To prevent this, PostgreSQL enables WAL-logging a full page image at the first change of the page since the beginning of a checkpoint cycle (the same image is also logged when hint bits change). The full_page_writes parameter controls this, and it is turned on by default.

If a recovery process comes across an FPI in WAL, it writes the image to disk unconditionally (without LSN checking): the FPI is more trustworthy since it is protected by a checksum, like each WAL record. And it is this reliably correct image, which normal WAL records are applied to.

Although in PostgreSQL, an FPI does not include free space (we discussed the block structure earlier), FPIs considerably increase the amount of WAL records generated. As already mentioned, the situation can be improved by compression of FPIs (using the wal_compression parameter).

To get an insight into changing the WAL size, let's conduct a simple experiment using the pgbench utility. Performing the initialization:

student$ pgbench -i test

dropping old tables...
creating tables...
generating data...
100000 of 100000 tuples (100%) done (elapsed 0.15 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done.

The full_page_writes parameter is turned on:

=> SHOW full_page_writes;

 full_page_writes
------------------
 on
(1 row)

Let's perform a checkpoint and immediately run a test for 30 seconds.

=> CHECKPOINT;
=> SELECT pg_current_wal_insert_lsn();

 pg_current_wal_insert_lsn
---------------------------
 0/38E04A08
(1 row)

student$ pgbench -T 30 test

starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 30 s
number of transactions actually processed: 26851
latency average = 1.117 ms
tps = 895.006720 (including connections establishing)
tps = 895.095229 (excluding connections establishing)

=> SELECT pg_current_wal_insert_lsn();

 pg_current_wal_insert_lsn
---------------------------
 0/3A69C478
(1 row)

Getting the size of WAL records:

=> SELECT pg_size_pretty('0/3A69C478'::pg_lsn - '0/38E04A08'::pg_lsn);

 pg_size_pretty
----------------
 25 MB
(1 row)

Now let's turn off the full_page_writes parameter:

=> ALTER SYSTEM SET full_page_writes = off;
=> SELECT pg_reload_conf();

And we repeat the experiment.

=> CHECKPOINT;
=> SELECT pg_current_wal_insert_lsn();

 pg_current_wal_insert_lsn
---------------------------
 0/3A69C530
(1 row)

student$ pgbench -T 30 test

starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 30 s
number of transactions actually processed: 27234
latency average = 1.102 ms
tps = 907.783080 (including connections establishing)
tps = 907.895326 (excluding connections establishing)

=> SELECT pg_current_wal_insert_lsn();

 pg_current_wal_insert_lsn
---------------------------
 0/3BE87658
(1 row)

Getting the size of WAL records:

=> SELECT pg_size_pretty('0/3BE87658'::pg_lsn - '0/3A69C530'::pg_lsn);

 pg_size_pretty
----------------
 24 MB
(1 row)

Yes, the size decreased, but not so much as we could expect.

The thing is that the cluster was initialized with the checksums on data pages and therefore, FPIs have to be WAL-logged anyway when hint bits change. These data (in the situation above) make up about half of the whole amount, which you can make sure of by looking at the statistics:

postgres$ /usr/lib/postgresql/11/bin/pg_waldump --stats -p /var/lib/postgresql/11/main/pg_wal -s 0/3A69C530 -e 0/3BE87658

Type              N      (%)   Record size      (%)   FPI size      (%)   
----              -      ---   -----------      ---   --------      ---   
XLOG           1721 (  1,03)         84329 (  0,77)   13916104 (100,00)        
Transaction   27235 ( 16,32)        926070 (  8,46)          0 (  0,00)          
Storage           1 (  0,00)            42 (  0,00)          0 (  0,00)              
CLOG              1 (  0,00)            30 (  0,00)          0 (  0,00)              
Standby           4 (  0,00)           240 (  0,00)          0 (  0,00)             
Heap2         27522 ( 16,49)       1726352 ( 15,76)          0 (  0,00)         
Heap         109691 ( 65,71)       8169121 ( 74,59)          0 (  0,00)         
Btree           756 (  0,45)         45380 (  0,41)          0 (  0,00)           
           --------               --------            --------                 
Total        166931               10951564 [44,04%]   13916104 [55,96%]

Zero rows are removed to make the table more compact. Pay attention to the summary row (Total) and compare the size of full images (FPI size) with he size of normal records (Record size).

The full_page_writes parameter can be turned off only if the file system and hardware themselves ensure the automicity of writes. But, as we can see, there isn't much sense in it (provided checksums are turned on).

Now let's see how compression can help.

=> ALTER SYSTEM SET full_page_writes = on;
=> ALTER SYSTEM SET wal_compression = on;
=> SELECT pg_reload_conf();

Repeating the same experiment.

=> CHECKPOINT;
=> SELECT pg_current_wal_insert_lsn();

 pg_current_wal_insert_lsn
---------------------------
 0/3BE87710
(1 row)

student$ pgbench -T 30 test

starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 30 s
number of transactions actually processed: 26833
latency average = 1.118 ms
tps = 894.405027 (including connections establishing)
tps = 894.516845 (excluding connections establishing)

=> SELECT pg_current_wal_insert_lsn();

 pg_current_wal_insert_lsn
---------------------------
 0/3CBD3EA8
(1 row)

Getting the size of WAL records:

=> SELECT pg_size_pretty('0/3CBD3EA8'::pg_lsn - '0/3BE87710'::pg_lsn);

 pg_size_pretty
----------------
 13 MB
(1 row)

Conclusion: if there are many FPIs (due to checksums or full_page_writes, that is, almost always), most likely it makes sense to use compression although it loads the processor.

Performance

During regular work of a server, WAL files are continuously written one by one. Because of no random access, even HDD disks do the job fine. But this kind of load is pretty much different from the one when data files are accessed.

So, it is usually beneficial to store WAL on a separate physical disk (or disk array) mounted to the file system of the server. Instead of the $PGDATA/pg_wal directory, a symbolic link to the appropriate directory must be created.

There are a couple of situations where WAL files need to be not only written but read. The first one is a clear case of recovery after a failure. The second one is less trivial. It occurs if stream replication is used and a replica is late to receive WAL records while they are still in the OS buffers of the main server. In this case, the walsender process has to read the necessary data from disk. We will discuss this in more detail when we reach replication.

WAL is written in either of the two modes:

Synchronous — at a transaction commit, the work cannot be continued until all WAL records of this transaction get on disk.
Asynchronous — a transaction completes immediately, and WAL is written in the background.

The synchronous_commit parameter, turned on by default, sets the synchronous mode.

Because synchronization is connected with the actual (that is, slow) input/output, it is beneficial to do it as infrequently as possible. To this end, a backend process that completes a transaction and writes WAL makes a short pause, defined by the commit_delay parameter. But this happens only if the system has not less than commit_siblings active transactions. This behavior relies on the expectation that during the waiting time some transactions will be completed and it will be possible to synchronize them in one go. This is similar to how you hold the doors of an elevator so that someone has time to jump into the car.

By default, commit_siblings = 5 and commit_delay = 0, so actually there is no wait. It makes sense to change the value of commit_delay only for systems that execute a great number of OLTP transactions.

Then the process flushes part of WAL up to the LSN needed (or a little more if during the waiting time new records were added). After that the transaction is considered completed.

Synchronous writing ensures durability (the letter D in the ACID acronym): if a transaction is committed, all its WAL records are already on disk and won't be lost. But a drawback is that synchronous writing increases the response time (the COMMIT command does not return control until the end of the synchronization) and reduces the system performance.

You can make writing asynchronous by setting synchronous_commit = off (or local).

When writing is asynchronous, WAL records are flushed by the wal writer process, which alternates work and waits (the waiting time is specified by the wal_writer_delay parameter with the default value of 200 ms).

When the process wakes up after a wait, it checks whether completely filled WAL pages appeared since last time. If they did appear, the process ignores the current page, not filled to the end, and writes only completely filled pages. (However, not all at once: writing stops when it reaches the end of the cache and proceeds from the beginning of the cache next time.)

But if none of the pages are filled, the process writes the current WAL page (not filled to the end) — otherwise, what did it wake up for?

This algorithm aims to avoid synchronization of the same page several times wherever possible, which is critical for a large stream of updates.

Asynchronous writing is more efficient than the synchronous since commits of the changes do not wait for writes of WAL pages. But the reliability decreases: committed data can be lost in case of failure if less than 3 × wal_writer_delay units of time elapsed between the commit and failure (with the default settings, this is a little longer than half a second).

Not an easy choice between efficiency and reliability is up to the system administrator.

Note that: unlike turning off synchronization (fsync = off), asynchronous mode does not make recovery impossible. In case of failure, the system will restore the consistent state, but maybe, some of the last transactions will not be present there.

You can set the synchronous_commit parameter for separate transactions. This enables increasing the performance by sacrificing the reliability only of some transactions. Say, financial transactions must be committed synchronously, while chat messages can be compromised.

Actually both modes work together. Even with a synchronous commit, WAL records of a long transaction will be written asynchronously in order to free WAL buffers. And if during a flush of a page from the buffer cache it appears that the corresponding WAL record is not on disk yet, it will be immediately flushed in the synchronous mode.

To get an insight into the gains of an asynchronous commit, let's try to repeat the pgbench test in this mode.

=> ALTER SYSTEM SET synchronous_commit = off;
=> SELECT pg_reload_conf();

student$ pgbench -T 30 test

starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 30 s
number of transactions actually processed: 45439
latency average = 0.660 ms
tps = 1514.561710 (including connections establishing)
tps = 1514.710558 (excluding connections establishing)

With synchronous commits, we got approximately 900 transactions per second (tps) and 1500 tps with asynchronous commits. It goes without saying that in a real-life system under the actual load, the proportion will be different, but it is clear that for short transactions the effect can be pretty considerable.

Here the series of articles on WAL logging comes to an end. If anything critical is unaddressed, I would appreciate if you provide comments. Thank you all!

And next, amazing adventures are awaiting us in the world of locks, but that's another story.

Tags:

Hubs:

WAL in PostgreSQL: 4. Setup and Tuning

WAL levels

Minimal

Replica

Logical

Reliability of writing

Caching

Data corruption

Atomicity of writing

Performance

Articles

Information