Database scaling is a continually coming future. DBMS get improved and better scaled on hardware platforms, while the hardware platforms themselves increase the performance, number of cores, and memory — Achilles is trying to catch up with the turtle, but has not caught up yet. The database scaling challenge manifests itself in all its magnitude.
Postgres Professional had to face the scaling problem not only theoretically, but also in practice: through their customers. Even more than once. It's one of these real-life cases that this article
Many thanks to Elena Indrupskaya for the translation. Russian version is here.
PostgreSQL scales well on NUMA systems in the case of a single motherboard with multiple processors and multiple data buses. You can read about some optimizations here and here.
However, there is another group of systems: they have several motherboards that exchange data using the interconnect and run one instance of the OS; for the user this design looks like a single machine. And although formally such systems can also be referred to NUMA, in essence, they are closer to supercomputers because access to the local memory of a node drastically differs from access to the memory of a neighboring node.
The PostgreSQL community believes that the cause of the issues is the only instance of Postgres running on such architectures, and there is no systematic approach to resolving them yet. Designers of shared-memory oriented software initially presumed that access times to local and remote memory would be more or less comparable. When we work with many nodes, it doesn’t make sense to rely on shared memory as on a fast communication channel: because of latency, it is much «cheaper» to send a request to perform a certain action to the node where the data of interest is located than to send the data through the bus. So, cluster solutions are relevant for supercomputers and for multi-node systems in general.
This does not mean that the combination of multi-node systems and the shared-memory architecture, which is typical for Postgres, should be trashed. After all, if Postgres processes spend most of their time performing complex computations locally, this architecture will be even more efficient. In our situation, the customer had already purchased a powerful multi-node server, and we had to resolve PostgreSQL issues on it.
And the issues were serious: execution of simplest write queries (to change several field values in one row) took from a few minutes to an hour. These issues came out full force exactly due to a large number of cores and accordingly, to massive parallelism in query execution with a relatively slow exchange between nodes.
Therefore, the article will be somewhat dual-purpose:
- To share experiences on: what if on a multi-node system, the database seriously slows down. What to begin with, how to diagnose, where to go.
- Tell how the problems of the PostgreSQL database itself can be solved with a high level of parallelism. Among the rest, how a change to the algorithm of acquiring locks affects the efficiency of PostgreSQL.
The server and database
The system consisted of 8 blades with 2 sockets each. More than 300 cores in total (hyper-threading not taken into account). The fast bus (the manufacturer's proprietary technology) connected the blades. Not that a supercomputer, but the configuration was impressive for a single DBMS instance. The load was also rather big. More than 1 terabyte of data. About 3000 transactions per second. More than 1000 connections to Postgres.
Having started analyzing hour-long write waits, first we made sure that writing to disk could not cause the delay. As soon as unclear delays began to occur, the tests were done exclusively on
tmpfs. The picture did not change. So the disk was not to blame.
Starting to collect diagnoses: views
Since the issues were most likely due to the high concurrency of processes that try to access the same objects, the first thing to check was the locks. For this check, PostgreSQL has
pg_stat_activityviews. Starting as early as with version 9.6, the second one includes the information on what the process is waiting for (Amit Kapila, Ildus Kurbangaliev) —
wait_event_type. Possible values for this field are described here.
But first, let's just count:
postgres=# SELECT COUNT(*) FROM pg_locks;
count —---— 88453 (1 row)
postgres=# SELECT COUNT(*) FROM pg_stat_activity;
count —---— 1826 (1 row)
postgres=# SELECT COUNT(*) FROM pg_stat_activity WHERE state='active';
count —---— 1005 (1 row)
And these are actual figures. It reached 200 000 locks. At the same time, the following locks were acquired for the ill-fated query:
SELECT COUNT(mode), mode FROM pg_locks WHERE pid=580707 GROUP BY mode;
count | mode —-----+---------------— 93 | AccessShareLock 1 | ExclusiveLock
When reading the buffer, the DBMS uses the
sharelock, and when writing — the
exclusivelock. That is, write locks made less than 1% of all queries. In the
pg_locksview, the kinds of locks do not always appear as described in the user documentation.
Here is a table of matches:
AccessShareLock = LockTupleKeyShare RowShareLock =LockTupleShare ExclusiveLock = LockTupleNoKeyExclusive AccessExclusiveLock = LockTupleExclusive
The query «SELECT mode FROM pg_locks» showed that 234 INSERTs were waiting for execution of the CREATE INDEX command (without the CONCURRENTLY keyword) and 390 INSERTs were awaiting the
buffer content lock. A possible solution is to «teach» INSERTs from different sessions to reduce buffer overlapping.
It's time to involve perf
perfutility collects pretty much diagnostic information. In the
record… mode, it writes system event statistics to files (by default they are in
./perf_data), and in the
reportmode, it analyzes the collected data; for example, you can filter events by being related to
postgresor to a given
$ perf record-u postgres or $ perf record-p 76876 and then, say $ perf report > ./my_results
As a result we will see something like
perffor PostgreSQL diagnostics is described, for example, here, as well as in pg wiki.
In our situation, important information was obtained even in the simplest mode —
perf top, which certainly worked a la
topcommand of the operating system. With
perf top, we've seen that the processor spends most of the time in kernel locks, as well as in the
PinBuffer()is a function that increments the count of references to the buffer (mapping of a data page to RAM), through which Postgres processes know what buffers can be evicted and which cannot.
LWLockAttemptLock()is a function to acquire
LWLockis a kind of lock with two levels –
exclusive– and without
deadlockdetection; locks are preallocated in
shared memory, and waiting processes are queued.
These functions were already optimized considerably in PostgreSQL 9.5 and 9.6. Spinlocks inside them were replaced with a direct use of atomic operations.
No way without them: even if they were useless, they would have still been worth telling about since they are extremely elegant. Moreover, they are useful! Here goes an illustration from
github– not from the case discussed (neither we nor the customer are ready to disclose details yet).
These impressive pictures very clearly show what the processor cycles are spent on. The same
perfcan collect data, but
flame graphclearly visualizes the data and builds trees based on the collected call stacks. You can read about profiling with flame graphs, for example, here and download everything you need here.
In our situation, we could see a huge number of
nestloops on the flame graphs. It seems like JOINs of a large number of tables in numerous parallel read queries caused a large number of
Statistics collected by
perfshow where the CPU cycles are spent. And although we've seen that most of the CPU time passes in the locks, we did not see what exactly leads to such long waits of the locks because we do not understand where exactly the waits of the locks occur — CPU time is not spent in waits.
In order to see the waits themselves, we can query the
SELECT wait_event_type, wait_event, COUNT(*) FROM pg_stat_activity GROUP BY wait_event_type, wait_event;
LWLockTranche | buffer_content | UPDATE ************* LWLockTranche | buffer_content | INSERT INTO ******** LWLockTranche | buffer_content | \r | | insert into B4_MUTEX | | values (nextval('hib | | returning ID Lock | relation | INSERT INTO B4_***** LWLockTranche | buffer_content | UPDATE ************* Lock | relation | INSERT INTO ******** LWLockTranche | buffer_mapping | INSERT INTO ******** LWLockTranche | buffer_content | \r
(here asterisks just replace the details of the query that we do not share).
The values are visible for
buffer_content(lock on the contents of buffers) and for
buffer_mapping(locks on parts of the
To GDB for help
But why are there so many waits for these locks? For more detailed information about locks, we had to use the
GDB, we can get a stack of calls to specific processes. Using sampling, i.e., having collected a certain number of random call stacks, you can get an idea of the stacks where the longest waits happen.
Let's consider the process of collecting statistics. We will consider the «manual» collection of statistics, although in real life special scripts are used to do it automatically.
First, it is necessary to attach
gdbto the PostgreSQL process. To do this, you need to find
pidof the server process, say, from
Assume we found:
$ ps aux | grep postgres
and now we will feed the debugger with
postgres 2025 0.0 0.1 172428 1240 pts/17 S jul23 0:00 /usr/local/pgsql/bin/postgres -D /usr/local/pgsql/data
igor_le:~$gdb -p 2025
Once inside the debugger, we type
where. And we get a lot of information of this kind:
(gdb) bt #0 0x00007fbb65d01cd0 in __write_nocancel () from /lib64/libc.so.6 #1 0x00000000007c92f4 in write_pipe_chunks ( data=0x110e6e8 "2018-06-01 15:35:38 MSK : [392-1] db=bp,user=bp,app=[unknown],client=192.168.70.163 (http://192.168.70.163) LOG: relation 23554 new block 493: 248.389503\n2018‐06‐01 15:35:38 MSK : [393-1] db=bp,user=bp,app=["..., len=409, dest=dest@entry=1) at elog.c:3123 #2 0x00000000007cc07b in send_message_to_server_log (edata=0xc6ee60 <errordata>) at elog.c:3024 #3 EmitErrorReport () at elog.c:1479</errordata>
Having gathered statistics, which included call stacks from all Postgres processes that were collected repeatedly at different points in time, we saw that 3706 seconds (about an hour) were spent awaiting the
buffer partition lockinside the
relation extension lock, that is, a lock on a piece of the buffer manager's hash table that was needed to evict the former buffer in order to later replace it with a new one, corresponding to the extended part of the table. A certain number was also noticeable for
buffer content lock, which corresponded to the wait of the lock of
B-treeindex pages for insertion.
At first, there were two explanations for such a monstrous waiting time:
- Someone else acquired this
LWLockand got stuck. But this is unlikely. Because nothing complicated happens inside the buffer partition lock.
- We came across some sort of pathologic behavior of
Lwlock. That is, although none acquired the lock for too long, the wait for it lasted unnecessarily long.
Diagnostic patches and treatment of trees
By reducing the number of simultaneous joins, we would certainly reduce the flow of requests for locks. But it would be like a surrender. Instead, Alexander Korotkov, chief architect of Postgres Professional (he certainly helped to prepare this article), proposed a series of patches.
First of all, it was necessary to get a more detailed picture of the disaster. Although off-the-shelf tools are pretty universal, company's own diagnostic patches still prove useful.
A patch was written to add detailed logging of the time spent in
relation extensionand of what happens inside the
RelationAddExtraBlocks()function. This way we get to know what the time inside
RelationAddExtraBlocks() is spent on.
And another patch was written in support of the previous, which reports in
pg_stat_activityon what we are doing now in
relation extension. It was done like this: when
RelationAddExtraBlocks. This process is now convenient to analyze in maximum details using
As for treatment patches (rather than diagnostic), two of them were written. The first patch changed the behavior of
B‐treeleaf locks: previously, in an insert query, the leaf was locked as
share, and only then acquired
exclusive. Now it acquires
exclusivefrom the beginning. Now this patch is already committed for PostgreSQL 12. Thanks to the committer status, which Alexander Korotkov received last year — the second PostgreSQL committer in Russia and the second in the company.
The value of
NUM_BUFFER_PARTITIONSwas also increased from 128 to 512 to reduce the load on the mapping locks: the hash table of the buffer manager was divided into smaller pieces with a view of decreasing the load on each particular piece.
After applying this patch, locks on the contents of the buffers were gone, but despite the increase of
NUM_BUFFER_PARTITIONS, there remained
buffer_mappinglocks, that is, to recall, locks on pieces of the buffer manager's hash table:
locks_count | active_session | buffer_content | buffer_mapping ----‐‐‐--‐‐‐+‐------‐‐‐‐‐‐‐‐‐+‐‐‐------‐‐‐‐‐‐‐+‐‐------‐‐‐ 12549 | 1218 | 0 | 15
And this is not much. B‐tree was no longer a bottleneck. The extension of
heapcame to the forefront.
Treatment of conscience
Then Alexander suggested the following hypothesis and solution:
We are waiting too long on the
buffer partition lockduring the buffer eviction. It is possible that on the same
buffer partition lockthere lies some highly demanded page, for example, the root of some
B‐tree. At this location, there is a persistent flow of requests for the
shared-lockfrom read queries.
The wait queue in
LWLockis «not fair». Since as many
shared locks as you want can be acquired at the same time, if
shared lockis already acquired then next
shared locks do not wait in line. Therefore, if the flow of shared locks is intensive enough to not leave gaps between them, the wait of the
exclusive lockgoes virtually to infinity.
To fix this, we can try to offer the patch for «gentleman» behavior of locks. It stirs the conscience of
shared locker-s and they honestly stand in line when
exclusive lockis already standing there (interestingly, heavyweight locks —
hwlock— have no trouble with the conscience: they always honestly stand in line)
locks_count | active_session | buffer_content | buffer_mapping | reladdextra | inserts>30sec ‐‐‐‐‐‐-‐‐‐‐‐+‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐+‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐+‐‐‐‐‐‐‐‐‐‐‐--‐-‐+‐‐‐‐‐‐-‐‐‐‐‐‐+‐‐‐‐------ 173985 | 1802 | 0 | 569 | 0 | 0
Everything is OK! There are no long
INSERTs. However, locks on pieces of hash tables remain. Well, it can’t be helped — these are the properties of the bus of our little supercomputer.
This patch was also offered to the community. But regardless of the future of these patches in the community, nothing prevents them from getting into next versions of Postgres Pro Enterprise, which are designed right for customers with heavily loaded systems.
sharelocks, which let
exclusivelocks into the queue, resolved the issue of hour-long delays on a multi-node system. The hash table of the
buffer managerfailed to work because of a too heavy flow of
share lock-s, which left no chance for the locks needed to evict the old buffers and load the new ones. Issues with the extension of the buffer for database tables only was the result of this. Previously, it was possible to fix a weakness with access to the root of
PostgreSQL was not created with an eye to NUMA-architectures and supercomputers. Accommodating Postgres to these architectures is a huge job, which would (and will possibly) require coordinated efforts of many people and even companies. But troublesome consequences of these architectural problems can be mitigated. And we have to: the load types that resulted in delays such as those described are quite typical, and we continue to receive similar distress alerts from other places. Similar troubles manifested themselves before — on systems with fewer cores, but the consequences were just not so terrible, and the symptoms were treated differently and with other patches. Now another treatment is available: although not universal, but definitely useful.
So, when PostgreSQL works with the memory of the whole system as local, no high-speed bus between nodes can be compared with the time of access to local memory. Uneasy problems arise because of this, often urgent, but interesting. And the experience of solving them is useful not only to the solvers, but also for the whole community.