Postgres Backend Buffer Pinning Algorithm
This article started from an apparently simple comment form Andres Freund on postgresql-hackers mailing list, saying that pinning buffers had a cost, when I asked him to elaborate, he gave some hints and some examples [1].
The buffer pinning logic in postgres 18 is implemented in src/backend/storage/buffer/bufmgr.c.
Definitions
Starting from Andres hint about REFCOUNT_ARRAY_ENTRIES we find this
static struct PrivateRefCountEntry PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES];
static HTAB *PrivateRefCountHash = NULL;
static int32 PrivateRefCountOverflowed = 0;
static uint32 PrivateRefCountClock = 0;
static PrivateRefCountEntry *ReservedRefCountEntry = NULL;
The next question is, what is a PrivateRefCountEntry ? It is a simple 8 byte structure (where Buffer is an a 32-bit integer).
typedef struct PrivateRefCountEntry
{
Buffer buffer;
int32 refcount;
} PrivateRefCountEntry;
So this is representing a hybrid data structure that combines an array that is fast when the number of pinned buffers is small, and is fast when the number of pinned buffers is high.
Let's check in details how this is coded.
ReservePrivateRefCountEntry
The ReservePrivateRefCountEntry is called when pinning a buffer, it implements a find or create logic, and uses a linear search code.
for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
{
PrivateRefCountEntry *res;
res = &PrivateRefCountArray[i];
if (res->buffer == InvalidBuffer)
{
ReservedRefCountEntry = res;
return;
}
}
// Proceed with the creation of an entry in
// PrivateRefCountHash
A recent change removed the mid-loop return ReservePrivateRefCountEntry , in favour of compiler auto-vectorization [3], so now the buffers are assigned to the last available position instead of the first.
GetPrivateRefCountEntry
Then GetPrivateRefCountEntry function is called when pinning or unpinning, and it has the following linear search code.
for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
{
res = &PrivateRefCountArray[i];
if (res->buffer == buffer)
return res;
}
if (PrivateRefCountOverflowed == 0)
return NULL;
res = hash_search(PrivateRefCountHash, &buffer, HASH_FIND, NULL);
In the same spirit, this was refactored into GetPrivateRefCountEntrySlow where again the mid loop return was removed from linear search code.
Every pin/unpin will do a full scan on the array because the desired buffer can be anywhere. And if there are more than 8 pins it will also do a hash insertion/lookup. It has two small optimisations, a clock to evict buffers in a round robin way, to avoid placing the last buffer on the hash table, and the development version includes a cache for last entry.
The simplehash
Postgres includes a simplehash implementation on src/include/lib/simplehash.h , it is a templated implementation, you define some macros, and include the file, different from most headers, it doesn't have an #ifdef against multiple includes. The default implementation used to require a field to save whether an entry is valid or not, I changed it to support a more flexible way using the underlying entry key own invalid values to detect that, e.g. buffer == InvalidBuffer.
It is an open addressing variant and basically stores all the entires in one array, minimising both memory allocation and pointer dereferencing. So, quite similar to the array search, with the difference that the key gives a hint of where the values are stored, allowing to reduce significantly the number array operations for insertion and lookup.
To exemplify this here is how the first 25 prime numbers would be inserted on a 32-entry simplehash. Only 8 elements are displaced, and none is displaced more than one position, meaning that it will check at most two positions in the array.
The Resource Owner
In addition to the private reference counter pinning a buffer also calls ResourceOwnerRemember that in itself is O(1) but it requires ResourceOwnerEnlarge that does some memory allocation, and possibly copy of all resource entries, but this will get amortised if the number of pinned buffers is limited, but eventually unpin will have to call ResourceOwnerForget, that is optimized for LIFO pattern, if we use a FIFO pattern it will be worse than random pattern. But that is a topic for another chapter.
Implementation
This was implemented in a series of 5 patches.
Benchmark buffer pinning: You know benchmark code, implemented a few functions that can be use in postgres queries, and a python script that runs them and produces CSV files and SVG plots for the current build.
Refactoring reference counting: Before starting to change code and potentially breaking things I considered prudent to isolate it to limit the damage. This code was part of a 8k+ LOC file.
Using simplehash: Simply replacing the HTAB for a simplehash, and adding a new set of macros SH_ENTRY_EMPTY, SH_MAKE_EMPTY, SH_MAKE_IN_USE. To allow using the InvalidBuffer special value instead of allocating extra space for a validity flag. Here I assume that the buffer buffer sequence is independent enough from the array size, so I use the buffer as the hash key directly, omitting a hash function call.
Compact PrivateRefCountEntry: The original implementation used a 4-byte key and 8-byte value. Reference count uses 32 bits, and it is unreasonable to expect one backend to pin the same buffer 1 billion times. The lock mode uses 32 bits but can only assume 4 values. So I packed them in one single uint32, giving 30 bits for count and 2 bits for lock mode. This makes the entries 8-byte long, on 64-bit CPUs this represents more than a 1/3 reduction in memory. This makes the array aligned with the 64-bit words, copying one entry can be completed in one instruction, and every entry will be aligned.
REFCOUNT_ARRAY_ENTRIES=0: since the simplehash is basically some array lookup, it is worth trying to remove it completely and keep only the hash. For prefetch holding and always more than 8 entries and on a FIFO, the original implementation would always hit the hash, this change would save as a 8-entry array linear serarch, and some extra data moves.
Results
The first patch makes the performance slightly worse by the added function calls, but on the next commit already makes it better for all the cases, except the case with one single buffer pinned. The compact entry shows amodest improvement, being more significant when more then 1000 buffers are pinned. Removing the array brings the performance for a single pin back to the the baseline, and makes holding 100 pins about 25% slower than holding one, as opposed to 100%
Now looking at the Resowner time only we can see a striking similarity in the shapes.
If we subtract the averages point by point, we get a plot that should give us an idea of how of the performance impact of reference counting alone.
This case we see an overhead of 50ns in the patch 5 and ~200ns on the baseline.
References
[1] Andres Freund, on "Index Prefetch", 2023-03-02 https://www.postgresql.org/message-id/s5p7iou7pdhxhvmv4rohmskwqmr36dc4rybvwlep5yvwrjs4pa%406oxsemms5mw4
[2] bufmgr: Add one-entry cache for private refcount, commit 30df619
[3] ReservePrivateRefContEntry linear search without early return src/backend/storage/buffer/bufmgr.c#L307-L318