Rating: 7.9/10.
Mastering PostgreSQL 15: Advanced techniques to build and manage scalable, reliable, and fault-tolerant database applications by Hans-Jürgen Schönig
A fairly comprehensive overview of PostgreSQL database for developers, including many advanced features and also security and performance, backups and replication, with good descriptions of when to use each feature and examples.
Chapter 1. New features in PostgreSQL 15 include improved security defaults, options for handling null values better, conditionally upserting using a merge command, and compressed backups.
Chapter 2. Transactions work by a begin command, some statements, and then either commit all of them or rollback all of them. Commit and chain allows you to commit and immediately start the next transaction. Savepoints allow you to return to one of multiple points within a transaction, but after a transaction is done cannot be rolled back. Commands that modify the table structure can also be part of a transaction.
There are many types of locks, eg: an update will lock only the writes for the row being affected, so that reads can still happen, but other writes that affect the same rows will be blocked until the lock is released. Some operations (like taking the max value across the whole table and inserting the next highest value) require locking the entire table, which is very inefficient. Select for update locks the selected rows until an update or timeout is reached, and this can be expensive, especially when foreign key tables are involved. PostgreSQL can automatically detect deadlocks; when two transactions cause a deadlock, one will be forced to error out and rollback. The vacuum command frees up space in the table (but this does not always free up disk space) and can be configured to run automatically.
Chapter 3. Use the EXPLAIN command to analyze queries – when comparing different queries, it produces a cost value not in ms but rather a more abstract cost, eg: a block read is 1 point, random read is 4 points, setting up a parallel worker is 1000 points, etc. The set of indexes used depends on the data and not just the structure of the table, it is smart enough to avoid using an index in cases where it won’t help. A bitmap scan is used when the query requires multiple indexes on the same table, and it combines indexes to avoid reading the same block twice. It collects table statistics, like autocorrelation (measures how sorted vs random the data is), which is useful for the query planner.
Indexes can take up a lot of space, sometimes more than the data itself. One way to optimize space usage is by using a partial index that excludes the most common values, which are not useful to index. If your data has an unusual sorting order, it is useful to define a custom operator to define an ordering different from the default for the data type.
There are many types of indexes. The B-tree is useful for sorted data and is basically a sorted list, while hash indexes are useful for unsorted data. GiST is good for range queries on 2D spatial data. Extensions let you install more indexes, like a bloom filter. The trigram index can handle fuzzy string searching, allowing for typos and simple regular expressions. Full-text search is done using GIN (generalized inverted index), which works by indexing every word and performs stemming and stopword removal, making it dependent on language and treating cases like URLs and emails separately.
Chapter 4: The range type is more powerful than storing the lower and upper values separately, but this is a PostgreSQL-specific feature. Grouping sets are a more advanced version of GROUP BY
and allow multiple aggregations over different subsets with a single query. Ordered sets can query based on the order within a group, such as finding the median or percentile.
Window function differs from aggregation in that the number of rows stays the same, but it can define a column that compares the rows against others in the same group, eg, PARTITION BY
can invoke a function to define which group, then take an average within that group, and return it as a column in the result. Ordering may matter for the window function; for example, if referencing the rank or using a sliding window to compute a moving average. You can write your own aggregate function, which is essentially similar to reduce
in JavaScript. This function takes the initial value and applies a function to every row in the group.
Recursive select can be used in a SELECT
statement, useful for traversing a tree structure and returning all the results simultaneously with one query. PostgreSQL also has JSON support, allowing you to return rows or write to a table in JSON format, and provides functions to iterate through a JSON structure and return it as multiple rows.
Chapter 5. PSQL can check which queries are running for each connection and terminate it; since a lot of queries are very fast, it will also show queries that have recently been completed. Table statistics can look at the tables that have a high number of sequential scans, suggesting that an index might be helpful there, or conversely, indices that are taking up a lot of space but are rarely used may be considered for deletion. You can also query stats tables to track the progress of operations like replication and checkpointing to disk. A simple way to track performance problems is to find the slowest queries; there is a stats table for this. Remember to set this once in a while to get recent information.
Chapter 6. There are several options for performing a join: nested loop (viable only if at least one table is small), hash join (putting both sides in a hash table), and merge join (if the data is sorted or if there is an index). The query planner will decide the best way to join and determine the optimal joining order using a search algorithm. Additionally, equality constraints are propagated to the join tables: constant expressions and immutable functions are turned into constants whenever possible, allowing for an index lookup instead of a join, as long as it won’t affect the final result.
Use the EXPLAIN ANALYZE
command to compare the estimated versus actual execution times. Sometimes the estimates can be off, leading the query planner to pick a suboptimal plan. It is possible to override the query planner during joins by disabling certain operations, like hash join, but only do this on one statement and not globally. You can partition tables to distribute data among several child tables, such as by date range or by hash, and add a check constraint so queries can skip tables that are irrelevant. Parallel workers and JIT can speed up queries even more, and some heuristics are used to determine how many workers to allocate based on estimated execution cost, and this can be overridden in the configuration.
Chapter 7. Functions are slightly different from stored procedures: and functions may not contain transactions because they might be part of a statement like a SELECT; in contrast, a stored procedure may start any transactions and is invoked with the CALL command. Several languages are supported, and code inside is surrounded by $$ followed by the language marker. The default language is PL/pgSQL, which is essentially SQL but with added control flow capabilities, something not really possible in plain SQL, and is a little bit slower. Make sure to use the right quoting function to avoid SQL injection problems, and functions should use cursors to return data in chunks instead of materializing the entire result in memory. A trigger can be created to make a function execute on INSERT, UPDATE, or DELETE, and the the function knows which table and which row have been changed and is called to run on one row at a time. Functions can be written in Perl or Python instead of PL/pgSQL, and you have to decide whether it is in trusted or untrusted mode – in untrusted mode, it can access system functions with the same permissions as the host process. The SPI (Server Programming Interface) can interact with database internal APIs in both languages.
Chapter 8. By default, PostgreSQL does not listen to any network input, which may be useful if the application is running locally. Otherwise, pg_hba.conf confines a set of rules for which IP addresses and users can connect, either with a password or an SSL key. Roles and users are used to configure access to database permissions at a system level. They are identical, but typically roles cannot log in directly; the intended usage is that roles are assigned to users who can log in to a database instance.
Once the user can connect to an instance, database-level permissions come into play. As of PostgreSQL 15, users do not have any permissions on schemas by default. This is different from the previous behavior, where everybody had permission for the public schema, allowing them to create tables in the public schema, but this is usually not intended, so now you have to explicitly grant permission even to the public schema. Permissions on tables can be set, as well as on columns or rows of tables, with row-level security (RLS). These permissions can be permissive or restrictive, which determines whether to use an “OR” or “AND” clause when there are multiple permissions relevant for a table. With RLS, a SELECT
statement will only return the rows that the user has permission to access. Configuring permissions is possible with the CLI, but it is often easier to do with a GUI tool. When dropping users and roles, the database will ensure that all the objects they own are either removed or assigned to someone else.
Chapter 9. The pg_dump command is the most straightforward way to dump and back up a database. It produces a set of statements in plain text that can be fed into the psql
command and will ensure the generation of a consistent dump even if writes are happening to the database. It supports several ways of configuring login options and credentials; it operates at the database level, and to dump the instance-level globals, you can use the pg_dumpall
utility.
Chapter 10: The Write-Ahead Log (WAL) is a binary format that helps PostgreSQL maintain consistency in case of crashes. Previously, these logs were called xlog, which caused some people to accidentally delete them – they should never be deleted, which risks data loss, and the database handles reclaiming the disk space. WAL can be configured to automatically stream and archive the logs somewhere, enabling point-in-time recovery (PITR) rather than reverting to the previous backup. A backup consists of a base backup plus a WAL file stream. The restore procedure involves specifying the desired point in time, configuring it to pause while checking if it’s the right time, and either trying again or, if it looks good, promoting the restored database. After restoration, it cannot replay more data from WAL.
Replication can be asynchronous or synchronous. Long operations on either master or a slave can cause inconsistencies when the slave falls behind, but various configuration options mitigate this, such as limiting the maximum delay. Synchronous replication has several options (can be configured by table): off, is equivalent to asynchronous replication; it can wait for the slave to receive and write the data, or wait for the slave to finish all previous transactions and have the data visible to the user (the slowest but safest option). Only when this replica finishes is the transaction considered complete on the master. Thus, synchronous replication is slower than asynchronous but carries a lower risk of data loss. Use replication slots to store the WAL until all replicas have consumed it, but the configuration must be done carefully, or it can accidentally use more disk space than expected. Logical replication can list only some tables instead of the whole database. Patroni is a tool to manage a cluster of PostgreSQL nodes with a REST interface to control them. It automatically handles network and node failures, performs leader election using a consensus algorithm, and uses etcd as the distributed data store.
Chapter 11. Use the CREATE EXTENSION
command to install an extension on that database, or optionally install its dependencies as well. Extensions can be useful to enable new types of indexes, such as a bloom filter or KNN queries, to expose file formats as a table or a table of network, to inspect tables at a binary level, to inspect performance stats, and many more.
Chapter 12. Various PostgreSQL tables can debug transactions that are taking unusually long or can be set to log queries that exceed a certain length to inspect slow performance. Ways to debug performance include finding missing indexes on tables that are frequently scanned sequentially, and checking for slowness due to out-of-memory issues and disk spillover.
Chapter 13. Migrating to PostgreSQL from various other databases like Oracle and MySQL: Many of the advanced features have been supported in different versions. Most likely, the feature you are using is supported but may have different syntax, and there are tools available to help you migrate.