I run multiple web services, mostly related to i3wm.org. All of them use PostgreSQL as their
database, so the data that is stored in that PostgreSQL database is pretty
important to me and the users of these services.
Since a while now, I have been thinking about storing that data in a more
reliable way. Currently, it is stored on a single server, and is backed up to
two different locations (one on-site, one off-site) every day. The server in
question has a RAID-1 of course, but still: the setup implies that if that one
server dies, the last backup may be about a day old in the worst case, and also
it could take me significant time to get the services back up.
The areas in which I’d like to improve my setup are thus:
Durability: In case the entire server dies, I want to have an
up-to-date copy of all data.
Fault tolerance: In case the entire server dies, I want to be
able to quickly switch to a different server. A secondary machine should be
ready to take over, albeit not fully automatically because fully automatic
solutions typically are either fragile or require a higher number of servers
than I’m willing to afford.
For PostgreSQL, there are various settings and additional programs that you can
use which will provide you with some sort of clustering/replication. There is
an overview in
the PostgreSQL wiki (“Replication, Clustering, and Connection Pooling”). My
solution of choice is pgpool2 because it seems
robust and mature (yet under active development) to me, it is reasonably well
documented and I think I roughly understand what it does under the covers.
I have two servers, located in different data centers, that I will use for this
setup. The number of servers does not really matter, meaning you can easily add
a third or fourth server (increasing latency with every server of course).
However, the low number of servers places some restrictions on what we can do.
As an example, solutions that involve global consistency based on paxos/raft
quorums will not work with only two servers. As a consequence, master election
is out of the question and a human will need to do the actual
Each of the two servers will run PostgreSQL, but only one of them will run
pgpool2 at a time. The DNS records for e.g. faq.i3wm.org will point to the
server on which pgpool2 is running, so that server handles 100% of the traffic.
Let’s call the server running pgpool2 the primary, and the other server the
secondary. All queries that modify the database will still be sent to the
secondary, but the secondary does not handle any user traffic. This could be
accomplished by either not running the applications in question, or by having
them connect to the pgpool2 on the primary.
When a catastrophe happens, the DNS records will be switched to point to the
old-secondary server, and pgpool2 will be started there. Once the old-primary
server is available again, it will become the secondary server, so that in case
of another catastrophe, the same procedure can be executed again.
With a solution that involves only two servers, an often encountered problem
are split-brain situations. This means both servers think they are primary,
typically because there is a network partition, meaning the servers cannot talk
to each other. In our case, it is important that user traffic is not handled by
the secondary server. This could happen after failing over because DNS heavily
relies on caching, so switching the record does not mean that suddenly all
queries will go to the other server — this will only happen over time. A
solution for that is to either kill pgpool2 manually if possible, or have a
tool that kills pgpool2 when it cannot verify that the DNS record points to the
I apologize for the overly long lines in some places, but there does not seem
to be a way to use line continuations in the PostgreSQL configuration file.
Installing and configuring PostgreSQL
The following steps need to be done on each database server, whereas pgpool2
will only be installed on precisely one server.
Also note that a prerequisite for the configuration described below is that
hostnames are configured properly on every involved server, i.e.
-f should return the fully qualified hostname of the server in question,
and other servers must be able to connect to that hostname.
apt-get install postgresql postgresql-9.4-pgpool2 rsync ssh
cat >>/etc/postgresql/9.4/main/postgresql.conf <<'EOT'
listen_addresses = '*'
max_wal_senders = 1
wal_level = hot_standby
archive_mode = on
archive_command = 'test ! -f /var/lib/postgresql/9.4/main/archive_log/backup_in_progress || (test -f /var/lib/postgresql/9.4/main/archive_log/%f || cp %p /var/lib/postgresql/9.4/main/archive_log/%f)'
install -o postgres -g postgres -m 700 -d \
systemctl restart postgresql.service
pgpool comes with an extension (implemented in C) that provides a couple of
functions which are necessary for recovery. We need to “create” the extension
in order to be able to use these functions. After running the following
command, you can double-check with
\dx that the extension was
echo 'CREATE EXTENSION "pgpool_recovery"' | \
su - postgres -c 'psql template1'
During recovery, pgpool needs to synchronize data between the PostgreSQL
servers. This is done partly by running
pg_basebackup on the
recovery target via SSH and using
rsync (which connects using
SSH). Therefore, we need to create a passwordless SSH key for the
postgres user. For simplicity, I am implying that you’ll copy the
authorized_keys files onto every
database node. You’ll also need to connect to every other database server once
in order to get the SSH host fingerprints into the known_hosts file.
su - postgres
ssh-keygen -f /var/lib/postgresql/.ssh/id_rsa -N ''
cat .ssh/id_rsa.pub >> .ssh/authorized_keys
We’ll also need to access remote databases with
non-interactively, so we need a password file:
su - postgres
echo '*:*:*:postgres:wQgvBEusf1NWDRKVXS15Fc8' > .pgpass
chmod 0600 .pgpass
When pgpool recovers a node, it first makes sure the data directory is up to
date, then it starts PostgreSQL and tries to connect repeatedly. Once the
connection succeeded, the node is considered healthy. Therefore, we need to
postgres user permission to control
apt-get install sudo
cat >/etc/sudoers.d/pgpool-postgres <<'EOT'
postgres ALL=(ALL:ALL) NOPASSWD:/bin/systemctl start postgresql.service
postgres ALL=(ALL:ALL) NOPASSWD:/bin/systemctl stop postgresql.service
Now enable password-based authentication for all databases and replication
traffic. In case your database nodes/clients don’t share a common hostname
suffix, you may need to use multiple entries or replace the hostname suffix by
cat >>/etc/postgresql/9.4/main/pg_hba.conf <<'EOT'
host all all .zekjur.net md5
host replication postgres .zekjur.net md5
After enabling password-based authentication, we need to set a password for the
postgres user which we’ll use for making the base backup:
echo "ALTER USER postgres WITH PASSWORD 'wQgvBEusf1NWDRKVXS15Fc8';" | \
su postgres -c psql
apt-get install pgpool2
gunzip -c /usr/share/doc/pgpool2/examples/\
pgpool.conf.sample-replication.gz > pgpool.conf
To interact with pgpool2, there are a few command-line utilities whose name
pcp_. In order for these to work, we must configure a
username and password. For simplicity, I’ll re-use the password we set earlier
postgres user, but you could chose to use an entirely
echo "postgres:$(pg_md5 wQgvBEusf1NWDRKVXS15Fc8)" >> pcp.conf
In replication mode, when the client should authenticate towards the PostgreSQL
database, we also need to tell pgpool2 that we are using password-based
sed -i 's/trust$/md5/g' pool_hba.conf
sed -i 's/\(enable_pool_hba =\) off/\1 on/g' pgpool.conf
Furthermore, we need to provide all the usernames and passwords that we are going to use to pgpool2:
chown postgres.postgres pool_passwd
pg_md5 -m -u faq_i3wm_org secretpassword
For the use-case I am describing here, it is advisable to turn off
load_balance_mode, otherwise queries will be sent to all healthy
backends, which is slow because they are not in the same network. In addition,
we’ll assign a higher weight to the backend which runs on the same machine as
pgpool2, so read-only queries are sent to the local backend only.
sed -i 's/^load_balance_mode = on/load_balance_mode = off/g' \
Now, we need to configure the backends.
sed -i 's/^\(backend_\)/# \1/g' pgpool.conf
cat >>pgpool.conf <<'EOT'
backend_hostname0 = 'midna.zekjur.net'
backend_port0 = 5432
backend_weight0 = 2
backend_data_directory0 = '/var/lib/postgresql/9.4/main'
backend_hostname1 = 'alp.zekjur.net'
backend_port1 = 5432
backend_weight1 = 1
backend_data_directory1 = '/var/lib/postgresql/9.4/main'
Overview: How recovery works
Let’s assume that pgpool is running on
midna is handling all the traffic), and
alp.zekjur.net crashed. pgpool will automatically degrade
alp and continue operation. When you tell it to recover
alp because the machine is available again, it will do three
(“1st stage”) SSH into
alp and run pg_basebackup to get a copy of
(“2nd stage”) Disconnect all clients so that the database on
will not be modified anymore. Flush all data to disk on
then rsync the data to
alp. pg_basebackup from 1st stage will have
copied almost all of it, so this is a small amount of data — typically on the
order of 16 MB, because that’s how big one WAL file is.
Try to start PostgreSQL on
alp again. pgpool will wait for 90
seconds by default, and within that time PostgreSQL must start up in such a
state that pgpool can connect to it.
So, during the 1st stage, which copies the entire database, traffic will still
be handled normally, only during 2nd stage and until PostgreSQL started up no
queries are served.
For recovery, we need to provide pgpool2 with a couple of shell scripts that
handle the details of how the recovery is performed.
sed -i 's/^\(recovery_\|client_idle_limit_in_recovery\)/# \1/g' \
cat >>pgpool.conf <<'EOT'
recovery_user = 'postgres'
recovery_password = 'wQgvBEusf1NWDRKVXS15Fc8'
# This script is being run by invoking the pgpool_recovery() function on
# the current master(primary) postgresql server. pgpool_recovery() is
# essentially a wrapper around system(), so it runs under your database
# UNIX user (typically "postgres").
# Both scripts are located in /var/lib/postgresql/9.4/main/
recovery_1st_stage_command = '1st_stage.sh'
recovery_2nd_stage_command = '2nd_stage.sh'
# Immediately disconnect all clients when entering the 2nd stage recovery
# instead of waiting for the clients to disconnect.
client_idle_limit_in_recovery = -1
The 1st_stage.sh script logs into the backend that should be recovered and uses
pg_basebackup to copy a full backup from the master(primary)
backend. It also sets up the
recovery.conf which will be used by
PostgreSQL when starting up.
cat >/var/lib/postgresql/9.4/main/1st_stage.sh <<'EOF'
# Move the PostgreSQL data directory out of our way.
ssh -T $RECOVERY_TARGET \
"[ -d $RECOVERY_DATA ] && mv $RECOVERY_DATA $RECOVERY_DATA.$TS"
# We only use archived WAL logs during recoveries, so delete all
# logs from the last recovery to limit the growth.
# With this file present, our archive_command will actually
# archive WAL files.
# Perform a backup of the database.
ssh -T $RECOVERY_TARGET \
"pg_basebackup -h $MASTER_HOST -D $RECOVERY_DATA --xlog"
# Configure the restore_command to use the archive_log WALs we’ll copy
# over in 2nd_stage.sh.
echo "restore_command = 'cp $RECOVERY_DATA/archive_log/%f %p'" | \
ssh -T $RECOVERY_TARGET "cat > $RECOVERY_DATA/recovery.conf"
cat >/var/lib/postgresql/9.4/main/2nd_stage.sh <<'EOF'
# Force to flush current value of sequences to xlog
psql -p $port -t -c 'SELECT datname FROM pg_database WHERE NOT datistemplate AND datallowconn' template1|
while read i
if [ "$i" != "" ];then
psql -p $port -c "SELECT setval(oid, nextval(oid)) FROM pg_class WHERE relkind = 'S'" $i
# Flush all transactions to disk. Since pgpool stopped all connections,
# there cannot be any data that does not reside on disk until the
# to-be-recovered host is back on line.
psql -p $port -c "SELECT pgpool_switch_xlog('$MASTER_DATA/archive_log')" template1
# Copy over all archive logs at once.
rsync -avx --delete $MASTER_DATA/archive_log/ \
# Delete the flag file to disable WAL archiving again.
cat >/var/lib/postgresql/9.4/main/pgpool_remote_start <<'EOF'
ssh $1 sudo systemctl start postgresql.service
chmod +x /var/lib/postgresql/9.4/main/1st_stage.sh
chmod +x /var/lib/postgresql/9.4/main/2nd_stage.sh
chmod +x /var/lib/postgresql/9.4/main/pgpool_remote_start
Now, let’s start pgpool2 and verify that it works and that we can access our
first node. The
pcp_node_count command should return an integer
number like “2”. The psql command should be able to connect and you should see
your database tables when using
systemctl restart pgpool2.service
pcp_node_count 10 localhost 9898 postgres wQgvBEusf1NWDRKVXS15Fc8
psql -p 5433 -U faq_i3wm_org faq_i3wm_org
pgpool2 intercepts a couple of SHOW statements, so you can use the SQL command
SHOW pool_nodes to see how many nodes are there:
> SHOW pool_nodes;
node_id | hostname | port | status | lb_weight | role
0 | midna.zekjur.net | 5432 | 2 | 0.666667 | master
1 | alp.zekjur.net | 5432 | 2 | 0.333333 | slave
You could export a cgi-script over HTTP, which just always runs this command,
and then configure your monitoring software to watch for certain strings in the
output. Note that you’ll also need to configure a
www-data user. As an example, to monitor whether
alp is still a healthy backend, match for “alp.zekjur.net,5432,2”
in the output of this script:
exec echo 'SHOW pool_nodes;' | psql -t -A -F, --host localhost \
-U faq_i3wm_org faq_i3wm_org
Performing/Debugging a recovery
In order to recover node 1 (
alp in this case), use:
pcp_recovery_node 300 localhost 9898 postgres wQgvBEusf1NWDRKVXS15Fc8 1
The “300” used to be a timeout, but these days it’s only supported for
backwards compatibility and has no effect.
In case the recovery fails, the only thing you’ll get back from
pcp_recovery_node is the text “BackendError”, which is not very helpful. The
logfile of pgpool2 contains a bit more information, but to debug recovery
problems, I typically strace all PostgreSQL processes and see what the scripts
are doing/where they are failing.
pgpool2 behavior during recovery
In order to see how pgpool2 performs during recovery/degradation, you can use
this little Go program that
tries to do three things every 0.25 seconds: check that the database is healthy
SELECT 1;), run a meaningful SELECT, run an UPDATE.
When a database node goes down, a single query may fail until pgpool2 realizes
that the node needs to be degraded. If your database load is light, chances are
that pgpool2 will realize the database is down without even failing a single
2014-08-13 23:15:27.638 health: ✓ select: ✓ update: ✓
2014-08-13 23:15:28.700 insert failed: driver: bad connection
2014-08-13 23:15:28.707 health: ✓ select: ✓ update: x
During recovery, there is a time when pgpool2 will just disconnect all clients
and not answer any queries any more (2nd stage). In this case, the state lasted
for about 20 seconds:
2014-08-13 23:16:01.900 health: ✓ select: ✓ update: ✓
2014-08-13 23:16:02.161 health: ✓ select: ✓ update: ✓
# no queries answered here
2014-08-13 23:16:23.625 health: ✓ select: ✓ update: ✓
2014-08-13 23:16:24.308 health: ✓ select: ✓ update: ✓
Setting up a PostgreSQL setup that involves pgpool2 is definitely a lot of
work. It could be a bit easier if the documentation was more specific on the
details of how recovery is supposed to work and would include the configuration
that I came up with above. Ideally, something like pgpool2 would be part of
I am not yet sure how much software I’ll need to touch in order to make it
gracefully deal with the PostgreSQL connection dying and coming back up. I know
of at least one program I use (buildbot) which does not handle this situation
well at all — it needs a complete restart to work again.
Time will tell if the setup is stable and easy to maintain. In case I make
negative experiences, I’ll update this article :).