I have been using pg_standby quite a bit lately, and thus we needed some management tools around it.  One simple thing that was sorely missed was lag monitoring.  When you have 100′s of standby’s,  and many of them on the bleeding edge of keeping up with the primary, then lag monitoring is very important.  Heck, it’s even important if you just have one standby database. If your standby is lagging when you need it, you might effect your ability to have a up to date database when you fail over to it. The problem is that PostgreSQL has no way to actively query the database for it’s lag amount, because it’s actively recovering.  So one must rely on the control file for the last checkpoint time as the most accurate marker of how up to date the standby actually is. Here is some code for implementing this:Â
#!/home/postgres/python/bin/python # # # script returns number of seconds since last checkpoint # to be used for standby lag detection # import os import commands import string import time import datetime pg_controldata ="/usr/local/pgsql/bin/pg_controldata" cmd =pg_controldata+" /data/mydb" out = commands.getoutput(cmd) line = string.split(out, "\n") checkpoint_date_dict = string.split(line[16], ": ") checkpoint_date=checkpoint_date_dict[1].lstrip() # format: Thu 08 May 2008 01:58:18 PM PDT checkpoint_epoch = int(time.mktime(time.strptime(checkpoint_date, '%a %d %B %Y %H:%M:%S %p %Z'))) t = datetime.datetime.now() now = time.mktime(t.timetuple()) print "lag="+str(now-checkpoint_epoch)
The output of this script is quite simple, so it can be called from monitoring solutions like HQ or whatever you use in-house. You will need to alter the line that shows where your base installation is; pg_controldata+” /data/mydb/” in my example.
$>./pg_standby_lag.py $>lag=10
This indicates the standby is 10 seconds lagging from the primary. Perhaps you alert when you see lag over 3600 (or 1 hour).

Sorry for losing the spacing (kinda important with python), just email me if you want the script..
I’m looking to migrate to Postgres from Oracle 10g for a client, but the standby database is one thing that has me holding back. Since you have a lot of experience with it, I would like to hear more of your experiences.
Postgres standby seems pretty weak at the moment compared to Oracle. You cannot open the standby in read-only mode to do any reporting and basic things like querying for lag time are lacking. I realize it is much cheaper than Oracle (I’m looking at Postgres Plus from Enterprisedb), but Oracle’s Dataguard is kick ass and with Active Dataguard, where the standby can be applying redo logs and be live at the same time, is sweet, but costly. Postgres needs to add read-only standby at least and make the setup gui based and easy to do, but for the difference in cost I can deal with that.
The Postgres standby also needs a mount point to write the WAL files to the remote server. I want to do a standby to a remote disaster recovery site and will not have NFS mount point possible. What do people usually do for this? Do you just setup scp the files in the archive_command ? or ?
Enzo,
If you require a fully featured standby mechanism, then Oracle has a much better model. In fact, mysql does as well. This is one area where PostgreSQL is lagging behind (no pun intended), however, I believe things will be getting better in PG 8.4. I have used Dataguard for a number of years in some crazy ways and I agree with you that it’s fairly robust and feature-full.
In terms of writing wal logs, you don’t really need an NFS mount or anything to sync the wall files. You can build an rsync script into your archive_command in postgresql.conf (or even a one liner if you want). For instance in postgresql.conf:
archive_command = '/var/home/postgres/scripts/backup/pg_archiver.py -p %p -t archive_logs -l 172.16.130.12'And then in pg_archiver.py:
#!/var/home/postgres/python/bin/python
# pg_archiver.py script
from optparse import OptionParser
import commands
import sys
import os
parser = OptionParser()
parser.add_option("-p",dest="path",help="path for log (%p)")
parser.add_option("-l",dest="logsink",help="logsink machine name to rsync to")
parser.add_option("-t",dest="target",help="rsync target on logsink machine")
(options, args) = parser.parse_args()
logsink=options.logsink
target=options.target
path=options.path
# this gets the base working dir. In PG this indicates a unique identifier for a PG database
def getbase():
ppid=os.getppid()
cwd=os.readlink("/proc/"+str(ppid)+"/cwd")
rwd=cwd.replace("/","_")
return cwd,rwd
hostname=commands.getstatusoutput("hostname")
base,suffix=getbase()
destfile=base+"/"+path+"_"+hostname[1]+".archive"
try:
cmd="ln -s " + base + "/" + path + " " + destfile
commands.getstatusoutput(cmd)
cmd="rsync -av --copy-links "+destfile+" "+logsink+"::"+target+"/"+hostname[1]+suffix+"/"
out=commands.getstatusoutput(cmd)
if (out[0] == 0):
cmd="rm "+destfile
out=commands.getstatusoutput(cmd)
sys.exit(0)
except OSError, info:
print str(info)
sys.stderr.write(str(info))
sys.exit(999)
This is exactly what I needed today. I had to change %B in the format string for strptime() to %b (month abbreviation) and the %H to %I so that 05:42:00 PM would be interpreted as 17:42:00 for the comparison.