kennygorman.com

Database engineering, architecture, startups, and other assorted bits

pg_standby lag monitoring

I have been using pg_standby quite a bit lately, and thus we needed some management tools around it.  One simple thing that was sorely missed was lag monitoring.  When you have 100′s of standby’s,  and many of them on the bleeding edge of keeping up with the primary, then lag monitoring is very important.  Heck, it’s even important if you just have one standby database.  If your standby is lagging when you need it, you might effect your ability to have a up to date database when you fail over to it. The problem is that PostgreSQL has no way to actively query the database for it’s lag amount, because it’s actively recovering.  So one must rely on the control file for the last checkpoint time as the most accurate marker of how up to date the standby actually is. Here is some code for implementing this: 

#!/home/postgres/python/bin/python
#
#
# script returns number of seconds since last checkpoint
# to be used for standby lag detection
#
import os
import commands
import string
import time
import datetime

pg_controldata  ="/usr/local/pgsql/bin/pg_controldata"
cmd             =pg_controldata+" /data/mydb"

out = commands.getoutput(cmd)
line = string.split(out, "\n")
checkpoint_date_dict = string.split(line[16], ": ")
checkpoint_date=checkpoint_date_dict[1].lstrip()
# format: Thu 08 May 2008 01:58:18 PM PDT
checkpoint_epoch = int(time.mktime(time.strptime(checkpoint_date, '%a %d %B %Y %H:%M:%S %p %Z')))

t = datetime.datetime.now()
now = time.mktime(t.timetuple())

print "lag="+str(now-checkpoint_epoch)
	

The output of this script is quite simple, so it can be called from monitoring solutions like HQ or whatever you use in-house. You will need to alter the line that shows where your base installation is; pg_controldata+” /data/mydb/” in my example.

$>./pg_standby_lag.py
$>lag=10

This indicates the standby is 10 seconds lagging from the primary. Perhaps you alert when you see lag over 3600 (or 1 hour).

7 Discussions on
“pg_standby lag monitoring”
  • Hi,

    Thank you very much for sharing the script with us.

    I found an issue in the script while executing it in my environment.
    1. in my case, checkpoint line number was 17, whereas in the script it was given 16.
    checkpoint_date_dict = string.split(line[16], “: “)

    Does control date info differ from environment to environment?

  • Hi,
    I came upon this post when I was creating warm replication. It was a big help. I know it comes a bit late, but thank you for this post.

    I would just like to comment on time conversion:
    when making conversion, for hours it should be used %I, so that epoch that we get correctly converts for pm/am.

    checkpoint_epoch = int(time.mktime(time.strptime(checkpoint_date, ‘%a %d %B %Y %I:%M:%S %p %Z’)))

    I have to admit, that gave me a bit of a headache till I figured out what was wrong.

    • Hi Kgorman,

      I am new to python, so i was unable to identify the issue i am facing, below is the script i am using, can you let me know if script has any issue.

      Thanks,
      HR

      Script
      —————————————
      #!/usr/bin/python
      #
      #
      # script returns number of seconds since last checkpoint
      # to be used for standby lag detection
      #
      import os
      import commands
      import string
      import time
      import datetime

      #pg_controldata=”/usr/pgsql-9.2/bin/pg_controldata”
      print executing pg_controldata
      cmd=pg_controldata+” /var/lib/pgsql/9.2/data”

      print (cmd)

      out=commands.getoutput(cmd)
      line=string.split(out, “\n”)
      checkpoint_date_dict=string.split(line[16],”: “)
      checkpoint_date=checkpoint_date_dict[1].lstrip()
      # format: Thu 08 May 2008 01:58:18 PM PDT
      checkpoint_epoch=int(time.mktime(time.strptime(checkpoint_date, ‘%a %d %B %Y %H:%M:%S %p %Z’)))

      t=datetime.datetime.now()
      now=time.mktime(t.timetuple())

      print “lag=”+str(now-checkpoint_epoch)

      ——————————————————-
      Error:

      -bash-4.1$ ./standby_lag_monitor.sh
      File “./standby_lag_monitor.sh”, line 14
      print executing pg_controldata
      ^
      SyntaxError: invalid syntax

  • I’m looking to migrate to Postgres from Oracle 10g for a client, but the standby database is one thing that has me holding back. Since you have a lot of experience with it, I would like to hear more of your experiences.

    Postgres standby seems pretty weak at the moment compared to Oracle. You cannot open the standby in read-only mode to do any reporting and basic things like querying for lag time are lacking. I realize it is much cheaper than Oracle (I’m looking at Postgres Plus from Enterprisedb), but Oracle’s Dataguard is kick ass and with Active Dataguard, where the standby can be applying redo logs and be live at the same time, is sweet, but costly. Postgres needs to add read-only standby at least and make the setup gui based and easy to do, but for the difference in cost I can deal with that.

    The Postgres standby also needs a mount point to write the WAL files to the remote server. I want to do a standby to a remote disaster recovery site and will not have NFS mount point possible. What do people usually do for this? Do you just setup scp the files in the archive_command ? or ?

    • Enzo,

      If you require a fully featured standby mechanism, then Oracle has a much better model. In fact, mysql does as well. This is one area where PostgreSQL is lagging behind (no pun intended), however, I believe things will be getting better in PG 8.4. I have used Dataguard for a number of years in some crazy ways and I agree with you that it’s fairly robust and feature-full.

      In terms of writing wal logs, you don’t really need an NFS mount or anything to sync the wall files. You can build an rsync script into your archive_command in postgresql.conf (or even a one liner if you want). For instance in postgresql.conf:
      archive_command = '/var/home/postgres/scripts/backup/pg_archiver.py -p %p -t archive_logs -l 172.16.130.12'
      And then in pg_archiver.py:

      #!/var/home/postgres/python/bin/python
      # pg_archiver.py script
      from optparse import OptionParser
      import commands
      import sys
      import os

      parser = OptionParser()
      parser.add_option("-p",dest="path",help="path for log (%p)")
      parser.add_option("-l",dest="logsink",help="logsink machine name to rsync to")
      parser.add_option("-t",dest="target",help="rsync target on logsink machine")
      (options, args) = parser.parse_args()

      logsink=options.logsink
      target=options.target
      path=options.path

      # this gets the base working dir. In PG this indicates a unique identifier for a PG database
      def getbase():
      ppid=os.getppid()
      cwd=os.readlink("/proc/"+str(ppid)+"/cwd")
      rwd=cwd.replace("/","_")
      return cwd,rwd

      hostname=commands.getstatusoutput("hostname")
      base,suffix=getbase()
      destfile=base+"/"+path+"_"+hostname[1]+".archive"

      try:
      cmd="ln -s " + base + "/" + path + " " + destfile
      commands.getstatusoutput(cmd)
      cmd="rsync -av --copy-links "+destfile+" "+logsink+"::"+target+"/"+hostname[1]+suffix+"/"
      out=commands.getstatusoutput(cmd)
      if (out[0] == 0):
      cmd="rm "+destfile
      out=commands.getstatusoutput(cmd)
      sys.exit(0)
      except OSError, info:
      print str(info)
      sys.stderr.write(str(info))
      sys.exit(999)

  • This is exactly what I needed today. I had to change %B in the format string for strptime() to %b (month abbreviation) and the %H to %I so that 05:42:00 PM would be interpreted as 17:42:00 for the comparison.

Leave A Comment

Your email address will not be published.