pg_reorg 1.0.4

At Hi5, we currently use pg_reorg1.0.3 in order to organize data in a clustered fashion. I posted previously about the strategy. Our version is slightly modified, the modifications I made to the C code essentially allow pg_reorg to spin/wait for locks on the objects to be released before proceeding.

The good news is the folks at NTT have incorporated a similar change in pg_reorg 1.0.4. This is a fantastic improvement, and frankly implemented in a cleaner way than my changes.

The crux of the issue is the situation where a database is being auto-vacuumed, you can’t be guaranteed that pg_reorg and the vacuum will not collide. In theory you should not need to vacuum a table which you are pg_reorg’ing because that is the point of a pg_reorg, it’s essentially a vacuum full w/ extra features because the table is being rebuilt from scratch. However in an environment where auto-vacuum is being utilized to keep tables vacuumed, both will need to co-exist.

The change is simple, use the NOWAIT option of lock table to fail if the lock can not be obtained. This is wrapped in a loop until the lock is granted. The effect is pg_reorg patiently sits and waits while your vacuums complete and then it can finish it’s work. The downside is if any of these operations run for too long, then the journal table may grow very large. So there should be some monitoring wrapped around the code if it’s intended to run in the background. For the future we need a backoff algorithm as well as perhaps a limit to the number of spin/sleep cycles, but hey this is excellent progress.

This tool is essential in my humble opinion for everyone running PostgreSQL in a high transaction/high availability environment. By the way, pg_reorg works seamlessly with Slony-I.

The code addition does the following:

for (;;)
                command("BEGIN ISOLATION LEVEL READ COMMITTED", 0, NULL);
                res = execute_nothrow(table->lock_table, 0, NULL);
                if (PQresultStatus(res) == PGRES_COMMAND_OK)
                else if (sqlstate_equals(res, SQLSTATE_LOCK_NOT_AVAILABLE))
                        /* retry if lock conflicted */
                        command("ROLLBACK", 0, NULL);
                        /* exit otherwise */
                        printf("%s", PQerrorMessage(connection));

The text below is a snip of the strace on pg_reorg while it’s waiting for the lock:

rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_DFL}, 8) = 0
sendto(3, "P\0\0\0008\0SELECT reorg.reorg_apply($"..., 529, 0, NULL, 0) = 529
rt_sigaction(SIGPIPE, {SIG_DFL}, {SIG_IGN}, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
recvfrom(3, "1\0\0\0\0042\0\0\0\4T\0\0\0$\0\1reorg_apply\0\0\0\0"..., 16384, 0, NULL, NULL) = 77
rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_DFL}, 8) = 0
sendto(3, "P\0\0\0\177\0SELECT 1 FROM pg_locks WHE"..., 178, 0, NULL, 0) = 178
rt_sigaction(SIGPIPE, {SIG_DFL}, {SIG_IGN}, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
recvfrom(3, "1\0\0\0\0042\0\0\0\4T\0\0\0!\0\1?column?\0\0\0\0\0\0\0"..., 16384, 0, NULL, NULL) = 74
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, {1, 0})

Your Postgresql log file will show the following:

Jun 25 17:09:33 <dbname> postgres[7825]: [37-2] 2009-06-25 17:09:33 PDTSTATEMENT:  LOCK TABLE <tablename> IN ACCESS EXCLUSIVE MODE NOWAIT
Jun 25 17:09:34 <dbname> postgres[7825]: [38-1] 2009-06-25 17:09:34 PDTERROR:  could not obtain lock on relation "<tablename>"

[Read pg_reorg 1.0.4]

Fusion-io SSD

I got the opportunity to test out some of the new Fusion-io <a href=”” target=1>Solid State</a> ioDrive, and I thought I would post some results.

Fusion-io has created a SSD product called ioDrive that is based on PCIe cards vs replacing SAS or SATA drives with SSD directly. This approach allows for much lower latency because of the use of the PCIe bus vs traditional disk channels geared towards slow disk. The 320GB model I used in my test are made of Multi Level Cell (MLC) NAND flash and are quoted by Fusion-io to achieve throughput somewhere in the 70k IOPS neighborhood.

[Read Fusion-io SSD]

pgstat 0.8beta released on PgFoundry

I moved the pgstat (previously named pgd) project to pgfoundry. Thank you to the folks over there approving the project. I added a column for ‘active’ processes from pg_stat_activity as well as some fixes requested by Devrim that I really should have had done from the start. Thanks for the contribution. Downloads can be found here. Oh, and I will remove the OSX file from the .tar file in the future.

Want something added? Just comment on this post..

[Read pgstat 0.8beta released on PgFoundry]

Cluster data, save cash

Since the economy is not exactly rocking these days, I suspect there are a lot of companies out there trying to save a buck or two on infrastructure. Databases are not exactly cheap, so anything that an engineer or DBA can do to save cycles is a win. So how do you stretch your existing hardware and make it perform more transactions for the same amount of cash?

Clustering your data is an approach to reducing load and stretching the capacity of your database servers. Clustering data is a technique where data is reorganized to match the query patterns very specifically and thus reducing the amount of logical (and also physical) I/O a database performs. This technique is not RDBMS product specific, it applies to <a href=”” target=1>Oracle</a>, PostgreSQL, or most other block based row-store RDBMS. I am going to reference PostgreSQL and a very specific case where clustering data can produce huge performance gains.

[Read Cluster data, save cash]

The debate about the optimum shared_buffers size in PostgreSQL is clearly far from over. However, I have not seen any results where the buffer cache and the FS cache were tuned in unison.

Because PostgreSQL knows what buffers are in use, and knows when to flush the dirty buffers it is ideal to use as much of it as possible. Using an secondary cache (the FS) as in a ‘traditional’ PostgreSQL configuration, just introduces more workload in the system. The secondary cache needs a process to wake up and flush the buffers (pdflush), and also has to manage it’s own LRU, and likely has many duplicate buffers sitting there wasting your precious RAM.

The point I am trying to stress is that the PostgreSQL cache *and* the FS cache must be tuned together. In this case, I remove the FS cache entirely using direct I/O, and then take over that space with the fastest cache in the stack; the PostgreSQL buffer cache.

So here is a bit of testing to show the results.

The testing was performed on a set of nice new HP DL185 G5′s with 14 72GB 15k SAS Drives and Quad Core Opteron processors and 32GB of main memory. The OS is SUSE Linux EnterpriseServer 10 SP1 (x86_64) The filesystem is Veritas VxFS file system. The version of PostgreSQL tested is 8.2.5.

The test itself is a set of custom pgbench scripts. These scripts were developed using actual data from production PostgreSQL log files, and pgfouine. The access pattern is about 70% read, with 15% update and 15% insert. There is no delete activity. This matches a very specific workload on a subset of our production databases. The workload is random, and the total data size is larger than our cache size by a very large amount, this guarantees a good workout of physical disk and caches in front of them. The WAL logs and data were on the same VxFS volume.

The test runs were gathered with pgd, and graphed. Between each test the filesystem was unmounted and remounted. I tested each run 3 times and averaged the tests.

VxFS buffered I/O
disk controller cache=512MB write, 0MB read
500GB total data size
36 concurrent backends

VxFS buffered I/O
disk controller cache=512MB write, 0MB read
500GB total data size
36 concurrent backends

VxFS with convosync=direct,mincache=direct
disk controller cache=512MB write, 0MB read
500GB total data size
36 concurrent backends

A modest gain can be had when using a very large (comparatively) shared_buffers setting when combining that change with direct I/O. The PostgreSQL cache does scale quite nicely up to at least a 20GB cache size when configured in this manner.

Further gains could likely be achieved by separating the WAL logs with specific VxFS tunables as well as tuning other PostgreSQL parameters due to the larger cache (bgwriter tunables, etc).

[Read Tuning for very large shared_buffers, using Veritas VxFS]

Graphing pgstat output

If you’re using the <a href=>pgd pgstat utility</a> I posted about previously, you can graph the output with very little effort using <a href=>gnuplot</a>. In my case I use pgd pgstat for capturing output for various PostgreSQL performance tests, and of course graphing that output is important.

[Read Graphing pgstat output]

pgstat: a database utility like iostat

I needed a utility for capturing various data-points about a PostgreSQL database as I performed load tests. I copied a utility I have used previously on Oracle that worked quite well. The new utility is called pgd pgstat. This utility spits out various DB metrics on the command line similar to iostat. It’s good for quick diagnosis, performance testing, etc. The output is suitable for import/graphing in excel as well, just use space delimited format when importing the data. Here is a screenshot of how the output appears. You can <a href=>download the utility</a> and give it a whirl.

<a href= target=1><img src= width=450></a>

[Read pgstat: a database utility like iostat]

In PostgreSQL, the working directory is a unique identifier for a database, and sometimes you want to use that working dir in your script(s). But what if you don’t want to actually connect and query the database? Is there a way to find out the unique identifier for the db? In a high transaction environment, making the archive process fast and efficient is very important, so keeping an eye on efficiency when coding up these little bits is paramount. The key lies in the /proc/ filesystem in Linux. Here is a little function that returns the unique identifier:

import os
def getbase():  
  return cwd,rwd

This comes in handy, for instance, in a script that processes archived WAL logs. This way no db queries are required in order to uniquely name logs by prepending the variable ‘rwd’ from the above example to it’s name. One thing to remember is that os.getppid is the parent process, so if you launch this by hand on the OS, you will not get the directory you expect.

[Read Unique identifier for a database without connecting to the database?]

After reading Allan Packers excellent blog entry ‘are proprietary databases doomed’ , a couple of additional thoughts came to mind. He mentions that open source databases (OSDB’s) are becoming “good enough”, and explains about how most users don’t need or use the vast amount of the features available in the proprietary databases, and thus, OSDB’s are becoming more popular. Coming from a proprietary background myself, scaling Oracle at Paypal and then Ebay, I know that side of the business pretty well. In the last year, I have made the professional leap to OSDB’s in a big way at Hi5, where we do quite a bit of workload on an OSDB - PostgreSQL. So Allan’s points hit home for me.

[Read Open source databases; scaling the application stack.]

I got some email and comments about the code I used for <a href=>my post</a>; Mapping Linux memory for PostgreSQL using fincore so I thought I would post the code I am using. Nothing too fancy here, it needs a config file and some more bits. I highly recommend someone do this in perl vs python and incorporate fincore as a module instead. I used <a href=>psycopg2</a>, other than that, all the other modules are stock with python <a href=>2.5.2</a>. If someone wants to show me how to query the data directory PostgreSQL is currently using and incorporate that into the script vs the static string: mydir = “/data/mydb/base”, that would be great. In order to use this script, you must change the variables for fincore, and mydir below.

# script to find show memory usage of PG buffers in OS cache
# 2008 kcg
import os
import psycopg2
import commands
import re
import sys
from optparse import OptionParser

parser = OptionParser()
parser.add_option("-u","--username",dest="username",help="username for PostgreSQL")
parser.add_option("-m","--machine",dest="machine",help="machine to connect to.. aka: hostname")
parser.add_option("-d","--dbname",dest="dbname",help="database name to connect to")
parser.add_option("-p","--password",dest="password",help="password for PostgreSQL")
(options, args) = parser.parse_args()

osmem   ={}
# change these to match actual locations
fincore ="/home/kgorman/"
mydir   = "/data/mydb/base"

# get list of dbs on host, and return dictionary of db=oid sets
def lookup_dbs():
 connectstr="host="+options.machine+" dbname="+options.dbname+" user="+options.username+" port=5432 password="+options.password
 sql="select datname,oid from pg_database where datname = '"+options.dbname+"' and datname not like '%template%'"
 for d in curs.fetchall():
 return dbs

# get object
def lookup_oid(oid,dbname):
 connectstr="host="+options.machine+" dbname="+dbname+" user="+options.username+" port=5432 password="+options.password
 sql="select relname from pg_class where oid = "+oid
 for d in curs.fetchall():
  return d[0]

for v, i in dbs.iteritems():
  for ii in os.listdir(mydir+"/"+str(i)):
     p = re.compile('\d')
     if p.match(ii):
     	#print ii
	cmd=fincore+" "+fullpath
	#print cmd
	#print pages
	n=pages[1].split(' ')
	if p.match(size):
           if rel:

# sort and output
sdata=sorted(osmem.iteritems(), key=lambda (k,v): (v,k), reverse=True)
print "OS Cache Usage:"
while a < len(sdata):
  print sdata[a][0]+":"+str(sdata[a][1])

[Read Python script showing PostgreSQL objects in linux memory:]

pg_standby lag monitoring

I have been using pg_standby quite a bit lately, and thus we needed some management tools around it.  One simple thing that was sorely missed was lag monitoring.  When you have 100′s of standby’s,  and many of them on the bleeding edge of keeping up with the primary, then lag monitoring is very important.  Heck, it’s even important if you just have one standby database.  If your standby is lagging when you need it, you might effect your ability to have a up to date database when you fail over to it. The problem is that PostgreSQL has no way to actively query the database for it’s lag amount, because it’s actively recovering.  So one must rely on the control file for the last checkpoint time as the most accurate marker of how up to date the standby actually is. Here is some code for implementing this: 

# script returns number of seconds since last checkpoint
# to be used for standby lag detection
import os
import commands
import string
import time
import datetime

pg_controldata  ="/usr/local/pgsql/bin/pg_controldata"
cmd             =pg_controldata+" /data/mydb"

out = commands.getoutput(cmd)
line = string.split(out, "\n")
checkpoint_date_dict = string.split(line[16], ": ")
# format: Thu 08 May 2008 01:58:18 PM PDT
checkpoint_epoch = int(time.mktime(time.strptime(checkpoint_date, '%a %d %B %Y %H:%M:%S %p %Z')))

t =
now = time.mktime(t.timetuple())

print "lag="+str(now-checkpoint_epoch)

The output of this script is quite simple, so it can be called from monitoring solutions like <a href=>HQ</a> or whatever you use in-house. You will need to alter the line that shows where your base installation is; pg_controldata+” /data/mydb/” in my example.


This indicates the standby is 10 seconds lagging from the primary. Perhaps you alert when you see lag over 3600 (or 1 hour).

[Read pg_standby lag monitoring]

PostgreSQL, more-so than other databases generally relies on OS file system cache to keep performance high. It is pretty common not to see gains in performance for buffer cache sizes over about 2GB, why this is true I am unsure, but perhaps the tracking of buffer overhead becomes high? Anyways, since the OS buffer cache keeps file buffers in memory as well, there are lots of database objects in the PostgreSQL cache, OS cache, or both. We can debate <a href=>double buffering</a> later.

To see what is in the PostgreSQL cache, one can use the contrib module (and now built in in 8.3) <a href=>pg_buffercache</a> to map each object with it’s footprint in cache. But, if one wanted to see the entire cache picture, including OS cache, how would that be done?

Enter fincore, pronounced ‘eff in core’;. This utility was originally presented at <a href=>LISA2007</a> by David Plonka, Archit Gupta, and Dale Carder of University of Wisconsin at Madison. This utility (and the sister utility <a href=>fadvise</a>) are great utilities to find out what your PostgreSQL database has pulled into OS cache.

In order to see what database objects these files map to, we can just query the database itself and join each database object to it’s associated OS file using pg_class.relfilenode. Then call fincore for each file name, and output the resulting # of pages per database object. Slow but simple.

I put together some python to glue this all together. The best way would be to use perl, and use fincore as a module. In our environment at least we have standardized on python and psycopg2 for database access. So this script utilizes fincore on the command line and captures the output. Perhaps not as clean as I would have liked, but this tool is not used every day. In our case we left fincore stock as we downloaded it in order to keep these components as separate as possible. This is also a diagnostic tool; it’s quite slow and resource consumptive. If you have lots of objects things might get slow. YMMV.

Using this code, one can see the set of most used buffers in the OS cache, and compare them to the PG cache. Sizes are in bytes.


And then you can see that the PG buffer cache has the following:

    postgres=# SELECT current_database(),c.relname, count(*)*8192 as bytes
	postgres-# FROM pg_buffercache b INNER JOIN pg_class c
	postgres-# ON b.relfilenode = c.relfilenode AND
	postgres-# b.reldatabase IN (0, (SELECT oid FROM pg_database
	postgres(# WHERE datname = current_database()))
	postgres-# GROUP BY c.relname
	postgres-# ORDER BY 3 DESC LIMIT 25;

	 current_database |             relname             |   bytes
	 postgres         | testtable                       | 376979456
	 postgres         | accounts                        | 309608448
	 postgres         | accounts_pkey                   | 225583104
	 postgres         | testtable_pk                    | 141058048
	 postgres         | branches                        |  10715136
	 postgres         | tellers                         |   5488640
	 postgres         | history                         |   2457600
	 postgres         | tellers_pkey                    |    360448
	 postgres         | pg_proc                         |    131072
	 postgres         | pg_operator                     |    106496
	 postgres         | pg_attribute                    |     65536
	 postgres         | pg_operator_oprname_l_r_n_index |     65536
	 postgres         | pg_proc_oid_index               |     65536
	 postgres         | branches_pkey                   |     57344
	 postgres         | pg_statistic                    |     57344
	 postgres         | pg_attribute_relid_attnum_index |     49152
	 postgres         | pg_class                        |     49152
	 postgres         | pg_class_relname_nsp_index      |     49152
	 postgres         | pg_depend_reference_index       |     49152
	 postgres         | pg_type_typname_nsp_index       |     40960
	 postgres         | pg_proc_proname_args_nsp_index  |     40960
	 postgres         | pg_statistic_relid_att_index    |     40960
	 postgres         | pg_operator_oid_index           |     32768
	 postgres         | pg_opclass_am_name_nsp_index    |     24576
	 postgres         | pg_amop_opr_opc_index           |     24576

[Read Mapping Linux memory for PostgreSQL using fincore]