PostgreSQL, more-so than other databases generally relies on OS file system cache to keep performance high. It is pretty common not to see gains in performance for buffer cache sizes over about 2GB, why this is true I am unsure, but perhaps the tracking of buffer overhead becomes high? Anyways, since the OS buffer cache keeps file buffers in memory as well, there are lots of database objects in the PostgreSQL cache, OS cache, or both. We can debate <a href=>double buffering</a> later.

To see what is in the PostgreSQL cache, one can use the contrib module (and now built in in 8.3) <a href=>pg_buffercache</a> to map each object with it’s footprint in cache. But, if one wanted to see the entire cache picture, including OS cache, how would that be done?

Enter fincore, pronounced ‘eff in core’;. This utility was originally presented at <a href=>LISA2007</a> by David Plonka, Archit Gupta, and Dale Carder of University of Wisconsin at Madison. This utility (and the sister utility <a href=>fadvise</a>) are great utilities to find out what your PostgreSQL database has pulled into OS cache.

In order to see what database objects these files map to, we can just query the database itself and join each database object to it’s associated OS file using pg_class.relfilenode. Then call fincore for each file name, and output the resulting # of pages per database object. Slow but simple.

I put together some python to glue this all together. The best way would be to use perl, and use fincore as a module. In our environment at least we have standardized on python and psycopg2 for database access. So this script utilizes fincore on the command line and captures the output. Perhaps not as clean as I would have liked, but this tool is not used every day. In our case we left fincore stock as we downloaded it in order to keep these components as separate as possible. This is also a diagnostic tool; it’s quite slow and resource consumptive. If you have lots of objects things might get slow. YMMV.

Using this code, one can see the set of most used buffers in the OS cache, and compare them to the PG cache. Sizes are in bytes.


And then you can see that the PG buffer cache has the following:

    postgres=# SELECT current_database(),c.relname, count(*)*8192 as bytes
	postgres-# FROM pg_buffercache b INNER JOIN pg_class c
	postgres-# ON b.relfilenode = c.relfilenode AND
	postgres-# b.reldatabase IN (0, (SELECT oid FROM pg_database
	postgres(# WHERE datname = current_database()))
	postgres-# GROUP BY c.relname
	postgres-# ORDER BY 3 DESC LIMIT 25;

	 current_database |             relname             |   bytes
	 postgres         | testtable                       | 376979456
	 postgres         | accounts                        | 309608448
	 postgres         | accounts_pkey                   | 225583104
	 postgres         | testtable_pk                    | 141058048
	 postgres         | branches                        |  10715136
	 postgres         | tellers                         |   5488640
	 postgres         | history                         |   2457600
	 postgres         | tellers_pkey                    |    360448
	 postgres         | pg_proc                         |    131072
	 postgres         | pg_operator                     |    106496
	 postgres         | pg_attribute                    |     65536
	 postgres         | pg_operator_oprname_l_r_n_index |     65536
	 postgres         | pg_proc_oid_index               |     65536
	 postgres         | branches_pkey                   |     57344
	 postgres         | pg_statistic                    |     57344
	 postgres         | pg_attribute_relid_attnum_index |     49152
	 postgres         | pg_class                        |     49152
	 postgres         | pg_class_relname_nsp_index      |     49152
	 postgres         | pg_depend_reference_index       |     49152
	 postgres         | pg_type_typname_nsp_index       |     40960
	 postgres         | pg_proc_proname_args_nsp_index  |     40960
	 postgres         | pg_statistic_relid_att_index    |     40960
	 postgres         | pg_operator_oid_index           |     32768
	 postgres         | pg_opclass_am_name_nsp_index    |     24576
	 postgres         | pg_amop_opr_opc_index           |     24576

[Read Mapping Linux memory for PostgreSQL using fincore]

Incremental Commits in PostgreSQL

On large and/or busy databases it’s often not feasible to update an entire table at one time. For instance, if one has to perform some background DML task to change all credit card limits on a 300gb table that sees 1000 transactions per second, and increase each card limit by $1000. Simply executing an update statement with no predicate would attempt to lock every row in the table and perform the update. Even if the lock was granted on such a busy table, you would block all updates to those rows (they would enqueue) until you commit (or rollback), and you would also cause lots of reads from the read-before images. This is bad in Oracle; <a href=>ORA-01555: snapshot too old</a>, and also has <a href=>vacuum implications</a> in PostgreSQL. So in order to run our theoretical job above, we need some sort of procedure that runs in the background and commits the work every N rows. This is a pretty common technique in Oracle PL/SQL, and a simple construct exists:


 CURSOR mycur IS SELECT id FROM bigtab;
 COUNTER number := 0;


 FOR myrec IN mycur LOOP

    update bigtab set credit_limit = credit_limit+1000 WHERE id =;

    IF mod(counter,2000)=0 THEN
    END IF;





The above code incrementally commits every 1000 rows and performs all updates until the cursor is exhausted. In PostgreSQL, there is no simple <a href=>PL/pgSQL</a> substitute because <a href=>COMMIT can not be called in a function</a>. So what to do? Well, python has a nice <a href=>fetchmany(size)</a> construct that allows us to create a small memory footprint program that commits incrementally just like the above Oracle code. The variable incremental_commit_size is the size of the commit batch, and the throttle_time is an amount to sleep after each commit to further make this process very slow and low impact. The code is also uses binds so you don’t create many cursor permutations.

```python #!/usr/bin/env python #

incremental commits

2008 kcg

import psycopg2 import time


INCREMENTAL_COMMIT_SIZE = 1000 # number of rows THROTTLE_TIME = 0 # seconds

connectstr = “host=localhost dbname=postgres user=postgres port=5432” handle = psycopg2.connect(connectstr) cursor = handle.cursor() cursor2 = handle.cursor() sql = “select id from bigtab” cursor.execute(sql)

while 1:

output = cursor.fetchmany(incremental_commit_size)

if not output:
for row in output:

    # update table
    sql = "update bigtab set credit_limit = credit_limit+1000 where id = %s"
    cursor2.execute(sql, ([row[0]]))

    # commit, invoked every incremental commit size


[Read Incremental Commits in PostgreSQL]

int vs bigint

So which one is a better use of space given that an application uses 32bit integers? In PostgreSQL, the int column takes only 4 bytes, and the bigint takes 8 bytes. So one would think there is significant space savings to be had by using int. In the real world with disk space costs where they are at, it just does not matter in a practical sense. Furthermore, if you are running on an <a href=>extent based filesystem</a> like ext3 or vxfs, then you really don’t see any practical benefits. Why? Well, your extent size is likely to be much larger then (number of tuples * 4 bytes) so the difference in size is masked because the FS sizes up to the next extent size. Consider the below example:

vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Intel(R) Xeon(R) CPU           E5405  @ 2.00GHz #1 SMP Thu May 17 14:00:09 UTC 2007 x86_64 x86_64 x86_64 GNU/Linux
pg_ctl (PostgreSQL) 8.2.4

postgres=# \d testtable2;
           Table "public.testtable2"
  Column  |          Type          | Modifiers
 id       | bigint                 | not null
 text     | character varying(200) |
 thecount | integer                |
    "testtable2_pk" PRIMARY KEY, btree (id)

postgres=# \d testtable3;
           Table "public.testtable3"
  Column  |          Type          | Modifiers
 id       | integer                | not null
 text     | character varying(200) |
 thecount | integer                |

postgres=# select count(*) from testtable2;
(1 row)

postgres=# select count(*) from testtable3;

postgres=# analyze testtable2;
postgres=# analyze testtable3;

postgres=# select tablename, attname, avg_width,n_distinct from pg_stats where tablename = 'testtable2';
 tablename  | attname  | avg_width | n_distinct
 testtable2 | text     |        56 |          1
 testtable2 | thecount |         4 |          0
 testtable2 | id       |         8 |         -1

postgres=# select tablename, attname, avg_width,n_distinct from pg_stats where tablename = 'testtable3';
 tablename  | attname  | avg_width | n_distinct
 testtable3 | thecount |         4 |          0
 testtable3 | id       |         4 |         -1
 testtable3 | text     |        56 |          1
(3 rows)

postgres=# SELECT * FROM pgstattuple('testtable3');
 table_len | tuple_count | tuple_len | tuple_percent | dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space | free_percent
 227917824 |     2253530 | 207324760 |         90.96 |                0 |              0 |                  0 |    1897096 |         0.83

Two tables with the only difference being the testtable2 is int and testtable3 is bigint. Same number of rows. Different row sizes. Even pgstattuple shows different row lengths.

postgres=# SELECT schemaname, tablename,
pg_size_pretty(size) AS size_pretty,
pg_size_pretty(total_size) AS total_size_pretty
pg_relation_size(schemaname||'.'||tablename) AS size,
pg_total_relation_size(schemaname||'.'||tablename) AS total_size
FROM pg_tables where tablename like '%testtable%') AS TABLES
ORDER BY total_size DESC;
 schemaname | tablename  | size_pretty | total_size_pretty
 public     | testtable2 | 217 MB      | 266 MB
 public     | testtable3 | 217 MB      | 266 MB

select relname, relpages from pg_class where relname in ('testtable2','testtable3');
  relname   | relpages
 testtable2 |    27822
 testtable3 |    27822
(2 rows)

postgres=# select 27822*8192;
(1 row)

postgres=# select 27822*8192/1024/1024;

But since we are on an extent based filesystem, we see the exact same size on disk. So for anything other than the most extravagant setups, it’s smart to use bigint and not limit yourself to 32bit integers. There is no space penalty for doing so at least. There are <a href=>performance penalties in performing arithmetic on these columns</a>, but no downside in most practical applications. For instance, a row access by index.

postgres=# select id from testtable2 where id = 2003;
(1 row)

Time: 0.131 ms

postgres=# select id from testtable3 where id = 2003;
(1 row)

Time: 0.165 ms

insert 100000 rows into testtable2 and commit:
Elapsed Seconds: 21.438544035
TPS: 4664.49586487

insert 100000 rows into testtable3 and commit:
Elapsed Seconds: 21.7363810539
TPS: 4600.58184258

If you perform calculations on these id’s you do see a performance hit, so be careful:

postgres=# select id*10000+45/10 from testtable2 where id = 2003;
(1 row)

Time: 0.268 ms
postgres=# select id*10000+45/10 from testtable3 where id = 2003;
(1 row)

Time: 0.133 ms

[Read int vs bigint]

The next breed of DBA’s?

University of California, Berkeley and Carnegie Mellon University students now have what appears to be a <a href=>first rate course</a> on database internals. What I like about this course is not only is it very practical experience, but also, that it teaches internals. So much of how a database works has to do with how it’s built, and the syllabus shows that this course should turn out excellent talent ready to develop on any database platform. However, how many of these students will go on to become DBA’s? I hope many, but it’s not likely I am afraid. It’s too bad, because this is the kind of knowledge that makes excellent DBA’s.

In the non-open source world, Oracle DBA’s are plentiful, and rightly so because of the scale and complication of the software itself. However, many DBA’s coming from this side of the fence are more ‘operators’ in my opinion. They know how to manipulate the software to achieve a result, but do they really understand deep deep down? In my experience at <a>Paypal</a> and <a>Ebay</a> I was lucky enough to get good experience with internals because at our transaction rate, you simply had to understand internals. I remember re-reading James Morles book; <a href=>Scaling Oracle 8i</a> and having a lightbulb moment about latches – Ah! I get why something so seemly small as latches mean so much when you scale a really busy system.

It’s great to see that students are now getting good hands on expertise in these internals right from the start of their careers. Here is to hoping some of these students take this excellent and practical knowledge and decide to be DBA’s.

[Read The next breed of DBA’s?]

Python/cx_Oracle utility functions

I recently created some utility functions to startup, mount and stop Oracle via cx_Oracle and thought I would share them. You will want to make sure you have password files setup correctly in order for this stuff to work.

# shutdown abort:
def shutdown_abort(sid):
   handle=cx_Oracle.connect("sys", "yourpwd", sid, cx_Oracle.SYSDBA)
   handle.shutdown(mode = cx_Oracle.DBSHUTDOWN_ABORT)
   return 0
 except cx_Oracle.DatabaseError,info:
   print "Error: ",info
   return 1

# startup nomount:
def startup_nomount(sid):
   handle=cx_Oracle.connect("sys", "yourpwd", sid, cx_Oracle.SYSDBA | cx_Oracle.PRELIM_AUTH)
   return 0
 except cx_Oracle.DatabaseError,info:
   print "Error: ",info
   return 1

# mount:
def db_mount(sid):
   handle=cx_Oracle.connect("sys", "yourpwd", sid, cx_Oracle.SYSDBA)
   cursor = handle.cursor()
   cursor.execute("alter database mount")
   return 0
 except cx_Oracle.DatabaseError,info:
   print "Error: ",info
   return 1

# open resetlogs
def db_open_resetlogs(sid):
   handle=cx_Oracle.connect("sys", "yourpwd", sid, cx_Oracle.SYSDBA)
   cursor = handle.cursor()
   cursor.execute("alter database open resetlogs")
   return 0
 except cx_Oracle.DatabaseError,info:
   print "Error: ",info
   return 1

# open
def db_open(sid):
   handle=cx_Oracle.connect("sys", "yourpwd", sid, cx_Oracle.SYSDBA)
   cursor = handle.cursor()
   cursor.execute("alter database open")
   return 0
 except cx_Oracle.DatabaseError,info:
   print "Error: ",info
   return 1

# Flashback database to a given restore point:
def flashback_db(restore_point,sid):
   handle=cx_Oracle.connect("sys", "yourpwd", sid, cx_Oracle.SYSDBA)
   cursor = handle.cursor()
   sql="flashback database to restore point "+restore_point
   print sql
   return 0
 except cx_Oracle.DatabaseError,info:
   print "Error: ",info

[Read Python/cx_Oracle utility functions]

Primary Keys for Slony-I

One of the prerequisites for <a href=>Slony</a> replication is to be sure that primary keys (or at least suitable columns) are defined for each table to be under replication. This is a simple example script for getting that data in postgresql. This assumes that the intention is to replicate all of the public schema.

select tablename
from pg_tables
where tablename not in
 (select r.relname
 from pg_class r, pg_constraint c
 where r.oid = c.conrelid
 and c.contype = 'p'
and schemaname = 'public';

If the output shows tables without primary keys, then the decision can be made to either add them, or instruct slonik to use suitable columns that form a logical unique key. However, as the <a href=>documentation states</a>, you might as well just define the keys if you go this far.

[Read Primary Keys for Slony-I]

Memory sizing for PostgreSQL

I needed to make an assessments of a linux server’s memory footprint consumed by postgresql, however, calculating the exact memory footprint of a postgresql database is not as straightforward as one might think. Since unix utilities like top, and ps tend to show some amount of shared memory usage in their tallies, they are next to useless for this purpose. The other problem is Linux caches filesystem buffers to fill up most free space on the box, so deciphering what is actually used by postgresql and what is really filesystem buffers is confusing as well. The calculation needed to be just a reasonable estimate of the memory usage, so this method is non-perfect, but should be good for most planning purposes.

In order to come up with the memory usage, I decided to subtract the postgresql usage from the total memory footprint shown in /proc/meminfo. I divided up the postgresql memory usage parameters into two types of consumption:

  1. fixed memory usage – memory consumption that doesn’t change rapidly
  2. variable memory usage – memory consumption that does change rapidly

An example of fixed memory is anything allocated at startup that remains relatively fixed, an example of variable buffers is anything multiplied by the number of connections.

fixed parameters:

  1. shared_buffers
  2. wall_buffers
  3. max_fsm_pages

variable parameters:

  1. work_mem
  2. maintenance_work_mem

Using /proc/meminfo and the postgresql.conf file one can calculate the memory footprint with a reasonably high degree of certainty using:

(MemTotal-MemFree-Cached)-((shared_buffers*8)+(wal_buffers*8)+((fsm_pages*6)/1024))=variable size in KB

For instance:

$>cat /proc/meminfo
MemTotal: 32876680 kB
MemFree: 1762176 kB
Buffers: 256664 kB
Cached: 30040432 kB
$>cat /data/postgresql.conf | grep buffers
shared_buffers = 65535 # min 16 or max_connections*2, 8KB each
wal_buffers = 1024 # min 4, 8KB each

$>cat /data/postgresql.conf | grep fsm
max_fsm_pages = 8000000 # min max_fsm_relations*16, 6 bytes each
max_fsm_relations = 2000 # min 100, ~70 bytes each
$>cat /data/postgresql.conf | grep mem
work_mem = 32768 # size in KB
maintenance_work_mem = 768000 # min 1024, size in KB


(32876680-1762176-30040432)-(524280+8192+46875)=494725 KB

Now that we have the variable memory size, we have defined the major components of the memory footprint. Then we can calculate the average size of each backend process by dividing the total usage by the number of backend processes. Assuming for we expect very few, or perhaps no backends consuming maintenance_work_mem at any time (for instance auto-vacuum’s running)

$>ps -ef | grep "postgres: postgres" | grep -v grep | wc -l
$>echo 494725/122 | bc

So each connection consumes avg ~4KB. We can now adjust, tune, and play what-if with those numbers as we desire.

It makes sense to sample the number of backend connections and memory usage during different periods of the day and/or during batch runs, etc to get a good read on the usage throughout the day.

[Read Memory sizing for PostgreSQL]

Python and cx_oracle

I decided to start futzing with Python and Oracle. I generally use perl for most everything, and sometimes php for web based things, but I wanted to try Python at the urging of colleagues. I downloaded the source from here and added in cx_Oracle from here. The installation on linux was about as smooth as possible. I installed the source @ ~oracle/python so I could keep a user space version of Python. This is similar to how I generally install perl. So I created my first application using Python, select * from dual;. No big deal, but wow, Python is a pretty straightforward language. Here are some resources for Oracle based code:

The docs and code examples are pretty slim for Oracle, but I supposed that to be expected as the language and it’s use with Oracle expands over time. Here is an example of the hello world app: 

import cx_Oracle
handle = cx_Oracle.connect('scott/tiger@test')
curs = handle.cursor()
sql = """ select * from dual """
row = curs.fetchone()
print row[0]

[Read Python and cx_oracle]

now on WordPress

I just updated the blog to WordPress in order to get a few (ok, a lot) more features, and use the general editing environment.  I imported some of the popular posts from my old blog as well.

[Read now on WordPress]

Backing up Oracle optimizer statistics

Oracle 10g has some neat features for keeping track of old statistics. Pre-10g It used to be important to backup your statistics before you analyzed a table just in case your plans went crazy. Now Oracle does this for you automatically. There is a great post on Doug’s Oracle Blog that talks about this with some examples. So now one can analyze as needed without fear of not being able to roll back the statistics to the previous value. In a crisis, it might be something to check to see if analyze recently ran and now plans are bad by selecting from tab_stats_history. Then simply back out the stats with dbms_stats.restore_schema_stats to revert back to a known good statistics state. Be sure to check dbms_stats.get_stats_history_retention and make sure you are keeping a long enough record of stats, you can adjust with dbms_stats.alter_stats_history_retention as needed to keep yourself sleeping at night.

[Read Backing up Oracle optimizer statistics]

Performance Tuning: Be Proactive

I see many articles out there talking about performance tuning using the Oracle waits interface. While I am not contending that technique; I believe it to be sound, and use this technique every day. What I am saying, is performance tuning in this manner, or the ‘traditional’ manner, is inherently reactive. Reactive performance tuning has it’s place, but many time some proactive performance tuning can help reduce the amount of reactive performance tuning a DBA needs to do. By it’s very definition, if one is reacting to some performance problem, it’s most likely it’s impacting someone, and perhaps your company is losing money.

Don’t wait for a performance problem, solve them before they occur.In order to be proactive with performance tuning, one needs a system, a plan, a technique. Proactive performance tuning is all about the trends a given database, or a suite of databases exhibit. One must develop a system in order to observe performance telemetry over time, and notice larger trends that don’t show up in the detailed wait interfaces when queried at performance-problem time. I gave a very simple example of this technique using the Oracle Statspack package in my previous post. Using that technique one could keep a running report showing the relative amount of work that a database is doing over time. How is this proactive performance tuning?It might seem simple, but if one knows the relative work ‘profile’ for a given database, then one can perform proactive tuning measures to AVOID problems. For instance, reschedule batch jobs, move backup windows, or even just communicate with your business about good times for promotions. It should also be noted, that such a system will also allow a DBA to observe the improvements that a given change makes on the trend as well. This is both good in terms of knowning when you are done tuning, but also to illustrate those tuning successes to yourself, and your management.

While it’s good to have good control of your reactive tuning efforts, augmenting those skills with a proactive tuning ‘system’ might just help you to use your reactive skills a bit less, and provide more value to your organization.

[Read Performance Tuning: Be Proactive]

The benefits of testing

I was recently involved on a problem for a critical production database that had slow response time for inserts. The batch job inserting was running at full tilt, and adding more processes (concurrency) was not helping. Ah, there must be some contention going on. What was interesting was not the particular technical details of increasing insert performance, but rather, differing perspectives on how to figure out the problem, and how to suggest a fix. The insert process was inserting data into a long raw column of images. v$session_wait showed many sessions waiting on buffer busy waits.

One method was to observe the system, trace, block dump, systemstate dump. Then try to decifer the ton of data created from those processes into a hypothesis, and then suggest a fix. This method gets difficult trying to pinpoint the problem because you are left with so much data, its very time consuming and error prone, and it’s human nature to start GUESSING. More so, once you think you have a root cause, then the fix is even more problematic. You have no way to validate your fix. You again have to GUESS, or try it out. And trying things out on production systems that have millions of dollars running through them is not a good career path.    

A different approach was to simply observe the problem in a semi-simplisitc way. “We are seeing buffer busy waits on insert of a long raw”. Then take that example and model it in a test environment. So I utilized some existing perl code I wrote that inserts with a configurable concurrency rate. One can easily see the problem. NOW, we can conclusively model a fix. In this case increasing freelist_groups OR decreasing concurrency and running more serialized increases throughput dramatically and exponentially. Now, thats not so hard is it?

[Read The benefits of testing]