Incremental Commits in PostgreSQL

On large and/or busy databases it’s often not feasible to update an entire table at one time. For instance, if one has to perform some background DML task to change all credit card limits on a 300gb table that sees 1000 transactions per second, and increase each card limit by $1000. Simply executing an update statement with no predicate would attempt to lock every row in the table and perform the update. Even if the lock was granted on such a busy table, you would block all updates to those rows (they would enqueue) until you commit (or rollback), and you would also cause lots of reads from the read-before images. This is bad in Oracle; <a href=http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:1441804355350>ORA-01555: snapshot too old</a>, and also has <a href=http://www.postgresql.org/docs/8.2/interactive/routine-vacuuming.html>vacuum implications</a> in PostgreSQL. So in order to run our theoretical job above, we need some sort of procedure that runs in the background and commits the work every N rows. This is a pretty common technique in Oracle PL/SQL, and a simple construct exists:

DECLARE

 CURSOR mycur IS SELECT id FROM bigtab;
 COUNTER number := 0;

BEGIN

 FOR myrec IN mycur LOOP

    update bigtab set credit_limit = credit_limit+1000 WHERE id = myrec.id;

    IF mod(counter,2000)=0 THEN
       COMMIT;
    END IF;

    COUNTER:=COUNTER+1;

 END LOOP;

COMMIT;

END;
/

The above code incrementally commits every 1000 rows and performs all updates until the cursor is exhausted. In PostgreSQL, there is no simple <a href=http://www.postgresql.org/docs/8.2/interactive/plpgsql.html>PL/pgSQL</a> substitute because <a href=http://www.postgresql.org/docs/8.2/interactive/plpgsql-structure.html>COMMIT can not be called in a function</a>. So what to do? Well, python has a nice <a href=http://www.python.org/dev/peps/pep-0249/>fetchmany(size)</a> construct that allows us to create a small memory footprint program that commits incrementally just like the above Oracle code. The variable incremental_commit_size is the size of the commit batch, and the throttle_time is an amount to sleep after each commit to further make this process very slow and low impact. The code is also uses binds so you don’t create many cursor permutations.

```python #!/usr/bin/env python #

incremental commits

2008 kcg

import psycopg2 import time

vars

INCREMENTAL_COMMIT_SIZE = 1000 # number of rows THROTTLE_TIME = 0 # seconds

connectstr = “host=localhost dbname=postgres user=postgres port=5432” handle = psycopg2.connect(connectstr) cursor = handle.cursor() cursor2 = handle.cursor() sql = “select id from bigtab” cursor.execute(sql)

while 1:

output = cursor.fetchmany(incremental_commit_size)

if not output:
    break
for row in output:

    # update table
    sql = "update bigtab set credit_limit = credit_limit+1000 where id = %s"
    cursor2.execute(sql, ([row[0]]))

    # commit, invoked every incremental commit size
    handle.commit()
    time.sleep(throttle_time)

handle.commit()

[Read Incremental Commits in PostgreSQL]


int vs bigint

So which one is a better use of space given that an application uses 32bit integers? In PostgreSQL, the int column takes only 4 bytes, and the bigint takes 8 bytes. So one would think there is significant space savings to be had by using int. In the real world with disk space costs where they are at, it just does not matter in a practical sense. Furthermore, if you are running on an <a href=http://en.wikipedia.org/wiki/Extent_(file_systems)>extent based filesystem</a> like ext3 or vxfs, then you really don’t see any practical benefits. Why? Well, your extent size is likely to be much larger then (number of tuples * 4 bytes) so the difference in size is masked because the FS sizes up to the next extent size. Consider the below example:

vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Intel(R) Xeon(R) CPU           E5405  @ 2.00GHz
2.6.16.46-0.12-smp #1 SMP Thu May 17 14:00:09 UTC 2007 x86_64 x86_64 x86_64 GNU/Linux
pg_ctl (PostgreSQL) 8.2.4

postgres=# \d testtable2;
           Table "public.testtable2"
  Column  |          Type          | Modifiers
----------+------------------------+-----------
 id       | bigint                 | not null
 text     | character varying(200) |
 thecount | integer                |
Indexes:
    "testtable2_pk" PRIMARY KEY, btree (id)

postgres=# \d testtable3;
           Table "public.testtable3"
  Column  |          Type          | Modifiers
----------+------------------------+-----------
 id       | integer                | not null
 text     | character varying(200) |
 thecount | integer                |
Indexes:

postgres=# select count(*) from testtable2;
  count  
---------
 2253530
(1 row)

postgres=# select count(*) from testtable3;
  count  
---------
 2253530

postgres=# analyze testtable2;
ANALYZE
postgres=# analyze testtable3;
ANALYZE

postgres=# select tablename, attname, avg_width,n_distinct from pg_stats where tablename = 'testtable2';
 tablename  | attname  | avg_width | n_distinct
------------+----------+-----------+------------
 testtable2 | text     |        56 |          1
 testtable2 | thecount |         4 |          0
 testtable2 | id       |         8 |         -1

postgres=# select tablename, attname, avg_width,n_distinct from pg_stats where tablename = 'testtable3';
 tablename  | attname  | avg_width | n_distinct
------------+----------+-----------+------------
 testtable3 | thecount |         4 |          0
 testtable3 | id       |         4 |         -1
 testtable3 | text     |        56 |          1
(3 rows)

postgres=# SELECT * FROM pgstattuple('testtable3');
 table_len | tuple_count | tuple_len | tuple_percent | dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space | free_percent
-----------+-------------+-----------+---------------+------------------+----------------+--------------------+------------+--------------
 227917824 |     2253530 | 207324760 |         90.96 |                0 |              0 |                  0 |    1897096 |         0.83


Two tables with the only difference being the testtable2 is int and testtable3 is bigint. Same number of rows. Different row sizes. Even pgstattuple shows different row lengths.

postgres=# SELECT schemaname, tablename,
pg_size_pretty(size) AS size_pretty,
pg_size_pretty(total_size) AS total_size_pretty
FROM (SELECT *,
pg_relation_size(schemaname||'.'||tablename) AS size,
pg_total_relation_size(schemaname||'.'||tablename) AS total_size
FROM pg_tables where tablename like '%testtable%') AS TABLES
ORDER BY total_size DESC;
 schemaname | tablename  | size_pretty | total_size_pretty
------------+------------+-------------+-------------------
 public     | testtable2 | 217 MB      | 266 MB
 public     | testtable3 | 217 MB      | 266 MB

select relname, relpages from pg_class where relname in ('testtable2','testtable3');
  relname   | relpages
------------+----------
 testtable2 |    27822
 testtable3 |    27822
(2 rows)

postgres=# select 27822*8192;
 ?column?  
-----------
 227917824
(1 row)

postgres=# select 27822*8192/1024/1024;
 ?column?
----------
      217

But since we are on an extent based filesystem, we see the exact same size on disk. So for anything other than the most extravagant setups, it’s smart to use bigint and not limit yourself to 32bit integers. There is no space penalty for doing so at least. There are <a href=http://archives.postgresql.org/pgsql-general/2007-06/msg00299.php>performance penalties in performing arithmetic on these columns</a>, but no downside in most practical applications. For instance, a row access by index.

postgres=# select id from testtable2 where id = 2003;
  id  
------
 2003
(1 row)

Time: 0.131 ms

postgres=# select id from testtable3 where id = 2003;
  id  
------
 2003
(1 row)

Time: 0.165 ms

insert 100000 rows into testtable2 and commit:
Elapsed Seconds: 21.438544035
TPS: 4664.49586487

insert 100000 rows into testtable3 and commit:
Elapsed Seconds: 21.7363810539
TPS: 4600.58184258

If you perform calculations on these id’s you do see a performance hit, so be careful:

postgres=# select id*10000+45/10 from testtable2 where id = 2003;
 ?column?
----------
 20030004
(1 row)

Time: 0.268 ms
postgres=# select id*10000+45/10 from testtable3 where id = 2003;
 ?column?
----------
 20030004
(1 row)

Time: 0.133 ms

[Read int vs bigint]


The next breed of DBA’s?

University of California, Berkeley and Carnegie Mellon University students now have what appears to be a <a href=http://www.sigmod.org/record/issues/0309/4.JHdbcourseS03.pdf>first rate course</a> on database internals. What I like about this course is not only is it very practical experience, but also, that it teaches internals. So much of how a database works has to do with how it’s built, and the syllabus shows that this course should turn out excellent talent ready to develop on any database platform. However, how many of these students will go on to become DBA’s? I hope many, but it’s not likely I am afraid. It’s too bad, because this is the kind of knowledge that makes excellent DBA’s.

In the non-open source world, Oracle DBA’s are plentiful, and rightly so because of the scale and complication of the software itself. However, many DBA’s coming from this side of the fence are more ‘operators’ in my opinion. They know how to manipulate the software to achieve a result, but do they really understand deep deep down? In my experience at <a href=www.paypal.com>Paypal</a> and <a href=www.ebay.com>Ebay</a> I was lucky enough to get good experience with internals because at our transaction rate, you simply had to understand internals. I remember re-reading James Morles book; <a href=http://www.amazon.com/Scaling-Oracle8i-TM-Building-Architectures/dp/0201325748/ref=sr_1_2?ie=UTF8&s=books&qid=1207710869&sr=1-2>Scaling Oracle 8i</a> and having a lightbulb moment about latches – Ah! I get why something so seemly small as latches mean so much when you scale a really busy system.

It’s great to see that students are now getting good hands on expertise in these internals right from the start of their careers. Here is to hoping some of these students take this excellent and practical knowledge and decide to be DBA’s.

[Read The next breed of DBA’s?]


Python/cx_Oracle utility functions

I recently created some utility functions to startup, mount and stop Oracle via cx_Oracle and thought I would share them. You will want to make sure you have password files setup correctly in order for this stuff to work.

# shutdown abort:
def shutdown_abort(sid):
 try:
   os.environ['TWO_TASK']=sid
   handle=cx_Oracle.connect("sys", "yourpwd", sid, cx_Oracle.SYSDBA)
   handle.shutdown(mode = cx_Oracle.DBSHUTDOWN_ABORT)
   return 0
 except cx_Oracle.DatabaseError,info:
   print "Error: ",info
   return 1

# startup nomount:
def startup_nomount(sid):
 try:
   os.environ['TWO_TASK']=sid
   handle=cx_Oracle.connect("sys", "yourpwd", sid, cx_Oracle.SYSDBA | cx_Oracle.PRELIM_AUTH)
   handle.startup()
   return 0
 except cx_Oracle.DatabaseError,info:
   print "Error: ",info
   return 1

# mount:
def db_mount(sid):
 try:
   os.environ['TWO_TASK']=sid
   handle=cx_Oracle.connect("sys", "yourpwd", sid, cx_Oracle.SYSDBA)
   cursor = handle.cursor()
   cursor.execute("alter database mount")
   return 0
 except cx_Oracle.DatabaseError,info:
   print "Error: ",info
   return 1

# open resetlogs
def db_open_resetlogs(sid):
 try:
   os.environ['TWO_TASK']=sid
   handle=cx_Oracle.connect("sys", "yourpwd", sid, cx_Oracle.SYSDBA)
   cursor = handle.cursor()
   cursor.execute("alter database open resetlogs")
   return 0
 except cx_Oracle.DatabaseError,info:
   print "Error: ",info
   return 1

# open
def db_open(sid):
 try:
   os.environ['TWO_TASK']=sid
   handle=cx_Oracle.connect("sys", "yourpwd", sid, cx_Oracle.SYSDBA)
   cursor = handle.cursor()
   cursor.execute("alter database open")
   return 0
 except cx_Oracle.DatabaseError,info:
   print "Error: ",info
   return 1

# Flashback database to a given restore point:
def flashback_db(restore_point,sid):
 try:
   os.environ['TWO_TASK']=sid
   handle=cx_Oracle.connect("sys", "yourpwd", sid, cx_Oracle.SYSDBA)
   cursor = handle.cursor()
   sql="flashback database to restore point "+restore_point
   print sql
   cursor.execute(sql)
   return 0
 except cx_Oracle.DatabaseError,info:
   print "Error: ",info
   os.exit(1)

[Read Python/cx_Oracle utility functions]


Primary Keys for Slony-I

One of the prerequisites for <a href=http://slony.info>Slony</a> replication is to be sure that primary keys (or at least suitable columns) are defined for each table to be under replication. This is a simple example script for getting that data in postgresql. This assumes that the intention is to replicate all of the public schema.

select tablename
from pg_tables
where tablename not in
 (select r.relname
 from pg_class r, pg_constraint c
 where r.oid = c.conrelid
 and c.contype = 'p'
 )
and schemaname = 'public';

If the output shows tables without primary keys, then the decision can be made to either add them, or instruct slonik to use suitable columns that form a logical unique key. However, as the <a href=http://slony.info/documentation/definingsets.html>documentation states</a>, you might as well just define the keys if you go this far.

[Read Primary Keys for Slony-I]


Memory sizing for PostgreSQL

I needed to make an assessments of a linux server’s memory footprint consumed by postgresql, however, calculating the exact memory footprint of a postgresql database is not as straightforward as one might think. Since unix utilities like top, and ps tend to show some amount of shared memory usage in their tallies, they are next to useless for this purpose. The other problem is Linux caches filesystem buffers to fill up most free space on the box, so deciphering what is actually used by postgresql and what is really filesystem buffers is confusing as well. The calculation needed to be just a reasonable estimate of the memory usage, so this method is non-perfect, but should be good for most planning purposes.

In order to come up with the memory usage, I decided to subtract the postgresql usage from the total memory footprint shown in /proc/meminfo. I divided up the postgresql memory usage parameters into two types of consumption:

  1. fixed memory usage – memory consumption that doesn’t change rapidly
  2. variable memory usage – memory consumption that does change rapidly

An example of fixed memory is anything allocated at startup that remains relatively fixed, an example of variable buffers is anything multiplied by the number of connections.

fixed parameters:

  1. shared_buffers
  2. wall_buffers
  3. max_fsm_pages

variable parameters:

  1. work_mem
  2. maintenance_work_mem

Using /proc/meminfo and the postgresql.conf file one can calculate the memory footprint with a reasonably high degree of certainty using:

(MemTotal-MemFree-Cached)-((shared_buffers*8)+(wal_buffers*8)+((fsm_pages*6)/1024))=variable size in KB

For instance:

$>cat /proc/meminfo
MemTotal: 32876680 kB
MemFree: 1762176 kB
Buffers: 256664 kB
Cached: 30040432 kB
$>cat /data/postgresql.conf | grep buffers
shared_buffers = 65535 # min 16 or max_connections*2, 8KB each
wal_buffers = 1024 # min 4, 8KB each

$>cat /data/postgresql.conf | grep fsm
max_fsm_pages = 8000000 # min max_fsm_relations*16, 6 bytes each
max_fsm_relations = 2000 # min 100, ~70 bytes each
$>cat /data/postgresql.conf | grep mem
work_mem = 32768 # size in KB
maintenance_work_mem = 768000 # min 1024, size in KB

Thus:

(32876680-1762176-30040432)-(524280+8192+46875)=494725 KB

Now that we have the variable memory size, we have defined the major components of the memory footprint. Then we can calculate the average size of each backend process by dividing the total usage by the number of backend processes. Assuming for we expect very few, or perhaps no backends consuming maintenance_work_mem at any time (for instance auto-vacuum’s running)

$>ps -ef | grep "postgres: postgres" | grep -v grep | wc -l
122
$>echo 494725/122 | bc
4055

So each connection consumes avg ~4KB. We can now adjust, tune, and play what-if with those numbers as we desire.

It makes sense to sample the number of backend connections and memory usage during different periods of the day and/or during batch runs, etc to get a good read on the usage throughout the day.

[Read Memory sizing for PostgreSQL]


Python and cx_oracle

I decided to start futzing with Python and Oracle. I generally use perl for most everything, and sometimes php for web based things, but I wanted to try Python at the urging of colleagues. I downloaded the source from here and added in cx_Oracle from here. The installation on linux was about as smooth as possible. I installed the source @ ~oracle/python so I could keep a user space version of Python. This is similar to how I generally install perl. So I created my first application using Python, select * from dual;. No big deal, but wow, Python is a pretty straightforward language. Here are some resources for Oracle based code:

The docs and code examples are pretty slim for Oracle, but I supposed that to be expected as the language and it’s use with Oracle expands over time. Here is an example of the hello world app: 

#!/home/kgorman/python/bin/python
import cx_Oracle
handle = cx_Oracle.connect('scott/tiger@test')
curs = handle.cursor()
sql = """ select * from dual """
curs.execute(sql)
row = curs.fetchone()
print row[0]
curs.close()

[Read Python and cx_oracle]


now on WordPress

I just updated the blog to WordPress in order to get a few (ok, a lot) more features, and use the general editing environment.  I imported some of the popular posts from my old blog as well.

[Read now on WordPress]


Backing up Oracle optimizer statistics

Oracle 10g has some neat features for keeping track of old statistics. Pre-10g It used to be important to backup your statistics before you analyzed a table just in case your plans went crazy. Now Oracle does this for you automatically. There is a great post on Doug’s Oracle Blog that talks about this with some examples. So now one can analyze as needed without fear of not being able to roll back the statistics to the previous value. In a crisis, it might be something to check to see if analyze recently ran and now plans are bad by selecting from tab_stats_history. Then simply back out the stats with dbms_stats.restore_schema_stats to revert back to a known good statistics state. Be sure to check dbms_stats.get_stats_history_retention and make sure you are keeping a long enough record of stats, you can adjust with dbms_stats.alter_stats_history_retention as needed to keep yourself sleeping at night.

[Read Backing up Oracle optimizer statistics]


Performance Tuning: Be Proactive

I see many articles out there talking about performance tuning using the Oracle waits interface. While I am not contending that technique; I believe it to be sound, and use this technique every day. What I am saying, is performance tuning in this manner, or the ‘traditional’ manner, is inherently reactive. Reactive performance tuning has it’s place, but many time some proactive performance tuning can help reduce the amount of reactive performance tuning a DBA needs to do. By it’s very definition, if one is reacting to some performance problem, it’s most likely it’s impacting someone, and perhaps your company is losing money.

Don’t wait for a performance problem, solve them before they occur.In order to be proactive with performance tuning, one needs a system, a plan, a technique. Proactive performance tuning is all about the trends a given database, or a suite of databases exhibit. One must develop a system in order to observe performance telemetry over time, and notice larger trends that don’t show up in the detailed wait interfaces when queried at performance-problem time. I gave a very simple example of this technique using the Oracle Statspack package in my previous post. Using that technique one could keep a running report showing the relative amount of work that a database is doing over time. How is this proactive performance tuning?It might seem simple, but if one knows the relative work ‘profile’ for a given database, then one can perform proactive tuning measures to AVOID problems. For instance, reschedule batch jobs, move backup windows, or even just communicate with your business about good times for promotions. It should also be noted, that such a system will also allow a DBA to observe the improvements that a given change makes on the trend as well. This is both good in terms of knowning when you are done tuning, but also to illustrate those tuning successes to yourself, and your management.

While it’s good to have good control of your reactive tuning efforts, augmenting those skills with a proactive tuning ‘system’ might just help you to use your reactive skills a bit less, and provide more value to your organization.

[Read Performance Tuning: Be Proactive]


The benefits of testing

I was recently involved on a problem for a critical production database that had slow response time for inserts. The batch job inserting was running at full tilt, and adding more processes (concurrency) was not helping. Ah, there must be some contention going on. What was interesting was not the particular technical details of increasing insert performance, but rather, differing perspectives on how to figure out the problem, and how to suggest a fix. The insert process was inserting data into a long raw column of images. v$session_wait showed many sessions waiting on buffer busy waits.

One method was to observe the system, trace, block dump, systemstate dump. Then try to decifer the ton of data created from those processes into a hypothesis, and then suggest a fix. This method gets difficult trying to pinpoint the problem because you are left with so much data, its very time consuming and error prone, and it’s human nature to start GUESSING. More so, once you think you have a root cause, then the fix is even more problematic. You have no way to validate your fix. You again have to GUESS, or try it out. And trying things out on production systems that have millions of dollars running through them is not a good career path.    

A different approach was to simply observe the problem in a semi-simplisitc way. “We are seeing buffer busy waits on insert of a long raw”. Then take that example and model it in a test environment. So I utilized some existing perl code I wrote that inserts with a configurable concurrency rate. One can easily see the problem. NOW, we can conclusively model a fix. In this case increasing freelist_groups OR decreasing concurrency and running more serialized increases throughput dramatically and exponentially. Now, thats not so hard is it?

[Read The benefits of testing]