Dec 07 2009

Hello Shutterfly

shutterfly[1]I am very excited to start a new position at Shutterfly.com. Shutterfly is a well known internet property with a massivly growing customer base and thus new and interesting challenges in how to store, share, and organize data. What fantastic fun. I expect to post more on the growing NOSQL movement as well as continuing information on PostgreSQL and more traditional relational stores like Oracle.

2 responses so far

Nov 10 2009

pgstat 1.0 released

Published by kgorman under postgres

I added a couple of fixes to the code and released it as 1.0. We have been using it here at hi5 for some time w/o problems. Thanks everyone who has helped with feedback. Also thanks to Devrim GUNDUZ for his help. The latest version can be downloaded on pgfoundry.

No responses yet

Jun 25 2009

pg_reorg 1.0.4

At Hi5, we currently use pg_reorg1.0.3 in order to organize data in a clustered fashion. I posted previously about the strategy. Our version is slightly modified, the modifications I made to the C code essentially allow pg_reorg to spin/wait for locks on the objects to be released before proceeding.

The good news is the folks at NTT have incorporated a similar change in pg_reorg 1.0.4. This is a fantastic improvement, and frankly implemented in a cleaner way than my changes.

The crux of the issue is the situation where a database is being auto-vacuumed, you can’t be guaranteed that pg_reorg and the vacuum will not collide. In theory you should not need to vacuum a table which you are pg_reorg’ing because that is the point of a pg_reorg, it’s essentially a vacuum full w/ extra features because the table is being rebuilt from scratch. However in an environment where auto-vacuum is being utilized to keep tables vacuumed, both will need to co-exist.

The change is simple, use the NOWAIT option of lock table to fail if the lock can not be obtained. This is wrapped in a loop until the lock is granted. The effect is pg_reorg patiently sits and waits while your vacuums complete and then it can finish it’s work. The downside is if any of these operations run for too long, then the journal table may grow very large. So there should be some monitoring wrapped around the code if it’s intended to run in the background. For the future we need a backoff algorithm as well as perhaps a limit to the number of spin/sleep cycles, but hey this is excellent progress.

This tool is essential in my humble opinion for everyone running PostgreSQL in a high transaction/high availability environment. By the way, pg_reorg works seamlessly with Slony-I.

The code addition does the following:

for (;;)                        
        {
                command("BEGIN ISOLATION LEVEL READ COMMITTED", 0, NULL);
                res = execute_nothrow(table->lock_table, 0, NULL); 
                if (PQresultStatus(res) == PGRES_COMMAND_OK)
                {
                        PQclear(res);
                        break;
                }
                else if (sqlstate_equals(res, SQLSTATE_LOCK_NOT_AVAILABLE))
                {
                        /* retry if lock conflicted */ 
                        PQclear(res);
                        command("ROLLBACK", 0, NULL);
                        sleep(1);
                        continue;
                }
                else
                {
                        /* exit otherwise */
                        printf("%s", PQerrorMessage(connection));
                        PQclear(res);
                        exit(1);        
                }
        }

The text below is a snip of the strace on pg_reorg while it’s waiting for the lock:

rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_DFL}, 8) = 0
sendto(3, "P\0\0\0008\0SELECT reorg.reorg_apply($"..., 529, 0, NULL, 0) = 529
rt_sigaction(SIGPIPE, {SIG_DFL}, {SIG_IGN}, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
recvfrom(3, "1\0\0\0\0042\0\0\0\4T\0\0\0$\0\1reorg_apply\0\0\0\0"..., 16384, 0, NULL, NULL) = 77
rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_DFL}, 8) = 0
sendto(3, "P\0\0\0\177\0SELECT 1 FROM pg_locks WHE"..., 178, 0, NULL, 0) = 178
rt_sigaction(SIGPIPE, {SIG_DFL}, {SIG_IGN}, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
recvfrom(3, "1\0\0\0\0042\0\0\0\4T\0\0\0!\0\1?column?\0\0\0\0\0\0\0"..., 16384, 0, NULL, NULL) = 74
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, {1, 0})

Your Postgresql log file will show the following:

Jun 25 17:09:33 <dbname> postgres[7825]: [37-2] 2009-06-25 17:09:33 PDTSTATEMENT:  LOCK TABLE <tablename> IN ACCESS EXCLUSIVE MODE NOWAIT
Jun 25 17:09:34 <dbname> postgres[7825]: [38-1] 2009-06-25 17:09:34 PDTERROR:  could NOT obtain LOCK ON relation "<tablename>"

No responses yet

May 27 2009

Fusion-io SSD

Published by kgorman under database engineering, postgres

I got the opportunity to test out some of the new Fusion-io Solid State ioDrive, and I thought I would post some results.

Fusion-io has created a SSD product called ioDrive that is based on PCIe cards vs replacing SAS or SATA drives with SSD directly. This approach allows for much lower latency because of the use of the PCIe bus vs traditional disk channels geared towards slow disk. The 320GB model I used in my test are made of Multi Level Cell (MLC) NAND flash and are quoted by Fusion-io to achieve throughput somewhere in the 70k IOPS neighborhood.

Continue Reading »

8 responses so far

Mar 03 2009

pgstat 0.8beta released on PgFoundry

Published by kgorman under postgres, python

I moved the pgstat (previously named pgd) project to pgfoundry. Thank you to the folks over there approving the project. I added a column for ‘active’ processes from pg_stat_activity as well as some fixes requested by Devrim that I really should have had done from the start. Thanks for the contribution. Downloads can be found here. Oh, and I will remove the OSX file from the .tar file in the future.

Want something added? Just comment on this post..

3 responses so far

Feb 13 2009

Cluster data, save cash

Since the economy is not exactly rocking these days, I suspect there are a lot of companies out there trying to save a buck or two on infrastructure. Databases are not exactly cheap, so anything that an engineer or DBA can do to save cycles is a win. So how do you stretch your existing hardware and make it perform more transactions for the same amount of cash?

Clustering your data is an approach to reducing load and stretching the capacity of your database servers. Clustering data is a technique where data is reorganized to match the query patterns very specifically and thus reducing the amount of logical (and also physical) I/O a database performs. This technique is not RDBMS product specific, it applies to Oracle, PostgreSQL, or most other block based row-store RDBMS. I am going to reference PostgreSQL and a very specific case where clustering data can produce huge performance gains.

Continue Reading »

4 responses so far

Dec 12 2008

Tuning for very large shared_buffers, using Veritas VxFS

Published by kgorman under database engineering, postgres

The debate about the optimum shared_buffers size in PostgreSQL is clearly far from over. However, I have not seen any results where the buffer cache and the FS cache were tuned in unison.

Because PostgreSQL knows what buffers are in use, and knows when to flush the dirty buffers it is ideal to use as much of it as possible. Using an secondary cache (the FS) as in a ‘traditional’ PostgreSQL configuration, just introduces more workload in the system. The secondary cache needs a process to wake up and flush the buffers (pdflush), and also has to manage it’s own LRU, and likely has many duplicate buffers sitting there wasting your precious RAM.

The point I am trying to stress is that the PostgreSQL cache *and* the FS cache must be tuned together. In this case, I remove the FS cache entirely using direct I/O, and then take over that space with the fastest cache in the stack; the PostgreSQL buffer cache.

So here is a bit of testing to show the results.

The testing was performed on a set of nice new HP DL185 G5’s with 14 72GB 15k SAS Drives and Quad Core Opteron processors and 32GB of main memory. The OS is SUSE Linux EnterpriseServer 10 SP1 (x86_64) 2.6.16.46-0.12-smp. The filesystem is Veritas VxFS file system. The version of PostgreSQL tested is 8.2.5.

The test itself is a set of custom pgbench scripts. These scripts were developed using actual data from production PostgreSQL log files, and pgfouine. The access pattern is about 70% read, with 15% update and 15% insert. There is no delete activity. This matches a very specific workload on a subset of our production databases. The workload is random, and the total data size is larger than our cache size by a very large amount, this guarantees a good workout of physical disk and caches in front of them. The WAL logs and data were on the same VxFS volume.

The test runs were gathered with pgd, and graphed. Between each test the filesystem was unmounted and remounted. I tested each run 3 times and averaged the tests.

Test#1:
VxFS buffered I/O
shared_buffers=500MB
disk controller cache=512MB write, 0MB read
500GB total data size
36 concurrent backends

Test#2:
VxFS buffered I/O
shared_buffers=2GB
disk controller cache=512MB write, 0MB read
500GB total data size
36 concurrent backends

Test#3:
VxFS with convosync=direct,mincache=direct
shared_buffers=20GB
disk controller cache=512MB write, 0MB read
500GB total data size
36 concurrent backends

Conclusion:
A modest gain can be had when using a very large (comparatively) shared_buffers setting when combining that change with direct I/O. The PostgreSQL cache does scale quite nicely up to at least a 20GB cache size when configured in this manner.

Further gains could likely be achieved by separating the WAL logs with specific VxFS tunables as well as tuning other PostgreSQL parameters due to the larger cache (bgwriter tunables, etc).

5 responses so far

Nov 05 2008

Graphing pgstat output

Published by kgorman under postgres

If you’re using the pgd pgstat utility I posted about previously, you can graph the output with very little effort using gnuplot. In my case I use pgd pgstat for capturing output for various PostgreSQL performance tests, and of course graphing that output is important.

Continue Reading »

3 responses so far

Oct 21 2008

pgstat: a database utility like iostat

Published by kgorman under postgres

I needed a utility for capturing various data-points about a PostgreSQL database as I performed load tests. I copied a utility I have used previously on Oracle that worked quite well. The new utility is called pgd pgstat. This utility spits out various DB metrics on the command line similar to iostat. It’s good for quick diagnosis, performance testing, etc. The output is suitable for import/graphing in excel as well, just use space delimited format when importing the data. Here is a screenshot of how the output appears. You can download the utility and give it a whirl.

4 responses so far

Sep 16 2008

Unique identifier for a database without connecting to the database?

Published by kgorman under postgres

In PostgreSQL, the working directory is a unique identifier for a database, and sometimes you want to use that working dir in your script(s). But what if you don’t want to actually connect and query the database? Is there a way to find out the unique identifier for the db? In a high transaction environment, making the archive process fast and efficient is very important, so keeping an eye on efficiency when coding up these little bits is paramount. The key lies in the /proc/ filesystem in Linux. Here is a little function that returns the unique identifier:

import os
def getbase():  
  ppid=os.getppid()  
  cwd=os.readlink("/proc/"+str(ppid)+"/cwd")  
  rwd=cwd.replace("/","_")  
  return cwd,rwd

This comes in handy, for instance, in a script that processes archived WAL logs. This way no db queries are required in order to uniquely name logs by prepending the variable ‘rwd’ from the above example to it’s name. One thing to remember is that os.getppid is the parent process, so if you launch this by hand on the OS, you will not get the directory you expect.

No responses yet

Next »