The death of a Sun

As I was shutting down my last Sun server in my colo, I was thinking back about how things have changed. It’s sad to see such a great company fall. But it seems there is little reason to keep using Sun or Solaris. We will miss you Sun.

$>uname -a
SunOS bora 5.9 Generic_112233-01 sun4u sparc SUNW,Ultra-60
$>uptime
10:10am up 599 day(s), 12:52, 2 users, load average: 0.07, 0.04, 0.03
$>halt

Posted in Random | Tagged , , | View Comments

Dropping ACID

The de-facto durability story in MongoDB is essentially… there is none. Or at least single server durability. OMFG! No ACID WTF! &^%#^#?!

For the next generation of internet scale, downtime intolerant systems, ACID may not be a desirable property.

Traditional data stores like Oracle, PostgreSQL and MySQL all have a durability design built in. They are all ACID compatible databases. The ‘D’ stands for durability. Meaning, when someone issues a commit(); then the database ensures the change is durable. When a write is requested from the caller and the ack comes back as complete, then the change is on ‘disk’.

However, single server durable systems have these drawbacks at internet scale:

  1. Must repair/replay on recovery. The customer/application must wait for this to complete.
  2. Must wait for disk(s) to spin on every commit(). Performance suffers or more complicated hardware must be used.
  3. Trust the hardware to ensure the write actually happened

In MongoDB, the ack comes back to the caller as soon as the change is in RAM, which is volatile. So if the server crashes before that change flushes from memory to disk, poof your data is gone!

Incidentally, many folks run ACID compliant databases and are unknowingly not durable because they fail to setup the hardware correctly. There are many levels of caching on todays modern architectures, and one must be very careful that every level is properly synchronously flushing to disk in order to make sure the entire stack is correctly reporting when a change is considered durable. PostgreSQL has a manual page dedicated to the subject. It can be turned ‘off’ but it may lead to corruption. Oracle has Asynchronous commit which is a really nice implementation allowing application level control of commit behavior.

When using any persistent data store that is not ACID compliant, something must be in place to make sure the writes are durable at some point OR just not care if you miss some data on crash. If you are building a large cluster of databases then you can expect MTBF to increase in proportion to the number of machines being used in the cluster.

Many internet scale systems these days have a very low tolerance for downtime. So if a server crashes the MTTR must be very low. Take the death of RAID as example. My trusted source says we can expect ultra high capacity disks sooner than StorageMojo indicates. The take away is that fairly soon, RAID won’t be a viable option for an uptime sensitive internet scale data store.

Many current large scale systems currently have a concept of a replica database. It be for scaling reads or just for backup purposes or both. However, most existing systems don’t synchronously guarantee the write has made it to N replicas. Some traditional RDBMS systems can function so that a named replica(s) get written to. This is still an ACID model however.

Here is where I think Eliot’s comments and design become viable: Don’t try to be durable to any single machine. Hardware fails and disk is slow. Instead, be durable to a set of machines. That is: “ensure a write makes it to N machines in a set”. Just be durable to the memory on those machines. For instance, if you have a single master and three replicas, then a commit will be considered good when (and block until) it makes it to the master plus two slaves.

Using the model outlined by Werner Vogels, this would be:

N=3, W=2, R*=1 (fault tolerant, read what you wrote)

*When R=1, then it should mean the master not any slave.

If an application was such designed and could tolerate eventually consistent data, then one could optionally configure for:

N=3, W=2, R=3 (eventually consistent, fault tolerant, read scalable)

From a durability standpoint, this design could be called: Nsync Replication as Durability, or NRaD.

This type of design has one very nice attribute. It does not require expensive servers with fancy disk subsystems to ensure durability, thus absolutely rock bottom inexpensive hardware can be used. We no longer have any requirement on battery backed cache controllers, SATA drive caches, disk architecture or even RAID. We now have a cluster composed of an redundant array of inexpensive computers. Knowledge of direct I/O, async I/O, SATA Drive Caches, Battery Backed Caches, Partial Writes, RAID, and many other complexities just simply don’t matter much or at all. RAID may still be desirable such that a single drive failure doesn’t cause the cluster to fail over often. But the complication of the stack has dropped massively. The cost of each machine has dropped massively.

So what is the downside with a NRaD? Of course more than one machine is needed for this type of architecture. It’s important to make sure that all the machines are located in separate failure domains. A failure domain is a group of items all likely to fail at once because of a common dependency. For instance, an entire rack or a whole data center. So keeping the master and replicas in separate failure domains helps to ensure no single event brings them all down. In addition, at least two hosts are required. Latency may be increased. But not guaranteed depending on I/O capabilities of durable systems. Synchronously writing to more than one machine means the calling process must wait for this process to happen. It may introduce latency in your application, but more than spinning disks? Also, this type of architecture requires the application to tolerate an eventually consistent model (depending on config). Data is not guaranteed to be consistent on all machines.

Does this fly in the face of the CAP Theorem? No, data is not guaranteed to be on all replicas. Just N. N may be a subset. The user could configure the system to fit the specific needs of the application in question and focus on just durability or focus on durability plus read scalability.

Just to be clear, the requirements for an NRaD system might not make sense for everyone. YMMV. But for a vast majority of downtime sensitive, internet scale persistent stores, it seems like a very good fit.

If your a MongoDB user and want Nsync Replication as Durability in the server, vote/comment for it here. There is some work to do to get the product to the point where it would perform replication as durability.

Posted in Data Architecture, Database Engineering, Mongodb, MySQL, Oracle | Tagged , , , , , , , , | View Comments

mongodb 1.3.3 (devel) with mongostat

MongoDB 1.3.3 (devel) was released today. My tool called mongostat has been incorporated into mongo1.3.3 and changed from python to C++. So further development will come inside the mongodb distribution itself. Also of note is new slave lag for replication. Changelog here.

Posted in Mongodb | Tagged , | View Comments

mongostat 0.2b

I have been playing quite a bit with MongoDB lately. MongoDB is a nosql type database. It’s different than a simple k/v store however. It allows sorting, secondary indexes and such. It’s format is BSON a binary representation of JSON with some additions.

Anyway.. Wow what a mind bender working in the ‘schema-less’ nosql type databases is.

One of the things I needed in order to evaluate performance is a tool similar to pgstat2 and/or iostat for mongodb performance. So yet again, I created a tool, and posted it up on github. It’s still just beta and quite simple. Feel free to give feedback on github.

Posted in Mongodb, Python | Tagged , , , | View Comments

Hello Shutterfly

shutterfly[1]I am very excited to start a new position at Shutterfly.com. Shutterfly is a well known internet property with a massivly growing customer base and thus new and interesting challenges in how to store, share, and organize data. What fantastic fun. I expect to post more on the growing NOSQL movement as well as continuing information on PostgreSQL and more traditional relational stores like Oracle.

Posted in Database Engineering | Tagged , | View Comments

pgstat 1.0 released

I added a couple of fixes to the code and released it as 1.0. We have been using it here at hi5 for some time w/o problems. Thanks everyone who has helped with feedback. Also thanks to Devrim GUNDUZ for his help. The latest version can be downloaded on pgfoundry.

Posted in PostgreSQL | Tagged , , | View Comments

pg_reorg 1.0.4

At Hi5, we currently use pg_reorg1.0.3 in order to organize data in a clustered fashion. I posted previously about the strategy. Our version is slightly modified, the modifications I made to the C code essentially allow pg_reorg to spin/wait for locks on the objects to be released before proceeding.

The good news is the folks at NTT have incorporated a similar change in pg_reorg 1.0.4. This is a fantastic improvement, and frankly implemented in a cleaner way than my changes.

The crux of the issue is the situation where a database is being auto-vacuumed, you can’t be guaranteed that pg_reorg and the vacuum will not collide. In theory you should not need to vacuum a table which you are pg_reorg’ing because that is the point of a pg_reorg, it’s essentially a vacuum full w/ extra features because the table is being rebuilt from scratch. However in an environment where auto-vacuum is being utilized to keep tables vacuumed, both will need to co-exist.

The change is simple, use the NOWAIT option of lock table to fail if the lock can not be obtained. This is wrapped in a loop until the lock is granted. The effect is pg_reorg patiently sits and waits while your vacuums complete and then it can finish it’s work. The downside is if any of these operations run for too long, then the journal table may grow very large. So there should be some monitoring wrapped around the code if it’s intended to run in the background. For the future we need a backoff algorithm as well as perhaps a limit to the number of spin/sleep cycles, but hey this is excellent progress.

This tool is essential in my humble opinion for everyone running PostgreSQL in a high transaction/high availability environment. By the way, pg_reorg works seamlessly with Slony-I.

The code addition does the following:

for (;;)                        
        {
                command("BEGIN ISOLATION LEVEL READ COMMITTED", 0, NULL);
                res = execute_nothrow(table->lock_table, 0, NULL); 
                if (PQresultStatus(res) == PGRES_COMMAND_OK)
                {
                        PQclear(res);
                        break;
                }
                else if (sqlstate_equals(res, SQLSTATE_LOCK_NOT_AVAILABLE))
                {
                        /* retry if lock conflicted */ 
                        PQclear(res);
                        command("ROLLBACK", 0, NULL);
                        sleep(1);
                        continue;
                }
                else
                {
                        /* exit otherwise */
                        printf("%s", PQerrorMessage(connection));
                        PQclear(res);
                        exit(1);        
                }
        }

The text below is a snip of the strace on pg_reorg while it’s waiting for the lock:

rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_DFL}, 8) = 0
sendto(3, "P\0\0\0008\0SELECT reorg.reorg_apply($"..., 529, 0, NULL, 0) = 529
rt_sigaction(SIGPIPE, {SIG_DFL}, {SIG_IGN}, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
recvfrom(3, "1\0\0\0\0042\0\0\0\4T\0\0\0$\0\1reorg_apply\0\0\0\0"..., 16384, 0, NULL, NULL) = 77
rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_DFL}, 8) = 0
sendto(3, "P\0\0\0\177\0SELECT 1 FROM pg_locks WHE"..., 178, 0, NULL, 0) = 178
rt_sigaction(SIGPIPE, {SIG_DFL}, {SIG_IGN}, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
recvfrom(3, "1\0\0\0\0042\0\0\0\4T\0\0\0!\0\1?column?\0\0\0\0\0\0\0"..., 16384, 0, NULL, NULL) = 74
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, {1, 0})

Your Postgresql log file will show the following:

Jun 25 17:09:33 <dbname> postgres[7825]: [37-2] 2009-06-25 17:09:33 PDTSTATEMENT:  LOCK TABLE <tablename> IN ACCESS EXCLUSIVE MODE NOWAIT
Jun 25 17:09:34 <dbname> postgres[7825]: [38-1] 2009-06-25 17:09:34 PDTERROR:  could NOT obtain LOCK ON relation "<tablename>"
Posted in Database Engineering, PostgreSQL | Tagged , , | View Comments

Fusion-io SSD

I got the opportunity to test out some of the new Fusion-io Solid State ioDrive, and I thought I would post some results.

Fusion-io has created a SSD product called ioDrive that is based on PCIe cards vs replacing SAS or SATA drives with SSD directly. This approach allows for much lower latency because of the use of the PCIe bus vs traditional disk channels geared towards slow disk. The 320GB model I used in my test are made of Multi Level Cell (MLC) NAND flash and are quoted by Fusion-io to achieve throughput somewhere in the 70k IOPS neighborhood.

Continue reading

Posted in Database Engineering, PostgreSQL | Tagged , , , , | View Comments

pgstat 0.8beta released on PgFoundry

I moved the pgstat (previously named pgd) project to pgfoundry. Thank you to the folks over there approving the project. I added a column for ‘active’ processes from pg_stat_activity as well as some fixes requested by Devrim that I really should have had done from the start. Thanks for the contribution. Downloads can be found here. Oh, and I will remove the OSX file from the .tar file in the future.

Want something added? Just comment on this post..

Posted in PostgreSQL, Python | View Comments

Cluster data, save cash

Since the economy is not exactly rocking these days, I suspect there are a lot of companies out there trying to save a buck or two on infrastructure. Databases are not exactly cheap, so anything that an engineer or DBA can do to save cycles is a win. So how do you stretch your existing hardware and make it perform more transactions for the same amount of cash?

Clustering your data is an approach to reducing load and stretching the capacity of your database servers. Clustering data is a technique where data is reorganized to match the query patterns very specifically and thus reducing the amount of logical (and also physical) I/O a database performs. This technique is not RDBMS product specific, it applies to Oracle, PostgreSQL, or most other block based row-store RDBMS. I am going to reference PostgreSQL and a very specific case where clustering data can produce huge performance gains.

Continue reading

Posted in Database Engineering, PostgreSQL, Python | View Comments