Database engineering, architecture, startups, and other assorted bits

Dropping ACID

The de-facto durability story in MongoDB is essentially… there is none. Or at least single server durability. OMFG! No ACID WTF! &^%#^#?!

For the next generation of internet scale, downtime intolerant systems, ACID may not be a desirable property.

Traditional data stores like Oracle, PostgreSQL and MySQL all have a durability design built in. They are all ACID compatible databases. The ‘D’ stands for durability. Meaning, when someone issues a commit(); then the database ensures the change is durable. When a write is requested from the caller and the ack comes back as complete, then the change is on ‘disk’.

However, single server durable systems have these drawbacks at internet scale:

  1. Must repair/replay on recovery. The customer/application must wait for this to complete.
  2. Must wait for disk(s) to spin on every commit(). Performance suffers or more complicated hardware must be used.
  3. Trust the hardware to ensure the write actually happened

In MongoDB, the ack comes back to the caller as soon as the change is in RAM, which is volatile. So if the server crashes before that change flushes from memory to disk, poof your data is gone!

Incidentally, many folks run ACID compliant databases and are unknowingly not durable because they fail to setup the hardware correctly. There are many levels of caching on todays modern architectures, and one must be very careful that every level is properly synchronously flushing to disk in order to make sure the entire stack is correctly reporting when a change is considered durable. PostgreSQL has a manual page dedicated to the subject. It can be turned ‘off’ but it may lead to corruption. Oracle has Asynchronous commit which is a really nice implementation allowing application level control of commit behavior.

When using any persistent data store that is not ACID compliant, something must be in place to make sure the writes are durable at some point OR just not care if you miss some data on crash. If you are building a large cluster of databases then you can expect MTBF to increase in proportion to the number of machines being used in the cluster.

Many internet scale systems these days have a very low tolerance for downtime. So if a server crashes the MTTR must be very low. Take the death of RAID as example. My trusted source says we can expect ultra high capacity disks sooner than StorageMojo indicates. The take away is that fairly soon, RAID won’t be a viable option for an uptime sensitive internet scale data store.

Many current large scale systems currently have a concept of a replica database. It be for scaling reads or just for backup purposes or both. However, most existing systems don’t synchronously guarantee the write has made it to N replicas. Some traditional RDBMS systems can function so that a named replica(s) get written to. This is still an ACID model however.

Here is where I think Eliot’s comments and design become viable: Don’t try to be durable to any single machine. Hardware fails and disk is slow. Instead, be durable to a set of machines. That is: “ensure a write makes it to N machines in a set”. Just be durable to the memory on those machines. For instance, if you have a single master and three replicas, then a commit will be considered good when (and block until) it makes it to the master plus two slaves.

Using the model outlined by Werner Vogels, this would be:

N=3, W=2, R*=1 (fault tolerant, read what you wrote)

*When R=1, then it should mean the master not any slave.

If an application was such designed and could tolerate eventually consistent data, then one could optionally configure for:

N=3, W=2, R=3 (eventually consistent, fault tolerant, read scalable)

From a durability standpoint, this design could be called: Nsync Replication as Durability, or NRaD.

This type of design has one very nice attribute. It does not require expensive servers with fancy disk subsystems to ensure durability, thus absolutely rock bottom inexpensive hardware can be used. We no longer have any requirement on battery backed cache controllers, SATA drive caches, disk architecture or even RAID. We now have a cluster composed of an redundant array of inexpensive computers. Knowledge of direct I/O, async I/O, SATA Drive Caches, Battery Backed Caches, Partial Writes, RAID, and many other complexities just simply don’t matter much or at all. RAID may still be desirable such that a single drive failure doesn’t cause the cluster to fail over often. But the complication of the stack has dropped massively. The cost of each machine has dropped massively.

So what is the downside with a NRaD? Of course more than one machine is needed for this type of architecture. It’s important to make sure that all the machines are located in separate failure domains. A failure domain is a group of items all likely to fail at once because of a common dependency. For instance, an entire rack or a whole data center. So keeping the master and replicas in separate failure domains helps to ensure no single event brings them all down. In addition, at least two hosts are required. Latency may be increased. But not guaranteed depending on I/O capabilities of durable systems. Synchronously writing to more than one machine means the calling process must wait for this process to happen. It may introduce latency in your application, but more than spinning disks? Also, this type of architecture requires the application to tolerate an eventually consistent model (depending on config). Data is not guaranteed to be consistent on all machines.

Does this fly in the face of the CAP Theorem? No, data is not guaranteed to be on all replicas. Just N. N may be a subset. The user could configure the system to fit the specific needs of the application in question and focus on just durability or focus on durability plus read scalability.

Just to be clear, the requirements for an NRaD system might not make sense for everyone. YMMV. But for a vast majority of downtime sensitive, internet scale persistent stores, it seems like a very good fit.

If your a MongoDB user and want Nsync Replication as Durability in the server, vote/comment for it here. There is some work to do to get the product to the point where it would perform replication as durability.

18 Discussions on
“Dropping ACID”
  • Kenny,

    I don’t disagree with you that recovery-via-replicated data is the right answer for primary data recovery on large web clusters. A hybrid solution which involves recovery from disk combined with catch-up from replicas can be even better; this allows the downed node to be brought back up with a minimum of recovery load on the other servers. Assuming, of course, that disk logging isn’t too much of a performance hit (although asyncronous disk logging could alleviate that).

    What I’m asserting instead is that web-scale durability is hard to implement, complex and high-administration (ask Google or Amazon how many engineer-hours they spent on it), and not somehow easier than disk-based durability. Nor is durability-via-replication an excuse for not having a local durability option for a general open source database; what about users who only have a single server?

    The real answer for any new database system is to have both replication-based and local durability options, with the ability to enable and disable either, possibly in more than one mode.

  • Just another note….

    Consistency is a trade-off. You may require complete consistency for a given application and maybe eventual consistency for another. You may decide to value availability as highest or keeping cost low the highest. You may require ACID for one data store, and a BASE system for another. MongoDB is just one architectural design in your toolbox. Use it to fit a given need. Be educated on how it works and fits.

    There is a lot of noise about NOSQL lately. By no means is MongoDB a *replacement* for RDBMS databases. It serves a very specific point in the tradeoffs I discussed.

  • Kenny,

    What you’re really saying is that, right now, MongoDB has no durability of any kind, either single-server or multi-server. I have to admit to some irritation that MongoDB constantly champions recent trends in software development without actually implementing them. In contrast in the non-relational world, Redis has single-server durability as an option, and several third-party add-ons give you multi-server durability for Memcached.

    Why am I reminded of MySQL folks back in 1998 explaining how nobody needed transactions?

    There is no question that for a scalable web architecture multi-server (failover) durablity, otherwise known as High Availability, beats recoverability for uptime. This is not a new thing, it is not even recent: Telecom databases have relied on HA-as-durability at least since the 80′s. Of course, having both recoverability and HA beats everything, and that’s why the high-end products (Bigtable, Dynamo, etc.) do both. After all, what if something takes your whole data center offline? Do you really want to permanently lose your data?

    Further, implementing real HA-as-durability is actually quite difficult. Tried to install Cassandra lately? It’s not simple, and there’s a reason for that. The MongoDB folks are about to learn how hard it actually is. Good luck!

    • Josh,

      Thanks for the reply. Just a couple thoughts.

      Correct, MongoDB is not yet guaranteed durable. is the ticket that addresses it, and due in 1.5.

      The recoverability issue is not a theoretical problem, it’s a practical problem presented at web scale. At web scale recoverability is just not an option for many SLA’s. If it takes 10 minutes to recover a DB thats just too long. This issue plagued us on PostgreSQL @ Hi5 as well as Oracle @ Paypal / Ebay. If the scalability is not the primary problem and some downtime is ok, then sure why not just have both.

  • There is a difference between ‘not durable’ and ‘really not durable’ —

    I think that ‘not durable’ is a great compromise to improve performance without leading to significant downtime during a crash. I can reason about how much data is likely to be lost and how long recovery will take. I cannot do the same for ‘really not durable’ systems.

    Is MongoDB ‘not durable’ or ‘really not durable’?

    • Durable means changes made to data not yet fsync’d to disk. Data not only in memory. Not data loss of changes that have once been made to disk. The other case I would call corruption, and recovery would be required. If I am reading your distinction between ‘not durable’ and ‘really not durable’.

  • Interesting thoughts.

    I would add that PostgreSQL has had Asynchronous Commit for more than two years now, first released in 8.3. So yes, we drop ACID, if and when we choose to but don’t force people to do so if they don’t want to.

  • Nice writeup. Didn’t you mean to say MTBF will *decrease* as you add servers? In most systems that are not explicitly HA (i.e. with heartbeat and failover and such) that’s what will happen. Increasing component count = increasing failure rate = decreasing MTBF.

    Also, I think we need to start getting the word out that being durable to a set of machines doesn’t necessarily mean replication. Using erasure codes and such it’s possible to achieve high levels of data protection across machines without having a full replica at each one.

  • Hi Kenny,

    Couple of comments.

    1. As I’ve written in my MongoDB durability is a tradeoff article ( I don’t think MongoDB acknowledges a successful write only after it reaches the slaves/pairs, but immediately. That’s why I don’t think that MongoDB blog post is correct. There are other systems (f.e. Riak, Cassandra, Voldemort) where this happens.

    2. Not sure I’m reading correctly your N/W/R, but it was my impression that for N=3, R=3 implies consistent read, R=2 would be quorum read, R=1 would be eventual consistent.


    :- alex

    • Alex,

      1: Correct, MongoDB does not yet include replication as durability by returning when N slaves have a write. Thats the point of the last paragraph where I ask for help making sure it does get that feature.

      2: See my point about the read always being from the master. Aka: Just for durability purposes not read scalability.

      Thanks for the comments.

  • What I am wondering about, in case you do have a master failure, then I assume getting a slave up that really has all the data is going to be tricky. Of course in many situations you will not care, but I guess for those use cases it might make more sense to have 2 prefered slaves (which may not even get any user traffic to ensure that they are always prepared to accept data from the master as quickly as possible).

Leave A Comment

Your email address will not be published.