Data clustering in MongoDB using embedded docs

I wrote a while ago about how to cluster data to save cash. This post was geared towards relational stores. But in reality, the technique is applicable to any block store on disk. To recap, the premise is simple. When you run a query for some amount of data, you want to minimize I/O as much as possible. Even if the result is in cache, you still want to reduce logical I/O. See my post for examples.

So how does one manage this technique in MongoDB? If you’re not familiar MongoDB is a fairly new database that is non-relational and schema-less. However, data density and clustering is still important. Anytime you can reduce the amount of logical or physical I/O to return a query that is a good thing.

With SERVER-1054, the MongoDB folks implemented one key feature that helps one manage data clustering in MongoDB; the ability to show what file/offset a given document lives in. This allows the inspection of the location, and action to be taken. Think of it as a measure of how fragmented your data is inside blocks.

var{}, {'$diskLoc': 1}).showDiskLoc()
> for(var i=0; i<arr.length(); i++) {

In the above example, both of these rows live in block # 3241650. Thus the data is dense.

If a given set of documents are typically queried together, then they should be as dense as possible. Additionally if they are in contiguous blocks that is good too.

In a traditional RDBMS data store there are various techniques to re-cluster the data by a given key to re-arrange the data densely. For instance using CREATE TABLE AS SELECT * FROM foo ORDER BY mykey. However, it’s mostly a one time affair because future inserts may not be dense.

In MongoDB depending on the design, that may or may not be required. A design pattern called embedding can alleviate many of the typical problems associated with data clustering and AUTOMATICALLY keep your collection dense. Thus further making the MongoDB seem much faster than a traditional RDBMS.

Let me give an example to illustrate. Let’s give the following relational data model:

>postgres=# \d photos
         Table ""
  Column   |     Type      | Modifiers
 id        | integer       |
 user_id   | integer       |
 file_path | character(42) |

And the typical access path is:

select * from photos where user_id = 10;

Then one can expect results (worse case) where all the results are in different blocks. Thus at least 3 I/O operations to return this query. If they were dense, they would be all in one block.

postgres=# select ctid, * from photos where user_id=10;
 ctid    | id | user_id |               file_path
 (0,1)   |  1 |      10 | /home/foo/1.jpg
 (22,2)  | 24 |      10 | /home/foo/2.jpg
 (334,3) | 23 |      10 | /home/foo/3.jpg

In MongoDB the following model can be used to *always* keep the data dense and tightly clustered.

{ "_id" : ObjectId("4c252807164314895e44fb6d"),
  "user_id" : 10,
  "paths" : ['/home/foo/1.jpg','/home/foo/2.jpg','/home/foo/3.jpg']

And a query would be:{"owner":10})

The data payload is exactly 1 I/O. As the embedded document grows over the block size it would start to span multiple blocks. So this is an additional design consideration. Keep embedded documents less than the block size or you may not be able to see additional benefits.

Embedding may not always be possible. But if one is aware of the potential I/O savings when performing the design then it’s just another data point to making a more intelligent and fast performing data store.

MongoDB does not yet have the simple capability to rebuild a collection and re-order the data in a simple operation. This is the technique used on the RDBMS side pretty commonly and shown in my previous post examples. So design with that in mind.

[Read Data clustering in MongoDB using embedded docs]

WordPress 3.0

On June 17th, WordPress 3.0 was launched. I decided to take the plunge and upgrade. There are just so many compelling features it’s hard not to. Part of the new release is the twenty ten theme with some pretty exciting features including a new menu system that I hope to take advantage of as well as featured images. I decided to go ahead and use it, and thus the new look.

I try not to blog about blogging, but I couldn’t help it in this case. The upgrade is pretty compelling, and I thought I would share my thoughts.

[Read WordPress 3.0]


The MongoDB command line performance monitoring utility named mongostat is now (well since 1.3.3) part of the core distribution of Mongodb. The python version hosted on my site is now deprecated in lieu of the C++ version in the distro.

[Read mongostat]

MongoSF; Video’s up

The videos from MongoSF are starting to get posted now on the 10gen site. The presentations are there too. Here is my talk:


[Read MongoSF; Video’s up]

I came across this old article I wrote for the NOCOUG newsletter in 2002 about using OS snapshots for backups. This technique is still very much a valid and widely used technique to perform backups. The idea is simple:

  • Stop I/O temporarily
  • Snapshot the filesystem (OS snapshot, rsync, whatever)
  • Release I/O
  • Backup any logs needed to recover point in time

This technique works for many different data stores. In the article I only show Oracle. But many other databases have the same capabilities for backups. Here are some examples:


SELECT pg_start_backup('label');
-- snapshot the DB here
SELECT pg_stop_backup();
-- backup wal logs here

You can find all the details of this kind of backup in the PostgreSQL docs.


> use admin
switched to db admin
> db.runCommand({fsync:1,lock:1})
	"info" : "now locked against writes",
	"ok" : 1
// snapshot the DB here
> db.$cmd.sys.unlock.findOne();
{ "ok" : 1, "info" : "unlock requested" }

You can find the docs on this procedure on the MongoDB site.

I thought I would include the original article here even though it’s going on 8 years old!

OS Snapshots for Backup;
Utilizing operating system snapshots for quick and painless Oracle database backup and restore.
from VOL. 16, No. 2 · MAY, 2002 of the NOCOUG Journal

[Read Wayback Machine: snapshots still valid technique]

MongoSF Slides

I had a great time at the MongoSF Conference on Friday. There were a ton of great presentations, and lots and lots of excitement. A big thanks to 10gen for inviting me to speak. I had a great time and I hope everyone learned a lot from our experiences so far with MongoDB. I especially liked Mike Dirolf’s discussion on Python and pymongo. There have been lots of changes as of late, and most of them fantastic!

Here are my slides from my presentation:

[Read MongoSF Slides]

The death of a Sun

pic As I was shutting down my last Sun server in my colo, I was thinking back about how things have changed. It’s sad to see such a great company fall. But it seems there is little reason to keep using Sun or Solaris. We will miss you Sun.

$>uname -a
SunOS bora 5.9 Generic_112233-01 sun4u sparc SUNW,Ultra-60
10:10am  up 599 day(s), 12:52,  2 users,  load average: 0.07, 0.04, 0.03

[Read The death of a Sun]

Dropping ACID

The de-facto durability story in MongoDB is essentially… there is none. Or at least single server durability. OMFG! No ACID WTF! &^%#^#?!

For the next generation of internet scale, downtime intolerant systems, ACID may not be a desirable property.

Traditional data stores like Oracle, PostgreSQL and MySQL all have a durability design built in. They are all ACID compatible databases. The ‘D’ stands for durability. Meaning, when someone issues a commit(); then the database ensures the change is durable. When a write is requested from the caller and the ack comes back as complete, then the change is on ‘disk’.

However, single server durable systems have these drawbacks at internet scale:

  1. Must repair/replay on recovery. The customer/application must wait for this to complete.
  2. Must wait for disk(s) to spin on every commit(). Performance suffers or more complicated hardware must be used.
  3. Trust the hardware to ensure the write actually happened

In MongoDB, the ack comes back to the caller as soon as the change is in RAM, which is volatile. So if the server crashes before that change flushes from memory to disk, poof your data is gone!

Incidentally, many folks run ACID compliant databases and are unknowingly not durable because they fail to setup the hardware correctly. There are many levels of caching on todays modern architectures, and one must be very careful that every level is properly synchronously flushing to disk in order to make sure the entire stack is correctly reporting when a change is considered durable. PostgreSQL has a manual page dedicated to the subject. It can be turned ‘off’ but it may lead to corruption. Oracle has Asynchronous commit which is a really nice implementation allowing application level control of commit behavior.

When using any persistent data store that is not ACID compliant, something must be in place to make sure the writes are durable at some point OR just not care if you miss some data on crash. If you are building a large cluster of databases then you can expect MTBF to increase in proportion to the number of machines being used in the cluster.

Many internet scale systems these days have a very low tolerance for downtime. So if a server crashes the MTTR must be very low. Take the death of RAID as example. My trusted source says we can expect ultra high capacity disks sooner than StorageMojo indicates. The take away is that fairly soon, RAID won’t be a viable option for an uptime sensitive internet scale data store.

Many current large scale systems currently have a concept of a replica database. It be for scaling reads or just for backup purposes or both. However, most existing systems don’t synchronously guarantee the write has made it to N replicas. Some traditional RDBMS systems can function so that a named replica(s) get written to. This is still an ACID model however.

Here is where I think Eliot’s comments and design become viable: Don’t try to be durable to any single machine. Hardware fails and disk is slow. Instead, be durable to a set of machines. That is: “ensure a write makes it to *N *machines in a set”. Just be durable to the memory on those machines. For instance, if you have a single master and three replicas, then a commit will be considered good when (and block until) it makes it to the master plus two slaves.

Using the model outlined by Werner Vogels, this would be:
<br /> N=3, W=2, R*=1 (fault tolerant, read what you wrote)<br />

*When R=1, then it should mean the master not any slave.

If an application was such designed and could tolerate eventually consistent data, then one could optionally configure for:
<br /> N=3, W=2, R=3 (eventually consistent, fault tolerant, read scalable)<br />

From a durability standpoint, this design could be called: Nsync Replication as Durability, or NRaD.

This type of design has one very nice attribute. It does not require expensive servers with fancy disk subsystems to ensure durability, thus absolutely rock bottom inexpensive hardware can be used. We no longer have any requirement on battery backed cache controllers, SATA drive caches, disk architecture or even RAID. We now have a cluster composed of an redundant array of inexpensive computers. Knowledge of direct I/O, async I/O, SATA Drive Caches, Battery Backed Caches, Partial Writes, RAID, and many other complexities just simply don’t matter much or at all. RAID may still be desirable such that a single drive failure doesn’t cause the cluster to fail over often. But the complication of the stack has dropped massively. The cost of each machine has dropped massively.

So what is the downside with a NRaD? Of course more than one machine is needed for this type of architecture. It’s important to make sure that all the machines are located in separate failure domains. A failure domain is a group of items all likely to fail at once because of a common dependency. For instance, an entire rack or a whole data center. So keeping the master and replicas in separate failure domains helps to ensure no single event brings them all down. In addition, at least two hosts are required. Latency may be increased. But not guaranteed depending on I/O capabilities of durable systems. Synchronously writing to more than one machine means the calling process must wait for this process to happen. It may introduce latency in your application, but more than spinning disks? Also, this type of architecture requires the application to tolerate an eventually consistent model (depending on config). Data is not guaranteed to be consistent on all machines.

Does this fly in the face of the CAP Theorem? No, data is not guaranteed to be on all replicas. Just N. N may be a subset. The user could configure the system to fit the specific needs of the application in question and focus on just durability or focus on durability plus read scalability.

Just to be clear, the requirements for an NRaD system might not make sense for everyone. YMMV. But for a vast majority of downtime sensitive, internet scale persistent stores, it seems like a very good fit.

If your a MongoDB user and want Nsync Replication as Durability in the server, vote/comment for it here. There is some work to do to get the product to the point where it would perform replication as durability.

[Read Dropping ACID]

mongodb 1.3.3 (devel) with mongostat

MongoDB 1.3.3 (devel) was released today. My tool called mongostat has been incorporated into mongo1.3.3 and changed from python to C++. So further development will come inside the mongodb distribution itself. Also of note is new slave lag for replication. Changelog here.

[Read mongodb 1.3.3 (devel) with mongostat]

mongostat 0.2b

I have been playing quite a bit with MongoDB lately. MongoDB is a nosql type database. It’s different than a simple k/v store however. It allows sorting, secondary indexes and such. It’s format is BSON a binary representation of JSON with some additions.

Anyway.. Wow what a mind bender working in the ‘schema-less’ nosql type databases is.

One of the things I needed in order to evaluate performance is a tool similar to pgstat2 and/or iostat for mongodb performance. So yet again, I created a tool, and posted it up on github. It’s still just beta and quite simple. Feel free to give feedback on github.

[Read mongostat 0.2b]

Hello Shutterfly

shutterfly[1]I am very excited to start a new position at Shutterfly is a well known internet property with a massivly growing customer base and thus new and interesting challenges in how to store, share, and organize data. What fantastic fun. I expect to post more on the growing NOSQL movement as well as continuing information on PostgreSQL and more traditional relational stores like Oracle.

[Read Hello Shutterfly]

pgstat 1.0 released

I added a couple of fixes to the code and released it as 1.0. We have been using it here at hi5 for some time w/o problems. Thanks everyone who has helped with feedback. Also thanks to Devrim GUNDUZ for his help. The latest version can be downloaded on pgfoundry.

[Read pgstat 1.0 released]