Stellar use cases for MongoDB

MongoDB has a nice wide sweet spot where it’s a very useful persistence platform, however, it’s not for everything. I thought I would quickly enumerate a couple great use cases that have come up in the last year and a half and why they are such a great fit for MongoDB.

  1. Documents: Using MongoDB instead of a XML based system.

MongoDB is a document oriented data store. XML is a document language. By moving a traditional XML app to MongoDB one can experience a few key advantages. The typical pattern in XML is to fetch an entire document, work with it, and put it back to the server. This approach has many downsides including the amount of data transmitted over the wire, collision detection/resolution, data set size, and server side overhead. In the MongoDB model, documents can be updated atomically, fetched by index, and even partially fetched. Applications are simpler, faster, and more robust.

  1. Metadata storage systems.

Any system that stores metadata can be a great use case in MongoDB. Such systems typically have a pattern of adding attributes about some type of entitiy, and then needing to query/sort/filter based on these items. The prototypical use case for such a system is the use of tags. The tag implementation is so superior in MongoDB that almost single handedly compels one to use MongoDB for any system needing tags. Simply put:

db.mymetadata.save({stuff:"some data here", thing:"/x/foo/bar.mpg", tags:['cats','beach','family']})
db.mymetadata.ensureIndex({"tags":-1})
db.mymetadata.find({tags:'cats'})
...
"indexBounds" : {
		"tags" : [
			[
				"cats",
				"cats"
			]
		]

In many metadata systems the schema may vary depending on the metadata itself. This allows for huge degrees of flexibility in the data modeling of applications that store metadata. Imagine a media storage service that can store video and image data in the same collection but with different attributes about each type of metadata. No joins needed on query, and the logical I/O path is minimized! MongoDB now supports sparse indexes, so indexes on attributes that are not in every document are kept at a minimum size.

  1. Read intensive systems

Any system where the amount of change is low, and read is high is a nice sweet spot for MongoDB. MongoDB has a nice scaling property with both the replica sets functionality (setting SLAVE_OK), as well as using sharding. Combine this with the document model, and metadata storage capabilites one has an excellent system for say a gawker clone. Reads can come off any one of N sharded nodes by say, story_id, and reads can be geographically targeted to a slave for reads. Keep your data clustered by key for super fast I/O.

[Read Stellar use cases for MongoDB]


MongoSF 2011 slides: MongoDB Performance Tuning

Here are my slides from MongoSF 2011:

MongoDB Performance Tuning

[Read MongoSF 2011 slides: MongoDB Performance Tuning]


MongoSF 2011

I am very excited to speak @ MongoSF 2011. We have been doing quite a bit of performance tuning lately at Shutterfly as we deploy more and more MongoDB services. My hope is I can share some of what we have been doing in terms of performance tuning and performance management and it will be valuable to folks who may face performance challenges with MongoDB. I just wanted to put up some of the specific items I will be going over:

  • Utilizing the profiler, interpreting the data, and using it to make your application faster.  For instance, do you know how to see when your document updates cause a document to be read and re-written inside the datafile?
  • How to tune around the the single db-wide lock in MongoDB.  How to minimize it’s impact.
  • How to monitor using mongostat.  What to look for, and what to do when you find something bad.  For instance, are you looking at ‘locked %’?  You should be!

[Read MongoSF 2011]


PGEast and MongoDB?

When I first started playing with MongoDB I accidentially posted my posts in a category that got them put up on Planet PostgreSQL. I remember getting some pretty pissed off emails and comments. Beyond just my mistake such that the posting was on the incorrect blog, people seemed pretty pissed off about NoSQL in general. I remember thinking how closed minded that was. These are all different tools for different jobs, not a religion.

Well, it seems there is some hope! The PostgreSQL Conference Series folks have allocated some talks at PGEast for MongoDB. Nice job guys. Let’s hope there are lots of good discussions, there are lots of concepts I think that MongoDB can pick up from a more mature RDBMS like PostgreSQL and the reverse too!

[Read PGEast and MongoDB?]


Guy Harrison has a series of articles on real world NoSQL deployments over at GigaOM. The second installment was on MongoDB where he interviewed me on our deployment at Shutterfly. Check out Real World NoSQL: MongoDB at Shutterfly by Guy Harrison.

[Read GigaOM article – Real World NoSQL: MongoDB at Shutterfly]


Interview on NoSQLDatabases.com

The folks over at NoSQLDatabases.com have posted an interview they did with me on our implementation of MongoDB at Shutterfly. Good folks, great blog. Here is a link to the article. I talk a lot about what we have done at Shutterfly, In particular one item I discuss is ORM’s and the promise of not using heavyweight mappers in a non-relational architecture. I also talk a bit about the challenges and benefits of modeling data as documents. I hope it’s helpful for folks thinking about using something like MongoDB.

[Read Interview on NoSQLDatabases.com]


The video of the presentation, Sharing Life’s Joy using MongoDB: A Shutterfly Case Study, I gave at MongoSV is now online. Nice job editing the slides in 10gen!

Here are just the slides:
Sharing Lifes Joy With Mongodb

[Read Sharing Life’s Joy using MongoDB: A Shutterfly Case Study]


I am speaking at MongoSV 2010

I am excited to announce I will be speaking at the MongoSV conference Dec 3 2010. My talk, Sharing Life’s Joy using MongoDB: A Shutterfly Case Study will be focused on how we have been using MongoDB here at Shutterfly over the last year. I plan to outline some of the specific cases where MongoDB has been a massive win and some areas to be careful of if you are planning your own MongoDB application. This is a follow on to my previous talk at MongoSF with more technical depth. I will show code examples and various use cases outlining parts of our journey.

I hope to see you all there, it’s shaping up to look like an amazing lineup. I think this conference will be great for anyone new to non-relational data stores as well as people who already are neck deep in a MongoDB implementation. So Sign up NOW!

[Read I am speaking at MongoSV 2010]


Shutterfly is looking for: Staff Engineer

Shutterfly is looking for: Staff Engineer – Platform
http://jobvite.com/m?35y6Xfwq #job

[Read Shutterfly is looking for: Staff Engineer]


MongoDB: Lagged Replica with Replica Sets

In an enterprise database architecture, it’s very common to create a standby or replica database with a ‘lag’ in it’s state relative to the primary. Operations applied to the primary are not seen on the replica for some amount of pre-determined timeframe. The purpose of such an architecture is to protect yourself against an accidental deletion, code bug, corruption, table drop, etc. If something really bad happens to the primary it may replicate that horrible thing before someone can step in and correct it. A lagged replica solves this problem by giving some amount of time to stop the replica from ingesting the change, and allowing an operator to use the clean data to fix the primary or even roll back to a earlier image.

How long should you lag your replica? Thats up to you, but as a general rule of thumb 8 hours would leave you reasonable time to detect a data problem and take corrective action.

MongoDB now has this capability with the 1.7.x versions of MongoDB. For now, you will have to use the nightly builds in order to have the capabilities. But on release of 1.7 it will be generally available. Here is how it works.

Setup a replica set like normal. But be sure to specify a slave with some amount of lag. It’s important to make sure you set priority=0 on this slave so it never automatically becomes master. Thus, it makes sense to always have at least 1 primary and 2 replicas in a lagged replica configuration. 1 primary, 1 replica for failover, then a lagged replica to ensure data safety.

In the above example, here is the config:

c={_id:"sfly",
         members:[
             {_id:0,host:"host_a:27017"},
             {_id:1,host:"host_b:27017"},
             {_id:2,host:"host_c:27017",priority:0,slaveDelay:120},
             {_id:3,host:"host_d:27017",arbiterOnly:true}]
}
>
{
	"_id" : "sfly",
	"members" : [
		{
			"_id" : 0,
			"host" : "host_a:27017"
		},
		{
			"_id" : 1,
			"host" : "host_b:27017"
		},
		{
			"_id" : 2,
			"host" : "host_c:27017",
			"priority" : 0,
			"slaveDelay" : 120
		},
		{
			"_id" : 3,
			"host" : "host_d:27017",
			"arbiterOnly" : true
		}
	]
}
> rs.initiate(c);
{
	"info" : "Config now saved locally.  Should come online in about a minute.",
	"ok" : 1
}
> rs.conf()
{
	"_id" : "sfly",
	"version" : 1,
	"members" : [
		{
			"_id" : 0,
			"host" : "host_a:27017"
		},
		{
			"_id" : 1,
			"host" : "host_b:27017"
		},
		{
			"_id" : 2,
			"host" : "host_c:27017",
			"priority" : 0,
                        "slaveDelay" : 120
		},
		{
			"_id" : 3,
			"host" : "host_d:27017",
			"arbiterOnly" : true
		}
	]
}

[Read MongoDB: Lagged Replica with Replica Sets]


OSX + Bit.ly

A couple months ago I didn’t even know what bit.ly was, I was using tinyurl for everything. Sheez, how web 1.0 of me. But after bit.ly started using MongoDB for it’s backend services, I started using it for url shortening. I just love the idea of web services, and bit.ly was crying out for a nice OSX implementation. I wanted full OSX compatibility instead of having to bring up a web browser each time I needed to shorten an url. This Automator script turned that all around for me. Now I use bit.ly for almost every url I ever copy/paste.

[Read OSX + Bit.ly]


Why Not Auto Increment in MongoDB

I came across [this blog post][1] with a nice pattern for auto-increment in MongoDB. It’s a great post, but there is something to think about beyond how to logically perform the operation; performance.

The idea presented in the blog is to utilize the MongoDB findAndModify command to pluck sequences from the DB using the atomic nature of the command.

counter=db.command("findandmodify", "seq", query={"_id":"users"},update={"$inc":{"seq":1}})
f={"_id":counter['value']['seq'],"data":"somedata"}
c.insert(f)

When using this technique each insert would require both the insert as well as the findAndModify command which is a query plus an update. So now you have to perform 3 operations where it used to be one. Not only that, but there are 3 more logical I/O’s due to the query, and those might be physical I/O’s. This pattern is easily seen with the mongostat utility.

Maybe you still meet your performance goals. But then again maybe not.

I did some testing to play with the various options. I compared a complete insert cycle with a unique key. The test is a simple python program that performs inserts using pymongo. The program is a single process and I ran 3 concurrent processes just so simulate a bit of concurrency. The save uses safe_mode=False. I tested the findAndModify approach to the native BSON objectId approach vs Python UUID generation approach.

The results are:

</table>

So clearly if the problem being solved can be achieved using the native BSON objectId type it should be. This is the fastest way to save data into MongoDB in a concurrent application.

f={"data","somedata"}    # let MongoDB generate objectId for _id
c.insert(f)

That said, what if auto-increment / concurrent unique key generator is still required? One option would be to use a relational store with a native sequence generation facility like PostgreSQL. PostgreSQL, in my testing, achieved 389,000 keys/sec when fetching from a single sequence using about 30 processes. Thus fetching sequences clearly outpaces the ability for MongoDB to insert them. Something like the following is possible:

cur.execute("nextval('users_seq')")
s=cur.fetchone()
f={"_id":s[0],"data":"somedata"}
c.insert(f)

The stack used in this test is:
- Sun X2270 dual quad core AMD 2376, 24GB RAM, 2 100GB SATA Drives, software RAID.
- MongoDB 1.5.7
- PostgreSQL 8.4.2
- Python 2.6.4
- Pymongo 1.7
- Linux Centos 2.6.18-128.el5 x86_64

[1]: http://shiflett.org/blog/2010/jul/auto-increment-with-mongodb
Type Inserts/s
findAndModify auto-increment 3000
Native BSON objectId’s 20000
Python UUID 9000

[Read Why Not Auto Increment in MongoDB]