(Note: this is being posted with Foursquare’s permission.)
As many of you are aware, Foursquare had a significant outage this
week. The outage was caused by capacity problems on one of the
machines hosting the MongoDB database used for check-ins. This is an
account of what happened, why it happened, how it can be prevented,
and how 10gen is working to improve MongoDB in light of this outage.
It’s important to note that throughout this week, 10gen and Foursquare
engineers have been working together very closely to resolve the
* Some history
Foursquare has been hosting check-ins on a MongoDB database for some
time now. The database was originally running on a single EC2
instance with 66GB of RAM. About 2 months ago, in response to
increased capacity requirements, Foursquare migrated that single
instance to a two-shard cluster. Now, each shard was running on its
own 66GB instance, and both shards were also replicating to a slave
for redundancy. This was an important migration because it allowed
Foursquare to keep all of their check-in data in RAM, which is
essential for maintaining acceptable performance.
The data had been split into 200 evenly distributed chunks based on
user id. The first half went to one server, and the other half to the
other. Each shard had about 33GB of data in RAM at this point, and
the whole system ran smoothly for two months.
* What we missed in the interim
Over these two months, check-ins were being written continually to
each shard. Unfortunately, these check-ins did not grow evenly across
chunks. It’s easy to imagine how this might happen: assuming certain
subsets of users are more active than others, it’s conceivable that
their updates might all go to the same shard. That’s what occurred in
this case, resulting in one shard growing to 66GB and the other only
to 50GB. 
* What went wrong
On Monday morning, the data on one shard (we’ll call it shard0)
finally grew to about 67GB, surpassing the 66GB of RAM on the hosting
machine. Whenever data size grows beyond physical RAM, it becomes
necessary to read and write to disk, which is orders of magnitude
slower than reading and writing RAM. Thus, certain queries started to
become very slow, and this caused a backlog that brought the site
We first attempted to fix the problem by adding a third shard. We
brought the third shard up and started migrating chunks. Queries were
now being distributed to all three shards, but shard0 continued to hit
disk very heavily. When this failed to correct itself, we ultimately
discovered that the problem was due to data fragmentation on shard0.
In essence, although we had moved 5% of the data from shard0 to the
new third shard, the data files, in their fragmented state, still
needed the same amount of RAM. This can be explained by the fact that
Foursquare check-in documents are small (around 300 bytes each), so
many of them can fit on a 4KB page. Removing 5% of these just made
each page a little more sparse, rather than removing pages
After the first day’s outage it had become clear that chunk migration,
sans compaction, was not going to solve the immediate problem. By the
time the second day’s outage occurred, we had already move 5% of the
data off of shard0, so we decided to start an offline process to
compact the database using MongoDB’s repairDatabase() feature. This
process took about 4 hours (partly due to the data size, and partly
because of the slowness of EBS volumes at the time). At the end of
the 4 hours, the RAM requirements for shard0 had in fact been reduced
by 5%, allowing us to bring the system back online.
Since repairing shard0 and adding a third shard, we’ve set up even
more shards, and now the check-in data is evenly distributed and there
is a good deal of extra capacity. Still, we had to address the
fragmentation problem. We ran a repairDatabase() on the slaves, and
promoted the slaves to masters, further reducing the RAM needed on
each shard to about 20GB.
* How is this issue triggered?
Several conditions need to be met to trigger the issue that brought
1. Systems are at or over capacity. How capacity is defined varies; in
the case of Foursquare, all data needed to fit into RAM for acceptable
performance. Other deployments may not have such strict RAM
2. Document size is less than 4k. Such documents, when moved, may be
too small to free up pages and, thus, memory.
3. Shard key order and insertion order are different. This prevents
data from being moved in contiguous chunks.
Most sharded deployments will not meet these criteria. Anyone whose
documents are larger than 4KB will not suffer significant
fragmentation because the pages that aren’t being used won’t be
The main thing to remember here is that once you’re at max capacity,
it’s difficult to add more capacity without some downtime when objects
are small. However, if caught in advance, adding more shards on a
live system can be done with no downtime.
For example, if we had notifications in place to alert us 12 hours
earlier that we needed more capacity, we could have added a third
shard, migrated data, and then compacted the slaves.
Another salient point: when you’re operating at or near capacity,
realize that if things get slow at your hosting provider, you may find
yourself all of a sudden effectively over capacity.
* Final Thoughts
The 10gen tech team is working hard to correct the issues exposed by
this outage. We will continue to work as hard as possible to ensure
that everyone using MongoDB has the best possible experience. We are
thankful for the support that we have received from Foursquare and our
community during this unfortunate episode. As always, please let us
know if you have any questions or concerns.
 Chunks get split when they are 200MB into 2 100MB halves. This
means that even if the number of chunks on each shard was the same,
data size is not always so. This is something we are going to be
addressing in MongoDB. We’ll be making splitting balancing look for
this imbalance so it can act upon it.
 The 10gen team is working on doing online incremental compaction
of both data files and indexes. We know this has been a concern in
non-sharded systems as well. More details about this will be coming
in the next few weeks.
来自mailinglist http://groups.google.com/group/mongodb-user/browse_thread/thread/528a94f287e9d77e#, 省去翻墙，存档备忘。