Tuning MongoDB Performance with MMS

At MongoLab we manage thousands of MongoDB clusters and regularly help customers optimize system performance. Some of the best tools available for gaining insight into our MongoDB deployments are the monitoring features of MongoDB Management Service (MMS). MMS allows us to quickly determine the health of a MongoDB system and identify the root cause of performance issues. This post covers our general approach to using MMS and MongoDB log files and provides techniques to help you optimize your own MongoDB deployment, whether you’re in the cloud or on your own hardware.

First, we will define the key metrics that we use to guide any performance investigation. Then we will go through the various combinations of metric values, discuss what they mean, and explore how to address the problems they indicate.

Key Metrics

Here we focus primarily on the metrics provided by MMS but augment our analysis with specific log file metrics as well.

MMS Metrics

MMS collects and reports metrics from both the MongoDB server process and the underlying host machine using an agent you install on the same network as your system. All metrics are of interest, but we will focus on the key metrics we use at MongoLab to begin any investigation.

  • PF/OP (derived from the Page Faults and Opcounters graphs)
  • CPU Time (IOWait and User)
  • Lock Percent and Queues

We find that by examining these key metrics you can very quickly get a good picture of what is going on inside a MongoDB system and what computing resources (CPU, RAM, disk) are performance bottlenecks. Let’s look a little more closely at each of these metrics.

PF/OP (Page Faults / Opcounters)

Between 5 and 10 page faults per second (left) compared to more than 4,000 operations per second (right).
A PF/OP of 0.001 (5 / 4000) is close enough to zero to classify as a low disk I/O requirement.

MongoDB manages data in memory using memory mapped files. As indexes and documents are accessed, the data file pages containing them are brought into memory. Meanwhile, data that isn’t accessed remains on disk. If a given memory-mapped page is not in memory when the data in it is needed, a page fault is counted because the OS loads the page from disk. But if a page is already in memory, a page fault does not occur.

The documents and indexes that tend to be regularly accessed are called the working set. As of version 2.4, MongoDB can estimate the working set size using the command:

db.serverStatus( { workingSet: 1 } )

Page faults are not necessarily a problem; they can occur on any machine in a cluster that doesn’t have sufficient RAM to hold the working set. If page faults are consistent/predictable and don’t result in queues, or if they are sporadic and don’t result in queues, they can be considered part of normal operations.

That being said, high-load databases and databases with latency-sensitive apps are often optimized with large amounts of RAM with the specific intention of avoiding page faults.

Because the exact number of page faults depends on the current load and what’s currently in memory, a better comparative metric is the ratio of page faults per second to the operation count per second. Calculate this PF/OP ratio using a ballpark sum of your operations per second.

If PF/OP is…

  • near 0 – reads rarely require disk I/O
  • near 1 – reads regularly require disk I/O
  • greater than 1 – reads require heavy disk I/O

Note: In Windows environments, page fault counts will be higher because they include “soft” page faults that don’t involve disk access. When running MongoDB in Windows, be prepared to rely more heavily on Lock, Queue, and IOWait metrics to determine the severity of page faults.

CPU Time (IOWait and User)

CPU graphs from two different instances:
One experiencing high CPU IOWait (left) and the other experiencing high CPU User (right)

The CPU Time graph shows how the CPU cores are spending their cycles. CPU IOWait reflects the fraction of time spent waiting for the network or disk, while CPU User measures computation. Note that to view CPU Time in MMS, you must also install munin.

CPU User time is usually the result of:

  • Accessing and maintaining (updating, rebalancing) indexes
  • Manipulating (rewriting, projecting) documents
  • Scanning and ordering query results
  • Running server-side JavaScript, Map/Reduce, or aggregation framework commands

Lock Percent and Queues

Screen Shot 2013-11-27 at 1.55.29 PM
Lock fluctuating with daily load (left) and corresponding queues (right)

Lock Percent and Queues tend to go hand in hand — the longer locking operations take, the more other operations wait on them. The formation of locks and queues isn’t necessarily cause for alarm in a healthy system, but they are very good severity indicators when you and your app already know things are slow.

Concurrency control in MongoDB is implemented through a per-database reader-writer lock. The Lock Percent graph shows how long MongoDB held a write lock for the database selected in the drop-down at the top of the graph. Note that if “global” is selected, the graph displays a virtual metric: the highest database lock percent on the server at that time. This means two lock spikes at different times might be from different databases, so any time you’re working with a specific database remember to select that particular database to avoid confusion.

Because read and write operations to a database queue when that database’s write lock is held and because all operations queue while the server’s global lock is held, locking is undesirable. Yet, locking is a necessary part of many operations, so it is allowable to an extent and expected when the database is under load.

The Queues graph counts specific operations waiting for the lock at any given time and therefore provides additional information about the severity of the congestion. Because each queued operation likely represents an affected application process, the time ranges of queue spikes are excellent guides for examining log files.

Log File Metrics

In addition to the key MMS metrics discussed above, we use specific information from the MongoDB log files to augment performance analysis. The MongoDB log files output statistics on any database operation that takes more than 100ms (a reasonable, configurable definition of “slow”).

  • nscanned - The number of objects the database examines to service a query. This counts either documents or index entries, any of which may need to be obtained from disk. It is a good measure of the cost of an operation both in terms of I/O, object manipulation, and memory use. Ideally nscanned is not larger than nreturned, the number of results returned. Also compare nscanned to the size of the collection. If nscanned is large, it is usually because an index is not being used, or that the index was not selective enough. Moving large numbers of documents into memory for these operations is a common cause of page faults and CPU IOWait.
  • scanAndOrder - This is true when MongoDB performs an in-memory ordering operation to satisfy the sort/orderby component of a query. For best performance, all sort/orderby clauses should be satisfied by indexes. Sorting results at query time can be inefficient (as compared with using an index that already contains documents in the desired sort order), and can even block other operations if done while holding the write lock, as with an update or findAndModify. Because sorting requires computation, CPU User time can be expected.
  • nmoved - Indicates the number of documents that moved on disk during the operation. The higher “nmoved”, the more intensive the operation. This is because when moving a document: the new document location must be in memory, the document must be copied, the old document’s location must be cleared, and index entries must be updated to point to the new location.

This sort of analysis can be done via visual inspection of your MongoDB logs or using a tool like grep. All of this information can also be obtained via MongoDB’s database profiler if it is enabled. In situations where the logs are not accessible, use the profiler to gather the statistics discussed above (and more).

Improving Performance

Now that we are armed with an understanding of the key metrics, the first step to hone-in on a performance issue begins with a coarse-grained examination of the following three metrics:

  • PF/OP
  • CPU IOWait
  • CPU User

For each, consider whether it is “high” or “low”. This determination may be non-obvious and can depend on what is considered a baseline of “normal”. Furthermore, correctly determining whether one of these metrics is high or low depends on their values relative to each other. Experience–and more data over time–will improve your ability to compare these metrics, but here are some rules of thumb:

PF/OP is high if…

  • it is greater than 0.25 (indicates that your instance is page-faulting for roughly 25% of operations)

CPU IOWait or CPU User is high if…

  • it is greater than 50% OR
  • it is more than twice as large as the next highest CPU metric

There are eight possible overall assessments given these three metrics:

8_states

Each of the rows in this table points to a category of behavior that helps you determine the root cause of performance issues and the available solutions, using database log files, other MMS metrics, and some knowledge of your application. Let’s examine each in detail.

Unbound

Unbound

Either this database isn’t being used, or it is operating optimally.

Instead of relaxing, consider creating test data and a script to load-test your queries. This way, you can get ahead of the game by actively looking for scaling challenges in a controlled environment.

CPU-Bound

CPU-Bound

In this case, page faults are not severe enough to prompt any appreciable IOWait. As long as the CPU isn’t idle, it’s somewhat expected that User time will appear the highest. This is, after all, the cost associated with maintaining indexes and servicing queries.

If locking and queueing are occurring, you’re probably seeing in-memory table scans and scanAndOrder operations that are not bottlenecking due to an abundance of RAM or a lack of load. Run through the following steps:

  1. Check your logs for operations with high nscanned or scanAndOrder during periods of high lock/queue, and index accordingly. In general we recommend that all queries be fully indexed. For suggestions on how to structure your indexes, see this blog, and/or make use of a slow query analysis tool like Dex.
  2. Check your queries for CPU-intensive operators like $all, $push/$pop/$addToSet, as well as updates to large documents, and especially updates to documents with large arrays (or large subdocument arrays). These are CPU-intensive operations that can be avoided by altering your queries and/or data model.
  3. Obtain more and/or faster CPU cores. Importantly, if your database is write-heavy, keep in mind that only one CPU per database can write at a time (owing to that thread holding the write lock). Consider moving part of that data into its own database.

Disk-Bound

Disk-Bound

If your database isn’t locking or queueing and your app isn’t having a problem, then you may not have to worry. IOWait is a sign that your host is accessing disk. For some databases, especially write-heavy ones, that’s okay.

However, if you’re experiencing trouble or if this sort of activity is coming in bursts, then getting the most out of your current disk is about finding the source of I/OWait. So, even though the ratio of PF/OP is low, we’re interested in actual Page Faults, which could still put a noticeable drag on performance.

Let’s look at some sources of page faults:

  • Page faults from insert/update operations – The write load to a database results in page faults as fresh extents are brought into memory to handle new or moved documents. Adding RAM will not alleviate this cost, so consider faster disks, like SSDs, or sharding.
  • Page faults from a shifting working set – These are the page faults that can be prevented with additional RAM or with more efficient indexing.
  • Page faults from MongoDB’s internal operations – These are few in number and cannot be prevented.

If you’ve accounted for or the majority of your page faults, the remaining I/OWait in the system should be coming from flushing the journal file and the background synchronization of memory-mapped data to disk. Check writeToDataFilesMB on the Journal Stats graph and the Background Flush Avg graph to see if spikes match the elevated IOWait. Upgrade your disks and/or look into getting your journal on a separate volume than your data. You can also separate databases using the directoryperdb mongod option.

The last resort before sharding should be verifying that your data model is optimized for writes. Check for collections with a large number of indexes, especially multi-key indexes, and look for alternatives. Check your logs for the “nmoved” flag. If you find a lot of documents are being moved during updates, examine the usePowerOf2Sizes setting for those collections. When it comes to write-friendly data models, look for anything that reduces the number of pages (index or data) that need to be modified when documents are inserted or updated.

CPU/Disk-Bound

CPUDisk-Bound

This state likely results from a combination of other states. Treat elevated IOWait and User times as independently high, and consider each of the previous two sections. We don’t see this combination a lot but it is definitely possible, such as in the case of a high-RAM, write-heavy cluster that is missing an index.

If locks and queues aren’t occurring and your app is happy, this is the ideal state for a database with fully-utilized hardware. However, if you are experiencing locks and queues, use them to guide the time ranges you focus on. When dangerous, this case often merely precedes the “Bound All Around” HHH case discussed below.

RAM-Bound

RAM-Bound

If you are running high PF/OP without CPU IOWait, then:

  • your operation counts are relatively low (indicating little to no operational activity).
  • you have fantastic disks that are covering up for what would otherwise be a “RAM-Bound and Disk-Bound” HHL case.
  • you just started the database and are “warming” your cache. Wait until memory is fully utilized and reassess.

If you are experiencing locks and queues, immediately evaluate your indexes to make sure missing indexes are not prompting the need to page fault. Check your logs during periods of high page faulting. Focus on operations with high nscanned values. Increase RAM if necessary, and you should arrive at the coveted LLL case.

If you are not locking/queueing, this is technically a sustainable state, but be aware that any growth in operation volume or data size–or even a barely-quantifiable shift in your working data set–could lead to increasing IOWait, locks, and queues.

RAM/CPU-Bound

RAMCPU-Bound

This scenario is similar to the RAM-Bound case above, except that high CPU User activity may be covering up for what would otherwise be a high IOWait. Tackle the CPU portion first, then move towards page faults, in the following steps:

  1. Look for nscanned and scanAndOrder activity. On high-write clusters, consider the sheer number of indexes in your write-collections. Maintaining indexes during inserts and updates can be responsible for this, as can other CPU-intensive operators like $all, $push/$pop/$addToSet, updates to large documents, and updates to documents with large arrays (or large subdocument arrays).
  2. Once the CPU User time has decreased, reconsider your case. If you have moved into HHL territory, go to that section. If your CPU User time is still high, beef up your CPU if possible.
  3. If your PF/OP is still high, be prepared to move to faster disk as soon as locks and queues begin to occur.
  4. Finally, consider sharding.

RAM/Disk-Bound

RAMDisk-Bound

This is possibly the most common problem case, and the solution is fairly straightforward compared to the other cases we’ve dealt with.  Your activity is disk-heavy. Index to reduce nscanned, or add RAM either by scaling your machines vertically; or scaling horizontally via sharding.

Bound All Around

BoundAllAround

The solution to this protracted situation depends on which of these three metrics is highest. In general, we recommend tackling this as “RAM/Disk-Bound” HHL case first, because lack of RAM is relatively easy to diagnose and many of the solutions for high CPU IOWait can contribute to CPU User improvements.

So, check your logs for scanAndOrder and high nscanned numbers and then add indexes to reduce them. Increase RAM if IOWait continues, and then re-evaluate your case. If you still end up here, and if your locks and queues are still a problem, consider increasing your hardware capacity.

It could be premature to consider sharding straight from this highly protracted case. Ideally, attempt a resolution that brings you to another case that we’ve explored, and consider sharding in that context.

Conclusion & Further Reading

Performance tuning can be a grim task, especially if you’re already dealing with the fallout of an under-performing application. However, with a systematic approach, the right tools, and the experience that comes from practice, you will be able to confidently identify and meet most performance challenges you encounter in the field.

For more information, check out the following resources:

Subscribe

Subscribe to our e-mail newsletter to receive updates.

, ,

  • Mark Gibaud

    Great blog post – we need more info like this to help us interpret the wealth of data in MMS.

    • http://mongolab.com/ Eric Sedor

      Thanks, Mark! While our users can always email us at support@mongolab.com with specific questions, anyone with nagging questions about MMS should feel free to comment here; we’ll do our best to elaborate and blog further on key points.