A key component of optimizing application performance is tuning the performance of the database that supports it. Each post in our Telemetry series discusses an important metric used by developers and database administrators to tune the database and describes how MongoLab users can leverage Telemetry, MongoLab’s monitoring interface, to effectively review and take action on these metrics.
Databases are optimized for working with data that is stored on disk, but usually cache as much data as possible in RAM in order to access disk as infrequently as possible. However, as it is cost-prohibitive to store in RAM all the data accessed by the application, the database must eventually go to disk. Because disks are slower than RAM, this incurs a significant time cost.
Effectively tuning a database deployment commonly involves assessing how often the database accesses disk with an eye towards reducing the need to do so. To that end, one of the best ways to analyze the RAM and disk needs of a MongoDB deployment is to focus on what are called Page Faults.
What is a Page Fault?
MongoDB manages documents and indexes in memory by using an OS facility called MMAP, which translates data files on disk to addresses in virtual memory. The database then accesses disk blocks as though it is accessing memory directly. Meanwhile, the operating system transparently keeps as much of the mapped data cached in RAM as possible, only going to disk to retrieve data when necessary.
When MMAP receives a request for a page that is not cached, a Page Fault occurs, indicating that the OS had to read the page from disk into memory.
What do Page Faults mean for my cluster?
The frequency of Page Faults indicates how often the OS goes to disk to read data. Operations that cause Page Faults are slower because they necessarily incur disk latency.
Page Faults are one of the most important metrics to look at when diagnosing poor database performance because they suggest the cluster does not have enough RAM for what you’re trying to do. Analyzing Page Faults will help you determine if you need more RAM, or need to use RAM more efficiently.
How does Telemetry help me interpret Page Faults?
Select a deployment and then look back through Telemetry over months or even years to determine the normal level of Page Faults. In instances where Page Faults deviate from that norm, check application and database logs for operations that could be responsible. If these deviations are transient and infrequent they may not pose a practical problem. However, if they are regular or otherwise impact application performance you may need to take action.
If Page Faults are steady but you suspect they are too high, consider the ratio of Page Faults to Operations. If this ratio is high it could indicate unindexed queries or insufficient RAM. The definition of “high” varies across deployments and requires knowledge of the history of the deployment, but consider taking action if any of the following are true:
- The ratio of Page Faults to Operations is greater than or equal to 1.
- Effective Lock % is regularly above 15%.
- Queues are regularly above 0.
- The app seems sluggish.
Note: Future Telemetry blog posts will cover additional metrics, such as Effective Lock % and Queues. See MongoDB’s serverStatus documentation for more information.
How do I reduce Page Faults?
How you reduce Page Faults depends on their source. There are three main reasons for excessive Page Faults.
- Not having enough RAM for the dataset. In this case, the solution is to add more RAM to the deployment by scaling either vertically to machines with more RAM, or horizontally by adding more shards to a sharded cluster.
- Inefficient use of RAM due to lack of appropriate indexes. The most inefficient queries are those that cause collection scans. When a collection scan occurs, the database is iterating over every document in a collection to identify the result set for a query. During the scan, the whole collection is read into RAM, where it is inspected by the query engine. Page Faults are generally acceptable when obtaining the actual results of a query, but collection scans cause Page Faults for documents that won’t be returned to the app. Worse, these unnecessary Page Faults are likely to evict “hot” data, resulting in even more Page Faults for subsequent queries.
- Inefficient use of RAM due to excess indexes. When the indexed fields of a document are updated, the indexes that include those fields must be updated. When a document is moved on disk, all indexes that contain the document must be updated. These affected indexes must enter RAM to be updated. As above, this can lead to thrashing memory.
Note: For assistance determining what indexes your deployment needs, MongoLab offers a Slow Query Analyzer that provides index recommendations to Shared and Dedicated plan users.
Have questions or feedback?
We’d love to hear from you as this Telemetry blog series continues. What topics would be most interesting to you? What types of performance problems have you struggled to diagnose?
Email us at firstname.lastname@example.org to let us know your thoughts, or to get our help tuning your MongoLab deployment.