Introducing Dex: the Index Bot

(update 2012-07-19: A new remote feature detailed here.)
(update 2012-10-09: Version 0.5 detailed here.)

Greetings adventurers! MongoLab is pleased to introduce Dex! We hope he assists you on your quests.


Dex is a MongoDB performance tuning tool, open-sourced under the MIT license, that compares logged queries to available indexes in the queried collection(s) and generates index suggestions based on a simple rule of thumb. To use Dex, you provide a path to your MongoDB log file and a connection URI for your database. Dex is also registered with PyPi, so you can install it easily with pip.

Quick Start

pip install dex

THEN

dex -f mongodb.log mongodb://localhost

Dex provides runtime output to STDERR as it finds recommendations:

{
"index": "{'simpleIndexedField': 1, 'simpleUnindexedFieldThree': 1}",
"namespace": "dex_test.test_collection",
"shellCommand": "db.test_collection.ensureIndex(
  {'simpleIndexedField': 1, 'simpleUnindexedFieldThree': 1},
  {'background': true})"
}

As well as summary statistics:

Total lines read: 7
Understood query lines: 7
Unique recommendations: 5
Lines impacted by recommendations: 5

Just copy and paste each shellCommand value into your MongoDB shell to create the suggested indexes.

Dex also provides the complete analysis on STDOUT when it’s done, so you will see this information repeated before Dex exits. The output to STDOUT is an entirely JSON version of the above, so Dex can be part of an automated toolchain.

For more information check out the README.md and tour the source code at https://github.com/mongolab/dex. Or if you’re feeling extra adventurous, fiddle with the source yourself!

git clone https://github.com/mongolab/dex.git

The motivation behind Dex

MongoLab manages tens of thousands of MongoDB databases, heightening our sensitivity to slow queries and their impact on CPU.  What started as a set of operational heuristics has been cast as an automated tool.  For example, “Create indexes with this order: Exact values first. Sorted fields next. Ranged fields after that.”  It’s worked very well for us and we hope that it’ll work as well for your own MongoDB databases, even if you never host one with us.  We’ll continue to improve Dex — see the future directions below — and would love your feedback,  suggestions, and requests.

How it works

At a high level, Dex reads the MongoDB log and performs three steps:

  1. Parse the query
  2. Evaluate existing indexes against query
  3. Recommend an index (if necessary)

Step 1: Parse the query

Each query is parsed into an internal representation that classifies each query term into one of:

  • EQUIV – a standard equivalence check; ex: {a: 1}
  • SORT – a sort/orderby clause; ex: .sort({a: 1})
  • RANGE – a range or set check:
    Specifically: ‘$ne’, ‘$gt’, ‘$lt’, ‘$gte’, ‘$lte’, ‘$in’, ‘$nin’, ‘$all’, ‘$not’
  • UNSUPPORTED
    • Composite ($and, $or, $nor)
    • Nested “operators” not included in RANGE above.

Step 2: Evaluate existing indexes against query 

The query is evaluated against each index according two criteria:
  • Coverage (none, partial, full) - a less granular description of fields covered. None corresponds to Fields Covered 0 and indicates the index is not used by the query. Full means the number of fields covered is equal to the number of fields in the query. Partial describes any value of fields covered value between None and Full.
  • Order (ideal or not) - describes whether the index is partially-ordered according to Dex’s notion of ideal index order. This notion of order is:
    Equivalence ○ Sort ○ Range
    This ordering is a synthesis of conventional indexing wisdom and a rule of thumb developed to avoid expensive MongoDB scanAndOrderoperations when performing sorted range queries.Note: Geospatial queries and indexes are unsupported. Index evaluation is performed but Dex will not make recommendations for Geospatial queries. Analysis continues but the index is no longer considered for recommendation purposes.

Step 3: Recommend an index (if necessary)

Once evaluation is complete, Dex considers an ideal index to have Coverage ‘full’ and Ideal Order true. If these conditions are not met, and the query itself contains no UNSUPPORTED terms, Dex reccommends an ideal index (with an index order of 1 for all fields).

Note: Dex does not really need to look at existing indexes in order suggest the ideal index for a given query. But Dex does examine existing indexes as a courtesy to users who already have them in place — both to provide analysis of partial indexes (in verbose mode), and to avoid suggesting indexes that already exist.

Future Directions

  • Line Parsers: Better coverage of log lines, with a goal of complete coverage of all indexable queries
  • Analyze the system.profile collection with -p option
  • Constrain analysis by a time range with -t option
  • Add Dex’s own “SLOW_MS” to narrow results if desired
  • Support geospatial queries
  • Improved recommendation caching, storing queries by mask and summing:
    • Number of like queries
    • Time consumed
    • Min/max time range
    • Min/max nscanned/nreturned
  • Improved recommendations:
    • Combine like recommendations (or generate recommendations from multiple like queries)
    • Measure cardinality (yay aggregation framework) in the collection to inform recommended index key order.

Conclusion

We’re really excited about Dex at MongoLab and look forward to the many improvements that are possible in the MongoDB automation space. I’m presenting Dex at the June MongoDB San Francisco User’s Group today, June 19, 2012. If you can make it on such short notice, check out the details here. We will follow up with a link to presentation slides a short time later.

Finally, for those interested in the indexing knowledge we’ve accumulated as Dex was built, check out my Cardinal $ins blog here.

As always, good luck out there!

Sincerely,
Eric@MongoLab

(update: Slides from Eric’s SF Mongo User Group talk on Jun 19, 2012 are here.)

, , ,

  • Pingback: Indexing performance tips for MongoDB at the SF Users Group | MongoLab - MongoDB Hosting

  • Pingback: Sysadmin Sunday 85 | Server Density Blog

  • Pingback: Dex – MongoDB索引优化工具 | NoSQLWiki – 关注NoSQL相关技术、新闻

  • http://twitter.com/bendemott Ben DeMott

    This is a really brilliant approach

    • http://mongolab.com/ Eric Sedor

      Thank you! That’s very kind.

  • Pingback: MongoLab at MongoDB Users Groups in July 2012 | MongoLab - MongoDB Hosting

  • Pingback: Remote Dex: Index Analysis Using the Profile Collection | MongoLab - MongoDB Hosting

  • patsplat

    Regarding analysis of system.profile — consider aggregating by normalized queries rather than just the slowest queries.  There are often thousands of similar queries which individually have acceptable performance, but which in total represent a greater load on the system than the slowest individual queries.

    A tool could identify such queries and either suggest indexes or application caching if the best indexes already exist.

    Edit: nevermind, already see that in the future plans. Sounds great!

    • http://mongolab.com/ Eric Sedor

      That’s okay, I’m glad you said something anyway. It is always nice to hear from others what features are important to them. Not only that, it is often surprisingly difficult to explain that indexes aren’t just for the queries they’re intended to help. All queries benefit from an index on any query!

      We definitely know what you’re talking about, and joke that it is a “death of a thousand cuts.”

      We’re still deciding what form the query time aggregation should take, but when it goes in, this feature will be available for both logfile and profile modes (though for the issue you point to it will be notably more useful with profiling at a lower Slow MS value). It is our hope that having these stats will contribute to the broader question of helping a user prioritize the Dex suggestions they’re getting.

      If you have any ideas on what metrics you would like to see most, please let us know!

      • patsplat

        The basic routine is:

        1. Capture all queries during a period of operation
        2. Group queries by ideal associated index and sum total time spent
        3. Optimizing / cache what the app waits on the most

        Entries in a sow query log can be symptomatic as much as casual.  Anything at the top of the total time spent is a real, actionable performance consideration.  The desired report would process all the data, then give a short list (top 10) of the most time consuming queries.

        Another routine comes when you run out of indexes.  In this event, it is helpful to know how your indexes are being used.

        1. Capture all queries during a period of operation
        2. Group queries by actual used index and sum total instances
        3. Review entire list for recurring patterns that could be improved through schema change

        In my case, I found that multiple different taxonomy / tagging fields led to an over-abundance of associated indexes (http://hexane.org/blog/2011/05/16/mongodb-content-taxonomy-schema/)

  • Pingback: DIY IaaS vs PaaS Approach Webinar and Meetup « Joyeur

  • Pingback: PaaS and DBaaS as tools | MongoLab - MongoDB Hosting

  • Pingback: MongoSeattle 2012 with talk on Dex, Slow Query Analysis | MongoLab - MongoDB Hosting

  • Roger

    Note that not all indexes actually help.  For an index to help it also has to be discriminatory.  As an example if you have a field that has the same value 90% of the time, then it isn’t particularly helpful to have an index since you’ll be doing almost the same amount of work as a table scan.  The more discriminatory the index is the more effective it is.   (This has been effecting my data sets.)

    It would also be nice if you gave Python/pymongo syntax so I can copy and paste into my ensure_index calls.

  • Pingback: Dex 0.5: Weighted Index Recommendations for MongoDB | MongoLab - MongoDB Hosting

  • Pingback: Production-Ready MongoDB | MongoLab: MongoDB-as-Service