Aggregation Framework Example

(also posted to the 10gen blog here)

In this blog post, you run a concise set of aggregation framework examples on the mongo Javascript shell against a MongoLab hosted 2.2 database.  The framework includes the aggregation operators $project, $unwind, $group, and others.  These operators allow you to calculate values across documents in a collection, like averages and sums.  They also let you reshape documents, unpacking nested structures and regrouping them as needed.

The aggregation framework, one of the most powerful and highly anticipated features in the forthcoming production MongoDB 2.2 release, lets you construct a server-side processing pipeline to be run on a collection.  A rich set of operations are available for incorporation in the pipeline so as to achieve various kinds of collection transforms, ranging from simple multi-document calculations (e.g., sums and averages) to complex projections and pivots.

The framework fits nicely in a range of data manipulation tools available in MongoDB from basic built-in functions like document counts to map-reduce and Javascript, to custom code and language-specific packages, including Hadoop.

Overview

  1. Create a 2.2 MongoLab database with your own unique name, say <myaggdemo>.  Instructions on how to do that are here. You’ll need your mongod username and password.
  2. On your database’s home page, copy the mongo shell connection to your clipboard.
  3. git clone git://gist.github.com/1401585.git aggdemo ; cd aggdemo
  4. Edit articles.js and aggregation.js to use the your db <myaggdemo> (see below)
  5. mongo <your connection> -u <mongod username> -p <mongod password> articles.js  (inserts the data into your database, 3 documents)
  6. mongo --shell <your connection> -u <mongod username> -p <mongod password> aggregation.js (performs several aggregation examples and leaves you in the mongo shell.)
  7. Type g1 in the mongo shell to see the first $group result discussed below.

(I’ve tested this to work with the production 2.0.6 mongo client, and the latest development 2.1.2 mongo client.)

Code snippets

articles.js

/* sample articles for aggregation demonstrations */

// make sure we're using the right db; this is the same as "use mydb;" in shell
db = db.getSiblingDB("aggdb"); //Put your MongoLab database name here.
db.article.drop();

db.article.save( {
    title : "this is my title" ,
    author : "bob" ,
    posted : new Date(1079895594000) ,
    pageViews : 5 ,
    tags : [ "fun" , "good" , "fun" ] ,
    comments : [
        { author :"joe" , text : "this is cool" } ,
        { author :"sam" , text : "this is bad" }
    ],
    other : { foo : 5 }
});
//...snip

aggregation.js

// make sure we're using the right db; this is the same as "use aggdb;" in shell
db = db.getSiblingDB("aggdb"); //Put your MongoLab database name here.
// ...snip...
// grouping
var g1 = db.runCommand(
    { aggregate : "article", pipeline : [
        { $project : {
            author : 1,
            tags : 1,
            pageViews : 1
        }},
        { $unwind : "$tags" },
        { $group : {
            _id : "$tags",
            docsByTag : { $sum : 1 },
            viewsByTag : { $sum : "$pageViews" },
            mostViewsByTag : { $max : "$pageViews" },
            avgByTag : { $avg : "$pageViews" }
        }}
    ]});
// ...snip

g1 aggregation result

{
	"result" : [
//...snip...
		{
			"_id" : "fun",
			"docsByTag" : 3,
			"viewsByTag" : 17,
			"mostViewsByTag" : 7,
			"avgByTag" : 5.666666666666667
		}
	],
	"Ok" : 1
}
  • Props to Chris Westin, 10gen architect for the aggregation framework for providing these examples
  • See also his presentation here.

Discussion

The results of the aggregation are saved to convenient variables for examination. The group operations (g1 and g5) at the end of the aggregation.js file are noteworthy because they rollup three operators into a common pivot and aggregation example. The g1 data flow is shown above.  Click it for a larger .png version or here for a .pdf version.

  1. Collection -> Intermediate-1: First using the initial Collection of documents as input, g1 uses a $project to filter the document list’s fields to only include author, tags, and pageViews fields. The output is shown in Intermediate-1.
  2. Intermediate-1 -> Intermediate-2: Then g1 $unwinds Intermediate-1 by the embedded tags array so that each tag instance its own document with the output shown in Intermediate-2.
  3. Intermediate-2 -> Result: Then g1 uses the $group operator to create a list of documents by each tag instance, calculating statistics like total and average page views, shown as Result.

(Note that both Intermediate forms are internal to the processing engine and are not visible to the shell directly; Intermediate-2 is actually shown as example p2.)

For another example, you can look at g5. It also pivots on the embedded tag arrays but this time rolls up authors as embedded arrays using $addToSet, essentially completing the pivot.

NB: There’s a slight bug in the design of the g1 aggregation.  The first object has the “fun” tag twice.  I intentionally chose this one as it shows how the $unwind duplicates “fun” in the Intermediate-2 output for the first document, meaning that its aggregates are counted twice.  A free MongoLab T-shirt to the first person who can correct the code to properly calculate the aggregates.  Enter in the comments.  (@cwestin63, you’re disqualified; you get a T-shirt anyway)

Summary

The MongoDB 2.2 Aggregation Framework is a powerful mechanism that can help you answer questions across documents. You can try it out with minimal risk by using the MongoLab hosted experimental service. Happy aggregating!

(Update 2012-07-10 untabify indentation in aggregation.js for proper formatting. 2012-07-11 Re-arranged images.  2012-09-07 to reference 2.2)

About benwen

I'm MongoLab's VP of Sales and Marketing. And I'm here to serve our customers' needs for MongoDB hosting in the cloud.

Subscribe

Subscribe to our e-mail newsletter to receive updates.

  • Shawn Brownfield

    The fix for g1 aggregation is pretty easy.  Add these operations to the pipeline after the unwind:


    { $group : {
    _id : "$_id",
    author : { $first : "$author" },
    tags : { $addToSet : "$tags" },
    pageViews : { $first : "$pageViews" }}},
    { $unwind : "$tags" }

    The group command will undo the unwind on tags, but will remove duplicates ($addToSet).  We then unwind again, and we’re ready to go.

    It’d be nice to have this sort of de-duplication built in, but it isn’t too bad to do.

    • http://twitter.com/benwen benwen

      Yep, that works!  email me ben at mongolab.com to claim your T-shirt.

  • http://twitter.com/niko_nava Niko Schmuck

    Fantastic article making the matter of the aggregation pipeline clear!

    Unfortunately your webpage seem to have crumbled up the code snippets (articles.js and following), they contain a lot of HTML extra encoding of the brackets ().

    The referenced code seems to also be available from https://gist.github.com/cwestin/1401585

    • http://mongolab.com MongoLab

      Glad you like the article.

      Thanks Niko, I’m looking into the bug.  (seems to be a change in our WordPress environment) -Ben

  • http://profiles.google.com/vaid.abhi Abhishek Vaid

    Does anyone know how can I access the whole MongoDocument in aggregation pipeline ?