Archive | general RSS feed for this section

Analyze MongoLab Data with Hadoop in Mortar

The following is a guest post by Doug Daniels, CTO of Mortar Data Inc.

Today, we’re excited to announce integration between MongoLab and Mortar, the Hadoop platform for high-scale data science. If you have one of the 100,000+ databases at MongoLab, you can now seamlessly use Hadoop to:

  • Run advanced algorithms (like recommendation engines)
  • Build reports that run quickly in parallel against large collections
  • Join multiple collections (and outside data) together for analysis
  • Store results to Google Drive, back to MongoLab, or many other destinations

In this article we’ll show you how to connect your MongoLab database to Hadoop, and then use Hadoop to do something simple but very useful: gather schema information from an entire collection, including histograms of common values, data types, and more. Mortar handles all deployment, monitoring and cluster management, so no prior knowledge of Hadoop is required.

Quick Start

We first need to connect your MongoLab database to Mortar and Hadoop. If you haven’t already, head over to the MongoLab sign-up page to create an account. After completing the form, you can immediately begin to provision new databases. Make sure that you choose the AWS us-east-1 datacenter for your MongoDB.

**If you’re unsure which plan is right for you, visit the MongoLab plans page or email the MongoLab team at support@mongolab.com

Next, login to your MongoLab console. For this tutorial, we’ll be using a replica set cluster and will connect to a secondary node. It’s recommended to use a secondary node for analytics so that you don’t affect regular traffic on the primary node (which can lead to performance degradation). For a deeper dive and alternate connection strategies, see the full Mongo—>Hadoop tutorial.

In your MongoLab console, open up the MongoDB cluster and database you’d like to process with Hadoop.

 mongolab-databases

Click on the Users tab for that database. Add a new user that you can use to connect to the database. We’ll call ours “mortar_user”. If you want to save results back to the database, make sure the user has write privileges.

mongolab-users

Next, sign up for a free account at Mortar. If you don’t mind your project being public, stick with the free Public plan. If you need your project to be private, grab a free 7-day trial on the Solo plan.

Install Mortar and Connect to MongoLab

Now that your account is setup, use Mortar’s installer to set up your workstation with everything you need to run and deploy Hadoop and Pig jobs.

Next, use Mortar to fork an example project for working with Mongo data:

  mortar projects:fork git@github.com:mortardata/mortar-mongo-examples.git <your_project_name_goes_here>

Now, grab the standard Mongo URI connection details for your database from MongoLab. If you have a Replica Set, use the credentials for the secondary node to keep traffic off the primary.

You can get the Mongo URI by clicking on your MongoLab Cluster and then choosing the Servers tab. If you have a secondary node, choose that one from the list. Then, select the database underneath you’d like to analyze.

database-twitter

At the top of the page, you’ll see a box that says “To connect using a driver via the standard URI”. Grab your database’s Mongo URI from there, and fill in the missing <dbuser> and <dbpassword> with the user credentials you created above.

With your filled-out URI in hand, set the configuration for your Mortar project to point to your MongoLab server by running:

  cd <my_project_folder>
mortar config:set MONGO_URI='put your Mongo URI here'

This will store your encrypted configuration at Mortar for running jobs against MongoLab.

Run a Small Hadoop Job Locally

Now we’re ready to run our first Hadoop job against Mongo. As an example, we’ll run an Apache Pig script that connects to a collection in your database and emits statistics about every field in the collection. We’ll run this script on your local computer first, so choose a small collection! Otherwise, you’ll spend a lot of time trying to stream data from your MongoLab database to your local computer. We’ll try larger collections when we run in the cloud next.

In your project directory, open the params/characterize-local.params file. Change INPUT_COLLECTION to a small collection you’d like to see stats on, and OUTPUT_COLLECTION to where you’d like the stats delivered. Then run:

  mortar local:run pigscripts/characterize_collection.pig -f params/characterize-local.params

This will first download all of the dependencies you need to run a Pig job to a local sandbox for your project. Once complete, it will do a local run of the characterize_collection Pigscript against your input collection.

When finished, you’ll have a new Mongo document in your output collection with detailed information about each field in the input collection, including the number of unique values in the field, example values and predicted data types.

characterize_result

Run a Full Hadoop Job in the Cloud

Running locally is fine for smaller datasets, but to process larger data, we’ll want to use the full power of a Hadoop cluster. With Mortar, one command deploys a snapshot of your code to a private Github repository, launches a private AWS Elastic MapReduce Hadoop cluster, and runs your code at scale.

Let’s try it out. Open up the params/characterize-cloud.params file.  Set the INPUT_COLLECTION parameter to a larger collection that you’d like to analyze. Set the OUTPUT_COLLECTION to either the same output you used before or a new collection.

Now, run:

  mortar run pigscripts/characterize_collection.pig -f params/characterize-cloud.params --clustersize 3

This will validate your script, launch a new private, 3-node Hadoop cluster on AWS Spot Instances, and analyze your collection. Cluster startup will take about 10-15 minutes, and the job should cost less than $0.40 for the whole hour on 3 machines—Mortar passes AWS cluster costs directly back with no up-charge.

When you start your job, mortar will print out a job URL. Open it up, and you’ll see realtime progress tracking, logs, and visualization for your job.

 job-tracking

When your job finishes, your results will be ready to view in the output collection you chose.

What’s Next?

The example we ran is a fairly simple one. You’ll want to go deeper on your own data—bringing in multiple collections, joining and aggregating, and using your own code. Our Mongo —> Hadoop tutorial will step you through the process, showing you how to work with your MongoLab data in Hadoop and Pig.

Mortar also has a growing number of open-source data apps pre-built on top of the platform, such as recommendation engines and Google Drive / Data Hero dashboards. We’re quickly adding more, but if your use case isn’t yet available, we have tutorials to help build your own data app.

If you have any questions about getting your data connected, contact us @mortardata or drop a question to our Q&A Forum.

Announcing New MongoDB Instances on Microsoft Azure

The following is a guest blog post by Brian Benz, Senior Technical Evangelist at Microsoft Open Technologies, Inc.

Since the previous release of production-ready MongoLab plans on Azure, we’ve seen demand increase significantly. The MongoLab and Microsoft teams have been working together to develop a solution for your growing requirements and are excited to announce the arrival of our newest high-memory MongoDB database plans, with virtual machine choices that now provide up to 56GB of RAM per node with availability in all eight Azure datacenters worldwide.

Scott Guthrie, Executive Vice President of the Cloud and Enterprise group in Microsoft says, “We have been working with MongoLab for a long time to bring a fully-managed Database-as-a-Service offering for MongoDB to Azure. With full production support for all VM types across all datacenters, we are excited and optimistic for the future of MongoDB on our cloud platform and think there is no better place to run your application in the cloud than Azure.”

Highlights of MongoLab’s Dedicated Cluster plans on Azure

Highly-Available MongoDB Cloud Hosting

  • Dedicated virtual machines (up to 56GB of RAM)
  • Multi-zone automatic failover using MongoDB Replica Sets

MongoDB Management Tools

  • Free daily system-level backup or custom backups with easy restore
  • Real-time and historical monitoring of key performance metrics
  • Automated query analysis and index recommendations

MongoDB Support

  • Expert, timely email support as well as a 24×7 emergency support hotline
  • Availability SLA
  • Commercial Support from MongoDB, Inc. with one-hour SLA

Getting Started

Try it out on MongoLab.com

Head over to our Azure page, click “Get Started Now” and select which plan and datacenter you’d like to provision your new MongoDB. Once ready, MongoLab will provide a connection string you can plug into your application.

Alternatively, you can login to your MongoLab account and clone any existing Sandbox database (from Azure or any other cloud provider) to upgrade an existing database to a new Replica Set plan.

New to MongoDB?

We have plenty of resources to help!

Visit the Microsoft Open Tech blog to see all the options available to MongoLab developers on Azure and in-depth instructions on getting started.

MongoLab’s open-source site www.mongolab.org has many resources to get you up and running with MongoDB quickly. For resources specific to beginners, the Basics page will be very helpful.

For specific tutorials on deploying to Azure and MongoLab, we have some great examples on the Azure Documentation Center:

What’s next?

We’re excited about our ongoing partnership and look forward to helping Azure users scale their production MongoDB deployments. Stay tuned for more announcements soon and feel free to write to MongoLab’s team at support@mongolab.com with questions any time.

MongoLab now manages over 100,000 databases! (102,280 to be exact)

We’re proud to announce that MongoLab is now powering over 100,000 cloud MongoDB databases in 23 datacenters worldwide! Continue Reading →

Finding duplicate keys with MongoDB’s aggregation framework

Quite frequently our users want to create a unique index on a data set but encounter some form of the following error because of duplicate key value(s):

E11000 duplicate key error index: db.collection.$field_1_field2_1  dup key: { : 1.0 : 1.0 }

While MongoDB supports an option to drop duplicates, dropDups, during index builds, this option forces the creation of a unique index by way of deleting data. If you use the dropDups option, MongoDB will create an index on the first occurrence of a value for a given key and then  delete all subsequent values. While this behavior may be acceptable in some cases, it’s important to be cautious whenever you are deleting data. Continue Reading →

Future of MongoDB: Fireside chat with MongoDB CTO Eliot Horowitz

Last night I attended a Meetup at MongoDB Inc.’s new Palo Alto office to hear MongoDB’s CTO, Eliot Horowitz, speak about the product roadmap. With a new production release right around the corner and MongoDB World in the not-so-distant future, the buzz and excitement around all things MongoDB is high. For those who were not able to attend, we’re going to recap all the major points Eliot made.

Continue Reading →

{ "comments": 3 }

Managing disk space in MongoDB

In our previous post on MongoDB storage structure and dbStats metrics, we covered how MongoDB stores data and the differences between the dataSize, storageSize and fileSize metrics. We can now apply this knowledge to evaluate strategies for re-using MongoDB disk space.

When documents or collections are deleted, empty record blocks within data files arise. MongoDB attempts to reuse this space when possible, but it will never return this space to the file system. This behavior explains why fileSize never decreases despite deletes on a database.

If your app frequently deletes or if your fileSize is significantly larger than the size of your data plus indexes, you can use one of the methods below reclaim free space. Continue Reading →

{ "comments": 3 }

Tuning MongoDB Performance with MMS

At MongoLab we manage thousands of MongoDB clusters and regularly help customers optimize system performance. Some of the best tools available for gaining insight into our MongoDB deployments are the monitoring features of MongoDB Management Service (MMS). MMS allows us to quickly determine the health of a MongoDB system and identify the root cause of performance issues. This post covers our general approach to using MMS and MongoDB log files and provides techniques to help you optimize your own MongoDB deployment, whether you’re in the cloud or on your own hardware.

First, we will define the key metrics that we use to guide any performance investigation. Then we will go through the various combinations of metric values, discuss what they mean, and explore how to address the problems they indicate.

Continue Reading →

{ "comments": 2 }

MongoLab Now Supports Two-Factor Authentication

(Updated: 2014-01-08 Two-factor authentication is now GA)

Here at MongoLab, security is one of our foremost concerns.  Part of our stewardship of our users’ data, in addition to keeping it accessible and intact, is doing our best to secure it against unauthorized access. Today, as part of that effort, we are excited to announce our adoption of two-factor authentication (“2FA”) for MongoLab’s web-based management portal.

If you keep tabs on the glamorous InfoSec scene you probably already know what 2FA is and exactly why you should care. If that’s you, feel free to skip on down to the punchline in the last paragraph. Otherwise, bear with me and I’ll try to explain why an extra screen in your MongoLab login sequence might be a Very Good Thing indeed.

Continue Reading →

{ "comments": 2 }