Analyze MongoLab Data with Hadoop in Mortar

The following is a guest post by Doug Daniels, CTO of Mortar Data Inc.

Today, we’re excited to announce integration between MongoLab and Mortar, the Hadoop platform for high-scale data science. If you have one of the 100,000+ databases at MongoLab, you can now seamlessly use Hadoop to:

  • Run advanced algorithms (like recommendation engines)
  • Build reports that run quickly in parallel against large collections
  • Join multiple collections (and outside data) together for analysis
  • Store results to Google Drive, back to MongoLab, or many other destinations

In this article we’ll show you how to connect your MongoLab database to Hadoop, and then use Hadoop to do something simple but very useful: gather schema information from an entire collection, including histograms of common values, data types, and more. Mortar handles all deployment, monitoring and cluster management, so no prior knowledge of Hadoop is required.

Quick Start

We first need to connect your MongoLab database to Mortar and Hadoop. If you haven’t already, head over to the MongoLab sign-up page to create an account. After completing the form, you can immediately begin to provision new databases. Make sure that you choose the AWS us-east-1 datacenter for your MongoDB.

**If you’re unsure which plan is right for you, visit the MongoLab plans page or email the MongoLab team at support@mongolab.com

Next, login to your MongoLab console. For this tutorial, we’ll be using a replica set cluster and will connect to a secondary node. It’s recommended to use a secondary node for analytics so that you don’t affect regular traffic on the primary node (which can lead to performance degradation). For a deeper dive and alternate connection strategies, see the full Mongo—>Hadoop tutorial.

In your MongoLab console, open up the MongoDB cluster and database you’d like to process with Hadoop.

 mongolab-databases

Click on the Users tab for that database. Add a new user that you can use to connect to the database. We’ll call ours “mortar_user”. If you want to save results back to the database, make sure the user has write privileges.

mongolab-users

Next, sign up for a free account at Mortar. If you don’t mind your project being public, stick with the free Public plan. If you need your project to be private, grab a free 7-day trial on the Solo plan.

Install Mortar and Connect to MongoLab

Now that your account is setup, use Mortar’s installer to set up your workstation with everything you need to run and deploy Hadoop and Pig jobs.

Next, use Mortar to fork an example project for working with Mongo data:

  mortar projects:fork git@github.com:mortardata/mortar-mongo-examples.git <your_project_name_goes_here>

Now, grab the standard Mongo URI connection details for your database from MongoLab. If you have a Replica Set, use the credentials for the secondary node to keep traffic off the primary.

You can get the Mongo URI by clicking on your MongoLab Cluster and then choosing the Servers tab. If you have a secondary node, choose that one from the list. Then, select the database underneath you’d like to analyze.

database-twitter

At the top of the page, you’ll see a box that says “To connect using a driver via the standard URI”. Grab your database’s Mongo URI from there, and fill in the missing <dbuser> and <dbpassword> with the user credentials you created above.

With your filled-out URI in hand, set the configuration for your Mortar project to point to your MongoLab server by running:

  cd <my_project_folder>
mortar config:set MONGO_URI='put your Mongo URI here'

This will store your encrypted configuration at Mortar for running jobs against MongoLab.

Run a Small Hadoop Job Locally

Now we’re ready to run our first Hadoop job against Mongo. As an example, we’ll run an Apache Pig script that connects to a collection in your database and emits statistics about every field in the collection. We’ll run this script on your local computer first, so choose a small collection! Otherwise, you’ll spend a lot of time trying to stream data from your MongoLab database to your local computer. We’ll try larger collections when we run in the cloud next.

In your project directory, open the params/characterize-local.params file. Change INPUT_COLLECTION to a small collection you’d like to see stats on, and OUTPUT_COLLECTION to where you’d like the stats delivered. Then run:

  mortar local:run pigscripts/characterize_collection.pig -f params/characterize-local.params

This will first download all of the dependencies you need to run a Pig job to a local sandbox for your project. Once complete, it will do a local run of the characterize_collection Pigscript against your input collection.

When finished, you’ll have a new Mongo document in your output collection with detailed information about each field in the input collection, including the number of unique values in the field, example values and predicted data types.

characterize_result

Run a Full Hadoop Job in the Cloud

Running locally is fine for smaller datasets, but to process larger data, we’ll want to use the full power of a Hadoop cluster. With Mortar, one command deploys a snapshot of your code to a private Github repository, launches a private AWS Elastic MapReduce Hadoop cluster, and runs your code at scale.

Let’s try it out. Open up the params/characterize-cloud.params file.  Set the INPUT_COLLECTION parameter to a larger collection that you’d like to analyze. Set the OUTPUT_COLLECTION to either the same output you used before or a new collection.

Now, run:

  mortar run pigscripts/characterize_collection.pig -f params/characterize-cloud.params --clustersize 3

This will validate your script, launch a new private, 3-node Hadoop cluster on AWS Spot Instances, and analyze your collection. Cluster startup will take about 10-15 minutes, and the job should cost less than $0.40 for the whole hour on 3 machines—Mortar passes AWS cluster costs directly back with no up-charge.

When you start your job, mortar will print out a job URL. Open it up, and you’ll see realtime progress tracking, logs, and visualization for your job.

 job-tracking

When your job finishes, your results will be ready to view in the output collection you chose.

What’s Next?

The example we ran is a fairly simple one. You’ll want to go deeper on your own data—bringing in multiple collections, joining and aggregating, and using your own code. Our Mongo —> Hadoop tutorial will step you through the process, showing you how to work with your MongoLab data in Hadoop and Pig.

Mortar also has a growing number of open-source data apps pre-built on top of the platform, such as recommendation engines and Google Drive / Data Hero dashboards. We’re quickly adding more, but if your use case isn’t yet available, we have tutorials to help build your own data app.

If you have any questions about getting your data connected, contact us @mortardata or drop a question to our Q&A Forum.

Announcing New MongoDB Instances on Microsoft Azure

The following is a guest blog post by Brian Benz, Senior Technical Evangelist at Microsoft Open Technologies, Inc.

Since the previous release of production-ready MongoLab plans on Azure, we’ve seen demand increase significantly. The MongoLab and Microsoft teams have been working together to develop a solution for your growing requirements and are excited to announce the arrival of our newest high-memory MongoDB database plans, with virtual machine choices that now provide up to 56GB of RAM per node with availability in all eight Azure datacenters worldwide.

Scott Guthrie, Executive Vice President of the Cloud and Enterprise group in Microsoft says, “We have been working with MongoLab for a long time to bring a fully-managed Database-as-a-Service offering for MongoDB to Azure. With full production support for all VM types across all datacenters, we are excited and optimistic for the future of MongoDB on our cloud platform and think there is no better place to run your application in the cloud than Azure.”

Highlights of MongoLab’s Dedicated Cluster plans on Azure

Highly-Available MongoDB Cloud Hosting

  • Dedicated virtual machines (up to 56GB of RAM)
  • Multi-zone automatic failover using MongoDB Replica Sets

MongoDB Management Tools

  • Free daily system-level backup or custom backups with easy restore
  • Real-time and historical monitoring of key performance metrics
  • Automated query analysis and index recommendations

MongoDB Support

  • Expert, timely email support as well as a 24×7 emergency support hotline
  • Availability SLA
  • Commercial Support from MongoDB, Inc. with one-hour SLA

Getting Started

Try it out on MongoLab.com

Head over to our Azure page, click “Get Started Now” and select which plan and datacenter you’d like to provision your new MongoDB. Once ready, MongoLab will provide a connection string you can plug into your application.

Alternatively, you can login to your MongoLab account and clone any existing Sandbox database (from Azure or any other cloud provider) to upgrade an existing database to a new Replica Set plan.

New to MongoDB?

We have plenty of resources to help!

Visit the Microsoft Open Tech blog to see all the options available to MongoLab developers on Azure and in-depth instructions on getting started.

MongoLab’s open-source site www.mongolab.org has many resources to get you up and running with MongoDB quickly. For resources specific to beginners, the Basics page will be very helpful.

For specific tutorials on deploying to Azure and MongoLab, we have some great examples on the Azure Documentation Center:

What’s next?

We’re excited about our ongoing partnership and look forward to helping Azure users scale their production MongoDB deployments. Stay tuned for more announcements soon and feel free to write to MongoLab’s team at support@mongolab.com with questions any time.

Heartbleed security update

As many of you know, a serious vulnerability in the OpenSSL cryptographic software library was recently discovered: CVE-2014-0160. This vulnerability is commonly called the “Heartbleed Bug” and is described at http://heartbleed.com.

The Heartbleed vulnerability can be exploited by an attacker to gain access to the cryptographic keys used to secure communication between clients and servers using SSL, which includes most communication with web servers using HTTPS. Furthermore, this vulnerability can be used to access the system memory of running servers. As a result, an attacker can potentially listen to client-server traffic, steal passwords, and even hijack an HTTP session.

What actions are we taking?

On Monday we patched all services most vulnerable to attack and since then we have been carefully reviewing the less vulnerable components in our system and either patching or disabling them as appropriate.

  • We have patched all front-facing web services that talk over HTTPS to include the latest protected version of OpenSSL.

  • We have regenerated all SSL certificates used by MongoLab web servers, and we have revoked our old certificates.

  • We have reset all browser sessions active prior to the vulnerability.

  • We have reviewed the vulnerability of all database hosts and temporarily disabled any features that use the OpenSSL library. These services will remain disabled until they have been patched. Please note that your MongoDB servers are not using the affected library and that your database instances are not vulnerable to direct attack.

What actions should you be taking?

We have no evidence that any customer assets have been compromised. However, there are precautionary steps you should now take to ensure that your MongoLab account and MongoLab databases are as secure as possible:

(1) You should change all mongolab.com account user passwords and audit your list of MongoLab account users to ensure that all users in your MongoLab account are legitimate.

You can change your password on the User settings page which you can link to from the upper-right corner of the screen underneath the “Logout” button.

(2) You should re-generate all user API keys. We suggest you do this even if you have never used your MongoLab API key.

These API keys are per MongoLab user and can be regenerated on the same screen that you use to reset your password in step (1) above.

(3) You should reset all database credentials for all of the database deployments you have with MongoLab and audit the list of users in each database to ensure that all users are legitimate.

To manage database credentials, navigate to your database and click on the “Users” tab.

(4) If you are not using it already, you should enable 2-factor authentication (2FA) for your MongoLab account user.

Going forward

While we have closed all obvious attack vectors we will continue to respond to this threat by carefully reviewing all of our infrastructure and taking additional actions we deem necessary or prudent.

For updates please see our status page, which we will be keeping up-to-date.

Of course, if you have any questions or concerns please email us at support@mongolab.com.

Sincerely,

Will@MongoLab

MongoDB driver tips & tricks: Mongoose

Many of the support requests we get at MongoLab are questions about how to properly configure and use particular MongoDB drivers and client libraries.

This blog post is the 2nd of a series where we are covering the popular MongoDB drivers in depth (we covered Mongoid last time). The driver we’re covering today is Mongoose, which is maintained by Aaron Heckmann (@aaronheckmann) and officially supported by MongoDB, Inc. Continue Reading →

{ "comments": 1 }

MongoLab now manages over 100,000 databases! (102,280 to be exact)

We’re proud to announce that MongoLab is now powering over 100,000 cloud MongoDB databases in 23 datacenters worldwide! Continue Reading →

Finding duplicate keys with MongoDB’s aggregation framework

Quite frequently our users want to create a unique index on a data set but encounter some form of the following error because of duplicate key value(s):

E11000 duplicate key error index: db.collection.$field_1_field2_1  dup key: { : 1.0 : 1.0 }

While MongoDB supports an option to drop duplicates, dropDups, during index builds, this option forces the creation of a unique index by way of deleting data. If you use the dropDups option, MongoDB will create an index on the first occurrence of a value for a given key and then  delete all subsequent values. While this behavior may be acceptable in some cases, it’s important to be cautious whenever you are deleting data. Continue Reading →

MongoDB driver tips & tricks: Mongoid 3

Many of the support requests we get at MongoLab are questions about how to properly configure and use particular MongoDB drivers.

This blog post is the first of a series where we plan to cover each of the major MongoDB drivers in depth. The driver we’ll be covering today is Mongoid, developed by Durran Jordan (@modetojoy). Continue Reading →

{ "comments": 1 }

Future of MongoDB: Fireside chat with MongoDB CTO Eliot Horowitz

Last night I attended a Meetup at MongoDB Inc.’s new Palo Alto office to hear MongoDB’s CTO, Eliot Horowitz, speak about the product roadmap. With a new production release right around the corner and MongoDB World in the not-so-distant future, the buzz and excitement around all things MongoDB is high. For those who were not able to attend, we’re going to recap all the major points Eliot made.

Continue Reading →

{ "comments": 3 }