Friday, January 30, 2015

MongoDB with Node.js

Find


this will:
 exports.connect = function() {  
      console.log("connecting to MongoDB ... ");  
      var mongoose = require('mongoose');  
      mongoose.connect('mongodb://host/db/');  
   
      var Gngram = mongoose.model('Schema', {   
           term: String,  
           noun: String,  
           verb: String }, 'collection');  
   
      Gngram.find(  
           {  
                $and:  
                [  
                     { "noun" : { $gte : "0.4" } },  
                     { "verb" : { $gte : "0.4" } }  
                ]  
           },  
           {  
                "term" : 1  
           },  
           function(err, gngrams) {  
                if (err) console.error(err);  
                console.log(gngrams);  
           }  
      );  
 }  

In the Google n-Gram collection this query will look for terms that are used as both nouns and verbs with a distribution spread of 40% each.  This code is highlighted in blue.

The query also uses projection to limit the returned data to terms sorted in ascending order.


Quick Reference

  • Specify in the sort parameter the fields to sort by and a value of 1 or -1 to specify an ascending or descending sort respectively[1]

Tuesday, January 20, 2015

Changing the Data Path in MongoDB

Changing the dbPath

Changing the dbPath for MongoDB can be an exercise in frustration if permissions are not managed correctly.

A common error:
 craig@U14BASE01:~$ mongo  
 MongoDB shell version: 2.6.7  
 connecting to: test  
 2015-01-20T13:54:49.274-0800 warning: Failed to connect to 127.0.0.1:27017, reason: errno:111 Connection refused  
 2015-01-20T13:54:49.276-0800 Error: couldn't connect to server 127.0.0.1:27017 (127.0.0.1), connection attempt failed at src/mongo/shell/mongo.js:146  
 exception: connect failed  


1. Stopping the Service

The first step is to stop the MongoDB service:
sudo service mongod stop


2. Editing the Configuration

Then edit the MongoDB configuration file:
sudo gedit /etc/mongod.conf
I recommend commenting out the dbPath variable and adding a new one.


3. Setting Permissions

We have to give the user "mongodb" permission to use this folder.  To check the current permission set use:
ls -l <dir>
I would expect to see "mongodb nogroup" in the file permission set.

If these do not exist, add them using this command:
sudo chown -R mongodb /path/to/db
Likewise, make sure that the parent directories have the same permission.

For example:
sudo chown -R mongodb /path/to/



4. Starting and Connecting

Once complete, you should be able to start the service and connect to MongoDB using this command sequence:
sudo service mongod start
mongo


Application


After stepping through these instructions to add a new hard drive to an Ubuntu Virtual Box instance, I followed the steps above to configure a path on that drive for MongoDB.

My directory permissions look like:
  craig@U14BASE01:~$ ls -l -R /media/craig/  
 /media/craig/:  
 total 4  
 drwxr-xr-x 4 mongodb root 4096 Jan 20 13:54 mongo  
 /media/craig/mongo:  
 total 20  
 drwxr-xr-x 4 mongodb root 4096 Jan 20 13:54 db  
 drwx------ 2 mongodb root 16384 Jan 20 13:38 lost+found  
 /media/craig/mongo/db:  
 total 81932  
 drwxr-xr-x 2 mongodb nogroup   4096 Jan 20 13:54 journal  
 -rw------- 1 mongodb nogroup 67108864 Jan 20 13:54 local.0  
 -rw------- 1 mongodb nogroup 16777216 Jan 20 13:54 local.ns  
 -rwxr-xr-x 1 mongodb nogroup    5 Jan 20 13:54 mongod.lock  
 drwxr-xr-x 2 mongodb nogroup   4096 Jan 20 13:54 _tmp  
 /media/craig/mongo/db/journal:  
 total 3145740  
 -rw------- 1 mongodb nogroup 1073741824 Jan 20 13:54 j._0  
 -rw------- 1 mongodb nogroup 1073741824 Jan 20 13:54 prealloc.1  
 -rw------- 1 mongodb nogroup 1073741824 Jan 20 13:54 prealloc.2  
 /media/craig/mongo/db/_tmp:  
 total 0  
 ls: cannot open directory /media/craig/mongo/lost+found: Permission denied  

Note the presence of the "mongodb" username in the user permissions column.

It's important to keep an eye on the log file.

Generally, when I received a "connection refused" error it was due to incorrect permissions. However, once the permissions were set correctly, I had to type "mongo" on the terminal line twice. The first time was a failure due to the lack of journal files; on the second time the command succeeded, and I gained access to the shell.

In another instance, I cloned the VM that held MongoDB.  When the clone started, I was unable to connect to the MongoDB shell with the same error (connection refused).  By restarting the service, and typing "mongo" twice on the terminal, access was restored.


References

  1. Setting Permissions
    1. If I change the dbPath to another directory on the same drive, this solution works for me.
  2. Changing Permissions on the Parent Directory
    1. This is the post that put me over the top.  
    2. The parent directory that held the mongo directory did not have the correct permissions.
  3. Changing the Default Path
    1. This is a detailed post.  
    2. While the solution did nothing for me, it was helpful to walk through a detailed approach that worked for someone else.

Friday, January 16, 2015

An Introduction to the Aggregation Pipeline



Finding Names

This query will find all actors with a name like "stevens"
db.test.find({"actor.displayName":/stevens/i}).pretty()  
and I get back 40 results, with the full tweet for each result:
 /* 0 */  
 {  
   "_id" : ObjectId("54b860e65756612564fbc584"),  
   "body" : "@TheRevAl Rev Cornell is the \"misguided YOUNG kid\" caught n jihadist social media. T. Rice 12, is \"20yr old\" killed b'cuz of toy pellet gun",  
   "retweetCount" : 0,  
   "generator" : {  
     "link" : "http://twitter.com/download/iphone",  
     "displayName" : "Twitter for iPhone"  
   },  
   "twitter_filter_level" : "low",  
   "gnip" : {  
     "klout_score" : 13,  
     "matching_rules" : [   
       {  
         "tag" : "SIDM_10205",  
         "value" : "\"mobile app\" OR \"mobile application\" OR \"Social Commerce\" OR \"Social Media\" OR \"Social Campaign\" OR \"Digital Optimization\" OR \"Digital Analytics\" OR \"Digital Software\" OR \"site optimization\" OR \"conversation optimization\" OR \"B2B Integration\" OR \"B2B enterprises\" OR \"B2b customer\" OR \"B2b purchase\" OR \"product information management\" OR \"web content management\" OR \"order management\" OR \"Campaign Management\" OR \"multichannel attribute\" OR \"multichannel optimization\" OR \"multi-channel attribute\" OR \"Customer Experience\""  
       }  
     ],  
     "language" : {  
       "value" : "en"  
     }  
   },  
   "favoritesCount" : 0,  
   "object" : {  
     "postedTime" : "2015-01-16T00:33:02.000Z",  
     "summary" : "@TheRevAl Rev Cornell is the \"misguided YOUNG kid\" caught n jihadist social media. T. Rice 12, is \"20yr old\" killed b'cuz of toy pellet gun",  
     "link" : "http://twitter.com/EDSDrum/statuses/555885403102662657",  
     "id" : "object:search.twitter.com,2005:555885403102662657",  
     "objectType" : "note"  
   },  
   "actor" : {  
     "preferredUsername" : "EDSDrum",  
     "displayName" : "Ernest Stevens",  
     "links" : [   
       {  
         "href" : null,  
         "rel" : "me"  
       }  
     ],  
     "twitterTimeZone" : null,  
     "image" : "https://pbs.twimg.com/profile_images/1474488395/ES_Photo__2005___2__normal.jpg",  
     "verified" : false,  
     "location" : {  
       "displayName" : "MD",  
       "objectType" : "place"  
     },  
     "statusesCount" : 350,  
     "summary" : "Financier, Educator, Artist, Business Consultant and political junkie.",  
     "languages" : [   
       "en"  
     ],  
     "utcOffset" : null,  
     "link" : "http://www.twitter.com/EDSDrum",  
     "followersCount" : 1,  
     "favoritesCount" : 1,  
     "friendsCount" : 57,  
     "listedCount" : 1,  
     "postedTime" : "2009-07-01T16:26:25.000Z",  
     "id" : "id:twitter.com:52771595",  
     "objectType" : "person"  
   },  
   "twitter_lang" : "en",  
   "twitter_entities" : {  
     "user_mentions" : [   
       {  
         "id" : 42389136,  
         "indices" : [   
           0,   
           9  
         ],  
         "id_str" : "42389136",  
         "screen_name" : "TheRevAl",  
         "name" : "Reverend Al Sharpton"  
       }  
     ],  
     "symbols" : [],  
     "trends" : [],  
     "hashtags" : [],  
     "urls" : []  
   },  
   "verb" : "post",  
   "link" : "http://twitter.com/EDSDrum/statuses/555885403102662657",  
   "provider" : {  
     "link" : "http://www.twitter.com",  
     "displayName" : "Twitter",  
     "objectType" : "service"  
   },  
   "postedTime" : "2015-01-16T00:33:02.000Z",  
   "id" : "tag:search.twitter.com,2005:555885403102662657",  
   "objectType" : "activity"  
 }  
 ... 40 results total ...  



Projection

That last query returned more data than I needed. I can use projection to solve this.
 db.test.find({"actor.displayName":/Stevens/i}, {"actor.displayName":1, _id:0})  
and now I only get back data of interest to me, rather than the entire tweet:
 /* 0 */  
 {  
   "actor" : {  
     "displayName" : "Ernest Stevens"  
   }  
 }  
   
 /* 1 */  
 {  
   "actor" : {  
     "displayName" : "Shimmel Stevenson"  
   }  
 }  
   
 /* 2 */  
 {  
   "actor" : {  
     "displayName" : "John Stevens"  
   }  
 }  
   
 /* 3 */  
 {  
   "actor" : {  
     "displayName" : "patti stevens"  
   }  
 }  
   
 /* 4 */  
 {  
   "actor" : {  
     "displayName" : "patti stevens"  
   }  
 }  
   
 /* 5 */  
 {  
   "actor" : {  
     "displayName" : "Edward Stevens"  
   }  
 }  
   
 /* 6 */  
 {  
   "actor" : {  
     "displayName" : "patti stevens"  
   }  
 }  
   
 /* 7 */  
 {  
   "actor" : {  
     "displayName" : "Gina Stevenson"  
   }  
 }  
   
 /* 8 */  
 {  
   "actor" : {  
     "displayName" : "patti stevens"  
   }  
 }  
   
 /* 9 */  
 {  
   "actor" : {  
     "displayName" : "patti stevens"  
   }  
 }  
   
 /* 10 */  
 {  
   "actor" : {  
     "displayName" : "patti stevens"  
   }  
 }  
   
 /* 11 */  
 {  
   "actor" : {  
     "displayName" : "Edward Stevens"  
   }  
 }  
   
 /* 12 */  
 {  
   "actor" : {  
     "displayName" : "Stewart Stevenson"  
   }  
 }  
   
 /* 13 */  
 {  
   "actor" : {  
     "displayName" : "Edward Stevens"  
   }  
 }  
   
 /* 14 */  
 {  
   "actor" : {  
     "displayName" : "patti stevens"  
   }  
 }  
   
 /* 15 */  
 {  
   "actor" : {  
     "displayName" : "patti stevens"  
   }  
 }  
   
 /* 16 */  
 {  
   "actor" : {  
     "displayName" : "Ezra Stevens"  
   }  
 }  
   
 /* 17 */  
 {  
   "actor" : {  
     "displayName" : "patti stevens"  
   }  
 }  
   
 /* 18 */  
 {  
   "actor" : {  
     "displayName" : "patti stevens"  
   }  
 }  
   
 /* 19 */  
 {  
   "actor" : {  
     "displayName" : "Edward Stevens"  
   }  
 }  
   
 /* 20 */  
 {  
   "actor" : {  
     "displayName" : "Manny Stevenson"  
   }  
 }  
   
 /* 21 */  
 {  
   "actor" : {  
     "displayName" : "Manny Stevenson"  
   }  
 }  
   
 /* 22 */  
 {  
   "actor" : {  
     "displayName" : "Dennis Stevens"  
   }  
 }  
   
 /* 23 */  
 {  
   "actor" : {  
     "displayName" : "Stewart Stevenson"  
   }  
 }  
   
 /* 24 */  
 {  
   "actor" : {  
     "displayName" : "Ruben Stevens"  
   }  
 }  
   
 /* 25 */  
 {  
   "actor" : {  
     "displayName" : "Simon Stevens"  
   }  
 }  
   
 /* 26 */  
 {  
   "actor" : {  
     "displayName" : "Neil Stevens"  
   }  
 }  
   
 /* 27 */  
 {  
   "actor" : {  
     "displayName" : "DavidBernard-Stevens"  
   }  
 }  
   
 /* 28 */  
 {  
   "actor" : {  
     "displayName" : "Ryan Stevens"  
   }  
 }  
   
 /* 29 */  
 {  
   "actor" : {  
     "displayName" : "julia stevens"  
   }  
 }  
   
 /* 30 */  
 {  
   "actor" : {  
     "displayName" : "julia stevens"  
   }  
 }  
   
 /* 31 */  
 {  
   "actor" : {  
     "displayName" : "Mike Stevens"  
   }  
 }  
   
 /* 32 */  
 {  
   "actor" : {  
     "displayName" : "Neil Stevens"  
   }  
 }  
   
 /* 33 */  
 {  
   "actor" : {  
     "displayName" : "Michael Stevens"  
   }  
 }  
   
 /* 34 */  
 {  
   "actor" : {  
     "displayName" : "Guido Stevens"  
   }  
 }  
   
 /* 35 */  
 {  
   "actor" : {  
     "displayName" : "Stewart Stevenson"  
   }  
 }  
   
 /* 36 */  
 {  
   "actor" : {  
     "displayName" : "Eric Stevens"  
   }  
 }  
   
 /* 37 */  
 {  
   "actor" : {  
     "displayName" : "Mike Stevens"  
   }  
 }  
   
 /* 38 */  
 {  
   "actor" : {  
     "displayName" : "Skeeve Stevens"  
   }  
 }  
   
 /* 39 */  
 {  
   "actor" : {  
     "displayName" : "Amber Stevens"  
   }  
 }  

But now I have a lot of duplicates, and it would be nice to have an aggregation with a count for the number of times each name appears.


All Distinct Names

I can find all the distinct names
 db.test.distinct("actor.displayName")  
and that gives me this output:
 /* 0 */  
 {  
   "0" : "abdul",  
   "1" : "Ayunda Dewi",  
   "2" : "CheckAContract",  
   "3" : "Lisa Peyton",  
   "4" : "NJT Advisory",  
   "5" : "E3M Filters",  
   "6" : "Ernest Stevens",  
   "7" : "Social Media Time",  
   "8" : "GoodmanHailey",  
   "9" : "ERP-VIEW.PL",  
   "10" : "12Stocks.com Tech",  
   "11" : "Technology News",  
   "12" : "resign .",  
   "13" : "Kenji Hiranabe",  
   "14" : "LittleBabyTinySteve",  
   "15" : "Bryan K. Robinson",  
   "16" : "christopherlortie",  
   "17" : "ErinFlannagan",  
   "18" : "Liz Barnett",  
   "19" : "Agile Retweets 2.2k",  
   "20" : "Ferry Irawan",  
   <snip ... 71,499 records in this dataset ...>  
Distinct returns an array (above). Arrays are not compatible with the aggregation pipeline[1].


By Name

This demonstrates a basic use of the aggregation pipeline.

This will count the number of tweets that contain the name "stevens" in some variation:
 db.test.aggregate(  
   [  
     { $match: { "actor.displayName": /stevens/i } },  
     { $group: { _id: null, count: { $sum: 1 } } }  
   ]  
 )  
and gives me this result:
 /* 0 */  
 {  
   "result" : [   
     {  
       "_id" : null,  
       "count" : 40  
     }  
   ],  
   "ok" : 1  
 }  
But what I'm really looking for is a count for each distinct variation of a name that contains ": stevens".  I'm missing some key elements in my query above.


The Aggregation Pipeline

This query appears to do what I want:
 db.test.aggregate([  
   {$match: { "actor.displayName": /stevens/i } },  
   {$group: { _id: { actor: "$actor.displayName" }, numberOfTimes: { $sum: 1 }}},   
   {$sort:{numberOfTimes:-1}}  
 ])  
and the output is 21 results:
 /* 0 */  
 {  
   "result" : [   
     {  
       "_id" : {  
         "actor" : "patti stevens"  
       },  
       "numberOfTimes" : 10  
     },   
     {  
       "_id" : {  
         "actor" : "Edward Stevens"  
       },  
       "numberOfTimes" : 4  
     },   
     {  
       "_id" : {  
         "actor" : "Stewart Stevenson"  
       },  
       "numberOfTimes" : 3  
     },   
     {  
       "_id" : {  
         "actor" : "julia stevens"  
       },  
       "numberOfTimes" : 2  
     },   
     {  
       "_id" : {  
         "actor" : "Neil Stevens"  
       },  
       "numberOfTimes" : 2  
     },   
     {  
       "_id" : {  
         "actor" : "Manny Stevenson"  
       },  
       "numberOfTimes" : 2  
     },   
     {  
       "_id" : {  
         "actor" : "Mike Stevens"  
       },  
       "numberOfTimes" : 2  
     },   
     {  
       "_id" : {  
         "actor" : "Skeeve Stevens"  
       },  
       "numberOfTimes" : 1  
     },   
     {  
       "_id" : {  
         "actor" : "Eric Stevens"  
       },  
       "numberOfTimes" : 1  
     },   
     {  
       "_id" : {  
         "actor" : "Michael Stevens"  
       },  
       "numberOfTimes" : 1  
     },   
     {  
       "_id" : {  
         "actor" : "Ryan Stevens"  
       },  
       "numberOfTimes" : 1  
     },   
     {  
       "_id" : {  
         "actor" : "DavidBernard-Stevens"  
       },  
       "numberOfTimes" : 1  
     },   
     {  
       "_id" : {  
         "actor" : "Amber Stevens"  
       },  
       "numberOfTimes" : 1  
     },   
     {  
       "_id" : {  
         "actor" : "Simon Stevens"  
       },  
       "numberOfTimes" : 1  
     },   
     {  
       "_id" : {  
         "actor" : "Ezra Stevens"  
       },  
       "numberOfTimes" : 1  
     },   
     {  
       "_id" : {  
         "actor" : "Dennis Stevens"  
       },  
       "numberOfTimes" : 1  
     },   
     {  
       "_id" : {  
         "actor" : "Ruben Stevens"  
       },  
       "numberOfTimes" : 1  
     },   
     {  
       "_id" : {  
         "actor" : "Ernest Stevens"  
       },  
       "numberOfTimes" : 1  
     },   
     {  
       "_id" : {  
         "actor" : "Gina Stevenson"  
       },  
       "numberOfTimes" : 1  
     },   
     {  
       "_id" : {  
         "actor" : "Guido Stevens"  
       },  
       "numberOfTimes" : 1  
     },   
     {  
       "_id" : {  
         "actor" : "John Stevens"  
       },  
       "numberOfTimes" : 1  
     },   
     {  
       "_id" : {  
         "actor" : "Shimmel Stevenson"  
       },  
       "numberOfTimes" : 1  
     }  
   ],  
   "ok" : 1  
 }  
sorted in ascending order, with a count of the number of times it appears.


Matching all Names

And if I withdraw the match condition, and just use this query:
 db.test.aggregate([  
   {$group: { _id: { actor: "$actor.displayName" }, numberOfTimes: { $sum: 1 }}},   
   {$sort:{numberOfTimes:-1}}  
 ])  
The result is a sorted list (like the above):
 /* 0 */  
 {  
   "result" : [   
     {  
       "_id" : {  
         "actor" : "BYOD News"  
       },  
       "numberOfTimes" : 529  
     },   
     {  
       "_id" : {  
         "actor" : "free movies"  
       },  
       "numberOfTimes" : 521  
     },   
     {  
       "_id" : {  
         "actor" : "mclovemckee"  
       },  
       "numberOfTimes" : 480  
     },   
     {  
       "_id" : {  
         "actor" : "Cell Phone Deals"  
       },  
       "numberOfTimes" : 417  
     },   
     {  
       "_id" : {  
         "actor" : "Jose Fornelino Muñiz"  
       },  
       "numberOfTimes" : 376  
     },   
     {  
       "_id" : {  
         "actor" : "mckeeliurvg"  
       },  
       "numberOfTimes" : 328  
     },   
     {  
       "_id" : {  
         "actor" : "NoSQL"  
       },  
       "numberOfTimes" : 325  
     },   
     {  
       "_id" : {  
         "actor" : "Capitalsecure"  
       },  
       "numberOfTimes" : 302  
     },   
     {  
       "_id" : {  
         "actor" : "Antonio Trento?"  
       },  
       "numberOfTimes" : 257  
     },   
     {  
       "_id" : {  
         "actor" : "Top Sinhala Blog"  
       },  
       "numberOfTimes" : 252  
     },   
     {  
       "_id" : {  
         "actor" : "?FollowerSale.com"  
       },  
       "numberOfTimes" : 221  
     },   
     {  
       "_id" : {  
         "actor" : "Jeff Ho"  
       },  
       "numberOfTimes" : 209  
     },   
     {  
       "_id" : {  
         "actor" : "Cloud Server Hosts"  
       },  
       "numberOfTimes" : 206  
     },   
     ... 71,449 results total ...
for every name; not just "stevens".


Thursday, January 15, 2015

MongoDB Administration: RoboMongo on Ubuntu

Environment

  1. Ubuntu 14.04
  2. MongoDB 2.6.7
  3. RoboMongo 0.8.4


Installation and Configuration


Steps:
  1. Installation:
    1. Download the File
    2. Install the File
    3. Launch the File
  2. Configuration:
    1. Manage Connections
    2. Viewing Data


1. Downloading the File

Download the .deb file from the homepage:


I downloaded the file to my
/home/craig/Downloads
directory.

2. Installing the .deb file

Double-click the .deb file and select install:

You will be required to authenticate as root to install the package. The installation time will vary based on the speed of network connection.

A successful installation will look like this:


3. Running RoboMongo

Use the search application in Ubuntu to find "RoboMongo":

Congratulations!  You have successfully installed RoboMongo.

4. Creating a Connection

When RoboMongo is launched for the first time, you will need to create a connection to your database.

Click the "create" link on the dialog box as shown below:

Enter a name for the connection (any name is fine):

and then select the "Advanced" tab and enter a name for your database:

You can click "Save" once complete.

You can now connect to your database:


5. Viewing Data


Prior to loading any data into MongoDB, the default view looks like this:

Once I load data in, using this script, the default view looks like this:


Note that several different views of the data are possible.

I can view the data in table mode:

You can likewise view the data as formatted JSON text:

Importing Data into MongoDB with Python

Environment


  1. Python 2.7.6
    Installing the Python Driver for MongoDB
  2. MongoDB 2.6.7
    Installing MongoDB on Ubuntu 14.04
  3. Ubuntu 14.04



JSON Support for Python


Official Documentation: Simplejson is a simple, fast, complete, correct and extensible JSON encoder and decoder for Python 2.5+ and Python 3.3+. It is pure Python code with no dependencies, but includes an optional C extension for a serious speed boost.

Install simple-json using pip:
sudo pip install simple-json



Writing to MongoDB


 # -*- coding: utf-8 -*-
   
 import argparse  
 import datetime  
 import pprint  
 import pymongo  
 import json  
 import os  
 import sys  
 import fnmatch  
   
 ##      ARGPARSE USAGE  
 ##     <https://docs.python.org/2/howto/argparse.html>  
 parser = argparse.ArgumentParser(description="Import records into MongoDB")  
 group = parser.add_mutually_exclusive_group()  
 group.add_argument("-v", "--verbose", action="store_true")  
 group.add_argument("-q", "--quiet", action="store_true")  
 parser.add_argument("max", type=int, help="the maximum records to import", default=sys.maxint)  
 parser.add_argument("path", help="The input path for importing. This can be either a file or directory.")  
 parser.add_argument("db", help="The MongoDB name to import into.")  
 parser.add_argument("collection", help="The MongoDB collection to import into.")  
 args = parser.parse_args()
   
 ##      RETRIEVE files from filesystem
 def getfiles(path) :  
      if len(path) <= 1 :   
           print "!Please Supply an Input File"  
           return []  
      try :  
           input_path = str(path).strip()  
   
           if os.path.exists(input_path) == 0 :   
                print "!Input Path does not exist (input_path = ", input_path, ")"  
                return []  
   
           if os.path.isdir(input_path) == 0 :  
                if args.verbose :  
                     print "*Input Path is Valid (input_path = ", input_path, ")"  
                return [input_path]       
   
           matches = []  
           for root, dirnames, filenames in os.walk(input_path):  
                for filename in fnmatch.filter(filenames, '*.json'):  
                     matches.append(os.path.join(root, filename))  
             
           if len(matches) > 0 :  
                if args.verbose :  
                     print "*Found Files in Path (input_path = ", input_path, ", total-files = ", len(matches), ")"  
                return matches  
   
           print "!No Files Found in Path (input_path = ", input_path, ")"  
      except ValueError :  
           print "!Invalid Input (input_path, ", input_path, ")"  
      return []  
   
 ##     IMPORT records into mongo
 def read(jsonFiles) :  
      from pymongo import MongoClient  
   
      client = MongoClient('mongodb://localhost:27017/')  
      db = client[args.db]  
   
      counter = 0  
      for jsonFile in jsonFiles :  
           with open(jsonFile, 'r') as f:  
                for line in f:  
   
                     # load valid lines (should probably use rstrip)
                     if len(line) < 10 : continue  
                     try:  
                          db[args.collection].insert(json.loads(line))  
                          counter += 1  
                     except pymongo.errors.DuplicateKeyError as dke:  
                          if args.verbose :  
                               print "Duplicate Key Error: ", dke  
                     except ValueError as e:  
                          if args.verbose :  
                               print "Value Error: ", e  
   
                     # friendly log message                      
                     if 0 == counter % 100 and 0 != counter and args.verbose : print "loaded line: ", counter  
                     if counter >= args.max :   
                          break  
   
      f.close  
      db.close  
   
      if 0 == counter :  
           print "Warning: No Records were Loaded"  
      else :  
           print "loaded a total of ", counter, " lines"  
   
   
 ##      EXECUTE
 files = getfiles(args.path)  
 read(files)  

This will write to MongoDB.

Command line usage is:
 python import.py 1000 /media/data/records.json mydb mycollection -v

The -v flag is optional and will log in a verbose manner to the console.


Other Considerations


I've noticed that twitter data from the GNIP firehose can be imported directly into MongoDB.

On the other hand, Java objects serialized into JSON using the GSON package need to be restructured. For example, this an array of objects deserialized using GSON will look like this:
 [  
      { name : "item1" },  
      { name : "item2" },  
      { name : "item-n" }  
 ]  

If you use a web validator / formatter, such as JsonEditorOnline, this output will be parsed correctly, like this:


However, MongoDB doesn't like this syntax, and prefer this approach:
 { name : "item1" }  
 { name : "item2" }  
 { name : "item-n" }  

Note the absence of both commas to separate the items and the lack of braces at the beginning and end of the structure.


MacOS

The instructions don't vary greatly.

I prefer to use a virtualenv on my local dev environment. Virtualenv is described in this blog post here.

Set up the virtualenv on the terminal:
virtualenv --system-site-packages .
source bin/activate

Once inside the virtualenv, install pymongo:
(data-imdb-populate-mongo)~/workspaces/data-imdb-populate-mongo$ pip install pymongo
Collecting pymongo
  Downloading pymongo-3.2-cp27-none-macosx_10_8_intel.whl (263kB)
    100% |████████████████████████████████| 266kB 1.4MB/s 
Installing collected packages: pymongo
Successfully installed pymongo-3.2



References

  1. Python Argparse
    1. The first part of this program uses argparse to access the command line arguments from the user to the program
  2. [Offical Documentation] PyMongo Tutorial
    1. This tutorial is intended as an introduction to working with MongoDB and PyMongo
  3. Unix ULIMIT settings
    1. I've noticed the bulk insert with PyMongo has a tendency to run out of memory.  This details a method for limiting and controlling the usage of system resources that might help.
      1. [StackOverflow] PyMongo Bulk Insert Runs out of memory
      2. [MongoDB JIRA] Bug Report (fixed)

Installing the Python Driver in MongoDB

Installing Python on Ubuntu 14.04:

  1. Check to see if pip is already installed
    whereis pip
    pip is a package management system used to install and manage software packages written in Python.
  2. Install python-dev
    sudo apt-get install build-essential python-dev
    header files and a static library for Python
  3. Install pip
    sudo apt-get install python-pip
  4. Install pymongo
    sudo pip install pymongo
    PyMongo is a Python distribution containing tools for working with MongoDB, and is the recommended way to work with MongoDB from Python


Simple Script


 import datetime  
 import pprint  
 import pymongo  

 from pymongo import MongoClient  

 client = MongoClient('mongodb://localhost:27017/')  
 db = client.mydb  

 post = { "a" : datetime.datetime.utcnow() }  
 postId = db.test.insert(post);  

 find = db.test.find()  
 for item in db.test.find() :   
      pprint.pprint(item)  

I've saved this script as "test.py" and it can be executed by typing
python test.py
in the terminal window.

Successful output for me looks like:
 craig@U14BASE01:~$ python test.py   
 {u'_id': ObjectId('54b8048557566138d66f7fa8'),  
  u'a': datetime.datetime(2015, 1, 15, 18, 18, 45, 239000)}  

Tuesday, January 13, 2015

Using Spring Data for MongoDB

Environment:



Introduction

Spring Data for MongoDB is part of the umbrella Spring Data project which aims to provide a familiar and consistent Spring-based programming model for for new datastores while retaining store-specific features and capabilities.



Rationale


I started to use Spring Data for MongoDB because the default query API in Mongo was awkward for Java.

For example, searching for i > 50 is represented as:
 cursor = coll.find(new BasicDBObject("i", new BasicDBObject("$gt", 50)));  

The equivalent Spring enabled query is:
query(where("i").gte("50"))

While these are both simple cases, the former case is both syntactically and semantically awkward.  Semantically, we lose a lot of meaning as the query grows in length.  For queries with multiple conditions, a large number of BasicDBObject instances have to be created and appended to simulate a pipeline.  Syntactically, operators like ">" and "<" have to be escaped.

In the latter case, the Java pipeline looks somewhat more like a Javascript document string.  While using MongoDB in Java may never have the syntactic elegance of Javascript (native JSON), Spring brings us closer to this.

Spring support brings further advantages around deployment, integration and environment support for MongoDB in enterprise applications.


Test Cases


I use this test case to demonstrate CRUD functionality over a working MongoDB connection. This test case is also helpful as a quick reminder of the Spring syntax for MongoDB as I prefer working with python or in the javascript shell directly.

MongoSpringCrudTest.java:
package org.swtk.sandbox.mongodb.spring;

import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertNotNull;
import static org.junit.Assert.assertNull;
import static org.junit.Assert.assertTrue;
import static org.springframework.data.mongodb.core.query.Criteria.where;
import static org.springframework.data.mongodb.core.query.Query.query;
import static org.springframework.data.mongodb.core.query.Update.update;

import java.util.List;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.junit.Test;
import org.springframework.data.mongodb.core.MongoOperations;
import org.springframework.data.mongodb.core.MongoTemplate;
import org.swtk.sandbox.mongodb.spring.dto.Person;

import com.mongodb.Mongo;

public class MongoSpringCrudTest {

 private static final Log log = LogFactory.getLog(MongoSpringCrudTest.class);

 @Test
 public void run() throws Throwable {

  @SuppressWarnings("deprecation") MongoOperations mongoOps = new MongoTemplate(new Mongo(), "database");
  Person p = new Person("Joe", 34);

  /*  insert is used to initially store the object into the database. */
  mongoOps.insert(p);

  /*  find */
  p = mongoOps.findById(p.getId(), Person.class);
  assertNotNull(p);

  /*  update */
  mongoOps.updateFirst(query(where("name").is("Joe")), update("age", 35), Person.class);
  log.info("Updated: " + p);

  /*  test the update */
  p = mongoOps.findOne(query(where("name").is("Joe")), Person.class);
  assertNotNull(p);
  assertEquals(35, p.getAge());

  /*  delete */
  mongoOps.remove(p);

  assertNull(mongoOps.findOne(query(where("i").gte("50")), Person.class));

  /*  check that deletion worked */
  /*  find one ... */
  assertNull(mongoOps.findOne(query(where("name").is("Joe")), Person.class));
  /*  find all ... */
  List<Person> people = mongoOps.findAll(Person.class);
  assertNotNull(people);
  assertTrue(people.isEmpty());

  mongoOps.dropCollection(Person.class);
 }
}

Person.java (domain transfer object):
package org.swtk.sandbox.mongodb.spring.dto;

public class Person {

 private String id;
 private String name;
 private int  age;

 public Person(String name, int age) {
  this.name = name;
  this.age = age;
 }

 public String getId() {
  return id;
 }

 public String getName() {
  return name;
 }

 public int getAge() {
  return age;
 }

 @Override
 public String toString() {
  return "Person [id=" + id + ", name=" + name + ", age=" + age + "]";
 }
}



The Soundex Use Case


The Soundex algorithm is in the class of approximate string matching (asm) algorithms.

The goal is for homophones (e.g. Jon Smith, John Smythe) to be encoded to the same representation so that they can be matched despite minor differences in spelling.

I have a large dataset of ~150 million names. Each name is encoded and a new record is inserted into MongoDB:
package com.mycompany;

public class SoundexResult {

 private String encoding;

 private String id;

 private String value;

 public String getEncoding() {
  return encoding;
 }

 public String getId() {
  return id;
 }

 public String getValue() {
  return value;
 }

 public void setEncoding(String encoding) {
  this.encoding = encoding;
 }

 public void setId(String id) {
  this.id = id;
 }

 public void setValue(String value) {
  this.value = value;
 }
}

The soundex encoder is provided through an apache-commons codec:
<dependency>
 <groupId>commons-codec</groupId>
 <artifactId>commons-codec</artifactId>
 <version>1.9</version>
</dependency>

Since this Soundex algorithm is designed for English phonology only, a simple check for each String that it exists within the English alphabet, then a call out to the codec:
new org.apache.commons.codec.language.Soundex().encode("Jonathan");

This is a simple test case that demonstrates the Soundex algorithm working correctly:
package com.mycompany;

import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertFalse;
import static org.junit.Assert.assertTrue;

import java.util.HashSet;
import java.util.Set;

import org.junit.Test;
import org.swtk.common.framework.exception.BusinessException;
import org.swtk.common.util.TextUtils;
import org.swtk.eng.asm.svc.SoundexService;
import org.swtk.eng.asm.svc.impl.SoundexServiceImpl;

public final class SoundexServiceTest {

 @Test
 public void difference() throws Throwable {
  assertEquals(4, getService().difference("John Checker", "Jon Cecker"));
  assertEquals(4, getService().difference("John Checker", "John Checker"));
  assertEquals(2, getService().difference("John Checker", "John Doe"));
  assertEquals(1, getService().difference("John Checker", "Barack Obama"));
  assertEquals(4, getService().difference("Checker", "Cecker"));
 }

 @Test
 public void encode() throws Throwable {
  assertTrue(hasEqualEncoding("Jon Cecker", "John Checker", "J522"));
  assertTrue(hasEqualEncoding("Jon Smythe", "John Smith", "J525"));
  assertTrue(hasEqualEncoding("Jemima", "Jemimah", "Jemina", "JHEMIMAH", "Jhemimhah", "J550"));
  assertTrue(hasEqualEncoding("Jeremiah", "Jeremy", "J650"));

  /* what are the other *620's? */
  assertEquals("C620", encode("Craig"));
  assertEquals("G620", encode("Greg"));

  assertEquals("T500", encode("Tim"));
  assertTrue(hasEqualEncoding("Trin", "Trinh", "Trim", "T650"));
 }

 private String encode(String value) throws Throwable {
  return getService().encode(value);
 }

 @Test(expected = BusinessException.class)
 public void encodeFailiures() throws Throwable {
  getService().encode("م");
 }

 @Test
 public void equals() throws Throwable {
  assertTrue(getService().isEqual("Jon Cecker", "John Checker"));
  assertFalse(getService().isEqual("Barack Obama", "John Checker"));
 }

 private SoundexService getService() {
  return new SoundexServiceImpl();
 }

 private boolean hasEqualEncoding(String... values) throws Throwable {
  Set<String> set = new HashSet<String>();

  for (String value : values) {
   if (4 == value.length() && TextUtils.isNumeric(value.substring(1, value.length()))) set.add(value);
   else set.add(encode(value));
  }

  return 1 == set.size();
 }
}

I loaded this data into MongoDB using the mongoOps.insert(...) command. Admittedly, this test hardly flexes the use case for Spring/Mongo; I expect that in the analysis stage. Insertion performance was tracked across 24 large files and 75 million records. ~50% of the names were non-English, and had to be discarded.



The x-axis represents the number of records being loaded (in millions).  The y-axis represents the insertion time per record in milliseconds (ms).  The jagged green line is the actual insertion performance on a ms-per-record basis.  The lighter dotted green line is a linear trendline through the actual and seems to exhibit slightly better than O(1/2 n).  For comparison, three hypothetical (dotted) lines are drawn.  The blue line is O(n), the orange is O(1/2 n) and the purpose is O(log n).

The load performance is very reasonable.  The total time to process the entire dataset (10 GB across a local LAN with gigabit ethernet and minimal computation prior to the db insertion, using a quad-core VirtuaBox image with 16GB ram for Ubuntu 14) was 19 minutes.

References

  1. Spring Reference Manual 1.6.1
  2. [Blog, Sept 2013] Spring Data and MongoDB: A Mismatch Made In Hell
  3. [Blog] The Soundex Algorithm