Pymongo: distinct items and an example map reduce on subset of db

Posted: June 7th, 2011 | Author: | Filed under: mongodb, pymongo, Python | 6 Comments »

I’ve been playing around with Pymongo for a few weeks now, and I’m slowly discovering quirks and differences in syntax relative to the mongodb shell. The two I’ll cover in this post are:

  1. Using distinct in pymongo
  2. map reduce example in pymongo on a queried subset of the db

Using distinct in pymongo

Let’s say you want to group users by some sort of id on a day (I’ll use May 21 as an example). From the mongodb shell this command is simply:

db.raw_data.distinct(“id”, {“_date”: “2011-05-21″})

Running this command from within a Python file yields the following error

File “test.py”, line 13, in
foo = db.raw_data.distinct(“id”, {“_date”: “2011-05-21″})
TypeError: distinct() takes exactly 2 arguments (3 given)

It turns out Pymongo makes you do the find and then distinct the records

foo = db.raw_data.find({“_date”: “2011-05-21″}).distinct(“id”)

This is exactly what the mongodb shell interpreter is doing, its just annoying that the syntax is different.

map reduce example in pymongo on a queried subset of the db

Following the example from my previous post, you simply add query = {} and add out = for the collection in which you want your results to end up. Most of the examples I found on stack overflow or personal blogs were wrong or tried to pass these parameters in together. I tried roughly 865868 variations and what I have below is the only combination that worked.

#!/usr/bin/env python
from bson.code import Code # for some this needs to be pymongo.bson
from pymongo import Connection

# code for example map/reduce
db = Connection().map_reduce_example
map = Code(“function () {”
“emit(this.id, 1)”
“}”)
reduce = Code(“function (key, values) {”
” var total = 0;”
” for (var i = 0; i < values.length; i++) {" " total += values[i];" " }" " return total;" "}") # code without query result = db.things.map_reduce(map, reduce, "map_reduce_example") # code with simple query result = db.raw_data.map_reduce(map, reduce, out = "map_reduce_example", query = {"date": "2011-05-21"}) # code with query that grabs all records from May 2011 result = db.raw_data.map_reduce(map, reduce, out = "map_reduce_example2", query = {"date": {"$regex": "^2011-05"}})

This may seem like a silly and simple blog post to write, but this wasn’t documented anywhere else online so I wanted to save anyone else with the distinct error or trying to run a map reduce with a query 5-10 minutes.


More fun with mongodb, map/reduce, and sorting records by value using pymongo

Posted: May 11th, 2011 | Author: | Filed under: mongodb, pymongo | 2 Comments »

A couple of weeks ago I promised a fun application in mongodb. That time has arrived. Suppose you have a collection of records and you want to group by some id (canonicalized url, user id, checkin venue) and see how many user actions (perhaps clicks, status updates, or checkins for the three examples listed) are associated with each. Those familiar with mongodb would ask, “Why not do this with the group() function?” Its limited to 20,000 unique ids. Many applications have more than this, so map/reduce is the way to go. Below I provide code to do this map reduce and then sort by value. From the shell sort is done as:

result.find().sort({u’value': -1}):

but if you run this from within python using the pymongo driver you will receive the error:

TypeError: if no direction is specified, key_or_list must be an instance of list

If this occurs make sure you sort with:

result.find().sort(u’value’, -1):

Some more hardcore/sophisticated mongodb examples/applications will be coming soon!

#!/usr/bin/env python

from bson.code import Code
from pymongo import Connection

# code for example map/reduce
db = Connection().map_reduce_example
map = Code(“function () {”
“emit(this.id, 1)”
“}”)
reduce = Code(“function (key, values) {”
” var total = 0;”
” for (var i = 0; i < values.length; i++) {" " total += values[i];" " }" " return total;" "}") result = db.things.map_reduce(map, reduce, "myresults") for doc in result.find().sort(u'value', -1): print doc