2010-03-09 03:37:27 +00:00
Map/Reduce Example
------------------
This is an example of how to use the mapReduce function to perform
map/reduce style aggregation on your data.
This document has been shamelessly ported from the similar
[pymongo Map/Reduce Example ](http://api.mongodb.org/python/1.4%2B/examples/map_reduce.html ).
Setup
-----
2010-03-09 04:54:23 +00:00
To start, we'll insert some example data which we can perform
2010-03-09 03:37:27 +00:00
map/reduce queries on:
2011-06-22 21:18:32 +00:00
$ ghci
2010-03-09 04:54:23 +00:00
...
Prelude> :set prompt "> "
2010-06-15 03:14:40 +00:00
> :set -XOverloadedStrings
2010-03-09 04:54:23 +00:00
> import Database.MongoDB
2010-12-27 05:23:02 +00:00
> import Data.CompactString ()
2011-06-22 21:18:32 +00:00
> conn <- newConnPool 1 (host "127.0.0.1")
2010-12-20 02:08:53 +00:00
> let run act = access safe Master conn $ use (Database "test") act
2010-03-09 04:54:23 +00:00
> :{
2010-06-21 15:06:20 +00:00
run $ insertMany "mr1" [
2010-06-15 03:14:40 +00:00
["x" =: 1, "tags" =: ["dog", "cat"]],
["x" =: 2, "tags" =: ["cat"]],
["x" =: 3, "tags" =: ["mouse", "cat", "dog"]],
["x" =: 4, "tags" =: ([] :: [String])]
2010-03-09 04:54:23 +00:00
]
:}
2010-03-09 03:37:27 +00:00
Basic Map/Reduce
----------------
Now we'll define our map and reduce functions. In this case we're
performing the same operation as in the MongoDB Map/Reduce
documentation - counting the number of occurrences for each tag in the
tags array, across the entire collection.
Our map function just emits a single (key, 1) pair for each tag in the
array:
2010-03-09 04:54:23 +00:00
> :{
2010-06-15 03:14:40 +00:00
let mapFn = Javascript [] "
2010-03-09 04:54:23 +00:00
function() {\n
this.tags.forEach(function(z) {\n
emit(z, 1);\n
});\n
}"
:}
2010-03-09 03:37:27 +00:00
The reduce function sums over all of the emitted values for a given
key:
2010-03-09 04:54:23 +00:00
> :{
2010-06-15 03:14:40 +00:00
let reduceFn = Javascript [] "
2010-03-09 04:54:23 +00:00
function (key, values) {\n
var total = 0;\n
for (var i = 0; i < values.length ; i ++) { \n
total += values[i];\n
}\n
return total;\n
}"
:}
Note: We can't just return values.length as the reduce function might
2010-03-09 03:37:27 +00:00
be called iteratively on the results of other reduce steps.
2011-06-22 21:18:32 +00:00
Finally, we run mapReduce, results by default will be return in an array in the result document (inlined):
2010-03-09 03:37:27 +00:00
2011-06-22 21:18:32 +00:00
> run $ runMR' (mapReduce "mr1" mapFn reduceFn)
Right [ results: [[ _id: "cat", value: 3.0],[ _id: "dog", value: 2.0],[ _id: "mouse", value: 1.0]], timeMillis: 379, counts: [ input: 4, emit: 6, reduce: 2, output: 3], ok: 1.0]
2010-03-09 03:37:27 +00:00
2011-06-22 21:18:32 +00:00
Inlining only works if result set < 16MB. An alternative to inlining is outputing to a collection . But what to do if there is data already in the collection from a previous run of the same MapReduce ? You have three alternatives in the MRMerge data type: Replace , Merge , and Reduce . See its documentation for details . To output to a collection , set the mOut field in MapReduce .
2010-03-09 03:37:27 +00:00
2011-06-22 21:18:32 +00:00
> run $ runMR' (mapReduce "mr1" mapFn reduceFn) {rOut = Output Replace "mr1out" Nothing}
Right [ result: "mr1out", timeMillis: 379, counts: [ input: 4, emit: 6, reduce: 2, output: 3], ok: 1.0]
2010-06-15 03:14:40 +00:00
2011-06-22 21:18:32 +00:00
You can now query the mr1out collection to see the result, or run another MapReduce on it! A shortcut for running the map-reduce then querying the result collection right away is `runMR` .
2010-06-15 03:14:40 +00:00
2011-06-22 21:18:32 +00:00
> run $ rest =<< runMR (mapReduce "mr1" mapFn reduceFn) {rOut = Output Replace "mr1out" Nothing}
Right [[ _id: "cat", value: 3.0],[ _id: "dog", value: 2.0],[ _id: "mouse", value: 1.0]]