Introducing:
MongoDB
David J. C. Beach
Sunday, August 1, 2010
David Beach
Software Consultant (past 6 years)
Python since v1.4 (late 90’s)
Design, Algorithms, Data Structures
Sometimes Database stuff
not a “frameworks” guy
Organizer: Front Range Pythoneers
Sunday, August 1, 2010
Outline
Part I: Trends in Databases
Part II: Mongo Basic Usage
Part III: Advanced Features
Sunday, August 1, 2010
Part I:
Trends in Databases
Sunday, August 1, 2010
Database Trends
WARNING: extreme
oversimplification
Past: “Relational” (RDBMS)
Data stored in Tables, Rows, Columns
Relationships designated by Primary, Foreign
keys
Data is controlled & queried via SQL
Sunday, August 1, 2010
Trends:
Criticisms of RDBMS
Lots of disagreement over this
Rigid data model There are points & counterpoints from
both sides
Hard to scale / distribute
The debate is not over
Not here to deliver a verdict
POINT: This is why we see an explosion of
Slow (transactions, disk seeks) new databases.
SQL not well standardized
Awkward for modern/dynamic languages
Sunday, August 1, 2010
As with so many things in technology,
we’re seeing... FRAGMENTATION! Trends:
Fragmentation
some examples of DB categories
Relational with ORM (Hibernate, SQLAlchemy)
ODBMS / ORDBMS (push OO-concepts into database)
Key-Value Stores (MemcacheDB, Redis, Cassandra)
Graph (neo4j)
categories are
Document Oriented (Mongo, Couch, etc...) incomplete
some don’t fit neatly into
categories
Sunday, August 1, 2010
Where Mongo Fits
Mongo’s Tagline (taken from website)
“The Best Features of
Document Databases,
Key-Value Stores,
and RDBMSes.”
Sunday, August 1, 2010
What is Mongo
Document-Oriented Database
Produced by 10gen / Implemented in C++
Source Code Available
Runs on Linux, Mac, Windows, Solaris
Database: GNU AGPL v3.0 License
Drivers: Apache License v2.0
Sunday, August 1, 2010
Mongo
Advantages
many of these taken
straight from home page
json-style documents fast queries (auto-tuning
(dynamic schemas) planner)
flexible indexing (B-Tree) fast insert & deletes
(sometimes trade-offs)
replication and high-
availability (HA)
sharding support available as of
v1.6 (late July 2010)
automatic sharding
support (v1.6)*
easy-to-use API
Sunday, August 1, 2010
Mongo
Language Bindings
C, C++, Java
Python, Ruby, Perl
PHP, JavaScript
(many more community supported ones)
Sunday, August 1, 2010
Mongo
Disadvantages
Can mimic with foreign IDs, but referential
No Relational Model / SQL integrity not enforced.
No Explicit Transactions / ACID
Operations can only be atomic within single
collection. (Generally)
Limited Query API You can do a lot more with MapReduce
and JavaScript!
Sunday, August 1, 2010
When to use Mongo
My personal take on this...
Rich semistructured records (Documents)
Transaction isolation not essential
Humongous amounts of data
Need for extreme speed
You hate schema migrations
Caveat: I’ve never used Mongo in Production!
Sunday, August 1, 2010
Part II:
Mongo Basic Usage
BRIEFLY cover:
- Download, Install, Configure
- connection, creating DB, creating Collection
- CRUD operations (Insert, Query, Update, Delete)
Sunday, August 1, 2010
Installing Mongo
Use a 64-bit OS (Linux, Mac, Windows)
Get Binaries: www.mongodb.org
32-bit available; not for production
PyMongo uses memory-mapped files.
32-bits limits database to 2 GB!
Run “mongod” process
Sunday, August 1, 2010
Installing PyMongo
Download: http://pypi.python.org/pypi/pymongo/1.7
Build with setuptools
(includes C extension for speed)
# python setup.py install
# python setup.py --no-ext install
(to compile without extension)
Sunday, August 1, 2010
Mongo Anatomy
Mongo Server
Database
Collection
Document
Sunday, August 1, 2010
Getting a Connection
Connection required for using Mongo
>>> import pymongo
>>> connection = pymongo.Connection(“localhost”)
Sunday, August 1, 2010
Finding a Database
Databases = logically separate stores
Navigation using properties
Will create DB if not found
>>> db = connection.mydatabase
Sunday, August 1, 2010
Using a Collection
Collection is analogous to Table
Contains documents
Will create collection if not found
>>> blog = db.blog
Sunday, August 1, 2010
Inserting
collection.insert(document) => document_id
>>> entry1 = {“title”: “Mongo Tutorial”,
“body”: “Here’s a document to insert.” }
>>> blog.insert(entry1)
ObjectId('4c3a12eb1d41c82762000001')
document
Sunday, August 1, 2010
Inserting (contd.)
Documents must have ‘_id’ field
Automatically generated unless assigned
You can also assign your own ‘_id’, can be
12-byte unique binary value any unique value.
>>> entry1
{'_id': ObjectId('4c3a12eb1d41c82762000001'),
'body': "Here's a document to insert.",
'title': 'Mongo Tutorial'}
Mongo’s IDs are designed to be unique...
...even if hundreds of thousands of
ID generated by driver. No waiting on DB. documents are generated per second, on
numerous clustered machines.
Sunday, August 1, 2010
Inserting (contd.)
Documents may have different properties
Properties may be atomic, lists, dictionaries
>>> entry2 = {"title": "Another Post",
"body": "Mongo is powerful",
"author": "David",
"tags": ["Mongo", "Power"]}
>>> blog.insert(entry2)
ObjectId('4c3a1a501d41c82762000002')
another document
Sunday, August 1, 2010
Indexing
May create index on any field
If field is list => index associates all values
index by single value
>>> blog.ensure_index(“author”)
>>> blog.ensure_index(“tags”)
by multiple values
Sunday, August 1, 2010
Bulk Insert
Let’s produce 100,000 fake posts
bulk_entries = [ ]
for i in range(100000):
entry = { "title": "Bulk Entry #%i" % (i+1),
"body": "What Content!",
"author": random.choice(["David", "Robot"]),
"tags": ["bulk",
random.choice(["Red", "Blue", "Green"])]
}
bulk_entries.append(entry)
Sunday, August 1, 2010
Bulk Insert (contd.)
collection.insert(list_of_documents)
Inserts 100,000 entries into blog
Returns in 2.11 seconds
>>> blog.insert(bulk_entries)
[ObjectId(...), ObjectId(...), ...]
Sunday, August 1, 2010
Bulk Insert (contd.)
returns in 7.90 seconds (vs. 2.11 seconds)
driver returns early; DB is still working
...unless you specify “safe=True”
>>> blog.remove() # clear everything
>>> blog.insert(bulk_entries, safe=True)
Sunday, August 1, 2010
Querying
collection.find_one(spec) => document
spec = document of query parameters
>>> blog.find_one({“title”: “Bulk Entry #12253”})
{u'_id': ObjectId('4c3a1e411d41c82762018a89'),
u'author': u'Robot',
u'body': u'What Content!', returned in 0.04s - extremely fast
u'tags': [u'bulk', u'Green'], No index created for “title”!
u'title': u'Bulk Entry #99999'}
presumably, need more entries to effectively test index performance...
Sunday, August 1, 2010
Querying
(Specs)
Multiple conditions on document => “AND”
Value for tags is an “ANY” match
>>> blog.find_one({“title”: “Bulk Entry #12253”,
“tags”: “Green”})
{u'_id': ObjectId('4c3a1e411d41c82762018a89'),
u'author': u'Robot',
u'body': u'What Content!',
u'tags': [u'bulk', u'Green'],
u'title': u'Bulk Entry #99999'}
presumably, need more entries to effectively test index performance...
Sunday, August 1, 2010
Querying
(Multiple)
collection.find(spec) => cursor
new items are fetched in bulk (behind the
scenes)
>>> green_items = [ ]
>>> for item in blog.find({“tags”: “Green”}):
green_items.append(item)
- or -
>>> green_items = list(blog.find({“tags”: “Green”}))
Sunday, August 1, 2010
Querying
(Counting)
Use the find() method + count()
Returns number of matches found
>>> blog.find({"tags": "Green"}).count()
16646
presumably, need more entries to effectively test index performance...
Sunday, August 1, 2010
Updating
collection.update(spec, document)
updates single document matching spec
“multi=True” => updates all matching docs
>>> item = blog.find_one({“title”: “Bulk Entry #12253”})
>>> item.tags.append(“New”)
>>> blog.update({“_id”: item[‘_id’]}, item)
Sunday, August 1, 2010
Deleting
use remove(...)
it works like find(...)
>>> blog.remove({"author":"Robot"}, safe=True)
Example removed approximately 50% of records.
Took 2.48 seconds
Sunday, August 1, 2010
Part III:
Advanced Features
Sunday, August 1, 2010
Advanced Querying
Regular Expressions
{“tag” : re.compile(r“^Green|Blue$”)}
Nested Values {“foo.bar.x” : 3}
$where Clause (JavaScript)
Sunday, August 1, 2010
Advanced Querying
$lt, $gt, $lte, $gte, $ne
$in, $nin, $mod, $all, $size, $exists, $type
$or, $not
$elemmatch
>>> blog.find({“$or”: [{“tags”: “Green”}, {“tags”:
“Blue”}]})
Sunday, August 1, 2010
Advanced Querying
collection.find(...)
sort(“name”) - sorting
limit(...) & skip(...) [like LIMIT & OFFSET]
distinct(...) [like SQL’s DISTINCT]
collection.group(...) - like SQL’s GROUP
won’t beBY
showing detailed
examples of all these...
there are good tutorials online
>>> blog.find().limit(50) # find 50 articles for all of this
>>> blog.find().sort(“title”).limit(30) # 30 titles
let’s move on to something even
more interesting
>>> blog.find().distinct(“author”) # unique author names
Sunday, August 1, 2010
Map/Reduce
Most powerful querying
mechanism
collection.map_reduce(mapper, reducer)
ultimate in querying power
distribute across multiple nodes
Sunday, August 1, 2010
Map/Reduce
Visualized
1 2 3
)LJXUH0DS5HGXFHORJLFDOGDWDIORZ
Java MapReduce
also see: Diagram Credit:
+DYLQJUXQWKURXJKKRZWKH0DS5HGXFHSURJUDPZRUNVWKHQH[WVWHSLVWRH[SUHVVLW
Hadoop: The Definitive Guide
LQFRGH:HQHHGWKUHHWKLQJVDPDSIXQFWLRQDUHGXFHIXQFWLRQDQGVRPHFRGHWR
Map/Reduce : A Visual Explanation by Tom White; O’Reilly Books
UXQ WKH MRE 7KH PDS IXQFWLRQ LV UHSUHVHQWHG E\ DQ LPSOHPHQWDWLRQ
Chapter 2, RI
pageWKH
20Mapper
LQWHUIDFHZKLFKGHFODUHVD map()PHWKRG([DPSOHVKRZVWKHLPSOHPHQWDWLRQRI
RXUPDSIXQFWLRQ
([DPSOH0DSSHUIRUPD[LPXPWHPSHUDWXUHH[DPSOH
import
Sunday, August 1, 2010 java.io.IOException;
SELECT
19OPQ db.runCommand({
A*2=*LR
Dim1, Dim2, ! mapreduce: "DenormAggCollection",
SUM(Measure1) AS MSum, query: {
"
COUNT(*) AS RecordCount, filter1: { '$in': [ 'A', 'B' ] },
AVG(Measure2) AS MAvg, # filter2: 'C',
MIN(Measure1) AS MMin filter3: { '$gt': 123 }
MAX(CASE },
WHEN Measure2 < 100 $ map: function() { emit(
THEN Measure2 { d1: this.Dim1, d2: this.Dim2 },
END) AS MMax { msum: this.measure1, recs: 1, mmin: this.measure1,
FROM DenormAggTable mmax: this.measure2 < 100 ? this.measure2 : 0 }
WHERE (Filter1 IN (’A’,’B’)) );},
AND (Filter2 = ‘C’) % reduce: function(key, vals) {
AND (Filter3 > 123) var ret = { msum: 0, recs: 0, mmin: 0, mmax: 0 };
GROUP BY Dim1, Dim2 ! for(var i = 0; i < vals.length; i++) {
HAVING (MMin > 0) ret.msum += vals[i].msum;
ORDER BY RecordCount DESC ret.recs += vals[i].recs;
LIMIT 4, 8 if(vals[i].mmin < ret.mmin) ret.mmin = vals[i].mmin;
if((vals[i].mmax < 100) && (vals[i].mmax > ret.mmax))
ret.mmax = vals[i].mmax;
}
! ()*+,-./.01-230*2/4*5+123/6)-/,+55-./ return ret;
*+7/63/8-93/02/7:-/16,/;+2470*2</ },
)-.+402=/7:-/30>-/*;/7:-/?*)802=/3-7@ finalize: function(key, val) {
'
" A-63+)-3/1+37/B-/162+6559/6==)-=67-.@ & val.mavg = val.msum / val.recs;
return val;
# C==)-=67-3/.-,-2.02=/*2/)-4*)./4*+273/
},
G-E030*2/$</M)-67-./"N!NIN#IN'
G048/F3B*)2-</)048*3B*)2-@*)=
1+37/?607/+2705/;02650>670*2@
$ A-63+)-3/462/+3-/,)*4-.+)65/5*=04@
out: 'result1',
verbose: true
% D057-)3/:6E-/62/FGAHC470E-G-4*).I });
5**802=/3795-@
db.result1.
' C==)-=67-/;057-)02=/1+37/B-/6,,50-./7*/
7:-/)-3+57/3-7</2*7/02/7:-/16,H)-.+4-@
find({ mmin: { '$gt': 0 } }).
& C34-2.02=J/!K/L-34-2.02=J/I!
sort({ recs: -1 }).
skip(4).
limit(8);
http://rickosborne.org/download/SQL-to-MongoDB.pdf
Sunday, August 1, 2010
Map/Reduce
Examples
This is me, playing with Map/Reduce
Sunday, August 1, 2010
Health Clinic Example
Person registers with the Clinic
Weighs in on the scale
1 year => comes in 100 times
Sunday, August 1, 2010
Health Clinic Example
person = { “name”: “Bob”,
! “weighings”: [
! ! {“date”: date(2009, 1, 15), “weight”: 165.0},
! ! {“date”: date(2009, 2, 12), “weight”: 163.2},
! ! ... ]
Sunday, August 1, 2010
Map/Reduce
Insert Script
for i in range(N):
person = { 'name': 'person%04i' % i }
weighings = person['weighings'] = [ ]
std_weight = random.uniform(100, 200)
for w in range(100):
date = (datetime.datetime(2009, 1, 1) +
datetime.timedelta(
days=random.randint(0, 365))
weight = random.normalvariate(std_weight, 5.0)
weighings.append({ 'date': date,
'weight': weight })
weighings.sort(key=lambda x: x['date'])
all_people.append(person)
Sunday, August 1, 2010
Insert Data
Performance
LOG-LOG scale
Linear scaling
Insert
1000
292s
100
29.5s
10
3.14s
1
1k 10k 100k
Sunday, August 1, 2010
Map/Reduce
Total Weight by Day
map_fn = Code("""function () {
this.weighings.forEach(function(z) {
emit(z.date, z.weight);
});
}""")
reduce_fn = Code("""function (key, values) {
var total = 0;
for (var i = 0; i < values.length; i++) {
total += values[i];
}
return total;
}""")
result = people.map_reduce(map_fn, reduce_fn)
Sunday, August 1, 2010
Map/Reduce
Total Weight by Day
>>> for doc in result.find():
print doc
{u'_id': datetime.datetime(2009, 1, 1, 0, 0), u'value':
39136.600753163315}
{u'_id': datetime.datetime(2009, 1, 2, 0, 0), u'value':
41685.341024046182}
{u'_id': datetime.datetime(2009, 1, 3, 0, 0), u'value':
38232.326554504165}
... lots more ...
Sunday, August 1, 2010
Total Weight by Day
Performance
MapReduce
1000
384s
100
38.8s
10
4.29s
1
1k 10k 100k
Sunday, August 1, 2010
Map/Reduce
Weight on Day
map_fn = Code("""function () {
var target_date = new Date(2009, 9, 5);
var pos = bsearch(this.weighings, "date",
target_date);
var recent = this.weighings[pos];
emit(this._id, { name: this.name,
date: recent.date,
weight: recent.weight });
};""")
reduce_fn = Code("""function (key, values) {
return values[0];
};""")
result = people.map_reduce(map_fn, reduce_fn,
scope={"bsearch": bsearch})
Sunday, August 1, 2010
Map/Reduce
bsearch() function
bsearch = Code("""function(array, prop, value) {
var min, max, mid, midval;
for(min = 0, max = array.length - 1; min <= max; ) {
mid = min + Math.floor((max - min) / 2);
midval = array[mid][prop];
if(value === midval) {
break;
} else if(value > midval) {
min = mid + 1;
} else {
max = mid - 1;
}
}
return (midval > value) ? mid - 1 : mid;
};""")
Sunday, August 1, 2010
Weight on Day
Performance
MapReduce
1000
100
108s
10
10s
1 1.23s
1k 10k 100k
Sunday, August 1, 2010
Weight on Day
(Python Version)
target_date = datetime.datetime(2009, 10, 5)
for person in people.find():
dates = [ w['date'] for w in person['weighings'] ]
pos = bisect.bisect_right(dates, target_date)
val = person['weighings'][pos]
Sunday, August 1, 2010
Map/Reduce
Performance
MapReduce Python
1000
100
108s
26s
10
10s
1 2.2s
1.23s
0.37s
0.1
1k 10k 100k
Sunday, August 1, 2010
Summary
Sunday, August 1, 2010
Resources
www.mongodb.org
PyMongo
api.mongodb.org/python
MongoDB
The Definitive Guide
O’Reilly
www.10gen.com
Sunday, August 1, 2010
END OF SLIDES
Sunday, August 1, 2010
Chalkboard
is not Comic Sans
This is Chalkboard, not Comic Sans.
This isn’t Chalkboard, it’s Comic Sans.
does it matter, anyway?
Sunday, August 1, 2010