Skip to main content

Monary performs high-performance column queries on MongoDB

Project description

Introduction

MongoDB is a document-oriented database, organized for quick access to records (or rows) of data. When doing analytics on a large data set, it is often desirable to have it in a column-oriented format. Columns of data may be thought of as mathematical vectors, and a wealth of techniques exist for gathering statistics about data that is stored in vector form.

For small to medium sized collections, it is possible to materialize several columns of data in the memory of a modern PC. For example, an array of 100 million double-precision numbers consumes 800 million bytes, or about 0.75 GB. For larger problems, it’s still possible to materialize a substantial portion of the data, or to work with data in multiple segments. (Very large problems require more powerful weapons, such as map/reduce.)

Extracting column data from MongoDB using Python is fairly straightforward. In PyMongo, collection.find() generates a sequence of dictionary objects. When dealing with millions of records, the trick is not to keep these dictionaries in memory, as they tend to be large. Fortunately, it’s easy to move the data in to arrays as it is loaded.

First, let’s create 3.5 million rows of test data:

#!/usr/bin/env python
import random
import pymongo

NUM_BATCHES = 3500
BATCH_SIZE = 1000
# 3500 batches * 1000 per batch = 3.5 million records

c = pymongo.MongoClient()
collection = c.mydb.collection

for i in xrange(NUM_BATCHES):
    stuff = [ ]
    for j in xrange(BATCH_SIZE):
        record = dict(x1=random.uniform(0, 1),
                      x2=random.uniform(0, 2),
                      x3=random.uniform(0, 3),
                      x4=random.uniform(0, 4),
                      x5=random.uniform(0, 5)
                 )
        stuff.append(record)
    collection.insert(stuff)

Here’s an example that uses numpy arrays:

#!/usr/bin/env python
import numpy
import pymongo

c = pymongo.MongoClient()
collection = c.mydb.collection
num = collection.count()
arrays = [ numpy.zeros(num) for i in range(5) ]

for i, record in enumerate(collection.find()):
    for x in range(5):
        arrays[x][i] = record["x%i" % x+1]

for array in arrays: # prove that we did something...
    print numpy.mean(array)

With 3.5 million records, this query takes 85 seconds on an EC2 Large instance running Ubuntu 10.10 64-bit, and takes 88 seconds on my MacBook Pro (2.66 GHz Intel Core 2 Duo with 8 GB RAM).

These timings might seem impressive, given that they’re loading 200,000+ values per second. However, closer examination reveals that much of that time is spent by pymongo as it reads each query result and transforms the BSON result to a Python dictionary. (If you watch the CPU usage, you’ll see Python is using 90% or more of the CPU.)

Monary

It is possible to get (much) more speed from the query if we bypass the PyMongo driver. To demonstrate this, I’ve developed monary, a simple C library and accompanying Python wrapper which make use of MongoDB C driver. The code is designed to accept a list of desired fields, and to load exactly those fields from the BSON results into some provided array storage.

Here’s an example of the same query using monary:

#!/usr/bin/env python

from monary import Monary
import numpy

with Monary("127.0.0.1") as monary:
    arrays = monary.query(
        "mydb",                         # database name
        "collection",                   # collection name
        {},                             # query spec
        ["x1", "x2", "x3", "x4", "x5"], # field names (in Mongo record)
        ["float64"] * 5                 # Monary field types (see below)
    )

for array in arrays:                    # prove that we did something...
    print numpy.mean(array)

Monary is able to perform the same query in 4 seconds flat, for a rate of about 4 million values per second (20 times faster!) Here’s a quick summary of how this Monary query stacks up against PyMongo:

  • PyMongo Insert – EC2: 102 s – Mac: 76 s

  • PyMongo Query – EC2: 85 s – Mac: 88 s

  • Monary Query – EC2: 5.4 s – Mac: 3.8 s

Of course, this test has created some fairly ideal circumstances: It’s querying for every record in the collection, the records contain only the queried data (plus ObjectIDs), and the database is running locally. The performance may degrade if we used a remote server, if the records were larger, or if queried for a only subset of the records (requiring either that more records be scanned, or that an index be used).

Monary now knows about the following types:

  • id (Mongo’s 12-byte ObjectId)

  • int8

  • int16

  • int32

  • int64

  • float32

  • float64

  • bool

  • date (stored as int64, milliseconds since epoch)

Monary’s source code is available on bitbucket. It includes a copy of the Mongo C driver, and requires compilation and installation, which can be done via the included “setup.py” file. (The installation script works, but is in a somewhat rough state. Any help from a distutils guru would be greatly appreciated!) To run Monary from Python, you will need to have the pymongo and numpy packages installed.

Monary has been slowly gaining functionality (including the recent additions of more numeric types and the date type). Here are some planned future improvements:

  • Support for string / binary types

    (I hope to develop Monary to support some reasonable mapping of most BSON types onto array storage.)

  • Support for fetching nested fields (e.g. “x.y”)

  • Remove dependencies on PyMongo and NumPy (possibly)

    (Currently these must be installed in order to use Monary.)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Monary-0.5.0.tar.gz (33.8 kB view details)

Uploaded Source

File details

Details for the file Monary-0.5.0.tar.gz.

File metadata

  • Download URL: Monary-0.5.0.tar.gz
  • Upload date:
  • Size: 33.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for Monary-0.5.0.tar.gz
Algorithm Hash digest
SHA256 a69a3f9ddd493958432b0a57f37d6efed757bb0493ecc8d6c271685cc6505268
MD5 0baf204f6bea7f0406d47f1bf57b03bd
BLAKE2b-256 35b6230a3ec114337e324f372106b83a88efe2043f9adda551292ff57cc1262d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page