Skip to main content

Transactional object database, implemented in pure Python.

Project description

Overview

Dobbin is a fast and convenient way to persist a Python object graph on disk.

The object graph consists of persistent nodes, which are objects that are based on one of the persistent base classes:

from dobbin.persistent import Persistent

foo = Persistent()
foo.bar = 'baz'

Each of these nodes can have arbitrary objects connected to it; the only requirement is that Python’s pickle module can serialize the objects.

Persistent objects are fully object-oriented:

class Frobnitz(Persistent):
    ...

The object graph is built by object reference:

foo.frob = Frobnitz()

To commit changes to disk, we use the commit() method from the transaction module. Note that we must first elect a root object, thus connecting the object graph to the database handle:

from dobbin.database import Database

jar = Database('data.fs')
jar.elect(foo)

transaction.commit()

Consequently, if we want to make changes to one or more objects in the graph, we must first check out the objects in question:

from dobbin.persistent import checkout

checkout(foo)
foo.bar = 'boz'

transaction.commit()

The checkout(obj) function puts the object in shared state. It only works on object that are persistent nodes.

Dobbin is available on Python 2.6 and up including Python 3.x.

Key features:

  • 100% Python, fully compliant with PEP8

  • Threads share data when possible

  • Multi-threaded, multi-process MVCC concurrency model

  • Efficient storage and streaming of binary blobs

  • Pluggable architecture

Getting the code

You can download the package from the Python package index or install the latest release using setuptools or the newer distribute (required for Python 3.x):

$ easy_install dobbin

Note that this will install the transaction module as a package dependency.

The project is hosted in a GitHub repository. Code contributions are welcome. The easiest way is to use the pull request interface.

Author and license

Written by Malthe Borch <mborch@gmail.com>.

This software is made available under the BSD license.

Notes

Frequently asked questions

This section lists frequently asked questions.

  1. How is Dobbin different from ZODB?

    There are other object databases available for Python, most notably ZODB from Zope Corporation.

    Key differences:

    • Dobbin is written 100% in Python. The persistence layer in ZODB is written in C. ZODB also comes with support for B-Trees; this is also written in C.

    • Dobbin is available on Python 3 (but requires a POSIX-system).

    • ZODB comes with support for B-Trees which allows processes to load objects on demand (because of the implicit weak reference). Dobbin currently loads all data at once and keeps it in memory.

    • Dobbin uses a persistence model that tries to share data in active objects between threads, but relies on an explicit operation to put an object in a mode that allows making changes to it. ZODB shares only inactive object data.

    • ZODB comes with ZEO, an enterprise-level network database layer which allows processes on different machines to connect to the same database.

    • ZODB comes with a memory management system that evicts object data from memory based on usage. Dobbin does not attempt to manage memory consumption, but calls upon the virtual memory manager to swap inactive object data to disk.

  2. What is the database file format?

    The default storage option writes transactions sequentially to a single file.

    Each transaction consists of a number of records which consist of a Python pickle and sometimes an attached payload of data (in which case the pickle contains control information). Finally, the transaction ends with a transaction record object, also a Python pickle.

  3. Can I connect to a single database with multiple processes?

    Yes.

    The default storage option writes transactions to a single file, which alone makes up the storage record. Multiple processes can connect to the same file and share the same database, concurrently. No further configuration is required; the database uses POSIX file-locking to ensure exclusive write-access and processes automatically stay synchronized.

  4. How can I limit memory consumption?

    To avoid memory thrashing, limit the physical memory allowance of your Python processes and make sure there is enough virtual memory available (at least the size of your database) [1].

    You may want to compile Python with the --without-pymalloc flag to use native memory allocation. This may improve performance in applications that connect to large databases due to better paging.

User’s guide

This is the primary documentation for the database. It uses an interactive narrative which doubles as a doctest. There’s a suite of regression tests included in the distribution.

You can run the tests by issuing the following command at the command-line prompt:

$ python setup.py test

Setup

The default storage option writes transactions sequentially in a single file. It’s optimized for long-running processes, e.g. application servers.

The first step is to initialize a database object. To configure it we provide a path on the file system. The path needn’t exist already.

>>> from dobbin.database import Database
>>> db = Database(database_path)

This particular path does not already exist. This is a new database. We can verify it by using the len method to determine the number of objects stored.

>>> len(db)
0

The database uses an object graph persistency model. Objects must be transitively connected to the root node of the database (by Python reference).

Since this is an empty database, there is no root object yet.

>>> db.root is None
True

Persistent objects

Any persistent object can be elected as the database root object. Persistent objects must inherit from the Persistent class. These objects form the basis of the concurrency model; overlapping transactions may write a disjoint set of objects (conflict resolution mechanisms are available to ease this requirement).

>>> from dobbin.persistent import Persistent
>>> obj = Persistent()

Persistent objects begin life in local state. In this state we can both read and write attributes. However, when we want to write to an object which has previously been persisted in the database, we must check it out explicitly using the checkout method. We will see how this works shortly.

>>> obj.name = 'John'
>>> obj.name
'John'

Electing a database root

We can elect this object as the root of the database.

>>> db.elect(obj)
>>> obj._p_jar is db
True

The object is now the root of the object graph. To persist changes on disk, we commit the transaction.

>>> transaction.commit()

As expected, the database contains one object.

>>> len(db)
1

The tx_count attribute returns the number of transactions which have been written to the database (successful and failed).

>>> db.tx_count
1

Checking out objects

The object is now persisted in the database. This means that we must now check it out before we are allowed to write to it.

>>> obj.name = "John"
Traceback (most recent call last):
 ...
TypeError: Can't set attribute on shared object.

We use the checkout method on the object to change its state to local.

>>> from dobbin.persistent import checkout
>>> checkout(obj)

The checkout method does not have a return value; this is because the object identity never actually changes. Instead custom attribute accessor and mutator methods are used to provide a thread-local object state. This happens transparent to the user.

After checking out the object, we can both read and write attributes.

>>> obj.name = 'James'

When an object is first checked out by some thread, a counter is set to keep track of how many threads have checked out the object. When it falls to zero (always on a transaction boundary), it’s retracted to the previous shared state.

>>> transaction.commit()

This increases the transaction count by one.

>>> db.tx_count
2

Concurrency

The object manager (which implements the low-level functionality) is inherently thread-safe; it uses the MMVC concurrency model.

It’s up to the database which sits on top of the object manager to support concurrency between external processes sharing the same database (the included database implementation uses a file-locking scheme to extend the MVCC concurrency model to external processes; no configuration is required).

We can demonstrate concurrency between two separate processes by running a second database instance in the same thread.

>>> new_db = Database(database_path)
>>> new_obj = new_db.root

Objects from this database are disjoint from those of the first database.

>>> new_obj is obj
False

The new database instance has already read the previously committed transactions and applied them to its object graph.

>>> new_obj.name
'James'

Let’s examine this further. If we check out a persistent object from the first database instance and commit the changes, that same object from the second database will be updated as soon as we begin a new transaction.

>>> checkout(obj)
>>> obj.name = 'Jane'
>>> transaction.commit()

The database has registered the transaction; the new instance hasn’t.

>>> db.tx_count - new_db.tx_count
1

The object graphs are not synchronized.

>>> new_obj.name
'James'

Applications must begin a new transaction to stay in sync.

>>> tx = transaction.begin()
>>> new_obj.name
'Jane'

Conflicts

When concurrent transactions attempt to modify the same objects, we get a write conflict in all but one (first to get the commit-lock wins the transaction).

Objects can provide conflict resolution capabilities such that two concurrent transactions may update the same object.

As an example, let’s create a counter object; it could represent a counter which keeps track of visitors on a website. To provide conflict resolution for instances of this class, we implement a _p_resolve_conflict method.

>>> class Counter(Persistent):
...     def __init__(self):
...         self.count = 0
...
...     def hit(self):
...         self.count += 1
...
...     @staticmethod
...     def _p_resolve_conflict(old_state, saved_state, new_state):
...         saved_diff = saved_state['count'] - old_state['count']
...         new_diff = new_state['count']- old_state['count']
...         return {'count': old_state['count'] + saved_diff + new_diff}

As a doctest technicality, we set the class on the builtins-module (there’s a difference here between Python 2.x and 3.x series, which explains the fallback import location).

>>> try:
...     import __builtin__ as builtins
... except ImportError:
...     import builtins
>>> builtins.Counter = Counter

Next we instantiate a counter instance, then add it to object graph.

>>> counter = Counter()
>>> checkout(obj)
>>> obj.counter = counter
>>> transaction.commit()

To demonstrate the conflict resolution functionality of this class, we update the counter in two concurrent transactions. We will attempt one of the transactions in a separate thread.

>>> from threading import Semaphore
>>> flag = Semaphore()
>>> flag.acquire()
True
>>> def run():
...     counter = db.root.counter
...     assert counter is not None
...     checkout(counter)
...     counter.hit()
...     flag.acquire()
...     try: transaction.commit()
...     finally: flag.release()
>>> from threading import Thread
>>> thread = Thread(target=run)
>>> thread.start()

In the main thread we check out the same object and assign a different attribute value.

>>> checkout(counter)
>>> counter.count
0
>>> counter.hit()

Releasing the semaphore, the thread will commit the transaction.

>>> flag.release()
>>> thread.join()

As we commit the transaction running in the main thread, we expect the counter to have been increased twice.

>>> transaction.commit()
>>> counter.count
2

More objects

Persistent objects must be connected to the object graph, before they’re persisted in the database. If we check out a persistent object and commit the transaction without adding it to the object graph, an exception is raised.

>>> another = Persistent()
>>> from dobbin.exc import ObjectGraphError
>>> try:
...     transaction.commit()
... except ObjectGraphError as exc:
...     print(str(exc))
<dobbin.persistent.LocalPersistent object at ...> not connected to graph.

We abort the transaction and try again, this time connecting the object using an attribute reference.

>>> transaction.abort()
>>> checkout(another)
>>> another.name = 'Karla'
>>> checkout(obj)
>>> obj.another = another

We commit the transaction and observe that the object count has grown. The new object has been assigned an oid as well (these are not in general predictable; they are assigned by the database on commit).

>>> transaction.commit()
>>> len(db)
3
>>> another._p_oid is not None
True

If we begin a new transaction, the new object will propagate to the second database instance.

>>> tx = transaction.begin()
>>> new_obj.another.name
'Karla'

As we check out the object that carries the reference and access any attribute, a deep-copy of the shared state is made behind the scenes. Persistent objects are never copied, however, which a simple identity check will confirm.

>>> checkout(obj)
>>> obj.another is another
True

Circular references are permitted.

>>> checkout(another)
>>> another.another = obj
>>> transaction.commit()

Again, we can verify the identity.

>>> another.another is obj
True

Storing files

We can persist open files (or any stream object) by enclosing them in a persistent file wrapper. The wrapper is immutable; it’s for single use only.

>>> from tempfile import TemporaryFile
>>> file = TemporaryFile()
>>> length = file.write(b'abc')
>>> pos = file.seek(0)

Note that the file is read from the current position and until the end of the file.

>>> from dobbin.persistent import PersistentFile
>>> pfile = PersistentFile(file)

Let’s store this persistent file as an attribute on our object.

>>> checkout(obj)
>>> obj.file = pfile
>>> transaction.commit()

Note that the persistent file has been given a new class. It’s the same object (in terms of object identity), but since it’s now stored in the database and is only available as a file stream, we call it a persistent stream.

>>> obj.file
<dobbin.database.PersistentStream object at ...>

We must manually close the file we provided to the persistent wrapper (or let it fall out of scope).

>>> file.close()
>>> pfile.closed
True

Using persistent streams

There are two ways to use persistent streams; either by iterating through it, in which case it automatically gets a file handle (implicitly closed when the iterator is garbage-collected), or through a file-like API.

We use the open method to open the stream; this is always required when using the stream as a file.

>>> obj.file.open()
>>> print(obj.file.read().decode('ascii'))
abc

The seek and tell methods work as expected.

>>> int(obj.file.tell())
3

We can seek to the beginning and repeat the exercise.

>>> obj.file.seek(0)
>>> print(obj.file.read().decode('ascii'))
abc

As any file, we have to close it after use.

>>> obj.file.close()

In addition we can use iteration to read the file; in this case, we needn’t bother opening or closing the file. This is automatically done for us. Note that this makes persistent streams suitable as return values for WSGI applications.

>>> print("".join(thunk.decode('ascii') for thunk in obj.file))
abc

Iteration is strictly independent from the other methods. We can observe that the file remains closed.

>>> obj.file.closed
True

Start a new transaction (to prompt database catch-up) and confirm that file is available from second database.

>>> tx = transaction.begin()
>>> print("".join(thunk.decode('ascii') for thunk in new_obj.file))
abc

Persistent dictionary

It’s not advisable in general to use the built-in dict type to store records in the database, in particular not if you expect frequent minor changes. Instead the PersistentDict class should be used (directly, or subclassed).

It operates as a normal Python dictionary and provides the same methods.

>>> from dobbin.persistent import PersistentDict
>>> pdict = PersistentDict()

Check out objects and connect to object graph.

>>> checkout(obj)
>>> obj.pdict = pdict

You can store any key/value combination that works with standard dictionaries.

>>> pdict['obj'] = obj
>>> pdict['obj'] is obj
True

The PersistentDict stores attributes, too. Note that attributes and dictionary entries are independent from each other.

>>> pdict.name = 'Bob'
>>> pdict.name
'Bob'

Committing the changes.

>>> transaction.commit()
>>> pdict['obj'] is obj
True
>>> pdict.name
'Bob'

Snapshots

We can use the snapshot method to merge all database transactions until a given timestamp and write the snapshot as a single transaction to a new database.

>>> tmp_path = "%s.tmp" % database_path
>>> tmp_db = Database(tmp_path)

To include all transactions (i.e. the current state), we just pass the target database.

>>> db.snapshot(tmp_db)

The snapshot contains three objects.

>>> len(tmp_db)
4

They were persisted in a single transaction.

>>> tmp_db.tx_count
1

We can confirm that the state indeed matches that of the current database.

>>> tmp_obj = tmp_db.root

The object graph is equal to that of the original database.

>>> tmp_obj.name
'Jane'
>>> tmp_obj.another.name
'Karla'
>>> tmp_obj.pdict['obj'] is tmp_obj
True
>>> tmp_obj.pdict.name
'Bob'

Binary streams are included in the snapshot, too.

>>> print("".join(thunk.decode('ascii') for thunk in tmp_obj.file))
abc

Cleanup

>>> transaction.commit()
>>> db.close()
>>> new_db.close()
>>> tmp_db.close()

This concludes the narrative.

Changes

0.3 (2012-02-02)

  • Add support for Python 3.

  • Use C-optimized pickle module when available.

0.2 (2009-10-22)

  • Subclasses may now override existing methods (e.g. __setattr__) and use super to get at the overriden method.

  • Transactions now see data in isolation.

  • When a persistent object is first created, its state is immediately local. This allows an __init__ method to initialize the object.

  • Added method to create a snapshot in time of an existing database.

  • Added PersistentDict class.

  • The Persistent class is now persisted as changesets rather than complete object state.

  • Set up tests to run using the nose testrunner (or using setuptools).

0.1 (2009-09-26)

  • Initial public release.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dobbin-0.3.tar.gz (27.8 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page