Skip to main content

Converts Django ORM objects into data documents, and keeps them in sync

Project description

What is it?

Django-denormalize allows you to convert a tree of Django ORM objects into one data document. With ‘data document’ we mean a structure of dicts, lists and other primitive types, that can be serialized to JSON or a Python Pickle.

The resulting document can be used in combination with the Django cache layer to create blazingly fast views that do not hit the database. The data can also be synced to a NoSQL store like MongoDB, for consumption by other frameworks, like Meteor (NodeJS based).

If any data changes in the ORM (even if it’s on a some deep many-to-many relationship far away from the root object), django-denormalize will automatically trigger a cache invalidation of the root object’s document and/or sync the new document to your preferred NoSQL store.

This module also includes special support for content in FeinCMS objects: all regions and content types will be available under a ‘content’ dictionary.

Example

For example, suppose you have the following models:

class Book(models.Model):
    title = models.CharField(_("title"), max_length=80)
    year = models.PositiveIntegerField(_("year"), null=True)
    authors = models.ManyToManyField(Author)
    ...

class Author(models.Model):
    name = models.CharField(_("name"), max_length=80)
    ...

You can write the following class to describe your document collection:

from denormalize.models import DocumentCollection

class BookCollection(DocumentCollection):
    model = Book
    name = "books"
    prefetch_related = ['authors']

Let’s print all documents:

books = BookCollection()
for doc in books.dump_collection():
    print doc

Each document will have the following structure:

{
    'id': 42,
    'title': u'Cooking for Geeks',
    'year': 2010,
    'authors': [
        {
            'id': 18,
            'name': u'Jeff Potter',
            ...
        }
    ],
    ...
}

This in itself can be useful, but the real power of django-documentsync lies in its backends. Suppose we want to cache these documents, to avoid hitting the database. We can use these documents in our views, instead of accessing the Django ORM. Backend and view code:

# In models.py

from denormalize.backends.cache import CacheBackend

backend = CacheBackend()
backend.register(books)

# In views.py

def our_book_view(request, book_id):
    book_doc = backend.get_doc(books, book_id)
    if not book_doc:
        raise Http404("Book not found")
    return render(request, 'book.html', {'book': book_doc})

Our CacheBackend will try to fetch the book document from the Django cache. If it cannot be found, it will generate the document from the ORM and then store it in the cache.

And best of all: if any data on the Author or Book objects for this book changes, the cache will automatically be invalidated for us! The book_doc we retrieve, will always be up to date.

How does this compare with simply using the Django page cache?

The traditional approach to Django scalability is using the page cache to cache the entire page rendered by the view. This works quite well, but it has two big disadvantages:

  • The cache will not automatically be invalidated as soon as the underlying data changes. If you set the page cache time to 60 seconds, it will take up to 60 seconds for a change to be visible on the site.

  • This approach does not work well for websites where users can login and see customized content.

In simpler cases, these problems can be worked around by using template fragment caching, as this allows you to cache common regions, and specify which variables should be incorporated into the cache key. But even in our simple Book example, it’s not easy to invalidate the cache on changes to Author.

The disadvantages of the django-denormalize approach are:

  • You no longer have access to the Django models and its methods in your templates. You are dealing with the raw data. Of course, you can add any extra information you might need in the template by extending the DocumentCollection, or by creating custom template filters to calculate some value.

  • Writes by the ORM to models that are included in documents are slower, because they are monitored for changes.

MongoDB backend

The MongoDB backend works quite similar to the CacheBackend:

# In models.py

from denormalize.backends.mongodb import MongoBackend

backend = MongoBackend(
    name='mongo',
    db_name='test_denormalize',
    connection_uri='mongodb://localhost')
backend.register(books)

Because the data is persistent and accessed directly through the MongoDB API, you need to make care to keep it in sync. You can trigger a full one-way sync using the following management command (TODO: currently not implemented yet for the MongoBackend, only for LocMemBackend. Coming soon!):

$ ./manage.py denormalize_sync mongo books

Whenever you update the data through the ORM, the corresponding document will be updated automatically. The backend preserves any extra keys you may have set on the document root in MongoDB. Make sure, however, to not add or change keys on subdocuments created by the driver, because they will be overwritten. In the book example above, it is safe to set doc[‘foo’], but not safe to set doc[‘authors’][0][‘foo’].

You should run full syncs in a cronjob, though, to prevent your data from going out of sync over time due to network outages and changes that bypass the ORM (see ‘bugs and limitations’ below).

Creating aggregate collections

Occasionaly you may want to aggregate data from more than one object on the root model. The key differences here are:

  • The output documents do not have a 1:1 relation with the input documents.

  • Any change on any root object should trigger an update.

Use cases:

  • Creating one document with a tree structure of pages or categories to generate a menu.

  • Calculating statistics about data stored in an entire table.

  • Generating an index document, mapping one field to the ids of the documents where the field has a certain value.

AggregateCollection makes this really easy. The following collection will create an index by tag:

class BookTagIndexCollection(AggregateCollection):
    model = Book
    name = 'book_tags'
    prefetch_related = 'tags'

    def aggregate(self, key):
        assert key == 'default'
        index = {}
        for book in self.queryset().all():
            for tag in book.tags.all():
                tagname = tag.name
                index.setdefault(tagname, set()).add(book.id)
        return index

FeinCMS support

Django-denormalize has experimental special support for FeinCMS. If you use the special FeinCMSCollection, the content attribute will be set to a dict with all regions represented as lists. All content types are included by default. If you want to follow relations on content types, you need to explicitly define all relations to follow. This will become easier in the future.

Performance optimization

@@@TODO: explain how to prevent spurious updates using denormalize.context.

Disadvantages, bugs and implementation notes

Bugs and limitations:

  • Django-normalize had not yet been extensively tested in real world applications. Expect bugs. And since it’s an early beta release, there is no guarantee that the API will not change without warning in the near future.

  • Using django-denormalize on models that receive a lot of writes might significantly slow down your application, as every write will trigger database queries to determine the affected documents, and regeneration of the documents that have changes. Keep you view counters and last login timestamps out of the models included in documents! (You might want to move these to a NoSQL store anyway.)

  • If you bypass the ORM (raw queries, manage.py dbshell, other applications, etc), django-denormalize cannot detect the changes made to the models. After perform a large batch operation, flush the Django cache, or run a full sync (denormalize_sync management command) to update your NoSQL backend, depending on how you use django-denormalize.

  • If syncing to a NoSQL store and the NoSQL database is not available, you will lose the update, it is currently not rescheduled (TODO: implement a transaction log to keep track of changes and whether they have been properly synced or not). You should run a regular full sync in a cronjob.

  • Syncing happens only one way. If you want to change data, you need to perform the modification on the ORM side, not a NoSQL side. We do try hard not to overwrite any extra attributes you added in the NoSQL backends.

  • A full sync currently does not delete stale objects (TODO)

  • Keep the storage limitations of your backends in mind. Memcached can only store objects of up to 1MB, MongoDB has a limit of 16MB. Make sure your documents will not exceed these limits.

Types of projects that would benefit most of django-denormalize:

  • Writes are rare and mostly occur due to content updates in the Django admin, like in CMS systems.

  • There are a lot more reads than writes, and you want to speed up the read views, while keeping the front-end personalized and responsive to data changes.

  • You want to use Meteor to build the front-end side of your application, but do not feel like implementing a CMS in Meteor. Django-denormalize allows you to build the CMS backend using the Django admin and FeinCMS. This was the original reason to start this project, so expect more updates to support this!

  • You want to use MongoDB to access/query your data, but prefer to keep your primary data in a traditional, proven, relation database system you have 10 years experience with, because it makes you or your DBA sleep better.

Alternatives

Django-nonrel allows you to use the Django ORM to directly access a NoSQL database, but with limitations. If you do a lot of writes from your front-end views, or want to prevent data duplication, this might be a better solution.

PS: Need another backend? Writing one is quite simple! You only need to override a base class, and implement a few methods.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

django-denormalize-0.2.1.tar.gz (25.4 kB view hashes)

Uploaded Source

Built Distributions

django_denormalize-0.2.1-py2.7.egg (58.8 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page