Manage Elasticsearch Indexes in Django
Project description
django-elastic-migrations is a Django app for creating, indexing and changing schemas of Elasticsearch indexes.
Overview
Elastic has given us basic python tools for working with its search indexes:
elasticsearch-py, a python interface to elasticsearch’s REST API
elasticsearch-dsl-py, a Django-esque way of declaring Elasticsearch schemas, built upon elasticsearch-py
Django Elastic Migrations adapts these tools into a Django app which also:
Provides Django management commands for listing indexes, as well as performing create, update, activate and drop actions on them
Implements concurrent bulk indexing powered by python multiprocessing
Gives Django test hooks for Elasticsearch
Records a history of all actions that change Elasticsearch indexes
Supports AWS Elasticsearch 6.0, 6.1 (6.2 TBD; see #3 support elasticsearch-dsl 6.2)
Enables having two or more servers share the same Elasticsearch cluster
Models
Django Elastic Migrations provides comes with three Django models: Index, IndexVersion, and IndexAction:
Index - a logical reference to an Elasticsearch index. Each Index points to multiple IndexVersions, each of which contains a snapshot of that Index schema at a particular time. Each Index has an active IndexVersion to which all actions are directed.
IndexVersion - a snapshot of an Elasticsearch Index schema at a particular point in time. The Elasticsearch index name is the name of the Index plus the primary key id of the IndexVersion model, e.g. movies-1. When the schema is changed, a new IndexVersion is added with name movies-2, etc.
IndexAction - a record of a change that impacts an Index, such as updating the index or changing which IndexVersion is active in an Index.
Management Commands
Use ./manage.py es --help to see the list of all of these commands.
Read Only Commands
./manage.py es_list
help: For each Index, list activation status and doc count for each of its IndexVersions
usage: ./manage.py es_list
Action Commands
These management commands add an Action record in the database, so that the history of each Index is recorded.
./manage.py es_create - create a new index.
./manage.py es_activate - activate a new IndexVersion. all updates and reads for that Index by will then go to that version.
./manage.py es_update - update the documents in the index.
./manage.py es_clear - remove the documents from an index.
./manage.py es_drop - drop an index.
./manage.py es_dangerous_reset - erase elasticsearch and reset the Django Elastic Migrations models.
For each of these, use --help to see the details.
Usage
Installation
pip install django-elastic-migrations; see django-elastic-migrations on PyPI
Put a reference to this package in your requirements.txt
Ensure that a valid elasticsearch-dsl-py version is accessible, and configure the path to your configured Elasticsearch singleton client in your django settings: DJANGO_ELASTIC_MIGRATIONS_ES_CLIENT = "tests.es_config.ES_CLIENT". There should only be one ES_CLIENT instantiated in your application.
Add django_elastic_migrations to INSTALLED_APPS in your Django settings file
Add the following information to your Django settings file:
DJANGO_ELASTIC_MIGRATIONS_ES_CLIENT = "path.to.your.singleton.ES_CLIENT" # optional, any unique number for your releases to associate with indexes DJANGO_ELASTIC_MIGRATIONS_GET_CODEBASE_ID = subprocess.check_output(['git', 'describe', "--tags"]).strip() # optional, can be used to have multiple servers share the same # elasticsearch instance without conflicting DJANGO_ELASTIC_MIGRATIONS_ENVIRONMENT_PREFIX = "qa1_"
Create the django_elastic_migrations tables by running ./manage.py migrate
Create an DEMIndex:
from django_elastic_migrations.indexes import DEMIndex, DEMDocType from .models import Movie from elasticsearch_dsl import Text MoviesIndex = DEMIndex('movies') @MoviesIndex.doc_type class MovieSearchDoc(DEMDocType): text = TEXT_COMPLEX_ENGLISH_NGRAM_METAPHONE @classmethod def get_queryset(self, last_updated_datetime=None): """ return a queryset or a sliceable list of items to pass to get_reindex_iterator """ qs = Movie.objects.all() if last_updated_datetime: qs.filter(last_modified__gt=last_updated_datetime) return qs @classmethod def get_reindex_iterator(self, queryset): return [ MovieSearchDoc( text="a little sample text").to_dict( include_meta=True) for g in queryset]
Add your new index to DJANGO_ELASTIC_MIGRATIONS_INDEXES in settings/common.py
Run ./manage.py es_list to see the index as available:
./manage.py es_list Available Index Definitions: +----------------------+-------------------------------------+---------+--------+-------+-----------+ | Index Base Name | Index Version Name | Created | Active | Docs | Tag | +======================+=====================================+=========+========+=======+===========+ | movies | | 0 | 0 | 0 | Current | | | | | | | (not | | | | | | | created) | +----------------------+-------------------------------------+---------+--------+-------+-----------+ Reminder: an index version name looks like 'my_index-4', and its base index name looks like 'my_index'. Most Django Elastic Migrations management commands take the base name (in which case the activated version is used) or the specific index version name.
Create the movies index in elasticsearch with ./manage.py es_create movies:
$> ./manage.py es_create movies The doc type for index 'movies' changed; created a new index version 'movies-1' in elasticsearch. $> ./manage.py es_list Available Index Definitions: +----------------------+-------------------------------------+---------+--------+-------+-----------+ | Index Base Name | Index Version Name | Created | Active | Docs | Tag | +======================+=====================================+=========+========+=======+===========+ | movies | movies-1 | 1 | 0 | 0 | 07.11.005 | | | | | | | -93-gd101 | | | | | | | a1f | +----------------------+-------------------------------------+---------+--------+-------+-----------+ Reminder: an index version name looks like 'my_index-4', and its base index name looks like 'my_index'. Most Django Elastic Migrations management commands take the base name (in which case the activated version is used) or the specific index version name.
Activate the movies-1 index version, so all updates and reads go to it.
./manage.py es_activate movies For index 'movies', activating 'movies-1' because you said so.
Assuming you have implemented get_reindex_iterator, you can call ./manage.py es_update to update the index.
$> ./manage.py es_update movies Handling update of index 'movies' using its active index version 'movies-1' Checking the last time update was called: - index version: movies-1 - update date: never Getting Reindex Iterator... Completed with indexing movies-1 $> ./manage.py es_list Available Index Definitions: +----------------------+-------------------------------------+---------+--------+-------+-----------+ | Index Base Name | Index Version Name | Created | Active | Docs | Tag | +======================+=====================================+=========+========+=======+===========+ | movies | movies-1 | 1 | 1 | 3 | 07.11.005 | | | | | | | -93-gd101 | | | | | | | a1f | +----------------------+-------------------------------------+---------+--------+-------+-----------+
Deployment
Creating and updating a new index schema can happen before you deploy. For example, if your app servers are running with the movies-1 index activated, and you have a new version of the schema you’d like to pre-index, then log into another server and run ./manage.py es_create movies followed by ./manage.py es_update movies --newer. This will update documents in all movies indexes that are newer than the active one.
After deploying, you can run ./manage.py es_activate movies to activate the latest version. Be sure to cycle your gunicorn workers to ensure the change is caught by your app servers.
During deployment, if get_reindex_iterator is implemented in such a way as to respond to the datetime of the last reindex date, then you can call ./manage.py es_update movies --resume, and it will index only those documents that have changed since the last reindexing. This way you can do most of the indexing ahead of time, and only reindex a portion at the time of the deployment.
Django Testing
(optional) update DJANGO_ELASTIC_MIGRATIONS_ENVIRONMENT_PREFIX in your Django settings. The default test prefix is test_. Every test will create its own indexes.
if 'test' in sys.argv: DJANGO_ELASTIC_MIGRATIONS_ENVIRONMENT_PREFIX = 'test_'
Override TestCase - test_utilities.py
from django_elastic_migrations import DEMIndexManager class MyTestCase(TestCase): def _pre_setup(self): DEMIndexManager.test_pre_setup() super(MyTestCase, self)._pre_setup() def _post_teardown(self): DEMIndexManager.test_post_teardown() super(MyTestCase, self)._post_teardown()
Excluding from Django’s dumpdata command
When calling django’s dumpdata command, you likely will want to exclude the database tables used in this app:
from django.core.management import call_command params = { 'database': 'default', 'exclude': [ # we don't want to include django_elastic_migrations in dumpdata, # because it's environment specific 'django_elastic_migrations.index', 'django_elastic_migrations.indexversion', 'django_elastic_migrations.indexaction' ], 'indent': 3, 'output': 'path/to/my/file.json' } call_command('dumpdata', **params)
An example of this is included with the moviegen management command.
Tuning Bulk Indexing Parameters
By default, /.manage.py es_update will divide the result of DEMDocType.get_queryset() into batches of size DocType.BATCH_SIZE. Override this number to change the batch size.
There are many configurable paramters to Elasticsearch’s bulk updater. To provide a custom value, override DEMDocType.get_bulk_indexing_kwargs() and return the kwargs you would like to customize.
Development
This project uses make to manage the build process. Type make help to see the available make targets.
Elasticsearch Docker Compose
docker-compose -f local.yml up
Requirements
This project uses pip-tools. The requirements.txt files are generated and pinned to latest versions with make upgrade:
run make requirements to run the pip install.
run make upgrade to upgrade the dependencies of the requirements to the latest versions. This process also excludes django and elasticsearch-dsl from the requirements/test.txt so they can be injected with different versions by tox during matrix testing.
Populating Local tests_movies Database Table With Data
It may be helpful for you to populate a local database with Movies test data to experiment with using django-elastic-migrations. First, migrate the database:
./manage.py migrate --run-syncdb --settings=test_settings
Next, load the basic fixtures:
./manage.py loaddata tests/100films.json
You may wish to add more movies to the database. A management command has been created for this purpose. Get a Free OMDB API key here, then run a query like this (replace MYAPIKEY with yours):
$> ./manage.py moviegen --title="Inception" --api-key="MYAPIKEY"
{'actors': 'Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen Page, Tom Hardy',
'awards': 'Won 4 Oscars. Another 152 wins & 204 nominations.',
'boxoffice': '$292,568,851',
'country': 'USA, UK',
'director': 'Christopher Nolan',
'dvd': '07 Dec 2010',
'genre': 'Action, Adventure, Sci-Fi',
'imdbid': 'tt1375666',
'imdbrating': '8.8',
'imdbvotes': '1,721,888',
'language': 'English, Japanese, French',
'metascore': '74',
'plot': 'A thief, who steals corporate secrets through the use of '
'dream-sharing technology, is given the inverse task of planting an '
'idea into the mind of a CEO.',
'poster': 'https://m.media-amazon.com/images/M/MV5BMjAxMzY3NjcxNF5BMl5BanBnXkFtZTcwNTI5OTM0Mw@@._V1_SX300.jpg',
'production': 'Warner Bros. Pictures',
'rated': 'PG-13',
'ratings': [{'Source': 'Internet Movie Database', 'Value': '8.8/10'},
{'Source': 'Rotten Tomatoes', 'Value': '86%'},
{'Source': 'Metacritic', 'Value': '74/100'}],
'released': '16 Jul 2010',
'response': 'True',
'runtime': 148,
'title': 'Inception',
'type': 'movie',
'website': 'http://inceptionmovie.warnerbros.com/',
'writer': 'Christopher Nolan',
'year': '2010'}
To save the movie to the database, use the --save flag. Also useful is the --noprint option, to suppress json. Also, if you add OMDB_API_KEY=MYAPIKEY to your environment variables, you don’t have to specify it each time:
$ ./manage.py moviegen --title "Closer" --noprint --save
Saved 1 new movie(s) to the database: Closer
Now that it’s been saved to the database, you may want to create a fixture, so you can get back to this state in the future.
$ ./manage.py moviegen --makefixture=tests/myfixture.json
dumping fixture data to tests/myfixture.json ...
[...........................................................................]
Later, you can restore this database with the regular loaddata command:
$ ./manage.py loaddata tests/myfixture.json
Installed 101 object(s) from 1 fixture(s)
There are already 100 films available using loaddata as follows:
$ ./manage.py loaddata tests/100films.json
Running Tests Locally
Run make test. To run all tests and quality checks locally, run make test-all.
To just run linting, make quality. Please note that if any of the linters return a nonzero code, it will give an InvocationError error at the end. See tox’s documentation for InvocationError for more information.
We use edx_lint to compile pylintrc. To update the rules, change pylintrc_tweaks and run make pylintrc.
Cutting a New Version
optional: run make update to update dependencies
bump version in django_elastic_migrations/__init__.py.
update CHANGELOG.rst.
make clean
python3 setup.py sdist bdist_wheel
twine check dist/django-elastic-migrations-*.tar.gz to see if there are any syntax mistakes before tagging
submit PR bumping the version
ensure test matrix is passing on travis and merge PR
pull changes to master
make clean
python3 setup.py sdist bdist_wheel
twine check dist/django-elastic-migrations-*.tar.gz to see if there are any syntax mistakes before tagging
twine upload -r testpypi dist/django-elastic-migrations-*.tar.gz
Check it at https://test.pypi.org/project/django-elastic-migrations/
python3 setup.py tag to tag the new version
twine upload -r pypi dist/django-elastic-migrations-*.tar.gz
Update new release at https://github.com/HBS-HBX/django-elastic-migrations/releases
Changelog
0.8.2 (2018-11-20)
0.8.1 (2018-11-19)
0.8.0 (2018-11-13)
In practice, Python 2 may work, but it is removed from the test matrix and won’t be updated
0.7.8 (2018-11-13)
first release on PyPI
update project dependencies
0.7.7 (2018-09-17)
0.7.6 (2018-09-11)
0.7.5 (2018-08-20)
0.7.4 (2018-08-15)
0.7.3 (2018-08-14)
fix #31 es_update movies –newer –workers does not store worker information
0.7.2 (2018-08-13)
fix #21 wrong batch update total using multiprocessing in 0.7.1
fix #23 KeyError _index_version_name in es_update –newer
address #25 use pks for queryset inside workers #29
0.7.1 (2018-08-07)
fixed gh #8 es_dangerous_reset –es-only to sync database to ES
fixed gh #17 make es_dangerous_reset remove dem models
improved test coverage
added tests for es_create --es-only
added IndexVersion.hard_delete() (not called by default)
added hard_delete flag to DropIndexAction
added hard_delete flag to DEMIndexManager.test_post_teardown()
updated __str__() of IndexAction to be more descriptive
0.7.0 (2018-08-06)
fixed gh #5: “add python 3 support and tests”
0.6.1 (2018-08-03)
fixed gh #9: “using elasticsearch-dsl 6.1, TypeError in DEMIndex.save”
0.6.0 (2018-08-01)
Added test structure for py2 - GH #2
Renamed default log handler from django-elastic-migrations to django_elastic_migrations
0.5.3 (2018-07-23)
First basic release
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file django-elastic-migrations-0.8.2.tar.gz
.
File metadata
- Download URL: django-elastic-migrations-0.8.2.tar.gz
- Upload date:
- Size: 49.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.7.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5de4637493409634a8ac1735e060e7f4bcfe95ac37ed2b8b662ef2a86f729590 |
|
MD5 | 21f9ffc0f3ad74e8704f52dfdac7d061 |
|
BLAKE2b-256 | 95d963b09cb79e0f8c3f4dc4530227ef8b9eba8320a1b43bc0d00288618bce10 |