Skip to main content

Serialization & bulk indexing package for Elasticsearch; based on elasticsearch-dsl.py, supports multi-processing, Django

Project description

This is a modern replacement of django-simple-elasticsearch (DSE). Both Django and Elasticsearch have seen major changes over the years; this is a move to keep up.

Why not just update django-simple-elasticsearch?
  • DSE is Django-specific; I wanted to build a solution that could be used in a broader scope of applications
  • To start fresh and avoid assumptions made in the DSE project
  • Dropped support for Python 2
Details
  • Flexible and modular; eg. Django support is available via a 'contrib' module
  • Supports multi-process indexing and asynchronous IO via gevent
  • Depends on elasticsearch-dsl-py rather than the low level elasticsearch-py package
    • You get a lot of functionality for free!
  • Python 3 only
  • esdocs >= 0.6 supports Elasticsearch 7.x
Installation
pip install esdocs

If multi-process indexing is desired, you will want to install it along with the necessary gevent dependencies:

pip install esdocs[gevent]
Command Line Usage
$ esdocs -h
usage: esdocs [-h] [-v] [--version] [--no_input] [--indexes INDEXES]
              [--using USING] [--multi [MULTI]]
              {list,init,update,rebuild,cleanup} ...

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         increase output verbosity
  --version             show program's version number and exit
  --no_input, --noinput
                        Do not prompt for user input (assumes 'Yes' for
                        actions)
  --indexes INDEXES     Comma-separate list of index names to target
  --using USING         Elasticsearch named connection to use
  --multi [MULTI]       Enable multiple processes and optionally set number of
                        CPU cores to use (defaults to all cores)

commands:
  {list,init,update,rebuild,cleanup}
    list                List indexes
    init                Initialize indexes
    update              Update indexes
    rebuild             Rebuild indexes
    cleanup             Delete unaliased indexes

To rebuild indexes specified by document serializers in ESDOCS_SERIALIZER_MODULES:

export ESDOCS_SERIALIZER_MODULES="mypackage.module1,myotherpackage.module2"
export ESDOCS_SERIALIZER_COMPATIBILITY_HOOKS="esdocs.contrib.postgresql.compatibility.range_field"

esdocs rebuild

Multi-process indexing:

export ESDOCS_GEVENT=y
export ESDOCS_SERIALIZER_MODULES="mypackage.module1,myotherpackage.module2"
export ESDOCS_SERIALIZER_COMPATIBILITY_HOOKS="esdocs.contrib.postgresql.compatibility.range_field"

# auto detect number of CPU cores to use
esdocs rebuild --multiproc

# specify the number of cores to use
esdocs rebuild --multiproc --numprocs=4
Django

You must specify ESDOCS_SERIALIZER_MODULES in your Django settings and add esdocs.contrib.esdjango to your INSTALLED_APPS. You can optionally set ESDOCS_SERIALIZER_COMPATIBILITY_HOOKS as well:


INSTALLED_APPS = [
    'django.contrib.auth',
    'django.contrib.contenttypes',
    'django.contrib.sessions',
    ...,
    'esdocs.contrib.esdjango'
]


ESDOCS_SERIALIZER_MODULES = [
    'mypackage.module1',
    'myotherpackage.module2'
]

# these are the current defaults for this setting
ESDOCS_SERIALIZER_COMPATIBILITY_HOOKS = [
    'esdocs.contrib.esdjango.compatibility.manager',
    'esdocs.contrib.esdjango.compatibility.geosgeometry',
    'esdocs.contrib.postgresql.compatibility.range_field'
]
Serializing Data

For esdocs to work, you need to define Document and Serializer (or DjangoSerializer) subclasses to index your data. Document comes from the excellent elasticsearch-dsl-py, while Serializer/DjangoSerializer are a part of esdocs.

  • Document defines the Elasticsearch field mappings
  • Serializer is associated with a Document
  • Serializer defines how to retrieve the dataset
  • For each record in your dataset, the Serializer will attempt to retrieve a value for each field defined on the associated Document
    • There are a number of methods you can implement on a Serializer to retrieve (or construct/munge) each value
Examples

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

esdocs-0.6.2.tar.gz (16.1 kB view hashes)

Uploaded Source

Built Distribution

esdocs-0.6.2-py3-none-any.whl (18.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page