Django extension, executing bulk update operations for PostgreSQL
Project description
django-pg-bulk-update
Django extension to update multiple table records with similar (but not equal) conditions in efficient way on PostgreSQL
Requirements
- Python 3.6+ Previous versions may also work, but are not tested in CI
- django >= 1.8
Previous versions may also work, but are not tested in CI.
django.postgres.contrib fields are also supported (available since django 1.8) django.postgres.contrib.JSONField is supported since django 1.9 - pytz for python before 3.3
- typing for python before 3.5
- psycopg2 or psycopg2-binary This library is not added to requirements file as it is not used directly (thorugh django cursor only). Any form of psycopg2 installation can be used.
- PostgreSQL 9.4+
Previous versions may also work, but haven't been tested.
JSONB operations are available for PostgreSQL 9.4+. INSERT .. ON CONFLICT is used for PostgreSQL 9.5+.
Installation
Install via pip:
pip install django-pg-bulk-update
or via setup.py:
python setup.py install
Usage
You can make queries in 2 ways:
- Declaring a custom manager for your model
- Calling query functions directly
Query functions
There are 4 query helpers in this library. There parameters are unified and described in the section below.
-
bulk_update(model, values, key_fields='id', using=None, set_functions=None, key_fields_ops=(), where=None, returning=None, batch_size=None, batch_delay=0)
This function updates multiple records of given model in single database query.
Functions forms raw sql query for PostgreSQL. It's work is not guaranteed on other databases.
Function returns number of updated records. -
bulk_update_or_create(model, values, key_fields='id', using=None, set_functions=None, update=True, key_is_unique=True, returning=None, batch_size=None, batch_delay=0)
This function finds records by key_fields. It creates not existing records with data, given in values.
Ifupdate
flag is set, it updates existing records with data, given in values.There are two ways, this function may work:
- Use INSERT ... ON CONFLICT statement. It is safe, but requires PostgreSQL 9.5+ and unique index on key fields. This behavior is used by default.
- 3-query transaction:
+ Search for existing records
+ Create not existing records (if values have any)
+ Update existing records (if values have any andupdate
flag is set)
This behavior is used by default on PostgreSQL before 9.5 and if key_is_unique parameter is set to False. Note that transactional update has a known race condition issue that can't be fixed.
Function returns number of records inserted or updated by query.
-
bulk_create(model, values, using=None, set_functions=None, returning=None, batch_size=None, batch_delay=0)
This function creates multiple records of given model in single database query.
Its functionality is the same as django's QuerySet.bulk_create, but it is implemented on this library bases and can be more effective in some cases (for instance, for wide models). -
pdnf_clause(key_fields, field_values, key_fields_ops=())
Pure django implementation of principal disjunctive normal form. It is base on combining Q() objects.
Condition will look like:SELECT ... WHERE (a = x AND b = y AND ...) OR (a = x1 AND b = y1 AND ...) OR ...
Function returns a django.db.models.Q instance
Function parameters
-
model: Type[Model]
A subclass of django.db.models.Model to update -
values: Union[Union[TUpdateValuesValid, Dict[Any, Dict[str, Any]]], Iterable[Dict[str, Any]]]
Data to update. All items must update same fields!!!
Parameter can have one of 2 forms:- Iterable of dicts. Each dict contains both key and update data. Each dict must contain all key_fields as keys. You can't update key_fields with this format.
- Dict of key_values: update_fields_dict
Can not be used forbulk_create
function. You can use this format to update key_fields- key_values can be tuple or single object. If tuple, key_values length must be equal to key_fields length. If single object, key_fields is expected to have 1 element
- update_fields_dict is a dictionary {field_name: update_value} to update
-
key_fields: Union[str, Iterable[str]]
Optional. Field names, which are used as update conditions. Parameter can have one of 2 forms:- String for single key field. Primary key is used by default.
- Iterable of strings for multiple key fields.
-
using: Optional[str]
Optional. Database alias to query. If not set, 'default' database is used. -
set_functions: Optional[Dict[str, Union[str, AbstractSetFunction]]]
Optional. Functions which will be used to set values.
If given, it should be a dictionary:- Key is a field name function is applied to
- Value can be:
-
django.db.models.expressions.BaseExpression
instance
Any django function expression returning a value. Expression can use Value, F, Func and their child classes. It can not use annotations and tables other than updated model (likeF(a__b__c)
). In create operations field default values is taken. If it is not provided, field default value is used. Expression can containdjango_pg_bulk_update.set_functions.BulkValue()
expression in it. If so, it will be replaced with field value passed invalues
parameter. If expression does not containBulkValue
instances data, passed invalues
parameter for this key is ignored. -
Function alias name
- 'eq', '='
Simple assign operator. It used by default for fields that are not mentioned in the dict. - 'incr', '+'
Adds field value to previous one. It can be used for all numeric database types. - 'concat', '||'
Concatenates field value to previous one. It can be used for string types, JSONField, HStoreField, ArrayField. - 'eq_not_null'
This function can be used, if you want to update value only if it is not None. - 'union'
This function combines ArrayField value with previous one, removing duplicates. - 'array_remove'
This function deletes value from ArrayField field using array_remove PSQL Function. - 'now', 'NOW'
This function sets field value toNOW()
database function. Doesn't expect any value invalues
parameter. If one is given it is ignored.
- 'eq', '='
-
django_pg_bulk_update.set_functions.AbstractSetFunction
instance
You can define your own set function. See section below.Increment, union and concatenate functions concern NULL as default value.
-
-
key_field_ops: Union[Dict[str, Union[str, AbstractClauseOperator]], Iterable[Union[str, AbstractClauseOperator]]]
Optional. Operators, which are used to fined records for update. Operators are applied tokey_fields
.
If some fields are not given, equality operator is used.bulk_update_or_create
function always uses equality operator Parameter can have one of 2 forms:- Iterable of operator alias names or AbstractClauseOperator instances. Order of iterable must be the same as key_fields.
- Dictionary:
- Key is a field name, function is applied to
- Value is a function alias name of set function or AbstractSetFunction instance.
Available name aliases:
- 'eq', '=', '==' Simple equality condition. It is used by default.
- '!eq', '!=', '<>' Not equal operator
- 'in' Searches for records, which have field from values list. Value should be an iterable of correct field values.
- '!in' Searches for records, which have field not from values list. Value should be an iterable of correct field values.
- 'lt', '<'
- 'lte', '<='
- 'gt', '>'
- 'gte', '>='
- 'between' Searches for records, which have field between a and b. Value should be iterable with 2 items.
- 'is_null', 'isnull' Checks field value for been NULL. Value should be boolean (true for IS NULL, false for IS NOT NULL)
- You can define your own clause operator. See section below.
-
where
: Optional[WhereNode]
This parameter is used to filter data before doing bulk update, using QuerySet filter and exclude methods. Generated condition should not contain annotations and other table references.
NOTE: parameter is not supported inbulk_update_or_create
-
returning: Optional[Union[str, Iterable[str]]]
If this parameter is set, it can be:- A field name string
- An iterable of field names
- '*' string to return all model fields
Query returns django_pg_returning.ReturningQuerySet instead of rows count.
Using this feature requires django-pg-returning library installed (it is not in requirements, though). -
batch_size: Optional[int]
If this parameter is set, values are split into batches of given size. Each batch is processed separately. Note that batch_size != number of records processed if you use key_field_ops other than 'eq' -
batch_delay: float
If batch_size is set, this parameter sets time to sleep in seconds between batches execution -
update: bool
If flag is not set, bulk_update_or_create function will not update existing records, only creating not existing. -
key_is_unique: bool
Defaults to True. Settings this flag to False forces library to use 3-query transactional update_or_create. -
field_values: Iterable[Union[Iterable[Any], dict]]
Field values to use inpdnf_clause
function. They have simpler format than update functions. It can come in 2 formats:- An iterable of tuples in key_fields order
( (x, y), (x1, y1), ...)
- An iterable of dicts with field name as key
({'a': x, 'b': y}, ...)
- An iterable of tuples in key_fields order
Examples
from django.db import models, F
from django.db.models.functions import Upper
from django_pg_bulk_update import bulk_update, bulk_update_or_create, pdnf_clause
from django_pg_bulk_update.set_functions import BulkValue
# Test model
class TestModel(models.Model):
name = models.CharField(max_length=50)
int_field = models.IntegerField()
# Create test data
created = TestModel.objects.pg_bulk_create([
{'id': i, 'name': "item%d" % i, 'int_field': 1} for i in range(1, 4)
])
print(created)
# Outputs 3
# Create test data returning
created = TestModel.objects.pg_bulk_create([
{'id': i, 'name': "item%d" % i, 'int_field': 1} for i in range(4, 6)
], returning='*')
print(created)
print(type(res), list(res.values_list('id', 'name', 'int_field')))
# Outputs:
# <class 'django_pg_returning.queryset.ReturningQuerySet'>
# [
# (4, "item4", 1),
# (5, "item5", 1)
# ]
# Update by id field
updated = bulk_update(TestModel, [{
"id": 1,
"name": "updated1",
}, {
"id": 2,
"name": "updated2"
}])
print(updated)
# Outputs: 2
# Update returning
res = bulk_update(TestModel, [{
"id": 1,
"name": "updated1",
}, {
"id": 2,
"name": "updated2"
}], returning=('id', 'name', 'int_field'))
print(type(res), list(res.values_list('id', 'name', 'int_field')))
# Outputs:
# <class 'django_pg_returning.queryset.ReturningQuerySet'>
# [
# (1, "updated1", 1),
# (2, "updated2", 1)
# ]
# Call update by name field
updated = bulk_update(TestModel, {
"updated1": {
"int_field": 2
},
"updated2": {
"int_field": 3
}
}, key_fields="name")
print(updated)
# Outputs: 2
print(list(TestModel.objects.all().order_by("id").values("id", "name", "int_field")))
# Outputs: [
# {"id": 1, "name": "updated1", "int_field": 2},
# {"id": 2, "name": "updated2", "int_field": 3},
# {"id": 3, "name": "item3", "int_field": 1}
# ]
# Increment int_field by 3 and transform name to upper case for records where id >= 2 and int_field < 3
updated = bulk_update(TestModel, {
(2, 3): {
"int_field": 3
}
}, key_fields=['id', 'int_field'], key_fields_ops={'int_field': '<', 'id': 'gte'},
set_functions={'int_field': '+', 'name': Upper('name')})
print(updated)
# Outputs: 1
print(list(TestModel.objects.all().order_by("id").values("id", "name", "int_field")))
# Outputs: [
# {"id": 1, "name": "updated1", "int_field": 2},
# {"id": 2, "name": "updated2", "int_field": 3},
# {"id": 3, "name": "incr", "int_field": 4}
# ]
res = bulk_update_or_create(TestModel, [{
"id": 3,
"name": "_concat1",
"int_field": 3
}, {
"id": 4,
"name": "concat2",
'int_field': 4
}], set_functions={'name': '||', 'int_field': F('int_field') + BulkValue()})
print(res)
# Outputs: 2
print(list(TestModel.objects.all().order_by("id").values("id", "name", "int_field")))
# Note: IntegerField defaults to 0 in create operations. So 0 + 4 = 4 for id 4.
# Outputs: [
# {"id": 1, "name": "updated1", "int_field": 2},
# {"id": 2, "name": "updated2", "int_field": 3},
# {"id": 3, "name": "incr_concat1", "int_field": 7},
# {"id": 4, "name": "concat2", "int_field": 4},
# ]
# Find records where
# id IN [1, 2, 3] AND name = 'updated2' OR id IN [3, 4, 5] AND name = 'concat2' OR id IN [2, 3, 4] AND name = 'updated1'
cond = pdnf_clause(['id', 'name'], [([1, 2, 3], 'updated2'),
([3, 4, 5], 'concat2'),
([2, 3, 4], 'updated1')], key_fields_ops={'id': 'in'})
data = TestModel.objects.filter(cond).order_by('int_field').values_list('int_field', flat=True)
print(list(data))
# Outputs: [3, 5]
Using custom manager and query set
In order to simplify using bulk_create
, bulk_update
and bulk_update_or_create
functions,
you can use a custom manager.
It automatically fills:
model
parameterusing
parameter (extracts queryset write database)where
parameter (applies queryset filters, if called as QuerySet method). Not supported in bulk_update_or_create.
You can change database to use with Manager.db_manager() or QuerySet.using() methods.
The rest parameters are the same as above.
Note: As django 2.2
introduced bulk_update method,
library methods were renamed to pg_bulk_create
, pg_bulk_update
and pg_bulk_update_or_create
respectively.
Example:
from django.db import models
from django_pg_bulk_update.manager import BulkUpdateManager
# Test model
class TestModel(models.Model):
objects = BulkUpdateManager()
name = models.CharField(max_length=50)
int_field = models.IntegerField()
# Now you can use functions like:
TestModel.objects.pg_bulk_create([
# Any data here
], set_functions=None)
TestModel.objects.pg_bulk_update([
# Any data here
], key_fields='id', set_functions=None, key_fields_ops=())
# Update only records with id greater than 5
TestModel.objects.filter(id__gte=5).pg_bulk_update([
# Any data here
], key_fields='id', set_functions=None, key_fields_ops=())
TestModel.objects.pg_bulk_update_or_create([
# Any data here
], key_fields='id', set_functions=None, update=True)
If you already have a custom manager, you can replace QuerySet to BulkUpdateQuerySet:
from django.db import models
from django.db.models.manager import BaseManager
from django_pg_bulk_update.manager import BulkUpdateQuerySet
class CustomManager(BaseManager.from_queryset(BulkUpdateQuerySet)):
pass
# Test model
class TestModel(models.Model):
objects = CustomManager()
name = models.CharField(max_length=50)
int_field = models.IntegerField()
If you already have a custom QuerySet, you can inherit it from BulkUpdateMixin:
from django.db import models
from django.db.models.manager import BaseManager
from django_pg_bulk_update.manager import BulkUpdateMixin
class CustomQuerySet(BulkUpdateMixin, models.QuerySet):
pass
class CustomManager(BaseManager.from_queryset(CustomQuerySet)):
pass
# Test model
class TestModel(models.Model):
objects = CustomManager()
name = models.CharField(max_length=50)
int_field = models.IntegerField()
Custom clause operator
You can define your own clause operator, creating AbstractClauseOperator
subclass and implementing:
names
attributedef get_django_filter(self, name)
method- One of
def get_sql_operator(self)
ordef get_sql(self, table_field, value)
When clause is formed, it callsget_sql()
method. In order to simplify method usage of simplefield <op> value
operators, by defaultget_sql()
forms this condition, callingget_sql_operator()
method, which returns .
Optionally, you can change def format_field_value(self, field, val, connection, cast_type=True, **kwargs)
method,
which formats value according to field rules
Example:
from django_pg_bulk_update import bulk_update
from django_pg_bulk_update.clause_operators import AbstractClauseOperator
class LTClauseOperator(AbstractClauseOperator):
names = {'lt', '<'}
def get_django_filter(self, name): # type: (str) -> str
"""
This method should return parameter name to use in django QuerySet.fillter() kwargs
:param name: Name of parameter
:return: String with filter
"""
return '%s__lt' % name
def get_sql_operator(self): # type: () -> str
"""
If get_sql operator is simple binary operator like "field <op> val", this functions returns operator
:return: str
"""
return '<'
# Usage examples
# import you function here before calling an update
bulk_update(TestModel, [], key_field_ops={'int_field': 'lt'})
bulk_update(TestModel, [], key_field_ops={'int_field': LTClauseOperator()})
You can use class instance directly in key_field_ops
parameter or use its aliases from names
attribute.
When update function is called, it searches for all imported AbstractClauseOperator subclasses and takes first class
which contains alias in names
attribute.
Custom set function
You can define your own set function, creating AbstractSetFunction
subclass and implementing:
names
attributesupported_field_classes
attribute
It can contain class name, class or full python import path of a class. If a class or path is given, any child class would be accepted too.- One of:
def get_sql_value(self, field, val, connection, val_as_param=True, with_table=False, for_update=True, **kwargs)
method This method defines new value to set for parameter. It is called fromget_sql(...)
method by default.def get_sql(self, field, val, connection, val_as_param=True, with_table=False, for_update=True, **kwargs)
method This method sets full sql and it params to use in set section of update query.
By default it returns:"%s" = self.get_sql_value(...)
, params
Optionally, you can change:
def format_field_value(self, field, val, connection, cast_type=False, **kwargs)
method, if input data needs special formatting.def modify_create_params(self, model, key, kwargs)
method, to change data before passing them to model constructor inbulk_update_or_create()
. This method is used in 3-query transactional update only. INSERT ... ON CONFLICT uses for_update flag ofget_sql()
andget_sql_value()
functions
Example:
from django_pg_bulk_update import bulk_update
from django_pg_bulk_update.set_functions import AbstractSetFunction
class CustomSetFunction(AbstractSetFunction):
# Set function alias names
names = {'func_alias_name'}
# Names of django field classes, this function supports. You can set None (default) to support any field.
supported_field_classes = {'IntegerField', 'FloatField', 'AutoField', 'BigAutoField'}
def get_sql_value(self, field, val, connection, val_as_param=True, with_table=False, for_update=True, **kwargs):
"""
Returns value sql to set into field and parameters for query execution
This method is called from get_sql() by default.
:param field: Django field to take format from
:param val: Value to format
:param connection: Connection used to update data
:param val_as_param: If flag is not set, value should be converted to string and inserted into query directly.
Otherwise a placeholder and query parameter will be used
:param with_table: If flag is set, column name in sql is prefixed by table name
:param for_update: If flag is set, returns update sql. Otherwise - insert SQL
:param kwargs: Additional arguments, if needed
:return: A tuple: sql, replacing value in update and a tuple of parameters to pass to cursor
"""
# If operation is incremental, it should be ready to get NULL in database
null_default, null_default_params = self._parse_null_default(field, connection, **kwargs)
# Your function/operator should be defined here
tpl = 'COALESCE("%s", %s) + %s'
if val_as_param:
sql, params = self.format_field_value(field, val, connection)
return tpl % (field.column, null_default, sql), null_default_params + params
else:
return tpl % (field.column, null_default, str(val)), null_default_params
# Usage examples
# import you function here before calling an update
bulk_update(TestModel, [], set_functions={'int_field': 'func_alias_name'})
bulk_update(TestModel, [], set_functions={'int_field': CustomSetFunction()})
You can use class instance directly in set_functions
parameter or use its aliases from names
attribute.
When update function is called, it searches for all imported AbstractSetFunction subclasses and takes first class
which contains alias in names
attribute.
Compatibility
Library supports django.contrib.postgres.fields:
- ArrayField
- JSONField
- HStoreField
- RangeField (IntegerRangeField, BigIntegerRangeField, FloatRangeField, DateTimeRangeField, DateRangeField)
Note that ArrayField and HStoreField are available since django 1.8, JSONField - since django 1.9.
RangeField supports are available since PostgreSQL 9.2, psycopg2 since 2.5 and django since 1.8.
PostgreSQL before 9.4 doesn't support jsonb, and so - JSONField.
PostgreSQL 9.4 supports JSONB, but doesn't support concatenation operator (||). In order to support this set function a special function for postgres 9.4 was written. Add a migration to create it:
from django.db import migrations,
from django_pg_bulk_update.compatibility import Postgres94MergeJSONBMigration
class Migration(migrations.Migration):
dependencies = []
operations = [
Postgres94MergeJSONBMigration()
]
PostgreSQL before 9.5 doesn't support INSERT ... ON CONFLICT statement. So 3-query transactional update will be used.
Performance
Test background:
- Django 2.0.2
- PostgreSQL 10.2
- Python 3.6.3
- 1000 pre-created records
Updating records one by one took 51,68 seconds.
Updating records with bulk_update took 0.13 seconds.
You can write your own tests, based on test.test_performance and running it.
Development
This is an Open source project developed by M1ha-Shvn
under BSD 3 license.
Feel free to create issues and make pull requests.
Library test system is based on django.test.
You can find them in tests
directory.
Tests requirements
- PostgreSQL
- Pypi libraries listed in
requirements-test.txt
file
Running tests
Running in docker
- Install docker and docker-compose
- Run
docker build . --tag django-pg-bulk-pupdate
in project directory - Run
docker-compose run run_tests
in project directory
Running in virtual environment
- Install all requirements listed above
- Create virtual environment
- Create a superuser named 'test' on your local Postgres instance:
CREATE ROLE test;
ALTER ROLE test WITH SUPERUSER;
ALTER ROLE test WITH LOGIN;
ALTER ROLE test PASSWORD 'test';
CREATE DATABASE test OWNER test;
CREATE DATABASE test2 OWNER test;
- Install requirements
pip3 install -U -r requirements-test.txt
- Start tests
python3 runtests.py
Alternatives
django 2.2+ bulk_update difference
Pros:
- bulk_update_or_create() method
- Ability to use complex set functions
- Ability to use complex conditions
- Ability to update primary key
- pdnf_clause helper
- Django 1.7+ support
- Ability to make delay between batches
- Ability to return affected rows instead of rowcount (using Postgres RETURNING feature)
Cons:
- PostgreSQL only
- Ability to update parents/children (using extra queries)
django-bulk-update difference
Pros:
- bulk_update_or_create() method
- Ability to use complex set functions
- Ability to use django expressions as set functions
- Ability to use complex conditions
- pdnf_clause helper
- Django 1.7 support
- Ability to make delay between batches
- Ability to return affected rows instead of rowcount (using Postgres RETURNING feature)
Cons:
- PostgreSQL only
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for django-pg-bulk-update-3.7.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0d5f6d238bfa2858540db294b938fc11712f73d45865ff302f2969be2b513014 |
|
MD5 | e5ec6326acda6f2b295f395b4ebe53b7 |
|
BLAKE2b-256 | 332f4da1bf95b2792f2904f4f0ef2782fc7783a378ea004d356ab04dac8c68f7 |
Hashes for django_pg_bulk_update-3.7.2-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d6cfed8b1fa3f11770c9bcffba3b1aca0addf4f2ca513551f83f972e8b5d3ae5 |
|
MD5 | b70f51d2deb249f998edca2862d74a1a |
|
BLAKE2b-256 | 61b4609038f497e517f41fa3779def549e07de7da0beea377871760fd9f409bf |