Skip to main content

In-memory database

Project description

https://travis-ci.org/akaterra/udb.py.svg?branch=master

Udb is an in-memory weak schema database based on the Zope Foundation BTrees, the Rtree, the Whoosh and on the native python’s dict. Udb provides indexes support and limited MongoDB-like queries.

Table of contents

Requirements

Python 3.6

Installation

pip install udb_py

To enable BTree indexes support install Zope Foundation BTrees package (installing of “persistent” package may be needed):

pip install BTrees

To enable RTree indexes support install Rtree package (requires libspatialindex, install it before):

pip install Rtree

Installing libspatialindex on MacOS with Homebrew:

brew install spatialindex

To enable Full-Text indexes support install Whoosh package:

pip install Whoosh

Quick start

Create the Udb instance with the indexes declaration:

from udb_py import Udb, UdbBtreeIndex

db = Udb({
    'a': UdbBtreeIndex(['a']),
    'b': UdbBtreeIndex(['b']),
    'cde': UdbBtreeIndex(['c', 'd', 'e']),
})

Insert records:

db.insert({'a': 1, 'b': 1, 'c': 3, 'd': 4, 'e': 5})
db.insert({'a': 2, 'b': 2, 'c': 3, 'd': 4, 'e': 5})
db.insert({'a': 3, 'b': 3, 'c': 3, 'd': 4, 'e': 5})
db.insert({'a': 4, 'b': 4, 'c': 3, 'd': 4, 'e': 6})
db.insert({'a': 5, 'b': 5, 'c': 3, 'd': 4, 'e': 7})

Select records:

a = list(db.select({'a': 1})

[{'a': 1, 'b': 1, 'c': 3, 'd': 4, 'e': 5}]

b = list(db.select({'b': 0})

[]  # no records with b=0

c = list(db.select({'c': 3, 'd': 4}, limit=2)

[{'a': 3, 'b': 3, 'c': 3, 'd': 4, 'e': 5}, {'a': 4, 'b': 4, 'c': 3, 'd': 4, 'e': 6}]

Data schema

Data schema allows to fill the inserted or updated record with default values. The default value can be defined as a primitive value or callable:

from udb_py import Udb

db = Udb(schema={
    'a': 'a',
    'b': 'b',
    'c': lambda key, record: 'b' if record['b'] == 'b' else 'c',
})

Functional fields

auto_id - generates unique id (uuid v1 by default)

from udb_py import Udb, auto_id

db = Udb(schema={
    'id': auto_id(),
})

current_timestamp - uses current timestamp (as int value)

from udb_py import Udb, current_timestamp

db = Udb(schema={
    'timestamp': current_timestamp(),
})

fn - calls custom function

from udb_py import Udb, fn

db = Udb(schema={
    'timestamp': fn(lambda record: record['a'] + record['b']),
})

optional - returns “None” value

from udb_py import Udb, optional

db = Udb(schema={
    'a': optional,
})

Indexes

To speed up the search the records can be indexed by necessary for searching fields. The Udb also includes a simple query optimiser that can apply the most appropriate index.

BTree index (supports the range and prefix scan operations):

  • UdbBtreeIndex - btree based index supporting multiple records with the same index key.

  • UdbBtreeEmbeddedIndex - same as the UdbBtreeIndex, but supports embedded list of values.

  • UdbBtreeUniqIndex - btree based index operating with always single records, the second record inserted with the same index key will raise IndexConstraintError.

Hash index (supports only const scan operation):

  • UdbHashIndex - hash based index supporting multiple records with the same index key.

  • UdbHashEmbeddedIndex - same as the UdbHashIndex, but supports embedded list of values.

  • UdbHashUniqIndex - hash based index operating with always single records, the second record inserted with the same index key will raise IndexConstraintError.

Spatial index:

  • UdbRtreeIndex - spatial index that supports “intersection with rectangle” and “near to point” search.

Full-Text index:

  • UdbTextIndex - full text index that supports searching by words.

Index declaration

As it was shown above, for the index declaration the Udb instance should be created with the indexes parameter that provides dict with the key as an index name and value as an index instance. The index instance should be created with the sequence of fields (1 at least) which will be fetched in the declared order from the indexed record. By this sequence of fields, the index key will be generated and will be associated with the indexed record.

from udb_py import Udb, UdbBtreeIndex

db = Udb(indexes={
    'abc': UdbBtreeIndex(['a', 'b', 'c'])  # "a", "b" and "c" fields will be fetched from the indexed record
})

record = {'a': 'A', 'b': 'B', 'c': 'C'}  # index key=ABC

In order that the record to be indexed it is not obliged to contain all of the fields declared in the sequence of index fields. By default the “None” value is used for the missing field.

from udb_py import Udb, UdbBtreeIndex

db = Udb(indexes={
    'abc': UdbBtreeIndex(['a', 'b', 'c'])  # "a", "b" and "c" fields will be fetched from the indexed record
})

record = {'a': 'A', 'b': 'B'}  # index key=ANoneC

Required fields constraint:

from udb_py import Udb, UdbBtreeIndex, REQUIRED

db = Udb(indexes={
    'abc': UdbBtreeIndex({'a': REQUIRED, 'b': REQUIRED, 'c': REQUIRED})  # "a", "b" and "c" fields will be fetched from the indexed record
})

record = {'a': 'A', 'b': 'B'}  # won't be indexed, raises FieldRequiredError

The default value for missing field can be defined as a primitive value or callable (functional index):

from udb_py import Udb, UdbBtreeIndex

db = Udb(indexes={
    'abc': UdbBtreeIndex({'a': 'a', 'b': 'b', 'c': 'c'})
})

record = {'a': 'A', 'c': 'C'}  # index key=AbC
from udb_py import Udb, UdbBtreeIndex

db = Udb(indexes={
    'abc': UdbBtreeIndex({'a': 'a', 'b': lambda key, values: 'b', 'c': 'c'})
})

record = {'a': 'A', 'c': 'C'}  # index key=AbC

Note that the default value is used as missing value for index key only. That means in case of the query is not fully covered by index, the part of the query moves to “seq” scan and then search may not return results.

from udb_py import Udb, UdbBtreeIndex

db = Udb(indexes={
    'abc': UdbBtreeIndex({'a': 'a', 'b': 'b', 'c': 'c'})
})

db.insert({'a': 'A', 'b': 'B'})

results = list(db.select({'a': 'A', 'c': 'c'}))  # no results since query covering key consists of "a", "c" is searched by "seq" scan but nothing was defined in record as "c", only in index

results = list(db.select({'a': 'A', 'b': 'B', 'c': 'c'}))  # now record returned due to index key is fully covered

To define the default record value use Data schema.

from udb_py import Udb, UdbBtreeIndex

db = Udb(indexes={
    'abc': UdbBtreeIndex({'a': 'a', 'b': 'b', 'c': 'c'})
}, schema={'c': 'c'})

db.insert({'a': 'A', 'b': 'B'})

results = list(db.select({'a': 'A', 'c': 'c'}))  # record returned although index key is not fully covered

results = list(db.select({'a': 'A', 'b': 'B', 'c': 'c'}))  # record returned due to index key is fully covered

Example of functional index over the size of list:

from udb_py import Udb, UdbBtreeIndex

db = Udb(indexes={
    'abc': UdbBtreeIndex({
      '$size': lambda key, values: len(values['arr']) if isinstance(values['arr'], list) else 0,
    }),
})

db.insert({'arr': [1]})
db.insert({'arr': [1, 2]})
db.insert({'arr': [1]})

print(list(db.select({'$size': 2})))

Use EMPTY value to exclude zero-length records from the index:

from udb_py import Udb, UdbBtreeIndex, EMPTY

db = Udb(indexes={
    'abc': UdbBtreeIndex({
      '$size': lambda key, values: len(values['arr'] if isinstance(values['arr'], list) else 0 or EMPTY),
    }),
})

db.insert({'arr': [1]})
db.insert({'arr': [1, 2]})
db.insert({'arr': [1]})

print(list(db.select({'$size': 2})))

Float precision

To be able to index float values enable the float mode with necessary precision (number of decimals):

from udb_py import Udb, UdbBtreeIndex

db = Udb(indexes={
    'abc': UdbBtreeIndex(['a']).set_float_precision(10)
})

db.insert({'a': 3.1415926525})

Querying

Udb supports limited MongoDB-like queries which can be used in the delete, select or update operations. The query generally is a python’s dict with the key as a field and value as a primitive value or an equality condition over the field. The query dict is mutable, therefore it needs to be initialized every time anew.

Supported query operations:

  • $eq - equal to a value

    udb.select({'a': {'$eq': 5}})
  • $gt - greater then value

    udb.select({'a': {'$gt': 5}})
  • $gte - greater or equal to a value

    udb.select({'a': {'$gte': 5}})
  • $in - equal to an any value in the list of a values

    udb.select({'a': {'$in': 5}})
  • $intersection - intersection with rectangle

    udb.select({'a': {'$intersection': {'minX': 5, 'minY': 5, 'maxX': 1, 'maxY': 5}}})
  • $like - like value (sql compatible)

    udb.select({'a': {'$like': 'a%b_c'}})
  • $lt - less then value

    udb.select({'a': {'$lt': 5}})
  • $lte - less or equal to a value

    udb.select({'a': {'$lte': 5}})
  • $ne - not equal to a value

    udb.select({'a': {'$ne': 5}})
    • BTree index - performs “range” scan of [-∞, value)∪(value, +∞]

  • $near - near to point with optional min and max distances

    udb.select({'a': {'$near': {'x': 5, 'y': 5, 'minDistance': 1, 'maxDistance': 5}}})
    • allocates sort buffer is case of “seq” scan

    • selects all records in case of unset maxDistance and set minDistance.

  • $nin - not equal to an any value in the list of a values

    udb.select({'a': {'$nin': [1, 2, 3]}})
    • BTree index - performs “range” scan of [-∞, value_1)∪(value_1, value_2)∪…∪(value_n, +∞]

  • $text - contains text words

    udb.select({'a': {'$text': 5}})
    • needs Full-Text index

  • primitive value - equal to a value

    udb.select({'a': 5})

Example:

records = list(udb.select({'a': 1}))
records = list(udb.select({'a': {'$gte': 1, '$lte': 3}}))
records = list(udb.select({'a': {'$in': [1, 2, 3], '$lte': 2}}))

Query validation

By default Udb does not check the query dict validity. To check its validity use validate_query method.

udb.validate_query({'a': {'$gte': [1, 2, 3]}})  # raises InvalidScanOperationValueError('a.$gte')

Comparison order

Udb database is not strictly typed for stored values, therefore it uses the following order of ascending comparisons for values of different types:

  • None

  • boolean - false less then true

  • int, float

  • string

So, for example, the record containing int value always greater than the record containing boolean value for the same field. The records having indexed field will be fetched in the order above.

Getting plan

To get the query plan use select method with get_plan=True:

from udb_py import Udb, UdbBtreeIndex

db = Udb(indexes={
    'abc': UdbBtreeIndex({'a': 'a', 'b': lambda key, values: 'b', 'c': 'c'})
})

db.select({'a': 3}, sort='-a', get_plan=True)  # [(<udb.index.udb_btree_index.UdbBtreeIndex object at 0x104994080>, 'const', 1, 2), (None, 'sort', 0, 0, 'a', False)]

Scan operations

BTree index:

  • const - index has only one index key that refers exactly to the one record in case of single valued index or to the set of records covered by the same index key in case of multi-record index (are fetched in order of insertion)

  • in - index has multiple index keys, each one refers exactly to the one record in case of single valued index or to the set of records covered by the same index key in case of multi-record index (are fetched in order of insertion)

  • range - index covers multiple records by the index keys set having minimum and maximum values

  • prefix - index covers range of records by the partial index key

  • prefix_in - index covers multiple records by the list of the partial index keys, each one covers range of records

RTree index:

  • intersection - index covers records intersected by the rectangle

  • near - index covers records near to the point

Full-text index:

  • text - index covers records containing words

No index:

  • seq - scanning that is not covered by any index, all records will be scanned (worst case)

Storages

The storage allows keeping data persistent.

UdbJsonFileStorage stores data in the JSON file. The file may be partially stored (broken) if no graceful app shutdown applied.

from udb_py import UdbJsonFileStorage

db = Udb(storage=UdbJsonFileStorage('db'))

db.load_db()

db.insert({'a': 'a'})

db.save_db()

UdbWalStorage stores data of delete, insert and update operations in the WAL (Write-Ahead-Logging) file chronologically. May partially store broken last insert/update/delete op if no graceful app shutdown applied. Use allow_corrupted_wal=True param to ignore such ops.

from udb_py import UdbWalStorage

db = Udb(storage=UdbWalStorage('db'))

db.load_db()

db.insert({'a': 'a'})

db.save_db()  # does nothing; delete, insert and update data will be stored on the fly

Select operation

Selected and inserted records are mutable, so avoid to update them directly. Otherwise use “copy on select” or “copy on insert” mode (shallow copy):

udb.set_copy_on_select()
udb.set_copy_on_insert()

To limit the result subset to particular number of records use limit parameter:

records = list(udb.select({'a': 1}, limit=5)

To fetch the result subset from the particular offset use offset parameter:

records = list(udb.select({'a': 1}, offset=5)

Delete operation

udb.delete(q={'a': 1}, offset=5)

Insert operation

udb.insert({'a': 1})

Update operation

udb.update({'a': 2}, q={'a': 1}, offset=5)

Aggregation

Aggregation mechanics allows to build aggregation pipeline over any iterable, particular over the cursor. Aggregation accepts an iterable with the stages to be applied over it.

from udb_py import Udb, aggregate

db = Udb()

db.insert({'a': [1, 2, 3]})
db.insert({'a': 2})
db.insert({'a': 3})

related_db = Udb()

related_db.insert({'x': 1})
related_db.insert({'x': 2})
related_db.insert({'x': 3})

results = list(aggregate(
  db.select(),
  ('$unwind', 'a'),  # stage 1
  ('$o2o', ('a', 'x', related_db, 'rel1')),  # stage 2
))

[{
  'a': 1, '__rev__': 0, 'rel1': {'x': 1, '__rev__': 0}
}, {
  'a': 2, '__rev__': 0, 'rel1': {'x': 2, '__rev__': 1}
}, {
  'a': 3, '__rev__': 0, 'rel1': {'x': 3, '__rev__': 2}
}, {
  'a': 2, '__rev__': 1, 'rel1': {'x': 2, '__rev__': 1}
}, {
  'a': 3, '__rev__': 2, 'rel1': {'x': 3, '__rev__': 2}
}]

Stages:

  • $facet - run multiple pipelines over previous result - (‘$facet’, {‘result_key_1’: [<pipeline 1>, <pipeline 2>, …], ‘result_key_2’: [<pipeline 1>, <pipeline 2>, …], …})

  • $group - group by keys with group operations - (‘$group’, (‘key1’, ‘key2’, …, { ‘$operation’: (arg1, arg2, … ), … })

    Operations:

    • $count - counts records - { ‘$count’: ‘save_to_key’ }

    • $last - gets last record value by key - { ‘$last’: ‘key’ }

    • $max - gets max value by key - { ‘$max’: (‘key’, ‘save_to_key’) }

    • $min - gets min value by key - { ‘$min’: (‘key’, ‘save_to_key’) }

    • $mul - multiplies values by key - { ‘$mul’: (‘key’, ‘save_to_key’) }

    • $push - pushes value by key into list - { ‘$push’: (‘key’, ‘save_to_key’) }, skips records with missing key

    • $sum - sums values by key - { ‘$sum’: (‘key’, ‘save_to_key’) }

  • $limit - (‘$limit’, limit)

  • $match - matches to query - (‘$match’, { … })

  • $o2o - one to one relation - (‘$o2o’, (‘field_from’, ‘field_to’, related_db, ‘save_to_key’)), result is None or record

  • $o2m - one to many relation - (‘$o2m’, (‘field_from’, ‘field_to’, related_db, ‘save_to_key’)), result is list of records

  • $offset - (‘$offset’, offset)

  • $project - renames keys - (‘$project’, { ‘key1_from’: ‘key1_to’, ‘key2_from’: ‘key2_to’, … }), None as “key_to” unsets the key

  • $rebase - rebases dict by key onto record values - (‘$rebase’, ‘key’, skip_existing)

  • $unwind - unwinds list by key into single records - (‘$unwind’, ‘key’), each list entry will be merged with the copy of record

Instant view

Instant view allows to get an instant slice of record by condition.

from udb_py import Udb, UdbView

db = Udb({
    'a': UdbBtreeIndex(['a']),
    'b': UdbBtreeIndex(['b']),
    'cde': UdbBtreeIndex(['c', 'd', 'e']),
})

db.insert({'a': 1, 'b': 1, 'c': 3, 'd': 4, 'e': 5})
db.insert({'a': 2, 'b': 2, 'c': 3, 'd': 4, 'e': 5})
db.insert({'a': 3, 'b': 3, 'c': 3, 'd': 4, 'e': 5})
db.insert({'a': 4, 'b': 4, 'c': 3, 'd': 4, 'e': 6})
db.insert({'a': 5, 'b': 5, 'c': 3, 'd': 4, 'e': 7})

view = UdbView(db, {'b': {'$gte': 3}})

db.insert({'a': 6, 'b': 6, 'c': 3, 'd': 4, 'e': 8})  # updates view immediately

view.select({'a': 6})  # {'a': 5, 'b': 5, 'c': 3, 'd': 4, 'e': 7}

By default view has the same indexes as the provided Udb instance. Use indexes parameter to drop all indexes or to set your own.

view = UdbView(db, {'b': {'$gte': 3}}, indexes=None)  # view has no indexes
view = UdbView(db, {'b': {'$gte': 3}}, indexes={'a': UdbBtreeIndex(['a'])})  # view has custom indexes

Limitations

  • Nested paths for indexing and querying are not supported, only the root level

  • Transactions are not supported

Benchmarks

  • Intel Core i7, 3.58 GHz, 4 cores, disabled HT

  • 16GB 1600 MHz RAM

  • PyPy3

INSERT (BTREE, 1ST INDEX COVERS 1 FIELD)

Total time: 2.9712460041046143 sec., per sample: 2.971246004104614e-06 sec., samples per second: 336559.1400437912, total samples: 1000000

SELECT (BTREE, 1ST INDEX COVERS 1 FIELD)

Total time: 1.7301840782165527 sec., per sample: 1.7301840782165527e-06 sec., samples per second: 577973.1836573046, total samples: 1000000

INSERT (BTREE, 1ST INDEX COVERS 1 FIELD, 2ND INDEX COVERS 1 FIELD, 3RD INDEX COVERS 2 FIELDS)

Total time: 6.8810200691223145 sec., per sample: 6.881020069122315e-06 sec., samples per second: 145327.29013353275, total samples: 1000000

SELECT (BTREE, 1ST INDEX COVERS 1 FIELD, 2ND INDEX COVERS 1 FIELD, 3RD INDEX COVERS 2 FIELDS)

Total time: 1.8345210552215576 sec., per sample: 1.8345210552215576e-06 sec., samples per second: 545101.4024361953, total samples: 1000000

INSERT (HASH, 1ST INDEX COVERS 1 FIELD)

Total time: 1.781458854675293 sec., per sample: 1.781458854675293e-06 sec., samples per second: 561337.6909467103, total samples: 1000000

SELECT (HASH, 1ST INDEX COVERS 1 FIELD)

Total time: 0.8209011554718018 sec., per sample: 8.209011554718018e-07 sec., samples per second: 1218173.458929125, total samples: 1000000

INSERT (HASH, 1ST INDEX COVERS 1 FIELD, 2ND INDEX COVERS 1 FIELD, 3RD INDEX COVERS 2 FIELDS)

Total time: 4.138401985168457 sec., per sample: 4.138401985168457e-06 sec., samples per second: 241639.16496847855, total samples: 1000000

SELECT (HASH, 1ST INDEX COVERS 1 FIELD, 2ND INDEX COVERS 1 FIELD, 3RD INDEX COVERS 2 FIELDS)

Total time: 1.001291036605835 sec., per sample: 1.001291036605835e-06 sec., samples per second: 998710.628020589, total samples: 1000000

INSERT (RTREE, 1ST INDEX COVERS 1 FIELD)

Total time: 9.943094968795776 sec., per sample: 9.943094968795777e-05 sec., samples per second: 10057.230702696503, total samples: 100000

SELECT (RTREE, 1ST INDEX COVERS 1 FIELD, LIMIT = 5)

Total time: 11.716284990310669 sec., per sample: 0.00011716284990310669 sec., samples per second: 8535.128676256994, total samples: 100000

Running tests with pytest

pytest . --ignore=virtualenv -v

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

udb_py-0.0.3.tar.gz (36.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

udb_py-0.0.3-py3-none-any.whl (34.5 kB view details)

Uploaded Python 3

File details

Details for the file udb_py-0.0.3.tar.gz.

File metadata

  • Download URL: udb_py-0.0.3.tar.gz
  • Upload date:
  • Size: 36.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for udb_py-0.0.3.tar.gz
Algorithm Hash digest
SHA256 2762047504bc438092b21471556a4d926a7126dcc2e505d5f65a202a1092a04d
MD5 232d0fab1550e69d7376d2f64d342c5c
BLAKE2b-256 d6e70fb84fb7849c2b7172737de0186f7039d75889f66f2bbc969984849cdc7c

See more details on using hashes here.

File details

Details for the file udb_py-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: udb_py-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 34.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for udb_py-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 b375bfff18d0517fba686f43315a9ae76b7c2b69fa1bb33f437ad6e91e4b9457
MD5 3da3bb9262c448500d9d7ad164ab5c7a
BLAKE2b-256 8b28265bc336daa859762e5133f5e07d824760493e00962d327fa37833766572

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page