Skip to main content

A wrapper around ArangoDB CRUD HTTP API

Project description

arango_crud

This respository wraps the basic CRUD operations on ArangoDB for practical use. This is not an official library; the official python library is pyArango. The main reason to use this over just requests is authorization and server failure back-off. The main reason to use this over pyArango is thread-safety, simpler interfaces, a more narrow focus, pep8 naming conventions, and complete support for either environment variable configuration or code-as-configuration.

The main reason to use pyArango over arango_crud is field validation and access to AQL. If you want to use ArangoDB as a database, use pyArango or similar. If you want to use ArangoDB as a disk-based cache, use arango_crud or similar.

Note: This package recommends a time-to-live semantic. The TTL may be set to "-1" in environment variables to be disabled, or "None" in code to be disabled. A TTL index is only created when collections are initialized, so if this library is used with TTL disabled and then TTL is enabled, one must manually add the TTL indexes. Besides standard TTL usages, using a TTL means that if there was a bug that leaked keys which was since patched, those keys won't stay around forever. Furthermore, it means a small amount of key leakage, such as through extremely unlikely race conditions which would be expensive in either performance or developer time to fix, is not harmful to the long-term health of the project. https://www.arangodb.com/arangodb-training-center/ttl-indexes/

Note: This is not intended to provide much configurability for creating databases or collections, providing only sane defaults for this particular use-case. It's recommended to use a migration structure where databases and collections are only initialized once and to call the appropriate HTTP endpoint directly: https://www.arangodb.com/docs/stable/http/collection-creating.html

Usage

Installation

Supports python 3.7 or higher.

pip install arango_crud

Initialize

Code-as-configuration BasicAuth

from arango_crud import (
    Config, BasicAuth, RandomCluster, StepBackOffStrategy
)

config = Config(
    cluster=RandomCluster(urls=['http://127.0.0.1:8529']),  # see Cluster Styles
    timeout_seconds=3,
    back_off=StepBackOffStrategy([0.1, 0.5, 1, 1, 1]),  # see Back Off Strategies
    auth=BasicAuth(username='root', password=''),
    ttl_seconds=31622400
)

Code-as-configuration JWT

from arango_crud import (
    Config, JWTAuth, JWTDiskCache, RandomCluster, StepBackOffStrategy
)

config = Config(
    cluster=RandomCluster(urls=['http://localhost:8529']),
    timeout_seconds=3,
    back_off=StepBackOffStrategy(steps=[0.1, 0.5, 1, 1, 1]),
    ttl_seconds=31622400,
    auth=JWTAuth(
        username='root',
        password='',
        cache=JWTDiskCache(  # See JWT Caches
            lock_file='.arango_jwt.lock',
            lock_time_seconds=10,
            store_file='.arango_jwt'
        )
    )
)

# encouraged for easier performance tracing, not required. happens on first
# request otherwise. Fetches the JWT token if it does not exist.
config.prepare()

Environment variables BasicAuth

test.py

from arango_crud import env_config

config = env_config()
config.prepare() # recommended, not required

run.sh

#!/usr/bin/env bash
# Cluster urls are separated by a comma
export ARANGO_CLUSTER=http://localhost:8529
export ARANGO_CLUSTER_STYLE=random
export ARANGO_TIMEOUT_SECONDS=3
export ARANGO_BACK_OFF=step
export ARANGO_BACK_OFF_STEPS=0.1,0.5,1,1,1
export ARANGO_TTL_SECONDS=31622400
export ARANGO_AUTH=basic
export ARANGO_AUTH_USERNAME=root
export ARANGO_AUTH_PASSWORD=
python test.py

Environment variables JWT

test.py

from arango_crud import env_config

config = env_config()

run.sh

#!/usr/bin/env bash
# Cluster urls are separated by a comma
export ARANGO_CLUSTER=http://localhost:8529
export ARANGO_CLUSTER_STYLE=random
export ARANGO_TIMEOUT_SECONDS=3
export ARANGO_BACK_OFF=step
export ARANGO_BACK_OFF_STEPS=0.1,0.5,1,1,1
export ARANGO_TTL_SECONDS=31622400
export ARANGO_AUTH=jwt
export ARANGO_AUTH_USERNAME=root
export ARANGO_AUTH_PASSWORD=
export ARANGO_AUTH_CACHE=disk
export ARANGO_AUTH_CACHE_LOCK_FILE=.arango_jwt.lock
export ARANGO_AUTH_CACHE_LOCK_TIME_SECONDS=10
export ARANGO_AUTH_CACHE_STORE_FILE=.arango_jwt
python test.py

CRUD

To make these runnable environment variables must be set and ArangoDB needs to be reachable. Here are the configurations for ArangoDB running locally on default development settings:

Windows:

SET ARANGO_CLUSTER=http://localhost:8529
SET ARANGO_CLUSTER_STYLE=random
SET ARANGO_TIMEOUT_SECONDS=3
SET ARANGO_BACK_OFF=step
SET ARANGO_BACK_OFF_STEPS=0.1,0.5,1,1,1
SET ARANGO_TTL_SECONDS=31622400
SET ARANGO_AUTH=basic
SET ARANGO_AUTH_USERNAME=root
SET ARANGO_AUTH_PASSWORD=

*Nix:

#!/usr/bin/env bash
export ARANGO_CLUSTER=http://localhost:8529
export ARANGO_CLUSTER_STYLE=random
export ARANGO_TIMEOUT_SECONDS=3
export ARANGO_BACK_OFF=step
export ARANGO_BACK_OFF_STEPS=0.1,0.5,1,1,1
export ARANGO_TTL_SECONDS=31622400
export ARANGO_AUTH=basic
export ARANGO_AUTH_USERNAME=root
export ARANGO_AUTH_PASSWORD=
from arango_crud import env_config
import time

config = env_config()
config.prepare()

db = config.database('my_db')
db.create_if_not_exists()
coll = db.collection('users')
coll.create_if_not_exists()

# The simplest interface
coll.create_or_overwrite_doc('tj', {'name': 'TJ'})
coll.read_doc('tj') # {'name': 'TJ'}
coll.force_delete_doc('tj') # True

# non-expiring
coll.create_or_overwrite_doc('tj', {'name': 'TJ'}, ttl=None)
coll.force_delete_doc('tj')

# custom expirations with touching. Note that touching a document is not
# a supported atomic operation on ArangoDB and is hence faked with
# read -> compare_and_swap. Presumably if the CAS fails the document was
# touched recently anyway.
coll.create_or_overwrite_doc('tj', {'name': 'TJ'}, ttl=30) # None
coll.touch_doc('tj', ttl=60) # True
coll.force_delete_doc('tj') # True

# Alternative interface. For anything except one-liners, usually nicer.
doc = coll.document('tj')
doc.body['name'] = 'TJ'
doc.create() # True
doc.body['note'] = 'Pretty cool'
doc.compare_and_swap() # True

# We may use etags to avoid redownloading an unchanged document, but be careful
# if you are modifying the body.

# Happy case:
doc2 = coll.document('tj')
doc2.read() # loads {'name': 'TJ', 'note': 'Pretty cool'} from network

doc.read_if_remote_newer() # 304 not modified, returns False
doc2.read_if_remote_newer() # 304 not modified, returns False

doc.body['note'] = 'bar'
doc.compare_and_swap()
doc.read_if_remote_newer() # 304 not modified, returns False
doc2.read_if_remote_newer() # loads {'name': 'TJ', 'note': 'bar'} from network, returns True

# Where it can get dangerous
doc.body['note'] = 'foo'
print(doc.body) # {'name': 'TJ', 'note': 'foo'}
doc.read() # always a complete download
print(doc.body) # {'name': 'TJ', 'note': 'bar'}
doc.read_if_remote_newer() # no changes on server since last read; 304 not modified, returns False
print(doc.body) # {'name': 'TJ', 'note': 'bar'}
doc.body['note'] = 'foo'
print(doc.body) # {'name': 'TJ', 'note': 'foo'}
doc.read_if_remote_newer() # no changes on server since last read; 304 not modified, returns False
print(doc.body) # {'name': 'TJ', 'note': 'foo'}
doc.read()
print(doc.body) # {'name': 'TJ', 'note': 'bar'}

doc.compare_and_delete() # True


# Simple caching
for i in range(2):
    doc = coll.document('tj')
    hit = doc.read()
    if hit:
        doc.compare_and_swap() # refreshes TTL, usefulness depends
    else:
        # .... expensive computation ....
        doc.body = {'name': 'TJ', 'note': 'Pretty cool'}
        doc.create_or_overwrite()

    print(f'cached value (loop {i + 1}/2) (hit: {hit}): {doc.body}')

The following is in a separate code-block and is commented out to prevent accidentally copy+paste into somewhere it should not be pasted. When running tests it's helpful to cleanup the collections and databases afterward. It's encouraged that if you do not need to delete collections and databases on production these operations are disabled to help prevent developer error, which is done by setting ARANGO_DISABLE_DATABASE_DELETE and ARANGO_DISABLE_COLLECTION_DELETE to true These environment variables are treated as true unless explicitly set to false. This is not a substitute for good backups and should not be considered a security feature.

# coll.force_delete()
# db.force_delete()

Contributing

This package adheres to pep8 guidelines unless an exception is listed in .flake8. Comments are explicitly line-broken at 80 characters. Code complexity measures (AbcComplexity, etc) are not used. This measures code coverage and a build of below 70% code coverage is considered failing. Note that the word "unit test" is avoided - if it's possible to test a line of code without mocking or accessing private variables that is preferred. PRs which reduce code coverage must include an explanation of why.

The examples directory should not contain non-functional lines of code, so instead of

bar = foo()
if bar is None:
    print('Foo gave none!') # prints Foo
else:
    print('Something went wrong!')

it should be the easier to read assert variant, which plays friendlier with automated testing that the examples actually work:

bar = foo()
assert bar is None

Hence any PR where the coverage in the examples directory is less than 98% when running coverage run --rcfile=.coveragerc_examples examples/run_all.py will have changes requested. The lines which do not run should be only due to random chance.

This repository is focused specifically on using ArangoDB as a disk-based cache. Functionality which doesn't support that use-case will have their PR closed with the recommendation that they fork. So AQL or graph support would likely be closed, but (bulk) get/set operations or concurrency-safe patches will likely be merged.

Inheritance is to be avoided, preferring delegation which respects contracts. Interfaces are not included in this, where an interface is a class where all the functions simply raise NotImplementedError and there is no constructor.

Setup Development (Windows)

Install ArangoDB on default development settings.

python -m venv venv
python -m pip install --upgrade pip
"venv/Scripts/activate.bat"
python -m pip install -r dev_requirements.txt
"scripts/windows_dev_env.bat"
coverage run -m unittest discover -s tests
coverage combine
coverage report
coverage run --rcfile=.coveragerc_examples examples/run_all.py
coverage report

Setup Development (*Nix)

docker pull arangodb/arangodb
docker run -e ARANGO_NO_AUTH=1 -p 8529/tcp arangodb/arangodb arangod --server-endpoint tcp://0.0.0.0:8529
python -m venv venv
python -m pip install --upgrade pip
. venv/bin/activate
. scripts/nix_dev_env.sh
python -m pip install -r dev_requirements.txt
# This pulls from .coveragerc to handle multiprocessing
coverage run -m unittest discover -s tests
coverage combine
coverage report
coverage run --rcfile=.coveragerc_examples examples/run_all.py
coverage report

Cluster Styles

When working with an ArangoDB cluster, it's important that the clients distribute their requests amongst the various coordinators. The request styles supported are random and weighted-random. Round-robin and similar are avoided as they cannot be made thread-safe and performant without context.

Random

A random url is selected from the cluster for each request with equal probability among all urls.

Weighted Random

A random node in the cluster is selected on each request, except there may be a different probability for different urls. This is useful if, for example, one of the coordinators is running on a larger server than the rest.

Example:

from arango_crud import WeightedRandomCluster

cluster = WeightedRandomCluster(
    urls=['http://localhost:8529', 'http://localhost:8530', 'http://localhost:8531'],
    weights=[1, 2, 1]
)

This will select port 8529 1/4 of the time, 8530 1/2 of the time, and 8531 1/4 of the time. If one prefers to set the exact percentages just ensure the weights sum to one (i.e., 0.25, 0.5, 0.25)

Example environment variables:

#!/usr/bin/env bash
export ARANGO_CLUSTER=http://localhost:8529,http://localhost:8530,http://localhost:8531
export ARANGO_CLUSTER_STYLE=weighted-random
export ARANGO_CLUSTER_WEIGHTS=1,2,1

Alternatives to Environment Variables

Although environment variables are sometimes extremely convenient, they can also be painful in other development environments. One can painlessly switch these out for their preferred storage mechanism since env_config accepts a dictionary which it uses to load variables from. Note that env_config will exclusively use that dictionary - it will not fall back and use an environment variable if something is missing.

The only caveat is that for simplicity of development and to reuse the same documentation, the keys need to be screaming snake case and it will not make use of nesting. If one prefers they can massage the data into this format after loading to get more conventional looking configuration files. One can also simply massage the data into the arguments for Config directly.

arango_config.json

{
    "ARANGO_CLUSTER": "http://localhost:8529,http://localhost:8530,http://localhost:8531",
    "ARANGO_CLUSTER_STYLE": "weighted-random",
    "__comment": "... see src/arango_crud/env_config.py for complete argument docs ..."
}

Which allows loading as follows:

from arango_crud import env_config
import json

with open('arango_config.json') as fin:
    cfg = json.load(fin)

arango_config = env_config(cfg)

Server Failures

When a request fails due to a server-side issue it's usually desirable to try again on a new coordinator. A small sleep is also helpful to avoid suddenly massively spiking traffic to the coordinators whenever they hiccup. This supports only a step-back-off policy. If the steps are [0.1, 0.5, 1] then on the first server error this waits 0.1 seconds then tries again. If that also fails this waits 0.5 seconds then tries again. If that fails this waits 1 second then tries again. If that fails, an error is raised.

JWT Locking and Store

It's usually not a good idea to create a lot of new tokens when a client is misbehaving, as token generation is generally meant to be expensive in order to be secure. Hence JWT is necessarily stateful on the Config - rather than just being able to create network requests we first need to fetch the JWT. Furthermore, we may need to refresh the token on arbitrary requests.

The recommended way to handle JWT's cache is JWTDiskCache. A file will contain the JWT and some metadata about it, which will be accessed in a safe way for even highly concurrent environments, meaning that every instance running arango_crud on the same machine using the same config will share JWT tokens and will only create/renew the token once per renewal period. This overhead is extremely minor for non-concurrent environments.

If you're very confident that JWT generation is not going to be a significant source of load and there is no multithreading, a naive approach can be enabled with the cache style None. See the examples jwt_disk_example.py and jwt_none_example.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arango_crud-1.0.5.tar.gz (31.4 kB view hashes)

Uploaded Source

Built Distribution

arango_crud-1.0.5-py3-none-any.whl (30.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page