Skip to main content

Python context manager to query a SQLite file stored on S3

Project description

sqlite-s3-query CircleCI Test Coverage

Python context managers to query a SQLite file stored on S3. It uses multiple HTTP range requests per query to avoid downloading the entire file, and so is suitable for large databases.

All queries using the same instance of the context will query the same version of the database object in S3. This means that a context is roughly equivalent to a REPEATABLE READ transaction, and queries should complete succesfully even if the database is replaced concurrently by another S3 client. Versioning must be enabled on the S3 bucket.

SQL statements that write to the database are not supported. If you're looking for a way to write to a SQLite database in S3, try sqlite-s3vfs.

Inspired by phiresky's sql.js-httpvfs, and dacort's Stack Overflow answer.

Installation

pip install sqlite_s3_query

The libsqlite3 binary library is also required, but this is typically already installed on most systems. The earliest version of libsqlite3 known to work is 2012-12-12 (3.7.15).

Usage

For single-statement queries, the sqlite_s3_query function can be used.

from sqlite_s3_query import sqlite_s3_query

with sqlite_s3_query(url='https://my-bucket.s3.eu-west-2.amazonaws.com/my-db.sqlite') as query:

    with query('SELECT * FROM my_table WHERE my_column = ?', params=('my-value',)) as (columns, rows):
        for row in rows:
            print(row)

    # Exactly the same results, even if the object in S3 was replaced
    with query('SELECT * FROM my_table WHERE my_column = ?', params=('my-value',)) as (columns, rows):
        for row in rows:
            print(row)

For multi-statement queries, the sqlite_s3_query_multi function can be used.

from sqlite_s3_query import sqlite_s3_query_multi

with sqlite_s3_query_multi(url='https://my-bucket.s3.eu-west-2.amazonaws.com/my-db.sqlite') as query_multi:
    for (columns, rows) in query_multi('''
            SELECT * FROM my_table_a WHERE my_column_a = ?;
            SELECT * FROM my_table_b WHERE my_column_b = ?;
    ''', params=('my-value-a','my-value-b')):
        for row in rows:
            print(row)

If in your project you query the same object from multiple places, functools.partial can be used to make an interface with less duplication.

from functools import partial
from sqlite_s3_query import sqlite_s3_query

query_my_db = partial(sqlite_s3_query,
    url='https://my-bucket.s3.eu-west-2.amazonaws.com/my-db.sqlite',
)

with \
        query_my_db() as query, \
        query('SELECT * FROM my_table WHERE my_col = ?', params=('my-value',)) as (columns, rows):

    for row in rows:
        print(row)

with \
        query_my_db() as query, \
        query('SELECT * FROM my_table_2 WHERE my_col = ?', params=('my-value',)) as (columns, rows):

    for row in rows:
        print(row)

Credentials

The AWS region and the credentials are taken from environment variables, but this can be changed using the get_credentials parameter. Below shows the default implementation of this that can be overriden.

import os

def get_credentials(_):
    return (
        os.environ['AWS_REGION'],
        os.environ['AWS_ACCESS_KEY_ID'],
        os.environ['AWS_SECRET_ACCESS_KEY'],
        os.environ.get('AWS_SESSION_TOKEN'),  # Only needed for temporary credentials
    )

query_my_db = partial(sqlite_s3_query,
    url='https://my-bucket.s3.eu-west-2.amazonaws.com/my-db.sqlite',
    get_credentials=get_credentials,
)

with \
        query_my_db() as query, \
        query('SELECT * FROM my_table_2 WHERE my_col = ?', params=('my-value',)) as (columns, rows):

    for row in rows:
        print(row)

How to use this to fetch credentials for the IAM role associated with an ECS container is shown in the example below.

import contextlib
import os
import threading
import httpx

def GetECSCredentials():
    aws_access_key_id, aws_secret_access_key, aws_session_token = None, None, None
    expiration = datetime.datetime.fromtimestamp(0)
    lock = threading.Lock()
    aws_region = os.environ['AWS_REGION']
    creds_path = os.environ['AWS_CONTAINER_CREDENTIALS_RELATIVE_URI']

    @contextlib.contextmanager
    def lock_with_timeout():
        lock.acquire(timeout=10)
        try:
            yield
        finally:
            lock.release()

    def get_credentials(now):
        nonlocal aws_access_key_id, aws_secret_access_key, aws_session_token
        nonlocal expiration

        # If this cannot be called from multiple threads at the same time, the lock can be ommitted
        with lock_with_timeout():
            if now > expiration:
                creds = httpx.get(f'http://169.254.170.2{creds_path}').json()
                aws_access_key_id = creds['AccessKeyId']
                aws_secret_access_key = creds['SecretAccessKey']
                aws_session_token = creds['Token']
                expiration = datetime.datetime.strptime(creds['Expiration'], '%Y-%m-%dT%H:%M:%SZ')

        return aws_region, aws_access_key_id, aws_secret_access_key, aws_session_token

    return get_credentials

query_my_db = partial(sqlite_s3_query,
    url='https://my-bucket.s3.eu-west-2.amazonaws.com/my-db.sqlite',
    get_credentials=GetECSCredentials(),
)

with \
        query_my_db() as query, \
        query('SELECT * FROM my_table_2 WHERE my_col = ?', params=('my-value',)) as (columns, rows):

    for row in rows:
        print(row)

HTTP Client

The HTTP client can be changed by overriding the the default get_http_client parameter, which is shown below.

from functools import partial
import httpx
from sqlite_s3_query import sqlite_s3_query

query_my_db = partial(sqlite_s3_query,
    url='https://my-bucket.s3.eu-west-2.amazonaws.com/my-db.sqlite',
    get_http_client=lambda: httpx.Client(),
)

with \
        query_my_db() as query, \
        query('SELECT * FROM my_table WHERE my_col = ?', params=('my-value',)) as (columns, rows):

    for row in rows:
        print(row)

Location of libsqlite3

The location of the libsqlite3 library can be changed by overriding the get_libsqlite3 parameter.

from ctypes import cdll
from ctypes.util import find_library
from functools import partial
from sqlite_s3_query import sqlite_s3_query

query_my_db = partial(sqlite_s3_query,
    url='https://my-bucket.s3.eu-west-2.amazonaws.com/my-db.sqlite',
    get_libsqlite3=lambda: cdll.LoadLibrary(find_library('sqlite3'))
)

with \
        query_my_db() as query, \
        query('SELECT * FROM my_table WHERE my_col = ?', params=('my-value',)) as (columns, rows):

    for row in rows:
        print(row)

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sqlite-s3-query-0.0.66.tar.gz (7.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sqlite_s3_query-0.0.66-py3-none-any.whl (8.4 kB view details)

Uploaded Python 3

File details

Details for the file sqlite-s3-query-0.0.66.tar.gz.

File metadata

  • Download URL: sqlite-s3-query-0.0.66.tar.gz
  • Upload date:
  • Size: 7.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.42.0 CPython/3.9.7

File hashes

Hashes for sqlite-s3-query-0.0.66.tar.gz
Algorithm Hash digest
SHA256 7d54f790ceeaf88e6ddaaf22a9329455eff93f12f56b5c2d6791f0f24bef71aa
MD5 25499c7525cab1bd31f9f6a3b003d0ff
BLAKE2b-256 cfdc2310d9ab921864f2f9ba743154696f7eeaf8724a792c1cb1e5ce602da44d

See more details on using hashes here.

File details

Details for the file sqlite_s3_query-0.0.66-py3-none-any.whl.

File metadata

  • Download URL: sqlite_s3_query-0.0.66-py3-none-any.whl
  • Upload date:
  • Size: 8.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.42.0 CPython/3.9.7

File hashes

Hashes for sqlite_s3_query-0.0.66-py3-none-any.whl
Algorithm Hash digest
SHA256 fb99e5fbf4b1f2f9fc7f49e349d4b8b1ab144e96ed703210eaad8aa73476c767
MD5 f20ce232d86d390640360c36a843561c
BLAKE2b-256 d8317c069ce1782bd61c480cc077f375aa2f034740f70aa44282ed463758793e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page