No project description provided

These details have not been verified by PyPI

Project description

The BMS Lake API

A FastAPI Plugin that allows you to expose your Data Lake as an API, allowing multiple output formats, such as Parquet, Csv, Json, Excel, ...

The lake API also contains a minimal security layer for convenience (Basic Auth), but you can also bring your own.

It contrast to roapi, we intentionally do not want to expose most SQL by default, but we limit possible queries using a config. This makes it easy for you to control what happens on your data. If you want the sql endpoint, you can enable this.

To run the app with default config, just do:

app = FastAPI()
bmsdna.lakeapi.init_lakeapi(app)

To adjust the config, you can do like this:

import dataclasses
import bmsdna.lakeapi

def_cfg = bmsdna.lakeapi.get_default_config() # Get default startup config
cfg = dataclasses.replace(def_cfg, enable_sql_endpoint=True, data_path="tests/data") # Use dataclasses.replace to set the properties you want
sti = bmsdna.lakeapi.init_lakeapi(app, cfg, "config_test.yml") # Enable it. The first parameter is the FastAPI instance, the 2nd one is the basic config and the third one the config of the tables

Installation

Pypi Package bmsdna-lakeapi can be installed like any python package : pip install bmsdna-lakeapi

OpenApi

Of course, everything works with Open API and FastAPI. Meaning you can add other FastAPI routes, you can use the /docs and /redoc endpoint.

Engine

By default, DuckDB is the query engine. Polars and Datafusion are also supported. The query engine can be defined on a route level and on a query level with the hidden parameter $engine="duckdb|datafusion|polars".

At the moment DuckDB seems to have an edge and performances the best. Also features like full text search are only available with DuckDB.

Default Security

By Default, Basic Authentication is enabled. To add a user, simply run add_lakeapi_user YOURUSERNAME --yaml-file config.yml. This will add the user to your config yaml (argon2 encrypted). The generated Password is printed. If you do not want this logic, you can overwrite the username_retriver of the Default Config

Standalone Mode

If you just want to run this thing, you can run it with a webserver:

Uvicorn: uvicorn bmsdna.lakeapi.standalone:app --host 0.0.0.0 --port 8080

Gunicorn: gunicorn bmsdna.lakeapi.standalone:app --workers 4 --worker-class uvicorn.workers.UvicornWorker --bind 0.0.0.0:80

Of course you need to adjust your http options as needed. Also, you need to pip install uvicorn/gunicorn

You can still use environment variables for configuration

Environment Variables

CONFIG_PATH: The path of the config file, defaults to config.yml. If you want to split the config, you can specify a folder, too
DATA_PATH: The path of the data files, defaults to data. Paths in config.yml are relative to DATA_PATH
ENABLE_SQL_ENDPOINT: Set this to 1 to enable the SQL Endpoint

Config File

The application by default relies on a Config file beeing present at the root of your project that's call config.yml.

The config file looks something like this, see also our test yaml:

tables:
  - name: fruits
    tag: test
    version: 1
    api_method:
      - get
      - post
    params:
      - name: cars
        operators:
          - "="
          - in
      - name: fruits
        operators:
          - "="
          - in
    datasource:
      uri: delta/fruits
      file_type: delta

  - name: fruits_partition
    tag: test
    version: 1
    api_method:
      - get
      - post
    params:
      - name: cars
        operators:
          - "="
          - in
      - name: fruits
        operators:
          - "="
          - in
      - name: pk
        combi:
          - fruits
          - cars
      - name: combi
        combi:
          - fruits
          - cars
    datasource:
      uri: delta/fruits_partition
      file_type: delta
      select:
        - name: A
        - name: fruits
        - name: B
        - name: cars

  - name: fake_delta
    tag: test
    version: 1
    allow_get_all_pages: true
    api_method:
      - get
      - post
    params:
      - name: name
        operators:
          - "="
      - name: name1
        operators:
          - "="
    datasource:
      uri: delta/fake
      file_type: delta

  - name: fake_delta_partition
    tag: test
    version: 1
    allow_get_all_pages: true
    api_method:
      - get
      - post
    params:
      - name: name
        operators:
          - "="
      - name: name1
        operators:
          - "="
    datasource:
      uri: delta/fake
      file_type: delta

  - name: "*" # We're lazy and want to expose all in that folder. Name MUST be * and nothing else
    tag: startest
    version: 1
    api_method:
      - post
    datasource:
      uri: startest/* # Uri MUST end with /*
      file_type: delta

  - name: fruits # But we want to overwrite this one
    tag: startest
    version: 1
    api_method:
      - get
    datasource:
      uri: startest/fruits
      file_type: delta

Partioning for awesome performance

In order to use partitions, you can either:

partition by a column you filter on. Obviously
partition on a special column called columnname_md5_prefix_2 which means that you're partitioning by the first two chars of your hex-encoded md5 hash. If you now filter by columnname this will greatly reduce files searched for. The number of chars used is up to you, we found two to be meaningful
partition on a special column called columnname_md5_mod_NRPARTIIONS where your partition value is str(int(hashlib.md5(COLUMNNAME).hexdigest(), 16) % NRPARTITIONS). That might look a bit complicated, but it's not that hard :) your just doing a modulo on your md5 hash which allows you to set the exact number of partitions. Filtering is still happening on columname correctly

You must use deltalake to use parttions and you must only have str partition columns for now.

Even more features

Paging built-in, you can use limit/offset to control what you receive
Full-text Search using DuckDB's Full Text Search Feature
jsonify_complex Parameter to turn structs/lists into Json the client cannot deal with structs/lists
Metadata endpoints to retrieve data types, string lengths and more
Expose whole folders easily by using a "*" wildcard in both the name and the datasource.uri config, see sample in above config
Good test coverage

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.22.0

Oct 29, 2024

0.21.8

Jul 3, 2024

0.21.7

Jul 2, 2024

0.21.5

Jun 21, 2024

0.21.4

Jun 18, 2024

0.21.3

Jun 17, 2024

0.21.2

Jun 17, 2024

0.21.1

Jun 14, 2024

0.21.0

Jun 11, 2024

0.20.0

Jun 11, 2024

0.19.13

Jun 3, 2024

0.19.12

Apr 18, 2024

0.19.10

Apr 16, 2024

0.19.8

Mar 19, 2024

0.19.7

Mar 14, 2024

0.19.5

Mar 12, 2024

0.19.4

Mar 12, 2024

0.19.3

Mar 11, 2024

0.19.2

Mar 11, 2024

0.19.1

Mar 8, 2024

0.19.0

Mar 5, 2024

0.18.1

Feb 14, 2024

0.18.0

Feb 14, 2024

0.17.1

Feb 5, 2024

0.17.0

Jan 19, 2024

0.16.21

Jan 19, 2024

0.16.20

Jan 18, 2024

0.16.19

Jan 4, 2024

0.16.18

Dec 19, 2023

0.16.17

Dec 13, 2023

0.16.16

Dec 5, 2023

0.16.14

Dec 5, 2023

0.16.13

Nov 29, 2023

0.16.12

Nov 25, 2023

0.16.11

Nov 14, 2023

0.16.10

Nov 10, 2023

0.16.9

Nov 7, 2023

0.16.8

Nov 7, 2023

0.16.7

Nov 7, 2023

0.16.6

Nov 6, 2023

0.16.5

Nov 6, 2023

0.16.4

Nov 6, 2023

0.16.3

Nov 4, 2023

0.16.2

Nov 3, 2023

0.16.1

Nov 3, 2023

0.16.0

Nov 3, 2023

0.15.19

Oct 19, 2023

0.15.18

Sep 28, 2023

0.15.17

Sep 26, 2023

0.15.16

Sep 23, 2023

0.15.15

Sep 23, 2023

0.15.14

Sep 13, 2023

0.15.13

Sep 13, 2023

0.15.12

Sep 7, 2023

0.15.11

Aug 28, 2023

0.15.10

Aug 18, 2023

0.15.9

Aug 3, 2023

0.15.8

Aug 3, 2023

0.15.7

Aug 3, 2023

0.15.6

Aug 3, 2023

0.15.5

Aug 2, 2023

0.15.4

Aug 1, 2023

0.15.3

Aug 1, 2023

0.15.2

Aug 1, 2023

0.15.1

Aug 1, 2023

0.15.0

Aug 1, 2023

0.14.0

Jul 28, 2023

0.13.2

Jul 26, 2023

0.13.1

Jul 25, 2023

0.13.0

Jul 24, 2023

0.12.0

Jul 19, 2023

0.11.0

Jul 19, 2023

0.10.0

Jul 7, 2023

0.9.7

Jul 6, 2023

0.9.6

Jul 4, 2023

0.9.5

Jul 4, 2023

0.9.4

Jul 2, 2023

0.9.2

Jul 2, 2023

0.9.1

Jul 2, 2023

0.9.0

Jul 1, 2023

0.8.4

Jun 15, 2023

0.8.3

Jun 14, 2023

0.8.2

Jun 13, 2023

0.8.1

Jun 12, 2023

0.7.1

Jun 12, 2023

0.7.0

Jun 9, 2023

0.6.7

Jun 4, 2023

0.6.6

Jun 4, 2023

0.6.5

Jun 3, 2023

0.6.4

Jun 3, 2023

0.6.0

Jun 1, 2023

0.5.2

Jun 1, 2023

0.5.1

May 31, 2023

0.4.7

May 25, 2023

0.4.6

May 25, 2023

0.4.5

May 24, 2023

0.4.4

May 24, 2023

0.4.3

May 24, 2023

This version

0.4.2

May 24, 2023

0.4.1

May 24, 2023

0.4.0

May 24, 2023

0.3.1

May 23, 2023

0.3.0

May 23, 2023

0.2.5

May 23, 2023

0.2.4

May 22, 2023

0.2.3

May 22, 2023

0.2.1

May 22, 2023

0.1.1

May 17, 2023

0.1.0

May 17, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bmsdna_lakeapi-0.4.2.tar.gz (32.1 kB view details)

Uploaded May 24, 2023 Source

Built Distribution

bmsdna_lakeapi-0.4.2-py3-none-any.whl (39.2 kB view details)

Uploaded May 24, 2023 Python 3

File details

Details for the file bmsdna_lakeapi-0.4.2.tar.gz.

File metadata

Download URL: bmsdna_lakeapi-0.4.2.tar.gz
Upload date: May 24, 2023
Size: 32.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/4.0.1 CPython/3.11.3

File hashes

Hashes for bmsdna_lakeapi-0.4.2.tar.gz
Algorithm	Hash digest
SHA256	`b775814fae45b6ba3631ed54f7696b846b6452711ba326f5142e4b7d3e17524a`
MD5	`c568ee0d8072c64f8fb0cbd06a70c5ff`
BLAKE2b-256	`e7deaef8e947f02d0411f7f9a64ac89f688eece74756649f2678fbf001828153`

See more details on using hashes here.

File details

Details for the file bmsdna_lakeapi-0.4.2-py3-none-any.whl.

File metadata

Download URL: bmsdna_lakeapi-0.4.2-py3-none-any.whl
Upload date: May 24, 2023
Size: 39.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/4.0.1 CPython/3.11.3

File hashes

Hashes for bmsdna_lakeapi-0.4.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bea284dc52f3b38d19d77b4da10123fd7697a2725267c023173c06f3b3b8ab62`
MD5	`92b23009477114dbf74c78e253a864fa`
BLAKE2b-256	`81c7000624a1230afa38d03fa3a0ff4ed2f70b94fea1c272d456f346c86434f2`