Skip to main content

Generic backend for dataset annotation.

Project description

Gonk

Gonk is a backend for building and versioning deep learning datasets. Its goal is to do the heavy lifting for storage, validation, and approval workflows to make labeling high-quality datasets more efficient.

Features

  • Works with any file type
  • Strongly defined annotation formats using JSON Schema
  • Complete dataset version history through event sourcing
  • Change approval to enable collaboration with untrusted third parties

In Progress

  • Point-in-time release tagging
  • Reproducible dataset releases
  • Cloning the full dataset history
  • Common annotation schemas
  • Example clients

Installation

This will install the packages as well as an application gonk-api.

Requirements

These should be installed automatically but if you are having trouble, it requires Flask, jsonschema, PyNaCl, and click. The API tests require requests. All of these are listed in setup.py.

We use a fancy feature from the typing library (typing.Self), so Python 3.11 or higher is required.

PyPI

pip install gonk-ai

Source

git clone https://github.com/ComputeHeavy/gonk.git
cd gonk
pip install .

Running

The command gonk-api will run the Flask API.

To initialize, go to the folder you would like everything stored in and use -

gonk-api init --username USERNAME

You can manage users with -

gonk-api users list
gonk-api users add USERNAME
gonk-api users rekey USERNAME

When you add a user their API key will be printed once. Give that to them. If they lose it or you want to disable their access you can use rekey.

You can run the server with -

gonk-api run

This will spawn the Flask application on localhost:5000. This is running in Flask's default development mode and should not be used for production, but it is probably suitable for individuals small teams. We will have a more robust solution in the future.

Documentation

These docs cover the API as well as all modules. gonk-ai.readthedocs.io

Design

The first two files to look at are interfaces.py and events.py. The three main interfaces are the RecordKeeper, Depot, and State. The RecordKeeper is the event storage, it acts as a linear log of events. The Depot stores objects (files and annotations). With those two you have a complete history of the dataset. The State acts as the application service, validating and processing events, maintaining the current state of the dataset.

The file integrity.py has two methods for maintaining event integrity. The default is hash-chaining, where the event being added is serialized to bytes and hashed in conjunction with the previous event's hash. The other method is for signatures, which will play a larger role in a peer to peer implementation.

There are currently two implementations. There is a file system backed Depot and RecordKeeper. Then there is also a SQLite backed RecordKeeper and State. An immediate plan is to add PostgreSQL and S3-compatible (read: R2) implementations for a higher scale (hosted) service.

Tests

To test the core modules you can just run python test.py in test/core. For the API tests, you'll have to initialize the API according to the README in that directory, have an instance running, then python test.py in there.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gonk-ai-0.1.4.tar.gz (36.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gonk_ai-0.1.4-py3-none-any.whl (40.0 kB view details)

Uploaded Python 3

File details

Details for the file gonk-ai-0.1.4.tar.gz.

File metadata

  • Download URL: gonk-ai-0.1.4.tar.gz
  • Upload date:
  • Size: 36.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for gonk-ai-0.1.4.tar.gz
Algorithm Hash digest
SHA256 08d8777bb1c363be87906995f19efc616915efb8fd04a983a8e79826b9adc024
MD5 a158638a4b1b386795c2626448f9c339
BLAKE2b-256 97b1f585512553c8a178327f0c70123df42b944a9ec5a224eee8d58c155162e9

See more details on using hashes here.

File details

Details for the file gonk_ai-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: gonk_ai-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 40.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for gonk_ai-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 a045d32bc9f20ce21c7f536e4d82781929e704f0173b6a96c5fb8b4d09563628
MD5 7ffd5464698e648a55a2353df8af18cc
BLAKE2b-256 933279e078ed41fb94791f101ce42d2cfe9b6057d1b56e273bbb2a00494fdca6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page