Skip to main content

Tarchia - Metadata Catalog

Project description

Tarchia

Tarchia is a table format for huge analytic tables.

Resource Location
Source Code https://github.com/mabel-dev/tarchia
Documentation https://github.com/mabel-dev/tarchia
Download https://github.com/mabel-dev/tarchia

Terminology

  • Catalog - A collection of tables.
  • Data File - Files that contain the rows of the table.
  • Manifest - Files that list and describe data files in the table.
  • Metadata - Information used to manage and describe tables.
  • Owner - Namespace for table.
  • Schema - The structure defining the columns of the table.
  • Snapshot - The state of the table at a specific point in time.
  • Table - A dataset stored in a structured and managed way.

Physical Structure

table/
 |- metadata
 |   |- manifests/
 |   |   +- manifest-00000000.avro
 |   +- snapshots/
 |       +- snapshot-00000000.json
 +- data/
     +- year=2000/
         +- month=01/
             +- day=01/
                 +- as_at_0000/
                     +- data-0000-0000.parquet

Metadata Catalog

flowchart TD
    CATALOG[(Catalog)] --> SNAPSHOT(Snapshot)
    CATALOG  --> SCHEMA(Schema)
    subgraph  
        MAN_LIST --> MANIFEST(Manifest)
        SNAPSHOT --> SCHEMA
        SNAPSHOT --> MAN_LIST(Manifest List)
    end
    MANIFEST --> DATA(Data Files)

The Catalog, hosted in a datastore like FireStore, contains references to Schemas and Snapshots.

When a table is created, it is created with a schema but no snapshot (it doesn't have any data yet).

When a table is written to, the writer should obtain the latest schema (this may not be the same as the latest snapshot, if the schema was updated since), write the data, and saves the snapshot.

Writers use a transaction system to ensure datasets/updates are complete before updating the catalog.

Clashes are managed by ensuring the version (latest snapshot) of the dataset at the start of the transaction matches the version of the dataset at the end of the transaction - if these don't match someone else has made a change and the transaction should fail or require a hard override.

Streaming datasets add new files to the dataset, create a new manifest streaming data cannot change schemas beyond - adding and removing columns, and limited type converstions (e.g. int->float). Larger changes are different datasets.

When a table is read, we get the schema and the manifest. Each snapshot can only have one schema.

The catalog references the latest schema, latest snapshot and key information about the table.

Manifests are limited to 2048 rows (aiming for most files to be <2Mb files to fit in cache), when a manifest exceeds this number it is split and a Manifest list created. Manifest/Manifest Lists use B-Tree style management to help ensure efficient pruning.

B-Tree manifests will create read and write overheads when accessing and updating, assuming about 15k rows per file; 1 million row dataset would be in a single manifest, a 1 billion row dataset in 33 manifests (1 root and 1 layer with 32 children) and a 1 trillion row dataset in 32568 manifests in three layers.

1 trillion row's 32568 manifests would be about 16Gb of data, this data could be accessed in parallel reducing the time to read and process all of this data. Pruning would very quickly reduce the reads - eliminating just one row from layer one would avoid reading thousands of manifests. (Data files with 50k rows would only have 9770 Manifests)

The manifest and snapshot files do not need to be colocated with the data files.

The data files don't need to be colocated with each other.

Updates are atomic due to them being effected when the catalog is updated. Failed updates may leave artifacts (like orphan files), but the update was either successful or not, there is no partial success. As storage is cheap, if the cost of a failed commit is orphaned tables, that should be acceptable.

Pruning is only effective for columns that are sorted, or nearly sorted, or columns with values that appear for limited periods of time. Attempting to prune on a column like Gender which has very few, highly recurrant values, is likely to be a waste of effort, pruning on dates when working with log entries, is likely to be quite effective.

It's intended that indexes will operate at a leaf manifest level, providing a balance between too many indexes (one per blob) and too few indexes (one per dataset). This is still to be worked through.

Git-Like Management

Git Function Tarchia
init Initialize a new dataset [POST] /v1/tables/{owner}
add Stage changes to be included in commit [POST] /v1/tables/{owner}/{table}/stage
commit Save the staged changes to the dataset [POST] /v1/transactions/commit
branch Create a new branch of the dataset [POST] /v1/transactions/start
push Send committed changes to dataset [POST] /v1/tables/{owner}/{table}/push/{snapshot}
fork Create a copy of a dataset [POST] /v1/tables/{owner}/{table}/fork

Backing DB Structure

erDiagram
    OWNER {
        string name
        string user
    }
    
    CATALOG {
        string name
        string owner

    }

    OWNER ||--o{ CATALOG : "owns"

API Definition

Owner Management

End Point GET POST PATCH DELETE
/v1/owners - Create Owner - -
/v1/owners/{owner} Read Owner - Update Owner Delete Owner

Schema Management

End Point GET POST PATCH DELETE
/v1/tables/{owner}/{table}/schemas Read Schema - Update Schema -

[POST] /v1/tables/{owner}/{table}/stage
[POST] /v1/tables/{owner}/{table}/truncate
[POST] /v1/transactions/start
[POST] /v1/transactions/commit

[POST] /v1/tables/{owner}/{table}/push/{snapshot}
[POST] /v1/tables/{owner}/{table}/fork

Table Management

[POST] /v1/tables/{owner}
[GET] /v1/tables/{owner}
[PATCH] /v1/tables/{owner}/{table}
[DELETE] /v1/tables/{owner}/{table}
[GET] /v1/tables/{owner}/{table}?as_at={timestamp}&filter={filter}
[GET] /v1/tables/{owner}/{table}/snaphots/{snapshot}?filter={filter}

Request Fulfillment

I want to know what datasets there are

[GET]       /v1/tables

I want to retrive the current instance of a dataset

[GET]       /v1/tables/{owner}/{table}?

I want to create a new dataset

[POST]      /v1/tables

I want to retrieve a dataset as at a date in the past

[GET]       /v1/tables/{owner}/{table}?

I want to update the schema for a dataset

[POST]       /v1/views/{viewIdentifier}/schemas

I want to update the metadata for a dataset

[POST]      /v1/tables/{owner}/{table}/metadata

I want to add another file to a dataset

[POST]      /v1/transactions/start
[POST]      /v1/tables/{owner}/{table}/files
[POST]      /v1/transactions/commit

I want to write a new instance of a dataset

[POST]      /v1/transactions/start
[POST]      /v1/tables/{owner}/{table}/files/truncate
[POST]      /v1/tables/{owner}/{table}/files
[POST]      /v1/transactions/commit

I want to copy a dataset

[POST]      /v1/tables/{owner}/{table}/clone

I want to make a new version the latest

[POST]      /v1/tables/{owner}/{table}/promote

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tarchia-0.0.0.tar.gz (36.3 kB view details)

Uploaded Source

Built Distribution

tarchia-0.0.0-py3-none-any.whl (44.6 kB view details)

Uploaded Python 3

File details

Details for the file tarchia-0.0.0.tar.gz.

File metadata

  • Download URL: tarchia-0.0.0.tar.gz
  • Upload date:
  • Size: 36.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for tarchia-0.0.0.tar.gz
Algorithm Hash digest
SHA256 122dd81e24fdd2cad3bdd0747aebf611112177c0273107063099189dc2971301
MD5 27c9632f4b6e9215667ccb088ce72b3a
BLAKE2b-256 9d0e5b4c2106e4435e86f2e8cb5f7290538e2785d5e55a4fb01d3f50c39d1264

See more details on using hashes here.

File details

Details for the file tarchia-0.0.0-py3-none-any.whl.

File metadata

  • Download URL: tarchia-0.0.0-py3-none-any.whl
  • Upload date:
  • Size: 44.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for tarchia-0.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 61874cfce76950bc551db12391362a36ff0780b615badcd7aae720298c9a99a5
MD5 098fd9a1dd87ba9237058955d5cd3287
BLAKE2b-256 3b94c6a1a90293c351cf0848ad71ee9adc0333c8973524029c59f1a5b02d5825

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page