Tarchia - Metadata Catalog
Project description
Tarchia
Tarchia is a table format for huge analytic tables.
Resource | Location |
---|---|
Source Code | https://github.com/mabel-dev/tarchia |
Documentation | https://github.com/mabel-dev/tarchia |
Download | https://github.com/mabel-dev/tarchia |
Terminology
- Catalog - A collection of tables.
- Data File - Files that contain the rows of the table.
- Manifest - Files that list and describe data files in the table.
- Metadata - Information used to manage and describe tables.
- Owner - Namespace for table.
- Schema - The structure defining the columns of the table.
- Snapshot - The state of the table at a specific point in time.
- Table - A dataset stored in a structured and managed way.
Physical Structure
table/
|- metadata
| |- manifests/
| | +- manifest-00000000.avro
| +- snapshots/
| +- snapshot-00000000.json
+- data/
+- year=2000/
+- month=01/
+- day=01/
+- as_at_0000/
+- data-0000-0000.parquet
Metadata Catalog
flowchart TD
CATALOG[(Catalog)] --> SNAPSHOT(Snapshot)
CATALOG --> SCHEMA(Schema)
subgraph
MAN_LIST --> MANIFEST(Manifest)
SNAPSHOT --> SCHEMA
SNAPSHOT --> MAN_LIST(Manifest List)
end
MANIFEST --> DATA(Data Files)
The Catalog, hosted in a datastore like FireStore, contains references to Schemas and Snapshots.
When a table is created, it is created with a schema but no snapshot (it doesn't have any data yet).
When a table is written to, the writer should obtain the latest schema (this may not be the same as the latest snapshot, if the schema was updated since), write the data, and saves the snapshot.
Writers use a transaction system to ensure datasets/updates are complete before updating the catalog.
Clashes are managed by ensuring the version (latest snapshot) of the dataset at the start of the transaction matches the version of the dataset at the end of the transaction - if these don't match someone else has made a change and the transaction should fail or require a hard override.
Streaming datasets add new files to the dataset, create a new manifest streaming data cannot change schemas beyond - adding and removing columns, and limited type converstions (e.g. int->float). Larger changes are different datasets.
When a table is read, we get the schema and the manifest. Each snapshot can only have one schema.
The catalog references the latest schema, latest snapshot and key information about the table.
Manifests are limited to 2048 rows (aiming for most files to be <2Mb files to fit in cache), when a manifest exceeds this number it is split and a Manifest list created. Manifest/Manifest Lists use B-Tree style management to help ensure efficient pruning.
B-Tree manifests will create read and write overheads when accessing and updating, assuming about 15k rows per file; 1 million row dataset would be in a single manifest, a 1 billion row dataset in 33 manifests (1 root and 1 layer with 32 children) and a 1 trillion row dataset in 32568 manifests in three layers.
1 trillion row's 32568 manifests would be about 16Gb of data, this data could be accessed in parallel reducing the time to read and process all of this data. Pruning would very quickly reduce the reads - eliminating just one row from layer one would avoid reading thousands of manifests. (Data files with 50k rows would only have 9770 Manifests)
The manifest and snapshot files do not need to be colocated with the data files.
The data files don't need to be colocated with each other.
Updates are atomic due to them being effected when the catalog is updated. Failed updates may leave artifacts (like orphan files), but the update was either successful or not, there is no partial success. As storage is cheap, if the cost of a failed commit is orphaned tables, that should be acceptable.
Pruning is only effective for columns that are sorted, or nearly sorted, or columns with values that appear for limited periods of time. Attempting to prune on a column like Gender which has very few, highly recurrant values, is likely to be a waste of effort, pruning on dates when working with log entries, is likely to be quite effective.
It's intended that indexes will operate at a leaf manifest level, providing a balance between too many indexes (one per blob) and too few indexes (one per dataset). This is still to be worked through.
Git-Like Management
Git | Function | Tarchia |
---|---|---|
init |
Initialize a new dataset | [POST] /v1/tables/{owner} |
add |
Stage changes to be included in commit | [POST] /v1/tables/{owner}/{table}/stage |
commit |
Save the staged changes to the dataset | [POST] /v1/transactions/commit |
branch |
Create a new branch of the dataset | [POST] /v1/transactions/start |
push |
Send committed changes to dataset | [POST] /v1/tables/{owner}/{table}/push/{snapshot} |
fork |
Create a copy of a dataset | [POST] /v1/tables/{owner}/{table}/fork |
Backing DB Structure
erDiagram
OWNER {
string name
string user
}
CATALOG {
string name
string owner
}
OWNER ||--o{ CATALOG : "owns"
API Definition
Owner Management
End Point | GET | POST | PATCH | DELETE |
---|---|---|---|---|
/v1/owners | - | Create Owner | - | - |
/v1/owners/{owner} | Read Owner | - | Update Owner | Delete Owner |
Schema Management
End Point | GET | POST | PATCH | DELETE |
---|---|---|---|---|
/v1/tables/{owner}/{table}/schemas | Read Schema | - | Update Schema | - |
[POST] /v1/tables/{owner}/{table}/stage
[POST] /v1/tables/{owner}/{table}/truncate
[POST] /v1/transactions/start
[POST] /v1/transactions/commit
[POST] /v1/tables/{owner}/{table}/push/{snapshot}
[POST] /v1/tables/{owner}/{table}/fork
Table Management
[POST] /v1/tables/{owner}
[GET] /v1/tables/{owner}
[PATCH] /v1/tables/{owner}/{table}
[DELETE] /v1/tables/{owner}/{table}
[GET] /v1/tables/{owner}/{table}?as_at={timestamp}&filter={filter}
[GET] /v1/tables/{owner}/{table}/snaphots/{snapshot}?filter={filter}
Request Fulfillment
I want to know what datasets there are
[GET] /v1/tables
I want to retrive the current instance of a dataset
[GET] /v1/tables/{owner}/{table}?
I want to create a new dataset
[POST] /v1/tables
I want to retrieve a dataset as at a date in the past
[GET] /v1/tables/{owner}/{table}?
I want to update the schema for a dataset
[POST] /v1/views/{viewIdentifier}/schemas
I want to update the metadata for a dataset
[POST] /v1/tables/{owner}/{table}/metadata
I want to add another file to a dataset
[POST] /v1/transactions/start
[POST] /v1/tables/{owner}/{table}/files
[POST] /v1/transactions/commit
I want to write a new instance of a dataset
[POST] /v1/transactions/start
[POST] /v1/tables/{owner}/{table}/files/truncate
[POST] /v1/tables/{owner}/{table}/files
[POST] /v1/transactions/commit
I want to copy a dataset
[POST] /v1/tables/{owner}/{table}/clone
I want to make a new version the latest
[POST] /v1/tables/{owner}/{table}/promote
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file tarchia-0.0.0.tar.gz
.
File metadata
- Download URL: tarchia-0.0.0.tar.gz
- Upload date:
- Size: 36.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 122dd81e24fdd2cad3bdd0747aebf611112177c0273107063099189dc2971301 |
|
MD5 | 27c9632f4b6e9215667ccb088ce72b3a |
|
BLAKE2b-256 | 9d0e5b4c2106e4435e86f2e8cb5f7290538e2785d5e55a4fb01d3f50c39d1264 |
File details
Details for the file tarchia-0.0.0-py3-none-any.whl
.
File metadata
- Download URL: tarchia-0.0.0-py3-none-any.whl
- Upload date:
- Size: 44.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 61874cfce76950bc551db12391362a36ff0780b615badcd7aae720298c9a99a5 |
|
MD5 | 098fd9a1dd87ba9237058955d5cd3287 |
|
BLAKE2b-256 | 3b94c6a1a90293c351cf0848ad71ee9adc0333c8973524029c59f1a5b02d5825 |