Skip to main content

Toolkit to enrich, validate and explore YAML metadata from a pandas DataFrame.

Project description

MetaCraft Toolkit

MetaCraft is a Python package for enriching and validating YAML schemas from a pandas.DataFrame. The metadata.update() function can now read YAML directly from URLs and even download remote ZIP files with multiple schemas, just like pandas.read_csv.

Features

  • update: enriches YAML with statistics and sketches (tdigest, HyperLogLog), storing the results in metadata.df.
  • validate: checks the consistency between a DataFrame and the YAML (types, ranges, nulls, ...).
  • compare: detects schema drift between two schemas.
  • export_schema: converts the YAML to other formats (Spark, SQL, etc.).
  • generate_expectations: creates Great Expectations suites.
  • transform: returns a DataFrame adjusted to the schema.
  • quality_report: simple quality score (completeness + drift).
  • research: uses OpenAI to explore relationships and anomalies.
  • loglevel: controls verbosity via Metadata(loglevel="DEBUG").

Installation

pip install MetaCraft

Or from the repository:

pip install -r requirements.txt

Optional dependencies: openai, tdigest, datasketch.

Quick example

import pandas as pd
from metacraft import Metadata

# Example DataFrame
df = pd.DataFrame({
    'survived': [0, 1, 1, 0],
    'age': [22, 38, 26, 35],
})

# Minimal schema
yaml_schema = {
    'schema': [
        {'identity': {'name': 'survived'}},
        {'identity': {'name': 'age'}},
    ]
}

# Save YAML to disk
import yaml
with open('schema.yaml', 'w') as f:
    yaml.safe_dump(yaml_schema, f, sort_keys=False, allow_unicode=True)

m = Metadata(loglevel="INFO")
m.update(df, 'schema.yaml', inplace=True)
m.quality_report(df)

Results

✔ schema.yaml updated
root
 |-- survived: integer (nullable = false)
 |-- age: integer (nullable = false)
<class 'metadata.dataset'>
Columns: 2 entries
 #   Column            Non-Null Count   Dtype
---  ------            --------------   -----
 0   survived                        4   integer
 1   age                             4   integer
dtypes: integer(2)
Validation passed: True
Quality score: 100.0 (A)

Remote ZIP example

metadata.update() can also process ZIP files hosted on the web. Just pass a URL ending in .zip:

m.update(df, 'https://example.com/schemas.zip', verbose=True)

This downloads the ZIP to a temporary directory, applies the updates and leaves the resulting file in the same folder (or in the path provided with output).

Editing metadata via metadata.df

After m.update() the schema lives in m.df, an editable DataFrame. Changes can be propagated back to YAML with m.df.upgrade():

# 1) If all columns are integers
m.df['type.logical_type'] = 'integer'

# 2) Change the description of `age`
m.df.loc['age', 'identity.description_i18n.es'] = 'Passenger age'

# 3) Adjust the allowed range for `age`
m.df.loc['age', ['domain.numeric.min', 'domain.numeric.max']] = [0, 120]

m.df.upgrade('schema.yaml')  # save the updated YAML
m.df.revert()                # discard the changes in memory

Roadmap

  • ✔️ Remote YAML support (v 2025‑07‑30)
  • ✔️ Remote ZIP download (v 2025‑07‑30)
  • ✔️ Optional local cache
  • ⬜ CLI (metadata-cli update titanic.csv titanic.yaml)

Metadata generator

You can try the Metadata Generator, a GPT that creates the YAML from a .head.

Contributions welcome!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metacraft-2025.7.30.tar.gz (15.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

metacraft-2025.7.30-py3-none-any.whl (14.5 kB view details)

Uploaded Python 3

File details

Details for the file metacraft-2025.7.30.tar.gz.

File metadata

  • Download URL: metacraft-2025.7.30.tar.gz
  • Upload date:
  • Size: 15.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for metacraft-2025.7.30.tar.gz
Algorithm Hash digest
SHA256 6e4858fd04df9cda3ff51e29339b7520c216820101614fcf0961a98fb04f71b2
MD5 d749db68fe727818ae85f70fadac7ae5
BLAKE2b-256 2af78a161ff738286c31b03a49490fd6d1352ef5b3d02a190d55a699b7aa787e

See more details on using hashes here.

File details

Details for the file metacraft-2025.7.30-py3-none-any.whl.

File metadata

  • Download URL: metacraft-2025.7.30-py3-none-any.whl
  • Upload date:
  • Size: 14.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for metacraft-2025.7.30-py3-none-any.whl
Algorithm Hash digest
SHA256 c179977bc3651ef452ffbab47f17d843977cfc485f1dc3d46df6be3fce63ff84
MD5 e0a2bcb167aac63646a9c7b9cb648f03
BLAKE2b-256 a15f0474565db660820c54010ac55fbdd8e235583757264cf41f43a932033b35

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page