Skip to main content

Toolkit to enrich, validate and explore YAML metadata from a pandas DataFrame.

Project description

MetaCraft Toolkit

MetaCraft is a Python package for enriching and validating YAML schemas from a pandas.DataFrame. The metadata.update() function can now read YAML directly from URLs and even download remote ZIP files with multiple schemas, just like pandas.read_csv.

Features

  • update: enriches YAML with statistics and sketches (tdigest, HyperLogLog), storing the results in metadata.df.
  • validate: checks the consistency between a DataFrame and the YAML (types, ranges, nulls, ...).
  • compare: detects schema drift between two schemas.
  • export_schema: converts the YAML to other formats (Spark, SQL, etc.).
  • generate_expectations: creates Great Expectations suites.
  • transform: returns a DataFrame adjusted to the schema.
  • quality_report: simple quality score (completeness + drift).
  • research: uses OpenAI to explore relationships and anomalies.
  • loglevel: controls verbosity via Metadata(loglevel="DEBUG").

Installation

pip install MetaCraft

Or from the repository:

pip install -r requirements.txt

Optional dependencies: openai, tdigest, datasketch.

Quick example

import pandas as pd
from metacraft import Metadata

# Example DataFrame
df = pd.DataFrame({
    'survived': [0, 1, 1, 0],
    'age': [22, 38, 26, 35],
})

# Minimal schema
yaml_schema = {
    'schema': [
        {'identity': {'name': 'survived'}},
        {'identity': {'name': 'age'}},
    ]
}

# Save YAML to disk
import yaml
with open('schema.yaml', 'w') as f:
    yaml.safe_dump(yaml_schema, f, sort_keys=False, allow_unicode=True)

m = Metadata(loglevel="INFO")
m.update(df, 'schema.yaml', inplace=True)
m.quality_report(df)

Customising OpenAI usage

Metadata can reuse an existing OpenAI client (or API key) and lets you define the exact parameters that will be sent to the chat endpoint. Provide default values through the constructor and override any of them per call:

from openai import OpenAI
from metacraft import Metadata

client = OpenAI(api_key="sk-...")
metadata = Metadata(
    openai_api=client,
    openai_params={"model": "gpt-4.1-mini", "temperature": 0.2, "max_tokens": 600},
)

# Override defaults ad-hoc when exporting a schema
spark_code = metadata.export_schema(
    "spark",
    response_format={"type": "text"},
    max_tokens=900,
)

Results

✔ schema.yaml updated
root
 |-- survived: integer (nullable = false)
 |-- age: integer (nullable = false)
<class 'metadata.dataset'>
Columns: 2 entries
 #   Column            Non-Null Count   Dtype
---  ------            --------------   -----
 0   survived                        4   integer
 1   age                             4   integer
dtypes: integer(2)
Validation passed: True
Quality score: 100.0 (A)

Remote ZIP example

metadata.update() can also process ZIP files hosted on the web. Just pass a URL ending in .zip:

m.update(df, 'https://example.com/schemas.zip', verbose=True)

This downloads the ZIP to a temporary directory, applies the updates and leaves the resulting file in the same folder (or in the path provided with output).

Editing metadata via metadata.df

After m.update() the schema lives in m.df, an editable DataFrame. Changes can be propagated back to YAML with m.df.upgrade():

# 1) If all columns are integers
m.df['type.logical_type'] = 'integer'

# 2) Change the description of `age`
m.df.loc['age', 'identity.description_i18n.es'] = 'Passenger age'

# 3) Adjust the allowed range for `age`
m.df.loc['age', ['domain.numeric.min', 'domain.numeric.max']] = [0, 120]

m.df.upgrade('schema.yaml')  # save the updated YAML
m.df.revert()                # discard the changes in memory

Roadmap

  • ✔️ Remote YAML support (v 2025‑07‑30)
  • ✔️ Remote ZIP download (v 2025‑07‑30)
  • ✔️ Optional local cache
  • ⬜ CLI (metadata-cli update titanic.csv titanic.yaml)

Metadata generator

You can try the Metadata Generator, a GPT that creates the YAML from a .head.

Contributions welcome!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metacraft-2025.10.6.tar.gz (18.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

metacraft-2025.10.6-py3-none-any.whl (16.4 kB view details)

Uploaded Python 3

File details

Details for the file metacraft-2025.10.6.tar.gz.

File metadata

  • Download URL: metacraft-2025.10.6.tar.gz
  • Upload date:
  • Size: 18.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for metacraft-2025.10.6.tar.gz
Algorithm Hash digest
SHA256 c67d64bfd17c79c7edf282987a40315f3cbada80f2394056d38e41e79197a61e
MD5 7a73475eff69dcc0d0ab6684ab55a763
BLAKE2b-256 f3d0ebc14a822f24fd71ff5d1c1ea93d22fd254b2258a7134f67a0c3033e6ce0

See more details on using hashes here.

File details

Details for the file metacraft-2025.10.6-py3-none-any.whl.

File metadata

  • Download URL: metacraft-2025.10.6-py3-none-any.whl
  • Upload date:
  • Size: 16.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for metacraft-2025.10.6-py3-none-any.whl
Algorithm Hash digest
SHA256 56042e3c84c91b11c98429d4e7ab92621bc9f82582fb6faf188d67c5b4e684a5
MD5 e6a57531025997b607c3b86b377f19df
BLAKE2b-256 13d0970198ed6c1e47d402819cf200b3c0701a7b6f077597d3f290121b0e5b28

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page