Skip to main content

A tool for generating and validating data schemas

Project description

Smart Schema

Smart Schema is a powerful Python package for generating and validating data schemas from various data sources. It provides a flexible and intuitive way to work with structured data, particularly focusing on CSV files and JSON data.

Features

  • Schema Generation: Automatically generate Pydantic models from:
    • CSV files
    • JSON data
    • Pandas DataFrames
  • Data Validation: Validate data against generated schemas
  • CSV Processing:
    • Split large CSV files
    • Infer column types
    • Validate CSV data
  • Model Management: Save and load generated models
  • Rich CLI: User-friendly command-line interface with detailed output

Installation

As a Binary (CLI Tool)

# Using pip
pip install smart-schema

# Using pipx (recommended for CLI tools)
pipx install smart-schema

As a Library

# Using pip
pip install smart-schema

# Using Poetry
poetry add smart-schema

# From Source
git clone https://github.com/yourusername/smart-schema.git
cd smart-schema
pip install -e .

Command Line Interface

Smart Schema provides a powerful CLI tool for working with data schemas. After installation, you can use the smart-schema command.

Basic Commands

# Show help and available commands
smart-schema --help

# Show help for a specific command
smart-schema generate-model --help

Generate Models

# Generate a model from a CSV file
smart-schema generate-model data.csv --output models/product_model.py

# Generate a model with specific datetime columns
smart-schema generate-model data.csv --datetime-columns created_at,updated_at --output models/product_model.py

# Generate a model from JSON data
smart-schema generate-model data.json --type json --output models/order_model.py

Validate Data

# Validate a CSV file against a model
smart-schema validate data.csv --model models/product_model.py

# Validate and save valid records
smart-schema validate data.csv --model models/product_model.py --output valid_data.csv

# Show detailed validation errors
smart-schema validate data.csv --model models/product_model.py --verbose

Process CSV Files

# Split a large CSV file into smaller chunks
smart-schema split data.csv --rows 1000 --output split_

# Split a CSV file by column values
smart-schema split data.csv --by-column category --output category_

# Infer column types from a CSV file
smart-schema infer-types data.csv --output types.json

Common Options

# Show progress bar for long operations
smart-schema generate-model data.csv --progress

# Specify input file encoding
smart-schema generate-model data.csv --encoding utf-8

# Use a different delimiter for CSV files
smart-schema generate-model data.csv --delimiter ";"

# Skip header row in CSV files
smart-schema generate-model data.csv --no-header

Output Formats

# Save model as Python file (default)
smart-schema generate-model data.csv --output models/product_model.py

# Save model as JSON schema
smart-schema generate-model data.csv --output schema.json --format json

# Save validation report as HTML
smart-schema validate data.csv --model models/product_model.py --output report.html --format html

Examples

  1. Generate a model from a CSV file and validate it:
# Generate model
smart-schema generate-model products.csv --output models/product_model.py

# Validate the same file
smart-schema validate products.csv --model models/product_model.py
  1. Process a large CSV file:
# Split into 1000-row chunks
smart-schema split large_file.csv --rows 1000 --output chunks/chunk_

# Generate model from first chunk
smart-schema generate-model chunks/chunk_1.csv --output models/data_model.py

# Validate all chunks
for f in chunks/chunk_*.csv; do
    smart-schema validate "$f" --model models/data_model.py
done
  1. Work with JSON data:
# Generate model from JSON
smart-schema generate-model config.json --type json --output models/config_model.py

# Validate JSON data
smart-schema validate data.json --model models/config_model.py --type json

Quickstart

Basic Usage

from smart_schema import ModelGenerator, ModelValidator

# Generate a model from a CSV file
generator = ModelGenerator(name="Product")
model = generator.from_dataframe(
    df,
    datetime_columns=['last_updated']
)

# Validate data against the model
validator = ModelValidator(model)
valid_records, invalid_records = validator.validate_dataframe(df)

Command Line Interface

# Generate a model from a CSV file
smart-schema generate-model data.csv --output models/product_model.py

# Validate a CSV file against a model
smart-schema validate data.csv --model models/product_model.py

# Split a large CSV file
smart-schema split data.csv --rows 1000

Detailed Usage

Generating Models

From CSV Files

from smart_schema import ModelGenerator
import pandas as pd

# Read CSV file
df = pd.read_csv('data.csv')

# Generate model
generator = ModelGenerator(name="Product")
model = generator.from_dataframe(
    df,
    datetime_columns=['created_at', 'updated_at']
)

# Save model to file
model_file = "models/product_model.py"
with open(model_file, "w") as f:
    f.write(f"from pydantic import BaseModel\n\n")
    f.write(f"class {model.__name__}(BaseModel):\n")
    for field_name, field in model.model_fields.items():
        f.write(f"    {field_name}: {field.annotation.__name__}\n")

From JSON Data

from smart_schema import ModelGenerator

# Sample JSON data
json_data = {
    "user": {
        "id": 1,
        "name": "John Doe",
        "email": "john@example.com"
    },
    "orders": [
        {
            "order_id": "ORD-001",
            "items": [
                {"product_id": "P1", "quantity": 2}
            ]
        }
    ]
}

# Generate model
generator = ModelGenerator(name="OrderSystem")
model = generator.from_json(
    json_data,
    datetime_columns=['order_created_at']
)

Validating Data

from smart_schema import ModelValidator

# Validate DataFrame
validator = ModelValidator(model)
valid_records, invalid_records = validator.validate_dataframe(df)

# Print validation results
print(f"Valid records: {len(valid_records)}")
print(f"Invalid records: {len(invalid_records)}")

if invalid_records:
    print("\nInvalid Records Details:")
    for record in invalid_records:
        print(f"\nRecord: {record['record']}")
        for error in record['errors']:
            print(f"  - {error['msg']}")

Working with CSV Files

Splitting Large Files

from smart_schema.adapters.csv_splitter import split_by_rows, split_by_column

# Split by number of rows
split_by_rows(
    "large_file.csv",
    rows_per_file=1000,
    output_prefix="split_"
)

# Split by column value
split_by_column(
    "data.csv",
    column="category",
    output_prefix="category_"
)

Inferring Column Types

from smart_schema.adapters.csv_inference import infer_column_types
import pandas as pd

df = pd.read_csv("data.csv")
column_types = infer_column_types(df)
print("Inferred column types:", column_types)

Contributing

We welcome contributions! Here's how you can help:

Setting Up Development Environment

  1. Fork the repository
  2. Clone your fork:
    git clone https://github.com/yourusername/smart-schema.git
    cd smart-schema
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smart_schema-0.1.1.tar.gz (35.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

smart_schema-0.1.1-py3-none-any.whl (23.1 kB view details)

Uploaded Python 3

File details

Details for the file smart_schema-0.1.1.tar.gz.

File metadata

  • Download URL: smart_schema-0.1.1.tar.gz
  • Upload date:
  • Size: 35.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for smart_schema-0.1.1.tar.gz
Algorithm Hash digest
SHA256 eebe034dc75fef99619947eefd8703a8d8a0983ab64e2a17285c442cb6eec9cb
MD5 9bef42b922d26b74cdc29c0a018c1f56
BLAKE2b-256 03a5d44ca9770924a6ab2a327b728e9adfbcfff29987cbb5d88f374172a0e25e

See more details on using hashes here.

Provenance

The following attestation bundles were made for smart_schema-0.1.1.tar.gz:

Publisher: python-publish.yml on ipriyaaanshu/smart-schema

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file smart_schema-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: smart_schema-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 23.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for smart_schema-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3cd96abc237095003f311ce81da193b9fc1e6c718eeed4a7ccccccba0a296779
MD5 df714d675b86f292a1492eff4b742de2
BLAKE2b-256 c8283e1f63c4546b9ccd7457ecb0f9d9d725104eaa30d641bd2e22798fd41a4c

See more details on using hashes here.

Provenance

The following attestation bundles were made for smart_schema-0.1.1-py3-none-any.whl:

Publisher: python-publish.yml on ipriyaaanshu/smart-schema

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page