A tool for generating and validating data schemas
Project description
Smart Schema
Smart Schema is a powerful Python package for generating and validating data schemas from various data sources. It provides a flexible and intuitive way to work with structured data, particularly focusing on CSV files and JSON data.
Features
- Schema Generation: Automatically generate Pydantic models from:
- CSV files
- JSON data
- Pandas DataFrames
- Data Validation: Validate data against generated schemas
- CSV Processing:
- Split large CSV files
- Infer column types
- Validate CSV data
- Model Management: Save and load generated models
- Rich CLI: User-friendly command-line interface with detailed output
Installation
As a Binary (CLI Tool)
# Using pip
pip install smart-schema
# Using pipx (recommended for CLI tools)
pipx install smart-schema
As a Library
# Using pip
pip install smart-schema
# Using Poetry
poetry add smart-schema
# From Source
git clone https://github.com/yourusername/smart-schema.git
cd smart-schema
pip install -e .
Command Line Interface
Smart Schema provides a powerful CLI tool for working with data schemas. After installation, you can use the smart-schema command.
Basic Commands
# Show help and available commands
smart-schema --help
# Show help for a specific command
smart-schema generate-model --help
Generate Models
# Generate a model from a CSV file
smart-schema generate-model data.csv --output models/product_model.py
# Generate a model with specific datetime columns
smart-schema generate-model data.csv --datetime-columns created_at,updated_at --output models/product_model.py
# Generate a model from JSON data
smart-schema generate-model data.json --type json --output models/order_model.py
Validate Data
# Validate a CSV file against a model
smart-schema validate data.csv --model models/product_model.py
# Validate and save valid records
smart-schema validate data.csv --model models/product_model.py --output valid_data.csv
# Show detailed validation errors
smart-schema validate data.csv --model models/product_model.py --verbose
Process CSV Files
# Split a large CSV file into smaller chunks
smart-schema split data.csv --rows 1000 --output split_
# Split a CSV file by column values
smart-schema split data.csv --by-column category --output category_
# Infer column types from a CSV file
smart-schema infer-types data.csv --output types.json
Common Options
# Show progress bar for long operations
smart-schema generate-model data.csv --progress
# Specify input file encoding
smart-schema generate-model data.csv --encoding utf-8
# Use a different delimiter for CSV files
smart-schema generate-model data.csv --delimiter ";"
# Skip header row in CSV files
smart-schema generate-model data.csv --no-header
Output Formats
# Save model as Python file (default)
smart-schema generate-model data.csv --output models/product_model.py
# Save model as JSON schema
smart-schema generate-model data.csv --output schema.json --format json
# Save validation report as HTML
smart-schema validate data.csv --model models/product_model.py --output report.html --format html
Examples
- Generate a model from a CSV file and validate it:
# Generate model
smart-schema generate-model products.csv --output models/product_model.py
# Validate the same file
smart-schema validate products.csv --model models/product_model.py
- Process a large CSV file:
# Split into 1000-row chunks
smart-schema split large_file.csv --rows 1000 --output chunks/chunk_
# Generate model from first chunk
smart-schema generate-model chunks/chunk_1.csv --output models/data_model.py
# Validate all chunks
for f in chunks/chunk_*.csv; do
smart-schema validate "$f" --model models/data_model.py
done
- Work with JSON data:
# Generate model from JSON
smart-schema generate-model config.json --type json --output models/config_model.py
# Validate JSON data
smart-schema validate data.json --model models/config_model.py --type json
Quickstart
Basic Usage
from smart_schema import ModelGenerator, ModelValidator
# Generate a model from a CSV file
generator = ModelGenerator(name="Product")
model = generator.from_dataframe(
df,
datetime_columns=['last_updated']
)
# Validate data against the model
validator = ModelValidator(model)
valid_records, invalid_records = validator.validate_dataframe(df)
Command Line Interface
# Generate a model from a CSV file
smart-schema generate-model data.csv --output models/product_model.py
# Validate a CSV file against a model
smart-schema validate data.csv --model models/product_model.py
# Split a large CSV file
smart-schema split data.csv --rows 1000
Detailed Usage
Generating Models
From CSV Files
from smart_schema import ModelGenerator
import pandas as pd
# Read CSV file
df = pd.read_csv('data.csv')
# Generate model
generator = ModelGenerator(name="Product")
model = generator.from_dataframe(
df,
datetime_columns=['created_at', 'updated_at']
)
# Save model to file
model_file = "models/product_model.py"
with open(model_file, "w") as f:
f.write(f"from pydantic import BaseModel\n\n")
f.write(f"class {model.__name__}(BaseModel):\n")
for field_name, field in model.model_fields.items():
f.write(f" {field_name}: {field.annotation.__name__}\n")
From JSON Data
from smart_schema import ModelGenerator
# Sample JSON data
json_data = {
"user": {
"id": 1,
"name": "John Doe",
"email": "john@example.com"
},
"orders": [
{
"order_id": "ORD-001",
"items": [
{"product_id": "P1", "quantity": 2}
]
}
]
}
# Generate model
generator = ModelGenerator(name="OrderSystem")
model = generator.from_json(
json_data,
datetime_columns=['order_created_at']
)
Validating Data
from smart_schema import ModelValidator
# Validate DataFrame
validator = ModelValidator(model)
valid_records, invalid_records = validator.validate_dataframe(df)
# Print validation results
print(f"Valid records: {len(valid_records)}")
print(f"Invalid records: {len(invalid_records)}")
if invalid_records:
print("\nInvalid Records Details:")
for record in invalid_records:
print(f"\nRecord: {record['record']}")
for error in record['errors']:
print(f" - {error['msg']}")
Working with CSV Files
Splitting Large Files
from smart_schema.adapters.csv_splitter import split_by_rows, split_by_column
# Split by number of rows
split_by_rows(
"large_file.csv",
rows_per_file=1000,
output_prefix="split_"
)
# Split by column value
split_by_column(
"data.csv",
column="category",
output_prefix="category_"
)
Inferring Column Types
from smart_schema.adapters.csv_inference import infer_column_types
import pandas as pd
df = pd.read_csv("data.csv")
column_types = infer_column_types(df)
print("Inferred column types:", column_types)
Contributing
We welcome contributions! Here's how you can help:
Setting Up Development Environment
- Fork the repository
- Clone your fork:
git clone https://github.com/yourusername/smart-schema.git cd smart-schema
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file smart_schema-0.1.1.tar.gz.
File metadata
- Download URL: smart_schema-0.1.1.tar.gz
- Upload date:
- Size: 35.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eebe034dc75fef99619947eefd8703a8d8a0983ab64e2a17285c442cb6eec9cb
|
|
| MD5 |
9bef42b922d26b74cdc29c0a018c1f56
|
|
| BLAKE2b-256 |
03a5d44ca9770924a6ab2a327b728e9adfbcfff29987cbb5d88f374172a0e25e
|
Provenance
The following attestation bundles were made for smart_schema-0.1.1.tar.gz:
Publisher:
python-publish.yml on ipriyaaanshu/smart-schema
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
smart_schema-0.1.1.tar.gz -
Subject digest:
eebe034dc75fef99619947eefd8703a8d8a0983ab64e2a17285c442cb6eec9cb - Sigstore transparency entry: 220020342
- Sigstore integration time:
-
Permalink:
ipriyaaanshu/smart-schema@36b198ec20f4e69342ba7f182f94cbe737cdeb88 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/ipriyaaanshu
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@36b198ec20f4e69342ba7f182f94cbe737cdeb88 -
Trigger Event:
release
-
Statement type:
File details
Details for the file smart_schema-0.1.1-py3-none-any.whl.
File metadata
- Download URL: smart_schema-0.1.1-py3-none-any.whl
- Upload date:
- Size: 23.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3cd96abc237095003f311ce81da193b9fc1e6c718eeed4a7ccccccba0a296779
|
|
| MD5 |
df714d675b86f292a1492eff4b742de2
|
|
| BLAKE2b-256 |
c8283e1f63c4546b9ccd7457ecb0f9d9d725104eaa30d641bd2e22798fd41a4c
|
Provenance
The following attestation bundles were made for smart_schema-0.1.1-py3-none-any.whl:
Publisher:
python-publish.yml on ipriyaaanshu/smart-schema
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
smart_schema-0.1.1-py3-none-any.whl -
Subject digest:
3cd96abc237095003f311ce81da193b9fc1e6c718eeed4a7ccccccba0a296779 - Sigstore transparency entry: 220020344
- Sigstore integration time:
-
Permalink:
ipriyaaanshu/smart-schema@36b198ec20f4e69342ba7f182f94cbe737cdeb88 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/ipriyaaanshu
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@36b198ec20f4e69342ba7f182f94cbe737cdeb88 -
Trigger Event:
release
-
Statement type: