Skip to main content

Infer Pydantic schemas from JSON and CSV files automatically

Project description

Schemai - Schema Inference for Pydantic

Automatically infer production-ready Pydantic models from JSON and CSV files with a single command.

schemai solves the repetitive task that every data engineer faces: converting raw JSON and CSV data into validated, typed Python models.

Features

  • One-Command Schema Inference: Generate Pydantic models from any JSON or CSV file
  • Type Inference: Automatically detects integers, floats, strings, booleans, lists, and nested objects
  • Production-Ready: Generates clean, well-structured Pydantic v2 models
  • CLI + Library: Use as a command-line tool or import as a Python library
  • Batch Processing: Process multiple files at once
  • Code Generation: Export inferred schemas as executable Python code
  • Flexible Output: Generate models, JSON schema, or Python code

Installation

Install from PyPI (coming soon) or from source:

# From PyPI (once published)
pip install schemai

# From source
git clone https://github.com/yourusername/schemai.git
cd schemai
pip install -e .

Quick Start

CLI Usage

Infer schema from JSON file

schemai infer data.json -n User

Output:

from pydantic import BaseModel
from typing import Optional

class User(BaseModel):
    name: Optional[str] = None
    age: Optional[int] = None
    email: Optional[str] = None
    is_active: Optional[bool] = None

Infer schema from CSV file

schemai infer customers.csv -n Customer

Save output to file

schemai infer data.json -n Product -o product_schema.py

Process multiple files

schemai batch *.json -o schemas/

Get file information

schemai info data.json

Library Usage

from schemai import SchemaInferencer

# Create inferencer
inferencer = SchemaInferencer()

# Infer from JSON
model = inferencer.infer_from_json("data.json", model_name="User")

# Infer from CSV
model = inferencer.infer_from_csv("customers.csv", model_name="Customer")

# Generate code
code = inferencer.to_code(model, class_name="User")
print(code)

# Use the model
user = model(name="Alice", age=30, email="alice@example.com")
print(user.model_dump_json())

Examples

Example 1: JSON Data

Input file (users.json):

{
  "name": "John Doe",
  "age": 30,
  "email": "john@example.com",
  "is_active": true,
  "tags": ["admin", "user"]
}

Command:

schemai infer users.json -n User

Generated Model:

from pydantic import BaseModel
from typing import Optional, List

class User(BaseModel):
    name: Optional[str] = None
    age: Optional[int] = None
    email: Optional[str] = None
    is_active: Optional[bool] = None
    tags: Optional[List[str]] = None

Example 2: CSV Data

Input file (products.csv):

product_id,name,price,in_stock
1,Widget,19.99,true
2,Gadget,29.99,false
3,Tool,9.99,true

Command:

schemai infer products.csv -n Product

Generated Model:

from pydantic import BaseModel
from typing import Optional

class Product(BaseModel):
    product_id: Optional[int] = None
    name: Optional[str] = None
    price: Optional[float] = None
    in_stock: Optional[bool] = None

Command Reference

schemai infer

Infer a schema from a single file.

schemai infer FILE [OPTIONS]

Options:

  • -n, --name TEXT: Name for the generated model class (default: GeneratedModel)
  • -o, --output PATH: Output file path (if not provided, prints to stdout)
  • -f, --format [model|code]: Output format (default: code)
  • --strict: Enable strict type checking

schemai batch

Process multiple files and generate schemas.

schemai batch FILES... [OPTIONS]

Options:

  • -o, --output PATH: Output directory for generated models

schemai info

Display information about a file's inferred schema.

schemai info FILE [OPTIONS]

Options:

  • --sample-rows INTEGER: Number of rows to sample from CSV

Supported File Formats

  • JSON: Objects and arrays of objects
  • CSV: Comma-separated values with header row

Type Mapping

schemai infers the following Python types:

JSON/CSV Type Python Type
null Optional[str]
true/false bool
123 int
123.45 float
"text" str
[1, 2, 3] List[int]
{...} dict

Advanced Usage

Custom Type Handling

from schemai import SchemaInferencer

inferencer = SchemaInferencer(strict=True)
model = inferencer.infer_from_json("data.json")

Generate Code Without Using CLI

from schemai import SchemaInferencer

inferencer = SchemaInferencer()
model = inferencer.infer_from_json("users.json", model_name="User")

# Get Python code
code = inferencer.to_code(model, class_name="User")

# Save to file
with open("user_schema.py", "w") as f:
    f.write(code)

Development

Setup Development Environment

# Clone repository
git clone https://github.com/yourusername/schemai.git
cd schemai

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode with dev dependencies
pip install -e ".[dev]"

Running Tests

pytest
pytest --cov=schemai  # With coverage report

Code Quality

# Format code
black schemai/

# Sort imports
isort schemai/

# Lint
ruff check schemai/

# Type checking
mypy schemai/

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Run tests and linting
  5. Commit your changes (git commit -m 'Add amazing feature')
  6. Push to the branch (git push origin feature/amazing-feature)
  7. Open a Pull Request

License

This project is licensed under the MIT License - see LICENSE file for details.

Why Schemai?

Every data engineer spends hours manually creating Pydantic models from data files. schemai automates this tedious process, letting you focus on data transformation and analysis instead of boilerplate model definition.

Use cases:

  • Data pipeline setup for new data sources
  • API request/response modeling
  • Data validation frameworks
  • Machine learning data preprocessing
  • Rapid prototyping with new datasets

Roadmap

  • PostgreSQL table schema inference
  • Parquet file support
  • JSON Schema inference
  • Model inheritance and composition
  • Custom validation rules
  • Type refinement with examples
  • Web UI for schema exploration
  • Integration with popular data tools (Airflow, dbt, etc.)

Support


Built with ❤️ for data engineers by data engineers.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

schemai-0.1.0.tar.gz (14.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

schemai-0.1.0-py3-none-any.whl (10.5 kB view details)

Uploaded Python 3

File details

Details for the file schemai-0.1.0.tar.gz.

File metadata

  • Download URL: schemai-0.1.0.tar.gz
  • Upload date:
  • Size: 14.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for schemai-0.1.0.tar.gz
Algorithm Hash digest
SHA256 fd42eb256ef0909ca0e7901d6c135e18499669c242ef309b7b67a4b0fe01e872
MD5 155ed2d46bef7ec83f849a4850a07834
BLAKE2b-256 edd7573cf913f473310be1abf72b3c30135f3b113bdb8998a747740ca5d3ab32

See more details on using hashes here.

File details

Details for the file schemai-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: schemai-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 10.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for schemai-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3a10253061f8d7af5b94334d9b738bddafddf06453881c92f16956a4bd802752
MD5 652ded62851105fc53281b65aeb8bcb6
BLAKE2b-256 9f877db1697ec138dabbb36b513eb7a3bb28920369d4359b3f3ef7c1094e1486

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page