Infer Pydantic schemas from JSON and CSV files automatically
Project description
Schemai - Schema Inference for Pydantic
Automatically infer production-ready Pydantic models from JSON and CSV files with a single command.
schemai solves the repetitive task that every data engineer faces: converting raw JSON and CSV data into validated, typed Python models.
Features
- One-Command Schema Inference: Generate Pydantic models from any JSON or CSV file
- Type Inference: Automatically detects integers, floats, strings, booleans, lists, and nested objects
- Production-Ready: Generates clean, well-structured Pydantic v2 models
- CLI + Library: Use as a command-line tool or import as a Python library
- Batch Processing: Process multiple files at once
- Code Generation: Export inferred schemas as executable Python code
- Flexible Output: Generate models, JSON schema, or Python code
Installation
Install from PyPI (coming soon) or from source:
# From PyPI (once published)
pip install schemai
# From source
git clone https://github.com/yourusername/schemai.git
cd schemai
pip install -e .
Quick Start
CLI Usage
Infer schema from JSON file
schemai infer data.json -n User
Output:
from pydantic import BaseModel
from typing import Optional
class User(BaseModel):
name: Optional[str] = None
age: Optional[int] = None
email: Optional[str] = None
is_active: Optional[bool] = None
Infer schema from CSV file
schemai infer customers.csv -n Customer
Save output to file
schemai infer data.json -n Product -o product_schema.py
Process multiple files
schemai batch *.json -o schemas/
Get file information
schemai info data.json
Library Usage
from schemai import SchemaInferencer
# Create inferencer
inferencer = SchemaInferencer()
# Infer from JSON
model = inferencer.infer_from_json("data.json", model_name="User")
# Infer from CSV
model = inferencer.infer_from_csv("customers.csv", model_name="Customer")
# Generate code
code = inferencer.to_code(model, class_name="User")
print(code)
# Use the model
user = model(name="Alice", age=30, email="alice@example.com")
print(user.model_dump_json())
Examples
Example 1: JSON Data
Input file (users.json):
{
"name": "John Doe",
"age": 30,
"email": "john@example.com",
"is_active": true,
"tags": ["admin", "user"]
}
Command:
schemai infer users.json -n User
Generated Model:
from pydantic import BaseModel
from typing import Optional, List
class User(BaseModel):
name: Optional[str] = None
age: Optional[int] = None
email: Optional[str] = None
is_active: Optional[bool] = None
tags: Optional[List[str]] = None
Example 2: CSV Data
Input file (products.csv):
product_id,name,price,in_stock
1,Widget,19.99,true
2,Gadget,29.99,false
3,Tool,9.99,true
Command:
schemai infer products.csv -n Product
Generated Model:
from pydantic import BaseModel
from typing import Optional
class Product(BaseModel):
product_id: Optional[int] = None
name: Optional[str] = None
price: Optional[float] = None
in_stock: Optional[bool] = None
Command Reference
schemai infer
Infer a schema from a single file.
schemai infer FILE [OPTIONS]
Options:
-n, --name TEXT: Name for the generated model class (default: GeneratedModel)-o, --output PATH: Output file path (if not provided, prints to stdout)-f, --format [model|code]: Output format (default: code)--strict: Enable strict type checking
schemai batch
Process multiple files and generate schemas.
schemai batch FILES... [OPTIONS]
Options:
-o, --output PATH: Output directory for generated models
schemai info
Display information about a file's inferred schema.
schemai info FILE [OPTIONS]
Options:
--sample-rows INTEGER: Number of rows to sample from CSV
Supported File Formats
- JSON: Objects and arrays of objects
- CSV: Comma-separated values with header row
Type Mapping
schemai infers the following Python types:
| JSON/CSV Type | Python Type |
|---|---|
null |
Optional[str] |
true/false |
bool |
123 |
int |
123.45 |
float |
"text" |
str |
[1, 2, 3] |
List[int] |
{...} |
dict |
Advanced Usage
Custom Type Handling
from schemai import SchemaInferencer
inferencer = SchemaInferencer(strict=True)
model = inferencer.infer_from_json("data.json")
Generate Code Without Using CLI
from schemai import SchemaInferencer
inferencer = SchemaInferencer()
model = inferencer.infer_from_json("users.json", model_name="User")
# Get Python code
code = inferencer.to_code(model, class_name="User")
# Save to file
with open("user_schema.py", "w") as f:
f.write(code)
Development
Setup Development Environment
# Clone repository
git clone https://github.com/yourusername/schemai.git
cd schemai
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install in development mode with dev dependencies
pip install -e ".[dev]"
Running Tests
pytest
pytest --cov=schemai # With coverage report
Code Quality
# Format code
black schemai/
# Sort imports
isort schemai/
# Lint
ruff check schemai/
# Type checking
mypy schemai/
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Run tests and linting
- Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
This project is licensed under the MIT License - see LICENSE file for details.
Why Schemai?
Every data engineer spends hours manually creating Pydantic models from data files. schemai automates this tedious process, letting you focus on data transformation and analysis instead of boilerplate model definition.
Use cases:
- Data pipeline setup for new data sources
- API request/response modeling
- Data validation frameworks
- Machine learning data preprocessing
- Rapid prototyping with new datasets
Roadmap
- PostgreSQL table schema inference
- Parquet file support
- JSON Schema inference
- Model inheritance and composition
- Custom validation rules
- Type refinement with examples
- Web UI for schema exploration
- Integration with popular data tools (Airflow, dbt, etc.)
Support
Built with ❤️ for data engineers by data engineers.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file schemai-0.1.0.tar.gz.
File metadata
- Download URL: schemai-0.1.0.tar.gz
- Upload date:
- Size: 14.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fd42eb256ef0909ca0e7901d6c135e18499669c242ef309b7b67a4b0fe01e872
|
|
| MD5 |
155ed2d46bef7ec83f849a4850a07834
|
|
| BLAKE2b-256 |
edd7573cf913f473310be1abf72b3c30135f3b113bdb8998a747740ca5d3ab32
|
File details
Details for the file schemai-0.1.0-py3-none-any.whl.
File metadata
- Download URL: schemai-0.1.0-py3-none-any.whl
- Upload date:
- Size: 10.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a10253061f8d7af5b94334d9b738bddafddf06453881c92f16956a4bd802752
|
|
| MD5 |
652ded62851105fc53281b65aeb8bcb6
|
|
| BLAKE2b-256 |
9f877db1697ec138dabbb36b513eb7a3bb28920369d4359b3f3ef7c1094e1486
|