An ETL/ELT transformation library for converting nested JSON structures into flat, tabular formats
Project description
Transmogrify
A Python library for transforming complex nested JSON data into flat, structured formats.
Features
- Flatten deeply nested JSON/dict structures with customizable delimiter options
- Transform values during processing with custom functions
- Native Formats: output to PyArrow Tables, Python dictionaries, or JSON objects
- Bytes Output: serialize directly to Parquet, CSV, or JSON bytes
- File Export: write to various file formats (JSON, CSV, Parquet)
- Recover from errors in malformed data with customizable strategies
- Optimize for performance with optional dependencies
- Stream large datasets efficiently
- Deterministic ID generation for data consistency across processing runs
Installation
pip install transmogrify
For minimal installation without optional dependencies:
pip install transmogrify[minimal]
For development installation:
pip install transmogrify[dev]
See the installation guide for more details.
Quick Example
import transmogrify as tm
# Sample nested data
data = {
"user": {
"id": 1,
"name": "John Doe",
"contact": {
"email": "john@example.com"
},
"orders": [
{"id": 101, "amount": 99.99},
{"id": 102, "amount": 45.50}
]
}
}
# Process the data
processor = tm.Processor()
result = processor.process(data)
# Native data structure output
tables = result.to_dict() # Get all tables as Python dictionaries
pa_tables = result.to_pyarrow_tables() # Get as PyArrow Tables
# Access the data in memory
main_table = tables["main"] # Main table as Python dict
orders = tables["user_orders"] # Child table as Python dict
# Bytes output for direct writing
json_bytes = result.to_json_bytes(indent=2) # Get all tables as JSON bytes
csv_bytes = result.to_csv_bytes() # Get all tables as CSV bytes
parquet_bytes = result.to_parquet_bytes() # Get all tables as Parquet bytes
# Direct write to files
with open("main_table.json", "wb") as f:
f.write(json_bytes["main"])
# Or use PyArrow tables directly
pa_table = pa_tables["main"] # Work with PyArrow Table directly
print(f"Table has {pa_table.num_rows} rows and {pa_table.num_columns} columns")
# File output (still supported)
result.write_all_json("output_dir/json")
result.write_all_csv("output_dir/csv")
result.write_all_parquet("output_dir/parquet")
Deterministic ID Generation
Transmogrify can now ensure consistent IDs for records across multiple processing runs:
# Configure deterministic IDs based on specific fields
processor = tm.Processor(
deterministic_id_fields={
"": "id", # Root level uses "id" field
"user_orders": "id" # Order records use "id" field
}
)
# Process the data - IDs will be consistent across runs
result = processor.process(data)
# For complex ID generation logic, use a custom function
def custom_id_generator(record):
# Generate custom ID based on record contents
if "id" in record:
return f"CUSTOM-{record['id']}"
return str(uuid.uuid4()) # Fallback
processor = tm.Processor(id_generation_strategy=custom_id_generator)
See the deterministic IDs guide for more information.
Output Format Options
Transmogrify provides three main categories of output formats:
-
Native Data Structures - Python objects like dictionaries and PyArrow Tables
result.to_dict() # Python dictionaries result.to_json_objects() # JSON-serializable Python objects result.to_pyarrow_tables() # PyArrow Tables
-
Bytes Serialization - Raw bytes in JSON, CSV, or Parquet format
result.to_json_bytes() # JSON bytes result.to_csv_bytes() # CSV bytes result.to_parquet_bytes() # Parquet bytes
-
File Output - Direct writing to files in different formats
result.write_all_json() # Write to JSON files result.write_all_csv() # Write to CSV files result.write_all_parquet() # Write to Parquet files
Documentation
- Installation Guide
- Getting Started
- Output Formats
- In-Memory Processing
- Deterministic IDs
- API Reference
- Examples
Use Cases
- Data ETL pipelines
- API response processing
- JSON/CSV conversion
- Preparing nested data for tabular analysis
- Data normalization and standardization
- Integration with data processing frameworks
- In-memory data transformation
- Cloud-based serverless processing
- Incremental data processing with consistent IDs
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Please make sure to update tests as appropriate.
License
Distributed under the MIT License. See LICENSE for more information.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file transmog-0.1.0.tar.gz.
File metadata
- Download URL: transmog-0.1.0.tar.gz
- Upload date:
- Size: 58.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6625ea0d960892b6fc024cf65400253bbfb87c1610f127caf6814ed0e1690e9a
|
|
| MD5 |
7deaf43cb5c90a8e891b1ab87f783c71
|
|
| BLAKE2b-256 |
c384b04119fbcbb2682ad6238a0894a71338bcae106fd47d5650e25609bf81c2
|
File details
Details for the file transmog-0.1.0-py3-none-any.whl.
File metadata
- Download URL: transmog-0.1.0-py3-none-any.whl
- Upload date:
- Size: 69.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3df8d8432292ca0acbc7a797b564588fb600cc181cfc2ded1e81332589974a58
|
|
| MD5 |
c00100291dff1a6939263a88cf6c3554
|
|
| BLAKE2b-256 |
0a285bd970c6d6decb2e45e777ce0e8ef485fad45d8168a1388af8865d9bca1e
|