Skip to main content

An ETL/ELT transformation library for converting nested JSON structures into flat, tabular formats

Project description

Transmogrify

PyPI version Python versions License

A Python library for transforming complex nested JSON data into flat, structured formats.

Features

  • Flatten deeply nested JSON/dict structures with customizable delimiter options
  • Transform values during processing with custom functions
  • Native Formats: output to PyArrow Tables, Python dictionaries, or JSON objects
  • Bytes Output: serialize directly to Parquet, CSV, or JSON bytes
  • File Export: write to various file formats (JSON, CSV, Parquet)
  • Recover from errors in malformed data with customizable strategies
  • Optimize for performance with optional dependencies
  • Stream large datasets efficiently
  • Deterministic ID generation for data consistency across processing runs

Installation

pip install transmogrify

For minimal installation without optional dependencies:

pip install transmogrify[minimal]

For development installation:

pip install transmogrify[dev]

See the installation guide for more details.

Quick Example

import transmogrify as tm

# Sample nested data
data = {
    "user": {
        "id": 1,
        "name": "John Doe",
        "contact": {
            "email": "john@example.com"
        },
        "orders": [
            {"id": 101, "amount": 99.99},
            {"id": 102, "amount": 45.50}
        ]
    }
}

# Process the data
processor = tm.Processor()
result = processor.process(data)

# Native data structure output
tables = result.to_dict()                # Get all tables as Python dictionaries
pa_tables = result.to_pyarrow_tables()   # Get as PyArrow Tables

# Access the data in memory
main_table = tables["main"]              # Main table as Python dict
orders = tables["user_orders"]           # Child table as Python dict

# Bytes output for direct writing
json_bytes = result.to_json_bytes(indent=2)  # Get all tables as JSON bytes
csv_bytes = result.to_csv_bytes()        # Get all tables as CSV bytes
parquet_bytes = result.to_parquet_bytes()    # Get all tables as Parquet bytes

# Direct write to files
with open("main_table.json", "wb") as f:
    f.write(json_bytes["main"])

# Or use PyArrow tables directly
pa_table = pa_tables["main"]       # Work with PyArrow Table directly
print(f"Table has {pa_table.num_rows} rows and {pa_table.num_columns} columns")

# File output (still supported)
result.write_all_json("output_dir/json")
result.write_all_csv("output_dir/csv")
result.write_all_parquet("output_dir/parquet")

Deterministic ID Generation

Transmogrify can now ensure consistent IDs for records across multiple processing runs:

# Configure deterministic IDs based on specific fields
processor = tm.Processor(
    deterministic_id_fields={
        "": "id",                     # Root level uses "id" field
        "user_orders": "id"           # Order records use "id" field
    }
)

# Process the data - IDs will be consistent across runs
result = processor.process(data)

# For complex ID generation logic, use a custom function
def custom_id_generator(record):
    # Generate custom ID based on record contents
    if "id" in record:
        return f"CUSTOM-{record['id']}"
    return str(uuid.uuid4())  # Fallback

processor = tm.Processor(id_generation_strategy=custom_id_generator)

See the deterministic IDs guide for more information.

Output Format Options

Transmogrify provides three main categories of output formats:

  1. Native Data Structures - Python objects like dictionaries and PyArrow Tables

    result.to_dict()              # Python dictionaries
    result.to_json_objects()      # JSON-serializable Python objects
    result.to_pyarrow_tables()    # PyArrow Tables
    
  2. Bytes Serialization - Raw bytes in JSON, CSV, or Parquet format

    result.to_json_bytes()        # JSON bytes
    result.to_csv_bytes()         # CSV bytes
    result.to_parquet_bytes()     # Parquet bytes
    
  3. File Output - Direct writing to files in different formats

    result.write_all_json()       # Write to JSON files
    result.write_all_csv()        # Write to CSV files
    result.write_all_parquet()    # Write to Parquet files
    

Documentation

Use Cases

  • Data ETL pipelines
  • API response processing
  • JSON/CSV conversion
  • Preparing nested data for tabular analysis
  • Data normalization and standardization
  • Integration with data processing frameworks
  • In-memory data transformation
  • Cloud-based serverless processing
  • Incremental data processing with consistent IDs

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Please make sure to update tests as appropriate.

License

Distributed under the MIT License. See LICENSE for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

transmog-0.1.0.tar.gz (58.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

transmog-0.1.0-py3-none-any.whl (69.6 kB view details)

Uploaded Python 3

File details

Details for the file transmog-0.1.0.tar.gz.

File metadata

  • Download URL: transmog-0.1.0.tar.gz
  • Upload date:
  • Size: 58.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for transmog-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6625ea0d960892b6fc024cf65400253bbfb87c1610f127caf6814ed0e1690e9a
MD5 7deaf43cb5c90a8e891b1ab87f783c71
BLAKE2b-256 c384b04119fbcbb2682ad6238a0894a71338bcae106fd47d5650e25609bf81c2

See more details on using hashes here.

File details

Details for the file transmog-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: transmog-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 69.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for transmog-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3df8d8432292ca0acbc7a797b564588fb600cc181cfc2ded1e81332589974a58
MD5 c00100291dff1a6939263a88cf6c3554
BLAKE2b-256 0a285bd970c6d6decb2e45e777ce0e8ef485fad45d8168a1388af8865d9bca1e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page