Helper to create a compatibility layer between inputs in different formats and other parts of application.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Schema Overseer – Local

This is a local version of Schema Overseer, intended to use in a single repository.
For the multi-repository service see schema-overseer-service.

Schema Overseer ensures strict adherence to defined data formats and raises an exception in case of attempting to process unsupported input schema.
In more technical terms, it is an adapter[^1] between inputs with different schemas and other application components.

Why is it important?

Data formats evolve over time
Developers need to simultaneously support both legacy and new data formats
Mismatches between input data format and the corresponding code can lead to unexpected and hard-to-debug runtime errors
As the number of supported data formats increases, application code often becomes less maintainable

Features

Straightforward extensibility
Static analysis checks via type checking
Detailed runtime checks
Incoming data validation with pydantic

Use Cases and Tutorials

Installation

pip install schema-overseer-local

Quick Start

Create a file adapter.py to define the adapter logic.
For quick start we will use single file, but in a real application it's better to use multiple files.
Output. Define the output schema you plan to work with.
The output schema could be any object. For the tutorial purpose we will use dataclass. The output schema attributes could be any python objects, including non-serializables. Output could have the same behavior as the original input object, or a completely different one. Here is example of different behavior.
```
@dataclass
class Output:
    value: int
    function: Callable
```
Registry. Create the SchemaRegistry instance for Output.
```
schema_registry = SchemaRegistry(Output)
```

Input schemas. Define the input schemas using pydantic and register them in schema_registry.

@schema_registry.add_schema
class OldInputFormat(BaseModel):
    value: str

@schema_registry.add_schema
class NewInputFormat(BaseModel):
    renamed_value: int

Builders. Implement functions to convert each registered input to Output.
Builders require type hinting to link input formats and Output.

@schema_registry.add_builder
def old_builder(data: OldInputFormat) -> Output:
    return Output(
        value=data.value,
        function=my_function,
    )

@schema_registry.add_builder
def new_builder(data: NewInputFormat) -> Output:
    return Output(
        value=data.renamed_value,
        function=my_other_function,
    )

Finally, use schema_registry inside the application to get validated output or handle the exception.

schema_registry.setup()  # see "Discovery" chapter in documentation

def my_service(raw_data: dict[str, Any]):
    try:
        output = schema_registry.build(source_dict=raw_data)  # build output object
    except BuildError as error:
        raise MyApplicationError() from error  # handle the exception

    # use output object
    output.function()
    return output.value

Full quickstart example is here
Run it:

git clone git@github.com:Schema-Overseer/schema-overseer-local.git
cd schema-overseer-local
poetry install
poetry run python -m tutorial.quickstart.app

Usage

Using multiple Python files

While you can define registry, models and builders in one or two files, it is usually a better idea to split them into different files, i.e., Python modules.

There are different ways to do the file structure, we recommend one of the following:

Minimal — start with this one, when you are still figuring out the best way to work
Expanded builders — useful for the case with lots of code for each builder
Detached output — useful for the case, when the output is a big or complex entity

Minimal

├── __init__.py
├── builders.py
├── models
│   ├── __init__.py
│   ├── v1.py
│   ├── v2.py
│   ├── ...
│   └── vN.py
└── registry.py

Expanded builders

├── __init__.py
├── builders
│   ├── __init__.py
│   ├── v1.py
│   ├── v2.py
│   ├── ...
│   └── vN.py
├── models
│   ├── __init__.py
│   ├── v1.py
│   ├── v2.py
│   ├── ...
│   └── vN.py
└── registry.py

Detached output

├── __init__.py
├── builders.py
├── models
│   ├── __init__.py
│   ├── v1.py
│   ├── v2.py
│   ├── ...
│   └── vN.py
├── output.py
└── registry.py

Models (i.e., input data formats) are decoupled first for two reasons:

If models contain inner models inside, it would be harder to distinguish between inner models for different root models. (see Q)
If you transition to schema-overseer-service, the models are sourced from the outside of your code, so this split will come naturally.

Load modules automatically

[!NOTE] Python will not load modules automatically, unless they are explicitly imported.
SchemaRegistry has a discovery_paths: Sequence[str] argument to load all required models.
Specified modules and packages will be loaded at SchemaRegistry.setup().

Definition (SchemaRegistry(...)) is decoupled with loading (SchemaRegistry.setup()) to prevent cycle imports, that's why calling setup() is required.

Argument discovery_paths takes a sequence of strings in the absolute import format. Entries could be either python modules (single files) or python packages (folder with __init__.py and other *.py files inside)

For example, this will work for minimal option, mentioned above:

schema_registry = SchemaRegistry(
    Output,
    discovery_paths=[
        'example_project.payload.models',  # loaded as package
        'example_project.payload.builders',  # loaded as module
    ],
)

Runtime safety and strict self-checks

In addition to static type hint checks, schema-overseer-local performs runtime checks to ensure:

Each registered model has only one corresponding builder.
All builders have a proper call signature, which includes:
- One argument for the input data
- No additional non-default arguments
All builders have proper type hints

Additional runtime checks:

If set to validate_output=True (the default is False), it verifies whether the builder returns an object of the annotated type using pydantic.
By default, schema-overseer-local selects the builder from the first valid schema. However, if check_for_single_valid_schema=True is enabled, it ensures only one schema is valid for the input data.
If multiple schemas are found to be valid, a MultipleValidSchemasError will be raised.

Object as a source

SchemaRegistry.build() method operates in two modes:

Using dict-like objects as inputs and extracting fields with __getitem__
Use build(source_dict=...) for this option
Using objects with data as attributes and extracting fields with getattr
Use build(source_object=...) for this option

source_dict and source_object are mutually exclusive.

Use one of the input schema as output

TODO

FAQ

Q: Why is this project exists? Isn't it too much overhead for such a simple task?

A: It depends on the scale of the different formats you need to support. In case of a few formats to support, schema-overseer-local would be an overhead indeed. But in the projects with lots of different formats, such extensive adapter layer could be helpful. Another goal of schema-overseer-local is to serve as a fast and simple introduction to schema-overseer-service for sophisticated use cases with multiple teams and repositories to work with.

Q: How is this project better than an adapter I can code quickly myself?

A: schema-overseer-local has three important benefits: - it provides type checking; - it has very detailed runtime checks; - and it is easily extensible.

Q: Why I have to use type hinting in builders?

A: schema-overseer-local uses the same pattern as pydantic and FastAPI for input and output validation in both runtime and static analysis. It provides an extra layer of defense against code errors. Even if your code is not entirely correctly typed or not checked with static analysis tools like mypy, the data is still validated.

Q: Should I re-use inner pydantic models in different data formats?

Code example

class InnerModel(BaseModel):
    value: int


class InputFormatV1(BaseModel):
    inner: InnerModel

class InnerModelV2(BaseModel):
    value: int


class InputFormatV2(BaseModel):
    inner: InnerModelV2  # or re-use InnerModel?

A: Not really. While it might be tempting to adhere to the DRY[^2] principle in this context, it's generally a better approach to fully separate nested pydantic models into distinct modules, avoiding their reuse even if they are identical.
The primary rationale is future code maintainability: tracking modifications in reused models can be challenging, and the introduction of a new format version could require changes to the inner model, which would then demand separation regardless.

[^1]: Adapter pattern [^2]: Don't repeat yourself

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.1.1

Feb 5, 2024

0.1.0

Dec 30, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

schema_overseer_local-0.1.1.tar.gz (8.3 kB view hashes)

Uploaded Feb 5, 2024 Source

Built Distribution

schema_overseer_local-0.1.1-py3-none-any.whl (9.2 kB view hashes)

Uploaded Feb 5, 2024 Python 3

Hashes for schema_overseer_local-0.1.1.tar.gz

Hashes for schema_overseer_local-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`832f1c76d8aba7c42f78fb33fb61167456b809a142c91acd1284a998b7ba469e`
MD5	`421cb629561a6cd6a70d19621c5298a2`
BLAKE2b-256	`848a2f71fad64f8304f388d820bc76fbfa4bf52d6c854439d93864d5fd0f78a4`

Hashes for schema_overseer_local-0.1.1-py3-none-any.whl

Hashes for schema_overseer_local-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`078a8a65619d9e450e2be86f399ac41012fceefcb8720c35a7b51df98430f9fd`
MD5	`7e150817f9dcd1a69f1b9166da8c1d28`
BLAKE2b-256	`0fadebb62a11dba1dabe357a6b65c22cd202e9a08c0c723c84544d86b16d847c`