Skip to main content

A schema converter for asyncapi to pyarrow and jsonschema to pyarrow

Project description

Schema to pyarrow converter

GitHub License pre-commit

This library provides a tool for converting JSON Schema and AsyncAPI YAML schemas to PyArrow schemas. It supports a wide range of data types and formats, including integers, floats, strings, booleans, arrays, and objects.

When to use this library

  • Contract-First Data Engineering: By using a schema-first approach, you can define the structure of your data before it is generated. This ensures that all stakeholders agree on the format of the data, reducing errors and miscommunication.

  • Verify data format: With this library, you can verify that the data you receive has the correct format and only process it in this format. This prevents errors and ensures that your data pipeline is robust and reliable.

  • No false positives: By using a schema to define the data types, you avoid the need for PyArrow to guess the data types. This eliminates false positives and ensures that your data is processed correctly.

  • AsyncAPI support: AsyncAPI is a well-established format for defining APIs and data formats. This library supports AsyncAPI, making it easy to integrate with backend and platform teams that use this format.

  • JSON Schema support: JSON Schema is a widely-used format for defining the structure of JSON data. This library supports JSON Schema, making it easy to integrate with existing JSON-based data pipelines

Benefits

Using this library provides several benefits, including:

  • Improved data quality: By verifying the format of your data, you can ensure that it is correct and consistent.
  • Reduced errors: By avoiding false positives and ensuring that your data is processed correctly, you can reduce errors and improve the reliability of your data pipeline.
  • Increased efficiency: By using a schema-first approach, you can improve the efficiency of your data pipeline and reduce the time spent on data processing and validation.
  • Better collaboration: By using a well-established format like AsyncAPI, you can improve collaboration between teams and stakeholders, ensuring that everyone agrees on the format of the data.

Installation

Install this package via pip:

pip install schema2pyarrow

Development

To install this package in development mode use:

pip install -e .

We are always happy to get PRs and Issues. Please look into our contribution guidelines for more details.

Usage

The library provides several functions for converting schemas to PyArrow schemas. The main functions are:

  • async_api_to_pyarrow_schema(schema): Converts an AsyncAPI YAML schema to a PyArrow schema.
  • dict_to_pyarrow_schema(schema): Converts a JSON Schema dictionary to a PyArrow schema.

Here is an example of how to use the dict_to_pyarrow_schema function:

import json
from pathlib import Path
from schema2pyarrow.pyarrow_converter import dict_to_pyarrow_schema

with open(Path("tests/sample_schemas/simple_schema.json")) as f:
    data = json.load(f)

pyarrow_schema = dict_to_pyarrow_schema(data)

Here is an example of how to use async_api_to_pyarrow_schema:

import yaml
from pathlib import Path
from schema2pyarrow.pyarrow_converter import async_api_to_pyarrow_schema

with open(Path("tests/sample_schemas/complex_schema.yaml")) as f:
    data = yaml.safe_load(f)

pyarrow_schema = async_api_to_pyarrow_schema(data)

Once the schema is converted it can be used in pyarrow to load data:

import pyarrow.json as paj
from pathlib import Path

arrow_table = paj.read_json(
    Path("sample_data.jsonl"),
    parse_options=paj.ParseOptions(
        explicit_schema=pyarrow_schema, unexpected_field_behavior="error"
    ),
)

Using the builtin CLI

This library also includes a CLI tool that can be used to convert AsyncAPI YAML schemas to PyArrow schemas.

It can be used to:

  • Convert schemas: convert multiple AsyncAPI YAML schemas to PyArrow schemas.
  • Check for errors: check for errors in the schema and only report problematic schemas (useful for a CI).

Usage

To use the CLI tool, run the following command:

schema2pyarrow path/*/**/*.yaml --check

Options

The CLI tool has the following options:

  • --check: Check for errors in the schema. Useful in a CI where you are only interested in the errors.
  • --metadata-path: A path to an AsyncAPI yaml file that will be used as an additional metadata check.

Providing custom schema data to the CLI

The CLI also enables users to check that a specific block of data must be part of the schema. A common use case is to include a metadata block that contains at least a leading key and the last updated timestamp.

To run the CLI with an additional metadata check, use the following command:

schema2pyarrow tests/sample_schemas/complex_schema.yaml --metadata-path tests/sample_schemas/metadata.yaml

If the provided metadata in the metadata file is not present in the schema under test, the CLI will exit with an error. Specifying a null datatype in the extra metadata is also supported. This is useful when verifying the existence of a specific key, regardless of its type.

More Use-Cases

This library offers a range of use cases beyond its primary functionality. Here are a few examples:

Converting Airbyte Schema with PyArrow

Airbyte provides a schema, but it includes custom fields in its output. To match the data produced by Airbyte, you need to surround the schema with these additional fields. Here's how you can prepare your Airbyte schema:

  • Fetch the schema from Airbyte
  • Use the prepare_airbyte_schema function to add the necessary fields
  • Convert the schema to PyArrow using dict_to_pyarrow_schema

Example:

from schema2pyarrow.pyarrow_converter import dict_to_pyarrow_schema
from schema2pyarrow.airbyte_utils import prepare_airbyte_schema

schema = fetch_schema_from_airbyte()
converted_schema = dict_to_pyarrow_schema(prepare_airbyte_schema(schema))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

schema2pyarrow-1.1.2.tar.gz (18.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

schema2pyarrow-1.1.2-py3-none-any.whl (9.3 kB view details)

Uploaded Python 3

File details

Details for the file schema2pyarrow-1.1.2.tar.gz.

File metadata

  • Download URL: schema2pyarrow-1.1.2.tar.gz
  • Upload date:
  • Size: 18.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for schema2pyarrow-1.1.2.tar.gz
Algorithm Hash digest
SHA256 e9dd95b8c7d61d8b0b13559d585eb65fa56581c3450d330e97c171b0d80ba57a
MD5 7cb46f22f86529d871e566ff46c9a8ab
BLAKE2b-256 03193ac34b8274581542f830159d462262d196fa942b3652e6bc9039c2b8d6ed

See more details on using hashes here.

File details

Details for the file schema2pyarrow-1.1.2-py3-none-any.whl.

File metadata

  • Download URL: schema2pyarrow-1.1.2-py3-none-any.whl
  • Upload date:
  • Size: 9.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for schema2pyarrow-1.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ffd7ce2e7c70f08edf58b9f2736ffbe7abd356046844c64bd6843c67855fada9
MD5 9c883cc63b47f8c2010ca65997fe1856
BLAKE2b-256 41f1b9a2f157399891cac6c1cde3fa5c55dbc9f933429e51f5d6663622c4676c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page