Maps JSON schema types to Spark SQL types
Project description
json-data-type-to-spark-sql-type-mapper
A python mapper that converts JSON data types to a Spark SQL type
Introduction
Spark has built in support for converting Avro data types into Spark SQL types, but lacks similar functionality with regard to JSON data types. Instead Spark has logic that can infer types from sample JSON documents. At first glance this might appear sufficient, however at closer inspection some disadvantages surface. For instance:
- It is impossible to define a StructField as optional. Every subsequent JSON document that is processed needs to supply all the fields that were present in the initial JSON document.
- Numeric values, both JSON integer and number, should be converted to the largest Spark type because ranges are unknown. This could lead additional storage requirements, although minimal in modern systems. When using a JSON schema that specifies ranges the right Spark type can be selected.
- JSON arrays can be a pain. In most positive scenario they act as a
list
containing a single type, but they could also be used to definetuple
structures with mandatory types and additional elements of any type.
This package provides a mapping function that can be used similar to how avro schemas are used whilst keeping all relevant details to create a StructType with optimal StructFields. See the supported types.
How to use
Install package
First make sure you install the module into your environment. There are various options assuming you have a Python 3.* environment set up:
Install from PyPI
Not yet available. Working on it.
Install from TestPyPI
pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ json2spark-mapper
Note: because the required package pyspark is not available in TestPyPI the extra-index-url
is needed.
From source
- checkout the project
- navigate to the root directory
- simply issue
pip install .
git clone https://github.com/vdweij/json-data-type-to-spark-sql-type-mapper.git
cd json-data-type-to-spark-sql-type-mapper
pip install .
Import module
In order to make the mapper function from_json_to_spark
available in your Python file, use the following import statement:
from json2spark_mapper.schema_mapper import from_json_to_spark
Call mapping function
with open("path-to-your-schema.json") as schema_file:
json_schema = json.load(schema_file)
struct_type = from_json_to_spark(json_schema)
Troubleshooting
Nothing here yet as this is pretty straight forward, right?!
Issues
Please check existing issues before creating a new one.
Development
For development, install the [dev]
dependencies of the package.
This will also install pre-commit.
Install pre-commit so that it automatically runs whenever you create a new commit, to ensure code quality before pushing.
pip install .[dev]
pre-commit install
In order to run unittest locally also install the [test]
dependencies.
pip install .[test]
Pre-commit
Pre-commit is configured to lint and autofix all files with standard.
Python code linting is done with black and ruff with a selection
of ruff plugins. Details are in .pre-commit-config.yaml
.
More
See for more development information.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for json2spark_mapper-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | eb97a9f5373d72d7b87cbde83f8b002069b06db840b08c4808a79b6c245c1869 |
|
MD5 | 9d089552f40744bc68c4c94bc1965a12 |
|
BLAKE2b-256 | 8d9b3d2b69d3edae95bce64249d3c48e35af46651b8e153c7271d6fe7b1403d7 |