Infer JSON schema from CSV file
Project description
Infer schema
Infer JSON schema from CSV files.
Installation
The script can be installed via pip
pip install infer-schema
Currently, infer-schema is a single Python 3 script without any external dependencies, so you can download it to somewhere in your PATH and make it executable:
curl https://raw.githubusercontent.com/abhidg/infer-schema/main/infer_schema.py -o infer-schema
chmod +x infer-schema
./infer-schema file
Usage
See infer-schema(1)
(from a local clone, use man ./infer-schema.1
)
For the Python library interface, see below.
Examples
With a data file like
date,count
2023-11-20,10
2023-11-21,23
Running infer-schema will produce a JSON Schema that the CSV conforms to:
{
"$schema": "https://json-schema.org/draft-07/schema",
"description": "Description of data.csv",
"properties": {
"count": {
"description": "Description for column count",
"maximum": 23,
"minimum": 10,
"type": "integer"
},
"date": {
"description": "Description for column date",
"format": "date",
"type": "string"
}
},
"required": [
"date",
"count"
],
"title": "JSON Schema for data.csv"
}
Python library
The same result can be obtained using the Python module:
from infer_schema import infer_schema
schema = infer_schema("data.csv")
print(schema)
Parameters
infer_schema(file: Union[Path, str], enum_threshold: int = 10, enum_fields: List[str] = [], bound_types: Set[DType] = {"integer", "number"}, explicit_nulls: bool = False)
Here DType is one of number, integer or string.
-
file (Path or str): CSV file
-
enum_threshold (int, default = 10): Threshold of number of unique values in column below which the field is typed enum
-
enum_fields (List[str], default = []): Forces a certain field to be classed as an enum, useful for including fields that do not meet
enum-threshold
criteria -
bound_types (Set[DType], default =
{"integer", "number"}
): Types for which bounds should be encoded into the schema, default is numbers, for which minimum / maximum are determined. For strings minLength and maxLength are determined. Set toNone
to disable bound detection -
explicit_nulls (bool, default = False): By default, fields that have null and another type are typed as non-required with the non-null type. Another interpretation is to assume the field will be present and allow it to dual-typed with null.
Returns: JSON Schema as a dictionary
Development
Install pre-commit to setup ruff linting and formatting.
To generate the man page, scdoc is required:
scdoc < infer-schema.1.scd > infer-schema.1
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file infer_schema-0.1.tar.gz
.
File metadata
- Download URL: infer_schema-0.1.tar.gz
- Upload date:
- Size: 4.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.11
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a6a9f3d88ee378fdc2e750c7ce8c08be07e2e0571ff6e0beea8fdb164a80a524 |
|
MD5 | 9ddb72cfe844f03f63f8bfd380b29688 |
|
BLAKE2b-256 | 803459179ab28a8a3c0163e55fb8f5fb54e42e07840a9a8df39371e213498a00 |
File details
Details for the file infer_schema-0.1-py3-none-any.whl
.
File metadata
- Download URL: infer_schema-0.1-py3-none-any.whl
- Upload date:
- Size: 5.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.11
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 825504e85b6673c8e5e944753f61f6da2b1a958b41b2790a14d8ed312bb5f345 |
|
MD5 | 61e812cc80093d6b1810d430741d697f |
|
BLAKE2b-256 | 972f472768b59e1463124f2005fecf6048ec1f9a9a6fd72a7063453d1536891c |