Skip to main content

Infer JSON schema from CSV file

Project description

Infer schema

Infer JSON schema from CSV files.

Installation

The script can be installed via pip

pip install infer-schema

Currently, infer-schema is a single Python 3 script without any external dependencies, so you can download it to somewhere in your PATH and make it executable:

curl https://raw.githubusercontent.com/abhidg/infer-schema/main/infer_schema.py -o infer-schema
chmod +x infer-schema
./infer-schema file

Usage

See infer-schema(1) (from a local clone, use man ./infer-schema.1)

For the Python library interface, see below.

Examples

With a data file like

date,count
2023-11-20,10
2023-11-21,23

Running infer-schema will produce a JSON Schema that the CSV conforms to:

{
  "$schema": "https://json-schema.org/draft-07/schema",
  "description": "Description of data.csv",
  "properties": {
    "count": {
      "description": "Description for column count",
      "maximum": 23,
      "minimum": 10,
      "type": "integer"
    },
    "date": {
      "description": "Description for column date",
      "format": "date",
      "type": "string"
    }
  },
  "required": [
    "date",
    "count"
  ],
  "title": "JSON Schema for data.csv"
}

Python library

The same result can be obtained using the Python module:

from infer_schema import infer_schema

schema = infer_schema("data.csv")
print(schema)

Parameters

infer_schema(file: Union[Path, str], enum_threshold: int = 10, enum_fields: List[str] = [], bound_types: Set[DType] = {"integer", "number"}, explicit_nulls: bool = False)

Here DType is one of number, integer or string.

  • file (Path or str): CSV file

  • enum_threshold (int, default = 10): Threshold of number of unique values in column below which the field is typed enum

  • enum_fields (List[str], default = []): Forces a certain field to be classed as an enum, useful for including fields that do not meet enum-threshold criteria

  • bound_types (Set[DType], default = {"integer", "number"}): Types for which bounds should be encoded into the schema, default is numbers, for which minimum / maximum are determined. For strings minLength and maxLength are determined. Set to None to disable bound detection

  • explicit_nulls (bool, default = False): By default, fields that have null and another type are typed as non-required with the non-null type. Another interpretation is to assume the field will be present and allow it to dual-typed with null.

Returns: JSON Schema as a dictionary

Development

Install pre-commit to setup ruff linting and formatting.

To generate the man page, scdoc is required:

scdoc < infer-schema.1.scd > infer-schema.1

Project details


Release history Release notifications | RSS feed

This version

0.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

infer_schema-0.1.tar.gz (4.9 kB view details)

Uploaded Source

Built Distribution

infer_schema-0.1-py3-none-any.whl (5.9 kB view details)

Uploaded Python 3

File details

Details for the file infer_schema-0.1.tar.gz.

File metadata

  • Download URL: infer_schema-0.1.tar.gz
  • Upload date:
  • Size: 4.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.11

File hashes

Hashes for infer_schema-0.1.tar.gz
Algorithm Hash digest
SHA256 a6a9f3d88ee378fdc2e750c7ce8c08be07e2e0571ff6e0beea8fdb164a80a524
MD5 9ddb72cfe844f03f63f8bfd380b29688
BLAKE2b-256 803459179ab28a8a3c0163e55fb8f5fb54e42e07840a9a8df39371e213498a00

See more details on using hashes here.

File details

Details for the file infer_schema-0.1-py3-none-any.whl.

File metadata

  • Download URL: infer_schema-0.1-py3-none-any.whl
  • Upload date:
  • Size: 5.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.11

File hashes

Hashes for infer_schema-0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 825504e85b6673c8e5e944753f61f6da2b1a958b41b2790a14d8ed312bb5f345
MD5 61e812cc80093d6b1810d430741d697f
BLAKE2b-256 972f472768b59e1463124f2005fecf6048ec1f9a9a6fd72a7063453d1536891c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page