Skip to main content

Infer JSON schema from CSV file

Project description

Infer schema

Infer JSON schema from CSV files.

Installation

The script can be installed via pip

pip install infer-schema

Currently, infer-schema is a single Python 3 script without any external dependencies, so you can download it to somewhere in your PATH and make it executable:

curl https://raw.githubusercontent.com/abhidg/infer-schema/main/infer_schema.py -o infer-schema
chmod +x infer-schema
./infer-schema file

Usage

See infer-schema(1) (from a local clone, use man ./infer-schema.1)

For the Python library interface, see below.

Examples

With a data file like

date,count
2023-11-20,10
2023-11-21,23

Running infer-schema will produce a JSON Schema that the CSV conforms to:

{
  "$schema": "https://json-schema.org/draft-07/schema",
  "description": "Description of data.csv",
  "properties": {
    "count": {
      "description": "Description for column count",
      "maximum": 23,
      "minimum": 10,
      "type": "integer"
    },
    "date": {
      "description": "Description for column date",
      "format": "date",
      "type": "string"
    }
  },
  "required": [
    "date",
    "count"
  ],
  "title": "JSON Schema for data.csv"
}

Python library

The same result can be obtained using the Python module:

from infer_schema import infer_schema

schema = infer_schema("data.csv")
print(schema)

Parameters

infer_schema(file: Union[Path, str], enum_threshold: int = 10, enum_fields: List[str] = [], bound_types: Set[DType] = {"integer", "number"}, explicit_nulls: bool = False)

Here DType is one of number, integer or string.

  • file (Path or str): CSV file

  • enum_threshold (int, default = 10): Threshold of number of unique values in column below which the field is typed enum

  • enum_fields (List[str], default = []): Forces a certain field to be classed as an enum, useful for including fields that do not meet enum-threshold criteria

  • bound_types (Set[DType], default = {"integer", "number"}): Types for which bounds should be encoded into the schema, default is numbers, for which minimum / maximum are determined. For strings minLength and maxLength are determined. Set to None to disable bound detection

  • explicit_nulls (bool, default = False): By default, fields that have null and another type are typed as non-required with the non-null type. Another interpretation is to assume the field will be present and allow it to dual-typed with null.

Returns: JSON Schema as a dictionary

Development

Install pre-commit to setup ruff linting and formatting.

To generate the man page, scdoc is required:

scdoc < infer-schema.1.scd > infer-schema.1

Project details


Release history Release notifications | RSS feed

This version

0.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

infer_schema-0.1.tar.gz (4.9 kB view hashes)

Uploaded Source

Built Distribution

infer_schema-0.1-py3-none-any.whl (5.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page