Skip to main content

A library for defining rule-based overrides on messy data.

Project description

datapatch

A Python library for defining rule-based overrides on messy data. Imagine, for example, trying to import a dataset in each row is associated with a country - which have been entered by humans. You might find country names like Northkorea, or Greet Britain that you want to normalise. datapatch creates a mechanism to build a flexible lookup table (usually stored as a YAML file) to catch and repair these data issues.

Installation

You can install datapatch from the Python package index:

pip install datapatch

Example

Given a YAML file like this:

countries:
  normalize: true
  lowercase: true
  asciify: true
  options:
    - match: Frankreich
      value: France
    - match:
        - Northkorea
        - Nordkorea
        - Northern Korea
        - NKorea
        - DPRK
      value: North Korea
    - contains: Britain
      value: Great Britain

The file can be used to apply the data patches against raw input:

from datapatch import read_lookups, LookupException

lookups = read_lookups("countries.yml")
countries = lookups.get("countries")

# This will apply the patch or default to the original string if none exists:
for row in iter_data():
    raw = row.get("Country")
    row["Country"] = countries.get_value(raw, default=raw)

Extended options

There's a host of options available to configure the application of the data patches:

countries:
  # If you mark a lookup as required, a value that matches no options will
  # throw a `datapatch.exc:LookupException`.
  required: true
  # Normalisation will remove many special characters, remove multiple spaces
  normalize: false
  # By default normalize perform transliteration across alphabets (Путин -> Putin)
  # set asciify to false if you want to keep non-ascii alphabets as is
  asciify: false
  options:
    - match: Francois
      value: France
  # This is a shorthand for defining options that have just one `match` and
  # one `value` defined:
  map:
    Luxemborg: Luxembourg
    Lux: Luxembourg

Result objects

You can also have more details associated with a result and access them:

countries:
  options:
    - match: Frankreich
      # These can be arbitrary attributes:
      label: France
      code: FR

This can be accessed as a result object with attributes:

from datapatch import read_lookups, LookupException

lookups = read_lookups("countries.yml")
countries = lookups.get("countries")

result = countries.match("Frankreich")
print(result.label, result.code)
assert result.capital is None, result.capital

License

datapatch is licensed under the terms of the MIT license, which is included as LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datapatch-1.2.4-py3-none-any.whl (8.3 kB view details)

Uploaded Python 3

File details

Details for the file datapatch-1.2.4-py3-none-any.whl.

File metadata

  • Download URL: datapatch-1.2.4-py3-none-any.whl
  • Upload date:
  • Size: 8.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for datapatch-1.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 b6c2dae33a6635d6526b122bcd2229f098ef9f833bd60a93f644a10d82dde699
MD5 597f471b838cb7ccf52b21862e740eca
BLAKE2b-256 d8dd1df187bea2546fa7c3de04d34a366ad5a8095febb61468baddf077c8fd73

See more details on using hashes here.

Provenance

The following attestation bundles were made for datapatch-1.2.4-py3-none-any.whl:

Publisher: build.yml on opensanctions/datapatch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page