Skip to main content

No project description provided

Project description

datapatch

A Python library for defining rule-based overrides on messy data. Imagine, for example, trying to import a dataset in each row is associated with a country - which have been entered by humans. You might find country names like Northkorea, or Greet Britain that you want to normalise. datapatch creates a mechanism to build a flexible lookup table (usually stored as a YAML file) to catch and repair these data issues.

Installation

You can install datapatch from the Python package index:

pip install datapatch

Example

Given a YAML file like this:

countries:
  normalize: true
  lowercase: true
  asciify: true
  options:
    - match: Frankreich
      value: France
    - match:
        - Northkorea
        - Nordkorea
        - Northern Korea
        - NKorea
        - DPRK
      value: North Korea
    - contains: Britain
      value: Great Britain

The file can be used to apply the data patches against raw input:

from datapatch import read_lookups, LookupException

lookups = read_lookups("countries.yml")
countries = lookups.get("countries")

# This will apply the patch or default to the original string if none exists:
for row in iter_data():
    raw = row.get("Country")
    row["Country"] = countries.get_value(raw, default=raw)

Extended options

There's a host of options available to configure the application of the data patches:

countries:
  # If you mark a lookup as required, a value that matches no options will
  # throw a `datapatch.exc:LookupException`.
  required: true
  # Normalisation will remove many special characters, remove multiple spaces
  normalize: false
  # By default normalize perform transliteration across alphabets (Путин -> Putin)
  # set asciify to false if you want to keep non-ascii alphabets as is
  asciify: false
  options:
    - match: Francois
      value: France
  # This is a shorthand for defining options that have just one `match` and
  # one `value` defined:
  map:
    Luxemborg: Luxembourg
    Lux: Luxembourg

Result objects

You can also have more details associated with a result and access them:

countries:
  options:
    - match: Frankreich
      # These can be arbitrary attributes:
      label: France
      code: FR

This can be accessed as a result object with attributes:

from datapatch import read_lookups, LookupException

lookups = read_lookups("countries.yml")
countries = lookups.get("countries")

result = countries.match("Frankreich")
print(result.label, result.code)
assert result.capital is None, result.capital

License

datapatch is licensed under the terms of the MIT license, which is included as LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datapatch-1.2.2.tar.gz (8.1 kB view details)

Uploaded Source

Built Distribution

datapatch-1.2.2-py2.py3-none-any.whl (8.6 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file datapatch-1.2.2.tar.gz.

File metadata

  • Download URL: datapatch-1.2.2.tar.gz
  • Upload date:
  • Size: 8.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for datapatch-1.2.2.tar.gz
Algorithm Hash digest
SHA256 c4656685a03a7bb2e9e482220a130c4ead53999b0c46075809827b9e1cd2baf1
MD5 9ea14b4ac97cd166fbe57ae783d65bb2
BLAKE2b-256 865690a895e72fb2d73dcb9c6bc42bfb4a3e1658ecd524169f3536c77f566fb0

See more details on using hashes here.

Provenance

The following attestation bundles were made for datapatch-1.2.2.tar.gz:

Publisher: build.yml on opensanctions/datapatch

Attestations:

File details

Details for the file datapatch-1.2.2-py2.py3-none-any.whl.

File metadata

  • Download URL: datapatch-1.2.2-py2.py3-none-any.whl
  • Upload date:
  • Size: 8.6 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for datapatch-1.2.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 6110112bc017fe51b3d7c3cc00d7f5abfd02481076743368afa442d6c0e6326f
MD5 1626c5dd283f8e2ee128015178a1cde9
BLAKE2b-256 1e87b4ffe1fbc43f7544415c1782e0232bd416b1d19954ff968e25c96331ec8d

See more details on using hashes here.

Provenance

The following attestation bundles were made for datapatch-1.2.2-py2.py3-none-any.whl:

Publisher: build.yml on opensanctions/datapatch

Attestations:

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page