Skip to main content

No project description provided

Project description

datapatch

A Python library for defining rule-based overrides on messy data. Imagine, for example, trying to import a dataset in each row is associated with a country - which have been entered by humans. You might find country names like Northkorea, or Greet Britain that you want to normalise. datapatch creates a mechanism to build a flexible lookup table (usually stored as a YAML file) to catch and repair these data issues.

Installation

You can install datapatch from the Python package index:

pip install datapatch

Example

Given a YAML file like this:

countries:
  normalize: true
  lowercase: true
  options:
    - match: Frankreich
      value: France
    - match:
        - Northkorea
        - Nordkorea
        - Northern Korea
        - NKorea
        - DPRK
      value: North Korea
    - contains: Britain
      value: Great Britain

The file can be used to apply the data patches against raw input:

from datapatch import read_lookups, LookupException

lookups = read_lookups("countries.yml")
countries = lookups.get("countries")

# This will apply the patch or default to the original string if none exists:
for row in iter_data():
    raw = row.get("Country")
    row["Country"] = countries.get_value(raw, default=raw)

Extended options

There's a host of options available to configure the application of the data patches:

countries:
  # If you mark a lookup as required, a value that matches no options will
  # throw a `datapatch.exc:LookupException`.
  required: true
  # Normalisation will remove many special characters, remove multiple spaces
  # and perform some basic matching across alphabets (Путин -> Putin).
  normalize: false
  options:
    - match: Francois
      value: France
  # This is a shorthand for defining options that have just one `match` and
  # one `value` defined:
  map:
    Luxemborg: Luxembourg
    Lux: Luxembourg

Result objects

You can also have more details associated with a result and access them:

countries:
  options:
    - match: Frankreich
      # These can be arbitrary attributes:
      label: France
      code: FR

This can be accessed as a result object with attributes:

from datapatch import read_lookups, LookupException

lookups = read_lookups("countries.yml")
countries = lookups.get("countries")

result = countries.match("Frankreich")
print(result.label, result.code)
assert result.capital is None, result.capital

License

datapatch is licensed under the terms of the MIT license, which is included as LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datapatch-1.0.0.tar.gz (5.9 kB view hashes)

Uploaded Source

Built Distribution

datapatch-1.0.0-py2.py3-none-any.whl (7.4 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page