No project description provided
A Python library for defining rule-based overrides on messy data. Imagine, for example,
trying to import a dataset in each row is associated with a country - which have been
entered by humans. You might find country names like
that you want to normalise.
datapatch creates a mechanism to build a flexible lookup
table (usually stored as a YAML file) to catch and repair these data issues.
You can install
datapatch from the Python package index:
pip install datapatch
Given a YAML file like this:
countries: normalize: true lowercase: true options: - match: Frankreich value: France - match: - Northkorea - Nordkorea - Northern Korea - NKorea - DPRK value: North Korea - contains: Britain value: Great Britain
The file can be used to apply the data patches against raw input:
from datapatch import read_lookups, LookupException lookups = read_lookups("countries.yml") countries = lookups.get("countries") # This will apply the patch or default to the original string if none exists: for row in iter_data(): raw = row.get("Country") row["Country"] = countries.get_value(raw, default=raw)
There's a host of options available to configure the application of the data patches:
countries: # If you mark a lookup as required, a value that matches no options will # throw a `datapatch.exc:LookupException`. required: true # Normalisation will remove many special characters, remove multiple spaces # and perform some basic matching across alphabets (Путин -> Putin). normalize: false options: - match: Francois value: France # This is a shorthand for defining options that have just one `match` and # one `value` defined: map: Luxemborg: Luxembourg Lux: Luxembourg
You can also have more details associated with a result and access them:
countries: options: - match: Frankreich # These can be arbitrary attributes: label: France code: FR
This can be accessed as a result object with attributes:
from datapatch import read_lookups, LookupException lookups = read_lookups("countries.yml") countries = lookups.get("countries") result = countries.match("Frankreich") print(result.label, result.code) assert result.capital is None, result.capital
datapatch is licensed under the terms of the MIT license, which is included as
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Hashes for datapatch-1.0.2-py2.py3-none-any.whl