Skip to main content

Automatic format error detection on tabular data

Project description

Forma

Automatic format error detection on tabular data.

Forma is an open-source library, written in python, that enables automatic and domain-agnostic format error detection on tabular data. The library is a by-product of the research project BigDataStack.

Install

Run pip install forma to install the library in your environment.

How to use

We will work with the the popular movielens dataset.

# local
# load the data
col_names = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings_df = pd.read_csv('../data/ratings.dat', delimiter='::', names=col_names, engine='python')
# local
ratings_df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
user_id movie_id rating timestamp
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
3 1 3408 4 978300275
4 1 2355 5 978824291

Initialize the detector, fit and detect. The returned result is a pandas DataFrame with an extra column p, which records the probability of a format error being present in the row.

# local
# initialize detector
detector = FormatDetector()
# fit detector
detector.fit(ratings_df[:100], PatternGenerator(other='leaf'))
# detect error probability
assessed_df = detector.detect()

# visualize results
assessed_df.head()
100%|██████████| 4/4 [00:00<00:00, 222.64it/s]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
user_id movie_id rating timestamp p
0 1 1193 5 978300760 0.041667
1 1 661 3 978302109 0.128333
2 1 914 3 978301968 0.128333
3 1 3408 4 978300275 0.041667
4 1 2355 5 978824291 0.041667

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

forma-0.0.1.tar.gz (11.0 kB view details)

Uploaded Source

Built Distribution

forma-0.0.1-py3-none-any.whl (9.3 kB view details)

Uploaded Python 3

File details

Details for the file forma-0.0.1.tar.gz.

File metadata

  • Download URL: forma-0.0.1.tar.gz
  • Upload date:
  • Size: 11.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200330 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.2

File hashes

Hashes for forma-0.0.1.tar.gz
Algorithm Hash digest
SHA256 43182a507826f118af2aa2d887f63c60297e69d3b788c18e3562e0ba70c62d38
MD5 7cd130cf8691754902977b2a3320619d
BLAKE2b-256 8c9497c3ba0a0a215cbafe0c9c9067bd1340766434ddbdc66c2f69c1dbde1920

See more details on using hashes here.

Provenance

File details

Details for the file forma-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: forma-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 9.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200330 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.2

File hashes

Hashes for forma-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 22d949bd324dae4fe07159ea265708bf9ac33047fd2c8fe322d8f733bc15125e
MD5 b37635cb71c68c6fe2148f5ee70d487c
BLAKE2b-256 3b4d00668d5535f309bcfebddb5690682c4330a6e23eb6ccdaae1de749acf574

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page