Skip to main content

Automatic format error detection on tabular data

Project description

Forma

Automatic format error detection on tabular data.

Forma is an open-source library, written in python, that enables automatic and domain-agnostic format error detection on tabular data. The library is a by-product of the research project BigDataStack.

Install

Run pip install forma to install the library in your environment.

How to use

We will work with the the popular movielens dataset.

# local
# load the data
col_names = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings_df = pd.read_csv('../data/ratings.dat', delimiter='::', names=col_names, engine='python')
# local
ratings_df.head()
user_id movie_id rating timestamp
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
3 1 3408 4 978300275
4 1 2355 5 978824291

Let us introduce some random mistakes.

# local
dirty_df = ratings_df.astype('str').copy()

dirty_df.iloc[3]['timestamp'] = '9783000275'
dirty_df.iloc[2]['movie_id'] = '914.'
dirty_df.iloc[4]['rating'] = '10'

Initialize the detector, fit and detect. The returned result is a pandas DataFrame with an extra column p, which records the probability of a format error being present in the row. We see that the probability for the tuples where we introduced random artificial mistakes is increased.

# local
# initialize detector
detector = FormatDetector()
# fit detector
generators = {'user_id': PatternGenerator(other='leaf'),
              'movie_id': PatternGenerator(other='leaf'),
              'rating': PatternGenerator(other='leaf'),
              'timestamp': PatternGenerator(other='leaf')}

detector.fit(dirty_df, generator=generators, n=3)
# detect error probability
assessed_df = detector.detect(reduction=np.mean)

# visualize results
assessed_df.head()
100%|██████████| 4/4 [00:00<00:00, 158.06it/s]
user_id movie_id rating timestamp p
0 1 1193 5 978300760 0.06750
1 1 661 3 978302109 0.19750
2 1 914. 3 978301968 0.24413
3 1 3408 4 9783000275 0.31250
4 1 2355 10 978824291 0.31250

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

forma-0.1.0.tar.gz (11.3 kB view details)

Uploaded Source

Built Distribution

forma-0.1.0-py3-none-any.whl (9.6 kB view details)

Uploaded Python 3

File details

Details for the file forma-0.1.0.tar.gz.

File metadata

  • Download URL: forma-0.1.0.tar.gz
  • Upload date:
  • Size: 11.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200330 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.2

File hashes

Hashes for forma-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f8f99cf92781b242026f01959d30fbe1d9649c8ec1da36bf504304903f02f2bb
MD5 6380d80b172fb4731f86bf6c73b26712
BLAKE2b-256 51550719b0308f689981bc34f90110d8cca0de08d053b4dbe65c88806e03cc00

See more details on using hashes here.

Provenance

File details

Details for the file forma-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: forma-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200330 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.2

File hashes

Hashes for forma-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e83f38cab0be601bf6f44c747472c38e4a0b282982aa1648f58bec20f1ae0efd
MD5 26fcca5904de1b3cc5e2848bc1ec5708
BLAKE2b-256 1b7f80447cfeae0d4b25660a4268b1b0abb56a08b33d9417f59f903fe3bc2574

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page