Skip to main content

Automatic format error detection on tabular data

Project description

CI

Forma

Automatic format error detection on tabular data.

Forma is an open-source library, written in python, that enables automatic and domain-agnostic format error detection on tabular data. The library is a by-product of the research project BigDataStack.

Install

Run pip install forma to install the library in your environment.

How to use

We will work with the the popular movielens dataset.

# local
# load the data
col_names = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings_df = pd.read_csv('../data/ratings.dat', delimiter='::', names=col_names, engine='python')
# local
ratings_df.head()
user_id movie_id rating timestamp
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
3 1 3408 4 978300275
4 1 2355 5 978824291

Let us introduce some random mistakes.

# local
dirty_df = ratings_df.astype('str').copy()

dirty_df.iloc[3]['timestamp'] = '9783000275'
dirty_df.iloc[2]['movie_id'] = '914.'
dirty_df.iloc[4]['rating'] = '10'

Initialize the detector, fit and detect. The returned result is a pandas DataFrame with an extra column p, which records the probability of a format error being present in the row. We see that the probability for the tuples where we introduced random artificial mistakes is increased.

# local
# initialize detector
detector = FormatDetector()
# fit detector
detector.fit(dirty_df, generator= PatternGenerator(), n=3)
# detect error probability
assessed_df = detector.detect(reduction=np.mean)

# visualize results
assessed_df.head()
100%|██████████| 4/4 [02:58<00:00, 44.58s/it]
100%|██████████| 1000209/1000209 [07:28<00:00, 2230.59it/s]
user_id movie_id rating timestamp p
0 1 1193 5 978300760 0.319957
1 1 661 3 978302109 0.456679
2 1 914. 3 978301968 0.509287
3 1 3408 4 9783000275 0.550982
4 1 2355 10 978824291 0.569957

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

forma-0.1.3.tar.gz (11.4 kB view details)

Uploaded Source

Built Distribution

forma-0.1.3-py3-none-any.whl (9.7 kB view details)

Uploaded Python 3

File details

Details for the file forma-0.1.3.tar.gz.

File metadata

  • Download URL: forma-0.1.3.tar.gz
  • Upload date:
  • Size: 11.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200330 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.2

File hashes

Hashes for forma-0.1.3.tar.gz
Algorithm Hash digest
SHA256 58c7da141f04704712818c6d330a4c3a7e9afab8fb01413e6e02dd321ae7db5e
MD5 ae1902b4a33262bda415561e07251b54
BLAKE2b-256 5827fcf2dab6751257d6f3b3e6b7fb6fe10b38a00c011bc5f9ab55db8f713e1d

See more details on using hashes here.

Provenance

File details

Details for the file forma-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: forma-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 9.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200330 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.2

File hashes

Hashes for forma-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 c7973b8e8e86519d1f17944c2527ce42d5359d5a0856d027f5389f6162b11418
MD5 c92d65720fbe0602f601cef02a7c86c2
BLAKE2b-256 d4bc35f11e1d37fd8ae123929515595cb607e86730a94ae7b6f2e896256a8300

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page