Automatic format error detection on tabular data
Project description
Forma
Forma is an open-source library, written in python, that enables automatic and domain-agnostic format error detection on tabular data. The library is a by-product of the research project BigDataStack.
Install
Run pip install forma
to install the library in your environment.
How to use
We will work with the the popular movielens dataset.
# local
# load the data
col_names = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings_df = pd.read_csv('../data/ratings.dat', delimiter='::', names=col_names, engine='python')
# local
ratings_df.head()
user_id | movie_id | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 1193 | 5 | 978300760 |
1 | 1 | 661 | 3 | 978302109 |
2 | 1 | 914 | 3 | 978301968 |
3 | 1 | 3408 | 4 | 978300275 |
4 | 1 | 2355 | 5 | 978824291 |
Let us introduce some random mistakes.
# local
dirty_df = ratings_df.astype('str').copy()
dirty_df.iloc[3]['timestamp'] = '9783000275'
dirty_df.iloc[2]['movie_id'] = '914.'
dirty_df.iloc[4]['rating'] = '10'
Initialize the detector, fit and detect. The returned result is a pandas DataFrame with an extra column p
, which records the probability of a format error being present in the row. We see that the probability for the tuples where we introduced random artificial mistakes is increased.
# local
# initialize detector
detector = FormatDetector()
# fit detector
detector.fit(dirty_df, generator= PatternGenerator(), n=3)
# detect error probability
assessed_df = detector.detect(reduction=np.mean)
# visualize results
assessed_df.head()
100%|██████████| 4/4 [02:58<00:00, 44.58s/it]
100%|██████████| 1000209/1000209 [07:28<00:00, 2230.59it/s]
user_id | movie_id | rating | timestamp | p | |
---|---|---|---|---|---|
0 | 1 | 1193 | 5 | 978300760 | 0.319957 |
1 | 1 | 661 | 3 | 978302109 | 0.456679 |
2 | 1 | 914. | 3 | 978301968 | 0.509287 |
3 | 1 | 3408 | 4 | 9783000275 | 0.550982 |
4 | 1 | 2355 | 10 | 978824291 | 0.569957 |
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file forma-0.2.0.tar.gz
.
File metadata
- Download URL: forma-0.2.0.tar.gz
- Upload date:
- Size: 11.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200330 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9a5ead02039850eedab81681a207b079d35b3cd65b27898d0ec9f7f2c433ff7d |
|
MD5 | 1bdaf572faa9bab33cbd9b8d268857c3 |
|
BLAKE2b-256 | 53c3450a293e250fa044cc3e3a930e69139880409ede7d2744a555cbbfaa9137 |
Provenance
File details
Details for the file forma-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: forma-0.2.0-py3-none-any.whl
- Upload date:
- Size: 9.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200330 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4c5084b91696476862f165edf2dc6c558ecd0b55b8dbc585880068ca88440910 |
|
MD5 | e7fe6ace366e25ee5071320dbf6db55e |
|
BLAKE2b-256 | 17e237abb1789142b5ca7918a08e270feaf02b3d41a60a25f4c64398f167a210 |