Automatic format error detection on tabular data
Project description
Forma
Automatic format error detection on tabular data.
Forma is an open-source library, written in python, that enables automatic and domain-agnostic format error detection on tabular data. The library is a by-product of the research project BigDataStack.
Install
Run pip install forma
to install the library in your environment.
How to use
We will work with the the popular movielens dataset.
# local
# load the data
col_names = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings_df = pd.read_csv('../data/ratings.dat', delimiter='::', names=col_names, engine='python')
# local
ratings_df.head()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
user_id | movie_id | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 1193 | 5 | 978300760 |
1 | 1 | 661 | 3 | 978302109 |
2 | 1 | 914 | 3 | 978301968 |
3 | 1 | 3408 | 4 | 978300275 |
4 | 1 | 2355 | 5 | 978824291 |
Initialize the detector, fit and detect. The returned result is a pandas DataFrame with an extra column p
, which records the probability of a format error being present in the row.
# local
# initialize detector
detector = FormatDetector()
# fit detector
detector.fit(ratings_df[:100], PatternGenerator(other='leaf'))
# detect error probability
assessed_df = detector.detect()
# visualize results
assessed_df.head()
100%|██████████| 4/4 [00:00<00:00, 222.64it/s]
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
user_id | movie_id | rating | timestamp | p | |
---|---|---|---|---|---|
0 | 1 | 1193 | 5 | 978300760 | 0.041667 |
1 | 1 | 661 | 3 | 978302109 | 0.128333 |
2 | 1 | 914 | 3 | 978301968 | 0.128333 |
3 | 1 | 3408 | 4 | 978300275 | 0.041667 |
4 | 1 | 2355 | 5 | 978824291 | 0.041667 |
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
forma-0.0.1.tar.gz
(11.0 kB
view hashes)
Built Distribution
forma-0.0.1-py3-none-any.whl
(9.3 kB
view hashes)