Library for automatic feature extraction from JSON-datasets
Project description
Open source tool for machine learning on semi-structured data that creates numeric object-feature matrix from JSON. The idea of Datapot is to make the process of data preparation and feature extraction automatic, easy and effective.
Usage
Install Datapot:
$ git clone https://github.com/bashalex/datapot.git
$ cd datapot
$ pip install .
To create a Datapot object simply write the following:
>>> import datapot as dp
>>> data = dp.DataPot()
DataPot has two main methods:
fit()
transform()
Method fit(self, data, limit) goes through the first N objects (N = limit), passes the possible features to Transformers. Each Transformer evaluates if a feature from current field or a number of fields can be created. As a result a dict of features and Transformers is created.
To apply fit() to JSON file:
>>> f = open('data/matches_test.jsonlines', 'r')
>>> data.fit(f, limit=100)
>>> data
DataPot class instance
- number of features without transformation: 806
- number of new features: 315
features to transform:
(u'players.0.gold_t', [ComplexTransformer])
(u'picks_bans.0.is_pick', [BoolToIntTransformer])
(u'players.0.kills_log.0.unit', [TfidfTransformer])
(u'players.1.xp_t', [ComplexTransformer])
(u'picks_bans.1.is_pick', [BoolToIntTransformer])
(u'players.1.kills_log.0.unit', [TfidfTransformer])
...
Method transform(self, data, verbose) generates a pandas. DataFrame with new features that were detected on the fit() call. If parameter verbose is true, progress description is printed during the feature extraction.
>>> df = data.transform(f, verbose=False)
fit transformers...OK
num of new features: 315
Examples
Look for more examples of using Datapot with different datasets and more Transformer specific.
Features
Datapot provides many ways of extracting features from JSON-s.
Data types that can be processed: - Boolean - Numerical array (transform array to their sum divided by average length of array in training set) - Time series (сalculate descriptive statistical properties of a given time series) - Timestamp (date, time, day of week, day of month etc.) - Text (bag of words tf-idf, word2vec) - Categorial (one-hot encoding, dimension reduction)