A typed dataset abstraction toolkit for machine learning projects
Project description
A generic dataset interface for Machine Learning models
instancelib
provides a generic architecture for datasets.
© Michiel Bron, 2021
Quick tour
Load dataset: Load the dataset in an environment
import instancelib as il
text_env = il.read_excel_dataset("./datasets/testdataset.xlsx",
data_cols=["fulltext"],
label_cols=["label"])
ds = text_env.dataset # A `dict-like` interface for instances
labels = text_env.labels # An object that stores all labels
labelset = labels.labelset # All labels that can be given to instances
ins = ds[20] # Get instance with identifier key `20`
ins_data = ins.data # Get the raw data for instance 20
ins_vector = ins.vector # Get the vector representation for 20 if any
ins_labels = labels.get_labels(ins)
Dataset manipulation: Divide the dataset in a train and test set
train, test = text_env.train_test_split(ds, train_size=0.70)
print(20 in train) # May be true or false, because of random sampling
Train a model:
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),
])
model = il.SkLearnDataClassifier.build(pipeline, text_env)
model.fit_provider(train, labels)
predictions = model.predict(test)
Installation
See installation.md for an extended installation guide.
Method | Instructions |
---|---|
pip |
Install from PyPI via pip install instancelib . |
Local | Clone this repository and install via pip install -e . or locally run python setup.py install . |
Documentation
Full documentation of the latest version is provided at https://instancelib.readthedocs.org.
Example usage
See usage.py to see an example of how the package can be used.
Releases
instancelib
is officially released through PyPI.
See CHANGELOG.md for a full overview of the changes for each version.
Citation
@misc{instancelib,
title = {Python package instancelib},
author = {Michiel Bron},
howpublished = {\url{https://github.com/mpbron/instancelib}},
year = {2021}
}
Library usage
This library is used in the following projects:
- python-allib. A typed Active Learning framework for Python for both Classification and Technology-Assisted Review systems.
- text_explainability. A generic explainability architecture for explaining text machine learning models
- text_sensitivity. Sensitivity testing (fairness & robustness) for text machine learning models.
Maintenance
Contributors
- Michiel Bron (
@mpbron
)
Todo
Tasks yet to be done:
- Implement support for ONNX models
- Implement support for Python DataLoaders
- Make the external dataset interface more user friendly
- Redesign LabelProvider to support more attribute levels
- CI/CD tests
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for instancelib-0.4.4.11-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1c0639a5bf768c9dbb01debc84cc38ec0f570081fa809d80e86612a7c8ef249e |
|
MD5 | 9e1fd18929979e406545b9133a04b132 |
|
BLAKE2b-256 | e2c17cbffac0501cc5429ab1c182c5196c9f51e6e0f6fb1adf79e4d4c2a57a92 |