Skip to main content

Incremental machine learning in Python

Project description

creme_logo

creme is a library for online machine learning, also known as incremental learning. Online learning is a machine learning regime where a model learns one observation at a time. This is in contrast to batch learning where all the data is processed in one go. Incremental learning is desirable when the data is too big to fit in memory, or simply when you want to handle data in a streaming fashion. In addition to many online machine learning algorithms, creme provides utilities for extracting features from a stream of data. The API is heavily inspired from that of scikit-learn, meaning that users who are familiar with it should feel comfortable.

Useful links

Installation

:point_up: creme is intended to work with Python 3.6 and above.

creme can simply be installed with pip.

pip install creme

You can also install the latest development version as so:

pip install git+https://github.com/creme-ml/creme --upgrade

As for dependencies, creme mostly relies on Python's standard library. Sometimes it relies on numpy, scipy, and scikit-learn to avoid reinventing the wheel.

Quick example

In the following example we'll use a linear regression to forecast the number of available bikes in bike stations from the city of Toulouse :bike:.

We'll use the available numeric features, as well as calculate running averages of the target. Before being fed to the linear regression, the features will be scaled using a StandardScaler. Note that each of these steps works in a streaming fashion, including the feature extraction. We'll evaluate the model by asking it to forecast 30 minutes ahead and delaying the true answers, which ensures that we're simulating a production scenario. Finally we will print the current score every 20,000 predictions.

>>> import datetime as dt
>>> from creme import compose
>>> from creme import datasets
>>> from creme import feature_extraction
>>> from creme import linear_model
>>> from creme import metrics
>>> from creme import model_selection
>>> from creme import preprocessing
>>> from creme import stats

>>> X_y = datasets.fetch_bikes()

>>> def add_hour(x):
...     x['hour'] = x['moment'].hour
...     return x

>>> model = compose.Whitelister('clouds', 'humidity', 'pressure', 'temperature', 'wind')
>>> model += (
...     add_hour |
...     feature_extraction.TargetAgg(by=['station', 'hour'], how=stats.Mean())
... )
>>> model += feature_extraction.TargetAgg(by='station', how=stats.EWMean(0.5))
>>> model |= preprocessing.StandardScaler()
>>> model |= linear_model.LinearRegression()

>>> model_selection.online_qa_score(
...     X_y=X_y,
...     model=model,
...     metric=metrics.MAE(),
...     on='moment',
...     lag=dt.timedelta(minutes=30),
...     print_every=20_000
... )
[20,000] MAE: 13.743465
[40,000] MAE: 7.990616
[60,000] MAE: 6.101015
[80,000] MAE: 5.159895
[100,000] MAE: 4.593369
[120,000] MAE: 4.19251
[140,000] MAE: 3.904753
[160,000] MAE: 3.725466
[180,000] MAE: 3.568893
MAE: 3.555296

We can also draw the model to understand how the data flows through.

>>> dot = model.draw()
bikes_pipeline

By only using a few lines of code, we've built a robust model and evaluated it by simulating a production scenario. You can find a more detailed version of this example here. creme is a framework that has a lot to offer, and as such we kindly refer you to the documentation if you want to know more.

Comparison with other solutions

  • scikit-learn: Some of it's estimators have a partial_fit method which allows them to update themselves with new observations. However, online learning isn't treated as a first class citizen, which can make things awkward. You should definitely use scikit-learn if your data fits in memory and that you can afford retraining your model from scratch every time new data is available.
  • Vowpal Wabbit: VW is probably the fastest out-of-core learning system available. At it's core it implements a state-of-the-art adaptive gradient descent algorithm with many tricks. It also has some mechanisms for doing active learning and using bandits. However it isn't a "true" online learning system as it assumes the data is available in a file and can be looped over multiple times. Also it is somewhat difficult to grok for newcomers.
  • LIBOL: This is a good library written by academics with some great documentation. It's written in C++ and seems to be pretty fast. However it only focuses on the learning aspect of online learning, not on other mundane yet useful tasks such as feature extraction and preprocessing. Moreover it hasn't been updated for a few years.
  • Spark Streaming: This is an extension of Apache Spark which caters to big data practitioners. It processes data in mini-batches instead of actually doing real streaming operations. It also has some compatibility with the MLlib for implementing online learning algorithms, such as streaming linear regression and streaming k-means. However it is a somewhat overwhelming solution which might be a bit overkill for certain use cases.
  • TensorFlow: Deep learning systems are in some sense online learning systems because they use online gradient descent. However, popular libraries are mostly attuned to batch situations. Because frameworks such as Keras and PyTorch are so popular and very well backed, there is no real point in implementing neural networks in creme. Additionally, for a lot of problems neural networks might not be the right tool, and you might want to use a simple logistic regression or a decision tree (for which online algorithms exist).

Feel free to open an issue if you feel like other solutions are worth mentioning.

Contributing

Like many subfields of machine learning, online learning is far from being an exact science and so there is still a lot to do. Feel free to contribute in any way you like, we're always open to new ideas and approaches. If you want to contribute to the code base please check out the CONTRIBUTING.md file. Also take a look at the issue tracker and see if anything takes your fancy.

Last but not least you are more than welcome to share with us how you're using creme or online learning in general! We believe that online learning solves a lot of pain points in practice and we would love to share experiences.

This project follows the all-contributors specification. Contributions of any kind are welcome!

Max Halford
Max Halford

📆 💻
AdilZouitine
AdilZouitine

💻
Raphael Sourty
Raphael Sourty

💻
Geoffrey Bolmier
Geoffrey Bolmier

💻
vincent d warmerdam
vincent d warmerdam

💻
VaysseRobin
VaysseRobin

💻
Lygon Bowen-West
Lygon Bowen-West

💻
Florent Le Gac
Florent Le Gac

💻
Adrian Rosebrock
Adrian Rosebrock

📝

License

See the license file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

creme-0.3.0.tar.gz (94.1 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page