Skip to main content

Data engineering, simplified. LineaPy creates a frictionless path for taking your data science artifact from development to production.

Project description

LineaPy

Capture, analyze, and transform messy notebooks into production-ready pipelines
with just two lines of code.

Follow LineaPy on Twitter! Join the LineaPy Slack!

Ask questions or learn about our workshops on our Slack!

👇 Try It Out! 👇

https://user-images.githubusercontent.com/13392380/169427654-487d8d4b-3eda-462a-a96c-51c151f39ab9.mp4

Python Versions Build Documentation Status License PyPi

What Problems Can LineaPy Solve?

Use Case 1: Cleaning Messy Notebooks

When working in a Jupyter notebook day-to-day, it is easy to write messy code by jumping around between cells, deleting cells, editing cells, and executing the same cell multiple times — until we think we have some good results (e.g., tables, models, charts). With this highly dynamic and interactive nature of notebook use, some issues may arise. For instance, our colleagues who try to rerun the notebook may not be able to reproduce our results. Worse, with time passing, we ourselves may have forgotten the exact steps to produce the previous results, hence unable to help our colleagues.

One way to deal with this problem is to keep the notebook in sequential operations by constantly re-executing the entire notebook during development. However, we soon realize that this interrupts our natural workflows and stream of thoughts, decreasing our productivity. Therefore, it is much more common to clean up the notebook after development. This is a very time-consuming process and is not immune from reproducibility issues caused by deleting cells and out-of-order cell executions.

To see how LineaPy can help here, check out this demo or Open in Colab, Open in Binder.

Use Case 2: Revisiting Previous Work

Data science is often a team effort where one person's work uses results from another's. For instance, a data scientist building a model may use various features engineered by other colleagues. In using results generated by other people, we may encounter issues such as missing values, numbers that look suspicious, and unintelligible variable names. If so, we may need to check how these results came into being in the first place. Often, this means tracing back the code that was used to generate the result in question (e.g., feature table). In practice, this can become a challenging task because it may not be clear who produced the result. Even if we knew who to ask, the person might not remember where the exact version of the code is. Worse, the person may have overwritten the code without version control. Or, the person may no longer be in the organization with no proper handover of the relevant knowledge. In any of these cases, it becomes extremely difficult to identify the root of the issue, which may render the result unreliable and even unusable.

To see how LineaPy can help here, check out this demo or Open in Colab, Open in Binder.

Use Case 3: Building Pipelines

As our notebooks become more mature, they may get used like pipelines. For instance, our notebook might process the latest data to update dashboards. Or, it may pre-process data and dump it to the filesystem for downstream model development. Since other people rely on up-to-date results from our work, we might be expected to re-execute these processes on a regular basis. Running a notebook is a manual, brittle process prone to errors, so we may want to set up proper pipelines for production. Relevant engineering support may not be available, so we may need to clean up and refactor the notebook code so it can be used in orchestration systems or job schedulers (e.g., cron, Apache Airflow, Prefect). Of course, this is assuming that we already know what they are and how to work with them. If not, we need to spend time learning about them in the first place. All this operational work involves time-consuming, manual labor, which means less time for us to spend on our core duties as a data scientist.

To see how LineaPy can help here, check out this demo or Open in Colab, Open in Binder.

Getting Started

Installation

To install LineaPy, run:

pip install lineapy

Or, if you want the latest version of LineaPy directly from the source, run:

pip install git+https://github.com/LineaLabs/lineapy.git --upgrade

LineaPy offers several extras to extend its core capabilities:

Version Installation Command Enables
minimal pip install lineapy[minimal] Minimal dependencies for LineaPy
dev pip install lineapy[dev] All LineaPy dependencies for testing and development
s3 pip install lineapy[s3] Dependencies to use S3 to save artifact
graph pip install lineapy[graph] Dependencies to visualize LineaPy node graph
postgres pip install lineapy[postgres] Dependencies to use PostgreSQL backend

The minimal version of LineaPy does not include black or isort, which may result in less organized output code and scripts.

By default, LineaPy uses SQLite for artifact store, which keeps the package light and simple. However, SQLite has several limitations, one of which is that it does not support multiple concurrent writes to a database (it will result in a database lock). If you want to use a more robust database, please follow instructions for using PostgreSQL.

Quick Start

Once LineaPy is installed, we are ready to start using the package. We start with a simple example using the Iris dataset to demonstrate how to use LineaPy to 1) store a variable's history, 2) get its cleaned-up code, and 3) build an executable pipeline for the variable.

import lineapy
import pandas as pd
from sklearn.linear_model import LinearRegression, ElasticNet

# Load data
df = pd.read_csv("https://raw.githubusercontent.com/LineaLabs/lineapy/main/examples/tutorials/data/iris.csv")

# Some very basic feature engineering
color_map = {"Setosa": 0, "Versicolor": 1, "Virginica": 2}
df["variety_color"] = df["variety"].map(color_map)
df2 = df.copy()
df2["d_versicolor"] = df["variety"].apply(lambda x: 1 if x == "Versicolor" else 0)
df2["d_virginica"] = df["variety"].apply(lambda x: 1 if x == "Virginica" else 0)

# Initialize two models
model1 = LinearRegression()
model2 = ElasticNet()

# Fit both models
model1.fit(
    X=df2[["petal.width", "d_versicolor", "d_virginica"]],
    y=df2["sepal.width"],
)
model2.fit(
    X=df[["petal.width", "variety_color"]],
    y=df["sepal.width"],
)

Now, we reach the end of our development session and decide to save the ElasticNet model. We can store the model as a LineaPy :ref:artifact <concepts> as follows:

# Store the model as an artifact
lineapy.save(model2, "iris_elasticnet_model")

A LineaPy artifact encapsulates both the value and code, so we can easily retrieve the model's code, like so:

# Retrieve the model artifact
artifact = lineapy.get("iris_elasticnet_model")

# Check code for the model artifact
print(artifact.get_code())

which will print:

import pandas as pd
from sklearn.linear_model import ElasticNet

df = pd.read_csv(
    "https://raw.githubusercontent.com/LineaLabs/lineapy/main/examples/tutorials/data/iris.csv"
)
color_map = {"Setosa": 0, "Versicolor": 1, "Virginica": 2}
df["variety_color"] = df["variety"].map(color_map)
model2 = ElasticNet()
model2.fit(
    X=df[["petal.width", "variety_color"]],
    y=df["sepal.width"],
)

Note that these are the minimal essential steps to produce the model. That is, LineaPy has automatically cleaned up the original code by removing extraneous operations that do not affect the model.

Say we are now asked to retrain the model on a regular basis to account for any updates in the source data. We need to set up a pipeline to train the model, and LineaPy make it as simple as a single line of code:

lineapy.to_pipeline(
    artifacts=[artifact.name],
    pipeline_name="iris_model_pipeline",
    output_dir="output/",
    framework="AIRFLOW",
)

which generates several files that can be used to execute the pipeline from the UI or CLI.

In sum, LineaPy automates time-consuming, manual steps in a data science workflow, helping us move our work into production more quickly.

Interfaces

Jupyter and IPython

To use LineaPy in an interactive computing environment such as Jupyter Notebook/Lab or IPython, launch the environment with the lineapy command, like so:

lineapy jupyter notebook
lineapy jupyter lab
lineapy ipython

This will automatically load the LineaPy extension in the corresponding interactive shell application.

Or, if the application is already running without the extension loaded, which can happen when we start the Jupyter server with jupyter notebook or jupyter lab without lineapy, you can load it on the fly with:

%load_ext lineapy

executed at the top of your session. Please note:

  • You will need to run this as the first command in a given session; executing it in the middle of a session will lead to erroneous behaviors by LineaPy.

  • This loads the extension to the current session only, i.e., it does not carry over to different sessions; you will need to repeat it for each new session.

Hosted Jupyter Environment

In hosted Jupyter notebook environments such as JupyterHub, Google Colab, Kaggle, Databricks or in any other environments that are not started using CLI (such as Jupyter extension within VS Code), you need to install lineapy directly within your notebook first via:

!pip install lineapy

then you can manually load lineapy extension with :

%load_ext lineapy

For environments with older versions IPython<7.0 like Google Colab, we need to upgrade the IPython>=7.0 module before the above steps, we can upgrade IPython via:

!pip install --upgrade ipython

and restart the notebook runtime:

exit()

then we can start setting up LineaPy as described previously.

CLI

We can also use LineaPy as a CLI command. Run:

lineapy python --help

to see available options.

Usage Reporting

LineaPy collects anonymous usage data that helps our team to improve the product. Only LineaPy's API calls and CLI commands are being reported. We strip out as much potentially sensitive information as possible, and we will never collect user code, data, variable names, or stack traces.

You can opt-out of usage tracking by setting environment variable:

export LINEAPY_DO_NOT_TRACK=true

What Next?

To learn more about LineaPy, please check out the project documentation which contains many examples you can follow with. Some key resources include:

Resource Description
Docs This is our knowledge hub — when in doubt, start here!
Concepts Learn about key concepts underlying LineaPy!
Tutorials These notebook tutorials will help you better understand core functionalities of LineaPy
Use Cases These domain examples illustrate how LineaPy can help in real-world applications
API Reference Need more technical details? This reference may help!
Contribute Want to contribute? These instructions will help you get set up!
Slack Have questions or issues unresolved? Join our community and ask away!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lineapy-0.1.5.tar.gz (157.8 kB view hashes)

Uploaded Source

Built Distribution

lineapy-0.1.5-py3-none-any.whl (188.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page