Skip to main content

No project description provided

Project description

PureML

Track, version, compare and review your data and models.

Quick Access

Documentation         Watch Demo         Quick example         Get Instant Help         Sign Up for free





Intro

PureML is an open-source version control for machine learning.

  1. Quick start
  2. How it works
  3. Demo
  4. Main Features
  5. Tutorials
  6. Core design principles
  7. Core abstractions
  8. Why to get involved

Quick start

You can install and run PureML using pip.

Using pip

  1. Install PureML
    pip install pureml
    

How it works

Just add a few lines of code. You don't need to change the way you work.

PureML is a Python library that uploads metadata to S3.

Generating Data Lineage

  1. Load Data
@load_data(name='loading data')
def loading_data():
    
    return pd.read_csv('churn.csv')
  1. Transform Data
@transformer(name='fill missing values')
def fill_missing_values(df):
    return df.fillna()
    

@transformer(name='encode ordinal')
def encode_ordinal(df):
    col_ord = ['state', 'phone number']
    df_ord = df[col_ord]
    feat = OrdinalEncoder().fit_transform(df_ord)    
    df[col_ord] = feat
    
    return df

@transformer(name='encode binary')
def encode_binary(df):

    df['voice mail plan'] = df['voice mail plan'].map({'yes':1, 'no':0})
    df['international plan'] = df['international plan'].map({'yes':1, 'no':0})
    df['churn'] = df['churn'].map({True:1, False:0})

    return df
  1. Register Dataset
@dataset(name='telecom churn', parent='encode binary')
def build_dataset():
    df = loading_data()

    df = fill_missing_values(df)

    df = encode_ordinal(df)

    df = encode_binary(df)

    return df

df = build_dataset()

This is how generated data lineage will look like in the UI

Demo

Live demo

Build and run a PureML project to create data lineage and a model with our demo colab link.

Demo video (2 min)

PureML quick start demo

PureML Demo Video

</iframe>

Click the image to play video


Main Features

Data Lineage Automatic generation of data lineage
Dataset Versioning Automatic Semantic Versioning of datasets
Model Versioning Automatic Semantic Versioning of models
Comparision Comparing different versions of models or datasets
Branches (Coming Soon) Separation between experimentation and production ready models using branches
Review (Coming Soon) Review and approve models, and datasets to production ready branch

Tutorials


Core design principles

Easy developer experience An intuitive open source package aimed to bridge the gaps in data science teams
Engineering best practices built-in Integrating PureML functionalities in your code doenot disrupt your workflow
Object Versioning A reliable object versioning mechanism to track changes to your datasets, and models
Data is a first-class citizen Your data is secure. It will never leave your system.
Reduce Friction Have access to operations performed on data using data lineage without having to spend time on lengthy meetings

Core abstractions

These are the fundamental concepts that PureML uses to operate.

Project A data science project. This is where you store datasets, models, and their related objects. It is similar to a github repository with object storage.
Lineage Contains a series of transformations performed on data to generate a dataset.
Data Versioning Versioning of the data should be comprehensible to the user and should encapsulate the changes in the data, its creation mechanism, among others.
Model Versioning Versioning of the model should be comprehensible to the user and should encapuslate the changes in training data, model architecture, hyper parameters.
Fetch This functionality is used to fetch registered Models, and Datasets.

Why to get involved

Version control is much more common in software than in machine learning. So why isn’t everyone using Git? Git doesn’t work well with machine learning. It can’t handle large files, it can’t handle key/value metadata like metrics, and it can’t record information automatically from inside a training script.

GitHub wasn’t designed with data as a core project component. This along with a number of other differences between AI and more traditional software projects makes GitHub a bad fit for artificial intelligence, contributing to the reproducibility crisis in machine learning.

From manually tracking models to git based versioning systems that do not follow an intuitive versioning mechanism, there is no standardized way to track objects. Using these mechanisms, it is hard enough to track or get your model from a month ago running, let alone of a teammates!

We are trying to build a version control system for machine learning objects. A mechanism that is object dependant and intuitive for users.

Lets build this together. If you have faced this issue or have worked out a similar solution for yourself, please join us to help build a better system for everyone.


Reporting Bugs

To report any bugs you have faced while using PureML package, please

  1. report it in Discord channel
  2. Open an issue

Contributing and developing

Lets work together to improve the features for everyone.

Work with mutual respect.


License

See the Apache-2.0 file for licensing information.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pureml-0.1.6.tar.gz (36.5 kB view hashes)

Uploaded Source

Built Distribution

pureml-0.1.6-py3-none-any.whl (59.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page