No project description provided
Project description
Track, version, compare and review your data and models.
⛳ Quick Access
Documentation Watch Demo Quick example Get Instant Help Sign Up for free
💎 Intro
PureML is an open-source version control for machine learning.
- Quick start
- How it works
- Demo
- Main Features
- Core design principles
- Core abstractions
- Why to get involved
- Tutorials
⏱ Quick start
You can install and run PureML using pip
.
Install PureML
pip install pureml
📋 How it works
Just add a few lines of code. You don't need to change the way you work.
PureML is a Python library that uploads metadata to S3.
Generating Data Lineage
- Load Data
@load_data(name='loading data')
def loading_data():
return pd.read_csv('churn.csv')
- Transform Data
@transformer(name='fill missing values')
def fill_missing_values(df):
return df.fillna()
@transformer(name='encode ordinal')
def encode_ordinal(df):
col_ord = ['state', 'phone number']
df_ord = df[col_ord]
feat = OrdinalEncoder().fit_transform(df_ord)
df[col_ord] = feat
return df
@transformer(name='encode binary')
def encode_binary(df):
df['voice mail plan'] = df['voice mail plan'].map({'yes':1, 'no':0})
df['international plan'] = df['international plan'].map({'yes':1, 'no':0})
df['churn'] = df['churn'].map({True:1, False:0})
return df
- Register Dataset
@dataset(name='telecom churn', parent='encode binary')
def build_dataset():
df = loading_data()
df = fill_missing_values(df)
df = encode_ordinal(df)
df = encode_binary(df)
return df
df = build_dataset()
This is how generated data lineage will look like in the UI
For more detailed explanation, please visit our Documentation
💻 Demo
Live demo
Build and run a PureML project to create data lineage and a model with our demo colab link.
Demo video (2 min)
PureML quick start demo
Click the image to play video
📍 Main Features
Data Lineage | Automatic generation of data lineage |
Dataset Versioning | Object-based Automatic Semantic Versioning of datasets |
Model Versioning | Object-based Automatic Semantic Versioning of models |
Comparision | Comparing different versions of models or datasets |
Branches (Coming Soon) | Separation between experimentation and production ready models using branches |
Review (Coming Soon) | Review and approve models, and datasets to production ready branch |
🔮 Core design principles
Easy developer experience | An intuitive open source package aimed to bridge the gaps in data science teams |
Engineering best practices built-in | Integrating PureML functionalities in your code doesnot disrupt your workflow |
Object Versioning | A reliable object versioning mechanism to track changes to your datasets, and models |
Data is a first-class citizen | Your data is secure. It will never leave your system. |
Reduce Friction | Have access to operations performed on data using data lineage without having to spend time on lengthy meetings |
⚙ Core abstractions
These are the fundamental concepts that PureML uses to operate.
Project | A data science project. This is where you store datasets, models, and their related objects. It is similar to a github repository with object storage. |
Lineage | Contains a series of transformations performed on data to generate a dataset. |
Data Versioning | Versioning of the data should be comprehensible to the user and should encapsulate the changes in the data, its creation mechanism, among others. |
Model Versioning | Versioning of the model should be comprehensible to the user and should encapuslate the changes in training data, model architecture, hyper parameters. |
Fetch | This functionality is used to fetch registered Models, and Datasets. |
🤝 Why to get involved
Version control is much more common in software than in machine learning. So why isn’t everyone using Git? Git doesn’t work well with machine learning. It can’t handle large files, it can’t handle key/value metadata like metrics, and it can’t record information automatically from inside a training script.
GitHub wasn’t designed with data as a core project component. This along with a number of other differences between AI and more traditional software projects makes GitHub a bad fit for artificial intelligence, contributing to the reproducibility crisis in machine learning.
From manually tracking models to git based versioning systems that do not follow an intuitive versioning mechanism, there is no standardized way to track objects. Using these mechanisms, it is hard enough to track or get your model from a month ago running, let alone of a teammates!
We are trying to build a version control system for machine learning objects. A mechanism that is object dependant and intuitive for users.
Lets build this together. If you have faced this issue or have worked out a similar solution for yourself, please join us to help build a better system for everyone.
🧮 Tutorials
- Registering Data lineage
- Registering models
- Quick Start: Tabular
- Quick Start: Computer Vision
- Quick Start: NLP
- Logging
🐞 Reporting Bugs
To report any bugs you have faced while using PureML package, please
⌨ Contributing and Developing
Lets work together to improve the features for everyone. For more details, please look at out Contributing Guide
Work with mutual respect.
👨👩👧👦 Community
To get quick updates, feature release for PureML follow us on
📄 License
See the Apache-2.0 file for licensing information.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.