End-to-end machine learning on your desktop or server.

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 1 - Planning
Framework
- Jupyter
Intended Audience
- Developers
License
- OSI Approved :: GNU Affero General Public License v3
Natural Language
- English
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence

Project description

PyDataSci (wide)

Value Proposition

PyDataSci's AIdb is an open source, autoML tool that keeps track of the moving parts of machine learning so that data scientists can perform best practice ML without the coding overhead.

Mission

Reproducible
No more black boxes. No more screenshotting parameters and loss-accuracy graphs. A record of every: dataset, feature, split, fold, parameter, run, model, and result - is persisted in a lightweight fashion. So hypertune to your heart's content, compare models, and pick the best one with the proof to back it up.
Local-first
Empower non-cloud users (academic/ institute HPCs, private cloud companies, remote server SSH'ers, and everyday desktop hackers) with the same quality ML services as present in public clouds (e.g. AWS SageMaker).
Integrated
Don’t disrupt the natural workflow of users by forcing them into the confines of a GUI app or specific IDE. Weave automated tracking into their existing code to work alongside the existing ecosystem of data science tools.
Scalable
Queue many hypertuning jobs locally, or run big jobs in parallel in the cloud.

Painpoint Solved

In writing a paper about comparative methods for the interpretation of deep learning activation values via graph neural networks, CNNs, and LSTMs - I found myself comparing multiple models with many parameter combinations. I was burdened by questions like: Had I already tried these parameters? How was I going to save the metrics to compare the models? I was literally screenshotting my parameters and charts. That's not conducive to the scientific method. I had done the hard part in figuring out the science, but this permuted world was just a mess. When I took a look at other tools in the space, I found they were either: cloud-only, too complex/ bad documentation, incomplete (bring your own database), or too proprietary/ close-walled. Let's be honest, the avaergae data scientist isn't the world's best software engineer/ architect, so they need an low-code fix for keeping track of everything.

Functionality

Compresses a dataset (csv, tsv, parquet) to keep track of.
Derives informative featuresets and/ or labels from that dataset. -- Treats validation sets (3rd split) and cross-folds (k-fold) as first-level citizens.
Queues hypertuning jobs and batches.
Calculates and saves performance model metrics of each model.
Visually compares models to find the best one.
Scales out to run cloud jobs (data size, training time) by toggling cloud_queue = True.

Community

Much to automate there is. Simple it must be. ML is a broad space with a lot of challenges to solve. Let us know if you want to get involved. We plan to host monthly dev jam sessions and data science lightning talks. layne <at> pydatasci.com

Data types: tabular, longitudinal, image, graph, audio, video, gaming.
Analysis types: classification, regression, dimensionality reduction, feature engineering, recurrent, generative, reinforcement, NLP.

Installation

Requires Python 3+. You will only need to perform these steps the first time you use the package.

Enter the following commands one-by-one and follow any instructions returned by the command prompt to resolve errors should they arise.

Starting from the command line:

$ pip install --upgrade pydatasci
$ python

Once inside the Python shell:

>>> import pydatasci as pds
>>> pds.create_folder()
>>> pds.create_config()
>>> from pydatasci import aidb
>>> aidb.create_db()

PyDataSci makes use of the Python package, appdirs, for an operating system (OS) agnostic location to store configuration and database files. This not only keeps your $HOME directory clean, but also helps prevent careless users from deleting your database.

The installation process checks not only that the corresponding appdirs folder exists on your system but also that you have the permissions neceessary to read from and write to that location. If these conditions are not met, then you will be provided instructions during the installation about how to create the folder and/ or grant yourself the appropriate permissions.

We have attempted to support both Windows (icacls permissions and backslashes C:\\) as well as POSIX including Mac and Linux (chmod letters permissions and slashes /). Note: due to variations in the ordering of appdirs author and app directories in different OS', we do not make use of the appdirs appauthor directory, only the appname directory.

If you run into trouble with the installation process on your OS, please submit a GitHub issue so that we can attempt to resolve, document, and release a fix as quickly as possible.

Installation Location Based on OS
import appdirs; appdirs.user_data_dir('pydatasci');:

Mac:
/Users/Username/Library/Application Support/pydatasci

Linux - Alpine and Ubuntu:
/root/.local/share/pydatasci

Windows:
C:\Users\Username\AppData\Local\pydatasci

create_db() is equivalent to a migration in Django or Rails in that it creates the tables found in the Object Relational Model (ORM). We use the peewee ORM as it is simpler than SQLAlchemy, has good documentation, and found the project to be actively maintained (saw same-day GitHub response to issues on a Saturday). With the addition of Dash-Plotly, this will make for a full-stack experience that also works directly in an IDE like Jupyter or VS Code.

Deleting & Recreating the Database

When deleting the database, you need to either reload the aidb module or restart the Python shell before you can attempt to recreate the database.

>>> from pydatasci import aidb
>>> aidb.delete_db(True)
>>> from importlib import reload
>>> reload(aidb)
>>> create_db()

Usage

If you've already completed the Installation section above, let's get started.

import pydatasci as pds
from pydatasci import aidb

1. Add a `Dataset`.

Supported tabular file formats include: CSV, TSV, Apache Parquet. At this point, the project's support for Parquet is extremely minimal.

The bytes of the file will be stored as a BlobField in the SQLite database file. Storing the data in the database not only (a) provides an entity that we can use to keep track of experiments and link relational data to but also (b) makes the data less mutable than keeping it in the open filesystem.

dataset = aidb.Dataset.create_from_file(
	path = 'iris.tsv'
	,file_format = 'tsv'
	,name = 'tab-separated plants'
	,perform_gzip = True
)

You can choose whether or not you want to gzip compress the file when importing it with the perform_gzip=bool parameter. This compression not only enables you to store up to 90% more data on your local machine, but also helps overcome the maximum BlobField size of 2.147 GB. We handle the zipping and unzipping on the fly for you, so you don't even notice it.

Fetch a `Dataset`.

Supported in-memory formats include: NumPy Structured Array and Pandas DataFrame.

Pandas

df = dataset.read_to_pandas()
df.head()

df2 = aidb.Dataset.read_to_pandas(id = 1)
df2.head()

NumPy

arr = dataset.read_to_numpy()
arr[:4]

arr2 = aidb.Dataset.read_to_numpy(id = 1)
arr2[:4]

We chose structured array because it keeps track of column names. For the sake of simplicity, we are reading into NumPy via Pandas. That way, if we want to revert to a simpler ndarray in the future, then we won't have to rewrite the function to read NumPy.

2. Create a `Label` if you want to perform supervised learning (aka predict a specific column).

From a Dataset, pick a column that you want to train against/ predict. If you are planning on training an unsupervised model, then you don't need to do this.

label = aidb.Label.create_from_dataset(
	dataset_id = 1
	,column_name = 'species'
)

3. Derive a `Featureset` of columns from a Dataset.

This won't duplicate your data. It simply records the columns to be used in training.

a) For supervised learning, be sure to pass in the `Label` you want to predict.

supervised_bruteforce = aidb.Featureset.create_all_columns(
	dataset_id = 1
	,label_id = 1
)

supervised_selective = aidb.Featureset.create_from_dataset_columns(
	dataset_id = 1
	,label_id = 1
	,columns = ['petal_width', 'petal_length']
)

b) For unsupervised learning (aka studying variance within a `Dataset`), leave the `Label` blank.

Feature selection is about finding out which columns in your data are most informative. In performing feature engineering, a data scientist reduces the dimensionality of the data by determining the effect each feature has on the variance of the data. This makes for simpler models in the form of faster training and reduces overfitting by making the model more generalizable to future data.

unsupervised_bruteforce = aidb.Featureset.create_all_columns(
	dataset_id = 1
)

unsupervised_selective = aidb.Featureset.create_from_dataset_columns(
	dataset_id = 1
	,columns = ['petal_width', 'petal_width', 'sepal_length']
)

4. Split the `Dataset` rows into `Splitsets` based on how you want to train, test, and validate your models.

a) One set containing train-test splits.

b) One set containing train-validate-test splits.

c) k-fold sets containing train-test splits.

d) k-fold sets containing train-validate-test splits.

5. Create an `Algorithm` aka model to fit to your splits.

6. Create combinations of `Hyperparamsets` for your algorithms.

7. Create a `Batch` of `Job`'s to keep track of training.

PyPI Package

Steps to Build & Upload

$ pyenv activate pydatasci
$ pip3 install --upgrade wheel twine
$ python3 setup.py sdist bdist_wheel
$ python3 -m twine upload --repository pypi dist/*
$ rm -r build dist pydatasci.egg-info
# proactively update the version number in setup.py next time
$ pip install --upgrade pydatasci; pip install --upgrade pydatasci

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 1 - Planning
Framework
- Jupyter
Intended Audience
- Developers
License
- OSI Approved :: GNU Affero General Public License v3
Natural Language
- English
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence

Release history Release notifications | RSS feed

0.0.61

Nov 22, 2020

0.0.60

Nov 22, 2020

0.0.59

Nov 22, 2020

0.0.58

Nov 22, 2020

0.0.57

Nov 20, 2020

0.0.56

Nov 18, 2020

0.0.55

Nov 18, 2020

0.0.54

Nov 12, 2020

0.0.53

Nov 12, 2020

0.0.52

Nov 5, 2020

0.0.51

Oct 26, 2020

This version

0.0.50

Oct 5, 2020

0.0.49

Oct 5, 2020

0.0.48

Sep 29, 2020

0.0.47

Sep 29, 2020

0.0.46

Sep 29, 2020

0.0.45

Sep 29, 2020

0.0.44

Sep 23, 2020

0.0.43

Sep 23, 2020

0.0.42

Sep 23, 2020

0.0.41

Sep 23, 2020

0.0.39

Sep 23, 2020

0.0.38

Sep 23, 2020

0.0.37

Sep 23, 2020

0.0.36

Sep 23, 2020

0.0.35

Sep 23, 2020

0.0.34

Sep 23, 2020

0.0.33

Sep 23, 2020

0.0.32

Sep 23, 2020

0.0.31

Sep 23, 2020

0.0.30

Sep 23, 2020

0.0.29

Sep 23, 2020

0.0.28

Sep 23, 2020

0.0.27

Sep 23, 2020

0.0.26

Sep 23, 2020

0.0.25

Sep 23, 2020

0.0.24

Sep 23, 2020

0.0.23

Sep 22, 2020

0.0.22

Sep 22, 2020

0.0.21

Sep 22, 2020

0.0.20

Sep 22, 2020

0.0.19

Sep 22, 2020

0.0.18

Sep 22, 2020

0.0.17

Sep 14, 2020

0.0.16

Sep 13, 2020

0.0.15

Sep 13, 2020

0.0.13

Sep 12, 2020

0.0.12

Sep 12, 2020

0.0.11

Sep 12, 2020

0.0.10

Sep 12, 2020

0.0.9

Sep 11, 2020

0.0.8

Sep 11, 2020

0.0.6

Sep 11, 2020

0.0.5

Sep 11, 2020

0.0.4

Sep 11, 2020

0.0.3

Sep 11, 2020

0.0.2

Sep 11, 2020

0.0.1

Sep 10, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydatasci-0.0.50.tar.gz (17.3 kB view details)

Uploaded Oct 5, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pydatasci-0.0.50-py3-none-any.whl (25.2 kB view details)

Uploaded Oct 5, 2020 Python 3

File details

Details for the file pydatasci-0.0.50.tar.gz.

File metadata

Download URL: pydatasci-0.0.50.tar.gz
Upload date: Oct 5, 2020
Size: 17.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.7.6

File hashes

Hashes for pydatasci-0.0.50.tar.gz
Algorithm	Hash digest
SHA256	`8e783cbb873d54465a6ec6c0e3d1d98b23073f1b969caaad5e66742fb189a472`
MD5	`962c9129f73d4957c3af91de832ff57c`
BLAKE2b-256	`bfff641e7d2578d24902fc12816dff2b35efca86fd2ae10e4a34c4b39a9732e5`

See more details on using hashes here.

File details

Details for the file pydatasci-0.0.50-py3-none-any.whl.

File metadata

Download URL: pydatasci-0.0.50-py3-none-any.whl
Upload date: Oct 5, 2020
Size: 25.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.7.6

File hashes

Hashes for pydatasci-0.0.50-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9bab152bf7ea090c264a1cb97a2438c972f7416e94a6d0d5231e338f6e21a65f`
MD5	`e4b01d01eef1441733a042d05c990dba`
BLAKE2b-256	`b9a617e692a5f4e5922af59d29b6b839021846cdea88fbbed88528902ffafbdb`

See more details on using hashes here.

pydatasci 0.0.50

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Value Proposition

Mission

Painpoint Solved

Functionality

Community

Installation

Deleting & Recreating the Database

Usage

1. Add a Dataset.

Fetch a Dataset.

Pandas

NumPy

2. Create a Label if you want to perform supervised learning (aka predict a specific column).

3. Derive a Featureset of columns from a Dataset.

a) For supervised learning, be sure to pass in the Label you want to predict.

b) For unsupervised learning (aka studying variance within a Dataset), leave the Label blank.

4. Split the Dataset rows into Splitsets based on how you want to train, test, and validate your models.

a) One set containing train-test splits.

b) One set containing train-validate-test splits.

c) k-fold sets containing train-test splits.

d) k-fold sets containing train-validate-test splits.

5. Create an Algorithm aka model to fit to your splits.

6. Create combinations of Hyperparamsets for your algorithms.

7. Create a Batch of Job's to keep track of training.

PyPI Package

Steps to Build & Upload

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1. Add a `Dataset`.

Fetch a `Dataset`.

2. Create a `Label` if you want to perform supervised learning (aka predict a specific column).

3. Derive a `Featureset` of columns from a Dataset.

a) For supervised learning, be sure to pass in the `Label` you want to predict.

b) For unsupervised learning (aka studying variance within a `Dataset`), leave the `Label` blank.

4. Split the `Dataset` rows into `Splitsets` based on how you want to train, test, and validate your models.

5. Create an `Algorithm` aka model to fit to your splits.

6. Create combinations of `Hyperparamsets` for your algorithms.

7. Create a `Batch` of `Job`'s to keep track of training.