A collection of tools for low-resource indie machine learning development
Project description
A collection of machine learning tools for low-resource research and experiments
Note: THIS LIBRARY IS UNFINISHED WORK-IN-PROGRESS
Description
pip install ml-indie-tools
This module contains of a collection of tools useable for researchers with limited access to compute-resources and who change between laptop, Colab-instances and local workstations with a graphics card.
env_tools
checks the current environment, and populates a number of flags that allow identification of run-time
environment and available accelerator hardware. For Colab instances, it provides tools to mount Google Drive for
persistant data- and model-storage.
The usage scenarios are:
Env | Tensorflow TPU | Tensorflow GPU | Pytorch TPU | Pytorch GPU | Jax TPU | Jax GPU |
---|---|---|---|---|---|---|
Colab | x | x | / | x | x | x |
Workstation with Nvidia | / | x | / | x | / | x |
Apple Silicon | / | x | / | / | / | / |
Gutenberg_Dataset
and Text_Dataset
are NLP libraries that provide text data and can be used in conjuction
with Huggingface Datasets or directly with ML libraries.
ALU_Dataset
is a toy-dataset that allows training of integer arithmetic and logical (ALU) operations.
env_tools
A collection of tools that allow moving machine learning projects between local hardware and colab instances.
Examples
Local laptop:
from ml_indie_tools.env_tools import MLEnv
ml_env = MLEnv(platform='tf', accelator='fastest')
ml_env.describe() # -> 'OS: Darwin, Python: 3.9.9 (Conda) Tensorflow: 2.7.0, GPU: METAL'
ml_env.is_gpu # -> True
ml_env.is_tensorflow # -> True
ml_env.gpu_type # -> 'METAL'
Colab instance:
# !pip install -U ml_indie_tools
from ml_indie_tools.env_tools import MLEnv
ml_env = MLEnv(platform='tf', accelerator='fastest')
print(ml_env.describe())
print(ml_env.gpu_type)
Output:
DEBUG:MLEnv:Tensorflow version: 2.7.0
DEBUG:MLEnv:GPU available
DEBUG:MLEnv:You are on a Jupyter instance.
DEBUG:MLEnv:You are on a Colab instance.
INFO:MLEnv:OS: Linux, Python: 3.7.12, Colab Jupyter Notebook Tensorflow: 2.7.0, GPU: Tesla K80
The tensorboard extension is already loaded. To reload it, use:
%reload_ext tensorboard
OS: Linux, Python: 3.7.12, Colab Jupyter Notebook Tensorflow: 2.7.0, GPU: Tesla K80
Tesla K80
Project paths
ml_env.init_paths('my_project', 'my_model')
will give a list of paths that are adapted for local and colab usage
Local project:
ml_env.init_paths("my_project", "my_model") # -> ('.', '.', './model/my_model', './data', './logs')
The list contains , (both are current directory for local projects), to save model and weights, for training data and for logs.
Those paths (with exception of ./logs
) are moved to Google Drive for Colab instances:
On Google Colab:
# INFO:MLEnv:You will now be asked to authenticate Google Drive access in order to store training data (cache) and model state.
# INFO:MLEnv:Changes will only happen within Google Drive directory `My Drive/Colab Notebooks/<project-name>`.
# DEBUG:MLEnv:Root path: /content/drive/My Drive
# Mounted at /content/drive
('/content/drive/My Drive',
'/content/drive/My Drive/Colab Notebooks/my_project',
'/content/drive/My Drive/Colab Notebooks/my_project/model/my_model',
'/content/drive/My Drive/Colab Notebooks/my_project/data',
'./logs')
See the env_tools API documentation for details.
Gutenberg_Dataset
Gutenberg_Dataset makes books from Project Gutenberg available as dataset.
This module can either work with a local mirror of Project Gutenberg, or download files on demand. Files that are downloaded are cached to prevent unnecessary load on Gutenberg's servers.
Working with a local mirror of Project Gutenberg
If you plan to use a lot of files (hundreds or more) from Gutenberg, a local mirror might be the best solution. Have a look at Project Gutenberg's notes on mirrors.
A mirror image suitable for this project can be made with:
rsync -zarv --dry-run --prune-empty-dirs --del --include="*/" --include='*.'{txt,pdf,ALL} --exclude="*" aleph.gutenberg.org::gutenberg ./gutenberg_mirror
It's not mandatory to include pdf
-files, since they are currently not used. Please review the --dry-run
flag.
Once a mirror of at least all of Gutenberg's *.txt
files and of index-file GUTINDEX.ALL
has been generated, it can be used via:
from ml_indie_tools.Gutenberg_Dataset import Gutenberg_Dataset
gd = Gutenberg_Dataset(root_url='./gutenberg_mirror') # Assuming this is the file-path to the mirror image
Working without a remote mirror
from ml_indie_tools.Gutenberg_Dataset import Gutenberg_Dataset
gd = Gutenberg_Dataset() # the default Gutenberg site is used. Alternative specify a specific mirror with `root_url=http://...`.
Getting Gutenberg books
After using one of the two methods to instantiate the gd
object:
gd.load_index() # load the index of books
Then get a list of books (array). Each entry is a dict with meta-data:
search_result
is a list of dictionaries containing meta-data without the actual book-text.
search_result = gd.search({'author': ['kant', 'goethe'], language=['german', 'english']})
Insert the actual book text into the dictionaries. Note that download count is limited if using a remote server.
search_result = gd.insert_book_texts(search_result)
# search_result entries now contain an additional field `text` with the filtered text of the book.
import pandas as pd
df = DataFrame(search_result) # Display results as Pandas DataFrame
See the Gutenberg_Dataset API documentation for details.
Text_Dataset
See the Text_Dataset API documentation for details.
ALU_Dataset
See the ALU_Dataset API documentation for details.
keras_custom_layers
A collection of Keras residual- and self-attention layers
See the keras_custom_layers API documentation for details.
History
- (2021-12-26, 0.0.x) First pre-alpha versions published for testing purposes, not ready for use.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for ml_indie_tools-0.0.42-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 911e2f2beaf3a73c62e19e1d1dfd5e4aa4821a596dee3270db25cc24a941b087 |
|
MD5 | 5ea0983008af1c6e03e082089eff4781 |
|
BLAKE2b-256 | 96cf9c81cb2d6e1c3b92ed561d38cce78eb0c55eac5ebc70feca0c7a5caa163c |