A collection of tools for low-resource indie machine learning development
Project description
A collection of machine learning tools for low-resource research and experiments
Description
pip install ml-indie-tools
This module contains of a collection of tools useable for researchers with limited access to compute-resources and who change between laptop, Colab-instances and local workstations with a graphics card.
env_tools checks the current environment, and populates a number of flags that allow identification of run-time
environment and available accelerator hardware. For Colab instances, it provides tools to mount Google Drive for
persistent data- and model-storage.
The usage scenarios are:
| Env | Tensorflow TPU | Tensorflow GPU | Pytorch TPU | Pytorch GPU | Jax TPU | Jax GPU |
|---|---|---|---|---|---|---|
| Colab | x | x | / | x | x | x |
| Workstation with Nvidia | / | x | / | x | / | x |
| Apple Silicon | / | x | / | / | / | / |
(x: supported, /: not supported)
Gutenberg_Dataset and Text_Dataset are NLP libraries that provide text data and can be used in conjuction
with Huggingface Datasets or directly with ML libraries.
ALU_Dataset is a toy-dataset that allows training of integer arithmetic and logical (ALU) operations.
env_tools
A collection of tools that allow moving machine learning projects between local hardware and colab instances.
Examples
Local laptop:
from ml_indie_tools.env_tools import MLEnv
ml_env = MLEnv(platform='tf', accelator='fastest')
ml_env.describe() # -> 'OS: Darwin, Python: 3.9.9 (Conda) Tensorflow: 2.7.0, GPU: METAL'
ml_env.is_gpu # -> True
ml_env.is_tensorflow # -> True
ml_env.gpu_type # -> 'METAL'
Colab instance:
# !pip install -U ml_indie_tools
from ml_indie_tools.env_tools import MLEnv
ml_env = MLEnv(platform='tf', accelerator='fastest')
print(ml_env.describe())
print(ml_env.gpu_type)
Output:
DEBUG:MLEnv:Tensorflow version: 2.7.0
DEBUG:MLEnv:GPU available
DEBUG:MLEnv:You are on a Jupyter instance.
DEBUG:MLEnv:You are on a Colab instance.
INFO:MLEnv:OS: Linux, Python: 3.7.12, Colab Jupyter Notebook Tensorflow: 2.7.0, GPU: Tesla K80
The tensorboard extension is already loaded. To reload it, use:
%reload_ext tensorboard
OS: Linux, Python: 3.7.12, Colab Jupyter Notebook Tensorflow: 2.7.0, GPU: Tesla K80
Tesla K80
Project paths
ml_env.init_paths('my_project', 'my_model') will give a list of paths that are adapted for local and colab usage
Local project:
ml_env.init_paths("my_project", "my_model") # -> ('.', '.', './model/my_model', './data', './logs')
The list contains , (both are current directory for local projects), to save model and weights, for training data and for logs.
Those paths (with exception of ./logs) are moved to Google Drive for Colab instances:
On Google Colab:
# INFO:MLEnv:You will now be asked to authenticate Google Drive access in order to store training data (cache) and model state.
# INFO:MLEnv:Changes will only happen within Google Drive directory `My Drive/Colab Notebooks/<project-name>`.
# DEBUG:MLEnv:Root path: /content/drive/My Drive
# Mounted at /content/drive
('/content/drive/My Drive',
'/content/drive/My Drive/Colab Notebooks/my_project',
'/content/drive/My Drive/Colab Notebooks/my_project/model/my_model',
'/content/drive/My Drive/Colab Notebooks/my_project/data',
'./logs')
See the env_tools API documentation for details.
Gutenberg_Dataset
Gutenberg_Dataset makes books from Project Gutenberg available as dataset.
This module can either work with a local mirror of Project Gutenberg, or download files on demand. Files that are downloaded are cached to prevent unnecessary load on Gutenberg's servers.
Working with a local mirror of Project Gutenberg
If you plan to use a lot of files (hundreds or more) from Gutenberg, a local mirror might be the best solution. Have a look at Project Gutenberg's notes on mirrors.
A mirror image suitable for this project can be made with:
rsync -zarv --dry-run --prune-empty-dirs --del --include="*/" --include='*.'{txt,pdf,ALL} --exclude="*" aleph.gutenberg.org::gutenberg ./gutenberg_mirror
It's not mandatory to include pdf-files, since they are currently not used. Please review the --dry-run flag.
Once a mirror of at least all of Gutenberg's *.txt files and of index-file GUTINDEX.ALL has been generated, it can be used via:
from ml_indie_tools.Gutenberg_Dataset import Gutenberg_Dataset
gd = Gutenberg_Dataset(root_url='./gutenberg_mirror') # Assuming this is the file-path to the mirror image
Working without a remote mirror
from ml_indie_tools.Gutenberg_Dataset import Gutenberg_Dataset
gd = Gutenberg_Dataset() # the default Gutenberg site is used. Alternative specify a specific mirror with `root_url=http://...`.
Getting Gutenberg books
After using one of the two methods to instantiate the gd object:
gd.load_index() # load the index of books
Then get a list of books (array). Each entry is a dict with meta-data:
search_result is a list of dictionaries containing meta-data without the actual book-text.
search_result = gd.search({'author': ['kant', 'goethe'], language=['german', 'english']})
Insert the actual book text into the dictionaries. Note that download count is limited if using a remote server.
search_result = gd.insert_book_texts(search_result)
# search_result entries now contain an additional field `text` with the filtered text of the book.
import pandas as pd
df = DataFrame(search_result) # Display results as Pandas DataFrame
See the Gutenberg_Dataset API documentation for details.
Text_Dataset
See the Text_Dataset API documentation for details.
ALU_Dataset
See the ALU_Dataset API documentation for details. A sample project is at ALU_Net
keras_custom_layers
A collection of Keras residual- and self-attention layers
See the keras_custom_layers API documentation for details.
External projects
Checkout the following jupyter notebook based projects for example-usage:
Text generation
Arithmetic and logic operations
History
- (2022-03-12, 0.1.0) First version for external use.
- (2021-12-26, 0.0.x) First pre-alpha versions published for testing purposes, not ready for use.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ml-indie-tools-0.1.1.tar.gz.
File metadata
- Download URL: ml-indie-tools-0.1.1.tar.gz
- Upload date:
- Size: 34.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d8c1ac8d3d99b17ab75e918d57375ac11bd6df1d624ee24090e218b9f4ba313a
|
|
| MD5 |
8051a2202106c9e23391fcb62b563e0c
|
|
| BLAKE2b-256 |
d201b3384fba09a09bfbba162cb8ba59a93e076f65c613b075ad3550a2016469
|
File details
Details for the file ml_indie_tools-0.1.1-py3-none-any.whl.
File metadata
- Download URL: ml_indie_tools-0.1.1-py3-none-any.whl
- Upload date:
- Size: 33.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0639cf969b3d520af2a2e10fcae2757cbd814c6b3aa2167b14c46fbb7be1bc37
|
|
| MD5 |
688a298caba835e94f7f3c7f69fd158c
|
|
| BLAKE2b-256 |
38e5ae02b25c23ba07e64cdafb7223c60491afdf158bb46958db4b82f96056a6
|