Skip to main content

A statistics and machine learning package.

Project description

🔥 Ember

Ember is a statistics and ML library for my personal use with C++ and Python. I mainly built it for educational purposes, but it's quite functional and can be used to train several datasets.

Look here to see the methods it supports.

Installation

System Support

This library supports both x86_64/amd64 and arm64/aarch64. Check if your system is supported out of the box in the table below. The library requires very few dependencies, so as long as your machine supports a C++ compiler and python, you should be able to get it working by fiddling with the CMake and setuptools files.

x86_64 Python 3.13 Python 3.12 Python 3.11 Python 3.10 Python 3.9 Python 3.8 Python 3.7
Ubuntu 24.04
Ubuntu 22.04
Ubuntu 20.04
ArchLinux 6.6.68 LTS
Debian 13
Debian 12
Debian 11
Debian 10
LinuxMint 22
LinuxMint 21
MacOS 10.15 Catalina
MacOS 10.14 Mojave
MacOS 10.13 High Sierra
MacOS 10.12 Sierra
MacOS 10.11 El Capitan
MacOS 10.10 Yosemite
MacOS 10.9 Mavericks
MacOS 10.8 Mountain Lion
MacOS 10.7 Lion
Windows 11
Windows 10
Windows 8
Windows 7
ARM64 Python 3.13 Python 3.12 Python 3.11 Python 3.10 Python 3.9 Python 3.8 Python 3.7
Ubuntu 24.04
Ubuntu 22.04
Ubuntu 20.04
MacOS 15.x Sequoia
MacOS 14.x Sonoma
MacOS 13.x Ventura
MacOS 12.x Monterey
MacOS 11.x Big Sur
Windows 12

Compiling the aten Library

Your machine will need system dependencies such as CMake, a C++ compiler, and pybind11. The library uses C++17. Preferably you will have git and conda installed already. For more specific instructions on installing these on your system, refer to the more detailed installation guide.

Git clone the repo, then pip install, which will run setup.py.

git clone git@github.com:mbahng/pyember.git 
cd pyember 
pip install .

This runs cmake on aten/CMakeLists.txt, which calls the following.

  1. It always calls aten/src/CMakeLists.txt that compiles and links the source files in the C++ tensor library.
  2. If BUILD_PYTHON_BINDINGS=ON (always on by default), it further calls aten/bindings/CMakeLists.txt to further generate a .so file that can be imported into ember.
  3. If BUILD_DEV=ON, it calls aten/test/CMakeLists.txt to further compile the C++ unit testing suite.

If there are problems with building, you should check, in order,

  1. Whether build/ has been created. This is the first step in setup.py
  2. Whether the compiled main.cpp and, if BUILD_DEV=ON, the C++ unit test files have been compiled, i.e. if build/src/main and build/test/tests executables exist.
  3. Whether build/*/aten.cpython-3**-darwin.so exists (somewhere in the build directory, depending on the machine). The Makefile generated by aten/bindings/CMakeLists.txt will produce build/*/aten.cpython-3**-darwin.so.
  4. The setup() function will immediately copy this .so file to ember/aten.cpython-3**-darwin.so. You should see a success message saying that it has been moved or an error. The .so file must live within ember, the actual library, since ember/__init__.py must access it within the same directory level.

Testing and Development

The pip install comes with two more environment variable parameters. Note that the following command is whitespace-sensitive.

CMAKE_DEBUG=1 CMAKE_DEV=1 pip install .
  1. Setting CMAKE_DEBUG=1 compiles the aten library with debug mode (-g) on, which I use when using gdb/lldb on the compiled code.
  2. Setting CMAKE_DEV=1 compiles the C++ testing suite as well. If you want to do this, you will also need to install google-tests. A code snippet for Ubuntu and Debian is shown below.
sudo apt-get install libgtest-dev 
cd /usr/src/gtest 
cmake CMakeLists.txt 
make 
cp lib/*.a /usr/lib 
rm -rf /var/lib/apt/lists/*

If you would like to run tests and/or develop the package yourself, you can run the script ./run_tests.sh all (args python to run just python tests and cpp to run just C++ tests), which will

  1. Run all C++ unit tests for aten, ensuring that all functions work correctly.
  2. Run all Python unit tests for ember, ensuring that additional functions work correctly and that the C++ functions are bound correctly.

The stub (.pyi) files for aten are located in ember/aten.

Repository Structure

I tried to model a lot of the structure from Pytorch and TinyGrad. Very briefly,

  1. aten/ contains the header and source files for the C++ low-level tensor library, such as basic operations and an autograd engine.
    1. aten/src contains all the source files and definitions.
    2. aten/bindings contains the pybindings.
    3. aten/test contains all the C++ testing modules for aten.
  2. ember/ contains the actual library, supporting high level models, objectives, optimizers, dataloaders, and samplers.
    1. ember/aten contains the stub files.
    2. ember/datasets contains all preprocessing tools, such as datasets/loaders, standardizing, cross validation checks.
    3. ember/models contains all machine learning models.
    4. ember/objectives contain all loss functions and regularizers.
    5. ember/optimizers contain all the optimizers/solvers, such as iterative (e.g. SGD), greedy (e.g. decision tree splitting), and one-shot (e.g. least-squares solution).
    6. ember/samplers contain all samplers (e.g. MCMC, SGLD).
  3. docs/ contains detailed documentation about each function.
  4. examples/ are example python scripts on training models.
  5. tests/ are python testing modules for the ember library.
  6. docker/ contains docker images of all the operating systems and architectures I tested ember on. General workflows on setting up the environment can be found there for supported machines.
  7. setup.py allows you to pip install this as a package.
  8. run_tests.sh which is the main test running script.

For a more detailed explanation, look here.

Getting Started

Ember Tensors and GradTensors

ember.Tensors represent data and parameters, while ember.GradTensors represent gradients. An advantage of this package is that rather than just supporting batch vector operations and matrix multiplications, we can also perform general contractions of rank $(N, M)$-tensors, a generalization of matrix multiplication. This allows us to represent and utilize the full power of higher order derivatives for arbitrary functions $f: \mathbb{R}^{\mathbf{M}} \rightarrow \mathbb{R}^{\mathbf{N}}$, where $\mathbf{M} = (M_1, \ldots, M_m)$ and $\mathbf{N} = (N_1, \ldots, N_m)$ are vectors, not just scalars, representing the dimension of each space.

Tensors are multidimensional arrays that can be initialized in a number of ways. GradTensors are initialized during the backpropagation method, but we can explicitly set them if desired.

import ember 

a = ember.Tensor([2]) # scalar
b = ember.Tensor([1, 2, 3])  # vector 
c = ember.Tensor([[1, 2], [3, 4]]) # 2D vector 
d = ember.Tensor([[[1, 2]]]) # 3D vector

Say that you have a series of elementary operations on tensors.

a = ember.Tensor([2, -3]) 
h = a ** 2
b = ember.Tensor([3, 5])

c = b * h

d = ember.Tensor([10, 1])
e = c.dot(d)

f = ember.Tensor([-2])

g = f * e

Automatic Differentiation

The C++ backend computes a directed acyclic graph (DAG) representing the operations done to compute g. You can then run g.backprop() to compute the gradients by applying the chain rule. This constructs the DAG and returns a topological sorting of its nodes. The gradients themselves, which are technically Jacobian matrices, are updated, with each mapping x -> y constructing a gradient tensor on x with value dy/dx. The gradients can be either accumulated by setting backprop(intermediate=False) so that the chain rule is not applied yet, or we can set =True to apply the chain rule to calculate the derivative of the tensor we called backprop on w.r.t. the rest of the tensors.

top_sort = g.backprop()
print(a.grad) # [[4.0, 0.0], [0.0, -6.0]]
print(h.grad) # [[3.0, 0.0], [0.0, 5.0]]
print(b.grad) # [[4.0, 0.0], [0.0, 9.0]]
print(c.grad) # [[10.0, 1.0]]
print(d.grad) # [[12.0, 45.0]]
print(e.grad) # [[-2.0]]
print(f.grad) # [[165.0]]
print(g.grad) # [[1.0]]

Finally, we can visualize this using the networkx package.

Alt text

Linear Regression

To perform linear regression, use the LinearRegression model.

import ember 

ds = ember.datasets.LinearDataset(N=20, D=14)
dl = ember.datasets.Dataloader(ds, batch_size=2)
model = ember.models.LinearRegression(15) 
mse = ember.objectives.MSELoss()

for epoch in range(500): 
  loss = None
  for x, y in dl: 
    y_ = model.forward(x)  
    loss = mse(y, y_)
    loss.backprop()
    model.step(1e-5) 

  print(loss)

K Nearest Neighbors

To do a simple K Nearest Neighbors regressor, use the following model. The forward method scans over the whole dataset, so we must input it to the model during instantiation. Note that we do not need a dataloader or a backpropagation method since we aren't iteratively updating gradients, though we want to show the loss.

import ember
from ember.models import KNearestRegressor
from ember.datasets import LinearDataset

ds = LinearDataset(N=20, D=3)
model = KNearestRegressor(dataset=ds, K=1)
mse = ember.objectives.MSELoss() 

for k in range(1, 21): # hyperparameter tuning
  model.K = k
  print(f"{k} ===") 
  loss = 0
  for i in range(len(ds)): 
    x, y = ds[i] 
    y_ = model.forward(x) 
    loss = loss + mse(y, y_) 

  print(loss)

Multilayer Perceptrons

To instantiate a MLP, just call it from models. In here we make a 2-layer MLP with a dummy dataset. For now only SGD with batch size 1 is supported.

import ember 

ds = ember.datasets.LinearDataset(N=20, D=14)
dl = ember.datasets.Dataloader(ds, batch_size=2)
model = ember.models.MultiLayerPerceptron(15, 10) 
mse = ember.objectives.MSELoss()

for epoch in range(500):  
  loss = None
  for x, y in dl: 
    y_ = model.forward(x) 
    loss = mse(y, y_)
    loss.backprop() 
    model.step(1e-5)

  print(loss)

Its outputs over 1 minute.

LOSS = 256733.64437981808
LOSS = 203239.08846901066
LOSS = 160223.4554735339
LOSS = 125704.33716141782
LOSS = 98074.96981384761
LOSS = 76026.19871949886
LOSS = 58491.92389906721
LOSS = 44604.493032865605
LOSS = 33658.23285350788
LOSS = 25079.638682869212
LOSS = 18403.01062298029
LOSS = 13250.54496118543
LOSS = 9316.069468116035
LOSS = 6351.758695807299
LOSS = 4157.286052245369
LOSS = 2570.96819208677
LOSS = 1462.5380952427417
LOSS = 727.2493587808174
LOSS = 281.0683664354656
LOSS = 56.75530418715159

Datasets

Models and Training

Monte Carlo Samplers

Contributing

To implement a new functionality in the aten library, you must

  1. Add the class or function header in aten/src/Tensor.h
  2. Add the implementation in the correct file (or create a new one) in aten./*Tensor/*.cpp. Make sure to update aten/bindings/CMakeLists.txt if needed.
  3. Add its pybindings (if a public function that will be used in ember) in aten/bindings/*bindings.cpp. Make sure to update aten/bindings/CMakeLists.txt if needed.
  4. Add relevant C++ tests in aten/test/.
  5. Not necessary, but it's good to test it out on a personal script for a sanity check.
  6. Add to the stub files in ember/aten/*.pyi.
  7. Add Python tests in test/.
  8. If everything passes, you can submit a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyember-0.0.13-cp312-cp312-macosx_14_0_arm64.whl (645.3 kB view details)

Uploaded CPython 3.12macOS 14.0+ ARM64

File details

Details for the file pyember-0.0.13-cp312-cp312-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for pyember-0.0.13-cp312-cp312-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 982697d8862ad816a9376b9700471e655a90e04ad78dbd3306e86dc0660c9ec1
MD5 859dec311c0ac71060f5ee2ff0470bd5
BLAKE2b-256 1d75c597118785a63de5bed777e4af6bec055a6df43dd23a768c7bed561d5545

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page