Small language model.
Project description
slangmod
Small language model.
Ever wondered how large language models (LLMs) like ChatGPT, Claude, LLama, Deepseek, etc., actually work, like, really work? I did. And I figured there is only one way to find out: Make one yourself. From scratch.
Of course, I wasn't expecting to beat the big players at their own game,
but I wanted to know what you can do on consumer hardware (meaning a
state-of-the art gaming PC with a single graphics card supported by
PyTorch). So, naturally, it was going to be a small
language model. These hardware limitations are reflected in software
design choices. Specifically, slangmod does not employ any type of
parallelization that would keep multiple GPUs busy at the same time, and all
training data are loaded into CPU RAM at once, to be drip-fed to the model
on the GPU from there (1 billion tokens take up about 7.5 GB worth of 64-bit
integer numbers).
Having said that, slangmod provides everything you need to
- preprocess and clean your text corpus;
- chose and train one of the HuggingFace tokenizers;
- specify a Transformer model including the type of positional encodings and the feedforward block;
- train your model with a choice of optimizers and learning-rate schedulers, employing early-stopping if you like;
- monitor convergence and experiment on hyperparameters;
- explore text-generation algorithms like top-k, top-p or beamsearch;
- and, finally, chat with your model.
To do all these things, slangmod provides a command-line interface (CLI)
with fine-grained configuration options on one hand, and the raw building
blocks it is made of on the other hand. Leveraging the foundational
functionalities provided by the fiercely functional
swak package, any other workflow
can thus be quickly coded up.
Installation
- Create a new virtual environment running at least
python 3.12. - The easiest way of installing
slangmodis from the python package index PyPI, where it is hosted. Simply typepip install slangmod
or treat it like any other python package in your dependency management. - While it is, in principle, possible to run
slangmodon the CPU, this is only intended for debugging purposes. To get any results in finite time, you also need a decent graphics card, and you must have a working installation of PyTorch to make good use of it. Because there is no way of knowing which version of CUDA (or ROC) you have installed on your machine and how you installed it, PyTorch is not an explicit dependency ofslangmod. You will have to install it yourself, e.g., following these instructions. If you are usingpipenvfor dependency management, you can also have a look at the Pipfile in the root of theslangmodrepository and taylor it to your needs. Personally, I gopipenv sync --categories=cpu
for a CPU-only installation of PyTorch andpipenv sync --categories=cuda
if I want GPU support.
Documentation
The documentation for both the CLI and the API of slangmod is hosted
on GitHub Pages.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file slangmod-0.0.4.tar.gz.
File metadata
- Download URL: slangmod-0.0.4.tar.gz
- Upload date:
- Size: 55.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
53f9a1bea3053bd91cd2054b854e7657d867a0c7687e81108e2e73c6e1e1581a
|
|
| MD5 |
1ba063dbfbf7dfd34e1cd2cffc149516
|
|
| BLAKE2b-256 |
85c0d0395663e0aa229ce29184e418227982a524632132c5bef2b5b49b4db420
|
Provenance
The following attestation bundles were made for slangmod-0.0.4.tar.gz:
Publisher:
publish-package.yml on yedivanseven/slangmod
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
slangmod-0.0.4.tar.gz -
Subject digest:
53f9a1bea3053bd91cd2054b854e7657d867a0c7687e81108e2e73c6e1e1581a - Sigstore transparency entry: 169778858
- Sigstore integration time:
-
Permalink:
yedivanseven/slangmod@7c36432b136c5430886bd5c3eb05f3651d3f4f2e -
Branch / Tag:
refs/tags/v0.0.4 - Owner: https://github.com/yedivanseven
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-package.yml@7c36432b136c5430886bd5c3eb05f3651d3f4f2e -
Trigger Event:
release
-
Statement type:
File details
Details for the file slangmod-0.0.4-py3-none-any.whl.
File metadata
- Download URL: slangmod-0.0.4-py3-none-any.whl
- Upload date:
- Size: 87.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b371a84d0ffdefffe8ece0437d492bb29e0c59ce55ddf73d7f344c3b60695e4f
|
|
| MD5 |
dafcf385d3325790baca60e16c08a726
|
|
| BLAKE2b-256 |
a5839d833567641188ee1b17bafefd06e69bc93c8597d8211150a7ae1de2e8f7
|
Provenance
The following attestation bundles were made for slangmod-0.0.4-py3-none-any.whl:
Publisher:
publish-package.yml on yedivanseven/slangmod
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
slangmod-0.0.4-py3-none-any.whl -
Subject digest:
b371a84d0ffdefffe8ece0437d492bb29e0c59ce55ddf73d7f344c3b60695e4f - Sigstore transparency entry: 169778861
- Sigstore integration time:
-
Permalink:
yedivanseven/slangmod@7c36432b136c5430886bd5c3eb05f3651d3f4f2e -
Branch / Tag:
refs/tags/v0.0.4 - Owner: https://github.com/yedivanseven
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-package.yml@7c36432b136c5430886bd5c3eb05f3651d3f4f2e -
Trigger Event:
release
-
Statement type: