Small language model.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

yedivanseven

These details have not been verified by PyPI

Project description

GitHub Pages PyPI

slangmod

small language model

Ever wondered how large language models (LLMs) like ChatGPT, Claude, LLama, Deepseek, etc., actually work, like, really work? I did. And I figured there is only one way to find out: Make one yourself. From scratch.

Of course, I wasn't expecting to beat the big players at their own game, but I wanted to know what you can do on consumer hardware (meaning a state-of-the art gaming PC with a single graphics card supported by PyTorch). So, naturally, it was going to be a small language model. These hardware limitations are reflected in software design choices. Specifically, slangmod does not employ any type of parallelization that would keep multiple GPUs busy at the same time, and all training data are loaded into CPU RAM at once, to be drip-fed to the model on the GPU from there (1 billion tokens take up about 7.5 GB worth of 64-bit integer numbers).

Having said that, slangmod provides everything you need to

preprocess and clean your text corpus;
chose and train one of the HuggingFace tokenizers;
specify a Transformer model including the type of positional encodings and the feedforward block;
train your model with a choice of optimizers and learning-rate schedulers, employing early-stopping if you like;
monitor convergence and experiment on hyperparameters;
explore text-generation algorithms like top-k, top-p or beamsearch;
and, finally, chat with your model.

To do all these things, slangmod provides a command-line interface (CLI) with fine-grained configuration options on one hand, and the raw building blocks it is made of on the other hand. Leveraging the foundational functionalities provided by the swak package, any other workflow can thus be quickly coded up.

Installation

Python package

Create a new virtual environment running at least python 3.12.
The easiest way of installing slangmod is from the python package index PyPI, where it is hosted. Simply type
```
pip install slangmod
```
or treat it like any other python package in your dependency management.
While it is, in principle, possible to run slangmod on the CPU, this is only intended for debugging purposes. To get any results in finite time, you also need a decent graphics card, and you must have a working installation of PyTorch to make good use of it. Because there is no way of knowing which version of CUDA (or ROC) you have installed on your machine and how you installed it, PyTorch is not an explicit dependency of slangmod. You will have to install it yourself, e.g., following these instructions. If you are using pipenv for dependency management, you can also have a look at the Pipfile in the root of the slangmod repository and taylor it to your needs. Personally, I go
```
pipenv sync --categories=cpu
```
for a CPU-only installation of PyTorch (for debugging only) and
```
pipenv sync --categories=cuda
```
if I want GPU support.
Finally, with the virtual environment you just created active, open a console and type
```
slagnmod -h
```
to check that everything works.

Docker image

A docker image with GPU-enabled PyTorch and all other dependencies inside is available on the Docker Hub.

docker pull yedivanseven/slangmod

To use it, you must have a host machine that

has an NVIDIA GPU,
has the drivers for it installed, and
exposes it via the container toolkit.

Change into a working directory, i.e., one where slangmod will read its config file slangmod.toml from and where it will save outputs to, and mount this directory to the path /workdir inside the container when you run it.

docker run --rm -v ./:/workdir yedivanseven/slangmod

This will invoke slangmod -h.

In the event that you still want to clean your raw text with the help of slangmod, you will also have to mount the folder with those dirty files when your start a docker container.

docker run --rm -v ./:/workdir -v /path/to/raw/docs:/raw yedivanseven/slangmod clean ...

For all other command-line options and to find out about this config TOML file, refer to the ...

Documentation

The documentation for both the CLI and the API of slangmod is hosted on GitHub Pages.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

yedivanseven

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.4

Feb 14, 2025

0.1.3

Feb 10, 2025

0.1.2

Feb 9, 2025

0.1.1

Feb 9, 2025

0.1.0

Feb 9, 2025

0.0.9

Feb 9, 2025

0.0.8

Feb 9, 2025

This version

0.0.7

Feb 9, 2025

0.0.6

Feb 9, 2025

0.0.5

Feb 9, 2025

0.0.4

Feb 8, 2025

0.0.3

Feb 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slangmod-0.0.7.tar.gz (57.0 kB view details)

Uploaded Feb 9, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

slangmod-0.0.7-py3-none-any.whl (84.8 kB view details)

Uploaded Feb 9, 2025 Python 3

File details

Details for the file slangmod-0.0.7.tar.gz.

File metadata

Download URL: slangmod-0.0.7.tar.gz
Upload date: Feb 9, 2025
Size: 57.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for slangmod-0.0.7.tar.gz
Algorithm	Hash digest
SHA256	`3fe78fe27bcc13c5f0ee04bc0a6063cad49ea5b90c652b9d52bd7af057917c6c`
MD5	`20720566e207385ebfab5a63b5c764a0`
BLAKE2b-256	`598048eb283fc2861e1fe991ed98735fc6c0832b1663b634bf2533b1cb22c196`

See more details on using hashes here.

Provenance

The following attestation bundles were made for slangmod-0.0.7.tar.gz:

Publisher: publish-package.yml on yedivanseven/slangmod

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: slangmod-0.0.7.tar.gz
- Subject digest: 3fe78fe27bcc13c5f0ee04bc0a6063cad49ea5b90c652b9d52bd7af057917c6c
- Sigstore transparency entry: 169889179
- Sigstore integration time: Feb 9, 2025
Source repository:
- Permalink: yedivanseven/slangmod@1fac5d97391ea1c359d5d63666f5a43b600b21cc
- Branch / Tag: refs/tags/v0.0.7
- Owner: https://github.com/yedivanseven
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-package.yml@1fac5d97391ea1c359d5d63666f5a43b600b21cc
- Trigger Event: release

File details

Details for the file slangmod-0.0.7-py3-none-any.whl.

File metadata

Download URL: slangmod-0.0.7-py3-none-any.whl
Upload date: Feb 9, 2025
Size: 84.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for slangmod-0.0.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8ea84b6e44f43158d43d46dd24b73ae53728c014552a7be7537dc8f0d20d056f`
MD5	`98e36ead4546a7d2f989aff2157cf989`
BLAKE2b-256	`8c5b3f2d33552f7e7cb666c5dfdd810956fa986f42af400dc4c347bb8e5bb851`

See more details on using hashes here.

Provenance

The following attestation bundles were made for slangmod-0.0.7-py3-none-any.whl:

Publisher: publish-package.yml on yedivanseven/slangmod

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: slangmod-0.0.7-py3-none-any.whl
- Subject digest: 8ea84b6e44f43158d43d46dd24b73ae53728c014552a7be7537dc8f0d20d056f
- Sigstore transparency entry: 169889181
- Sigstore integration time: Feb 9, 2025
Source repository:
- Permalink: yedivanseven/slangmod@1fac5d97391ea1c359d5d63666f5a43b600b21cc
- Branch / Tag: refs/tags/v0.0.7
- Owner: https://github.com/yedivanseven
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-package.yml@1fac5d97391ea1c359d5d63666f5a43b600b21cc
- Trigger Event: release

slangmod 0.0.7

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

slangmod

Installation

Python package

Docker image

Documentation

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance