Fast, accurate language detection using static LLM embeddings

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

WordLlama Detect

WordLlama Detect is a WordLlama-like library focused on the task of language identification. It supports identification of 148 languages, and high accuracy and fast CPU & numpy-only inference. WordLlama detect was trained from static token embeddings extracted from Gemma3-series LLMs.

WordLlamaDetect

Overview

Features:

NumPy-only inference with no PyTorch dependency
Pre-trained model (148 languages), with 103 @ >95% accuracy
Sparse lookup table (13MB)
Fast inference: >70k texts/s single thread
Simple interface

Installation

pip install wldetect

Or install from source:

git clone https://github.com/dleemiller/WordLlamaDetect.git
cd WordLlamaDetect
uv sync

Quick Start

Python API

from wldetect import WLDetect

# Load bundled model (no path needed)
wld = WLDetect.load()

# Detect language for single text
lang, confidence = wld.predict("Hello, how are you today?")
# ('eng_Latn', 0.9564036726951599)

CLI Usage

# Detect from text
uv run wldetect detect --text "Bonjour le monde"

# Detect from file
uv run wldetect detect --file input.txt

Included Model

WLDetect ships with a pre-trained model based on concatenated Gemma3-27B + Gemma3-4B token embeddings:

Languages: 148 (from OpenLID-v2 dataset)
Accuracy: 92.92% on FLORES+ dev set
F1 (macro): 92.74%
Language codes: ISO 639-3 + ISO 15924 script (e.g., eng_Latn, cmn_Hans, arb_Arab)

[!TIP] See docs/languages.md for the complete list of supported languages with performance metrics.

[!NOTE]
Gemma3 is a good choice for this application, because it was trained on over 140 languages. The tokenizer, vocab size (262k) and multi-language training are critical for performance.

Architecture

Simple Inference Pipeline (NumPy-only)

Tokenize: Use HuggingFace fast tokenizer (512-length truncation)
Lookup: Index into pre-computed exponential lookup table (vocab_size × n_languages)
Pool: LogSum pooling over token sequence
Softmax: Calculate language probabilities

The lookup table is pre-trained using: exp((embeddings * token_weights) @ projection.T + bias), where embeddings are frozen token embeddings from Gemma3, trained with focal loss on OpenLID-v2. During training, token vectors are aggregated using logsumexp pooling along the sequence dimension.

[!IMPORTANT]
To optimize artifact size and compute, we perform exp(logits) before saving the lookup table. Then we apply a threshold to make the table sparse. This reduces the artifact size 10x (~130mb -> 13mb), with negligable performance degradation.

Sparse Lookup Table

The lookup table uses sparse COO (Coordinate) format with configurable sparsification threshold:

Sparsity: 97.15% (values below threshold (<10) set to zero)
Format: COO (row, col, data) indices stored as int32, values as fp32
Performance impact: Negligible (0.003% accuracy loss)

Performance

FLORES+ Benchmark Results

Evaluated on FLORES+ dataset (148 languages, ~1k sentences per language):

Split	Accuracy	F1 (macro)	F1 (weighted)	Samples
dev	92.92%	92.74%	92.75%	150,547
devtest	92.86%	92.71%	92.69%	153,824

See docs/languages.md for detailed results.

Inference Speed

Benchmarked on 12th gen Intel-i9 (single thread):

Single text: 71,500 texts/second (0.014 ms/text)
Batch (1000): 82,500 texts/second (12.1 ms/batch)

Supported Languages

The bundled model supports 148 languages from the OpenLID-v2 dataset. Languages use ISO 639-3 language codes with ISO 15924 script codes (e.g., eng_Latn, cmn_Hans, arb_Arab).

See model_config.yaml for the complete list of supported languages.

Training

Installation for Training

# CPU or default CUDA version
uv sync --extra training

# With CUDA 12.8 (Blackwell)
uv sync --extra cu128

Training Pipeline

Configure model in configs/models/custom-config.yaml:

model:
  name: google/gemma-3-27b-pt
  hidden_dim: 5376
  shard_pattern: model-00001-of-00012.safetensors
  embedding_layer_name: language_model.model.embed_tokens.weight

languages:
  eng_Latn: 0
  spa_Latn: 1
  fra_Latn: 2
  # ... add more languages

inference:
  max_sequence_length: 512
  pooling: logsumexp

Configure training in configs/training/custom-training.yaml:

model_config_path: "configs/models/custom-model.yaml"

dataset:
  name: "laurievb/OpenLID-v2"
  filter_languages: true

training:
  batch_size: 1536
  learning_rate: 0.002
  epochs: 2

Train:

uv run wldetect train --config configs/training/custom-training.yaml

Artifacts saved to artifacts/:

lookup_table_exp.safetensors - Sparse exp lookup table (for inference)
projection.safetensors - Projection matrix (fp32, for fine-tuning)
model_config.yaml - Model configuration
model.pt - Full PyTorch checkpoint

Training Commands

# Train model
uv run wldetect train --config configs/training/gemma3-27b.yaml

# Evaluate on FLORES+
uv run wldetect eval --model-path artifacts/ --split dev

# Generate sparse lookup table from checkpoint (default: threshold=10.0)
uv run wldetect create-lookup \
  --checkpoint artifacts/checkpoints/checkpoint_step_100000.pt \
  --config configs/training/gemma3-27b.yaml \
  --output-dir artifacts/

Training Details

Embedding extraction: Downloads only embedding tensor shards from HuggingFace (not full models)
Dataset: OpenLID-v2 with configurable language filtering and balancing
Model: Simple linear projection (hidden_dim → n_languages) with dropout
Pooling: LogSumExp or max pooling over token sequences
Training time: ~2-4 hours on GPU for 2 epochs (150 languages, 5000 samples/language)
Evaluation: Automatic FLORES+ evaluation after training

License

Apache 2.0 License

Citations

If you use WordLlama Detect in your research or project, please consider citing it as follows:

@software{miller2025wordllamadetect,
  author = {Miller, D. Lee},
  title = {WordLlama Detect: The Language of the Token},
  year = {2025},
  url = {https://github.com/dleemiller/WordLlamaDetect},
  version = {0.1.0}
}

Acknowledgments

OpenLID-v2 dataset: laurievb/OpenLID-v2
FLORES+ dataset: openlanguagedata/flores_plus
HuggingFace transformers and tokenizers libraries
Google Gemma model team

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

dleemiller

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.2

Dec 16, 2025

This version

0.1.1

Dec 16, 2025

0.1.0

Dec 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wldetect-0.1.1.tar.gz (5.3 MB view details)

Uploaded Dec 16, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

wldetect-0.1.1-py3-none-any.whl (5.4 MB view details)

Uploaded Dec 16, 2025 Python 3

File details

Details for the file wldetect-0.1.1.tar.gz.

File metadata

Download URL: wldetect-0.1.1.tar.gz
Upload date: Dec 16, 2025
Size: 5.3 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for wldetect-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`c781f3f3ba61f0d18122277bc1785602549d2545ed1b3e583353533430439487`
MD5	`cc5292500cdd2f841cb10b82ce1446b5`
BLAKE2b-256	`28ebf483f25ce75c527519fe82ac48cc799e0c8af16df48aa72b274446300fb0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for wldetect-0.1.1.tar.gz:

Publisher: publish.yml on dleemiller/WordLlamaDetect

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: wldetect-0.1.1.tar.gz
- Subject digest: c781f3f3ba61f0d18122277bc1785602549d2545ed1b3e583353533430439487
- Sigstore transparency entry: 766552295
- Sigstore integration time: Dec 16, 2025
Source repository:
- Permalink: dleemiller/WordLlamaDetect@be6d5e6c4fb17c117230e7f05616cf1332bb63d1
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/dleemiller
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@be6d5e6c4fb17c117230e7f05616cf1332bb63d1
- Trigger Event: release

File details

Details for the file wldetect-0.1.1-py3-none-any.whl.

File metadata

Download URL: wldetect-0.1.1-py3-none-any.whl
Upload date: Dec 16, 2025
Size: 5.4 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for wldetect-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`eb5a3b0dc99f215aa553621b0ae90dff1f6a47be8b9a4de76c97b95e4e4760a4`
MD5	`eba781d343cd88784498fea478a92631`
BLAKE2b-256	`7846ba4169c2aabd992b16be4f960bd33bd3476c2a24435df30bc14b6f04b495`

See more details on using hashes here.

Provenance

The following attestation bundles were made for wldetect-0.1.1-py3-none-any.whl:

Publisher: publish.yml on dleemiller/WordLlamaDetect

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: wldetect-0.1.1-py3-none-any.whl
- Subject digest: eb5a3b0dc99f215aa553621b0ae90dff1f6a47be8b9a4de76c97b95e4e4760a4
- Sigstore transparency entry: 766552296
- Sigstore integration time: Dec 16, 2025
Source repository:
- Permalink: dleemiller/WordLlamaDetect@be6d5e6c4fb17c117230e7f05616cf1332bb63d1
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/dleemiller
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@be6d5e6c4fb17c117230e7f05616cf1332bb63d1
- Trigger Event: release

wldetect 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

WordLlama Detect

Overview

Installation

Quick Start

Python API

CLI Usage

Included Model

Architecture

Simple Inference Pipeline (NumPy-only)

Sparse Lookup Table

Performance

FLORES+ Benchmark Results

Inference Speed

Supported Languages

Training

Installation for Training

Training Pipeline

Training Commands

Training Details

License

Citations

Acknowledgments

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance