Fast, accurate language detection using static LLM embeddings
Project description
WordLlama Detect
WordLlama Detect is a WordLlama-like library focused on the task of language identification. It supports identification of 148 languages, and high accuracy and fast CPU & numpy-only inference. WordLlama detect was trained from static token embeddings extracted from Gemma3-series LLMs.
Overview
Features:
- NumPy-only inference with no PyTorch dependency
- Pre-trained model (148 languages), with 103 @ >95% accuracy
- Sparse lookup table (13MB)
- Fast inference: >70k texts/s single thread
- Simple interface
Installation
pip install wldetect
Or install from source:
git clone https://github.com/dleemiller/WordLlamaDetect.git
cd WordLlamaDetect
uv sync
Quick Start
Python API
from wldetect import WLDetect
# Load bundled model (no path needed)
wld = WLDetect.load()
# Detect language for single text
lang, confidence = wld.predict("Hello, how are you today?")
# ('eng_Latn', 0.9564036726951599)
CLI Usage
# Detect from text
uv run wldetect detect --text "Bonjour le monde"
# Detect from file
uv run wldetect detect --file input.txt
Included Model
WLDetect ships with a pre-trained model based on concatenated Gemma3-27B + Gemma3-4B token embeddings:
- Languages: 148 (from OpenLID-v2 dataset)
- Accuracy: 92.92% on FLORES+ dev set
- F1 (macro): 92.74%
- Language codes: ISO 639-3 + ISO 15924 script (e.g.,
eng_Latn,cmn_Hans,arb_Arab)
[!TIP] See docs/languages.md for the complete list of supported languages with performance metrics.
[!NOTE]
Gemma3 is a good choice for this application, because it was trained on over 140 languages. The tokenizer, vocab size (262k) and multi-language training are critical for performance.
Architecture
Simple Inference Pipeline (NumPy-only)
- Tokenize: Use HuggingFace fast tokenizer (512-length truncation)
- Lookup: Index into pre-computed exponential lookup table (vocab_size × n_languages)
- Pool: LogSum pooling over token sequence
- Softmax: Calculate language probabilities
The lookup table is pre-trained using: exp((embeddings * token_weights) @ projection.T + bias),
where embeddings are frozen token embeddings from Gemma3, trained with focal loss on OpenLID-v2.
During training, token vectors are aggregated using logsumexp pooling along the sequence dimension.
[!IMPORTANT]
To optimize artifact size and compute, we performexp(logits)before saving the lookup table. Then we apply a threshold to make the table sparse. This reduces the artifact size 10x (~130mb -> 13mb), with negligable performance degradation.
Sparse Lookup Table
The lookup table uses sparse COO (Coordinate) format with configurable sparsification threshold:
- Sparsity: 97.15% (values below threshold (<10) set to zero)
- Format: COO (row, col, data) indices stored as int32, values as fp32
- Performance impact: Negligible (0.003% accuracy loss)
Performance
FLORES+ Benchmark Results
Evaluated on FLORES+ dataset (148 languages, ~1k sentences per language):
| Split | Accuracy | F1 (macro) | F1 (weighted) | Samples |
|---|---|---|---|---|
| dev | 92.92% | 92.74% | 92.75% | 150,547 |
| devtest | 92.86% | 92.71% | 92.69% | 153,824 |
See docs/languages.md for detailed results.
Inference Speed
Benchmarked on 12th gen Intel-i9 (single thread):
- Single text: 71,500 texts/second (0.014 ms/text)
- Batch (1000): 82,500 texts/second (12.1 ms/batch)
Supported Languages
The bundled model supports 148 languages from the OpenLID-v2 dataset. Languages use ISO 639-3 language codes with ISO 15924 script codes (e.g., eng_Latn, cmn_Hans, arb_Arab).
See model_config.yaml for the complete list of supported languages.
Training
Installation for Training
# CPU or default CUDA version
uv sync --extra training
# With CUDA 12.8 (Blackwell)
uv sync --extra cu128
Training Pipeline
- Configure model in
configs/models/custom-config.yaml:
model:
name: google/gemma-3-27b-pt
hidden_dim: 5376
shard_pattern: model-00001-of-00012.safetensors
embedding_layer_name: language_model.model.embed_tokens.weight
languages:
eng_Latn: 0
spa_Latn: 1
fra_Latn: 2
# ... add more languages
inference:
max_sequence_length: 512
pooling: logsumexp
- Configure training in
configs/training/custom-training.yaml:
model_config_path: "configs/models/custom-model.yaml"
dataset:
name: "laurievb/OpenLID-v2"
filter_languages: true
training:
batch_size: 1536
learning_rate: 0.002
epochs: 2
- Train:
uv run wldetect train --config configs/training/custom-training.yaml
Artifacts saved to artifacts/:
lookup_table_exp.safetensors- Sparse exp lookup table (for inference)projection.safetensors- Projection matrix (fp32, for fine-tuning)model_config.yaml- Model configurationmodel.pt- Full PyTorch checkpoint
Training Commands
# Train model
uv run wldetect train --config configs/training/gemma3-27b.yaml
# Evaluate on FLORES+
uv run wldetect eval --model-path artifacts/ --split dev
# Generate sparse lookup table from checkpoint (default: threshold=10.0)
uv run wldetect create-lookup \
--checkpoint artifacts/checkpoints/checkpoint_step_100000.pt \
--config configs/training/gemma3-27b.yaml \
--output-dir artifacts/
Training Details
- Embedding extraction: Downloads only embedding tensor shards from HuggingFace (not full models)
- Dataset: OpenLID-v2 with configurable language filtering and balancing
- Model: Simple linear projection (hidden_dim → n_languages) with dropout
- Pooling: LogSumExp or max pooling over token sequences
- Training time: ~2-4 hours on GPU for 2 epochs (150 languages, 5000 samples/language)
- Evaluation: Automatic FLORES+ evaluation after training
License
Apache 2.0 License
Citations
If you use WordLlama Detect in your research or project, please consider citing it as follows:
@software{miller2025wordllamadetect,
author = {Miller, D. Lee},
title = {WordLlama Detect: The Language of the Token},
year = {2025},
url = {https://github.com/dleemiller/WordLlamaDetect},
version = {0.1.0}
}
Acknowledgments
- OpenLID-v2 dataset: laurievb/OpenLID-v2
- FLORES+ dataset: openlanguagedata/flores_plus
- HuggingFace transformers and tokenizers libraries
- Google Gemma model team
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wldetect-0.1.1.tar.gz.
File metadata
- Download URL: wldetect-0.1.1.tar.gz
- Upload date:
- Size: 5.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c781f3f3ba61f0d18122277bc1785602549d2545ed1b3e583353533430439487
|
|
| MD5 |
cc5292500cdd2f841cb10b82ce1446b5
|
|
| BLAKE2b-256 |
28ebf483f25ce75c527519fe82ac48cc799e0c8af16df48aa72b274446300fb0
|
Provenance
The following attestation bundles were made for wldetect-0.1.1.tar.gz:
Publisher:
publish.yml on dleemiller/WordLlamaDetect
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
wldetect-0.1.1.tar.gz -
Subject digest:
c781f3f3ba61f0d18122277bc1785602549d2545ed1b3e583353533430439487 - Sigstore transparency entry: 766552295
- Sigstore integration time:
-
Permalink:
dleemiller/WordLlamaDetect@be6d5e6c4fb17c117230e7f05616cf1332bb63d1 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/dleemiller
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@be6d5e6c4fb17c117230e7f05616cf1332bb63d1 -
Trigger Event:
release
-
Statement type:
File details
Details for the file wldetect-0.1.1-py3-none-any.whl.
File metadata
- Download URL: wldetect-0.1.1-py3-none-any.whl
- Upload date:
- Size: 5.4 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eb5a3b0dc99f215aa553621b0ae90dff1f6a47be8b9a4de76c97b95e4e4760a4
|
|
| MD5 |
eba781d343cd88784498fea478a92631
|
|
| BLAKE2b-256 |
7846ba4169c2aabd992b16be4f960bd33bd3476c2a24435df30bc14b6f04b495
|
Provenance
The following attestation bundles were made for wldetect-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on dleemiller/WordLlamaDetect
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
wldetect-0.1.1-py3-none-any.whl -
Subject digest:
eb5a3b0dc99f215aa553621b0ae90dff1f6a47be8b9a4de76c97b95e4e4760a4 - Sigstore transparency entry: 766552296
- Sigstore integration time:
-
Permalink:
dleemiller/WordLlamaDetect@be6d5e6c4fb17c117230e7f05616cf1332bb63d1 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/dleemiller
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@be6d5e6c4fb17c117230e7f05616cf1332bb63d1 -
Trigger Event:
release
-
Statement type: