18 projects
gherbal
FastText-based multilingual language identification with HuggingFace integration
wikilangs
A Python package for consuming Wikipedia language models including tokenizers, ngram models, markov chains, and vocabularies.
prepress
A modern, polyglot release management tool
babelvec
Position-aware, cross-lingually aligned word embeddings built on FastText
picomon
Beautiful TUI dashboard for monitoring GPUs (AMD, NVIDIA, Apple Silicon)
contrastive-ft
CRAFT: Contrastive Representation Aware Fine-Tuning toolkit
zippy-data
High-performance, multi-language dataset storage format
vocabulous
A bootstrapping language detection system that builds dictionaries from noisy and ambiguous training data
borgllm
Universal LLM client with API key rotation, rate limit management, and custom fallback strategies. Drop-in OpenAI SDK replacement with optional LangChain support.
curriculus
Progressive curriculum learning for LLM training with fine-grained schedule control
unscript
A writing script-aware library for cleaning text for NLP, training and analysis.
wikisets
Flexible Wikipedia dataset builder with sampling and pretraining support
residuals
Instruction residuals (task vectors) for efficient LLM continuous pre-training
hypersets
Fast, efficient alternative to Hugging Face load_dataset using DuckDB for querying, sampling and transforming remote datasets
datapluck
Export & import Hugging Face datasets to spreadsheets and various file formats.
monitoro-herd
Python SDK for Monitoro Herd
monitoro
Official Python SDK for the Monitoro API
sawalni
Official Python SDK for the Sawalni API