7 projects
oldp
Open Legal Data Platform
commonlid
Evaluate language identification models on CommonLID and other benchmarks.
cld3-py
Compact Language Detector v3 (CLD3) Python bindings — modernised fork with prebuilt wheels for Python 3.10–3.14.
url-is-in
A Python package for efficiently checking if a URL is part of large whitelist or blacklist of URLs and domain names.
llm-datasets
A collection of datasets for language model training including scripts for downloading, preprocesssing, and sampling.
lm-datasets
A collection of datasets for language model training including scripts for downloading, preprocesssing, and sampling.
finetune-eval
Finetune_Eval_Harness