Skip to main content

Fast language detection for Python powered by Rust

Project description

papagan

PyPI Python versions

Fast language detection for Python, powered by Rust (via PyO3 + maturin).

10 languages bundled, weighted per-word output, fully typed (PEP 561).

Install

uv add papagan
# or
pip install papagan

Pre-built wheels ship for Linux (x86_64, aarch64), macOS (x86_64, arm64), and Windows (x86_64). Python 3.10+.

Quick start

from papagan import Detector

detector = Detector()

# Document-level detection
output = detector.detect("Die Katze sitzt auf der Matte")
lang, confidence = output.top()
print(f"{lang}: {confidence:.3f}")
# de: 0.996

# Full distribution
for lang, score in output.distribution():
    print(f"  {lang}: {score:.3f}")

Per-word detail

Useful for mixed-language text or debugging:

detailed = detector.detect_detailed("The cat is black. Die Katze ist schwarz.")

for word in detailed.words:
    top_lang, top_score = max(word.scores, key=lambda x: x[1])
    print(f"  {word.token:<10} [{word.source}]  {top_lang} ({top_score:.2f})")
# the        [dict]   en (0.85)
# cat        [ngram]  en (0.99)
# ...
# katze      [ngram]  de (1.00)

# The aggregate handles mixed input gracefully:
print(detailed.aggregate.distribution())
# [('de', 0.52), ('en', 0.48)]

Batch detection

For multi-document workloads, detect_batch fans out across cores via rayon and releases the GIL while running — so concurrent Python threads can do other work and scale-out on ThreadPoolExecutor behaves as expected:

docs = ["The cat sat", "Die Katze sitzt", "Le chat est assis", "El gato está sentado"]

results = detector.detect_batch(docs)              # list[Output]
detailed = detector.detect_detailed_batch(docs)    # list[Detailed]

for o in results:
    print(o.top())

On a 1000-paragraph batch (Leipzig news, avg 84 words each, 8-core M-series), detect_batch is ~3.5× faster than calling detect() in a Python loop — 90 ms → 26 ms. On 1870 short titles it's ~5× faster (16 ms → 3 ms) since rayon setup amortizes better over dict-hit-heavy tokens.

Batches smaller than 4 fall back through the normal per-call path so there's no small-batch regression.

Restrict to specific languages

Faster and more confident when you know the input's language set in advance:

detector = Detector(only=["en", "de"])
# or with the builder:
detector = Detector.builder().only(["en", "de"]).build()

Configuration

detector = Detector(
    only=["en", "de", "fr"],       # restrict to a subset
    unknown_threshold=0.25,         # below this => ("?", ...) aka Lang.Unknown
    parallel_threshold=32,          # parallelize per-word work at 32+ tokens (default)
    # set parallel_threshold to a very large number to opt out of rayon entirely
)

Supported languages

Code Language Code Language
de German it Italian
en English nl Dutch
es Spanish pl Polish
fr French pt Portuguese
ru Russian tr Turkish

All 10 languages are bundled — no feature flags to set.

Type hints

The package ships .pyi stubs and a py.typed marker (PEP 561):

from papagan import Detector, Lang, Output, WordScore, LangCode, MatchSource

def classify(text: str) -> LangCode:
    lang, _score = Detector().detect(text).top()
    return lang  # typed as Literal["de", "en", ..., "?"]

Your type checker (mypy, pyright) will see full signatures for all classes, including the LangCode and MatchSource Literal types.

Benchmarks

Measured on Darwin arm64, 2026-04-22. Open fixtures: Tatoeba sentences (CC-BY 2.0 FR) and Leipzig news paragraphs (CC-BY 4.0). ns/tok is the per-token rate — the cleanest cross-library comparison since it normalizes out workload size. Full cross-binding matrix and reproduction commands live in the repository README.

Library Tokens Bytes Loop (ms) Loop (ns/tok) Batch (ms) Batch (ns/tok)
papagan 35k 222 KB 32.94 949 9.67 279
papagan 87k 620 KB 79.94 923 23.74 274
py3langid 35k 222 KB 223.59 6 442
py3langid 87k 620 KB 120.81 1 395
langdetect 35k 222 KB 6 959.25 200 526
langdetect 87k 620 KB 1 348.72 15 570
lingua (all langs) 35k 222 KB 3 700.01 106 613
lingua (all langs) 87k 620 KB 2 675.98 30 893

papagan chews through ~950 ns/token on loop and ~275 ns/token with detect_batch — flat across workload size, ~7× ahead of py3langid on short sentences, ~110× ahead of lingua, ~210× ahead of langdetect. detect_batch releases the GIL so ThreadPoolExecutor scales as expected.

Accuracy

99.42 % on Tatoeba (5,000 sentences) and 99.86 % on FLORES-200 devtest (10,120 sentences) across the 10 supported languages. Per-language precision/recall is best on isolated scripts (Russian, Turkish, Polish — ~perfect) and slightly weaker on the close Romance cluster (Spanish/Portuguese/Italian); full per-language table in the repository README.

License

Dual-licensed under MIT or Apache-2.0, at your option.

Related

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

papagan-0.1.7.tar.gz (684.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

papagan-0.1.7-cp310-abi3-win_amd64.whl (832.0 kB view details)

Uploaded CPython 3.10+Windows x86-64

papagan-0.1.7-cp310-abi3-musllinux_1_2_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

papagan-0.1.7-cp310-abi3-musllinux_1_2_aarch64.whl (2.0 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARM64

papagan-0.1.7-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

papagan-0.1.7-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.8 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

papagan-0.1.7-cp310-abi3-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

papagan-0.1.7-cp310-abi3-macosx_10_12_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file papagan-0.1.7.tar.gz.

File metadata

  • Download URL: papagan-0.1.7.tar.gz
  • Upload date:
  • Size: 684.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for papagan-0.1.7.tar.gz
Algorithm Hash digest
SHA256 f60921da5e3f0957b3c7a8e00e1c01d541a8817ac8137fad35b516c6f61e1e5d
MD5 b0d9bb2ca76a80809fe2650771cc2a98
BLAKE2b-256 47d83d4887f47e90f74710f44fc2a4d64619605775a90174f846c328d38c332d

See more details on using hashes here.

File details

Details for the file papagan-0.1.7-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: papagan-0.1.7-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 832.0 kB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for papagan-0.1.7-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 22ff7aaee2222889afcc267a424b92813afa61995dcb52d55cd5c2148bd9afaf
MD5 28ea537fa4c5823300d75d4e23130659
BLAKE2b-256 2e3f2bf51dc5e12de764c7a6404197a82b840d17ebff96715a44e8e3f49496a2

See more details on using hashes here.

File details

Details for the file papagan-0.1.7-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for papagan-0.1.7-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 79dd6cd62c450005984278a6729191a510e8aba9a9c49d03773f86fee97e0444
MD5 13bd0fa96f2f48d02f113455bf9d9149
BLAKE2b-256 0efbc22a02642b5355c2825e2d86306eb70b387e4a28f32917325467db6002ee

See more details on using hashes here.

File details

Details for the file papagan-0.1.7-cp310-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for papagan-0.1.7-cp310-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 4227c6ae0a29a904096309a796b6ba5e87121074d4eeb19ff9d47c011e059c74
MD5 2ac8b9a2304b84de58c9acecf6e1bee0
BLAKE2b-256 ac3867c93b006afe7b2c772c239c75c36aefc08afb5c19951288f37fa9c94e96

See more details on using hashes here.

File details

Details for the file papagan-0.1.7-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for papagan-0.1.7-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c54975f6298018a313074ab1ce0e99c36875f44a92107d1845049cb16bfd1792
MD5 adabef140465e20822e3b086b7d05835
BLAKE2b-256 1b7e2fafbfb781d2cca30fd0e94262a474282d6375d699acffc7f89233cc4a20

See more details on using hashes here.

File details

Details for the file papagan-0.1.7-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for papagan-0.1.7-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 1921838cc9ad44900b933e03ce1d967bdbc0fc5db56bbac064daec7228254112
MD5 4b27aa2408e410353e96fdadb54b8b2f
BLAKE2b-256 cae08c12e4f9b515ce9fda21cd7439ee869072d5cb9051d69d98402744b4078d

See more details on using hashes here.

File details

Details for the file papagan-0.1.7-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for papagan-0.1.7-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d8f0c010d597b6eb7e34133c18d2b039830f4c63a383b24f01e39d27420a7e3c
MD5 48556ffa3e1315c9346705f9858e8694
BLAKE2b-256 891eba0786a7d9943137d868e0ba42a875681569739e39a9fa27bd91201cf44a

See more details on using hashes here.

File details

Details for the file papagan-0.1.7-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for papagan-0.1.7-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 35757ae1da16319a49adcead1311b4dd77724e7430b529f37922135ccf10550b
MD5 3ef81f39cfe45346d4f10c6653c840b1
BLAKE2b-256 c8f070d5d552f51fb84d8d56813d9921be57c74309269995efb726b9d61a775b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page