Universal character encoding detector
Project description
chardet
Universal character encoding detector.
chardet 7.0 is a ground-up, MIT-licensed rewrite of chardet. Same package name, same public API — drop-in replacement for chardet 5.x/6.x, just much faster and more accurate. The detection engine is reimplemented in Rust and exposed to Python via PyO3. Python 3.10+.
[!WARNING] This Rust reimplementation is an AI experiment. It is not an official upstream replacement.
Why chardet 7.0?
98.1% accuracy on 2,510 test files. 43x faster than chardet 6.0.0 and 6.8x faster than charset-normalizer. Language detection for every result. MIT licensed.
| chardet 7.0 (Rust core) | chardet 6.0.0 | charset-normalizer | |
|---|---|---|---|
| Accuracy (2,510 files) | 98.1% | 88.2% | 78.5% |
| Speed | 546 files/s | 13 files/s | 80 files/s |
| Language detection | 95.1% | -- | -- |
| Peak memory | 26.2 MiB | 29.5 MiB | 101.2 MiB |
| Streaming detection | yes | yes | no |
| Encoding era filtering | yes | no | no |
| Supported encodings | 99 | 84 | 99 |
| License | MIT | LGPL | MIT |
Installation
pip install chardet
For source builds (or editable local development), install a Rust toolchain as
well, because the extension module is built from rust/ with maturin.
Quick Start
import chardet
# Plain ASCII is reported as its superset Windows-1252 by default,
# keeping with WHATWG guidelines for encoding detection.
chardet.detect(b"Hello, world!")
# {'encoding': 'Windows-1252', 'confidence': 1.0, 'language': 'en'}
# UTF-8 with typographic punctuation
chardet.detect("It\u2019s a lovely day \u2014 let\u2019s grab coffee.".encode("utf-8"))
# {'encoding': 'utf-8', 'confidence': 0.99, 'language': 'es'}
# Japanese EUC-JP
chardet.detect("これは日本語のテストです。文字コードの検出を行います。".encode("euc-jp"))
# {'encoding': 'euc-jis-2004', 'confidence': 1.0, 'language': 'ja'}
# Get all candidate encodings ranked by confidence
text = "Le café est une boisson très populaire en France et dans le monde entier."
results = chardet.detect_all(text.encode("windows-1252"))
for r in results:
print(r["encoding"], r["confidence"])
# windows-1252 0.44
# iso-8859-15 0.44
# mac-roman 0.42
# cp858 0.42
Streaming Detection
For large files or network streams, use UniversalDetector to feed data incrementally:
from chardet import UniversalDetector
detector = UniversalDetector()
with open("unknown.txt", "rb") as f:
for line in f:
detector.feed(line)
if detector.done:
break
result = detector.close()
print(result)
Encoding Era Filtering
Restrict detection to specific encoding eras to reduce false positives:
from chardet import detect_all
from chardet.enums import EncodingEra
data = "Москва является столицей Российской Федерации и крупнейшим городом страны.".encode("windows-1251")
# All encoding eras are considered by default — 4 candidates across eras
for r in detect_all(data):
print(r["encoding"], round(r["confidence"], 2))
# windows-1251 0.5
# mac-cyrillic 0.47
# kz-1048 0.22
# ptcp154 0.22
# Restrict to modern web encodings — 1 confident result
for r in detect_all(data, encoding_era=EncodingEra.MODERN_WEB):
print(r["encoding"], round(r["confidence"], 2))
# windows-1251 0.5
CLI
chardetect somefile.txt
# somefile.txt: utf-8 with confidence 0.99
chardetect --minimal somefile.txt
# utf-8
# Pipe from stdin
cat somefile.txt | chardetect
What's New in 7.0
- Rust reimplementation of the detector core — the full detection pipeline is implemented in
rust/srcand exposed to Python viachardet_rs._chardet_rs(PyO3) - Python API compatibility layer —
detect(),detect_all(),UniversalDetector, andchardetectkeep the familiar chardet API while delegating execution to Rust - 12-stage detection pipeline — BOM detection, structural probing, byte validity filtering, and bigram statistical models are now executed in native code
- 43x faster than chardet 6.0.0, 6.8x faster than charset-normalizer
- 98.1% accuracy — +9.9pp vs chardet 6.0.0, +19.6pp vs charset-normalizer
- Language detection — 95.1% accuracy across 49 languages, returned with every result
- 99 encodings — full coverage including EBCDIC, Mac, DOS, and Baltic/Central European families
EncodingErafiltering — scope detection to modern web encodings, legacy ISO/Mac/DOS, mainframe, or all- Thread-safe detection calls —
detect()anddetect_all()are safe to call concurrently; free-threaded execution is covered in CI for Python 3.13t
Documentation
Full documentation is available at chardet.readthedocs.io.
License Discussion
There is an active licensing dispute around this AI-assisted rewrite.
Timeline
- On March 4, 2026, issue #327 was opened by a user identifying as Mark Pilgrim (original chardet author), arguing that relicensing from LGPL to MIT is not permitted.
- On March 6, 2026, The Register article reported the dispute and included statements from multiple people in the OSS ecosystem.
Core Disagreement
- Relicensing claim: maintainers stated the new version is a sufficiently new implementation and can be MIT-licensed.
- Derivative-work claim: critics argue the rewrite remains derivative of prior LGPL work because of project continuity, prior code exposure, and intentional API/behavior compatibility.
- Clean-room dispute: one side treats AI-assisted regeneration plus low similarity metrics as evidence of independence; the other side argues that AI training provenance and maintainer prior exposure weaken clean-room arguments.
Points Raised in Public Discussion
- Similarity analysis (for example, references to JPlag comparisons) was cited as evidence that 7.0 differs structurally from prior versions.
- Counterarguments focused less on line-by-line similarity and more on copyright/licensing doctrine for derivative works.
- Broader concerns were raised about whether AI-assisted rewrites could undermine copyleft obligations in practice.
- The Register also framed this as part of a larger unresolved legal question: how copyright and licensing apply when code is heavily AI-assisted.
Current Status
- The disagreement is public and unresolved.
- This repository includes this summary for transparency and context.
- If licensing compliance is material to your use case, obtain legal advice before adoption.
This section is informational only and is not legal advice.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file chardet_rust-0.1.4.tar.gz.
File metadata
- Download URL: chardet_rust-0.1.4.tar.gz
- Upload date:
- Size: 115.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
99a29f175b4589d6d1b2625887835d0bb3617ff715be1e92879261c8afd82b33
|
|
| MD5 |
573f07312d2a5ef22188ac7954ab02fe
|
|
| BLAKE2b-256 |
583d38c34272a843f757ed39028499c7449e74a0acbf3b7b1d057cef034c05ed
|