Skip to main content

Fast, accurate charset/encoding detection. Zero ML. Zero dependencies. Optional Rust acceleration.

Project description

bytesense logo

Charset detection that stays fast, honest, and dependency-free.

PyPI version Python versions CI codecov License: MIT Documentation

Author: Oğuzhan Kır · GitHub · Documentation


bytesense reads raw bytes and tells you which encoding likely produced them—without shipping neural nets, without pulling in other Python packages at install time, and with an explainable why on every result. It is designed as a modern alternative to chardet and charset-normalizer for teams that want predictable performance and a small install footprint.

If you already use chardet.detect() or charset_normalizer.detect(), you can swap in bytesense.detect() with minimal code churn.

Comparison with chardet & charset-normalizer

Same idea as the charset-normalizer README: a feature matrix and measured accuracy on the bundled suites, so you can see where bytesense stands—not hand-wavy marketing.

Feature matrix

Feature Chardet Charset Normalizer bytesense
Fast
Universal¹
Reliable without distinguishable standards
Reliable with distinguishable standards
License MIT MIT MIT
Native Python (no C extension required)
Optional native acceleration Rust (pip install "bytesense[fast]")
Detect spoken language
Explainable detection (why) Partial Rich metadata Yes (always)
Streaming-first API Limited Via API patterns StreamDetector
Wheel size (typical) ~500 kB ~150 kB Small (pure Python + tables)
Custom codecs via stdlib registration ✅ (same codecs model)

¹ Universal in the charset-normalizer sense: coverage follows what your Python build exposes through codecs as decode candidates—not a fixed “99 encodings” count on every platform.

Accuracy (bundled tests, three-way)

Numbers below are from the same ./scripts/compare_libraries.sh run the CI exercises: chardet, charset-normalizer, and bytesense on identical inputs. Strict = detected codec name matches the reference (after codecs.lookup aliases). Functional = decoded Unicode matches the reference encoding (multiple labels can count if they decode to the same text—same spirit as charset-normalizer’s comparisons).

39-case suite (synthetic + charset-normalizer’s published data/ samples, expected labels aligned with their tests):

Package Strict Functional
bytesense 100% (39/39) 100% (39/39)
charset-normalizer 92.3% (36/39) 97.4% (38/39)
chardet 79.5% (31/39) 97.4% (38/39)

24-case hard stress (paragraph-sized, ambiguous SBCS / CJK / UTF-16 / ISO-2022 — benchmarks/test_hard_scenarios.py):

Package Strict Functional
bytesense 83.3% (20/24) 100% (24/24)
charset-normalizer 41.7% (10/24) 58.3% (14/24)
chardet 70.8% (17/24) 91.7% (22/24)

Updated March 2026 — CPython 3.12, chardet 7.x, charset-normalizer 3.4.x, bytesense 0.1.0. Your CPU and dependency versions will differ; re-run ./scripts/compare_libraries.sh to refresh.

Speed (pytest-benchmark, same machine snapshot)

Rough single-thread means (pytest benchmarks/test_bench_detection.py, speed filter; your numbers will differ):

Sample (from_bytes / detect) bytesense chardet
utf8_bom ~4 µs ~104 µs
utf8_ascii_only ~46 µs ~114 µs

UTF-8 with BOM is an early exit in bytesense; chardet still runs its full probe. Profile your own payloads.

How to reproduce

pip install -e ".[dev]"
python scripts/build_fingerprints.py
python scripts/fetch_cn_benchmark_samples.py   # CN `data/` mirrors into benchmarks/data/cn_official/
./scripts/compare_libraries.sh                   # prints the tables above (accuracy + hard stress)

Options: ./scripts/compare_libraries.sh --no-fetch (no download), --no-hard (skip 24-case stress). Full benches: ./scripts/run_all_benchmarks.sh.

Fingerprint script: Lines like utf_16 “skipped” are normal — UTF-16/32 are handled via BOM / null-byte rules, not the single-byte histogram table.

Troubleshooting fetch_cn_benchmark_samples.py (SSL on macOS): Use .venv/bin/python scripts/.... certifi is in [dev]. If HTTPS still fails: Install Certificates.command for your Python, or python scripts/fetch_cn_benchmark_samples.py --insecure as a last resort.

Installation

pip install bytesense                 # pure Python — always works
pip install "bytesense[fast]"       # same API; uses Rust wheel when available for your platform

Test PyPI (pre-release smoke test): after publishing to test.pypi.org, install with an extra index so pip can fetch build tools from PyPI:

pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ bytesense

CLI

bytesense ships with a small CLI for files on disk.

usage: bytesense [-h] [-v] [-m] [--version] FILE [FILE ...]

positional arguments:
  FILE              File(s) to analyse

optional arguments:
  -h, --help        show this help and exit
  -v, --verbose     Include the ``why`` field in JSON (``confidence_interval`` is always shown)
  -m, --minimal     Print only the detected encoding name
  --version         Show version and exit
bytesense ./README.md
bytesense -m ./README.md
python -m bytesense.cli --version

stdout is JSON (one object per file, or a list when multiple files and not -m):

{
  "encoding": "utf_8",
  "confidence": 0.98,
  "confidence_interval": [0.93, 1.0],
  "language": "English",
  "alternatives": [],
  "bom_detected": false,
  "chaos": 0.02,
  "coherence": 0.41,
  "byte_count": 2048,
  "path": "/absolute/path/to/file"
}

Use -v to add the human-readable why string to each object.

Python

Full result object

from bytesense import from_bytes, from_path

result = from_path("notes.txt")
print(result.encoding, result.confidence, result.language)
print(result.why)

Streaming

from bytesense import StreamDetector

det = StreamDetector()
for chunk in response.iter_content(1024):
    det.feed(chunk)
    if det.confidence >= 0.99:
        break
print(det.encoding, det.language)

Repair mojibake

from bytesense import repair

garbled = "été"   # UTF-8 read as Latin-1
result = repair(garbled)
if result.improved:
    print(result.repaired)    # "été"
    print(result.chain)       # ("latin_1", "utf_8")
    print(result.improvement) # e.g. 0.34

Stream from HTTP

from bytesense import detect_stream
import urllib.request

with urllib.request.urlopen("https://example.com") as resp:
    result = detect_stream(
        iter(lambda: resp.read(1024), b""),
        stop_confidence=0.99,
    )
print(result.encoding)

HTML/XML hints

from bytesense import best_hint, from_bytes

html = b'<meta charset="cp1252"><p>Hëllo</p>'
hint = best_hint(html, headers={"Content-Type": "text/html"})
result = from_bytes(html)
print(hint, result.encoding)

Multi-encoding documents

from bytesense import detect_multi

# Example: bytes from a legacy .eml or mixed scrape
your_bytes = b"..."  # replace with your document
result = detect_multi(your_bytes)
print(f"Uniform encoding: {result.is_uniform}")
for seg in result.segments:
    print(f"  [{seg.start}:{seg.end}] {seg.encoding}{seg.text[:40]!r}")

Drop-in detect()

from bytesense import detect

print(detect(b"hello"))  # {"encoding", "confidence", "language"}

More detail: MkDocs site · API · Quick start · Examples

Why bytesense

  • Decode as late as possible. Histograms, BOMs, and null-byte layout often rule out whole families of encodings before you spend CPU on full decodes.
  • Shortlist, then verify. Cosine similarity against pre-generated fingerprints (see scripts/build_fingerprints.py) keeps the expensive “mess + coherence” phase on a handful of candidates.
  • No black boxes. No training step, no weights to tune, no network calls—just tables and statistics you can inspect.
  • Rust is optional. pip install bytesense never requires a compiler; Rust only accelerates hot paths when a wheel matches your platform.

How it works (short)

  1. Fingerprint the byte distribution and compare to pre-computed vectors.
  2. Decode only the shortlisted encodings (strict), in a controlled order.
  3. Mess — score how “garbled” the decoded text looks (printable ratio, bigrams, etc.).
  4. Coherence — score language plausibility using character-frequency priors.
  5. Rank and return the best hypothesis plus a human-readable why.

Known limitations

  • Very short inputs (dozens of bytes) are inherently ambiguous; any detector will guess.
  • Mixed-language text can confuse language coherence.
  • Like any heuristic detector, adversarial or random binary data may yield a best-effort encoding with low confidence.

Contributing

Issues and PRs are welcome: CONTRIBUTING.md · Issues

License

MIT — see LICENSE.

Copyright © Oğuzhan Kır.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bytesense-0.1.1.tar.gz (43.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

bytesense-0.1.1-cp312-cp312-win_amd64.whl (130.7 kB view details)

Uploaded CPython 3.12Windows x86-64

bytesense-0.1.1-cp312-cp312-manylinux_2_34_x86_64.whl (223.8 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

bytesense-0.1.1-cp312-cp312-macosx_11_0_arm64.whl (209.2 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

File details

Details for the file bytesense-0.1.1.tar.gz.

File metadata

  • Download URL: bytesense-0.1.1.tar.gz
  • Upload date:
  • Size: 43.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bytesense-0.1.1.tar.gz
Algorithm Hash digest
SHA256 1c860b8e734efdd5b50ea9dac4def6253923c80588331a5c8e6d99f6158c9d70
MD5 71ffb73d3b13abe0160eea09101ef7fb
BLAKE2b-256 f70d6d34bbf5e7adb9db9218d5ad9d8820613052da8dc289efd5806ccfae5deb

See more details on using hashes here.

Provenance

The following attestation bundles were made for bytesense-0.1.1.tar.gz:

Publisher: release.yml on oguzhankir/bytesense

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bytesense-0.1.1-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: bytesense-0.1.1-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 130.7 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bytesense-0.1.1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 6af9341d8aeb2163c1c85456c6f8342d7fa6f770ce7370fb39f0e21ae99cfbdc
MD5 8c2018346f10ef467de04453ea3158ee
BLAKE2b-256 ea5a2aacb6c46f38758f4b8425fa4d310a55f26bdcef1e4878c17eed1045b24f

See more details on using hashes here.

Provenance

The following attestation bundles were made for bytesense-0.1.1-cp312-cp312-win_amd64.whl:

Publisher: release.yml on oguzhankir/bytesense

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bytesense-0.1.1-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for bytesense-0.1.1-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 8f875d3071d9aa13fe339576f6d0d139e7a4eec933aa677e284255f784a5828b
MD5 6e7c4c13beab349ffdefa2025856cf9f
BLAKE2b-256 1852e5e963efa994ec35e8dec7f63b8f6d7ffdaf347ccb90ae7cef2efb9779e5

See more details on using hashes here.

Provenance

The following attestation bundles were made for bytesense-0.1.1-cp312-cp312-manylinux_2_34_x86_64.whl:

Publisher: release.yml on oguzhankir/bytesense

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bytesense-0.1.1-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for bytesense-0.1.1-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2654dbcd36b1524403892d771a13ebb8219ecb177f57a5d08438c182c6afcbbc
MD5 da8a79875eb3ade9831162f8bd6d4ec7
BLAKE2b-256 2fec1f388d46bc9224a00b413980c157f2ff8610c9f2ba669971d1b070048d95

See more details on using hashes here.

Provenance

The following attestation bundles were made for bytesense-0.1.1-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: release.yml on oguzhankir/bytesense

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page