bytesense

Fast, accurate charset/encoding detection. Zero ML. Zero dependencies. Optional Rust acceleration.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

oguzhankir

These details have not been verified by PyPI

Project description

bytesense logo

Charset detection that stays fast, honest, and dependency-free.

_{Author: Oğuzhan Kır · GitHub · Documentation}

bytesense reads raw bytes and tells you which encoding likely produced them—without shipping neural nets, without pulling in other Python packages at install time, and with an explainable why on every result. It is designed as a modern alternative to chardet and charset-normalizer for teams that want predictable performance and a small install footprint.

If you already use chardet.detect() or charset_normalizer.detect(), you can swap in bytesense.detect() with minimal code churn.

Comparison with chardet & charset-normalizer

Same idea as the charset-normalizer README: a feature matrix and measured accuracy on the bundled suites, so you can see where bytesense stands—not hand-wavy marketing.

Feature matrix

Feature	Chardet	Charset Normalizer	bytesense
Fast	✅	✅	✅
Universal¹	❌	✅	✅
Reliable without distinguishable standards	✅	✅	✅
Reliable with distinguishable standards	✅	✅	✅
License	MIT	MIT	MIT
Native Python (no C extension required)	✅	✅	✅
Optional native acceleration	—	—	Rust (`pip install "bytesense[fast]"`)
Detect spoken language	✅	✅	✅
Explainable detection (`why`)	Partial	Rich metadata	Yes (always)
Streaming-first API	Limited	Via API patterns	`StreamDetector`
Source / wheel size (typical)	~500 kB	~150 kB	~45 kB sdist (pure Python + tables; platform wheels add Rust binary)
Custom codecs via stdlib registration	❌	✅	✅ (same `codecs` model)

¹ Universal in the charset-normalizer sense: coverage follows what your Python build exposes through codecs as decode candidates—not a fixed “99 encodings” count on every platform.

Accuracy (bundled tests, three-way)

Numbers below are from the same ./scripts/compare_libraries.sh run the CI exercises: chardet, charset-normalizer, and bytesense on identical inputs. Strict = detected codec name matches the reference (after codecs.lookup aliases). Functional = decoded Unicode matches the reference encoding (multiple labels can count if they decode to the same text—same spirit as charset-normalizer’s comparisons).

39-case suite (synthetic + charset-normalizer’s published data/ samples, expected labels aligned with their tests):

Package	Strict	Functional
bytesense	100% (39/39)	100% (39/39)
charset-normalizer	92.3% (36/39)	97.4% (38/39)
chardet	79.5% (31/39)	97.4% (38/39)

24-case hard stress (paragraph-sized, ambiguous SBCS / CJK / UTF-16 / ISO-2022 — benchmarks/test_hard_scenarios.py):

Package	Strict	Functional
bytesense	83.3% (20/24)	100% (24/24)
charset-normalizer	41.7% (10/24)	58.3% (14/24)
chardet	70.8% (17/24)	91.7% (22/24)

Updated March 2026 — CPython 3.12, chardet 7.x, charset-normalizer 3.4.x, bytesense 0.1.x. Your CPU and dependency versions will differ; re-run ./scripts/compare_libraries.sh to refresh.

Speed (pytest-benchmark)

Same harness as the accuracy suite (benchmarks/test_bench_detection.py): bytesense from_bytes vs chardet detect on identical bytes. Times are mean per call (µs) from a single machine (CPython 3.12); your numbers will differ — see charset-normalizer’s performance section and chardet 7’s README for their published tables.

Sample	bytesense mean (µs)	chardet mean (µs)	Notes
`utf8_bom`	~8	~170	BOM fast path vs chardet full scan
`utf8_portuguese` (fast-path bench)	~8	—	Very small UTF-8 payload
`utf8_ascii_only` (full detection)	~70	—	Non-ASCII sentinel byte avoids pure-ASCII shortcut
`utf8_english_unicode`	~64	—	Latin + a few extended chars
`large_utf8_1mb`	~4,100	~1,300	Workload-dependent — on this run chardet was faster on the 1 MB buffer; always profile your own file sizes

charset-normalizer is exercised in the same file (test_bench_cn_* / charset_normalizer.from_bytes); install [dev] and run the benchmark group to print side-by-side timings for your CPU.

UTF-8 with BOM is an early path in bytesense; chardet still runs a heavier probe on many builds.

How to reproduce

pip install -e ".[dev]"
python scripts/build_fingerprints.py
python scripts/fetch_cn_benchmark_samples.py   # CN `data/` mirrors into benchmarks/data/cn_official/
./scripts/compare_libraries.sh                   # prints the tables above (accuracy + hard stress)

Options: ./scripts/compare_libraries.sh --no-fetch (no download), --no-hard (skip 24-case stress). Full benches: ./scripts/run_all_benchmarks.sh.

Fingerprint script: Lines like utf_16 “skipped” are normal — UTF-16/32 are handled via BOM / null-byte rules, not the single-byte histogram table.

Troubleshooting fetch_cn_benchmark_samples.py (SSL on macOS): Use .venv/bin/python scripts/.... certifi is in [dev]. If HTTPS still fails: Install Certificates.command for your Python, or python scripts/fetch_cn_benchmark_samples.py --insecure as a last resort.

Installation

pip install bytesense                 # pure Python — always works
pip install "bytesense[fast]"       # same API; uses Rust wheel when available for your platform

Test PyPI (pre-release smoke test): after publishing to test.pypi.org, install with an extra index so pip can fetch build tools from PyPI:

pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ bytesense

CLI

bytesense ships with a small CLI for files on disk.

usage: bytesense [-h] [-v] [-m] [--version] FILE [FILE ...]

positional arguments:
  FILE              File(s) to analyse

optional arguments:
  -h, --help        show this help and exit
  -v, --verbose     Include the ``why`` field in JSON (``confidence_interval`` is always shown)
  -m, --minimal     Print only the detected encoding name
  --version         Show version and exit

bytesense ./README.md
bytesense -m ./README.md
python -m bytesense.cli --version

stdout is JSON (one object per file, or a list when multiple files and not -m):

{
  "encoding": "utf_8",
  "confidence": 0.98,
  "confidence_interval": [0.93, 1.0],
  "language": "English",
  "alternatives": [],
  "bom_detected": false,
  "chaos": 0.02,
  "coherence": 0.41,
  "byte_count": 2048,
  "path": "/absolute/path/to/file"
}

Use -v to add the human-readable why string to each object.

Python

Full result object

from bytesense import from_bytes, from_path

result = from_path("notes.txt")
print(result.encoding, result.confidence, result.language)
print(result.why)

Streaming

from bytesense import StreamDetector

det = StreamDetector()
for chunk in response.iter_content(1024):
    det.feed(chunk)
    if det.confidence >= 0.99:
        break
print(det.encoding, det.language)

Repair mojibake

from bytesense import repair

garbled = "Ã©tÃ©"   # UTF-8 read as Latin-1
result = repair(garbled)
if result.improved:
    print(result.repaired)    # "été"
    print(result.chain)       # ("latin_1", "utf_8")
    print(result.improvement) # e.g. 0.34

Stream from HTTP

from bytesense import detect_stream
import urllib.request

with urllib.request.urlopen("https://example.com") as resp:
    result = detect_stream(
        iter(lambda: resp.read(1024), b""),
        stop_confidence=0.99,
    )
print(result.encoding)

HTML/XML hints

from bytesense import best_hint, from_bytes

html = b'<meta charset="cp1252"><p>Hëllo</p>'
hint = best_hint(html, headers={"Content-Type": "text/html"})
result = from_bytes(html)
print(hint, result.encoding)

Multi-encoding documents

from bytesense import detect_multi

# Example: bytes from a legacy .eml or mixed scrape
your_bytes = b"..."  # replace with your document
result = detect_multi(your_bytes)
print(f"Uniform encoding: {result.is_uniform}")
for seg in result.segments:
    print(f"  [{seg.start}:{seg.end}] {seg.encoding} — {seg.text[:40]!r}")

Drop-in detect()

from bytesense import detect

print(detect(b"hello"))  # {"encoding", "confidence", "language"}

More detail: MkDocs site · API · Quick start · Examples

Why bytesense

Decode as late as possible. Histograms, BOMs, and null-byte layout often rule out whole families of encodings before you spend CPU on full decodes.
Shortlist, then verify. Cosine similarity against pre-generated fingerprints (see scripts/build_fingerprints.py) keeps the expensive “mess + coherence” phase on a handful of candidates.
No black boxes. No training step, no weights to tune, no network calls—just tables and statistics you can inspect.
Rust is optional. pip install bytesense never requires a compiler; Rust only accelerates hot paths when a wheel matches your platform.

How it works (short)

Fingerprint the byte distribution and compare to pre-computed vectors.
Decode only the shortlisted encodings (strict), in a controlled order.
Mess — score how “garbled” the decoded text looks (printable ratio, bigrams, etc.).
Coherence — score language plausibility using character-frequency priors.
Rank and return the best hypothesis plus a human-readable why.

Known limitations

Very short inputs (dozens of bytes) are inherently ambiguous; any detector will guess.
Mixed-language text can confuse language coherence.
Like any heuristic detector, adversarial or random binary data may yield a best-effort encoding with low confidence.

Contributing

Issues and PRs are welcome: CONTRIBUTING.md · Issues

License

MIT — see LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

oguzhankir

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.2

Mar 26, 2026

0.1.1

Mar 26, 2026

0.1.0

Mar 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bytesense-0.1.2.tar.gz (44.2 kB view details)

Uploaded Mar 26, 2026 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bytesense-0.1.2-cp312-cp312-win_amd64.whl (130.9 kB view details)

Uploaded Mar 26, 2026 CPython 3.12Windows x86-64

bytesense-0.1.2-cp312-cp312-manylinux_2_34_x86_64.whl (224.0 kB view details)

Uploaded Mar 26, 2026 CPython 3.12manylinux: glibc 2.34+ x86-64

bytesense-0.1.2-cp312-cp312-macosx_11_0_arm64.whl (209.5 kB view details)

Uploaded Mar 26, 2026 CPython 3.12macOS 11.0+ ARM64

File details

Details for the file bytesense-0.1.2.tar.gz.

File metadata

Download URL: bytesense-0.1.2.tar.gz
Upload date: Mar 26, 2026
Size: 44.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bytesense-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`dc12ac0f8efb75cb9ba68bcf063ea1dfcb32c5ebc05d5713ce7285e18010fd35`
MD5	`87dce94184a94919b6b831645e0bb231`
BLAKE2b-256	`8effbaf933f21cdbf0423dd91da635e1fad2fd575fb92c2990d7b88e87aa563e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for bytesense-0.1.2.tar.gz:

Publisher: release.yml on oguzhankir/bytesense

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: bytesense-0.1.2.tar.gz
- Subject digest: dc12ac0f8efb75cb9ba68bcf063ea1dfcb32c5ebc05d5713ce7285e18010fd35
- Sigstore transparency entry: 1186698436
- Sigstore integration time: Mar 26, 2026
Source repository:
- Permalink: oguzhankir/bytesense@bce7a33178d56371954a01509d3e1cd9490cf435
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/oguzhankir
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@bce7a33178d56371954a01509d3e1cd9490cf435
- Trigger Event: release

File details

Details for the file bytesense-0.1.2-cp312-cp312-win_amd64.whl.

File metadata

Download URL: bytesense-0.1.2-cp312-cp312-win_amd64.whl
Upload date: Mar 26, 2026
Size: 130.9 kB
Tags: CPython 3.12, Windows x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bytesense-0.1.2-cp312-cp312-win_amd64.whl
Algorithm	Hash digest
SHA256	`83c67a7054e13c794a31e1e84901b563ff28a55c3ba7bc5f29856e99f9752391`
MD5	`e7efdbcb91ad49a38c084d0f6833bd83`
BLAKE2b-256	`5ed28c975bff3cb915bd355a2c26e35f83ef25b649848309449f9895679382ef`

See more details on using hashes here.

Provenance

The following attestation bundles were made for bytesense-0.1.2-cp312-cp312-win_amd64.whl:

Publisher: release.yml on oguzhankir/bytesense

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: bytesense-0.1.2-cp312-cp312-win_amd64.whl
- Subject digest: 83c67a7054e13c794a31e1e84901b563ff28a55c3ba7bc5f29856e99f9752391
- Sigstore transparency entry: 1186698446
- Sigstore integration time: Mar 26, 2026
Source repository:
- Permalink: oguzhankir/bytesense@bce7a33178d56371954a01509d3e1cd9490cf435
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/oguzhankir
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@bce7a33178d56371954a01509d3e1cd9490cf435
- Trigger Event: release

File details

Details for the file bytesense-0.1.2-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

Download URL: bytesense-0.1.2-cp312-cp312-manylinux_2_34_x86_64.whl
Upload date: Mar 26, 2026
Size: 224.0 kB
Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bytesense-0.1.2-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm	Hash digest
SHA256	`ccb120db85b857973e457426d38cd12bf858c9fcdaf0a9a7758ada5941fdf5f6`
MD5	`18cb3084658416afbc552d1b4a6fbcd1`
BLAKE2b-256	`841089ad34bb40f0603dd525baec20f7dc4d401a31856342ac8316775ce3aced`

See more details on using hashes here.

Provenance

The following attestation bundles were made for bytesense-0.1.2-cp312-cp312-manylinux_2_34_x86_64.whl:

Publisher: release.yml on oguzhankir/bytesense

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: bytesense-0.1.2-cp312-cp312-manylinux_2_34_x86_64.whl
- Subject digest: ccb120db85b857973e457426d38cd12bf858c9fcdaf0a9a7758ada5941fdf5f6
- Sigstore transparency entry: 1186698442
- Sigstore integration time: Mar 26, 2026
Source repository:
- Permalink: oguzhankir/bytesense@bce7a33178d56371954a01509d3e1cd9490cf435
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/oguzhankir
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@bce7a33178d56371954a01509d3e1cd9490cf435
- Trigger Event: release

File details

Details for the file bytesense-0.1.2-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

Download URL: bytesense-0.1.2-cp312-cp312-macosx_11_0_arm64.whl
Upload date: Mar 26, 2026
Size: 209.5 kB
Tags: CPython 3.12, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bytesense-0.1.2-cp312-cp312-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`553a0edf1e5a9c3c98157f75117a760e5467fd53caa6d670cf1387db5b98669e`
MD5	`d387121714a83d0a42bc87d29332aae5`
BLAKE2b-256	`dfbdea23cd79c3600f4df2db20569003e7e8ed6ae068680841230568abe8bfe6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for bytesense-0.1.2-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: release.yml on oguzhankir/bytesense

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: bytesense-0.1.2-cp312-cp312-macosx_11_0_arm64.whl
- Subject digest: 553a0edf1e5a9c3c98157f75117a760e5467fd53caa6d670cf1387db5b98669e
- Sigstore transparency entry: 1186698451
- Sigstore integration time: Mar 26, 2026
Source repository:
- Permalink: oguzhankir/bytesense@bce7a33178d56371954a01509d3e1cd9490cf435
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/oguzhankir
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@bce7a33178d56371954a01509d3e1cd9490cf435
- Trigger Event: release

bytesense 0.1.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

Comparison with chardet & charset-normalizer

Feature matrix

Accuracy (bundled tests, three-way)

Speed (pytest-benchmark)

How to reproduce

Installation

CLI

Python

Repair mojibake

Stream from HTTP

HTML/XML hints

Multi-encoding documents

Why bytesense

How it works (short)

Known limitations

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance