cChardet is high speed universal character encoding detector.
Project description
cChardet
cChardet is high speed universal character encoding detector. - binding to uchardet.
Supported Languages/Encodings
- International (Unicode)
- UTF-8
- UTF-16BE / UTF-16LE
- UTF-32BE / UTF-32LE / X-ISO-10646-UCS-4-34121 / X-ISO-10646-UCS-4-21431
- Arabic
- ISO-8859-6
- WINDOWS-1256
- Bulgarian
- ISO-8859-5
- WINDOWS-1251
- Chinese
- ISO-2022-CN
- BIG5
- EUC-TW
- GB18030
- HZ-GB-2312
- Croatian:
- ISO-8859-2
- ISO-8859-13
- ISO-8859-16
- Windows-1250
- IBM852
- MAC-CENTRALEUROPE
- Czech
- Windows-1250
- ISO-8859-2
- IBM852
- MAC-CENTRALEUROPE
- Danish
- ISO-8859-1
- ISO-8859-15
- WINDOWS-1252
- English
- ASCII
- Esperanto
- ISO-8859-3
- Estonian
- ISO-8859-4
- ISO-8859-13
- ISO-8859-13
- Windows-1252
- Windows-1257
- Finnish
- ISO-8859-1
- ISO-8859-4
- ISO-8859-9
- ISO-8859-13
- ISO-8859-15
- WINDOWS-1252
- French
- ISO-8859-1
- ISO-8859-15
- WINDOWS-1252
- German
- ISO-8859-1
- WINDOWS-1252
- Greek
- ISO-8859-7
- WINDOWS-1253
- Hebrew
- ISO-8859-8
- WINDOWS-1255
- Hungarian:
- ISO-8859-2
- WINDOWS-1250
- Irish Gaelic
- ISO-8859-1
- ISO-8859-9
- ISO-8859-15
- WINDOWS-1252
- Italian
- ISO-8859-1
- ISO-8859-3
- ISO-8859-9
- ISO-8859-15
- WINDOWS-1252
- Japanese
- ISO-2022-JP
- SHIFT_JIS
- EUC-JP
- Korean
- ISO-2022-KR
- EUC-KR / UHC
- Lithuanian
- ISO-8859-4
- ISO-8859-10
- ISO-8859-13
- Latvian
- ISO-8859-4
- ISO-8859-10
- ISO-8859-13
- Maltese
- ISO-8859-3
- Polish:
- ISO-8859-2
- ISO-8859-13
- ISO-8859-16
- Windows-1250
- IBM852
- MAC-CENTRALEUROPE
- Portuguese
- ISO-8859-1
- ISO-8859-9
- ISO-8859-15
- WINDOWS-1252
- Romanian:
- ISO-8859-2
- ISO-8859-16
- Windows-1250
- IBM852
- Russian
- ISO-8859-5
- KOI8-R
- WINDOWS-1251
- MAC-CYRILLIC
- IBM866
- IBM855
- Slovak
- Windows-1250
- ISO-8859-2
- IBM852
- MAC-CENTRALEUROPE
- Slovene
- ISO-8859-2
- ISO-8859-16
- Windows-1250
- IBM852
- M
Example
# -*- coding: utf-8 -*- import cchardet as chardet with open(r"src/tests/samples/wikipediaJa_One_Thousand_and_One_Nights_SJIS.txt", "rb") as f: msg = f.read() result = chardet.detect(msg) print(result)
Benchmark
$ cd src/
$ pip install chardet
$ python tests/bench.py
Results
CPU: Intel(R) Core(TM) i5-4690 CPU @ 3.50GHz
RAM: DDR3 1600Mhz 16GB
Platform: Ubuntu 16.04 amd64
Python 2.7.13
Request (call/s) | |
---|---|
chardet v3.0.2 | 0.36 |
cchardet v2.0.1 | 1396.42 |
Python 3.6.1
Request (call/s) | |
---|---|
chardet v3.0.2 | 0.35 |
cchardet v2.0.1 | 1467.77 |
LICENSE
See COPYING file.
Contact
CHANGES
2.1.4 (2018-09-26)
- disable LTO because become poor performance
2.1.3 (2018-09-26)
- support Python 3.7
2.1.2 (2018-09-26)
- enable LTO for wheel builds
- update Cython
2.1.1 (2017-07-01)
- fix that different results with different chuck sizes
- fix that assignments to nsSMState in nsCodingStateMachine result in unspecified behavior
- include COPYING in package
2.0.1 (2017-04-25)
2.0.0 (2017-04-06)
- Improve tests
2.0a4 (2017-04-05)
- Update uchardet repo (Fix buffer overflow)
2.0a3 (2017-03-29)
- Implement UniversalDetector (like chardet)
2.0a2 (2017-03-28)
- Update uchardet repo (Fix memory leak)
2.0a1 (2017-03-28)
- Replace uchardet-enhanced to uchardet
- Remove Detector class
1.1.3 (2017-02-26)
- Support AArch64
1.1.2 (2017-01-08)
- Support Python 3.6
1.1.1 (2016-11-05)
- Use len() function (9e61cb9e96b138b0d18e5f9e013e144202ae4067)
- Remove detect function in _cchardet.pyx (25b581294fc0ae8f686ac9972c8549666766f695)
- Support manylinux1 wheel
1.1.0 (2016-10-17)
- Add Detector class
- Improve unit tests
Project details
Release history Release notifications
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.