Universal encoding detector. This library is faster than chardet.
Project description
cChardet is high speed universal character encoding detector. - binding to charsetdetect.
Support codecs
Big5
EUC-JP
EUC-KR
GB18030
HZ-GB-2312
IBM855
IBM866
ISO-2022-CN
ISO-2022-JP
ISO-2022-KR
ISO-8859-2
ISO-8859-5
ISO-8859-7
ISO-8859-8
KOI8-R
Shift_JIS
TIS-620
UTF-8
UTF-16BE
UTF-16LE
UTF-32BE
UTF-32LE
WINDOWS-1250
WINDOWS-1251
WINDOWS-1252
WINDOWS-1253
WINDOWS-1255
EUC-TW
X-ISO-10646-UCS-4-2143
X-ISO-10646-UCS-4-3412
x-mac-cyrillic
Requires
Cython: http://www.cython.org/
e.g.) Ubuntu 12.04
$ sudo apt-get install build-essential python-dev cython
Installation
$ cd /tmp $ git clone git://github.com/PyYoshi/cChardet.git $ cd cChardet $ python setup.py build $ sudo python setup.py install
or
$ sudo easy_install cchardet
Example
# -*- coding: utf-8 -*- import cchardet as chardet with open(r"test/testdata/wikipediaJa_One_Thousand_and_One_Nights_SJIS.txt", "rb") as f: msg = f.read() result = chardet.detect(msg) print(result)
Test
$ sudo easy_install or pip install -U chardet nose $ cd test $ nosetests --nocapture tests.py
Benchmark
code: tests.TestCchardetSpeed
sample: test/testdata/wikipediaJa_One_Thousand_and_One_Nights_SJIS.txt
Performance:
CPU: Intel Core i7 860 2.8GHz
RAM: DDR3-1333 16GB
Platform: Kubuntu 12.04 amd64, Python 2.7.3 64-bit
Result:
chardet: 0.32 (call/s) cchardet: 975.32 (call/s)
License
The MIT License: src/cchardet
Other Libraries License: Please, look at the src/ext directory.
Thanks
Contact
Sorry for my poor English :)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.