Encoding detection collection for Python.
Project description
**For the current version
please /download/install cssutils
which includes v0.9.8 of encutils as well.
Files here have been deleted and
it is only kept here for reference.**
===================================================
encutils - encoding detection collection for Python
===================================================
:Version: 0.9
:Author: Christof Hoeke, see http://cthedot.de/encutils/
:Contributor: Robert Siemer
:Copyright: 2005-2009: Christof Hoeke
:License: encutils has a dual-license, please choose whatever you prefer:
* encutils is published under the
`LGPL 3 or later <http://cthedot.de/encutils/license/>`__
* encutils is published under the
`Creative Commons License <http://creativecommons.org/licenses/by/3.0/>`__.
encutils is free software: you can redistribute it and/or modify
it under the terms of the GNU Lesser General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
encutils is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public License
along with encutils. If not, see <http://www.gnu.org/licenses/>.
A collection of helper functions to detect encodings of text files (like HTML, XHTML, XML, CSS, etc.) retrieved via HTTP, file or string.
:func:`getEncodingInfo` is probably the main function of interest which uses
other supplied functions itself and gathers all information together and
supplies an :class:`EncodingInfo` object.
example::
>>> import encutils
>>> info = encutils.getEncodingInfo(url='http://cthedot.de/encutils/')
>>> print info # = str(info)
utf-8
>>> print repr(info) # doctest:+ELLIPSIS
<encutils.EncodingInfo object encoding='utf-8' mismatch=False at...>
>>> print info.logtext
HTTP media_type: text/html
HTTP encoding: utf-8
Encoding (probably): utf-8 (Mismatch: False)
<BLANKLINE>
references
XML
RFC 3023 (http://www.ietf.org/rfc/rfc3023.txt)
easier explained in
- http://feedparser.org/docs/advanced.html
- http://www.xml.com/pub/a/2004/07/21/dive.html
HTML
http://www.w3.org/TR/REC-html40/charset.html#h-5.2.2
TODO
- parse @charset of HTML elements?
- check for more texttypes if only text given
please /download/install cssutils
which includes v0.9.8 of encutils as well.
Files here have been deleted and
it is only kept here for reference.**
===================================================
encutils - encoding detection collection for Python
===================================================
:Version: 0.9
:Author: Christof Hoeke, see http://cthedot.de/encutils/
:Contributor: Robert Siemer
:Copyright: 2005-2009: Christof Hoeke
:License: encutils has a dual-license, please choose whatever you prefer:
* encutils is published under the
`LGPL 3 or later <http://cthedot.de/encutils/license/>`__
* encutils is published under the
`Creative Commons License <http://creativecommons.org/licenses/by/3.0/>`__.
encutils is free software: you can redistribute it and/or modify
it under the terms of the GNU Lesser General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
encutils is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public License
along with encutils. If not, see <http://www.gnu.org/licenses/>.
A collection of helper functions to detect encodings of text files (like HTML, XHTML, XML, CSS, etc.) retrieved via HTTP, file or string.
:func:`getEncodingInfo` is probably the main function of interest which uses
other supplied functions itself and gathers all information together and
supplies an :class:`EncodingInfo` object.
example::
>>> import encutils
>>> info = encutils.getEncodingInfo(url='http://cthedot.de/encutils/')
>>> print info # = str(info)
utf-8
>>> print repr(info) # doctest:+ELLIPSIS
<encutils.EncodingInfo object encoding='utf-8' mismatch=False at...>
>>> print info.logtext
HTTP media_type: text/html
HTTP encoding: utf-8
Encoding (probably): utf-8 (Mismatch: False)
<BLANKLINE>
references
XML
RFC 3023 (http://www.ietf.org/rfc/rfc3023.txt)
easier explained in
- http://feedparser.org/docs/advanced.html
- http://www.xml.com/pub/a/2004/07/21/dive.html
HTML
http://www.w3.org/TR/REC-html40/charset.html#h-5.2.2
TODO
- parse @charset of HTML elements?
- check for more texttypes if only text given