Infer HTML encoding from response headers & content
Project description
guessenc
Infer HTML encoding from response headers & content. Goes above and beyond the encoding detection done by most HTTP client libraries.
Basic Usage
The main function exported by guessenc
is infer_encoding()
.
>>> import requests
>>> from guessenc import infer_encoding
>>> resp = requests.get("http://www.fatehwatan.ps/page-183525.html")
>>> resp.raise_for_status()
>>> infer_encoding(resp.content, resp.headers)
(<Source.META_HTTP_EQUIV: 2>, 'cp1256')
This tells us that the detected encoding is cp1256, and that it was retrieved from a HTML tag with http-equiv='Content-Type'
.
Detail on the signature of infer_encoding()
:
def infer_encoding(
content: Optional[bytes] = None,
headers: Optional[Mapping[str, str]] = None
) -> Pair:
...
The content
represents the page HTML, such as response.content
.
The headers
represents the HTTP response headers, such as response.headers
.
If provided, this should be a data structure supporting a case-insensitive lookup, such as requests.structures.CaseInsensitiveDict
or multidict.CIMultiDict
.
Both parameters are optional.
The return type is a tuple
.
The first element of the tuple is a member of the Source
enum (see Search Process below). The source indicates where
the detected encoding comes from.
The second element of the tuple is either a str
, which is the canonical name of the detected encoding, or None
if no encoding is found.
Where Do Other Libraries Fall Short?
The requests
library "[follows] RFC 2616 to the letter" in using the HTTP headers to determine the encoding of the response content. This
means, among other things, using ISO-8859-1
as a fallback if no charset is given, despite the fact that UTF-8 has absolutely
dwarfed all other encodings in usage on web pages.
# requests/adapters.py
response.encoding = get_encoding_from_headers(response.headers)
If requests
does not find an HTTP Content-Type
header at all, it will fall back to detection via chardet
rather than looking in the
HTML tags for meaningful information. There's nothing at all wrong with this; it just means that the requests
maintainers have chosen to
focus on the power of requests
as an HTTP library, not an HTML library. If you want more fine-grained control over encoding detection,
try infer_encoding()
.
This is not to single out requests
either; there are other libraries that do the same dance with encoding detection;
aiohttp
checks the Content-Type
header, or otherwise
defaults to UTF-8 without looking anywhere else.
Search Process
The function guessenc.infer_encoding()
looks in a handful of places to extract an encoding, in this order, and stops when it finds one:
- In the
charset
value from theContent-Type
HTTP entity header. - In the
charset
value from a<meta charset="xxxx">
HTML tag. - In the
charset
value from a<meta>
tag withhttp-equiv="Content-Type"
. - Using the
chardet
library.
Each of the above "sources" is signified by a corresponding member of the Source
enum:
class Source(enum.Enum):
"""Indicates where our detected encoding came from."""
CHARSET_HEADER = 0
META_CHARSET = 1
META_HTTP_EQUIV = 2
CHARDET = 3
COULD_NOT_DETECT = 4
If none of the 4 sources from the list above return a viable encoding, this is indicated by Source.COULD_NOT_DETECT
.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file guessenc-0.3.tar.gz
.
File metadata
- Download URL: guessenc-0.3.tar.gz
- Upload date:
- Size: 330.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e081977d3ae2ed55835ca0d92f86d8987fd5b0e4372241cbbde1e5abf0315190 |
|
MD5 | 1460a5fa4fecd47992c81028b2ad77d3 |
|
BLAKE2b-256 | 1b5170a2c4fd02c29dc330a7cfb793fd81a3f8ad238a6e0d6ddb13218c97dc16 |
File details
Details for the file guessenc-0.3-py3-none-any.whl
.
File metadata
- Download URL: guessenc-0.3-py3-none-any.whl
- Upload date:
- Size: 5.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 010e2a3e39a6b90a40dd704c6e62b6988e5841992f5947d2637c9f2ffb5719ac |
|
MD5 | f96cf457be86908169781dd4dd95e82e |
|
BLAKE2b-256 | 0c19df737e83c1d7f47438d69bbba36120824be606246e974738d5fa35640da2 |