Infer HTML encoding from response headers & content

# guessenc

Infer HTML encoding from response headers & content. Goes above and beyond the encoding detection done by most HTTP client libraries.

## Basic Usage

The main function exported by guessenc is infer_encoding().

>>> import requests
>>> from guessenc import infer_encoding

>>> resp = requests.get("http://www.fatehwatan.ps/page-183525.html")
>>> resp.raise_for_status()
(<Source.META_HTTP_EQUIV: 2>, 'cp1256')


This tells us that the detected encoding is cp1256, and that it was retrieved from a <meta> HTML tag with http-equiv='Content-Type'.

Detail on the signature of infer_encoding():

def infer_encoding(
content: Optional[bytes] = None,
) -> Pair:
...


The content represents the page HTML, such as response.content.

The headers represents the HTTP response headers, such as response.headers. If provided, this should be a data structure supporting a case-insensitive lookup, such as requests.structures.CaseInsensitiveDict or multidict.CIMultiDict.

Both parameters are optional.

The return type is a tuple.

The first element of the tuple is a member of the Source enum (see Search Process below). The source indicates where the detected encoding comes from.

The second element of the tuple is either a str, which is the canonical name of the detected encoding, or None if no encoding is found.

## Where Do Other Libraries Fall Short?

The requests library "[follows] RFC 2616 to the letter" in using the HTTP headers to determine the encoding of the response content. This means, among other things, using ISO-8859-1 as a fallback if no charset is given, despite the fact that UTF-8 has absolutely dwarfed all other encodings in usage on web pages.

# requests/adapters.py


If requests does not find an HTTP Content-Type header at all, it will fall back to detection via chardet rather than looking in the HTML tags for meaningful information. There's nothing at all wrong with this; it just means that the requests maintainers have chosen to focus on the power of requests as an HTTP library, not an HTML library. If you want more fine-grained control over encoding detection, try infer_encoding().

This is not to single out requests either; there are other libraries that do the same dance with encoding detection; aiohttp checks the Content-Type header, or otherwise defaults to UTF-8 without looking anywhere else.

## Search Process

The function guessenc.infer_encoding() looks in a handful of places to extract an encoding, in this order, and stops when it finds one:

1. In the charset value from the Content-Type HTTP entity header.
2. In the charset value from a <meta charset="xxxx"> HTML tag.
3. In the charset value from a <meta> tag with http-equiv="Content-Type".
4. Using the chardet library.

Each of the above "sources" is signified by a corresponding member of the Source enum:

class Source(enum.Enum):
"""Indicates where our detected encoding came from."""

META_CHARSET = 1
META_HTTP_EQUIV = 2
CHARDET = 3
COULD_NOT_DETECT = 4


If none of the 4 sources from the list above return a viable encoding, this is indicated by Source.COULD_NOT_DETECT.

## Project details

Uploaded source
Uploaded py3