Skip to main content

Infer HTML encoding from response headers & content

Project description

guessenc

Build License PyPI Status Python

Infer HTML encoding from response headers & content. Goes above and beyond the encoding detection done by most HTTP client libraries.

Basic Usage

The main function exported by guessenc is infer_encoding().

>>> import requests
>>> from guessenc import infer_encoding

>>> resp = requests.get("http://www.fatehwatan.ps/page-183525.html")
>>> resp.raise_for_status()
>>> infer_encoding(resp.content, resp.headers)
(<Source.META_HTTP_EQUIV: 2>, 'cp1256')

This tells us that the detected encoding is cp1256, and that it was retrieved from a HTML tag with http-equiv='Content-Type'.

Detail on the signature of infer_encoding():

def infer_encoding(
    content: Optional[bytes] = None,
    headers: Optional[Mapping[str, str]] = None
) -> Pair:
    ...

The content represents the page HTML, such as response.content.

The headers represents the HTTP response headers, such as response.headers. If provided, this should be a data structure supporting a case-insensitive lookup, such as requests.structures.CaseInsensitiveDict or multidict.CIMultiDict.

Both parameters are optional.

The return type is a tuple.

The first element of the tuple is a member of the Source enum (see Search Process below). The source indicates where the detected encoding comes from.

The second element of the tuple is either a str, which is the canonical name of the detected encoding, or None if no encoding is found.

Where Do Other Libraries Fall Short?

The requests library "[follows] RFC 2616 to the letter" in using the HTTP headers to determine the encoding of the response content. This means, among other things, using ISO-8859-1 as a fallback if no charset is given, despite the fact that UTF-8 has absolutely dwarfed all other encodings in usage on web pages.

# requests/adapters.py
response.encoding = get_encoding_from_headers(response.headers)

If requests does not find an HTTP Content-Type header at all, it will fall back to detection via chardet rather than looking in the HTML tags for meaningful information. There's nothing at all wrong with this; it just means that the requests maintainers have chosen to focus on the power of requests as an HTTP library, not an HTML library. If you want more fine-grained control over encoding detection, try infer_encoding().

This is not to single out requests either; there are other libraries that do the same dance with encoding detection; aiohttp checks the Content-Type header, or otherwise defaults to UTF-8 without looking anywhere else.

Search Process

The function guessenc.infer_encoding() looks in a handful of places to extract an encoding, in this order, and stops when it finds one:

  1. In the charset value from the Content-Type HTTP entity header.
  2. In the charset value from a <meta charset="xxxx"> HTML tag.
  3. In the charset value from a <meta> tag with http-equiv="Content-Type".
  4. Using the chardet library.

Each of the above "sources" is signified by a corresponding member of the Source enum:

class Source(enum.Enum):
    """Indicates where our detected encoding came from."""

    CHARSET_HEADER = 0
    META_CHARSET = 1
    META_HTTP_EQUIV = 2
    CHARDET = 3
    COULD_NOT_DETECT = 4

If none of the 4 sources from the list above return a viable encoding, this is indicated by Source.COULD_NOT_DETECT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

guessenc-0.3.tar.gz (330.3 kB view details)

Uploaded Source

Built Distribution

guessenc-0.3-py3-none-any.whl (5.6 kB view details)

Uploaded Python 3

File details

Details for the file guessenc-0.3.tar.gz.

File metadata

  • Download URL: guessenc-0.3.tar.gz
  • Upload date:
  • Size: 330.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for guessenc-0.3.tar.gz
Algorithm Hash digest
SHA256 e081977d3ae2ed55835ca0d92f86d8987fd5b0e4372241cbbde1e5abf0315190
MD5 1460a5fa4fecd47992c81028b2ad77d3
BLAKE2b-256 1b5170a2c4fd02c29dc330a7cfb793fd81a3f8ad238a6e0d6ddb13218c97dc16

See more details on using hashes here.

File details

Details for the file guessenc-0.3-py3-none-any.whl.

File metadata

  • Download URL: guessenc-0.3-py3-none-any.whl
  • Upload date:
  • Size: 5.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for guessenc-0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 010e2a3e39a6b90a40dd704c6e62b6988e5841992f5947d2637c9f2ffb5719ac
MD5 f96cf457be86908169781dd4dd95e82e
BLAKE2b-256 0c19df737e83c1d7f47438d69bbba36120824be606246e974738d5fa35640da2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page