Skip to main content

[DEPRECATED] Infer HTML encoding from response headers & content. Use charset-normalizer or bs4.UnicodeDammit instead.

This project has been archived.

The maintainers of this project have marked this project as archived. No new releases are expected.

Project description

guessenc

Archived / deprecated. This project is no longer maintained. Most of what it does is now well covered by charset-normalizer (the default encoding detector in modern requests and httpx) and by bs4.UnicodeDammit, which also handles HTML <meta> tag sniffing. Please use those instead.

License PyPI Status Python

Infer HTML encoding from response headers & content. Goes above and beyond the encoding detection done by most HTTP client libraries.

Basic Usage

The main function exported by guessenc is infer_encoding().

>>> import requests
>>> from guessenc import infer_encoding

>>> resp = requests.get("http://www.fatehwatan.ps/page-183525.html")
>>> resp.raise_for_status()
>>> infer_encoding(resp.content, resp.headers)
(<Source.META_HTTP_EQUIV: 2>, 'cp1256')

This tells us that the detected encoding is cp1256, and that it was retrieved from a HTML tag with http-equiv='Content-Type'.

Detail on the signature of infer_encoding():

def infer_encoding(
    content: Optional[bytes] = None,
    headers: Optional[Mapping[str, str]] = None
) -> Pair:
    ...

The content represents the page HTML, such as response.content.

The headers represents the HTTP response headers, such as response.headers. If provided, this should be a data structure supporting a case-insensitive lookup, such as requests.structures.CaseInsensitiveDict or multidict.CIMultiDict.

Both parameters are optional.

The return type is a tuple.

The first element of the tuple is a member of the Source enum (see Search Process below). The source indicates where the detected encoding comes from.

The second element of the tuple is either a str, which is the canonical name of the detected encoding, or None if no encoding is found.

Where Do Other Libraries Fall Short?

The requests library "[follows] RFC 2616 to the letter" in using the HTTP headers to determine the encoding of the response content. This means, among other things, using ISO-8859-1 as a fallback if no charset is given, despite the fact that UTF-8 has absolutely dwarfed all other encodings in usage on web pages.

# requests/adapters.py
response.encoding = get_encoding_from_headers(response.headers)

If requests does not find an HTTP Content-Type header at all, it will fall back to detection via chardet rather than looking in the HTML tags for meaningful information. There's nothing at all wrong with this; it just means that the requests maintainers have chosen to focus on the power of requests as an HTTP library, not an HTML library. If you want more fine-grained control over encoding detection, try infer_encoding().

This is not to single out requests either; there are other libraries that do the same dance with encoding detection; aiohttp checks the Content-Type header, or otherwise defaults to UTF-8 without looking anywhere else.

Search Process

The function guessenc.infer_encoding() looks in a handful of places to extract an encoding, in this order, and stops when it finds one:

  1. In the charset value from the Content-Type HTTP entity header.
  2. In the charset value from a <meta charset="xxxx"> HTML tag.
  3. In the charset value from a <meta> tag with http-equiv="Content-Type".
  4. Using the chardet library.

Each of the above "sources" is signified by a corresponding member of the Source enum:

class Source(enum.Enum):
    """Indicates where our detected encoding came from."""

    CHARSET_HEADER = 0
    META_CHARSET = 1
    META_HTTP_EQUIV = 2
    CHARDET = 3
    COULD_NOT_DETECT = 4

If none of the 4 sources from the list above return a viable encoding, this is indicated by Source.COULD_NOT_DETECT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

guessenc-0.3.1.tar.gz (332.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

guessenc-0.3.1-py3-none-any.whl (6.1 kB view details)

Uploaded Python 3

File details

Details for the file guessenc-0.3.1.tar.gz.

File metadata

  • Download URL: guessenc-0.3.1.tar.gz
  • Upload date:
  • Size: 332.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for guessenc-0.3.1.tar.gz
Algorithm Hash digest
SHA256 0e0e7f10d8e021cc6cfb8b354bb16e4cc66c1406f4a786836f7438ce8e0697bf
MD5 9c69e2ca5b4f79b8f08f810edb11a251
BLAKE2b-256 58917ee40e9ffaeec40c63842210ac57d8af7421775f1f533df2729698a3d9fd

See more details on using hashes here.

File details

Details for the file guessenc-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: guessenc-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 6.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for guessenc-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 be25ebf2cda9f877371e66b34dda6813bff274ccfa4f32c123bd14c5dcfa4825
MD5 832dd4e3d042971d6f6907db603544a9
BLAKE2b-256 4524e92ac2a858a4fdf9a6227d52875decdb7288acd0d2b3dac2e41a11ba4d2d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page