Skip to main content

Accurately separates a URL's subdomain, domain, and public suffix, using the Public Suffix List (PSL). By default, this includes the public ICANN TLDs and their exceptions. You can optionally support the Public Suffix List's private domains as well.

Project description

tldextract PyPI version Build Status

tldextract accurately separates a URL's subdomain, domain, and public suffix, using the Public Suffix List (PSL).

Why? Naive URL parsing like splitting on dots fails for domains like forums.bbc.co.uk (gives "co" instead of "bbc"). tldextract handles the edge cases, so you don't have to.

Quick Start

>>> import tldextract

>>> tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com', is_private=False)

>>> tldextract.extract('http://forums.bbc.co.uk/')
ExtractResult(subdomain='forums', domain='bbc', suffix='co.uk', is_private=False)

>>> # Access the parts you need
>>> ext = tldextract.extract('http://forums.bbc.co.uk')
>>> ext.domain
'bbc'
>>> ext.top_domain_under_public_suffix
'bbc.co.uk'
>>> ext.fqdn
'forums.bbc.co.uk'

Install

pip install tldextract

How-to Guides

How to disable HTTP suffix list fetching for production

no_fetch_extract = tldextract.TLDExtract(suffix_list_urls=())
no_fetch_extract('http://www.google.com')

How to set a custom cache location

Via environment variable:

export TLDEXTRACT_CACHE="/path/to/cache"

Or in code:

custom_cache_extract = tldextract.TLDExtract(cache_dir='/path/to/cache/')

How to update TLD definitions

Command line:

tldextract --update

Or delete the cache folder:

rm -rf $HOME/.cache/python-tldextract

How to treat private domains as suffixes

extract = tldextract.TLDExtract(include_psl_private_domains=True)
extract('waiterrant.blogspot.com')
# ExtractResult(subdomain='', domain='waiterrant', suffix='blogspot.com', is_private=True)

How to use a local suffix list

extract = tldextract.TLDExtract(
    suffix_list_urls=["file:///path/to/your/list.dat"],
    cache_dir='/path/to/cache/',
    fallback_to_snapshot=False)

How to use a remote suffix list

extract = tldextract.TLDExtract(
    suffix_list_urls=["https://myserver.com/suffix-list.dat"])

How to add extra suffixes

extract = tldextract.TLDExtract(
    extra_suffixes=["foo", "bar.baz"])

How to validate URLs before extraction

from urllib.parse import urlsplit

split_url = urlsplit("https://example.com:8080/path")
result = tldextract.extract_urllib(split_url)

Command Line

$ tldextract http://forums.bbc.co.uk
forums bbc co.uk

$ tldextract --update  # Update cached suffix list
$ tldextract --help    # See all options

Understanding Domain Parsing

Public Suffix List

tldextract uses the Public Suffix List, a community-maintained list of domain suffixes. The PSL contains both:

  • Public suffixes: Where anyone can register a domain (.com, .co.uk, .org.kg)
  • Private suffixes: Operated by companies for customer subdomains (blogspot.com, github.io)

Web browsers use this same list for security decisions like cookie scoping.

Suffix vs. TLD

While .com is a top-level domain (TLD), many suffixes like .co.uk are technically second-level. The PSL uses "public suffix" to cover both.

Default behavior with private domains

By default, tldextract treats private suffixes as regular domains:

>>> tldextract.extract('waiterrant.blogspot.com')
ExtractResult(subdomain='waiterrant', domain='blogspot', suffix='com', is_private=False)

To treat them as suffixes instead, see How to treat private domains as suffixes.

Caching behavior

By default, tldextract fetches the latest Public Suffix List on first use and caches it indefinitely in $HOME/.cache/python-tldextract.

URL validation

tldextract accepts any string and is very lenient. It prioritizes ease of use over strict validation, extracting domains from any string, even partial URLs or non-URLs.

FAQ

Can you add/remove suffix ____?

tldextract doesn't maintain the suffix list. Submit changes to the Public Suffix List.

Meanwhile, use the extra_suffixes parameter, or fork the PSL and pass it to this library with the suffix_list_urls parameter.

My suffix is in the PSL but not extracted correctly

Check if it's in the "PRIVATE" section. See How to treat private domains as suffixes.

Why does it parse invalid URLs?

See URL validation and How to validate URLs before extraction.

Contribute

Setting up

  1. git clone this repository.
  2. Change into the new directory.
  3. pip install --upgrade --editable '.[testing]'

Running tests

tox --parallel       # Test all Python versions
tox -e py311         # Test specific Python version
ruff format .        # Format code

History

This package started from a StackOverflow answer about regex-based domain extraction. The regex approach fails for many domains, so this library switched to the Public Suffix List for accuracy.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tldextract-5.3.1.tar.gz (126.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tldextract-5.3.1-py3-none-any.whl (105.9 kB view details)

Uploaded Python 3

File details

Details for the file tldextract-5.3.1.tar.gz.

File metadata

  • Download URL: tldextract-5.3.1.tar.gz
  • Upload date:
  • Size: 126.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for tldextract-5.3.1.tar.gz
Algorithm Hash digest
SHA256 a72756ca170b2510315076383ea2993478f7da6f897eef1f4a5400735d5057fb
MD5 e4e429649a5567af70c86669d5b7b9d4
BLAKE2b-256 657b644fbbb49564a6cb124a8582013315a41148dba2f72209bba14a84242bf0

See more details on using hashes here.

File details

Details for the file tldextract-5.3.1-py3-none-any.whl.

File metadata

  • Download URL: tldextract-5.3.1-py3-none-any.whl
  • Upload date:
  • Size: 105.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for tldextract-5.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6bfe36d518de569c572062b788e16a659ccaceffc486d243af0484e8ecf432d9
MD5 327cc796139cc5910a6342ff555c2f8b
BLAKE2b-256 6d420e49d6d0aac449ca71952ec5bae764af009754fcb2e76a5cc097543747b3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page