Accurately separates a URL's subdomain, domain, and public suffix, using the Public Suffix List (PSL). By default, this includes the public ICANN TLDs and their exceptions. You can optionally support the Public Suffix List's private domains as well.
Project description
tldextract

tldextract accurately separates a URL's subdomain, domain, and public suffix,
using the Public Suffix List (PSL).
Why? Naive URL parsing like splitting on dots fails for domains like
forums.bbc.co.uk (gives "co" instead of "bbc"). tldextract handles the edge
cases, so you don't have to.
Quick Start
>>> import tldextract
>>> tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com', is_private=False)
>>> tldextract.extract('http://forums.bbc.co.uk/')
ExtractResult(subdomain='forums', domain='bbc', suffix='co.uk', is_private=False)
>>> # Access the parts you need
>>> ext = tldextract.extract('http://forums.bbc.co.uk')
>>> ext.domain
'bbc'
>>> ext.top_domain_under_public_suffix
'bbc.co.uk'
>>> ext.fqdn
'forums.bbc.co.uk'
Install
pip install tldextract
How-to Guides
How to disable HTTP suffix list fetching for production
no_fetch_extract = tldextract.TLDExtract(suffix_list_urls=())
no_fetch_extract('http://www.google.com')
How to set a custom cache location
Via environment variable:
export TLDEXTRACT_CACHE="/path/to/cache"
Or in code:
custom_cache_extract = tldextract.TLDExtract(cache_dir='/path/to/cache/')
How to update TLD definitions
Command line:
tldextract --update
Or delete the cache folder:
rm -rf $HOME/.cache/python-tldextract
How to treat private domains as suffixes
extract = tldextract.TLDExtract(include_psl_private_domains=True)
extract('waiterrant.blogspot.com')
# ExtractResult(subdomain='', domain='waiterrant', suffix='blogspot.com', is_private=True)
How to use a local suffix list
extract = tldextract.TLDExtract(
suffix_list_urls=["file:///path/to/your/list.dat"],
cache_dir='/path/to/cache/',
fallback_to_snapshot=False)
How to use a remote suffix list
extract = tldextract.TLDExtract(
suffix_list_urls=["https://myserver.com/suffix-list.dat"])
How to add extra suffixes
extract = tldextract.TLDExtract(
extra_suffixes=["foo", "bar.baz"])
How to validate URLs before extraction
from urllib.parse import urlsplit
split_url = urlsplit("https://example.com:8080/path")
result = tldextract.extract_urllib(split_url)
Command Line
$ tldextract http://forums.bbc.co.uk
forums bbc co.uk
$ tldextract --update # Update cached suffix list
$ tldextract --help # See all options
Understanding Domain Parsing
Public Suffix List
tldextract uses the Public Suffix List, a
community-maintained list of domain suffixes. The PSL contains both:
- Public suffixes: Where anyone can register a domain (
.com,.co.uk,.org.kg) - Private suffixes: Operated by companies for customer subdomains
(
blogspot.com,github.io)
Web browsers use this same list for security decisions like cookie scoping.
Suffix vs. TLD
While .com is a top-level domain (TLD), many suffixes like .co.uk are
technically second-level. The PSL uses "public suffix" to cover both.
Default behavior with private domains
By default, tldextract treats private suffixes as regular domains:
>>> tldextract.extract('waiterrant.blogspot.com')
ExtractResult(subdomain='waiterrant', domain='blogspot', suffix='com', is_private=False)
To treat them as suffixes instead, see How to treat private domains as suffixes.
Caching behavior
By default, tldextract fetches the latest Public Suffix List on first use and
caches it indefinitely in $HOME/.cache/python-tldextract.
URL validation
tldextract accepts any string and is very lenient. It prioritizes ease of use
over strict validation, extracting domains from any string, even partial URLs or
non-URLs.
FAQ
Can you add/remove suffix ____?
tldextract doesn't maintain the suffix list. Submit changes to
the Public Suffix List.
Meanwhile, use the extra_suffixes parameter, or fork the PSL and pass it to
this library with the suffix_list_urls parameter.
My suffix is in the PSL but not extracted correctly
Check if it's in the "PRIVATE" section. See How to treat private domains as suffixes.
Why does it parse invalid URLs?
See URL validation and How to validate URLs before extraction.
Contribute
Setting up
git clonethis repository.- Change into the new directory.
pip install --upgrade --editable '.[testing]'
Running tests
tox --parallel # Test all Python versions
tox -e py311 # Test specific Python version
ruff format . # Format code
History
This package started from a StackOverflow answer about regex-based domain extraction. The regex approach fails for many domains, so this library switched to the Public Suffix List for accuracy.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tldextract-5.3.1.tar.gz.
File metadata
- Download URL: tldextract-5.3.1.tar.gz
- Upload date:
- Size: 126.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a72756ca170b2510315076383ea2993478f7da6f897eef1f4a5400735d5057fb
|
|
| MD5 |
e4e429649a5567af70c86669d5b7b9d4
|
|
| BLAKE2b-256 |
657b644fbbb49564a6cb124a8582013315a41148dba2f72209bba14a84242bf0
|
File details
Details for the file tldextract-5.3.1-py3-none-any.whl.
File metadata
- Download URL: tldextract-5.3.1-py3-none-any.whl
- Upload date:
- Size: 105.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6bfe36d518de569c572062b788e16a659ccaceffc486d243af0484e8ecf432d9
|
|
| MD5 |
327cc796139cc5910a6342ff555c2f8b
|
|
| BLAKE2b-256 |
6d420e49d6d0aac449ca71952ec5bae764af009754fcb2e76a5cc097543747b3
|