Skip to main content

URL normalization for Python

Project description

url-normalize

tests Coveralls PyPI Python Versions License Ruff

A Python library for standardizing and normalizing URLs. Ideal for database deduplication, caching, web crawling, and anywhere you need to ensure that equivalent URLs resolve to the exact same string.

from url_normalize import url_normalize

# Fixes IDN, lowercases host/scheme, removes default ports, resolves path segments
url_normalize("HTTP://User:Pass@www.FOO.com:80///foo/../bar/./baz?q=1#frag")
# -> 'http://User:Pass@www.foo.com/bar/baz?q=1#frag'

Features

url-normalize provides a robust URI normalization function that handles IDN domains, scheme/host lowercasing, and RFC-compliant path normalization.

  • IDN Support: Full internationalized domain name handling (using IDNA2008 with UTS46).
  • Humanization: Convert normalized URLs to a readable display format while preserving round-trip normalization.
  • RFC Compliance:
    • Proper percent-encoding (minimal, uppercase hex).
    • Dot-segment removal in paths.
    • Default port and authority handling.
    • UTF-8 NFC normalization.
  • Configurable Defaults:
    • Customizable default scheme (https by default).
    • Configurable default domain for absolute paths.
  • Query Parameter Control:
    • Parameter filtering with allowlists.
    • Support for domain-specific parameter rules.
  • Versatile URL Handling: Handles empty strings, double-slash URLs (//domain.tld), and shebang (#!) URLs.
  • Developer Friendly:
    • Python 3.10+ compatibility.
    • 100% test coverage.
    • Modern type hints and string handling.

Inspired by Sam Ruby's urlnorm.py.

Installation

Install as a library:

pip install url-normalize

Or install as a standalone CLI tool using uv:

uv tool install url-normalize

Usage

Python API

Basic Normalization

from url_normalize import url_normalize

# Basic normalization (uses https by default)
print(url_normalize("www.foo.com:80/foo"))
# Output: https://www.foo.com/foo

# With custom default scheme
print(url_normalize("www.foo.com/foo", default_scheme="http"))
# Output: http://www.foo.com/foo

Query Parameter Filtering

You can strip out tracking parameters and only keep the ones you care about using allowlists.

# With query parameter filtering enabled (strips all params by default)
print(url_normalize("www.google.com/search?q=test&utm_source=test", filter_params=True))
# Output: https://www.google.com/search?q=test

# With custom parameter allowlist as a list
print(url_normalize(
    "example.com?page=1&id=123&ref=test",
    filter_params=True,
    param_allowlist=["page", "id"]
))
# Output: https://example.com?page=1&id=123

# With domain-specific parameter allowlists
print(url_normalize(
    "example.com?page=1&id=123&ref=test",
    filter_params=True,
    param_allowlist={"example.com": ["page", "id"]}
))
# Output: https://example.com?page=1&id=123

Default Domain & Scheme

Useful for resolving relative URLs found on a specific page.

# With default domain for absolute paths
print(url_normalize("/images/logo.png", default_domain="example.com"))
# Output: https://example.com/images/logo.png

# With default domain and custom scheme
print(url_normalize("/images/logo.png", default_scheme="http", default_domain="example.com"))
# Output: http://example.com/images/logo.png

Humanizing URLs

Convert normalized URLs back into a user-friendly format for display, particularly useful for IDN domains and percent-encoded paths.

from url_normalize import url_humanize

# Human-readable display form that still normalizes back to the same URL
print(url_humanize("https://xn--e1afmkfd.xn--80akhbyknj4f/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F"))
# Output: https://пример.испытание/Служебная

# Humanization accepts the same normalization options
print(url_humanize("/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F", default_domain="xn--e1afmkfd.xn--80akhbyknj4f"))
# Output: https://пример.испытание/Служебная

Command-line Usage

You can also use url-normalize directly from the terminal to process URLs.

$ url-normalize "www.foo.com:80/foo"
# Output: https://www.foo.com/foo

# With custom default scheme
$ url-normalize -s http "www.foo.com/foo"
# Output: http://www.foo.com/foo

# With query parameter filtering
$ url-normalize -f "www.google.com/search?q=test&utm_source=test"
# Output: https://www.google.com/search?q=test

# With custom allowlist
$ url-normalize -f -p page,id "example.com?page=1&id=123&ref=test"
# Output: https://example.com/?page=1&id=123

# With default domain for absolute paths
$ url-normalize -d example.com "/images/logo.png"
# Output: https://example.com/images/logo.png

# With default domain and custom scheme
$ url-normalize -d example.com -s http "/images/logo.png"
# Output: http://example.com/images/logo.png

# Human-readable display form
$ url-normalize -H "https://xn--e1afmkfd.xn--80akhbyknj4f/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F"
# Output: https://пример.испытание/Служебная

# Via uv tool/uvx
$ uvx url-normalize www.foo.com:80/foo
# Output: https://www.foo.com:80/foo

Documentation

For a complete history of changes, see CHANGELOG.md.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

url_normalize-3.0.0.tar.gz (21.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

url_normalize-3.0.0-py3-none-any.whl (16.9 kB view details)

Uploaded Python 3

File details

Details for the file url_normalize-3.0.0.tar.gz.

File metadata

  • Download URL: url_normalize-3.0.0.tar.gz
  • Upload date:
  • Size: 21.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for url_normalize-3.0.0.tar.gz
Algorithm Hash digest
SHA256 0552cbf2831a32a28994a13d29bca58a60e10ff6c0380e343ec6d1c2a0d232d8
MD5 761659341e1ba88b738feb1c1e8a19a3
BLAKE2b-256 8bcd846d87d6d49d963b04ef4429b73d71d3c17468059956bab360866a9b0aec

See more details on using hashes here.

File details

Details for the file url_normalize-3.0.0-py3-none-any.whl.

File metadata

  • Download URL: url_normalize-3.0.0-py3-none-any.whl
  • Upload date:
  • Size: 16.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for url_normalize-3.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 95234bd359f86831c1fd87c248877f2a6887db2f3b5087120083f2fffcba4889
MD5 e9c5f443121d353bf369d93a6041b9a8
BLAKE2b-256 138af72344eab18674fd7b174f35abbce41ed88fea72927f111726732d0ca779

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page