Skip to main content

URL normalization library for consistent URL representation

Project description

tk-normalizer

Python PyPI License: MIT

URL normalization library for creating consistent URL representations.

Purpose

The URL normalization process creates a mechanism to provide equivalence between URLs with varying string, protocol, scheme, and query parameter ordering. This library helps create normalized representations of URLs for consistent storage, comparison, and analysis.

Installation

pip install tk-normalizer

Quick Start

from tk_normalizer import TkNormalizer

# Simple usage - str() returns just the normalized URL
normalized = TkNormalizer("http://www.Example.com/path?b=2&a=1&utm_source=test")
print(str(normalized))  # Output: example.com/path?a=1&b=2

# Get full details with dict()
print(dict(normalized))  # Returns all fields including query_string, path, and hashes

Features

URL Normalization

The following URLs all normalize to the same normalized form:

https://example.com/
http://www.example.com/
http://www.example.com
http://www.example.com/#my_search_engine_is_great
https://www.example.com/?utm_campaign=SomeGoogleCampaign
https://www.example.com/?utm_source=because&utm_campaign=SomeGoogleCampaign

All normalize to: example.com

Normalization Process

URLs are normalized through the following steps:

  • ✅ Protocol and www subdomains removed
  • ✅ Lowercased
  • ✅ Trailing slashes removed
  • ✅ Query parameters reordered alphabetically by key
  • ✅ Duplicate query parameter key/value pairs removed
  • ✅ Common tracking parameters removed (utm_*, gclid, fbclid, etc.)
  • ✅ Non-HTTP(S) protocols rejected
  • ✅ Localhost URLs rejected

Tracking Parameters Removed

The following tracking parameters are automatically removed during normalization:

  • utm_* (all utm parameters)
  • gclid, fbclid, dclid (click identifiers)
  • _ga, _gid, _fbp, _hjid (analytics cookies)
  • msclkid (Microsoft Ads)
  • aff_id, affid (affiliate tracking)
  • referrer, adgroupid, srsltid

Advanced Usage

Getting Full Normalization Details

from tk_normalizer import TkNormalizer

normalizer = TkNormalizer("http://blog.example.com/page?b=2&a=1")

# Use str() for just the normalized URL
print(str(normalizer))  # blog.example.com/page?a=1&b=2

# Use dict() for complete normalization data
result = dict(normalizer)
print(result)
# {
#   'normalized_url': 'blog.example.com/page?a=1&b=2',
#   'parent_normalized_url': 'blog.example.com',
#   'root_normalized_url': 'example.com',
#   'query_string': 'a=1&b=2',
#   'path': '/page',
#   'normalized_url_hash': '...',
#   'parent_normalized_url_hash': '...',
#   'root_normalized_url_hash': '...'
# }

Error Handling

from tk_normalizer import TkNormalizer, InvalidUrlException

try:
    normalizer = TkNormalizer("not a valid url")
except InvalidUrlException as e:
    print(f"Invalid URL: {e}")

Accessing Individual Components

from tk_normalizer import TkNormalizer

normalizer = TkNormalizer("https://blog.example.com/path?a=1")

# Dict-like access to individual fields
print(normalizer["normalized_url"])       # blog.example.com/path?a=1
print(normalizer["parent_normalized_url"]) # blog.example.com
print(normalizer["root_normalized_url"])   # example.com
print(normalizer["query_string"])          # a=1
print(normalizer["path"])                  # /path

# Iterate over available fields
for key in normalizer:
    print(f"{key}: {normalizer[key]}")

# Get all field names
print(normalizer.keys())

Hashing

For efficient storage and comparison, SHA-256 hashes are computed for:

  • The normalized URL
  • The parent normal URL (domain without path)
  • The root normal URL (root domain without subdomains)

This provides fixed-length representations suitable for database indexing.

Important Caveats

While this normalization process works well for most use cases, there are some limitations:

  1. www subdomain removal: Technically, www.example.com and example.com could serve different content, though this is rare in practice.

  2. Case sensitivity: URLs are lowercased, but some servers are case-sensitive for paths.

  3. Tracking parameters: New tracking parameters emerge over time and may not be in the removal list.

  4. Fragment removal: URL fragments (#anchors) are removed, which may affect single-page applications.

Development

Setting Up Development Environment

# Clone the repository
git clone https://github.com/terakeet/tk-normalizer.git
cd tk-normalizer

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=tk_normalizer

# Run linting
ruff check src tests

Running Tests

# Run all tests
pytest

# Run with verbose output
pytest -v

# Run specific test file
pytest tests/test_normalizer.py

# Run with coverage report
pytest --cov=tk_normalizer --cov-report=html

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

For issues and questions, please use the GitHub issue tracker.

Credits

Based on the URL normalization functionality from tk-core, extracted and packaged for standalone use.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tk_normalizer-1.1.0.tar.gz (15.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tk_normalizer-1.1.0-py3-none-any.whl (8.0 kB view details)

Uploaded Python 3

File details

Details for the file tk_normalizer-1.1.0.tar.gz.

File metadata

  • Download URL: tk_normalizer-1.1.0.tar.gz
  • Upload date:
  • Size: 15.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tk_normalizer-1.1.0.tar.gz
Algorithm Hash digest
SHA256 61188fe8b07343bfbc7d3c92846ca32e5a2450d4aeea9cf88587a0afb613c18c
MD5 644b89d2563a1a5964b252bb8e990fbd
BLAKE2b-256 c411911039305c2a8bfd7dc2cec3e2ce6be4fe3bcdecfd4652f3b998f0ba4c27

See more details on using hashes here.

Provenance

The following attestation bundles were made for tk_normalizer-1.1.0.tar.gz:

Publisher: deploy_to_pypi.yml on terakeet/tk-normalizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tk_normalizer-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: tk_normalizer-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tk_normalizer-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e494ac5e3ce9dfa0f0331b936fdaed1cad883ce7f0a315b7387e4305246a0107
MD5 f97d61aea8c1015875137ddf51224718
BLAKE2b-256 8af24114a35336992e8d39c52aa114a570282916d0c5739b30ab83d2fac6273c

See more details on using hashes here.

Provenance

The following attestation bundles were made for tk_normalizer-1.1.0-py3-none-any.whl:

Publisher: deploy_to_pypi.yml on terakeet/tk-normalizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page