Skip to main content

URL normalization library for consistent URL representation

Reason this release was yanked:

Was not built correctly.

Project description

tk-normalizer

Python PyPI License: MIT

URL normalization library for creating consistent URL representations.

Purpose

The URL normalization process creates a mechanism to provide equivalence between URLs with varying string, protocol, scheme, and query parameter ordering. This library helps create normalized representations of URLs for consistent storage, comparison, and analysis.

Installation

pip install tk-normalizer

Quick Start

from tk_normalizer import normalize_url

# Simple usage with the convenience function
normalized = normalize_url("http://www.Example.com/path?b=2&a=1&utm_source=test")
print(normalized)  # Output: example.com/path?a=1&b=2

# Using the class directly for more control
from tk_normalizer import TkNormalizer

normalizer = TkNormalizer("http://www.Example.com/path?b=2&a=1&utm_source=test")
print(normalizer.normalized_url)  # example.com/path?a=1&b=2
print(normalizer.get_normalized_url())  # Full details including hashes

Features

URL Normalization

The following URLs all normalize to the same normalized form:

https://example.com/
http://www.example.com/
http://www.example.com
http://www.example.com/#my_search_engine_is_great
https://www.example.com/?utm_campaign=SomeGoogleCampaign
https://www.example.com/?utm_source=because&utm_campaign=SomeGoogleCampaign

All normalize to: example.com

Normalization Process

URLs are normalized through the following steps:

  • ✅ Protocol and www subdomains removed
  • ✅ Lowercased
  • ✅ Trailing slashes removed
  • ✅ Query parameters reordered alphabetically by key
  • ✅ Duplicate query parameter key/value pairs removed
  • ✅ Common tracking parameters removed (utm_*, gclid, fbclid, etc.)
  • ✅ Non-HTTP(S) protocols rejected
  • ✅ Localhost URLs rejected

Tracking Parameters Removed

The following tracking parameters are automatically removed during normalization:

  • utm_* (all utm parameters)
  • gclid, fbclid, dclid (click identifiers)
  • _ga, _gid, _fbp, _hjid (analytics cookies)
  • msclkid (Microsoft Ads)
  • aff_id, affid (affiliate tracking)
  • referrer, adgroupid, srsltid

Advanced Usage

Getting Full Normalization Details

from tk_normalizer import TkNormalizer

normalizer = TkNormalizer("http://blog.example.com/page?b=2&a=1")
result = normalizer.get_normalized_url()

print(result)
# {
#   'normalized_url': 'blog.example.com/page?a=1&b=2',
#   'parent_normal_url': 'blog.example.com',
#   'root_normal_url': 'example.com',
#   'normalized_url_hash': '...',
#   'parent_normal_url_hash': '...',
#   'root_normal_url_hash': '...'
# }

Error Handling

from tk_normalizer import normalize_url, InvalidUrlException

try:
    normalized = normalize_url("not a valid url")
except InvalidUrlException as e:
    print(f"Invalid URL: {e}")

Accessing Individual Components

from tk_normalizer import TkNormalizer

normalizer = TkNormalizer("https://blog.example.com/path?a=1")

# Access individual normalized components
print(normalizer.normalized_url)         # blog.example.com/path?a=1
print(normalizer.parent_normal_url)   # blog.example.com
print(normalizer.root_normal_url)     # example.com

Hashing

For efficient storage and comparison, SHA-256 hashes are computed for:

  • The normalized URL
  • The parent normal URL (domain without path)
  • The root normal URL (root domain without subdomains)

This provides fixed-length representations suitable for database indexing.

Important Caveats

While this normalization process works well for most use cases, there are some limitations:

  1. www subdomain removal: Technically, www.example.com and example.com could serve different content, though this is rare in practice.

  2. Case sensitivity: URLs are lowercased, but some servers are case-sensitive for paths.

  3. Tracking parameters: New tracking parameters emerge over time and may not be in the removal list.

  4. Fragment removal: URL fragments (#anchors) are removed, which may affect single-page applications.

Development

Setting Up Development Environment

# Clone the repository
git clone https://github.com/terakeet/tk-normalizer.git
cd tk-normalizer

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=tk_normalizer

# Run linting
ruff check src tests

Running Tests

# Run all tests
pytest

# Run with verbose output
pytest -v

# Run specific test file
pytest tests/test_normalizer.py

# Run with coverage report
pytest --cov=tk_normalizer --cov-report=html

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

For issues and questions, please use the GitHub issue tracker.

Credits

Based on the URL normalization functionality from tk-core, extracted and packaged for standalone use.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tk_normalizer-1.0.0.tar.gz (14.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tk_normalizer-1.0.0-py3-none-any.whl (7.9 kB view details)

Uploaded Python 3

File details

Details for the file tk_normalizer-1.0.0.tar.gz.

File metadata

  • Download URL: tk_normalizer-1.0.0.tar.gz
  • Upload date:
  • Size: 14.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.6

File hashes

Hashes for tk_normalizer-1.0.0.tar.gz
Algorithm Hash digest
SHA256 b9b019f744cbf10d61a28bb13550257facbb719e269762b3bb845d69ec6fc3ca
MD5 460d1405dffd47159b59c21bf565ca0c
BLAKE2b-256 1d8a74442ec60916c198025647ae7a9c77af9f1f85b9ff6148f91c5aa6f1b89c

See more details on using hashes here.

File details

Details for the file tk_normalizer-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: tk_normalizer-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 7.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.6

File hashes

Hashes for tk_normalizer-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 55a98ef916e50b379f73ef36a962388615e1b2fb5efb2619aa6cba57d0eadf45
MD5 f5ad6f7467fcc508c6972666e45baa91
BLAKE2b-256 fb1490991748297b557e64a94141a6d50bda42018ae4d868e5bd601040cfbbf4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page