URL normalization library for consistent URL representation
Project description
tk-normalizer
URL normalization library for creating consistent URL representations.
Purpose
The URL normalization process creates a mechanism to provide equivalence between URLs with varying string, protocol, scheme, and query parameter ordering. This library helps create normalized representations of URLs for consistent storage, comparison, and analysis.
Installation
pip install tk-normalizer
Quick Start
from tk_normalizer import TkNormalizer
# Simple usage - str() returns just the normalized URL
normalized = TkNormalizer("http://www.Example.com/path?b=2&a=1&utm_source=test")
print(str(normalized)) # Output: example.com/path?a=1&b=2
# Get full details with dict()
print(dict(normalized)) # Returns all fields including query_string, path, and hashes
Features
URL Normalization
The following URLs all normalize to the same normalized form:
https://example.com/
http://www.example.com/
http://www.example.com
http://www.example.com/#my_search_engine_is_great
https://www.example.com/?utm_campaign=SomeGoogleCampaign
https://www.example.com/?utm_source=because&utm_campaign=SomeGoogleCampaign
All normalize to: example.com
Normalization Process
URLs are normalized through the following steps:
- ✅ Protocol and www subdomains removed
- ✅ Lowercased
- ✅ Trailing slashes removed
- ✅ Query parameters reordered alphabetically by key
- ✅ Duplicate query parameter key/value pairs removed
- ✅ Common tracking parameters removed (utm_*, gclid, fbclid, etc.)
- ✅ Non-HTTP(S) protocols rejected
- ✅ Localhost URLs rejected
Tracking Parameters Removed
The following tracking parameters are automatically removed during normalization:
utm_*(all utm parameters)gclid,fbclid,dclid(click identifiers)_ga,_gid,_fbp,_hjid(analytics cookies)msclkid(Microsoft Ads)aff_id,affid(affiliate tracking)referrer,adgroupid,srsltid
Advanced Usage
Getting Full Normalization Details
from tk_normalizer import TkNormalizer
normalizer = TkNormalizer("http://blog.example.com/page?b=2&a=1")
# Use str() for just the normalized URL
print(str(normalizer)) # blog.example.com/page?a=1&b=2
# Use dict() for complete normalization data
result = dict(normalizer)
print(result)
# {
# 'normalized_url': 'blog.example.com/page?a=1&b=2',
# 'parent_normalized_url': 'blog.example.com',
# 'root_normalized_url': 'example.com',
# 'query_string': 'a=1&b=2',
# 'path': '/page',
# 'normalized_url_hash': '...',
# 'parent_normalized_url_hash': '...',
# 'root_normalized_url_hash': '...'
# }
Error Handling
from tk_normalizer import TkNormalizer, InvalidUrlException
try:
normalizer = TkNormalizer("not a valid url")
except InvalidUrlException as e:
print(f"Invalid URL: {e}")
Accessing Individual Components
from tk_normalizer import TkNormalizer
normalizer = TkNormalizer("https://blog.example.com/path?a=1")
# Dict-like access to individual fields
print(normalizer["normalized_url"]) # blog.example.com/path?a=1
print(normalizer["parent_normalized_url"]) # blog.example.com
print(normalizer["root_normalized_url"]) # example.com
print(normalizer["query_string"]) # a=1
print(normalizer["path"]) # /path
# Iterate over available fields
for key in normalizer:
print(f"{key}: {normalizer[key]}")
# Get all field names
print(normalizer.keys())
Hashing
For efficient storage and comparison, SHA-256 hashes are computed for:
- The normalized URL
- The parent normal URL (domain without path)
- The root normal URL (root domain without subdomains)
This provides fixed-length representations suitable for database indexing.
Important Caveats
While this normalization process works well for most use cases, there are some limitations:
-
www subdomain removal: Technically,
www.example.comandexample.comcould serve different content, though this is rare in practice. -
Case sensitivity: URLs are lowercased, but some servers are case-sensitive for paths.
-
Tracking parameters: New tracking parameters emerge over time and may not be in the removal list.
-
Fragment removal: URL fragments (#anchors) are removed, which may affect single-page applications.
Development
Setting Up Development Environment
# Clone the repository
git clone https://github.com/terakeet/tk-normalizer.git
cd tk-normalizer
# Install development dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run tests with coverage
pytest --cov=tk_normalizer
# Run linting
ruff check src tests
Running Tests
# Run all tests
pytest
# Run with verbose output
pytest -v
# Run specific test file
pytest tests/test_normalizer.py
# Run with coverage report
pytest --cov=tk_normalizer --cov-report=html
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Support
For issues and questions, please use the GitHub issue tracker.
Credits
Based on the URL normalization functionality from tk-core, extracted and packaged for standalone use.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tk_normalizer-1.1.0.tar.gz.
File metadata
- Download URL: tk_normalizer-1.1.0.tar.gz
- Upload date:
- Size: 15.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
61188fe8b07343bfbc7d3c92846ca32e5a2450d4aeea9cf88587a0afb613c18c
|
|
| MD5 |
644b89d2563a1a5964b252bb8e990fbd
|
|
| BLAKE2b-256 |
c411911039305c2a8bfd7dc2cec3e2ce6be4fe3bcdecfd4652f3b998f0ba4c27
|
Provenance
The following attestation bundles were made for tk_normalizer-1.1.0.tar.gz:
Publisher:
deploy_to_pypi.yml on terakeet/tk-normalizer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tk_normalizer-1.1.0.tar.gz -
Subject digest:
61188fe8b07343bfbc7d3c92846ca32e5a2450d4aeea9cf88587a0afb613c18c - Sigstore transparency entry: 760446816
- Sigstore integration time:
-
Permalink:
terakeet/tk-normalizer@02209d0753c6be25cb59e8a6b80224c921308ba5 -
Branch / Tag:
refs/tags/1.1.0 - Owner: https://github.com/terakeet
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
deploy_to_pypi.yml@02209d0753c6be25cb59e8a6b80224c921308ba5 -
Trigger Event:
release
-
Statement type:
File details
Details for the file tk_normalizer-1.1.0-py3-none-any.whl.
File metadata
- Download URL: tk_normalizer-1.1.0-py3-none-any.whl
- Upload date:
- Size: 8.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e494ac5e3ce9dfa0f0331b936fdaed1cad883ce7f0a315b7387e4305246a0107
|
|
| MD5 |
f97d61aea8c1015875137ddf51224718
|
|
| BLAKE2b-256 |
8af24114a35336992e8d39c52aa114a570282916d0c5739b30ab83d2fac6273c
|
Provenance
The following attestation bundles were made for tk_normalizer-1.1.0-py3-none-any.whl:
Publisher:
deploy_to_pypi.yml on terakeet/tk-normalizer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tk_normalizer-1.1.0-py3-none-any.whl -
Subject digest:
e494ac5e3ce9dfa0f0331b936fdaed1cad883ce7f0a315b7387e4305246a0107 - Sigstore transparency entry: 760446817
- Sigstore integration time:
-
Permalink:
terakeet/tk-normalizer@02209d0753c6be25cb59e8a6b80224c921308ba5 -
Branch / Tag:
refs/tags/1.1.0 - Owner: https://github.com/terakeet
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
deploy_to_pypi.yml@02209d0753c6be25cb59e8a6b80224c921308ba5 -
Trigger Event:
release
-
Statement type: