URL normalization library for consistent URL representation
Project description
tk-normalizer
URL normalization library for creating consistent URL representations.
Purpose
The URL normalization process creates a mechanism to provide equivalence between URLs with varying string, protocol, scheme, and query parameter ordering. This library helps create normalized representations of URLs for consistent storage, comparison, and analysis.
Installation
pip install tk-normalizer
Quick Start
from tk_normalizer import TkNormalizer
# Simple usage - str() returns just the normalized URL
normalized = TkNormalizer("http://www.Example.com/path?b=2&a=1&utm_source=test")
print(str(normalized)) # Output: example.com/path?a=1&b=2
# Get full details with dict()
print(dict(normalized)) # Returns all fields including query_string, path, and hashes
Features
URL Normalization
The following URLs all normalize to the same normalized form:
https://example.com/
http://www.example.com/
http://www.example.com
http://www.example.com/#my_search_engine_is_great
https://www.example.com/?utm_campaign=SomeGoogleCampaign
https://www.example.com/?utm_source=because&utm_campaign=SomeGoogleCampaign
All normalize to: example.com
Normalization Process
URLs are normalized through the following steps:
- ✅ Protocol and www subdomains removed
- ✅ Lowercased
- ✅ Trailing slashes removed
- ✅ Query parameters reordered alphabetically by key
- ✅ Duplicate query parameter key/value pairs removed
- ✅ Common tracking parameters removed (utm_*, gclid, fbclid, etc.)
- ✅ Non-HTTP(S) protocols rejected
- ✅ Localhost URLs rejected
Tracking Parameters Removed
The following tracking parameters are automatically removed during normalization:
utm_*(all utm parameters)gclid,fbclid,dclid(click identifiers)_ga,_gid,_fbp,_hjid(analytics cookies)msclkid(Microsoft Ads)aff_id,affid(affiliate tracking)referrer,adgroupid,srsltid
Advanced Usage
Getting Full Normalization Details
from tk_normalizer import TkNormalizer
normalizer = TkNormalizer("http://blog.example.com/page?b=2&a=1")
# Use str() for just the normalized URL
print(str(normalizer)) # blog.example.com/page?a=1&b=2
# Use dict() for complete normalization data
result = dict(normalizer)
print(result)
# {
# 'normalized_url': 'blog.example.com/page?a=1&b=2',
# 'parent_normalized_url': 'blog.example.com',
# 'root_normalized_url': 'example.com',
# 'query_string': 'a=1&b=2',
# 'path': '/page',
# 'normalized_url_hash': '...',
# 'parent_normalized_url_hash': '...',
# 'root_normalized_url_hash': '...'
# }
Error Handling
from tk_normalizer import TkNormalizer, InvalidUrlException
try:
normalizer = TkNormalizer("not a valid url")
except InvalidUrlException as e:
print(f"Invalid URL: {e}")
Accessing Individual Components
from tk_normalizer import TkNormalizer
normalizer = TkNormalizer("https://blog.example.com/path?a=1")
# Dict-like access to individual fields
print(normalizer["normalized_url"]) # blog.example.com/path?a=1
print(normalizer["parent_normalized_url"]) # blog.example.com
print(normalizer["root_normalized_url"]) # example.com
print(normalizer["query_string"]) # a=1
print(normalizer["path"]) # /path
# Iterate over available fields
for key in normalizer:
print(f"{key}: {normalizer[key]}")
# Get all field names
print(normalizer.keys())
Hashing
For efficient storage and comparison, SHA-256 hashes are computed for:
- The normalized URL
- The parent normal URL (domain without path)
- The root normal URL (root domain without subdomains)
This provides fixed-length representations suitable for database indexing.
Important Caveats
While this normalization process works well for most use cases, there are some limitations:
-
www subdomain removal: Technically,
www.example.comandexample.comcould serve different content, though this is rare in practice. -
Case sensitivity: URLs are lowercased, but some servers are case-sensitive for paths.
-
Tracking parameters: New tracking parameters emerge over time and may not be in the removal list.
-
Fragment removal: URL fragments (#anchors) are removed, which may affect single-page applications.
Development
Setting Up Development Environment
# Clone the repository
git clone https://github.com/terakeet/tk-normalizer.git
cd tk-normalizer
# Install development dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run tests with coverage
pytest --cov=tk_normalizer
# Run linting
ruff check src tests
Running Tests
# Run all tests
pytest
# Run with verbose output
pytest -v
# Run specific test file
pytest tests/test_normalizer.py
# Run with coverage report
pytest --cov=tk_normalizer --cov-report=html
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Deploying to PYPI
We have a workflow set up to deploy to our PYPI package when a release is created. Here is how you can do that:
- Cut a PR for your change
- Make sure you increment your version number in the pyproject.toml file in your changes
- After approval, merge changes to main branch
- Cut a new release in GitHub
- this can be found on the right hand side of the screen when you are at the repo's home page
- you should see the current release
- For consistency in the release:
- create a new tag that matches the version number you changed earlier
- add a title with a brief description of the changes
- add a small description or link to JIRA tickets for updates
- After creating the release you should see a workflow get triggered, this will deploy the updated version to pypi
- If you want to see check the pypi package page after the workflow completes running
NOTE: DO NOT change the name of the workflow file. If you do the deployment will not work unless we update the configuration in PYPI under trusted publishers
If you have questions or concerns reach out.
Deploying to Snowflake -- UDF Update
The normalizer implementaiton in snowflake is defined at TERAKEET.COMMON.NORMALIZE_URL. Once PYPI has been deployed, another workflow will run and update the UDF in Snowflake. That UDF needs the new version of PYPI and will just override the current function. This happens instantly after PYPI has been deployed.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Support
For issues and questions, please use the GitHub issue tracker.
Credits
Based on the URL normalization functionality from tk-core, extracted and packaged for standalone use.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tk_normalizer-1.2.0.tar.gz.
File metadata
- Download URL: tk_normalizer-1.2.0.tar.gz
- Upload date:
- Size: 17.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b8f32a45aa3f1d22b791b4d1f657c5d5532631e33bceed1e5b7199890940a10
|
|
| MD5 |
5e9b2214c3a83c992e47f138f08bd662
|
|
| BLAKE2b-256 |
1c74b402164476c62467df29b70b33d885befc6f1323b90c009b6036a946ca2f
|
Provenance
The following attestation bundles were made for tk_normalizer-1.2.0.tar.gz:
Publisher:
deploy_to_pypi.yml on terakeet/tk-normalizer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tk_normalizer-1.2.0.tar.gz -
Subject digest:
5b8f32a45aa3f1d22b791b4d1f657c5d5532631e33bceed1e5b7199890940a10 - Sigstore transparency entry: 1451062606
- Sigstore integration time:
-
Permalink:
terakeet/tk-normalizer@74f99005a143b658e971ead8d0f05bd678cedad3 -
Branch / Tag:
refs/tags/1.2.0 - Owner: https://github.com/terakeet
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
deploy_to_pypi.yml@74f99005a143b658e971ead8d0f05bd678cedad3 -
Trigger Event:
release
-
Statement type:
File details
Details for the file tk_normalizer-1.2.0-py3-none-any.whl.
File metadata
- Download URL: tk_normalizer-1.2.0-py3-none-any.whl
- Upload date:
- Size: 8.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1e4b5dd0140c474590658fd604a14ddaf025f69990360b05a99f1d029014481c
|
|
| MD5 |
30bccaf205d0843f55ca0b8c575d5e97
|
|
| BLAKE2b-256 |
e474c762d88400bfe71ceba3b09070f14e2781be6d0c0232885d8607ecba9c3a
|
Provenance
The following attestation bundles were made for tk_normalizer-1.2.0-py3-none-any.whl:
Publisher:
deploy_to_pypi.yml on terakeet/tk-normalizer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tk_normalizer-1.2.0-py3-none-any.whl -
Subject digest:
1e4b5dd0140c474590658fd604a14ddaf025f69990360b05a99f1d029014481c - Sigstore transparency entry: 1451062957
- Sigstore integration time:
-
Permalink:
terakeet/tk-normalizer@74f99005a143b658e971ead8d0f05bd678cedad3 -
Branch / Tag:
refs/tags/1.2.0 - Owner: https://github.com/terakeet
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
deploy_to_pypi.yml@74f99005a143b658e971ead8d0f05bd678cedad3 -
Trigger Event:
release
-
Statement type: