URL normalization library for consistent URL representation

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

tk-innovation

These details have not been verified by PyPI

Project description

tk-normalizer

URL normalization library for creating consistent URL representations.

Purpose

The URL normalization process creates a mechanism to provide equivalence between URLs with varying string, protocol, scheme, and query parameter ordering. This library helps create normalized representations of URLs for consistent storage, comparison, and analysis.

Installation

pip install tk-normalizer

Quick Start

from tk_normalizer import TkNormalizer

# Simple usage - str() returns just the normalized URL
normalized = TkNormalizer("http://www.Example.com/path?b=2&a=1&utm_source=test")
print(str(normalized))  # Output: example.com/path?a=1&b=2

# Get full details with dict()
print(dict(normalized))  # Returns all fields including query_string, path, and hashes

Features

URL Normalization

The following URLs all normalize to the same normalized form:

https://example.com/
http://www.example.com/
http://www.example.com
http://www.example.com/#my_search_engine_is_great
https://www.example.com/?utm_campaign=SomeGoogleCampaign
https://www.example.com/?utm_source=because&utm_campaign=SomeGoogleCampaign

All normalize to: example.com

Normalization Process

URLs are normalized through the following steps:

✅ Protocol and www subdomains removed
✅ Lowercased
✅ Trailing slashes removed
✅ Query parameters reordered alphabetically by key
✅ Duplicate query parameter key/value pairs removed
✅ Common tracking parameters removed (utm_*, gclid, fbclid, etc.)
✅ Non-HTTP(S) protocols rejected
✅ Localhost URLs rejected

Tracking Parameters Removed

The following tracking parameters are automatically removed during normalization:

utm_* (all utm parameters)
gclid, fbclid, dclid (click identifiers)
_ga, _gid, _fbp, _hjid (analytics cookies)
msclkid (Microsoft Ads)
aff_id, affid (affiliate tracking)
referrer, adgroupid, srsltid

Advanced Usage

Getting Full Normalization Details

from tk_normalizer import TkNormalizer

normalizer = TkNormalizer("http://blog.example.com/page?b=2&a=1")

# Use str() for just the normalized URL
print(str(normalizer))  # blog.example.com/page?a=1&b=2

# Use dict() for complete normalization data
result = dict(normalizer)
print(result)
# {
#   'normalized_url': 'blog.example.com/page?a=1&b=2',
#   'parent_normalized_url': 'blog.example.com',
#   'root_normalized_url': 'example.com',
#   'query_string': 'a=1&b=2',
#   'path': '/page',
#   'normalized_url_hash': '...',
#   'parent_normalized_url_hash': '...',
#   'root_normalized_url_hash': '...'
# }

Error Handling

from tk_normalizer import TkNormalizer, InvalidUrlException

try:
    normalizer = TkNormalizer("not a valid url")
except InvalidUrlException as e:
    print(f"Invalid URL: {e}")

Accessing Individual Components

from tk_normalizer import TkNormalizer

normalizer = TkNormalizer("https://blog.example.com/path?a=1")

# Dict-like access to individual fields
print(normalizer["normalized_url"])       # blog.example.com/path?a=1
print(normalizer["parent_normalized_url"]) # blog.example.com
print(normalizer["root_normalized_url"])   # example.com
print(normalizer["query_string"])          # a=1
print(normalizer["path"])                  # /path

# Iterate over available fields
for key in normalizer:
    print(f"{key}: {normalizer[key]}")

# Get all field names
print(normalizer.keys())

Hashing

For efficient storage and comparison, SHA-256 hashes are computed for:

The normalized URL
The parent normal URL (domain without path)
The root normal URL (root domain without subdomains)

This provides fixed-length representations suitable for database indexing.

Important Caveats

While this normalization process works well for most use cases, there are some limitations:

www subdomain removal: Technically, www.example.com and example.com could serve different content, though this is rare in practice.
Case sensitivity: URLs are lowercased, but some servers are case-sensitive for paths.
Tracking parameters: New tracking parameters emerge over time and may not be in the removal list.
Fragment removal: URL fragments (#anchors) are removed, which may affect single-page applications.

Development

Setting Up Development Environment

# Clone the repository
git clone https://github.com/terakeet/tk-normalizer.git
cd tk-normalizer

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=tk_normalizer

# Run linting
ruff check src tests

Running Tests

# Run all tests
pytest

# Run with verbose output
pytest -v

# Run specific test file
pytest tests/test_normalizer.py

# Run with coverage report
pytest --cov=tk_normalizer --cov-report=html

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

For issues and questions, please use the GitHub issue tracker.

Credits

Based on the URL normalization functionality from tk-core, extracted and packaged for standalone use.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

tk-innovation

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.2.0

May 6, 2026

This version

1.1.0

Dec 11, 2025

1.0.1

Aug 20, 2025

1.0.0 yanked

Aug 20, 2025

Reason this release was yanked:

Was not built correctly.

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tk_normalizer-1.1.0.tar.gz (15.3 kB view details)

Uploaded Dec 11, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tk_normalizer-1.1.0-py3-none-any.whl (8.0 kB view details)

Uploaded Dec 11, 2025 Python 3

File details

Details for the file tk_normalizer-1.1.0.tar.gz.

File metadata

Download URL: tk_normalizer-1.1.0.tar.gz
Upload date: Dec 11, 2025
Size: 15.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tk_normalizer-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`61188fe8b07343bfbc7d3c92846ca32e5a2450d4aeea9cf88587a0afb613c18c`
MD5	`644b89d2563a1a5964b252bb8e990fbd`
BLAKE2b-256	`c411911039305c2a8bfd7dc2cec3e2ce6be4fe3bcdecfd4652f3b998f0ba4c27`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tk_normalizer-1.1.0.tar.gz:

Publisher: deploy_to_pypi.yml on terakeet/tk-normalizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tk_normalizer-1.1.0.tar.gz
- Subject digest: 61188fe8b07343bfbc7d3c92846ca32e5a2450d4aeea9cf88587a0afb613c18c
- Sigstore transparency entry: 760446816
- Sigstore integration time: Dec 11, 2025
Source repository:
- Permalink: terakeet/tk-normalizer@02209d0753c6be25cb59e8a6b80224c921308ba5
- Branch / Tag: refs/tags/1.1.0
- Owner: https://github.com/terakeet
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: deploy_to_pypi.yml@02209d0753c6be25cb59e8a6b80224c921308ba5
- Trigger Event: release

File details

Details for the file tk_normalizer-1.1.0-py3-none-any.whl.

File metadata

Download URL: tk_normalizer-1.1.0-py3-none-any.whl
Upload date: Dec 11, 2025
Size: 8.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tk_normalizer-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e494ac5e3ce9dfa0f0331b936fdaed1cad883ce7f0a315b7387e4305246a0107`
MD5	`f97d61aea8c1015875137ddf51224718`
BLAKE2b-256	`8af24114a35336992e8d39c52aa114a570282916d0c5739b30ab83d2fac6273c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tk_normalizer-1.1.0-py3-none-any.whl:

Publisher: deploy_to_pypi.yml on terakeet/tk-normalizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tk_normalizer-1.1.0-py3-none-any.whl
- Subject digest: e494ac5e3ce9dfa0f0331b936fdaed1cad883ce7f0a315b7387e4305246a0107
- Sigstore transparency entry: 760446817
- Sigstore integration time: Dec 11, 2025
Source repository:
- Permalink: terakeet/tk-normalizer@02209d0753c6be25cb59e8a6b80224c921308ba5
- Branch / Tag: refs/tags/1.1.0
- Owner: https://github.com/terakeet
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: deploy_to_pypi.yml@02209d0753c6be25cb59e8a6b80224c921308ba5
- Trigger Event: release

tk-normalizer 1.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

tk-normalizer

Purpose

Installation

Quick Start

Features

URL Normalization

Normalization Process

Tracking Parameters Removed

Advanced Usage

Getting Full Normalization Details

Error Handling

Accessing Individual Components

Hashing

Important Caveats

Development

Setting Up Development Environment

Running Tests

Contributing

License

Support

Credits

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance