A Python package for efficiently checking if a URL is part of large whitelist or blacklist of URLs and domain names.
Project description
url-is-in
A Python package for efficiently checking if URLs are part of large whitelists or blacklists. Built for speed and scalability, url-is-in provides different matching algorithms based on dataset size and provides both URL and SURT-based matching capabilities.
Features
- 🌐 URL Normalization: Uses SURT (Sort-friendly URI Reordering Transform) for consistent URL comparison
- 🔍 Subdomain Matching: Optional subdomain matching for domain-based filtering
- 📊 Scalable: Efficiently handles large URL lists using Trie matching (tested with > 1M of URLs)
- 🎯 Flexible: Support for both URL and SURT-based matching
- 🐍 Python 3.8-13: Modern Python support with type hints
Installation
Using pip
pip install url-is-in
Using uv (recommended for development)
uv add url-is-in
From source
git clone https://github.com/commoncrawl/url-is-in.git
cd url-is-in
pip install -e .
Requirements
- Python: 3.8 or higher
- Dependencies:
surt- For URL normalization and SURT conversion
Quick Start
Basic URL Matching
from url_is_in import URLMatcher
# Create a matcher with a list of URLs
urls = [
'https://example.com',
'https://test.org/specific/path',
'https://github.com/user/repo'
]
matcher = URLMatcher(urls)
# Check if URLs match
print(matcher.is_in('https://example.com/any/path')) # True
print(matcher.is_in('https://test.org/specific/path/file.html')) # True
print(matcher.is_in('https://other.com')) # False
Subdomain Matching
from url_is_in import URLMatcher
# Enable subdomain matching (default: True)
matcher = URLMatcher(['https://example.com'], match_subdomains=True)
print(matcher.is_in('https://www.example.com')) # True
print(matcher.is_in('https://api.example.com')) # True
print(matcher.is_in('https://sub.example.com/path')) # True
Host names and wild cards
from url_is_in import URLMatcher
# Create a matcher with a list of hostnames and wild card
urls = [
'example.com',
'*.org',
'github.com'
]
matcher = URLMatcher(urls)
# Check if URLs match
print(matcher.is_in('https://example.com/any/path')) # True
print(matcher.is_in('https://test.org/specific/path/file.html')) # True
print(matcher.is_in('https://other.com')) # False
SURT-based Matching
For advanced use cases, you can work directly with SURT strings:
from url_is_in import SURTMatcher
# Work with SURT strings directly
surts = [
'com,example)/',
'org,test)/specific/path',
'com,github)/user/repo'
]
matcher = SURTMatcher(surts)
# Check SURT strings
print(matcher.is_in('com,example)/any/path')) # True
print(matcher.is_in('org,test)/other')) # False
Algorithm Selection
The package automatically selects the optimal matching algorithm:
from url_is_in import URLMatcher
# Automatic selection (default)
matcher = URLMatcher(urls, mode="auto") # Trie for >100 URLs, tuple for ≤100
# Manual selection
fast_matcher = URLMatcher(urls, mode="trie") # Always use trie
simple_matcher = URLMatcher(urls, mode="tuple") # Always use tuple
Setting up development environment
# Clone the repository
git clone https://github.com/commoncrawl/url-is-in.git
cd url-is-in
# Install with development dependencies
uv sync --extra dev
# Run tests
pytest
# Run linting
ruff check .
ruff format .
Running tests
# Run all tests
pytest
# Run with coverage
pytest --cov=src --cov-fail-under=95
Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file url_is_in-0.1.1.tar.gz.
File metadata
- Download URL: url_is_in-0.1.1.tar.gz
- Upload date:
- Size: 87.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6b119529b14ba34693d4a15a9b5cadd2ce9a26b88a641916d1c264d15b0ad2df
|
|
| MD5 |
02571238f76dc28a209e2c68e93b219a
|
|
| BLAKE2b-256 |
c5f81097336f6ac5ec3d5ad741d8b250ea6d76c620b403af1c2bd11ee697cfe1
|
Provenance
The following attestation bundles were made for url_is_in-0.1.1.tar.gz:
Publisher:
publish.yml on commoncrawl/url-is-in
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
url_is_in-0.1.1.tar.gz -
Subject digest:
6b119529b14ba34693d4a15a9b5cadd2ce9a26b88a641916d1c264d15b0ad2df - Sigstore transparency entry: 575898964
- Sigstore integration time:
-
Permalink:
commoncrawl/url-is-in@8aa5c5375bc94a9cf3e94958110aedb4d36798a5 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/commoncrawl
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8aa5c5375bc94a9cf3e94958110aedb4d36798a5 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file url_is_in-0.1.1-py3-none-any.whl.
File metadata
- Download URL: url_is_in-0.1.1-py3-none-any.whl
- Upload date:
- Size: 10.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
51d5b938eea8c28bb3b4a851559dc574d9f6707783799ee9fe0b3322370d124f
|
|
| MD5 |
2e01c1d099800cd9f9517d215ac4b841
|
|
| BLAKE2b-256 |
c4e46618b69821063e1c3441198fc5bb249b885b9a955fc2538bcabf7aa2774b
|
Provenance
The following attestation bundles were made for url_is_in-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on commoncrawl/url-is-in
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
url_is_in-0.1.1-py3-none-any.whl -
Subject digest:
51d5b938eea8c28bb3b4a851559dc574d9f6707783799ee9fe0b3322370d124f - Sigstore transparency entry: 575898994
- Sigstore integration time:
-
Permalink:
commoncrawl/url-is-in@8aa5c5375bc94a9cf3e94958110aedb4d36798a5 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/commoncrawl
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8aa5c5375bc94a9cf3e94958110aedb4d36798a5 -
Trigger Event:
workflow_dispatch
-
Statement type: