A simple, fast, and configurable URL sensitive data filter
Project description
filter-url
A simple, fast, and configurable Python utility to censor sensitive data (passwords, API keys, tokens) from URLs, making them safe for logging, monitoring, and debugging.
Key Features
- Comprehensive Censoring: Censors passwords in userinfo (
user:[...]@host), query parameter values, and parts of the URL path. - Flexible Rules: Filter query parameters by exact key names or by powerful regular expressions.
- Advanced Path Filtering: Use regex with named capture groups to censor specific dynamic parts of a URL path while leaving the rest intact.
- Order Preserving: Guarantees that the order of query parameters in the output is identical to the input.
- Logging Integration: Provides a ready-to-use
logging.Filtersubclass for seamless integration into your application's logging setup. - Lightweight: Zero external dependencies.
Installation
pip install filter-url
Quick Start
The quickest way to use the library is the standalone filter_url() function, which uses a default set of rules to catch common sensitive keys.
from filter_url import filter_url
dirty_url = "https://user:my-secret-password@example.com/data?token=abc-123-xyz"
# Use the function with default filters
clean_url = filter_url(dirty_url)
print(clean_url)
# >> https://user:[...]@example.com/data?token=[...]
Usage & Examples
Basic Filtering (Standalone Function)
The filter_url() function is great for one-off tasks. You can pass your own filtering rules directly to it. If a rule is not provided, a sensible default is used.
from filter_url import filter_url
# Define custom rules
custom_path_re = r'/user/(?P<user_id>\d+)/profile'
dirty_url = "https://example.com/user/123456/profile?credit_card_number=5555"
# Censor using a custom path regex
clean_url = filter_url(
url=dirty_url,
bad_path_re=custom_path_re
)
print(clean_url)
# >> https://example.com/user/[...]/profile?credit_card_number=5555
Advanced: Using the FilterURL Class for Performance
When you need to filter a large number of URLs with the same configuration, it's much more efficient to instantiate the FilterURL class once. This pre-compiles the regular expressions and avoids redundant work in a loop.
from filter_url import FilterURL
# Create the filter instance ONCE with your custom rules.
# The regexes are compiled here.
my_filter = FilterURL(
bad_keys={'api_key'},
bad_keys_re=[r'session']
)
urls_to_process = [
"https://service.com/api?api_key=key-1",
"https://service.com/api?user_session=sess-2",
"https://service.com/api?id=3"
]
# Reuse the same instance in a loop for high performance
clean_urls = [my_filter.remove_sensitive(url) for url in urls_to_process]
# clean_urls will be:
# [
# 'https://service.com/api?api_key=[...]',
# 'https://service.com/api?user_session=[...]',
# 'https://service.com/api?id=3'
# ]
The class has an internal cache for filtered URLs, you can tune it or turn it off completely with the parameter cache_size (see API description below)
Integration with Python's logging Module
This is the most powerful feature for real-world applications. The URLFilter automatically censors URLs in your logs. The filter works in two ways:
- (Preferred) It looks for a
urlkey in theextradictionary of your logging call. - (Fallback) If
fallback=True(the default), it searches for URLs in the positional arguments of the log message.
import logging
import sys
from filter_url import URLFilter
# 1. Configure a logger
logger = logging.getLogger('my_app')
logger.setLevel(logging.INFO)
if logger.hasHandlers():
logger.handlers.clear()
# 2. Simply add our filter. Let's use custom rules for this example
custom_filter = URLFilter(
bad_keys={'access_token'},
fallback=True # Default, but shown for clarity
)
logger.addFilter(custom_filter)
# 3. Use a standard Formatter. No special formatter is needed
handler = logging.StreamHandler(sys.stdout)
formatter = logging.Formatter('%(levelname)s: %(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)
# --- Usage Examples ---
# Case 1: (Preferred) Pass the URL via 'extra'
logger.info(
"User login attempt failed",
extra={'url': "<https://auth.service.com/login?access_token=12345"}>
)
# Case 2: (Fallback) The URL is an argument in the message string
logger.info(
"API call to %s resulted in a 404 error.",
"<https://api.service.com/data/v1/user?password=abc>"
)
# Case 3: No URL in the message. Nothing extra is added
logger.info("Application started successfully.")
Be aware of a minor trade-off between using a filter for the logging module and the FilterURL class. Provided each URL is only output once, then a filter for logging is the perfect solution: it will make your code much more straightforward and cleaner. When processing URLs and outputting them multiple times during different stages, prepare them in advance using the FilterURL class to save CPU cycles. The filtered URTs are stored in the internal cache inside FilterURL to mitigate this difference. However, it can still be notable under load.
Expected Output:
INFO: User login attempt failed | (URL data: https://auth.service.com/login?access_token=[...])
INFO: API call to https://api.service.com/data/v1/user?password=[...] was made. | (URL data: https://api.service.com/data/v1/user?password=[...])
INFO: Application started successfully.
Corner Cases & Considerations
- Log String vs. Valid URL: The primary goal of this library is to produce a human-readable, safe string for logging. The output string containing
[...]in the userinfo (password) section is not a valid URL according to RFC standards and may fail if you try to parse it again withurllib.parse. - Performance: For filtering a large number of URLs, always instantiate the
FilterURLclass once and reuse the instance. The standalonefilter_url()function re-compiles regexes on every call and is less performant for batch jobs. - Logging Filter Precedence: When using
URLFilter, providing a URL in theextradictionary is always the preferred method. Thefallbacksearch will only trigger if aurlkey is not found inextra. Also, using fallback option needs extra CPU cycles, which may be unwanted.
API Reference
-
filter_url(url, censored, bad_keys, bad_keys_re, bad_path_re): A standalone function for one-off URL censoring.- url:str - (required) an URL to 'censor'
- censored:str - (optional) a placeholder to use insted aof redacted parts, '[...]' by default
- bad_keys:list: - (optional) a list of keys in the HTTP method GET that may contain a sensitive data. Default:
[ "password", "token", "key", "secret", "auth", "apikey", "credentials", ]
- bad_keys_re:list: - (optional) a list of regexs matching keys in the HTTP method GET that may contain a sensitive data. Default:
[ r"session", r"csrf", r"._secret", r"._token", r".*_key", ]
- bad_path_re:str: - (optional) a regex to match a path port of the URL, each defined group in it will be redacted. Default: None. Examples:
custom_path_re_named = r"/api/v1/(?P<api_key>[^/]+)/resource" custom_path_re_simple = r"(?<=/user/)\d+(?=/delete)"
-
FilterURL(bad_keys, bad_keys_re, bad_path_re, cache_size): A class that holds a compiled filter configuration for efficient, repeated use. Meaning of bad_keys:list, bad_keys_re:list, bad_path_re:str and their defaults are the same as for filter_url() (see above)- cache_size:int - (optional) Size of the cache to keep filtered URLs, 0 or None means no caching. Default: 512
.remove_sensitive(url, censored): The method that performs the censoring.- censored:str - (optional) a placeholder to use insted aof redacted parts, '[...]' by default
-
URLFilter(bad_keys, bad_keys_re, bad_path_re, fmt, url_filter_instance, fallback, cache_size, name): Alogging.Filtersubclass for easy integration with Python's logging module.- bad_keys:list, bad_keys_re:list, bad_path_re:str are the same as for filter_url() (see above)
- fmt:str - (optional) Format to add an filtered URL into the log message, default: ' | (URL={filtered_url})' ({filtered_url} will be replaced with your filtered URL)
- url_filter_instance:FilterURL - (optional) Pre-configured instance of FilterURL-like class to use for filtering. Default: None (will be created by the filter)
- fallback:bool - (optional) Do we look for URL in the text when URL is not specified explicitly with extra={'url':...}? Default: True
- cache_size:int - (optional) Size of the cache to keep filtered URLs, 0 or None means no caching. Default: 512
- name:str - (optional) The name of the filter (inherited from the logging.Filter)
License
This project is licensed under the MIT License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file filter_url-1.2.0.tar.gz.
File metadata
- Download URL: filter_url-1.2.0.tar.gz
- Upload date:
- Size: 11.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d0138995c96917aa75048227d714e0eee849dfa4fdff28a918af3f228403a66e
|
|
| MD5 |
e39d07221feda451fe039cb42455bb32
|
|
| BLAKE2b-256 |
ca38b0243052d7f287f219bd47339f523ff97e3b38b8c7902b6576aa0866e67f
|
File details
Details for the file filter_url-1.2.0-py3-none-any.whl.
File metadata
- Download URL: filter_url-1.2.0-py3-none-any.whl
- Upload date:
- Size: 9.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
37e47a9170d7bb7d2eb1f11bd9afc9d1d33fba9413b4ba8c503604fae853bf95
|
|
| MD5 |
4c14f13da5c34b02335cac8483b805af
|
|
| BLAKE2b-256 |
0b9ae38227deafaa6934017f94980286727548e4f4f44b0497cdb4be39649e2e
|