Library to find URLs and check their validity.
Project description
urlfinderlib
This is a Python (3.6+) library for finding URLs in documents and checking their validity.
Supported Documents
Extracts URLs from the following types of documents:
- HTML files
- PDF files
- Text files (ASCII or UTF-8)
- XML files
Every extracted URL is validated such that it contains a domain with a valid TLD (or a valid IP address) and does not contain any invalid characters.
URL Permutations
This was originally written to accommodate finding both valid and obfuscated or slightly malformed URLs used by malicious actors and using them as indicators of compromise (IOCs). As such, the extracted URLs will also include the following permutations:
- URL with any Unicode characters in its domain
- URL with any Unicode characters converted to its IDNA equivalent
For both domain variations, the following permutations are also returned:
- URL with its path %-encoded
- URL with its path %-decoded
- URL with encoded HTML entities in its path
- URL with decoded HTML entities in its path
- URL with its path %-decoded and HTML entities decoded
Child URLs
This library also attempts to extract or decode child URLs found in the paths of URLs. The following formats are supported:
- Barracuda protected URLs
- Base64-encoded URLs found within the URL's path
- Google redirect URLs
- Mandrill/Mailchimp redirect URLs
- Outlook Safe Links URLs
- Proofpoint protected URLs
- URLs found in the URL's path query parameters
Basic usage
from urlfinderlib import find_urls
with open('/path/to/file', 'rb') as f:
print(find_urls(f.read())
base_url Parameter
If you are trying to find URLs inside of an HTML file, the paths in the URLs are often relative to their location on the server hosting the HTML. You can use the base_url parameter in this case to extract these "relative" URLs.
from urlfinderlib import find_urls
with open('/path/to/file', 'rb') as f:
print(find_urls(f.read(), base_url='http://example.com')
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file urlfinderlib-0.12.3.tar.gz
.
File metadata
- Download URL: urlfinderlib-0.12.3.tar.gz
- Upload date:
- Size: 10.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.7.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 75cfbf37888e34c00004c924502747908b615c5be0100dafd66dcd0865812704 |
|
MD5 | 870c95d7c31425afaeccca176c273f66 |
|
BLAKE2b-256 | d28a4f8e4981896e80c3b1886bc483160cc932470536cea0805998d9273a51b9 |