Skip to main content

Library to find URLs and check their validity.

Project description

urlfinderlib

This is a Python (3.10+) library for finding URLs in documents and checking their validity.

Supported Documents

Extracts URLs from the following types of documents:

  • Binary files (finds URLs within strings)
  • CSV files
  • HTML files
  • iCalendar/vCalendar files
  • PDF files
  • Text files (ASCII or UTF-8)
  • XML files

Every extracted URL is validated such that it contains a domain with a valid TLD (or a valid IP address) and does not contain any invalid characters.

URL Permutations

This was originally written to accommodate finding both valid and obfuscated or slightly malformed URLs used by malicious actors and using them as indicators of compromise (IOCs). As such, the extracted URLs will also include the following permutations:

  • URL with any Unicode characters in its domain
  • URL with any Unicode characters converted to its IDNA equivalent

For both domain variations, the following permutations are also returned:

  • URL with its path %-encoded
  • URL with its path %-decoded
  • URL with encoded HTML entities in its path
  • URL with decoded HTML entities in its path
  • URL with its path %-decoded and HTML entities decoded

Child URLs

This library also attempts to extract or decode child URLs found in the paths of URLs. The following formats are supported:

  • Barracuda protected URLs
  • Base64-encoded URLs found within the URL's path
  • Google redirect URLs
  • Mandrill/Mailchimp redirect URLs
  • Outlook Safe Links URLs
  • Proofpoint protected URLs
  • URLs found in the URL's path query parameters

Basic usage

from urlfinderlib import find_urls

with open('/path/to/file', 'rb') as f:
    print(find_urls(f.read())

base_url Parameter

If you are trying to find URLs inside of an HTML file, the paths in the URLs are often relative to their location on the server hosting the HTML. You can use the base_url parameter in this case to extract these "relative" URLs.

from urlfinderlib import find_urls

with open('/path/to/file', 'rb') as f:
    print(find_urls(f.read(), base_url='http://example.com')

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

urlfinderlib-0.21.0.tar.gz (13.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

urlfinderlib-0.21.0-py3-none-any.whl (19.9 kB view details)

Uploaded Python 3

File details

Details for the file urlfinderlib-0.21.0.tar.gz.

File metadata

  • Download URL: urlfinderlib-0.21.0.tar.gz
  • Upload date:
  • Size: 13.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for urlfinderlib-0.21.0.tar.gz
Algorithm Hash digest
SHA256 646f24b5d02bdc85d110fe6e0e19c030cc92c886bc62f26111265a65099c54a3
MD5 a981c6d4f32cdfdac72ec7876fc8acb5
BLAKE2b-256 b933688b2b057cf8b84ad31e9365cb2263dd6faef8fb0e3fdd441c768d62677d

See more details on using hashes here.

Provenance

The following attestation bundles were made for urlfinderlib-0.21.0.tar.gz:

Publisher: pypi.yml on ACE-Collective/urlfinderlib

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file urlfinderlib-0.21.0-py3-none-any.whl.

File metadata

  • Download URL: urlfinderlib-0.21.0-py3-none-any.whl
  • Upload date:
  • Size: 19.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for urlfinderlib-0.21.0-py3-none-any.whl
Algorithm Hash digest
SHA256 659b7350ce7de669e3171d6d6056a4b68a73d3c563c245762f54059d25dff906
MD5 a69f1a9a22d29e0c590aa06a11dcd6f9
BLAKE2b-256 e6fe8d6d73077014f6cc3b27ae8d60c567406cf2cd23f0efb140576a1cbdc86f

See more details on using hashes here.

Provenance

The following attestation bundles were made for urlfinderlib-0.21.0-py3-none-any.whl:

Publisher: pypi.yml on ACE-Collective/urlfinderlib

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page