Skip to main content

Universal content extraction library with tiered fetching strategies and anti-bot bypass

Project description

OmniFetch Python Library

Python implementation of OmniFetch - a universal content extraction library.

Features

  • Universal Extraction: Fetches content from any URL, handling standard sites, SPAs, and paywalls.
  • Tiered System:
    1. Light Fetch: Fast, standard HTTP request.
    2. Headless Browser: Handles dynamic JS-heavy sites (requires Netlify endpoint).
    3. Search Fallback: Finds alternative sources for paywalled or blocked content.
  • Smart Parsing: Converts HTML to clean Markdown or JSON.

Installation

pip install omnifetch-lib

Quick Start

from omnifetch import omni_fetch

# Text extraction (Markdown)
result = omni_fetch('https://example.com', mode='TEXT')
print(result.content)

# JSON extraction (Structured Data)
json_result = omni_fetch('https://example.com', mode='JSON')
print(json_result.content['title'])

Configuration

def omni_fetch(
    url: str,
    mode: str = 'TEXT',           # 'JSON' for structured, 'TEXT' for markdown
    timeout: int = 30,            # Request timeout in seconds
    netlify_endpoint: str = None, # Headless browser endpoint (Tier 2)
    headers: dict = None,         # Custom headers
    skip_headless: bool = False,  # Skip Tier 2
    skip_search: bool = False,    # Skip Tier 3
    force_title: str = None       # Override title for search fallback
) -> OmniFetchResult

Advanced Usage

Handling Blocked Domains (e.g., X/Twitter)

Some domains block direct scraping. OmniFetch automatically handles this by falling back to search (Tier 3). For opaque URLs, you can provide a force_title to improve search results.

result = omni_fetch(
    'https://x.com/someuser/status/12345',
    mode='TEXT',
    force_title='Specific Tweet Content Title' # Helps find the content via search
)

Headless Browser Support

To enable Tier 2 (Headless Browser) for dynamic sites, you need to deploy the provided Netlify function and pass the endpoint.

result = omni_fetch(
    'https://dynamic-site.com',
    netlify_endpoint='https://your-site.netlify.app/.netlify/functions/headless-fetch'
)

Development Installation

pip install -e .

Running Tests

pip install -e ".[dev]"
pytest

See the main README.md for full documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

omnifetch_lib-1.1.0.tar.gz (15.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

omnifetch_lib-1.1.0-py3-none-any.whl (17.9 kB view details)

Uploaded Python 3

File details

Details for the file omnifetch_lib-1.1.0.tar.gz.

File metadata

  • Download URL: omnifetch_lib-1.1.0.tar.gz
  • Upload date:
  • Size: 15.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for omnifetch_lib-1.1.0.tar.gz
Algorithm Hash digest
SHA256 5810b33a96c434abdb538c8887dec9dce46c9e9a51269b7daa39fc79be628f32
MD5 a8527986601f0027a4adce525956376c
BLAKE2b-256 8c3a9cf13246efbf4f4abeaac6408014fe9f2924e49e3aefd9ed670f67b485d8

See more details on using hashes here.

File details

Details for the file omnifetch_lib-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: omnifetch_lib-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 17.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for omnifetch_lib-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f3f31f21fc49190b51a3cecb5df424aad2a2cba50c5d047e4ebe81c2f659e723
MD5 822d79df45ae85ad82e85225cf9f2ded
BLAKE2b-256 674a397bbdb4e9842e786524156d833fbbed8ccdb52f28c5d5ca0d3475f8b516

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page