Skip to main content

A modular AI-powered web scraper for data pipelines.

Project description

WebSense

CI PyPI version Python 3.10+ License: MIT

"Making sense of the web."

WebSense is a Python library that transforms raw websites into structured, meaningful data. It leverages AI through the ask2api library to semantically understand page content, allowing you to extract complex data structures without writing brittle CSS selectors or XPath expressions.

Features

  • Semantic Understanding: Uses LLMs to interpret content meaning, not just match patterns
  • Resilient: Adapts to layout changes—if the meaning is there, WebSense finds it
  • Minimalist API: Extract data in 3 lines of code
  • Auto-Cleaning: Intelligent noise removal filters focus on meaningful content
  • Flexible Schemas: Use JSON schemas or provide examples for schema inference
  • Web Search Integration: Search the web and scrape top results in one go
  • Multi-Source Consolidation: Aggregate information from multiple websites into one structured result
  • Modular Design: Fetch, search, clean, and parse stages can be customized independently

Installation

pip install websense

For development:

git clone https://github.com/atasoglu/websense.git
cd websense
pip install -e ".[dev]"

Quick Start

Extract data with just an example:

from websense import Scraper

scraper = Scraper()

data = scraper.scrape(
    "https://github.com/atasoglu/ask2api",
    example={
        "project_name": "string",
        "description": "string",
        "stars": 0,
        "is_active": True
    }
)

print(data)

You can provide a strict JSON schema for validation:

schema = {
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "price": {"type": "number"},
        "in_stock": {"type": "boolean"}
    },
    "required": ["title", "price"]
}

data = scraper.scrape("https://example.com/product", schema=schema)

Specify a different language model for extraction:

scraper = Scraper(model="gpt-4")

Web Search & Consolidation

Search the web and consolidate information from the top 3 results:

data = scraper.search_and_scrape(
    "latest news about SpaceX Starship",
    max_results=3,
    example={
        "status": "string",
        "last_launch": "string",
        "summary": "brief overview"
    }
)

WebSense intelligently crawls multiple sources and uses an LLM-based "judge" to synthesize the most accurate data from all sources.

CLI Usage

WebSense provides a command-line interface for quick data extraction:

# Extract structured data from a webpage
websense scrape https://example.com --example schema.json --verbose

# Search the web and consolidate top 3 results
websense search-scrape "Nvidia stock performance 2024" --top-k 3 --example '{"price": "str"}'

# Search search only (returns titles and URLs)
websense search "query" --verbose

# Get cleaned content only
websense content https://example.com --output content.md

Available options for scrape command:

Option Description
--model, -m LLM model name
--schema, -s JSON schema (file path or raw JSON string)
--example, -e JSON example (file path or raw JSON string)
--output, -o Output file path
--timeout, -t Request timeout (default: 10)
--retries, -r Retry attempts (default: 3)
--verbose, -v Enable verbose output

Pro Tip: You can pass raw JSON strings directly to the CLI:

websense scrape https://example.com -e '{"title": "string"}'

How It Works

WebSense follows a three-stage pipeline:

  1. Fetch (fetcher.py): Downloads and retrieves the webpage
  2. Clean (cleaner.py): Removes noise and extracts meaningful text
  3. Parse (parser.py): Uses AI to extract structured data based on your schema/example

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/my-feature)
  3. Commit changes (git commit -m 'Add my feature')
  4. Push to the branch (git push origin feature/my-feature)
  5. Open a Pull Request

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

websense-0.4.0.tar.gz (20.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

websense-0.4.0-py3-none-any.whl (13.6 kB view details)

Uploaded Python 3

File details

Details for the file websense-0.4.0.tar.gz.

File metadata

  • Download URL: websense-0.4.0.tar.gz
  • Upload date:
  • Size: 20.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for websense-0.4.0.tar.gz
Algorithm Hash digest
SHA256 fbb8d3e9647199cc807d520fc38dd89374dc9e88d480c36ffeb5d49f0fbd6eae
MD5 f03c1aa9806f674509d6d193b7247e40
BLAKE2b-256 8f590340d7b4d3345a157812c2ca3d26bec7104d3d69cc1e5f60c9d9ac49eb2a

See more details on using hashes here.

File details

Details for the file websense-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: websense-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 13.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for websense-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1eeda2dfbb2dafdc47c03be95e51e42136b664066fe319a396cb4a28be8dba93
MD5 08e504fdcda6e42019e15cbfd77bd548
BLAKE2b-256 87b488bbd6f73a34ebe6327d5c76f81dd41ee244d65fcf81243f78ae5f9c8de9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page