Skip to main content

A modular AI-powered web scraper for data pipelines.

Project description

WebSense

CI PyPI version Python 3.10+ License: MIT

"Making sense of the web."

WebSense is a Python library that transforms raw websites into structured, meaningful data. It leverages AI through the ask2api library to semantically understand page content, allowing you to extract complex data structures without writing brittle CSS selectors or XPath expressions.

Features

  • Semantic Understanding: Uses LLMs to interpret content meaning, not just match patterns
  • Resilient: Adapts to layout changes—if the meaning is there, WebSense finds it
  • Minimalist API: Extract data in 3 lines of code
  • Auto-Cleaning: Intelligent noise removal filters focus on meaningful content
  • Flexible Schemas: Use JSON schemas or provide examples for schema inference
  • Modular Design: Fetch, clean, and parse stages can be customized independently

Installation

pip install websense

For development:

git clone https://github.com/atasoglu/websense.git
cd websense
pip install -e ".[dev]"

Quick Start

Extract data with just an example:

from websense import Scraper

scraper = Scraper()

data = scraper.scrape(
    "https://github.com/atasoglu/ask2api",
    example={
        "project_name": "string",
        "description": "string",
        "stars": 0,
        "is_active": True
    }
)

print(data)

You can provide a strict JSON schema for validation:

schema = {
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "price": {"type": "number"},
        "in_stock": {"type": "boolean"}
    },
    "required": ["title", "price"]
}

data = scraper.scrape("https://example.com/product", schema=schema)

Specify a different language model for extraction:

scraper = Scraper(model="gpt-4")

See the environment variables in the ask2api repository to configure your LLM provider.

The examples/ directory contains real-world use cases:

How It Works

WebSense follows a three-stage pipeline:

  1. Fetch (fetcher.py): Downloads and retrieves the webpage
  2. Clean (cleaner.py): Removes noise and extracts meaningful text
  3. Parse (parser.py): Uses AI to extract structured data based on your schema/example

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/my-feature)
  3. Commit changes (git commit -m 'Add my feature')
  4. Push to the branch (git push origin feature/my-feature)
  5. Open a Pull Request

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

websense-0.1.1.tar.gz (7.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

websense-0.1.1-py3-none-any.whl (6.2 kB view details)

Uploaded Python 3

File details

Details for the file websense-0.1.1.tar.gz.

File metadata

  • Download URL: websense-0.1.1.tar.gz
  • Upload date:
  • Size: 7.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for websense-0.1.1.tar.gz
Algorithm Hash digest
SHA256 3df614d78506eee13964e8637b108f7ae54d2b78964d1e31e1c0147abf8086b7
MD5 8778da3e7c4cefef5ba918219350eab3
BLAKE2b-256 c90567a178afc37ea9aa0f201cbb6c0c901075916bebd80a0aaa4c57aeb1d839

See more details on using hashes here.

File details

Details for the file websense-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: websense-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 6.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for websense-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 206992e9643d7b91402c275b337dfb71498b0f9ab17294488cc80519bd76ee4a
MD5 6f82a034c9682a5aa4e1fba8ad80a9f4
BLAKE2b-256 534907745aa6dbb291768eceaafb69fe1124dc41341cfb2ef737efdccb3de665

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page