A modular AI-powered web scraper for data pipelines.

These details have not been verified by PyPI

Project links

Homepage

Project description

WebSense

"Making sense of the web."

WebSense is a Python library that transforms raw websites into structured, meaningful data. It leverages AI through the ask2api library to semantically understand page content, allowing you to extract complex data structures without writing brittle CSS selectors or XPath expressions.

Features

Semantic Understanding: Uses LLMs to interpret content meaning, not just match patterns
Resilient: Adapts to layout changes—if the meaning is there, WebSense finds it
Minimalist API: Extract data in 3 lines of code
Auto-Cleaning: Intelligent noise removal filters focus on meaningful content
Flexible Schemas: Use JSON schemas or provide examples for schema inference
Web Search Integration: Search the web and scrape top results in one go
Multi-Source Consolidation: Aggregate information from multiple websites into one structured result
Modular Design: Fetch, search, clean, and parse stages can be customized independently

Installation

pip install websense

For development:

git clone https://github.com/atasoglu/websense.git
cd websense
pip install -e ".[dev]"

Quick Start

Extract data with just an example:

from websense import Scraper

scraper = Scraper()

data = scraper.scrape(
    "https://github.com/atasoglu/ask2api",
    example={
        "project_name": "string",
        "description": "string",
        "stars": 0,
        "is_active": True
    }
)

print(data)

You can provide a strict JSON schema for validation:

schema = {
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "price": {"type": "number"},
        "in_stock": {"type": "boolean"}
    },
    "required": ["title", "price"]
}

data = scraper.scrape("https://example.com/product", schema=schema)

Specify a different language model for extraction:

scraper = Scraper(model="gpt-4")

Web Search & Consolidation

Search the web and consolidate information from the top 3 results:

data = scraper.search_and_scrape(
    "latest news about SpaceX Starship",
    max_results=3,
    example={
        "status": "string",
        "last_launch": "string",
        "summary": "brief overview"
    }
)

WebSense intelligently crawls multiple sources and uses an LLM-based "judge" to synthesize the most accurate data from all sources.

CLI Usage

WebSense provides a command-line interface for quick data extraction:

# Extract structured data from a webpage
websense scrape https://example.com --example schema.json --verbose

# Search the web and consolidate top 3 results
websense search-scrape "Nvidia stock performance 2024" --top-k 3 --example '{"price": "str"}'

# Search search only (returns titles and URLs)
websense search "query" --verbose

# Get cleaned content only
websense content https://example.com --output content.md

Available options for scrape command:

Option	Description
`--model, -m`	LLM model name
`--schema, -s`	JSON schema (file path or raw JSON string)
`--example, -e`	JSON example (file path or raw JSON string)
`--output, -o`	Output file path
`--timeout, -t`	Request timeout (default: 10)
`--retries, -r`	Retry attempts (default: 3)
`--verbose, -v`	Enable verbose output

Pro Tip: You can pass raw JSON strings directly to the CLI:

websense scrape https://example.com -e '{"title": "string"}'

How It Works

WebSense follows a three-stage pipeline:

Fetch (fetcher.py): Downloads and retrieves the webpage
Clean (cleaner.py): Removes noise and extracts meaningful text
Parse (parser.py): Uses AI to extract structured data based on your schema/example

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/my-feature)
Commit changes (git commit -m 'Add my feature')
Push to the branch (git push origin feature/my-feature)
Open a Pull Request

License

MIT

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.4.1

Jan 29, 2026

This version

0.4.0

Jan 29, 2026

0.3.0

Jan 28, 2026

0.2.0

Jan 28, 2026

0.1.1

Jan 27, 2026

0.1.0

Jan 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

websense-0.4.0.tar.gz (20.0 kB view details)

Uploaded Jan 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

websense-0.4.0-py3-none-any.whl (13.6 kB view details)

Uploaded Jan 29, 2026 Python 3

File details

Details for the file websense-0.4.0.tar.gz.

File metadata

Download URL: websense-0.4.0.tar.gz
Upload date: Jan 29, 2026
Size: 20.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for websense-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`fbb8d3e9647199cc807d520fc38dd89374dc9e88d480c36ffeb5d49f0fbd6eae`
MD5	`f03c1aa9806f674509d6d193b7247e40`
BLAKE2b-256	`8f590340d7b4d3345a157812c2ca3d26bec7104d3d69cc1e5f60c9d9ac49eb2a`

See more details on using hashes here.

File details

Details for the file websense-0.4.0-py3-none-any.whl.

File metadata

Download URL: websense-0.4.0-py3-none-any.whl
Upload date: Jan 29, 2026
Size: 13.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for websense-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1eeda2dfbb2dafdc47c03be95e51e42136b664066fe319a396cb4a28be8dba93`
MD5	`08e504fdcda6e42019e15cbfd77bd548`
BLAKE2b-256	`87b488bbd6f73a34ebe6327d5c76f81dd41ee244d65fcf81243f78ae5f9c8de9`

See more details on using hashes here.

websense 0.4.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

WebSense

Features

Installation

Quick Start

Web Search & Consolidation

CLI Usage

How It Works

Contributing

License

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes