Skip to main content

A smart web scraper with LLM-powered extraction capabilities

Project description

Web Scraper with LLM Extraction

A powerful and lightweight web scraping library with LLM extraction capabilities. This library combines web scraping with AI-powered content extraction using either OpenAI or OpenRouter APIs.

Features

  • Configurable web scraping with Playwright
  • Support for both headless and visible browser modes
  • Content cleaning and preprocessing
  • LLM-based information extraction
  • Support for both OpenAI and OpenRouter APIs
  • Customizable schema definitions with type specifications:
    • String fields
    • Array fields
    • Object fields with nested properties
  • Ad blocking and media handling
  • Automatic handling of srcset attributes
  • HTML minification support

Installation

pip install aiohttp>=3.8.0
pip install beautifulsoup4>=4.9.3
pip install fake-useragent>=0.1.11
pip install playwright>=1.20.0
pip install pydantic>=2.0.0
pip install tiktoken>=0.3.0
pip install openai>=1.0.0
pip install lxml>=4.9.0
pip install scrapeneatly

Quick Start

import asyncio
from scrapeneatly import scrape_product

async def main():
    # Define what you want to extract
    fields = {
        "title": {
            "description": "Product title",
            "type": "string"
        },
        "images": {
            "description": "Product images",
            "type": "array",
            "items": {"type": "string"}
        }
    }

    result = await scrape_product(
        url="https://example.com/product",
        fields_to_extract=fields,
        provider="openai",  # or "openrouter"
        api_key="your-api-key",
        model="anthropic/claude-2"  # optional, for OpenRouter
    )

    if result["success"]:
        print(result["data"])

if __name__ == "__main__":
    asyncio.run(main())

Advanced Usage

Specifying Field Types

fields = {
    "price": {
        "description": "Product price",
        "type": "string"
    },
    "variants": {
        "description": "Product variants",
        "type": "array",
        "items": {
            "type": "object",
            "properties": {
                "color": {"type": "string"},
                "size": {"type": "string"}
            }
        }
    }
}

Using OpenRouter with Custom Model

result = await scrape_product(
    url="your_url",
    fields_to_extract=fields,
    provider="openrouter",
    api_key="your-openrouter-key",
    model="google/gemini-2.0-flash-001"
)

Using OpenAI models - Uses gpt4o - please don't specify the model

result = await scrape_product(
    url="your_url",
    fields_to_extract=fields,
    provider="openai",
    api_key="your-openai-api-key",
)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapeneatly-0.1.0.tar.gz (14.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapeneatly-0.1.0-py3-none-any.whl (15.9 kB view details)

Uploaded Python 3

File details

Details for the file scrapeneatly-0.1.0.tar.gz.

File metadata

  • Download URL: scrapeneatly-0.1.0.tar.gz
  • Upload date:
  • Size: 14.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for scrapeneatly-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3a53b6c7f0a41e3d52a05b4c5041bfec0fdb68da1b5f0966de6e134d871949fd
MD5 c918819f9b33e2186abd8ad5f546e2de
BLAKE2b-256 f8340a9e58f344840e7b62a4363fda6bb0d6b0fe9f76f0b85c23014cd9d4eded

See more details on using hashes here.

File details

Details for the file scrapeneatly-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: scrapeneatly-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 15.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for scrapeneatly-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d95bbcf316d8b5852bc84518db8ca05172e9b8449e60c6b2b7349cf743a129f0
MD5 deb809bbf2af7bc8aa5841b3cbcdeca3
BLAKE2b-256 16d9730c50563d624fe380ca2feb404d8c5d9f536a29912f75fd131626a16816

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page