A smart web scraper with LLM-powered extraction capabilities
Project description
Web Scraper with LLM Extraction
A powerful and lightweight web scraping library with LLM extraction capabilities. This library combines web scraping with AI-powered content extraction using either OpenAI or OpenRouter APIs.
Features
- Configurable web scraping with Playwright
- Support for both headless and visible browser modes
- Content cleaning and preprocessing
- LLM-based information extraction
- Support for both OpenAI and OpenRouter APIs
- Customizable schema definitions with type specifications:
- String fields
- Array fields
- Object fields with nested properties
- Ad blocking and media handling
- Automatic handling of srcset attributes
- HTML minification support
Installation
pip install aiohttp>=3.8.0
pip install beautifulsoup4>=4.9.3
pip install fake-useragent>=0.1.11
pip install playwright>=1.20.0
pip install pydantic>=2.0.0
pip install tiktoken>=0.3.0
pip install openai>=1.0.0
pip install lxml>=4.9.0
pip install scrapeneatly
Quick Start
import asyncio
from scrapeneatly import scrape_product
async def main():
# Define what you want to extract
fields = {
"title": {
"description": "Product title",
"type": "string"
},
"images": {
"description": "Product images",
"type": "array",
"items": {"type": "string"}
}
}
result = await scrape_product(
url="https://example.com/product",
fields_to_extract=fields,
provider="openai", # or "openrouter"
api_key="your-api-key",
model="anthropic/claude-2" # optional, for OpenRouter
)
if result["success"]:
print(result["data"])
if __name__ == "__main__":
asyncio.run(main())
Advanced Usage
Specifying Field Types
fields = {
"price": {
"description": "Product price",
"type": "string"
},
"variants": {
"description": "Product variants",
"type": "array",
"items": {
"type": "object",
"properties": {
"color": {"type": "string"},
"size": {"type": "string"}
}
}
}
}
Using OpenRouter with Custom Model
result = await scrape_product(
url="your_url",
fields_to_extract=fields,
provider="openrouter",
api_key="your-openrouter-key",
model="google/gemini-2.0-flash-001"
)
Using OpenAI models - Uses gpt4o - please don't specify the model
result = await scrape_product(
url="your_url",
fields_to_extract=fields,
provider="openai",
api_key="your-openai-api-key",
)
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapeneatly-0.1.0.tar.gz.
File metadata
- Download URL: scrapeneatly-0.1.0.tar.gz
- Upload date:
- Size: 14.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a53b6c7f0a41e3d52a05b4c5041bfec0fdb68da1b5f0966de6e134d871949fd
|
|
| MD5 |
c918819f9b33e2186abd8ad5f546e2de
|
|
| BLAKE2b-256 |
f8340a9e58f344840e7b62a4363fda6bb0d6b0fe9f76f0b85c23014cd9d4eded
|
File details
Details for the file scrapeneatly-0.1.0-py3-none-any.whl.
File metadata
- Download URL: scrapeneatly-0.1.0-py3-none-any.whl
- Upload date:
- Size: 15.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d95bbcf316d8b5852bc84518db8ca05172e9b8449e60c6b2b7349cf743a129f0
|
|
| MD5 |
deb809bbf2af7bc8aa5841b3cbcdeca3
|
|
| BLAKE2b-256 |
16d9730c50563d624fe380ca2feb404d8c5d9f536a29912f75fd131626a16816
|