AI web scraping workflow.
Project description
Scraipe
Scraipe is a high performance asynchronous scraping and analysis framework that leverages Large Language Models (LLMs) to extract structured information.
Installation
Ensure you have Python 3.10+ installed. Install Scraipe with powerful scrapers and analyzers:
pip install scraipe[extras]
Features
- High Performance: IO-bound tasks such as scraping and querying LLMs are fully asynchronous under the hood.
- Custom Scraping: Scraipe comes with
- LLM Analysis: Process text using OpenAI’s API with built-in validation via Pydantic.
- Workflow Management: Combine scraping and analysis in a single workflow--ideal for work in Jupyter notebooks.
Usage Example
-
Setup:
- Import the required modules:
from scraipe import Workflow from scraipe.extras import NewsScraper, OpenAiAnalyzer
-
Configure Scraper and Analyzer:
# Configure the scraper scraper = NewsScraper() # Define an instruction and optional Pydantic schema for the analyzer instruction = ''' Extract a list of celebrities mentioned in the article text. Return a JSON dictionary with the schema: {"celebrities": ["celebrity1", "celebrity2", ...]} ''' from pydantic import BaseModel from typing import List class ExpectedOutput(BaseModel): celebrities: List[str] analyzer = OpenAiAnalyzer("YOUR_OPENAI_API_KEY", instruction, pydantic_schema=ExpectedOutput)
-
Use the Workflow:
workflow = Workflow(scraper, analyzer) # Provide a list of URLs to scrape news_links = ["https://example.com/article1", "https://example.com/article2"] workflow.scrape(news_links) # Analyze the scraped content workflow.analyze() # Export results as a CSV file export_df = workflow.export() export_df.to_csv('celebrities.csv', index=False)
Contributing
Contributions are welcome. Please open an issue or submit a pull request for improvements.
License
This project is licensed under the MIT License.
Maintainer
This project is maintained by Nibs
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scraipe-0.1.24.tar.gz.
File metadata
- Download URL: scraipe-0.1.24.tar.gz
- Upload date:
- Size: 14.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.10.16 Linux/5.15.167.4-microsoft-standard-WSL2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eec004f0472e5bcd39b4f6fc71a02d5ec7a50a64bae3854bcc4b529e16c5b624
|
|
| MD5 |
ee6af846511800dfecb61d66d2beae9f
|
|
| BLAKE2b-256 |
fe08c9ab03b204187eb11544b6c624cfec1e93ff14385f659867bcada6bcefd5
|
File details
Details for the file scraipe-0.1.24-py3-none-any.whl.
File metadata
- Download URL: scraipe-0.1.24-py3-none-any.whl
- Upload date:
- Size: 20.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.10.16 Linux/5.15.167.4-microsoft-standard-WSL2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
52c92697771ec125d52fbb5c226d81d784e314c24e962ae978d6aae8ef0f54e6
|
|
| MD5 |
1b3b7811e8eec2076646bafb8562b387
|
|
| BLAKE2b-256 |
487cf533f07baf5f938f2ee622f0a40f63476bde484d7b378cbfea3cf8c770dd
|