AI web scraping workflow.
Project description
Scraipe
Scraipe is a high performance asynchronous scraping and analysis framework that leverages Large Language Models (LLMs) to extract structured information.
Installation
Ensure you have Python 3.10+ installed. Install Scraipe with all built-in scrapers/analyzers:
pip install scraipe[extended]
Alternatively, install the core library and develop your own scrapers/analyzers with:
pip install scraipe
Features
- Versatile Scraping: Leverage custom scrapers that handle Telegram messages, news articles, and links that require multiple ingress rules.
- LLM Analysis: Process text using OpenAI models with built-in Pydantic validation.
- Workflow Management: Combine scraping and analysis in a single fault-tolerant workflow--ideal for Jupyter notebooks.
- High Performance: Asynchronous IO-bound tasks are seamlessly integrated in the synchronous API.
- Modular: Extend the framework with new scrapers or analyzers as your data sources evolve.
- Customizable Ingress: Easily define and update rules to route different types of links to their appropriate scrapers.
- Detailed Logging: Monitor scraping and analysis operations through comprehensive logging for improved debugging and transparency.
Usage Example
-
Setup:
- Import the required modules:
from scraipe import Workflow from scraipe.extended import NewsScraper, OpenAiAnalyzer
-
Configure Scraper and Analyzer:
# Configure the scraper scraper = NewsScraper() # Define an instruction for the analyzer instruction = ''' Extract a list of celebrities mentioned in the article text. Return a JSON dictionary with the schema: {"celebrities": ["celebrity1", "celebrity2", ...]} ''' analyzer = OpenAiAnalyzer("YOUR_OPENAI_API_KEY", instruction)
-
Use the Workflow:
workflow = Workflow(scraper, analyzer) # Provide a list of URLs to scrape news_links = ["https://example.com/article1", "https://example.com/article2"] workflow.scrape(news_links) # Analyze the scraped content workflow.analyze() # Export results as a CSV file export_df = workflow.export() export_df.to_csv('celebrities.csv', index=False)
Contributing
Contributions are welcome. Please open an issue or submit a pull request for improvements.
License
This project is licensed under the MIT License.
Maintainer
This project is maintained by Nibs
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scraipe-0.1.26.tar.gz.
File metadata
- Download URL: scraipe-0.1.26.tar.gz
- Upload date:
- Size: 14.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.10.16 Linux/5.15.167.4-microsoft-standard-WSL2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
55df849f1b41a390adc2c43b682c98d5d50099a8421ec931796df3d71f4430d1
|
|
| MD5 |
7078973ae640846fac7442871b56fdad
|
|
| BLAKE2b-256 |
363e7c56df7b06e836ed77116e7fdceb13ac059fd0e445ec34b525ee2eb726ba
|
File details
Details for the file scraipe-0.1.26-py3-none-any.whl.
File metadata
- Download URL: scraipe-0.1.26-py3-none-any.whl
- Upload date:
- Size: 20.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.10.16 Linux/5.15.167.4-microsoft-standard-WSL2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e19e156303dbc0848d75252639109b781eb7601fa40155b4045694da7a169f69
|
|
| MD5 |
15280465437f5f1f2ef5012e09539abf
|
|
| BLAKE2b-256 |
c56c6751b2037551a02f9d9c3d8c53d7cceaba7dec5b52a07025faf791fbbc49
|