Web scraping tool potentially using Llama models.
Project description
llama-web-scraper
Llama Web Scraper (llama-web-scraper) is a toolkit for building intelligent web scrapers within the LlamaSearch AI ecosystem. It combines traditional web scraping techniques with AI models for tasks like content extraction, understanding page structure, and handling dynamic websites.
Key Features
- Web Scraping Engine: Core components for fetching and parsing web pages (
scraper/). - AI-Powered Extraction: Utilizes AI models for intelligent content extraction, potentially handling complex layouts or JavaScript-rendered pages (
ai/,models/). - Command-Line Interface: Provides CLI tools for initiating and configuring scraping tasks (
cli/). - Utilities: Includes helper functions for requests, parsing, and data handling (
utils/). - Core Orchestration: Manages the scraping workflow (
core.py,main.py). - Configurable: Allows defining target URLs, scraping rules, AI models, output formats, etc. (
config.py).
Installation
pip install llama-web-scraper
# Or install directly from GitHub for the latest version:
# pip install git+https://github.com/llamasearchai/llama-web-scraper.git
Usage
Command-Line Interface (CLI)
(CLI usage examples for scraping specific URLs or using configuration files will be added here.)
llama-web-scraper scrape --url https://example.com/article --output article.json --use-ai
llama-web-scraper run --config scrape_job.yaml
Python Client / Embedding
(Python usage examples for programmatically controlling the scraper will be added here.)
# Placeholder for Python client usage
# from llama_web_scraper import Scraper, ScrapeConfig
# config = ScrapeConfig.load("config.yaml")
# scraper = Scraper(config)
# # Scrape a single URL
# results = scraper.scrape_url(
# "https://blog.example.com/latest-post",
# extract_elements=['title', 'body', 'author']
# )
# print(results)
# # Run a scraping job defined in config
# # job_results = scraper.run_job("news_sites_job")
Architecture Overview
graph TD
A[User / CLI (cli)] --> B{Core Scraper Orchestrator (core.py, main.py)};
B -- Initiates Scrape --> C{Scraping Engine (scraper/)};
C -- Fetches --> D((Target Website));
D -- HTML/Content --> C;
C -- Raw Content --> E{AI Processing Module (ai/, models/)};
E -- Extracts Data --> C;
C --> F[Structured Scraped Data];
F --> B;
B --> G[Output (File, DB, API)];
H[Utilities (utils/)] -- Used by --> C;
H -- Used by --> E;
I[Configuration (config.py)] -- Configures --> B;
I -- Configures --> C;
I -- Configures --> E;
style B fill:#f9f,stroke:#333,stroke-width:2px
style E fill:#aef,stroke:#333,stroke-width:1px
- Interface: User initiates scraping via the CLI or programmatically.
- Orchestrator: Manages the scraping task based on configuration.
- Scraping Engine: Handles fetching web content (HTML, etc.) from target sites.
- AI Processing: (Optional) AI models analyze the raw content for intelligent extraction, structure understanding, or rendering JavaScript.
- Data Extraction: Relevant data is extracted either through rules or AI processing.
- Output: Structured data is saved to a file, database, or sent to another service.
- Config/Utils: Configuration defines targets and rules; utilities provide support.
Configuration
(Details on configuring target URLs/sites, scraping rules (CSS selectors, XPath), AI model selection, JavaScript rendering options, output formats, rate limits, proxy usage, etc., will be added here.)
Development
Setup
# Clone the repository
git clone https://github.com/llamasearchai/llama-web-scraper.git
cd llama-web-scraper
# Install in editable mode with development dependencies
pip install -e ".[dev]"
Testing
pytest tests/
Contributing
Contributions are welcome! Please refer to CONTRIBUTING.md and submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llama_web_scraper-0.1.0.tar.gz.
File metadata
- Download URL: llama_web_scraper-0.1.0.tar.gz
- Upload date:
- Size: 15.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ef15a04814d239175b9ebb735cc4469e8b41634dbe24dc639b2279c1d9becf26
|
|
| MD5 |
7ef02440326d043de721ddaa5ec7ecf9
|
|
| BLAKE2b-256 |
d84b6faeece59debaf497562086b9477a79bd71a02a84d5e9bfee6f688e88fb6
|
File details
Details for the file llama_web_scraper-0.1.0-py3-none-any.whl.
File metadata
- Download URL: llama_web_scraper-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
19131e692bdaaaad82b499a045139c107bece450eeac21b96ed47dbc61eaa24d
|
|
| MD5 |
fe0a536c00e4516abba9f62bb522d1fb
|
|
| BLAKE2b-256 |
f63ac0cf7534e11b083a739f186e5a45609317845cc15ef29752d2b8ca96256f
|