Web scraping tool potentially using Llama models.

These details have not been verified by PyPI

Project links

Project description

llama-web-scraper

Llama Web Scraper (llama-web-scraper) is a toolkit for building intelligent web scrapers within the LlamaSearch AI ecosystem. It combines traditional web scraping techniques with AI models for tasks like content extraction, understanding page structure, and handling dynamic websites.

Key Features

Web Scraping Engine: Core components for fetching and parsing web pages (scraper/).
AI-Powered Extraction: Utilizes AI models for intelligent content extraction, potentially handling complex layouts or JavaScript-rendered pages (ai/, models/).
Command-Line Interface: Provides CLI tools for initiating and configuring scraping tasks (cli/).
Utilities: Includes helper functions for requests, parsing, and data handling (utils/).
Core Orchestration: Manages the scraping workflow (core.py, main.py).
Configurable: Allows defining target URLs, scraping rules, AI models, output formats, etc. (config.py).

Installation

pip install llama-web-scraper
# Or install directly from GitHub for the latest version:
# pip install git+https://github.com/llamasearchai/llama-web-scraper.git

Usage

Command-Line Interface (CLI)

(CLI usage examples for scraping specific URLs or using configuration files will be added here.)

llama-web-scraper scrape --url https://example.com/article --output article.json --use-ai
llama-web-scraper run --config scrape_job.yaml

Python Client / Embedding

(Python usage examples for programmatically controlling the scraper will be added here.)

# Placeholder for Python client usage
# from llama_web_scraper import Scraper, ScrapeConfig

# config = ScrapeConfig.load("config.yaml")
# scraper = Scraper(config)

# # Scrape a single URL
# results = scraper.scrape_url(
#     "https://blog.example.com/latest-post",
#     extract_elements=['title', 'body', 'author']
# )
# print(results)

# # Run a scraping job defined in config
# # job_results = scraper.run_job("news_sites_job")

Architecture Overview

graph TD
    A[User / CLI (cli)] --> B{Core Scraper Orchestrator (core.py, main.py)};
    B -- Initiates Scrape --> C{Scraping Engine (scraper/)};
    C -- Fetches --> D((Target Website));
    D -- HTML/Content --> C;
    C -- Raw Content --> E{AI Processing Module (ai/, models/)};
    E -- Extracts Data --> C;
    C --> F[Structured Scraped Data];
    F --> B;
    B --> G[Output (File, DB, API)];

    H[Utilities (utils/)] -- Used by --> C;
    H -- Used by --> E;
    I[Configuration (config.py)] -- Configures --> B;
    I -- Configures --> C;
    I -- Configures --> E;

    style B fill:#f9f,stroke:#333,stroke-width:2px
    style E fill:#aef,stroke:#333,stroke-width:1px

Interface: User initiates scraping via the CLI or programmatically.
Orchestrator: Manages the scraping task based on configuration.
Scraping Engine: Handles fetching web content (HTML, etc.) from target sites.
AI Processing: (Optional) AI models analyze the raw content for intelligent extraction, structure understanding, or rendering JavaScript.
Data Extraction: Relevant data is extracted either through rules or AI processing.
Output: Structured data is saved to a file, database, or sent to another service.
Config/Utils: Configuration defines targets and rules; utilities provide support.

Configuration

(Details on configuring target URLs/sites, scraping rules (CSS selectors, XPath), AI model selection, JavaScript rendering options, output formats, rate limits, proxy usage, etc., will be added here.)

Development

Setup

# Clone the repository
git clone https://github.com/llamasearchai/llama-web-scraper.git
cd llama-web-scraper

# Install in editable mode with development dependencies
pip install -e ".[dev]"

Testing

pytest tests/

Contributing

Contributions are welcome! Please refer to CONTRIBUTING.md and submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_web_scraper-0.1.0.tar.gz (15.9 kB view details)

Uploaded Apr 8, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llama_web_scraper-0.1.0-py3-none-any.whl (13.2 kB view details)

Uploaded Apr 8, 2025 Python 3

File details

Details for the file llama_web_scraper-0.1.0.tar.gz.

File metadata

Download URL: llama_web_scraper-0.1.0.tar.gz
Upload date: Apr 8, 2025
Size: 15.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for llama_web_scraper-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`ef15a04814d239175b9ebb735cc4469e8b41634dbe24dc639b2279c1d9becf26`
MD5	`7ef02440326d043de721ddaa5ec7ecf9`
BLAKE2b-256	`d84b6faeece59debaf497562086b9477a79bd71a02a84d5e9bfee6f688e88fb6`

See more details on using hashes here.

File details

Details for the file llama_web_scraper-0.1.0-py3-none-any.whl.

File metadata

Download URL: llama_web_scraper-0.1.0-py3-none-any.whl
Upload date: Apr 8, 2025
Size: 13.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for llama_web_scraper-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`19131e692bdaaaad82b499a045139c107bece450eeac21b96ed47dbc61eaa24d`
MD5	`fe0a536c00e4516abba9f62bb522d1fb`
BLAKE2b-256	`f63ac0cf7534e11b083a739f186e5a45609317845cc15ef29752d2b8ca96256f`

See more details on using hashes here.

llama-web-scraper 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

llama-web-scraper

Key Features

Installation

Usage

Command-Line Interface (CLI)

Python Client / Embedding

Architecture Overview

Configuration

Development

Setup

Testing

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes