Skip to main content

Web scraping tool potentially using Llama models.

Project description

llama-web-scraper

PyPI version License Python Version CI Status

Llama Web Scraper (llama-web-scraper) is a toolkit for building intelligent web scrapers within the LlamaSearch AI ecosystem. It combines traditional web scraping techniques with AI models for tasks like content extraction, understanding page structure, and handling dynamic websites.

Key Features

  • Web Scraping Engine: Core components for fetching and parsing web pages (scraper/).
  • AI-Powered Extraction: Utilizes AI models for intelligent content extraction, potentially handling complex layouts or JavaScript-rendered pages (ai/, models/).
  • Command-Line Interface: Provides CLI tools for initiating and configuring scraping tasks (cli/).
  • Utilities: Includes helper functions for requests, parsing, and data handling (utils/).
  • Core Orchestration: Manages the scraping workflow (core.py, main.py).
  • Configurable: Allows defining target URLs, scraping rules, AI models, output formats, etc. (config.py).

Installation

pip install llama-web-scraper
# Or install directly from GitHub for the latest version:
# pip install git+https://github.com/llamasearchai/llama-web-scraper.git

Usage

Command-Line Interface (CLI)

(CLI usage examples for scraping specific URLs or using configuration files will be added here.)

llama-web-scraper scrape --url https://example.com/article --output article.json --use-ai
llama-web-scraper run --config scrape_job.yaml

Python Client / Embedding

(Python usage examples for programmatically controlling the scraper will be added here.)

# Placeholder for Python client usage
# from llama_web_scraper import Scraper, ScrapeConfig

# config = ScrapeConfig.load("config.yaml")
# scraper = Scraper(config)

# # Scrape a single URL
# results = scraper.scrape_url(
#     "https://blog.example.com/latest-post",
#     extract_elements=['title', 'body', 'author']
# )
# print(results)

# # Run a scraping job defined in config
# # job_results = scraper.run_job("news_sites_job")

Architecture Overview

graph TD
    A[User / CLI (cli)] --> B{Core Scraper Orchestrator (core.py, main.py)};
    B -- Initiates Scrape --> C{Scraping Engine (scraper/)};
    C -- Fetches --> D((Target Website));
    D -- HTML/Content --> C;
    C -- Raw Content --> E{AI Processing Module (ai/, models/)};
    E -- Extracts Data --> C;
    C --> F[Structured Scraped Data];
    F --> B;
    B --> G[Output (File, DB, API)];

    H[Utilities (utils/)] -- Used by --> C;
    H -- Used by --> E;
    I[Configuration (config.py)] -- Configures --> B;
    I -- Configures --> C;
    I -- Configures --> E;

    style B fill:#f9f,stroke:#333,stroke-width:2px
    style E fill:#aef,stroke:#333,stroke-width:1px
  1. Interface: User initiates scraping via the CLI or programmatically.
  2. Orchestrator: Manages the scraping task based on configuration.
  3. Scraping Engine: Handles fetching web content (HTML, etc.) from target sites.
  4. AI Processing: (Optional) AI models analyze the raw content for intelligent extraction, structure understanding, or rendering JavaScript.
  5. Data Extraction: Relevant data is extracted either through rules or AI processing.
  6. Output: Structured data is saved to a file, database, or sent to another service.
  7. Config/Utils: Configuration defines targets and rules; utilities provide support.

Configuration

(Details on configuring target URLs/sites, scraping rules (CSS selectors, XPath), AI model selection, JavaScript rendering options, output formats, rate limits, proxy usage, etc., will be added here.)

Development

Setup

# Clone the repository
git clone https://github.com/llamasearchai/llama-web-scraper.git
cd llama-web-scraper

# Install in editable mode with development dependencies
pip install -e ".[dev]"

Testing

pytest tests/

Contributing

Contributions are welcome! Please refer to CONTRIBUTING.md and submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_web_scraper-0.1.0.tar.gz (15.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llama_web_scraper-0.1.0-py3-none-any.whl (13.2 kB view details)

Uploaded Python 3

File details

Details for the file llama_web_scraper-0.1.0.tar.gz.

File metadata

  • Download URL: llama_web_scraper-0.1.0.tar.gz
  • Upload date:
  • Size: 15.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for llama_web_scraper-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ef15a04814d239175b9ebb735cc4469e8b41634dbe24dc639b2279c1d9becf26
MD5 7ef02440326d043de721ddaa5ec7ecf9
BLAKE2b-256 d84b6faeece59debaf497562086b9477a79bd71a02a84d5e9bfee6f688e88fb6

See more details on using hashes here.

File details

Details for the file llama_web_scraper-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llama_web_scraper-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 19131e692bdaaaad82b499a045139c107bece450eeac21b96ed47dbc61eaa24d
MD5 fe0a536c00e4516abba9f62bb522d1fb
BLAKE2b-256 f63ac0cf7534e11b083a739f186e5a45609317845cc15ef29752d2b8ca96256f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page