Skip to main content

LLM data collection and synthetic fine-tuning dataset pipeline

Project description

LLM Web Crawler

A cross-platform CLI tool for collecting and processing web data for LLM fine-tuning datasets. Discovers URLs, fetches content, and prepares synthetic training data.

Installation

Via pip (recommended - all platforms)

pip install llm-web-crawler

Then run:

dataforge

From source

git clone https://github.com/ianktoo/data-forge.git
cd data-forge
pip install -e .
dataforge

Standalone executables (no Python required)

Download pre-built executables for your platform from GitHub Releases:

  • Windows: dataforge-windows-x64.exe
  • macOS: dataforge-macos-x64
  • Linux: dataforge-linux-x64

Just download and run directly.

Features

  • URL Discovery: Automatically discovers URLs from sitemaps and robots.txt
  • Parallel Processing: Asynchronous collection of web content
  • LLM Integration: Prepare data for fine-tuning with LiteLLM support
  • Multi-format Output: Support for various output formats (JSON, Arrow, Parquet)
  • Cross-platform: Runs on Windows, macOS, and Linux

Usage

dataforge

The CLI provides an interactive interface for:

  • Configuring data sources
  • Setting collection parameters
  • Monitoring progress
  • Exporting datasets

Development

Setup

git clone https://github.com/ianktoo/data-forge.git
cd data-forge
pip install -e ".[dev]"

Run tests

pytest

Code quality

# Linting
ruff check src/ tests/

# Type checking
mypy src/

Releasing New Versions

This project uses GitHub Actions for automated CI/CD:

Release process

  1. Update the version in pyproject.toml
  2. Commit your changes:
    git add .
    git commit -m "Bump version to X.Y.Z"
    git push
    
  3. Create and push a git tag:
    git tag vX.Y.Z
    git push origin vX.Y.Z
    

What happens automatically

  • Build: Cross-platform executables are built for Windows, macOS, and Linux
  • Release: Executables are attached to a GitHub Release
  • Publish: Package is published to PyPI as llm-web-crawler

Users can then:

  • Install via pip install llm-web-crawler
  • Or download standalone executables from Releases

Project Structure

data-forge/
├── src/dataforge/           # Main package
│   ├── cli/                 # CLI interface (typer + questionary)
│   ├── collectors/          # Web content collectors
│   ├── processors/          # Data processors
│   └── main.py              # Entry point
├── tests/                   # Test suite
├── .github/workflows/       # CI/CD automation
│   ├── build-executables.yml  # PyInstaller builds
│   └── publish-pypi.yml       # PyPI publishing
└── pyproject.toml           # Project metadata & dependencies

Dependencies

Core:

  • typer - CLI framework
  • rich - Terminal formatting
  • questionary - Interactive prompts
  • httpx - HTTP client
  • beautifulsoup4 - HTML parsing
  • litellm - LLM API abstraction

Data:

  • sqlmodel - Database ORM
  • pydantic - Data validation
  • datasets - Hugging Face datasets
  • pyarrow - Arrow/Parquet support

License

See LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_web_crawler-0.2.0.tar.gz (52.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_web_crawler-0.2.0-py3-none-any.whl (56.0 kB view details)

Uploaded Python 3

File details

Details for the file llm_web_crawler-0.2.0.tar.gz.

File metadata

  • Download URL: llm_web_crawler-0.2.0.tar.gz
  • Upload date:
  • Size: 52.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_web_crawler-0.2.0.tar.gz
Algorithm Hash digest
SHA256 139b4369fe3ead3c2fa05ee6429af0a3cb0a4ae6b4c6c2fdc2ea6ebf1c2affa5
MD5 90733e318fbaf06dc2bc6bb456568d32
BLAKE2b-256 11aea21ee5b3868af5863c607685ba9921839b1adac118e17a0f035204efdb61

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_web_crawler-0.2.0.tar.gz:

Publisher: publish-pypi.yml on ianktoo/data-forge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llm_web_crawler-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_web_crawler-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 72f82cd518ae79613ca05225b81e4601de6a7a50735afdda35d937bb46ec795a
MD5 e26fbdc841d38bcc8502506bf9bf5692
BLAKE2b-256 d8c83c771a9af724ae893ee14c7e467b326ea5d620286597e2a7ecc045013b28

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_web_crawler-0.2.0-py3-none-any.whl:

Publisher: publish-pypi.yml on ianktoo/data-forge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page