Skip to main content

LLM data collection and synthetic fine-tuning dataset pipeline

Project description

LLM Web Crawler

A cross-platform CLI tool for collecting and processing web data for LLM fine-tuning datasets. Discovers URLs, fetches content, and prepares synthetic training data.

Installation

Via pip (recommended - all platforms)

pip install llm-web-crawler

Then run:

dataforge

From source

git clone https://github.com/ianktoo/data-forge.git
cd data-forge
pip install -e .
dataforge

Standalone executables (no Python required)

Download pre-built executables for your platform from GitHub Releases:

  • Windows: dataforge-windows-x64.exe
  • macOS: dataforge-macos-x64
  • Linux: dataforge-linux-x64

Just download and run directly.

Features

  • URL Discovery: Automatically discovers URLs from sitemaps and robots.txt
  • Parallel Processing: Asynchronous collection of web content
  • LLM Integration: Prepare data for fine-tuning with LiteLLM support
  • Multi-format Output: Support for various output formats (JSON, Arrow, Parquet)
  • Cross-platform: Runs on Windows, macOS, and Linux

Usage

dataforge

The CLI provides an interactive interface for:

  • Configuring data sources
  • Setting collection parameters
  • Monitoring progress
  • Exporting datasets

Development

Setup

git clone https://github.com/ianktoo/data-forge.git
cd data-forge
pip install -e ".[dev]"

Run tests

pytest

Code quality

# Linting
ruff check src/ tests/

# Type checking
mypy src/

Releasing New Versions

This project uses GitHub Actions for automated CI/CD:

Release process

  1. Update the version in pyproject.toml
  2. Commit your changes:
    git add .
    git commit -m "Bump version to X.Y.Z"
    git push
    
  3. Create and push a git tag:
    git tag vX.Y.Z
    git push origin vX.Y.Z
    

What happens automatically

  • Build: Cross-platform executables are built for Windows, macOS, and Linux
  • Release: Executables are attached to a GitHub Release
  • Publish: Package is published to PyPI as llm-web-crawler

Users can then:

  • Install via pip install llm-web-crawler
  • Or download standalone executables from Releases

Project Structure

data-forge/
├── src/dataforge/           # Main package
│   ├── cli/                 # CLI interface (typer + questionary)
│   ├── collectors/          # Web content collectors
│   ├── processors/          # Data processors
│   └── main.py              # Entry point
├── tests/                   # Test suite
├── .github/workflows/       # CI/CD automation
│   ├── build-executables.yml  # PyInstaller builds
│   └── publish-pypi.yml       # PyPI publishing
└── pyproject.toml           # Project metadata & dependencies

Dependencies

Core:

  • typer - CLI framework
  • rich - Terminal formatting
  • questionary - Interactive prompts
  • httpx - HTTP client
  • beautifulsoup4 - HTML parsing
  • litellm - LLM API abstraction

Data:

  • sqlmodel - Database ORM
  • pydantic - Data validation
  • datasets - Hugging Face datasets
  • pyarrow - Arrow/Parquet support

License

See LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_web_crawler-0.2.1.tar.gz (52.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_web_crawler-0.2.1-py3-none-any.whl (56.0 kB view details)

Uploaded Python 3

File details

Details for the file llm_web_crawler-0.2.1.tar.gz.

File metadata

  • Download URL: llm_web_crawler-0.2.1.tar.gz
  • Upload date:
  • Size: 52.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_web_crawler-0.2.1.tar.gz
Algorithm Hash digest
SHA256 dc5052815d5bf981cf86ee7142425d392095eeeb86dd7bed8883b8c5b0b8b8fd
MD5 f71c1f5c3913bec91dca19cda52c4a29
BLAKE2b-256 f78158ba54a79708fe89f02ee9c45c1e5ac176c5542d832946d9beb35e2fb24c

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_web_crawler-0.2.1.tar.gz:

Publisher: publish-pypi.yml on ianktoo/data-forge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llm_web_crawler-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_web_crawler-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 61f0eca8e0c26033645c19e45cd3ac85d5c4ebd5dea7e08c030e8b286daabd78
MD5 1742540956e0429fb7fd1f3dc2195965
BLAKE2b-256 d1a81d5a2b83c92e5f65358fb33570271f9c269e4250b144a26dae1593c3cce7

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_web_crawler-0.2.1-py3-none-any.whl:

Publisher: publish-pypi.yml on ianktoo/data-forge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page