LLM data collection and synthetic fine-tuning dataset pipeline

Project description

LLM Web Crawler

A cross-platform CLI tool for collecting and processing web data for LLM fine-tuning datasets. Discovers URLs, fetches content, and prepares synthetic training data.

Installation

Via pip (recommended - all platforms)

pip install llm-web-crawler

Then run:

dataforge

From source

git clone https://github.com/ianktoo/data-forge.git
cd data-forge
pip install -e .
dataforge

Standalone executables (no Python required)

Download pre-built executables for your platform from GitHub Releases:

Windows: dataforge-windows-x64.exe
macOS: dataforge-macos-x64
Linux: dataforge-linux-x64

Just download and run directly.

Features

URL Discovery: Automatically discovers URLs from sitemaps and robots.txt
Parallel Processing: Asynchronous collection of web content
LLM Integration: Prepare data for fine-tuning with LiteLLM support
Multi-format Output: Support for various output formats (JSON, Arrow, Parquet)
Cross-platform: Runs on Windows, macOS, and Linux

Usage

dataforge

The CLI provides an interactive interface for:

Configuring data sources
Setting collection parameters
Monitoring progress
Exporting datasets

Development

Setup

git clone https://github.com/ianktoo/data-forge.git
cd data-forge
pip install -e ".[dev]"

Run tests

pytest

Code quality

# Linting
ruff check src/ tests/

# Type checking
mypy src/

Releasing New Versions

This project uses GitHub Actions for automated CI/CD:

Release process

Update the version in pyproject.toml

Commit your changes:

git add .
git commit -m "Bump version to X.Y.Z"
git push

Create and push a git tag:
```
git tag vX.Y.Z
git push origin vX.Y.Z
```

What happens automatically

Build: Cross-platform executables are built for Windows, macOS, and Linux
Release: Executables are attached to a GitHub Release
Publish: Package is published to PyPI as llm-web-crawler

Users can then:

Install via pip install llm-web-crawler
Or download standalone executables from Releases

Project Structure

data-forge/
├── src/dataforge/           # Main package
│   ├── cli/                 # CLI interface (typer + questionary)
│   ├── collectors/          # Web content collectors
│   ├── processors/          # Data processors
│   └── main.py              # Entry point
├── tests/                   # Test suite
├── .github/workflows/       # CI/CD automation
│   ├── build-executables.yml  # PyInstaller builds
│   └── publish-pypi.yml       # PyPI publishing
└── pyproject.toml           # Project metadata & dependencies

Dependencies

Core:

typer - CLI framework
rich - Terminal formatting
questionary - Interactive prompts
httpx - HTTP client
beautifulsoup4 - HTML parsing
litellm - LLM API abstraction

Data:

sqlmodel - Database ORM
pydantic - Data validation
datasets - Hugging Face datasets
pyarrow - Arrow/Parquet support

License

See LICENSE file for details.

Project details

Release history Release notifications | RSS feed

2.3.0

Apr 24, 2026

2.2.0

Apr 24, 2026

2.1.0

Apr 23, 2026

2.0.8

Apr 15, 2026

2.0.7

Apr 12, 2026

2.0.6

Apr 11, 2026

2.0.4

Apr 11, 2026

2.0.3

Apr 11, 2026

2.0.2

Apr 11, 2026

2.0.1

Apr 11, 2026

0.4.0

Apr 11, 2026

0.3.2

Apr 11, 2026

0.3.1

Apr 10, 2026

0.2.1

Apr 8, 2026

This version

0.2.0

Apr 8, 2026

0.1.0

Apr 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_web_crawler-0.2.0.tar.gz (52.1 kB view details)

Uploaded Apr 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_web_crawler-0.2.0-py3-none-any.whl (56.0 kB view details)

Uploaded Apr 8, 2026 Python 3

File details

Details for the file llm_web_crawler-0.2.0.tar.gz.

File metadata

Download URL: llm_web_crawler-0.2.0.tar.gz
Upload date: Apr 8, 2026
Size: 52.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_web_crawler-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`139b4369fe3ead3c2fa05ee6429af0a3cb0a4ae6b4c6c2fdc2ea6ebf1c2affa5`
MD5	`90733e318fbaf06dc2bc6bb456568d32`
BLAKE2b-256	`11aea21ee5b3868af5863c607685ba9921839b1adac118e17a0f035204efdb61`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_web_crawler-0.2.0.tar.gz:

Publisher: publish-pypi.yml on ianktoo/data-forge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llm_web_crawler-0.2.0.tar.gz
- Subject digest: 139b4369fe3ead3c2fa05ee6429af0a3cb0a4ae6b4c6c2fdc2ea6ebf1c2affa5
- Sigstore transparency entry: 1250815696
- Sigstore integration time: Apr 8, 2026
Source repository:
- Permalink: ianktoo/data-forge@108393891c7de3e34dd779d5929677bdd25509c7
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/ianktoo
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@108393891c7de3e34dd779d5929677bdd25509c7
- Trigger Event: push

File details

Details for the file llm_web_crawler-0.2.0-py3-none-any.whl.

File metadata

Download URL: llm_web_crawler-0.2.0-py3-none-any.whl
Upload date: Apr 8, 2026
Size: 56.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_web_crawler-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`72f82cd518ae79613ca05225b81e4601de6a7a50735afdda35d937bb46ec795a`
MD5	`e26fbdc841d38bcc8502506bf9bf5692`
BLAKE2b-256	`d8c83c771a9af724ae893ee14c7e467b326ea5d620286597e2a7ecc045013b28`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_web_crawler-0.2.0-py3-none-any.whl:

Publisher: publish-pypi.yml on ianktoo/data-forge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llm_web_crawler-0.2.0-py3-none-any.whl
- Subject digest: 72f82cd518ae79613ca05225b81e4601de6a7a50735afdda35d937bb46ec795a
- Sigstore transparency entry: 1250815724
- Sigstore integration time: Apr 8, 2026
Source repository:
- Permalink: ianktoo/data-forge@108393891c7de3e34dd779d5929677bdd25509c7
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/ianktoo
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@108393891c7de3e34dd779d5929677bdd25509c7
- Trigger Event: push

llm-web-crawler 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

LLM Web Crawler

Installation

Via pip (recommended - all platforms)

From source

Standalone executables (no Python required)

Features

Usage

Development

Setup

Run tests

Code quality

Releasing New Versions

Release process

What happens automatically

Project Structure

Dependencies

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance