An autonomous text-to-dataset agent — turns any website or document corpus into a typed, research-grade dataset in minutes. Works with Gemma, Llama, Gemini, GPT and any OpenAI-compatible model.

These details have not been verified by PyPI

Project links

Project description

⛏ Gemma Miner

Turn any website or document corpus into a typed, research-grade dataset — in minutes, autonomously.

Gemma Miner is an autonomous agent that takes a one-sentence brief — "build me a stats-ready dataset of CNIL sanctions", "3 000 AI clinical trials", "every Hacker News 'Who is hiring' post mentioning RAG" — and produces a typed Parquet dataset with codebook, charts and HF-ready card.

It handles harvest → typed schema design → per-row extraction → self-verification → export to Parquet/CSV/HuggingFace in a single run. Works with Ollama (local Gemma 4 31B), OpenRouter, Together AI, Featherless — or any OpenAI-compatible endpoint.

Why it exists

Most "scrape this site" tools give you a JSON dump of raw fields. That's not a dataset — it's a starting point. Gemma Miner closes the loop:

Read the source (HTML, JSON API, PDF / DOCX / XLSX / archives).
Design a codebook of 20–60 typed analytical variables — booleans, enums, integers, dates — appropriate for the corpus.
Extract every row through the codebook with deterministic type coercion (dates → ISO, enums snapped to nearest valid value, booleans null-when-silent, no placeholder stuffing).
Self-verify before declaring done. If verification fails, retry with corrective feedback.
Export to Parquet + CSV + a Markdown codebook, and optionally push to the Hugging Face Hub with a one-line command.

The result is a dataset you can drop into pandas, DuckDB or scikit-learn without a second cleaning pass.

See it in action

Two real datasets built end-to-end by Gemma Miner — click through to read their cards and load them:

Dataset	Rows × Cols	Source	Try it
🇫🇷 CNIL Sanctions 2011-2025	374 × 34	cnil.fr	`load_dataset("moncefem/cnil-sanctions-2011-2025")`
🧬 Clinical Trials of AI 2000-2025	3 000 × 30	clinicaltrials.gov	`load_dataset("moncefem/clinical-trials-ai-2000-2025")`

Install

# core (most users want this)
uv pip install gemma-miner

# + parsers (PDF / DOCX / XLSX / EPUB / archives)
uv pip install "gemma-miner[parsers]"

# + Hugging Face push
uv pip install "gemma-miner[hf]"

# + analysis (pandas, matplotlib, numpy for the chart scripts)
uv pip install "gemma-miner[analysis]"

# everything
uv pip install "gemma-miner[all]"

Plain pip install gemma-miner works identically — uv is just faster.

First launch (Claude-Code style REPL)

gemma-miner

On first launch you get a setup wizard that walks you through:

Pick a provider (ollama / openrouter / together / featherless).
Paste an API key (or skip for Ollama — fully local).
Pick a default model. For Ollama, the wizard shows the live list of models you have installed (queried from /api/tags).

Your choice is saved to ~/.config/gemma-miner/config.toml (chmod 600). Switch any time with /config inside the REPL, or gemma-miner configure from the shell.

Inside the REPL:

Type / to see the live command palette (filters as you keep typing).
Just type plain English to start a run: "build me a dataset of the top 100 Hacker News stories with id, title, points, comments."
Multi-line prompts: end the first line with """ to open a heredoc, close with """.
The agent runs with a live Rich activity feed showing every phase and per-row extraction progress.

What the REPL knows how to do

Slash command	What it does
`/help`	Full help panel
`/config`	Re-run the provider + API-key setup wizard
`/datasets`	List datasets produced under `./runs/`
`/workdir [<path>]`	Show or change the base workdir
`/provider [<name>]`	Show or switch LLM provider (persisted)
`/model [<id>]`	Show or switch model (persisted per provider)
`/gemma-full-local`	Switch every phase to Ollama Gemma (auto-picks the largest installed Gemma)
`/resume <path>`	Resume a previous run — load its dataset + codebook + memory
`/push <repo_id>`	Push the last dataset to Hugging Face Hub
`/history`, `/clear`, `/trace`, `/quit`	Standard shell controls

After a run completes, the chat agent has the dataset in memory — ask follow-up questions like "which row had the most points?" or "summarise the breakdown by sector" and it answers from the data without triggering another scrape.

One-shot mode (no REPL)

# free-text prompt — Gemma Miner parses URL + count + fields automatically
gemma-miner "Build a dataset of every CNIL sanction from \
https://www.cnil.fr/fr/les-sanctions-prononcees-par-la-cnil with date, \
organisation type, breaches, decision text and 25 analytical variables."

# explicit flags for power users
gemma-miner run \
  --goal "Top 100 Hacker News stories" \
  --min-rows 100 \
  --required-fields rank,id,title,points \
  --unique-field id \
  --workdir ./runs/hn \
  --provider ollama \
  --model gemma4:31b

Python API

from gemma42 import (
    FieldsContract, MinRowsContract, UniqueFieldContract,
    make_llm, run_agent,
)

result = run_agent(
    goal=(
        "Build a dataset of the top 100 Hacker News stories using the public "
        "JSON API. Each row needs rank, id, title, domain, points."
    ),
    contracts=[
        MinRowsContract(min_rows=100),
        FieldsContract(required_fields=["rank", "id", "title", "points"]),
        UniqueFieldContract(field="id"),
    ],
    unique_key="id",
    workdir="./runs/hn",
    llm=make_llm("openrouter", model="google/gemini-3.1-flash-lite"),
)
print(result.dataset_path)

The Python module is still named gemma42 internally (the brand was previously gemma42); the PyPI package is gemma-miner. Both CLI commands (gemma-miner and gemma42) are equivalent.

Architecture

goal (one sentence)
    │
    ▼
┌─────────────────────────────────────────┐
│   AgentState (dataset + contracts +     │
│   memory + plan + workdir)              │
└─────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────┐
│   Phase machine (recomputed every turn  │
│   from observable state):               │
│                                         │
│   DISCOVER_LISTING → ENUMERATE →        │
│   DISCOVER_DETAIL → PROCESS →           │
│   CODEBOOK → EXTRACT → EXPORT → FINISH  │
└─────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────┐
│   One LLM call per turn → one tool      │
│   call (HTTP / HTML / Python / extract /│
│   codebook ops / dataset / queue / …)   │
└─────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────┐
│   Self-verification before finish.      │
│   On fail, re-enter the loop with the   │
│   verifier's feedback in the prompt.    │
└─────────────────────────────────────────┘

Key design choices:

One tool call per turn. Each step is auditable; the trace is a flat JSONL of decisions.
Re-rendered state brief every turn instead of chat history. No context drift, no stale observations.
Phase-narrowed tool list. The model sees 5–8 relevant tools per turn, not 30 — small models behave dramatically better this way.
Null-not-false discipline. Booleans are null when the source is silent. The system prompt forbids placeholder stuffing and the contract checks surface low-cardinality "constants" as evidence.
Deterministic IDs. Bronze (raw harvest) and silver (typed extraction) join by stable content-hash id — re-runs converge.
Hysteresis. Once the silver dataset is populated, the phase machine refuses to fall back into harvest for marginal gains.

Providers

Provider	What it gives you	Default model
Ollama	100 % local, no API key	`gemma4:31b` (wizard shows your installed models)
OpenRouter	Cheapest router for cloud models	`google/gemini-3.1-flash-lite`
Together AI	Fast OSS models	`google/gemma-4-31b-it`
Featherless	Serverless GPU for OSS models	`google/gemma-4-31b-it`
Anything else	Any OpenAI-compatible endpoint via `--base-url`	—

Run gemma-miner providers to print the full list.

Push to Hugging Face

After a run, push to a public dataset repo from inside the REPL:

› /push moncefem/my-cool-dataset
✓ uploaded → https://huggingface.co/datasets/moncefem/my-cool-dataset

Or from the shell:

gemma-miner export-hf ./runs/hn/dataset.jsonl --repo-id you/hn-top100

Needs HF_TOKEN (or HUGGINGFACE_HUB_TOKEN) in the environment and the hf extra installed.

Safety

The bash and python tools refuse destructive operations (rm, dd, mkfs, sudo, fork bombs, …) at the tool layer.
File operations are confined to the run's workdir.
The config file is chmod 600 so API keys aren't readable by other users on a shared machine.

Don't run agent code on production boxes — use a container or VM.

Contributing

Bugs, ideas, and pull requests welcome at https://github.com/moncifem/gemma-miner.

The test suite runs offline:

uv pip install -e ".[dev]"
pytest -q

License

Apache License 2.0.

If you use Gemma Miner in a paper, project, or product, attribution to the upstream source and to Gemma Miner is appreciated:

@software{elmouden_gemma_miner_2025,
  title  = {Gemma Miner: an autonomous text-to-dataset agent},
  author = {EL-Mouden, Moncif and contributors},
  year   = {2025},
  url    = {https://github.com/moncifem/gemma-miner},
}

⛏ Made with care by Moncif EL-Mouden. Powered by your favourite small open model.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.4

May 19, 2026

0.1.3

May 19, 2026

0.1.1

May 18, 2026

This version

0.1.0

May 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gemma_miner-0.1.0.tar.gz (170.5 kB view details)

Uploaded May 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gemma_miner-0.1.0-py3-none-any.whl (191.9 kB view details)

Uploaded May 18, 2026 Python 3

File details

Details for the file gemma_miner-0.1.0.tar.gz.

File metadata

Download URL: gemma_miner-0.1.0.tar.gz
Upload date: May 18, 2026
Size: 170.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for gemma_miner-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`e2cbc642d95299978b6b71af7785c5af444eae17482183b80b2519ce3a23deec`
MD5	`2ac195a55ed4dafcd7416a7efe88d8c1`
BLAKE2b-256	`6742840bd49ac5c0c97a9559e04329f9a44572a0483f150f8d66f99c8e18b4f7`

See more details on using hashes here.

File details

Details for the file gemma_miner-0.1.0-py3-none-any.whl.

File metadata

Download URL: gemma_miner-0.1.0-py3-none-any.whl
Upload date: May 18, 2026
Size: 191.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for gemma_miner-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d7e1f5528262872199c9a047971cffeffb45fb3e0bf5c1457c1e10c16151d454`
MD5	`0e4df3bbd380bb85fb7ac7eaebcb118b`
BLAKE2b-256	`35676a5841c563847b33c7c46b0999cbdf1a7b8262c1ac5b4e537d58cdcfb58a`

See more details on using hashes here.

gemma-miner 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

⛏ Gemma Miner

Why it exists

See it in action

Install

First launch (Claude-Code style REPL)

What the REPL knows how to do

One-shot mode (no REPL)

Python API

Architecture

Providers

Push to Hugging Face

Safety

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes