An autonomous text-to-dataset agent — turns any website or document corpus into a typed, research-grade dataset in minutes. Works with Gemma, Llama, Gemini, GPT and any OpenAI-compatible model.
Project description
⛏ Gemma Miner
Turn any website or document corpus into a typed, research-grade dataset — in minutes, autonomously.
Gemma Miner is an autonomous agent that takes a one-sentence brief — "build me a stats-ready dataset of CNIL sanctions", "3 000 AI clinical trials", "every Hacker News 'Who is hiring' post mentioning RAG" — and produces a typed Parquet dataset with codebook, charts and HF-ready card.
It handles harvest → typed schema design → per-row extraction → self-verification → export to Parquet/CSV/HuggingFace in a single run. Works with Ollama (local Gemma 4 31B), OpenRouter, Together AI, Featherless — or any OpenAI-compatible endpoint.
Why it exists
Most "scrape this site" tools give you a JSON dump of raw fields. That's not a dataset — it's a starting point. Gemma Miner closes the loop:
- Read the source (HTML, JSON API, PDF / DOCX / XLSX / archives).
- Design a codebook of 20–60 typed analytical variables — booleans, enums, integers, dates — appropriate for the corpus.
- Extract every row through the codebook with deterministic type coercion (dates → ISO, enums snapped to nearest valid value, booleans null-when-silent, no placeholder stuffing).
- Self-verify before declaring done. If verification fails, retry with corrective feedback.
- Export to Parquet + CSV + a Markdown codebook, and optionally push to the Hugging Face Hub with a one-line command.
The result is a dataset you can drop into pandas, DuckDB or scikit-learn without a second cleaning pass.
See it in action
Two real datasets built end-to-end by Gemma Miner — click through to read their cards and load them:
| Dataset | Rows × Cols | Source | Try it |
|---|---|---|---|
| 🇫🇷 CNIL Sanctions 2011-2025 | 374 × 34 | cnil.fr | load_dataset("moncefem/cnil-sanctions-2011-2025") |
| 🧬 Clinical Trials of AI 2000-2025 | 3 000 × 30 | clinicaltrials.gov | load_dataset("moncefem/clinical-trials-ai-2000-2025") |
Install
gemma-miner is a CLI tool, so the cleanest install is uv tool install
(or pipx install) — it puts a gemma-miner binary on your PATH inside an
isolated environment:
# recommended — installs as an isolated CLI tool
uv tool install gemma-miner
# with optional extras:
uv tool install "gemma-miner[parsers]" # PDF / DOCX / XLSX / EPUB / archives
uv tool install "gemma-miner[hf]" # huggingface_hub + datasets (for /push)
uv tool install "gemma-miner[analysis]" # pandas + matplotlib + numpy
uv tool install "gemma-miner[all]" # everything above
Then just run it from anywhere:
gemma-miner # interactive REPL
gemma-miner configure # re-run the setup wizard
gemma-miner --help # full command list
Alternative installs:
| Use case | Command |
|---|---|
| Try it once without installing | uv run --with gemma-miner gemma-miner |
| Add it as a library to an existing project | cd your-project && uv add gemma-miner |
| Plain pip (no uv) | pipx install gemma-miner |
First launch (Claude-Code style REPL)
gemma-miner
On first launch you get a setup wizard that walks you through:
- Pick a provider (
ollama/openrouter/together/featherless). - Paste an API key (or skip for Ollama — fully local).
- Pick a default model. For Ollama, the wizard shows the live list of
models you have installed (queried from
/api/tags).
Your choice is saved to ~/.config/gemma-miner/config.toml (chmod 600).
Switch any time with /config inside the REPL, or gemma-miner configure
from the shell.
Inside the REPL:
- Type
/to see the live command palette (filters as you keep typing). - Just type plain English to start a run: "build me a dataset of the top 100 Hacker News stories with id, title, points, comments."
- Multi-line prompts: end the first line with
"""to open a heredoc, close with""". - The agent runs with a live Rich activity feed showing every phase and per-row extraction progress.
What the REPL knows how to do
| Slash command | What it does |
|---|---|
/help |
Full help panel |
/config |
Re-run the provider + API-key setup wizard |
/datasets |
List datasets produced under ./runs/ |
/workdir [<path>] |
Show or change the base workdir |
/provider [<name>] |
Show or switch LLM provider (persisted) |
/model [<id>] |
Show or switch model (persisted per provider) |
/gemma-full-local |
Switch every phase to Ollama Gemma (auto-picks the largest installed Gemma) |
/resume <path> |
Resume a previous run — load its dataset + codebook + memory |
/push <repo_id> |
Push the last dataset to Hugging Face Hub |
/history, /clear, /trace, /quit |
Standard shell controls |
After a run completes, the chat agent has the dataset in memory — ask follow-up questions like "which row had the most points?" or "summarise the breakdown by sector" and it answers from the data without triggering another scrape.
One-shot mode (no REPL)
# free-text prompt — Gemma Miner parses URL + count + fields automatically
gemma-miner "Build a dataset of every CNIL sanction from \
https://www.cnil.fr/fr/les-sanctions-prononcees-par-la-cnil with date, \
organisation type, breaches, decision text and 25 analytical variables."
# explicit flags for power users
gemma-miner run \
--goal "Top 100 Hacker News stories" \
--min-rows 100 \
--required-fields rank,id,title,points \
--unique-field id \
--workdir ./runs/hn \
--provider ollama \
--model gemma4:31b
Python API
from gemma42 import (
FieldsContract, MinRowsContract, UniqueFieldContract,
make_llm, run_agent,
)
result = run_agent(
goal=(
"Build a dataset of the top 100 Hacker News stories using the public "
"JSON API. Each row needs rank, id, title, domain, points."
),
contracts=[
MinRowsContract(min_rows=100),
FieldsContract(required_fields=["rank", "id", "title", "points"]),
UniqueFieldContract(field="id"),
],
unique_key="id",
workdir="./runs/hn",
llm=make_llm("openrouter", model="google/gemini-3.1-flash-lite"),
)
print(result.dataset_path)
The Python module is still named
gemma42internally (the brand was previouslygemma42); the PyPI package isgemma-miner. Both CLI commands (gemma-minerandgemma42) are equivalent.
Architecture
goal (one sentence)
│
▼
┌─────────────────────────────────────────┐
│ AgentState (dataset + contracts + │
│ memory + plan + workdir) │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Phase machine (recomputed every turn │
│ from observable state): │
│ │
│ DISCOVER_LISTING → ENUMERATE → │
│ DISCOVER_DETAIL → PROCESS → │
│ CODEBOOK → EXTRACT → EXPORT → FINISH │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ One LLM call per turn → one tool │
│ call (HTTP / HTML / Python / extract /│
│ codebook ops / dataset / queue / …) │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Self-verification before finish. │
│ On fail, re-enter the loop with the │
│ verifier's feedback in the prompt. │
└─────────────────────────────────────────┘
Key design choices:
- One tool call per turn. Each step is auditable; the trace is a flat JSONL of decisions.
- Re-rendered state brief every turn instead of chat history. No context drift, no stale observations.
- Phase-narrowed tool list. The model sees 5–8 relevant tools per turn, not 30 — small models behave dramatically better this way.
- Null-not-false discipline. Booleans are
nullwhen the source is silent. The system prompt forbids placeholder stuffing and the contract checks surface low-cardinality "constants" as evidence. - Deterministic IDs. Bronze (raw harvest) and silver (typed extraction) join by stable content-hash id — re-runs converge.
- Hysteresis. Once the silver dataset is populated, the phase machine refuses to fall back into harvest for marginal gains.
Providers
| Provider | What it gives you | Default model |
|---|---|---|
| Ollama | 100 % local, no API key | gemma4:31b (wizard shows your installed models) |
| OpenRouter | Cheapest router for cloud models | google/gemini-3.1-flash-lite |
| Together AI | Fast OSS models | google/gemma-4-31b-it |
| Featherless | Serverless GPU for OSS models | google/gemma-4-31b-it |
| Anything else | Any OpenAI-compatible endpoint via --base-url |
— |
Run gemma-miner providers to print the full list.
Push to Hugging Face
After a run, push to a public dataset repo from inside the REPL:
› /push moncefem/my-cool-dataset
✓ uploaded → https://huggingface.co/datasets/moncefem/my-cool-dataset
Or from the shell:
gemma-miner export-hf ./runs/hn/dataset.jsonl --repo-id you/hn-top100
Needs HF_TOKEN (or HUGGINGFACE_HUB_TOKEN) in the environment and the
hf extra installed.
Safety
- The
bashandpythontools refuse destructive operations (rm,dd,mkfs,sudo, fork bombs, …) at the tool layer. - File operations are confined to the run's workdir.
- The config file is
chmod 600so API keys aren't readable by other users on a shared machine.
Don't run agent code on production boxes — use a container or VM.
Contributing
Bugs, ideas, and pull requests welcome at https://github.com/moncifem/gemma-miner.
The test suite runs offline:
uv pip install -e ".[dev]"
pytest -q
License
If you use Gemma Miner in a paper, project, or product, attribution to the upstream source and to Gemma Miner is appreciated:
@software{elmouden_gemma_miner_2025,
title = {Gemma Miner: an autonomous text-to-dataset agent},
author = {EL-Mouden, Moncif and contributors},
year = {2025},
url = {https://github.com/moncifem/gemma-miner},
}
⛏ Made with care by Moncif EL-Mouden. Powered by your favourite small open model.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gemma_miner-0.1.3.tar.gz.
File metadata
- Download URL: gemma_miner-0.1.3.tar.gz
- Upload date:
- Size: 177.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
68af974576b30d04c96db84d53d2688aacdef6fe494af868139ab055f5d24361
|
|
| MD5 |
2846fa1b0ec040e7ef37610ea49aff3c
|
|
| BLAKE2b-256 |
0eb06ed49ffaa71784f4dc54751eb33c2ea34b94e497fc768377218d33065f5e
|
File details
Details for the file gemma_miner-0.1.3-py3-none-any.whl.
File metadata
- Download URL: gemma_miner-0.1.3-py3-none-any.whl
- Upload date:
- Size: 198.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d7024415e7ada80051caf8a8be4d1ea3aa82c3638d794e002a7d904ea5336f81
|
|
| MD5 |
804d6289221e146632002063578d3260
|
|
| BLAKE2b-256 |
716687ddf745b916776e5854d6d53cd16256905d75f0de4abcd3e932125322de
|