Skip to main content

An autonomous text-to-dataset agent — turns any website or document corpus into a typed, research-grade dataset in minutes. Works with Gemma, Llama, Gemini, GPT and any OpenAI-compatible model.

Project description

Gemma Miner — extract, analyze, discover

⛏ Gemma Miner

Turn any website or document corpus into a typed, research-grade dataset — in minutes, autonomously.

PyPI Python License HF Datasets

Gemma Miner is an autonomous agent that takes a one-sentence brief — "build me a stats-ready dataset of CNIL sanctions", "3 000 AI clinical trials", "every Hacker News 'Who is hiring' post mentioning RAG" — and produces a typed Parquet dataset with codebook, charts and HF-ready card.

It handles harvest → typed schema design → per-row extraction → self-verification → export to Parquet/CSV/HuggingFace in a single run. Works with Ollama (local Gemma 4 31B), OpenRouter, Together AI, Featherless — or any OpenAI-compatible endpoint.


Why it exists

Most "scrape this site" tools give you a JSON dump of raw fields. That's not a dataset — it's a starting point. Gemma Miner closes the loop:

  1. Read the source (HTML, JSON API, PDF / DOCX / XLSX / archives).
  2. Design a codebook of 20–60 typed analytical variables — booleans, enums, integers, dates — appropriate for the corpus.
  3. Extract every row through the codebook with deterministic type coercion (dates → ISO, enums snapped to nearest valid value, booleans null-when-silent, no placeholder stuffing).
  4. Self-verify before declaring done. If verification fails, retry with corrective feedback.
  5. Export to Parquet + CSV + a Markdown codebook, and optionally push to the Hugging Face Hub with a one-line command.

The result is a dataset you can drop into pandas, DuckDB or scikit-learn without a second cleaning pass.

See it in action

Two real datasets built end-to-end by Gemma Miner — click through to read their cards and load them:

Dataset Rows × Cols Source Try it
🇫🇷 CNIL Sanctions 2011-2025 374 × 34 cnil.fr load_dataset("moncefem/cnil-sanctions-2011-2025")
🧬 Clinical Trials of AI 2000-2025 3 000 × 30 clinicaltrials.gov load_dataset("moncefem/clinical-trials-ai-2000-2025")

Install

# core (most users want this)
uv pip install gemma-miner

# + parsers (PDF / DOCX / XLSX / EPUB / archives)
uv pip install "gemma-miner[parsers]"

# + Hugging Face push
uv pip install "gemma-miner[hf]"

# + analysis (pandas, matplotlib, numpy for the chart scripts)
uv pip install "gemma-miner[analysis]"

# everything
uv pip install "gemma-miner[all]"

Plain pip install gemma-miner works identically — uv is just faster.

First launch (Claude-Code style REPL)

gemma-miner

On first launch you get a setup wizard that walks you through:

  1. Pick a provider (ollama / openrouter / together / featherless).
  2. Paste an API key (or skip for Ollama — fully local).
  3. Pick a default model. For Ollama, the wizard shows the live list of models you have installed (queried from /api/tags).

Your choice is saved to ~/.config/gemma-miner/config.toml (chmod 600). Switch any time with /config inside the REPL, or gemma-miner configure from the shell.

Inside the REPL:

  • Type / to see the live command palette (filters as you keep typing).
  • Just type plain English to start a run: "build me a dataset of the top 100 Hacker News stories with id, title, points, comments."
  • Multi-line prompts: end the first line with """ to open a heredoc, close with """.
  • The agent runs with a live Rich activity feed showing every phase and per-row extraction progress.

What the REPL knows how to do

Slash command What it does
/help Full help panel
/config Re-run the provider + API-key setup wizard
/datasets List datasets produced under ./runs/
/workdir [<path>] Show or change the base workdir
/provider [<name>] Show or switch LLM provider (persisted)
/model [<id>] Show or switch model (persisted per provider)
/gemma-full-local Switch every phase to Ollama Gemma (auto-picks the largest installed Gemma)
/resume <path> Resume a previous run — load its dataset + codebook + memory
/push <repo_id> Push the last dataset to Hugging Face Hub
/history, /clear, /trace, /quit Standard shell controls

After a run completes, the chat agent has the dataset in memory — ask follow-up questions like "which row had the most points?" or "summarise the breakdown by sector" and it answers from the data without triggering another scrape.

One-shot mode (no REPL)

# free-text prompt — Gemma Miner parses URL + count + fields automatically
gemma-miner "Build a dataset of every CNIL sanction from \
https://www.cnil.fr/fr/les-sanctions-prononcees-par-la-cnil with date, \
organisation type, breaches, decision text and 25 analytical variables."

# explicit flags for power users
gemma-miner run \
  --goal "Top 100 Hacker News stories" \
  --min-rows 100 \
  --required-fields rank,id,title,points \
  --unique-field id \
  --workdir ./runs/hn \
  --provider ollama \
  --model gemma4:31b

Python API

from gemma42 import (
    FieldsContract, MinRowsContract, UniqueFieldContract,
    make_llm, run_agent,
)

result = run_agent(
    goal=(
        "Build a dataset of the top 100 Hacker News stories using the public "
        "JSON API. Each row needs rank, id, title, domain, points."
    ),
    contracts=[
        MinRowsContract(min_rows=100),
        FieldsContract(required_fields=["rank", "id", "title", "points"]),
        UniqueFieldContract(field="id"),
    ],
    unique_key="id",
    workdir="./runs/hn",
    llm=make_llm("openrouter", model="google/gemini-3.1-flash-lite"),
)
print(result.dataset_path)

The Python module is still named gemma42 internally (the brand was previously gemma42); the PyPI package is gemma-miner. Both CLI commands (gemma-miner and gemma42) are equivalent.

Architecture

goal (one sentence)
    │
    ▼
┌─────────────────────────────────────────┐
│   AgentState (dataset + contracts +     │
│   memory + plan + workdir)              │
└─────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────┐
│   Phase machine (recomputed every turn  │
│   from observable state):               │
│                                         │
│   DISCOVER_LISTING → ENUMERATE →        │
│   DISCOVER_DETAIL → PROCESS →           │
│   CODEBOOK → EXTRACT → EXPORT → FINISH  │
└─────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────┐
│   One LLM call per turn → one tool      │
│   call (HTTP / HTML / Python / extract /│
│   codebook ops / dataset / queue / …)   │
└─────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────┐
│   Self-verification before finish.      │
│   On fail, re-enter the loop with the   │
│   verifier's feedback in the prompt.    │
└─────────────────────────────────────────┘

Key design choices:

  • One tool call per turn. Each step is auditable; the trace is a flat JSONL of decisions.
  • Re-rendered state brief every turn instead of chat history. No context drift, no stale observations.
  • Phase-narrowed tool list. The model sees 5–8 relevant tools per turn, not 30 — small models behave dramatically better this way.
  • Null-not-false discipline. Booleans are null when the source is silent. The system prompt forbids placeholder stuffing and the contract checks surface low-cardinality "constants" as evidence.
  • Deterministic IDs. Bronze (raw harvest) and silver (typed extraction) join by stable content-hash id — re-runs converge.
  • Hysteresis. Once the silver dataset is populated, the phase machine refuses to fall back into harvest for marginal gains.

Providers

Provider What it gives you Default model
Ollama 100 % local, no API key gemma4:31b (wizard shows your installed models)
OpenRouter Cheapest router for cloud models google/gemini-3.1-flash-lite
Together AI Fast OSS models google/gemma-4-31b-it
Featherless Serverless GPU for OSS models google/gemma-4-31b-it
Anything else Any OpenAI-compatible endpoint via --base-url

Run gemma-miner providers to print the full list.

Push to Hugging Face

After a run, push to a public dataset repo from inside the REPL:

› /push moncefem/my-cool-dataset
✓ uploaded → https://huggingface.co/datasets/moncefem/my-cool-dataset

Or from the shell:

gemma-miner export-hf ./runs/hn/dataset.jsonl --repo-id you/hn-top100

Needs HF_TOKEN (or HUGGINGFACE_HUB_TOKEN) in the environment and the hf extra installed.

Safety

  • The bash and python tools refuse destructive operations (rm, dd, mkfs, sudo, fork bombs, …) at the tool layer.
  • File operations are confined to the run's workdir.
  • The config file is chmod 600 so API keys aren't readable by other users on a shared machine.

Don't run agent code on production boxes — use a container or VM.

Contributing

Bugs, ideas, and pull requests welcome at https://github.com/moncifem/gemma-miner.

The test suite runs offline:

uv pip install -e ".[dev]"
pytest -q

License

Apache License 2.0.

If you use Gemma Miner in a paper, project, or product, attribution to the upstream source and to Gemma Miner is appreciated:

@software{elmouden_gemma_miner_2025,
  title  = {Gemma Miner: an autonomous text-to-dataset agent},
  author = {EL-Mouden, Moncif and contributors},
  year   = {2025},
  url    = {https://github.com/moncifem/gemma-miner},
}

⛏ Made with care by Moncif EL-Mouden. Powered by your favourite small open model.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gemma_miner-0.1.0.tar.gz (170.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gemma_miner-0.1.0-py3-none-any.whl (191.9 kB view details)

Uploaded Python 3

File details

Details for the file gemma_miner-0.1.0.tar.gz.

File metadata

  • Download URL: gemma_miner-0.1.0.tar.gz
  • Upload date:
  • Size: 170.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for gemma_miner-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e2cbc642d95299978b6b71af7785c5af444eae17482183b80b2519ce3a23deec
MD5 2ac195a55ed4dafcd7416a7efe88d8c1
BLAKE2b-256 6742840bd49ac5c0c97a9559e04329f9a44572a0483f150f8d66f99c8e18b4f7

See more details on using hashes here.

File details

Details for the file gemma_miner-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: gemma_miner-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 191.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for gemma_miner-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d7e1f5528262872199c9a047971cffeffb45fb3e0bf5c1457c1e10c16151d454
MD5 0e4df3bbd380bb85fb7ac7eaebcb118b
BLAKE2b-256 35676a5841c563847b33c7c46b0999cbdf1a7b8262c1ac5b4e537d58cdcfb58a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page