An agent-native CLI for building balanced, duplicate-free synthetic datasets.
Project description
okgv - organizing knowledge: graphs and vectors
The standard way to generate synthetic datasets for training/fine-tuning ML models is to build a complex pipeline, which requires real design and engineering effort. A cheaper option is to drive an LLM through a coding harness or agents, but two requirements make this hard at scale: the dataset has to stay balanced, and it has to avoid near-duplicate instances. Both get harder as the instance count grows.
The reason is context. Suppose you want math questions spanning algebra and calculus, each rated easy, medium, or hard, balanced both across subjects (algebra vs. calculus) and across difficulty levels. The naive approach, asking the LLM to "ensure diversity", forces it to hold every previously generated instance in context to know what is still missing. Deduplication hits the same wall: spotting a near-duplicate requires comparing the candidate against all prior instances, which again means keeping them in context. Past a few hundred instances this becomes infeasible.
okgv moves that state out of the prompt and into storage. It models a dataset as a tree: each topic is a node, its sub-topics are its children, and every instance is an entry attached to a single topic node. Each entry is also stored as a vector embedding. The agent never has to remember what it generated, it queries the store instead.
This makes the agent work one topic at a time. It checks which topics are underrepresented to pick what to generate next, and before adding a new entry it measures the candidate against the entries already under that same topic. The closest matches come back with their full content, so the agent can decide whether the candidate is too similar to keep. The dataset never lives in the prompt, and the result stays easy to inspect.
Handing an agent full ownership of generation requires a degree of trust that isn't always warranted. For that reason, okgv also supports a review stage: entries can be inspected and approved or discarded, interactively through a TUI by a human, or via CLI commands by an agent prompted to act as the reviewer.
When to use okgv (and when not to)
okgv is not a vector database and not a large-scale curation pipeline. It is a thin, agent-native layer for building a dataset incrementally, where the generating agent makes the novelty and balance decisions in the loop. The design choices follow from that niche.
It is meant to be driven directly by a coding agent. You point the agent at the task; it reads okgv cli-prompt, then runs the generation loop itself through the CLI, finding gaps, checking novelty, submitting. You don't build an API-call pipeline, and the agent doesn't have to hold the growing dataset in its context to stay balanced and avoid duplicates, because that state lives in okgv and the agent queries it.
Use okgv when:
- An agent drives generation and you want it to decide, per candidate, whether a new entry is novel enough to keep, with the nearest existing entries surfaced as full-content context rather than reduced to a similarity score.
- The dataset is naturally hierarchical and must stay balanced across that hierarchy. The topic tree doubles as the balance stratum and the dedup scope.
- You want a human or a second agent to review generated entries before they ship.
- You want zero infrastructure: one portable SQLite file, no server, JSON in and out.
- You need the dataset to be an auditable artifact: inspectable, reviewable, with every submission recorded in the log and reversible with
undo, and checkable for cross-table consistency withreconcile. Useful when you have to trust the data, such as eval sets or regulated domains. - Each leaf topic stays bounded (roughly up to a few thousand entries). Similarity is scoped to the exact topic, so its cost tracks the per-topic count, not the total. The overall dataset can be large as long as individual leaf topics stay small. Where possible, group entries into finer sub-topics to keep each leaf small.
Reach for something else when:
- You need reproducible, deterministic dedup over a fixed corpus. okgv puts the keep/discard call in the agent's hands (see below), which is non-deterministic and costs an LLM call per candidate. If you want a repeatable cosine cutoff instead, a vector store with a metadata filter does that without okgv.
- Individual leaf topics grow very large (tens of thousands of entries each). sqlite-vec scores vectors by brute force; the per-topic filter bounds how many it scores, but only down to the leaf-topic size, so a single huge topic wants a real ANN index. Splitting it into finer sub-topics is often enough to stay within okgv; if the entries genuinely can't be partitioned, reach for dedicated tooling.
- The data has no meaningful hierarchy and balance doesn't matter (for example, a flat set of diverse paraphrases). The tree collapses to a single node, the balance machinery does nothing, and okgv degrades to a dedup wrapper you don't need.
- You want a full synthetic-data orchestration framework with provided generation steps and integrations. okgv deliberately stays narrower than that.
In short: okgv trades determinism and per-topic scale for agent-driven, in-the-loop control and zero setup. If that trade matches your workflow, it fits.
Why a guide, not a filter
A similarity threshold can answer one question: is this candidate too close to something we already have? It returns a number, and a number can reject but it cannot steer. The agent learns only that its attempt failed, not why, so the next attempt is a blind retry that may land in the same crowded region again.
okgv keeps the decision in the loop on purpose. Before a candidate is submitted, similar returns the nearest existing entries with their full content, not just a score. So "too similar" becomes "too similar to this specific entry," and the agent can generate deliberately away from it. A collision stops being a dead end and becomes direction for the next generation.
A threshold is cheaper and deterministic, and for filtering a fixed corpus it is the right tool. But when the goal is to generate a balanced, diverse dataset, what matters is filling the gaps, and that needs feedback the agent can act on. Showing it the nearest existing entry turns each near-miss into a more informed next attempt.
Quickstart
pip install "okgv[embeddings]"
cd my-dataset-project
okgv init # generic scaffold to fill in
okgv init --template qa # or start from a worked preset
okgv init --list # see all presets
okgv init scaffolds a project you fill in (existing files are never overwritten). --template picks a starting point: default is the blank scaffold below; classification, qa, function-calling, rag, and paraphrase are worked, documented projects, one per shape in Dataset Patterns, so you can start from the one closest to your problem and edit. The files are the same in every preset, only their contents differ:
| File | What it is | You edit it… |
|---|---|---|
.env |
Config: schema specifier, DB path, embedding model, review mode | by hand, set OKGV_SCHEMA and EMBED_MODEL |
config/schema.py |
Entry schema template (MyEntrySchema): fields, validators, DB mapping |
by hand, or hand to an agent with prompts/schema-guide.md |
config/structure.json |
Topic hierarchy as nested JSON ({} = leaf) |
by hand, or hand to an agent with prompts/structure-prompt.md |
generation-guide.md |
The brief an agent reads to generate entries. Has a TODO goal for you to fill |
by hand, describe your dataset's goal |
prompts/schema-guide.md |
Guide an agent follows to design config/schema.py with you |
leave as-is |
prompts/structure-prompt.md |
Guide an agent follows to design config/structure.json with you |
leave as-is |
prompts/reviewer-prompt.md |
Guide a reviewer agent follows to approve/reject queued entries | leave as-is |
The three prompts/ files are the point of okgv: each hands one phase of the work to a coding agent. You don't write the schema, the structure, or the entries yourself, you point an agent at the matching guide and it drives the CLI.
1. Design the schema (what every entry looks like):
"read prompts/schema-guide.md and help me design my schema" # prompt for the agent
The agent interviews you about fields, constraints, and storage, then writes config/schema.py. Set OKGV_SCHEMA=config.schema:YourSchema in .env.
2. Design the topic tree (how entries are scoped and balanced):
"read prompts/structure-prompt.md and help me design my topic structure"
okgv create-structure --file config/structure.json # load the agreed tree into okgv.db
3. Generate entries (fill in generation-guide.md's goal first):
"read generation-guide.md and start generating"
The agent runs the loop itself: cli-prompt to learn the CLI, find gaps, check novelty with similar, submit. The dataset never lives in its context.
4. Review (optional, if OKGV_REVIEW=all or --review was used):
"read prompts/reviewer-prompt.md and review the pending queue" # agent reviewer
okgv review -i # or human TUI
5. Export for training:
okgv export --output dataset.jsonl --exclude-in-review
okgv export --output dataset.jsonl --split "train=0.8,val=0.1,test=0.1" # stratified splits
One JSONL file, or one per split. --split divides each topic × balance-field stratum by the given fractions, so train/val/test all keep the dataset's distribution. --split-method shuffle (default) gives exact split sizes for a one-shot export; --split-method hash assigns each entry by a hash of its id, keeping splits stable as the knowledge base grows and is re-exported (proportions held in expectation). Cells too small to fill the smallest split are warned about (--strict to fail). Preview with --dry-run to see per-split counts, balance, and warnings before writing.
See examples/function-calling/ for a complete worked project: a filled-in schema, topic structure, and generation guide. More worked projects live under examples/.
How it works
Everything lives in one portable SQLite file (okgv.db): the topic tree, entries, their vector embeddings (via sqlite-vec), the submission log, and the review queue. No server, zero setup.
Topics form a path-identified tree (algebra/linear_algebra/basics). The tree is both the balance stratum and the dedup scope: counts and stats are recursive across descendants, but similarity search is scoped to the exact target topic, so its cost tracks the per-topic count, not the dataset total. An agent works one topic at a time, runs report (or least-topic for a quick single-level answer) to find gaps, and checks similar (full-content, not a score) before submitting.
See Architecture & Internals for the details: topic structure, similarity scoping, session logging, and reliability.
Documentation
If you are designing a dataset, read Principles then Patterns first, they cover how to shape the tree before you touch the CLI. If you are driving okgv (or pointing an agent at it), Commands and Schema are the reference. The agent itself learns the CLI from okgv cli-prompt, not from these docs.
Design, how to shape a dataset:
| Doc | Contents |
|---|---|
| Design Principles | How to think about structural choices: how deep the tree, what makes a good partition, when a dimension is a branch vs a field, sizing leaves, the depth/balance/dedup trade-offs |
| Dataset Patterns | Worked dataset shapes (classification, Q&A, tool-use, RAG eval, paraphrase), where each rule belongs, when to use _meta and when not, choosing similarity scope |
Reference, how to drive it:
| Doc | Contents |
|---|---|
| Entry Schema & Configuration | Install, env vars, embedding backends, defining a schema, validators, field descriptions, balance fields |
| Commands | Full command reference, examples, agent workflow, error handling |
| Review System | Review modes, CLI and TUI workflows, review states |
| Architecture & Internals | Storage, topic structure, similarity scoping, session logging, reliability |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file okgv-0.1.0.tar.gz.
File metadata
- Download URL: okgv-0.1.0.tar.gz
- Upload date:
- Size: 91.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4dfc924f4bddca7706dfcbea13c015b0a6299da17f6f936eae6a1816020b9a46
|
|
| MD5 |
9cb5a36a33d62b8a0f03c901a542d022
|
|
| BLAKE2b-256 |
94010493c71cfa7e0df0c0afd721f7fdc8b35d1bc2ebdf89a290481512504209
|
File details
Details for the file okgv-0.1.0-py3-none-any.whl.
File metadata
- Download URL: okgv-0.1.0-py3-none-any.whl
- Upload date:
- Size: 106.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c78aa8ee25b94fd3e863490f0531b0f8f3134ad62ce72b7d520cc1704f5654c3
|
|
| MD5 |
b1e005016e35e68649e60c1e77731e05
|
|
| BLAKE2b-256 |
4221f3da280ff71952991284b734133e36ee865137e7caa86f7c8668c0387876
|