Standalone ingest-to-pgvector: source → chunker → embedder → extractor → table. int8 by default.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

TheYonk

These details have not been verified by PyPI

Project description

chunkshop (Python)

Reference implementation of the chunkshop ingest tool. v0.2.0, alpha.

New here? Start with the end-to-end tutorial — a guided walkthrough from zero (no Postgres) to a running semantic query.

This file is the field-by-field reference: every CLI flag, every YAML field, the troubleshooting table. Use it alongside the tutorial once you know what you're doing.

For the high-level shape and mermaid diagram, see the top-level README.

Install

From source (recommended while alpha):

cd chunkshop/python
uv sync --extra dev

As a path dependency from another project:

[tool.uv.sources]
chunkshop = { path = "../chunkshop/python", editable = true }

Optional extras:

Extra	What you get
`extractors`	`rake-nltk` + `nltk` for the RAKE extractor.
`keybert`	`keybert` + `sentence-transformers` for the `keybert_phrases` extractor.
`spacy`	`spacy` for the `spacy_entities` NER extractor.
`lang`	`langdetect` for the `lang_detect` extractor.
`nlp`	Umbrella: `keybert` + `spacy` + `lang` in one install.
`lede`	Sibling `extractive_summary` repo as a path dep — enables `summary_embed` with `lede.tfidf.summarize`.
`sumy`	`sumy` + NLTK corpora for the sumy adapter shim (`chunkshop.summarizers.sumy`).
`quantize`	`onnx` for on-the-fly quantization scratch.
`dev`	`pytest`, `pytest-asyncio`, `onnx`.

Python ≥ 3.12 required.

Prerequisites

Postgres ≥ 14 with the pgvector extension installed (CREATE EXTENSION vector; must succeed in your target DB).
Disk space for model cache in ~/.cache/fastembed/ — ~85 MB for int8 bge-base, ~550 MB for nomic.
An env var holding your DSN. The target config references it by name, not by value.

Quick run

export CHUNKSHOP_DSN="postgresql://postgres:postgres@localhost:5432/mydb"

# Point at the sample corpus in docs/samples/ for a real end-to-end run:
chunkshop ingest --config ../docs/samples/sample.yaml

# Or copy the template and edit it:
cp src/chunkshop/configs/example-files-to-bge.yaml my-cell.yaml
chunkshop ingest --config my-cell.yaml

Success looks like:

{
  "cell_name": "example_files",
  "docs_processed": 47,
  "chunks_written": 312,
  "wall_seconds": 18.4,
  "error": null
}

CLI

Two subcommands: ingest (one cell) and orchestrate (many cells in parallel).

`chunkshop ingest`

Runs one YAML end-to-end.

chunkshop ingest --config PATH [--doc-limit N] [--log PATH] [--omp-threads N]

Flag	YAML override	Purpose
`-c, --config`	—	Required. Path to YAML.
`--doc-limit`	`runtime.doc_limit`	Smoke-test mode; stop after N docs.
`--log`	`runtime.log_path`	Append stdout log lines to this file.
`--omp-threads`	`runtime.omp_num_threads`	Cap BLAS/OMP threads before ORT loads.

Exit code: 0 on success, 1 if the cell errored. Stdout = a JSON summary.

`chunkshop orchestrate`

Runs N cells in parallel as subprocesses.

chunkshop orchestrate (--config-dir DIR | --config PATH [--config PATH ...])
                      [--concurrency N]
                      [--checkpoints "60,120,300,600"]
                      [--timeout SECONDS]
                      [--smoke | --full]

Flag	Default	Purpose
`-d, --config-dir`	—	Run every `.yaml`/`.yml` in the directory.
`-c, --config`	—	Explicit path; repeatable. Mutually exclusive with `--config-dir`.
`--concurrency`	`4`	Max parallel cells (subprocess pool size).
`--checkpoints`	`60,120,300,600`	Seconds at which to print a status report.
`--timeout`	`7200` (2h)	Overall wall limit; survivors get SIGTERM to their process group.
`--smoke`	off	Force `doc_limit=1` + `concurrency=1`. Useful for "does it crash".

Stdout = checkpoint reports during the run, JSON summary at the end.

`chunkshop bakeoff`

Runs a chunker × embedder matrix against a corpus with hand-written gold queries, scores recall@k + MRR per combo, writes a leaderboard + a runnable recommended.yaml. Config-driven — the matrix lives in YAML, not on the command line.

chunkshop bakeoff --config PATH [--dsn DSN] [--yes] [--keep-schema]

Flag	Default	Purpose
`--config`	—	Path to the bakeoff YAML. Required.
`--dsn`	`$CHUNKSHOP_DSN`	Postgres DSN. Required (env var or flag).
`--yes`	off	Bypass the >50-cell matrix confirmation prompt.
`--keep-schema`	off	Keep the bakeoff schema after run — useful for debugging.

Outputs land in skill-output/bakeoff/{name}/:

results.json — raw per-combo + per-query data.
report.md — leaderboard sorted by MRR, per-query detail, statistical- power caveat.
recommended.yaml — top combo pre-filled as a runnable chunkshop ingest cell.

Full walkthrough: ../docs/tutorial-bakeoff.md. Recipe card: ../docs/quickstart-bakeoff.md.

YAML reference

Every cell config has five sections plus an optional runtime. Extra keys are rejected (extra="forbid" in pydantic), so typos fail loudly.

cell_name: my_cell
source:   { ... }
chunker:  { ... }
embedder: { ... }
extractor: { ... }   # optional, defaults to {type: none}
target:   { ... }
runtime:  { ... }    # optional, sensible defaults below

`source`

`type`	Required fields	Optional fields
`files`	`glob`	`id_from: path \| stem \| sha1` (default `stem`), `encoding` (`utf-8`)
`json_corpus`	`path`	`documents_key` (`documents`), `id_field` (`id`), `content_field` (`content`), `title_field` (`title`)
`pg_table`	`dsn_env`, `schema`, `table`, `id_column`, `content_column`	`title_column`, `where`
`http`	`urls` or `sitemap`	— (stub today)
`s3`	`bucket`	`prefix` (stub today)

`chunker`

Seven chunkers in three families. Pick one per cell.

Structural — split on headings, paragraphs, or word counts:

`type`	Required	Defaults
`sentence_aware`	—	`doc_type: prose` (or `code`), `max_chars: 2000`, `min_chars: 200`
`fixed_overlap`	—	`window_words: 300`, `step_words: 150`
`hierarchy`	—	`prefix_heading: true`, `min_section_chars: 100`, `max_chars: 2000`
`neighbor_expand`	`base:` (nested chunker)	`window: 1`

Semantic — splits on embedding-drift boundaries (no heading needed):

`type`	Required	Defaults
`semantic`	—	`boundary_model: "sentence-transformers/all-MiniLM-L6-v2-int8"`, `breakpoint_percentile: 95`, `min_sentences_per_chunk: 3`, `max_chunk_chars: 2000`, `sentence_splitter: "naive"`

Pass boundary_model: "same" to reuse the cell's main embedder (trades speed for memory). See ../docs/tutorial-semantic.md.

Summary-layer — wrap any base chunker and change what gets embedded vs. what gets stored (summary_embed) or emit fine+coarse rows linked by group_id (hierarchical_summary):

`type`	Required	Defaults
`summary_embed`	`base:`, `summarizer:`	—
`hierarchical_summary`	`base:`, `summarizer:`, `grouping:`	`grouping: {strategy: fixed_n, n: 5}`

The summarizer config is a discriminated union: {mode: external, field: ...} pulls a pre-computed summary from a source document metadata field; {mode: callable, module: "lede.tfidf", function: "summarize", kwargs: {...}} imports lazily at first use; {mode: passthrough} reuses the raw chunk as the summary (baseline). See ../docs/summaries.md and ../docs/tutorial-summaries.md.

Full per-chunker guidance: ../docs/chunkers.md.

`embedder`

Only fastembed today.

Field	Required	Default	Notes
`type`	yes	—	Literal `fastembed`.
`model_name`	yes	—	e.g. `Xenova/bge-base-en-v1.5-int8`. See embedders.md.
`dim`	yes	—	Must match the model. Mismatch fails loudly at first embed.
`batch_size`	no	`64`	Per-call batch to `fastembed.embed`.
`threads`	no	`None`	`None` = auto (bad on shared boxes). Set to 4 typically.

`extractor`

`type`	Fields
`none`	— (default)
`rake_keywords`	`top_k: 10`, `min_chars: 3` (defaults)

RAKE downloads NLTK corpora (stopwords, punkt) on first use to ~/nltk_data/.

`target`

Field	Required	Default	Notes
`dsn_env`	no	`AGE_BAKEOFF_PGRG_DSN`	Name of the env var holding your DSN. Override this to `CHUNKSHOP_DSN` in your configs.
`schema`	yes	—	Lowercase ident; must match `^[a-z_][a-z0-9_]*$`. Created if missing.
`table`	yes	—	Same ident rule.
`mode`	no	`overwrite`	One of `overwrite`, `append`, `create_if_missing`. See `../docs/tutorial-multi-source.md`.
`source_tag`	when `mode=append`	`null`	Ident-safe tag written to every row's `source` column. Required for `append`; optional (but recommended) for `overwrite`/`create_if_missing`.
`promote_metadata`	no	`[]`	List of `{path, type}` pairs lifting jsonb metadata paths into typed columns. `path` is lowercased + `.` → `__` for the column name.
`force_overwrite`	no	`false`	Bypasses the "refuse to drop a table that holds rows from a foreign `source_tag`" safety check in `overwrite` mode.
`overwrite`	no (soft-deprecated)	`false`	Legacy boolean. Still honored when `mode=overwrite` (acts as the DROP+CREATE switch). Prefer the new `mode` field for new configs.
`hnsw`	no	`true`	`false` for tiny test tables where HNSW is slower than seq scan.

Multi-source ingest

Multiple cells can write to the same table by tagging each cell's rows with a source_tag. Cell A creates the table with mode: create_if_missing, Cell B appends with mode: append and its own tag. Queries filter or group by the source column. See ../docs/tutorial-multi-source.md for the end-to-end walkthrough.

target:
  dsn_env: CHUNKSHOP_DSN
  schema: mydata
  table: all_docs
  mode: append
  source_tag: support_tickets

`runtime`

Field	Default	Notes
`omp_num_threads`	`1`	Sets `OMP/MKL/OPENBLAS/NUMEXPR` env vars before ORT loads.
`doc_limit`	`null`	Stop after N docs. Smoke-test lever.
`log_path`	`null`	Mirror stdout heartbeats to this file. Parent dirs auto-created.
`heartbeat_every`	`25`	Log a progress line every N docs.

Environment variables

Var	When chunkshop reads it
`$<target.dsn_env>` (default `AGE_BAKEOFF_PGRG_DSN`)	At sink construction; must be a valid libpq DSN.
`OMP_NUM_THREADS` and friends	Set by `runner` before any numpy/ORT import.
`HF_HOME` / `HF_HUB_CACHE`	Respected by fastembed's downloader if you've moved the cache.

Troubleshooting

"no files matched glob: /path/**/*.md"

Your source.glob didn't match anything. Test it in a shell first:

ls /path/**/*.md | head

Note that chunkshop uses Python's glob.glob(..., recursive=True) — ** only matches across directories when it's its own path component (/foo/**/*.md, not /foo/**.md).

"relation already exists" on second run

target.overwrite is false by default. Either flip it to true (drops + recreates) or drop the table yourself. The ON CONFLICT DO UPDATE in the writer will also happily upsert into an existing table.

"model X produced dim Y, config says dim=Z"

Your YAML's embedder.dim doesn't match the model's output. Look up the right dim in ../docs/embedders.md — bge-small=384, bge-base=768, nomic=768.

"CREATE EXTENSION IF NOT EXISTS vector" fails with permission denied

Your DB role can't create extensions. Ask a superuser to run it once per database:

CREATE EXTENSION IF NOT EXISTS vector;

Then re-run chunkshop — the sink's CREATE EXTENSION IF NOT EXISTS will be a no-op.

"table/schema must match ^[a-z_][a-z0-9_]*$"

chunkshop refuses to interpolate mixed-case or quoted identifiers — SQL injection safety via allowlist. Lowercase your schema and table.

Ingest is slow and my CPU fans are loud

Three knobs. Pick one:

Drop embedder.batch_size from 64 to 32 — less memory pressure, slower per-doc.
Set embedder.threads: 4 (or 2) — caps ORT's worker pool.
If running under orchestrate, reduce --concurrency.

See the thread-tuning table in ../docs/embedders.md.

First run hangs on "downloading model"

Fastembed is pulling the ONNX from HuggingFace. Network / HF outage. Check curl -sI https://huggingface.co/ and your proxy settings. The file lands in ~/.cache/fastembed/<model-name>/.

nltk errors on first `rake_keywords` run

The extractor downloads stopwords, punkt, punkt_tab into ~/nltk_data/ on first use. Behind a strict firewall? Pre-download once:

import nltk
for r in ("stopwords", "punkt", "punkt_tab"):
    nltk.download(r)

Using chunkshop as a library

from chunkshop import load_config
from chunkshop.runner import run_cell

cfg = load_config("my-cell.yaml")
result = run_cell(cfg)
print(result.docs_processed, result.chunks_written, result.wall_seconds)

Or skip the YAML and build a CellConfig directly — every section is a plain pydantic model.

Tests

cd python
uv run pytest

Most tests are offline. test_embedder_fastembed.py and test_int8_registry.py download the int8 bge-base model on first run and cache it — budget ~85 MB + a few seconds the first time.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

TheYonk

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.2

May 1, 2026

0.3.1

Apr 30, 2026

0.3.0

Apr 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunkshop-0.3.2.tar.gz (67.1 kB view details)

Uploaded May 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chunkshop-0.3.2-py3-none-any.whl (81.7 kB view details)

Uploaded May 1, 2026 Python 3

File details

Details for the file chunkshop-0.3.2.tar.gz.

File metadata

Download URL: chunkshop-0.3.2.tar.gz
Upload date: May 1, 2026
Size: 67.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for chunkshop-0.3.2.tar.gz
Algorithm	Hash digest
SHA256	`f76cff74e472435855da518212be415fecd02ece90a3cfbad0fb80157d758b81`
MD5	`39a96abde4f89bbddcc6f68237046323`
BLAKE2b-256	`c7ef09c6a42dae97d01efb8c8c855041807c97b8007f3b6433b8d057c121f17f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for chunkshop-0.3.2.tar.gz:

Publisher: release.yml on yonk-labs/chunkshop

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: chunkshop-0.3.2.tar.gz
- Subject digest: f76cff74e472435855da518212be415fecd02ece90a3cfbad0fb80157d758b81
- Sigstore transparency entry: 1417644784
- Sigstore integration time: May 1, 2026
Source repository:
- Permalink: yonk-labs/chunkshop@04a168727e8e99407262bfffb0e155ab574e763f
- Branch / Tag: refs/tags/v0.3.2
- Owner: https://github.com/yonk-labs
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@04a168727e8e99407262bfffb0e155ab574e763f
- Trigger Event: push

File details

Details for the file chunkshop-0.3.2-py3-none-any.whl.

File metadata

Download URL: chunkshop-0.3.2-py3-none-any.whl
Upload date: May 1, 2026
Size: 81.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for chunkshop-0.3.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`991ba6ae1e185cac498ff70ee724e011b48870f3578c1d6b02da3cb6029a83d4`
MD5	`001ac7b97052d733d998240843bcc437`
BLAKE2b-256	`30f8508c81a604c3ddfb5844980dd5b99f4a0324ba6fdb8d95f85c0be7e9dd68`

See more details on using hashes here.

Provenance

The following attestation bundles were made for chunkshop-0.3.2-py3-none-any.whl:

Publisher: release.yml on yonk-labs/chunkshop

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: chunkshop-0.3.2-py3-none-any.whl
- Subject digest: 991ba6ae1e185cac498ff70ee724e011b48870f3578c1d6b02da3cb6029a83d4
- Sigstore transparency entry: 1417644787
- Sigstore integration time: May 1, 2026
Source repository:
- Permalink: yonk-labs/chunkshop@04a168727e8e99407262bfffb0e155ab574e763f
- Branch / Tag: refs/tags/v0.3.2
- Owner: https://github.com/yonk-labs
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@04a168727e8e99407262bfffb0e155ab574e763f
- Trigger Event: push

chunkshop 0.3.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

chunkshop (Python)

Install

Prerequisites

Quick run

CLI

chunkshop ingest

chunkshop orchestrate

chunkshop bakeoff

YAML reference

source

chunker

embedder

extractor

target

Multi-source ingest

runtime

Environment variables

Troubleshooting

"no files matched glob: /path/**/*.md"

"relation already exists" on second run

"model X produced dim Y, config says dim=Z"

"CREATE EXTENSION IF NOT EXISTS vector" fails with permission denied

"table/schema must match ^[a-z_][a-z0-9_]*$"

Ingest is slow and my CPU fans are loud

First run hangs on "downloading model"

nltk errors on first rake_keywords run

Using chunkshop as a library

Tests

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`chunkshop ingest`

`chunkshop orchestrate`

`chunkshop bakeoff`

`source`

`chunker`

`embedder`

`extractor`

`target`

`runtime`

nltk errors on first `rake_keywords` run