Composable, cache-aware batch processing pipelines for LLMs, APIs, and dataset generation.
Project description
BatchFactory
Composable, cache‑aware pipelines for parallel LLM workflows, API calls, and dataset generation.
Status —
v0.2alpha. Stable enough for prototypes; expect fast‑moving APIs.
Install
pip install batchfactory # latest tag
pip install --upgrade batchfactory # grab the newest patch
Quick‑start
import batchfactory as bf
from batchfactory.op import *
project = bf.CacheFolder("quickstart", 1, 0, 0)
broker = bf.brokers.ConcurrentLLMCallBroker(project["cache/llm_broker.jsonl"])
# Rewrite the first three passages of every *.txt file into four‑line poems.
g = (
ReadMarkdownLines("./data/*.txt", key_field="keyword", directory_str_field="directory")
| Shuffle(42)
| TakeFirstN(3)
| GenerateLLMRequest(
'Rewrite the passage from "{directory}" titled "{keyword}" as a four‑line poem.',
model="gpt-4o-mini@openai",
)
| ConcurrentLLMCall(project["cache/llm_call.jsonl"], broker)
| ExtractResponseText()
| WriteJsonl(project["out/poems.jsonl"], output_fields=["keyword", "text", "directory"])
| Print()
)
g.compile().execute(dispatch_brokers=True)
Run it twice – everything after the first run is served from the on‑disk ledger.
Why BatchFactory? Three killer moves
| 📦 Mass data distillation & cleanup | 🎭 Multi‑agent, multi‑round workflows | 🔥 Hierarchical spawning for long text |
|---|---|---|
Chain GenerateLLMRequest → ConcurrentLLMCall → ExtractResponseText behind keyword or file sources to bulk‑create, filter, or refine datasets (think millions of Q&A rows, code explanations, translation pairs) with caching and cost tracking built‑in. |
Repeat plus chat helpers let you spin up translation swarms, code‑review pairs, or tutoring agents in 5 minutes of code – conversations live in chat_history, cost and revisions are automatic. |
SpawnFromList explodes complex items into fine‑grained subtasks, runs them in parallel, then CollectAllToList stitches results back – perfect for beat→scene→arc analysis or any long, messy document pipeline. |
Loop snippet (Role‑Playing)
Teacher = Character("teacher_name", TEACHER_PROMPT)
Student = Character("student_name", STUDENT_PROMPT)
g = ( ReadMarkdownLines("story.txt", "keyword")
| SetField({"teacher_name":"Alice", "student_name":"Bob"})
| Teacher("老师,请先讲解课文", 0)
| Repeat( Student("同学提问或回答", 1)
| Teacher("回应或继续讲解", 2), 3)
| Teacher("请总结", 3)
| ChatHistoryToText() | Print() )
Spawn snippet (chapter → paragraph → chapter synopsis)
project = bf.CacheFolder("spawn_demo", 1, 0, 0)
broker = bf.brokers.ConcurrentLLMCallBroker(project["cache/llm.jsonl"])
g = ( ReadMarkdownLines("novel/*.md", "chapter") # each entry = a chapter
| SpawnFromList("paragraphs", "para") # fan‑out per paragraph
| GenerateLLMRequest("Summarise:\n{para}", model="gpt-4o-mini@openai")
| ConcurrentLLMCall(project["cache/para_sum.jsonl"], broker)
| CollectAllToList("text", "chapter_summaries") # wait until ALL paras done
| GenerateLLMRequest("Chapter synopsis:\n{chapter_summaries}",
model="gpt-4o-mini@openai")
| ConcurrentLLMCall(project["cache/ch_sum.jsonl"], broker) )
pseudocode, please see examples for implementation detail
Core concepts (one‑liner view)
| Term | Story in one sentence |
|---|---|
| Entry | Tiny record with immutable idx, mutable data, auto‑incrementing rev. |
| Op | Atomic node; compose with ` |
| GraphSegment | A lightweight helper: .to_segment() just wraps a node so ` |
| execute() | High‑level driver that resumes, pumps, and dispatches brokers. |
| Broker | Pluggable engine handling expensive / async jobs (LLM, search, human labels). |
| Ledger | Append‑only JSONL cache behind every broker & graph enabling instant restart. |
(You can call pump() manually, but 99 % of users stick to execute().)
Primitive index (short list)
| Family | Node | Blurb |
|---|---|---|
| Sources | ReadMarkdownLines, FromList |
ingest files or raw dicts |
| Transforms | Apply, Filter, SetField |
python‑powered field tweaks |
| Spawn / Collect | SpawnFromList, CollectAllToList |
map‑reduce with unique child ids |
| Control flow | If, While, Repeat |
branch, loop, iterate |
| LLM | GenerateLLMRequest → ConcurrentLLMCall → ExtractResponseText |
prompt, call, harvest |
| Utilities | CleanupLLMData, PrintTotalCost |
tidy temporary fields, audit cost |
(Shared‑idx ops Replicate / Collect are deprecated and vanish in v0.3.)
Example gallery
| ✨ Example | Demonstrates |
|---|---|
| 01 – Basic pipeline | linear LLM transform & caching |
| 02 – Role‑playing loop | concise multi‑agent RPG using Repeat + chat helpers |
| 03 – Split & summarise | fan‑out/fan‑in summarisation (deprecated style) |
| 04 – Long‑text segmentation | Spawn + CollectAll power pattern |
| 05 – Math ops (unit) | loop + conditional logic under pure Python |
Broker & cache highlights
- Each expensive call is hashed →
job_idx— duplicate prompts are free. BrokerFailureBehavior = RETRY | STAY | EMITlets you decide how failures propagate.- On restart,
execute()reuses cached results and sends only the missing jobs.
Roadmap → v0.3
- Enforce unique
idxend‑to‑end → newJoinByParentreplaces deprecated shared‑idx ops. - Built‑in vector‑store & semantic‑search nodes.
- Streamlined cost & progress reporting.
- More batteries‑included tutorials.
© 2025 · MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file batchfactory-0.2.0.tar.gz.
File metadata
- Download URL: batchfactory-0.2.0.tar.gz
- Upload date:
- Size: 36.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
09e92dea0ff4c3d2f62f48a7b6f0f081649cd02fbdf4260289e46c369ca94a2e
|
|
| MD5 |
5a2d69b60981cf64d98ed761f8f73697
|
|
| BLAKE2b-256 |
be47b30d00ba457da97d9dfd280d44945ffc41a56e9df504f59b47f5d244014a
|
File details
Details for the file batchfactory-0.2.0-py3-none-any.whl.
File metadata
- Download URL: batchfactory-0.2.0-py3-none-any.whl
- Upload date:
- Size: 39.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ee434e2a661843495c8ba4dc9da928c343d1c55b439006b80e5d2aa917bb6a68
|
|
| MD5 |
b9c41e1d5b26805d67c0f5f8c2546d25
|
|
| BLAKE2b-256 |
4b3458dd32243dd7893a1b41648033f7e44786af5f2be6fc506443e74ce7061b
|