Skip to main content

Composable, cache-aware batch processing pipelines for LLMs, APIs, and dataset generation.

Project description

BatchFactory

Composable, cache‑aware pipelines for parallel LLM workflows, API calls, and dataset generation.

Status — v0.2 alpha. Stable enough for prototypes; expect fast‑moving APIs.


Install

pip install batchfactory            # latest tag
pip install --upgrade batchfactory  # grab the newest patch

Quick‑start

import batchfactory as bf
from batchfactory.op import *

project = bf.CacheFolder("quickstart", 1, 0, 0)
broker  = bf.brokers.ConcurrentLLMCallBroker(project["cache/llm_broker.jsonl"])

# Rewrite the first three passages of every *.txt file into four‑line poems.

g = (
    ReadMarkdownLines("./data/*.txt", key_field="keyword", directory_str_field="directory")
    | Shuffle(42)
    | TakeFirstN(3)
    | GenerateLLMRequest(
        'Rewrite the passage from "{directory}" titled "{keyword}" as a four‑line poem.',
        model="gpt-4o-mini@openai",
    )
    | ConcurrentLLMCall(project["cache/llm_call.jsonl"], broker)
    | ExtractResponseText()
    | WriteJsonl(project["out/poems.jsonl"], output_fields=["keyword", "text", "directory"])
    | Print()
)

g.compile().execute(dispatch_brokers=True)

Run it twice – everything after the first run is served from the on‑disk ledger.


Why BatchFactory?  Three killer moves

📦 Mass data distillation & cleanup 🎭 Multi‑agent, multi‑round workflows 🔥 Hierarchical spawning for long text
Chain GenerateLLMRequest → ConcurrentLLMCall → ExtractResponseText behind keyword or file sources to bulk‑create, filter, or refine datasets (think millions of Q&A rows, code explanations, translation pairs) with caching and cost tracking built‑in. Repeat plus chat helpers let you spin up translation swarms, code‑review pairs, or tutoring agents in 5 minutes of code – conversations live in chat_history, cost and revisions are automatic. SpawnFromList explodes complex items into fine‑grained subtasks, runs them in parallel, then CollectAllToList stitches results back – perfect for beat→scene→arc analy­sis or any long, messy document pipeline.

Loop snippet (Role‑Playing)

Teacher = Character("teacher_name", TEACHER_PROMPT)
Student = Character("student_name", STUDENT_PROMPT)

g = ( ReadMarkdownLines("story.txt", "keyword")
      | SetField({"teacher_name":"Alice", "student_name":"Bob"})
      | Teacher("老师,请先讲解课文", 0)
      | Repeat( Student("同学提问或回答", 1)
                | Teacher("回应或继续讲解", 2), 3)
      | Teacher("请总结", 3)
      | ChatHistoryToText() | Print() )

Spawn snippet (chapter → paragraph → chapter synopsis)

project = bf.CacheFolder("spawn_demo", 1, 0, 0)
broker  = bf.brokers.ConcurrentLLMCallBroker(project["cache/llm.jsonl"])

g = ( ReadMarkdownLines("novel/*.md", "chapter")                 # each entry = a chapter
      | SpawnFromList("paragraphs", "para")                         # fan‑out per paragraph
      | GenerateLLMRequest("Summarise:\n{para}", model="gpt-4o-mini@openai")
      | ConcurrentLLMCall(project["cache/para_sum.jsonl"], broker)
      | CollectAllToList("text", "chapter_summaries")               # wait until ALL paras done
      | GenerateLLMRequest("Chapter synopsis:\n{chapter_summaries}",
                           model="gpt-4o-mini@openai")
      | ConcurrentLLMCall(project["cache/ch_sum.jsonl"], broker) )

pseudocode, please see examples for implementation detail


Core concepts (one‑liner view)

Term Story in one sentence
Entry Tiny record with immutable idx, mutable data, auto‑incrementing rev.
Op Atomic node; compose with `
GraphSegment A lightweight helper: .to_segment() just wraps a node so `
execute() High‑level driver that resumes, pumps, and dispatches brokers.
Broker Pluggable engine handling expensive / async jobs (LLM, search, human labels).
Ledger Append‑only JSONL cache behind every broker & graph enabling instant restart.

(You can call pump() manually, but 99 % of users stick to execute().)


Primitive index (short list)

Family Node Blurb
Sources ReadMarkdownLines, FromList ingest files or raw dicts
Transforms Apply, Filter, SetField python‑powered field tweaks
Spawn / Collect SpawnFromList, CollectAllToList map‑reduce with unique child ids
Control flow If, While, Repeat branch, loop, iterate
LLM GenerateLLMRequest → ConcurrentLLMCall → ExtractResponseText prompt, call, harvest
Utilities CleanupLLMData, PrintTotalCost tidy temporary fields, audit cost

(Shared‑idx ops Replicate / Collect are deprecated and vanish in v0.3.)


Example gallery

✨ Example Demonstrates
01 – Basic pipeline linear LLM transform & caching
02 – Role‑playing loop concise multi‑agent RPG using Repeat + chat helpers
03 – Split & summarise fan‑out/fan‑in summarisation (deprecated style)
04 – Long‑text segmentation Spawn + CollectAll power pattern
05 – Math ops (unit) loop + conditional logic under pure Python

Broker & cache highlights

  • Each expensive call is hashed → job_idxduplicate prompts are free.
  • BrokerFailureBehavior = RETRY | STAY | EMIT lets you decide how failures propagate.
  • On restart, execute() reuses cached results and sends only the missing jobs.

Roadmap → v0.3

  • Enforce unique idx end‑to‑end → new JoinByParent replaces deprecated shared‑idx ops.
  • Built‑in vector‑store & semantic‑search nodes.
  • Streamlined cost & progress reporting.
  • More batteries‑included tutorials.

© 2025 · MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

batchfactory-0.2.0.tar.gz (36.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

batchfactory-0.2.0-py3-none-any.whl (39.5 kB view details)

Uploaded Python 3

File details

Details for the file batchfactory-0.2.0.tar.gz.

File metadata

  • Download URL: batchfactory-0.2.0.tar.gz
  • Upload date:
  • Size: 36.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for batchfactory-0.2.0.tar.gz
Algorithm Hash digest
SHA256 09e92dea0ff4c3d2f62f48a7b6f0f081649cd02fbdf4260289e46c369ca94a2e
MD5 5a2d69b60981cf64d98ed761f8f73697
BLAKE2b-256 be47b30d00ba457da97d9dfd280d44945ffc41a56e9df504f59b47f5d244014a

See more details on using hashes here.

File details

Details for the file batchfactory-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: batchfactory-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 39.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for batchfactory-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ee434e2a661843495c8ba4dc9da928c343d1c55b439006b80e5d2aa917bb6a68
MD5 b9c41e1d5b26805d67c0f5f8c2546d25
BLAKE2b-256 4b3458dd32243dd7893a1b41648033f7e44786af5f2be6fc506443e74ce7061b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page