No project description provided
Project description
Here's a revised version of your README.md with tighter prose, clearer structure, and minimal fluff. I've preserved all essential information while improving readability and precision.
cereproc.py
old/cereproc.py processes large documents by splitting them into chunks suitable for the Cerebras zai-glm-4.6 model, generating completions for each chunk, and reassembling the results while maintaining context.
Quick Start
export CEREBRAS_API_KEY="csk-..."
uv run old/cereproc.py --input_data document.md --output_data document.out.md
Add optional guidance using inline prompts or instruction files:
uv run old/cereproc.py \
--input_data huge.md \
--file_prompt prompts/style.md \
--prompt "Write concise technical summaries." \
-c code \
--chunk_size 28000 \
--sample_size 256 \
--verbose
CLI
NAME
cerebrate-file - Process large documents by chunking for Cerebras zai-glm-4.6
SYNOPSIS
cerebrate-file INPUT_DATA <flags>
POSITIONAL ARGUMENTS
INPUT_DATA
Path to input file to process
FLAGS
-o, --output_data=OUTPUT_DATA
Output file path (default: overwrite input)
-f, --file_prompt=FILE_PROMPT
Path to file with initial instructions
-p, --prompt=PROMPT
Inline prompt text (appended after file_prompt)
-c, --chunk_size=CHUNK_SIZE
Target max chunk size in tokens (default: 32000)
--max_tokens_ratio=MAX_TOKENS_RATIO
Completion budget as % of chunk size (default: 100)
--data_format=DATA_FORMAT
Chunking strategy: text | semantic | markdown | code (default: markdown)
-s, --sample_size=SAMPLE_SIZE
Tokens from previous request/response to maintain context (default: 200)
--temp=TEMP
Model temperature (default: 0.7)
--top_p=TOP_P
Model top-p sampling (default: 0.8)
--model=MODEL
Override default model name (default: zai-glm-4.6)
-v, --verbose
Enable debug logging
-e, --explain
Parse and update frontmatter metadata
--dry_run
Show chunking details without calling the API
Streaming via STDIN/STDOUT
Use - to read from stdin or write to stdout:
cat huge.md | uv run cerebrate_file --input_data - --output_data - > processed.md
Processing Pipeline
- Load
.envand validateCEREBRAS_API_KEYand CLI arguments. - Construct base prompt from
--file_promptand--prompt, separated by two newlines. Count its tokens. - Read input file, preserving frontmatter. Parse metadata if
--explainis enabled. - Split document body using one of these strategies:
text: line-based greedy splittingsemantic: paragraph-aware viasemantic-text-splittermarkdown: structure-preserving Markdown splittingcode: regex-based source code boundaries
- For each chunk, optionally prepend/append continuity examples (
--sample_sizetokens each) from prior interactions, ensuring total tokens stay under the 131K limit. - Stream responses from Cerebras, with automatic retry and backoff on transient errors (
tenacity). - Write final output atomically. Update frontmatter if
--explainis active.
Explain Mode Metadata
When --explain is set, the script looks for frontmatter containing:
titleauthoridtypedate
Missing fields are filled via a structured JSON query to the model. Use --dry_run to preview parsed metadata without making network calls.
Dry Run Workflow
Use --dry_run to inspect:
- Chunk sizes
- Token budgets
- Message structure
No API calls are made in this mode.
Dependencies
Install with uv or your preferred package manager:
firelogurupython-dotenvtenacitycerebras-cloud-sdksemantic-text-splitterqwen-tokenizertqdmpython-frontmatter
Configuration
The tool uses a layered configuration system. Settings are loaded in this order (later sources override earlier ones):
- Built-in defaults –
default_config.tomlbundled with the package - User config –
~/.config/cerebrate-file/config.toml - Project config –
.cerebrate-file.tomlin the current directory - Environment variables – e.g.,
CEREBRATE_PRIMARY_MODEL
If no custom config exists, the built-in defaults are used automatically.
Config File Locations
| Platform | User Config Path |
|---|---|
| macOS/Linux | ~/.config/cerebrate-file/config.toml |
| Windows | %APPDATA%\cerebrate-file\config.toml |
For project-specific settings, create .cerebrate-file.toml in your project root.
Example Config
[inference]
temperature = 0.98
top_p = 0.8
chunk_size = 32000
sample_size = 200
[models.primary]
name = "zai-glm-4.6"
provider = "cerebras"
api_key_env = "CEREBRAS_API_KEY"
max_context_tokens = 131000
max_output_tokens = 40960
[models.fallback1]
enabled = true
name = "zai-org/GLM-4.6"
provider = "chutes"
api_key_env = "CHUTES_API_KEY"
api_base = "https://llm.chutes.ai/v1"
Environment Setup
Set CEREBRAS_API_KEY before running. The tool will warn about placeholder keys and validate basic formatting. Use --verbose for extra runtime info and rate-limit headers.
Testing Tips
- Run with
--dry_runto check chunking logic quickly. - Test on a small sample file with
--verboseto observe:- Context blending between chunks
- Output statistics
- Only then run on larger inputs.
Let me know if you'd like this tailored further toward users, developers, or integration into a larger documentation system.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cerebrate_file-1.0.35.tar.gz.
File metadata
- Download URL: cerebrate_file-1.0.35.tar.gz
- Upload date:
- Size: 12.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
58cfeb70b16b0386bec4a80dcf431735c7171788f06a5d47d7b19548e944e7e8
|
|
| MD5 |
a43acc41b38a81cdcf214b497b1db6e2
|
|
| BLAKE2b-256 |
c2100f7c0d9f57e659100264fd430f1e97ecdbd8ade4b594a1d44626d57aefac
|
File details
Details for the file cerebrate_file-1.0.35-py3-none-any.whl.
File metadata
- Download URL: cerebrate_file-1.0.35-py3-none-any.whl
- Upload date:
- Size: 69.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e2bb17b24bdf71a66bde58c19ef060e56544664e849b2dbd61fb8cd9f56e94f6
|
|
| MD5 |
ce264df5697b00247d4c96dd90e03f4b
|
|
| BLAKE2b-256 |
42a6be08760ab42db2b0dce41f3ab2e1d073f29deafd7afa751d6605ce5c99f3
|