No project description provided
Project description
this_file: README.md
cereproc.py
old/cereproc.py is a single-file utility that splits oversized documents into
Cerebras-friendly chunks, calls the qwen-3-coder-480b chat completion model
for each chunk, and stitches the results back together while keeping context
intact.
Quick Start
export CEREBRAS_API_KEY="csk-..."
uv run old/cereproc.py --input_data document.md --output_data document.out.md
Add optional guidance by supplying an inline prompt or a separate instructions file:
uv run old/cereproc.py \
--input_data huge.md \
--file_prompt prompts/style.md \
--prompt "Write concise technical summaries." \
--data_format code \
--chunk_size 28000 \
--sample_size 256 \
--verbose
CLI Flags
--input_data PATH(required) Text/Markdown/code file to process.--output_data PATHDestination file (defaults to the input path).--file_prompt PATHLoad reusable instructions; appended before the inline prompt.--prompt TEXTFreeform instructions appended after the file prompt.--chunk_size INTTarget chunk size in tokens (default32000).--data_format text|semantic|markdown|codeChunking strategy (defaultmarkdown).--sample_size INTContinuity example size in tokens (default200, use0to disable).--max_tokens_ratio INTCompletion budget as%of chunk tokens (default100).--temp FLOATand--top_p FLOATSampling controls (defaults0.7/0.8).--model TEXTCerebras model name override (defaultqwen-3-coder-480b).--verboseEnable detailed logging and chunk previews.--dry_runInspect chunking and request envelopes without calling the API.--explainParse Markdown frontmatter, ensure required metadata fields, and ask the model to fill gaps before processing.
Processing Pipeline
- Load
.envvalues and validateCEREBRAS_API_KEYplus CLI arguments. - Build a base prompt from
--file_promptand--prompt(always separated by two newlines) and count its tokens. - Read the input file (frontmatter preserved) and optionally parse metadata
when
--explainis active. - Chunk the body using the selected strategy:
text: greedy line-based splitting.semantic: paragraph-aware viasemantic-text-splitter.markdown: structure-aware Markdown splitter.code: regex-guided boundaries for source files.
- For each chunk, optionally blend in continuity examples drawn from the
previous request/response pair (
--sample_sizetokens each way), truncated to stay within the 131K-token context budget. - Stream completions from Cerebras with adaptive rate-limit backoff and retry
(
tenacity) on transient failures. - Write the concatenated result atomically, preserving or updating frontmatter
when
--explainmetadata is present.
Explain Mode Metadata
When --explain is set, the script expects frontmatter containing
title, author, id, type, and date. Missing keys trigger a structured
JSON request to the model that fills only the absent values. Dry-run mode skips
this network call while still showing parsed metadata.
Dry-Run Workflow
Use --dry_run to sanity-check chunk sizes, token budgets, and message shapes
without spending quota. The script prints the first two chunk envelopes, token
counts, and previews, then exits before creating the Cerebras client.
Dependencies
Install requirements with uv (or your preferred tool):
firelogurupython-dotenvtenacitycerebras-cloud-sdksemantic-text-splitterqwen-tokenizertqdmpython-frontmatter
Environment
Set CEREBRAS_API_KEY before running. The utility warns on placeholder keys
and gently validates formatting. Use --verbose to surface additional runtime
information and rate-limit headers.
Testing Tips
Run with --dry_run for fast validation, then process a short sample file in
--verbose mode to observe continuity handling and output statistics before you
launch against larger documents.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cerebrate_file-1.0.5.tar.gz.
File metadata
- Download URL: cerebrate_file-1.0.5.tar.gz
- Upload date:
- Size: 8.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ef5dfe525ed327283d7b20b7ca12f31f2e834e76b80442e8cc6a6e05b0c849bb
|
|
| MD5 |
139b2103da7e3f7301c0a9146c5cf130
|
|
| BLAKE2b-256 |
5fd4c1ced07e083ce0fc00b12b76017e81897d10d6f573040c5583fc0e9f053f
|
File details
Details for the file cerebrate_file-1.0.5-py3-none-any.whl.
File metadata
- Download URL: cerebrate_file-1.0.5-py3-none-any.whl
- Upload date:
- Size: 41.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2843b27d938c07774f5bfab6da7d570ca25a3d1102c5b4dde7772e59241fe9f5
|
|
| MD5 |
a509479096b9d9de98359c105bdb8dfc
|
|
| BLAKE2b-256 |
2076e0abfdb16ee92fa8112fc1c3b68a50218b819ccf539e818f0d33613d9b28
|