The Swiss Army knife for prompt engineering — build, test, version, and optimize prompts 10x faster.
Project description
PromptForge
The Swiss Army knife for prompt engineering -- build, test, version, and optimize prompts 10x faster.
Why PromptForge?
Writing a good prompt is half the battle. Managing hundreds of prompts across projects, tracking which version performs best, estimating token costs before you hit "send" -- that is the other half, and it is where most teams waste time.
PromptForge gives you a single toolkit to:
- Build parameterized prompt templates with
{{variable}}injection - Test prompts with built-in A/B testing and batch evaluation
- Version every change with full diff and fork support
- Score output quality on coherence, relevance, safety, completeness, and clarity
- Compress prompts to use fewer tokens without losing meaning
- Chain prompts into multi-step pipelines with automatic output passing
- Count tokens and estimate costs across 16+ LLM models
- Export to JSON, YAML, or LangChain format
All backed by a file-based store (no database required) with support for OpenAI, Anthropic, Google Gemini, and local models (Ollama, LM Studio).
Features
Prompt Templates with Variable Injection
Define templates with {{variable}} placeholders. PromptForge auto-detects variables, validates required ones, and fills defaults.
Full Version History
Every save creates a version snapshot. Diff any two versions, fork from a specific version, and track the evolution of your prompts over time.
A/B Testing Framework
Compare two prompts head-to-head on the same inputs. PromptForge runs both, scores outputs on five quality dimensions, and declares a winner.
Batch Testing
Run a suite of inputs through a single prompt and get back quality scores, latencies, token usage, and estimated cost per run.
Chain-of-Thought Pipelines
Link prompts into multi-step pipelines where each step's output flows into the next as a {{step_N_output}} variable.
Quality Scoring
Deterministic, heuristic-based scoring across five dimensions -- coherence, relevance, safety, completeness, and clarity -- with weighted overall score. No LLM call required.
Prompt Compression
Reduce token count with configurable aggression (0.0 to 1.0) by removing filler phrases, collapsing whitespace, shortening verbose instructions, and compressing lists.
Token Counting and Cost Estimation
Accurate token counts via tiktoken with pricing tables for GPT-4o, GPT-4o-mini, Claude 3.5 Sonnet, Gemini 2.0 Flash, Llama 3.1, Mistral, and more. Compare costs across all models in one call.
Multi-Provider LLM Engine
Unified interface for OpenAI, Anthropic, Google Gemini, and local models (Ollama / LM Studio). Switch providers with a single parameter.
Export to JSON, YAML, and LangChain
Export individual prompts or your entire library. LangChain export generates PromptTemplate-compatible JSON with input_variables, template_format, and metadata.
Installation
# Install from PyPI
pip install ai-prompt-forge
# Or from source
git clone https://github.com/theihtisham/ai-prompt-forge.git
cd ai-prompt-forge
pip install -e .
# With dev dependencies (pytest, coverage)
pip install -e ".[dev]"
Requirements
- Python 3.10+
- API keys (set as environment variables):
OPENAI_API_KEYfor OpenAIANTHROPIC_API_KEYfor AnthropicGOOGLE_API_KEYfor Google Gemini- Local models need no key (uses
http://localhost:11434by default)
Quick Start
Create and Render a Template
from promptforge.models import PromptTemplate, PromptCategory
template = PromptTemplate(
name="Code Reviewer",
description="Reviews code for bugs and style issues",
category=PromptCategory.CODING,
template="""Review the following {{language}} code for bugs, style issues,
and performance problems. Be specific and actionable.
Code:
```{{language}}}
{{code}}
Focus on: {{focus_areas}}""", )
Render with variables
rendered = template.render({ "language": "python", "code": "def add(a, b): return a + b", "focus_areas": "edge cases, type safety", }) print(rendered)
### Save, Version, and Fork
```python
from promptforge.store import PromptStore
store = PromptStore() # defaults to ~/.promptforge/data
# Save (creates version 1)
template = store.save_prompt(template)
# Update the template and save again (creates version 2)
template.template = "You are an expert {{language}} developer. Review:\n{{code}}"
template.version = 2
store.save_prompt(template)
# Diff versions
diff = store.diff_versions(template.id, old_ver=1, new_ver=2)
print(diff.summary) # "+2 lines, -1 lines"
print(diff.additions) # list of added lines
print(diff.deletions) # list of removed lines
# Fork from version 1 into a new prompt
forked = store.fork_version(template.id, version=1, new_name="Simple Reviewer")
List and Search Prompts
# List all prompts
all_prompts = store.list_prompts()
# Filter by category
coding_prompts = store.list_prompts(category="coding")
# Filter by tag
tagged = store.list_prompts(tag="production")
# Full-text search
results = store.list_prompts(search="code review")
Generate with LLM Providers
from promptforge.engine import get_provider
# OpenAI
openai = get_provider("openai", model="gpt-4o")
response = openai.generate("Explain recursion in one paragraph.")
print(response.text)
print(f"Tokens: {response.total_tokens}, Latency: {response.latency_ms:.0f}ms")
# Anthropic
anthropic = get_provider("anthropic", model="claude-3-5-sonnet-20241022")
response = anthropic.generate("Explain recursion in one paragraph.", temperature=0.3)
# Google Gemini
google = get_provider("google", model="gemini-2.0-flash")
response = google.generate("Explain recursion in one paragraph.")
# Local (Ollama)
local = get_provider("local", model="llama3.1")
response = local.generate("Explain recursion in one paragraph.")
A/B Testing
from promptforge.testing import ABTester
from promptforge.models import ABTestConfig, ModelProvider
config = ABTestConfig(
name="Review style comparison",
prompt_a_id=prompt_a.id,
prompt_b_id=prompt_b.id,
test_inputs=[
"Write a function to reverse a linked list",
"Implement binary search in Python",
"Create a REST API endpoint for user login",
],
provider=ModelProvider.OPENAI,
model="gpt-4o-mini",
)
tester = ABTester(store=store)
results, summary = tester.run_test(config)
print(f"Wins A: {summary.wins_a}, Wins B: {summary.wins_b}, Ties: {summary.ties}")
print(f"Recommendation: {summary.recommendation}")
Batch Testing
from promptforge.testing import BatchTester
from promptforge.models import BatchTestConfig, ModelProvider
config = BatchTestConfig(
prompt_id=template.id,
inputs=[
"Write a sorting function",
"Debug this null pointer exception",
"Explain the difference between TCP and UDP",
],
provider=ModelProvider.OPENAI,
model="gpt-4o-mini",
)
tester = BatchTester(store=store)
result = tester.run_batch(config)
print(f"Success: {result.successful}/{result.total_inputs}")
print(f"Avg latency: {result.avg_latency_ms:.0f}ms")
print(f"Total tokens: {result.total_tokens_used}")
print(f"Estimated cost: ${result.estimated_cost_usd:.4f}")
Chain Pipelines
from promptforge.chain import ChainBuilder
from promptforge.models import ModelProvider
builder = ChainBuilder(store=store)
# Create a 3-step pipeline
pipeline = builder.create_pipeline(
name="Research Pipeline",
description="Research -> Summarize -> Format",
step_prompt_ids=[research_prompt.id, summarize_prompt.id, format_prompt.id],
)
# Run it
result = builder.run_pipeline(
pipeline_id=pipeline.id,
initial_variables={"topic": "quantum computing"},
provider=ModelProvider.OPENAI,
model="gpt-4o-mini",
)
print(result.final_output)
print(f"Total tokens: {result.total_tokens_used}")
print(f"Steps completed: {len(result.step_results)}")
Quality Scoring
from promptforge.quality import QualityScorer
scorer = QualityScorer()
scores = scorer.score_all(
output="To implement binary search, first sort the array...",
prompt="Explain how binary search works",
)
print(f"Coherence: {scores['coherence']:.2f}")
print(f"Relevance: {scores['relevance']:.2f}")
print(f"Safety: {scores['safety']:.2f}")
print(f"Completeness: {scores['completeness']:.2f}")
print(f"Clarity: {scores['clarity']:.2f}")
print(f"Overall: {scores['overall']:.2f}")
Prompt Compression
from promptforge.compression import PromptCompressor
compressor = PromptCompressor()
result = compressor.compress(
"""Please note that it is important to understand that in order to
implement binary search, you should first make sure to sort the array
prior to beginning the search algorithm. Keep in mind that the time
complexity is O(log n).""",
aggression=0.7,
)
print(result["compressed"])
print(f"Reduction: {result['reduction_pct']}%")
print(f"Steps applied: {', '.join(result['steps_applied'])}")
Token Counting and Cost Estimation
from promptforge.token_counter import count_tokens, estimate_cost, analyze_text, compare_models
# Count tokens
tokens = count_tokens("Explain quantum computing in detail.", model="gpt-4o")
print(f"Token count: {tokens}")
# Estimate cost
cost = estimate_cost(input_tokens=500, output_tokens=200, model="gpt-4o")
print(f"Estimated cost: ${cost:.6f}")
# Full analysis
analysis = analyze_text("Your long prompt here...", model="gpt-4o", estimated_output_tokens=500)
print(f"Input tokens: {analysis['input_tokens']}")
print(f"Context usage: {analysis['context_usage_pct']}%")
print(f"Total cost: ${analysis['total_cost_usd']:.6f}")
# Compare across all models
comparisons = compare_models("Your prompt text", estimated_output_tokens=500)
for c in comparisons:
print(f"{c['model']:25s} {c['total_tokens']:6d} tokens ${c['total_cost_usd']:.6f}")
Export
from promptforge.exporter import PromptExporter
from promptforge.models import ExportFormat
exporter = PromptExporter(store=store)
# Export a single prompt
json_str = exporter.export_prompt(template, fmt=ExportFormat.JSON)
yaml_str = exporter.export_prompt(template, fmt=ExportFormat.YAML)
langchain_str = exporter.export_prompt(template, fmt=ExportFormat.LANGCHAIN)
# Export all prompts
all_json = exporter.export_all(fmt=ExportFormat.JSON)
# Export specific prompts by ID
selected = exporter.export_by_ids(["abc123", "def456"], fmt=ExportFormat.YAML)
Architecture
promptforge/
├── models.py # Pydantic data models (PromptTemplate, ABTestConfig, etc.)
├── store.py # File-based persistence layer (~/.promptforge/data/)
├── engine.py # LLM provider integrations (OpenAI, Anthropic, Google, Local)
├── quality.py # Output quality scorer (coherence, relevance, safety, clarity)
├── testing.py # A/B testing and batch testing frameworks
├── chain.py # Chain-of-thought pipeline builder
├── compression.py # Prompt compression engine
├── token_counter.py # Token counting and cost estimation (tiktoken)
├── exporter.py # Export to JSON, YAML, and LangChain format
└── __init__.py # Package metadata
Data Flow
PromptTemplate ──render()──> Rendered Prompt
│ │
│ ├──> engine.generate() ──> LLMResponse
│ ├──> quality.score_all()──> QualityScore
│ ├──> compression.compress()
│ └──> token_counter.analyze_text()
│
├──> store.save_prompt() (creates version snapshot)
├──> store.diff_versions() (compare two versions)
├──> store.fork_version() (branch from a version)
└──> exporter.export_prompt() (JSON / YAML / LangChain)
Storage Layout
All data is stored under ~/.promptforge/data/ (configurable via PROMPTFORGE_DATA_DIR):
~/.promptforge/data/
├── prompts/ # Prompt templates (one JSON file per prompt)
├── versions/ # Version snapshots ({id}_v{N}.json)
├── tests/ # A/B test configs and batch results
└── chains/ # Chain pipelines and run results
API Reference
promptforge.models
| Class | Description |
|---|---|
PromptTemplate |
Core template model with render() and extract_variables() |
VariableDefinition |
Variable schema with name, default, required, and example |
PromptCategory |
Enum: coding, writing, analysis, creative, business, education, etc. |
ModelProvider |
Enum: openai, anthropic, google, local |
ExportFormat |
Enum: json, yaml, langchain |
QualityMetric |
Enum: coherence, relevance, safety, completeness, clarity |
VersionDiff |
Diff result between two prompt versions |
ABTestConfig |
A/B test configuration |
ABTestResult |
Single A/B test run result |
ABTestSummary |
Aggregated A/B test summary with recommendation |
BatchTestConfig |
Batch test configuration |
BatchTestResult |
Batch test result with cost and latency stats |
ChainPipeline |
Chain-of-thought pipeline definition |
ChainStep |
Single step in a chain pipeline |
ChainRunResult |
Result of running a chain pipeline |
TokenCount |
Token count result for a specific model |
QualityScore |
Quality scores with weighted overall |
promptforge.store.PromptStore
| Method | Description |
|---|---|
save_prompt(prompt) |
Save a template (creates version snapshot) |
get_prompt(prompt_id) |
Retrieve a template by ID |
list_prompts(category, tag, search) |
List prompts with optional filters |
delete_prompt(prompt_id) |
Delete a template and all versions |
get_version(prompt_id, version) |
Get a specific version |
list_versions(prompt_id) |
List all version numbers |
diff_versions(prompt_id, old_ver, new_ver) |
Diff two versions |
fork_version(prompt_id, version, new_name) |
Fork from a specific version |
save_chain(chain) |
Save a chain pipeline |
get_chain(chain_id) |
Retrieve a chain pipeline |
list_chains() |
List all chain pipelines |
promptforge.engine
| Function / Class | Description |
|---|---|
get_provider(provider, model, api_key) |
Factory function to create a provider |
OpenAIProvider |
OpenAI API integration |
AnthropicProvider |
Anthropic API integration |
GoogleProvider |
Google Gemini API integration |
LocalProvider |
Local model integration (Ollama, LM Studio) |
LLMResponse |
Standardized response with text, tokens, latency |
promptforge.quality.QualityScorer
| Method | Description |
|---|---|
score_coherence(text) |
Score internal consistency and structure (0-1) |
score_relevance(output, prompt) |
Score relevance to the input prompt (0-1) |
score_safety(text) |
Score safety (penalizes harmful content) (0-1) |
score_completeness(output, prompt) |
Score how completely the prompt is addressed (0-1) |
score_clarity(text) |
Score readability and understandability (0-1) |
score_all(output, prompt) |
All five scores plus weighted overall |
promptforge.testing
| Class | Method | Description |
|---|---|---|
ABTester |
run_test(config, variables) |
Run A/B test, returns results and summary |
BatchTester |
run_batch(config, variables) |
Run batch test, returns result with stats |
promptforge.chain.ChainBuilder
| Method | Description |
|---|---|
create_pipeline(name, step_prompt_ids) |
Create a new pipeline with ordered steps |
add_step(pipeline_id, prompt_id) |
Add a step to an existing pipeline |
run_pipeline(pipeline_id, initial_variables) |
Execute pipeline sequentially |
list_pipelines() |
List all saved pipelines |
promptforge.compression.PromptCompressor
| Method | Description |
|---|---|
compress(text, aggression) |
Compress prompt (aggression 0.0-1.0) |
promptforge.token_counter
| Function | Description |
|---|---|
count_tokens(text, model) |
Count tokens for a given model |
estimate_cost(input_tokens, output_tokens, model) |
Estimate cost in USD |
analyze_text(text, model, estimated_output_tokens) |
Full token and cost analysis |
compare_models(text, estimated_output_tokens) |
Compare cost across all models |
get_supported_models() |
List all supported model names |
promptforge.exporter.PromptExporter
| Method | Description |
|---|---|
export_prompt(prompt, fmt) |
Export a single prompt (json/yaml/langchain) |
export_all(fmt) |
Export all prompts |
export_by_ids(prompt_ids, fmt) |
Export selected prompts by ID |
Supported Models
| Provider | Models |
|---|---|
| OpenAI | gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-4, gpt-4-32k, gpt-3.5-turbo |
| Anthropic | claude-3.5-sonnet, claude-3-opus, claude-3-haiku, claude-3.5-haiku |
| gemini-1.5-pro, gemini-1.5-flash, gemini-2.0-flash | |
| Local | llama-3.1-70b, llama-3.1-8b, mistral-large, mixtral-8x7b |
Development
# Clone the repository
git clone https://github.com/theihtisham/ai-prompt-forge.git
cd ai-prompt-forge
# Install with dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run tests with coverage
pytest --cov=promptforge --cov-report=term-missing
Project Structure
12-prompt-forge/
├── pyproject.toml # Build config, dependencies, and metadata
├── requirements.txt # Pinned dependencies
├── LICENSE # MIT License
├── .gitignore
└── src/
└── promptforge/
├── __init__.py # Package metadata (v1.0.0)
├── models.py # Pydantic data models and enums
├── store.py # File-based storage layer
├── engine.py # LLM provider integrations
├── quality.py # Output quality scorer
├── testing.py # A/B and batch testing
├── chain.py # Chain-of-thought pipelines
├── compression.py # Prompt compression engine
├── token_counter.py # Token counting and cost estimation
└── exporter.py # Multi-format export
Contributing
Contributions are welcome. To contribute:
- Fork the repository
- Create a feature branch (
git checkout -b feature/my-feature) - Write tests for your changes
- Ensure all tests pass (
pytest) - Commit with a descriptive message
- Open a pull request
Guidelines
- Follow the existing code style (type hints on all public APIs, Pydantic models for data)
- Add tests for new functionality
- Keep the public API backward-compatible
- Document new parameters and return types
License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ai_prompt_forge-1.0.0.tar.gz.
File metadata
- Download URL: ai_prompt_forge-1.0.0.tar.gz
- Upload date:
- Size: 28.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aef9cdb7c74b00d83b3ef4ddda8f7bd7473d89509e953663dcf5924b5de829ff
|
|
| MD5 |
3b19c71db6f3d0e86e5d4d69b3ee47cb
|
|
| BLAKE2b-256 |
41436af95a83aea87cd76eec7004fb0822a3f9f571c108447bca8778062eb60b
|
File details
Details for the file ai_prompt_forge-1.0.0-py3-none-any.whl.
File metadata
- Download URL: ai_prompt_forge-1.0.0-py3-none-any.whl
- Upload date:
- Size: 28.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
769ad1bc4978f92eb72ed8e8bda28129adb2332148fe817387fafbd8af9ae670
|
|
| MD5 |
74766d5cc9cb5836b70089d036b959a2
|
|
| BLAKE2b-256 |
43d6b740fd6f7cc8346e534ac98cd70163b0742a4ae5b3120cc23a4732676ef8
|