LLM-powered CSV data analyzer with structured output generation using LiteLLM (short for LLM Analyser)
Project description
pplyz
Minimal CSV→LLM→CSV transformer powered by LiteLLM. Point it at a CSV, pick columns, describe the desired JSON schema, and pplyz writes the result back as new columns.
Quick run (uvx)
# No install: pulls pplyz from PyPI / TestPyPI into an ephemeral venv
uvx pplyz --input data/sample.csv --columns question,answer --output enriched.csv
Add --preview to try a handful of rows first. Use --list to print bundled prompt templates.
Requirements
- Python 3.12+
- uv
- One LLM API key supported by LiteLLM (OpenAI, Gemini, Anthropic, Groq, etc.)
Configuration
pplyz loads configuration in this order (later wins):
- Existing environment variables
./pplyz.local.toml~/.config/pplyz/config.toml(override viaPPLYZ_CONFIG_DIR)
Start from the template:
cp pplyz.local.toml.example pplyz.local.toml
Example config:
[env]
OPENAI_API_KEY = "sk-..."
[pplyz]
default_model = "gpt-4o-mini"
Only set the providers you need. Common keys: OPENAI_API_KEY, GEMINI_API_KEY, ANTHROPIC_API_KEY, GROQ_API_KEY, or PPLYZ_DEFAULT_MODEL to force a model string like groq/llama-3.1-8b-instant.
Local development
uv sync
uv run python main.py --help
Pre-commit runs pytest + Ruff automatically. Use uv tool run pre-commit install before committing.
Support
- Issues / PRs welcome.
- Licensed under MIT (see
LICENSE).
pplyz --input data/yourfile.csv --columns title,abstract --preview --preview-rows 5
Examples
Example 1: Sentiment Analysis
pplyz --input data/reviews.csv --columns review_text --output sentiment_results.csv
When prompted, enter:
Analyze the sentiment and classify as positive, negative, or neutral. Also provide a confidence score.
Example 2: Academic Paper Analysis
pplyz --input data/hsa-miR-144-3p.csv --columns title,abstract --output analysis.csv
When prompted, enter:
Extract the following information: research_topic, methodology, key_findings, and clinical_significance
Example 3: Using Different LLM Providers
# Use Gemini 2.0 Flash Lite (default)
pplyz --input data/papers.csv --columns title,abstract --output results.csv
# Use OpenAI GPT-4o
pplyz --input data/papers.csv --columns title,abstract --output results.csv --model gpt-4o
# Use Anthropic Claude 3.5 Sonnet
pplyz --input data/papers.csv --columns title,abstract --output results.csv --model claude-3-5-sonnet-20241022
# Use OpenAI GPT-4o Mini (cost-effective)
pplyz --input data/papers.csv --columns title,abstract --output results.csv --model gpt-4o-mini
Example 4: Boolean-Based Classification (Recommended)
pplyz \
--input data/articles.csv \
--columns title,abstract \
--fields 'is_relevant:bool,summary:str' \
--output classified.csv
When prompted, enter:
Determine if this article is relevant to climate change research (true/false).
Provide a one-sentence summary.
Why this works well: Boolean fields guarantee only true/false values, avoiding ambiguous responses like "yes", "maybe", or "unclear".
Example 5: List Available Models
pplyz --list
Command-Line Options
| Option | Short | Description | Required |
|---|---|---|---|
--input |
-i |
Path to input CSV file | Yes |
--columns |
-c |
Comma-separated list of columns to use as LLM input | Yes |
--fields |
-f |
Field definition string (e.g., "field1:str,field2:int"). Required to keep output columns consistent. |
Yes |
--output |
-o |
Path to output CSV file (defaults to overwriting the input file when omitted) | No |
--model |
-m |
LLM model name (default: PPLYZ_DEFAULT_MODEL or gemini/gemini-2.5-flash-lite) |
No |
--list |
-l |
List supported models and exit | No |
--preview |
-p |
Preview results on sample rows without saving | No |
--preview-rows |
Number of rows to preview (default: 3) | No | |
--no-resume |
-R |
Reprocess all rows even if output exists | No |
Note: When
--outputis omitted, the processor overwrites the input CSV in place. Use--previewfirst or point--outputto a different file if you want to keep the original rows untouched.
Best Practices for Field Definitions
Defining Fields
Use --fields (required) to describe the JSON structure you expect back. The CLI enforces this flag to keep output columns consistent:
--fields "is_relevant:bool,summary:str,keywords:list[str]"
Any field without :type defaults to str. Supported types: bool, int, float, str, list[str], list[int], list[float], list[bool], and dict.
Field Design Patterns
Pattern 1: Binary Classification (Most Reliable) ✅
Use boolean fields for yes/no decisions - these are strictly enforced across all LLM providers:
pplyz \
--input data.csv \
--columns title,abstract \
--fields "is_relevant:bool,reason:str" \
--output results.csv
Why boolean? Unlike string fields, boolean type guarantees only true or false values - no unexpected strings like "yes", "maybe", or "unknown".
Pattern 2: Multi-level Classification (Requires Care) ⚠️
For granular classifications (e.g., high/medium/low), understand the limitations:
--fields "relevance:str,confidence:str,notes:str"
Important: String fields don't enforce enum constraints. LLMs may return unexpected values.
Best Practice:
- Use very explicit prompts listing all allowed values
- Consider post-processing for production use
- Consider using boolean flags instead (see Pattern 3)
Pattern 3: Hybrid Approach (Recommended for Production) 🎯
Combine boolean gates with descriptive fields:
--fields "is_dish_related:bool,confidence_level:str,explanation:str"
Then in your prompt:
First, determine if the article is DISH-related (true/false).
Then rate confidence as "high", "medium", or "low".
Provide a brief explanation.
Advantages:
- Boolean field provides reliable binary classification
- String fields provide additional context
- Works consistently across all providers
Supported Field Types
| Type | Enforcement Level | Use Case | Example |
|---|---|---|---|
bool |
✅ Strict (true/false only) |
Binary decisions | "is_relevant": "bool" |
int |
✅ Strict (integers only) | Counts, scores | "score": "int" |
float |
✅ Strict (numbers only) | Decimal scores | "confidence": "float" |
str |
⚠️ Flexible (any text) | Descriptions, reasons | "summary": "str" |
list[str] |
⚠️ Flexible (JSON array) | Tags, keywords | "tags": "list[str]" |
Understanding Output Validation Levels
LLM Analyser supports three levels of validation through LiteLLM:
-
JSON Syntax Only (basic mode)
- Ensures valid JSON structure
- No field or type validation
-
Type Validation (with
--fields)- Enforces field presence and types
- Boolean, int, float strictly enforced
- Strings accept any text
-
Enum/Pattern Validation (limited support)
- Only supported by some models (e.g., OpenAI GPT-4o)
- Not universally available
- Requires post-processing for most models
Provider-Specific Behavior
| Provider | Boolean/Int/Float | String Enums | Recommendation |
|---|---|---|---|
| OpenAI (GPT-4o) | ✅ Excellent | ✅ Good | Best for strict schemas |
| Anthropic (Claude) | ✅ Excellent | ⚠️ Moderate | Good JSON, use strong prompts |
| Gemini (Flash) | ✅ Good | ❌ Limited | Boolean + post-processing |
| Others | Varies | ❌ Limited | Test thoroughly |
How It Works
- Input: The tool reads a CSV file and extracts specified columns
- Processing: For each row:
- Selected column data is formatted as JSON
- Combined with your prompt and sent to the LLM
- LLM generates structured JSON output
- Retry logic handles API errors automatically
- Output: Generated data is added as new columns with a configurable prefix
- Error Handling:
- Automatic retries for rate limits (HTTP 429) and transient errors (500, 502, 503, 504)
- Exponential backoff with configurable wait times
- Graceful error logging for rows that fail after retries
Supported Models
LLM Analyser supports 100+ models via LiteLLM. Common examples:
Gemini Models:
gemini/gemini-2.5-flash-lite- Fast and lightweight (default)gemini/gemini-2.0-flash-lite- Fast and lightweightgemini/gemini-1.5-pro- High qualitygemini/gemini-1.5-flash- Balanced
OpenAI Models:
gpt-4o- Latest GPT-4ogpt-4o-mini- Cost-effectivegpt-3.5-turbo- Fast
Anthropic Models:
claude-3-5-sonnet-20241022- High qualityclaude-3-haiku-20240307- Fast
For the full list, run pplyz --list (or -l) (use uv run python main.py --list from source) or visit LiteLLM Providers.
Configuration
Key configuration options can be found in pplyz/config.py and pplyz/settings.py:
DEFAULT_MODEL: Default model (gemini/gemini-2.5-flash-lite). Override by settingPPLYZ_DEFAULT_MODELin your config TOML or environment.USE_JSON_MODE: Force JSON output via LiteLLM (True)MAX_RETRIES: Maximum number of retry attempts (len(RETRY_BACKOFF_SCHEDULE) + 1)RETRY_BACKOFF_SCHEDULE: Fixed delays between retries ([1, 2, 3, 5, 10, 10, 10, 10, 10]seconds)REQUEST_DELAY: Delay between API requests (0.5 seconds)pplyz/settings.py: Resolves user-level configuration from~/.config/pplyz/
Project Structure
pplyz/
├── main.py # Dev entry point (calls pplyz.cli)
├── pplyz/
│ ├── __init__.py # Package initialization
│ ├── cli.py # Command-line interface
│ ├── config.py # Static configuration settings
│ ├── llm_client.py # LLM API client with retry logic
│ ├── processor.py # CSV processing logic
│ ├── schemas.py # Output schema definitions
│ ├── settings.py # Runtime configuration loader
│ └── utils.py # Shared helpers
├── data/ # Input CSV files
├── pyproject.toml # Project dependencies
├── pplyz.local.toml.example # Example local configuration
└── README.md # This file
Error Handling
The tool includes comprehensive error handling:
- API Rate Limits: Automatic retry with exponential backoff
- Transient Errors: Retries for temporary server issues
- Invalid JSON: Clear error messages when LLM output isn't valid JSON
- Missing Columns: Validation before processing starts
- Keyboard Interrupt: Graceful shutdown on Ctrl+C
Limitations
- Sequential Processing: Rows are processed one at a time to respect API rate limits
- API Costs: Each row requires an API call, which may incur costs (varies by provider)
- Model Constraints: Output quality depends on the LLM model's capabilities
- JSON Mode: Some providers may not support JSON mode (will fall back to prompt-based JSON)
Troubleshooting
API Key Error
Error: API key for gemini not found. Please set the GEMINI_API_KEY environment variable.
Solution: Make sure you've set the appropriate API key in pplyz.local.toml, ~/.config/pplyz/config.toml, or as an environment variable.
Rate Limit Errors
litellm.RateLimitError: 429 Rate limit exceeded
Solution: The tool will automatically retry with exponential backoff. If errors persist, consider:
- Increasing
REQUEST_DELAYinconfig.py - Using a different model with higher rate limits
- Upgrading your API plan
Invalid JSON Response
ValueError: Failed to parse LLM response as JSON
Solution: Make your prompt more specific about requiring JSON output, or try rephrasing the task.
Unexpected Output Values
Problem: LLM returns values outside your expected range (e.g., "yes"/"no" instead of "high"/"middle"/"low").
Why this happens:
- String fields (
"str") only enforce type, not specific values - LLMs may interpret instructions creatively
- Enum constraints are not universally supported across all providers
Solutions:
-
Use boolean fields for binary decisions (RECOMMENDED):
--fields "is_relevant:bool,reason:str"
Boolean fields strictly enforce
true/falsevalues across all providers. -
Choose models with better instruction-following:
- OpenAI GPT-4o: Best enum constraint support
- Anthropic Claude: Strong instruction-following
- Gemini Flash: Good for general use
-
Strengthen your prompt:
- Provide explicit examples
- Use "You MUST output only: X, Y, or Z"
- Add example output in the prompt
-
Use preview mode to validate:
--preview --preview-rows 5
Check actual outputs before processing full dataset.
Note: Even with strict prompts, string fields may return unexpected values. Consider post-processing for production use.
Development
Running Tests
Install development dependencies:
uv sync --extra dev
Run all tests:
uv run pytest
Run tests with coverage:
uv run pytest --cov=pplyz --cov-report=html
Run specific test file:
uv run pytest tests/test_config.py
Run tests by marker:
uv run pytest -m unit
Test Structure
tests/
├── test_config.py # Configuration tests
├── test_llm_client.py # LLM client tests (mocked)
├── test_processor.py # CSV processor tests
├── test_main.py # CLI tests
└── fixtures/ # Test data
Contributing
Contributions are welcome! Please feel free to submit issues or pull requests.
License
This project is provided as-is for educational and research purposes.
Acknowledgments
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pplyz-0.1.0.tar.gz.
File metadata
- Download URL: pplyz-0.1.0.tar.gz
- Upload date:
- Size: 30.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
11c4e7d01ce84c26327547c13728878b6341bc9cf788f53f2a5e708d513f7611
|
|
| MD5 |
039f23ea426fc5edd4cc7ed874ef442d
|
|
| BLAKE2b-256 |
b9c07f2330d115ea644c1e0856f5b69b687aed2044f306d32e7d350273b45218
|
File details
Details for the file pplyz-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pplyz-0.1.0-py3-none-any.whl
- Upload date:
- Size: 22.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6bb2c9f23ead6218a52f50d4a82ff57779815a9cbbc2d087dc10a907bd48fd71
|
|
| MD5 |
4897cb64b03cec37f52cd01a2809ca2c
|
|
| BLAKE2b-256 |
63a0d07d41fd6ba2b9b8cfbe7a0cb331b029a142fb8b11d842972bab8ccd5c52
|