pplyz

LLM-powered CSV data analyzer with structured output generation using LiteLLM (short for LLM Analyser)

Project description

pplyz

Minimal CSV→LLM→CSV transformer powered by LiteLLM. Point it at a CSV, pick columns, describe the desired JSON schema, and pplyz writes the result back as new columns.

Quick run (uvx)

# No install: pulls pplyz from PyPI / TestPyPI into an ephemeral venv
uvx pplyz --input data/sample.csv --columns question,answer --output enriched.csv

Add --preview to try a handful of rows first. Use --list to print bundled prompt templates.

Requirements

Python 3.12+
uv
One LLM API key supported by LiteLLM (OpenAI, Gemini, Anthropic, Groq, etc.)

Configuration

pplyz loads configuration in this order (later wins):

Existing environment variables
./pplyz.local.toml
~/.config/pplyz/config.toml (override via PPLYZ_CONFIG_DIR)

Start from the template:

cp pplyz.local.toml.example pplyz.local.toml

Example config:

[env]
OPENAI_API_KEY = "sk-..."

[pplyz]
default_model = "gpt-4o-mini"

Only set the providers you need. Common keys: OPENAI_API_KEY, GEMINI_API_KEY, ANTHROPIC_API_KEY, GROQ_API_KEY, or PPLYZ_DEFAULT_MODEL to force a model string like groq/llama-3.1-8b-instant.

Local development

uv sync
uv run python main.py --help

Pre-commit runs pytest + Ruff automatically. Use uv tool run pre-commit install before committing.

Support

Issues / PRs welcome.
Licensed under MIT (see LICENSE).

pplyz --input data/yourfile.csv --columns title,abstract --preview --preview-rows 5

Examples

Example 1: Sentiment Analysis

pplyz --input data/reviews.csv --columns review_text --output sentiment_results.csv

When prompted, enter:

Analyze the sentiment and classify as positive, negative, or neutral. Also provide a confidence score.

Example 2: Academic Paper Analysis

pplyz --input data/hsa-miR-144-3p.csv --columns title,abstract --output analysis.csv

When prompted, enter:

Extract the following information: research_topic, methodology, key_findings, and clinical_significance

Example 3: Using Different LLM Providers

# Use Gemini 2.0 Flash Lite (default)
pplyz --input data/papers.csv --columns title,abstract --output results.csv

# Use OpenAI GPT-4o
pplyz --input data/papers.csv --columns title,abstract --output results.csv --model gpt-4o

# Use Anthropic Claude 3.5 Sonnet
pplyz --input data/papers.csv --columns title,abstract --output results.csv --model claude-3-5-sonnet-20241022

# Use OpenAI GPT-4o Mini (cost-effective)
pplyz --input data/papers.csv --columns title,abstract --output results.csv --model gpt-4o-mini

Example 4: Boolean-Based Classification (Recommended)

pplyz \
  --input data/articles.csv \
  --columns title,abstract \
  --fields 'is_relevant:bool,summary:str' \
  --output classified.csv

When prompted, enter:

Determine if this article is relevant to climate change research (true/false).
Provide a one-sentence summary.

Why this works well: Boolean fields guarantee only true/false values, avoiding ambiguous responses like "yes", "maybe", or "unclear".

Example 5: List Available Models

pplyz --list

Command-Line Options

Option	Short	Description	Required
`--input`	`-i`	Path to input CSV file	Yes
`--columns`	`-c`	Comma-separated list of columns to use as LLM input	Yes
`--fields`	`-f`	Field definition string (e.g., `"field1:str,field2:int"`). Required to keep output columns consistent.	Yes
`--output`	`-o`	Path to output CSV file (defaults to overwriting the input file when omitted)	No
`--model`	`-m`	LLM model name (default: `PPLYZ_DEFAULT_MODEL` or `gemini/gemini-2.5-flash-lite`)	No
`--list`	`-l`	List supported models and exit	No
`--preview`	`-p`	Preview results on sample rows without saving	No
`--preview-rows`		Number of rows to preview (default: 3)	No
`--no-resume`	`-R`	Reprocess all rows even if output exists	No

Note: When --output is omitted, the processor overwrites the input CSV in place. Use --preview first or point --output to a different file if you want to keep the original rows untouched.

Best Practices for Field Definitions

Defining Fields

Use --fields (required) to describe the JSON structure you expect back. The CLI enforces this flag to keep output columns consistent:

--fields "is_relevant:bool,summary:str,keywords:list[str]"

Any field without :type defaults to str. Supported types: bool, int, float, str, list[str], list[int], list[float], list[bool], and dict.

Field Design Patterns

Pattern 1: Binary Classification (Most Reliable) ✅

Use boolean fields for yes/no decisions - these are strictly enforced across all LLM providers:

pplyz \
  --input data.csv \
  --columns title,abstract \
  --fields "is_relevant:bool,reason:str" \
  --output results.csv

Why boolean? Unlike string fields, boolean type guarantees only true or false values - no unexpected strings like "yes", "maybe", or "unknown".

Pattern 2: Multi-level Classification (Requires Care) ⚠️

For granular classifications (e.g., high/medium/low), understand the limitations:

--fields "relevance:str,confidence:str,notes:str"

Important: String fields don't enforce enum constraints. LLMs may return unexpected values.

Best Practice:

Use very explicit prompts listing all allowed values
Consider post-processing for production use
Consider using boolean flags instead (see Pattern 3)

Pattern 3: Hybrid Approach (Recommended for Production) 🎯

Combine boolean gates with descriptive fields:

--fields "is_dish_related:bool,confidence_level:str,explanation:str"

Then in your prompt:

First, determine if the article is DISH-related (true/false).
Then rate confidence as "high", "medium", or "low".
Provide a brief explanation.

Advantages:

Boolean field provides reliable binary classification
String fields provide additional context
Works consistently across all providers

Supported Field Types

Type	Enforcement Level	Use Case	Example
`bool`	✅ Strict (`true`/`false` only)	Binary decisions	`"is_relevant": "bool"`
`int`	✅ Strict (integers only)	Counts, scores	`"score": "int"`
`float`	✅ Strict (numbers only)	Decimal scores	`"confidence": "float"`
`str`	⚠️ Flexible (any text)	Descriptions, reasons	`"summary": "str"`
`list[str]`	⚠️ Flexible (JSON array)	Tags, keywords	`"tags": "list[str]"`

Understanding Output Validation Levels

LLM Analyser supports three levels of validation through LiteLLM:

JSON Syntax Only (basic mode)
- Ensures valid JSON structure
- No field or type validation
Type Validation (with --fields)
- Enforces field presence and types
- Boolean, int, float strictly enforced
- Strings accept any text
Enum/Pattern Validation (limited support)
- Only supported by some models (e.g., OpenAI GPT-4o)
- Not universally available
- Requires post-processing for most models

Provider-Specific Behavior

Provider	Boolean/Int/Float	String Enums	Recommendation
OpenAI (GPT-4o)	✅ Excellent	✅ Good	Best for strict schemas
Anthropic (Claude)	✅ Excellent	⚠️ Moderate	Good JSON, use strong prompts
Gemini (Flash)	✅ Good	❌ Limited	Boolean + post-processing
Others	Varies	❌ Limited	Test thoroughly

How It Works

Input: The tool reads a CSV file and extracts specified columns
Processing: For each row:
- Selected column data is formatted as JSON
- Combined with your prompt and sent to the LLM
- LLM generates structured JSON output
- Retry logic handles API errors automatically
Output: Generated data is added as new columns with a configurable prefix
Error Handling:
- Automatic retries for rate limits (HTTP 429) and transient errors (500, 502, 503, 504)
- Exponential backoff with configurable wait times
- Graceful error logging for rows that fail after retries

Supported Models

LLM Analyser supports 100+ models via LiteLLM. Common examples:

Gemini Models:

gemini/gemini-2.5-flash-lite - Fast and lightweight (default)
gemini/gemini-2.0-flash-lite - Fast and lightweight
gemini/gemini-1.5-pro - High quality
gemini/gemini-1.5-flash - Balanced

OpenAI Models:

gpt-4o - Latest GPT-4o
gpt-4o-mini - Cost-effective
gpt-3.5-turbo - Fast

Anthropic Models:

claude-3-5-sonnet-20241022 - High quality
claude-3-haiku-20240307 - Fast

For the full list, run pplyz --list (or -l) (use uv run python main.py --list from source) or visit LiteLLM Providers.

Configuration

Key configuration options can be found in pplyz/config.py and pplyz/settings.py:

DEFAULT_MODEL: Default model (gemini/gemini-2.5-flash-lite). Override by setting PPLYZ_DEFAULT_MODEL in your config TOML or environment.
USE_JSON_MODE: Force JSON output via LiteLLM (True)
MAX_RETRIES: Maximum number of retry attempts (len(RETRY_BACKOFF_SCHEDULE) + 1)
RETRY_BACKOFF_SCHEDULE: Fixed delays between retries ([1, 2, 3, 5, 10, 10, 10, 10, 10] seconds)
REQUEST_DELAY: Delay between API requests (0.5 seconds)
pplyz/settings.py: Resolves user-level configuration from ~/.config/pplyz/

Project Structure

pplyz/
├── main.py                 # Dev entry point (calls pplyz.cli)
├── pplyz/
│   ├── __init__.py        # Package initialization
│   ├── cli.py             # Command-line interface
│   ├── config.py          # Static configuration settings
│   ├── llm_client.py      # LLM API client with retry logic
│   ├── processor.py       # CSV processing logic
│   ├── schemas.py         # Output schema definitions
│   ├── settings.py        # Runtime configuration loader
│   └── utils.py           # Shared helpers
├── data/                   # Input CSV files
├── pyproject.toml         # Project dependencies
├── pplyz.local.toml.example # Example local configuration
└── README.md              # This file

Error Handling

The tool includes comprehensive error handling:

API Rate Limits: Automatic retry with exponential backoff
Transient Errors: Retries for temporary server issues
Invalid JSON: Clear error messages when LLM output isn't valid JSON
Missing Columns: Validation before processing starts
Keyboard Interrupt: Graceful shutdown on Ctrl+C

Limitations

Sequential Processing: Rows are processed one at a time to respect API rate limits
API Costs: Each row requires an API call, which may incur costs (varies by provider)
Model Constraints: Output quality depends on the LLM model's capabilities
JSON Mode: Some providers may not support JSON mode (will fall back to prompt-based JSON)

Troubleshooting

API Key Error

Error: API key for gemini not found. Please set the GEMINI_API_KEY environment variable.

Solution: Make sure you've set the appropriate API key in pplyz.local.toml, ~/.config/pplyz/config.toml, or as an environment variable.

Rate Limit Errors

litellm.RateLimitError: 429 Rate limit exceeded

Solution: The tool will automatically retry with exponential backoff. If errors persist, consider:

Increasing REQUEST_DELAY in config.py
Using a different model with higher rate limits
Upgrading your API plan

Invalid JSON Response

ValueError: Failed to parse LLM response as JSON

Solution: Make your prompt more specific about requiring JSON output, or try rephrasing the task.

Unexpected Output Values

Problem: LLM returns values outside your expected range (e.g., "yes"/"no" instead of "high"/"middle"/"low").

Why this happens:

String fields ("str") only enforce type, not specific values
LLMs may interpret instructions creatively
Enum constraints are not universally supported across all providers

Solutions:

Use boolean fields for binary decisions (RECOMMENDED):
```
--fields "is_relevant:bool,reason:str"
```
Boolean fields strictly enforce true/false values across all providers.
Choose models with better instruction-following:
- OpenAI GPT-4o: Best enum constraint support
- Anthropic Claude: Strong instruction-following
- Gemini Flash: Good for general use
Strengthen your prompt:
- Provide explicit examples
- Use "You MUST output only: X, Y, or Z"
- Add example output in the prompt
Use preview mode to validate:
```
--preview --preview-rows 5
```
Check actual outputs before processing full dataset.

Note: Even with strict prompts, string fields may return unexpected values. Consider post-processing for production use.

Development

Running Tests

Install development dependencies:

uv sync --extra dev

Run all tests:

uv run pytest

Run tests with coverage:

uv run pytest --cov=pplyz --cov-report=html

Run specific test file:

uv run pytest tests/test_config.py

Run tests by marker:

uv run pytest -m unit

Test Structure

tests/
├── test_config.py       # Configuration tests
├── test_llm_client.py   # LLM client tests (mocked)
├── test_processor.py    # CSV processor tests
├── test_main.py         # CLI tests
└── fixtures/            # Test data

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

License

This project is provided as-is for educational and research purposes.

Acknowledgments

Built with LiteLLM for multi-provider LLM support
Supports Google Gemini, OpenAI, Anthropic, Groq, Mistral, Cohere, and 100+ other providers exposed via LiteLLM
Uses uv for fast, reliable dependency management
Powered by pandas for CSV processing

Project details

Release history Release notifications | RSS feed

0.1.5

Dec 25, 2025

0.1.4

Dec 17, 2025

0.1.3

Dec 11, 2025

0.1.2

Dec 5, 2025

0.1.1

Nov 14, 2025

This version

0.1.0

Nov 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pplyz-0.1.0.tar.gz (30.8 kB view details)

Uploaded Nov 11, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pplyz-0.1.0-py3-none-any.whl (22.9 kB view details)

Uploaded Nov 11, 2025 Python 3

File details

Details for the file pplyz-0.1.0.tar.gz.

File metadata

Download URL: pplyz-0.1.0.tar.gz
Upload date: Nov 11, 2025
Size: 30.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for pplyz-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`11c4e7d01ce84c26327547c13728878b6341bc9cf788f53f2a5e708d513f7611`
MD5	`039f23ea426fc5edd4cc7ed874ef442d`
BLAKE2b-256	`b9c07f2330d115ea644c1e0856f5b69b687aed2044f306d32e7d350273b45218`

See more details on using hashes here.

File details

Details for the file pplyz-0.1.0-py3-none-any.whl.

File metadata

Download URL: pplyz-0.1.0-py3-none-any.whl
Upload date: Nov 11, 2025
Size: 22.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for pplyz-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6bb2c9f23ead6218a52f50d4a82ff57779815a9cbbc2d087dc10a907bd48fd71`
MD5	`4897cb64b03cec37f52cd01a2809ca2c`
BLAKE2b-256	`63a0d07d41fd6ba2b9b8cfbe7a0cb331b029a142fb8b11d842972bab8ccd5c52`

See more details on using hashes here.

pplyz 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

pplyz

Quick run (uvx)

Requirements

Configuration

Local development

Support

Examples

Example 1: Sentiment Analysis

Example 2: Academic Paper Analysis

Example 3: Using Different LLM Providers

Example 4: Boolean-Based Classification (Recommended)

Example 5: List Available Models

Command-Line Options

Best Practices for Field Definitions

Defining Fields

Field Design Patterns

Pattern 1: Binary Classification (Most Reliable) ✅

Pattern 2: Multi-level Classification (Requires Care) ⚠️

Pattern 3: Hybrid Approach (Recommended for Production) 🎯

Supported Field Types

Understanding Output Validation Levels

Provider-Specific Behavior

How It Works

Supported Models

Configuration

Project Structure

Error Handling

Limitations

Troubleshooting

API Key Error

Rate Limit Errors

Invalid JSON Response

Unexpected Output Values

Development

Running Tests

Test Structure

Contributing

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes