Skip to main content

LLM-powered data operations engine for Excel/CSV files

Project description

DataOps LLM Engine

Python 3.10+ License: MIT

LLM-powered data operations for Excel/CSV files using natural language

DataOps LLM Engine is a standalone Python SDK that allows you to perform arbitrary data operations on Excel/CSV files using natural language instructions. It uses Large Language Models (LLMs) to generate and execute safe Python code for data manipulation, without requiring predefined tools.

Features

  • Natural Language Interface: Describe what you want in plain English
  • Multi-Provider LLM Support: Works with OpenAI, Anthropic, Google, and 100+ providers via LiteLLM
  • 7-Layer Security Model: AST validation, import whitelisting, subprocess isolation, and more
  • Flexible Input: CSV, Excel files, or pandas DataFrames
  • Dry-Run Mode: Preview generated code before execution
  • REST API: Optional FastAPI wrapper for HTTP access
  • Zero Configuration: Works out of the box with sensible defaults

Quick Start

Installation

pip install dataops-llm

Or install from source:

git clone https://github.com/yourusername/dataops-llm-engine.git
cd dataops-llm-engine
pip install -e .

Basic Usage

from dataops_llm import process

# Process a CSV file with natural language
result = process(
    file_path="companies.csv",
    instruction="Remove duplicates by email and normalize company names to lowercase"
)

if result.success:
    result.save("companies_cleaned.csv")
    print(result.report)

Configuration

Create a .env file with your LLM API key:

LITELLM_API_KEY=sk-...
LITELLM_MODEL=gpt-4

Or pass configuration directly:

result = process(
    file_path="data.csv",
    instruction="Filter rows where revenue > 1000000",
    llm_config={
        "api_key": "sk-...",
        "model": "claude-3-5-sonnet-20241022"
    }
)

How It Works

User Instruction
    ↓
1. Intent Extraction (LLM) → Structured intent
    ↓
2. Execution Planning (LLM) → Step-by-step plan
    ↓
3. Code Generation (LLM) → Pandas code
    ↓
4. Code Validation (AST) → Security checks
    ↓
5. Sandbox Execution (subprocess) → Result

Examples

Example 1: Data Cleaning

from dataops_llm import process

result = process(
    file_path="messy_data.csv",
    instruction="""
    1. Remove rows with null emails
    2. Trim whitespace from all text columns
    3. Convert dates to ISO format
    4. Remove duplicate rows based on email
    """
)

Example 2: Aggregation

result = process(
    file_path="sales.xlsx",
    instruction="Group by region and calculate total sales, average order value"
)

Example 3: Dry-Run Mode

result = process(
    file_path="data.csv",
    instruction="Filter rows where age > 25",
    dry_run=True,
    return_code=True
)

print(result.generated_code)  # See what code would be executed

Example 4: Working with DataFrames

import pandas as pd
from dataops_llm import process

df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35]
})

result = process(
    file_path=df,
    instruction="Add a column 'age_group' categorizing ages as young/middle/senior"
)

Security

DataOps LLM Engine implements a 7-layer security model to prevent malicious code execution:

  1. LLM Prompt Engineering: System prompts explicitly forbid dangerous operations
  2. Import Whitelisting: Only pandas, numpy, datetime, re, math allowed
  3. AST Validation: Parse and inspect code before execution
  4. Call Blacklisting: Block eval, exec, open, subprocess, network calls
  5. Subprocess Isolation: Code runs in separate process
  6. Resource Limits: 60s timeout, 512MB memory limit
  7. File System Isolation: Execution in temporary directory

Allowed imports: pandas, numpy, datetime, re, math

Blocked operations: File I/O, network access, subprocess execution, eval/exec, pickling

See docs/security.md for detailed security documentation.

API Reference

process()

Main SDK function for processing data.

Parameters:

  • file_path (str | Path | DataFrame): Input file path or DataFrame
  • instruction (str): Natural language instruction
  • llm_config (dict, optional): LLM configuration
    • api_key: LLM provider API key
    • model: Model name (default: "gpt-4")
    • temperature: Sampling temperature (default: 0.1)
    • max_tokens: Maximum tokens (default: 2000)
  • sandbox_config (dict, optional): Sandbox configuration
    • timeout: Max execution time in seconds (default: 60)
    • memory_limit_mb: Max memory in MB (default: 512)
  • dry_run (bool): Preview mode without execution (default: False)
  • return_code (bool): Include generated code in result (default: False)

Returns:

  • DataOpsResult: Result object with:
    • success: Whether operation succeeded
    • dataframe: Resulting DataFrame
    • report: Human-readable report
    • generated_code: Generated code (if requested)
    • execution_time: Time taken in seconds
    • warnings: Warning messages
    • metadata: Additional metadata

REST API

Run the FastAPI server:

python -m dataops_llm.web.app

Or with uvicorn:

uvicorn dataops_llm.web.app:app --reload

API endpoints:

  • GET / - API information
  • GET /api/v1/health - Health check
  • POST /api/v1/process - Process data

Example request:

import requests
import base64

with open("data.csv", "rb") as f:
    file_base64 = base64.b64encode(f.read()).decode()

response = requests.post(
    "http://localhost:8000/api/v1/process",
    json={
        "instruction": "Remove duplicates",
        "file_base64": file_base64,
        "file_format": "csv"
    }
)

result = response.json()

Configuration Options

Environment Variables

# LLM Configuration
LITELLM_API_KEY=sk-...
LITELLM_MODEL=gpt-4
LITELLM_TEMPERATURE=0.1
LITELLM_MAX_TOKENS=2000

# Sandbox Configuration
SANDBOX_TIMEOUT=60
SANDBOX_MEMORY_MB=512

# Application
APP_LOG_LEVEL=INFO

See .env.example for all options.

Supported LLM Providers

Via LiteLLM, supports 100+ providers:

  • OpenAI: gpt-4, gpt-4-turbo, gpt-3.5-turbo
  • Anthropic: claude-3-5-sonnet, claude-3-opus
  • Google: gemini-pro, gemini-ultra
  • Azure OpenAI: All OpenAI models via Azure
  • AWS Bedrock: Claude, Llama, etc.
  • And many more...

Limitations

  • File Size: Max 1M rows × 1K columns (configurable)
  • Execution Time: Max 60 seconds (configurable)
  • Memory: Max 512MB (configurable)
  • Operations: Only pandas-compatible operations
  • No: File I/O, network access, subprocess execution

Development

Setup

# Clone repository
git clone https://github.com/yourusername/dataops-llm-engine.git
cd dataops-llm-engine

# Install dependencies
pip install -e ".[dev]"

# Copy environment template
cp .env.example .env
# Edit .env with your API keys

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=dataops_llm

# Run security tests only
pytest tests/test_sandbox/test_security.py

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

License

MIT License - see LICENSE file for details.

Citation

If you use this project in your research, please cite:

@software{dataops_llm_engine,
  title={DataOps LLM Engine: Natural Language Data Operations},
  author={Islam Abd-Elhady},
  year={2025},
  url={https://github.com/yourusername/dataops-llm-engine}
}

Support

Acknowledgments

  • Built with LiteLLM for multi-provider LLM support
  • Uses pandas for data manipulation
  • Powered by FastAPI for REST API

Roadmap

  • Multi-step workflow chaining
  • SQL database support
  • Custom function definitions
  • Operation history and rollback
  • Web UI for non-developers
  • Docker-based sandbox option
  • Response caching
  • Streaming progress updates

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataops_llm-0.1.0.tar.gz (40.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataops_llm-0.1.0-py3-none-any.whl (44.9 kB view details)

Uploaded Python 3

File details

Details for the file dataops_llm-0.1.0.tar.gz.

File metadata

  • Download URL: dataops_llm-0.1.0.tar.gz
  • Upload date:
  • Size: 40.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dataops_llm-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b2f4fe7c7478a3177b6f121ff2bd9ca8864c81125cde510a72c64c02d4aa892f
MD5 7aff8ce95547894f535f490d30204186
BLAKE2b-256 0e3049ded2d672515ff339db37d47153785dcdd2a329dc83382bb7c65c905ed0

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataops_llm-0.1.0.tar.gz:

Publisher: publish.yml on Islam-hady9/dataops-llm-engine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dataops_llm-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: dataops_llm-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 44.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dataops_llm-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0f5f9c625bca4a4d92a664b34c0be689bd76b32720eac5cb04d209479f4aad3c
MD5 4bb6ccb7d8cef4baa18b76ee12d0816e
BLAKE2b-256 6a1ca3297cc841ec96a8c0b20913ea1707b32fae496e8478e2b8a62473d70997

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataops_llm-0.1.0-py3-none-any.whl:

Publisher: publish.yml on Islam-hady9/dataops-llm-engine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page