LLM-powered data operations engine for Excel/CSV files

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

IslamHady7

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language

Project description

DataOps LLM Engine

LLM-powered data operations for Excel/CSV files using natural language

DataOps LLM Engine is a standalone Python SDK that allows you to perform arbitrary data operations on Excel/CSV files using natural language instructions. It uses Large Language Models (LLMs) to generate and execute safe Python code for data manipulation, without requiring predefined tools.

Features

Natural Language Interface: Describe what you want in plain English
Multi-Provider LLM Support: Works with OpenAI, Anthropic, Google, and 100+ providers via LiteLLM
7-Layer Security Model: AST validation, import whitelisting, subprocess isolation, and more
Flexible Input: CSV, Excel files, or pandas DataFrames
Dry-Run Mode: Preview generated code before execution
REST API: Optional FastAPI wrapper for HTTP access
Zero Configuration: Works out of the box with sensible defaults

Quick Start

Installation

pip install dataops-llm

Or install from source:

git clone https://github.com/yourusername/dataops-llm-engine.git
cd dataops-llm-engine
pip install -e .

Basic Usage

from dataops_llm import process

# Process a CSV file with natural language
result = process(
    file_path="companies.csv",
    instruction="Remove duplicates by email and normalize company names to lowercase"
)

if result.success:
    result.save("companies_cleaned.csv")
    print(result.report)

Configuration

Create a .env file with your LLM API key:

LITELLM_API_KEY=sk-...
LITELLM_MODEL=gpt-4

Or pass configuration directly:

result = process(
    file_path="data.csv",
    instruction="Filter rows where revenue > 1000000",
    llm_config={
        "api_key": "sk-...",
        "model": "claude-3-5-sonnet-20241022"
    }
)

How It Works

User Instruction
    ↓
1. Intent Extraction (LLM) → Structured intent
    ↓
2. Execution Planning (LLM) → Step-by-step plan
    ↓
3. Code Generation (LLM) → Pandas code
    ↓
4. Code Validation (AST) → Security checks
    ↓
5. Sandbox Execution (subprocess) → Result

Examples

Example 1: Data Cleaning

from dataops_llm import process

result = process(
    file_path="messy_data.csv",
    instruction="""
    1. Remove rows with null emails
    2. Trim whitespace from all text columns
    3. Convert dates to ISO format
    4. Remove duplicate rows based on email
    """
)

Example 2: Aggregation

result = process(
    file_path="sales.xlsx",
    instruction="Group by region and calculate total sales, average order value"
)

Example 3: Dry-Run Mode

result = process(
    file_path="data.csv",
    instruction="Filter rows where age > 25",
    dry_run=True,
    return_code=True
)

print(result.generated_code)  # See what code would be executed

Example 4: Working with DataFrames

import pandas as pd
from dataops_llm import process

df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35]
})

result = process(
    file_path=df,
    instruction="Add a column 'age_group' categorizing ages as young/middle/senior"
)

Security

DataOps LLM Engine implements a 7-layer security model to prevent malicious code execution:

LLM Prompt Engineering: System prompts explicitly forbid dangerous operations
Import Whitelisting: Only pandas, numpy, datetime, re, math allowed
AST Validation: Parse and inspect code before execution
Call Blacklisting: Block eval, exec, open, subprocess, network calls
Subprocess Isolation: Code runs in separate process
Resource Limits: 60s timeout, 512MB memory limit
File System Isolation: Execution in temporary directory

Allowed imports: pandas, numpy, datetime, re, math

Blocked operations: File I/O, network access, subprocess execution, eval/exec, pickling

See docs/security.md for detailed security documentation.

API Reference

`process()`

Main SDK function for processing data.

Parameters:

file_path (str | Path | DataFrame): Input file path or DataFrame
instruction (str): Natural language instruction
llm_config (dict, optional): LLM configuration
- api_key: LLM provider API key
- model: Model name (default: "gpt-4")
- temperature: Sampling temperature (default: 0.1)
- max_tokens: Maximum tokens (default: 2000)
sandbox_config (dict, optional): Sandbox configuration
- timeout: Max execution time in seconds (default: 60)
- memory_limit_mb: Max memory in MB (default: 512)
dry_run (bool): Preview mode without execution (default: False)
return_code (bool): Include generated code in result (default: False)

Returns:

DataOpsResult: Result object with:
- success: Whether operation succeeded
- dataframe: Resulting DataFrame
- report: Human-readable report
- generated_code: Generated code (if requested)
- execution_time: Time taken in seconds
- warnings: Warning messages
- metadata: Additional metadata

REST API

Run the FastAPI server:

python -m dataops_llm.web.app

Or with uvicorn:

uvicorn dataops_llm.web.app:app --reload

API endpoints:

GET / - API information
GET /api/v1/health - Health check
POST /api/v1/process - Process data

Example request:

import requests
import base64

with open("data.csv", "rb") as f:
    file_base64 = base64.b64encode(f.read()).decode()

response = requests.post(
    "http://localhost:8000/api/v1/process",
    json={
        "instruction": "Remove duplicates",
        "file_base64": file_base64,
        "file_format": "csv"
    }
)

result = response.json()

Configuration Options

Environment Variables

# LLM Configuration
LITELLM_API_KEY=sk-...
LITELLM_MODEL=gpt-4
LITELLM_TEMPERATURE=0.1
LITELLM_MAX_TOKENS=2000

# Sandbox Configuration
SANDBOX_TIMEOUT=60
SANDBOX_MEMORY_MB=512

# Application
APP_LOG_LEVEL=INFO

See .env.example for all options.

Supported LLM Providers

Via LiteLLM, supports 100+ providers:

OpenAI: gpt-4, gpt-4-turbo, gpt-3.5-turbo
Anthropic: claude-3-5-sonnet, claude-3-opus
Google: gemini-pro, gemini-ultra
Azure OpenAI: All OpenAI models via Azure
AWS Bedrock: Claude, Llama, etc.
And many more...

Limitations

File Size: Max 1M rows × 1K columns (configurable)
Execution Time: Max 60 seconds (configurable)
Memory: Max 512MB (configurable)
Operations: Only pandas-compatible operations
No: File I/O, network access, subprocess execution

Development

Setup

# Clone repository
git clone https://github.com/yourusername/dataops-llm-engine.git
cd dataops-llm-engine

# Install dependencies
pip install -e ".[dev]"

# Copy environment template
cp .env.example .env
# Edit .env with your API keys

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=dataops_llm

# Run security tests only
pytest tests/test_sandbox/test_security.py

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Make your changes
Add tests
Submit a pull request

License

MIT License - see LICENSE file for details.

Citation

If you use this project in your research, please cite:

@software{dataops_llm_engine,
  title={DataOps LLM Engine: Natural Language Data Operations},
  author={Islam Abd-Elhady},
  year={2025},
  url={https://github.com/yourusername/dataops-llm-engine}
}

Support

Documentation: docs/
Issues: GitHub Issues
Security: For security concerns, email security@yourdomain.com

Acknowledgments

Built with LiteLLM for multi-provider LLM support
Uses pandas for data manipulation
Powered by FastAPI for REST API

Roadmap

Multi-step workflow chaining
SQL database support
Custom function definitions
Operation history and rollback
Web UI for non-developers
Docker-based sandbox option
Response caching
Streaming progress updates

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

IslamHady7

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language

Release history Release notifications | RSS feed

This version

0.1.0

Dec 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataops_llm-0.1.0.tar.gz (40.9 kB view details)

Uploaded Dec 13, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dataops_llm-0.1.0-py3-none-any.whl (44.9 kB view details)

Uploaded Dec 13, 2025 Python 3

File details

Details for the file dataops_llm-0.1.0.tar.gz.

File metadata

Download URL: dataops_llm-0.1.0.tar.gz
Upload date: Dec 13, 2025
Size: 40.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dataops_llm-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`b2f4fe7c7478a3177b6f121ff2bd9ca8864c81125cde510a72c64c02d4aa892f`
MD5	`7aff8ce95547894f535f490d30204186`
BLAKE2b-256	`0e3049ded2d672515ff339db37d47153785dcdd2a329dc83382bb7c65c905ed0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataops_llm-0.1.0.tar.gz:

Publisher: publish.yml on Islam-hady9/dataops-llm-engine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dataops_llm-0.1.0.tar.gz
- Subject digest: b2f4fe7c7478a3177b6f121ff2bd9ca8864c81125cde510a72c64c02d4aa892f
- Sigstore transparency entry: 763544820
- Sigstore integration time: Dec 13, 2025
Source repository:
- Permalink: Islam-hady9/dataops-llm-engine@e5f199ea4083536f96f2b07327b203b114fe4b69
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/Islam-hady9
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e5f199ea4083536f96f2b07327b203b114fe4b69
- Trigger Event: release

File details

Details for the file dataops_llm-0.1.0-py3-none-any.whl.

File metadata

Download URL: dataops_llm-0.1.0-py3-none-any.whl
Upload date: Dec 13, 2025
Size: 44.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dataops_llm-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0f5f9c625bca4a4d92a664b34c0be689bd76b32720eac5cb04d209479f4aad3c`
MD5	`4bb6ccb7d8cef4baa18b76ee12d0816e`
BLAKE2b-256	`6a1ca3297cc841ec96a8c0b20913ea1707b32fae496e8478e2b8a62473d70997`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataops_llm-0.1.0-py3-none-any.whl:

Publisher: publish.yml on Islam-hady9/dataops-llm-engine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dataops_llm-0.1.0-py3-none-any.whl
- Subject digest: 0f5f9c625bca4a4d92a664b34c0be689bd76b32720eac5cb04d209479f4aad3c
- Sigstore transparency entry: 763544822
- Sigstore integration time: Dec 13, 2025
Source repository:
- Permalink: Islam-hady9/dataops-llm-engine@e5f199ea4083536f96f2b07327b203b114fe4b69
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/Islam-hady9
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e5f199ea4083536f96f2b07327b203b114fe4b69
- Trigger Event: release

dataops-llm 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

DataOps LLM Engine

Features

Quick Start

Installation

Basic Usage

Configuration

How It Works

Examples

Example 1: Data Cleaning

Example 2: Aggregation

Example 3: Dry-Run Mode

Example 4: Working with DataFrames

Security

API Reference

process()

REST API

Configuration Options

Environment Variables

Supported LLM Providers

Limitations

Development

Setup

Running Tests

Contributing

License

Citation

Support

Acknowledgments

Roadmap

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`process()`