LLM-powered data operations engine for Excel/CSV files
Project description
DataOps LLM Engine
LLM-powered data operations for Excel/CSV files using natural language
DataOps LLM Engine is a standalone Python SDK that allows you to perform arbitrary data operations on Excel/CSV files using natural language instructions. It uses Large Language Models (LLMs) to generate and execute safe Python code for data manipulation, without requiring predefined tools.
Features
- Natural Language Interface: Describe what you want in plain English
- Multi-Provider LLM Support: Works with OpenAI, Anthropic, Google, and 100+ providers via LiteLLM
- 7-Layer Security Model: AST validation, import whitelisting, subprocess isolation, and more
- Flexible Input: CSV, Excel files, or pandas DataFrames
- Dry-Run Mode: Preview generated code before execution
- REST API: Optional FastAPI wrapper for HTTP access
- Zero Configuration: Works out of the box with sensible defaults
Quick Start
Installation
pip install dataops-llm
Or install from source:
git clone https://github.com/yourusername/dataops-llm-engine.git
cd dataops-llm-engine
pip install -e .
Basic Usage
from dataops_llm import process
# Process a CSV file with natural language
result = process(
file_path="companies.csv",
instruction="Remove duplicates by email and normalize company names to lowercase"
)
if result.success:
result.save("companies_cleaned.csv")
print(result.report)
Configuration
Create a .env file with your LLM API key:
LITELLM_API_KEY=sk-...
LITELLM_MODEL=gpt-4
Or pass configuration directly:
result = process(
file_path="data.csv",
instruction="Filter rows where revenue > 1000000",
llm_config={
"api_key": "sk-...",
"model": "claude-3-5-sonnet-20241022"
}
)
How It Works
User Instruction
↓
1. Intent Extraction (LLM) → Structured intent
↓
2. Execution Planning (LLM) → Step-by-step plan
↓
3. Code Generation (LLM) → Pandas code
↓
4. Code Validation (AST) → Security checks
↓
5. Sandbox Execution (subprocess) → Result
Examples
Example 1: Data Cleaning
from dataops_llm import process
result = process(
file_path="messy_data.csv",
instruction="""
1. Remove rows with null emails
2. Trim whitespace from all text columns
3. Convert dates to ISO format
4. Remove duplicate rows based on email
"""
)
Example 2: Aggregation
result = process(
file_path="sales.xlsx",
instruction="Group by region and calculate total sales, average order value"
)
Example 3: Dry-Run Mode
result = process(
file_path="data.csv",
instruction="Filter rows where age > 25",
dry_run=True,
return_code=True
)
print(result.generated_code) # See what code would be executed
Example 4: Working with DataFrames
import pandas as pd
from dataops_llm import process
df = pd.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35]
})
result = process(
file_path=df,
instruction="Add a column 'age_group' categorizing ages as young/middle/senior"
)
Security
DataOps LLM Engine implements a 7-layer security model to prevent malicious code execution:
- LLM Prompt Engineering: System prompts explicitly forbid dangerous operations
- Import Whitelisting: Only pandas, numpy, datetime, re, math allowed
- AST Validation: Parse and inspect code before execution
- Call Blacklisting: Block eval, exec, open, subprocess, network calls
- Subprocess Isolation: Code runs in separate process
- Resource Limits: 60s timeout, 512MB memory limit
- File System Isolation: Execution in temporary directory
Allowed imports: pandas, numpy, datetime, re, math
Blocked operations: File I/O, network access, subprocess execution, eval/exec, pickling
See docs/security.md for detailed security documentation.
API Reference
process()
Main SDK function for processing data.
Parameters:
file_path(str | Path | DataFrame): Input file path or DataFrameinstruction(str): Natural language instructionllm_config(dict, optional): LLM configurationapi_key: LLM provider API keymodel: Model name (default: "gpt-4")temperature: Sampling temperature (default: 0.1)max_tokens: Maximum tokens (default: 2000)
sandbox_config(dict, optional): Sandbox configurationtimeout: Max execution time in seconds (default: 60)memory_limit_mb: Max memory in MB (default: 512)
dry_run(bool): Preview mode without execution (default: False)return_code(bool): Include generated code in result (default: False)
Returns:
DataOpsResult: Result object with:success: Whether operation succeededdataframe: Resulting DataFramereport: Human-readable reportgenerated_code: Generated code (if requested)execution_time: Time taken in secondswarnings: Warning messagesmetadata: Additional metadata
REST API
Run the FastAPI server:
python -m dataops_llm.web.app
Or with uvicorn:
uvicorn dataops_llm.web.app:app --reload
API endpoints:
GET /- API informationGET /api/v1/health- Health checkPOST /api/v1/process- Process data
Example request:
import requests
import base64
with open("data.csv", "rb") as f:
file_base64 = base64.b64encode(f.read()).decode()
response = requests.post(
"http://localhost:8000/api/v1/process",
json={
"instruction": "Remove duplicates",
"file_base64": file_base64,
"file_format": "csv"
}
)
result = response.json()
Configuration Options
Environment Variables
# LLM Configuration
LITELLM_API_KEY=sk-...
LITELLM_MODEL=gpt-4
LITELLM_TEMPERATURE=0.1
LITELLM_MAX_TOKENS=2000
# Sandbox Configuration
SANDBOX_TIMEOUT=60
SANDBOX_MEMORY_MB=512
# Application
APP_LOG_LEVEL=INFO
See .env.example for all options.
Supported LLM Providers
Via LiteLLM, supports 100+ providers:
- OpenAI: gpt-4, gpt-4-turbo, gpt-3.5-turbo
- Anthropic: claude-3-5-sonnet, claude-3-opus
- Google: gemini-pro, gemini-ultra
- Azure OpenAI: All OpenAI models via Azure
- AWS Bedrock: Claude, Llama, etc.
- And many more...
Limitations
- File Size: Max 1M rows × 1K columns (configurable)
- Execution Time: Max 60 seconds (configurable)
- Memory: Max 512MB (configurable)
- Operations: Only pandas-compatible operations
- No: File I/O, network access, subprocess execution
Development
Setup
# Clone repository
git clone https://github.com/yourusername/dataops-llm-engine.git
cd dataops-llm-engine
# Install dependencies
pip install -e ".[dev]"
# Copy environment template
cp .env.example .env
# Edit .env with your API keys
Running Tests
# Run all tests
pytest
# Run with coverage
pytest --cov=dataops_llm
# Run security tests only
pytest tests/test_sandbox/test_security.py
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
License
MIT License - see LICENSE file for details.
Citation
If you use this project in your research, please cite:
@software{dataops_llm_engine,
title={DataOps LLM Engine: Natural Language Data Operations},
author={Islam Abd-Elhady},
year={2025},
url={https://github.com/yourusername/dataops-llm-engine}
}
Support
- Documentation: docs/
- Issues: GitHub Issues
- Security: For security concerns, email security@yourdomain.com
Acknowledgments
- Built with LiteLLM for multi-provider LLM support
- Uses pandas for data manipulation
- Powered by FastAPI for REST API
Roadmap
- Multi-step workflow chaining
- SQL database support
- Custom function definitions
- Operation history and rollback
- Web UI for non-developers
- Docker-based sandbox option
- Response caching
- Streaming progress updates
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dataops_llm-0.1.0.tar.gz.
File metadata
- Download URL: dataops_llm-0.1.0.tar.gz
- Upload date:
- Size: 40.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b2f4fe7c7478a3177b6f121ff2bd9ca8864c81125cde510a72c64c02d4aa892f
|
|
| MD5 |
7aff8ce95547894f535f490d30204186
|
|
| BLAKE2b-256 |
0e3049ded2d672515ff339db37d47153785dcdd2a329dc83382bb7c65c905ed0
|
Provenance
The following attestation bundles were made for dataops_llm-0.1.0.tar.gz:
Publisher:
publish.yml on Islam-hady9/dataops-llm-engine
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dataops_llm-0.1.0.tar.gz -
Subject digest:
b2f4fe7c7478a3177b6f121ff2bd9ca8864c81125cde510a72c64c02d4aa892f - Sigstore transparency entry: 763544820
- Sigstore integration time:
-
Permalink:
Islam-hady9/dataops-llm-engine@e5f199ea4083536f96f2b07327b203b114fe4b69 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Islam-hady9
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e5f199ea4083536f96f2b07327b203b114fe4b69 -
Trigger Event:
release
-
Statement type:
File details
Details for the file dataops_llm-0.1.0-py3-none-any.whl.
File metadata
- Download URL: dataops_llm-0.1.0-py3-none-any.whl
- Upload date:
- Size: 44.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0f5f9c625bca4a4d92a664b34c0be689bd76b32720eac5cb04d209479f4aad3c
|
|
| MD5 |
4bb6ccb7d8cef4baa18b76ee12d0816e
|
|
| BLAKE2b-256 |
6a1ca3297cc841ec96a8c0b20913ea1707b32fae496e8478e2b8a62473d70997
|
Provenance
The following attestation bundles were made for dataops_llm-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on Islam-hady9/dataops-llm-engine
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dataops_llm-0.1.0-py3-none-any.whl -
Subject digest:
0f5f9c625bca4a4d92a664b34c0be689bd76b32720eac5cb04d209479f4aad3c - Sigstore transparency entry: 763544822
- Sigstore integration time:
-
Permalink:
Islam-hady9/dataops-llm-engine@e5f199ea4083536f96f2b07327b203b114fe4b69 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Islam-hady9
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e5f199ea4083536f96f2b07327b203b114fe4b69 -
Trigger Event:
release
-
Statement type: