ETL with LLM operations.
Project description
📜 DocETL: Powering Complex Document Processing Pipelines
DocETL is a tool for creating and executing data processing pipelines, especially suited for complex document processing tasks. It offers:
- An interactive UI playground for iterative prompt engineering and pipeline development
- A Python package for running production pipelines from the command line or Python code
🌟 Community Projects
📚 Educational Resources
🚀 Getting Started
There are two main ways to use DocETL:
1. 🎮 DocWrangler, the Interactive UI Playground (Recommended for Development)
DocWrangler helps you iteratively develop your pipeline:
- Experiment with different prompts and see results in real-time
- Build your pipeline step by step
- Export your finalized pipeline configuration for production use
DocWrangler is hosted at docetl.org/playground. But to run the playground locally, you can either:
- Use Docker (recommended for quick start):
make docker - Set up the development environment manually
See the Playground Setup Guide for detailed instructions.
2. 📦 Python Package (For Production Use)
If you want to use DocETL as a Python package:
Prerequisites
- Python 3.10 or later
- OpenAI API key
pip install docetl
Create a .env file in your project directory:
OPENAI_API_KEY=your_api_key_here # Required for LLM operations (or the key for the LLM of your choice)
To see examples of how to use DocETL, check out the tutorial.
2. 🎮 DocWrangler Setup
To run DocWrangler locally, you have two options:
Option A: Using Docker (Recommended for Quick Start)
The easiest way to get the DocWrangler playground running:
- Create the required environment files:
Create .env in the root directory:
OPENAI_API_KEY=your_api_key_here
BACKEND_ALLOW_ORIGINS=http://localhost:3000,http://127.0.0.1:3000
BACKEND_HOST=0.0.0.0
BACKEND_PORT=8000
BACKEND_RELOAD=True
FRONTEND_HOST=0.0.0.0
FRONTEND_PORT=3000
Create .env.local in the website directory:
OPENAI_API_KEY=sk-xxx
OPENAI_API_BASE=https://api.openai.com/v1
MODEL_NAME=gpt-4o-mini
NEXT_PUBLIC_BACKEND_HOST=localhost
NEXT_PUBLIC_BACKEND_PORT=8000
- Run Docker:
make docker
This will:
- Create a Docker volume for persistent data
- Build the DocETL image
- Run the container with the UI accessible at http://localhost:3000
To clean up Docker resources (note that this will delete the Docker volume):
make docker-clean
Option B: Manual Setup (Development)
For development or if you prefer not to use Docker:
- Clone the repository:
git clone https://github.com/ucbepic/docetl.git
cd docetl
- Set up environment variables in
.envin the root/top-level directory:
OPENAI_API_KEY=your_api_key_here
BACKEND_ALLOW_ORIGINS=http://localhost:3000,http://127.0.0.1:3000
BACKEND_HOST=localhost
BACKEND_PORT=8000
BACKEND_RELOAD=True
FRONTEND_HOST=0.0.0.0
FRONTEND_PORT=3000
And create an .env.local file in the website directory with the following:
OPENAI_API_KEY=sk-xxx
OPENAI_API_BASE=https://api.openai.com/v1
MODEL_NAME=gpt-4o-mini
NEXT_PUBLIC_BACKEND_HOST=localhost
NEXT_PUBLIC_BACKEND_PORT=8000
- Install dependencies:
make install # Install Python package
make install-ui # Install UI dependencies
Note that the OpenAI API key, base, and model name are for the UI assistant only; not the DocETL pipeline execution engine.
- Start the development server:
make run-ui-dev
- Visit http://localhost:3000/playground to access the interactive UI.
🛠️ Development Setup
If you're planning to contribute or modify DocETL, you can verify your setup by running the test suite:
make tests-basic # Runs basic test suite (costs < $0.01 with OpenAI)
For detailed documentation and tutorials, visit our documentation.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docetl-0.2.2.tar.gz.
File metadata
- Download URL: docetl-0.2.2.tar.gz
- Upload date:
- Size: 152.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dffb4a2e8c14a61bf0d8e64a3e0cf599dbda0d74057e41e288b31f9461c3be0d
|
|
| MD5 |
c4ebb9556cbab91a10157a00c9ded89c
|
|
| BLAKE2b-256 |
218201eee1d44f75f952a4044572f619ea7c2c95ee528406b497089f47f33fd5
|
Provenance
The following attestation bundles were made for docetl-0.2.2.tar.gz:
Publisher:
release.yml on ucbepic/docetl
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docetl-0.2.2.tar.gz -
Subject digest:
dffb4a2e8c14a61bf0d8e64a3e0cf599dbda0d74057e41e288b31f9461c3be0d - Sigstore transparency entry: 166710592
- Sigstore integration time:
-
Permalink:
ucbepic/docetl@3ecb385d7639634aaf6bc31cdd6fa888f27bf0c3 -
Branch / Tag:
refs/tags/0.2.2 - Owner: https://github.com/ucbepic
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@3ecb385d7639634aaf6bc31cdd6fa888f27bf0c3 -
Trigger Event:
push
-
Statement type:
File details
Details for the file docetl-0.2.2-py3-none-any.whl.
File metadata
- Download URL: docetl-0.2.2-py3-none-any.whl
- Upload date:
- Size: 180.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4b075257c68f4ceaba5e04f78eb9a65160242ec913ef47d9afdb7a8ba9861242
|
|
| MD5 |
abf40315f96655d212cf6c1ddc5fc41b
|
|
| BLAKE2b-256 |
be25f9abc5860cfa05495c1ecd25b9fad5397d4b59e847d15963ad66edfc658d
|
Provenance
The following attestation bundles were made for docetl-0.2.2-py3-none-any.whl:
Publisher:
release.yml on ucbepic/docetl
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docetl-0.2.2-py3-none-any.whl -
Subject digest:
4b075257c68f4ceaba5e04f78eb9a65160242ec913ef47d9afdb7a8ba9861242 - Sigstore transparency entry: 166710593
- Sigstore integration time:
-
Permalink:
ucbepic/docetl@3ecb385d7639634aaf6bc31cdd6fa888f27bf0c3 -
Branch / Tag:
refs/tags/0.2.2 - Owner: https://github.com/ucbepic
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@3ecb385d7639634aaf6bc31cdd6fa888f27bf0c3 -
Trigger Event:
push
-
Statement type: