ETL with LLM operations.

Project description

📜 DocETL: Powering Complex Document Processing Pipelines

DocETL Figure

DocETL is a tool for creating and executing data processing pipelines, especially suited for complex document processing tasks. It offers:

An interactive UI playground for iterative prompt engineering and pipeline development
A Python package for running production pipelines from the command line or Python code

💡 Need Help Writing Your Pipeline?
Want to use an LLM like ChatGPT or Claude to help you write your pipeline? See docetl.org/llms.txt for a big prompt you can copy paste into ChatGPT or Claude, before describing your task.

🌟 Community Projects

📚 Educational Resources

🚀 Getting Started

There are two main ways to use DocETL:

1. 🎮 DocWrangler, the Interactive UI Playground (Recommended for Development)

DocWrangler helps you iteratively develop your pipeline:

Experiment with different prompts and see results in real-time
Build your pipeline step by step
Export your finalized pipeline configuration for production use

DocWrangler

DocWrangler is hosted at docetl.org/playground. But to run the playground locally, you can either:

Use Docker (recommended for quick start): make docker
Set up the development environment manually

See the Playground Setup Guide for detailed instructions.

2. 📦 Python Package (For Production Use)

If you want to use DocETL as a Python package:

Prerequisites

Python 3.10 or later
OpenAI API key

pip install docetl

Create a .env file in your project directory:

OPENAI_API_KEY=your_api_key_here  # Required for LLM operations (or the key for the LLM of your choice)

To see examples of how to use DocETL, check out the tutorial.

2. 🎮 DocWrangler Setup

To run DocWrangler locally, you have two options:

Option A: Using Docker (Recommended for Quick Start)

The easiest way to get the DocWrangler playground running:

Create the required environment files:

Create .env in the root directory:

OPENAI_API_KEY=your_api_key_here
# BACKEND configuration
BACKEND_ALLOW_ORIGINS=http://localhost:3000,http://127.0.0.1:3000
BACKEND_HOST=localhost
BACKEND_PORT=8000
BACKEND_RELOAD=True

# FRONTEND configuration
FRONTEND_HOST=0.0.0.0
FRONTEND_PORT=3000

# Host port mapping for docker-compose (if not set, defaults are used in docker-compose.yml)
FRONTEND_DOCKER_COMPOSE_PORT=3031
BACKEND_DOCKER_COMPOSE_PORT=8081

# Supported text file encodings
TEXT_FILE_ENCODINGS=utf-8,latin1,cp1252,iso-8859-1

Create .env.local in the website directory:

OPENAI_API_KEY=sk-xxx
OPENAI_API_BASE=https://api.openai.com/v1
MODEL_NAME=gpt-4o-mini

NEXT_PUBLIC_BACKEND_HOST=localhost
NEXT_PUBLIC_BACKEND_PORT=8000
NEXT_PUBLIC_HOSTED_DOCWRANGLER=false

Run Docker:

make docker

This will:

Create a Docker volume for persistent data
Build the DocETL image
Run the container with the UI accessible at http://localhost:3000

To clean up Docker resources (note that this will delete the Docker volume):

make docker-clean

AWS Bedrock

This framework supports integration with AWS Bedrock. To enable:

Configure AWS credentials:

aws configure

Test your AWS credentials:

make test-aws

Run with AWS support:

AWS_PROFILE=your-profile AWS_REGION=your-region make docker

Or using Docker Compose:

AWS_PROFILE=your-profile AWS_REGION=your-region docker compose --profile aws up

Environment variables:

AWS_PROFILE: Your AWS CLI profile (default: 'default')
AWS_REGION: AWS region (default: 'us-west-2')

Bedrock models are pefixed with bedrock. See liteLLM docs for more details.

Option B: Manual Setup (Development)

For development or if you prefer not to use Docker:

Clone the repository:

git clone https://github.com/ucbepic/docetl.git
cd docetl

Set up environment variables in .env in the root/top-level directory:

OPENAI_API_KEY=your_api_key_here
# BACKEND configuration
BACKEND_ALLOW_ORIGINS=http://localhost:3000,http://127.0.0.1:3000
BACKEND_HOST=localhost
BACKEND_PORT=8000
BACKEND_RELOAD=True

# FRONTEND configuration
FRONTEND_HOST=0.0.0.0
FRONTEND_PORT=3000

# Host port mapping for docker-compose (if not set, defaults are used in docker-compose.yml)
FRONTEND_DOCKER_COMPOSE_PORT=3031
BACKEND_DOCKER_COMPOSE_PORT=8081

# Supported text file encodings
TEXT_FILE_ENCODINGS=utf-8,latin1,cp1252,iso-8859-1

And create an .env.local file in the website directory with the following:

OPENAI_API_KEY=sk-xxx
OPENAI_API_BASE=https://api.openai.com/v1
MODEL_NAME=gpt-4o-mini

NEXT_PUBLIC_BACKEND_HOST=localhost
NEXT_PUBLIC_BACKEND_PORT=8000
NEXT_PUBLIC_HOSTED_DOCWRANGLER=false

Install dependencies:

make install      # Install Python deps with uv and set up pre-commit
make install-ui   # Install UI dependencies

If you prefer using uv directly instead of Make:

curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync --all-groups --all-extras

Note that the OpenAI API key, base, and model name are for the UI assistant only; not the DocETL pipeline execution engine.

Start the development server:

make run-ui-dev

Visit http://localhost:3000/playground to access the interactive UI.

🛠️ Development Setup

If you're planning to contribute or modify DocETL, you can verify your setup by running the test suite:

make tests-basic  # Runs basic test suite (costs < $0.01 with OpenAI)

For detailed documentation and tutorials, visit our documentation.

Project details

Release history Release notifications | RSS feed

0.2.6

Dec 28, 2025

This version

0.2.5

Aug 9, 2025

0.2.4

May 21, 2025

0.2.3

Apr 29, 2025

0.2.2

Jan 29, 2025

0.2.1

Jan 9, 2025

0.2

Dec 4, 2024

0.1.7

Oct 14, 2024

0.1.6

Oct 3, 2024

0.1.5

Sep 30, 2024

0.1.4 yanked

Sep 30, 2024

Reason this release was yanked:

Has a bug in parsing an LLM response

0.1.3

Sep 29, 2024

0.1.2

Sep 23, 2024

0.1.1

Sep 17, 2024

0.1.0 yanked

Sep 1, 2024

Reason this release was yanked:

Empty release

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docetl-0.2.5.tar.gz (175.1 kB view details)

Uploaded Aug 9, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docetl-0.2.5-py3-none-any.whl (205.9 kB view details)

Uploaded Aug 9, 2025 Python 3

File details

Details for the file docetl-0.2.5.tar.gz.

File metadata

Download URL: docetl-0.2.5.tar.gz
Upload date: Aug 9, 2025
Size: 175.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for docetl-0.2.5.tar.gz
Algorithm	Hash digest
SHA256	`005bb2ea0bc3fee92059cbbe85ded49f33f1a85b2aeb1a0ed6af7b8ec51872e8`
MD5	`f5b77b9080aa110b0fed0dec087b82e2`
BLAKE2b-256	`5784607c3111451be767c0483d1dfe1063f00d87c486e6bdf89d7dce91abc60d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docetl-0.2.5.tar.gz:

Publisher: release.yml on ucbepic/docetl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docetl-0.2.5.tar.gz
- Subject digest: 005bb2ea0bc3fee92059cbbe85ded49f33f1a85b2aeb1a0ed6af7b8ec51872e8
- Sigstore transparency entry: 373846747
- Sigstore integration time: Aug 9, 2025
Source repository:
- Permalink: ucbepic/docetl@42f6a1e32cdfb6c6b96bad38b571e680a254a5fe
- Branch / Tag: refs/tags/0.2.5
- Owner: https://github.com/ucbepic
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@42f6a1e32cdfb6c6b96bad38b571e680a254a5fe
- Trigger Event: push

File details

Details for the file docetl-0.2.5-py3-none-any.whl.

File metadata

Download URL: docetl-0.2.5-py3-none-any.whl
Upload date: Aug 9, 2025
Size: 205.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for docetl-0.2.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`33b7bb5ed6be0b6100a425332defcce63389b55a2fed8cd9bc2de3eb3218bd7a`
MD5	`a531032fc813b06508dbb67e48bed929`
BLAKE2b-256	`6a433d18168a6edb20cb6b6aa4c9596abe94212c3eefd0d03c8b42868330c678`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docetl-0.2.5-py3-none-any.whl:

Publisher: release.yml on ucbepic/docetl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docetl-0.2.5-py3-none-any.whl
- Subject digest: 33b7bb5ed6be0b6100a425332defcce63389b55a2fed8cd9bc2de3eb3218bd7a
- Sigstore transparency entry: 373846756
- Sigstore integration time: Aug 9, 2025
Source repository:
- Permalink: ucbepic/docetl@42f6a1e32cdfb6c6b96bad38b571e680a254a5fe
- Branch / Tag: refs/tags/0.2.5
- Owner: https://github.com/ucbepic
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@42f6a1e32cdfb6c6b96bad38b571e680a254a5fe
- Trigger Event: push

docetl 0.2.5

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

📜 DocETL: Powering Complex Document Processing Pipelines

🌟 Community Projects

📚 Educational Resources

🚀 Getting Started

1. 🎮 DocWrangler, the Interactive UI Playground (Recommended for Development)

2. 📦 Python Package (For Production Use)

Prerequisites

2. 🎮 DocWrangler Setup

Option A: Using Docker (Recommended for Quick Start)

AWS Bedrock

Option B: Manual Setup (Development)

🛠️ Development Setup

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance