ETL with LLM operations.
Project description
DocETL: Powering Complex Document Processing Pipelines
Website (Includes Demo) | Documentation | Discord | NotebookLM Podcast (thanks Shabie from our Discord community!) | Paper (coming soon!)
DocETL is a tool for creating and executing data processing pipelines, especially suited for complex document processing tasks. It offers a low-code, declarative YAML interface to define LLM-powered operations on complex data.
When to Use DocETL
DocETL is the ideal choice when you're looking to maximize correctness and output quality for complex tasks over a collection of documents or unstructured datasets. You should consider using DocETL if:
- You want to perform semantic processing on a collection of data
- You have complex tasks that you want to represent via map-reduce (e.g., map over your documents, then group by the result of your map call & reduce)
- You're unsure how to best express your task to maximize LLM accuracy
- You're working with long documents that don't fit into a single prompt or are too lengthy for effective LLM reasoning
- You have validation criteria and want tasks to automatically retry when the validation fails
Installation
See the documentation for installing from PyPI.
Prerequisites
Before installing DocETL, ensure you have Python 3.10 or later installed on your system. You can check your Python version by running:
python --version
Installation Steps (from Source)
- Clone the DocETL repository:
git clone https://github.com/ucbepic/docetl.git
cd docetl
- Install Poetry (if not already installed):
pip install poetry
- Install the project dependencies:
poetry install
- Set up your OpenAI API key:
Create a .env file in the project root and add your OpenAI API key:
OPENAI_API_KEY=your_api_key_here
Alternatively, you can set the OPENAI_API_KEY environment variable in your shell.
- Run the basic test suite to ensure everything is working (this costs less than $0.01 with OpenAI):
make tests-basic
That's it! You've successfully installed DocETL and are ready to start processing documents.
For more detailed information on usage and configuration, please refer to our documentation.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file docetl-0.1.7.tar.gz
.
File metadata
- Download URL: docetl-0.1.7.tar.gz
- Upload date:
- Size: 127.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 610bbaece6545b1187da433fc6fdec6757246e6fb28e5185fc099ec1571b1efc |
|
MD5 | bc45b9c748c6cb33ff2857eb8263510a |
|
BLAKE2b-256 | ed1b4dca0a47704ef5ece82f81c8e975e16ef971c16416e8ff3dfb74a6c3ba92 |
File details
Details for the file docetl-0.1.7-py3-none-any.whl
.
File metadata
- Download URL: docetl-0.1.7-py3-none-any.whl
- Upload date:
- Size: 147.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 25514024f8a9021cd045ebba33836cde1d395d8c21b332072a599cc25c2e97c6 |
|
MD5 | fb27dec710f62e9f025e52006ffa710d |
|
BLAKE2b-256 | 315cfde7e5cbf590c51afe18a5cab914f5b38c7a56b1ea2b49f15820de9b0da1 |