Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Crich1187/docetl

 
 

Repository files navigation

📜 DocETL: Powering Complex Document Processing Pipelines

Website Documentation Discord Paper

DocETL Figure

DocETL is a tool for creating and executing data processing pipelines, especially suited for complex document processing tasks. It offers:

  1. An interactive UI playground for iterative prompt engineering and pipeline development
  2. A Python package for running production pipelines from the command line or Python code

💡 Need Help Writing Your Pipeline?
Want to use an LLM like ChatGPT or Claude to help you write your pipeline? See docetl.org/llms.txt for a big prompt you can copy paste into ChatGPT or Claude, before describing your task.

🌟 Community Projects

📚 Educational Resources

🚀 Getting Started

There are two main ways to use DocETL:

1. 🎮 DocWrangler, the Interactive UI Playground (Recommended for Development)

DocWrangler helps you iteratively develop your pipeline:

  • Experiment with different prompts and see results in real-time
  • Build your pipeline step by step
  • Export your finalized pipeline configuration for production use

DocWrangler

DocWrangler is hosted at docetl.org/playground. But to run the playground locally, you can either:

  • Use Docker (recommended for quick start): make docker
  • Set up the development environment manually

See the Playground Setup Guide for detailed instructions.

2. 📦 Python Package (For Production Use)

If you want to use DocETL as a Python package:

Prerequisites

  • Python 3.10 or later
  • OpenAI API key
pip install docetl

Create a .env file in your project directory:

OPENAI_API_KEY=your_api_key_here  # Required for LLM operations (or the key for the LLM of your choice)

To see examples of how to use DocETL, check out the tutorial.

2. 🎮 DocWrangler Setup

To run DocWrangler locally, you have two options:

Option A: Using Docker (Recommended for Quick Start)

The easiest way to get the DocWrangler playground running:

  1. Create the required environment files:

Create .env in the root directory:

OPENAI_API_KEY=your_api_key_here
# BACKEND configuration
BACKEND_ALLOW_ORIGINS=http://localhost:3000,http://127.0.0.1:3000
BACKEND_HOST=localhost
BACKEND_PORT=8000
BACKEND_RELOAD=True

# FRONTEND configuration
FRONTEND_HOST=0.0.0.0
FRONTEND_PORT=3000

# Host port mapping for docker-compose (if not set, defaults are used in docker-compose.yml)
FRONTEND_DOCKER_COMPOSE_PORT=3031
BACKEND_DOCKER_COMPOSE_PORT=8081

# Supported text file encodings
TEXT_FILE_ENCODINGS=utf-8,latin1,cp1252,iso-8859-1

Create .env.local in the website directory:

OPENAI_API_KEY=sk-xxx
OPENAI_API_BASE=https://api.openai.com/v1
MODEL_NAME=gpt-4o-mini

NEXT_PUBLIC_BACKEND_HOST=localhost
NEXT_PUBLIC_BACKEND_PORT=8000
NEXT_PUBLIC_HOSTED_DOCWRANGLER=false
  1. Run Docker:
make docker

This will:

  • Create a Docker volume for persistent data
  • Build the DocETL image
  • Run the container with the UI accessible at http://localhost:3000

To clean up Docker resources (note that this will delete the Docker volume):

make docker-clean
AWS Bedrock

This framework supports integration with AWS Bedrock. To enable:

  1. Configure AWS credentials:
aws configure
  1. Test your AWS credentials:
make test-aws
  1. Run with AWS support:
AWS_PROFILE=your-profile AWS_REGION=your-region make docker

Or using Docker Compose:

AWS_PROFILE=your-profile AWS_REGION=your-region docker compose --profile aws up

Environment variables:

  • AWS_PROFILE: Your AWS CLI profile (default: 'default')
  • AWS_REGION: AWS region (default: 'us-west-2')

Bedrock models are pefixed with bedrock. See liteLLM docs for more details.

Option B: Manual Setup (Development)

For development or if you prefer not to use Docker:

  1. Clone the repository:
git clone https://github.com/ucbepic/docetl.git
cd docetl
  1. Set up environment variables in .env in the root/top-level directory:
OPENAI_API_KEY=your_api_key_here
# BACKEND configuration
BACKEND_ALLOW_ORIGINS=http://localhost:3000,http://127.0.0.1:3000
BACKEND_HOST=localhost
BACKEND_PORT=8000
BACKEND_RELOAD=True

# FRONTEND configuration
FRONTEND_HOST=0.0.0.0
FRONTEND_PORT=3000

# Host port mapping for docker-compose (if not set, defaults are used in docker-compose.yml)
FRONTEND_DOCKER_COMPOSE_PORT=3031
BACKEND_DOCKER_COMPOSE_PORT=8081

# Supported text file encodings
TEXT_FILE_ENCODINGS=utf-8,latin1,cp1252,iso-8859-1

And create an .env.local file in the website directory with the following:

OPENAI_API_KEY=sk-xxx
OPENAI_API_BASE=https://api.openai.com/v1
MODEL_NAME=gpt-4o-mini

NEXT_PUBLIC_BACKEND_HOST=localhost
NEXT_PUBLIC_BACKEND_PORT=8000
NEXT_PUBLIC_HOSTED_DOCWRANGLER=false
  1. Install dependencies:
make install      # Install Python deps with uv and set up pre-commit
make install-ui   # Install UI dependencies

If you prefer using uv directly instead of Make:

curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync --all-groups --all-extras

Note that the OpenAI API key, base, and model name are for the UI assistant only; not the DocETL pipeline execution engine.

  1. Start the development server:
make run-ui-dev
  1. Visit http://localhost:3000/playground to access the interactive UI.

🛠️ Development Setup

If you're planning to contribute or modify DocETL, you can verify your setup by running the test suite:

make tests-basic  # Runs basic test suite (costs < $0.01 with OpenAI)

For detailed documentation and tutorials, visit our documentation.

About

A system for agentic LLM-powered data processing and ETL

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 56.0%
  • TypeScript 43.5%
  • Other 0.5%