Thanks to visit codestin.com
Credit goes to github.com

Skip to content

CloudEngineHub/docetl

ย 
ย 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ“œ DocETL: Powering Complex Document Processing Pipelines

Website Documentation Discord Paper

DocETL Figure

DocETL is a tool for creating and executing data processing pipelines, especially suited for complex document processing tasks. It offers:

  1. An interactive UI playground for iterative prompt engineering and pipeline development
  2. A Python package for running production pipelines from the command line or Python code

๐Ÿ’ก Need Help Writing Your Pipeline?
Want to use an LLM like ChatGPT or Claude to help you write your pipeline? See docetl.org/llms.txt for a big prompt you can copy paste into ChatGPT or Claude, before describing your task.

๐ŸŒŸ Community Projects

๐Ÿ“š Educational Resources

๐Ÿš€ Getting Started

There are two main ways to use DocETL:

1. ๐ŸŽฎ DocWrangler, the Interactive UI Playground (Recommended for Development)

DocWrangler helps you iteratively develop your pipeline:

  • Experiment with different prompts and see results in real-time
  • Build your pipeline step by step
  • Export your finalized pipeline configuration for production use

DocWrangler

DocWrangler is hosted at docetl.org/playground. But to run the playground locally, you can either:

  • Use Docker (recommended for quick start): make docker
  • Set up the development environment manually

See the Playground Setup Guide for detailed instructions.

2. ๐Ÿ“ฆ Python Package (For Production Use)

If you want to use DocETL as a Python package:

Prerequisites

  • Python 3.10 or later
  • OpenAI API key
pip install docetl

Create a .env file in your project directory:

OPENAI_API_KEY=your_api_key_here  # Required for LLM operations (or the key for the LLM of your choice)

โš ๏ธ Important: Two Different .env Files

  • Root .env: Used by the backend Python server that executes DocETL pipelines
  • website/.env.local: Used by the frontend TypeScript code in DocWrangler (UI features like improve prompt and chatbot)

To see examples of how to use DocETL, check out the tutorial.

2. ๐ŸŽฎ DocWrangler Setup

To run DocWrangler locally, you have two options:

Option A: Using Docker (Recommended for Quick Start)

The easiest way to get the DocWrangler playground running:

  1. Create the required environment files:

Create .env in the root directory (for the backend Python server that executes pipelines):

OPENAI_API_KEY=your_api_key_here  # Used by DocETL pipeline execution engine
# BACKEND configuration
BACKEND_ALLOW_ORIGINS=http://localhost:3000,http://127.0.0.1:3000
BACKEND_HOST=localhost
BACKEND_PORT=8000
BACKEND_RELOAD=True

# FRONTEND configuration
FRONTEND_HOST=0.0.0.0
FRONTEND_PORT=3000

# Host port mapping for docker-compose (if not set, defaults are used in docker-compose.yml)
FRONTEND_DOCKER_COMPOSE_PORT=3031
BACKEND_DOCKER_COMPOSE_PORT=8081

# Supported text file encodings
TEXT_FILE_ENCODINGS=utf-8,latin1,cp1252,iso-8859-1

Create .env.local in the website directory (for DocWrangler UI features like improve prompt and chatbot):

OPENAI_API_KEY=sk-xxx  # Used by TypeScript features: improve prompt, chatbot, etc.
OPENAI_API_BASE=https://api.openai.com/v1
MODEL_NAME=gpt-4o-mini  # Model used by the UI assistant

NEXT_PUBLIC_BACKEND_HOST=localhost
NEXT_PUBLIC_BACKEND_PORT=8000
NEXT_PUBLIC_HOSTED_DOCWRANGLER=false
  1. Run Docker:
make docker

This will:

  • Create a Docker volume for persistent data
  • Build the DocETL image
  • Run the container with the UI accessible at http://localhost:3000

To clean up Docker resources (note that this will delete the Docker volume):

make docker-clean
AWS Bedrock

This framework supports integration with AWS Bedrock. To enable:

  1. Configure AWS credentials:
aws configure
  1. Test your AWS credentials:
make test-aws
  1. Run with AWS support:
AWS_PROFILE=your-profile AWS_REGION=your-region make docker

Or using Docker Compose:

AWS_PROFILE=your-profile AWS_REGION=your-region docker compose --profile aws up

Environment variables:

  • AWS_PROFILE: Your AWS CLI profile (default: 'default')
  • AWS_REGION: AWS region (default: 'us-west-2')

Bedrock models are pefixed with bedrock. See liteLLM docs for more details.

Option B: Manual Setup (Development)

For development or if you prefer not to use Docker:

  1. Clone the repository:
git clone https://github.com/ucbepic/docetl.git
cd docetl
  1. Set up environment variables in .env in the root/top-level directory (for the backend Python server):
OPENAI_API_KEY=your_api_key_here  # Used by DocETL pipeline execution engine
# BACKEND configuration
BACKEND_ALLOW_ORIGINS=http://localhost:3000,http://127.0.0.1:3000
BACKEND_HOST=localhost
BACKEND_PORT=8000
BACKEND_RELOAD=True

# FRONTEND configuration
FRONTEND_HOST=0.0.0.0
FRONTEND_PORT=3000

# Host port mapping for docker-compose (if not set, defaults are used in docker-compose.yml)
FRONTEND_DOCKER_COMPOSE_PORT=3031
BACKEND_DOCKER_COMPOSE_PORT=8081

# Supported text file encodings
TEXT_FILE_ENCODINGS=utf-8,latin1,cp1252,iso-8859-1

And create an .env.local file in the website directory (for DocWrangler UI features):

OPENAI_API_KEY=sk-xxx  # Used by TypeScript features: improve prompt, chatbot, etc.
OPENAI_API_BASE=https://api.openai.com/v1
MODEL_NAME=gpt-4o-mini  # Model used by the UI assistant

NEXT_PUBLIC_BACKEND_HOST=localhost
NEXT_PUBLIC_BACKEND_PORT=8000
NEXT_PUBLIC_HOSTED_DOCWRANGLER=false
  1. Install dependencies:
make install      # Install Python deps with uv and set up pre-commit
make install-ui   # Install UI dependencies

If you prefer using uv directly instead of Make:

curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync --all-groups --all-extras
  1. Start the development server:
make run-ui-dev
  1. Visit http://localhost:3000/playground to access the interactive UI.

๐Ÿ› ๏ธ Development Setup

If you're planning to contribute or modify DocETL, you can verify your setup by running the test suite:

make tests-basic  # Runs basic test suite (costs < $0.01 with OpenAI)

For detailed documentation and tutorials, visit our documentation.

About

A system for complex LLM-powered document processing

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 54.6%
  • TypeScript 44.9%
  • Other 0.5%