AI-powered PDF redaction tool with conversational interface

Project description

✂️🤖 RedactFlow: Agentic PDF Sanitizer

RedactFlow is a powerful and intelligent PDF sanitization tool that uses a sophisticated agentic workflow to detect and redact sensitive information from your documents. It combines state-of-the-art AI models with a human-in-the-loop (HITL) interface to ensure accurate and reliable redaction.

🚀 Quick Start

Prerequisites

Python 3.8+
Node.js 16+ (for the React frontend)
An Azure account with access to Azure OpenAI and Azure Document Intelligence
A Supabase account for user authentication and subscription management (free tier available)

Option 1: Modern React Frontend (Recommended)

The application now features a modern React frontend with FastAPI backend for better performance and user experience.

1. Clone and Setup

git clone https://github.com/matthewyijielu0317/RedactFlow.git
cd RedactFlow

2. Backend Setup

# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install backend dependencies
pip install -r backend/requirements.txt

3. Frontend Setup

# Install frontend dependencies
cd frontend
npm install
cd ..

4. Environment Configuration

Note on Environment Files: This project uses two .env files:

Root .env: Backend configuration (Python/FastAPI)

frontend/.env: Frontend configuration (React)

This separation is necessary because:

React (Create React App) only reads .env from its own directory

Backend and frontend have different environment variable requirements

Important: Keep Supabase URLs in sync between both files!

Root .env file (Backend Configuration):

Create a .env file in the root directory:

# Azure OpenAI Configuration
AZURE_OPENAI_ENDPOINT=your_azure_openai_endpoint
AZURE_OPENAI_API_KEY=your_azure_openai_api_key

# Azure Document Intelligence Configuration
AZURE_DI_ENDPOINT=your_azure_di_endpoint
AZURE_DI_KEY=your_azure_di_key

# Google Cloud Configuration (Optional - for image detection)
GOOGLE_CLOUD_PROJECT=redactflow-486302
ENABLE_IMAGE_DETECTION=true
GEMINI_MODEL_ID=gemini-2.5-pro

# Supabase Configuration (for backend authentication)
SUPABASE_URL=your_supabase_project_url
SUPABASE_JWT_SECRET=your_supabase_jwt_secret
SUPABASE_ANON_KEY=your_supabase_anon_key
SUPABASE_SERVICE_KEY=your_supabase_service_key

# Stripe Configuration (Optional, for payment integration)
STRIPE_SECRET_KEY=your_stripe_secret_key
STRIPE_PUBLISHABLE_KEY=your_stripe_publishable_key
STRIPE_WEBHOOK_SECRET=your_stripe_webhook_secret

# Tavily Search (Optional)
TAVILY_KEY=your_tavily_key

frontend/.env file (Frontend Configuration):

Create a .env file in the frontend/ directory (use frontend/.env.example as template):

# Supabase Configuration
# ⚠️ IMPORTANT: These values should match the SUPABASE_URL and SUPABASE_ANON_KEY in root .env
# Get these values from: https://app.supabase.com/project/_/settings/api
REACT_APP_SUPABASE_URL=your_supabase_project_url
REACT_APP_SUPABASE_ANON_KEY=your_supabase_anon_key

5. Run the Application

Option A: Use the provided scripts (Recommended)

# Terminal 1: Start Backend
chmod +x start_backend.sh
./start_backend.sh

# Terminal 2: Start Frontend
chmod +x start_frontend.sh
./start_frontend.sh

Option B: Manual startup

# Terminal 1: Start Backend
source venv/bin/activate
cd backend
python main.py

# Terminal 2: Start Frontend
cd frontend
npm start

6. Access the Application

Frontend: http://localhost:3000
Backend API: http://localhost:8000
API Documentation: http://localhost:8000/docs

Option 2: Original Streamlit Interface

If you prefer the original Streamlit interface:

# Setup (same as above)
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Run Streamlit app
streamlit run app.py

🎯 Features

Modern React Frontend

User Authentication: Secure login/signup with Supabase (email/password and Google OAuth)
Subscription Management: Track user plans and page limits
Interactive PDF Canvas: Draw manual redactions directly on the PDF
Real-time Preview: See redactions applied instantly
Smart Detection Panel: Quick prompts for common redaction types
Workflow Progress: Visual progress tracking through the AI workflow
Responsive Design: Works on desktop and mobile devices

Core AI Features

Agentic Workflow: Utilizes a robust agentic workflow powered by LangGraph
Dual OCR Technology: Employs a creative dual OCR process for both high-level content understanding and precise word-level coordinate mapping
Intelligent Detection: Leverages large language models (LLMs) to analyze document content and identify sensitive information
Image Detection (Optional): Uses Google Gemini's spatial understanding API to detect and redact logos, icons, stamps, seals, and other graphical elements
Evaluator Feedback Loop: Includes an evaluator agent that provides feedback to the detector, iteratively improving accuracy
Human-in-the-Loop (HITL) Interface: Review, edit, and approve AI-detected redactions, plus add manual redactions
Flexible and Configurable: Easily configurable with your own Azure OpenAI and Document Intelligence API keys

🏗️ System Architecture

The RedactFlow system is built around a langgraph state machine that orchestrates the flow of data through a series of nodes:

Workflow Nodes

Orchestrator: Entry point that interprets user prompts and routes requests
Searcher: (Optional) Searches for external regulations and compliance information
Detector: Core detection using dual OCR and dual LLM architecture
Evaluator: Reviews detected data and provides feedback to improve accuracy
Human-in-the-Loop (HITL): Pauses workflow for user review and approval
Redactor: Applies final redactions to create sanitized PDF

The Detector Workflow

The most innovative part of RedactFlow is the Detector Workflow with its unique dual OCR and dual LLM architecture:

Orchestrator: The entry point of the workflow. It interprets the user's prompt and decides whether to route the request to the Searcher for external regulation lookup or directly to the Detector.
Searcher: (Optional) Searches for external regulations and compliance information to enrich the detection criteria.
Detector: The core of the sensitive data detection process. Uses an innovative content batching approach with dual OCR and LLM architecture to analyze all pages together in a single API call, achieving massive efficiency gains.
Evaluator: Reviews all detected sensitive data across the entire document in a single LLM call and provides comprehensive feedback to improve accuracy.
Corrector: Applies evaluator feedback to refine detections across all pages in a single LLM call, ensuring consistent quality improvements throughout the document.
Human-in-the-Loop (HITL): Pauses the workflow and waits for the user to review, edit, and approve the redactions through the Streamlit UI.
Redactor: Applies the final redactions to the PDF, creating a sanitized version of the document.

The Innovative Content Batching Architecture

The most innovative part of RedactFlow is the Content Batching Architecture that revolutionizes how AI processes multi-page documents. Instead of processing pages individually, RedactFlow combines all pages into single prompts and processes entire documents in just 3 LLM API calls, achieving dramatic efficiency gains.

How Content Batching Works:

Dual OCR in Parallel:
- Page-level OCR: Extracts content for high-level semantic analysis
- Word-level OCR: Gets precise coordinates of each word
Dual LLM Analysis:
- Sensitive Identification LLM: Analyzes content to identify sensitive information
- Mapping LLM: Maps sensitive content to precise word-level coordinates

🎮 How to Use

Using the React Frontend

Upload a PDF: Drag and drop or click to upload your PDF file
Set Detection Prompt: Enter what you want to redact (e.g., "names, addresses, phone numbers")
Run Detection: Click "Run Detection" to start the AI workflow
Review Results:
- Review AI-detected sensitive information
- Edit or delete incorrect detections
- Add manual redactions by drawing on the PDF
Approve or Reject:
- Approve: Proceed to final redaction
- Reject: Modify your prompt and try again
Download: Get your redacted PDF

Key Features

Manual Redaction: Draw rectangles directly on the PDF to mark sensitive areas
Edit AI Detections: Modify content, reasons, or bounding boxes
Real-time Preview: See changes applied instantly
Workflow Control: Approve, reject, or go back to review stage

RedactFlow's architecture is designed to continuously improve its detection accuracy through a sophisticated feedback loop between the Evaluator, Corrector, and HumanInLoop nodes.

Content Batching Feedback Loop: After the Detector identifies sensitive data across all pages, the Evaluator node analyzes the entire document in a single LLM call, comparing results with the user prompt and document context. It generates comprehensive feedback for all pages simultaneously by processing all content together, identifying patterns and inconsistencies across the entire document. The Corrector then applies this feedback to refine detections across all pages in a single LLM call, ensuring consistent quality improvements throughout the document.

📁 Project Structure

RedactFlow/
├── frontend/                 # React frontend
│   ├── src/
│   │   ├── components/      # React components
│   │   ├── types.ts         # TypeScript type definitions
│   │   └── App.tsx          # Main React app
│   ├── package.json         # Frontend dependencies
│   └── tsconfig.json        # TypeScript configuration
├── backend/                  # FastAPI backend
│   ├── main.py              # FastAPI server
│   ├── requirements.txt     # Python dependencies
│   └── static/              # Generated PDF previews
├── nodes/                    # LangGraph workflow nodes
│   ├── orchestrator.py      # Main workflow orchestration
│   ├── detector_node.py     # Dual OCR/LLM detection
│   ├── evaluator_node.py    # Detection evaluation
│   ├── hitl_node.py         # Human-in-the-loop logic
│   └── redactor_node.py     # PDF redaction
├── output/                   # Generated files
│   ├── original/            # Original uploaded PDFs
│   ├── preview/             # Preview images
│   └── redacted/            # Final redacted PDFs
├── app.py                   # Original Streamlit app
├── requirements.txt         # Root Python dependencies
└── start_*.sh              # Startup scripts

This powerful combination of an automated content batching feedback loop and human oversight ensures that the final redacted document is accurate, reliable, and meets the user's specific needs.

Performance & Real-World Results

RedactFlow has been tested with real-world documents, including complex immigration forms (I-20), demonstrating exceptional performance:

Real I-20 Form Processing Results:

Document: 4-page I-20 Certificate of Eligibility for Nonimmigrant Student Status
OCR Extraction: 318 page elements, 1,297 word elements
Detection Results: 39 sensitive items with precise coordinates
API Efficiency: 3 LLM calls vs 48+ traditional individual calls
Cost Savings: ~90% reduction in API costs
Processing Time: Significant reduction through content batching

Detection Categories Successfully Identified:

Student Information: Names, SEVIS IDs, birth dates, citizenship
Academic Program: School names, degree types, program duration, majors
Official Information: School codes, approval dates, certification details
Financial Data: Tuition amounts, funding sources
Immigration Data: Document numbers, status classifications

Quality Assurance:

Content Batching Evaluation: Comprehensive quality checks across all pages in single LLM calls
Feedback Integration: Automatic correction of detection gaps
Coordinate Precision: Exact pixel-level redaction boundaries
Fallback Mechanisms: Robust error handling and recovery

Preview before and after user's feedback

Preview Before Human Feedbakc

Give a user prompt: You failed to detect the UID and financial amount. Also, include the date.

Preview After Human Feedbakc

The Final Redacted Version is below:

FinaL Redacted Version

🔧 Configuration

Environment Variables

Create a .env file with the following variables:

# Required: Azure OpenAI
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_API_KEY=your_api_key_here

# Required: Azure Document Intelligence
AZURE_DI_ENDPOINT=https://your-resource.cognitiveservices.azure.com/
AZURE_DI_KEY=your_di_key_here

# Optional: Google Cloud Image Detection
GOOGLE_CLOUD_PROJECT=redactflow-486302
ENABLE_IMAGE_DETECTION=true
GEMINI_MODEL_ID=gemini-2.5-pro
# GOOGLE_APPLICATION_CREDENTIALS=secrets/redactflow-service-account.json  # If using service account

# Optional: Tavily Search (for external regulation lookup)
TAVILY_KEY=your_tavily_key_here

# Demo/Deployment Controls
# Limit analysis to the first N pages (0 disables the cap)
MAX_ANALYZED_PAGES=4

Azure Setup

Azure OpenAI Service:
- Create an Azure OpenAI resource
- Deploy a GPT-4 model
- Get your endpoint and API key
Azure Document Intelligence:
- Create a Document Intelligence resource
- Get your endpoint and key

Google Cloud Setup (for Image Detection - Optional)

RedactFlow uses Google Gemini's spatial understanding API to detect logos, icons, stamps, seals, and other graphical elements within PDF documents. This feature is optional but highly recommended for comprehensive document sanitization.

Quick Setup Steps

Install Google Cloud SDK:

# macOS (Homebrew)
brew install --cask google-cloud-sdk
source "$(brew --prefix)/share/google-cloud-sdk/path.zsh.inc"

# Verify installation
gcloud --version

Fix Permissions (if needed): If you encounter permission errors during installation, run:
```
sudo chown -R $USER:staff ~/.config
brew reinstall gcloud-cli
```
Authenticate:
```
gcloud auth application-default login
```
A browser window will open. Sign in with your Google account that has access to the redactflow-486302 project.

Set Quota Project:

gcloud auth application-default set-quota-project redactflow-486302

Update Your .env File: Add these lines to your root .env file:

# Google Cloud Image Detection
GOOGLE_CLOUD_PROJECT=redactflow-486302
ENABLE_IMAGE_DETECTION=true
GEMINI_MODEL_ID=gemini-2.5-pro

Restart the Backend:
```
./start_backend.sh
```

Alternative: Service Account Key

If you don't want to install gcloud, you can use a service account key file:

Place the key file at secrets/redactflow-service-account.json

Add to .env:

GOOGLE_APPLICATION_CREDENTIALS=secrets/redactflow-service-account.json
GOOGLE_CLOUD_PROJECT=redactflow-486302
ENABLE_IMAGE_DETECTION=true
GEMINI_MODEL_ID=gemini-2.5-pro

Restart the backend (no gcloud CLI needed)

Environment Variables for Image Detection

Variable	Required	Default	Description
`GOOGLE_CLOUD_PROJECT`	Yes	—	Your GCP project ID
`ENABLE_IMAGE_DETECTION`	Yes	`false`	Set to `true` to enable
`GEMINI_MODEL_ID`	No	`gemini-2.5-flash`	Model choice (see below)
`GEMINI_LOCATION`	No	`global`	Vertex AI region
`IMAGE_DETECT_CONCURRENCY`	No	`4`	Max pages processed in parallel
`MAX_ANALYZED_PAGES`	No	`4`	Page limit for detection
`GOOGLE_APPLICATION_CREDENTIALS`	No	—	Path to service account JSON

Model Choices

Model	Speed	Cost	Accuracy	Best for
`gemini-2.5-flash`	Fast	Low	Good	Cost-sensitive, high volume
`gemini-2.5-pro`	Moderate	Mid	Better	Precision matters, small logos

For detailed setup instructions and troubleshooting, see docs/IMAGE_DETECTION_SETUP.md.

Supabase Setup (Authentication + Storage)

Create a Supabase Project:
- Go to Supabase and create a new project
- Wait for the project to be fully provisioned
Get API Credentials:
- Navigate to Project Settings → API
- Copy your Project URL → SUPABASE_URL
- Copy your anon/public key → SUPABASE_ANON_KEY
- Copy your service_role key → SUPABASE_SERVICE_KEY (keep this secret!)
- Copy your JWT Secret from Project Settings → API → JWT Settings → SUPABASE_JWT_SECRET

Run Database Migrations (in order):

Open Supabase Dashboard → SQL Editor and run these scripts:

Order	Script	What it creates
1	`scripts/create_storage_tables.sql`	`users` table (synced from auth.users via trigger), `files` table (metadata, OCR, annotations), `source-files` and `redacted-files` Storage buckets, RLS policies, indexes
2	`scripts/create_subscriptions_table.sql`	Subscription management table (for Stripe integration)

Important: create_storage_tables.sql must be run first. It creates the core users and files tables, Storage buckets, and Row Level Security policies that the application depends on. Without it, the backend will fail to start.

Configure Environment Variables:
- Copy .env.example to .env in the project root and fill in your Supabase credentials
- Copy frontend/.env.example to frontend/.env and fill in the frontend Supabase credentials
- Make sure SUPABASE_URL and SUPABASE_ANON_KEY match between both files

🐛 Troubleshooting

Common Issues

Port Already in Use:

# Kill processes on ports 3000 and 8000
lsof -ti:3000 | xargs kill -9
lsof -ti:8000 | xargs kill -9

TypeScript Errors:
```
cd frontend
npm install
```

Python Dependencies:

pip install --upgrade pip
pip install -r backend/requirements.txt

Node Modules Issues:

cd frontend
rm -rf node_modules package-lock.json
npm install

Google Cloud Authentication Errors: If you see "Your default credentials were not found" errors:

# Option A: Re-authenticate with gcloud
gcloud auth application-default login
gcloud auth application-default set-quota-project redactflow-486302

# Option B: Use service account key
# Add to .env: GOOGLE_APPLICATION_CREDENTIALS=secrets/redactflow-service-account.json

Permission Denied for .config Directory:

# Fix ownership of .config directory
sudo chown -R $USER:staff ~/.config

# Reinstall gcloud if needed
brew reinstall gcloud-cli

Image Detection Not Working:
- Ensure ENABLE_IMAGE_DETECTION=true in your .env file
- Verify Google Cloud credentials are set up correctly
- Check backend logs for specific error messages
- See docs/IMAGE_DETECTION_SETUP.md for detailed troubleshooting

Development

Frontend Development: cd frontend && npm start
Backend Development: cd backend && python main.py
API Testing: Visit http://localhost:8000/docs for interactive API documentation

📄 File Descriptions

frontend/src/App.tsx: Main React application with PDF canvas and workflow management
backend/main.py: FastAPI server handling PDF processing and AI workflow
nodes/orchestrator.py: LangGraph workflow orchestration
nodes/detector_node.py: Dual OCR and dual LLM detection logic
nodes/recall_node.py: AI recall additions and feedback
nodes/hitl_node.py: Human-in-the-loop workflow control
nodes/redactor_node.py: PDF redaction application
app.py: Original Streamlit interface (legacy)

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Test thoroughly
Submit a pull request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built with LangGraph for workflow orchestration
Powered by Azure OpenAI and Azure Document Intelligence
Frontend built with React and Tailwind CSS
Backend powered by FastAPI

Project details

Release history Release notifications | RSS feed

0.0.2

Apr 7, 2026

This version

0.0.1

Apr 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

redactflow-0.0.1.tar.gz (2.7 MB view details)

Uploaded Apr 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

redactflow-0.0.1-py3-none-any.whl (264.6 kB view details)

Uploaded Apr 7, 2026 Python 3

File details

Details for the file redactflow-0.0.1.tar.gz.

File metadata

Download URL: redactflow-0.0.1.tar.gz
Upload date: Apr 7, 2026
Size: 2.7 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.5

File hashes

Hashes for redactflow-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`26c4b8b7e3713f1287bb5df9d0552a7e4467727be91c19adba13a8ff3dd3e718`
MD5	`fa6a5c27cf4ac8c838e34c75c7fad3b0`
BLAKE2b-256	`d6aeee56da39c459c1f4db626b536b72fc3abad01a7b645000f7f33688973889`

See more details on using hashes here.

File details

Details for the file redactflow-0.0.1-py3-none-any.whl.

File metadata

Download URL: redactflow-0.0.1-py3-none-any.whl
Upload date: Apr 7, 2026
Size: 264.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.5

File hashes

Hashes for redactflow-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0792e8675ecb0f2de10d141ec6dc706f1ff3991e89ec82d5fa7073091bcb1767`
MD5	`1386b3ffa50c8845b903e19c3c3fe39a`
BLAKE2b-256	`8ed5b70b819480a28b7e95c258b6bbff437087abb618ff776633eb061f407881`

See more details on using hashes here.

redactflow 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

✂️🤖 RedactFlow: Agentic PDF Sanitizer

🚀 Quick Start

Prerequisites

Option 1: Modern React Frontend (Recommended)

1. Clone and Setup

2. Backend Setup

3. Frontend Setup

4. Environment Configuration

5. Run the Application

6. Access the Application

Option 2: Original Streamlit Interface

🎯 Features

Modern React Frontend

Core AI Features

🏗️ System Architecture

Workflow Nodes

The Detector Workflow

The Innovative Content Batching Architecture

How Content Batching Works:

🎮 How to Use

Using the React Frontend

Key Features

📁 Project Structure

Performance & Real-World Results

Real I-20 Form Processing Results:

Detection Categories Successfully Identified:

Quality Assurance:

Preview before and after user's feedback

🔧 Configuration

Environment Variables

Azure Setup

Google Cloud Setup (for Image Detection - Optional)

Quick Setup Steps

Alternative: Service Account Key

Environment Variables for Image Detection

Model Choices

Supabase Setup (Authentication + Storage)

🐛 Troubleshooting

Common Issues

Development

📄 File Descriptions

🤝 Contributing

📝 License

🙏 Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes