AI-powered PDF redaction tool with conversational interface
Project description
โ๏ธ๐ค RedactFlow: Agentic PDF Sanitizer
RedactFlow is a powerful and intelligent PDF sanitization tool that uses a sophisticated agentic workflow to detect and redact sensitive information from your documents. It combines state-of-the-art AI models with a human-in-the-loop (HITL) interface to ensure accurate and reliable redaction.
๐ Quick Start
Prerequisites
- Python 3.8+
- Node.js 16+ (for the React frontend)
- An Azure account with access to Azure OpenAI and Azure Document Intelligence
- A Supabase account for user authentication and subscription management (free tier available)
Option 1: Modern React Frontend (Recommended)
The application now features a modern React frontend with FastAPI backend for better performance and user experience.
1. Clone and Setup
git clone https://github.com/matthewyijielu0317/RedactFlow.git
cd RedactFlow
2. Backend Setup
# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install backend dependencies
pip install -r backend/requirements.txt
3. Frontend Setup
# Install frontend dependencies
cd frontend
npm install
cd ..
4. Environment Configuration
Note on Environment Files: This project uses two
.envfiles:
- Root
.env: Backend configuration (Python/FastAPI)frontend/.env: Frontend configuration (React)This separation is necessary because:
- React (Create React App) only reads
.envfrom its own directory- Backend and frontend have different environment variable requirements
- Important: Keep Supabase URLs in sync between both files!
Root .env file (Backend Configuration):
Create a .env file in the root directory:
# Azure OpenAI Configuration
AZURE_OPENAI_ENDPOINT=your_azure_openai_endpoint
AZURE_OPENAI_API_KEY=your_azure_openai_api_key
# Azure Document Intelligence Configuration
AZURE_DI_ENDPOINT=your_azure_di_endpoint
AZURE_DI_KEY=your_azure_di_key
# Google Cloud Configuration (Optional - for image detection)
GOOGLE_CLOUD_PROJECT=redactflow-486302
ENABLE_IMAGE_DETECTION=true
GEMINI_MODEL_ID=gemini-2.5-pro
# Supabase Configuration (for backend authentication)
SUPABASE_URL=your_supabase_project_url
SUPABASE_JWT_SECRET=your_supabase_jwt_secret
SUPABASE_ANON_KEY=your_supabase_anon_key
SUPABASE_SERVICE_KEY=your_supabase_service_key
# Stripe Configuration (Optional, for payment integration)
STRIPE_SECRET_KEY=your_stripe_secret_key
STRIPE_PUBLISHABLE_KEY=your_stripe_publishable_key
STRIPE_WEBHOOK_SECRET=your_stripe_webhook_secret
# Tavily Search (Optional)
TAVILY_KEY=your_tavily_key
frontend/.env file (Frontend Configuration):
Create a .env file in the frontend/ directory (use frontend/.env.example as template):
# Supabase Configuration
# โ ๏ธ IMPORTANT: These values should match the SUPABASE_URL and SUPABASE_ANON_KEY in root .env
# Get these values from: https://app.supabase.com/project/_/settings/api
REACT_APP_SUPABASE_URL=your_supabase_project_url
REACT_APP_SUPABASE_ANON_KEY=your_supabase_anon_key
5. Run the Application
Option A: Use the provided scripts (Recommended)
# Terminal 1: Start Backend
chmod +x start_backend.sh
./start_backend.sh
# Terminal 2: Start Frontend
chmod +x start_frontend.sh
./start_frontend.sh
Option B: Manual startup
# Terminal 1: Start Backend
source venv/bin/activate
cd backend
python main.py
# Terminal 2: Start Frontend
cd frontend
npm start
6. Access the Application
- Frontend: http://localhost:3000
- Backend API: http://localhost:8000
- API Documentation: http://localhost:8000/docs
Option 2: Original Streamlit Interface
If you prefer the original Streamlit interface:
# Setup (same as above)
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# Run Streamlit app
streamlit run app.py
๐ฏ Features
Modern React Frontend
- User Authentication: Secure login/signup with Supabase (email/password and Google OAuth)
- Subscription Management: Track user plans and page limits
- Interactive PDF Canvas: Draw manual redactions directly on the PDF
- Real-time Preview: See redactions applied instantly
- Smart Detection Panel: Quick prompts for common redaction types
- Workflow Progress: Visual progress tracking through the AI workflow
- Responsive Design: Works on desktop and mobile devices
Core AI Features
- Agentic Workflow: Utilizes a robust agentic workflow powered by LangGraph
- Dual OCR Technology: Employs a creative dual OCR process for both high-level content understanding and precise word-level coordinate mapping
- Intelligent Detection: Leverages large language models (LLMs) to analyze document content and identify sensitive information
- Image Detection (Optional): Uses Google Gemini's spatial understanding API to detect and redact logos, icons, stamps, seals, and other graphical elements
- Evaluator Feedback Loop: Includes an evaluator agent that provides feedback to the detector, iteratively improving accuracy
- Human-in-the-Loop (HITL) Interface: Review, edit, and approve AI-detected redactions, plus add manual redactions
- Flexible and Configurable: Easily configurable with your own Azure OpenAI and Document Intelligence API keys
๐๏ธ System Architecture
The RedactFlow system is built around a langgraph state machine that orchestrates the flow of data through a series of nodes:
Workflow Nodes
- Orchestrator: Entry point that interprets user prompts and routes requests
- Searcher: (Optional) Searches for external regulations and compliance information
- Detector: Core detection using dual OCR and dual LLM architecture
- Evaluator: Reviews detected data and provides feedback to improve accuracy
- Human-in-the-Loop (HITL): Pauses workflow for user review and approval
- Redactor: Applies final redactions to create sanitized PDF
The Detector Workflow
The most innovative part of RedactFlow is the Detector Workflow with its unique dual OCR and dual LLM architecture:
- Orchestrator: The entry point of the workflow. It interprets the user's prompt and decides whether to route the request to the
Searcherfor external regulation lookup or directly to theDetector. - Searcher: (Optional) Searches for external regulations and compliance information to enrich the detection criteria.
- Detector: The core of the sensitive data detection process. Uses an innovative content batching approach with dual OCR and LLM architecture to analyze all pages together in a single API call, achieving massive efficiency gains.
- Evaluator: Reviews all detected sensitive data across the entire document in a single LLM call and provides comprehensive feedback to improve accuracy.
- Corrector: Applies evaluator feedback to refine detections across all pages in a single LLM call, ensuring consistent quality improvements throughout the document.
- Human-in-the-Loop (HITL): Pauses the workflow and waits for the user to review, edit, and approve the redactions through the Streamlit UI.
- Redactor: Applies the final redactions to the PDF, creating a sanitized version of the document.
The Innovative Content Batching Architecture
The most innovative part of RedactFlow is the Content Batching Architecture that revolutionizes how AI processes multi-page documents. Instead of processing pages individually, RedactFlow combines all pages into single prompts and processes entire documents in just 3 LLM API calls, achieving dramatic efficiency gains.
How Content Batching Works:
-
Dual OCR in Parallel:
- Page-level OCR: Extracts content for high-level semantic analysis
- Word-level OCR: Gets precise coordinates of each word
-
Dual LLM Analysis:
- Sensitive Identification LLM: Analyzes content to identify sensitive information
- Mapping LLM: Maps sensitive content to precise word-level coordinates
๐ฎ How to Use
Using the React Frontend
- Upload a PDF: Drag and drop or click to upload your PDF file
- Set Detection Prompt: Enter what you want to redact (e.g., "names, addresses, phone numbers")
- Run Detection: Click "Run Detection" to start the AI workflow
- Review Results:
- Review AI-detected sensitive information
- Edit or delete incorrect detections
- Add manual redactions by drawing on the PDF
- Approve or Reject:
- Approve: Proceed to final redaction
- Reject: Modify your prompt and try again
- Download: Get your redacted PDF
Key Features
- Manual Redaction: Draw rectangles directly on the PDF to mark sensitive areas
- Edit AI Detections: Modify content, reasons, or bounding boxes
- Real-time Preview: See changes applied instantly
- Workflow Control: Approve, reject, or go back to review stage
RedactFlow's architecture is designed to continuously improve its detection accuracy through a sophisticated feedback loop between the Evaluator, Corrector, and HumanInLoop nodes.
- Content Batching Feedback Loop: After the
Detectoridentifies sensitive data across all pages, theEvaluatornode analyzes the entire document in a single LLM call, comparing results with the user prompt and document context. It generates comprehensive feedback for all pages simultaneously by processing all content together, identifying patterns and inconsistencies across the entire document. TheCorrectorthen applies this feedback to refine detections across all pages in a single LLM call, ensuring consistent quality improvements throughout the document.
๐ Project Structure
RedactFlow/
โโโ frontend/ # React frontend
โ โโโ src/
โ โ โโโ components/ # React components
โ โ โโโ types.ts # TypeScript type definitions
โ โ โโโ App.tsx # Main React app
โ โโโ package.json # Frontend dependencies
โ โโโ tsconfig.json # TypeScript configuration
โโโ backend/ # FastAPI backend
โ โโโ main.py # FastAPI server
โ โโโ requirements.txt # Python dependencies
โ โโโ static/ # Generated PDF previews
โโโ nodes/ # LangGraph workflow nodes
โ โโโ orchestrator.py # Main workflow orchestration
โ โโโ detector_node.py # Dual OCR/LLM detection
โ โโโ evaluator_node.py # Detection evaluation
โ โโโ hitl_node.py # Human-in-the-loop logic
โ โโโ redactor_node.py # PDF redaction
โโโ output/ # Generated files
โ โโโ original/ # Original uploaded PDFs
โ โโโ preview/ # Preview images
โ โโโ redacted/ # Final redacted PDFs
โโโ app.py # Original Streamlit app
โโโ requirements.txt # Root Python dependencies
โโโ start_*.sh # Startup scripts
This powerful combination of an automated content batching feedback loop and human oversight ensures that the final redacted document is accurate, reliable, and meets the user's specific needs.
Performance & Real-World Results
RedactFlow has been tested with real-world documents, including complex immigration forms (I-20), demonstrating exceptional performance:
Real I-20 Form Processing Results:
- Document: 4-page I-20 Certificate of Eligibility for Nonimmigrant Student Status
- OCR Extraction: 318 page elements, 1,297 word elements
- Detection Results: 39 sensitive items with precise coordinates
- API Efficiency: 3 LLM calls vs 48+ traditional individual calls
- Cost Savings: ~90% reduction in API costs
- Processing Time: Significant reduction through content batching
Detection Categories Successfully Identified:
- Student Information: Names, SEVIS IDs, birth dates, citizenship
- Academic Program: School names, degree types, program duration, majors
- Official Information: School codes, approval dates, certification details
- Financial Data: Tuition amounts, funding sources
- Immigration Data: Document numbers, status classifications
Quality Assurance:
- Content Batching Evaluation: Comprehensive quality checks across all pages in single LLM calls
- Feedback Integration: Automatic correction of detection gaps
- Coordinate Precision: Exact pixel-level redaction boundaries
- Fallback Mechanisms: Robust error handling and recovery
Preview before and after user's feedback
Give a user prompt: You failed to detect the UID and financial amount. Also, include the date.
The Final Redacted Version is below:
๐ง Configuration
Environment Variables
Create a .env file with the following variables:
# Required: Azure OpenAI
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_API_KEY=your_api_key_here
# Required: Azure Document Intelligence
AZURE_DI_ENDPOINT=https://your-resource.cognitiveservices.azure.com/
AZURE_DI_KEY=your_di_key_here
# Optional: Google Cloud Image Detection
GOOGLE_CLOUD_PROJECT=redactflow-486302
ENABLE_IMAGE_DETECTION=true
GEMINI_MODEL_ID=gemini-2.5-pro
# GOOGLE_APPLICATION_CREDENTIALS=secrets/redactflow-service-account.json # If using service account
# Optional: Tavily Search (for external regulation lookup)
TAVILY_KEY=your_tavily_key_here
# Demo/Deployment Controls
# Limit analysis to the first N pages (0 disables the cap)
MAX_ANALYZED_PAGES=4
Azure Setup
-
Azure OpenAI Service:
- Create an Azure OpenAI resource
- Deploy a GPT-4 model
- Get your endpoint and API key
-
Azure Document Intelligence:
- Create a Document Intelligence resource
- Get your endpoint and key
Google Cloud Setup (for Image Detection - Optional)
RedactFlow uses Google Gemini's spatial understanding API to detect logos, icons, stamps, seals, and other graphical elements within PDF documents. This feature is optional but highly recommended for comprehensive document sanitization.
Quick Setup Steps
-
Install Google Cloud SDK:
# macOS (Homebrew) brew install --cask google-cloud-sdk source "$(brew --prefix)/share/google-cloud-sdk/path.zsh.inc" # Verify installation gcloud --version
-
Fix Permissions (if needed): If you encounter permission errors during installation, run:
sudo chown -R $USER:staff ~/.config brew reinstall gcloud-cli
-
Authenticate:
gcloud auth application-default login
A browser window will open. Sign in with your Google account that has access to the
redactflow-486302project. -
Set Quota Project:
gcloud auth application-default set-quota-project redactflow-486302
-
Update Your
.envFile: Add these lines to your root.envfile:# Google Cloud Image Detection GOOGLE_CLOUD_PROJECT=redactflow-486302 ENABLE_IMAGE_DETECTION=true GEMINI_MODEL_ID=gemini-2.5-pro
-
Restart the Backend:
./start_backend.sh
Alternative: Service Account Key
If you don't want to install gcloud, you can use a service account key file:
- Place the key file at
secrets/redactflow-service-account.json - Add to
.env:GOOGLE_APPLICATION_CREDENTIALS=secrets/redactflow-service-account.json GOOGLE_CLOUD_PROJECT=redactflow-486302 ENABLE_IMAGE_DETECTION=true GEMINI_MODEL_ID=gemini-2.5-pro
- Restart the backend (no
gcloudCLI needed)
Environment Variables for Image Detection
| Variable | Required | Default | Description |
|---|---|---|---|
GOOGLE_CLOUD_PROJECT |
Yes | โ | Your GCP project ID |
ENABLE_IMAGE_DETECTION |
Yes | false |
Set to true to enable |
GEMINI_MODEL_ID |
No | gemini-2.5-flash |
Model choice (see below) |
GEMINI_LOCATION |
No | global |
Vertex AI region |
IMAGE_DETECT_CONCURRENCY |
No | 4 |
Max pages processed in parallel |
MAX_ANALYZED_PAGES |
No | 4 |
Page limit for detection |
GOOGLE_APPLICATION_CREDENTIALS |
No | โ | Path to service account JSON |
Model Choices
| Model | Speed | Cost | Accuracy | Best for |
|---|---|---|---|---|
gemini-2.5-flash |
Fast | Low | Good | Cost-sensitive, high volume |
gemini-2.5-pro |
Moderate | Mid | Better | Precision matters, small logos |
For detailed setup instructions and troubleshooting, see docs/IMAGE_DETECTION_SETUP.md.
Supabase Setup (Authentication + Storage)
-
Create a Supabase Project:
- Go to Supabase and create a new project
- Wait for the project to be fully provisioned
-
Get API Credentials:
- Navigate to Project Settings โ API
- Copy your
Project URLโSUPABASE_URL - Copy your
anon/publickey โSUPABASE_ANON_KEY - Copy your
service_rolekey โSUPABASE_SERVICE_KEY(keep this secret!) - Copy your
JWT Secretfrom Project Settings โ API โ JWT Settings โSUPABASE_JWT_SECRET
-
Run Database Migrations (in order):
Open Supabase Dashboard โ SQL Editor and run these scripts:
Order Script What it creates 1 scripts/create_storage_tables.sqluserstable (synced from auth.users via trigger),filestable (metadata, OCR, annotations),source-filesandredacted-filesStorage buckets, RLS policies, indexes2 scripts/create_subscriptions_table.sqlSubscription management table (for Stripe integration) Important:
create_storage_tables.sqlmust be run first. It creates the coreusersandfilestables, Storage buckets, and Row Level Security policies that the application depends on. Without it, the backend will fail to start. -
Configure Environment Variables:
- Copy
.env.exampleto.envin the project root and fill in your Supabase credentials - Copy
frontend/.env.exampletofrontend/.envand fill in the frontend Supabase credentials - Make sure
SUPABASE_URLandSUPABASE_ANON_KEYmatch between both files
- Copy
๐ Troubleshooting
Common Issues
-
Port Already in Use:
# Kill processes on ports 3000 and 8000 lsof -ti:3000 | xargs kill -9 lsof -ti:8000 | xargs kill -9
-
TypeScript Errors:
cd frontend npm install
-
Python Dependencies:
pip install --upgrade pip pip install -r backend/requirements.txt
-
Node Modules Issues:
cd frontend rm -rf node_modules package-lock.json npm install
-
Google Cloud Authentication Errors: If you see "Your default credentials were not found" errors:
# Option A: Re-authenticate with gcloud gcloud auth application-default login gcloud auth application-default set-quota-project redactflow-486302 # Option B: Use service account key # Add to .env: GOOGLE_APPLICATION_CREDENTIALS=secrets/redactflow-service-account.json
-
Permission Denied for .config Directory:
# Fix ownership of .config directory sudo chown -R $USER:staff ~/.config # Reinstall gcloud if needed brew reinstall gcloud-cli
-
Image Detection Not Working:
- Ensure
ENABLE_IMAGE_DETECTION=truein your.envfile - Verify Google Cloud credentials are set up correctly
- Check backend logs for specific error messages
- See
docs/IMAGE_DETECTION_SETUP.mdfor detailed troubleshooting
- Ensure
Development
- Frontend Development:
cd frontend && npm start - Backend Development:
cd backend && python main.py - API Testing: Visit http://localhost:8000/docs for interactive API documentation
๐ File Descriptions
frontend/src/App.tsx: Main React application with PDF canvas and workflow managementbackend/main.py: FastAPI server handling PDF processing and AI workflownodes/orchestrator.py: LangGraph workflow orchestrationnodes/detector_node.py: Dual OCR and dual LLM detection logicnodes/recall_node.py: AI recall additions and feedbacknodes/hitl_node.py: Human-in-the-loop workflow controlnodes/redactor_node.py: PDF redaction applicationapp.py: Original Streamlit interface (legacy)
๐ค Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Acknowledgments
- Built with LangGraph for workflow orchestration
- Powered by Azure OpenAI and Azure Document Intelligence
- Frontend built with React and Tailwind CSS
- Backend powered by FastAPI
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file redactflow-0.0.1.tar.gz.
File metadata
- Download URL: redactflow-0.0.1.tar.gz
- Upload date:
- Size: 2.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
26c4b8b7e3713f1287bb5df9d0552a7e4467727be91c19adba13a8ff3dd3e718
|
|
| MD5 |
fa6a5c27cf4ac8c838e34c75c7fad3b0
|
|
| BLAKE2b-256 |
d6aeee56da39c459c1f4db626b536b72fc3abad01a7b645000f7f33688973889
|
File details
Details for the file redactflow-0.0.1-py3-none-any.whl.
File metadata
- Download URL: redactflow-0.0.1-py3-none-any.whl
- Upload date:
- Size: 264.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0792e8675ecb0f2de10d141ec6dc706f1ff3991e89ec82d5fa7073091bcb1767
|
|
| MD5 |
1386b3ffa50c8845b903e19c3c3fe39a
|
|
| BLAKE2b-256 |
8ed5b70b819480a28b7e95c258b6bbff437087abb618ff776633eb061f407881
|