Skip to main content

AI-powered PDF redaction tool with conversational interface

Project description

โœ‚๏ธ๐Ÿค– RedactFlow: Agentic PDF Sanitizer

RedactFlow is a powerful and intelligent PDF sanitization tool that uses a sophisticated agentic workflow to detect and redact sensitive information from your documents. It combines state-of-the-art AI models with a human-in-the-loop (HITL) interface to ensure accurate and reliable redaction.

๐Ÿš€ Quick Start

Prerequisites

  • Python 3.8+
  • Node.js 16+ (for the React frontend)
  • An Azure account with access to Azure OpenAI and Azure Document Intelligence
  • A Supabase account for user authentication and subscription management (free tier available)

Option 1: Modern React Frontend (Recommended)

The application now features a modern React frontend with FastAPI backend for better performance and user experience.

1. Clone and Setup

git clone https://github.com/matthewyijielu0317/RedactFlow.git
cd RedactFlow

2. Backend Setup

# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install backend dependencies
pip install -r backend/requirements.txt

3. Frontend Setup

# Install frontend dependencies
cd frontend
npm install
cd ..

4. Environment Configuration

Note on Environment Files: This project uses two .env files:

  • Root .env: Backend configuration (Python/FastAPI)
  • frontend/.env: Frontend configuration (React)

This separation is necessary because:

  • React (Create React App) only reads .env from its own directory
  • Backend and frontend have different environment variable requirements
  • Important: Keep Supabase URLs in sync between both files!

Root .env file (Backend Configuration):

Create a .env file in the root directory:

# Azure OpenAI Configuration
AZURE_OPENAI_ENDPOINT=your_azure_openai_endpoint
AZURE_OPENAI_API_KEY=your_azure_openai_api_key

# Azure Document Intelligence Configuration
AZURE_DI_ENDPOINT=your_azure_di_endpoint
AZURE_DI_KEY=your_azure_di_key

# Google Cloud Configuration (Optional - for image detection)
GOOGLE_CLOUD_PROJECT=redactflow-486302
ENABLE_IMAGE_DETECTION=true
GEMINI_MODEL_ID=gemini-2.5-pro

# Supabase Configuration (for backend authentication)
SUPABASE_URL=your_supabase_project_url
SUPABASE_JWT_SECRET=your_supabase_jwt_secret
SUPABASE_ANON_KEY=your_supabase_anon_key
SUPABASE_SERVICE_KEY=your_supabase_service_key

# Stripe Configuration (Optional, for payment integration)
STRIPE_SECRET_KEY=your_stripe_secret_key
STRIPE_PUBLISHABLE_KEY=your_stripe_publishable_key
STRIPE_WEBHOOK_SECRET=your_stripe_webhook_secret

# Tavily Search (Optional)
TAVILY_KEY=your_tavily_key

frontend/.env file (Frontend Configuration):

Create a .env file in the frontend/ directory (use frontend/.env.example as template):

# Supabase Configuration
# โš ๏ธ IMPORTANT: These values should match the SUPABASE_URL and SUPABASE_ANON_KEY in root .env
# Get these values from: https://app.supabase.com/project/_/settings/api
REACT_APP_SUPABASE_URL=your_supabase_project_url
REACT_APP_SUPABASE_ANON_KEY=your_supabase_anon_key

5. Run the Application

Option A: Use the provided scripts (Recommended)

# Terminal 1: Start Backend
chmod +x start_backend.sh
./start_backend.sh

# Terminal 2: Start Frontend
chmod +x start_frontend.sh
./start_frontend.sh

Option B: Manual startup

# Terminal 1: Start Backend
source venv/bin/activate
cd backend
python main.py

# Terminal 2: Start Frontend
cd frontend
npm start

6. Access the Application

Option 2: Original Streamlit Interface

If you prefer the original Streamlit interface:

# Setup (same as above)
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Run Streamlit app
streamlit run app.py

๐ŸŽฏ Features

Modern React Frontend

  • User Authentication: Secure login/signup with Supabase (email/password and Google OAuth)
  • Subscription Management: Track user plans and page limits
  • Interactive PDF Canvas: Draw manual redactions directly on the PDF
  • Real-time Preview: See redactions applied instantly
  • Smart Detection Panel: Quick prompts for common redaction types
  • Workflow Progress: Visual progress tracking through the AI workflow
  • Responsive Design: Works on desktop and mobile devices

Core AI Features

  • Agentic Workflow: Utilizes a robust agentic workflow powered by LangGraph
  • Dual OCR Technology: Employs a creative dual OCR process for both high-level content understanding and precise word-level coordinate mapping
  • Intelligent Detection: Leverages large language models (LLMs) to analyze document content and identify sensitive information
  • Image Detection (Optional): Uses Google Gemini's spatial understanding API to detect and redact logos, icons, stamps, seals, and other graphical elements
  • Evaluator Feedback Loop: Includes an evaluator agent that provides feedback to the detector, iteratively improving accuracy
  • Human-in-the-Loop (HITL) Interface: Review, edit, and approve AI-detected redactions, plus add manual redactions
  • Flexible and Configurable: Easily configurable with your own Azure OpenAI and Document Intelligence API keys

๐Ÿ—๏ธ System Architecture

The RedactFlow system is built around a langgraph state machine that orchestrates the flow of data through a series of nodes:

Workflow Nodes

  • Orchestrator: Entry point that interprets user prompts and routes requests
  • Searcher: (Optional) Searches for external regulations and compliance information
  • Detector: Core detection using dual OCR and dual LLM architecture
  • Evaluator: Reviews detected data and provides feedback to improve accuracy
  • Human-in-the-Loop (HITL): Pauses workflow for user review and approval
  • Redactor: Applies final redactions to create sanitized PDF

The Detector Workflow

The most innovative part of RedactFlow is the Detector Workflow with its unique dual OCR and dual LLM architecture:

  • Orchestrator: The entry point of the workflow. It interprets the user's prompt and decides whether to route the request to the Searcher for external regulation lookup or directly to the Detector.
  • Searcher: (Optional) Searches for external regulations and compliance information to enrich the detection criteria.
  • Detector: The core of the sensitive data detection process. Uses an innovative content batching approach with dual OCR and LLM architecture to analyze all pages together in a single API call, achieving massive efficiency gains.
  • Evaluator: Reviews all detected sensitive data across the entire document in a single LLM call and provides comprehensive feedback to improve accuracy.
  • Corrector: Applies evaluator feedback to refine detections across all pages in a single LLM call, ensuring consistent quality improvements throughout the document.
  • Human-in-the-Loop (HITL): Pauses the workflow and waits for the user to review, edit, and approve the redactions through the Streamlit UI.
  • Redactor: Applies the final redactions to the PDF, creating a sanitized version of the document.

The Innovative Content Batching Architecture

The most innovative part of RedactFlow is the Content Batching Architecture that revolutionizes how AI processes multi-page documents. Instead of processing pages individually, RedactFlow combines all pages into single prompts and processes entire documents in just 3 LLM API calls, achieving dramatic efficiency gains.

How Content Batching Works:

  1. Dual OCR in Parallel:

    • Page-level OCR: Extracts content for high-level semantic analysis
    • Word-level OCR: Gets precise coordinates of each word
  2. Dual LLM Analysis:

    • Sensitive Identification LLM: Analyzes content to identify sensitive information
    • Mapping LLM: Maps sensitive content to precise word-level coordinates

๐ŸŽฎ How to Use

Using the React Frontend

  1. Upload a PDF: Drag and drop or click to upload your PDF file
  2. Set Detection Prompt: Enter what you want to redact (e.g., "names, addresses, phone numbers")
  3. Run Detection: Click "Run Detection" to start the AI workflow
  4. Review Results:
    • Review AI-detected sensitive information
    • Edit or delete incorrect detections
    • Add manual redactions by drawing on the PDF
  5. Approve or Reject:
    • Approve: Proceed to final redaction
    • Reject: Modify your prompt and try again
  6. Download: Get your redacted PDF

Key Features

  • Manual Redaction: Draw rectangles directly on the PDF to mark sensitive areas
  • Edit AI Detections: Modify content, reasons, or bounding boxes
  • Real-time Preview: See changes applied instantly
  • Workflow Control: Approve, reject, or go back to review stage

RedactFlow's architecture is designed to continuously improve its detection accuracy through a sophisticated feedback loop between the Evaluator, Corrector, and HumanInLoop nodes.

  • Content Batching Feedback Loop: After the Detector identifies sensitive data across all pages, the Evaluator node analyzes the entire document in a single LLM call, comparing results with the user prompt and document context. It generates comprehensive feedback for all pages simultaneously by processing all content together, identifying patterns and inconsistencies across the entire document. The Corrector then applies this feedback to refine detections across all pages in a single LLM call, ensuring consistent quality improvements throughout the document.

๐Ÿ“ Project Structure

RedactFlow/
โ”œโ”€โ”€ frontend/                 # React frontend
โ”‚   โ”œโ”€โ”€ src/
โ”‚   โ”‚   โ”œโ”€โ”€ components/      # React components
โ”‚   โ”‚   โ”œโ”€โ”€ types.ts         # TypeScript type definitions
โ”‚   โ”‚   โ””โ”€โ”€ App.tsx          # Main React app
โ”‚   โ”œโ”€โ”€ package.json         # Frontend dependencies
โ”‚   โ””โ”€โ”€ tsconfig.json        # TypeScript configuration
โ”œโ”€โ”€ backend/                  # FastAPI backend
โ”‚   โ”œโ”€โ”€ main.py              # FastAPI server
โ”‚   โ”œโ”€โ”€ requirements.txt     # Python dependencies
โ”‚   โ””โ”€โ”€ static/              # Generated PDF previews
โ”œโ”€โ”€ nodes/                    # LangGraph workflow nodes
โ”‚   โ”œโ”€โ”€ orchestrator.py      # Main workflow orchestration
โ”‚   โ”œโ”€โ”€ detector_node.py     # Dual OCR/LLM detection
โ”‚   โ”œโ”€โ”€ evaluator_node.py    # Detection evaluation
โ”‚   โ”œโ”€โ”€ hitl_node.py         # Human-in-the-loop logic
โ”‚   โ””โ”€โ”€ redactor_node.py     # PDF redaction
โ”œโ”€โ”€ output/                   # Generated files
โ”‚   โ”œโ”€โ”€ original/            # Original uploaded PDFs
โ”‚   โ”œโ”€โ”€ preview/             # Preview images
โ”‚   โ””โ”€โ”€ redacted/            # Final redacted PDFs
โ”œโ”€โ”€ app.py                   # Original Streamlit app
โ”œโ”€โ”€ requirements.txt         # Root Python dependencies
โ””โ”€โ”€ start_*.sh              # Startup scripts

This powerful combination of an automated content batching feedback loop and human oversight ensures that the final redacted document is accurate, reliable, and meets the user's specific needs.

Performance & Real-World Results

RedactFlow has been tested with real-world documents, including complex immigration forms (I-20), demonstrating exceptional performance:

Real I-20 Form Processing Results:

  • Document: 4-page I-20 Certificate of Eligibility for Nonimmigrant Student Status
  • OCR Extraction: 318 page elements, 1,297 word elements
  • Detection Results: 39 sensitive items with precise coordinates
  • API Efficiency: 3 LLM calls vs 48+ traditional individual calls
  • Cost Savings: ~90% reduction in API costs
  • Processing Time: Significant reduction through content batching

Detection Categories Successfully Identified:

  • Student Information: Names, SEVIS IDs, birth dates, citizenship
  • Academic Program: School names, degree types, program duration, majors
  • Official Information: School codes, approval dates, certification details
  • Financial Data: Tuition amounts, funding sources
  • Immigration Data: Document numbers, status classifications

Quality Assurance:

  • Content Batching Evaluation: Comprehensive quality checks across all pages in single LLM calls
  • Feedback Integration: Automatic correction of detection gaps
  • Coordinate Precision: Exact pixel-level redaction boundaries
  • Fallback Mechanisms: Robust error handling and recovery

Preview before and after user's feedback

Preview Before Human Feedbakc

Give a user prompt: You failed to detect the UID and financial amount. Also, include the date.

Preview After Human Feedbakc

The Final Redacted Version is below:

FinaL Redacted Version

๐Ÿ”ง Configuration

Environment Variables

Create a .env file with the following variables:

# Required: Azure OpenAI
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_API_KEY=your_api_key_here

# Required: Azure Document Intelligence
AZURE_DI_ENDPOINT=https://your-resource.cognitiveservices.azure.com/
AZURE_DI_KEY=your_di_key_here

# Optional: Google Cloud Image Detection
GOOGLE_CLOUD_PROJECT=redactflow-486302
ENABLE_IMAGE_DETECTION=true
GEMINI_MODEL_ID=gemini-2.5-pro
# GOOGLE_APPLICATION_CREDENTIALS=secrets/redactflow-service-account.json  # If using service account

# Optional: Tavily Search (for external regulation lookup)
TAVILY_KEY=your_tavily_key_here

# Demo/Deployment Controls
# Limit analysis to the first N pages (0 disables the cap)
MAX_ANALYZED_PAGES=4

Azure Setup

  1. Azure OpenAI Service:

    • Create an Azure OpenAI resource
    • Deploy a GPT-4 model
    • Get your endpoint and API key
  2. Azure Document Intelligence:

    • Create a Document Intelligence resource
    • Get your endpoint and key

Google Cloud Setup (for Image Detection - Optional)

RedactFlow uses Google Gemini's spatial understanding API to detect logos, icons, stamps, seals, and other graphical elements within PDF documents. This feature is optional but highly recommended for comprehensive document sanitization.

Quick Setup Steps

  1. Install Google Cloud SDK:

    # macOS (Homebrew)
    brew install --cask google-cloud-sdk
    source "$(brew --prefix)/share/google-cloud-sdk/path.zsh.inc"
    
    # Verify installation
    gcloud --version
    
  2. Fix Permissions (if needed): If you encounter permission errors during installation, run:

    sudo chown -R $USER:staff ~/.config
    brew reinstall gcloud-cli
    
  3. Authenticate:

    gcloud auth application-default login
    

    A browser window will open. Sign in with your Google account that has access to the redactflow-486302 project.

  4. Set Quota Project:

    gcloud auth application-default set-quota-project redactflow-486302
    
  5. Update Your .env File: Add these lines to your root .env file:

    # Google Cloud Image Detection
    GOOGLE_CLOUD_PROJECT=redactflow-486302
    ENABLE_IMAGE_DETECTION=true
    GEMINI_MODEL_ID=gemini-2.5-pro
    
  6. Restart the Backend:

    ./start_backend.sh
    

Alternative: Service Account Key

If you don't want to install gcloud, you can use a service account key file:

  1. Place the key file at secrets/redactflow-service-account.json
  2. Add to .env:
    GOOGLE_APPLICATION_CREDENTIALS=secrets/redactflow-service-account.json
    GOOGLE_CLOUD_PROJECT=redactflow-486302
    ENABLE_IMAGE_DETECTION=true
    GEMINI_MODEL_ID=gemini-2.5-pro
    
  3. Restart the backend (no gcloud CLI needed)

Environment Variables for Image Detection

Variable Required Default Description
GOOGLE_CLOUD_PROJECT Yes โ€” Your GCP project ID
ENABLE_IMAGE_DETECTION Yes false Set to true to enable
GEMINI_MODEL_ID No gemini-2.5-flash Model choice (see below)
GEMINI_LOCATION No global Vertex AI region
IMAGE_DETECT_CONCURRENCY No 4 Max pages processed in parallel
MAX_ANALYZED_PAGES No 4 Page limit for detection
GOOGLE_APPLICATION_CREDENTIALS No โ€” Path to service account JSON

Model Choices

Model Speed Cost Accuracy Best for
gemini-2.5-flash Fast Low Good Cost-sensitive, high volume
gemini-2.5-pro Moderate Mid Better Precision matters, small logos

For detailed setup instructions and troubleshooting, see docs/IMAGE_DETECTION_SETUP.md.

Supabase Setup (Authentication + Storage)

  1. Create a Supabase Project:

    • Go to Supabase and create a new project
    • Wait for the project to be fully provisioned
  2. Get API Credentials:

    • Navigate to Project Settings โ†’ API
    • Copy your Project URL โ†’ SUPABASE_URL
    • Copy your anon/public key โ†’ SUPABASE_ANON_KEY
    • Copy your service_role key โ†’ SUPABASE_SERVICE_KEY (keep this secret!)
    • Copy your JWT Secret from Project Settings โ†’ API โ†’ JWT Settings โ†’ SUPABASE_JWT_SECRET
  3. Run Database Migrations (in order):

    Open Supabase Dashboard โ†’ SQL Editor and run these scripts:

    Order Script What it creates
    1 scripts/create_storage_tables.sql users table (synced from auth.users via trigger), files table (metadata, OCR, annotations), source-files and redacted-files Storage buckets, RLS policies, indexes
    2 scripts/create_subscriptions_table.sql Subscription management table (for Stripe integration)

    Important: create_storage_tables.sql must be run first. It creates the core users and files tables, Storage buckets, and Row Level Security policies that the application depends on. Without it, the backend will fail to start.

  4. Configure Environment Variables:

    • Copy .env.example to .env in the project root and fill in your Supabase credentials
    • Copy frontend/.env.example to frontend/.env and fill in the frontend Supabase credentials
    • Make sure SUPABASE_URL and SUPABASE_ANON_KEY match between both files

๐Ÿ› Troubleshooting

Common Issues

  1. Port Already in Use:

    # Kill processes on ports 3000 and 8000
    lsof -ti:3000 | xargs kill -9
    lsof -ti:8000 | xargs kill -9
    
  2. TypeScript Errors:

    cd frontend
    npm install
    
  3. Python Dependencies:

    pip install --upgrade pip
    pip install -r backend/requirements.txt
    
  4. Node Modules Issues:

    cd frontend
    rm -rf node_modules package-lock.json
    npm install
    
  5. Google Cloud Authentication Errors: If you see "Your default credentials were not found" errors:

    # Option A: Re-authenticate with gcloud
    gcloud auth application-default login
    gcloud auth application-default set-quota-project redactflow-486302
    
    # Option B: Use service account key
    # Add to .env: GOOGLE_APPLICATION_CREDENTIALS=secrets/redactflow-service-account.json
    
  6. Permission Denied for .config Directory:

    # Fix ownership of .config directory
    sudo chown -R $USER:staff ~/.config
    
    # Reinstall gcloud if needed
    brew reinstall gcloud-cli
    
  7. Image Detection Not Working:

    • Ensure ENABLE_IMAGE_DETECTION=true in your .env file
    • Verify Google Cloud credentials are set up correctly
    • Check backend logs for specific error messages
    • See docs/IMAGE_DETECTION_SETUP.md for detailed troubleshooting

Development

  • Frontend Development: cd frontend && npm start
  • Backend Development: cd backend && python main.py
  • API Testing: Visit http://localhost:8000/docs for interactive API documentation

๐Ÿ“„ File Descriptions

  • frontend/src/App.tsx: Main React application with PDF canvas and workflow management
  • backend/main.py: FastAPI server handling PDF processing and AI workflow
  • nodes/orchestrator.py: LangGraph workflow orchestration
  • nodes/detector_node.py: Dual OCR and dual LLM detection logic
  • nodes/recall_node.py: AI recall additions and feedback
  • nodes/hitl_node.py: Human-in-the-loop workflow control
  • nodes/redactor_node.py: PDF redaction application
  • app.py: Original Streamlit interface (legacy)

๐Ÿค Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Test thoroughly
  5. Submit a pull request

๐Ÿ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

redactflow-0.0.1.tar.gz (2.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

redactflow-0.0.1-py3-none-any.whl (264.6 kB view details)

Uploaded Python 3

File details

Details for the file redactflow-0.0.1.tar.gz.

File metadata

  • Download URL: redactflow-0.0.1.tar.gz
  • Upload date:
  • Size: 2.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.5

File hashes

Hashes for redactflow-0.0.1.tar.gz
Algorithm Hash digest
SHA256 26c4b8b7e3713f1287bb5df9d0552a7e4467727be91c19adba13a8ff3dd3e718
MD5 fa6a5c27cf4ac8c838e34c75c7fad3b0
BLAKE2b-256 d6aeee56da39c459c1f4db626b536b72fc3abad01a7b645000f7f33688973889

See more details on using hashes here.

File details

Details for the file redactflow-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: redactflow-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 264.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.5

File hashes

Hashes for redactflow-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0792e8675ecb0f2de10d141ec6dc706f1ff3991e89ec82d5fa7073091bcb1767
MD5 1386b3ffa50c8845b903e19c3c3fe39a
BLAKE2b-256 8ed5b70b819480a28b7e95c258b6bbff437087abb618ff776633eb061f407881

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page