Redact PDF/image-based documents, Word, or CSV/XLSX files using a Gradio-based GUI interface

These details have not been verified by PyPI

Project links

Project description

Document redaction (doc_redaction)

Redact personally identifiable information (PII) from documents (PDF, PNG, JPG), Word files (DOCX), or tabular data (XLSX/CSV/Parquet). Please see the User Guide for a full walkthrough of all the features in the app.

🚀 Quick Start - Installation and first run

Follow these instructions to get the document redaction application running on your local machine.

1. Installation

Option 1 - Recommended: Install from source repo

Clone the repository and install in editable mode:

git clone https://github.com/seanpedrick-case/doc_redaction.git
cd doc_redaction
pip install -e .

Install extras (Paddle or Transformers/Torch VLM)

To install with PaddleOCR (with a transformers backend as of v2.4.0):

pip install -e ".[paddle]"

If you want to run VLMs / LLMs with the transformers package:

pip install -e ".[vlm]"

Note that the versions of both PaddleOCR and Torch installed by default are the CPU-only versions. If you want to install the GPU-enabled version of torch, it is advised to install the following version:

pip install torch==2.10.0 torchvision==0.25.0 --index-url https://download.pytorch.org/whl/cu129

Option 2 - Install from PyPI

Create a virtual environment (recommended) and install doc_redaction.

python -m venv venv
# Windows:
.\venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate

The package is published on PyPI as doc-redaction (import name doc_redaction):

pip install doc_redaction

Optional extras (same as in pyproject.toml). For installing paddleOCR:

pip install "doc_redaction[paddle]"

For running VLMs / LLMs with the transformers package:

pip install "doc_redaction[vlm]"

For programmatic use (CLI-first API matching Gradio api_name routes), see Python Package usage (Python). The console script cli_redact is available after install.

Web UI from a PyPI install: You can start the Gradio UI after pip install doc_redaction by running (note that the prerequisites tesseract and poppler will need to be correctly installed following step 2 below):

python -m app

Important: your working directory matters. When you run python -m app, the app treats your current folder as the “app folder”:

It will look for configuration at config/app_config.env relative to the folder you run it from (and python -m doc_redaction.install_deps will also write config/app_config.env there).
It may create new folders in that location (for example config/, output/, input/, logs/, usage/, feedback/, and temporary/cache folders depending on your settings).
The UI example files and bundled assets are packaged with the PyPI install (they live inside the installed doc_redaction package). If you run from a “random” directory after a PyPI install, the app can still locate its packaged examples; your working directory mainly affects where config/, input/, output/, logs, and temp folders are created.

In practice, the smoothest UI experience (examples, bundled assets, docs links, predictable relative paths) is still usually via a repository checkout or Docker, but PyPI install is sufficient to launch the UI as long as you run it from a suitable working folder and have the system dependencies available (or run python -m doc_redaction.install_deps first).

Option 3 - Docker installation

The doc_redaction Redaction app can be installed by using the Dockerfile or Docker compose files (llama.cpp, vLLM) provided in the repo.

With Llama.cpp / vLLM inference server

The project now has Docker and Docker compose files available to pair running the Redaction app with local inference servers powered by llama.cpp, or vLLM. Llama.cpp is more flexible than vLLM for low VRAM systems, as Llama.cpp will offload to cpu/system RAM automatically rather than failing as vLLM tends to do.

For Llama.cpp, you can use the docker-compose_llama.yml file, and for vLLM, you can use the docker-compose_vllm.yml file. To run, Docker / Docker Desktop should be installed, and then you can run the commands suggested in the top of the files to run the servers.

You will need ~40 GB of disk space to run everything depending on the model chosen from the compose file. For the vLLM server, you will need 24 GB VRAM. For the Llama.cpp server, 24 GB VRAM is needed to run at full speed, but the n-gpu-layers and n-cpu-moe parameters in the Docker compose file can be adjusted to fit into your system. I would suggest that 8 GB VRAM is needed as a bare minimum for decent inference speed. See the Unsloth guide for more details on working with GGUF files for Qwen 3.5.

Without Llama.cpp / vLLM inference server

If you want a working Docker installation without GPU support, you can install from the Dockerfile in the repo. A working example of this, with the CPU version of PaddleOCR, can be found on Hugging Face. You can adjust the INSTALL_PADDLEOCR, PADDLE_GPU_ENABLED, INSTALL_VLM, and TORCH_GPU_ENABLED config variables to adjust for PaddleOCR and Transformers packages for local VLM support. Note that GPU-enabled PaddleOCR, and GPU-enabled Transformers/Torch often don't work well together, which is one reason why a Llama.cpp/vLLM inference server Docker installation option is provided below.

The main Dockerfile produces two final images via build targets: gradio (default web UI, non-root user, named volumes for writable paths) and lambda (AWS Lambda handler). Build examples:

docker build -f Dockerfile --target gradio -t doc-redaction-gradio .
docker build -f Dockerfile --target lambda -t doc-redaction-lambda .

Pi agent (agentic redaction)

The Pi orchestration UI uses a separate multi-stage image at agent-redact/pi-agent/Dockerfile. It shares the same Python 3.12 slim base as the main app; a small Node stage installs the pi CLI, which is copied into the runtime image.

Build target	Typical use
`dev`	Local development with docker-compose_llama_agentic.yml — the repo is bind-mounted; only Pi CLI + Python deps are in the image.
`runtime`	Hugging Face Space and AWS ECS — agent code is baked in; runs as non-root `user` with named volumes for workspace, uploads, and session dirs (read-only root filesystem friendly).

Build from the repository root:

docker build -f agent-redact/pi-agent/Dockerfile --target dev -t pi-agent-dev .
docker build -f agent-redact/pi-agent/Dockerfile --target runtime -t pi-agent-runtime .

For llama.cpp + Pi together, see the compose examples at the top of docker-compose_llama_agentic.yml. Further detail: agent-redact/README.md.

Option 4 - Installation on AWS with CDK

The repo contains a CDK folder, that contains all the files you need to setup and deploy to an AWS environment with CDK. The installation wizard is cdk_install.py, which provides a number of options to deploy the Document Redaction App to AWS for demonstration or production. More details on CDK deployment can be found in the Installation Guide.

2. Install prerequisites: Tesseract and Poppler

This application relies on two external tools for OCR (Tesseract) and PDF processing (Poppler). Please install them on your system before proceeding.

Automated dependency setup (recommended)

If you don’t have admin rights (or you just want the simplest setup), you can have the project download and configure Tesseract and Poppler into a local redaction_deps/ folder inside the doc_redaction folder.

You need the installer script available first, which means either:

Repository checkout: git clone ... and run the command from the repo root (recommended for the web UI), or
PyPI install: pip install doc_redaction and run from a writable folder where you want redaction_deps/ and config/app_config.env to be created/updated.

From the repository root (or your chosen working folder) after creating/activating your venv and installing Python requirements:

python -m doc_redaction.install_deps

This writes TESSERACT_FOLDER / POPPLER_FOLDER into config/app_config.env so the app can find the binaries without you editing your system PATH.

To just check whether your machine can already see the tools:

python -m doc_redaction.install_deps --verify-only

On Windows

If you don’t use the automated setup above, you can install the dependencies manually by downloading installers and adding the programs to your system's PATH.

Install Tesseract OCR:
- Download the installer from the official Tesseract at UB Mannheim page (e.g., tesseract-ocr-w64-setup-v5.X.X...exe).
- Run the installer.
- IMPORTANT: During installation, ensure you select the option to "Add Tesseract to system PATH for all users" or a similar option. This is crucial for the application to find the Tesseract executable.
Install Poppler:
- Download the latest Poppler binary for Windows. A common source is the Poppler for Windows GitHub releases page. Download the .zip file (e.g., poppler-25.07.0-win.zip).
- Extract the contents of the zip file to a permanent location on your computer, for example, C:\Program Files\poppler\.
- You must add the bin folder from your Poppler installation to your system's PATH environment variable.
  - Search for "Edit the system environment variables" in the Windows Start Menu and open it.
  - Click the "Environment Variables..." button.
  - In the "System variables" section, find and select the Path variable, then click "Edit...".
  - Click "New" and add the full path to the bin directory inside your Poppler folder (e.g., C:\Program Files\poppler\poppler-24.02.0\bin).
  - Click OK on all windows to save the changes.
To verify, open a new Command Prompt and run tesseract --version and pdftoppm -v. If they both return version information, you have successfully installed the prerequisites.

On Linux (Debian/Ubuntu)

Open your terminal and run the following command to install Tesseract and Poppler:

sudo apt-get update && sudo apt-get install -y tesseract-ocr poppler-utils

On Linux (Fedora/CentOS/RHEL)

Open your terminal and use the dnf or yum package manager:

sudo dnf install -y tesseract poppler-utils

3. Run the Application

With all dependencies installed, you can now start the Gradio application GUI. For a guide on how to use this, please go here.

python app.py

After running the command, the application will start, and you will see a local URL in your terminal (usually http://127.0.0.1:7860).

Open this URL in your web browser to use the document redaction tool

Command line interface

For example CLI commands, please refer to this guide or the examples in cli_redact.py

If you installed from PyPI, use the installed console script:

cli_redact --help

From a repository checkout, you can also run:

python cli_redact.py --help

Python package commands

For Python examples in using the Python package, please see Python Package usage (Python).

4. ⚙️ Configuration (Optional)

You can customise the application's behavior by creating a configuration file. This allows you to change settings without modifying the source code, such as enabling AWS features, changing logging behavior, or pointing to local Tesseract/Poppler installations. A full overview of all the potential settings you can modify in the app_config.env file can be seen in tools/config.py, with explanation on the documentation website for the github repo

To get started:

Copy config/app_config.env.example to config/app_config.env.
Modify the values in config/app_config.env to suit your needs. The application will automatically load these settings on startup.

If you do not create this file, the application will run with default settings.

Configuration Breakdown

Here is an overview of the most important settings, separated by whether they are for local use or require AWS.

Local & General Settings (No AWS Required)

These settings are useful for all users, regardless of whether you are using AWS.

TESSERACT_FOLDER / POPPLER_FOLDER
- Use these if you installed Tesseract or Poppler to a custom location on Windows and did not add them to the system PATH.
- Provide the path to the respective installation folders (for Poppler, point to the bin sub-directory).
- Examples: POPPLER_FOLDER=C:/Program Files/poppler-24.02.0/bin/ TESSERACT_FOLDER=tesseract/
SHOW_LANGUAGE_SELECTION=True
- Set to True to display a language selection dropdown in the UI for OCR processing.
DEFAULT_LOCAL_OCR_MODEL=tesseract"
- Choose the backend for local OCR. Options are tesseract, paddle, or hybrid. "Tesseract" is the default, and is recommended. "hybrid-paddle" is a combination of the two - first pass through the redactions will be done with Tesseract, and then a second pass will be done with PaddleOCR on words with low confidence. "paddle" will only return whole line text extraction, and so will only work for OCR, not redaction.
SESSION_OUTPUT_FOLDER=False
- If True, redacted files will be saved in unique subfolders within the output/ directory for each session.
DISPLAY_FILE_NAMES_IN_LOGS=False
- For privacy, file names are not recorded in usage logs by default. Set to True to include them.

AWS-Specific Settings

These settings are only relevant if you intend to use AWS services like Textract for OCR and Comprehend for PII detection.

RUN_AWS_FUNCTIONS=True
- This is the master switch. You must set this to True to enable any AWS functionality. If it is False, all other AWS settings will be ignored.
UI Options:
- SHOW_AWS_TEXT_EXTRACTION_OPTIONS=True: Adds "AWS Textract" as an option in the text extraction dropdown.
- SHOW_AWS_PII_DETECTION_OPTIONS=True: Adds "AWS Comprehend" as an option in the PII detection dropdown.
Core AWS Configuration:
- AWS_REGION=example-region: Set your AWS region (e.g., us-east-1).
- DOCUMENT_REDACTION_BUCKET=example-bucket: The name of the S3 bucket the application will use for temporary file storage and processing.
AWS Logging:
- SAVE_LOGS_TO_DYNAMODB=True: If enabled, usage and feedback logs will be saved to DynamoDB tables.
- ACCESS_LOG_DYNAMODB_TABLE_NAME, USAGE_LOG_DYNAMODB_TABLE_NAME, etc.: Specify the names of your DynamoDB tables for logging.
Advanced AWS Textract Features:
- SHOW_WHOLE_DOCUMENT_TEXTRACT_CALL_OPTIONS=True: Enables UI components for large-scale, asynchronous document processing via Textract.
- TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET=example-bucket-output: A separate S3 bucket for the final output of asynchronous Textract jobs.
- LOAD_PREVIOUS_TEXTRACT_JOBS_S3=True: If enabled, the app will try to load the status of previously submitted asynchronous jobs from S3.
Cost Tracking (for internal accounting):
- SHOW_COSTS=True: Displays an estimated cost for AWS operations. Can be enabled even if AWS functions are off.
- GET_COST_CODES=True: Enables a dropdown for users to select a cost code before running a job.
- COST_CODES_PATH=config/cost_codes.csv: The local path to a CSV file containing your cost codes.
- ENFORCE_COST_CODES=True: Makes selecting a cost code mandatory before starting a redaction.

Now you have the app installed, please refer to the User Guide for more information on how to use it for basic and advanced redaction.

For agents (API quickstart)

If you are an LLM/agent interacting with this app over HTTP (e.g. Hugging Face Spaces), do not guess inputs from the UI. Use the Gradio schema as the source of truth:

Discover schema: GET /gradio_api/info
Upload files: POST /gradio_api/upload (multipart field files) → returns server-internal paths like /tmp/gradio_tmp/...
Call: POST /gradio_api/call/{api_name} with body {"data":[...]} (argument order must match /gradio_api/info)
Poll: GET /gradio_api/call/{api_name}/{event_id} until complete
Download outputs: GET /gradio_api/file={path} (note: some deployments return 403 without session cookies)

Choose the correct route (prefer short `gr.api` endpoints)

Fetch /gradio_api/info and then prefer the simplest route that exists:

Apply edited review CSV to a PDF: /review_apply
Redact a PDF/image document: /doc_redact — optional handwrite_signature_checkbox for AWS Textract (e.g. Extract handwriting, Extract signatures)
Summarise a PDF: /pdf_summarise
Redact tabular files (CSV/XLSX/Parquet/DOCX): /tabular_redact

If those endpoints are not present in your deployment, fall back to the long UI-chained routes (/apply_review_redactions, /redact_data, etc.) and build data[] strictly from /gradio_api/info.

Common gotchas

Arity errors (needed: N, got: M) mean you called a session-heavy UI handler with the wrong data[]. Prefer the short endpoints above.
handle_file() gotcha (for gradio_client users): do not wrap server-internal upload paths (e.g. /tmp/gradio_tmp/...) with handle_file(). Pass them as plain strings.
Container-only outputs: outputs may be written to container paths (e.g. /home/user/app/output/). Plan to download via file=... or use a mounted output directory in Docker.

Optional: MCP server

If you want external agents to call this app reliably without re-implementing Gradio upload/call/poll/download details, consider an MCP server that wraps the main tasks (redact_document, apply_review_redactions, redact_tabular, summarise_document) behind a small tool interface. See the relevant documentation.

Use as a library: After installing from PyPI (pip install doc_redaction), you can call the same workflows as the Gradio api_name routes from Python. See the documentation: Python Package usage (Python).

To extract text from documents, the 'Local' options are PikePDF for PDFs with selectable text, and OCR with Tesseract. Use AWS Textract to extract more complex elements e.g. handwriting, signatures, or unclear text. PaddleOCR and VLM support is also provided (see the installation instructions below).

For PII identification, 'Local' (based on spaCy) gives good results if you are looking for common names or terms, or a custom list of terms to redact (see Redaction settings). AWS Comprehend gives better results at a small cost.

Additional options on the 'Redaction settings' include, the type of information to redact (e.g. people, places), custom terms to include/ exclude from redaction, fuzzy matching, language settings, and whole page redaction. After redaction is complete, you can view and modify suggested redactions on the 'Review redactions' tab to quickly create a final redacted document.

NOTE: The app is not 100% accurate, and it will miss some personal information. It is essential that all outputs are reviewed by a human before using the final outputs.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.6.0

Jul 17, 2026

2.5.0

Jul 4, 2026

2.4.2

Jun 17, 2026

2.4.1

Jun 16, 2026

2.4.0

Jun 16, 2026

2.3.0

May 29, 2026

2.2.8

May 15, 2026

2.2.7

May 7, 2026

2.2.6

Apr 29, 2026

2.2.5

Apr 28, 2026

2.2.4

Apr 27, 2026

2.2.3

Apr 25, 2026

2.2.2

Apr 24, 2026

2.2.1

Apr 24, 2026

2.2.0

Apr 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doc_redaction-2.6.0.tar.gz (3.4 MB view details)

Uploaded Jul 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

doc_redaction-2.6.0-py3-none-any.whl (3.4 MB view details)

Uploaded Jul 17, 2026 Python 3

File details

Details for the file doc_redaction-2.6.0.tar.gz.

File metadata

Download URL: doc_redaction-2.6.0.tar.gz
Upload date: Jul 17, 2026
Size: 3.4 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for doc_redaction-2.6.0.tar.gz
Algorithm	Hash digest
SHA256	`3ddd9194f4cac4f5ea55cb7879ae478485d2a2e1a49effbeb39762186980540d`
MD5	`f22ab4da11687e8ed8e32aab3be13392`
BLAKE2b-256	`f87882a08879759fa2e3e96cc85c4a4527df6b9b7ca18534678be48d93cc2974`

See more details on using hashes here.

File details

Details for the file doc_redaction-2.6.0-py3-none-any.whl.

File metadata

Download URL: doc_redaction-2.6.0-py3-none-any.whl
Upload date: Jul 17, 2026
Size: 3.4 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for doc_redaction-2.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5dd6d851df70abc528ce2d4715baf6a39ad7abb89bc0533ab7000d6bc07f914d`
MD5	`20e1c63e056009837c6f54f1f8b5a109`
BLAKE2b-256	`adeb9333140c278f8dc424c4503aa34e348c0046ec7aa33c59f9c81767ee3af1`

See more details on using hashes here.

doc-redaction 2.6.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Document redaction (doc_redaction)

🚀 Quick Start - Installation and first run

1. Installation

Option 1 - Recommended: Install from source repo

Install extras (Paddle or Transformers/Torch VLM)

Option 2 - Install from PyPI

Option 3 - Docker installation

With Llama.cpp / vLLM inference server

Without Llama.cpp / vLLM inference server

Pi agent (agentic redaction)

Option 4 - Installation on AWS with CDK

2. Install prerequisites: Tesseract and Poppler

Automated dependency setup (recommended)

On Windows

On Linux (Debian/Ubuntu)

On Linux (Fedora/CentOS/RHEL)

3. Run the Application

Command line interface

Python package commands

4. ⚙️ Configuration (Optional)

Configuration Breakdown

Local & General Settings (No AWS Required)

AWS-Specific Settings

For agents (API quickstart)

Choose the correct route (prefer short gr.api endpoints)

Common gotchas

Optional: MCP server

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Choose the correct route (prefer short `gr.api` endpoints)