A system for processing documents to detect their type, extract fields, and generate templates
Project description
Document Processing System
A system for processing documents to detect their type, extract fields, and generate templates.
Features
- Document type detection (Form or Table)
- Field detection and extraction
- Template generation
- Data extraction from filled documents
- Web API and UI for document processing
Installation
- Clone the repository:
git clone https://github.com/yourusername/document-processing.git
cd document-processing
- Create a virtual environment and activate it:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install the dependencies:
pip install -r src/requirements.txt
Usage
Command-line Interface
Process a document using the command-line interface:
python src/process_document.py path/to/document.pdf --output-dir output --filled
Options:
--output-dir: Directory to save the output files (default: processed directory in the same location as the input file)--filled: Flag to indicate that the document is filled (default: False)
CLI Examples
Quick Start Example:
# Install dependencies
pip install -r src/requirements.txt
# Process a sample document
python src/process_document.py sample_documents/invoice.pdf
# The command will output JSON with document structure
# and save files to a 'processed' directory next to the original file
Basic usage with default options (process an unfilled document):
python src/process_document.py sample_documents/invoice.pdf
Process a filled form and extract the data:
python src/process_document.py sample_documents/filled_form.pdf --filled
Specify an output directory:
python src/process_document.py sample_documents/contract.pdf --output-dir ./processed_results
Process a filled document and specify an output directory:
python src/process_document.py sample_documents/filled_invoice.jpg --filled --output-dir ./processed_results
Process an image document (JPG or PNG):
python src/process_document.py sample_documents/form.png
The CLI will output the JSON result to the console and save detailed results to the specified output directory. The output includes:
document_type: The detected type (form or table)template: The generated template structurefieldsorrows: The detected fields or table rowsextracted_data: (Only when--filledis used) The extracted values from the document
Web Interface
The application consists of two parts that need to be run separately:
- Start the API server:
python run_server.py
- Start the NextJS frontend (in a separate terminal):
cd www
npm install # Only needed the first time
npm run dev
Then open your browser and navigate to http://localhost:3000 to access the web interface.
API
The system provides a REST API for document processing:
Process Document
POST /api/
Parameters:
document: Document file (PDF, PNG, JPG)is_filled: Whether the document is filled or not (default: False)
Response:
{
"document_type": "form",
"template": {
"name": "Form Template",
"type": "form",
"sections": [
{
"name": "Section 1",
"fields": [
{
"name": "Field 1",
"type": "text",
"required": false,
"roi": {
"left": 100,
"top": 200,
"right": 300,
"bottom": 250,
"page": 0
}
}
]
}
]
},
"fields": [
{
"name": "Field 1",
"value": "",
"type": "text"
}
],
"extracted_data": {
"form_name": "Form Template",
"sections": [
{
"name": "Section 1",
"fields": [
{
"name": "Field 1",
"value": "Extracted Value",
"type": "text"
}
]
}
]
}
}
Architecture
The system consists of the following components:
- Document Processor: Detects the type of document and extracts ROIs.
- Field Detector: Detects and extracts fields from the document.
- Template Generator: Generates templates from the document.
- Data Extractor: Extracts data from filled documents using templates.
- Web API: Provides a REST API for document processing.
- Web UI: Provides a user interface for document processing.
Supported Document Types
- Form: Documents with a vertical layout where you have entries pertaining to a single entity per page.
- Table: Documents with a horizontal layout where you can have one entry per row but multiple entries in a page.
Supported Field Types
- Text: Free flowing text.
- Boxed Text: Text where you have to fill a character in the box forming a full word or phrase.
- Number: A number (e.g., phone number).
- Date: A date.
- Radio Button: A radio button.
- Signature: Free flowing text but orientation may not be horizontal.
- Checkbox: A checkbox.
- Image: An image.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file test_saral_auto-0.1.0.tar.gz.
File metadata
- Download URL: test_saral_auto-0.1.0.tar.gz
- Upload date:
- Size: 461.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b3e2709a48092d8a6d1f41955aab5742a0c1ea0ae9cefede5015e81a00b9b4b7
|
|
| MD5 |
9f836eacdc7c17d920765d3f94ef1d89
|
|
| BLAKE2b-256 |
129da915ab5b3989fccef81b347ab41e93f40e5801b0bf20a7d79335a5e0889d
|
File details
Details for the file test_saral_auto-0.1.0-py3-none-any.whl.
File metadata
- Download URL: test_saral_auto-0.1.0-py3-none-any.whl
- Upload date:
- Size: 599.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e3b0a7775187920ff43a2ed611a67067da200a09fbb983c20b318d374b8484dc
|
|
| MD5 |
af85b686dcbfa93330deecd30c15548f
|
|
| BLAKE2b-256 |
2b05b5018014f0b6d7999b4c32d452a6bb03e116959d5e3532269840dc0e59b1
|