Multi-modal document analysis microservice — extracts text, readability, and structure from PDFs, DOCX, and more
Project description
DocumentLens
Text Analysis & Academic Intelligence Microservice
Transform text content into actionable insights through comprehensive linguistic analysis, writing quality assessment, and academic integrity checking.
🚀 Quick Start
# Docker deployment (recommended)
docker-compose up -d
# Or raw deployment
./deploy.sh
# API available at: http://localhost:8002
# Documentation: http://localhost:8002/docs
📊 API Endpoints
Core Analysis
GET /health- Service health checkPOST /text- Text analysis (readability, quality, word frequency)POST /academic- Academic analysis (citations, DOI resolution, integrity)POST /files- File upload + analysis (PDF, DOCX, TXT, MD)
Advanced Text Analysis
POST /advanced/ngrams- N-gram extraction with optional filter termsPOST /advanced/ner- Named entity recognitionPOST /advanced/search/keywords- Batch keyword search across multiple terms
Document Intelligence
POST /files/infer-metadata- Infer year, company, industry, document type from contentPOST /text/infer-metadata- Metadata inference from raw text- Page-level text extraction (via
include_extracted_text=trueon/files)
Integration
- Root endpoint:
GET /- Service info and available endpoints - For presentations: Use PresentationLens
- For recordings: Use RecordingLens
🎯 Use Cases
- Text Analysis: Readability, writing quality, word frequency for any text content
- Academic Analysis: Citation verification, DOI resolution, AI detection, integrity checking
- Document Intelligence: Extract and analyze text from PDFs and Word documents
- Sustainability Research: Batch keyword analysis for TCFD, GRI, SDGs, SASB frameworks
- Corporate Report Analysis: Auto-detect metadata (year, company, industry) from annual reports
- Multi-Service Workflows: Integrate with specialized analysis services
Desktop Application Support
DocumentLens powers the document-lens-desktop Electron application for researchers analyzing corporate sustainability reports. Features include:
- Smart metadata inference (company name, year, industry, document type)
- Framework keyword analysis (TCFD, GRI, SDGs, SASB)
- Batch processing with SQLite storage
- Offline operation via bundled Python backend
🏗️ Microservices Ecosystem
DocumentLens is part of a focused microservices architecture:
| Service | Purpose | Repository |
|---|---|---|
| DocumentLens | Text analysis & academic intelligence | This repo |
| PresentationLens | Presentation design & structure analysis | presentation-lens |
| RecordingLens | Student recordings (video/audio) analysis | recording-lens |
| CodeLens | Source code quality & analysis | code-lens |
| SubmissionLens | Student submission router & frontend | submission-lens |
Integration Pattern
graph LR
A[Student Submission] --> B[SubmissionLens Frontend]
B --> C{File Type Router}
C -->|Text/PDF/DOCX| D[DocumentLens]
C -->|PPTX| E[PresentationLens]
C -->|Video/Audio| F[RecordingLens]
C -->|Source Code| G[CodeLens]
E --> D
F --> D
G --> D
D --> H[Combined Feedback]
H --> B
B --> I[Student Dashboard]
🚀 Deployment
Docker Deployment (Recommended)
git clone https://github.com/michael-borck/document-lens.git
cd document-lens
docker-compose up -d # Single container deployment
Raw/Native Deployment
git clone https://github.com/michael-borck/document-lens.git
cd document-lens
./deploy.sh # Handles venv, dependencies, and production server
🧪 Testing
# Install dev dependencies
uv sync --extra dev
# Run all tests
uv run pytest tests/ -v
# Run specific test file
uv run pytest tests/test_files.py -v
# Run only PDF tests
uv run pytest tests/ -m pdf -v
# Skip slow tests
uv run pytest tests/ -m "not slow" -v
# Run with coverage report
uv run pytest tests/
Test Structure
tests/conftest.py- Shared fixtures and test client setuptests/test_health.py- Health/smoke teststests/test_text_analysis.py- Text analysis endpoint teststests/test_academic_analysis.py- Academic analysis endpoint teststests/test_files.py- PDF file upload tests
Test Data
Place test files (PDF, DOCX, etc.) in the test-data/ directory. The test suite automatically discovers and uses these files for parameterized tests.
📚 Documentation
DEPLOYMENT.md- Deployment guide for Docker and raw installationsDOCUMENTLENS_SETUP.md- Setup and usage instructions.env.example- Configuration templatedocs/- Additional architecture and integration documentation
DocumentLens: Pure text intelligence at the heart of content analysis
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file document_analyser-0.1.0.tar.gz.
File metadata
- Download URL: document_analyser-0.1.0.tar.gz
- Upload date:
- Size: 317.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e0a07ec757e912a559d53f08b7e86cdb8a44879ca46ab237d658518dadd2b734
|
|
| MD5 |
12f840a220758816cebb8df2de321978
|
|
| BLAKE2b-256 |
7a6559181cc0aa4de98df75ff3f5e7e85efa24b28f9aca83794a1bfaf4b1ec4b
|
File details
Details for the file document_analyser-0.1.0-py3-none-any.whl.
File metadata
- Download URL: document_analyser-0.1.0-py3-none-any.whl
- Upload date:
- Size: 66.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0d87f7a2a7355f2304206f62ef45e38e32411e62c56da4a1c615411955936cec
|
|
| MD5 |
2decf558abb5390102d8c66691ddb581
|
|
| BLAKE2b-256 |
54db3128217411bdbfa3f9e8c348f2b04617957118c1847b5f22722052ff67e6
|