Multi-modal document analysis microservice — extracts text, readability, and structure from PDFs, DOCX, and more

These details have not been verified by PyPI

Project description

DocumentLens

Text Analysis & Academic Intelligence Microservice

Transform text content into actionable insights through comprehensive linguistic analysis, writing quality assessment, and academic integrity checking.

🚀 Quick Start

# Docker deployment (recommended)
docker-compose up -d

# Or raw deployment
./deploy.sh

# API available at: http://localhost:8002
# Documentation: http://localhost:8002/docs

📊 API Endpoints

Core Analysis

GET /health - Service health check
POST /text - Text analysis (readability, quality, word frequency)
POST /academic - Academic analysis (citations, DOI resolution, integrity)
POST /files - File upload + analysis (PDF, DOCX, TXT, MD)

Advanced Text Analysis

POST /advanced/ngrams - N-gram extraction with optional filter terms
POST /advanced/ner - Named entity recognition
POST /advanced/search/keywords - Batch keyword search across multiple terms

Document Intelligence

POST /files/infer-metadata - Infer year, company, industry, document type from content
POST /text/infer-metadata - Metadata inference from raw text
Page-level text extraction (via include_extracted_text=true on /files)

Integration

Root endpoint: GET / - Service info and available endpoints
For presentations: Use PresentationLens
For recordings: Use RecordingLens

🎯 Use Cases

Text Analysis: Readability, writing quality, word frequency for any text content
Academic Analysis: Citation verification, DOI resolution, AI detection, integrity checking
Document Intelligence: Extract and analyze text from PDFs and Word documents
Sustainability Research: Batch keyword analysis for TCFD, GRI, SDGs, SASB frameworks
Corporate Report Analysis: Auto-detect metadata (year, company, industry) from annual reports
Multi-Service Workflows: Integrate with specialized analysis services

Desktop Application Support

DocumentLens powers the document-lens-desktop Electron application for researchers analyzing corporate sustainability reports. Features include:

Smart metadata inference (company name, year, industry, document type)
Framework keyword analysis (TCFD, GRI, SDGs, SASB)
Batch processing with SQLite storage
Offline operation via bundled Python backend

🏗️ Microservices Ecosystem

DocumentLens is part of a focused microservices architecture:

Service	Purpose	Repository
DocumentLens	Text analysis & academic intelligence	This repo
PresentationLens	Presentation design & structure analysis	presentation-lens
RecordingLens	Student recordings (video/audio) analysis	recording-lens
CodeLens	Source code quality & analysis	code-lens
SubmissionLens	Student submission router & frontend	submission-lens

Integration Pattern

graph LR
    A[Student Submission] --> B[SubmissionLens Frontend]
    B --> C{File Type Router}
    C -->|Text/PDF/DOCX| D[DocumentLens]
    C -->|PPTX| E[PresentationLens]
    C -->|Video/Audio| F[RecordingLens]
    C -->|Source Code| G[CodeLens]
    E --> D
    F --> D
    G --> D
    D --> H[Combined Feedback]
    H --> B
    B --> I[Student Dashboard]

🚀 Deployment

Docker Deployment (Recommended)

git clone https://github.com/michael-borck/document-lens.git
cd document-lens
docker-compose up -d  # Single container deployment

Raw/Native Deployment

git clone https://github.com/michael-borck/document-lens.git
cd document-lens
./deploy.sh  # Handles venv, dependencies, and production server

🧪 Testing

# Install dev dependencies
uv sync --extra dev

# Run all tests
uv run pytest tests/ -v

# Run specific test file
uv run pytest tests/test_files.py -v

# Run only PDF tests
uv run pytest tests/ -m pdf -v

# Skip slow tests
uv run pytest tests/ -m "not slow" -v

# Run with coverage report
uv run pytest tests/

Test Structure

tests/conftest.py - Shared fixtures and test client setup
tests/test_health.py - Health/smoke tests
tests/test_text_analysis.py - Text analysis endpoint tests
tests/test_academic_analysis.py - Academic analysis endpoint tests
tests/test_files.py - PDF file upload tests

Test Data

Place test files (PDF, DOCX, etc.) in the test-data/ directory. The test suite automatically discovers and uses these files for parameterized tests.

📚 Documentation

DEPLOYMENT.md - Deployment guide for Docker and raw installations
DOCUMENTLENS_SETUP.md - Setup and usage instructions
.env.example - Configuration template
docs/ - Additional architecture and integration documentation

DocumentLens: Pure text intelligence at the heart of content analysis

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.1

May 8, 2026

0.2.0

May 8, 2026

0.1.2

May 7, 2026

0.1.1

May 7, 2026

This version

0.1.0

May 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

document_analyser-0.1.0.tar.gz (317.8 kB view details)

Uploaded May 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

document_analyser-0.1.0-py3-none-any.whl (66.9 kB view details)

Uploaded May 5, 2026 Python 3

File details

Details for the file document_analyser-0.1.0.tar.gz.

File metadata

Download URL: document_analyser-0.1.0.tar.gz
Upload date: May 5, 2026
Size: 317.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for document_analyser-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`e0a07ec757e912a559d53f08b7e86cdb8a44879ca46ab237d658518dadd2b734`
MD5	`12f840a220758816cebb8df2de321978`
BLAKE2b-256	`7a6559181cc0aa4de98df75ff3f5e7e85efa24b28f9aca83794a1bfaf4b1ec4b`

See more details on using hashes here.

File details

Details for the file document_analyser-0.1.0-py3-none-any.whl.

File metadata

Download URL: document_analyser-0.1.0-py3-none-any.whl
Upload date: May 5, 2026
Size: 66.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for document_analyser-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0d87f7a2a7355f2304206f62ef45e38e32411e62c56da4a1c615411955936cec`
MD5	`2decf558abb5390102d8c66691ddb581`
BLAKE2b-256	`54db3128217411bdbfa3f9e8c348f2b04617957118c1847b5f22722052ff67e6`

See more details on using hashes here.

document-analyser 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

DocumentLens

🚀 Quick Start

📊 API Endpoints

Core Analysis

Advanced Text Analysis

Document Intelligence

Integration

🎯 Use Cases

Desktop Application Support

🏗️ Microservices Ecosystem

Integration Pattern

🚀 Deployment

Docker Deployment (Recommended)

Raw/Native Deployment

🧪 Testing

Test Structure

Test Data

📚 Documentation

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes