Skip to main content

Multi-modal document analysis microservice — extracts text, readability, and structure from PDFs, DOCX, and more

Reason this release was yanked:

Package renamed — see "document-analyser" instead

Project description

DocumentLens

edtech academic-integrity api docker document-analysis microservice natural-language-processing nlp python readability

Text Analysis & Academic Intelligence Microservice

Transform text content into actionable insights through comprehensive linguistic analysis, writing quality assessment, and academic integrity checking.

🚀 Quick Start

# Docker deployment (recommended)
docker-compose up -d

# Or raw deployment
./deploy.sh

# API available at: http://localhost:8002
# Documentation: http://localhost:8002/docs

📊 API Endpoints

Core Analysis

  • GET /health - Service health check
  • POST /text - Text analysis (readability, quality, word frequency)
  • POST /academic - Academic analysis (citations, DOI resolution, integrity)
  • POST /files - File upload + analysis (PDF, DOCX, TXT, MD)

Advanced Text Analysis

  • POST /advanced/ngrams - N-gram extraction with optional filter terms
  • POST /advanced/ner - Named entity recognition
  • POST /advanced/search/keywords - Batch keyword search across multiple terms

Document Intelligence

  • POST /files/infer-metadata - Infer year, company, industry, document type from content
  • POST /text/infer-metadata - Metadata inference from raw text
  • Page-level text extraction (via include_extracted_text=true on /files)

Integration

🎯 Use Cases

  • Text Analysis: Readability, writing quality, word frequency for any text content
  • Academic Analysis: Citation verification, DOI resolution, AI detection, integrity checking
  • Document Intelligence: Extract and analyze text from PDFs and Word documents
  • Sustainability Research: Batch keyword analysis for TCFD, GRI, SDGs, SASB frameworks
  • Corporate Report Analysis: Auto-detect metadata (year, company, industry) from annual reports
  • Multi-Service Workflows: Integrate with specialized analysis services

Desktop Application Support

DocumentLens powers the document-lens-desktop Electron application for researchers analyzing corporate sustainability reports. Features include:

  • Smart metadata inference (company name, year, industry, document type)
  • Framework keyword analysis (TCFD, GRI, SDGs, SASB)
  • Batch processing with SQLite storage
  • Offline operation via bundled Python backend

🏗️ Microservices Ecosystem

DocumentLens is part of a focused microservices architecture:

Service Purpose Repository
DocumentLens Text analysis & academic intelligence This repo
PresentationLens Presentation design & structure analysis presentation-lens
RecordingLens Student recordings (video/audio) analysis recording-lens
CodeLens Source code quality & analysis code-lens
SubmissionLens Student submission router & frontend submission-lens

Integration Pattern

graph LR
    A[Student Submission] --> B[SubmissionLens Frontend]
    B --> C{File Type Router}
    C -->|Text/PDF/DOCX| D[DocumentLens]
    C -->|PPTX| E[PresentationLens]
    C -->|Video/Audio| F[RecordingLens]
    C -->|Source Code| G[CodeLens]
    E --> D
    F --> D
    G --> D
    D --> H[Combined Feedback]
    H --> B
    B --> I[Student Dashboard]

🚀 Deployment

Docker Deployment (Recommended)

git clone https://github.com/michael-borck/document-lens.git
cd document-lens
docker-compose up -d  # Single container deployment

Raw/Native Deployment

git clone https://github.com/michael-borck/document-lens.git
cd document-lens
./deploy.sh  # Handles venv, dependencies, and production server

🧪 Testing

# Install dev dependencies
uv sync --extra dev

# Run all tests
uv run pytest tests/ -v

# Run specific test file
uv run pytest tests/test_files.py -v

# Run only PDF tests
uv run pytest tests/ -m pdf -v

# Skip slow tests
uv run pytest tests/ -m "not slow" -v

# Run with coverage report
uv run pytest tests/

Test Structure

  • tests/conftest.py - Shared fixtures and test client setup
  • tests/test_health.py - Health/smoke tests
  • tests/test_text_analysis.py - Text analysis endpoint tests
  • tests/test_academic_analysis.py - Academic analysis endpoint tests
  • tests/test_files.py - PDF file upload tests

Test Data

Place test files (PDF, DOCX, etc.) in the test-data/ directory. The test suite automatically discovers and uses these files for parameterized tests.

📚 Documentation

  • DEPLOYMENT.md - Deployment guide for Docker and raw installations
  • DOCUMENTLENS_SETUP.md - Setup and usage instructions
  • .env.example - Configuration template
  • docs/ - Additional architecture and integration documentation

DocumentLens: Pure text intelligence at the heart of content analysis

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

document_lens-0.1.0.tar.gz (317.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

document_lens-0.1.0-py3-none-any.whl (66.9 kB view details)

Uploaded Python 3

File details

Details for the file document_lens-0.1.0.tar.gz.

File metadata

  • Download URL: document_lens-0.1.0.tar.gz
  • Upload date:
  • Size: 317.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for document_lens-0.1.0.tar.gz
Algorithm Hash digest
SHA256 282d1c75f2a4b735cc262d19ee2b39e8e3d9ad7072c0048a8b64fbbad3a4e1e2
MD5 67e839e466ee22f87bb95d47f6559987
BLAKE2b-256 b4408e46d8596e1c5262eda657b22ce6f52f7e12cb60276295f9178676aaa22e

See more details on using hashes here.

File details

Details for the file document_lens-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: document_lens-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 66.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for document_lens-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bbd20a66c58b86918d9b9726f2c7d3477325a726f52aa9f2e1412dfbae9efa23
MD5 cc4ea790404863a1e7d0498bc264827a
BLAKE2b-256 969ca38dd69ed0a1677ed6d0b5f2134f03df4bd4178c2db201bcda8ad764c8dc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page