Skip to main content

Multi-modal document analysis microservice — extracts text, readability, and structure from PDFs, DOCX, and more

Project description

DocumentLens

edtech academic-integrity api docker document-analysis microservice natural-language-processing nlp python readability

Text Analysis & Academic Intelligence Microservice

Transform text content into actionable insights through comprehensive linguistic analysis, writing quality assessment, and academic integrity checking.

🚀 Quick Start

# Docker deployment (recommended)
docker-compose up -d

# Or raw deployment
./deploy.sh

# API available at: http://localhost:8002
# Documentation: http://localhost:8002/docs

📊 API Endpoints

Core Analysis

  • GET /health - Service health check
  • POST /text - Text analysis (readability, quality, word frequency)
  • POST /academic - Academic analysis (citations, DOI resolution, integrity)
  • POST /files - File upload + analysis (PDF, DOCX, TXT, MD)

Advanced Text Analysis

  • POST /advanced/ngrams - N-gram extraction with optional filter terms
  • POST /advanced/ner - Named entity recognition
  • POST /advanced/search/keywords - Batch keyword search across multiple terms

Document Intelligence

  • POST /files/infer-metadata - Infer year, company, industry, document type from content
  • POST /text/infer-metadata - Metadata inference from raw text
  • Page-level text extraction (via include_extracted_text=true on /files)

Integration

🎯 Use Cases

  • Text Analysis: Readability, writing quality, word frequency for any text content
  • Academic Analysis: Citation verification, DOI resolution, AI detection, integrity checking
  • Document Intelligence: Extract and analyze text from PDFs and Word documents
  • Sustainability Research: Batch keyword analysis for TCFD, GRI, SDGs, SASB frameworks
  • Corporate Report Analysis: Auto-detect metadata (year, company, industry) from annual reports
  • Multi-Service Workflows: Integrate with specialized analysis services

Desktop Application Support

DocumentLens powers the document-lens-desktop Electron application for researchers analyzing corporate sustainability reports. Features include:

  • Smart metadata inference (company name, year, industry, document type)
  • Framework keyword analysis (TCFD, GRI, SDGs, SASB)
  • Batch processing with SQLite storage
  • Offline operation via bundled Python backend

🏗️ Microservices Ecosystem

DocumentLens is part of a focused microservices architecture:

Service Purpose Repository
DocumentLens Text analysis & academic intelligence This repo
PresentationLens Presentation design & structure analysis presentation-lens
RecordingLens Student recordings (video/audio) analysis recording-lens
CodeLens Source code quality & analysis code-lens
SubmissionLens Student submission router & frontend submission-lens

Integration Pattern

graph LR
    A[Student Submission] --> B[SubmissionLens Frontend]
    B --> C{File Type Router}
    C -->|Text/PDF/DOCX| D[DocumentLens]
    C -->|PPTX| E[PresentationLens]
    C -->|Video/Audio| F[RecordingLens]
    C -->|Source Code| G[CodeLens]
    E --> D
    F --> D
    G --> D
    D --> H[Combined Feedback]
    H --> B
    B --> I[Student Dashboard]

🚀 Deployment

Docker Deployment (Recommended)

git clone https://github.com/michael-borck/document-lens.git
cd document-lens
docker-compose up -d  # Single container deployment

Raw/Native Deployment

git clone https://github.com/michael-borck/document-lens.git
cd document-lens
./deploy.sh  # Handles venv, dependencies, and production server

🧪 Testing

# Install dev dependencies
uv sync --extra dev

# Run all tests
uv run pytest tests/ -v

# Run specific test file
uv run pytest tests/test_files.py -v

# Run only PDF tests
uv run pytest tests/ -m pdf -v

# Skip slow tests
uv run pytest tests/ -m "not slow" -v

# Run with coverage report
uv run pytest tests/

Test Structure

  • tests/conftest.py - Shared fixtures and test client setup
  • tests/test_health.py - Health/smoke tests
  • tests/test_text_analysis.py - Text analysis endpoint tests
  • tests/test_academic_analysis.py - Academic analysis endpoint tests
  • tests/test_files.py - PDF file upload tests

Test Data

Place test files (PDF, DOCX, etc.) in the test-data/ directory. The test suite automatically discovers and uses these files for parameterized tests.

📚 Documentation

  • DEPLOYMENT.md - Deployment guide for Docker and raw installations
  • DOCUMENTLENS_SETUP.md - Setup and usage instructions
  • .env.example - Configuration template
  • docs/ - Additional architecture and integration documentation

DocumentLens: Pure text intelligence at the heart of content analysis

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

document_analyser-0.1.0.tar.gz (317.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

document_analyser-0.1.0-py3-none-any.whl (66.9 kB view details)

Uploaded Python 3

File details

Details for the file document_analyser-0.1.0.tar.gz.

File metadata

  • Download URL: document_analyser-0.1.0.tar.gz
  • Upload date:
  • Size: 317.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for document_analyser-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e0a07ec757e912a559d53f08b7e86cdb8a44879ca46ab237d658518dadd2b734
MD5 12f840a220758816cebb8df2de321978
BLAKE2b-256 7a6559181cc0aa4de98df75ff3f5e7e85efa24b28f9aca83794a1bfaf4b1ec4b

See more details on using hashes here.

File details

Details for the file document_analyser-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for document_analyser-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0d87f7a2a7355f2304206f62ef45e38e32411e62c56da4a1c615411955936cec
MD5 2decf558abb5390102d8c66691ddb581
BLAKE2b-256 54db3128217411bdbfa3f9e8c348f2b04617957118c1847b5f22722052ff67e6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page