Skip to main content

Multi-modal document analysis microservice — extracts text, readability, and structure from PDFs, DOCX, and more

Project description

document-analyser

Extracts text from documents and returns readability metrics, word counts, and structural information. Accepts PDF, DOCX, PPTX, and plain text formats.

Part of the analyser family.

Install

pip install document-analyser

Requires Python 3.11+.

Usage

Python

from app.analyser import DocumentAnalyser

result = DocumentAnalyser().analyse("report.pdf")

print(f"Words:       {result['word_count']}")
print(f"Sentences:   {result['sentence_count']}")
print(f"Readability: {result['readability']['flesch_reading_ease']:.1f} (Flesch)")
print(result["text"][:500])

CLI

# Human-readable summary
document-analyser report.pdf

# Machine-readable JSON
document-analyser thesis.docx --json

# Start the HTTP server
document-analyser serve --port 8000

HTTP API

curl -X POST http://localhost:8000/analyse \
  -F "file=@report.pdf"

Supported formats

Format Extensions
PDF .pdf
Word .docx
PowerPoint .pptx
Plain text .txt .md

Output

{
  "format": "pdf",
  "file_path": "/path/to/report.pdf",
  "file_size": 204800,
  "page_count": 12,
  "word_count": 4823,
  "sentence_count": 312,
  "paragraph_count": 89,
  "text": "Executive summary...",
  "readability": {
    "flesch_reading_ease": 52.3,
    "flesch_kincaid_grade": 11.2,
    "gunning_fog": 13.8,
    "smog_index": 12.1,
    "automated_readability_index": 11.9
  }
}

The analyser family

Low-level analysis tools. Each accepts files directly and returns structured JSON. Build your own UI or pipeline on top.

Package Handles
speech-analyser audio and video files — transcript and speech metrics
video-analyser video files — frames, scenes, and visual quality
document-analyser PDF, DOCX, PPTX, TXT — text and readability
code-analyser source code — style, complexity, and quality metrics
records-analyser CSV, Excel, SQLite, Parquet, JSON — data profiling
auto-analyser any file — detects format and routes to the right tool

Licence

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

document_analyser-0.2.1.tar.gz (234.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

document_analyser-0.2.1-py3-none-any.whl (71.2 kB view details)

Uploaded Python 3

File details

Details for the file document_analyser-0.2.1.tar.gz.

File metadata

  • Download URL: document_analyser-0.2.1.tar.gz
  • Upload date:
  • Size: 234.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for document_analyser-0.2.1.tar.gz
Algorithm Hash digest
SHA256 39c51a2b1e40624088ed850c09e0d9ef16765bcadd13ac6b6a9aacb1651ed702
MD5 6452961685efd050eacf4d201c5b4c71
BLAKE2b-256 0ccf8af2a0477b3ee89f31c7f0ce0986eabd73695a70c12c01a44ee7a63cbae6

See more details on using hashes here.

File details

Details for the file document_analyser-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for document_analyser-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ed7222db0d8ea353ab897e76f50d334268a661638b927791f43cdb03efb1ca4c
MD5 78ad0137061cc51aab5dbf9d8f651b13
BLAKE2b-256 fc6e2569a6ab0e01f97ae811b49c34daf4f8dd4f1b8cdac9d3b98b1cfd96e7d6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page