Skip to main content

Multi-modal document analysis microservice — extracts text, readability, and structure from PDFs, DOCX, and more

Project description

document-analyser

Extracts text from documents and returns readability metrics, word counts, and structural information. Accepts PDF, DOCX, PPTX, and plain text formats.

Part of the analyser family.

Install

pip install document-analyser

Requires Python 3.11+.

Usage

Python

from app.analyser import DocumentAnalyser

result = DocumentAnalyser().analyse("report.pdf")

print(f"Words:       {result['word_count']}")
print(f"Sentences:   {result['sentence_count']}")
print(f"Readability: {result['readability']['flesch_reading_ease']:.1f} (Flesch)")
print(result["text"][:500])

CLI

# Human-readable summary
document-analyser report.pdf

# Machine-readable JSON
document-analyser thesis.docx --json

# Start the HTTP server
document-analyser serve --port 8000

HTTP API

curl -X POST http://localhost:8000/analyse \
  -F "file=@report.pdf"

Supported formats

Format Extensions
PDF .pdf
Word .docx
PowerPoint .pptx
Plain text .txt .md

Output

{
  "format": "pdf",
  "file_path": "/path/to/report.pdf",
  "file_size": 204800,
  "page_count": 12,
  "word_count": 4823,
  "sentence_count": 312,
  "paragraph_count": 89,
  "text": "Executive summary...",
  "readability": {
    "flesch_reading_ease": 52.3,
    "flesch_kincaid_grade": 11.2,
    "gunning_fog": 13.8,
    "smog_index": 12.1,
    "automated_readability_index": 11.9
  }
}

The analyser family

Low-level analysis tools. Each accepts files directly and returns structured JSON. Build your own UI or pipeline on top.

Package Handles
speech-analyser audio and video files — transcript and speech metrics
video-analyser video files — frames, scenes, and visual quality
document-analyser PDF, DOCX, PPTX, TXT — text and readability
code-analyser source code — style, complexity, and quality metrics
records-analyser CSV, Excel, SQLite, Parquet, JSON — data profiling
auto-analyser any file — detects format and routes to the right tool

Licence

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

document_analyser-0.2.0.tar.gz (233.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

document_analyser-0.2.0-py3-none-any.whl (70.8 kB view details)

Uploaded Python 3

File details

Details for the file document_analyser-0.2.0.tar.gz.

File metadata

  • Download URL: document_analyser-0.2.0.tar.gz
  • Upload date:
  • Size: 233.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for document_analyser-0.2.0.tar.gz
Algorithm Hash digest
SHA256 0286d419dfe0e9562544620224dfd609438a218c5aa5a63b0a1a658eba19e88f
MD5 e0ccc4fb2ae9d8b4c2647335af8ee8e4
BLAKE2b-256 6dd9888b70be2f59f46909db119d9c856a7a491ee1adcf46524a9f85cdb5625f

See more details on using hashes here.

File details

Details for the file document_analyser-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for document_analyser-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2f49d3c7e2d85b4d54d894d6f02204dc470f00724fa4602444dab9b78b61caca
MD5 36def24ad31d180b818a4cb41503ea7d
BLAKE2b-256 2c1a3c1097692b5bbafb82b7c39c407fe1f3d84a348a8e37b9d501d37681cd24

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page