Skip to main content

Multi-modal document analysis microservice — extracts text, readability, and structure from PDFs, DOCX, and more

Project description

document-analyser

Extracts text from documents and returns readability metrics, word counts, and structural information. Accepts PDF, DOCX, PPTX, and plain text formats.

Part of the analyser family.

Install

pip install document-analyser

Requires Python 3.11+.

Usage

Python

from app.analyser import DocumentAnalyser

result = DocumentAnalyser().analyse("report.pdf")

print(f"Words:       {result['word_count']}")
print(f"Sentences:   {result['sentence_count']}")
print(f"Readability: {result['readability']['flesch_reading_ease']:.1f} (Flesch)")
print(result["text"][:500])

CLI

# Human-readable summary
document-analyser report.pdf

# Machine-readable JSON
document-analyser thesis.docx --json

# Start the HTTP server
document-analyser serve --port 8000

HTTP API

curl -X POST http://localhost:8000/analyse \
  -F "file=@report.pdf"

Supported formats

Format Extensions
PDF .pdf
Word .docx
PowerPoint .pptx
Plain text .txt .md

Output

{
  "format": "pdf",
  "file_path": "/path/to/report.pdf",
  "file_size": 204800,
  "page_count": 12,
  "word_count": 4823,
  "sentence_count": 312,
  "paragraph_count": 89,
  "text": "Executive summary...",
  "readability": {
    "flesch_reading_ease": 52.3,
    "flesch_kincaid_grade": 11.2,
    "gunning_fog": 13.8,
    "smog_index": 12.1,
    "automated_readability_index": 11.9
  }
}

The analyser family

Low-level analysis tools. Each accepts files directly and returns structured JSON. Build your own UI or pipeline on top.

Package Handles
speech-analyser audio and video files — transcript and speech metrics
video-analyser video files — frames, scenes, and visual quality
document-analyser PDF, DOCX, PPTX, TXT — text and readability
code-analyser source code — style, complexity, and quality metrics
records-analyser CSV, Excel, SQLite, Parquet, JSON — data profiling
auto-analyser any file — detects format and routes to the right tool

Licence

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

document_analyser-0.1.1.tar.gz (157.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

document_analyser-0.1.1-py3-none-any.whl (70.2 kB view details)

Uploaded Python 3

File details

Details for the file document_analyser-0.1.1.tar.gz.

File metadata

  • Download URL: document_analyser-0.1.1.tar.gz
  • Upload date:
  • Size: 157.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for document_analyser-0.1.1.tar.gz
Algorithm Hash digest
SHA256 c257d9040762e0f2c008f565d936f09aecab170cf2a160a646dd8b3e22955269
MD5 ab51edc0976ab18374ea0ae1d1589247
BLAKE2b-256 5cae43cfb332d4b03e594ec54382519b082a1d08f83f23673bef58cb2b2028f1

See more details on using hashes here.

File details

Details for the file document_analyser-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for document_analyser-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4359fdd89f3d99e23ce27fead15bcf5d15bfb9d4440482f45fe01928ce7d78ee
MD5 c3af3569b8988bea1260bfea2e7e7458
BLAKE2b-256 06392923de698838437fcd69eb0589f771c68dd2bff78e321b0550a757904ed0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page