Skip to main content

Multi-modal document analysis microservice — extracts text, readability, and structure from PDFs, DOCX, and more

Project description

document-analyser

Extracts text from documents and returns readability metrics, word counts, and structural information. Accepts PDF, DOCX, PPTX, and plain text formats.

Part of the analyser family.

Install

pip install document-analyser

Requires Python 3.11+.

Usage

Python

from app.analyser import DocumentAnalyser

result = DocumentAnalyser().analyse("report.pdf")

print(f"Words:       {result['word_count']}")
print(f"Sentences:   {result['sentence_count']}")
print(f"Readability: {result['readability']['flesch_reading_ease']:.1f} (Flesch)")
print(result["text"][:500])

CLI

# Human-readable summary
document-analyser report.pdf

# Machine-readable JSON
document-analyser thesis.docx --json

# Start the HTTP server
document-analyser serve --port 8000

HTTP API

curl -X POST http://localhost:8000/analyse \
  -F "file=@report.pdf"

Supported formats

Format Extensions
PDF .pdf
Word .docx
PowerPoint .pptx
Plain text .txt .md

Output

{
  "format": "pdf",
  "file_path": "/path/to/report.pdf",
  "file_size": 204800,
  "page_count": 12,
  "word_count": 4823,
  "sentence_count": 312,
  "paragraph_count": 89,
  "text": "Executive summary...",
  "readability": {
    "flesch_reading_ease": 52.3,
    "flesch_kincaid_grade": 11.2,
    "gunning_fog": 13.8,
    "smog_index": 12.1,
    "automated_readability_index": 11.9
  }
}

The analyser family

Low-level analysis tools. Each accepts files directly and returns structured JSON. Build your own UI or pipeline on top.

Package Handles
speech-analyser audio and video files — transcript and speech metrics
video-analyser video files — frames, scenes, and visual quality
document-analyser PDF, DOCX, PPTX, TXT — text and readability
code-analyser source code — style, complexity, and quality metrics
records-analyser CSV, Excel, SQLite, Parquet, JSON — data profiling
auto-analyser any file — detects format and routes to the right tool

Licence

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

document_analyser-0.1.2.tar.gz (157.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

document_analyser-0.1.2-py3-none-any.whl (70.3 kB view details)

Uploaded Python 3

File details

Details for the file document_analyser-0.1.2.tar.gz.

File metadata

  • Download URL: document_analyser-0.1.2.tar.gz
  • Upload date:
  • Size: 157.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for document_analyser-0.1.2.tar.gz
Algorithm Hash digest
SHA256 d3908f37908e553f4837dff384b82eadc6128e1b45dcc642a48915c7dba640e6
MD5 5e1d0706e8f4b43e3d5d896e63a554fc
BLAKE2b-256 19bea0b71612af028eed2415aa3652a22b433cfe7f31261a8dd4b929023d8054

See more details on using hashes here.

File details

Details for the file document_analyser-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for document_analyser-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 687cddd267dc5f037dcc2ecd4d1c720647a0b5ad396ff4533a49c38fc1950dd7
MD5 2be9e3fc64b267a86eb701663352ab44
BLAKE2b-256 6d7f1d3929c75f001da00cf741a816e47736c9b8653a2ab38ea3a7ab4c7c05e9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page