Multi-modal document analysis microservice — extracts text, readability, and structure from PDFs, DOCX, and more
Project description
document-analyser
Extracts text from documents and returns readability metrics, word counts, and structural information. Accepts PDF, DOCX, PPTX, and plain text formats.
Part of the analyser family.
Install
pip install document-analyser
Requires Python 3.11+.
Usage
Python
from app.analyser import DocumentAnalyser
result = DocumentAnalyser().analyse("report.pdf")
print(f"Words: {result['word_count']}")
print(f"Sentences: {result['sentence_count']}")
print(f"Readability: {result['readability']['flesch_reading_ease']:.1f} (Flesch)")
print(result["text"][:500])
CLI
# Human-readable summary
document-analyser report.pdf
# Machine-readable JSON
document-analyser thesis.docx --json
# Start the HTTP server
document-analyser serve --port 8000
HTTP API
curl -X POST http://localhost:8000/analyse \
-F "file=@report.pdf"
Supported formats
| Format | Extensions |
|---|---|
.pdf |
|
| Word | .docx |
| PowerPoint | .pptx |
| Plain text | .txt .md |
Output
{
"format": "pdf",
"file_path": "/path/to/report.pdf",
"file_size": 204800,
"page_count": 12,
"word_count": 4823,
"sentence_count": 312,
"paragraph_count": 89,
"text": "Executive summary...",
"readability": {
"flesch_reading_ease": 52.3,
"flesch_kincaid_grade": 11.2,
"gunning_fog": 13.8,
"smog_index": 12.1,
"automated_readability_index": 11.9
}
}
The analyser family
Low-level analysis tools. Each accepts files directly and returns structured JSON. Build your own UI or pipeline on top.
| Package | Handles |
|---|---|
| speech-analyser | audio and video files — transcript and speech metrics |
| video-analyser | video files — frames, scenes, and visual quality |
| document-analyser | PDF, DOCX, PPTX, TXT — text and readability |
| code-analyser | source code — style, complexity, and quality metrics |
| records-analyser | CSV, Excel, SQLite, Parquet, JSON — data profiling |
| auto-analyser | any file — detects format and routes to the right tool |
Licence
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file document_analyser-0.1.2.tar.gz.
File metadata
- Download URL: document_analyser-0.1.2.tar.gz
- Upload date:
- Size: 157.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d3908f37908e553f4837dff384b82eadc6128e1b45dcc642a48915c7dba640e6
|
|
| MD5 |
5e1d0706e8f4b43e3d5d896e63a554fc
|
|
| BLAKE2b-256 |
19bea0b71612af028eed2415aa3652a22b433cfe7f31261a8dd4b929023d8054
|
File details
Details for the file document_analyser-0.1.2-py3-none-any.whl.
File metadata
- Download URL: document_analyser-0.1.2-py3-none-any.whl
- Upload date:
- Size: 70.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
687cddd267dc5f037dcc2ecd4d1c720647a0b5ad396ff4533a49c38fc1950dd7
|
|
| MD5 |
2be9e3fc64b267a86eb701663352ab44
|
|
| BLAKE2b-256 |
6d7f1d3929c75f001da00cf741a816e47736c9b8653a2ab38ea3a7ab4c7c05e9
|