Text extraction from Microsoft Word files. Parses Word documents natively and can optionally run local OCR with Tesseract for embedded images or scanned pages. Supports page selection and bytes input. Legacy .doc is read-only and OCR is not available.
Project description
Wizard Docx
WizardDocx is a Python library focused on text extraction from Microsoft Word documents.
It parses Word documents natively and can apply local OCR with Tesseract for embedded images or scanned pages inside 'docx'.
Legacy .doc is supported in read-only mode without OCR.
Contents
Installation
Requires Python 3.9+.
pip install wizarddocx
For OCR capabilities, ensure you have Tesseract installed on your system.
Quick start
import wizarddocx as wd
text = wd.extract_text("example.docx")
print(text)
Text extraction
Parameters
input_data:[str, bytes, Path]extension: The file extension, required only ifinput_dataisbytes.pages: page selection for .docx.
• Examples:1,"1-3",[1, 3, "5-8"]ocr: Enables OCR using Tesseract. Applies to DOCX and image-based files no for doc.language_ocr: Language code for OCR. Defaults to'eng'.
Examples
Basic:
import wizarddocx as wd
txt = wd.extract_text("docs/report.docx")
From bytes:
from pathlib import Path
import wizarddocx as wd
raw = Path("img.docx").read_bytes()
txt_img = wd.extract_text(raw, extension="docx")
Paged selection and OCR:
import wizarddocx as wd
sel = wd.extract_text("docs/big.docx", pages=[1, 3, "5-7"])
ocr_txt = wd.extract_text("scan.docx", ocr=True, language_ocr="ita")
Supported Formats
| Format | OCR Option |
|---|---|
| DOC | Not available |
| DOCX | Optional |
License
RESOURCES
Contact & Author
Author: Mattia Rubino
Email: textwizard.dev@gmail.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wizarddocx-1.0.0.tar.gz.
File metadata
- Download URL: wizarddocx-1.0.0.tar.gz
- Upload date:
- Size: 47.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7429c9dbca1785f425d860026aa096824ffe059c7f7970c529db94d8d8b7f034
|
|
| MD5 |
1c33ffeae39ee03180deec569ad0e0e5
|
|
| BLAKE2b-256 |
6ac26170740936f0e860b285c67baefe9ab417504ad47f2b53cec274e329a873
|
File details
Details for the file wizarddocx-1.0.0-py3-none-any.whl.
File metadata
- Download URL: wizarddocx-1.0.0-py3-none-any.whl
- Upload date:
- Size: 37.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f9dc07be615608e88094dd69d75cd764144e970749363ba7713f8d5075286f61
|
|
| MD5 |
635cc54899d9e461c5145d3fe3cc527c
|
|
| BLAKE2b-256 |
0def0086ac5d2dceb4e9ad7ac8ae2d885413bade5a90e256dd364eba512cccb7
|