Skip to main content

Text extraction from Microsoft Word files. Parses Word documents natively and can optionally run local OCR with Tesseract for embedded images or scanned pages. Supports page selection and bytes input. Legacy .doc is read-only and OCR is not available.

Project description

wizarddocx Banner


Wizard Docx

PyPI - Version PyPI - Downloads/month License

WizardDocx is a Python library focused on text extraction from Microsoft Word documents.
It parses Word documents natively and can apply local OCR with Tesseract for embedded images or scanned pages inside 'docx'.
Legacy .doc is supported in read-only mode without OCR.


Contents


Installation

Requires Python 3.9+.

pip install wizarddocx

For OCR capabilities, ensure you have Tesseract installed on your system.


Quick start

import wizarddocx as wd

text = wd.extract_text("example.docx")
print(text)

Text extraction

Parameters

  • input_data: [str, bytes, Path]
  • extension: The file extension, required only if input_data is bytes.
  • pages: page selection for .docx.
    • Examples: 1, "1-3", [1, 3, "5-8"]
  • ocr: Enables OCR using Tesseract. Applies to DOCX and image-based files no for doc.
  • language_ocr: Language code for OCR. Defaults to 'eng'.

Examples

Basic:

import wizarddocx as wd

txt = wd.extract_text("docs/report.docx")

From bytes:

from pathlib import Path
import wizarddocx as wd

raw = Path("img.docx").read_bytes()
txt_img = wd.extract_text(raw, extension="docx")

Paged selection and OCR:

import wizarddocx as wd

sel = wd.extract_text("docs/big.docx", pages=[1, 3, "5-7"])
ocr_txt = wd.extract_text("scan.docx", ocr=True, language_ocr="ita")

Supported Formats

Format OCR Option
DOC Not available
DOCX Optional

License

AGPL-3.0-or-later.

RESOURCES


Contact & Author

Author: Mattia Rubino
Email: textwizard.dev@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wizarddocx-1.0.0.tar.gz (47.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wizarddocx-1.0.0-py3-none-any.whl (37.8 kB view details)

Uploaded Python 3

File details

Details for the file wizarddocx-1.0.0.tar.gz.

File metadata

  • Download URL: wizarddocx-1.0.0.tar.gz
  • Upload date:
  • Size: 47.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.6

File hashes

Hashes for wizarddocx-1.0.0.tar.gz
Algorithm Hash digest
SHA256 7429c9dbca1785f425d860026aa096824ffe059c7f7970c529db94d8d8b7f034
MD5 1c33ffeae39ee03180deec569ad0e0e5
BLAKE2b-256 6ac26170740936f0e860b285c67baefe9ab417504ad47f2b53cec274e329a873

See more details on using hashes here.

File details

Details for the file wizarddocx-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: wizarddocx-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 37.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.6

File hashes

Hashes for wizarddocx-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f9dc07be615608e88094dd69d75cd764144e970749363ba7713f8d5075286f61
MD5 635cc54899d9e461c5145d3fe3cc527c
BLAKE2b-256 0def0086ac5d2dceb4e9ad7ac8ae2d885413bade5a90e256dd364eba512cccb7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page