Text extraction from PDFs, Word files, spreadsheets, and images. Local OCR with Tesseract and optional Azure Document Intelligence for text, tables, and key–value pairs. Includes page/sheet selection and a hybrid PDF mode.

These details have not been verified by PyPI

Project links

Project description

WizardExtract Banner

Wizard Extract

WizardExtract is a Python library for reliable text extraction from PDFs, Office documents, and images. It supports local OCR with Tesseract and cloud OCR with Azure Document Intelligence. It provides page and sheet selection, hybrid PDF handling that combines native text with OCR, and deterministic I/O. With Azure prebuilt-layout it can also return tables and key-value pairs.

Installation
Quick start
API overview
Text extraction
Azure OCR
License
Resources

Installation

Requires Python 3.9+.

pip install wizardextract

Optional extras:

Azure OCR: pip install "wizardextract[azure]"

For OCR capabilities, ensure you have Tesseract installed on your system.

Quick start

import wizardextract as we

text = we.extract_text("example.pdf")
print(text)

API overview

Method	Purpose
`extract_text`	Local text extraction with optional Tesseract OCR
`extract_text_azure`	Cloud extraction via Azure (text, tables, key-value)

Text extraction

Parameters

input_data: [str, bytes, Path]
extension: The file extension, required only if input_data is bytes.
pages: Page/sheet selection.
• Paged (PDF, DOCX, TIFF): 1, "1-3", [1, 3, "5-8"]
• Excel (XLSX/XLS): sheet index (int), name (str), or mixed list
ocr: Enables OCR using Tesseract. Applies to PDF/DOCX and image-based files.
language_ocr: Language code for OCR. Defaults to 'eng'.

Examples

Basic:

import wizardextract as we

txt = we.extract_text("docs/report.pdf")

From bytes:

from pathlib import Path
import wizardextract as we

raw = Path("img.png").read_bytes()
txt_img = we.extract_text(raw, extension="png")

Paged selection and OCR:

import wizardextract as we

sel = we.extract_text("docs/big.pdf", pages=[1, 3, "5-7"])
ocr_txt = we.extract_text("scan.tiff", ocr=True, language_ocr="ita")

Supported Formats

Format	OCR Option
PDF	Optional
DOC	No
DOCX	Optional
XLSX	No
XLS	No
TXT	No
CSV	No
JSON	No
HTML	No
HTM	No
TIF	Default
TIFF	Default
JPG	Default
JPEG	Default
PNG	Default
GIF	Default

Azure OCR

Parameters

input_data: [str, bytes, Path]
extension: File extension when bytes are passed.
language_ocr: OCR language code (ISO-639).
pages: Page selection (int, "1,3,5-7", or list).
azure_endpoint: Azure Document Intelligence endpoint URL.
azure_key: Azure API key.
azure_model_id: "prebuilt-read" (text only) or "prebuilt-layout" (text + tables + key-value).
hybrid: If True, for PDFs: native text via PyMuPDF and images via OCR.

Example

import wizardextract as we

res = we.extract_text_azure(
    "invoice.pdf",
    language_ocr="ita",
    azure_endpoint="https://<resource>.cognitiveservices.azure.com/",
    azure_key="<KEY>",
    azure_model_id="prebuilt-layout",
    hybrid=True,
)

print(res.text)
print(res.pretty_tables[:1])
print(res.key_value)

License

AGPL-3.0-or-later.

RESOURCES

Contact & Author

Author: Mattia Rubino
Email: textwizard.dev@gmail.com

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.1

Aug 29, 2025

This version

1.0.0

Aug 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wizardextract-1.0.0.tar.gz (60.7 kB view details)

Uploaded Aug 27, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

wizardextract-1.0.0-py3-none-any.whl (56.1 kB view details)

Uploaded Aug 27, 2025 Python 3

File details

Details for the file wizardextract-1.0.0.tar.gz.

File metadata

Download URL: wizardextract-1.0.0.tar.gz
Upload date: Aug 27, 2025
Size: 60.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.6

File hashes

Hashes for wizardextract-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`c00c7964edfee804d3418423c3d20db00295224ccab62a4d9022538fffaac2ec`
MD5	`a2d1be364d57971bc9e1a3287d849afc`
BLAKE2b-256	`dc77b8b70e26f1aa03ede631acb46b94b4a8fb185a9fc2ee6315122a8e5f106a`

See more details on using hashes here.

File details

Details for the file wizardextract-1.0.0-py3-none-any.whl.

File metadata

Download URL: wizardextract-1.0.0-py3-none-any.whl
Upload date: Aug 27, 2025
Size: 56.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.6

File hashes

Hashes for wizardextract-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bfdbb63376cc46776d2148f27f3976f27447d42bc72bd1906ee75377f37634bf`
MD5	`c5aa22fc42ad1c51c6dc58fbf7e8d1b3`
BLAKE2b-256	`bb90e6bc5741cc97b1639410156c52fad50b14ee65e3bcfd1c20c44c191b57f1`

See more details on using hashes here.

wizardextract 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Wizard Extract

Contents

Installation

Quick start

API overview

Text extraction

Parameters

Examples

Supported Formats

Azure OCR

Parameters

Example

License

RESOURCES

Contact & Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes