Skip to main content

Text extraction from PDFs, Word files, spreadsheets, and images. Local OCR with Tesseract and optional Azure Document Intelligence for text, tables, and key–value pairs. Includes page/sheet selection and a hybrid PDF mode.

Project description

WizardExtract Banner


Wizard Extract

PyPI - Version PyPI - Downloads/month License

WizardExtract is a Python library for reliable text extraction from PDFs, Office documents, and images. It supports local OCR with Tesseract and cloud OCR with Azure Document Intelligence. It provides page and sheet selection, hybrid PDF handling that combines native text with OCR, and deterministic I/O. With Azure prebuilt-layout it can also return tables and key-value pairs.


Contents


Installation

Requires Python 3.9+.

pip install wizardextract

Optional extras:

  • Azure OCR: pip install "wizardextract[azure]"

For OCR capabilities, ensure you have Tesseract installed on your system.


Quick start

import wizardextract as we

text = we.extract_text("example.pdf")
print(text)

API overview

Method Purpose
extract_text Local text extraction with optional Tesseract OCR
extract_text_azure Cloud extraction via Azure (text, tables, key-value)

Text extraction

Parameters

  • input_data: [str, bytes, Path]
  • extension: The file extension, required only if input_data is bytes.
  • pages: Page/sheet selection.
    • Paged (PDF, DOCX, TIFF): 1, "1-3", [1, 3, "5-8"]
    • Excel (XLSX/XLS): sheet index (int), name (str), or mixed list
  • ocr: Enables OCR using Tesseract. Applies to PDF/DOCX and image-based files.
  • language_ocr: Language code for OCR. Defaults to 'eng'.

Examples

Basic:

import wizardextract as we

txt = we.extract_text("docs/report.pdf")

From bytes:

from pathlib import Path
import wizardextract as we

raw = Path("img.png").read_bytes()
txt_img = we.extract_text(raw, extension="png")

Paged selection and OCR:

import wizardextract as we

sel = we.extract_text("docs/big.pdf", pages=[1, 3, "5-7"])
ocr_txt = we.extract_text("scan.tiff", ocr=True, language_ocr="ita")

Supported Formats

Format OCR Option
PDF Optional
DOC No
DOCX Optional
XLSX No
XLS No
TXT No
CSV No
JSON No
HTML No
HTM No
TIF Default
TIFF Default
JPG Default
JPEG Default
PNG Default
GIF Default

Azure OCR

Parameters

  • input_data: [str, bytes, Path]
  • extension: File extension when bytes are passed.
  • language_ocr: OCR language code (ISO-639).
  • pages: Page selection (int, "1,3,5-7", or list).
  • azure_endpoint: Azure Document Intelligence endpoint URL.
  • azure_key: Azure API key.
  • azure_model_id: "prebuilt-read" (text only) or "prebuilt-layout" (text + tables + key-value).
  • hybrid: If True, for PDFs: native text via PyMuPDF and images via OCR.

Example

import wizardextract as we

res = we.extract_text_azure(
    "invoice.pdf",
    language_ocr="ita",
    azure_endpoint="https://<resource>.cognitiveservices.azure.com/",
    azure_key="<KEY>",
    azure_model_id="prebuilt-layout",
    hybrid=True,
)

print(res.text)
print(res.pretty_tables[:1])
print(res.key_value)

License

AGPL-3.0-or-later.

RESOURCES


Contact & Author

Author: Mattia Rubino
Email: textwizard.dev@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wizardextract-1.0.0.tar.gz (60.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wizardextract-1.0.0-py3-none-any.whl (56.1 kB view details)

Uploaded Python 3

File details

Details for the file wizardextract-1.0.0.tar.gz.

File metadata

  • Download URL: wizardextract-1.0.0.tar.gz
  • Upload date:
  • Size: 60.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.6

File hashes

Hashes for wizardextract-1.0.0.tar.gz
Algorithm Hash digest
SHA256 c00c7964edfee804d3418423c3d20db00295224ccab62a4d9022538fffaac2ec
MD5 a2d1be364d57971bc9e1a3287d849afc
BLAKE2b-256 dc77b8b70e26f1aa03ede631acb46b94b4a8fb185a9fc2ee6315122a8e5f106a

See more details on using hashes here.

File details

Details for the file wizardextract-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: wizardextract-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 56.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.6

File hashes

Hashes for wizardextract-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bfdbb63376cc46776d2148f27f3976f27447d42bc72bd1906ee75377f37634bf
MD5 c5aa22fc42ad1c51c6dc58fbf7e8d1b3
BLAKE2b-256 bb90e6bc5741cc97b1639410156c52fad50b14ee65e3bcfd1c20c44c191b57f1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page