Skip to main content

Text extraction from PDFs, Word files, spreadsheets, and images. Local OCR with Tesseract and optional Azure Document Intelligence for text, tables, and key–value pairs. Includes page/sheet selection and a hybrid PDF mode.

Project description

WizardExtract Banner


Wizard Extract

PyPI - Version PyPI - Downloads/month License

WizardExtract is a Python library for reliable text extraction from PDFs, Office documents, and images. It supports local OCR with Tesseract and cloud OCR with Azure Document Intelligence. It provides page and sheet selection, hybrid PDF handling that combines native text with OCR, and deterministic I/O. With Azure prebuilt-layout it can also return tables and key-value pairs.


Contents


Installation

Requires Python 3.9+.

pip install wizardextract

Optional extras:

  • Azure OCR: pip install "wizardextract[azure]"

For OCR capabilities, ensure you have Tesseract installed on your system.


Quick start

import wizardextract as we

text = we.extract_text("example.pdf")
print(text)

API overview

Method Purpose
extract_text Local text extraction with optional Tesseract OCR
extract_text_azure Cloud extraction via Azure (text, tables, key-value)

Text extraction

Parameters

  • input_data: [str, bytes, Path]
  • extension: The file extension, required only if input_data is bytes.
  • pages: Page/sheet selection.
    • Paged (PDF, DOCX, TIFF): 1, "1-3", [1, 3, "5-8"]
    • Excel (XLSX/XLS): sheet index (int), name (str), or mixed list
  • ocr: Enables OCR using Tesseract. Applies to PDF/DOCX and image-based files.
  • language_ocr: Language code for OCR. Defaults to 'eng'.

Examples

Basic:

import wizardextract as we

txt = we.extract_text("docs/report.pdf")

From bytes:

from pathlib import Path
import wizardextract as we

raw = Path("img.png").read_bytes()
txt_img = we.extract_text(raw, extension="png")

Paged selection and OCR:

import wizardextract as we

sel = we.extract_text("docs/big.pdf", pages=[1, 3, "5-7"])
ocr_txt = we.extract_text("scan.tiff", ocr=True, language_ocr="ita")

Supported Formats

Format OCR Option
PDF Optional
DOC No
DOCX Optional
XLSX No
XLS No
TXT No
CSV No
JSON No
HTML No
HTM No
TIF Default
TIFF Default
JPG Default
JPEG Default
PNG Default
GIF Default

Azure OCR

Parameters

  • input_data: [str, bytes, Path]
  • extension: File extension when bytes are passed.
  • language_ocr: OCR language code (ISO-639).
  • pages: Page selection (int, "1,3,5-7", or list).
  • azure_endpoint: Azure Document Intelligence endpoint URL.
  • azure_key: Azure API key.
  • azure_model_id: "prebuilt-read" (text only) or "prebuilt-layout" (text + tables + key-value).
  • hybrid: If True, for PDFs: native text via PyMuPDF and images via OCR.

Example

import wizardextract as we

res = we.extract_text_azure(
    "invoice.pdf",
    language_ocr="ita",
    azure_endpoint="https://<resource>.cognitiveservices.azure.com/",
    azure_key="<KEY>",
    azure_model_id="prebuilt-layout",
    hybrid=True,
)

print(res.text)
print(res.pretty_tables[:1])
print(res.key_value)

License

AGPL-3.0-or-later.

RESOURCES


Contact & Author

Author: Mattia Rubino
Email: textwizard.dev@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wizardextract-1.0.1.tar.gz (60.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wizardextract-1.0.1-py3-none-any.whl (56.1 kB view details)

Uploaded Python 3

File details

Details for the file wizardextract-1.0.1.tar.gz.

File metadata

  • Download URL: wizardextract-1.0.1.tar.gz
  • Upload date:
  • Size: 60.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.6

File hashes

Hashes for wizardextract-1.0.1.tar.gz
Algorithm Hash digest
SHA256 4db5ce2e775c32de35420245c78ecfae6bcd02c821ba2f354dd3cf5955f8162a
MD5 393c424c81a1a95560ee295b05eeeaf6
BLAKE2b-256 907d90cffdab94c94562445e3601c7b485062e20dad7eb4bb21289fdf540dda7

See more details on using hashes here.

File details

Details for the file wizardextract-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: wizardextract-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 56.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.6

File hashes

Hashes for wizardextract-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0c3d72e043fc33cf4d9db2a570efe1cc96a40b51d16e658699e4ae6fc95fac24
MD5 7277c99b8f61980035d9244cc3036f66
BLAKE2b-256 321f8c8a4b74b20b5474c2c7e3c0c3e36dfb335005e0111fc5f0035812d15194

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page