Text extraction from PDFs, Word files, spreadsheets, and images. Local OCR with Tesseract and optional Azure Document Intelligence for text, tables, and key–value pairs. Includes page/sheet selection and a hybrid PDF mode.
Project description
Wizard Extract
WizardExtract is a Python library for reliable text extraction from PDFs, Office documents, and images. It supports local OCR with Tesseract and cloud OCR with Azure Document Intelligence. It provides page and sheet selection, hybrid PDF handling that combines native text with OCR, and deterministic I/O. With Azure prebuilt-layout it can also return tables and key-value pairs.
Contents
Installation
Requires Python 3.9+.
pip install wizardextract
Optional extras:
- Azure OCR:
pip install "wizardextract[azure]"
For OCR capabilities, ensure you have Tesseract installed on your system.
Quick start
import wizardextract as we
text = we.extract_text("example.pdf")
print(text)
API overview
| Method | Purpose |
|---|---|
extract_text |
Local text extraction with optional Tesseract OCR |
extract_text_azure |
Cloud extraction via Azure (text, tables, key-value) |
Text extraction
Parameters
input_data:[str, bytes, Path]extension: The file extension, required only ifinput_dataisbytes.pages: Page/sheet selection.
• Paged (PDF, DOCX, TIFF):1,"1-3",[1, 3, "5-8"]
• Excel (XLSX/XLS): sheet index (int), name (str), or mixed listocr: Enables OCR using Tesseract. Applies to PDF/DOCX and image-based files.language_ocr: Language code for OCR. Defaults to'eng'.
Examples
Basic:
import wizardextract as we
txt = we.extract_text("docs/report.pdf")
From bytes:
from pathlib import Path
import wizardextract as we
raw = Path("img.png").read_bytes()
txt_img = we.extract_text(raw, extension="png")
Paged selection and OCR:
import wizardextract as we
sel = we.extract_text("docs/big.pdf", pages=[1, 3, "5-7"])
ocr_txt = we.extract_text("scan.tiff", ocr=True, language_ocr="ita")
Supported Formats
| Format | OCR Option |
|---|---|
| Optional | |
| DOC | No |
| DOCX | Optional |
| XLSX | No |
| XLS | No |
| TXT | No |
| CSV | No |
| JSON | No |
| HTML | No |
| HTM | No |
| TIF | Default |
| TIFF | Default |
| JPG | Default |
| JPEG | Default |
| PNG | Default |
| GIF | Default |
Azure OCR
Parameters
input_data:[str, bytes, Path]extension: File extension whenbytesare passed.language_ocr: OCR language code (ISO-639).pages: Page selection (int,"1,3,5-7", or list).azure_endpoint: Azure Document Intelligence endpoint URL.azure_key: Azure API key.azure_model_id:"prebuilt-read"(text only) or"prebuilt-layout"(text + tables + key-value).hybrid: IfTrue, for PDFs: native text via PyMuPDF and images via OCR.
Example
import wizardextract as we
res = we.extract_text_azure(
"invoice.pdf",
language_ocr="ita",
azure_endpoint="https://<resource>.cognitiveservices.azure.com/",
azure_key="<KEY>",
azure_model_id="prebuilt-layout",
hybrid=True,
)
print(res.text)
print(res.pretty_tables[:1])
print(res.key_value)
License
RESOURCES
Contact & Author
Author: Mattia Rubino
Email: textwizard.dev@gmail.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wizardextract-1.0.1.tar.gz.
File metadata
- Download URL: wizardextract-1.0.1.tar.gz
- Upload date:
- Size: 60.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4db5ce2e775c32de35420245c78ecfae6bcd02c821ba2f354dd3cf5955f8162a
|
|
| MD5 |
393c424c81a1a95560ee295b05eeeaf6
|
|
| BLAKE2b-256 |
907d90cffdab94c94562445e3601c7b485062e20dad7eb4bb21289fdf540dda7
|
File details
Details for the file wizardextract-1.0.1-py3-none-any.whl.
File metadata
- Download URL: wizardextract-1.0.1-py3-none-any.whl
- Upload date:
- Size: 56.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0c3d72e043fc33cf4d9db2a570efe1cc96a40b51d16e658699e4ae6fc95fac24
|
|
| MD5 |
7277c99b8f61980035d9244cc3036f66
|
|
| BLAKE2b-256 |
321f8c8a4b74b20b5474c2c7e3c0c3e36dfb335005e0111fc5f0035812d15194
|