Extract, clean, and analyze text from PDFs, Office docs, images, CSV/HTML. Local OCR (Tesseract), Azure DI, NER (spaCy/Stanza), language detection, spell-check, statistics, and HTML tools.
Project description
TextWizard
TextWizard is a Python library to extract, clean, and analyze text from PDFs, DOCX, images, CSV, HTML/XML, and more. It includes local OCR (Tesseract), cloud OCR with Azure Document Intelligence, multi-backend NER, language detection, lexical statistics, and HTML utilities.
Contents
- Installation
- Quick start
- API overview
- Text extraction
- Azure OCR
- HTML cleaning
- XML cleaning
- CSV cleaning
- Named-Entity Recognition (NER)
- Spell checking
- Language detection
- Text statistics
- Text similarity
- Beautiful HTML
- HTML to Markdown
- License
- Resources
What is TextWizard?
TextWizard is a Python toolkit for end-to-end text ingestion: it extracts, cleans, and analyzes content from PDFs, Office documents, images, HTML/XML, CSV, and plain text. It unifies local OCR (Tesseract) and Azure Document Intelligence, normalizes noisy markup, and exposes text, tables, and key-value pairs through one consistent API.
It targets production pipelines: deterministic I/O, page selection and hybrid PDF handling, multi-backend NER (spaCy, Stanza), language detection at 160+ languages, compact spell-checking tries, lexical statistics, and HTML utilities (sanitization, pretty-print, HTML→Markdown). The goal is to be a dependable, high-level building block for practical text extraction and cleanup in Python.
Installation
Requires Python 3.9+.
pip install textwizard
Optional extras:
- Azure OCR:
pip install "textwizard[azure]" - NER:
pip install "textwizard[ner]" - Everything:
pip install "textwizard[all]"
For OCR capabilities, ensure you have Tesseract installed on your system.
For spaCy models, e.g.:python -m spacy download en_core_web_sm.
Quick start
import textwizard as tw
text = tw.extract_text("example.pdf")
print(text)
API overview
| Method | Purpose |
|---|---|
extract_text |
Local text extraction with optional Tesseract OCR |
extract_text_azure |
Cloud extraction via Azure (text, tables, key-value) |
clean_html |
High-level HTML cleaning with semantic flags |
clean_xml |
XML cleanup and normalization |
clean_csv |
CSV cleanup with configurable dialect |
extract_entities |
NER via spaCy / Stanza / spaCy-Stanza |
correctness_text |
Spell checking |
lang_detect |
Language detection |
analyze_text_statistics |
Lexical metrics (entropy, Zipf, Gini, …) |
text_similarity |
Similarity: cosine, jaccard, levenshtein |
beutifull_html |
Pretty-print HTML |
html_to_markdown |
Convert HTML → Markdown |
Text extraction
Parameters
input_data:[str, bytes, Path]extension: The file extension, required only ifinput_dataisbytes.pages: Page/sheet selection.
• Paged (PDF, DOCX, TIFF):1,"1-3",[1, 3, "5-8"]
• Excel (XLSX/XLS): sheet index (int), name (str), or mixed listocr: Enables OCR using Tesseract. Applies to PDF/DOCX and image-based files.language_ocr: Language code for OCR. Defaults to'eng'.
Examples
Basic:
import textwizard as tw
txt = tw.extract_text("docs/report.pdf")
From bytes:
from pathlib import Path
import textwizard as tw
raw = Path("img.png").read_bytes()
txt_img = tw.extract_text(raw, extension="png")
Paged selection and OCR:
import textwizard as tw
sel = tw.extract_text("docs/big.pdf", pages=[1, 3, "5-7"])
ocr_txt = tw.extract_text("scan.tiff", ocr=True, language_ocr="ita")
Supported Formats
| Format | OCR Option |
|---|---|
| Optional | |
| DOC | No |
| DOCX | Optional |
| XLSX | No |
| XLS | No |
| TXT | No |
| CSV | No |
| JSON | No |
| HTML | No |
| HTM | No |
| TIF | Default |
| TIFF | Default |
| JPG | Default |
| JPEG | Default |
| PNG | Default |
| GIF | Default |
Azure OCR
Parameters
input_data:[str, bytes, Path]extension: File extension whenbytesare passed.language_ocr: OCR language code (ISO-639).pages: Page selection (int,"1,3,5-7", or list).azure_endpoint: Azure Document Intelligence endpoint URL.azure_key: Azure API key.azure_model_id:"prebuilt-read"(text only) or"prebuilt-layout"(text + tables + key-value).hybrid: IfTrue, for PDFs: native text via PyMuPDF and images via OCR.
Example
import textwizard as tw
res = tw.extract_text_azure(
"invoice.pdf",
language_ocr="ita",
azure_endpoint="https://<resource>.cognitiveservices.azure.com/",
azure_key="<KEY>",
azure_model_id="prebuilt-layout",
hybrid=True,
)
print(res.text)
print(res.pretty_tables[:1])
print(res.key_value)
HTML cleaning
Parameters
text:strHTML input.remove_script: Remove executable tags (<script>,<template>).remove_metadata_tags: Remove metadata (<link>,<meta>,<base>,<noscript>,<style>,<title>).remove_flow_tags: Remove flow content (<address>,<div>,<input>, …).remove_sectioning_tags: Remove sectioning content (<article>,<aside>,<nav>, …).remove_heading_tags: Remove heading tags (<h1>–<h6>).remove_phrasing_tags: Remove phrasing content (<audio>,<code>,<textarea>, …).remove_embedded_tags: Remove embedded content (<iframe>,<embed>,<img>).remove_interactive_tags: Remove interactive content (<button>,<input>,<select>).remove_palpable: Remove palpable elements (<address>,<math>,<table>, …).remove_doctype: Remove<!DOCTYPE html>.remove_comments: Remove HTML comments.remove_specific_attributes: Remove specific attributes (supports wildcards).remove_specific_tags: Remove specific tags (supports wildcards).remove_empty_tags: Drop empty tags.remove_content_tags: Remove content of given tags.remove_tags_and_contents: Remove tags and their contents.
Behavior
There are three modes with different return types:
| Mode | How to trigger | Output | Description |
|---|---|---|---|
| A – text-only | No parameters provided (all None) |
str (plain text) |
Extracts text, skips script-supporting tags, inserts safe spaces. |
| B – structural clean | At least one flag is True |
str (serialized HTML) |
Removes/unwraps per flags. Supports wildcard tag/attribute removal, content stripping, empty-tag pruning. |
| C – text with preservation | Parameters present and all False |
str (text + preserved markup) |
Extracts text but preserves groups explicitly set to False (and comments/doctype if set False). |
Examples
A) Text-only (no params)
import textwizard as tw
txt = tw.clean_html("<div><p>Hello</p><script>x()</script></div>")
print(txt) # -> "Hello"
B) Structural clean (HTML out)
import textwizard as tw
html = """
<html><head><title>x</title><script>evil()</script></head>
<body>
<article><h1>Title</h1><img src="a.png"><p id="k" onclick="x()">hello</p></article><!-- comment -->
</body></html>
"""
out = tw.clean_html(
html,
remove_script=True,
remove_metadata_tags=True,
remove_embedded_tags=True,
remove_specific_attributes=["id", "on*"],
remove_empty_tags=True,
remove_comments=True,
remove_doctype=True,
)
print(out)
Output
<html>
<body>
<article><h1>Title</h1><p>hello</p></article>
</body></html>
C) Text with preservation (False flags)
import textwizard as tw
html = "<html><body><article><h1>T</h1><p>Body</p><!-- c --></article></body></html>"
txt = tw.clean_html(
html,
remove_sectioning_tags=False, # keep <article> in output
remove_heading_tags=False, # keep <h1> in output
remove_comments=False, # keep comments
)
print(txt)
Output
<article><h1>T</h1>Body<!-- c --></article>
Wildcard selectors
import textwizard as tw
html = '<div id="hero" data-track="x" onclick="h()"><img src="a.png"></div>'
out = tw.clean_html(
html,
remove_specific_attributes=["id", "data-*", "on*"],
remove_specific_tags=["im_"],
)
print(out)
Output
<html><head></head><body><div></div></body></html>
XML cleaning
Parameters
text:str | bytesXML input.remove_comments: Remove<!-- ... -->.remove_processing_instructions: Remove<? ... ?>.remove_cdata_sections: Unwrap<![CDATA[...]]>.remove_empty_tags: Drop empty elements.remove_namespaces: Drop prefixes andxmlns.remove_duplicate_siblings: Keep only the first identical sibling.collapse_whitespace: Collapse runs of whitespace.remove_specific_tags: Delete tags (supports wildcards).remove_content_tags: Keep tag but delete inner content.remove_attributes: Delete attributes (supports wildcards).remove_declaration: Drop<?xml ...?>.normalize_entities: Convert entities like&→&.
Examples
import textwizard as tw
xml = "<root xmlns='ns'><a/><b>ok</b><!-- x --></root>"
fixed = tw.clean_xml(
xml,
remove_namespaces=True,
remove_empty_tags=True,
remove_comments=True,
normalize_entities=True,
)
print(fixed)
Output
<root><b>ok</b></root>
CSV cleaning
Behavior
- Columns can be removed by name (with header) or 0-based index.
remove_row_indexuses 0-based indices over the parsed rows. If a header exists, it is row0.remove_valuesblanks matching cells. Supports wildcards*and?.remove_empty_columns/remove_empty_rowsrun after other edits.- Output is serialized with the provided dialect (
delimiter,quotechar,quoting, etc.).
Parameters
text: Raw CSV string.delimiter,quotechar,escapechar,doublequote,skipinitialspace,lineterminator,quoting.remove_columns: Name or 0-based index (or list).remove_row_index: 0-based index (or list).remove_values: Literal values or wildcard patterns to blank out.remove_duplicates_rows: Remove duplicate rows.trim_whitespace: Strip whitespace inside fields.remove_empty_columns: Drop empty columns.remove_empty_rows: Drop empty rows.
Example
import textwizard as tw
csv_data = """id,name,age,city,salary
1,John,30,New York,50000
2,Jane,25,,40000
3,,35,Los Angeles,60000
4,Mark,45,,70000
5,Sarah,40,New York,
1,John,30,New York,50000
"""
out = tw.clean_csv(
csv_data,
delimiter=",",
remove_columns=["id", "salary"],
remove_values=["John", "50000"],
trim_whitespace=True,
remove_empty_columns=True,
remove_empty_rows=True,
remove_duplicates_rows=True,
)
print(out)
Output
name,age,city
,30,New York
Jane,25,
,35,Los Angeles
Mark,45,
Sarah,40,New York
Named-Entity Recognition (NER)
Parameters
text:strinput.engine:'spacy' | 'stanza' | 'spacy_stanza'(default'spacy').model: spaCy model name or path (spaCy engine only).language: ISO code for Stanza engines.device:'auto' | 'cpu' | 'gpu'(default'auto').
Examples
import textwizard as tw
sample = (
"Alex Rivera traveled to Springfield to meet the research team at Northstar Analytics on 14 March 2025. "
"The next day, he signed a pilot agreement with Horizon Bank and gave a talk at the University of Westland at 10:30 AM."
)
res = tw.extract_entities(sample)
print([e.text for e in res.entities["PERSON"]])
print([e.text for e in res.entities["GPE"]])
print([e.text for e in res.entities["ORG"]])
Output
['Alex Rivera']
['Springfield']
['Northstar Analytics', 'Horizon Bank', 'the University of Westland']
Spell checking
Parameters
text: String to analyze.language: ISO code.dict_dir: Folder with*.marisa.zstdictionaries. IfNone, user data dir and on-demand downloads.use_mmap:Trueto memory-map the uncompressed trie.
Example
import textwizard as tw
check = tw.correctness_text("Thiss sentense has a typo.", language="en")
print(check)
Output
{'errors_count': 2, 'errors': ['thiss', 'sentense']}
Language detection
Language identification via character n-gram profiles. Candidate gating guided by priors and linguistic cues, then probability estimation for each language. Supports 161 languages. Returns a top-1 ISO code or a probability-ordered list.
Parameters
text: Input string (Unicode).top_k: Number of candidates to return (default3).profiles_dir: Override the bundled profiles directory.use_mmap: IfTrue, memory-map the profile tries (lower RAM; first access may be slightly slower).return_top1: IfTrue, return only the best language code; otherwise a list of(lang, prob).
Examples
Top-1 (single code)
import textwizard as tw
text = "Ciao, come stai oggi?"
lang = tw.lang_detect(text, return_top1=True)
print(lang)
Output
it
Top-k distribution
import textwizard as tw
text = "The quick brown fox jumps over the lazy dog."
langs = tw.lang_detect(text, top_k=5, return_top1=False)
print(langs)
Output
[('en', 0.9999376335362183), ('mg', 4.719212057614953e-05), ('fy', 1.4727973350205069e-05), ('rm', 2.8718519851832537e-07), ('la', 1.5918465665694727e-07)]
Batch example
import textwizard as tw
tests = [
"これは日本語のテスト文です。",
"Alex parle un peu français, aber nicht so viel.",
"¿Dónde está la estación de tren?",
]
for s in tests:
print("TOP1:", tw.lang_detect(s, return_top1=True))
Output
TOP1: ja
TOP1: fr
TOP1: es
Custom profiles & mmap
from pathlib import Path
import textwizard as tw
langs = tw.lang_detect(
"Buongiorno a tutti!",
profiles_dir=Path("/opt/textwizard/profiles"), # custom profiles
use_mmap=True, # lower RAM
top_k=3,
)
print(langs)
Text statistics
Parameters
Computes: entropy, zipf.slope, zipf.r2, vocab_gini, type_token_ratio, hapax_ratio, simpson_index, yule_k, avg_word_length.
Tokens are lower-cased and split on whitespace.
Example
import textwizard as tw
stats = tw.analyze_text_statistics("a a a b b c d e f g")
print(stats)
Output
{'entropy': 2.646, 'zipf': {'slope': -0.605, 'r2': 0.838}, 'vocab_gini': 0.229, 'type_token_ratio': 0.7, 'hapax_ratio': 0.5, 'simpson_index': 0.82, 'yule_k': 800.0, 'avg_word_length': 1.0}
Text similarity
Compute a similarity score between two strings using one of three measures.
Returns a float in [0.0, 1.0] (1.0 ≡ identical).
Parameters
a,b: Strings to compare.method:"cosine" | "jaccard" | "levenshtein"(default"cosine").
Notes
- Tokenization for cosine/jaccard uses lowercase word tokens matched by
\w+(Unicode letters, digits, underscore). - Quick guide:
| Method | Best for | Trade-offs |
|---|---|---|
| cosine | “bag of words” overlap incl. short phrases | needs some tokens; bigram TF helps with order |
| jaccard | set overlap (unique words) | ignores frequency; robust to duplicates |
| levenshtein | character-level edits | O(len(a)·len(b)); great for short strings |
- Example
import textwizard as tw
s1 = tw.text_similarity("kitten", "sitting", method="levenshtein")
s2 = tw.text_similarity("hello world", "hello brave world", method="jaccard")
s3 = tw.text_similarity("abc def", "abc xyz", method="cosine")
print(s1, s2, s3)
Output
0.5714285714285714
0.6666666666666666
0.33333333333333337
Beautiful HTML
Pretty-print raw HTML without changing its semantics. Controls indentation, attribute quoting/sorting, whitespace normalization, and optional DOCTYPE insertion.
Parameters
html: Raw HTML string.indent: Spaces per indentation level (default2).quote_attr_values:"always" | "spec" | "legacy"(default"spec").quote_char:"or'(default").use_best_quote_char: Iftrue, auto-pick the quote char that needs fewer escapes.minimize_boolean_attributes: Iftrue, render compact booleans (e.g.,disabled).use_trailing_solidus: Iftrue, add a trailing slash on void elements (<br />).space_before_trailing_solidus: Add a space before that slash when used.escape_lt_in_attrs: Escape<and>inside attribute values.escape_rcdata: Escape within RCData (<script>,<style>,<textarea>).resolve_entities: Prefer named entities when serializing.alphabetical_attributes: Sort attributes alphabetically.strip_whitespace: Trim/collapse whitespace in text nodes.include_doctype: Prepend<!DOCTYPE html>if missing.expand_mixed_content: Put each child of mixed-content nodes on its own line.expand_empty_elements: Render empty non-void elements on two lines.
Example
import textwizard as tw
html = """
<body>
<button id='btn1' class="primary" disabled="disabled">
Click <b>me</b>
</button>
<img alt="Logo" src="/static/logo.png">
</body>
"""
pretty = tw.beautiful_html(
html=html,
indent=4,
alphabetical_attributes=True,
minimize_boolean_attributes=True,
quote_attr_values="always",
strip_whitespace=True,
include_doctype=True,
expand_mixed_content=True,
expand_empty_elements=True,
)
print(pretty)
Output
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<button class="primary" disabled id="btn1">
Click
<b>
me
</b>
</button>
<img alt="Logo" src="/static/logo.png">
</body>
</html>
HTML to Markdown
Parameters
html: Raw HTML input.
Example
import textwizard as tw
md = tw.html_to_markdown("<h1>Hello</h1><p>World</p>")
print(md)
Output
# Hello
World
License
RESOURCES
Contact & Author
Author: Mattia Rubino
Email: textwizard.dev@gmail.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file textwizard-1.1.0.tar.gz.
File metadata
- Download URL: textwizard-1.1.0.tar.gz
- Upload date:
- Size: 207.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6705c5db782c57c1f8da99613bb2f3aa071dab5811d3f7baff8abd3e9b4505cb
|
|
| MD5 |
989ecec4dbc27a5eb4647500346db640
|
|
| BLAKE2b-256 |
09e72a4f4551efbd2f3bc2408737870c6c29c50abdf502a8d5b5a3be1535846d
|
File details
Details for the file textwizard-1.1.0-cp311-cp311-win_amd64.whl.
File metadata
- Download URL: textwizard-1.1.0-cp311-cp311-win_amd64.whl
- Upload date:
- Size: 263.8 kB
- Tags: CPython 3.11, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6e4640b474529c27f1a34b498c8f72871b4cdf19ffbc481fcc950a22f1b1d712
|
|
| MD5 |
f5d637265355df2a4c893af08c3c39e1
|
|
| BLAKE2b-256 |
bddbf48afc83a1564d086e316f2e5e5e98097b7ff1fb9cecc0abcae6317fd4a2
|