Profile of quiser

wizardhtml

Last released Aug 29, 2025

WHATWG-compliant HTML5 toolkit: DFA tokenizer, spec-guided tree builder, DOM, configurable serializer, high-level cleaner, pretty-printer, and HTML to Markdown.

Text extraction from PDFs, Word files, spreadsheets, and images. Local OCR with Tesseract and optional Azure Document Intelligence for text, tables, and key–value pairs. Includes page/sheet selection and a hybrid PDF mode.

textwizard

Last released Aug 29, 2025

Extract, clean, and analyze text from PDFs, Office docs, images, CSV/HTML. Local OCR (Tesseract), Azure DI, NER (spaCy/Stanza), language detection, spell-check, statistics, and HTML tools.

wizardspell

Last released Aug 28, 2025

Dictionary-based spell checking with Unicode-aware tokenization and light normalization. Supports 62 languages via compressed Marisa-Trie dictionaries and returns a compact report of misspellings.

wizardlangid

Last released Aug 28, 2025

Language identification via character n-gram profiles. Candidate gating guided by priors and linguistic cues, then probability estimation for each language. Supports 161 languages. Returns a top-1 ISO code or a probability-ordered list.

wizarddocx

Last released Aug 28, 2025

Text extraction from Microsoft Word files. Parses Word documents natively and can optionally run local OCR with Tesseract for embedded images or scanned pages. Supports page selection and bytes input. Legacy .doc is read-only and OCR is not available.

Mattia Rubino

6 projects

wizardhtml

wizardextract

textwizard

wizardspell

wizardlangid

wizarddocx