Skip to main content

Universal text extraction from many document formats with external-tool fallbacks

Project description

all2txt

all2txt is a Python library (and CLI) for extracting text from many document formats.

Russian version: README.ru.md

It is designed for legacy and mixed corpora where files may come from Word, LibreOffice/OpenOffice, OLE-based formats, or plain text formats.

Features

  • Unified API to get text from a file as Python string
  • Save extracted text to .txt in a chosen output encoding
  • Return extended decode results with method, detected encoding, warnings, and metadata
  • Native Python extractors (no extra deps): .txt, .log, .ini, .conf, .tex, .bib, .strings, .md, .rst, .csv, .tsv, .json, .xml, .html, .htm, .mht, .mhtml, .eml, .plist, .rtf, .docx, .odt, .ods, .xlsx, .pptx, .fb2, .epub, .pages, .numbers, .key
  • Requires optional dep: .pdf - pip install all2txt[pdf]; .mobi - pip install all2txt[mobi]; .msg - pip install all2txt[msg]
  • Supported via external converter: .azw and similar ebook formats are best handled through Calibre ebook-convert
  • External-tool fallbacks (install separately):
    Tool Covers
    Microsoft Word (COM, Windows) .doc, .docx, .rtf, .odt
    LibreOffice / OpenOffice headless Office formats, .odt, .epub
    antiword Old .doc
    wvText Old .doc (Linux/Unix)
    catdoc/catppt/xls2csv Legacy .doc/.ppt/.xls
    macOS textutil Apple/macOS office and rich-text conversions
    Calibre ebook-convert .epub, .mobi, .djvu, .azw, .fb2, +100 formats
    DjVuLibre djvutxt .djvu, .djv
    pstotext/ps2ascii .ps, .eps
    extract_chmLib/chm2txt .chm
    OLE stream scan Legacy MS Office binaries .doc, .xls, .ppt

Installation

# from PyPI
pip install all2txt

# local development install
pip install -e .

What gets installed by default

pip install all2txt installs only the core package itself.

Current base Python dependencies: none.

This means the default install gives you:

  • the main Python API: decode_file, decode_result, decode_to_txt, TextDecoder
  • built-in extractors for plain text, markup, Office XML/ZIP-based formats, email-like formats, and several archive-like document containers
  • the CLI command all2txt
  • best-effort fallback logic including python-bytes
  • built-in plugin code shipped inside the package, including OCR plugin registration hooks

This also means the default install does not automatically install:

  • pypdf
  • pywin32
  • olefile
  • mobi
  • extract-msg
  • external OS tools such as LibreOffice, Word, Calibre, Tesseract, DjVuLibre, antiword, catdoc, etc.

Optional Python dependencies:

pip install -e .[all]
# or separately:
pip install -e .[pdf]   # PDF via pypdf
pip install -e .[win]   # Word COM on Windows
pip install -e .[ole]   # OLE binary fallback
pip install -e .[mobi]  # MOBI native extractor
pip install -e .[msg]   # Outlook .msg parsing
pip install -e .[ocr]   # OCR-related Python helpers; OCR still needs external tools

If you install from PyPI instead of editable mode, the same extras look like this:

pip install all2txt[all]
pip install all2txt[pdf]
pip install all2txt[win]
pip install all2txt[ole]
pip install all2txt[mobi]
pip install all2txt[msg]
pip install all2txt[ocr]

What each extra adds

Extra Installs What it enables
pdf pypdf native PDF text extraction and PDF metadata
win pywin32 Microsoft Word COM extraction on Windows
ole olefile OLE stream fallback for old .doc/.xls/.ppt
mobi mobi native .mobi extraction
msg extract-msg Outlook .msg parsing
ocr pypdf OCR helper path for scanned PDF workflows; external OCR tools still required
all pypdf, pywin32, olefile, mobi, extract-msg most optional Python-side features in one install

Notes:

  • all already includes pypdf, so in practice it also covers the Python side of ocr
  • ocr does not install Tesseract, OCRmyPDF, Poppler, ImageMagick, or DjVu tools; those are system tools and must be installed separately
  • if a dependency is missing, the library tries to degrade gracefully and usually records warnings or falls back to another strategy

External tools (install once on the OS):

# Calibre - covers EPUB, MOBI, DJVU, AZW, FB2 and 100+ formats
# https://calibre-ebook.com/download

# DjVuLibre - for .djvu files
# Windows: https://djvu.sourceforge.net/  |  Linux: apt install djvulibre-bin

Recommended installation patterns

Minimal install:

pip install all2txt

One command for all optional Python dependencies:

pip install all2txt[all]

This is the shortest answer to "install everything that pip can install for this library".

What it includes immediately:

  • pypdf
  • pywin32
  • olefile
  • mobi
  • extract-msg

What it still does not include:

  • Microsoft Word
  • LibreOffice / OpenOffice
  • Calibre
  • Tesseract OCR
  • OCRmyPDF
  • Poppler
  • DjVuLibre
  • antiword / wvText / catdoc tools

Those are external system tools and must be installed separately.

Good default for Windows office-heavy corpora:

pip install all2txt[all]

If you mainly process old Cyrillic Office files on Windows, also ensure one of these is installed on the OS:

  • Microsoft Word
  • LibreOffice

If you mainly process scanned PDF/DjVu/image files:

pip install all2txt[ocr]

and separately install OCR tools such as:

  • Tesseract OCR
  • OCRmyPDF
  • Poppler (pdftoppm)
  • DjVuLibre (ddjvu / djvutxt)
  • ImageMagick (magick)

How to add functionality later

You can start with the minimal install and add only what you need.

Examples:

# add PDF support later
pip install all2txt[pdf]

# add Outlook .msg support later
pip install all2txt[msg]

# add legacy OLE fallback later
pip install all2txt[ole]

# add everything Python-side later
pip install all2txt[all]

To inspect what is currently available in your environment, run:

all2txt --available

It will show:

  • which extras are effectively available
  • which external tools were found in PATH
  • which format groups are currently available at native, tool, OCR, or fallback level
  • suggested installation commands for missing pieces

Format install matrix

Format group Works after pip install all2txt Better with Python extra Best with external tools
.txt .log .ini .conf .md .rst .csv .tsv .json .xml .html .htm .mht .mhtml .eml .plist .tex .bib .strings yes, native not needed not needed
.docx .odt .ods .xlsx .pptx .fb2 .epub .pages .numbers .key yes, native not needed optional, only for edge cases
.pdf limited fallback only pip install all2txt[pdf] for scanned PDFs add Tesseract / OCRmyPDF / Poppler
.msg limited fallback only pip install all2txt[msg] usually not needed
.mobi limited fallback only pip install all2txt[mobi] Calibre can improve coverage
.azw and similar ebooks no true native parser not applicable Calibre ebook-convert
.doc best-effort fallback only pip install all2txt[win] and/or pip install all2txt[ole] Microsoft Word, LibreOffice, antiword, wvText, catdoc
.xls best-effort fallback only pip install all2txt[ole] LibreOffice, xls2csv
.ppt best-effort fallback only pip install all2txt[ole] LibreOffice, catppt
.djvu .djv limited fallback only no dedicated Python extra DjVuLibre, Calibre, or OCR tools
.ps .eps limited fallback only no dedicated Python extra pstotext / ps2ascii
.chm limited fallback only no dedicated Python extra extract_chmLib / chm2txt
scanned images / scanned PDFs placeholder or fallback behavior only pip install all2txt[ocr] helps on Python side Tesseract, OCRmyPDF, Poppler, DjVuLibre, ImageMagick

Practical recommendation:

  • for most users start with pip install all2txt[all]
  • for Office-heavy Windows corpora also install Microsoft Word or LibreOffice
  • for scanned documents also install OCR tools
  • if you are unsure, run all2txt --available

Python usage

For most code and notebook scenarios there are 4 entry points to remember:

  • decode_file(path) -> returns only text as str
  • decode_result(path) -> returns DecodeResult with text, metadata and warnings
  • decode_to_txt(path, out_path) -> writes extracted text to a .txt file
  • TextDecoder(...) -> reusable decoder with shared settings for many files

Quick start

from all2txt import TextDecoder, decode_file, decode_result, decode_to_txt

text = decode_file("sample.docx")

decoder = TextDecoder(
  preferred_tools=["word", "libreoffice", "ole"],
  encoding="cp1251",
  fallback_encodings=["koi8-r", "cp866"],
  output_encoding="cp1251",
)

result = decoder.decode_result("legacy.doc")
text_only = decoder.decode_file("legacy.doc")
print(result.used_method)
print(result.detected_encoding)
print(result.metadata)

decode_to_txt(
  "legacy.doc",
  "out/legacy.txt",
  preferred_tools=["word", "libreoffice", "ole"],
  encoding="cp1251",
  fallback_encodings=["koi8-r", "cp866"],
  output_encoding="cp1251",
)

Important:

  • decode_file(path) is the shortest API, but it only accepts preferred_tools
  • if you need encoding, fallback_encodings, or output_encoding, use decode_result(...), decode_to_txt(...), or TextDecoder(...)

Which function to use

Function Returns When useful
decode_file(path) str You only need the extracted text
decode_result(path) DecodeResult You want text + method + encoding + metadata + warnings
decode_to_txt(path, out) Path You want to convert files into .txt on disk
TextDecoder(...) reusable decoder object You process many files with the same settings

decode_result(...) example

from all2txt import decode_result

res = decode_result(
  "data/legacy.doc",
  encoding="utf-8",
  fallback_encodings=["cp1251", "koi8-r", "cp866"],
)

print(type(res).__name__)
print(res.text[:500])
print(res.used_method)
print(res.source_format)
print(res.detected_encoding)
print(res.metadata)
print(res.warnings)

Jupyter Notebook / pandas example

If you work in .ipynb, the most practical pattern is: one document = one row in a DataFrame.

from pathlib import Path
import pandas as pd
from all2txt import TextDecoder

root = Path("docs")

decoder = TextDecoder(
  preferred_tools=["word", "libreoffice", "antiword", "ole", "strings"],
  encoding="utf-8",
  fallback_encodings=["cp1251", "koi8-r", "cp866"],
)

extensions = {
  ".txt", ".doc", ".docx", ".rtf", ".pdf",
  ".xls", ".xlsx", ".ppt", ".pptx",
  ".odt", ".ods", ".epub", ".fb2", ".mobi",
  ".html", ".xml", ".json", ".csv", ".tsv",
  ".eml", ".msg", ".djvu", ".djv", ".chm",
}

rows = []

for path in root.rglob("*"):
  if not path.is_file() or path.suffix.lower() not in extensions:
    continue

  try:
    res = decoder.decode_result(path)
    rows.append({
      "path": str(path),
      "file_name": path.name,
      "ext": path.suffix.lower(),
      "text": res.text,
      "chars": len(res.text),
      "used_method": res.used_method,
      "encoding": res.detected_encoding,
      "language": res.metadata.get("language"),
      "title": res.metadata.get("title"),
      "author": res.metadata.get("author"),
      "warnings": res.warnings,
      "status": "ok",
      "error": "",
    })
  except Exception as exc:
    rows.append({
      "path": str(path),
      "file_name": path.name,
      "ext": path.suffix.lower(),
      "text": "",
      "chars": 0,
      "used_method": "",
      "encoding": "",
      "language": None,
      "title": None,
      "author": None,
      "warnings": [],
      "status": "failed",
      "error": str(exc),
    })

df = pd.DataFrame(rows)
df_ok = df[(df["status"] == "ok") & (df["text"].str.len() > 0)].copy()

This makes it easy to:

  • build a text corpus for ML or embedding pipelines
  • filter documents by extraction method or language
  • inspect failed files separately
  • keep warnings for later quality control

Error handling example

from all2txt import TextDecoder, ExtractorError

decoder = TextDecoder(encoding="utf-8", fallback_encodings=["cp1251", "koi8-r"])

try:
  result = decoder.decode_result("docs/problematic.doc")
  print(result.text[:300])
  print(result.warnings)
except FileNotFoundError:
  print("File does not exist")
except ExtractorError as exc:
  print("Extraction failed:", exc)

Save extracted corpus as TXT files

from pathlib import Path
from all2txt import decode_to_txt

src_dir = Path("docs")
out_dir = Path("decoded_txt")

for src in src_dir.rglob("*.doc"):
  dst = out_dir / src.with_suffix(".txt").name
  decode_to_txt(src, dst)

Metadata

decode_result(...) returns a DecodeResult object with:

  • text
  • used_method
  • source_format
  • detected_encoding
  • warnings
  • metadata

Metadata is best-effort and may include:

  • title
  • author
  • date
  • language
  • page_count
  • subject, from, to for email-like formats
  • source path, file name, format and file size

CLI usage

# Single file
all2txt input.doc -o output.txt

# Show what is available in the current environment
all2txt --available

# Directory batch
all2txt ./docs -o ./decoded --glob "*.doc*"

# Keep directory structure and write a CSV report
all2txt ./docs -o ./decoded --keep-structure --report report.csv

# Retry only files without output yet
all2txt ./docs -o ./decoded --failed-only

# Show what would happen without writing files
all2txt ./docs --dry-run --glob "*.doc*"

# Set preferred fallback order
all2txt input.doc --method-order word libreoffice ole

# Control encodings
all2txt input.txt -o output.txt --input-encoding cp1251 --fallback-encodings koi8-r cp866 --output-encoding cp1251

CLI options of interest:

  • --available / --doctor / --help-env
  • --dry-run
  • --report report.csv
  • --failed-only
  • --keep-structure
  • --method-order ...
  • --input-encoding ...
  • --fallback-encodings ...
  • --output-encoding ...

--report report.csv writes one row per processed file and includes fields such as:

  • status
  • used_method
  • encoding
  • chars
  • metadata_json
  • warnings
  • warnings_json

Plugins

External packages can register custom extractors through the entry point group all2txt.extractors. The loaded object should be callable and expose a suffixes attribute.

Built-in optional plugin included in this package:

  • ocr_plugin for .pdf, .djvu, .djv and image formats
  • It tries OCR tools in a soft-fallback mode and does not break standard extraction if OCR is unavailable
  • For pure image files without OCR tools, it returns a best-effort placeholder text with warnings instead of crashing
  • Typical external OCR tools are tesseract, ocrmypdf, pdftoppm, ddjvu, or magick depending on file type

Minimal example:

from all2txt import register_extractor

def extract_custom(path, default_encoding, fallback_encodings=None):
  return path.read_text(encoding=default_encoding)

register_extractor(".custom", extract_custom)

Useful when you have:

  • internal corporate formats
  • pre-cleaned text containers
  • custom archive wrappers
  • files that need a project-specific parser before standard NLP processing

Publish a plugin to PyPI

If you want to extend all2txt without changing the core package, publish a separate plugin package.

Suggested package name pattern:

  • all2txt-yourformat

Minimal package structure:

  • src/all2txt_yourformat/init.py
  • src/all2txt_yourformat/plugin.py
  • pyproject.toml

Example plugin code (plugin.py):

from pathlib import Path


def yourformat_extractor(path: Path, default_encoding="utf-8", fallback_encodings=None):
  return path.read_text(encoding=default_encoding, errors="replace")


yourformat_extractor.suffixes = [".yourfmt"]

Minimal pyproject.toml for plugin package:

[build-system]
requires = ["setuptools>=68", "wheel"]
build-backend = "setuptools.build_meta"

[project]
name = "all2txt-yourformat"
version = "0.1.0"
requires-python = ">=3.9"
dependencies = ["all2txt>=0.1.0"]

[project.entry-points."all2txt.extractors"]
yourformat = "all2txt_yourformat.plugin:yourformat_extractor"

[tool.setuptools]
package-dir = {"" = "src"}

[tool.setuptools.packages.find]
where = ["src"]

How to publish:

python -m pip install --upgrade build twine
python -m build
python -m twine upload dist/*

How users install your plugin:

pip install all2txt-yourformat

How users verify plugin activation:

  • run all2txt --available
  • decode a test file with .yourfmt extension
  • check used_method in decode_result(...)

Why this is important:

  • no need to fork or modify all2txt core
  • independent release cycle per format
  • easy team ownership for domain-specific formats

How fallback works

The library first tries native Python extractors for known formats. If extraction fails or text is empty, it tries external tools in order. Default order:

  1. word
  2. libreoffice
  3. openoffice
  4. antiword
  5. wvtext
  6. catdoc
  7. textutil
  8. calibre
  9. djvutxt
  10. pstotext
  11. chm
  12. ole
  13. strings

For .djvu specifically - djvutxt, calibre, or OCR plugin routes may help depending on the file; for .mobi - native extractor requires pip install all2txt[mobi], then calibre fallback; for unsupported or partially supported binaries, the library can still fall back to python-bytes best-effort recovery.

Notes

  • For old .doc files, best quality is usually from Word COM or LibreOffice.
  • For legacy text corpora, pass explicit encoding and fallback_encodings to improve old Cyrillic file decoding.
  • output_encoding allows saving extracted text back to an older target encoding when needed.
  • OCR is implemented as a separate plugin layer: if OCR tooling is missing, the main decoder still continues with non-OCR fallbacks.
  • The core now includes a Python-only binary text recovery fallback (python-bytes) so decoding remains available even without external office/OCR tools.
  • OLE mode is a best-effort fallback and may include noisy text.
  • EPUB extraction follows the OPF spine order (reading order), falling back to alphabetic.
  • iWork extraction first tries macOS textutil, then falls back to package parsing and printable-string recovery from .iwa chunks.
  • For scanned PDFs/DJVU, OCR is required (not included in this version; see Tesseract).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

all2txt-0.1.2.tar.gz (32.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

all2txt-0.1.2-py3-none-any.whl (27.6 kB view details)

Uploaded Python 3

File details

Details for the file all2txt-0.1.2.tar.gz.

File metadata

  • Download URL: all2txt-0.1.2.tar.gz
  • Upload date:
  • Size: 32.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for all2txt-0.1.2.tar.gz
Algorithm Hash digest
SHA256 267d61be98577797ed31d085f55d99e88dafd7b4113d901c69338c19c61dfa8c
MD5 4a4780835eb1cabf06e6a88577137b9c
BLAKE2b-256 2766c2bbb883cd002aa273853346c6666b9646aec61266174da8ac36a7db2bf2

See more details on using hashes here.

Provenance

The following attestation bundles were made for all2txt-0.1.2.tar.gz:

Publisher: publish.yml on steenzh27/all2txt

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file all2txt-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: all2txt-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 27.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for all2txt-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 76e277494bb545e7055b54c175f1090e0c5cfbbcfc5964c60b29bcce3cd3661d
MD5 f04fc95dc0dc39e48b2fa7d9e98ae791
BLAKE2b-256 27135ef340da19b0579c5e99735b02122d37ba0bbb935e955463ea4573cc1320

See more details on using hashes here.

Provenance

The following attestation bundles were made for all2txt-0.1.2-py3-none-any.whl:

Publisher: publish.yml on steenzh27/all2txt

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page