Universal text extraction from many document formats with external-tool fallbacks

These details have not been verified by PyPI

Project description

all2txt

all2txt is a Python library (and CLI) for extracting text from many document formats.

It is designed for legacy and mixed corpora where files may come from Word, LibreOffice/OpenOffice, OLE-based formats, or plain text formats.

Features

Unified API to get text from a file as Python string
Save extracted text to .txt in a chosen output encoding
Return extended decode results with method, detected encoding, warnings, and metadata
Native Python extractors (no extra deps): .txt, .log, .ini, .conf, .tex, .bib, .strings, .md, .rst, .csv, .tsv, .json, .xml, .html, .htm, .mht, .mhtml, .eml, .plist, .rtf, .docx, .odt, .ods, .xlsx, .pptx, .fb2, .epub, .pages, .numbers, .key
Requires optional dep: .pdf - pip install all2txt[pdf]; .mobi - pip install all2txt[mobi]; .msg - pip install all2txt[msg]
Supported via external converter: .azw and similar ebook formats are best handled through Calibre ebook-convert

External-tool fallbacks (install separately):

Tool	Covers
Microsoft Word (COM, Windows)	`.doc`, `.docx`, `.rtf`, `.odt`
LibreOffice / OpenOffice headless	Office formats, `.odt`, `.epub`
antiword	Old `.doc`
wvText	Old `.doc` (Linux/Unix)
catdoc/catppt/xls2csv	Legacy `.doc`/`.ppt`/`.xls`
macOS `textutil`	Apple/macOS office and rich-text conversions
Calibre `ebook-convert`	`.epub`, `.mobi`, `.djvu`, `.azw`, `.fb2`, +100 formats
DjVuLibre `djvutxt`	`.djvu`, `.djv`
pstotext/ps2ascii	`.ps`, `.eps`
extract_chmLib/chm2txt	`.chm`
OLE stream scan	Legacy MS Office binaries `.doc`, `.xls`, `.ppt`

Installation

# from PyPI
pip install all2txt

# local development install
pip install -e .

What gets installed by default

pip install all2txt installs only the core package itself.

Current base Python dependencies: none.

This means the default install gives you:

the main Python API: decode_file, decode_result, decode_to_txt, TextDecoder
built-in extractors for plain text, markup, Office XML/ZIP-based formats, email-like formats, and several archive-like document containers
the CLI command all2txt
best-effort fallback logic including python-bytes
built-in plugin code shipped inside the package, including OCR plugin registration hooks

This also means the default install does not automatically install:

pypdf
pywin32
olefile
mobi
extract-msg
external OS tools such as LibreOffice, Word, Calibre, Tesseract, DjVuLibre, antiword, catdoc, etc.

Optional Python dependencies:

pip install -e .[all]
# or separately:
pip install -e .[pdf]   # PDF via pypdf
pip install -e .[win]   # Word COM on Windows
pip install -e .[ole]   # OLE binary fallback
pip install -e .[mobi]  # MOBI native extractor
pip install -e .[msg]   # Outlook .msg parsing
pip install -e .[ocr]   # OCR-related Python helpers; OCR still needs external tools

If you install from PyPI instead of editable mode, the same extras look like this:

pip install all2txt[all]
pip install all2txt[pdf]
pip install all2txt[win]
pip install all2txt[ole]
pip install all2txt[mobi]
pip install all2txt[msg]
pip install all2txt[ocr]

What each extra adds

Extra	Installs	What it enables
`pdf`	`pypdf`	native PDF text extraction and PDF metadata
`win`	`pywin32`	Microsoft Word COM extraction on Windows
`ole`	`olefile`	OLE stream fallback for old `.doc/.xls/.ppt`
`mobi`	`mobi`	native `.mobi` extraction
`msg`	`extract-msg`	Outlook `.msg` parsing
`ocr`	`pypdf`	OCR helper path for scanned PDF workflows; external OCR tools still required
`all`	`pypdf`, `pywin32`, `olefile`, `mobi`, `extract-msg`	most optional Python-side features in one install

Notes:

all already includes pypdf, so in practice it also covers the Python side of ocr
ocr does not install Tesseract, OCRmyPDF, Poppler, ImageMagick, or DjVu tools; those are system tools and must be installed separately
if a dependency is missing, the library tries to degrade gracefully and usually records warnings or falls back to another strategy

External tools (install once on the OS):

# Calibre - covers EPUB, MOBI, DJVU, AZW, FB2 and 100+ formats
# https://calibre-ebook.com/download

# DjVuLibre - for .djvu files
# Windows: https://djvu.sourceforge.net/  |  Linux: apt install djvulibre-bin

Recommended installation patterns

Minimal install:

pip install all2txt

One command for all optional Python dependencies:

pip install all2txt[all]

This is the shortest answer to "install everything that pip can install for this library".

What it includes immediately:

pypdf
pywin32
olefile
mobi
extract-msg

What it still does not include:

Microsoft Word
LibreOffice / OpenOffice
Calibre
Tesseract OCR
OCRmyPDF
Poppler
DjVuLibre
antiword / wvText / catdoc tools

Those are external system tools and must be installed separately.

Good default for Windows office-heavy corpora:

pip install all2txt[all]

If you mainly process old Cyrillic Office files on Windows, also ensure one of these is installed on the OS:

Microsoft Word
LibreOffice

If you mainly process scanned PDF/DjVu/image files:

pip install all2txt[ocr]

and separately install OCR tools such as:

Tesseract OCR
OCRmyPDF
Poppler (pdftoppm)
DjVuLibre (ddjvu / djvutxt)
ImageMagick (magick)

How to add functionality later

You can start with the minimal install and add only what you need.

Examples:

# add PDF support later
pip install all2txt[pdf]

# add Outlook .msg support later
pip install all2txt[msg]

# add legacy OLE fallback later
pip install all2txt[ole]

# add everything Python-side later
pip install all2txt[all]

To inspect what is currently available in your environment, run:

all2txt --available

It will show:

which extras are effectively available
which external tools were found in PATH
which format groups are currently available at native, tool, OCR, or fallback level
suggested installation commands for missing pieces

Format install matrix

Format group	Works after `pip install all2txt`	Better with Python extra	Best with external tools
`.txt .log .ini .conf .md .rst .csv .tsv .json .xml .html .htm .mht .mhtml .eml .plist .tex .bib .strings`	yes, native	not needed	not needed
`.docx .odt .ods .xlsx .pptx .fb2 .epub .pages .numbers .key`	yes, native	not needed	optional, only for edge cases
`.pdf`	limited fallback only	`pip install all2txt[pdf]`	for scanned PDFs add Tesseract / OCRmyPDF / Poppler
`.msg`	limited fallback only	`pip install all2txt[msg]`	usually not needed
`.mobi`	limited fallback only	`pip install all2txt[mobi]`	Calibre can improve coverage
`.azw` and similar ebooks	no true native parser	not applicable	Calibre `ebook-convert`
`.doc`	best-effort fallback only	`pip install all2txt[win]` and/or `pip install all2txt[ole]`	Microsoft Word, LibreOffice, antiword, wvText, catdoc
`.xls`	best-effort fallback only	`pip install all2txt[ole]`	LibreOffice, xls2csv
`.ppt`	best-effort fallback only	`pip install all2txt[ole]`	LibreOffice, catppt
`.djvu .djv`	limited fallback only	no dedicated Python extra	DjVuLibre, Calibre, or OCR tools
`.ps .eps`	limited fallback only	no dedicated Python extra	pstotext / ps2ascii
`.chm`	limited fallback only	no dedicated Python extra	extract_chmLib / chm2txt
scanned images / scanned PDFs	placeholder or fallback behavior only	`pip install all2txt[ocr]` helps on Python side	Tesseract, OCRmyPDF, Poppler, DjVuLibre, ImageMagick

Practical recommendation:

for most users start with pip install all2txt[all]
for Office-heavy Windows corpora also install Microsoft Word or LibreOffice
for scanned documents also install OCR tools
if you are unsure, run all2txt --available

Python usage

For most code and notebook scenarios there are 4 entry points to remember:

decode_file(path) -> returns only text as str
decode_result(path) -> returns DecodeResult with text, metadata and warnings
decode_to_txt(path, out_path) -> writes extracted text to a .txt file
TextDecoder(...) -> reusable decoder with shared settings for many files

Quick start

from all2txt import TextDecoder, decode_file, decode_result, decode_to_txt

text = decode_file("sample.docx")

decoder = TextDecoder(
  preferred_tools=["word", "libreoffice", "ole"],
  encoding="cp1251",
  fallback_encodings=["koi8-r", "cp866"],
  output_encoding="cp1251",
)

result = decoder.decode_result("legacy.doc")
text_only = decoder.decode_file("legacy.doc")
print(result.used_method)
print(result.detected_encoding)
print(result.metadata)

decode_to_txt(
  "legacy.doc",
  "out/legacy.txt",
  preferred_tools=["word", "libreoffice", "ole"],
  encoding="cp1251",
  fallback_encodings=["koi8-r", "cp866"],
  output_encoding="cp1251",
)

Important:

decode_file(path) is the shortest API, but it only accepts preferred_tools
if you need encoding, fallback_encodings, or output_encoding, use decode_result(...), decode_to_txt(...), or TextDecoder(...)

Which function to use

Function	Returns	When useful
`decode_file(path)`	`str`	You only need the extracted text
`decode_result(path)`	`DecodeResult`	You want text + method + encoding + metadata + warnings
`decode_to_txt(path, out)`	`Path`	You want to convert files into `.txt` on disk
`TextDecoder(...)`	reusable decoder object	You process many files with the same settings

`decode_result(...)` example

from all2txt import decode_result

res = decode_result(
  "data/legacy.doc",
  encoding="utf-8",
  fallback_encodings=["cp1251", "koi8-r", "cp866"],
)

print(type(res).__name__)
print(res.text[:500])
print(res.used_method)
print(res.source_format)
print(res.detected_encoding)
print(res.metadata)
print(res.warnings)

Jupyter Notebook / pandas example

If you work in .ipynb, the most practical pattern is: one document = one row in a DataFrame.

from pathlib import Path
import pandas as pd
from all2txt import TextDecoder

root = Path("docs")

decoder = TextDecoder(
  preferred_tools=["word", "libreoffice", "antiword", "ole", "strings"],
  encoding="utf-8",
  fallback_encodings=["cp1251", "koi8-r", "cp866"],
)

extensions = {
  ".txt", ".doc", ".docx", ".rtf", ".pdf",
  ".xls", ".xlsx", ".ppt", ".pptx",
  ".odt", ".ods", ".epub", ".fb2", ".mobi",
  ".html", ".xml", ".json", ".csv", ".tsv",
  ".eml", ".msg", ".djvu", ".djv", ".chm",
}

rows = []

for path in root.rglob("*"):
  if not path.is_file() or path.suffix.lower() not in extensions:
    continue

  try:
    res = decoder.decode_result(path)
    rows.append({
      "path": str(path),
      "file_name": path.name,
      "ext": path.suffix.lower(),
      "text": res.text,
      "chars": len(res.text),
      "used_method": res.used_method,
      "encoding": res.detected_encoding,
      "language": res.metadata.get("language"),
      "title": res.metadata.get("title"),
      "author": res.metadata.get("author"),
      "warnings": res.warnings,
      "status": "ok",
      "error": "",
    })
  except Exception as exc:
    rows.append({
      "path": str(path),
      "file_name": path.name,
      "ext": path.suffix.lower(),
      "text": "",
      "chars": 0,
      "used_method": "",
      "encoding": "",
      "language": None,
      "title": None,
      "author": None,
      "warnings": [],
      "status": "failed",
      "error": str(exc),
    })

df = pd.DataFrame(rows)
df_ok = df[(df["status"] == "ok") & (df["text"].str.len() > 0)].copy()

This makes it easy to:

build a text corpus for ML or embedding pipelines
filter documents by extraction method or language
inspect failed files separately
keep warnings for later quality control

Error handling example

from all2txt import TextDecoder, ExtractorError

decoder = TextDecoder(encoding="utf-8", fallback_encodings=["cp1251", "koi8-r"])

try:
  result = decoder.decode_result("docs/problematic.doc")
  print(result.text[:300])
  print(result.warnings)
except FileNotFoundError:
  print("File does not exist")
except ExtractorError as exc:
  print("Extraction failed:", exc)

Save extracted corpus as TXT files

from pathlib import Path
from all2txt import decode_to_txt

src_dir = Path("docs")
out_dir = Path("decoded_txt")

for src in src_dir.rglob("*.doc"):
  dst = out_dir / src.with_suffix(".txt").name
  decode_to_txt(src, dst)

Metadata

decode_result(...) returns a DecodeResult object with:

text
used_method
source_format
detected_encoding
warnings
metadata

Metadata is best-effort and may include:

title
author
date
language
page_count
subject, from, to for email-like formats
source path, file name, format and file size

CLI usage

# Single file
all2txt input.doc -o output.txt

# Show what is available in the current environment
all2txt --available

# Directory batch
all2txt ./docs -o ./decoded --glob "*.doc*"

# Keep directory structure and write a CSV report
all2txt ./docs -o ./decoded --keep-structure --report report.csv

# Retry only files without output yet
all2txt ./docs -o ./decoded --failed-only

# Show what would happen without writing files
all2txt ./docs --dry-run --glob "*.doc*"

# Set preferred fallback order
all2txt input.doc --method-order word libreoffice ole

# Control encodings
all2txt input.txt -o output.txt --input-encoding cp1251 --fallback-encodings koi8-r cp866 --output-encoding cp1251

CLI options of interest:

--available / --doctor / --help-env
--dry-run
--report report.csv
--failed-only
--keep-structure
--method-order ...
--input-encoding ...
--fallback-encodings ...
--output-encoding ...

--report report.csv writes one row per processed file and includes fields such as:

status
used_method
encoding
chars
metadata_json
warnings
warnings_json

Plugins

External packages can register custom extractors through the entry point group all2txt.extractors. The loaded object should be callable and expose a suffixes attribute.

Built-in optional plugin included in this package:

ocr_plugin for .pdf, .djvu, .djv and image formats
It tries OCR tools in a soft-fallback mode and does not break standard extraction if OCR is unavailable
For pure image files without OCR tools, it returns a best-effort placeholder text with warnings instead of crashing
Typical external OCR tools are tesseract, ocrmypdf, pdftoppm, ddjvu, or magick depending on file type

Minimal example:

from all2txt import register_extractor

def extract_custom(path, default_encoding, fallback_encodings=None):
  return path.read_text(encoding=default_encoding)

register_extractor(".custom", extract_custom)

Useful when you have:

internal corporate formats
pre-cleaned text containers
custom archive wrappers
files that need a project-specific parser before standard NLP processing

Publish a plugin to PyPI

If you want to extend all2txt without changing the core package, publish a separate plugin package.

Suggested package name pattern:

all2txt-yourformat

Minimal package structure:

src/all2txt_yourformat/init.py
src/all2txt_yourformat/plugin.py
pyproject.toml

Example plugin code (plugin.py):

from pathlib import Path


def yourformat_extractor(path: Path, default_encoding="utf-8", fallback_encodings=None):
  return path.read_text(encoding=default_encoding, errors="replace")


yourformat_extractor.suffixes = [".yourfmt"]

Minimal pyproject.toml for plugin package:

[build-system]
requires = ["setuptools>=68", "wheel"]
build-backend = "setuptools.build_meta"

[project]
name = "all2txt-yourformat"
version = "0.1.0"
requires-python = ">=3.9"
dependencies = ["all2txt>=0.1.0"]

[project.entry-points."all2txt.extractors"]
yourformat = "all2txt_yourformat.plugin:yourformat_extractor"

[tool.setuptools]
package-dir = {"" = "src"}

[tool.setuptools.packages.find]
where = ["src"]

How to publish:

python -m pip install --upgrade build twine
python -m build
python -m twine upload dist/*

How users install your plugin:

pip install all2txt-yourformat

How users verify plugin activation:

run all2txt --available
decode a test file with .yourfmt extension
check used_method in decode_result(...)

Why this is important:

no need to fork or modify all2txt core
independent release cycle per format
easy team ownership for domain-specific formats

How fallback works

The library first tries native Python extractors for known formats. If extraction fails or text is empty, it tries external tools in order. Default order:

word
libreoffice
openoffice
antiword
wvtext
catdoc
textutil
calibre
djvutxt
pstotext
chm
ole
strings

For .djvu specifically - djvutxt, calibre, or OCR plugin routes may help depending on the file; for .mobi - native extractor requires pip install all2txt[mobi], then calibre fallback; for unsupported or partially supported binaries, the library can still fall back to python-bytes best-effort recovery.

Notes

For old .doc files, best quality is usually from Word COM or LibreOffice.
For legacy text corpora, pass explicit encoding and fallback_encodings to improve old Cyrillic file decoding.
output_encoding allows saving extracted text back to an older target encoding when needed.
OCR is implemented as a separate plugin layer: if OCR tooling is missing, the main decoder still continues with non-OCR fallbacks.
The core now includes a Python-only binary text recovery fallback (python-bytes) so decoding remains available even without external office/OCR tools.
OLE mode is a best-effort fallback and may include noisy text.
EPUB extraction follows the OPF spine order (reading order), falling back to alphabetic.
iWork extraction first tries macOS textutil, then falls back to package parsing and printable-string recovery from .iwa chunks.
For scanned PDFs/DJVU, OCR is required (not included in this version; see Tesseract).

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.2

Mar 23, 2026

This version

0.1.0

Mar 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

all2txt-0.1.0.tar.gz (32.6 kB view details)

Uploaded Mar 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

all2txt-0.1.0-py3-none-any.whl (27.5 kB view details)

Uploaded Mar 17, 2026 Python 3

File details

Details for the file all2txt-0.1.0.tar.gz.

File metadata

Download URL: all2txt-0.1.0.tar.gz
Upload date: Mar 17, 2026
Size: 32.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for all2txt-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4d71c776cfff9425aa277017ac1644416869e8fd861b97f89f9772cc1a24fcc3`
MD5	`37cf59d210ffc153d4cb130f29285f46`
BLAKE2b-256	`e8be33550fa9a3ef6b1d0cb1bd94c42b667ed34384f5cd6d753f400505de207a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for all2txt-0.1.0.tar.gz:

Publisher: publish.yml on steenzh27/all2txt

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: all2txt-0.1.0.tar.gz
- Subject digest: 4d71c776cfff9425aa277017ac1644416869e8fd861b97f89f9772cc1a24fcc3
- Sigstore transparency entry: 1116129925
- Sigstore integration time: Mar 17, 2026
Source repository:
- Permalink: steenzh27/all2txt@485ec167a56f9e8302e260a0f1b3ec573b00703b
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/steenzh27
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@485ec167a56f9e8302e260a0f1b3ec573b00703b
- Trigger Event: release

File details

Details for the file all2txt-0.1.0-py3-none-any.whl.

File metadata

Download URL: all2txt-0.1.0-py3-none-any.whl
Upload date: Mar 17, 2026
Size: 27.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for all2txt-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`469526ca42eda9419ac583a62849884d7db30acdae32c4271288dc4bc7e275b9`
MD5	`964c8d45f81ec2937031d48566ca6e87`
BLAKE2b-256	`ca88383862764eb9b10f978eea18261e3e4e3a81503eeb6380731da85b360795`

See more details on using hashes here.

Provenance

The following attestation bundles were made for all2txt-0.1.0-py3-none-any.whl:

Publisher: publish.yml on steenzh27/all2txt

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: all2txt-0.1.0-py3-none-any.whl
- Subject digest: 469526ca42eda9419ac583a62849884d7db30acdae32c4271288dc4bc7e275b9
- Sigstore transparency entry: 1116129966
- Sigstore integration time: Mar 17, 2026
Source repository:
- Permalink: steenzh27/all2txt@485ec167a56f9e8302e260a0f1b3ec573b00703b
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/steenzh27
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@485ec167a56f9e8302e260a0f1b3ec573b00703b
- Trigger Event: release

all2txt 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

all2txt

Features

Installation

What gets installed by default

What each extra adds

Recommended installation patterns

How to add functionality later

Format install matrix

Python usage

Quick start

Which function to use

decode_result(...) example

Jupyter Notebook / pandas example

Error handling example

Save extracted corpus as TXT files

Metadata

CLI usage

Plugins

Publish a plugin to PyPI

How fallback works

Notes

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`decode_result(...)` example