Skip to main content

Local-first desktop document OCR: PDF operations, smart OCR routing, and Burmese-first text handling.

Project description

Lexo logo

Lexo

PyPI version Python versions CI License: AGPL-3.0

Lexo stands for Local EXtraction and OCR: a local-first desktop document OCR tool. It turns PDFs and images into clean, editable text, with strong support for Burmese (Myanmar script) using free, high-accuracy Google Docs OCR.

Everything runs on your machine. The only network call is the optional OCR provider, which uses your own Google account, so there is nothing to pay for.

Features

  • PDF operations: extract page ranges, split, crop, rotate, merge, and split two-up spreads into separate pages.
  • Visual crop and split editor in the GUI: drag a crop box on the rendered page to remove headers and page numbers, and split scanned two-up spreads. Works on a PDF or a batch of images.
  • Smart OCR routing: digital PDFs use their embedded text layer (instant and lossless); only scanned pages are OCR'd.
  • OCR via Google Docs OCR: free, high-accuracy (especially for Burmese), run on your own Google account. Providers are pluggable behind a single interface.
  • Burmese-aware text handling: NFC normalization and zero-width-space-safe cleaning.
  • Proofread before you export: the desktop app shows each page beside an editable text pane.
  • Exports: plain text (the default), Markdown (with YAML frontmatter), and JSONL (for NLP and LLM workflows).
  • A desktop GUI and a scriptable CLI, both driving the same engine.

Install

Lexo is a Python package. With uv:

uv tool install lexo            # the `lexo` CLI and `lexo gui`

Without uv, use any standard Python installer:

pipx install lexo
# or
python -m pip install lexo

Everything is included in the one install. There are no separate system dependencies to set up.

Quick start

# Digital PDF: extract the embedded text, instantly (plain text by default)
lexo extract report.pdf -o report.txt

# Scanned PDF or image: OCR it (Burmese by default) with your Google account
lexo login
lexo ocr scan.pdf --lang my -o scan.txt

# PDF operations
lexo pdf extract book.pdf --pages "1-3,7,10-" -o subset.pdf
lexo pdf split book.pdf --every 10
lexo pdf crop book.pdf --top 8 --bottom 8 -o trimmed.pdf

# Launch the desktop app
lexo gui

Run lexo --help (or lexo pdf --help) for the full command list.

Commands

Command Purpose
lexo extract <pdf> Extract the embedded text layer of a digital PDF
lexo ocr <pdf|image> OCR a scanned document (--lang, --force-ocr)
lexo pdf info|extract|split|crop|rotate|merge|split-spread PDF operations
lexo login / lexo logout Sign in to / out of Google (token stored in the OS keychain)
lexo gui Launch the desktop app
lexo info Show the version and where Lexo stores its data
lexo check-update Check PyPI for a newer release

All output formats are available via --format text|markdown|jsonl.

Google Docs OCR setup (one-time)

OCR uses Google Docs OCR, which is free and runs on your own Google account. You bring your own OAuth client credentials (credentials.json). It is a one-time setup:

  1. Create or pick a Google Cloud project at the Google Cloud Console.
  2. Enable the Google Drive API: APIs & Services -> Library -> search "Google Drive API" -> Enable.
  3. Configure the OAuth consent screen: APIs & Services -> OAuth consent screen -> User type External -> add an app name and your email, then add your own Google account under Test users.
  4. Create the OAuth client: APIs & Services -> Credentials -> Create credentials -> OAuth client ID -> Application type Desktop app -> Create -> Download JSON, and rename the file to credentials.json.
  5. Place credentials.json where Lexo looks for it (first match wins):
    • the path in the LEXO_GOOGLE_CREDENTIALS environment variable, or
    • your Lexo config directory (run lexo info to see it), or
    • the current working directory.
  6. Sign in: run lexo login (or in the GUI, Account -> Sign in with Google). A browser opens; approve access. The token is saved in your OS keychain, and credentials.json is only read during login.

Notes:

  • Lexo requests only the least-privilege drive.file scope, so it can touch only the temporary files it creates while running OCR.
  • While the OAuth app stays in Testing status, Google expires the sign-in roughly every 7 days, so you may need to run lexo login again periodically.
  • Sign out any time with lexo logout (or Account -> Sign out); this removes the stored token.

Burmese notes

  • The OCR language hint defaults to my; override with --lang.
  • Extracted text is normalized to Unicode NFC and zero-width spaces are preserved.
  • A Myanmar Unicode font (Noto Sans Myanmar, SIL Open Font License) is bundled so Burmese renders in the GUI regardless of installed system fonts. The license travels with it as OFL.txt.

Tech stack

Area Tools
Language Python 3.11+
CLI Typer
Desktop GUI PySide6 (Qt)
PDF engine PyMuPDF
Images Pillow
OCR Google Docs OCR via the Google Drive API (google-api-python-client + google-auth)
Credentials keyring (OS keychain)
Settings pydantic-settings (env-var config)
Logging structlog
Paths platformdirs
Build & packaging uv + Hatchling
Quality Ruff, mypy, pytest
CI/CD GitHub Actions, PyPI Trusted Publishing

Development

uv sync
uv run ruff check src tests
uv run mypy src/lexo
uv run pytest

Design notes live in docs/ARCHITECTURE.md.

License

AGPL-3.0, to align with PyMuPDF. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lexo-0.1.0.tar.gz (454.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lexo-0.1.0-py3-none-any.whl (465.4 kB view details)

Uploaded Python 3

File details

Details for the file lexo-0.1.0.tar.gz.

File metadata

  • Download URL: lexo-0.1.0.tar.gz
  • Upload date:
  • Size: 454.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lexo-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b8917f65a554b83918fb7bcf0936b52399f2fd55142525c09474ce2b9088ce73
MD5 c40c697d663e72d8867e3ad44c53dcb6
BLAKE2b-256 77f25713824fd2189cee6f6326249c1d67d6783a3cb4a798252a5922f9a13d48

See more details on using hashes here.

Provenance

The following attestation bundles were made for lexo-0.1.0.tar.gz:

Publisher: release.yml on PhilixTheExplorer/lexo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lexo-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: lexo-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 465.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lexo-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 67c16fb9d6f7a9dcb664c09864243f45b45e412efd162217c49713e75a780c84
MD5 0d5a5e6f701a4d7d57ca7a75591e6bd6
BLAKE2b-256 0ddf9a3e63813a21f80d65953d4e4718b6693cf1d9b853f3ae700c0ba2e0a2c8

See more details on using hashes here.

Provenance

The following attestation bundles were made for lexo-0.1.0-py3-none-any.whl:

Publisher: release.yml on PhilixTheExplorer/lexo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page