Local-first desktop document OCR: PDF operations, smart OCR routing, and Burmese-first text handling.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

PhilixTheExplorer

These details have not been verified by PyPI

Project description

Lexo logo

Lexo

Lexo stands for Local EXtraction and OCR: a local-first desktop document OCR tool. It turns PDFs and images into clean, editable text, with strong support for Burmese (Myanmar script) using free, high-accuracy Google Docs OCR.

Everything runs on your machine. The only network call is the optional OCR provider, which uses your own Google account, so there is nothing to pay for.

Features

PDF operations: extract page ranges, split, crop, rotate, merge, and split two-up spreads into separate pages.
Visual crop and split editor in the GUI: drag a crop box on the rendered page to remove headers and page numbers, and split scanned two-up spreads. Works on a PDF or a batch of images.
Smart OCR routing: digital PDFs use their embedded text layer (instant and lossless); only scanned pages are OCR'd.
OCR via Google Docs OCR: free, high-accuracy (especially for Burmese), run on your own Google account. Providers are pluggable behind a single interface.
Burmese-aware text handling: NFC normalization and zero-width-space-safe cleaning.
Proofread before you export: the desktop app shows each page beside an editable text pane.
Exports: plain text (the default), Markdown (with YAML frontmatter), and JSONL (for NLP and LLM workflows).
A desktop GUI and a scriptable CLI, both driving the same engine.

Install

Lexo is a Python package. With uv:

uv tool install lexo            # the `lexo` CLI and `lexo gui`

Without uv, use any standard Python installer:

pipx install lexo
# or
python -m pip install lexo

Everything is included in the one install. There are no separate system dependencies to set up.

Quick start

# Digital PDF: extract the embedded text, instantly (plain text by default)
lexo extract report.pdf -o report.txt

# Scanned PDF or image: OCR it (Burmese by default) with your Google account
lexo login
lexo ocr scan.pdf --lang my -o scan.txt

# PDF operations
lexo pdf extract book.pdf --pages "1-3,7,10-" -o subset.pdf
lexo pdf split book.pdf --every 10
lexo pdf crop book.pdf --top 8 --bottom 8 -o trimmed.pdf

# Launch the desktop app
lexo gui

Run lexo --help (or lexo pdf --help) for the full command list.

Commands

Command	Purpose
`lexo extract <pdf>`	Extract the embedded text layer of a digital PDF
`lexo ocr <pdf\|image>`	OCR a scanned document (`--lang`, `--force-ocr`)
`lexo pdf info\|extract\|split\|crop\|rotate\|merge\|split-spread`	PDF operations
`lexo login` / `lexo logout`	Sign in to / out of Google (token stored in the OS keychain)
`lexo gui`	Launch the desktop app
`lexo info`	Show the version and where Lexo stores its data
`lexo check-update`	Check PyPI for a newer release

All output formats are available via --format text|markdown|jsonl.

Google Docs OCR setup (one-time)

OCR uses Google Docs OCR, which is free and runs on your own Google account. You bring your own OAuth client credentials (credentials.json). It is a one-time setup:

Create or pick a Google Cloud project at the Google Cloud Console.
Enable the Google Drive API: APIs & Services -> Library -> search "Google Drive API" -> Enable.
Configure the OAuth consent screen: APIs & Services -> OAuth consent screen -> User type External -> add an app name and your email, then add your own Google account under Test users.
Create the OAuth client: APIs & Services -> Credentials -> Create credentials -> OAuth client ID -> Application type Desktop app -> Create -> Download JSON, and rename the file to credentials.json.
Place credentials.json where Lexo looks for it (first match wins):
- the path in the LEXO_GOOGLE_CREDENTIALS environment variable, or
- your Lexo config directory (run lexo info to see it), or
- the current working directory.
Sign in: run lexo login (or in the GUI, Account -> Sign in with Google). A browser opens; approve access. The token is saved in your OS keychain, and credentials.json is only read during login.

Notes:

Lexo requests only the least-privilege drive.file scope, so it can touch only the temporary files it creates while running OCR.
While the OAuth app stays in Testing status, Google expires the sign-in roughly every 7 days, so you may need to run lexo login again periodically.
Sign out any time with lexo logout (or Account -> Sign out); this removes the stored token.

Burmese notes

The OCR language hint defaults to my; override with --lang.
Extracted text is normalized to Unicode NFC and zero-width spaces are preserved.
A Myanmar Unicode font (Noto Sans Myanmar, SIL Open Font License) is bundled so Burmese renders in the GUI regardless of installed system fonts. The license travels with it as OFL.txt.

Tech stack

Area	Tools
Language	Python 3.11+
CLI	Typer
Desktop GUI	PySide6 (Qt)
PDF engine	PyMuPDF
Images	Pillow
OCR	Google Docs OCR via the Google Drive API (`google-api-python-client` + `google-auth`)
Credentials	keyring (OS keychain)
Settings	pydantic-settings (env-var config)
Logging	structlog
Paths	platformdirs
Build & packaging	uv + Hatchling
Quality	Ruff, mypy, pytest
CI/CD	GitHub Actions, PyPI Trusted Publishing

Development

uv sync
uv run ruff check src tests
uv run mypy src/lexo
uv run pytest

Design notes live in docs/ARCHITECTURE.md.

License

AGPL-3.0, to align with PyMuPDF. See LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

PhilixTheExplorer

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.1

Jun 17, 2026

0.1.0

Jun 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lexo-0.1.1.tar.gz (440.0 kB view details)

Uploaded Jun 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lexo-0.1.1-py3-none-any.whl (453.5 kB view details)

Uploaded Jun 17, 2026 Python 3

File details

Details for the file lexo-0.1.1.tar.gz.

File metadata

Download URL: lexo-0.1.1.tar.gz
Upload date: Jun 17, 2026
Size: 440.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lexo-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`3d9e66d2521d354489e4df77553601f2d05ee680684441bbdc16105e6720751b`
MD5	`526a341cb2511ce2dad1398714ed785f`
BLAKE2b-256	`66f21f379af91bde082ea0b6c03f59ad91324bb0c833a397f5166d25e9d702df`

See more details on using hashes here.

Provenance

The following attestation bundles were made for lexo-0.1.1.tar.gz:

Publisher: release.yml on PhilixTheExplorer/lexo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: lexo-0.1.1.tar.gz
- Subject digest: 3d9e66d2521d354489e4df77553601f2d05ee680684441bbdc16105e6720751b
- Sigstore transparency entry: 1849919672
- Sigstore integration time: Jun 17, 2026
Source repository:
- Permalink: PhilixTheExplorer/lexo@85f35fbfe1248f8a988006b99e92c181e59a4e94
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/PhilixTheExplorer
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@85f35fbfe1248f8a988006b99e92c181e59a4e94
- Trigger Event: push

File details

Details for the file lexo-0.1.1-py3-none-any.whl.

File metadata

Download URL: lexo-0.1.1-py3-none-any.whl
Upload date: Jun 17, 2026
Size: 453.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lexo-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`90fa85211aead4be5b4415bbb6d0d2c684e37c2bf7db15be0b781bf85ed04388`
MD5	`61ccc78586589212a605d8dd2073bd3b`
BLAKE2b-256	`3018c15f9888770038efdbe020dd452d77a7302222cfb8da0e47f3868d0ae587`

See more details on using hashes here.

Provenance

The following attestation bundles were made for lexo-0.1.1-py3-none-any.whl:

Publisher: release.yml on PhilixTheExplorer/lexo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: lexo-0.1.1-py3-none-any.whl
- Subject digest: 90fa85211aead4be5b4415bbb6d0d2c684e37c2bf7db15be0b781bf85ed04388
- Sigstore transparency entry: 1849919813
- Sigstore integration time: Jun 17, 2026
Source repository:
- Permalink: PhilixTheExplorer/lexo@85f35fbfe1248f8a988006b99e92c181e59a4e94
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/PhilixTheExplorer
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@85f35fbfe1248f8a988006b99e92c181e59a4e94
- Trigger Event: push

lexo 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Lexo

Features

Install

Quick start

Commands

Google Docs OCR setup (one-time)

Burmese notes

Tech stack

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance