Local-first desktop document OCR: PDF operations, smart OCR routing, and Burmese-first text handling.
Project description
Lexo
Lexo stands for Local EXtraction and OCR: a local-first desktop document OCR tool. It turns PDFs and images into clean, editable text, with strong support for Burmese (Myanmar script) using free, high-accuracy Google Docs OCR.
Everything runs on your machine. The only network call is the optional OCR provider, which uses your own Google account, so there is nothing to pay for.
Features
- PDF operations: extract page ranges, split, crop, rotate, merge, and split two-up spreads into separate pages.
- Visual crop and split editor in the GUI: drag a crop box on the rendered page to remove headers and page numbers, and split scanned two-up spreads. Works on a PDF or a batch of images.
- Smart OCR routing: digital PDFs use their embedded text layer (instant and lossless); only scanned pages are OCR'd.
- OCR via Google Docs OCR: free, high-accuracy (especially for Burmese), run on your own Google account. Providers are pluggable behind a single interface.
- Burmese-aware text handling: NFC normalization and zero-width-space-safe cleaning.
- Proofread before you export: the desktop app shows each page beside an editable text pane.
- Exports: plain text (the default), Markdown (with YAML frontmatter), and JSONL (for NLP and LLM workflows).
- A desktop GUI and a scriptable CLI, both driving the same engine.
Install
Lexo is a Python package. With uv:
uv tool install lexo # the `lexo` CLI and `lexo gui`
Without uv, use any standard Python installer:
pipx install lexo
# or
python -m pip install lexo
Everything is included in the one install. There are no separate system dependencies to set up.
Quick start
# Digital PDF: extract the embedded text, instantly (plain text by default)
lexo extract report.pdf -o report.txt
# Scanned PDF or image: OCR it (Burmese by default) with your Google account
lexo login
lexo ocr scan.pdf --lang my -o scan.txt
# PDF operations
lexo pdf extract book.pdf --pages "1-3,7,10-" -o subset.pdf
lexo pdf split book.pdf --every 10
lexo pdf crop book.pdf --top 8 --bottom 8 -o trimmed.pdf
# Launch the desktop app
lexo gui
Run lexo --help (or lexo pdf --help) for the full command list.
Commands
| Command | Purpose |
|---|---|
lexo extract <pdf> |
Extract the embedded text layer of a digital PDF |
lexo ocr <pdf|image> |
OCR a scanned document (--lang, --force-ocr) |
lexo pdf info|extract|split|crop|rotate|merge|split-spread |
PDF operations |
lexo login / lexo logout |
Sign in to / out of Google (token stored in the OS keychain) |
lexo gui |
Launch the desktop app |
lexo info |
Show the version and where Lexo stores its data |
lexo check-update |
Check PyPI for a newer release |
All output formats are available via --format text|markdown|jsonl.
Google Docs OCR setup (one-time)
OCR uses Google Docs OCR, which is free and runs on your own Google account. You
bring your own OAuth client credentials (credentials.json). It is a one-time
setup:
- Create or pick a Google Cloud project at the Google Cloud Console.
- Enable the Google Drive API: APIs & Services -> Library -> search "Google Drive API" -> Enable.
- Configure the OAuth consent screen: APIs & Services -> OAuth consent screen -> User type External -> add an app name and your email, then add your own Google account under Test users.
- Create the OAuth client: APIs & Services -> Credentials -> Create
credentials -> OAuth client ID -> Application type Desktop app -> Create
-> Download JSON, and rename the file to
credentials.json. - Place
credentials.jsonwhere Lexo looks for it (first match wins):- the path in the
LEXO_GOOGLE_CREDENTIALSenvironment variable, or - your Lexo config directory (run
lexo infoto see it), or - the current working directory.
- the path in the
- Sign in: run
lexo login(or in the GUI, Account -> Sign in with Google). A browser opens; approve access. The token is saved in your OS keychain, andcredentials.jsonis only read during login.
Notes:
- Lexo requests only the least-privilege
drive.filescope, so it can touch only the temporary files it creates while running OCR. - While the OAuth app stays in Testing status, Google expires the sign-in
roughly every 7 days, so you may need to run
lexo loginagain periodically. - Sign out any time with
lexo logout(or Account -> Sign out); this removes the stored token.
Burmese notes
- The OCR language hint defaults to
my; override with--lang. - Extracted text is normalized to Unicode NFC and zero-width spaces are preserved.
- A Myanmar Unicode font (Noto Sans Myanmar,
SIL Open Font License) is bundled so Burmese renders in the GUI regardless of
installed system fonts. The license travels with it as
OFL.txt.
Tech stack
| Area | Tools |
|---|---|
| Language | Python 3.11+ |
| CLI | Typer |
| Desktop GUI | PySide6 (Qt) |
| PDF engine | PyMuPDF |
| Images | Pillow |
| OCR | Google Docs OCR via the Google Drive API (google-api-python-client + google-auth) |
| Credentials | keyring (OS keychain) |
| Settings | pydantic-settings (env-var config) |
| Logging | structlog |
| Paths | platformdirs |
| Build & packaging | uv + Hatchling |
| Quality | Ruff, mypy, pytest |
| CI/CD | GitHub Actions, PyPI Trusted Publishing |
Development
uv sync
uv run ruff check src tests
uv run mypy src/lexo
uv run pytest
Design notes live in docs/ARCHITECTURE.md.
License
AGPL-3.0, to align with PyMuPDF. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lexo-0.1.1.tar.gz.
File metadata
- Download URL: lexo-0.1.1.tar.gz
- Upload date:
- Size: 440.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3d9e66d2521d354489e4df77553601f2d05ee680684441bbdc16105e6720751b
|
|
| MD5 |
526a341cb2511ce2dad1398714ed785f
|
|
| BLAKE2b-256 |
66f21f379af91bde082ea0b6c03f59ad91324bb0c833a397f5166d25e9d702df
|
Provenance
The following attestation bundles were made for lexo-0.1.1.tar.gz:
Publisher:
release.yml on PhilixTheExplorer/lexo
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
lexo-0.1.1.tar.gz -
Subject digest:
3d9e66d2521d354489e4df77553601f2d05ee680684441bbdc16105e6720751b - Sigstore transparency entry: 1849919672
- Sigstore integration time:
-
Permalink:
PhilixTheExplorer/lexo@85f35fbfe1248f8a988006b99e92c181e59a4e94 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/PhilixTheExplorer
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@85f35fbfe1248f8a988006b99e92c181e59a4e94 -
Trigger Event:
push
-
Statement type:
File details
Details for the file lexo-0.1.1-py3-none-any.whl.
File metadata
- Download URL: lexo-0.1.1-py3-none-any.whl
- Upload date:
- Size: 453.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
90fa85211aead4be5b4415bbb6d0d2c684e37c2bf7db15be0b781bf85ed04388
|
|
| MD5 |
61ccc78586589212a605d8dd2073bd3b
|
|
| BLAKE2b-256 |
3018c15f9888770038efdbe020dd452d77a7302222cfb8da0e47f3868d0ae587
|
Provenance
The following attestation bundles were made for lexo-0.1.1-py3-none-any.whl:
Publisher:
release.yml on PhilixTheExplorer/lexo
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
lexo-0.1.1-py3-none-any.whl -
Subject digest:
90fa85211aead4be5b4415bbb6d0d2c684e37c2bf7db15be0b781bf85ed04388 - Sigstore transparency entry: 1849919813
- Sigstore integration time:
-
Permalink:
PhilixTheExplorer/lexo@85f35fbfe1248f8a988006b99e92c181e59a4e94 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/PhilixTheExplorer
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@85f35fbfe1248f8a988006b99e92c181e59a4e94 -
Trigger Event:
push
-
Statement type: