Python wrapper for LiteParse - fast, lightweight PDF and document parsing
Project description
LiteParse Python
Python wrapper for LiteParse - fast, lightweight document parsing with optional OCR.
Important: This package is a Python wrapper around the LiteParse Node.js CLI. You must have Node.js (>= 18) installed on your system. The CLI will be auto-installed via npm on first use if not already present, or you can install it manually beforehand.
Installation
Step 1: Install Node.js
LiteParse requires Node.js (>= 18). Install it from nodejs.org or via your package manager.
Step 2: Install the LiteParse CLI
npm install -g @llamaindex/liteparse
Step 3: Install the Python package
pip install liteparse
Note: If you skip Step 2, the Python package will attempt to auto-install the CLI via
npm install -g @llamaindex/liteparseon first use (requires npm in your PATH).
Quick Start
from liteparse import LiteParse
# Create parser
parser = LiteParse()
# Parse a document
result = parser.parse("document.pdf")
print(result.text)
# Access structured data
for page in result.pages:
print(f"Page {page.pageNum}: {len(page.textItems)} text items")
Configuration
All parsing options are passed per-call to parse():
from liteparse import LiteParse
parser = LiteParse()
result = parser.parse(
"document.pdf",
ocr_enabled=False,
max_pages=10,
dpi=150,
preserve_very_small_text=True,
)
print(result.text)
Parsing from bytes
If you already have file contents in memory (e.g. from a web upload), pass them directly:
with open("document.pdf", "rb") as f:
pdf_bytes = f.read()
result = parser.parse(pdf_bytes)
print(result.text)
Batch Processing
For parsing multiple files, batch mode is significantly faster as it reuses the PDF engine:
from liteparse import LiteParse
parser = LiteParse()
# Parse all documents in a directory
result = parser.batch_parse(
input_dir="./documents",
output_dir="./output",
ocr_enabled=False,
recursive=True, # Include subdirectories
extension_filter=".pdf", # Only PDF files
)
print(f"Output written to: {result.output_dir}")
Supported Formats
- PDF (
.pdf) - Microsoft Office (
.docx,.xlsx,.pptx, etc.) - requires LibreOffice - OpenDocument (
.odt,.ods,.odp) - requires LibreOffice - Images (
.png,.jpg,.tiff, etc.) - requires ImageMagick - And more!
Performance Tips
-
Disable OCR if your documents have selectable text:
result = parser.parse("doc.pdf", ocr_enabled=False)
-
Use batch mode for multiple files to avoid cold-start overhead:
parser.batch_parse("./input", "./output")
-
Limit pages if you only need specific pages:
result = parser.parse("doc.pdf", target_pages="1-5")
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file liteparse-1.2.1.tar.gz.
File metadata
- Download URL: liteparse-1.2.1.tar.gz
- Upload date:
- Size: 29.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6528cd6ceb2a9bdac342c52dde0df19204b82e185ff4919bb3225f2ffe132e3b
|
|
| MD5 |
283941294259ca8d4635bbb438f06cac
|
|
| BLAKE2b-256 |
2c2408bf22b40fffb231074ad413d136e25f34515ba3f552bb148ad125bd1f2c
|
File details
Details for the file liteparse-1.2.1-py3-none-any.whl.
File metadata
- Download URL: liteparse-1.2.1-py3-none-any.whl
- Upload date:
- Size: 9.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4036277ffef70e5dfbbcf95f46fafb11114f4a6aaebf1c6b5b17b5afba5b1247
|
|
| MD5 |
bd9e16bb05eb2a99fc125afe25714419
|
|
| BLAKE2b-256 |
e8fb91f20a67c3c2784a543a6a6627173363f622fb4a5158fa412dc20080c94d
|