Structured Information Extraction for Nepali and multilingual PDFs
Project description
LamiSema
Structured information extraction for Nepali PDFs.
The problem
Most Nepali PDFs silently return wrong output when you use standard tools on them. There are three types, each needing a different approach:
| PDF type | Example source | Standard tool result |
|---|---|---|
| Unicode-native | Modern government portals | ✅ Works fine |
| Legacy-encoded | Pre-2010 docs using Preeti/Kantipur font | ❌ Returns garbage (g]kfn instead of नेपाल) |
| Scanned | Physical forms, old records | ❌ Returns empty string |
LamiSema detects the type first and automatically routes to the right strategy.
How it works
flowchart LR
PDF[PDF Input] --> P[Pre-flight\nDetect encoding type]
P -->|unicode_native| T[Text layer\npdfplumber]
P -->|legacy_encoded| O[OCR\nTesseract nep+eng]
P -->|scanned| O
T --> N[NER + Date\nNormalization]
O --> N
N --> J[Structured JSON\nwith confidence scores]
Install
pip install lamisema
System dependency — Tesseract with the Nepali language pack:
# macOS
brew install tesseract tesseract-lang
# Ubuntu / Debian
sudo apt-get install tesseract-ocr tesseract-ocr-nep
Python usage
from lamisema import LamiSema
pipeline = LamiSema()
with open("report.pdf", "rb") as f:
result = pipeline.extract(f.read(), filename="report.pdf")
print(result.encoding_type) # "legacy_encoded"
print(result.overall_confidence) # 0.74
print(result.pages[0].entities) # [Entity(type="DATE_BS", text="२०८१ साल असार १५", ...)]
REST API
Start the server:
lamisema serve
# → http://localhost:9001/docs
# 1. Upload
curl -X POST http://localhost:9001/upload -F "file=@report.pdf"
# → { "doc_id": "DOC-A1B2C3D4" }
# 2. Detect encoding (fast, no extraction)
curl http://localhost:9001/preflight/DOC-A1B2C3D4
# 3. Extract everything
curl -X POST http://localhost:9001/extract/DOC-A1B2C3D4
# 4. Get result
curl http://localhost:9001/result/DOC-A1B2C3D4
| Method | Endpoint | Description |
|---|---|---|
GET |
/ |
Health check |
POST |
/upload |
Upload a PDF, get doc_id |
GET |
/preflight/{doc_id} |
Encoding type + font analysis |
POST |
/extract/{doc_id} |
Full extraction + NER |
GET |
/result/{doc_id} |
Retrieve completed result |
POST |
/normalize-dates |
Normalize BS dates in raw text |
Try the demo app
A full-stack demo (Next.js frontend + API + MinIO) is in application-demo/.
cd application-demo
cp .env.example .env
docker compose -f docker-compose.local.yaml up --build
# → http://localhost:3000
Docs
Full documentation at lamisema.readthedocs.io
License
MIT — see LICENSE
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lamisema-1.1.0.tar.gz.
File metadata
- Download URL: lamisema-1.1.0.tar.gz
- Upload date:
- Size: 229.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
64956d29d79a0ab998a9b0ee85ade30e6f36370762aadfd1daa5dcfe5a42e40e
|
|
| MD5 |
07fb4070fbe1d1622b93b56c1b30292f
|
|
| BLAKE2b-256 |
8b5d5c28406589a9d2ac46268d2587366fc094102e9f2b73d3fc33ac0baa5bbf
|
Provenance
The following attestation bundles were made for lamisema-1.1.0.tar.gz:
Publisher:
publish.yml on sanjiblamichhane/lamisema
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
lamisema-1.1.0.tar.gz -
Subject digest:
64956d29d79a0ab998a9b0ee85ade30e6f36370762aadfd1daa5dcfe5a42e40e - Sigstore transparency entry: 1340160718
- Sigstore integration time:
-
Permalink:
sanjiblamichhane/lamisema@698f0c18588ec37bfff4cc546ce029f10402a502 -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/sanjiblamichhane
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@698f0c18588ec37bfff4cc546ce029f10402a502 -
Trigger Event:
release
-
Statement type:
File details
Details for the file lamisema-1.1.0-py3-none-any.whl.
File metadata
- Download URL: lamisema-1.1.0-py3-none-any.whl
- Upload date:
- Size: 31.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
732d4d9e21c4fe0731bda8a5516e25829bde98537b4057c34ea49375957c25f3
|
|
| MD5 |
8a26dc906cacb248f32e5ee166383a4a
|
|
| BLAKE2b-256 |
fb0af0b0c8f1dffccc7c7dbc38d642135a6472b9b5b5e954126447eee0195fc5
|
Provenance
The following attestation bundles were made for lamisema-1.1.0-py3-none-any.whl:
Publisher:
publish.yml on sanjiblamichhane/lamisema
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
lamisema-1.1.0-py3-none-any.whl -
Subject digest:
732d4d9e21c4fe0731bda8a5516e25829bde98537b4057c34ea49375957c25f3 - Sigstore transparency entry: 1340160728
- Sigstore integration time:
-
Permalink:
sanjiblamichhane/lamisema@698f0c18588ec37bfff4cc546ce029f10402a502 -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/sanjiblamichhane
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@698f0c18588ec37bfff4cc546ce029f10402a502 -
Trigger Event:
release
-
Statement type: