Skip to main content

Structured Information Extraction for Nepali and multilingual PDFs

Project description

LamiSema

Structured information extraction for Nepali PDFs.

PyPI Python 3.10+ CI License: MIT


The problem

Most Nepali PDFs silently return wrong output when you use standard tools on them. There are three types, each needing a different approach:

PDF type Example source Standard tool result
Unicode-native Modern government portals ✅ Works fine
Legacy-encoded Pre-2010 docs using Preeti/Kantipur font ❌ Returns garbage (g]kfn instead of नेपाल)
Scanned Physical forms, old records ❌ Returns empty string

LamiSema detects the type first and automatically routes to the right strategy.


How it works

flowchart LR
    PDF[PDF Input] --> P[Pre-flight\nDetect encoding type]
    P -->|unicode_native| T[Text layer\npdfplumber]
    P -->|legacy_encoded| O[OCR\nTesseract nep+eng]
    P -->|scanned| O
    T --> N[NER + Date\nNormalization]
    O --> N
    N --> J[Structured JSON\nwith confidence scores]

Install

pip install lamisema

System dependency — Tesseract with the Nepali language pack:

# macOS
brew install tesseract tesseract-lang

# Ubuntu / Debian
sudo apt-get install tesseract-ocr tesseract-ocr-nep

Python usage

from lamisema import LamiSema

pipeline = LamiSema()

with open("report.pdf", "rb") as f:
    result = pipeline.extract(f.read(), filename="report.pdf")

print(result.encoding_type)        # "legacy_encoded"
print(result.overall_confidence)   # 0.74
print(result.pages[0].entities)    # [Entity(type="DATE_BS", text="२०८१ साल असार १५", ...)]

REST API

Start the server:

lamisema serve
# → http://localhost:9001/docs
# 1. Upload
curl -X POST http://localhost:9001/upload -F "file=@report.pdf"
# → { "doc_id": "DOC-A1B2C3D4" }

# 2. Detect encoding (fast, no extraction)
curl http://localhost:9001/preflight/DOC-A1B2C3D4

# 3. Extract everything
curl -X POST http://localhost:9001/extract/DOC-A1B2C3D4

# 4. Get result
curl http://localhost:9001/result/DOC-A1B2C3D4
Method Endpoint Description
GET / Health check
POST /upload Upload a PDF, get doc_id
GET /preflight/{doc_id} Encoding type + font analysis
POST /extract/{doc_id} Full extraction + NER
GET /result/{doc_id} Retrieve completed result
POST /normalize-dates Normalize BS dates in raw text

Try the demo app

A full-stack demo (Next.js frontend + API + MinIO) is in application-demo/.

cd application-demo
cp .env.example .env
docker compose -f docker-compose.local.yaml up --build
# → http://localhost:3000

Docs

Full documentation at lamisema.readthedocs.io


License

MIT — see LICENSE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lamisema-1.1.0.tar.gz (229.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lamisema-1.1.0-py3-none-any.whl (31.9 kB view details)

Uploaded Python 3

File details

Details for the file lamisema-1.1.0.tar.gz.

File metadata

  • Download URL: lamisema-1.1.0.tar.gz
  • Upload date:
  • Size: 229.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lamisema-1.1.0.tar.gz
Algorithm Hash digest
SHA256 64956d29d79a0ab998a9b0ee85ade30e6f36370762aadfd1daa5dcfe5a42e40e
MD5 07fb4070fbe1d1622b93b56c1b30292f
BLAKE2b-256 8b5d5c28406589a9d2ac46268d2587366fc094102e9f2b73d3fc33ac0baa5bbf

See more details on using hashes here.

Provenance

The following attestation bundles were made for lamisema-1.1.0.tar.gz:

Publisher: publish.yml on sanjiblamichhane/lamisema

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lamisema-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: lamisema-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 31.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lamisema-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 732d4d9e21c4fe0731bda8a5516e25829bde98537b4057c34ea49375957c25f3
MD5 8a26dc906cacb248f32e5ee166383a4a
BLAKE2b-256 fb0af0b0c8f1dffccc7c7dbc38d642135a6472b9b5b5e954126447eee0195fc5

See more details on using hashes here.

Provenance

The following attestation bundles were made for lamisema-1.1.0-py3-none-any.whl:

Publisher: publish.yml on sanjiblamichhane/lamisema

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page