Skip to main content

Indian place-name lookup, OCR address cleanup, all-India village vocabulary, correction, extraction, and address intelligence

Project description

Indic Places Library

Indian place-name lookup, fuzzy matching, OCR address cleanup, and merged-word address segmentation for Python.

indic-places is useful when Indian addresses are extracted from OCR/scanned PDFs and words are joined together without spaces. It uses a large Indian place-name vocabulary to identify cities, towns, villages, districts, postal-place aliases, and common address tokens.

Install from PyPI

Install latest version:

pip install --upgrade indic-places

Force latest version without cache:

python -m pip install --no-cache-dir --upgrade --force-reinstall indic-places

Install exact version:

python -m pip install indic-places==1.1.7

Add to requirements.txt:

indic-places>=1.1.7

Import

PyPI package name:

indic-places

Python import name:

from indic_places import IndicPlaces

Data Stats

Metric Count
Structured GeoNames + postal records 815,477
Unique Indian place names 817,641
Runtime OCR/custom place aliases 652,331
Coverage India-wide

Quick Python Usage

from indic_places import IndicPlaces

ip = IndicPlaces()

address = "PILASSERYADIVARAMPUTHUPPADIADIVARAM PUDUPADIKATTIPARAADIVARAM THAMARASSERYKOZHIKODE - 673586"
print(ip.normalize_address_spacing(address))

Output:

PILASSERY ADIVARAM PUTHUPPADI ADIVARAM PUDUPADI KATTIPARA ADIVARAM THAMARASSERY KOZHIKODE - 673586

Use from CMD / Terminal

Check installed version

python -c "import importlib.metadata as m; print(m.version('indic-places'))"

Normalize an OCR address from CMD

python -c "from indic_places import IndicPlaces; ip=IndicPlaces(); print(ip.normalize_address_spacing('PILASSERYADIVARAMPUTHUPPADIADIVARAM PUDUPADIKATTIPARAADIVARAM THAMARASSERYKOZHIKODE - 673586'))"

Place lookup from CMD

python -c "from indic_places import IndicPlaces; ip=IndicPlaces(); print(ip.lookup('Bangalor', top_n=5))"

Extract places from text using CMD

python -c "from indic_places import IndicPlaces; ip=IndicPlaces(); print(ip.extract_places('PONMINISSERY HOUSE PERAMBRA THRISSUR 680689'))"

Word segmentation from CMD

python -c "from indic_places import IndicPlaces; ip=IndicPlaces(); r=ip.segment('iliveinmumbaiorkerala'); print(r.segmented)"

CLI Usage

If the console script is available after install:

indic-places stats
indic-places lookup Bangalor
indic-places segment iliveinmumbaiorkerala
indic-places extract "PONMINISSERY HOUSE PERAMBRA THRISSUR 680689"

If the command is not found, use:

python -m indic_places.cli stats

What This Library Solves

OCR may return Indian addresses like:

PILASSERYADIVARAMPUTHUPPADIADIVARAM
KUNNUMPURATHHOUSEKALLARAP.O
THAMARASSERYKOZHIKODE

This library helps convert merged OCR text into cleaner address text by using Indian place names and address vocabulary.

Example:

from indic_places import IndicPlaces

ip = IndicPlaces()

raw = "KUNNUMPURATHHOUSE KALLARA P.O KOTTAYAM - 686611"
clean = ip.normalize_address_spacing(raw)

print(clean)

Output style:

KUNNUMPURATH HOUSE KALLARA P.O KOTTAYAM - 686611

Main Features

  • Indian place-name lookup
  • OCR merged-address spacing
  • Fuzzy lookup for misspelled place names
  • Word segmentation for merged text
  • Place extraction from address text
  • India-wide GeoNames and postal vocabulary
  • Runtime OCR/custom place aliases from indic_places/data/custom_places.txt

Recommended Integration Pattern

For large OCR/document pipelines, do not create IndicPlaces() again and again. Create it once and reuse it.

from indic_places import IndicPlaces

_PLACE_ENGINE = IndicPlaces()


def clean_address(address: str) -> str:
    address = " ".join(str(address or "").split()).strip(" ,:-|")

    if not address:
        return ""

    return _PLACE_ENGINE.normalize_address_spacing(address)

Use it after your extraction logic has already identified the address candidate.

raw_address = "PILASSERYADIVARAMPUTHUPPADIADIVARAM PUDUPADIKATTIPARAADIVARAM THAMARASSERYKOZHIKODE - 673586"
final_address = clean_address(raw_address)
print(final_address)

Use with an Existing Address Extractor

If your project already has a final address cleanup function, call normalize_address_spacing() there.

from indic_places import IndicPlaces

_PLACE_ENGINE = IndicPlaces()


def finalize_address(address: str) -> str:
    address = " ".join(str(address or "").split()).strip(" ,:-|")

    if not address:
        return ""

    address = _PLACE_ENGINE.normalize_address_spacing(address)

    return " ".join(address.split()).strip(" ,:-|")

If your extraction function stores a best address candidate before returning, normalize before storing the final value.

def evaluate_and_store_address(candidate: str):
    candidate = finalize_address(candidate)

    if not candidate:
        return False

    # Store candidate in your output dictionary/model.
    return True

Complete Address Analysis

Use analyze_address() when you want spacing, extraction, correction, and details together.

from indic_places import IndicPlaces

ip = IndicPlaces()

result = ip.analyze_address("indrapuriratibadbhopalmadhyapradesh")

print(result["clean_address"])
print(result["places"])
print(result["corrections"])

It returns:

{
    "raw_address": "...",
    "clean_address": "...",
    "places": [
        {
            "text_found": "...",
            "name": "...",
            "state": "...",
            "district": "...",
            "pincode": "...",
            "score": ...
        }
    ],
    "corrections": [
        {
            "input": "...",
            "corrected": "...",
            "state": "...",
            "district": "...",
            "pincode": "..."
        }
    ],
    "tokens": [...]
}

This is useful for OCR address pipelines where you want:

spacing + extraction + correction + state/district/pincode details

Correct Place Name Search

indic-places also supports correction-style place search through:

ip.correct_place_name(...)
ip.correct_place(...)

Use these when the user input is incomplete, misspelled, or has missing letters.

Examples:

from indic_places import IndicPlaces

ip = IndicPlaces()

print(ip.correct_place_name("bhop"))          # Bhopal
print(ip.correct_place_name("bhopa"))         # Bhopal
print(ip.correct_place_name("kera"))          # Kerala
print(ip.correct_place_name("jhark"))         # Jharkhand
print(ip.correct_place_name("hrissu kerala")) # Thrissur

For correction with details:

from indic_places import IndicPlaces

ip = IndicPlaces()

print(ip.correct_place("bhop"))
print(ip.correct_place("kera"))
print(ip.correct_place("jhark"))

Expected output style:

{'name': 'Bhopal', 'state': 'MADHYA PRADESH', 'district': 'BHOPAL', 'pincode': ''}
{'name': 'Kerala', 'state': 'KERALA', 'district': '', 'pincode': ''}
{'name': 'Jharkhand', 'state': 'JHARKHAND', 'district': '', 'pincode': ''}

Difference between lookup and correction:

lookup()              = gives search suggestions
correct_place_name()  = gives one clean corrected place name
correct_place()       = gives corrected place name with state/district/pincode details

This is useful for search boxes, OCR correction, address parsing, and user-entered location cleanup.

CMD Examples

python -c "from indic_places import IndicPlaces; ip=IndicPlaces(); print(ip.correct_place_name('bhop'))"
python -c "from indic_places import IndicPlaces; ip=IndicPlaces(); print(ip.correct_place_name('kera')); print(ip.correct_place_name('jhark'))"
python -c "from indic_places import IndicPlaces; ip=IndicPlaces(); print(ip.correct_place('hrissu kerala'))"

Lookup Places

from indic_places import IndicPlaces

ip = IndicPlaces()

results = ip.lookup("Bangalor", top_n=5)

for r in results:
    print(r.name, r.state, r.district, r.pincode, r.score)

Extract Places from Text

from indic_places import IndicPlaces

ip = IndicPlaces()

text = "PONMINISSERY HOUSE PERAMBRA THRISSUR 680689"
places = ip.extract_places(text)

for p in places:
    print(p.name, p.state, p.district, p.pincode)

Word Segmentation

from indic_places import IndicPlaces

ip = IndicPlaces()

result = ip.segment("iliveinmumbaiorkerala")
print(result.segmented)
print(result.score)

Data Files

Runtime package data:

indic_places/data/address_terms.txt
indic_places/data/custom_places.txt
indic_places/data/places_index.json.gz

Supporting/reference data in repository:

data/unique_place_names.txt
data/geonames_india_places_full.csv.gz
data/by_state_geonames/

Data Sources and Attribution

This package includes place-name vocabulary derived from open geographical datasets, including GeoNames India gazetteer and postal data.

GeoNames data is licensed under Creative Commons Attribution 4.0. Please credit GeoNames when using data derived from GeoNames.

Suggested attribution:

This product includes data derived from GeoNames (https://www.geonames.org/), licensed under CC BY 4.0.

The data is provided as-is and may contain spelling variants, alternate names, outdated entries, or OCR-specific aliases.

Privacy and Project Neutrality

This package is public and project-neutral.

It does not include private project names, private customer data, private document text, or proprietary extraction logic. Use it as a reusable Indian place-name and OCR address cleanup utility.

Troubleshooting

Old version still installing

python -m pip uninstall indic-places -y
python -m pip install --no-cache-dir --upgrade --force-reinstall indic-places

Check installed version

python -c "import importlib.metadata as m; print(m.version('indic-places'))"

Command not found

python -m indic_places.cli stats

Works locally but not after pip install

Make sure package data files are included in the published wheel:

MANIFEST.in
pyproject.toml
indic_places/data/custom_places.txt
indic_places/data/address_terms.txt
indic_places/data/places_index.json.gz

Source Code

GitHub repository:

https://github.com/Tinku746286/indic_names_library

For normal users, install from PyPI:

pip install --upgrade indic-places

All-India village vocabulary

The package can include village names imported from official/LGD-style village datasets.

The import script adds only unique village names to:

indic_places/data/custom_places.txt

Duplicate protection uses a normalized key, so names already present with different case, spacing, or punctuation are not added again.

This improves village-level matching and correction, for example:

ip.correct_place_name("DORA CHHAPR")
ip.correct_place_name("MOHAN CHHAPR")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

indic_places-1.3.4.tar.gz (5.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

indic_places-1.3.4-py3-none-any.whl (5.7 MB view details)

Uploaded Python 3

File details

Details for the file indic_places-1.3.4.tar.gz.

File metadata

  • Download URL: indic_places-1.3.4.tar.gz
  • Upload date:
  • Size: 5.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for indic_places-1.3.4.tar.gz
Algorithm Hash digest
SHA256 e17c6ae38f33a1cb09c731a2230c12d7006890456285138ca2dd798e927d7e2e
MD5 ed0bdfe1b47b1027c18c78a94137a98d
BLAKE2b-256 d73e4a49f008e64c18c4a8932665d1930768c948dc6bc0709e528154becaa34f

See more details on using hashes here.

File details

Details for the file indic_places-1.3.4-py3-none-any.whl.

File metadata

  • Download URL: indic_places-1.3.4-py3-none-any.whl
  • Upload date:
  • Size: 5.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for indic_places-1.3.4-py3-none-any.whl
Algorithm Hash digest
SHA256 5a33977a4cce42b1e10e1a1ec810ebf033b5b7a01ae737dd1973ed8c07e199ed
MD5 3632e36b37d8d2d98cb5ac7589ecebe4
BLAKE2b-256 d0e8b0fc50c0fa6e473989bdf02a4f8649272fc09e34a5d065ea31fc922a0929

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page