Indian place-name lookup, OCR address cleanup, instant precheck and SQLite prefix fast correction, large Indian vocabulary, extraction, and address intelligence

These details have not been verified by PyPI

Project links

Project description

Indic Places Library

Indian place-name lookup, fuzzy matching, OCR address cleanup, and merged-word address segmentation for Python.

indic-places is useful when Indian addresses are extracted from OCR/scanned PDFs and words are joined together without spaces. It uses a large Indian place-name vocabulary to identify cities, towns, villages, districts, postal-place aliases, and common address tokens.

Install from PyPI

Install latest version:

pip install --upgrade indic-places

Force latest version without cache:

python -m pip install --no-cache-dir --upgrade --force-reinstall indic-places

Install exact version:

python -m pip install indic-places==1.1.7

Add to requirements.txt:

indic-places>=1.1.7

Import

PyPI package name:

indic-places

Python import name:

from indic_places import IndicPlaces

Data Stats

Metric	Count
Structured GeoNames + postal records	815,477
Unique structured place names	817,641
Runtime OCR/custom place aliases	1,502,371
Approx. total vocabulary entries across structured + custom layers	2,317,848
Coverage	India-wide + expanded South India/Kerala LGD vocabulary

Note: custom_places.txt is the large runtime OCR/custom vocabulary layer. Structured GeoNames/postal records remain stored separately in places.json and places_index.json.gz.

Quick Python Usage

from indic_places import IndicPlaces

ip = IndicPlaces()

address = "PILASSERYADIVARAMPUTHUPPADIADIVARAM PUDUPADIKATTIPARAADIVARAM THAMARASSERYKOZHIKODE - 673586"
print(ip.normalize_address_spacing(address))

Output:

PILASSERY ADIVARAM PUTHUPPADI ADIVARAM PUDUPADI KATTIPARA ADIVARAM THAMARASSERY KOZHIKODE - 673586

Use from CMD / Terminal

Check installed version

python -c "import importlib.metadata as m; print(m.version('indic-places'))"

Normalize an OCR address from CMD

python -c "from indic_places import IndicPlaces; ip=IndicPlaces(); print(ip.normalize_address_spacing('PILASSERYADIVARAMPUTHUPPADIADIVARAM PUDUPADIKATTIPARAADIVARAM THAMARASSERYKOZHIKODE - 673586'))"

Place lookup from CMD

python -c "from indic_places import IndicPlaces; ip=IndicPlaces(); print(ip.lookup('Bangalor', top_n=5))"

Extract places from text using CMD

python -c "from indic_places import IndicPlaces; ip=IndicPlaces(); print(ip.extract_places('PONMINISSERY HOUSE PERAMBRA THRISSUR 680689'))"

Word segmentation from CMD

python -c "from indic_places import IndicPlaces; ip=IndicPlaces(); r=ip.segment('iliveinmumbaiorkerala'); print(r.segmented)"

CLI Usage

If the console script is available after install:

indic-places stats

indic-places lookup Bangalor

indic-places segment iliveinmumbaiorkerala

indic-places extract "PONMINISSERY HOUSE PERAMBRA THRISSUR 680689"

If the command is not found, use:

python -m indic_places.cli stats

What This Library Solves

OCR may return Indian addresses like:

PILASSERYADIVARAMPUTHUPPADIADIVARAM
KUNNUMPURATHHOUSEKALLARAP.O
THAMARASSERYKOZHIKODE

This library helps convert merged OCR text into cleaner address text by using Indian place names and address vocabulary.

Example:

from indic_places import IndicPlaces

ip = IndicPlaces()

raw = "KUNNUMPURATHHOUSE KALLARA P.O KOTTAYAM - 686611"
clean = ip.normalize_address_spacing(raw)

print(clean)

Output style:

KUNNUMPURATH HOUSE KALLARA P.O KOTTAYAM - 686611

Main Features

Indian place-name lookup
OCR merged-address spacing
Fuzzy lookup for misspelled place names
Word segmentation for merged text
Place extraction from address text
India-wide GeoNames and postal vocabulary
Runtime OCR/custom place aliases from indic_places/data/custom_places.txt

Recommended Integration Pattern

For large OCR/document pipelines, do not create IndicPlaces() again and again. Create it once and reuse it.

from indic_places import IndicPlaces

_PLACE_ENGINE = IndicPlaces()


def clean_address(address: str) -> str:
    address = " ".join(str(address or "").split()).strip(" ,:-|")

    if not address:
        return ""

    return _PLACE_ENGINE.normalize_address_spacing(address)

Use it after your extraction logic has already identified the address candidate.

raw_address = "PILASSERYADIVARAMPUTHUPPADIADIVARAM PUDUPADIKATTIPARAADIVARAM THAMARASSERYKOZHIKODE - 673586"
final_address = clean_address(raw_address)
print(final_address)

Use with an Existing Address Extractor

If your project already has a final address cleanup function, call normalize_address_spacing() there.

from indic_places import IndicPlaces

_PLACE_ENGINE = IndicPlaces()


def finalize_address(address: str) -> str:
    address = " ".join(str(address or "").split()).strip(" ,:-|")

    if not address:
        return ""

    address = _PLACE_ENGINE.normalize_address_spacing(address)

    return " ".join(address.split()).strip(" ,:-|")

If your extraction function stores a best address candidate before returning, normalize before storing the final value.

def evaluate_and_store_address(candidate: str):
    candidate = finalize_address(candidate)

    if not candidate:
        return False

    # Store candidate in your output dictionary/model.
    return True

Complete Address Analysis

Use analyze_address() when you want spacing, extraction, correction, and details together.

from indic_places import IndicPlaces

ip = IndicPlaces()

result = ip.analyze_address("indrapuriratibadbhopalmadhyapradesh")

print(result["clean_address"])
print(result["places"])
print(result["corrections"])

It returns:

{
    "raw_address": "...",
    "clean_address": "...",
    "places": [
        {
            "text_found": "...",
            "name": "...",
            "state": "...",
            "district": "...",
            "pincode": "...",
            "score": ...
        }
    ],
    "corrections": [
        {
            "input": "...",
            "corrected": "...",
            "state": "...",
            "district": "...",
            "pincode": "..."
        }
    ],
    "tokens": [...]
}

This is useful for OCR address pipelines where you want:

spacing + extraction + correction + state/district/pincode details

Correct Place Name Search

indic-places also supports correction-style place search through:

ip.correct_place_name(...)
ip.correct_place(...)

Use these when the user input is incomplete, misspelled, or has missing letters.

Examples:

from indic_places import IndicPlaces

ip = IndicPlaces()

print(ip.correct_place_name("bhop"))          # Bhopal
print(ip.correct_place_name("bhopa"))         # Bhopal
print(ip.correct_place_name("kera"))          # Kerala
print(ip.correct_place_name("jhark"))         # Jharkhand
print(ip.correct_place_name("hrissu kerala")) # Thrissur

For correction with details:

from indic_places import IndicPlaces

ip = IndicPlaces()

print(ip.correct_place("bhop"))
print(ip.correct_place("kera"))
print(ip.correct_place("jhark"))

Expected output style:

{'name': 'Bhopal', 'state': 'MADHYA PRADESH', 'district': 'BHOPAL', 'pincode': ''}
{'name': 'Kerala', 'state': 'KERALA', 'district': '', 'pincode': ''}
{'name': 'Jharkhand', 'state': 'JHARKHAND', 'district': '', 'pincode': ''}

Difference between lookup and correction:

lookup()              = gives search suggestions
correct_place_name()  = gives one clean corrected place name
correct_place()       = gives corrected place name with state/district/pincode details

This is useful for search boxes, OCR correction, address parsing, and user-entered location cleanup.

CMD Examples

python -c "from indic_places import IndicPlaces; ip=IndicPlaces(); print(ip.correct_place_name('bhop'))"

python -c "from indic_places import IndicPlaces; ip=IndicPlaces(); print(ip.correct_place_name('kera')); print(ip.correct_place_name('jhark'))"

python -c "from indic_places import IndicPlaces; ip=IndicPlaces(); print(ip.correct_place('hrissu kerala'))"

Lookup Places

from indic_places import IndicPlaces

ip = IndicPlaces()

results = ip.lookup("Bangalor", top_n=5)

for r in results:
    print(r.name, r.state, r.district, r.pincode, r.score)

Extract Places from Text

from indic_places import IndicPlaces

ip = IndicPlaces()

text = "PONMINISSERY HOUSE PERAMBRA THRISSUR 680689"
places = ip.extract_places(text)

for p in places:
    print(p.name, p.state, p.district, p.pincode)

Word Segmentation

from indic_places import IndicPlaces

ip = IndicPlaces()

result = ip.segment("iliveinmumbaiorkerala")
print(result.segmented)
print(result.score)

Data Files

Runtime package data:

indic_places/data/address_terms.txt
indic_places/data/custom_places.txt
indic_places/data/places_index.json.gz

Supporting/reference data in repository:

data/unique_place_names.txt
data/geonames_india_places_full.csv.gz
data/by_state_geonames/

Data Sources and Attribution

This package includes place-name vocabulary derived from open geographical datasets, including GeoNames India gazetteer and postal data.

GeoNames data is licensed under Creative Commons Attribution 4.0. Please credit GeoNames when using data derived from GeoNames.

Suggested attribution:

This product includes data derived from GeoNames (https://www.geonames.org/), licensed under CC BY 4.0.

The data is provided as-is and may contain spelling variants, alternate names, outdated entries, or OCR-specific aliases.

Privacy and Project Neutrality

This package is public and project-neutral.

It does not include private project names, private customer data, private document text, or proprietary extraction logic. Use it as a reusable Indian place-name and OCR address cleanup utility.

Troubleshooting

Old version still installing

python -m pip uninstall indic-places -y
python -m pip install --no-cache-dir --upgrade --force-reinstall indic-places

Check installed version

python -c "import importlib.metadata as m; print(m.version('indic-places'))"

Command not found

python -m indic_places.cli stats

Works locally but not after pip install

Make sure package data files are included in the published wheel:

MANIFEST.in
pyproject.toml
indic_places/data/custom_places.txt
indic_places/data/address_terms.txt
indic_places/data/places_index.json.gz

Source Code

GitHub repository:

https://github.com/Tinku746286/indic_names_library

For normal users, install from PyPI:

pip install --upgrade indic-places

All-India village vocabulary

The package can include village names imported from official/LGD-style village datasets.

The import script adds only unique village names to:

indic_places/data/custom_places.txt

Duplicate protection uses a normalized key, so names already present with different case, spacing, or punctuation are not added again.

This improves village-level matching and correction, for example:

ip.correct_place_name("DORA CHHAPR")
ip.correct_place_name("MOHAN CHHAPR")

Fast correction index

correct_place_name() uses an in-memory candidate index so it does not scan every village/place name for every query.

This improves correction speed after adding large all-India village vocabulary data.

from indic_places import IndicPlaces

ip = IndicPlaces()

print(ip.correct_place_name("DORA CHHAPR"))
print(ip.correction_candidate_count("DORA CHHAPR"))

Faster and safer correction

correct_place_name() first checks a fast administrative-name index for common states/districts/cities, then falls back to the larger village/place index.

This prevents short local aliases like Bhopa, Kera, or Jharka from beating common outputs like Bhopal, Kerala, and Jharkhand.

from indic_places import IndicPlaces

ip = IndicPlaces()

print(ip.correct_place_name("bhop"))       # Bhopal
print(ip.correct_place_name("kera"))       # Kerala
print(ip.correct_place_name("jhark"))      # Jharkhand
print(ip.correct_place_name("hrissu"))     # Thrissur

South India subdistrict, village, and locality vocabulary

South Indian subdistrict, village, post-office, locality, colony, and area names can be imported into:

indic_places/data/custom_places.txt

The importer keeps only unique names and filters rows to South Indian states/UTs by state column when available.

Kerala LGD locality vocabulary

Kerala LGD data can be imported from downloaded LGD ZIP/XLS files. The importer extracts unique names from district, subdistrict, block, village, panchayat, urban local body, traditional local body, and ward-style files.

Output files:

data/kerala_lgd_names_unique.txt
data/kerala_lgd_names_full.csv.gz
indic_places/data/custom_places.txt

Raw downloaded source files should stay ignored under:

data/kerala_lgd_input/

Multi-state LGD locality vocabulary

Multiple LGD state downloads can be imported at once. The importer extracts unique district, subdistrict, block, village, panchayat, urban local body, traditional local body, and ward-style names.

Default states:

TAMIL NADU, KARNATAKA, ANDHRA PRADESH, TELANGANA, PUDUCHERRY

Output files:

data/multi_state_lgd_names_unique.txt
data/multi_state_lgd_names_full.csv.gz
indic_places/data/custom_places.txt

Raw downloaded source files should stay ignored under:

data/multi_state_lgd_input/
```\n\n### Normalize and correct OCR addresses

Use `normalize_and_correct_address()` when OCR creates both merged words and spelling noise.

```python
from indic_places import IndicPlaces

ip = IndicPlaces()

raw = "PILASSERYADIVAAMPUTHUPADIADIVARAMTHAMARASSERYKOZHIKODE"

print(ip.normalize_and_correct_address(raw))

Expected style:

PILASSERY ADIVARAM PUTHUPADI ADIVARAM THAMARASSERY KOZHIKODE

For debug details:

print(ip.normalize_and_correct_address(raw, return_details=True))
```\n

### OCR boundary rebalancing

`normalize_and_correct_address()` also handles cases where the spacing layer attaches the first letter of the next place to the previous token.

```python
from indic_places import IndicPlaces

ip = IndicPlaces()

raw = "PILASSERYADIVAAMPUTHUPADIADIVARAMTHAMARASSERYKOZHIKODE"
print(ip.normalize_and_correct_address(raw))

Expected style:

PILASSERY ADIVARAM PUTHUPADI ADIVARAM THAMARASSERY KOZHIKODE

Safer normalize-and-correct flow

normalize_and_correct_address() uses safer OCR-boundary repair and avoids changing a single noisy token into an unrelated multi-word place.

Example bad output avoided:

PUTHUPADI -> Rampur Thadi

Example:

from indic_places import IndicPlaces

ip = IndicPlaces()
raw = "PILASSERYADIVAAMPUTHUPADIADIVARAMTHAMARASSERYKOZHIKODE"
print(ip.normalize_and_correct_address(raw))

Known-token protection

normalize_and_correct_address() skips correction for tokens that already exactly exist in the place vocabulary.

This prevents valid tokens from being over-corrected.

Example avoided:

ADIVARAM -> Immidivaram
```\n\n### Common OCR place aliases

`normalize_and_correct_address()` includes a conservative OCR alias layer for noisy address tokens.

Examples handled:

```text
ADIVAAM -> ADIVARAM
IMMIDIVARAM -> ADIVARAM
THAMARASSERI -> THAMARASSERY
PILASSERYA -> PILASSERY
```\n

### Accuracy-safe SQLite fast search index

For large vocabularies, build the optional SQLite search index:

```cmd
python build_fast_sqlite_index.py

This creates:

indic_places/data/fast_places.sqlite

When this file exists, correct_place_name() uses the same scoring logic as before, but candidate retrieval comes from SQLite buckets instead of building a huge in-memory index.

The SQLite buckets mirror the old in-memory strategy: exact, prefix, missing-first-letter, consonant, and fallback buckets.

Fast admin override

correct_place_name() uses a small high-confidence admin alias list instead of scanning all structured records for admin overrides.

This keeps outputs such as bhop -> Bhopal, kera -> Kerala, and jhark -> Jharkhand, while allowing multi-word village/locality queries such as DORA CHHAPR to go directly to the fast vocabulary/SQLite search.

SQLite prefix shortcut

For large vocabularies, correct_place_name() uses a safe SQLite prefix shortcut for names that are only missing the last few characters.

Example:

DORA CHHAPR -> Dora Chhapra
MOHAN CHHAPR -> Mohan Chhapra

This avoids scoring thousands of candidates for common OCR truncation cases.

Instant precheck for common corrections

correct_place_name() now performs an instant precheck for common OCR variants before fuzzy search.

Examples:

BUHAR -> Buhara
GIJRAT -> Gujarat
UTTRAKAND -> Uttarakhand
DORA CHHAPR -> Dora Chhapra

This prevents common short queries from entering the slow fuzzy-search path.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.4.13

Apr 25, 2026

1.4.12

Apr 25, 2026

1.4.10

Apr 25, 2026

1.4.7

Apr 25, 2026

1.4.2

Apr 25, 2026

1.4.1

Apr 25, 2026

1.4.0

Apr 24, 2026

1.3.9

Apr 24, 2026

1.3.8

Apr 24, 2026

1.3.7

Apr 24, 2026

1.3.5

Apr 24, 2026

1.3.4

Apr 24, 2026

1.2.7

Apr 24, 2026

1.2.3

Apr 24, 2026

1.2.2

Apr 24, 2026

1.1.7

Apr 24, 2026

1.1.6

Apr 24, 2026

1.1.5

Apr 24, 2026

1.1.4

Apr 24, 2026

1.1.3

Apr 24, 2026

1.1.2

Apr 24, 2026

1.1.1

Apr 24, 2026

1.1.0

Apr 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

indic_places-1.4.13.tar.gz (12.5 MB view details)

Uploaded Apr 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

indic_places-1.4.13-py3-none-any.whl (12.6 MB view details)

Uploaded Apr 25, 2026 Python 3

File details

Details for the file indic_places-1.4.13.tar.gz.

File metadata

Download URL: indic_places-1.4.13.tar.gz
Upload date: Apr 25, 2026
Size: 12.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for indic_places-1.4.13.tar.gz
Algorithm	Hash digest
SHA256	`3427a1823ae9520a5278fe06a453bb7dd8b78fec6fc7ec9d2ff9f91b82baf868`
MD5	`ac67377498ad25f5adcdd43d32e3c124`
BLAKE2b-256	`f6548bb86574946f1be3f3c2268d49d4f01fa038bcb78746657ab394bacbcf92`

See more details on using hashes here.

File details

Details for the file indic_places-1.4.13-py3-none-any.whl.

File metadata

Download URL: indic_places-1.4.13-py3-none-any.whl
Upload date: Apr 25, 2026
Size: 12.6 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for indic_places-1.4.13-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0da610b2aae72fa2040dbc52804d43e31683b0a61eee81b2d5ecb0814eb43c6f`
MD5	`3f529f2af080e586070e1dc1e116911b`
BLAKE2b-256	`ee6f0f4a6e8c54254ac4d4f4703fdcf5e997fa5dc628f8cd44dad2aa3e9fecda`

See more details on using hashes here.

indic-places 1.4.13

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Indic Places Library

Install from PyPI

Import

Data Stats

Quick Python Usage

Use from CMD / Terminal

Check installed version

Normalize an OCR address from CMD

Place lookup from CMD

Extract places from text using CMD

Word segmentation from CMD

CLI Usage

What This Library Solves

Main Features

Recommended Integration Pattern

Use with an Existing Address Extractor

Complete Address Analysis

Correct Place Name Search

CMD Examples

Lookup Places

Extract Places from Text

Word Segmentation

Data Files

Data Sources and Attribution

Privacy and Project Neutrality

Troubleshooting

Old version still installing

Check installed version

Command not found

Works locally but not after pip install

Source Code

All-India village vocabulary

Fast correction index

Faster and safer correction

South India subdistrict, village, and locality vocabulary

Kerala LGD locality vocabulary

Multi-state LGD locality vocabulary

Safer normalize-and-correct flow

Known-token protection

Fast admin override

SQLite prefix shortcut

Instant precheck for common corrections

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes