Indian place-name lookup, OCR address cleanup, Kerala LGD locality vocabulary, correction, extraction, and address intelligence
Project description
Indic Places Library
Indian place-name lookup, fuzzy matching, OCR address cleanup, and merged-word address segmentation for Python.
indic-places is useful when Indian addresses are extracted from OCR/scanned PDFs and words are joined together without spaces. It uses a large Indian place-name vocabulary to identify cities, towns, villages, districts, postal-place aliases, and common address tokens.
Install from PyPI
Install latest version:
pip install --upgrade indic-places
Force latest version without cache:
python -m pip install --no-cache-dir --upgrade --force-reinstall indic-places
Install exact version:
python -m pip install indic-places==1.1.7
Add to requirements.txt:
indic-places>=1.1.7
Import
PyPI package name:
indic-places
Python import name:
from indic_places import IndicPlaces
Data Stats
| Metric | Count |
|---|---|
| Structured GeoNames + postal records | 815,477 |
| Unique Indian place names | 817,641 |
| Runtime OCR/custom place aliases | 652,331 |
| Coverage | India-wide |
Quick Python Usage
from indic_places import IndicPlaces
ip = IndicPlaces()
address = "PILASSERYADIVARAMPUTHUPPADIADIVARAM PUDUPADIKATTIPARAADIVARAM THAMARASSERYKOZHIKODE - 673586"
print(ip.normalize_address_spacing(address))
Output:
PILASSERY ADIVARAM PUTHUPPADI ADIVARAM PUDUPADI KATTIPARA ADIVARAM THAMARASSERY KOZHIKODE - 673586
Use from CMD / Terminal
Check installed version
python -c "import importlib.metadata as m; print(m.version('indic-places'))"
Normalize an OCR address from CMD
python -c "from indic_places import IndicPlaces; ip=IndicPlaces(); print(ip.normalize_address_spacing('PILASSERYADIVARAMPUTHUPPADIADIVARAM PUDUPADIKATTIPARAADIVARAM THAMARASSERYKOZHIKODE - 673586'))"
Place lookup from CMD
python -c "from indic_places import IndicPlaces; ip=IndicPlaces(); print(ip.lookup('Bangalor', top_n=5))"
Extract places from text using CMD
python -c "from indic_places import IndicPlaces; ip=IndicPlaces(); print(ip.extract_places('PONMINISSERY HOUSE PERAMBRA THRISSUR 680689'))"
Word segmentation from CMD
python -c "from indic_places import IndicPlaces; ip=IndicPlaces(); r=ip.segment('iliveinmumbaiorkerala'); print(r.segmented)"
CLI Usage
If the console script is available after install:
indic-places stats
indic-places lookup Bangalor
indic-places segment iliveinmumbaiorkerala
indic-places extract "PONMINISSERY HOUSE PERAMBRA THRISSUR 680689"
If the command is not found, use:
python -m indic_places.cli stats
What This Library Solves
OCR may return Indian addresses like:
PILASSERYADIVARAMPUTHUPPADIADIVARAM
KUNNUMPURATHHOUSEKALLARAP.O
THAMARASSERYKOZHIKODE
This library helps convert merged OCR text into cleaner address text by using Indian place names and address vocabulary.
Example:
from indic_places import IndicPlaces
ip = IndicPlaces()
raw = "KUNNUMPURATHHOUSE KALLARA P.O KOTTAYAM - 686611"
clean = ip.normalize_address_spacing(raw)
print(clean)
Output style:
KUNNUMPURATH HOUSE KALLARA P.O KOTTAYAM - 686611
Main Features
- Indian place-name lookup
- OCR merged-address spacing
- Fuzzy lookup for misspelled place names
- Word segmentation for merged text
- Place extraction from address text
- India-wide GeoNames and postal vocabulary
- Runtime OCR/custom place aliases from
indic_places/data/custom_places.txt
Recommended Integration Pattern
For large OCR/document pipelines, do not create IndicPlaces() again and again. Create it once and reuse it.
from indic_places import IndicPlaces
_PLACE_ENGINE = IndicPlaces()
def clean_address(address: str) -> str:
address = " ".join(str(address or "").split()).strip(" ,:-|")
if not address:
return ""
return _PLACE_ENGINE.normalize_address_spacing(address)
Use it after your extraction logic has already identified the address candidate.
raw_address = "PILASSERYADIVARAMPUTHUPPADIADIVARAM PUDUPADIKATTIPARAADIVARAM THAMARASSERYKOZHIKODE - 673586"
final_address = clean_address(raw_address)
print(final_address)
Use with an Existing Address Extractor
If your project already has a final address cleanup function, call normalize_address_spacing() there.
from indic_places import IndicPlaces
_PLACE_ENGINE = IndicPlaces()
def finalize_address(address: str) -> str:
address = " ".join(str(address or "").split()).strip(" ,:-|")
if not address:
return ""
address = _PLACE_ENGINE.normalize_address_spacing(address)
return " ".join(address.split()).strip(" ,:-|")
If your extraction function stores a best address candidate before returning, normalize before storing the final value.
def evaluate_and_store_address(candidate: str):
candidate = finalize_address(candidate)
if not candidate:
return False
# Store candidate in your output dictionary/model.
return True
Complete Address Analysis
Use analyze_address() when you want spacing, extraction, correction, and details together.
from indic_places import IndicPlaces
ip = IndicPlaces()
result = ip.analyze_address("indrapuriratibadbhopalmadhyapradesh")
print(result["clean_address"])
print(result["places"])
print(result["corrections"])
It returns:
{
"raw_address": "...",
"clean_address": "...",
"places": [
{
"text_found": "...",
"name": "...",
"state": "...",
"district": "...",
"pincode": "...",
"score": ...
}
],
"corrections": [
{
"input": "...",
"corrected": "...",
"state": "...",
"district": "...",
"pincode": "..."
}
],
"tokens": [...]
}
This is useful for OCR address pipelines where you want:
spacing + extraction + correction + state/district/pincode details
Correct Place Name Search
indic-places also supports correction-style place search through:
ip.correct_place_name(...)
ip.correct_place(...)
Use these when the user input is incomplete, misspelled, or has missing letters.
Examples:
from indic_places import IndicPlaces
ip = IndicPlaces()
print(ip.correct_place_name("bhop")) # Bhopal
print(ip.correct_place_name("bhopa")) # Bhopal
print(ip.correct_place_name("kera")) # Kerala
print(ip.correct_place_name("jhark")) # Jharkhand
print(ip.correct_place_name("hrissu kerala")) # Thrissur
For correction with details:
from indic_places import IndicPlaces
ip = IndicPlaces()
print(ip.correct_place("bhop"))
print(ip.correct_place("kera"))
print(ip.correct_place("jhark"))
Expected output style:
{'name': 'Bhopal', 'state': 'MADHYA PRADESH', 'district': 'BHOPAL', 'pincode': ''}
{'name': 'Kerala', 'state': 'KERALA', 'district': '', 'pincode': ''}
{'name': 'Jharkhand', 'state': 'JHARKHAND', 'district': '', 'pincode': ''}
Difference between lookup and correction:
lookup() = gives search suggestions
correct_place_name() = gives one clean corrected place name
correct_place() = gives corrected place name with state/district/pincode details
This is useful for search boxes, OCR correction, address parsing, and user-entered location cleanup.
CMD Examples
python -c "from indic_places import IndicPlaces; ip=IndicPlaces(); print(ip.correct_place_name('bhop'))"
python -c "from indic_places import IndicPlaces; ip=IndicPlaces(); print(ip.correct_place_name('kera')); print(ip.correct_place_name('jhark'))"
python -c "from indic_places import IndicPlaces; ip=IndicPlaces(); print(ip.correct_place('hrissu kerala'))"
Lookup Places
from indic_places import IndicPlaces
ip = IndicPlaces()
results = ip.lookup("Bangalor", top_n=5)
for r in results:
print(r.name, r.state, r.district, r.pincode, r.score)
Extract Places from Text
from indic_places import IndicPlaces
ip = IndicPlaces()
text = "PONMINISSERY HOUSE PERAMBRA THRISSUR 680689"
places = ip.extract_places(text)
for p in places:
print(p.name, p.state, p.district, p.pincode)
Word Segmentation
from indic_places import IndicPlaces
ip = IndicPlaces()
result = ip.segment("iliveinmumbaiorkerala")
print(result.segmented)
print(result.score)
Data Files
Runtime package data:
indic_places/data/address_terms.txt
indic_places/data/custom_places.txt
indic_places/data/places_index.json.gz
Supporting/reference data in repository:
data/unique_place_names.txt
data/geonames_india_places_full.csv.gz
data/by_state_geonames/
Data Sources and Attribution
This package includes place-name vocabulary derived from open geographical datasets, including GeoNames India gazetteer and postal data.
GeoNames data is licensed under Creative Commons Attribution 4.0. Please credit GeoNames when using data derived from GeoNames.
Suggested attribution:
This product includes data derived from GeoNames (https://www.geonames.org/), licensed under CC BY 4.0.
The data is provided as-is and may contain spelling variants, alternate names, outdated entries, or OCR-specific aliases.
Privacy and Project Neutrality
This package is public and project-neutral.
It does not include private project names, private customer data, private document text, or proprietary extraction logic. Use it as a reusable Indian place-name and OCR address cleanup utility.
Troubleshooting
Old version still installing
python -m pip uninstall indic-places -y
python -m pip install --no-cache-dir --upgrade --force-reinstall indic-places
Check installed version
python -c "import importlib.metadata as m; print(m.version('indic-places'))"
Command not found
python -m indic_places.cli stats
Works locally but not after pip install
Make sure package data files are included in the published wheel:
MANIFEST.in
pyproject.toml
indic_places/data/custom_places.txt
indic_places/data/address_terms.txt
indic_places/data/places_index.json.gz
Source Code
GitHub repository:
https://github.com/Tinku746286/indic_names_library
For normal users, install from PyPI:
pip install --upgrade indic-places
All-India village vocabulary
The package can include village names imported from official/LGD-style village datasets.
The import script adds only unique village names to:
indic_places/data/custom_places.txt
Duplicate protection uses a normalized key, so names already present with different case, spacing, or punctuation are not added again.
This improves village-level matching and correction, for example:
ip.correct_place_name("DORA CHHAPR")
ip.correct_place_name("MOHAN CHHAPR")
Fast correction index
correct_place_name() uses an in-memory candidate index so it does not scan every village/place name for every query.
This improves correction speed after adding large all-India village vocabulary data.
from indic_places import IndicPlaces
ip = IndicPlaces()
print(ip.correct_place_name("DORA CHHAPR"))
print(ip.correction_candidate_count("DORA CHHAPR"))
Faster and safer correction
correct_place_name() first checks a fast administrative-name index for common states/districts/cities, then falls back to the larger village/place index.
This prevents short local aliases like Bhopa, Kera, or Jharka from beating common outputs like Bhopal, Kerala, and Jharkhand.
from indic_places import IndicPlaces
ip = IndicPlaces()
print(ip.correct_place_name("bhop")) # Bhopal
print(ip.correct_place_name("kera")) # Kerala
print(ip.correct_place_name("jhark")) # Jharkhand
print(ip.correct_place_name("hrissu")) # Thrissur
South India subdistrict, village, and locality vocabulary
South Indian subdistrict, village, post-office, locality, colony, and area names can be imported into:
indic_places/data/custom_places.txt
The importer keeps only unique names and filters rows to South Indian states/UTs by state column when available.
Kerala LGD locality vocabulary
Kerala LGD data can be imported from downloaded LGD ZIP/XLS files. The importer extracts unique names from district, subdistrict, block, village, panchayat, urban local body, traditional local body, and ward-style files.
Output files:
data/kerala_lgd_names_unique.txt
data/kerala_lgd_names_full.csv.gz
indic_places/data/custom_places.txt
Raw downloaded source files should stay ignored under:
data/kerala_lgd_input/
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file indic_places-1.3.9.tar.gz.
File metadata
- Download URL: indic_places-1.3.9.tar.gz
- Upload date:
- Size: 6.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cb68464020404cb380969764714d0e190ca221d7c98d30b925c76fb99a8d69a5
|
|
| MD5 |
29e3a18968cb4c4e0238f9f09e05abcf
|
|
| BLAKE2b-256 |
6f2eea529ef223b8594adbfec4b06726f2bf1d8a88bed84158cfa80c3d331311
|
File details
Details for the file indic_places-1.3.9-py3-none-any.whl.
File metadata
- Download URL: indic_places-1.3.9-py3-none-any.whl
- Upload date:
- Size: 6.3 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
54136fae0e0bbe5c173a109aaab65fcf6caac81c33e569b8b1ea1a6199e321ec
|
|
| MD5 |
d77c36cc792c9d15f83c43c4f4ee068d
|
|
| BLAKE2b-256 |
a66a02cef8dfff7bfd8df96994d63389d2927b40e591d080ad62e5d42f1f2257
|