OnFlow Location Platform for parsing, converting, and standardizing Vietnamese administrative units
Project description
OnFlow Location Platform
OnFlow Location Platform is a Python package and data workspace for working with Vietnamese administrative units across two address systems:
LEGACY: the historical 63-province structureFROM_2025: the post-reform 34-province structure
The repository contains:
- a runtime package for parsing and converting Vietnamese addresses
- packaged lookup assets under
src/onflow_location_platform/data - data preparation inputs under
data - collection, processing, generation, and validation scripts under
scripts
Highlights
- Parse free-text Vietnamese administrative addresses into structured
AdminUnitobjects - Convert legacy addresses to the 2025 administrative structure
- Standardize province, district, ward, and address columns in pandas DataFrames
- Query packaged lookup data from the bundled SQLite database
- Maintain a reproducible workflow from raw inputs to generated runtime assets
Unified Interface: OnFlowLocation
The OnFlowLocation class provides a unified, property-based API to all platform operations.
from onflow_location_platform import OnFlowLocation
# Initialize with address or codes
location_info = OnFlowLocation(
address="842 Nguyễn Kiệm, Hạnh Thông, hồ chí minh",
provide_code=91,
district_code=913,
ward_code=31066
)
# Parse or convert using properties
print(location_info.convert_address_new_to_old.format_address())
print(location_info.convert_address_old_to_new_by_code.format_address())
Package Naming
- Repository / distribution name:
onflow-location-platform - Python import path:
onflow_location_platform - Runtime data directory:
src/onflow_location_platform/data
The public import path is onflow_location_platform.
Repository Layout
.
├── data/
│ ├── alias_keywords/
│ ├── raw/
│ ├── interim/
│ └── processed/
├── scripts/
│ ├── collecting_data/
│ ├── processing_data/
│ ├── generating_module_data/
│ └── testing_package/
├── src/
│ ├── onflow_location_platform/
│ │ ├── converter/
│ │ ├── data/
│ │ ├── database/
│ │ ├── pandas/
│ │ └── parser/
└── setup.py
How It Works
The repository operates as a small data platform plus a runtime package:
external sources
-> data/raw
-> data/interim
-> data/processed
-> scripts/generating_module_data
-> src/onflow_location_platform/data
-> src/onflow_location_platform parser / converter / pandas / database APIs
In practice, the workflow is:
- Collect source files from public endpoints or manual downloads into
data/raw. - Clean, map, and enrich those inputs into intermediate and processed datasets under
data/interimanddata/processed. - Generate compact runtime assets for the package in
src/onflow_location_platform/data, including parser dictionaries, conversion mappings, and the bundled SQLite database. - Use the public API from
src/onflow_location_platformto parse addresses, convert legacy addresses, standardize DataFrame columns, or query lookup data.
This means the runtime package does not depend on the workspace CSV files at execution time. It depends on the generated assets already bundled in src/onflow_location_platform/data.
Installation
Local editable install
python -m venv envs
source envs/bin/activate
pip install -e .
Local editable install with script dependencies
python -m venv envs
source envs/bin/activate
pip install -e '.[scripts]'
Quick Start
Parse a 2025-format address
from onflow_location_platform import parse_address_new
unit = parse_address_new("Tân Sơn Hòa, Hồ Chí Minh")
print(unit.format_address())
Parse a legacy-format address
from onflow_location_platform import parse_address_old
unit = parse_address_old(
"Đường 15, Long Bình, Quận 9, Hồ Chí Minh",
level=3,
)
print(unit.short_province, unit.short_district, unit.short_ward)
Convert a legacy address to the 2025 structure
from onflow_location_platform import convert_address_old_to_new
unit = convert_address_old_to_new("59 Nguyễn Sỹ Sách, Phường 15, Tân Bình, Hồ Chí Minh")
print(unit.format_address())
Convert a 2025 address back to legacy (first)
from onflow_location_platform import convert_address_new_to_old
unit = convert_address_new_to_old(
"Phường Hạnh Thông, Hồ Chí Minh",
multi_match="first",
)
print(unit.format_address())
# Phường 1, Quận Gò Vấp, Thành Phố Hồ Chí Minh
Convert a 2025 address back to legacy (all)
from onflow_location_platform import convert_address_new_to_old
units = convert_address_new_to_old(
"Phường Hạnh Thông, Hồ Chí Minh",
multi_match="all",
)
for unit in units:
if unit:
print(unit.format_address())
# Phường 1, Quận Gò Vấp Thành Phố Hồ Chí Minh
# Phường 3, Quận Gò Vấp Thành Phố Hồ Chí Minh
Convert a 2025 address back to legacy (geo)
from onflow_location_platform import convert_address_new_to_old
unit = convert_address_new_to_old(
"842 Nguyễn Kiệm, Phường Hạnh Thông, Hồ Chí Minh",
multi_match="geo",
)
print(unit.format_address())
# 842 Nguyễn Kiệm, Phường 3, Quận Gò Vấp Thành Phố Hồ Chí Minh
Convert legacy codes to the 2025 structure
Use this when you already have the three legacy numeric primary-key codes (province, district, ward) stored in a database, instead of a free-text address.
from onflow_location_platform import convert_address_old_to_new_by_code
result = convert_address_old_to_new_by_code(
provide_code=38,
district_code=399,
ward_code=16003
)
print(result.format_address())
# Xã Hoằng Thanh, Tỉnh Thanh Hóa
To convert individual legacy administrative codes, use the specific functions:
from onflow_location_platform import (
convert_province_old_to_new_by_code,
convert_district_old_to_new_by_code,
convert_ward_old_to_new_by_code
)
# Convert a legacy province code
province = convert_province_old_to_new_by_code(38)
print(province.format_address())
# Tỉnh Thanh Hóa
# Convert a legacy district code (returns the new mapped province)
district = convert_district_old_to_new_by_code(399)
print(district.format_address())
# Tỉnh Thanh Hóa
# Convert a legacy ward code (returns the new mapped ward and province)
ward = convert_ward_old_to_new_by_code(16003)
print(ward.format_address())
# Xã Hoằng Thanh, Tỉnh Thanh Hóa
Standardize administrative unit columns in pandas
import pandas as pd
from onflow_location_platform.pandas import standardize_admin_unit_columns
df = pd.DataFrame(
[
{"province": "ha noi", "ward": "hong ha"},
{"province": "hà nội", "ward": "ba đình"},
]
)
result = standardize_admin_unit_columns(
df,
province="province",
ward="ward",
)
print(result)
Query the bundled SQLite lookup data
from onflow_location_platform.database import get_data, query
print(get_data(fields=["province", "ward"], table="admin_units", limit=5))
print(query("SELECT province, ward FROM admin_units LIMIT 5"))
Public API
parse_address_new(address, keep_street=True, level=2)
Parse a Vietnamese address in the 2025 34-province structure into an AdminUnit.
keep_street=Truekeeps street text when enough address segments are availablelevel=1parses provincelevel=2parses ward and province
parse_address_old(address, keep_street=True, level=3)
Parse a Vietnamese address in the legacy 63-province structure into an AdminUnit.
keep_street=Truekeeps street text when enough address segments are availablelevel=1parses provincelevel=2parses district and provincelevel=3parses ward, district, and province
convert_address_old_to_new(address)
Convert a legacy-format address into a normalized AdminUnit in the 2025 structure.
convert_address_old_to_new_by_code(provide_code, district_code, ward_code, address=None)
Convert legacy administrative codes to the 2025 structure without any text parsing. Accepts the three legacy numeric primary-key codes stored in a database.
provide_code— legacy province numeric code (e.g.79)district_code— legacy district numeric code (e.g.760)ward_code— legacy ward numeric code (e.g.26737)address— optional raw address string attached to the returnedAdminUnit- both
intandstrinputs are accepted; leading zeros are added automatically - raises
KeyErrorif any code is not found, orValueErrorif the codes are inconsistent (district does not belong to province, ward does not belong to district)
convert_province_old_to_new_by_code(pk_id)
Resolve a single legacy province code to the 2025 province.
Returns an AdminUnit with only province-level fields populated.
convert_district_old_to_new_by_code(pk_id)
Resolve a single legacy district code to the 2025 province it now belongs to. Districts do not exist in the 2025 format; only the parent province is returned.
convert_ward_old_to_new_by_code(pk_id)
Resolve a single legacy ward code to the 2025 ward and province.
convert_address_new_to_old(address, multi_match="first")
Convert a 2025-format address into legacy administrative units.
multi_match="first"returns oneAdminUnit(first candidate)multi_match="all"returnslist[AdminUnit](all candidates)multi_match="geo"returns oneAdminUnitselected by nearest geocoded point, then falls back tofirstif needed- input must explicitly contain a 2025 province keyword; otherwise an empty result is returned
standardize_admin_unit_columns(...)
Standardize province, district, and ward columns in a pandas DataFrame.
convert_address_column(...)
Convert a full address column and optionally attach old/new administrative attributes.
get_data(...) and query(sql)
Read lookup data from the bundled SQLite database in src/onflow_location_platform/data/dataset.db.
Data Directories
Runtime assets
These files are bundled with the Python package and used at runtime:
src/onflow_location_platform/data/parser_legacy.jsonsrc/onflow_location_platform/data/parser_from_2025.jsonsrc/onflow_location_platform/data/converter_2025.jsonsrc/onflow_location_platform/data/dataset.db
Workspace data
The data directory is used for data preparation and notebook workflows:
data/alias_keywords: curated alias inputs used when generating parser assetsdata/raw: collected source filesdata/interim: intermediate transformation outputsdata/processed: processed datasets for analysis and validation
The repository currently ignores data/raw, data/interim, and data/processed, so those folders act as workspace outputs rather than committed package contents.
Scripts
Operational scripts and notebooks are grouped by purpose:
scripts/collecting_data: external data collection and scrapingscripts/processing_data: mapping, cleaning, enrichment, and dataset buildingscripts/generating_module_data: generation of packaged parser and converter assetsscripts/testing_package: smoke tests and manual validation
Example smoke test:
PYTHONPATH=. envs/bin/python scripts/testing_package/manual_parse_smoke_test.py
Example benchmark:
envs/bin/python scripts/testing_package/benchmark_public_api.py
Example collection script:
PYTHONPATH=. envs/bin/python scripts/collecting_data/scrape_sapnhap_bando_provinces_and_wards.py --date 2026-03-31 --verbose
Notes:
- Some scripts require the optional dependencies from
.[scripts] - Collection scripts require network access
- Several workflows are notebook-driven rather than packaged as command-line tools
Benchmark
The repository includes a reproducible micro-benchmark for the public SDK API:
envs/bin/python scripts/testing_package/benchmark_public_api.py
Sample results from a local run on 2026-03-31 using Python 3.11.13 on macOS-15.5-arm64:
| API | Sample Input | Iterations / run | Best ms/op | Mean ms/op | Ops/s |
|---|---|---|---|---|---|
parse_address_new |
"Tân Sơn Hòa, Hồ Chí Minh" |
10,000 | 0.1181 | 0.1199 | 8469.8 |
parse_address_old |
"Long Bình, Quận 9, Hồ Chí Minh" (level=3) |
10,000 | 0.1003 | 0.1039 | 9965.8 |
convert_address_old_to_new |
"Phường 15, Tân Bình, Hồ Chí Minh" |
5,000 | 0.2364 | 0.2402 | 4230.0 |
These numbers are indicative and will vary by machine, Python version, and whether the tested input path requires external geocoding.
Development Notes
- Python requirement:
>=3.7 - Runtime dependencies:
geopy,pandas,shapely,tqdm,unidecode - Optional script dependencies:
beautifulsoup4,numpy,requests,seleniumbase
License
setup.py declares the project as MIT-licensed. A standalone LICENSE file is not present in the current workspace snapshot.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file onflow_location_platform-1.0.4.tar.gz.
File metadata
- Download URL: onflow_location_platform-1.0.4.tar.gz
- Upload date:
- Size: 1.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6b0bd940cf07e137132d99dca6edf937640dddf1bacef6534dda403cba1cb739
|
|
| MD5 |
a9fae1553a3c78437ffbbb68c296b45e
|
|
| BLAKE2b-256 |
1f6ac5cef1b5b49c8ca46710b57fe92a8857c6306636fb9e0a5da24676da61b2
|
File details
Details for the file onflow_location_platform-1.0.4-py3-none-any.whl.
File metadata
- Download URL: onflow_location_platform-1.0.4-py3-none-any.whl
- Upload date:
- Size: 1.9 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bddff7ba72f37c8f29ca4f828a047648d858931dd82cab17d461ef7da5aca0e7
|
|
| MD5 |
45a819cd46afb0c74d65e446d29e4f87
|
|
| BLAKE2b-256 |
9fb6738649d8f424b6b990840291fd40d2a3215f3bd3ae2d2ded293ae16058bc
|