Skip to main content

OnFlow Location Platform for parsing, converting, and standardizing Vietnamese administrative units

Project description

OnFlow Location Platform

Tiếng Việt Version

OnFlow Location Platform is a Python package and data workspace for working with Vietnamese administrative units across two address systems:

  • LEGACY: the historical 63-province structure
  • FROM_2025: the post-reform 34-province structure

The repository contains:

  • a runtime package for parsing and converting Vietnamese addresses
  • packaged lookup assets under src/onflow_location_platform/data
  • data preparation inputs under data
  • collection, processing, generation, and validation scripts under scripts

Highlights

  • Parse free-text Vietnamese administrative addresses into structured AdminUnit objects
  • Convert legacy addresses to the 2025 administrative structure
  • Standardize province, district, ward, and address columns in pandas DataFrames
  • Query packaged lookup data from the bundled SQLite database
  • Maintain a reproducible workflow from raw inputs to generated runtime assets

Unified Interface: OnFlowLocation

The OnFlowLocation class provides a unified, property-based API to all platform operations.

from onflow_location_platform import OnFlowLocation

# Initialize with address or codes
location_info = OnFlowLocation(
    address="842 Nguyễn Kiệm, Hạnh Thông, hồ chí minh",
    provide_code=91,
    district_code=913,
    ward_code=31066
)

# Parse or convert using properties
print(location_info.convert_address_new_to_old.format_address())
print(location_info.convert_address_old_to_new_by_code.format_address())

Package Naming

The public import path is onflow_location_platform.

Repository Layout

.
├── data/
│   ├── alias_keywords/
│   ├── raw/
│   ├── interim/
│   └── processed/
├── scripts/
│   ├── collecting_data/
│   ├── processing_data/
│   ├── generating_module_data/
│   └── testing_package/
├── src/
│   ├── onflow_location_platform/
│   │   ├── converter/
│   │   ├── data/
│   │   ├── database/
│   │   ├── pandas/
│   │   └── parser/
└── setup.py

How It Works

The repository operates as a small data platform plus a runtime package:

external sources
    -> data/raw
    -> data/interim
    -> data/processed
    -> scripts/generating_module_data
    -> src/onflow_location_platform/data
    -> src/onflow_location_platform parser / converter / pandas / database APIs

In practice, the workflow is:

  1. Collect source files from public endpoints or manual downloads into data/raw.
  2. Clean, map, and enrich those inputs into intermediate and processed datasets under data/interim and data/processed.
  3. Generate compact runtime assets for the package in src/onflow_location_platform/data, including parser dictionaries, conversion mappings, and the bundled SQLite database.
  4. Use the public API from src/onflow_location_platform to parse addresses, convert legacy addresses, standardize DataFrame columns, or query lookup data.

This means the runtime package does not depend on the workspace CSV files at execution time. It depends on the generated assets already bundled in src/onflow_location_platform/data.

Installation

Local editable install

python -m venv envs
source envs/bin/activate
pip install -e .

Local editable install with script dependencies

python -m venv envs
source envs/bin/activate
pip install -e '.[scripts]'

Quick Start

Parse a 2025-format address

from onflow_location_platform import parse_address_new

unit = parse_address_new("Tân Sơn Hòa, Hồ Chí Minh")
print(unit.format_address())

Parse a legacy-format address

from onflow_location_platform import parse_address_old

unit = parse_address_old(
    "Đường 15, Long Bình, Quận 9, Hồ Chí Minh",
    level=3,
)

print(unit.short_province, unit.short_district, unit.short_ward)

Convert a legacy address to the 2025 structure

from onflow_location_platform import convert_address_old_to_new

unit = convert_address_old_to_new("59 Nguyễn Sỹ Sách, Phường 15, Tân Bình, Hồ Chí Minh")
print(unit.format_address())

Convert a 2025 address back to legacy (first)

from onflow_location_platform import convert_address_new_to_old

unit = convert_address_new_to_old(
    "Phường Hạnh Thông, Hồ Chí Minh",
    multi_match="first",
)
print(unit.format_address())
# Phường 1, Quận Gò Vấp, Thành Phố Hồ Chí Minh

Convert a 2025 address back to legacy (all)

from onflow_location_platform import convert_address_new_to_old

units = convert_address_new_to_old(
    "Phường Hạnh Thông, Hồ Chí Minh",
    multi_match="all",
)
for unit in units:
    if unit:
        print(unit.format_address())
# Phường 1, Quận Gò Vấp Thành Phố Hồ Chí Minh
# Phường 3, Quận Gò Vấp Thành Phố Hồ Chí Minh

Convert a 2025 address back to legacy (geo)

from onflow_location_platform import convert_address_new_to_old

unit = convert_address_new_to_old(
    "842 Nguyễn Kiệm, Phường Hạnh Thông, Hồ Chí Minh",
    multi_match="geo",
)
print(unit.format_address())
# 842 Nguyễn Kiệm, Phường 3, Quận Gò Vấp Thành Phố Hồ Chí Minh

Convert legacy codes to the 2025 structure

Use this when you already have the three legacy numeric primary-key codes (province, district, ward) stored in a database, instead of a free-text address.

from onflow_location_platform import convert_address_old_to_new_by_code

result = convert_address_old_to_new_by_code(
    provide_code=38,
    district_code=399,
    ward_code=16003
)
print(result.format_address())
# Xã Hoằng Thanh, Tỉnh Thanh Hóa

To convert individual legacy administrative codes, use the specific functions:

from onflow_location_platform import (
    convert_province_old_to_new_by_code,
    convert_district_old_to_new_by_code,
    convert_ward_old_to_new_by_code
)

# Convert a legacy province code
province = convert_province_old_to_new_by_code(38)
print(province.format_address())
# Tỉnh Thanh Hóa

# Convert a legacy district code (returns the new mapped province)
district = convert_district_old_to_new_by_code(399)
print(district.format_address())
# Tỉnh Thanh Hóa

# Convert a legacy ward code (returns the new mapped ward and province)
ward = convert_ward_old_to_new_by_code(16003)
print(ward.format_address())
# Xã Hoằng Thanh, Tỉnh Thanh Hóa

Get 2025 administrative info directly using new codes

from onflow_location_platform import (
    get_new_admin_unit_by_new_code,
    get_new_province_by_new_code,
    get_new_ward_by_new_code
)

# Get 2025 province info from a new province code
province = get_new_province_by_new_code("01")
print(province.province) # Thành phố Hà Nội

# Get 2025 ward info from a new ward code
ward = get_new_ward_by_new_code("00097")
print(ward.format_address()) # Phường Hồng Hà, Thành phố Hà Nội

# Get full information from both new province and ward codes
full_unit = get_new_admin_unit_by_new_code(province_code="01", ward_code="00097")

Standardize administrative unit columns in pandas

import pandas as pd
from onflow_location_platform.pandas import standardize_admin_unit_columns

df = pd.DataFrame(
    [
        {"province": "ha noi", "ward": "hong ha"},
        {"province": "hà nội", "ward": "ba đình"},
    ]
)

result = standardize_admin_unit_columns(
    df,
    province="province",
    ward="ward",
)

print(result)

Query the bundled SQLite lookup data

from onflow_location_platform.database import get_data, query

print(get_data(fields=["province", "ward"], table="admin_units", limit=5))
print(query("SELECT province, ward FROM admin_units LIMIT 5"))

Public API

parse_address_new(address, keep_street=True, level=2)

Parse a Vietnamese address in the 2025 34-province structure into an AdminUnit.

  • keep_street=True keeps street text when enough address segments are available
  • level=1 parses province
  • level=2 parses ward and province

parse_address_old(address, keep_street=True, level=3)

Parse a Vietnamese address in the legacy 63-province structure into an AdminUnit.

  • keep_street=True keeps street text when enough address segments are available
  • level=1 parses province
  • level=2 parses district and province
  • level=3 parses ward, district, and province

convert_address_old_to_new(address)

Convert a legacy-format address into a normalized AdminUnit in the 2025 structure.

convert_address_old_to_new_by_code(provide_code, district_code, ward_code, address=None)

Convert legacy administrative codes to the 2025 structure without any text parsing. Accepts the three legacy numeric primary-key codes stored in a database.

  • provide_code — legacy province numeric code (e.g. 79)
  • district_code — legacy district numeric code (e.g. 760)
  • ward_code — legacy ward numeric code (e.g. 26737)
  • address — optional raw address string attached to the returned AdminUnit
  • both int and str inputs are accepted; leading zeros are added automatically
  • raises KeyError if any code is not found, or ValueError if the codes are inconsistent (district does not belong to province, ward does not belong to district)

convert_province_old_to_new_by_code(pk_id)

Resolve a single legacy province code to the 2025 province. Returns an AdminUnit with only province-level fields populated.

convert_district_old_to_new_by_code(pk_id)

Resolve a single legacy district code to the 2025 province it now belongs to. Districts do not exist in the 2025 format; only the parent province is returned.

convert_ward_old_to_new_by_code(pk_id)

Resolve a single legacy ward code to the 2025 ward and province.

get_new_admin_unit_by_new_code(province_code, ward_code, address=None)

Get a 2025 AdminUnit directly from a 2025 province code and ward code.

get_new_province_by_new_code(province_code)

Get a 2025 AdminUnit directly from a 2025 province code.

get_new_ward_by_new_code(ward_code, address=None)

Get a 2025 AdminUnit directly from a 2025 ward code. Returns the ward and its corresponding province.

convert_address_new_to_old(address, multi_match="first")

Convert a 2025-format address into legacy administrative units.

  • multi_match="first" returns one AdminUnit (first candidate)
  • multi_match="all" returns list[AdminUnit] (all candidates)
  • multi_match="geo" returns one AdminUnit selected by nearest geocoded point, then falls back to first if needed
  • input must explicitly contain a 2025 province keyword; otherwise an empty result is returned

standardize_admin_unit_columns(...)

Standardize province, district, and ward columns in a pandas DataFrame.

convert_address_column(...)

Convert a full address column and optionally attach old/new administrative attributes.

get_data(...) and query(sql)

Read lookup data from the bundled SQLite database in src/onflow_location_platform/data/dataset.db.

Data Directories

Runtime assets

These files are bundled with the Python package and used at runtime:

Workspace data

The data directory is used for data preparation and notebook workflows:

The repository currently ignores data/raw, data/interim, and data/processed, so those folders act as workspace outputs rather than committed package contents.

Scripts

Operational scripts and notebooks are grouped by purpose:

Example smoke test:

PYTHONPATH=. envs/bin/python scripts/testing_package/manual_parse_smoke_test.py

Example benchmark:

envs/bin/python scripts/testing_package/benchmark_public_api.py

Example collection script:

PYTHONPATH=. envs/bin/python scripts/collecting_data/scrape_sapnhap_bando_provinces_and_wards.py --date 2026-03-31 --verbose

Notes:

  • Some scripts require the optional dependencies from .[scripts]
  • Collection scripts require network access
  • Several workflows are notebook-driven rather than packaged as command-line tools

Benchmark

The repository includes a reproducible micro-benchmark for the public SDK API:

envs/bin/python scripts/testing_package/benchmark_public_api.py

Sample results from a local run on 2026-03-31 using Python 3.11.13 on macOS-15.5-arm64:

API Sample Input Iterations / run Best ms/op Mean ms/op Ops/s
parse_address_new "Tân Sơn Hòa, Hồ Chí Minh" 10,000 0.1181 0.1199 8469.8
parse_address_old "Long Bình, Quận 9, Hồ Chí Minh" (level=3) 10,000 0.1003 0.1039 9965.8
convert_address_old_to_new "Phường 15, Tân Bình, Hồ Chí Minh" 5,000 0.2364 0.2402 4230.0

These numbers are indicative and will vary by machine, Python version, and whether the tested input path requires external geocoding.

Development Notes

  • Python requirement: >=3.7
  • Runtime dependencies: geopy, pandas, shapely, tqdm, unidecode
  • Optional script dependencies: beautifulsoup4, numpy, requests, seleniumbase

License

setup.py declares the project as MIT-licensed. A standalone LICENSE file is not present in the current workspace snapshot.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

onflow_location_platform-1.0.7.tar.gz (2.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

onflow_location_platform-1.0.7-py3-none-any.whl (2.6 MB view details)

Uploaded Python 3

File details

Details for the file onflow_location_platform-1.0.7.tar.gz.

File metadata

  • Download URL: onflow_location_platform-1.0.7.tar.gz
  • Upload date:
  • Size: 2.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for onflow_location_platform-1.0.7.tar.gz
Algorithm Hash digest
SHA256 58d4fabc371871a67c4d00b0a75ee4a9352916a7f86e018786a4ef155581fbc2
MD5 66c596702c1cee72b0de7fa271ede8d2
BLAKE2b-256 248a9660434d97d7dc8e40db00443105d177f84055b5401855302bf9532dfef7

See more details on using hashes here.

File details

Details for the file onflow_location_platform-1.0.7-py3-none-any.whl.

File metadata

File hashes

Hashes for onflow_location_platform-1.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 5a6717117e11d27e8bd8ac96c51121fa694039b338cbff2f05e49fcb9a64c92f
MD5 464c29df31ab9cc7cef08383c3cfbede
BLAKE2b-256 86c1d7fc36d9c72cbd5a613c7335298203be29b1a8b9d94d3ec77fd6fae1acce

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page