Skip to main content

OnFlow Location Platform for parsing, converting, and standardizing Vietnamese administrative units

Project description

OnFlow Location Platform

OnFlow Location Platform is a Python package and data workspace for working with Vietnamese administrative units across two address systems:

  • LEGACY: the historical 63-province structure
  • FROM_2025: the post-reform 34-province structure

The repository contains:

  • a runtime package for parsing and converting Vietnamese addresses
  • packaged lookup assets under src/data
  • data preparation inputs under data
  • collection, processing, generation, and validation scripts under scripts

Highlights

  • Parse free-text Vietnamese administrative addresses into structured AdminUnit objects
  • Convert legacy addresses to the 2025 administrative structure
  • Standardize province, district, ward, and address columns in pandas DataFrames
  • Query packaged lookup data from the bundled SQLite database
  • Maintain a reproducible workflow from raw inputs to generated runtime assets

Package Naming

  • Repository / distribution name: onflow-location-platform
  • Python import path: onflow_location_platform
  • Runtime data directory: src/data

The public import path is onflow_location_platform.

Repository Layout

.
├── data/
│   ├── alias_keywords/
│   ├── raw/
│   ├── interim/
│   └── processed/
├── scripts/
│   ├── collecting_data/
│   ├── processing_data/
│   ├── generating_module_data/
│   └── testing_package/
├── src/
│   ├── converter/
│   ├── data/
│   ├── database/
│   ├── pandas/
│   └── parser/
└── setup.py

How It Works

The repository operates as a small data platform plus a runtime package:

external sources
    -> data/raw
    -> data/interim
    -> data/processed
    -> scripts/generating_module_data
    -> src/data
    -> src parser / converter / pandas / database APIs

In practice, the workflow is:

  1. Collect source files from public endpoints or manual downloads into data/raw.
  2. Clean, map, and enrich those inputs into intermediate and processed datasets under data/interim and data/processed.
  3. Generate compact runtime assets for the package in src/data, including parser dictionaries, conversion mappings, and the bundled SQLite database.
  4. Use the public API from src to parse addresses, convert legacy addresses, standardize DataFrame columns, or query lookup data.

This means the runtime package does not depend on the workspace CSV files at execution time. It depends on the generated assets already bundled in src/data.

Installation

Local editable install

python -m venv envs
source envs/bin/activate
pip install -e .

Local editable install with script dependencies

python -m venv envs
source envs/bin/activate
pip install -e '.[scripts]'

Quick Start

Parse a 2025-format address

from onflow_location_platform import parse_2025_address

unit = parse_2025_address("Tân Sơn Hòa, Hồ Chí Minh")
print(unit.format_address())

Parse a legacy-format address

from onflow_location_platform import parse_legacy_address

unit = parse_legacy_address(
    "Đường 15, Long Bình, Quận 9, Hồ Chí Minh",
    level=3,
)

print(unit.short_province, unit.short_district, unit.short_ward)

Convert a legacy address to the 2025 structure

from onflow_location_platform import convert_legacy_address_to_2025

unit = convert_legacy_address_to_2025("59 Nguyễn Sỹ Sách, Phường 15, Tân Bình, Hồ Chí Minh")
print(unit.format_address())

Standardize administrative unit columns in pandas

import pandas as pd
from onflow_location_platform.pandas import standardize_admin_unit_columns

df = pd.DataFrame(
    [
        {"province": "ha noi", "ward": "hong ha"},
        {"province": "hà nội", "ward": "ba đình"},
    ]
)

result = standardize_admin_unit_columns(
    df,
    province="province",
    ward="ward",
)

print(result)

Query the bundled SQLite lookup data

from onflow_location_platform.database import get_data, query

print(get_data(fields=["province", "ward"], table="admin_units", limit=5))
print(query("SELECT province, ward FROM admin_units LIMIT 5"))

Public API

parse_2025_address(address, keep_street=True, level=2)

Parse a Vietnamese address in the 2025 34-province structure into an AdminUnit.

  • keep_street=True keeps street text when enough address segments are available
  • level=1 parses province
  • level=2 parses ward and province

parse_legacy_address(address, keep_street=True, level=3)

Parse a Vietnamese address in the legacy 63-province structure into an AdminUnit.

  • keep_street=True keeps street text when enough address segments are available
  • level=1 parses province
  • level=2 parses district and province
  • level=3 parses ward, district, and province

convert_legacy_address_to_2025(address)

Convert a legacy-format address into a normalized AdminUnit in the 2025 structure.

standardize_admin_unit_columns(...)

Standardize province, district, and ward columns in a pandas DataFrame.

convert_address_column(...)

Convert a full address column and optionally attach old/new administrative attributes.

get_data(...) and query(sql)

Read lookup data from the bundled SQLite database in src/data/dataset.db.

Data Directories

Runtime assets

These files are bundled with the Python package and used at runtime:

Workspace data

The data directory is used for data preparation and notebook workflows:

The repository currently ignores data/raw, data/interim, and data/processed, so those folders act as workspace outputs rather than committed package contents.

Scripts

Operational scripts and notebooks are grouped by purpose:

Example smoke test:

PYTHONPATH=. envs/bin/python scripts/testing_package/manual_parse_smoke_test.py

Example benchmark:

envs/bin/python scripts/testing_package/benchmark_public_api.py

Example collection script:

PYTHONPATH=. envs/bin/python scripts/collecting_data/scrape_sapnhap_bando_provinces_and_wards.py --date 2026-03-31 --verbose

Notes:

  • Some scripts require the optional dependencies from .[scripts]
  • Collection scripts require network access
  • Several workflows are notebook-driven rather than packaged as command-line tools

Benchmark

The repository includes a reproducible micro-benchmark for the public SDK API:

envs/bin/python scripts/testing_package/benchmark_public_api.py

Sample results from a local run on 2026-03-31 using Python 3.11.13 on macOS-15.5-arm64:

API Sample Input Iterations / run Best ms/op Mean ms/op Ops/s
parse_2025_address "Tân Sơn Hòa, Hồ Chí Minh" 10,000 0.1181 0.1199 8469.8
parse_legacy_address "Long Bình, Quận 9, Hồ Chí Minh" (level=3) 10,000 0.1003 0.1039 9965.8
convert_legacy_address_to_2025 "Phường 15, Tân Bình, Hồ Chí Minh" 5,000 0.2364 0.2402 4230.0

These numbers are indicative and will vary by machine, Python version, and whether the tested input path requires external geocoding.

Development Notes

  • Python requirement: >=3.7
  • Runtime dependencies: geopy, pandas, shapely, tqdm, unidecode
  • Optional script dependencies: beautifulsoup4, numpy, requests, seleniumbase

License

setup.py declares the project as MIT-licensed. A standalone LICENSE file is not present in the current workspace snapshot.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

onflow_location_platform-1.0.1.tar.gz (1.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

onflow_location_platform-1.0.1-py3-none-any.whl (1.9 MB view details)

Uploaded Python 3

File details

Details for the file onflow_location_platform-1.0.1.tar.gz.

File metadata

  • Download URL: onflow_location_platform-1.0.1.tar.gz
  • Upload date:
  • Size: 1.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for onflow_location_platform-1.0.1.tar.gz
Algorithm Hash digest
SHA256 d03a4761c2429a98ae0a80324eec16cf1c4f295219a2745a2a437a5c494bea1f
MD5 5d406ba41aacb2f39afb5b39197d54e3
BLAKE2b-256 7636a840db273a90429566e5ca9c032fda77873f7e81a46f300cccaa9c702619

See more details on using hashes here.

File details

Details for the file onflow_location_platform-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for onflow_location_platform-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9081dc7d9160ecf0d620b3d1e2e03b4d9d68a88c484a4f7e34e7d7e31c6b79a1
MD5 81ea030cac7ec36f610350b4de3f77b7
BLAKE2b-256 75f50a303d87121b23340d36a98c9a0a2a458bd811d871fd471e9e48a57a6d58

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page