Skip to main content

Build administrative evolution keys across time with exact-match constrained Gemini adjudication

Project description

AdminLineageAI

AdminLineageAI makes crosswalks between administrative locations such as districts (ADM2), subdistricts (ADM3), states (ADM1), and countries (ADM0) across datasets that may come from completely different sources and different periods. It uses AI to compare likely matches, reason over spelling variants and language-specific forms, administrative split/merges/renames and produce a usable crosswalk plus review artifacts.

Matching administrative units by hand is labour-intensive work. Through this package, we hope to reduce the manual work of matching administrative units between datasets while still keeping a clear review trail and reproducibility.

The package generates candidate matches between two datasets, asks Gemini to choose among them, and writes a crosswalk plus review artifacts. It outputs a final evolution key plus review files as CSV and Parquet.

Published package: adminlineage on PyPI

This is an experimental utility. Treat these crosswalks as assistive outputs and cross-verify them, especially in important cases.

Possible use cases

Below are few possible scenarios where this package can be of assistance. Moreover, we would love to hear about other user experiences and use cases for this package.

  • For instance, one has scheme dataset from a government scheme and need to match it against a standard administrative list such as a census table. The scheme source may write Paschimi Singhbhum while another uses West Singhbhum. Plain fuzzy matching will miss cases like this unless you manually standardize prefixes and suffixes first. While AI can do matching for this because it has context that paschim in Hindi means west. The same kind of issue shows up across many widely spoken languages.
  • Handling administrative churn. Districts and other units are regularly split, merged, renamed, or grouped differently, and there is often no up-to-date public evolution list for newly created units, the package does a wide google search and find possible predessor or sucessor for each administrative unit in the primary dataset
  • Creating entirely new evolution crosswalks that do not exist between two time period at an administrative level.

Important Features

  • The default setting of the package is set to have best results with minimal token cost. Please feel free to change them according to your needs.
  • To keep the token costs minimal, we do exact string match plus pruning of matching candidates on the primary side before first stage.
  • Hierarchical matching with exact_match. If your data are nested, you can match names within exact scopes such as country, state, or district. For example, you can choose to match only district names within a states or subdistricts with a district. This works well, but the exact-match column string need to line up exactly across both datasets.
  • Replay and reproducibility. Academic pipelines often need to be rerun many times. With replay enabled, repeated semantic requests can reuse prior completed LLM work instead of calling the API again. The seed parameter helps keep request identity deterministic and makes reruns easier to reproduce.

The supported live workflow in AdminLineageAI is:

  • Compatible with any gemini-3+ model
  • Google Search grounding enabled
  • strict JSON output from the model
  • user-controlled batching with automatic split fallback on failed multi-row requests
  • an optional bounded second-stage rescue pass for unmatched rows when string_exact_match_prune is set to from or to

The bounded second stage works like this:

  • first pass still does the normal grounded shortlist adjudication
  • if string_exact_match_prune="from", the rescue pass revisits rows with merge="only_in_from"
  • if string_exact_match_prune="to", it revisits rows with merge="only_in_to"
  • it runs one grounded research call to look for a predecessor or successor name
  • if that research comes back as unknown with no lineage hint, the row is left alone and the rescue pass stops there
  • otherwise it searches the full opposite table, rebuilds a short global shortlist, and runs one final strict JSON decision call without additional search grounding
  • the second stage is sequential, one-pass, resumable, and writes second_stage_results.jsonl

How To Use

You do not need the CLI to use AdminLineageAI. The simplest path is the Python API.

  1. Install the published package.
pip install adminlineage

Install the optional parquet dependency if you want parquet output support:

pip install "adminlineage[io]"
  1. Set a Gemini API key in GEMINI_API_KEY, or use another environment variable name and pass it explicitly.
GEMINI_API_KEY=your_api_key_here

The package can load a nearby .env file when it looks for the key.

  1. Choose the name column on each side, and add optional exact-match columns, IDs, or extra context columns if you have them.

  2. Run the matcher.

import pandas as pd
import adminlineage

df_from = pd.read_csv("from_units.csv")
df_to = pd.read_csv("to_units.csv")

crosswalk_df, metadata = adminlineage.build_evolution_key(
    df_from,
    df_to,
    country="India",
    year_from=1951,
    year_to=2001,
    map_col_from="district",
    map_col_to="district",
    exact_match=["state"],
    id_col_from="unit_id",
    id_col_to="unit_id",
    relationship="auto",
    string_exact_match_prune="from",
    evidence=False,
    reason=False,
    model="gemini-3.1-flash-lite-preview",
    gemini_api_key_env="GEMINI_API_KEY",
    replay_enabled=True,
    seed=42,
)

print(crosswalk_df[["from_name", "to_name", "merge", "score"]].head())
print(metadata["artifacts"])
  1. Review the outputs. By default, AdminLineageAI writes artifacts under outputs/<country>_<year_from>_<year_to>_<map_col_from>. The main ones are evolution_key.csv, review_queue.csv, and run_metadata.json.

Common Options

  • exact_match: Restricts matching to rows that agree exactly on one or more scope columns such as country, state, or district.
  • string_exact_match_prune: Controls how aggressively exact string hits are removed from later AI work. Use this to control token spend.
  • relationship: Declares the kind of relationship you expect, or leave it as auto.
  • max_candidates: Limits how many candidate rows are shown to the model for each source row. The default is 6.
  • evidence: Adds a short factual summary column.
  • reason: Adds a longer explanation column.
  • replay_enabled: Reuses prior completed LLM work when the semantic request matches.
  • seed: Keeps request identity deterministic for more reproducible reruns.
  • output_dir: Changes where run artifacts are written.

Matching Flow Example

This example follows a nested district-level match inside India > Uttar Pradesh from 2011 to 2025. Here string_exact_match_prune='to' (this set to as primary side and from as secondary side where all candidates stay global).

flowchart TD
    A["From table (2011)<br/>India / Uttar Pradesh / Agra<br/>India / Uttar Pradesh / Kanpur Dehat<br/>India / Uttar Pradesh / Faizabad<br/>India / Uttar Pradesh / Allahabad"]
    B["To table (2025)<br/>India / Uttar Pradesh / Agra<br/>India / Uttar Pradesh / Kanpur Rural<br/>India / Uttar Pradesh / Ayodhya<br/>India / Uttar Pradesh / Prayagraj"]
    C["Nested settings<br/>map_col='district'<br/>exact_match=['state']<br/>string_exact_match_prune='to'<br/>this set 'to' as primary side<br/>and 'from' as secondary side<br/>where all candidates stay global"]
    D["Validate inputs and normalize names"]
    E["Exact string match pruning before LLM"]
    F["Agra -> Agra<br/>no LLM used here<br/>just exact string match"]
    H["AI matches remaining rows on primary side<br/>(Kanpur Rural, Ayodhya, Prayagraj)<br/>using grounded Gemini search<br/>"]
    I["AI matches Kanpur Dehat -> Kanpur Rural<br/>because it has context that 'dehat' means 'rural' in Hindi"]
    J{"Do Ayodhya or Prayagraj stay unmatched<br/>after first stage?"}
    L["Do intensive Gemini search of potential predecessor / successor of Ayodhya / Prayagraj<br/>if they were renamed, merged, split, or transferred"]
    M["If Gemini finds a potential predecessor / successor for that district<br/>match it with the global district list from the secondary side"]
    N["Write final evolution key<br/>Agra -> Agra<br/>Kanpur Dehat -> Kanpur Rural<br/>Faizabad -> Ayodhya<br/>Allahabad -> Prayagraj"]
    O["Write artifacts<br/>evolution_key.csv<br/>review_queue.csv<br/>run_metadata.json<br/>replay bundle"]

    subgraph G["First stage"]
        H
        I
    end

    subgraph P["Second stage"]
        L
        M
    end

    A --> C
    B --> C
    C --> D
    D --> E
    E --> F
    E --> H
    H --> I
    I --> J
    J -- "No" --> N
    J -- "Yes" --> L
    L --> M
    A --> N
    B --> N
    M --> N
    N --> O

Hand Check Against Scheme Ground Truth

This is a quick hand check against a human-made evolution key for a government scheme implemented nationally in India. The scheme side is 2025 districts, mapped back to their predecessor 2011 districts.

The comparison is oriented from the scheme side: for each district_2025 in the hand key, does the evolution key recover the expected district_2011 predecessor? Names were normalized before comparison. Spelling and transliteration-only differences were treated as aligns. A row counts as a match only when the evolution key has a non-blank from_name.

  • aligns means the evolution key points to the same 2011 district name
  • disagrees means the evolution key points to a different 2011 district
  • no match means the evolution key does not provide any non-blank from_name
Outcome Count Share of 612 hand-coded district pairs
Aligns with scheme hand mapping 595 97.22%
Disagrees with scheme hand mapping 11 1.80%
Evolution key provides no 2011 match 6 0.98%

Takeaway: most scheme districts map back to the same 2011 predecessor as the hand key, a few disagree, and a small number have no match. Treat this as a sanity check, not a full audit.

Optional CLI Workflow

The CLI is useful when you want a saved YAML config for repeatable runs, but it is optional.

adminlineage preview --config examples/config/example.yml
adminlineage validate --config examples/config/example.yml
adminlineage run --config examples/config/example.yml
adminlineage export --input outputs/india_1951_2001_subdistrict/evolution_key.csv --format jsonl

The package includes these example assets:

  • examples/config/example.yml
  • examples/loaders/sample_loader.py
  • examples/adminlineage_gemini_3_1_flash_lite.ipynb

Python API

Public objects available from import adminlineage:

  • build_evolution_key
  • preview_plan
  • validate_inputs
  • export_crosswalk
  • get_output_schema_definition
  • OUTPUT_SCHEMA_VERSION
  • __version__

build_evolution_key

Build the evolution key and write run artifacts.

Required arguments:

Argument Type Meaning
df_from pd.DataFrame Earlier-period table
df_to pd.DataFrame Later-period table
country str Country label used in prompts and metadata
year_from int | str Earlier-period label
year_to int | str Later-period label
map_col_from str Source name column

Optional arguments:

Argument Type Default Meaning
map_col_to str | None None Target name column. Falls back to map_col_from when omitted.
exact_match list[str] | None None Columns that must agree before comparison.
id_col_from str | None None Source ID column.
id_col_to str | None None Target ID column.
extra_context_cols list[str] | None None Extra columns added to the model payload.
relationship str auto One of auto, father_to_father, father_to_child, child_to_father, child_to_child.
string_exact_match_prune str none none keeps exact-string hits in later AI work, from removes matched source rows from AI work, to removes matched source and target rows from later AI work.
evidence bool False Adds a short evidence summary and includes the evidence column.
reason bool False Adds a longer explanation in the reason column.
model str gemini-3.1-flash-lite-preview Gemini model name.
gemini_api_key_env str GEMINI_API_KEY Environment variable name used for the API key.
batch_size int 5 Maximum number of source rows per Gemini request. When a multi-row request fails, the pipeline retries in smaller batches.
max_candidates int 6 Candidate shortlist size per source row.
output_dir str | Path outputs Base output directory for run artifacts.
seed int 42 Deterministic seed for repeatable request identity.
temperature float 0.75 Gemini temperature.
enable_google_search bool True Enables grounded Gemini adjudication.
request_timeout_seconds int | None 90 Per-request timeout.
env_search_dir str | Path | None None Starting directory used when searching for .env.
replay_enabled bool False Reuses prior completed LLM work when the semantic request matches.
replay_store_dir str | Path | None None Replay store path. Falls back to .adminlineage_replay internally when replay is enabled.

Return value:

  • tuple[pd.DataFrame, dict]
  • first item: the crosswalk DataFrame
  • second item: run metadata with counts, warnings, request details, and artifact paths

preview_plan

Preview grouping and candidate-generation behavior without calling Gemini.

adminlineage.preview_plan(
    df_from,
    df_to,
    *,
    country,
    year_from,
    year_to,
    map_col_from,
    map_col_to=None,
    exact_match=None,
    id_col_from=None,
    id_col_to=None,
    extra_context_cols=None,
    string_exact_match_prune="none",
    max_candidates=6,
)

Return value: a diagnostics dict describing validity, group sizes, exact-string hits, and candidate budgets.

validate_inputs

Validate the two input tables without running the pipeline.

adminlineage.validate_inputs(
    df_from,
    df_to,
    *,
    country,
    map_col_from,
    map_col_to=None,
    exact_match=None,
    id_col_from=None,
    id_col_to=None,
)

Return value: a diagnostics dict that reports whether the inputs are valid and what is missing or duplicated.

export_crosswalk

Convert a materialized crosswalk file into another format.

adminlineage.export_crosswalk(
    input_path="outputs/india_1951_2001_subdistrict/evolution_key.csv",
    output_format="jsonl",
    output_path=None,
)

Return value: the written output path.

Supported output formats:

  • csv
  • parquet
  • jsonl

get_output_schema_definition

Return a machine-readable description of the materialized output schema.

schema = adminlineage.get_output_schema_definition(include_evidence=False)

Arguments:

Argument Type Default Meaning
include_evidence bool False Includes the evidence column in the returned schema definition.

Return value: a dict containing the schema version, ordered output columns, required columns, and enum values, including the merge indicator enum.

OUTPUT_SCHEMA_VERSION

String constant for the current materialized output schema version.

__version__

String constant for the package version.

Optional CLI Reference

Commands:

adminlineage run --config path/to/config.yml
adminlineage preview --config path/to/config.yml
adminlineage validate --config path/to/config.yml
adminlineage export --input path/to/evolution_key.csv --format {csv|parquet|jsonl} [--output path]

preview and validate do not call Gemini. run writes the full artifact set. export converts an existing materialized crosswalk file. If you are using the Python API directly, you can ignore this section.

CLI YAML Config Reference

Top-level sections:

  • request
  • data
  • llm
  • pipeline
  • cache
  • retry
  • replay
  • output

request

Key Default Meaning
country required Country label used in prompts and metadata.
year_from required Earlier-period label.
year_to required Later-period label.
map_col_from required Source name column.
map_col_to null Target name column. Falls back to map_col_from.
exact_match [] Columns that must agree before comparison.
id_col_from null Source ID column.
id_col_to null Target ID column.
extra_context_cols [] Extra columns added to the model payload.
relationship auto Relationship mode.
string_exact_match_prune none Exact-string pruning mode.
evidence false Adds the evidence column.
reason false Adds the reason column.

data

Key Default Meaning
mode files One of files or python_hook.
from_path null Required when mode: files.
to_path null Required when mode: files.
callable null Required when mode: python_hook. Uses module:function syntax.
params {} Arbitrary config payload passed to the loader hook.

Loader contract for python_hook mode:

def load_data(config: dict) -> tuple[pd.DataFrame, pd.DataFrame]:
    ...

The included example hook is examples/loaders/sample_loader.py.

For file mode, data.from_path and data.to_path are resolved relative to the config file location, not your shell location.

llm

Key Default Meaning
provider gemini Use gemini for live runs or mock for dry runs and testing.
model gemini-3.1-flash-lite-preview Gemini model name.
gemini_api_key_env GEMINI_API_KEY Environment variable name for the API key.
temperature 0.75 Gemini temperature.
seed 42 Deterministic seed.
enable_google_search true Enables grounded adjudication.
request_timeout_seconds 90 Per-request timeout.

pipeline

Key Default Meaning
batch_size 5 Maximum number of source rows per Gemini request. Failed multi-row requests are retried in smaller batches.
max_candidates 6 Candidate shortlist size per source row. You can raise this if you want a wider shortlist.
review_score_threshold 0.6 Rows below this score are flagged for review.

cache

Key Default Meaning
enabled true Enables the SQLite LLM cache.
backend sqlite Current cache backend.
path llm_cache.sqlite Cache database path.

retry

Key Default Meaning
max_attempts 6 Maximum retry attempts for transient LLM failures.
base_delay_seconds 1.0 Initial retry delay.
max_delay_seconds 20.0 Maximum retry delay.
jitter_seconds 0.2 Random jitter added to retry timing.

replay

Key Default Meaning
enabled false Enables exact replay for fully completed runs.
store_dir .adminlineage_replay Replay bundle directory.

Relative replay store paths are resolved from the config file location. This section only matters if you are using the CLI workflow.

output

Key Default Meaning
write_csv true Writes evolution_key.csv.
write_parquet true Writes evolution_key.parquet.

Minimal config shape:

request:
  country: India
  year_from: 1951
  year_to: 2001
  map_col_from: subdistrict
  map_col_to: subdistrict
  exact_match: [state, district]
  id_col_from: unit_id
  id_col_to: unit_id
  relationship: auto
  string_exact_match_prune: none
  evidence: false
  reason: false

data:
  mode: files
  from_path: ../data/from_units.csv
  to_path: ../data/to_units.csv

llm:
  provider: gemini
  model: gemini-3.1-flash-lite-preview
  gemini_api_key_env: GEMINI_API_KEY
  temperature: 0.75
  seed: 42
  enable_google_search: true
  request_timeout_seconds: 90

pipeline:
  batch_size: 5
  max_candidates: 6
  review_score_threshold: 0.6

cache:
  enabled: true
  backend: sqlite
  path: llm_cache.sqlite

retry:
  max_attempts: 6
  base_delay_seconds: 1.0
  max_delay_seconds: 20.0
  jitter_seconds: 0.2

replay:
  enabled: false
  store_dir: .adminlineage_replay

output:
  write_csv: true
  write_parquet: true

Outputs And Utilities

Main Artifacts

Artifact Meaning
evolution_key.csv Main crosswalk output.
evolution_key.parquet Parquet version of the crosswalk output.
review_queue.csv Rows that need manual review.
run_metadata.json Run counts, warnings, request details, and artifact paths.
links_raw.jsonl Incremental per-row decision log used for resumability and replay publishing.

Crosswalk Columns

Column Meaning
from_name, to_name Raw source and target names.
from_canonical_name, to_canonical_name Normalized names used during matching.
from_id, to_id User IDs when supplied, otherwise fallback internal IDs.
score Confidence in the chosen link, in [0, 1].
link_type One of rename, split, merge, transfer, no_match, unknown.
relationship One of father_to_father, father_to_child, child_to_father, child_to_child, unknown.
merge both for matched rows, only_in_from for source-only rows, only_in_to for target-only rows appended after the source pass.
evidence Short grounded summary. Included only when evidence=True.
reason Longer explanation. Present as a column, but empty unless reason=True.
exact-match columns Copied context columns from the request, such as state or district.
country, year_from, year_to Request metadata.
run_id Deterministic run identifier.
from_key, to_key Internal stable keys used by the pipeline.
constraints_passed Constraint checks recorded for that row.
review_flags, review_reason QA flags and their comma-joined summary.

review_queue.csv is a filtered subset of the crosswalk for rows that were flagged for manual review. Target-only rows remain in the final evolution key with merge="only_in_to".

Operational Notes

  • exact_match scopes the candidate search. If you set exact_match=["state", "district"], a row only compares against rows from the same (state, district) group. This is the main hierarchical matching mechanism in the package.
  • Candidate generation happens before Gemini. max_candidates controls how many shortlist entries the model sees for each source row. The default is 6, but you can still raise it explicitly.
  • Exact string handling happens before the model call. string_exact_match_prune controls whether already matched rows remain in later AI work.
  • Live Gemini work is grounded with Google Search and returns strict JSON. The pipeline then materializes CSV and Parquet outputs itself.
  • When string_exact_match_prune is from or to, the package can run one bounded second-stage rescue pass on unmatched primary-side rows. That pass does one grounded research call, and only does a second shortlist decision call if the research returned a usable lineage_hint.
  • Replay is opt-in. When replay_enabled=True, rerunning the same semantic request reuses the prior completed LLM output instead of calling Gemini again.
  • seed helps keep request identity deterministic and makes runs easier to reproduce.
  • Cache is configured in CLI config. When enabled, the package uses a SQLite cache at cache.path.
  • Retry behavior is configurable in CLI config. Transient Gemini failures are retried according to the retry section before a row is marked unresolved.
  • export_crosswalk and adminlineage export convert an existing materialized crosswalk into csv, parquet, or jsonl.

A Few Practical Defaults

  • model="gemini-3.1-flash-lite-preview"
  • temperature=0.75
  • enable_google_search=True
  • evidence=False
  • reason=False
  • relationship="auto"
  • string_exact_match_prune="none"

Those are the current defaults. Change them when you need replay, evidence, stricter scoping, or different review thresholds.

Reporting Issues

If you run into a bug, a broken match, or a confusing output, please open an issue on GitHub.

The most useful issue reports include:

  • the package version, for example adminlineage.__version__
  • whether you used the Python API, CLI, or one of the notebooks
  • the model name and the main matching settings, especially exact_match, string_exact_match_prune, batch_size, max_candidates, and enable_google_search
  • whether the run was fresh, resumed from an existing output directory, or reused replay artifacts
  • a small sanitized input example that reproduces the problem
  • the relevant rows from evolution_key.csv or review_queue.csv
  • run_metadata.json, and when relevant links_raw.jsonl, grounding_notes.jsonl, and second_stage_results.jsonl
  • the traceback or log excerpt if the run failed

Citation

If you use AdminLineageAI in published work, please cite the package and briefly report the workflow you used.

Suggested software citation:

Siddiqui, T. I., and Vetharenian Hari. (2026). AdminLineageAI (Version 0.2.1) [Python package]. https://pypi.org/project/adminlineage/

If the workflow matters for interpretation, report the key settings in your methods or appendix:

  • country and time span
  • administrative level and exact-match scope
  • string_exact_match_prune mode
  • Gemini model name
  • whether Google Search grounding was enabled
  • whether the bounded second-stage rescue pass was active
  • whether outputs were manually reviewed or corrected

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

adminlineage-0.2.2.tar.gz (61.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

adminlineage-0.2.2-py3-none-any.whl (63.7 kB view details)

Uploaded Python 3

File details

Details for the file adminlineage-0.2.2.tar.gz.

File metadata

  • Download URL: adminlineage-0.2.2.tar.gz
  • Upload date:
  • Size: 61.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for adminlineage-0.2.2.tar.gz
Algorithm Hash digest
SHA256 2675a35b02741f28fb1d8f7db76db91ccc17a5f3d589817996b316ef41fe112e
MD5 e10f863603c7a27b4bcd8ccc132417e3
BLAKE2b-256 8965d63cd9752603bc5b0e4f97c8be0dcb42709ac130e7b4a0248846303868c0

See more details on using hashes here.

Provenance

The following attestation bundles were made for adminlineage-0.2.2.tar.gz:

Publisher: publish.yml on TahaIbrahimSiddiqui/AdminLineageAI

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file adminlineage-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: adminlineage-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 63.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for adminlineage-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e5f5b8a389925e42a2a6ee0e9017525afad2a634f79fec94c5ea7ab520a27912
MD5 d9d912c62afb8de1af63f82df1d05a13
BLAKE2b-256 b693484947196be2242843b0140a947dcf7c8dc18a429592ef49db2f0acdc3e0

See more details on using hashes here.

Provenance

The following attestation bundles were made for adminlineage-0.2.2-py3-none-any.whl:

Publisher: publish.yml on TahaIbrahimSiddiqui/AdminLineageAI

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page