Derive the global air transportation networks (pax and cargo) from Wikipedia

These details have not been verified by PyPI

Project links

Project description

wikipediaGATN

Overview

wikipediaGATN scrapes Wikipedia airport pages to assemble the Global Air Transportation Networks (GATN): two directed graphs in which each node is an airport (identified by its IATA code) and each directed edge represents a scheduled route between two airports for passengers (pax) or cargo.

The package handles the full pipeline:

Crawling — breadth-first traversal from a seed airport, following destination links to neighbouring airport pages.
Parsing — extraction of IATA/ICAO codes, geographic coordinates, and route tables from Wikipedia infoboxes and HTML tables, supplemented by the authoritative OurAirports database for metadata.
IATA recovery — resolution of destination URLs that lack an obvious code, prioritizing offline lookups in the OurAirports database before falling back to Wikipedia scraping.
Export — sparse adjacency matrices (.npz), node lists, airport metadata CSVs ready for network analysis, and interactive Plotly visualisations (.html).
Updates — on demand maintenance of the network through incremental scraping and synchronization with upstream OurAirports metadata changes, keeping the graphs up to date.

The resulting networks can be used for empirical studies of air-travel connectivity, epidemic-spread modelling and transportation network analysis. They also provide great examples in courses about graphs/networks, data science and computational social science.

Setting up

If using a virtual environment

source /path/to/venv/bin/activate

If running before deploying the package, you need to run stuff from the top directory in the repo. Set

export PYTHONPATH=src

and then call the code using, e.g.,

python -m scripts.grab_info_from_IATA

Note the nonstandard call: -m, . instead of / to indicate a subdirectory and no .py extension.

Required post-install step — spaCy language model

The NLP fallback for airline/destination extraction requires the en_core_web_sm model, which cannot be declared as a standard PyPI dependency:

python -m spacy download en_core_web_sm

Dependencies

Package	Purpose
`requests`, `beautifulsoup4`	Wikipedia HTTP requests and HTML parsing
`mwparserfromhell`	Wikitext infobox parsing
`spacy`	NLP fallback for unstructured route tables
`geopy`, `pycountry`	Coordinate and ISO 3166-2 parsing
`numpy`, `scipy`	Sparse adjacency matrix construction
`pandas`	CSV I/O and data manipulation
`networkx`	Graph construction and layout
`plotly`	Interactive HTML visualisation

Example use

The following builds a network for all airports reachable within two hops of Winnipeg (YWG) and exports it as a sparse adjacency matrix:

from wikipediaGATN.wikipedia_network_level import iterate_search_until_distance_N
from wikipediaGATN.result_processing import (
    create_outbound_connections_list,
    run_two_pass_iata_extraction,
    create_outbound_adjacency_matrix,
)

# 1. Crawl Wikipedia — save one JSON file per airport to data/tmp_results/
iterate_search_until_distance_N("YWG", dist=2, delay=0.5, verbose=True)

# 2. Build connections CSV (maps destination URLs to IATA codes)
connections_csv, unmapped_csv = create_outbound_connections_list(
    verbose=True, export_unmapped=True
)

# 3. Recover IATA codes for any destinations that could not be mapped automatically
#    (scrapes Wikipedia; allow ~15 minutes for a large unmapped set)
run_two_pass_iata_extraction(batch_size=50, delay=0.5, verbose=True)

# 4. Re-run connections with the enriched mapping
create_outbound_connections_list(verbose=True)

# 5. Export sparse adjacency matrices to data/public/
matrix_npz, nodes_txt = create_outbound_adjacency_matrix(symmetric=False, verbose=True)
matrix_sym_npz, nodes_sym_txt = create_outbound_adjacency_matrix(symmetric=True, verbose=True)

For a full global crawl (several hours) replace step 1 with:

from wikipediaGATN.wikipedia_network_level import iterate_search_until_empty
iterate_search_until_empty("YWG", delay=0.5, verbose=True)

To resume after an interruption:

from wikipediaGATN.wikipedia_network_level import continue_existing_search_until_empty
continue_existing_search_until_empty(delay=0.5, verbose=True)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.2

Apr 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikipediagatn-0.1.2.tar.gz (78.7 kB view details)

Uploaded Apr 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

wikipediagatn-0.1.2-py3-none-any.whl (71.2 kB view details)

Uploaded Apr 30, 2026 Python 3

File details

Details for the file wikipediagatn-0.1.2.tar.gz.

File metadata

Download URL: wikipediagatn-0.1.2.tar.gz
Upload date: Apr 30, 2026
Size: 78.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: Hatch/1.16.5 cpython/3.14.4 HTTPX/0.28.1

File hashes

Hashes for wikipediagatn-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`c6e6e9803425614989d73dd7aaf2cb3556ad8aba8102058e6827b16b5c46b0a4`
MD5	`9ca7174232c6fbb28d13e1025f7ba096`
BLAKE2b-256	`76a1d566729044e893b6dfa3ff5c50277a73d3f9910dffcf3087571e0449424a`

See more details on using hashes here.

File details

Details for the file wikipediagatn-0.1.2-py3-none-any.whl.

File metadata

Download URL: wikipediagatn-0.1.2-py3-none-any.whl
Upload date: Apr 30, 2026
Size: 71.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: Hatch/1.16.5 cpython/3.14.4 HTTPX/0.28.1

File hashes

Hashes for wikipediagatn-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2ffb068bac992c88a89a335fef1d37e78a9b616e2eb32eb06ab0f2e670e1b154`
MD5	`08d2c73df1ccca9e378d440bcbd4b0be`
BLAKE2b-256	`356cec94cef8ee74e34a403b014a69947a059f8e9c00cd590befb4e84f3d7b6c`

See more details on using hashes here.

wikipediaGATN 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

wikipediaGATN

Overview

Setting up

Required post-install step — spaCy language model

Dependencies

Example use

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes