Derive the global air transportation networks (pax and cargo) from Wikipedia
Project description
wikipediaGATN
Overview
wikipediaGATN scrapes Wikipedia airport pages to assemble the Global Air Transportation Networks (GATN): two directed graphs in which each node is an airport (identified by its IATA code) and each directed edge represents a scheduled route between two airports for passengers (pax) or cargo.
The package handles the full pipeline:
- Crawling — breadth-first traversal from a seed airport, following destination links to neighbouring airport pages.
- Parsing — extraction of IATA/ICAO codes, geographic coordinates, and route tables from Wikipedia infoboxes and HTML tables, supplemented by the authoritative OurAirports database for metadata.
- IATA recovery — resolution of destination URLs that lack an obvious code, prioritizing offline lookups in the OurAirports database before falling back to Wikipedia scraping.
- Export — sparse adjacency matrices (
.npz), node lists, airport metadata CSVs ready for network analysis, and interactive Plotly visualisations (.html). - Updates — on demand maintenance of the network through incremental scraping and synchronization with upstream OurAirports metadata changes, keeping the graphs up to date.
The resulting networks can be used for empirical studies of air-travel connectivity, epidemic-spread modelling and transportation network analysis. They also provide great examples in courses about graphs/networks, data science and computational social science.
Setting up
If using a virtual environment
source /path/to/venv/bin/activate
If running before deploying the package, you need to run stuff from the top directory in the repo. Set
export PYTHONPATH=src
and then call the code using, e.g.,
python -m scripts.grab_info_from_IATA
Note the nonstandard call: -m, . instead of / to indicate a subdirectory and no .py extension.
Required post-install step — spaCy language model
The NLP fallback for airline/destination extraction requires the
en_core_web_sm model, which cannot be declared as a standard PyPI
dependency:
python -m spacy download en_core_web_sm
Dependencies
| Package | Purpose |
|---|---|
requests, beautifulsoup4 |
Wikipedia HTTP requests and HTML parsing |
mwparserfromhell |
Wikitext infobox parsing |
spacy |
NLP fallback for unstructured route tables |
geopy, pycountry |
Coordinate and ISO 3166-2 parsing |
numpy, scipy |
Sparse adjacency matrix construction |
pandas |
CSV I/O and data manipulation |
networkx |
Graph construction and layout |
plotly |
Interactive HTML visualisation |
Example use
The following builds a network for all airports reachable within two hops of Winnipeg (YWG) and exports it as a sparse adjacency matrix:
from wikipediaGATN.wikipedia_network_level import iterate_search_until_distance_N
from wikipediaGATN.result_processing import (
create_outbound_connections_list,
run_two_pass_iata_extraction,
create_outbound_adjacency_matrix,
)
# 1. Crawl Wikipedia — save one JSON file per airport to data/tmp_results/
iterate_search_until_distance_N("YWG", dist=2, delay=0.5, verbose=True)
# 2. Build connections CSV (maps destination URLs to IATA codes)
connections_csv, unmapped_csv = create_outbound_connections_list(
verbose=True, export_unmapped=True
)
# 3. Recover IATA codes for any destinations that could not be mapped automatically
# (scrapes Wikipedia; allow ~15 minutes for a large unmapped set)
run_two_pass_iata_extraction(batch_size=50, delay=0.5, verbose=True)
# 4. Re-run connections with the enriched mapping
create_outbound_connections_list(verbose=True)
# 5. Export sparse adjacency matrices to data/public/
matrix_npz, nodes_txt = create_outbound_adjacency_matrix(symmetric=False, verbose=True)
matrix_sym_npz, nodes_sym_txt = create_outbound_adjacency_matrix(symmetric=True, verbose=True)
For a full global crawl (several hours) replace step 1 with:
from wikipediaGATN.wikipedia_network_level import iterate_search_until_empty
iterate_search_until_empty("YWG", delay=0.5, verbose=True)
To resume after an interruption:
from wikipediaGATN.wikipedia_network_level import continue_existing_search_until_empty
continue_existing_search_until_empty(delay=0.5, verbose=True)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wikipediagatn-0.1.2.tar.gz.
File metadata
- Download URL: wikipediagatn-0.1.2.tar.gz
- Upload date:
- Size: 78.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: Hatch/1.16.5 cpython/3.14.4 HTTPX/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c6e6e9803425614989d73dd7aaf2cb3556ad8aba8102058e6827b16b5c46b0a4
|
|
| MD5 |
9ca7174232c6fbb28d13e1025f7ba096
|
|
| BLAKE2b-256 |
76a1d566729044e893b6dfa3ff5c50277a73d3f9910dffcf3087571e0449424a
|
File details
Details for the file wikipediagatn-0.1.2-py3-none-any.whl.
File metadata
- Download URL: wikipediagatn-0.1.2-py3-none-any.whl
- Upload date:
- Size: 71.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: Hatch/1.16.5 cpython/3.14.4 HTTPX/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2ffb068bac992c88a89a335fef1d37e78a9b616e2eb32eb06ab0f2e670e1b154
|
|
| MD5 |
08d2c73df1ccca9e378d440bcbd4b0be
|
|
| BLAKE2b-256 |
356cec94cef8ee74e34a403b014a69947a059f8e9c00cd590befb4e84f3d7b6c
|