Skip to main content

Derive the global air transportation networks (pax and cargo) from Wikipedia

Project description

wikipediaGATN

Overview

wikipediaGATN scrapes Wikipedia airport pages to assemble the Global Air Transportation Networks (GATN): two directed graphs in which each node is an airport (identified by its IATA code) and each directed edge represents a scheduled route between two airports for passengers (pax) or cargo.

The package handles the full pipeline:

  1. Crawling — breadth-first traversal from a seed airport, following destination links to neighbouring airport pages.
  2. Parsing — extraction of IATA/ICAO codes, geographic coordinates, and route tables from Wikipedia infoboxes and HTML tables, supplemented by the authoritative OurAirports database for metadata.
  3. IATA recovery — resolution of destination URLs that lack an obvious code, prioritizing offline lookups in the OurAirports database before falling back to Wikipedia scraping.
  4. Export — sparse adjacency matrices (.npz), node lists, airport metadata CSVs ready for network analysis, and interactive Plotly visualisations (.html).
  5. Updates — on demand maintenance of the network through incremental scraping and synchronization with upstream OurAirports metadata changes, keeping the graphs up to date.

The resulting networks can be used for empirical studies of air-travel connectivity, epidemic-spread modelling and transportation network analysis. They also provide great examples in courses about graphs/networks, data science and computational social science.

Setting up

If using a virtual environment

source /path/to/venv/bin/activate

If running before deploying the package, you need to run stuff from the top directory in the repo. Set

export PYTHONPATH=src

and then call the code using, e.g.,

python -m scripts.grab_info_from_IATA

Note the nonstandard call: -m, . instead of / to indicate a subdirectory and no .py extension.

Required post-install step — spaCy language model

The NLP fallback for airline/destination extraction requires the en_core_web_sm model, which cannot be declared as a standard PyPI dependency:

python -m spacy download en_core_web_sm

Dependencies

Package Purpose
requests, beautifulsoup4 Wikipedia HTTP requests and HTML parsing
mwparserfromhell Wikitext infobox parsing
spacy NLP fallback for unstructured route tables
geopy, pycountry Coordinate and ISO 3166-2 parsing
numpy, scipy Sparse adjacency matrix construction
pandas CSV I/O and data manipulation
networkx Graph construction and layout
plotly Interactive HTML visualisation

Example use

The following builds a network for all airports reachable within two hops of Winnipeg (YWG) and exports it as a sparse adjacency matrix:

from wikipediaGATN.wikipedia_network_level import iterate_search_until_distance_N
from wikipediaGATN.result_processing import (
    create_outbound_connections_list,
    run_two_pass_iata_extraction,
    create_outbound_adjacency_matrix,
)

# 1. Crawl Wikipedia — save one JSON file per airport to data/tmp_results/
iterate_search_until_distance_N("YWG", dist=2, delay=0.5, verbose=True)

# 2. Build connections CSV (maps destination URLs to IATA codes)
connections_csv, unmapped_csv = create_outbound_connections_list(
    verbose=True, export_unmapped=True
)

# 3. Recover IATA codes for any destinations that could not be mapped automatically
#    (scrapes Wikipedia; allow ~15 minutes for a large unmapped set)
run_two_pass_iata_extraction(batch_size=50, delay=0.5, verbose=True)

# 4. Re-run connections with the enriched mapping
create_outbound_connections_list(verbose=True)

# 5. Export sparse adjacency matrices to data/public/
matrix_npz, nodes_txt = create_outbound_adjacency_matrix(symmetric=False, verbose=True)
matrix_sym_npz, nodes_sym_txt = create_outbound_adjacency_matrix(symmetric=True, verbose=True)

For a full global crawl (several hours) replace step 1 with:

from wikipediaGATN.wikipedia_network_level import iterate_search_until_empty
iterate_search_until_empty("YWG", delay=0.5, verbose=True)

To resume after an interruption:

from wikipediaGATN.wikipedia_network_level import continue_existing_search_until_empty
continue_existing_search_until_empty(delay=0.5, verbose=True)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikipediagatn-0.1.2.tar.gz (78.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wikipediagatn-0.1.2-py3-none-any.whl (71.2 kB view details)

Uploaded Python 3

File details

Details for the file wikipediagatn-0.1.2.tar.gz.

File metadata

  • Download URL: wikipediagatn-0.1.2.tar.gz
  • Upload date:
  • Size: 78.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.5 cpython/3.14.4 HTTPX/0.28.1

File hashes

Hashes for wikipediagatn-0.1.2.tar.gz
Algorithm Hash digest
SHA256 c6e6e9803425614989d73dd7aaf2cb3556ad8aba8102058e6827b16b5c46b0a4
MD5 9ca7174232c6fbb28d13e1025f7ba096
BLAKE2b-256 76a1d566729044e893b6dfa3ff5c50277a73d3f9910dffcf3087571e0449424a

See more details on using hashes here.

File details

Details for the file wikipediagatn-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: wikipediagatn-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 71.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.5 cpython/3.14.4 HTTPX/0.28.1

File hashes

Hashes for wikipediagatn-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2ffb068bac992c88a89a335fef1d37e78a9b616e2eb32eb06ab0f2e670e1b154
MD5 08d2c73df1ccca9e378d440bcbd4b0be
BLAKE2b-256 356cec94cef8ee74e34a403b014a69947a059f8e9c00cd590befb4e84f3d7b6c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page