Skip to main content

Scrape HTML tables from a Wikipedia page into CSV format.

Project description

wiki-table-scrape

Scrape HTML tables from a Wikipedia page into CSV format.

wikitablescrape can be used as a shell command or imported as a Python package.

Why?

This tool makes it easy to download any Wikipedia table via CLI in a format ready for text processing.

This is especially useful when combined with a tool like xsv.

Year Distribution of Costliest Atlantic Hurricanes
wikitablescrape --url='https://en.wikipedia.org/wiki/List_of_costliest_Atlantic_hurricanes' --header='costliest' | xsv select "Season" | xsv stats --median | xsv select field,min,max,median,mean,stddev | xsv table
field   min   max   median  mean                stddev
Season  1965  2018  2002    1999.1228070175441  12.900523823770502
Country / Market Distribution of Best-selling Music Artists
wikitablescrape --url='https://en.wikipedia.org/wiki/List_of_best-selling_music_artists' --header='100 million' | xsv select 'Country / Market' | xsv frequency | xsv table
field             value                         count
Country / Market  United States                 26
Country / Market  United Kingdom                10
Country / Market  United Kingdom United States  1
Country / Market  Australia                     1
Country / Market  Spain                         1
Country / Market  Japan                         1

Installation

You can download the package from PyPI or build from source using Python 3.

As a system-level Python package

python3 -m pip install wikitablescrape
wikitablescrape --help

In a virtual environment

python3 -m venv venv
. venv/bin/activate
pip install wikitablescrape
wikitablescrape --help

Build from source

git clone https://github.com/rocheio/wiki-table-scrape
cd ./wiki-table-scrape
python3 -m venv venv
. venv/bin/activate
python setup.py install
wikitablescrape --help

Sample Commands

Write a single table to stdout
wikitablescrape --url="https://en.wikipedia.org/wiki/List_of_highest-grossing_films" --header="films by year" | tee >(head -1) >(tail -5) >/dev/null
"Year","Title","Worldwide gross","Budget","Reference(s)"
"2015","Star Wars: The Force Awakens","$2,068,223,624","$245,000,000",""
"2016","Captain America: Civil War","$1,153,304,495","$250,000,000",""
"2017","Star Wars: The Last Jedi","$1,332,539,889","$200,000,000",""
"2018","Avengers: Infinity War","$2,048,359,754","$316,000,000–400,000,000",""
"2019","Avengers: Endgame","$2,796,255,086","$356,000,000",""
Download all tables on a page into a folder of CSV files
wikitablescrape --url="https://en.wikipedia.org/wiki/Wikipedia:Multiyear_ranking_of_most_viewed_pages#Top-100_list" --output-folder="/tmp/scrape"
Parsing all tables from 'https://en.wikipedia.org/wiki/Wikipedia:Multiyear_ranking_of_most_viewed_pages#Top-100_list' into '/tmp/scrape'
Writing table 1 to /tmp/scrape/table_1_top_100_list.csv
Writing table 2 to /tmp/scrape/table_2_countries.csv
Writing table 3 to /tmp/scrape/table_3_cities.csv
Writing table 4 to /tmp/scrape/table_4_buildings_&_structures_&_statues.csv
Writing table 5 to /tmp/scrape/table_5_people.csv
Writing table 6 to /tmp/scrape/table_6_people_singers.csv
Writing table 7 to /tmp/scrape/table_7_people_actors.csv
Writing table 8 to /tmp/scrape/table_8_people_romantic_actors.csv
Writing table 9 to /tmp/scrape/table_9_people_athletes.csv
Writing table 10 to /tmp/scrape/table_10_people_modern_political_leaders.csv
Writing table 11 to /tmp/scrape/table_11_people_pre_modern_people.csv
Writing table 12 to /tmp/scrape/table_12_people_3rd_millennium_people.csv
Writing table 13 to /tmp/scrape/table_13_progression_of_the_most_viewed_millennial_persons_on_wikipedia.csv
Writing table 14 to /tmp/scrape/table_14_music_bands_historical_most_viewed_3rd_millennium_persons.csv
Writing table 15 to /tmp/scrape/table_15_sport_teams_historical_most_viewed_3rd_millennium_persons.csv
Writing table 16 to /tmp/scrape/table_16_films_and_tv_series_historical_most_viewed_3rd_millennium_persons.csv
Writing table 17 to /tmp/scrape/table_17_albums_historical_most_viewed_3rd_millennium_persons.csv
Writing table 18 to /tmp/scrape/table_18_books_and_book_series_historical_most_viewed_3rd_millennium_persons.csv
Writing table 19 to /tmp/scrape/table_19_books_and_book_series_pre_modern_books_and_texts.csv
head -5 /tmp/scrape/table_3_cities.csv
"Rank","Page","Continent","Views in millions"
"1","New York City","North America","75"
"2","Singapore","Asia","63"
"3","London","Europe","61"
"4","Hong Kong","Asia","50"

Testing

./scripts/test.sh

# Show coverage data in a browser
coverage html && open htmlcov/index.html

Sample Articles for Scraping

Contributing

If you would like to contribute to this module, please open an issue or pull request.

More Information

If you'd like to read more about this module, please check out my blog post from the initial release.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikitablescrape-1.0.4.tar.gz (9.9 kB view details)

Uploaded Source

Built Distribution

wikitablescrape-1.0.4-py3-none-any.whl (10.6 kB view details)

Uploaded Python 3

File details

Details for the file wikitablescrape-1.0.4.tar.gz.

File metadata

  • Download URL: wikitablescrape-1.0.4.tar.gz
  • Upload date:
  • Size: 9.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.7.6

File hashes

Hashes for wikitablescrape-1.0.4.tar.gz
Algorithm Hash digest
SHA256 c16e2ae44669ffbb9d431f063cf36265f863150943440f0c9add58f9083a9ef3
MD5 26d07b295750c49e002308765111a19a
BLAKE2b-256 263b9d16d51b0e753763eb3d4d0576fc7c0cf97c172b51ef84ace05cc4002f5c

See more details on using hashes here.

File details

Details for the file wikitablescrape-1.0.4-py3-none-any.whl.

File metadata

  • Download URL: wikitablescrape-1.0.4-py3-none-any.whl
  • Upload date:
  • Size: 10.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.7.6

File hashes

Hashes for wikitablescrape-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 054d8a2d3b5c1598fc931940cfcb491503305ca992f7e38e90c28eb1a43067fe
MD5 17241509f2d7a664ccd414c99890b67c
BLAKE2b-256 acca3153b24bbc4dad3bda2b38f74f3135b2fd07d548d2dd6538d9c19b3361a4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page