Skip to main content

Scrape HTML tables from a Wikipedia page into CSV format.

Project description

wiki-table-scrape

Scrape HTML tables from a Wikipedia page into CSV format.

wikitablescrape can be used as a shell command or imported as a Python package.

Why?

This tool makes it easy to download any Wikipedia table via CLI in a format ready for text processing.

This is especially useful when combined with a tool like xsv.

Year Distribution of Costliest Atlantic Hurricanes
wikitablescrape --url='https://en.wikipedia.org/wiki/List_of_costliest_Atlantic_hurricanes' --header='costliest' | xsv select "Season" | xsv stats --median | xsv select field,min,max,median,mean,stddev | xsv table
field   min   max   median  mean                stddev
Season  1965  2018  2002    1999.1228070175441  12.900523823770502
Country / Market Distribution of Best-selling Music Artists
wikitablescrape --url='https://en.wikipedia.org/wiki/List_of_best-selling_music_artists' --header='100 million' | xsv select 'Country / Market' | xsv frequency | xsv table
field             value                         count
Country / Market  United States                 26
Country / Market  United Kingdom                10
Country / Market  United Kingdom United States  1
Country / Market  Australia                     1
Country / Market  Spain                         1
Country / Market  Japan                         1

Installation

You can download the package from PyPI or build from source using Python 3.

As a system-level Python package

python3 -m pip install wikitablescrape
wikitablescrape --help

In a virtual environment

python3 -m venv venv
. venv/bin/activate
pip install wikitablescrape
wikitablescrape --help

Build from source

git clone https://github.com/rocheio/wiki-table-scrape
cd ./wiki-table-scrape
python3 -m venv venv
. venv/bin/activate
python setup.py install
wikitablescrape --help

Sample Commands

Write a single table to stdout
wikitablescrape --url="https://en.wikipedia.org/wiki/List_of_highest-grossing_films" --header="films by year" | tee >(head -1) >(tail -5) >/dev/null
"Year","Title","Worldwide gross","Budget","Reference(s)"
"2015","Star Wars: The Force Awakens","$2,068,223,624","$245,000,000",""
"2016","Captain America: Civil War","$1,153,304,495","$250,000,000",""
"2017","Star Wars: The Last Jedi","$1,332,539,889","$200,000,000",""
"2018","Avengers: Infinity War","$2,048,359,754","$316,000,000–400,000,000",""
"2019","Avengers: Endgame","$2,796,255,086","$356,000,000",""
Download all tables on a page into a folder of CSV files
wikitablescrape --url="https://en.wikipedia.org/wiki/Wikipedia:Multiyear_ranking_of_most_viewed_pages#Top-100_list" --output-folder="/tmp/scrape"
Parsing all tables from 'https://en.wikipedia.org/wiki/Wikipedia:Multiyear_ranking_of_most_viewed_pages#Top-100_list' into '/tmp/scrape'
Writing table 1 to /tmp/scrape/table_1_top_100_list.csv
Writing table 2 to /tmp/scrape/table_2_countries.csv
Writing table 3 to /tmp/scrape/table_3_cities.csv
Writing table 4 to /tmp/scrape/table_4_buildings_&_structures_&_statues.csv
Writing table 5 to /tmp/scrape/table_5_people.csv
Writing table 6 to /tmp/scrape/table_6_people_singers.csv
Writing table 7 to /tmp/scrape/table_7_people_actors.csv
Writing table 8 to /tmp/scrape/table_8_people_romantic_actors.csv
Writing table 9 to /tmp/scrape/table_9_people_athletes.csv
Writing table 10 to /tmp/scrape/table_10_people_modern_political_leaders.csv
Writing table 11 to /tmp/scrape/table_11_people_pre_modern_people.csv
Writing table 12 to /tmp/scrape/table_12_people_3rd_millennium_people.csv
Writing table 13 to /tmp/scrape/table_13_progression_of_the_most_viewed_millennial_persons_on_wikipedia.csv
Writing table 14 to /tmp/scrape/table_14_music_bands_historical_most_viewed_3rd_millennium_persons.csv
Writing table 15 to /tmp/scrape/table_15_sport_teams_historical_most_viewed_3rd_millennium_persons.csv
Writing table 16 to /tmp/scrape/table_16_films_and_tv_series_historical_most_viewed_3rd_millennium_persons.csv
Writing table 17 to /tmp/scrape/table_17_albums_historical_most_viewed_3rd_millennium_persons.csv
Writing table 18 to /tmp/scrape/table_18_books_and_book_series_historical_most_viewed_3rd_millennium_persons.csv
Writing table 19 to /tmp/scrape/table_19_books_and_book_series_pre_modern_books_and_texts.csv
head -5 /tmp/scrape/table_3_cities.csv
"Rank","Page","Continent","Views in millions"
"1","New York City","North America","75"
"2","Singapore","Asia","63"
"3","London","Europe","61"
"4","Hong Kong","Asia","50"

Testing

./scripts/test.sh

# Show coverage data in a browser
coverage html && open htmlcov/index.html

Sample Articles for Scraping

Contributing

If you would like to contribute to this module, please open an issue or pull request.

More Information

If you'd like to read more about this module, please check out my blog post from the initial release.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikitablescrape-1.0.4.tar.gz (9.9 kB view hashes)

Uploaded Source

Built Distribution

wikitablescrape-1.0.4-py3-none-any.whl (10.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page