Skip to main content

A Python library to scrape Wikipedia articles easily

Project description

WikiScraper

WikiScraper is a professional Python library to scrape Wikipedia articles easily. It allows you to scrape a single page or all linked articles recursively, supporting both .txt and .csv outputs.

Features

  • Scrape a single Wikipedia page or all linked articles recursively.
  • Supports .txt and .csv output formats.
  • Optionally add titles to scraped content.
  • Logging options for file saves and all actions.
  • Append all scraped articles into a single file or save separately.
  • Handles multiple languages and errors gracefully.
  • Polite crawling with configurable delay between requests.

Installation

pip install wikiscraper-py

Usage

Scrape a single page

from wikiscraper import WikiScraper

scraper = WikiScraper(file_type="txt", add_title=True)
scraper.scrape_one("https://en.wikipedia.org/wiki/Senate_of_Colombia")

Scrape all linked articles

from wikiscraper import WikiScraper

scraper = WikiScraper(file_type="txt", add_title=True, all_on_one_file=True, polite_time=2)
scraper.scrape_all("https://en.wikipedia.org/wiki/Senate_of_Colombia")

CSV Output Example

from wikiscraper import WikiScraper

scraper = WikiScraper(file_type="csv", add_title=True, all_on_one_file=True)
scraper.scrape_all("https://en.wikipedia.org/wiki/Senate_of_Colombia")
  • If add_title=True and output is CSV:
    • The first column will contain the article title.
    • The second column will contain the article text.

Parameters

  • file_type: 'txt' or 'csv'. Default is 'txt'.
  • add_title: Add the article title at the top of the file or first CSV column. Default is False.
  • log_saving: Log only file saves. Default is True.
  • log_all: Log all actions including errors and skipped links. Default is False.
  • polite_time: Delay between requests in seconds. Default is 3.
  • all_on_one_file: Append all articles into a single file when scraping multiple pages. Default is True.

Directory Structure

Scraped files are saved in the data/ folder created automatically in the working directory.

  • If all_on_one_file=True: all articles are appended to data/wikipedia_all.txt or .csv.
  • If all_on_one_file=False: each article is saved as a separate file with its title as filename.

Error Handling

  • Skips invalid Wikipedia URLs.
  • Logs network errors and pages without titles.
  • Automatically filters out non-article links (categories, special pages, user pages, etc.).

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikiscraper_py-0.0.1.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wikiscraper_py-0.0.1-py3-none-any.whl (5.8 kB view details)

Uploaded Python 3

File details

Details for the file wikiscraper_py-0.0.1.tar.gz.

File metadata

  • Download URL: wikiscraper_py-0.0.1.tar.gz
  • Upload date:
  • Size: 5.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.5

File hashes

Hashes for wikiscraper_py-0.0.1.tar.gz
Algorithm Hash digest
SHA256 82dc4cef87da847f66b9dda20748b548686e894f4e9339fcac137f64b541129c
MD5 c9444086c67b07e449e30d177d4ff17e
BLAKE2b-256 6208c61639c9330a99a8b25713e0a9c6d45b73ae7c2a1c0dce8dd1f1416bca3d

See more details on using hashes here.

File details

Details for the file wikiscraper_py-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: wikiscraper_py-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 5.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.5

File hashes

Hashes for wikiscraper_py-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 50c4b6ad3922e344051e54cafda9b7edda3677f2e14695e454d07acb70dd5469
MD5 366be2e2cbbf0902169cb2f28101e3ed
BLAKE2b-256 d484b2c86d481316990ac7f191c7ce876a6173fb4603bc99449607217a1d21c4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page