Skip to main content

A Python library to scrape Wikipedia articles easily

Project description

WikiScraper

WikiScraper is a professional Python library to scrape Wikipedia articles easily. It allows you to scrape a single page or all linked articles recursively, supporting both .txt and .csv outputs.

Features

  • Scrape a single Wikipedia page or all linked articles recursively.
  • Supports .txt and .csv output formats.
  • Optionally add titles to scraped content.
  • Logging options for file saves and all actions.
  • Append all scraped articles into a single file or save separately.
  • Handles multiple languages and errors gracefully.
  • Polite crawling with configurable delay between requests.

Installation

pip install wikiscraper-py

Usage

Scrape a single page

from wikiscraper import WikiScraper

scraper = WikiScraper(file_type="txt", add_title=True)
scraper.scrape_one("https://en.wikipedia.org/wiki/Python")

Scrape all linked articles

from wikiscraper import WikiScraper

scraper = WikiScraper(file_type="txt", add_title=True, all_on_one_file=True, polite_time=2)
scraper.scrape_all("https://en.wikipedia.org/wiki/Python")

CSV Output Example

from wikiscraper import WikiScraper

scraper = WikiScraper(file_type="csv", add_title=True, all_on_one_file=True)
scraper.scrape_all("https://en.wikipedia.org/wiki/Ethiopia")
  • If add_title=True and output is CSV:
    • The first column will contain the article title.
    • The second column will contain the article text.

Parameters

  • file_type: 'txt' or 'csv'. Default is 'txt'.
  • add_title: Add the article title at the top of the file or first CSV column. Default is False.
  • log_saving: Log only file saves. Default is True.
  • log_all: Log all actions including errors and skipped links. Default is False.
  • polite_time: Delay between requests in seconds. Default is 3.
  • all_on_one_file: Append all articles into a single file when scraping multiple pages. Default is True.

Directory Structure

Scraped files are saved in the data/ folder created automatically in the working directory.

  • If all_on_one_file=True: all articles are appended to data/wikipedia_all.txt or .csv.
  • If all_on_one_file=False: each article is saved as a separate file with its title as filename.

Error Handling

  • Skips invalid Wikipedia URLs.
  • Logs network errors and pages without titles.
  • Automatically filters out non-article links (categories, special pages, user pages, etc.).

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikiscraper_py-0.0.3.tar.gz (5.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wikiscraper_py-0.0.3-py3-none-any.whl (5.7 kB view details)

Uploaded Python 3

File details

Details for the file wikiscraper_py-0.0.3.tar.gz.

File metadata

  • Download URL: wikiscraper_py-0.0.3.tar.gz
  • Upload date:
  • Size: 5.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.5

File hashes

Hashes for wikiscraper_py-0.0.3.tar.gz
Algorithm Hash digest
SHA256 73a08ed1196af74ef9635bedee6864fe3e9c444cf40e564de7434e8d63a79001
MD5 063e75fb794fac40bd7b14c752168703
BLAKE2b-256 b825ba0fa40a9976aeef2b87bf4068e22dd52898c2f97dcd951eb7746ccc2d4a

See more details on using hashes here.

File details

Details for the file wikiscraper_py-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: wikiscraper_py-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 5.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.5

File hashes

Hashes for wikiscraper_py-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 eebcef89ba5b6bec11f6565bcab7b8265a5c22e748c1e8e110f99cf5991f8b70
MD5 193e64b70f424b02a4510e6933ca0dc5
BLAKE2b-256 a0ae029172525b8d206acbb1a5e01b3790c7565d16d0ace14b0da1830ff925ca

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page