A Python library to scrape Wikipedia articles easily
Project description
WikiScraper
WikiScraper is a professional Python library to scrape Wikipedia articles easily. It allows you to scrape a single page or all linked articles recursively, supporting both .txt and .csv outputs.
Features
- Scrape a single Wikipedia page or all linked articles recursively.
- Supports
.txtand.csvoutput formats. - Optionally add titles to scraped content.
- Logging options for file saves and all actions.
- Append all scraped articles into a single file or save separately.
- Handles multiple languages and errors gracefully.
- Polite crawling with configurable delay between requests.
Installation
pip install wikiscraper-py
Usage
Scrape a single page
from wikiscraper import WikiScraper
scraper = WikiScraper(file_type="txt", add_title=True)
scraper.scrape_one("https://en.wikipedia.org/wiki/Senate_of_Colombia")
Scrape all linked articles
from wikiscraper import WikiScraper
scraper = WikiScraper(file_type="txt", add_title=True, all_on_one_file=True, polite_time=2)
scraper.scrape_all("https://en.wikipedia.org/wiki/Senate_of_Colombia")
CSV Output Example
from wikiscraper import WikiScraper
scraper = WikiScraper(file_type="csv", add_title=True, all_on_one_file=True)
scraper.scrape_all("https://en.wikipedia.org/wiki/Senate_of_Colombia")
- If
add_title=Trueand output is CSV:- The first column will contain the article title.
- The second column will contain the article text.
Parameters
file_type:'txt'or'csv'. Default is'txt'.add_title: Add the article title at the top of the file or first CSV column. Default isFalse.log_saving: Log only file saves. Default isTrue.log_all: Log all actions including errors and skipped links. Default isFalse.polite_time: Delay between requests in seconds. Default is3.all_on_one_file: Append all articles into a single file when scraping multiple pages. Default isTrue.
Directory Structure
Scraped files are saved in the data/ folder created automatically in the working directory.
- If
all_on_one_file=True: all articles are appended todata/wikipedia_all.txtor.csv. - If
all_on_one_file=False: each article is saved as a separate file with its title as filename.
Error Handling
- Skips invalid Wikipedia URLs.
- Logs network errors and pages without titles.
- Automatically filters out non-article links (categories, special pages, user pages, etc.).
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wikiscraper_py-0.0.1.tar.gz.
File metadata
- Download URL: wikiscraper_py-0.0.1.tar.gz
- Upload date:
- Size: 5.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
82dc4cef87da847f66b9dda20748b548686e894f4e9339fcac137f64b541129c
|
|
| MD5 |
c9444086c67b07e449e30d177d4ff17e
|
|
| BLAKE2b-256 |
6208c61639c9330a99a8b25713e0a9c6d45b73ae7c2a1c0dce8dd1f1416bca3d
|
File details
Details for the file wikiscraper_py-0.0.1-py3-none-any.whl.
File metadata
- Download URL: wikiscraper_py-0.0.1-py3-none-any.whl
- Upload date:
- Size: 5.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
50c4b6ad3922e344051e54cafda9b7edda3677f2e14695e454d07acb70dd5469
|
|
| MD5 |
366be2e2cbbf0902169cb2f28101e3ed
|
|
| BLAKE2b-256 |
d484b2c86d481316990ac7f191c7ce876a6173fb4603bc99449607217a1d21c4
|