Easy scraper that extracts data from Wikipedia articles thanks to its URL slug
Project description
wikiscraper
Easy scraper that extracts data from Wikipedia articles thanks to its URL slug : title, images, summary, sections paragraphs, sidebar info
Developed by Alexandre MEYER
This work is licensed under a Creative Commons Attribution 4.0 International License.
Installation
$ pip install wikiscraper
Initialization
Import
import wikiscraper as ws
Main request
# Set the language page in Wikipedia for the query
# (ISO 639-1 & by default "en" for English)
ws.lang("fr")
# Search and get content by the URL slug of the article
# (Exemple : https://fr.wikipedia.org/wiki/Paris)
result = ws.searchBySlug("Paris")
Examples
# Get article's title
result.getTitle()
Sidebar
# Get value of the sidebar information label
result.getSideInfo("Gentilé")
Summary
# Get all paragraphs of summary
print(result.getSummary())
# Get the second paragraph of summary
print(result.getSummary()[1])
# Optional : Get the x paragraphs, starting from the beginning
print(result.getSummary(2))
Images
# Get all illustration images
img = result.getImage()
# Get a specific image thanks to its position in the page
print(img[0]) # Main image
Sections
# Get paragraphs from a specific section thanks to the parents' header title
# All optional args : .getSection(h2Title, h3Title, h4Title)
# Exemple : https://fr.wikipedia.org/wiki/Paris#Politique_et_administration
print(result.getSection('Politique et administration', 'Statut et organisation administrative', 'Historique')[0])
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
wikiscraper-1.0.2.tar.gz
(9.4 kB
view hashes)
Built Distribution
Close
Hashes for wikiscraper-1.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 673190fa8eb4a460b47b85549a31fc26d7cc3df611ccfa447b60474dcdafadaf |
|
MD5 | dad8305bb5c61ad254a6c08e0d8c8f9d |
|
BLAKE2b-256 | f7e49b69c24c2546d7f89eca1d8dfd5545648825ca0bbbaba6d9ebeb6743f331 |