The all-in-one Python package for seamless newspaper article indexing, scraping, and processing – supports public and premium content!
Project description
Newspaper-Scraper
The all-in-one Python package for seamless newspaper article indexing, scraping, and processing – supports public and premium content!
Intro
While tools like newspaper3k and goose3 can be used for extracting articles from news websites, they need a dedicated article url for older articles and do not support paywall content. This package aims to solve these issues by providing a unified interface for indexing, extracting and processing articles from newspapers.
- Indexing: Index articles from a newspaper website using the beautifulsoup package for public articles and selenium for paywall content.
- Extraction: Extract article content using the goose3 package.
- Processing: Process articles for nlp features using the spaCy package.
The indexing functionality is based on a dedicated file for each newspaper. A few newspapers are already supported, but it is easy to add new ones.
Supported Newspapers
Logo | Newspaper | Country | Time span | Number of articles |
---|---|---|---|---|
Der Spiegel | Germany | Since 2000 | tbd | |
Die Welt | Germany | Since 2000 | tbd | |
Bild | Germany | Since 2006 | tbd | |
Die Zeit | Germany | Since 1946 | tbd | |
Handelsblatt | Germany | Since 2003 | tbd |
Setup
It is recommended to install the package in an dedicated Python environment.
To install the package via pip, run the following command:
pip install newspaper-scraper
To also include the nlp extraction functionality (via spaCy), run the following command:
pip install newspaper-scraper[nlp]
Usage
To index, extract and process all public and premium articles from Der Spiegel, published in August 2021, run the following code:
import newspaper_scraper as ns
from credentials import username, password
with ns.Spiegel(db_file='articles.db') as news:
news.index_articles_by_date_range('2021-08-01', '2021-08-31')
news.scrape_public_articles()
news.scrape_premium_articles(username=username, password=password)
news.nlp()
This will create a sqlite database file called articles.db
in the current working directory. The database contains the following tables:
tblArticlesIndexed
: Contains all indexed articles with their scraping/ processing status and whether they are public or premium content.tblArticlesScraped
: Contains metadata for all parsed articles, provided by goose3.tblArticlesProcessed
: Contains nlp features of the cleaned article text, provided by spaCy.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for newspaper_scraper-0.2.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9976161b0aa6246be275bb7f0a3dad91acc3d4081b0c99b24a7db57a3976ffd5 |
|
MD5 | 36dade2d9af6512d3521e8eb4ac5d939 |
|
BLAKE2b-256 | 6555f045f7e4aabe361b338de042ca0c6ed3006fba5085dd9b626c7fa7af2b5c |