Script to fetch news using API and convert to wiki-text
Project description
news_fetcher
Script to fetch news URLs from news websites to database using API.
Installation
Install Python 3.8 or higher, install poetry, run poetry install --no-dev
.
Then you can just run poetry run COMMAND
to run specific commands under python virtual environment created by poetry.
Or you can enter poetry shell (by running poetry shell
) and then type script commands.
You can also use pip
.
Installation example
Assuming Python 3.8 or higher and poetry are installed.
Initialise and update virtual environment (assuming you are in the folder with this README file):
poetry install --no-dev
Run script:
poetry run python news_fetcher/news_fetcher.py --help
Windows installation example
Assuming Python 3.8 or higher is installed.
Install poetry (in Windows PowerShell):
(Invoke-WebRequest -Uri https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py -UseBasicParsing).Content | python
You may need to restart PowerShell or reboot your computer.
To install or update libraries, run batch file update.bat
.
Run script:
poetry run python news_fetcher/news_fetcher.py --help
Files
run_all.sh
is the Shell script for running all steps. It requires that environment variables are set in.env
file:MEDIAWIKI_CREDENTIALS
,DATABASE_URL
,WIKI_TOOL_DIRECTORY
,DATA_FILE
,SOURCE_PATH
,SOURCE_NAME
,TARGET_API_URL
,WIKI_PREFIX
,BOT_NAME
,REQUESTS_INTERVAL
.news_fetcher/news_fetcher.py
is the script entry point.news_fetcher/db.py
is the DB initialization module.news_fetcher/models.py
is the module with DB models.news_fetcher/module.py
is the module with base class for "source modules" which are used to grab news from different sources.
Prostoprosport source module
This modules fetches news using prostoprosport.ru API.
news_fetcher/prostoprosport.py
is the source module.data/categories_from_js.json
is a category URL data grabbed from JS.data/categories_bonus.json
is an additional category URL data grabbed from RSS.
RSS source module
This modules fetches news using RSS.
news_fetcher/rss.py
is the source module.
DB models
Source
Source website.
slug_name
— string website ID (primary key), for example:birmingham-post
.
Tag
Tag for news articles.
tag_id
— numerical ID (primary key).title
— tag text, for example:Sport
(must be unique).
Article
News article from source website.
article_id
— numerical ID (primary key).source
— source website (foreign key).slug_name
— string identifier (must be unique per source website), for example:sir-stanley-matthews-1915-2000-a-potteries-hero
.title
— human-readable article title, for example: Sir Stanley Matthews 1915-2000: A Potteries hero; Stanley stayed loyal to his beloved.date
— publication date, for example: 2020-02-24T00:00:00.source_url
— full article URL, for example: https://www.thefreelibrary.com/Sir+Stanley+Matthews+1915-2000%3A+A+Potteries+hero%3B+Stanley+stayed...-a060517953.source_url_ok
— true if URL can be retrieved, false if it can not, null if it was not checked yet.author_name
— human-readable author name, may be null.wikitext_paragraphs
— article content converted into wiki-text stored as JSON list of paragraphs, may be null if not fetched yet.misc_data
— miscellaneous data stored as JSON, specific format and structure is module-dependent.tags
— article tags (many-to-many relation withTag
model through technicalArticleTag
model with table namedarticle_m2m_tag
).
Usage
Getting help
--help
Show help message and exit. If this option is used with command, then help message for that specific command will be printed.
Common options
--source-module TEXT
(required) — source module name, can beprostoprosport
orrss
Prostoprosport module options
--data-file FILENAME
— file with categories data (can be built usingprocess-categories
command)--source-path TEXT
— API method name, can benews
ormain_news
RSS module options
--data-file FILENAME
— JSON file with configuration, should contain folllowing keys:css_selector
— CSS selector for article paragraphs on web pagesource_title
— source titlesource_template_name
— template name for generated wiki-pages (optional, source title is used by default)removed_last_lines
— count of paragraphs at the end of article that should be skipped (optional, 0 by default)disable_bold_font
— true to avoid bold font in generated page (optional, false by default)extra_first_lines
— array of strings to add at the beginning of generated page (optional, empty by default)
--source-name
(required) — source slug name (identifier) for DB--source-path TEXT
(required) — RSS feed URL
Command fetch-news
Fetch news for page range and write data to DB. Pages are numbered from most recent (1) to least recent. Note that page numbers are now used in Prostoprosport source module only.
Options
--first-page INTEGER
— number of first page to load, should not be less than 1--last-page INTEGER
— number of last page to load, should not be less than 1. If it is less than first page number, no data will be fetched
Example 1
Fetch most recent page (1):
python news_fetcher/prostoprosport_news_fetcher.py fetch-news
Example 2
Fetch pages 5 most recent pages (5 to 1):
python news_fetcher/prostoprosport_news_fetcher.py fetch-news --last-page 5
Example 3
Fetch pages 11 to 20:
python news_fetcher/prostoprosport_news_fetcher.py fetch-news --first-page 11 --last-page 20
Notes
- (OBSOLETE) Prostoprosport.ru API did not provide URLs, only category slugs and IDs, category-to-URL mappings are grabbed from JavaScript on website. Therefore URLs were not guaranteed to be correct.
- Now all news are placed under
/post/
URL path, without category URL.
Command fetch-news-pages
Fetch news pages contents for pages which were:
- From current source
- Not marked as "invalid URL" during previous fetch
- Not already fetched
Example
python news_fetcher/prostoprosport_news_fetcher.py fetch-news-pages
Command generate-wiki-pages
Generate MediaWiki pages as text files for fetched news pages not marked as uploaded.
Options
--output-file FILE
— output JSON file with list of generated pages, it contains dictionary, where keys are page titles, and values are page file paths--output-directory FILE
— directory to place generated MediaWiki page files--bot-name STRING
— name of bot user account to use in page template
Example
python news_fetcher/prostoprosport_news_fetcher.py generate-wiki-pages --output-file ../data/pages.json --output-directory ../data/pages/
Command mark-uploaded-pages
Mark news articles as uploaded in database.
Options
--input-file FILE
input JSON file generated bygenerate-wiki-pages
command
Example
python news_fetcher/prostoprosport_news_fetcher.py mark-uploaded-pages --input-file ../data/pages.json
Module-specific command process-categories
in prostoprosport
source module
Build categories mapping file. It will contain data about base URL for category slugs and IDs. For example, category rpl
have base URL (without leading slash) football/russia/rpl
.
Options
--input-from-js-file FILE
— JSON file with categories data grabbed from JavaScript, default isdata/categories_from_js.json
--input-bonus-file FILE
— JSON file with additional data, default isdata/categories_bonus.json
--input-colors-file FILE
— JSON file with ID-to-color mapping data grabbed from JavaScript, default isdata/category_colors.json
--output-file FILE
— output JSON file, default isdata1/categories_data.json
Example
python news_fetcher/prostoprosport.py process-categories
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file news_fetcher-0.3.1.tar.gz
.
File metadata
- Download URL: news_fetcher-0.3.1.tar.gz
- Upload date:
- Size: 17.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.11.6 Linux/6.5.0-28-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 24490e7fc49b081c1d791629f25c3625c63ab053ac772f8b053133233963991c |
|
MD5 | ee0a1baa77cf58478b49c3d6270a6040 |
|
BLAKE2b-256 | 411df3065d530334f3c31b4c6c360bb2a942bfaf8988bfbeb0fa50dbbc2bfc52 |
File details
Details for the file news_fetcher-0.3.1-py3-none-any.whl
.
File metadata
- Download URL: news_fetcher-0.3.1-py3-none-any.whl
- Upload date:
- Size: 18.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.11.6 Linux/6.5.0-28-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b9732a17712ff17f47bbf1fad228716471cf12e85429da31c6dc43b15c03c4f4 |
|
MD5 | 5f69fccc0702e44f082fe9847987e2b3 |
|
BLAKE2b-256 | 27cbf5b7432f3d77d17cc9943e477cca75c3d3851a50802ff30fe9bc3ceef13b |