Skip to main content

Simplified python article discovery & extraction.

Project description

Newspaper4k: Article Scraping & Curation, a continuation of the beloved newspaper3k by codelucas

PyPI version Build status Coverage status Documentation Status

Newspaper4k Project grew from a fork of the well known newspaper3k by codelucas which was not updated since September 2020. The initial goal of this fork was to keep the project alive and to add new features and fix bugs. As of version 0.9.3 there are many new features and improvements that make Newspaper4k a great tool for article scraping and curation. To make the migration to Newspaper4k easier, all the classes and methods from the original project were kept and the new features were added on top of them. All API calls from the original project still work as expected, such that for users familiar with newspaper3k you will feel right at home with Newspaper4k.

At the moment of the fork, in the original project were over 400 open issues, which I have duplicated, and as of v 0.9.3 only about 180 issues still need to be verified (many are already fixed, but it's pretty cumbersome to check - hint hint ... anyone contributing?). If you have any issues or feature requests please open an issue here.

Experimental ChatGPT helper bot for Newspaper4k: ChatGPT helper

Python compatibility

- Python 3.10+ minimum

Quick start

pip install newspaper4k

Using the CLI

You can start directly from the command line, using the included CLI:

python -m newspaper --url="https://edition.cnn.com/2023/11/17/success/job-seekers-use-ai/index.html" --language=en --output-format=json --output-file=article.json

More information about the CLI can be found in the CLI documentation.

cli demo

Using the Python API

Alternatively, you can use Newspaper4k in Python:

Processing one article / url at a time

import newspaper

article = newspaper.article('https://edition.cnn.com/2023/10/29/sport/nfl-week-8-how-to-watch-spt-intl/index.html')

print(article.authors)
# ['Hannah Brewitt']

print(article.publish_date)
# 2023-10-29 09:00:15.717000+00:00

print(article.text)
# New England Patriots head coach Bill Belichick, right, embraces Buffalo Bills head coach Sean McDermott ...

print(article.top_image)
# https://media.cnn.com/api/v1/images/stellar/prod/231015223702-06-nfl-season-gallery-1015.jpg?c=16x9&q=w_800,c_fill

print(article.movies)
# []

article.nlp()
print(article.keywords)
# ['patrick', 'mahomes', 'history', 'nfl', 'week', 'broncos', 'denver', 'p', 'm', '00', 'pittsburgh',...]


print(article.summary)
# Kevin Sabitus/Getty Images Denver Broncos running back Javonte Williams evades Green Bay Packers safety Darnell Savage, bottom.
# Kathryn Riley/Getty Images Kansas City Chiefs quarterback Patrick Mahomes calls a play during the Chiefs' 19-8 Thursday Night Football win over the Denver Broncos on October 12.
# Paul Sancya/AP New York Jets running back Breece Hall carries the ball during a game against the Denver Broncos.
# The Broncos have not beaten the Chiefs since 2015, and have never beaten Chiefs quarterback Patrick Mahomes.
# Australia: NFL+, ESPN, 7Plus Brazil: NFL+, ESPN Canada: NFL+, CTV, TSN, RDS Germany: NFL+, ProSieben MAXX, DAZN Mexico: NFL+, TUDN, ESPN, Fox Sports, Sky Sports UK: NFL+, Sky Sports, ITV, Channel 5 US: NFL+, CBS, NBC, FOX, ESPN, Amazon Prime

source demo

Parsing and scraping whole News Sources (websites) using the Source Class

This way you can build a Source object from a newspaper websites. This class will allow you to get all the articles and categories on the website. When you build the source, articles are not yet downloaded. The build() call will parse front page, will detect category links (if possible), get any RSS feeds published by the news site, and will create a list of article links. You need to call download_articles() to download the articles, but note that it can take a significant time.

download_articles() will download the articles in a multithreaded fashion using ThreadPoolExecutor from the concurrent package. The number of concurrent threads can be configured in Configuration.number_threads or passed as an argument to build().

import newspaper

cnn_paper = newspaper.build('http://cnn.com', number_threads=3)
print(cnn_paper.category_urls())

# ['https://cnn.com', 'https://money.cnn.com', 'https://arabic.cnn.com',
# 'https://cnnespanol.cnn.com', 'http://edition.cnn.com',
# 'https://edition.cnn.com', 'https://us.cnn.com', 'https://www.cnn.com']

article_urls = [article.url for article in cnn_paper.articles]
print(article_urls[:3])
# ['https://arabic.cnn.com/middle-east/article/2023/10/30/number-of-hostages-held-in-gaza-now-up-to-239-idf-spokesperson',
# 'https://arabic.cnn.com/middle-east/video/2023/10/30/v146619-sotu-sullivan-hostage-negotiations',
# 'https://arabic.cnn.com/middle-east/article/2023/10/29/norwegian-pm-israel-gaza']


article = cnn_paper.articles[0]
article.download()
article.parse()

print(article.title)
# المتحدث باسم الجيش الإسرائيلي: عدد الرهائن المحتجزين في غزة يصل إلى

Or if you want to get bulk articles from the website (have in mind that this could take a long time and could get your IP blocked by the newssite):

import newspaper

cnn_source = newspaper.build('http://cnn.com', number_threads=3)

print(len(newspaper.article_urls))

articles = source.download_articles()

print(len(articles))

print(articles[0].title)

As of version 0.9.3, Newspaper4k supports Google News as a special Source object

First, make sure you have the google extra installed, since we rely on the Gnews package to get the articles from Google News. You can install it using pip like this:

pip install newspaper4k[gnews]

Then you can use the GoogleNews class to get articles from Google News:

from newspaper.google_news import GoogleNewsSource

source = GoogleNewsSource(
    country="US",
    period="7d",
    max_results=10,
)

source.build(top_news=True)

print(source.article_urls())
# ['https://www.cnn.com/2024/03/18/politics/trump-464-million-dollar-bond/index.html', 'https://www.cnn.com/2024/03/18/politics/supreme-court-new-york-nra/index.html', ...
source.download_articles()

Multilanguage features

Newspaper can extract and detect languages seamlessly based on the article meta tags. Additionally, you can specify the language for the website / article. If no language is specified, Newspaper will attempt to auto detect a language from the available meta data. The fallback language is English.

Language detection is crucial for accurate article extraction. If the wrong language is detected or provided, chances are that no article text will be returned. Before parsing, check that your language is supported by our package.

from newspaper import Article

article = Article('https://www.bbc.com/zhongwen/simp/chinese-news-67084358')
article.download()
article.parse()

print(article.title)
# 晶片大战:台湾厂商助攻华为突破美国封锁?

if article.config.use_meta_language:
  # If we use the autodetected language, this config attribute will be true
  print(article.meta_lang)
else:
  print(article.config.language)

# zh

Docs

Check out The Docs for full and detailed guides using newspaper.

Features

  • Multi-threaded article download framework
  • Newspaper website category structure detection
  • News url identification
  • Google News integration
  • Text extraction from html
  • Top image extraction from html
  • All image extraction from html
  • Keyword building from the extracted text
  • Automatic article text summarization
  • Author extraction from text
  • Easy to use Command Line Interface (python -m newspaper....)
  • Output in various formats (json, csv, text)
  • Works in 80+ languages (English, Chinese, German, Arabic, ...) see LANGUAGES.md for the full list of supported languages.

Evaluation

Evaluation Results

Using the dataset from ScrapingHub I created an evaluator script that compares the performance of newspaper against it's previous versions. This way we can see how newspaper updates improve or worsen the performance of the library.

Scraperhub Article Extraction Benchmark

Version Corpus BLEU Score Corpus Precision Score Corpus Recall Score Corpus F1 Score
Newspaper3k 0.2.8 0.8660 0.9128 0.9071 0.9100
Newspaper4k 0.9.0 0.9212 0.8992 0.9336 0.9161
Newspaper4k 0.9.1 0.9224 0.8895 0.9242 0.9065
Newspaper4k 0.9.2 0.9426 0.9070 0.9087 0.9078
Newspaper4k 0.9.3 0.9531 0.9585 0.9339 0.9460

Precision, Recall and F1 are computed using overlap of shingles with n-grams of size 4. The corpus BLEU score is computed using the nltk's bleu_score.

We also use our own, newly created dataset, the Newspaper Article Extraction Benchmark (NAEB) which is a collection of over 400 articles from 200 different news sources to evaluate the performance of the library.

Newspaper Article Extraction Benchmark

Version Corpus BLEU Score Corpus Precision Score Corpus Recall Score Corpus F1 Score
Newspaper3k 0.2.8 0.8445 0.8760 0.8556 0.8657
Newspaper4k 0.9.0 0.8357 0.8547 0.8909 0.8724
Newspaper4k 0.9.1 0.8373 0.8505 0.8867 0.8682
Newspaper4k 0.9.2 0.8422 0.8888 0.9240 0.9061
Newspaper4k 0.9.3 0.8695 0.9140 0.8921 0.9029

Requirements and dependencies

Following system packages are required:

  • Pillow: libjpeg-dev zlib1g-dev libpng12-dev
  • Lxml: libxml2-dev libxslt-dev
  • Python Development version: python-dev

If you are on Debian / Ubuntu, install using the following:

  • Install python3 and python3-dev:

    $ sudo apt-get install python3 python3-dev
    
  • Install pip3 command needed to install newspaper4k package:

    $ sudo apt-get install python3-pip
    
  • lxml requirements:

    $ sudo apt-get install libxml2-dev libxslt-dev
    
  • For PIL to recognize .jpg images:

    $ sudo apt-get install libjpeg-dev zlib1g-dev libpng12-dev
    

NOTE: If you find problem installing libpng12-dev, try installing libpng-dev.

  • Install the distribution via pip:

    $ pip3 install newspaper4k
    

If you are on OSX, install using the following, you may use both homebrew or macports:

$ brew install libxml2 libxslt

$ brew install libtiff libjpeg webp little-cms2

$ pip3 install newspaper4k

Contributing

see CONTRIBUTING.md

LICENSE

Authored and maintained by Andrei Paraschiv.

Newspaper was originally developed by Lucas Ou-Yang (codelucas), the original repository can be found here. Newspaper is licensed under the MIT license.

Credits

Thanks to Lucas Ou-Yang for creating the original Newspaper3k project and to all contributors of the original project.

Changelog

Change Log

0.9.4 (2025-11-15)

New Features

Bumped min Python version to 3.10. Version 3.8 and 3.9 are no longer supported, but might still work.

  • misc: switch to uv from poetry(2345076) (by Andrei)
  • parse: add brotli compression(6ff72bd) (by Andrei)
  • install: dependency versions pin(10cae21) (by Andrei)
  • typing: update type hint for data parameter to allow None(80279d1) (by Andrei)
  • tests: split tests into unit, integration and e2e. Only unit tests are ran on each PR. Integration and e2e tests are ran locally when developing.
  • tests: added coverage report generation. Coverage uploaded to coveralls.io

Refactor

  • rework: :art: reformat with ruff, new line-width 120, sort imports(cb560f7) (by Andrei)
  • doc: :tada: Better explanation of min_word_count, min_sent_count configuration(7ed25e9) (by Andrei)
  • chore: typing-extensions, lxml compatibility (#639)(c5e4170) (by Chris)

Bugs fixed:

  • parse: :boom: fix: repair Google News URL decoding and update network request handling(059b45c) (by Andrei)
  • misc: erroneous debug statements (#663)(2ab8208) (by Michael Braun)
  • docs: :memo: correct spelling of 'memoize_articles' to 'memorize_articles' in user guide(3039463) (by Andrei)
  • lang: :bug: correct iso-code for nepali language (#624)(88fc5a7) (by Andrei)
  • requests: Fixed issue [BUG] Responses with no headers break some of the internal code #635(802ae11) (by Andrei)

0.9.3.1 (2024-03-18)

Some fixes with regards to python >= 3.11 dependencies. Numpy version was incompatible with colab. Now it is fixed. Also, there was a typo in the Nepali language code - it was "np" instead of "ne". This is now fixed.

0.9.3 (2024-03-18)

Massive improvements in multi-language capabilities. Added over 40 new languages and completely reworked the language module. Much easier to add new languages now. Additionally, added support for Google News as a source. You can now search and parse news based on keywords, topic, location or website. Integrated cloudscraper as an optional dependency. If installed, it will us cloudscraper as a layer over requests. Cloudscraper tries to bypass cloudflair protection. We now have use two evaluation datasets - the one from scrapinghub and one created by us drom the top 200 most popular websites. This will help keeping track of future improvements and to have a clear view of the impact of the changes.

We see a steady improvement from version 0.9.0 up to 0.9.3. The evaluation results are available in the documentation. The evaluation dataset is also available in the following repository: Article Extraction Dataset

New Features

  • lang: :zap: Rework of tokenizer. Additionally implemented new (easier) way of adding languages to the packet(0833859) (by Andrei)
  • lang: :rocket: added support for another 13 languages(fd41af5) (by Andrei)
  • lang: :memo: Added stopwords for af, br, ca,eo, eu, ga, gl, gu, ha, hy, ku, ms, so, st, tl, ur, yo, zu from https://github.com/stopwords-iso(bba7a99) (by Andrei)
  • lang: :memo: Added Burmese language(13670c3) (by Andrei)
  • lang: :memo: Added Slovak language support(4ff82a8) (by Andrei)
  • lang: :memo: Added Czech Language support(afcdc27) (by Andrei)
  • lang: :memo: Added Latvian language support(89f3152) (by Andrei)
  • lang: :memo: Added Telugu Language support(f0f8133) (by Andrei)
  • lang: :memo: Added Marathi language support(ef40042) (by Andrei)
  • lang: :memo: Added Georgian language support(afca45b) (by Andrei)
  • lang: :memo: Added Tamil language support(0bd48ec) (by Andrei)
  • lang: :memo: Added Bengali language support(7a08fc2) (by Andrei)
  • parse: :sparkles: added filter that limits the source.build to a specific category. use source.build(url,only_in_path=True) to scrape only stories that are in the starting url path(665f6fe) (by Andrei)
  • parse: :fire: Source object is now pickleable(af3f80f) (by Andrei)
  • parse: :fire: article is now pickleable(f564524) (by Andrei)
  • sources: :sparkles: New integration of Google news using GNews module. You can now use GoogleNewsSource to search and parse news based on keywords, topic, location or website(33c3409) (by Andrei)
  • sources: :sparkles: new option when building sources. You can limit the article parsing to the source home page only. Other categories or feeds are then ignored(6b8c23e) (by Andrei)
  • misc: :chart_with_upwards_trend: added cloudscraper as optional dependency. If installed, it will us cloudscraper as a layer over requests. Cloudscraper tries to bypass cloudflair protection(720bfe4) (by Andrei)
  • misc: better typing support and type hinting Author: Tom Parker-Shemilt <palfrey@***.net>
  • misc: Simplify favicon return Author: Tom Parker-Shemilt <palfrey@***.net>
  • misc: Basic mypy support Author: Tom Parker-Shemilt <palfrey@***.net>
  • core: added language dependencies, cloudscrape and gnews as optional(cd921a3) (by Andrei)
  • doc: 📝 adding evaluation results
  • doc: 🚀 Documentation Update. Added Examples, documented new features
  • doc: 🔥 Added typing and docstrings to most of the code

Refactor

  • lang: moving all language related files in languages folder
  • lang: added valid_languages function that returns available languages
  • misc: ⚡ removed ParsingCandidate, RawHelper, URLHelper classes. Removed link_hash from article (was never used)
  • parse: article.link_hash is no longer available
  • parse: ✨ Tidying up the gravity scoring process. No changes in the final score result
  • parse: 🚀 compute word statistics for a node taking children nodes into account
  • core: Minimum Python now 3.8; Also test 3.10/11/12 Author: Tom Parker-Shemilt <palfrey@***.net>
  • core: run gh actions on PR's. Author: Tom Parker-Shemilt <palfrey@***.net>
  • core: Set SETUPTOOLS_USE_DISTUTILS. setuptools as per numpy recommendations. Upgrade numpy and pandas for >= 3.9.Author: Tom Parker-Shemilt <palfrey@***.net>
  • core: Upgrade regex, virtualenv to avoid breaking pre-commit, distutils for everyone. Author: Tom Parker-Shemilt <palfrey@***.net>
  • parse: 💥 deprecated text_cleaned, clean_doc. Removed clean_top_node, article.clean_top_node is removed. Failures if it was accessed

Bugs fixed:

  • lang: :zap: better is_highlink_density for non-latin languages(a3b6250) (by Andrei)
  • parse: :bug: fixed an issue with non latin high density detection(17a2dad) (by Andrei)
  • parse: :bug: better feed discovery in Source objects(7a3abe9) (by Andrei)
  • parse: :fire: better binary content detection(7ad77cf) (by Andrei)
  • parse: :zap: Better title parsing. Added language specific regex for article titles(d5e8b2b) (by Andrei)
  • parse: :zap: get feeds fixed, it was not parsing the main page for possible feeds(2f7b698) (by Andrei)
  • parse: :fire: better article paragraph detection(0096999) (by Andrei)
  • parse: :zap: added figure as a tag to be removed before text generation(5a226e0) (by Andrei)
  • parse: :zap: Bug with autodetecting website language. If no language supplied, the detected language was not used(07076cb) (by Andrei)
  • misc: :sparkles: tydiing up some code in urls.py(3bb4ca9) (by Andrei)
  • misc: :ambulance: python-setup github action version bump(5bb581e) (by Andrei)
  • misc: :art: mypy stubs for gnews and cloudscraper + small typing fixes(2644f7a) (by Andrei)
  • cli: json output in stdout missing (by Andrei)
  • types: :art: added stubs for gnews(86d7128) (by Andrei)

0.9.2 (2024-01-14)

Some major changes in document parsing. In previous versions the chance that parts of the article body were missing was high. In addition, in some cases the order of the paragraphs was not correct. This release should fix these issues.

Highlighted features:

  • You can now us the module as a command line interface (CLI). Usage: python -m newspaper --url https://www.test.com. More information in the documentation.
  • I have added an evaluation script against a dataset from scrapinghub. This will help keeping track of future improvements.
  • Better handling of multithreaded requests. The previous version had a bug that could lead to a deadlock. I implemented ThreadPoolExecutor from the concurrent.futures module, which is more stable. The previously news_pool was replaced with a fetch_news() function.
  • Caching is now much more flexible. You can disable it completely or for one request.
  • You can now use newspaper.article() function for convenience. It will create, download and parse an article in one step. It takes all the parameters of the Article class.
  • protected sites by cloudflare are better detected and raise an exception. The reason will be in the exception message.

New feature:

  • category: :sparkles: improved category link parsing / category link detection(41677b0) (by Andrei)
  • category: :zap: Added option to disable the category_url cache for Source objects. Refactored the cache_disk decorator(670aad9) (by Andrei)
  • cli: :sparkles: added command line interface (CLI) for the module. Usage: python -m newspaper --url https://www.test.com(f46b443) (by Andrei)
  • cli: added output format "text"(31b9079) (by Andrei)
  • core Article.download() and Article.parse() now returns self. Calls can be chained(3be1e47) (by Andrei)
  • lang: :art: automatically load nltk punkt if not present (d0fcdd8) (by Andrei)
  • nlp added the keyword scores as a dictionary attribute in Articles. Additionally, config.MAX_KEYWORDS is really taken into consideration when computing article keywords(f51a04f) (by Andrei)
  • parse: :rocket: improvements in the article body extraction. some sections that were ignored are now added to the extracted text.(1af12d2) (by Andrei)
  • parse: :sparkles: better parametrization of top_node detection. magic constants moved out of the score computation(6485c40) (by Andrei)
  • parse: :triangular_flag_on_post: added some Author detection tags (Issue #347)(4aebf29) (by Andrei)
  • parse: added fine-grained score for top node article attribute booster(0d41fc7) (by Andrei)
  • parse: Added twitch as a video provider (Issue #349, #348)(f4d8f0f) (by Andrei)
  • parse: minor improvement on top node detection(95d5cfa) (by Andrei)
  • parse: parsing rules improvements suggested by @aleksandar-devedzic in issue #577(8677dbe) (by Andrei)
  • requests: :bookmark: Added redirection history from the request calls in Article.download(8ca3d40) (by Andrei)
  • requests: :chart_with_upwards_trend: added a binary file detection. Files that are known binary content-types or have in the first 1000 bytes more than 40% non-ascii characters will raise an exception in article.download.(e7a60dd) (by Andrei)
  • tests: :sparkles: added evaluation script to test against the dataset from https://github.com/scrapinghub/article-extraction-benchmark/(737c226) (by Andrei)

Bugs fixed:

  • bug: :lipstick: instead of memorize_articles the option / function / parameter was memoize_articles(aaef712) (by Andrei)

  • bug: MEMO_DIR is now Path object. addition with str forgotten from refactoring(0b98e71) (by Andrei)

  • depend: removed feedfinder2 as dependency. was not used(c230aca) (by Andrei)

  • doc: some minor documentation changes(764742a) (by Andrei)

  • lang added additional stopwords for "fa". Issue #398(3453538) (by Andrei)

  • lang: :speech_balloon: fixed serbian stopwords. added chirilic version (Issue #389)(dfcb760) (by Andrei)

  • parse itemprop containing but not equal to articleBody(510be0e) (by Andrei)

  • parse: :art: removed some additional advertising snippets(bd30d48) (by Andrei)

  • parse: :chart_with_upwards_trend: removed possible image caption remains from cleaned article text (Issue #44)(7298140) (by Andrei)

  • parse: :globe_with_meridians: image parsing and movie parsing improvements. get links from additional attributes such as "data-src".(c02bb23) (by Andrei)

  • parse: :memo: exclude some tags from get_text. Tags such as script, option can add garbage to the text output(f0e1965) (by Andrei)

  • parse: :memo: Improved newline geeneration based on block level tags.
    's are better taken into account.(22327d8) (by Andrei)

  • parse: added youtu.be to video sources(bf516a1) (by Andrei)

  • parse: additional fixes for caption(3e7fdcc) (by Andrei)

  • refactor: deprecated non pythonic configuration attributes (all caps vs lower caps). for the moment both approaches work(691e12f) (by Andrei)

  • sec: bump nltk and requests min version(553ef27) (by Andrei)

  • sources: :bug: fixed a problem with some type of articlelinks.(9a5c0e2) (by Andrei)

0.9.1 (2023-11-08)

New feature:

  • version bump(f7107be) (by Andrei)
  • tests: Add test case for(592f6f6) (by Andrei)
  • parse: added possibility to follow "read more" links in articles(0720de1) (by Andrei)
  • core: Allow to pass any requests parameter to the Article constructor. You can now pass verify=False in order to ignore certificate errors (issue #462)(5ff5d27) (by Andrei)
  • lang Macedonian file raises an error(cadea6a) (by Murat Çorlu)
  • parse: extended data parsing of json-ld metadata (issue #518)(fc413af) (by Andrei)
  • tests: added script to create test cases(9df8c16) (by Andrei)
  • parse: added tag for date detection issue #835(41152eb) (by Andrei)
  • parse: added og:regDate to known date tags(dc35e29) (by Andrei)
  • tests: convert unittest to pytest(45c4e8d) (by Andrei)
  • doc add autodoc for readthedocs (22e9dca) (by Andrei)
  • doc: Added docstring to Article, Source and Configuration.(8e54946) (by Andrei)
  • doc: some clarifications in the documentation(e8126d5) (by Andrei)
  • doc: some template changes(0261054, bfbac2c) (by Andrei)

Bugs fixed:

  • corec: typing annotation for set python 3.8(895343f) (by Andrei)
  • parse: improve meta tag content for articles and pubdate(37bb0b7) (by Andrei)
  • parse: :memo: improved author detection. improved video links detection(23c547f) (by Andrei)
  • parse: ensured that clean_doc/doc to clean_top_node are on the same DOM. And doc/top_node on the same DOM.(6874d05) (by Andrei)
  • core: small changes, replace os.path with pathlib(5598d95) (by Andrei)
  • parse: use one file of stopwords for english, the one in the standard folder #503(6bdf813) (by Andrei)
  • parse: better author parsing based on issue #493(f93a9c2) (by Andrei)
  • parse: make the url date parsing stricter. Issue #514(0cc1e83) (by Andrei)
  • parse: replace \n with space in sentence split (Issue #506)(3ccb87c) (by Andrei)
  • parsing: catch url errors resulting resulting from parsed image links(9140a04) (by Andrei)
  • repo: correct python versions in pipeline(7e671df) (by Andrei)
  • repo: gitignore update(8855f00) (by Andrei)

[0.9.0] (2023-10-29)

First release after the fork. This release is based on the 0.1.7 release of the original newspaper3k project. I jumped versions such that it is clear that this is a fork and not the original project.

New feature:

  • tests: starting moving tests to pytest(f294a01) (by Andrei)
  • parser: add yoast schema parse for date extraction(39a5cff) (by Andrei)

Bugs fixed:

  • docs: update README.md(d5f9209) (by Andrei)
  • parse: feed_url parsing, issue #915(ec2d474) (by Andrei)
  • parse: better content detection. added <article> and <div> tag as candidate for content parent_node(447a429) (by Andrei)
  • core: close pickle files - PR #938(d7608da) (by Andrei)
  • parse: improved publication date extraction(4d137eb) (by Andrei)
  • core: some linter errors, whitespaces and spelling(79553f6) (by Andrei)

... see here for earlier changes.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

newspaper4k-0.9.4.1.tar.gz (4.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

newspaper4k-0.9.4.1-py3-none-any.whl (306.2 kB view details)

Uploaded Python 3

File details

Details for the file newspaper4k-0.9.4.1.tar.gz.

File metadata

  • Download URL: newspaper4k-0.9.4.1.tar.gz
  • Upload date:
  • Size: 4.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.10 {"installer":{"name":"uv","version":"0.9.10"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for newspaper4k-0.9.4.1.tar.gz
Algorithm Hash digest
SHA256 5b1a92dfb04d6d379f9484fad4ad44741deb9ac3d55a6c178badf2a0d4bba903
MD5 220b1d125fe00ad700ef5537248358c2
BLAKE2b-256 b7cccf743a3d06b10907cec76a675fe0857445907ded0e64ec4624483c1467ce

See more details on using hashes here.

File details

Details for the file newspaper4k-0.9.4.1-py3-none-any.whl.

File metadata

  • Download URL: newspaper4k-0.9.4.1-py3-none-any.whl
  • Upload date:
  • Size: 306.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.10 {"installer":{"name":"uv","version":"0.9.10"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for newspaper4k-0.9.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 fab18fdb0637da0ea2452e18c5986c4af2263ba3016ff684f1c522f696bd39bd
MD5 8cdf350effed78cee1fc1cd7a8823883
BLAKE2b-256 63e87e6c6a6e626fec2ec25f7b69fa0c446d1ca8a7fad121b61f2672e5779b76

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page