Skip to main content

A parser that parses articles from any url or html

Project description

article-parser

GitHub Repo stars python pypi wheel license PyPI - Downloads

Extract article or news by url or html, parse the title and content.

English简体中文

How to install

article-parser is available on pypi https://pypi.org/project/article-parser/

$ pip install article-parser

Basic Usage

>>> import article_parser

article_parser.parse(
  url='',               # The URL of the article.
  html='',              # The HTML of the article.
  threshold=0.9,        # The ratio of text to the entire document, default 0.9.
  output='html',        # Result output format, support ``markdown`` and ``html``, default ``html``.
  **kwargs              # Optional arguments that `request` takes. optional
  ),
  

## ouput markdown
>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", output='markdown', timeout=5)

## output html
>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", timeout=5)

Example

Djokovic wins record 36th Masters title in Rome - Chinadaily.com.cn

  • Markdown
>>> import article_parser
>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", output='markdown', timeout=5)
>>> print(title)
>>> print('----------------')
>>> print(content)

Djokovic wins record 36th Masters title in Rome
----------------
![](http://img2.chinadaily.com.cn/images/202009/22/5f6962b2a31024adbd959228.jpeg)
Serbia's Novak Djokovic kisses the trophy after winning the final against
Argentina's Diego Schwartzman at Italian Open, Foro Italico, Rome, Italy, Sept
21, 2020. [Photo/Agencies]

ROME - Novak Djokovic won a record 36th Masters crown as he beat Diego
Schwartzman in the men's final of the ATP Italian Open on Monday.

Djokovic, the world number one and the top seed at the tournament, won 7-5,
6-3 against Argentine Schwartzman to lift his 36th Masters title, one more
than Rafael Nadal.

The Serb said he did not play his best tennis this time in Rome, but could
find it when needed.

Simona Halep, top seed of the women's draw, won her first title in Rome after
defending champion Karolina Pliskova of the Czech Republic retired while
trailing 6-0, 2-1 in the final.
  • HTML
>>> import article_parser
>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", timeout=5)
>>> print(title)
>>> print('----------------')
>>> print(content)

Djokovic wins record 36th Masters title in Rome
----------------
<div id="Content">

<figure class="image" style="display: table;">
<img data-from="newsroom" id="img-5f6962b2a31024adbd959228" src="//img2.chinadaily.com.cn/images/202009/22/5f6962b2a31024adbd959228.jpeg"/>
<figcaption style="font-size: 14px; display: table-caption; caption-side: bottom;">
   Serbia's Novak Djokovic kisses the trophy after winning the final against Argentina's Diego Schwartzman at Italian Open, Foro Italico, Rome, Italy, Sept 21, 2020. [Photo/Agencies]
 </figcaption>
</figure>
<p dir="ltr">ROME - Novak Djokovic won a record 36th Masters crown as he beat Diego Schwartzman in the men's final of the ATP Italian Open on Monday.</p>
<p dir="ltr">Djokovic, the world number one and the top seed at the tournament, won 7-5, 6-3 against Argentine Schwartzman to lift his 36th Masters title, one more than Rafael Nadal.</p>
<p dir="ltr">The Serb said he did not play his best tennis this time in Rome, but could find it when needed.</p>
<p dir="ltr">Simona Halep, top seed of the women's draw, won her first title in Rome after defending champion Karolina Pliskova of the Czech Republic retired while trailing 6-0, 2-1 in the final.</p>
</div>

Contributors

All contributions

Stargazers over time

Stargazers over time

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

article_parser-1.8.0.tar.gz (5.5 kB view details)

Uploaded Source

Built Distribution

article_parser-1.8.0-py3-none-any.whl (5.6 kB view details)

Uploaded Python 3

File details

Details for the file article_parser-1.8.0.tar.gz.

File metadata

  • Download URL: article_parser-1.8.0.tar.gz
  • Upload date:
  • Size: 5.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for article_parser-1.8.0.tar.gz
Algorithm Hash digest
SHA256 bf56405a6b3c0aad1dcdada74874dbb6932ce177e813aa4b4e8198634d663c52
MD5 16572af3198521f63b76184111a01330
BLAKE2b-256 c6a7213ebca6eb776362e377efc7ddb8d3c588844ba7881d6dc3f407d55b4264

See more details on using hashes here.

Provenance

File details

Details for the file article_parser-1.8.0-py3-none-any.whl.

File metadata

File hashes

Hashes for article_parser-1.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b848f87968b5b71b0b7d182da2abf0b2b6a95f53f76906593c3bae4aee7f3d48
MD5 e60d07cd45dd2b78bd646350657acd7f
BLAKE2b-256 23228def5b461d8c42d1e9428c9ff3b07e3ccf491e0391b42093a5cf84967648

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page