Skip to main content

A parser to parse article from url or html

Project description

article-parser

GitHub Repo stars GitHub Workflow Status python pypi wheel license PyPI - Downloads

Extract article or news by url or html, parse the title and content, output in markdown format.

How to install

article-parser is available on pypi https://pypi.org/project/article-parser/

$ pip install article-parser

Basic Usage

>>> import article_parser

article_parser.parse(
  url='',              ## The URL of the article. optional
  html='',             ## The HTML of the article. optional
  options={
    'markdown': True,  ## Output in markdown format. defult True. optional
    'threshold': 0.9   ## Content ratio threshold. defult 0.9. optional
  })

## ouput html
>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html")

## output markdown
>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", options={'markdown': False})

Example

Djokovic wins record 36th Masters title in Rome - Chinadaily.com.cn

  • Markdown
>>> import article_parser
>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html")
>>> print(title)
>>> print('----------------')
>>> print(content)

Djokovic wins record 36th Masters title in Rome
----------------
![](http://img2.chinadaily.com.cn/images/202009/22/5f6962b2a31024adbd959228.jpeg)
Serbia's Novak Djokovic kisses the trophy after winning the final against
Argentina's Diego Schwartzman at Italian Open, Foro Italico, Rome, Italy, Sept
21, 2020. [Photo/Agencies]

ROME - Novak Djokovic won a record 36th Masters crown as he beat Diego
Schwartzman in the men's final of the ATP Italian Open on Monday.

Djokovic, the world number one and the top seed at the tournament, won 7-5,
6-3 against Argentine Schwartzman to lift his 36th Masters title, one more
than Rafael Nadal.

The Serb said he did not play his best tennis this time in Rome, but could
find it when needed.

Simona Halep, top seed of the women's draw, won her first title in Rome after
defending champion Karolina Pliskova of the Czech Republic retired while
trailing 6-0, 2-1 in the final.
  • HTML
>>> import article_parser
>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", options={'markdown': False})
>>> print(title)
>>> print('----------------')
>>> print(content)

Djokovic wins record 36th Masters title in Rome
----------------
<div id="Content">

<figure class="image" style="display: table;">
<img data-from="newsroom" id="img-5f6962b2a31024adbd959228" src="//img2.chinadaily.com.cn/images/202009/22/5f6962b2a31024adbd959228.jpeg"/>
<figcaption style="font-size: 14px; display: table-caption; caption-side: bottom;">
   Serbia's Novak Djokovic kisses the trophy after winning the final against Argentina's Diego Schwartzman at Italian Open, Foro Italico, Rome, Italy, Sept 21, 2020. [Photo/Agencies]
 </figcaption>
</figure>
<p dir="ltr">ROME - Novak Djokovic won a record 36th Masters crown as he beat Diego Schwartzman in the men's final of the ATP Italian Open on Monday.</p>
<p dir="ltr">Djokovic, the world number one and the top seed at the tournament, won 7-5, 6-3 against Argentine Schwartzman to lift his 36th Masters title, one more than Rafael Nadal.</p>
<p dir="ltr">The Serb said he did not play his best tennis this time in Rome, but could find it when needed.</p>
<p dir="ltr">Simona Halep, top seed of the women's draw, won her first title in Rome after defending champion Karolina Pliskova of the Czech Republic retired while trailing 6-0, 2-1 in the final.</p>
</div>

Contributors

All contributions

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

article-parser-1.0.0.tar.gz (5.1 kB view details)

Uploaded Source

Built Distributions

article_parser-1.0.0-py3.10.egg (6.8 kB view details)

Uploaded Source

article_parser-1.0.0-py3-none-any.whl (5.2 kB view details)

Uploaded Python 3

File details

Details for the file article-parser-1.0.0.tar.gz.

File metadata

  • Download URL: article-parser-1.0.0.tar.gz
  • Upload date:
  • Size: 5.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.9.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.1

File hashes

Hashes for article-parser-1.0.0.tar.gz
Algorithm Hash digest
SHA256 e1fdbcf4eafaa8b43941aa5be347017537d2cb99ad4641cb23e5ede310219ad9
MD5 4aad49a31ac0dbe72234ab62c093f422
BLAKE2b-256 3cdeab44ce34210326b825c02680391378daec391e49d46e5590c8c446025591

See more details on using hashes here.

Provenance

File details

Details for the file article_parser-1.0.0-py3.10.egg.

File metadata

  • Download URL: article_parser-1.0.0-py3.10.egg
  • Upload date:
  • Size: 6.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.9.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.1

File hashes

Hashes for article_parser-1.0.0-py3.10.egg
Algorithm Hash digest
SHA256 c834e2b2e1e7f5c98b3db9a4f585b3d6a136071c927fd57efab96bf24c38c0a2
MD5 085faca67c88778969aeab95c0d86bbb
BLAKE2b-256 d9d50b6a2c2f365fcee82ebf385b60f59a5f1b64c2ecf44d6be7520468d9f0f9

See more details on using hashes here.

Provenance

File details

Details for the file article_parser-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: article_parser-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 5.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.9.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.1

File hashes

Hashes for article_parser-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d52168d5c967283b877800b48580b10abf031f21692300c6d4a56b109939bb51
MD5 b4996bf9cbe9605db96290dab85cbe9c
BLAKE2b-256 a3338e9fa9088e42b1fc65c57be5d120e78093461acf35d7dac9a347dcf71f0d

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page