Skip to main content

A parser to parse article from url or html

Project description

article-parser

GitHub Repo stars GitHub Workflow Status python pypi wheel license PyPI - Downloads

Extract article or news by url or html, parse the title and content, output in markdown format.

How to install

article-parser is available on pypi https://pypi.org/project/article-parser/

$ pip install article-parser

Basic Usage

>>> import article_parser

article_parser.parse(
  url='',               # The URL of the article. optional
  html='',              # The HTML of the article. optional
  proxies={},           # The Proxies to bypass anonymity, security and prevent IP blocking.
  options={
    'markdown': True,   # Output in markdown format. defult True. optional
    'threshold': 0.9,   # Content ratio threshold. defult 0.9. optional
    'timeout': 5        # Request webpage timeout time, in seconds, default 5. optional
  })

## ouput markdown
>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html")

## output html
>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", options={'markdown': False})

Example

Djokovic wins record 36th Masters title in Rome - Chinadaily.com.cn

  • Markdown
>>> import article_parser
>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html")
>>> print(title)
>>> print('----------------')
>>> print(content)

Djokovic wins record 36th Masters title in Rome
----------------
![](http://img2.chinadaily.com.cn/images/202009/22/5f6962b2a31024adbd959228.jpeg)
Serbia's Novak Djokovic kisses the trophy after winning the final against
Argentina's Diego Schwartzman at Italian Open, Foro Italico, Rome, Italy, Sept
21, 2020. [Photo/Agencies]

ROME - Novak Djokovic won a record 36th Masters crown as he beat Diego
Schwartzman in the men's final of the ATP Italian Open on Monday.

Djokovic, the world number one and the top seed at the tournament, won 7-5,
6-3 against Argentine Schwartzman to lift his 36th Masters title, one more
than Rafael Nadal.

The Serb said he did not play his best tennis this time in Rome, but could
find it when needed.

Simona Halep, top seed of the women's draw, won her first title in Rome after
defending champion Karolina Pliskova of the Czech Republic retired while
trailing 6-0, 2-1 in the final.
  • HTML
>>> import article_parser
>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", options={'markdown': False})
>>> print(title)
>>> print('----------------')
>>> print(content)

Djokovic wins record 36th Masters title in Rome
----------------
<div id="Content">

<figure class="image" style="display: table;">
<img data-from="newsroom" id="img-5f6962b2a31024adbd959228" src="//img2.chinadaily.com.cn/images/202009/22/5f6962b2a31024adbd959228.jpeg"/>
<figcaption style="font-size: 14px; display: table-caption; caption-side: bottom;">
   Serbia's Novak Djokovic kisses the trophy after winning the final against Argentina's Diego Schwartzman at Italian Open, Foro Italico, Rome, Italy, Sept 21, 2020. [Photo/Agencies]
 </figcaption>
</figure>
<p dir="ltr">ROME - Novak Djokovic won a record 36th Masters crown as he beat Diego Schwartzman in the men's final of the ATP Italian Open on Monday.</p>
<p dir="ltr">Djokovic, the world number one and the top seed at the tournament, won 7-5, 6-3 against Argentine Schwartzman to lift his 36th Masters title, one more than Rafael Nadal.</p>
<p dir="ltr">The Serb said he did not play his best tennis this time in Rome, but could find it when needed.</p>
<p dir="ltr">Simona Halep, top seed of the women's draw, won her first title in Rome after defending champion Karolina Pliskova of the Czech Republic retired while trailing 6-0, 2-1 in the final.</p>
</div>

Contributors

All contributions

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

article-parser-1.2.1.tar.gz (4.6 kB view details)

Uploaded Source

Built Distributions

article_parser-1.2.1-py3.6.egg (7.1 kB view details)

Uploaded Source

article_parser-1.2.1-py3-none-any.whl (5.4 kB view details)

Uploaded Python 3

File details

Details for the file article-parser-1.2.1.tar.gz.

File metadata

  • Download URL: article-parser-1.2.1.tar.gz
  • Upload date:
  • Size: 4.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for article-parser-1.2.1.tar.gz
Algorithm Hash digest
SHA256 e3f45cc46bb85831757a5fcb486ca9e5c9df671a0d8bee185b742ff84c2d219b
MD5 525829d5a177ec4cd37973539472446b
BLAKE2b-256 507892951b8dbc4acb33802427672514def3b62d4c9cae748a97fedfa8da14c7

See more details on using hashes here.

Provenance

File details

Details for the file article_parser-1.2.1-py3.6.egg.

File metadata

  • Download URL: article_parser-1.2.1-py3.6.egg
  • Upload date:
  • Size: 7.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for article_parser-1.2.1-py3.6.egg
Algorithm Hash digest
SHA256 d80412e6c24f57aacaa4650b1cacce0814a1f6d81d5c2689a8db862c7c7f8030
MD5 1b0da2bcda8f7865267cffd02f1baee7
BLAKE2b-256 25479cb4dfe1b158be59c2f1697695541c473a1c02024aae243144ecd0f07816

See more details on using hashes here.

Provenance

File details

Details for the file article_parser-1.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for article_parser-1.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2c5c72feff5c041a27a9fcadc676eaf2cfec94367fe10b4e597b1e6ed3d70bb5
MD5 6f500bce32a426ea7f4523a3f17a5064
BLAKE2b-256 0489bd09efe8a790d58d102384930a8b49385f147f64c202f8aad26e2f672ad9

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page