Skip to main content

A parser to parse article from url or html

Project description

article-parser

GitHub Repo stars GitHub Workflow Status python pypi wheel license PyPI - Downloads

Extract article or news by url or html, parse the title and content, output in markdown format.

How to install

article-parser is available on pypi https://pypi.org/project/article-parser/

$ pip install article-parser

Basic Usage

>>> import article_parser

article_parser.parse(
  url='',               # The URL of the article.
  html='',              # The HTML of the article.
  threshold=0.9,        # The ratio of text to the entire document, default 0.9.
  output='html',        # Result output format, support ``markdown`` and ``html``, default ``html``.
  **kwargs              # Optional arguments that `request` takes. optional
  ),


## ouput markdown
>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", output='markdown', timeout=5)

## output html
>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", timeout=5)

Example

Djokovic wins record 36th Masters title in Rome - Chinadaily.com.cn

  • Markdown
>>> import article_parser
>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", output='markdown', timeout=5)
>>> print(title)
>>> print('----------------')
>>> print(content)

Djokovic wins record 36th Masters title in Rome
----------------
![](http://img2.chinadaily.com.cn/images/202009/22/5f6962b2a31024adbd959228.jpeg)
Serbia's Novak Djokovic kisses the trophy after winning the final against
Argentina's Diego Schwartzman at Italian Open, Foro Italico, Rome, Italy, Sept
21, 2020. [Photo/Agencies]

ROME - Novak Djokovic won a record 36th Masters crown as he beat Diego
Schwartzman in the men's final of the ATP Italian Open on Monday.

Djokovic, the world number one and the top seed at the tournament, won 7-5,
6-3 against Argentine Schwartzman to lift his 36th Masters title, one more
than Rafael Nadal.

The Serb said he did not play his best tennis this time in Rome, but could
find it when needed.

Simona Halep, top seed of the women's draw, won her first title in Rome after
defending champion Karolina Pliskova of the Czech Republic retired while
trailing 6-0, 2-1 in the final.
  • HTML
>>> import article_parser
>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", timeout=5)
>>> print(title)
>>> print('----------------')
>>> print(content)

Djokovic wins record 36th Masters title in Rome
----------------
<div id="Content">

<figure class="image" style="display: table;">
<img data-from="newsroom" id="img-5f6962b2a31024adbd959228" src="//img2.chinadaily.com.cn/images/202009/22/5f6962b2a31024adbd959228.jpeg"/>
<figcaption style="font-size: 14px; display: table-caption; caption-side: bottom;">
   Serbia's Novak Djokovic kisses the trophy after winning the final against Argentina's Diego Schwartzman at Italian Open, Foro Italico, Rome, Italy, Sept 21, 2020. [Photo/Agencies]
 </figcaption>
</figure>
<p dir="ltr">ROME - Novak Djokovic won a record 36th Masters crown as he beat Diego Schwartzman in the men's final of the ATP Italian Open on Monday.</p>
<p dir="ltr">Djokovic, the world number one and the top seed at the tournament, won 7-5, 6-3 against Argentine Schwartzman to lift his 36th Masters title, one more than Rafael Nadal.</p>
<p dir="ltr">The Serb said he did not play his best tennis this time in Rome, but could find it when needed.</p>
<p dir="ltr">Simona Halep, top seed of the women's draw, won her first title in Rome after defending champion Karolina Pliskova of the Czech Republic retired while trailing 6-0, 2-1 in the final.</p>
</div>

Contributors

All contributions

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

article-parser-1.5.0.tar.gz (4.6 kB view details)

Uploaded Source

Built Distributions

article_parser-1.5.0-py3.7.egg (7.2 kB view details)

Uploaded Source

article_parser-1.5.0-py3-none-any.whl (5.5 kB view details)

Uploaded Python 3

File details

Details for the file article-parser-1.5.0.tar.gz.

File metadata

  • Download URL: article-parser-1.5.0.tar.gz
  • Upload date:
  • Size: 4.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.2

File hashes

Hashes for article-parser-1.5.0.tar.gz
Algorithm Hash digest
SHA256 d29d467979ff72e6a939d7bc3a1c215c9b1f7af1360eab35b8a915ed5b9688af
MD5 6d3c0d821aefe02432a21cbd7eb4bd1e
BLAKE2b-256 1239beb6cfd0fd45b42a6ba981f97a9e136c0bf23c3318b9b08be5803324bc85

See more details on using hashes here.

Provenance

File details

Details for the file article_parser-1.5.0-py3.7.egg.

File metadata

  • Download URL: article_parser-1.5.0-py3.7.egg
  • Upload date:
  • Size: 7.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.2

File hashes

Hashes for article_parser-1.5.0-py3.7.egg
Algorithm Hash digest
SHA256 2df25c05958aaa3b8107b47fa57b91af0f33b4a12cc384c3de1f285dcf5e36f3
MD5 a750e150b294b9083bb7513eccee2868
BLAKE2b-256 43e418d60b9e03a0a42c83c4f29690d89bb0e94cd5663df04d7c013411d28e75

See more details on using hashes here.

Provenance

File details

Details for the file article_parser-1.5.0-py3-none-any.whl.

File metadata

File hashes

Hashes for article_parser-1.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a5fb55cd4e48b9f51bbbdda426f0209b5280677956251f405cd1c92af7ca81f6
MD5 138a5c4467926a2f63971645cd0ddcab
BLAKE2b-256 bd66408715e51ef33201350e1d982b18640aa5901e9580a20139cce6f543aa89

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page