Skip to main content

General extractor of news pages.

Project description

GNE: 通用新闻网站正文抽取器

GeneralNewsExtractor(GNE)是一个通用新闻网站正文抽取模块,输入一篇新闻网页的 HTML, 输出正文内容、标题、作者、发布时间、正文中的图片地址和正文所在的标签源代码。GNE在提取今日头条、网易新闻、游民星空、 观察者网、凤凰网、腾讯新闻、ReadHub、新浪新闻等数百个中文新闻网站上效果非常出色,几乎能够达到100%的准确率。

使用方式也非常简单:

from gne import GeneralNewsExtractor

extractor = GeneralNewsExtractor()
html = '网站源代码'
result = extractor.extract(html)
print(result)

安装

pip install gne

文档

https://generalnewsextractor.readthedocs.io/

帮助 GNE 变得更好

https://github.com/kingname/GeneralNewsExtractor

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gne-0.3.1.tar.gz (30.7 kB view details)

Uploaded Source

File details

Details for the file gne-0.3.1.tar.gz.

File metadata

  • Download URL: gne-0.3.1.tar.gz
  • Upload date:
  • Size: 30.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.16

File hashes

Hashes for gne-0.3.1.tar.gz
Algorithm Hash digest
SHA256 47b32182d61ac3c038dba51b50c7177582bb630917217170cb1cbacdcf5836bc
MD5 ea27b652ed9c3ae6dc70ea94822a01de
BLAKE2b-256 eda6cb28c7319c6b95989d893a5841f64377c25c4604845d55809f6718ba770e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page