Skip to main content

Common domestic and foreign news website data collection framework

Project description

INSTRUCTION

本项目通过总结常见国内外新闻网站页面规则,汇总了一些通用的解析方法,在开发实践中效果较好,且用法简单

解析字段如下:

1.新闻列表页

  • 新闻url
  • 新闻标题

2.新闻内容提取

  • 文章标题
  • 文章发布时间
  • 文章内容
  • 文章主图片
  • 文章图片
  • 文章视频
  • 网站名称
  • 网站logo
  • 网站域名

效果演示

img.png

img_1.png

img_3.png

USAGE

首先安装项目依赖:pip install -r requirements.txt

本项目提供两种用法:

  1. url模式: 传参为url。需要安装playwright, 以及根据提示playwright install安装浏览器内核。通过浏览器下载完整html.
  2. html模式: 传参为url以及html。此时GNS将不做任何网络请求,url的作用仅做为网站logo以及媒体文件url拼接。

如果只想使用html模式,也可不下载playwright

解析文章列表页

from GeneralNewsScraper import GNS

_html = """ html示例 """
# html非必传;pagination非必传
articles = GNS.article_list(url="https://www.voachinese.com/", html=_html, pagination=1)
print(articles)

解析文章详情页

from GeneralNewsScraper import GNS

_html = """ html示例 """
# html非必传
article_info = GNS.article(url="https://www.voachinese.com/a/exiled-chinese-businessman-guo-s-trial-nears-close/7693596.html", html=_html)
print(article_info)

有问题请联系:jinchenghz@foxmail.com

免责声明:本项目仅供学习参考,请勿用于非法用途,否则后果自负。

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

generalnewsscraper-0.1.0.tar.gz (21.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

generalnewsscraper-0.1.0-py3-none-any.whl (22.9 kB view details)

Uploaded Python 3

File details

Details for the file generalnewsscraper-0.1.0.tar.gz.

File metadata

  • Download URL: generalnewsscraper-0.1.0.tar.gz
  • Upload date:
  • Size: 21.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.11.9 Windows/10

File hashes

Hashes for generalnewsscraper-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c4de3a77e784bcb7ae7cd48c4e5945f139563b9ba81d1c5e72314d90b3e52430
MD5 b5ce9bb2c0d8dd1198ac9f3ebf4cfc3b
BLAKE2b-256 a5deee75756e0c914d932d701df03d2479d6c0331c3f7834e6f336307db59083

See more details on using hashes here.

File details

Details for the file generalnewsscraper-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for generalnewsscraper-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 38150f1be11779021dd68fbc1a9e3ebd76f9c5039cc8f1be1a15ffbc4f6a49b9
MD5 eaff2bbcaa47143fa62425def9698363
BLAKE2b-256 6a5e6b91977a13e7d21a25a8ae182c4f929d2d903c156506591c3884dd262a78

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page