Common domestic and foreign news website data collection framework
Project description
INSTRUCTION
本项目通过总结常见国内外新闻网站页面规则,汇总了一些通用的解析方法,在开发实践中效果较好,且用法简单
解析字段如下:
1.新闻列表页
- 新闻url
- 新闻标题
2.新闻内容提取
- 文章标题
- 文章发布时间
- 文章内容
- 文章主图片
- 文章图片
- 文章视频
- 网站名称
- 网站logo
- 网站域名
效果演示
USAGE
首先安装项目依赖:pip install -r requirements.txt
本项目提供两种用法:
- url模式: 传参为url。需要安装playwright, 以及根据提示playwright install安装浏览器内核。通过浏览器下载完整html.
- html模式: 传参为url以及html。此时GNS将不做任何网络请求,url的作用仅做为网站logo以及媒体文件url拼接。
如果只想使用html模式,也可不下载playwright
解析文章列表页
from GeneralNewsScraper import GNS
_html = """ html示例 """
# html非必传;pagination非必传
articles = GNS.article_list(url="https://www.voachinese.com/", html=_html, pagination=1)
print(articles)
解析文章详情页
from GeneralNewsScraper import GNS
_html = """ html示例 """
# html非必传
article_info = GNS.article(url="https://www.voachinese.com/a/exiled-chinese-businessman-guo-s-trial-nears-close/7693596.html", html=_html)
print(article_info)
有问题请联系:jinchenghz@foxmail.com
免责声明:本项目仅供学习参考,请勿用于非法用途,否则后果自负。
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
generalnewsscraper-0.1.0.tar.gz
(21.0 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file generalnewsscraper-0.1.0.tar.gz.
File metadata
- Download URL: generalnewsscraper-0.1.0.tar.gz
- Upload date:
- Size: 21.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.11.9 Windows/10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c4de3a77e784bcb7ae7cd48c4e5945f139563b9ba81d1c5e72314d90b3e52430
|
|
| MD5 |
b5ce9bb2c0d8dd1198ac9f3ebf4cfc3b
|
|
| BLAKE2b-256 |
a5deee75756e0c914d932d701df03d2479d6c0331c3f7834e6f336307db59083
|
File details
Details for the file generalnewsscraper-0.1.0-py3-none-any.whl.
File metadata
- Download URL: generalnewsscraper-0.1.0-py3-none-any.whl
- Upload date:
- Size: 22.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.11.9 Windows/10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
38150f1be11779021dd68fbc1a9e3ebd76f9c5039cc8f1be1a15ffbc4f6a49b9
|
|
| MD5 |
eaff2bbcaa47143fa62425def9698363
|
|
| BLAKE2b-256 |
6a5e6b91977a13e7d21a25a8ae182c4f929d2d903c156506591c3884dd262a78
|