Common domestic and foreign news website data collection framework
Project description
INSTRUCTION
最好的文章正文定位器!欢迎star!
本项目通过总结常见国内外新闻网站页面规则,汇总了一些通用的解析方法,在开发实践中效果较好,用法简单, 且支持异步模式,支持高并发采集。
解析字段如下:
1.新闻列表页
- 新闻url
- 新闻标题
2.新闻内容提取
- 文章标题
- 文章发布时间
- 文章内容
- 文章主图片
- 文章图片
- 文章视频
- 网站名称
- 网站logo
- 网站域名
USAGE
安装项目:
pip install GeneralNewsScraper
本项目提供两种用法:
- url模式: 传参为url。需要安装playwright, 以及根据提示playwright install安装浏览器内核。通过浏览器下载完整html.
- html模式: 传参为url以及html。此时GNS将不做任何网络请求,url的作用仅做为网站logo以及媒体文件url拼接。
解析文章列表页(同步)
from GeneralNewsScraper import GNS
_html = """ html示例 """
# html非必传;pagination非必传
articles = GNS.article_list(url="https://www.voachinese.com/", html=_html, pagination=1)
print(articles)
解析文章列表页(异步)
import asyncio
from GeneralNewsScraper import GNS
async def run_article_list_async():
_html = """ html示例 """
# html非必传;pagination非必传
articles = await GNS.article_list_async(url="https://www.voachinese.com/", html=_html, pagination=1)
print(articles)
asyncio.run(run_article_list_async())
解析文章详情页(同步)
from GeneralNewsScraper import GNS
_html = """ html示例 """
# html非必传
article_info = GNS.article(url="https://www.voachinese.com/a/exiled-chinese-businessman-guo-s-trial-nears-close/7693596.html", html=_html)
print(article_info)
解析文章详情页(异步)
import asyncio
from GeneralNewsScraper import GNS
async def run_article_async():
_html = """ html示例 """
# html非必传
article_info = await GNS.article_async(url="https://www.voachinese.com/a/exiled-chinese-businessman-guo-s-trial-nears-close/7693596.html", html=_html)
print(article_info)
asyncio.run(run_article_async())
解析列表页所有文章详情(同步)
from GeneralNewsScraper import GNS
_html = """ html示例 """
# html非必传
article_info_list = GNS.article_parse_all(url="https://www.voachinese.com/", html=_html)
print(article_info_list)
解析列表页所有文章详情(异步)
import asyncio
from GeneralNewsScraper import GNS
async def run_article_parse_all_async():
_html = """ html示例 """
# html非必传
article_info_list = await GNS.article_parse_all_async(url="https://www.voachinese.com/", html=_html)
print(article_info_list)
asyncio.run(run_article_parse_all_async())
效果演示
有问题请联系:jinchenghz@foxmail.com
免责声明:本项目仅供学习参考,请勿用于非法用途,否则后果自负。
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
generalnewsscraper-0.2.3.tar.gz
(25.2 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file generalnewsscraper-0.2.3.tar.gz.
File metadata
- Download URL: generalnewsscraper-0.2.3.tar.gz
- Upload date:
- Size: 25.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.11.9 Windows/10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cb8f67da2991494188c7e1aa190d6e6318333fc27701467dbabe9f9354abd0b6
|
|
| MD5 |
a6fb112a58eac972bbaf04ae9be93fd8
|
|
| BLAKE2b-256 |
475432d42f12a1c7766b010d15ec275014e11f08dbaa7f1993024c8f7aa92733
|
File details
Details for the file generalnewsscraper-0.2.3-py3-none-any.whl.
File metadata
- Download URL: generalnewsscraper-0.2.3-py3-none-any.whl
- Upload date:
- Size: 27.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.11.9 Windows/10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9d378c89d0de4e91c63378892828e1505b6973326fdb8f84beaaecd556e0a3cc
|
|
| MD5 |
8a67668f4b3e263dae42b46e7bea8c0f
|
|
| BLAKE2b-256 |
9eea54d64d19956fdc789cde0cf006c72f4322a2516cbfc473c0c7d611de68eb
|