Common domestic and foreign news website data collection framework
Project description
INSTRUCTION
本项目通过总结常见国内外新闻网站页面规则,汇总了一些通用的解析方法,在开发实践中效果较好,用法简单, 且支持异步模式,支持高并发采集。
解析字段如下:
1.新闻列表页
- 新闻url
- 新闻标题
2.新闻内容提取
- 文章标题
- 文章发布时间
- 文章内容
- 文章主图片
- 文章图片
- 文章视频
- 网站名称
- 网站logo
- 网站域名
USAGE
安装项目:
pip install GeneralNewsScraper
本项目提供两种用法:
- url模式: 传参为url。需要安装playwright, 以及根据提示playwright install安装浏览器内核。通过浏览器下载完整html.
- html模式: 传参为url以及html。此时GNS将不做任何网络请求,url的作用仅做为网站logo以及媒体文件url拼接。
解析文章列表页(同步)
from GeneralNewsScraper import GNS
_html = """ html示例 """
# html非必传;pagination非必传
articles = GNS.article_list(url="https://www.voachinese.com/", html=_html, pagination=1)
print(articles)
解析文章列表页(异步)
import asyncio
from GeneralNewsScraper import GNS
async def run_article_list_async():
_html = """ html示例 """
# html非必传;pagination非必传
articles = await GNS.article_list_async(url="https://www.voachinese.com/", html=_html, pagination=1)
print(articles)
asyncio.run(run_article_list_async())
解析文章详情页(同步)
from GeneralNewsScraper import GNS
_html = """ html示例 """
# html非必传
article_info = GNS.article(url="https://www.voachinese.com/a/exiled-chinese-businessman-guo-s-trial-nears-close/7693596.html", html=_html)
print(article_info)
解析文章详情页(异步)
import asyncio
from GeneralNewsScraper import GNS
async def run_article_async():
_html = """ html示例 """
# html非必传
article_info = await GNS.article_async(url="https://www.voachinese.com/a/exiled-chinese-businessman-guo-s-trial-nears-close/7693596.html", html=_html)
print(article_info)
asyncio.run(run_article_async())
解析列表页所有文章详情(同步)
from GeneralNewsScraper import GNS
_html = """ html示例 """
# html非必传
article_info_list = GNS.article_parse_all(url="https://www.voachinese.com/", html=_html)
print(article_info_list)
解析列表页所有文章详情(异步)
import asyncio
from GeneralNewsScraper import GNS
async def run_article_parse_all_async():
_html = """ html示例 """
# html非必传
article_info_list = await GNS.article_parse_all_async(url="https://www.voachinese.com/", html=_html)
print(article_info_list)
asyncio.run(run_article_parse_all_async())
效果演示
有问题请联系:jinchenghz@foxmail.com
免责声明:本项目仅供学习参考,请勿用于非法用途,否则后果自负。
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
generalnewsscraper-0.2.1.tar.gz
(22.1 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file generalnewsscraper-0.2.1.tar.gz.
File metadata
- Download URL: generalnewsscraper-0.2.1.tar.gz
- Upload date:
- Size: 22.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.11.9 Windows/10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e324a016b6789e870a632d69e300f1b41c09cac5dd2595b88e958a8a26c4947f
|
|
| MD5 |
90308dead8959ea4fa4a29c04aef04b8
|
|
| BLAKE2b-256 |
090e2746a179d4e974156f27befbf1b0a38e2e94a3f685fe8b1fd437be3be0b3
|
File details
Details for the file generalnewsscraper-0.2.1-py3-none-any.whl.
File metadata
- Download URL: generalnewsscraper-0.2.1-py3-none-any.whl
- Upload date:
- Size: 24.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.11.9 Windows/10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
197d7987907d33dae24b803910732746e54f1a2e4c964e9217c9809ed7420deb
|
|
| MD5 |
7c101409e8ae5e07aa1a6370596b077f
|
|
| BLAKE2b-256 |
cd9279d2c4bec768655b83306475b475e28037369ee80db02eaeeac8dfe4930b
|