Common domestic and foreign news website data collection framework

These details have not been verified by PyPI

Project description

INSTRUCTION

本项目通过总结常见国内外新闻网站页面规则，汇总了一些通用的解析方法，在开发实践中效果较好，用法简单，且支持异步模式，支持高并发采集。

解析字段如下：

1.新闻列表页

新闻url
新闻标题

2.新闻内容提取

文章标题
文章发布时间
文章内容
文章主图片
文章图片
文章视频
网站名称
网站logo
网站域名

USAGE

安装项目:

pip install GeneralNewsScraper

本项目提供两种用法：

url模式: 传参为url。需要安装playwright, 以及根据提示playwright install安装浏览器内核。通过浏览器下载完整html.
html模式: 传参为url以及html。此时GNS将不做任何网络请求，url的作用仅做为网站logo以及媒体文件url拼接。

解析文章列表页（同步）

from GeneralNewsScraper import GNS

_html = """ html示例 """
# html非必传；pagination非必传
articles = GNS.article_list(url="https://www.voachinese.com/", html=_html, pagination=1)
print(articles)

解析文章列表页（异步）

import asyncio
from GeneralNewsScraper import GNS

async def run_article_list_async():
    _html = """ html示例 """
    # html非必传；pagination非必传
    articles = await GNS.article_list_async(url="https://www.voachinese.com/", html=_html, pagination=1)
    print(articles)
asyncio.run(run_article_list_async())

解析文章详情页（同步）

from GeneralNewsScraper import GNS

_html = """ html示例 """
# html非必传
article_info = GNS.article(url="https://www.voachinese.com/a/exiled-chinese-businessman-guo-s-trial-nears-close/7693596.html", html=_html)
print(article_info)

解析文章详情页（异步）

import asyncio
from GeneralNewsScraper import GNS

async def run_article_async():
    _html = """ html示例 """
    # html非必传
    article_info = await GNS.article_async(url="https://www.voachinese.com/a/exiled-chinese-businessman-guo-s-trial-nears-close/7693596.html", html=_html)
    print(article_info)

asyncio.run(run_article_async())

解析列表页所有文章详情（同步）

from GeneralNewsScraper import GNS

_html = """ html示例 """
# html非必传
article_info_list = GNS.article_parse_all(url="https://www.voachinese.com/", html=_html)
print(article_info_list)

解析列表页所有文章详情（异步）

import asyncio
from GeneralNewsScraper import GNS


async def run_article_parse_all_async():
    _html = """ html示例 """
    # html非必传
    article_info_list = await GNS.article_parse_all_async(url="https://www.voachinese.com/", html=_html)
    print(article_info_list)

    asyncio.run(run_article_parse_all_async())

效果演示

有问题请联系：jinchenghz@foxmail.com

免责声明：本项目仅供学习参考，请勿用于非法用途，否则后果自负。

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.4

Feb 24, 2025

0.2.3

Feb 24, 2025

This version

0.2.2

Feb 8, 2025

0.2.1

Nov 14, 2024

0.2.0

Nov 11, 2024

0.1.0

Aug 14, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

generalnewsscraper-0.2.2.tar.gz (25.0 kB view details)

Uploaded Feb 8, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

generalnewsscraper-0.2.2-py3-none-any.whl (27.1 kB view details)

Uploaded Feb 8, 2025 Python 3

File details

Details for the file generalnewsscraper-0.2.2.tar.gz.

File metadata

Download URL: generalnewsscraper-0.2.2.tar.gz
Upload date: Feb 8, 2025
Size: 25.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.3 CPython/3.11.9 Windows/10

File hashes

Hashes for generalnewsscraper-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`8fc7e5784ec2813e2ee90a622015f9f1cde210073de830129e5b973a853b2370`
MD5	`9e0a1f1f09eb938065046114946b7707`
BLAKE2b-256	`83c1c051a0c7b35d5ae566b9d8afa836e5dd4551f32d96be2866513cc51efe17`

See more details on using hashes here.

File details

Details for the file generalnewsscraper-0.2.2-py3-none-any.whl.

File metadata

Download URL: generalnewsscraper-0.2.2-py3-none-any.whl
Upload date: Feb 8, 2025
Size: 27.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.3 CPython/3.11.9 Windows/10

File hashes

Hashes for generalnewsscraper-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a490077a3b12a756dbeebfc9eb9cdc6b4795f93b696f47fca1c723cab5c037db`
MD5	`6628e67b4eef965c1c647bfa8695f6cb`
BLAKE2b-256	`2262d34529dce63ba7a2cd8f82b96631725cbbb7210440ebfeea3896be28f0a6`

See more details on using hashes here.

GeneralNewsScraper 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

INSTRUCTION

解析字段如下：

1.新闻列表页

2.新闻内容提取

USAGE

解析文章列表页（同步）

解析文章列表页（异步）

解析文章详情页（同步）

解析文章详情页（异步）

解析列表页所有文章详情（同步）

解析列表页所有文章详情（异步）

效果演示

免责声明：本项目仅供学习参考，请勿用于非法用途，否则后果自负。

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes