RSS新闻爬虫工具,自动抓取并存储RSS源的最新新闻
Project description
RSS新闻爬虫工具
概述
这是一个自动抓取RSS新闻源并存储到SQLite数据库的工具,支持:
- 多RSS源并行抓取
- 内容去重和压缩存储
- 当日新闻过滤
- 日志记录
安装
pip install rss-news-crawler```
## 使用方法
from rss_news_crawler import NewsCrawler
创建爬虫对象
crawler = NewsCrawler( db_name='news.db', # SQLite数据库路径 log_file='news.log', # 日志文件路径 rss_feeds_file='rss_feeds.txt', # RSS源文件路径,在RSS文件不存在或为空时将使用默认RSS源 )
爬取RSS源
crawler.fetch_and_store_news()```
rss_feeds.txt文件格式
每行一个RSS源的URL,例如:
https://www.example.com/rss.xml
https://www.example.com/rss2.xml
数据库表结构
id INTEGER PRIMARY KEY AUTOINCREMENT,
publish_time DATETIME NOT NULL,
crawl_time DATETIME NOT NULL,
title TEXT NOT NULL,
content BLOB NOT NULL,
url TEXT NOT NULL UNIQUE
)
Content字段存储的是经过压缩和去重的新闻内容,使用feed_handler.compress_content()进行压缩
示例
from rss_news_crawler import NewsCrawler
crawler = NewsCrawler(
db_name='news.db',
log_file='news.log',
rss_feeds_file='rss_feeds.txt',
)
crawler.fetch_and_store_news()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rss-news-crawler-0.1.2.tar.gz.
File metadata
- Download URL: rss-news-crawler-0.1.2.tar.gz
- Upload date:
- Size: 5.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
92e4e0fc9f3000990ea72cd613b523b9088eac1352aa58b6188574a86604de66
|
|
| MD5 |
7e7672167927dafc2b8f75a39f80990d
|
|
| BLAKE2b-256 |
09ab3f6b4dd47bf26026eac9a87e21c03c2e3062f37265dd58ec740384aa0aa4
|
File details
Details for the file rss_news_crawler-0.1.2-py3-none-any.whl.
File metadata
- Download URL: rss_news_crawler-0.1.2-py3-none-any.whl
- Upload date:
- Size: 6.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ecc6903f33372e3512560c9e93ebc3f57ae5f6a8c9868a0a03485a3e73f3f53c
|
|
| MD5 |
c4b04a6b433f57a9dc8ab68d9da1bd01
|
|
| BLAKE2b-256 |
4ae0cac364d5ab33b87018bf4265cf3cdbbc26de8dba584a169b1ebc27286920
|