Skip to main content

RSS新闻爬虫工具,自动抓取并存储RSS源的最新新闻

Project description

RSS新闻爬虫工具

概述

这是一个自动抓取RSS新闻源并存储到SQLite数据库的工具,支持:

  • 多RSS源并行抓取
  • 内容去重和压缩存储
  • 当日新闻过滤
  • 日志记录

安装

pip install rss-news-crawler``` 

## 使用方法

from rss_news_crawler import NewsCrawler

创建爬虫对象

crawler = NewsCrawler( db_name='news.db', # SQLite数据库路径 log_file='news.log', # 日志文件路径 rss_feeds_file='rss_feeds.txt', # RSS源文件路径,在RSS文件不存在或为空时将使用默认RSS源 )

爬取RSS源

crawler.fetch_and_store_news()```

rss_feeds.txt文件格式

每行一个RSS源的URL,例如:

https://www.example.com/rss.xml
https://www.example.com/rss2.xml

数据库表结构

    id INTEGER PRIMARY KEY AUTOINCREMENT,
    publish_time DATETIME NOT NULL,
    crawl_time DATETIME NOT NULL,
    title TEXT NOT NULL,
    content BLOB NOT NULL,
    url TEXT NOT NULL UNIQUE
)

Content字段存储的是经过压缩和去重的新闻内容,使用feed_handler.compress_content()进行压缩

示例

from rss_news_crawler import NewsCrawler

crawler = NewsCrawler(
    db_name='news.db',
    log_file='news.log',
    rss_feeds_file='rss_feeds.txt',
)

crawler.fetch_and_store_news()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rss-news-crawler-0.1.2.tar.gz (5.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rss_news_crawler-0.1.2-py3-none-any.whl (6.4 kB view details)

Uploaded Python 3

File details

Details for the file rss-news-crawler-0.1.2.tar.gz.

File metadata

  • Download URL: rss-news-crawler-0.1.2.tar.gz
  • Upload date:
  • Size: 5.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.6

File hashes

Hashes for rss-news-crawler-0.1.2.tar.gz
Algorithm Hash digest
SHA256 92e4e0fc9f3000990ea72cd613b523b9088eac1352aa58b6188574a86604de66
MD5 7e7672167927dafc2b8f75a39f80990d
BLAKE2b-256 09ab3f6b4dd47bf26026eac9a87e21c03c2e3062f37265dd58ec740384aa0aa4

See more details on using hashes here.

File details

Details for the file rss_news_crawler-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for rss_news_crawler-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ecc6903f33372e3512560c9e93ebc3f57ae5f6a8c9868a0a03485a3e73f3f53c
MD5 c4b04a6b433f57a9dc8ab68d9da1bd01
BLAKE2b-256 4ae0cac364d5ab33b87018bf4265cf3cdbbc26de8dba584a169b1ebc27286920

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page