Dark Keeper is open source simple web-parser for podcast-sites
Project description
Dark Keeper
Dark Keeper is open source simple web-parser for podcast-sites. Also you can use it for any sites.
Goal idea: parsing full information per each podcast episodes like number, description and download link.
Features
- simple web-spider walking on site
- cache for all downloaded pages
- parse any information from pages
- export parsed data to MongoDB
Quick start
$ mkvirtualenv keeper
(keeper)$ pip install dark-keeper
(keeper)$ cat app.py
from dark_keeper import BaseParser, DarkKeeper from dark_keeper.exports import ExportMongo from dark_keeper.http import HttpClient from dark_keeper.storages import UrlsStorage, DataStorage class PodcastParser(BaseParser): def parse_urls(self, content): urls = content.parse_urls('.posts-list > .container-fluid .text-left a') return urls def parse_data(self, content): data = [] for post_item in content.get_block_items('.posts-list .posts-list-item'): post_data = dict( title=post_item.parse_text('.number-title'), desc=post_item.parse_text('.post-podcast-content'), mp3=post_item.parse_attr('.post-podcast-content audio', 'src'), ) if post_data['title'] and post_data['mp3']: data.append(post_data) return data if __name__ == '__main__': pk = DarkKeeper( http_client=HttpClient( delay=2, user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) ' 'AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/81.0.4044.138 Safari/537.36 OPR/68.0.3618.125', ), parser=PodcastParser(), urls_storage=UrlsStorage(base_url='https://radio-t.com/'), data_storage=DataStorage(), export_mongo=ExportMongo(mongo_uri='mongodb://localhost/podcasts.radio-t.com'), ) pk.run()
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
dark-keeper-0.3.1.tar.gz
(6.9 kB
view hashes)
Built Distribution
Close
Hashes for dark_keeper-0.3.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ccd1d1ab03af8cca90b79be748528b333c53007f55380300559ebb7e3cd076b6 |
|
MD5 | 9dcb9683b5942b7d6749e636f5538b91 |
|
BLAKE2-256 | 6a3f89405f4a1a15da1ed6b72f0e4a5be544f46cee4fbb51536eee332d005cc6 |