A political news crawler for 34 Taiwanese stream media.
Project description
title: 'Taiwan_News_Crawler' disqus: hackmd
GitHub https://github.com/milkpool/Taiwan_News_Crawler
pypl https://pypi.org/project/Taiwan_News_Crawler/
Introduction
This open-source library is a political news crawler for 34 Taiwanese mainstream media. The crawlered media is listed below.
Media Type | Meida Name (CN) | Media Name (EN) | ID | Abbreviation |
---|---|---|---|---|
Print Media | 自由時報 | Liberty News | 0 | ltn |
Print Media | 蘋果日報 | Apple Daily | 1 | appledaily |
Print Media | 聯合報 | UDN News | 2 | udn |
Print Media | 中國時報 | China Times | 3 | chinatimes |
Broadcast Media | TVBS | TVBS | 4 | tvbs |
Broadcast Media | ETtoday | ETtoday | 5 | ettoday |
Broadcast Media | 台視 | TTV | 6 | ttv |
Broadcast Media | 中視 | CTV | 7 | ctv |
Broadcast Media | 華視 | CTS | 8 | cts |
Broadcast Media | 民視 | FTV News | 9 | ftv |
Broadcast Media | 公視 | PTS | 10 | pts |
Broadcast Media | 三立新聞 | STEN | 11 | sten |
Broadcast Media | 中天新聞 | CTITV | 12 | ctitv |
Broadcast Media | 年代新聞 | ERA News | 13 | era |
Broadcast Media | 非凡新聞 | USTV | 14 | ustv |
Electronic Media | 中央通訊社 | CNA | 15 | cna |
Electronic Media | 關鍵評論網 | The News Lens | 16 | thenewslens |
Electronic Media | 民報 | People News | 17 | peoplenews |
Electronic Media | 上報 | Up Media | 18 | upmedia |
Electronic Media | 大紀元 | Epoch Times | 19 | epochtimes |
Electronic Media | 信傳媒 | CM Media | 20 | cmmedia |
Electronic Media | 匯流新聞網 | CNEWS | 21 | cnews |
Electronic Media | 新頭殼 | Newtalk | 22 | newtalk |
Electronic Media | 風傳媒 | Storm Media | 23 | storm |
Electronic Media | 今日新聞 | NOW News | 24 | nownews |
Electronic Media | 鏡週刊 | Mirror Media | 25 | mirrormedia |
Electronic Media | 台灣好新聞 | Taiwan Hot | 26 | taiwanhot |
Electronic Media | 中央廣播電台 | RTI News | 27 | rti |
Electronic Media | 世界日報 | World Journal | 28 | worldjournal |
Electronic Media | 風向新聞 | Kairos News | 29 | kairos |
Electronic Media | 民眾日報 | Mypeople News | 30 | mypeople |
Electronic Media | 芋傳媒 | Taro News | 31 | taronews |
News Website | Pchome新聞 | Pchome News | 32 | pchome |
News Website | Yahoo!奇摩新聞 | YAHOO! News | 33 | yahoo |
Installation
1. Install the library package with pip.
pip install Taiwan_News_Crawler
2. Download the webdriver for Chrome on the official website.
Usage
1. Build a crawler with the webdriver path inputted.
import Taiwan_News_Crawler
## Build news crawler
webdriver_path = 'WEBDRIVER_PATH'
mycrawler = NewsCrawler.crawler(webdriver_path)
2. Crawl political news of certain media.
The crawler_news
crawls latest news with specified media id or name.
There are two parameters to input:
- media: int/str, the media id or name needed to be crawlered
- amount: int, the amount of crawlering news. Default number is 20.
## Crawl new with media id
news_id_0 = mycrawler.crawler_news(media=0)
## Crawl new with media name
news_ltn = mycrawler.crawler_news(media='ltn', amount=10)
news_udn = mycrawler.crawler_news(media="聯合報", amount=50)
3. Crawl political news with news url.
The crawler_by_url
identifies the news media with url and gets the information.
The url parameter is a list of string. Url with different media is acceptable.
news = mycrawler.crawler_by_url(url=['NEWS_URL_1', 'NEWS_URL_2'])
4. Print the crawled news.
The output of our crawler is in json format. There are fields of the output, which are shown below:
- title: str, the news title
- url: str, the news full url
- author: list, the news author. More than one author is possible. Shown as an empty list if it's not avaiable.
- time: list, the published time.
- time (str): the complete published time. ex. "2020/01/10 13:17"
- time_year (str): the published year. ex. "2020"
- time_month (str): the published month. ex. "01"
- time_day (str): the published day. ex. "10"
- time_hour_min (str): the published timing. ex. "13:17"
- context: str, the news article.
- tag: list, the tags of the news. Empty list for not avaiable. More than one tag is possible.
- related_news: list, the related or recommended news the media provides.
- related_title: str, the related news' title.
- related_url: str, the related news' url link.
- source_img: list, the pictures in the report.
- img_title: str, the related news' title. None for not avaliable.
- img_url: str, the related news' url link.
- sourcce_video: list, the video in the report.
- video_title: str, thr video title. None for not avaliable.
- video_url: str, the video url link.
news_ltn = mycrawler.crawler_news(media='CTS', amount=10)
# print the first news information
news_no_1 = news_ltn[0]
for key, value in new_no_1.items():
print(key)
print(value)
print()
title
'菲律賓禁台令 總統:若政治考量請三思'
url
'https://news.cts.com.tw/cts/politics/202002/202002141990582.html'
author
[u'陳詩雅', u'李鴻杰']
time
[{'time_day': '14', 'time_hour_min': '19:39', 'time_year': '2020', 'time_month': '02', 'time': '2020/02/14 19:39'}]
context
'華視新聞 陳詩雅 李鴻杰 台南報導 / 台南市面對武漢肺炎疫情,總統蔡英文、行政院長蘇貞昌和副院長陳其邁,今(14)日分頭前進工廠視察。總統下午就南下視察防疫用酒精生產。面對菲律賓發出禁台令,總統表示,若是因為政治考量,要求菲律賓三思,台灣不能容忍這樣的事情,也必然會有相應的處理,最新消息,菲律賓已經撤回對台灣的禁令。75度防疫酒精台酒一天就可以產20萬瓶,面對武漢肺癌疫情,總統蔡英文再度前進工廠生產線,這次看的是酒精,成為70年來,第一位造訪台南隆田酒廠的現任總統,目前酒精產量穩定,可以支撐現階段需求,不過防疫戰火延燒到菲律賓,對於菲律賓在10日無預警禁止台灣人入境,總統說如果是政治考量菲律賓要三思。總統 蔡英文:「如果是基於政治考量的話,我們就要請他們三思,因為我們不能夠容忍這樣的事情,也必然會有一些相應的處理,」在總統受訪之後,菲律賓在晚間取消對台禁令,記者 vs. 總統 蔡英文:「台灣最紅的小孩子就是小明。」總統一聽到小明忍不住笑了,但是又立刻收起微笑,因為被問到對於無我國國籍中配子女禁止入境,馬前總統PO文說不要讓歧視凌駕人道,總統 蔡英文:「這沒有歧視的問題,只有疫情處理跟疫情掌控,跟保護我們國人的健康,是最重要的原則,我是覺得既然已經做過總統,應該知道說在做一個相關的決策,現在所最重要的還是以疫情的掌控為最優先」,另外還有國人滯留湖北,總統則再次強調,弱勢優先檢疫優先兩大原則,會持續溝通。'
tag
[u'撤回禁令', u'菲律賓', u'蔡英文', u'酒精', u'武漢肺炎']
related_news
[{u'好消息! 傳菲將解除對台灣旅行禁令': 'https://news.cts.com.tw/cts/politics/202002/202002141990583.html'}, {u'菲律賓發布禁台令 對移工衝擊大': 'https://news.cts.com.tw/cts/international/202002/202002131990465.html'}, {u'菲律賓禁台入境 我方擬祭出反制': 'https://news.cts.com.tw/cts/international/202002/202002131990381.html'}, {u'菲內閣重議禁台措施 我研擬反擊': 'https://news.cts.com.tw/cts/international/202002/202002131990380.html'}, {u'出國怕被誤認 「來自台灣」小物熱賣': 'https://news.cts.com.tw/cts/general/202002/202002121990318.html'}, {u'菲律賓突發禁台令 台灣恐爆缺工潮': 'https://news.cts.com.tw/cts/society/202002/202002121990317.html'}, {u'菲禁令滯留長灘島 部落客:如歷險記': 'https://news.cts.com.tw/cts/society/202002/202002121990316.html'}, {u'菲禁台團入境 當地旅遊業損失逾50萬': 'https://news.cts.com.tw/cts/society/202002/202002121990257.html'}, {u'禁台入境轉折 外交部:菲國內部不同調': 'https://news.cts.com.tw/cts/general/202002/202002111990194.html'}, {u'菲禁台客入境 旅客到機場無法登機': 'https://news.cts.com.tw/cts/international/202002/202002111990193.html'}, {u'菲律賓移工無法入台 勞動部祭出因應措施': 'https://news.cts.com.tw/cts/life/202002/202002111990179.html'}, {u'菲遵循「一個中國」 國民黨:政府應強硬展立場並反制': 'https://news.cts.com.tw/cts/politics/202002/202002111990154.html'}]
source_img
[{None: 'https://news.cts.com.tw/photo/cts/202002/202002141990582_l.jpg'}]
source_video
[{None: 'https://www.youtube.com/embed/6cV1YNTOjyI?rel=0&playsinline=1'}]
media
'華視'
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file Taiwan_News_Crawler-2020.2.15.tar.gz
.
File metadata
- Download URL: Taiwan_News_Crawler-2020.2.15.tar.gz
- Upload date:
- Size: 22.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f44795575518bc9e8dbadea2103546311f155ee216d749f83b38a1beda47dc50 |
|
MD5 | a9e6ea0af438fc92ba738a6f572bb043 |
|
BLAKE2b-256 | 3815ef975d7df2ad8c8752dfcaa173efab971d9ca157c45ec4c73c14c4c099f8 |
File details
Details for the file Taiwan_News_Crawler-2020.2.15-py2.py3-none-any.whl
.
File metadata
- Download URL: Taiwan_News_Crawler-2020.2.15-py2.py3-none-any.whl
- Upload date:
- Size: 88.2 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8f6221089ac8c72dda251cc246820f463a2a0fd0b7867df23b6647282176bbc7 |
|
MD5 | 09c7ff99ec112479f99c064d138f65c9 |
|
BLAKE2b-256 | bdb0bfea18e4a57abe3cd2c998460f77548f58fc80cc4bbd8c26f842f1f72300 |