Skip to main content

A political news crawler for 34 Taiwanese stream media.

Project description


title: 'Taiwan_News_Crawler' disqus: hackmd

GitHub https://github.com/milkpool/Taiwan_News_Crawler

info python python

pypl https://pypi.org/project/Taiwan_News_Crawler/

info

Introduction

This open-source library is a political news crawler for 34 Taiwanese mainstream media. The crawlered media is listed below.

Media Type Meida Name (CN) Media Name (EN) ID Abbreviation
Print Media 自由時報 Liberty News 0 ltn
Print Media 蘋果日報 Apple Daily 1 appledaily
Print Media 聯合報 UDN News 2 udn
Print Media 中國時報 China Times 3 chinatimes
Broadcast Media TVBS TVBS 4 tvbs
Broadcast Media ETtoday ETtoday 5 ettoday
Broadcast Media 台視 TTV 6 ttv
Broadcast Media 中視 CTV 7 ctv
Broadcast Media 華視 CTS 8 cts
Broadcast Media 民視 FTV News 9 ftv
Broadcast Media 公視 PTS 10 pts
Broadcast Media 三立新聞 STEN 11 sten
Broadcast Media 中天新聞 CTITV 12 ctitv
Broadcast Media 年代新聞 ERA News 13 era
Broadcast Media 非凡新聞 USTV 14 ustv
Electronic Media 中央通訊社 CNA 15 cna
Electronic Media 關鍵評論網 The News Lens 16 thenewslens
Electronic Media 民報 People News 17 peoplenews
Electronic Media 上報 Up Media 18 upmedia
Electronic Media 大紀元 Epoch Times 19 epochtimes
Electronic Media 信傳媒 CM Media 20 cmmedia
Electronic Media 匯流新聞網 CNEWS 21 cnews
Electronic Media 新頭殼 Newtalk 22 newtalk
Electronic Media 風傳媒 Storm Media 23 storm
Electronic Media 今日新聞 NOW News 24 nownews
Electronic Media 鏡週刊 Mirror Media 25 mirrormedia
Electronic Media 台灣好新聞 Taiwan Hot 26 taiwanhot
Electronic Media 中央廣播電台 RTI News 27 rti
Electronic Media 世界日報 World Journal 28 worldjournal
Electronic Media 風向新聞 Kairos News 29 kairos
Electronic Media 民眾日報 Mypeople News 30 mypeople
Electronic Media 芋傳媒 Taro News 31 taronews
News Website Pchome新聞 Pchome News 32 pchome
News Website Yahoo!奇摩新聞 YAHOO! News 33 yahoo

Installation

1. Install the library package with pip.

pip install Taiwan_News_Crawler

2. Download the webdriver for Chrome on the official website.

Usage

1. Build a crawler with the webdriver path inputted.

import Taiwan_News_Crawler

## Build news crawler
webdriver_path = 'WEBDRIVER_PATH'
mycrawler = NewsCrawler.crawler(webdriver_path)

2. Crawl political news of certain media.

The crawler_news crawls latest news with specified media id or name. There are two parameters to input:

  • media: int/str, the media id or name needed to be crawlered
  • amount: int, the amount of crawlering news. Default number is 20.
## Crawl new with media id
news_id_0 = mycrawler.crawler_news(media=0)

## Crawl new with media name
news_ltn = mycrawler.crawler_news(media='ltn', amount=10)
news_udn = mycrawler.crawler_news(media="聯合報", amount=50)

3. Crawl political news with news url.

The crawler_by_url identifies the news media with url and gets the information. The url parameter is a list of string. Url with different media is acceptable.

news = mycrawler.crawler_by_url(url=['NEWS_URL_1', 'NEWS_URL_2'])

4. Print the crawled news.

The output of our crawler is in json format. There are fields of the output, which are shown below:

  • title: str, the news title
  • url: str, the news full url
  • author: list, the news author. More than one author is possible. Shown as an empty list if it's not avaiable.
  • time: list, the published time.
    • time (str): the complete published time. ex. "2020/01/10 13:17"
    • time_year (str): the published year. ex. "2020"
    • time_month (str): the published month. ex. "01"
    • time_day (str): the published day. ex. "10"
    • time_hour_min (str): the published timing. ex. "13:17"
  • context: str, the news article.
  • tag: list, the tags of the news. Empty list for not avaiable. More than one tag is possible.
  • related_news: list, the related or recommended news the media provides.
    • related_title: str, the related news' title.
    • related_url: str, the related news' url link.
  • source_img: list, the pictures in the report.
    • img_title: str, the related news' title. None for not avaliable.
    • img_url: str, the related news' url link.
  • sourcce_video: list, the video in the report.
    • video_title: str, thr video title. None for not avaliable.
    • video_url: str, the video url link.
news_ltn = mycrawler.crawler_news(media='CTS', amount=10)
# print the first news information
news_no_1 = news_ltn[0]
for key, value in new_no_1.items():
    print(key)
    print(value)
    print()
title
'菲律賓禁台令 總統:若政治考量請三思'

url
'https://news.cts.com.tw/cts/politics/202002/202002141990582.html'


author
[u'陳詩雅', u'李鴻杰']

time
[{'time_day': '14', 'time_hour_min': '19:39', 'time_year': '2020', 'time_month': '02', 'time': '2020/02/14 19:39'}]

context
'華視新聞 陳詩雅 李鴻杰 台南報導  / 台南市面對武漢肺炎疫情,總統蔡英文、行政院長蘇貞昌和副院長陳其邁,今(14)日分頭前進工廠視察。總統下午就南下視察防疫用酒精生產。面對菲律賓發出禁台令,總統表示,若是因為政治考量,要求菲律賓三思,台灣不能容忍這樣的事情,也必然會有相應的處理,最新消息,菲律賓已經撤回對台灣的禁令。75度防疫酒精台酒一天就可以產20萬瓶,面對武漢肺癌疫情,總統蔡英文再度前進工廠生產線,這次看的是酒精,成為70年來,第一位造訪台南隆田酒廠的現任總統,目前酒精產量穩定,可以支撐現階段需求,不過防疫戰火延燒到菲律賓,對於菲律賓在10日無預警禁止台灣人入境,總統說如果是政治考量菲律賓要三思。總統 蔡英文:「如果是基於政治考量的話,我們就要請他們三思,因為我們不能夠容忍這樣的事情,也必然會有一些相應的處理,」在總統受訪之後,菲律賓在晚間取消對台禁令,記者 vs. 總統 蔡英文:「台灣最紅的小孩子就是小明。」總統一聽到小明忍不住笑了,但是又立刻收起微笑,因為被問到對於無我國國籍中配子女禁止入境,馬前總統PO文說不要讓歧視凌駕人道,總統 蔡英文:「這沒有歧視的問題,只有疫情處理跟疫情掌控,跟保護我們國人的健康,是最重要的原則,我是覺得既然已經做過總統,應該知道說在做一個相關的決策,現在所最重要的還是以疫情的掌控為最優先」,另外還有國人滯留湖北,總統則再次強調,弱勢優先檢疫優先兩大原則,會持續溝通。'

tag
[u'撤回禁令', u'菲律賓', u'蔡英文', u'酒精', u'武漢肺炎']

related_news
[{u'好消息! 傳菲將解除對台灣旅行禁令': 'https://news.cts.com.tw/cts/politics/202002/202002141990583.html'}, {u'菲律賓發布禁台令 對移工衝擊大': 'https://news.cts.com.tw/cts/international/202002/202002131990465.html'}, {u'菲律賓禁台入境 我方擬祭出反制': 'https://news.cts.com.tw/cts/international/202002/202002131990381.html'}, {u'菲內閣重議禁台措施 我研擬反擊': 'https://news.cts.com.tw/cts/international/202002/202002131990380.html'}, {u'出國怕被誤認 「來自台灣」小物熱賣': 'https://news.cts.com.tw/cts/general/202002/202002121990318.html'}, {u'菲律賓突發禁台令 台灣恐爆缺工潮': 'https://news.cts.com.tw/cts/society/202002/202002121990317.html'}, {u'菲禁令滯留長灘島 部落客:如歷險記': 'https://news.cts.com.tw/cts/society/202002/202002121990316.html'}, {u'菲禁台團入境 當地旅遊業損失逾50萬': 'https://news.cts.com.tw/cts/society/202002/202002121990257.html'}, {u'禁台入境轉折 外交部:菲國內部不同調': 'https://news.cts.com.tw/cts/general/202002/202002111990194.html'}, {u'菲禁台客入境 旅客到機場無法登機': 'https://news.cts.com.tw/cts/international/202002/202002111990193.html'}, {u'菲律賓移工無法入台 勞動部祭出因應措施': 'https://news.cts.com.tw/cts/life/202002/202002111990179.html'}, {u'菲遵循「一個中國」 國民黨:政府應強硬展立場並反制': 'https://news.cts.com.tw/cts/politics/202002/202002111990154.html'}]

source_img
[{None: 'https://news.cts.com.tw/photo/cts/202002/202002141990582_l.jpg'}]

source_video
[{None: 'https://www.youtube.com/embed/6cV1YNTOjyI?rel=0&playsinline=1'}]

media
'華視'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Taiwan_News_Crawler-2020.2.15.tar.gz (22.6 kB view details)

Uploaded Source

Built Distribution

Taiwan_News_Crawler-2020.2.15-py2.py3-none-any.whl (88.2 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file Taiwan_News_Crawler-2020.2.15.tar.gz.

File metadata

  • Download URL: Taiwan_News_Crawler-2020.2.15.tar.gz
  • Upload date:
  • Size: 22.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.7.4

File hashes

Hashes for Taiwan_News_Crawler-2020.2.15.tar.gz
Algorithm Hash digest
SHA256 f44795575518bc9e8dbadea2103546311f155ee216d749f83b38a1beda47dc50
MD5 a9e6ea0af438fc92ba738a6f572bb043
BLAKE2b-256 3815ef975d7df2ad8c8752dfcaa173efab971d9ca157c45ec4c73c14c4c099f8

See more details on using hashes here.

File details

Details for the file Taiwan_News_Crawler-2020.2.15-py2.py3-none-any.whl.

File metadata

  • Download URL: Taiwan_News_Crawler-2020.2.15-py2.py3-none-any.whl
  • Upload date:
  • Size: 88.2 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.7.4

File hashes

Hashes for Taiwan_News_Crawler-2020.2.15-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 8f6221089ac8c72dda251cc246820f463a2a0fd0b7867df23b6647282176bbc7
MD5 09c7ff99ec112479f99c064d138f65c9
BLAKE2b-256 bdb0bfea18e4a57abe3cd2c998460f77548f58fc80cc4bbd8c26f842f1f72300

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page