Python Library for Crawling Top 10 Korean News and Providing Synonym Dictionary
Project description
Korean_News_Crawler
한국 10대 일간지 크롤링 및 유사어 사전 제공 Python 라이브러리입니다. 아직 PyPI에 정식등록되진 않은 beta 버전입니다.
Open Source Project로 기여자, 참여자 상시 모집하고 있습니다. 연락주시면 감사하겠습니다.
This is Python library for crawling articles from Korean Top 10 Newspaper sites and providing synonym dictionary.
The copyright of articles are belong to original media company. We don't take any legal responsibility using of them. We assume that you have agreed to this.
We're greeting to join you as contibutors, collaborator. Thanks to give me contact.
Supported News Sites
- 조선일보(Chosun Ilbo)
- 동아일보(Dong-a Ilbo)
- 한국일보(Hankook Ilbo)
- 한겨레(Hankyeoreh)
- 중앙일보(JoongAng Ilbo)
- 국민일보(Kukmin Ilbo)
- 경향신문(Kyunghyang Shinmun)
- 문화일보(Munhwa Ilbo)
- 내일신문(Naeil News)
- 세계일보(Segye Ilbo)
- 서울신문(Seoul Shinmun)
Contibutors
Indigo_Coder |
Installation
pip install korean_news_crawler
BeautifulSoup, Selenium, Requests are required.
Quick Usage
from korean_news_crawler import chosun
chosun = Chosun()
print(chosun.dynamic_crawl("https://www.chosun.com/..."))
chosun_url_list = list() #Chosun Ilbo url list
print(chosun.dynamic_crawl(chosun_url_list))
API
Chosun()
Donga()
Hankook()
Hankyoreh()
Joongang()
Kukmin()
Kyunghyang()
Munhwa()
Naeil()
Segye()
Seoul()
korean_news_crawler.Chosun(delay_time=None, saving_html=False)
It provides crawling Chosun Ilbo.
Parameters
Parameters | Type | Description |
---|---|---|
delay_time | float or tuple | - Optional, Defaults to None. - When 'delay_time=float', it will crawl sites with delay. - When 'delay_time=tuple', it will crawl sites with random delay. |
saving_html | bool | - Optional, Defaults to False. - When 'saving_html=False', it always requests url every function calling. - When 'saving_html=True', It will save requested html only first time. After that, it calls saved html. This will help to alleviate server load. |
Attributes
Attributes | Type | Description |
---|---|---|
delay_time | float or tuple | |
saving_html | bool |
Methods
Methods | Description |
---|---|
dynamic_crawl(url) | Return article text using Selenium. |
static_crawl(url) | Return article text using BeautifulSoup. |
dynamic_crawl(url)
- Return article text using Selenium.
Parameters | Type | Description |
---|---|---|
url | str or list | - When 'url=str', it will only crawl given url. - When 'url=list', it will crawl with iterating url list. |
Returns Type | Description |
---|---|
list | Return article list. |
static_crawl(url)
- Return article text using BeautifulSoup.
Parameters | Type | Description |
---|---|---|
url | str or list | - When 'url=str', it will only crawl given url. - When 'url=list', it will crawl with iterating url list. |
Returns Type | Description |
---|---|
list | Return article list. |
korean_news_crawler.Donga(delay_time=None, saving_html=False)
It provides crawling Dong-a Ilbo.
Parameters
Parameters | Type | Description |
---|---|---|
delay_time | float or tuple | - Optional, Defaults to None. - When 'delay_time=float', it will crawl sites with delay. - When 'delay_time=tuple', it will crawl sites with random delay. |
saving_html | bool | - Optional, Defaults to False. - When 'saving_html=False', it always requests url every function calling. - When 'saving_html=True', It will save requested html only first time. After that, it calls saved html. This will help to alleviate server load. |
Attributes
Attributes | Type | Description |
---|---|---|
delay_time | float or tuple | |
saving_html | bool |
Methods
Methods | Description |
---|---|
dynamic_crawl(url) | Return article text using Selenium. |
static_crawl(url) | Return article text using BeautifulSoup. |
dynamic_crawl(url)
- Return article text using Selenium.
Parameters | Type | Description |
---|---|---|
url | str or list | - When 'url=str', it will only crawl given url. - When 'url=list', it will crawl with iterating url list. |
Returns Type | Description |
---|---|
list | Return article list. |
static_crawl(url)
- Return article text using BeautifulSoup.
Parameters | Type | Description |
---|---|---|
url | str or list | - When 'url=str', it will only crawl given url. - When 'url=list', it will crawl with iterating url list. |
Returns Type | Description |
---|---|
list | Return article list. |
korean_news_crawler.Hankook(delay_time=None, saving_html=False)
It provides crawling Hankook Ilbo.
Parameters
Parameters | Type | Description |
---|---|---|
delay_time | float or tuple | - Optional, Defaults to None. - When 'delay_time=float', it will crawl sites with delay. - When 'delay_time=tuple', it will crawl sites with random delay. |
saving_html | bool | - Optional, Defaults to False. - When 'saving_html=False', it always requests url every function calling. - When 'saving_html=True', It will save requested html only first time. After that, it calls saved html. This will help to alleviate server load. |
Attributes
Attributes | Type | Description |
---|---|---|
delay_time | float or tuple | |
saving_html | bool |
Methods
Methods | Description |
---|---|
dynamic_crawl(url) | Return article text using Selenium. |
static_crawl(url) | Return article text using BeautifulSoup. |
dynamic_crawl(url)
- Return article text using Selenium.
Parameters | Type | Description |
---|---|---|
url | str or list | - When 'url=str', it will only crawl given url. - When 'url=list', it will crawl with iterating url list. |
Returns Type | Description |
---|---|
list | Return article list. |
static_crawl(url)
- Return article text using BeautifulSoup.
Parameters | Type | Description |
---|---|---|
url | str or list | - When 'url=str', it will only crawl given url. - When 'url=list', it will crawl with iterating url list. |
Returns Type | Description |
---|---|
list | Return article list. |
korean_news_crawler.Hankyoreh(delay_time=None, saving_html=False)
It provides crawling Hankyoreh.
Parameters
Parameters | Type | Description |
---|---|---|
delay_time | float or tuple | - Optional, Defaults to None. - When 'delay_time=float', it will crawl sites with delay. - When 'delay_time=tuple', it will crawl sites with random delay. |
saving_html | bool | - Optional, Defaults to False. - When 'saving_html=False', it always requests url every function calling. - When 'saving_html=True', It will save requested html only first time. After that, it calls saved html. This will help to alleviate server load. |
Attributes
Attributes | Type | Description |
---|---|---|
delay_time | float or tuple | |
saving_html | bool |
Methods
Methods | Description |
---|---|
dynamic_crawl(url) | Return article text using Selenium. |
static_crawl(url) | Return article text using BeautifulSoup. |
dynamic_crawl(url)
- Return article text using Selenium.
Parameters | Type | Description |
---|---|---|
url | str or list | - When 'url=str', it will only crawl given url. - When 'url=list', it will crawl with iterating url list. |
Returns Type | Description |
---|---|
list | Return article list. |
static_crawl(url)
- Return article text using BeautifulSoup.
Parameters | Type | Description |
---|---|---|
url | str or list | - When 'url=str', it will only crawl given url. - When 'url=list', it will crawl with iterating url list. |
Returns Type | Description |
---|---|
list | Return article list. |
korean_news_crawler.Joongang(delay_time=None, saving_html=False)
It provides crawling Joongang Ilbo.
Parameters
Parameters | Type | Description |
---|---|---|
delay_time | float or tuple | - Optional, Defaults to None. - When 'delay_time=float', it will crawl sites with delay. - When 'delay_time=tuple', it will crawl sites with random delay. |
saving_html | bool | - Optional, Defaults to False. - When 'saving_html=False', it always requests url every function calling. - When 'saving_html=True', It will save requested html only first time. After that, it calls saved html. This will help to alleviate server load. |
Attributes
Attributes | Type | Description |
---|---|---|
delay_time | float or tuple | |
saving_html | bool |
Methods
Methods | Description |
---|---|
dynamic_crawl(url) | Return article text using Selenium. |
static_crawl(url) | Return article text using BeautifulSoup. |
dynamic_crawl(url)
- Return article text using Selenium.
Parameters | Type | Description |
---|---|---|
url | str or list | - When 'url=str', it will only crawl given url. - When 'url=list', it will crawl with iterating url list. |
Returns Type | Description |
---|---|
list | Return article list. |
static_crawl(url)
- Return article text using BeautifulSoup.
Parameters | Type | Description |
---|---|---|
url | str or list | - When 'url=str', it will only crawl given url. - When 'url=list', it will crawl with iterating url list. |
Returns Type | Description |
---|---|
list | Return article list. |
korean_news_crawler.Kukmin(delay_time=None, saving_html=False)
It provides crawling Kukmin Ilbo.
Parameters
Parameters | Type | Description |
---|---|---|
delay_time | float or tuple | - Optional, Defaults to None. - When 'delay_time=float', it will crawl sites with delay. - When 'delay_time=tuple', it will crawl sites with random delay. |
saving_html | bool | - Optional, Defaults to False. - When 'saving_html=False', it always requests url every function calling. - When 'saving_html=True', It will save requested html only first time. After that, it calls saved html. This will help to alleviate server load. |
Attributes
Attributes | Type | Description |
---|---|---|
delay_time | float or tuple | |
saving_html | bool |
Methods
Methods | Description |
---|---|
dynamic_crawl(url) | Return article text using Selenium. |
static_crawl(url) | Return article text using BeautifulSoup. |
dynamic_crawl(url)
- Return article text using Selenium.
Parameters | Type | Description |
---|---|---|
url | str or list | - When 'url=str', it will only crawl given url. - When 'url=list', it will crawl with iterating url list. |
Returns Type | Description |
---|---|
list | Return article list. |
static_crawl(url)
- Return article text using BeautifulSoup.
Parameters | Type | Description |
---|---|---|
url | str or list | - When 'url=str', it will only crawl given url. - When 'url=list', it will crawl with iterating url list. |
Returns Type | Description |
---|---|
list | Return article list. |
korean_news_crawler.Kyunghyang(delay_time=None, saving_html=False)
It provides crawling Kyunghyang Shinmun.
Parameters
Parameters | Type | Description |
---|---|---|
delay_time | float or tuple | - Optional, Defaults to None. - When 'delay_time=float', it will crawl sites with delay. - When 'delay_time=tuple', it will crawl sites with random delay. |
saving_html | bool | - Optional, Defaults to False. - When 'saving_html=False', it always requests url every function calling. - When 'saving_html=True', It will save requested html only first time. After that, it calls saved html. This will help to alleviate server load. |
Attributes
Attributes | Type | Description |
---|---|---|
delay_time | float or tuple | |
saving_html | bool |
Methods
Methods | Description |
---|---|
dynamic_crawl(url) | Return article text using Selenium. |
static_crawl(url) | Return article text using BeautifulSoup. |
dynamic_crawl(url)
- Return article text using Selenium.
Parameters | Type | Description |
---|---|---|
url | str or list | - When 'url=str', it will only crawl given url. - When 'url=list', it will crawl with iterating url list. |
Returns Type | Description |
---|---|
list | Return article list. |
static_crawl(url)
- Return article text using BeautifulSoup.
Parameters | Type | Description |
---|---|---|
url | str or list | - When 'url=str', it will only crawl given url. - When 'url=list', it will crawl with iterating url list. |
Returns Type | Description |
---|---|
list | Return article list. |
korean_news_crawler.Munhwa(delay_time=None, saving_html=False)
It provides crawling Munhwa Ilbo.
Parameters
Parameters | Type | Description |
---|---|---|
delay_time | float or tuple | - Optional, Defaults to None. - When 'delay_time=float', it will crawl sites with delay. - When 'delay_time=tuple', it will crawl sites with random delay. |
saving_html | bool | - Optional, Defaults to False. - When 'saving_html=False', it always requests url every function calling. - When 'saving_html=True', It will save requested html only first time. After that, it calls saved html. This will help to alleviate server load. |
Attributes
Attributes | Type | Description |
---|---|---|
delay_time | float or tuple | |
saving_html | bool |
Methods
Methods | Description |
---|---|
dynamic_crawl(url) | Return article text using Selenium. |
static_crawl(url) | Return article text using BeautifulSoup. |
dynamic_crawl(url)
- Return article text using Selenium.
Parameters | Type | Description |
---|---|---|
url | str or list | - When 'url=str', it will only crawl given url. - When 'url=list', it will crawl with iterating url list. |
Returns Type | Description |
---|---|
list | Return article list. |
static_crawl(url)
- Return article text using BeautifulSoup.
Parameters | Type | Description |
---|---|---|
url | str or list | - When 'url=str', it will only crawl given url. - When 'url=list', it will crawl with iterating url list. |
Returns Type | Description |
---|---|
list | Return article list. |
korean_news_crawler.Naeil(delay_time=None, saving_html=False)
It provides crawling Naeil News.
Parameters
Parameters | Type | Description |
---|---|---|
delay_time | float or tuple | - Optional, Defaults to None. - When 'delay_time=float', it will crawl sites with delay. - When 'delay_time=tuple', it will crawl sites with random delay. |
saving_html | bool | - Optional, Defaults to False. - When 'saving_html=False', it always requests url every function calling. - When 'saving_html=True', It will save requested html only first time. After that, it calls saved html. This will help to alleviate server load. |
Attributes
Attributes | Type | Description |
---|---|---|
delay_time | float or tuple | |
saving_html | bool |
Methods
Methods | Description |
---|---|
dynamic_crawl(url) | Return article text using Selenium. |
static_crawl(url) | Return article text using BeautifulSoup. |
dynamic_crawl(url)
- Return article text using Selenium.
Parameters | Type | Description |
---|---|---|
url | str or list | - When 'url=str', it will only crawl given url. - When 'url=list', it will crawl with iterating url list. |
Returns Type | Description |
---|---|
list | Return article list. |
static_crawl(url)
- Return article text using BeautifulSoup.
Parameters | Type | Description |
---|---|---|
url | str or list | - When 'url=str', it will only crawl given url. - When 'url=list', it will crawl with iterating url list. |
Returns Type | Description |
---|---|
list | Return article list. |
korean_news_crawler.Segye(delay_time=None, saving_html=False)
It provides crawling Segye Ilbo.
Parameters
Parameters | Type | Description |
---|---|---|
delay_time | float or tuple | - Optional, Defaults to None. - When 'delay_time=float', it will crawl sites with delay. - When 'delay_time=tuple', it will crawl sites with random delay. |
saving_html | bool | - Optional, Defaults to False. - When 'saving_html=False', it always requests url every function calling. - When 'saving_html=True', It will save requested html only first time. After that, it calls saved html. This will help to alleviate server load. |
Attributes
Attributes | Type | Description |
---|---|---|
delay_time | float or tuple | |
saving_html | bool |
Methods
Methods | Description |
---|---|
dynamic_crawl(url) | Return article text using Selenium. |
static_crawl(url) | Return article text using BeautifulSoup. |
dynamic_crawl(url)
- Return article text using Selenium.
Parameters | Type | Description |
---|---|---|
url | str or list | - When 'url=str', it will only crawl given url. - When 'url=list', it will crawl with iterating url list. |
Returns Type | Description |
---|---|
list | Return article list. |
static_crawl(url)
- Return article text using BeautifulSoup.
Parameters | Type | Description |
---|---|---|
url | str or list | - When 'url=str', it will only crawl given url. - When 'url=list', it will crawl with iterating url list. |
Returns Type | Description |
---|---|
list | Return article list. |
korean_news_crawler.Seoul(delay_time=None, saving_html=False)
It provides crawling Seoul Shinmun.
Parameters
Parameters | Type | Description |
---|---|---|
delay_time | float or tuple | - Optional, Defaults to None. - When 'delay_time=float', it will crawl sites with delay. - When 'delay_time=tuple', it will crawl sites with random delay. |
saving_html | bool | - Optional, Defaults to False. - When 'saving_html=False', it always requests url every function calling. - When 'saving_html=True', It will save requested html only first time. After that, it calls saved html. This will help to alleviate server load. |
Attributes
Attributes | Type | Description |
---|---|---|
delay_time | float or tuple | |
saving_html | bool |
Methods
Methods | Description |
---|---|
dynamic_crawl(url) | Return article text using Selenium. |
static_crawl(url) | Return article text using BeautifulSoup. |
dynamic_crawl(url)
- Return article text using Selenium.
Parameters | Type | Description |
---|---|---|
url | str or list | - When 'url=str', it will only crawl given url. - When 'url=list', it will crawl with iterating url list. |
Returns Type | Description |
---|---|
list | Return article list. |
static_crawl(url)
- Return article text using BeautifulSoup.
Parameters | Type | Description |
---|---|---|
url | str or list | - When 'url=str', it will only crawl given url. - When 'url=list', it will crawl with iterating url list. |
Returns Type | Description |
---|---|
list | Return article list. |
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for korean_news_crawler-1.0.5.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8753bff944d7ffc81144e635bc2c0295f63eb653fae5c75af9fc252f7d69225b |
|
MD5 | 00ed358e88557161a2bf2df0e086f753 |
|
BLAKE2b-256 | 3941d7f4fbb646d30684aca1f0cf9556bf5c1ac58d9f246fc9c69f62aa5e9eb6 |
Hashes for korean_news_crawler-1.0.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 19eb687cb9d6303a4c29cbf5c0530c0aba1f57e407799679de019c12daab673b |
|
MD5 | 829965a5ae591c5c7d99be82e9f54ae8 |
|
BLAKE2b-256 | b44d8f9d3f6a6f44b4f52c548c542ccca44dcc399f8b86f885ad0d7b19de3847 |