A Python package that helps crawl updates from top Vietnamese news providers.
Project description
vnnews crawler
A Python package that helps crawl updates from top Vietnamese news providers.
II. REFERENCES
2.1. How to use this package?
- You can install the latest
vnnews
crawler version from source with the following command:pip install git+https://github.com/thinh-vu/vnnews.git@main
(*) You might need to insert a !
before your command when running terminal commands on Google Colab.
- To start using functions, you need to import them:
from vnnews import *
2.2. List of Popular Online news for investors
- VN Express
- Tuổi trẻ Online
- CafeF
- Cafebiz
- Kinh tế Sài Gòn Online
- VN Economy
- Pháp Luật Tp.HCM
- Đầu tư Online
- Nhịp cầu đầu tư
- Diễn đàn doanh nghiệp
- Diễn đàn kinh tế Việt Nam - Vietnamnet
- Forbes Việt Nam
- Vietstock
- Tin nhanh chứng khoán
- Cafe Land
- Kenh14
- Dân trí
- Thanh niên
- Vietnamnet
- Nhân dân điện tử
- Lao động
- Đời sống & pháp luật
2.3. Function references
-
url_extract (url, key, tag_class='', type='link', bs_on=True, user_agent='Mozilla/5.0 (Windows NT 10.0; WOW64; rv:11.0) Gecko/20100101')
- Purpose: Extract article info from a news source using BeautifulSoup to pull data from HTML/XML web page.
- Arguments:
- url (:obj:
str
, required): url of the target news source. Eg. 'https://cafef.vn/' - key (:obj:
str
, required): HTML tag which contains the information that you want to extract. Eg. 'h3', 'article', 'div' - tag_class (:obj:
str
, required): The HTML class attribute specifies one or more class names for an element. Eg. 'pdate' in the tag 19-11-2022 - 15:32 PM on CafeF. - type (:obj:
str
, optional): 'link' as default to extract only the article link from a news homepage. Use blank value''
when extracting article detail on the article page. - bs_on (:obj:
str
, optional):True
as default. Input blank''
when the issue is raised. - user_agent (:obj:
str
, optional): The default value for Desktop has been provided. You can find more user agent value here: https://developers.whatismybrowser.com/useragents/explore/operating_system_name/
- url (:obj:
-
fix_url(host, url)
- Purpose: Extract article info from a news source using BeautifulSoup to pull data from HTML/XML web page.
- Arguments:
- host (:obj:
str
, required): the host name of the news source. Eg. 'https://vneconomy.vn - url (:obj:
str
, required): the url string of the target news source. This might not contain the host at the beginning. Eg. '/de-viet-nam-thanh-digital-hub-cua-khu-vuc-vao-nam-2030-e290.htm'
- host (:obj:
2.4. Let's get our hands dirty
- VN Express
- Get the list of article urls:
url_extract('https://vnexpress.net/kinh-doanh', key='h3')
- Extract article details:
url_extract('https://vnexpress.net/thuong-mai-va-dau-tu-ben-vung-se-giup-apec-ung-pho-nguy-co-suy-thoai-4538015.html', key='span', tag_class='date', type='')
- Get the list of article urls:
- Tuổi trẻ Online
- Get the list of article urls:
url_extract('https://tuoitre.vn/phap-luat.htm', key='h3')
- Extract article details:
url_extract('https://tuoitre.vn/gap-thu-tuong-xuc-dong-chuyen-co-giao-mam-non-miet-mai-lam-thien-nguyen-cho-vung-xa-20221119175021292.htm', key='div', tag_class='date-time', type='')
- Get the list of article urls:
- CafeF
- Get the list of article urls:
url_extract('https://cafef.vn/bat-dong-san.chn', key='h3', type='link')
- Extract article details:
url_extract('https://cafef.vn/dau-se-la-phan-khuc-bds-giu-duoc-nhiet-trong-thoi-gian-toi-2022111913083069.chn', key='span', tag_class='pdate', type='')
- Get the list of article urls:
- Cafebiz
- Get the list of article urls:
url_extract('https://cafebiz.vn/vi-mo.chn', key='h3', type='link', bs_on='')
- Extract article details:
url_extract('https://cafebiz.vn/tai-sao-nha-o-my-la-tai-san-con-o-nhat-ban-thi-lai-chang-khac-gi-hang-tieu-dung-176221119095831295.chn', key='span', tag_class='time', type='')
- Get the list of article urls:
- Kinh tế Sài Gòn Online
- Get the list of article urls:
url_extract('https://thesaigontimes.vn/', key='h3', type='link', bs_on='')
- Extract article details:
url_extract('https://thesaigontimes.vn/kinh-te-tuan-hoan-mo-ra-nhung-mo-hinh-kinh-doanh-moi/', key='time', tag_class='', type='')
- Get the list of article urls:
- VN Economy
- Get the list of article urls:
url_extract('https://vneconomy.vn/', key='h3', type='link', bs_on=False)
- Extract article details:
url_extract('https://vneconomy.vn/xuat-khau-det-may-van-tu-tin-voi-muc-tieu-42-ty-usd.htm', key='div', tag_class='detail__meta', type='')
- Get the list of article urls:
- Pháp Luật Tp.HCM
- Get the list of article urls:
url_extract('https://m.plo.vn/phap-luat/', key='h3', type='link')[0][1]
- Extract article details:
test = url_extract('https://plo.vn/dieu-tra-trung-tam-dang-kiem-cap-so-song-sinh-cho-xe-tai-post705918.html', key='time', tag_class='', type='')
- Get the list of article urls:
- Đầu tư Online
- Get the list of article urls:
url_extract('https://baodautu.vn/', key='article', type='link', bs_on='')
- Extract article details:
url_extract('https://baodautu.vn/nguoi-dan-rong-ra-cau-cuu-khi-nao-co-so-do-tu-du-an-cua-cong-ty-bach-dat-an-d177946.html', key='span', tag_class='post-time', type='')
- Get the list of article urls:
- Nhịp cầu đầu tư
- Get the list of article urls:
url_extract('https://m.nhipcaudautu.vn/kinh-doanh/', key='article', type='link', bs_on='', user_agent='Mozilla/5.0 (iPhone; CPU iPhone OS 15_5 like Mac OS X)')
- Extract article details:
url_extract('https://m.nhipcaudautu.vn/ti-le-don-bay-tai-chinh-toan-thi-truong-giam-dan-tu-quy-i-3348999/', key='span', tag_class='date-post', type='')
- Diễn đàn doanh nghiệp
- Get the list of article urls:
url_extract('https://diendandoanhnghiep.vn/', key='h3', type='link', bs_on='')
- Extract article details:
url_extract('https://diendandoanhnghiep.vn/https-diendandoanhnghiep-vn-dien-mat-troi-mai-nha-can-hoan-thien-co-che-ho-tro-doanh-nghiep-phat-trien-225626-html-e313.html', key='span', tag_class='created_time', type='')
- Get the list of article urls:
- Diễn đàn kinh tế Việt Nam - Vietnamnet
- Get the list of article urls:
url_extract('https://vef.vn/diem-nong/', key='article', type='link', bs_on='')
- Extract article details: ``
- Get the list of article urls:
- Forbes Việt Nam
- Get the list of article urls:
url_extract('https://forbes.vn', key='h3', type='link', bs_on='')
- Extract article details:
url_extract('https://forbes.vn/m-village-cua-nguyen-hai-ninh-xay-lang-trong-pho/', key='div', tag_class='forbes-single__heading-time', type='')
- Get the list of article urls:
- Vietstock
- Get the list of article urls:
url_extract('https://vietstock.vn/', key='h4', type='link', bs_on='')
- Extract article details:
url_extract('https://vietstock.vn/2022/11/thieu-hut-iphone-14-nguoi-dung-viet-lua-chon-iphone-doi-cu-4264-1017483.htm', key='span', tag_class='date', type='')
- Get the list of article urls:
- Tin nhanh chứng khoán
- Get the list of article urls: Doesn't work
url_extract('https://m.tinnhanhchungkhoan.vn/', key='h2', type='link', bs_on='')
- Extract article details:
url_extract('https://www.tinnhanhchungkhoan.vn/big-trends-sau-con-mua-troi-lai-sang-post310328.html', key='time', tag_class='', type='')
- Get the list of article urls: Doesn't work
- Cafe Land
- Get the list of article urls:
url_extract('https://cafeland.vn/', key='h3', type='link', bs_on='')
- Extract article details:
url_extract('https://cafeland.vn/phan-tich/bien-doi-khi-hau-dang-leo-thang-nhung-doanh-nghiep-chu-yeu-doi-pho-114941.html', key='div', tag_class='info-date right', type='')
- Get the list of article urls:
- Kenh14
- Get the list of article urls:
url_extract('https://m.kenh14.vn/doi-song.chn', key='h3', type='link')
- Extract article details:
url_extract('https://m.kenh14.vn/phia-sau-nhung-gen-z-okela-co-luc-that-bai-co-luc-khong-on-lam-nhung-chua-bao-gio-ngung-no-luc-20221119153833146.chn', key='span', tag_class='kbwcm-time', type='')
- Get the list of article urls:
- Dân trí
- Get the list of article urls:
url_extract('https://dantri.com.vn/', key='h3', type='link', bs_on='')
- Extract article details:
url_extract('https://dantri.com.vn/the-gioi/moscow-cao-buoc-ukraine-kich-dong-xung-dot-quan-su-nga-nato-20221119145209276.htm', key='time', tag_class='author-time', type='')
- Get the list of article urls:
- Thanh niên
- Get the list of article urls: ``
- Extract article details: ``
- Vietnamnet
- Get the list of article urls: ``
- Extract article details: ``
- Nhân dân điện tử
- Get the list of article urls: ``
- Extract article details: ``
- Lao động
- Get the list of article urls: ``
- Extract article details: ``
- Đời sống & pháp luật
- Get the list of article urls: ``
- Extract article details: ``
III. APENDICES
- Demo video: How to select the key
- Explore User Agents by Operating System: here
IV. 🙋♂️ CONTACT INFORMATION
You can contact me at one of my social network profiles:
If you want to support my open-source projects, you can "buy me a coffee" via Patreon or Momo e-wallet (VN). Your support will help to maintain my blog hosting fee & to develop high-quality content.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file vnnews-0.0.1.tar.gz
.
File metadata
- Download URL: vnnews-0.0.1.tar.gz
- Upload date:
- Size: 7.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.8.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a33797bd0fec741aaf27ad81b983857222c1d932df6848179d9fb209ae6e1fef |
|
MD5 | dba473706d373774e8b10541deed650a |
|
BLAKE2b-256 | 1c7573570a353a335485715d4e802a70424f408b6eacdc256c33be973934d076 |
File details
Details for the file vnnews-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: vnnews-0.0.1-py3-none-any.whl
- Upload date:
- Size: 8.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.8.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e275e253704d2092c17250b0a5d534fe6ebfbf6486268106410696837e218efc |
|
MD5 | 2a87eb7307f7a078981f39e517a02be5 |
|
BLAKE2b-256 | faaa6c4e5bac5c6d67ae60555e11d34014c888725b66af88175b7221a17201b4 |