Skip to main content

A Python package that helps crawl updates from top Vietnamese news providers.

Project description

vnnews crawler

A Python package that helps crawl updates from top Vietnamese news providers.


Version Download Badge Commit Badge License Badge

II. REFERENCES

2.1. How to use this package?

  • You can install the latest vnnews crawler version from source with the following command: pip install git+https://github.com/thinh-vu/vnnews.git@main

(*) You might need to insert a ! before your command when running terminal commands on Google Colab.

  • To start using functions, you need to import them: from vnnews import *

2.2. List of Popular Online news for investors

  1. VN Express
  2. Tuổi trẻ Online
  3. CafeF
  4. Cafebiz
  5. Kinh tế Sài Gòn Online
  6. VN Economy
  7. Pháp Luật Tp.HCM
  8. Đầu tư Online
  9. Nhịp cầu đầu tư
  10. Diễn đàn doanh nghiệp
  11. Diễn đàn kinh tế Việt Nam - Vietnamnet
  12. Forbes Việt Nam
  13. Vietstock
  14. Tin nhanh chứng khoán
  15. Cafe Land
  16. Kenh14
  17. Dân trí
  18. Thanh niên
  19. Vietnamnet
  20. Nhân dân điện tử
  21. Lao động
  22. Đời sống & pháp luật

2.3. Function references

  • url_extract (url, key, tag_class='', type='link', bs_on=True, user_agent='Mozilla/5.0 (Windows NT 10.0; WOW64; rv:11.0) Gecko/20100101')

    • Purpose: Extract article info from a news source using BeautifulSoup to pull data from HTML/XML web page.
    • Arguments:
      • url (:obj:str, required): url of the target news source. Eg. 'https://cafef.vn/'
      • key (:obj:str, required): HTML tag which contains the information that you want to extract. Eg. 'h3', 'article', 'div'
      • tag_class (:obj:str, required): The HTML class attribute specifies one or more class names for an element. Eg. 'pdate' in the tag 19-11-2022 - 15:32 PM on CafeF.
      • type (:obj:str, optional): 'link' as default to extract only the article link from a news homepage. Use blank value '' when extracting article detail on the article page.
      • bs_on (:obj:str, optional): True as default. Input blank '' when the issue is raised.
      • user_agent (:obj:str, optional): The default value for Desktop has been provided. You can find more user agent value here: https://developers.whatismybrowser.com/useragents/explore/operating_system_name/
  • fix_url(host, url)

    • Purpose: Extract article info from a news source using BeautifulSoup to pull data from HTML/XML web page.
    • Arguments:
      • host (:obj:str, required): the host name of the news source. Eg. 'https://vneconomy.vn
      • url (:obj:str, required): the url string of the target news source. This might not contain the host at the beginning. Eg. '/de-viet-nam-thanh-digital-hub-cua-khu-vuc-vao-nam-2030-e290.htm'

2.4. Let's get our hands dirty

  1. VN Express
    • Get the list of article urls: url_extract('https://vnexpress.net/kinh-doanh', key='h3')
    • Extract article details: url_extract('https://vnexpress.net/thuong-mai-va-dau-tu-ben-vung-se-giup-apec-ung-pho-nguy-co-suy-thoai-4538015.html', key='span', tag_class='date', type='')
  2. Tuổi trẻ Online
    • Get the list of article urls: url_extract('https://tuoitre.vn/phap-luat.htm', key='h3')
    • Extract article details: url_extract('https://tuoitre.vn/gap-thu-tuong-xuc-dong-chuyen-co-giao-mam-non-miet-mai-lam-thien-nguyen-cho-vung-xa-20221119175021292.htm', key='div', tag_class='date-time', type='')
  3. CafeF
    • Get the list of article urls: url_extract('https://cafef.vn/bat-dong-san.chn', key='h3', type='link')
    • Extract article details: url_extract('https://cafef.vn/dau-se-la-phan-khuc-bds-giu-duoc-nhiet-trong-thoi-gian-toi-2022111913083069.chn', key='span', tag_class='pdate', type='')
  4. Cafebiz
    • Get the list of article urls: url_extract('https://cafebiz.vn/vi-mo.chn', key='h3', type='link', bs_on='')
    • Extract article details: url_extract('https://cafebiz.vn/tai-sao-nha-o-my-la-tai-san-con-o-nhat-ban-thi-lai-chang-khac-gi-hang-tieu-dung-176221119095831295.chn', key='span', tag_class='time', type='')
  5. Kinh tế Sài Gòn Online
    • Get the list of article urls: url_extract('https://thesaigontimes.vn/', key='h3', type='link', bs_on='')
    • Extract article details: url_extract('https://thesaigontimes.vn/kinh-te-tuan-hoan-mo-ra-nhung-mo-hinh-kinh-doanh-moi/', key='time', tag_class='', type='')
  6. VN Economy
    • Get the list of article urls: url_extract('https://vneconomy.vn/', key='h3', type='link', bs_on=False)
    • Extract article details: url_extract('https://vneconomy.vn/xuat-khau-det-may-van-tu-tin-voi-muc-tieu-42-ty-usd.htm', key='div', tag_class='detail__meta', type='')
  7. Pháp Luật Tp.HCM
    • Get the list of article urls: url_extract('https://m.plo.vn/phap-luat/', key='h3', type='link')[0][1]
    • Extract article details: test = url_extract('https://plo.vn/dieu-tra-trung-tam-dang-kiem-cap-so-song-sinh-cho-xe-tai-post705918.html', key='time', tag_class='', type='')
  8. Đầu tư Online
    • Get the list of article urls: url_extract('https://baodautu.vn/', key='article', type='link', bs_on='')
    • Extract article details: url_extract('https://baodautu.vn/nguoi-dan-rong-ra-cau-cuu-khi-nao-co-so-do-tu-du-an-cua-cong-ty-bach-dat-an-d177946.html', key='span', tag_class='post-time', type='')
  9. Nhịp cầu đầu tư
  • Get the list of article urls: url_extract('https://m.nhipcaudautu.vn/kinh-doanh/', key='article', type='link', bs_on='', user_agent='Mozilla/5.0 (iPhone; CPU iPhone OS 15_5 like Mac OS X)')
  • Extract article details: url_extract('https://m.nhipcaudautu.vn/ti-le-don-bay-tai-chinh-toan-thi-truong-giam-dan-tu-quy-i-3348999/', key='span', tag_class='date-post', type='')
  1. Diễn đàn doanh nghiệp
    • Get the list of article urls: url_extract('https://diendandoanhnghiep.vn/', key='h3', type='link', bs_on='')
    • Extract article details: url_extract('https://diendandoanhnghiep.vn/https-diendandoanhnghiep-vn-dien-mat-troi-mai-nha-can-hoan-thien-co-che-ho-tro-doanh-nghiep-phat-trien-225626-html-e313.html', key='span', tag_class='created_time', type='')
  2. Diễn đàn kinh tế Việt Nam - Vietnamnet
    • Get the list of article urls: url_extract('https://vef.vn/diem-nong/', key='article', type='link', bs_on='')
    • Extract article details: ``
  3. Forbes Việt Nam
    • Get the list of article urls: url_extract('https://forbes.vn', key='h3', type='link', bs_on='')
    • Extract article details: url_extract('https://forbes.vn/m-village-cua-nguyen-hai-ninh-xay-lang-trong-pho/', key='div', tag_class='forbes-single__heading-time', type='')
  4. Vietstock
    • Get the list of article urls: url_extract('https://vietstock.vn/', key='h4', type='link', bs_on='')
    • Extract article details: url_extract('https://vietstock.vn/2022/11/thieu-hut-iphone-14-nguoi-dung-viet-lua-chon-iphone-doi-cu-4264-1017483.htm', key='span', tag_class='date', type='')
  5. Tin nhanh chứng khoán
    • Get the list of article urls: Doesn't work url_extract('https://m.tinnhanhchungkhoan.vn/', key='h2', type='link', bs_on='')
    • Extract article details: url_extract('https://www.tinnhanhchungkhoan.vn/big-trends-sau-con-mua-troi-lai-sang-post310328.html', key='time', tag_class='', type='')
  6. Cafe Land
    • Get the list of article urls: url_extract('https://cafeland.vn/', key='h3', type='link', bs_on='')
    • Extract article details: url_extract('https://cafeland.vn/phan-tich/bien-doi-khi-hau-dang-leo-thang-nhung-doanh-nghiep-chu-yeu-doi-pho-114941.html', key='div', tag_class='info-date right', type='')
  7. Kenh14
    • Get the list of article urls: url_extract('https://m.kenh14.vn/doi-song.chn', key='h3', type='link')
    • Extract article details: url_extract('https://m.kenh14.vn/phia-sau-nhung-gen-z-okela-co-luc-that-bai-co-luc-khong-on-lam-nhung-chua-bao-gio-ngung-no-luc-20221119153833146.chn', key='span', tag_class='kbwcm-time', type='')
  8. Dân trí
    • Get the list of article urls: url_extract('https://dantri.com.vn/', key='h3', type='link', bs_on='')
    • Extract article details: url_extract('https://dantri.com.vn/the-gioi/moscow-cao-buoc-ukraine-kich-dong-xung-dot-quan-su-nga-nato-20221119145209276.htm', key='time', tag_class='author-time', type='')
  9. Thanh niên
    • Get the list of article urls: ``
    • Extract article details: ``
  10. Vietnamnet
    • Get the list of article urls: ``
    • Extract article details: ``
  11. Nhân dân điện tử
    • Get the list of article urls: ``
    • Extract article details: ``
  12. Lao động
    • Get the list of article urls: ``
    • Extract article details: ``
  13. Đời sống & pháp luật
    • Get the list of article urls: ``
    • Extract article details: ``

III. APENDICES

  • Demo video: How to select the key
  • Explore User Agents by Operating System: here

IV. 🙋‍♂️ CONTACT INFORMATION

You can contact me at one of my social network profiles:


If you want to support my open-source projects, you can "buy me a coffee" via Patreon or Momo e-wallet (VN). Your support will help to maintain my blog hosting fee & to develop high-quality content.

momo-qr

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vnnews-0.0.1.tar.gz (7.6 kB view hashes)

Uploaded Source

Built Distribution

vnnews-0.0.1-py3-none-any.whl (8.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page