Skip to main content

A Python package that helps crawl updates from top Vietnamese news providers.

Project description

vnnews crawler

A Python package that helps crawl updates from top Vietnamese news providers.


Version Download Badge Commit Badge License Badge

II. REFERENCES

2.1. How to use this package?

  • You can install the latest vnnews crawler version from source with the following command: pip install git+https://github.com/thinh-vu/vnnews.git@main

(*) You might need to insert a ! before your command when running terminal commands on Google Colab.

  • To start using functions, you need to import them: from vnnews import *

2.2. List of Popular Online news for investors

  1. VN Express
  2. Tuổi trẻ Online
  3. CafeF
  4. Cafebiz
  5. Kinh tế Sài Gòn Online
  6. VN Economy
  7. Pháp Luật Tp.HCM
  8. Đầu tư Online
  9. Nhịp cầu đầu tư
  10. Diễn đàn doanh nghiệp
  11. Diễn đàn kinh tế Việt Nam - Vietnamnet
  12. Forbes Việt Nam
  13. Vietstock
  14. Tin nhanh chứng khoán
  15. Cafe Land
  16. Kenh14
  17. Dân trí
  18. Thanh niên
  19. Vietnamnet
  20. Nhân dân điện tử
  21. Lao động
  22. Đời sống & pháp luật

2.3. Function references

  • url_extract (url, key, tag_class='', type='link', bs_on=True, user_agent='Mozilla/5.0 (Windows NT 10.0; WOW64; rv:11.0) Gecko/20100101')

    • Purpose: Extract article info from a news source using BeautifulSoup to pull data from HTML/XML web page.
    • Arguments:
      • url (:obj:str, required): url of the target news source. Eg. 'https://cafef.vn/'
      • key (:obj:str, required): HTML tag which contains the information that you want to extract. Eg. 'h3', 'article', 'div'
      • tag_class (:obj:str, required): The HTML class attribute specifies one or more class names for an element. Eg. 'pdate' in the tag 19-11-2022 - 15:32 PM on CafeF.
      • type (:obj:str, optional): 'link' as default to extract only the article link from a news homepage. Use blank value '' when extracting article detail on the article page.
      • bs_on (:obj:str, optional): True as default. Input blank '' when the issue is raised.
      • user_agent (:obj:str, optional): The default value for Desktop has been provided. You can find more user agent value here: https://developers.whatismybrowser.com/useragents/explore/operating_system_name/
  • fix_url(host, url)

    • Purpose: Extract article info from a news source using BeautifulSoup to pull data from HTML/XML web page.
    • Arguments:
      • host (:obj:str, required): the host name of the news source. Eg. 'https://vneconomy.vn
      • url (:obj:str, required): the url string of the target news source. This might not contain the host at the beginning. Eg. '/de-viet-nam-thanh-digital-hub-cua-khu-vuc-vao-nam-2030-e290.htm'

2.4. Let's get our hands dirty

  1. VN Express
    • Get the list of article urls: url_extract('https://vnexpress.net/kinh-doanh', key='h3')
    • Extract article details: url_extract('https://vnexpress.net/thuong-mai-va-dau-tu-ben-vung-se-giup-apec-ung-pho-nguy-co-suy-thoai-4538015.html', key='span', tag_class='date', type='')
  2. Tuổi trẻ Online
    • Get the list of article urls: url_extract('https://tuoitre.vn/phap-luat.htm', key='h3')
    • Extract article details: url_extract('https://tuoitre.vn/gap-thu-tuong-xuc-dong-chuyen-co-giao-mam-non-miet-mai-lam-thien-nguyen-cho-vung-xa-20221119175021292.htm', key='div', tag_class='date-time', type='')
  3. CafeF
    • Get the list of article urls: url_extract('https://cafef.vn/bat-dong-san.chn', key='h3', type='link')
    • Extract article details: url_extract('https://cafef.vn/dau-se-la-phan-khuc-bds-giu-duoc-nhiet-trong-thoi-gian-toi-2022111913083069.chn', key='span', tag_class='pdate', type='')
  4. Cafebiz
    • Get the list of article urls: url_extract('https://cafebiz.vn/vi-mo.chn', key='h3', type='link', bs_on='')
    • Extract article details: url_extract('https://cafebiz.vn/tai-sao-nha-o-my-la-tai-san-con-o-nhat-ban-thi-lai-chang-khac-gi-hang-tieu-dung-176221119095831295.chn', key='span', tag_class='time', type='')
  5. Kinh tế Sài Gòn Online
    • Get the list of article urls: url_extract('https://thesaigontimes.vn/', key='h3', type='link', bs_on='')
    • Extract article details: url_extract('https://thesaigontimes.vn/kinh-te-tuan-hoan-mo-ra-nhung-mo-hinh-kinh-doanh-moi/', key='time', tag_class='', type='')
  6. VN Economy
    • Get the list of article urls: url_extract('https://vneconomy.vn/', key='h3', type='link', bs_on=False)
    • Extract article details: url_extract('https://vneconomy.vn/xuat-khau-det-may-van-tu-tin-voi-muc-tieu-42-ty-usd.htm', key='div', tag_class='detail__meta', type='')
  7. Pháp Luật Tp.HCM
    • Get the list of article urls: url_extract('https://m.plo.vn/phap-luat/', key='h3', type='link')[0][1]
    • Extract article details: test = url_extract('https://plo.vn/dieu-tra-trung-tam-dang-kiem-cap-so-song-sinh-cho-xe-tai-post705918.html', key='time', tag_class='', type='')
  8. Đầu tư Online
    • Get the list of article urls: url_extract('https://baodautu.vn/', key='article', type='link', bs_on='')
    • Extract article details: url_extract('https://baodautu.vn/nguoi-dan-rong-ra-cau-cuu-khi-nao-co-so-do-tu-du-an-cua-cong-ty-bach-dat-an-d177946.html', key='span', tag_class='post-time', type='')
  9. Nhịp cầu đầu tư
  • Get the list of article urls: url_extract('https://m.nhipcaudautu.vn/kinh-doanh/', key='article', type='link', bs_on='', user_agent='Mozilla/5.0 (iPhone; CPU iPhone OS 15_5 like Mac OS X)')
  • Extract article details: url_extract('https://m.nhipcaudautu.vn/ti-le-don-bay-tai-chinh-toan-thi-truong-giam-dan-tu-quy-i-3348999/', key='span', tag_class='date-post', type='')
  1. Diễn đàn doanh nghiệp
    • Get the list of article urls: url_extract('https://diendandoanhnghiep.vn/', key='h3', type='link', bs_on='')
    • Extract article details: url_extract('https://diendandoanhnghiep.vn/https-diendandoanhnghiep-vn-dien-mat-troi-mai-nha-can-hoan-thien-co-che-ho-tro-doanh-nghiep-phat-trien-225626-html-e313.html', key='span', tag_class='created_time', type='')
  2. Diễn đàn kinh tế Việt Nam - Vietnamnet
    • Get the list of article urls: url_extract('https://vef.vn/diem-nong/', key='article', type='link', bs_on='')
    • Extract article details: ``
  3. Forbes Việt Nam
    • Get the list of article urls: url_extract('https://forbes.vn', key='h3', type='link', bs_on='')
    • Extract article details: url_extract('https://forbes.vn/m-village-cua-nguyen-hai-ninh-xay-lang-trong-pho/', key='div', tag_class='forbes-single__heading-time', type='')
  4. Vietstock
    • Get the list of article urls: url_extract('https://vietstock.vn/', key='h4', type='link', bs_on='')
    • Extract article details: url_extract('https://vietstock.vn/2022/11/thieu-hut-iphone-14-nguoi-dung-viet-lua-chon-iphone-doi-cu-4264-1017483.htm', key='span', tag_class='date', type='')
  5. Tin nhanh chứng khoán
    • Get the list of article urls: Doesn't work url_extract('https://m.tinnhanhchungkhoan.vn/', key='h2', type='link', bs_on='')
    • Extract article details: url_extract('https://www.tinnhanhchungkhoan.vn/big-trends-sau-con-mua-troi-lai-sang-post310328.html', key='time', tag_class='', type='')
  6. Cafe Land
    • Get the list of article urls: url_extract('https://cafeland.vn/', key='h3', type='link', bs_on='')
    • Extract article details: url_extract('https://cafeland.vn/phan-tich/bien-doi-khi-hau-dang-leo-thang-nhung-doanh-nghiep-chu-yeu-doi-pho-114941.html', key='div', tag_class='info-date right', type='')
  7. Kenh14
    • Get the list of article urls: url_extract('https://m.kenh14.vn/doi-song.chn', key='h3', type='link')
    • Extract article details: url_extract('https://m.kenh14.vn/phia-sau-nhung-gen-z-okela-co-luc-that-bai-co-luc-khong-on-lam-nhung-chua-bao-gio-ngung-no-luc-20221119153833146.chn', key='span', tag_class='kbwcm-time', type='')
  8. Dân trí
    • Get the list of article urls: url_extract('https://dantri.com.vn/', key='h3', type='link', bs_on='')
    • Extract article details: url_extract('https://dantri.com.vn/the-gioi/moscow-cao-buoc-ukraine-kich-dong-xung-dot-quan-su-nga-nato-20221119145209276.htm', key='time', tag_class='author-time', type='')
  9. Thanh niên
    • Get the list of article urls: ``
    • Extract article details: ``
  10. Vietnamnet
    • Get the list of article urls: ``
    • Extract article details: ``
  11. Nhân dân điện tử
    • Get the list of article urls: ``
    • Extract article details: ``
  12. Lao động
    • Get the list of article urls: ``
    • Extract article details: ``
  13. Đời sống & pháp luật
    • Get the list of article urls: ``
    • Extract article details: ``

III. APENDICES

  • Demo video: How to select the key
  • Explore User Agents by Operating System: here

IV. 🙋‍♂️ CONTACT INFORMATION

You can contact me at one of my social network profiles:


If you want to support my open-source projects, you can "buy me a coffee" via Patreon or Momo e-wallet (VN). Your support will help to maintain my blog hosting fee & to develop high-quality content.

momo-qr

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vnnews-0.0.1.tar.gz (7.6 kB view details)

Uploaded Source

Built Distribution

vnnews-0.0.1-py3-none-any.whl (8.3 kB view details)

Uploaded Python 3

File details

Details for the file vnnews-0.0.1.tar.gz.

File metadata

  • Download URL: vnnews-0.0.1.tar.gz
  • Upload date:
  • Size: 7.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.8

File hashes

Hashes for vnnews-0.0.1.tar.gz
Algorithm Hash digest
SHA256 a33797bd0fec741aaf27ad81b983857222c1d932df6848179d9fb209ae6e1fef
MD5 dba473706d373774e8b10541deed650a
BLAKE2b-256 1c7573570a353a335485715d4e802a70424f408b6eacdc256c33be973934d076

See more details on using hashes here.

File details

Details for the file vnnews-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: vnnews-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 8.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.8

File hashes

Hashes for vnnews-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e275e253704d2092c17250b0a5d534fe6ebfbf6486268106410696837e218efc
MD5 2a87eb7307f7a078981f39e517a02be5
BLAKE2b-256 faaa6c4e5bac5c6d67ae60555e11d34014c888725b66af88175b7221a17201b4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page