ghananews-scraper

A python package to scrape data from Ghana News Portals

These details have not been verified by PyPI

Project links

Homepage

Project description

Python 3.7, 3.8, 3.9

GhanaNews Scraper

A simple unofficial python package to scrape data from GhanaWeb, MyJoyOnline, DailyGraphic, CitiBusinessNews, YenGH, 3News, MyNewsGh, PulseGh Affiliated to Bank of Ghana Fx Rates and GhanaShops-Scraper

NOTE: `This library may keep changing due to changes to the respective websites.`

How to install

pip install ghananews-scraper

Example Google Colab Notebook

Click Here:

#f03c15 Warning: DO NOT RUN GHANAWEB CODE IN ONLINE Google Colabs)

Some GhanaWeb Urls:

urls = [
    "https://www.ghanaweb.com/GhanaHomePage/regional/"	
    "https://www.ghanaweb.com/GhanaHomePage/editorial/"
    "https://www.ghanaweb.com/GhanaHomePage/health/"
    "https://www.ghanaweb.com/GhanaHomePage/diaspora/"
    "https://www.ghanaweb.com/GhanaHomePage/tabloid/"
    "https://www.ghanaweb.com/GhanaHomePage/africa/"
    "https://www.ghanaweb.com/GhanaHomePage/religion/"
    "https://www.ghanaweb.com/GhanaHomePage/NewsArchive/"
    "https://www.ghanaweb.com/GhanaHomePage/business/"
    "https://www.ghanaweb.com/GhanaHomePage/SportsArchive/"
    "https://www.ghanaweb.com/GhanaHomePage/entertainment/"
    "https://www.ghanaweb.com/GhanaHomePage/africa/"
    "https://www.ghanaweb.com/GhanaHomePage/television/"
]

Outputs

All outputs will be saved in a .csv file. Other file formats not yet supported.

Usage

from ghanaweb.scraper import GhanaWeb

url = 'https://www.ghanaweb.com/GhanaHomePage/politics/'
# url = "https://www.ghanaweb.com/GhanaHomePage/NewsArchive/"
# url = 'https://www.ghanaweb.com/GhanaHomePage/health/'
# url = 'https://www.ghanaweb.com/GhanaHomePage/crime/'
# url = 'https://www.ghanaweb.com/GhanaHomePage/regional/'
# url = 'https://www.ghanaweb.com/GhanaHomePage/year-in-review/'

# web = GhanaWeb(url='https://www.ghanaweb.com/GhanaHomePage/politics/')
web = GhanaWeb(url=url)
# scrape data and save to `current working dir`
web.download(output_dir=None)

Scrape list of articles from GhanaWeb

from ghanaweb.scraper import GhanaWeb

urls = [
        'https://www.ghanaweb.com/GhanaHomePage/politics/',
        'https://www.ghanaweb.com/GhanaHomePage/health/',
        'https://www.ghanaweb.com/GhanaHomePage/crime/',
        'https://www.ghanaweb.com/GhanaHomePage/regional/',
        'https://www.ghanaweb.com/GhanaHomePage/year-in-review/'
    ]

for url in urls:
    print(f"Downloading: {url}")
    web = GhanaWeb(url=url)
    # download to current working directory
    # if no location is specified
    # web.download(output_dir="/Users/tsiameh/Desktop/")
    web.download(output_dir=None)

Scrape data from MyJoyOnline

It is recommended to use option 1. Option 2 might run into timeout issues.
DO NOT RUN THIS IN Google Colab NOTEBOOK, either in .py script, visual studio code or in terminal due to selenium package.
You may pass driver_name = chrome or firefox

# Option 1.
from myjoyonline.scraper import MyJoyOnlineNews

url = 'https://myjoyonline.com/politics'
print(f"Downloading data from: {url}")
joy = MyJoyOnlineNews(url=url)
# joy = MyJoyOnlineNews(url=url, driver_name="firefox")
joy.download()


# Option 2.
from myjoyonline.scraper import MyJoyOnlineNews

urls = [
        'https://myjoyonline.com/news',
        'https://myjoyonline.com/politics'
        'https://myjoyonline.com/entertainment',
        'https://myjoyonline.com/business',
        'https://myjoyonline.com/sports',
        'https://myjoyonline.com/opinion',
        'https://myjoyonline.com/technology'
    ]

for url in urls:
    print(f"Downloading data from: {url}")
    joy = MyJoyOnlineNews(url=url)
    # download to current working directory
    # if no location is specified
    # joy.download(output_dir="/Users/tsiameh/Desktop/")
    joy.download()

Scrape data from CitiBusinessNews

Here are some list of publisher names:
- citibusinessnews
- aklama
- ellen
- emmanuel-oppong
- nerteley
- edna-agnes-boakye
- nii-larte-lartey
- naa-shika-caesar
- ogbodu
  - Note: using publisher names fetches more data than the url.

from citionline.scraper import CitiBusinessOnline

urls = [
    "https://citibusinessnews.com/ghanabusinessnews/features/",
    "https://citibusinessnews.com/ghanabusinessnews/telecoms-technology/",
    "https://citibusinessnews.com/ghanabusinessnews/international/",
    "https://citibusinessnews.com/ghanabusinessnews/news/government/",
    "https://citibusinessnews.com/ghanabusinessnews/news/",
    "https://citibusinessnews.com/ghanabusinessnews/business/",
    "https://citibusinessnews.com/ghanabusinessnews/news/economy/",
    "https://citibusinessnews.com/ghanabusinessnews/news/general/",
    "https://citibusinessnews.com/ghanabusinessnews/news/top-stories/",
    "https://citibusinessnews.com/ghanabusinessnews/business/tourism/"
]

for url in urls:
    print(f"Downloading data from: {url}")
    citi = CitiBusinessOnline(url=url)
    citi.download()

# OR: scrape using publisher name
from citionline.authors import CitiBusiness

citi = CitiBusiness(author="citibusinessnews", limit_pages=4)
citi.download()

Scrape data from DailyGraphic

from graphiconline.scraper import GraphicOnline

urls = [
    "https://www.graphic.com.gh/news.html",
    "https://www.graphic.com.gh/news/politics.html",
    "https://www.graphic.com.gh/lifestyle.html",
    "https://www.graphic.com.gh/news/education.html",
    "https://www.graphic.com.gh/native-daughter.html",
    "https://www.graphic.com.gh/international.html"
]

for url in urls:
    print(f"Downloading data from: {url}")
    graphic = GraphicOnline(url=url)
    graphic.download()

Scrape data from YenGH

# OPTION: 1

from yen.scrapy import YenNews

url = 'https://www.yen.com.gh/'
print(f"Downloading data from: {url}")
yen = YenNews(url=url)
yen.download()

# OPTION: 2

from yen.scrapy import YenNews

urls = [
    'https://www.yen.com.gh/',
    'https://yen.com.gh/politics/',
    'https://yen.com.gh/world/europe/',
    'https://yen.com.gh/education/',
    'https://yen.com.gh/ghana/',
    'https://yen.com.gh/people/',
    'https://yen.com.gh/world/asia/',
    'https://yen.com.gh/world/africa/',
    'https://yen.com.gh/entertainment/',
    'https://yen.com.gh/business-economy/money/',
    'https://yen.com.gh/business-economy/technology/'
]

for url in urls:
    print(f"Downloading data from: {url}")
    yen = YenNews(url=url)
    yen.download()

Scrape data from MyNewsGh

from mynewsgh.scraper import MyNewsGh

# scrape from multiple URLs
urls = [
  "https://www.mynewsgh.com/category/politics/",
  "https://www.mynewsgh.com/category/news/",
  "https://www.mynewsgh.com/category/entertainment/",
  "https://www.mynewsgh.com/category/business/",
  "https://www.mynewsgh.com/category/lifestyle/",
  "https://www.mynewsgh.com/tag/feature/",
  "https://www.mynewsgh.com/category/world/",
  "https://www.mynewsgh.com/category/sports/"
]

for url in urls:
    print(f"Downloading data from: {url}")
    my_news = MyNewsGh(url=url, limit_pages=50)
    my_news.download()

# scrape from a single URL
from mynewsgh.scraper import MyNewsGh

url = "https://www.mynewsgh.com/category/politics/"
my_news = MyNewsGh(url=url, limit_pages=None)
my_news.download()

Scrape data from 3News

from threenews.scraper import ThreeNews

# DO NOT RUN ALL AUTHORS: select ONLY few
# DO NOT CHANGE THE AUTHOR NAMES
authors = [
  "laud-nartey",
  "3xtra",
  "essel-issac",
  "arabaincoom",
  "bbc",
  "betty-kankam-boadu",
  "kwameamoh",
  "fiifi_forson",
  "fdoku",
  "frankappiah",
  "godwin-asediba",
  "afua-somuah",
  "irene",
  "joyce-sesi",
  "3news_user",
  "ntollo",
  "pwaberi-denis",
  "sonia-amade",
  "effah-steven",
  "michael-tetteh"
]

for author in authors:
    print(f"Downloading data from author: {author}")
    three_news = ThreeNews(author=author, limit_pages=50)
    three_news.download()
    
# OR
from threenews.scraper import ThreeNews

three = ThreeNews(author="laud-nartey", limit_pages=None)
three.download()

Scrape data from PulseGh

select ONLY few urls

Note: these values may change

Category	Number of Pages
News	40
Entertainment	40
Business	40
Lifestyle	40
Business/Domestic	26
Business/International	40
Sports/Football	99
News/Politics	40
News/Local	40
News/World	40
News/Filla	38
Entertainment/Celebrities	40
Lifestyle/Fashion	40

from pulsegh.scraper import PulseGh

urls = [
  "https://www.pulse.com.gh/news",
  "https://www.pulse.com.gh/news/politics",
  "https://www.pulse.com.gh/entertainment",
  "https://www.pulse.com.gh/lifestyle",
  "https://www.pulse.com.gh/sports",
  "https://www.pulse.com.gh/sports/football",
  "https://www.pulse.com.gh/business/international",
  "https://www.pulse.com.gh/business/domestic",
  "https://www.pulse.com.gh/business",
  "https://www.pulse.com.gh/quizzes",
  "https://www.pulse.com.gh/news/filla",
  "https://www.pulse.com.gh/news/world"
]

for url in urls:
    print(f"Downloading data from: {url}")
    pulse = PulseGh(url=url, limit_pages=5)
    pulse.download()
    
# news has 40 pages
from pulsegh.scraper import PulseGh

pulse = PulseGh(url="https://www.pulse.com.gh/news", total_pages = 40, limit_pages=20)
pulse.download()

# Sports/football has 99 pages
from pulsegh.scraper import PulseGh
pulse = PulseGh(url="https://www.pulse.com.gh/sports/football", total_pages=99, limit_pages=None)
pulse.download()

BuyMeCoffee

Credits

Theophilus Siameh

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.0.28

Nov 29, 2023

1.0.27

Nov 17, 2023

1.0.26

Nov 17, 2023

1.0.25

Jun 22, 2023

1.0.22

Jun 21, 2023

1.0.21

Jun 21, 2023

1.0.20

Jun 21, 2023

1.0.13

May 23, 2023

1.0.12

May 23, 2023

1.0.11

May 20, 2023

1.0.10

May 20, 2023

1.0.9

May 20, 2023

1.0.8

May 3, 2023

1.0.7

May 1, 2023

1.0.4

Mar 15, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ghananews-scraper-1.0.28.tar.gz (25.2 kB view details)

Uploaded Nov 29, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ghananews_scraper-1.0.28-py3-none-any.whl (39.4 kB view details)

Uploaded Nov 29, 2023 Python 3

File details

Details for the file ghananews-scraper-1.0.28.tar.gz.

File metadata

Download URL: ghananews-scraper-1.0.28.tar.gz
Upload date: Nov 29, 2023
Size: 25.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.1

File hashes

Hashes for ghananews-scraper-1.0.28.tar.gz
Algorithm	Hash digest
SHA256	`ca30b391640fb115deff266832afc9f7c1fe4959dac6ec7f056f3a6b9ff0321c`
MD5	`e07f562944d05e19f9e3df0b53b58081`
BLAKE2b-256	`b99965c93752cbeb94499d700a4b8ebcca19f194d98011c2fb998e2db0aa6ae7`

See more details on using hashes here.

File details

Details for the file ghananews_scraper-1.0.28-py3-none-any.whl.

File metadata

Download URL: ghananews_scraper-1.0.28-py3-none-any.whl
Upload date: Nov 29, 2023
Size: 39.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.1

File hashes

Hashes for ghananews_scraper-1.0.28-py3-none-any.whl
Algorithm	Hash digest
SHA256	`aea87925109ee6ebb8ad9d174c0d0dbe349a93a471e262d732742a63e5a47f9c`
MD5	`a5e7fd3d280f939af3199c4685bc086f`
BLAKE2b-256	`30affb7d4fd9992a1e9f39c8f524dafc0d13b368bf3d22e4c71acd78b06e1c5f`

See more details on using hashes here.

ghananews-scraper 1.0.28

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

GhanaNews Scraper

NOTE: This library may keep changing due to changes to the respective websites.

How to install

Example Google Colab Notebook

Some GhanaWeb Urls:

Outputs

Usage

Scrape list of articles from GhanaWeb

Scrape data from MyJoyOnline

Scrape data from CitiBusinessNews

Scrape data from DailyGraphic

Scrape data from YenGH

Scrape data from MyNewsGh

Scrape data from 3News

Scrape data from PulseGh

BuyMeCoffee

Credits

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

NOTE: `This library may keep changing due to changes to the respective websites.`